Professional Documents
Culture Documents
BOWO PRASETYO
A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF TOKYO IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF INFORMATION SCIENCE AND TECHNOLOGY UNIVERSITY OF TOKYO
SUMMER 2002
ACKNOWLEDGMENTS
I wish to thank Prof. Masaru Kitsuregawa for allowing me to explore my areas of interest and guiding me through this thesis and for his gracious understanding of the various difficulties that punctuated my graduate study. I wish to thank all my friends, Iko Pramudiono, Katsumi Takahashi, Masashi Toyoda and others which I cannot mention one by one, for being there when I needed them. I wish to express my gratitude to the NTT Directory Services Co. for providing me with web access log data of NTT i-Townpage served on i-Mode website, which is used as experiment data in this thesis. I wish to express my gratitude to the Japan Monbu-Kagakusho who granted me with the scholarship during my period of study at the University of Tokyo. I will be eternally indebted to them. Also I wish to thank my parents for their love and patience since my childhood until now and allowing me to learn precious experiences from them. May I can reward their eternal love and affection some time in the future. Last but not least, I wish to thank my beloved wife, Silvia Surini, for her patience and support during my study. Without her nice smile and delicious food I would not be able to finish this thesis.
ii
TABLE OF CONTENTS ACKNOWLEDGMENTS ...ii TABLE OF CONTENTSiii 1. INTRODUCTION..1 A) Importance of Analyzing Website Visitor Behavior.....1 B) Visual Data Analysis as Effective Tool 3 C) Motivation and Purpose of This Thesis5 2. BACKGROUND AND RELATED WORK ..6 A) Web Usage Mining to Discover Behavior Patterns .....6 B) A Mining Technique Generalized Sequential Pattern .....11 C) Visualization Tools of Visitor Behavior .15 3. NAVIZ CONCEPT ...17 A) Appropriate Web Traversal Properties ...18 B) Visitor Behavior in Traversal Diagram ..19 C) Algorithm for Drawing Directed Graphs ...21 4. NAVIZ IMPLEMENTATION...22 A) System Architecture ...22 B) Visualization Features ....27 5. NAVIZ EXPERIMENT 30 A) NTT i-Townpage served on i-Mode ......30 B) Global Visitor Behavior .34 C) Comparison of Visitor Behavior by Time, Place and Category Class ...38 6. CONCLUSION AND FUTURE WORK.54 References ..54 APPENDIX ....Ai
iii
1. INTRODUCTION
This chapter provides an introduction about the motivation behind and purpose of this thesis. It is organized into three sections. The first provides a brief overview of Internet and the importance of studying visitor behavior on a website. Data mining is introduced as a method to study website visitor behavior. Second is a general overview of visual analysis of data, which is needed to analyze data mining results. Third chapter describes the motivation and purpose of this thesis. The chapter ends with an outline of the structure of this thesis.
Internet or World Wide Web (WWW) is probably the most rapid growing technology of the human society in this century. Just in count of years, it has infiltrated into all aspects of human life such as the ways of doing business by e-commerce, providing and receiving education by online school, managing the organization by intranet etc. And the most direct effect of WWW technology is the complete change of information collection, conveying, and exchange; we can mention it such as email, website, ftp, chat, newsgroup etc. With the rapid progress of WWW technology, and the ever growing popularity of it, today Web has turned to be the largest information source available in the earth. More than 9% of world population or about 581 million people all around the world have now been connecting to the Web, and are growing as rapid as 9 million users per month in the last two years [13]. Popular websites such as NTT Docomo i-Townpage can have hundred of thousand of visitors in a day.
Fig. 1. Visitor behavior on a website In order to enhance website performance and improve system design, webmasters need to find the way to know their visitor behavior on the website. Usually every web server has a mechanism to register a web log entry for every single access they get such as the URL requested, the IP address from which the request originated, and a timestamp. This web log record contains wealth of information about visitors of website and becomes a treasure for webmasters in which they can track visitors browsing behavior down to individual mouse clicks. First attempt to take advantage of web log record is statistical analysis which has been successful in providing webmasters with a set of basic information such as number of requests made, ip-address originated the requests, total files and kilobytes served, totals and averages by specific time periods, URLs from which user came to the site, browsers making the requests, etc. Another way to exploit more information from web log record is by utilizing data mining techniques such as association rule, sequential pattern, and clustering, which recently have been prosperously researched by many people. Applying data mining techniques such as
2
Generalized Sequential Pattern algorithm [1] on a web log record, certain patterns (rules) which are hidden inside and cannot be revealed by statistical method can be discovered. For example: About 30% of visitors on a website exit as soon as they enter the Top page (Fig. 1), Around 2% of visitors reach the Search Result page through the path of Top Region List Result. Knowing such a pattern of visitor behavior is very advantageous for webmasters to improve design and enhance performance of their website. With the rapid progress of WWW technology, and the ever growing popularity of the Web, a huge number of web log record are being collected. Popular websites can have their web log growing by hundreds of megabytes every day. Considering these huge number files of raw web log data, it is a difficult task to retrieve significant and useful pattern of visitor behavior. A common problem with data mining techniques on the web log record is that they resulted in large number of patterns such that it is not suitable for human to analyze directly. Thus in turn, it is needed a tool which can analyze the large number of patterns discovered from web log record easily and effectively. Naviz proposed in this thesis is a tool that is designed for such a purpose by visualizing visitor behavior patterns such that it can be analyzed visually. Prefecture List Local Top Input Form Search
The data mining process results in a set of patterns of data. While small number of patterns may be understood easily by examine them by hand, large number of patterns
requires more effective way to analyze. Often the most effective way to understand and analyze a set of information is to look at a picture of them (analysis by visual) rather than to investigate their accurate values in a list (analysis by value). Large number of discovered patterns can be effectively analyzed by utilizing visuals which may be traditional statistical images like bar graphs and pie diagrams or innovative custom displays related to the analysis in question. Moreover its combination with the analysis by value technique can be used for maximum understanding of the significance and meaning of the data, such as method implemented by Naviz proposed in this thesis. Visualization of data mining patterns on web log record provided by Naviz also meets requirements of graphical displays proposed by Edward R. Tufte [9], whose works are often considered the standards by which data graphics are judged. He says that graphical displays should: Show the data. Induce the viewer to think about the substance rather than about methodology, graphics design, the technology of graphic production, or something else. Avoid distorting what the data have to say. Present many numbers in small space. Make large data sets coherent. Encourage comparison between data. Supply a broad overview to fine detail. Serve a clear purpose. Be closely integrated with the statistical and verbal descriptions of the data set. On the other hand, construction of effective web log visualization is a challenging task. As stated in works of Mukherjea et. al [2], there are various problems involved:
The navigational views are two or three dimensional projections of generally multidimensional hypermedia networks
Even if such a view is developed, the resulting structure would be very complex for any non trivial hypermedia system
As the size of the underlying information space increases, it becomes very difficult to fit the whole information structure on a screen
And yet the user should be able to get an idea of not only the structure but the actual contents of the nodes and links just by looking at the navigational views.
In the ever growing WWW technology today, website developers, designers, and maintainers are having difficulties to analyze the efficiency of their website. Two of the major problems with current web log analysis are difficulty to understand what visitors are trying to do on a website and how they are doing it. This requires analysis of the visitor navigational behavior on the website, which gives webmasters know how to improve structure of the hyperlinks as well as the content of the documents in the website. Some data mining techniques to discover visitor behavior patterns have been proposed. However due to the large number of discovered patterns, effective method to visualize the results is needed. This thesis proposes Naviz, a website visitor behavior visualization system that can be used interactively to visually present the behavior of a user visiting a website. Its purpose is to provide a visual analysis tool in the area of web usage mining, which can be used to easily and effectively visualize the result from data mining process on a web log file.
The second chapter summarizes the background and academic research in the area of web usage mining with a focus on visualizing web log record of a website. The third chapter describes in detail the visualization concept used in this thesis. Chapter 4 describes the implementation of the Naviz system, also details the steps in a typical usage of the Naviz system. Chapter 5 is dedicated to describe the experiment results that have been conducted on web log file of NTT Docomo Mobile i-Townpage website. The last chapter gives conclusive remarks and outlines the areas of future work and development.
Applying data mining technique on to the Web data to automatically discover and analyze the useful information from it is called Web mining [10]. Based on what type of web data to be mined, Web mining can be classified into three types: web content mining, web structure mining, and web usage mining. Web content mining tries to mine the data contained in the Web itself such as text contained in the website, result of search engine, etc. Web structure mining focuses on the HTML structure of websites and tries to find relationship patterns of hyperlink structure between websites. Web usage mining uses web log record as target and tries to discover usage patterns of visitors on a website. Web usage mining is particularly of interest for this thesis since it is used to discover visitor behavior from web log file.
Web usage mining uses web log record as target to apply data mining technique to and
Fig. 2. Web Usage Mining consists of three distinctive phases: preprocessing, pattern discovery, and pattern analysis. With the assistance of the diagram of the web usage mining process shown in Fig. 2, the architecture of the web usage mining may be understood easily. Before the description of web usage mining process, its necessary to clarify the definitions of the related data abstractions. The following definitions are from the Web characterization terminology & definition sheets draft published by the World Wide Web Committee Web usage characterization activity [12]. User is the principal using a client to interactively retrieve and render resources or resource manifestations. Page view is visual rendering of a web page in a specific client environment at a specific point in time. Click stream is a sequential series of page view request. User session is a delimited set of user clicks (click stream) across one or more web servers. Server session is a collection of user clicks to a single web server during a user session. It is also called a visit. Episode is a subset of related user clicks that occur within a user session. The term Visitor which is used throughout this thesis refers to term User in the above definition.
11
b23db4675c57
DoCoMo/1.0/N501i
20000501024444
45c2bf42898b
00eeiCojPzJo
12 13 14
Preprocessing Before applying data mining algorithm to the web log record, a data preparation must be performed to convert the raw web log data into the data abstraction necessary for the further process. As shown in Fig. 2, the input of the preprocessing phase is the web log files and the output is the session files. Data preprocessing includes data cleansing, user identification, sessionizing, path completion and formatting. In web usage mining process, the data of interest is the access records to certain web pages (usually identified by its filename extension such as html, cgi, php, pl, asp, jsp, etc), not byproduct records such as any access to images (gif, jpg, bmp), sounds (ra, mp3), movies (mov, mpg), etc. Therefore, a data cleansing is needed to eliminate those irrelevant byproduct items from the web log record. These kinds of data are not used in data mining process and usually take up large portion in web log file. Next the user identification is a process to identify the user, and can be done using ip-address, agent, cookies, user-id or client side tracking. The sessionizing is a process to divide the page accesses of each user, who is likely to visit the website more than once, into individual server sessions (visits). The simplest way to do is to use a timeout to break a users click-stream into session. The thirty minutes is used as a default timeout as a result of work of L. Catledge et. al. [11]. Another important step is path completion, a process to determine if there are important accesses that are not recorded in the access log. Methods similar to those used for user identification can be used for path completion. The final procedure of the preprocessing is formatting, which is a preparation module to properly format the sessions such that it can be used in the further process. Example of web log record is given in Table 1 based on data which is actually used as experiment data in this thesis. It includes information such as cookie, user agent, time,
ip-address, cgi parameters, etc. In this thesis, only accesses from mobile phone (DoCoMo) user agent are used. Since cgi parameters passed from user contain a lot of information such as the requested page (CGI_1), user-id (CGI_2), place, category, etc, we can make use of them in our experiment. Instead of using cookie and ip-address, user-id embedded in cgi parameters can be used to exactly identify the user. As result 3 users are identified from web log record example, 00eeiCojPbrB, 00eeiCojPzJo, and 00eeiCoLhErx as shown in Table 2. In case of mobile phone web log record, sessionizing is usually performed using a 5 minutes timeout to break a users click-stream, not 30 minutes as in other type of access. From the web log record 3 sessions, one for each user, are found. Session of user 00eeiCojPbrB can be identified as accessing a sequence of page T1 J1 GS J1 NR Note, user 00eeiCojPzJo: sequence T1 T2 T3
T2. After
formatting the data to the suitable form, now it is ready to be supplied to pattern discovery algorithm, which will try to reveal the patterns hidden inside the data.
10
Pattern Discovery This is the key component of the web usage mining. Pattern discovery of the web log record converge the algorithms and techniques from several research areas, such as data mining, machine learning, statistics, and pattern recognition. Data mining technique, especially Generalized Sequential Pattern algorithm proposed by R. Srikant et. al. [1] is of interest in this thesis since it is used to discover patterns from the web log record. This technique intends to find the pattern, such that a set of the items follows the presence of another in a time-ordered set of sessions or episodes, and will be described in detail in the next section. Pattern Analysis Pattern Analysis is a final stage of the whole web usage mining. The goal of this process is to eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. The output of web usage mining algorithms is often a very large number of patterns that is not easy for human to understand, and thus need to be transformed to a format which can be understood easily. This can be done with the help of some analysis methodologies and tools. There are several approaches for the pattern analysis. One is to use the knowledge query mechanism such as SQL, another is to construct multi-dimensional data cube to perform OLAP operations, while more recent techniques are to use graphical visualization, such as Naviz proposed in this thesis.
R. Srikant et. al. in[1] proposed a Generalized Sequential Pattern (GSP) algorithm to
11
find from a database of data-sequences, all sequences whose support is greater than the user-specified minimum support. Let L = {i1,i2,...,im} be a set of literals, called items. An itemset is a non-empty set of items. A sequence is an ordered list of itemsets. A sequence s is denoted by (s1 s2 ... sn), where sj is an itemset, also called an element of the sequence. An element of a sequence is denoted by (x1,x2,...,xm) where xj is an item. An item can occur only once in an element of a sequence, but can occur multiple times in different elements. An itemset is considered to be a sequence with a single element. A sequence (a1 a2 ... an) is a subsequence of another sequence (b1 b2 ... bm) if there exist integers i1<i2<...<in such that a1 bi1, a2 bi2, ..., an bin. For example, the sequence ((3) (4,5) (8)) is a subsequence of ((7) (3,8) (9) (4,5,6) (8)), since (3) (3,8), (4,5) (4,5,6), and (8) (8). However, the sequence ((3) (5)) is not a subsequence of ((3,5)) (and vice versa). Let a database D of sequences called data-sequences is given, each data-sequences is a list of transactions, ordered by increasing transaction-time. A transaction has the following fields: sequence-id, transaction-id, transaction-time, and the items present in the transaction. The support for a sequence is defined as the fraction of total data-sequences that contain the sequence. In the absence of taxonomies, sliding windows and time constraints, a data-sequence contains a sequence s if s is a subsequence of the data-sequence. The sliding window generalization relaxes the definition of when a data-sequence contributes to the support of a sequence by allowing a set of transactions to contain an element of a sequence, as long as the difference in transaction-times between the transactions in the set is less than the user-specified window-size. Time constraint is of interest in this thesis, since it is used to implement GSP algorithm in web log mining. It restricts the time gap between sets of transactions that contain consecutive
12
elements of the sequence. Formally, a data-sequence d = (d1 ... dm) contains a sequence s = (s1 ... sn) if there exist integers l1 u1 < l 2 u 2 < < l n u n such that: si is contained in
ui k =li
Fig. 1. Algorithm of GSP To implement GSP algorithm in web log mining, the first step is to transform each access in web log record to its appropriate counterpart in data mining. First, each user session can be thought of in two ways; either as a single transaction of any page references, or a set of many transactions each consisting of a single page reference. The goal of transaction identification is to create meaningful clusters of references for each user. In the web log record used in this experiment, a single page reference itself simply forms a transaction, and a session forms a data-sequence, therefore in Table 3 data-sequence (T1 Note) consists of 2 transactions, while data-sequence
(T1 T2 T3 J1 GS J1 NR G1) consists of 8 transactions, etc. Since a transaction consists only of a single item, there is no need to use a sliding window generalization (window-size = 0). The time constraint used to restrict the time gap between sets of transactions that contain consecutive elements of the sequence is same as
13
the timeout that has been used to break a users click-stream in sessionizing process (min-gap = 0, max-gap = 5 min). For example in Table 3, log record database D contains
3 data-sequences. Page T1 contained in 3 data-sequences 1, 2, and 3, therefore sup(T1) = 3/3; sequence (T1 T2) contained in 2 and 3, therefore sup(T1 T2) = 2/3; sequence (T1 T2 T3) is contained in 2, therefore sup(T1 T2 T3) = 1/3.
In this section, some of methods to visualizing web log record that has been proposed in the past are reviewed. WebViz [4] was known as the first attempt that has been made in the field of web log visualization, which utilizes the statistical properties of the data. While subsequently proposed methods in web log visualization also incorporate some technique of data mining.
WebViz: A Tool for World-Wide Web Access Log Analysis Pitkow et. al. in 1994 proposed WebViz [4] as a tool for web log analysis and provides graphical view of websites local documents and access patterns. By incorporating the Web-Path paradigm [4], i.e. establishment of relationship between documents in database and web access log, into an interactive tool, webmasters can see the documents in their website as well as the hyperlinks traveled (represented visually as links) by visitors. WebViz also enables webmasters to filter the access log by domain names or DSN numbers, directory names, and start and stop times, and play back the events in the access log. The drawback of WebViz is that it was designed to visualize the statistical property of the data only (i.e. frequency and recency information), it can not handle sequential patterns (i.e. has no related data mining tools). It displayed web log access in two-dimensional directed graph, but without enough considerations about appropriate web traversal properties. Also WebViz did not visualize modern dynamic page effectively.
WUM : A Web Utilization Miner In 1998, Spiliopoulou et. al. presented Web Utilization Miner [5], WUM as a mining
15
system for the discovery of interesting navigation patterns. One of the most important features of WUM is that using WUMs mining language MINT; human expert can dynamically specify the interestingness criteria for navigation patterns. This includes specification of criteria of statistical, structural and textual nature. To discover the navigation patterns satisfying the experts criteria, WUM exploits Aggregation Service that extracts information on web access log and retains aggregated statistical information. Although WUM provides an integrated and robust environment to do web log analysis job, but since it focuses mainly on data mining and its mining language, it lacks aesthetic visualization of the data as well as considerations of appropriate web traversal properties and dynamic page.
VISVIP: 3D Visualization of Paths through Websites Cugini et. al. in 1999 proposed VISVIP [6], a tool which allows website developers and usability engineers to visualize the paths taken through the site. The graphical layout of the website can be dynamically customized and simplified, and which subjects paths to view can be dynamically selected. VISVIP also provide an animated representation of traversal along the path through the website. The time spent on each page visit can be represented using the third dimension of the 3D display. VISVIP gives good visualization regarding paths taken through the site, but it has drawbacks such as it needs client customization to instrument a website so as to record the activity of a subject navigating the site. As almost currently available visualizations, VISVIP also did not consider appropriate web traversal properties as well as dynamic page.
16
The most recent work done by Hong et. al. in 2001, proposed WebQuilt [7] as a web logging and visualization system that helps web design teams run usability tests and analyze the collected data. To overcome many of the problems with server-side and client-side logging, WebQuilt uses a proxy to log the activity. It aggregates logged usage traces and visualize in a zooming interface that shows the web pages viewed. Also it shows the most common paths taken through the website for a given task, as well as the optimal path for that task. WebQuilt as well as almost currently available visualizations, also has drawbacks that it did not consider appropriate web traversal properties and dynamic page.
3. NAVIZ CONCEPT
Agrawal et. al. [1] in 1995 has given the basic of sequential pattern data mining, which later also widely used to discover traversal paths from web log data. But finding interesting knowledge from the discovered patterns is not an easy task. Since the number of discovered patterns is generally large, and there is no metric that well represents their usefulness. Thus visualization tool is needed to help human users to interpret them. As mentioned in the previous chapter, the drawbacks of current visualization tools includes lack of capability to handle sequential pattern, lack of aesthetic visualization, necessity of client-side customization and the common drawbacks are that most of them do not consider about modern dynamic page. Considering about inherent web traversal topology, which are intrinsically directed cyclic graphs, Naviz was designed to overcome those drawbacks by combining two-dimensional graph of traversal diagram and facilities to visualize interesting
17
traversal paths by specifying certain visited pages and path attributes such as number of hops, support and confidence. It combines the power of sequential pattern mining tools and intuitive-looking of graph drawing tools, to create interactive visualization of traversal paths, in an aesthetic layout of graph. Since it uses log miner and graph layout producer separately from visualization part, Naviz may easily adapt to the latest technology of this two methods whenever its newer version is available, while the future development of its visualization part can continue to be explored.
Visualization of web log data, first of all it should give a good understanding of global visitor access. To achieve this goal, it is important to consider about inherent web traversal topology, i.e. WWW visitor access traversals are intrinsically directed cyclic graphs. This can be thought of as a hypermedia-like structure, with nodes represent pages and edges represent traversals between pages. For visualizing this visitor traversal on website, it is found that introducing two appropriate web traversal properties below is very helpful. Hierarchical structure regarding traversal traffic, i.e. more traversed edges are placed at the higher level and less traversed edges at the lower level position. This would give an intuitive description of visitors that are traversing over the site, which enter from top of the graph and then exit to bottom. Grouping of related pages, for example, pages that have high degree of transitional probability among them are better to be placed together near to each other. This pages grouping would be useful for webmasters to easily find what pages are related
18
each other. By incorporating these two properties, Naviz can visualize visitor behavior on a website effectively and easy to understand.
First the notation page class and instance that is used in this thesis will be explained. A page class is an abstraction of web pages that offers the same navigational functions in the web traversal topology. Instance is a member of page class that has certain semantics. Instance membership is user defined. Usually an instance is defined by parameters of its page class; however one can ignore some parameters so that an instance may include some different representations of web access logs. For example, logs often include visitor ID as a CGI parameter. But user can define logs with certain CGI parameters as an instance although they have different visitor ID, since this CGI parameter does not affect the semantics of the instance. The page class is a generalization of the web page definition so far since a static web page can be seen as a page class with a single instance. To represent navigational behavior traversal path, a sequence of page classes traversed by visitors, is used. Following association rule, some parameters to assess the strength of the traversal path can be defined: Support of a traversal path A B is the percentage of sessions that contain the sequence of A B against the total number of sessions. Users of a traversal path A B is the percentage of visitors that traversed between A B against the total number of visitors.
19
The confidence of a traversal path is the probability of a visitor to visit the last page in the sequence of traversal path. For example the confidence of traversal path A B C D is the probability of a visitor to visit page D after he/she visited pages A B C consecutively. It can be obtained by dividing the number of sessions that contain sequence A B C D by the number of sessions that contain page A B C. When the sequence only contains two page classes A B, the confidence can also be interpreted as page transition probability from A to B. To derive traversal paths from web access logs GSP algorithm [1] is used. Note that
Naviz is not limited to this algorithm only. The capability to set constraints such as deriving only the traversal paths of visitor with certain attributes has been added also. For example, to extract traversal paths of daytime visitors web logs whose accesses between 6:00 until 18:00 is filtered.
page as node traversal as edge support as thickness confidence as color
To acquire a global view of navigational behavior, it is needed to draw all traversal paths above certain parameter thresholds. As mentioned before, it must be able to draw directed cyclic graphs, hierarchical structure and nodes grouping also. As the underlying unit of drawing, the shortest traversal paths with a pair of page classes is used, since they also convey the relationships between two adjacent page classes. Finally traversal diagram is drawn as a directed graph with page classes as the nodes and the traversal paths between any pair of page classes as the weighted edges (Fig. 3).
To draw the underlying graph layout which representing traversal diagram, Naviz utilizes a graph drawing tool called GraphViz created by Gansner et. al. In their paper [3] in 1993, they described a four-pass algorithm for drawing directed graphs. The first pass assigns discrete ranks to nodes. In a top to bottom drawing, ranks determine Y coordinates. Edges that span more than one rank are broken into chains of virtual nodes and unit-length edges. The second pass orders nodes within ranks to minimize crossings. The third pass sets X coordinates of nodes to keep edges short. The last pass routes edge splines. Furthermore, as secondary role in the algorithm, Gansner et. al. propose the next aesthetic principles: Expose hierarchical structure in the graph. Avoid visual anomalies that do not convey information about the underlying graph. Keep edges short, and
21
Favor symmetry and balance. This algorithm is proven to be able to draw graph layout very well and very fast,
while still maintaining the aesthetic principles. Furthermore, this algorithm satisfies the requirements of appropriate website traversal properties that are needed by Naviz to draw traversal diagram.
4. NAVIZ IMPLEMENTATION
A) System Architecture
Naviz was designed to visualize traversal paths discovered from data mining on web log file. Naviz was developed as a Java JApplet program, which run on almost modern browsers with Java 2 plugin installed. It combines a separated log miner that utilize the GSP algorithm sequential pattern mining [1], a graph layout producer called Graphviz [8], an open source graph drawing software from AT&T that implemented algorithm in [3], and Naviz ability to interactively analyze the sequential patterns discovered. As shown in Fig. 4, Naviz is a client-server application, which consists of: A java applet as user interface in the client side. Naviz applet displays the visualization of the traversal diagram and traversal path to the user, interactively responds to users request and passes it to the server through servlets. Users can set the threshold of support value and confidence degree to control the number of nodes and edges that will be displayed in traversal diagram. For traversal path visualization, users can specify number of hops and visited pages to find interesting paths. Furthermore, users can define the place, category and time class of the underlying
22
Pattern Repository
Graphviz
Log Miner
Log File
Fig. 4. Naviz architecture web log record that will be visualized, such that their behavior can be compared by Naviz. Users also can change interactively strategy of hiearchization and page grouping to explore the best possible structure of traversal diagram. Java servlets in the server side, as interface that allows communication between client and server. Due to security restrictions that are applied to applet, Naviz applet cannot directly write-access file system in the server, nor running the program. Hence, Naviz servlets acts as interface between applet and server-side programs, by opening HTTP connections between client and server. Server-side applications Pattern repository containing discovered patterns. Naviz uses a pattern repository that manages the miner output files such as traversals paths discovered from web log data. Currently repository contains traversals paths only. As Naviz visualization capability will improve, in the future it will be expanded to handle other pattern also such as clustering results. Graphviz that is responsible for drawing the graph layout.
23
Graphviz which is responsible for drawing graph layout, allows us to specify detail parameters to draw graph regarding appropriate web traversal properties. Moreover, it is able to give the output as coordinate and size information of the graph in easily readable format for Naviz to use. Log miner to discover traversals/paths from web log data. Currently Naviz has implemented interactive communication between Naviz applet and log miner. If Naviz applet has requested a pattern that doesnt exist in pattern repository, Naviz servlets will automatically launch the log miner to find such a pattern from web log file. Naviz mining engine currently implemented association rule mining, sequential pattern mining (traversals paths), and clustering.
24
Pattern Repository
paths data edges data
Naviz Servlet
paths array: {support, confidence, node1nodeN} edges array: {support,
Naviz Applet
GraphViz
graph layout: nodes, edges Layout Cache
Log Miner
Fig. 5. Naviz detail mechanism Fig. 5 shows detail mechanism of Naviz. Naviz has been programmed to operate on
internal representations of graph, and to allow the user to view the graph as traversal diagram. Naviz which run as an applet in the client browser, start up GraphViz on server through servlets, to compute the graph layout. Whenever a user asks for new layout, Naviz first will try to find the layout in the cache; if it doesnt exist Naviz reads edge data from pattern repository and sends it to GraphViz. GraphViz computes the layout and sends the graph back to Naviz applet along with coordinate and size information as graph attributes. Naviz applet then redraws the traversal diagram using the new layout. The traversal data is read from pattern repository, directly sent back to Naviz applet, and then drawn on top of traversal diagram. When edge and path data is being read from pattern repository, a user specified minimum support and confidence is set as a filter to reduce the number of pattern. Whenever a user has requested a pattern that is not available in the pattern repository, Naviz servlet will launch the log miner to find such a pattern from
25
of the panel indicate the place, category, start and finish time class of the traversal diagram that is being displayed. All means that the traversal diagram includes all instances of the corresponding class. These combo boxes also define properties of the new traversal diagram that will be displayed by clicking on button View; the size of the new window can be specified in the text fields beside the button. Slide bar of Minsup and Minconf is used to set the minimum value of support/confidence of edges that will be shown on traversal diagram. Path button is used to switch between traversal diagram and traversal path mode that will be described in the following paragraph. Slide bar below the button is used to set the sequential rule number that will be shown. Four buttons of ||, <>, <|, and |> are used to search sequential rules from repository that uses closed-closed, open-open, open-closed, and closed-open method (will be described later), respectively. Combo box beside them has sup, conf, and hop options that are used to sort the appearance of sequential rules according to support, confidence, and hop number, respectively, either in ascending (v button) or descending (^ button) order. Check box and text box below the combo box is used to enable/input the hop number restriction on path searching. Layout button is used to change the layout of underlying traversal diagram according to minimum support, confidence and other properties. Clicking on _|_|_| button will bring a new window consists of six traversal diagrams displaying top six of traversal paths in order. Remaining Time, Place, and Ctgry button is a convenience way to do behavior comparison regarding to time, place, and category classes respectively.
26
B) Visualization Features
Using Naviz webmasters can analyze web log data to discover navigational behavior of their visitors. From our experiment Naviz can discover knowledge about 1) the traversal diagram of visitor traversals on the site, 2) how many visitors are visiting which page, 3) how high is transition probability between two pages, 4) which pages are related each other, 5) how many number of hops are needed to reach certain page, 6) which path drove the most visitors to exit from site, 7) which path drove the most visitors to successfully find their goal, 8) are there visitors who were being lost in website etc. Naviz has two operation modes, i.e. traversal diagram mode and traversal path mode. In traversal diagram mode Naviz displays traversal diagram that describes global visitor traversals on the entire site. It is a two-dimensional graph which nodes represent pages and edges represent traversals between pages. Thickness of edge represents support value of traversal and color of edge (ranges from blue to red) represents confidence degree (low to high). In traversal path mode, Naviz displays traversal path on top of traversal diagram, in such a manner that only one path is showed at one time. Which paths are being showed can be filtered by specifying the pages that are visited and optionally by the number of hops required to reach those pages. Found paths are then displayed in traversal diagram one after another and can be ordered by support, confidence, or number of hops. Naviz uses several strategies to form effective view of traversal diagram and traversal paths:
27
1.
Consideration of appropriate web traversal properties. Edge hierarchization Default strategy of edge hierarchization is that more traversed edges are placed in
upper level position and less traversed ones in lower level. This would intuitively describe visitor traversals on a website from top to bottom of graph and give better understanding of the traversal diagram. This strategy can be changed interactively, i.e. higher confidence edges in upper level and lower confidence edges in lower level, although it may not suitable to visualize the traversal diagram. Moreover, graph orientation can be changed either vertically or horizontally. Page grouping In visualizing the traversal diagram, it is often useful to group pages that are related each other. This is done by give a certain weight to indicate the importance of edges, such as more important edge will have shorter length. By default Naviz binds weight to confidence degree of edges, such that pages with high degree of confidence (transition probability) will be grouped together. It can be changed interactively to be bound to support value, so pages with high support (heavy traffic of traversal among them) will be grouped together. Again, this later strategy may not suitable to visualize the traversal diagram. 2. Navigational behavior comparison web log data contains the information about various properties of the visitor visit such as time, place, category etc. Utilizing this information Naviz can display the traversal diagram of the same website regarding to different properties in different windows that is displayed side by side. This is useful particularly to compare the visitor behavior of the same website regarding to different properties. In case of Mobile
28
Townpage, 3 important properties have been defined, i.e. time, place, and category. Time class has several instances of time span in a day, place has several instances of prefectures in Japan such as Tokyo, Osaka, etc., and category has instances such as karaoke, izakaya, restaurant etc. Therefore for example Naviz can do behavior comparison of day behavior vs. night behavior, Tokyo people behavior vs. Osaka people behavior, karaoke people behavior vs. izakaya people behavior, etc. 3. Traversal path filtering Traversal paths is visualized over the underlying traversal diagram in such a manner that only one path will be showed in one time. Path number slide bar can be used to show the desirable path. For traversal path visualizations, Naviz implemented two filtering mechanisms i.e. filtering by number of hops (path length) and by visited pages. The number of displayed paths will decrease/increase correspondingly as we decrease/increase the number of hops. We can choose which pages are being visited, and find the paths of visitors that are traversing over those pages. There is no limit in the number of pages to choose. In general there are four methods to find the path: Closed-closed method: find paths that begin exactly at first chosen page, traverse over consecutive chosen pages and finish exactly at last chosen page. Closed-open method: find paths that begin exactly at first chosen page, traverse over consecutive chosen pages and finish at any page. Open-closed method: find paths that begin at any page, traverse over consecutive chosen pages and finish exactly at last chosen page. Open-open method: find paths that begin at any page, traverse over consecutive chosen pages and finish at any page.
29
4.
Interactive environment. Furthermore Naviz utilizes the interactive environment to provide more capabilities
to explore navigational behavior from the data. Layout of the traversal diagram, strategy of hierarchization, grouping, and filtering all can be changed interactively to select the best structure representing the web log data. Users use mouse to point to a particular node/edge and Naviz will show detail explanations about the node/edge below the graph such as pages title and url, edges support value and confidence degree etc. Clicking on the node will either bring the corresponding page in the browser (in traversal diagram mode), or selecting the pages for path searching (in traversal path mode).
5. NAVIZ EXPERIMENT
30
Townpage is the name of ``yellow pages'', a directory service of phone numbers in Japan. It consists 11 million listings under 2000 categories. At 1995 it started internet version namely i-Townpage whose URL http://itp.ne.jp/. The visitors of i-Townpage can specify the location and some other search conditions such as business category or any free keywords and get the list of companies or shops that matched, as well as their phone number and address. Visitors can input the location by browsing the address hierarchy or from the nearest station or landmark. Currently i-Townpage records about 40 million page views monthly. At this moment i-Townpage has four versions: Standard version: for access with ordinary web browser which is also equipped with some features like online maps. Lite version: simplified version for device with limited display capability such as PDA. Mobile Townpage: a version for i-Mode users. An illustration of its usage is shown in Fig. 7 i-mode users can directly make a call from the search results.
Please Select Category 1.Hotels 2.Restaurants 3.Airlines 4.Travel Agencies 5.Book Dealers-Reta il
Ready for search? -Searchor -Results listed alpha beticalSelect detail regions Category
Results 1-10 (of 495) 1.Agnes Hotel 2-20-1 Kagurazaka , Shinjuku-ku, To kyo 03-3267-5505
L-mode version: a version for L-mode access, a new service from NTT that enables internet access from stationary/fixed phone. Because the limited display and communication ability of mobile phones, a website for mobile phone has to be carefully designed so that the visitors can reach their goal with least clicks. The Mobile Townpage is not simply a reduced version of standard one but it is completely redesigned to meet the demand of i-Mode users. Nearly 40% visitors of i-Townpage accesses from mobile phones. Figure gives a simple illustration of Mobile Townpage typical usage. A visitor first inputs industry category by choosing from prearranged Category List or entering free keywords in Input Form. Then he/she decides the location. Afterward he/she can begin the search or browse more detailed location, and then get the Search Result.
32
Table 4. Edges discovered from web log record sorted by support value
start_node Start Prefecture List Region List Top Start Start Prefecture List Local Top Local Top Top Prefecture List Search Result Region List Top Category Menu 1 Region List Top Local Top Category Menu 2 Top Local Top Prefecture List Region List Top Top Local Top Category List Input Form End Local Top End Prefecture List Region List Category Menu 2 Prefecture List Region List Input Form Category Menu 3 end_node support 71.3054 61.6683 47.684 46.9162 35.8722 35.4333 33.9377 32.5073 31.8483 30.8386 27.7306 27.3661 26.2213 25.7128 25.0853 21.4627 21.2034 18.8155 18.6382 confidence 76.8053 82.5526 92.1699 55.1483 75.0839 78.6304 84.0869 35.2533 34.5386 36.2496 80.7494 34.0821 93.0353 60.3139 74.1879 91.1342 49.9595 37.511 43.5817 from_hour 0 0 0 0 13 19 13 0 0 0 19 0 13 13 0 19 19 13 0 to_hour 0 0 0 0 18 12 18 0 0 0 12 0 18 18 0 12 12 18 0 place 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 category 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Table 5. Paths discovered from web log record sorted by support value
id 11 11 26 26 182 182 291 291 1453 1453 1453 382 382 382 1149 1149 1149 3798 3798 3798 3798 2892 2892 2892 2892 Start Top Prefecture List Local Top Top Region List Region List Prefecture List Top Region List Prefecture List Region List Prefecture List Local Top Start Top Region List Top Region List Prefecture List Local Top Start Top Region List Prefecture List node support 76.8053 76.8053 57.9264 57.9264 45.806 45.806 45.6374 45.6374 43.5543 43.5543 43.5543 43.4543 43.4543 43.4543 42.2664 42.2664 42.2664 41.1211 41.1211 41.1211 41.1211 40.0719 40.0719 40.0719 40.0719 confidence -0.1041 -0.1041 -0.0785 -0.0785 -0.0621 -0.0621 -0.0619 -0.0619 95.0844 95.0844 95.0844 95.2163 95.2163 95.2163 55.0306 55.0306 55.0306 94.4135 94.4135 94.4135 94.4135 94.8079 94.8079 94.8079 94.8079 from_hour 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 to_hour 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 place 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 category 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33
Data mining is performed on the web log data of Mobile Townpage site from 1 to 7 May 2000 with size of 15 GB, using minimum support threshold of 0.1%, and currently results in a set of traversal patterns containing 8611 edges (traversals between two pages) such as in Table 4 and 58125 paths (traversals consist of more than two pages) such as in
Table 5 (this figures may increase as more patterns of instances is being discovered).
Edges are essentially paths consisting of two pages only, start_node and end_node, and become basic elements to form traversal diagram. For example, edge Start Top. Paths usually consist of more than two pages, for example path no 1453: Top Region List Prefecture List. Support of edge or path reflects the traffic
of visitors going through it. Confidence has important meaning in case of edge only; that is transition probability from start_node to end_node. Other fields from_hour and to_hour is instance of class Time; place and category is instance of class Place and Category respectively. These instances information is used to perform behavior comparison. Traversal diagram visualization
Fig. 8 is the visualization result of edges by Naviz on traversal diagram mode, it
gives the global view of visitor traversal on the entire site of Mobile Townpage. Here, the minimum-maximum support is set to 0.6%-100% and minimum-maximum confidence to 0%-100%. Total of 115 most important edges out of 8611 edges available in repository are being visualized. The Start and End pages are not actual pages belong to the site, they are actually another sites located somewhere on the Web, and indicate the entry and
34
exit door to and from the site. The important pages include Top as the sites top page, Category List a place for visitors to choose their category option, Input Form where visitors fill in the free keyword, and Search Result which indicates that visitors found the answer and thus reached their goal. Also we can see in the traversal diagram that many edges are coming into or going out from Search Result indicating the importance of the page. When visualizing traversal diagram, Naviz considers appropriate web traversal properties, so that we can think of visitors came into the site from top of the graph and went out from the bottom. We can see the most traversed edges, the thick one, that are forming sequence of pages Top Region List Prefecture List Local Top are placed in the upper position of the graph. While the less traversed edges that are connecting City Menu 1 City Menu 2, Station Menu 1 Station Menu 2 etc. are placed in the lower position of the graph. Naviz also allows us to find related pages easily. Also we can see that there are several groups of related pages, which indicates that those pages have high degree of transition probability among them: i.e. group (Top Region List Prefecture List Local Top Commercial), group (Category Menu 1 Category Menu 2), group (Station Menu 1 Station Menu 2) etc. Just by viewing the traversal diagram we can easily grasp and understand the global behavior of visitors over the site, since looking at the traversal diagram is as looking at the picture of the visitor behavior itself. Furthermore by analyzing the detail value of traversal diagrams components, we can know that more than 30% of visitors in the Top page flow directly to go out from the site with probability more than 36%; while only about 11% of visitors in the Local Top page are going to see Commercial with
35
probability less than 12%. By the means of mouse point and click, we can easily and interactively investigate each node and edge on traversal diagram that is considered important, to look closely at their detail values. Traversal path visualization Switching to traversal path mode, Naviz has discovered some interesting visitor behavior on Mobile Townpage site. First, success paths in which visitors are successful finding their goal is of interest of many webmasters. It can be identified as paths which traverse from Start Search Result End. We can start to find such a
path by doing the following steps: in the traversal path mode select pages Start, Search Result, and End consecutively, optionally we can specify number of hops too, and then do searching by clicking on the one of four search buttons. Since Start is the first page, and End is the last page on every complete path, four methods of searching (closed-closed, closed-open, open-closed, and open-open) have the same effect. As the result Fig. 9 shows the most high traffic of the success paths: i.e. about 2.8 % of visitors successfully reached the Search Result page through the path Start Top Region List Prefecture List Local Top Input Form Search
Result and then exit from the site. We can see from the traversal diagram that since there is no way to go to Search Result from pages before the Input Form, this top success path is the most efficient way for people to reach Search Result page, except for paths that are using bookmark. Fig. 10 shows top 2 5 of success paths. Second top success path shows that about 0.85% of visitors runs through Start Search Result
End, which means that visitors came into the site directly to the Search Result page (they bookmarked the result). Third success path is about 0.84% and took very long path
36
of Start List
Top
Region List
Prefecture List
Local Top
Category
City Menu 1
City Menu 2
Search Confirm
Search Result
End. Fourth and fifth top success path is supported by 0.63% and 0.54% of visitors respectively. Next interesting one is the exit path which is a path taken by visitors to exit from the site. The top exit path as shown in Fig. 11 tells us the shocking result that 28.76 % of visitors left the site as soon as they came into the Top page. Fig. 12 shows top 2 5 of exit paths. Second top exit path is interesting since it is also a top success path. Webmasters ideally wish to make into reality that the top exit path in their website is also the top success path. Top 3, 4, and 5 of exit path is supported by 2.44%, 2.43%, and 2.34% of visitors respectively. Finally another interesting path is the lost path in which visitors failed to find their goal even after a long struggle in the labyrinth of the site. Fig. 13 shows that 0.5 % of visitors failed to find their goal through Start List Local Top Category List Top Region List Prefecture
City Menu 1
another lost path in Fig. 14 which shows that 0.06% of visitors still failed to find their goal, even though once they have tried to browse Category List, but didnt succeed; went back to Local Top, filled in some keywords in Input Form, failed again; then tried to browse Category Menu 1, useless too; and eventually exited the site without getting the answer. Using Naviz in traversal path mode, we can easily and interactively investigate every path by specifying the traversed pages and the hop number. The discovered paths can then be browsed one by one on top of the traversal diagram, while the appearance order of the path can be specified either by support, confidence or hop number, either in
37
Time based comparison Using Naviz comparison function, some interesting differences between day behavior and night behavior of i-mode internet users can be discovered. Fig. 15 shows the top exit path of both behavior; they share the common top exit path with difference in support value, 12.7% for day and 16.1% for night. Fig. 16 shows the second top exit path; 1.82% of day visitors successfully found their goal, while 1.33% of night visitors chose rather to leave the site. The pie chart below the graph indicates the ratio of total visitors that support the behavior. The night visitors are slightly more than the day visitors. Fig. 17 shows the top success path of both behavior. Similar to exit path, they share also the common top success path with difference in support value, 1.82% for day and 1.0% for night. However the second top success path in Fig. 18 shows the significant difference between two behavior, while 0.52% of day visitors bookmarked their previous results, 0.44% of night visitors seem enjoyed browsing long path to find their goal. These facts indicate that people of day have more certain objective when browsing the site compared to people of night. Location based comparison Besides time based behavior comparison, Naviz can also perform location based behavior comparison. Fig. 19 shows behavior comparison between Tokyo people and Osaka people. Since top 1 and 2 of exit paths didnt reveal any significant differences, third top exit path is used. It tells us that 0.36% of Tokyo people took longer path than the
38
path taken by 0.23% of Osaka people to exit. Fig. 20 shows third top success path of both behavior. Based on this result, 0.1% of Tokyo people took longer path than the path taken by 0.06% of Osaka people to reach their goal. Another interesting behavior difference of two people is that while Tokyo people tend to view Category List page before entering Input Form page, Osaka people seem like to go directly to Input Form page. This may indicate that Tokyo people like to have guidance from Category List to know what they want to search, while Osaka people already know what they want to search. Category based comparison Another Naviz feature is category based behavior comparison. Fig. 21 shows the fourth top exit path (in the same time it is third top success path too) of hotel people and izakaya people behavior. According to this result, 0.07% of hotel people took longer path than 0.05% of izakaya people to reach their goal as well as to exit the site. Since top 1, 2, and 3 of exit path as well as top 1 and 2 of success path dont reveal any significant differences, we may conclude that visitors of both category almost have the same behavior. Essentially, Naviz comparison function allows us to compare visitor behavior of a website based on time, place, and category, or combination of them. There is no restriction in the way of combining the comparison condition.
39
40
Fig. 11. Top exit path of visitor behavior of NTT Mobile Townpage.
43
Fig. 14. Another lost path of visitor behavior of NTT Mobile Townpage.
46
Fig. 15. : Top exit path of day behavior vs. night behavior
47
Fig. 16. Second top exit path of day behavior vs. night behavior
48
Fig. 17. Top success path of day behavior vs. night behavior
49
Fig. 18. Second top success path of day behavior vs. night behavior
50
Fig. 19. Third top exit path of Tokyo people vs. Osaka people
51
Fig. 20. Third top success path of Tokyo people vs. Osaka people
52
Fig. 21. Fourth top exit path of Hotel people vs. Izakaya people
53
Experiment in visualization of traversal paths discovered from web log data of NTT i-Townpage served on i-Mode website shows that Naviz has successfully visualized a traversal diagram of visitor behavior, discovered various interesting visitor behavior on the website as well as performed various behavior comparison based on time, place and category. It can be concluded that the important factors in visualizing the visitor navigational behavior from web log data is to consider appropriate web traversal properties to make it easy to grasp and understand the global visitor behavior, as well as perform behavior comparison based on classes and instances found as cgi parameters in modern dynamic pages, and utilize interactive environment to provide greater capability to analyze the discovered traversal paths. Of course this is not describing all the important factors needed in web log visualization, rather we need more study and experiment in web log data mining on various kinds of website. In the future Naviz will be improved with capability of making future predictions of visitor behavior based on their behavior in the past. Besides visualization of traversals and paths, Naviz will be improved also with the capability to visualize results of web access log clustering.
References
[1] R. Srikant, and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Fifth Intl Conference on Extending Database Technology (EDBT96), pages 3-17, Avignon, France, March 1996. [2] Mukherjea, S & Foley, J., D. (1995). Visualizing the World-Wide Web with the
54
Navigational View Builder. Computer Networks and ISDN Systems, 27 (6), 1075-1087 [3] E. Gansner, E. Koutsofios, S. North, and K. Vo. A technique for drawing directed graphs. Transactions on Software Engineering, 19(3):214230, March 1993. [4] Pitkow, J. and K. Bharat. WebViz: A Tool for World-Wide Web Access Log Analysis. In Proceedings of First International Conference on the World-Wide Web 1994. [5] M. Spiliopoulou and L.C. Faulstich. WUM : A Web Utilization Miner. EDBT Workshop WebDB98, Valencia, Spain, 1998. Springer Verlag. [6] Cugini, J. and J. Scholtz. VISVIP: 3D Visualization of Paths through Websites. In Proceedings of International Workshop on Web-Based Information Visualization (WebVis'99). Florence, Italy. pp. 259-263. IEEE Computer Society, September 1-3 1999. [7] Jason I. Hong, and James A. Landay, "WebQuilt: A Framework for Capturing and Visualizing the Web Experience." In Proceedings of The Tenth International World Wide Web Conference (WWW10), Hong Kong, May 2001. [8] http://www.research.att.com/sw/tools/graphviz/ [9] Tufte, E. The Visual Display of Quantitative Information. Cheshire, CT. Graphics Press: 1983. [10] Oren Etzioni. The World Wide Web: Quagmire or Gold Mine. Communications of the ACM, 39(11): 65-68, 1996. [11] L. Catledge and J. Pitkow. Characterizing browsing behaviors on the World Wide Web. Computer Networks and ISDN Systems, 27(6), 1995. [12] http://www.w3.org/WCA, http://www.w3.org/1999/05/WCAterms/ [13] http://www.nua.ie/surveys/how_many_online/
55
APPENDIX: NAVIZ PROGRAM SOURCE CODE....................... A1 A1 A2 NAVIZ APPLET ......................................................................... A2 NAVIZ SERVLETS .................................................................... A58
A2.1 NAVIZ SERVLET ....................................................................... A58 A2.2 NAVIZ PATH SERVLET............................................................... A69 A2.3 NAVIZ ATTRIBUTE SERVLET ...................................................... A73
FIGURE A 1. NAVIZ ARCHITECTURE ........................................................... A2 FIGURE A 2. NAVIZ APPLET MECHANISM ...................................................... A3 FIGURE A 3. NAVIZ LAUNCHER .................................................................... A3 FIGURE A 4. ALGORITHM FOR PATH SELECTION .............................................. A4 FIGURE A 5. ALGORITHM FOR INCLUDE FUNCTION: OPEN OPEN .................. A5 FIGURE A 6. ALGORITHM FOR INCLUDE FUNCTION: OPEN CLOSED .............. A5
A3
FIGURE A 7. ALGORITHM FOR INCLUDE FUNCTION: CLOSED OPEN .............. A6 A4 DATA TYPE CLASSES.............................................................. A86 FIGURE A 8. ALGORITHM FOR INCLUDE FUNCTION: CLOSED CLOSED .......... A6
A4.1 TYPES OF NODE AND EDGE DATA ............................................ A86 A4.2 TYPES OF REQUEST AND RESPONSE OF SERVLETS ...................... A90
A5
Ai
Applet Communicator (Interface) 4. Data Type Classes 1. Types of Node and Edge Data Naviz Node Naviz Edge
2.
Types of Request and Response of Servlets Request and Response of Naviz Servlet Naviz Request Naviz Response
Request and Response of Naviz Path Servlet Naviz is designed as client server application with an applet in client side and several servlets plus applications in server side. Roughly, as shown in Figure A 1 Naviz components can be mentioned as below: 1. 2. Naviz Applet Class Naviz Servlet Classes 1. 2. 3. 3. Naviz Servlet: NavizServlet Naviz Path Servlet Naviz Attribute Servlet 5. Naviz Path Request Naviz Path Response
Request and Response of Naviz Attribute Servlet Naviz Attribute Request Naviz Attribute Response
Helper Classes 1. 2. 3. Class of Canvas to Draw Traversal Diagram (NavizCanvas) Class of Thread Parallelization (SwingWorker) Classes of Applet Communication Message Relayer
A1
task to read node (NavizNode) and edge (NavizEdge) data from MySQL Database (Pattern Repository), and feed it to GraphViz to make graph layout of traversal diagram. If the requested patterns do not exist in repository, it launches Log Miner to mine the patterns from web log file and store them into repository. NavizServlet receives request from applet in the form NavizRequest and send the
Naviz Request
Naviz Response
Naviz Path
Naviz Path
response in the form NavizResponse. NavizPathServlet is responsible to read path data from repository and uses NavisPathRequest and NavizPathResponse to
and NavizAttributeResponse to communicate with applet and is responsible to read attribute data from repository, such as time, place, and category that are used to perform behavior comparison.
GraphViz
Log Miner
MySQL Database
Main Class
A1 Naviz Applet
Figure A 1. Naviz Architecture NavizApplet implements AppletCommunicator interface in order to be able to communicate with other applets in the client side, and instantiates MessageRelayer to open the communication. It uses NavizCanvas to draw traversal diagram, and SwingWorkers that each invokes new thread that parallely open HTTP connection to each servlet in server side. NavizServlet has primary
A2
NavizLauncher.jsp
connectToServlet SwingWorker connectToPathServlet SwingWorker construct finish
NavizApplet.jsp NavizApplet
construct nsTextFieldActionPerformed mcTextFieldActionPerformed colorComboBoxActionPerformed widthComboBoxActionPerformed weightComboBoxActionPerformed drawBezier orderComboBoxActionPerformed getSubPaths minconfScrollBarAdjustmentValueChanged maxconfScrollBarAdjustmentValueChanged minsupScrollBarAdjustmentValueChanged maxsupScrollBarAdjustmentValueChanged pathButtonActionPerformed hopCheckBoxActionPerformed minhopTextFieldActionPerformed maxhopTextFieldActionPerformed descButtonActionPerformed ascButtonActionPerformed sortComboBoxActionPerformed closedOpenButtonActionPerformed openOpenButtonActionPerformed openClosedButtonActionPerformed closedClosedButtonActionPerformed finish
Figure A 3. Naviz Launcher Naviz Applet is the core of the Naviz visualisation system, and is embedded in a jsp page called NavizApplet.jsp launched by NavizLauncher.jsp as shown in
minmaxSupConf
Figure A 3. It has mechanism as shown in Figure A 2. When NavizApplet is initiated, it calls connectToServlet, connectToPathServlet, and
connectToAttributeServlet which each makes http connection paralelly to the corresponding servlet. After the response was received, connectToServlet calls drawBezier to spline edges, and set data to NavizCanvas. Upon receiving the
response, connectToPathServlet calls minmaxSupConf to assign minimum and maximum support and confidence to each path, and getSubPaths to find the satisfied paths and set the paths to NavizCanvas. It needs to call the functions after connectToServlet received response from servlet. In case
connectToPathServlet receive response first, it will not call the functions and instead connectToServlet will call them. And connectToAttributeServlet will set the comparison data to the NavizCanvas. There are many event handlers that also calls connectToServlet to request new layout to the server, and getSubPaths to find the satisfied paths. After the edge and path of NavizCanvas was set, it will responsible to draw the traversal diagram or path to the screen. Function of getSubPaths has important task to search paths according hop
A3
number and visited pages in 4 types of method, closed-closed, closed-open, open-closed, open-open. It has mechanism as shown in Figure A 4. First each path is checked wether it has edge whose support and confidence is out of the range. If hop checking is selected, check wether path length is in the range. Then using include function to check if the path satisfies the pages (nodes) specified. For branching condition, by default yes condition goes down, unless it is specified else. Open open method is shown in Figure A 5. It consists of two loops of i from 0 to specified nodeSize, and j from 0 to path.nodeSize. First it will check if nodes[i] = path.node[j]. Then the order of nodes and path.node is checked wether same or not by comparing j and order[i-1]. As shown in Figure A 6, open closed method cut both the loops at original size 1, and check the last node independently. In contrast with previous method, closed open method start both the loops from 1, and check the first node independently (Figure A 7). Finally, closed closed method checks first and last node independently (Figure A 8).
i+1
nodes[] type
i = 0
minsup<=sup[i]<=maxsup minconf<=conf[i]<=maxconf
hop?
minhop<=hop[i]<=maxhop
include(i)?
subPaths.insert(i)
i<pathSize
return subPaths
A4
nodes[] type
previous
type=open-closed
next
b=false
type=open-open
next i+1
i=0
j=0
i=0
j>=order[i-1]
order[0]=j,b=true j+1 i=0 j>=order[i-1] yes order[0]=j,b=true order[i]=j,b=true yes yes j<path.nodeSize yes i<nodeSize !b? b=false i<nodeSize-1 !b? j<path.nodeSize-1
order[i]=j,b=true
b=false
lastNodes=path.lastNode
b=false
return b
b=true
return b
A5
previous
type=closed-open
next nodes[0]=path.node[0]
nodes[0]=path.node[0]
order[0]=0,b=true
order[0]=0,b=true
i=1
i=1
i+1
j=1
i+1
j=1
nodes[i]=path.node[j]
nodes[i]=path.node[j]
j+1
i=1
j>=order[i-1]
j+1
i=1
j>=order[i-1]
order[1]=j,b=true
order[i]=j,b=true
order[1]=j,b=true
order[i]=j,b=true
b=false
return b
Figure A 7. Algorithm for include function: closed open Figure A 8. Algorithm for include function: closed closed
/* * NavizApplet.java *
A6
/** * * @author praz */ public class NavizApplet extends JApplet implements AppletCommunicator { // General Variables private boolean initiated; // flag to indicate whether an applet has been initiated or not private boolean copiedData; // flag to indicate that the edge and path are copied from another applet private String baseDir; // base directory of applet address private int windowColNum; // column number of browser window private int windowRowNum; // row number of browser window private double windowMaxHor; // maximum horizontal of screen private double windowMaxVer; // maximum vertical of screen private int screenTop; private int screenLeft; private double screenWidth; private double screenHeight; public NavizCanvas canvas; // diagram drawing area public JSObject jsobjectWin; // current browser containing applet public NumberFormat numberFormat; // format number in sup, conf textbox private private private private private int int int int int placeIndex; categoryIndex; fromHourIndex; toHourIndex; newWinHeight;
private int appletNum; // applet number public int getAppletNum() { return appletNum; } public void setAppletNum(int appletNum){ this.appletNum = appletNum; } private int parentNum; // applet parent number public int getParentNum() { return parentNum; } public void setParentNum(int parentNum){ this.parentNum = parentNum; } private String propertyPage; public void setPropertyPage(String propertyPage){ this.propertyPage = propertyPage; } public String getPropertyPage(){ return propertyPage; } public void setPathButton(String pathButtonText){ if(initiated){ if(pathButtonText.equals("Diagram")) pathButton.setText("Path"); else pathButton.setText("Diagram"); pathButtonActionPerformed(null); } } public String getPathButton(){ return pathButton.getText(); } public void changeLayout(){ layoutButtonActionPerformed(null); } public void detailControl(){ detailButtonActionPerformed(null); }
A7
A8
make layout
to make layout
to make layout
in pixels
in pixels
A9
private int orderByRow; // hierarchize edges by sup or conf public void setOrderByRow(int orderByRow){ this.orderByRow = orderByRow; if(initiated) orderComboBox.setSelectedIndex(orderByRow); } public int getOrderByRow(){ return orderByRow; } private int weightByRow; // group nodes by sup or conf public void setWeightByRow(int weightByRow){ this.weightByRow = weightByRow; if(initiated) weightComboBox.setSelectedIndex(weightByRow); } public int getWeightByRow(){ return weightByRow;
A10
sortComboBox.setSelectedIndex(sortPath==NavizPathServlet.PATH_SUPPORT_R OW?0:(sortPath==NavizPathServlet.PATH_CONFIDENCE_ROW?1:2)); } public int getSortPath(){ return sortPath; } public int sortType; // sort paths in path mode: ASCENDING, DESCENDING; public void setSortType(int sortType){ this.sortType = sortType; if(initiated){ if(sortType==ASCENDING) ascButtonActionPerformed(null); else descButtonActionPerformed(null); } } public int getSortType(){ return sortType; } public int pathType; // type of path: CLOSED_CLOSED, OPEN_CLOSED, CLOSED_OPEN, OPEN_OPEN public void setPathType(int pathType){ this.pathType = pathType; if(initiated){ switch(pathType){ case OPEN_OPEN: openOpenButtonActionPerformed(null); break; case OPEN_CLOSED: openClosedButtonActionPerformed(null); break; case CLOSED_OPEN: closedOpenButtonActionPerformed(null); break; case CLOSED_CLOSED: closedClosedButtonActionPerformed(null); break; } } } public int getPathType(){ return pathType; } private int pathNo; // current path number showed in path mode public int getPathNo() { return pathNo;
A11
A12
/** Creates new form NavizApplet */ public NavizApplet() { initComponents (); // Create the drawing area canvas = new NavizCanvas(); canvas.setDoubleBuffered (true); canvas.setOpaque (true); canvas.setBackground(Color.white); canvas.addMouseListener(new MouseAdapter() { public void mouseClicked(MouseEvent evt) { canvasMouseClicked(evt); } } ); canvas.addMouseMotionListener(new MouseMotionAdapter() { public void mouseMoved(MouseEvent evt) { canvasMouseMoved(evt); } } ); canvasPanel.add(canvas); canvas.setApplet(this); // Create the diagram popup menu.
A13
A14
A15
A16
A17
A18
A19
A20
A21
A22
maxconfTextField.setText("100"); maxconfTextField.setToolTipText("Maximum confidence of data"); maxconfTextField.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent evt) { maxconfTextFieldActionPerformed(evt); } }); maxconfTextField.addFocusListener(new FocusAdapter() { public void focusLost(FocusEvent evt) { maxconfTextFieldFocusLost(evt); } }); controlView.add(maxconfTextField); maxconfTextField.setBounds(930, 20, 30, 20); maxconfPercent.setText("%"); controlView.add(maxconfPercent); maxconfPercent.setBounds(960, 20, 9, 16); incsupLabel.setFont(new Font("Dialog", 1, 10)); incsupLabel.setText("Inc"); incsupLabel.setToolTipText("Incremental value of support"); controlView.add(incsupLabel); incsupLabel.setBounds(970, 0, 20, 20); incsupTextField.setText("0.1"); incsupTextField.setToolTipText("Incremental value of support"); incsupTextField.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent evt) { incsupTextFieldActionPerformed(evt); } }); incsupTextField.addFocusListener(new FocusAdapter() { public void focusLost(FocusEvent evt) { incsupTextFieldFocusLost(evt); } }); controlView.add(incsupTextField); incsupTextField.setBounds(990, 0, 30, 20); incsupPercent.setText("%");
A23
controlView.add(orderComboBox); orderComboBox.setBounds(1070, 0, 50, 20); weightLabel.setFont(new Font("Dialog", 0, 10)); weightLabel.setText("Weight"); weightLabel.setToolTipText("Weighten edge by support or confidence"); controlView.add(weightLabel); weightLabel.setBounds(1030, 20, 33, 20); weightComboBox.setFont(new Font("Dialog", 0, 10)); weightComboBox.setMaximumRowCount(2); weightComboBox.setModel(new DefaultComboBoxModel(new String[] { "sup", "conf" })); weightComboBox.setToolTipText("Weighten edge by support or confidence"); weightComboBox.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent evt) { weightComboBoxActionPerformed(evt); } }); controlView.add(weightComboBox); weightComboBox.setBounds(1070, 20, 50, 20); widthLabel.setFont(new Font("Dialog", 0, 10)); widthLabel.setText("Width"); widthLabel.setToolTipText("Edge width by support or confidence"); controlView.add(widthLabel); widthLabel.setBounds(1120, 0, 27, 20); widthComboBox.setFont(new Font("Dialog", 0, 10)); widthComboBox.setMaximumRowCount(2); widthComboBox.setModel(new DefaultComboBoxModel(new String[] { "sup", "conf" })); widthComboBox.setToolTipText("Edge width by support or confidence"); widthComboBox.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent evt) { widthComboBoxActionPerformed(evt); } }); controlView.add(widthComboBox); widthComboBox.setBounds(1150, 0, 50, 20);
A24
A25
A26
A27
A28
A29
A30
A31
A32
A33
A34
A35
A36
A37
private void windowRowNumTextFieldActionPerformed(ActionEvent evt) {//GEN-FIRST:event_windowRowNumTextFieldActionPerformed int val = Integer.parseInt(windowRowNumTextField.getText()); if(val>0 && val<=9 && val!=windowRowNum){ windowRowNum = val; }else{ windowRowNumTextField.setText(String.valueOf(windowRowNum)); } }//GEN-LAST:event_windowRowNumTextFieldActionPerformed private void windowColNumTextFieldActionPerformed(ActionEvent evt) {//GEN-FIRST:event_windowColNumTextFieldActionPerformed int val = Integer.parseInt(windowColNumTextField.getText()); if(val>0 && val<=9 && val!=windowColNum){ windowColNum = val; }else{ windowColNumTextField.setText(String.valueOf(windowColNum)); } }//GEN-LAST:event_windowColNumTextFieldActionPerformed private void detailButtonActionPerformed(ActionEvent evt) {//GEN-FIRST:event_detailButtonActionPerformed if(detailButton.getText().equals("Detail >>")){ detailButton.setText("<< Hide"); controlView.setPreferredSize(new Dimension(DETAIL_CONTROL_WIDTH,controlView.getHeight())); controlView.revalidate(); maxsupLabel.setVisible(true); maxsupScrollBar.setVisible(true); maxsupTextField.setVisible(true); maxsupPercent.setVisible(true); incsupLabel.setVisible(true); incsupTextField.setVisible(true); incsupPercent.setVisible(true); maxconfLabel.setVisible(true); maxconfScrollBar.setVisible(true); maxconfTextField.setVisible(true); maxconfPercent.setVisible(true); incconfLabel.setVisible(true); incconfTextField.setVisible(true); incconfPercent.setVisible(true); orderComboBox.setVisible(true); weightComboBox.setVisible(true); widthComboBox.setVisible(true); colorComboBox.setVisible(true);
A38
A39
A40
A41
A42
A43
A44
A45
A46
A47
A48
A49
System.err.println("nNavizApplet.connectToPathServlet error: "+e.toString()); } } return pathResponsePacket; } public void finished(){ if(workFinished){ initiated = true; setIncsup(incsup); setMinsup(minsup); setMaxsup(maxsup); setIncconf(incconf); setMinconf(minconf); setMaxconf(maxconf); if(!(pathCompleted || copiedData)){ double[] minmax; for(int i=0; i<pathResponsePacket.paths.size(); i++){ minmax = minmaxSupConf((Vector)pathResponsePacket.paths.elementAt(i)); ((Vector)pathResponsePacket.paths.elementAt(i)).insertElementAt(new Double(minmax[0]),NavizPathServlet.PATH_MINSUP_ROW); ((Vector)pathResponsePacket.paths.elementAt(i)).insertElementAt(new Double(minmax[1]),NavizPathServlet.PATH_MAXSUP_ROW); ((Vector)pathResponsePacket.paths.elementAt(i)).insertElementAt(new Double(minmax[2]),NavizPathServlet.PATH_MINCONF_ROW); ((Vector)pathResponsePacket.paths.elementAt(i)).insertElementAt(new Double(minmax[3]),NavizPathServlet.PATH_MAXCONF_ROW); } pathCompleted = true; } if(pathNodes.size()>0){ permanentPathNodes = (Vector)pathNodes.clone(); pathNodes.removeAllElements(); } if(avoidNodes.size()>0){ permanentAvoidNodes = (Vector)avoidNodes.clone(); avoidNodes.removeAllElements(); } Vector paths = getSubPaths(permanentPathNodes,permanentAvoidNodes,pathType);
A50
outputToServlet.writeObject(MessageRelayer.getAttributeRequestPacket()) ; outputToServlet.flush(); outputToServlet.close(); // now, let's read the response from the servlet. ObjectInputStream inputFromServlet = new ObjectInputStream(servletConnection.getInputStream()); MessageRelayer.setAttributeResponsePacket((NavizAttributeResponse)input FromServlet.readObject()); inputFromServlet.close(); }catch(Exception e){ System.err.println("nNavizApplet.connectToPathServlet error: "+e.toString()); } } return MessageRelayer.getAttributeResponsePacket(); } public void finished(){ for(int i=0;i<MessageRelayer.getAttributeResponsePacket().places.size();i++){ placeComboBox.addItem(((Vector)MessageRelayer.getAttributeResponsePacke t().places.elementAt(i)).elementAt(NavizAttributeServlet.ATTRIBUTE_NAME )); if(((Integer)((Vector)MessageRelayer.getAttributeResponsePacket().place s.elementAt(i)).elementAt(NavizAttributeServlet.ATTRIBUTE_ID)).intValue ()==place){ placeIndex=i; } } for(int i=0;i<MessageRelayer.getAttributeResponsePacket().categories.size();i++ ){ categoryComboBox.addItem(((Vector)MessageRelayer.getAttributeResponsePa cket().categories.elementAt(i)).elementAt(NavizAttributeServlet.ATTRIBU TE_NAME)); if(((Integer)((Vector)MessageRelayer.getAttributeResponsePacket().categ ories.elementAt(i)).elementAt(NavizAttributeServlet.ATTRIBUTE_ID)).intV alue()==category){ categoryIndex=i;
A51
A52
A53
A54
A55
pathNodes.removeElementAt(pathNodes.lastIndexOf(selectedNode)); }catch(Exception e){ } try{ avoidNodes.removeElementAt(avoidNodes.lastIndexOf(selectedNode)); }catch(Exception e){ } canvas.repaint(); if(parentNum==-1){ String[] appletNames = MessageRelayer.getAppletNames(); for(int i=0; i<appletNames.length; i++) if(!appletNames[i].equals(this.getName())){ NavizApplet app = (NavizApplet)MessageRelayer.getApplet(appletNames[i]); if(!(app==null||app.getParentNum()!=appletNum)) app.pathNodeCleared(selectedNode); } } } public void addApplet(AppletCommunicator remote_app){} public void removeApplet(AppletCommunicator remote_app){} public void postCommand(Object command){} class PopupListener extends MouseAdapter { public void mousePressed(MouseEvent e) { maybeShowPopup(e); } public void mouseReleased(MouseEvent e) { maybeShowPopup(e); } private void maybeShowPopup(MouseEvent e) { if (e.isPopupTrigger()) { popup.show(e.getComponent(), e.getX(), e.getY()); } } } class PathPopupListener extends MouseAdapter { public void mousePressed(MouseEvent e) { maybeShowPopup(e); } public void mouseReleased(MouseEvent e) { maybeShowPopup(e); }
A56
A57
GraphViz to create layout that has hierarchical structure according to edge support (visitor traffic). Furthermore, NavizServlet will give weight to each edge according to its confidence degree, such that GraphViz will draw edges with heavier weight shorter. This will make the edges with high confidence to be
A2 Naviz Servlets
drawn shorter, hence gives effect of page grouping according to edge confidence (transition probability).
readNode
As shown in Figure A 9 NavizServlet has main task to response to the graph layout request: 1. 2. Check the layout in the cache (ifFileCached), send it if available. Read general edge data from repository (readDiagram), and feed it to GraphViz (buildDot) to make new layout (runCommand). 3. Read layout created by GraphViz (readPlain), store it in the cache, and send
readEdge ifDataExist
mineLog
it. 4. If specific edge data for comparison is not available in the repository (ifDataExist), launch Log Miner to mine desirable pattern (mineLog) and store it into repository. 5. Read specific edge data needed for comparison (readEdge), and send it. The job of buildDot function is to write the nodes and edges data to a .dot file that will be read by GraphVizs dot program to create layout. While writing out the edges data, NavizServlet orders edges according to their support value, such that high support edges will be written out first. This will has
/* * NavizServlet.java * * Created on 2002/07/12, 11:42 */ package naviz; import javax.servlet.*; import javax.servlet.http.*; import java.util.*;
A58
NODE_INITIAL_CAPACITY = 100; NODE_CAPACITY_INCREMENT = 20; NODE_TUPLE_SIZE = 5; NODE_ID_ROW = 0; // row number of id NODE_NAME_ROW = 1; // row number of name NODE_COMMENT_ROW = 2; // row number of
static final int NODE_LINK_ROW = 3; // row number of link static final int NODE_EXIST = 4; // row number of existence in
public static final int FILE_ACCURACY = 100; public static final double MAX_LINE_WEIGHT = 100.0; public String baseDir; public String dirSeparator = ""; //public String dirSeparator = "/"; // end default value of arguments private private private private NavizRequest requestPacket; NavizResponse responsePacket; String statusFile; String dotFile;
/** Initializes the servlet. */ public void init(ServletConfig config) throws ServletException { super.init(config); System.out.println("NavizServlet.init: initiated..."); try{ Class.forName("org.gjt.mm.mysql.Driver"); navizDB = DriverManager.getConnection("jdbc:mysql:///wl","praz","4it2womd"); //navizDB = DriverManager.getConnection("jdbc:mysql://nogizaka.tkl.iis.u-tokyo.ac.j p:3366/wl3","praz","4it2womd"); selectEdge = navizDB.prepareStatement("select start_node,end_node,support,confidence from edge where support>=? and support<=? and confidence>=? and confidence<=? and from_hour=? and to_hour=? and place=? and category=? order by ? desc"); selectEdgeTest = navizDB.prepareStatement("select start_node from edge where from_hour=? and to_hour=? and place=? and category=?"); selectMinMaxSupConf = navizDB.prepareStatement("select min(support) as minsup, max(support) as maxsup, min(confidence) as minconf, max(confidence) as maxconf from edge where support>=? and support<=? and confidence>=? and confidence<=? and from_hour=? and to_hour=? and place=? and category=?"); selectSessionNumAnd = navizDB.prepareStatement("select count(*) as senum from sedata where type=1 and time>=? and time<=? and addr=? and cate=?"); selectSessionNumOr = navizDB.prepareStatement("select count(*) as senum from sedata where type=1 and (time>=? or time<=?) and addr=? and cate=?");
A59
A60
A61
A62
A63
A64
A65
A66
A67
A68
readNode timer
dataExist
ifDataExist
dataExist=false
readPath
dataExist=true
/* * NavizPathServlet.java * * Created on 2002/07/12, 13:52 */ package naviz; import import import import import javax.servlet.*; javax.servlet.http.*; java.util.*; java.sql.*; java.io.*;
/** * * @author praz * @version */ public class NavizPathServlet extends HttpServlet implements SingleThreadModel { // default value of public static final public static final public static final length public static final arguments int PATH_INITIAL_CAPACITY = 5000; int PATH_CAPACITY_INCREMENT = 500; int PATH_LENGTH_ROW = 0; // row number of path int PATH_SUPPORT_ROW = 1; // row number of path
A69
/** Initializes the servlet. */ public void init(ServletConfig config) throws ServletException { super.init(config); System.out.println("NavizPathServlet.init: initiated..."); try{ Class.forName("org.gjt.mm.mysql.Driver"); navizDB = DriverManager.getConnection("jdbc:mysql:///wl","praz","4it2womd"); //navizDB = DriverManager.getConnection("jdbc:mysql://nogizaka.tkl.iis.u-tokyo.ac.j p:3366/wl3","praz","4it2womd"); selectSequence = navizDB.prepareStatement("select id, support, confidence, node, order_num from sequence, seq_body where from_hour=? and
A70
A71
A72
private NavizAttributeRequest attributeRequestPacket; private NavizAttributeResponse attributeResponsePacket; private private private private private Connection navizDB=null; PreparedStatement selectTotalSessionNum=null; PreparedStatement selectPlace=null; PreparedStatement selectCategory=null; PreparedStatement selectStation=null;
/** * * @author praz * @version */ public class NavizAttributeServlet extends HttpServlet implements SingleThreadModel {
// default value of arguments public static final int ATTRIBUTE_INITIAL_CAPACITY = 1000; public static final int ATTRIBUTE_CAPACITY_INCREMENT = 200; public static final int ATTRIBUTE_ID = 0; // row number of count public static final int ATTRIBUTE_COUNT = 1; // row number of count public static final int ATTRIBUTE_CODE = 2; // row number of code public static final int ATTRIBUTE_NAME = 3; // row number of name public static final int ATTRIBUTE_TUPLE_SIZE = 4; // end default value of arguments
/** Initializes the servlet. */ public void init(ServletConfig config) throws ServletException { super.init(config); System.out.println("NavizAttributeServlet.init: initiated..."); try{ Class.forName("org.gjt.mm.mysql.Driver"); navizDB = DriverManager.getConnection("jdbc:mysql:///wl","praz","4it2womd"); //navizDB = DriverManager.getConnection("jdbc:mysql://nogizaka.tkl.iis.u-tokyo.ac.j p:3366/wl3","praz","4it2womd"); selectTotalSessionNum = navizDB.prepareStatement("select count(*) as senum from sedata where type=1 and time>=10 and time<=21 and addr=0 and cate=0"); selectPlace = navizDB.prepareStatement("select * from attributes where left(code,1)='J'"); selectCategory = navizDB.prepareStatement("select * from attributes where left(code,1)='G'"); selectStation = navizDB.prepareStatement("select * from attributes where left(code,1)='E'"); }catch(Exception e){ System.out.println("NavizAttributeServlet.init error: "+e.toString()); e.printStackTrace(); } } /** Destroys the servlet. */ public void destroy() { } /** Processes requests for both HTTP <code>GET</code> and <code>POST</code> methods. * @param request servlet request
A73
A74
A75
A3
Helper Classes
/** * * @author Bowo Prasetyo * @version */ public class NavizCanvas extends JPanel implements ItemListener, ActionListener{ public static final public static final public static final public static final public static final public static final public static final public static final public static final public static final public static final (int)SUP_CHART_WIDTH + int TOP = 75; // pixels int LEFT = 0; int HEIGHT = 600; int WIDTH = 900; double PIE_CHART_WIDTH = 75.0; double PIE_CHART_HEIGHT = 75.0; double SUP_CHART_WIDTH = 15.0; double SUP_CHART_HEIGHT = 75.0; double CONF_CHART_WIDTH = 15.0; double CONF_CHART_HEIGHT = 75.0; int LEFT_MARGIN = (int)PIE_CHART_WIDTH + (int)CONF_CHART_WIDTH + 15;
public static final double LINE_WIDTH_RATIO = 0.75; private static final double FONT_WIDTH_RATIO = 0.65; private static final int ALPHA = 256; // ???? - ??????? private Hashtable nodes; private Hashtable edges; private FontMetrics fm; // ???????? private private private private private private DefaultPieDataset pieDataset; PiePlot piePlot; DefaultCategoryDataset supDataset; DefaultCategoryDataset confDataset; VerticalCategoryPlot supPlot; VerticalCategoryPlot confPlot;
// property that has set method only private NavizApplet applet; private Vector paths; public void setApplet(NavizApplet applet){ this.applet = applet; }
A76
A77
A78
A79
g.drawString(thisNode.getComment(),(int)(thisNode.getX()-bound.getWidth ()/2),(int)(thisNode.getY()+bound.getHeight()/4)); } for(Enumeration e=edges.elements();e.hasMoreElements();){ NavizEdge thisEdge = (NavizEdge)e.nextElement(); g.setColor(Color.getHSBColor((float)applet.normalizeLineColor(applet.ge tColorByRow()==NavizServlet.EDGE_SUPPORT_ROW?thisEdge.getSup():thisEdge .getConf(),applet.getColorByRow()),(float)1.0,(float)0.95)); g.setStroke(new BasicStroke((float)(applet.normalizeLineWidth(applet.getWidthByRow()==N avizServlet.EDGE_SUPPORT_ROW?thisEdge.getSup():thisEdge.getConf(),apple t.getWidthByRow())*LINE_WIDTH_RATIO))); g.draw(thisEdge.getPath()); g.fillPolygon(thisEdge.getArrowXPoints(),thisEdge.getArrowYPoints(),3); } g.setFont(new Font(applet.getNodeFontname(),Font.PLAIN,10)); String footer; Rectangle2D.Double supChartArea = null; Rectangle2D.Double confChartArea = null; if(node!=null){ g.setColor(Color.black); g.setStroke(new BasicStroke()); footer = "Node: "+node.getName(); footer += " Label: "+node.getComment(); footer += " URL: "+node.getUrl(); g.drawString(footer,LEFT_MARGIN,getHeight()-5); }else if(selectedEdge!=null){ g.setColor(Color.getHSBColor((float)applet.normalizeLineColor(applet.ge tColorByRow()==NavizServlet.EDGE_SUPPORT_ROW?selectedEdge.getSup():sele ctedEdge.getConf(),applet.getColorByRow()),(float)1.0,(float)0.95).brig hter()); g.setStroke(new BasicStroke((float)(applet.normalizeLineWidth(applet.getWidthByRow()==N avizServlet.EDGE_SUPPORT_ROW?selectedEdge.getSup():selectedEdge.getConf (),applet.getWidthByRow())*LINE_WIDTH_RATIO))); g.draw(selectedEdge.getPath()); g.fillPolygon(selectedEdge.getArrowXPoints(),selectedEdge.getArrowYPoin ts(),3); g.setColor(Color.black); g.setStroke(new BasicStroke()); footer = "Edge: "+selectedEdge.getStart()+" -->
A80
A81
connectToAttributeServlet that connect to NavizAttributeServlet to request properties data needed in behavior comparison. The class SwingWorker used to implement thread parallelization is the default way of this job in the Java Swing environment.
package naviz; import javax.swing.SwingUtilities; /** * This is the 3rd version of SwingWorker (also known as * SwingWorker 3), an abstract class that you subclass to * perform GUI-related work in a dedicated thread. For * instructions on using this class, see: * * http://java.sun.com/docs/books/tutorial/uiswing/misc/threads.html * * Note that the API changed slightly in the 3rd version: * You must now invoke start() on the SwingWorker after
A82
/** * Start a thread that will call the <code>construct</code> method * and then exit. */ public SwingWorker() { final Runnable doFinished = new Runnable() { public void run() { finished(); } }; Runnable doConstruct = new Runnable() {
A83
A84
A85
/* Copyright Notice Web Techniques grants permission to use this source code for private or commercial use provided that credit to Web Techniques and the author is clearly expressed within the comments of the source code. All source code is provided as-is, and should be utilized at your own risk. For questions, contact editors@webtechniques.com. */ /**
NavizNode
NavizNode contains all sufficient information about the node such as name (unique to identify each node), comment (can be shown as node title), url (url
NavizAttributeRequest, that are used to send request from applet to servlet, and three of its counterpart, NavizResponse, NavizPathResponse, and
/** * * @author Bowo Prasetyo * @version */ public class NavizNode extends Object implements Serializable {
A86
public int getX() { return x; } public void setX(int x){ this.x = x; } public int getY() { return y; } public void setY(int y){ this.y = y; } public int getH() { return h; } public void setH(int h){ this.h = h; } public int getW() { return w; } public void setW(int w){ this.w = w; } public String getName() { return name; } public void setName(String name) { this.name = name; } public String getComment() { return comment; } public void setComment(String comment) { this.comment = comment; } public String getUrl() { return url;
NavizEdge
NavizEdge contains all sufficient information about the edge such as start_node + end_node (unique to identify each edge), sup (can be visualized as edges width), conf (can be visualized as edges color), number of Bezier points needed to spline the edge, and their location (x, y).
/* * NavizEdge.java * * Created on 2001/07/13, 18:33 */ package naviz; import java.io.Serializable; import java.awt.geom.GeneralPath; import java.awt.Point; /**
A87
public String getHashKey(){ return start+"->"+end; } public int getPointNum() { return pointNum; } public void setPointNum(int pointNum){ this.pointNum = pointNum; xPoints = new int[pointNum]; yPoints = new int[pointNum]; points = new Point[pointNum]; } public int getLabelX() { return labelX; } public void setLabelX(int labelX){ this.labelX = labelX; } public int getLabelY() { return labelY; } public void setLabelY(int labelY){ this.labelY = labelY; } public double getWidth() {
A88
A89
public boolean dotCenter; // centers drawing on page: true, false public boolean dotConcentrate; // enable edges concentrators: true, false public int dotFontsize; // point size of label: in points public double dotXMargin; public double dotYMargin; // margin included in page: public int dotMclimit; // adjust mincross iterations: f times public double dotNodesep; // separation between nodes, in inches public int dotNslimit; // bounds network simplex iterations by f (num of nodes) //public double dotXPage; //public double dotYPage; // unit of pagination: in inches
A90
import java.io.Serializable; import java.util.Hashtable; /** * * @author praz * @version */ public class NavizResponse extends Object implements Serializable { public public public public public public public public public public Hashtable nodes; Hashtable edges; double scaleFactor; double dataMinsup; double dataMaxsup; double dataMinconf; double dataMaxconf; double height; double width; int sessionNum;
int edgeFontsize; int edgeMinlen; // minimum distance between start and end: in String String String String edgeColor; edgeFontcolor; edgeFontname; edgeDir; // edge direction: forward, back, both, none
/** Creates new NavizResponse */ public NavizResponse() { nodes = new Hashtable(NavizServlet.EDGE_INITIAL_CAPACITY); edges = new Hashtable(NavizServlet.EDGE_INITIAL_CAPACITY); dataMinsup = 100.0; dataMaxsup = 0.0; dataMinconf = 100.0; dataMaxconf = 0.0; } public Object clone(){ NavizResponse res = new NavizResponse(); res.nodes = (Hashtable)nodes.clone(); res.edges = (Hashtable)edges.clone(); res.scaleFactor = scaleFactor; res.dataMinsup = dataMinsup; res.dataMaxsup = dataMaxsup; res.dataMinconf = dataMinconf; res.dataMaxconf = dataMaxconf; res.height = height; res.width = width; return res; } }
Naviz Response
/* * NavizResponse.java * * Created on 2001/09/13, 10:42 */ package naviz;
A91
import java.io.Serializable; import java.util.Vector; /** * * @author praz * @version */ public class NavizPathResponse extends Object implements Serializable { public public public public public Vector double double double double paths; pathMinsup; pathMaxsup; pathMinconf; pathMaxconf;
/** Creates new NavizPathResponse */ public NavizPathResponse() { paths = new Vector(NavizPathServlet.PATH_INITIAL_CAPACITY,NavizPathServlet.PATH_CAP ACITY_INCREMENT); pathMinsup = 100.0; pathMaxsup = 0.0; pathMinconf = 100.0; pathMaxconf = 0.0; } public Object clone(){ NavizPathResponse res = new NavizPathResponse(); res.paths = (Vector)paths.clone(); res.pathMinsup = pathMinsup; res.pathMaxsup = pathMaxsup; res.pathMinconf = pathMinconf; res.pathMaxconf = pathMaxconf; return res; } }
A92
/** Creates new NavizPathResponse */ public NavizAttributeResponse() { totalSessionNum = 0; places = new Vector(NavizAttributeServlet.ATTRIBUTE_INITIAL_CAPACITY,NavizAttributeS ervlet.ATTRIBUTE_CAPACITY_INCREMENT); categories = new Vector(NavizAttributeServlet.ATTRIBUTE_INITIAL_CAPACITY,NavizAttributeS ervlet.ATTRIBUTE_CAPACITY_INCREMENT); stations = new Vector(NavizAttributeServlet.ATTRIBUTE_INITIAL_CAPACITY,NavizAttributeS ervlet.ATTRIBUTE_CAPACITY_INCREMENT); } }
A93
A5
A5.1
GraphViz
2.
`id` int(11) NOT NULL default '0', `name` char(32) NOT NULL default '', `comment` char(255) default NULL, `link` char(255) default NULL, PRIMARY KEY (`id`), KEY `e_name` (`name`) ) TYPE=MyISAM;
To draw the underlying graph layout which representing traversal diagram, Naviz utilizes a graph drawing tool called GraphViz [8], an open source graph drawing software from AT&T that implemented algorithm in [3] proposed by Gansner et. al. Naviz system assumes that GraphViz is installed and included in the PATH environment variable.
edge: to store edge information, such as start and end node, support, confidence, from_hour, to_hour, place, and category.
CREATE TABLE `edge` ( `start_node` int(11) NOT NULL default '0', `end_node` int(11) NOT NULL default '0', `support` double(16,4) NOT NULL default '0.0000', `confidence` double(16,4) NOT NULL default '0.0000', `from_hour` int(11) NOT NULL default '0', `to_hour` int(11) NOT NULL default '0', `place` int(11) NOT NULL default '0', `category` int(11) NOT NULL default '0', KEY `e_supconf` (`support`,`confidence`), KEY `e_fromto` (`from_hour`,`to_hour`), KEY `e_place` (`place`), KEY `e_category` (`category`) ) TYPE=MyISAM;
A5.2
Log Miner
Currently Naviz utilizes a separated log miner that using the GSP algorithm sequential pattern mining [1]. This Log Miner can mine edges and paths pattern from web log file of NTT Docomo Mobile Townpage, and store them in pattern repository described below. 3.
sequence: to store paths information such as id, support, confidence, place, category, from_hour, and to_hour.
CREATE TABLE `sequence` ( `id` int(11) NOT NULL auto_increment, `support` double(16,4) NOT NULL default '0.0000', `confidence` double(16,4) NOT NULL default '0.0000', `place` int(11) NOT NULL default '0', `category` int(11) NOT NULL default '0', `from_hour` int(11) NOT NULL default '0', `to_hour` int(11) NOT NULL default '0', PRIMARY KEY (`id`), KEY `e_category` (`category`), KEY `e_fromto` (`from_hour`,`to_hour`), KEY `e_supconf` (`support`,`confidence`), KEY `e_place` (`place`)
A5.3
Pattern Repository
Naviz uses MySQL Database as repository to store the discovered patterns. The tables used by Naviz are: 1. node: to store node data, such as id, name, comment, and link.
CREATE TABLE `node` (
A94
4.
seq_body: to store path body information such as node and its order.
CREATE TABLE `seq_body` ( `sequence` int(11) NOT NULL default '-1', `order_num` int(11) default NULL, `node` int(11) default NULL, KEY `e_sequence` (`sequence`) ) TYPE=MyISAM;
5.
attributes: to store values of properties such as place and category that is used for behavior comparison.
CREATE TABLE `attributes` ( `item` int(10) unsigned NOT NULL default '0', `cnt` int(11) default NULL, `code` varchar(20) NOT NULL default '', `attrib` varchar(100) NOT NULL default '', PRIMARY KEY (`item`) ) TYPE=MyISAM;
A95