ROBUST MULTI-RESOLUTION WEB USAGE MINING WITH GENETIC NICHE CLUSTERING Olfa Nasraoui Dept.

of Electrical and Computer Engineering The University of Memphis 206 Engr. Science Bldg. Memphis, TN 38152 onasraou@memphis.edu Raghu Krishnapuram IBM India Research Lab Block 1 Indian Institute of Technology, Hauz Khas New Delhi 110016, India kraghura@in.ibm.com

ABSTRACT In this paper, we present a new hierarchical clustering technique based on the concept of genetic niches, called Hierarchical Unsupervised Niche Clustering (HUNC) which is considerably faster than its non-hierarchical counterpart (UNC), and offers the advantage of multi-resolution clustering. We use HUNC as part of a complete system of knowledge discovery in Web usage data. Our new approach does not necessitate fixing the number of clusters in advance, is insensitive to initialization, can handle noisy data, general non-differentiable similarity measures,and can provide profiles to match any desired level of detail or resolution. Our experiments show that our algorithm is not only capable of extracting meaningful user profiles on real Web sites, but also discovers associations between distinct URL pages on a site, with no additional cost. Unlike content based association methods, our approach discovers associations between different Web pages based only on the user access patterns and not on the page content. Also, unlike traditional context-blind association discovery methods, HUNC discovers context-sensitive associations. INTRODUCTION Manually entered Web user profiles have raised serious privacy concerns, are subjective, and do not adapt to the users’ changing interests. Mass profiling, on the other hand, is based on general trends of usage patterns (thus protecting privacy) compiled from all users on a site, and can be achieved by mining or discovering user profiles from the historical data stored in server access logs. Current Web usage mining approaches avoid the feature representation dilemma of Web data by resorting to memory and computation intensive relational clustering (require the computation of all pairwise dissimilarities) or computation intensive association rule discovery (because very low support thresholds are needed to discover typical profiles). A classical non-relational approach requires a differentiable dissimilarity measure. However, for DM problems, a domain specific similarity measure should be designed free of any constraints. Recently (Nasraoui and Krishnapuram, 2000), we have presented a new evolutionary approach to robust clustering based on the Unsupervised Niche Clustering algorithm (UNC). UNC seeks dense areas in feature space and determines their number by converting the clustering problem into a multimodal function optimization problem within the context of genetic niching and striving to locate the peaks of niches or subpopulations in the search space. Niching methods (Holland, 1975; 1

Each peak in a mutlimodal domain can be thought of as a niche. the user session is encoded as an -dimensional binary attribute vector with the property if the user accessed the URL during the session otherwise The ensemble of all sessions extracted from the server log file is denoted . can provide profiles to match any desired level of detail or resolution. 1 shows the evolution of an initial random population (denoted by square symbols) using UNC. The components of are URL relevence 2 A #! $"  %           ¨ ¦ ¤ ¡¡¡¢©§¥£ # $! £ !     ¢¡      ¡¢¡¢ #! $" 9 @ F D GE4 ( 0(' 1)©& B ¨ 8 0( 6 4 1) 75' 3 2 . Each log entry consists of: (i) User’s IP address. We propose a Hierarchical modification of UNC. etc. (ii) Access time.. In nature. A user session consists of accesses originating from the same IP address within a predefined time period. The similarity measure between two user-sessions (Nasraoui et al. (c) population after 30 generations. and requires no analytical derivation of the prototypes. 1992) were designed to identify multiple optima within multimodal domains. Genetic Optimization makes UNC much less prone to suboptimal solutions than other objective function based approaches. where is the total number of valid URLs. 1999) . 1999. Our new approach does not necessitate fixing the number of clusters in advance. Thus. Each URL in the site is assigned a unique number . Fig. The sesare summarized by a typical session “profile” vector (Nasraoui sions in cluster ( C T R P USQ( F H I( ( C et al. (iii) URL of the page accessed.(a) (b) (c) (d) Figure 1: Evolution of the population using UNC: (a) original data set (b) Initial population. a hierarchy of clusters which give more insight to the Web mining process. and satisfies the desirable property of becoming more stringent as the accessed URLs get farther from the root because the amount of specificity in user accesses increases correspondingly. called HUNC. 2000) takes the site’s structure into account. We use HUNC as part of a complete system of knowledge discovery in Web usage data. . niches correspond to different subspaces of the environment that can support different types of life such as species or organisms. that departs from the traditional limited flat view of the data. (d) final extracted centers Mahfoud. and generates instead. for a noisy data set. Thus. it can handle a vast array of general subjective.. even non-metric dissimilarities. making it suitable for many applications including data and Web mining. THE KNOWLEDGE DISCOVERY PROCESS OF WEB SESSION PROFILING The access log for a given Web server consists of a record of all files accessed by users. toward the correct niches in subsequent generations.

and its maximum allowable mean squared error. which for the cluster. and the new Web session dissimilarity measure is used instead of the Euclidean distance to take the Web site topology in account. 2000). . example from our experiments: to . Let denote the data set partitionned at level . 3 q " p‚€x§ v " y! w( C Q  h f )p¢d U 0 H   @ r sh  ' B ( H H C PIG & t     ¡¢  #! $¢ ( "H 0 r t ƒh uH 0 H E F ' B A r sh ' C D B bc a3Y`WX¢  S S S P£TH  A B@ !( ¢ ©£¦¤      ¨§ ¥ 4 0¨ "H " ( r sh 2 q ' B H " B U V@ S S S S P£PTH 4 H 4 ¡ ( ¢ £¡  (   C R T 6 9 78 52 3£¢ 3D 6 4 ¨  !( h f d ige ¨ 0 ( & 1)'4 2 q q !( $ ( B (1) (2) . . Since. . In other words. Let denote the current level. instead of clustering the entire data set on a single level which would necessitate a larger population size.weights. where is the th cluster found at level . is met. they are not crucial to the performance of HUNC This is because. This means that at any given level. based on the minimum acceptable size of a cluster. estimated by the probability of access of each URL during the sessions of . Hierarchical clustering proceeds by re-clustering the data in each of the above clusters in a recursive fashion to obtain the next level . this complexity is much lower than that of relational clustering techniques such as Agglomerative Hierarchical Clustering (AHC) (Duda and Hart. The computational complexiy of UNC is . and the closely related graph theoretic based Minimum Spanning Tree (MST) (Duda and Hart. UNC’s computational time can be significantly reduced if we perform clustering in a hierarchical mode. Even though the above parameters will eventually determine the number of clusters at the last level of the partitionning. except for a few differences that result from the distinct nature of the session data: The solution space for possible session prototypes consists of binary chromosome strings which are defined to be the binary session attribute vectors . HIERARCHICAL UNSUPERVISED NICHE CLUSTERING AND ITS APPLICATION TO WEB USAGE MINING We retain the principal structure of UNC (Nasraoui and Krishnapuram. The final profiles can be evaluated based on the average dissimilarity. and on a multiomodal optimization approach where multiple clusters are sought in parallel at each level. unlike classical divisive hierarchical clustering. is given by #! $ b ¢ Another measure is the robust cardinality given by  !( %  ©§ ¥ $   ¨ # is a robust weight (that is high for inliers/good data where and low for outliers/noise). where is the population size and is the number of data points to be clustered. our approach relies on robust weights to suppress the influence of outliers and data belonging to other clusters. The hierarchical clustering is performed recursively starting from the top level (lowest resolution) until a termination criterion. They measure the significance of a given URL to the profile. can usually be a very small fraction of (typical in the hierarchical mode. techniques. we could cluster smaller subsets of the data using a smaller population size at multiple levels. HUNC is expected to identify as many good clusters as the population size. 1973). 1973).

This allows us to concentrate on the core of each profile by filtering out the noise sessions assigned to the closest profile. 2. the partition is not subject to any final commitment at each level. 3. with each such profile showing a more specific kind of interest in the department. Also. The same observation can be made about the first cluster (general inquiries about the CECS department) which at level 2 gets split into profiles No. MU-CECS1 Data The 12-day access record (during 1998) of the Web site of the Dept. and 4. of Computer Science and Computer Engineering at the University of Missouri. 7. 5. Columbia generated 1703 sessions accessing 369 distinct URLs. one of the well known pitfalls of hierarchical clustering techniques. 6. Also. the core of the spurious cluster No. CONCLUSION 4 E E ¥ ¢ S8  3 ( ¨ F ¦ 4 4   £ #( £ ¢ " )¨   ¨ $ 4 8 G¨  " S8  0 ¦4  4    C © r #( £ #( £  8 ¢¨ $ $ § ¥ £ ¨¦4 ¤F % &`( !( £ $ $ ( B "   ¨  ¢S8 ¤ 0 `' &  (¨ ¨ ¢ S8   ¦ 4 4   ¡F ' ¡  S8 (  B  4 4 !(  h f igd  S S8   3 ( $  4 !( h f ig¢ d # $! # $! 8 ¢¨ $   £ (3) . WEB USAGE MINING EXPERIMENTAL RESULTS The parameters for the robust hierarchical UNC were fixed to the following values: The crossover and mutation probabilities are and respectively. to the contrary of classical hierarchical techniques. hence was discovmaking weak profiles. .while classical hierarchical approaches are expected to yield the optimal cluster prototypes only at the optimal level of the partition that corresponds to the known correct number of clusters. (ii) Multiresolution profiling: Note how profile 2 (at ) in Table 2 is split into many profiles with distinct user interests (at ) (profiles No. The quality of these clusters is confirmed by their low average dissimilarity compared to the maximal value of . and 9) as shown in Table 3. Since all session dissimilarities are confined in it is reasonable to choose The profile vectors are displayed in the format URL in table 1. Thus. and end up having less than 20 members. exceed a given threshold. 1. ered to contain sessions accessing the site managers’ pages. illustrating typical profiles. (i) Robust profiling is obtained by retaining profile members whose robust weights. (iii) Inferring Associations between different URLs: Profile 2 (at ) in Table 2 contains accesses to two different courses taught by different professors. HUNC re-partitions the data at the end of clustering (the last level of the hierarchy). CECS 352 students in profile 7. . Different values generate different -cuts of the profiles when these are viewed as fuzzy sets. 8. prospective students in profile 2 and 4. signaling an association. equal to in our experiments. . The results obtained at levels are summarized in Table 3 showing profiles that reflect typical access patterns – the general “outside visitor” is captured in profiles 1 and 3. UNC used generations per clustering with a population size. The -core of the profile is defined as  The cores of profiles Nos. etc. It was later revealed that one of the courses (CECS 352: Operating systems) relies for the implementation of its projects on the C programming language which is taught in the other course (CECS 333: Object Oriented Design).

Thus.html . course 0. people and main degree page Dr. R.19 .54 0.html . and requires no analytical derivation of the prototypes.19 ./courses100. P. even non-metric dissimilarities. Joshi’s course pages (CECS 333 and CECS 352 respectively) Accesses to the CECS227 class pages Dr Shi’s CECS345 pages Dr./CECS computer. with no additional cost.93 .0 enquiries. we presented an adaptation of UNC. can provide profiles to match any desired level of detail or resolution. 5 4 Table 2: 4 of the 7 profiles discovered by HUNC from MU-CECS1 Data at description 1 572 312 362.html . Also.95 ./courses index.class ./ 1. We have illustrated through several examples that our clustering process results in the discovery of associations between different URL addresses on a given site./faculty.html .08 For Web usage mining.95 . the associations are meaningful only within well defined distinct profiles/contexts (context-sensitive) as opposed to all or none of the data (context-blind). This approach of discovering context-sensitive associations via clustering can be generalized to other transactional data. Saab’s and Dr. Pattern Classification and Scene Analysis.html . Hart.0 102. it can handle a vast array of general subjective.html . References Duda. Wiley Interscience. 1973.class ( C " S8  (   4 B   " !( £ " ( $ B "  4  ./people.html . Our new approach does not necessitate fixing the number of clusters in advance.95 .0 124. Shang’s course pages 0.19 ./courses200. the session dissmilarity measure is not a distance metric.2 0. and dealing with relational data is impractical given the huge dimension of the data sets. NY.2 main page./people index./ ./courses.32 ¢  ¡ ¨ (     £¡ ¡ (    3 ¡ ¡ ¡ ¡ ¡   ¤¡   ¢¡  £¡   £¡   ¤¡   " ¡   1 .00 .19 . ACKNOWLEDGMENTS Partial support of this work by the National Science Foundation Grant IIS 9800899 is gratefully acknowledged.83 ..2 51.67 ..Table 1: Examples of Profiles discovered by HUNC from MU-CECS1 Data at and 2 3 4 5 305 185 162 73 170 111 84 56 191./CECS computer. class list. called Hierarchical Unsupervised Niche Clustering (HUNC) which is considerably faster than its non-hierarchical counterpart. making it suitable for many applications in data and Web mining.37 0. Therefore.

. J.5 11 12 13 Holland. May 2000. individual faculty. Parallel problem Solving from Nature. O.27 0. class list. (  " ¡ (  B " 4  " ( B " 0 4  0. In: Ninth IEEE International Conference on Fuzzy Systems. people. Mahfoud.. 1992. 1999.. class list.5 77. San Antonio. Joshi. To appear in International Journal of Artificial Intelligence . O.19 0. Brussels.. Sep. pp. research and graduate degree pages Dr. R. In: 2nd Conf.2 29.4 123. J.19 6 . Saab’s CECS333 pages (long detailed sessions) Dr. Krishnapuram. 705–709. Nasraoui.3 30. R.3 49. NY. course enquiries and people main page. MIT Press. PPSN ’92. H.4 33.0 22. Krishnapuram.0 91. 2000.16 0. A novel approach to unsupervised robust clustering using genetic niching. 1975..Table 3: Some of the 16 profiles discovered by HUNC from MU-CECS1 Data at and description main page. Saab’s CECS303 pages (long detailed sessions) Accesses to the CECS227 class pages Dr Shi’s CECS345 (main page and Java examples) Dr Shi’s CECS345 (long sessions: lectures and project No.16 0. 170–175. Extracting web user profiles using relational competitive fuzzy clustering.6 80. A. pp.26 0.. W. O..46 0.. R. Krishnapuram. Mining web access logs using a relational clustering algorithm based on a robust estimator. In: NAFIPS Conference. Nasraoui. Saab’s CECS333 pages (short sessions) Dr. 1) Dr Shi’s CECS345 (short sessions to main page) ( ¢ ¡   ¡ 1 2 3 4 6 8 9 10 219 119 140 129 133 47 53 184 77 47 34 132 73 85 71 80 28 111 49 30 - 140. Nasraoui.13 0. Belgium. Crowding and preselection revisited.. Jun.2 0. New York. TX. S.7 85. H. Adaptation in natural and artificial systems. A. course and undergraduate degree enquiries Short sessions mostly limited to main page and class list main page...27 0. Frigui..39 0.