12 - Chepter 6

CHAPTER-6
WEB USAGE MINING USING

CLUSTERING
6.1 Related work in Clustering Technique
6.2 Quantifiable Analysis of Distance Measurement Techniques
6.3 Approaches to Formation of Clusters
6.4 Conclusion
Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 75
This chapter will deal with clustering web usage mining technique for pattern discovery in
detail. Cluster analysis is not an algorithm, but it consists of many different algorithms for
pattern discovery tasks that could be used according to an application and targeted results. The
first section of the chapter deals with a literature survey of overall clustering technique. The
second section deals with partitioning clustering techniques in detail. The next section will deal
with distance measurement techniques. Remaining sections deal with the new approach of
pattern discovery and associated analysis.
6.1 Related work in Clustering Technique

Clustering is an unsupervised data mining technique that divides data among different
groups for pattern discovery task. It is also known as exploratory data analysis and no labeled
data are available [5]. The ultimate goal of clustering is to separate finite unlabeled data sets
into a discrete and finite set of useful, valid and hidden data sets. The cluster is defined as
internal homogeneity and external separation [87]. Clustering procedure is applied to
preprocessed data that consists of only selected attributes. After preprocessing, appropriate
algorithm is selected or designed to achieve targeted results. At last, result interpretation is
done to provide meaningful insights to end user. In [98] several characteristics of the cluster are
described which can be considered for better cluster formation. Clustering is very useful in
several applications of data mining, document retrieval, image segmentation, and pattern
classification for purposes of pattern analysis, grouping, decision making and machine learning
situations. Clustering in data mining is studied in literature extensively in the form of
information retrieval and text mining [32,102] and very little work has been done for web
analysis.
There are three main things in cluster analysis (1) Effective Similarity Measures (2) Criterion
Functions and (3) Algorithm.
6.1.1 Similarity Measures

Similarity means distance between two objects. There are different domains such as web
analysis, information retrieval, recommendation systems; social network analysis etc. requires
efficient techniques for measuring similarity among diverse objects. To measure a distance,
various similarity measures have been proposed in literature. There are two main categories of
similarity measures: (I) Content Based and (II) linked based. The content based [118,46,73,95]
methods evaluate similarity among various objects like web pages, persons, multimedia objects
etc. The linked based similarity measures [6,93] are useful to determine similarity of two web
objects links for the purpose of search engine. Linked based similarity measures are out of the
scope of the proposed research since they are vital for search engine aspect rather than web
usage mining purpose. Most popular and commonly used distance metric technique studied in
literature is Euclidean distance [1,68,53]. It is an ordinary distance between two points that is
measured with the ruler and it is derived from Pythagorean formula. The distance between two
data points in the plane with coordinates (p, q) and (r, s) is formulated by:
DIST (( p, q), (r, s)) = Sqrt (( p-r) 2 + (q-s) 2)
The usefulness of Euclidean depends on circumstances. For any plane, it provides pretty good
results, but with slow speed. It is also not useful in the case of string distance measurement.
Euclidean distance can be extended to any dimensions. Another most popular distance metric
technique is Manhattan distance that computes the distance from one data point to another if
grid like a path is followed. The Manhattan distance between two items is the sum of the
differences of their corresponding elements [103]. The distance between two data points with
coordinates (p1, q1) and (p2, q2) is derived from:
n
DIST ((p1, q1), (p2, q2)) =  | pi  qi |
i 1
It is the summation of horizontal and vertical elements where diagonal distance is computed
using Pythagorean formula. It is derived from the Euclidean distance so it exhibits similar
characteristics as of Euclidean distance. It is generally useful in gaming applications like chess to
determine diagonal distance from one element to another. Minkowski is another famous
distance measurement technique that can be considered as a generalization of both Euclidean
and Manhattan distance. Several researches [4, 39,43] have been done based on Minkowski
distance measurement technique to determine similarity among objects. The Minkowski
distance of order p between two points: P (x1, x2, x3…xn) and Q( y1,y2,y3,…yn)  Rn is
defined as:
If the value of p equals to one Minkowski distance is known as Manhattan distance while for
value 2 it becomes Euclidean distance. All distances discussed so far are considered as ordinary
distances that are measured from ruler and they are not suitable for measuring similarity of
two strings so they are not appropriate in context to propose research since web session
consists of numbers of string in the form of URLs. Hamming distance is a most popular similarity
measure for strings that determines similarity by considering the number of positions at which
the corresponding characters are different. More formally hamming distance between two
strings P and Q is | Pi  Qi | . Hamming distance theory is used widely in several applications
like quantification of information, study of the properties of code and for secure
communication. The main limitation on Hamming distance, is it only applicable for strings of the
same length. Hamming distance is widely used in error free communication [2,112] field
because the byte length remains same for both parties who are involved in communication.
Hamming distance is not so much useful in context to web usage mining since length of all web
sessions is not similar. The Levenshtein distance or edit distance is more sophisticated distance
measurement technique for string similarity. It is the key distance in several fields such as
optical character recognition, text processing, computational biology, fraud detection,
cryptography etc. and was studied extensively by many authors [79,42,9]. The formula of
Levenshtein distance between two strings S1, S2 is given by Lev S1, S2 (| S1|, |S2|) where
Lev S1, S2 (I, j) = Max (I, j) if Min (I, j) =0,
Otherwise
Lev S1,S2 ( i,j) = Min ( Lev s1, s2 (i-1, j) +1)
OR
Min ( Lev s1, s2( i, j-1) + 1)
OR
Min ( Lev s1, s2 ( i-1, j-1) + [ S1i # S2j]
Levenshtein distance measurement technique is an ideal context for web session since it is
applicable to strings of unequal size.
Several bioinformatics distance measurement techniques that are used to align protein or
nucleotide sequences can be used to web mining perspectives to cluster unequal size web
sessions. One of the most important techniques of this category was invented by Saul B.
Needleman and Christian D. Wunsch [80] to align unequal size protein sequences. This
technique uses dynamic programming means solving complex problems by breaking them
down into simpler sub problems. It is a global alignment technique in which closely related
sequences of same length are very much appropriate. Alignment is done from beginning till end
of sequence to find out best possible alignment. This technique uses scoring system. Positive or
higher value is assigned for a match and a negative or a lower value is assigned for mismatch. It
uses gap penalties to maximize the meaning of sequence. This gap penalty is subtracted from
each gap that has been introduced. There are two main types of gap penalties such as open and
extension. The open penalty is always applied at the start of the gap, and then the other gaps
following it are given with a gap extension penalty which will be less compared to the open
penalty. Typical values are –12 for gap opening, and –4 for gap extension. According to
Needleman Wunsch algorithm, initial matrix is created with N * M dimension, where N =
number of rows equals to number of characters of first string plus one and M= number of
columns equals to number of characters of first string plus one. Extra row and column is used to
align with gap. After that scoring scheme is introduced that can be user defined with specific
scores. The simple basic scoring scheme is, if two sequences at i th and jth positions are same
matching score is 1( S(I,j) =1) or if two sequences at ith and jth positions are not same mismatch
score is assumed as -1 ( S(I,j)=-1). The gap penalty is assumed as -1. When any kind of
operation is performed like insertion or deletion, the dynamic programming matrix is defined
with three different steps:
1. Initialization Phase: - In initialization phase the gap score can be added to previous cell of
the row and column.
2. Matrix Filling Phase: - It is most crucial phase and matrix filling starting from the upper left
hand corner of the matrix. It is required to know the diagonal, left and right score of the current
position in order to find maximum score of each cell. From the assumed values, add match or
mismatch score to diagonal value. Same way repeat the process for left and right value. Take
the maximum of three values (i.e. diagonal, right and left) and fill ith and jth positions with
obtained score. The equation to calculate the maximum score is as under:
Mi,j = Max [ Mi-1,j-1 + Si,j, Mi,j-1 + W, Mi-1,j +W]
Where i,j describes row and columns.

M is the matrix value of the required cell (stated as M i,j)
S is the score of the required cell (Si, j)
W is the gap alignment
3. Alignment through Trace Back: - It is the final step in Needleman Wunsch algorithm that is
trace back for the best possible alignment. The best alignment among the alignments can be
identified by using the maximum alignment score.
Needleman Wunsch distance measurement technique is an ideal one in string similarity so this
technique is also considered in the proposed research context.
Smith-waterman is an important bioinformatics technique to align different strings. This

technique compares segments of all possible lengths and optimizes the measure of similarity.
Temple F.Smith and Michael S.Waterman [97] were founders of this technique. The main
difference in the comparison of Needleman Wunsh is that negative scoring matrix cells are set
to zero that makes local alignment visible. This technique compares diversified length segments
instead of looking at entire sequence at once. The main advantages of smith waterman
technique are:
 To gives conserved regions between the two sequences.

 To align partially overlapping sequences.
 To align the subsequence of the sequence to itself.
Alike Needleman-Wunsch, this technique also uses scoring matrix system. For scoring system
and gap analysis, same concepts used in Needleman Wunsch are applicable here in Smith-
Waterman. It also uses same steps of initialization, matrix filling and alignment through trace
back. The equation to calculate maximum score is same as Needleman Wunsch. The main
differences between Needleman-Wunsch and Smith Waterman are:
 Needleman Wunsch does global alignment while Smith Waterman focuses on local
alignment.
 Needleman Wunsch requires alignment score for pair of remainder to be >=0 while
for Smith Waterman it may be positive or negative.
 For Needleman Wunsch no gap penalty is required for processing while for Smith
and Waterman gap penalty is required for efficient work.
 In Needleman and Wunsch score can not be decrease between two cells of a
pathway while in Smith Waterman score can increase, decrease or remain same
between two cells of pathway.
Table 6.1 describes the comparison of different distance metrics techniques in context to
proposed research.
Table 6.1 Comparison of distance metrics techniques
Sr.NO Technique Description Advantages Disadvantages

1. Euclidean It describes distance between two (1)It is faster for (1)It is not
Distance points that would be measure determination of suitable for
with ruler and calculated using correlation among ordinal data
Pythagorean theorem. points (2) It is fair like string.
measure because it (2) It requires
compares data actual data
points based on not rank.
actual ratings.
2. Levenshtein It is a string metric for measuring It is fast and best It is not

the difference between two suited for strings considered
strings. similarity. order of
sequence of
characters
while
comparing.
( Table
Continue to
Next Page)
Table 6.1 Comparison of distance metrics techniques(Continue)
Sr.NO Technique Description Advantages Disadvantages
3. Needleman- It is a bio informatics algorithm It is best for string It requires

Wunsch and provides global alignment comparison because same length
between strings while comparing. it considers ordering of string while
of sequence of comparing.
characters
4. Smith- It is a bio informatics algorithm It is best for string It is quite

Waterman and provides local alignment comparison because complex than
between strings while comparing. it considers ordering any global
of sequence of alignment
characters and it is technique
applicable for either
similar or dissimilar
length of strings.
From above table it is to be analyzed that Euclidean distance is not suitable for proposed
research because web sessions consists of sequences of web objects and which are in string
format. Levenshtein distance is a very good technique for string sequences similarity but for
prediction model of web caching and prefetching an ordering of web objects is an important
aspect that is ignored by this distance metric technique so it is also not an appropriate way in
proposed research context. Both Needleman-Wunsch and Smith-Waterman considers an
ordering of sequence for string matching so they are ideal for this context. Web Sessions are
not always of same length so Needleman-Wunsch algorithm is not cent percent fit for
formation of web sessions clusters as it only provides global alignment. Smith-Waterman
algorithm is applicable for both same length sequence as well as dissimilar length of sequences
so it is an ideal algorithm for formation of clusters in this proposed research.
6.1.2 Categories of Clustering Algorithms
Once appropriate distance metric is identified next step is to determine appropriate

category of clustering algorithm. Clustering algorithms are basically of two main types
described in following figure.
Hierarchical clustering is also known as connectivity based clustering. It is based on most

fundamental concept that objects are more related to nearby objects than objects far away.
Algorithms of Hierarchical clustering connect different objects based on their distance
measurement. Different clusters in hierarchical clustering are represented in binary tree
format. Hierarchical clustering algorithms are either top-down or bottom-up. Top down
Clustering is also known as splitting algorithm. It proceeds by splitting clusters recursively until
individual object is reached. Bottom up Clustering is known as merging algorithms. Bottom Up
Clustering Algorithms
Hierarchical Clustering Partitioning Clustering
Bottom UP Top Down Centroid Medoid
(Figure-6.1 Categories of Clustering Algorithms)
clustering algorithms [104-124] begin with any n clusters and each cluster contains a single
sample or point. Then two clusters will merge so distance among them becomes as least as
possible. The graphical representation of both techniques is as follows:
C1
Bottom Up
C1,C2
C2
2
C3 C1,C2,C3,C4,C5
C3,C4
C4
C3,C4,C5
C5
Top Down
(Figure-6.2 Graphical Representation of Hierarchical Clustering Techniques)
The main advantages of hierarchical algorithms are:
(1) It is not required to specify number of clusters in advance.
(2) Generation of smaller clusters is possible that may helpful for discovery of important
information.
But there are number of limitations of this category of clustering that are as follows:
(a) Objects might incorrectly group so result should be examined closely before proceeding to
next phase.
(b) Use of different distance technique may generate different results.
(c) Interpretation of result is subjective.
(d) Interpretation of hierarchy is complex and often confusing.
(e) Researches show that most of hierarchical algorithms do not revisit clusters once they
build.
Hierarchical clustering is not an ideal in context to proposed research because they are
not flexible in cluster formation. It provides rigid means in terms of optimization of clustering
results. Sometimes groping of clusters is also not up to the mark. In Hierarchical clustering
ordinary distance metric techniques are used that not suit to web usage mining process.
Partitioning clustering techniques are another category of clustering techniques. They

are very effective and relocation based clustering techniques. There are two main approaches
of that (I) Centroid and (II) Medoids. Gravity center of the objects is considered as a measure to
represent each cluster. K-means algorithm is well known centroid based algorithm. There are
three main steps in k-means algorithm:
(i) Center point is determined and each cluster is associated with a center point.
(ii) Each point is assigned to the cluster with the closet center point.
(iii) K means number of clusters must be specified.
According to k-means, numbers of clusters are selected randomly [44-125]. Using

Euclidian distance measurement technique assigns every item to its nearest cluster center.
Move each cluster center to mean of its assigned items. Change in cluster assignments by
repeating assignments and moving clusters until it becomes less than a threshold value. K-
means algorithm exhibits number of characteristics:
(A) It is most suitable for large data sets.
(B) It is sensitive to noise.
(C) It terminates at local optimum means it is optimal within neighboring set of candidate
solutions.
(D) The clusters having three dimensional views in shapes.
K-means is an ideal partitioned based clustering algorithm but it exhibits certain

limitations in context to proposed research work. Following are the numbers of limitations of it:
1. It is not possible to predict number of clusters K in advance in proposed research.
2. K-means has problems when clusters are of different in size. In proposed research it is not
possible that all clusters of having same size.
3. K-means has a problem of outlier.
4. Empty clusters are possible in K-means.
5. It is applicable only when the mean of clusters is defined and not suitable for categorical
data. In proposed research data may be of categorical.
6. Results are heavily depends upon initial partitions only.
7. One object may not be a part of another cluster. According to proposed research one session
might be a part of number of clusters.
8. Distance for centroid is calculated based on ordinary distance metric technique like Euclidian.
Euclidian would be measure with ruler and calculated using Pythagorean Theorem. It does not
suit to string data while web sessions are in form of string.
K-medoid is another powerful partitioning clustering algorithm based on medoid

philosophy. It is more efficient in context to noise and outliers as compared to K-means
because it minimizes a sum of pair wise dissimilarities instead of a sum of squared Euclidean
distances. Medoid is more centrally located point in the cluster and it deals with average
dissimilarity to all the objects in the clusters. The most important and common algorithm of this
category is Partitioning around Medoid (PAM). PAM algorithm like k-means requires selecting K
clusters randomly. Associate each object to closest medoid. Closest distance is defined using
valid distance metric technique most commonly Euclidian Distance. For each medoid and non-
medoid object, swap and compute the total cost of the configuration. Select configuration with
the lowest cost. Repeat steps of association to closest medoid and swapping until there is no
change in the medoid. The main advantage of K-medoid is that the problem of outlier is
removed but still it facing number of same limitations as of K-Means like K clusters requires in
advance, one object may not be a part of another cluster and for calculating medoid ordinary
distance metric technique like Euclidian is used.
In both K-means and K-medoid techniques, one object could not be the part of more
than one cluster while in proposed research context, one web object could be part of more
than one cluster. One popular technique of clustering is Fuzzy C-Means [81-126] that attempts
to divide any n elements into collections of m fuzzy clusters with some criterion. The fuzzy C-
Means algorithm is simple and contains three main steps. First step is to select number of
clusters. Second step is to assign randomly to each point coefficients for being in the clusters.
Third step has two sub steps, one is compute the centroid for each cluster and second one is for
each point, compute its coefficients of being in the clusters. Repeat third step until the
coefficients' change between two iterations is no more than threshold value. In this kind of
clustering every object has a degree of belonging to clusters as in fuzzy logic rather than
belonging completely to just one cluster. The main advantage is an algorithm minimizes intra-
cluster variance but same problems like it requires predicting number of clusters in advance
and results depends on the initial choice of weights.
Following table describes above mentioned clustering techniques in context to

proposed research work in condensed manner.
Table 6.2 Clustering techniques characteristics
Sr.No Clustering Technique Characteristics Justification with Proposed Research

1. Top-Down Hierarchical  It splits clusters  This technique does not revisit
Clustering recursively until cluster once they build but as far
individual object is as web session clusters are
reached. concern they require repetitions
 It is not required to until appropriate cluster is
specify number of formed.
clusters in advance.
 Smaller clusters are
possible.
2. Bottom Up Hierarchical  It begins with any n  Same problem of top-down
Clustering number of clusters and clustering of not revisiting clusters
then two clusters will after formation.
merge so distance
among them becomes
as least as possible.
3. K-Means  Using Euclidian  It requires predicting K numbers
distance measurement of clusters in advance but that is
( Centroid based technique assigns every not possible in proposed research
Partitioning Algorithm) item to its nearest context.
cluster center.
( Table Continue to Next Page)
Table 6.2 Clustering techniques characteristics (Continue)
Sr.No Clustering Technique Characteristics Justification with Proposed Research

  It uses ordinary distance
measurement technique and it is
applicable only for numerical data
but web sessions are in form of
string data and also sequence is
important while formation of
clusters.
 One object may not be a part of
more than one cluster but
according to proposed research
one session may be the part of
number of clusters.
 Outlier or noise problem may
arise.
4. K-Medoid  It is based on medoid  It exhibits similar kind of

philosophy. limitations as of K-means in
(Medoid based  It is more efficient in context to proposed research
Partitioning Algorithm) context to noise and work.
outliers.
 it minimizes a sum of
pair wise dissimilarities
instead of a sum of
squared Euclidean
distances.
5. Fuzzy C-Means  It attempts to divide  It requires predicting K numbers
any n elements into of clusters in advance that is not
collections of m fuzzy possible in proposed research
clusters with some work.
criterion so one object  It also used ordinary distance
may be the part of measurement technique that is
number of clusters. not suitable in context to
proposed research work.
From table 6.2 it is describes that no clustering technique is perfect for prediction model
for web caching and web prefetching. There are two main limitations of all above mentioned
clustering techniques. First is no appropriate distance measurement technique is used and
second one is prediction of any number of clusters are not possible in advance. In proposed
research context no clustering techniques can be used as it directly. First important point is to
determine appropriate distance measurement technique in context to proposed research work.
The next section of this chapter will deal with quantifiable analysis of distance measurement
techniques that suits to proposed research context.
6.2 Quantifiable Analysis of Distance Measurement Techniques
It is cleared from section 6.1.1 that Levenshtein, Needleman-Wunsch and Smith-

Waterman distance measurement techniques are appropriate in terms of clustering of web
sessions in terms prediction model of web caching and web prefetching. This section will
analyze all distance measurement technique in quantifiable manner and decide which one be
the most efficient in context to current work.
6.2.1 Infrastructure and experimental environment
Following Infrastructure is used in experiment of analyzing distance measurement

technique in quantifiable manner.
(1) Personal Computer: - Intel Pentium 4 CPU, 2.40 GHz, 1 GB of RAM, 20 GB Hard Disk.
(2) Online tool for Distance Measurement: - This tool is available at

http://asecuritysite.com/forensics/simstring for distance measurement based on many
distance measurement techniques.
(3)Internet: - It is used to download purpose.
(4)Sample Raw Log file:- which is downloaded from NASA site

http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html), which contains transactions of users
between 15-11-2009 to 31-11-2009.
(5)Operating System: - Microsoft Windows XP professional version 2002, Service Pack-2.
As far as experimental environment is concerned following steps are taken:

(a) Web Object number is converted into respective alphabet as online tool is deals with string
data.
For Example: Cluster1: 2,5,7,8,9,10 that is converted to BEGHIJ
Cluster2: 6, 8, 9,12,15,2,5 that is converted to FHILOBE
(b) Same Sample data of Markov Model is taken to experiment the results of Levenshtein
distance measurement technique.
6.2.2 Levenshtein Analysis
Table 6.3 describes distance between clusters using Levenshtein distance measurement
technique. Distance between sessions is calculated using equation described in section 6.1.
Figure 6.3 is the snapshot of online tool that is used for distance measurement. The online tool
requires string to match the distance so sessions are converted
(Figure-6.3 Snapshot of online tool for distance measurement)
Table 6.3 Distance between clusters using Levenshtein
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 0 0 25 31 8 17 20 21 21 55 25 21 8 43 33 21 27 25 7 19 14 14 25 20 46
2 0 0 33 38 33 21 20 21 29 27 31 29 31 50 50 21 33 25 29 25 0 0 0 40 15
3 25 33 0 31 31 33 42 21 79 33 25 36 15 0 33 29 27 75 21 25 33 8 50 17 46
4 31 38 31 0 23 8 54 29 36 46 81 71 31 14 77 7 87 25 36 38 31 23 23 38 8
5 8 33 31 23 0 18 27 14 36 45 19 21 69 14 25 21 20 19 79 19 9 9 0 27 8
6 17 21 33 8 18 0 36 14 29 27 6 14 15 7 8 79 7 31 14 25 36 9 27 9 31
7 20 20 42 54 27 36 0 36 36 45 44 36 38 7 50 43 47 44 29 31 30 20 30 30 31
8 21 21 21 29 14 14 36 0 36 36 19 14 14 7 29 21 20 31 7 88 36 14 21 21 21
9 21 29 79 36 36 29 36 36 0 43 31 36 21 14 36 29 33 56 21 31 21 7 36 29 29
10 55 27 33 46 45 27 45 36 43 0 44 36 31 7 50 29 47 31 36 31 27 9 18 45 38
11 25 31 25 81 19 6 44 19 31 44 0 69 25 38 62 6 88 12 25 19 25 31 19 31 12
12 21 29 36 71 21 14 36 14 36 36 69 0 21 14 50 0 73 19 29 25 36 14 29 29 21
13 8 31 15 31 69 15 38 14 21 31 25 21 0 14 31 14 27 12 57 12 15 15 15 23 0
14 43 50 0 14 14 7 7 7 14 7 38 14 14 0 14 0 27 19 14 12 14 21 7 29 29
15 33 50 33 77 25 8 50 29 36 50 62 50 31 14 0 14 67 31 21 31 25 17 25 42 15
16 21 21 29 7 21 79 43 21 29 29 6 0 14 0 14 0 7 31 7 31 29 14 21 21 36
17 27 33 27 87 20 7 47 20 33 47 88 73 27 27 67 7 0 12 27 25 27 27 20 33 7
18 25 25 75 25 19 31 44 31 56 31 12 19 12 19 31 31 12 0 6 25 25 19 38 19 56
19 7 29 21 36 79 14 29 7 21 36 25 29 57 14 21 7 27 6 0 6 14 21 0 21 0
20 19 25 25 38 19 25 31 88 31 31 19 25 12 12 31 31 25 25 6 0 31 12 19 19 25
21 14 0 33 31 9 36 30 36 21 27 25 36 15 14 25 29 27 25 14 31 0 14 12 10 38
22 14 0 8 23 9 9 20 14 7 9 31 14 15 21 17 14 27 19 21 12 14 0 25 10 31
23 25 0 50 23 0 27 30 21 36 18 19 29 15 7 25 21 20 38 0 19 12 25 0 20 46
24 20 40 17 38 27 9 30 21 29 45 31 29 23 29 42 21 33 19 21 19 10 10 20 0 15
25 46 15 46 8 8 31 31 21 29 38 12 21 0 29 15 36 7 56 0 25 38 31 46 15 0
Analysis of Levensthtein Patterns
80
Accuracy %
70
60
50
40
30
20 High accuracy Patterns %
10
0 Average Accuracy Patterns %
Low Accuracy Patterns %
lu e
50
55
60
65
70
75
80
Va
old
sh
re
Th
Threshold Value
(Figure-6.4 Levensthtein Pattern Analysis)
From the above graph it is observed that threshold value of 70 is an ideal for Levensthtein
distance that provides mean accuracy of 78.99 and all patterns consists of mean accuracy
between 70 and 78.99.
Following are several Limitations of Levensthtein Measure in pattern discovery in current

research context
(1) Session 1: 2 5 7 8 9 10
Session 10: 2 5 7 8 9 10 12 13 14 10
Here both sessions requires same web objects and order of web objects are also similar
but distance between them is only 55.
(2)Session 8: 7 6 5 2 1 5 6 9 10 12 14 11 10 9
Session 20: 7 6 5 2 1 5 6 9 10 12 14 11 10 9 13 5
Here case is same as (1) but distance between them is 88 that mean Levensthtein also
considers length of two strings.
(3)Session 5: 5 7 9 11 12 13 14 15 2 3 14
Session 13: 3 6 9 11 12 13 14 15 2 3 14 10 12
Here orders of web objects are not exactly the same and some web objects are differ in
both sessions still distance between them is 69.
(4) Session 1: 2 5 7 8 9 10
Session 2 : 6 8 9 12 15 2 5
Here order is not same but several web objects are similar like 2,5,8,9 still distances between
them is 0.
(5) Session 1 : 2 5 7 8 9 10
Session 3 : 3 4 5 6 7 9 10 11 12 15 14 13
Here same case as previous one but distance measure is 25.
6.2.3 Needleman-Wunsch Analysis
Table 6.4 describes distance between clusters using Needleman Wunsch distance
measurement technique. Distance between sessions is calculated using equation described in
section 6.1. Figure 6.4 describes analysis of patterns according to Needleman Wunsch distance
measurement technique.
From analysis it is found that it is very difficult to decide threshold value in this technique.
Ideal value of threshold is 85 but it consists only 32% of clusters so it affects cache hit ratio.
According to distance metric of Needleman Wunsch every session is half similar with other
session and that is also not true. It is based on global alignment so it also considers length of
sessions while comparing. There are several limitations in patter discovery based on
Needleman Wunsch, they are following:
(1) Session 1: 2 5 7 8 9 10
Session 2: 6 8 9 12 15 2 5
Here distance is 50% though ordering as well as number of web objects is dissimilar.
(2) Session 1: 2 5 7 8 9 10
Session 3 : 3 4 5 6 7 9 10 11 12 15 14 13
Here more number of web objects are similar than previous case still distance is 50%.
(3) Session 1 : 2 5 7 8 9 10
Session 10 : 2 5 7 8 9 10 12 13 14 10
Table 6.4 Distance between clusters using Needleman Wunsch
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 0 50 50 50 50 50 50 50 54 55 59 57 50 71 50 50 53 50 54 50 57 50 56 50 58
2 50 0 58 54 55 50 60 54 50 55 50 50 50 50 62 50 53 56 54 53 50 50 50 60 58
3 50 58 0 54 50 62 58 57 82 62 50 57 50 50 58 57 50 88 54 56 50 50 58 54 65
4 50 54 54 0 65 54 62 57 64 62 81 82 58 50 85 54 87 53 57 69 54 58 50 58 50
5 50 55 50 65 0 50 50 50 57 59 50 57 77 50 58 50 50 50 79 59 50 50 50 59 50
6 50 50 62 54 50 0 64 50 57 55 50 50 54 50 54 89 50 59 50 56 50 50 55 50 62
7 50 60 58 62 50 64 0 64 54 68 53 50 50 50 62 61 53 62 57 56 55 55 55 55 54
8 50 54 57 57 50 50 64 0 57 64 53 54 54 50 61 54 53 59 54 88 50 50 50 57 54
9 54 50 82 64 57 57 54 57 0 64 56 64 54 57 61 57 57 72 54 56 50 54 54 57 50
10 55 55 62 62 59 55 68 64 64 0 53 57 50 54 67 54 57 59 54 56 50 50 50 59 54
11 59 50 50 81 50 50 53 53 56 53 0 78 53 69 69 50 91 50 56 50 53 53 53 50 53
12 57 50 57 82 57 50 50 54 64 57 78 0 54 57 68 50 83 50 54 59 54 54 50 50 50
13 50 50 50 58 77 54 50 54 54 50 53 54 0 54 54 54 57 53 75 50 50 54 50 50 50
14 71 50 50 50 50 50 50 50 57 54 69 57 54 0 50 50 57 50 57 53 54 54 50 50 57
15 50 62 58 85 58 54 62 61 61 67 69 68 54 50 0 57 73 56 50 62 50 50 50 62 54
16 50 50 57 54 50 89 61 54 57 54 50 50 54 50 57 0 54 62 54 59 50 50 54 54 64
17 53 53 50 87 50 50 53 53 57 57 91 83 57 57 73 54 0 53 62 54 59 50 54 54 64
18 50 56 88 53 50 59 62 59 72 59 50 50 53 50 56 62 53 0 53 56 50 50 56 53 69
19 54 54 54 57 79 50 57 54 54 54 56 54 75 57 50 54 62 53 0 53 57 54 50 54 50
20 50 53 56 69 59 56 56 88 56 56 50 59 50 53 62 59 54 56 53 0 50 53 50 50 56
21 57 50 50 54 50 50 55 50 50 50 53 54 50 54 50 50 59 50 57 50 0 50 50 50 58
22 50 50 50 58 50 50 55 50 54 50 53 54 54 54 50 50 50 50 54 53 50 0 50 50 50
23 56 50 58 50 50 55 55 50 54 50 53 50 50 50 50 54 54 56 50 50 50 50 0 55 65
24 50 60 54 58 59 50 55 57 57 59 50 50 50 50 62 54 54 53 54 50 50 50 55 0 54
25 58 58 65 50 50 62 54 54 50 54 53 50 50 57 54 64 64 69 50 56 58 50 65 54 0
Analysis of Needlemna Wunsch Patterns
120
100
Accuracy %

40 Low Accuracy Patterns %
20
0
e
55
60
65
70
75
80
85
90
lu
Va
old
sh
re
Th
Threshold Value
(Figure-6.5 Needleman Wunsch Pattern Analysis)
Here both sessions requires same pages and order of web objects are also similar but distance
between them is only 55.
(4) Session 3: 3 4 5 6 7 9 10 11 12 15 14 13
Session 18: 8 9 10 2 3 4 5 6 7 9 10 11 12 15 14 13
Here order as well as number of web objects are differing still distance between them is 88%.
(5) Session 6: 3 8 7 9 4 6 10 11 12 13 15
Session 7: 2 3 4 6 9 11 12 14 15 8
Here order as well as number of web objects are differ still their distance is 64 % that is higher
than case (3)
6.2.4 Smith Waterman Analysis
Table 6.5 describes distance between clusters using smith waterman technique. Figure
6.6 describes analysis of patterns using smith waterman distance measurement technique.
Table 6.5 Distance between clusters using Smith Waterman
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 0 33 50 50 42 25 17 33 50 100 50 58 17 100 50 25 50 50 42 33 33 17 17 33 83
2 33 0 29 57 36 21 36 29 29 36 57 43 43 100 64 21 57 29 36 29 21 29 29 43 29
3 50 29 0 29 36 41 50 38 92 32 29 38 25 25 29 42 29 100 33 38 36 43 50 10 58
4 50 57 29 0 27 23 50 35 27 50 100 62 27 31 83 19 100 31 27 35 43 36 31 45 27
5 42 36 36 27 0 36 45 14 36 50 27 14 73 23 27 36 27 36 100 14 14 29 25 35 45
6 25 21 41 23 36 0 30 18 41 32 23 18 32 14 23 100 23 41 36 18 21 21 25 20 41
7 17 36 50 50 45 30 0 30 40 25 50 20 60 25 50 35 50 60 45 30 14 29 38 25 30
8 33 29 38 35 14 18 30 0 43 32 32 25 19 18 38 14 32 32 11 100 57 29 25 20 23
9 50 29 92 27 36 41 40 43 0 32 25 25 23 21 29 36 25 79 29 43 36 43 38 14 46
10 100 36 32 50 50 32 25 32 32 0 50 41 36 55 50 32 50 32 50 32 43 29 12 45 59
11 50 57 29 100 27 23 50 32 25 50 0 57 27 29 83 18 87 25 25 28 43 36 31 45 27
12 58 43 38 62 14 18 20 25 25 41 57 0 12 25 42 14 61 32 14 25 43 36 44 30 46
13 17 43 25 27 73 32 60 19 23 36 27 12 0 23 29 27 27 23 62 19 14 29 25 35 27
14 100 100 25 31 23 14 25 18 21 55 29 25 23 0 42 11 29 21 18 18 29 29 25 30 38
15 50 64 29 83 27 23 50 38 29 50 83 42 29 42 0 21 83 33 25 38 43 36 31 45 29
16 25 21 42 19 36 100 35 14 36 32 18 14 27 11 21 0 18 36 29 14 21 29 25 20 35
17 50 57 29 100 27 23 50 32 25 50 87 61 27 29 83 18 0 27 25 30 43 36 31 45 27
18 50 29 100 31 36 41 60 32 79 32 25 32 23 21 33 36 27 0 29 28 36 43 50 20 58
19 42 36 33 27 100 36 45 11 29 50 25 14 62 18 25 29 25 29 0 11 14 36 25 35 38
20 33 29 38 35 14 18 30 100 43 32 28 25 19 18 38 14 30 28 11 0 57 29 25 20 23
21 33 21 36 43 14 21 14 57 36 43 43 43 14 29 43 21 43 36 14 57 0 14 14 21 36
22 17 29 43 36 29 21 29 29 43 29 36 36 29 29 36 29 36 43 36 29 14 0 57 14 29
23 17 29 50 31 25 25 38 25 38 12 31 44 25 25 31 25 31 50 25 25 14 57 0 12 44
24 33 43 10 45 35 20 25 20 14 45 45 30 35 30 45 20 45 20 35 20 21 14 12 0 30
25 83 29 58 27 45 41 30 23 46 59 27 46 27 38 29 35 27 58 38 23 36 29 44 30 0
Smith Waterman Pattern Analysis
80
70
Accuracy %

50
30
20 Low Accuracy Patterns %
10
0
e
0
50
55
60
65
70
75
80
85
90
95
lu
10
Va
old
sh
re
Th
Threshold Value
(Figure-6.6 Smith Waterman Pattern Analysis)
Any threshold value from 65 to 100 is an ideal for smith waterman and that is decided
based on space available in proxy server. It is based on local optimal so it is not taken into
considerations length of strings. Several facts of Smith Waterman based on distance metric are
as follows:
(1) Session 1: 2 5 7 8 9 10
Session 10: 2 5 7 8 9 10 12 13 14 10
Here order as well as all web objects referred in both sessions are similar so distance is
100%.
(2)Session 1: 2 5 7 8 9 10
Session 14: 6 8 9 12 15 2 5 1 2 5 7 8 9 10
Here length of strings is dissimilar but certain portion of second string is same as first string
with order so distance is 100%
(3) Session 3 : 3 4 5 6 7 9 10 11 12 15 14 13
Session 9: 6 4 5 6 7 9 10 11 12 15 14 13 10
Here first string is not exactly same as second but nearly it is same so distance is 92%
(4) Session 4: 2 4 6 8 9 10 12 14 15 3 9 8 6
Session 15: 2 4 6 8 9 10 12 14 15 3 2 1
Here first string is nearly similar as second string but not much similar than previous case (3)
so distance is less than previous case that is 83 %
(5) Session 1: 2 5 7 8 9 10
Session 2: 6 8 9 12 15 2 5
Here only two web pages are similar so distance is 33%
(6) Session 1: 2 5 7 8 9 10
Session 3: 3 4 5 6 7 9 10 11 12 15 14 13
Here total four web objects are similar and out of that two are in order so obvious distance is
more than previous case and that is 50%
6.3 Approaches to Formation of Clusters

As discussed in related work of clustering technique hierarchical way of clustering is not
an efficient technique for formation of clusters because clusters are not revisited by most of
algorithms of hierarchical technique. For the prediction model of web caching and prefetching
partitioning relocation clustering is an ideal solution for clusters formations. There are two
main techniques for partitioning clustering and they are: (1) K-means and (2) K-medoids. In
both techniques specification of K is require and it indicates numbers of clusters that are going
to form after implementation of that technique. In proposed research it is not able to predict
number of clusters so it is impossible to give value of K at initial level so both of these
techniques are not suitable for this context. In both K-means and K-medoids one object could
not be the part of more than one cluster while in proposed research context, one web object
could be part of more than one cluster. One popular technique of clustering is Fuzzy C-Means
that attempts to divide any n elements into collections of m fuzzy clusters with some criterion
but this algorithm also requires choosing a number of clusters so it is not fitted perfectly in this
proposed research work. One new approach for formation of clusters is suggested in this
proposed work. Following are number of steps of this approach:
[1] Determine distance metric based on smith-waterman distance metric technique.
Based on quantifiable analysis of all relevant distance measurement techniques it is
found that Smith-Waterman technique is most suitable technique in context to proposed work.
Smith-Waterman technique is suitable for any kind of comparison either similar length or
dissimilar length. Smith Waterman technique is also considers ordering of web objects.
[2] Decide threshold value in context of proxy server cache memory.
Threshold value should be decided based on the capacity of proxy server cache memory
and that is decided based on application perspective.
[3] Based on threshold value form clusters of web objects.
Formation of clusters is done based on selected threshold value.
[4] Repeat step 3 based on new threshold value if require.
Repeat formation of clusters according to new threshold value, if require by an

application.
6.4 Conclusion
This chapter has described several related work in clustering data mining techniques in
perspectives of web data. From literature survey it is found out that there are two main
categories of clustering techniques (1) Hierarchical and (2) Partitioned based. Hierarchical
clustering are divided into two main approaches top-down and bottom-up but both techniques
suffered from many limitations and they do not suit in context to proposed work. The main
limitation is that clusters are not revisited once they formed. Partitioning techniques overcome
limitations of Hierarchical clustering in terms of optimization of cluster formation. There are
two main approaches of that centroid based and medoid based. In both approaches it is
required to know in advance number of clusters that is not feasible in context to current work.
Other limitation is both techniques uses ordinary distance measurement techniques like
Euclidian distance measurement that is not suitable for categorical data as well as not
considering an ordering of objects. Other main limitation is that one object must be the part of
one cluster only that is not true in case of proposed research so this chapter also described
fuzzy C means technique. Fuzzy C means overcome limitation that one object could be the part
of more than one cluster but still it requires to know number of clusters in advance so it is not
perfect technique to form clusters in context to proposed work. The challenge of proposed
work is to identify appropriate distance measurement technique that is suitable for all
categories of data and also considers an ordering of objects. This chapter dealt with distance
measurement techniques and identified appropriate techniques in context to proposed work.
This chapter also did quantitative analysis of those distance measurement techniques and
identified most relevant in context to this. Finally chapter has given a new approach to form
clusters. The new approach is based on appropriate distance measurement technique in
context to web caching and prefetching criteria. The new approach is threshold based approach
where value of threshold is decided based om meory of proxy server.The new approach is
iterative means it provides liberty to select new value of threshold if previous one is not up to
the date.

12 - Chepter 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12 - Chepter 6

Uploaded by

Copyright:

Available Formats

CHAPTER-6

WEB USAGE MINING USING

6.1 Related work in Clustering Technique

6.2 Quantifiable Analysis of Distance Measurement Techniques

6.3 Approaches to Formation of Clusters

6.1 Related work in Clustering Technique

6.1.1 Similarity Measures

DIST (( p, q), (r, s)) = Sqrt (( p-r) 2 + (q-s) 2)

Lev S1, S2 (I, j) = Max (I, j) if Min (I, j) =0,

Lev S1,S2 ( i,j) = Min ( Lev s1, s2 (i-1, j) +1)

Min ( Lev s1, s2( i, j-1) + 1)

Min ( Lev s1, s2 ( i-1, j-1) + [ S1i # S2j]

Mi,j = Max [ Mi-1,j-1 + Si,j, Mi,j-1 + W, Mi-1,j +W]

Where i,j describes row and columns.

Smith-waterman is an important bioinformatics technique to align different strings. This

 To gives conserved regions between the two sequences.

Table 6.1 Comparison of distance metrics techniques

Sr.NO Technique Description Advantages Disadvantages

2. Levenshtein It is a string metric for measuring It is fast and best It is not

3. Needleman- It is a bio informatics algorithm It is best for string It requires

4. Smith- It is a bio informatics algorithm It is best for string It is quite

6.1.2 Categories of Clustering Algorithms

Once appropriate distance metric is identified next step is to determine appropriate

Hierarchical clustering is also known as connectivity based clustering. It is based on most

Hierarchical Clustering Partitioning Clustering

Bottom UP Top Down Centroid Medoid

(Figure-6.1 Categories of Clustering Algorithms)

(Figure-6.2 Graphical Representation of Hierarchical Clustering Techniques)

The main advantages of hierarchical algorithms are:

(1) It is not required to specify number of clusters in advance.

(b) Use of different distance technique may generate different results.

(c) Interpretation of result is subjective.

(d) Interpretation of hierarchy is complex and often confusing.

Partitioning clustering techniques are another category of clustering techniques. They

(iii) K means number of clusters must be specified.

According to k-means, numbers of clusters are selected randomly [44-125]. Using

(A) It is most suitable for large data sets.

(B) It is sensitive to noise.

(D) The clusters having three dimensional views in shapes.

K-means is an ideal partitioned based clustering algorithm but it exhibits certain

1. It is not possible to predict number of clusters K in advance in proposed research.

3. K-means has a problem of outlier.

4. Empty clusters are possible in K-means.

6. Results are heavily depends upon initial partitions only.

K-medoid is another powerful partitioning clustering algorithm based on medoid

Following table describes above mentioned clustering techniques in context to

Table 6.2 Clustering techniques characteristics

Sr.No Clustering Technique Characteristics Justification with Proposed Research

Sr.No Clustering Technique Characteristics Justification with Proposed Research

4. K-Medoid  It is based on medoid  It exhibits similar kind of

6.2 Quantifiable Analysis of Distance Measurement Techniques

It is cleared from section 6.1.1 that Levenshtein, Needleman-Wunsch and Smith-

6.2.1 Infrastructure and experimental environment

Following Infrastructure is used in experiment of analyzing distance measurement

(2) Online tool for Distance Measurement: - This tool is available at

(3)Internet: - It is used to download purpose.

(4)Sample Raw Log file:- which is downloaded from NASA site

(5)Operating System: - Microsoft Windows XP professional version 2002, Service Pack-2.

As far as experimental environment is concerned following steps are taken:

6.2.2 Levenshtein Analysis

(Figure-6.3 Snapshot of online tool for distance measurement)

(Figure-6.4 Levensthtein Pattern Analysis)

Following are several Limitations of Levensthtein Measure in pattern discovery in current