You are on page 1of 33

MINING USER ACCESS LOG USING

EVOLUTIONARY APPROACH FOR CLUSTERING










GUIDED BY: SUBMITTEDBY:
Mrs.Seema Thamman Rahul Mewada
Sr. Lecturer Sachin Choudhary
Computer Science & Sachin Thakur
Engg. Vikas Garg
Introduction
Information overload: too much information to
sift/browse through in order to find desired information
Most information on Web is actually irrelevant to a particular user
This is what motivated interest in techniques for Web
personalization
As they surf a website, users leave a wealth of historic data
about what pages they have viewed, choices they have made,
etc
Web Usage Mining: A branch of Web Mining (itself a
branch of data mining) that aims to discover interesting
patterns from Web usage data (typically Web Access Log
data/clickstreams)
THE MAIN GOALS OF PROJECT

Extract user sessions from Web access log
Calculate the similarity
Cluster user sessions using UNC
Summarize clusters by a typical session
profile vector

Objectives for Above Goals
Collect Web log data of a particular web site
Preprocess the web log by data cleaning and
then create session
Cluster session using Unsupervised Niche
Clustering
Summarize clusters by a typical session
profile vector
Classical Knowledge Discovery Process For
Web Usage Mining
Source of Data: Web Clickstreams: that get recorded
inWeb Log Files:
Date, time, IP Address/Cookie, URL accessed, etc
Profile 1:
URLs a, b, c
Sessions in profile 1
Profile 2:
URLs x, y, z, w
Sessions in profile 2
etc

Web log
Web
Server
Server
log
Data
Preproc
essing
Analysis of Web
Data by UNC
Algorithm
Interpretation and
Evaluation of User Profile
fig. web usage mining process
Classical Knowledge Discovery Process For
Web Usage Mining
Complete KDD process:
Preprocessing:
selecting and
cleaning data
(result = session data)
Complete KDD process:
Preprocessing:
selecting and
cleaning data
(result = session data)
UNC
(Learning Phase):
Clustering algorithm to
categorize sessions into usage
categories/modes/profiles
algorithms to discover frequent
usage patterns/profiles
Classical Knowledge Discovery Process For
Web Usage Mining
Complete KDD process:
Preprocessing:
selecting and cleaning
data
(result = session data )
UNC
(Learning Phase):
Clustering algorithm to
categorize sessions into usage
categories/modes/profiles
algorithms to discover frequent
usage patterns/profiles
Derivation
and Interpretation
of results:
Computing profiles
Evaluating results
Analyzing profiles
Classical Knowledge Discovery Process For
Web Usage Mining
PREPROCESSING & SEGMENTATION
DEFINING THE SIMILARITY BETWEEN USER SESSIONS-
FILTERING-
(i) result in any error (indicated by the error code), (ii) use a request
method other than GET, or (iii) record accesses to image files (.gif,
.jpeg,, etc.).(iv)ip address (v)date.
USER SESSION-
accesses from the same IP address such that the duration of time
elapsed between any two consecutive accesses in the session is
within a prespecified threshold.


Similarity Measure

= =
=
=
U U
U
N
i
l
i
N
i
k
i
N
i
l
i
k
i
kl
s s
s s
S
1
) (
1
) (
1
) ( ) (
, 1

=
otherwise
URL j accessed user if
s
th
i
j
0
1
) (
i
p
th
i
If site structure ignored cosine similarity
Map N
U
URLs on site to indices
User session vector s
(i)
: temporally compact sequence of Web
accesses by a user
Taking site structure into account relate distinct URLs

: path from root toURLs node
( ) ( )
|
|
.
|

\
|

=
1 , max , 1 max
, 1 min ) , (
j i
j i
u
p p
p p
j i S
Syntactic similarity-

In correlation with url attribute it is-



new similarity with property of S
1
and S
2

Clustering
Definition:
Clustering is the process of grouping objects
together in such a way that the objects belonging to
the same group are similar and those belonging to
different groups are dissimilar
It can be considered unsupervised learning problem;
so, as it deals with finding a structure in a collection
of unlabeled data.

Example:


Similarity criterion- distance based
Application-
City-planning: identifying groups of houses
according to their house type, value and
geographical location;
WWW: document classification; clustering
weblog data to discover groups of similar
access patterns,etc.
Unsupervised Niche Clustering
(UNC)

=
2
2
2
exp
i
ij
ij
d
w
o
Representation: binary chromosome strings
(one substring per feature)
Deterministic Crowding Selection: Children
replace closest parent if they have better fitness.
Density fitness measure:
Robust weight:
2
1
i
N
j
ij
i
f
o

=
=
w
0 1 1 1 0 1 0
GENETIC ALGORITHM(GA)
Basic GAs operate in steps. In each step new generation
is created, therefore we will refer to these steps as
generations.
The whole algorithm starts by initializing the first
population by filling it up with randomly generated
individuals.
In each step, first every individual is evaluated and a real
value, fitness is associated with it.
After all the individuals are evaluated, the genetic
operators are applied to the current population leading to
creation of a new generation.
GENETIC OPERATORS
1.CROSSOVER-
Crossover is a genetic operator that combines (mates)
two chromosomes (parents) to produce a new
chromosome (offspring)
Consider the following 2 parents


2. MUTATION
mutation operator involves a probability that an
arbitrary bit in a genetic sequence will be
changed from its original state.
Consider an example

1010010

1010110




GENETIC NICHING
Niching GAs preserves the population
diversity by using niches techniques which
divide the population in different niches.
In this way the solutions in each different area
or niche can survive during the evolutionary
process independently of their global quality.
In the niching technique there are three
algorithms. Here we use Deterministic
crowding algorithm.
fig. Replacement process in deterministic crowding
INTERPRETATION AND EVALUATION OF
THE RESULTS
The user sessions are assigned to the closest clusters based on
dik, from the ith cluster to the kth session. This creates C
clusters Xi= {s
(k)
S | d
ik
< d
jk
j i} for 1 < i < C.
The sessions in cluster Xi are summarized by-
P
i
= (P
i1,.,
P
iNU
)
t
P
i
estimated by the conditional probability-

Pij = p(s
(k)
j
= 1| s
(k)
j
X
i
)= , where X
ij
= s
(k)
X
i
| s
(k)
j
>0 }

The URL weights Pij measure the signicance of a given URL to
the ith profile.



System Design
Use Case Diagram:-
Sequence Diagram:-
User login Program
interface
File1 File2
Request for login
Enter loginid and password
Verify Details
Authenticate Authenticate
Invalid Details
Enter IP address, date, time threshold
Verify IP address and date
Authenticate
Invalid IP address and date
Request for filtering
Filtering log
Filtered log
Request for session
Creating session
Session vector will get
LogOut
Preprocessing
User login Program
interface
File1 File2
Request for login
Invalid Details
Enter login id and password
Verify Details
Authenticate Authenticate
Request for Extracting Cluster Centers
Extract Cluster Centers
Response with extracted Cluster Centers
LogOut
UNC:-
Post Processing:-
User login Program
interface
File1 File2
Request for login
Invalid Details
Enter login id and password
Verify Details
Authenticate Authenticate
Request for Profile Vector
Find Profile Vector
Get Profile vector
Request for Most visited URL
Find Most Visited URL
Get Most Visited URL
LogOut
Activity Diagram
Pre Processing:-
UNC:-
Post processing:-
Data Flow Diagram:-
CONCLUSION
The proposed methods were successfully tested on
the log files for both removal and user sessions
identification. The results which were obtained after
the analysis were satisfactory and contained valuable
information about the Log Files.i.e
The URL weights Pij which measure the signicance of
a given URL to the ith profile
FUTURE SCOPE
We can investigate different ways to make our
approach scalable to large data sets, using it
for clustering text documents/Web content,
and incorporating Web content into Web user
profiling.

REFERENCES
T. Yan, M. Jacobsen, H. Garcia-Molina and U. Dayal, From user access patterns to dynamic
hypertext linking, 5th World Wide Web Conf. Paris, 1996.
R. Cooley, B. Mobasher and J. Srivastava, Data preparation for mining world wide web
browsing patterns, Knowledge Inf. Syst. 1, 1 (1999).
O. Nasraoui, H. Frigui, R. Krishnapuram and A. Joshi, Mining web access logs using relational
competitive fuzzy clustering, 8th Int. World Wide Web Conf., Toronto, Canada, 1999.
O. Nasraoui, R. Krishnapuram, H. Frigui and A. Joshi, Extracting web user profiles using
relational competitive fuzzy clustering, Int. J. Artif. Intell. Tools 9, 4, (2000) 509-526.
O. Nasraoui and R. Krishnapuram, A novel approach to unsupervised robust clustering using
genetic niching, 9th IEEE Int. Conf. Fuzzy Syst., San Antonio, TX, May 2000, 170-175.
O. Nasraoui, Antonio Badia and Richard Germain, A Web usage mining framework for mining
evolving user profiles in dynamic websites .
?
THANK YOU

You might also like