Professional Documents
Culture Documents
= =
=
=
U U
U
N
i
l
i
N
i
k
i
N
i
l
i
k
i
kl
s s
s s
S
1
) (
1
) (
1
) ( ) (
, 1
=
otherwise
URL j accessed user if
s
th
i
j
0
1
) (
i
p
th
i
If site structure ignored cosine similarity
Map N
U
URLs on site to indices
User session vector s
(i)
: temporally compact sequence of Web
accesses by a user
Taking site structure into account relate distinct URLs
: path from root toURLs node
( ) ( )
|
|
.
|
\
|
=
1 , max , 1 max
, 1 min ) , (
j i
j i
u
p p
p p
j i S
Syntactic similarity-
In correlation with url attribute it is-
new similarity with property of S
1
and S
2
Clustering
Definition:
Clustering is the process of grouping objects
together in such a way that the objects belonging to
the same group are similar and those belonging to
different groups are dissimilar
It can be considered unsupervised learning problem;
so, as it deals with finding a structure in a collection
of unlabeled data.
Example:
Similarity criterion- distance based
Application-
City-planning: identifying groups of houses
according to their house type, value and
geographical location;
WWW: document classification; clustering
weblog data to discover groups of similar
access patterns,etc.
Unsupervised Niche Clustering
(UNC)
=
2
2
2
exp
i
ij
ij
d
w
o
Representation: binary chromosome strings
(one substring per feature)
Deterministic Crowding Selection: Children
replace closest parent if they have better fitness.
Density fitness measure:
Robust weight:
2
1
i
N
j
ij
i
f
o
=
=
w
0 1 1 1 0 1 0
GENETIC ALGORITHM(GA)
Basic GAs operate in steps. In each step new generation
is created, therefore we will refer to these steps as
generations.
The whole algorithm starts by initializing the first
population by filling it up with randomly generated
individuals.
In each step, first every individual is evaluated and a real
value, fitness is associated with it.
After all the individuals are evaluated, the genetic
operators are applied to the current population leading to
creation of a new generation.
GENETIC OPERATORS
1.CROSSOVER-
Crossover is a genetic operator that combines (mates)
two chromosomes (parents) to produce a new
chromosome (offspring)
Consider the following 2 parents
2. MUTATION
mutation operator involves a probability that an
arbitrary bit in a genetic sequence will be
changed from its original state.
Consider an example
1010010
1010110
GENETIC NICHING
Niching GAs preserves the population
diversity by using niches techniques which
divide the population in different niches.
In this way the solutions in each different area
or niche can survive during the evolutionary
process independently of their global quality.
In the niching technique there are three
algorithms. Here we use Deterministic
crowding algorithm.
fig. Replacement process in deterministic crowding
INTERPRETATION AND EVALUATION OF
THE RESULTS
The user sessions are assigned to the closest clusters based on
dik, from the ith cluster to the kth session. This creates C
clusters Xi= {s
(k)
S | d
ik
< d
jk
j i} for 1 < i < C.
The sessions in cluster Xi are summarized by-
P
i
= (P
i1,.,
P
iNU
)
t
P
i
estimated by the conditional probability-
Pij = p(s
(k)
j
= 1| s
(k)
j
X
i
)= , where X
ij
= s
(k)
X
i
| s
(k)
j
>0 }
The URL weights Pij measure the signicance of a given URL to
the ith profile.
System Design
Use Case Diagram:-
Sequence Diagram:-
User login Program
interface
File1 File2
Request for login
Enter loginid and password
Verify Details
Authenticate Authenticate
Invalid Details
Enter IP address, date, time threshold
Verify IP address and date
Authenticate
Invalid IP address and date
Request for filtering
Filtering log
Filtered log
Request for session
Creating session
Session vector will get
LogOut
Preprocessing
User login Program
interface
File1 File2
Request for login
Invalid Details
Enter login id and password
Verify Details
Authenticate Authenticate
Request for Extracting Cluster Centers
Extract Cluster Centers
Response with extracted Cluster Centers
LogOut
UNC:-
Post Processing:-
User login Program
interface
File1 File2
Request for login
Invalid Details
Enter login id and password
Verify Details
Authenticate Authenticate
Request for Profile Vector
Find Profile Vector
Get Profile vector
Request for Most visited URL
Find Most Visited URL
Get Most Visited URL
LogOut
Activity Diagram
Pre Processing:-
UNC:-
Post processing:-
Data Flow Diagram:-
CONCLUSION
The proposed methods were successfully tested on
the log files for both removal and user sessions
identification. The results which were obtained after
the analysis were satisfactory and contained valuable
information about the Log Files.i.e
The URL weights Pij which measure the signicance of
a given URL to the ith profile
FUTURE SCOPE
We can investigate different ways to make our
approach scalable to large data sets, using it
for clustering text documents/Web content,
and incorporating Web content into Web user
profiling.
REFERENCES
T. Yan, M. Jacobsen, H. Garcia-Molina and U. Dayal, From user access patterns to dynamic
hypertext linking, 5th World Wide Web Conf. Paris, 1996.
R. Cooley, B. Mobasher and J. Srivastava, Data preparation for mining world wide web
browsing patterns, Knowledge Inf. Syst. 1, 1 (1999).
O. Nasraoui, H. Frigui, R. Krishnapuram and A. Joshi, Mining web access logs using relational
competitive fuzzy clustering, 8th Int. World Wide Web Conf., Toronto, Canada, 1999.
O. Nasraoui, R. Krishnapuram, H. Frigui and A. Joshi, Extracting web user profiles using
relational competitive fuzzy clustering, Int. J. Artif. Intell. Tools 9, 4, (2000) 509-526.
O. Nasraoui and R. Krishnapuram, A novel approach to unsupervised robust clustering using
genetic niching, 9th IEEE Int. Conf. Fuzzy Syst., San Antonio, TX, May 2000, 170-175.
O. Nasraoui, Antonio Badia and Richard Germain, A Web usage mining framework for mining
evolving user profiles in dynamic websites .
?
THANK YOU