You are on page 1of 7

International J ournal of Computer Trends and Technology (IJ CTT) volume 5 number 2 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page107



Web User Behavior Analysis Using Improved
Nave Bayes Prediction Algorithm

B.Harindra Varma
M.Tech(C.S.E)
Gudlavalleru Engineering College,
Gudlavalleru
K.Ashok Reddy
Assistant Professor(C.S.E)
Gudlavalleru Engineering
College, Gudlavalleru
S.Narayana
Associate Professor (C.S.E),
Gudlavalleru Engineering College




Abstract With the continued growth and
proliferation of Web services and Web based
information systems, the volumes of user data have
reached astronomical proportions. Analyzing such
data using Web Usage Mining can help to determine
the visiting interests or needs of the web user. As web
log is incremental in nature, it becomes a crucial issue
to predict exactly the ways how users browse
websites. It is necessary for web miners to use
predictive mining techniques to filter the unwanted
categories for reducing the operational scope. Markov
models& its variations have also been used to analyze
web navigation behavior of users. A user's web link
transition on a particular website can be modeled
using first, second-order or higher-order Markov
models and can be used to make predictions
regarding future navigation and to personalize the
web page for an individual user. All higher order
Markov model holds the promise of achieving higher
prediction accuracies, improved coverage than any
single-order Markov model but holds high state space
complexity. Hence a Hybrid Markov Model is
required to improve the operation performance and
prediction accuracy significantly. Markov model is
assumed to be a probability model by which users
browsing behaviors can be predicted at category level.
Bayesian theorem can also be applied to present and
infer users browsing behaviors at webpage level. In
this research, Markov models and Bayesian theorem
are combined and a two-level prediction model is
designed. By the Markov Model, the system can
effectively filter the possible category of the websites
and Bayesian theorem will help to predict websites
accuracy. The experiments will show that our
provided model has noble hit ratio for prediction.

Keywords Webusage,Hidden Markov, Bayes,Data
Mining.


I. INTRODUCTION

The Web is a huge, explosive, diverse,
dynamic and mostly unstructured data repository, which
supplies incredible amount of information, and also raises
the complexity of how to deal with the information from
the different perspectives of view, users, web service
providers, business analysts. The users want to have the



effective search tools to find relevant information easily
and precisely. The Web service providers want to find the
way to predict the users behaviors and personalize
information to reduce the traffic load and design the
Website suited for the different group of users. The
business analysts want to have tools to learn the
user/consumers needs. All of them are expecting tools or
techniques to help them satisfy their demands and/or
solve the problems encountered on the Web. Therefore,
Web mining becomes a popular active area and is taken
as the research topic for this investigation. Web Usage
Mining is the application of data mining techniques to
discover interesting usage patterns from Web data, in
order to understand and better serve the needs of Web-
based applications.
Here our task is related to the web usage
mining which basically Consist task related to the use of
web where the access of the web will considered and the
navigation pattern and the prediction operation will
performed in the mining of this kind we will use the
database in the form of the web log files and we will
generate the results on the basis of the database
given.Markov models have been used for studying and
understanding stochastic processes, and were shown to be
well-suited for modeling and predicting a users
browsing behavior on a web-site.In general, the input for
these problems is the sequence of web-pages that were
accessed by a user and the goal is to build Markov
models[2] that can be used to model and predict the web-
page that the user will most likely access next[3]. In
many applications, first-order Markov models are not
very accurate in predicting the users browsing behavior,
since these models do not look far into the past to
correctly discriminate the different observed patterns.As
a result, higher-order models are often used.
Unfortunately, these higher-order models have a number
of limitations associated with high state-space
complexity, reduced coverage, and sometimes even
worse prediction accuracy.One method proposed
toovercome the problem is the clustering and cloning to
duplicate the state corresponding to page that require a
longer history to understand the choice of link that users
made.Initially when the web log is not available means
the web site is newly launched the prediction or the
navigation decision will mad on the page rank our page
rank strategy will also used to resolve the ambiguity of
the model.Our model will use the basic strategy for the

International J ournal of Computer Trends and Technology (IJ CTT) volume 5 number 2 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page108

preparing the model is the page rank , and variable length
markov model, the problem of ambiguity in the markov
model will solve on the basis of the page rank and the
page rank will also used in the initial stage when the web
log file is not available.

Markov model have been used for studying and
understanding stochastic processes, and well suited for

modeling and predicting a users browsing behavior on a
web. In general, the input for these problems is the
sequence of web pages that are accessed by a user and
the goal is built Markov model that can be used to predict
the web user usage behavior. The state space of the
Markov model depends on the number of previous
actions used in predicting the next action. The simplest
Markov model predicts the next action by only looking at
the last action performed by the user. In this model, also
known as the first order Markov model, each action that
can be performed by a user corresponds to a state in the
model. A somewhat more complicated model computes
the prediction by looking at the last two actions
performed by the user. This is called the second order
Markov model, and its states correspond to all possible
pairs of action that can be performed in sequence. This
approach is generalized to the nth order Markov model,
which computes the prediction by looking at the last N
actions performed by the user, leading to a state space
that contains all possible sequences of N actions.

In most of the applications, the first-order Markov model
has low accuracy in achieving right predictions, which is
why extensions to higher order models are necessary. All
higher order Markov model holds the promise of
achieving higher prediction accuracies and improved
coverage than any single-order Markov model, at the
expense of a dramatic increase in the statespace
complexity

II. LITERATURE SURVEY

Myra Spiliopoulou [1] suggests applying Web usage
mining to website evaluation to determine needed
modifications, primarily to the sites design of page
content and link structure between pages. Eirinaki et al.
[2] propose a method that incorporates link analysis, such
as the page rank measure, into a Markov model in order
to provide Web path recommendations. Schechter et al.
[3] utilized a tree-based data structure that represents the
collection of paths inferred from the log data to predict
the next page access. Chen and Zhang [4] utilized a
Prediction by Partial Match forest that restricts the roots
to popular nodes; assuming that most user sessions start
in popular pages, the branches having a Non popular
page as their root are pruned. R. Walpole, R. Myers and
S. Myers [5] proposed Bayesian theorem can be used to
predict the most possible users next request.

The Hybrid Successive Markov Predictive Model HSMP
has been used for investigation and understanding
stochastic process and it was to be well suited for
modeling and predicting users browsing behavior in the
Web log Scenario. In most of the applications, the first-
order Markov model has low accuracy in achieving right
predictions, which is why extensions to higher order
models are necessary. All higher order Markov model
holds the promise of achieving higher prediction
accuracies and improved coverage than any single-order
Markov model, at the expense of a dramatic increase in
the state-space complexity. Hence, the authors proposes
techniques for intelligently combining different order
Markov models so that the resulting model has low state
space complexity, improved prediction accuracy and
retains the coverage of the all higher order Markov
model.

Problems in Existing Work:

1) We propose a new two-tier prediction framework to
improve prediction time. Such framework can
accommodate various prediction models
2) We present an analysis study for Markov model and
all-Kth model
3) We propose a new modified Markov model that
handles the excess memory requirements in case of large
data sets by reducing the number of paths during the
training and testing phases.
4) We conduct extensive experiments on three
benchmark data sets to study different aspects of the
WPP using Markov model, modified Markov model,
ARM, and all- Kth Markov model. Our analysis and
results show that higher order Markov model produces
better prediction accuracy.



III. PROPOSED SYSTEM


In this section, we propose another Improved variation of
Markov model by reducing the number of paths in the
model so that it can fit in the memory and predict
faster[1]. Web prediction is perfomed on the following
data :


International J ournal of Computer Trends and Technology (IJ CTT) volume 5 number 2 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page109








HMM BASED BAYES APPROACH:

BAYESIAN CLASSIFICATION

Bayesian classifiers are statistical classifiers.
They can predict class membership probabilities,
such as the probability that a given tuple belongs
to a particular class. Bayesian classification is
based on Bayes theorem. Nave Bayesian
classifiers assume that the effect of an attribute
value on a given class is independent of the values
of the other attributes. This assumption is called
class conditional independence. It is made to
simplify the computations involved and, in this
sense, is considered nave. Bayesian belief
networks are graphical models, which unlike nave
Bayesian classifiers, allow the representation of
dependencies among subsets of attributes.
Bayesian belief networks can also be used for
classification.

Bayes Theorem
Bayes theorem is named after Thomas
Bayes, a nonconformist English clergyman who did
early work in probability and decision theory
during the 18th century. Let X be a data tuple. In
Bayesian terms, X is considered evidence. As
usual, it is described by measurements made on a
set of n attributes. Let H be some hypothesis, such
as that the data tuple X belongs to a specified class
C. For classification problems, we want to
determine P(H/X), the probability that the
hypothesis H holds given the evidence or
observed data tuple X. In other words, we are
looking for the probability that tuple X belongs to
class C, given that we know the attribute
description of X. P(H/X) is the posterior
probability, or a posteriori probability, of H
conditioned on X. For example, suppose our world
of data tuples is confined to customers described
by the attributes age and income, respectively, and
that X is a 35-year-old customer with an income of
$40,000. Suppose that H is the hypothesis that our
customer will buy a computer. Then P(H/X)
reflects the probability that customer X will buy a
computer given that we know the customers age
and income. In contrast, P(H) is the prior
probability, or a priori probability, of H. For our
example, this is the probability that any given
customer will buy a computer, regardless of age,
income, or any other information, for that matter.
The posterior probability, P(H/X), is based on
more information (e.g., customer information) than
the prior probability, P(H), which is independent
of X. Similarly, P(X/H) is the posterior probability
of X conditioned on H. That is, it is the probability
that a customer, X, is 35 years old and earns
$40,000, given that we know the customer will buy
a computer. P(X) is the prior probability ofX.Using
our example, it is the probability that a person
from our set of customers is 35 years old and earns
$40,000. P(X/H), and P(X) may be estimated from
the given data, as we shall see below. Bayes
theorem is useful in that it provides a way of
calculating the posterior probability, P(H/X), from
P(H), P(X/H), and P(X). Bayes theorem is

Nave Bayesian Classification

The nave Bayesian classifier, or simple Bayesian
classifier, works as follows:
1. Let D be a training set of tuples and their
associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X
= (x1, x2, : : : , xn), depicting n measurements
made on the tuple from n attributes, respectively,
A1, A2, : : : , An.



2. Suppose that there are m classes, C1, C2, : : : ,
Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest
posterior probability, conditioned on X. That is,
the nave Bayesian classifier predicts that tuple X

International J ournal of Computer Trends and Technology (IJ CTT) volume 5 number 2 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page110

belongs to the class Ci if and only if P(Ci/X) >
P(Cj/X). Thus we maximize P(Ci/X). The classCi
for which P(Ci/X) is maximized is called the
maximum posteriori hypothesis. By Bayes
theorem


3. As P(X) is constant for all classes, only
P(X/Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly
assumed that the classes are equally likely, that is,
P(C1) = P(C2) = .. = P(Cm). Given data sets
with many attributes, it would be extremely
computationally expensive to compute P(X/Ci). In
order to reduce computation in evaluating P(X/Ci),
the naive assumption of class conditional
independence is made. This presumes that the
values of the attributes are conditionally
independent of one another, given the class label
of the tuple (i.e., that there are no dependence
relationships among the attributes). Thus,

We can easily estimate the probabilities P(x1/Ci),
P(x2/Ci), : : : , P(xn/Ci) fromthe training tuples.
Recall that here xk refers to the value of attribute
Ak for tuple X. For each attribute, we look at
whether the attribute is categorical or continuous
valued.
Finally m prediction existing model is applied for
classi fying the rules[1].




IV. RESULTS


All experiments were performed with the configurations
Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the
operating system platform is Microsoft Windows XP
Professional (SP2).



Existing results:

Country:
Texas -> 23.0
Florida -> 54.0
Illinois -> 24.0
Ontario -> 28.0
Washington -> 35.0
Oklahoma -> 53.0
California -> 29.0
Oregon -> 26.0
Alberta -> 41.0
Kentucky -> 49.0
North_Carolina -> 18.0
Georgia -> 26.0
Pennsylvania -> 24.0
Indiana -> 55.0
Virginia -> 25.0
Australia -> 27.0
Michigan -> 28.0
Ohio -> 28.0
Connecticut -> 17.0
Rhode_Island -> 41.0
New_York -> 26.0
United_Kingdom -> 22.0
Massachusetts -> 41.0
Saskatchewan -> 34.0
Idaho -> 60.0
Wisconsin -> 17.0
New_Jersey -> 45.0
Italy -> 37.0
South_Dakota -> 23.0
Louisiana -> 28.0
Vermont -> 44.0
Missouri -> 25.0
Mississippi -> 36.0
Netherlands -> 28.0
Kansas -> 28.0
Alaska -> 69.0
Minnesota -> 28.0
Colorado -> 26.0
Maryland -> 32.0
Utah -> 28.0
Nevada -> 27.0
Washington_D.C. -> 35.0
Wyoming -> 27.0
Arizona -> 41.0
New_Hampshire -> 53.0
South_Carolina -> 53.0
Delaware -> 49.0
Tennessee -> 25.0
Sweden -> 28.0
Afghanistan -> 36.0
Iowa -> 35.0
British_Columbia -> 53.0
Arkansas -> 25.0
Montana -> 41.0
France -> 26.0
Alabama -> 39.0
Kuwait -> 50.0

International J ournal of Computer Trends and Technology (IJ CTT) volume 5 number 2 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page111

Finland -> 49.0
Switzerland -> 30.0
New_Zealand -> 19.0
Belgium -> 30.0
China -> 25.0
Spain -> 25.0
Manitoba -> 16.0
Maine -> 49.0
Hong_Kong -> 51.0
Nebraska -> 44.0
Germany -> 43.0
West_Virginia -> 55.0
Brazil -> 28.0
New_Brunswick -> 27.0
Quebec -> 34.0
Other -> 33.0
Colombia -> 33.0
Hawaii -> 28.0
Japan -> 30.0
South_Africa -> 35.0
Portugal -> 30.0
New_Mexico -> 28.0
Austria -> 49.0
India -> 34.0
Namibia -> 35.0
Argentina -> 66.0
Israel -> 31.0
Ireland -> 32.0
(123/672 instances correct)

Accuracy for single country predition:

Correctly Classified Instances 16 4.381 %
Incorrectly Classified Instances 656 95.619 %

Proposed Approach Results:

Primary_Language = English &&
Actual_Time = Other &&
Race = White &&
Age = 35.0 ==> Professional

Primary_Language = English &&
Actual_Time = Other &&
Community_Membership_Religious > 0 &&
Who_Pays_for_Access_Self > 0 &&
Not_Purchasing_Security <= 0 ==> Professional

Primary_Language = English &&
Community_Membership_Religious > 0 &&
Community_Membership_Family > 0 ==> Other

Primary_Language = English &&
Actual_Time = Other &&
Community_Membership_Religious <= 0 &&
Race = White &&
Major_Geographical_Location = USA &&
Disability_Not_Impaired <= 0 &&
who > 90441 ==> Computer

Primary_Language = English &&
Race = White &&
Actual_Time = Other &&
Community_Membership_Religious <= 0 &&
Major_Geographical_Location = USA &&
Not_Purchasing_No_credit > 0 &&
Community_Membership_Hobbies <= 0 &&
Opinions_on_Censorship <= 3 ==> Other

Primary_Language = English &&
Race = White &&
Actual_Time = Other &&
Community_Membership_Religious <= 0 &&
Major_Geographical_Location = USA &&
Not_Purchasing_No_credit > 0 ==> Computer

Primary_Language = English &&
Race = White &&
Age = 29.0 ==> Professional

Primary_Language = English &&
Race = White &&
Age = Not_Say ==> Computer

Primary_Language = English &&
Race = White &&
Not_Purchasing_Not_option > 0 &&
Not_Purchasing_Cant_find <= 0 ==> Computer

Primary_Language = English &&
Race = White &&
Age = 42.0 ==> Professional

Primary_Language = English &&
Race = White &&
Age = 37.0 &&
Sexual_Preference = Heterosexual ==> Professional

Primary_Language = English &&
Race = White &&
Age = 37.0 ==> Management

Primary_Language = English &&
Race = White &&
Age = 27.0 &&
Not_Purchasing_Privacy <= 0 ==> Management

Primary_Language = English &&
Race = White &&
Age = 27.0 ==> Professional

Primary_Language = English &&
Race = White &&
Age = 47.0 ==> Other

Primary_Language = English &&
Race = White &&
Age = 38.0 &&
Gender = Male ==> Professional

International J ournal of Computer Trends and Technology (IJ CTT) volume 5 number 2 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page112


Primary_Language = English &&
Race = White &&
Age = 30.0 &&
How_You_Heard_About_Survey_Others <= 0 ==>
Management

Primary_Language = English &&
Race = White &&
Age = 45.0 &&
Gender = Male ==> Professional

Primary_Language = English &&
Race = White &&
Age = 26.0 ==> Professional

Primary_Language = English &&
Race = White &&
Age = 40.0 ==> Computer

Primary_Language = English &&
Race = White &&
Age = 54.0 ==> Education

Primary_Language = English &&
Race = White &&
Age = 30.0 ==> Professional

Primary_Language = English &&
Age = 24.0 &&
Community_Membership_Hobbies <= 0 ==>
Professional

Primary_Language = English &&
Race = White &&
How_You_Heard_About_Survey_Others <= 0 ==>
Professional

Primary_Language = English &&
Falsification_of_Information = Never ==> Other

Not_Purchasing_Other <= 0 &&
Gender = Male &&
Not_Purchasing_Bad_press <= 0 ==> Computer

Community_Membership_Other <= 0 &&
Major_Geographical_Location = USA ==> Management

Community_Membership_Other <= 0 ==> Professional

: Other

Number of Rules : 89


Time taken to build model: 1.01 seconds
Time taken to test model on training data: 0.04 seconds

=== Error on training data ===

Correctly Classified Instances 617 93.8155 %
Incorrectly Classified Instances 55 6.1845 %

Performance Analysis:

Below graph shows the time comparison between existing
and proposed approach.
Time(ms)
0
5
10
15
20
25
Time(ms)
Time(ms) 23 11
Exist ing HMM Proposed Bayes

Below graph shows the Accuracy comparison between
existing and proposed approach.



0
50
100
Exist ing HMM
predict ion
Proposed Bayes Based
HMM
Exist ing
HMM
predict ion
20.3 79.69
Proposed
Bayes Based
HMM
75.95 24.04
Correct l
y
Incorrect
ly


V. CONCLUSION AND FUTURE SCOPE

Because of the huge quantity of data of web pages on
many portal sites, for convenience, are to assemble the
web page based on category. In this paper, users
browsing behavior will be observed at two levels to meet
the nature of the webusage data. The scope of calculation
is massively reduced. Next, using Bayesian theorem in
the level two to predict the users browsing page is more
effective and accurate. The results of experiment prove
the Hit Ratio is well in both levels. Proposed approach
give more prediction on multiple attributes compare to
existing approach with less error rate.

REFERENCES

[1] Prediction of Users Web-Browsing Behavior:
Application of Markov Model Mamoun A. Awad
and Issa Khalil, IEEE TRANSACTIONS ON
SYSTEMS, MAN, AND CYBERNETICSPART
B: CYBERNETICS, VOL. 42, NO. 4, AUGUST
2012.

International J ournal of Computer Trends and Technology (IJ CTT) volume 5 number 2 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page113

[2] An Efficient Hybrid Successive Markov Model for
Predicting Web User Usage Behavior using Web
Usage Mining V.V.R.Maheswara Rao, International
Journal of Data Engineering (IJDE).
[3]. S. Schechter, M. Krishnan, and M. Smith, Using Path
Profiles to Predict HTTP Requests, Computer
Networks and ISDN Systems, vol. 30, pp. 457-467,
1998.
[4]. X. Chen and X. Zhang, A Popularity-Based
reduction Model for Web Pre fetching, Computer,
pp. 63-70, 2003.
[5]. Eugene Charniak. Statistical Language Learning. The
MIT Press, 1996.
[6]. X. Chen and X. Zhang. A popularity-based
prediction model for web prefetching. Computer,
36(3):63{70, March 2003.
[7]. M. Deshpande and G. Karypis. Selective markov
models for predicting web page accesses. ACM
Transactions on Internet Technology, 4(2):163{184,
2004.
[8]. X. Dongshan and S. Junyi. A new markov model for
web access prediction. Com- puting in Science and
Engineering, 4(6):34{39, November/December
2002.