(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, 2010
efficiency. The researchers have proposed a number of methods for better application of contextualcharacteristics. Feature selection is a popular one. Inaddition to features of HTML tags, a webpage can beclassified based on its own URL. URLs are highlyeffective in classification. First, a URL is easilyretrievable, and each URL is limited to one webpage,and each webpage has one special URL. Second, if this method is solely employed, classification of onewebpage based on its URL causes download removalof the whole page. This tends to be an appropriatemethod for classifying pages which are not existing ortheir download is impossible, or time/space is criticalfor example in realtime classifications.
C. Bayesian Algorithm
Bayesian inference is provides a probability methodfor inference. This method is built on the hypothesisthat the considered values follow a probabledistribution, and that optimal decisions can be madewith an eye to inference on the probabilities as well asobserved data. As this method is a quantitative one forweighing evidences which support differenthypotheses, it is of great importance in machinelearning. Bayesian inference provides a direct methodfor dealing with probabilities for learning algorithms,and it also creates a framework for analyzingperformance of algorithms which are not directlyrelated to probabilities. In many cases, the problem isfinding the best hypothesis in hypothesis space H withavailable D learning data. One method to express thebest hypothesis is that we claim we are looking for themost probable hypothesis with D data in addition toinitial data on prior probabilities H. Bayesian theoremis also a direct method for calculating theseprobabilities.To define Bayesian theorem, P(h) is used to expressinitial probability which maintains h hypothesis istrue, earlier than observing learning data. P(h) isusually called prior probability, and it expresses anyprior knowledge which states on the chance of correctness of hypothesis h. If there is no initialknowledge on hypotheses, we can assign a similarprobability to the whole hypotheses space H.Likewise, P(h) is similarly used for expressing priorprobability where D data are observed. In other words,probability of observing D in the case of there is noknowledge on correctness of hypotheses. P(D|h) isemployed to express probability D in a space wherehypothesis h is true. In machine learning we look forP(D|h), i.e. probability of correctness of hypothesis hin the case of observing D learning data. P(D|h) iscalled post probability, as it expresses our confidenceof hypothesis h after observing D data.Bayesian theorem is the main building block of Bayesian learning, as it provides a method forcalculating post probability P(D|h) based on P(h)along with P(D) and P(D|h).P(h |D)=
P(D|h) P(h)P(D)
(1)As expected, it can be seen that P(h|D) increases withthe increase of P(h), and P(D|h). Therefore, it isreasonable that P(D|h) decreases with the increase of P(D), because with higher probability of occurrenceP(D) which is independent from h, fewer evidence of D are available to support h. In many learningscenarios, the learner considers a set of hypotheses H,and it is interested in finding the hypothesis h
∈
Hwhich is the most probable one or at least one of themost probable ones. Any hypothesis which carriessuch feature is called Maximum a posteriori, MAP.Using Bayesian theorem, it is possible to find MAPhypothesis to calculate posteriori probability of eachcandidate. In other words, HMAP is a hypothesis that
HMAP =argmax hj є H P(hj | Di)
= argmax hj є H( P(Di|hj)
P(hj) /P(D)) (2)
= argmax hj є H P(Di|hj)
P(hj)Be noted that P(D) is deleted at the final step, as itscalculation is independent from h, and it is always aconstant. However, for all probable states on differentfeatures of a problem, this theorem can be generalizedfor all existing probabilities, as for all values of feature D1,D2…Dn:H =argm
ax hj є H P(hj) Π P(Di|hj)
(3)And P(Di|hj) is calculated as follow:P(Di|hj)=
n_c + mpn+m
(4)Where:n= number of examples where h= hjnc = number of examples where samples D= Di andh= hjp= initial estimation for P(Di|hj)
m= size of sample space
D. Using Features of Neighbor Webpage
Although URL addresses of webpage contain usefulfeatures, they may miss, misunderstood, or notrecognized for some reasons in a special URL. Forexample, some pages do have brief addresses orconcise contextual content. In such cases, it would bedifficult for classifiers to decide on the accuracyaccording to URL features of the webpage. To solvethe problem, these features might be extracted fromneighbor pages which are related to the webpageunder classification. In this study, pages with the samesibling, i.e. sibling pages are employed to arrive atmore accuracy in classification.
169http://sites.google.com/site/ijcsis/ISSN 1947-5500