Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
7Activity
0 of .
Results for:
No results containing your search query
P. 1
Webpage Classification based on URL Features and Features of Sibling Pages

Webpage Classification based on URL Features and Features of Sibling Pages

Ratings: (0)|Views: 365 |Likes:
Published by ijcsis

More info:

Published by: ijcsis on Jun 12, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/15/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, 2010
Webpage Classification based on URLFeatures and Features of Sibling Pages
 
Dr.Mashallah Abbasi Dezfuli
 Department of Computer engineering,Science and Researchbranch,Islamic Azad University(IAU) ,Khouzestan,Iran, Abbasi_masha@yahoo.com
Dr.Amir masoud rahmani,
 Department of Computer engineering,Science and Researchbranch,Islamic Azad University(IAU) ,Tehran,Iranrahmani@srbiau.ac.ir 
 
Sara Meshkizadeh,
 Department of Computer engineering,Science and Researchbranch,Islamic Azad University(IAU) ,Khouzestan,IranSara_meshkizadeh@yahoo.com
 
 Abstract
:
Webpage classification plays an important rolein information organization and retrieval. It involvesassignment of one webpage to one or more than onepredetermined categories. The uncontrolled features of web content implies that more work is required forwebpage classification compared with traditional textclassification. The interconnected nature of hyper text,however, carries some features which contribute to theprocess, for example URL features of a webpage. Thisstudy illustrates that using such features along withfeatures of sibling pages, i.e. pages from the same sibling,as well as Bayesian algorithm for combining the resultsof these features, it would be possible to improve theaccuracy of webpage classification based on thisalgorithm.
 
 Keywords: classification, hyper text, URL, sibling pages, Bayesian algorithm
I.
 
INTRODUCTION
 
Traditional classification is supposed to be ainstructor-based teaching where a set of labeled datacan be used for teaching a classifier to be employedfor future classification. Webpage classification isdistinct from traditional on in a number of aspects.First, traditional textual classification is usuallyperformed on the structured documents written basedon a fixed style (e.g. news articles), whereas webpagecontent is far from such characteristics. Second, webpages are documents with HTML structure whichmight be translated for the user visually. Classificationplays a significant role in information managementand retrieval task. On the web, webpage classificationis essential for focused crawling, web directories,topic specific web link, contextual advertising, andtopical structure. It can be of great help in increasingsearch quality on the web. Using URL relatedinformation, [1] has achieved 48% accuracy inclassification. [2] uses URL information as well aswebpage pictures to provide a classification with treestructure. [3] involves combination of parent pageinformation with HTML features to arrive atclassification. [4] reaches 80% accuracy through acombination of web mining information. Combiningsome features of HTML and URL and in sum 9features, classification is made with 80% accuracy.This study introduces a novel method for webpageclassification as it enhances existing algorithms. Thismethod combines three different features of webpageURL, and it also employs information fromneighboring pages with the same sibling, i.e. siblingpages to increase classification accuracy.In this paper, in Section 2 related concepts aredescribed. Section 3 introduces the proposedalgorithm in detail. Section 4 evaluates the suggestedalgorithm, and Section 5 provides conclusions andinsights towards future work.II.
 
RELATED CONCEPTS
 A. Selecting Database
As webpage classification is usually taken as asupervised learning, there is a need for classifiedsamples for learning. In addition, some samples arealso required to test classification for classifierevaluation. Manual labeling involves something morethan human, therefore, some available web directoriesare used for more research. One of them which isemployed more is OPD[12]. In this database, 4519050various websites are classified by 84430 classificationeditors. This research has been performed onuniversities, shopping, forums, FAQ (frequently askedquestions) of ODP.
 B. Feature Extraction from Webpage URL
The contextual content is the most important featureavailable on the webpage. However, taking diverserange of parasites on webpage into account, using bagof words directly may does not lead to higher
168http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, 2010
efficiency. The researchers have proposed a number of methods for better application of contextualcharacteristics. Feature selection is a popular one. Inaddition to features of HTML tags, a webpage can beclassified based on its own URL. URLs are highlyeffective in classification. First, a URL is easilyretrievable, and each URL is limited to one webpage,and each webpage has one special URL. Second, if this method is solely employed, classification of onewebpage based on its URL causes download removalof the whole page. This tends to be an appropriatemethod for classifying pages which are not existing ortheir download is impossible, or time/space is criticalfor example in realtime classifications.
C. Bayesian Algorithm
Bayesian inference is provides a probability methodfor inference. This method is built on the hypothesisthat the considered values follow a probabledistribution, and that optimal decisions can be madewith an eye to inference on the probabilities as well asobserved data. As this method is a quantitative one forweighing evidences which support differenthypotheses, it is of great importance in machinelearning. Bayesian inference provides a direct methodfor dealing with probabilities for learning algorithms,and it also creates a framework for analyzingperformance of algorithms which are not directlyrelated to probabilities. In many cases, the problem isfinding the best hypothesis in hypothesis space H withavailable D learning data. One method to express thebest hypothesis is that we claim we are looking for themost probable hypothesis with D data in addition toinitial data on prior probabilities H. Bayesian theoremis also a direct method for calculating theseprobabilities.To define Bayesian theorem, P(h) is used to expressinitial probability which maintains h hypothesis istrue, earlier than observing learning data. P(h) isusually called prior probability, and it expresses anyprior knowledge which states on the chance of correctness of hypothesis h. If there is no initialknowledge on hypotheses, we can assign a similarprobability to the whole hypotheses space H.Likewise, P(h) is similarly used for expressing priorprobability where D data are observed. In other words,probability of observing D in the case of there is noknowledge on correctness of hypotheses. P(D|h) isemployed to express probability D in a space wherehypothesis h is true. In machine learning we look forP(D|h), i.e. probability of correctness of hypothesis hin the case of observing D learning data. P(D|h) iscalled post probability, as it expresses our confidenceof hypothesis h after observing D data.Bayesian theorem is the main building block of Bayesian learning, as it provides a method forcalculating post probability P(D|h) based on P(h)along with P(D) and P(D|h).P(h |D)=
P(D|h) P(h)P(D)
(1)As expected, it can be seen that P(h|D) increases withthe increase of P(h), and P(D|h). Therefore, it isreasonable that P(D|h) decreases with the increase of P(D), because with higher probability of occurrenceP(D) which is independent from h, fewer evidence of D are available to support h. In many learningscenarios, the learner considers a set of hypotheses H,and it is interested in finding the hypothesis h
Hwhich is the most probable one or at least one of themost probable ones. Any hypothesis which carriessuch feature is called Maximum a posteriori, MAP.Using Bayesian theorem, it is possible to find MAPhypothesis to calculate posteriori probability of eachcandidate. In other words, HMAP is a hypothesis that
HMAP =argmax hj є H P(hj | Di)
 
= argmax hj є H( P(Di|hj)
P(hj) /P(D)) (2)
= argmax hj є H P(Di|hj)
P(hj)Be noted that P(D) is deleted at the final step, as itscalculation is independent from h, and it is always aconstant. However, for all probable states on differentfeatures of a problem, this theorem can be generalizedfor all existing probabilities, as for all values of feature D1,D2…Dn:H =argm
ax hj є H P(hj) Π P(Di|hj)
(3)And P(Di|hj) is calculated as follow:P(Di|hj)=
n_c + mpn+m
(4)Where:n= number of examples where h= hjnc = number of examples where samples D= Di andh= hjp= initial estimation for P(Di|hj)
 
m= size of sample space
 D. Using Features of Neighbor Webpage
Although URL addresses of webpage contain usefulfeatures, they may miss, misunderstood, or notrecognized for some reasons in a special URL. Forexample, some pages do have brief addresses orconcise contextual content. In such cases, it would bedifficult for classifiers to decide on the accuracyaccording to URL features of the webpage. To solvethe problem, these features might be extracted fromneighbor pages which are related to the webpageunder classification. In this study, pages with the samesibling, i.e. sibling pages are employed to arrive atmore accuracy in classification.
169http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, 2010
III.
 
Proposed Algorithm
 A.
 
Pre-processing of Web Content 
In most studies, pre-processing is performed prior tofeeding the web content to the classifier. In theproposed method here, only URL address and pageTitle are required for implementing classificationalgorithm. First, the URL address and page Title arepre-processed, i.e. words shorter than 3 characters,numbers, conjunctions and prepositions stored in atable called Stopwords are removed from this content.Using the function Porter Stemmer [13], the usefulwords are changed to their stems ( to avoid dataredundancy), and then they are stored in the relevanttable along with a value as the frequency of word inthe data bank of the program.
 B. Using Extracted Features from Webpage URL1) First Feature
URL do have appropriate features for classification.Two sets of features which are easily extracted fromURL pages are Postfix and Directory. The generalformat of a postfix is usually as Abbreviation orAbbreviation. Abbreviation. For example, .edu or ac.iror .ac.uk indicate pages related to universities oracademic websites. The general format of a Directoryfeature is mostly like Word(Abbreviation)slash. Forexample, a directory named FAQ or Forumrepresented as /Faq/ or /Forum/ can represent therelation between the current page to the relevant class.These features are put in Table 1.
Table 1
: Features of URL Addresses
 
2) Second Feature
The second feature important in webpage URL isattention to the domains which have been observed bythe system and further their correct class is identified.To do so, once the class of a webpage is determined atthe last step and its accuracy is confirmed by the user,the page's address is registered in the URL table alongwith the correct class. Afterwards, if a webpage withthe same domain is given to the system, the systemcan recognize more simply and quickly based onsimilarity in addresses. Take for example the addresswww.aut.ac.ir/sites/e-shopping/raja.ir whose domainis recognized as shopping, thereby it is registeredunder shopping class at the URL table.If a webpage such aswww.aut.ac.ir/sites/e-shopping/iranair.iris given to the system as a test, dueto the domainwww.aut.ac.ir/sites/e-shopping/ in thetable, the system quickly and simply assigns shoppingclass to the second address. It should be noted thatusing domain similarity of webpage URL is a newidea which has not been taken into account inwebpage classification.
3) Third Feature
To enhance the efficiency of the proposed algorithmanother method based on URL address can also beemployed. This method involves expanding URL of awebpage to use existing elements better.Forexample,consideringhttp://www.washington.edu/news/nytimes, and alsopage title, i.e. NewYorkTimes it is easily understoodthat the address refers to NewYorkTimes newsdatabase. The machine, however, misses the point. Tosolve the problem and to employ human guesses in theprocedure, with an eye to the function used in [1] butwith some changes in definition and usage, alikelihood function is presented. Table 2 illustrates themethod. The likelihood is used to compare URL'stoken letter by letter, and similar word in the pageTitle.
Table2.
Likelihood Function
 For example consider the news website Newyorktimeswith the addresshttp://www.nytimes.com.Now takethe title "The New York Times-Breaking News,World News & Multimedia". The above-mentionedfunction calculates likelihood for a condition whereURL token, Nytimes and similar word at page Title isNewyorktimes as bellow:(Condition1)
N(Condition5)
E(Condition5)
 W(Condition1)
Y(Condition5)
O(Condition5)
 R(Condition5)
K(Condition1)
T(Condition3)
 I(Condition 3)
M(Condition 3)
E(Condition 4)
SIn this case, the word Newyorktimes receives score 11from words at the page Title which is the highest scoreout of other tokens, and according to the likelihood
numberSpecificationURL feature
1
To find class of pages basedon Stem URL Address
Postfix :.edu,.ac.ir2
To find class of pages basedon rest of Stem URL Address
 
Directory :Forums/,Faq/ 
numberConditionRank
1First letter of URL token is thesame as first letter of similar wordat page Title22First letter of URL token is thesame another letter of similar wordat page Title13A letter of URL token is the sameas a letter of similar word at pageTitle14Last letter of URL token is the sameas last letter of similar word at pageTitle25Ignoring a character of URL token0
170http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (7)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Shiv Kumar liked this
marianta.v2346 liked this
mmha19688 liked this
kds2006 liked this
marianta.v2346 liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->