You are on page 1of 6

Information Systems International Conference (ISICO), 2 4 December 2013

Web Tags Formatting With Multilevel Numbering


Andrea Stevens Karnyoto* and Marius Limpo**
* Department of Information System, Engineering Faculty, Toraja Indonesia Christian University ** Department of Information System, Engineering Faculty, Toraja Indonesia Christian University

Keywords: HTML Tags Website Characteristic Tags Mapping

ABSTRACT
An extraction method is needed to know the headline location in a web page/blog automatically because each website/blog has owned particular characteristic in advertisement, headline, and links list placement. To do mapping of HTML tags on website/blog code for tag pairs or single tag, the result can be used as a pattern or processed page against the saved patterns comparison by making a system. Extracting method uses both vertical and horizontal multilevel numbering. The example of data taken from 20 website from government, school, and also private companies to obtain the result of tag mapping that is going to be used to acknowledge the amount of pattern that is produced by a website. The comparison of template up to 5th level, the result obtained from an extracted website/blog contains hundreds of pages starting from the lowest 110 and the highest 280 and it only has less than ten patterns. Conclusion in this research that all extracted website has pattern that is less than 10% of the total pages. Copyright 2013Information Systems International Conference. All rights reserved.

Corresponding Author: Andrea Stevens Karnyoto Department of Information System, Engineering Faculty, Toraja Indonesia Christian University, Jl. Nusantara No. 13, Makale, Tana Toraja, South Sulawesi 90000, Indonesia. Email: karnyoto@gmail.com

INTRODUCTION Over the past decades, researches in computing science and information technology have been intensively carried out worldwide to pervade computer systems into every area of human life [1]. Managing knowledge on the World Wide Web has become an important issue given the large volume of information that is now available on the Internet. However, the management of this knowledge is a difficult task both because of the dynamic nature of the Internet [2]. Currently, most Web content is written in HTML, which follows a rigid format in displaying content because web pages for the syntax based HTML Web is written for human comprehension [2]. The HEAD element contains information, also called metadata, about a web page, such as its title, keywords, description, language, and place of publication that may be useful to search engines, and other metadata that are not considered page content [3]. Aims Meta tags are search engines can find information key from the website [4], [5]. The Meta tag provides authors and Web site owners a means to control how their information is displayed and retrieved in a search engine [6]. Observed that collaborative tagging users exhibit a great variety in their sets of tags; some users have many tags, and others have few [7]. Other research has revealed from 27% to 38% of website pages containing Meta tags descriptions [8]. It means that 62% up to 73% of web pages do not have Meta tags descriptions. Because Meta tags description has characters length limited [9], [10]. There are many web pages which do not have Meta tags description [11], Meta tags description is not suitable with the web page content and Meta tags description only describes website thoroughly, not per page. Make patterns must tags based because tags have the standard rules in use and BODY element consists of tags, news and links. Therefore, author created a method to extract and mapping BODY element in every website page so that each tag converts to unique series number in a tiered series numbering called multilevel numbering method. A website Copyright 2013 ISICO

1.

32 page co onverts to a tie ered series num mbering, if the tiered series nu umbering does s not have in common c with the t existing g tiered series numbering n in the t database th hen the tiered series numberin ng will be store ed as a pattern in others will w be saved as a a compariso on result. Autho or create a sys stem using mul ltilevel numbering method th hat is able to extract and map the infor rmation in BOD DY element based on web ta ags which can n be used later as pattern or comparison n material again nst the useful p pattern such as to know the main m content of f a website page. 2. RE ESEARCH METHOD M This research h has purpose to create an extraction e meth hod called mu ultilevel numbe ering, building ga system based on the method, m test th he system again nst 20 website es that have bee en online, coun nt total web ta ags pattern and error page presentation. System featu ures are able to o perform the reading three website w pages at once, pr rocess of patterns extraction and compariso on with stored pattern p in datab base is running g as backgroun nd. t to identif fy potential de escription word ding usually te ext of Web pa ages had prov ved The use of tags relative ely unsuccessfu ul [8]. Multilevel numbering g is a level tha at has level gr roups which ar re put into ord der based on o the number according to th he position of t the element. Ta ags in the BOD DY element fro om the web code are used d for mapping a web page can n be a pattern or o comparison material on pa atterns.

Web tags formatting with multilevel mbering num

Res sult Web page


Figure 1. O Overview of sy ystem s extracts s web page co ode into level numbering arr rangement (se ee figure 1). The T In general, system develop ped system tak kes all pages from a website and process ses it one by one. Thus, ea ach page has an extracti ion result. Every page in a website will be extrac cted to know the level of tags in it. Th he result will be compar ring with saved d patterns in a database. If th he pattern is not n the same w with the pattern n in the previo ous page, then the pat ttern will have kept as a new patte ern. This sys stem has three classes, i.e. i TDownloadPageHttp, TPartingStrin ngHtml dan TD DatabaseAdder r.

Figure 2. Collaborative C Pr rocess PageHttp class s function is t to take websit te page from internet i and send the page to TDownloadP TPartin ngStringHtml. TPartingString gHtml function n is to conduct t extraction an nd mapping on the download ded page. TDatabaseAdde T er class functio on is to keep ex xtraction result t to the databas se if the extrac ction result is n not register red in database (See Figure 2) ). ght 2013 ISI ICO Copyrig

33 Part of downloaded code page only takes from BODY element. Tags inside the BODY element will extracted to multilevel numbering mode.

Figure 3. Web Tags Map Formatting Figure 3 shows, A = number 1 is for opening and number 2 is for closing of tag, B = level of tag, C= tag in the web page, D = position of tag in a web page. Table 1. Tags Position on Multilevel Numbering
A 1 1 2 1 2 2 B(1) 1 1 1 1 1 1 B(2) 2 2 2 2 2 2 B(3) 2 2 2 2 2 2 B(4) 1 1 1 1 1 1 B(5) 2 2 2 2 2 2 B(6) 1 1 1 1 1 1 B(7) 4 4 4 4 4 4 B(8) 1 1 2 2 .... B(n)

See Table 1, the opening and closing tag, which are partners, have the same level but they have different code in column A, number 1 is for opening code and number 2 is for closing the code. Meanwhile, B (1) up to B (n) is the position of tags in multilevel numbering order.
1,1,2,1,2, 1,1,2,1,2,1, 1,1,2,1,2,1,1, 2,1,2,1,2,1,1, 1,1,2,1,2,1,2, 2,1,2,1,2,1,2, 1,1,2,1,2,1,3, 2,1,2,1,2,1,3, 1,1,2,1,2,1,4, 2,1,2,1,2,1,4, 1,1,2,1,2,1,5, 2,1,2,1,2,1,5, 2,1,2,1,2,1, 2,1,2,1,2, --,--,--,--,<div id="menu">+=188,202=+ --,--,--,--,--,<ul>+=203,206=+ --,--,--,--,--,--,<li>+=207,210=+ --,--,--,--,--,--,</li>+=242,246=+ --,--,--,--,--,--,<li>+=247,250=+ --,--,--,--,--,--,</li>+=311,315=+ --,--,--,--,--,--,<li>+=316,319=+ --,--,--,--,--,--,</li>+=369,373=+ --,--,--,--,--,--,<li>+=374,377=+ --,--,--,--,--,--,</li>+=433,437=+ --,--,--,--,--,--,<li>+=438,441=+ --,--,--,--,--,--,</li>+=497,501=+ --,--,--,--,--,</ul>+=557,561=+ --,--,--,--,</div>+=562,567=+

Figure 4. Web Tags Level Trim

The method to take first tag up to fifth level is by erasing all tags located beyond fifth level, as it is described in figure 4, the shaded tags will be erased, so that the picture will become as shown in figure 5.
1,1,2,1,2, 1,1,2,1,2,1, 2,1,2,1,2,1, 2,1,2,1,2, --,--,--,--,<div id="menu">+=188,202=+ --,--,--,--,--,<ul>+=203,206=+ --,--,--,--,--,</ul>+=557,561=+ --,--,--,--,</div>+=562,567=+

Figure 5. After Trim Tag Result After trim tags result is pattern as shown on figure 5 is final result from extraction and mapping process of a website page. The pattern will saved in database and pattern contain information about levels such as number from tag position, sorted tag, start and end position of tag on web page source code. The pattern also used for comparison material to other web page patterns in a website. 3. RESULTS AND ANALYSIS Sample data to test the constructed system taken from 20 websites, including website of government, school, and private company, to get each tag extraction results which are going to be used to know how many patterns resulted from a website, the comparison of template up to 5th level of numbering. System has three threads that have a duty to download, extract, and keep template into the database (see figure 6). Data of extracted page can be seen in figure 7. All templates and information related of the page saved in database.

Copyright 2013 ISICO

34

Fi igure 6. The Pr rocess of Page Extracting

Figure 7. Li ist of page dow wnload Ta able 2.The Resu ult of Extractin ng Websites
W Website http://www.upx xanel.com http://ukitoraja.ac.id http://resensifilm mbagus.blogspot.c com http://www.pa-a amurang.go.id http://www.pa-sidrap.go.id http://karnyoto.blogspot.com http://www.sma an1-pacitan.sch.id http://www.sek kolahciputra.sch.id http://www.acsj jakarta.sch.id http://www.sek kolahbogorraya.com m http://www.ama artakarya.co.id http://www.ann neahira.com http://www.met trodata.co.id http://idrusmud deng.wordpress.com m http://www.akp prind.ac.id http://dprd-mam mujuutarakab.go.id d http://dprd-pare eparekota.go.id http://www.barr rukab.go.id http://sementon nasa.co.id http://www.eka aristi.org Pages 129 110 280 214 271 153 122 112 206 168 139 113 114 117 167 123 140 181 111 194 Templates T 4 3 6 9 7 4 6 4 8 3 2 4 4 3 7 4 6 8 3 4 Errors E 0 1 0 2 3 0 3 2 6 4 0 0 0 0 3 4 1 3 3 2

ws system exec cution results to o perform the extraction e and mapping page es to 20 website es. Table 2 show s pages am mount were pr rocessed, amou unt of template result by usi ing multilevel numbering, an nd Fields show also ca annot processed d pages amou unt. Error in c conducting a mapping m and e extraction on several s pages of website e caused so ma any errors in tag t syntax writ ting by websit te programmer r even though the browser can still rea ad or ignore it. ght 2013 ISI ICO Copyrig

35 4. CO ONCLUSION N Contents of the BODY ele ement is a com mbination of news n and web tags, a metho od to extract an nd mappin ng web tag into o unique series number is requ uired, this research using mul ltilevel number ring method. All extracted d website has a pattern is les ss than 10% th hat number of f whole page in n a website. The T least am mount of page is 110 and the highest is 280 0 pages. The sy ystem produces s less than 3.5% % error by usin ng multilev vel numbering method. The system s results which are web b tags pattern c can already be used as material to deter rmine position n of headlines, advertisement ts and links lis st, all compon nent position has been listed in unique tiered series numbering. n So ometimes, the web w programm mer is not payin ng attention to o the tags writin ng in a website. Therefor re, there is a ta ag which has op pening but it does d not have c closing or otherwise. Thus, it t is read as error by the e system. It is suggested to h having farther research whic ch can obtain the t headline of fa website e by using mu ultilevel numbe ering method and a conducting g research to format f a webs site tag by usin ng other methods. m ACKNOWLEDGEM MENTS This research h was fully fu unded and supp ported by Department of Inf formation System, Engineerin ng y, Toraja Indon nesia Christian University. Faculty REFER RENCES
[1] Y. . Bassil and M. Alwani, "Auton nomic Html Inte erface Generator r For Web Appli ications", Intern national Journal l of Web & Semantic Technology, We T vol. 3(1), pp. 1-15, J January 2012. [2] T.C. Du, et al., "M Managing know wledge on the Web W Extracting g ontology from m HTML Web", Decision Suppo ort Sy ystems, vol. 47(4), pp. 319331, Nov N 2009. [3] A. . Noruzi, A St tudy of HTML Title Tag Crea ation Behavior of Academic Web W Sites", Jou urnal of Academ mic Lib brarianship, vol l. 33 (4), pp. 501 1506, 2007. [4] H. . Joseph Wen, et t al., E-commer rce Website desi ign:strategies an nd models, Infor rmation Manage ement & Computer Se ecurity, vol. 9(1), pp. 5-12, 2001. [5] H. . Han and R. Em masri, Learning g Rules for Conc ceptual Structure e on the Web, Journal of Intel lligent Informati ion Sy ystems, vol. 22(3), pp. 237-256, 2004. 2 [6] R. Henshaw," The e First Monday Metadata M Project t", Libri, 1999, pp. p 125-131. [7] S.A A. Golder and B.A. Huberman n, Usage Patter rns of Collabor rative Tagging Systems", S Journ nal of Informati ion Sc cience, vol. 32(2) ), pp. 198208, 2006. 2 [8] T. C. Craven, et al. a , HTML Tags s as Extraction C Cues for Web Page Description n Construction", Informing Scien nce Jo ournal, Vol. 6, pp p. 1-12, 2003. [9] A. . Riad, et al., Web Image Retrieval Searc ch Engine base ed on Semantic cally Shared Annotation, A IJC CSI Int ternational Journ nal of Computer r Science, vol. 9, , Issue 2, No 3, March M 2012. [10] A. .W. Huang and T.R. Chuang, Social Tagging, , Online Commu unication, and P Peircean Semiot tics: A Conceptu ual Fr ramework, Jour rnal of Informati ion Science, vol. . 20(10), pp. 1-18, 2008. [11] S. Gupta, et al., A Automating Con ntent Extraction of HTML Docu uments, World Wide W Web Journ nal, vol. 1, pp. 3-7, 004. 20

BIBLIO OGRAPHY OF O AUTHORS S


Andrea Steve ens Karnyoto, S S.Kom.,MT., born b in Sept 8th h, 1979, obtaine ed his Bachelor r in Informatics Te echnology from the t STMIK Dipanegara, Makassar, Indonesia, in i 2003, his Mas ster in Electrical En ngineering with Informatics Tec chnology concen ntrate (2010) fro om the Hasanuddin University. He e has been head of the informati ics technologies consultant in CV. Anugrah Em mpat Pilar starting from fr 2007 and le ecture in Toraja a Indonesia Christian University y of the departm ment Informatics Technology startin ng from 2012.

Copyrigh ht 2013 ISIC CO

36
Marius Limp po, S.Kom., bo orn in Makassa ar, may 19th, 1985, obtained d his Bachelor in informatics Technology from A ATMA JAYA University U Maka assar, Indonesia 2010. Research hers Student at KY YUSHU UNIVE ERSITY Japan, form October 2011 until Mar rch 2012 and n now lecture in Tor raja Indonesia Christian C Unive ersity of the de epartment inform matics Technolo ogy starting from 2012.

ght 2013 ISI ICO Copyrig

You might also like