From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved.

Structural

WebSearch Using a Graph-Based Discovery System
Nifish Manocha, Diane J. Cook, and Lawrence B. Holder Department of Computer Science andEngineering Box19015 University of Texas at Arlington Arlington, "IX76019-0015 {cook,holder } @cse.uta.edu In responseto a user query, a structural search engine searchesnot only the text but also the structure formedby the set of pages to find matchesto the query. The web consists of hyperlinksthat connectone pageto another, and this hyperlink structure represents a relationship between the source and destination pages. Suchrelationships may include a hierarchical relation between a top-level pageand a child page containing more detailed information, connectionsbetweencurrent, previous and next sites in a sequence of pages, or an implicit endorsement of a site which represents an authority on a particular topic. A structural search engine utilizes this hyperlinkstructure along with the textual content to find sites of interest. Structural webqueries can locate online presentations, web communities, or nepotistic sites. By searching the organization of a set of pages as well as the content, a search enginecan provide richer informationon the content of webinformation. Evena purely textual search can face challenges due to the fact that human language is very expressiveand full of synonymy.A query for ’automobile’, for example, may miss a deluge of pages that include the term ’car’. One strategy to overcome this limitation is to inform search algorithms about semantic relations between words. A structural search engine could use this information to improve the retrieval of websites that reflect the structure andthe semantics of the user query. Much research has focused on using hyperlink information in somemethodto enhance web search. The Clever systemscans the mostauthoritative sites on topics by making use of hyperlinks [Clever]. A respected authority is a pagethat is referencedby many goodhubs; a useful hub is a location that points to manyvaluable authorities. Google is another search engine that uses the power of hyperlinks to search the web [Brin and Page 1999]. Googleevaluates a site by summing the scores of other locations pointingto the page, Although these systems use hyperlink structures to rank retrieved webpages, they do not performsearches for a range of structural queries. In contrast, WebSUBDUE searches for any type of query describing structure embedded with textual content.

Abstract Accessing information of interest onthe Internetpresents a challenge to scientists andanalysts, particularly if the desiredinformation is structuralin nature. Ourgoalis to designa structural searchenginewhich uses the hyperlink structureof the Web, in additionto textualinformation, to searchfor sites of interest. Todesign a structuralsearch engine, we use the SUBDUE graph-baseddiscoverytool. The tool, called WebSUBDUE, is enhancedby WordNet featuresthat allowthe engine to searchfor synonym terms. Oursearch engine retrievessites corresponding to structures formed by graph-based user queries. We demonstrate the approach ona number of structuralweb queries. Introduction The World Wide Web (WWW) provides an immense source of information. Websearch engines sift through keywords collected fi’om millions of pages when responding to a particular query. These search engines typically ignore the importance of searchingthe structure of hyperUnk information and cannot resolve complicated queries like ’find all pagesthat link to a pagecontaining term x’. A hyperlink from one page to another represents an organizationthat a web pagedesignerwantsto establish, or relationships that exist between websites. Astructural search engine searches not only for particular wordsor topics but also for a desiredhyperlink structure. The WWW can be represented as a labeled graph. SUBDUE is a data mining tool that discovers repetitive substructures in graph-baseddata [Cooket al. 2000]. We hypothesize that SUBDUE can form the heart of a structural search engine, called WebSUBDUE. In particular, this graph-based data miningtool can discover structure formed by a user query within a graph representation of the WWW. To demonstrate the use of WebSUBDUE as a structural search engine, we performvarious experimentalqueries on the Computer Science and EngineeringDepartmentof the UTA web site and compare the results with those of posting similar queries to a popular search engine, Altavista. Copyright ©2001, American Association for Artificial Intelligence (www.aaa~.org). Allrightsreserved.

DATA MINING& THEWEB 133

representing the results of the query. or subgraphs. Predefinedsubstructures can be provided as input by creating a predefined substructure graph that is stored in a separate file. and reversing the direction of an edge. WordNet WordNet is an electronic lexicon database whichattempts to organize information accordingto the meanings of the words instead of the forms of the words [Miller et al. Forthis project. called WcbSUBDUE. Asit traverses a website. WebSUBDUE reports the graph vertices. For our structural search engine. a labeled graph is generated representing the website. SUBDUE discovers repetitive patterns. In this graph.g. edges. vertices representinga wordwithin a pageare matched withvertices containinga different label if they hold one of the valid relationships defined by WordNet. but not exactly. Vertices labeled with the correspondingwordsare added to the graph. Once the search terminates and returns the list of best subgraphs. WebSUBDUE uses WordNet to increase its search capacity by searching for wordssimilar to the queryterm rather than searchingonly for the querytermitself. the file is processed to obtain keyword information. 1999]. SUBDUE accepts data in the formof a labeled graphand performsvarious types of data mining on the graph. and an edge labeled ’_hyperlink_" points from parent URLs to child URLs. match’base’ or ’basis’ with query 134 FLAIRS-2001 . Allowed transformations include addingor delete an edgeor vertex. and are evaluated by their ability to compress the graph.In this project. URLs are processed using a breadth-first search until all pageson the site havebeenvisited or the user-specified pagelimit is exceeded. where the search queryis representedin the formof a substructure. The current search engine allows the user to create a graph for a newdomainor search an existing graph. As an unsupervised discovery algorithm. A substructure in SUBDUE consists of a subgraph defining the substrticture and all of the substructure instances throughoutthe graph. The webrobot only follows links to pages residing onspecifiedservers. the webrobot generatesa graphfile representingthe specified site. following the Minimum Descriptiot~Length principle. Wethen introducethe application of SUBDUE as a structural search engine. SUBDUE applies a polynomial-time inexact match algorithmto allow for small differences between a substructure definition and a substructure instance. New websites can also be incrementallyaddedto an existing graph. Thedissimilarity of two graphs is determined by the numberof transformations needed to make one graph isomorphic to the other. which extends a subgraph in all possible waysby . In particular. As a conceptlearner. SUBDUE supports biasing the discovery process toward specified types of substructures. Two graphs matchif the number of transformations is no morethan a user-definedthreshold value multiplied by the size of the larger graph.If the thresholdvalue is definedas 0. Substructures are kept on an ordered queue of length determinedby the beamwidth. our goal is to augment the structural search technique with stored informationabout relations between words. Data collection is performedusing a webrobot written in Perl.in the graph. SUBDUE discovers the instances of the predefined substructure in the graph. The inclusion of backgroundknowledgehas been shownto be of benefit whenexpert-supplied knowledgeis available [Cookand Holder1999]. we review approach to unsupervised discovery. changinga label. patterns exist in the database with small variations between instances. the system identifies concepts that describestructural data in oneclass and exclude data mother classes. each URL is represented as a vertex labeled ’_. The initial state of the search is the set of subgraphsconsisting of all uniquely labeled vertices. SUBDUE can iterate on this compressed graphto generate a hierarchical description of discoveredsubstructures. SUBDUE will locate and expandthese substructures. the graph can be compressedusing the best subgraph. and an edge labeled ’_word’connects the page in whicha keywordis foundto the wordvertex. Oncethe URLs have been scanned and the associative array has been created. wetransformweb data to a labeled graph for input to the WebSUBDUE system. match the user query. The compression procedure replaces all instances of the subgraphin the input graph by a single representative vertex.page_’. SUBDUE SUBDUE is a ka~owledge discovery system that discovers patterns in structural data. and corresponding URLs for each discoveredinstance. In particular. SUBDUE uses a polynomial-time beamsearch to find the subgraph whichyields the best compression of the input graph. 1991]. an exact match is required. SUBDUE’s inexact graph match algorithm can be used by WebSUBDUE to find web sites that closely. in which the search query and the WWW arc represented as labeled graphs and discovered instances are reported as the results of the query.Data Preparation Viewed as a graph. Here. We enhance the capabilities of WebSUBDUE’s structural search engine by integrating some of WordNet’s functionality. WebSUBDUE invokes SUBDUE in predefined substructure mode. In many databases. WebSUBDUE uses WordNet to enhance its search capabilities by finding wordssimilar to the search terms.t single edge ~md a neighboringvertex. the WorldWide Web structure exhibits properties that are of interest to a number of researchers [Kleinberget al. The system can also discoverstructural t~ierarchical conceptual clusters. thus "jump-starting" the discovery process. Valid considerations include matchingthe base formof the word(e.. If the document currently being processedis either an HTML file or a text file. Theonly searcl~ operator is the Extend Subgraph operator.

Altavista.gif ANDimage:up_motif. ~cook/ail/lectures/113/113. the top page being a menu page which has links to all the other pages. -cook/ai 1/1ecture." This search option Altavista to search in the CSE-UTA domain for pages containing links to next icon. we posed several queries to the system and comparedthe results to similar searches using Altavista. derived forms of the word (e. -cook/ail/lecmres/16/node6.html 1I. -holder/¢ourses/cse5311/lectures/123/123. In these experiments. and synonyms (e. http://www-cse.825 URLs. -cook/ai1/lectures/l 12/112. --cook/ail/lectures/17/17.657 edges covering 5.edu AND image:next_motif.html l/lectures/19/19. or ’work’ with query ’job’). Altavista discovered 12 instances compared to the 22 instances WebSUBDUE discovered. we provided additional information to guide Altavista. ~holderlcourses/cse5311/lectures/126/126. A graph file of the CSE-UTAdomain.html Figure 1.933 vertices and 125. -holder/cottrses/cse2320/1ectures/116/116. -holder/courses/cse2320/1ectured13/D. -holder/courses/cse53 ! I/lectures/119/119. we restrict the search space to a single sample domain. -holder/cottrses/¢se531 i/lectures/l 10/110.html 4. -holder/courses/cse5311/1ectures/l 12/112.html 18. whereas WehSUBDUE just requires the structure information. was created using the web robot.s/I10. Presentation page structural query.’position’. the previous page and the top page.html 22. ~holder/¢ourses/cse531 l/lectures/12/12. not all presentation pages were discovered using Altavista.’bases’). -holder/courses/cse531 l/lectures/14/14.gif. These discovered web pages represent presentation pages covering various topics with a top menupage containing a link to sub-topic pages and each sub topic page pointing to the next sub topic page. The sample domain chosen is the web domain for the UTA CSE department.html 3. -reyes/node48. -holder/courses/cse531 9.html 21.hunl 2. The discovered URLsare Listed in Figure 1. we compare query results of WebSUBDUE with search results generated using a popular search engine. In response to the presentation page query.hmal 6. Altavista was run in the advanced search modewith the AND following option: "host:www-cse. and other search criteria that provide a valuable point of comparison for the results discovered by WebSUBDUE.htxnl 5. Altavista’s advanced search features include the use of Booleanexpressions.html 10.html 7. The graph representation of the CSE-UTA web site contains 113. The search range of Aitavista was ’also restricted to the CSE-UTA domain using the define-host advanced search option in Altavista. A query to search for all presentation-style pages was posed _h yperlin k_ _hyperlink_ I.html 17. match ’Jan’ with query ’January’). match ’employment’. A presentation page is defined here as series of URLswith links to the next page. -holder/courses/cse2320/1ectures/12/12. This shows the difficulty of posing these types of structural queries using a text search engine. and top icon files.html 14. -holder/cottrses/cse2320/1ectttres/110/110. ~reyes/nodel..hUnl 13.1110. The Altavista search engine has no method of respondingto structural search queries.html 19. previous icon.. To demonstrate the structural searching capability of WebSUBDUE. Onetype of structural query might be to find online lecture materials or HTML papers discussing a particular topic. WebSUBDUE discovered 22 unique instances or 22 presentation pages on the CSE-UTA site. This query structure is shownin Figure 1. Where possible. thus we eliminate the domain name from our summaries. -holder/courses/cse2320/lectures/l 15/115. To assist Altavista.html 20.g.uta. Because the actual names of the icons varied between presentations. Experimental Results We conducted a number of experiments to evaluate the capabilities of the WebSUBDUE structural search engine. DATAMINING & THE WEB 136 . the previous page and the top menupage.g.html 8.html 12.html 15.edu. -cook/ail/$yllabus/syllabus. Searching for Presentation Pages to WebSUBDUE.gif asks image:previous_motif.htm 16. -cook/ai I/lectures/11/11 . the limitation of a search to a particular host or domain.uta.

1998] and calculated as part of the graph transformation cost.h~ml ~work) 7. In addition to using WebSUBDUE as a structural search engine. To demonstrate the impact of using an inexact graph match.html (problem) 8. we posed queries requesting all pages on ’jobs in computer science’ to WebSUBDUE using the structural query shown in Figure 2 and the synonymmatch capability. exact match. and combinations of these searches. jobs. 32. A similar query was posed to Altavista.html (problem) -cook/aa/syllabus (problem) ~cook/ail/lectures/appletr.html (problem) Altavista results I. and how the search can be enhanced using WordNetfunctions and an inexact graph match algorithm. additional features of WordNet. The second instance is missing three edges and includes an extra node with an associated edge. as summarized in Figure 2 along with the match words found in the corresponding page. WebSUBDUE discovered 33 instances of the search query. we are also exploring other methods by which SUBDUE can perform data mining on web data.ht ml (.html (jobs) 9.WebSUBDUE results I. 7ob’. such as hyponyms and hypernyms.html (jobs) 10. 31. Inexact match for hubs and authorities SUBDUE’s inexact graph match capability can be used to enhance the WebSUBDUE’s web search results. GraduateYmse. To improve the textual search capabilities of WebSUBDUE. 33. We have demonstrated how WebSUBDUE can be used successfully as a structural search engine. Altavista discovered only 2 sites.2 . Text Searching Using Synonyms To demonstrate the synonym searching capability of WebSUBDUE. -al-khaiy/cse3320_summer99/summer99. A user interface that facilitates generation of structural queries is also being considered.h~! (work) 6. as shown in Figure 2.htm (task) 30.A total of 13 instances are identified with this query. The amount of allowable difference is controlled by the user-defined match threshold. both of which were also discovered by WebSUBDUE. WebSUBDUE thus covers a wider searcla range using synonymsof the query terms along with the keywordsthemselves. synonym search. can be integrated. Only exact matches were found for ’computer’and ’science’ in tl~ese pages. web sites can be retrieved which approximate the query structure. and hierarchical structural clustering capabilities mayprovide useful insight into the structure and content of webdata. ’problem’. text search. we search for hub (pages that point to at least three authorities) and attlhority {pagespointed to by at least three hubs) sites that tt~cus on ’algorithms’ with an inexact threshold of 0. 136 FLAIRS-2001 . Results of these experiments indicate that WebSUBDUE can successfully perform structural search.html (employment) 2. Altavista discovered 2 instances of the search query.html jobs. The unsupervised discovery. -holder/courses/cse5311/summ98/syllabus/$yllabus. 2000] or Latent Semantic Analysis [Landauer et al. 2. ’task’ as synonymsand allowable forms of the search terms. ’work’. These experiments also demonstrate the need for structural search engines and the inability of existing search engines to perform these functions. -cook/ail/hw/hl60ob) 4. we introduce the concept of a structural search engine and demonstrate that web search can be enhanced by searching not only for text but also for structure that has been created to reflect relationships between individual pages.htm (problem) Undergrad/97/97pro~ram_info_4. Graduate/mse.htmi (work) 3. Using an inexact match. Structural query to find pages on ’jobs in computerscience’.job) 5. ~reyes/nodel l. as well as to Altavista. Undergrad/undergrad. concept learning. Twoinstances of the query structure shown in Figure 3 are retrieved. as opposedto the single instance found using an Conclusions In this paper. about_uta_engr. The first instance is missing two edges that are provided in the query structure. WebSUBDUE finds ’employment’. Undergrad/bscse. ~cook/pai/pai./pd/ga-axelr. -coo~grt/index.html Figure 2. Distances between words can be calculated using techniques such as marker passing [Lytinent et al.

.Hubpages --cook/aa/aa. 25. Structural query for hub and authority pages on ’algorithms’. Fellbaum. R. Gahd... Dora.. Landauer.almaden.. In Proceedings of the AAAI Workshop on Artificial Intelligence for Web Search..html -holder/coursea/cse2320 . R.com/cs/k53/clever. The Anatomyof a Large-scale Hypertextual Web-search Engine.. 235-244. Authority pages -holdcr/courses/cse. The Use of WordNet Sense Tagging in FAQFinder. W. K. 15(2). Beckwith. Cook.html. S. J. References Brin S. and Repede. Tumuro. P. Holder.. L. and Tompkins.. and Holder. Discourse Processes. Foltz. G. Miller. R. Chakrabarti.html word word I wor01 Acknowledgements This work was supported by NSFgrant EIA-0086260. T. L. 32-41. Lytinen.. A. L. and Page.. 32(8). 1999. Introduction to WordNet: An On-line Lexical Database.. Kumar. Graph-Based Data Mining.5311/lectures/index.. http://www. In Proceedings of the Seventh International WorM Wide Web Conference.ibm. J. D. Introduction to Latent Semantic Analysis.html -cook/aa/lectures/index. and Maglothin. S.J. D. Gross D.. C. Raghavan. 1999. S.html Figure 3. and Laham. J. 1991. K. Gibson. International Journal of Lexicography. Rajapolan. E. Cook....html -holder/courses/csc2320/lectures/index. G.him] -holder/courses/cse5311 . Kleinberg. and Miller. 2000. IEEE Computer. B. P. D. B. Approaches to Parallel Graph-Based Knowledge Discovery. IEEEIntelligent Systems. 1998. L. DATA MINING & THEWEB 137 . 60-67.. Journal of Parallel and Distributed Computing. Mining the Link Structure of the World Wide Web. D. 2000. N. 3(4). B. 259-284. A. T.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.