KMIS2011

Structure-based interrogation and automatic query reformulation Mohamed Bea Aouicha, Ines Kamoun and Mohamed Tmar and Abdelmajid Ben Hamadou MIRACT. Slax Univesity, fi Keywords XML retrieval, XML query optimization, virtual tins, INEX Abstract: This papor presents sn snformation eieval swell on XML doctiments hase! on tee matching, Queries atl documents are represented hy extended gees, An extend tee is built stating Hom the original te, with additional wei fed virtua links betweon each nove and ts indieet NS Neo No cycles have to occur in a tree path, thus it. Ny — Ab Ny is a path resuled trom the tee 2, all, Should be different, This brings (0 another property which requites thal each node but the root bats only cone parent Me NET {° if Noor An XML tree is then & set of paths that fulills the properties T and 2 The structure retrieval ean be viewed ss compar ing between the siruetures of Lo XML trees, Corte paring between trees was initially imroduced by the luge to tee correction theory (Seiko, 1977), To com pare between two trees, we have at tendency to hull a ttee Starting from the sevond (ree and then com thenwi @ pute correction costs. The correction process consi ‘erS hisdlen relations between nodes. For example, i ode is the parent of node B whieh is the parent al nodeC in the tree 7. and node isthe parent of node Cin the wee 7, the 7 to I” conection ean by bell by removing node B and linking node A with nodeCy Egch update operation his a cost and the ttee to ttee correction cost (ihe edit distance) is the total eumu: lated cost of the necessity correction operations, The total cost ¢ af tree to tte correction ean be viewed as the inverse of the similavity between hoth of them Typically, if ¢ —0, this means that no emrtection aper= ation is needed! to build a tree starting trom the other so they ate quite sunilThis process can be applied 10 estimate the structure Similwity erween tivo XML documents, or an XML document and an XML query. The ‘ain: motivation of ats application is that st allows to perforin structure comparison, which is necessary in XML retrieval amd second, it provides a tanked list oof document gees (or document subtrees), where lhe score is the inverse of the tosl eumallated cost of the document tree 10 the query tree vomection {or the query tree lo the document ttee currection| However, de to some practice reasons, this cannot he applied to XML retrieval algorithias are not sealable 69 XML reirieval on tery lat The te to tree correction they eaeinat be applied compass. The siruciural rettieval process should look for the deepest and Substree shared by both representations (Selkow, 1977). Instead of applying a tree to tree flexible conection algorithm the Wiles nal wees, we use a dull process: ian exact mate, algorithr in flexible representations oF eh tree A flexible representation of ar XML tree explores ll hidden relations that gould exist hetween nodes. We call such representation the extended tree. This operation is done sinee indexing the XML corpus. To do so, we ald to each path A — Ba weight rellecting the importance of the relation beeen nodes A and Bo According to the parent-child relation, this Weight is equal to L. ‘The more A is isn from ip the original path the Les its wig is. The weight depends 0m the distanee between t¥0 nodes that occur in a path, We tse the Xeigit Timetion f delined by fA —=-B) —expil dh.) where diA.B) is the distmee that separates node A from nod B. We denote this path by As B where w— epi —d{A,B)} isthe weight of the pa VB. We apply an exact mated algoritho on the ex tended trees (Bea Aouicls cl al,, 2006) (Ben Aouicha ef al, 2009). As the complexity of the evstet mate algorithms is tess than the approsiniative matching algorithm one, We use this kind of algoritims to comipare between two Hlexidle representations. All ihe complessty is then moved al the representation phase, hich ‘corresponds to the indexing process. This phase is only performed ance, ame the inter= rogalion process 8. in-contrast perlorimed by the application of an exact match algorithm on the extended lnees representations which are siored in the indes. Starting from tree T nd tree 77 this algorishen looks for the deepest said widest sub-tree Shared by T and T? and computes the similarity depending on tne weights of paths appewing in each one, Relevant fragments are those baving similar siructure than the user query. This ean be done hy looking in the extended document for fragments having exaetly the Same structure as the estended query oF a part of Ths allows flexinle NML. retrieval (Bordoyna and Pasi, 2000), The search sirategy we adopt is iterative We start from the eMtended document and query trees. We build a returned fragment incrementally, Sarling from potenbally common roots, we build For each one its common child nodes, and then for each hil node, We build ies common child nodes and se ‘on until leaf nodes. When building relevant candidate fragments, we compute the stsueture-hased seare by sunulating the proxluct of the pats weights in the document and their matehed ones in the query tee The final seore is a Jinear combination of the sinacture-bised seore rand the content-based score Hl OXre! aon ao where Oc 8 & parameter that emphasizes the retrieval process on the structure constraints or the Ke ‘souls, For our experiments, we use = 6. 4 AUTOMATIC QUERY REFORMLULATION In our approach we fheus on the Structure at Ine original query said tar of document fragments deemed relevant (© the User structure hints. Indeed this study allows us to reintoree she inportsave al these sirliciures in the reformatted query to etter identify the most relevant fragments to te user's nevds, ‘The analysis of Stuuctures allows us (0 idee Lufy the most relevant nodes and the involved relation= Sips. 4d ine of descent matrix According (0 most approaches to atomic query re formulation, the query construction is done hy build ig st represenitalise Structure of relevant objects and another structure for irrelevant ones, and then uid representation close to the frst snd farther Harn the second For example, the Rocehio’s method (Roechio, 171) considers @ representative Srueture of a doc lument set by their centroid. A hinew combination al he centroids of the relevant fones can be assured 3 Ine original query aad Uocuments and irveleva potentially suitable user need.We propose to tradluce the documents and the jquery in & matrix format stead of a wighted term vector like in Rocchio’s methiod whieh is mare suit able tor Hat document, We build Jor eael docuraeat at matrix called line af descent matrix (LDND, whieh ‘must show alll existing ties of kindship between differ et nodes. This representation should also reflect the positions of dhe various nodes i the fragments as thes fate also important ia the query relormutabion, For an XML tree (or subsree) A fined by Maz fe associate the mattis de= [Pf ol eAfnivshe parent of 9 Male] [0 otherwise Whore Pis @ constant value whieh represents the eight ofthe descent relationship. As Tor us fragments andthe initial query inthe LDM fost. The value of the constant P for the query LDM construe tion is greater dhan that used forthe eanstuetion al oxher LDMs (hich represent the selevant fragments) to sirengtien the weight of the initial query dives follossing the prineiple used inthe Rocchio’S melbod hich uses reformulation parameters having diferent elleets (1 Tor the initial query, efor the relevant doe- tunents centro and B tar the non relevant documenis centroid where O SO < 1 and I< BS), Nove that to complexity analysis is here neadee Because ofthe too: number of relevant judged documents comparing to the corpus size. In our experiments, we undertake the query reformulation in a pseudo-feedack way on the top 20 ranked documents resulting tap the firs round retieval Inthe other band, the total number of tugs is over 160 in all the collection aon! about Sin we represent each of the relevant single Iragment, so the mats siz nol ese Ina fragment ean be returned even if the structural conditions of the query ate not entirely ullled. This neans thal fa fragment of an XML dncument is sin iar but nor identical to the query, it cant he returned. The information retrieval systems now bas tw query with tolerated dliflerences (a few mnissing elements ‘or more aaditional ones) betaven the query situcttse and the document. Consequently, we believe that the most ellective way to bring this tolerance isto assure thal one element is not only connected! (0 ity child nodes, but to all of iis diteet dad indirect Geseendant. A relationship between nodes in the Sime Line of eswent is weighted by their distance in the XML tree We propose the 7C function whieh is a transitive closure on the weights of the nodes edges with a com ron aneesior, The resulted value will be added to the iisell in the LDM as follows weight of the ed Mylan FCM fa. 1| Mnf ol (ntt a)ENS Mylan] With As for us, shis transitive closure will be applied lw each LDM of each fragenent jugjed as relevant and also to the LDM of the query The the reformulated query structure is baile stat ror the objained LDMs. Let us consider £ (iB eofa) where fp ae the relevant judged traig- iments and Qp is the initial query, he query situctuse is hilt casting from the cumulated LDM 8! 2 Sinn! — Y Myf. | May ne] 4.2 Building the new query structure The query reformulation starts by identivinig is roo The toot is characterized by a igh numbet of child nodes and @ neghgible number ob parents. For evan ple, to find the root we simply return the element Ry which has the greatest weight in the rows of the mae trix $ and the lowest weight in ils colunms. The root Ris then suc that ( S Swtnl R so sation Sy > ow The argument fo maximize reflects ha the ean tidate nodes to represent the root should bave as rmurximall ow’ values as possible in the relative row Shine) and as minimal low values as possible in the column (S, Six.) celaively wo the wal suis of vB the muatsis values © WE ‘One the root has boca established, we proceed te Jopment phase ofthe tree represent ing the structure of the new query. ‘The development (oF the tree Starts by the root R, and then by determin= all the child nodes of R, the same Operation 1s perforined recursively for the elvld nogles of R until reste Sieh. Ihe recursivede’ he leaves elements. XPERIMENTS Experiments have heen undertaken into a dataset provided by INEX. The INEX metic for evaluationis based on the exhaustivity and specificity measures which are analogous to the traditional recall and pre= cision, measures (Piwowvarski and Dupret, 2006), The speciticity is w extent (© whieh # document compa nront i tncused on the itformation need, while be an informative unit, The exhaustivenessis inn extent 10 ‘hich the information contained in a document com= ponent satisfies the information need, Fact document component atsone of the follow Livity (non exhaushive, slightly exhaustive, exhaustive very exhiushve) and spevitieity tbeckeen O and 1 These measures are quamtized onto a single relevance ales of exbaus- value First experiments were undertaken to stow the ef fectiveness of our XML retrieval approach. Figures 1,2 andl 3 Show nur results (bold curves) compared 1@ oflieial INEX obtained results by all participants on the MAep (mean average ellort precision) quantizat tion mewsire We experimented dillerent vellles of g: used in ege toon 3, the valu that provides de best results is 1.6 which shows thal the streture-hased retrieval contti= ution is over 40% of the whole AML reineval poe spinsecal Figure I: Comparative results with INEX official partici pants, generalized quantization fanetion I can be seen through figure 3 that we obtain better MAep results than all the INEX"2008 partic- pants in the siret quantization funetions. Figure T and 2 show that our system is very competitive lo the INEX’2005 outsiders according 10 the oiler {quantization functions. Atte other measures, the results ane mci closed to the best INEN"2005 oficial results, The obisined resulls om the sirict quantization function over all meesures show thet eur approach is able fo reiuen high specitic elements 10 the INEX lopies, sshich is structure oriented, Nove that we ob= lain (502 for the most structured topic (lor MAep inl fnin) Over INEX 2005 togies, which conirms gain Figore 2 Comparative results with INEX olisil partis pts, generalized ite guanizstion function, 028 02 ona oar a Figere 3: Comparative sesulty with INEX ollcil paste pts strict untization function Ihe elleetivencss of our approach in highly structured queries context on one had, said the impact of strue= lore on XML retrieval on the other: Additional experiments have heen undertaken to sho the elfectiseness of our query relormulation approach, Since [NEX does not provide & training Galasel, we built i manually by dividing the INEX corpus ilo a training dataset containing, 450 doc uments sod the 16369 remaining documents are used as a test corpus. To evaluate te impaet of the query relormutation, we use the same NML retrieval process as described below In Table 1 4g present a comparison between the values ohiained defore relortulation (BR), aller re formulation (AR) and the mean average of the official results of all INEX peatie:pants (AVR). The paraime= sind $.50. We ca soe Uirough our experiateuts that our query relormnulation approach signilicantly improves the results on the fn quantization. lunetion (unt 0.38 %) whieh considers one element t0 be rel: ‘vant provided thal tis both exhaustive aid specific This function is an excellent illustration of structured ers and are respectivelyOTT AVG [UTR BR_ [aaa f AR POURS] AVE [mse Ty Ta Ta Table 1: Comparative results before and alter automatic query relorimlation information reirieval. We note that during these es- periments we reformulate only the queries steuctures without changing their original content, and therefore we believe tha this reformulation has brought an evo lution that could he accentuated by the relormulation ol hy gonten 6 CONCLUSIONS We have presented in this paper an NML retrieval upproach based on tree oteling. ‘The approach conisiss of comparing document abxl query represen= lations, computing astructure and a content seore to each document node and then combine them into a Each score is computed independently of the oiher, aid the final score depends on bail of them Undertaking content andl soucture seores. compus lation independently leads to the independence herween content and strugture seores distributions. This assumption is needed (0 estimate the score a each document node by is probabiliy of relevance. We also proposed i representation of the original query and relevant fragments under a form ob ama Iris. Afler some processing aid calculations on the obtained matrix aid alter some analysis we have been alale to wentily the most relevant nodes and their re= Fationships that empect them. The obtained results show that the automatic reformulation of the query siructure contributes to the improvenient of NML re= Isieval. The strategy of the reformulation is based on ‘a mtrix representation of the NML tees deemed relevant to the fragments and! the original quer. We experimented our approach into a dataset pro vided by INEX. We have phiced the emphasis nthe CAS tasss, whieh represent aor exellent illustration of exible NML retvieval. This tisk is the most ap- propriate for our rerieval model. The system was de= sigsted especially Lor such tasks. Additional experiments have been undertaken 10 Show the elleetiveness of using a aulomaiic quer relormblation model. Through our experiments, we slow that the structure hints are very important not coply for XML svirieval Put also for NAIL query refor ‘mulation REFERENCES Balog, K., Bron, M., and de Rilke, M. (2010) Category fused gusty modeling for entity search 2rd Fie pean Conference on Information Retrieval, BCR, prises 313-331 on Avitichs, M, Tinar, M. Rotghner, M. aid Abid, M 205) Vets om ‘le techerche 0 infomation steuctuze base sur la compassison Uashues, COREL Ben Aouichs, M, Tint, M. Bozhaness, M, and Abid, M 2M). Experiments on element and document statics for xml netic Posed on tree mating. te asinat 2f Computer ard lnormation Science siteering LICISE 317-16 Bordogns, G. and Pasi, G. (2000), Flexible stevetuied documents. Pave. of she fui ie om Flexible Query Answering Sys Fah N- and Grossiohann, K, 12001) Xin language for information retieval in wld Pre. of ‘ae A query New Onles, USA, ps Prnowsarski, B, arsd Dupre, G. 12006), Exaluation in xml inlormavin retrieval Expected prseision roesll th user modelling teprumt Proc, ef the 29 an ACM SIGIR cone a pases 261-2 . mation B Rocchio J ITI Releva vival Prestice Hall Ie, enslewood elif, nj edition Schlieder. 1 anad Mouss, H.¢ rd anki ‘inf doctments Infanmation Science and Techmotog), 61981489. Setkors, SM. (1977). ‘The wee-to-te edition protien. dy Jaton processing liters, pases 1S-186

KMIS2011

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KMIS2011

Uploaded by

Copyright:

Available Formats

You might also like