You are on page 1of 8
Structural-based relevance feedback in XML retrieval KAMOUN FOURATI luis, TMAR Mohamed, and BEN HAMADOL Abdelagid Multimedia Information systems and Advaneed Computing Ls Higher Institute of Computer Seienee and Multimedia, University of stp: fay ntact rm. tay fax, Tunisia Abstract. Contearily to classical information retrieval systems, the sy toms that treat structured docunats include the structural dimension through the document and query comparison, Thus, relevant rests ate I the while de rikoa it vument fragments that mateh the user need rather than the hv ease, the dociuuent aud quety struetute should vwoll as daring the re= Formulation. Query reformulation should also inelude the structural di annent. Ln wecount inthe tetzieval process u formulation Is on one hand aad the lrngiuents jd, stiuetutal relevance feedback, We start front the origival query La ather, Structure hints analysis allows us to identify nodes that qiatele aback stop. Th truetural hints an Jovant by: the user ot the the relesaaee ingpact © the user query and to rebwild it dui this pa ain goal is to sha NAL an dataset 1 nptitnization. Some experiments have been uncdeetaken inte wided by INEX! ta show the effectiveness of our pc Keywords: tlevanes feedback, XML, INEX, le of descent anattis 1 Introduction The goal of information cetrieval systerns (IRS) isco satisfy the information needs ofa iser. This need is expressed by a query ta be niateled fo all the dacunrents in the corpus to seleet those that could ansiver to the user need. Beemuse of the atnbiguity, atl the incompleteness of his query, the user is, it mnast eases, hot satisfied with tle returned results, To overconte this problems, there ean be allemlives tn the initial query si-as to inuprove the resilts, Among the thost poplar patterns in information retrieval {II}, we cite the relevance feedback RE) wluch is based on the julunents of relevance of the documents found by tlhe TRS nud is iutended co re es;pres tle ulormpation weed fron: tue initial query in an effort to find more relevant documents. But with the standardization of TINitintive for the Evaluation of XML roteioval, am evaluation ferwn that aims at promoting vetsieval capabilities on NAIL documents Structural relevance: fowback in XML retrieval the Web to XML schemas 2 presents new problems antl hence new needs for ustomnized infornuation acerss. However, the ‘ditional IRS ao uot exploit Chis irneture afdcenments, snehudiig the RF funetion, hicewd, the user emt expres his meee hy a set of keywords, asin the traditional IRS, and ena ade struct ural constraints to better target the songht senmanties, Thos, takin into wecount th structure af the doeunients and tht af the query by the infarntion rvtvieval systems handling strnetured documents is eeessary in the feedback process \lany initiatives of relevance feerlhack have heer. proposed to rewrite the nser ery. The niajority of these approaches are content-based, which means Chat nuly the query terns are updated, snd rolatively ceweighted te iinprose the result. Only @ few approaches meslifes the query structure. hn this paper, wu propose an approach of structure-based relevance feedback, We assnme that th ery structure could be reformulated based ou the structure of the document lenwents judged as relevant. This paper is organized a follows: in the second section, we give a survey on the related works to XML relevance feedbark, We Dreseat inthe third section our approneh of query refornulation, based on th irneture releviauee feedback. In the fourth section, we present the experiments anul the obtained results. The fifth section concludes 2) Related work Many initiatives of XML query reformulation lias been proposed. fu the west nies, RF approaches lias heen adapted in order to take into account the stm tural dimension. Villatoro-Tello etal. describes iy [12) a system developed by tl Language and Reasoning Group of UAM for the Relevance Feedhaek track of INEX 2012. The system fornses on the problem of ranking documents in ecor diate to their eelevanee. [Cis mnainly bases om different liypotleses such ns that urtent TR niachines ate able to retrieve relerant documents for most of gen ‘al queries, but they cannot generate « pertinent ranking aud focused ferclhack enuld provide more atid better eletnents for the ranking process than isolated query termis, ‘The wuts aint to demonstrate hat si related relevance feedback itis possible te improve the fal rank trieved doeunents, Balog et al. propuse x general probabilistic fran tity search to evaluiete and provide insight in the many’ way’ of us a input for query mortelling [I]. Phey focus on the nse of category infomation anne demonstrates the effectiveness of eategory-based expansion using exinnple aitities, Seeikel ral ‘Phessalel I] de incorporation of strnetnral aspects bythe feed ack process. The frst approach ve-ranks results tuned by an initial keyword-based query using struetural fea tomes derived fran resalts with knossn relevaner. Their seeont! approael involves these types ribe 1Wo approwehes whieh focus on the X structured document {s¢ XML document) is characterized by a content and a Sucture, This struetute possibly completes semantics express by the content fand becomes a constraint with whiel IRS must comply inorder to sitisiy the user information ness sedback in XML retrieval 3 Structural relevan xpanding traditional keyword queries ito content-andestrueture queries. OF ficial results, evaluated using the INEX. 2005 [5] assessment method! based on tanke freezing, show that rerankiny ontperfortns the quety expansion method on this dat Aniong these approaches, only a few consider that RF in the query structure is treessary: Te is common tt cowrite the query based oa its sieuetine, and le in (3), [9) andl [15) . but modification of the yuery structure self is nat aukdressed Tour approach, ase consider that che content of the relevant elements structural RE is necessary, partienlarly if the XML retrieval system takes ite account the structural dinteusion iu the nantehing process, Since ays use an XMT retrieval systenn that matches the structure in addition to the emntont Lf. xe sesame that the structure reformation eunkd improse the retrieval perfomance 3. Structuralbased relevance feedback: our approach fhe oar approach yr fiacns sentially ow the stractnre of the original query and tliat of oe Inlevd, this study allows ns to reinforee the inapertance of thesw structures in che relornuilated! query to better identify the nuast relevant frasment= to the u twerls, The analysis of structures allows us identify the mast relevant sales arid the insalved celationships. The content of these iragnents and those of the initial quer are also taken nto account, ‘Their analysis allows us to select le tiost relevant terns that will be injected in new quers. Our approach is based on fu neajar phases, The first aims ab representing tlie query aru tle a relevant frassnents in « single representative structure, ‘The seond is focusex! on query rewriting. jent fragments deemed to he relevant to the user structure hits ued 3.41 Query and relevant fragments representation According 10 most appronches of relevance feedback, the query construction is clone by building a representative pattern for relevant abjeets and another pattern for ierelovant artes, aul then Ini a representation close te th frst and far from the second For example, the Rocchio’s method [#) considers n representative pattern of » docuntent set by their ceutrod. < linear eombiuation of the original query and the centronls of the relevant documents and irrelevant-ones eany be assmmed ns potentially suitable nser need Although simplistic. te Rocehio’s method is the most widespread. This sine plicity isdue to the nature of the neinipubited objects, udeed, Recehia’s met fiid is adapted to the ense where documents are full text, in sul ease, exch docu cerully: a vector of weighted terms). Whore the lations, the vector representation heeatnes sitne nent is expressed by a vector Jocutuents embody structural plistie, this results in # significant loss of structural emtrast and tHerefare te reconstruction of «unified structure becornes impessible, As for us, we believe thot the structure is an aubfitianal dimension Structural relevance fiedbaek ity XML retrieval Aunnique dimension is not enough to encode the structural information one Jitwension vector}, thus we need to encude all documents inte te dimensions, Fhy usin matrices rathoe than vertons That reasoning has led us to traduer the documents and the query ina rnuitris format instead of a weishted termi veetor. ‘Phiose matriews are enriched by values caleulated from transitive relationship function, Then, tlw representative strneture of query and judeed relevant frasgnents {thal we eall 8) is constructed Line of descent matrix We build for each document a matris called fine af descent smatrie (LDN), which must show all existing ties of relation between different nodes, This representation should also reflect the positions of the varions odes in the fragments as they aro also important in the structural rebvaner feedback. For an XML tree (or sub-tree) 4, we assoriate the mattix deine hy My vn) — (Pron © Ath pt of Where P is a constant value shih represents the weight of the descent relationship, and n’ are tse nodes of the tree A As for us, we represent each of the relevant fraginents and the initial query in the LBM forn?. The value af the constant 2 for the query LIM eoustruction is sgceater than: that used for the construction af ather LDMs (ihieh represent te Ps fallowins: the principle used in the Rocchio’s methud which uses refornmiation parameters having dillerent effets (1 for the initial query. ¢ for the relevant documents ventrod andl 5 for the non relevant dacuments centrod where (Sa <1 anid Ale 550) relevant fr nents) to strenethen the weight of the initial query e« Content integration in LDM The content of each element represented in, EDM nmst juiery’s siructitre and the set uf terns af this query: So, We propase to integrate ternis of exch element in LDA Fach element node min LDM is characterized by a tag naue and a set of vwoighied terms: ry — (hagie (A. iF) De fl t0l Fo. tA) Alaa tlbins Bet where: fag: tag nme of element ny, fy: BE Comm in ans weifye at}: Weight of terri fj nv element 29; based on the its frequeniey inn atl the total mamber of Taken into necount, In RE in XML retrieval we ain te rewrite Jements that contain it (2). 2 ianiber wf elements That an complesity 9 hore needed because of thekow number of jnvlged documents comparing to the corpus size. In our experiments, we under the telovanes foodback in a pseudo-teedbaek ay on the top 20 ranked fragment result ont the frst rouna getrieval, [a the ather hind (INEN'GS se the mattis sige ean not excess is cover 190 in all the collet Vion and about 3 in a single fragment Structural reluyance feedback in XML retrieval Let ns cmisider that the element nae appears it three positions in a XML tree as Fallows AL] (A. GE gt) (Fa. 148, (9. et (lM). Po Dodo. 2] — AAO. U3. OT (en). UF 022) F500 Ue.) MIS] = CA. {E4100}, a D9} (45,009. (Ha 0.2) tte O8), Us. 0.8)}) All| and Alp] have relatively the sarne content wehiet is different from Aba] In the LDM matrix, A will appear twiee, with two different contents (fs. ft} and (fs 45 40). The content similarity is done by the inner produet lke in veetor inode! |) we assuare that All] and A) are similar site they have the same tay tuane (A) and All) x12) — 0.926 > Phe which is nor te case for 1] x Als 0.051 < Theor AZ) Als) — 60° {Ph is en experiinental theeshole, Tits sxampht, we take Th— 05}. Al) and A will be agarewatedt and represented in EDX as follore by a single Amid the eentrod of 1] and Ala) as follows A= (A (M5) (00,075), (Fase) (04,23) Setting relationships between a uode and its descendants XYIL retrieval is usually: cone ina vague way [LI]. The XML retrieval system has to query with tolerated differences (a few missing elements or more additional ones) between the query structure and the document, Consequently, we believe that the most effective way to bring this toleranee is to assure that one element not only connected {0 its Piles ndes, but to all its descendants, A relationship) between toodes inthe samne Tine of descent is weighted hy tie distin in the NRE tree So, we use the Transitive Relationship tunetion TR which is defined as Follows 8 Malian ] = Malena) + PREM.) Malo whose ZV is the sot ofall different nodes in the tree A and Ay is its LDM sand Tr) ~ be yea Matrix $ construction The new query structuse is built starting thou tw shined EDMs. Let us emisider # = fj. 25... 8,} with) are the relevant judged fromnents and Qui0 the initial quesy, the query structure is built starting fron the enimalated LDALS N?. Sher nt] = Molen ne Mylan und for each ns aeit m8) anit Qual + pp Ty paeitia dt The constants ainda the same used it line af deseont matrix constr tion to strengthen the weight of old query’s terms. If calunun in contains soveral low values, then the node will Lend to appear as a leaf node in the re formmulated query. [! on the eontrary one row contains several low vilues, then, the node will tend to he see as a teat node ia the vefortuulated query if ruldition, the corresponding column cuntaits several high values, otheneise, te node will tend to appear as an internal node. Thus, in order to bnikd the new juery structure, we ene detertnine the new root {Structural relevance fowdback itr NAIL retrieval 3.2 Query rewriting Root identification ‘The structure query construction starts by identifying its toot, The root is craracterized by a high number of child! nodes and a west number of parents, For example, to find the roat we simply return the element R. whic has the greatest weight in tlie rows of the nuatris Sand the fewest weight mits eolmnns. The root Ris then suck that # max So Sfp mag sina! The argument to maxitnize refleets Cat the candidate nodes to represent the rat should have as mained lose wahies as possible in the relative rey YE Shen) and as niinitual low values as possible the eolnin (32h nl} Spor 0). Wie ace it relatively to the total sunt of the matris values spired frou the (fx if factor (term frequcrrg. inverse of document frequcnca} comnenily sed in trotional ivformation retrieval [7] whiel affeets inapontanse teva tertn for a doctnuent proportiomally to its frequency in the doctmueat tern Frequency) and inversely proportionally to the number of documents in the collection shere it appears at last once Building the new query strneture Quer the root lyis her established fron the mmatris S, ace proceed ta the recursive development phase af the tree repre senting the structure of the new query. The developinent of the tee starts by the root Rau ten by determninite all the child uodes of R, the same operation is performed recursively for the chile nodes of # until reaching the leaf nodes. Each, Jninent ris developed by attribnting te it ils potentially chill nodes 0” (nn) whose S[r.n'] > Threshold, caleulated from the mean average gy and the stane Jard deviation a, of its relative child nodes, Indeed, Uwe mean average aned the stanulard deviation sill illustrate the probability that a node is am actual child hode af the eurtent nade n, This reshald isiefined as 7 Sb l= te with jin =p DSi) and oy IT He value of > is relatively high, the tree outeonee will tend to he stllow sud ramified stud view vers, The value of > alloaes the estinnation far each elertent 21 the munber of child wades, The objective of this interval is to reeonstruet a tree as wide and deep as the XML fragments from whiel the query should be inferred. This value is then defined experimentally 4 Experiments and results To carry ont our experiments we use INEX'IS dataset anil we only considered the VVCAS [ (Copies whose relevance vawuely depends on the structural con. Araints} queries type. Indeed the need for reformation of the query structire Structural relovancw feedback iy XMIL retrieval is appropriate to the task, We nse also the metties proposed by INEX whieh are Tased anon the extenled eummlated gain (VCC) (6,.For a given rank & the value of narCG 4 reflects the relative gain the user nceuntulated up te tha rank. We only present tlue results of the generalized quantization fanetion whieh, is most suitable for VVCAS queries (10 queries proposed by INEX The table I shows the results obcainee! from XIVIR a cesearcl systens hased This table presents a comparison between the valnes ab= 1 tree: matehin tained before (BRE) and after RE (ARF), AA is the absolute improvement af tw relevance feodhaek runt over the original base run proposed by INEX i TSP | ARE Tost] \ ERIEE) Table 1. Comparative result= hotare ORT) nis) ater GANT) structural RE fy onr experiments we assume that the top kf table 2 shows the results obtained from different numbers of relevent fraginents fin COCO CON Fo [tose a [De ore Th PSEC [ao] aT aS ST ae ae Tote [ram Psa Table 3. Resilts or diferent timber We can see through our experiuents Hat our RE approach significantly: ina proves the results. We note that during these experiments we tefominulate ani ra] content, anil therelare we the queries structures without changing their « believe that tis reforsmulation has brought an evolution that eould be aeernt ated by the reformulation af the content 5 Conclusions and Future Wo We have proposed in this paper aur approael ta structural selevanee feedback in XML retrieval. We proposed a representation of the original query and relevant Irwments unver a inateix form, Alter sonw processing aud calculations on tw sited anatris and alter some agualysis we lave heer able to identify the rnost relevant nodes and their relationships that eonmwet then Structural relevance: fowdback in XML retrieval The obtained results show that structural relewnee feedback eontribates to the inprovenwnt of XML retrieval, The

You might also like