You are on page 1of 12
Structural relevance feedback in XML retrieval KAMOUN FOURATI Inés, EMAR Molmned, and BEN HAMADOL Abdelmajid Multintetin Infortstion systenns and Advanod Computing Laboratory High Instisute of Computer Science aud Mulsiaundin, Vaiversey of Sfx. Shs, Tanisin (ines. xenoun, zohaned,tuar, abdelmajid, bonbamaviou) Sisinst .rm.tn Abstract. Contratily 1 classival informotion sotriovsl systems, tests tems that treat structured deeutnetts ie hiss the structiral linens theouah the docusnent and quezy comparison, Thus, the retrieval of 1 docmnent fragments thet match the user need rather chan the: whole document. So, the structure noe tim should be taken intr aeemunt duving the retrieval prongs as sell as uring the eelorabation Ta this paper we propos an approach of query refoasulation bass structural relevance fowdback, We start from the original «gery on one sl as televiant by the user on the other Struetuce hints analysis allows us to identify sles chat aatoh the user ‘query ane to coboild it luring the relevance foslbinck step, The aan goal ‘of this pape i te shins the: impact of stenctural hints in NAIL. «piety ‘optimization, Some experiments have boon arnleriake into a dataset proided hy INEX! to show the effectiveness of our proposals hand ond the fragments ju Key words: relevance foshack, XML, INEX, line of essemt mattis 1 Introduction The goal of information retrieval systems (IRS) ist satisfy infornnationsl ueed- bf a user. ‘his need fs expressed in the fornt of » query to be matelwed to the all documents in the corpus to select these who contd arsscer to the user's query Because of the ambiguity, aud the incompleteness of his query, the user is it ost eases raulrontest with the problews of sileare or yoise To overcome this problem, there must be alternatives ¢0 the initial query so as to ituprave the research, Anions the most popular patterns iu information retrieval (HR). we cite the relevance feedback [RE] whieh, since the frst attempts iy IR fuas hevote a enucial plise, 1 is based on the judzuients of relesanee of the documents foune by the IRS nud is intended to respress the iulormtion neces from the initial jqhers ian elfort to find snore relesant doeunents, Due to the great inportanee of structured information, NAT. doeunents aver a big part not only on the web, but also on moder digital Hbraries, sued Tritton for the Evaluation of NAL rottioval, an promoting retrieval capabilities ow XML documents vow forum Tat is at 2 Structural rosance fnelback in NM retrieval essentially on Web servives orieutes soltware [13 This standardization of the Web to XML schemas presents new problems and lence nex’ needs for cus tomizes! information aces, Being» very powerful and olten: mavoidsble took to customized aveess 10 information of sll kinds, information vetaieval syste arise at the forelront of this iss However, the traditional IRS do not exploit this struetute of doruments inchiding the RF phase, However, » structured docianent is elmeneterized by it content and structure. This structure possibly corpletes semunties expressed b the content and boeonaes « eoustraint sith whieh IRS mist conyphy it arler te satisfy the user information wecds, Ince, the user eat expres his need by a set of keywords, asin the traditional IBS. sud cans add steuethral constraints te Dvotter target the sought semantic Thus, taking into account the structure of the dacunwents aud that of the query hy the information retrieval systenes hanaling tioeessary in the Feeback process We propose ia this paper to evilhiate the impact of staueture baling in ruetured daenmnents is gnery reformulation proeess by sieuctural relevance feedbaek wag, The teuetute hints in the user query are taken into accout st frst (helore sng content treat nwt) in the query refarmmatarion process, the query structure canld he devoted fo some modifieation hased on the structure of the relevant judged document Insaments, Thus, xe put the euphiasis on the structure by aniahzing the stvie tive fentuares wand relation that are the most i process, iunitienn 1a the relevance ferdbnels This paper is organized inte tour sections. The second section gives a survey of reluted work ity RF an XML retrieval. The Hite seetion peesetts our appanoel ia this contest, In the fourth seetion, we preset the experiments aud the obtained results, The filth section eonelides 2 Related work Sehenkel aul Theobald [3] deseribe two appronelws which foeus on the inewnpe= ution of structural aspects in the feedback process, Theit Frat approach resanks results returned by initial, Kevword-hosed query using stiruetural features derived hone ysuls with known relevanee, Their second appeoaels invalves ex: ponding Haditional keywatd queries into content aud stancture queries, Official results, evelunted using the INEX 20065 [11] assesstnent mnetined based on rank- revving, shove that reraiking ontpectornis the query expansion method on this slat 4 to include what hey re fer toes “strnetural releynrice leedhses”, They assunie that knovsledue of eonpo- Mibajlovie et, al, [1] estend their database appre nent relevaiee provides “implicit structural hints” whieh may’ be usw! to improve performance. Their inyph relevance” of the top-ranked elements and then restructuring the query ane tuning the system based on RP information, They ar ‘ant lar a aiven topie, the document to whiel i elon tion is bused fiest on "extrueting the structural w that iia component is Structural relewane: feslback iy XML retriewal contain shnilar iuformation, so the document nane is used to meds! structural h relevance, Based on the structural information: stud assessments associated the rvlevant elements, the query is rewritten axl evaluated, Ty [i]. two experiments are describes. One analyzes the effets of assigning different weights to the strnetural information found in the top 20 elements, The sovon seeks 1 determine whieh of the tee types of structural infornustion is more useful i this contest Sanvagnat et.al. [6)deseribe their experiments in relevance foedliaek as tak lows: The “struchure-oriented” appronel first seeks to identity the ture shared by the hargest atumber of relevant elements aud then they use this informstion to modify the query. A seeord method, ealled *coutent-oriented” fermns tram cvlevant elements for feedback A chine meth involves a combination of both approuckes, Offieial results show improverient in sone ease ut are not consistent aes ‘query type Crochet. al [7 implement relevance feedback in a conventional information retriewn! environment based an the Vector Space Model, ‘Theie approach to flex ible retrieval allows the systent to retrieve relecant infomation at the element level. The paragraph is seleeted as the basic indesing sit, and the collection is inuleses! on paragraphs. A simple experiment in relevance feedback is performed ass falloves, The top 2 parsgraplis retrieved fiom an initial search ate exantined for rolesawe. A feedback query is eonstrueted bused on Roechio'’s algoritlan The result of the festhack iteration is another list of rankordered paragraphs Flexible retrieval is performed on this set to produce the associated elements Asin, staal] inereases in average recall provisions were produced Mass and Matilelbrod [S] propose ant approach that determines the types af the most informative items or components in the collection {atticles, sections, and paragraph for [NEX} and erentes for each type its inde. The autonatie query reformulation process is based on identifying 11S best elements froin sn ordered list to seloet the most relevant anes. The scores in the retrieved sets ate normalized Lo enable eonnparison actos indices and thet sealed by factor related to the seare of containing artiele, ‘They use the Rocehio alyorittin (9) associated with the loxienl affinity Hanslin [10]proposes a framework for foedback-dviven XML query refinement and address several building blocks including reweighting of query conditions ‘and ontolosy-basod query expansion, He poiits ont the issues tliat arise specif ieally in the XML contest and entmot be simply addressed by straightfurward tise of traditional IR techniques, and presents am approaches toward tackling thesn, He presents iv [I] demonstration: that shows this approwel far extrect= lige rs, elatifsing uncertaiuties, reweishting atomic conditions. es perating 9 refined query for the XML ree ing wer information needs by relevance feedback, maintains ror i personal ontolos pandiny query and gutomatienlly 2 tuieval system NNL 2 Ammons these appronelies, only « fev consider that RF in the queey structure rite the query hosed on its sires are, sad the rontent of the relevaat elements, withont any modifiewtion of the query structure is neressary, His common to re Strctutal relevance feedback in XML retrieval itself. In onr approach, we consider the structural RE is necessary, partieutath if the XML retrieval sestenn takes into aveount the structural dirnension in the atehing process, Since so use aut XML retrieval system that matehes the ste te in avldition to the content [4], we assume that the strnetnre refornenlation ould iniprose the retrieval performance 3 Structural relevance feedback: Our approach In our approach we fens on the structure of the original query and that of docnment fragments deemed relevant to the aser structure hint= An esannple ie shown in Bgurel tS arte | avery agile wile vee oy wel ey 8 bp Shel ania age fg tis tae Pam a aide See 1 Yn iter St " Fragment Fragment? Fragment] Fragment 4 Fragment 5 Fig. 1. Example of query structute snd relevant fragment Indeed, this study allons us to reinforve the dnnportanes of these structures in the vefornmbated query to better identify the most relevant Iragments t© the ers needs, ‘The mualssis of structures allows is ta Mlentity the mast relevant nodes and the involved relationships, Our approach is bused on) two nunjor phases. The first aims at representing relevant fragnient one in single repevsentati the query structure and the judj 3.1 Query and velovant fragment representation According te most approaches of relevance feedback, the query eonstinetion Is done by building » representative siructure for relevant objects and another strneture for irrelevant ones, anid thea huilel 9 representation lose ta the first Structural relovanes: fw jay XML retrieval For example, the Hoechia’s nethod (9) considers @ representative structure al que andl thw centroids of the relevant «laeurnents and inerlovatt ons ea he assured of a document set by their centroid. 4 linear combination of the aria fas. patoutially shitable user ueed Althoush simplistic, the Rocelia’s method is the most widespread, This Sinn plicity is die to the nature of the manipulated objects, Indeed, Racehio’s nvethod is adapted to the ease where doenments are full text, a contest in whieh each dochnent isexpressed by a vector {ueterally a yeetar af weiahted ferns). Whote the documents embody structural relations, the kertor representation becornes simplistic, this results in a signifiennt loss of structural contrast and thetetore the revowstruetion of a unified structure becomes quasi impossible, As for is, holieve that the strneture is ax additional dimension \ unique dimension is wot enonsl to enende the strncturel information (one dimension vector}, thus we need to encode all documents into ovo dinensions, hy using mia(riees rather than vectors That reasoning hae led us 10 traduee the docunents and the query in a tnuateis format instead of @ wishted term vretor, Those nestriees are enriehed by values caleuluied from transitive relationships funetion, Then, the representative strneture of query and judged relevant fraaments {thi se call $) is eonstrneted Line of descent matrix We build for each docinnent a mattis called thre of descent vondeia (LIM), sshieh must show all exist different nodes. This representation should also rellect thep rhodes in the frogments as Uhey ate also important in th feedback. For an XML. thee (or subtree) A. we associate the mattis defined by My ios of Kinship between ition of the vuriowts Males (° i yn! Abris the puteat ot » otherwise Whore Pisa constant value whielr eepresents the weight of the descent relationship. As fr tis, we reprrsent eurh of the relevant fragments end te iitial query i the EDM form, The kale of the rausient tor the query LDM vorstrnetion is eater than that used for the eonstenetion af ather LDMs (which represent the relevant fragments} ta strengthen the weight of the initial query eal the principle used in the Rocehio’s method which uses reformulation parameter having: different effects [1 for the initial query, a for the relevant document ventroid and Far the yon cvlevant dacunnents centroid where Ge SV and 50) Note that no complexity analysis is here needed! heewuse of ee lows nnruber bf relevant judued documents coniparing to the eorpus size, hy ant experiments, swe undertake the relevance feedback in a psondo-fersthaek way on the top 21 ranked! documents resulting from tHe first romud retrieval, hy the otlier linn 4 Structural relevance foveal in XML retro the total nunber of tags is over 160 fa all the eolleetion (INEX'O collection ent, $0 the matrix size ean not exceed 5 95, Setting relationship between a node and its descendants XML, errievl is ustally dow inn vague wage [15]. A traginent ean be returned even ib the stnietural contitions of the query ave not entively filled. This means that it fragment of an NMI document is siwilar bit uot identiesl to ths query, it eau differences (a fow missing, cements or more additional anes} between the que imieture and the document, Corseentls, we beliewe that the most effective {way to bring this tolerance sto assure thet one element is vot anky connected ta its child nodes, but to all of its ditcet and indirect descendants, \ relationship hetsveen nindes in the sae line af descent is weighted by their distanre in the XML tive, Por exaniple, inode Ais the parent of Buel the latter isthe patent oF Cth descent link between A and Cis weishtee wit a valne thet depend on the weight of the link betweu A and B and hetween Band © the diect links we weighted hy the sslne of Pin the LDM) This example is illustrates in figure 2 Fig. 2, Relationship botwwen a node ane its descendants The Rlnnction is transitive relationship on the weishtsof the wsdes eddues With a conuon ancestor, The resulted value will be added ter the weight of the radu iIself in the LDA as follows 8 Malia] Male ae] WP RM ln il Malo 0 sshere Vis the sot of all different nodes in the tree And Mis its LDM T Ris a function defined by the following hi: BOR oa TRE Tir.) shonld be less then to the values of 1 and y because of the trates tivity (ile weight of che relationship decreases}, Purthecmore, the PH Munetion Structural relwanes foeback iy NMI retrieval unt be incrensins the higher the edges weight are, the more important the ddeseont Link is Wo ww the follosciny funetion as a meeting viteria TR, 9 As for us, this transitive relationship will be applied to each LDM of eaely Insgment judged as relevant and also to the LDMPof the query. The figure :t illustrate an XML tree aud associated LDV Fig. 8. Pxample of LDM (PA Matrix $ construction ‘The nes query structure is built starting from the obtained LDMs. Let ns consider = (Aj. Aya. Reg) where R initial query and A; are the relevmnt judued fragments, the query structure ie built starting trom the eumulated LDALS: Front fragments shea i Ue nok will tend ta appear as x leaf node in the celornnilated guers, IF on the eoutrary 8 Structural relevance fevllack in NML retried in the reformulated query, If, iu aeldition, the corresponding cola eontains several high values, otherwise, the node will tend to appear as an internal node Fig. 4, Psamnple of watts 5 3.2 Strnetural query rewriting Root identification The struc root, The root is characterized hy » high number of ehilel 1 id « nexlisibl umber of parents, For esanple, to find the root se simply returi the elevent R. shih has the wrentest weight in the rows af the matrix Sand the lewest weight in its columns, The toot A is tien suelt tat 5 Sint Ronny So 8 fon | | “ nT The argument 16 the root shonld have as maximal low values as possible in the wlative row CSL Sfpicnt)) and os mininu! love values as possible inthe coluron (32 Sf np) sum of the matrix values (2 Sitar). nosinize reflects that the candidate nodes te represent Wo are inspired from the ff if fartor (dorm frequency weney) commonly uses fy traditional information retrieval [16] whicl allects importance 10 « turne é for a « iunett proportionally 10 its Trequeney in the Structural relovanes: fw jay XML retrieval document ¢ (termi fresqueney) and inversely proportionally 10 the munber of Agenments in the collection were it appears at least once Building the new query steucture Once the root has hoon established front the matrix $. we proceed to the recursive development phase of the tree repre senting, the strnetine of the new query. The development of the tree starts by the root Rand then by detecting all the ebiel nodes of R, the same operation ie perforiued recursively for the chill nodes of until yeneching tle leaves clement Fach element 9 is developed by attributing (© i ts potentially ehild nodes 1 (Zo) whose Slv.n'] > Threshol Wo neste that Pireshofiy is calculated from the mean average yyy and the stondand deviation a of its relative ehild nodes, Indeed, the mean axerase and the standard deviation will Musteate the probability that a node is an actual chile-node of the current node . This threshold is defined as follos Ie the value of y is latively bis ramified nnd sice versa. The value of allows the estimation for cack element ie tare outeome will toad to be shallow and of the mnnber of child nodes, The objeetive of this interval is to teeanstriet a tree ne wide and dep es the XSL traginents tron whieh the query should be inferred, ‘This valne is then defined experimentally We present in figure 5 the qhery steurture obtnined the matrix 9 of figuey pele fm by abo! ariele Sec tem Fig. 5, Structural reformulated query 4 Experiments and results Our experiments have hee undertaken tito INEX'O5 dataset shieb, contain 16819 artieles taken trom IEEE publientions in 21 journals. ‘The INEX metres HM Structural relevance fselvaek iy NMI retrieval ised for evnluting systems, are based on txo dimensions of relevance (esis relevance value, We tivity and specificity: whieh are quantized inte a site listingnish Go quantization funetions = A siviet quantization to evaluate whether a of retrieving highly eshaustive ul hizhly specifie doeanert components, Setriet(8s€) {aise 20 , Jen retrieval approach is able Dotherwise = A generalized quantization to evwluste dacnnient components seeordins te their deter of relevance Oilicial meties are based on the extended cnmmalsted gain LCG) [12]. The NEG mwties are a family of metries Haat sina to consider the dependency OF NAIL elernents within the evaluation. The KCC nutries iuchade the ser goin WVCG) ond te ssstenrariented effort-preeision,/anin-tevsll niessnes (ep) 92). The r€ serene! list For a given eank «the value of nrCG[y relleets the rebative sceunmabsted up to that rank, compared to the gait he: could haw attained it the system would lind produced the optinnn best ranking, For sty rank the normalized value of 1 represents the ideul prrkarnantce oriented messtires of normalized extenled acemuulated rete aceunmiates the relevance seares of retrieved doenments aia the user The effort: precision ep is defines! as shove ¢igsai & the rank position at whieh the eunmlated gain of r is eoaehed by the ideal enrve, nnd gj. i the rank position at whieh the ennubated goin oF 1ris reselied by the system run, A senre of L rellects the ideal performance where the user needs to spend the niinimuin neeessary elfort to revel a given level ot © Tn evalnation, we use the uninterpobated! men average effart-precision de noted as Mlegrichich isealeulnted as the average al effert-precision values mea sined ot each natural seine-reeall points To carry out our experiments we only considered He VWCAS [LI] (queties whose relevaniee vaguely depends on the structural constaints) type querie= bbreause the nerd for relornuilation af the query structure i appropriate to the task, Wie present only the results asin suitable for VVCAS queries, The table 1 shows the results obtained from the research system based om eurralized quantization whieh is most tree mateliing (L1, This table presents a comparison between the values obtained belore RE (BRE), ater RF (ARE Structural rekaani fondback in XML retrieval 11 Table 1, Comparative newts beta TOI and aI ATE) structural Re Wo cou soe through our experinwats that our RF approwh sisniieentiy in proves the results, We note thet durin these esperimenis we refornantate en the queries strnetures without changing their original eatfent, and therefore 5 Conclusions and Future Work We hinve proposed in this paper an approach to struetural relesanee feedback iin XML retrieval. Ye proposed « representation of the origica query and relexat fragments under a matrix form, Alter s obtained mattis and afler some analysis we have been able to ilentife the most ie processing, andl calrulations on the aut nodes aur] their relationships tliat connect ten The obiaiwed results show that structural relevance feedback contribute foment of XML retrieval, The strategy of the retarntation is bi fon a nnateis representation of the XML trees deemed relevant 16 the fraennent= sux! the arisinsl query. This representation p sind the transtormations achieved favors the fl Wo plan, in sl the eontent of the initial query relyin in the relevant elements, The selected terms will be injeetad iy the content of the query elements, We phan a that of Wikipedia bility of the research esults to reformulate te condnet out tests on other earpts uote References 1. Pow Hawslin, Thovbald Ania onl Scheukel Ralf: Queey Refinement by Relesaa Fasebacke in an XML Retriesal System (Dee): 231d intemnaticnal Conk Conceptual Med scuee Notes in Computer Seien Shanzhai, China 2000 2. Schonkol Ralf, Thesbald Anja aul Woikusn Gerhard: XXL © INEX 2 ings of the Sewond [NEN Workshop, pp. 9-66, Dagstubh, Germans 8. Schenkel Rall and Thosbald Martin -Relosasies Fodback fie Structural Queey Exe pansion, Aduanes in NML Inforiation Retrieval and Bvabiation, Ith International Workshop of the Initiative for shee Evaluation of XML Retrieval, Lecture: Notes it SS, ppd 8, Proce Computer Scien 77. yp. HM-367 Daystull Cas Veikin Milajlovic, Georgina Ramirez, Arion P. Dee Vties, Bjoerd Hiemstra Enst Blok: THAH at INEX 2004 Medina Phrases and Relevance b Puocesslings of the Sd INEX Workshops, LNCS 308, Sprinse. 1 12 Structural relevance fsabvaek iy NMI retrieval 5. Vajkiw Milmjlovie, Cootsina Rominez, Thijs Westerveld, Dioord Hiemstra, Honk Enst Blok ated Arion P. de Vries TIIAH Seratcbos INEX- 2005: Vague llement Selection, Image Search, Ovevlap, and Relevance Pewtbaeks Advances ia XML Lie formation Retrieval and Evaluation, tel Internaticmal W the Tnishative for the Evaluation of XML Restioval, pp, 7287, Dagstubl Castle, Germany, November she 6. Lebna Haowa, Mouny Torimen, Karen PinelSwusaznat, Melnd Boughanem XPIRM at INFX 2006. Ad-Tloo, Relesance Pevdhack and MuleiMfedia Tracks Cot parative Evalhation of XML hnfosmatien Retrieval Systems, stl Tntematimed Work shop of the Initiative For the Evaluation of NML Retrieval, pp.27% 28H, Dasst abl ‘Case, Germimy, December (2006) 7. Corols J. Crouch, Anizudia Mahajan awd Archana Bollaronclas Plog Res trieval Based on the Vector Space Model, Advaness in XML Infermation Retrial Titel Invernational Workshop of the Initiative for the Evaluation of XMIL, Retrieval 292-402, Dagetull Castle, Germany, December (2001) sand Mf, Mandelbtoe: Relevonce Fowslback for XML Retiival, by INEX 2004 Procesatings, Dagstubl, Allemagne, 2004 0 I. Rocchi: Relevance fesdbuack i infortaation retrieval, Preutice Hall Luc., Ens wood Cliffs, Nl [NTL 10, Pas Hanlin: Relevance Feslhick in XML Recrioval, EDBT IST Lis, Vi. Norbert Puhr, Mosanio Laltnas, Saactia Mali ane Gabriella Kevzaiz Advances in XML Tafornsation Retrieval asd Evaluation, Ate Iuteraational Workshop of the Iaie tative for the Evaluation of XML Retrieval, [NEX 2005, Daystull Castle, Gertoany Noseanber 205 12,6. Raza oned M, Lalimas. Inox 2005 evaluation motties, ly INEX 2005 Workshoy Processing’. yp. 401-408, Germany, November 200°. 1h. World Wide Web Consortinmy (WSC): Extensible Markup Language (XML bony www artorayTR/REC XML, 2 HM Be Aouicla, M. Tar, Mo Bo tuieval B Tre Matching, [FEE International ¢ Compater-Bosod Spstomas ECBS, yp. Al 00, 208 15, Voikan Miho josie, Djverd Hiemstra, Honk Ernst Blok: Vague Eh sud Query Reseiting for XML Recaiowal, P Information Retrieval workshop, pp. HAS. 2 18, G. Saltons 4 comparisoes hetseon mvvamal ned nial of American Hecttmentation, volume 2001) MM Workshops, pr and M. Abid: XML Infermation Re mfecence on Engerworing of nent Selection vvdings of the sigth, Dutch-Belsian note itv motes, Jon

You might also like