DOTTORATO DI RICERCA IN INGEGNERIA DELL’INFORMAZIONE XVIII CICLO Sede Amministrativa Universit` degli Studi di MODENA e REGGIO EMILIA a

TESI PER IL CONSEGUIMENTO DEL TITOLO DI DOTTORE DI RICERCA

Information Retrieval Techniques for Pattern Matching

Ing. Riccardo Martoglia

Relatore: Chiar.mo Prof. Paolo Tiberio

Anno Accademico 2004 - 2005

Keywords: Pattern matching Similarity searching Textual information searching XML query processing Approximate XML query answering

Contents
Acknowledgments Introduction 1 3

I

Pattern Matching for Plain Text
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7
9 11 11 13 18 20 20 21 22 25 27 30 30 31 32 32 33 33 34 35 36 39 40

1 Approximate (sub)sequence matching 1.1 Foundation of approximate matching for (sub)sequences 1.1.1 Background . . . . . . . . . . . . . . . . . . . . . 1.1.2 Approximate sub2 sequence matching . . . . . . . 1.2 Approximate matching processing . . . . . . . . . . . . . 1.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . 1.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Implementation . . . . . . . . . . . . . . . . . . . 1.3.3 Performance . . . . . . . . . . . . . . . . . . . . . 1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . 2 Approximate matching for EBMT 2.1 Research in the EBMT field . . . . . . . . . . . . . . 2.1.1 Logical representation of examples . . . . . . 2.1.2 Similarity metrics and scoring functions . . . . 2.1.3 Efficiency and flexibility of the search process 2.1.4 Evaluation of EBMT systems . . . . . . . . . 2.1.5 Some notes about commercial systems . . . . 2.2 The suggestion search process in EXTRA . . . . . . . 2.2.1 Definition of the metric . . . . . . . . . . . . . 2.2.2 The involved processes . . . . . . . . . . . . . 2.3 Document analysis . . . . . . . . . . . . . . . . . . . 2.4 The suggestion search process . . . . . . . . . . . . . 2.4.1 Approximate whole matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii 2.4.2 2.4.3

CONTENTS Approximate sub2 matching . . . . . . . . . . . . . . Meeting suggestion search and ranking with translator needs . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 2.5.1 Implementation notes . . . . . . . . . . . . . . . . . . 2.5.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Effectiveness of the system . . . . . . . . . . . . . . . 2.5.4 Efficiency of the system . . . . . . . . . . . . . . . . 2.5.5 Comparison with commercial systems . . . . . . . . . detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . 42 43 44 44 45 55 57 61 64 64 65 68 70 71 73 75 77 80 81 83 83 85

2.5

3 Approximate matching for duplicate document 3.1 Document similarity measures . . . . . . . . . . 3.1.1 Logical representation of documents . . . 3.1.2 The resemblance measure . . . . . . . . 3.1.3 Other possible indicators . . . . . . . . . 3.2 Data reduction . . . . . . . . . . . . . . . . . . 3.2.1 Filtering . . . . . . . . . . . . . . . . . . 3.2.2 Intra-document reduction . . . . . . . . 3.2.3 Inter-document reduction . . . . . . . . 3.3 Related work . . . . . . . . . . . . . . . . . . . 3.4 Experimental Evaluation . . . . . . . . . . . . . 3.4.1 The similarity computation module . . . 3.4.2 Document generator . . . . . . . . . . . 3.4.3 Document collections . . . . . . . . . . . 3.4.4 Test results . . . . . . . . . . . . . . . .

II

Pattern Matching for XML Documents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99
101 . 103 . 105 . 106 . 109 . 113 . 115 . 118 . 119 . 120 . 120 . 121

4 Query processing for XML databases 4.1 Tree signatures . . . . . . . . . . . . . . . . 4.1.1 The signature . . . . . . . . . . . . . 4.1.2 Twig pattern inclusion evaluation . . 4.2 A formal account of twig pattern matching . 4.2.1 Conditions on pre-orders . . . . . . . 4.2.2 Conditions on post-orders . . . . . . 4.2.3 On the computation of new answers . 4.2.4 Characterization of the delta answers 4.3 Exploiting content-based indexes . . . . . . 4.3.1 Path matching . . . . . . . . . . . . 4.3.2 Ordered twig matching . . . . . . . .

2 Personalized access to versions .3 Unordered twig matching . . . . . . . . 6 Multi-version management and personalized access to XML documents 6. . . . . . . . . . . .1 Experimental setting . . . . . . . . . . .4 4. . . .1. . . . . . 5.2 Semantic versioning and personalization support . . . .5.2 Structural disambiguation . . . . .1 Temporal XML representation and querying . . . . 4. . . . . . Experimental evaluation . . . . .5 4. . . .2 Personalized access to XML documents . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . .6 4. . . . . . Unordered decomposition approach . . . 5. . . 5.2. . . . . . . . . . . .3 Related work . 6. . . 5.4 Experimental evaluation . . . . . . . . . . . . . . . . .1 Matching and rewriting . . . . . . . .1 Matching and rewriting services . . 5.7 5 Approximate query answering in heterogeneous XML collections 5. . . .3 Evaluating the impact of each condition . . . . . . . . . . . . . . . . 5. 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .CONTENTS 4. . . . . . . . . . . . . . . . . . . . . . .4 Experimental evaluation . . . . . . . . . iii 128 130 135 136 139 142 145 145 148 149 154 157 159 162 167 169 171 174 177 177 178 179 180 181 185 191 195 197 197 200 208 208 210 214 214 216 217 4. .3. . . . . . . . . . . . . . . .3. . .1.3 Related work . . .2 Free-text disambiguation . 5. . . . . . . . . . . . 4. . . 5. .2 The disambiguation algorithm . . . . . . . .4. . .3. . 6. . .2 Efficient computation of the answer set . . . . . . . . . .4 Decomposition approach performance evaluation . . 4. . . . . . . . . . . . . . . . . . . . . . 5. .2. . . . . 5.1. . . . . . .2 Providing a native support for temporal slicing 6. . . . . . 4. . .7. . . . . . . . . . . . . .3. . .1 Identification of the answer set .7.3 Structural disambiguation . . . . . . . . An overview of pattern matching algorithms . . . .4. . . . . .1 Preliminaries . . . . . .2 Structural disambiguation service . . . 6.2 Automatic query rewriting .2. . . . . . . .1 Schema matching . . . . . . . . . . . .1 The complete infrastructure .7. . . . . . . . . .7. . . . . . . . 6. . . . . .1 Overview of the approach . The XML query processor architecture . . . 5. . . . . . . . . .1 Approximate query answering . . . . . . . . . . . . . . . 6. . 6. . . . . 6. . . . .2. . . . . . . . . . . . . . .2 General performance evaluation . . 6. . . . . . .3.1 Temporal versioning and slicing support . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . .5 Future extensions towards Peer-to-Peer scenarios . . . . . . . . .5. . .3. 5.

. . . B. . . . . . . . . . . . . A. .1 Notation and common basis .2 The M ultiEditDistance algorithms . . . . . . . . .4 Proofs of of of of Chapter 1 .1 Standard version . . . . . . . . . . . . . . . . .3 Verb disambiguation .4. . . . . . B. . . . . . . . . .iv 6. . .1 Preliminary notions . . . . . 271 . . . .1 6. . . . . . . . . . . . . . . .2 Content-based index optimized version B. . . 261 . .1 Basic filter . . . . . . . . .2 Content-based index optimized version B. 248 . . . . 270 . . . . . . . 245 . . .1 Proofs C. . . . . . . . . . . . . . . . . 229 . . . . . . . . . . . . . B. . . . . . . . . . . . . . . . . B. . . . 227 . 217 Personalized access . . . . . . . . . . . 235 . . .2 CONTENTS Temporal slicing . . . . . . .4.4 Properties of the confidence functions A. . . . 266 . . . . . . 253 . . .3. . . . . . . . . . . . . . . . . . . .2 Content-based index optimized version B. . . . .1. . . . . . . 222 Conclusions and Future Directions 225 A More on EXTRA techniques A. . . . . . . . . B. . . . . . . . . . . . . . . . . . . . . . 273 273 274 275 279 B The complete XML matching algorithms B. . Chapter 4 . . A. . . . . B. . . . . 270 . . . . . . . . .4.2.1. . . . . .1 The disambiguation techniques . . . .1. . .2 Proofs C. . . . . . . . . . . . . . . . .2 Content-based filter . B. . .4. .2 Noun disambiguation . . . . . .2. . C Proofs C. . . . . . . . . . . . . 241 . . . . . . . . . . . . . . . . . . . . . . . 242 . . . . . . . . . . . Appendix B . . . . . . . . . . . B.4 Unordered twig pattern matching . . . . . . . . . A. .1. . . 227 . .5 Sequential scan range filters . . . . 227 . . . . 242 . . .5.5.3 Ordered twig pattern matching . . . . . . . . .3. 248 . . . . . . . . .3 Proofs C. . . . . . 233 and optimization 235 . . . . . .1 Standard version . 261 . A. . .1 Standard version . . . . . . . . . . . . . . . . B. . . . . . . . . .2 Path matching . . . . Chapter 3 . 241 . . . .

. . . . . . . . . . . . . . . . . . . Running time tests .2 3. . Web of duplicate documents . Scalability tests for Collection 2 . . . . . . . . . . . . . . . . .1 3. . . .9 2.7 2. Trends of total translation time in simulation . . .12 3. . . . . . . . . . Coverage of the collections as outcome of the search process . .5 3. . . . . . .5 2. . . . . . . .3 3. . . . . . . . Effectiveness of the resemblance measure for Times100L . . . . . . . . . .11 3. . . The syntactic and semantic analysis and the actual WSD phase The role of the DBMS in pretranslation . . . . .8 3. . . . . . .3 2. . . . . . Examples of full and partial matches . . . . . Percentages of disambiguation success . .2 2. . . . . . . . . . . . . . . . The suggestion search and document analysis processes . . . . . . . . . .2 1. . . . . . . . . . . .7 3. . . . Schema of a generic translation model . .4 1.4 3.8 2. . . . . . .5 2. . . . . . . . . .13 sub2 Position Filter . . . . . . .4 2. . . . Further efficiency tests . . . . Running time tests . . . . . .6 2. . . . . . . . . . . . .11 3. . . . .9 3. Representation of a mapping between chunks . . . . . . . . . . . .10 3. . . . . . . . . . . . . . . . . . . Efficiency tests results for intra-doc data reduction . Document bubble experiments for Times100S and Times500s . . . . . . . . . . Chunk size tests results for Cite100M . Query for approximate sub2 sequence match filtering Filtering tests . . . . . . . . . . . .10 2. . . . . . . . . . . . . . . . . 15 19 22 23 24 28 35 38 44 45 46 49 51 54 56 57 65 77 80 86 87 88 89 89 90 92 95 96 97 Processes in an EBMT system . Further efficiency tests results .6 3. . . . Distance between document bubbles . The DANCER Architecture . . . . . . . . . . . . . . . . . . . . Results of the runtime tests with no data reduction . . . . . . . . . . . . . . .1 2. . . . . . Chunk size and length rank tests results in CiteR25 . . . . . . . . . . . . . . . . . . . . . Sampling and length-rank selection experiments for Times100L Chunk clustering experiments for Times100L . . . . . . . .List of Figures 1. . . . . . .3 1. . . .1 1. .

189 . . . . . . .16 4. . . . . . . . . . . . . .16 5. . .16: an example . . . . . .9 5. . .8 5. 163 Example of two related schemas and of the expected matches . . . . 136 Structural join of Example 4. . . . .17 4. .15 4. 111 Representation of the pre-order conditions in the domain space 113 Representation of the post-order conditions in the domain space117 Target list management example . . . . . . . . . . . .2 4. . . .13 5. . . . . . . . . . 104 Properties of the pre-order and post-order ranks. . . 170 The STRIDER graph disambiguation service. . . . . . 143 Structure and content of a Datastore . . . . . . . 138 Structural join of Example 4. 129 Pattern matching algorithms . 182 Results of schema matching between DBLP and SIGMOD . . . .12 4. . . . . . . . . . . . . . 175 The ContextCorr() function . . . . . . . . . . . 122 Ordered Twig Examples . . . . . 184 Mean precision levels for the three groups . . . . . . .18 4.vi 4.13 4. 110 Behavior of the domains during the scanning . . . . . . .11 4. . . . . . . 139 The unordered tree pattern evaluation algorithm . . . . . . 165 RDF and corresponding PCG for portions of Schemas A and B 166 Examples of query rewriting between Schema A and Schema B 168 A portion of the eBay categories. . . . . . . . . . . 175 The T ermCorr() function . .17 LIST OF FIGURES Pre-order and post-order sequences of a tree . . . . . . .6 5.11 5.8 4. . .1 4. . . . . 132 Examples for decomposition approach . . . . . . . . . . . . .11 . . . .7 4. . . . .14 5. . . . . . . .2 5. . . . . . . .7 5. . . 152 Comparison between time in different settings . .19 4. . . . . . . .3 5. . . . . . . .6 4.4 4.5 5.23 5. . . . . 153 The role of schema matching and query rewriting . . . . . . . . . . .5 4. . . . . . . . . . . . . .15 5. . . 140 Evaluation of paths in Algorithm of Figure 4. . . . . . . . 141 Abstract Architecture of XSiter . . . . 187 Typical context selection behavior for Group1 (Yahoo tree) . . . . 162 Example of structural schema expansion (Schema A) . . . . . . . . . . . .22 4. . .12 5. . . . . . . . .9 . . . . . . . . . .10 5. 108 An example of a data tree and pattern matching results . . . . . . . . . . . . . . . . .1 5. . . .3 4. .14 4. . . . . . 171 The disambiguation algorithm . . . . . .4 5. . . . . . . . .21 4. . . 125 Unordered Twig Examples . . . . 177 A small selection of the matching results before filtering . . . 161 The XML S3 MART matching and rewriting services. . . . . .20 4. .10 4. 144 The queries used in the main tests . . . . . . 105 Sample query tree Q . . . . . . . . 147 The query templates used in the decomposition approach tests 147 Comparison for mean domains size in different settings . . . . . . .9 4. . . . . . . . . . . . . . . . . . . . . 160 A given query is rewritten in order to be compliant . . . . 124 Ordered Twig Examples . . . 188 Typical context selection behavior for Group2 (IMDb tree) . . .

. . . . . . . . . . . . .1 6. . . . . . Content index optimized unordered twig matching algorithm .14 B. . .16 B. . . . . . . . . . . . The Complete Infrastructure . . . . . . . . . . Unordered twig matching solution construction (part 1) . . . . . . . . .2 B. . . . .5 B. . . . . . . . . . . . . . . . . . . . . Comparison between the two collections C-R and C-S. . . .8 6. . . . Ordered twig matching auxiliary functions . . .11 B. Content index optimized ordered twig matching algorithm auxiliary functions . . . . . . . Ordered twig matching solution construction .4 6. . . . . . Unordered twig matching algorithm . . . . . Example of WordNet hypernym hierarchy . . . . .6 B. . . . . . . . . .12 B. . . . . . . . . . State of levels L1 and L2 during the first iteration . . . . . Comparison between TS1 and TS2. . . . . Content index optimized unordered twig matching algorithm auxiliary functions . . . . . . . . . . . . . . . . . .LIST OF FIGURES vii 5. . . . . . .5 B. . . . .2 6. . An example of civic ontology . . . . . . Example of approximate sub2 matches with inclusion filter . Scalability results for TS1. . . . . . . . Ordered twig matching algorithm . . . . . . . . . . . . . . . . . . . .4 A. .15 B. . . .6 6. . Unordered Target list management functions . . .3 6. . . . . . . .13 B. . . . . . . . .3 A. . . . .9 6. .12 A. . . .18 Confidence delta values for OLMS . Approximate sub2 matching with inclusion filtering . Unordered twig matching auxiliary functions . .7 6. . . . . . . .17 Reference example. .11 6. . . Skeleton of the holistic twig join algorithms (HTJ algorithms) The buffer loading algorithm Load . . . . . . Additional execution time comparisons. . . . . . . Ordered Target list management functions (part 1) . . . . . . . . . . . . .7 B. . . . . . . . . . . . The temporal inverted indices for the reference example .2 A. . . . . . . Content index optimized ordered twig matching algorithm . . .1 A.8 B. . . . . . . . . . . . . . . . . . . . . MultiEditDistance between subsequences in σ q and σ . . . . . . . . Example of the gaussian decay function D . Content index optimized path matching algorithm . . Path matching algorithm . . . . . . . . .4 B. . . . Unordered twig matching solution construction (part 2) .9 B. . . . . 198 201 202 203 205 206 209 212 219 220 221 222 228 231 236 237 238 242 244 247 250 251 252 256 257 259 260 262 263 264 265 268 269 270 How domains and pointers are implemented . . . . . . . . . .1 B. .5 6. . . The basic holistic twig join four level architecture .3 B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 6. . . .10 B. . Ordered Target list management functions (part 2) . . 191 6.

.

. . . 186 Delta values of the selected senses . . . . . .1 3. .4 3. . . . . . . . . . . . . . . . . . . . .3 4. .6 Path matching functions . . . . . . . . . . . . .1 5. . . .1 6.4 4. . . . . . . . . . . . . .4 B. . . . . . The XML data collections used for experimental evaluation DBLP Test-Collection Statistics . . . . . . Path matching functions for content-based search . Correlation discovery comparison between Citeseer and Dancer Results of violation detection tests . . . . . . . . . . Ordered twig matching functions .2 2. . . . . . .List of Tables 2. . .5 4. .2 B. Behavior of the isCleanable() and isNeeded() functions . . . . . Summary of the discussed cases . . . . . . Performance comparison for unordered tree inclusion . Pre-translation comparison test results for Collection1 . . . . . . . .1 2. . . . . . . .2 3. . 190 Evaluation of the computation scenarios with TS1. . . . . . . . . . . Specifications of the real collections used in the tests . .6 5. . . .2 Examples of improvements offered by stemming and WSD Description of the simulation input parameters . .3 3. . .4 3. . . . . . . . .3 B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Affinity and noise values for the CiteR25 collection . . . . 229 B. . . Target list management functions . Pattern matching results . . . . . . Unordered twig matching functions . . . . . 218 Features of the test queries and query execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2. . . . Features of the tested trees . .1 Symbols and meanings . . . . . . . . . 243 246 249 254 255 261 . . 48 52 53 59 85 85 90 91 93 145 146 148 150 151 154 Specifications of the synthetic collections used in the tests . . Results of the simulation runs . . . . . . 223 A. . .1 B.1 4. . . . . . . . . . . . . . . . . Ordered twig matching functions for content-based search . . . .5 4.2 4. . . . . .2 6.5 B. . . .

7 Target list management functions . 267 . . . .x LIST OF TABLES B. . . . . . . . . . . . .

D. wonderful pleasantness and high-quality work that distinguish you. it is always incredible for me to see your “multi-tasking” brain getting all those brilliant ideas. to all my friends at LabInfo and “lunch mates”. experience. this thesis would not have been as it is without your precious collaboration! Special thanks to Sonia Bergamaschi and Domenico Beneventano. and thank you Simona.. which have always been friendly and helpful to me in many occasions. to thank you properly. the people that made all this possible: My supervisor Paolo Tiberio. an ideal co-author.Acknowledgments I would like to begin this thesis with some words of gratitude to all people who helped me and were close to me during my Ph. you have been (and are) an amazing supervisor. and Federica Mandreoli. and a good friend. working with you is always a truly inspiring experience.. Indeed. Many thanks also to Pavel Zezula and Fabio Grandi. Francesco. for all the enjoyable moments we have spent and the remarkable results we achieved together. Federica. such as driving your car back to Bologna or taking care of your sons. even when you are away from university doing something else.. Thank you all. Last but not least. a colleague working with whom is more pleasure than just “work”.. First of all. Robby. I would like to thank all my colleagues and coauthors (and friends) at ISGroup: Thank you Enrico and Mattia. 2006 Riccardo Martoglia . Maurizio.. an always stimulating and incredibly kind person who gave me the possibility of working on the topics that interested me in the Information Systems research group.. the cooperation with you has just started but you instantly left your mark in our group (and in me) with the exceptional enthusiasm. Thank you for showing me the beauty of research and for all the interesting discussions we had. Luca. what could I say in few words? Ever since my master thesis. I would have to write several pages. to my parents: A huge “Thank You” for the constant support you provided in all these years! January 25.

.

in particular XML documents. deeply discussed in Part II. Efficiency in retrieving the most similar parts available in . one of the available textual data). such as genetic sequences or plain natural language sentences. Sequences. Depending on the particular application scenario and on the type of information that has to be managed and searched. In Chapter 1 we present a purely syntactic approach for searching similarities within sequences. can be considered the backbone of textual data. created the right conditions for the global diffusion of the Web and. The underlying similarity measure is exploitable for any language since it is based on the similarity between sequences of terms such that the parts most close to a given one are those which maintain most of the original form and contents. doubling every six months) and at its richness and flexibility of use. along with the constant drop of Internet access costs and data management and storing. it is necessary to rely on pattern matching techniques. and semi-structured data. meant as logic units of meaningful term successions. Information Retrieval (IR) techniques must be able to answer every User Information Need both effectively and efficiently. to which Part I is dedicated. including education. in particular. for them to create added value in all areas of Internet/Information economy.e. of new research tools able to analyze information and their contents. on approximate (similarity) matching ones. The work presented in this thesis mainly deals with two types of information: Plain text. research and engineering. In order to exploit the full potentialities of textual sequence repositories. Text information is the main form of electronic information representation: It is sufficient to think at the quantity of text available on the web (more than 8 billions of indexed pages.e. However. different techniques need to be devised. developing new algorithms and data structures that go beyond exact match and are able to find out how much a given text (i. and. a query) is similar to another (i. The recent developments in computing power and telecommunications. guiding the user through the sea of information and not confusing him with information overload.Introduction Information is the main value of Information Society. more generally.

respectively. specifically.4 Introduction the sequence repository is ensured by exploiting specifically devised filtering and data reduction techniques. we define a complete set of conditions under which data nodes accessed at a given step are no longer necessary for the subsequent generation of matching solutions and show how such conditions can be used to write pattern matching algorithms that are correct and which. In order to fully exploit the data available in such document repositories. such as the one presented. providing effective and efficient search among large numbers of “related” XML documents is a particularly challenging task. which also incorporates. are necessary and should be exploited in every next-generation structural search engine. Even more than for text. The idea behind evaluating tree pattern queries. nowadays XML has quickly become the de facto standard for data exchange and for heterogenous data representation over the Internet. have been designed. developed and successfully tested. ad-hoc indexing structures and storing facilities. robust query processing techniques over data that conform to the labelled-tree data model are needed. cannot be further improved. from a numbering scheme point of view. by taking advantage of pre/post order coding scheme and of the sequential nature of a compact representation of the data. efficient evaluation techniques for all the main types of tree pattern matching are presented. Such repositories often collect documents coming from different sources. an entire ensemble of systems and services is needed to help users to easily find and access the information they are looking for. The proposed algorithms have formed the backbone of a complete and extensible XML query processor. in a flexible architecture. heterogeneous for what concerns the structures adopted for their representations but related for the contents they deal with. named XSiter. In Chapter 5. named EXTRA and DANCER. XML data is becoming more and more popular. This is also due to the recent emergence of wrappers for the translation of web documents into XML format. however they are not sufficient alone to fully answer the user needs in the most advanced scenarios. semi-structured and. First of all. from Example-Based Machine Translation (to which Chapter 2 is dedicated) to syntactical document similarity search and independent sentence repositories correlation (analyzed in Chapter 3). querying and accessing in an effective and efficient way semi-structured information requires a lot of effort in several synergic areas. is to find all existing ways of embedding the pattern in the data. In particular. which allow the approximation of . In Chapter 4. The proposed techniques have been specialized and applied to a number of application contexts. we propose novel approximate query answering techniques. Efficient exact structural search techniques. Besides textual information. sometimes called the twig queries. Indeed. For both these areas complete working systems.

Introduction 5 the user queries with respect to the different documents available in a collection. which includes the STRIDER disambiguation component. with the extracted information. Finally. i. web directories but also ontologies. in a completely general setting. banking. addressing the question of how to construct a complete XML query processor supporting temporal querying. the structural components of the query (query rewriting) are interpreted and adapted. the possibility to deal with historical information is essential to many computer applications. the presented approach is completely generic and versatile and can be used to make explicit the meaning of a wide range of structure based information. we also consider the versioning aspect of XML management and deal with the problem of managing and querying timevarying multi-version XML documents. we firstly propose a native solution to the temporal slicing problem. law. medical records and customer relationship management. In this context. we propose a further service for automatic structural disambiguation which can prove valuable in enhancing the effectiveness of the matching (and rewriting) techniques. in Chapter 6. The central issue of supporting temporal versioning. All the discussed services have been implemented in our XML S3 MART system. such as accounting. is time-slicing the input data while retaining period timestamping. then. Then. Indeed. we focus on one of the most interesting scenarios in which such techniques can be successfully exploited. To this end. Standard XML query engines are not aware of the temporal semantics and thus it makes more difficult to map temporal XML queries into efficient “vanilla” queries and to apply query optimization and indexing techniques particularly suited for temporal XML documents. most temporal queries in any language.e. Further. as data changes over time. we propose additional techniques in order to support a personalized access to them. like XML schemas. A time-varying XML document records a version history and temporal slicing makes the different states of the document available to the application needs. we present how the slicing technology can be adapted and exploited in a complete normative system in order to provide efficient access to temporal XML norm texts repositories. Indeed. Such techniques first exploit a reworking of the documents’ schemas (schema matching). . the eGovernment one. the structures of XML documents. In the light of these facts. The key to a good effectiveness is to exploit the right meaning of the terminology employed in the schemas.

.

Part I Pattern Matching for Plain Text .

.

Such an approach is often based on information retrieval techniques. where the terms are genetic symbols. such as the analysis of term frequencies. and has been widely adopted for text segmentation. genetic sequences. Consider. Sequences. can be considered the backbone of textual data. or plain natural language sentences. With the advent of databases. strictly connected to the application they serve. categorization and summarization [7]. storing large amounts of textual data has become an effortless and widespread task. exploiting the full potentiality of unstructured repositories and thus understanding the utility of the information they contain is a much more complex task. The latter approach disregards the semantical content and focuses on the structure of the sequence thus enabling the location of similar word sequences. meant as logic units of meaningful term successions. Searching in sequence repositories often requires to go beyond exact matching to determine the sequences which are similar or close to a given query sentence (approximate matching). formed by words. thus allowing information extraction and manipulation. To name just few examples of sequence use. We will now briefly introduce the two main applications that motivated us in devising the new generic approximate sequence matching techniques we will present in this . The former considers the meaning of the terms in the sequences. for instance. The similarity involved in this process can be based either on the semantics of the sequence or just on its syntax. Many applications may benefit from such a facility. consider the adoption of sentences for the description of the real world modelled in the database and their role in composing documents.Chapter 1 Approximate (sub)sequence matching Textual data is the main electronic form of knowledge representation. On the other hand.

one of the most promising paradigms for multilingual document translation. Besides EBMT. Although complex. correlation between the data should be based on approximate joins which also takes flexibility in specifying sentence attributes into account. based on a purely syntactic approach for searching similarities within sequences [90]. When a document is to be compared against the stored documents. The similarity matching we refer to attempts to match any parts of data sequences against any query parts.10 Approximate (sub)sequence matching chapter. Large corpora of bilingual text are maintained in a database. documents are usually broken up into more primitive units such as sentences and inserted into a database. in most cases it is rather difficult for a translation memory to store the exact translation of a given sentence. Thus. such as syntactical document similarity search and independent sentence repositories correlation. This type of document similarity can be exploited both for copy detection [122] and for similar document retrieval services. An EBMT system translates by analogy: it is given a set of sentences in the source language (from which one is translating) and their corresponding translations in the target language. Chapters 2 and 3 will then provide a much more detailed analysis of how such techniques can be exploited in real applications. we are not aware of works related to finding syntactic similarities between sequences. For this purpose. Due to the high complexity and extent of languages. The underlying similarity measure is exploitable for any language since it is based on the . In this chapter. known as translation memory. as the one offered by the digital library CiteSeer [80]. The correlation of independent sentence repositories is a prerequisite of applications such as warehousing and mining which analyse textual data. In this context. only the documents that overlap at the unit level will be considered. Even if some works in literature address the problem of similarity search in the context of information extraction. similar source-language sentences into the target language. We argue that the kind of similarity matching useful for most applications should go beyond the search for whole sequences. there are many other motivating applications. an EBMT system is proved to be a useful translator assistant only if most of the suggestions provided are based on similarity searching rather than exact matching. this kind of search enables the detection of similarities that could otherwise be unidentified. we are not aware of solutions fitting into a DBMS context. which represents the most common choice adopted by the above cited applications for managing their large amount of textual data. The first scenario is the one of EBMT (Example-Based Machine Translation). Syntactical document similarity search involves the comparison of a query document against the data documents so that some relevance between the document and the query is obtained. In particular. we propose this kind of solution. and uses those examples to translate other.

Finally. thus ensuring efficient processing. In Section 1. we characterize the problem of approximate matching between sequences as a problem of searching for similar whole sequences or parts of them.1. can be reused. mainly focusing on sequences of characters. like the query optimizer.1 we briefly review the notion of sequence matching from the literature.1 introduces the foundation.1. Section 1. Applying an approximate sub2 sequence matching algorithm to a given query sequence and a collection of data sequences is extremely time consuming. i. Filtering is based on the fact that it may be much easier to state that two sequences do not match than to tell that they match.2 we show how sequence similarity search can be mapped into SQL expressions and optimized by conventional optimizers.1. efficiently ensuring no false dismissals and few false positives. 1. Then.3 we assess and evaluate the results of the conducted experiments and. 1.1 Foundation of approximate matching for (sub)sequences The problem of searching similarities between sequences is addressed by introducing a syntactic approach which analyzes the sequence contents in order to find similar parts. in Section 1.1 Foundation of approximate matching for (sub)sequences 11 similarity between sequences of terms such that the parts most close to a given one are those which maintain most of the original form and contents. In Section 1. As far as the matching processing is concerned.4. The immediate practical benefit of our techniques is twofold: approximate sub2 sequence matching can be widely deployed without changes to the underlying database and existing facilities.1.e. in Section 1. the similarity measure and filters. we chose a solution that would require minimal changes to existing databases. in Section 1.2 we introduce the notion of approximate sub2 sequence matching. we discuss related works on approximate matching. In particular. We introduce two new filters for the approximate sub2 sequence matching which quickly discard sequences that cannot match. Efficiency in retrieving the most similar parts available in the sequence repository is ensured by exploiting filtering techniques. of our approximate sub2 sequence matching. .1 Background The problem of sequence matching has been extensively studied in the literature as sequences constitute a large portion of data stored in computers.

find those data sequences that contain matching subsequences. of the involved sequences. Therefore. Filtering is based on the fact that it may be much easier to state that two sequences do not match than to state that they match. σ2 )) is the minimum number of edit operations (i. ˆ σ[i . deletions. that is subsequences that are within a given distance threshold from the query sequence. Given a sequence σ. position filtering. approximate matching based on a specified distance function can be classified in two categories [55]: ˆ Whole matching. i. ˆ σ[i] is i-th element of the sequence.. Since . a string. and substitutions) of single elements needed to transform the first sequence into the second. 126. In particular. 127.e.1 (Edit Distance between sequences) Let σ1 and σ2 be two sequences. we opt to exploit the analogy between a sequence of terms. and a sequence of characters. dealing with sequences of characters implies adopting the edit distance notion to capture the concept of approximate equality between strings [103]. denoted as q-grams [134. insertions. find those data sequences that are within a given distance threshold from the query sequence. its positional q-grams are obtained by “sliding” a window of length q over the elements of σ. In most cases. Given a query sequence and a collection of data sequences. One of the techniques for computing approximate string matching efficiently relies on filters. Given a query sequence and a collection of data sequences.12 Approximate (sub)sequence matching In this context. As far as the underlying distance function is involved. ˆ Subsequence matching. The edit distance between σ1 and σ2 (ed(σ1 . three well known filtering techniques are widely used for approximate string matching: count filtering. filters quickly discard sequences that cannot match. . In the following. and thus we rely on the existing distance function between strings. efficiently ensuring no false dismissals and few false positives. . j] is the subsequence of length j − i + 1 starting at position i and ending at position j. we introduce the notion of edit distance between sequences of elements by adopting the notation below: given a sequence σ of elements. They rely on matching short parts of length q. Definition 1. then: ˆ |σ| is the length of the sequence. and length filtering.e. 61]. such as a sentence.

j2 ]) denotes the edit distance between the two parts σ1 [i1 . In particular. . σ2 [i2 . consider a sentence example: the part “the dog eats the cat” is more similar to “the dog eats the mouse” than “the cat eats the dog”. . . σ2 [i2 . 1. . We adopt the edit distance defined in Def.3 (Length Filtering) If sequences σ1 and σ2 are within an edit distance of d. j2 ]) will be denoted as ed(σ1 . . Definition 1. ed(σ1 [i1 .Approximate sub2 sequence matching 13 q-grams at the beginning and the end of the sequence can have fewer than q terms from σ. ed(σ1 [i1 . then the cardinality of Gσ1 ∩ Gσ2 . Thus. . j2 = |σ2 |. j1 ]. .2 (Positional q-gram) A positional q-gram of a sequence σ is a pair (i. . 127]. Proposition 1. . . i + q − 1] is the q-gram of σ that starts at position i. j1 ].1 as similarity measure between (parts of) sequences. the kinds of approximate matches between sequences considered are based on one part of one sequence being similar to another. [i . i + q − 1]). new terms “#” and “$” not in the term grammar are introduced. . given two sequences as sequences of terms σ1 and σ2 . If σ1 and σ2 are within an edit distance of d. and the sequence σ is conceptually extended by prefixing it with q − 1 occurrences of “#” and suffixing it with q − 1 occurrences of “$”. . their lengths cannot differ by more than d. . if i1 = 1. This approach enables us to take the position of terms in the sequence into account: we consider two parts as much similar as they maintain the same order of the same terms. j2 ]. i2 = 1. . counting on the extended sequence. where [i .2 Approximate sub2 sequence matching In this section we introduce the concept of similarity searches within sequences. ignoring positional information. The filtering techniques basically take the total number of q-gram matches and the position of individual q-gram match into account: Proposition 1. Proposition 1. then a positional q-gram in one cannot correspond to a positional q-gram in the other that differs from it by more than d positions. σ2 ) and represents the edit distance between the two whole sequences of terms. . j1 ] and σ2 [i2 .2 (Position Filtering) If sequences σ1 and σ2 are within an edit distance of d. 1.1 (Count Filtering) Consider sequences σ1 and σ2 . In particular. Proof and explanations of the above filters can be found in [126.1. must be at least max(|σ1 |. The set Gσ of all positional q-grams of a sequence σ is the set of all the |σ| + q − 1 pairs constructed from all q-grams of σ. . j1 = |σ1 |. |σ2 |)−1−(d−1)∗q. For instance. .

(j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . σ2 [i2 . j2 ]) such that σ1 ∈ Q. .3 (Approximate sub2 sequence matching) Given a collection of query sequences Q and a collection of data sequences D not necessarily distinct. In particular. in the following we propose two new filters. j1 ]. Proposition 1. for approximate sub2 sequence matches we consider q-grams from sequences not extended with “#” and “$”. . For instance. . sub2 Count filtering The concept underlying the count filter is still applicable since the two sequences must obviously have a certain minimum number of q-grams. j2 ]) ≤ d. (j2 −i2 +1) ≥ minL and ed(σ1 [i1 . whereas subsequence matching applies with only one of them as a whole.3. a distance threshold d and a minimum length minL. only that of the candidate answers will be further analyzed. σ2 [i2 . . find all pairs of sequences (σ1 [i1 .4 If two sequences σ1 . . . . we reexamine the approaches introduced for approximate whole matching. j2 ]) ≤ d then the cardinality of Gσ1 ∩ Gσ2 must be at least minL + 1 − (d + 1) ∗ q. . the number of common q-grams must be determined with respect to some parts of the sequences. Thus. Such filtering techniques should operate on whole sequence pairs and efficiently hypothesize a small set of them as matching candidates. . σ2 have a pair of sequences (σ1 [i1 . . σ2 ∈ D. Applying an approximate sub2 sequence matching algorithm to a given query sequence and a collection of data sequences is extremely time consuming. Notice that whole matching applies when both sequences σ1 and σ2 are considered as a whole. Definition 1. . . For this purpose. j1 ]. As far as the properties underlying count filtering and position filtering are considered. j2 ]) such that (j1 − i1 + 1) ≥ minL. j1 ]. Since our problem has less bonds than the one studied in literature. The main challenge is thus to find filtering techniques suitable for the problem introduced in Def. (j1 −i1 +1) ≥ minL. most of the properties that should be satisfied by the matching sequence pairs and on which filters are based are no longer true. σ2 [i2 . . As to sequence content. j1 ]. 1. . σ2 [i2 . . Unfortunately it is almost impractical to extend all the possible parts of all the sequences for q-gram computation. length filtering is clearly not applicable in this case. . in order to contain the one a part suggestion for the other. .14 Approximate (sub)sequence matching The operation of approximate matching that we introduce in the following definition extends the notion of subsequence/whole matching in order to locate (parts of) sequences that match (parts of) query sequences.

1 Let minL = 4. f o r ( p1 = 1 . // c o u n t t h r e s h o l d i n t [ ] S1 c . | S1 | ) // i n n e r s e n t e n c e c y c l e { i f ( S1 [ p1 ] = S2 [ p2−w ] ) { f o r ( i = 0 . . q = 1. } } } return f a l s e . 15 // c o u n t e r s // p o s i t i o n s i n s e n t e n c e s // i n c r e m e n t / decrement l i m i t a t i o n c h e c k f o r ( p2 = 1 . | S2 | ) // o u t e r s e n t e n c e c y c l e { i f ( p2−w > 0 ) { S1 lim . it does not take advantage of position and order information. i n t d ) { i n t w ← minL . . // window s i z e i n t c ← minL−d . i n t minL . S1 l i m [ p1+i ] ← true . . p2 . . words) threshold for sub2 Count filtering is 3. i n t p1 . } } i f ( S1 c [ p1 ] >=c ) return true .Approximate sub2 sequence matching sub2 P osF ilter ( S t r i n g S1 . d = 1. boolean S1 l i m [ ] . S1 l i m [ p1+i ] ← true . f o r ( p1 = 1 . w−1) // i n c r e m e n t c y c l e { i f ( ! S1 l i m [ p1+i ] ) {S1 c [ p1+i ]++. Example 1. } } } } } S1 l i m ← f a l s e . the minimum number of common q-grams (in this case. . } Figure 1. | S1 | ) // i n n e r s e n t e n c e c y c l e { i f ( S1 [ p1 ] = S2 [ p2 ] ) { f o r ( i = 0 . S t r i n g S2 . ← f a l s e . Then. w−1) // decrement c y c l e { i f ( ! S1 l i m [ p1+i ] ) {S1 c [ p1+i ]−−.1: sub2 Position Filter sub2 Position filtering While sub2 Count filtering is effective in improving the efficiency of approximate sub2 sequence matching. . . .

e. the filter stops and returns TRUE. preventing the modification of each counter more than once for each outer cycle.e. the filter returns FALSE. is equal to the current one in the outer cycle. i. To provide a position filter for approximate sub2 sequence matching is not a simple task. q = 1). the term counters σ1 c[p1 ] through σ1 c[p1 + w − 1]. sub2 Position filter works by dynamically analyzing the relative positions and (partially) the order of equal terms in the sequence. σ1 [p1 ]. Notice that the count threshold c used in sub2 Position filtering is the same as the one used in the sub2 Count filter when q-grams have length 1 (i. being the actual number of common words 3. The schematized dynamic programming algorithm is shown in Figure 1. that is if none of the counters reach the threshold. notice that the filter algorithm “cleverly” marks an entire window of terms ahead of the equal term found. The filter receives as input the two sequences σ1 and σ2 . and performs two nested cycles: an outer cycle on the terms of σ2 and an inner cycle on the terms of σ1 . In particular. to decrement the counters of the terms in w or more positions earlier.σ2 ) returns TRUE even though they do not contain similar parts satisfying the parameters specified at the beginning of the example. An increment/decrement limitation is also used: it improves the filter effectiveness in the case of clusters of repeated terms. otherwise.1. should still be efficient and effective.e. the filter also provides a decrement cycle. sub2 CountFilter (σ1 . consider the following examples. efficiently pruning out a higher number of sequence pairs. .16 Approximate (sub)sequence matching Consider the following sequence (here. sub2 Position filter is a new filtering technique explicitly designed to further improve the performance of the approximate sub2 sequence matching operation: it offers a much better filtration rate than simple count filtering. As you can see. say σ2 [p2 ]. √ √ √ σ2 : XPaint is a new computer graphics sof tware. Furthermore. sentence) pair. When the term counter σ1 c[p1 ] at the position p1 reaches the count threshold c and its term. where equal words are emphasized: √ √ √ σ1 : ABC sof tware welcomes you to the world of computer graphics. σ2 [p2 ]. a minimum length minL and a distance threshold d. i. say σ1 [p1 ]. each time it is equal to a term in σ1 . are incremented. For further accuracy. Each position p1 in σ1 has an associated counter σ1 c[p1 ]: given a term in σ2 . being w (w = minL) the filter window size. without knowing where the candidate matching sequences start. As an explanation of the filter mechanisms. together with the increment cycle. making it easier to identify “clusters” of equal terms without having to analyze the surroundings of each one of them. since the algorithm.

Consider the following sequence (sentence) pair which represents a wrong candidate answer for the standard position filter since it contains two close and equal terms: σ1 : XPaint is very easy to use. Notice that none of the counters reaches the threshold 3.σ2 ) returns FALSE. the algorithm works as follows (the term marked with the √ symbol is the current term in the outer cycle. the filter is even able to ensure a smaller number of false positives than the one offered by standard position filters. so sub2 PosFilter (σ1 . σ2 : XPaint is a new computer graphics sof tware. σ2 : Is XP aint a bitmap processing software? . the number of over the term in the position p1 in the inner cycle correspond to the value of σ1 c[p1 ]. From p2 = 5 to p2 = 7. d = 1. σ2 : Is XPaint a bitmap processing software? The filter counters are updated for the two first (common) terms: √ √ √ σ1 : XPaint is very easy to use. Consider the sentence pair of Example 1. By using the ahead-marking technique. p2 = 6 : √ √√ σ1 : ABC software welcomes you to the world of computer graphics. Then. Then. equal terms are in italics): p2 = 5 : √ √ σ1 : ABC software welcomes you to the world of computer graphics. d = 1. σ2 : XPaint is a new computer graphics software. the threshold c for sub2 Position filtering is 2 and the window size w is 3. σ2 : XPaint is a new computer graphics software.3 Let minL = 3. as shown in the following example: Example 1. σ2 : Is XPaint a bitmap processing software? √ √√ √√ √ σ1 : XP aint is very easy to use.2 Let minL = 4. the threshold c for sub2 Position filtering is 3 and the window size w is 4.1 where the first four terms of σ2 are not found in σ1 . correctly pruning out this pair of sequences (sentences).Approximate sub2 sequence matching 17 Example 1. p2 = 7 : √ √ √ √ √ √√ σ1 : ABC sof tware welcomes you to the world of computer graphics.

. . which serves as the foreign key attribute to the table storing S. Both tables share the same schema (COD. If exists a pair (σ1 [i1 . 1. sub2 PosFilter(σ1 . maintained in D and Q respectively. σ2 [i2 . with the query table Q and the data table D to retrieve the sequence pairs to be further . we show how the algorithms used for sequence similarity search can be mapped into SQL expressions and optimized by conventional optimizer. d)) returns TRUE. where POS identifies the position of the q-gram Qgram. j2 ] ∈ σ2 ) such that (j1 − i1 + 1) ≥ minL.2. d be a distance threshold and minL be a minimum length. It shows that filters can be expressed as an SQL expression and efficiently implemented by a commercial relational query engine. The involved SQL expression joins the auxiliary tables for q-gram sequences. . . we provide the following theorem.e. then. 2). the database must be augmented with the data about q-grams corresponding to the data and query sequences.18 Approximate (sub)sequence matching As you can see. where COD is the key attribute and SEQ the sequence. some counters reach the threshold (i. the sub2 position filter extP osF ilter(σ1 .σ2 ) returns FALSE. SEQ). such as Oracle and DB2. j2 ]) ≤ d. σ2 [i2 . j1 ] ∈ σ1 . . The immediate practical benefit of our techniques is that approximate search can be widely and efficiently deployed without changes to the underlying database. σ2 . Let D be a table containing the data sequences and Q an auxiliary table storing the query sequences. Theorem 1. . correctly pruning out even this pair of sentences. As far as the correctness of the filter is concerned. Dq and Qq. .Qgram). In order to enable approximate sub2 sequence matching processing through filtering techniques based on q-grams. So. which is created on-the-fly. The positional q-grams of S share the same value for the attribute COD. its positional q-grams are represented as separate tuples in the above tables.POS. (j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . In the rest of the section. but not σ1 c[1] which is the counter of the only term equal to the current term in the outer cycle. The SQL expression exploiting the filtering techniques for approximate sub2 sequence matches has the form pictured in Figure 1. j1 ]. For each sequence σ. minL. . and stored in two auxiliary tables Qq and Dq with the same schema (COD.2 Approximate matching processing The approximate sub2 sequence matching problem can be easily expressed in any database system supporting user-defined functions (UDFs).1 Let σ1 and σ2 be two sequences.

position filtering sub2Position(S1.2 Approximate matching processing 19 SELECT FROM WHERE AND AND AND GROUP BY HAVING S1. The sub2 Count filtering is implemented by comparing the number of q-gram matches with the length of the involved sequences. The fact that the parts of the sequences analyzed for filtering purposes cannot benefit from extended qgrams (as already outlined in subsection 1. In this context. d) -.2: Query for approximate sub2 sequence match filtering analysed for approximate sub2 sequence matches. d+1 Intuitively.COD = S1q. i.2 requires that at least one q-gram is shared by the two sequences.1. d+1 Once the candidate sequence pairs for approximate sub2 sequence matching have been selected by the filters.COD AS COD1. they must be further analysed to locate the approximate sub2 sequence matches. Qq S1q.d). Besides considering the introduction of an ad-hoc algorithm.COD AS COD2 Q S1.SEQ. S2.1 is implemented by means of an UDF function sub2Position(S1. we explored the possibility to keep .SEQ. minL.COD S2. choosing a q-gram size too big with respect to the minimum length minL and/or too small with respect to the number of allowed errors d could imply that a sequence pair share no q-gram even if they have some approximate matching parts.COD S1q. and we distribute the d allowed errors.(d + 1)*q Figure 1.count filtering S1.e.S2. Dq S2q S1. S2.1. S2.Qgram = S2q. we are not mainly interested in finding a specific solution for this problem. The sub2 Position filtering algorithm shown in Figure 1. in order to establish the q limits we consider the minimum length of the part suggestion. Indeed. The worst case occurs when we have d + 1 “safe” parts with the same size: minL .COD = S2q. minL.COD. D S2.5 The propositional formula “If σ1 and σ2 have some approximate matching parts then there is at least one q-gram shared by σ1 and σ2 ” is true if and only if q ∈ [1.COD COUNT(*) >= minL + 1 .2) also influences the size of q-grams stored in tables Dq and Qq. Notice that the structure of the query of Figure 1. minL ]. Proposition 1.minL.Qgram -.

. we can implement one of the algorithms surveyed in [103]. Interested readers can refer to [61. filtering techniques exist and a mapping into SQL expression is shown in [61]. Thus. It consists of 1497 reference sentences corresponding to one version of the manual and 400 query sentences corresponding to a part of the following version. the problem can be reduced to a subsequence matching case. we compute the edit distance ed(σ1 [i1 .3. Before doing so. Whole matching can be efficiently processed by means of the filtering techniques shown in Section 1. Given σ1 [i1 . taken from two versions of a software technical manual.20 Approximate (sub)sequence matching using filtering techniques relying on a DBMS by reducing the problem to the two well-known approaches of whole matching and subsequence matching. Indeed.2. Following a similar approach. where the considered sequences are sentences. . we consider all possible subsequences of both sequences. Also in this case. To reduce the problem to a whole matching case. having length greater than minL. The only problem is that such algorithms do not locate the subsequence of σ2 matching σ1 [i1 .1. where the input of this query would be the output of the one shown in Figure 1. j1 ] and σ2 [i2 . 1. 91]. .1 we show the data set used for experiments and in Section 1. we used two real data sets: ˆ Collection1. say σ1 . j1 ].1. 1. . in Section 1. j2 ]) and return such subsequence pair if and only if the corresponding value is less than d. in this section the terms “sentence” and “sequence” will be used as synonyms.1 Data sets To effectively test our techniques.3. means to transform the candidate answers so that the edit distance function can be applied. σ2 ). j1 ]. In this case.3 Experimental Evaluation In this section we present the results of an experimental evaluation of the techniques described in the previous sections. . .3. . but most of them can be extended in order to fulfill this requirement. for each pair of candidate answers (σ1 . σ2 [i2 . we chose the EBMT environment. . a mapping into SQL expression is also possible. Therefore. if we consider all possible subsequences of length greater than minL of only one of the two sequences. As reference application. query and data sentences deal with the same topic and have similar style. j2 ]. .2 we summarize some interesting implementation aspects. More precisely. . .

It contains translations from one of their customers’ technical manuals. Since our experiments do not focus on such a computation performance. the ad-hoc algorithm we implemented is a naive algorithm which performs two nested cycles for each possible starting term in the two sequences to compute the matrixes of the edit distance dynamic programming algorithm. storing and exploiting the new q-grams: not only for their bigger quantity (approximately 2(|S| − minL)(q − 1) more for each sentence) but also for the complications of considering each extended q-gram in the right place. 1. As for query efficiency. multi-edit-distance(σ1 . In order to implement the two queries for whole and subsequence matching we further extended the database by introducing a new table storing all possible starting and ending positions for each part of each sentence and by extending the q-gram tables with all possible extended q-grams. σ1 . We measured the execution time of these solutions.minL. a complete Translation Memory provided by LOGOS. i.e.Implementation 21 ˆ Collection2.3. Collection2 presents a lower homogeneity with respect to Collection1. Such function will be presented more in depth in Chapter 2. There are 34550 reference sentences and 421 query sentences. given two sequences σ1 and σ2 . the underlying DBMS is Oracle 9i Enterprise Edition running on a Pentium IV 1.2. we just added some appropriate indexes on the q-gram tables to further speed up the filtering and search processes. As to the computation of sub2 sequence matching. and a a distance threshold d. Because of its greater size. Despite the high computational complexity of the implemented algorithm. we noticed that its performance was better than that of the two full-database solutions. our complete EBMT system EXTRA. in the context where we first tested it. by issuing conventional SQL expressions we have been able to rely on the DBMS standard optimization algorithms. but already from the first tests they turned out to be more than ten times slower than the naive algorithm solution and so in the following discussions they will not be . In particular. we tested the three alternatives described in Section 1. a minimum length minL.d) locates and presents all possible matching parts along with their distance.8Ghz Microsoft Windows XP Pro workstation.2 Implementation The similarity search techniques described in the previous sections have been implemented using Java2 and JDBC code. Suffice it to say that the benefits of using q-gram filtering were completely nullified by the enormous overhead in generating. a worldwide leader in multilingual document translation.

enabling different filters each time. for a filter to be useful its response time should not be greater than the processing time of just the match algorithm on the whole cross product. In order to examine how efficient each combination of filters and matching algorithm is.0E+07 1.0E+05 1. K=1. 1. . enabling different filters each time.0E+02 1. In order to examine how effective each filter and each combination of filters is we ran different queries.0E+06 Candidate Set Size 1. Q=1 MinL=4. and measured the response time also considering the scalability for the two most meaningful cases.3 Performance In this chapter. The main objective of filtering techniques is to reduce the number of candidate answer pairs. Q=3 MinL=4. Indeed. Obviously.3.0E+06 Sub2Pos Sub2Count Cross Product 1. Q=2 (a) Collection1 (b) Collection2 Figure 1. the more filters are effective the more the size of candidate answers gets near to the size of the answer set. Q=3 MinL=4.22 Approximate (sub)sequence matching Real 1. We started by considering the most meaningful minimum length minL that. and the values of q-gram size allowed by Prop. K=1.3. K=1. Q=1 MinL=6. that is the minimum length minL. Effectiveness of filters We conducted experiments on both the data sets introduced in Section 1. in most cases.0E+05 1.0E+03 1. we ran different queries.5. K=1. K=2. Another key aspect of filtering techniques is their efficiency.0E+08 Real Sub2Pos Sub2Count Cross Product 1. 1. the number of allowed errors d.1. Q=2 1. Q=1 MinL=4.0E+04 Candidate Set Size MinL=3. Performance trends were observed under the parameters that are associated with our problem.3: Filtering tests further considered. K=1.0E+02 1.0E+04 1.0E+01 MinL=3. we are mainly interested in presenting the new filters performance. K=2. and measured the size of the candidate set with respect to the cross product of the query sentences and data sentences. Q=1 MinL=6.0E+03 1. K=1.0E+01 1.

Obviously. Q=3 MinL=3. K=1. In any case.Performance 23 MED MinL=4. Q=1 MinL=3. In particular the sub2 Count filter gave a candidate answer that was between 0.8% . the sub2 Count filter works better with q values greater than 1 and preferably smaller than 4. Q=1 0 20 40 60 80 100 Seconds 120 140 160 180 0 200 400 600 800 Seconds 1000 1200 1400 1600 (a) Collection1 (b) Collection2 Figure 1. K=1. besides counting the number of equal terms (with the same threshold as sub2 Count when q = 1).4: Running time tests is 3 and thus we allowed at most one error in order to get a significative answer set (q must be 1). Q=2 MED 22 Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED 4 26 6 139 19 36 31 138 3 25 5 103 21 31 35 155 540 68 1427 131 640 380 1369 17 456 19 1052 144 625 358 1504 MinL=4. since q = 1 does not allow for q-gram overlap. K=2. K=1. Q=3 MinL=4. For this reason. The comparison of the two alternatives having minL = 4 shows it: setting the similarity threshold to 75% enables q = 2 and thus doubles the filtering performance.003% to 11% of the cross-product size and sub2 Position filter filters from five to tens of times better than the sub2 Count filter. K=1. Q=2 Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED MinL=4. In particular. the sub2 Position filter always filters better than the sub2 Count filter since. Q=1 MinL=6. Even more evident is the case of q = 3 where the reduction of all the data sets is more than 99. we did not take into account possible combinations of filters. Then we analysed the effect of increasing each of the two parameters and the subsequent value(s) of q. Notice that the sub2 Count filter filters better on Collection2 than on Collection1 since Collection1 contains more homogeneous sentences and thus it is more likely that they share common terms. The most meaningful experiments are shown in Figure 1.4 presents the response time of the experiments detailed in the effectiveness of filters paragraph. Q=1 MinL=6. K=1. Efficiency of filters Figure 1. K=1. it also considers their positions. K=2. it shows the times required to get the answer sentence pairs for each possible combination of filters and .3.

The assessment and evaluation of the obtained values focus on determining the best choice for filters with respect to the parameter values.24 Approximate (sub)sequence matching MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED 10000 10000 1000 1000 Seconds 100 Seconds 100 10 10 1 25% 50% 100% 1 25% 50% 100% (a) minL = 3. It shows that all combinations grow linearly with the number of sentence pairs and that the best choice is generally to turn on all the available filters. Figure 1. The reason why . any combination that includes the sub2 Count filter always improves the performances. 11-13-15 secs at 25%-50%-100%). using indexes on the codes of the phrases and on the q-grams. In particular. joins are computed using sort-merge-join algorithms. we selected one parameter combination with q = 1 and one with q > 1 as representative cases characterizing the filters behaviours and we analyzed their scalability. In detail. From the four parameter combinations. q = 1 (b) minL = 4. As a final remark. on Collection 2. This improvement is even more evident for Collection2. q = 2 Figure 1. Moreover.3). Enabling filtering techniques reduces the response time of at least 7 times in the worst case. even if it filters less than the sub2 Position filter (see Figure 1. it plays an important role by pruning out a large portion of sentence pairs and thus leaving a small set of them on which the sub2 Position filter or directly the matching algorithm is applied. k = 1. Indeed. note that the sub2 Count filter has a less than logarithmical growth (i. In particular. especially for values of q equal to 2 and 3.5 shows the scalability of all the possible filter combinations for q = 1 and q = 2. turning the sub2 Position filter on provides an even faster execution. where response time is significantly reduced. Moreover.e. its execution requires a small amount of time since it relies on the facilities offered by the DBMS. with particular benefits for q = 1. we observed the plans generated by the DBMS showing that. k = 1.5: Scalability tests for Collection 2 the matching algorithm (denoted as MED).

g. [3. Such solutions are limited to the problem of string matching and substring matching. A revised version of such algorithm could be adopted for implementing our multi-edit-distance. [8]). 57]). indexes (e.g. in the total times shown. 55] addressing the problem of whole and subsequence matching for sequences of the same length. and time series (e. In particular.4 Related work A large work body has been devoted to the problem of matching sequences with respect to a similarity measure. [34. They adopt a suffix tree as indexing method in order to efficiently compare all possible sequences. 105]). 73]). such as the approach we propose. 1. genetics (e. As to indexes. 29. The problem has been addressed by proposing solutions based on specific algorithms (e. where in the latter case the main objective is to verify if a pattern string is contained in a given text without necessarily locating the exact position of the occurrences. however. the work [103] is an excellent survey on the current techniques to cope with the problem of string matching allowing errors.g.1.4 Related work 25 this strong performance is not evident from the graphs is that. and filters (e. we found the work [61] of particular interest. It presents some filtering techniques relying on a DBMS for approximate string joins and it offered the starting ideas for our work. A notable exception is the search of the Longest Common Subsequence (LCS) [114] between two sequences which. customized secondary storage indexes or indexing techniques for arbitrary metric spaces have to be supported by the DBMS in order to be useful for techniques accessing large amounts of data stored in databases. the problem has been considered in different fields such as text and information processing. Starting from the works of Faloutsos et al. 40]). the paper [8] presents a fast algorithm for all-against-all genetic sequence matching. As far as text and information processing is concerned. [61. is limited to the location of the longest part by allowing only insertions and deletions. the linear-growing MED time prevails.g. . Amongst the others. [2.g. [104.

.

their expertise and skill is not alone entirely sufficient in order to achieve highly effective and efficient translation performance. The consequent allocation towards communication will grow from 550 to more than 800 Million Euros. while the number of pages to be translated by European institutes was more than 2 Million in 2004 and is expected to grow by 40% each year. while ensuring optimal translation times and costs. supported by highly experienced linguistic revisors. Devised with the aim of preserving and treasuring the richness and accuracy that only human translation can achieve. however. The best way for translating very large quantities of documents. Machine-Assisted (-aided) Human Translation (MAHT) and. with an ever increasing quality. The importance – and the expenses – for inter-language communication and translation are soaring both in the private and in the public fields. they may face instances where similar contents is translated several times for separate uses and this can result in inconsistencies and uncontrolled translation expenses. Consider the case of the European Union for instance: With the recent joining of ten countries. Indeed. in par- . The same problems arise in the private field: As enterprises grow. is to exploit the constantly growing Machine Translation (MT) possibilities. both in public and in private enterprises. and an extensive number of freelance translators are more frequently employed in every context. the number of official languages has increased by 82% and has reached 20 units. Large teams of professional translators. MT tools provide a way to automate such process and have a specifically definite goal: Bringing better quality and quantity in less time.Chapter 2 Approximate matching for EBMT Nowadays we are witnessing the need to translate ever increasing quantities of texts. translation is a repetitive activity requiring a high degree of attention and the full range of human knowledge.

In this chapter we show how the approximate matching techniques de- . Then. Each time a new document is translated.28 Approximate matching for EBMT Suggestion search Translation suggestions (Target language) Text File (Source language) Translation Memory Code Original Sentence (Source language ) 5673 Welcome to the world of computer generated art! 32567 … Now press the right mouse button. similar source-language sentences into the target language. An EBMT system consequently proves to be a useful translator assistant only if most of the suggestions provided are based on similarity searching rather than exact matching. which include source language (from which one is translating) and target language text. with the suggestion search process.1: Processes in an EBMT system ticular. in most cases it is quite unlikely that a translation memory stores the exact translation of a given sentence. they search and manage terminologies. EBMT systems can speed up a translator’s work significantly: They analyse the technical issues of the different document formats. Due to the high complexity and extent of languages. Example-based translation is essentially translation by analogy. the TM is updated with the set of sentences in the source language and their corresponding translations in the target language. … Translated Sentence (Target (Target language ) ) language Benvenuti nel mondo della grafica generata al computer! Ora premere il pulsante destro del mouse. Example-Based Machine Translation (EBMT) represent one of the most promising translation paradigms. But how do EBMT systems basically work and attempt to achieve all this? All past translations are stored and organized in a large text repository known as Translation Memory (TM) (see Figure 2.1). glossaries and reference texts. Such TM is organized in a series of translation examples. … Translator Text File (Target language) Pairs of sentences TM update Figure 2. the system uses those examples to help to translate other. together with possible additional information. such process is denoted as TM update. generate translation suggestions to speed up the translation process itself and ultimately help translators in achieving a good level of terminological uniformity in the result.

it applies a metric which is the basis for the suggestion-ranking and whose properties are exploited to take full advantage of the TM content and to speed up the retrieval process. As to experimental evaluation. the EBMT system we have developed over the last few years [91. EXTRA is able to propose effective suggestions through the cascade execution of two processes. The work presented in this chapter has been developed as a joint effort with Logos Group. The Logos solution includes professional in-country specialist localization teams. bringing high flexibility in its design and complete functionalities to the translator. and propose an in-depth analysis of its features. we dedicate specific sections to the simulation of ad-hoc statistical. Furthermore. ensuring a good trade-off between the effectiveness of the results and the efficiency of the processes involved. The rest of the chapter is organized as follows: In Section 2. In . The heart of EXTRA is its innovative suggestion search engine. particularly for evaluation of the effectiveness. also in complex search settings. process-oriented. To this end. it performs analyses of the available bilingual texts ranging from syntactic text operations to advanced semantic word sense disambiguation techniques. EXTRA does not use external knowledge requiring the intervention of users.29 scribed in the previous chapter can be succesfully applied to the EBMT scenario. the objective and detailed evaluation of the results provided by EBMT systems has always been a complex matter. centralized quality control. the system is able to efficiently search large amounts of bilingual texts stored in the translation memory for suggestions. discrete-event models for quantifying the benefits offered by EXTRA assisted translation over manual translation. and to the specific analysis of the effectiveness of the semantic analysis of the texts. and is completely customizable and portable as it has been implemented on top of a standard DBMS. such as annotated corpora. We also provides a thorough evaluation of both the effectiveness and the efficiency of EBMT systems and a comparison between the results offered by our system and the major commercially available systems. we founded EXTRA’s search algorithms on advanced Information Retrieval techniques. software engineering and dedicated project management and translation memories. Then. 92] to support the translation of texts written in Western languages. Logos is a worldwide leader in multilingual technical document translation. we present EXTRA (EXample-based TRanslation Assistant). EXTRA is designed to be completely modular. Furthermore. First. Instead of relying on Artificial Intelligence. together with an extensive evaluation of the results that it can achieve. Indeed. whose foundation is built on a specialization of the approximate matching techniques discussed in Chapter 1.1 we present an overview of research carried out in the EBMT and related fields.

they require highly extensive monolingual and/or bilingual corpora for the unavoidable training of their models. we present an overview of the processes involved.1 Research in the EBMT field Since the EBMT paradigm is founded on predicting which sentences in the translation memory are useful and which are not. Machine Translation is the research area that has produced most studies on these aspects and to which most of the researchers proposing EBMT techniques have turned. the reader should not be surprised to discover that part of the work that has been done in the area of Information Retrieval (IR) can indeed prove valuable for reviewing the EBMT issues from a new and interesting perspective and devising new techniques to these aims. 2. Good points of view to distinguish representations of text fragments are the . since EBMT systems basically manage text information.1. However. In this context. Finally. In particular. the available approaches differ in the way they logically represent and store the available examples in the translation memory. The key to the working of EBMT systems is to support techniques allowing the system to distinguish what text segment is similar to the submitted one and how similar it is. 2.1 Logical representation of examples First of all. to which Sections 2. 51] to be good surveys. i. one main problem regarding an EBMT system is the suggestion search activity. and they are strongly dependent on the particular language and subject that they were trained for. we found the papers [124.4 is dedicated. document analysis (analysed in detail in Section 2. which are not always available. such a choice is particularly critical. being them quite peculiar and different in their working from our system.30 Approximate matching for EBMT Section 2.5 presents a detailed discussion on the results of the experiments performed. Section 2. and of the algorithms for searching the TM for useful text segments. The dissertation does not deal with statistical approaches.e. topics that have drawn major attention on behalf of researchers over the last two decades are: The conception of different ways to represent the examples in the TM. since what is stored in the TM is also likely to be exploited by the similarity metric in order to determine the text similarity. the definition of metrics and scoring functions. While such techniques relieve EBMT systems of the problem of evaluating the goodness (similarity) of the translations. For this purpose.2 we present our approach and discuss the reasons behind it. Naturally.3) and suggestion search.

Character-based metrics are not particularly effective for most Western languages.1. Other approaches [42. approaches such as [118] perform advanced parsing on the text and choose to represent the full complexity of its syntax by means of complex syntactic tree structures. For instance the approach in [24] stores examples as strings of words together with some alignment and information on equivalence classes (numbers. It should be noticed that more an approach and the consequent logical representation are complex more information can be exploited in the search of useful material in the translation memory. The ones reported in the literature can be characterised depending on the text patterns that they are applied to: They can be character-. 2. weekdays. suggested an early word-based metric that attempts to emulate human translation practice in recognizing the similarity of a new source language sentence to an example by selecting identical phrases available in the translation memory except for a similar content word.and knowledge-dependent and perform some text operations (e. word. such as sequences or trees. Finally. such complexity is paid in terms of huge computational costs for the creation.2 Similarity metrics and scoring functions The example(s) an EBMT system determines to be equivalent (or at least similar enough) to the text to be translated varies according to the approach adopted for the similarity metrics. and the amount and the kind of linguistic processing performed [41].or structure-based. such as the equivalence classes. the pioneer in EBMT system. The partial match is performed by allowing equivalence classes. Further.g. 41] are more language. Nagao [102]. Furthermore. in wordbased metrics. Brown [24] proposes a matching algorithm searching the example base for contiguous occurrences of successive words by means of an inverted index. As to word-based metrics. storage and matching / retrieval algorithms (see next paragraph). In [125] the similarity between the words is determined by means of a thesaurus abstraction hierarchy. they compare individual words and define a similarity of the fragment by combining the words’ similarities. etc. instead some of the cited approaches assume the availability of a particular knowledge strictly related to the language involved. the exploitation of the information provided by the order of .).Similarity metrics and scoring functions 31 employed structure(s). a strong advantage of EBMT should be the ability to develop systems automatically despite lack of external knowledge resources [124]. providing only a superficial way to compare text fragments. The closeness of the match would be determined by a semantic distance between the two content words as measured by some metric based on a thesaurus or ontology. POS tagging and stemming) on the sentences in order to store more structural information about them. however.

4 2. most of the EBMT researchers do not specifically propose algorithms and data structures in order to make the search techniques more efficient. then the matching function makes a collection of matched chunk material. the works [41] and [42] perform advanced hybrid word and structure based similarity matching techniques. For instance.e. 118]. evaluation is often achieved by computing recall and precision figures over well known test collections. Such low flexibility is a big issue: Consider. in a large number of the proposed research systems the obvious and intuitive “grain-size” for examples seems to be the sentence [124]. both the examples and the query are first decomposed into “chunks”.e. In [24] on the other hand a rough ranking mechanism based on the freshness of the translation memory suggestion is proposed. the entire sentence is the smallest unit of search.1. . Most of the similarity metrics proposed in literature are also exploited in the fundamental suggestion-ranking operation [42.3 Efficiency and flexibility of the search process Another fundamental aspect of EBMT systems. In the Information Retrieval field. taking advantage of the defined similarity metric and of adaptability scores between the source and target language of a particular translation suggestion. the frequent cases in which a translator decides to merge two sentences into a single one. In [41] a different approach is taken. Furthermore. 2. i.32 Approximate matching for EBMT the words can be fundamental: Word order-sensitive approaches are demonstrated to generally outperform bag-of-words methods [10]. together with the similarity metric that they are based on. For a discussion of approximate matching related work refer to Section 1. or when the translation memory contains no whole suggestions. 42] the assumption is that the aim of the matching process is to find a single best-matching example. in [102. while in [41] a more subtle ranking mechanism is exploited. Indeed. thus the approach still lacks the flexibility needed by translators. i. While research on part matching is not particularly encouraging in the Machine Translation field.1.4 Evaluation of EBMT systems The evaluation of the effectiveness of an EBMT system is not particularly straightforward. For instance. is the flexibility they offer in order to extract useful translation parts from the translation memory. however such subdivision in chunks is entirely static and determined before the query is submitted. along with the efficiency of the proposed search algorithms. for instance. much has been done and many ideas can be taken from the approximate string matching field and adapted to example-based suggestion search.

2. which. A popular set of more advanced products include translation memory software such as Trados [133] and D´j` Vu [46]. linguistic and semantic analysis such as stemming or word sense disambiguation is generally lacking and the employed similarities between the text fragments appear to be very simple character-based ones. the contribution of EXTRA to the state of the art in the way it retrieves useful . they are clearly not completely applicable to EBMT systems.Some notes about commercial systems 33 However. They offer some interesting applications for document ea management. the authors propose a way to measure the closeness between the output of an MT system and a “reference” translation by measuring it in proportion to the number of matching words. as commercial systems take an important role in the context of computer-aided translation. The above techniques require the existence of a complete set of reference (hand made) and automatic (machine generated) translations to be applicable. In particular. by definition. In Section 2.5 we will delve further into the performance and features of commercial EBMT systems. we present the suggestion search activity devised for EXTRA (Example-based TRanslation Assistant).5. In [97]. thus this could lead to wrong and only superficially similar suggestions.5 Some notes about commercial systems A related work on EBMT would not be complete without a final mention to commercially available EBMT systems. but they basically work in the same manner and show the drawbacks discussed previously. do not generally provide a complete translation as their output.1. The “BLEU” measure [108] enhances this technique by partially considering the order between the words: They count the number of equal q-grams. such as semi-automatic processes for document alignment. such as ParaConc [109]. there are no universally accepted ways to automatically and objectively evaluate MT systems. such “reference” collections have never been defined in the Machine Translation field and. 2. The simplest form of EBMT software which is able to search for and retrieve information from a bilingual parallel corpus are bilingual concordancers. Therefore. In particular. It attempts to overcome some of the above mentioned deficiencies of existing approaches through an extensive use of innovative ideas conceived in the IR field. Such tools are very straightforward but are generally designed just to search for words or very short phrases in an exact way [18]. comparing them with the ones of EXTRA. more generally.2 The suggestion search process in EXTRA In this section.

The adopted similarity metric should consequently be independent from the translation context. Thinking about the retrieval task in an EBMT system designed to support the translation of texts in Western languages. as . as it has been implemented on top of a standard DBMS. Editing the translation means adding text. such as involved languages (provided that they are Western languages) and subjects. and that more it meets the human intuition of what constitutes a pair of similar objects. Moreover. more it is effective. any suggestion allows the translator to save time only when editing the suggestion takes less time than generating a translation from scratch. effective and rigorous metric which is the basis for the suggestion-ranking and whose properties are exploited to speed up the retrieval process. heavily dependent on the information retrieval needs. ˆ it does not use external knowledge requiring the intervention of users. it is well known that the definition of a measure quantifying the similarity between objects is a complex task.34 Approximate matching for EBMT suggestions can be summarized in the following items: ˆ it relies on a polymorph. 2. language-independent. ˆ it is versatile because it allows the combination of different ways to logically represent the examples with different ways to search them. we cannot forget that the translator can submit text in different languages and that the retrieved suggestions should help translators in speeding-up the translation process and achieving a good level of terminological uniformity in the result. swapping text. ˆ it fully exploits the TM content also in complex settings by supporting different grain-sizes of search.2. Moreover. for EXTRA too the similarity metric plays a fundamental role in the selection of the examples in the TM that are useful and those that are not and it obviously influences the logical representation of the examples. Intuitively. The first step towards the definition of a similarity measure is the identification of the properties that characterize the objects to be compared and that can influence the similarity judgement. ˆ it is able to efficiently search large amounts of bilingual texts stored in the translation memory for suggestions. deleting text. ˆ it is customizable and portable. modifying the text and so on in the retrieved examples.1 Definition of the metric As for all the EBMT systems relying on similarity matching.

both the sentences “The cat eats the dog” and “The dog eats the cat” would be represented by the set {dog.σq) Translations (sentS.The involved processes 35 New document (to be translated ) Translated documents Document Analysis Sequence of tokens Queries (sentSq. The translation memory of EXTRA contains a collection of translations. Document analysis 1 2 Each token is a group of characters with collective significance. Each document submitted to the system goes through a document analysis process. σ is the sequence of tokens corresponding to the logical representation of sentS and sentT is the translation of sentS in the target language. In the following we will use indifferently the terms translation and example .sentT) Translation Memory Suggestion Search Translation suggestions Figure 2. Each translation t is a tuple (sentS . either because it has to be added into the translation memory or because the translator searches useful suggestions for it. Instead. 2.g. the examples can be logically represented as sequences of representative items named tokens 1 . as they do not take into account any notion of word order and position (e.2: The suggestion search and document analysis processes much as an example maintains the same words in the same positions of the unit of text to be translated as much its translation is a good suggestion. the metric we presented in Chapter 1 for general sequence matching and which constitutes the foundation of the suggestion search process in EXTRA. σ. eat.σ.2 The involved processes Figure 2. cat}).2. sentT ) where sentS is the sentence2 in the source language. we argue that the classical IR models based on bag of words approaches are not suitable for our purposes. The knowledge concerning the position of the tokens in the example can thus be exploited through the edit distance [103]. For these reasons.2 depicts the flow of a document.

Setting out the internal representation of the sentences to be compared. σ q )}. independently from the properties of the involved tokens with the exception of their positions. each represented by the tuple (sentS . we considered various text operations for the document analysis process. to be added to the translation memory. where each query q is represented by the tuple (sentq . the logical representation of a sentence can be extracted by means of text operations. The same process is applied to each document to be translated when it is submitted to the system: it is transformed into a collection of queries. thanks to the alignment processes.36 Approximate matching for EBMT is the process of converting the stream of characters representing an input document into a set of sentences and generate a logical view of them (i. in particular. σ q ) where sentq is the sentence S S for which the translator needs suggestions in the target language and σ q the corresponding sequence of tokens. while the suggestion search takes a query sequence as input and produces a ranked list of suggestions for it. sequences of tokens). For the same reason. compares their logical representaS tion with that of the examples {(sentS . EXTRA is able to switch from the source to the target language in order to present to the translator a ranked list of suggestions. respectively. Notice that in the design of the EXTRA system we modularized the suggestion search activity by providing two autonomous processes: The document analysis always produces sequences of tokens. it obviously influences the effectiveness and the efficiency of the search process.e. σ. Document analysis is a fundamental step towards the suggestion search process. it is transformed through the document analysis into a set of translations. The suggestion search process takes the collection of queries as input {(sentq . 2. the doc- . In this way. is limited to the document analysis phase where the text analysis procedure takes place. independently from the way they are generated. the management of the translation context and. In our approach. When a translated document is submitted to the system. Finally. sentT ). we introduce a good level of versatility in the search process by allowing the testing of the impact of different logical representations in the search process. the importance of which cannot be ignored. of the involved languages. As we mentioned in Section 2.3 Document analysis The main task of the document analysis process is to transform each document into a set of sentences and to transform each sentence sentS or sentq S into its logical representation σ or σ q . sentT )} through the edit distance and some measure of relevance between the queries and the examples is obtained.1. Starting from their original formats. σ.

i. word sense disambiguation (WSD). word stemming (and stopword removal).2.3 Document analysis 37 uments submitted to EXTRA first go through a “chunking” process where they are broken into a set of sentences. This type of approach is not resilient to small changes in the sentences.e. notice that in the former representation all the words in the sentence are equally important while the latter disregards common words such as articles and so on while focusing on the most meaningful terms. Then. stemming also includes punctuation removal and WSD also includes stemming. we perform a semantic analysis to disambiguate the main terms in the sequence. There are various alternatives. as they are the product of a syntactic analysis where the meanings of terms are not taken into account. By switching between the first two options. translators usually find it easier to translate common words rather than far-fetched terms. such as the modification of an article. thus enabling the comparison between meanings instead of terms. the logical representation of each sentence ranges from the sentence itself with the exception of the punctuation to the sequence of its most meaningful terms with the exception of their inflections. a stem change to a term.3). each sentence has to be transformed into a sequence of tokens as required by the edit distance. For instance. we considered gradually more complex alternatives to pre-process a sentence and produce its logical representation. the author could refer to the artistic creation of the user as both “image” or “picture”. which should be considered as two equivalent . Finally. 3.e. consider the technical context of a computer graphic software manual: In such a context. simple punctuation removal. and produce different logical representations with an increasing level of resilience. Both the above mentioned representations disregard semantic aspects. the one on which the similarity search algorithms can be applied: 1. the stemmed version of its worth surviving terms (see the upper block of Figure 2. The syntactic representation of a sentence is thus a sequence of tokens. each of which is able to affect the search process as it determines the internal representation of the managed corpora. On the other hand. The most obvious way to do it is to directly compare the sentences as they are. thus the latter representation helps to find more useful suggestions than the former representation as only the differences among the fundamental terms are important. i. Because of this reason. Such options are incremental. or the addition of a comma. 2. In a suggestion search perspective. with the Word Sense Disambiguation option.

On the other hand.3: The syntactic and semantic analysis and the actual WSD phase words. Note that we disregarded such categories. which are usually less ambiguous and assume a marginal role in the meaning of the sentence. the electronic device. By reconsidering the previous example. we devised two completely automatic techniques for the disambiguation of nouns and verbs that ensure good effectiveness while not requiring any additional external data or corpora besides WordNet and an English language probabilistic Parts-OfSpeech (POS) tagger. as adjectives and adverbs. knowledge-driven ones. he could describe the picture of a “mouse” from Cinderella. The WSD techniques we designed can be categorized as relational information. mouse Semantic analysis Stemmed. tagged sentence 1 The x x 2 white JJ white 3 cat NN cat 4 is x x 5 hunting VBG hunt 6 the x x 7 mouse. while the two instances of “mouse” would be considered different. can be judged by the comparison scheme to be the same. On the other hand. . have different meanings can be considered as distinct tokens. Specifically. have the same sense. different terms which. which should not be mistaken with the “mouse”. in the context of the sequences they belong to. NN mouse Tagging N (nouns) Nouns and verbs lists extraction 3 7 cat mouse ? ? 5 V (verbs) hunt ? WSD: nouns and verbs disambiguation N (nouns) 3 cat 7 mouse V (verbs) 5 hunt *N-1788952* *N-1993014* *V-903354* Sequence of Tokens white *N-1788952* *V-903354* *N-1993014* Figure 2. exploiting one of the most known lexical resources for the English language: WordNet [100. terms which. Stemming Approximate matching for EBMT Syntactic analysis Stemmed sentence 1 2 3 4 5 6 7 Output Sequence of Tokens white cat hunt mouse The x white white cat cat is x hunting hunt the x mouse. 101]. used in different contexts. The syntactic analysis precedes the actual WSD phase which receives a stemmed sentence as input and produces a disambiguated version of it.38 Input Original sentence The white cat is hunting the mouse . “image” and “picture” would be considered to be the same. obviously meant as an animal. By employing WSD. used to digitally paint it.

4 The suggestion search process 39 where the WordNet-disambiguated nouns and verbs are substituted with the codes of their meanings (named synsets in Wordnet). In particular. Such lists are the input for the WSD techniques themselves. In Figure 2. complete with pointers to the positions of the words in the sentence. the word “white” is unchanged because it is not a noun nor a verb. To give an idea of the usefulness of the two approaches. the suggestion search process accesses the translation memory by comparing the submitted text with past translations and returns a ranked list of useful suggestions. Indeed. The first operations we perform on the stemmed sentence is POS tagging. which attempt to associate each entry with the code of the WordNet synset that best matches its sense. let us show the work habits of two possible types of translators.3 for a discussion). they usually know the material they are going to translate and thus their main objective is simply to carry out the assigned . 2.4. Confident translators might prefer to translate from scratch such sentences for which no very close suggestions are available. EXTRA tries to meet the skills and the work habits of the translators in the best way possible by putting two similarity matching approaches at their disposal. confident and “casual”. The tags shown in the figure are exactly the ones produced by our tagger.1 for detailed description of our WSD techniques. which is a variant of the common BrownPenn tagset [96]. we need to identify nouns (tags starting with N) and verbs (tags starting with V) and create stemmed lists N and V of each of these two categories. Notice that the same would happen for words tagged as nouns or verbs but not present in WordNet. Refer to Appendix A.3 we show the different steps of the WSD elaboration and an example of the produced output.2. which can be freely combined in order to obtain the kind of suggestions they consider to be the most useful ones (see Section 2. The approximate whole matching searches the TM for such sentences that are similar enough to the query sentence. and their plausible interaction with the system. Not all of the stemmed terms are substituted with a code: In the example. which associates each of the terms in the sentence with a tag corresponding to a part of speech. whereas the approximate sub2 matching extends the above approach by comparing parts of the query with parts of the available sentences. The final step is to assemble the partial results in the sequence of tokens corresponding to its worth surviving terms.4 The suggestion search process Given a document to be translated. Each translator has her/his own way to gain her/his objective. To enhance the readability of the output we add a “N-” (noun) or “V-” (verb) prefix to each of the codes.

They correspond to translation memory sentences whose sequences are far from the query sequence less than an absolute distance threshold related to the specified relative one. and return (sentT .40 Approximate matching for EBMT job as soon as possible. sentT )} of translations. we will only describe how the general techniques have been modified and customized to the EBMT scenario in EXTRA. More precisely. As to the distance threshold. searching for sentences within 3 errors given a query of length 6 is totally different to searching sentences within the same number of errors given a query of length 20. In the following. σ)). the approximate whole matching works on the translation memory {(sentS . σ q )} of S queries. sentT )} and on the document to be translated {(sentq . provided that they are of good quality.4. they might be willing to edit and combine suggestions covering part of the submitted corpora. σ) ≤ round(d ∗ |σ q |). For example. the maximum number of allowed errors is defined as a user-specified percentage d of the query sequence length instead of an absolute number that is usually specified in string processing [7]. S Definition 2. Different is the case of more “casual” translators who look for such suggestions allowing them to obtain an acceptable coherence in the adopted terminology. ed(σ q . Indeed. the approximate sub2 matching might prove to be particularly useful in this situation. σ.1. searching for sentences within a number of errors that is independent of the length of the query could be of little meaning. For this reason we consider d as the percentage of admitted errors with respect to the sentences to be .1 (Approximate whole matching in EXTRA) Given a document to be translated corresponding to a collection Q = {(sentq . find all translations t ∈ T M such that for some query q ∈ Q ed(σ q . and a relative distance threshold d. As the translation memory often does not contain good quality whole matches. in the context of EBMT systems. The approximate whole matching probably represents the main source of suggestions for this kind of translators. In this case. For a detailed description of the foundation of the employed approximate matching techniques see Section 1. σ q )} as specified in the following definition.1 Approximate whole matching Given a collection of queries representing a document to be translated and a relative distance threshold. a collection T M = {(sentS . σ. the approximate whole matching efficiently retrieves the most similar sentences available in the translation memory and returns their target-language counterparts together with the likelihood values. 2.

a relative distance threshold d and a minimum length minL. . (j q − iq + 1) ≥ minL and (j − i + 1) ≥ minL. Although complex.1. σ. this kind of search enables the detection of similarities that could otherwise be unidentified. j q ] of a query q ∈ Q.2 Approximate sub2 matching Approximate whole matching is not the only search mechanism provided by EXTRA. the powerful similarity matching technique introduced in Section 1. . . Performing approximate sub2 matching generates a number of new and challenging issues. sentT ). the retrieval does not concern whole sentences. j q ]. Given a translation t ∈ T M . In order to correctly solve such link between the sequence and the target . Efficiency in retrieving the most similar sequences available in the translation memory is once again ensured by exploiting filtering techniques (see Section 1. Experience with several language pairs has shown that producing an EBMT system that provides reasonable translation coverage of unrestricted texts requires a large number of pre-translated texts [24].r. sentT ) ∈ T M . σ[i .3 would allow 3 errors w.4. . j].t. .t. and so on. to fully exploit the translation memory potentialities. Then. the result of the approximate sub2 matching are those parts of the target sentence sentT corresponding to the subsequences satisfying the specified distance threshold. return (sentT [iσ . It goes beyond “standard” whole and subsequence matching. σ q )} of S queries. a query of length 10. . . . a query of length 20. . . For example. as it attempts to match any part of the sequences in the translation memory against any part of the query sequences. j q ]. setting d = 0. such that for some subsequence σ q [iq . EXTRA exploits approximate sub2 matching. σ.2 (Approximate sub2 matching in EXTRA) Given a document to be translated corresponding to a collection Q = {(sentq .Approximate sub2 matching 41 translated.1). . . j] of the translations t = (sentS . . . the sentences stored in the translation memory could be partially useful. . Definition 2. find all subsequences σ[i . j σ ] the part of the sentence in the target language corresponding to σ[i . In particular. 6 errors w. . Thus. j]) ≤ round(d ∗ (j q − iq + 1)). j] the subsequence of σ ranging from its i-th to its j-th token and with sentT [iσ . ed(σ q [iq . . . a collection T M = {(sentS . σ[i . Firstly. we denote with σ[i .r. Anyway. 2. represented by a tuple (sentS . . . such that ed(σ q [iq . j])). unlike whole matching. but parts of them. Consequently. sentT )} of translations. j σ ]. σ. translators may submit sentences for which no whole match exists.

a special alignment technique is needed where the sentence in the source language acts as “bridge” between the sequence and the sentence in the target language. a possible way to combine the two matching approaches is to search for matching parts only for the query sentences for which no whole match exists. .4.1 and 2. 2.3 Meeting suggestion search and ranking with translator needs Whole matching extends sub2 sequence matching. i. In EXTRA. identify the right part sentT [iσ . which is easier and takes less time than sub2 sequence matching. however. consequently. in this case.e. just read only the top suggestions where the likelihood values help in quickly evaluating the level of similarity with the material submitted. 2. . The two matching techniques being the basis of the suggestion search process provide suggestions for the translation of a given document together with the edit distance values (see Defs. and. there is no point in applying the two search techniques to the same document to be translated. For some translators. Another interesting issue related to the search process is the ranking of the suggestions. sub2 sequence matching is able to identify a whole match too. Such problem. j σ ]. for each of the sentences they must translate. only the intervention of the whole matching process is required. In this way. in most cases. As a matter of fact.2). some translators may prefer to receive only suggestions for the whole query sentences. we exploit the new filtering techniques described in Section 1. Furthermore. i. For this reason. On the other hand. Translators who submit a document to an EBMT system for suggestion search expect a list of suggestions in the target language ranked in a meaningful order.1. In particular. has been addressed in [92]. such kinds of suggestions may not be sufficient for their needs. Indeed.42 Approximate matching for EBMT language sentence. such filters are adapted to the context of a relative distance threshold but are based on the same properties which have been previously described.e. the suggestions appearing at the top of this order are the ones with the lower values of edit distance and thus are the ones for which it should take less time to adapt them to the . which we call word alignment. in order to keep a good efficiency. Thus the most straightforward way to rank the retrieved suggestions for each sentence in the submitted document is to sort them by increasing edit distance values. most existing commercial systems only support this kind of search process. they do not want to lose time and. consequently they can rely on sub2 sequence matching. considering that translators may submit sentences for which no whole match exists but the sentences stored in the translation memory could be partially useful.

Different is the case of sub2 sequence matching that suggests parts of the TM sentences that match parts of the query sentence. all experiments were carried out by setting the relative distance thresholds d and dSub to 0. Unless explicitly specified. we analysed the quality of the assistance offered by EXTRA by means of some new metrics we introduced for EBMT systems. and of the ranking on the translation process has been subject of several experiments. A detailed account of the results that we obtained is presented in Section 2. As far as efficiency is concerned. The impact of document analysis.5. For additional efficiency tests of the approximate matching algorithms underlying suggestion search refer to Section 1. Moreover.2.3. 2. we avoid computing the unnecessary suggestions by exploiting an ad-hoc algorithm. i. it takes longer to use a large number of short suggestions than using a smaller number of longer suggestions. This is certainly true for the results of whole matching as they are suggestions for the whole sentence to be translated. since they would typically only slow the work of the translator down while giving no additional hint. In this case. Thus. . the length is also a factor that can affect the time required to complete the actual translation. As for each starting point the longest suggestions contain other suggestions. for each starting point the suggestions can be ordered by length and then by edit distance values. Thus. in most cases. of the type of suggestions. even when the longer are less similar to the involved parts than the short ones. considering the position of the first word of the query segment for which a translation is proposed. An algorithm that implements this idea is shown in Appendix A. Contained matches are in fact usually not useful. both for document analysis and for suggestion search. one possible way to prepare the suggestions for presentation is to group them by starting point.5 Experimental Evaluation 43 actual translation. together with the edit distance value.e. Indeed.2. another possibility is to output only such suggestions and to sort them on the basis of the edit distance values.5 Experimental Evaluation In this section we present the results of an experimental evaluation of the techniques described in the previous sections. The suggestions concern parts of the query sentence starting in different positions and have variable lengths. System performance was tested both in terms of the effectiveness and the efficiency of the proposed search techniques.2. As to effectiveness. we experimentally observed performance trends under different settings.

4 traces the broad outline of the interaction with the DBMS where the examples are recorded into as many tables and an auxiliary table is created on-the-fly to store the queries whenever the translator submits a document.2 Data Sets To effectively test the system. on the other hand. The immediate practical benefit of our techniques is that approximate search in translation memory can be widely and efficiently deployed without changes to the underlying database.2). 2. the whole and sub2 matching algorithms and the corresponding filtering techniques were implemented on top of a standard DBMS by mapping them into vanilla SQL expressions (by following the guidelines proposed in Section 1. Figure 2. was mainly used to test the effectiveness of the system in finding useful suggestions in a relatively small translation memory. we used it not only to test the reaction of the system to more repetitive data. Therefore.44 Approximate matching for EBMT Whole sentences Whole matching Doc analysis Translation (Permanent Memory DB table) New document (to be translated) On-the-fly DB table Parts of sentences Sub2 matching Figure 2. Designing a solution that fits into a DBMS context allowed us to efficiently manage the large bilingual corpora of the translation memory and ensure total compatibility with other applications.5. was created by professional translators and contains years of translation work on a very specific technical subject. EXTRA has been implemented using Java JDK 1.3. equipped with 512MB RAM and a RAID0 cluster of 2 80GB EIDE disks with NT file system (NTFS). Collection1.4: The role of the DBMS in pretranslation 2.5Ghz Windows XP Professional workstation.5. containing only English sentences. such collection is much better established than Collection1 and over 20 times larger. we used the two real data sets described in Section 1.1 Implementation notes As far as the design of the suggestion search process is concerned. the underlying DBMS is Oracle 10g Enterprise Edition running on a Pentium 4 2. but . Collection2.2 and JDBC code. For these reasons.4.

Some of such examples for Collection2 are shown in Figure 2. Given those premises. more challenging scenario. Moreover. Corresponding sentence in the target language: Posizionare le 4 mollette (A) come indicato e alla distanza prevista.5: Examples of full and partial matches also to test the efficiency of the system in a larger and. Collection2 contains both English sentences and their Italian translations. However. Figure 2. 2. Corresponding sentence in the target language: Dopo aver eseguito il collegamento elettrico. Suggestion in the target language: collegamento elettrico. montare il piano cottura dall’alto Sentence containing a similar part: Secure it by means of the clips.5 where the parts emphasized are those considered to be interesting suggestions.3 Effectiveness of the system To evaluate the effectiveness of a retrieval system means to assess the quality of the answer set with respect to a user query. the system could also be tested in aligning them and giving suggestions in the target language. there are no universally accepted ways to automatically and objectively evaluate MT systems. At a first . Query sentence: On completion of electrical connections. we first caught on to the quality of the suggestions by examining a significant sample of matches retrieved by EXTRA. more generally. the few efforts to define some measures of effectiveness do not apply to EBMT systems.1. fit the hob from the top and hook it to the support springs. montare il piano cottura dall’ alto e agganciarlo alle molle di supporto come da figura. In the Information Retrieval field.5. fit the cooktop in place from the top and secure it by means of the clips as shown. thus. such “reference” collections have never been defined in the Machine Translation field and. being the sentences available in two languages. Suggestion in the target language: Fissare definitivamente per mezzo dei ganci. Similar sentence in the source language: Position the 4 clips (A) as shown and at the specified distance. this is often achieved by computing recall and precision figures over well known test collections. as we already outlined in Section 2. Furthermore.Effectiveness of the system 45 Query sentence: Position the 4 clips (D) as shown and at the specified dimensions. according to the illustration. Sentence containing a similar part: After the electrical connection.

dSub d.6-b shows .r.7% 2.1 Sub 1.6: Coverage of the collections as outcome of the search process glance. Figure 2.3% 12. dSub d.1 Collection1 Sub 2. named coverage.5.2 0.1 1.3 2. dSub 0.0% 62. along with the benefits offered by our stemming and WSD techniques. When a translator is going to translate texts concerning subjects that have already been dealt with. focused on our particular problem. In order to quantify the ability of EXTRA in retrieving suggestions for the submitted text.6% Approximate matching for EBMT 18 16 Suggestions per sentece 14 12 10 8 6 4 2 0 d.2 0.1 d.9% 97.8% d. We then decided to introduce and evaluate “ad hoc” test figures. obtained both from a whole or a sub match.5. dSub 0.2 d.0 11. completeness) of the translation suggestions that EXTRA proposes to its users. which corresponds to the percentage of query sentences for which at least one suggestion. Furthermore. dSub 0. they seemed to be well-aimed compared to the submitted text.3 25.46 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% d. by performing several simulation runs on discrete event models that we specifically designed for this purpose. dSub d.t. dSub d.5 12. dSub 0.6% 96. Such measure is a good indicator of the effectiveness of a suggestion search process only if there is a good correlation between the text to be translated and the translation memory as it is for our collections.5 10.1 0.3 Whole 1.6 4.6-a shows that our search techniques ensure a good coverage for the considered collections. Coverage The content of the translation memory represents the history of past translations. has been found in the translation memory.0% Collection1 58. the EBMT system should help him/her save time by exploiting the potentialities of the translation memory contents. dSub 0.1% 95.7% Collection2 1. we precisely quantified the benefits of such translation assistance w. as shown in Section 2.0% 17. we analyzed the quality (pertinence.5% 8.1 1. standard manual translation. we propose a new measure. Moreover it proves to be useful in the comparisons of different systems.1 0. In particular.2 Whole 2. while Figure 2.875 Collection2 Sentence coverage (%) Sub Whole (a) Percentages of sentence coverage (b) Mean suggestions per sentence Figure 2.5 15.4 15.

notice that. sub2 matching covers a remarkable percentage of query sentences and becomes essential to further exploit the translation memory potentialities. In particular.2. Collection1 and Collection2. it enables the retrieval of many additional useful suggestions having very similar meanings but different forms. Furthermore. where most of the suggestions concern whole matches. since the effects of WSD can be even more manifest when applied to semantically and lexically richer texts. but they are nonetheless very similar in meaning and. In particular. Table 2. “just by”.3 have on the final results further in detail. but also on some classic literature works. plural forms (“users”. therefore. we felt the need to analyze the impact that the document analysis techniques described in Section 2. thus proving the good flexibility of our matching techniques. The upper part of the table shows some of these cases.g. even with a very restrictive setting of d and dSub = 0. we compared the suggestion fragments that are retrieved by EXTRA by employing the syntactic analysis with the ones offered by the WSD analysis techniques.6-b). “brushes”). which is relatively small and not so well established.1 shows a selection of some of the most interesting results of the effectiveness of the document analysis comparisons. and so on). notice that. the possibilities of finding similar whole sentences in small TMs are quite low). EXTRA can offer even more quality in the finally proposed suggestions. In particular. the mean number of partial suggestions is sufficiently high for both scenarios (see Figure 2. where the differences between the query (left column) and the retrieved fragments (right column) are emphasized in italic: The useful fragments contain contractions (“you’re”).Effectiveness of the system 47 the high number of retrieved suggestions per sentence and demonstrates the wide range of proposed translations. as we expected. These specific tests were performed not only on the two technical translation collections. prepositions and other non-semantically significant words (“through”. As to Collection1. their translation can be useful to speed up the translator’s work. “other”. The use of stemming techniques not only makes it possible to significantly speed up the suggestion search phase (see Section 2. Effectiveness of the document analysis techniques After having evaluated the quality and coverage of the translation suggestions. while the mean number of whole suggestions per sentence is generally particularly sensitive to the TM size (e. the enhance- .5. The good size and consolidation of Collection2 implies a very high level of coverage (over 97%.1). the obtained coverage is more than 70% of the available sentences. By coupling stemming with our WSD techniques. more importantly.4) but. by setting the distance threshold d and dSub at at least 0.

. . to understand a program feature . . . . . . . . . . . . . . . since synonyms are considered as equivalent words (see the central part of Table 2. . . . . “lad” and “fellow” have clearly the same meaning in the shown fragments and. . . .1). . . created custom brushes . . Table 2. . . . . the fellow shall be in my power . you are new to computer graphics . . to save translation time. which in the query fragment is a verb. . . In order to quantify the effectiveness of the WSD techniques we also extracted a significant sample of 100 sentences from each of the two collections and systematically analyzed and judged the correctness of the disambiguation of their nouns and verbs. . . . For instance. begin with Chapter Two. . . . . . knob to set . . cream puffs . . . . . . . . . . .7 depicts the results of such analysis for each of the collections. as if the original image had been . the term “cream” in “cream puffs” and in “cream of asparagus” has obviously different meanings. be sure you have the proper information . . are equal for the suggestion search purposes. . . that you understand other program features . . . . . . . Wrong fragments that would be retrieved without WSD . . . . since two words having the same form but different meaning are now considered to be different (lower part of Table 2. . the poor fellow was an orphan . ultimately. . . specify a beginning and ending . what goes on upon the moor .1). .48 Query fragments . do not go across the moor . . . . be sure to have information . . in the original picture has been . move on to Chapter 1. while in the translation memory segment is a noun: Avoiding such deceptively similar suggestions is clearly important for the quality of the offered translation aid and. . . . . . . . . . .1: Examples of the improvements in the effectiveness of the suggestion search process offered by stemming and WSD ment to the effectiveness of the suggestion retrieval offered by our in-depth semantic analysis concerns both precision and recall: Not only more useful suggestions are delivered to the translator. . . . On the other hand. we should be powerless . . . . Figure 2. Along with the “right” and “wrong” . . . . create a custom brush . The same applies to the term “knob”. . . . . . . with a context window of one (“standard”) or three (“expanded context”) sentences. . Query fragments . . . . . consult your PC Users Guide . Getting Started . . . . . . . . . cream of asparagus . . if you’re new to computer graphics . . but also the number of wrong or uncorrelated suggestions is significantly reduced. . legend of the fiend dog . Getting Started . therefore. . . just by specifying the beginning and ending . . . the legend of the demon dog . . . browse through your PC User’s guide . Useful fragments that are retrieved only with WSD . poor lad is an orphan . . . . control knob set to . Approximate matching for EBMT Useful fragments that are retrieved only with stemming . . . . “demon” and “fiend”. . . . . the nouns “picture” and “image”. Query fragments . . . . . .

but also the senses “a relational difference between states”.Effectiveness of the system 100% % of terms disambiguation success 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Partially Right Right Standard 9% 72% Exp. since in many cases the WordNet senses can be very similar and more than one disambiguation can be considered sufficiently correct. an implement that has hairs or bristles firmly set into a handle”. for instance. “action of changing something” (identified by EXTRA). We also noticed further positive enhancement. For instance. our WSD techniques provide rather good precision. Do not use abrasive products bleach oven cleaner spray or pan scourers. On both collections. context 9% 76% Standard 7% 71% Exp. By considering the next sentence. the term “maintenance” in the first sentence would be disambiguated as “the act of sustaining”.7: Percentages of disambiguation success on samples from the two collections classifications. thanks to the presence of words such as “painting” in the preceding sentence. context 7% 79% 49 Collection1 Collection2 Figure 2. in the sentence “In addition you can make global changes to your artwork with ease”. Here the disambiguation of “brush” changes from “the act of brushing your teeth” to “brush. we found that context window expansion is useful in a good number of cases. Consider. which is not repre- . Your brush works like a little animation. Indeed. the best sense for the term “change” is “alteration.”. which is very close to 90% when the context window is expanded. the following paragraph from Collection2: “Before any maintenance on the appliance disconnect it from the electrical power supply. The following is another example from Collection1: “You can even pick up an animation as a brush and produce a painting with it. modification”. which contains topical nouns such as “cleaner” and “products”. the term is correctly disambiguated as “activity involved in maintaining something in good working order”.” Without expanding the context. and “the result of alteration or modifications” can be considered sufficiently close to it. we also considered a “partially right” one.

Such time is the central element of our model and depends on the type of translation (assisted or manual) and on a number of factors that we will analyze later in this section. in the sentence “For example you can learn painting with the mouse”. discrete-event models and simulated them with the aid of specific simulation toolkits. we will typically configure them for just one translator. the verb learn would be disambiguated as “commit to memory”.r. thus.7.50 Approximate matching for EBMT sented in the graph of Figure 2. while the ones differentiating the two scenarios are shown at the bottom of the table.t.2 shows all the input parameters that we used to describe the two translation models. the main result that we are interested in is the overall simulation time. we assume a unary capacity of each translator. involving a configurable number of translators working on the translation of the text (“servers” in the simulation field) and of the sentences of the text (“users”). the mean service time of a translator will be the time spent to translate a given sentence. Most of the parameters are common to the manual and assisted models and are presented on the top. Table 2. namely the time required to translate the whole document in each particular setting. since we are not mainly interested in varying the number of translators or in settings involving large groups of translators in the tests we present. Translation models and simulations In order to quantify the benefits offered by the EXTRA translation assistance w. For instance. in verb disambiguation by exploiting not only the definitions but also the examples of their senses. together with their default value. standard manual translation. The two models simulate the manual and assisted translation of a certain document and allow us to estimate the time that it would take real translators in actual translation sessions. The simulation ends when all the sentences have been translated. Parameters describing aleatory variables. as well as the effectiveness of the suggestions provided by the approximate matching process. are expressed in terms of their mean value (x) and standard . In this context. offering insight into the dynamic behavior of such processes. we devised two process-oriented. that is each translator works. such as the ones involving time. Each sentence of the document comes from the source and waits for the translator in a standard FIFO queue. such as “she learned dancing from her sister”.8. Furthermore. it is correctly disambiguated as “acquire or gain knowledge or skills”. considering the examples of usage of its different senses. Both models share the same structural scheme shown in Figure 2. Note that the models we envisaged are completely general but. while. on a single sentence at a time. as one would expect.

. less confidence) to a maximum (end of the translation) value. Therefore. while the probabilities of recalling a word translation Prec range linearly from a minimum (beginning of the translation. Exit Translators Figure 2. . on average. For these reasons. there are a number of very common and/or frequent words the translation time of which can be less than base translation time since the translator has already translated them in the preceding sentences and is very confident on their meaning. the base word translation time tword base . the word recall saved time parameter tword rec models such a time save for each of the recalled words. avoiding inconsistencies with the previous sentences and with the other translators. since higher the number of translators working on the translation. working on the same document. Let us now specifically analyze the manual translation model: One peculiar parameter is the time required to make the “hand-made” translation coherent in its terminology. All the time values are expressed in seconds. . The default setting is that. On the other hand. certain words may require an additional amount of time. For this reason we added a parameter. representing the time needed to translate one word whether or not the translator is confident in their translation. The maximum value is also proportional to the number of translators working on the same task. Depending on the translator’s experience and on the difficulty of the text (factors described by the probability of word look-up parameter Plook ). which we call word look-up time tword look : It is the time needed to look-up the word in a dictionary and to decide the right meaning and translation.Effectiveness of the system 51 S Sentences . The amount of time needed to perform a translation is proportional to the length (in words) of the sentences to be translated. such sentence coherence-check time tsent coher is added for each of the translated sentences and ranges from a minimum to a maximum value in proportion to the number of sentences already translated. The number of translators (Ntrans ) is 1 by default. while the document to be translated consists of the (Nsent ) query sentences from Collection1. 1 out of 20 words (5%) requires such additional time. for the other parameters we simply specify the corresponding value (val ).8: Schema of a generic translation model deviation (σ). if any.

52
Input param Ntrans Nsent tword base Plook tword look Prec tword rec tsent coher Nread tword read val / x 1 400 2.5 0.05 10.0 0.0 → 0.1 1.0 0.0 → 5.0 5 0.2 σ 0.5 3.0 0.5 2.0 0.05

Approximate matching for EBMT
Description Number of translators Number of sentences to be translated Base word translation time Probability of word look-up Word look-up time Probability in recalling a word translation Word recall saved time Sentence coherence-check time (per translator) Maximum number of suggestions read Suggestion word reading time Man Ass

Table 2.2: Description of the simulation input parameters (time in seconds) greater is the coordination time needed to obtain a final coherent work. The following formula can be used to summarize all the contributions to the total document translation time for one non-assisted translator, while setting aside all probability considerations: tmanual =
sentences

tsent coher +
words

(tword base + tword look − tword rec )

(2.1)

As to assisted translation, its behavior can be modelled starting from the manual model but applying the following substantial modifications:
ˆ if the offered suggestions are error free (null distance to the query segments), they automatically translate the words of the source sentences that they cover. For this reason, all the “covered” words do not require translation time; ˆ all the “uncovered” words can be treated as they are in manual translation. The same applies for each of the erroneous words that need to be modified in the suggestions in order to produce the new translation; ˆ the suggestions do speed up translation but they also require time to be read and chosen: For this reason we introduce the tword read parameter representing the time needed to read one suggestions’ word, and the maximum number of suggestions to be read Nread (the default value of this parameter is 5, an optimal value as we will show later in the analysis of the simulation’s results); ˆ finally, the sentence coherence time is considerably reduced, since assisted translation suggestions automatically guarantee an optimal coherence on already translated segments and terminology and can, substantially, be ignored.

Effectiveness of the system
Scenario Manual, 1 translator Mean sentence translation time Max sentence translation time Total translation time Manual, 2 translators Mean sentence translation time Max sentence translation time Total translation time Assisted, 1 translator Mean sentence translation time Max Sentence translation time Total translation time Confidence interval 28.175 < µ < 28.347 76.057 < µ < 81.559 11270.125 < µ < 11338.425 30.836 < µ < 31.058 90.172 < µ < 98.228 6167.397 < µ < 6211.589 18.859 < µ < 19.037 76.856 < µ < 84.084 7542.862 < µ < 7615.218 x 28.261 78.808 11304.275 30.947 94.2 6189.493 18.948 80.47 7579.04 σ2

53

0.014 14.232 2193.06 0.023 30.505 918.107 0.015 24.561 2461.268

Table 2.3: Results of the simulation runs (time in seconds) The approximate formula that simplifies these assisted translation contributions to the total document translation time is the following: tassisted =
sentences s words

tword read +
u words

(tword base + tword look − tword rec )

(2.2) where s words denotes the words of the read suggestions, while u words denotes the query words which either are not covered by a suggestion or, while being covered, still need to be modified. In our simulation experiments, we considered three scenarios, manual translation with one or two translators and assisted translation with one translator, for each of which we were interested in three figures: The mean and maximum time required for the translation of one sentence and the total time required for the translation of the whole document. Then we chose a level of confidence (probability that a confidence interval or region will contain the true figures) of 95% from which we derived the confidence interval and the minimum number of runs. Table 2.3 shows the mean value (x) and the variance (σ 2 ) of the 10 runs we performed in order to estimate the three figures accurately. The time required by an unassisted translator to translate the query sentences in the given document is approximately 3 hours and 8 minutes; notice that, as expected, the time required to perform the same work by two translators working in team is not exactly half of this time but is slightly more, i.e. 1 hour and 43 minutes. This is realistic and is due to the overhead given by the coordination (or coherence check) between the translators working on different sentences in the same document: In fact, the

54

Approximate matching for EBMT
1 aut

12000 Total translation time (sec.) 10000 8000 6000 4000 2000 0

1 aut

1 man

2 man

7850 7800 7750 7700 7650 7600 7550 7500 7450

30000 25000 20000 15000 10000 5000 0

1 aut

1 man

10 0 15 0 20 0 25 0 30 0 35 0 40 0

50

0. 01 0. 02

0. 05

0. 4

0. 1

(a) Nsent

(b) Nread

(c) Plook

Figure 2.9: Trends of total translation time as obtained in the simulation runs by varying... mean and maximum translation time per sentence is higher in this scenario. Now, notice that the assisted translation time is slightly more than 2 hours (2 hours and 6 minutes) and is significantly closer to the performance of the team of two translators than to the one of the single translator. This is good proof of the significant improvement that can be obtained by employing assisted translation software but, overall, it quantifies the real effectiveness of the EXTRA translation suggestions. In particular, the time required to adapt the suggestions retrieved from past translations is not only significantly lower than the one required to produce the same translations from scratch, but the speed gain is also particularly consistent: The mean sentence translation time is by far the lowest of the three scenarios considered, and this corresponds to the highest per-translator productivity. Furthermore, notice that the maximum time required to translate a sentence in the given query document is also particularly close to the manual single translator scenario, this is because even for the query sentences for which many suggestions are available the time spent in reading them is limited by the maximum number of suggestions parameter. Figure 2.9 shows the trends of the total translation time for automatic and/or manual translation obtained by our simulation models by varying the number of query sentences (Nsent , Figure 2.9-a), the maximum number of suggestions to be read (Nread , Figure 2.9-b), and the probability of look-up (Plook , Figure 2.9-c) parameters. Notice that the variation of the number of sentences produces a sort of scalability graph, with linear trends for the three models: The automatic translation trend stands, as expected, between the two manual ones. Figure 2.9-b demonstrates the trade-off between the time saved by exploiting the translation suggestions and the time spent in reading

0. 2

0. 3

0. 5

1

2

3

4

5

6

7

8

9

0

Efficiency of the system

55

and selecting them: The best trade-off is given by reading (and presenting) 4 or 5 suggestions at the most, therefore such values are the optimal ones that have been used in the other assisted translation simulations and that can be used in EXTRA itself in order to deliver a balanced suggestion range to the user. Finally, Figure 2.9-c shows the trends of total translation time by varying the Plook parameter, which may represent the ability and experience of the translator: In this case, the graph shows that the translation assistance is equally useful both for experienced translators (Plook = 0.01) and inexperienced ones (Plook > 0.3).

2.5.4

Efficiency of the system

In this section we analyze the efficiency of the document analysis and similarity search techniques. Figure 2.10 shows the results of the tests we performed in order to estimate the running time of our algorithms on Collection 1 (400 query sentences) and on Collection 2 (421 query sentences). In particular, in subfigure 2.10-a, the time required for whole and sub2 matching on the two collections is shown for three different configurations of the relative distance parameters d and dSub. Notice that, for these tests, the two search techniques were applied sequentially: sub2 matching was applied to the sentences that were not covered by whole suggestions, as this is the most popular scenario for translators. Further, we only performed tests with d less or equal to 0.3, since greater values would have led to almost useless suggestions. Translators do indeed usually set the value of d and dSub from 0.2 to 0.1. The algorithms perform efficiently for all the parameters’ settings, with total running time (whole + sub2 ) of less than 3 seconds for Collection1 and less than 12 seconds for Collection2 in the most demanding setting. The value of d = dSub = 0.2 proves to be the optimal one since, while delivering a very good coverage, nearly the same as d = dSub = 0.3 (see Section 2.5.3), it also enables particularly fast response time (30% faster than the 0.3 setting). Figure 2.10-b shows the scalability of the total matching time on the two collections. It shows that, in both cases, time grows linearly with the number of query sentence pairs. For further discussion on the matching algorithms performance see Section 1.3, where the performance trends were observed in a generic environment under all the different parameters associated to the problem, and the minimum length minL and the values of z-gram size are also considered. The fast running time of the EXTRA search algorithms is mainly due to the good performance of the filtering and document analysis techniques that we presented in the previous sections. The tests results are depicted in Figure 2.11-a and show the impact that the whole and sub2 filtering and the document analysis techniques have on the total time shown in the previous

56
10,000 9,000 8,000 7,000 Time (mSec) 6,000

Approximate matching for EBMT

10,000 9,000 8,000 7,000 Time (mSec) 6,000 5,000 4,000 3,000 2,000
Whole matching 390 422 437 Sub^2 matching 532 1,312 1,950 Collection1 Whole matching 4,719 6,766 9,047 Sub^2 matching 2,605 2,957 2,709 Collection2

5,000 4,000 3,000 2,000 1,000 0

1,000 0 Collection1 Collection2 25% 211 3,928 50% 554 5,743 75% 1,027 7,541 100% 1,734 9,723

d, dSub 0.1 d, dSub 0.2 d, dSub 0.3

(a) Whole and sub2 matching time

(b) Total matching time scalability

Figure 2.10: Running time tests

experiment, where all the filters were on and stemming was performed on all the query and the TM sentences. Notice that the graph employs a logarithmic scale. Disabling stemming, but keeping the filters enabled, produces a total running time of nearly 22 seconds for Collection1 and approximately 185 seconds for Collection2, more than 20 times higher than in standard configuration; conversely, disabling filters, but keeping the stemming enabled, produces a huge performance loss, with total searching time of more than 2 minutes and 20 minutes, respectively, for the first and second collection. Thus, filters have great impact on the final execution time of the search algorithms. By enabling all the available whole and sub2 filters allows the system to reduce the overall response times by a factor of at least 1:70. As to document analysis, it can be extremely useful not only for enhancing the effectiveness of the retrieved suggestions (see Section 2.5.3) but also for a clear increase in performance. Notice that the document analysis time is not included in the total response time, since this is not strictly related to the search algorithms; such time is shown for the query sentences of the two collections in subfigure 2.11-b, both for stemming and WSD. In particular, the “no analysis” time corresponds to the time required to read the documents, extract their sentences and store them in the DB, along with their z-grams. For instance this time is 6 seconds for the 400 query sentences in Collection1. By enabling stemming, such time increases by only 2 seconds, thus showing the performance of over 200 sentences per second offered by our stemming algorithms. The same graph also reports on the WSD analysis time; such analysis is much more complex and, therefore, the time required for it is higher than for stemming. However, 10 sentences per second are

Comparison with commercial systems
10,000,000 1,000,000 50 100,000 Time (mSec) 40 10,000 1,000 100 10 1 Filters + Stem Stem Off Filters Off Time (Sec) 30 20 10 0 No analysis Stemming Stemming + WSD

57

60

Total Time 1,734 21,812 128,453 Collection1

Total Time 9,723 185,320 1,250,438 Collection2

Collection1 6 8 51

Collection2 7 9 54

(a) Total time with / without filters and stemming

(b) Query document analysis time

Figure 2.11: Further efficiency tests: impact of the filtering and document analysis techniques still elaborated (approx. 50 seconds in total) and such time is still quite low, especially if we consider that WSD can prove valuable to achieve an optimal effectiveness in the search techniques (see Section 2.5.3).

2.5.5

Comparison with commercial systems

In this concluding section, we present a series of tests that we performed in order to delve into the suggestion search process of commercial EBMT systems and to compare the performances of such systems with those offered by EXTRA. The softwares analyzed are two of the most successful products in this field: Trados version 5 and IBM Translation Manager 2.6.0, which is no longer commercially available but is still widely used. On the suggestion search process of commercial EBMT systems Software producers usually provide few or even no information about the techniques adopted in their systems. As the role of commercial systems in the computer-aided translation field cannot be ignored, we tried to understand the principles underlying the suggestion search process of two of the most diffused EBMT systems through a series of targeted tests. In these experiments we initially build a translation memory using Collection1, then we submit a new text containing ad-hoc modifications of the reference sentences or taken from the query sentences of the same collection. Finally we pre-translated it and analyzed the results.

After several tests we came to the following conclusions: ˆ the programs identify the different parts up to a certain level of modifications. we initially modified (deleted) a word from it to find out if the systems were still able to identify the match and. the search algorithms appeared to be order sensitive. Trados identified and displayed the moved segments with a particular specific color. In this case. Then we modified (deleted) other words and analyzed the penalty trend.58 Approximate matching for EBMT The first series of tests concerns the way past translations are logically represented and the penalty scheme adopted by the two systems to judge the differences between the submitted text and the TM content. how much the matching score was penalized. The text was created in three different ways: ˆ separation with a conjunction (“s1 and s2 ”). In the second series of experiments. where the new term is completely unconnected to the original one from a linguistic point of view. the systems give an equal penalty to the modification of “graphics” in “graphic” and of “graphics” in “raphics”. ˆ inclusion of one sentence s in s1 previously separated with a comma (“s1 . we inverted some of the positions of the words in a given sentence in order to verify if the comparison mechanism is sensitive to the order of words. we inverted the terms “graphics tool” in “tool graphics” and “ease and precision” in “precision and ease”. and not simply based on a “bag of words” approach. In particular. For example. if so. similarly to EXTRA edit distance approach. s2 ”). the systems seem to perform no stemming to the TM and query sentences’ words. we joined two whole sentences (s1 and s2 in the following) contained in the TM and tried to pre-translate the newly created sentence in each of the systems. s2 ”). In these experiments. we tested the ability of such systems to identify interesting parts in the stored text. Starting from a sentence in the TM. Further. The penalties given seem to depend not only on the number of modified (deleted) words but also on the length of the original sentences. while they should penalize much more heavily the second variation. s. since it is able to identify the suggestions from the two original sentences by exploiting our sub2 matching . ˆ separation with a comma (“s1 . ˆ unlike the EXTRA approach. 1 1 These scenarios create no problem to EXTRA. Finally.

In particular. in all three cases Trados was not able to automatically retrieve any suggestions. adding the comma and the conjunction as permanent separators. Pre-translation coverage and efficiency In order to evaluate coverage and efficiency. (s)he would have to change the segmentation rules. manually selecting the partial segments to look for in the TM. Some limitations still remain: The system is not able to suggest the interesting part of the TM sentence that matches a part of the query sentence. To retrieve the suggestions to our queries. for the segments to be identified. Translation Manager tries to find more suggestions for the unknown sequences of words by re-analyzing the sequences of which a segment consists. or. The only way to automatically retrieve matches between segments is indeed to insert the interesting parts. in the TM and to split the query sentences in multiple segments.7 s 59 Exact Match Fuzzy Match Total Coverage Time Table 2. for example. the user would have two possibilities. Furthermore. but it just presents the whole sentence to the user.96% 14. otherwise they cannot be retrieved by the engine.30% 1. In particular we kept the distinction made by the two programs between ex- . s.74% 19.91% 24.79% 26. already segmented. As to commercial systems.4: Pre-translation comparison test results for Collection1 algorithms. The Translation Manager similarity model seems to be more complex than the Trados one: While Trados does not analyze further unknown sentences. then it 1 1 specifically restricted the search to the contaminating segment (s). they must match the majority of a TM sentence. we found out that the system was not able to dynamically segment the query and/or the examples in order to find sub-matches. Translation Manager was able to partially solve the three queries we submitted. we performed the pre-translation of Collection1 with both systems and examined how many sentences were matched with the ones in the TM and the time that it took for the search.70% 23 s TrMan 1. both of them being quite impractical: To translate in Concordance mode. The problem is that the user does not know which segments could match and adding static segmentation rules would not be a general solution. s2 ” the system first identified the sentence s1 . in the sentence “s1 . For example.70% 3. in batch mode.5 s EXTRA n/a n/a 74.Comparison with commercial systems Trados 4.

the level of coverage provided in EXTRA by our whole and sub2 matching algorithms is nearly three times higher and guarantees a much more accurate suggestion retrieval. but Translation Manager is able to find almost twice the number of fuzzy matches and ultimately proves to have the most effective similarity search engine between the two commercial systems. The results are shown in Table 2. For EXTRA we considered the default parameters’ values and. notice that the time required for the pre-translation operation is quite different for the three systems. the results obtained by EXTRA on the same collection. for ease of comparison. . further. for the time comparison. On the other hand. we considered the total time for whole and sub2 matching. The quantity of exact matches favors Trados. where we also report.4. the EXTRA search algorithms are more complex but do not require more time. the distinction between exact and fuzzy match does not apply to our system.60 Approximate matching for EBMT act match (with a similarity greater than 95%) and fuzzy match (from 50% to 95% of similarity). the EXTRA search techniques do indeed actually reveal to be the most efficient ones. Parts with a smaller similarity were categorized as “not found”. Finally. In particular.

The preference access channel is the Internet. For . digital libraries represent a concrete alternative to traditional libraries. where the information is stored in digital formats and accessible over a network” [136]. for instance. However. re-elaboration and distribution of other people’s work. new software systems and new laws for copyrighted work copy prevention and new technologies developed in order to limit the access and/or inhibit the copy of digital information. especially for reference needs. such a wide web of data portals makes it much easier to collect and distribute duplicates. providing encryption. mainly textual data. and so on. the main drawback of these approaches is that. which reduces the marginal costs of the distribution of digital contents and makes their fruition readily and easily accessible to the users. As already observed in [22]. while never being completely secure. Nowadays. This gives rise to problems of protecting the owners of intellectual property. In the last years. An informal definition of a digital library is that of a “managed collection of information. One of the reasons for this danger is that electronic media facilitates malicious users in the illegal appropriation. Much work has been done in this direction. One possibility of addressing these problems lies in preventing violations from occurring. along with its intrinsic property of extraordinary accessibility have made the building of digital libraries easier. involving. the sheer size of such amount of electronic information. with associated services. new hardware for authorization. created the right conditions for the global diffusion of data sources for nearly every topic.Chapter 3 Approximate matching for duplicate document detection The recent advances in computing power and telecommunications and the constant drop of the costs for Internet access and data management and storage. they are quite cumbersome and impose too many restrictions to the users.

is that of proper duplicate detection techniques [22. Another approach. Moreover. On the other hand. a good trade-off between protecting the information and ensuring its availability can be reached by using duplicate detection approaches that allow the free access to the documents while identifying the works and the users that violate the rules. The level of security delivered by duplicate detection techniques is variable and is strictly correlated to the scheme adopted for the comparison of documents. by rearranging portions of text). 121]. A duplicate detection technique ensuring a good level of security can thus be employed as service of an infrastructure that gives users access to a wide variety of digital libraries and information sources [22] and that. as it has been recently highlighted in [83]. a duplicate detection service could be supported by a number of other important services. For instance. duplicates are widely diffused for a number of reasons other than the il- . the notion of security represents how hard it is for a malicious user to break the system [22]. Digital libraries are open and shared systems. since copies are typically what we preserve. 23. works that are copy protected are less likely to survive into the future [21]. Nonetheless. Indeed. Obviously. copy prevention is not properly the best choice for the digital library context. the use of duplicate detection techniques is not limited to the safeguard of the intellectual property. it has been noted that. the source will be known.g. since a few insignificant changes to a document may prevent it from identifying it as a duplicate. in a digital library context. accessibility and flexibility of digital libraries. In contexts requiring secrecy. Further. subsets. where a publisher adds a unique signature in a document at publishing time so that when an unauthorized copy is found. 66.62 Approximate matching for duplicate document detection these reasons. protects the owners of intellectual properties by detecting different levels of violation such as plagiarisms. such as encryption and authorization mechanisms. watermarks-based protection systems can be easily broken by attackers who remove embedded watermarks or cause them to be undetectable. These techniques are able to identify such violations that occur when a document infringes upon another document in some way (e. the application of too many and too strict limitations in such a domain often results in the paradoxical and intolerable result of severely limiting the usefulness. For a duplicate detection technique. stability. which is the one we advocate. In our opinion. at the same time. 33. which would help too in the protection of intellectual property. any approach relying on an exact comparison of documents can not be particularly secure. overlaps and so on. One of the techniques following this approach relies on watermark schemes [78]. supposed to be designed to advance and spread knowledge. thus also limiting the preservation of our cultural heritage. not all the duplicate detection approaches present the same level of security.

being the document collection of a digital library the union of data obtained from multiple sources. in Section 3. various summarization levels. As a consequence.1 we define the new document similarity measures for duplicate detection. partially overlapping data sources) or different versions of a document may be available (i. other schemes relying on crisp similarities. we address efficiency and scalability by introducing a number of data reduction techniques that are able to reduce both time and space requirements without affecting the good quality of the results. it usually presents an high level of duplication. as well as the amount of space required to store the logical representation of documents (intra-document reduction) and the document search space (inter-document reduction). it is evident the need of a solid pair-wise document comparison scheme which is able to detect different levels of duplication. The security delivered by such approach is particularly improved w.3 we discuss related works. For a duplicate detection technique to be secure. multiple revisions and updates over time. In order to do it. mirror sites. The comparison of the information conveyed by the chunks allows us to accurately quantify the similarity between the involved documents. “Legal” duplicates do not give any extra information to the user and therefore lower the accuracy of the results of searches in a digital library. This is achieved by reducing the number of useless comparisons (filtration).e. Further.r.2 presents the data reduction techniques. . devising effective similarity measures that allow us to accurately determine how much a document is similar to (or is contained into) another. The chapter is organized as follows: In Section 3. Section 3. In Section 3. in this chapter we specialize them to the document comparison scenario. thus detecting different levels of duplication and violations. ranging from (near) duplicates up to (partial) overlaps.4 we show the DANCER architecture and the results of some experiments. our pair-wise document comparison scheme tries to detect the resemblance between the content of documents. since it is able to identify with much more precision and reliability the actual level of overlap and similarity between two documents. Finally. By exploiting the approximate matching techniques described in Chapter 1.e. The availability of duplicate detection techniques represents an added value for a digital library search engine as they improve the quality and the correctness of search results. we do not extract stand-alone keywords but we consider the document chunks representing the contexts of words selected from the text. Conceptually.t.63 legal ones: the same document may be stored in almost identical form in multiple places (i. different formats). Such techniques have been implemented in a system prototype named DANCER (Document ANalysis and Comparison ExpeRt) [93].

The following subsection describes the logical view we adopt to represent documents.1. one chunk can be equal to one sentence. chunks undergo some filtrations to improve the resilience of our approach to insignificant document changes.1 Logical representation of documents The resemblance and containment measures compare the content of pairs of documents getting up to a specific level of detail. k sentences. whereas length filtration allows us to perform an initial selection amongst the chunks by applying a length threshold expressing the minimum length (in words) of . of paragraphs. Our measures do not actually compare documents by analyzing their original formats but their logical representation. Unlike previous approaches. documents that are to be compared undergo a “chunking” process [121] where they are broken up into more primitive units. Starting from their original formats. sequences of words. we need to establish measures that effectively quantify the level of duplication between two given documents. For the above reason. The similarity measures we are going to define capture the informal notion of “roughly the same” (resemblance measure) and “roughly contained” (containment measure) in a rigorous way.64 Approximate matching for duplicate document detection 3. two sentences. but also to one paragraph. We agree with the definition given in [33] where Chowdhury et al. Such two properties are essential for the comparison scheme we are going to introduce since it relies on the correlation of the information conveyed by the chunks. state that if a document contains roughly the same content it is a duplicate whether or not it is a precise syntactic match. with syntactic filtration we remove suffixes and stopwords from the document chunks and stem the chunks by converting the set of the remaining terms to a common root form [7]. of sentences.g. and so on) known as chunks [22] or shingles [33]. in the following the chunking unit will be one sentence unless specified.1 Document similarity measures In order to deal with the problems we outlined. Documents can have different formats. For the sake of simplicity. single words cannot be considered as chunks and the smallest unit is the sentence which represents a context of words whose order determines the meaning. and so on. we will only consider units having a stand alone meaning and defining a context. The unit of chunking determines the logical representation of documents. 3. Once documents have been broken up into an initial set of chunks. Thus. In particular. In particular. documents contain some structural information which can be exploited in order to divide them into well defined units (e. three sentences. The definition of what constitutes a duplicate is at present unclear.

and corresponding permutation Figure 3. D2 = BAC.The resemblance measure 65 Di c1i c2i c3i c4i c5i Dj c1j c2j c3j c4j c5j c1i c2i c3i c4i c5i Di pm(2)=1 pm(1)=2 pm(4)=3 pm(3)=4 pm(m)=5 Dj c2j c1j c4j c3j cmj . n} will denote the corresponding index set and |ck | will denote the length in words of the chunk ck . . In the following In = {1. . (b) . More formally.2 The resemblance measure Before introducing the similarity measures.. . cni chunks . Intuitively. . the above comparison approach requires two documents Di and Dj logically represented by the sequences c1 . . pm(5)=m chunks c5j (a) Mapping between chunks. . D3 is contained in D0 whereas D4 contains D0 .1. cni chunks .. sentences) indicated with upper case letters ABC. D5 = A .. and D6 = A BC where n times chunks identified by different capital letters are completely different whereas A is a “variant” of A. . . cm .. Informally. we give an intuition about what. 3.. and a similarity measure between chunks sim(ck . . we start from a document D0 to be compared to a generic document Di .. To this end. D5 is somewhat similar for low n and not very similar for high n.. . .g. cn extracted from D. D4 = ABCD. .1: Representation of a mapping between chunks the worth surviving chunks. in our opinion.. Let D0 be represented by a sequence of chunks (e. D3 and D4 are quite similar to D0 where D3 (D4 ) is the more similar the shorter is the chunk C (D) with respect to A and B. ch ) stating how j i much two chunks are approximatively equal (an effective similarity measure will be introduced in the following). . The resulting logical representation of a document D consists of a sequence of filtered chunks c1 . D6 is the more similar the more A is similar to A. as in [121]. good measures of resemblance should satisfy.. we consider the following documents: D1 = ABC. cmj chunks . cn and c1 .. D3 = AB.. rej j i i spectively. A . . . . . a possible similarity mea- . we would expect to obtain the following suggestions: D0 is the exact duplicate of D1 and D2 where the latter is a chunk level rearrangement of D0 . ..

1) |ck | i k=1 + Notice that the similarity measure is defined in order to return a similarity score ranging from 0 (totally different contents) to 1 (equal contents. cn . the similarity Sim(Di . Dj ) | pm is a permutation of Im } where n |ck | + |cj m i Simpm (Di . ch ) i j j i j between chunks. Dj ) between Di and Dj is the maximum of the following set {Simpm (Di . . |D| = 7. Dj ) = k=1 n m p (k) | · sim(ck . and |A | = 8. |B| = 8. Let us suppose that the lengths of the involved chunks expressed in terms of word numbers is the following: |A| = 10. We look for the permutations pm maximizing the overall document similarity obtained by combining the similarities between the chunks in the same position (see Figure 3. 3. will be 1 if ck = ch . Moreover. such that m ≥ n.e. Definition 3. similar small parts of the two documents will have a low weight in the computation of the similarity.1 (Resemblance measure) Given two documents Di = c1 i . and let us consider the index set Im of Dj .1 Let us consider the situation depicted at the beginning of the present section. Dj = c1 . The j i permutation on the chunk indexes of the longest document maximizing the . m > n.66 Approximate matching for duplicate document detection sure between documents satisfying the requirements stated above is the one that looks for the mapping between chunks maximizing the overall document similarity (see Figure 3. A ). and a similarity measure sim(ck .1 weights similarities between chunks p (k) on the basis of their relative lengths denoted with |ck | and |cj m |. ch ) j i between all chunks but sim(A. i in the denominator we sum the lengths (in words) of the two documents whereas in the numerator we only sum the lengths of the matching chunks.a). .1. In this way. cj m i |ck | j k=1 p (k) ) (3. Indeed. also rearranged). Its formalization is given in the following definition. We remind that different capital letters correspond to completely different chunks whereas A is a “variant” of A. 0 otherwise. then a permutation of Im is a function pm allowing the rearrangement of the positions of the Dj chunks. i. Example 3. Let us suppose that the longest document in terms of number of chunks is Dj . Intuitively.1. the relative lengths of the non-matching parts affect the resulting score: the bigger they are the lower is the resemblance value. the best mapping is the one which associates the (approximatively) equal chunks in the two documents. cm . notice that Eq. Further. The statement can also be formulated in the following way. |C| = 2. the similarities sim(ck . . Thus.b). .

the more A’ is similar to A the more the score Sim(D0 .A )+(8+8)∗1+(2+2)∗1 (10+8+2)+(8+8+2) Notice that. they state that “the unit of chunking is critical since it shapes the subsequent overlap search and storage cost”. being the simplest and most straightforward measure to define. comparison between chunks is usually quantified by means of a crisp function. As to D5 . e. notice that the score is 0. 121]. For these reasons. For instance.667 (10+8+2)+n∗10 ≥ 0. 3 → 3} Identity Identity Identity Identity Similarity with D0 1 1 (10+10)∗1+(8+8)∗1 = 0. Indeed.851 (10+8+2)+(10+8+2+7) (10+10)∗1 ≤ 0.4 when n = 3 and so on. we consider a chunk as a sequence of terms and we exploit the edit distance metric. In the literature. In our opinion this measure has one major drawback: The obtained scores are completely dependent on the size of the chunks and are not satisfyingly precise in any case.811.667 when n = 1. such as words. 0. A ) = 0. On the other hand. then we obtain 0. In this context. D0 is estimated to be equal to D1 and D2 . would produce a very large number of small chunks. Such a problem has been recognized in other related papers [22. any change. we propose the introduction of a chunk comparison function that goes beyond equality by analyzing their contents and computing how much they are similar. too small to be able to truly capture the contents of the document. D3 and D4 are both quite similar to D0 but D3 is more similar to D0 than D4 since C is shorter than D. propose a copy detection system and measure the notion of security in terms of how many changes need to be made to a document so that it will not be identified as a copy. 66. as we expected. 2 → 1.9 then the similarity score is 0.526 (8+10)∗sim(A. the .953. To this end. even to a very limited portion of the documents. Finally.The resemblance measure 67 similarity with D0 and the resulting score are summarized in the following table: Doc D1 D2 D3 D4 D5 D6 Permutation Identity {1 → 2. The resemblance measure introduced above relies on a comparison measure between chunks.947 (10+8+2)+(10+8) (10+10)∗1+(8+8)∗1+(2+2)∗1 = 0. whereas if the chunk similarity is lower. where in D2 the D1 chunks are simply rearranged. D6 ) is high.5 when n = 2. by using large chunks and a crisp chunk comparison measure. For instance. by using chunks smaller than the sentence. 23. In previous chapters. if sim(A.6. it is 0.g. 0. in [22] GarciaMolina et al. would increase the probability of missing actual overlaps thus considerably decreasing the security.

beside the above resemblance measure. . . . such that m > n. . m} be the corresponding index sets. . . . since it is much more independent from the chosen size of the chunks and it is able to identify with much more precision the actual level of overlap and similarity between two documents. cn and Dj = c1 . cj m i for each k ∈ [s. The security delivered by such approach is particularly improved w.r.1. Then: ˆ The maximum contiguous part in Di overlapping Dj w. This gives rise to the maximum contiguous overlap measures.3 Other possible indicators In a duplicate detection context. . other schemes relying on crisp similarities.t. pm is deDj noted M axOverlap(Di )pm and is defined as:   maxs. cj ) = (3.2) 0 otherwise By computing Eq. we are able to obtain a good level of effectiveness since we perform a term comparison without giving up the context-awareness which is guaranteed by the use of chunks.68 Approximate matching for duplicate document detection edit distance has proved to be a good metric for detecting syntactic similarities between sentences.4 for an extensive experimental evaluation).3) )>0 where [s.2 (Maximum contiguous overlap w.t. . Another possible similarity measure is an asymmetric one representing the length of the maximum contiguous part of a document (approximately) overlapping another document. e].e] |ck | i p (k) (3.|ck |)) ≤ t h |. n} and Im = {1.1 of the resemblance measure with the above similarity measure between chunks.e∈In :s≤e  k∈[s. . .r. a permutation) Let Di = c1 . Given two chunks ch ∈ Di and ck ∈ Dj and an edit distance threshold t.ck ) 1 − ed(ch . 3. 3.ck ) i j j if max(|ci |.t. we extend the concept of sentence to that of chunk having a stand alone meaning and defining a context. Let sim(ck . . thus detecting different levels of duplication and violations (see Section 3. . e] is a sequence of indexes in In such that sim(ck .r. In this context. and i i j j let In = {1.|ck |)) h h k max(|ci j i j sim(ci . we define i j the similarity between ch and ck in the following way: i j  ed(ch . cm be two documents. ch ) be a similarity measure between chunks and pm be a permutation i j on Im . Definition 3. some applications could also be interested in other indicators.

Dj ) | pm is a permutation of Im } where n |ck | · sim(ck . ch ) j i j j i between chunks. 3.e] p (k) | (3. the measure sim(D0 .e]  i j |cj m k∈[s. . When it is computed w. Dj ) = k=1 n p (k) ) (3. It can be measured by introducing an asymmetric measure as variant of the resemblance measure: The containment measure.947 is indicative of a high similarity but gives no idea about the maximum overlap.r. As we have already shown.2 Let us reconsider the document set of example 3.r. . For a deeper analysis we compute M axOverlap(D0 )D2 = M axOverlap(D2 )D0 = (10 + 8) = 18.6) |ck | i k=1 . such as an entire paragraph copied from one document to another.e∈In :s≤e∧sim(ck . it is an additional indicator that helps to understand the meaning of the score of the resemblance measure.cpm (k) )>0∀k∈[s. the final document. cj m i i aSimpm (Di . Dj ) of Def. Example 3. Otherwise. cm . pm is denoted M axOverlap(Dj )Di and is defined as: pm   maxs. Such scores state that the content of D2 is approximately a copy of a great part of D0 . . and a similarity measure sim(ck . . Dj = c1 . The formula to be computed is: M axOverlap(Di )Dj = maxpm M axOverlap(Di )Dj pm D (3. it can provide a different score quantifying the maximum contiguous part overlapping another document.4) The role of the above measure is twofold. the two scores M axOverlap(Di )pm and M axOverlap (Di )Dj are different.1.3 (Containment measure) Given two documents Di = c1 i .1. such that m > n. Dj ) of Di in Dj is estimated by the maximum of the following value set {aSimpm (Di .Other possible indicators 69 ˆ The maximum contiguous part in Dj overlapping Di w.r.t. in general. For instance. even if the relative size of the copied part is very small w. D3 ) = 0.5) j Notice that.t.t. one of the permutations pm associated to the maximum value Simpm (Di . Definition 3. cn . the containment aSim(Di . such score can help to easily identify partial plagiarisms. Another useful indicator is the one telling how much one of the two documents is contained in the other.

and aSim(D4 . D0 ) and aSim(D0 . ˆ inter-document reduction. with clustering the system analyzes a large set of documents and collects them into clusters. 10+8+2 10+8+2+7 In particular. In a similarity search context.2 Data reduction So far we have introduced a similarity measure between documents and we have given an intuition of its efficacy (more details will be given in Section 3. we address the problems of efficiency and scalability by introducing three fine tunable approaches for data reduction. given a query document. in particular. in turn. the containment of D3 in D0 and of D0 in D4 and viceversa. In this section.74. aSim(D3 . we assume that the detection of duplicates relies on the resemblance measure given in Def. i Example 3. They are orthogonal and thus fully combinable. is a time consuming operation. The order of presentation corresponds to the increasing impact on the document search space. More precisely.9.70 Approximate matching for duplicate document detection The containment aSim(Dj .1. D0 ) = 10∗1+8∗1 = 1. to reduce the number of chunks in each document. D3 ) = 10∗1+8∗1 = 0. Filtering . is fully contained in document D4 . The approaches we devised are the following: ˆ filtering. we concentrate our attention on similarity search and clustering in a document search space. we are aware that the comparison of documents. However. 10+8 10+8+2 aSim(D0 . in particular for large collections.4). aSim(D0 . the system retrieves the documents in the search space whose level of duplication with the query exceeds a given similarity threshold (similarity search) or which are the most “duplicate” ones (k-NN search). On the other hand. that is collections of duplicate documents that can thus be treated collectively as a group. The scores are aSim(D3 . 3. Di ) of Dj in Di can be obtained by substituting p (k) ck with cj m .1 and.3 Let us consider again the document set of example 3. With the term “data reduction” we mean the reduction of the number of comparisons performed by such techniques which rely on a similarity function to detect duplicates. D4 ) = 10∗1+8∗1+2∗1 = 1. D4 ) state that document D3 is fully contained in document D0 which. D0 ) = 10∗1+8∗1+2∗1 = 0. ˆ intra-document reduction. to reduce the number of stored documents. In both cases. 3. to reduce the number of useless comparisons.

r. The mapping computation phase is the second and last phase . The tests we performed and which will be presented in Section 3. the number of required comparisons between chunks (sentences) can be reduced from hundreds of thousands to some dozens of comparisons.4 show that the similarity measures are robust w. In this section. The pair-wise document comparison scheme introduced in Def.2 allows the identification of pairs of chunks similar enough. inter-document reduction approximates the document search space representation by pruning out less significant documents while maintaining significant ones. thus exponentially reducing the required time. and length filtering. we consider the application of filtering techniques based on (dis)similarity thresholds for the comparison of both documents and their chunks. without missing any of the final results. Because of the way such filters work it is not possible to give a theoretical approximation of the computational benefits offered by such techniques: In fact all such filters’ behavior is dependent on the data on which they are applied. i. Exploiting chunk filters allows us to greatly reduce the costs of approximate chunk match.1 Filtering Filters allow the reduction of the number of useless comparisons while always ensuring the correctness of the results. although filters reduce the document search space.e. we will refer to such filters as chunk filters. 3. 3.1 and its variants essentially try to find the best mapping between the document chunks by considering the similarities between all possible pairs of chunks themselves.Filtering 71 leaves it unchanged since. in a typical case. 3. Indeed. Since the edit distance threshold t in Eq. such three different data reduction techniques and that a good trade-off between the effectiveness of the measures and the efficiency of the duplicate detection technique can be reached as data reduction allows us to decrease requirements for both time and space while keeping a very high effectiveness in the results. the results offered by the length filter are strictly influenced by the distribution of the sentences’ lengths. they ensure no false dismissals. For instance.t. we can ensure efficiency in the chunk similarity computation by applying the filtering techniques described in Chapter 1 before the edit distance computation: count filtering. filtering is based on the fact that it may be much easier to state that two data items do not match than to state that they match. the first phase of the document similarity computation. intra-document reduction introduces an approximation in the logical representation of documents by storing a selected number of chunk samples. In the following. For instance. position filtering.2.

for each document D in the collection check whether aSim(Q. As for chunk filters. Dj ) = n k=1 m h=1 |ck | · sim(ck . further details on the approach we exploit in order to solve such mapping problem. Q) > s) which quickly discards documents that cannot match. . Di ) ≥ s then one of the following two conditions holds: ˆ aSim(Di . The intuition is that two documents that are very similar have a great number of similar chunks. Theorem 3. since the number of document pairs having at least one similar chunk is very high. Consequently. we can first apply the filter shown in the theorem (i. ˆ aSim(Di .e. thus also discussing its complexity. ˆ aSim(Dj .4.1. cn .1 Given a pair of documents Di = c1 . In other words.e. Dj = c1 . The basic idea of our filter is to take advantage of the information conveyed by the similarities between the involved chunks. Dj ) ≥ s. we introduce a filtering technique which is able to output a candidate set of document pairs. will be given in Section 3. whenever we solve range queries where a query document Q and a duplication threshold s are specified. Dj ) ≥ s. . Di ) ≥ s where aSim(Di . ignoring possible mappings between the chunks. then if at least one of the following three conditions holds: ˆ Sim(Di . .72 Approximate matching for duplicate document detection in the document similarity computation. we ensure no false dismissals. cm . D) > s or aSim(D. Then. ˆ aSim(Dj . Dj ) ≥ s. In order to improve the efficiency also in this phase. . that none of the pairs of documents whose similarity actually exceeds a given threshold is left out by the filter. we compute the resemblance level between the query document and each document in the resulting candidate set. i. 1]). j j i i where m ≥ n and a minimum document similarity threshold s (s ∈ [0.7) The above theorem shows the correctness of the filter we devised for data reduction. it is not possible to define an a priori formula quantifying in general terms the reduction offered by the document filter to the computational complexity of the mapping computation . ch ) i i j n k k=1 |ci | (3.

both in efficacy and efficiency. for document collections of different size and type (see Section 3.Intra-document reduction 73 between the chunks. For these reasons. On the other hand. Even though it is a simple idea. It acts by selecting the longest chunks of each document. By following the arguments given in [61]. Moreover. we consider two different techniques: length-rank chunk selection and chunk clustering.4 for more details on the results). The number of chunks selected from each document is the chunkRatio percentage of the total number of its chunks. we found it to be particularly effective. .4 we quantify the effectiveness of the chunk filters and the document filter in the reduction of the number of comparisons.2. we thus ignore the similarities between small chunks and stress the similarities between large portions of the involved documents. Indeed. especially for real data sets as shown in Section 3. In Section 3. it prunes out small chunks which have a little impact on the document similarity computation.4 we describe the effect of such filter by means of experimental evaluation on different document collections. With length-rank chunk selection.4. 66]. Being the chunk filters and the document filter correct. we state that the required space is bounded by some linear function of q times the size of the corresponding chunks. they reduce but do not approximate the document search space. in Section 3. the adoption of the chunk filtering techniques requires a space overhead to store the qgram repository. since the improvement is dependent on the data which it is applied to. To this end. as it is weighted on the chunk lengths. From our tests. we infer that filters ensure a small number of false positives as in most cases the level of duplication between documents is heavily dependent on the number of similar chunks. it works better than sampling. In particular. keeping good quality results while significantly reducing computing time and required stored space. small chunks can also be source of noise. Each of these techniques is completely fine tunable to achieve good results. Length-rank chunk selection The length-rank chunk selection is a variant of the chunk sampling approach previously addressed in [22. 3. Both act by selecting for each document the percentage of its chunks specified by a given reduction ratio chunkRatio. it aims at selecting similar chunks from similar documents. Finally.2 Intra-document reduction The intra-document reduction aims at reducing the number of chunks per document which have to be stored and compared.

For each cluster γ. 3. The intuition is that. given a document D containing n chunks. Dj = 1 m k k k k γj . Dj ) between Di and Dj is the maximum of the similarity scores between the two documents. if a document contains two or more very similar chunks. we keep some features which will be exploited in the document similarity computation. also the total length |γ| in words and the number N of the chunks in γ are considered. Ni(j) . To this end. Rj m m k=1 k |γj | p (k) ) |γik | + (3. More precisely. in order to weight the γ contribution to the document similarity. γj where the k-th cluster γi(j) in Di (Dj ) is a tuple (Ri(j) . we start from two documents Di and Dj of n and m chunks.1 is required. it follows that if n ≤ m then n ≤ m . we devised two variants of Eq. . and length |γi(j) |. In the revised form of the resemblance measure of Def. chunk clustering is able to choose the “right” representatives. each one computed on a permutation pm on Im . Its effectiveness is mainly due to the availability of the proximity measure defined in Eq. To this end. By featuring the similarities among chunks. chunk clustering stores just one of them. More precisely. 3. among those proposed in the literature. respectively. . 3. 3.74 Approximate matching for duplicate document detection Chunk clustering The chunk clustering is the process of cluster analysis in the chunk search space representing one document. which have been clustered in n = chunkRatio ∗ n and m = chunkRatio ∗ m clusters. Moreover. γin . and thus are represented by the sequences Di = γi1 .1. documents are no longer represented as sequences of chunks but as sequences of chunk clusters and an adjustment of the resemblance measure of Def. the similarity Sim(Di . we experimentally found out that the most suitable for our needs is a hierarchical complete link agglomerative clustering [70]. As to the clustering algorithm. Cluster-Based Function The Cluster-Based Function is a straightforward adaptation of Eq.2 which is appropriate to the chunk domain and on which relies the document resemblance measure. . giving particularly good results for documents with a remarkable inner repeatedness. the cluster centroid R will be used in place of the chunks it represents. The centroid corresponds to the chunk minimizing the maximum of the distances between itself and the other elements of the cluster. in the chunk clustering setting. Therefore.1: CSimpm (Di . 3. |γi(j) |) k k k with centroid Ri(j) . . Dj ) = n k=1 |γik | + |γj m n k=1 p (k) k | · sim(Ri . number of chunks Ni(j) . the clustering algorithm produces chunkRatio ∗ n clusters in the chunk space.8) .1. Being the reduction ratio chunkRatio common to the two documents.

On the other hand. Data bubbles provide a very general framework for applying hierarchical clustering algorithm to data . by introducing an approximation in the logical representation of documents we would expect an approximation of the duplication scores computed on the original document search space. The technique we propose derives from a recent idea based on the notion of data bubble [20]. 3. given two documents Di and Dj having ni and nj chunks respectively. Starting from the original definition of a data bubble. we define the concept of document bubble.4. The loss of accuracy is analyzed in Section 3. Rj m k m |γj | k k=1 Nj p (k) ) ALCSimpm (Di .2. Intuitively. keeping just a representative (in this case. Thus. for each document represented by n chunks. which is the key for inter-document reduction. The intuition behind a data bubble is that of a “convenient abstraction summarizing the sufficient information on which clustering can be performed” [20]. More precisely.9) As to the accuracy of the intra-document reduction techniques. such techniques reduce the space required to store the logical representation of documents and thus the number of pair-wise chunk comparisons which is strictly correlated to the response time. Dj ) = k |γi | n k k=1 Ni + (3. too.Inter-document reduction 75 Average-Length Cluster-Based Function The second function provides a different way to weight the similarities between the cluster representatives by using the average length of each cluster. a document bubble works for inter-document reduction as a chunk cluster works in intra-document data reduction: It reduces the amount of data to be stored and elaborated for similar documents. It is defined as follows: n k=1 k |γi | k Ni + |γj m Nj m p p (k) | (k) k · sim(Ri . by adopting the intradocument reduction we cut down the number of pair-wise chunk comparisons from ni × nj to chunkRatio2 × ni × nj . which is a particular data structure employed to reduce the amount of data on which to perform hierarchical clustering algorithms [69].3 Inter-document reduction In this subsection. both techniques store the n × chunkRatio most representative chunks. It is thus able to distinguish large clusters with small chunks from small clusters with large ones by giving more weight to the latter than the former. a document) and a series of values summarizing the involved set of similar documents. we discuss the approach we follow for the inter-document reduction.

In comparison with traditional hierarchical clustering. 1 − sim(Di . ˆ N is the number of documents in D. In such context. assuming only that a distance measure is defined for the original objects [20]. the features of the resulting cluster are computed and stored as a document bubble. The remaining documents are assigned to the “closest” originator by applying the standard document similarity algorithm. the inter-document reduction summarizes the starting set of DS documents to be clustered in bubRatio ∗ |DS| document bubbles which approximate the original document search space. Dn } be a set of n documents. ext. the distance measure suitable for hierarchical clustering is no longer the distance d(Di . for each document set. .1.e.r. innC )    (otherwise) . by applying the Vitter’s algorithm [135] we perform a random sampling in order to select the initial bubRatio ∗ |DS| document originators. RC ) − (extB + extC ) + (innB + innC )    (if d(RB . Dj ) is the complement to 1 of the resemblance measure shown in Def. Dj ) between documents but the distance measure between two bubbles B and C given in [20]:  d(RB . ˆ inn is the average of the nearest neighbor distances within the set of documents D. .t. i.76 Approximate matching for duplicate document detection items randomly chosen from an arbitrary data set. The outcome is a collection of bubRatio∗|DS| sets of documents. Therefore.10) max(innB . ˆ ext is a real number such that the documents of D are located within a radius ext around R. We build document bubbles in the following way: Given a set of documents DS and a reduction ratio bubRatio. Finally. inter-document reduction allows us to cut down the number of pair-wise document comparisons from |DS| × |DS| to bubRatio2 × |DS| × |DS|. D is a tuple (R. inn). We introduce the notion of document bubble which is a specialization of the data bubble notion in the context of duplicate detection where the distance measure between documents d(Di . where: ˆ R is the representative document for D corresponding to the document in D minimizing the maximum of the distances between itself and the other documents of the cluster. RC ) − (extB + extC ) ≥ 0) d(B. Dj ). N. The document bubble B w. 3. Definition 3.4 (Document bubble) Let D = {D1 . C) = (3.

The COPS working prototype [22] is a copy detection service where original documents can be registered.repC) 77 dist(repB. A good dissertation on the state of the art is given in [33]. overlap. they address the efficiency and scalability of OOTs by applying random selections at the chunk and document levels. which is even more substantial for documents of unequal sizes. if the document bubbles overlap.repC) B C repB repC B C repB repC dist(rep B. they define different violation tests.2): The distance between two non-overlapping document bubbles is the distance of their representatives minus their radii plus the average of their nearest neighbor distances.3 Related work dist(repB.repC)-(extB+extC)>0 dist(rep B. Finally. To this end. and plagarism. Since no semantic premise is used to reduce the amount of data to be compared.3 Related work The problem of duplicate document detection has been addressed in different contexts and for purposes also going beyond security issues. a degree of approximation is introduced to the matching process resulting in a decay of the accuracy. 3. and copies can be detected.2: Distance between document bubbles The two conditions correspond to non overlapping and overlapping document bubbles (see Figure 3.repC)-(extB+extC)<0 Non overlapping doc bubbles Overlapping doc bubbles Figure 3. Such OOTs essentially count the number of common chunks between each pair of documents and select such registered documents that exceed some threshold fraction of matching chunks. we take the maximum of the averages of their nearest neighbor distances as their distance. and as many ordinary operational tests (OOTs) that approximate the desired violation tests.3. It has been devised as one of the main components of an information infrastructure that gives users access to a wide variety of digital libraries. The same approach is followed in the KOALA [66] . Otherwise. subset.

we are able to accurately detect different levels of duplication. b is 1. chunks or shingles are obtained by “sliding” a window of fixed length over the tokens of the document. 3. the authors show that. at least. al.1. The problem is also outlined in [121] where Garcia-Molina et. For the above approaches. as outlined by the authors. Such approaches. b. As to accuracy. the larger the chunk. Our approach differs from the above cited ones since it is less dependent from the size of the chunks to be compared as our comparison scheme enters into chunks. of those depicted in Section 3. DSC-SS. In such approaches. d and ak with k = 1. In this way. The DSC algorithm has a more efficient alternative. the similarity between a. which uses super shingles but does not work well with short documents and thus is not suitable for web clustering.1 computes the number of common chunks weighted on their lengths. As to search cost. Moreover. such approaches are usually not resilient to small changes occurring within chunks as only equal chunks are considered. SCAM investigates the use of words as the unit of chunking. Indeed. it does not fully meet the characteristics of a good measure of resemblance or. whereas no similarity is detected with k > 0. It essentially measures the similarity between pairs of documents by considering the corresponding vectors of words together with their occurrences and by adopting a variant of the cosine similarity measure which uses the relative frequencies of words as indicators of similar word usage. finding the chunk size ensuring a “good” level of efficiency is almost impossible. Recently.78 Approximate matching for duplicate document detection and in the DSC [23] techniques which address the general problem of textual matching for document search and clustering and plagiarism/copyright applications. an approach based on collection statistics has been proposed in . documents differing in contents but sharing more or less the same words are judged very similar. the similarity measure of Eq. notice that if we adopt a crisp similarity. c and a. the bigger the chunking unit the lower the probability of matching unrelated documents but the higher the probability of missing actual overlaps. the same holds for a. b. address again the problem of copy detection by proposing a different approach named SCAM and compare SCAM with COPS by showing that the adoption of sentence unit as chunk implies a percentage of false negatives of approximatively 50% (more than 1000 netnews articles). setting the measure parameters to the values adopted in the implementation. For instance. the lower is the running time but the higher the potential number of distinct chunks that will be stored. are heavily dependent on the type and the dimension of chunks which affect both the accuracy and the efficiency of the systems. as documents are broken up into standalone words with no notion of context and the adopted measure relies on the cosine similarity measure which is insensitive to word order. Moreover. c. c against a. b. In our opinion.

It thus turns out to be not much suitable for plagiarism or similar problems where the detection of different degrees of duplication within a collection is required. Data bubbles capitalize on sampling as means of improving the efficiency of the clustering task. They are orthogonal to the data reduction techniques we proposed and thus they could be combined in order to deal with duplicate detection problems involving large amount of documents. which should ideally be 1.g. As far as the techniques for data reduction are concerned.r.e.t. such techniques. the fact that our resemblance measure is not a metric (it does not satisfy the triangle inequality property) does not prevent us from using a MAM such as the M-Tree [36] for indexing the document search space.3 Related work 79 [33] for clustering purposes. [61]) where the properties of strings whose edit distance is within a given threshold are exploited in order to quickly discard strings which cannot match. In our context. we have revised and adapted to our context the general data bubble framework proposed in [20]. they show that the average of document clusters formed for each seed document. several indexing techniques have been proposed to efficiently support similarity queries. In our work. In particular any distance function that is lower-bounded by the distance used to build the . the Metric Access Methods (MAMs) only require that the distance function used to measure the (dis)similarity of the objects is a metric. The approach extracts from each document a set of significant words selected on the basis of their inverse document frequency and then computes a single hash value for the document by using the SHA1 algorithm [1] which is order sensitive. The construction of the data bubble could be sped up by adopting the very recent data summarization method proposed in [142] which is very quick and is based only on distance information.3. is 3. Although they introduce a mechanism which lowers the probability of modification. Among those. 10 seed documents and 10 variants for each seed document. the main objective of our work has been to devise techniques able to reduce the number of document comparisons relying on our scheme and to test the robustness of the resemblance measures w. Indeed. Finally. The approach supports an high similarity threshold justified by the aim of clustering large amounts of documents. a comparison with our approach is also possible. Indeed. As to the efficacy of the measure. The paper [35] shows that the M-Tree and any other MAM can be extended so as to process queries using distance functions based on user preferences. i. Filters have been successfully employed in the context of approximate string matching (e. they conducted some experiments on synthetic documents generated in a similar way as ours. All documents resulting in the same hash value are duplicates.3 with a maximum of 9. we devised an effective filter suitable for the resemblance measure of the comparison scheme we propose.

1). When a document is submitted to the system. In this enlarged scenario.80 Approximate matching for duplicate document detection Pre-processing Conversion to ASCII Chunk Identification Documents Chunks Syntactic Filtration Length Filtration Statistic and Graphic Analysis Document Repository Inter / Intra document Data reduction Similarity Computation Chunk Similarity Search Chunk Mapping Optimization Figure 3. where the query and/or the comparison distance functions can also be nonmetric. we conducted a number of exploratory experiments using a system prototype named DANCER (Document ANalysis and Comparison ExpeRt).3.1. then it performs a simple finite-states automa in order to identify both sentences and paragraphs. several distances can be dealt with at a time. the index distance (used to build the tree) and the comparison distance (used to quickly discard uninteresting index paths). the module first converts to ASCII the contents of the documents. The aim of the document pre-processing modules is to translate the documents to be stored in DANCER into their logical representations suitable for the similarity computation (see Subsection 3.4 Experimental Evaluation To evaluate both the effectiveness and the efficiency of the ideas and the techniques presented so far. and provides the user with the option of choosing the . 3. whose architecture is summarized in Figure 3. a metric which is a lower-bound of our resemblance measure can be easily defined and our filters can be used as comparison distances.3: The DANCER Architecture tree can be used to correctly query the index without any false dismissal. the query (user-defined) distance. As a matter of fact.

As shown in Section 3. have some similarities with other chunks of Di . in turn. The following sections describe: ˆ the similarity computation module.1 The similarity computation module The Similarity Computation module is able to compute the similarities between two sets V and W of documents by following two phases: the Chunk Approximate Matching and the Chunk Mapping Optimization.2. Dj ) between Di and Dj . 3. Those chunks in Dj could. ˆ assessment and evaluation of the experiments we conducted. quickly filters pairs of chunks that do not match by means of the chunk filtering techniques described in the previous sections (for more details see Chapter 1). for each Di ∈ V and Dj ∈ W it returns all pairs of chunks ch ∈ Di and ck ∈ Dj such that ed(ci . as required by Eq. and generations of synthetic document collections.2.3 respectively). statistic and graphic analysis of the document collections.1. making the problem quite complex. Indeed.The similarity computation module 81 desired size of chunks in terms of a fixed number of sentences or paragraphs. The logical representations of the submitted documents are then stored in the document repository. The former approximately matches the chunks of the involved documents. Given a similarity measure. 3. cj ) ≤ t. The remaining system modules access such information in order to perform different tasks: document similarity computations. and so on. ˆ the document generator tool we developed to create synthetic document collections..2. that is the permutation pm maximizing . 3. finding the similarities between the different chunks could not be sufficient to correctly compute the resemblance measure Sim(Di .1. i. ˆ the document collections used in the tests.1. before computing the edit distance. data reduction functions. both intra and inter-document (see Sections 3.2 and 3. a chunk in document Di could be similar to more than one chunk in document Dj . the goal of Chunk Mapping Optimization is to find the best mapping between the chunks in the two documents. It is implemented on top of i j the DBMS managing the document repository by means of an SQL expression which.1. The last two steps are the chunk syntactic filtration and the chunk length filtration which have already been described in Subsection 3.e.4. say the symmetric one introduced in Def.

Whenever a range query w.6.1. Finally.a).k = 1 if chunk ch is associated to chunk ck j i 0 otherwise under the following constraints: ∀k h xh. the more are the pairs of similar chunks the more is the probability that the intervention of the ILP package is required.4 or lower values. ck ) · xh.82 Approximate matching for duplicate document detection Eq. the mapping between chunks can be straightforwardly computed in linear time w.t.k i j i j n m |ck | i k=1 + k=1 |ck | j where xh. ck ) actually i j participates to the similarity computation and thus assuming the values xh.k ≤ 1 xh. such as [74]. we express the problem in the following form: We find the maximum of function m k=1 n h=1 |ch | + |ck | · sim(ch . Further. also including ck . the computation of the resemblance measure requires the intervention of ILP algorithms only when a pair of chunks ch and ck exists such that ch has similarity greater than 0 with i j i more than one chunk in document Dj .k ≤ 1 k ∀h that is a chunk in a document is coupled with at most one chunk in the other document (see Figure 3. which make the costs of this type of calculation feasable. we set the edit distance threshold t to 0. that is we only consider pairs of chunks whose similarity score is at least 0. in our case.1. The complexity of algorithms solving the linear programming problem has generally been high.k is a boolean variable stating if the pair of chunks (ch .r.2. which are the majority. In our experiments. In j all the other cases. the number of matching chunk pairs. We found out that such a setting has no impact on the effectiveness of the measure (see Section 3. With those premises. a query . let us go back to the two kinds of content-based document analysis techniques we depicted in Section 3. the resemblance measure computation becomes a matter of solving an integer linear programming problem. however there exist worst-case polynomial-time algorithms.4) and allows the improvement of system efficiency since it implies that the ILP is activated only few times. for which a standard ILP package can be employed. Instead of generating all possible permutations and testing them. and vice-versa.r. 3.t. Obviously.

the elements of the variation documents to be used.e.2 Document generator To test the performance more in depth. sentences and words. The types of modifications are amongst deletions. Clustering. i. insertions or substitutions. Times500S. in case of insertions or substitutions. The frequency of modifications to the three different levels of the document structure are specified by as many parameters: parF req. the Similarity Computation module computes the similarities between Q. in the document similarity / copy detection field there are no standard reference collections or test beds which are universally known and accepted.t. as well the efficiency of the different matching techniques together with the impact of the proposed data reduction approaches. 3. every possible situation. designed to automatically produce random variations out of an initial set of seed documents. Times100L.r. swaps. Finally. and the data documents contained in W . V = {Q}. The algorithm takes one seed document at a time and a set of “variation” documents. coming from various contexts and studied to stress the different features and capabilities of our approach in various scenarios. Tough they are not intended to be a reference or to be comprehensive w. an additional parameter specifies the number of documents (variations) to generate out of each seed document.4. is performed on a document collection.Document generator 83 document Q is submitted to DANCER. V = W where V represents the documents to be clustered. 3. sentF req. five of which are synthetic ones and three are real data sets.4. and involve paragraphs. The synthetic ones were created with our document generator: ˆ Times1000S. i. the document sets we use allow us to analyze in detail the effectiveness in copy (related) and in violation detection scenarios. The generator algorithm is based on independent streams of random numbers which determine the type of modification to operate and. and wordF req. The document generator is fully parameterizable and works by randomly applying modifications to the different parts of each seed document. Times100S: collections of . we developed a document generator tool. from which we extract the new material needed to modify the seed documents.3 Document collections As stated in [22]. We report the experiments performed on eight different data sets. For our experiments we created / collected different document sets. instead.e.

50. Finally. by following the same technique. we selected two sets of real documents: CiteR25 and CiteR8. In particular. i. In the “biggest” collections. These are collections of 25 (long) and 8 (long) scientific articles coming from the Citeseer digital library. 10 and 10 seed documents. Times1000S and Times500S.000. In this way. therefore a random selection of uncorrelated articles was clearly not the right choice. Details about the different collections are depicted in Tables 3. 100 (long) and 100 (short) documents. For the CiteR8 collection we started from the article “A Faster Algorithm for Approximate String Matching” [9] and then recursively followed the “similar on the sentence level” links to select 7 correlated documents. 500 (short).2. extracted from articles of the Times newspaper digital library. we experimented a slightly more “aggressive” setting (sentF req as one each 4 sentences and wordF req as one each 5 words). the obtained collections of scientific articles are significant real data sets since there is a certain level of repeatedness within each group while the inter-group cross correlation is quite low. For the Times100S. we simulate a sufficiently varied but still realistic scenario in document changes by setting the following document generator parameters: sentF req as one each 6 sentences and wordF req as one each 8 words.000 different pairs. The contents of such documents should guide their grouping into distinct sets of (near) duplicates. with a number of chunks comparisons near 330. Times100L and Cite100M collections.e. containing 10 variations (including the seed documents themselves) for each of the 10 seed documents extracted from various scientific articles coming from the Citeseer [80] digital library.1 and 3. we decided to search for some real documents’ sets. Notice that they involve a cross product of up to 1. we collected groups of documents starting from selected articles and then exploiting the documents suggested as similar by the Citeseer digital library search engine. containing 10 variations (including the seed documents themselves) for each of the 100. The task of finding sets that could be significant to evaluate the effectiveness of the measures was not a trivial task. . to test the system behavior also with less inner-correlated documents. In CiteR25 we selected 5 initial documents and.000 (1000*1000) documents comparisons. ˆ Cite100M: collection of 100 medium sized documents.84 Approximate matching for duplicate document detection 1000 (short). respectively. In order to fulfill such requirements. we formed 5 groups of 5 correlated documents each.000.

t. the different data reduction techniques described in Section 3. we present an experimental analysis of the level of security offered by our scheme.2 to 0. further.Test results Times1000S 1000 17231 392478 Times500S 500 8606 219438 Times100L 100 10833 223171 Times100S 100 2078 44192 85 Cite100M 100 4045 91273 # docs # sents # words Table 3. and their robustness w. Unless otherwise specified. length threshold 3) and an edit distance threshold t between chunks ranging from 0. a minimum chunk length of 3 words (i.1: Specifications of the synthetic collections used in the tests # docs # sents # words CiteR25 25 8223 164935 CiteR8 8 4086 81339 Table 3. we inserted the documents of each collection in the system and we queried the collection against itself in order to obtain a duplication matrix containing a duplication score between each pair of documents.4.r.4.4 Test results The system performance was tested both with respect to the effectiveness of the resemblance and the other measures and the efficiency of the proposed data reduction techniques. For our experiments. All experiments were run on a Pentium 4 2.2: Specifications of the real collections used in the tests 3.2. tests were performed with chunk size of one sentence.5Ghz Windows XP Pro workstation. Effectiveness tests In this section. we present the experiments we performed to evaluate the effectiveness of the similarity measures. . Then. we analyzed the duplication matrix in different ways: ˆ direct graphic visualization of the duplication matrix values.e. the effectiveness of the measures was also tested at increasing levels of data reduction with the aim of evaluating the robustness and of finding a good tradeoff between the accuracy and the efficiency of the system. equipped with 512MB RAM and a RAID0 cluster of 2 80GB EIDE disks. In particular.

containing 10 groups of 10 (near) duplicates.86 Approximate matching for duplicate document detection 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Figure 3. This technique allows us to show the quality of the computation at a glance. all the documents generated from the same seed document are inserted with subsequent indexes and.4: Effectiveness of the resemblance measure for Times100L ˆ numerical computation of average and standard deviation of the duplication scores. In Figure 3.r. by associating a particular shade of color to each duplication value. Let us start from the Times collections. .r.t. in particular from the Times100L. with a dark color associated to nonduplicate pairs and a brighter one for more duplicate ones. The obtained results show the robustness of the similarity measures w. each pixel represents the level of duplication of one pair of documents. We use a direct visualization of the duplication matrix. meaning that the system is able to distinguish the 10 clusters we generated. Most of the tests we present involve the synthetic collections. As testbeds for duplicate detection are lacking. we evaluated the decay of the effectiveness by comparing the “original” duplication matrix of a reference collection with those obtained by applying different data reduction settings. To this end. Figure 3. while preserving most of the detail needed for an in-depth analysis. The other aspect we considered is the robustness of the resemblance measure w.4 shows the image generated with this technique for the Times100L. visualized in a specific zone of the plot. Obviously. in order to make the image instantly readable. The level of security of our duplicate detection technique is thus good: The similarity scores it computes are proportional to the number of changes applied to the seed documents and many changes need to be made so that the resulting documents are identified as unrelated to the original documents. text modifications. thus.4 the 10 groups of 10 (near) duplicates are clearly visible. the use of synthetic collections makes the effectiveness evaluation more objective since the generation process we adopted together with its settings provide a clear direction of the awaited results. Notice that.t data reduction. In this way.

Here.e.5 10 10 (b) sampling 0. i.5 (e) length-rank sel. the reference collection is again Times100L and thus the decay of the effectiveness can be evaluated by comparing the image matrixes of Figure 3.3 (f) length-rank sel. As to the intra-document reduction.1 the quality of the results is quite poor. Since the two implementations produced approximately the same quality level. Figure 3.3 the 10 groups are still recognizable but the intra-group duplication scores are much reduced (pixels are darker than the corresponding ones in Figure 3. the third column represents the image matrixes obtained by applying a reduction ratio of 0. With sampling 0.1.5 compares the effect of the different ratios of sampling (first line) and length-rank chunk selection (second line).1) and thus we do not affect the accuracy of the system. in both cases we kept one every 10 chunks per document.5 with the image matrix of Figure 3. With sampling 0. For instance.1 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 (d) length-rank sel.4.3 10 (c) sampling 0. we first compare the results of the length-rank chunk selection with the sampling approach used in [22]. we only present random sampling results.4). The quality of the length-rank chunk .1 Figure 3.Test results 87 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 (a) sampling 0. 0. 0.2.5: Sampling and length-rank selection experiments for Times100L we did not perform tests with filtering techniques which have been proved to be correct (see Subsection 3. 0.5 and 0. Sampling was implemented both through a fixed-step sampling algorithm keeping every i-th chunk and with the sequential random sampling algorithm by Vitter [135].

very good and even at the highest selection level (0. respec- . instead.6: Chunk clustering experiments for Times100L selection is. Notice that the A-L C-B resemblance measure (see Eq.1 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 (d) A-L C-B 0. that is 20 and 100 document bubbles were generated for Times100S and Times500S.1) the similarity scores are much better than any sampling 0.9) produces “smooth” results with respect to the different reduction settings: Results have an acceptable quality up to 0.3 10 (c) C-B 0.3 (f) A-L C-B 0.8) behaves almost in a crisp way: the resemblance results near to the extremes.1. 3.2. Figure 3.2.6 shows the results of the experiments we conducted with the chunk clustering technique at three different levels of reduction ratio. The reduction ratio was set to 0. degrade very slowly at the different reduction ratios and several document pairs are recognized with a full score even at 0.88 Approximate matching for duplicate document detection 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 (a) C-B 0. where duplication scores were computed using the clustered-based (C-B) and the average-length clustered-based (A-L C-B) functions (see Section 3.5 (e) A-L C-B 0. On the other hand. the C-B resemblance measure (see Eq.1 reduction ratio. 3.1 Figure 3.7. the results are not as good as for length-rank chunk selection.5. The collections tested are the Times100S and Times500S.2). The results of the effectiveness experiments we conducted for inter document reduction are depicted in Figure 3. 0 and 1.20.5 10 10 (b) C-B 0. Since the inner repeatedness of the collection is quite low.

0.7. succeeds in correlating the information conveyed .r. Such results are due to the fact that our comparison scheme. bub. As shown in Figure 3. The results concern Cite100M. in comparison with the original duplication matrix.t. 0.8: Chunk size tests results for Cite100M We also performed tests on the impact of the chunking unit on the quality of the obtained results (Figure 3.7: Document bubble experiments for Times100S and Times500s tively.Test results 89 2 10 4 20 6 30 8 40 10 50 12 60 14 70 16 80 18 90 20 2 4 6 8 10 12 14 16 18 20 100 10 20 30 40 50 60 70 80 90 100 (a) Times100S. In this way. 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 (a) sentences as chunks (b) paragraphs as chunks Figure 3. Figure 3.2 Figure 3. We kept the document bubbles ordered as much as possible as the original documents. Clearly. By choosing a bigger chunk size (i. a good effectiveness with document bubbles would be represented by a lower resolution duplication matrix maintaining the original pattern of squares. through the chunk similarity measure. a pixel in the matrix represents the resemblance score between a pair of document bubbles.8). a paragraph. other syntethic collections behaved similarly. the document bubble technique. the resemblance measure is robust also w.2 (b) Times500S.8a).8b) the similarity scores were quite lowered but the similar document groups were still perfectly identifiable. bub. The tests shown so far have been performed with the chunk size of one sentence (Figure 3.e.

In order to test the effectiveness of the resemblance measure in determining duplicate documents also in real heterogeneous document sets.00009 ± 0. we performed “ad-hoc” tests on the CiteR25 and CiteR8 collections.00035 ± 0. Noise represents undesired matches between documents belonging to different groups and the value reported in each line of Table 3.r.32580 ± 0.31576 ± 0.40008 0.50332 ± 0.00074 0.00011 ± 0.28984 ± 0. Affinity represents how close documents in the same group are in terms of duplication.9a. no red.32384 ± 0.32719 0.3 is the average a and the standard deviation d over the intragroup document comparisons. (b) paragraphs as chunks (c) length-rank sel.40992 0.00067 0. 0.41103 0. we inserted into the system the 5 groups of 5 documents of CiteR25 and then computed the affinity and noise values for the symmetric and asymmetric similarity measures.00109 0.00011 ± 0. even for real and therefore not extremely correlated document sets.39617 0. the other groups.00037 0.3: Affinity and noise values for the CiteR25 collection 5 5 5 10 10 10 15 15 15 20 20 20 25 5 10 15 20 25 25 5 10 15 20 25 25 5 10 15 20 25 (a) chunk=sent.00066 0.00024 ± 0.00022 ± 0.39189 0. Notice the extremely low noise and the high affinity values.40308 0.35909 ± 0.29572 ± 0.37176 ± 0.00053 Asim Affinity Noise 0.9: Chunk size and length rank tests results in CiteR25 by the chunks.3 is the average and the standard deviation over the comparisons between the documents of the group w.90 Approximate matching for duplicate document detection Sim Affinity Noise 0.36496 ± 0.00012 ± 0.32709 0.38697 0.00030 ± 0. which are three orders of magnitude higher than the noise ones.40268 0.3 Figure 3.9b shows the effect of the variation of the chunking .00035 0.00038 0.50104 ± 0.00019 ± 0.00040 0.38557 0.32352 ± 0.. First. Table 3. The image matrix of CiteR25 is depicted in Figure 3.39356 0. whereas Figure 3.00015 ± 0.t.00013 ± 0.00041 0.00053 0.36402 ± 0.00045 ± 0. The affinity value a ± d reported in each line of Table 3. This once again confirms the goodness of our resemblance measures.00059 Group 1 2 3 4 5 Total Avg.

Test results
Citeseer Doc. 4 5.2% 8.8% DANCER Doc. 4 3.4% 3.2% 8.3% 100.0% 2.8% 18.5% 14.0%(E) 14.5%(F )

91

Doc. Doc. Doc. Doc. Doc. Doc. Doc. Doc.

1 2 3 4 5 6 7 8

Doc. 1 41.1% Doc. 1 100.0% 57.3% 2.0% 1.8% 0.9% 1.8% 2.3% 1.8%

Doc. 2 31.5% Doc. 2 44.3% 100.0% 2.9% 1.3% 0.7% 1.2% 1.8% 1.7%

Doc. 3 18.0% 47.6% 41.6% 19.7% 20.7% 14.3% 12.0% Doc. 3 25.3% 46.2% 100.0% 54.6% 22.7% 36.2% 32.0% 30.6%

Doc. 5 5.2% Doc. 5 3.7% 3.7% 7.1% 5.9%(B) 100.0% 4.5% 4.1% 3.9%

Doc. 6 9.2% 35.4% 27.4% Doc. 6 3.4% 3.1% 5.5%(A) 18.2% 2.2% 100.0% 48.6% 47.8%

Doc. 7 20.5% 59.4% Doc. 7 2.6% 2.6% 3.0% 8.8%(C) 4.1% 31.0% 100.0% 83.8%

Doc. 8 13.8% 68.1% Doc. 8 1.6% 1.9% 2.3% 7.2%(D) 0.9% 24.1% 66.7% 100.0%

Doc. Doc. Doc. Doc. Doc. Doc. Doc. Doc.

1 2 3 4 5 6 7 8

Table 3.4: Correlation discovery: a comparison between Citeseer search engine and Dancer unit and Figure 3.9c that of the length-rank chunk selection. As to the first aspect (Figure 3.9b), the obtained results are comparable to the ones obtained in the synthetic tests. As to the second aspect, notice that the quality of the computation on the reduced data is even better than the original one, as noise decreases and affinity increases. This is due to the fact that length-rank chunk selection stresses the similarities between large portions of the involved documents while small chunks are excluded from the computation. In the Citeseer real data sets, small chunks often correspond to the few small sentences, such as “The figure shows the results ”, “Technical report”, “Experimental evaluation”, or even the authors’ affiliations, that are quite common in scientific articles, so that even unrelated documents might both contain them. They are thus source of noise that are left out by the length-rank chunk selection. The second test on real collections involved CiteR8 and was devised to stress the DANCER capabilities in discovering inter-correlations between sets of scientific articles coming from a digital library. Such correlations are quite important in order to classify the documents and to show to the user a web of links involving articles similar to the one being accessed. In fact, the 8 documents were selected by starting from the article [9] and navigating the

92

Approximate matching for duplicate document detection

Doc. 3 14% 5,5% (A) Doc. 6 8,8%
(C) (E)

Doc. 4

5,9%

(B)

Doc. 5

14,5%

(F)

7,2% Doc. 8

(D)

Doc. 7

Figure 3.10: Web of duplicate documents

“similar on the sentence level” links proposed by the Citeseer digital library search engine. They correspond to variations of the original article and to articles on the same subject from the same authors, and include extended versions, extended abstracts and surveys. In order to provide the links, the Citeseer engine adopts a sentence based approach to detect the similarities between the documents: It maintains the collection of all sentences of all the stored documents and then computes the percentage of identical sentences between all documents. The percentages computed by Citeseer between the CiteR8 documents are shown in the upper part of Table 3.4 (Doc. 1 is the starting document). Notice that the matrix is asymmetric as each value represents the percentage of the sentences that the document in the row shares with the document in the column. For instance, Doc. 1 is linked to Doc.2 (with which it shares 31.5% of its sentences) and Doc.3 (18.0%). Further, notice that the resulting correlation matrix is quite sparse and unpopulated. Now, consider the corresponding matrix produced by DANCER (lower part of Table 3.4). To make the resulting values directly comparable with the ones from Citeseer, we used the containment measure (see Subsection 3.1.3). Notice that the output given by our approach is much more detailed, providing more precise similarities between all the available document pairs. For instance, Citeseer does not provide a link from Doc.1 to Doc.4, whereas our approach reveals that the first shares 3.4% of its contents with the latter. Even by considering a 5.0% of minimum similarity threshold, that could be employed by Citeseer not to flood the user with low similarity links, there still remains a good number of links that have to be discovered in order to consider the similarity computation a complete and correct one. Such significant links, which are not provided by Citeseer, are marked in Table 3.4 with capital letters A to F and are also graphically represented in Figure 3.10. Their significance can not be ignored, since they involve similarities even up

Test results
Type of test Exact copy Partial copy Amount 50% subset 25% subset 10% subset 5% subset 2 paragraphs 1 paragraph Low(S) Med(S) High(S) Low(S) Low(W) Low(S) Med(W) Low(S) High(W) Med(S) Low(W) Med(S) Med(W) Med(S) High(W) Similarity range (DANCER) Resemblance Containment Overlap 100%(A) 69%-78%(B) 35%-51% 14%-27% 6%-13%(C) 6%-11% 3%-5%(D) 88%-97%(E) 65%-73% 26%-38% 84%-92%(F ) 41%-48% 12%-18%(G) 61%-69% 33%-39%(H) 8%-13% 1%-12%(I) 100% 100%(B) 100% 100% 100%(C) 5%-12% 2%-7% 86%-98% 61%-77% 23%-39% 81%-94% 39%-49% 11%-20% 60%-71% 31%-40% 7%-15% 1%-14% 1924-2705 1090-1254 460-595 211-267 102-132 202-285 110-134(D) 422-793 88-179 50-99 380-764 64-148 35-47 290-383 42-125 18-29 5-25

93
Similarity range (others) Crisp, stem Crisp 100%(A) 69%-78%(B) 35%-51% 14%-27% 6%-13%(C) 6%-11% 3%-5%(D) 88%-97%(E) 65%-73% 26%-38% 19%-30%(F ) 2%-6% 1%-2%(G) 12%-27% 2%-3%(H) 1%-2% 1%-3%(I) 100% 69%-78% 35%-51% 14%-27% 6%-13% 6%-11% 3%-5% 88%-97%(E) 65%-73% 26%-38% 5%-13%(F ) 1% 0%(G) 3%-8% 1%(H) 0% 1%(I)

Plagiarism

Genuine (related)

Table 3.5: Results of violation detection tests to 14.5%. Finally, in order to specifically test the level of security provided by our approach, we performed an in-depth violation detection analysis, whose results are shown in Table 3.5. The tests consisted in verifying the output (i.e. the similarity scores) of our approach obtained by simulating well-defined expected-case violation scenarios: Exact copy, several levels of partial copy, several levels and types of plagiarism. Each row of data in the table summarizes the results of 20 similarity tests performed on a specific scenario; for this reason each cell of the results is represented by a similarity range, showing the lowest and the highest computed similarity score. The table shows all the three DANCER similarity measures (resemblance, containment and maximum contiguous overlap) defined in Section 3.1. For the two asymmetric measures, the presented values refer to the violating document w.r.t. the original one. The overlap measure is expressed in number of (stemmed and stopword-filtered) words. Further, we computed and compared the resemblance similarities that would be delivered by other approaches based on crisp chunk similarities, such as [22, 33], with and without syntactic filtrations (such as stemming): Such values are shown on the two last columns of the table. For each test a document of the Times collection was chosen as the “original” document and its similarity was measured w.r.t. a “violating” document, derived from the original by applying various modifications. By choosing specific similarity thresholds, such as 15%-20% for resemblance and containment, and 100 words for overlap, such values can be used to help the human judgement in the various violations, even well substituting and ap-

94

Approximate matching for duplicate document detection

proximating it where highly effective measures such as the ones we described are employed. For the exact copy test the violating document is exactly the same as the original one. The computed resemblance is 100% clearly reporting the violation, as expected, both for DANCER and for standard crisp approaches (values A in Table 3.5). As to partial copy, we simulated the “subset” scenario, in which only a portion of the original document is kept in order to obtain a smaller document, and the even more typical scenario in which a portion (for instance, a whole paragraph or two) of the original document is copied and inserted in the context of a new document. When the copied document is a big subset of the original one (50%, 25%) the resemblance measure can still be a good indicator, even in the crisp approach (values B). However, when the copied document represents a small portion of the original one (10%, 5%), or only one or two small paragraphs are copied, such violations are difficultly detected by standard symmetric measure, while the DANCER containment and overlap measures are clearly able to report them: See, for instance, values (C), where containment is maximum, and (D), where the detected overlap is more than 100 words and should not pass unnoticed. As to the plagiarism test we performed the analysis by synthetically generating violating documents at different plagiarism levels, even very subtle ones: In fact, it often happens that only very small parts (i.e., sparse sentences) are copied from an original document and then slightly modified, for instance by changing a word or two. In order to test the DANCER effectiveness also from such subtle but very frequent behavior, we exploited our document generator and, by varying the value of the sentF req and wordF req parameters, produced several “gradual” variations of the original documents. The variations include three levels of modifications on the sentence level (denoted on the table with Low(S), Med(S), and High(S)), involving deletions, insertions, swaps or substitutions of whole sentences, and three levels of modifications on the words of a sentence (Low(W), Med(W), High(W)). When the modifications only affect sentences as a whole, the resemblance measures detected by DANCER, as well as the crisp ones, clearly identify the violations (see, for instance, values E), even for high amounts of modifications. However, when the modifications also affect some of the words on the sentence, the DANCER results are very different from the crisp ones. Even for very low amounts of modifications on the words (for instance, one each ten) the crisp measures significantly drop and no longer truthfully represent the gravity of the violation (see values F). This is even more true for the crisp measure applied with no stemming. In the case of medium and high amounts of modifications (for instance values G and H) the plagiarism is less accentuated but still present: Here the crisp approaches clearly show an un-

Test results

95

Tim

Chunk Matching
es 10 0S 10 0M

Doc Sim Computation

Ci te

Ci teR Ci teR Tim

8

25

es 50 0S es 10 0L

Tim Tim

es 10 00 S

0

50

100 Seconds

150

200

250

Figure 3.11: Results of the runtime tests with no data reduction acceptable security level, showing similarities well below threshold and thus producing false negatives. The last row of the table shows the resulting similarities in a “genuine” document scenario: Here the analyzed document pairs are extracted from related documents, i.e. documents involving the same topics but with no common or plagiarized parts. As expected, being the DANCER approach much more sensitive, the obtained similarities are higher than the crisp approach, however they are still low and below typical violation thresholds. Then, the percentage of false positives delivered by our approach is still kept very low, similarly to crisp techniques and differently from other approaches which are able to identify subtle violations but at the cost of very high percentages of false positives, such as [121]. Efficiency tests As far as the efficiency of the proposed techniques is concerned, we measured the system runtime performance in performing the resemblance computations by analyzing the impact of the data reduction techniques. The computing time required to query each collection against itself when only chunk filters are applied is shown in Figure 3.11. The graph makes a distinction between the time required for each of the two phases, that is chunk matching and similarity scores’ computation. The similarity score computation time represents the time required to compute the similarity scores for each pair of documents. The overall time can be significantly decreased by applying the data reduction techniques we devised. As to the document filter, setting the similarity threshold s to a value

96
Sampling

Approximate matching for duplicate document detection
Clustering LenRank 160 140

Sampling

Clustering

LenRank 60

50
120

40
100 Seconds 80 60

30

20
40

10
20 0 1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1

0 1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 Reduction Ratio

Reduction Ratio

(a) Times100L

(b) Cite100M

Figure 3.12: Efficiency tests results for intra-document data reduction

greater than 0, even at a minimal setting as 0.1, halves the total required time. Such an improvement is due to the ability of the filter to quickly discard pairs of documents which cannot match and to ensure few false positives thus limiting the number of useless comparisons. For instance, for Times100L the cross product computation requires 100*100/2 comparisons (resemblance measure is symmetric). By setting the similarity threshold at 0.1, the document filter leaves out the 91% of the document pairs while the worth surviving document pairs on which we compute the similarity scores are 450. From such a computation, we found out that all the pairs contained in the candidate set are actually similar enough, that is the similarity score is at least 0.1. In this case we have no false positives and, thus, the best filter performance. The same applies to threshold values up to 0.6. As the threshold grows over 0.6, the number of positive document pairs obviously decrease (up to 0 when the threshold is 0.9) and the filter leaves out more document pairs. The worst case occurs when the threshold is 0.7 where we have a candidate set of 378 document pairs containing 5 false positives. Such a behavior is mainly due to the fact that documents having a certain number of similar chunks are often similar documents. In most of these cases, the mapping between chunks can be straightforwardly computed without requiring the intervention of the ILP package (see Subsection 3.4.1). Moreover, the intra-document reduction techniques provide us with further ways to enhance the system performance, as shown in Figure 3.12, where we compare the time required to compute the resemblance matrix employing the three intra-document reduction techniques on Times100L and Cite100M at different values of reduction ratios, starting from 1.0 which means no reduction. Using length-rank chunk selection at 0.1, for instance, allows us to

Seconds

Moreover. For example. Notice that we showed the runtime impact of the various data reduction techniques separately but it is possible to combine them to achieve even faster computations.13a). but at the price of a reduced quality of the similarity scores. comparable to the sampling approach. clustering and document bubbles construction) is not shown because it is not part of the computation itself.e. while the most complex ones such as chunk clustering and document bubbles require a bigger amount of time. our intra-document reduction techniques offer a good trade-off between effectiveness and efficiency. for example a setting of 0. As a final remark. but with much higher quality results.2 is more than 20 times faster than the standard computation.3. In general. document bubbles construction required approximately the same time as a complete resemblance computation on the same data set.Test results 97 No clustering 60 Clustering 0. Document bubbles bring an extremely consistent time reduction (Figure 3. it could be reduced by adopting the approach proposed in [142]. Further efficiency test results are on inter-document data reduction.5 60 No clustering Clustering 0. As we have already stated in Section 3. notice that such construction techniques could be modified in order to make . Figure 3. allowing remarkable time reductions.13: Further efficiency tests results reduce the total computing time by a factor of up to 1:9. from our tests. without compromising the good quality of the results (see the previous section).13b shows the impact of chunk size: Changing the default size of a chunk from a sentence to an entire paragraph reduces the number of total chunks and therefore produces a significant speed up.5 50 50 40 Seconds 40 Seconds 30 30 20 20 10 10 0 Sentences as chunks Paragraphs as chunks 0 Sentences as chunks Paragraphs as chunks (a) Document bubble technique (b) Different chunk sizes on Cite100M Figure 3. but it is an initial preparation phase that has to be done just once. Further. the pre-processing time of the various data reduction approaches (i. notice that simpler techniques like length-rank chunk selection are almost immediate. Anyway.

.98 Approximate matching for duplicate document detection them dynamic to the addition of new documents in the database.

Part II Pattern Matching for XML Documents .

.

so the problem can be characterized as the ordered tree pattern matching. this could exactly be the person we are searching for. sometimes called the twig queries. This is called unordered tree pattern matching. The way to solve this problem is to consider the query twig as an unordered tree in which each node has a label and where only the ancestor-descendant relationships are important – the preceding-following relationships are unimportant. For example. . From the formal point of view. when searching for a twig of the element person with the subelements first name and last name (possibly with specific values). is to find all existing ways of embedding the pattern in the data. However. A naive approach to solve the problem is to first decompose complex relationships into binary structural relationships (parent-child or ancestor-descendant) between pairs of nodes. Though there are certainly situations where the ordered tree pattern matching perfectly reflects the information needs of users. efficient evaluation techniques for all types of tree pattern matching are needed. ordered matching would not consider the case where the order of the first name and the last name is reversed. In general.Chapter 4 Query processing for XML databases With the rapidly increasing popularity of XML for data representation. there is a lot of interest in query processing over data that conform to the labelledtree data model. XML data objects can be seen as ordered labelled trees. there are many other that would prefer to consider query trees as unordered. three main types of pattern matching exist: One involving paths and two involving trees. then match each of the binary relationships against the XML database. and finally complete together results from those basic matches. The idea behind evaluating tree pattern queries. of which the path pattern matching can be seen as a particular case. since XML data collections can be very large.

some data must be retained in special structures to make the reasoning possible. 72]. 64]. In order to compactly represent partial results of individual query root-to-leaf paths. In order to build correct and efficient twig-matching algorithms. stored possibly in main memory. skipping areas of obviously no query answers. we first characterize the pattern matching problems from a pre. the holistic twig join approach was proposed [25]. and in particular of the involved pre/post order coding scheme and of its sequential nature. and at the same time never return back in the processed sequence. in general. which retains structural relationships between tree elements. By taking advantage of the tree signature properties (Section 4. In order to avoid such problems.1). The generic trends aiming at efficiency of matching call for compact structures. intersection of paths. With local changes to the traditional (relational) database kernel. whenever possible. (ii) they process the supporting data structures in a sequential way. and at evaluation algorithms able to provide (partial) solutions as soon as possible. a chain of linked stacks is used as a structure. strong theoretical bases must be established so that the skipped area can be maximized with the minimum amount of retained data.and post-order point of view (Section 4. In this chapter we deal with the three problems of pattern matching (path. This is fundamental for the important segment of XML applications processing data streams [76]. even for quite small search results.1) and then we show: ˆ that the pre/post-order ranks are sufficient to define a complete set of conditions under which a data node accessed at a given step is no longer necessary for the subsequent generation of a matching solution (Section 4. Most of the advanced solutions share two characteristics: (i) they are based on a coding scheme. Another disadvantage is that users must wait long to get (even partial) results. Such approach was further improved by using additional index structures on element sets to quickly locate the first match for a sub-twig pattern. the system is easily able to identify a subtree size. Another interesting way to support tree data processing tries to make the relational system aware of tree-structured data [63.2). inclusion or disjointness of subtrees. Then the algorithm merges the sorted lists of participating element sets together and in this way avoids creating large intermediate results. and improves the performance of XPaths. see for example [32. which has originally been proposed for the ordered tree matching. . ordered and unordered twig matching) by exploiting the tree signature approach [137]. The obvious key to success is to skip as much of the underlying data as possible.102 Query processing for XML databases The main disadvantage of such decomposition based approach is that intermediate result sizes can become very large. However.

i. cannot be further improved (Section 4. ˆ how the discovered conditions and properties can be used to write pattern matching algorithms that are correct and which. left to right.4 presents a summary of their main features). which in certain querying scenarios can provide an equivalently high (or even better) efficiency than that of the “standard” algorithms (Section 4. 4. . we provide extensive experimental evaluation performed on real and synthetic data of all the proposed algorithms (Section 4.5). All the twig matching algorithms have been implemented in the XSiter (XML SIgnaTure Enhanced Retrieval) system.4.6 for an overview of the system architecture and features). If the path from the root to v has length n. ik . As a coding schema. An ordered tree T is a rooted tree in which the children of each node are ordered. A labelled tree T associates a label (name) tv ∈ Σ (the domain of tree node labels) with each node v ∈ T .1 Tree signatures 103 ˆ the properties of the twig matching solutions. If a node v ∈ T has k children then the children are uniquely identified. Finally. . ˆ how to take advantage from the indexes built on the content of the document nodes (Section 4. A detailed description of the complete pattern matching algorithms is available in Appendix B. we also consider an alternative approach [138] specific for unordered tree matching. and extended string processing algorithms are applied to identify the tree inclusion.1 Tree signatures The idea of tree signatures proposed in [137] is to maintain a small but sufficient representation of the tree structures able to decide the ordered tree inclusion problem for the XML data processing. . from a numbering scheme point of view. i2 . we consider ordered labelled trees. Further. . level(v) = n. tree structures are linearized.2). a native and extensible XML query processor providing very high querying performance in general XML querying settings (see Section 4. generated at each step of the scanning (Section 4. In this section. Such approach is based on decomposition and structurally consistent joins. size(v) denotes the number of nodes rooted at v – the size of any leaf node is zero. Finally. . the pre-order and post-order ranks [47] are used. In this way.7).e. as i1 .3). we say that the node v is on the level n.

the following properties are important towards our objectives: ˆ all nodes x with pre(x) < pre(v) are ancestors or preceding of v. before its children are recursively traversed from left to right. In a pre-order sequence. a tree node v is traversed and assigned its (increasing) post-order rank.2) O D G 4 E B 5 G O 6 F P 7 (9. post(v). Pre. ˆ for any v ∈ T . respectively.1: Pre-order and post-order sequences of a tree The pre-order and post-order sequences are ordered lists of all nodes of a given tree T . if post(vi+1 ) > post(vi ). For illustration.and post-order ranks are also indicated in brackets near the nodes. see the pre-order and post-order sequences of our sample tree in Figure 4. after its children are recursively traversed from left to right. a tree node v is traversed and assigned its (increasing) pre-order rank.8) D (4. . ˆ if pre(v) = 1.10) B (2. For all the other neighboring nodes vi and vi+1 in the pre-order sequence.6) P O F 9 P A 10 (10.7) pre: post: rank: C C 3 H H 8 Figure 4.4) H (8. In the post-order sequence.9) C (3. vi is a leaf. ˆ all nodes x with post(x) > post(v) are ancestors or following of v. pre(v).104 Query processing for XML databases Sample Data Tree A (1. v is the root. Given a node v ∈ T with pre(v) and post(v) ranks. if pre(v) = n.1 – the node’s position in the sequence is its pre-order/post-order rank.1) E A D 1 B E 2 (5. ˆ all nodes x with post(x) < post(v) are descendants or preceding of v.3) G (6. ˆ all nodes x with pre(x) > pre(v) are descendants or following of v. v is a leaf.5) F (7. we have pre(v) − post(v) + size(v) = level(v).

sig(T ) = t1 . g. 4. Except the node name. tn .2: Properties of the pre-order and post-order ranks. so the value serves actually two purposes. The first signature element a is the tree root. Definition 4. The signature of T is a sequence. t2 . d. f.The signature post n 105 A v F P D n pre Figure 4. o. Leaf nodes in signatures are all nodes with post-order smaller than the post-order of the following node in the signature sequence. post(t1 ).2 for illustration. and following (F) nodes of v are strictly located in the proper regions. 10.1 Let T be an ordered labelled tree. b. preceding (P). Observe that the index in the signature sequence is the node’s pre-order. As proposed in [62]. in our example it is node p. h. See Figure 4. e. of n = |T | entries. the closest ancestor of the reference. we use the term index. 4. i. 1. For example.1 The signature The tree signature is a list of entries for all nodes in acceding pre-order. c. descendant (D). . where the ancestor (A). post(tn ) . that is nodes d. . where ti is a name of the node with pre(ti ) = i. The post(ti ) value is the post-order value of the node named ti and the pre-order value i.1. o – the last node. 5. e. which is constrained on the left by the reference node and on the right by the first following node (or the end of the sequence). 8. . post(t2 ). Notice that in the pre-order sequence all descendant nodes (if they exist) form a continuous sequence. In the following. 9. p. a. 7 is the signature of the tree from Figure 4.e. The parent node of the reference is the ancestor with the highest pre-order rank. 2. We can also deter- . each entry also contains the node’s position in the post-order sequence. is always a leaf. 3. 6.1. such properties can be summarized in a two dimensional diagram. when we consider the position of the node’s entry in the signature sequence. we use the term pre-order if we mean the rank of the node. g.

1. 2. 5. f. Extended Signatures By extending entries of tree signatures with two pre-order numbers representing pointers to the first following. 5. which retains the original hierarchical relationships of elements in T . 8 . 3. that is the sub-signature representing the subtree rooted at the node b of our sample tree. Naturally. 6). h. such that 1 ≤ s1 < s2 < . 11. post(ts2 ). the set operations of the union and the intersection can be applied on sub-signatures provided the sub-signatures are derived from the same signatures and the results are kept sorted. The generic entry of the i-th extended signature entry is ti . f f i . p. 3. post(ti ). sk ) of indexes (pre-order values) in sig(T ). 6.106 Query processing for XML databases mine the level of leaf nodes. 4. 3. f f . because the size(i) = 0 for all leaves i. . 3. s2 . 7. c.1. e.1. Specifically. d. post(tsn ) is a sub-sequence of sig(T ). 5. 3. 7. 11. b. the extended signature is: sig(T ) = a. .2 Twig pattern inclusion evaluation The problem of twig pattern inclusion evaluation on tree signatures can be seen as a problem of finding all sub-signatures of a given data signature matching with the twig pattern at node name level and satisfying some of the relationships of parent-child (ancestor-descendant) and sibling between the nodes. 1. g. . 11. . 6. 10. 6. 10. < sk ≤ n. ts2 . f a. 4) and S2 = (2. . defined by the ordered set S = (s1 . 2. 4. and unordered tree inclusion. In the following we will consider three kinds of twig pattern inclusion evaluation: ordered tree inclusion. 4. 6). and the first ancestor. 1. . 0. 7. 7. For the tree from Figure 4. tsn . path inclusion. Such version of the tree signatures makes possible to compute levels for any node as: level(i) = f f i − post(i)−1. 8. thus level(i) = i − post(i). The union of S1 and S2 is the set (2. 8. o. The cardinality of the descendant node set can also be computed: size(i) = f f i − i − 1. . 5. 9. sub sigS (T ) = ts1 . 2. nodes. defined by ordered sets S1 = (2. 11. . 1. post(ts1 ). f ai . the extended signatures are defined in [137]. . For example consider two subsignatures of the signature representing the tree from Figure 4. 3. Sub-Signatures A sub-signature sub sigS (T ) is a specialized (restricted) view of T through signatures.

Let sub sigS (D) be the sub-signature (i. post(q1 ). Example 4. . qn . 3.1 The query tree Q is included in the data tree D in an ordered fashion if the following two conditions are satisfied: (1) on the level of node names. Such a modified query tree is . dm . and the post-order of node o is smaller than the post-order of node p (both in sig(Q) and sub sigS (T )). because qi = dsi for all i. q2 = d9 . . 9. sig(Q) is sequence-included in sig(D) determining sub sigS (D) through the ordered set of indexes S = (s1 . If we change in our query tree Q the label h for f . we get sig(Q) = f. 2 . and q3 = d10 . .1 defines a weak inclusion of the query tree in the data tree. Observe that Lemma 4. Such query qualifies in D. (2) for all pairs of entries i and j in sig(Q) and sub sigS (D). consider the data tree D in Figure 4. . i. post(qn ) .Twig pattern inclusion evaluation Ordered tree inclusion evaluation 107 Let D and Q be ordered labelled trees. s2 . if required. 6.3. j = 1. due to the properties of pre-order and post-order ranks. . d2 . |Q| − 1 and i + j ≤ |Q|. 8. . such constraints can easily be strengthened. sn ) where s1 < . . sig(Q) = q1 . . if D contains all elements (nodes) of Q and when the sibling and ancestor relationships of the nodes in D are the same as in Q. The tree Q is included in D. The following lemma defines the necessary constraints for qualification. 1. . 2. but the corresponding entries can have different post-order values. . Suppose the data tree D and the query tree Q specified by signatures sig(D) = d1 . we can formally define the ordered tree inclusion problem as follows.e. any sub sigS (D) ≡ sig(Q). post(d1 ). 10). p. . a subsequence) of sig(D) induced by a name sequence-inclusion of sig(Q) in sig(D) – a specific query signature can determine zero or more data sub-signatures. Lemma 4. Using the concept of signatures. Regarding the node names. post(dm ) . whenever post(qi+j ) > post(qi ) it is also true that post(dsi+j ) > post(dsi ) and whenever post(qi+j ) < post(qi ) it is also true that post(dsi+j ) < post(dsi ). q2 .1 and the query tree Q in Figure 4. . . < sn . p. (2) the post-order of node h is higher than the post-order of nodes o and p. 1. o. post(d2 ). p. post(q2 ). 7 through the ordered set S = (8.1 For example. . o. 3. However. because sig(Q) = h. 2 determines sub sigS (T ) = h. because (1) q1 = d8 . in the sense that the parent-child relationships of the query are implicitly reflected in the data tree as only the ancestor-descendant. o.

o. .1) P (3. 6. . and implicitly considers all such relationships as ancestordescendant. 3. p. Lemma 4. which can be verified in Figure 4. . p. 2 . . . . . tn . it means that the post-order values of subsequent entries i and j (i. the query tree with the root g. resulting in sig(Q) = g. the tree signatures can easily deal with such situations just by simply distinguishing between node names and their unique occurrences. 10).1.2) Figure 4. Following the numbering scheme of a path P signature sig(P ) = t1 . . even though it is also sequence-included (on the level of names) as the sub-signature sub sigS (D) = g. (2) for each i ∈ [1. < sn . . . 9. That means that o is not a descendant node of g. sub sigP (Q) is sequence-included in sig(D) determining sub sigS (D) through the ordered set of indexes S = (s1 . .1 does not insist on the strict parentchild relationships. n − 1 and i + j ≤ n) satisfy the inequality post(qpi ) < post(qpi+j ). However. . sn ) where s1 < . Multiple nodes with common names may result in multiple tree inclusions. because Lemma 4. does not qualify. Path inclusion evaluation The path inclusion evaluation is a special case of the ordered tree inclusion evaluation as all the relationships between the nodes in any path P are of parent-child (ancestor-descendant) type. j = 1. .3: Sample query tree Q also included in D. as required by the query. post(tn ) . As demonstrated in [137].3) O (2. 4. 7 |S = (6. The lemma below easily follows from the above observation and from the fact that inequalities are transitive. 1. The reason is that the query requires the post-order to go down from g to o (from 3 to 1) . 2. o. while in the sub-signature it actually goes up (from 4 to 6). |P | − 1]: post(dsi ) < post(dsi+1 ). . post(t1 ).108 Query processing for XML databases Sample Twig Query H (1.2 A path P is included in the data tree D if the following two conditions are satisfied: (1) on the level of node names.

. qn . . . n. indexes are not necessarily in an increasing order. post(dm ) . . represented by the signatures and sig(D) = d1 . dm . Should the signature sig(Q) of the query not be included on the level of node names in the signature sig(D) of the data.1. j = 1. post(d1 ). if post(qi+j ) < post(qi ) then post(dsi+j ) < post(dsi ) ∧ si+j > si . . any S satisfying the properties specified in Lemma 4. an unordered tree inclusion does not necessarily imply the node-name inclusion of the query signature in the data signature. . . d2 . . q2 . post(q1 ). s2 .2 A formal account of twig pattern matching Unordered tree inclusion evaluation 109 Let Q and D be ordered labelled trees. |Q| − 1 and i + j ≤ |Q|. 2. . Figure 4. Using the concept of signature.3 can always undergo a sorting process in order to determine the corresponding sub-signature of sig(D) qualifying the unordered tree inclusion of Q in D. n. such that dsi = qi . . Anyway. such that only the ancestor-descendant structural relationships between nodes in Q are satisfied by the corresponding nodes in D. . (2) for all pairs of entries i and j. S would not determine the qualifying sub-signature sub sigS (D). . 4. . In the shown trees. the nodes’ names are appended by their pre-order and post-order ranks in brackets. . . as shown in [138]. . . . i.2 A formal account of twig pattern matching sig(Q) = q1 . . An unordered tree inclusion of Q in D is identified by a total mapping from nodes in Q to some nodes in D. In other words. unlike the ordered inclusion of Lemma 4. post(d2 ). post(qn ) Given a twig pattern Q and a data tree D. the query tree Q is included in the data tree D in an unordered fashion if at least one qualifying index set exists. Notice that the index set S is ordered but. . Lemma 4.3 The query tree Q is included in the data tree D in an unordered fashion if the following two conditions are satisfied: (1) on the level of node names. The unordered tree inclusion evaluation essentially searches for a node mapping keeping the ancestor-descendant relationships of the query nodes in the target data nodes. sn ) exists. for i = 1. 1 ≤ si ≤ m for i = 1.4 shows further examples of the three types of pattern matching. . an ordered set of indexes S = (s1 . post(q2 ). .4.

Let Σi designate the set (domain) of all positions j in the data signature where the query node name qi occurs (i.5) firstName (8. .4 > < 6.7 > Figure 4.1) title (2. For the sake of brevity we will use the notation (U )ansQ (D) to designate situations which apply to both the cases.2) firstName (2. . we have shown the properties an index set must satisfy so that it is an answer to the inclusion of Q in D. In all the three cases.3) Ordered twig matching book (1. if Q is a path P then ansP (D) = U ansP (D). because the Cartesian product is empty. The set of answers (U )ansQ (D) is a subset of the Cartesian product of the domains Σ1 × Σ2 × .7) firstName (3.1) lastName (3. even more important.3) (2. The intrinsic limitation of this approach is twofold: it can produce very large intermediate results and. Obviously.2.e.6. n] and then discard from the Cartesian product Σ1 × Σ2 × . In the previous section. dj = qi ). such evaluation . if one of the Σi sets is empty.5.3.1) lastName (4. × Σn the tuples the corresponding sub-signatures of which do not satisfy the properties required by specific pattern matching constraints. . for i ∈ [1. × Σn determined by Lemma 4. A na¨ strategy to compute the desired Cartesian product subset is to ıve first compute the sets Σi . respectively.2) Matches: < 1.7 > Matches: < 2. Q is not included in D.1) lastName (3. .4 > <1.3) Unordered twig matching author (1. Obviously.4) author (6.8.2) (3.8) author (2.4: An example of a data tree and pattern matching results we denote with ansQ (D) the set of answers to the ordered inclusion of Q in D and with U ansQ (D) the set of answers to the unordered inclusion of Q in D.110 Query processing for XML databases Sample Data Tree book (1. the matching on the level of node names is required.1 or Lemma 4.6) Path matching book author lastName (1.3.7> Matches: < 1.3) title (5.2) lastName (7.

for j = 1. Σj . . . (U )ans1 (D) ⊆ (U )ans2 (D) ⊆ . . Σj . . .4.. Σj dei notes the set of all positions in the range from 1 up to j where qi occurs and (U )ansj (D) denotes the set of answers to the (un)ordered inclusion Q of the twig pattern Q in D computable from the domains Σj . . ⊆ (U )ansm (D). 2 1 n Notice that the sets ansj (D) and U ansj (D) are subsets of the Cartesian Q Q product Σj × Σj × . and 3 2 1 ans5 (D) = ∅.. In principle. Σ5 = {5}.. . Assuming that ans0 (D) = ∅. consider the ordered twig matching scenario in Figure 4.e. domains should be maintained in main memory . m. we denote with Q j−1 ∆(U )ansj (D) = (U )ansj (D) \ (U )ansQ (D) the j-th delta answer. . . . . . the set of answers can be incrementally Q constructed by sequentially scanning the data signature.. Thus. an inclusion relationship between the answer sets holds.. Our algorithms exploit properties of the pre-order/post-order numbering scheme adopted in the construction of tree signatures. .. m SEQUENTIAL SCAN Figure 4. and the last Q Q Q answer set is the set of answers to the (un)ordered inclusion of Q in D. At the step 5 of the sequential scanning.2 A formal account of twig pattern matching 111 Σh k Q 1 h n D 1 2 . For example. Notice that the complete set of matches is the union of the delta answers and that the matches which can be computed at step j result from the domains Σj . the Q Q set of matches which can be computed (decided) at step j and which have not been computed at an earlier step. At each step j of the sequential scan of D. the domains are Σ5 = {1}. Σj . k .. . i.4. . × Σj and that ansj (D) ⊆ U ansj (D) as any answer 2 1 n Q Q to the ordered inclusion of Q in D is an answer to the unordered inclusion of Q in D. (U )ansm (D) = (U )ansQ (D). In the following sections we present algorithms that perform the twig pattern matching by sequentially scanning the data signature. . Obviously. Σ5 = {4}. 1 n For efficiency reasons.5: Behavior of the domains during the scanning procedure does not exploit the sequential nature of tree signatures. .. . at each step new answers can be added to the Q answer set of the earlier one.

Q . for each j ≥ k.5). . at a given step j. Σj are necessary for the generation of the delta answers 1 n ∆(U )ansj (D). In this context. We assume that lk (D) matches lh (Q) and thus k should be added to ∆Σk . our ultimate goal is to regulate the growth of the domains so that the pattern matching problems are solved efficiently even for tricky queries and data. . Moreover. .and postorder conditions ensuring that a data node already accessed (or accessed at a given step) is not necessary for the generation of the solutions from that step up to the end of the scanning. domains grow in an uncontrolled way. i. ∆(U )ansm (D). XML documents) and the 80-20 law tells us that most of the queries involve a limited set of labels.5 showing the domains on a plane where the y-axis represents the query nodes and the x-axis the data nodes (domain space in the following). it can even make the management of the domains unfeasible due to the constraints on the main memory size. Thus. which are in fact the most frequent in the data tree. we characterize the delta answers that can be decided at each step. To this end.g. . we want to maintain the domains as compact as possible by putting nothing which is useless and by deleting elements that are no longer necessary for the generation of the subsequent answers.or post-order relationships required by Lemmas 4. . Necessary conditions are founded on the relative positions between such data node and the other data nodes accessed so far. k. Tree structured data often have a considerable number of nodes (see e. The combination of these two issues deteriorates the performance of the matching processes. . For illustration. . their growth poses some fundamental problems from the performance point of view. because at a given step of the sequential scanning we have no information about the properties of the data nodes that follow. we show the pre. those with preorders 1. unfortunately. consider Figure 4. During the sequential scan. Such a growth is continuous and is influenced by the peculiarities of the data and query. . either the data node has already been used in the generation of the previous delta answers or it will never be used and thus is unnecessary for the generation of ∆(U )ansj (D). will always violate one of the pre. .3 with the data nodes already accessed or those following the k-th. we will consider the snapshot of the sequential scanning process occurring at the k-th step (see Figure 4.e. By exploiting the pre/post ordering scheme. Our main aim is to determine the conditions h under which any of the already accessed data nodes. not all data nodes represented in the domains Σj . In other words. . we denote with ∆Σj the “reduced” Q i Q j versions of the original domains Σi which are needed to decide the delta answers from the j-th step.112 Query processing for XML databases where. In the following.1-4. . . In this case.

n ∩ Ø h=n 1 ..6-b. Moreover. m]: k ∈ S. .6: Representation of the pre-order conditions in the domain space 4. any set of indexes (pre-order values) S = (s1 . k’ = . It states that a data node matching the last query node qn then it will never belong to the solutions that can be computed in the following steps.2. n Besides the previous Lemma. . i. h From the previous Lemma. ∆Σk only contains k... ..PR stands for PRe-order. Thus k does not / Q belong to ∆Σj . . For illustration see Figure 4. sn ) qualifying the ordered inclusion is also ranked according to pre-order.. < sn ≤ m. Thus k does not belong to ∆Σj . j m D (a) Condition PRO1 (b) Condition PRO2 Figure 4.6-a depicting Condition PRO1 on the domain space. h The second Lemma extends the condition of the previous one to the subsequent steps. . in this case.. For illustration see Figure 4. 1 .e. for each j ∈ [k. . m]... Lemma 4. h − 1]. it follows that ∆Σj is always empty but when n dk = qn and. the following Lemmas states the conditions under which dk is no longer necessary for the generation of the delta answers. A direct consequence of this property is given by the following Lemma (Condition PRO1 . Lemma 4. k k+1 . i ∈ [1. . 1 ≤ s1 < .5 (Condition PRO2 applied to k) If ∆Σk = ∅.... j m D . O stands for Ordered). s2 . The first one states that if at the k-th step a domain ∆Σk preceding ∆Σk is i h empty then k will never belong to the solutions that can be computed in the k-th step and in the following ones.1 Conditions on pre-orders Let us first consider pre-order codes and the ordered case. Its proof is similar to the previous one. Recall that a sequential scan of a signature means that the data nodes are visited according to their increasing pre-order codes.. k .. i then for each S ∈ ∆ansj (D). h’-1 h’ .. a total order is required.Conditions on pre-orders Q 1 113 Q 1 i .4 (Condition PRO1) If h = n then k ∈ S for each S ∈ ∆ansj / Q (D) for each j ∈ [k + 1...

i ∈ [1. Thus k does not belong to ∆Σj . for each j ∈ [k. Thus k does not belong to ∆Σj . whenever post(qi+j ) < post(qi ) it is required that post(dsi+j ) < post(dsi ) and that si+j > si . is no longer necessary.4. / h Lemma 4. Thus k does not belong to ∆Σj .1 (Completeness) For the ordered case. Thus Lemmas 4. and 4.8 (Condition PRU applied to k < k) If k ∈ ∆Σk−1 . then for each S ∈ i ∆U ansj (D). . In particular. Condition PRO2 avoids the insertion of the data node 4 in the pertinence domain ∆Σ4 .6 rewritten in the following way are still sound. for each j ∈ [k. / h The following Theorem shows that the three previous conditions together constitute the sufficient conditions such that a data node. sn ). Notice that at the 4-th step. then for each S ∈ ∆ansj (D). 4.7 (Condition PRU applied to k) If ∆Σk = ∅.6. due to its pre-order value. the Lemmas above are no longer sound. . 3 2 Moreover at the 8-th step. notice that the pre-order values of any qualifying set of indexes are not required to be completely ordered as it is for the ordered evaluation. Theorem 4. / Q h . m]: k ∈ S. the Table above shows the impact of Conditions PRO1 and PRO2 on the composition of the delta domains in the domain space during the sequential scan for ordered twig matching. . any data node due to its pre-order value does not belong to the solutions which will be generated in the following steps. h − 1] i and post(qi ) > post(qh ) then for each S ∈ ∆U ansj (D) . h − 1]. For this reason. 1 2 3 {1} {} {} 1 {1} {} {} 2 {1} {} {} 3 {1} {} {} 4 {1} {5} {} 5 {1} {5} {} 6 {1} {5} {7} 7 {1} {5} {} 8 Example 4. for each i i Q j ∈ [k. beyond the conditions expressed in Lemmas 4. h − 1]. as ∆Σ4 is empty.5.2 With respect to the example of Figure 4. thanks to Condition PRO1 ∆Σ8 is empty.114 Query processing for XML databases Lemma 4. Lemma 4. However. there is no other condition ensuring that at each step k. m]: k ∈ S. ∆Σk ∩ i h ∆Σk = ∅. . 3 As to the unordered case. the unordered evaluation requires a partial order among the pre-order values of a qualifying set of indexes (s1 . i ∈ [1. i ∈ [1.4. m]: Q k ∈ S.5 and 4.6 (Condition PRO2 applied to k < k) If k ∈ ∆Σk−1 and h ∆Σk ∩ ∆Σk = ∅. and post(qi ) > post(qh ).

. First of all. here the distinction is between the path matching and the (un)ordered twig matching. m]. we first introduce some general rules which will then be used to study each of the matching approaches. the data node dk accessed at the k-th step can always be useful for the generation of the answers in the subsequent steps unless no data node in the delta sets exists for which it is required that its pre-order value is smaller than k. as no total order among the pre-order values is required.2 Conditions on post-orders As far as post-order requirements are involved. sn−1 . any data node due to its pre-order value does not belong to the solutions which will be generated in the following steps. the following Lemma easily follows from the property on postorder values a solution for twig pattern matching must satisfy.Conditions on post-orders 115 On the other hand. there is no other condition ensuring that at each step k. post(dsi ) < post(dk ) is required but post(dsi ) > post(dk ) or i post(dsi ) > post(dh ) is required but post(dsi ) < post(dk ) then k ∈ S. In this way.2. . Thus k can be deleted from ∆Σj . Thus. h − 1] exists such that for each si ∈ ∆Σk−1 . . in the unordered case. . we have shown the sufficient and necessary pre-order conditions for the exclusion of a data node in the generation of the delta solutions of the ordered and unordered inclusion of a query tree in a data tree. path matching requires a total order among the post-order values of the data nodes belonging to a match whereas in the twig matching only a partial order is sufficient. beyond the conditions expressed in Lemmas 4. .4. in the ordered case a concept of “last” query node from a pre-order point of view exists and thus whenever dk = qn any solution (s1 . . Instead. Indeed. . Lemma 4. there is no counterpart for Lemma 4.2 (Completeness) For the unordered case. the “position” of the query node matching the data node does not influence the use of such data node in the solutions which will be generated in the following steps.8.7 and 4. More precisely. Q h .9 (Condition POT1) If i ∈ [1. For this reason the following Theorem is sound and the proof is similar to that of Theorem 4. for each S ∈ ∆(U )ansj (D) for each j ∈ [k. Indeed. 4.1. This last aspect is considered in the two Lemmas above. sn−1 of all the other data nodes are required to be smaller than k (and thus accessed in the steps preceding the k-th). k) involving dk is generated at the k-th steps as the pre-order values s1 . Theorem 4. . It allows us to prevent the introduction of the current data node dk in the pertinence domain ∆Σj for the construction of the solutions in the steps from j = k to h j = m.

Lemma 4. and that of a preceding node post(dj ) (j < k).t. It is true when post(dj ) > post(dk ). is no longer necessary to generate the answers to the path inclusion evaluation. In the pattern matching definition. post(dk ). by considering the properties of the pre-order and post-order ranks given in Figure 4. Given the relationship between the post-order value of the k-th node. as in the pre-order case. we can state that dj is no longer necessary due to its post-order value. si ∈ ∆Σk where i i ∈ [1. at the k-th step of the sequential scanning.11 (Condition POP) Let Q be a path P . two kinds of relationships are taken into account between the post-order values of two data nodes di and dj : either it is required that post(di ) < post(dj ) or post(di ) > post(dj ). we want to predict what kind of inequality relationship will hold between the post-order value of dj and those of the nodes following dk in the sequential scanning. Only if we are able to do it. Notice that. Lemma 4. the other nodes will always be violate by the data nodes following the k-th in the sequential scanning. a node belonging to a delta domain will no longer be necessary for the generation of the delta answer sets due to its post-order value if one of the required post-order relationships w. Lemma 4. due to its post-order value. both in the cases of path and (un)ordered twig matching. We deeply analyze such situations in the following. a data node dj can be deleted from the pertinence delta domain if and only if it is required that its post-order value is greater than that of another data node but that condition shall never be verified from a particular step of the scanning process. n] and post(dsi ) < post(dk ).10 allows the introduction of the following Lemma showing a sufficient condition such that a data node. It follows that si ∈ S. i P . At first glance. Thus si can be deleted from ∆Σj . It means that either dj with j < k has already been used in the generation of the previous delta answer set or it will never be used.116 Query processing for XML databases Different is the case of the deletion of a data node preceding the k-th in the sequential scanning and thus already belonging to a delta domain. for each j ∈ [k.2. m]. Thus. for each S ∈ ∆ansj (D) for each j ∈ [k.r. Let us first consider the case of path matching. It follows that post(dj ) < post(dj ). it seems that the post-order relationships between post(dj ) and post(dk ) and post(dj ) and post(dj ) with j > k are completely independent.10 Let j < k and post(dj ) < post(dk ). Different is the case of the other post-order relatioship post(dj ) < post(dk ) which is taken into account in the following Lemma. For illustration see Figure 4.7-a depicting Condition POP on the domain space. m].

11.e.. Intuitively.11 also acts on the nodes preceding dk in i the sequential scanning. it states the only possible condition such that a data node due to its post-order value can be deleted. Notice that.. both the two kind of postorder relationship are in principle allowed between any two data nodes in . h − 1]. k ∈ S. 1 ∆Σ6 = {}.12 If for each i ∈ [1. j m D (a) Condition POP (b) Condition POT2 Figure 4. they have already been used in the 2 3 generation of the delta answer ans4 = { 1. On the other hand. j m D .. we only consider the case post(qi ) < post(qh ). < POST k-1 k s . 4 } at the 4-th step and they P will never be used again because node 6 belongs to another path. h It should be emphasized that Condition POP is a necessary and sufficient condition.11 can be used in place of Lemma 4. The proof is given in the following proposition. the post-order of the 3 current data node is post6 (D) = 7 which is greater than both post2 (D) = 3 and post4 (D) = 2. Given the situation depicted in Lemma 4. post(dsi ) < i post(dk ) then.4 where we show that in the computation of the delta answers there is no condition on post-order values to be checked.. Lemma 4..e. n 1 . and ∆Σ6 = {4} (lastName). when the composition of the delta domains is the following: ∆Σ6 = {1} (book).Conditions on post-orders P 1 117 Q 1 i i+1 si s Ø i si . for each si ∈ ∆Σk−1 .. being the query Q a path P .7: Representation of the post-order conditions in the domain space Example 4.3 Consider the example of Figure 4.9. the previous Lemma produces the same effect on the current data node dk as Lemma 4.. m]. si k-1 k < POST .9. For this reason..5 and 4. The proof is Theorem 4. The problem of the generic twig inclusion evaluation is more involved than the path matching problem. and ∆Σ6 = {}.. In these cases. ∆Σ6 = {2} 1 2 (author). i.. for each S ∈ ∆ansj (D) P for each j ∈ [k. Thus nodes 2 and 4 can be deleted from their pertinence domains and the composition of the delta domains becomes ∆Σ6 = {1}. Thus k can be deleted from ∆Σj . Lemma 4. due to Lemmas 4. n 1 si .. i.4 at step 6. In this case. Lemma 4. 2.. its deletion from ∆Σj .9.

and ∆Σ6 = {4} (lastName).13 and 4. A direct consequence of this property is given by the following fact. beyond the conditions expressed in Lemmas 4. which become empty.e. i k If ¯ ∈ [1.7-b. there is no other condition ensuring that at each step k. For this problem. 1 k Lemma 4. for each S ∈ ∆(U )ansQ (D) for each j ∈ [k.3 (Completeness) For the twig case. 4.118 Query processing for XML databases a set qualifying the pattern matching and only a partial order is required.15 If h = n then ∆ansk = ∅. For illustration see Figure 4.4 at step 6 when the delta domains are as follows: ∆Σ6 = {2} (author). It only considers the inequality condition on postorder values where it is required that dk is greater than any other data node. the node 6. stating that the set of matches which can be computed at step k is empty unless lk (D) matches with the “last” query node ln (Q).14. At this step. / Example 4. a new 2 3 root arrives. Lemma 4. Consequently.11 as it shows the post-order conditions under which the nodes preceding dk in the sequential scanning can be deleted. Lemma 4.e. no 1 check on the other delta domains is required and all the nodes s1 ∈ ∆Σk 1 such that post(ds1 ) < post(dk ) can be deleted from ∆Σk .13 (Condition POT2) Let si ∈ ∆Σk and post(dsi ) < post(dk ). we must again consider two cases: the ordered and the unordered. whenever Condition POT2 involves the root domain. post(ds ) > post(dsi ).4 Consider the unordered twig matching of the example of Figure 4. Theorem 4. 1 ∆Σ6 = {3} (firstName). The following Lemma is the counterpart of Lemma 4. Finally. ∆Σk .14 (Condition POT3) Let s1 ∈ ∆Σ1 and post(ds1 ) < post(dk ) j then s1 ∈ S. the 1 delta domains ∆Σ6 and ∆Σ6 can also be emptied thanks to Condition PRO2. i. i. m]. For the ordered case.3 On the computation of new answers In this subsection. Q . 2 3 Note that such situation is very frequent in data-centric XML scenarios. as post6 (D) > post2 (D). m]. we exploit the total order among the pre-order values of the data nodes in a match. Thus Condition POT3 allows us to delete node 2 from ∆Σ6 .2. for each / i j S ∈ ∆(U )ansQ (D) for each j ∈ [k. n] exists such that post(q¯) < post(qi ) and ∆Σ¯ = ∅ or for each ı ı ı k s ∈ ∆Σ¯ such that s > si . then si ∈ S. we want to detect at which step of the sequential scanning new matches can be decided. any data node due to its post-order value does not belong to the solutions which will be generated in the following steps.

. n − 1 and i + j ≤ n. 4.1 delete all the data nodes which are no longer necessary.5 If Lemmas 4. .Characterization of the delta answers 119 For the unordered case.5.6. . instead.16 If i > h exists such that post(qh ) > post(qi ) then ∆ansk = ∅.4 Characterization of the delta answers In this subsection. . . . the completeness of the conditions shown in the previous section ensures that we cover all possible node deletions from the pre.2.7. if posti+j (Q) < posti (Q) then postsi+j (D) < s postsi (D) and si ∈ ∆Σi i+j .14 have been applied at each step of the sequential scanning. 4. . n − 1]) 1 n For the ordered case. . ∆Σk only contains k. It shows how the set of delta answers can be computed at each step of the sequential scanning.and post-order points of view.8. 4. In particular. It also shows that Lemma 4. Lemma 4. . 2.14 have been applied at each step of the sequential scanning then the set of answers ∆ansk (D) Q which can be generated at step k for the twig Q is such subset of the Cartesian s product ∆Σk × . we characterize the delta answers generated at each sequential step. . respecting our three kinds of pattern matching strategies. × ∆Σk so that for each n 1 (s1 . . and 4. . Q Theorem 4.5. new matches can only be decided when the data node k matches with a query node which is a leaf. . is such subset of the Cartesian product ∆Σk × . sh .13. Thus.9. . where only a partial order is required. Q which can be generated for the twig Q at step k whenever h is a leaf. The following Theorem represents a considerable result for the path matching. On the other hand.5. it only contains new matches. .6. Theorem 4. . .1 are satisfied. sn ): (1) si ∈ ∆Σi i+1 for 1 n each i < n (2) condition on the post-order values expressed in Lemma 4. 4. and 4. 4.11 together with Theorem 4. sn ): (1) sh = k. due to Lemma 4. . as the delta domains of the query leaves are not always empty. 4.13.11 have been applied at each step of the sequential scanning then the set of answers ∆ansk (D) which can P be generated at step k for the path P is such subset of the cartesian product s ∆Σk × . and 4. we must avoid producing redundant results.9. the “last” delta domain ∆Σk is always empty unless the current data node k matches with n the n-th query node and. j = 1. sn ) | si ∈ ∆Σi i+1 for each i ∈ [1. × ∆Σk where for each (s1 .4 If Lemmas 4. i. then the set of answers ∆U ansk (D). In the unordered case. .6 If Lemmas 4. . 4. Theorem 4. (2) for all pairs of entries i and j. . . . 4.4. . 4. × ∆Σk defined as ((s1 . . in this case. post-order values must be checked. Q 4. .4. in this case. whenever n ∆ansk (D) is not empty.

if the query specifies a value condition on the leaf node ql and we have a content-based index built on ql we can improve the scanning process as follows.10 if post(du ) < post(t(ql )) and u < pre(t(ql )) then . each solution to the path matching will contain one of the occurrences conteined in T (ql ). therefore. Given a query leaf ql that specifies a value condition and associated to a content-based index we can can obtain from the index all the the occurrences of ql in the current document that satisfy the condition. we access a node du (with u < pre(t(ql ))) having post(du ) < post(t(ql )) we are guaranteed that node du does not belong to any answer ending with node t(ql ) as post-order values must be in descending order. t(ql ) is the next potential match for the query node ql . first we can avoid to insert useless nodes into delta domains and second we can avoid to scan some fragments of the data document since we are guaranteed that no useful nodes will be found in the skipped document part.e. in the following t(ql ) is also called current target for ql . But we are also guaranteed that node du does not belong to any answer ending with any node t (ql ) ∈ T (ql ) : pre(t (ql )) > pre(t(ql )). 4. for the path case t(ql ) is defined as the first node in T (ql ) that has a pre-order value greater then the current document pre-order value. since due to Lemma 4. From the index we can obtain the list T (ql ). not all occurrences will belong to a solution.3 Exploiting content-based indexes In the following sections we describe how to take advantage from available indexes built on the content of documents nodes. i. during the sequential scan knowing a priori the post-order value of the next node matching with the query leaf enables us to avoid the domain insertion of useless nodes and to reduce the fragment of the data tree to be scanned. Current Target Definition As we said. during the sequential scan.1 Path matching Consider the evaluation of a path. Let T (ql ) be such set of occurrences ordered by increasing value of pre-order and let t(ql ) ∈ T (ql ) be the next potential match for ql . obviously the opposite is not true.3. Generally two kinds of operations could be exploited by taking in consideration content-based index information. More precisely if.120 Query processing for XML databases 4. Insertion Avoidance and Skipping Policy A path answer requires that pre-order and post-order values of its element are totally ordered in increasing and decreasing order respectively.

Moreover since post(du ) < post(du ) for each du descendant of du it follows that conditions above hold also for these nodes so we can safely discard them. from those indexes we will obtain as many lists of data nodes that satisfy the conditions and we have to coherently manage them. we can directly continue the scan from the first following of du . i.. First we start analyzing the main differences between the path and the twig matching. The second difference is that in the path matching any query node is an ancestor of the leaf (unless it is the leaf itself) and the same relationship must be retained in each solution whereas in the twig matching the relationship between each query node and a query leaf can be either of ancestor or following-preceding type.e. If the signature does not contain the first following values f fi we can still safely skip a part of the document due to the following observation. From the above considerations we can conclude that if during the sequential scan we access a node du such that post(du ) < post(t(ql )).. Given a twig query we assume to have a list L = {ql1 . If we have at least two content-based indexes on the leaves subject to value conditions. We first discuss the definition of the current targets and the management of the lists T (qli ) and then we discuss the skip policy. In this section we extend those observations to support a similar improvement for the ordered twig pattern matching.Ordered twig matching 121 post(du ) < post(t (ql )) for each t (ql ) ∈ T (ql ) : pre(t (ql )) > pre(t(ql )). if Tl is scanned in sorted order by pre-order values then we can safely discard the node du since it will never belong to any answer. The first difference is that while in a path we can have at most one value condition. on the leaf of the path.2 Ordered twig matching In the previous section we described how to speed up the sequential scan in the path matching when an index on the value condition specified on the leaf of the path query is available.3. Under these conditions. ql2 . . the twig query can contain more than one condition. qlr } that contains all the query leaves subject to value condition and associated with a content index and that the list is ordered according to their pre-order values (li < li+1 ∀i ∈ [1. . We have f f du = u + size(du ) + 1 and size(du ) = post(du ) − u + level(du ) since 0 ≤ level(du ) ≤ h where h is the height of the data tree we can safely continue the scan from the node having pre-oreder euqals to post(du ) + 1. 4. For each leaf qli ∈ L through the associated index we can obtain a list T (qli ) of all the occurrences of qli in the current document that satisfy the specified condition ordered according to pre-order values. r − 1]).

5 Consider Figure 4. Current targets are related to each other and depend on the current document pre-order value and on the current state of the delta domains.8.2) B (4.1) C (3.C3 } in TC3 will never be candidate for domain insertion because no element TB2 has a smaller pre-order value.1). C3 . however an ordered twig answer requires that preorders of its elements are totally ordered in increasing order (see Lemma 4. Definition 4. suppose that each document leaf satisfies the correspondent condition.4) C (6.6) B (2.3) C (5. An answer to the query requires that any node matching “C” follows (i. The query in the example has two value constrained leaves.122 Query processing for XML databases Ordered Twig Query A (1.1) C (3. Example 4. we say that two list T (qli ) and T (qli+1 ) are aligned iff: . then the lists T (B2 ) and T (C3 ) obtained through the index are {B4 } and {C2 . let k be the current document pre-order. so not all nodes in each T (qli ) will be candidate for the domain insertion. where the nodes are represented as circles filled with different shades: ˆ a white circle is a generic node.5) Figure 4. While we perform a sequential scan over the input document we associate each list T (qli ) with the current target t(qli ) that represents the next potential match for the element qli . C6 } respectively.2) C (2. ˆ a dark-grey circle is a query node with a value condition (or a document node with a value).3) Data Tree A (1.2 During the sequential scan. ˆ a light-grey patterned circle is a query node ancestor of at least one value constrained leaf. has a greater pre-order value) any node matching “B” thus we know that element {C2 .8: Target list management example Current Target Definition and List Management Each element in T (qli ) matches with qli . C5 .e.

i.Ordered twig matching ˆ pre(t(qli )) > k and pre(t(qli+1 )) > k ˆ pre(t(qli+1 )) > pre(t(qli )) or pre(t(qli+1 )) > minP re(∆Σki ) l 123 Where minP re(∆Σki ) is the minimun pre-order value for nodes in the delta l domain ∆Σli at the k-th step of the algorithm. If the target lists are aligned we also say that current targets are aligned. More precisely given minP re = max{k. this could lead to the definition of new targets. The alignment property is transitive.e. min{pre(t(qli )). the updating of the targets should be performed during the sequential scan and whenever a deletion is performed on a ∆Σli delta domain. Then. In order to take advantage from the transitive property and in order to minimize the number of operations we perform the update starting from t(qli ) and then propagate that to the targets on its right (with increasing value of i). in this case the alignment depends only on the elements contained in the lists because each delta domain ∆Σli is empty. we can safely skip some part of the input document because we are sure that no useful elements for the current query will be found in the skipped document parts. minP re(∆Σki )}} l t(qli+1 ) is then the first element in T (qli+1 ) that has a pre-order value greater or equal to minP re. Sequential scan progressively increments the current document pre-order. l l With these three values we can define the minimum pre-order value that the target t(qli+1 ) must assume. Insertion Avoidance and Skipping Policy For the path case we have shown that during a sequential scan. From the definition of the alignment property we can derive how new targets should be defined. The first alignment is performed before starting the sequential scan. if T (qli ) is aligned to T (qli+1 ) and T (qli+1 ) is aligned to T (qli+2 ) then T (qli ) is aligned to T (qli+2 ). depending on the current target and on post-order value of the current document’s element. The target t(qli+1 ) for the element qli+1 depends on: ˆ the current document pre-order (k). ˆ the pre-order of the current target for the element qli (t(qli )). As for the path case the skipping policy can only be based on conditions on the post-order values between document nodes . For the ordered twig case we need to introduce more constraints that limit the applicability of the skipping strategy. ˆ the minimum pre-order of elements in ∆Σki (minP re(∆Σki )).

At this step.n+1) A (n+3. Again we access element “X” but this time even if the current post-order value is smaller than the current target one we cannot skip the subtree of the current element.1) C (3. in particular Lemma 4.9: Ordered Twig Examples and current targets. m]. In this case the first element matches with the root query element and being an ancestor of the current target it will be inserted into the correspondent delta domains. the delta domain associated to “A” is empty and the current document post-order is smaller than the current target one (i. the current target is outside the subtree rooted by “X”).e.n+2) C (n+5.n+3) B (3.n+1) D (n+3.n+2) C (n+5. for each j ∈ [k. Let us first analyze the first data tree.n+4) B (n+4.10 ensures that if j < k and post(dj ) < post(dk ) then post(dj ) < post(dj ). the second data tree case is different. Example 4. suppose that for both input documents “C” elements satisfy this condition. During the sequential scan we first access “D” element that does not match with any query element.n+5) Data Tree 2 A (1.1) B (n+4. For the skipping policy point of view this means that while we are looking for answers that include current targets.n+4) X (2. in fact the delta domain associated to “A” is not empty and possible matches for “B” (whose post-order is not required to be greater than the one of “C”) could be lost if we skip the current element subtree (as the example shows). Instead.124 Ordered Twig Query A (1. This simple example shows that. then we access its first child and independently from its label “X” we can entirely skip its subtree. The sample query specifies a value condition over its leaf “C”.9. differently from the path case.e.3) Query processing for XML databases Data Tree 1 D (1.6 Consider Figure 4.n+3) Figure 4.2) X (2. we are looking for an ancestor of the targets) but it is actually smaller then we can skip all descendants of the current document element because none of them could be useful. if we access a document node that should have a post-order value greater than the targets’s one (i.n+5) B (2. we cannot establish if a skip is safe or not by taking into consideration only post-order .

B6 . thus we associate each query node. say qx . Otherwise.3) A (5. then. Example 4.3) Data Tree D (1. As before. < lu and also that post(ql1 ) < x x x post(ql2 ) < .. Then we access A2 that matches with the root of the query. we need to make some considerations on the relationship between a generic query node and the query leaves subject to a value condition.7 Consider Figure 4. suppose that each document leaf satisfies the correspondent value condition. that contains all the query leaves in L that are x x x x descendants of qx . In particular let dk be the current data node matching with qx .12) B (2.5) C (9.. dk can potentially belong to an answer and thus it could be inserted in the correspondent domain. Before explaining under which conditions a skip is safe. For this reason we can avoid to insert dk in the domain associated to qx . By convention if qx ∈ L then L = {qx }. to a sublist x x x of L. Lx = {ql1 . current targets are t(B2 ) = B3 and t(C3 ) = C9 . x if post(dk ) >= post(t(qlu ))..2) B (6.9) C (12. in order to establish if a node matching qx with Lx = ∅ is useless we can simply consider the post-order of the current x target for qlu . dk does not belong to any subsequent answer x due to Lemma 4.Ordered twig matching 125 Ordered Twig Query A (1.4) A (7.1) E (4.7) B (11. < post(qlu ).8) A (10. now since post(t(C3 ))=6 and .10) E (8. C12 } and they are aligned.10: Ordered Twig Examples values. ql2 .11) B (3.1) C (3. The condition above is necessary but not sufficient in order to establish if a node is useful. C3 }. In particular if ∆Σkj = ∅ l x ∀qlj ∈ Lx we also know that dk is an ancestor of all t(qli ) since post(dk ) x > post(t(qli )) for each i < u. qlu }.6) Figure 4.10.10. Given a set of aligned targets.2) A (2. we have L1 = {B2 .. initial target lists are T (B2 ) = {B3 . B11 } and T (C3 ) = {C9 . The first accessed node is D1 that does not match with any query node. one or more query leaves subject to a value condition. x then if post(dk ) < post(t(qlu )). Notice that l1 < l2 < . Each query node can be ancestor of zero. . as T (qlu ) is ordered by the pre-order value.

instead of checking only the post-order of t(C3 ). in order to perform a skip. Next we find and insert B11 and C12 into the correspondent domain and we generate the last solution {10. . We arrive at A5 that matches with the query node and since post(t(C3 ))=6 and the current document post-order value is 8 the element could be useful and we insert it into the correspondent domain. The next node does not match with any query node so we continue the scan. Now we access B6 that is the current target for B2 . t(C3 ) becomes C12 . Now we access B3 that is the current target for B2 . The rough condition above could be refined as follows: ˆ none of the current partial solutions can be extended by nodes that are descendants of the current document node. The successive node does not match with any query node so we continue the scan. The observation above enables us to avoid the insertion of useless nodes but. For each twig node qx we have: ˆ no reference target if Lx = ∅. the one with the highest pre-order in Lx . Since ∆Σ5 = ∅ and ∆Σ5 = ∅ we know that A5 is an ancestor of the current 2 3 targets for B2 and C3 . we had also checked the post-order of t(B2 ).11. Node A10 matches with the query root and since post(t(C3 ))=10 and the current document post-order value is 11 the element could be useful and we insert it into the correspondent domain. Like before we also know that A10 is an ancestor of both t(B2 ) and t(C3 ). the node is useful and we could generate the first solution {5.9}. we now access C9 that is the current target for C3 . we must first define under which conditions a skip is safe. x ˆ one reference target qlu .6. Next node is A10 that is a following of A5 so all the domains are emptied. the target for B2 becomes B6 (that is still coherent with the current t(C3 )).12}. we would have concluded that A7 is an useless element. i. Basically a skip is safe if there is no useful element is in the document skipped part. Now we access A7 and for the same reasons explained before we need to insert it into the correspondent domain. since the parent domain is not empty we can insert it into the domain. in the implementation one could choose which condition has to be applied depending on the desired trade-off.e.126 Query processing for XML databases the current document post-order value is 3 A2 is an useless element and it will not be inserted into the correspondent domain. It has to be noted that if. the target for B2 becomes B11 that is beyond the t(C3 ) but since we have inserted the previous target into the domain t(C3 ) does not change (see the previous section). but since the domain of A1 is empty we cannot insert into the domain.

If the first empty domain has a reference target or belongs to a target itself (say qli ) then we know that we are still looking for a match for qli so if post(dk ) < post(t(qli )) we know that our target is not a descendant of dk and. the condition above will be: j ∃qi : ∆Σk = ∅∧ ∀j ∈ [1. i] is not descendant of dk that is necessary in order not to miss any useful match. the same happens when at least one non empty domain associated to a node qi that has Li = ∅ exists. any subsequent t(qli ) is not a descendant of dk . in other words if Di is empty then each Dj with j ∈ [i + 1. A matching node can be inserted into its domain if the delta domains of its preceding and ancestor elements are not empty. A skip is considered safe iff: j j ∃qi : ∆Σk = ∅∧ ∀j ∈ [1. It has to be noted that even in the previous situation we have no information about the next occurrences of nodes matching with qi with Li = ∅. if we cannot extend a partial solution. i] ∃qlv ∈ Lj : post(dk ) < post(t(qlv )) i j Instead of checking the existence of a qlv whose target is not in the subtree of the current document node we could simply check if the current reference j target t(qlu ) is descendant of the current document node. dk will not . the two conditions collapse. even if dk matches with a query node qj . in this condition we can not perform a skip. however this condition is subsumed by the previous one since. for such nodes we have no information about the next occurrence so we cannot be sure that no useful occurrence is a descendant of dk . Finally we highlight that if we can perform a skip at step k then we are guaranteed that. For the ordered twig matching we know that domains are filled “from left to right”. If the check succeeds we are guaranteed that no solution will be completely built with nodes that are descendant of dk . it is not possible to build a complete one and. if no partial solutions exist. If the first empty domain belongs to a node qi with Li = ∅ then it belongs to a node that is related to targets by a following-preceding relationship. Let us analyze the first condition. but we know that each occurrence of them will be useless because at least one preceding domain will be empty. The j condition ensures us that also each reference target t(qlu ) with j ∈ [1. as we have shown for the path case. n] is also empty.Ordered twig matching 127 ˆ it is not possible to build a complete solution with nodes that are descendants of the current document node. i] post(dk ) < post(t(qlu )) i In order to verify if a solution could be completely built with nodes that are descendants of the current data node it is sufficient to check if post(dk ) < 1 post(t(qlu )). Analyzing the delta domains we can establish if a skip is safe or not.

then at least one empty delta domain ∆Σk that prevents the insertion exists. we cannot assume that the one with the highest query pre-order value is also the one whose current target has the highest post-order value: Since matching nodes need to be ancestor of each associated target they need to have a post-order value greater than each associated target. Otherwise if j > i. . Since there is no order constrain. we associate each list T (qli ) with the current target t(qli ) that represents the next potential match for the element qli .e. then the skip condition ensure us j that t(qu ) is not a descendant of dk . ql2 . For the unordered matching.128 Query processing for XML databases be inserted into ∆Σj since if 1 ≤ j ≤ i. i.3 Unordered twig matching Similarly to the ordered case. Consider a node with more than one associated target. i 4. We can still define a single reference target (the one whose current target has the highest post-order value) for each query node but this reference target needs to be dynamically updated along with the . The current target for the leaf qli is the first element in T (qli ) that has a pre-order value greater or equal to the current document pre-order value. for the unordered twig matching we do not have any alignment property between the target lists. we can have more than one leaf with a specified value condition and associated to a content index and similar to the previous case we need to coherently manage the lists of occurrences obtained through the indexes. greater than the highest one..3.. For the ordered case we could statically define for each twig node its reference target between the associated targets. qlr } that contains all the query leaves subject to value conditions and associated with a content index. Current Target Definition and List Management Like the ordered case. given a twig query we can obtain a list L = {ql1 . for the unordered case this static definition is not possible. This means that current targets are not related to each other and do not depend on the current state of the delta domains. During the sequential scan. Insertion Avoidance and Skipping Policy Before starting to analyze the skipping policy we need to highlight another difference between the unordered and the ordered case induced by the absence of the order constraint. current targets depend only on the current document pre-order value. for each leaf qli ∈ L through the associated index we can obtain a list T (qli ) of all the occurrences of qli in the current document that satisfy the specified condition ordered according to pre-order values.

1) C (3.11: Unordered Twig Examples update of the related current targets. For the unordered case a matching node could be inserted in the correspondent domain if there is at least a supporting occurrence (a node with a greater post-order value) in the domain of its parent/ancestor.n+3) Figure 4. the query specifies a value condition on the node “B”.1) C (n+4. at this step we have all the delta domains empty and the current document postorder value is smaller than the post-order value of the current target for “B”.8 Consider Figure 4.n+4) X (2. Let us analyze the first data tree. independently of its label we can avoid to scan its subtree. then it accesses its first child.n+2) B (n+5. The shown scenario is very similar to the one for the ordered case.3) 129 Data Tree 2 A (1.n+5) B (2. if x post(dk ) < post(t(qli )) then the node dk is useless and we can avoid to insert it into the delta domain associated to qx .n+1) A (n+3. since it is ancestor of the current target for “B”.n+4) C (n+4.11. in order to establish if a skip is safe or not we need to analyze the status of the delta domains. Now let us analyze the second data tree case. Even if we can build a partial solution with nodes found in the subtree of the current element we will never be able to complete this partial solution with the required “B” node (next match for “B” is outside the subtree and since a match for “A” is still missing it is not possible to complete partial solutions with this node).Unordered twig matching Unordered Twig Query A (1.n+1) D (n+3. Now we access the first child of the root. Assume that the current reference tarx get for qx is qli and that qx matches with the current document node dk . like the ordered case.2) X (2. we can insert it into the delta domain associated to “A”. suppose again that for both input documents all “B” nodes satisfy this condition. Example 4. Now we can discuss the skipping policy used for the unordered matching algorithm.n+5) Data Tree 1 D (1.n+2) B (n+5.n+3) C (3. The root of the document matches with “A” and. Like the previous cases the skipping policy can only be based on conditions over the post-order values between nodes and current targets and. the sequential scan first accesses node “D” that does not match with any query node. again the current document post-order value is smaller than the one of the current target for “B” but this .

First if the delta domain associated to qi is empty the skip is considered safe because we are guaranteed that matching elements with qi or with any descendant of qi will never i be part of a solution since there is not a valid match for qlu in the subtree of the current document node. if for at least one child the skip is considered unsafe then the skip is unsafe also for qi . Starting from these examples we can derive the conditions under which a skip is safe for the unordered matching algorithm. . It should be noticed that a smart management of the domains in the sequential scanning does not prevent the adoption of other improvements like filters or the use of indexes. In order to establish if a skip is safe or not it is sufficient to check the condition above for the query root q1 . to efficiently put the theoretical framework in practice. The second part of the condition is less intuitive.2 we have introduced a theoretical framework consisting of a set of pre/post-order conditions for a node to be deleted.e. the skip is unsafe because.130 Query processing for XML databases time we cannot skip the subtree. we know that at least one target will be lost if we perform a skip. If the delta domain associated to qi is not empty we need to verify if the skip is safe for all the children of qi . in the former case. The set of conditions is complete thus ensuring that the domains are maintained as compact as possible from a numbering scheme point of view. or its current reference target is descendant of the current document node. From a qi point of view a skip is considered safe iff: i Li = ∅ ∧ post(dk ) < post(t(qlu )) ∧ (∆Σk = ∅∨ skip is safe ∀qj ∈ i children(qi )) It is obvious that if qi is related to targets only by following-preceding relationships (i. if we introduce the order constraint. The delta domain associated to “A” is not empty so possible matches for “C” (like the one shown in the example) will be useful. 4. it also means to find implementation solutions consuming little time. the same subtree could be safely skipped (since “C” matches are useless unless we found a preceding match for “B”). Li = ∅). At the same time. we have no information about next matches for these kinds of node and. in the latter case.4 An overview of pattern matching algorithms In Section 4. It has to be noted that. The next challenge is to conceive sequential pattern matching algorithms that exploit the theoretical framework to manage the domains efficiently.

Indeed. each data node k in the pertinence domain Dh consists of a pair: (post(k). Otherwise it can be reduced by filters or auxiliary structures as in [25.3. 32. For a full discussion of the complete algorithms and all their different versions see Appendix B. The three kinds of considered pattern matching share a common skeleton shown on top of Figure 4. PRO1 for Condition PRO1).12 shows the algorithm skeletons with place-holders implementing the complete set of deletion conditions.g. for each query node identified by its pre-order value i we assume that the post(i) operator accesses its post-order value. each pointer indicates the pair which is at the top of Dprev(h) . Figure 4. Moreover. Even if the implementation of the complete set of the deletion conditions ensures the compactness of domains. then. respectively). it .4. . the l(i) operator accesses its label.12. We will in this context only show a sketch of the ideas and the basic skeletons of the algorithms. . 72]. A deep analysis of this aspect is provided in Section 4. Line 0 determines the sequence of data nodes for the sequential scanning. Indeed. the sequential scanning can start from first(l(1)). In particular. to which case our theoretical framework can easily be adopted. In the worst case it is the whole data tree. 4. Thus the data nodes from the bottom up to the top of each domain are in pre-order increasing order. . . and we associate a domain Di together with the maximum and the minimum post-order values of the data nodes stored in Di (accessible by means of the minPost and maxPost operators. By recursively following such links from Dprev(h) to k D1 . ∆Σ1 . to put the theoretical framework in practice also means to select the most effective reduction conditions with respect to the pattern and the data tree. Nodes are scanned in sorted order of their pre-order values and insertions in the domains are always performed on the top by means of the push operator. whenever the pre-order value of the first occurrence first(l) of each label l in D and of the last one last(l) are available. we will not analyze the content index optimized versions exploiting the properties described in Section 4. Further.6 (e. In this way.4. we can derive ∆Σk prev(prev(h)) . In the ordered case.4 An overview of pattern matching algorithms 131 In this section we show how the conditions presented so far can be used in pattern matching algorithms to manage the domains associated with the query nodes. pointer to a node in Dprev(h) ) where prev(h) is h − 1 in the ordered case whereas it is the parent of h in Q in the unordered case. the set of data nodes in Dprev(h) from the bottom up to the data node pointed by k implements ∆Σk prev(h) . as suggested by Theorems 4.5 and 4. thanks to Condition PRO2.7. When the data node is inserted into Dh . in some cases the CPU time spent to apply a condition is not repaid by the advantages gained on the domain dimensions and/or the execution time.

132 Query processing for XML databases (0) getRange(start. (1) for each k where k ranges from start to end (2) and matches with the query node h: l(k)=l(h) (3) (4) (5) (6) POP PRO2 Lemma 4. Therefore. domains can be treated as stacks. in the unordered case.16 (b) OTMatch(Q) (c) UTMatch(Q) Figure 4. Instead of checking all nodes as specified in Condition POP. that is deletions are implemented following a LIFO policy by means of the pop operator. we stop looking at the nodes in Di whenever post(top(Di ))>post(k)) (line 2).12: Pattern matching algorithms can end at last(l(n)) due to Lemma 4. the ordered algorithms delete k if it is the last node. then they check whether the intersection between domains at different steps is empty (line 4). end). if the algorithms are performed on compressed structural surrogates of the original data tree. and thus it should be added to Dh (lines 1-2). Indeed. Finally. node deletions are only performed in the POP fragment which corresponds to the following code: (1) for each Di where i ranges from 1 to n (2) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(k)) (3) pop(Di ). then the first and last values can be computed in the surrogate construction phase. finally they work on the current node. As to the PMatch(P) algorithm.15 PRO1 (3) (4) (5) (6) POT2 & POT3 PRU POT1 Lemma 4. assuming that the current data node k matches with the h-th query node. the three algorithms implement the required conditions in the most effective order. Lemma 4. It fully implements Condition POP because if post(top(Di ))>post(k) then . they try to delete nodes by means of the conditions on post-orders (line 3).15 PRO1 (a) PMatch(P) (3) (4) (5) (6) (7) POT2 & POT3 PRO2 POT1 Lemma 4.16 suggests to set the end as the maximum value among last(l(l)) for each leaf l in the query. In this case. First. they delete all data nodes in the domains specified in the corresponding Conditions.15 whereas. Note that. In particular the twig algorithms implement Condition POT1 to check whether k can be added to Dh and all the algorithms verify if solutions can be generated. through the PRO1 code fragment.

si )) (4) pos ← index(Di . the fact that domains are stacks allows us to implement Condition PRO2 (isEmpty(Di ) checks whether Di is empty whereas empty(Di ) empties Di ): (1) for each Di where i ranges from 1 to n (2) if(isEmpty(Di )) (3) for each Di where i ranges from i + 1 to n (4) empty(Di ). (5) delete(Di .1) is a recursive function implementing Theorem 4. More details can be found in Appendix B. Finally the Lemma 4. OTMatch(Q) and UTMatch(Q).pos).1). In these cases the code fragment corresponding to POT2 & POT3 is: (1) for each Di where i ranges from 1 to n (2) for each si in Di in ascending order (3) if(post(si )<post(k) ∧ isCleanable(i.15 code fragment is: (1) if(h = n) (2) showSolutions(h.4 An overview of pattern matching algorithms 133 post(si )>post(k) for each si ∈ Di and thus Condition POP can no longer be applied. (6) if(i = n) (7) updateLists(i.(post(k).pointerToTop(Dh−1 ))). . it can be shown that in order to delete the nodes belonging to a domain Di at step k. Moreover. Notice that when both POP and PRO2 are applied. cannot be stacks because they are not ordered on post-order values thus deletions can be applied at any position of the domains. (5) if(¬isEmpty(Dh−1 )) (6) push(Dh .4 and the PRO1 one is: (1) if(h = n) (2) pop(Dh ). it is first necessary to delete the nodes belonging to Di at a step preceding the k-th one.si ).4. where showSolutions(h. we only check whether Di is empty (line 2). Indeed.si ). where at line 6 the addition of k in Dh took place. Observe that. instead of checking the intersection between the state of the domains at different steps as required by Condition PRO2. the external cycle shared by both the conditions (line 1) are merged. Domains of the other two algorithms.

pointerToTop(Dprev(h) ))). As to PRO2 and PRU.(post(k). Such emptying are performed by the second fragment. its pointer becomes dangling. Such an update is performed in a descending order and stops when a node pointing to a node below si is accessed. Whenever a node si is deleted.16 check if new solutions can be generated and. for instance. which is the application of Conditions PRO2 and PRU to a node k preceding k and thus already belonging to a domain Dh . due to the deletions applied in the POT2 & POT3 code fragment.k)) (2) push(Dh . k is deleted in the updateLists() procedure when. We recall that. where it is sufficient to only check Dprev(h) because. they consists of two code fragments. Lemma 4. In particular. In particular. . The first one is the application of Conditions PRO2 and PRU to the current data node k: (1) if(¬isEmpty(Dprev(h) ) (2) push(Dh . The same can be recursively applied to the other domains k ∆Σk prev(prev(h )) . if a domain Di is empty then all the domains “following” Di are emptied. where the boolean function isNeeded() checks the condition shown in Condition POT1 by using the minPost(D) and maxPost(D) values for each domain D instead of comparing k with each data node in D. . Obviously. the two conditions of Lines 1 are put together: ¬isEmpty(Dprev(h) ) ∧ isNeeded(h.134 Query processing for XML databases where the boolean function isCleanable() checks whether si can be deleted. the updateLists() function updates the pointer of all the nodes pointing to si in order to make it point to the node below si . . As to POT1. Thus.pointerToTop(Dprev(h) ))).15 and Lemma 4.6. By anology to the path matching algorithm. In this case. call recursive functions implementing Theorem 4. links connecting each domain Di with the domains of the descendants of i in the twig Q and the minPost operator are exploited to speed up the process. In this case.5 and 4. ∆Σk prev(h ) is implemented by that portion of Dprev(h ) between the bottom and the data node pointed by k and ∆Σk prev(h ) is the current state of Dprev(h ) . in this case. if the pointer of k is k dangling it means that ∆Σk prev(h ) ∩ ∆Σprev(h ) = ∅ as required by Conditions PRO2 and PRU. respectively. ∆Σ1 . whenever Conditions PRO2 and PRU are applied. the code fragment is: (1) if (isNeeded(h. whenever both POT1 and PRO2 or PRU are applied. instead of checking the post-order value of each data node in the domains.k). . if i is the root.(post(k). . intuitively. it simply returns true (Condition POT3). otherwise it checks the conditions expressed in Condition POT2. we check if minPost(D)>post(k) for each domain D.

The outcome of this phase is an ordered set rew(Q) of the subsignatures sub sigPj (Q) defined by the index sets Pj . for each leaf j. Any path Pi represents all (and only) the ancestor-descendant relationships between the involved nodes. identification of the set of answers to the unordered inclusion of Q in D. In principle. therefore we will now focus on the last point. suppose the data tree D specified by signature sig(D) and the query tree Q specified by signature sig(Q). 3. which in certain querying scenarios can provide an equivalently high (or even better) efficiency than that of the “standard” algorithms presented in the previous section. 2. Then the structurally consistent path qualifications are joined to find unordered query tree inclusions in the data. The query decomposition process transforms a query twig into a set of root-to-leaf paths so that the ordered tree inclusion can be safely applied. . decomposition of the query Q into a set of paths Pi . i. Evaluating the path inclusions is a quite straightforward operation (see previous sections for the required properties). we sort the paths on the basis of their selectivity. identifying the set of answers to the unordered inclusion of Q in D. The query tree Q is decomposed into a set of root-to-leaf paths Pi and the inclusion of the corresponding signatures sig(Pi ) in the data signature sig(D) is evaluated. Thus. See [138] for more details. an ordered inclusion of sig(Pi ) in sig(D) states that a mapping. the unordered tree inclusion of Q in D is found.4.5 Unordered decomposition approach In this section we propose an alternative approach specific for unordered tree matching.e. evaluation of the inclusion of the corresponding signatures sig(Pi ) in the data signature sig(D). the decomposition approach consists of the following three steps: 1. The idea is not to consider the twig as a whole but to decompose it into a collection of root to leaf paths and search for their embedding in the data trees. keeping the ancestor-descendant relationships. so that in the next phase. More formally. If there are structurally consistent answers to the ordered inclusion of all the paths Pi in D. For efficiency reasons. the more selective paths are evaluated before the less selective ones. exists from the nodes in Pi to some nodes in D.5 Unordered decomposition approach 135 4.

f. 3} only matches at the level of node names but it is not a qualifying one. 1. 3}}.3.2) Figure 4. in particular the first unordered query shown. respectively. in order that they are joinable. and the data tree. 5} ∈ ansP2 (D) and {1. The rewriting of Q gives rise to the following paths rew(Q) = {P2 .5. 4 The only sub-signature qualifying the unordered tree inclusion of Q in D is defined by the index set {1.3 whereas the index set {2. In this case.4) B (3. 2. 1. 5. From the cartesian product of ansP2 (D) and ansP3 (D) it follows that the index sets {1. which we will call Q. The index 1 occurs in the first position both in P2 and P3 . 5. Example 4. The common sub-path between P2 and P3 is P2 ∩ P3 = {1}. 5} ∈ ansP2 (D) and {2.9 Consider Figure 4. 3. P3 }. {2. 3. c.3) A (2. 2} and P3 = {1. f. 3}.1) C (4. f. whereas {1. 2. 4. 5}} and ansP3 = {{1.2) F (4. . 5. 1. 3} and the corresponding sub-signature is sub sig{1. a.5) F (2.3) Query processing for XML databases Unordered Twig Query 2 A (1. The main problem is to establish how to join the answers for the paths in rew(Q) to get the answers of the unordered inclusion of Q in D. 0. Notice that the index set {1. and the outcome of their evaluation is ansP2 = {{1.1) B (3.1 Identification of the answer set The answer set ansQ (D) of the unordered inclusion of Q in D can be determined by joining compatible answer sets ansP (D). Such commonalities and differences must meet a correspondence in any pair of index sets Si ∈ ansPi (D) and Sj ∈ ansPj (D). 4. we state that Si ∈ ansPi (D) and Sj ∈ ansPj (D) are structurally consistent. 1. b. 3} ∈ ansP3 (D) are not structurally compatible and thus not joinable.5} (D) = a. 3}. 3} satisfies both conditions of Lemma 4.136 Unordered Twig Query 1 A (1. 1 and sig(D) = a. The condition is that any pair of paths Pi and Pj share a common sub-path (at least the root) and differ in the other nodes (at least the leaves). 3. 5.2) F (2.13: Examples for decomposition approach 4. 4 . 3} ∈ ansP3 (D) are structurally consistent as they share the same value in the first position and have different values in the second position.4) Data Tree A (1.13. D. b. We have sig(Q) = a. b. for all P ∈ rew(Q).1) B (3. 5. Not all pairs of answers of two distinct sets are necessarily “joinable”.3) F (5. where P2 = {1.

. Sj ∈ ansTj (D)). . . . . sh = sk . . . tk } as the index set {s1 . tm }. one for each P ∈ rew(Q). ansTj (D)) between the two sets ansTi (D) and ansTj (D) is the set ansT (D) where: ˆ T = {t1 . the join of Si and Sj . where Ti = {t1 . . . . tk } is the ordered set obtained by the union Ti ∪ Tj of the ordered sets Ti and Tj . . ansTi (D) and ansTj (D) the answers of the unordered inclusion of Ti and Tj in D. Ti = {t1 . D a data tree. . . . . Tj = i i {t1 .4 (Join of answers) Given two structurally consistent answers Si ∈ ansTi (D) and Sj ∈ ansTj (D). . . . Ti and Tj two ordered sets of indexes determining sub sigTi (Q) and sub sigTj (Q). . . . . respectively. . We denote such operation as the structural join. . sn } ∈ ansTi (D) and Sj = {s1 . respectively. . j i j i Definition 4. respectively. . . . . . The answer set ansQ (D) can thus be computed by sequentially joining the sets of answers of the evaluation of the path queries. i i ˆ for each h = 1. . sh = sk . tm } two ordered sets of indexes deterj j i i mining sub sigTi (Q) and sub sigTj (Q). is defined on the ordered set Ti ∪ Tj = {t1 . . l ∈ {1. i j i j ˆ for each pair of different indexes th = tk . Si = {s1 . . k} exists such that th = tl and j sh = sl .Identification of the answer set 137 The following definition states the meaning of structural consistency for two generic subtrees Ti and Tj of Q – paths Pi and Pj are particular instances of Ti and Tj . . The structural join sj(ansTi (D). . respectively. D a data tree.5 (Structural join) Let Q be a query twig. Definition 4. . . . . .3 (Structural consistency) Let Q be a query twig. tn }. . tn } and Tj = {t1 . . sn } and Sj = {s1 . . . . k} exists such that th = tl and sh = sl . sm }. . m. . . . sk } where: ˆ for each h = 1. sm } ∈ ansTj (D) are i i j j structurally consistent if: ˆ for each pair of common indexes th = tk . . . . . . l ∈ {1. . ansTi (D) and ansTj (D) the answers of the unordered inclusions of Ti and Tj in D. . . . identifying distinct paths in sig(D). ˆ ansT (D) contains the join Si Sj of each pair of structurally consistent answers (Si ∈ ansTi (D). . j j i i j j Si Sj . j Any answer to the unordered inclusion of Q in D is the result of a sequence of joins of structurally consistent answers. Definition 4. . . n. Si = {s1 .

ansP3 (D)). 3} ∈ sj(ansP2 (D). Example 4. 2. Since the structural join operator is associative and symmetric. P4 } where P2 = {1. . . we get the answer set ansQ (D) identifying the unordered inclusion of Q in D by incrementally merging the answer sets by means of the structural join. . ansP4 (D)) = ∅ The answer sets of the separate paths and of sj(ansP2 (D).13 again. 4}. which we will call Q. . 2}.11 Consider Figure 4. 5} ∈ ansP4 (D) is not structurally consistent: the two different query nodes 2 ∈ P2 ∪ P3 and 4 ∈ P4 (4. 3} is the ordered set {1.14: Structural join of Example 4. ansP3 (D). ansPxk (D)) where rew(Q) = {Px1 .15. . ansTj (D)) thus returns an answer set defined on the union of two sub-queries Ti and Tj as the join of the structurally consistent answers of ansTi (D) and ansTj (D). ansP3 ): Figure 4. 3} ∈ ansP3 (D). . It joins the only pair of structurally consistent answers: {1. ansP3 (D)) = ansP2 ∪P3 (D) where P2 ∪ P3 = {1. ansPxk (D)} for paths in rew(Q).10 The answer set ansQ (D) of Example 4. The answers to the individual paths and the final answers are shown in Figure 4. and P4 = {1. The final result ansQ (D) is the outcome of the structural join: sj(ansP2 (D). ansP3 (D)) are shown in Figure 4. . The rewriting phase produces the set rew(Q) = {P2 .9 The structural join sj(ansTi (D). P3 = {1. 3}. It can be easily verified that there is no qualifying sub-signature since at most two of the three paths find a correspondence in the data tree.9 is the outcome of the structural join sj(ansP2 (D). Pxk }. The final result is empty since the only pair of joinable answers {1.14 (the first line of each table represents the query). ansP4 (D)) = = sj(sj(ansP2 (D). 3}. . 5} ∈ ansP2 (D) and {1. in the data tree D. .1) . In this example we show the evaluation of the unordered tree inclusion of the second twig query depicted. ansP3 (D)) and {1. Example 4. Starting from the set of answers {ansPx1 (D). . . P3 . 2} ∪ {1. we can compute ansQ (D) as: ansQ (D) = sj(ansPx1 (D). .138 P2 1 2 1 5 Query processing for XML databases P3 ansP3 (D): P2 ∪ P3 1 2 1 5 3 3 1 3 1 3 2 3 ansP2 (D): sj(ansP2 . 5.

As paths are sorted by their selectivity. we have specified two distinct phases for the decomposition approach for unordered tree pattern matching: the computation of the answer set for each root-to-leaf path of the query and the structural join of such sets. Theorem 4. it iterates the process by joining the partial answer set anspQ (D) with the answer set ansP (D) of the next path P of rew(Q). from step 1 to step 3. In the following. the answer set ansQ (D) as defined by Eq.3. The full algorithm is depicted in Figure 4.16. From step 4 to step 12. 4. 4.1 contains all and only the index sets S qualifying the unordered inclusion of Q in D according to Lemma 4. The basic idea is to evaluate at each step the most selective path among the available ones and to directly combine the partial results computed with structurally consistent answers of the paths. In particular. The main drawback of this approach is that many intermediate results may not be part of any final answer. at each step. ansP3 (D)): 1 1 2 3 5 3 Figure 4.5. The algorithm essentially computes the answer set by incrementally joining the partial answers collected up to that moment with the answer set of the next path P in rew(Q). it does not properly compute first the answer set ansP (D) and . It makes use of the pop operation which extracts the next element from the ordered set of paths rew(Q).Efficient computation of the answer set P2 1 2 1 5 P3 ansP3 (D): P4 1 3 1 3 2 3 139 ansP2 (D): ansP4 (D): 1 4 1 5 P2 ∪ P3 sj(ansP2 (D).15: Structural join of Example 4.7 Given a query twig Q and a data tree D.11 correspond to the same data node 5.2 Efficient computation of the answer set In the previous section. It means that there are not as many data tree paths as query tree paths. we show how these two phases can be merged into one to avoid unnecessary computations. Notice that. the algorithm initializes the partial query pQ evaluated up to moment to the most selective path P and stores in the partial answer set anspQ (D) the evaluation of the inclusion of pQ in D. P is the most selective path among those which have not been evaluated yet.

4. In order to do it. as the part of the path P3 corresponding to the query node a has already been evaluated while evaluating P2 .. we compute only such answers in ansP (Q).f fsk −1} (D))) to P Ans. by considering Example 4. the algorithm tries to extend each answer in anspQ (D) to the answers to pQ ∪ P by only evaluating such sub-path of P which has not been evaluated in pQ. As each pair of index sets must be structurally consistent in order to be joinable. then at step 5 pQ = P2 and. (9) for each answer S in anspQ (D) (10) evaluate anspP (sub sig{sk +1.. (11) if(anspP (sub sig{sk +1. For each index set S ∈ anspQ (D).. anspP (sub sig{sk +1. (2) pQ = P .. Step 7 identifies tk as the parent of the partial path pP where k is its position in pQ.17-b... the partial path pP to be evaluated and the parent tk of pP are depicted in Figure 4. which is structurally consistent with S. If rew(Q) = {P2 . only such answers may be part of the answers to Q. ansP (D)) as shown in Eq. must share the same values in the positions corresponding to the common sub-path P ∩ pQ. In particular. (3) evaluate anspQ (D).. we assume . the two paths P2 and P3 of the query Q are depicted in Figure 4. (7) tk is the parent of pP . P3 }. (8) P Ans = ∅. which are structurally consistent with some of the answers in anspQ (D). (13) pQ = pQ ∪ P .. As a matter of fact. each index set in ansP (Q). (6) pP = P \ (P ∩ pQ). k is the position in pQ..17-a..16: The unordered tree pattern evaluation algorithm the structural join sj(anspQ (D).140 Query processing for XML databases Input: the paths of the rewriting phase rew(Q) Output: ansQ (D) Algorithm: (1) P = pop(rew(Q)).} Figure 4.f fsk −1} (D)). In other words. (4) while((rew(Q) not empty) AND (anspQ (D) not empty)) (5) P = pop(rew(Q))..9.. step 6 stores in the sub-path pP such part of the path P to be evaluated which is not in common with the query pQ evaluated up to that moment: P \ (P ∩ pQ).1.f fsk −1} (D)) not empty) (12) add sj({S}. but it rather applies a sort of nested loop algorithm in order to perform the two phases in one shot. (14) anspQ (D)=P Ans. For instance.

In particular.17: Evaluation of paths in Algorithm of Figure 4.16: an example that the part of the path P which is common to pQ has already been evaluated and that the indexes of the data nodes matching P ∩ pQ are contained in S. Either the evaluation of the partial path pP fails (line 11). ansP (D)) = ∅. Thus. by shrinking the index interval to a limited portion of the data signature. As the path P has been split into two branches P ∩ pQ and pP . Then it joins S with such answer set by only checking that different query entries correspond to different data entries (step 12). which means that none of the answers in ansP (D) share the same values of S in the positions corresponding to the common sub-path P ∩pQ. join with S. the algorithm extends S to the answers to P ∪ pQ by only evaluating in the “right” sub-tree of the data tree the inclusion of the part pP of the path P which has not been evaluated yet (step 10). for each answer S in anspQ (D) two alternatives exists. where tk is the parent of pP and S contains a set of indexes matching P ∩ pQ. Notice that. or the structural join between S and the answers to pP fails (line 12). which means that some of the answers in ansP (D) share the same values of S in positions corresponding to different indexes in P and pQ. The latter case occurs when we evaluate a path P having no answer which is structurally consistent with those in anspQ (D): sj(anspQ (D).Efficient computation of the answer set tk 141 A A F B F B P2 (a) P3 pQ (b) pP Figure 4. the index sk in S actually represents the entry of the data node matching the query node corresponding to tk . we are able to reduce the computing time for the sequence inclusion evaluation. then. In this case. in step 10. in order to compute the answers in ansP (D) that are structurally consistent with S and. the evaluation of pP must be limited to the descendants of the data node dsk which in the tree signature corresponds to the sequence of nodes having pre-order values from sk + 1 up to f fsk − 1. Example 4.16 to Ex- . The algorithm ends when we have evaluated all the paths in rew(Q) or when the partial answer set collected up to that moment anspQ (D) becomes empty.12 Let us apply the algorithm described in Figure 4.

1. c. the import process of documents and queries and the query processing (Core System) and finally the persistence of managed documents (Store System).6 The XML query processor architecture All the twig matching algorithms we described in the previous sections have been implemented in the XSiter (XML SIgnaTure Enhanced Retrieval) system. 4.4} (D) to the inclusion of sub sig{pP } (Q) = f. Then. the proposed solution performs a small number of additional operations on the paths of the query twig Q. The system is essentially composed by three subsystems. 5}}. 3}}. {2. a. from the top to the bottom. 3} and evaluate anspP (D) on the descendants of the data node having index s1 = 1. 3}. 4. 1. 1. For the next index set {2. 3. 3.3. 3}}. Essentially. 5.18 we can see the abstract architecture of XSiter. it is required to evaluate sub sig{pP } (Q) on the sub-tree rooted by s1 = 2 that is sub sig{3. 1. In Figure 4. 4. 4. Thus ansQ (D) = {{1. 3}}. In the Store System. 2} whose answer set is ansP2 (D) = {{1. f. P3 = {1. 5} in ansP2 (D) and stores in anspP (D) the index sets qualifying the inclusion of the query sub sig{3} (Q) = b.9 where the signatures involved are sig(Q) = a.19) are managed. tk = 1 where k = 1. The outcome is thus anspP (D) = {{3}} and ansQ (D) = {{1. 1.17. It computes P2 ∩ P3 = {1}. We then consider the first index set {1. .3. 1. the interaction with the user (GUI). b. 2. we remarkably reduce computing efforts. c. 1 on the sub-tree rooted by the data node labelled with a and having index s1 = 1 that is in the signature sub sig{2. 4 . b. that respectively manage. 3. 0. Since the two paths are of the same length. 1. 5. 3}. 2. 2.4. a datastore is a collection used for aggregating conceptually related documents and for keeping their internal representations persistent among different query sessions. f. 3}. 4 . in the way shown in Figure 4. Being P2 and P3 of the same length. f. c. 1 and sig(D) = a. 2. a native and extensible XML query processor providing very high querying performance in general XML querying settings. 2.5} (D) = a.4} (D) = b. we can also start from ansP3 (D) = {{1. 4 is {{5}}.5} (D) = a. e.4. but dramatically reduces the number of operations on the data trees by avoiding the computation of useless path answers. In this way. In this section we briefly describe the main system architecture and features.142 Query processing for XML databases ample 4. 3. and pP = {3}. In this case pP = {2} while. tk = 1 where k = 1.3. in sub sig{2. as in the previous case. the algorithm essentially deals with the next path. datastore structures (Figure 4. we start from P2 = {1. In summary. It then considers the only index set S = {1. 3. 5. 2 and the answer set anspP (D) is empty. b. b.g. 3. The answer anspP (sub sig{2. c. 1. f.

that are stored separately and are evaluated only by need (value constraints).xml Query Engine Doc Importer Internal Doc Representation Core System Datastore Datastore Offline Process Datastore Collection Store System Figure 4. ˆ Values.4. as we have seen. In particular. can be indexed (Content Based Indexes) in order to speed up the search process. is used for solving tree pattern matching (structural constraints). Such part is optional and is generated according to user needs. named . in a datastore two shared global structures are also kept (see left part of Figure 4. in the following we will briefly describe only the document one. for each tag that is present in a document. elements contents or attribute values.19): ˆ We have the (Tree Signature). being it the most complex. along with the document internal representations. ˆ The signature does not include values. ˆ A simple document summary (Local Tag Index ) is used as a filter for limiting the search space.6 The XML query processor architecture 143 Query Language Query Specifier Result Visualizer GUI Internal Query Representation Query Importer Doc.18: Abstract Architecture of XSiter Queries and documents are transformed in an almost homogenous representation. The chosen internal representation addresses the main issues needed for querying XML documents and consists of four main parts (see right part of Figure 4.19). Finally. that. the first and the last document positions are kept.

” Local TagIndex Content Based Indexes Values </> # Stored Documents (Internal Representation ) Datastore Figure 4. Further.19: Structure and content of a Datastore Global TagIndex and TagMapping. These features. TagMapping provides mappings between textual tags and correspondent unique numbers (ids). as we deeply discussed in the previous sections. The Global TagIndex keeps track of which tag is present in each managed document and is used to quickly filter out documents that can not contain matches for a particular query.144 Query processing for XML databases </> Global TagIndex TagMapping Signature </> “. which are used to store tags in the most compact possible way. The key to their efficiency is to skip as much of the underlying data as possible. specialized index structures can be easily integrated and exploited in our system in order to better match different application needs. In general. all XSiter algorithms efficiently process the supporting data structures in a sequential way. In particular.. also thanks to the document filters exploiting Global TagIndex Data. make the system suitable for querying and managing very large documents. a filter based on Local TagIndexes is also available. which is used to limit the parts of a document that have to be scanned in order to solve a particular query. XSiter is currently implemented as a general purpose system. Further. skipping areas of obviously no query answers. . meaning that little special domain optimizations have been currently applied but the architecture was developed to be simply extensible. Instead. Repositories involving a large number of documents are efficiently managed and queried. whenever possible. and at the same time never return back in the processed sequence. together with the minimal memory usage of the algorithms. current algorithms use a very high abstraction level of content index access that enables us to substantially use different kinds of indexes without changing the search algorithms..

It is a very “flat” (3 levels) and wide (very high root fan out) data tree. we chose the complete DBLP Computer Science Bibliography archive. Table 4.2 shows more details about this XML archive.7.8 Millions of elements. respectively).7 Experimental evaluation 145 4.7 Experimental evaluation In this section we present the results of the experimental evaluation of the XSiter query processor matching algorithms.3) we show the performance of each of the algorithms of Section 4. fan-out (3) and root fan-out (50000 and 30. In particular. Further. However. as described in Section 4.7. as random trees.2 and 4.t. In fact. while Gen1 . The file consists of over 3. respectively). In order to show the performance of the matching algorithms on real-world “data-centric” XML scenarios.1). Gen1 and Gen2. in the main part of the tests (Subsections 4. We present the results we obtained on one real and two synthetic collections (see Table 4. the whole set presents repetitions of typical patterns (for instance. such as POT2. we specifically evaluate the performance of our decomposition technique for unordered tree inclusion. 4.1: The XML data collections used for experimental evaluation In our tests.4. we also generated two synthetic data sets.1 Experimental setting The data sets Type Real Synth DBLP Gen1 Gen2 Dimensions Depth F/O F/O (root) 3 2-12 376698 5 3 50000 8 3 30 Labels # Equi 20 7 7 Size 3814975 2000000 32000 Table 4. Both synthetic collections differ from the DBLP set in their labels distribution as they are uniformly distributed. in Subsection 4. For this reason.r. as described in the previous sections. this would not allow us to test some of the most complex conditions.4 and we evaluate the benefits offered by each of the conditions discussed. both in terms of the reduced size of the domains and in the amount of saved time w. as in typical “datacentric” XML documents. Since typical real data sets are very flat. “article-author”).5.4.7. their standard execution. we used both real and synthetic data sets. As in most real data sets.7. using the following parameters: depth (5 and 8. the distribution of the node labels is non equiprobable.

the query depth is limited to 2. This is typically clear after a significant portion of the data is scanned. . therefore we tried to differentiate the queries by means of an increasing fanout. Our aim is mainly to analyze the behavior of the algorithms and the trends of the sizes of domains. As to Gen1 and Gen2. The testing queries Figure 4.2: DBLP Test-Collection Statistics proposes a similarly wide and slightly deeper tree. while the lower one defines queries for the synthetically generated collections (denoted with Gn).146 Query processing for XML databases Middle-level Element name Occs inproceedings 241244 article 129468 proceedings 3820 incollection 1079 book 1010 phdthesis 72 mastersthesis 5 Leaf-level Element name Occs author 823369 title 376737 year 376709 url 375192 pages 361989 booktitle 245795 ee 143068 crossref 141681 editor 8032 publisher 5093 isbn 4450 school 77 Summary Total number of elements 3814975 Total number of values 3438237 Maximum tree height 3 Table 4. we created queries G2 and G3 deeper than the DBLP ones. The labels are specifically chosen amongst the less selective. we provide a path and two twigs. Note that the size of collections (in particular the root fan-out) is not very important. in order to test our algorithms in the most demanding settings.4. Gen2 is very deep and has smaller root fan-out. For DBLP. They are specifically conceived to test all the conditions which would not be activated in the shallow DBLP setting.20 shows the testing queries we used to perform the main tests on the algorithms discussed in Section 4. thus simulating more “text centric” trees. The upper row shows the queries on the DBLP collection (denoted with Dn). In both cases.

b is the branching factor. and h the tree height.e.21. We refer to the templates as “xSb-h’. where S stands for element name selectivity and can be H(igh) or L(ow).21: The query templates used in the decomposition approach tests Further. and different tree heights.2. i. the number of elements having a given element name. derived from the six query twig templates of Figure 4. we used inproceedings for the low selectivity and book and phdthesis for the high selectivity. we performed specific series of tests on the decomposition approach (Section 4.20: The queries used in the main tests Template xL 2-2 inproceedings Template xL 3-2 inproceedings Template xL 8-2 inproceedings author title author title year author title pages ee url year crossref booktitle Template xH 2-2 book Template xH 3-2 book Template xH 7-3 dblp phdthesis author title author title year author school author book title year publisher isbn Figure 4. We have con- . showing the number of occurrences of each name in the DBLP data set.e.Experimental setting Path D1 article 147 Twig D2 inproceedings Twig D3 inproceedings author author author title title pages year crossref ee url booktitle Path G1 A B C E Twig G2 A Twig G3 A B D G B E C F D G Figure 4. refer to Table 4. the maximum number of sibling elements.5) by means of another set of queries. different branching factors. Such templates present different element name selectivity. In particular. i. To understand the element name selectivity.

09 1. In this way.3: Pattern matching results for the different queries and collections ducted experiments by using not only queries defined by the plain templates (designated as “NSb-h”) which only contain tree structural relationships.3 summarizes the results obtained by applying our pattern matching algorithms to solve the proposed queries.35 1.4 1. such as the author name. because we believe that this kind of queries is especially significant for highly selective fields.43 149700 2.3 1.17 Gen1 collection 13209 1.53 1. for each of the three collections. equipped with 512MB RAM and a RAID0 cluster of 2 80GB EIDE disks with NT file system (NTFS). On the other hand.2 General performance evaluation Table 4.32 1.82 2.5Ghz Windows XP Professional workstation. 390008 1038632 1843762 1041697 2245014 390036 392751 393119 487100 587464 8425 8511 8709 11061 14039 144% 39% 51% 38% 24% 120% 264% 410% 193% 240% 69% 178% 276% 114% 133% P-D1 O-D2 O-D3 U-D2 U-D3 P-G1 O-G2 O-G3 U-G2 U-G3 P-G1 O-G2 O-G3 U-G2 U-G3 Table 4.47 5.99 2788 10.47 920 2.4 26854 132.148 Query Time (msec) 1109 2890 5984 3750 7266 941 1392 2394 2238 2854 90 113 190 180 961 # sols Query processing for XML databases sols MDS /constr DBLP collection 260540 1 1 553997 2. All experiments are executed on a Pentium 4 2.75 Gen2 collection 1320 1. we measure the response time of twelve queries.7. half of which contain predicates.59 1.92 Insertions # % avoid.05 848 1.59 1412 1.93 3173 6. where the templates are extended by predicates on the author name.17 559209 2.64 5. the performance of queries with low selectivity fields should be very close to the corresponding templates.44 98934 2. We have chosen the highly content-selective predicates.59 1.56 1. but also queries (designated as “VSb-h”). 4.2 884 7.9 7. Value accesses are supported by a content index.32 94 2. .58 1.

identifying the most significant cases for each of them. We will present all the specific results and graphs only for . Keeping the MDS low is essential for efficiency reasons. In order to simplify the analysis we discuss the path and twig matching separately.for ordered and U. the mean number of solutions constructed each time a solution construction is started.e. since the time spent in constructing the solutions is roughly proportional to the Cartesian product of the domains size. i. since an overflow of the domains would mean a total failure. the domains behavior and execution time. the total number of node insertions.Evaluating the impact of each condition 149 Queries are denoted with a prefix signifying the applied matching algorithm (P. 4. even though each of the settings presents non-trivial query execution challenges: a very wide data tree for both DBLP and Gen1.8 nodes for DBLP and Gen1. the mean domain size (denoted with MDS).for unordered twig matching). in many cases. and the percentage of avoided insertions with respect to their total number. nearly one or even two orders of magnitude larger than for the other two collections). which is very significant for all collections (e. Finally. the total execution time (in milliseconds). the total number of solutions retrieved. For each query setting. we present the fundamental details of the algorithms execution. measuring how much each of the conditions applied in the algorithms influence their trend. nearly a million avoided insertions for DBLP O-D3 query.e.7. Observe that in all the cases the time is in the order of a few seconds (7 seconds at most for query U-D3). In the following. Its low values in each of the settings (less than 1. the MDS parameter is particularly significant for all the queries. and especially the high percentage of avoided insertions. over half a million for queries D2) and a very deep and involved tree for Gen2 (notice the high number of solutions for each solution construction.g. but. reaching 7.3 Evaluating the impact of each condition Now that the effectiveness of the algorithms and of the conditions is clear. In particular. we deeply analyze the MDS. while in Gen1 and Gen2 twigs the number of non-inserted nodes is much higher than the ones inserted). O. Also observe the large number of node insertions. it is also essential for the good outcome of the matching. this means that the number of deletions is very near to the number of insertions. a considerable repetitiveness for DBLP labels and patterns (notice the very high number of solutions.92 for the most complex query in the deep Gen2) testimony the good efficacy of our reduction conditions. since the mean domains size is low.for path matching. It represents the mean size of the domains measured each time a solution construction is called for the whole size of the collection. i. we still need information about the benefits offered by enabling each of them.

4 presents a summary of the cases we will discuss. Case P-B produces larger but still controlled domain sizes (20%-30% higher MDS than the “standard” cases). The deletion of the nodes in the last stacks. since. shortly discussing the others in words. Path matching Let us start by considering the simpler path matching scenario.4: Summary of the discussed cases (disabled conditions are marked with a ‘x’) some of the most complex and interesting cases. Then are the cases involving deactivation of pre-order conditions (P-B. which . In the following. Notice that the modified path algorithm for Case P-B is equivalent to the one proposed in [25]. P-B and P-C. In fact. where we distinguish between cases P-A.150 Case P-A P-B / T-E P-C / T-F T-A T-B T-C T-D POP x Query processing for XML databases Post-order POT1 POT2 Pre-order PRO1 PRO2(PRU) x x x x x x x POT3 Table 4. together with the associated disabled conditions. which again influence deletions but in a lesser manner. preventing the conclusion of the matching for all the query settings. at least for short paths. This is expected. P-C). Case P-B means we do not empty the domains on the “right” when a given domain becomes empty (thus there will be “dangling” pointers). First is the case P-A where we turn off the post-order condition and which is expected to significantly degrade the matching performance since such a condition is clearly the key to the high number of node deletions. we refer to the case where all the conditions are turned on as the “standard” (std) case. case P-A produces an uncontrolled growth in the domains’ size. the time required to apply the conditions compensates the shorter solution generation time. while in P-C the nodes in the last domain are no longer deleted after solution construction. both for domain size overflow and for the consequently exploded execution time. with case P-C we obtained results which were almost identical to the standard case. Finally. Table 4. while the execution time is nearly unchanged (only 2% less).

on mean. This results in nearly identical execution time and mean domain size (nearly 6% larger).74% 1429090 762319 37.07% 2780212 Gen1 collection 417154 68.54% 6% 23679 109385 4.Evaluating the impact of each condition Query isCleanable() Calls % true % true Calls (POT3) (POT2) DBLP collection 1389881 17. which are the main source of avoided insertions and deletions.14% 0.79% 1999979 Gen2 collection 27645 17. we will first inspect the pre-order conditions (i.68% 1999979 542592 52. while for isNeeded() means preventing an useless insertion (returning false for POT1). is equally carried out by the other conditions (i.86% n/a n/a 9.25% 27.02% 8.43% n/a n/a 6.15% n/a n/a O-D2 O-D3 U-D2 U-D3 O-G2 O-G3 U-G2 U-G3 O-G2 O-G3 U-G2 U-G3 Table 4.48% 15.4).17% 3.54% 14.47% 24. For isCleanable() “success” means allowing a node deletion (returning true. We started by analyzing the “percentages of success” of such functions for each call in each of the queries – Table 4.5 provides such statistics.25% 53.e.92% 70.71% 59.28% 57.5% 71.63% 54. the following are key functions which activate POT conditions: isCleanable() for deletions (exploiting POT2 and POT3) and isNeeded() (which will be isNeededOrd() for the ordered and isNeededUnord() for the unordered case) for insertions (exploiting POT1).25% 66. both for POT3 or POT2).3% 4. Notice that POT1 can be satisfied by examining nodes in the parent domain (denoted in table with POT1p) or.14% 1429087 456767 62.67% 32758 151 isNeeded() % false % false (POT1p) (POT1s) 27.6% 32758 55344 8. As for paths.29% 1439703 7866217 3.48% 65. . Twig matching As to twig matching.34% 14.5: Behavior of the isCleanable() and isNeeded() functions would be immediately provided by condition PRO1.21% 14.48% 23679 57733 8.e.1% 7.36% 1439703 7477979 3.65% 19. the number of available conditions requires a deeper analysis. POP. As shown in the algorithms (Section 4. POT).23% 2780210 1395093 17.65% 21.65% 19. PRO2) just a few steps later.

T-B and T-D Figure 4. we distinguish cases T-A to T-D (see Table 4.00 4. while case T-D will allow useless insertions.99 7.92 8. the main contribution is given in situation POT1p. the percentage of deletion success is lower than in the other collections. vs.22-a shows a plot comparing.70 5. the mean stack size of case T-A to the standard one for O-G3 query.44 1.32 1.88 1. Case T-A is conceptually equivalent to the case P-A. we also performed some CPU utilization tests and found out that their contribution is generally significant also from an execution time point of view. This is due both to the more repetitive and simple structure of its data tree and to the inapplicability of condition POT2 (DBLP queries have only two levels).4). T-A (O-G3.00 8. preventing the termination of the matching in acceptable time.00 O-G2 O-G3 U-G2 U-G3 O-G2 O-G3 U-G2 U-G3 (Gen1) (Gen1) (Gen1) (Gen1) (Gen2) (Gen2) (Gen2) (Gen2) 1.00 Mean domains size 10. 14. The first three cases will clearly produce less node deletions.59 1. std. Such condition proves instead quite useful in the other collections. deletions are almost totally prevented. Gen1) (b) MDS.00 2.74 12. The percentage of success for both functions is considerable in all cases. Case T-B Case T-D Document pre-order (a) MDS.11 Case T-B Case T-D 500 400 300 200 100 0 60 0 12 00 18 00 24 00 30 00 36 00 42 00 48 00 54 00 60 00 66 00 72 00 78 00 84 00 90 00 96 00 0 Case std. in the siblings ones (POT1s in table). In DBLP. Even if the difference in size may not seem particularly significant. Like in P-A.22-b).94 7. As to cases T-B and TD.96 1.00 0.85 1. To quantify the specific effects of the conditions on the domain size and time. after each data node.50 7. 600 Mean domains size Case T-A Query processing for XML databases Case std.00 6. we found out that the domain sizes grow uncontrollably. the graph in Figure 4. About the two functions we are discussing.75 2.16 2.65 1.29 2.42 8.93 4. i. As an example. the algorithms generally produced domain sizes which were larger than the standard case (see Figure 4. collection Gen1.22: Comparison for mean domains size in different settings only for unordered. where its application comes often near the one of POT3. std. thus the differences in execution time may become . particularly for unordered matching. we have to consider that the time spent in constructing the solutions is roughly proportional to the cartesian product of the size of each domains.152 Case std.81 5.00 12.40 3.40 6. since their percentage of CPU utilization is typically less than 4% of the total CPU times. while also POT1s can give a good contribution in the ordered matching. vs.47 1.e. as expected. As to POT1.83 1.

we can also briefly analyze the cases involving the activation of the pre-order conditions (PRO. For instance. in order to verify execution time savings in more complex situations. Notice that. as for the paths. named U-G2b. since the time required to check the conditions compensates the shorter solution generation time. we also employed new queries specifically for these tests. denoted as T-E and T-F in Table 4. As an example.23 shows.4. such case produces uncontrolled growth in domains size and in time. the comparison between the standard case time and the one of the T-D case. proving that such conditions are essential for more complex queries. there will be dangling pointers. while it proves essential for unordered matching (time and domain size grow uncontrollably). As seen in the graph. each solution construction run becomes nearly 20 times longer. if the domains are on mean one and a half larger. i. Differently from P-B.23: Comparison between time in different settings more evident. case T-D). Disabling condition POT3 produces almost no variations in the ordered matching (condition POT2 produces the same deletions at the cost of a little more time spent in checking the hypotheses). Note that if we disabled PRO1 together with POT3.g.e. for the most complex queries and for collection Gen2. a modified version of query U-G2. 3000 2500 Time (msec) 2000 1500 1000 500 0 O-G2 (Gen2) 113 152 O-G2b (Gen2) 188 227 O-G3 (Gen2) 190 266 U-G2 (Gen2) 180 319 U-G2b (Gen2) 592 2516 U-G3 (Gen2) 961 2552 Case T-D 153 Case std. PRU). As to case T-F. the difference in execution time can reach a proportion of 1:5 (query U-G2b). where second level nodes have two children instead of one. The results obtained for case T-C are very different between the ordered and unordered settings. case T-C would degenerate for the ordered setting too. e. Finally. this is . Simulating case T-E (which is conceptually similar to the P-B one for paths) means to disable the deletions produced by the pointers update. for the most complex settings the difference in execution time can be remarkable. Case T-D Figure 4. as for query UG-3 (collection Gen2. While for the most simple queries we found out that execution time is still not much affected. Figure 4.Evaluating the impact of each condition Case std. since the last domain would not be empty and POT2 could no longer be always applied.

03 8.016 0. 4.28 4.94 40320 4. expressed in seconds.015 0.014 0. 2).154 Query elem.2 2 2.028 6 0.9 259. we measure the time needed to process the different query twigs using the paths decomposition approach. For each query twig. It is evident that the decomposition approach is superior and scores a lower time in every respect. (sec) 0. the total number of elements and predicates.64 2 0.4 Decomposition approach performance evaluation In this section we provide specific evaluation of the performance of our decomposition technique for unordered tree inclusion.015 0.65 0. # 3 4 10 3 4 9 3 4 10 3 4 9 Query processing for XML databases Evaluation Permutation N mean (sec) total (sec) 2 0.2 4.e.096 288 0. For the permutation approach. 3. the number of solutions (inclusions) found in the data set.1 6.56 6 2.031 0. as described in Section 4. # 0 0 0 0 0 0 1 1 2 1 1 1 solutions # 1343 1343 90720 559209 559209 149700 1 1 1 39 36 29 Decomp.832 1.664 6 1. such approach is twice as faster for both selectivity settings.7. the number of needed permutations and the mean per-permutation processing time are also presented.69 0. 8) the speed increment becomes larger and larger – the num- . Table 4. With high branching factors (i.016 0. In particular.6 summarizes the results of the unordered tree inclusion performance tests for both approaches we considered.718 Table 4.016 0.49 14.014 0. and compare the obtained results with the query processing performance of a na¨ permutation apıve proach.3 92736 Twig NH2-2 NH3-2 NH7-3 NL2-2 NL3-2 NL8-2 VH2-2 VH3-2 VH7-3 VL2-2 VL3-2 VL8-2 pred. and the processing time.6 40320 2.1 2. The permutation approach considers all the permutations the query satisfying its ancestor-descendant relationships and computes the answers to the ordered inclusion of the corresponding signatures in the data signature.016 1.105 288 0.5.028 6 0.6: Performance comparison for unordered tree inclusion equivalent to case P-C and the results obtained for ordered matching confirm the one discussed for such case. with low branching factors (i. In particular. The union of the partial answers is the result of the unordered inclusion.7 0.8 193536 2 0.e. are reported.2 7.

2 seconds. Of course. Evaluating query NL2-2 starting with the title path produces a response time of 2. Finally.7 sec (author path). the time spent is nearly proportional to the number of occurrences of such path in the data. This holds for all the query twigs as well – for NL8-2. the processing time remains in milliseconds.9 sec. notice that the permutation approach also requires an initial “startup” phase where all the different permutation twigs are generated.). For the decomposition method. the time ranges from 7.7 sec (crossref path) up to 15. In particular. . while starting with the less selective author path. the time used to generate such permutations is not taken into account. for the predicate queries the best time is obtained by starting the evaluation from the value-enabled paths. we measured the time needed to solve each query for each of the possible order of path evaluation and reported only the lower one.Decomposition approach performance evaluation 155 ber of permutations required in the alternative approach grows factorially: for queries NL8-2 and VL8-2 the decomposition method is more than 25. The decomposition approach is particularly fast with the high selectivity queries. as we do not have statistics on the path selectivity at our disposal. in VH7-3 ). the time would nearly double (3.e. we found that starting with the most highly selective paths always increases the query evaluation efficiency. Even for greater heights (i. As we expected.000 times faster.

.

Such repositories often collect data coming from different sources. In such a sea of electronic information. XML has quickly become the de facto standard for data exchange and for heterogenous data representation over the Internet. are more and more widespread on the web. such as actual text documents or metadata on textual and/or multimedia documents. for instance those providing scientific data and articles. along with the considerable drop in digitizing costs. access and diffuse via the Web a large number of digital documents and multimedia data. This is also due to the recent emergence of wrappers (e. languages for describing the structures and data types and for querying XML documents are becoming more and more popular. In this context.Chapter 5 Approximate query answering in heterogeneous XML collections In recent years. Heterogeneous collections of various types of documents. for the former and the latter purposes respectively. the user can easily get lost in her/his struggle to find the information (s)he requires. the syntax and semantics of XML Schema [132] and of XQuery [16] are W3C recommendations/working drafts.g. Thus. have fostered development of systems which are able to electronically store. Among the several languages proposed in recent years. in a large number of heterogeneous . [12. Think of the several available portals and digital libraries offering search capabilities. Along with XML. The documents are heterogeneous for what concerns the structures adopted for their representations but are related for the contents they deal with. or those assisting the users in finding the best bargains for their shopping needs. 43]) for the translation of web documents into XML format. the constant integration and enhancements in computational resources and telecommunications.

while the problem of making explicit the meanings of words is usually demanded to human intervention. on the other hand. but they are still far from perfect in delivering the information required by the user. while structural queries submitted to XML web search engines are written in XQuery. are indeed necessary and should be exploited in the underlying search engines. one key issue which is still an open problem is the effective and efficient search among large numbers of “related” XML documents. the adoption of XQuery allows users to perform structural inquiries. from one side. such high flexibility could also mean more complexity: Hardly a user knows the exact structure of all the documents contained in the document base. in most application contexts human intervention is not always feasible. Given those premises. even though they could be useful in order to satisfy the user’s information need. In this chapter. they should be able to know the right meaning of the employed terminology. we propose a series of techniques trying to give an answer to the above mentioned needs and providing altogether an effective and efficient approach for approximate query answering in heterogeneous document bases in XML format.158 Approximate query answering in heterogeneous XML collections web collections. which are able . could use largely different structures and element names. In particular: ˆ in Section 5. but they are not sufficient to fully answer the user needs in these scenarios. due to the ambiguity of natural languages. if. Efficient exact structural search techniques. Sites offering access to large document bases are now widely available all over the web. terms describing information usually have several meanings and making explicit their semantics can be a very complex and tedious task. like the ones described in Chapter 4. Further. Indeed. Further. in order to provide a good effectiveness. an entire ensemble of systems and services is needed to help users to easily find and access the information they are looking for. In order to exploit the data available in such document repositories. In particular. it is also evident that such solutions should focus on the structural properties of the accessed information and. a language expressive enough to allow users to perform structural inquiries. but coming from different sources. Indeed. However. data are most likely expressed in XML and are associated to XML Schemas. for instance compact disks in a music store. going beyond the “flat” bag of words approaches of common plain text search engines. also on the ones which do not exactly comply with the structural part of the query itself but which are similar enough. the need for solutions to the problem of performing queries on all the useful documents of the document base.1 we propose XML S3 MART services [94]. becomes apparent. XML documents about the same subject and describing the same reality.

1 Matching and rewriting services The services we propose rely on the information about the structures of the XML documents which we suppose to be described in XML Schema.2 we propose a further service for automatic structural disambiguation [86. In Section 5. 5. A schema matching process extracts the semantic and structural similarities between the schema elements. Indeed.4 we provide extensive experimental evaluation of all the proposed techniques. Instead of working directly on the data. query rewriting and structural disambiguation related work. In fact. which are then exploited in the proper query processing phase where we perform the rewriting of the submitted queries. Finally. Such services have been implemented in our XML S3 MART (XML Semantic Structural Schema Matcher for Automatic query RewriTing) system. From a technological point of view. designing and implementing the matching and rewriting functionalities was to offer a solution allowing easy extensions of the offered features and promoting information exchange between different systems. next generation web .5. the presented approach is completely generic and versatile and can be used to make explicit the meaning of a wide range of structure based information. in Sections 5.1.1 Matching and rewriting services 159 to approximate the user queries with respect to the different documents available in a collection.3 we provide a discussion on schema matching. in particular.5. we breifly describe how we plan to enhance the XML S3 MART system in order to support distributed Peer-to-Peer (P2P) systems and. ˆ in Section 5. we first exploit a reworking of the documents’ schemas (schema matching).2 we specifically analyze the matching and rewriting features. like XML schemas.1. The queries produced by the rewriting phase can thus be issued to a “standard” XML engine and enhance the effectiveness of the searches. In this section we give a brief overview of XML S3 MART motivations and functionalities and introduce how such a module can work in an open-architecture web repository offering advanced XML search functions. in Section 5. Then. the principle we followed in planning.1 and 5. 87] which can prove valuable in enhancing the effectiveness of the matching (and rewriting) techniques. then with the extracted information we interpret and adapt the structural components of the query (query rewriting). In Section 5. Peer Data Management Systems (PDMS) settings. in order to make them compatible with the available documents’ structures. web directories but also ontologies. the structures of XML documents.

e. from a structural point of view but related in their contents. XML S3 MART provides web services making use of SOAP which. Notice that the returned results can be actual XML textual documents but also multimedia data for which XML metadata are available in the document base. to query the available XML corpora by drawing their request on one of the XML Schemas (named source schema). The matching and rewriting services offered by XML S3 MART can be thought as the middleware (Figure 5. Then. give the architecture an high level of inter-operability. For all these reasons. At the interface level users can exploit a graphical user interface. they should access data and be accessed by other modules. one for each of the XML schemas the other useful documents are associated to (target schemas). deliver all the required functionalities to the users.1) of a system offering access to large XML repositories containing documents which are heterogenous. such as [28].160 Interface logic Approximate query answering in heterogeneous XML collections XQuery GUI client Business logic XML S3MART Data logic Data Manager & Search Engine XML Repository Multimedia Repository Figure 5. i. XML S3 MART automatically rewrites the query expressed on the source schema into a set of XML queries. as they are consistent with the structures of the useful documents in the corpus. together with the XML standard. The basic premise is that the struc- . The results are then gathered. incompatible.1: The role of schema matching and query rewriting in an openarchitecture web repository offering advanced XML search functions systems offering access to large XML document repositories should follow an open architecture standard and be partitioned in a series of modules which. together. ranked and sent to the user interface component. Let us now concentrate on the motivation and characteristics of our approximate query answering approach. such as [19]. ultimately tying their functionalities together into services. Such middleware interacts with other services in order to provide advanced search engine functionalities to the user. Such modules should be autonomous but should cooperate in order to make the XML data available and accessible on the web. the resulting XML queries can be submitted to a standard underlying data manager and search engine.

effective and efficient way. audio. described by XML schemas. Consider for instance the query shown in the upper left part of Figure 5. etc. all such documents can be useful to answer a query only if. using all the information extracted by such analysis. thus requiring to go beyond the exact match. which depicts the role of the two operations and the interaction between them. Then. See also Figure 5. which has been considered elsewhere (for instance in [130]). FOR $x IN /musicStore WHERE $x/storage/stock/compactDisk /albumTitle = "Then comes the sun " RETURN $x/signboard/namesign … rewritten query FOR $x IN /cdStore WHERE $x/cd/cdTitle = "Then comes the sun " RETURN $x/name Document in Repository 161 <cdStore> <name>Music World Shop </name> <address> .. are used to search the documents as they are involved in the query formulation. .. though being different.. the approximate query answering process is performed by first applying a query rewriting operation in a completely automatic. Being such similarities independent from the queries which could be issued.. the query needs to be rewritten (lower left part of Figure 5. Due to the intrinsic nature of the semi-structured data. thus fully exploiting the potentialities of the data.5. it would not be returned by a standard XML search engine.2: A given query is rewritten in order to be compliant with useful documents in the repository tural parts of the documents.. some kind of approximation could also be required for the values expressed in the queries as they usually concern the contents of the stored documents (texts. As a final remark.1 Matching and rewriting services Original query.). since its structure and element names are different. images. </address> <cd> <cdTitle>Then comes the sun </cdTitle> <vocalist>Elisa</vocalist> </cd> . however. the target schemas share meaningful similarities with the source one. </cdStore> Figure 5. In order to retrieve all such useful documents available in the document base. both structural (similar structure of the underlying XML tree) and semantical (employed terms have similar meanings) ones.2.3. notice that. video. We concentrate our attention only on the structural parts of the submitted queries and we do not deal with the problem of value approximation. The document shown in the right part of the figure would clearly be useful to answer such need.2). asking for the names of the music stores selling a particular album. they are identified by a schema matching operation which is preliminary to the proper query processing phase..

It is composed by three sub-processes (as shown in Figure 5. there is a basic difference between complex types. for each pair of schemas. which allow elements as their content and may carry attributes.3). Structural Schema Expansion The W3C XML Schema [132] recommendation defines the structure and data types for XML documents. which cannot have element content and attributes.1. are needed to maximize the effectiveness of the third phase. In XML Schema. the first two of which. An XML document referencing an XML schema uses (some of) the elements introduced in the schema and the structural relationships between them to describe the structure of the document itself. 5. In the structural schema expansion phase.e.1 Schema matching The schema matching operation takes as input the set of XML schemas characterizing the structural parts of the documents in the repository and. each XML schema is modified and expanded in order to make the structural relationships between the involved elements more explicit and thus to represent the class of XML documents it defines. identifies the “best” matches between the attributes and the elements of the two schemas. The purpose of a schema is to define a class of XML documents. the structural expansion and the terminology disambiguation. and declarations which enable elements and attributes with specific names and types (both simple and complex) to appear in document instances. and simple types. the structural part of the XML documents referencing . i. There is also a major distinction between definitions which create new types (both simple and complex).3: The XML S3 MART matching and rewriting services.162 Approximate query answering in heterogeneous XML collections Schema matching S Original XML Schemas Structural expansion Semantic annotation “Expanded” XML Schemas “Expanded & Annotated” XML Schemas Matching computation M Q Submitted query (source schema) Matching information Query Rewriting Q Rewritten queries (on target schemas ) Figure 5. the real matching one.

.e. locationType) and regular expression keywords (i. the original XML Schema contains.w3. For instance the path /musicStore/storage/stock selects all the stock elements that have a storage parent and a musicStore grandparent which is the root element. and ultimately better captures the tree structure underlying the concepts expressed in the schema. Figure 5.4. global definitions. for instance. every complex type . town country colorsign namesign stock .e. for instance. such as complex type definitions.org/2001/XMLSchema"> <xsd:element name="musicStore"> <xsd:element name="location"> <xsd:element name="town" type="xsd:string"/> <xsd:element name="country" type="xsd:string"/> </xsd:element> . </xsd:all> </xsd:complexType> <xsd:complexType name="locationType"> <xsd:all> <xsd:element name="town" type="xsd:string"/> <xsd:element name="country" type="xsd:string"/> 163 Underlying Tree structure (fragment) musicStore location signboard storage . which may interfere or even distort the discovery of the real underlying tree structure.. Going back to the example of Figure 5. while in the original XML Schema location was the child of a all node. elements musicStore.. element references..org/2001/XMLSchema"> <xsd:element name="musicStore" type="musicStoreType"/> <xsd:complexType name="musicStoreType"> <xsd:all> <xsd:element name="location" type="locationType"/> .Schema matching Original XML Schema (fragment) <xsd:schema xmlns:xsd="http://www.. In general. which was child of a complex type. Consider. is conceptually a child of musicStore: This relation is made explicit only in the expanded version of the schema. As a matter of fact. which is essential for an effective schema matching. location). and so on. Expanded XML Schema (fragment) <xsd:schema xmlns:xsd="http://www. XML Schema constructions need to be resolved and rewritten in a more explicit way in order for the structure of the schema to be the most possibly similar to its underlying conceptual tree structure involving elements and attributes. all). along with a fragment of the corresponding expanded schema file and a representation of the underlying tree structure expressing the structural relationship between the elements which can appear in the XML documents complying with the schema.w3.. searches are performed on the XML documents stored in the repository and a query in XQuery usually contains paths expressing structural relationships between elements and attributes. The resulting expanded schema file abstracts from several complexities of the XML schema syntax. complex types musicStoreType..e. Further. Figure 5. also type definitions (i. along with the element definitions whose importance is definitely central (i.. As can be seen from the figure. the element location.4 showing a fragment of an XML Schema describing the structural part of documents about music stores and their merchandiser.4: Example of structural schema expansion (Schema A) it.

. i. Terminology disambiguation As discussed in the introduction. by using an ad-hoc GUI. in Section 5. In this step. every element or attribute node has a name. leaf elements (or attributes) can hold a textual value. representing different meanings or senses. providing completely automatic terminology disambiguation.164 Approximate query answering in heterogeneous XML collections and keyword is discarded. which actually rely on the distance between meanings. To this end. The WordNet (WN) lexical database is conceptually organized in synonym sets or synsets. For each pair of schemas. Each term in WN is usually associated to more than one synset.e. whose primitive type is maintained and specified in their “type=. then. we exploit one of the most known lexical resources for the English language: WordNet [100]. Middle elements have children and these can be deduced immediately from the new explicit structure. This time the focus is on the semantics of the terms used in the element and attribute definitions. therefore we will consider term disambiguation as a semi-automatic operation where the operator. In this section the focus is on the rewriting and matching computation techniques. signifying it is polysemic.” parameter of the output schema. Indeed..2. and will be specifically discussed later in this chapter. Matching Computation The matching computation phase performs the actual matching operation between the expanded annotated schemas made available by the previous steps. a further step is required in order to refine and complete even more the information delivered by each schema. to select one of its synsets. is required to “annotate” each term used in each XML schema with the best candidate among the WN terms and. On the other hand.1). Our new structural disambiguation technique greatly enhances this approach. it has more than one meaning (some preliminary WordNet concepts are also available in Appendix A. thus maximizing the successive matching computation’s effectiveness. that is its meaning is made explicit as it will be used for the identification of the semantical similarities between the elements and attributes of the schemas. In the resulting expanded schema. we identify the “best” matchings between the attributes and the elements of the two schemas by considering both the structure of the corresponding trees and the semantics of the involved terms. each term is disambiguated. after having made explicit the structural relationships of a schema with the expansion process. in our opinion the meanings of the terms used in the XML schemas .

i. On the other hand.e. it is only the position of the corresponding nodes which can help us to better contextualize the selected meaning.5. In particular. For instance.Tree structure musicStore location C B A 165 Schema B .Tree structure cdStore storage E A B F signboard D name E address C cd D J K G town country colorsign namesign stock F city K street state vocalist cdTitle trackList passage H compactDisk songList track songTitle I G albumTitle title I H singer J Figure 5. For instance. where each entity represents an . it does not provide meanings for specific context. the terms used in the two schemas have already been disambiguated by choosing the best WN synset. which are quite common. At this step. it should be clear that the node albumTitle matches with the node cdTitle.5. as both refer to song title. title. The steps we devised for the matching computation are partially derived from the ones proposed in [98] and are the following: 1. Thus. they both describe the contents of the albums sold by music stores. the best choice is to associate terms as albumTitle and songTitle for Schema A and cdTitle and title for Schema B with the same WN term.Schema matching Schema A . for which the information about their location is also represented. Thus. and that songTitle matches with the node title. among the results of a query expressed by using Schema A we would also expect documents consistent with Schema B. As WordNet is a general purpose lexical ontology. for which the best synset can thus be chosen. a careful reader would probably identify the matches which are represented by the same letter. Though being different in the structure and in the adopted terminology.5: Example of two related schemas and of the expected matches cannot be ignored as they represent the semantics of the actual content of the XML documents. by looking at the two schemas of Figure 5. In these cases. let us consider the two expanded schemas represented by the trees shown in Figure 5. the involved schemas are first converted into directed labelled graphs following the RDF specifications [79]. as both refer to album title. the structural part of XML documents cannot be considered as a plain set of terms as the position of each node in the tree provides the context of the corresponding term.

Exploiting the semantics of the terms in the XML schemas provided in the disambiguation phase. In this case. the scores for each pair of synsets (s1 .6 shows a portion of the RDF graph and of the pairwise connectivity graph for Schema A and Schema B. Figure 5. We recall that hypernym relations are also known as IS-A relations (for instance “feline” is a hypernym for “cat”. since you can say “a cat is a feline”). /musicStore/location) and each literal represents a particular name (e. one for each RDF graph. In [98] the scores are obtained using a simple string matcher that compares common prefixes and suffixes of literals. which captures the involved schema structure. /cdStore name “musicStore” . 2. in order to maximize the matching effectiveness.g. Such label set can be optionally extended for further flexibility in the matching process. As to the labels on the arcs. “cdStore” child child child /musicStore/location name “location” /cdStore/address name “address” /musicStore/location . which quantifies the distance between the involved meanings by comparing the WN hypernyms hierarchies of the involved synsets. Then. location) or a primitive type (e. and name. we chose to adopt an in-depth semantic approach. an initial similarity score is computed for each node pair contained in the PCG. is constructed [98] in which a labelled edge connects two pairs of nodes. we mainly employ two kinds of them: child.g. /cdStore/address name “location” . if such labelled edge connects the involved nodes in the RDF graphs. we follow a linguistic approach in the computation of the similarities between pairs of literals (names). This is one of the most important steps in the matching process. xsd:string) which more than one element or attribute of the schema can share. “cdStore” Figure 5.166 RDF Model for Schema A (portion) Approximate query answering in heterogeneous XML collections RDF Model for Schema B (portion) Pairwise connectivity graph (portion) /musicStore name “musicStore” /cdStore name “cdStore” /musicStore . s2 ) are obtained by computing the depths of the synset in the WN hypernyms hierarchy and the length of the path connecting . From the RDF graphs of each pair of schemas a pairwise connectivity graph (PCG). Instead.g. involving node pairs.6: RDF and corresponding PCG for portions of Schemas A and B element or attribute of the schema identified by the full path (e.

At present. Finally. all the full paths in the query are rewritten by using the best matches between the nodes in the given source schema and target schema (e. depth of s1 + depth of s2 167 3. on the target schemas. we support conjunctive queries with standard variable use. The initial similarities.g. the query given in Figure 5.1. In fact. 4. In other words. Each rewrite is assigned a score.2 Automatic query rewriting By exploiting the best matches provided by the matching computation. written w. we straightforwardly rewrite a given query. y ) exists such that x is more similar to y than to y and y is more similar to x than to x .Automatic query rewriting them as follows: 2 ∗ depth of the least common ancestor . are refined by an iterative fixpoint calculation as in the similarity flooding algorithm [98]. predicates and wildcards (e.2). a source schema. The fixpoint computation is iterated until the similarities converge or a maximum number of iterations is reached. . reflecting the semantics of the single node pairs. no other pair (x . we rewrite the query for each of the target schemas in the following way: 1. this method is one of the most versatile and also provides realistic metrics for match accuracy [48]. in order to allow the ranking of the results retrieved by the query. After having substituted each path in the WHERE and RETURN clauses with the corresponding full paths and then discarded the variable introduced in the FOR clause. y). which brings the structural information of the schemas in the computation.t. The intuition behind this computation is that two nodes belonging to two distinct schemes are the more similar the more their adjacent nodes are similar. We will now briefly explain the approach and show some meaningful examples. for each pair of nodes (x. 5.r. the similarity of two elements propagates to their respective adjacent nodes. The stable marriage filter guarantees that. Query rewriting is simplified by the fact that the previous phases were devised for this purpose: The expanded structure of the schemas summarizes the actual structure of the XML data where elements and attributes are identified by their full paths and have a key role in an XQuery FLWOR expression paths.g. we apply a stable marriage filter which produces the “best” matching between the elements and attributes of the two schemas.

The value of the $x variable in the submitted query is the path of the element track in Schema A and the corresponding element in Schema B is passage (label H in Figure 5. For instance. a variable is reconstructed and inserted in the FOR clause in order to link all the rewritten paths (its value will be the longest common prefix of the involved paths).5). It is the average of the scores assigned to each path rewriting which is based on the similarity between the involved nodes.7: Examples of query rewriting between Schema A and Schema B the path /musicStore/storage/stock/compactDisk of Schema A is automatically rewritten in the corresponding best match. /cdStore/cd of Schema B). 3. all the corresponding paths are rewritten and put in an OR clause. The submitted queries are written by using Schema A of Figure 5. Figure 5. Query 2 demonstrates the rewriting behavior in the variable management.7 shows some examples of query rewriting.5) will be the one used in the rewrite.5 and the resulting rewriting on Schema B is shown on the right of the figure. as specified in the match. However . the best matches are accessed not exactly but by means of regular expressions string matching. Query 1 involves the rewriting of a query containing paths with wildcards: In order to successfully elaborate them. 2. When more than one path of the source schema satisfies a wildcard path.168 Original Query on Source Schema A Approximate query answering in heterogeneous XML collections Automatically Rewritten Query on Target Schema B Query 1 FOR $x IN /musicStore WHERE $x/storage/*/compactDisk//singer = "Elisa" AND $x//track/songTitle = "Gift" RETURN $x/signboard/namesign Query 2 FOR $x IN /musicStore/storage/stock /compactDisk/songlist/track WHERE $x/singer = "Elisa" AND $x/songtitle = "Gift" RETURN $x Query 3 FOR $x IN /musicStore WHERE $x/storage/stock/compactDisk = "Gift" AND $x/location = "Modena" RETURN $x FOR $x IN /cdStore WHERE $x/cd/vocalist = "Elisa" AND $x/cd/trackList/passage/title = "Gift" RETURN $x/name FOR $x IN /cdStore/cd WHERE $x/vocalist = "Elisa" AND $x/trackList/passage/title = "Gift" RETURN $x/trackList/passage FOR $x IN /cdStore WHERE ( $x/cd/vocalist = "Gift" OR $x/cd/cdTitle = "Gift" OR $x/cd/trackList/passage/title = "Gift" ) AND ( $x/address/city = "Modena" OR $x/address/street = "Modena" OR $x/address/state = "Modena" ) RETURN $x Figure 5. the only path of the tree structure of Schema A satisfying the path /musicStore/storage/*/compact Disk//singer is /musicStore/storage/stock/compactDisk/songList/ track/singer and the corresponding match in Schema B (label J in Figure 5. a score is assigned to the rewritten query.

We refer to “hot” research topics. Indeed. which are represented adopting XML or ontology based data models. textual values.2 Structural disambiguation service In this section we propose a service for automatic structural disambiguation which can prove valuable in enhancing the effectiveness of the matching (and rewriting) techniques described in the previous section and. in these cases. Notice that.. the predicates expressed on such element are rewritten as OR clauses on the elements which are descendants of the matching target element and which contain a compatible value. the corresponding best matches in Schema B. It contains many polysemous words. the query is correctly rewritten as we first substitute each path with the corresponding full path and then we reconstruct the variable. better enabling computers and people to work in cooperation” [14].8 shows the hierarchical representation of a portion of the categories offered by eBay . of the majority of the available knowledge-based applications. In these contexts. but also XML data clustering and classification [128. At present.. For example. one of the most famous world’s online marketplaces (nodes are univocally identified by their pre-order values). as considered in the previous section for XML S3 MART and also in peer data management systems (PDMS) [84]. cdTitle and title. approaches which exploit the semantics of the information they access. are not both descendants of passage. which in this case holds the value of path cdStore/cd. an extension of the current web in which information is given well-defined meaning. not only schema matching and query rewriting.2 Structural disambiguation service 169 directly translating the variable value in the rewritten query would lead to a wrong rewrite: While the elements singer and songTitle referenced in the query are descendants of track. In example 3 an additional rewrite feature is highlighted concerning the management of predicates involving values and. therefore the condition is rewritten on the descendant leaves vocalist. most of the proposed approaches share a common basis: They focus on the structural properties of the accessed information. in particular. the element compactDisk and its match cd on Schema B are not leaf elements. 53]. Figure 5. For instance. 5. whenever the best match for an element containing a value is a middle element. are rapidly acquiring more and more importance in a wide range of application contexts. that is songTitle and vocalist. from string to batteries and memory. all going in the direction of the Semantic Web “. in general. to which com- . i.5. knowledge-based approaches.e. 131] and ontology-based annotation of web pages and query expansion [39. and their effectiveness is heavily dependent on knowing the right meaning of the employed terminology.

can be compared against the graph context to refine the results. that string are “stringed instruments that are played with a bow” and batteries are electronic devices and not a group of guns or whatever else. and the sense contexts.170 buy computers desktop PC components Approximate query answering in heterogeneous XML collections 1 2 8 9 cameras 11 antiques musical instruments 15 string 14 accessories 12 forniture batteries 13 3 10 chair 4 memory 5 speaker 6 fan 7 mouse Figure 5. peers not necessarily store actual data. as suggested by most of the classic wsd studies. In this way. web directories.2. disambiguation is founded on the hypernymy/hyponymy hierarchy. We follow instead a different approach: The exploitation of the information provided by commonly available thesauri such as WordNet [100]. starting from a given node. In particular. monly available vocabularies associates several meanings.8: A portion of the eBay categories. while the proper disambiguation algorithm is presented in . extracted from the thesaurus. including XML schemas as employed for instance in XML S3 MART. It can be used to make explicit the meaning of a wide range of structure based information. For instance. Such service has been implented in our STRIDER (STRucture-based Information Disambiguation ExpeRt) system. Subsection 5. where several solutions have been proposed for free text. for instance. More precisely. Our generic disambiguation service works on graph-like structured information. Moreover. and ontologies. which are not always available. Starting from the lesson learnt in the word sense disambiguation (wsd) field [68].1 presents an overview of our disambiguation approach. in a PDMS. the disambiguation method does not depend on training data or extensions. The outcome of the overall process is a ranking of the plausible senses for each term. we have conceived a versatile approach which tries to disambiguate the terms occurring in the nodes’ labels by analysing their context and by using an external knowledge source. we are able to support both the assisted annotation and the completely automatic one whenever the top sense is selected. The information given by the surrounding nodes allows us to state. but also the structures of XML documents. we support several ways of navigating the graph in order to extract the context which can thus be tailored on the specific application needs. mainly focusing on trees.

2. The service is able to disambiguate XML schemas. Terms / Senses Suggestions Polysemous Terms Context extraction Terms / Senses Selection Graph Context Extraction Context Expansion Contextualized Polysemous Terms Terms Disambiguation {0. web directories. the structures of XML documents. The only external source is a thesaurus associating each word with the concepts or senses it is able to express.8. without loss of soundness. . we will show that the service can be straightforwardly extended to graphs.g. property). such information descriptions which can be represented as trees.. …} Arc weights External knowledge sources Terms / senses feedback Figure 5. The tree contains a set of nodes whose labels must be disambiguated and a set of arcs which connect pairs of nodes and which may as well be labelled (e.9: The STRIDER graph disambiguation service. in general. The individuation of the correct sense for each label can be possible by analysing the context of the involved terms and by using an external knowledge source.Overview of the approach Graph Terms’ Sense Rankings 171 .1 Overview of the approach In this section we present the functional architecture of the generic structural disambiguation service (see Figure 5. 5. Indeed.9) and introduce relevant terminology..2. and abstract from the complexity of the language syntax. thus capturing the element context. Each arc label is associated with two weights between 0 and 1 (default value 1). We emphasize that no extension or training data is required for our disambiguation purposes as they are not always available. As a particular case. Section 5. type. XML schemas are represented as trees which make explicit the structural relationships between the involved elements. at the end of the present section. Weights will be used to compute the distance between two nodes in the graph and the lower the weight of an arc is the closer two nodes connected by such arc are. and. Being trees particular kinds of graphs.2. Arcs are particularly important as they connect each label with its context. one for each crossing direction (direct and inverse). in the following we will use indifferently the terms tree and graph.

we also provide the possibility of including the siblings of the term’s node in the context. In principle. . It actually represents the conceptual structure of the most common application contexts such as web directories. For instance. The above options can be freely combined. sk ]. N ) is then associated with its context. the more one node is close to the term’s node and is connected by arcs with low weights the more it influences the term disambiguation.172 Approximate query answering in heterogeneous XML collections The “terms/senses selection” component in Figure 5.9 takes the label of each node N of the tree. we count the number of instances corre1 Notice that the same term could be included more than once and that the disambiguation is strictly dependent on the node each instance belongs to. When the only crossing direction is the direct one. N )1 with a list of senses Senses(t. the “graph context extraction” component in Figure 5. we associate each reachable node Nc in the context with a weight weight(Nc ) computed as follows. N ) by extracting its graph context Gcontext(t. Given the path from the node Nc to the term’s node N . it is represented by the ancestors. . . the context is defined by the descendants or subtree of the term’s node. Each polysemous term (t. for the eBay example. Conversely.9 contextualizes each polysemous term (t. XML documents. and siblings whereas the whole structure would be useful for structures dealing with more “contextualized” topics such as book descriptions. different applications require different contexts. one of the best crossing settings is to include ancestors. Indeed. let us consider trees having no label on the arcs. N ) from the set of terms belonging to the reachable nodes. For this reason. descendants. As a special case. that is it is possible to specify which kinds of arcs are crossable. . in which direction and the maximum number of crossings (distance from the term’s node). The set of crossable arc labels and the corresponding crossing directions is shrinkable. The context is first extracted from the tree but it does not necessarily coincide with the entire tree. and XML schemas. Moreover. By default. using categories such as women’s clothing would be quite misleading. . In principle.8) and associates each of these terms (t. while disambiguating the term string in the musical instruments category of eBay . Not all nodes contribute with the same weight to the disambiguation of a term. s2 . For instance. the nodes reachable by the term’s node N through any arc belong to the term’s context. N ) = [s1 . Thus we support different contexts by means of different crossing settings. extracts the contained terms (which can also be more than one as for instance desktop PC components in Figure 5. as we deal with trees. such list is the complete list of senses provided by the thesaurus but it can also be a shrunk version suggested either by human or machine experts or as feedback of a previous disambiguation process. Given a crossing setting.

N ). The context of each term (t. It supports several disambiguation needs by means of parameters which can be freely combined.5).5). It is particularly useful when the graph context provides too little information. 3. 3).8.e. N ) is disambiguated by using the previously extracted context. The proper disambiguation process is the subject of the following section.95.e. The result is a ranked version of Senses(t.1 Assume that in the eBay tree (Figure 5. respectively. weight(2) = 0. The overall approach is quite versatile. The graph context of the term (mouse. (PC. As most of the semantics is carried by noun words [68]. 4). the examples and any other explanation of the sense provided by the thesaurus. 2). Then. the ranking approach has been conceived in order to support two types of graph disambiguation services: The assisted and the completely automatic one. N ) where each sense s ∈ Senses(t. 1 arc crossed in the opposite direction with weight 0. Finally. 4 (5 and 6) are 1 (i.Overview of the approach 173 sponding to each pattern specified in the crossing setting (i. 6). In particular for each sense we consider the definitions. (memory. Then. (components. Senses(tc . and (fan. and that the maximum number of crossings is 2. that the weight of the parent/child arcs is 1 in the direct direction and 0. weight(3) = 0.8) the context is made up of the siblings and ancestors. the “context expansion” module in Figure 5. 0. In the former case. 1 arc crossed in the opposite direction with weight 0.5 and 1 arc crossed in the direct direction with weight 1). the disambiguation d2 . Nc ). weight(Nc )) defined from each term tc belonging to each reachable node Nc . weight(Nc ) is computed by applying a gaussian distance decay function defined on d: e− 8 2 weight(Nc ) = 2 · √ + 1 − √ 2π 2π Thus each element of the graph context is a triple ((tc . N ) can be expanded by the contexts Scontext (s) of each sense s in Senses(t.9 defines Scontext(s) as the set of nouns contained in the sense explanation.5 (i. and weight(4) = weight(5) = weight(6) = 0. The distance between node 7 and nodes 2.91. arc label and arc crossing direction) and we define the distance d between N and Nc as the sum of the product of the weights associated to each pattern and the corresponding number of instances. N ) is associated with a confidence φ(s) in choosing s as a sense of (t.5 in the opposite one. N ). 2 arcs crossed in the opposite direction with weight 0. from the weights to the graph context. each term (t. 5). (speaker. Nc ).5 (i. and 1. 3).e. Moreover. N ) with its senses Senses(t. 7) is made up of the terms (computers.e. Example 5. (desktop. 3).

2. Moreover. in particular. s2 . It takes in input a term (t. N ). trees are our main focus of interest and. Note that some of the techniques presented in this section are borrowed and adapted from the ones of the free-text word sense disambiguator we devised for our EXTRA system (see Chapter 2 and Appendix A. The only problem is in the weight computation where more than one path can connect a pair of nodes. The obtained confidence vector tunes two contributions (line 11): That of the context. . N ) to be disambiguated and produces a vector φ of confidences in choosing each of the senses in Senses(t. there is no human intervention and the selected sense can be the top one. at present.2 The disambiguation algorithm The algorithm for disambiguation we devised follows a relational information and knowledge-driven approach. . However. Indeed. γ + = 1. we would be able to disambiguate ontologies written in different languages such as OWL and RDF where arc labels are quite frequent (e. given Senses(t. In this way. . whose weight is expressed by the constant α and which is subdivided in the graph context (confidence vector φG . in RDF arcs can be of subClassOf type or range type or many other types). In the latter case. we use additional information provided by thesauri: The hypernymy/hyponymy relationships among senses and the sense explanations and frequencies of use. Indeed.1 for a detailed description). only the context extraction phase accesses the submitted structure whereas the actual disambiguation algorithm is completely independent from it. In particular. . trees having no label on the arcs which have been subject of our tests. N ). Moreover. and that of the frequency of sense use in English language (confidence vector φU ) with weight β. .2 The terms surrounding a given one provide a good informational context and good hints about what sense to choose for it. the context is not merely considered as a bag of words but other information such as their distance from the polysemous term to be disambiguated and semantic relations are also extracted. while we plan to deal with general graphs in the future.10. the one with the lower distance could be selected. the above approach can be straightforwardly applied also to graphs. sk ].g. The algorithm is shown in Figure 5. weight γ) and the expanded context (confidence vector φE . α + β = 1. In this case. N ) = [s1 .174 Approximate query answering in heterogeneous XML collections task is committed to a human expert and the disambiguation service assists him/her by providing useful suggestions. weight ). 5. The contribution of the 2 All operations on the vectors are the usually defined ones. φ is a vector of k values and φ[i] is the confidence in choosing si as sense of (t.

The disambiguation algorithm algorithm Disambiguate(t. . Nc ) is weighted by the relative position in the graph of the tc ’s node. tc ) is ancestor of si (5) φC [i]=sim(t.10: The disambiguation algorithm function T ermCorr(t.e. . . Nc ) in Gcontext(t.t. Scontext(si )) (10) φU [i]=decay(si ) (11) φ=α(γ ∗ φG + ∗ φE ) + β ∗ φU 175 Figure 5. . φG is the sum of the values measuring the level of semantic correlation between the polysemous term t and the ones in the graph context Gcontext(t. N ). norm) (1) c(t. Finally in step 6 the whole vector φG is divided by the norm value in order to obtain normalized confidences. . w. N ) (step 4). . In particular. tc . N ) (4) if c(t. tc .N ) (02) norm = 0 (03) for each (tc . The contribution of each context term (tc . The basis of function T ermCorr() (see Figure 5. N ) (08) if expanded context (09) φE [i]= ContextCorr(Gcontext(t. N ) (04) φG = φG + weight(Nc ) ∗ T ermCorr(t. the intuition behind the similarity is . N (i. N ) //graph context contribution (01) φG = [0.11) derives from the one in [113].11: The T ermCorr() function graph context is computed from step 1 to step 6. the confidence in choosing one of the senses associated with each term is directly proportional to the semantic similarities between that term and each term in the context. 0] # senses in Senses(tc . tc ) is the minimum common hypernymy of t and tc (2) φC = [0. tc ) (7) return φC Figure 5. Nc . tc ) (6) norm=norm + sim(t. norm) (05) norm = norm ∗ weight(Nc ) (06) φG =φG /norm //expanded context contribution (07) for i from 1 to the number of senses in Senses(t. 0] (3) for i from 1 to the number of senses in Senses(t. weight(Nc )). As in [113]. .r. .

N ) and each sense in Senses(tc . tc ) is the minimum among the number of links connecting each sense in Senses(t.10.e. line 9). in WordNet the minimum path length between the terms “cat” and “mouse” is 5. one of the most promising measures is the Leacock-Chodorow [82]. the value assigned in φG to each sense is then the proportion of support it receives. line 6). tc ) as it does not rely on a training phase on large pre-classified corpora but exploits the hypernymy hierarchy of the thesaurus.176 Approximate query answering in heterogeneous XML collections that the more similar two terms are. At the end of the process (Figure 5.10. also the expanded context can be exploited in the disambiguation process (Figure 5. Nc ) and H is the height of the hypernymy hierarchy (in WordNet it is 16).1 is decreasing as one moves higher in the taxonomy thus guaranteeing that “more abstract” is synonymous of “less informative”. Therefore. N ) which are descendants of the minimum common hypernym (lines 3-4) and the increment is proportional to how informative the minimum common hypernym is (line 5). which has been reviewed in the following way: sim(t. lines 7-9). that crossed in the computation of len(t. function T ermCorr() increases the confidence of such senses in Senses(t. For instance. we define the minimum common hypernym c(t.tc ) 2·H 0 if ∃ a common hypernymy otherwise (5.1) where len(t. our approach differs in the semantic similarity measure sim(t. It essentially computes the semantic similarity between each term ti in the graph context and the terms in the sense context Scontext(s) (lines 3-7) by calling the T ermCorr() function for each term ts in Scontext(s) j (line 6) and then by computing the maximum of the obtained confidence . out of the support possible which is kept updated by function T ermCorr() (line 6) and in the main algorithm (Figure 5. Eq. tc )). N ) of the polysemous term (t.12. tc ) = −ln len(t. line 5). Obviously these two values are not computed within the function but once for each pair of the involved terms.10. N ) represented by Scontext(s). the main objective is to quantify the semantic correlation between the context Gcontext(t. tc ) of t and tc as the sense which is the most specific (lowest in the hierarchy) of the hypernyms common to the two senses (i. the more informative will be the most specific concept that subsumes them both. The pseudocode of function ContextCorr() is shown in Figure 5. N ) and the explanation of each sense s in Senses(t. 5. However. In this case. In particular.10. the confidence in choosing s is proportional to the computed similarity value (Figure 5. Beside the contribution of the graph context. since the senses of such nouns that join most rapidly are “cat (animal)” and “mouse (animal)” and the minimum common hypernym is “placental mammal”. Moreover. In this context.

3 Related work function ContextCorr([t1 . we quantify the frequency of the senses where the first sense has no decay and the last sense has a decay of 1:5.10. Much research has been .12: The ContextCorr() function vector φT . the first is the most common sense. [ts . The last contribution is that of function decay(). .5. In particular. . etc. ts .3 5. line 9). . . The returned value (line 8) is the mean of the similarity values computed for the terms in Gcontext(t. exploiting the frequency of use of the senses in English language (Figure 5. . . N ) in a way which is inversely proportional to its position. several works took into account the problem of answering approximate structural queries against XML documents. We increment the confidence in choosing each sense s in Senses(t. norm) j (7) φC [i]=max(φT /norm) (8) return mean(φC ) 177 Figure 5. notice that for the sake of simplicity of presentation. for efficiency reasons.3. . . pos(s). 0] (2) for i from 1 to n (3) φT = [0. 5. .8 and |W N Senses(t)| is the cardinality of W N Senses(t). In this way. algorithm Disambiguate() takes one term at a time.). N ). . Such an adjustment attempts to emulate the common sense of a human in choosing the right meaning of a noun when the context gives little help. . . in such ordered list: decay(si ) = 1 − ρ pos(si ) − 1 |W N Senses(t)| where 0 < ρ < 1 is a parameter we usually set at 0. tn ]. 0] (4) norm = 0 (5) for j from 1 to m (6) φT = φT + T ermCorr(ti . . . . However. in the actual implementation the sim() computation is performed only once for a given pair of terms (also swapped as sim() is a symmetric measure).e. . ts ]) m 1 (1) φC = [0. As a final remark. WordNet orders its list of senses W N Senses(t) of each term t on the basis of the frequency of use (i.1 Related work Approximate query answering Recently.

in particular for regular path queries. [111] presents an approach based on the exploitation of a description logic. DB and knowledge representation communities [15. in general. The necessity of looking at the context of a word in order to correctly disambiguate it is universally accepted.3. in order to rewrite the submitted queries on the involved sources: For instance. and Similarity Flooding (SF) [98]. most of the works present interesting and complex theoretical studies [107]. are equally very expensive and generally deliver inadequate performance due to the very large size of most of the available XML data trees. the proposed rewriting approaches rarely actually benefit from the great promises of the schema matching methods. such as [37].178 Approximate query answering in heterogeneous XML collections done on the instance level. the process of unordered tree matching is difficult and extremely time consuming. 48]. providing a particularly versatile graph matching algorithm. As to rewriting. In [27] the theoretical foundations for query rewriting techniques based on views of semi-structured data. combining a name and a structural matching algorithm. we will first briefly review our disambiguation approach in the more “classic” and well studied field of wsd for free text. CUPID [85]. for instance. which supports the combination of different schema matching techniques. However. the edit distance on unordered trees was found in [141] N P hard. such as LSD and GLUE [50]. Schema matching is a problem which has been the focus of work since the 1970s in the AI. while [106] deals with the problem of the informative capability of each of the sources. most of the work on XML schema matching has been motivated by the problem of schema integration: A global view of the schemas is constructed and from this point the fundamental aspect of query rewriting remains a particularly problematic and difficult to be solved aspect [115].2 Free-text disambiguation Before discussing the few approaches proposed for the “structural” disambiguation problem. Some rewriting methods have also been studied in the context of mediator systems. However. which are based on machine learning approaches needing a preliminary training phase. 5. Ad-hoc approaches based on explicit navigation of the nodes’ instances. Many systems have been developed: The most interesting ones working on XML data are COMA [49]. nonetheless two different approaches exist: The . On the other hand. Many approaches also combine schema level with instance level analysis. trying to reduce the approximate structural query evaluation problem to well-known unordered tree inclusion [119] or tree edit distance [65] problems directly on the data trees. a large number of approaches prefer to address the problem of structural heterogeneity by first trying to solve the differences between the schemas on which data are based. However.

WordNet is. The disambiguation algorithm developed by us adopts the relational information approach which is more complex but generally performs much better. [98. the most common one is the corpus-based or statistic approach where the context of a word is combined with previously disambiguated instances of such word. Recently. Among the alternative approaches. such as simple string matching possibly considering its synonyms (e. Also. This problem prevents their use in the application contexts we refer to. the semantic closeness between nodes relies on syntactic approaches. Such approach often benefits from a general applicability and is able to achieve good effectiveness even when it is not restricted to specific domains. new methods relying on the entire web textual data. the most common of which are the path based ones [82]. Further. and in particular on the page count statistics gathered by search engines like Google [38. Our disambiguation method is a knowledge-driven method as it combines the context of the word to be disambiguated with additional information extracted from an external knowledge source.Structural disambiguation 179 bag of words approach. schema matching and the XML data clustering.3. However. huge textual corpora. In the literature. 39] have also been proposed. 5.g. the descriptions and glosses provided for each term can deliver additional ways to perform or refine the disambiguation: The gloss overlap approach [11] is one of them. generally speaking. In many schema matching approaches. peers not necessarily store actual data).g. which extends the former with other information such as their distance or relation with the involved word. which are not always available. and few actual structural disambiguation approaches have recently been presented. such as electronically oriented dictionaries and thesauri. the problem of such approaches is that they are extremely data hungry and require extensive training. the most used external knowledge source [26] and its hypernym hierarchies constitute a very solid foundation on which to build effective relatedness and similarity measures. a further distinction is based on the kind of information source used to assign a sense to each word occurrence [68]. 85]). in a PDMS. However. a good number of statistical wsd approaches have . to our best knowledge. extracted from annotated corpora [4. and the relational information approach. with no doubt. 5].3 Structural disambiguation Structural disambiguation is acknowledged as a very real and frequent problem for many semantic-aware applications. and/or a very large quantity of manual work to produce the sense-annotated corpora they rely on. up to now it has only been partially considered in two contexts. as even “raw” data are not always available (e. where the context is merely a set of words next to the term to disambiguate.

they rely on additional data which may not always be available. where disambiguation is performed on the documents’ tag names. . with the WordNet ones.4. Further.180 Approximate query answering in heterogeneous XML collections been proposed in the matching context (e. It fully exploits the potentialities of the context of a node in a graph structure and its extraction is flexible enough to include relational information between the nodes and different kinds of relationships. the approach we presented in Section 5. However. and this appears as a quite strong requirement. For the graph construction. the schema relations have to coincide. the textual content of the element and the text of the subordinate elements and then it is enlarged by including related words retrieved with WordNet.1) for the matching and rewriting services. and STRIDER for the structural disambiguation one (Section 5. in [131] the authors propose a technique for XML data clustering. Generalizing. as we already outlined. In a similar scenario.g. 5. In order for this approach to be fully effective. we present a selection of the most interesting results we obtained through the experimental evaluation performed on the prototypes of XML S3 MART (Section 5. and in particular the hypernym ones which are the most used for building effective relatedness measures between terms in free text wsd.2 differs from the existing structural disambiguation approaches as it has not been conceived in a particular scenario but it is versatile enough to be applicable to different semantic-aware application contexts. This context is then compared to the ones associated to the different WordNet senses of the term to be disambiguated by means of standard vector model techniques. In a schema matching application. the method proposed in [128] performs disambiguation by applying a shortest path algorithm on a weighted graph constructed on the terms in the path from each node to the root and on their related WordNet terms. In particular. we fully exploit WordNet hierarchies. [84]).4 Experimental evaluation In this section we provide an extensive experimental evaluation of the techniques we proposed in this chapter. The local context of a tag is captured as a bag of words containing the tag name itself.4. WordNet relations are navigated just one level. As to the proper structural disambiguation approaches. descendants or siblings. [17] presents a node disambiguation technique exploiting the hierarchical structure of a schema tree together with WordNet hierarchies. at least partially.2). such as ancestors.

and schemas officially adopted in worldwide DLs in order to describe bibliography metadata or audio-visual content.4. IFLA-FRBR (International Federation of Library Associations and Institutions Functional Requirements for Bibliographic Records) and RSLP (Research Support Libraries Programme) Collection Description proposal. Such schemas include the ones we devised ad hoc in order to precisely evaluate the behavior of the different features of our approach. such as e-books. notably MPEG-7. Experimental setting We evaluated the effectiveness of our techniques in a wide range of contexts by performing tests on a large number of different XML schemas.1 Matching and rewriting Since in our method the proper rewriting phase and its effectiveness is completely dependent on the schema matching phase and becomes quite straightforward. In particular. ˆ XML multimedia description standards. proposed by the Digital Rights Management community for declaring the rights connected with digital distributions of works. ˆ specific IFLA-FRBR extensions for the description of audio-visual content. such as DCMI (Dublin Core Metadata Initiative).5. In this section we will discuss the results obtained for the music store example and for a real case concerning the schemas employed for storing the two most popular digital libraries of scientific references in XML format: . in particular DBLP Computer Science Bibliography archive and ACM SIGMOD Record XML Collection.Matching and rewriting 181 5. we further tested XML S3 MART on carefully selected pairs of similar “official” schemas derived from: ˆ generic digital libraries metadata description standards. ˆ schemas employed in popular XML digital libraries of scientific works. such the ones proposed in the ECHO (European CHronicles Online) Project. we will mainly focus on the quality of the matches produced by the matching process. ˆ specific XML languages such as ODRL (Open Digital Rights Language). an example of which is the music store example introduced in Figure 5.

27 compactDisk 0. For instance.18 0. After annotation.98 cdStore 0.26 Approximate query answering in heterogeneous XML collections 0. are also the key to identify matches B. annotated as title) and of the very few terms not present in WN (e. Consider for instance the small excerpt of the results before filtration shown in Figure . similarity scores are shown on the edges. XML S3 MART iterative matching algorithm automatically identified the best matches among the node pairs which coincides with the ones shown in Figure 5.5. Effectiveness of matching For the music store schema matching. location and address) and also different conceptual organizations (notably singer. Notice that before applying the stable marriage filtering each node in Schema A is matched to more than one node in Schema B. all four annotations are the same (title) but the different contexts of surrounding nodes allow XML S3 MART to identify the right correspondences. cdTitle. matches A. The annotation phase for the ad-hoc schemas was quite straightforward. D. vocalist. H and J. in which we associated each of the different terms to the most similar term and sense available in WordNet.g. we performed a careful annotation. F and G between nodes with identical annotation and a similar surrounding context are clearly identified. compactDisk annotated as cd ). associated to each of the tracks of a cd in Schema A. together with similar but not identical annotations. In these cases. The matches I and K require particular attention: Schema A songTitle and albumTitle are correctly matched respectively with Schema B title and cdTitle. pertaining to a whole cd in Schema B). we devised the two schemas so to have both different terms describing the same concept (such as musicStore and cdStore. simply choosing the best matching node in Schema B for each of the nodes in Schema A would not represent a good choice. since we basically only had to associate each term with the corresponding WN term. A very similar context of surrounding nodes.31 0.g. The DBLP Computer Science Bibliography archive and the ACM SIGMOD Record.13: A small selection of the matching results between the nodes of Schema A (on the left) and B (on the right) before filtering. E. vs.59 0. Firstly.182 musicStore 0.08 cd stock street Figure 5. the only peculiarities are the annotations of composite terms (e. C.

while stock. is ultimately matched with street. Figure 5. and cd . involving articles and proceedings. Notice that the nodes having no match (weak matches were pruned out by filtering them with a similarity threshold of 0. The effectiveness is very high: Practically all the matches. terms like procedings (DBLP) and issue (SIGMOD) were annotated with the respective WN terms. such as an higher number of nodes.1) and will be finally filtered out by a threshold filter.14. and we tested if the similarities between the involved nodes increased as expected. The same applies between stock . describing the proceedings of conferences along with the articles belonging to conference proceedings. is annotated as link. the link to the electronic edition of an article.14 shows the two involved schemas. the proposed pair of schemas presents additional challenges. the score for such match is very low (< 0. to artificially hint at the right matches by selecting identical annotations for different corresponding terms: For instance. In particular. In general. with a higher depth for SIGMOD). a node which has no correspondent in Schema B. a term not available in WN. structures describing the same reality with different levels of detail (as for author) and different distribution of the nodes (more linear for DBLP. Along with the complexities already discussed in the ad-hoc test. For instance. we tried to be as objective and as faithful to the schemas’ terms as possible. we modified Schema B by making the vocalist node child of the node passage. Therefore.1) actually represent concepts not covered in the other schema. the matches for musicStore and compactDisk are correctly selected (similarities in bold). passage and vocalist and their Schema B correspondents were 10% to 30% higher. Each match is identified by the same letter inside the nodes and is associated with a similarity score (on the right). to the most subtle.compactDisk. avoiding. After annotation. such as . are correctly identified without any manual intervention. We also tested our system against specific changes in the ad-hoc schemas. as article. for instance. dblp was annotated as bibligraphy while sigmodRecord as record. inprocedings.Matching and rewriting 183 5. inproceedings and article).13: The best match for stock (Schema A) is cdStore but such node has a better match with musicStore. we noticed that the similarities of the matches between cd. In such real cases the annotation process is no longer trivial and many terms could not have a WN correspondence: For instance DBLP’s ee. making the whole schema more similar to Schema A structure. However.cd. such as different terms describing the same concept (proceedings and issue. from the fundamental ones like B and G. As to the tests on a real case. making the evaluation of the matching phase particularly critical and interesting. such as L involving the link to electronic editions. XML S3 MART matcher automatically produced the matches shown in Figure 5.

L. represented by a letter. simplifying the process of making annotations of complex terms like these.22 0.21 0. for nodes like Schema B articles. or. E. of m:n correspondences be- .29 0. G.14: Results of schema matching between Schema DBLP and Schema SIGMOD.29 0. On the other hand. where the similarity of the annotations and contexts is very high.28 0.48 0. In this regard.Tree structure Approximate query answering in heterogeneous XML collections Schema SIGMOD . The same applies for the node author in Schema A. is associated to a similarity score (shown on the right).29 issue B D E F B C volume number year C D E F month conference location key title volume number year inproceedings G articles article G H I J K L H J K key author title pages ee crossref article Code title init end Page Page L toFull Text authors href author Position author I author Name Figure 5. while the key of a proceedings has no match in Schema B. nodes like key and title are present twice in DBLP but are nonetheless correctly matched thanks to the influence of the surrounding similar nodes: In particular. even more generally. F.22 0. from matches D. representing the name of an author. Each match. an additional improvement.Tree structure sigmod Record issues dblp proceedings A A Sim Scores A B C D E F G H I J K L 0. and as such has been annotated and correctly matched. Indeed. we think that a good future improvement. would be to allow and exploit composite annotations such has “the title of a conference”.31 0. the key of an inproceedings is associated to the articleCode of an article. and J. “a group of authors”.64 0.184 Schema DBLP . Matches C and I are particularly critical and can be obtained by annotating the terms with particular care: The node conference in Schema B actually represents the title of a given conference. might be to enable the identification of 1:n or.29 0. to A and B. involving terms with similar contexts and identical annotations.98 0. “the name of an author”. also the fixed point computation relying on the structure of the involved schemas is quite important. authorPosition or location for SIGMOD. this time to the matching algorithm. Further. The semantics delivered by the terminology disambiguation have a great role in deciding the matches.

On the other hand. it has been necessary to find a good trade-off between the influence of the similarities between given pairs of nodes and that of the surrounding nodes. Experimental setting Tests were conceived in order to show the behavior of our disambiguation approach in different scenarios. We performed many other tests on XML S3 MART effectiveness. The second dimension. it is the case of web directo- . For each feasible combination of these properties we formed a group by selecting the three most representative trees. though having the same annotation. For instance. indicates how much a tree is contextualized in a particular scope. for instance between Schema B and DBLP. Trees with high polysemy contain terms with very different meanings: For instance. polysemy. 5. such as a web directory. trees with low specificity can be used to describe heterogeneous concepts.Structural disambiguation 185 tween nodes: For instance match K is not completely correct since the node pages would have to be matched with two nodes of Schema B. This is because. such as title. in particular.4. Finally notice that. Group1 is characterized by a low specificity and a polysemy which increases along with the level of the tree. The first dimension. initPage and endPage. in order to obtain such matching results. We tested 3 groups of trees characterized by 2 dimensions of interest.e. rock and track whose meanings radically change in different contexts. we conducted “mixed” tests between little correlated schemas. trees with low polysemy contain mostly terms whose senses are characterized by subtle shades of meaning. more than three times smaller than the corresponding DBLP-SIGMOD match. the nodes had a completely different surrounding context. the nodes labelled title in Schema B (title of a song) and DBLP (title of articles and proceedings) were matched with a very low score. and not just with initPage. Among them. we tested several graph coefficient propagation formulas [98] and we found that the one delivering the most effective results is the inverse total average.2 Structural disambiguation In this section we present the results of the actual implementation of our STRIDER disambiguation approach. generally confirming the correct identification of at least 90% of the available matches. between the annotations and context of nodes. specificity. indicates how much the terms are ambiguous. whereas trees with high specificity are used to represent specialized fields such as data about movies and their features and staff. In this case. the matches’ scores were very low as we expected. i.

991 2.330 0.667 5.198 0.366 0.org) specifications.901 2.062 8 3. dublincore. Internet Movie Database (IMDb . Group2 is characterized by a high specificity and a high polysemy. the percentage of correct senses between all the possible senses and the average similarities among the senses of each given term in the tree (computed by using a variant of Eg. Low specificity and high polysemy are hardly compatible.imdb.522 11 2.31 2.1). The mean number of senses of Group2 and Group3 is almost double than that of Group1.000 41 21 15 25. a low polysemy at low levels and high polysemy at the leaf level.667 14 17 18 16. thus we expect their disambiguation to be harder. Group2 DBLP DCMI Sigmod Group3 16 23 15 18.327 0. The trees we selected for Group1 are a small portion of Google—’s and Yahoo ’s web directories and of eBay ’s catalog. www. Even though not big.484 2. The other features are instead important in order to understand the difficulty of the disambiguation task: For instance.6 2.106 8. From left to right: The number of terms.983 2.444 13 5. correct 0.org) and a possible On Line Music Shop (OLMS).286 17 8 29 6.159 0. 5.201 3.152 2. Finally. we chose structures extracted from XML documents of Shakespeare’s plays.1 shows the features of each tree involved in our experimental evaluation.828 Table 5.624 11.321 3.429 11 5 10 6.1: Features of the tested trees ries in which we usually find very different categories under the same root.854 10 6. therefore we will not consider this one as a feasible combination.047 18. higher is the number of senses of the involved terms more difficult will be their disambiguation.733 6 3.224 0. Notice that our trees are composed by 15-40 terms. 3.333 Perc.333 Approximate query answering in heterogeneous XML collections # senses mean max 3. Group3 is characterized by a high specificity and a low polysemy and contains representative XML schemas from the DBLP and SIGMOD Record scientific digital libraries and the Dublin Core Metadata Initiative (DCMI . This is confirmed by the percentage of correct .278 0. Table 5.298 2. their composition allows us to generate a significant variety of graph contexts.219 Sense simil.333 3.372 3.296 0. the mean and maximum number of terms’ senses.235 0.133 0.186 # terms eBay Google Yahoo Group1 IMDb OLMS Shakes.190 0.

r.000 P(3) 1. the polysemy dimension: the higher is the average of the sense similarity the lower is the polysemy and the different senses have a closer meaning.914 1. precision P is the mean of the number of terms correctly disambiguated divided by the number of terms in the trees of each group.944 0.816 0.t.960 Exp 0.000 1. Figure 5. the expanded context one (Exp) and the combined one (Comb).500 0.900 Precision level 0.617 0. Recall parameter is not considered because its computation is usually based on frequent repetitions of the same terms in different documents.15 shows the precision results for the disambiguation of the three groups.844 0.000 Graph 0.944 Group 2 Comb 0.832 0.694 0.r. Traditionally. Since we have at our disposal .900 1.15: Mean precision levels for the three groups senses between all the possible senses. confirming the initial hypothesis.944 Graph 0.614 0.t.811 0.963 Group 3 Comb 0.631 0.865 0.960 0.400 0. the different tree scenarios.700 0.800 0.892 P(2) 1.883 0.000 Group 1 Comb 0. Efficiency evaluation is not crucial for a disambiguation approach so it will not be deepened (in any case.300 Graph Exp 0. The last feature partially expresses how the trees are positioned w.000 Figure 5. and we are not interested in evaluating the wsd quality from a single term perspective.Structural disambiguation 187 1. Effectiveness evaluation In our experiments we evaluated the performances of our disambiguation algorithm mainly in terms of effectiveness. which can be considered an even more significant “ease factor” and is higher in Group1. Three contributions are presented: The graph context one (Graph). In order to produce a deeper analysis not only of the quality of the results but also of its possible motivations w. we considered the precision figure along with a number of newly introduced indicators.939 P(1) 0. This is true in particular for Group1 and Group3 trees. The disambiguation algorithm has first been tested on the entire collection of trees using the default graph context: all the terms in the tree.600 0. In general. wsd algorithms are evaluated in terms of precision and recall figures [68]. the disambiguation process for the analysed trees required at most few seconds).627 0.939 Exp 0.000 0.

in all the three cases the combination of the two contributions (Comb) produces better results than each of the contributions alone. by considering the results up to the first M ranks: For instance.000 1.000 1.17 show two illustrative comparisons between complete and selected contexts for Yahoo tree (Group1) and IMDb tree (Group2). descendant and/or sibling terms. P(1) will be the percentage of terms in which the correct senses are at the top position of the ranking.16 and 5. descendants and siblings. with “selected” contexts including only ancestor. meaning a good efficacy of this approach too. we compute precision P(M) at different levels of quality. further. and (0.7) β. respectively. notice that its contribution alone (Exp) is generally very near to the graph context one.500 0. while Group2 and Group3 show an opposite trend. particularly in the complex Group3 setting.867 1.533 0.000 Selected context Comb 1. The next step was to evaluate the different behaviors of the trees disambiguation by varying the composition of their terms’ context.867 1. Notice that.000 Exp 0. Precision results for Group3 are lower (nearly 70%). even in this difficult settings.867 1. respectively.000 0. This is due to the fact that Group1 concepts are very heterogeneous and including in the context only directly related terms reduces .3) weights.300 P(1) P(2) P(3) Graph 0. This is achieved by using optimal values for the α. in the first case.188 1. Figures 5.000 1.600 0. and discovered two main behaviors: Group1 trees respond well to a more careful context selection. We tested an extensive range of combinations for all the available trees. As to the effectiveness of the context expansion.000 1.000 Figure 5.700 0. γ (0. as obtained from a series of exploratory tests.000 1. the results are quite encouraging.400 0.000 Approximate query answering in heterogeneous XML collections Graph 1. Combination of graph context and expanded context contributions produces good P(1) precision levels of 90% and of over 83% for groups Group1 and Group2. particularly if we notice that P(2) is above 88%.867 1. but we have to consider the large number and higher similarity between the senses of the involved terms.900 Precision level 0.000 Complete context Comb 0.000 Exp 0. the combined precision P(1) raises from 86% to a perfect 100% for a selected setting involving only ancestors.800 0.533 0.16: Typical context selection behavior for Group1 (Yahoo tree) complete ranking results.000 1.

976 Selected context Comb 0. precision P(1) might be high thanks to the effectiveness of the approach but also for the possibly small number of senses of the involved terms (think of terms with just one sense). meaning that our approach correctly disambiguates also terms with very high number of senses.951 Exp 0.659 0.878 0.976 Complete context Comb 0. the result is the opposite: Notice the IMDb combined precision dropping from nearly 88% to 80% when only ancestors and descendant terms are kept (Figure 5. Further.17: Typical context selection behavior for Group2 (IMDb tree) the disambiguation “noise” produced by completely uncorrelated ones. At a first glance.600 0. Instead.2): The left part of the table shows delta values between rank positions. the top sense is that related to the health science as the process is wrongly influenced by terms like neurology and cardiology contained in the medicine category.500 0.r.700 0. Also notice the very high rank1 delta of some trees. For instance. we wanted to analyze the results more in depth and from different perspectives.756 0. Group1. For a given rank. second (rank2) and third (rank3) position.000 0.400 0.951 0. In order to deepen our analysis. Group2 and Group3 confirm their inherent complexity w.951 189 Figure 5. when the tree terms are specific and more contextualized.659 0.951 0.878 0. Precision figures are the fundamental way to evaluate a wsd approach. while the right part shows delta values between confidences. we computed additional “delta” parameters (see Table 5.t. in our case. Delta rank values express the mean difference between the position in the ranking of the correct sense and that of the last one. the harder the disambiguation task should be.951 0. in particular: . we indicate by a ‘-’ the situation where there are no correct senses with that rank. we computed them when the right senses appear in the first (rank1 in table).951 0.902 0. For instance. however. such as in the other two groups.951 Exp 0.800 0. when the complete Yahoo tree is used to disambiguate the term hygiene in the health category. we wanted to analyze the actual confidence values and.900 Precision level 0.805 0.17).Structural disambiguation 1.300 P(1) P(2) P(3) Graph 0.951 Graph 0. In general. where rank1 delta values are nearly double. the higher the “delta to last” rank value is.927 0. such as the Shakespeare one.

Up to now we have not considered the contribution of the terms/senses’ feedback to the overall effectiveness of the results.015 0.038 eBay Google Yahoo Group1 IMDb OLMS Shakes. We see that the “to the followings” values are sufficiently high (from 14% of Group3 to over 24% of Group1). the upper bars are evidence of good disambiguation confidence and reliability. how much the algorithm is confident in its choices (delta confidence to the followings.171 -0. as we expected. the graph shows that. with peaks of over 40%. in particular in the disam- . Further.5 9.667 2.667 3. Notice that the wsd choices performed on Group1. from top 0. are also the most “reliable” ones. Figure 5. In Table 5.190 Approximate query answering in heterogeneous XML collections Delta to last (rank) rank1 rank2 rank3 2. which gave the best results in terms of precision.733 2. how much the correct sense confidence is far from the chosen one (delta confidence from the top). Notice that for two terms no contributions are present: This is due to the fact that these terms have only one available sense and. In most cases. their disambiguation is not relevant.244 -0.333 2.161 -0.168 -0.171 -0.043 0.125 -0.154 0.2: Delta values of the selected senses How much the right senses’ confidences are far from the incorrect ones.e.273 4 5 5.145 -0.118 10. meaning very few and small mistakes. i.039 0.229 5.2 we showed aggregate delta values for each group. the bottom bars are not null (wrong disambiguation choices).4 6.035 0.014 5 Delta (conf ) to foll. Group2 DBLP DCMI Sigmod Group3 Table 5.142 -0. while the “from the top” ones are nearly null. where the horizontal axis represents the 21 terms of the On Line Music Shop tree.18 shows the double histogram composed by the delta to the followings (top part) and the delta from the top (bottom part) values.042 0.307 -0.14 -0.245 -0.039 0.122 3.5 4.129 1.058 5.375 4.2 5.003 0.184 0 0. however we also found interesting to investigate the visual trend of the delta confidences of the terms of a tree. and. first column of the right part of the table). but only in a very limited number of cases. thus.125 2. when the upper bars are not particularly high (low confidence).026 0.017 0. when the choice is not right.444 2.5 2 1.667 4.02 0.

4 Delta confidence 0. Moreover. Peer . this has a positive effect on the disambiguation of the near terms since the “noise” produced by the wrong senses is eliminated. we briefly describe how we plan to enhance the XML S3 MART system and some of its previously described features in order to support distributed Peer-to-Peer (P2P) systems and. our algorithm is able to correctly disambiguate it). The flexibility of our approach allows also to benefit from a completely automatic feedback. such as 2 or 3. where the results of a given run are refined by automatically disabling the contributions of all but the top X senses in the following runs. the results of the second run show a precision increment of almost 17%.1 -0.5 0. in particular. since the right sense is typically occupying the very top positions in the ranking. and similar results are generally obtainable on all the considered trees. Notice that. for instance for the speaker term.2 Terms Figure 5. For illustration. 5. and in many others. For instance. suggesting the correct meaning of the term volume in the DBLP tree as a book helps the algorithm in choosing the right meaning for number as a periodic publication. the feedback on the term merely confirms the top sense in the ranking (i. by choosing X = 2 in the SIGMOD tree. in this case.5 Future extensions towards Peer-to-Peer scenarios In this concluding section. nonetheless.1 0 -0.3 0. where the position of the right sense passes from second to first in the ranking.5. suggesting the correct meaning of the term line (part of character’s speech) in the Shakespeare tree produces better disambiguation results. We can generally choose a very low X value.e.5 Future extensions towards Peer-to-Peer scenarios 191 0.18: Confidence delta values for OLMS biguation of the most ambiguous terms in the tree.2 0.

it would be very inefficient to involve entities which contain unrelated data. In choosing the most convenient neighbors for query propagation. A further key issue in PDMS querying is the following: It is not always convenient for a peer to propagate a query towards all other peers. What would be needed is to exploit the mappings discovered by XML S3 MART in order to select. each user should be able to search for and exploit all the available contents. even when it resides on peers which are different from the queried one. P2P systems are more and more diffused on the Internet and are characterized by high flexibility. With the modified algorithms. the mapping computation needs to be modified and transformed in an incremental computation. requiring additional advanced techniques for efficient and effective approximate query answering. One interesting way to handle this problem should be devising and exploiting . In order to deal with the problem of effective search in a PDMS architecture. a technique which can be defined as routing by mapping. each peer can establish how each of its concepts can be approximated in its “neighbor” entities by means of a numerical score expressing their semantic closeness. since the requesting peer would be overloaded with a large number of insignificant results and the network traffic would be uselessly multiplied. this is a particularly challenging scenario. Indeed. all the information available in the subnetworks rooted by them should be taken in consideration. which of the neighbors are potentially able to solve it effectively. which can efficiently update the semantic mappings when new peers connect and disconnect from the network. To this end. In a PDMS architecture. In order to exploit XML S3 MART schema matching features. for each of the incoming queries. suitable semantic mappings containing all the correspondences between their different concepts need to be computed. PDMSs [129] represent a recent proposal trying to make a synthesis between the flexibility of P2P systems and the semantic expressiveness of the recent database and XML search techniques. Thus. the schema matching and query rewriting approach of the XML S3 MART system can be very useful: Since single peers are independent entities. it is necessary to deal with the data heterogeneity of the PDMS participants. they might adopt very different schemas to represent their data. in a P2P context it is not possible for each node to perform schema matching on every other possible entity in the network. many of the techniques presented in this chapter can be exploited. searching for particular information in large P2P networks is often quite a long and disappointing task for users. Indeed. rapid evolution and decentralization. However. queries could be propagated only to the peers having a satisfying mapping score for the involved concepts. however several additional issues need to be addressed. First of all.192 Approximate query answering in heterogeneous XML collections Data Management Systems (PDMS) settings. however. In this way.

eventually. to verify the effectiveness of its matching and semantic indexing features. Some very promising initial results about all the ideas and techniques described in this section can be found in [117]. Finally. in which each peer could store summary information on how its contents are semantically approximated in the whole subnetworks rooted on the neighboring peers.5 Future extensions towards Peer-to-Peer scenarios 193 ad-hoc data structures. another essential improvement to the XML S3 MART system would be to develop and immerse it in a complete PDMS simulation environment in which to verify its behavior in a distributed setting and.5. . which could be named Semantic Routing Indexes (SRIs).

.

56. standard XML query engines are not aware of the temporal semantics and thus it makes more difficult to map temporal XML queries into efficient “vanilla” queries and to apply query optimization and indexing techniques . researchers have tried to provide answers to this need by proposing models and languages for representing and querying the temporal aspect of XML data. The central issue of supporting temporal versioning. where it is complicated by the fact that timestamps are distributed throughout XML documents. The solution proposed in [56] relies on a stratum approach whose advantage is that they can exploit existing techniques in the underlying XML query engine. As data changes over time. the paper [56] has the merit of having been the first to raise the temporal slicing issue in the XML context. 99]. most temporal queries in any language. i. However. the possibility to deal with historical information is essential to many computer applications. is time-slicing the input data while retaining period timestamping. While a great deal of work has been done on temporal slicing in the database field [54]. such as accounting. such as query optimization and query evaluation. In the last years. the problem of supporting structural querying in XML databases is an appealing research topic for the database community. medical records and customer relationship management. A time-varying XML document records a version history and temporal slicing makes the different states of the document available to the application needs. Recent works on this topic include [44. as we have also seen in the last two chapters.Chapter 6 Multi-version management and personalized access to XML documents Nowadays XML is universally accepted as the standard for structural data representation and exchange and.e. banking. law. 58.

we propose additional techniques in order to support semantic versioning and personalized access to them. and we conclude by providing .3). In this chapter we propose new techniques for the effective and efficient management and querying of time varying XML documents and for their personalized access. ˆ in the second part (Section 6. we are witnessing a strong institutional push towards the implementation of eGovernment support services. another versioning dimension. thus allowing personalization facilities. a given norm may contain some articles which are only applicable to particular classes of citizens (e.gov.g. collections of norm texts and legal information presented to citizens are becoming popular on the internet and one of the main objectives of many reasearch activities and projects is the development of techniques supporting temporal querying but also. www. Further.196 Multi-version management and personalized access to XML documents particularly suited for temporal XML documents. Indeed. because some norms or some of their parts have or acquire a limited applicability.g.it). we provide a related work discussion on temporal representation. In particular: ˆ in the first part (Section 6.g. a citizen accessing the repository may be interested in finding a personalized version of the norm. Thus. the need for a native solution to the temporal slicing problem becomes apparent to be able to effectively and efficiently manage temporal versioning. aimed at a higher level of integration and involvement of the citizens in the Public Administration (PA) activities that concern them.1) we deal with the problem of managing and querying time-varying multi-version XML documents in a completely general scenario.italia. Here. Hence. In this context. the semantic one. www. as in [52]. whereas flexible and on-demand personalization services are lacking. querying and personalized access (Section 6.2) we focus on the eGovernment scenario and we present how the slicing technology described in Section 6.it) or predefined by human experts and hardwired in the repository structure (e. In particular. Finally. we propose a native solution to the temporal slicing problem. public employees).normeinrete. personalization is either absent (e.1 can be adapted and exploited in a complete normative system in order to provide efficient access to temporal XML norm texts repositories [89]. personalization plays an important role. One of the most interesting scenarios in which this is particularly essential is the eGovernment one. For example. addressing the question of how to construct an XML query processor supporting time-slicing [88]. In existing works. that is a version only containing articles which are applicable to his/her personal case.

we devise a flexible technology supporting temporal slicing (remaining parts of Section 6. addressing the question of how to construct an XML query processor supporting time-slicing. Finally. timestamps are defined on a single time dimension and the granularity is the year. and that. They include the introduction of novel algorithms and the exploitation of different access methods.1. where the temporal slicing problem is defined.1.1 Preliminaries A time-varying XML document records a version history. which consists of the information in each version. For simplicity’s sake. and we show how a timevarying XML document can be encoded in it. We begin by providing some background in Section 6. we can freely extend them to become temporally aware.6. The underlying idea is to propose the changes that a “conventional” XML pattern matching engine would need to be able to slice time-varying XML documents. it consists in computing simultaneously the portion of each state . The proposed solutions act at the different levels of the holistic twig join architectures with the aim of limiting main memory space requirements. Temporal slicing is essentially the snapshot of the time-varying XML document(s) at a given time point but. along with timestamps indicating the lifetime of that version [44].2).1 Temporal versioning and slicing support 197 extensive experimental evaluation of all the proposed techniques (Section 6. The left part of Figure 6.1 Temporal versioning and slicing support In this section we propose a native solution to the temporal slicing problem. Data nodes are identified by capital letters.1 shows the tree representation of our reference time-varying XML document taken from a legislative repository of norms. all relying on the holistic twig join approach [25]. 6.1. which adopts the inverted list technology proposed in [139] for XML databases and changes it in order to allow the storing of time-varying XML documents. 6. where the focus is on the structural aspects which are intrinsic also in temporal XML data. It consists in alternative solutions supporting temporal slicing on the above storing scheme. I/O and CPU costs. Then.1.2).4). in its broader meaning. The advantage of this solution is that we can benefit from the XML pattern matching techniques present in the literature. we propose a novel temporal indexing scheme (first subsection of Section 6.1. which is one of the most popular approaches for XML pattern matching. at the same time.

now] [1996. The timestamp is a temporal element chosen from one or more temporal dimensions and records the life- . now] + contents B [1991. now] contents B article G article [1970. as well as single periods. now] contents article section C [1970. On the contrary. 2003] [2004. A time-varying XML database is a collection of XML documents. it is able to deal with time-varying XML documents containing timestamps defined on an arbitrary number of temporal dimensions and represented as temporal elements [54]. it is often required to combine the results back into a period-stamped representation [56] in the period [1994. of time-varying XML document(s) which is contained in a given period and which matches with a given XML query twig pattern. now] contents B article E D E G [2004. 2003] section F [2004.1: Reference example. preceding-following). parent-child. 2003] article Figure 6. 1998] U [2001. now] and for the query twig contents//article. The right part of Figure 6. In the following. i. also containing time-varying documents. We denote with DT a time-varying XML document represented as an ordered labelled tree containing timestamped elements and attributes (in the following denoted as nodes) related by some structural relationships (ancestordescendant. disjoint union of periods. our proposal is not bound to a specific temporal XML model. Moreover. Document representation A temporal XML model is required when there is the need of managing temporal information in XML documents and the adopted solution usually depends on the peculiarities of the application one wants to support.1 shows the output of a temporal slicing example. we will refer to time-varying XML documents by adopting part of the notation introduced in [44].198 Multi-version management and personalized access to XML documents Time varying XML document (fragment) law A Time-slicing example [1994. This section introduces a notation for time-varying XML documents and a formal definition for the temporal slicing problem.e. For the sake of generality. 1998] U [2001. 1990] article [1995. 1994] U [1996.

Timestamps are not represented in the snapshot. otherwise. by specifying a pattern of selection predicates on multiple elements having some specified tree structured relationships. . The t-window parameter is the temporal window on which the time-slice operator has to be applied.g. . The time-slice operator The time-slice operator is applied to a time-varying XML database and is defined as time-slice(twig. by default temporal slicing is applied to the whole time-lines. . . . (iii) (n1 . nk ) = lif etime(n1 ) ∩ . we follow the semantics given in [45]: If no temporal semantics is provided.Preliminaries 199 time of a node. i. a temporal window t-window and a timevarying XML database TXMLdb. i. . . the parent-child and ancestordescendant relationships between query nodes are satisfied by the correspond[T ] [T ] ing document nodes. nk ) of the database nodes that identify a distinct match of twig in TXMLdb. nk ) is temporally consistent. . . .e. i. We will use the notation nT to signify that node n has been timestamped and lif etime(nT ) to denote its lifetime. . . ∩ lif etime(nk ) is not . The snapshot operator is defined as snp(t. With t-window. Given a twig pattern twig. . . e. it is possible to restrict the set of time points by specifying a collection of periods chosen from one or more temporal dimensions. which can be either temporal or snapshot. The snapshot operator is an auxiliary operation which extracts a complete snapshot or state of a time-varying document at a given instant and which is particularly useful in our context. nk ) is structurally consistent. . t∞ ). It defines the portion of interest in each state of the documents contained in the database. such that: (i) query node predicates are satisfied by the [T ] [T ] corresponding document nodes thus determining the tuple (n1 . A snapshot at time t replaces each timestamped node nT with its non-timestamped copy x if t is in lif etime(nT ) or with the empty string. XQuery. It can also be the whole document. DT ) = D where D is the snapshot at time t of DT . for each newly added temporal dimension we set the value on this dimension to the whole time-line. that is by using every single time point contained in the time-varying documents. (ii) [T ] [T ] (n1 . a slice is a mapping from nodes in twig to nodes in TXMLdb. More precisely. In this case.e. . . Sometimes it can be necessary to extend the lifetime of a node n[T ] . Not all nodes are necessarily timestamped. The twig parameter is a nontemporal node-labeled twig pattern which is defined on the snapshot schema [44] of the database through any XML query languages.t-window). its life[T ] [T ] [T ] [T ] time lif etime(n1 . [t0 .e. to a temporal dimension not specified in its timestamp. .

D) is structurally but not temporally consistent as lif etime(B) ∩ lif etime(D) = ∅. . where t ∈ lif espan(n1 . . for each distinct slice [T ] [T ] (n1 . lif etime(n1 . LeftPos:RightPos. . [139]) shows that capturing the XML document structure using traditional indices is a good solution. LeftPos. the position of an element occurrence as a tuple (DocId. . . . The temporal indexing scheme The indexing scheme described in [139] is an extension of the classic inverted index data structure in information retrieval which maps elements and strings to inverted lists. nk ). Being timestamps distributed throughout the structure of XML documents. nk )) with its [T ] [T ] pertinence lif etime(n1 . . .LevelNum) where (a) DocId is the identifier of the document. In particular.2 Providing a native support for temporal slicing Existing work on “conventional” XML query processing (see. on which it is possible to devise efficient structural or containment join algorithms for twig pattern matching. . . . For instance. we consider the temporal slicing problem: Given a twig pattern twig. analogously. 6. nk ). . a temporal window t-window and a time-varying XML database TXMLdb. and (c) LevelNum is the depth of the node in the document. (b) LeftPos and RightPos can be generated by counting word numbers from the beginning of the document DocId until the start and end of the element. respectively. . .LevelNum) and. the tuple (B. The position of a string occurrence in the XML database is represented in each inverted list as a tuple (DocId. . N2 ) is a descendent of the tree node n1 encoded as . nk ) ⊆ t-window.t-window) computes the snap[T ] [T ] [T ] [T ] shot snp(t. (n1 . time-slice(twig. . In this chapter.200 Multi-version management and personalized access to XML documents [T ] [T ] empty and it is contained in the temporal window. nk )). L2 : R2 . . . . it is possible to provide a period-timestamped representation of [T ] [T ] the results by associating each distinct state snp(t. . we decided to start from one of the most popular approaches for XML query processing whose efficiency in solving structural constraints is proved. structural relationships between tree nodes can be easily determined: (i) ancestor-descendant: A tree node n2 encoded as (D2 . .1. (n1 . in the reference example. for example. In this context. . Obviously. nk ) in t-window. . our solution for temporal slicing support consists in an extension to the indexing scheme described in [139] such that time-varying XML databases can be implemented and in alternative changes to the holistic twig join technology [25] in order to efficiently support the time-slice operator in different scenarios. .

is extended to the temporal dimension by setting the pertinence of the corresponding tuple to [1970.. whose lifetime is a temporal element containing a number of periods. 1 | 1970:now) (1. is encoded through as many tuples having the same projection on the non-temporal attributes (DocId. (ii) parent-child : n2 is a child of n1 iff it is a descendant of n1 and L2 = L1 + 1. It consists of a sequence of From:To temporal attributes. and R2 < R1 . Our proposal adds time to the interval-based indexing scheme by substituting the inverted indices in [139] with temporal inverted indices. All the temporal inverted indices are defined on the same temporal dimensions such that tuples coming from different inverted indices are always comparable from a temporal point of view. the resulting lifetime is extended to the temporal dimensions on which it has not been defined by following the approach described in Subsection 6. and represents a period. now].Fromh :Toh . N1 ) iff D1 = D2 .Providing a native support for temporal slicing law contents section (1. (1.2: The temporal inverted indices for the reference example (D1 . when the time-slice operation is applied. 10:11.LevelNum|TempPer) contains an implicit temporal attribute [54]. 6:7. whose label is law. timestamped nodes have a specific semantics which should be exploited when documents are accessed and. our temporal inverted indices are in 1NF and each timestamped node nT . 6:7. (1. 4 | 1970:1990). each time-varying XML document to be inserted in the database undergoes a pre-processing phase where (i) the lifetime of each node is derived from the timestamps associated with it. TempPer.. 4 | 1995:1998). 9:12. 1:14. one for each involved temporal dimension. L1 < L2 . On the other hand. in particular. .. (1. 2:13. they can be indexed using the interval-based scheme described above and thus by indexing timestamps as “standard” tuples. 4 | 2004:now) 201 article Figure 6. Figure 6. 3:8. TempPer is From1 :To1 . As temporal XML documents are XML documents containing time-varying data. 2:13.LevelNum) but with different TempPer values. given the number h of the different temporal dimensions represented in the time-varying XML database. (ii) in case. Thus. In this context. 2 | 1996:now) (1. (1.1. 3 | 2004:now) (1. 4:5. Notice that the snapshot node A.. the tuple (DocId. 4 | 2001:2003). (1. each representing a period. LeftPos:RightPos. Therefore. In each temporal inverted index. 3 | 1970:2003).2 illustrates the structure of the four indices for the reference example.1. L1 : R1 . besides the position of an element occurrence in the time-varying XML database. LeftPos:RightPos. 2 | 1991:1994).

In particular.. Sq2 n q1 Sq n n q2 . they remove partial answers form the stacks that cannot be extended to total answers and push the node nq into the stack ¯ Sq . The algorithms presented in [25] have been .. Sqn (level L2 in Figure). let it be nq . n qn Iq1 I q2 Iqn Figure 6. . {ptr} . the two stack-based algorithms presented in [25].. Similarly to the tree signature twig matching algorithms we described in Chapter 4. one for path matching and the other for twig matching.. thanks to a deletion policy the set of stacks contains data nodes which are guaranteed to lie on a root-to-leaf path in the XML database and thus represents in linear space a compact encoding of partial and total answers to the query twig pattern. Bq2 ID .. given a path involving the nodes q1 . the set of stacks contains an encoding of total answers and the algorithms output these answers.. . . Then. During the computation. The skeleton of the two holistic twig join algorithms (HTJ algorithms in the following) is presented in Figure 6. ... ¯ given knowledge of such node.3. ... Bq n .. {ptr} . which are then composed to obtain matches for the twig pattern (level SOL in Figure).. {ptr} . . for each query node q.. the algorithms choose the node with the smaller value. . ..202 Multi-version management and personalized access to XML documents A technology for the time-slice operator Level SOL Level L2 .. the approach maintains in main-memory a chain of linked stacks to compactly represent partial results to root-to-leaf query paths. q1 q2 . qn Sq 1 Level L1 Level L0 Bq 1 ID . At each iteration the algorithms identify the next node to be processed. at level L1 is the node in the inverted index Iq with the smaller LeftPos value and not yet processed... ID . . Iqn (level L0 in Figure) and build solutions from the stacks Sq1 ..3: The basic holistic twig join four level architecture The basic four level architecture of the holistic twig join approach is depicted in Figure 6. work on the inverted indices Iq1 . . Whenever a node associated with a leaf node of the query path is pushed ¯ on a stack.4. . Among those. . To this end... qn .

a temporal window t-window.4: Skeleton of the holistic twig join algorithms (HTJ algorithms) further improved in [30. In this way. . (D. Given a twig pattern twig. Indeed. . L1 : R1 . XML documents usually contain millions of nodes and this is absolutely true in the temporal context where documents record the history of the applied changes. Thus the holistic twig join algorithms continue to work as they are responsible for the structural consistency of the slices and provide the best management of the stacks from this point of view. 72]. Nk |Tk )) it is necessary to intersect the periods represented by the values T1 . . Lk : Rk . . a slice is the snapshot of any answer to twig which is temporally consistent. for each potential slice ((D. instead. we refer interested readers to the above cited papers. the snapshot operation is simply a projection of the temporally consistent answers on the non-temporal attributes. Thus. . we have described the “first step” towards the realization of a temporal XML query processor. must be checked on each answer output of the overall process. Tk and then check both that such intersection is not empty and that it is contained in the temporal window. On the other hand.Providing a native support for temporal slicing While there are nodes to be processed (1) Choose the next node nq ¯ (2) Apply the deletion policy (3) Push the node nq into the pertinence stack Sq ¯ ¯ (4) Output solutions 203 Figure 6. . This situation implies useless computations due to an uncontrolled growth of the the number of tuples put on the stacks. N1 |T1 ). Finally. . We devised alternative solutions which rely on the two different aspects of temporal consistency and act at the different . the performances of this first solution are strictly related to the peculiarities of the underlying database. Temporal consistency. The time-slice operator can be implemented by applying minimal changes to the holistic twig join architecture. As our solutions do not modify the core of such algorithms. the holistic twig join algorithms can produce a lot of answers which are structurally consistent but which are eventually discarded as they are not temporally consistent. The time-varying XML database is recorded in the temporal inverted indices which substitute the “conventional” inverted index at the lower level of the architecture and thus the nodes in the stacks will be represented both by the position and the temporal attributes. In particular. . Temporal consistency considers two aspects: The intersection of the involved lifetimes must be non-empty (non-empty intersection constraint in the following) and it must be contained in the temporal window (containment constraint in the following).

204

Multi-version management and personalized access to XML documents

levels of the architecture with the aim of limiting the number of temporally useless nodes the algorithms put in the stacks. The reference architecture is slightly different from the one presented in Figure 6.3. Indeed, in our context, any timestamped node whose lifetime is a temporal element is encoded into more tuples (e.g. see the encoding of the timestamped node E in the reference example). Thus, at level L1, each node nq must be interpreted as the set of tuples encoding nq . They are stored in buffer Bq and step 3 of the HTJ algorithms empties Bq and pushes the tuples in the stack Sq . Non-empty intersection constraint Not all temporal tuples which enter level L1 will at the end belong to the set of slices. In particular, some of them will be discarded due to the nonempty intersection constraint. The following Lemma characterizes this aspect. Without lose of generality, it only considers paths as the twig matching algorithm relies on the path matching one. Proposition 6.1 Let (D, L : R, N |T ) be a tuple belonging to the temporal inverted index Iq , Iq1 , . . . , Iqk the inverted indices of the ancestors of q and T Pqi = σLeftPos<L (Iqi )|TempPer , for i ∈ [1, k], the union of the temporal pertinences of all the tuples in Iqi having LeftPos smaller than L. Then (D, L : R, N |T ) will belong to no slice if the intersection of its temporal pertinence with T Pq1 , . . . , T Pqk is empty, i.e. T ∩ T Pq1 ∩ . . . ∩ T Pqk = ∅. Notice that, at each step of the process, the tuples having LeftPos smaller than L can be in the stacks, in the buffers or still have to be read from the inverted indices. However, looking for such tuples in the three levels of the architecture would be quite computationally expensive. Thus, in the following we introduce a new approach for buffer loading which allows us to look only at the stack level. Moreover, we avoid accessing the temporal pertinence of the tuples contained in the stacks by associating a temporal pertinence to each stack (temporal stack ). Such a temporal pertinence must therefore be updated at each push and pop operation. At each step of the process, for efficiency purposes both in the update and in the intersection phase, such a temporal pertinence is the smaller multidimensional period Pq containing the union of the temporal pertinence of the tuples in the stack Sq . The aim of our buffer loading approach is to avoid loading the temporal tuples encoding a node n[T ] in the pertinence buffer Bq if the inverted indices associated with the parents of q contain tuples with LeftPos smaller than that of nq and not yet processed. Such an approach is consistent with step 1 of the HTJ algorithms as it chooses the node at level L1 with the smaller LeftPos value and ensures that when n[T ] enters Bq all the tuples involved

Providing a native support for temporal slicing
− Input: Twig pattern twig, the last processed node n← q Output: Next node nq to be processed ¯ Algorithm Load:

205

(1) if all buffers are empty (2) start=root(twig); (3) else − (4) start=←; q (5) for each query node q from start to leaf(twig) (6) get nq ; (7) minq is the minimum between nq .LeftPos and minparent(q) ; (8) if nq .LeftPos is equal to minq (9) load nq into Bq ; (10)return the last node inserted into the buffers Figure 6.5: The buffer loading algorithm Load in Prop. 6.1 are in the stacks. The algorithm implementing step 1 of the HTJ algorithms is shown in Figure 6.5. We associate each buffer Bq with the minimum minq among the LeftPos values of the tuples contained in the buffer itself and those of its ancestors. Assuming that all buffers are empty, the algorithm starts from the root of the twig (step 2) and, for each node q up to the leaf, it updates the minimum minq and inserts nq , the node in Iq with the smaller LeftPos value and not yet processed, if it is smaller than minq . The same applies when some buffers are not empty. In this case, it starts from the query node matching with the previously processed data node and it can be easily shown that the buffers of the ancestors of such node are not empty whereas the buffers of the subpath rooted by such node are all empty. Lemma 6.1 Assume that step 1 of the HTJ algorithms depicted in Figure 6.4 is implemented by the algorithm Load. The tuple (D, L : R, N |T ) in Bq will belong to no slice if the intersection of its temporal pertinence T with the multidimensional period Pq1 →qk = Pq1 ∩ . . . ∩ Pqk intersecting the periods of the stacks of the ancestors q1 , . . . , qk of q is empty. For instance, at the first iteration of the HTJ algorithms applied to the reference example, step 1 and step 3 produce the situation depicted in Figure 6.6. Notice that when the tuple (1, 4 : 5, 4|1970 : 1990) encoding node D (label article) enters level L1 all the tuples with LeftPos smaller than 4 are already at level L2 and due to the above Lemma we can state that it will belong to no slice.

206

Multi-version management and personalized access to XML documents

Level L2

(1, 2:13, 2 | 1996:now) (1, 2:13, 2 | 1991:1994) [1991, now] [ ]

STEP 3

Level L1

(1, 2:13, 2 | 1996:now) (1, 2:13, 2 | 1991:1994) mincontents = 2 contents minarticle = 2 article

STEP 1

Figure 6.6: State of levels L1 and L2 during the first iteration Thus, the non-empty intersection constraint can be exploited to prevent the insertion of useless nodes into the stacks by acting at level L1 and L2 of the architecture. At level L2 we act at step 3 of the HTJ algorithms by simply avoiding pushing into the stack Sq each temporal tuple (D, L : R, N |T ) encoding the next node to be processed which satisfies Lemma 6.1, i.e. such that T ∩ Pq1 →qk = ∅. At level L1, instead, we act at step 9 of the algorithm Load by avoiding loading in any buffer Bq each temporal tuple encoding nq which satisfies Lemma 6.1. More precisely, given the LeftPos value of the last processed node, say CurLef tP os, we only load each tuple (D, L : R, N |T ) such that L is the minimum value greater than CurLef tP os and T intersects Pq1 →qk . To this purpose, our solution uses time-key indices combining the LeftPos attribute with the attributes Fromj :Toj in the TempPer implicit attribute representing one temporal dimension in order to improve the performances of range-interval selection queries on the temporal inverted indices. In particular, we considered two access methods: The B+-tree and a temporal index, the Multiversion B-tree (MVBT) [13]. An one-dimensional index like the B+-tree, clusters data primarily on a single attribute. Thus, we built B+-trees that cluster first on the LeftPos attribute and than on the interval end time Toj . In this way, we can take advantage of sequential I/O as tree leaf pages are linked and records in them are ordered. In particular, we start with the first leaf page that contains a LeftPos value greater than CurLef tP os and a Toj value greater than or equal to Pq1 →qk |Fromj , i.e. the projection of the period Pq1 →qk on the interval start time Fromj . Then we proceed by loading the records until the leaf page with the next LeftPos value or with a Fromj value greater than Pq1 →qk |Toj is met. This has the effect of selecting each tuple (D, L : R, N |T ) where L is the smaller value greater than CurLef tP os and its period T |Fromj :Toj intersect the period Pq1 →qk |Fromj :Toj , as T |Toj ≥ Pq1 →qk |Fromj and T |Fromj ≤ Pq1 →qk |Toj .

Providing a native support for temporal slicing

207

The alternative approach we considered is to maintain multiple versions of a standard B+-tree through an MVBT. An MVBT index record contains a key, a time interval and a pointer to a page and, thus, this structure is able to directly support our range-interval selection requirements. Containment constraint The following proposition is the equivalent of Prop. 6.1 when the containment constraint is considered. Proposition 6.2 Let (D, L : R, N |T ) be a tuple belonging to the temporal inverted index Iq . Then (D, L : R, N |T ) will belong to no slice if the intersection of its temporal pertinence with the temporal window t-window is empty. It allows us to act at level L1 and L2, but also between level L0 and level L1. At level L1 and L2 the approach is the same as the non-empty intersection constraint; it is sufficient to use the temporal window t-window, and thus Prop. 6.2, instead of Lemma 6.1. Moreover, it is also possible to add an intermediate level between level L0 and level L1 of the architecture, which we call “under L1” (UL1), where the only tuples satisfying Prop. 6.2 are selected from each temporal inverted index, are ordered on the basis of their (DocId,LeftPos) values and then pushed into the buffers. Similarly to the approach explained in the previous section, to speed up the selection, we exploit B+ -tree indices built on one temporal dimension. Notice that this solution deals with buffers as streams of tuples and thus it provides interesting efficiency improvements only when the temporal window is quite selective. Combining solutions The non-empty intersection constraint and the containment constraint are orthogonal thus, in principle, the solutions presented in the above subsections can be freely combined in order to decrease the number of useless tuples we put in the stacks. Each combination gives rise to a different scenario denoted as “X/Y”, where “X” and “Y” are the employed solutions for the non-empty intersection constraint and for the containment constraint, respectively (e.g. scenario L1/L2 employs solution L1 for the non-empty intersection constraint and solution L2 for the containment constraint). Some of these scenarios will be discussed in the following. First, scenario L1/UL1 is not applicable since in solution UL1 selected data is kept and read directly from buffers, with no chance of additional indexing. Instead, in scenario L1/L1 the management of the two constraints can be easily combined by querying the indices with the

208

Multi-version management and personalized access to XML documents

intersection of the temporal pertinence of the ancestors (Proposition 6.1) and the required temporal window. All other combinations are straightforwardly achievable, but not necessarily advisable. In particular, when L1 is involved for any of the two constraints the L1 indices have to be built and queried: Therefore, it is best to combine the management of the two constraints as in L1/L1 discussed above. Finally, notice that the baseline scenario is the SOL/SOL one, involving none of the solutions discussed in this chapter.

6.2

Semantic versioning and personalization support

In this section, we show how the slicing technology described in Section 6.1 can be adapted and exploited in a real application scenario. In particular, in the context of the research activity entitled “Semantic web techniques for the management of digital identity and the access to norms”, which we have carryied out as part of the PRIN national project “European Citizen in eGovernance: legal-philosophical, legal, computer science and economical aspects” [52], we focus on eGovernment and on the development of a complete normative querying system providing efficient access to temporal XML norm texts repositories. Further, we add support for the semantic versioning dimension, allowing a fully personalized access to the required documents. Indeed, as we have seen in the introduction of the chapter, in eGovernment personalization plays a very important role, because some norms or some of their parts have or acquire a limited applicability. For example, a given norm may contain some articles which are only applicable to particular classes of citizens (e.g. public employees). Hence, a citizen accessing the repository may be interested in finding a personalized version of the norm, that is a version only containing articles which are applicable to his/her personal case. The following subsections are organized as follows: Section 6.2.1 describes the complete eGovernment infrastructure, while Section 6.2.2 investigates the aspects of personalized access to multi-version documents, where the versioning is both temporal and semantic.

6.2.1

The complete infrastructure

In order to enhance the participation of the citizens to an eGovernance procedure of interest, their automatic and accurate positioning within the reference legal framework is needed. To solve this problem we employ Semantic Web techniques and introduce a civic ontology, which corresponds to a classification of citizens based on the distinctions introduced by subsequent norms

The complete infrastructure

209

WEB SERVICES OF PUBLIC ADMINISTRATION

PUBLIC ADMINISTRATION DB XML REPOSITORY OF ANNOTATED NORMS

1 IDENTIFICATION
SIMPLE ELABORATION UNIT WEB SERVICES WITH ONTOLOGY

2 CLASSIFICATION
CLASS

CX

OC

CREATION /UPDATE

3 QUERYING

RESULTS

Figure 6.7: The Complete Infrastructure which imply some limitation (total or partial) in their applicability. In the following, we refer to such norms as founding acts. Moreover, we define the citizen’s digital identity as the total amount of information concerning him/her –necessary for the sake of classification with respect to the ontology– which is available online [116]. Such information must be retrievable in an automatic, secure and reliable way from the PA databases through suitable Web services (identification services). For instance, in order to see whether a citizen is married, a simple query concerning his/her marital status can be issued to registry databases. In this way, the classification of the citizen accessing the repository makes it possible to produce the most appropriate version of all and only norms which are applicable to his/her case. Hence, the resulting complete infrastructure is composed by various components that have to communicate between each other to collect partial and final results (see Figure 6.7). Firstly, in order to obtain a personalized access, a secure authentication is required for a citizen accessing the infrastructure. This is performed through a simple elaboration unit, also acting as user interface, which processes the citizen’s requests and manages the results. Then, we can identify the following phases:
ˆ the identification phase (step 1 in Figure 6.7) consists of calls to identification services to reconstruct the digital identity of the authenticated user on-the-fly. In this phase the system collects pieces of information from all the involved PA web services and composes the identity of the citizen. ˆ the citizen classification phase (step 2 in Figure 6.7) in which the classification service uses the collected digital identity to classify the citizen with respect to the civic ontology (OC in Figure 6.7), by means of an embedded reasoning service. In Figure 6.7, the most specific class

In this way. a Description Logic [6]). For example.g. classification and creation/update services. We then extend the temporal framework with semantic versioning in order to provide personalized access to norm texts.210 Multi-version management and personalized access to XML documents CX has been assigned to the citizen. Furthermore. the matching between the civic ontology classes and the citizen’s digital identity can be reduced to a standard reasoning task (e. Notice that this process (and also the introduction of semantic annotations into the multi-version XML documents) is a delicate task which needs advice by human experts and “official validation” of the outcomes and. during the classification procedure. in the querying phase (step 3 in Figure 6. by accessing and reconstructing the appropriate version of all and only norms which are applicable to the class CX .2. the civic ontology used in step 2 and 3 requires to be created and constantly maintained: each time a new founding act is enforced.g.g. In fact. ˆ finally. For the specification of the identification. In order to supply the desired services. . the version of norms which must be applied to the case is the one that was in force then. the execution of a creation/update phase is needed. if a Court has to pass judgment today on some fact committed in the past. The querying phase will be deeply analyzed in the next Section. whereas it is crucial to reconstruct the current (consolidated) version of a norm as it is the one that currently belongs to the regulations and must be enforced today. based on XML/SOAP [123]).7) the citizen’s query is executed on the multi-version XML repository. computer tools and graphic environments (e. 6. it can only partially be automated. The study of the services and of the mechanisms necessary to their semi-automatic specification will be dealt with in future research work.2 Personalized access to versions Temporal concerns are widespread in the eGovernment domain and a legal information system should be able to retrieve or reconstruct on demand any version of a given document to meet common application requirements. we plan to adopt a standard declarative formalism (e. Temporal versioning aspects are examined in the next subsection. thus. ontology entailment for the underlying Description Logic [67]).g. based on the Prot´g´ platform [112]) could be provided to assist the human experts e e to perform this task. the digital identity is modelled and represented within the system in a form such that it can be translated into the same language used for the ontology (e. However. not only for historical reasons. also past versions are still important.

As a global and unchangeable norm property. where the modified norm becomes applicable to a limited category of citizens only (e. it is crucial to maintain the mapping between each portion of a norm and the maximal class of citizens it applies to in order to support an effective personalization service. While such cases do exist. the norm continues its efficacy even if no longer in force. whereas the rest of the citizens remain subject to the unmodified norm. it is not used as a versioning dimension. Our work on these aspects is based on our previous research experiences [58. due to the limited applicability that norms or some of their parts have or acquire. It is the time of publication of the norm on the Official Journal. The data model was defined via an XML Schema. The considered dimensions are: Validity time.g. It has the same semantics as event time in temporal databases [75]. a new norm might state a modification to a preexisting norm.1. Transaction time. Publication time. where the structure of norms is defined by means of a contents-section-article-paragraph hierarchy and multiple content versions can be defined at each level of the hierarchy.Personalized access to versions 211 Semantic versioning also plays an important role. It is the time the norm can be applied to concrete cases. since it represents the time the norm actually belongs to the regulations in the real world. Hence. we developed a temporal XML data model which uses four time dimensions to correctly represent the evolution of norms in time and their resulting versioning. It is the time the norm is in force. It is the time the norm is stored in a computer system. It has the same semantics of valid time as in temporal databases [71]. To this purpose. Each version is characterized by timestamp attributes defining its temporal . 59. Efficacy time. Finally. It also has a semantics of valid time although it is independent from validity time. It has the same semantics of transaction time as in temporal databases [71]. Temporal versioning We first focused on the temporal aspects and on the effective and efficient management of time-varying norm texts. 60] and on the work discussed in Section 6. For instance. notice that temporal and limited applicability aspects though orthogonal may also interplay in the production and management of versions. retired persons).

the semantic versioning dimension encodes information about the applicability of different parts of a norm text to the relevant classes of the civic ontology defined in the infrastructure (OC in Figure 6. efficacy and transaction time dimensions. the left part of Figure 6. that is based on a taxonomy induced by ISA relationships.8: An example of civic ontology. that is retrieval of all and only norm provisions that are applicable to a given citizen according to his/her digital identity. Legal text repositories are usually managed by traditional information retrieval systems where users are allowed to access their contents by means of keyword-based queries expressing the subjects they are interested in. We extended such a framework by offering users the possibility of expressing temporal specifications for the reconstruction of a consistent version of the retrieved normative acts (consolidated act).1) Unemployed (3.4) Subordinate (7.2) Public (6. .212 The sample ontology (1. The tree-like civic ontology is sufficient to satisfy basic application requirements as to applicability constraints and personalization services.7).8 depicts a simple civic ontology built from a small corpus of norms ruling the status of citizens with respect to their work position.7) (2.6) Employee Retired (4. pertinence with respect to each of the validity. Semantic information is mapped onto a tree-like civic ontology. For instance. where each class has a name and is associated to a (pre.8) Multi-version management and personalized access to XML documents A fragment of an XML document supporting personalized access <article num="1"> <ver num="1"> <aa applies_to="3"/> [… Temporal attributes … ] <paragraph num="1"> <ver num="1"> [ … Text … ] <aa applies_to="4"/> [… Temporal attributes … ] </ver> </paragraph> <paragraph num="2"> <ver num="1"> [ … Text … ] <aa applies_also="8"/> [… Temporal attributes … ] </ver> </paragraph> </ver> </article> Citizen (8. Semantic versioning The temporal multi-version model described above has then been enhanced to include a semantic versioning mechanism to provide personalized access. though more advanced application requirements may need a more sophisticated ontology definition. The right part shows a fragment of a multi-version XML norm text supporting personalized access with respect to this ontology. Hence.post) pair.3) Private Figure 6.5) Self-employed (5. and a fragment of a XML norm containing applicability annotations.

The representation of extensions and restrictions gives rise to high expressiveness and flexibility in such a context.e. structural. the class “Employee” has pre-order “3”. Let us focus first on the applicability constraint. Consider again the ontology and norm fragment in Figure 6. but only the second paragraph will be actually presented as applicable. ˆ which are applicable to his personal case (applicability constraint). whereas the second one is applicable to class “8” (tag applies also).. which is a restriction. textual and applicability. which is an extension. which is also its identifier. Such constraints are completely orthogonal and allow users to perform very specific searches in the XML norm repository. in the form: (pre-order. by means of redefinitions we can also introduce. Hence.post-order).Personalized access to versions 213 As we currently manage tree-like ontologies. ˆ which contain paragraphs (structural constraint) dealing with health care (textual constraint).. For example. for each part of a document.. For instance. The article in the XML fragment on the right-hand-side of Figure 6..8 and let John Smith be a “self-employed” citizen (i.. complex applicability properties including extensions or restrictions with respect to ancestors. John Smith could be interested in all the norms . the applicability constraint can be combined with the other three ones in order to fully support a multi-dimensional retrieval. belonging to class “7”) retrieving the norm: hence. . its first paragraph is applicable to class “4”. The reconstruction of pertinent versions of the norm based on its applicability annotations is very important in an eGovernment scenario. Furthermore. For instance. whereas its post order is “6”. Notice that applicability is inherited by descendant nodes unless locally redefined. this allows us to exploit the preorder and post-order properties of trees in order to enumerate the nodes and check ancestor-descendant relationships between the classes. the sample article in the Figure will be selected as pertinent. . Accessing the right version for personalization The queries that can be submitted to the system can contain four types of constraints: temporal.. the whole article in the Figure is applicable to civic class “3” (tag applies to) and by default to all its descendants.8 is composed of two paragraphs and contains applicability annotations (tag aa). These codes are represented in the upper left part of the ontology classes in the Figure. ˆ which were valid and in effect between 2002 and 2004 (temporal constraint). However. .

the authors study the problem of consistently deriving a scheme for managing the temporal counterpart of non-temporal XML documents. the main objective of the work presented in [58] has been the development of a computer system for the temporal management of . Finally.1 Related work Temporal XML representation and querying In the last years. efficacy and transaction). validity. ’health AND care’) tempConstr (’vTime OVERLAPS PERIOD(’2002-01-01’. all consistently reconstructed by the system on the basis of the user’s requirements and personalized on the basis of his/her identity. The paper [56] presents a temporal XML query language. τ XQuery. with which the authors add temporal support to XQuery by extending its syntax and semantics to three kinds of temporal queries: Current.’2004-12-31’)’) tempConstr (’eTime OVERLAPS PERIOD(’2002-01-01’. [58] where the history of changes XML data undergo is represented into a single document from which versions can be extracted when needed. by means of validity and efficacy time constraints. temporal and applicability constraints. In [44]. sequenced. Notice that the temporal constraints can involve all the four available time dimensions (publication. Similarly. starting from the definition of their schema. [44]. 6. Mendelzon et al. and applConstr are suitable functions allowing the specification of the textual. allowing high flexibility in satisfying the information needs of users in the eGovernment scenario. tempConstr. Gao and Snodgrass [56]. In particular.’2004-12-31’)’) applConstr (’class 7’) $a where textConstr. respectively (the structural constraint is implicit in the XPath expressions used in the XQuery statement). or to access past versions of particular norm texts. and representational.3.214 Multi-version management and personalized access to XML documents Such a query can be issued to our system using the standard XQuery FLWR syntax as follows: FOR WHERE AND AND AND RETURN $a IN norm textConstr ($a//paragraph//text(). there has been a growing interest in representing and querying the temporal aspect of XML data. [99]. Recent papers on this topic include those of Currim et al. a user is able to extract consolidated current versions from the multi-version repository.3 6. and Grandi et al. the TXPath query language described in [99] extends XPath for supporting temporal queries.

article) which are temporally consistent. While twig querying is not directly handled in this approach. Moreover.Temporal XML representation and querying 215 multiversion norms represented as XML documents and made available on the Web. as it requires the full navigation of the document collection structure. Similarly to the structural join approach [139] proposed for XML query pattern matching.[1994. Each time one or more updates occur on a multiversion XML document. and the execution of a binary join between them at each level of the query path. However the main drawback of the structural join approach is that the sizes of the results of . Moreover. this solution results in long XQuery programs also for simple temporal queries and postprocessing phases in order to coalesce the query results. a great deal of work has been devoted to the processing of temporal join (see e. standard path indexing. Even if they propose different optimizations of the initial time-slicing approach.t.g. In [99] the authors propose an approach for evaluating TXPath queries which integrates the temporal dimension into a path indexing scheme by taking into account the available continuous paths from the root to the elements. we could have extended temporal join algorithms to the structural join problem or vice versa.r. Given the temporal indexing scheme. The authors suggest an implementation based on a stratum approach to exploit the availability of XQuery implementations. an XQuery engine is not aware of the temporal semantics and thus it makes more difficult to apply query optimization and indexing techniques particularly suited for temporal XML documents. proposed in [31. Native solutions are. query processing can still be quite heavy for large documents.section) and (section. the temporal slicing problem can be naturally decomposed into a set of temporal-structural constraints. path query performance is enhanced w. instead. For instance solving time-slice(//contents// section//article. i. This leads to large overheads when “conventional” queries involving structural constraints and spanning over multiple versions are submitted to the system. the previous version. 99].r. Gao and Snodgrass [56] need to time-slice documents in a given period and to evaluate a query in each time slice of the documents. paths that are valid continuously during a certain time interval. even though the main memory representation of their indices is more than 10 times the size of the original documents. in order to access the required element tables. The paper [31] introduces techniques for storing and querying multiversion XML documents. the proposed versioning scheme creates a new physical version of the document where it stores the differences w.t. now]) means to find all occurrences in a temporal XML database of the basic ancestor-descendant relationships (contents. [110]) also using indices [140]. Closer to our definition of time-slice operator. In the literature.e.

rather than user profiles. On the other hand. in [77] is presented a contextual search system that allows users to explicitly and conveniently define their context features to obtain personalized results. especially when it is done on the server-side. On the other hand. Our native approach extends one of the most efficient approaches for XML query processing and the underlying indexing scheme in order to support temporal slicing and overcome most of the previously discussed problems.3. The biggest concerns of personalized search through the analysis of the user context is privacy.216 Multi-version management and personalized access to XML documents binary structural joins can get very large. such as the one we presented. an implicit feedback is provided. in [120] a client-side web search agent is presented that uses query expansion based on previous searches and on an immediate result re-ranking based on clickthrough information. to learn a retrieval function which is used to produce a customized ranking of search results that suits users preferences. our personalized search techniques differ from the other above cited approaches because they exploit a domain ontology. For instance. Starting from the holistic twig join approach [25]. to perform queries on the repository: This implies the inter- . the personalization in web search has been mainly influenced by two factors: The analysis of the user context and the exploitation of ontologies. the user context could be collected through an explicit feedback. 6. even when the input and the final result sizes obtained by stitching together the basic matches are much more manageable. such as the subject of the visited pages and the dwelling time on a page. Whenever this kind of information is collected without any effort of the user. The analysis of the user context involves the recording of the user behavior during searches. To conclude. For instance. avoid this problem exploiting specific domain ontologies to answer user queries on a conceptual level. These approaches tend to gather semantics from the documents to define an ontology of the concepts.2 Personalized access to XML documents The problem of information overload when browsing and searching the web becomes more and more crucial as the web keeps growing exponentially: A personalized access to resources is the most popular remedy for increasing the quality and speed-up these tasks. ontology-based approaches. In particular. which directly avoids the problem of very large intermediate results size by using a chain of linked stacks to compactly represent partial results. we proposed new flexible technologies consisting in alternative solutions and extensively experimented them in different settings. as we experienced in recent years. rather than use classification techniques to automatically create user profiles.

such as semi-structured ones. On average. where three temporal dimensions are involved. we access more complex documents. and on several variations of it. involving three nodes. in subsection 6. 6. the high flexibility we provide makes it possible to access fragments of the documents. and different temporal windows as our focus is not on the structural aspects.4. each one in 2-3 versions composed by the union of 1-2 distinct periods.6.1 Temporal slicing Experimental setting The document collections follow the structure of the documents used in [58]. in subsection 6.4 Experimental evaluation In this section we present the results of an actual implementation of our XML query processor supporting multi-version XML documents. rather than unstructured documents such as web pages. consisting of 5000 documents (120 MB) generated following a uniform distribution and characterized by not much scattered nodes.2 we specifically evaluate the system in the eGovernment scenario. In particular. thus directly affecting the containment constraint. Finally. we investigate different kinds of probability density functions generating collections with different distributions. 6.4 Experimental evaluation 217 vention of human experts to build the ontology. a depth level of 10. returning all and only the ones that fit to the user needs and avoiding the retrieval of useless information. 10-15 of these nodes are timestamped nodes nT . each document contains 30-40 nodes. All experiments have been performed on a Pentium 4 3Ghz Windows XP Professional workstation. showing the behavior of our system on different document collections and in different execution scenarios. equipped with 1GB RAM and an 160GB EIDE HD with NT file system (NTFS). also testing its personalized access performances.4. . Experiments were conducted on a reference collection (C-R). Then. but avoids privacy issues. We tested the performance of the time-slice operator with different twig and t-window parameters. We are also able to change the length of the periods and the probability that the temporal pertinence of the document nodes overlap. indeed. and have been generated by a configurable XML generator.1 we focus on temporal slicing support.4. In this context we will deepen the performance analysis by considering the same path. Moreover.

37 % 7. The best result is given by the computation scenario L1/L1: Its execution time is more than 6 times faster than the execution time of the baseline scenario SOL/SOL. avoids I/O costs for reading unnecessary tuples and their further elaboration cost at the upper levels. the percentage of tuples that are put in the buffers and in the stacks w.10 % 23.95 % 7.e.51 % Tuples (%) Buffer Stack 7.76 % 9.218 Evaluation scenarios: L1/L1 L2/L1 SOL/L1 L1/L2 L2/L2 SOL/L2 L1/SOL L2/SOL SOL/SOL Multi-version management and personalized access to XML documents Execution Time (ms) 1890 1953 2000 2625 2797 2835 12125 12334 12688 Non-Consistent Solutions (%) 23. 20% of the tuples stored in the temporal inverted indexes involved by the twig pattern intersect the temporal window. the percentage of potential solutions at level SOL that are not temporally consistent and. Evaluation of the default setting We started by testing the time-slice operator with a default setting (denoted as TS1 in the following). Let us now have a look to the other scenarios.23 % 7.74 % 96.33 % 88.85 % 99.1 shows the performance of each scenario when executing TS1. Table 6. i.17 % 17.80 % 9.00 % 100.00 % Table 6. */L1.85 % 100. The decrease of read tuples from 100% of SOL/SOL to just 7. TS1 represents a typical querying setting where the containment constraint is much more selective than the non-empty intersection constraint.1% of L1/L1 represent a remarkable result in terms of efficiency.17 % 88.76 % 23. */L2 and */SOL.43 % 9.76 % 9. This consideration induces us to analyse the obtained performances by partitioning the scenarios in three groups.92 % 88.r. The returned solutions are 5584.99% of L1/L1 and the decrease of temporally inconsistent solutions from 96.99 % 7. such as L1.74 % 95. the temporal inverted indices exploited at level L1 are B+-trees.10 % 39.13 % 23. In particular. on the basis of the adopted containment constraint . the total number of tuples involved in the evaluation. in the last two columns.76 % 23.51% of SOL/SOL to 23. the comparison of the performances between the B+ -tree and MVBT implementations will be shown in the following. Such a result clearly shows that combining solutions at a low level of the architecture.13 % 95. from the left: The execution time. Its temporal window has a selectivity of 20%.10 % 23. Notice that.10 % 39.1: Evaluation of the computation scenarios with TS1.t.

In group */L1 the low percentage of tuples in buffers (10%) means low I/O costs and this has a good influence on the execution time.00 100. solution. However. for TS1 the execution time follows the same trend as the read tuples whereas for TS2 the execution time of different scenarios are closer.95 25. this explains also the same execution time.9: Comparison between TS1 and TS2.15 SOL/L2 23. while the execution time is about 1.23 17. Moreover.9 shows the percentage of read tuples (Figure 6.5 times higher. In group */L2 the percentages of tuples in buffers are more than double of those of group */L1. .58 L2/L1 9. The scenarios within each group show similar execution time and percentages of tuples.99 15. respectively.00 SOL/SOL 12688 12697 (a) Percentage of tuples in the buffers (b) Execution Time (ms) Figure 6. group */SOL is characterized by percentages of tuples in buffers and execution time approximately ten and six time higher than those in */T1.68 L1/L2 17. Finally. In this case.07 SOL/SOL 100. the lower selectivity of the temporal window makes the benefits achievable by the L1 solutions less appreciable.80 32. Notice that the trend of growth of the percentage of read tuples along the different scenarios is similar.Temporal slicing 100 90 80 70 60 50 40 30 20 10 0 TS1 TS2 14000 12000 Execution Time (ms) 10000 8000 6000 4000 2000 0 TS1 TS2 L1/L1 1290 2812 L2/L1 1953 2844 L1/L2 2625 3422 SOL/L2 2835 3547 219 % of Tuples in the Buffers L1/L1 7. Notice that. in the SOL/SOL scenario both queries have the same number of tuples in the buffers because no selectivity is applied at the lower levels. Figure 6.9-b) of TS1 compared with our reference time-slice setting (TS1). Changing the selectivity of the temporal window We are now interested in showing how our XML query processor responds to the execution of temporal slicing with different selectivity levels. to this purpose we considered a second time-slice (TS2) having a selectivity of 31% (lower than TS1) and returning 12873 solutions.9-a) and the execution time (Figure 6. within each group it should be noticed that rising the non-empty intersection constraint solution from level L1 to level SOL produces more and more deterioration in the overall performances.

Notice that in this indexing scheme tuples must have different LeftPos and RightPos values and thus each temporal XML document must be converted into an XML document where each timestamped node gives rise to a number of distinct nodes equal to the number of distinct periods.r. Notice that when MVBT indices are used to access data the execution time is generally higher than the B+-tree solution. Evaluation of the performance of solution UL1 In order to evaluate the results of exploiting access methods at level UL1 we considered a third time-slice (TS3) that is characterized by a highly selective temporal window (1%) and returns 123 solutions.10-a compares the execution time of the scenarios involving UL1 solutions (*/UL1) with the best and the baseline scenarios shown above (L1/L1 and SOL/SOL).t.10-b where it is clear that the execution time of the purely structural . In particular the best computation scenario for TS3 is L2/UL1. the MVBT.220 Multi-version management and personalized access to XML documents 14000 12000 Execution Time (ms) 10000 8000 6000 4000 2000 0 TS1 TS2 TS3 L1/L1 1290 2812 1031 L2/UL1 3078 3938 797 SOL/UL1 3081 3953 813 SOL/SOL 12688 12691 9891 Execution Time (ms) 20000 15000 10000 5000 0 L1/L1 B+TREE 1290 2812 L1/L1 MVBT 2655 5709 SOL/SOL STRUCT 12688 12691 17750 17859 TS1 TS2 (a) UL1 scenarios performances (b) MVBT and structural approach performances Figure 6.10: Additional execution time comparisons. The last comparison involves the holistic twig join algorithms applied on the original indexing scheme proposed in [139] where temporal attributes are added to the index structure but are considered as common attributes. Figure 6. while they are the best ones with high-selectivity setting. it shows that */UL1 scenarios are inefficient for lowselectivity settings. Comparison with MVBT and purely structural techniques In Figure 6. The results are shown on the right of Figure 6.10-b we compare the execution time for scenario L1/L1 when the access method is the B+-tree w. As one would expect. This might be due to the implementation we used which is a beta-version included in the XXL package [130].

Evaluation on differently distributed collections We also considered the performance of our XML query processor on another collection (C-S) of the same size of the reference one. Figure 6.11: Comparison between the two collections C-R and C-S. depicted in Figure 6.23 98.5 SOL/L1 39.17 SOL/L1 43. L1.e. Therefore. when temporal slicing is applied to this kind of collections the best way to process it is to adopt a solution exploiting the non-empty intersection constraint at the lowest level.Temporal slicing 14000 12000 10000 Execution Time (ms) 8000 6000 4000 2000 0 C-R C-S L1/L1 1890 906 SOL/L1 2000 1383 SOL/SOL 12688 9766 L1/L1 2812 1250 SOL/L1 2859 1797 SOL/SOL 12691 9875 % Non-Consistent Solutions 100 90 80 70 60 50 40 30 20 10 0 C-R C-S 221 L1/L1 23. The non-empty intersection constraint is mainly influenced by the temporal sparsity of the nodes in the collection: The more the nodes are temporally scattered the more the number of temporally inconsistent potential solutions increases.88 TS1 TS2 TS1 TS2 (a) Execution time (b) Percentage of Non-Consistent Solutions Figure 6.98 L1/L1 29.t.95 99. The execution time of scenarios L1/L1 and SOL/L1.3 where we provide additional discussion of state of the art techniques w. Notice also that the percentage of temporally inconsistent potential solutions when no solution is applied under level SOL is limited in the C-R case but explodes in the C-S case (see for instance SOL/L1 in Figure 6. i. approach (STRUCT) is generally higher than our baseline scenario and thus also than the other scenarios (13 times slower than the best scenario).13 95.96 63. whereas the difference is more remarkable for both temporal slicing settings for collection C-S.r. We refer the interested reader also to Section 6. ours. . This demonstrates that the introduction of our temporal indexing scheme alone brings significant benefits on temporal slicing performance.11-b).11 shows the execution time and the number of temporally inconsistent potential solutions of TS1 and TS2 on both collections.01 SOL/SOL 96.11-a. shows that it is almost unchanged for collection C-R.22 SOL/SOL 91.10 32.51 99. but that is characterized by temporally scattered nodes.

t.4. articles. Scalability Figure 6. thus showing the good scalability of the processor in every type of query context. textual and applicability query facilities in a single component. together with additional ad-hoc data structures (relying on embedded “light” DBMS libraries) and algorithms which allow users to store and reconstruct on-the-fly the XML norm texts satisfying the four types of constraints.222 100000 Multi-version management and personalized access to XML documents Execution Time (ms) 10000 1000 L1/L1 L2/L2 SOL/SOL 5000 Docs 1890 2797 12688 10000 Docs 3531 5329 22893 20000 Docs 5654 9844 45750 Figure 6.12 (notice the logarithmic scales) reports the performance of our XML query processor in executing TS1 for the reference collection C-R and for two collections having the same characteristics but different sizes: 10000 and 20000 documents. The system accesses and retrieves only the .2 Personalized access Experimental setting The architecture of the system we designed for the eGovernment scenario is based on an “XML-native” approach. Such tests have also been performed on the other temporal slicing settings where we measured a similar trend. The execution time grew linearly in every scenario.r. as it is composed of a Multi-version XML Query Processor designed on purpose. and so on).12: Scalability results for TS1. with a proportion of approximately 0. 6. structural. such a component stores the XML norms not as entire documents but by converting them into a collection of ad-hoc temporal tuples. the number of documents for our best scenario L1/L1. paragraphs.75 w. representing each of its multi-version parts (i. which is able to manage the XML data repository and to support all the temporal. The prototype exploits the temporal slicing technology whose evaluation has been shown in the previous subsection. Textual constraints are handled by means of an inverted index. As in the temporal slicing section.e.

Owing to the properties of the adopted pre. respectively. 2KB and 125KB.65%) 2.and post-order encoding of the civic classes.68% (0. All the queries require structural support . collection C1) strictly necessary data by querying ad-hoc and temporally-enhanced structures without accessing whole documents. the average.6% (0. The total size of the collections is 120MB.77%) Performance (msec) 1046 (1095) 2970 (3004) 6523 (6760) 1015 (1020) 2550 (2602) 223 Table 6. Execution time on main collection Experiments were conducted by submitting queries of five different types (Q1-Q5). In the following.2: Features of the test queries and query execution time (time in msecs. the system is able to very efficiently deal with applicability constraints during query processing by means of simple comparisons involving such encodings and semantic annotations. respectively. we will present in detail the results obtained on the main collection (C1).31%) 1. there is no need to build space-consuming structures such as DOM trees to process a query and only the parts which satisfy the query constraints are used for the reconstruction of the results.23%) 4. Furthermore.000 XML norm text documents). we built a specific query benchmark and conducted a number of exploratory experiments to test its behavior under different workloads.2 presents the features of the test queries and the query execution time for each of them. then we will briefly describe the scalability performance shown on the other two collections.000 documents). C2 (10. 240MB.9% (1.Personalized access Query Q1 Q2 Q3 Q4 Q5 (Q1-A) (Q2-A) (Q3-A) (Q4-A) (Q5-A) Constraints Tm St Tx Selectivity 0.02% (1. which is able to produce different documents compliant to our multi-version model. and 480MB.3%) 0. hence. We performed the tests on three XML document sets of increasing size: collection C1 (5.000 documents) and C3 (20.46% (0. the architecture also provides support to personalized access by handling the applicability constraints. For each collection. Table 6. In all collections the documents were synthetically generated by means of an ad-hoc XML generator we developed. In order to evaluate the performance of our system. minimum and maximum document size is 24KB.

The time needed to answer the personalized access versions of the Q1–Q5 queries is approximately 0.2 (performance figures in parentheses). thus showing the good scalability of the system in every type of query context. therefore.5-1% more than for the original versions. valid and publication time. types Q4 and Q5 mix the previous ones since they involve both keywords and temporal conditions. the system was less than 30% slower than with C2). even when large amounts of documents containing some (typically small) relevant portions have to be retrieved. the size of the tuples with applicability annotations is practically unchanged (only a 3-4% storage space overhead is required with respect to documents without semantic versioning).000 documents of collection C3.e. Also with the other queries the measured trend was the same. 58]. the average response time was 1. Notice that this property is also very promising towards future extensions to cope with concurrent multi-user query processing.366 msec (i. Scalability Finally. on the 20. denoted as Qx-A in Table 6. even with quite complex annotations involving several applicability extensions and restrictions. Notice that the selectivity of the query predicates does not impair performances. as it happens for queries Q2 and Q3. types Q1 and Q2 also involve textual search by keywords (Tx constraint). the main memory requirements of the Multi-version XML Query Processor are very small.741 msec (i. Furthermore.000 documents of collection C2 (which is as double as C1) took 1. providing a short response time (including query analysis. For instance. for the same reasons. For each query type.224 Multi-version management and personalized access to XML documents (St constraint). retrieval of the qualifying norm parts and reconstruction of the result) of approximately one or two seconds for most of the queries. Our system is able to deliver a fast and reliable performance in all cases.e. . let us comment the performance in querying the other two collections C2 and C3 and. the system was only 30% slower). Our approach shows a good efficiency in every context. concerning the scalability of the system. similarly. We ran the same queries of the previous tests on the larger collections and saw that the computing time always grew sub-linearly with the number of documents. type Q3 contains temporal conditions (Tm constraint) on three time dimensions: transaction. Let us first focus on the “standard” queries. since it practically avoids the retrieval of useless document parts. Moreover. query Q1 executed on the 10. since the applicability annotations of each part of an XML document are stored as simple integers. we also present a personalized access variant involving an additional applicability constraint. with different selectivities. consider that. less than 5% with respect to “DOM-based” approaches such as the one adopted in [59.

i. The effectiveness of such application widely benefits from the definition and exploitation of a versatile approach for the disambiguation of the schema terminological information. we deeply studied XML pattern matching properties and we defined a set of reduction conditions. ˆ We considered the problem of approximate query answering for heterogeneous XML document bases and we proposed a method for structural query approximation which is able to automatically identify semantic and structural similarities between the involved schemas and to exploit them in order to automatically rewrite a query written on a source schema towards other available schemas. In particular. we considered different application scenarios and two main kinds of information. We successfully applied such techniques to the EBMT and duplicate document detection scenarios. The main contributions of our work are the following: ˆ We proposed similarity measures between text sequences and we defined novel approximate matching algorithms searching for matches within them.e. also fulfilling additional requirements which are peculiar to these tasks. . textual and semi-structured XML documents. which is complete and which is applicable to the three kinds of tree pattern matching.Conclusions and Future Directions In this thesis we presented different techniques allowing effective and efficient management and search in large amounts of information. ˆ As to semi-structured information. We showed that such a theoretical framework can be applied for building efficient pattern matching algorithms. Efficiency and portability are ensured by the introduction of ad-hoc filtering techniques and a mapping into plain SQL expressions. respectively. based on the pre/post-order numbering scheme.

bringing approximate query answering techniques toward the requirements of such distributed settings. We showed that the results and performance that our systems achieve can widely benefit the different reference applications. and by extending and testing their applicability toward graph-like schemas and ontologies. we delved into the multi-version XML management and querying scenario. for instance. ˆ Extending the multi-version query processor framework in order to fully support semantic versioning through generic graph-like ontologies. we developed a temporal and semantical XML query processor supporting both temporal versioning. ˆ Enhancing the schema matching. also improving the XML documents annotation scheme and their storage organization. structural disambiguation and query rewriting techniques. Starting from the ideas and work presented in this thesis. We also applied and specialized such approach to the eGovernment scenario: In the context of a complete and modular infrastructure. exploiting and supporting. several interesting issues for future research can be considered.226 Conclusions and Future Directions ˆ Finally. starting from the holistic twig join approach. ˆ Deeply analyzing the Peer-to-Peer and Peer Data Management Systems scenarios. These include: ˆ Testing and specializing XML matching techniques towards multimedia querying. and semantic versioning. essential in normative systems. ˆ For all the presented techniques. . we proposed new flexible technologies consisting in alternative solutions supporting the temporal slicing problem. by studying automatic deduction of the schemas underlying the submitted queries and ad-hoc rewriting rules. we implemented them in full system prototypes and performed very detailed experimental evaluation on them. the peculiar features and potentialities of the XML-based MPEG-7 audio-visual metadata description schema [95]. We proposed a novel temporal indexing scheme and.

1 A. s2 . It is organized conceptually in synonym sets or synsets. “cat (animal)”). in WordNet Sw is an ordered list of senses [sw . . which are linked by a variety of semantic relations. we introduce the notion of minimum common hypernym csi . In particular. . is hypernymy/hyponymy (IS-A). . Section A..1. and the one we are particularly interested in.1 explains the noun and verbs disambiguation techniques. Because not all the senses of a term are equally frequent in any language. sm } and with Sw the set of senses (Sw ⊆ S) a word w (which can be a noun n or a verb v) is able to express. We indicate the set of WordNet senses as S = {s1 .. sw ]. The predominant relationship for nouns.sj (also called most informative subsumer in [113]) between two senses si and sj . if avail- . as the most specific (lowest in the hierarchy) of their common hypernyms. .g. we introduce some preliminary notions based on WordNet. The senses constitute the nodes of a word sense m 2 1 network.1 The disambiguation techniques Preliminary notions Before analyzing our Word Sense Disambiguation techniques in depth. representing different concepts or senses. An example of hypernymy/hyponymy hierarchy is shown in Figure A.2 details the M ultiEditDistance matching algorithms.1. By considering any pair of senses.Appendix A More on EXTRA techniques In this appendix we present and discuss in detail some of the techniques of the EXTRA system which could not be systematically analyzed in Chapter 2. sw .. where the senses are represented with a short textual description (e. while Section A. WordNet is a lexical database [100] containing the definitions of most of the English language words. A.

we define the path length between two nouns ni . For instance. since the senses of such nouns that join most rapidly are the ones depicted in Figure A. we call path length denoted as len(si .1: Example of WordNet hypernym hierarchy able (e. one of the most promising one that does not require a training phase on large pre-classified corpora is the Leacock-Chodorow measure [81]. we can introduce the minimum common hypernym cni .g. whenever a minimum common hypernym between the two senses si and sj is available. nj as: len(ni .1. felid Mouse (animal) s2 Cat (animal ) s1 Figure A. Among the proposed measures. Furthermore. we introduce a more general variant of such measure for quantifying the similarity between two nouns as follows.1) In the same way. the hypernym hierarchies and the notion of path length between pairs of synsets has been extensively studied as a backbone for the definition of similarities between two senses [26]. nj ) = minsni ∈Sn k nj i . by starting from senses s1 “cat (animal)” and s2 “mouse (animal)” and moving up in the hierarchy . sj ) the number of links connecting s1 with s2 and passing from the minimum common hypernym node. s2 ) = 3 + 2 = 5 (3 links from s1 to cs1 . sk j ) k n (A. In the following definition. which therefore is the minimum common hypernym cs1 . Definition A. s2 Placental mammal More on EXTRA techniques Carnivore Rodent Feline. we see that they first intersect on the common sense of “placental mammal”.1 (Semantic similarity between nouns) Given two nouns .s2 and the path length is len(s1 . note that the noun hierarchy has nine root senses).nj between two nouns ni .s2 plus 2 from cs1 . nj as the minimum common hypernym sense corresponding to the minimum path length between the two nouns. As to nouns. Their minimum common hypernym is the sense “placental mammal”. where we also consider the case where the minimum common hypernym is not available.sk ∈Snj len(sni .228 c s 1. the minimum path length between nouns “cat” and “mouse” is 5. In the literature.s2 to s2 ). For instance.

the semantic similarity sim(ni . that is the most probable. For example. nj ) = −ln 0 len(ni . nj Sn v. ni . it is well known that. vj Sv sn (sv ) k k cni . vh . the techniques that we have devised compute the confidences in choosing each of the senses associated with each term and the sense with the highest confidence.nj ) 2·Hmax if ∃ cni . nj ) Set of noun tokens Set of verb tokens Nouns Set of senses of n Verbs Set of senses of v k-th sense of n (v) Minimum common hypernym between ni and nj Semantic similarity between ni and nj Table A.2 Noun disambiguation After the premises. will be the chosen one. we will first consider the latter.Noun disambiguation N V n. nj .2) where Hmax has value 16 has it is the maximum height of WordNet IS-A hierarchy.1 summarizes the set of symbols introduced so far and that will be used in the following. when two polysemous nouns are correlated. Given the list of nouns (N ) and verbs (V ) to be disambiguated. The resulting similarity is a non-negative number. nh .86. the similar5 ity between the nouns “cat” and “mouse” is −ln( 32 ) = 1. A null similarity corresponds to the similarity between two nouns having no common hypernyms. Table A. consider . A.1: Symbols and meanings 229 ni . In fact. For instance. vi .nj otherwise (A. The basic idea of such technique is that the nouns surrounding a given one in a sentence provide a good informational context and good hints about what sense to choose for it. nj ) between ni and nj is defined as: sim(ni .nj sim(ni . Since the technique for verbs disambiguation is strictly based on the nouns one. it is quite probable that their minimum common hypernym provides valuable information about which sense of each noun is the relevant one. we are now ready to discuss the core of our WSD technique.1.

As far as the first aspect is considered. we weigh the contribution of the surrounding nouns w. “cat (animal)” and “mouse (animal)”. we refined the confidence computation by introducing two enhancements which make our approach more effective in such context. In particular. If we consider their minimum common hypernym. will have the highest confidence and will be the ones chosen. In particular. the decay is almost absent (values near 1). Thus. For example. more correlated the two nouns are and therefore more important will be the information extracted from the computation of their semantic similarity. while a cat could also be a vehicle. their positions and we consider the frequency of use of such sense in English language. the two senses that join through it. i. click the mouse button to freeze it in that position” (see Figure A. the best hint of this fact is given by the presence of the noun “button”. note that the contributes of the surrounding nouns to the confidence of each of the senses of a given noun are not equally important. we assume that closer the positions of the two nouns are in the sentence.2). we deal with sentences of variable lengths and whose meaning is also affected by the relative positions of their terms.r.j ) = 2 · √ 2 +1− √ 2π 2π d2 h. “placental mammal”. If no context is available. while the surrounding nouns forming its context are underlined. located close to the term “mouse” (point B in the figure. consider the following technical sentence: “As soon as the curve is the right shape. while for distant words the decay asymptotically tends towards values around 1/5 of the full value. we weigh the similarity between nh ∈ N and each of the surrounding nouns nj ∈ N on their relative positions by adopting a gaussian distance decay function D centered on nh and defined on their distance dh. given a noun nh ∈ N to be disambiguated. in the computation of the confidence in choosing one of the senses associated with each noun. Thus.j : e− D(dh.e.t. The basis of the technique we propose derives from the one in [113]. the “mouse” is clearly an electronic device and not an animal.3) For two close nouns.3 again: The nouns to disambiguate are “cat” and “mouse”.230 More on EXTRA techniques the sentence in Figure 2. In this example. More distant nouns such as “curve” or “position” are less correlated to “mouse” . low decay). The decay function is centered on “mouse”. the noun to be disambiguated. As in [113]. the confidence in choosing one of the senses associated with each term is directly proportional to the semantic similarities between that term and the other ones. clearly providing a good way to understand the meaning. On the other hand. a mouse could be an animal.j 8 (A. but also an electronic device.

In this case. which is based on the frequency of use of the senses in English language.j) B A 0.4) k m−1 where 0 < γ < 1 is a parameter we usually set at 0.2: The gaussian decay function D for the disambiguation of “mouse” defined on the words of the example sentence and have a lesser influence on its disambiguation (point A in the figure. by incrementing the confidence of one sense of a given noun n in a way which is inversely proportional to its position in the list Sn . The confidence in choosing the meaning of a noun in a sentence is formally defined as follows. high decay).8 0. All the concepts discussed are involved in the notion of confidence in choosing a sense for a noun nh ∈ N . while the second represents the contribution of the sense decay. .2 1 0. In this way. we quantify the frequency of the senses where the first sense has no decay and the last sense has a decay of 1:5.j Figure A. In our experiments. we exploit the rank of the senses. the first one involving the semantic similarity and the distance decay between nh and the surrounding nouns.6 231 D(d i. It is defined as a sum of two components. we defined a linear sense-decay function R on the sense number k (ranging from 1 to m) of a given noun n: k−1 R(sn ) = 1 − γ (A. Such an adjustment attempts to emulate the common sense of a human in choosing the right meaning of a noun when the context gives little help. it has been proved to be particularly useful for short sentences where the choice of a sense can be led astray by the small number of surrounding nouns.Noun disambiguation 1.2 0 -6 -5 soon -4 curve -3 right -2 -1 0 1 2 3 that 4 position 5 6 shape click mouse button freeze di.4 0.8. As to the second aspect.

k k By varying the values of the α and β parameters we can change the relative weight of each of the components. i. 0. As previously described.j. defined as: xh.j ) + β · R(snh ) k (A.k nj ∈N. If we consider the following sentence. in the sentence “You can even pick up an animation as a brush and paint a picture with it”. for instance. an high value of ϕ(snh ) indicates a high probability k that the correct sense of nh is snh .j=h sim(nh .3.j ) · xh.4). the default values. in order to provide good effectiveness .j. Confidences are values between 0 and 1.232 More on EXTRA techniques Definition A. however.6) The meaning chosen for nh ∈ N in the sentence is the sense snh ∈ Sni such that ϕ(snh ) = max{ϕ(snh ) | snh ∈ Snh }.k is a boolean variable depending on the minimun common hypernym cnh . nj ) · D(dh.5) where α > 0 and β > 0 (α + β = 1) and xh. for example when the nouns in N are not strictly inter-correlated or when the length of the sentence is too short. Consider.e. such as “a dense growth of bushes” or “the act of brushing your teeth”. the confidence ϕ(snh ) in choosing sense snh ∈ Snh k k of noun nh ∈ N is: ϕ(snh ) = α k nj ∈N. For instance. However. Therefore. the name “dog” would clearly be very useful in order to disambiguate the meaning of “cat” and “beetle” as animals.nj is an hypernym of snh k 0 otherwise (A.k = 1 if cnh . At the end of the algorithm the sense having the highest confidence (0. make the contribution of the first component predominant. Therefore. which is the correct one and will be the one chosen. we choose the sense with highest k confidence as the most probable meaning for each noun in the sentence. there may be some situations in which such context is not sufficient.7 and 0. Our disambiguation algorithm is able to correctly disambiguate it by measuring the semantic similarities between it and the other nouns. the nouns N contained in a sentence provide a good informational context for the correct disambiguation of each noun nh ∈ N .2 (Meaning of a noun in a sentence) Given the set N of nouns in a given sentence.79) is “an implement that has hairs or bristles firmly set into a handle”. “animation” and “picture”. the sentence “The cat hunts the beetle”: The two nouns “cat” and “beetle” alone could be correlated both as vehicles and as animals.nj between nh and nj .j.j=h sim(nh . “It just ran away from the dog”. while other off-topic senses. all have a much lower confidence (less than 0. nj ) · D(dh. the noun “brush” has several senses in WordNet. respectively.

1. in order to disambiguate a verb vh ∈ V of a given sentence. we can formally define the meaning of a verb in a sentence. much more resource and knowledge “hungry” than a tagger. other than WordNet. As with nouns. as one would expect. the nouns from the near sentences automatically contribute less than the ones from the central sentence. Consider the verbs in the technical sentence of the previous examples: “As soon as the curve is the right shape. we allow the expansion of the dimension of the context window to include the nouns of the sentences before and after the one involved. The approach we adopt for verbs capitalizes on the good effectiveness of the noun technique. and of external knowledge sources.3 (Meaning of a verb in a sentence) Given the set V of . However.Verb disambiguation 233 also in such difficult situations. making the computation of semantic similarities between them quite inappropriate. The adoption of the same approach used for noun disambiguation would not be effective for several reasons. verbs could be better disambiguated by analyzing their relations with the objects and subjects. Definition A. A. but this would generate two main problems: The need for a complete sentence parser. where the nouns’ hierarchies should be connected with the verbs’ ones. The distance decay is still defined on the distance. In fact. the sentences in the context window are considered as one long sentence.3 Verb disambiguation The automatic disambiguation of verbs is a more complex and a less studied problem. while also exploiting one of the most interesting additional information WordNet provides for verbs: The usage examples. It is evident that the disambiguation of “click” is totally independent from the one of “freeze”. between the nouns in the sentences: For this purpose. Basically. the verbs of a sentence have a much lower semantic inter-correlation than nouns. In particular. in words. The nouns contained in this window constitute a wider noun set N and they will contribute to the disambiguation of the nouns N of the central sentence. In this way. verbs in WordNet are organized in troponimy (“is a manner of”) hierarchies. As nouns are organized in hypernymy structures. where the similarity measure is applied between each noun ni ∈ N of the given sentence and each noun nj in the usage examples of the verb vh . we denote with N (svh ) the nouns in the usage examples of the sense svh of the k k verb vh and with N (Svh ) the nouns in the usage examples of all the senses of verb vh . click the mouse button to freeze it in that position”. we introduce a technique similar to the one used for noun disambiguation.

since this match greatly increments its confidence. the confidence ϕ(svh ) in choosing sense svh ∈ Svh k k of a verb vh ∈ V is computed as: maxnj ∈N (svh ) sim(ni . it is able to choose the correct sense for the verb. we identify the most semantically close noun from the usage examples (maxnj ∈N (svh ) sim(ni . along with the nouns of the usage examples. we choose the sense whose usage examples best reflect the use of the verb in the sentence.h ) + βR(svh ) k (A. that is “to move or strike with a click”. the algorithm chooses “light” as the best match both for “mouse” and “button” and.234 More on EXTRA techniques verbs of a given sentence. The meaning chosen for vh ∈ V in the sentence is the sense svh ∈ Svh such that ϕ(svh ) = max{ϕ(svh ) | svh ∈ Svh }. . our approach is also able to extract. nj ) · D(di. “button” and “light” as electronic devices are very high. we choose the sense with highest confidence as the most probable meaning for each verb in the sentence. the most probable sense is the one whose nouns k in its usage examples best match the nouns in the sentence. A large contribution to the correct disambiguation of the verb “hunt” comes from the nouns of the definition “pursue for food or sport (as of wild animals)”: The semantic similarity between “cat”. Consider the sentence “the white cat is hunting the mouse” again. In this way. As for each noun n ∈ N in the sentence. Since the semantic similarities between “mouse”. consider the disambiguation of the verb “click” in the fragment “click the mouse button”.7) where α > 0 and β > 0 (α + β = 1). nj )). each with different usage examples. contains the following usage fragment: “he clicked on the light”. an high value of ϕ(svh ) indicates a high probability for svh to be the correct sense of vh . nj ) · D(di. The correct sense. consequently. k k Therefore. “mouse” and “animal” is obviously very high and decisive in correctly choosing such a sense as the preferred one.h ) k ϕ(svh ) = α k ni ∈N ni ∈N maxnj ∈N (Svh ) sim(ni . k k Again. or instead of them. In particular. Good hints for the disambiguation of a verb can also be extracted from the definitions of its different senses. the nouns present in such definitions. notice that the confidence is a value between 0 and 1. In WordNet such verb has 7 different senses. computing the similarities between them and the nouns of the original sentence. For example.

.3 shows some aspects of the computation for the two sequences σ q “white dog hunt mouse” and σ “cat hunt mouse”. the computation of the confidence ϕ(sk+1 ) for the n subsequent sense snh is performed only if ϕ(sk h ) < α + βR(snh ). ≥ ϕ(snh ) (m m k+1 is the cardinality of Snh ) as the α component is between 0 and 1 and R(·) is a decay function. Figure A. ϕ(snh ). . . k+1 k+1 nh nh nh whenever ϕ(sk ) ≥ α + βR(sk+1 ) then ϕ(sk ) ≥ ϕ(snh ) ≥ . the complexity of the disambiguation process could be very high. the same holds k n for ϕ(svh ). . 4] “dog hunt mouse” and σ[1 . .4 Properties of the confidence functions and optimization Both the confidence functions ϕ(snh ) and ϕ(svh ) have two components dek k pending on the k-th sense. After the computation of the confidence ϕ(snh ) where ϕ(sk h ) is k k nh the local maximum value. .r. In the following we will show it for ϕ(snh ). . For instance.t. . Indeed. coming out from the filtering phase. whose values are between 0 and 1 and whose weights are specified by the α and β parameters. Each cell M [i][j] of each matrix M represents the edit distance value between the subsequences ranging from the active starting positions to the i-th token of the first sequence and the j-th token of the second one. the cell denoted as A in the figure represents the edit distance value between the subsequences σ q [2 . while the cell denoted as B represents the edit distance value between the subsequences σ q [3 . Thus. . By computing all the cells’ values with multiEditDistance and by checking them w. Since verbs and nouns in WordNet can have quite a large number of senses (in some cases over 30).1. 4] “hunt mouse” and σ[2 .Properties of the confidence functions and optimization 235 A. For this purpose. . the multiEditDistance function performs two nested cycles for each possible starting token in the two sequences and for each pair of starting positions it computes the matrices of the edit distance dynamic programming algorithm [103] thus allowing the identification of the sub2 matches. To prevent the computation of the confidences of all the different senses. 3] “hunt mouse”. we adopt a simple optimization based on the monotonic properties of the R(·) sense decay function. 3] “cat hunt mouse”. the minimum length minL and the maximum distance d. . m k+1 A. the fact that the meaning chosen for nh has the maximum value of ϕ(·) among the computed ones allows us to avoid computing the subsequent values ϕ(snh ) . whose value can be directly and easily computed from the sense position.2 The M ultiEditDistance algorithms Given a query sequence σ q and a TM sequence σ. it . a function is needed to locate all the possible matching parts along with their distances. .

In order to better understand the set of requirements. is also required in order to prune out the shorter matches contained in the longer ones. the translator could be interested just in suggestions that are not contained in other suggestions. a filtration process.3]) ed (σq[3.3: MultiEditDistance between subsequences in σ q and σ is possible to perform all the steps required for sub2 matching and to identify all the valid sub2 matches.e.. We welcome our guests to the Madrid Art Expo! Welcome to the world of computer aided translation! We welcome you to the world of computer generated fractals This is a computer generated art work.4]. consider the following example of approximate sub2 matching with inclusion filtering. i. Suppose that minL = 3 and d = 0..4].236 do g hu nt m ou se wh it do e g hu nt m ou se More on EXTRA techniques hu nt m ou se m ou se 0 1 0 1 2 3 4 0 1 2 3 0 1 2 cat hunt mouse 1 2 3 cat hunt mouse wh it do e g hu nt m ou se 1 2 3 cat hunt A do g hu nt m ou se 1 2 3 cat hunt mouse hu nt m ou se 1 2 3 Increasing σq starting token mouse 0 1 2 3 4 0 1 2 3 0 1 2 0 hunt mouse 1 2 hunt mouse wh it do e g hu nt m ou se 1 2 hunt mouse se 1 2 hunt B se 1 2 mouse 0 1 2 3 4 0 do g hu nt m ou 1 2 3 0 1 2 0 mouse 1 mouse Increasing σ starting token 1 mouse 1 mouse [1 ] 1 A : B : ed (σq[2. sentence in the source language in the second one): 2518 3945 5673 10271 13456 So.. welcome to the world of music. which we call inclusion filtering. welcome to the world of computer generated art” where “welcome world compute generate art” is the sequence outcome of the syntactic analysis. σ[1. To satisfy this particular need.3]) σq : white dog hunt mouse [2 ] [1 ] σ : cat hunt mouse Figure A. The logical representation is the following: [3 ]         [2 ] [4 ] [3 ] q q q q m ou se 1 hu nt m ou m 1 ou se       . σ[2. as discussed in Chapter 2.. However... the longest ones.3 and that we search for the sub2 matches for the query sentence “So. Let us suppose that the translation memory contains the five sentences shown in the following table (number of sentence in the first column.

It is applied to the collection of the pairs of sequences coming out from the filters. at least partially. art]. . it could be possible to exclude the computation of the shorter matches that would not satisfy the minimum length requirement and of those matches that would be included in already identified ones.A. The remaining sequences all satisfy the minL and d requirements. compute] since it is contained in σ 10271 [welcome . . note that the inclusion requirement excludes σ 5673 [welcome . and inclusion constraints and prune out all the invalid matches. 13456. . 10271. the multiEditDistance algorithm could still be used as a first step to produce all candidate sub2 matches.4). However. 5673. . minL. . .4: Example of approximate sub2 matches with inclusion filter 2518 3945 5673 10271 13456 welcome world music welcome guest Madrid art Expo welcome world compute aid translation welcome world compute generate fractal be compute generate art work All the above sequences present some tokens in common with the query sequence (see Figure A. we developed a modified version of the multiEditDistance function. then a subsequent phase could check for the d. In particular.2 The M ultiEditDistance algorithms welcome world compute generate art 237 2518.3 ∗ 5). However. which is able to efficiently solve the sub2 matching problem by considering all the above constraints. To do so. in this way all the pairs of sequences sharing the same query sequence will be one after the other. . σ q [welcome . the subsequence of 2518 is not a valid sub2 match since it is too short. . art]) > round(0. 3945. In order to implement a sub2 matching satisfying all the discussed requirements. the part of 3945 is not considered since ed(σ 3945 [welcome . Figure A. named multiEditDistanceOpt. generate].5. a more efficient way to proceed is to avoid the computation of the undesired matches. . The pseudo-code is shown in Figure A. However. Such collection is ordered on the basis of the query sequences σ q .

. . σ) ∈ sub 2 Count and sub 2 P o s i t i o n f i l t e r r e s u l t s q i f ( σlast = σ q ) q σlast ← σ q L ← |σ q | − minL + 1 i f (L < 1) n e x t (σ q . DM [i][j] ) maxEnd[iext ] ← i − 1 i f ( i = |σ q | ) lastP os ← iext return pM atches } Figure A. which also manages the set of solutions and some auxiliary structures used in the computation. . minL. pM atches ) matches ← matches ∪ pM atches return matches } m u l t i E d i t D i s t a n c e O p t ( σ q . |σ| − minL) // c y c l e f o r each s t a r t i n g p o s i t i o n i n σ ∀i ∈ ((iext + 1) . L .5: Algorithms for approximate sub2 matching with inclusion filtering For each pair (line 5) the multiEditDistanceOpt function is called by the sub2 M atchingOpt function. |σ|) Compute DM [i][j] // d i s t a n c e m a t r i x i n pos i . . j. minL ) { matches ← ∅ // matches s e t o f matches q σlast ← empty s e q u e n c e ∀(σ q . d. i. . . . σ. . lastP os) // c y c l e f o r each s t a r t i n g p o s i t i o n i n σ q ∀jext ∈ (0 . lastP os. ∅} // a r r a y o f L s e t s pM atches = m u l t i E d i t D i s t a n c e O p t ( σ q . . . . . . |σ q |) ∀j ∈ ((jext + 1) . σ. d. maxEnd.238 More on EXTRA techniques 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 sub 2 MatchingOpt ( d. L) i f ( ( pos < iext and maxEnd[pos] ≥ i − 1 ) // c o n t a i n e d match o r ( pos = iext and maxEnd[pos] ≥ i − 1) ) insertF lag ← f a l s e e l s e i f ( pos = iext and maxEnd[pos] < i − 1 ) // c o n t a i n i n g match pM atches[pos] ← empty s e t i f ( insertF lag = true ) // new match i n s e r t i o n pM atches[iext ] ← pM atches[iext ] ∪ subMatch ( iext + 1. The matches set (initialized at line 3) will store all the resulting matches. pM atches ) { ∀iext ∈ (0 . L. . . . . maxEnd. jext + 1. σ) lastP os ← L −1 maxEnd[L] ← {−1. minL. lastP os. . . −1} // a r r a y o f L i n t e g e r s matches ← matches ∪ pM atches pM atches[L] ← {∅. j i f ( ( i − iext ≥ minL ) and ( j − jext ≥ minL ) and ( DM [i][j] ≤ round(d ∗ (i − iext )) ) insertF lag ← true // new match ∀pos ∈ (0 .

. with the help of the auxiliary structures.r. for the properties of the inclusion requirements. σ) is passed to the multiEditDistanceOpt. the minimum length constraint. Notice that. the insertion is not performed by setting insertF lag to false (line 34). L is less than 1 the query sequence is discarded for insufficient length and is not analyzed any further (lines 9-10). L is a constant representing the number of acceptable starting positions for a match on the query sequence. maxEnd is an array of L integers. computes the corresponding distance matrix (line 25-27). the start and end positions in σ and the computed edit distance. its initial value is L − 1 (line 11). thanks to the inclusion filter. Such function considers each starting position in each of the two sequences (lines 23-24). a few auxiliary data structures are exploited in the multiEditDistanceOpt function and initialized by the sub2 M atchingOpt function (lines 8. inserts it in the relevant set (lines 38-39) and updates the maxEnd and lastP os values accordingly (lines 40-42). the inclusion filtering process is performed quite efficiently.2 The M ultiEditDistance algorithms 239 Each match is a tuple containing the start and end positions in σ q . note that lastP os needs to be updated only when the ending position of the query sequence in the new match coincides with the last token of such sequence (line 41): In this case. pM atches is an array of L sets. lastP os is set at the starting position of the found match and. lastP os represents the current last acceptable starting position for the query sequence. 14) for each new query sequence σ q . Furthermore. Finally. The goal of the integers L and lastP os is to optimize the computation w. each one pM atches[i] containing the computed matches starting at the i-th position in the query. whereas that of the arrays maxEnd and pM atches is to dynamically apply the inclusion filtering. For the same reason. 12. for such query sequence. without ever re-accessing the already computed matches. Besides matches. the following starting positions will not be analyzed any further since they would only produce shorter contained matches. if the new match is longer than the already computed matches having the same starting position. if the match is included in others. If.A. eventually. More precisely. identifies new matches (lines 28-29). for a given σ q . 11. the function empties the relevant match set (line 36). the current pair of sequences (σ q . it depends on the length of the query sequence and the minimum length of any suggestion: |σ q | − minL + 1 (line 8).t. In particular. As to the inclusion filter structures. each one L[i] representing the maximum end position of all the already computed matches starting at the i-th position in the query. if the match is not included in others. checks if a match is included or includes other matches (lines 31-36) and. otherwise.

Appendix B The complete XML matching algorithms
In this appendix we present and discuss the complete versions of the sequential twig pattern matching algorithms whose basic properties and ideas have been shown as an overview in Section 4.4. We have three classes of algorithms which, respectively, perform path matching (Section B.2), ordered (Section B.3) and unordered (Section B.4) twig matching. For each class of algorithms we present both the “standard” and the content-based index optimized versions. All these algorithms perform a sequential scan of the tree signature; in Section B.5 we give more detail over the current solutions used to delimit the scan range. The algorithms associate one domain to each query node and rely on two principles: generate the qualifying index set as soon as possible and delete from the domains those data nodes which are no longer needed for the generation of the subsequent answers. With reference to the previous section, during the scanning process the algorithms generate the “delta” answer sets ∆(U )ansk , that is the set of answers which can be Q computed at step k.

B.1

Notation and common basis

Let sig(D) = d1 , post(d1 ); d2 , post(d2 ); . . . ; dm , post(dm ) denote the signature of the data tree and sig(Q) = q1 , post(q1 ); q2 , post(q2 ); . . . ; qn , post(qn ) denote the signature of a query twig pattern. To distinguish a path from a general tree, we use the capital letter P in place of Q. As in Chapter 4, for each query node qi , we assume that the post(qi ) operator accesses its post-order value and Di represents its domain. Together with the domain, the maximum and the minimum post-order values of the data nodes stored

242

The complete XML matching algorithms

∆Σprev(h’)
k

∆Σprev(h’)
k'

dk' Dprev(h’) Dh’

k Figure B.1: How ∆Σk prev(h ) (and ∆Σprev(h ) ) are implemented

in Di , accessible by means of the minPost and maxPost operators, respectively, are associated to each query node. first(qi ) is the pre-order value of the first occurrence of the name of qi in D, i.e. the minimum pre-order, and last(qi ) is the last one, i.e. the maximum pre-order value of node with name qi . Recall that a pre-order of a node is also the node’s position (index) in the signature. Notice that both these values can be computed while constructing the data tree signature. Insertions in the domains are always performed on the top by means of the push() operator. Thus the data nodes from the bottom up to the top of each domain are in pre-order increasing order. Moreover, each data node dk in the pertinence domain Dh consists of a pair: (post(dk ), pointer to a node in Dprev(h) ) where prev(h) is h−1 in the ordered case whereas it is the parent of h in Q in the unordered case. When the data node dk is inserted into Dh , its pointer indicates the pair which is at the top of Dprev(h) . For illustration see Figure B.1, where a node dk preceding dk is represented with its pointers. In this way, the set of data nodes in Dprev(h ) from the bottom up to the data node pointed by k implements ∆Σk prev(h ) and the whole content k of Dprev(h ) at step k implements ∆Σprev(h ) . By recursively following such k links from Dprev(h ) to D1 , we can derive ∆Σk prev(prev(h )) , . . . , ∆Σ1 . As a final note, we allow the access to a particular position (from the bottom to the top) in each domain by means of the dot notation. For instance D3 .2 means the second entry from the bottom of D3 .

B.2
B.2.1

Path matching
Standard version

In this subsection we propose the algorithm for the matching of a path pattern in a data tree. In this case, domains can be treated as stacks, that is deletions

Standard version
DATA STACK MANAGEMENT empty(stack) Empties stack isEmpty(stack) Returns true if stack is empty, false otherwise pointer(elem) Returns the pointer of the given element pointerToTop(stack) Returns a pointer to the top of the given stack pop(stack) Pops (and returns) an element from stack push(stack,elem) Pushes given elem in stack SOLUTION CONSTRUCTION showSolutions(...) Recursively builds the path solutions

243

Table B.1: Path matching functions are implemented following a LIFO policy. Table B.1 provides a summary of the path matching auxiliary functions. In particular, the top of the table presents all the functions which are needed in order to manage the stacks; in addition to the already introduced push() function, notice the complementary pop() one, which performs the deletions. Further, isEmpty() checks whether a given stack is empty, whereas empty() empties the stack. In the lower part of the table the functions needed for the solutions construction are shown, in our case the only showSolutions(). We assume that the query pattern has unique node names. The algorithm for path matching, which is depicted in Figure B.2, can be easily extended to the case where multiples query nodes have the same names and we will show it in the following. Finally, for path matching we do not need the maximum and the minimum post-order values associated to each stack. The key idea of the algorithm is to scan the portion of the signature from start to end, which in this case will be from first(q1 ) to last(qn ). For each data node, from Line 2 to Line 8, it deletes those data nodes in the stacks which are no longer useful to generate the delta answers. Then it adds the k-th data node in the proper stack (Line 10) and, if the data node matches with the leaf of the query path, it shows the answers which can be generated (Line 12). Notice that the k-th data node points to that data node which matches qh−1 and which has the highest pre-order value smaller than k. Such pointers will be used in the showSolutions() function. At line 13, due to the PRO1 condition, the algorithm delete k if it is the last node. Instead of checking all nodes as specified in Condition POP, we stop looking at the nodes in Di whenever post(top(Di ))>post(dk )) (line 4). It fully implements Condition POP because, as we will prove in the following, if post(top(Di ))>post(dk ) then post(si )>post(dk ) for each si ∈ Di and thus Condition POP can no longer be applied. Moreover, Condition PRO2 is

244

The complete XML matching algorithms

Input: path P having signature sig(P ); rew(Q) Output: ansP (D) algorithm PathMatch(P) (0) getRange(start, end); (1) for each dk where k ranges from start to end (2) for each h such that qh = dk in descending order (3) for each Di where i ranges from 1 to n (4) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(dk )) (5) pop(Di ); (6) if(isEmpty(Di )) (7) for each Di where i ranges from i + 1 to n (8) empty(Di ); (9) if(¬isEmpty(Dh−1 )) (10) push(Dh ,(post(dk ),pointerToTop(Dh−1 ))); (11) if(h = n) (12) showSolutions(h,1); (13) pop(Dh ); (14) for each Di where i ranges from 1 to n (15) if(isEmpty(Di ) ∧ last(qi )<k) (16) exit; procedure showSolutions(h,p) (1) index[h] ← p; (2) if(h = 1) output(D1 .index[1],...,Dn .index[n]) (3) else (4) for i = 1 to pointer(Dh .index[h]) (5) showSolutions(h − 1,i);

Figure B.2: Path matching algorithm implemented in Lines 6-9. Observe that, instead of checking the intersection between the state of the domains at different steps as required by Condition PRO2, we only check whether Dh−1 is empty (line 9). Indeed, we will show that, in order to delete the nodes belonging to a domain Di at step k, it is first necessary to delete the nodes belonging to Di at a step preceding the k-th one. Before demonstrating the correctness of the algorithm, we present three properties of the data nodes stored in each stack Di . Lemma B.1 If post(top(Di ))>post(dk ) then post(dk )>post(dk ) for each k ∈ Di .

Content-based index optimized version

245

Lemma B.2 For each i ∈ [1, n] ∆Σj ∩∆Σk = ∅ iff the condition isEmpty(Di ) i i is true at any step j with k ≤ j ≤ j. Lemma B.3 At each step j and for each query index i, the stack Di is a subset of Σj containing only the data entries that cannot be deleted from Σj , i i i.e. it has the same content as ∆Σj when Lemmas 4.4, 4.5, 4.6, and 4.11 i have been applied. Starting from the data node in the leaf stack Dn (function call at Line 12 of the main algorithm), function showSolutions() uses the pointer associated with each data node dk to “jump” from one stack Di to the previous one Di−1 (up to D1 ) and recursively combines dk with each node from the bottom of Di−1 up to the node pointed by dk . The correctness of the algorithm follows from the properties shown so far. Theorem B.1 For each data node dj , S = (s1 , . . . , sn ) ∈ ∆ansj (D) iff the P algorithm, by calling the function showSolutions(), generates the solution S. Finally, let us consider the correctness of the scanned range, which is between the first occurrence of q1 and the last occurrence of qn in sig(D). As D1 is empty before first(q1 ) has been accessed, we can avoid to access all the data nodes before first(q1 ) which should be inserted in D2 , . . . , Dn but which will never been used due to Lemma 4.5. On the other hand, from Lemma 4.4, it follows that ∆ansk (D) = ∅ for k ∈ [last(qn ) + 1, m]. Q last(qn ) Therefore ansQ (D) = k= first(q ) ∆ansk . Moreover, the algorithm exits Q 1 whenever any stack Di is empty and no data node matching with qi will be accessed, i.e. last(qi )< k (Lines 14-16). It means that ∆Σk = ∅ for each i k ∈ [k, m] and thus that ∆ansk (D) = ∅. Q

B.2.2

Content-based index optimized version

In the previous section we have described the general algorithm for the matching of a path pattern in a data tree, in this section we propose another path matching algorithm (detailed in Figure B.3) that can be used for path queries with a specified value condition and if it is available a content index over the leaf of the query. In this condition we can reduce the part of the document that has to be scanned by simply considering the post-orders of the document’s nodes that satisfy the value condition specified by the query. Since we need to use a content-based index we have to introduce some functions to deal with it (see Table B.2).

the proposed algorithm supposes to use a signature that does not contain the first following values. Returns an iterator that holds the list of condition) elements in the document correspondent to element that satisfies the condition condition with the value value.3. Each element of the list is sequentially used as a target element (curLeaf ) for the search. if the signature contains those values we can simply replace Line 14 with k ← f f (dk ). there is not any element ql that satisfies the specified value condition) so we can terminate the algorithm (Lines 3 and 4) otherwise we can start the search.2: Path matching functions for content-based search Differently from the previous algorithm we first retrieve the list of document’s elements of type ql (last element of the path. then answers ended by the second element. The loop from Lines 8 to 12 updates the target of the search. Such skip is made by the loop at Lines 13 and 14. otherwise no other answers can be found and we can terminate the algorithm (Line 12). following observations made in Section 4.246 The complete XML matching algorithms PATH QUERY NAVIGATION getLastElement(signature) Returns the last element of the path getValue(element) Returns the value for the specified element getCondition(element) Returns the value condition for the specified element INDEX getMatchList(element. if needed. In order to keep curLeaf and k coherently updated we need to repeat the two loops described before (Lines 8 to 14) until we have curLeaf and k such that k ≤ pre(curLeaf ) and post(dk ) . Since the list is ordered by the pre-order value and answers are sought sequentially from start to end we will first find (if they exist) answers that end with the first element of the list.e. the only with a specified value condition) that satisfy the specified value condition by calling the function getMatchList() (Line 2). ITERATOR MANAGEMENT hasNext(iterator) Checks if exists a subsequent element getNext(iterator) Returns the subsequent element Table B.value. i. If the list is empty no answer can be found (i. If we have gone past the curLeaf (Line 8) we need to change the target: if a next target exists (Line 9) we simply update the curLeaf variable. List is ordered by pre-orders values.1 we can skip document subtrees rooted from nodes dk such that post(dk ) < post(curLeaf ). and so on.e.

247 Figure B. rew(Q) Output: ansP (D) algorithm PathMatchCont(P) (0) getRange(start. on Line 24 before generating answers we have also to check that dk is the same node as curLeaf because we need to check that the value of dk satisfies the specified condition (and that is possible only if dk = curLeaf ). From Lines 15 to 29 the algorithm is substantially the same of the previous section. (11) else (12) exit. .Content-based index optimized version Input: path P having signature sig(P ). (6) for each dk where k ranges from start to end (7) while((k > pre(curLeaf )) ∨ ((post(dk ) < post(curLeaf )))) (8) while(k > pre(curLeaf )) (9) if(hasNext(docLeaves)) (10) curLeaf ← getNext(docLeaves). (5) curLeaf ← getNext(docLeaves). (19) if(isEmpty(Di )) (20) for each Di where i ranges from i + 1 to n (21) empty(Di ). (1) ql ← getLastElement(P ).3: Content index optimized path matching algorithm ≥ post(curLeaf ). (3) if (¬hasNext(docLeaves)) (4) exit.1). (13) while(post(dk ) < post(curLeaf )) (14) k ← post(dk ) + 1. (26) pop(Dh ).pointerToTop(Dh−1 ))). (24) if(h = n ∧ dk = curLeaf ) (25) showSolutions(h. (22) if(¬isEmpty(Dh−1 )) (23) push(Dh . (15) for each h such that qh = dk in descending order (16) for each Di where i ranges from 1 to n (17) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(dk )) (18) pop(Di ). end).getValue(ql ). (2) docLeaves ← getMatchList(ql .(post(dk ).getCondition(ql )). (27) for each Di where i ranges from 1 to n (28) if(isEmpty(Di ) ∧ last(qi )<k) (29) exit.

we work on the current node (Lines 10-14). the boolean function isCleanable() (see Figure B. together with the main algorithm. condition PRO1 is used to delete a node from the last stack (Line 14) and the algorithm exits whenever any stack Di is empty and no data node matching with qi will be accessed (Lines 15-17). The ordered twig matching algorithm is shown in Figure B. Finally.5 for the complete code) checks whether di can be deleted. in particular POT2 and POT3. checking whether a node can be deleted and whether a node insertion can be avoided.4. As in path matching. updating domains after a deletion. First. In particular. Line 10) and verifying if new solutions can be generated (Lines 12-13). Here. Each of these matching functions will be shown in detail. Besides the functions managing lists.248 The complete XML matching algorithms B. we try to delete nodes by means of the post-order conditions. thus deletions can be applied at any position of the domains.1 Ordered twig pattern matching Standard version As already discussed in Section 4.3 shows a summary of the functions employed in the ordered twig matching algorithms.3 B. as for the path algorithm. in this case we will implement the query domains as lists.1. Table B.3. the domains of the twig matching algorithms cannot be stacks because they are not ordered on post-order values. which are the same but applied to lists instead of stacks). if a deletion is performed.4. if i is the index of the . as we will show. deleteLast() is the equivalent of the pop() used for stacks. also in this case we implement the required conditions in the most effective order. Therefore. Then. checking if an insertion is needed (condition PRO2 and POT1. note that we omit the functions which we already discussed in the path case (functions such as empty() or push(). twig query navigation functions and more advanced solution construction functions are now needed. Let us first analyze the deletion part of the algorithm (Lines 3-9). Further. we update the pointers in the right lists (Line 9). The scanned range is the same as the one we previously discussed for paths. and thus we will also exploit the minPost and maxPost operators discussed in Section B. Further. the rest of the table shows the functions which interact with the main algorithm in order to produce the matching results: The boolean functions isCleanable() and isNeededOrd(). and updateRightLists(). in the twig matching algorithms we also need the maximum and the minimum post-order values associated to each domain. (Lines 3-9) and. respectively. in this case descendants() and the other ones shown in the lower part of the table and that will be discussed later while explaining the solution construction algorithm.

otherwise it checks the conditions expressed in Condition POT2. shown in detail in Figure B.) Recursively builds the ordered twig solutions checkPostDir(node. false otherwise isNeededOrd(pre. following the deletion of the pos-th element from the pre-th list TWIG QUERY NAVIGATION descendants(pre) Return the descendants of the given twig node SOLUTION CONSTRUCTION findSolsOrd(.elem) Checks if insertion of given elem is needed in the pre-th list (ordered version) updateRightLists(pre.post) Updates top element of stack if less than post Table B. Returns true if given nodes have the correct precN ode) post-order direction w. -n for n blocks closing.elem) Returns true if current elem can be deleted from the pre-th list. updates the pointer of all the nodes pointing to di in order to make it point to the node below di (Line 12 of the function). instead of checking the post-order value of each data node in the domains. In particular. As to the current node insertion (Lines 10-11 of the main algorithm).5.r.elem) Returns the position of elem in list noEmptyLists() Returns false if there is at least an empty list pointerToLast(list) Returns a pointer to the top of the given list MATCHING isCleanable(pre. Such an update is performed in a descending order and stops when a node pointing to a node below di is accessed (Line 3).t. Whenever a node di is deleted. the updateRightLists() function.. possibly propagating the update. . the query twigBlock(pre) Returns the block information for the pre-th query node (1 for block opening.Standard version 249 DATA LIST MANAGEMENT decreasePointer(elem) Decreases by 1 the pointer in given elem delete(list.. we check if minPost(D¯)>post(di ) for each domain D¯ (Line4 of the isCleanable() ı ı code).pos) Updates the pointers in the (pre + 1)-th list.elem) Deletes given elem from list deleteLast(list) Deletes last elem from list index(list. 0 otherwise) updateStackMax(stack. it simply returns true (Condition POT3). links connecting each domain Di with the domains of the descendants ¯ of the i-th twig node and the ı minPost operator are exploited to speed up the process. In this case.3: Ordered twig matching functions twig root.

We recall that. The same can be recursively applied to the other domains ∆Σk −2 . di+1 is deleted when. . Such emptying are performed in the updateRightLists() code from Lines 5 to 10. 1 h Before inserting a new node (Lines 10-11 of the main algorithm).4: Ordered twig matching algorithm condition PRO2 is exploited. making it sufficient to only check Dh−1 . .(post(dk ). due to the deletions applied in the main algorithm. if the pointer of di+1 is dangling it means that ∆Σk −1 ∩ ∆Σk −1 = h h ∅ as required by Conditions PRO2. end). where h > i. (15) for each Di where i ranges from 1 to n (16) if(isEmpty(Di ) ∧ last(qi )<k) (17) exit.pos).di )) (6) pos ← index(Di . Thus. whenever Condition PRO2 is applied. ∆Σk −1 is implemented by that portion of Dh −1 between the bottom and the h data node pointed by di+1 and ∆Σk −1 is the current state of Dh −1 . for instance.pointerToLast(Dh−1 ))). rew(Q) Output: ansQ (D) algorithm OrderedTwigMatch(Q) (0) getRange(start.di ). In this case. (1) for each dk where k ranges from start to end (2) for each h such that qh = dk in descending order (3) for each Di where i ranges from 1 to n (4) for each di in Di in ascending order (5) if(post(di )<post(dk ) ∧ isCleanable(i. which checks the condition shown in Condition POT1 by using the minPost(D) and maxPost(D) values for each domain D instead of comparing the current data node with each data . Figure B. and are the application of PRO2 to a node inserted at step k (di+1 in the algorithm) preceding the current k-th data node and thus already belonging to a domain Dh .di ). (10) if(¬isEmpty(Dh−1 ) ∧ isNeededOrd(h. if a domain Dh is empty then all the domains “following” Dh are emptied.1). (7) delete(Di .dk )) (11) push(Dh . . its pointer becomes dangling. we also call the boolean function isNeededOrd(). (12) if(h = n) (13) findSolsOrd(h.250 The complete XML matching algorithms Input: query Q having signature sig(Q). h intuitively. ∆Σk . (8) if(i = n) (9) updateRightLists(i. (14) deleteLast(Dh ). This is because. .

From the above analysis. 251 Figure B.di ) (1) if(i = 1) (2) return true. (7) delete(Di+1 .di+1 ). Finally. (6) return false. procedure updateRightLists(i. function isNeededOrd(i. due to the transitivity of the relationships.15) and. Lemma B.pos) (1) for each di+1 in Di+1 in descending order (2) if(pointer(di+1 )<pos) (3) return.Standard version function isCleanable(i.5.di ) (1) if(i = 1) (2) return true. we only perform the check in the parent domain (Line 3 of isNeededOrd()) and in the first left sibling (Line 5). by analogy to the path matching algorithm.4 At each step j and for each query index i.rP os). (11) else (12) decreasePointer(di+1 ). the list Di is a subset of Σj containing only the data entries that cannot be deleted from Σj .di+1 ). in this case. in order to speed up the process. call (Line 13 of the main algorithm) the recursive function findSolsOrd() implementing Theorem 4.e. (7) return true. (3) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di )) (4) return false. (5) if(isEmpty(Di−1 ) ∨ minPost(Di−1 )>post(di )) (6) return false. (10) return. i. it easily follows the Lemma below. it i i . (8) if(i + 1 = n) (9) updateRightLists(i + 1. (3) for each ¯ in descendants(i) ı (4) if(isEmpty(D¯) ∨ minPost(D¯)>post(di )) ı ı (5) return true. we check if new solutions can be generated (Lemma 4. (4) if(pointer(di+1 )=1) (5) for each di+1 from di+1 in descending order (6) rP os ← index(Di+1 .5: Ordered twig matching auxiliary functions node in D. Notice that.

and 4.9. 4..index[h]. and in order to maintain the step-by-step backward construction behavior... in particular.6.4.p) (1) index[h] ← p.i)) (14) if(twigBlock(h − 1)<=0) (15) okT oContinue ← true. (11) for i = 1 to pointer(Dh .post(Dh−1 . starting from the last domain and outputting the current solution when reaching the first domain (Line 2).6: Ordered twig matching solution construction has the same content as ∆Σj when Lemmas 4.t.index[h])). The ordered twig matching solution construction. by means of the checkPostDir() function. Moreover. Instead of performing all these checks.index[h]) (12) okT oContinue ← false.. it is a function which recursively builds each solution one step at a time. i. shown in detail in Figure B.post(Dh .Dn . (19) updateStackMax(postStack. is inspired by the path one. at each step the algorithm verifies. (16) else (17) if(post(Dh−1 .index[1]. (7) else if (twigBlock(h)=0) (8) updateStackMax(postStack.6.13. if the post order of the current node and the one in the preceding domain have the correct post-order direction (increasing / decreasing) w. Figure B.i).252 The complete XML matching algorithms procedure findSolsOrd(h. (9) if(twigBlock(h − 1)>0) (10) curP ost ← pop(postStack). 4.post(Dh . 4.i)>curP ost) (18) okT oContinue ← true.index[h])). in the ordered twig matching we would have to do all the post-order checks defined in Lemma 4. (20) if(okT oContinue) (21) findSolsOrd(h − 1.Dh−1 .e. the function is more complex since the solutions have to be checked while being built. (13) if(checkPostDir(Dh .index[n]) (3) else (4) if(twigBlock(h)<0) (5) for i = 0 to twigBlock(h) (6) push(postStack.14 i have been applied.r. the algorithm checks that the post-orders of all the children of a given node are actually smaller than the parent one: This is done by using a stack . However.i)). (2) if(h = 1) output(D1 .1.5. the corresponding twig query nodes (Line 13). 4.

Beside these functions.7). . We modified the isNeedOrd() function (now called isNeedOrdCont()) and updateRightLists() (now called updateRightListsCont) in order to consider target information. in case of a parent node (block opening. . -n for n blocks closing (i. in particular some that help to identify value constrained leaves (isConstrainedLeaf() and getValueConstrainedLeaves()). If a block is closing (Line 4). S = (s1 . and a function named twigBlock(). the twigBlock() function returns an integer: 1 for block opening (i. Now that we have briefly introduced the new used functions we can start to analyze the algorithm (detailed in Figure B. for a given node. if all the check succeed. Finally.2 Content-based index optimized version Basically. Then. by calling the function findSolsOrd().2) but since a twig could have more then one leaf with a specified value condition and some query element could not be related to target leaves by an ancestor-descendant relationship we have some differences. which helps in identifying the structure of the query. Theorem B. i.5). the algorithm continues by recursively calling itself (Line 21). In the following we assume that the returned list is not empty and for each leaf li contained . Line 17 checks if such value.4). establish which is the current document target leaf for a specified element (getTarget()). representing the maximum post order of the children. B. such post order is then updated in case of other siblings (Lines 7-8) in order to keep the maximum one of the current block. its “blocks” of children.2. generates the solution S. the given node is the father of other nodes). we introduced a completely new set of functions needed to manage document target lists (see Table B.Content-based index optimized version 253 structure named postStack. Before starting to analyze the algorithm we need to introduce some new functions (see Table B. sn ) ∈ ∆ansj (D) iff the Q algorithm. After the computation of the scan range (Line 0) we retrieve the list of query leaves that specify a value condition by calling the function getValueConstrainedLeaves() (Line 1). . 0 otherwise (no blocks opening or closing). is less than the post of the current node (Line 17).e.e. the current post order is saved in the stack (Line 6).e. and check if a skip is applicable or not (canSkipOrd()). Line 9) such value is retrieved and.3. the given node is a leaf and is the last children for n parents). and this will be the first case since we are constructing the solutions from the last domain. . In particular. the content-based index optimized algorithm follows the same principles explained for the path algorithm (see Section B.2 For each data node dj . if the post order direction check succeeds (Line 13). in which the post orders of the children nodes are kept.

From Lines 2 to 5 we retrieve (through the index) for each leaf li in targetLeaves an iterator that holds the list of document elements that match with element li and satisfy the specified value condition (function getMatchList() on Line 3). If one of these lists is found to be empty (Lines 4 and 5) we can stop the algorithm because no answer will be found in the input document. possibly propagating the update. Updates the pointers in the (pre + 1)-th curDocP re) list.4: Ordered twig matching functions for content-based search in the list we have an available content-based index. if the adjustment fails we can stop the algorithm because some target list is found to be empty and the relative domain is also empty (i. following the deletion of the pos-th element from the pre-th list. At each step of the scan we first try to adjust current targets in order to be coherent with the current document pre-order value (k) and with the current state of domains (Line 9). no answer can be found in the input document.254 The complete XML matching algorithms QUERY NAVIGATION isConstrainedLeaf(element) Returns true if the passed element is a value constrained leaf getValueConstrainedLeaves Returns the list of leaf that specifies (signature) a value condition getTarget(pre) Returns the current target for twig pre-th element.pos.e. current targets and current domains is possible to perform a Skip Table B. Then we perform the first target list alignment by calling the function firstAlignment() (Line 6) that will be discussed later. by convention if the element is a target itself it returns the relative target MATCHING isNeededOrdCont(pre. no other match for the element could . Again. at least one list result empty after the attempt to align the lists) we can stop the algorithm.elem) Checks if insertion of given elem is needed in the pre-th list considering the current targets updateRightListsCont(pre.e. If the alignment fails (i. It also perform target alignment if it is needed canSkipOrd(post) Returns true if evaluating the current document post-order value. otherwise we can start the sequential scan.

On Line 21.10 we are sure that di .8).Content-based index optimized version 255 TARGET LIST MANAGEMENT adjustTargetsOrd(pre) Adjusts the current targets depending on the current document pre and domains.2. taking in consideration pre as the start scanning pre-order value. For each update we need to re-adjust the current targets so on Lines 13 and 14 we apply the same schema applied on Lines 9 and 10. Returns true if the operation was completed successfully firstAlignment(pre) Performs the first target list alignment. If the check succeeds. Let us now discuss more deeply the newly proposed functions (shown in Figure B.2). it has to be noted that on Line 24 we call the new defined function updateRightListsCont() instead of updateRightLists() and on Line 25 we call isNeedOrdCont() instead of isNeedOrd(). We start with the modified version of isNeedOrd() called isNeedOrdCont(). at least is an ancestor of one target) we check if the post-order value of di is smaller than the the post-order value of the current target related to the i-th element. From Line 11 to Line 14 we check if it is possible to perform a skip by repetitively calling the function canSkipOrd(). If we can perform a skip on Line 12 we update the current document pre-order value k with the approximate value of the first following of current element (see Section B. all the differences between the original version are on the first two lines. The rest of the algorithm from Line 23 to Line 32 works exactly as the previous one. Returns true if the operation was completed successfully Table B. Returns true if the operation was completed successfully alignment(pre. From Line 15 to 20 the algorithm works exactly as the previous one.e. we know that element di is not an ancestor of the related current target and due the Lemma 4.element) Performs the alignment between the target list associated to element and the subsequent one. If the i-th element of the query twig has a reference target (i. after the deletion of an element from the relative domain.5: Target list management functions be found). we check if the deletion is relative to the first element of a domain that belong to a target element (say li ). in this case we need to perform an alignment between the target list targetListli and the the target list targetListli+1 (see Definition 4.3.2) by calling the function alignment() (Line 22). thus it will not be further discussed.

(post(dk ).pointerToLast(Dh−1 ))). (21) if(isConstrainedLeaf(qi ) ∧ pos = 1) (22) alignment(k.getCondition(li )). (1) targetLeaves ← getValueConstrainedLeaves(Q). (20) delete(Di . Also the function updateRightLists() has been modified since current targets depend also .256 The complete XML matching algorithms Input: query Q having signature sig(Q).k). (25) if(¬isEmpty(Dh−1 ) ∧ isNeededOrdCont(h. rew(Q) Output: ansQ (D) algorithm OrderedTwigMatchCont(Q) (0) getRange(start. (8) for each dk where k ranges from start to end (9) if(¬adjustTargetsOrd(k)) (10) exit.di ). Figure B.di ).di )) (19) pos ← index(Di .getValue(li ). (11) while(canSkipOrd(post(dk ))) (12) k ← post(dk ) + 1. (27) if(h = n) (28) findSolsOrd(h. (15) for each h such that qh = dk in descending order (16) for each Di where i ranges from 1 to n (17) for each di in Di in ascending order (18) if(post(di )<post(dk ) ∧ isCleanable(i.qi ) (23) if(i = n) (24) updateRightListsCont(i.dk )) (26) push(Dh . (29) deleteLast(Dh ). (13) if(¬adjustTargetsOrd(k)) (14) exit. so di can be considered an useless element and we can directly return false (Line 1). (2) for each li in targetLeaves (3) targetListli ← getMatchList(li . (4) if(¬hasNext(targetListli )) (5) exit.7: Content index optimized ordered twig matching algorithm will not be an ancestor of any subsequent target. end).pos.1). (30) for each Di where i ranges from 1 to n (31) if(isEmpty(Di ) ∧ last(qi )<k) (32) exit. (6) if(¬firstAlignment(start)) (7) exit.

function canSkipOrd(post) (0) for i from 1 to n (1) if( ¬(hasAReferenceTarget(i) ∨ isContrainedLeaf(qi )) ∨ post ≥post(getTarget(i)) ) (2) return false.rP os.Content-based index optimized version function isNeededOrdCont(i. we check if we delete the first occurrence of a . 257 Figure B. (4) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di )) (5) return false. (3) else (4) if( isEmpty(Di ) ∨ (hasAReferenceTarget(i) ∧ post >maxPost(Di )) ) (5) return true. (7) delete(Di+1 . procedure updateRightListsCont(i. (13) else (14) decreasePointer(di+1 ).8: Content index optimized ordered twig matching algorithm auxiliary functions by the status of the domains associated to target leaves.di ) (0) if(hasAReferenceTarget(i) ∧ post(di ) < post(getTarget(i))) (1) return false. (4) if(pointer(di+1 )=1) (5) for each di+1 from di+1 in descending order (6) rP os ← index(Di+1 . (8) return true.pos.di+1 ). (6) if(isEmpty(Di−1 ) ∨ minPost(Di−1 )>post(di )) (7) return false. (8) if(isConstrainedLeaf(qi+1 ) ∧ rP os = 1) (9) alignment(pre.pre) (1) for each di+1 in Di+1 in descending order (2) if(pointer(di+1 )<pos) (3) return.di+1 ). After the deletion of an occurrence (Line 7). (12) return. (2) if(i = 1) (3) return true.pre).qi+1 ) (10) if(i + 1 = n) (11) updateRightListsCont(i + 1.

7) domains are cleaned (Lines 16 to 24) only when a match occurs. since preceding domains are not empty.9 and B. As in the path matching algorithm. otherwise we return false. If the check fails we need to check the state of the domain Di . It has to be noted that there is no default return value. In the latter case the next match for li is a descendant of the current document element. The condition explained in Section 4.3.10). . Now let us analyze the target lists management functions (Figures B. In Section 4.3. If the domain is empty or the current document post-order value is greater than the maximum postorder value of its occurrences. The rest of the function remains equal to the previous version.2 takes in consideration only empty domains however since in the main matching algorithm (see Figure B. a domain could be not physically but substantially empty. For each domain. The function canSkipOrd() establishes if it is possible to perform a skip. useful matches for element qi could be found in the current document subtree. the functions that we are going to describe are used to align and maintain aligned the used lists. After the retrieving. At the beginning of the new algorithm we retrieve through the content indexes the lists of document leaves that satisfy the value conditions specified by the query. if the check succeeds we must return false because in the former case we have no information about next matches for elements that are related with target leaves only by following-preceding relationships and.258 The complete XML matching algorithms domain associated to a value constrained leaf. this is because at least the last domain is always empty (in the main matching algorithm as we insert an element in the last domain we generate all the correspondent answers and then we remove it) so if we reach the last domain (all preceding domain are not empty and no return condition has be found) if the last leaf is value constrained and the next match for it is not descendant of the current document element we return true. if the check succeeds we need to perform an alignment between the target list targetListqi+1 and the the subsequent one by calling the function alignment() (Line 9). we can safely perform a skip and we return true. for each managed list we hold the last accessed element (for list targetListli the last accessed element is curT argetli ) that represents the current document target for the correspondent element.2 and this function simply checks those conditions. the conditions under which a skip is safe are described in Section 4. this is the reason why we have introduced the second part of the condition. lists are accessed by an iterator pattern.3. The function is made by a main loop that analyzes each Di in increasing order of i.2 we have introduced the definition of alignment property that holds for the ordered twig matching. we first check if it belongs to an element related to targets by a following-preceding relationship or if the current document post-order value is greater or equal to the post-order value of the current target for li .

we have a valid target for each element) we can return true (Line 10). function firstAlignment(pre) (0) foreach targetListli in targetLists (1) if(i = 1) (2) minP re ← pre. 259 Figure B. . (2) while((curT argetl1 is not null) ∧ pre(curT argetl1 )<pre) (3) if(hasNext(targetListl1 )) (4) curT argetl1 ← getNext(targetListl1 ).9: Ordered Target list management functions (part 1) lists are not aligned. The minP re value is the first pre-order value scanned by the main algorithm (the value start returned by the getRange() function) for the first list or the pre-order value of the target of the preceding list for the other lists.e.l1 ). (7) if(isEmpty(Dl1 )) (8) return false. if we reach the end of the main loop (i. The function firstAlignment() moves the iterator of each list until it founds an element that have a pre-order value greater than minP re (loop from Lines 5 to 9). (3) else (4) minP re ← pre(curT argetli−1 ). we return false (Line 7). At each step of the main algorithm we may need to update the targets. (9) while(pre(curT argetli )<minP re) (10) return true. (9) return alignment(pre.Content-based index optimized version function adjustTargetsOrd(pre) (0) if((curT argetl1 is null) ∧ isEmpty(Dl1 )) (1) return false. in order to perform a first alignment we call the function firstAlignment(). As we reach the end of a list. (8) curT argetli ← getNext(targetListli ). it has to be noted that instead of introducing a new function we could perform the same task by initializing all the curT argets and then call the function adjustTargetsOrd() but since the first alignment is simpler than a normal adjustment (we have not to check domains because at the beginning they are all empty) we have chosen to develop an ad-hoc function. (5) else (6) curT argetl1 ← null. (5) do (6) if(¬hasNext(targetListli )) (7) return false.

From Line 8 to Line 14 we update the target for leaf li+1 if it is needed as we described for the . (13) if(isEmpty(Dli+1 )) (14) return false.li+1 .e. on Line 0 if we have required the alignment between the last constrained leaf and the subsequent one (that does not exist) we simply return true (Line 1). (11) else (12) curT argetli+1 ← null. Basically the function is recursive. From Line 2 to Line 8 we update the current target for the first constrained leaf if it is needed (Line 4) and until we reach the end of the list. The alignment between the target list for element li and element li+1 is performed by the function alignment(). during precedent adjustments we reach the end of the target list) and the correspondent domain is empty (Line 0) we can return false (Line 1) because independently of any other condition we are sure than no other answer could be found in the input document.260 The complete XML matching algorithms function alignment(pre.10: Ordered Target list management functions (part 2) this is made by calling the function adjustTargetsOrd(). If we do not have a target for the first constrained leaf (i. Finally we start the alignment for the subsequent targets by calling the function alignment() and we return its result (Line 9). in the latter case we set the current target as null (Line 6) and if the domain associated to the first constrained leaf is empty we return false (Line 8) for the same reasons explained before. (6) if(minP re < pre) (7) minP re ← pre. From Line 2 to Line 7 we establish the minimum pre-order value (minP re) that target for leaf li+1 must assume. (8) while((curT argetli+1 is not null) ∧ (pre(curT argetli+1 )<minP re)) (9) if(hasNext(targetListli+1 )) (10) curT argetli+1 ← getNext(targetListli+1 ). Code from Line 2 to Line 7 establishes the value for minP re following the above definition and taking into account that some of the value could be not defined. (4) else (5) minP re ← pre(Dli [0]).li ) (0) if( li+1 ) (1) return true. (15) return alignment(pre. (2) if(isEmpty(Dli )) (3) minP re ← pre(curT argetli ).targetLists. Figure B.

In particular. -1 if the node is a leaf firstLeaf() Returns the pre-order of the first leaf in twig isLeaf(pre) Returns true if given twig node is a leaf..6: Unordered twig matching functions adjustTargetsOrd() function..elem) Checks if insertion of given elem is needed in the pre-th list (unordered version) updateDescLists(pre.) Used by findSolsUnord to navigate domains extendSols(. false otherwise parent(pre) Returns the pre-order of the parent of the given twig node siblings(pre) Returns the pre-orders of the siblings of the given twig node SOLUTION CONSTRUCTION findSolsUnord(.3 shows a summary of the functions employed which have not already introduced for the ordered case.. B. and updateDescLists()..4 B. commenting on the parts which differ from the ordered one discussed in previous section. following the deletion of the pos-th element from the pre-th list TWIG QUERY NAVIGATION firstChild(pre) Returns the pre-order of the first child of the given twig node..pos) Updates the pointers in the descendants of the pre-th list.B.4 Unordered twig pattern matching 261 MATCHING isNeededUnord(pre.4. On Line 15 we recursively call the function in order to perform the alignment between the target list for li+1 and the one for li+2 and we return its result.1 Unordered twig pattern matching Standard version In this section we will show the complete unordered twig matching algorithm.) Used by findSolsUnord to build the solutions Table B.) Recursively builds the unordered twig solutions preVisit(. updating domains after a deletion. possibly propagating the update. Table B.. the upper part shows the new functions which interact with the main algorithm in order to produce the matching results: isNeededUnord(). Such functions are the unordered counterparts of the isNeededOrd() and updateRightLists() discussed in the or- . checking whether a node insertion can be avoided.

and.dk )) (11) push(Dh . since the solution construction is different from the ordered case. we work on the current node (Lines 10-13). in particular POT2 and POT3. whenever a node di is deleted.pos). additional twig query navigation functions are needed. (14) for each Di where i ranges from 1 to n (15) if(isEmpty(Di ) ∧ last(qi )<k) (16) exit. indexesList). thus. new functions are also needed in this respect (findSolsUnord(). The unordered twig matching algorithm is shown in Figure B.11. preVisit() and extendSols()). Figure B. the . we update the pointers in the subsequent lists. which are quite self-explanatory. This time.262 The complete XML matching algorithms Input: query Q having signature sig(Q). if a deletion is performed. (7) delete(Di . condition PRO1 is not available and. (12) if(isLeaf(h) ∧ noEmptyLists()) (13) findSolsUnord(firstLeaf(). (8) if(¬isLeaf(i)) (9) updateDescLists(i.di ). h.(post(dk ). These functions will be discussed later while explaining the solution construction algorithm. (1) for each dk where k ranges from start to end (2) for each h such that qh = dk in descending order (3) for each Di where i ranges from 1 to n (4) for each di in Di in ascending order (5) if(post(di )<post(dk ) ∧ isCleanable(i.di ). (10) if(¬isEmpty(Dparent(h) ) ∧ isNeededUnord(h. (Lines 3-9) and.-1. rew(Q) Output: ansQ (D) algorithm UnorderedTwigMatch(Q) (0) getRange(start.pointerToLast(Dparent(h) )))). in this case in the descendant ones (Line 9). In this case.di )) (6) pos ← index(Di . The boolean function isCleanable() is the same of the ordered case and will not be further discussed. Line 10) and verifying if new solutions can be generated (Lines 12-13). Finally. the algorithm exits whenever any stack Di is empty and no data node matching with qi will be accessed (Lines 14-16). we first try to delete nodes by means of the post-order conditions. Then. As in the other two algorithms we analysed. Further. checking if an insertion is needed (condition PRU and POT1. end).11: Unordered twig matching algorithm dered case. we do not delete a node from the last stack after solution construction as in the other algorithms.

but in this case condition PRU is exploited instead of PRO2. checks the condition shown in Condition POT1. is the parent child one (Line 3 of isNeededUnOrd()). at each call it updates the pointers in all the domains children of the given one. (4) for each h in children (5) for each dh in Dh in descending order (6) if(pointer(dh )<pos) (7) break. before inserting a new node (Lines 10-11 of the main algorithm). instead of updating the pointers on the following domain. (15) else (16) decreasePointer(dh ). In this case the only relation required and.Standard version function isNeededUnord(i. updates the pointer of all the nodes pointing to di in order to make it point to the node below di (Line 16 of the function). like isNeededOrd(). It basically works in the same manner as updateRightLists() for the ordered case but.dh ). As to the current node insertion (Lines 10-11 of the main algorithm). 263 Figure B.12: Unordered twig matching auxiliary functions updateDescLists() function. all the considerations done for the ordered case are still true.di ) (1) if(i = 1) (2) return true. which. making it sufficient to only check Dparent(h) .12. (2) if(i = -1) (3) children ← i ∪ siblings(i). we call the boolean function isNeededUnord().dh ). Finally. (8) if(pointer(dh )=1) (9) for each dh from dh in descending order (10) dP os ← index(Dh . (12) if(¬isLeaf(h)) (13) updateDescLists(h. the only one that has to be checked. thus. (14) break. (3) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di )) (4) return false. possibly propagating the update to the descendants. procedure updateDescLists(i.pos) (1) i ← firstChild(i). (5) return true. shown in detail in Figure B. Then. (11) delete(Dh . as in the other .dP os).

and all its required functions. call (Line 13 of the main algorithm) the recursive function findSolsUnord() implementing Theorem 4.lastLeaf . (7) for each s in siblings(prec) (8) preVisit(s. (5) for each s in siblings(h) (6) preVisit(s.indexesList).13: Unordered twig matching solution construction (part 1) algorithms.lastLeaf .14.264 The complete XML matching algorithms procedure findSolsUnord(h.e. i.lastLeaf .i. gradually checking and producing all the answers by extending them with the nodes contained in the associated domains with the extendSols() function. 4. are shown in detail in Figure B. (2) h ← firstChild(h). In this case the solution construction starts from the first leaf (see the initial call at Line 13 of the main algorithm).indexesList) (3) if(h>1) (4) findSolsUnord(parent(h).9.lastLeaf .indexesList). The unordered twig matching solution construction. (3) if(h = -1) (4) preVisit(h.prec.indexesList).h. (5) else (6) extendSols (h.lastLeaf . 4.h.lastLeaf . then navigates all the query nodes one by one.-1.lastLeaf . we check if new solutions can be generated (following Lemma 4.indexesList). in this case. since the pointers are now from children to parents and not from right to left domains. The solutions are kept in indexesList.13.0.lastLeaf . Chapter 4) and. it i i has the same content as ∆Σj when Lemmas 4.14 i have been applied.parent(h). Lemma B.lastLeaf . 4.1.cont+1). procedure preVisit(h.indexesList). as in the path and ordered matching case.8.7. an index array pointing to the domain nodes. Figure B. which contains. the list Di is a subset of Σj containing only the data entries that cannot be deleted from Σj .prec.indexesList).6. (9) if(h>1) (10) findSolsUnord(parent(h).16. and 4.indexesList) (1) if(isLeaf(h)) (2) extendSols (h.13 and B. for each of them.5 At each step j and for each query index i. .indexesList) (1) extendSols (h. In this case the step-by-step backward construction behavior of the other cases would not be the best choice.prec.

t. (10) if(dir>0) (11) for each i = 1 to pointer(dprec ) (12) if(checkPostDir(Dh .prec. (6) index[h] ← i. in this way we extend the current solution only with the pointed node.lastLeaf . doing the opposite: Starting from the last node in the child domain to be included in the solution. which is a sort of upper bound. thus exploiting the available node pointers going in the same direction. Indeed. findSolsUnord() goes up the query twig by recursively calling itself up to the query root node (Lines 4. (20) i ← i-1. 10 of its code). before navigating up it calls on each of the siblings the preVisit() function (Line 8).dir.r. we extend the solutions with all the nodes which point to the parent node.14: Unordered twig matching solution construction (part 2) Let us examine the way in which the query domains are navigated in order to build the solutions: Starting from the first leaf.6 of its code). first we go up from a leaf to its parent. the pointers in the domain nodes and the solution construction. we go downward from it to its other children.dprec )) (18) extend indexes in indexesList. all the query nodes (domains) are covered and we move from one domain to the other in the most suitable way w. Then. or to any node above it (Lines 15-20 of . (7) put index in indexesList. which recursively explores in pre-visit.i. 265 Figure B. For each of the navigated nodes having right sibilings. (8) else (9) dprec ← Dprec . from parent to child.dprec )) (13) extend indexes in indexesList.indexesList) (1) indexesListOrig ← indexesList.i.Standard version procedure extendSols(h.index[prec]. (14) else (15) i ← |Dh | (16) while(pointer(Dh . the subtrees of the given nodes (Lines 4. (2) for each index in |indexesListOrig| (3) if(dir=0) (4) for i = 1 to |Dh | (5) if (lastLeaf =h) i ← |Dh |. when we have reached a parent node. (19) if (lastLeaf =h) break. In this way. thus covering the left most path. and the ones underlying it (the same as in path and ordered construction) (Lines 10-13 of extendSols()).i)>=index[prec]) (17) if(checkPostDir(Dh .

S = (s1 . all the post-order checks defined in Lemma 4.3. as we said before. From Line 9 to Line 12 we check if it is possible to perform a skip by repetitively call the function canSkipUnord(). . Lines 3-7 of extendSols()). This is necessary since. Line 2 of findSolsUnord(). duplicate solutions would be generated. The dir parameter actually codes the direction in which to perform the extension: 1 if going up.3 For each data node dj . In other words. by calling the function findSolsUnord().3 are performed (Lines 12.4. Since there are not many differences with respect to the previous cases. we first briefly review the main matching algorithm and then we show the new introduced functions. During the extesions. no other match for the element could be found). . the node starting solution construction is not deleted immediately after it. If we can perform a skip on Line 10 we update the current document pre-order value k with the approximate value of the first following of current element (see Section B. without this check.266 The complete XML matching algorithms extendSols()).2). . sn ) ∈ ∆U ansj (D) iff the Q algorithm. Starting from Line 6 we have the main scanning loop that reads the input document using the previously computed range. the parameter lastLeaf is the pre-order of the query node which started solution construction: This is used at Line 19 of extendSols() in order to limit the extension relative to this domain only to the last inserted node. 0 is used to insert the first domain nodes in the solutions (no actual extension. and therefore. Theorem B. At each step of the scan we first try to adjust current targets in order to be coherent with the current document pre-order value k (Line 7). in the “downward” solution extension the parent node acts as a lower bound. Finally. If the adjustment fails we can stop the algorithm because some target list is found to be empty and the relative domain is also empty (i. as in the other matching algorithms.7). . 17). -1 if going down. From Line 0 to Line 5 we perform the same operations discussed for the ordered match algorithm (see Section B.e. We have also modified the function verifying if an element is needed in order to take in consideration target information (See Table B. generates the solution S.3. As we did for the ordered case we introduce the functions needed to adjust current targets and to establish if it is possible to perform a skip.15) is based on the observations made in Section 4.2.3. For each update we need to re-adjust the current targets so on Lines 11 and 12 we apply the same schema applied .2 Content-based index optimized version The content-based optimized algorithm for unordered pattern matching (detailed in Figure B. B.2).

On Line 0 we check if the current document postorder is smaller than the reference target of the current element. The function canSkipUnord() simply calls the recursive function checkSkipUnord() by passing the post-order value passed to it and 1 as pre-order value. It has to be noted that the function must be called on elements that have a reference target or are target themselves (the root element has always a reference target as long as a target exists). As for the ordered case all the differences are in the first two lines where if the ith twig element has a reference target and the current document post-order value is smaller than the one of the current target for the element then we can directly return false. current targets and current domains is possible to perform a Skip TARGET LIST MANAGEMENT adjustTargetsUnord(pre) Adjusts the current targets depending on the current document pre. and possibly its child/descendant domains. It has to be noted that on Line 21 we call the newly defined function isNeedUnordCont() instead of isNeedUnord().elem) Checks if insertion of given elem is needed in the pre-th list considering the current targets canSkipUnord(post) Returns true if evaluating the current document post-order value. thus it will not be further discussed. If the domain associated to the pre-element is empty or the current document . if so we go ahead with the analysis otherwise we directly return false (Line 15. In Figure B.16 we have the matching auxiliary functions.Content-based index optimized version 267 MATCHING isNeededUnordCont(pre.7: Target list management functions on Lines 7 and 8. that means that the function will start the domain check from the root domain. Function checkSkipUnord() returns true if analyzing the pre-domain. Now we can examine the auxiliary functions used by the main algorithm. First we have the modified version of isNeededUnord(). otherwise it returns false. Returns true if the operation was completed successfully Table B. the target for the current element is a descendant of the current document element). The rest of the function is equal to the previous version and it will not be discussed further. a skip is considered safe. From Line 13 to 28 the algorithm works exactly as the previous one. called isNeededUnordCont().

pointerToLast(Dparent(h) )))).getValue(li ).di ). (6) for each dk where k ranges from start to end (7) if(¬adjustTargetsUnord(k)) (8) exit.3 for the Skipping Policy) we can directly return true (Line 2).di ).268 The complete XML matching algorithms Input: query Q having signature sig(Q). end). (4) if(¬hasNext(targetListli )) (5) exit.dk )) (22) push(Dh . (1) targetLeaves ← getValueConstrainedLeaves(Q). (25) for each Di where i ranges from 1 to n (27) if(isEmpty(Di ) ∧ last(qi )<k) (28) exit. If the current element has at least one child that is related to targets only by following- . h.-1.3. (23) if(isLeaf(h) ∧ noEmptyLists()) (24) findSolsUnord(firstLeaf(). (18) delete(Di .(post(dk ). (11) if(¬adjustTargetsUnord(k)) (12) exit. If the domain is not empty we need to verify what kind of children has the current element and possibly call the function again on them.getCondition(li )).di )) (17) pos ← index(Di . indexesList). (2) for each li in targetLeaves (3) targetListli ← getMatchList(li . (13) for each h such that qh = dk in descending order (14) for each Di where i ranges from 1 to n (15) for each di in Di in ascending order (16) if(post(di )<post(dk ) ∧ isCleanable(i.15: Content index optimized unordered twig matching algorithm post-order value is greater then the maximum post-order value in the predomain (recall that since domains are cleaned only when we found a match a domain could be not physically empty but substantially empty and see Section 4. (9) while(canSkipUnord(post(dk ))) (10) k ← post(dk ) + 1. Figure B. rew(Q) Output: ansQ (D) algorithm UnorderedTwigMatchCont(Q) (0) getRange(start. (21) if(¬isEmpty(Dparent(h) ) ∧ isNeededUnordCont(h. (19) if(¬isLeaf(i)) (20) updateDescLists(i.pos).

post) (0) if(post < post(getTarget(pre))) (1) if( isEmpty(Dpre ) ∨ (hasAReferenceTarget(pre) ∧ post >maxPost(Dpre )) ) (2) return true. function canSkipUnord(post) (0) return checkSkipUnord(1.Content-based index optimized version function isNeededUnordCont(i.post)) (12) return false.16: Content index optimized unordered twig matching algorithm auxiliary functions preceding relationship (Line 9. Finally we have the target list management functions (see Figure B. (7) children ← j ∪ siblings(j).17). Since there is no order constraint. 269 Figure B.post). (5) if(j = -1) (6) return true.di ) (0) if(hasAReferenceTarget(i) ∧ post(di ) < post(getTarget(i))) (1) return false. (6) return true. (8) for each qi in children (9) if(¬(hasAReferenceTarget(i) ∨ isContrainedLeaf(qi ))) (10) return false. it does not exist an alignment property. for those elements we have no information about next matches and since the parent domain is not empty useful match could be found in the subtree of the current document element) or at least one check over its child fails (Line 11). function checkSkipUnord(pre. (3) else (4) j ← firstChild(pre). (13) return true. (4) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di )) (5) return false. (14) else (15) return false. so the only target management function is adjustTargetsUnord(). (2) if(i = 1) (3) return true. This . (11) if(¬checkSkipUnord(i. we return false (Lines 10 and 12) otherwise we can safely return true (Line 13).

(10)return true. If during the search of a new target we reach the end of the relative target list we set that target to null (Line 7) and if the relative domain is empty we return false for the same reasons explained before. B.5. (3) while((curT argetli is not null) ∧ pre(curT argetli )<pre) (4) if(hasNext(targetListli )) (5) curT argetli ← getNext(targetListli ). Each filter outputs a rough range and the final range (used by algorithms) is obtained by intersecting those ranges. Otherwise we update each target as needed until we found a target that has a pre-order value greater or equal to the current document one (loop from Line 3 to Line 9).17: Unordered Target list management functions function adjusts the current targets in order to be coherent to the current document pre-order value.270 The complete XML matching algorithms function adjustTargetsUnord(pre) (0) for each li in targetLeaves (1) if((curT argetli is null) ∧ isEmpty(Dli )) (2) return false. (6) else (7) curT argetli ← null.5 Sequential scan range filters In this section we will give more detail over the current solutions used to delimit the portion of the document that has to be scanned. Figure B.4 and B. B. The function is made by a main loop (from Line 0 to Line 9) that goes through all the targets possibly trying to update them.1. Finally if we have successfully updated all the targets we can return true (Line 10). Scan range is actually computed by two different kinds of filter. Thanks to condition PRO2 we can safely start the sequential .1 Basic filter The first kind of filter operates starting from the observations made in Section 4. (8) if(isEmpty(Dli )) (9) return false. If at least one target is found to be null (that means that previously we reach the end of its target list) and its domain is found to be empty (Line 1) then we return false (Line 2) because no other match could be found in the current document.

Ordered match requires by definition (see Lemma 4. Given a query node qj that specifies a value condition and with a content-based index built on it we can obtain from the index all the document elements that satisfy the specified condition. By analyzing specific ranges we can simply conclude if a document cannot contain a solution.lastV(qi )].last(qi )]. given an element de that satisfies the value condition specified by the query. If some twig element specifies a value condition and we have a content index for those elements we can limit the specific range for an element qi to [firstV(qi ). we call this ancestor d1e ) that matches with q1 . for each element of the query we can compute a specific range that represents the part of the document where we can find occurrences of that element that is [first(qi ). Unordered match requires only a partial order between answer node preordes so checks related to a node qi are performed only with node qj with qj ∈ descendants(qi ). if the document contains a solution with de then this solution lies on the subtree rooted from the furthest ancestor of de (i. the one with the smallest pre-order value. For each query node qi with i ∈ [1.1) that answer elements are totaly ordered by their pre-order values so subsequent elements in the query must be subsequent in the answer. while to establish the right limit of the range we need to separate the ordered (and path) case form the unordered one. During the computation of the range we can also make some checks that could identify an empty answer space. Again we need to analyze separately ordered and unordered case. Lemma 4. respectively. B. due to Lemma 4. where firstV(qi ) and lastV(qi ) return the first and the last pre-order value for element qi in the document satisfying the specified value condition. It is obvious that these checks represent a necessary condition but not a sufficient one. n − 1] we need to perform the described check with any node qj with j ∈ [i + 1.f fd1e −1] . we can stop the scan with the last occurrence of qn (last(qn )) whereas.15. the document cannot contain any answer. For each of these elements de we can identify a range [d1e .e.16 suggests to stop with the maximum value among last(ql ) for each leaf l in the query. this filter can be used only if the query specifies some value condition and if we have content-based index built on nodes that specify those conditions.Content-based filter 271 scan from the first occurrence of q1 (first(q1 )).5. n].2 Content-based filter As the name suggests. In the former case. in the latter case. If a specific range of an element qj ends before the beginning of the specific range of qi with i < j. The basic idea for this filter is that.

maxEnd(qj )]. .272 The complete XML matching algorithms and considering all these ranges we can identify a single range for the query node qj that is [minStart(qj ). The second kind of filter computes such range for all nodes qj that satisfy the conditions explained before and then returns a range obtained from the intersections of these ranges with the whole data tree range. where minStart(qj ) represents the minimum start (start with the smallest pre-order) for the previous ranges while maxEnd(qj ) represents the maximum end (end with the greatest preorder) for the previous ranges.

minL. j1 ] ∈ σ1 . . . the formula is minL − 1 − (d − 1) ∗ q − (q − 1) ∗ 2. . In this second case. σ2 [i2 . . σ2 [p2 −w +1 . In this case. . j1 ] ∈ σ1 . notice that showing Theorem 1. . .4. . σ2 [i2 . 2.1. j2 ] ∈ σ2 ) does .1. 1. j1 ]. (qsub − 1) ∗ 2. Let extP osF ilter(σ1 . Proof of Theorem 1. p2 ]) > d then it is obvious that a pair (σ1 [i1 . For each common term σ1 [p1 ] = σ2 [p2 ] the corresponding counter has the following value: σ1 c[p1 ] < c. There is no common term between the two sequences. σ2 [i2 . Let us start from the threshold of count filter of Prop. then does not exist a pair (σ1 [i1 . . .1 Proofs of Chapter 1 Proof of Proposition 1. σ2 [i2 . . max(|σ1 |. σ2 . . |σ2 |) −1 − (d − 1) ∗ q. d) returns FALSE. . .e. if we show that for each common term σ1 [p1 ] = σ2 [p2 ] if σ1 c[p1 ] < c then ed(σ1 [p1 −w +1 . (j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . . . First. . which is equivalent to minL + 1 − (d + 1) ∗ q. σ2 . p1 ]. σ2 [i2 . . j2 ] ∈ σ2 ) such that (j1 − i1 + 1) ≥ minL. . j1 ]. j2 ]) ≤ d. . (j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . and let us substitute max(|σ(σ1 )|. . |σ(σ2 )|) with the minimum length minL and subtract the number of qsub -grams with wild cards to the total count. Then. it is obvious that a pair (σ1 [i1 . d) return FALSE then there are two alternatives: 1. minL.1 is equivalent to prove the following statement: if extP osF ilter(σ1 . . i. j1 ] ∈ σ1 . . .Appendix C Proofs C. j2 ] ∈ σ2 ) does not exist so that (j1 − i1 + 1) ≥ minL. j2 ]) ≤ d.

σ2 (w2 )) = card(P P3 ) = σ1 c[p1 ] < c P P3 ⊆P P2 ≤ card(P P2 ) P P2 ⊆P P1 ≤ card(P P1 ) where card(P P1 ) = σ1 c[p1 ] < c from the filter definition and by hypothesis. Let w = minL be the window size. From the edit distance definition and from the fact that P P3 contains the ordered and non-repeated equal terms. . k2 ) ∈ P P2 : (k1 . . k2 ) ∈ P P1 : k2 = k2 . ˆ P P2 be a subset of P P1 . with the following additional property: ∀(k1 .274 Proofs not exist so that (j1 − i1 + 1) ≥ minL. Indeed.e. j2 ]) ≤ d. . . It follows that ed(σ1 (w1 ). m]} obviously contains the set of pairs j i p (k) given by the permutation pm maximizing the similarity: {(ck . Given the set P P of position pairs (k1 . satisfying the following additional property: ∀(k1 . k2 ) ∈ P P2 : k1 = k1 . C.1. . that is the two approximations are upper bounds of the corresponding asymmetric document similarity values. ch ) | k ∈ [1. p1 ]. (k1 . cj m ) | k ∈ i [1. σ2 (w2 )) > d. k2 ) ∈ P P1 : (k1 . j1 ].2 Proofs of Chapter 3 Proof of Theorem 3. it directly follows that the P P3 cardinality is as follows: w − ed(σ1 (w1 ). But. Dj ) and aSim(Dj . First notice that aSim(Di . . n]. p2 ] be the two windows. (j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . n]}. k2 ). σ2 [i2 . h ∈ [1. since c = w − d then w − ed(σ1 (w1 ). . σ2 (w2 )) < c. the set {(ck . . ˆ P P3 be the total ordered set from P P2 . Dj ) ≥ aSim(Di . Di ). . Di ) ≥ aSim(Dj . k2 ) in the interval (k1 ∈ w1 and k2 ∈ w2 ) whose corresponding terms in the two sequences are equal (σ1 [k1 ] = σ2 [k2 ]). Then. and w1 ≡ [p1 − w + 1 . k2 ) : if k1 < k1 then k2 < k2 . i. w2 ≡ [p2 − w + 1 . w − ed(σ1 (w1 ). σ2 (w2 )) < w − d. let us introduce three subsets of P P of pairs of terms with the following properties: ˆ P P1 is obtained from P P by pruning out the pairs for which another pair in P P exists with the same position in σ2 : ∀(k1 .

Di ) then aSim(Di . sn ) ∈ ∆ansj (D) and sh = k. Dj ). If post(qi+j ) < post(qi ). By analogy. Dj ) ≤ Sim(Di . Dj ) ≤ sim(Di . the node i + j in the query is a descendant of the node i.3 Proofs of Chapter 4 275 Furthermore. 3. Di ) ≤ Sim(Di . Let us suppose that aSim (Di . that is asim(Dj . It follows that if aSim(Di . 3. Di ) or the other way round. Dj ) ≤ γ δ aSim(Dj . and δ are positive values. either aSim(Di . Moreover from Eq. Obviously. whereas if aSim(Di . Dj ). node i + j must be either the descendant or the following node of i. if post(qi+j ) > post(qi ). Di ) then α ≤ β ⇒ α ≤ α+β ≤ β that is aSim(Di . Dj ) γ δ γ γ+δ δ ≤ Sim(Di . Di ) ≤ aSim(Dj . α+β γ+δ ? ? ? ? C. we state that either asim(Di . As to the second inequality notice that ≤ β ⇔ (α + β)δ ≤ (γ + δ)β ⇔ (αδ + βδ) ≤ (βγ + βδ) ⇔ αδ ≤ βγ which δ is true since α ≤ β . β.C. Dj ) ≤ aSim(Dj . thus also post(dsi+j ) > post(dsi ) must hold. Dj ) ≤ asim(Di . Dj ) ≤ asim (Dj . Di ).6. Let us suppose that j ∈ [k. Dj ) ≤ aSim(Dj . Di ). the node i + j in the query is a following node of i. Dj ) ≥ aSim(Dj . thus also post(dsi+j ) < post(dsi ) is required. Indeed. Notice that it should be si < k Q . . . . Dj ) ≤ aSim(Di . m] exists such that S = (s1 . Di ) ≤ sim(Di . γ.1. Dj ) = α and aSim(Dj . we have aSim(Di . Di ) or aSim(Di . Dj ) ≤ aSim(Dj . cj m i p (k) ) |ck | + i k=1 γ h=1 |ch | j δ where α. Proof of Lemma 4.3 Proofs of Chapter 4 Proof of Lemma 4. Dj ) ≤ aSim(Dj . Di ) then aSim(Dj . Di ). cj m ) + i m |cj m k=1 n p (k) | · sim(ck . Dj ) ≤ aSim (Di . . Dj ) ≤ aSim (Di . it follows that sim(Di . Di ) = β . from Eq. Dj ) ≤ aSim(Dj . Dj ) from which follow the statements of the theorem. In the same way. if aSim(Di . Di ) then γ δ aSim(Dj . Dj ) ≥ aSim(Dj . Dj ) is equal to α n n β |ck | i k=1 · p (k) sim(ck .5.1. Di ) ≤ Sim(Di . Because the index i increases according to the preorder sequence.

. Notice that (s1 . × ∆Σj −1 .10. ∆Σk = ∅ i h and ∆Σk ∩ ∆Σk = ∅ then si ∈ ∆Σk exists such that si < k . . . m].6) 4. h − 1]. wherever the facts 2. sh −1 ) ∈ Q ∆Σj × . . Therefore. then k does not belong to ∆ansj (D). . without any / Q knowledge about the data nodes following k in the sequential scanning. m] exists such that S = (s1 . j ∈ [k. . . . such that for each i. si ∈ S. . . As to j ∈ [k + 1. due to the premise. h = n (condition of Lemma 4. Proof of Lemma 4. for each i < h . sn ) ∈ ∆ansj (D). m]. ∆Σk = ∅ (condition of Lemma 4. In the former case. let us consider the two possible alternatives: either post(dk ) < post(dj ) or post(dk ) > post(dj ). Moreover h = n i i i and thus for any solution S = (s1 . for each i ∈ [1. it easily follows that post(dj ) < post(dj ). . .5) i 3. for each i ∈ [1. . .11. It follows that dj is a preceding of dj and thus post(dj ) < post(dj ).×∆Σk −1 exists. . . i Proof of Theorem 4.3. . . post(dj ) < post(dj ) when j = k. at step k. . Q Proof of Lemma 4. . at step k . and 4 are true then (s1 . . . . . .10. sn ) such that sh = k it must be sh+1 > k . . from Lemma 4.276 Proofs but ∆Σk = ∅ and thus no index si exists such that dsi = qi and si < k. sh+1 . . Q 2. then S can not belong to ∆ansj . . The latter case means that dk is an ancestor of dj since k < j . . . The proof is ab absurdo. sh −1 ) belongs to an answer S and k belongs to it too. Obviously. In particular we will show that the following four facts together constitute an absurd: 1. sh−1 ) ∈ ∆Σk × 1 . . . sn ) ∈ ∆ansj (D) Q Q iff j exist such that sj > k for each j ≥ j. Let us consider sj . sh . post(dsi ) < post(dsj ). because for each (s1 . Moreover dj is a preceding of dk since post(dj ) < post(dk ) and j < k. But it is impossible since post(dsi ) < post(dk ) and. ∆Σk ∩ ∆Σk = ∅ (condition of Lemma i i 4. . As. si < k . .4) Indeed. either i exists such that si > k or given that 1 h (s1 . Being si < sj then S is a solution iff post(dsi ) > post(dsj ). . sn > k that is S ∈ ansk (D). Let dk be a data node with k ≤ k and dk = qh . . S could belong to ∆ansj . . Let us suppose that j ∈ [k. h − 1].1.

. In this way. .13. thus for each j ∈ [k.3.3 Proofs of Chapter 4 277 Proof of Lemma 4. (si+1 . k can be deleted from ∆Σj . k 2. both in the ı ı case of ordered and unordered matching must satisfy the following property: k post(dsi ) > post(ds¯). Thus the above ı condition can not be satisfied. . . i h Proof of Lemma 4. si > si and post(dsi ) < post(dsi ). wherever the fact above is true. . post(ds ) > post(dsi ). due to Lemma 4. . Proof of Lemma 4. h − 1]. m] exists such that S = (s1 . can be deleted from its domain. si does not belong to ∆ansj (D). Q Proof of Theorem 4. the node ı ı matching with q¯. being post(q¯) < post(qi ).×∆Σn .13) Indeed. i = 1 and for each ¯ ∈ [1. m] for each solution S = (s1 . × ∆Σj n i+1 exists such that for each i > i such that post(qi ) < post(qi ). the condition post(dsi ) > post(ds¯) can be ı i satisfied iff s¯ > k. .12. . the data nodes of such a partial solution can .e. n] exists such ı Q that post(q¯) < post(qi ). i exists such that it is required that post(dsi ) < post(dsi ) but post(dsi ) > post(dsi ). . . Notice that due to Lemma 4. n]. and that ¯ ∈ [1. whenever post(dsi ) < post(dk ) we will show that the following two facts together constitute an absurd: 1. . sn ) ∈ ∆(U )ansj (D) and si ∈ S. But being post(dsi ) < post(dk ). that with pre-order value equal to s¯. sn ) ∈ ∆(U )ansj (D) involving s1 it should Q be post(ds1 ) > post(dsi ). Q j j sn ) ∈ ∆Σi+1 ×. On the other hand. On the other hand.10 it follows that post(dsi ) < post(ds ) for each s > k and S ∈ ∆(U )ansj (D) iff at least one si is greater than k.10 ı post(dsi ) < post(ds ) for each s > k and thus also for s = s¯.14. On the other hand being post(ds1 ) < post(dk ).11 each data node in ∆Σk−1 . for i ∈ [1. . . For a data node si ∈ ∆Σk whenever post(dsi ) > i post(dk ) there is no way to predict the relationship between the post-order of the data node si and that of the nodes after k. Indeed. sn ) ∈ ∆Σj × . .5. . . .C. wherever ∆Σ¯ = ∅ or for each ı ı k s ∈ ∆Σ¯ . . First notice that for each qi for i ∈ [2. from Lemma 4. . i. Notice that. . n] such that post(q¯) < post(qi ) and ∆Σ¯ = ∅ ı ı ı k and exists s ∈ ∆Σ¯ such that s > k and post(ds ) > post(dsi ) (condition i of Lemma 4. s2 . j ∈ [k. post(q1 ) > post(qi ) since q1 is the root of the twig pattern. Let us suppose that j ∈ [k. i each ∆Σk is empty and thus due to Lemma 4. because for each (si+1 . m].

P We first show that (s1 . . Thus S is a solution if si > k that is si must be accessed after k in the sequential scanning. . . . sn ) can s is a solution as s1 < s2 < . 4.11. The set of index tuples specified in the theorem is a subset of the cartesian product ∆Σk × . If (s1 . × ∆Σk as. × ∆Σk as. . . sn ) which satisfies conditions 1 and 2 is a solution and . . sn ) ∈ ansk−1 (D) is the Q same as in the path case. sn ) ∈ ∆ansk (D). . . . sn ) is a solution.8. .14. and 4. n − 1] and vice versa. . sn ) ∈ ∆ansk (D) then P s si ∈ ∆Σi i+1 for each i ∈ [1. . .13. .9. and 4.5. n − 1] then (s1 . Proof of Theorem 4. Moreover any index tuple (s1 . post(dsi ) > post(dsi+1 ) since. . Moreover by hypothesis i > h exists Q such that post(qh ) > post(qi ). . 4.4. Let us suppose that S = (s1 .7. we have to show that if (s1 . < sn whereas condition 2 explicitly requires that the relationships between postorders are satisfied. . 4. . as in the path case. . . Proofs Proof of Lemma 4.5. . . .278 be the ones specified in the fact above. 4. 1 n and 4. Notice that (s1 . . sn ) ∈ ansk−1 (D). 4. we never delete useful data nodes. .11. . Finally the fact that (s1 . .16. . sn ) exists such that S ∈ ∆ansk and then sh = k. < sn and thus si must be processed before si+1 in the sequens tial scanning that is si ∈ ∆Σi i+1 . . by applying Lemmas 4. . . due n P to Lemma 4. . . 4. it straightforwardly follow that sn ∈ Σk−1 n which is one of the domains of ansk−1 (D). sn ) ∈ ∆ansk (D) P then s1 < .4. Q Proof of Theorem 4. . k = sn and thus. . . . . . for each i ∈ [1.6. sh .6. by applying Lemmas 1 n 4. < sn because si ∈ ∆Σi i+1 and. we have to show s that if si ∈ ∆Σi i+1 for each i ∈ [1. sn ) which satisfies conditions 1 and 2 is a solution because. The other way around. It is true as sn ∈ ∆Σk if. . . we never delete useful data nodes. by applying Lemmas n 1 4. .4. . . .13. we never delete useful data nodes. .9. . . n − 1]. i Finally. × ∆Σk as. . condition 1 implies that s1 < s2 < . we show that (s1 .14. we have that post(dsi ) > post(dj ). P Proof of Theorem 4. . . . 4. The set of index tuples specified in the theorem is a subset of the cartesian product ∆Σk × . 4. Given that premise. 4.5. .6. . For this reason S can not belong to ∆ansk . Moreover any index tuple (s1 . . due to Lemma 4.4. The set of answers ∆ansk (D) is a subset of the P cartesian product ∆Σk × . for each j such that si < j ≤ k being si ∈ ∆Σk .

for i i i j ≥ j. as such domains are stacks and k ≤ (r + 1) thus (i1 . Indeed. i at Line 10 the algorithm adds dj to the right stack. . let i i i us suppose that ∆Σr ∩ ∆Σk = (i1 . . . If k can not belong to ∆Σj due to its pre-order value. The proof i i i is by induction. then it is sufficient to check stack Dh−1 . In the first case.4 Proofs of Appendix B the proof is similar to the above ones.C.∪Dn . At Lines 9-10. as whenever a stack is empty we empty all the stacks at “its right”. in ) are i at the bottom of the stack and the deletion of (i1 . the algorithm should have deleted k from Di (see Line 5). .1. Then there are two alternatives: either post(dk )>post(dk ) or post(dk )<post(dk ). notice that Di ⊆ Σj as for each data node dj . at the j-th step we delete all the “unnecessary” data nodes. Let us suppose that ∆Σj = ∅ with k ≤ j ≤ j. but it is impossible because the ∆Σr+1 = ∅. it straightforward follows that post(dk )>post(dk ) as due to the premise post(dk )>post(dk ) whereas the second case is impossible as when k was added to Di . Let k ∈ Di and let top(Di )=k then k < k as k is at the top of Di and the data signature is scanned in a sequential way. we show that if for each step j with k ≤ j ≤ j we have that ∆Σj = ∅ then ∆Σj ∩ ∆Σk = ∅. . . assuming that Di = ∆Σj . . in Lemma B. In particular.5. . due i to Lemma 4.4 Proofs of Appendix B Proof of Lemma B. The algorithm detects all these conditions and delete k from Di . Moreover the algorithm deletes some indexes from Di by means of the pop and empty operators. . in ) implies the deletion of all the data node in ∆Σr+1 . Proof of Lemma B. We have to show that if for each i i step j with k ≤ j ≤ (r + 1) ∆Σj = ∅ then ∆Σr+1 ∩ ∆Σk = ∅. . .1. .2. i i Proof of Lemma B. 279 C. . When j = k and ∆Σk = ∅. i then being k ≤ j and j ≤ j it easily follows that ∆Σj ∩ ∆Σk = ∅. i i As far as the opposite direction is involved.2 it has been shown that ∆Σj must be empty in order that ∆Σk ∩∆Σj = ∅. Notice that i i i ∆Σr+1 ∩ ∆Σk = ∅ iff at step r + 1 we delete the index set (i1 .3. At Lines 6-8. . . First. it applies Lemma 4. it can be due to one of three i possible alternatives shown in Theorem 4. . then ∆Σk ∩ ∆Σk = ∆Σk = ∅. in ). As if for each step j with k ≤ j ≤ r ∆Σj = ∅ then ∆Σr ∩ ∆Σk = ∅. in ) from i i ∆Σr+1 . i i i i Let us suppose that the statement is true for j = r. . . Thus. But.6 it deletes all the nodes in Di+1 ∪. we show it for j = r + 1. .

i Proof of Theorem B. it applies Lemma 4. Moreover. It follows that the algorithm never deletes from the stacks data nodes which could belong to a delta answer set ∆ansj (D) for j ∈ [j.4. m]. the nodes from the bottom of Di up to the node pointed by si+1 are all those nodes matching qi and whose pre-order value is s less than si+1 . .1. . sn ) such that s si ∈ ∆Σi i+1 . no other data node can be deleted from that stack due to its post-order value. n]. . .4. . in this case. i. it sets the pointer to the top of the “preceding” stack Dh−1 (Line 10).3 states that si ∈ ∆Σk iff si ∈ Di at step k. the “chain” of i pointers followed by the function showSolutions() allows us to state that such a function only generates those solutions S = (s1 . as the algorithm sequentially scans the data signature.1. k ∈ Di at step j iff P k ∈ ∆Σj . S = (s1 . whenever the algorithm adds a new data node to any stack Dh . si ∈ ∆Σi i+1 and si ∈ ∆Σk . the algorithm correctly avoid to go on with Di whenever post(top(Di ))>post(dj ) as. Thus.e. Indeed. all those in Σi i+1 . .280 Proofs Finally.11. Lemma i P B. . Thus. the algorithm deletes k from Di due to its post-order value at Lines 4-5 where it applies Lemma 4. due to Lemma B. Moreover. Notice that. . . Due to Theorem 4. at Line 13. sn ) belongs s to ∆ansj (D) iff for each i ∈ [1. for each data node si+1 in Di+1 .

-I. Verdejo. Amitay. W. Niblack. In Proc. [9] R. Soffer. [2] R. Gonnet. A. Addison Wesley. and U. of the International Workshop and Symposium on String Processing and Information Retrieval (SPIRE 1999). 1995. C. Baeza-Yates and G. and Translation in Time-Series Databases. K. [3] R. H. 1996. Sivan. Modern Information Retrieval. Agrawal. Baader. A. In Combinatorial Pattern Matching. 2000. The Effects of Word Order and Segmentation on Translation Retrieval Performance. In Proc. R. Sattler. N. u [7] R. and F. Department of Commerce/National Institute of Standards and Technology. Baldwin and H. 7th Annual Symposium. In Proc. Nelken.Bibliography [1] Secure Hash Standard. Tanaka. . Scaling. Multi-resolution disambiguation of term occurrences. Baeza-Yates and G. and A. H. Navarro. Faloutsos. 2003. of 4th International Conference on Foundations of Data Organization and Algorithms (FODO 1993). Description logics for the semantic web. [5] J. Baeza-Yates and B. Artiles. 2002. Sawhney. 16(4). Efficient Similarity Search In Sequence Databases. Swami. and A. 1999. Agrawal. of Senseval-3. [6] F. U. Shim. Technical Report FIPS PUB 180-1. 1995. In Proc. of 21th International Conference on Very Large Data Bases (VLDB 1995). 2004. A. Fast Similarity Search in the Presence of Noise.S. S. R. [8] R. 1999. In Proc. Ribeiro-Neto. Penas. Word Sense Disambiguation based on term to term similarity in a context space. [10] T. Horrocks. K¨nstliche Intelligenz. 1993. A Fast Algorithm on Average for AllAgainst-All Sequence Matching. Lin. A Faster Algorithm for Approximate String Matching. of the 18th International Conference on Computational Linguistics (COLING 2000). of the 12th Conference on Information and Knowledge Management (CIKM 2003). [4] E. In Proc. and K. I.

and G. Copy Detection Mechanisms for Digital Documents. [12] R. Seeger. Copy Protection http://www.282 BIBLIOGRAPHY [11] S. 2003. 2004. of the Twenty-seventh Int. and J. [13] B. and Hector Garcia-Molina. J. Visual Web information extraction with Lixto.0: An XML Query Language. On Matching Schemas Automatically. and J. H. 2001. [22] Sergev Brin. 1996. Z. [23] A. Technical Report MSR-TR-2001-17. Microsoft Research (MSR). Florescu. L. Robs the Future. Pedersen. Conference on Web Information Systems Engineering. S. 284(5). and G. M. Braga and A. Bernstein and Erhard Rahm. 2001. An asymptotically optimal multiversion b-tree. a [16] S. 1995. [21] D. Sander. A Graphical Environment to Query XML Data with XQuery. M. In Proc. Widmayer. C. of the 2nd ISWC Conference. Kr¨ger. Gschwind.. Bowker and M. 5(4). Flesca. W3C Working Draft. D. Semantic coordination: a new approach and an application. Sim´on. B. 1997. Breunig. Robie. M.bricklin. 2003. Manasse. Barlow. Research and Training (LR4Trans-II 2004). Ohler. P. Hendler. S. [15] Philip A. S. Scientific American. e 2003. In Proc. [17] P. SIGMOD Record (ACM Special Interest Group on Management of Data). Bricklin. James Davis. [14] T. T. P. 30(2). In Proc. [18] L. of the 4th Intl. Banerjee and T. . [20] M. Campi. Fern´ndez. Computer Networks and ISDN Systems. Gottlob. of 18th IJCAI Conference. VLDB J. and O. Kriegel. Bilingual concordancers and translation memories: A comparative evaluation. Berners-Lee. XQuery 1. and S. In Proc. Zanobini. S. Chamberlin. Zweig. of the 2nd International Workshop on Language Resources for Translation Work.com/robfuture. Lassila. J. In Proc. of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD 1995). Bouquet. Conference on Very Large Data Bases. 2001. Syntactic clustering of the Web. Serafini. In Proc. 29(8-13). [19] D. Extended gloss overlaps as a measure of semantic relatedness. Baumgartner. F.htm. Glassman. 2003. Broder. Boag. The Semantic Web. Becker. and P. D. 2001. Data Bubbles: o Quality Preserving Performance Boosting for Hierarchical Clustering.

Chvez and G. Holistic twig joins: optimal XML pattern matching. J. Ciaccia and M. Hirst. [35] P. [30] T. Trans. Chien. In Proc. of 23rd International Conference on Very Large Data Bases (VLDB). In Proc. Searching in Metric Spaces with User-defined and Approximate Distances. and M. [36] P. Z. Chan and A. Lu. Tsotras. In Proc. Lenzerini. Zaniolo. Frieder. 2002. M-Tree: An efficient access method for similarity search in metric spaces. Fu. 20(2). 1999. Grossman. Koudas. Calvanese. Tsotras. In Proc. Patella. Navarro. 2005. In Proc. 2002. Zezula.. Castelli and P. M.-C. G. Ciaccia. 2002. 2001. In Proc. [28] D. and D. V.BIBLIOGRAPHY 283 [24] R. 2002. [25] N. A System for Building Expandable Digital Libraries. and T. M. Srivastava. 11(4). Zhang. application-oriented evaluation of five measures. 2000. 2002. 1996. Wang Ling. Chien. J. Chen.-P. [34] E. Efficient time series matching by wavelets. Budanitsky and G. In Proc. and P.D. on Database Systems (TODS). 2003. Patella.-Y. [29] K. Semantic distance in wordnet: an experimental. V. Efficient structural joins on indexed XML documents. Bruno. 2002. D. and C. [26] A. ACM Transactions on Information Systems. of the 15th International Conference on Data Engineering (ICDE 1999). Example-Based Machine Translation in the Pangloss Systems. Collection statistics for fast duplicate document detection. of 16th International Conference on Computational Linguistics. of the Third ACM/IEEE-CS Joint Conference on Digital Libraries. of the ACM SIGMOD. 1997. N. and D. VLDB J. [27] D. Pagano. of the NAACL 2001 Workshop on WordNet and Other Lexical Resources. Y. Zaniolo. Vagena. Chowdhury. In Proceedings of the 2002 International Conference on Management of Data (SIGMOD 2002). On boosting holism in xml twig pattern matching using structural indexing techniques. . Vardi. In Proceedings of 28th International Conference on Very Large Data Bases (VLDB 2002). [32] S. 4(27). and C. of the Nineteenth ACMSIGMOD-SIGACT-SIGART (PODS-00). [33] A. Efficient schemes for managing multiversionxml documents. A metric index for approximate string matching. In Proc. O. of the 5th Latin American Symposium on Theoretical Informatics. [31] S. De Giacomo. W. View-Based Query Processing for Regular Path Queries with Inverse. Brown.

[42] L. Fast Approximate Matching Using Suffix Trees. Currim. Papageorgiou. Scalas. [46] Atril Deja Vu . and S. Comparison of schema matching evaluations. Piperidis. In Proc. Cimiano. Rahm. Snodgrass. [40] A. of the 2002 ACM SIGMOD Int. S. A Tale of Two Schemas: Creating a Temporal Schema from a Snapshot Schema with τ XSchema. 2004. 1982. 2002. Towards the self-annotating web.com. [38] R. Handschuh. [45] C. (EWCBR 1996).atril. 1993. Heraklion. Halevy. Domingos. of the 15th International Conference on Computational Linguistics (COLING 1994). Reconciling schemas of disparate data sources: a machine-learning approach. of the VLDB EEXTT Workshop. [41] P. [47] P. of the 2nd Int. S. In Proceedings of 14th Anual ACM Symposium on Theory of Computing (STOC 1982). G. 1996. of the 28th Conference on Very Large Data Bases (VLDB 2002). P. of the 3rd European Workshop on Advances in Case-Based Reasoning. In Proc. R. [44] F. In Proc. [49] H. In Proc. of EDBT. In Proc. S. A Matching Technique In Example-Based Machine Translation. Automatic meaning discovery using Google. Mecca. Relevance ranking tuning for similarity queries on xml data. Dietz. Maintaining Order in a Linked List. Cunningham. 2001. C. In Proc. University of Amsterdam. Do. In Proc. of the 6th International Symposium on Combinatorial Pattern Matching (CPM 1995).284 BIBLIOGRAPHY [37] P. Technical report. Y. H. .F. 2002. T. Merialdo. Dyreson. and R. De Castro. 1995. [43] V. Cobbs. 2002. RoadRunner: automatic data extraction from data-intensive web sites. 30(2). SIGMOD Record. Melnik. 1994. 2004. of the 13th World Wide Web Conference (WWW 2004). Vitanyi. and M. and E. [39] P. 2004. Crescenzi. Penzo. and A.M. Staab. Collins and P.Translation Memory and Productivity System. and S. F. Greece. Semantic interoperability of multitemporal relational databases. of ER. 2002. Adaptation Guided Retrieval in EBMT: A Case-Based Approach to Machine Translation. Doan. Home page http://www. and P. Cranias. Ciaccia and W. [50] A. In Proc. Workshop on Web Databases. In Proc. Rahm. Conference on Management of Data (SIGMOD 2002). [48] H.B. Grandi. Cilibrasi and P. In Proc. Do and E. Currim. COMA – A system for flexible combination of schema matching approaches.

1995. Srivastava. P. and D. and J. Eng. Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps. In Proc. Gravano. . Tiberio. and J. June 2003. Mandreoli. Teubner. Van Keulen. M. [63] T. Ipeirotis. S. Jordan. of the 4th South American Workshop on String Processing. Mandreoli. [62] T. Grandi and F. [60] F. 1997. of the ACM SAC. A temporal data model and management system for normative texts in xml format. P. [57] R. In Proceedings of 29th International Conference on Very Large Data Bases (VLDB 2003). Muthukrishnan. Temporal modelling and management of normative documents in xml format.V. P. T.cirsfid. and Y. 1994. Temporal slicing in the evaluation of xml queries. Bergonzini. F. Snodgrass. Koudas. F. and M. Bergonzini. of VLDB. 2001. T.it/eGov03. [55] C. Data Knowl. Ranganathan. Grust. Grandi. Benoit..BIBLIOGRAPHY 285 [51] B. Accelerating XPath location steps. 54(3). Fast Subsequence Matching in Time-Series Databases. [52] The “Semantic web techniques for the management of digital identity and the access to norms” PRIN Project Home Page. http://www. 1999. [58] F. H. and M. Dorr. of the 11th Natlional Conference on Advanced Database Systems (SEBD). Ontology-focused crawling of web documents. In Proc. In Proc. Italy. [59] F. In Proc. 2005.G. 49. The TSQL2 Temporal Query Language. LA. Hischke. Grandi. November 2003. M. In Proceedings of the ACM International Conference on Management of Data (SIGMOD 2002). 2003. In Proc. New Orleans. A General Technique to Improve Filter Algorithms for Approximate String Matching. Jagadish. Kluwer Academic Publishing. Gao and R. of the 1994 ACM SIGMOD International Conference on Management of Data (ICMD 1994). S. Berlin. Faloutsos. and Enno Ohlebusch. Advances in Computers. Grust. Snodgrass et al. Tiberio. F.unibo. A survey of current research in machine translation. New York. In Proc. Kurtz. 2003. of 27th International Conference on Very Large DataBases (VLDB 2001). Giegerich. P. Approximate String Joins in a Database (Almost) for Free. In Proc. 2003. of the 15th ACM Intl’ Workshop on Web Information and Data Management (WIDM). Cetraro. Manolopoulos. Germany. N. [53] Marc Ehrig and Alexander Maedche. [56] D. Mandreoli. A temporal data model and system architecture for the management of normative texts. [61] L. [54] R.

A Polynomial Algorithm in Linear Programming. [76] C. H. August 31 – September 3. 1979. J. [75] Seung-Kyum Kim and Sharma Chakravarthy. Jensen. Van Keulen. 1988. M. Variable Length Queries for Time Series Data. Koch. K. and J. of ISWC 2003. Xu Yu. Dyreson. Algorithms for Clustering Data. [67] I. of the 17th International Conference on Data Engineering (ICDE 2001). ACM Transactions on Database Systems. S. Scherzinger. N. [65] S. 2002. Accelerating XPath evaluation in any RDBMS. Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art.) et al. 2004. Horrocks and P. Veronis. Jain.286 BIBLIOGRAPHY [64] T. N. Canada. [72] H. FluXQuery: An Optimizing XQuery Processor for Streaming XML Data. Modeling time: Adequacy of three distinct time concepts for temporal data. and S. Toronto. Heintze. V. 24(1). 2003. Lu. In Proc. and P. 1998. 29(1). In Proceedings of 29th International Conference on Very Large Data Bases (VLDB 2003). and (Eds. [71] C. Khachiyan.G. S. Grust. The Consensus Glossary of Temporal Database Concepts . Stegmaier. Etzion. Flynn. C. and T. 1999. S. Temporal Databases — Research and Practice. Teubner. Singh. N. Sripada. C. and J. 31(3). D. [68] N. [74] L. Dubes. 244. In Proc. Srivastava. and B. [66] N. of 30th International Confl on Very Large Data Bases (VLDB 2004). Jiang. Guha. In Proc. 2003. In O. In Proc. Doklady Akademii Nauk SSSR. Yu. ACM Computing Surveys. In Proc. TX. [73] T. 2004. K. K. F. E. Springer-Verlag. of ACM SIGMOD. LNCS No. Reducing owl entailment to description logic satisfiability. Data clustering: a review. Kahveci and A. H. Koudas. Murty. Wang. 1998. LNCS No. [69] A. Schweikardt. Second Usenix Workshop on Electronic Commerce. December 1993. Prentice Hall Inc. [70] A. Jagadish. 823. Arlington. 2001. editors. Holistic Twig Joins on Indexed XML Documents. Approximate XML joins. Scalable Document Fingerprinting. of 12th International Conference on the Entity-Relationship Approach (ER’93). 1996. W. Jain and R. Jajodia.. 1399. Computational Linguistics.February 1998 Version. . Ide and J. Patel-Schneider. M. 2004.

Indexing and Retrieval of Scientific Literature. of CIKM 2005. of the 27th Conference on Very Large Data Bases (VLDB 2001). Versatile structural disambiguation for semantic-aware applications. R. [89] F. In Proc. Rahm. of the 10th International Conference on Extending Database Technology (EDBT 2006). and M. Corpus-based schema matching. Resource Description Framework (RDF) model and syntax specification. Combining local context and WordNet similarity for word sense identification. Swick. 36(2). 2005. Grandi. Martoglia. Martoglia. Y. R. 1998. of the International Conference on E-Government (DEXA EGOV 2005). P. Personalized access to multi-version norm texts in an egovernment scenario. Using Corpus Statistics and WordNet Relations for Sense Identification. Lee Giles. Doan. H. and E. Leacock and M. Madhavan. [88] F. . Bernstein.s of 8th International Conference on Information and Knowledge Management (CIKM 1999). [82] C. 2006. In Proc. In Proc. and C. Ronchetti. Scalas. Fellbaum. 1998. and C. Lawrence. Communications of the ACM. In Proc. 2005. Watermark-based copyright protection system security. 24(1). In C. M. A. editor. and E. F. Mandreoli. Miller. Madhavan. 46(10). Bernstein. 2006. P. ACM SIGIR Forum. K. [87] F. 2005. Supporting temporal slicing in xml databases. [80] S. [79] O. Y!Q: Contextual Search at the Point of Inspiration. R. 2003. 1998. Generic Schema Matching with Cupid. Computational Linguistics. Kwok. E. 2005. In Proc. Martoglia. Ronchetti. [85] J. Martoglia. C. W3C Working Draft WD-rdf-syntax-19981008. [83] J. Tiberio. [84] J. 1999. In Proc. [78] S. and A. Bremen. In Proc. Halevy. Chodorow. Chodorow. 2001.BIBLIOGRAPHY 287 [77] R. Ronchetti. and G. Chang. of the 10th International Conference on Extending Database Technology (EDBT 2006). and E. [86] F. WordNet: An electronic lexical database. Germany. F. R. Kraft. Maghoul. A. of 21st International Conference on Data Engineering (ICDE 2005). Mandreoli. Digital Copyright and the Progress of Science. Mandreoli. Leacock. of the 14th International Conference on Information Knowledge and Management (CIKM 2005). [81] C. Mandreoli. Strider: a versatile system for structural disambiguation. A. Lassila and R. MIT Press. Bollacher. R. A. P. Litman. Ronchetti. 2002. and E. In Proc.

Martoglia. In Proc. Five Papers on WordNet. Gross. 2003. Beckwith. of the 13th IEEE International Workshop on Research Issues in Data Engineering: Multi Lingual Information Management (RIDE-MLIM 2003). tagger. [91] F. Martoglia.288 BIBLIOGRAPHY [90] F. Mandreoli. Martoglia. [95] J. Similarity Flooding: A Versatile Graph Matching Algorithm and ist Application to Schema Matching. Mendelzon. International Journal of Digital Libraries. In Proc. Mandreoli. F. http://www. [101] G. Nato Publications. Rahm. Miller. Vaisman. of the 18th International Conference on Data Engineering (ICDE 2002). M. [99] A.Mason/software/tagger/. of the 10th Convegno su Sistemi Evoluti per Basi di Dati (SEBD 2002). WordNet: A Lexical Database for English.org/mpeg/standards/mpeg-7. [100] G. 2002. of the 5th International Conference on Web Information Systems Engineering (WISE 2004). Tiberio. QTag. Tiberio. In Proc. Nagao. In Proc.uk/O. R. and P. R.bham. Miller. of the 11th ACM Conference of Information and Knowledge Management (CIKM 2002). Dan Melamed. A Framework of a Mechanical Translation between Japanese and English by Analogy Principle. [93] F. MPEG-7 standard overview. 2004. Tiberio. 2002. Mandreoli. C. Tiberio. H. A. 4(3). 1995. Rizzolo. D. Martoglia. Indexing temporal xml documents. In Proc. Exploiting multi-lingual text potentialities in EBMT systems. [102] M. 2002. [94] F. Fellbaum. Miller. [96] O. O. [98] S. Martoglia. R. Mandreoli.A. Mandreoli. Searching Similar (Sub)sentences for Example Based Machine Translation.ac. and P. Melnik. and P. Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons. In Proc. Tiberio. A Syntactic Approach for Searching Similarities within Sentences.chiariglione. 1993.A. 2004. Garcia-Molina. [92] F. of the Third Workshop on Very Large Corpora (WVLC3). R. Technical report. Princeton University’s Cognitive Science Laboratory. In Proc. and E. In CACM 38. ISO/IEC JTC1/SC29/WG11 N6828. and P. Martinez. and A. and K. R. A Document Comparison Scheme for Secure Duplicate Detection. and P. 2004. of VLDB. Mason. . Approximate Query Answering for a Heterogeneous XML Document Base. R. a probabilistic parts-of-speech http://web. 1984. [97] I. 1995.

BIBLIOGRAPHY 289 [103] G. Peim. and C. http://www. [106] Y.edu/plugins/owl/. Query rewriting for semistructured data. 1999. A. 6(1). Baeza-Yatesa. X. 2000. Rick. In Proc. Ma.com. Haas. Rodot`. [115] N. In Proc. [107] Y. of VLDB. Vasilevsky. R. Chen. Query Processing with Description Logic Ontologies Over Object-Wrapped Databases. W. Franconi. Information Processing Letters.stanford. 2004. 1998. A. of the 14th International Conference on Scientific and Statistical Database Management. and W. M. Roukos. 33(1). S. 72. New and faster filters for multiple approximate string matching. University of Bonn. [108] K. and C. Capabilities-Based Query Rewriting in Mediator Systems. 2002. In Proc. Distributed and Parallel Databases. Yuan. Papineni. 2002. Semantic Access: Semantic Interface for Querying Databases. e e [113] P. 2001. J. Navarro and R. In Proc. Paton. Dyreson. Jensen. 20(1). 2002. Paris. 1994. [111] M. Baeza-Yatesa. Bach Pedersen. In Proc. A. [105] G. ACM Computing Surveys. Shaposhnikov.athel. Disambiguating Noun Groupings with Respect to WordNet Senses. Rishe. Ward. [109] ParaConc . C. Extending practical pre-aggregation in on-line analytical processing. Vassalos. A. Introduction to the “one world. A. S. Goble. of the Third Workshop on Very Large Corpora. France. Athauda. A New Flexible Algorithm for the Longest Common Subsequence Problem. T. [112] Owl plugin for prot´g´. of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002). of the 26th Conference on Very Large Data Bases (VLDB 2000). [114] C. one privacy” session. Gupta. [116] S. E. E. Bleu: a Method for Automatic Evaluation of Machine Translation. Computer Science Department IV. and D. S. . Zhu. 1999. In Proc. Navarro and R. Lu. Vaschillo. Technical report. N.Multilingual Concordancer. http://protege. A guided tour to approximate string matching. [104] G. Resnik. 1995. Very Fast and Simple Approximate String Matching. [110] T. 1999. of the ACM SIGMOD. Navarro. X. Papakonstantinou. Papakonstantinou and V. Random Structures and Algorithms. In Proc. of a 23rd Data Protection Commissioners Conference. A. and L.

Building a scalable and accurate copy detection mechanism. Shivakumar and H. Approssimazione semantica per routing di interrogazioni in un PDMS. Schlieder and Felix Naumann. In Proc. Review Article: Example-based Machine Translation. Sutinen and J. of the 2nd International Conference on Theory and Practice of Digital Libraries. of the 5th International Conference on Web Information Systems Engineering (WISE 2004). [125] E. 1991. In Proc. of the 7th annual Symposium on Combinatorial Pattern Matching. Iida. In Proc. Greco. X. In Proc.w3. Implicit user modeling for personalized search. 1995. . In Proc. In Proc. Approximate tree embedding for querying XML data. Experiments and Prospects of Example-based Machine Translation. Germany. [121] N. Universit` degli studi di Modena e Reggio Emilia. Sumita and H. On Using q-gram Locations in Approximate String Matching. Sassatelli. Garcia-Molina. Tarhio. In Proc. W3C Consortium. Efficient Query Reformulation in Peer Data Management Systems. Tarhio. Filtration with q-samples in Approximate String Matching. In Proc. 1990. of the 13th International Conference on Computational linguistics (COLING 1990). services activity. [126] E. of ACM SIGMOD. 1995. [119] T. 1999. of the 29th Annual Meeting of the Association for Computational Linguistics (ACL 1991). 2004. 2004. Zhai. [127] E. 14(2). Toward Memory-based Translation. 2004. 2005. Halevy. 1996. Sutinen and J. Tagarelli and S. In Proc. of the 1st ACM International Conference on Digital Libraries (ICDL 1996). Tatarinov and A. [118] S. Machine Translation. Garcia-Molina. Sato and M. [124] H. Somers. of ACM SIGIR Workshop On XML and Information Retrieval. Shivakumar and H. Shen. of 3rd Annual European Symposium. [129] I. [123] Web http://www. Master thesis. and C. B. SCAM: A Copy Detection Mechanism for Digital Documents. of CIKM 2005. 2000.290 BIBLIOGRAPHY [117] S. [128] A. [120] X. Clustering Transactional XML Data with Semantically-Enriched Content and Structural Features.org/2000/xp/Group/. Tan. Bremen. 1996. a 2004/2005. [122] N. In Proc. Nagao.

2003. W3C Recommendation. The MIT Press. of the WebDB Workshop. Zhang. F. Rabitti. 1992. Home page [134] E. 2002. Approximate String Matching with q-grams and Maximal Matches. Weikum. [141] K. In Proc. D. and Ontological Knowledge for Automatic Classification of XML Data. 2001. 1987. On supporting containment queries in relational database management systems. Mandreoli. [142] J. 2003. Y. 13(1). Amato. Maloney. Vitter.BIBLIOGRAPHY 291 [130] A. In Proc. Efficient temporal join processing using indices. Theobald. S. Lohman. R. and F. [131] M. Q. Zezula. S. Debole. R. In Proc. F. Annotation. 2000.com. The index-based XXL search engine for querying XML data with relevance ranking. 2002. ACM Transactions on Mathematical Software. [138] P. Shasha. Digital Libraries. 2001. Tree Signatures and Unordered XML Pattern Matching. Sander.Translation Memory Technologies. . Seeger. Tsotras. [140] D. Zhang. DeWitt. Zhou and J. and R. and G. and N. http://www. M. Exploiting Structure. Zhang. In Proc. 1992. Naughton. Theobald and Gerhard Weikum. 2287. Theoretical Computer Science. 2004. J. Statman. [139] C. of ACM SIGMOD. Arms. 92(1). Tree Signatures for XML Querying and Navigation. 2003. and G. and B. of 29th International Conference on Very Large Data Bases (VLDB). of ICDE. F. On the editing distance between unordered labeled trees. Information Processing Letters. XMLSchema. J. V. G. [137] P. Lecture Notes in Computer Science. [133] Trados Team Edition . Thompson. Zezula. Martoglia. An Efficient Algorithm for Sequential Random Sampling. Data bubbles for non-vector data: Speeding-up hierarchical clustering in arbitrary metric spaces. and D. Ukkonen. [132] H. In Proceedings of the XML Database Symposium (XSym 2003). [135] J. 42(3). M. Mendelsohn. D. Luo. Schenkel. In Proceedings of 30th International Conference on Current Trends in Theory and Practice of Informatics (SOFSEM 2004).trados. Beech. [136] W. J.