Edit Distance Textual Entailment Suite (EDITS

)
User Manual - Version 2.1 Milen Kouylekov and Matteo Negri Fondazione Bruno Kessler, FBK-irst {kouylekov,negri}@fbk.eu

1 Introduction
Textual Entailment (TE) has been proposed as a unifying generic framework for modeling language variability and semantic inference in different Natural Language Processing (NLP) tasks. The Recognizing Textual Entailment (RTE) task (Dagan and Glickman, 2007) consists in deciding, given two text fragments (respectively called Text- T, and Hypothesis - H), whether the meaning of H can be inferred from the meaning of T, as in: T: ”Yahoo acquired Overture” H: ”Yahoo owns Overture” The system has been designed following three basic requirements: • Modularity. System architecture is such that the overall processing task is broken up into major modules. Modules can be composed through a configuration file, and extended as plug-ins according to individual requirements. System’s work-flow, the behavior of the basic components, and their IO formats are described in a comprehensive documentation available upon download. • Flexibility. The system is general-purpose, and suited for any TE corpus provided in a simple XML format. In addition, both language dependent and language independent configurations are allowed by algorithms that manipulate different representations of the input data. • Adaptability. Modules can be tuned over training data to optimize performance along several dimensions (e.g. overall Accuracy, Precision/Recall trade-off on YES and NO entailment judgements). In addition, an optimization component based on genetic algorithms is available to automatically set parameters starting from a basic configuration. EDITS is open source, and available under GNU Lesser General Public License (LGPL). The tool is implemented in Java, it runs on Unix-based Operating Systems, and has been tested on MAC OSX, Linux, and Sun Solaris. The latest release of the package can be downloaded from: http: //edits.fbk.eu. EDITS comes pre-packaged with: • System Graphical Interface - an editor capable of handling the system core data structures; • Set of System Configurations - configurations used to run EDITS described in Section 4.2; • Set of Cost Schemes - xml files used by the entailment engine described in Section 5; • Entailment Rules - xml files that contain knowledge extracted from WordNet, Verbocean and Wikipedia represented as rules. • Set of RTE Datasets - The public available RTE Corpora (RTE 1,2,3) in ETAF format described in Section 4.1; • Trained Models - Configured and Trained Entailment engines used by the FBK Textual Entailment group.

1

• HTML Reference - a HTML Document describing all the modules available in the system. • INSTALL.txt - a file that describes the installation of edits

2

2 System Overview

Figure 1: Entailment Engine The EDITS package allows to: • Create an Entailment Engine (Figure 1) by defining its basic components (i.e. algorithms, cost schemes, rules, and optimizers); • Train such Entailment Engine over an annotated RTE corpus (containing T-H pairs annotated in terms of entailment) to learn a Model; • Use the Entailment Engine and the Model to assign an entailment judgment and a confidence score to each pair of an un-annotated test corpus. EDITS implements a distance-based framework which assumes that the probability of an entailment relation between a given T-H pair is inversely proportional to the distance between T and H (i.e. the higher the distance, the lower is the probability of entailment). Within this framework the system implements and harmonizes different approaches to distance computation, providing both edit distance algorithms, and similarity algorithms (see Section 3.1). Each algorithm returns a normalized distance score (a number between 0 and 1). At a training stage, distance scores calculated over annotated T-H pairs are used to estimate a threshold that best separates positive from negative examples. The threshold, which is stored in a Model, is used at a test stage to assign an entailment judgment and a confidence score to each test pair. In the creation of a distance Entailment Engine, algorithms are combined with cost schemes (see Section 3.2) that can be optimized to determine their behavior (see Section 3.3), and optional external knowledge represented as rules (see Section 3.4). Besides the definition of a single Entailment Engine, a unique feature of EDITS is that it allows for the combination of multiple Entailment Engines in different ways (see Section 4.4). Basic components are already provided with EDITS, allowing to create a variety of entailment engines. Fast prototyping of new solutions is also allowed by the possibility to extend the modular architecture of the system with new algorithms, cost schemes, rules, or plug-ins to new language processing components.

3

3 Basic Components
This section overviews the main components of a distance Entailment Engine, namely: i) algorithms, iii) cost schemes, iii) the cost optimizer, and iv) entailment/contradiction rules. 3.1 Algorithms Algorithms are used to compute a distance score between T-H pairs. EDITS provides a set of predefined algorithms, including edit distance algorithms, and similarity algorithms adapted to the proposed distance framework. The choice of the available algorithms is motivated by their large use documented in RTE literature. Edit distance algorithms cast the RTE task as the problem of mapping the whole content of H into the content of T. Mappings are performed as sequences of editing operations (i.e. insertion, deletion, substitution of text portions) needed to transform T into H, where each edit operation has a cost associated with it. The distance algorithms available in the current release of the system are: • Token Edit Distance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of tokens of T and H; • Tree Edit Distance: an implementation of the algorithm described in (Zhang and Shasha, 1990), with edit operations defined over single nodes of a syntactic representation of T and H. Similarity algorithms are adapted to the EDITS distance framework by transforming measures of the lexical/semantic similarity between T and H into distance measures. These algorithms are also adapted to use the three edit operations to support overlap calculation, and define term weights. For instance, substitutable terms in T and H can be treated as equal, and non-overlapping terms can be weighted proportionally to their insertion/deletion costs. Five similarity algorithms are available, namely: • Word Overlap: computes an overall (distance) score as the proportion of common words in T and H. In the current implementation the algorithm uses the cost scheme to find the less costly substitution of a word from H with a word form T. One word from T can substitute more than one of the words in H. The score returned by the algorithm is the sum of the cost of all substitutions divided by the number of words in H; • Jaro-Winkler distance: a similarity algorithm between strings, adapted to similarity between word .The algorithm uses the cost scheme to define if two words are the same (they have a 0 cost of substitution). The entailment score is obtained by subtracting the obtained Jaro-Winkler metric from 1 (i.e score(A,B)=1-JW(A,B)); • Cosine Similarity: a common vector-based similarity measure. The EDITS implementation uses the cost scheme to define if two words are the same (they have a 0 cost of substitution) and the weight of words (the cost of insertion for a word in H and the cost of deletion for a word in T). • Longest Common Subsequence: searches the longest possible sequence of words appearing both in T and H in the same order, normalizing its length by the length of H; The algorithm uses the cost scheme to define if two words are the same (they have a 0 cost of substitution). The entailment score is obtained by subtracting the obtained similarity from 1 (i.e. score(A,B)=1-(LCS(A,B)/ words(B))). • Jaccard Coefficient: confronts the intersection of words in T and H to their union. The algorithm uses the cost scheme to define if two words are the same (they have a 0 cost of substitution). The entailment score is obtained by subtracting the obtained similarity from 1 (i.e. score(A,B)=1-JSC(A,B)).

4

<scheme> <insertion><cost>10</cost></insertion> <deletion><cost>10</cost></deletion> <substitution> <condition>(equals A B)</condition> <cost>0</cost> </substitution> <substitution> <condition>(not (equals A B))</condition> <cost>20</cost> </substitution> </scheme> Figure 2: Example of XML Cost Scheme In the distance-based framework adopted by EDITS. where edit operation costs are defined as follows: Insertion(B)=10 . since distance scores returned by the same algorithm with different cost schemes can be considerably different. referred to as A and B.• Rouge : out implementation of a set of metrics for summarization evaluation. Given a T-H pair. For instance. Deletion(A)=10 . whereas Token Edit Distance and similarity algorithms will manipulate words. always costs 10. (Lin 2004) 3. the distance score returned by an algorithm directly depends on the cost of the operations applied to transform T into H (edit distance algorithms). More information about this mechanism can be found in Section 8. when dealing with T-H pairs composed by texts that are much longer than the hypotheses (as in the RTE5 Campaign). Substitution(A. Elements.the one shown in Figure 2. This mechanism creates a simple cost scheme that is compatible with the distance algorithm used and the resources of entailment/ contradiction accessible to the system. Such interaction determines the overall behavior of an Entailment Engine. in fact. EDITS provides two predefined cost schemes: • Simple Cost Scheme .substituting A with B costs 0 if A and B are equal.B)=0 if A=B . Tree Edit Distance will manipulate nodes in a dependency tree representation. no matter what A is.substituting A with B costs 20 if A and B are different. 5 . For instance.3) the cost schemes that best suit the RTE data they want to model. Cost schemes are defined as XML files that explicitly associate a cost (a positive real number) to each edit operation applied to elements of T and H.B)=20 if A!=B . Figure 2 shows an example of a cost scheme.inserting an element B from H to T. as explained in Section 3. can be of different types. Substitution(A.2 Cost Schemes Cost schemes are used to define the cost of each edit operation. always costs 10. This allows users to define (and optimize. To facilitate the usage of the system EDITS provides a mechanism for automatic generation of cost schemes. no matter what B is.deleting an element A from T. depending on the algorithm used. the interaction between algorithms and cost schemes plays a central role. or on the cost of mapping words in H with words in T (similarity algorithms). setting fixed costs for each edit operation. setting low deletion costs avoids penalization to short Hs fully contained in the Ts.

2009). cost schemes can be parametrized by externalizing as parameters the edit operations costs. with the format shown in Figure 3.e.88</probability> </rule> Figure 3: Example of XML Rule Repository The format of the entailment rules repository is described in Section 6. the cost of the substitution between two elements A and B is inversely proportional to the probability that A entails B. using a meta-language based on a lisp-like syntax (e. and conditions over the A and B elements.g.6. as proposed in (Mehdad. Users can create new hash files to collect statistics about words in other languages. the one that best performs on the training set). (+ (IDF A) (IDF B)). or new ones defined by the user) to specific datasets. as described in Section 4. Each rule consists of three parts: i) a left-hand side. or other information to be used inside the cost scheme. To this aim. 6 .4 Rules Rules are used to provide the Entailment Engine with knowledge (e. (not (equals A B))). The definition of a cost scheme is described in details in Section 5. ii) a right-hand side. EDITS provides a mechanism for automatic generation of cost schemes.g.95</probability> </rule> <rule entailment="CONTRADICTION"> <t>beautiful</t> <h>ugly</h> <probability>0. 3. <rule entailment="ENTAILMENT"> <t>acquire</t> <h>own</h> <probability>0.5. The substitution cost is set to 0 if a word w1 from T and a word w2 from H are the same. lexical.insertion and deletion costs for a word w are set to the inverse document frequency of w (IDF(w)). The optimization mechanism is described in Section 4.• IDF Cost Scheme . semantic) about the probability of entailment or contradiction between elements of T and H. Rules are invoked by cost schemes to influence the cost of substitutions between elements of T and H. The system also provides functions to access data stored in hash files. The optimizer iterates over training data using different values of these parameters until on optimal set is found (i. and IDF(w1)+IDF(w2) otherwise. Rules are stored in XML files called Rule Repositories. the IDF Cost Scheme accesses the IDF values of the most frequent 100K English words (calculated on the Brown Corpus) stored in a file distributed with the system. In the creation of new cost schemes. Typically. The optimizer is based on cost adaptation through genetic algorithms. 3. users can express edit operation costs. For example. iii) a probability that the left-hand side entails (or contradicts) the right-hand side.3 Cost Optimizer A cost optimizer is used to adapt cost schemes (either those provided with the system. syntactic.

tree?. ETAF is used to represent both the input T-H pairs. a simple XML internal annotation format (DTD in Figure 4). which can be run with commands in a Unix Shell.edge*)> <!ELEMENT node (word|label)> <!ELEMENT label (#PCDATA)> <!ELEMENT edge EMPTY> <!ELEMENT semantics (entity*.word*. ETAF allows to represent texts at two different levels: i) as sequences of tokens with their associated morpho-syntactic properties. or ii) as syntactic trees with structural relations among nodes.4 Using the System This section provides basic information about the use of EDITS.1 EDITS Input The input of the system is an entailment corpus represented in the EDITS Text Annotation Format (ETAF).semantics?)> <!ELEMENT string (#PCDATA)> <!ELEMENT word (attribute+)> <!ELEMENT attribute (#PCDATA)> <!ELEMENT tree (node+.semantics?)> <!ELEMENT hAnnotation (string?. and the entailment and contradiction rules. Plug-ins for several widely used annotation tools (including TreeTagger. and OpenNLP) can be downloaded from the system’s website.tree?. 4.relation*)> <!ELEMENT entity EMPTY> <!ELEMENT relation EMPTY> <!ATTLIST pair id CDATA #REQUIRED entailment (YES|NO|UNKNOWN) #REQUIRED task CDATA #IMPLIED length CDATA #IMPLIED > <!ATTLIST tAnnotation id CDATA #IMPLIED> <!ATTLIST hAnnotation id CDATA #IMPLIED> <!ATTLIST word id CDATA #IMPLIED> <!ATTLIST attribute name CDATA #REQUIRED> <!ATTLIST tree id CDATA #IMPLIED> <!ATTLIST node id CDATA #IMPLIED> <!ATTLIST edge name CDATA #IMPLIED from CDATA #REQUIRED to CDATA #REQUIRED > <!ATTLIST relation name CDATA #REQUIRED source CDATA #IMPLIED > <!ATTLIST entity name CDATA #REQUIRED start CDATA #IMPLIED 7 . Stanford Parser. Users can also extend EDITS by implementing plug-ins to convert the output of other annotation tools into ETAF.word*. <!ELEMENT entailment-corpus (pair+)> <!ELEMENT pair (t. hAnnotation*)> <!ELEMENT t (#PCDATA)> <!ELEMENT h (#PCDATA)> <!ELEMENT tAnnotation (string?. tAnnotation*. h.

+punc</attribute> <attribute name="sentence">&lt. weights of tokens (like IDF).eos&gt. full morphological analysis (full morpho) and sentence boundaries. and many others. Common properties are token.</attribute> <attribute name="lemma">.</attribute> <attribute name="wnpos"></attribute> <attribute name="token">.".<attribute name="wnpos">v</attribute> &nbsp.<attribute name="token">invented</attribute> <attribute name="lemma">invent</attribute>s <attribute name="pos">VVD</attribute> </word> <word id="2234"> <attribute name="full_morpho">the+adv the+art</attribute> <attribute name="wnpos"></attribute> <attribute name="token">the</attribute> <attribute name="lemma">the</attribute> <attribute name="pos">AT0</attribute> </word> <word id="2235"> <attribute name="wnpos">n</attribute> <attribute name="token">Kinetoscope</attribute> <attribute name="lemma">kinetoscope</attribute> <attribute name="pos">NN1</attribute> </word> <word id="2236"> <attribute name="full_morpho">. <hAnnotation> <string>Edison invented the Kinetoscope. but other linguistic annotations may be used at this level. the following is the morphosyntactic representation of the sentence "Edison invented the Kinetoscope.TextPro tagset and wnpos WordNet part of speech). &nbsp.</string> <word id="2232"> <attribute name="wnpos">n</attribute> <attribute name="token">Edison</attribute> <attribute name="lemma">edison</attribute> <attribute name="pos">NP0</attribute> </word> <word id="2233"> <attribute name="full_morpho">invent+v+part+past invented+adj+zero+invent+v+indic+past</attribute> &nbsp.</attribute> <attribute name="pos">PUN</attribute> </word> </hAnnotation> Figure 5: Morpho-Syntactic Annotation Example The second level of annotation represents texts as syntactic trees with their structural 8 . DTD of the ETAF annotation format The basic level of annotation represents texts as sequences of tokens with morphosyntactic features. morpho and part of speech. token. For example. including named entities. lemma.> end CDATA #IMPLIED source CDATA #IMPLIED Figure 4. lemma. with userdefined attribute names for two pos tagging sets (pos .

and EVALITA 2009). The "-name-of-the-tool" option indicates the annotation tool (e. Both nodes (terminal and non terminal) and edges with syntactic relations are represented. are delivered together with the system to be used as first experimental datasets. "texpro" or "opennlp") used to perform the annotation.features. The example below shows the output of the Stanford Parser for the sentence "Edison invented the Kinetoscope.</string> <tree root="2"> <node id="1"> <word id="1"> <attribute name="token">Edison</attribute> <attribute name="lemma">Edison</attribute> <attribute name="pos">NNP</attribute> </word> </node> <node id="2"> <word id="2"> <attribute name="token">invented</attribute> <attribute name="lemma">invent</attribute> <attribute name="pos">VBD</attribute> </word> </node> <node id="3"> <word id="3"> <attribute name="token">the</attribute> <attribute name="lemma">the</attribute> <attribute name="pos">DT</attribute> </word> </node> <node id="4"> <word id="4"> <attribute name="token">Kinetoscope</attribute> <attribute name="lemma">Kinetoscope</attribute> <attribute name="pos">NN</attribute> </word> </node> <edge to="1" name="nsubj" from="2"/> <edge to="3" name="det" from="4"/> <edge to="4" name="dobj" from="2"/> </tree> </hAnnotation> Figure 6: Syntactic Annotation Example Publicly available RTE corpora (RTE 1-3. annotated in ETAF at both the annotation levels. For example the following command will annotate RTE2 corpus with the 9 . <hAnnotation> <string>Edison invented the Kinetoscope. "stanford-parser". Nodes are typically described with their morpho-syntactic properties. The annotation of an entailment corpus with one of the annotation tools known by the system is done with the following command: edits -a -name-of-the-tool -o output-file input-file where "-a" indicates that EDITS must annotate a file.g. The "-o" indicates the file where EDITS will store the result of annotation." converted into ETAF.

1 Training Given a configuration file and an RTE corpus annotated in ETAF. Adding external knowledge to an entailment engine can be done by extending the configuration file with a reference to a rules file (e.xml ETAF annotatated files can be visualised with the EDITS graphical interface. edits -a -stanford-parser -o RTE2_dev-annotated. This is done using the following command: edits -r -c configuration_file -sm model annotated_entailment_corpus. visualized and modified with the EDITS graphical interface. the user can run the training procedure to learn a model. each having a set of options. This configuration defines a distance Entailment Engine that combines Tree Edit Distance as a core distance algorithm. For example snapshots check Section 8.xml”) as follows: <module alias="memory"> <option name="rules-file" value="${EDITS}/share/cost-schemes/wordnetrules. <conf> <module alias="distance"> <module alias="tree"/> <module alias="xml"> <option name="scheme-file" value="${EDITS}/share/cost-schemes/idfscheme.g. Configuration files can be created. and the predefined IDF Cost Scheme that will be optimized on training data with the Particle Swarm Optimization algorithm (“pso”) as in (Mehdad. 4.Stanford parser.xml"/> <option name="hash-file" id="idf" value="${EDITS}/share/cost-schemes/ idf. 2009). and rules) through an XML configuration file. 4. where "-r" instructs the system to train a model using the entailment engine defined in the configuration file specified by the "-c" option on the annotated entailment corpus file or directory provided as input. cost schemes.xml rte/RTE2_dev.3 Training and Test 4.txt"/> </module> <module alias="pso"/> </module> </conf> Figure 7: An Example of a Configuration File The configuration file in Figure 7 is divided in modules.xml"/> </module> The DTD of the format and more information about the configuration file appear in Section 7.3. optimizer.txt"/> <option name="hash-file" id="stopwords" value="${EDITS}/share/costschemes/stopwords. “wordnet-rules.2 Configuration The creation of an Entailment Engine is done by defining its basic components (algorithms. The obtained model will be saved in the file specified by 10 . For example snapshots check Section 8.

Calculated Threshold: 0. By default the system maximizes the overall accuracy (distinction between YES and NO pairs). (ii) the accuracy of the annotation of the whole training set. This summary reports: (i) the distance model. the cost scheme. including the distance threshold. replicate and modify experiments.61 ############################################### # Examples # Precision # Recall # FMeasure # Confidence # Class # # 412 &nbsp. <option name="optimize-per-metric" value="METRIC"/> .3313 # NO # ############################################## ############### # # YES # NO # # YES # 251 # 161 # # NO # 151 # 237 # ############### Figure 8: Example of Training Summary The training stage also allows to tune performance along several dimensions (e. Our policy is to publish online the models we use for participation in the RTE Challenges. and replicate results. The explicit availability of all this information in the 1 model allows users to share. Precision/Recall trade-off on YES and/or NO entailment judgments).5955 # 0. and the entailment/contradiction rules used to calculate the threshold.. recall and F-measure scores for the YES and NO training pairs. An example summary is shown in Figure 8. allow new users to quickly modify working configurations.the "-sm" option. thus creating a collaborative environment. To make such adjustments the user should modify the configuration file adding the following option to the definition of the distance entailment engine: <module alias="distance"> .xml The output of the training phase is a model: a zip file that contains the learned threshold. overall Accuracy..6031 # 0. For example if we want to train a model on the RTE3 development dataset with the predefined configuration specified in the file share/ configurations/conf2. 11 . the configuration file.g.6244 # 0..xml we use the following command: edits -r -c share/configurations/conf2. We encourage other users of EDITS to do the same. # 0.1454 # YES # 388 # 0.7948717948717948 ****************************** Accuracy: 0.. (iii) separate precision.6167 # 0. At the end of the training phase.6092 # 0.6108 # 0.xml -sm RTE3dev-model-conf2 rte/etaf/morphosyntax/RTE3dev. </module> # 1. a summary of the system's performance over the training set is printed on screen.

the test procedure produces a file containing for each pair: i) the decision of the system (YES. An example output is the following xml framgment: <pair task="IE" length="short" id="1" entailment="NO" confidence="0.3.</h> <log xsi:type="EditOperations" cost="420. If the pairs in the corpus had already an entailment relation assigned (in case of training data or annotated test) the additional attribute benchmark is added signifying the original value.1 Test Given a model and an ETAF annotated RTE corpus as input.</h> </pair> Figure 9: Simple EDITS output The entailment relation found by EDITS is reported in the entailment attribute. the result for each entailment pair.84" normalization="500. including Le Boucher . 4. including Le Boucher . 1930) is a French movie director and has become well-known in the 40 years since his first film.0" confidence="0.0" xmlns:xsi="http://www.</t> <h>Le Beau Serge was directed by Chabrol. NO). for his chilling tales of murder.22" benchmark="YES"> <t>Claude Chabrol (born June 24. <pair task="IE" score="0. Le Beau Serge .0" confidence="0.84" normalization="500. This allow the to produce the Extended Output (-ot=extended) represented in Figure 10 or the Full output represented in Figure 11. including Le Boucher . Le Beau Serge .</t> <h>Le Beau Serge was directed by Chabrol. iv) the sequence of edit operations made to calculate the entailment score.</t> <h>Le Beau Serge was directed by Chabrol. specified by the option "-o". for his chilling tales of murder.0" length="short" id="1" entailment="NO" distance="420. ii) the confidence of the decision.22" benchmark="YES"> <t>Claude Chabrol (born June 24.More information of the values of the option and other options of the distance entailment engine can be found in the HTML reference guide downloadable with the system.org/2001/ XMLSchema-instance"> 12 . EDITS will store the in the output file.</h> </pair> Figure 10: Extended Output <pair task="IE" score="0. 1930) is a French movie director and has become well-known in the 40 years since his first film. for his chilling tales of murder. 1930) is a French movie director and has become well-known in the 40 years since his first film. iii) the entailment score. The command used to evoke the test procedure is the following: edits -e -m model -o edits_result entailment_corpus where "-e" instructs the system to load the entailment engine stored in the file specified by the "-m" option and to annotate the entailment relation for the pairs of the input file. The user can obtain also the edit distance operations made by the system for each pair by controlling the verbosity of the output with the "-ot" option.w3. Le Beau Serge .0" length="short" id="1" entailment="NO" distance="420.22" benchmark="YES" > <t>Claude Chabrol (born June 24.

13 ... For example snapshots check Section 8. <operation type="insertion" scheme="insertion" cost="10.. 4. The two modalities combine in different ways the entailment scores produced by multiple independent engines..<operation type="deletion" scheme="deletion" cost="10.0"> <target xsi:type="Word" id="48"> <attribute name="lemma">by</attribute> <attribute name="token">by</attribute> <attribute name="full_morpho">by+prep by+adv by+adj+zero</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">PRP</attribute> </target> </operation> . <operation type="substitution" scheme="equal" cost="0..4 Combining Engines A relevant feature of EDITS is the possibility to combine multiple Entailment Engines into a combined entailment engine as shown in Figure 12.0"> <source xsi:type="Word" id="1"> <attribute name="lemma">claude</attribute> <attribute name="wnpos">n</attribute> <attribute name="token">Claude</attribute> <attribute name="full_morpho">claude+pn</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">NP0</attribute> </source> </operation> . and return a final decision for each T-H pair./attribute> </source> <target xsi:type="Word" id="50"> <attribute name="lemma">chabrol</attribute> <attribute name="wnpos">n</attribute> <attribute name="token">Chabrol</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">NN1</attribute> </target> </operation> </log> </pair> Figure 11: Full output EDITS annotated files can be visualised with the EDITS graphical interface.0"> <source xsi:type="Word" id="2"> <attribute name="lemma">chabrol</attribute> <attribute name="wnpos">n</attribute> <attribute name="token">Chabrol</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">NN1&lt. This can be done by grouping their definitions as sub-modules in the configuration file. EDITS allows users to define customized combination strategies. and ii) Classifier Combination. or to use two predefined combination modalities provided with the package. namely: i) Linear Combination.

weighti is an ad-hoc weight parameter for each entailment engine. To this aim. used to train a classifier with Weka.e.Figure 12: Combined Entailment Engine Linear Combination returns an overall entailment score as the weighted sum of the entailment scores returned by each engine: In this formula. Optimal weight parameters can be determined using the same optimization strategy used to optimize the cost schemes.3 4. EDITS provides a plug-in that uses the Weka machine learning workbench as a core. but other Weka algorithms can be specified as options in the configuration file. as described in Sections 3. By default the plug-in uses an SVM classifier.5. Classifier Combination is based on using the entailment scores returned by each engine as features to train a classifier (see Figure 10). the other based on Cosine Similarity). one based on Tree Edit Distance.xml"/> </module> </module> <module alias="distance" id="2"> <module alias="cosine"/> <module alias="xml"/> 14 . <module alias="weka"> <module alias="distance" id="1"> <module alias="tree"/> <module alias="xml"> <option name="scheme-file" value=scheme1. The configuration file in Figure 13 describes configuration file the describes a combination of two engines (i.

1). For both modules the optimization procedure uses some form of genetic search to find the optimal values.xml that can be fond in share/configurations folder) which represents a simple entailment engine using an optimizable cost scheme (scheme-optimize.<option name="scheme-file" value="scheme2. For example snapshots check Section 8. The optimizeable parameters of the simple entailment engine are the OP constants of the cost scheme (more information in Section 5.xml Parameter: OPinsertion Value: 0.5636193641242588 15 .2. Figure 15 describes the result of the optimization process showing the new values of the optimizable parameters. 4.7209179287932125 Calculated Threshold: 0. Entailment engines can be combined by merging their configurations using the EDTS graphical interface.5 Optimization The goal of the optimization process is to change the values of certain parameters (optimizable parameters) of an entailment engine in order to make it perform better on the training set.xml"/> </module> <module alias="genetic"/> <option name="optimize-parameters" value="true"/> </module> </conf> Figure 14: Configuration file that represents an entailment engine that will be optimized during training The option "optimize-parameters" indicates to EDITS that it should use the engineoptimizer the module "genetic" to tune the performance of the entailment engine.xml) that will be optimized when the training process is activated. The Weka based entailment engine can not be optimized. EDITS provides two such module "genetic" and "pso" as plugins. In Figure 14 is shown the configuration file confoptimize. The optimize-able parameters are specific to the optimized engine.xml rte/etaf/morpho-syntax/ RTE2_dev. <conf> <module alias="distance"> <module alias="token"/> <module alias="xml"> <option name="scheme-file" value="${EDITS_PATH}/share/costs-schemes/ scheme-optimze.7482828808106775 Parameter: OPdeletion Value: 0.xml"/> <module alias="xml"/> </module> </module> Figure 13: Configuration file of Combined Entailment Engine A linear combination can be easily obtained by changing the alias of the highest-level module (“weka”) into “linear”. The procedure is performed by modules called engine-optimizers. bin/edits -r -c share/configurations/conf-optimize. The optimizeable parameters of the linear combination engine are the weights of each sub-engine.08147454355821715 Parameter: OPsubstitution Value: 0.

0" encoding="UTF-8" standalone="yes"?> <scheme> <constant value="1" type="number" name="OPinsertion"/> <constant value="1" type="number" name="OPdeletion"/> <constant value="0" type="number" name="OPsubstitution1"/> <constant value="1" type="number" name="OPsubstitution2"/> <insertion> <cost>(* OPinsertion (size (words T)))</cost> </insertion> <deletion> <cost>(* OPdeletion (size (words H)))</cost> </deletion> <substitution> <condition>(equals (a.567 # 0.2658 # YES # # 400 # 0.5025 # 0. For example the cost scheme in Figure 16 is automatically generated for the simple entailment engine that uses token edit distance algorithm with configuration file shown in Figure 17.5947 # 0.6554 # 0.61625 ############################################# # Examples # Precision # Recall # FMeasure # Confidence # Class # # 400 # 0.6 Automatic Generation of Cost Schemes EDITS provides a mechanism for a quick experimentation with different distance algorithms by allowing the user to avoid the specification of a cost scheme and still be able to have a fully functional entailment engine.2097 # NO # ########################################### ################### # # YES # NO # # YES # 292 # 108 # # NO # 199 # 201 # ################### Figure 15: Result of Optimization Process 4. <conf> <module alias="distance"> <module alias="token"/> </module> </conf> Figure 16: Simple Entailment Engine that uses Token Edit Distance Algorithm <?xml version="1.******************************* Accuracy: 0. In this case EDITS automatically generates a cost scheme that is adapt to the algorithm and the resources of entailment rules defined in the configuration file.token A) (a.73 # 0.token B))</condition> <cost>OPsubstitution1</cost> </substitution> <substitution> <cost>(* OPsubstitution2 (+ (size (words T)) (size (words H))))</cost> </substitution> 16 .6505 # 0.

xml</test> </experiment> Figure 18: Example of Experiment File The experiment file allows the user to reduce the length of command lines and to repeat quickly the same experiment. The user can specify in the experiment file all the options that will handle the models created during training and the output of a test.</scheme> Figure 17: Automatically Generated Cost Scheme The generated cost scheme is also optimizable.xml" to train a model on the RTE2 development set and will test this model on RTE2 test set.xml will automatically generate a configuration file. The following command will create a simple entailment engine with tree edit distance as distance algorithm: bin/edits -r -tree rte/etaf/syntax/RTE2_dev. test*)> <!ELEMENT training (#PCDATA)> <!ELEMENT test (#PCDATA)> <!ATTLIST experiment configuration CDATA #INPLIED 17 . <!ELEMENT experiment (training+. 4. An example of an experiment file is shown in Figure 18. Users are encouraged to start from automatically generated cost schemes as template for future cost schemes development. that will describe a simple entailment engine using token edit distance as distance algorithm and an automatically generated cost scheme. The DTD of an experiment file is shown in Figure 19.xml</training> <test>/hardmnt/queneau0/tcc/kouylekov/edits/rte/etaf/morpho-syntax/ RTE2_test. For example the following command: bin/edits -r rte/etaf/morpho-syntax/RTE2_dev.xml" adddate="false"> <training>/hardmnt/queneau0/tcc/kouylekov/edits/rte/etaf/morpho-syntax/ RTE2_dev. like the one shown in Figure 16.xml 4.8 Experiment <experiment use-memory="true" overwrite="false" output-type="NO" configuration="/ hardmnt/queneau0/tcc/kouylekov/edits/share/configurations/conf1.xml The following command will generate a configuration file that will describe a simple entailment engine with tree edit distance as distance algorithm and will optimize the automatically created cost scheme with the genetic entailment engine optimizer: bin/edits -r -op -genetic -tree rte/etaf/syntax/RTE2_dev.7 Automatic Generation of Configuration file EDITS allows the user to make quick experiments without providing a configuration file. This file defines an experiment in which EDITS will use the configuration defined in the file "share/configurations/conf1.

18 . For example snapshots check Section 8.> model CDATA #INPLIED configuration CDATA #INPLIED output CDATA #INPLIED output-type CDATA #INPLIED use-memory CDATA #INPLIED overwrite CDATA #INPLIED add-data CDATA #INPLIED Figure 19: DTD of Experiment File New experiments can be easily created using the EDITS graphical interface.

the target element. which need to be satisfied in order to activate the cost scheme. they have to return true) in order the cost scheme to be applied. Each constraint is expressed in a lisp-like syntax. less costly) sequence of edit operations that transforms T into H. called the source and referred through the variable A. A simple example of a predefined cost scheme file (simple-scheme. and (iii) an element of H. Condition .2) is shown in Figure 21. the text T. and they are collected in a cost scheme file that can be created by the user.A fixed value. EDITS adopts a combination of XML annotations and functional expressions to define the cost schemes. Name . One or more cost schemes can be associated to each edit operation.e.xml. or a function that returns a numerical value. <!ELEMENT schemes (scheme+)> <!ELEMENT scheme (constant*. T entails H if there exists a sequence of transformations applied to T such that we can obtain H with an overall cost below a certain threshold. expressing the cost of the edit operation applied to the source and to the target.insertion*.Every cost scheme must have a user defined unique name 2.substitution*)> <!ELEMENT constant EMPTY> <!ELEMENT insertion (condition*. and the hypothesis H. A cost scheme is invoked by the edit distance algorithm with three parameters: (i) an edit operation. introduced in Section 3. and all constraints must be satisfied (i.e. Each cost scheme for a certain edit operation consists of three parts: 1. (ii) an element of T. Cost . EDITS allows for the definition of the cost for each edit operation carried out by the distance algorithm in order to find the best (i. The underlying assumption is that pairs between which an entailment relation holds have a low cost of transformation. A cost function can consider as parameters the source element.cost)> <!ELEMENT substitution (condition*. <scheme> <insertion name="insertion"> <cost>10</cost> </insertion> 19 .cost)> <!ELEMENT condition (#PCDATA)> <!ELEMENT cost (#PCDATA)> <!ATTLIST scheme name CDATA #REQUIRED> <!ATTLIST insertion name CDATA #REQUIRED> <!ATTLIST deletion name CDATA #REQUIRED> <!ATTLIST substitution name CDATA #REQUIRED> <!ATTLIST constant name CDATA #REQUIRED type (string|number|boolean) #REQUIRED value CDATA #REQUIRED > Figure 20: XML Cost Scheme DTD. The XML Document Type Definition (DTD) of the cost scheme file is reported in Figure 20. 3.cost)> <!ELEMENT deletion (condition*.A set (possibly empty) of constraints over the source and the target elements.deletion*. The basic data structure in EDITS for the definition of costs is the cost scheme.5 Defining Cost Schemes for Edit Operations According to the distance-based approach. called the target and referred through the variable B.

according to the intuition that the most frequent words should have a lower cost of insertion. • deletion(A)= 10 . 5. • substitution(A. always costs 10. no matter what B is.1 Data Types and Functions 20 . The function (attribute "token" A) returns the token of the source element. • substitution(A. As shown in the previous examples.<deletion name="deletion"> <cost>10</cost> </deletion> <substitution name="equal"> <condition>(equals (attribute "token" A) (attribute "token" B))</condition> <cost>0</cost> </substitution> <substitution name="not-equal"> <condition>(not (equals (attribute "token" A) (attribute "token" B)))</condition> <cost>20</cost> </substitution> </scheme> Figure 21: Simple Cost Scheme This cost scheme applies to elements of T and H.substituting A with B costs 0 if the token of A and the token of B are equal (i. which are annotated as words (see ETAF in Section 3).substituting A with B costs 20 if the tokens of A and B are not equal.deleting an element A. always costs 10. a token from T is substituted by a token from H with a cost equal to 0 if their lemmas and part of speech are equal. and 2 for substitution). <insertion name="insertion_frequency"> <condition>(not (null (attribute "freq" B)))</condition> <cost>(* (/ 1 (number (attribute "freq" B))) 20)</cost> </insertion> Figure 23: Insertion Based on Frequency Cost schemes can be easily created.B)= 20 if A != B . no matter what A is. EDITS allows to define the cost of the edit operations by means of user-defined attributes. modified and viewed using the EDITS graphical interface. there are four edit operations (1 for insertion. assigned to different costs: • insertion(B)= 10 . referred respectively as A and B. <substitution name="equal"> <condition>(and (equals (attribute "lemma" A) (attribute "lemma" B)) (equals (attribute "pos" A) (attribute "pos" B)))</condition> <cost>0</cost> </substitution> Figure 22 shows a more complex example of a cost scheme for the substitution operation. 1 for deletion.e. For example snapshots check Section 8. they are the same string). In the example in Figure 23 the cost scheme exploits the pre-computed frequency of a token to calculate the cost of insertion. Within the example.B)= 0 if A=B . In the example.inserting an element B.

the text T. 4. Node . type key-1 value-1 . part of speech. it is obtained using the function tree. Such functions can consider as parameters the source A.1 Data Types Basic elements for defining constraints and costs in a cost scheme are derived from the three representation levels defined in ETAF.1. Tree . numbers. or booleans. 5. expressed in a lisp-like syntax.represents a syntactic tree. The format of a file containing a Hash is shown in Figure 24. 10. List . EDITS provides functions for the most relevant elements defined in the ETAF linguistic representation.a sequence of characters. Hash . loaded from a le. Boolean .e. etc. 6.).txt" containing the IDF of words is shown in Figure 25. This means that all the information about T and H derived from their linguistic processing (e. 7. Set . key-n value-n Figure 24: Hash file format number speak 12. Functions use the following primitive objects data types: 1. The arguments of such functions are the variables (i. numbers.. they can not be created inside a cost scheme or read from the entailment corpus). the target B. The type is the data-type of the values in the file. 5. 10.) is available for defining conditions and cost functions. syntactic structure. Keys are strings. String . and the hypothesis H..23 . Sets and Hashs are objects that have to be read from an external file (i.. As an example.Conditions and costs are defined using a set of functions. while values are either strings. 8.an object that contains maps from keys to values. read &nbsp. etc. Number .a group of elements of a given type (strings. B. numbers. nodes.e. for example: "Dolomiti" or "Milen".True or False. typical constraints involve checking the token and the part of speech of A and B..32 Figure 25: Hash file example 21 . H) which are instantiated within a specific cost scheme.represents a tree element in T and H and it is instantiated by the variables A and B in a cost scheme. and the key is separated from a value with a tab. A fragment from the file "share/ cost-scheme/idf. or booleans) loaded from a file. 9. A. 3.a sequence of elements (words. 3:14 etc. possibly normalizing such value over the length of T and H. while typical cost functions are computed considering the lexical similarity between A and B. Word . booleans. T. Edges . 2.a real number.represents an edge in the dependency tree of T and H and it is instantiated by the variables A and B in a cost scheme.g.represents a token in T and H and it is instantiated by the variables A and B in a cost scheme.23 ride 3. for example: 0 .

They are accessed using the functions set-contains and hash-value described in the following section by refering to their ID. <module alias="xml"> <option name="scheme-file" value="${EDITS}/share/cost-schemes/idfscheme. Word and Node. type element-1 . otherwise the output of the function is null. read Figure 27: Set file example The Hashs and Sets are defined as options in the configuration of the CostScheme module. Functions for accessing Entailment Rules • (entail SimpleRulesObject1 SimpleRulesObject2 ) .e.txt"/> </module> Figure 28: Definition of a Hash in configuration file 5.checks for the existence of an entailment rule (see Section 6) where the left hand side of the rule matches SimpleRulesObject1 and the right hand side of the rule matches SimpleRulesObject2. A fragment from the file "share/cost-scheme/stopwords.2 Functions Functions over AnnotatedText • (string AnnotatedText ) .txt" stop words is shown in Figure 27.checks for the existence of a contradiction rule (see Section 6) where the left hand side of the rule matches SimpleRulesObject1 and the right hand side of the rule matches SimpleRulesObject2. element-n Figure 26: Set file format string speak ride .. • (contradict SimpleRulesObject1 SimpleRulesObject2 ) . The allowed types are: String. The fragment in Figure 28 represents a simple definition of a hash and a set inside a configuration file..returns the syntactic tree of the AnnotatedText. • (tree AnnotatedText ) .returns the list of words in the AnnotatedText. then the probability associated to the rule is returned..returns the text of the AnnotatedText (i. The 22 .The format of a file containing a Set is shown in Figure 26. The two arguments must be of the same data type. The two arguments must be of the same data type. If the rule exists. the text of T or H).xml"/> <option name="hash-file" id="idf" value="${EDITS}/share/cost-schemes/idf. • (words AnnotatedText ) .txt"/> <option name="hash-file" id="stopwords" value="${EDITS}/share/cost-schemes/ stopwords.1.. The type is the data-type of the elements in the Set.

Number2 ) . (is-word-node Node) .returns true if the node contains a word. (char String Number ) .returns the sub-string of String from the position corresponding to Number1 till the position Number2 .returns the edge (i. Functions over Words • (attribute String Word ) . (equals-ignore-case String1 String2) .14..returns True if Number1 is more than Number2 . the syntactic relation) entering in the node. • (from Edge) . (starts-with String1 String2 ) . the syntactic category) of the node.returns the list of nodes of a tree. (ends-with String1 String2 ) .returns True if String1 ends with String2 . Word and Node. (is-label-node Node) . • (children Node Tree ) . For instance: (starts-with ”reading” ”read” ) is True.returns true if the node contains a label.returns True if Number1 is equal to Number2 . (distance String1 String2 :normalize ) returns the Levenshtein distance between String1 and String2 .converts String to lower case.reads a boolean from String . Number2 ) . Functions over Trees • (nodes Tree ) .returns the parent of a node in the syntactic tree of T or H. otherwise the output of the function is null. Numbern ) . (contains ”reenacting” ”act” ) returns True.14” ) returns 3. . • (to Edge) . (boolean String ) .Returns True if String1 is equal to String2.. Number2 ) .returns True if String1 contains String2 .returns the character in String at position corresponding to Number.e. (length String ) . • (parent Node Tree ) . If the rule exists.returns the number of characters in String . (number String ) . For instance.returns True if String1 starts with String2 .returns true if the string is capitalized. Functions over Nodes • • • • • (word Node) .returns True if Number1 is less than Number2 .compares two strings ignoring their case.returns the from node of an edge in the syntactic tree of T or H.allowed types are: String.returns the label (i. (number ”3. If the attribute is missing the function returns null. (label Node) . Functions with string arguments • • • • • • • • • • • • • (equals String1 String2) . (substring String Number1 Number2 ) . Functions with numeric arguments • • • • (= Number1 (< Number1 (> Number1 (+ Number1 to 3).reads a number from String . If the “:normalize” parameter is present the function returns a normalized distance (with respect to the length of the two arguments) between 0 and 1.e.makes a sum of numbers (example (+ 1 2) is equal 23 . then the probability associated to the rule is returned. The possible arguments are ”true” and ”false”.returns the children of a node in the syntactic tree of T or H. For instance.returns the word of the node. (capitalized String ) . (contains String1 String2 ) . (edge Node) .returns the to node of an edge in the syntactic tree of T or H.returns the value of the attribute String of the word. (to-lower-case String ) .

multiplies numbers (example (* 2 2) is equal to 4). • (nth Number List ) .(size list) 1) list). the expression (null (attribute “freq” A)) can be used. Numbern ) . Functions handling Hash and Set • (hash-value String1 String1) .returns True if the set with id equals to String contains Object. For example: the list (1 2 3) contains the number 1.2 Using constants <constant name="COST" type="number" value="10"/> <insertion name="insertion"> <cost>COST</cost> </insertion> <deletion name="deletion"> <cost>COST</cost> </deletion> Figure 29: Example of Cost Scheme constants. • (size List ) .divides Number1 by the rest (example (/ 24 3 2) is equal to 4)...returns True if the argument is null. the list (”sum” ”plus” ”minus”) contains the string ”plus”. The constants are used in a cost scheme to externalize certain values that can be used by more than one of the operations of the cost scheme.returns True if the argument is False.returns T rue if List contains Object... • (or Boolean1 . • (set-contains String Object ) . Booleann ) . Conditional Functions • (if Boolean Object1 Object2 ) . the last with (nth (.. • (not Boolean ) .returns the sub-list of the list from the position corresponding to Number1 till the position Number2 . Numbern ) ..subtracts numbers from N umber1 (example (.returns the value of the hash with id equals to String1 for the key String2 . Numbern ) . 5. For example. In the XML scheme in Figure 29 24 .returns True if all the arguments are True.. The first element of a list is returned by (nth 1 list)..returns the n-th (Number) element of List. • (/ Number1 .returns the number of elements in List.5 2 1) is equal to 2). If Object2 is not defined the function returns null. Null handling functions • (null Object ) . • (subseq List Number1 Number2 ).. Booleann ) .. Functions with list arguments • (member List Object ) . Functions with boolean arguments • (and Boolean1 .• (− Number1 .if the Boolean is equal to True then the function returns Object1 otherwise Object2 .returns True if at least one of the arguments is True. • (* Number1 . to express that a word A from T has not an attribute “freq”. and False if it is True.

The possible types are "string". &nbsp. </substitution> </scheme> Figure 30: Optimizable Cost Scheme 5./cost> </substitution> <substitution name="not-equal"> <condition>(not (equals (attribute "token" A) (attribute "token" B)))</condition> <cost>OPsubstitution</cost> &nbsp. To do this the user must use the match-edges option either in the configuration file or in the command line.2.1 are adapted to work with the edges of the syntactic tree. &nbsp. For certain algorithms like token edit distance or longest common subsequence this is not semantically motivated as the edges does not have a predefined order. An example of a cost scheme with optimizable constants is the fragment in Figure 30 (which can be found in share/cost-schemes/ optimize-scheme. The constants that have a name that start with the capital letters OP are considered by the system as parameters of the cost scheme that can be optimized. <deletion name="deletion"> <cost>OPdeletion</cost> &lt. deletion and substitution in case A and B are not equal is optimized. Each constant must have a type that define what type of data the cost scheme will interpret it's value as.1 Optimizing Cost Schemes The constants also play an important role in optimizing the cost scheme./deletion> <substitution name="equal"> <condition>(equals (attribute "token" A) (attribute "token" B))</condition> <cost>0&lt. "number" and "boolean".<insertion name="insertion"> <cost>OPinsertion</cost> &nbsp. The contents of these constants is the annotation correspondingly of of T and H. bin/edits -train -overlap -match-edges <module alias="overlap"> 25 . A cost scheme is optimizable if it contains at least one such parameter.&nbsp.xml). In this scheme the cost of insertion.3 Matching Edges All algorithm that process tokens (the only one that works on trees is tree edit distance) from version 2. <scheme> <constant name="OPinsertion" value="10" type="number"/> <constant name="OPdeletion" value="10" type="number"/> &nbsp. For example if the user wants to set as the cost of deletion the number of words in Tm he/she must use the following xml fragment: <deletion name="deletion"> <cost>(size (words T))</cost> </deletion> 5. </insertion> &nbsp.the constant "COST" is used as cost of both the insertion and substitution operations. EDITS provides two constants T and H that the user can use inside cost schemes. <constant name="OPsubstitution" value="20" type="number"/> &nbsp.

<option name="match-edges" value="true"> </module> 26 .

e. a probability equal to 0 means that the relation between T and H is unknown. i. h . Wikipedia.1 Rule format Both T and H can be defined using the Edits Text Annotation Format (ETAF) described in Section 4. Probability .a unique identifier of the rule within a certain rule repository.tree?. Both in entailment and contradiction rules. Each rule in EDITS consists of four parts: 1. or WORD or NODE. ETAF allows text portions to be represented at three different levels of annotation: just as strings (i. Name . If not provided by the user. i. h.semantics?)> <!ELEMENT h (string?. 4.probablity)> <!ELEMENT t (string?.e. in order to provide specific knowledge (e. 6. 3.edge*)> <!ELEMENT node (word|label)> <!ELEMENT label (#PCDATA)> <!ELEMENT edge EMPTY> <!ELEMENT semantics (entity*.g. while a probability equal to 1 means that the entailment/contradiction between T and H is fully preserved. 5.tree?.g.e. WordNet. Type (entailment) . the STRING object). Rules can be manually created.e.semantics?)> <!ELEMENT string (#PCDATA)> <!ELEMENT word (attribute+)> <!ELEMENT attribute (#PCDATA)> <!ELEMENT tree (node+. <!ELEMENT rules (rule+)> <!ELEMENT rule (t.relation*)> <!ELEMENT entity EMPTY> <!ELEMENT relation EMPTY> <!ATTLIST rule name CDATA #REQUIRED> <!ATTLIST t id CDATA #IMPLIED> <!ATTLIST h id CDATA #IMPLIED> <!ATTLIST word id CDATA #IMPLIED> <!ATTLIST attribute name CDATA #REQUIRED> <!ATTLIST tree id CDATA #IMPLIED> <!ATTLIST node id CDATA #IMPLIED> <!ATTLIST edge name CDATA #IMPLIED from CDATA #REQUIRED to CDATA #REQUIRED > 27 . Rules in EDITS can be defined using the three datatypes. provided that they are used consistently in T and H. both entailment rules and contradiction rules. i. in order to help the user to understand which rules have been applied by the system for a certain pair. DIRT) and stored in XML files which are called Rule Repositories. This is used for logging purposes only.e. syntactic. the right hand side of the rule. the left hand side of the rule. the WORD object).word*. semantic) about transformations between T and H.6 Defining rules in EDITS EDITS allows the use of sets of rules. and as syntactic trees (e. either STRING.a text T.word*.a hypothesis H.g. the rule name is automatically generated by the system. The XML Document Type Definition (DTD) of the rules file is reported in Figure 30. the NODE object). as sequences of tokens with their morpho-syntactic features (i. or they can be extracted from any available resource (e.specifies the type of rule: entailment or contradiction. 2. lexical.a probability that the rule maintains either the entailment or the contradiction between T and H. t .

0. <rule name="1" entailment="Entailment"> <t><string>invented</string></t> <h><string>pioneered</string></h> <probability>1. The following are examples of entailment rules at the different levels allowed by ETAF.0</probability> </rule> Figure 32: Morpho-Syntactic Entailment Rule The entailment rule in Figure states that the lemma invent entails the lemma pioneer with a probability equal to 1. with some degree of confidence.<!ATTLIST relation name CDATA #REQUIRED entailment CDATA #IMPLIED source CDATA #IMPLIED > <!ATTLIST entity name CDATA #REQUIRED start CDATA #IMPLIED end CDATA #IMPLIED source CDATA #IMPLIED > Figure 30: DTD of the rule file In the current release of EDITS only rules that contain just one element both in t and h (i. the entailment relation between T and H. lexical rules) are allowed. <rule name="2" entailment="Entailment"> <t> <word> <attribute name="lemma">invent<attribute> <attribute name="pos">v</attribute> </word> </t> <h> <word> <attribute name="lemma">pioneer<attribute> <attribute name="pos">v</attribute> </word> </h> <probability>1.e. 6.1 Entailment Rules Entailment rules preserve.0</probability> </rule> Figure 31: String Entailment Rule The string entailment rule in Figure 31 states that the word "invented" entails the word "pioneered" with a probability equal to 1.0.1. <rule name="3" entailment="Entailment"> <t> 28 .

“John bought a house”. where "house" is the direct object of the verb "buy" ) entails the node "habitation" in the dependency relation of direct object with its syntactic head (e.2 Contradiction rules Contradiction rules represent. <rule name="1" entailment="Contradiction"> <t><string>beautiful</string></t> <h><string>ugly</string></h> <probability>1.0</probability> </rule> Figure 34: String Contradiction Rule The contradiction rule in Figure 34 states that the string "beautiful" contradicts the string "ugly" with a probability equal to 1. the semantic incompatibility between T and H.0. <rule name="2" entailment="Contradiction"> <t> <word> <attribute name="lemma">extend<attribute> <attribute name="pos">v</attribute> </word> </t> <h> <word> <attribute name="lemma">shorten<attribute> <attribute name="pos">v</attribute> </word> </h> 29 .g.1. 6.0</probability> </rule> Figure 33: Syntactic Entailment Rule The entailment rule in Figure 33 states that the node home in the dependency relation of direct object with its syntactic head (e.g.0. with some degree of confidence.<node> <attribute name="edge-to-parent">dobj</attribute> <word> <attribute name="token">home<attribute> </word> </node> </t> <h> <node> <attribute name="edge-to-parent">dobj</attribute> <word> <attribute name="token">habitation<attribute> </word> </node> </h> <probability>1. “John bought an habitation”. where "habitation" is the direct object of the verb to "buy") with a probability equal to 1. The following are examples of contradiction rules at the different levels allowed by ETAF.

both entailment and contradiction rules have to be stored in a rule repository. which contain a rule file each. where "black" is the adjective modifying the noun "T-shirt") with a probability equal to 1. where white is the adjective modifying the noun "T-shirt" ) contradicts the node "black" in the dependency relation of adjectival modifier with its syntactic head (e.xml"/> </module> <module type="rule-repositoty" alias="memory" id="wordnet-contradiction"> <option name="rules-file" value="${EDITS}/share/rules/contradicion-ruleswordnet. <module type="entailment-engine"> <module type="rule-repositoty" alias="memory" id="wordnet-entailment"> <option name="rules-file" value="${EDITS}/share/rules/entailment-ruleswordnet. As an example. called. “Mary wears a white T-shirt”.2 Rules repository In order to be used by EDITS. the declaration below defines two rule repositories.xml"/> </module> </module> 30 . “Mary wears a black T-shirt”. <rule name="3" entailment="Contradiction"> <t> <node> <attribute name="edge-to-parent">amod</attribute> <word> <attribute name="token">white<attribute> </word> </node> </t> <h> <node> <attribute name="edge-to-parent">amod</attribute> <word> <attribute name="token">black<attribute> </word> </node> </h> <probability>1. respectively.0.g. 6.0</probability> </rule> Figure 36: Syntactic Contradiction Rule The contradiction rule Figure 36 states that the node white in the dependency relation of adjectival modifier with its syntactic head (e. EDITS allows to declare and use multiple XML rule files as sets of entailment or contradiction rules that can be referred to using user-defined identifiers.<probability>1. An example of entailment rules repository configuration can be found in Figure 37. entailment-rules-wordnet and contradiction-rules-wordnet.0</probability> </rule> Figure 35: Morpho-Syntactic Contradiction Rule The contradiction rule in Figure 35 states that the lemma "extend" contradicts the lemma "shorten" with a probability equal to 1.0.g.

A rule is activated when the X parameter of the entail/contradiction function matches with the T part of the rule. The two functions accept four parameters: 1. their assignments to corresponding elements of the X /Y argument need to be satisfied. typically the cost scheme defined for the substitution edit operation. entail and contradict.e. Figure 38 shows a substitution that calculates the cost of substituting A with B based on the inverse of the probability of the entailment rule between A and B in the repository called entailment-rules-wordnet. This parameter is optional. which selects the rule that matches X and Y and with thehighest probability.e. the following call at the entail function: (entail X Y "wordnet-entailment" :max) searches for the rule with the highest probability among those that are activated by the A and B parameters and that are contained in a rules repository called entailmentrules-wordnet. which can be used within a cost scheme (see Section 5). 2. If a rule exists in the specified repository which matches with both A and B. The entail and contradict functions are called in cost schemes. i. As an example. which selects the first rule that matches the X and Y parameters. then the probability associated to the rule is returned. 3. <substitution name="entail"> <cost>(.3 Rule activation The basic way for activating a rule is through one of the two functions. The search modality. Two search modalities are allowed: First. Max.1 (entail A B "wordnet-entailment" :max))</cost> </substitution> Figure 38: Substitution using Entailment Function 31 . The first two parameters X and Y are portions of T and H managed by the distance algorithm and the cost scheme. The name of a set of rules in the rules repository where the search has to be carried.Figure 37: Entailment Rules Repository Configuration 6. i. the right hand side. The two functions check for the existence of an entailment or a contradiction rule between the values assumed by A and B in a cost schema. otherwise null. All the elements of the X /Y argument have to match against all elements of the rule. the left hand side. In case the rule contains variables. and the Y parameter of the function matches with the H part of the rule.

a path to the Java class of the module. • className . the cost-scheme and the rules repository. • id: a unique identifier for the module. which requires three nested modules. The whole EDITS configuration is considered itself a module. • option .option*.2 and 4. Accepted values for the type attribute are entailment-engine. Such dependencies are expressed in a configuration file through nested modules. A module may require that another module is defined in order to work.1 Module Configuration Modules are defined by the following pieces of information: • alias . cost scheme and rule repositories).set the options of the module.e. 7. cost-scheme. All module with their internal identifiers are listed in the HTML reference. that will be actually used while running EDITS on a certain dataset. respectively for the distancealgorithm. and their corresponding parameters. mlink*)> <!ELEMENT option EMPTY> <!ELEMENT mlink EMPTY> <!ATTLIST module type CDATA #IMPLIED id CDATA #IMPLIED alias CDATA #IMPLIED className CDATA #IMPLIED > <!ATTLIST option name CDATA #REQUIRED id CDATA #IMPLIED value CDATA #IMPLIED > <!ATTLIST mlink idref CDATA #REQUIRED > Figure 39: DTD of the configuration file 7. Only modules defined in a EDITS Configuration File (ECF) can be used for training and testing with the command bin/edits (see Sections 4. <!ELEMENT conf (constant* module*)> <!ELEMENT module (module*. the most global one. The XML Document Type Definition (DTD) of the configuration file is reported in Figure 39.7 EDITS Configuration File The purpose of a configuration file is to define the three basic modules (i. • type: indicates the category of the module being defined.2 Usage of Constants 32 . called the entailment engine. distance-algorithm. referring to the code that will be executed when the module is activated. rules-repository etc (See the HTML reference).3). Examples of configuration file can be found in the share/configuration folder. distance algorithm. assigned by the user.an internal identifier known to the system.

txt"/> &nbsp.xml"/> &lt. the configuration in Figure 40 is using the ${DATA_PATH} variable to indicate the access path to cost scheme resources.txt"/> &nbsp.option name="hash-file" id="idf" value="${DATA_PATH}/share/costschemes/idf. declared at the beginning of the configuration file. 33 .option name="hash-file" id="stopwords" value="${DATA_PATH}/ share/cost-schemes/stopwords. <module alias="tree"/> <module alias="xml"> <option name="scheme-file" value="${DATA_PATH}/share/cost-schemes/idfscheme. <module alias="distance"> &nbsp.EDITS allows that the values of the options (which are frequently used in the configuration file) can be referred through the use of constants. <conf> <constant name="DATA_PATH">/home/epack/edits</constant>&nbsp. </module> </conf> Figure 40: Example of variable in the configuration file. For example.&lt. All the configuration can access the path of the edits installation using the constant "EDITS_PATH". </module> <module alias="pso"/> &nbsp.

The user must select one algorithm to create a configuration for a simple entailment engine or more than one for combination entailment engine. The graphical interface represents a desktop in which different views are open as windows. It can also view entailment corpora.8 EDITS Graphical Interface This section contains several snapshot of the EDITS graphical interface. If he/she choses to do the latter a combination strategy must also be selected as demonstrated in Figure 42. EDITS output and EDITS models. Figure 42: New Configuration Combined Engine 34 . The user can copy and cut objects from one window and paste them in another one of the same type. The graphical interface is started with the command: edits -g The graphical interface represents a simple editor for confgurations. Figure 41: New Configuration Simple Engine The interface in Figure 41 represents a dialog for creating a new entailment engine. cost-scemes and experiments.

The configuration file is represented as a tree. The user can interact with it using the context menu available while right-clicking a node of the three.Figure 43: Configuration Editor . 35 .Conf1.xml EDITS provides an interface for editing a configuration file as presented in Figure 43.

The configuration file is represented as a tree.Figure 44: Cost Scheme Editor .Simple Cost Scheme EDITS provides an interface for creating and editing a cost scheme file as presented in Figure 44. 36 . The user can interact with it using the context menu available while right-clicking a node of the three.

37 . The files are represented as a table containing the entailment pairs in the rows and their contents in the columns. The morpho-syntactic annotation of a pair is presented as a table in Figure 46.Figure 45: ETAF Corpus Representation EDITS provides an interface for browsing entailment corpora as presented in Figure 45 . The user can view that annotation of each pair by clicking the view button. The syntactic annotation of a pair is presented as a tree in Figure 47.

Figure 46: ETAF Morpho-Syntactic Annotation of a Entailment Pair 38 .

The table is similar to the interface for browsing entailment corpus.Figure 47: Etaf Syntactic Annotation of a Pair EDITS provides an interface for browsing an EDITS output as presented in Figure 48. for the edit operations log attached to each pair.) that comes in the EDITS output. The view button provides a simple viewer. represented in figure 49. New columns are added to represent the additional information (score confidence etc. 39 .

Figure 48: EDITS output 40 .

Figure 49: EDIT Operations EDITS provides an interface for creating and editing an experiment file as presented in Figure 50. 41 . The interface allows the user to change the basic elements of an experiment and execute the experiment to obtain a result as represented in Figure 51.

Figure 50: Experiment interface 42 .

43 .Figure 51: Experiment Result EDITS provides an interface for viewing the contents of an EDITS model as presented in Figure 52.

Grenoble. 21-23 September. insertions. in Advances in Neural Information Processing Systems 15 (NIPS 2002). 11-13 April. Manning (2003). Levenshtein (1965). Bulgaria. Probabilistic Textual Entailment: Generic Applied Modeling of Language Variability. Oren Glickman (2004). Bernardo Magnini (2005). and reversals. in Proceedings of the PASCAL Workshop of Learning Methods for Text Understanding and Mining. Oren Glickman.K.. pages 845848. 44 . Cambridge. France. Milen Kouylekov. Tree Edit Distance for Recognizing Textual Entailment. Ido Dagan. in Proceedings of the First PASCAL Challenges Workshop on Recognising Textual Entailment. pp. Dan Klein and Christopher D.Figure 52: EDITS Model View References Ido Dagan. Bernardo Magnini (2005). 3-10. U. in Doklady Akademii Nauk SSSR. Fast Exact Inference with a Factored Model for Natural Language Parsing. Southampton. Binary codes capable of correcting deletions. MA:MIT Press. in Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP). 163(4). Vladimir I. The PASCAL Recognizing Textual Entailment Challenge. Borovets.

December 1990. Automatic Cost Estimation for Tree Edit Distance Using Particle Swarm Optimization. Proc. of RANLP-2009.11.Emanuele Pianta. Elena Cabrio. Yashar Mehdad. In Journal of Algorithms. Fast Algorithm for the Unit Cost Editing Distance Between Trees. Milen Kouylekov. Matteo Negri. ROUGE: A Package For Automatic Evaluation Of Summaries Workshop On Text Summarization Branches Out . Yashar Mehdad 2009. Question Answering over Structured Data: an Entailment-Based Approach to Question Analysis. Proc. and Bernardo Magnini 2009. Roberto Zanoli (2008). Chin-Yew Lin. Matteo Negri and Milen Kouylekov 2009. vol. Christian Girardi. 28-30 May. Kaizhong Zhang. and Dennis Shasha (1990). 6th edition of the Language Resources and Evaluation Conference. Morocco. ACL 2004 45 . of ACL-IJCNLP 2009. in Proceedings of LREC. Recognizing Textual Entailment for English EDITS @ TAC 2009 To appear in Proceedings of TAC 2009. The TextPro tool suite. Marrakech.