You are on page 1of 10

Improving the Comprehension of Legal Documentation:

The Case of Patent Claims

Nadjet Bouayad-Agha Leo Wanner


Gerard Casamayor Instituci catalana de recerca i estudis avanats
Gabriela Ferraro (ICREA) and
Simon Mille Universitat Pompeu Fabra
Vanesa Vidal Barcelona, Spain
Universitat Pompeu Fabra and leo.wanner@icrea.es
Barcelona Media
Barcelona, Spain
firstname.lastname@upf.edu

ABSTRACT this demand is to paraphrase the original material, i.e., to


With their abstract vocabulary and overly long sentences, rewrite it in a more appropriate style, or—even better—to
patent claims, like several other genres of legal discourse, summarize it in the language of preference of the reader such
are notoriously difficult to read and comprehend. The enor- that the reader can rapidly grasp its essence and decide then
mous number of both native and non-native users reading whether he wants to go on reading the paraphrased version
patent claims on a daily basis raises the demand for means or the original.
that make them easier and faster to understand. An obvious Paraphrasing and summarization are popular research top-
way to satisfy this demand is to paraphrase the original ma- ics in computational linguistics; a number of techniques have
terial, i.e., to rewrite it in a more appropriate style, or—even thus been already developed. However, the peculiar style
better—to summarize it in the language of preference of the and abstract vocabulary of patent claims make the appli-
reader such that the reader can rapidly grasp its essence. cation of standard techniques that target general discourse
PATExpert is a patent processing service which incorpo- difficult. In particular in the area of text summarization, it
rates, among other technologies, paraphrasing and multilin- is well-known that specialized discourse requires discourse-
gual summarization of patent claims. With the goal to offer and genre-specific strategies [1, 2, 14]. This is especially true
the user the most suitable options and to evaluate alter- for patent summarization.
native techniques that are based on different contextual and We present the techniques for paraphrasing and multi-
linguistic criteria, both paraphrasing and summarization im- lingual summarization of patent claims developed in the
plement “surface-oriented” strategies and “deep” strategies. PATExpert project [15].1 In order to account for differ-
The surface strategies make use of shallow linguistic criteria ent needs a user might have and to evaluate alternative
such as punctuation and syntactic and lexical markers. The techniques that are based on different contextual and lin-
deep strategies operate on deep-syntactic structures of the guistic criteria, both paraphrasing and summarization im-
claims, using a full fledged text generator for synthesis of plement “surface-oriented” strategies and “deep” strategies.
the paraphrase or summary, respectively. The surface strategies make use of shallow linguistic criteria
such as punctuation and syntactic and lexical markers. The
deep strategies operate on deep-syntactic or shallow seman-
1. INTRODUCTION tic structures of the claims and presuppose thus a full fledged
With their abstract vocabulary and overly long sentences, text generator for synthesis of the paraphrase or summary,
patent claims, like several other genres of legal discourse, respectively. All strategies share several common prepro-
are notoriously difficult to read and comprehend. Yet given cessing stages which are first of all due to the peculiar lin-
that they define the legal scope of the invention they de- guistic style of the claims. Therefore, we implement them
scribe, their comprehension is crucial to specialists and non- in one single module—although paraphrasing and summa-
specialists alike. The enormous number of both native and rization have been traditionally considered clearly divided
non-native users reading patent claims on a daily basis fur- tasks.
ther raises the demand for means that make patent claims The paper is organized as follows. In Section 2, we present
easier and faster to understand. An obvious way to satisfy an overview of the system architecture. The overview is fol-
lowed by a detailed description of each of the submodules
together with the presentation of the paraphrasing and sum-
marization techniques used in the corresponding submodule
Permission to make digital or hard copies of all or part of this work for (Sections 3 to 5). In Section 6, we discuss the works that
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
are related to our. Finally, Section 7 concludes with some
bear this notice and the full citation on the first page. To copy otherwise, to remarks and plans for future work. An Annex, which has
republish, to post on servers or to redistribute to lists, requires prior specific been added for convenience of the reader, contains an orig-
permission and/or a fee. 1
ICAIL-2009 Barcelona, Spain PATExpert has been partially funded by the European
Copyright 2009 ACM 1-60558-597-0/09/0006 ...$5.00. Commission under the contract number FP6-028116.
inal patent claim and the corresponding PATExpert sum-
maries in English and French.

2. SYSTEM ARCHITECTURE
The different paraphrasing and summarization strategies
realized in PATExpert are interconnected in that the result
of one strategy serves as starting point for another strategy.
We can thus speak of a single pipeline of sub-modules. Such
architecture has two advantages: (i) a single pipeline is eas-
ier to maintain than several isolated sub-modules that act
independently of each other, (ii) the user can be offered the
output of the intermediate modules as a valid paraphrase or
summary, respectively. Figure 1 illustrates in detail the ar-
chitecture of the paraphrasing and summarization module.
The architecture is divided into three main submodules:

1. Patent claims dependency structuring, which is re-


sponsible for the identification and representation of
the relation between independent and dependent claims.

2. Claim sentence simplification, which segments each ori-


ginal claim sentence and “repairs” the segments so as
to obtain grammatical sentences. Furthermore, this
submodule determines the discourse structure over the
obtained segments and the co-referential links between
noun phrases. Thus no information is lost during the
simplification process and the full meaning can be re-
produced in the next stage.

3. Generation and deep-syntactic (or shallow semantic)


Figure 1: Architecture of the paraphrasing and sum-
pruning, which, given a deep syntactic structure of
marization module (Greyed-out processes are optional)
each shorter sentence, first either removes parts of the
structure according to specific summarization criteria
or leaves it as it is and then generates from the pruned
dependency tree according to a user defined percentage, with
respectively original structure a text in English or in
100% meaning all dependent claims are kept, and 0% mean-
any of the languages available. So far, we cover, apart
ing no dependent claims are kept.
from English, German, French and Spanish.

4. SIMPLIFICATION
3. CLAIMS DEPENDENCY STRUCTURING As shown in Figure 1, the simplification proper of each
The text structure of a patent claim section is a tree that original claim-sentence is performed in five separate stages:
is predefined by the dependency between the claims, made
explicit by textual references such as according to Claim 1, (i) part-of-speech (POS) tagging and chunking using Tree-
as set forth in claims 1 to 3, as in claims 3-5, as defined by Tagger [11] with its off-the-shelf English parameter set;
claims 1 and 2, etc. Structures of different complexity can
be encountered. Errors are not seldom and must be dealt (ii) segmentation of a claim sentence into clausal discourse
with. Thus, one of the errors that is especially disturbing units;
for discourse processing is a claim referring back to itself
directly or indirectly, or referring back to a claim that does (iii) establishment of coreference links between NPs that
not exist. denote the same object;
Level 1 of the resulting text structure tree corresponds to
independent claims (claims that do not refer back to super- (iv) building a clause-discourse tree drawing upon corefer-
ordinated claims) whilst the other levels correspond to de- ence links and other information such as tree configu-
pendent claims. Consider, for illustration, a sample claim ration and discourse markers;
structure of a German patent in Figure 2. The numbers
stand for the individual claims; the arcs connect dependent (v) reconstruction of the individual segments in order to
claims with the claims on which they depend. Dependent obtain grammatically correct independent sentences
claims cannot stand on their own and must be read in the out of each of them.
context of the claims they depend on in order to be fully un-
derstood [10]. Furthermore, we can assume that the deeper a Let us discuss these five stages (and, in particular, the
claim is in the claim structure the more detailed is the infor- fourth of them since clause-discourse structuring is primary
mation it renders. This is exploited in the claim dependency for both the simplification and the subsequent discourse
structure based summarization approach which prunes the summarization).
Clause structuring searches for the best clause structure
in a space restricted by a set of weighted rules, each of which
is further restricted by a set of constraints. The rules en-
code the fundamental features for the identification of coor-
dination, subordination and juxtaposition relations between
spans, whilst the constraints ensure syntactic correctness
and global coherence of the tree under construction. This
rule-based approach proved reasonable given the regulations
of the linguistic style in the patent domain.
A rule R is a quadruple <F1 ,F2 ,W0 ,Wc >, where :

• F1 and F2 are the feature tuples describing the two


spans S1 and S2 to join.

• W0 is the rule’s initial weight.

• Wc is a set of weighted constraints.

The features tuples are quintuples Fi of the kind <punct,


coord,subord,syntagm,colon>, where punct is the clause de-
limiter punctuation preceding the segment, coord is a co-
ordination marker, subord is a subordination marker, syn-
Figure 2: Illustration of the claim dependency struc- tagm is the span’s syntactic group (mainly Sentence, Noun
ture of a German claim Phrase or Verb Phrase), and colon is whether the clause
contains a colon.3 For instance, one of the rules address
the plain coordinaation between spans whose features are
4.1 Stages of the simplification <*,0,+,+,*> and <’,’,+,+,+,*>. We currently have ten
POS tagging is a standard procedure in corpus-oriented rules that specialize in the description of one of the following
computational linguistics and does not require any further constructs: plain coordination, semi-colon coordination, co-
mention here. Sentence segmentation is carried out using the ordination of subordinations, juxtaposition, subordination.
best from a set of simple rule-based and machine-learning The initial weight of the rules is currently set to 1, apart
based segmentation algorithms that use lexical, punctua- from a fall-back rule which is set to 0.001 as it must only
tion, POS and chunk information. For coreference determi- apply if all else fails (see inmediatly below).
nation, a simple coreference resolution algorithm has been The currently implemented constraints are presented in
implemented that relies on the patent claims characteris- Table 1. The rules, constraints and associated weights for
tic nominal phrase (NP) repetition to realize coreference.2 rule application were adjusted in a series of iterative trials
Roughly, we consider that two NPs corefer if they have iden- based on the evaluation of a small development corpus of
tical N′ (i.e., the two NPs without determiners are identi- manually structured claims. The development corpus con-
cal). Given that each claim is realized in a single sentence, sists of eight claims (104 segments) from the Optical Record-
the discourse tree construction relies on the identification ing Devices domain selected because of their reasonable au-
of the sentence’s clause structure, that is, the identification tomatic segmentation and size (average of 10 segments per
of the sentence’s coordination, subordination and juxtapo- claim), with the aim that they provide a good coverage
sition relations. Clause-discourse structuring consists of the for the first round experiments (see the evaluation). The
following main substages: weights reflect whether a constraint is a strong or soft prefer-
ence or a hard syntactic constraint. For example, constraint
1. Clause Structuring: The output of this substage is 7 makes coordination a barrier to subordination, disallowing
a binary tree whose terminal nodes are the segments the situation in which S2 is subordinated to S1 if S1 is a co-
(aka clauses) obtained in the previous stage and whose ordinated clause. Constraint 8 rewards thematic continuity,
intermediate nodes specify whether the subtree is a giving the highest weight to the case in which the first NP
subordination, coordination or juxtaposition. of S2 corefers with some NP of S1 and the lowest weight to
the one in which no NP of S2 corefers in S1 .
2. Clause tree flattening: The obtained binary clause For a given set of segments, the number of possible syntac-
tree is flattened to account for n-ary constructs such tic analyses which a set of rules can yield grows exponentially
as coordinations. with the number of segments a sentence can be divided into,
thus rendering the problem of exploring all possible trees in-
3. Projection of the clause structure onto the dis- tractable when the number of segments is high enough. Due
couse structure: Each intermediate node of the tree to the complexity of claim sentences, large segment sets are
is enriched with discourse information based on Rhetor- common. For this reason, a variation of a local beam search
ical Structure Theory (RST) [5], labelling the discourse algorithm was used to search amongst the various possible
units (=spans) as nucleus/satellite and determining syntactic trees, with the metrics calculated for each applica-
the relation between them according to a set of rules tion of a rule serving as the objective function guiding the
that use the type of constructs and lexical information. algorithm. The goal of this algorithm is a rooted tree. In
2 3
NP repetition (instead of, for instance, pronominalization) In our approach, a segment is a sentence if it starts with a
is one of the means used in patents to avoid ambiguity). noun phrase followed by a verb phrase.
Constraints Weights
Plain coordination, Coordination of subordinations:
C1) Spans have the same syntagm. Yes=1, No=0.5
C2) If spans are not separated by a comma, then their core clauses Yes=1, No=0.5
are contiguous (and the reverse is true). (or reverse)
C3) Non-terminal S2 are constructed with a coordination. Yes=1, No=0
Coordination of subordinations:
C4) Spans must be introduced by the same subordinated marker. Yes=1, No=0.1
Semi-colon coordination:
C5) Non-terminal S2 is constructed with a semi-colon coordination. Yes=1, No=0
Subordination:
C6) Favour right branching. Yes=1, No=0.1
C7) Non-terminal S1 is coordination. Yes=0, No=1
Subordination, Juxtaposition
C8) Favour spans with higher ranking coreferring NPs, taking 1st =1,2nd =0.9,
into account a focus rank of noun phrases: 1st >2nd >3rd >none 3rd =0.8,none=0.7

Table 1: Constraints on clause structuring.

order to obtain the optimal tree, the algorithm is allowed to sults are shown in Table 2. The machine learning results are
backtrack once it has reached a goal and explore alternative close to the rule-based results and will probably improve the
syntactic trees, up to a fixed limit set to ensure computa- segmentation given richer information such as chunking and
tional efficiency. multi-word keywords.
Once the clause-discourse structuring is finished, the re-
construction of each segment takes place, which transforms Evaluation of the coreference resolution
each segment into a complete independent sentence using We evaluated our coreference resolution approach on a set of
a set of rules that add a missing subject or verb, conjugate 30 claims, 15 from the Optical Recording Devices domain,
verbs in non-finite clauses, remove initial markers, or change and 15 from the Machine Tools domain, evenly selected from
the order of constituents. the set of independent and dependent claims and manually
annotated for coreference. The results presented in Table 3
4.2 Evaluation experiments show a precision of 83% and a recall of 79%—which rein-
We present the evaluation of the most relevant stages of forces the observation of NP repetition in patent material.
our simplification module, i.e., segmentation (as it is the We performed an error analysis and found the following main
basis for clause structuring and sentence reconstruction), types of errors:
coreference (as it is one of the outputs for the simplifica-
Chunking errors. The chunker failed to mark a chunk as
tion module) and clause structuring (as it is the basis for
an NP, resulting in a false negative.
our discourse structuring approach).
NP modifier. Identical NPs used as modifiers in a more
Evaluation of the segmentation complex NP were marked as coreferring while they
We manually constructed a gold standard corpus for the were not—for example, inner surface of the substrate
segmentation task totalling 1011 claims and 6723 segments and inner surface of the boot.
in the Optical Recording Devices domain and 486 claims and Abstract NPs. Abstract nouns like information or direc-
3101 segments in the Machine Tools domain. Each claim tion are often repeated within a claim but do not core-
was segmented by two independent annotators, who then fer.
discussed their differences and reached a consensus on the
correct segmentation. Partial match. Some co-referring NPs such as rotary valve
We experimented with both rule-based and machine learn- assembly and valve assembly are not an exact match
ing segmentation. Thus, we applied Weka’s J48 decision and thus are not marked as co-referring,
tree learner [17] on a vector of basic features comprising
POS, single keyword and punctuation information for pre- Evaluation of clause structuring
vious, current and next token using cross-validation on 581 We performed the evaluation of clause structuring on our
claims (3942 segments) of our gold standard in the Optical development and test corpus using as input the raw claims
Recording Devices domain. For the rule-based approach, we and the claims manually segmented and coreferenced (i.e,
experimented with a variety of strategies on a test corpus perfect input). Our baseline performs clause structuring
from the gold standard. We found the strategy that yields based on right branching given the number of segments in
the best results used the following information: semi-colon, the gold standard. For the evaluation, we count the number
comma, and about twenty representative lexical markers and of identical spans between the automatic and the manual
expressions of the patent domain such as in which, charac- structuring, excluding the top span as this would always be
terized in that, so as to, and other more ambiguous ones counted as correct. In order to be able to compare spans, we
such as and, for, by when followed by a verb phrase. We automatically map each segment of the raw input to its cor-
used a strict evaluation that counts 1:1 alignments. The re- responding gold standard segment. The results of evaluation
# manual # automatic #1:1 Prec Rec F-score
Development corpus, optical recordings:
Rule-based 3942 3522 2509 71% 63% 66%
J48 3942 3411 2217 64% 56% 59%
Test corpus, machine tools:
Rule-based 3101 2807 1867 66% 60% 62%
J48 3101 2606 1639 62% 52% 56%
Test corpus, optical recordings:
Rule-based 2781 2535 1664 65% 59% 61%
J48 2781 2352 1521 64% 54% 58%

Table 2: Evaluation results for segmentation (# = number of segments)

#automatic #manual #correct Precision Recall F-score


Machine Tools 106 118 93 87% 78% 82%
Optical Recordings 84 81 66 78% 81% 79%
Total 190 199 159 83% 79% 81%

Table 3: Evaluation results for coreference (# = number of corefering NPs)

with the test corpus are shown in Table 4. In the develop- This means that ELABORATION-LOCATION spans
ment corpus (i.e., 8 claims, 104 segments), we achieved an can be eliminated before ELABORATION-OBJECT-
F-score of 62% with raw input in the Optical Recording De- ATTRIBUTE spans. In their own turn, ELABORATI-
vices domain, whilst in the test corpus, our F-score is 42% ON-OBJECT-ATTRIBUTE spans can eliminated be-
for corresponding input and domain as shown in the table. fore MEANS spans which can be eliminated before
Our best score is achieved in the Optical Recording Devices PURPOSE spans.
domain—the same domain as the development corpus—with
Discourse markers: a list of discourse markers at the be-
an F-score of 61% with perfect input whilst clause structur-
ginning of spans that can be considered for elimination
ing in the same domain achieves 45% from raw input and
is provided in descending order. A sample hierarchy
32% with the right branching strategy.
is: WHEN > BY > FOR.
In order to find out about the contribution of each con-
straint defined in Table 1 to the overall performance, we The system can be configured such that any combination
ran nine rounds of the evaluation with the perfect input in of the three criteria can be selected and the criteria can be
both domains, each time omitting one of the constraints. applied in any order desired. In other words, the discourse
The F-scores of these rounds are presented in Figure 3, with summarization algorithm will apply one criterion after an-
label none referring to the situation where no constraints other and stop once it has reached the desired summariza-
are omitted, and label all to the situation where all con- tion rate. As an example, we show the discourse summariza-
straints are omitted. Not surprisingly, the constraint that tion of claim example (1) whose abbreviated RST discourse
contributes most to the overall performance in both domains structure in Figure 4 shows a maximum depth of 4, and
is right branching. On the other hand, S1 coord and same where nucleus spans are indicated by solid lines and satel-
syntagm in the Optical Recording Devices domain and S2 lite spans by dashed lines. The required summarization rate
coord and same subordinator in the Machine Tools domain is 50%, and the criteria to apply are the depth of 50% (that
do not contribute to the overall performance. is of up to 2 levels in our example), followed by discourse
relation based pruning given the partial order above. The
4.3 Discourse structure based summarization depth criterion eliminates segments S5,S7,S11,S17,S18 and
Discourse structure based summarization has been tackled S20 whilst the discourse relation criterion eliminates S9,S10
originally by Marcu [6] by pruning the satellites of an RST and S14, thus satisfying the 50% summarization rate.
tree in order to obtain a summary of the desired length. In (1) [An optical disk drive comprising: a laser light
our approach, the discourse structure is pruned according to source]S1 [for emitting a laser beam;]S2 [an optical
the following three criteria: system]S3 [for conversing the laser beam from the
laser light source on a signal plane of optical disk]S4
Depth: all spans above a given depth are preserved (ex-
[on which signal marks are formed]S5 [and for trans-
pressed in percentage).
mitting the light]S6 [reflected from the signal plane;]S7
Discourse relations: a list of discourse relations can be [one or more optical componentsS8 [arranged in the
provided so that the discourse summarizer prunes the optical path between the laser light source and the
spans whose relations are in the list, starting with the optical disk]S9 [for making the distribution of the
first one on that list. A sample hierarchy is: laser beam converged by the conversing means]S10
[located on a ring belt just after the passage of an
PURPOSE>MEANS>ELABORATION-OBJECT- aperture plane of the optical system;]S11 [a detec-
ATTRIBUTE>ELABORATION-LOCATION tion means]S12 [for detecting the light]S13 [reflected
#automatic #manual #correct Prec Rec F-score
Gold standard MT 112 113 58 51% 51% 51%
OR 89 91 56 62% 61% 61%
Total 201 204 114 56% 55% 55%
Raw MT 82 113 40 48% 35% 40%
OR 61 91 35 57% 38% 45%
Total 143 204 75 52% 36% 42%
Baseline MT 123 113 47 38% 41% 39%
OR 104 91 32 30% 35% 32%
Total 227 204 79 34% 38% 35%

Table 4: Performance of clause structuring on the test corpus(# = number of spans)

Figure 3: Performance of clause structuring with constraint omission from perfect input in both domains

Figure 4: The (abbreviated) discourse structure for the claim in (1)


from the optical disk;]S14 [and a signal processing system].
circuit]S15 [for generating a secondary differential sig- (b) The recesses are formed in the upper face and extend
nal]S16 [by differentiating the signals]S17 [detected from a land surface [adjacent to said cutting edge].
by the detection means]S18 [and for detecting the 2. A definite noun is modified by a full statement:
edge positions of the signal marks]S19 [by compar- (a) An automatic focusing apparatus comprises the actuator
ing the secondary differential signal with a detection [which controls the focusing means depending upon the out-
level.]S20 put of the phase detector].
3. A noun in a dependent claim is modified by a has-part
5. DEEP PARAPHRASING AND SUMMA- relation (in an independent claim, it can bear important in-
formation):
RIZATION (a) A unitary ridge is formed on the top face [having side
Deep paraphrasing and multilingual summarization in PA- surfaces constituting the first and second side chip deflector
TExpert are performed in the Meaning-Text Theory (MTT) surfaces].
framework [7]. MTT is a very popular linguistic theory, in 4. A noun in a dependent claim is modified by a PURPOSE
text generation, which is mainly due to two of its principal relation (for + Gerund):
features: (i) the use of dependency (instead of constituency) (a) The apparatus comprises a lens [for converting the light
structures and (ii) the use of a multistratal (a series of lev- from the signal plane].
els of representation) language model. These features are 5. A number appears in a sentence of a dependent claim:
equally valuable for the tasks of deep paraphrasing and sum- (a) The reflective component-containing layer has a film thick-
marization as we can abstract from the idiosyncracies of a ness of 0.01m to 0.5 m.
particular language by using a higher level structure (for (b) [The film thickness is 0.01m to 0.09 m].
more details see [16]), more specifically, by starting the re-
generation process of each “repaired” segment from the Deep 5.3 Generation proper
Syntactic Structure (DSyntS) of the MTT model. The (re)generation starts from the pruned DSyntSs. It
As illustrated in Figure 1, the stage of deep paraphrasing consists of the following four major substages:
and summarization consists of the following three substages:
(i) a preprocessing stage which consists of dependency pars-
(i) aggregation of repetitive and isolated fragments of the
ing and mapping to DSyntS, (ii) DSynt summarization (if a
DSyntS,
summary rather than a paraphrase is desired), (iii) genera-
tion proper (from a pruned DSyntS in case of summarization (ii) introduction of discourse markers to the SSyntS,
and from the original DSyntS in the case of paraphrasing).
In the case of the generation in another language, prior to (iii) generation of referring expressions (anaphoric refer-
(iii) structure transfer takes place. Let us focus in what ences) in the SSyntS,
follows on (i)–(iii).
(iv) syntactic generation.
5.1 Preprocessing
The preprocessing stage produces a DSyntS for each “re- The final result after these four substages conveys in the
paired” segment. First, the segment is parsed using the off- case of paraphrasing either absolutely the same information
the-shelf dependency parser Minipar [4]. Minipar has been as the original (but which is much easier to comprehend as
chosen because it produces syntactic structures that can be the original) and in the case of summarization the essen-
mapped onto Surface Syntactic Structures (or SSyntSs) of tials of the information of the original. In what follows we
the MTT model. Due to the theoretical divergences be- describe each of these stages.
tween Minipar and the MTT framework, the Minipar-to-
SSyntS conversion is not straightforward. The current ver- Aggregation
sion of the Minipar-SSyntS conversion grammar contains 137
rules [16]. Its evaluation on 1324 sentences has shown that Aggregation is the fusion of several separate sentence or
99% of well-formed Minipar-structures are correctly mapped phrase structures that share common parts into one struc-
onto SSyntSs. The SSyntSs are, in their turn, mapped onto ture in which the previously common parts occur only once.
DSyntSs. The main criterion for allowing two sentences to be aggre-
gated is their coreference. For illustration, consider the fol-
5.2 Deep Summarization lowing aggregation rule and its application to a part of a
A small set of summarization criteria is based on specific simplified claim:
patterns recognized within the input DSyntSs, taking into
account discourse and dependency structure information. [X(id=n) Yverb Z1] + [X(id=n) Yverb Z2]
Consider some of these patterns and the effect of the ap- +...+ [X(id=n) Yverb Zn]
plication of the corresponding summarization rules, namely --> X Yverb [Z1 and Z2 and ... and Zn]
removing of the chunks (in reality, branches of the DSyntS) For example:
that appear in square brackets:4 1 [An optical disk drive]X [comprises]Y [a laser
1. A noun has a postponed attribute: light source]Z 1.
(a) The optical component is a shading member [arranged 2 [[An optical disk drive]X [comprises]Y [an opti-
near the optical axis around the aperture plane of the optical cal system]Z 2. Becomes:
4
For further details on deep-syntactic summarization, see (1+2) An optical disk drive comprises a laser light
[8]. source and an optical system.
Discourse marker insertion baseline regeneration
Intelligibility 49 58
In order to improve the readability of the aggregated text,
Simplicity 49 74
some rules add discourse markers to the top verb of the
Accuracy 47 51
SSyntS. Depending on the discourse relation which is in-
troduced during the simplification stage, the marker can be
Table 5: Accuracy of the PATExpert Multilingual
retrieved from a discourse-marker dictionary.
Summarizer (regeneration) against a Google Trans-
Referring expression generation lator (baseline)
This stage is essentially rule-based. For example we have
a rule that introduces a relative pronoun as subject of a
sentence if it is coreferring with the object of the previous this purpose, six native speakers were asked to rate twelve
sentence. Two further restrictions apply to this rule: 1) the different claim descriptions in their native tongue produced
phrase to become the relative clause must not be syntacti- by PATExpert with summarization switched off (such that
cally “too heavy” (i.e., contain “too many” nodes); if it is, the only simplification, transfer and regeneration were effective)
introduction of a deictic is preferred, and 2) the rule does not against the online-Google translation of the original claims
apply either if the object of the first sentence has undergone as baseline.5 Given that the recall of our multilingual gener-
aggregation so as not to rebuild sentences that would be ator is still very much hampered by the shortage of multilin-
very long and potentially ambiguous (e.g., the aggregation gual resources, we consider this evaluation a general indica-
might result in a coordination of noun phrases which would tion of the potential of “deep” translation techniques when
make the following relative clause ambiguous). “Too many” combined with the preprocessing of the claims. The evalu-
is an empirical figure. Currently, we experiment with 15 to ation was based on a questionnaire which has been largely
20 nodes, depdending on the complexity of the structure. inspired by [9]. It consists of three categories: “intelligibil-
ity”, “simplicity” and “accuracy”. The first two categories
Syntactic generation
deal with the quality of the transferred text; both have
This stage is responsible for completing the syntactic genera- a five value scale. The third category, which has a seven
tion, realizing agreement, word order, any necessary elisions value scale, captures how the content from the English in-
and other surface morphological phenomena, punctuation put is conveyed in the transferred text. Due to the lack of
and finally getting the word forms to include in the surface space, we do not cite here the questionnaire itself. Table 5
text. shows the accuracy regarding each of the three quality cat-
egories for regeneration and for the baseline. As expected,
5.4 Evaluation of Summarization and Gener- the complexity of our multilingual summarization module
ation is much lower, hence the intelligibility is about 9% higher.
Given that no reliable unique evaluation metrics exists But surprisingly, there is no significant difference regarding
as yet for multilingual summarization and paraphrasis, we the accuracy of the two translations, which might show that
performed a preliminary evaluation of our strategy of multi- no meaningful information is lost during the simplification
lingual summarization from the perspective of the quality of stage compared to a non-simplified output.
the summary and from the perspective of the quality of the
multilinguality. The evaluation of the quality of our sum-
mary has been performed using ROUGE [3]. As baseline,
6. RELATED WORK
we used the MS Word automatic summarizer (MSAS), with Shinmori et al. [13] address the simplification of Japanese
the summarization parameter set to 50%. Out of a list of claim sentences. Their rule-based approach first identifies
50 patents that underwent simplification, 30 were randomly the claim’s coarse-grained discourse structure with patent
selected and summarized with our regeneration module and specific relations such as procedure and precondition
MSAS. The summaries used as reference have been done by for process-style and jepson-style claims respectively using
a patent specialist. Our summarization obtained an overall cue phrases and grammatical patterns, and then paraphrases
f-score of 61% over quadrigrams and trigrams, while MSAS each discourse segment.
reached 43%. That we did not surpass 61% can be par- Sheremetyeva [12] proposes a parsing methodology that
tially explained by the object/method dichotomy in some is tuned to the analysis of original US patent claims. The
patent claims, which we cannot identify reliably in an auto- resulting analysis structure is a syntactic dependency tree.
matic way. If a patent claim section contains claims refer- The intermediate chunking structure of the parse is used for
ring to both the invented object and the method of apply- generation of simple sentences (with the goal to improve the
ing this object, both kinds of claims tend to contain largely readability of the claims).
the same information. Human created reference summaries Some work has also been done on the shallow extractive
avoid the repetition of this information, whilst our module summarization of legal texts [1, 2]. These approaches rely
is currently not able to differentiate an object-related claim on the text type specific discourse/thematic structure, and
from a method-related one. Furthermore, it is worth noting in the case of [2], on general and specialized shallow features
that the evaluation that has been carried out so far does such as tf*idf, location, sentence length, quotations and cue
not take into account the quality of the text, for which a phrases. Given the highly repetitive nature of patent claim
qualitative evaluation would be necessary. sentences and their overly long sentences as well as the lack
For the evaluation of the quality of the multilinguality, we of corpus of patent claims aligned with summaries, the use
chose human evaluation in order to balance the purely statis- 5
Since our goal was to evaluate the multilingual output of
tical metrics of the ROUGE evaluation and to obtain some our system with the original claims as input, we consider it
objective opinions from native speakers and experts. For correct to run the Google translator on the original claims.
of such shallow features is not possible. For this reason as [11] H. Schmid. Probabilistic part–of–speech tagging using
well as for the generation of multilingual summaries, we have decision trees. In Proceedings of the International
chosen a deep summarization approach. Conference on New Methods in Language Processing,
pages 44–49, Manchester, UK, September 1994.
7. CONCLUSION [12] S. Sheremetyeva. Natural language analysis of patent
claims. In Proceedings of the ACL Workshop On
We have presented a system that uses shallow and deep Patent Corpus Processing, 2003.
summarization and paraphrasing techniques to produce para-
[13] A. Shinmori, M. Okumura, Y. Marukawa, and
phrases and multilingual summaries of patent claims. The
M. Iwayama. Patent processing for readability.
efectiveness of the different summarization and paraphras-
structure analysis and term explanation. In
ing strategies needs still to be formally evaluated with patent
Proceedings of the Workshop on Patent Corpus
experts and users who are not trained in patent information
Processing held at the ACL Meeting, pages 56–65,
process. Informal talks with patent experts confirmed the
2003.
benefits of claim dependency structure based summariza-
tion. On the other hand, some specific discourse summa- [14] S. Teufel and M. Moens. Summarizing scientific
rization strategies might be needed: for example, a pur- articles: Experiments with relevance and rhetorical
pose followed by a means relation expresses important in- status. Computational Linguistics, 28(4):409–445,
formation; likewise, an elaboration relation starting with 2002.
the cue phrase in which is also important and shall not be [15] L. Wanner, R. Baeza-Yates, S. Brügmann, J. Codina,
pruned. As far as paraphrasing and deep summarization are B. Diallo, E. Escorsa, M. Giereth, Y. Kompatsiaris,
concerned, our prototypical technique proved useful. How- S. Papadopoulos, E. Pianta, G. Piella, I. Puhlmann,
ever, further research is needed, for instance, to establish G. Rao, M. Rotard, P. Schoester, L. Serafini, and
more profound summarization criteria. Another topic is to V. Zervaki. Towards content-oriented patent document
extend the summarization to other sections of patents. processing. World Patent Information Journal,
Given that other genres of legal discourse (such as, e.g., 28(4):409–445, 2007.
laws) tend to be equally difficult to read and comprehend by [16] L. Wanner, S. Bott, N. Bouayad-Agha, G. Casamayor,
non-experts in the field, paraphrasing and summarization G. Ferraro, J. Graën, A. Joan, F. Lareau, S. Mille,
(gist generation) techniques such as the ones presented in V. Rodrı́guez, and V. Vidal. Paraphrasing and
this paper could serve as a blueprint for further analogous multilingual summarization of patent claims. In
applications. L. Wanner, S. Brügmann, and B. Diallo, editors,
PATExpert: A Next Generation Patent Processing
Service. IOS Press, Amsterdam, in press.
8. REFERENCES [17] I. H. Witten and E. Frank. Data Mining: Practical
[1] A. Farzinfar, G. Lapalme, and J.-P. Declés. Résumé de machine learning tools and techniques. Morgan
textes juridiques par identification de leur structure Kaufmann, San Francisco, 2005.
thématique. Traitement automatique des langues,
45(1):39–64, 2004. Annex: Example
[2] B. Hachey and C. Grover. Extractive summarisation
In what follows, we present first an original claim and then
of legal texts. Artificial Intelligence and Law,
two PATExpert summaries (in English and French) thereof.
14(4):305–345, 2006.
[3] C.-Y. Lin. Rouge: A package for automatic evaluation Original claim
of summaries. In Proceedings of the ACL Workshop on
An optical disk drive comprising: a laser light source
“Text Summarization Branches Out”, Barcelona, 2004.
for emitting a laser beam; an optical system for con-
[4] D. Lin. Dependency-based evaluation of minipar. In
versing the laser beam from the laser light source on a
Proc. Workshop on the Evaluation of Parsing
signal plane of optical disk on which signal marks are
Systems, LREC’98, Granada, 1998.
formed and for transmitting the light reflected from the
[5] W. C. Mann and S. Thompson. Rhetorical structure signal plane; one or more optical components arranged
theory: Toward a functional theory of text in the optical path between the laser light source and
organization. Text, 8(3):243–281, 1988. the optical disk for making the distribution of the laser
[6] D. Marcu. The Theory and Practice of Discourse beam converged by the conversing means located on a
Parsing and Summarization. The MIT Press, ring belt just after the passage of an aperture plane
November 2000. of the optical system; a detection means for detecting
[7] I. Mel’cuk. Dependency Syntax: Theory and Practice. the light reflected from the optical disk; and a signal
SUNY, 1988. processing circuit for generating a secondary differen-
[8] S. Mille and L. Wanner. Making text resources tial signal by differentiating the signals detected by the
accessible to the reader: The case of patent claims. In detection means and for detecting the edge positions
Proceedings of the International Language Resources of the signal marks by comparing the secondary differ-
and Evaluation Conference (LREC), Marrakech, 2008. ential signal with a detection level.
[9] M. Nagao, J. ichi Tsujii, and J. ichi Nakamura. The
japanese government project for machine translation. PATExpert generated summaries
Computational Linguistics, 11:2–3, 1985. An optical disk drive comprises a laser light source, an
[10] D. Pressman. Patent It Yourself. Nolo, Berkeley, CA, optical system, a detection means, and a signal pro-
2006. cessing circuit. The laser light source emits a laser
beam. The optical system converses the laser beam
from the laser light source on a signal plane of optical
disk and transmits the light. The optical disk drive
also comprises one or more optical components. It is
arranged in the optical path between the laser light
source and the optical disk. The detection means de-
tects the light. The signal processing circuit generates
a secondary differential signal and detects the edge po-
sitions of the signal mark.

Une unité de disque optique comprend une source de


lumière laser, un systme optique, un système de détec-
tion, et un circuit de traitement de signal. La source
de lumière laser émet un faisceau laser. Le système
optique inverse le faisceau laser de la source de lu-
mière laser sur un plan de signal de disque optique
et transmet la lumière. Lunité de disque optique com-
prend également un ou plusieurs composants optiques.
Il est arrangé dans le chemin optique entre la source
de lumière laser et le disque optique. Le système de
détection détecte la lumière. Le circuit de traitement
de signal génère un signal de différentiel secondaire et
détecte les positions de bord de la marque de signal.

You might also like