Professional Documents
Culture Documents
2. SYSTEM ARCHITECTURE
The different paraphrasing and summarization strategies
realized in PATExpert are interconnected in that the result
of one strategy serves as starting point for another strategy.
We can thus speak of a single pipeline of sub-modules. Such
architecture has two advantages: (i) a single pipeline is eas-
ier to maintain than several isolated sub-modules that act
independently of each other, (ii) the user can be offered the
output of the intermediate modules as a valid paraphrase or
summary, respectively. Figure 1 illustrates in detail the ar-
chitecture of the paraphrasing and summarization module.
The architecture is divided into three main submodules:
4. SIMPLIFICATION
3. CLAIMS DEPENDENCY STRUCTURING As shown in Figure 1, the simplification proper of each
The text structure of a patent claim section is a tree that original claim-sentence is performed in five separate stages:
is predefined by the dependency between the claims, made
explicit by textual references such as according to Claim 1, (i) part-of-speech (POS) tagging and chunking using Tree-
as set forth in claims 1 to 3, as in claims 3-5, as defined by Tagger [11] with its off-the-shelf English parameter set;
claims 1 and 2, etc. Structures of different complexity can
be encountered. Errors are not seldom and must be dealt (ii) segmentation of a claim sentence into clausal discourse
with. Thus, one of the errors that is especially disturbing units;
for discourse processing is a claim referring back to itself
directly or indirectly, or referring back to a claim that does (iii) establishment of coreference links between NPs that
not exist. denote the same object;
Level 1 of the resulting text structure tree corresponds to
independent claims (claims that do not refer back to super- (iv) building a clause-discourse tree drawing upon corefer-
ordinated claims) whilst the other levels correspond to de- ence links and other information such as tree configu-
pendent claims. Consider, for illustration, a sample claim ration and discourse markers;
structure of a German patent in Figure 2. The numbers
stand for the individual claims; the arcs connect dependent (v) reconstruction of the individual segments in order to
claims with the claims on which they depend. Dependent obtain grammatically correct independent sentences
claims cannot stand on their own and must be read in the out of each of them.
context of the claims they depend on in order to be fully un-
derstood [10]. Furthermore, we can assume that the deeper a Let us discuss these five stages (and, in particular, the
claim is in the claim structure the more detailed is the infor- fourth of them since clause-discourse structuring is primary
mation it renders. This is exploited in the claim dependency for both the simplification and the subsequent discourse
structure based summarization approach which prunes the summarization).
Clause structuring searches for the best clause structure
in a space restricted by a set of weighted rules, each of which
is further restricted by a set of constraints. The rules en-
code the fundamental features for the identification of coor-
dination, subordination and juxtaposition relations between
spans, whilst the constraints ensure syntactic correctness
and global coherence of the tree under construction. This
rule-based approach proved reasonable given the regulations
of the linguistic style in the patent domain.
A rule R is a quadruple <F1 ,F2 ,W0 ,Wc >, where :
order to obtain the optimal tree, the algorithm is allowed to sults are shown in Table 2. The machine learning results are
backtrack once it has reached a goal and explore alternative close to the rule-based results and will probably improve the
syntactic trees, up to a fixed limit set to ensure computa- segmentation given richer information such as chunking and
tional efficiency. multi-word keywords.
Once the clause-discourse structuring is finished, the re-
construction of each segment takes place, which transforms Evaluation of the coreference resolution
each segment into a complete independent sentence using We evaluated our coreference resolution approach on a set of
a set of rules that add a missing subject or verb, conjugate 30 claims, 15 from the Optical Recording Devices domain,
verbs in non-finite clauses, remove initial markers, or change and 15 from the Machine Tools domain, evenly selected from
the order of constituents. the set of independent and dependent claims and manually
annotated for coreference. The results presented in Table 3
4.2 Evaluation experiments show a precision of 83% and a recall of 79%—which rein-
We present the evaluation of the most relevant stages of forces the observation of NP repetition in patent material.
our simplification module, i.e., segmentation (as it is the We performed an error analysis and found the following main
basis for clause structuring and sentence reconstruction), types of errors:
coreference (as it is one of the outputs for the simplifica-
Chunking errors. The chunker failed to mark a chunk as
tion module) and clause structuring (as it is the basis for
an NP, resulting in a false negative.
our discourse structuring approach).
NP modifier. Identical NPs used as modifiers in a more
Evaluation of the segmentation complex NP were marked as coreferring while they
We manually constructed a gold standard corpus for the were not—for example, inner surface of the substrate
segmentation task totalling 1011 claims and 6723 segments and inner surface of the boot.
in the Optical Recording Devices domain and 486 claims and Abstract NPs. Abstract nouns like information or direc-
3101 segments in the Machine Tools domain. Each claim tion are often repeated within a claim but do not core-
was segmented by two independent annotators, who then fer.
discussed their differences and reached a consensus on the
correct segmentation. Partial match. Some co-referring NPs such as rotary valve
We experimented with both rule-based and machine learn- assembly and valve assembly are not an exact match
ing segmentation. Thus, we applied Weka’s J48 decision and thus are not marked as co-referring,
tree learner [17] on a vector of basic features comprising
POS, single keyword and punctuation information for pre- Evaluation of clause structuring
vious, current and next token using cross-validation on 581 We performed the evaluation of clause structuring on our
claims (3942 segments) of our gold standard in the Optical development and test corpus using as input the raw claims
Recording Devices domain. For the rule-based approach, we and the claims manually segmented and coreferenced (i.e,
experimented with a variety of strategies on a test corpus perfect input). Our baseline performs clause structuring
from the gold standard. We found the strategy that yields based on right branching given the number of segments in
the best results used the following information: semi-colon, the gold standard. For the evaluation, we count the number
comma, and about twenty representative lexical markers and of identical spans between the automatic and the manual
expressions of the patent domain such as in which, charac- structuring, excluding the top span as this would always be
terized in that, so as to, and other more ambiguous ones counted as correct. In order to be able to compare spans, we
such as and, for, by when followed by a verb phrase. We automatically map each segment of the raw input to its cor-
used a strict evaluation that counts 1:1 alignments. The re- responding gold standard segment. The results of evaluation
# manual # automatic #1:1 Prec Rec F-score
Development corpus, optical recordings:
Rule-based 3942 3522 2509 71% 63% 66%
J48 3942 3411 2217 64% 56% 59%
Test corpus, machine tools:
Rule-based 3101 2807 1867 66% 60% 62%
J48 3101 2606 1639 62% 52% 56%
Test corpus, optical recordings:
Rule-based 2781 2535 1664 65% 59% 61%
J48 2781 2352 1521 64% 54% 58%
with the test corpus are shown in Table 4. In the develop- This means that ELABORATION-LOCATION spans
ment corpus (i.e., 8 claims, 104 segments), we achieved an can be eliminated before ELABORATION-OBJECT-
F-score of 62% with raw input in the Optical Recording De- ATTRIBUTE spans. In their own turn, ELABORATI-
vices domain, whilst in the test corpus, our F-score is 42% ON-OBJECT-ATTRIBUTE spans can eliminated be-
for corresponding input and domain as shown in the table. fore MEANS spans which can be eliminated before
Our best score is achieved in the Optical Recording Devices PURPOSE spans.
domain—the same domain as the development corpus—with
Discourse markers: a list of discourse markers at the be-
an F-score of 61% with perfect input whilst clause structur-
ginning of spans that can be considered for elimination
ing in the same domain achieves 45% from raw input and
is provided in descending order. A sample hierarchy
32% with the right branching strategy.
is: WHEN > BY > FOR.
In order to find out about the contribution of each con-
straint defined in Table 1 to the overall performance, we The system can be configured such that any combination
ran nine rounds of the evaluation with the perfect input in of the three criteria can be selected and the criteria can be
both domains, each time omitting one of the constraints. applied in any order desired. In other words, the discourse
The F-scores of these rounds are presented in Figure 3, with summarization algorithm will apply one criterion after an-
label none referring to the situation where no constraints other and stop once it has reached the desired summariza-
are omitted, and label all to the situation where all con- tion rate. As an example, we show the discourse summariza-
straints are omitted. Not surprisingly, the constraint that tion of claim example (1) whose abbreviated RST discourse
contributes most to the overall performance in both domains structure in Figure 4 shows a maximum depth of 4, and
is right branching. On the other hand, S1 coord and same where nucleus spans are indicated by solid lines and satel-
syntagm in the Optical Recording Devices domain and S2 lite spans by dashed lines. The required summarization rate
coord and same subordinator in the Machine Tools domain is 50%, and the criteria to apply are the depth of 50% (that
do not contribute to the overall performance. is of up to 2 levels in our example), followed by discourse
relation based pruning given the partial order above. The
4.3 Discourse structure based summarization depth criterion eliminates segments S5,S7,S11,S17,S18 and
Discourse structure based summarization has been tackled S20 whilst the discourse relation criterion eliminates S9,S10
originally by Marcu [6] by pruning the satellites of an RST and S14, thus satisfying the 50% summarization rate.
tree in order to obtain a summary of the desired length. In (1) [An optical disk drive comprising: a laser light
our approach, the discourse structure is pruned according to source]S1 [for emitting a laser beam;]S2 [an optical
the following three criteria: system]S3 [for conversing the laser beam from the
laser light source on a signal plane of optical disk]S4
Depth: all spans above a given depth are preserved (ex-
[on which signal marks are formed]S5 [and for trans-
pressed in percentage).
mitting the light]S6 [reflected from the signal plane;]S7
Discourse relations: a list of discourse relations can be [one or more optical componentsS8 [arranged in the
provided so that the discourse summarizer prunes the optical path between the laser light source and the
spans whose relations are in the list, starting with the optical disk]S9 [for making the distribution of the
first one on that list. A sample hierarchy is: laser beam converged by the conversing means]S10
[located on a ring belt just after the passage of an
PURPOSE>MEANS>ELABORATION-OBJECT- aperture plane of the optical system;]S11 [a detec-
ATTRIBUTE>ELABORATION-LOCATION tion means]S12 [for detecting the light]S13 [reflected
#automatic #manual #correct Prec Rec F-score
Gold standard MT 112 113 58 51% 51% 51%
OR 89 91 56 62% 61% 61%
Total 201 204 114 56% 55% 55%
Raw MT 82 113 40 48% 35% 40%
OR 61 91 35 57% 38% 45%
Total 143 204 75 52% 36% 42%
Baseline MT 123 113 47 38% 41% 39%
OR 104 91 32 30% 35% 32%
Total 227 204 79 34% 38% 35%
Figure 3: Performance of clause structuring with constraint omission from perfect input in both domains