Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
15Activity
0 of .
Results for:
No results containing your search query
P. 1
Compiler Design -- Practical Earley Parsing

Compiler Design -- Practical Earley Parsing

Ratings: (0)|Views: 829|Likes:
Published by robertarg72

More info:

Published by: robertarg72 on Nov 11, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/18/2012

pdf

text

original

 
c
British Computer Society 2002
Practical Earley Parsing
J
OHN
A
YCOCK
1
AND
R. N
IGEL
H
ORSPOOL
2
1
 Department of Computer Science, University of Calgary, 2500 University Drive NW, Calgary, Alberta,Canada T2N 1N4
2
 Department of Computer Science, University of Victoria, Victoria, BC, Canada V8W 3P6  Email: aycock@cpsc.ucalgary.ca
Earley’s parsingalgorithmisa general algorithm, abletohandleanycontext-freegrammar. As withmost parsing algorithms, however, the presence of grammar rules having empty right-hand sidescomplicates matters. By analyzing why Earley’s algorithm struggles with these grammar rules,we have devised a simple solution to the problem. Our empty-rule solution leads to a new typeof finite automaton expressly suited for use in Earley parsers and to a new statement of Earley’salgorithm. We show that this new form of Earley parser is much more time efficient in practicethan the original.
 Received 5 November 2001; revised 17 April 2002
1. INTRODUCTION
Earleys parsing algorithm [1, 2] is a general algorithm,capable of parsing according to
any
context-free grammar.(In comparison, the LALR(1) parsing algorithm used inYacc [3] is limited to a subset of unambiguous context-free grammars.) General parsing algorithms like Earleyparsing allow unfettered expression of ambiguous grammarconstructs, such as those used in C, C++ [4], softwarereengineering[5], Graham/Glanvillecodegeneration[6]andnatural language processing.Like most parsing algorithms, Earley parsers sufferfrom additional complications when handling grammarscontaining
-rules, i.e. rules of the form
A
which cropup frequently in grammars of practical interest. There aretwo published solutions to this problem when it arises inan Earley parser, each with substantial drawbacks. As wediscuss later, one solution wastes effort by repeatedlyiterating over a queue of work items; the other involvesextra run-time overhead and couples logically distinct partsof Earley’s algorithm.This led us to study the nature of the interaction betweenEarley parsing and
-rules more closely, and arrive at astraightforward remedy to the problem. We use this resultto create a new type of automaton customized for use in anEarley parser, leading to a new, efficient form of Earley’salgorithm.We have organized the paper as follows. We introduceEarley parsing in Section 2 and describe the problem that
-rules cause in Section 3. Section 4 presents our solutionto the
-rule problem, followed by a proof of correctness inSection 5. In Sections 6 and7, we show howto constructournew automaton and how Earley’s algorithm can be restatedto takeadvantageof thisnew automaton. Section8 discusseshow derivations of the input may be reconstructed. Finally,some empirical results in Section 9 demonstrate the efficacyof our approach and why it makes Earley parsing practical.
2. EARLEY PARSING
We assume familiarity with standard grammar notation [7].Earley parsers operate by constructing a sequence of sets,sometimes called Earleysets. Givenan input
x
1
x
2
...x
n
, theparser builds
n
+
1 sets: an initial set
0
and one set
i
foreach input symbol
x
i
. Elements of these sets are referred toas (Earley) items, which consist of three parts: a grammarrule, a position in the right-hand side of the rule indicatinghow much of that rule has been seen and a pointer to anearlier Earley set. Typically Earley items are written as
[
A
α
β,j
]
where the position in the rule’s right-handside is denotedbya dot (
) and
j
is a pointer to set
j
.An Earley set
i
is computed from an initial set of Earleyitems in
i
, and
i
+
1
is initialized, by applyingthe followingthree steps to the items in
i
until no more can be added.S
CANNER
. If 
[
A
...
a...,j
]
is in
i
and
a
=
x
i
+
1
,add
[
A
...a
...,j
]
to
i
+
1
.P
REDICTOR
. If 
[
A
...
B ...,j
]
is in
i
, add
[
B
α,i
]
to
i
for all rules
B
α
.C
OMPLETER
. If 
[
A
...
,j
]
is in
i
, add
[
B
...A
...,k
]
to
i
for all items
[
B
...
A...,k
]
in
j
.An item is added to a set only if it is not in the setalready. The initial set
0
contains the item
[
S,
0
]
to begin with—we assume the grammar is augmented witha new start rule
—and the final set must contain
[
,
0
]
for the input to be accepted. Figure 1 showsan example of Earley parser operation.We have not used lookahead in this description of Earleyparsing, for two reasons. First, it is easier to reason aboutEarley’s algorithm without lookahead. Second, it is notclear what role lookahead should play in Earley’s algorithm.T
HE
C
OMPUTER
J
OURNAL
, Vol. 45, No. 6, 2002
 
P
RACTICAL
E
ARLEY
P
ARSING
621
0
E ,
0
E
E
+
E ,
0
E
n ,
0
n
1
E
n
,
0
E
,
0
E
E
+
E ,
0
+
2
E
E
+
E ,
0
E
E
+
E ,
2
E
n ,
2
n
3
E
n
,
2E
E
+
E
,
0
E
E
+
E ,
2
S
E
,
0
FIGURE 1.
Earley sets for the grammar
E
E
+
E
|
n
andthe input
n + n
. Items in bold are ones which correspond to theinput’s derivation.
Earley recommended using lookahead for the C
OMPLETER
step [2]; it was later shown that a better approach was to uselookahead for the P
REDICTOR
step [8]; later it was shownthat prediction lookahead was of questionable value in anEarley parser which uses finite automata [9] as ours does.In terms of implementation, the Earley sets are built inincreasing order as the input is read. Also, each set istypically represented as a list of items, as suggested byEarley [1, 2]. This list representation of a set is particularlyconvenient, because the list of items acts as a ‘work queue’when buildingthe set: items are examinedin order,applyingS
CANNER
, P
REDICTOR
and C
OMPLETER
as necessary;items added to the set are appended onto the end of the list.
3. THE PROBLEM OF
At any given point
i
in the parse, we have two partially-constructed sets. S
CANNER
may add items to
i
+
1
and
i
may have items added to it by P
REDICTOR
andC
OMPLETER
. It is this latter possibility, adding items to
i
while representing sets as lists, which causes grief with
-rules.When C
OMPLETER
processes an item
[
A
,j
]
whichcorresponds to the
-rule
A
, it must look through
j
for items with the dot before an
A
. Unfortunately,for
-rule items,
j
is always equal to
i
—C
OMPLETER
is thus looking through the partially-constructed set
i
.
3
Since implementations process items in
i
in order, if anitem
[
B
...
A...,k
]
is added to
i
after C
OMPLETER
has processed
[
A
,j
]
, C
OMPLETER
will never add
[
B
...A
...,k
]
to
i
. In turn, items resulting directlyand indirectly from
[
B
...A
...,k
]
will be omitted too.This effectively prunes potential derivation paths, which cancause correct input to be rejected. Figure 2 gives an exampleof this happening.
3
j
=
i
for
-rule items because they can only be added to an Earleyset by P
REDICTOR
, which always bestows added items with the parentpointer
i
.
AAAAA
aA
EE
0
S ,
0
AAAA ,
0
A
a ,
0
A
E ,
0
E
,
0
A
E
,
0
A
AAA ,
0
a
1
A
a
,
0
A
AAA ,
0
AA
AA ,
0
A
a ,
1
A
E ,
1
E
,
1
A
E
,
1
AAA
A ,
0
FIGURE 2.
An unadulterated Earley parser, representing setsusing lists, rejects the valid input
a
. Missing items in
0
soundthe death knell for this parse.
Two methods of handling this problem have beenproposed. Grune and Jacobs aptly summarize one approach:‘The easiest way to handle this mare’s nest isto stay calm and keep running the Predictor andCompleter in turn until neither has anything moreto add.’ [10, p. 159]Aho and Ullman [11] specify this method in their presen-tation of Earley parsing and it is used by ACCENT [12], acompiler–compiler which generates Earley parsers.The other approach was suggested by Earley [1, 2].He proposed having C
OMPLETER
note that the dot neededto be moved over
A
, then looking for this whenever futureitems were added to
i
. For efficiency’s sake, the collectionof non-terminals to watch for should be stored in a datastructure which allows fast access. We used this methodinitially for the Earley parser in the SPARK toolkit [13].In our opinion, neither approach is very satisfactory.Repeatedly processing
i
, or parts thereof, involves a lotof activity for little gain; Earley’s solution requires anextra, dynamically-updated data structure and the unnaturalmating of C
OMPLETER
with the addition of items. Ideally,we want a solution which retains the elegance of Earley’salgorithm, only processes items in
i
once and has no run-time overhead from updating a data structure.
4. AN ‘IDEAL’ SOLUTION
Our solution involves a simple modification to P
REDICTOR
,based on the idea of nullability. A non-terminal
A
is saidto be nullable if 
A
; terminal symbols, of course,can never be nullable. The nullability of non-terminals ina grammar may be easily precomputed using well-knowntechniques [14, 15]. Using this notion, our P
REDICTOR
canbe stated as follows (our modification is in bold):If 
[
A
...
B ...,j
]
is in
i
, add
[
B
α,i
]
to
i
for all rules
B
α
.
If 
B
is nullable,also add
[
A
...
B
...,
 j
]
to
i
.T
HE
C
OMPUTER
J
OURNAL
, Vol. 45, No. 6, 2002
 
622 J. A
YCOCK AND
R. N. H
ORSPOOL
0
S ,
0
AAAA ,
0
,
0
A
a ,
0
A
E ,
0
A
AAA ,
0
E
,
0
A
E
,
0
AA
AA ,
0
AAA
A ,
0
AAAA
,
0
a
1
A
a
,
0
A
AAA ,
0
AA
AA ,
0
AAA
A ,
0
AAAA
,
0
A
→ •
a ,
1
A
→ •
E ,
1
,
0
E
,
1
A
E
,
1
FIGURE 3.
An Earley parser accepts the input
a
, using ourmodification to P
REDICTOR
.
In other words, we eagerly move the dot over a non-terminal if that non-terminal can derive
and effectively‘disappear’. Using the grammar from the ill-fated Figure 2,Figure 3 demonstrates how our modified P
REDICTOR
fixesthe problem.
5. PROOF OF CORRECTNESS
Our solution is correct in the sense that it produces exactlythe same items in an Earley set
i
as would Earley’s originalalgorithm as described in Section 2 (where an Earley setmay be iterated over until no new items are added to
i
or
i
+
1
). In the following proof, we write
i
and
i
todenote the contents of Earley set
i
as computedby Earley’smethod and our method, respectively. Earley’s S
CANNER
,P
REDICTOR
and C
OMPLETER
steps are denoted by
E
S
,
E
P
,and
E
C
; ours are denoted by
E
S
,
E
P
, and
E
C
.We begin with some results which will be used later.L
EMMA
5.1.
Let 
be the Earley item
[
A
α
,i
]
. If 
i
 , then
α
.Proof.
There are two cases, depending on the length of 
α
.
Case 1
.
|
α
| =
0. The lemma is trivially true, because
α
must be
.
Case 2
.
|
α
|
>
0. The key observation here is that theparent pointer of an Earley item indicates the Earley setwhere the item first appeared. In other words, the item
[
A
α,i
]
must also be present in
i
, and because both itand
are in
i
, it means that
α
must have made its debut
and 
been recognized without consuming any terminal symbols.Therefore
α
.L
EMMA
5.2.
If 
0
=
0
 ,
1
=
1
,...,S 
i
1
=
i
1
 , then
i
i
.Proof.
Assume that there is some Earley item
i
suchthat
i
. How could
have been added to
i
?
Case 1
.
was added initially. In this case,
i
=
0,
= [
S,
0
]
. Obviously,
i
as well.
Case 2
.
was added by
E
S
being applied to some item
i
1
.
E
S
is the same as
E
S
and
i
1
=
i
1
, so
mustalso be in
i
.
Case 3
.
was added by
E
P
. Say
= [
A
α,i
]
.Then there must exist
= [
B
...
A...,j
]
i
. If 
is also in
i
, then
i
because
E
P
adds at least those itemsthat
E
P
adds.
Case 4
.
was added by
E
C
, operating on some
=[
A
α
,j
]
i
. Say
= [
B
...A
...,k
]
. First,assume
j < i
. If 
i
, then
i
because
E
C
and
E
C
are the same, and would be referring back to Earley set
j
where
j
=
j
. Now assume that
j
=
i
. By Lemma 5.1,
α
and thus
A
. There must therefore also bean item

= [
B
...
A...,k
]
i
. If 

i
, then
i
, added by
E
P
.Therefore
i
i
by contradiction, since no
can bechosen such that
i
and
i
.L
EMMA
5.3.
If 
0
=
0
 ,
1
=
1
,...,S 
i
1
=
i
1
 , then
i
i
.Proof.
We take the same tack as before: posit the existenceof an Earley item
i
but
i
. How could
have beenadded to
i
?
Case 1
.
was added initially. This is the same situationas Case 1 of Lemma 5.2.
Case 2
.
was added by
E
S
. For this case, the proof isdirectly analogous to that given in Lemma 5.2, Case 2, andwe therefore omit it.
Case 3
.
was added by
E
P
. If 
= [
A
α,i
]
, thenthis case is analogous to Case 3 of Lemma 5.2 and we omitit. Otherwise,
= [
A
...B
...,j
]
. By definition of 
E
P
,
B
and there is some
= [
A
...
B ...,j
]
in
i
. If 
i
, then
i
because, if it were not, Earley’salgorithm would have failed to recognize that
B
.
4
Case 4
.
was added by
E
C
. This case is analogous toLemma 5.2, Case 4, and the proof is omitted.As before,no
canbechosensuchthat
i
and
i
,proving that
i
i
by contradiction.We can now state the main theorem.T
HEOREM
5.1.
i
=
i
 ,
0
i
n
.Proof.
The proof follows from the previous two lemmas byinduction. The basis for the induction is
0
=
0
. Assuming
k
=
k
, 0
k
i
1, then
i
=
i
from Lemmas 5.2and 5.3.Our solution thus produces the same item sets as Earley’salgorithm.
6. THE BIRTH OF A NEW AUTOMATON
The Earley items added by our modified P
REDICTOR
canbe precomputed. Moreover, this precomputation can takeplace in such a way as to produce a new variant of LR(0)deterministic finite automata (DFA) which is very wellsuited to Earley parsing.In our description of Earley parsing up to now, wehave not mentioned the use of automata. However,the idea to use efficient deterministic automata as the
4
Earley’s algorithm recognizes that
B
derives the empty string with aseries of P
REDICTOR
and C
OMPLETER
steps [16].
T
HE
C
OMPUTER
J
OURNAL
, Vol. 45, No. 6, 2002

Activity (15)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Neha Thakur liked this
gcsece liked this
Moin Quadri liked this
Vipal Kharva liked this
ivanhv77 liked this
ivanhv77 liked this
Vikash Gupta liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->