Introduction to packrat parsing for PEGs (Parsing Expression Grammars

)

pycon APAC 2011, Singapore gavin bong
Le 8 juin 2011 2129 mercredi

roadmap
Motivation PEG theory pyparsing PyMeta PyPy rlib/parsing Closing 04 05 16 07 01 01 mins mins mins mins min min

34 mins
2

motivation
How to parse texts with PEGs
Natural languages NLTK Mini languages (DSLs) Structured / unstructured file formats

4 thoughts :
i. Aren't structured formats like JSON, XML, HTML well-served by existing parsers ? ii. Parsing log files & configuration files are easy with python. iii. Regular expression is good enough. iv. What is wrong with the classical way of writing parsers ?
3

CFG (Context Free Grammars)
In formal language theory, CFG is suitable for modeling both natural & computer languages. BNF is the defacto notation for describing syntax of CFGs.
if_stmt ::= "if" expression ":" suite ( "elif" expression ":" suite )* [ "else" ":" suite ]
EBNF

Original BNF only supported recursion.
sequence, decision(choice) repetition, recursion

S → Sa S→Ɛ
4

CFG & Ambiguity
CFG grammars are potentially ambiguous.
1 if( x > 5 ) 2 if( y > 5 ) 3 console.log("heaven"); 4 else console.log("limbo"); Dangling else problem

Name AST 'x' Ops Comp test > Num test 5 body IfExp IfExp Comp Name body orelse Log values Str 'limbo' 'y'

#1

values Str Log 'heaven' 5

CFG & Ambiguity (2)
Name 'x' Ops Comp test > Num test 5 body IfExp IfExp Comp Name body orelse 'y' values Str Log 'heaven' Log values Str 'limbo'
6

AST #2

Definitions
Parse trees vs AST
= concrete = abstract whitespace, braces, = uses tree nodes semicolons specific to language = nodes are nonterminals constructs from grammar

Top-Down
= begin with start nonterminal. = work down the parse tree.

vs

Bottom-up
= identify terminals = infer nonterminals = climb the parse tree.

Definitions (2)
Recursive descent parsing
* A top-down parser constructed from recursive functions. * Each function represents a rule in the grammar.
version ::= <digit> '.' <digit> digit ::= '0' | '1' ... | '9' def version( source, position=0 ): digit( source, position ) period( source, position + 1 ) digit( source, position + 2 )
Run (pymeta) nose --nocapture -v test_rdp_list.py 8

Recursive Descent Parsing
def expect(source, position, comparator): try: expecting, msg = comparator if not expecting(source[0]): raise ParseError(position, msg) source.popleft() #consume ! except IndexError: raise EOFError(position)

def digit(source, position): fn = (lambda t: t in string.digits,this_rule()) expect(source, position, fn) def period(source, position): fn = (lambda t: t == '.',this_rule()) expect(source, position, fn)

9

Recursive Descent Parsing (2)
>>> import collections >>> version(collections.deque('1.6')) >>> version(collections.deque('A.6')) ParseError: (0, 'expected <digit>') >>> version(collections.deque('1,6')) ParseError: (1, 'expected <period>') >>> version(collections.deque('1.')) EOFError: (2, [('message', 'end of input')])
10

Classical method of parsing
1. Flesh out a grammar in BNF 2. Lexical analysis phase lexer ( patterns, stream-of-characters)
=> stream of tokens

3. Parsing phase
parser ( grammar, stream-of-tokens) => parse tree / AST

4. Use your parser

Specific to LALR(1) bottom-up parsers
Photo attribution: http://www.flickr.com/photos/j_aroche/2160902499/ 11

Spectrum of parsing solutions
Regex Handwritten Recursive Descent Parsers PEG parsers ANTLR Lex / Yacc parser generators (GNU flex/bison)
12

Other python parsing toolkits
PLY Yapps funcparserlib
http:// wiki.python.org / moin / LanguageParsing

13

PEG
Formalized by Bryan Ford in 2002-2004 Grammar mimics a recursive descent parser (+ backtracking). Scanner-less A PEG grammar consists of a set of parsing expressions of the form: A → e One expression is denoted the starting expression
e1 / e2 e1 e2 e+ e? &e !e e* Ordered Choice Sequence PEG Repetition Predicates

!= EBNF
14

PEG's ordered choice
S → “Hitch” / “Hitchens” Q. Given an input string of “Hitchens”, what is the result of the parse ? Law #1: Given an input of A, the parsing expression matches a prefix A' of A or fails. Law #2: A rule S -> M / N will try to parse for a M. If that fails, backtrack & look for N. Answer: Hitch
15

PEG vs CFG
PEG Syntax definition philosophy Choice alternatione1/e2 Handles ambiguous grammars Requires a lexical analysis phase ? Left recursion * Analytical Ordered No No No CFG
Generative

Commutative Yes Yes (lex/yacc) Yes

*

Warth et al. Packrat parsers can support left recursion (2008)

16

PEG & Packrat parsing
Context: recursive descent parsing with backtracking Problem: an input substring might be re-parsed during backtracking.
grammar ::= AB | AC

Solution: memoization guarantees linear time performance. Neotoma Cinerea
Photo attribution: http://en.wikipedia.org/wiki/File:Neotoma_cinerea.jpg 17

case study #1 problem statement
Parse modern Japanese dates in various formats.

If the date parses successfully, convert it to its equivalent datetime.date instance.
18

case study #1 : The four ERAs
HEISEI
Akihito

( ( ( (

) 1989 Jan 8 - present ) 1926 Dec 25 – 1989 Jan 7 ) 1912 Jul 30 – 1926 Dec 24 ) 1868 Sep 8 – 1912 July 29
19

SHOWA
Hirohito

TAISHOU
Yoshihito

MEIJI
Mutsuhito

case study #1 : liberties taken
1. No support for days-of-the-week tagged onto the end. 2. Numbers use western digits, not kanji. 3. Some eras have overlapping days. Ignore. 4. For 1st year of an era, no support for gannen.

20

case study #1 : initial attempt

from pyparsing import Literal, Word, nums year month day heisei_era integer = = = = = Literal( u'\u5e74' ) Literal( u'\u6708' ) Literal( u'\u65e5' ) Literal( u'\u5e73\u6210' ) Word(nums) Word(nums, exact=2)

21

case study #1 : initial attempt (2)

day_spec = integer.setResultsName('dd') + day month_spec = integer('mm') + month

western_year = integer('yyyy') + year imperial_year = heisei_era + western_year year_spec = (imperial_year('imperial') | western_year('western')) grammar = year_spec + month_spec + day_spec

case study #1 : initial attempt (3)
result = grammar.parseString(japanese_date) print result.dump()

23

pyparsing : introduction
Easy to use PEG-based text parser Grammar definitions in python Framework distributed as one file
pyparsing.py

Runs on both python 2.x & 3.x . Future releases after 1.5.x will be focused on python 3.x only Not classified as recursive descent !
24

pyparsing : framework overview

25

pyparsing & PEGs : correlation
PEG
e1 e2 e1 e2 ̷ e* e+ e? &e !e

pyparsing
e1 + e2 e1 | e2 == And( e1, e2 ) == MatchFirst( [e1,e2] )

ZeroOrMore( e ) OneOrMore( e ) Optional( e ) Followed( e ) ~e == NotAny( e )

pyparsing : framework overview

27

pyparsing : ordered choice
will short circuit as soon as a match is found. Not commutative.
MatchFirst

Shadowing literals in which one is a substring of the other should be avoided.

Keywords are different

28

pyparsing : backtracking
Or forces the parser to make an exhaustive

search of the alternatives. (match longest)
Or might introduce ambiguities.

No better than non-PEG parsers. Tweak the order of alternatives & put most probable (e.g. frequency of occurrence) first. Avoids wasteful backtracking.

29

pyparsing : backtracking
Ballon d'Or 2011 example
p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta']) first = p2 + p1 + p4 second = p2 + p1 + p5 third = p2 + p1 + p3 grammar = first | second | third print grammar.parseString( "messi ronaldo park-ji-sung" )

pyparsing : backtracking

31

pyparsing : left factored
p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta']) absolute_certainty = p2 + p1 too_close_to_call = p4 | p5 | p3 grammar = absolute_certainty + too_close_to_call print grammar.parseString( "messi ronaldo \ park-ji-sung" )

32

pyparsing : packrat
Memoization must be manually turned on.
ParserElement.enablePackrat()

Caches: a. ParseResults b. Exceptions thrown Caveat emptor: A grammar with parse actions that has side effects do not always play well with memoization turned on.

run python

select_parser.py

33

pyparsing : semantic actions
In pyparsing parlance, a ParserElement can have zero or more parsing actions. 4 forms of parse actions:
fn(s,loc,toks) fn(loc,toks) fn(toks) fn()

Usage:
ParserElement.setParseAction( *fn ) ParserElement.addParseAction( *fn )

Uses: 1. Perform validation (see ParseException) 2. Process the matched token(s) & modify it Returning a value overwrites the matched token(s). 3. Annotate with custom types (collary of #2)
34

case study #1 : Semantic action
All users of the integer expression will inherit the parse action.
integer = Word(nums).setParseAction( lambda t: int(t[0]))

Selective assignments of parse action to copies.
def range_check(toks): month = int(toks[0]) if month <=0 or month >= 13: raise ParseException('month must be in range 1..12') month_spec = integer('m').addParseAction(range_check) + month

Show: japan_simple.py

!

integer.copy().addParseAction( .. ) integer( 'result_name' ).addParseAction( .. )

35

case study #1 : test files
imperial . utf8 western . utf8

36

case study #1 : complete solution
Demo:

@traceParseAction def convert_kanji_year(toks): if 'imperial' in toks.keys(): year = toks.imperial.yearZero + toks.imperial.yy toks['era'] = toks.imperial.type_ toks['yyyy'] = year elif 'western' in toks.keys(): year = toks.yyyy try: toks['modernDate'] = date(year, toks.mm, toks.dd) except ValueError, error: raise ParseException(error.args[0])

Show: japan_dates.py
37

case study #2 problem statement
Parse Gmail search criterias.

Supports a tiny subset of the full grammar : from : ( <sender> ) label : inbox -label : sent yyyyy -yyyyy “zzzzz” -”zzzzz”
38

case study #2: example strings
from : ( bruno manser ) from : ( bruno.manser@swiss.org ) from : ( @swiss.org ) label : sarawak -label : not-urgent “penan injustice” -logging
39

case study #2: email addresses

emailfull = Regex(r"(?P<user>[A-Za-z0-9._%+-]+)@ (?P<hostname>[A-Za-z0-9.-]+)\.(?P<tld>[A-Za-z]{2,4})")

emailpartial = Regex(r"@(?P<hostname>[A-Za-z0-9.-]+)\. (?P<tld>[A-Za-z]{2,4})")

email = (emailpartial | emailfull) squeeze = lambda t: ' '.join( t[0].split() ) name = ZeroOrMore(Word(alphanums + ' ')) .setParseAction( squeeze )

40

case study #2: email addresses
opener,closer,colon = map(Suppress,'():') enclosed = email | name nested = opener + enclosed + closer grammar_email = Combine(Suppress('from') + colon + nested)

41

case study #2: email addresses
result = grammar_email.parseString( 'from:(bruno.manser.25@borneo.org)' ) print result.dump()

result = grammar.parseString( 'from:( Marco de Gasperi print result.dump()

)')

Run: nosetests -v testFromTo.py

42

case study #2: labels
GOAL: group the excluded and included labels into their own sub-lists. E.g. label : fukushima1 -label : aloo-gobi
hyphen = Suppress('-')
Combine( expr + ZeroOrMore( delim + expr ) )

label_rhs = delimitedList(Word(alphanums), delim='-', combine=True ) label_include = Combine( Suppress('label') + colon + label_rhs ) label_exclude = Combine( hyphen + label_include ) label_all = MatchFirst([ label_exclude.setResultsName('labels.exclude', listAllMatches=True), pyparsing 1.5.6 label_include('labels.include*')]) grammar_label = ZeroOrMore( label_all )
43

case study #2: labels

result = grammar_label.parseString('-label:fukushima1 label:onagawa -label:aloo-gobi label:cheese-naan' ) print result.dump()

Question. Will this grammar work if the user entered LABEL instead of label ? Answer.
CaselessLiteral('label')
44

case study #2: search strings
GOAL: group the excluded and included search strings into their own sub-lists. rumi - “ jack kerouac ”
key_single = Word(alphanums) key_quoted = quotedString.setParseAction(removeQuotes) key_included = key_quoted | key_single key_excluded = Combine(hyphen + key_included) key_all = MatchFirst( [key_excluded("key.exclude*"), key_included("key.include*")] ) grammar_key = ZeroOrMore( key_all )
45

case study #2: search strings
result = grammar_key.parseString( ' -osama obama -"bin laden" "white house" ' ) print result.dump()

Question. If the user entered single instead of double quotes, will it conform to the grammar ?
Answer. Yes
46

case study #2: Final solution
Let's compose all the individual pieces together.
email_all = grammar_email('from*') gmail = (ZeroOrMore(email_all | label_all | key_all) + Suppress(restOfLine))

result = gmail.parseString('love label:writing-tips "bird by bird" from:(Anne Lamott) -"dalai lama" -label:macchu-pichu from:(agnes.obel@sparrow.net) -label:french-guiana -"epictetus" label:yoga "bugle podcast" label from:(@microsoft.com)') print result.dump() nested = opener + Group(enclosed) + closer
47

case study #2: Final solution
['love', 'writing-tips', 'bird-by-bird', 'Anne Lamott', 'dalai lama', 'macchu-pichu', '@microsoft.com', 'agnes.obel@sparrow.net', 'french-guiana', 'epictetus', 'yoga', 'bugle podcast', 'label'] -from: ['Anne Lamott','@microsoft.com', 'agnes.obel@sparrow.net'] -key.exclude: ['dalai lama','epictetus'] -key.include: ['love', 'bird by bird', 'bugle podcast', 'label'] -labels.exclude: ['macchu-pichu', 'french-guiana'] -labels.include: ['writing-tips','yoga']
48

pyparsing: Recursion
A grammar is recursive when there exists a nonterminal which has itself in the right-hand-side of the production rule. number ::= digit rest
rest ::= digit rest | empty digit = Word(nums,exact=1).setName('1-digit') rest = Forward() rest << Optional(digit + rest) number = Combine(digit + rest, adjacent=False) ('digit-list') grammar = number.setParseAction(lambda t:int(t[0])) + Suppress(restOfLine)
Run 49

case study #3: binary tree
Parse parentheses notation for binary trees.

(nil,4,nil) ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil))
4 2 3 5 6 7

Convert it to list notation in python
50

case study #3: recursive solution
BNF
node ::= '(' node ',' number ',' node ')' | empty

Code

left, right, comma = map(Suppress, '(),') empty = (CaselessLiteral('nil') .setParseAction(replaceWith(None))) tree = Forward() value = Word(nums).setParseAction(lambda t:int(t[0])) tree << ((left + Group(tree) + bookend(value) + Group(tree) + right)

Run

51

case study #3: recursive solution
Input :

“ ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil)) ”
Output :
[[[None],2,[[None],3,[None]]],4,[[[None],5, [[None],6,[None]]],7,[None]]]

How to fix it : Re-implement Group in

Group(tree)

class TreeGroup(TokenConverter): def postParse(self, instring, loc, tokenlist): if len(tokenlist) == 1 and tokenlist[0] is None: return tokenlist else: 52 return [tokenlist]

pyparsing : left recursion
pyparsing does not support left recursion.
term ::= \d+ expr ::= expr + term | term

pyparsing will raise a RuntimeError with message
'maximum recursion depth exceeded' '
@raises(RecursiveGrammarException) def test_left_recursion(self): expr.validate()

Eliminate left recursion if you want it to work in pyparsing
Run
53

PyMeta : introduction
OMeta is a language prototyping system (PEG). Implemented in several programming languages. * Packrat memoization * Grammar: BNF dialect (with host language snippets) * Object-Oriented: inheritance, overriding rules
lowercase ::= <char_range 'a' 'z'> def rule_lowercase(): // ..body..

* <anything> consumes one object from the input stream. (c.f. regex) * Built-in rules <letter> <digit>
<letterOrDigit> <token '?'>
55

PEGs & PyMeta
PEG
e1 e2 e1 / e2 e* e+ e?
Syntactic Predicates (unlimited lookahead)

PyMeta
e1 e2 e1 | e2 e* e+ e? ~~e !e == ~e

&e !e

case study #1 : in PyMeta
Modest goals: a) recognize western and Heisei imperial dates b) read & parse both imperial.utf8 & western.utf8 Separate files: common.py : Common rules & utilities
western_dates.py : Grammar to recognize

western dates
era_heisei.py : Grammar to recognize

heisei dates
japan_date_parser.py : Final grammar
57

case study #1 : in PyMeta pt A
from pymeta.grammar import OMeta baseGrammar = r""" # common literals for all ERAs year ::= <token u'\u5E74'> month ::= <token u'\u6708'> day ::= <token u'\u65E5'> common.py

range_num :min :max ::= <digit>+:m ?(int(join(m)) >= min & int(join(m)) <= max) => m rest_of_line ::= <anything>* <token '\n'>? => None empty_line ::= <spaces> <rest_of_line> => None python_comment ::= <token '#'> <rest_of_line> => None """ def join(x): return ''.join(x) JapanCommonParser = OMeta.makeGrammar(baseGrammar, globals(), "JapanCommonParser")
58

case study #1 : in PyMeta pt B
western_dates.py westernGrammar = r""" western ::= <spaces> <digit>+:y <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => westernized( int(join(y)),int(join(m)), int(join(d))) grammar ::= <python_comment> | <western>""" def westernized(yyyy, mm, dd): retval = JapanDate() retval['western'] = date(yyyy,mm,dd) return retval WesternParser = JapanCommonParser.makeGrammar( westernGrammar, globals(), 59 'WesternParser')

case study #1 : in PyMeta pt C
era_heisei.py era_heisei = Era('Heisei','Akihito', (u'\u5E73\u6210',u'\u337B'), startDate=date(1989,1,8)) def heisei_year_ok(yy): return (yy >= 1 and yy <= era_heisei.maxYearUnit) def collect( yy, mm, dd ): retval = JapanDate() retval['imperial'] = date( era_heisei.yearZero + yy, mm, dd ) retval['era'] = [ era_heisei.name, yy ] return retval

60

case study #1: in PyMeta pt C (2)
era_heisei.py (continued) heiseiGrammar = r""" hlong ::= <token u'\u5e73\u6210'> hshort ::= <token u'\u337b'> heisei ::= (<hlong> | <hshort>) <digit>+:y ?(heisei_year_ok(int(join(y)))) <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => collect(int(join(y)),int(join(m)),int(join(d))) """ HeiseiParser = JapanCommonParser.makeGrammar(heiseiGrammar, globals(), 'HeiseiParser')
61

case study #1 : in PyMeta pt D
japan_date_parser.py finalGrammar = r""" # override 'grammar' in WesternParser grammar ::= <super> | <heisei> | <empty_line>""" class BaseParser(HeiseiParser, WesternParser): pass BaseParser.globals.update(WesternParser.globals) BaseParser.globals.update(HeiseiParser.globals) JapanDateParser = BaseParser.makeGrammar( finalGrammar, globals(), "JapanDateParser")

62

case study #1 : in PyMeta pt D (2)
japan_date_parser.py (continued)

def parse_file(filename): “”” iterate through each line “”” .... snipped ... parser = JapanDateParser(line) result,error = parser.apply('grammar') .... snipped ... results = parse_file('imperial.utf8') results = parse_file('western.utf8')

Run

63

case study #1 : PyMeta output

64

PyMeta : Left Recursion
PyMeta can handle left recursion.
recursiveGrammar = r""" num ::= <num>:n <digit>:d => n * 10 + d | <digit> digit ::= :d ?((d>='0') & (d<='9')) => int(d)"""

Quiz. Is the following grammar equivalent ?
num ::= <digit> | <num>:n <digit>:d => n * 10 + d

Run

65

PyMeta : Matching objects
python list
listGrammar = “”” digit ::= :x ?(x.isdigit()) interp ::= [<digit>:x '+' <digit>:y] => int(x) => x + y”””

g = OMeta.makeGrammar(listGrammar, {}) parser = g( [['600','+','66']] ) iterable result,error = parser.apply('interp') >>> result 666 >>> error ParseError(2,[])
66

PyMeta : Matching objects (2)
Object graph (e.g. tree) python rewriter project visits the AST tree created by the compiler module (python 2.x) & regenerates the python statement.

>>> import compiler >>> print compiler.parse('import ctypes') >>> Module(None, Stmt([Import(['ctypes', None)])]))

import :i ::= <anything>:a ?(a.__class__ == Import) => 'import '+', '.join(import_match(a.names))

67

pyparsing vs PyMeta
pyparsing Whitespace sensitive? Left recursion Packrat memoization Operates on character streams Operates on object streams Syntactic predicates Semantic predicates Semantic actions Regex support No. But turned on via
leaveWhitespace()

PyMeta Yes. Use <spaces> rule to eat whitespaces Yes Yes. Only no-arg rules Yes Yes Yes

No Yes. Off by default. Yes No Yes

No (@see parse actions) Yes Yes Yes Yes No
68

PyPy rlib/parsing
Library for generating tokenizers & parsers in RPython. Consists of: regex / packrat parser tree structure / EBNF parser Sample JSON ebnf
NUMBER: "\-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][\ +\-]?[0-9]+)?"; value: <STRING> | <NUMBER> | <object> | <array> | <"null"> |<"true"> | <"false">; array: ["["] (value [","])* value ["]"]; entry: STRING [":"] value;

Resulting parse tree can be transformed or traversed with custom visitors. (dot)
69

Topics not covered

Usage of syntactic predicates Parsing grammars of mathematical expression in order to preserve operator precedence Handling indents/dedents in order to parse indentation-sensitive languages

e.g. coffeescript, python, haskell

Resources
pyparsing
http://pyparsing.wikispaces.com/ https://github.com/marcua/tweeql

PyMeta
http://www.tinlizzie.org/ometa/ http://gitorious.org/python-decompiler/python_rewriter

PyPy Rpython parsing library
http://doc.pypy.org/en/latest/rlib.html

rubycoder@gmail.com
71

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.