You are on page 1of 22

MELJUN P.

CORTES, MBA,MPA,BSCS,ACS
CSC 3130: Automata theory and formal languages

Parsers for programming languages

Andrej Bogdanov
http://www.cse.cuhk.edu.hk/~andrejb/csc3130

CFG of the java programming language
Identifier: IDENTIFIER QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |=

from http://java.sun.com/docs/books/jls /second_edition/html/syntax.doc.html#52996

Parsing java programs
class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { x = px; y = py; debug = false; } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor } // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { x = pt.getX(); y = pt.getY(); } // Another consructor // Constructor

// turn off debugging

}

Simple java program: about 1000 symbols

Parsing algorithms
• How long would it take to parse this?
exhaustive algorithm CYK algorithm about 1080 years (longer than life of universe) about 1 week!

• Can we parse faster?
• No! CYK is the fastest known general-purpose parsing algorithm

Another way of thinking

Scientist:
Find an algorithm that can parse strings in any grammar

Engineer:
Design your grammar so it has a very fast parsing algorithm

An example
Stack  a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S Input abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc c c c  

Action
shift shift reduce (5) reduce (3) shift shift shift reduce (5) reduce (3) shift reduce (4) reduce (2) shift reduce (1)

S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5)

input: abaabbc
S T T A a b A T A

a

a

b

b

c

Items
S  Tc(1) S  •Tc S  T•c S  Tc•
Stack  • a • ab • A • T • Ta •

T  TA(2) T  A(3) A  aTb(4) A  ab(5) T  •TA T  T•A T  TA•
Input abaabbc baabbc aabbc aabbc aabbc abbc

T  •A T  A•

A  •aTb A  •ab A  a•Tb A  a•b A  aT•b A  ab• A  aTb•

Action
shift shift reduce (5) reduce (3) shift shift

Idea of parsing algorithm: Try to match complete items to top of stack

Some terminology
Stack  a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S Input abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc c c c  

Action
shift shift reduce (5) reduce (3) shift shift shift reduce (5) reduce (3) shift reduce (4) reduce (2) shift reduce (1)

S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5)

input: abaabbc
handle
valid items: a•Tb, a•b valid items: T•a, T•c, aT•b

Outline of LR(0) parsing algorithm
• As the string is being read, it is pushed on a stack
• Algorithm keeps track of all valid items • Algorithm can perform two actions: no complete there is one valid item,
item is viable and it is complete

shift

reduce

Running the algorithm
A Stack Input aabb  abb S a S aa S R S R aab aA aAb A bb b b   Valid Items A  •aAb A  a•Ab A  •aAb A  a•Ab A  •aAb A  ab• A  aA•b A  aAb• A  •ab A  a•b A  •ab A  a•b A  •ab

A  aAb | ab

A  aAb  aabb

Running the algorithm
A Stack Input aabb  abb S a S aa S R S R aab aA aAb A bb b b   Valid Items A  •aAb A  a•Ab A  •aAb A  a•Ab A  •aAb A  ab• A  aA•b A  aAb• A  •ab A  a•b A  •ab A  a•b A  •ab

A  aAb | ab

A  aAb  aabb

How to update viable items
• Initial set of valid items
S  •a for every production S  a

• Updating valid items on “shift b”
A  a•bb A  a•Xb

is updated to

A  ab•b

disappears if X ≠ b

– After these updates, for every valid item A  a•Cb and production C  •d, we also add C  •d as a valid item

a, b: terminals notatio A, B: variables n X, Y: mixed symbols a, b: mixed strings

How to update viable items
• Updating valid items on “reduce b to B”
– First, we backtrack to viable items before reduce – Then, we apply same rules as for “shift B” (as if B were a terminal) A  aB•b A  a•Bb is updated to A  a•Xb disappears if X ≠ B

C  •d

is added for every valid item A  a•Cb and production C  •d

Viable item updates by NFA
• States of NFA will be items (plus a start state q0)
• For every item S  •a we have a transition
q0  S  •a

• For every item A  a•Xb we have a transition
A  a•Xb X A  aX•b

• For every item A  a•Cb and production C  •d
A  a•Cb  C  •d

Example
A  aAb | ab
 a  A A  aA•b b A  aAb• A  a•b b A  ab•

A  •aAb  q0

A  a•Ab


A  •ab a

Convert NFA to DFA
a 1 A  •aAb A •ab A 2 A  a•Ab A  a•b A  •aAb A  •ab 4 A  aA•b b

a

5 A  aAb•

b
3 A  ab•

die

states correspond to sets of valid items transitions are labeled by variables / terminals

Attempt at parsing with DFA
A Stack  S a S aa S aab R aA Input aabb abb bb b b DFA state 1 A  •aAb 2 A  a•Ab A  •aAb 2 A  a•Ab A  •aAb 3 A  ab• ? A  aA•b A  •ab A  a•b A  •ab A  a•b A  •ab

A  aAb | ab

A  aAb  aabb

Remember the state in stack!
A Stack 1 S 1a2 S 1a2a2 S R S R 1a2a2b3 1a2A4 1a2A4b5 1A Input aabb abb bb b b   DFA state 1 A  •aAb 2 A  a•Ab A  •aAb 2 A  a•Ab A  •aAb 3 A  ab• 4 A  aA•b 5 A  aAb• A  •ab A  a•b A  •ab A  a•b A  •ab

A  aAb | ab

A  aAb  aabb

LR(0) grammars and deterministic PDAs
• The parsing procedure can be implemented by a deterministic pushdown automaton
• A PDA is deterministic if in every state there is at most one possible transition
– for every input symbol and pop symbol, including 

• Example: PDA for w#wR is deterministic, but PDA for wwR is not

LR(0) grammars and deterministic PDAs
• Not every PDA can be made deterministic
• Since PDAs are equivalent to CFLs, LR(0) parsing algorithm must fail for some CFLs! • When does LR(0) parsing algorithm fail?

Outline of LR(0) parsing algorithm
• Algorithm can perform two actions:
no complete item is valid there is one valid item, and it is complete

shift (S) • What if:
some valid items complete, some not

reduce (R)

more than one valid complete item

S / R conflict

R / R conflict

Hierarchy of context-free grammars
context-free grammars
parse using CYK algorithm (slow)

LR(∞) grammars

to be continued… java
LR(1) grammars LR(0) grammars
parse using LR(0) algorithm perl python …