You are on page 1of 5

Disclaimer: A report submitted to Dublin City University, School of Computing for module

CA448: Compiler Construction 1, 2009/2010. I hereby certify that the work presented and
the material contained herein is my own except where explicitly stated references to other material
are made.

Lexical and Syntax Analyser

Eoin Costelloe

CA448: Compiler Construction 1


Lexer:

The lexer allows the following:


skips which first exclude the areas between tokens such as (tabs, spaces, new lines) that we do
not think are important for the lexer. We also skip single line comments (//) and multi line
comments (/* */). These multi line comments do not allow nested comments for simplicity
sake.
simple tokens such as punctuation, keywords, signs and comparators.
boolean specific tokens such as boolean constants (true, false) and boolean signs (!, &&, ||).
strings. A valid string starts with a quote (“) represented in the lexer as “\””, any characters
within the string except the new line characters (\n), the return character (\r) and ends with a
quote. A string cannot go beyond its maximum length (5000 characters excluding the 2
quotes).
integers and identifiers. Integer constants are allowed to have leading zeros and have no sign
(assumed to be positive). We do not allow for integer constants which exceed the variable
size (greater than 2^31). Identifiers must begin with a letter but can be followed by any
amount of numbers letters or “_” characters.
error checking for unsupported special symbols in strings, unfinished strings, unsupported
characters that are not within a comment or a string constant and unfinished comments.
Supported special symbols are \n (newline), \t (tab), \” (double quote), \\ (backslash) and \f
(form feed). All other special symbols are unsupported. An unfinished string is one that
contains a new line character or a return character. An unfinished comment is one without an
ending */.

When the lexer reaches the end of file character (EOF), it checks for any errors found.

Parser:

The parser is a copy of the bantam java grammar rules:

Program -> (Class)+


Class -> class identifier [ extends identifier ] { (Member)* }
Member -> Method | Field
Method -> identifier identifier ( Formal (, Formal)* ) {Stmt RetnStmt }
Field -> identifier identif ier [ = Expr ] ;
Formal -> identifier identifier
Stmt -> ExprStmt | DeclStmt | IfStmt | WhileStmt | BlockStmt
RetnStmt -> return [ Expr ] ;
ExprStmt -> Expr ;
DeclStmt -> identifier identifier = Expr ;
IfStmt -> if ( Expr ) Stmt [ else Stmt ]
WhileStmt -> while ( Expr ) Stmt
BlockStmt -> { Stmt }
Expr -> AssignExpr | DispatchExpr | NewExpr | InstanceofExpr | CastExpr | BinaryExpr |
UnaryExpr | ConstExpr | VarExpr | ( Expr )
AssignExpr -> VarExpr = Expr
DispatchExpr -> [ Expr . ] identifier ( Expr (, Expr)* )
NewExpr -> new identifier ( )
InstanceofExpr -> Expr instanceof identifier
CastExpr -> ( identifier ) ( Expr )
BinaryExpr -> BinaryArithExpr | BinaryCompExpr | BinaryLogicExpr
UnaryExpr -> UnaryNegExpr | UnaryNotExpr
ConstExpr -> int const | boolean const | string const
BinaryArithExpr -> Expr + Expr | Expr − Expr | Expr * Expr | Expr / Expr | Expr % Expr
BinaryCompExpr -> Expr == Expr | Expr ! = Expr | Expr < Expr | Expr <= Expr | Expr > Expr |
Expr >= Expr
BinaryLogicExpr -> Expr && Expr | Expr || Expr
UnaryNegExpr -> − Expr
UnaryNotExpr -> ! Expr
VarExpr -> [ identifier . ] identifier

However this produces left recursive problems from DispatchExpr, InstanceofExpr and
BinaryExpr. I first converted the DispatchExpr to the following:

void DispatchExpr() : {}
{
<ID> <LEFT_BRACKET> Expr() ExtraExprs() <RIGHT_BRACKET>
| Expr() <MEMBER_REFERENCE> <ID> <LEFT_BRACKET> Expr() ExtraExprs()
<RIGHT_BRACKET>
}

I then added an ExprPrime to the end of each Expr as follows:

void Expr() : {}
{
AssignExpr() ExprPrime()
| DispatchExpr() ExprPrime()
| NewExpr() ExprPrime()
| CastExpr() ExprPrime()
| UnaryExpr() ExprPrime()
| ConstExpr() ExprPrime()
| VarExpr() ExprPrime()
| <LEFT_BRACKET> Expr() <RIGHT_BRACKET> ExprPrime()
}

I then put all left recursive rules into the ExprPrime and converted them as follows:

void ExprPrime() : {}
{
DispatchExpr() ExprPrime()
| InstanceofExpr() ExprPrime()
| BinaryExpr() ExprPrime()
| {}
}

I also removed all the starting Expr from each rule in BinaryArithExpr, BinaryCompExpr
and BinaryLogicExpr as follows:

void BinaryArithExpr() : {}
{
<PLUS_SIGN> Expr()
| <MINUS_SIGN> Expr()
| <MULTIPLY_SIGN> Expr()
| <DIVIDE_SIGN> Expr()
| <MODULO_SIGN> Expr()
}

void BinaryCompExpr() : {}
{
<EQUALS_COMPARATOR> Expr()
| <NOT_EQUALS_COMPARATOR> Expr()
| <LESS_THAN_COMPARATOR> Expr()
| <LESS_THAN_OR_EQUAL_COMPARATOR> Expr()
| <GREATER_THAN_COMPARATOR> Expr()
| <GREATER_THAN_OR_EQUAL_COMPARATOR> Expr()
}

void BinaryLogicExpr() : {}
{
<AND_BOOLEAN_SIGN> Expr()
| <OR_BOOLEAN_SIGN> Expr()
}

To remove conflict problems, I have modified several of the rules as follows:


To avoid problems with VarExpr being chosen instead of DispatchExpr, I have removed the
<MEMBER_REFERENCE> option from VarExpr. This is because I believe the grammer rule
should be VarExpr -> [ Expr . ] identifier, as the original rule seems too restricted. I have chosen to
implement the following rule for simplicity VarExpr -> identifier, as follows:

void VarExpr() : {}
{
<ID>
}

I have added a lookahead of 2 in Stmt to choose between ExprStmt and DeclStmt and avoid
the conflict with <ID>. I have joined DispatchExpr VarExpr and AssignExpr to avoid a common
prefix of <ID> as follows:

void DispatchVarOrAssignExpr() : {}
{
<ID> DispatchOrAssignExprPrime()
}

void DispatchVarOrAssignExprPrime() : {}
{
DispatchExpr()
| AssignExpr()
| {} //basic var expression
}

I have added a lookahead of 2 to avoid the conflict with <ID> “)”. I have also joined
CastExpr and BracketExpr to avoid a common prefix of <LEFT_BRACKET> as follows:

void CastOrBracketExpr() : {}
{
<LEFT_BRACKET> CastOrBracketExprPrime()
}

void CastOrBracketExprPrime() : {}
{
CastExpr()
| Expr() <RIGHT_BRACKET> //bracket expression
}

I then added in code to return an Abstract Syntax Tree (AST) which is documented in
bantamjava/api/html/index.html. Please the default parent node for Class is “Object” and the default
reference expression for DispatchExpr is “this”.

In summary, the problems with my parser is that my program contains a lookahead of 2 in


Stmt and CastOrBracketExpr. My parser also does not implement the <ID> “.” <ID> option of
VarExpr as it conflicts with DispatchExpr of <ID> “.” <ID> “(” etc.

You might also like