You are on page 1of 42

Lex and yacc

Lex
 The main job of a lexical analyzer (scanner) is to break up an input stream into
more usable elements (tokens)
 For example :a = b + c * d;
 ID ASSIGN ID PLUS ID MULT ID SEMI
 Lex is a tool for writing lexical analyzers.
Lex Source Program

 Lex source program is a table of – regular expressions and


– corresponding program fragments
Lex Source to C Program

 The table is translated to a C program (lex.yy.c) which –


 reads an input stream
 partitioning the input into strings which match the given expressions and
 copying it to an output stream if necessary
An Overview of Lex
An Overview of Lex

 An input file, which we call lex.l is written in the Lex language and describes the lexical
analyzer to be generated.
 The Lex compiler transforms lex.l to a C program, in a file that is always named lex.yy.c.
 The latter file is compiled by the C compiler into a file called a.out, as always.
 The C-compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens.
 It is a C function that returns an integer, which is a code for one of the possible token
names.
 The attribute value, whether it be another numeric code, a pointer to the symbol table, or
nothing, is placed in a global variable yylval, which is shared between the lexical
analyzer and parser, thereby making it simple to return both the name and an attribute
valueof a token.
Structure of Lex Programs

 A Lex program has the following form


Structure of Lex Programs

 The declarations section includes declarations of variables, manifest constants


(identifers declared to stand for a constant, e.g., the name of a token), and regular
definitions
 The translation rules each have the form
Pattern { Action }
 Each pattern is a regular expression, which may use the regular definitions of
the declaration section.
 The actions are fragments of code, typically written in C,
 The third section holds whatever additional functions are used in the actions.
 Alternatively, these functions can be compiled separately and loaded with the
lexical analyzer.
Lex Regular Expressions

 A regular expression matches a set of strings


 Regular expression
– Operators
– Character classes
– Arbitrary character
– Optional expressions
– Alternation and grouping
– Context sensitivity
--Repetitions and definitions
Operators

 “\[]^-?.*+|()$/{}%<>
 If they are to be used as text characters, an escape should be used \$ = “$”
\\ = “\”
 Every character but blank, tab (\t), newline (\n) and the list above is always a text
character
Designing patterns

 . A dot will match any single character except a newline.


 * ,+ Star and plus used to match zero/one or more of the preceding expressions.
 ? Matches zero or one copy of the preceding expression.
 | A logical ‘ or ’ statement - matches either the pattern before it, or the pattern
after.
 ^ Matches the very beginning of a line.
 $ Matches the end of a line.
 / Matches the preceding regular expression, but only if followed by the
subsequent expression.
Designing patterns

 [ ] Brackets are used to denote a character class, which matches any single
character within the brackets. If the first character is a ‘^’, this negates the
brackets causing them to match any character except those listed. The ‘-’ can be
used in a set of brackets to denote a range.
 “ ” Match everything within the quotes literally - don’t use any special meanings
for characters.
 ( ) Group everything in the parentheses as a single unit for the rest of the
expression.
Character Classes []

 [moc] matches a single character, which may be m, o, or c


 Every operator meaning is ignored except \ - and ^
 e.g. [ab] => a or b
 [a-z] => a or b or c or … or z
 [-+0-9] => all the digits and the two signs
 [^x] => any character but x
Optional & Repeated Expressions

 a? => zero or one instance of a


 a* => zero or more instances of a
 a+ => one or more instances of a
 E.g. ab?c => ac or abc
 [a-z]+ => all strings of lower case letters
 [a-zA-Z][a-zA-Z0-9]* => all alphanumeric strings with a leading alphabetic
character
Designing patterns
x{m,n} m through n occurrences of x
xx|yy Either xx or yy
x| The action on x is the action for the next rule

(x) x
x/y x but only if followed by y
{xx} The translation of xx from the definitions
section
x$ x at the end of a line
^x x at the beginning of a line
[ \t] matches either a space or tab character

[^a-d] matches any character other than a,b,c and d


Precedence of Operators

 Level of precedence –
 Kleene closure (*), ?, +
 concatenation
 alternation (|)
 All operators are left associative.
 Ex: a*b|cd* = ((a*)b)|(c(d*))
Pattern Matching Primitives
Lex Predefined Variables

 yytext – pointer to matched string


 yyleng -- the length of matched string
 yyin -- the input stream pointer – the default input of default main() is stdin
 yyout -- the output stream pointer – the default output of default main() is stdout.
E.g. [a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
The unmatched token is using a default action that ECHO from the input to the
output
Lex Library Routines

 yylex() – The default main() contains a call of yylex()


 yymore() – return the next token
 yyless(n) – retain the first n characters in yytext
 yywarp() – is called whenever Lex reaches an end-of-file – The default yywarp()
always returns 1
 You can use your Lex routines in the same ways you use routines in other
programming languages.
Conflict Resolution in Lex

 We have alluded to the two rules that Lex uses to decide on the proper lexeme to
select, when several prefixes of the input match one or more patterns:
1. Always prefer a longer prefix to a shorter prefix.
2. If the longest possible prefix matches two or more patterns, prefer the pattern listed first
in the Lex program.
 To run Lex on a source file, type for eg test1.l
flex test1.l
 It produces a file named lex.yy.c which is a C program for the lexical analyzer.
 To compile lex.yy.c, type gcc lex.yy.c -o test1.exe
 To run the lexical analyzer program, type test1.exe
Yacc - Yet Another Compiler Compiler
What is YACC ?

 Tool which will produce a parser for a given grammar.


 YACC (Yet Another Compiler Compiler) is a program designed to compile a
LALR(1) grammar and to produce the source code of the syntactic analyzer of the
language produced by this grammar.
How YACC Works??
How YACC Works???
An YACC File Example
YACC File Format
Definitions Section
Start Symbol

 The first non-terminal specified in the grammar specification section.


 To overwrite it with %start declaraction.
%start non-terminal
Rules Section

 This section defines grammar


 Example
expr : expr '+' term | term;
term : term '*' factor | factor;
factor : '(' expr ')' | ID | NUM;
Rules Section
The Position of Rules

You might also like