You are on page 1of 7

Lex And Yacc

Roll No 23 Group 5 Name Gavin Fernandes

Lex is officially known as a "Lexical Analyser". It's main job is to break up an input stream into more usable elements. Or in, other words, to identify the "interesting bits" in a text file. For example, if you are writing a compiler for the C programming language, the symbols { } ( ) ; all have significance on their own. The letter a usually appears as part of a keyword or variable name, and is not interesting on it's own. Instead, we are interested in the whole word. Spaces and newlines are completely uninteresting, and we want to ignore them completely, unless they appear within quotes "like this" All of these things are handled by the Lexical Analyser.

Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream.

2. Use of Lex in syntaxical analysis

The purpose of the lexical analysis is to transform a series of symbols into tokens (a token may be a number, a "+" sign, a reserved word for the language, etc.). Once this transformation is done, the syntaxical analyser will be able to do its job (see below). So, the

aim of the lexical analyser is to consum symbols and to pass them back to the syntaxical analyser. A Lex description file can be divided into three parts, using the following plan: declarations %% productions %% additionnal code in which no part is required. However, the first %% is required, in order to mark the separation between the declarations and the productions.

2.1. First part of a Lex file: declarations

This part of a Lex file may contain:
Code written in the target language (usually C or C++), embraced between %{ and %}, which will be placed at the top of the file that Lex will create. That is the place where we usually put the include files. Lex will put "as is" all the content enclosed between these signs in the target file. The two signs will have to be placed at the beginning of the line. Regular expressions, defining non-terminal notions, such as letters, digits, and numbers. This specifications have the form: of the file, and in the second part of the file, if you embrace them between

notion regular_expression You will be able to use the notions defined this way in the end of the first part { and }.

%{ #include "calc.h" #include <stdio.h> #include <stdlib.h> %} /* Regular expressions */ /* ------------------- */ white letter digit10 digit16 [\t\n ]+ [A-Za-z] [0-9] [0-9A-Fa-f]

/* base 10 */ /* base 16 */



int10 {digit10}+

The example by itself is, I hope, easy to understand, but let's have a deeper look into regular expressions.

2.2. Regular expressions

Symbol Meaning

x . [xyz] [^bz] [a-z] [^a-z] R* R+ R? R{2,5} R{2,} R{2}

the "x" character any character except \n either x, y or z any character, except b and z any character between a and z any character, except those between a and z zero R or more; R can be any regular expression one R or more one or zero R (that is an optional R) 2 to 5 R 2 R or more exactly 2 R

"[xyz\"foo" the string "[xyz"foo" {NOTION} expansion of NOTION, that as been defined above in the file \X \0 \123 \x2A RS R|S R/S ^R R$ if X is a "a", "b", "f", "n", "r", "t", or "v", this represents the ANSI-C interpretation of \X ASCII 0 character the caracter which ASCII code is 123, in octal the caracter which ASCII code is 2A, in hexadecimal R followed by S R or S R, only if followed by S R, only at the beginning of a line R, only at the end of a line

<<EOF>> end of file

So the definition: identifier {letter}(_|{letter}|{digit10})*

will match as identifiers the words "integer", "a_variable", "a1", but not "_ident" nor "1variable". Easy, isn't it? As a last example, this is the definition of a real number:

2.3. Second part of a Lex file: productions

This part is aimed to instruct Lex what to do in he generated analyser when it will encounter one notion or another one. It may contain:
Some specifications, written in the target language (usually C or C++) surrounded by specifications will be put at the beginning of the an integer. Productions, having the syntax: characters as is into the standard output. If

%{ and %} (at the beginning of a line). The yylex() function, which is the function that consums the tokens, and returns

regular_expression action If action is missing, Lex will put the matching action is specified, it has to be written in the target language. If it contains more than one instruction or is written in more than one line, you will have to embrace it between { and }.

You should also note that comment such as /* ... */ can be present in the second part of a Lex file only if enclosed between braces, in theaction part of the statements. Otherwise, Lex would consider them as regular expressions or actions, which would give errors, or, at least, a weird behaviour. Finally, the yytext variable used in the actions contains the characters accepted by the regular expression. This is a char table, of length yyleng(ie, char yytext[yyleng]). Example: %% [ \t]+$

[ \t]+

printf(" ");

This little Lex file will generate a program that will suppress the space characters that are not useful. You can also notice with that little program that Lex is not reserved to interpreters or compilers, and can be used, for example, for searches and replaces, etc.

2.4. Third part: additional code

You can put in this optional part all the code you want. If you don't put anything here, Lex will consider that it is just:

main() { yylex();

Yacc is officially known as a "parser". It's job is to analyse the structure of the input stream, and operate of the "big picture". In the course of it's normal work, the parser also verifies that the input is syntactically sound. Consider again the example of a C-compiler. In the C-language, a word can be a function name or a variable, depending on whether it is followed by a ( or a = There should be exactly one } for each { in the program. YACC stands for "Yet Another Compiler Compiler". This is because this kind of analysis of text files is normally associated with writing compilers. However, as we will see, it can be applied to almost any situation where text-based input is being used. For example, a C program may contain something like:
{ int int; int = 33; printf("int: %d\n",int); }

In this case, the lexical analyser would have broken the input sream into a series of "tokens", like this:
{ int int ; int = 33

; printf ( "int: %d\n" , int ) ; }

Note that the lexical analyser has already determined that where the keyword int appears within quotes, it is really just part of a litteral string. It is up to the parser to decide if the token int is being used as a keyword or variable. Or it may choose to reject the use of the name int as a variable name. The parser also ensures that each statement ends with a ; and that the brackets balance.
Computer program input generally has some structure; in fact, every computer program that does input can be thought of as defining an ``input language'' which it accepts. An input language may be as complex as a programming language, or as simple as a sequence of numbers. Unfortunately, usual input facilities are limited, difficult to use, and often are lax about checking their inputs for validity. Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized. Yacc turns such a specification into a subroutine that han- dles the input process; frequently, it is convenient and appropriate to have most of the flow of control in the user's application handled by this subroutine.

3. Syntaxical analysis with Yacc

Yacc (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar and to produce the source code of the syntaxical analyser of the language produced by this grammar. It is also possible to make it do semantic actions. As for a Lex file, a Yacc file can be divided into three parts: declarations %% productions %% additionnal code and only the first %% and the second part are mandatory

3.1. The first part of a Yacc file

The first part of a Yacc file may contain:

Specifications written in the target language, enclosed between at the top of the scanner generated by Yacc. Declaration of the tokens that can be encontered, with the The type of the terminal, using the reserved word: Informations about operators' priority or associativity. The axiom of the grammar, using the reserved word part of the file).

%{ and %} (each symbol at the begining of a line) that will be put

%token keyword. %union %start (if not specified, the axiom is the first production of the second

The yylval variable, implicitely declared of the %union type is really important in the file, since it is the variable that contains the description of the last token read.

3.2. Second part of a Yacc file

This part can not be empty. It may contain:
ddeclarations and/or definitions enclosed between Productions of the language's grammar.

%{ and %}

These productions look like: nonterminal_notion: body_1 | body_2 | ... | body_n ; { semantical_action_1 } { semantical_action_2 } { semantical_action_n }

provided that the body_i may be terminal or nonterminal notions of the language. And finally...

3.3. Third part of a Yacc file

This part contains the additional code, must contain a main() function (that should call the yyparse() function), and an yyerror(char *message)function, that is called when a syntax error is found.