You are on page 1of 6

Lex Overview Usage Paradigm of Lex

• Lex is a tool for creating lexical analyzers.

• Lexical analyzers tokenize input streams.

• Tokens are the terminals of a language.

• Regular expressions define tokens .

To Use Lex and Yacc Together Lex Internals Mechanism
Lex source Yacc source
(Lexical Rules) (Grammar Rules) • Converts regular expressions into DFAs.

• DFAs are implemented as table driven
Lex Yacc

state machines.
lex.yy.c y.tab.c
call
Input yylex() yyparse() Parsed Input

return token

lex.yy.c : What it produces Running Lex
• To run lex on a source file, use the
command:
lex source.l
• This produces the file lex.yy.c which is the
C source for the lexical analyzer.
• To compile this, use:
cc -o prog -O lex.yy.c -ll

1

Any source not intercepted • Three parts are separated by %% by Lex is copied into the generated • Tips: The first part defines patterns. – Anything included between lines containing only %{ and %} – Anything after the second %% delimiter Regular Policy of Translated Position of Copied Source Source • source input prior to the first %% • Various variables or tables whose name – external to any function in the generated code prefixed by yy • after the first %% and prior to the second – yyleng. yysvec[]. the third part defines actions. yylex()… actions • Various definition whose name are capital • after the second %% – BEGIN. and Win32 version • lex & yacc . INITIAL… – after the Lex generated output 2 .Versions and Reference Books General Format of Lex Source • AT&T lex. GNU flex. yywarp(). then we do some action”.Levine. yywork[] %% • Various functions whose name prefixed by – appropriate place for declarations in the yy function generated by Lex which contains the – yyless(). by Jeffrey E. O’Reilly • Mastering Regular Expressions. Friedl.2/e by John R. program. O’Reilly • Input specification file is in 3 parts Regular Policy of None- – Declarations: Definitions translated Source – Rules: Token Descriptions and actions • Remember that Lex is turning the rules – Auxiliary Procedures: User-Written code into a program.F. Tony Mason & Doug Brown. the – Any line which is not part of a Lex rule or second part puts together to express action which begins with a blank or tab “If we see some pattern. yymore().

the default output of default main() is ll will link a default main() which calls stdout./a. char *argv[]) { } /* ([ab]*aa[ab]*bb[ab]*)|([ab]*bb[ab]*aa[ab]*) { * call yylex to use the generated lexer /* match all strings over */ * (a. bin_digit [01] It deletes from the input all blanks or tabs at the %% ends of lines.b) containing aa and bb yylex(). default input of default main() is stdin and • If the third part dose not contain a main(). * Example lex source file • A trivial program to deletes three spacing * This first section contains necessary characters: * C declarations and includes %% * to use throughout the lex specifications. the third part and the with the input and the output of yylex(). int main(int argc. the second %% are optional. yytext). * make sure everything was printed } */ \n . Default Rules and Actions Default Input and Output • The first and second part must exist. */ • Another trivial example: #include <stdio. yylex() then exits. – stdin usually is to be keyboard input • Unmatched patterns will perform a default stdout usually is to be screen output action. . */ /* printf("AABB\n").out < inputfile > outputfile to the output Some Simple Lex Source A General Lex Source Example Examples • A minimum lex program: %{ %% /* It only copies the input to the output unchanged. /* ignore newlines */ fflush(yyout). [ \t\n]. } 3 .h> %% %} [ \t]+$. exit(0). {bin_digit}* { %% /* match all strings of 0's and 1's */ /* /* Print out message with matching * Now this is where you want your main * text program */ */ printf("BINARY: %s\n". but • If you don’t write your own main() to deal may be empty. which consists of copying the input – cs20: %.

double quote • a{1. • NAME REG_EXPR – These usually go at the start of this section. • NAME must be a valid C identifier • %{ • {NAME} is replaced by prior REG_EXPR • int linecount = 1."{digs} – Includes usually go here. – expreal {digs}".3} is a or aa or aaa – \\ -. – digs [0-9]+ marked by %{ at the beginning and %} at the end – integer {digs} or the line which begins with a blank or tab .matches only if followed by right part of optional / • ab? == a|ab • 0/1 means the 0 of 01 but not 02 or 03 or … • (ab)? == ab|ε – ( ) -. – plainreal {digs}".backspace – {n. $ ^ [ ] .this means the preceding was – / -..? * + | ( ) / { } % < > – + -.is whitespace • (except CR).means at the beginning of the line (except newline) (unless it is inside of a [ ]) – “ and \ -. same as /\n – \t -.grouping Definitions • The definitions can also contain variables and other declarations used by the Code generated by Lex.means anything except – \n -.) ( Extended Regular Expression ) – NOTE: . • %} 4 .Positive Closure – concatenation (put characters together) – alternation (a|b|c) • Examples: • [ab] == a|b – [0-9]+". Token Definitions • Elementary Operations (cont.m} – m through n occurrences – \" -.|i|j|k • note: without the quotes it could be any character • [a-z0-9] == any letter or digit • [^a] == any character but a – [ \t]+ -..newline • \"[^\"]*\" is a double quoted string – \b -.Kleene Closure – single characters • except “ \ .) –."[0-9]+ • [a-k] == a|b|c|.\ – {definition} – translation from definition –? -.quote the part as text – $ means at the end of the line."{digs}[Ee][+-]?{digs} – It is usually convenient to maintain a line counter so that error messages can be keyed to the lines – real {plainreal}|{expreal} in which the errors are found. • There is a blank space character before the \t • Special Characters: • Special Characters (cont.tab – [^ ] -. -. matches any character except the newline • Elementary Operations – * -.matches any single character –^ -.

Lex Special Variables Lex library function calls • identifiers used by Lex and Yacc begin with • yylex() yy – default main() contains a return yylex(). • Four special options: – {integer} { | ECHO. BEGIN. – the next input expression recognized is to be tacked • } on to the end of this input – C++ Comments -. &yylval). • Transition rules can be classified into • To overide the choice."%d". • //. • A null statement . • printf("I found an integer\n"). Transition Rules Tokens and Actions • ERE <one or more blanks> { program statement • Example: program – {real} return FLOAT.a string containing the lexeme • yywarp() – yyleng -. REJECT... which will be match ex: she {s++. statement } – begin return BEGIN.. REJECT. that ECHO from the input to the output • } • | indicates that the action for this rule is from the action for the next rule Ambiguous Source Rules Multiple States • lex allows the user to explicitly declare • If 2 rules match the same pattern. will ignore the input – {newline} linecount++.// .* . %s COMMENT • Lex always chooses the longest matching • Default states is INITIAL or 0 substring for its tokens. • The unmatched token is using a default action • return INTEGER. • yymore() • return INTEGER.} depend on states he {h++. | \n .} • BEGIN is used to change state . use action REJECT different states. Lex will multiple states ( in Definitions section ) use the first rule.. – yytext -.the length of the lexeme – called by lexical analyzer if end of the input file – yyin – the input stream pointer – default yywarp() always return 1 • Example: • yyless(n) – {integer} { – n characters in yytext are retained • printf("I found an integer\n"). • sscanf(yytext. REJECT. 5 .

i.c” in the last section of yacc input. actions are complicated enough that it is %% yywrap() better to describe them with a function call.yy. %% • The actions associated with any given [a-z]+ lengs[yyleng]++. • Definitions of this sort go in the last section printf("Length No. words\n"). i++) if (lengs[i] > 0) printf("%5d%10d\n". int i. token are normally specified using . where the appropriate token value is returned. for(i=0. • An easy way is placing the line: #include “lex. of the Lex input. } More Example 2 Using yacc with lex • yacc will call yylex() to get the token from the input so that each lex rule should end with: return(token). User Written Code More Example 1 int lengs[100]. return(1). 6 . But occasionally the \n . { and define the function elsewhere.| statements in C. i<100.lengs[i]).