Lex - A Lexical Analyzer Generator Lex helps write programs whose control flow is directed by instances of regular expressions

in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial look ahead is performed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it. Introduction: Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed. The user supplies the additional code beyond expression matching needed to complete his tasks, possibly including code written by other generators. The program that recognizes the expressions is generated in the general purpose programming language employed for the user's program fragments. Thus, a high level expression language is provided to write the string expressions to be matched while the user's freedom to write actions is unimpaired. This avoids forcing the user who wishes to use a string manipulation language for input analysis to write processing programs in the same and often inappropriate string handling language. Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.'' Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host

language appropriate to the task, the user's background, and the properties of local implementations Lex turns the user's expressions and actions (called source in this memo) into the host generalpurpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. See Figure 1. +-------+ Source -> | Lex | -> yylex +-------+ +-------+ Input -> | yylex | -> Output +-------+ An overview of Lex Figure 1 For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines. %% [ \t]+$ ; is all that is required. The program contains a %% delimiter to mark the beginning of the rules, and one rule. This rule contains a regular expression which matches one or more instances of the characters blank or tab (written \t for visibility, in accordance with the C language convention) just prior to the end of a line. The brackets indicate the character class made of blank and tab; the + indicates ``one or more ...''; and the $ indicates ``end of line,'' as in QED. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule: %% [ \t]+$ ; [ \t]+ printf(" "); The finite automaton generated for this source will scan for both rules at once, observing at the termination of the string of blanks or tabs whether or not there is a newline character, and executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc [3]. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. The flow of control in such a case (which might be the first half of a compiler, for example) is shown in

Figure 2. Additional programs, written by other generators or by hand, can be added easily to programs written by Lex. lexical grammar rules rules | | v v +---------+ +---------+ | Lex | | Yacc | +---------+ +---------+ | | v v +---------+ +---------+ Input -> | yylex | -> | yyparse | -> Parsed input +---------+ +---------+

Lex with Yacc Figure 2 Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named, so that the use of this name by Lex simplifies interfacing. Lex generates a deterministic finite automaton from the regular expressions in the source [4]. The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of rescanning. What does increase with the number and complexity of rules is the size of the finite automaton, and therefore the size of the program generated by Lex. In the program written by Lex, the user's fragments (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine. Lex is not limited to source which can be interpreted on the basis of one character lookahead. For example, if there are two rules, one looking for ab and another for abcdefg, and the input stream is abcdefh, Lex will recognize ab and leave the input pointer just before cd. . . Such backup is more costly than the processing of simpler languages.

Lex Source: The general format of Lex source is: {definitions} %% {rules} %% {user subroutines} where the definitions and the user subroutines are often omitted. The second %% is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is thus %% (no definitions, no rules) which translates into a program which copies the input to the output unchanged. In the outline of Lex programs shown above, the rules represent the user's control decisions; they are a table, in which the left column contains regular expressions (see section 3) and the right column contains actions, program fragments to be executed when the expressions are recognized. Thus an individual rule might appear integer printf("found keyword INT"); to look for the string integer in the input stream and print the message ``found keyword INT'' whenever it appears. In this example the host procedural language is C and the C library function printf is used to print the string. The end of the expression is indicated by the first blank or tab character. If the action is merely a single C expression, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to change a number of words from British to American spelling. Lex rules such as colour printf("color"); mechanise printf("mechanize"); petrol printf("gas"); Would be a start. These rules are not quite enough, since the word petroleum would become gaseum; a way of dealing with this will be described later.

Lex Regular Expressions: The definitions of regular expressions are very similar to those in QED [5]. A regular expression specifies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expression integer matches the string integer wherever it appears and the expression a57D looks for the string a57D.

Any blank character not contained within [] (see below) must be quoted. Character classes. the ^ operator must appear as the first character after the left bracket. The construction [abc] matches a single character. For example. Thus by quoting every non-alphanumeric character being used as a text character. the user can avoid remembering the list above of current operator characters. it is not required to escape tab and backspace. or both digits is implementation dependent and will get a warning message. use \\. Classes of characters can be specified using the operator pair[ ]. Thus [^abc] . The operator characters are "\[]^-?.. tab. Using . \t is tab. blanks or tabs end a rule. \n must be used. [0-z] in ASCII is many more characters than it is in EBCDIC). the angle brackets. or c. In character classes. If it is desired to include the character . most operator meanings are ignored.in a character class. Within square brackets. and \b is backspace. equivalent of the above expressions. Ranges may be given in either order. [a-z0-9<>_] indicates the character class containing all the lower case letters.between any pair of characters which are not both upper case letters. Several normal C escapes with \ are recognized: \n is newline. The quotation mark operator (") indicates that whatever is contained between a pair of quotes is to be taken as text characters. Note that a part of a string may be quoted. and underline. Every character but blank. Thus xyz"++" matches the string xyz++ when it appears. (E. thus [-+0-9] matches all the digits and the two signs. The character indicates ranges.Operators. b. it should be first or last. which may be a. the digits. It is harmless but unnecessary to quote an ordinary text character. the expression "xyz++" is the same as the one above.and ^. newline and the list above is always a text character. Only three characters are special: these are \ . it indicates that the resulting string is to be complemented with respect to the computer character set.*+|()$/{}%<> and if they are to be used as text characters. as explained above. both lower case letters. To enter \ itself. normally.g. Another use of the quoting mechanism is to get a blank into an expression. less readable. An operator character may also be turned into a text character by preceding it with \ as in xyz\+\+ which is another. an escape should be used. and is safe should further extensions to Lex lengthen the list. Since newline is illegal in an expression.

This can never conflict with the other meaning of ^. or cddd. Repetitions of classes are indicated by the operators * and +. To match almost any character. or at the beginning of the input stream).matches all characters except a. the expression will only be matched at the end of a line (when immediately followed by newline). Context sensitivity. a* is any number of consecutive a characters. abcd. Note that parentheses are used for grouping. The operator ? indicates an optional element of an expression. And [A-Za-z][A-Za-z0-9]* indicates all alphanumeric strings with a leading alphabetic character. . If the first character of an expression is ^. Thus ab?c matches either ac or abc. or [^a-zA-Z] is any character which is not a letter. b. while a+ is one or more instances of a. or c. Alternation and Grouping. The operator | indicates alternation: (ab|cd) matches either ab or cd. since that only applies within the [] operators. including all special or control characters. efefef. the operator character . Arbitrary character. Lex will recognize a small amount of surrounding context. cdef. [a-z]+ is all strings of lower case letters. If the very last character is $. Escaping into octal is possible although non-portable: [\40-\176] matches all printable characters in the ASCII character set. ab|cd would have sufficed. is the class of all characters except newline. This is a typical expression for recognizing identifiers in computer languages. from octal 40 (blank) to octal 176 (tilde). although they are not necessary on the outside level. complementation of character classes. but not abc. or abcdef. Repeated expressions. The \ character provides the usual escapes within character class brackets. The two simplest operators for this are ^ and $. the expression will only be matched at the beginning of a line (after a newline character. For example. including zero. Optional expressions. Parentheses can be used for more complex expressions: (ab|cd+)?(ef)* matches such strings as abefef.

If a rule is only to be executed when the Lex automaton interpreter is in start condition x. a character combination which is omitted from the rules and which appears as input is likely to be printed on the output. The expression ab/cd matches the string ab. This section describes some features of Lex which aid in writing actions. initial % is special. The definitions are given in the first part of the Lex input. Note that there is a default action. without producing any output. thus calling attention to the gap in the rules. this is the normal situation. Finally. which causes the three spacing characters (blank. Also. Lex Actions: When an expression written as above is matched. One of the simplest things that can be done is to ignore the input. . Specifying a C null statement. a{1. tab. One may consider that actions are what is done instead of copying the input to the output. and newline) to be ignored. as an action causes this result. the rule should be prefixed by <x> using the angle bracket operator characters. Thus ab$ is the same as ab/\n Left context is handled in Lex by start conditions as explained in section 10. For example {digit} looks for a predefined string named digit and inserts it at that point in the expression. Lex executes the corresponding action. The operators {} specify either repetitions (if they enclose numbers) or definition expansion (if they enclose a name). being the separator for Lex source segments. a rule which merely copies can be omitted. . In contrast.5} looks for 1 to 5 occurrences of a. then the ^ operator would be equivalent to <ONE> Start conditions are explained more fully later. This is performed on all strings not otherwise matched.The latter operator is a special case of the / operator character. A frequent rule is [ \t\n] . thus. before the rules. but only if followed by cd. Repetitions and Definitions. When Lex is being used with Yacc. in general. which indicates trailing context. must provide rules to match everything. If we considered ``being at the beginning of a line'' to be start condition ONE. Thus the Lex user who wishes to absorb the entire input. which consists of copying the input to the output.

The previous example could also have been written "" "\t" "\n" with the same result. if there is a rule which matches read it will normally match the instances of read contained in bread or readjust. Since the default action is just to print the characters found. to print the name found.Another easy way to avoid writing actions is the action character |. The regular expression which matches that is somewhat confusing. chars += yyleng. First. This is explained further below. Normally. hence Lex also provides a count yyleng of the number of characters matched. The quotes around \n and \t are not required. The argument n indicates the number of characters in yytext to be retained. one might ask why give a rule. Example: Consider a language which defines a string as a set of characters between quotation (") marks. a rule of the form [a-z]+ is needed. so that it might be preferable to write . Sometimes it is more convenient to know the end of what has been found. the user will often want to know the actual text that matched some expression like [a-z]+. the user might write [a-zA-Z]+ {words++. but in a different form. So this just places the matched string on the output. to avoid this. yyless (n) may be called to indicate that not all the characters matched by the currently successful expression are wanted right now.} which accumulates in chars the number of characters in the words recognized. will print the string in yytext. The C function printf accepts a format argument and data to be printed. which indicates that the action for this rule is the action for the next rule. yymore() can be called to indicate that the next input expression recognized is to be tacked on to the end of this input. like this one. This action is so common that it may be written as ECHO: [a-z]+ ECHO. a Lex action may decide that a rule has not recognized the correct span of characters. This provides the same sort of lookahead offered by the / operator. Second. is the same as the above. yytext). although in different style. and s indicating string type). Further characters previously matched are returned to the input. and provides that to include a " in a string it must be preceded by a \. The last character in the string matched can be accessed by yytext[yyleng-1] Occasionally. Lex leaves this text in an external character array named yytext. in this case. which merely specifies the default action? Such rules are often required to avoid matching some other rule which is not desired. In more complex actions. a rule like [a-z]+ printf("%s". the format is ``print string'' (% indicating data conversion. To count both the number of words and the number of characters in words in the input. and the data are the characters in yytext. For example. the next input string would overwrite the current entry in yytext. Thus. Two routines are provided to aid with this situation.

In addition to these routines. .. action for =.. } which prints a message. else . yyless(yyleng-2)... They are: 1) input() which returns the next input character. and treats the operator as ``=-''. yyless(yyleng-1). } will perform the other interpretation.. makes =-/[^ \t\n] a still better rule. 2) output(c) which writes the character c on the output.a'' but print a message. Consider the C problem of distinguishing the ambiguity of ``=-a''. normal user processing } which will.. A rule might be =-[a-zA-Z] { printf("Op (=-) ambiguous\n"). however. returns the letter after the operator to the input stream. then the call to yymore() will cause the next part of the string. To do this. and 3) unput(c) pushes the character c back onto the input stream to be read later by input().. action for = . It is not necessary to recognize the whole identifier to observe the ambiguity.\"[^"]* { if (yytext[yyleng-1] == '\\') yymore(). ... when faced with a string such as "abc\"def" first match the five characters "abc\. The possibility of ``=-3''. Note that the expressions for the two cases might more easily be written =-/[A-Za-z] in the first case and =/-[A-Za-z] in the second. "def. . Alternatively it might be desired to treat this as ``= -a''. to be tacked on the end. Note that the final quote terminating the string should be picked up in the code labeled ``normal processing''. Suppose it is desired to treat this as ``=. The function yyless() might be used to reprocess text in various circumstances.. no backup would be required in the rule action.. Lex also permits access to the I/O routines it uses. just return the minus sign as well as the letter to the input: =-[a-zA-Z] { printf("Op (=-) ambiguous\n").

These routines define the relationship between external files and internal characters. because [a-z]+ matches 8 characters while integer matches only 7.. They may be redefined. See below for a discussion of the character set used by Lex. summaries. 2) Among rules which matched the same number of characters.. unless a private version of input () is supplied a file containing nulls cannot be handled. For example. but every rule ending in + * ? or $ or containing / implies lookahead. When more than one expression can match the current input. [a-z]+ identifier action . it is taken as an identifier. including other programs or internal memory. Thus. and the keyword rule is selected because it was given first.. Lex continues with the normal wrap-up on end of input. but the character set used must be consistent in all routines. but the user can override them and supply private versions. it is convenient to arrange for more input to arrive from a new source. both rules match 7 characters. however.. '. The default yywrap always returns 1. Note that it is not possible to write a normal rule which recognizes end-of-file. and must all be retained or modified consistently. to be given in that order. Lex chooses as follows: 1) The longest match is preferred. and the relationship between unput and input must be retained or the Lex lookahead will not work. since a value of 0 returned by input is taken to be end-of-file. This instructs Lex to continue processing. Another Lex library routine that the user will sometimes want to redefine is yywrap () which is called whenever Lex reaches an end-of-file. But it is an invitation for the program to read far ahead. Lookahead is also necessary to match an expression that is a prefix of another expression. The principle of preferring the longest match makes rules containing expressions like . a value of zero returned by input must mean end of file. suppose the rules integer keyword action . If yywrap returns a 1. This routine is also a convenient place to print tables. the user should provide an yywrap which arranges for new input and returns 0. Sometimes. int) will not match the expression integer and so the identifier interpretation is used. at the end of a program. Lex does not look ahead at all if it does not have to. looking for a distant single quote. In this case. to cause input or output to be transmitted to or from strange places.By default these routines are provided as macro definitions. etc. Anything shorter (e. the rule given first is preferred. The standard Lex library imposes a 100 character limit on backup..* dangerous. the only access to this condition is through yywrap. Ambiguous Source Rules: Lex can handle ambiguous specifications.*' might seem a good way of recognizing a string in single quotes. In fact. If the input is integer..g. If the input is integers. Presented with the input .

REJECT. Since she includes he. The position of the input pointer is adjusted accordingly. Consider the two rules a[bc]+ { . 'second' here the above expression will match 'first' quoted string here. REJECT. Remember that . however.. Thus expressions like . of course. causing internal buffer overflows.} a[cd]+ { . Don't try to defeat this with expressions like (. the user could note that she includes he but not vice versa. Note that Lex is normally partitioning the input stream. REJECT.} If the input is ab. in other cases. will stop after 'first'.|\n)+ or equivalents. these rules are one way of changing the previous example to do just that. In contrast.'first' quoted string here. the Lex generated program will try to read the entire input file. operator will not match newline.'' It causes whatever rule was second choice after the current rule to be executed. The consequences of errors like this are mitigated by the fact that the . where the last two rules ignore everything besides he and she.} he {h++. only the first rule matches. A better rule is of the form '[^'\n]*' which. REJECT. whenever appropriate. the input accd agrees with the second rule for four characters and then the first rule for three. Suppose the user really wants to count the included instances of he: she {s++.. After counting each expression. The input string accb matches the first rule for four characters and then the second rule for three characters.* stop on the current line. Lex will normally not recognize the instances of he included in she. on the above input.. . . does not include newline. Sometimes the user would like to override this choice. In this example. the other expression will then be counted. This means that each character is accounted for once and only once. . it is rejected. For example. and on ad only the second matches.} \n | . it would not be possible a priori to tell which input characters were in both classes.. Some Lex rules to do this might be she s++. The action REJECT means ``go do the next alternative. not searching for all possible matches of each expression. . he h++. since once it has passed a she those characters are gone. . and omit the REJECT action on he. suppose it is desired to count occurrences of both she and he in an input text. \n | . 'second' which is probably not what was wanted.

This material must look like program fragments. There are three classes of such things.In general. Assuming a two-dimensional array named digram to be incremented. it appears in an appropriate place for declarations in the function written by Lex which contains the actions. where the REJECT is necessary to pick up a letter pair beginning at every character. that is the word the is considered to contain both th and he. . though. REJECT. This can be used to include comments in either the Lex source or the generated code. and the instances of these items may overlap or include each other. normally the digrams overlap. REJECT is useful whenever the purpose of Lex is not to partition the input stream but to detect all examples of some items in the input. and which contain a comment. to define variables for use in his program and for use by Lex. . The user needs additional options. if it appears immediately after the first %%. and should precede the first Lex rule. The delimiters are discarded. These can go either in the definitions section or in the rules section. Suppose a digram table of the input is desired. the appropriate source is %% [a-z][a-z] { digram[yytext[0]][yytext[1]]++. Remember that Lex is turning the rules into a program. lines which begin with a blank or tab. Such source input prior to the first %% delimiter will be external to any function in the code. 1) Any line which is not part of a Lex rule or action which begins with a blank or tab is copied into the Lex generated program. \n . rather than at every other character. or copying lines that do not look like programs. Any source not intercepted by Lex is copied into the generated program. The comments should follow the host language convention. As a side effect of the above. This format permits entering text like preprocessor statements that must begin in column 1. } . Lex Source Definitions: Remember the format of the Lex source: {definitions} %% {rules} %% {user routines} So far only the rules have been described. are passed through to the generated program. 2) Anything included between lines containing only %{ and %} is copied out as above.

and the name must begin with a letter.I. The definitions section may also contain other commands. both require a decimal point and contain an optional exponent field. is assumed to define Lex substitution strings.'' section 12. Definitions intended for Lex are given before the first %% delimiter. etc. {D}+"."{D}+({E})? | {D}+{E} Note the first two rules for real numbers.c. Using {D} for the digits and {E} for an exponent field. The format of such lines is name translation and it causes the string given as a translation to be associated with the name. which does not contain a real number.. a character set table. could be used in addition to the normal rule for integers. To correctly handle the problem posed by a Fortran expression such as 35. These possibilities are discussed below under ``Summary of Source Format. for example.EQ. a context-sensitive rule such as [0-9]+/". or adjustments to the default size of arrays within Lex itself for larger source programs. usually with a library of Lex subroutines. Any line in this section not contained between %{ and %}. The generated program is on a file named lex.yy. the Lex source must be turned into a generated program in the host general purpose language. including the selection of a host language. Usage: There are two steps in compiling a Lex source program. First. might abbreviate rules to recognize numbers: D [0-9] E [DEde][-+]?{D}+ %% {D}+ printf("integer"). a list of start conditions.3) Anything after the third %% delimiter. and beginning in column 1."{D}*({E})? | {D}*". . but the first requires at least one digit before the decimal point and the second requires at least one digit after the decimal point. regardless of formats. Then this program must be compiled and loaded. is copied out after the Lex output. The name and translation must be separated by at least one blank or tab."EQ printf("integer"). The translation can then be called out by the {name} syntax in a rule.

Step 4. All the Keywords in the grammar form the legal tokens.c" in the last section of Yacc input. the name required by Yacc for its analyzer. Identify all the Tokens. but if Yacc is loaded. To avoid reduce/reduce errors identify possible start states and break conflicting tokens into multiple tokens.Writing the Lex Step 1. Identify all Lex substitutions (if possible) from the set of regular expressions.yy. Step 3. An easy way to get access to Yacc's names for tokens is to compile the Lex output file as part of the Yacc output file by placing the line # includes "lex. Supposing the grammar to be named ``good'' and the lexical rules to be named ``better'' the UNIX command sequence can just be: yacc good lex better cc y. In this case each Lex rule should end with return (token). Yacc will call yylex(). If you want to use Lex with Yacc. Step 2. The generations of Lex and Yacc programs can be done in either order. and its main program is used. where the appropriate token value is returned.tab. Lex and Yacc. to obtain a main program which invokes the Yacc parser. If one has the BNF then all the Terminals are the tokens generated by the Lex. Write all the regular expressions describing the Tokens. Normally. the default main program on the Lex library calls this routine. . note that what Lex writes is a program named yylex().c -ly ±ll The Yacc library (-ly) should be loaded before the Lex library.

Code the yyerror function in subroutine section. Details on resolving it given later. Identify the Terminal and Non-Terminal Symbols from the BNF and Lex. Write rules for all possible syntax errors. The stack must have pointers for all the data structures.i. Details on error handling are given later. formulate the correct Stack. Design the Data Structure which can be easily integrated with the grammar rules for syntax directed translation. . Resolve it in Lex. Step8. Step10.Code all the actions. Resolve any shift/reduce conflicts. Step5. Step4. Do the appropriate type binding to all tokens and Yacc variables (non-terminals). Step9. Search for any reduce/reduce conflict.e no data structure should be built but parsing should continue to get more errors. This is an easy way of validating the BNF for reduce /reduce and shift/reduce conflicts. Step3. Step6. Step2. Step7. Try coding all the grammar rules in Yacc with empty actions Compile. Step11. From the Data Structures and Lex needs.Writing the Yacc Step1. Write all the data structures in a separate file and include it in Yacc.Restrict the actions in case of error. link it to Lex and check for conflicts.

However not all Yacc versions supports them. . There are many functions like yyerrok etc. These rules have shift/reduce conflict on the symbol X since in s->AabY for making a transition from literal a to b with input X it has no way to tell if it should reduce or shift another token.Eliminating shift/reduce errors: 1. It is one of the toughest parts of parsing. Note that precedence level in the same line is same and down the line increases. Yacc pushes a pseudo literal error and takes in next input. use -d switch of Yacc to create debug file(y. This file will contain the full transition diagram description and the points at which any conflict arises.g consider the rules s->XabY a->E|aXAY E=empty transiti b->E|bXBY The syntax of these rules says that there is block XY which can have zero or more blocks of type A & B. 3. On identifying the rule it pops the stack and takes proper actions.output). The literal error is pushed on the stack if relation is missing and then class is pushed. 2. Try assigning precedence and associatively to tokens and literals by using %left %right %noassoc %prec. These rules can be rewritten as following s->XaY a->E|aXbY b->A|B Syntax checking and error recovery. Thus in this way the file pointer will always point to the right location and next rule can be looked for correctly. The simplest method is just to use the pseudo literal 'error'. In majority of the cases shift/reduce conflict is always in the vicinity of left/right recursions. These might not be due to associatively or precedence relations. On reduction it calls yyerror with msg string. In our example code error class {yyerror ("Missing Relation") .e. In a rule whenever there is an error.} says that if only class exists then error must be flagged.

c=0.v. %% main() { printf("ENTER INTPUT : \n"). %{ int v=0. %} %% [aeiouAEIOU] v++. yylex().PROGRAM.c).1 Objective: Lex program to count number of vowels and consonant. [a-zA-Z] c++. printf("VOWELS=%d\nCONSONANTS=%d\n". } .

%} %% \+?[0-9]+ pi++.pf).ni=0. . printf("\nNEGATIVE INTEGER : %d". %{ int pi=0. yylex(). \+?[0-9]*\. printf("\nPOSITIVE FRACTION : %d".ni). %% main() { printf("ENTER INPUT : ").nf=0.PROGRAM. printf("\nNEGATIVE FRACTION : %d\n".pf=0. \-[0-9]+ ni++.pi).[0-9]+ nf++.nf).[0-9]+ pf++.2 Objective: . printf("\nPOSITIVE INTEGER : %d". \-[0-9]*\.

h" int pf=0.l".3 Objective: Lex program to count the number of printf and scanf statements. } %% main() { yyin=fopen("file1. fprintf(yyout. %} %% printf { pf++."%s". %{ #include "stdio. } scanf { sf++."readf"). fprintf(yyout.} PROGRAM."%s". ."r+").sf=0."writef").

} .l". printf("NUMBER OF SCANF IS %d\n".pf)."w+").yyout=fopen("file2. printf("NUMBER OF PRINTF IS %d\n".sf). yylex().

exit(0)..4 Objective: Lex program to find simple and compound statements %{ }% %% "and"| "or"| "but"| "because"| "nevertheless" {printf("COMPOUNT SENTANCE").PROGRAM. yylex(). \n return 0. . } . printf("SIMPLE SENTANCE"). %% main() { prntf("\nENTER THE SENTANCE : ").

" { flag=0.id).flag=0."r").} PROGRAM-5 Objective: Lex program to count the number of identifiers. yylex(). printf("%s".yytext). } [a-zA-Z0-9]*"="[0-9]+ { id++. } ". %} %% "int"|"char"|"float"|"double" { flag=1.h> int id=0. printf("%s".yytext).yytext).l". } [0] return(0). } [a-zA-Z][a-zA-z0-9]* { if(flag!=0) id++. . %% main() { printf("\n *** output\n"). printf("\nNUMBER OF IDENTIFIERS = %d\n". yyin=fopen("f1. %{ #include<stdio. printf("%s".printf("%s".yytext).

fclose(yyin). } int yywrap() { return(1). } .

PROGRAM-6 Objective: Lex program to count the number of words. [^' '\t\n]+ w++.l=0.blank spaces and lines. char *argv[]) { if(argc==2) { yyin=fopen(argv[1]. .s).s=0.w=0.c)."r"). %{ int c=0. printf("\nLINES=%d".characters. yylex(). c+=yyleng. printf("\nNUMBER OF SPACES = %d". %% int main(int argc. [' '\n\t] s++. %} %% [\n] l++. printf("\nCHARACTER=%d".l).

printf("\nWORD=%d\n". } .w). } else printf("ERROR").

yyout=fopen("f2. %} %% "/*"[a-zA-Z0-9' '\t\n]*"*/" cc++. fclose(yyin). yylex().l". "//"[a-zA-Z0-9' '\t]* cc++.7 Objective: Lex program to count the number of comment lines.l"."r").PROGRAM.h> int cc=0. printf("\nTHE NUMBER OF COMMENT LINES = %d\n". %% main() { yyin=fopen("f1. } . %{ #include<stdio."w").cc). fclose(yyout).

} .PROGRAM. } [a-zA-Z]+\+\-\*\/[a-zA-Z]+ { n=0. %} %% [\+\-\*\/] { printf("OPERATORS ARE %s\n". } [a-zA-Z]+ { printf("OPERANDS ARE %s\n". } [0-9]+ { printf("OPERANDS ARE %s\n".yytext).yytext). opd++.8 Objective: Lex program to check the validity of arithematic statement. opd++.h> int opr=0. } [0-9]+\+\-\*\/[0-9]+ { n=0. %{ #include<stdio.yytext). int n.opd=0. opr++.

opr). yylex(). else printf("\nINVALID EXPRESSION\n").opd). printf("\nNUMBER OF OPERANDS ARE %d". printf("\nNUMBER OF OPERATORS ARE %d". if((n==0)&&(opd==opr+1)) printf("\nVALID EXPRESSION\n"). } .%% main() { printf("\nENTER THE EXPRESSION : \n").

. printf("\nNUMBER OF CONSTANTS : %d\n". %{ #include<stdio. } ."r"). %} %% [0-9]+ { printf("\n%s". } .PROGRAM-9 Objective: Lex program to find the number of constants. } else printf("\nERROR"). cons++.h> int cons=0. yylex().yytext).char *argv[]) { if(argc==2) { yyin=fopen(argv[1].cons). %% main(int argc.

as long as possible (i. strcpy (longword. } . } } .e. %{ #include <strings. until it sees a non-letter). longword. char longword[60]. return 0. Although the range notation "[a-zA-Z]" would match any single letter. tacking on a "+" after it produces a regular expression matching any sequence of one or more letters. printf ("The longest word was \"%s\". | \n . yytext).PROGRAM-10 Objective: Finds the longest word (defined as a contiguous string of upper and lower case letters) in the input. which was %d characters long. longest). %} %% [a-zA-Z]+ { if (yyleng > longest) { longest = yyleng.h> int longest = 0. %% int main (void) { yylex ().\n".

} [a-zA-Z]+ { printf ("\"%s\" is not a verb\n". yytext). } . } . yytext). | \n ECHO.PROGRAM-11 Objective: Recogorizes a number of words as verbs and nonverbs %{ /* this sample demonstrates very simple word recognition: verbs & other */ %} %% [\t ]+ /* ignore whitespace */ . is | am | are | were | was | be | being | been | do | does | did | will | would | should | can | could | has | have | had | go { printf ("\"%s\" is a verb\n". /* which is the default anyway */ %% int main (void) { return yylex ().

%% int main( int argc. "+"|"-"|"*"|"/" printf( "An operator: %s\n". } {DIGIT}+". operators. yytext. floats. } if|then|begin|end|procedure|function { printf( "A keyword: %s\n". and comments.PROGRAM-12 Objective: Distinguishes keywords. yytext ). %{ /* need this for the call to atof() below */ #include <math. char **argv ) { /* eat up one-line comments */ /* eat up whitespace */ printf( "Unrecognized character: %s\n". atof( yytext ) ). yytext. yytext ). identifiers. yytext ). } {ID} printf( "An identifier: %s\n". integers."{DIGIT}* { printf( "A float: %s (%g)\n". . "{"[^}\n]*"}" [ \t\n]+ .h> %} DIGIT [0-9] ID [a-z][a-z0-9]* %% {DIGIT}+ { printf( "An integer: %s (%d)\n". yytext ). atoi( yytext ) ).

else yyin = stdin. "r" ).++argv. } . /* skip over program name */ if ( argc > 0 ) yyin = fopen( argv[0]. return yylex(). --argc.

a installed. %% int main (void) { return yyparse(). } \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ .\n"). printf ("Recognized a number. expression: expression '+' NUMBER { printf("pretending to assign %s the value { printf("= %d\n". %{ #include <stdio. msg).h> %} %token NAME NUMBER %% statement: NAME '=' expression %d\n". } { $$ = $1 .PROGRAM-13 Objective:The yacc and flex portions of a simple calculator program. $1.\n").$3. printf ("Recognized '+' expression. | expression .\n"). "YACC: %s\n". } | expression '-' NUMBER | NUMBER . } /* Added because panther doesn't have liby. */ int yyerror (char *msg) { return fprintf (stderr. $1). printf ("Recognized '-' expression. $3). } { $$ = $1 + $3. } { $$ = $1. } .

%{ #include "ch301. } { printf ("found other data \"%s\"\n". %} %% [0-9]+ [ \t] \n . yytext). { yylval = atoi (yytext). return yytext[0]. return 0. } { printf ("reached end of line\n").tab.h" extern int yylval. } { printf ("skipped whitespace\n"). return NUMBER. yylval). printf ("scanned the number %d\n". } %% .

\n"). } . printf ("Recognized a number. printf ("Recognized '-' expression. $1.h> %} %token NAME NUMBER %left '-' '+' %left '*' '/' %nonassoc UMINUS %% statement: NAME '=' expression value %d\n". printf ("Recognized '+' expression.\n"). } . } expression '-' expression { $$ = $1 . } | | | | | expression. } expression '*' expression { $$ = $1 * $3. printf ("Recognized parenthesized } | NUMBER { $$ = $1.\n"). $3).PROGRAM-14 Objective: A more advanced version that handles parentheses and order of operations %{ #include <stdio. $1).$2.\n").\n"). printf ("Recognized '/' expression.\n"). } '-' expression %prec UMINUS { $$ = .\n"). printf ("Recognized negation. printf ("Recognized '*' expression. { $$ = $1 + $3. else $$ = $1 / $3. } expression '/' expression { if ($3 == 0) yyerror ("divide by zero"). } '(' expression ')' { $$ = $2. expression: expression '+' expression { printf("pretending to assign %s the { printf("= %d\n". | expression .$3.

*/ int yyerror (char *msg) { return fprintf (stderr. msg). } .. %% int main (void) { return yyparse(). } /* Added because panther doesn't have liby. "YACC: %s\n".a installed.

nmodifier: ADJECTIVE | ADVERB nmodifier modifier\n").h> int at_end = 0. nphrase: modifiednoun | ARTICLE modifiednoun . %} %token ARTICLE VERB NOUN ADJECTIVE ADVERB PREPOSITION END %start sentence %% sentence: nphrase VERB termpunct | verb\n"). { printf ("\tadded an adverb to a noun . */ .} nphrase VERB nphrase termpunct { printf ("Sentence with object\n").\n").} nphrase VERB vmodifier termpunct { printf ("Sentence with modified immediately. } . YYACCEPT.\n"). modifiednoun: NOUN | nmodifier modifiednoun .PROGRAM-15 Objective: A programwhich make up a quick-and-dirty attempt at reading English sentences %{ #include <stdio.' | '!' . termpunct: '. . extern int yychar. . } { printf ("Simple noun-verb sentence.} /* All these YYACCEPTS are needed so yyparse will return rather than waiting for the first token of the next sentence. . YYACCEPT. } { printf ("\tmodified noun\n"). | | YYACCEPT. { printf ("\tGot an article\n"). They wouldn't be necessary if the main program were only calling yyparse() once.} END { printf ("Got EOF from lex. YYACCEPT.

msg. */ int yyerror (char *msg) { return fprintf (stderr. { printf ("\tadded an adverb to a verb { printf ("\tprepositional phrase\n").. "YACC: %s.a installed. yychar=%d\n". } printf ("Wasn't that fun?\n"). } %% int main (void) { while (! at_end) { yyparse().h" extern int yylval. return END. | return ARTICLE. } | | . %} %% <<EOF>> { at_end = 1. } \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ %{ #include "words2. vmodifier: | modifier\n"). extern int at_end. ADVERB ADVERB vmodifier PREPOSITION nphrase . } | . yychar). | "" \n "\t" [Tt]he [Aa]n? go(es)? . } /* Added because panther doesn't have liby.tab.

| | | return ADVERB. /* default action allows yacc to see literal characters */ . | | | return NOUN.jumps? runs? likes? eats? dogs? cats? fish fox(es)? | moose quick slow lazy clever smart stupid brown black blue red orange white big small quickly easily openly slowly over under around through between . | | | | | | | | | | | | | return ADJECTIVE. | | | | return PREPOSITION. %% | | | return VERB. return yytext[0].

Lex program to count the number of words. Lex program to count the number of identifiers. 4. Recogorizes a number of words as verbs and nonverbs. Although the range notation "[a-zA-Z]" would match any single letter. 9.List of Practical 1. Lex program to count number of vowels and consonant. 8. 13. Find the longest word (defined as a contiguous string of upper and lower case letters) in the input. blank spaces and lines. Lex program to count the type of numbers. identifiers.e. Lex program to count the number of printf and scanf statements. floats. 7. 14. and comments. A more advanced version that handles parentheses and order of operations. 10. 6. . Lex program to count the number of comment lines. operators. The Yacc and flex portions of a simple calculator program. 12. Lex program to find the number of constants. until it sees a non-letter). tacking on a "+" after it produces a regular expression matching any sequence of one or more letters. as long as possible (i. 5. Lex program to find simple and compound statements. integers. 2. 11. A program which make up a quick-and-dirty attempt at reading English sentences. Lex program to check the validity of arithmetic statement. 3. 15. characters. Distinguishes keywords.


Sign up to vote on this title
UsefulNot useful