UNIT-I LEXICAL ANALYSIS

TOPIC-I (INTRODUCTION TO COMPILING)
Definition of Compilers: Simply stated, a compiler is a program that reads a program written in one language-the source language-and translates it into an equivalent program in another language-the target language (see fig.1) As an important part of this translation process, the compiler reports to its user the presence of errors in the source program. Compiler

Source Program

Target Program

Error messages Fig1. A Compiler Compilers are sometimes classified as single-pass, multi-pass, load-and-go, debugging, or optimizing, depending on how they have been constructed or on what function they are supposed to perform. Despite this apparent complexity, the basic tasks that any compiler must perform are essentially the same. Analysis – Synthesis Model of Compilation: The process of compilation has two parts namely:  Analysis  Synthesis Analysis: The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program. The analysis part is often called the front end of the compiler Synthesis: The synthesis part constructs the desired target program from the intermediate representation. The synthesis part is the back end of the compiler. Software Tools: Many software tools that manipulate source programs first perform some kind of analysis. Some examples of such tools include: • Structure Editors : • Pretty printers : • Static Checkers • Interpreters  Structure Editors : A structure editor takes as input a sequence of commands to build a source program. The structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes the program text, putting an appropriate hierarchical structure on the source program. Example – while …. do and begin….. end.  Pretty printers : A pretty printer analyzes a program and prints it in such a way that the structure of the program becomes clearly visible.  Static Checkers : A static checker reads a program, analyzes it, and attempts to discover potential bugs without running the program.  Interpreters : Translate from high level language (BASIC, FORTRAN, etc..) into assembly or machine language. Interpreters are frequently used to execute command language,

tt h

:/ p

cs /

tu e

e. b

k/ t

http://csetube.weebly.com/

Examples of Compiler: The following examples are similar to that of a conventional complier. • Text formatters. • Silicon Compiler. • Query interpreters.  Text formatters: It takes input as a stream of characters includes commands to indicate paragraphs, figures etc.  Silicon Compiler: Variables represent logical signal 0 & 1 and output is a circuit design.  Query interpreters: It translates a predicate containing relational & Boolean operators into commands to search a database.

TOPIC-2 ANALYSIS OF THE SOURCE PROGRAM

The analysis phase breaks up the source program into constituent pieces and creates an intermediate representation of the source program. Analysis consists of three phases: • Linear analysis • Hierarchical analysis • Semantic analysis  Linear analysis (Lexical analysis or Scanning) : The lexical analysis phase reads the characters in the source program and grouped them as tokens that are sequence of characters having a collective meaning. Example: position: = initial + rate * 60 Identifiers – position, initial, rate. Assignment symbol - : = Operators - +, * Number - 60 Blanks – eliminated.  Hierarchical analysis (Syntax analysis or Parsing) : It involves grouping the tokens of the source program hierarchically into nested collections that are used by the complier to synthesize output. :=

tt h

:/ p
id1

cs /
+ id2 id3

tu e

e. b

k/ t

* inttoreal 10

 Semantic analysis : In this phase checks the source program for semantic errors and gathers type information for subsequent code generation phase. An important component of semantic analysis is type checking. Example : int to real conversion

http://csetube.weebly.com/

a keyword etc.weebly.  The representation of the statement given above after the lexical analysis would be: id1. + .Syntax Analysis & Semantic Analysis. such as an identifier.2.Intermediate Code Generation. Phases of a compiler The first four phases.THE DIFFERENT PHASES OF A COMPILER Conceptually.com/ .: =.e)Code Optimization. b k/ t I)Lexical Analysis(or)Scanner: Consider the statement. a compiler operates in phases. forms the bulk of the analysis portion of a compiler.  The character sequence forming a token is called the lexeme for the token.Code Generation Symbol table management and error handling. (i.id3 . position := initial + rate * 10  The lexical analysis phase reads the characters in the source pgm and groups them into a stream of tokens in which each token represents a logically cohesive sequence of characters.10 http://csetube. TOPIC-3 Fig. are shown interacting with the six phases.  Tokens identified during this phase are stored in the symbol table along with its properties called its attributes. (i.* . id2. The Analysis phase: tt h :/ p cs / tu e e. The next two phases forms the bulk of the synthesis portion of a compiler. each of which transforms the source program from one representation to another.e)Lexical Analysis.

Some "three-address instructions may have fewer than three operands. := id1 id2 id3 + * IV)Intermediate Code Generation:  The intermediate representation should have two important properties: • it should be easy to produce and • it should be easy to translate into the target machine  Some of the intermediate forms are three address codes. b k/ t Properties of three-address instructions: 1. h tp t // : cs tu e inttoreal 10 e. a binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers.com/ . where the compiler checks that each operator has matching operands. Each three-address assignment instruction has at most one operator on the right side. If the operator is applied to a floating-point number and an integer.  For example.II) Syntax analysis (or) Parser:  The tokens from the lexical analyzer are grouped hierarchically into nested collections with collective meaning called “Parse Tree” followed by syntax tree as output. postfix notation etc.weebly. The compiler must generate a temporary name to hold the value computed by a threeaddress instruction.  A Syntax Tree is a compressed representation of the parse tree in which the operators appears as interior nodes & the operands as child nodes. 3. the source pgm might look like this. In three-address code. the compiler may convert the integer into a floating-point number. temp1: = inttoreal (10) temp2: = id3 * temp1 temp3: = id2 + temp2 id1: = temp3 http://csetube. := id1 id2 id3 + * 10 III)Semantic Analysis:  An important part of semantic analysis is type checking. 2. which consists of a sequence of assembly-like instructions with three operands per instruction. We consider an intermediate form called three-address code.

so that faster running machine codes will result. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly. Intermediate Code Generation --. id1 Symbol table management A symbol table is a data structure containing a record for each identifier. When an identifier in the source program is detected by the lex analyzer.*. • Then. The output will look like this: temp1: = id3 * 10. a significant fraction of the time of the compiler is spent on this phase. Lexical Analysis --. Semantic Analysis --.+.Constant is too long 7. • If the target language is machine code. registers or memory locations are selected for each of the variables used by the program. R1 ADDF R2. with fields for the attributes of the identifier. Symbol Table --. Feature of the compiler is to detect & report errors. id3 . In those that do the most. There is a great variation in the amount of code optimization different compilers perform. MOVF id3. ii)Loop Optimization: Finding out loop invariants & avoiding them. Syntax Analysis --. called ‘optimizing compilers’. the identifier is entered into the symbol table. Error Handler tt h :/ p cs / tu e e. the intermediate instructions are translated into sequences of machine instructions that perform the same task.: = .Structure of the statement violates the rules of the language 3.The Synthesis Phase: V)Code Optimization: The code optimization phase attempts to improve the intermediate code. id2 .No meaning in the operation involved 4.Certain Statements may never be reached 6. 1. Code Generation --. R1 MOVF R1.0 id1: = id2 + temp1 VI)Code Generation: • The code generator takes as input an intermediate representation of the source program and maps it into the target language.Operands have incompatible data types 5. 10 http://csetube.0.Characters may be misspelled 2. Some optimizations are trivial.Multiple declared variables Position : = initial + rate * 10 Lexical Analyser id1. b k/ t Each phase can encounter errors. R2 MOVF id2 . R2 MULF #10. Code Optimizer --.com/ .weebly. Two optimizations Technique: i)Local Optimization: Elimination of common sub expression copy propagation.

weebly.Syntax Analyzer := id1 id2 * id3 10 + Semantic Analyzer := id1 id2 id3 + * Intermediate code generator h tt :/ p cs / tu e 10 inttoreal e. R1 ADDF R2.0 id1: = id2 + temp1 Code generator MOVF id3.com/ .0. b k/ t temp1: = inttoreal (10) temp2: = id3 * temp1 temp3: = id2 + temp2 id1: = temp3 Code optimizer temp1: = id3 * 10. R1 MOVF R1. R2 MULF #10.3 Output of the phases of compiler http://csetube. R2 MOVF id2 . id1 Fig .

standard header files are usually included in angle-brackets. When the preprocessor finds an #include directive it replaces it by the entire content of the specified file.weebly. File Inclusion 3.File Inclusion: tt h :/ p cs / tu e e. Language extension 1. macro definitions (#define. the file is searched first in the same directory that includes the file containing the directive. Its format is: #define identifier replacement When the preprocessor encounters this directive. the code becomes equivalent to: int table1[100].com/ . Macro processing 2. while other specific header files are included using quotes. Macro processing: A macro is a rule or pattern that specifies how a certain input sequence (often a sequence of characters) should be mapped to an output sequence (also often a sequence of characters) according to a defined procedure. The preprocessor is executed before the actual compilation of code begins. There are two ways to specify a file to be included:  #include "file" #include <file> The only difference between both expressions is the places (directories) where the compiler is going to look for the file.  Assembler.  In the first case where the file name is specified between double-quotes. I) Preprocessor: A preprocessor is a program that processes its input data to produce output that is used as input to another program. 2. the compiler searches the file in the default directories where it is configured to look for the standard header files. In case that it is not there. Therefore. b k/ t Preprocessor includes header files into the program text.TOPIC-4 COUSINS OF THE COMPILER Definition: The cousins of the compiler means “the context in which a compiler typically operates” The cousins of the compiler are:  Preprocessor. Example: #define TABLE_SIZE 100 int table1[TABLE_SIZE].  Loader and Link-editor.  If the file name is enclosed between angle-brackets <> the file is searched directly where the compiler is configured to look for the standard header files."Rational Preprocessors 4. http://csetube. They may perform the following functions 1. After the preprocessor has replaced TABLE_SIZE. #undef) To define preprocessor macros we can use #define. it replaces any occurrence of identifier in the rest of the code by replacement.

Definition: These tools have been developed for helping implement various phases of a compiler. compilergenerators or translator writing systems. the language equal is a database query language embedded in C. b k/ t http://csetube. such a preprocessor might provide the user with built-in macros for constructs like while-statements or if-statements. For example. then use the table in a second pass to generate code.  One –pass  Two -Pass One-pass assembler goes through the source code once and assumes that all symbols will be defined before any instruction that references them. These systems have often been referred to as compiler-compilers."Rational Preprocessors: These processors augment older languages with more modern flow of control and data structuring facilities. III)Linkers And Loaders: The process of loading consists of taking relocatable machine code. Statements begging with ## are taken by the preprocessor to perform the database access. 4. and by resolving symbolic names for memory locations and other entities. There are two types of assemblers based on how many passes through the source are needed to produce the executable program.weebly. altering the relocatable addresses and placing the altered instructions and data in memory at the proper locations. For example.where none exist in the programming language itself. A linker or link editor is a program that takes one or more objects generated by a compiler and combines them into a single executable program. Two-pass assemblers create a table with all symbols and their values in the first pass.Output(Lexical Analyzer) h TOPIC-5 t t COMPILER-CONSTRUCITON TOOLS :/ p cs / tu e e.Language extension : These processors attempt to add capabilities to the language by what amounts to built-in macros.3.com/ . Some commonly used compiler-construction tools include  Parser generator  Scanner generator  Syntax-directed translation engine  Data flow engine  Automatic code generator I)Scanner generators: Generates Input(Regular Expression)--------------------. II)Assembler: Typically a modern assembler creates object code by translating assembly instruction mnemonics into opcodes.

V) Automatic Coder generators: Generates Input(Optimized code)--------------------.Each translation is defined in terms of translations at its neighbor nodes in the tree. II)Parser generators: Generates Input(Context Free Grammar)--------------------.The basic idea is that one or more “translations” are associated with each node of the parse tree.weebly. • It also includes error handling that goes along with each of these phases.automatically generates lexical analyzers from a specification based on regular expression. b k/ t TOPIC-6 GROUPING OF PHASES Definition: Activities from more than one phase are often grouped together.Output(Intermediate code) .Many parser generators utilize powerful parsing algorithms that are too complex to be carried out by hand. III)Syntax-directed translation engines: Generates Input(Parse or Syntax Tree)--------------------. • Lexical and syntactic analysis. tt h :/ p cs / tu e e..The basic organization of the resulting lexical analyzers is finite automation.The rules must include sufficient details that we can handle the different possible access methods for data.Gathering of information about how values are transmitted from one part of a program to each other part.Output(Object Code) . symbol table.com/ . . semantic analysis and the generation of intermediate code is included.produce collections of routines that walk a parse tree and generating intermediate code. .Data-flow analysis is a key part of code optimization from Intermediate codes.A tool takes a collection of rules that define the translation of each operation of the optimized language into the machine language for a target machine. . ..Output(Optimized code) . The phases are collected into a front end and a back end  Front and Back Ends: Front End: • The Front End consists of those phases or parts of phases that depends primarily on the source language and is largely independent of target machine. http://csetube. • Certain amount of code optimization can be done by the front end. IV) Data-flow analysis engines: Generates Input(Intermediate code)--------------------.Output(Syntax Analyzer) . .produce syntax analyzers from input that is based on context-free grammar. .

• If we group several phases into one pass. code generation along with necessary error handling and symbol table operations. b k/ t http://csetube. If so. since it takes time to read and write intermediate files. • Find the aspects of code optimization phase. the token stream after lexical analysis may be translated directly into intermediate code. because one phase may need information in a different order than a previous phase produces it. syntax analysis.weebly.  Passes:    Several phases of compilation are usually implemented in a single pass consisting of reading an input file and writing an output file.  Reducing the number of passes: • It is desirable to have relatively few passes. It is common for several phases to be grouped into one pass. tt h :/ p cs / tu e e. Eg: Lexical analysis.com/ .Back End: • The Back End includes those portions of the compiler that depend on the target machine and these portions do not depend on the source language. and for the activity of these phases to be interleaved during the pass. semantic analysis and intermediate code generation might be grouped into one pass. we may forced to keep the entire program in memory.

tt h :/ p cs / tu e e. and punctuation symbols such as parentheses. 3) Compiler portability is enhanced. In most programming languages. For example in the Pascal’s statement const pi = 3. and semicolons. tab. There is a set of strings in the input for which the same token is produced as output. The separation of lexical analysis from syntax analysis often allows us to simplify one or the other of these phases. The pattern is set to match each string in the set. 1) Simpler design is the most important consideration. identifiers. constants. the substring pi is a lexeme for the token identifier. Another is correlating error messages from the compiler with the source program. it may also perform certain secondary tasks at the user interface. b k/ t http://csetube. the following constructs are treated as tokens: keywords. Its main task is to read the input characters and produces output a sequence of tokens that the parser uses for syntax analysis.1416. Source program Lexical analyzer token Parser get next token Symbol table Fig 4. literal strings. This set of strings is described by a rule called a pattern associated with the token. Sometimes lexical analyzers are divided into a cascade of two phases. and new line character. operators. A lexeme is a sequence of characters in the source program that is matched by the pattern for the token. the first called scanning” and the second “lexical analysis”.com/ . 2) Compiler efficiency is improved.CHAPTER-2 TOPIC-I THE ROLE OF THE LEXICAL ANALYZER The lexical analyzer is the first phase of compiler. As in the figure. Issues in Lexical Analysis There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing. Since the lexical analyzer is the part of the compiler that reads the source text. Interaction of lexical analyzer with parser.weebly. One such task is stripping out from the source program comments and white space in the form of blank. upon receiving a “get next token” command from the parser the lexical analyzer reads input characters until it can identify the next token. commas. Tokens Patterns and Lexemes.

The pattern for the token const in the above table is just the single string const that spells out the keyword. tt h :/ p cs / tu e e.we view the position of each pointer as being between the character last read and the character next to be read. in a PL/I program we may see: DECALRE (ARG1.>.The representation is a pair consisting of an integer code and a pointer to a table if the token is a more complex element such as an identifier or constant .TOKEN SAMPLE LEXEMES INFORMAL DESCRIPTION OF PATTERN const if relation id num literal Const if <. it is desirable for the lexical analyzer to read its input from an input buffer. comma. ARG2… ARG n) http://csetube. Languages such as FORTRAN require a certain constructs in fixed positions on the input line. A look ahead pointer scans ahead of the beginning point.<=. The representation is an integer code if the token is a simple construct such as a left parenthesis.02E23 “core dumped” const if < or <= or = or <> or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except” In the example when the character sequence pi appears in the source program. however. The returning of a token is often implemented by passing and integer corresponding to the token. It is this integer that is referred to as bold face id in the above table. For example. or colon . A pattern is a rule describing a set of lexemes that can represent a particular token in source program.Pairs are also retuned whenever we wish to distinguish between instances of a token. the token representing an identifier is returned to the parser. Certain language conventions impact the difficulty of lexical analysis. the pointer points to the value of that token . Attributes of Token The lexical analyzer returns to the parser a representation for the token it has found. b k/ t Token beginnings look ahead pointer The distance which the lookahead pointer may have to travel past the actual token may be large.6.1416. In practice each buffering scheme adopts one convention either a pointer is at the symbol last read or the symbol it is ready to read.>= pi. Input buffering: The lexical analyzer scans the characters of the source program one a t a time to discover tokens. until the token is discovered . Often.=.count.The integer code gives the token type. For this and other reasons.<>.D2 3.weebly.com/ . many characters beyond the next token many have to be examined before the next token itself can be determined. Thus the alignment of a lexeme may be important in determining the correctness of a source program. say 100 characters each. Figure shows a buffer divided into two haves of.0. One pointer marks the beginning of the token being discovered.

r’ is a regular expression over the symbols in Σ U {d. we could not reload the right half. Certain terms fro parts of a string are prefix. There are several important operations like union. the terms sentence and word are often used as synonyms for the term “string”. the empty set. Example: The set of Pascal identifiers is the set of strings of letters and digits beginning with a letter. tt h :/ p cs / tu e e. If the look ahead pointer travels beyond the buffer half in which it began.com/ .d’. The set {0. suffix. In either case. The defining rules specify how L(r) is formed by combining in various ways the languages denoted by the subexpressions of r.Without knowing whether DECLARE is a keyword or an array name until we see the character that follows the right parenthesis. This definition is very broad.…} . In the above example. The regular definition for the set is letter  A|B|…|Z|a|b|…z digit  0|1|2|…|9 http://csetube. The term language denotes any set of strings over some fixed alphabet. 3. concatenation and closure that can be applied to languages. Unnecessary parenthesis can be avoided in regular expressions if we adopt the conventions that: 1. if the look ahead traveled to the left half and all the way through the left half to the middle. d’  r’ where d. 2. Regular expressions allow us to define precisely sets such as this. the star means “ zero or more instances of” the parenthesized expression. and r.e. b k/ t Regular Definitions If Σ is an alphabet of basic symbols. Regular Expressions In Pascal.. we cannot ignore the fact that overhead is limited Specification of tokens: Strings and Languages The term alphabet or character class denotes any finite set of symbols. Each regular expression r denotes a language L(r). the token itself ends at the second E. With this notation. While we can make the buffer larger if we chose or use another buffering scheme. A regular expression is built up out of simpler regular expressions using set of defining rules. the parentheses are used to group subexpressions. or subsequence of a string.d’ is a distinct name . the unary operator has the highest precedence and is left associative. A String over some alphabet is a finite sequence of symbols drawn from that alphabet. or { ε}. are languages under this definition. an identifier is a letter followed by zero or more letters or digits. i. the other half must be loaded with the next characters from the source file. substring.weebly. then a regular definition is a sequence of definitions of the form d  r . and the juxtaposition of letter with remainder of the expression means concatenation. In Language theory. 1} is the binary alphabet. the set containing only the empty set. Abstract languages like φ. Pascal identifiers may be defined as letter (letter | digit)* The vertical bar here means “or” . | has the lowest precedence and is left associative. concatenation has the second highest precedence and is left associative. Since the buffer shown in above figure is of limited size there is an implied constraint on how much look ahead can be used before the next token is discovered. because we would lose characters that had not yet been grouped into tokens. Typically examples of symbols are letters and characters. the basic symbols and the previously defined names.

To simplify matters. using the translation table given in the figure. the lexical analyzer does not return a token to the parser. Zero or one instance.GT.|z. tt h :/ p cs / tu e e. An abbreviated character class such as [a-z] denotes the regular expression a|b|c|………. Consider the following grammar fragment: stmt if expr then stmt |if expr then stmt else stmt |ε exprterm relop term |term termid |num where the terminals if . id..LE.56.EQ. else.77. Notational Shorthands Certain constructs occur so frequently in regular expressions that it is convenient to introduce notational shorthands for them. The unary postfix operator + means “one or more instances of “ 2. One or more instances.GE. It will do so by comparing a string against the regular definition ws. it proceeds to find a token following the white space and returns that to the parser.id  letter ( letter | digit ) * Unsigned numbers in Pascal are strings such as 5280. we assume lexemes are separated by white space. http://csetube.6. that is. else.digit+)?(E(+|-)?digit+)? For this language fragment the lexical analyzer will recognize the keywords if. relop. tabs and newlines. below. as well as the lexemes denoted by relop. they cannot be used as identifiers. Recognition of tokens: The question of how to recognize the tokens is handled in this section. Unsigned integer and real numbers of Pascal are represented by num.weebly. delimblank|tab|newline wsdelim+ If a match for ws is found. 3. Rather. b k/ t In addition. we assume keywords are reserved.com/ .|9 digits  digit digit * This definition says that digit can be any number from 0-9. consisting of nonnull sequences of blanks. The following regular definition provides a precise specification for this class of strings: digit  0|1|2|…. The unary postfix operator ? means “zero or one instance of “.25E4 etc. Our lexical analyzer will strip out white space. The notation r? is a shorthand for r/ε.NE. 1. and num. The language generated by the following grammar is used as an example. then. The notation [abc] where a . b . while digits is a digit followed by zero or more occurrences of a digit. Character classes. then. id and num generate sets of strings given by the following regular definitions: if if then ten else else relop <|<=|=|<>|>|>= idletter(letter|digit)* numdigit+ (. Our goal is to construct a lexical analyzer that will isolate the lexeme for the next token in the input buffer and produce as output a pair consisting of the appropriate token and attribute value. The attribute values for the relational operators are given by the symbolic constants LT. and c are the alphabet symbols denotes the regular expression a|b|c.

Lex source is a table of regular expressions and corresponding program fragments. Lex is a program generator designed for lexical processing of character input streams. The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. Positions in a transition diagram are drawn as circles and are called states. Introduction.com/ . The states are connected by arrow.REGULAR EXPRESSION TOKEN ATTIBUTE VALUE ws if then else id num < <= = <> > >= if then else id num relop relop relop relop relop relop Pointer to table entry Pointer to table entry LT LE EQ NE GT GE Transition diagram A transition diagram is a stylized flowchart. we then go to the state pointed to by the edge. problem oriented specification for character string tt h :/ p cs / tu e e. Certain states may have actions that are executed when the flow of control reaches that state. We do so by moving from position to position in the diagrams as characters are read. b k/ t http://csetube. so that the user has general freedom to manipulate it. it is the initial state of the transition diagram where control resides when we begin to recognize a token. A transition diagram for >= is shown in the figure.weebly. One state is labeled as the start state. Transition diagram is used to keep track of information about characters that are seen as the forward pointer scans the input. Otherwise we indicate failure. The table is translated to a program which reads an input stream. As each such string is recognized the corresponding program fragment is executed. called edges. copying it to an output stream and partitioning the input into strings which match the given expressions. substantial lookahead is performed on the input. the label other refers to any character that is not indicated by any of the other edges leaving s. On entering a state we read the next input character if there is and edge from the current state whose label matches this input character. Lex A Lexical Analyzer Generator Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. Edges leaving state s have labels indicating the input characters that can next appear after the transition diagram has reached state s. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. It accepts a high-level. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. If necessary. but the input stream will be backed up to he end of the current partition.

is all that is required. or for analysis and statistics gathering on a lexical level. observing at the termination of the string of blanks or tabs whether or not there is a newline character. and the $ indicates ‘‘end of line. To change any remaining string of blanks or tabs to a single blank..matching. and the second rule all remaining strings of blanks or tabs. Compatible run-time libraries for the different host languages are also provided. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. Each application may be directed to the combination of hardware and host language appropriate to the task. called ‘‘host languages.Lex programs http://csetube. it is particularly easy to interface Lex and Yacc . and properties of local implementations. No action is specified. This rule contains a regular which matches one or more instances of the characters blank rtab (written \t for visibility. and produces a program in a general purpose language which recognizes regular expressions.weebly. Lex turns the user’s expressions and actions (called source in this memo) into the host general-purpose language. the generated program is named yylex The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. the corresponding fragment is executed. Source → Lex → yylex yylex → Output Input −−−> An overview of Lex For a trivial example. The host language is used for the output code generated by Lex and also for the program fragments added by the user. The Lex source file associates the regular expressions and the program fragments. At the boundaries between strings program sections provided by the user are executed. The first rule matches all strings of blanks or tabs at the end of lines.com/ . As each expression appears in the input to the program written by Lex. add another rule: tt h :/ p cs / tu e e. Lex can also be used with a parser generator to perform the lexical analysis phase. The regular expressions are specified by the user in the source given to Lex. in with the C language convention) just prior to the end of a line. The brackets indicate character class made of blank and tab. %% [ \t]+$ . b k/ t %% [ \t]+$ . but rather a generator representing a new language feature which can be added to different programming languages. and executing the desired rule action. Everything else will be copied. [ \t]+ printf (" "). The finite automaton generated for this source will scan for both rules at once.’’. so the program generated by Lex (yylex) will ignore these characters. the user’s background.’’ Lex can write code in different host languages. This makes Lex adaptable to different environments and different users. The program contains a%% delimiter to mark the beginning of the rules. Lex can be used alone for simple transformations. and one rule.. consider a program to delete from the input all blanks or tabs at the ends of lines.’’ as in QED. Lex is not a complete language. the + indicates ‘‘one or more .

the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. or to add subroutines outside this action routine. for example) is shown below. a combination of Lex and Yacc is often appropriate. In the outline of Lex programs shown the rules represent the user’s control signs. The second % % is optional.The automaton is interpreted. Thus.Additional programs.weebly. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions. one looking for ab and another for abcdefg .com/ . In this example the host procedural language http://csetube. Thus an individual rule might appear integer printf("found keyword INT"). b k/ t Lex Source.. but the first is required to mark the beginning of the rules. What does increase with the number and complexity of rules is the size of the finite automaton. to look for the string integer in the input stream and print the message ‘‘found keyword INT’’ whenever it appears. When used as a preprocessor for a later parser generator. they are a table. The automaton interpreter directs the control flow.recognize only regular expressions. The number of Lex rules or the complexity of the rules is not important in determining speed. tt h :/ p cs / tu e e. and therefore the size of the program generated by Lex. if there are two rules. The absolute minimum Lex program is %% (no definitions. Lex generates a deterministic finite automation from the regular expressions in the source . the user’s (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. and the input stream is abcdefh . In the program written by Lex. no rules) which translates into a program which copies the input to the output unchanged. Lex will recognize ab and leave the input pointer just before cd … Such is more costly than the processing of languages. so that the use of this name by Lex simplifies interfacing. In particular. Yacc writes parsers that accept a large class of context free grammars. Lex is used to partition the input stream. in order to save space. but require a lower level analyzer to recognize input tokens. can be added easily to programs written by Lex. unless rules which include forward context require a significant amount of rescanning. The flow of control in such a case (which might be the first half of a compiler. Lex is not limited to source which can be interpreted on the basis of one character look-ahead. The result is still a fast analyzer. rather than compiled. and the parser generator assigns structure to the resulting pieces. General format of Lex source is: {definitions} %% {rules} %% {user subroutines} where the definitions and the user subroutines are often omitted. lexical rules ↓ Lex ↓ Input → yylex → grammar rules | Yacc ↓ yyparse → Parsed input Lex with Yacc Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named. written by other generators or by hand. For example. in which the left column contains regular expressions and the right column contains actions to be executed when the expressions are recognized.

if it is compound. abccc.weebly. -. abcbcbc. The letters of the alphabet and the digits are always text characters. thus the regular expression integer matches the string integer wherever it appears and the expression a57D looks for the string a57D Metacharacter Matches . Lex rules such as colour printf ("color"). b k/ t abc abc abc* ab. choices. abc [abc] a. abcc. abccc. The end of the expression is indicated by the first blank or tab character. and other features). it should be enclosed in braces. … a(bc)+ abc. abcc. Lex Regular Expressions. abc. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions. ^. c [a-z] any letter. Mechanise printf ("mechanize"). would be a start. If the action is merely a single C expression. |. b. … a(bc)? a.is C and the C library function printf is used to print the string. z [-az] -. z [A-Za-z0-9]+ one or more alphanumeric characters [ \t\n]+ whitespace [^ab] anything except: a. abcbc. … abc+ abc. or takes more than a line. it can just be given on the right side of the line. \n * + ? ^ $ a|b (ab)+ "a+b" [] any character except newline newline zero or more copies of preceding expression one or more copies of preceding expression zero or one copy of preceding expression beginning of line end of line a or b one or more copies of ab (grouping) literal “a+b” (C escapes still work) character class Expression Matches tt h :/ p cs / tu e e. a through z [a\-z] a. b a|b a or b name function http://csetube. A regular expression specifies a set of strings to be matched. suppose it is desired to change a number of words from British to American spelling. b [a|b] a. a. petrol printf("gas"). As a slightly more useful example. b [a^b] a.com/ .

com/ . We compile a regular expression into a recognizer by constructing a transition diagram called finite automation. returns token pointer to matched string length of matched string value associated with token wrap-up. 0 if not done output file input file initial start condition condition switch start condition write matched string A recognizer for a language is a program that takes as input a string x and answers ‘yes’ if a sentence of the language and ‘no’ otherwise. return 1 if done.weebly. A finite automation can be deterministic or non deterministic where non deterministic means that more than one transition out of a state may be possible out of a state may be possible on a same input symbol. Non deterministic finite automata A mathematical model consisting : 1) 2) 3) 4) 5) a set of states S input alphabet transition function initial state final state Transition table: state tt h :/ p cs / tu e e.int yylex(void) char *yytext yyleng yylval int yywrap(void) FILE *yyout FILE *yyin INITIAL BEGIN ECHO call to invoke lexer.1} b {0} {2} {3} 0 1 2 Deterministic finite automata Special case of nfa in which 1) no state has epsilon transition http://csetube. Dfa s are faster recognizers than nfas but can be much bigger than equivalent nfas. b k/ t Input symbols a {0.

construct the NFA tt h :/ p cs / tu e e. Dtransiton[T.a)). If U is not in D-states then Add U asv an unmarked state to D-states.weebly.  For ∈.a) description Set of nfa states reachable from nfa state s on e-transitions alone Set of nfa states reachable from nfa state s in Ton etransitions alone Set of nfa states to which there is a transition on input symbol afrom nfa state s in T Subset construction: Initially.e-closure(So) is the only state in dfa-states and it is unmarked While there is an unmarked state in T in D-states do begin Mark T: For each input symbol a do begin U:=e-closure(move(T.a]:=U End End CONSTRUCTION OF AN NFA FROM A REGULAR EXPRESSION Thomson’s Construction To convert regular expression r over an alphabet Σ into an NFA N accepting L(r)  Parse r into its constituent sub-expressions.  For a in Σ.  Construct NFAs for each of the basic symbols in r. This NFA recognizes {∈}. construct the NFA http://csetube.com/ . there is at most one edge labeled a leaving s Conversion of nfa to dfa Subset construction algorithm input: nfa N output: equivalent dfa D Method: Operations on nfa states: operation Epsilon-closure(S) Epsilon-closure(T) Move(T.2) for each state s and input symbol a. b k/ t Here i is a new start state and f is a new accepting state.

Each intermediate NFA produced during the course of construction corresponds to a sub-expression r and has several important properties – it has exactly one final state.weebly. b k/ t (c) For the regular expression s* .  Keeping the syntactic structure of the regular expression in mind. construct the following composite NFA N(s|t) : (b) For the regular expression st.com/ . combine these NFAs inductively until the NFA for the entire expression is obtained.Again i is a new start state and f is a new accepting state. This NFA accepts {a}. then a separate NFA is constructed for each occurrence.  If a occurs several times in r. construct the composite NFA N(s*) : (d) For the parenthesized regular expression (s). (a) For regular expression s|t. use N(s) itself as the NFA http://csetube. no edge enters the start state and no edge leaves the final state. construct the composite NFA N(st) : tt h :/ p cs / tu e e.  Suppose N(s) and N(t) are NFAs for regular expressions s and t.

Sign up to vote on this title
UsefulNot useful