You are on page 1of 128
1.3.1 The Phases of Compiler ‘As we have discussed earlier the process of compilation is carried out in two parts : Analysis and synthesis. Again the analysis is carried out in three phases : lexical analysis, syntax analysis and semantic analysis. And the synthesis is carried out with the help of intermediate code generation, code generation and code optimization. Let us discuss these phases one by one 1. Lexical Analysis The lexical analysis is also called scanning. It is the phase of compilation in which the complete source code is scanned and your source program is broken up into group of strings called token. A token is a sequence of characters having a collective meaning. For example if some assignment statement in your source code is as follows, total = count + rate *10 compiler Design 1-5 Introduction to Compiling $$ e Then in lexical analysis phase this statement is broken up into series of tokens as follows. 1. The identifier total. 2. The assignment symbol . 3, The identifier count. "4. The plus sign. 5, The identifier rate. 6. The multiplication sign. 7. The constant number 10. The blank characters which are used in the programming statement are eliminated during the lexical analysis phase. 2. Syntax Analysis The syntax analysis is also called parsing. In this phase the tokens generated by the lexical analyzer are grouped together to form a hierarchical structure. The syntax analysis determines the’ structure of the source string by grouping the tokens together. The hieratchical structure generated in this phase is called parse tree or syntax tree. For the expression total = count + rate “10 the parse tree can be generated as follows. a /» rate 10 Fig. 1.5 Parse tree for total = count + rate “10 In the statement ‘total = count + rate *10’ first of all rate*10 will be considered because in arithmetic expression the multiplication operation should be performed before the addition. And then the addition operation will be considered. For building such type of syntax tree the production rules are to be designed. The rules are usually expressed by context free grammar. For the above statement the production rules are - (1) E € Identifier (2) E Number (3) Ee El + E2 Introduction to Comping Design (4) E — EI"E2 HEC) where E stands for an expression. * By rule (1) count and rate are expressions and * By rule (2) 10 is also an expression. * By rule (4) we get rate*10 as expression. * And finally count +rate*10 is an expression. 3. Semantic Analysis Once the syntax is checked in the syntax analyzer phase the next phase i.e. the semantic analysis determines the meaning of the source string. For example meaning of source string means matching of parenthesis in the expression, or matching of if...else statements or performing arithmetic operations of the expressions that are type compatible, or checking the scope of operation. és on, AS rate int to float | 10 Fig. 1.6 Semantic analysi lysis: Thus these three phases . intermediate code gets merge Pereorming the task of analys fe gets generated, ysis. After these phases an 4. Intermediate Code Generation The intermediate code is a kind can be easily converted to Of code which ig address code, quadruple, triple net Code. This code is in v2 ,f0, Benerate and this code address code form. This is like 2, 7 © Will consider ang, Of forms such as three of instructions each of which has Pred language. The — aaa code in three tree le consis ‘ost three o; perands. Foy example, complter Design 1-7 Introduction to Compiling tl := int to float (10) t2:= rate xtl {3 <= count + t2 total := t3 There are certain properties which should be possessed by the three address code and those are, 1, Each three address instruction has at the most one operator in addition to the assignment. Thus the compiler has to decide the order of the operations devised by the three address code. 2. The compiler must generate a temporary name to hold the value computed by each instruction. 3. Some three address instructions may have fewer than three operands for example first and last instruction of above given three address code. i.e. 5. Code Optimization The code optimization phase attempts to improve the intermediate code. This is necessary to have a faster executing code or less consumption of memory. Thus by optimizing the code the overall running time of the target program can be improved. 6 Code Generation In code generation phase the target code gets generated. The intermediate code instructions are translated into sequence of machine instructions. MOV rate, R1 MUL #10.0, RL MOV count, R2 ADD R2, R1 MOV RI, total "=> Example : Show how an input a = b+c*60 get processed in compiler. Show the output at each stage of compiler. Also show the contents of symbol table. introduction to Comphing Compiier Design Soiution : a=pec* 60 id; = 1dy¢ids * af > Zs 60 jemantic analyze = seat [| 1, = = int to float (60) Lf] een ty sid ot iy: = ty 4 id + 60.0 id, ida +t, ‘Symbol table Compiler Design 1-9 Introduction to Compiling symbol table management « To support these phases of compiler a symbol table is maintained. The task of symbol table is to store identifiers (variables) used in the program, The symbol table also stores information about attributes of each identifier. The attributes of identifiers are usually its type, its scope, information about the storage allocated for it « The symbol table also stores information about the subroutines used in the program. In case of subroutine, the symbol table stores the name of the subroutine, number of arguments passed to it, type of these arguments, the method of passing these arguments(may be call by value or call by reference) and return type if any. ¢ Basically symbol table is a data structure used to store the information about identifiers. * The symbol table allows us to-find the record for each identifier quickly and to store or retrieve data from that record efficiently. ** During compilation the lexical analyzer detects the identifier and makes its entry in the symbol table. However, lexical analyzer cannot determine all the attributes of an identifier and therefore the attributes are entered by remaining phases of compiler. . © Various phases can use the symbol table in various ways. For example while doing the semantic analysis and intermediate code generation, we need to know what type of identifiers are. Then during code generation typically information about how much storage is allocated to identifier is seen. Error detection and handling * To err is human. As programs are written by human beings therefore they cannot be free from errors. In compilation, each phase detects errors. These errors must be reported to error handler whose task is to handle the errors so that the compilation can proceed. Normally, the errors are reported in the form of message. When the wren haracters from the input do not form the token, the lexical analyzer detects it 4s error. Large number of errors can letected in syntax analysis phase. Such errors are popularly called as syntax errors. During semantic is; type misma mee sua Aantected. 7 analysis; type itch kind of Compiler Design Target machine code Fig. 1.7 Phases of compiler 1.4 Language Processors 4 rocessors - comptes The as pee PIeprocessors may be given as the input © PreProcessors are given as below rocessors alk ivi whlch ‘an be wen repens the program. Macro means some set of task is_be done by preprocessors ¥ in the program. Thus macro preprocessing complier Design 4-14 Introduction to Compiling For example tinclude By this statement the header file stdioh can be included and user can make use of the functions defined in this header file. This task of preprocessor is called file indusion. Skeleton Preprocessor | Source Machine cel program Fig. 1.8 Role of preprocessor Macro Preprocessor - Macro is a small set of instructions. Whenever in a program macroname is identified then that name is replaced by macro definition (ie. set of instruction defining the corresponding macro). For a macro there are two kinds of statements macro definition and macro use. The macro definition is given by keyword like “define” or “macro” followed by the name of the macro. For example : # define PI 3.14 Whenever the “PI” is encountered in a program it is replaced by a value 3.14. 2. Assemblers - Some compilers produce the assembly code as output which is given to the assemblers as an input. The assembler is a kind of translator which takes the assembly program as input and produces the machine code as output Source Assembly Machine pore es code Program Fig. 1.9 Role of assembler ‘An assembly code is a mnemonic version of machine code. The typical assembly instructions are as given below. MoV oa, Rl, MUL = #5,R1 ADD #7, R1 Mov R1,b The assembler converts these instructions in the binary language which can be Srausmtood by the muchine. Such a Pinary code is often called as machine cots This Machine code is a relocatable machine code that can be ed directly to the loader /linker for execution purpose. Pass Compiler Design 1-12 Introduction to Compiling The assembler converts the assembly -program to low level machine language using two passes. A pass means one complete. scan of the input program. The end of second pass is the relocatable machine code. . 3. Loaders and Link Editors - Loader is a program which performs two functions: Loading and link editing. Loading is a process in which the relocatable machine code is read and the relocatable addresses are altered. Then that code with altered instructions and data is placed in the memory at proper location. The job of link editor is to make a'single program from several files of relocatable machine code. If code in one file refers the location in another file then such a reference is called external reference. The link editor resolves such external references also. Loading Loader - Link editing Lexical Analysis 2.1 Introduction | The process of compilation starts with the first phase called lexical analysis. In this phase the input is scanned completely in order to identify the tokens. The token structure can be recognized with the help of some diagrams. These diagrams are popularly known as finite automata. And to construct such finite automata regular expressions are used. These diagrams can be translated into a program for identifying tokens. In this chapter we will see what the role of lexical analyzer in compilation process. We will also discuss the method of identifying tokens from source program. Finally we will learn a tool, LEX which automates the construction of lexical analyzer. 2.2 Role of Lexical Analyzer Lexical analyzer is the first phase of compiler. The lexical analyzer reads the input source program from left to right one character at a time and generates the sequence of tokens. Each token is a single logical cohesive unit such as identifier, keywords, operators and punctuation marks. Then the parser to determine the syntax of the source program can use these tokens. The role of lexical analyzer in the process of compilation is as shown below - Demands Input } ene | 'yntax tree Target string ——_ tokens Fig. 2.1 Role of lexical analyzer As the lexical analyzer scans the source program to recognize the tokens it is also called as scanner. Apart from token identification lexical analyzer also performs following functions. 2-1) Functions of lexical analyzer It produces stream of tokens. It eliminates blank and comments. ; It generates symbol table which stores the inform constants encountered in the input. It keeps track of line numbers. / 5. It reports the error encountered while generating the tokens. The lexical analyzer works in two phases. In first phase it performs scan and in the second phase it does lexical analysis; means it generates the series of tokens. pm ‘ation about identifiers 2 * 2.2.1 Tokens, Patterns, Lexernes Let us learn some terminologies, which are frequently used when we talk about the activity of lexical analysis. Tokens : It describes the-class or category of input string. For example, identifiers, keywords, constants are called tokens. Patterns : Set of rules that describe the token. Lexemes: Sequence of characters in the source program that are matched with the pattern of the token. For example int, i, num, ans, choice. Let us take one example of programming statement to clearly understand these terms - if (acb) Here “if", “(“ ,"a" ,"<" ,"b” ,")" are all lexemes. And “if” is a “Ci i oa ? ‘ keyword, “(” is opening parenthesis, “a” is identifier, “<” is an operator and so on. Now to define the identifier pattem could be - 1. Identifier is a collection of letters, 2. Identifier is collection of al 3 character should be neces. Liarina characters and identifier’s beginning sour therefore lexical analysis is also called agree Ad Produces sequence of tokens code is given below Scanner. For example - The piece of source int MAX (int a, int p) { if (a>b) return a; else return by Compiler Design Lexical Analysis The blank and new line characters can be ignored. These stream of tokens will be given to syntax analyzer. 2.3 Input Buffering The lexical analyzer scans the input string from left to right one character at a time. It uses two pointers begin_ptr (bp) and forward_ptr (fp) to keep track of the portion of the input scanned. Initially both the pointers point to the first character of the input string as shown below - bp Fig. 2.2 Initial configuration The forward_ptr moves ahead to search for end of lexeme. As soon as the blank Space is encountered it indicates end of lexeme. In above example as scon aS forward_ptr (fp) encounters a blank space the lexeme “int” is identified. Lexical Analysiy Fig. 2.3 Input buffering / ead at white space. When fp encounters white space it The fp will be moved sheath the begin. pte (bp) and forward ptt (Fp) is seta ignore and moves ahead. The next token i. Fig. 2.4 Input buffering The input character is thus read from secondary storage. But reading in this way from secondary storage is costly. Hence buffering technique is used. A block of data is first read into a buffer, and then scanned by lexical analyzer. There are two methods used in this context : one buffer scheme and two buffer scheme. 1. One buffer scheme In this one buffer scheme, only one buffer is used the it Bi SETTER] © with di scheme hat a cert ote fp 7 then it crosses the buffer boundary, to scan rest of Fig: 2.8/0 bier chem, the lexeme the buffer has to be refilled, that makes |. 2. 7 : storing input string overwriting the first part of lexeme. 2. Two buffer scheme » To overcome the problem of one buffer butler 1 scheme, in this method PELDETE] © sore te input sting. The ew ta eae een a Scanned alternately. When end of [LTT The only pen iertached the other buffer is filed e ; bufler2 beng ae Peslem with thi L problem with this method is that if fp buffer _then nislonger than length of the Fig. 2.6 Two buffer scheme complete EE inp SGraing input cannot be scanned 8 storing input string ee Compiler Design 2-5 Lexical Analysis Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves towards right in search of end of lexeme. As soon as blank character is recognized, the string between bp and fp is identified as corresponding token. To identify the boundary of first buffer end of buffer character should be placed at the efd of first Buffer. Similarly end of second buffer is also recognized by the end of buffer mark present at the end of second buffer. When fp encounters first eof, then one can recognize end of first buffer and hence filling up of second buffersestarted. In the same way when second eof is obtained then it indicates end of second buffer. ‘Alternatively both the buffers can be filled up until end of the input program and stream of tokens is identified. This eof character introduced at the end_is called sentinel which is used to identify the end of buffer. . Sentine’ ¥ Code for input buffering if (fp==eof(buffl)) /*encounters end of first buffer*/ /*Refill buffer2*/ Epes; } else if (fp=seof(buff2) /*encounters end of second buffer*/ /*Refill bufferl*/ fptt: ) else if (fp==eof (input)) return; /*terminate scanning*/ else * fptt: /* still remaining input has to scanned */ 2.4 Specification of Tokens To specify tokens regular expressions are used. When a pattern is matched by some regular expression then token can be recognized. Let us understand the fundamental concepts of language. 2.4.1 Strings and Language String is a collection of finite number of alphabets or letters. The strings are synonymously called as words. * The length of a string is denoted by | S |. * The empty string can be denoted by «. * The empty set of strings is denoted by Compiler Design 2-6 Following terms are commonly used in strings. Tecin Meaning Prefix of string {A string obtained by removing zero or more tall symbols. For example for string Hindustan the prefix could be ‘Hindu’. ing symbols, of strir A string obtained by removing zero or more leading symb« Safir of soa For example for string Hindustan the suffix could be ‘stan’ Substrin A string obtained by removing prefix and suffix of a given ° string is called substring. For example for string Hindustan the Lexical Analyay string Sequence of string | Any string formed by removing zero or more not necessarily the contiguous symbols is called sequence of string. For ae rean example Hisar ‘indu' can be a substring, ‘can be sequence of string 2.4.2 Operations on Language As we have seen that the language is a collection of strings. There are variow operations which can be performed on the language Operation Description Union of two languages U1 and 12. Concatenation of two languages L1 and L2, Kleen closure of L Positive closure of L For example, Let L be the se D be the set of digits such as D as discussed above new languag, LIU L2 = {set of strings in L1 and strings in L2). u 2 = {set of strings in L1 followed by set of strings in L2). us Gu L* denotes zero or more concatenations of L. = Ou Pat L'_ denotes one or more concatenations of x t of alphabets such as L= (A, B, C ...Z, a, b, c...2) and = (0, 1, 2...9) then by performing various operation’ es can be generated as follows - « LUDisa set of letters and digits, « LD is a set of strings consisting of let « L3is a set of strings hav; tters followed by digits. ing length of 5 each. Compiler Design 2-7 Lexical Analysis * L* is a set of strings having all the strings including «. * Lt is a set of strings except c. 2.4.3 Regular Expressions Regular are mathematical symbolisms which describe the set of strings of specific language. It provides convenient and useful notation for representing, tokens. Here are some rules that describe definition of the regular expressions over the input set denoted by 5. 1. © is a regular expression that denotes the set containing empty string. 2. If Ri and R2 are regular expressions then R = RI + R2 (samie can also be represented as R = R1/R2 ) is also regular expression which represents union operation. If R1 and R2 are regular expression then R = R1.R2 is also a regular expression which represents concatenation operation. 4. If R1 is a regular expression then R = RI* is also a regular expression which represents kleen closure. A language denoted by regular expressions is said to be a regular set or a regular language. Let us see some examples of regular expressions. mm Example 2.1: Write a regular expression (RE.) for a language containing the strings of length two over © = (0, 1). Solution: RE. = (0+1) (0+1) um Example 2.2: Write a regular expression for a language containing strings which end with “ab” over 5. = (a, bp Solution: RE. = (a+b)*abb tm) Example 2.3: Write a regular expression for a recognizing identifier. Solution : For denoting identifier we will consider a set of letters and digits because identifier is a combination of letters or letter and digits but having first character as letter always. Hence R.E. can be denoted as, RE. = letter (letter+digit)* Where Letter = (A, B,...Z, a, b,...z) and digit = (0, 1, 2, 9). mp Example 24: Design the regular expression (r.e,) for the language accepting all combinations of a’s except the null string over ¥ = {a} Solution : The regular expression has to built for the language L = {a, aa, aaa, ...) can write, This set indicates that there is no null string. So we R= a' The + is called positive closure im> Example 2.5: Design regular wiih rev number of a’s and b's. Soluticy ihe R.E. will be RE. = (a+ b) expression for the language containing all the Strings The set for this RE will be - — L = {¢-a, aa, ab, b, ba, bab, abab, ..any combination of a and bj The (a + b)’ means any combination of a and b even a null string, im> Example 2.6 : Construct a regular expression for the language containing all strings having any number of a's and b's except the null string. Solution ; RE. = (a +,b)* This regular expression will give the set of strings of any combination of a's and b's except a null string. mm> Example 2.7: Construct the RE. for the language accepting all the strings which are ending with 00 over the set ¥ = (0, 1). Solution : The RE. has to be formed in which at the end there should be 00. That means RE. = (Any combination of 0's and 1's) 00 RE. = (0+ 1)° 00 Thus the valid strings are 100, 0100, 1000... we have all strin; wm Example 28: Write RE. Sor the tc rit h are star . lan, i arting : ing with ie a ee the strings whicl tart Solution : The first symbol in RE. should be be 0. .E, 1 and the last . ° Morn ie symbol should igs ending with 00. Lexical Analysis Compiler Design i> Example 2.9: Write a regular expression to denote the language L over 3", where L= (a,b, d im which every string will be such that any number of a's followed by any number of b's followed by any number of c's Solution : Any number of a's means a’. Similarly any number of b's and any number of c's means b* and c’. So the regular expression is = a’ b" c’. im Example 2.10: Write RE. to denote a language L over ©", where £ = (a, b) such that the 3 character from right end of the string is always a. ‘Solution : 3 2rd RE. = (a+b) a (a+b) (a+b). Thus the valid strings are babb, baaa or abb, baba or aaab and $0 on. mm) Example 2.11: Construct RE. for the language which consists of exactly two b's over the set © = {a, bj. Solution : There should be exactly two b's. [mms] + Cameo] + Lanmean | Hence RE. = a ba‘ba® a" indicates either the string contains any number of a's or a null string. Thus we can derive any string having exactly two b's and any number of a's. 2.44 Notations used for Representing Regular Expressions Regular expressions are tiny units, which are useful for representing the set of strings belonging to some specific language. Let us see notations used for writing the regular expressions. 1. One or more instances - To represent one or more r is a regular expression then r * denotes one or example, set of strings in which there are one or the input set {a} then the regular expression can denotes the set of {a, aa, aaa, aaa, ...} instances + sign is used. If More occurrences of r. For more occurrences of ‘a’ over be written as a‘. It basically 2. Zero or more instances - To represent zero o If r is a regular expression then r* d, example, set of strings in which T more instances * sign is used. notes zero or more occurrences of F. For there are zero or more occurrences of a Compiler Design 2-10 Lexleal Anatyay Compiler Design over the input set {a} then the regular expression can be written as a* h basically denotes the set of {e, a, aa, aaa,...)- 3, Character classes - A class of symbols can be denoted by [ ]. For exampie [012] means 0 or 1 or 2. Similarly a complete class of a small letters troy atoz can be represented by a regular expression [a-z]. The hyphen indicates the range. We can also write a regular expression for representing any wor of small letters as [a-z]*. ‘as we know the language can be represented by regular expression but not all the languages. are represented by RE. such a set of languages is called non regular language. Let us understand them. 2.4.5 Non Regular Language A language which cannot be described by a regular expression is called a non regular language and the set denoted by such language is called non regular set. There are some languages which cannot be described by regular expressions. For example we cannot write a regular expression to check whether the given language is palindrome or not. Similarly we cannot write a regular expression to check whether the string is having balanced parenthesis. Thus regular expressions can be used to denote only fixed number of repetitions Unspecific information cannot be represented by regular expression. 2.5 Recognition of Tokens For a programming language there are various types of tokens such as identifier keywords, constants, and operators and so on. The token is pair token type and token value. usually represented by 1 For example ; We will consider some encoding of tokens as follows Consider, a program code as if (acl0) isit2; else imi-2; Our lexical analyzer will generate following token stream. 1, 8,1), 6,100), (7,1), (6,105), (8,2), (5,107), 10, (5,107), (9,1), (6,110), 25,107), 10, 6,107), (9,2), (6,110). Y 246100, The corresponding symbol table for identifiers and constants will be; Compiler Design 2-12 Lexical Analysis 107 identifier i 110 constant In above example scanner scans the input string and recognizes “if” as a keyword and returns token type as 1 since in given encoding code 1 indicates keyword “if” and hence 1 is at the beginning of token stream. Next is a pair (8,1) where 8 indicates parenthesis and 1 indicates opening parenthesis ‘(’. Then we scan the input ‘a’ it recognizes it as identifier and searches the symbol table to check whether the same entry is present. If not it inserts the information about this identifier in symbol table and returns 100. If the same identifier or variable is already present in symbol table then lexical analyzer does not insert it into the table instead it returns the location where it is present. Ss Design compiler Desigi 3. Syntax Analysis The modified view of front end is as shown below Fig. 3.2 Front end of compiler 3.1.2 Role of Parser In the process of compilation the parser and lexical analyzer work together. That means, when parser requires string of tokens it invokes lexical analyzer. In turn, the lexical analyzer supplies tokens to syntax analyzer (parser). Fig. 3.3 Role of parser The parser collects sufficient number of tokens and builds a parse tree. Thus by building the parse tree, parser smartly finds the syntactical errors if any. It is also Necessary that the parser should recover from commonly occurring errors so that Temaining task of process the input can be continued. 3.4.3 Why Lexical and Syntax Analyzer are Separated Out? The lexical analyzer scans the input program and collects the tokens from it On the other hand parser builds a parser tree using these tokens. These are two important Activities and these activities are independently carried out by these two phases ting out these two phases has two advantages - Firstly it accelerates the Process compilation and secondly the errors in the source input can be identified precisely complerDeegn 3.1.4 Context Free Grammar ; , collection of following things. The context free grammar G is @ 1. V is a set of non-terminals. 2. Tis a set of terminals. 3. Sis a start symbol. 4. P is a set of production rules. ‘Thus G can be represented as G = (V.T/SP) ‘The production rules are given in following form- Non-terminal (V U T)* ‘mp Example 3.1: Let the language L = a"b" where nT Let G = (V,T,S,P ) Where'V = {S}, T = {a,b} And S is a start symbol then, give the production rules. Solution : Pe{ Ss + aSb S + ab } The production rules actually defines the language a"b". Syntax Analyay | The non-terminal symbol occurs at the left symbo hand side. The i te : . These are the Is which ale oi oe] ae nothing but the tokens used in the as can lefined by grammar. For example if we want to define the declarative a eee ones grammat then it could be as follows. State > Type List Terminator Type > int | float List + Listid List + id Terminator + ; Using above rules we can derive, State Type List Terminator in XN | . id Fig. 3.4 Parse tree for derivation of int id, Id, id; Hence int id, id, id; can be defined by means of above Context Free Grammar. Following rules to be followed while writing a CFG. 1) A single non-terminal should be at LHS. 2) The rule should be always in the form of LHS. + RH. where RHS inay be the combination of non-terminal and terminal symbols. 3) The NULL derivation can be specified as NT > . 4) One of the non-terminals should be start symbol and conventionally we should write the rules for this non-terminal. 32 Derivation and Parse Trees Derivation from $ means generation of string w from S. For deivation two things are important. ') Choice of non-terminal from several others. i) Choice of rule from production rules for corresponding non-terminal. constructing | Definition of Derivation Tree : let G = (V, T, P, $) be a Context Free Grammar. . The derivation tree is a tree which can be constructed by following Properties ') The root has label S. | Wi) Every vertex can be derived from (VUTUS) Syntax 3-6 Analysi Compiler Design there itt) If Ginavi 'A with children Ry, Roy R,, then there should be iii) If there exists a vertex production A - R, R; Ry: , iv) The leaf nodes are from set T and in Instead of choosing the abitrary nonerminal one can choose otis cates ‘) Bither leftmost non-terminal in a sentential form leftmost sa selene non-terminal in a sentential form, then it is called rightmog, derivation. terior nodes are from set V. wm Example 3.2 : Let G be a context free grammar for which the production rules are given as below. S—>aB | bA A~+a|aS|bAA B+ b|bS|aBB Derive the string aaabbabbba using above grammar. Solution : Leftmost Derivation S—aB aaBB aaaBBB aaabSBB aaabbABB aaabbaBB aaabbabB aaabbabbS aaabbabbbA aaabbabbba Rightmost Derivation S + aB aaBB aaBbs aaBbbA, aaBbba aaaBBba aaaBbbba in - complet Desa BF, Syntax Analysis aaabSbbba aaabbAbbba aaabbabbba re parse tree for the above derivations is as given below. Ss Or CAA, mas /A LILA AYU \ Leftmost derivation om CR XK CARA Y Ul Fig. 3.5 totoost and veo abate derivations The structure shown above is called parse tree. ing grammar. mb Example 3.3: Design a derivation tree for the following, 7 S08 S > aSbS Soe -— ing “aaabaab”. Also obtain the leftmost and rightmost derivation for the string Solution : Leftmost Derivation Let, s S—aS as Sas aaS S — aSbS aaaSbS Swe aaabS S — aSbS aaabaSbS S-—+aS aaabaaSbSS+e aaabaabS Se aaabaab is derived. Rightmost Derivation Let, s aSbs S > aSbS aSbaSbs Se aSbaSb S—+aS aSbaaSb Sse aSbaab Sas aaSbaab Sas aaaSbaab — aaabaab is desivea, yer Design 3-9 cong —__"S__ Syntax Analysis ‘The derivation tree can be drawn as below. s /\ | /\} ix, |} /\ Derivation tree for Rightmost derivation > Example 3.4: Consider the grammar given below - E+E+E|E-E|E*E|E/E|alb Obtain leftmost and rightmost derivation for the string a + b +a + b. 3-10 Syntax Anatyay Compl Design Sir Solution ; Leftmost derivation i E+E ‘ /™ E+E+E + E ' arE+eE atErEreE AN | EEE atbeat+ 1A, OBESE | | boa Rightmost derivation E F. E+E ci E+E+E Kt E*Etb JIN. ZIN Eoesb fe beee E+E+atb | Tl] eteets a bab a+beaeb 3.3 Ambiguous Grammar A grammar G is said to be ambiguous if it generates more than one parse trees for sentence of language L(G). For example : E> E+E] E*Blid Then for id + id * id . E “ id (9) Parse tree 4 Fig. 36 Ambiguous g imp Example 3.5 : Find wi , i Me flowing g ammar is ambiguous or not A-a| Aa Bob copilot Design 3-4 Syntax Analysis gation: WE will construct parse tree for the string. ‘aab S. 8. / \ IS. a. i a an) a b 4 » —> Parse tree 1 Parse tree 2 There are two different parse trees for deriving string ‘aab’. Hence the above given grammar is an ambiguous grammar. wm) Example 3.6 : Show that following grammar is ambiguous S > aSbS S > bSaS Soe Solution : Consider a string ‘abab’. We can construct parse trees for deriving, ‘abab’ AN ASS. i | i | Parse tree 1 Parse tree 2 3.4 Parsing Techniques ‘As we know, there are two parsing techniques, the following principle. 1. The parser scans the input string from left to right and identifies that the derivation is leftmost or rightmost. 2. The parser makes use of production derivation. The different parsing techniques use different @P! selecting the appropriate rules for derivation. And finally a parse tree constructed. these parsing techniques work on rules for choosing the appropriate proaches in is Syntax Analysig 3-12 Compt ena aves thn root instructed from itself tells us that the parse Wha eee rap Gown parser. The name itself ti such type of parser is tree ane ult oe et tructed from leaves to root, then such type of the parse tree is built in bottom up manner, When the parse tree can be cons! parser is called as bottom-up parser. Thus Fig. 3.7 Parsing techniques Let us discuss the parsing techniques in detail. 3.5 Top-down Parser AAs we have seen that in top-down parsing the parse tree is generated from top to bottom (ie. from root to leaves). The derivation terminates when the required input string terminates. The leftmost derivation matches this requirement. The main task in top-down Parsing is Wend he appropriate Production rule in order to produce the correct input string. We will understand the ing wi hap ef 7 Process of top-down parsing with the Consider a grammar, S ~ xPz P > yly Consider the input string xyz is as shown below. CLE] T Input buffer ‘Syntax Ani Now we will construct the parse tree for above grammar deriving the given input ung. And for this derivation we will make use of top down approach. step 1° GE) © © Fig. 3.8 (a) Step 2: () OQ ©) ©) Fig. 3.8 (b) Step 3: Fig. 3.8 (c) The first leftmost leaf of the parse tree matches with the first input symbol. Hence we will advance the input pointer. The next leaf node is P. We have to expand the node P. After expansion we get the node y which matches with the input symbol y. Now the next node is w which is not matching with the input symbol Hence we go back to see whether there is another alternative of P. The another alternative for P is y which matches with current input symbol. ‘And thus we could produce a successful parse tree for given input string. We halt and declare that the parsing is completed successfully. In top-down parsing selection of proper rule is very important task. And this selection is based on trial and error technique. That means we have to select a particular rule and if it is not producing the correct input string then we need to backtrack and then we have to try another production. This process has to be repeated until we get the correct 3-14 Syn W tax Analyaig ns if we found every production unstita le input string. After trying all the productio foe en yas rse tree cannot be built. for the string match then in that case the pa 3.5.1 Problems with Top-down Parsing There are certain problems in top-down we need to eliminate these problems. Let us them 4) Backtracking Backtracking is a technique in which for expansion of non-terminal symbo| choose one alternative and if some mismatch occurs then we try another aleenann ms . ve if any parsing. In order to implement the pars Hiscuss these problems and how to rea Femove For example : S > xPz Pywly grammar then it creates = ‘ Serious E parser can enter in infin roblems ft ; He loop. This a because of ecusion is present in the Shown i Tecursion the top-dow? piler Design comm 3-45 sy nta Anatyta Fig. 3.10 Left recursion Thus expansion of A causes further expansion of A only and due to generation of A, Aa, Aca, Agaa, ..., the input pointer will not be advanced. This causes major problem in top-down parsing and therefore elimination of left recursion is a must. To eliminate left recursion we need to modify the grammar. Let, G be a context free grammar having a production rule with left recursion. A>Aa ..Q) AB Then we eliminate left recursion by re-writing the production rule as : ABA‘ A'>aA’ +) A'se Thus a new symbol A’ is introduced. We can also verify, whether the modified grammar is equivalent to original or not. ‘Syntax Analysiy cempromyy ott input string Fig. 3.11 For example : Consider the grammar E>E+T|T We can map this grammar with the rule A > A a B. Then using equation (2) we can say, A=E a=+T B=T Then the rule becomes, ETE’ E'>+TE' | ¢ Similarly for the rule, ToT+F/F gompter emigre SyrtneAnatala We can eliminate left recursion as TOF Torr le The grammar for arithmetic expression can be equivalently written as - ESE+T|T ETE’ EB +TE’ |e T+ Fr ToOT*F|F => Tere FoCE = F€) | id 3) Left factoring If the grammar is left factored then it becomes suitable for the use. Basically left factoring is used when it is not clear that which of the two alternatives is used to expand the non-terminal. By left factoring we may be able to re-write the production in which the decision can be deferred until enough of the input is seen to make the In general if A> Blo. is a production then it is not possible for us to take a decision whether to choose first rule or second. In such a situation the above grammar can be left factored as A-aA’ A BiB, For example : Consider the following grammar. S -iEtS | iEtSeS | a Eb ‘The left factored grammar becomes, S —iEtSS'|a Sele E+b 4) Ambiguity The ambiguous grammar is not desirable in top-down parsing. Hence we need to Temove the ambiguity from the grammar if it is present. For example : EsSE+E|E*E jid : Syntax Analysig a1 Compiler Design for id + id* id as follows, det (b) Parse tree 2 tree is an ambiguous grammar. We will design the parse (a) Parse tree 1 Fig. 3.12 Ambiguous grammar For removing the ambiguity we will apply one rule : If the grammar has left associative operator (such as +, -, *, /) then induce the left recursion and if the grammar has right associative operator (exponential operator) then induce the right recursion. The unambiguous grammar is E>E+T ET ToT ToF F id C Note one thing that the grammar is unambiguo it i ecursi oe us but it is I and elimination of such left recursion is again a must. 8 Ue recursive Fig. 3.13 Types top-down Parsers complier Desion 3-19 There are two types by which the top-down P 1, Backtracking Syntax Analysis arsing can be performed 2, Predictive parsing A backtracking parser will try different Production rules to find the match for the input string by backtracking each time. The backti , But_this technique is slower and it re Hal time tS —s equires exponential time in general backtracking is not preferred for practical compilers. — As the name suggests the predictive parser tries to predict the next construction using one of more lookahead symbols from input string. There are two types of predictive parsers : 1. Recursive descent 2. LL(1) parser. Let us discuss these types along with some examples. ing is powerful than predictive 35.2 Recursive Descent Parser A parser that uses collection of recursive procedures for parsing the given input string is called Recursive Descent (RD) Parser. In this type of parser the CFG is used to build the recursive routines. The RH. of the production rule is directly converted to a program. For each non-terminal a separate procedure is written and body of the procedure (code) is R.H.S. of the corresponding non-terminal. Basic steps for construction of RD parser The RHS. of the rule is directly converted into program code symbol by symbol 1. If the input symbol is non-terminal then a call to the procedure corresponding to the non-terminal is made 2. If the input symbol is terminal then it is anatched with the lookahead from input. The lookahead pointer has to be advanced on matching of the input symbol. 3. If the production rule has many alternates then all these alternates has to be combined into a single body of procedure. 4. The parser should be activated by a procedure corresponding to the start symbol. Let us take one example to understand the construction of RD parser. Consider the ammar having start symbol E. ~ E + numT Totmmf® |e Procedure E if lookahead = num then match(num) ¢ y+ call to procedure T */ T ) else error: if lookahead = $ ‘ declare success; /* Return on success */ , else error; }/*end of procedure E*/ procedure T ‘ if lookahead = ‘*’ ‘ match(**’); if lookahead =‘num’ ( match (num) ; : Te } /* inner if closed*/ else error } J/* outer if closed*/ else NULL /*tindicates € Here the other alternate is combined into same procedure T */ , /* end of procedure T*/ procedure match( token t ) { if lockahead=t Leokabasd = next_token; /*lockahead pointer is advanced*/ else error ) /*end of procedure match*/ procedure error { print (“Erro; Syntax Analysis, E+ numt Consider that the input string is 34 We will parse this string using above given procedures. Tee numt The parser will be activated by calling procedure E. Since the first input character 3 is matching with num the Procedure match (num) will be invoked and then the lookahead will point to next token. And a call to procedure T is T= enum T given. A match with “” is found hence lookahead = next_token. Now ‘4’ is matching with num hence Deciare Success | crits a euces! again the procedure for match (num) is fulfilled. Then procedure for T is invoked. And T is replaced by «. As lookahead pointer points to $ we git by reporting success. Thus the input string can be parsed using recursive descent reer. Construction of recursive descent parser is easy. However the programming language that we choose for RD parser should support recursion. The internal details at not accessible in this type of parser. For instance; in this parser we cannot access ‘e current leftmost sentential form. Secondly at any instant, we cannot access the Sack containing recursive calls. Writing procedures for left recursive grammar (A —a,|af;) is crucial. And to Sake such procedure simpler first we have to left factor the grammar. We cannot ‘te recursive descent parsers for all types context free grammar. 1821 Predictive LL) Parser This top-down parsing algorithm is of non-recursive type. In this type of parsing a ‘ble is built. For LL(1) - the first L means the input is scanned from left to right. The L means it uses leftmost derivation for input string. And the number 1 in the me bol means it uses only one input symbol (lookahead) to predict the parsing a 1 3-22 Syntax Anaiyy, | Compiler Design iven below. i The simple block diagram for LL(1) parser is as gi Fig. 3.15 Model for LL(1) parser The data structures used by LL(1) are i) input buffer ii) stack iii) parsing table The LL(1) parser uses input buffer to store the input tokens. The stack is used to hold the left sentential form.The symbols in R.H.S. of rule are pushed into the stack in reverse order ie. from right to left. Thus use of stack makes this algorithn non-recursive. The table is basically a two dimensional array. The table has row for non-terminal and column for terminals. The table can be represented as M[A,a] where A is a non-terminal and a is current input symbol. The parser works as follows - ‘The parsing program reads top of the stack and a current input symbol. With the help of these two symbols the parsing action is determined. The parsing, actions can be Parsing action Parsing successful, halt. Pop @ and advance lookahead to next token. Error. Refer table MIA.a] if entry at M(A.a) is error report Error. Refer table MIA.a] if entry at MIA, 4 A then push R, then push G, they cans h? POR then Pop of the stack and a ; Performed and the iny ut is successfully ~ P" pare = token. One by one configuration is ~ compiler Design 3-23 ‘Syntax Anztysis Teaches the halting configura g$ then it corresponds to successful parse. Construction of Predictive LL(1) Parser 3522 The construction of predictive LL(1) parser is based on two very important functions and those are FIRST and FOLLOW. For construction of Predictive LL(1) parser we have to follow the following steps - 1. Computation of FIRST and FOLLOW function 2. Construct the Predictive Parsing Table using FIRST and FOLLOW functions 3. Parse the input string with the help of Predictive Parsing table. FIRST function FIRST(a) is a set of terminal symbols that are first symbols appearing at RHS. in derivation of a. If a.=9¢ then ¢ is also in FIRST (1). Following are the rules used to compute the FIRST functions. 1. If the terminal symbol a the FIRST(a) = (a). 2 If there is a rule X +e then FIRST(X)=t)._ 3. For the rule A — X, X_ X3 ...X, FIRST(A) = (FIRST(X,) U FIRST(X,) U FIRST(X,)...FIRST(X,). 3 Where k Xjsn such that 15 js k-1 FOLLOW function FOLLOW (A) is defined as the set of terminal symbols that appear immediately to fhe right of A. In other words ° n. When the stack is empty and next token FOLLOW(A) = [a | Sa AaB where @ and p are some grammar symbols may be terminal or non-terminal. The rules for computing FOLLOW function are as given below - 1. For the start symbol S place $ in FOLLOW(S). 2 If there is a production A 4.8 § then everything in FIRSTQ) without e is to be placed in FOLLOW(B). : os, 3. If there is a production AaB ‘or A +08 and FIRST@)= fe} then FOLLOW(A) = FOLLOW(B) or “FOLLOW(B)=FOLLOW(A). That’ means everything in ROLLOW(A) is in FOLLOW(B). x Syntax Analya, Design FOLLOW functions using aboy, Let us take some examples to compute FIRST and FOL! Let us . rules, ‘mm> Example 3.7: Consider grammar TE * y 4TE'| air > T |e | (Bid Find the FIRST and FOLLOW functions for the above grammar. Solution : AS E -> TE’ is a rule in which the first symbol at R.H.S. is T. Now Tr in which the first symbol at RHS. is F and there is a rule for F as F - (E)|id. FIRST(E) = FIRST(T) = FIRST(F) As F+() F id Hence FIRST(E) = FIRST(T) = FIRST(F) = { (id } paramo ‘nr FIRST(E’) = (+, &) As E — +TE’ E’ ~+ by referring computation rule 2 | The first ‘symbol appea FIRST function. FIRST(T)=(*, €) As T’* TE’ Ese ‘he first terminal symbol appearing at RH, of producti is added in “ae Production rule for T’ is a low we will compute FOLLOW function. | FOLLOW(E) - | ‘'Y appears immediately to the right of E ring at R.H.S. of production rule for E’ is added in the i) As there is a rule F €) the symbol Hence ‘Y will be in FOLLOW¢€). ii) The computation rule i . A=Fa=( BoE poy, map this rule with F —> (E) then FOLLOW(B) = FIRST (®) = FIRST( ) ) = FOLLOW€) = {)} M=O) ‘ Compiler Design bas — le Since E is a start sym Hence FOLLOW(E) = FOLLOWE) - i) E bol, add $ to follow of E. {)$} ~» TE’ the computational rule is A >a BB. rate B= = E’, B=c then by computational rule 3 everything in "RUDE ~ me gverything in FOLLOW(E) is. in FOLLOWEE}, FOLLOW(B) ie. everything in . FOLLOW(E’) = { ), $) E’+ +TE’ the computational rule is A +a BB. - A=E,a = 47, B = E’, B=e then by computational rule 3 everything in FOLLOW(A) isin FOLLOW() ie. everything in FOLLOW(E) is in FOLLOW€). FOLLOW) = ()S} We can observe in the given grammar that ) is really following E. FOLLOW(T) - We have to observe two rules E ~ TE’ E’ — +TE’ i) Consider E- TE’ we will map it with A >a BB A=E,a=6,B=T, =B=E by computational rule 2 FOLLOW(B) = (FIRST (f) ~ c}. That is FOLLOW(T) = {FIRST(E’) - €) = {e-9 = (4 ii) Consider E’-» +TE’ we will map it with A >a BB A = Ea = + B= T B= E by computational rule 3 FOLLOW(A) = FOLLOW) ie. FOLLOW(E’) = FOLLOW(T FOLLOW(T) = 0S} Finally FOLLOW(T) = {+} US) {+ ), 9), 3 syntax Analy, Compiler Design -26 Ss om servi en * are really following T. We can observe in the given grammar that + and ) lly following T FOLLOW(T) - | T Fr . - We will map this rule with Aa BB then A= Ta = FB T, B= they FOLLOW(T’)=FOLLOW(T) = (+) 5} T > ‘FT We will map this rule with A >a BB then A = T, a FOLLOW(T) = FOLLOW(T) = [+,)5) Hence FOLLOW(T’) = (+) 5) FOLLOW(F) - Consider T — FI’ or TY -> *FT’ then by computational rule 2, = °F, B= TY, B= they T 3 FT A >aBB TFT A7aBB A=T,a=¢B=F,fp=T" FOLLOW(B) = {FIRST (8) - €} FOLLOW(F) = {FIRST(T’) ~ e} A=T,a=*,B=F,p=T FOLLOW(B) = (FIRST(®) - e} FOLLOW(F) = (FIRST(T’) - e} FOLLOW(F) = (*} FOLLOW@) = (*} Consider T’ *FT’ by computational rule 3 T7~*FT A>aBB A=T,a=4,B=R,p=T FOLLOW(A) = FOLLOW (B) FOLLOW(T) = FOLLOW(F) Hence FOLLOW(F) = (+8) Finally FOLLOW(F) = 4 Ub) S) FOLLOWG) = (+4,)$) To summarize above computation sompiter Design ‘Syntax Analysis {+.).$} {+5,)8) Igorithm for predictive parsing table - The construction of predictive parsing table is an important activity in predictive arsing method. This algorithm requires FIRST and FOLLOW functions. put: The Context Free Grammar G. utput : Predictive Parsing table M. Igorithm : For the rule A >a of grammar G 1. For each a in FIRST(a) create entry M[A.a} = Aa where a is terminal symbol. oo = 2. For e in FIRST(a) create entry M[A.b] = A 3a Where b is the symbols from FOLLOW(A). 3. Ife is in FIRST(®) and $ is in FOLLOW(A) then create entry in the table MIA, $] = A >a. 4. All the remaining entries in the table M are marked as SYNTAX ERROR. Now we will make use of the above algorithm to create the parsing tate for tT tammar - E> Te’ EB’ — +TE’|e Torr T + Tle F > @)lid Fist create the table as follows. ‘Syntax Compiler Design 3-28 Analy, Now we will fill up the entries in the table using the above given algorithm, Fe that consider each rule one by one. ETE Ava A=E,a=TE’ FIRST(TE’) if E’ = € then FIRST(T) = ((id) MIE, (] = ETE’ MIE, id] = ETE’ Es +TE Ava A=E,a = 4TE’ FIRST(+TE’) = (+) Hence MIE,+] = E> +TE’ E’ +e Arta A= Ea =€ then FOLLOWEE’) = (),$) MIE’, )] = E’ se MIE, $] = E' se TFT Ava Compiler Design 3-29 Syntax Analysis AcE,a= FT’ FIRSTIFT) = FIRST(F) ~ {(id) Hence MIF) = T > FT’ And M[F,id] = T > FT’ ToT Asa A=Ta="FT" FIRSTCFT) = (*) _Aenee M[T,*] = T — *FT’ T 7 A 7a A=Tlaze FOLLOW(T)) = {+,),$} Hence M[T’,+] = Te MT] = Te MIT'S] = Te + F+@® Ava A=F,a =(B) FIRST((E)) = { (} Hence M[F,(j= P(E) F sig Asa A=Ba sid FIRST(id) = ( id } Hence M[P,id] = F - id Syntax Analysis The complete table can be as shown below. Now the input string id + id + id $ can be parsed using above table. At the initial configuration the stack will contain start symbol E, in the input buffer inpistting ic placed. \ Now symbol E is at top of the stack and input pointer is at first id, hence MIE,id) is referred. This entry tells us E + TE, so we will push E’ first then T. Stack . Input SET deus | SET F id+ide ids |Pourr SET id id+ide id [ser | +aeus | | SE: tid + id $ SE’ T+ tide igs SET ideids SETF id id psera | ser TT jer Design 3-31 ‘Syntax Analysis ‘Thus it is observed that the input is scanned from left to right and we always follow left most derivation while parsing the input string. Also at a time only one ee is referred to taking the parsing action. Hence the name of this parser is La). The LL(1) Parser is a table driven predictive parser. The left recursion and oom is not allowed for LL(1) parser_, . — wap Example 3.8: Show that following grammar : $+ AaAb| BbBa Ave Boe is LL (1). Solution : Consider the grammar’: S$ AaAb S > BkBa Ave Boe Now we will compute FIRST and FOLLOW functions. FIRST(S) = (a, b} if we put S— AaAb S—aAb When Ae Also S + BbBa S—+bBa When Be FIRST(A) = FIRST(B) = {e) FOLLOW(S) = {$} FOLLOW(A) = FOLLOW(B) = {a, b} The LL(1) parsing table is Syntax Analys® ComplorDenlgn 89 Now consider the string "ba". For parsing ~ Action Stack Input bes S$ — BbBa $s Boe $aBbB bas $aBb bas T Boe ~* $08 as $a as ales $s $ Accept This shows that the given grammar is LL(1). sm Example 3.9 : Construct LL(1) parsing table for the following grammar. S 4B | aC | Sd | Se Bo bBe | f Cg Solution : Consider the grammar S— aB S—aC S$ Sd S— Se B — bBe, Bof Cog Now we will construct FIRST and FOLLOW for the above grammar, FIRST(S) = {a} FIRST(B) = {b, f} FIRST(C) = {g} FOLLOW(S) = {d, e, $} FOLLOW(B) = {¢} FOLLOW(C) = (d, e} Compiler Design 3-33 Syntax Analysis The LL(1) parsing table can be as shown below ‘The above table shows multiple entries in table [S, a]. This shows that the given grammar is not LL(1). Compiler Design 3-33 Syntax Analysis The LL(1) parsing table can be as shown below. The above table shows multiple entries in table [S, a]. This shows that the given grammar is not LL(1). 3.6 Bottom-up Parsing In this section, we will discuss "How an input string gets parsed efficiently ? In compilers, the task is done by a program called parsers. The task of parser is twofold either it checks the input string completely for its syntax or it generates error messages on syntactically input strings. In bottom-up parsing method, the input string is taken first and we try to reduce this string with the help of grammar and try to obtain the start symbol. The process of parsing halts successfully as soon as we reach to start symbol The parse tree is constructed from bottom to up that is from leaves to root. In this process, the input symbols are placed at the leaf nodes after successful parsing. The bottom-up parse tree is created starting from leaves, the leaf nodes together are reduced further to internal nodes, these internal nodes are further reduced and eventually a root node is obtained. The internal nodes are created from the list of terminal and non-terminal symbols. This involves - Non-terminal for internal node = Non-terminal U terminal In this process, basically parser tries to identify RH, of production rule and teplace it by corresponding L.H.S. This activity is called reduction. Thus the crucial but prime task in bottom-up parsing is to find the productions that can be used for reduction. The bottom-up parse tree construction process indicates that the tracing of derivations are to be done in reverse order. For example Consider the grammar for declarative statement, SoTL; Tint | float L— Lid | id The input string is float id,id,id; Syntax Analysis Compiler Design 3-34 Parse Tree Step 1: We will start from leaf node Step 2: Step 3: Reducing float to T. T > float. Step 4: Read next string from input. © Step 5: Reducing id to L. L + id. i CH) © Step 6: Read next string from input. we Compiler Design 3-35 Syntax Analysis Step 7: Read next string from input. bs Com) db db Stop 8: id, id gets reduced. oe a ae 6 D © db Step 9: TL ; reduced to. (s) Card CO © Fig. 3.16 Bottom-up parsing Step 10: The sentential form produced while constructing this parse tree is float id,id; Step 11: Thus looking at sentential form we can say i y that the rightmost deri i reverse order is perfo ightmost derivation in Thus basic steps in bottom-up parsing are 1) Reduction of input string to start symbol. 2) The sentential forms that are produced in the reductis out right derivation in reverse. juction process should trace Syntax Analysts Compiler Design 3-36 ‘ find the substring that As said earlier, the crucial task in bottom-up parsing is 1° ig called handle. could be reduced by appropriate non-terminal. Such a substring S ae In other words handle is a string of substring that a lon i hen ae Production and we can reduce such string by @ nor-term St aon Production. Such reduction represents one step along the reverse derivation. Formally we can define handle as follow. “Handle of right sentential form y is a production A -»B and a Ecos dete where the string may be found and replaced by A to produce the previo sentential form in rightmost derivation of y". For example Consider the grammar E> EE E > id Now consider the string id + id + id and the rightmost derivation is E> E+E m EDE+E+E E=> E+ E+ id E=> E+ id+id E> id+id+id 7 The bold strings are called handles. Right sentential form id + id + ig Etid+id reduction. Compiler Design 3-37 Syntax Analysis 3.6.1 Shift Reduce Parser Shift reduce parser attempts to construct parse tree from leaves to root. Thus it works on the same principal of bottom up parser. A shift reduce parser requires following data structures. 1. The input buffer storing the input string. 2. A stack for storing and accessing the L.H.S. and R.HS. of rules. The initial configuration of Shift reduce parser is as shown below. Stack Input buffer The parser performs following is [a] basic operations. Fig. 3.17 Initial configuration 1. Shift : Moving of the symbols from input buffer onto the stack, this action is called shift. 2. Reduce : If the handle appears on the top of the stack then reduction of it by appropriate rule is done. That means RHS. of rule is popped of and LHS. is pushed in. This action is called Reduce action. 3. Accept: If the stack contains start symbol only and input buffer is empty at the same time then that action is called accept. When accept state is obtained in the process of parsing then it means a successful parsing is done. 4, Error : A situation in which parser cannot either shift or reduce the symbols, it cannot even perform the accept action is called as error. Let us take some examples to learn the shift-reduce parser. i> Example 3.10: Consider the grammar EsE-E EWECE Enid Perform Shift-Reduce parsing of the input string “id1-id2vid3”, Solution : Input buffer id 1-4d2+id3$ Parsing action Shift aid2-id3$ Reduce by E> id Reduce by E> id Reduce by E-» E+E Here we have followed two rules. 1. If the incoming operator has more priority than in ‘stack operator then perform shift. 2. If in stack operator has same or less priority than the priority of incoming operator then perform reduce. ‘mm Example 3.11 : Consider the following grammar S$ >TL T int | float 7 L Lid | id Parse the input string int id,id; using shift-reduce parser. Solution : Input buffer f Complier Design 3-39 Syntax Analysis wap Example 3.12: Consider the following grammar sala LoLs|s Parse the input string (a, (a, a)) using shift-reduce parser. 3.6.2 Operator Precedence Parser A grammar G is said to be operator precedence if it posses following properties - P B 1. No production on the right side is . 2. There should not be any production rule possessing two adjacent non-terminals at the right hand side. Consider the grammar for arithmetic expressions, E + EAE | (E) | -E | id Syntax Analysis Some Deng Repel lee Ie This grammar is not an operator precedi E + EAE . ; It contains two consecutive non-terminals. Hence first we will convert it into equivalent operator precedence grammar by removing A. A E+ E+E|E-E|E*E|/E/E|E*E jent grammar as in the production rule. E > (B)| -E| id ; In operator precedence parsing we will first define precedence relations <:=and-> between pair of terminals. The meaning of these relations is peg P gras mare precedence than:q: peq p has same precedence as q. pq p takes precedence over q. These meanings appear similar to the less than, equal to and greater than operators. Now by considering the precedence relation between the arithmetic operators we will construct the operator precedence table. The operators precedences we have considered are id,+*$. Fig. 3.18 Precedence relation table Now consider the string. id + id * id We will insert $ symbols at the start insert precedence operator by Teferring the S+<-id ->e is encountered. it) Scan backwards over = until <. is encountered , iii) The handle is a string between < and ->, Compiler Design 3-41 Syntax Analysis The parsing can be done as follows. $<-id->4+ cid ->ec-id o§ Handle id is obtained between <- . > Reduce this by E -» id Etcid oe 8 Handle id is obtained between <. . > Reduce this by E > id Et Eecid oS Handle id is obtained between <--> Reduce this by E > id E+EoE Remove all the non-terminals. Insert $ at the beginning at the end. Also insert the precedence operators. The * operator is surrounded by <- -> .This indicates that * becomes handle. That means we have to reduce EsE operation first Now + becomes handle. Hence we evaluate Ere Parsing is done. Advantage of Operator Precedence Parsing 1. This type of parsing is simple to implement. Disadvantages of Operator Precedence Parsing 1. The operator like minus has two different precedence (unary and binary). Hence it is hard to handle tokens like minus sign. 2. This kind of parsing is applicable to only small class of grammars. Application The operator precedence parsing is done in a language having operators. Hence in SNOBOL operator precedence parsing is done. 3.7 Introduction to LR Parsing This is the most efficient method of bottom-up parsing which can be used to parse the large class of context free grammars. This method is also called LR(k) parsing Here * L stands for left to right scanning. * R stands for rightmost derivation in reverse. * kis number of input symbols, When k is omitted k is _agsumed to be 1. a a Syntax Analysis Com 3-42 Design Properties of LR parser LR parser is widely used for following reasons. 1. LR parsers can be constructed to recognize | ext free grammar can most of the programming languages for which cont written. _ 2. The class of grammar that can be parsed by LR parser is of grammars that can be parsed using predictive parsers. ; ani 3. LR parser works using non backtracking shift reduce technique y efficient one. LR Parsers detect syntactical errors very efficiently. perset of class Structure of LR parsers The stru of LR parser is as given Jin following Fig. 402, It consists of input buffer for storing the input string, a stack for storing the grammar symbols, output and a parsing table comprised of two parts, namely action and goto. There is one parsing program which is actually a driving program and reads the input symbol one at a time from the input buffer The driving program Fig. 3.19 Stru works on following line. ‘9. cture of LR parser 1. Itt initializes the stack with start . ; to get next token, symbol and invokes scanner (1 recived 2. It determines input symbol. 3. It consults the parsin, four values. i) 5, means shift state i. ii) 1, means reduce by rule j % the state currently on the top of the stack and a, the current table for the action [s; a] which can have one of the Compiler Design 3-43 Syntax Analysis iii) Accept means successful parsing is done. iv) Error indicates syntactical error. Types of LR parser Following diagram represents the types of LR parser. LALR parser Canonical LR parser) Fig. 3.20 Techniques of LR parsers The SLR means simple LR parser, LALR means Lookahead LR parser and canonical LR or simply “LR” parser - these are the three members of LR family. The overall structure of all these LR parsers is same. All are table driven parsers. The relative powers of these parsers is SLR(1) < LALR(1) < LR(1). That means canonical LR parses larger class than LALR and LALR parses larger class than SLR parser. 3.7.1 Simple LR Parsing We will start by the simplest form of LR parsing called SLR parser. It is the weakest of the three methods but it is easiest to implement. The parsing can be done as follows. Context free grammar Construction of canonical set of items ‘Construction of SLR parsing table Parsing of input string Output Input string Fig. 3.21 Working of SLR (1) “ Syntax Analysis 3- Compiler Design Hed SLR grammar. A grammar for which SLR parser can be cone Definition of LR (0) tema and related terms = G is production rule r RES: of the rule. For example in which symbol ¢ is 1) The LR(0) item for gram inserted at some position in S—*ABC SA BC SABC SABC The production $—re generates only one item S~ *. 2) Augmented grammar ; If a grammar G is having start symbol S then is G’ in which S' is a new start symbol augmented grammar is a new grammar nS 1 such that S-»S The purpose of this grammar is to indicate the acceptance o input. That is when parser is about to reduce S— S it reaches to acceptance state. 3) Kernel items : It is collection of items SS and all the items whose dots are not at the leftmost end of RHS. of the rule. Non-Kernel items : The collection of all the items in which « are at the left end of RHS. of the rule. 4) Functions closure and goto : These are two important functions required to create collection of canonical set of items. 5) Viable Prefix : It is the set of prefixes in the right sentential form of production Aa. This set can appear on the stack during shift/reduce action. Closure operation - For a context free grammar G, if | is the set of items then the functi closure(! can be constructed using following rules. maon ” 1. Consider I is a set i i i ao, a of canonical items and initially every item I is added to 2. If rule A a «BB is a rule in clos as By then closure(I) : A >a. + BB Boey ture(I) and there is another rule for B such This rule has to be applied until no more new items can be added to closure(I). e(1). Compiler Design 3-45 Syntax Analysis The meaning of rule A >a Bf is that during derivation of the input string at some point we may require strings derivable from BB as input. A non-terminal immediately to the right of * indicates that it has to be expanded shortly. ~ Goto operation - The function goto can be defines as follows. If there is a production A +a BB then goto(A >a ¢BB,B) = A —>aBep. That means simply shifting of * one position ahead over the grammar symbol (may be terminal or non-terminal). The rule A —>a * BB is in I then the same goto function can be written as goto(1,B). 1m Example 3.13 : Consider the grammar X + Xbla Compute closure(I) and goto(!). Solution: Let I: X +eXb Closure(I) = X « Xb =X+ea The goto function can be computed as goto(LX) =X + Xeb X—> Xeb Similarly goto (1, a) gives X > ae um) Example 3.14: Consider the grammar S—+AS|b A>SAla Compute closure (1) and goto (1). Solution : We will first write the grammar using dot operator. S— AS Seb A7 SA Area } ctosure Let us call this as state as Ig. Now we apply goto on each symbol in I, ‘Syntax Anatysi, 3-46 a " Aea ty: goto (Ip. b) S— be ty goto (Ig. S) A+SeA AweSA Avtee S—+0AS Seb 1g: goto (Ip, a) Avae mm=> Example 3.15: Consider the following grammar : $+ Aa bAc| Be] bBa And Bod Compute closure (1) and goto (1). Solution : First we will write the grammar using dot operator. ly: S—+ eAa S— sbAc S > +Be S— «bBa Anvod Booed Now we will apply new subsequent states. I, = goto (Ip, A) SrAca I, : goto (Iy,b) S— beAc S— beBa Aved Bed goto on each symbol from State I. Each goto (1) will generate Compiler Design 3-47)\\ ‘Syntax Analysis: —.4 TET ee re Ty : goto (Io, B) S— Bec 1, : goto (Ip. d) Awd. Bods Construction of canonical collection of set of item - 1. For the grammar G initially add S'— Sin the set of item C. 2. For each set of items 1, in C and for each grammar symbol X (may be terminal or non-terminal) add closure (1,X). This process should be repeated by applying goto(l,X) for cach X in I, such that goto(I,X) is not empty and not in C. The set of items has to constructed until no more set of items can be added to C. Now we will consider one grammar and construct the set of items by applying closure and goto functions. Example : E+E+T E>T ToT+F TOF F>() F id In this grammar we will add the augmented grammar F-}«E in the I then we have to apply closure(). wh: E +0E~ E>eE+T- E>eT- ToeT+F_ Toer- Fe (E)— Freid. ! Syntax Anatyaig 15 © E. Now immedia ammar E’ The item Ip is constructed starting ke 7 Ga Wwe add E-production, right to ¢ is E. Hence we have applied clos have added EoeE+TandEsa ith © at the left end of the rule. That means we c in Ie Dut again as an see that the rule E->¢ T which we have added, contains non terminal T immediately right to « So we have to add T-productions in Ip, ToT * F and T° F. te " ively. But since we have alread In T-productions after * comes T and F respectively. | ay added T productions so we will not add those. But we will add all the F-productions id will be added. Now we can see that after dot having dots. The F-+« (E) and F* ° s and id are coming in these two productions. The ( and id are terminal symbols and are not deriving any rule, Hence our closure function terminates over here. Since there is no rule further we will stop creating Ip Now apply goto(l,E) Bs0E Shit'dotto E+E* E+eE+T right E+ Eo+T Thus I, becomes, goto, E) L:EEe E+EesT Since in I, there is no non-terminal after dot we cannot apply closure(I,). By applying goto on T of Ip, , Si ce in I, there is no non-terminal after dot we cannot ay By applying goto on F of |, pply closure(1,). Boto(ly F) h:ToFe Since after dot in I, there is nothing, hence sews cannot apply closure(| 3) Compiler Design 3-49 Syntax Analysis er By applying goto on ( of Ip, But after dot E comes hence we will apply closure on E, then on T, then on F. goto(ly () Ws For): E>eE+T EoeT ToeT+F Tose F>9@) F wid | By applying goto on id of Ip, Since in I, there is no non-terminal to the right of dot we cannot apply closure function here. Thus we have completed applying gotos on Iy. We will consider I; for applying goto. In 1, there are two productions E’~» Ee and E ~> Ee + T. There is no Point applying goto on E’-+ Ee, hence we will consider E’-> Ee +T for application of goto. There is no other rule in I; for applying goto. So we will move to Ij. In I there are two productions E> T« and T-+Te*F. We will apply goto on *. Syntax Analysis Complier Design | $88 The goto cannot be appl productions having E after di If we will apply goto on E in l, In 1, there are two Hence we will apply goto on ied on 1. Apply goto on lot TH» Eland Eve E+ T)- both of these productions. The Ig becomes, (ly T) but we get E>. Te and T-> Te*F which is 1, only. Hence we will not apply goto‘on T. Similarly we will not apply goto on F, ( and id as we get the states Iy, Ty Is again. Hence these gotos cannot be applied to avoid There is no point applying goto on 1; Hence now we will move ahead by applying goto on |, for T. be ——__ goto(l, T) lb: EDE+ Te. ToTesk goto(l,, F) Io: ToTeke Boto(l,, )) IyeFs Ee Applying goto on Is, To, added in the set of items. The collection o 1, is of 11 Sion of, Thus now there is no item that ean be “st of items is from Ip to fy). Compiler Design 3-54 Syntax Analysis Construction SLR parsing table - ‘As we have seen in the structure of SLR parser that are two parts of SLR parsing table and those are action and goto. By considering basic parsing actions such as shift, reduce, accept and error we will fill up the action table. The goto table can be filled up using goto function. Let us see the algorithm - Input: An Augmented grammar G’ Output : SLR parsing table. Algorithm : 1. Initially construct set of items C = (Ip, 1;, 1;.-.I,) where C is a collection of set of LR(0) items for the input grammar G’. 2. The parsing actions are based on each item [;, The actions are as given below. a. If Aa eaB is in I, and goto(I, a)= Ij then set action[i, a] as “shift j” Note that a must be a terminal symbol. b. If there is a rule A 4 ° is in I, then set actionli, a] to “reduce A 0.” for all symbols a, where a € FOLLOW(A). Note that A must not be an augmented grammar S’. ¢. If S’ + Sis in |, then the entry in the action table actionli, $] ="accept”. 3. The goto part of the SLR table can be filled as : The goto transitions for state {is considered for non-terminals only. If goto(l,, A) = |; then goto[l,, A] = j. 4. All the entries not defined by rule 2 and 3 are considered to be “error”. tm) Example 3.16: Construct the SLR(1) parsing table for (WE EsT QEST 3)TST+F QTE (5) F + (E) (6) F id Solution : We will first construct a collection of canonical set of items for the above —. The set of items generated by this method are also called SLR(O) items. As te is no lookahead symbol in this set of items, zero is put in the bracket. Syntax Analysis Compiler Design 3-52 ’ Esse | EseE+T Ese Teter Tor Fo e(E) goto (Hy, +) Ig EGE set ToeTF Tor F+0(E) Foreid goto (14. E) Ig: F(E*) E+E oT goto (Ig, )) Ty F9 (B) 900 (Ig, id) Fok Compiler Design 3-53 We can design a DFA for above set of items as follows. Fig. 3.22 DFA for set of items Syntax Analysis ‘Compiter Design 336 Note tha ; 2. t tate, The state Ip is initial sta = eae nal 8 In the given DFA every state i DFA recognizes viable prefixes. For example - For the item, 1, 0t0 (lp. E) ESEe Es Bert Ig goto (ly. +) = EsEseT Ig : goto (Ig, T) EsE+Te The viable prefixes E, E+ and E+T are recognized here continuing in this fashion The DFA can be constructed for sct of items. Thus DFA helps in recognizing vail viable prefixes that can appear on the top of the stack. Now we will aso prepare a FOLLOW set for all the non-terminals becauge we require FOLLOW set according to rule 2.b of parsing table algorithm. [Note for readers: Below is a manual method of obtaining FOLLOW. We have already discussed the method of obtaining FOLLOW by using rules. This manual method can be used for getting FOLLOW quickly] FOLLOWEE) = As E’ is a start symbol $ will be placed = {$) FOLLOW(E) : —— EE that means E’ = E = start symbol. :. we will add $. E-E+T the + is following E. >. we will adds. F+(@) the) is following E. -.we will add ), -. FOLLOWEE) = {+, ), $} FOLLOW(T) : As EE, ET the F = B= 7 ~ gta, = rt bol... s ESET symbol. -.we will add $. EoT+T VEST . The + is following T hence we will add 4 Syntax Analysis Compiler Design 3-55 ToT+F As + is following T we will add + F >) Fo VEOT As ) is following T we will add ) =. POLLOW(T) = (+.4)5) As E> E, E> T and T + F the E’ = E = T = F = start symbol. We will add $. FOLLOW(F) : E>EsT EoT+T sEOT EORT vTF The + is following F hence we will add +. ToT*F TOFsT TOF As * is following F we will add «. F>@® e F(T) E>T Fo (F) ToF As ) is following F we will add ). FOLLOW() = {+, *, ), $) Building of parsing table Now from canonical collection of set of items, consider Ip. Compiler Consider F>¢(E) Asaeap Askazea=(6=5) gototl,, () = by action{0, ( ] = shift 4 Similarly for Foeid Entry in the action table actior Other item in Ip does not give any action. Hence we w! n{0, id] = shift 5° goto(Iy, id) = Is ill find the actions fr om ty to ly a Action Goto w]+[: lcfo{sfetrfe oO 35 4 1 $6 | +} 2 a L 3 | t tJ | 4 ss a 5 1 t _| i — 5 | s4 | Tis 4 +—t ; = {fs 1 s7 10 Li] | | Thus SLR parsi 7 Parsing table i: with reduce and act on According to the 2 ; rule 2, E-Ee < from . > Eein I. Hence we will add the Parsing table algori there i ; mn “accept” in action et production ion, $]. filled up wit P with the shift actions. Now we will fill it up Now in state 1, E>Te Avae nile 28 A=E,a=T FOLLOWE) = (+, ), $} 3-87 Syntax Analysis Complier Design Hence add rule E -» T in the row of state 2 and in the column of +,) and $. In the given example of grammar E —» T is rule number 2. .. action{2,+] = 12, action[2,)}=12, action|2,$]=12. Similarly, Now in state I, ToFe Asvae nile2b AeT, a= F FOLLOW()={+4,),$} Hence add rule T-> F in the row of state 3 and in the column of +4,) and $. In the given example of grammar T —> F is rule number 4. Therefore action(3,+]=r4, action{3,)]=r4, action[3,$]=r4. Thus we can find the match for the rule Aa * from remaining states I, to I,, and fill up the action table by respective “reduce” rule. The table with all the action{ } entries will be Syntax Analysis 3-58 Compiler Design Se cn le for oto[0, T] = 2, the goto ta 1) = Ip: Hence g ety 2) Hee gal = Sa Bot GLK parsing table Contsuing in this fashion we can fill up the goto en! ill look as - Finally the SLR(1) parsing table will loo! Remaining blank entries in the table are considered as syntactical errors. Parsing the input using Parsing table - Now its the time to parse the actual string using above parsing table. Consider the parsing algorithm. If actionlS, al = shift j then neat pointer," PS & then push j onto the stack Advance the Compiler Design 3-59 ‘Syntax Ani b) If action[S, a] = reduce A-»f then pop 2+|}| symbols. If i is on the top of the stack then push A. then push goto[i, A] on the top of the stack ©) If action{S, a] = accept then halt the parsing process. It indicates the successful parsing, Let us take one valid string for the grammar. QE 9 EAT QE ST @TOT+F @T4F 6) F > (E) (@F 3 id Input string : id + id+id We will consider two data structures while taking the parsing actions and those are - Stack and input buffer. Action table | Goto table ing [oigs5 shit | sidtidS [5.6 [0,F]=3 Reduce by F -» id ridtid$ (3,Fr4 {0,T)}=2 Reduce by T > F sideid tare7 | Shift sor2-7 | ideias (7id}=s5, Shift SOT27id5 1 4d [5.416 [7Fle10__| Reduce by F > id $0 T27F10 [.10,+]=r3 [O.T}=2 Reduce by T > T+ F [24]}=r2 Reduce by E + T [sn Reduce by F - id Reduce by T-> F In the above table at first row we input buffer onto the Bet action(5, Teplaced by Set action(0, id] = s5, that means shift id from stack and then push the state Number 5. On the second row we “] as r6 that means reduce by rule 6, P+ id hence in the stack id is F. By referring goto[0, F] we get state number 3 hence 3 is pushed onto ‘Syntax se YNtAX Anatya is performed. This process of parsin oto is Laie $] = Accept we halt be Compiler Design ction 8 the stack. Note that for every reduce action 89 7 is continued in this fashion and finally whe successfully parsing the input string. im Example 3.17: Consider the following grammar EsE+T|T TTF |F caerae ae sum parsing table for this grammar. Also parse the input a+ B+ a Solution : Let us number the production rules in the grammar. Q)E>E+T QE9T Q)T > TF @TOF ©) FoF* @Foa Fab Now we will build the canonical set of SLR (0) items. We will first introduce an augmented grammar. We will first introduce an augmented grammar E' > «E and then the initial set of items I, will be generated as follows. As after * the symbol E appears we will add rules of E. After ¢ T appears, so-add rules of T | as after «F appeats, so add rules of F Now we will use goto function. From state ] oti i applied step by step. Each goto transition will ent eae ‘ate new state I, . 1, : goto (Ip, E) , a E> Ee+T

You might also like