You are on page 1of 34
Chapter—I i et INTRODUCTION TO COMPILING 1.1 WHAT ARE COMPILERS? Compiler is'a program which translates a program written in one language (the source lan- guage) to an equivalent program in other language (the target language). cst Compiles =§ /_-——> Aaa Error messages Fig. 1.1 Compilers x Translates from one representation of the program to another. % Typically from high level source code to low level machine code or object code. x Source code is normally optimized for human readability — Expressive: matches our notion of languages — Redundant to help avoid programming errors. % Machine code is optimized for hardwarc. — Redundancy is reduced — Information about the intent is lost. 1.2 PARTS OF COMPILATION There are two parts to compilation: (analysis (ii) synthesis (i) The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program. (i) The synthesis part constructs the desired target program from the intermediate representa- tion. 14 principles of Compiler Design form some kind of analysis some “= Software tools that manipulate source programs first per examples of such tools are: (ii) Pretty Printers (iv) Interpreters (i) Structure editor Static Checkers (i) Structure Editors YA structure editor takes a sequence of commands as in e text creation program text, putting an apI put to build a source program: and modification functions of an } The structure editior not only performs th " a ordinary text editor, but it also analyzes the propriate hierarchical structure of the source program. input is correctly formed, can supply keywords automatically and } Foreg,, itcan check that the “ s matching end or right parenthesis. can jump from a begin or left parenthesis t0 i (ii) Pretty Printers % It analyzes a program and pl clearly visible. }€ For eg., Comments may appear in @ special font, of identation proportional to the depth of their nesting in the hierarchical organizatic ints it in such a way that the structure of the program becomes and statements may appear with an amount ion of the statements. (iii) Static Checker It reads a program, analyzes it, and attempts to discover potential bugs without running the program. % For eg., It may detect the parts of the source program that can never be executed. } Itcan catch logical errors, such as trying to use a real variable as a pointer. (iv) Interpreters } Interpreter performs the operations implied by the source program. }€ For an assignment statement, an ii i i , an interpreter might build a tree ar . tions at the nodes as it “walks” the tree. : cial For example, see next page. x pine are frequently used to execute command languages, since each operator €X- ecuted in a command language is usually an invocation of a complex routine such as editor oF compiler, Introduction to. Compiling 13 ie AN. b oe c 10 Fig. 1.2 : Syntax two fora=b+¢* 10 a The techniques used in compiler design can be applicable to many places, The analysis por- tion of the following examples are similar to that of a conventional compiler. (1) Text formatters: A text formatter takes input that is stream of characters, most of which is text to be typeset, but some of which includes commands to indicate paragraphs, figures or subscripts and superscripts. (2) Silicon compilers: A silicon compiler has a source language that is similar to a conventional Programming language. The variables of the source language represents logical signals (1 or 0) or group of signals in a switching circuit. The output is a circuit design in an appropriate language. (3) Query Interpreters: It translates a Predicate containing relational and boolean operators into commands to search a database for records satisfying that predicate, 1.3 ANALYSIS OF THE SOURCE PROGRAM Analysis consists of three phases. ( Linear analysis: Linear analysis is called lexical analysis or scanning. It is. the process of reading the characters from left to right and grouped into tokens having a collective meaning. (i) Hierarchical analysis: Hierarchical analysis is called Parsing or syntax analysis. It involves grouping the tokens of the source Program into grammatical phases that are used by the Compiler to synthesize output, (ii) Semantic analysis: Semantic analysis checks the source programe for semantic errors and gathers type information for the subsequent code generation phase. 1.4 PHASES OF A COMPILER compiler operates in phases, each of which transforms the source program from one “ ion into another representation. They communicate with error handlers. They commu- "ate With the symbol table. ~_— Error handler Symbol table Intermediate code manager ‘generator Code optimizer Code generator v Target progam Fig. 1.3 1.4.1 Symbol Table Management % Anessential function of a compiler is to record the identifiers used in the source program an collect information about various attributes of each identifier. % These attributes may provide information about the storage allocated for an identifier, its typ’ its scope and, in the case of procedure names, such things as the number and types of i! arguments the method of passing each argument and the type returned. A symbol table is a data structure containing a record for each identifier, with fields for th attributes of the identifier, The data structure allows us to find the record for each identifier quickly and to store § retrieve data from that record quickly. When an identifiers in the source program is detected by the lexical analyzer, the identifier entered into the symbol table. Introduction to Compiling 1 }€ However, the attributes of an identifier cannot normally be determined during lexical analysis For eg. in a Pascal declaration like. Var a, b,c: real; The type real is not known when a, b and c are seen by the lexical analyzer. % The remaining phases enter information about identifiers into the symbol table and use this information in various ways. 1.4.2 Error Detection and Reporting % Each phase can encounter errors. However, after detecting an error, a phase must deal with that error, so that compilation can proceed, allowing further errors in the source program to be detected. } The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. Error where the token stream violates the structure rules of the language are determined by the syntax analysis phase. During semantic analysis the compiler tries to detect constructs that have the right syntatic structure but no meaning to the operation involved. 1.4.3 Lexical Analysis 3% The lerical analysis phase reads the characters in the source program and groups them into stream of tokens. Each token represents a logically cohesive sequence of characters, such as an identifier, a keyword, a punctuation character or a operator, etc. % The character sequence forming a token is called the lexeme for the token. € Certain tokens will be augmented by a “lexical value”. The lexical analyzer not only generates a.oken, but also it enters the lexeme into the symbol table. (eg.,) Consider the expression b+c*20 The representation of the above expression after lexical analysis is, id, + id, * 20. 1.44 Syntax Analysis X€ It groups tokens together into syntatic structures, For (¢g,) the three tokens representing 4* B might be grouped into a syntatic structure called an Expression. Expressions might further be Introduction to Compiling 17 Combined to f form statement Ss seen often the syntatic structure can be regarded as a tree where together. ‘Or nodes of the tree represents strings of tokens that logically belong 1.4.5 Semantic Analysis A 5 i . % An important component of semantic analysis is type checking. Here the computer checks that each operator has operands that are permitted by the source language specification X For eg. many programming language definitions require a compiler to report an error every time a real number is used to index an array. * The language specification may permit some operand coercions. For eg. when a binary arith- metic operator is applied to an integer and real. In this case, the compiler may need to convert the integer to a real, 1.4.6 Intermediate Code Generation 2 It was the structure produced by the syntax analyzer to create a stream of simple instructions Instructions with one operator and a small number of operands. Intermediate representation should have two important properties, it should be easy to pro- duce, and easy to translate into the target program. % The intermediate representation can have a variety of forms one of the form is called “three- address code” which is like the assembly language for a machine in which every memory location can act like a register. Three address code consists of a sequence of instructions, each of which has as most three operands. 1.4.7 Code Optimization H _ Itis designed to improve the intermediate code so that the ultimate object program runs faster and or takes less space. Optimization may involve: ~ detection and removal of dead (unreachable) code. — calculation of constant expressions and terms. ~ collapsing of repeated expressions into temporary storage. = loop unrolling, ~ moving code outside of loops ~ removal of unnecessary temporary variables. OI os ae Chapter-II LEXICAL ANALYSIS 2.1 ROLE OF THE LEXICAL ANALYZER X€ Lexical analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer dosn't return a list of tokens at one shot, it retums a token when the parser asks a token from it. Token Source Lexical Parser program’ >] analyzer Get next token Symbol table Fig. 2.1; Interaction of lexical analyzer with Parser The lexical analyzer also might do some house keeping such as eliminating white spaces and comments and correlating error messages from the compiler with the source program. % After the lexical analysis, individual characters are no longer examined by the compiler, in- stead tokens are used. 2.1.1 Lexical Analysis Versus Parsing Why separate lexical analysis from parsing? The reasons are basically software engineering concerns. 1. Simplicity of design: When one detects a well defined subtask (produce the next token), it is often good to separate out the task ( modularity) for (eg.), a parser embodying the conventions for comments and white space is significantly more complex than one that assume comments and white space have already been removed by a lexical analyzer. 2. Efficiency: With the task separated it is easier to apply specialized techniques. For eg. spe- cialized buffering techniques for reading input characters and processing tokens can signifi- cantly speed up the performance of a compiler, 24 gq 1 Principles of Compiler De, 3. Portability: Input alphabet pecularities and other device specific anomalies can be rest. to the lexical analyzer. The representation of special or non-standard symbols such as + pascal can be isolated in the lexical analysis. 2.1.2 Tokens, Patterns, Lexemes Token: A lexical token is a sequence of characters that can be treated as a unit in the gramme, the programming languages. Example of Tokens: * Type token (id, num, real, ...) * Punctuation tokens (If, Void, Return, * Alphabetic tokens (keywords) Example of Non-Tokens: * Comments, preprocessor directive, macros, blanks, tables, newline. Patterns: There is a set of strings in the input for which the same token is produced as output. | set of strings is described by a rule called a pattern associated with the token. Regular expressions are an important notation for specifying patterns. For eg. the pattern for the identifier is token id, is id —» letter (letter/digit)” Lexeme A lexeme is a sequence of characters in the source program that is matched by the pal for a token. For eg. the pattem for the Relop (Relational Operator) token contains six lexemes (=,< >> =,>=,>) se the lexical analyzer should return any one of the six. a RELOP token to passes whenever it 2.1.3 Attributes for Tokens % Since a token can represent more than one that specific lexeme this additional inform: % Forsimplicity, token, lexeme, additional information should be he! ation is called as the attribute of the token. token may have a single attribute which holds the required information fo % Foridentifiers, this attribute a pointer to the s ‘ymbol table, and the symbol table holds the® attributes for that token. Sree Zz Lexical Analysis # -4 Lexical Errors Some attributes: ~ where attr is pointer to the symbol table. ~ no attribute is needed (if there is only one assignment operator) where val is the actual value of the number. Token type and the attribute uniquely identifies a lexeme eer zer has a ve Few errors are discernible at the lexical level alone, because lexical analyzer 2a localized view of a source program. If the string ‘fis encountered in aC program for the first time in the content fi @== fa) = is misspelling of the keyword for an undeclared the lexical analyzer must return the token for a lexical analyzer cannot tell whether ‘f7” function identifier. Since ‘f” is a valid identifier, an identifier and let some other phase of the compiler handle this error. Suppose a situation does arise in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix of the remaining input. “Panic mode” recovery will be taken that is delete successive characters from the remaining input until the lexical anlyzer can find a well-formed token This recovery technique may occasionally confuse the parser, but in an interactive computing environment it may be quite adequate. Other possible error-recovery actions are 1. deleting an extraneous character 2. inserting a missing character 3. replacing a incorrect character by a correct character 4. transposing two adjacent characters. 2.2 INPUT BUFFERING * x This section covers some efficiency issues concerned with the buffering of input. Determining the next lexeme often requires reading the input beyond the end of that lexeme. For eg. to determine the end of an identifier normally requires reading the first white space character after it. Also just reading > does not determine the lexeme as it could be > =, when you determine the current lexeme the characters you read beyond it may need to be read again to determine the next lexeme. 28 Principles of Compiler De, i] * —L; =all strings with length three (using a, b, c,d) le all strings using letters a, b, c, d and emptry string. * Lf =all tring doesn’t include the empty string. 2.4 REGULAR EXPRESSIONS }€ We use regular expressions to describe tokens of a preogramming language. }€ A regular expression is built up out of simple regular expressions using a set of defining mje Each regular expression denotes a language. + A language denoted by a regular expression is called as a regular set. Rules: The regular expressions over alphabet specifies a language according to the followiy rules. 1. cis regular expression that denotes {¢}, that is, the set containing the empty string. 2. Ifa isa symbol in alphabet, then a is a regular expression that denotes {a}, that is, the « containing the string a. 3. Suppose r and s are regular expressions denoting the language L(r) and L(s) then, (a) (1)(s) is a regular expression denoting L(r) U/L(s). (b) (0) is a regular expression denoting L(r) L(s). (©) (0) is a regular expression denoting (L(r))" (d) (r) is a regular expression denoting L(r) The reason we don’t include the positive clousre is that or any Regular Expression r’ =! % Unnecessary parameters can be avoided in regular expressions using the following conve tions. (1) The unary operator * (Kleene closure) has the highest precedence and is left associat! (2) Concatenation has a second highest precedence and is left associative. (3) Union has lowest precedence and is left associative. % (a(b)*)(c) is equivalent to ab*/e, (eg.) Let © ={ 0,1} ¢ 0/1 {0,1} © (0/1) (0/1) = { 00, 01, 10, 11} Lexical Analy si Ki The combined NFA recogni 5 I © combined NEA, there jg nS the longest prefix of the input that is matched by pattern. In We construct the sequen 2" accepting state for each pattern P,, When we semulate the NFA, input character a ‘uence of sets of states that the combined NFA can be in after seeing each match we mar en if we find a set of states that contains an accepting state, to find the longest Somer Continue to simulate the NFA until it reaches termination. That is, a set of states re are no transistions on the current input symbol. To find the correct match, we make two modifications to simulating the NFA algorithm. i . Whenever we add an accepting state to the current set of states, we record the current input Position and the pattern P, corresponding to this accepting state. 2. We continue making transistions until we reach termination upon termination, we retract the forward pointer to the position at which the last match occured. DFA for Lexical Analyzers Another approach to the construction of a lexical analyzer from a lex specification is to use a DFA to perform the pattern matching. When we convert an NFA to a DFA using the subset construction algoritim, there may be several accepting states in a given subset of non-deterministic states. In such a situation, the accepting state corresponding to the pattern listed first in the lex Specification has priority. As in the NFA simulation, the only other modification we need to perform isto continue making state transistions until we reach a state with no next state for the current input symbol. To find the lexeme matched, we return to the last input position at which the DFA entered an accepting state. 2.8 FINITE AUTOMATA A recognizer for a language is a program that takes as input a string x and answers “ye x is a sentence of the language and “no” otherwise, ‘The generalized transistion diagram for regular expression is called finite automaton. It is a labeled directed graph. Here the nodes are the states and the labeled edges are transistions. The automata can be grouped into two as given below: 1. Non-deterministic Finite Automata (NFA) 2. Deterministric Finite Automata (DFA) X€ “Non-deterministic” means that more than one transistion out of a state may be possible on the same input symbol. % Both finite automata are capable of recognizing precisely the regular sets. Non-Deterministic Finite Automata Anon-deterministic finite automaton (NFA) is a mathematical model that consists of Lexical Analysis 2 Ont geen mabe atinms Fig. 2.11 Deterministic Finite Automata A deterministic finite automation (DFA) is a speical case of a NFA in which: 1. ho state has an e-transistion i.e., a transistion on input &, 2. for each state S and input symbol a, there is at most one edge labeled a leaving S. ADFA has at most one transistion from each state on any input. If we are ee ae table to represent the transistion function of a DFA, then each entry in the transistion table 18 @ single state. DFA accepting the same language (a/b)* abb. Fig. 2.12: DFA accepting (a/b) ab For implementing the finite automata the following things must be done: 1. Conversion from RE to NFA 2. Conversion of NFA to DFA, 3. Minimize the DFA 2.8.1 Construction of an NFA from a Regular Expression Algorithm: Thompson’s construction. An NFA from a regular expression. Input: A regular expression r over an alphabet e. Output: An NFA N accepting L(r). Here, the following notations are used. i initial state J > final state r — regular expression N - NFA. Lexical Analysis Transition Diagram for DFA @++©) C rocessing the input string ababbab @ The sequence of moves made by each in p 40 @>050" @ 2.8.3 From a Regular Expression to a DFA augmented regular expression (r) begin by constructing To construct a DFA directly from an a syntax tree T for (r)# and then compute four functions. Nullable, frstPos, lastPos and followPos by making traverse over T. Finally we construct the DFA from followPos. The functions nullable, firstPos and lastPos are defined on the nodes of the syntax free and are used to compute followPos, which is defined on the set of positions. ‘At each node n of the syntax tree of a regular expression. Nutlable (): nullable (7) to be true if'a node n is nullable, otherwise false. FirstPos (): firstPos (n) gives the set of positions that can match the first symbol of a string generated by the sub expression rooted at n. LastPos (1): lastPos (71) gives the set of positions that can match the last symbol in such a string. FallowPos (1): FollowPos (1) gives the set of positions that can match the first or last symbol of a string generated by a given sub-expression of a regular expression. Chapter—III ey as Spal re ST GeT WH il RCUMPRONWOT SNOT 10 Soneunee SYNTAX ANALYSIS 3.1 INTRODUCTION Every programming language has rules that prescribe the syntactic structure of well-formed programs. For example, a °C” orPascal” program is made out of blocks, a block out of statements, statement out of expressions, an expression out of tokens, and so on. The syntax of all these high evel constructs such as expressions, statements, blocks can be described by context-free grammars or BNF (Backus-Naur Form) notation. The low level construct called tokens can be described using regular expression as described in the previous chapters. 3.1.1 Advantages of Grammars for Syntactic Specifications 1. A grammar given a precise, understanding syntactic specification for a programming lan- guage. 2. An efficient passes can be constructed automatically from a property designed grammar. 3._ A grammar imparts a structure to a program that is useful for its translation into object code ” and for the detection of errors. 4. A language evolve over a period of time, acquiring new constructs and performing additional tasks. These new constructs can be added to a language easily, if there is an existing imple- ‘mentation based on a grammatical description of the language. 3.1.2 The Role of the Parser The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that the string can be generated by the grammar for the source language. The parser also Teports any syntax errors in the program. It should also recover from commonly occuring errors so that it can continue processing the remainder of its input. Toker Source [Lexical Paw = Intermediate Program [@alyzer Gr Tree | front end Representation loken 34 Principles of Com ile 32 Input to the Parser Sequence of tokens from scanner. Output Parse tree, Functions silted es the nrosnar genta by ECE nr 2. It constructs parse tree. 3. It reports error. 4. It performs error recovery: Issues Parser cannot detect errors such as 1. Variable re-declaration (a variable des ble initialization before use. for an operation .clared already oF not). 2. Varial 3. Data type mismatch ese issues are handled by the semantic analysis phase. All th 3.1.3. TYPES OF PARSERS ‘There are two types of parsers: 1, Top-down 2. Bottom-up. Top-down parser build start from the leaves at scanned from left to right, one symbol at a time. ‘The most efficient top-down and bottom-up of grammars. Such as parse trees from the top (root) tothe bottom (leaves), while nd work up to the root. In both the cases, the input to the parsing methods work only based on 1. LL grammar. 2. LR grammar. Parsers implemented by hand often work with LL grammar. Parsers for the large LR grammers are usually constructed by automated tools. 3.1.4 Syatax Error Handling A program can certain errors at many different levels. Errors can be: 1. Lexical, such as misspelling an identifier, keyword or operator. mpi, lle, Gye ean generate aPPFOPriate error di, 34 nized in the input. “sy rammar so that 2 n rcog! constructed with this augmented hat has bee" indicate the erroneous construct t (iv) Global Correction coheed minimal sequence of changes to obtain a Roba ithms for ch Igorith There are algorith put string X and grammar G, these algorithms wi, fing, ions. Given an incorrect i amma ae corrections. Giv he number of insertions, deletions and changes “a sas small as possible. free for a related string Y> such that # required to transform X into yt 3 3.2 CONTEXT-FREE GRAMMARS mmar is & quadruple that consists of terminals, non-terminals , 1 Context free gran symbol and productions. se are the basic symbols from which strings are formed. The Word Se +» as far as the grammars for the programming languages», are keywords (if, then, else, ete.), operators (+5 =, ee) 1. Terminals: The! a synonym for “terminals’ cemed. (eg.) for terminals symbols (;,:,ete.), special symbols (;, :, ete.). 2. Non-terminals: These are the syntactic variables that denote set of strings. (eg) sm, expr are non-terminals that are made of set of terminals. ] 3. A Start symbol: One non-terminal in the grammar is selected as a “Start symbol” ori guished symbol”, and the set of strings it denotes is the language defined by the gram 4. Productions: These are otherwise called as “rewriting rules” that specifies the mat which the terminals and non-terminals can be combined to form strings. A grammar of set of productions and each production consists of a non-terminal, followed by ana"! symbol ::=(-> or : : =) followed by a string of non-terminals and terminals. 3.2.1 Example for Context Free Grammar Exampl ple for context free grammar that defines simple arithmetic expressions. expr —> expr op expr expr —> (expr) expr — -expr expr — id op > + ; DD hn op > * opm OPairy ty \ Chapter—IV rr tet ON ay ee ee, | PARSING i Parsing is the process of analyzing a continuous stream of input is order to determine its © grammatical structure with respect to a given formal grammar. The task of the parser is essentially to determine if and how the input can be derived from the start symbol within the rules of the formal grammar. This can be done in two ways. 1. Top-down Parsing: A parser can start with the start symbol and try to transform if to the input. (ie) the parser starts from the largest elements and breaks them down into incrementally smaller parts. (eg.) LL Parsers. 2. Bottom-up Parsing: A parser can start with the input and attempt to rewrite it to the start symbol. (i.e) the parser attempts to locate the most basic elements, then the elements containing these, and so on. (eg.) LR Parsers. Parsers A Parser for grammar G is a program that takes a string @ as input and produces as output either a parse tree for ©, if @ is a sentence of G, or an error message indicating that @ is not a sentence of G. (Note that parsing is a process and parser is a tool that does parsing). 4.1 TOP-DOWN PARSING Top-down parsing can be viewed as an attempt to find a leftmost derivation for an input string. It can also be viewed as an attempt to construct a parse free for the input starting from the root and creating the nodes of the parse tree in pre-order. For example, consider the grammar S + cAd A — abla and the input string @ = cad. The parse tree for this sentence is constructed in top-down approach as follows. Step 1: Initially create a tree with single node labeled S. An input pointer points to c (the first symbol of «). Expand the tree with the production of S. AN 44 parsing. 433 7.3.2 LR Grammars Agrammar for which we can construct a LR parsing table (set of states with action function forall the terminals and goto function for all the non-terminals) is said to be an LR grammar. + AnLR parses does not have to scan the entire stack when the handle appears on top. + Thestate symbol on TOS contains all the information it needs. + Byonly knowing the grammar symbol, the finite automaton recognizes the handle on TOS. The goto function is the finite automaton. LR parsers can be used to make shift-reduce decisions for next k input symbols where k = |, ie, k=O0ork=1. + A grammar that can be parsed by an LR parser examining upto k input symbols on each move js called can LR(K) grammar, 4.3.2, Construction of SLR.Parsing Tables The “Simple LR” or “SLR” parsing table is a least powerful but easy to implement parsing tables. A grammar for which an SLR parser can be constructed is said to be an SLR grammar. Before getting into the actual construction we need to known certain terms as follows. 4.3.3.1 LR(0) item or item An item ofa grammar a is a production of G with a dot at some position of the right side. Thus production A + XYZ yields the four items A > XYZ A> XYZ A > X¥Z A -> XYZ. The production A —> ¢ generates the item A >. . 4.3.3.2 Set of items We group the items together into sets which give rise to the states of the SLR parser. These States can be viewed as the states of an finite automation. a A collection of set of LR(0) items which we call the canonical LR(0) collection provides the asis for constructing SLR parsers. To construct the canonical LR(0) collection for a grammar, we define (1) An augmented grammar. (2) Two functions closure and goto. Chapter_y TYPE CHECKING E CHECKING piler must check the source program after syntactic convention is called static checking” 5,1 TY? Acom| Examples of static checks are Type check: A compiler generate error if an operand is applied to an incompatible oper- and. «) — Flow-of-control checks: Statements that cause flow of control to leave a construct must a have some place to which to transfer the flow of control. e.g break statement (i) Uniqueness check: An object must be defined exactly only once e.g. labels in case statement (iv) Name-related checks: Some times, the same name must appear two or more times. 5.2 Type systems In both pascal and C types are either basic or derived. Basic types are atomic types with no initial structure as for as the programmer is concerned. (e.g) Boolean character, integer and real. Derived types — arrays, records and sets. Type Expressi : A type expression is either a basic type or is formed by applying an operator called a type constructer to other type expressions. 1 i i Abasic type is a type expression 2 ‘ Atype name is a type expression. : . ‘type construct or applied to type expressions is a type expression, constructors include: ©) Arrays: IFT isa type expression, then array (I, 7) is a type expression denoting the type of Sn array with elements of type T and index set | var A: array [1....10] of integer (b) « ) Products: IfT, and T, are type expressions, then their cartesian product T, x T, isa type °Xpression, ar: 68 Principles of Compiler p, 5 ‘ i i times, due t 32 bits), because it was too inefficient to store without padding. Soe ma “ Space straints, padding may not be possible, so that the data has to be packe* : Bs : No gape Since the machine generally expects aligned data, special Saale may be requireq ay ed. time to position packed data so that it can be ‘operated on as If al ign 6.3 STORAGE ALLOCATION STRATEGIES oe A different storage-allocation strategy is used in run-time memory organization. They (i) Static allocation: lays out storage at compile time for all data objects. Gi) Stack allocation: manages the run time storage. (ii) Heap allocation: allocates and decallocation storage as needed at run time from heap, These allocation strategies are applied to allocate memory for activation records. Diffey, languages use difference strategies for this purpose. For example: FORTRAN used static. allocation Algol use stack allocation LISP use heap allocation. 6.3.1 Static Allocation The fundamental characteristics of static allocation are as follows: (Name binding occurs during compilation, there is no need for a run-time support package (ii) Bindings do not change at run time. (iii) On every invocation of procedure, its names are bound to the same storage locations.1 property allows the values of local names to be retained across activations of a prove That is, when control returns to a procedure, the values of the locals are the same as! were when control left the last time. For example: Suppose we had the following code, written in a language using static alloc! Function F() { inta; print (a); a=10; } After calling F ei 4 « seroptybe lH ete ee (once, ifit was called a second time, the value of ta” would inital yament pit B of a name determines its storage requirement. fo) ‘The tyPe this storage is an offs . ‘The address for ig set from the procedure’s activation record, and the compiler must eventually decide where the activation records go, relative to the tareet cod: . arget code and to one another. tion has been decided, the addresses of the activation records, and hence of and hence 0 After this posit n 4 the storage for each name in the records are fixed. x Thus, at compile time, the addresses at which the target code can find the data it Operates upon ean be filled in. The addresses at w hich information is to be saved when a procedure call takes place are also known at compile time. Static allocation does have some limitations, they a (Size of data objects, as well as any constraints on their positions in memory, must be avail- able at compile time. (i) No recursion, because all activations of a given procedure use the same bindings for local names. (iii) No dynamic data structures, since no mechanism is provided for run time storage allocation. 6.3.2 Stack Allocation ased on the idea of a control stack; storage is organized as stack, and } Stack allocation is bi d popped as activations begin and end respectively. activation records are pushed an } Storage for the locals in each call of a procedure is contained in the activation record for that call, Thus locals are bound to fresh storage in each activation, because a new activation record is pushed onto the stack w hen a call is made. ., the values are lost because the The values of locals are deleted when the activation ends; i storage for locals disappears when the activation record is popped. The following Figure shows the activation records that are pushed onto and popped for the run time stack as the control flows through the given activation tree X First the procedure is activated. x oa Procedure read array’s activation is pushed onto the stack, when the control reaches the first line in the procedure sort. XA fle the control returns from the activation of the record array, its activation is popped. X Inthe activati he activation of sort, the control then reaches a call of qsort with actuals 1 and 9 and an wean tivation of qsort is pushed onto the top of stack. Inthe : ; the lien stage the activations for partition (1,3) and qsort (1, 0) have begun and ended during lea a of qsort (1, 3) so their activation records have come and gone from the stack, SMe activation record for qsort (1, 3) on top. x 6.12 pDanglins 3 Dangli es Referenc 9 reference is nothing but referring to locations which have been de, ing refere! c Alloa, ~allorateg St Orage logical error to use dangling references, since the value of de. logic 4 x fi ve according to the semantics of most languages. ined a y }€ Since that storage may later be allocated to another datum, mysterious b i MS can programs with dangling references. ey Example: main () { int * P; P = dangle ( ); /* dangle reference */ } int * dangle () { inti=23; return i; } The pointer is created by the operator and applied to i. Wher dangle, the storage for locals is freed and can be used for other pul this storage, the use of P is a dangling reference. N control returns toni poses. Since P in maine 6.3.3 Heap Allocation Limitations of stack allocation is mentioned in the prev ious section, in those cases tion of activation record cannot occur in last in first out fashi ion. Heap allocation gives out Pieces of contiguous storage for activaton records. Pieces may be de-allocated in any . . Fi e order over time the heap will consist of altenat™ are free and in use, Heap manager is Supposed to make use of the free Space. a - jal oo For efficiency reasons it may be helpful to handle small activations as a speci! For each size of interest keep a linked list of tree blocks of that size. Filla request of size S with block of size S' where S' is the smallest size gre" toS. eke hee oo yer X__For large blocks of storage use heap manager. Chapter-VII INTERMEDIATE CODE GENERATION ate codes are machine independent codes, but they are close to machine instruc- x Intermed tions. ‘The given program ina source language is converted to an equivalent program in an interme- x diate language by the intermediate code generator. ‘Advangates of using a machine-independent intermediate form are as follows: 1, Retargeting is facilitated, a computer for a different machine can be created by attaching a "back end for the new machine to an existing front end. Amachine-independent code optimizer can be applied to the intermediate representation. For intermediate code generation, there is a notational frame work that is an extension of content-free grammars. This frame work is called syntax-directed translation scheme. It allows subroutine or semantic actions to be attached to the production generates intermediate code when called at appropriate times by a parser for that’ grammer. . There are two notations for associating semantic rules with productions, they are (i) Syntax-directed definition. (ii) Translation schemes. Syntax directed definitions are high-level specifications for translations. They hide many imple- mentation details and the user need not specify the order in which translation takes place. ‘Translation schemes indicate the order in which semantic rules are to be evaluated, so they allow some implementation details to be shown. 7.1 SYNTAX DIRECTED TRANSLATION X The definiti finition of a syntax directed translation specifies the translation of a construct in terms of attribute: i poate attributes associated with its syntatic components. The definit; i" nition uses a context free grammar to specify the syntatic structure of the input. Vith each : . &rammar symbol, it associates a set of attributes and with each production, a set of semantic , . that ag wes for computing values of the attributes associated with the symbols appearing in Production, TA 74 7.2 SYNTAX TREES An (abstract) syntax tree is a condensed form of parse tree usefull for representing language constructs, Example: syntax tree for b- 5 +C The construction of a syntax tree for an expression is similar to the translation of the expression into postfix form. ON Principles of Comp ler /\ /\ Each node in a syntax tree can be implemented as a record with several fields, The following functions are used to create the nodes of syntax trees for ex, binary operators. Pression, 1, mknode (OP, left, right - creates an operator node with label OP and to fldscons pointers to left and right. 2. mkleaf (id, entry) - creates an identifier ncde with label id and a field Containing ey pointer to the symbol-table entry for the identifier. 3. mkleaf (num, val) - creates a number node with label, num and a field containing, value of the number. Example: The syntax tree for the expression b - 5 + C. 1. P, = mkleaf (id, entry b); 2. P, 3. P, 4. P, 5. P, mknum (num, 5); mknode (‘~’, P,, P,) mkleaf (id, entry C) = mknode (‘+’, P,, P,): In this sequence P, , P,, .... P, are pointers to nodes and entry b and entry ¢ are pointes symbol table entries for identifiers b and ¢ respectively. eT Py Pe -J]e)e id] entye id | entyb rum | ‘5 Syntax tee for b-5 +6 ¢ Code Generation fiat the above procedures, TX, a and = fo ater pointing, to the symbol table Of quick Sort is al Created in the symbol table of ot Ae are created in the symbol 5 ti able of quick son, there atee: Similarly entries for kv a . a S0rt. The headers of tition have pointers pointing | lers of the symbol tabl # sotand pat Pointing to sort ang Quick sort respectively, ce TM sort are les of - sort nil header a ee x +——| fead array > toready aay exchange p> toerchange ‘uicksort 4 quicksort fread arre exchange header] t header 7 header pariton | partion header i i Fig. 7.1 : Symbol tables for nested procedures Creating Symbol Table i erations: The semantic rules are defined in terms of the following op he inter to the new table. TI ) mktable (previous): Create a new symbol table and return a pointe le (previous): m nh argument previous points to the enclosing proce: inted in the symbol table point (9 eater (table, name, type, offset): Creates a new entry for name in toby table, in its header. ive wi ies of a table in its “adaviaeh (able Width): Records cumulative width of entries in the symbol edure name in tl "eserpre tb ame, newtable): Creates an 2 cae aa petueatty i ol tal a "Se pointed toby table new table is a pointer to symb emit(E.place *:=" E,,place ‘real + us E.type := real end else E.type := type_error, Semantic action forE > E,+E, Example: real x, ): inti js xieyti*y generates the code as follows: iint*j := int to real, 1, :=y real 1, =, 3 7.6 BOOLEAN EXPRESSIONS Boolean expressions are composed of the boolean operators ap} boolean variables or relational expressions. In programming languages boolean expressiss two primary purpose. plied to elements tx 1. They are used to compute logical values. 2. They are used as conditional expressions in statement that alter flow of control, suchastl if-then-else or while do statements, Boolean operators used are: and, or, not. The following grammar generates boolean expressions. E - Eor E/E and E/not E/(E)/id relop id/true/false. - 7.6.1 Methods of Translating Boolean Expressions These are two principal methods of representing the value of a boolean expresso (i) Encode true and false numerically and to evaluate a boolean expression similar'0? expression Normally use | for true 0 for false. a ‘ HK ifthe called procedure ' ¢ Se" Hf dure. }€ restore the activation record of the calling proce eturn address (of calling rocedure). }€ generate a jump to the r p ) ted Translation Scheme of Procedure Call {for each item P on queue mit (‘call’ id.place)} 7.11.3 Syntax Direc (1) § > callid (E list) do emit (‘param * P): en valuates the arguments, The code for Sis the code for Elist, which evalua ts followed bapa, statement for each argument, followed by a call statement vay (2) Elist > Elist, £ {append EPlace to end of queue} (3) Elist > £ {initialize queue to contain only E.Place} Here queue is emptied and then gets asingle pointer to the symbol table location for the, that eee of E. ve SYMBOL-TABLE Accompiler uses a symbol-table to keep track of scope and binding information aboutnans The symbol-table is searched every time a name is encountered in the source text changestot: table occur, if a new name or new information about an existing name is discovered. A synbt table mechanism must allow us to add new entires and find existing entries efficiently. 7.12.1 Implementation Each entry of a'symbol-table can be implemented as a record consisting of sever i depending upon the information to be saved about the names. But since the information aba name depends on the usage of name. To keep symbol-table records uniform, i it may becom for some of the information about a name to be kept outside the table enttry, with only apt this information stored in the record. 7.12.2 Entering Information in the Symbol-Table Information is entered into the symbol-table at various times. Keywords are entered it tbl ily itl The esr ok up equenees feta ‘gist f eceraeae ro or a name has been collected with this ed we ae ees ; ‘able before lexical analysis begins. Alternatively, ifthe 1 vat! one pl ed keywords, then they need not appear in the symbol table. symbol table is created when the syntactic role played by this name is discovere4- | Generation Code. jité et acters of a Name the cho! pit is a modest upper bound on the len Les the symbol-table entry as, store’ 135 eth of a name, then the characters in the name ba Attribute t dia it iit ' tet it rt iit td! ifr i Hea rt tt is ano limit on the length of a name, the direct scheme can be used, in which a iy yor characters called string table is used to store the name, and a pointer to the name eats symbol-table record. Symboltable Name. Attibute 7.6 : String Table ™ 23 Storage Allocation Information !formatio ‘blab n about the storage locations that will be bound the name at run time is kept in the nba able a Fthe target code is assembly language, then the assembler take care of storage fo : : san Yarious names. After generating assembly code for the program, the compiler tee “seni ig eoltable and generate assembly language data definitions to be appended to Pt Witonayee® P*2eram for cach name. If machine code is to be generated by the compiler, " ust by ofeach data object relative to a fixed origin, such as the beginning of an activation ‘ * ascertained, , Vatious ¢, ats nari ‘structures used for implementing the symbol-table are: Bi a "RY thee * Hash table, “ez ** “vuuipat sons Is required in order to locate an entry. ethod has advantages over the linear list organisation. The basic hasing scheme consists two parts: a. A hash table consisting of a fixed array of m pointers to table entries. b. Table entries organized into m separate linked lists, called buckets. Each record in the symbol table appears on exactly one of these lists. a Nane | info +] Name | info Hash table To enter a name into symbol table, we find out the hash value of the name by applying, table hash function, which maps the name into an integer between 0 to m-1, and using the value nerated by a hash function as a index in a hash table, we search the list of the symbol table ords built on that hash index, if the name is not present in that list we create a record for name (insert it at the head of the list built on that hash index. The retrieval of the information associated with the name, is done as follows. _ First, the hash value of the name is obtained and the list built on this hash value is searched for ing the information about the name. wit Chapter-VIII Source Intermediate | Code Intermedaig |Code Target End coo) Optimizer | |generator progam Symbol Table Fig. 8.4 Ifa code generator takes the intermediate code after performing the code optimization then the compiler is called as “optimizing” compiler that produces more efficient target code. &1 ISSUES IN THE DESIGN OF A CODE GENERATOR Since the code generation phase is system dependent, the following issues arises during the code generation phase. 1. Input to the code generator, (Intermediate code) 2. Target program. 3. Memory management. 4. Instruction selection. 5. Register allocation. 6. Evaluation order. *1.1 Input to the Code Generator 'nput to the code generator is an intermediate code that may be of several forms, 4 linear representation - postfix notation. b three-address representation - quadruples. “Virtual machine representation - stack machine code. 4. graphical Tepresentation - syntax tree, dags. 8A fe Chapter_rx CODE OPTIMIZATION | jrroDUCTION rhe term “code optimization” Tefers to t lechniques ceabetter object language program tha “00 jut in the most obvic ™piler can employ in an attempt to compiler that apply code-improving transformations 0 , Us for a given source Program. are called o ley are . : . Ptimizin, i Optimizations are classified into two Categories, Th Brompilers, (i) Machine-independent optimization (i) Machine dependent optimization, () Machine independent optimization: Program transformations that improve the target code without taking into consideration any properties of the target machine. {) Machine dependent optimization: Program transformation tha uses the properties of the target machine. (eg) register allocation, utilization of special machine instruction sequences. Inthis section, we review the options available to a programmer and a compiler for creating ficient | target programs. | (teria for Code Improving Transformations | Thebest program transformations are those that yield the most benefit for the least effort. | Thetransformations provided by an optimizing compiler should have the following properties. i) Transformation must preserve the meaning of the program. That is an optimization must not _ ange the output produced by a program for a given input or cause an error. % Atensform . unt. : ation must speed up program by measurable amo A tent ation must be worth the effort. tin 19 Bett ‘ ¥ er Performance ; . Her per . . jg the running time ofa program. This can be Ming MANCe is achieved by reducing level to target level. 4 4, Povnethe program at all levels, from souree Vel, the a it implementing a given = op i better algorithm and imp Seog |, the options are available to tend done ' So that fewer operations are performed. h 9A n can be done as follows: 94 tio —— using COPY Propae! The optimization gePitrth js eliminated. Here the variable x15 eliminate Code Elimination 2 Lagranseg esis Q «my Beat ei ree na program ifthevalusconained ine 1 Oe subsequent, iable is said 10 «so program if the value contained in, Avariable i said to be dead ata point in a prog! Ned inti, theother hand, the variable is never been used. ‘The code containing suc performed by eliminating such ha variable supposed to be a dead code. An optimization egy adead code. Example: x= is considered is not used in the progan as dead code-if the value assigned to. Example: if(@=1) { a=b+5; ie Here, if statement is a dead code because this condition will never get satisfied. Hence! statement can be eliminated. (iv) Constant Folding The substitution of values for names whose values are constant a= 3.14157/2 can be replaced by a= 1.570 there by eliminating a division Operation. 9.2.2 Loop Optimization Another. important place fc imizati | ner. or optimization is lo jing ti am improved if we decrease the number of instructions ¢, eo amount code outside the loop. ns in an inner loop, even if we Three techniques for loop optimization are (i) Code motion (ii) Induction variable elimination (iii) Reduction in strength. a Syntax Analysis 33 2. Syntactic, such as an arithmetic expression with unbalanced paranthesis. 3. Semantic, such as an operator applied to an incompatible operand. 4. Logical, such as an infinitely recursive calls. Often much of the error detection and recovery in a compiler is centered around the syntax analysis phase due to two reasons. 1. Many errows are syntactic in nature. (i.e) errors occurs due to disobeying the grammar rules. 2. Modern parsing methods detect the syntactic errors in programs very efficiently. The goals of the error handler in Parser are: % It should report the presence of errors clearly and accurately. X€_ It should recover from each error quickly enough to be able to detect subsequent errors, x {tshould not significantly slow down the processing of correct programs. Several parsing methods such as the LL and LR methods detect an error as soon as possible. 3.1.5. Error-Recovery Strategies A parser uses the following strategies to recover from a syntactic errors. %€ Panic mode %€ Phrase level % Error protections € Global corrections. (i) Panic-mode Recovery On discovering an error, the parser discards in set of synchronizing token is found without chee tokens are delimiters such as; or end. iput symbols one at atime until one ofa designated king for additional errors. (eg,) for synchronizing (ii) Phrase-level Recovery may replace a prefix of the remaining input by some string that allows the Parser to continue. (eg.) for local corrections are: ¥ replace a comma by a semicolon X€ delete a extra semicolon € insert a missing semicolon, ) Error Productions Ifwe have a good idea of the common errors that might be encountered, the grammar can Pe augmented (extended) with productions that generate erroneous constructs. Parser can then

You might also like