You are on page 1of 27
UNIT - I Introduction Language Processing, Structure of a compiler, the evaluation of Programming language, The Science of building a Compiler, application of Compiler Technology. Programming Language Basics. Lexical Analysis-: The role of lexical analysis buffing, specification of tokens. Recognitions of tokens the lexical analyzer generator lexical Introduction: Programming languages can be categorized into the following levels of ‘Languages: * Low level or machine level language. * Assembly level languages. * High level language. Low Level language is the one which include only binary digit (@ and 1). Since it uses binary digits to construct instructions it is also called as binary language. Assembly level language is the one which uses symbolic or mnemonic representation for the instruction or commands. High level language is the one which uses natural language to construct instruction. Since the computer system can understand only machine language, the instructions which are written using assembly level languages and high level languages has to be translated into machine understandable language. This translations are done by some set of programs called as language translators. In other words, Language translators can be defined as a system software which converts the program written in assembly Level Language or high Level Language ‘into equivalent machine Language. It takes as input a program written in one programming language (the source language) and produces as output a program in another language (the object or target language). Assembler is the system software which takes the assembly level program as its input and converts it into machine language program as its output. Compiler is the system software which takes the high level program as its input and converts it into machine language program and gives the equivalent machine language program as its output. 1[Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy Executing a program written in a high level programming language is a two step process: * The source program must first be compiled, i.e., translated into the object program. * The resulting object program is loaded into memory and executed. The first Compiler is the FORTRAN Compiler which took 18-years to implement. Now-a-days compiler can be built with less effort which can be implemented as a student project. The principal developments of the past twenty years which led to this improvement are: 1. The understanding of how to organize and modularize the process of compilation. 2. The discovery of systematic techniques for handling many of the important tasks that occur during compilation. 3. The development of software tools that facilitate the implementation of compilers and compiler components. Translator The term translator denotes any language processor that accepts in some source language as input and produces equivalent programs in object language. A translator is a program, which performs a translation from a programming language (PL) into the machine language of a computer. Apart from translation, a translator should have error-detection capability. Any violation of the high-level language specification would be detected and reported to the programmer. The working of a translator is shown in fig. Data Source program Machine language program Results WC language program ‘Translator —___-| Ex: assemblers, compilers, interpreters, loader or Linker editor, preprocessor or macro processor etc..., Types of Translator: the following are the types of translators: 1. Preprocessors 2. Compilers 3. Assemblers Language Processors: 21 Page Shafiullah Academy Miracles happen everyday.. Visit to Access Attpidisitesaooale.com/sitelshafiullahacademy, An integrated software development environment includes any different kinds of language processors such as compilers, interpreters, assemblers, linkers, loaders etc., Language-processing system:- The preprocessor may also expand shorthand’s, called macros, into source language statements. The modified source program is then fed to a compiler. The compiler may produce an assembly-language program as its output, because assembly language is easier to produce as output and is easier to debug. The assembly language is then processed by a program called an assembler that produces relocatable machine code as its output. Large programs are often compiled in pieces, so the relocatable machine code may have to be linked together with other relocatable object files and library files into the code that actually runs on the machine. The linker resolves external memory addresses, where the code in one file may refer to a location in another file. The loader then puts together all of the executable object files into memory for execution. Preprocessor: Preprocessor & preprocessor is a program that processes its input data(i.e. source program) to produce output(i.e. modified source program) that is used as input to ‘compiler. The preprocessor expand shorthand’s, called macros, into source language statements. Preprocessor are shown in fig. Source program Target program ———___———+] Preprocessors |--———» Extended form Standard format of HLL of HLL Preprocessor may perform the following functions: 3|Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy (a) Macro-processing: A preprocessor may allow a user to define macros that are standards for larger constructs. (b) File inclusion: A preprocessor may include header files into the program text. (c) Rational preprocessor: These preprocessor augment older languages with flow of control and data structuring facilities. (d) tanguage extensions: These preprocessor add capabilities to the language by what amounts to built-in macros. For example, the language Equal is a database query language, embedded into ‘C'. Statements beginning with ## are taken by preprocessor to be database statements and statement related to C are translated into procedure calls on routines that perform database access. Compiler:- 4 compiler is a program takes a program written in a source language and translates it into an equivalent program in a target language. High-level Object program ————| Compiler | ——_____> language program ‘An important role of the compiler is to report any errors in the source program that it detects during the translation process. Aims of Compilers * Speed Compilation * Efficient Object Code generation Errors detection capability Short execution program generation Compilers are the language processors that translate the source program into machine language program prior to execution. Once the source program is translated to machine language program, the machine language program is the one that is executed. ypes of Compiler Ideal Compiler: an ideal compiler should produce an object code that is smaller size, executes faster and takes less time for the compiler. Cross Compiler/hybrid Compiler: This runs on one machine but produces object code for another machine. Incremental Compiler: This Compiler allows data modification on a program to be recompiled. Incremental compiler is designed to combine, the main advantages of interpreter and compilers. Interpreter:-an interpreter is another common kind of language processor. Instead of producing a target program as a translation, an interpreter directly execute the operations specified in the source program on inputs supplied by the user. 4[Page Shafiullah Academy Miracles happen everyday.. Visit to Access Atterllsihes.aooalent omisitedsfafiullahacademy The machine-language target program produced by a compiler executes much faster than an interpreter. An interpreter, however, can give better error messages than a compiler, because it executes the source program statement by statement. Assembler:-The compiler may produce an assembly-language program as its output, because assembly language is easier to produce as output and is easier to debug. A program called an assembler that produces relocatable machine code as its output then processes the assembly language. Source program Target program ————_+| A, r | ——______, ‘assembly code suninel machine code Assembly code is the mnemonic version of machine code, in which names are used instead of binary codes, for operations, and names are also given memory addresses. Each reading of the source program is called pass. Any translator which reads the input program once is called a one-pass assembler and if it reads twice it is called a two-pass assembler. Therefore, assemblers are of two types: 1. Single Pass (one pass assembler) 2. Multiple pass (two-pass assembler) One-pass assembler: The assembly program is read once and converted to an intermediate form and thereafter stored in a table in memory. Two-pass assembler: Two-pass assemblers make two passes over the input. In the first pass, all the identifiers that denote storage locations are identified and stored in a symbol table. In the second pass, usually the assembler translates each operation code into sequence of bits representing that operation in machine language. we can conclude that 1. In pass-one of a two-pass assembler, the definitions of symbols, statement labels, etc., are collected ad stored in a table known as the symbol table. 2. In pass-two, each statement can be read, assembled and output as the values of all symbols are known. Linker:- the linker is a program which links the object programs of functions to the main program. A linker is a system's program that accepts a set of object modules as input resolves external references and produces a single output module ready for loading. In actual practice, a complete program is built from many smaller routines possibly by many people. All these routines have to be connected logically linked to form a single program. A linker is a system's program that accounts for and reconciles all address references within and among modules and replaces those references with a single consistent scheme of relative addresses. Linking is done after the code is generated and is closely associated with a loader. 5|Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy A linker program that accepts a set of object modules as input and produces a single output program is shown in fig. Single Program SN Main Sort Search — Count Compilers or other translators basically translate one procedure at a time and put the translated output on the disk. All the translated procedures have to be located and linked together to be run as a unit called an executable binary Program. In MS-DOS, Windows 0S, etc.., object modules have extension .obj and the executable binary program .exe extension. In UNIX, object modules have .o extension and executable program have no extension. There are two main types of linking, namely, static and dynamic. 1, Static Linking: All references are resolved during loading at linkage time. 2. Dynamic Linking: References made to the code in the external module are resolved during run-time. It takes advantage of the full capabilities of virtual memory. The disadvantages is the considerable overhead and complexity incurred due to postponement of actions till run time. Loader:-The loader loads the program on the hard disk onto the main memory and loads the starting address of the program into the program counter(PC) and makes the program ready for execution. A loader is a system program that loads machine language programs into memory and prepares then for an execution, i.e., loaders are the system programs that load the binary code in the memory and make it ready for execution. It transfers the control to the first instruction and is responsible for locating program in the main memory every time it is being executed. There are various loading schemes: + Assemble-and-go Loader: The assembler simply places the code into memory and loader executes a single instruction that transfers control to the starting instruction of the assembled program. In this scheme, some portion of the memory is used by the assembler itself which would otherwise have been available for the object program. * Absolute Loader: Object code must be loaded into the absolute addresses in the memory to run. If there are multiple subroutines, then each absolute address has to be specified explicitly. 6 [Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy Relocating Loader: This loader modifies the actual instructions of the program during the process of loading a program so that effect of the load address is taken into account. Bootstrapping Bootstrapping is a concept of obtaining a compiler for a language by using the compiler which is the subset of some language, i.e. compiling a compiler in its own language. Using the facilities offered by a language to compile itself is the essence of bootstrapping. For bootstrapping purpose, a compiler is characterized by three languages, namely, the source language S, that it compiles, the target language T, that it generates code for, and the implementation language L: that it is written in. We represent the three languages using the following diagram, called a T-diagram as shown in fig. 5, Suppose we have a cross-compiler for a new language L in implementation language S to generate code for machine N, that is, we create L.N. If an existing compiler for S runs on machine M, it is characterized by SW. If LsN is run through SwM, we get a compiler LN, that is, a compiler from L to N that runs on M, This process is illustrated in Fig. L N L N ” When diagrams are joined together as depicted in fig, the implementation languages of the existing compiler LsN must be the same as the source language of the existing compiler Sw and that the target language M of existing compiler must be the same as the implementation language of the translated from LyN. LN + Sy aN Structure of a compiler The compilation process of a compiler can be subdivided into main parts. They are 1. Analysis and 2. synthesis 7[Page Shafiullah Academy Miracles happen everyday.. Visit to Access Atterllsihes.aooalent omisitedsfafiullahacademy Target program Analysis phase: - the analysis part is often called the front end of the compiler. In analysis phase, an intermediate representation is created from the given source program. Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase. It breaks up the source program and checks whether the source program is either syntactically or semantically correct. It provides informative error messages, so that the user can take corrective action. The analysis part also collects information about the source program and stores it in a data structure called a symbol table, which is passed along with the intermediate representation to the synthesis part. Synthesis phase:- The synthesis part constructs the desired target program from the intermediate representation and the information in the symbol table. Intermediate Code Generator, Code Generator, and Code Optimizer are the parts of this phase. The synthesis part is the back end of the compiler. Phases of A Compiler:- the compilation process operates as a sequence of phases. Each phase transforms the source program from one representation into another representation. A compiler is decomposed into phases as shown in Fig. The symbol table, which stores information about the entire source program, is a[Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy used by all phases of the compiler. During Compilation process, each phase can encounter errors. The error handler is a data structure that reports the presence of errors clearly and accurately. It specifies how the errors can be recovered quickly to detect subsequent errors. Lexical Analyzer:- The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the source program and groups the characters into meaningful sequences called lexemes (i.e tokens). For each lexeme, the lexical analyzer produces output in the form A token is a sequence of characters having collective meaning. Tokens are mainly of two kinds, viz. 1. Fixed elements of the language such as keywords, vocabulary of the language, operators, signs, etc.., 2, Identifiers and constants. Examples of Tokens: * Constants: 123,4566 + Identifier: AH2035B, Speed * Operator Symbol: +,-,*,= * Keywords: if, goto, while, do, switch, etc., * Function symbol: (),[],:5 Consider an assignment statement: X=A+B*C, This would be grouped into the following tokens: . The identifier x . The assignment symbol = the identifier A the plus sign . the identifier B . the multiplication sign . the identifier ¢ Nowsune The blanks separating the characters will be eliminated during lexical analysis. When the lexical analyzer finds the identifier Xx, it generates a token, say id. The identifiers X,A,B,C are the lexemes. The statement after lexical analysis is given by: idi=id2+id3*ida, where idi,id2,id3 and id4 are tokens for X,A,B and C respectively. This output is passed to the subsequent phase i.e syntax analysis. Syntax Analyzer:- the second phase of the compiler is syntax analysis or parsing. The parser uses the tokens produced by the lexical analyzer to create a syntax tree. In syntax tree, each interior node represents an operation and the children of the node represent the arguments of the operation. g[Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy The syntax analyzer also determines the structure of the source string, which may be represented by syntax trees. It is basically involves grouping of the statements into grammatical phrases that are used by the compiler to generate the output finally. The grammatical phrases of the source program are usually represented by a parse tree Consider the structure of the statement: a=4*6.0-b A parse tree is a structural representation of the input being parsed. A parse tree for string a=4*6.0-b is shown in fig. Assignment statement identifier expression | ZAIN @ expression expression ZIN | fvpression + exeresion ientier | | integer real ’ i | 4 60 Now a days, there are tools to generate parsers. For example, in UNIX systems, a tool called YACC (Yet Another Compiler Compiler) is available for this purpose. Semantic Analyzer:- The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic errors. It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation. An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands. For example, if a variable is defined as type char, then it is not permitted to do arithmetic operations on that variable. It takes syntax tree as the input and produces abstract syntax tree as the output. Example: if a real number is used to index an array i.e., A[1.5] then the compiler will report an error. This error is handled during semantic analysis. Intermediate Code Generator:- In the process of translating a source program into target code, a compiler may construct one or more intermediate representations, which can have a variety of forms. Syntax trees are a form of intermediate representation; they are commonly used during syntax and semantic analysis. After syntax and semantic analysis of the source program, many compilers generate an intermediate representation. This intermediate representation should have two important properties: it should be easy to produce and it should be easy to translate into the target code. One form of intermediate code is three-address code, which consists of a sequence of assembly-like instructions with three operands per instruction. 10| Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy (or) Some compilers generates an explicit intermediate representation of the source program after syntax and semantic analyses. This intermediate representation of the source program can be thought of as a program for an abstract machine and should have two main properties, viz. 1. It should be easy to produce 2. It should be easy to translate into the target program Intermediate representation has a variety of forms. There are also many algorithms for generating intermediate codes for typical programming language constructs. Usually, intermediate code forms could be either three-address codes. But it should be easy to produce and to translate into the target program. All we have to do is to choose a rich intermediate language that would bridge both the source programs and the target programs. Suppose, we have to write a compiler for m languages targeted for n machines. The obvious approach would be to write m*n compilers. When the intermediate cod generated is the same for all machines, we can reduce the number of backend conversions for all the machines with intermediate code generation. Code Optimizer:- The machine-independent code-optimization phase attempts to improve the intermediate code so that better target code will result. The target code generated must be executed faster and must consume less power. (or) It tries to improve the intermediate code to achieve faster running machine code. Code optimizer transforms the intermediate code to improve the execution time and memory space usage. The overall output of the code should not be changed after the optimization. Some of the examples of code optimization techniques are common subexpression elimination, dead code elimination, loop optimization, etc. Common subexpression elimination is very much required to avoid re- computation of expressions and makes use of previously computed value. For example, consider the following program fragment 25 = 24 Transform y=x and z=x at appropriate points or detect i*2 as a common expression. Dead Code Elimination means the removal of unreachable and useless code in the program. Consider the following fragment: di[Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy In this given fragment, else branch will never get executed since the value of x cannot be greater than or equal to 100. Loop Optimization is another major target because when a program is in execution, a lot of time is spent in loops. There are various ways to perform optimizations inside the loop. For example, if there is a statement, such as TEMP=5, inside a loop which is not affected by other statements, then this can be moved outside the loop. Such an optimization is called code motion or frequency reduction. Code Generator:- the code generator takes intermediate representation of the source program and coverts into the target code. If the target language is machine code, registers or memory locations are selected for each of the variables used by the program. Then, the intermediate instructions are translated into sequences of machine instructions that perform the same task. (or) It generates the target code. The final phase of the compiler is the generation of the target code which normally consists of relocatable machine code or assembly code. Memory locations are selected for all the variables used by the program. The intermediate instructions are then translated into a sequence of machine instructions that perform the task. Suppose, we have an expression: x=4*6.0+y, the code generator outputs the target code: LOAD R,,4 MUL R,,6.0 ADD R,y STR x In the first instruction, 4 is loaded into register R,. The Second instruction multiplies the value 6.@ to the value stored in register R,. The Third instruction adds the value in y to the previous result. The final result is stored in x. Symbol Table:- the symbol table is a data structure containing a record for each variable name, with fields for the attributes of the name. The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly. (or) It is essential function of a compiler which records the identifiers used in ‘the source program and collects information about various attributes of each identifier. A symbol table is a data structure containing a record for each identifier, with fields of the record contain the attributes of the identifier. The symbol table keeps track of the attributes of the symbol like name, type (int, char, etc.), size(in bytes), and address of the label of each variable. The basically helps us to locate the record for each identifier easily and store or retrieve data from the record quickly. When an identifier in the a2 [Page Shafiullah Academy Miracles happen everyday.. omisiteds fal Visit to Access Atterllsihes.aooalent ublafiacademy source program is detected during lexical analysis, its information is stored in the symbol table. The remaining phases of the compiler enter the information about the attributes of the identifier. Error Handler:- rror handler should report the presence of an error. It must report the place in the source program where an error is detected. Common programming errors can occur at many different levels. * Lexical errors include misspellings of identifiers, keywords, or operators. * Syntax errors include misplaced semicolons or extra or missing braces. * Semantic errors include type mismatches between operators and operands. Logical errors can be anything from incorrect reasoning on the part of ‘the programmer to the use in a C program of the assignment operator = instead of the comparison operator * Intermediate code generator may detect an operator whose operands have incompatible types. * The Code optimizer, during control flow analysis, may detect that certain statements can never be reached. * The Code Generator may find a compiler created constant that is too large to fit in a word of target machine. * While entering information into the symbol table, the symbol table management may discover an identifier that has been declared multiple times. The main goal of error handler is 1. Report the presence of errors clearly and accurately. 2. Recover from each error quickly enough to detect subsequent errors. 3. Add minimal overhead to the processing of correct programs. Both the symbol table management and error handler interact with all phases of compiler. Phases of the Compiler with inputs and outputs. 43 | Page Shafiullah Academy Miracles happen everyday.. Visit to Access Attpidisitesaooale.com/sitelshafiullahacademy, —> Reduced three adress code MOV id. Ri = Asser MUL (R,), (Re) selene MOV is, Ry ADD (R,), (F,) wovainy Fig: Phases of the Compiler with inputs and outputs. 14|Page Shafiullah Academy Miracles happen everyday. aff Visit to Access Show that output generated by each phase for the following expression Pos=new val+old val * 10 Pos = new val + old val * 10 LEXICAL ANALYZER i <*><10> 2 [newval SYNTAX ANALYZER 3 [old val t + é ' A } 10 ‘SEMANTIC ANALYZER i “N,

You might also like