Need for compilers

When we speak about compilers, we mainly talk about two types of languages: Source language and the machine Language. Machine language or the target language is the one which CPU or a computer can understand. It contains strings of 0 s and 1 s. As it is difficult for humans, to write, debug and test programs in binary language, we prefer writing programs in High Level language such as C, C++, Java etc. In order to convert programs written in High Level language to a language understandable to the CPU, we use traslators called Compilers.

1. 2. 3. 4. Compiler s most important goal is correctness; All vaild programs must compile correctly. After translation, size of code generated should be as small as possible. Time taken to execute the translated code should be reasonably low. Overall cost should be low.

Analysis & Synthesis model
Translation process with compiler works the same way humans translate sentences in one language to another. For example, when a person is given a sentence to translate, say in English, he first recognizes characters and words of the sentence. This is exactly what happnes in Lexical analysis phase of a compiler. Next we try to understand the structure of the sentence i.e. we understand if each word is a noun, pronoun , verb or an adective. In syntax analysis we do exactly the same and thus check if this sentence or statement is legal and can be gererated from the grammer of the programming language. Once the sentence structure is understood we try to understand the meaning of the sentence. Semantic analysis phase of the compiler handles this task. As their name suggest, first three phases of a compiler, Lexical, Syntactic and Semantic analysis, deals with analyzing and understanding the source code. The next three phases, intermediate code generation, optimization and code generation does the actual conversion of the source program to target program.

Why translate in steps/phases ?
In order to translate a high level code to a machine code one needs to go step by step, with each step doing a particular task and passing out its output for the next step in the form of another program representation. The steps can be parse tree generation, high level intermediate code generation, low level intermediate code generation, and then the machine language conversion. As the translation

logical. It also perform certain secondary tasks such as stripping comments and white space characters (blank. mismatched types and the like This is modeled through regular expressions and the structure is recognized through finite state automata. keyword. misspelled keywords. Few errors can be detected at the lexical level alone because the scanner has a very localized view of the source program without any context. tab and newline characters) from source code. increasingly dealing with registers. const. tokens out of place. For example 100. unterminated comments. etc.g. "How are you?" etc are lexemes. so that a line number can be associated with error message. identifier. The scanner can report about characters that are not valid tokens (e. For example. and well defined task Lexical Analysis: Lexical Analysis is the first phase of the compiler. counter. Sentences consist of a string of tokens.proceeds the representation becomes more and more machine specific. Translate in steps. For example number. undeclared identifiers. SLIDE 2 How do we implement a lexical analyzer ? A simple & efficient way to build a lexical analyzer is to construct a diagram that illustrates the structure of tokens and then to hand-translate the diagram into a program for finding tokens.01. but has no smarts about which token should come where.) It does not look for or detect garbled sequences. the lexical analyzer can keep track of number of new line characters seen. an illegal or unrecognized symbol) and a few other malformed entities (illegal characters within a string constant. Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis. Each step handles a reasonably simple. string etc are tokens. The scanner is tasked with determining that the input stream can be divided into valid symbols in the source language. Three possibilities: . Token: A token is a syntactic category. Lexeme: Sequence of characters in a token is a lexeme. It also correlates error messages from the compiler with the source program. memory locations etc..

. This is the most efficient way and produces very fast efficient lexical analyzers but this approach is very difficult to implement and maintain Write Lexical analyzer in a conventional systems programming language such as C using the I/O facilities of that language to read the input. Just as natural languages such as English have grammer associated with them. we need to know what a grammer is. This approach is very SLIDE 3 lex is a lexical analyzer generator. lex takes your specification and generates a combined NFA to recognize all your patterns. For example consider the grammar S (S)S | . You specify the scanner you want in the form of patterns to match and actions to apply for each token. Ambiguous Grammar: E E A E | (E) | -E | id | num A +|-|*|/ Unambiguous Grammar: E E+T |E T|T T T*F|T/F|F F id | num For certain class of parsing techniques. converts it to an equivalent DFA. A grammar gives precise. and generates C code that will implement it. Use Tools like Lex. it checks if the string can be generated from the available grammar. When the syntax analyzer is given a string from Lexical analyzer. We represent all the program constructs such as conditinal statements. where S represets string of balanced paranthesis. Syntax Analysis: SLIDE 1 Before we speak about syntax analysis. looping statements etc with such grammar. grammar should be an unambiguious grammar. Even this approach is reasonably efficient. minimizes the automaton as much as possible. umambiguous specification to a programming language. every computer programming language has grammer/rules that prescribe the syntactic structure of programs.y y y Write Lexical analyzer in assembly language by explicitly managing read and write I/O.

It should not significantly slow down the processing of correct programs. The goal is to reduce all the way up to the start symbol and report a successful parse. classes. driving productions in forward order until it gets to the terminal leaves. A top-down parser begins with the start symbol at the top of the parse tree and works downward. it records the type information assigned to that identifier. When it finds such a substring.. access control must be respected. the parser tries to report as many errors as possible. . a bottom-up parser searches for substrings of the working string that match the right side of some production. For example. In many languages. the constructions required are also more complex. not unlike the way lex builds a scanner to your spec. etc. . A bottom-up parse starts with the string of terminals itself and builds from the leaves upward. as it continues examining the rest of the program. It should recover from each error quickly enough to be able to detect subsequent errors. Semantic analysis is the front end s penultimate phase and the compiler s last chance to weed out incorrect programs. the type of the right side expression of an assignment statement should match the type of the left side. The syntax analysis phase verifies that the string can be generated by the grammar for the source language. and so forth. . It should report the presence of errors clearly and accurately. Error reporting and recovery form a very important part of the syntax analyzer. bottom-up parsing works in the opposite direction from topdown. but not surprisingly. so this handout serves as a brief overview of both. Our course web page include a link to an online bison user s manual for those who really want to dig deep and learn everything there is to learn about parser generators. it reduces it. It is to parsers what lex is to scanners. expressions and variables must be used in ways that respect the type system. but fortunately. i. identifiers have to be declared before they re used. functions. bottom-up parsing algorithms are more powerful than top-down methods. and the left side needs to be a properly declared and assignable identifier. The error handler in the parser has the following goals: . there are excellent parser generator tools like yacc that build a parser from an input specification. Semantic Analysis: Parsing only verifies that the program consists of tokens arranged in a syntactically valid combination.e. yacc is a parser generator. In general. Along the way. Our programming projects are configured to use the updated version bison. working backwards to the start symbol by applying the productions in reverse. It is difficult to write a bottom-up parser by hand for anything but trivial grammars. it verifies that the type of an identifier is respected in terms of the operations being performed. As the compiler encounters a new declaration. must be properly defined. For a program to be semantically valid. all variables. Then. You provide the input of a grammar specification and it generates an LALR(1) parser to recognize sentences in that grammar.As the name suggests. but all of the features we use are present in the original tool. a close relative of the yak. substitutes the left side nonterminal for the matching right side. In case of any syntax errors in the program. yacc stands for "yet another compiler compiler" and it is probably the most common of the LALR tools out there.

structs.. for instance). identify the language constructs that have types associated with them 3. etc. The language may require that identifiers be unique. Compound types arrays. char. and compound types (arrays. bool. A scanner tells . identify the language constructs that have types associated with them constants obviously. identify the types that are available in the language we have base types (int. variable. There may be a facility for user-defined variants on the base types (such as C enums). double. a programmer must first establish the name and type of any data object (e. For example in Pascal. Arithmetic operands will need to be of numeric perhaps even the exact same type (no automatic intto-double conversion. bool. type. In many languages. etc). identify the types that are available in the language 2. Verify that dereferencing is applied only on a pointer type. These are the primitive types provided directly by the underlying hardware. These types are constructed as aggregations of the base types and simple compound types Type Checking Type checking is the process of verifying that each operation executed in a program respects the type system of the language. mod operator should be applied only on integer operands.The parameters of a function should match the arguments of a function call in both number and type. the programmer usually defines the lifetime. Types and Declarations A type is a set of values and a set of operations operating on those values. double. thereby forbidding two global declarations from sharing the same name.g. interfaces) 2. A declaration is a statement in a program that communicates this information to the compiler. classes. In addition. function. and so on. Designing a Type Checker When designing a type checker for a compiler. unions. here s the process: 1. identify the semantic rules for the language 1. classes. every constant has an associated type. float. Indexing is applied only on array types. string). These are examples of the things checked in the semantic analysis phase. records. pointers. Base types int.

3. Each of the various expressions have a type based on the type of the constant. When reducing the Variable production. identify the semantic rules for the language ‡ Rules to parse Varialbe Declaration: VariableDecl -> Variable . and improve both run-time performance and code space. and it lets the running program avoid executing irrelevant operations. which can be stored in a symbol table for lookup later on. variable. which the parser records as an attribute attached to the token. We create a new variable declaration. Code Optimization: Induction variable elimination can reduce the number of additions (or subtractions) in a loop. or type of operands. but can also be variables whose values are never modified. Removing such code has two benefits: it shrinks program size. an important consideration in some contexts. Terms in constant expressions are typically simple literals. and each parameter in the function definition has a type. return type of the these types as well as the associated lexeme. function call. declaring that identifier to be of that type. Expressions an expression can be a constant. which reduces its running . The classic example of strength reduction converts "strong" multiplications inside a loop into "weaker" additions Constant folding is the process of simplifying constant expressions at compile time. we have the type associated with the Type symbol (passed up from the Type production) and the name associated with the identifier symbol (passed from the scanner). Functions functions have a return type. as does each argument in a function call. Strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations. variable. local. and instance) must have a declared type of one of the base types or the supported compound types. or variables explicitly marked as constant. such as the integer 2. Variable -> Type identifier Type -> int| bool |double|string|identifier|Type[] The scanner stores the name for an identifier lexeme. variables all variables (global. or some operator (binary or unary) applied to expressions. Dead code elimination is a compiler optimization that removes code that does not affect the program.

over a whole function/procedure (global register allocation). a process called spilling. so an optimizing compiler aims to assign as many variables to registers as possible. The generator may try to use faster instructions. exploit available registers.. Dead code includes code that can never be executed (unreachable code). Instruction selection is implemented with a backwards dynamic programming algorithm which computes the "optimal" tiling for each point starting from the end of the program and based from there. Register allocation can happen over a basic block (local register allocation). or in-between functions as a calling convention (interprocedural register allocation). . Instruction selection can also be implemented with a greedy algorithm that chooses a local optimum at each step  Instruction scheduling: in which order to put those instructions. two variables in use at the same time cannot be assigned to the same register without corrupting its value.time. usually in an intermediate language such as three address code. Scheduling is a speed optimization that can have a critical effect on pipelined machines. and code that only affects dead variables. machine code) that can be readily executed by a machine The input to the code generator typically consists of a parse tree or an abstract syntax tree. Variables which cannot be assigned to some register must be kept in RAM and loaded in/out for every read/write. instruction scheduling is a compiler optimization used to improve instruction-level parallelism.g. and avoid redundant computations. The tree is converted into a linear sequence of instructions. Accessing RAM is significantly slower than accessing registers and slows down the execution speed of the compiled program. Tasks which are typically part of a sophisticated compiler's "code generation" phase include:  Instruction selection: which instructions to use. loop interchange is the process of exchanging the order of two iteration variables Code Generation: code generation is the process by which a compiler's code generator converts some internal representation of source code into a form (e. use fewer instructions. In addition to the basic conversion from an intermediate representation into a linear sequence of machine instructions. variables that are irrelevant to the program. a typical code generator tries to optimize the generated code in some way. register allocation is the process of assigning a large number of target program variables onto a small number of CPU registers. which improves performance on machines with instruction pipelines  Register allocation: the allocation of variables to processor registers. that is.

Sign up to vote on this title
UsefulNot useful