A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language

) into another computer language (the target language, often having a binary form known as object code). The most common reason for wanting to transform source code is to create an executable program. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language or machine code). If the compiled program can run on a computer whose CPU or operating system is different from the one on which the compiler runs, the compiler is known as a cross-compiler. A program that translates from a low level language to a higher level one is a decompiler. A program that translates between high-level languages is usually called a language translator, source to source translator, or language converter. A language rewriter is usually a program that translates the form of expressions without a change of language. A compiler is likely to perform many or all of the following operations: lexical analysis, preprocessing, parsing, semantic analysis (Syntax-directed translation), code generation, and code optimization. Program faults caused by incorrect compiler behavior can be very difficult to track down and work around; therefore, compiler implementors invest significant effort to ensure the correctness of their software. The term compiler-compiler is sometimes used to refer to a parser generator, a tool often used to help create the lexer and parser

Structure of a compiler
Compilers bridge source programs in high-level languages with the underlying hardware. A compiler requires 1) determining the correctness of the syntax of programs, 2) generating correct and efficient object code, 3) run-time organization, and 4) formatting output according to assembler and/or linker conventions. A compiler consists of three main parts: the frontend, the middle-end, and the backend. The front end checks whether the program is correctly written in terms of the programming language syntax and semantics. Here legal and illegal programs are recognized. Errors are reported, if any, in a useful way. Type checking is also performed by collecting type information. The frontend then generates an intermediate representation or IR of the source code for processing by the middle-end. The middle end is where optimization takes place. Typical transformations for optimization are removal of useless or unreachable code, discovery and propagation of constant values, relocation of computation to a less frequently executed place (e.g., out of a loop), or specialization of computation based on the context. The middleend generates another IR for the following backend. Most optimization efforts are focused on this part. The back end is responsible for translating the IR from the middle-end into assembly code. The target instruction(s) are chosen for each IR instruction. Register allocation assigns processor registers for the program variables where possible. The backend utilizes the hardware by figuring out how to keep parallel execution units busy, filling delay slots, and so on. Although most algorithms for optimization are in NP, heuristic techniques are well-developed.
Phases of compiler:

The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer. It discards the white spaces and comments between the tokens and also keep track of line numbers. <fig: 3.1 pp. 84>
• •

Tokens, Patterns, Lexemes Specification of Tokens o Regular Expressions o Notational Shorthand Finite Automata o Nondeterministic Finite Automata (NFA).

o o o

Deterministic Finite Automata (DFA). Conversion of an NFA into a DFA. From a Regular Expression to an NFA.

Tokens, Patterns, Lexemes Token
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages. Example of tokens:
• • •

Type token (id, num, real, . . . ) Punctuation tokens (IF, void, return, . . . ) Alphabetic tokens (keywords)

Example of non-tokens:

Comments, preprocessor directive, macros, blanks, tabs, newline, . . .

Patterns
There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. Regular expressions are an important notation for specifying patterns. For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.

Lexeme
A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.

3.3 Specification of Tokens
An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters. The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets. Strings A string over some alphabet is a finite sequence of symbol taken from that alphabet. For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0). If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by appending y to x. For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S = S = S. String exponentiation concatenates a string with itself a given number of times: S2 = SS or S.S S3 = SSS or S.S.S S4 = SSSS or S.S.S.S and so on By definition S0 is an empty string, , and S` = S. For example, if x =ba and na then xy2 = banana.

Languages A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings. Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then
• • • •

Union: LUM = {dog, ba, na, house} Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba} Expontentiation: L2 = LL By definition: L0 ={ } and L` = L

The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L. L* = L0 U L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L* = { , a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . } The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L. L+ = L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L+ = {a, b, aa, ba, bb, aaa, aba, . . .
Code generation:

Phases of typical compiler and position of code generation.

<fig: 9.1 - page 513> Since code generation is an "undecidable problem (mathematically speaking), we must be content with heuristic technique that generate "good" code (not necessarily optimal code). Code generation must do following things:
• • •

Produce correct code make use of machine architecture. run efficiently.

Issues in the Design of Code generator
Code generator concern with: 1. 2. 3. 4. Memory management. Instruction Selection. Register Utilization (Allocation). Evaluation order.

1. Memory Management Mapping names in the source program to address of data object is cooperating done in pass 1 (Front end) and pass 2 (code generator). Quadruples → address Instruction.

Local variables (local to functions or procedures ) are stack-allocated in the activation record while global variables are in a static area. 2. Instruction Selection The nature of instruction set of the target machine determines selection. -"Easy" if instruction set is regular that is uniform and complete. Uniform: all triple addresses all stack single addresses. Complete: use all register for any operation. If we don't care about efficiency of target program, instruction selection is straight forward. For example, the address code is: a := b + c d := a + e Inefficient assembly code is:
1. 2. 3. 4. 5. 6.

MOV b, R0 ADD c, R0 MOV R0, a MOV a, R0 ADD e, R0 MOV R0 , d

R0 ← b R0 ← c + R0 a ← R0 R0 ← a R0 ← e + R0 d ← R0

Here the fourth statement is redundant, and so is the third statement if 'a' is not subsequently used. 3. Register Allocation Register can be accessed faster than memory words. Frequently accessed variables should reside in registers (register allocation). Register assignment is picking a specific register for each such variable. Formally, there are two steps in register allocation: 1. Register allocation (what register?) This is a register selection process in which we select the set of variables that will reside in register. 2. Register assignment (what variable?) Here we pick the register that contain variable. Note that this is a NP-Complete problem. Some of the issues that complicate register allocation (problem). 1. Special use of hardware for example, some instructions require specific register. 2. Convention for Software: For example
• • •

Register R6 (say) always return address. Register R5 (say) for stack pointer. Similarly, we assigned registers for branch and link, frames, heaps, etc.,

3. Choice of Evaluation order Changing the order of evaluation may produce more efficient code. This is NP-complete problem but we can bypass this hindrance by generating code for quadruples in the order in which they have been produced by intermediate code generator. ADD x, Y, T1 ADD a, b, T2 is legal because X, Y and a, b are different (not dependent).

The Target Machine
Familiarity with the target machine and its instruction set is a prerequisite for designing a good code generator. Typical Architecture Target machine is: 1. 2. 3. 4. Byte addressable (factor of 4). 4 byte per word. 16 to 32 (or n) general purpose register. Two addressable instruction of form: Op source, destination. e.g., move A, B add A, D

Typical Architecture: 1. 2. 3. 4. Target machine is : Bit addressing (factor of 1). Word purpose registers. Three address instruction of forms: Op source 1, source 2, destination e.g., ADD A, B, C Byte-addressable memory with 4 bytes per word and n general-purpose registers, R0, R1, . . . , Rn-1. Each integer requires 2 bytes (16-bits). Two address instruction of the form mnemonic source, destination MODE Absolute register Index Indirect register FORM M R c (R) *R ADDRESS M R c + contents (R) contents (R) contents (c + contents (R) constant c EXAMPLE ADD R0, R1 ADD temp, R1 ADD 100(R2), R1 ADD * R2, R1 ADD * 100(R2), R1 ADD # 3, R1 ADDEDCOST 1 0 1 0 1 1

Indirect Index *c (R) Literal Instruction costs: #c

Each instruction has a cost of 1 plus added costs for the source and destination. => cost of instruction = 1 + cost associated the source and destination address mode. This cost corresponds to the length (in words ) of instruction. Examples
1. Move register to memory R0 ← M.

MOV R0, M cost = 1+1 = 2.

2. Indirect indexed mode:

MOV * 4 (R0), M cost = 1 plus indirect index plus instruction word =1+1+1=3 3. Indexed mode: MOV 4(R0), M cost = 1 + 1 + 1 = 3 4. Litetral mode: MOV #1, R0 cost = 1 + 1 = 2 5. Move memory to memory MOV m, m cost = 1 + 1 + 1 = 3

Sign up to vote on this title
UsefulNot useful