You are on page 1of 3

Chapter 1: Introduction to Compiling

Homework Read chapter 1.

1.1: Language Processors

A Compiler is a translator from one language, the input or source language, to another language, the output or target language. Often, but not always, the target language is an assembler language or the machine language for a computer processor. Note that using a compiler requires a two step process to run a program. 1. Execute the compiler (and possibly an assembler) to translate the source program into a machine language program. 2. Execute the resulting machine language program, supplying appropriate input. This should be compared with an interpreter, which accepts the source language program and the appropriate input, and itself produces the program output. Sometimes both compilation and interpretation are used. For example, consider typical Java implementations. The (Java) source code is translated (i.e., compiled) into bytecodes, the machine language for an idealized virtual machine, the Java Virtual Machine or JVM. Then an interpreter of the JVM (itself normally called a JVM) accepts the bytecodes and the appropriate input, and produces the output. This technique was quite popular in academia, with the Pascal programming language and Pcode.

1.2: The Structure of a Compiler

Modern compilers contain two (large) parts, each of which is often subdivided. These two parts are the front end, shown in green on the right and the back end, shown in pink. The front end analyzes the source program, determines its constituent parts, and constructs an intermediate representation of the program. Typically the front end is independent of the target language. The back end synthesizes the target program from the intermediate representation produced by the front end. Typically the back end is independent of the source language. This front/back division very much reduces the work for a compiling system that can handle several (N) source languages and several (M) target languages. Instead of NM compilers, we need N front ends and M back ends. For gcc (originally abbreviating Gnu C Compiler, but now abbreviating Gnu Compiler Collection), N=7 and M~30 so the savings are considerable.

Other analyzers and synthesizers Other compiler like applications also use analysis and synthesis. Some examples include 1. Pretty printer: Can be considered a real compiler with the target language a formatted version of the source. 2. Interpreter. The synthesis traverses the intermediate code and executes the operation at each node (rather than generating machine code to do such). Multiple Phases The front and back end are themselves each divided into multiple phases. Conceptually, the input to each phase is the output of the previous. Sometime a phase changes the representation of the input. For example, the lexical analyzer converts a character stream input into a token stream output. Sometimes the representation is unchanged. For example, the machine-dependent optimizer transforms target-machine code into (hopefully improved) target-machine code. The diagram is definitely not drawn to scale, in terms of effort or lines of code. In practice, the optimizers dominate. Conceptually, there are three phases of analysis with the output of one phase the input of the next. Each of these phases changes the representation of the program being compiled. The phases are called lexical analysis or scanning, which transforms the program from a string of characters to a string of tokens; syntax analysis or parsing, which transforms the program into some kind of syntax tree; and semantic analysis, which decorates the tree with semantic information. Note that the above classification is conceptual; in practice more efficient representations may be used. For example, instead of having all the information about the program in the tree, tree nodes may point to symbol table entries. Thus the information about the variable counter is stored once and pointed to at each occurrence.

1.2.1: Lexical Analysis (or Scanning)

The character stream input is grouped into meaningful units called lexemes, which are then mapped into tokens, the latter constituting the output of the lexical analyzer. For example, any one of the following C statements
x3 = y + 3; x3 = y + x3 =y+ 3 ; 3 ;

but not
x 3 = y + 3;

would be grouped into the lexemes x3, =, y, +, 3, and ;. A token is a <token-name,attribute-value> pair. For example 1. The lexeme x3 would be mapped to a token such as <id,1>. The name id is short for identifier. The value 1 is the index of the entry for x3 in the symbol table produced by the compiler. This table is used gather information about the identifiers and to pass this information to subsequent phases.

2. The lexeme = would be mapped to the token <=>. In reality it is probably mapped to a pair, whose second component is ignored. The point is that there are many different identifiers so we need the second component, but there is only one assignment symbol =. 3. The lexeme y is mapped to the token <id,2> 4. The lexeme + is mapped to the token <+>. 5. The lexeme 3 is somewhat interesting and is discussed further in subsequent chapters. It is mapped to <number,something>, but what is the something. On the one hand there is only one 3 so we could just use the token <number,3>. However, there can be a difference between how this should be printed (e.g., in an error message produced by subsequent phases) and how it should be stored (fixed vs. float vs. double). Perhaps the token should point to the symbol table where an entry for this kind of 3 is stored. Another possibility is to have a separate numbers table. 6. The lexeme ; is mapped to the token <;>. Note that non-significant blanks are normally removed during scanning. In C, most blanks are nonsignificant. That does not mean the blanks are unnecessary. Consider
int x; intx;

The blank between int and x is clearly necessary, but it does not become part of any token. Blanks inside strings are an exception, they are part of the token (or more likely the table entry pointed to by the second component of the token). Note that we can define identifiers, numbers, and the various symbols and punctuation without using recursion (compare with parsing below). 1. What is Language Processors? 2. Two step process to run a program that a compiler requires . 3. How Come that Java is compiled and interpreted language? 4. What is front end and back end of the compiler? 5. What are the application of analysis and synthesis phase of compiler? 6. How a lexical analyzer processes source language?