You are on page 1of 7

Need for compilers:

When we speak about compilers, we mainly talk about two types of languages: Source language and the
machine Language.

Machine language or the target language is the one which CPU or a computer can understand. It
contains strings of 0’s and 1’s. As it is difficult for humans, to write, debug and test programs in binary
language, we prefer writing programs in High Level language such as C, C++, Java etc. In order to convert
programs written in High Level language to a language understandable to the CPU, we use traslators
called Compilers.

Goals:
1. Compiler’s most important goal is correctness; All vaild programs must compile correctly.
2. After translation, size of code generated should be as small as possible.
3. Time taken to execute the translated code should be reasonably low.
4. Overall cost should be low.

Analysis & Synthesis model


Translation process with compiler works the same way humans translate sentences in one language to
another. For example, when a person is given a sentence to translate, say in English, he first recognizes
characters and words of the sentence. This is exactly what happnes in Lexical analysis phase of a
compiler.

Next we try to understand the structure of the sentence i.e. we understand if each word is a noun,
pronoun , verb or an adective. In syntax analysis we do exactly the same and thus check if this sentence
or statement is legal and can be gererated from the grammer of the programming language.

Once the sentence structure is understood we try to understand the meaning of the sentence. Semantic
analysis phase of the compiler handles this task.

As their name suggest, first three phases of a compiler, Lexical, Syntactic and Semantic analysis, deals
with analyzing and understanding the source code.

The next three phases, intermediate code generation, optimization and code generation does the actual
conversion of the source program to target program.

Why translate in steps/phases ?


In order to translate a high level code to a machine code one needs to go step by step, with each step
doing a particular task and passing out its output for the next step in the form of another program
representation. The steps can be parse tree generation, high level intermediate code generation, low
level intermediate code generation, and then the machine language conversion. As the translation
proceeds the representation becomes more and more machine specific, increasingly dealing with
registers, memory locations etc.

 Translate in steps. Each step handles a reasonably simple, logical, and well defined task

Lexical Analysis:

Lexical Analysis is the first phase of the compiler. Its main task is to read the input characters and
produce as output a sequence of tokens that the parser uses for syntax analysis.

It also perform certain secondary tasks such as stripping comments and white space characters (blank,
tab and newline characters) from source code. It also correlates error messages from the compiler with
the source program. For example, the lexical analyzer can keep track of number of new line characters
seen, so that a line number can be associated with error message.

The scanner is tasked with determining that the input stream can be divided into valid
symbols in the source language, but has no smarts about which token should come
where. Few errors can be detected at the lexical level alone because the scanner has a
very localized view of the source program without any context. The scanner can report
about characters that are not valid tokens (e.g., an illegal or unrecognized symbol) and a
few other malformed entities (illegal characters within a string constant, unterminated
comments, etc.) It does not look for or detect garbled sequences, tokens out of place,
undeclared identifiers, misspelled keywords, mismatched types and the like

This is modeled through regular expressions and the structure is recognized through finite state
automata.

Token: A token is a syntactic category. Sentences consist of a string of tokens. For example number,
identifier, keyword, string etc are tokens.

Lexeme: Sequence of characters in a token is a lexeme. For example 100.01, counter, const, "How are
you?" etc are lexemes.

SLIDE – 2

How do we implement a lexical analyzer ?

A simple & efficient way to build a lexical analyzer is to construct a diagram that illustrates the structure
of tokens and then to hand-translate the diagram into a program for finding tokens.

Three possibilities:
 Write Lexical analyzer in assembly language by explicitly managing read and write I/O. This is the
most efficient way and produces very fast efficient lexical analyzers but this approach is very
difficult to implement and maintain
 Write Lexical analyzer in a conventional systems programming language such as C using the I/O
facilities of that language to read the input. Even this approach is reasonably efficient.
 Use Tools like Lex. This approach is very

SLIDE – 3

lex is a lexical analyzer generator. You specify the scanner you want in the form of patterns to match and
actions to apply for each token. lex takes your specification and generates a combined NFA to recognize
all your patterns, converts it to an equivalent DFA, minimizes the automaton as much as possible, and
generates C code that will implement it.

Syntax Analysis:
SLIDE – 1

Before we speak about syntax analysis, we need to know what a grammer is.

Just as natural languages such as English have grammer associated with them, every computer
programming language has grammer/rules that prescribe the syntactic structure of programs. A
grammar gives precise, umambiguous specification to a programming language.

For example consider the grammar S à (S)S | є, where S represets string of balanced paranthesis. We
represent all the program constructs such as conditinal statements, looping statements etc with such
grammar. When the syntax analyzer is given a string from Lexical analyzer, it checks if the string can be
generated from the available grammar.

Ambiguous Grammar:

E  E A E | (E) | -E | id | num

A+|-|*|/

Unambiguous Grammar:

EE+T |E–T|T

TT*F|T/F|F

F  id | num

For certain class of parsing techniques, grammar should be an unambiguious grammar.


As the name suggests, bottom-up parsing works in the opposite direction from topdown. A top-down
parser begins with the start symbol at the top of the parse tree and works downward, driving
productions in forward order until it gets to the terminal leaves. A bottom-up parse starts with the string
of terminals itself and builds from the leaves upward, working backwards to the start symbol by applying
the productions in reverse. Along the way, a bottom-up parser searches for substrings of the working
string that match the right side of some production. When it finds such a substring, it reduces it, i.e.,
substitutes the left side nonterminal for the matching right side. The goal is to reduce all the way up to
the start symbol and report a successful parse. In general, bottom-up parsing algorithms are more
powerful than top-down methods, but not surprisingly, the constructions required are also more
complex. It is difficult to write a bottom-up parser by hand for anything but trivial grammars, but
fortunately, there are excellent parser generator tools like yacc that build a parser from an input
specification, not unlike the way lex builds a scanner to your spec.

The syntax analysis phase verifies that the string can be generated by the grammar for the source
language. In case of any syntax errors in the program, the parser tries to report as many errors as
possible. Error reporting and recovery form a very important part of the syntax analyzer.

The error handler in the parser has the following goals: . It should report the presence of errors clearly
and accurately. . It should recover from each error quickly enough to be able to detect subsequent
errors. . It should not significantly slow down the processing of correct programs.

yacc is a parser generator. It is to parsers what lex is to scanners. You provide the input of a grammar
specification and it generates an LALR(1) parser to recognize sentences in that grammar. yacc stands for
"yet another compiler compiler" and it is probably the most common of the LALR tools out there. Our
programming projects are configured to use the updated version bison, a close relative of the yak, but
all of the features we use are present in the original tool, so this handout serves as a brief overview of
both. Our course web page include a link to an online bison user’s manual for those who really want to
dig deep and learn everything there is to learn about parser generators.

Semantic Analysis:
Parsing only verifies that the program consists of tokens arranged in a syntactically valid combination.
For a program to be semantically valid, all variables, functions, classes, etc. must be properly defined,
expressions and variables must be used in ways that respect the type system, access control must be
respected, and so forth. Semantic analysis is the front end’s penultimate phase and the compiler’s last
chance to weed out incorrect programs.

In many languages, identifiers have to be declared before they’re used. As the compiler encounters a
new declaration, it records the type information assigned to that identifier.

Then, as it continues examining the rest of the program, it verifies that the type of an identifier is
respected in terms of the operations being performed.

For example, the type of the right side expression of an assignment statement should match the type of
the left side, and the left side needs to be a properly declared and assignable identifier.
The parameters of a function should match the arguments of a function call in both number and type.

The language may require that identifiers be unique, thereby forbidding two global declarations from
sharing the same name.

Arithmetic operands will need to be of numeric—perhaps even the exact same type (no automatic int-
to-double conversion, for instance). These are examples of the things checked in the semantic analysis
phase.

For example in Pascal, mod operator should be applied only on integer operands.

Verify that dereferencing is applied only on a pointer type. Indexing is applied only on array types.

Types and Declarations


A type is a set of values and a set of operations operating on those values.

In many languages, a programmer must first establish the name and type of any data
object (e.g., variable, function, type, etc). In addition, the programmer usually defines
the lifetime. A declaration is a statement in a program that communicates this
information to the compiler.

Base types int, float, double, char, bool, etc. These are the primitive
types provided directly by the underlying hardware. There may be
a facility for user-defined variants on the base types (such as C
enums).
Compound types arrays, pointers, records, structs, unions, classes, and so on.
These types are constructed as aggregations of the base types and
simple compound types

Type Checking
Type checking is the process of verifying that each operation executed in a program
respects the type system of the language.

Designing a Type Checker


When designing a type checker for a compiler, here’s the process:
1. identify the types that are available in the language
2. identify the language constructs that have types associated with them
3. identify the semantic rules for the language
1. identify the types that are available in the language
we have base types (int, double, bool, string),
and compound types (arrays, classes, interfaces)

2. identify the language constructs that have types associated with them
constants obviously, every constant has an associated type. A scanner tells
us these types as well as the associated lexeme.
variables all variables (global, local, and instance) must have a declared
type of one of the base types or the supported compound types.
Functions functions have a return type, and each parameter in the function
definition has a type, as does each argument in a function call.
Expressions an expression can be a constant, variable, function call, or some
operator (binary or unary) applied to expressions. Each of the
various expressions have a type based on the type of the constant,
variable, return type of the function, or type of operands.

3. identify the semantic rules for the language

• Rules to parse Varialbe Declaration:

VariableDecl -> Variable ;

Variable -> Type identifier

Type -> int| bool |double|string|identifier|Type[]

The scanner stores the name for an identifier lexeme, which the parser records as an
attribute attached to the token. When reducing the Variable production, we have the
type associated with the Type symbol (passed up from the Type production) and the
name associated with the identifier symbol (passed from the scanner). We create a
new variable declaration, declaring that identifier to be of that type, which can be stored
in a symbol table for lookup later on.

Code Optimization:
Induction variable elimination can reduce the number of additions (or subtractions) in a loop, and
improve both run-time performance and code space.

Strength reduction is a compiler optimization where expensive operations are replaced with equivalent
but less expensive operations. The classic example of strength reduction converts "strong" multiplications
inside a loop into "weaker" additions

Constant folding is the process of simplifying constant expressions at compile time. Terms in constant


expressions are typically simple literals, such as the integer 2, but can also be variables whose values are
never modified, or variables explicitly marked as constant.

Dead code elimination is a compiler optimization that removes code that does not affect the program.
Removing such code has two benefits: it shrinks program size, an important consideration in some
contexts, and it lets the running program avoid executing irrelevant operations, which reduces its running
time. Dead code includes code that can never be executed (unreachable code), and code that only
affects dead variables, that is, variables that are irrelevant to the program.

loop interchange is the process of exchanging the order of two iteration variables

Code Generation:
code generation is the process by which a compiler's code generator converts some internal
representation of source code into a form (e.g., machine code) that can be readily executed by a
machine 

The input to the code generator typically consists of a parse tree or an abstract syntax tree. The tree is
converted into a linear sequence of instructions, usually in an intermediate language such as three
address code.

In addition to the basic conversion from an intermediate representation into a linear sequence of machine
instructions, a typical code generator tries to optimize the generated code in some way. The generator
may try to use faster instructions, use fewer instructions, exploit available registers, and avoid redundant
computations.

Tasks which are typically part of a sophisticated compiler's "code generation" phase include:

 Instruction selection: which instructions to use. Instruction selection is implemented with a


backwards dynamic programming algorithm which computes the "optimal" tiling for each point starting
from the end of the program and based from there. Instruction selection can also be implemented
with a greedy algorithm that chooses a local optimum at each step
 Instruction scheduling: in which order to put those instructions. Scheduling is a speed
optimization that can have a critical effect on pipelined machines. instruction scheduling is
a compiler optimization used to improve instruction-level parallelism, which improves performance on
machines with instruction pipelines
 Register allocation: the allocation of variables to processor registers. register allocation is the
process of assigning a large number of target program variables onto a small number
of CPU registers. Register allocation can happen over a basic block (local register allocation), over a
whole function/procedure (global register allocation), or in-between functions as a calling
convention (interprocedural register allocation). two variables in use at the same time cannot be
assigned to the same register without corrupting its value. Variables which cannot be assigned to
some register must be kept in RAM and loaded in/out for every read/write, a process called spilling.
Accessing RAM is significantly slower than accessing registers and slows down the execution speed
of the compiled program, so an optimizing compiler aims to assign as many variables to registers as
possible.

You might also like