Professional Documents
Culture Documents
When we speak about compilers, we mainly talk about two types of languages: Source language and the
machine Language.
Machine language or the target language is the one which CPU or a computer can understand. It
contains strings of 0’s and 1’s. As it is difficult for humans, to write, debug and test programs in binary
language, we prefer writing programs in High Level language such as C, C++, Java etc. In order to convert
programs written in High Level language to a language understandable to the CPU, we use traslators
called Compilers.
Goals:
1. Compiler’s most important goal is correctness; All vaild programs must compile correctly.
2. After translation, size of code generated should be as small as possible.
3. Time taken to execute the translated code should be reasonably low.
4. Overall cost should be low.
Next we try to understand the structure of the sentence i.e. we understand if each word is a noun,
pronoun , verb or an adective. In syntax analysis we do exactly the same and thus check if this sentence
or statement is legal and can be gererated from the grammer of the programming language.
Once the sentence structure is understood we try to understand the meaning of the sentence. Semantic
analysis phase of the compiler handles this task.
As their name suggest, first three phases of a compiler, Lexical, Syntactic and Semantic analysis, deals
with analyzing and understanding the source code.
The next three phases, intermediate code generation, optimization and code generation does the actual
conversion of the source program to target program.
Translate in steps. Each step handles a reasonably simple, logical, and well defined task
Lexical Analysis:
Lexical Analysis is the first phase of the compiler. Its main task is to read the input characters and
produce as output a sequence of tokens that the parser uses for syntax analysis.
It also perform certain secondary tasks such as stripping comments and white space characters (blank,
tab and newline characters) from source code. It also correlates error messages from the compiler with
the source program. For example, the lexical analyzer can keep track of number of new line characters
seen, so that a line number can be associated with error message.
The scanner is tasked with determining that the input stream can be divided into valid
symbols in the source language, but has no smarts about which token should come
where. Few errors can be detected at the lexical level alone because the scanner has a
very localized view of the source program without any context. The scanner can report
about characters that are not valid tokens (e.g., an illegal or unrecognized symbol) and a
few other malformed entities (illegal characters within a string constant, unterminated
comments, etc.) It does not look for or detect garbled sequences, tokens out of place,
undeclared identifiers, misspelled keywords, mismatched types and the like
This is modeled through regular expressions and the structure is recognized through finite state
automata.
Token: A token is a syntactic category. Sentences consist of a string of tokens. For example number,
identifier, keyword, string etc are tokens.
Lexeme: Sequence of characters in a token is a lexeme. For example 100.01, counter, const, "How are
you?" etc are lexemes.
SLIDE – 2
A simple & efficient way to build a lexical analyzer is to construct a diagram that illustrates the structure
of tokens and then to hand-translate the diagram into a program for finding tokens.
Three possibilities:
Write Lexical analyzer in assembly language by explicitly managing read and write I/O. This is the
most efficient way and produces very fast efficient lexical analyzers but this approach is very
difficult to implement and maintain
Write Lexical analyzer in a conventional systems programming language such as C using the I/O
facilities of that language to read the input. Even this approach is reasonably efficient.
Use Tools like Lex. This approach is very
SLIDE – 3
lex is a lexical analyzer generator. You specify the scanner you want in the form of patterns to match and
actions to apply for each token. lex takes your specification and generates a combined NFA to recognize
all your patterns, converts it to an equivalent DFA, minimizes the automaton as much as possible, and
generates C code that will implement it.
Syntax Analysis:
SLIDE – 1
Before we speak about syntax analysis, we need to know what a grammer is.
Just as natural languages such as English have grammer associated with them, every computer
programming language has grammer/rules that prescribe the syntactic structure of programs. A
grammar gives precise, umambiguous specification to a programming language.
For example consider the grammar S à (S)S | є, where S represets string of balanced paranthesis. We
represent all the program constructs such as conditinal statements, looping statements etc with such
grammar. When the syntax analyzer is given a string from Lexical analyzer, it checks if the string can be
generated from the available grammar.
Ambiguous Grammar:
E E A E | (E) | -E | id | num
A+|-|*|/
Unambiguous Grammar:
EE+T |E–T|T
TT*F|T/F|F
F id | num
The syntax analysis phase verifies that the string can be generated by the grammar for the source
language. In case of any syntax errors in the program, the parser tries to report as many errors as
possible. Error reporting and recovery form a very important part of the syntax analyzer.
The error handler in the parser has the following goals: . It should report the presence of errors clearly
and accurately. . It should recover from each error quickly enough to be able to detect subsequent
errors. . It should not significantly slow down the processing of correct programs.
yacc is a parser generator. It is to parsers what lex is to scanners. You provide the input of a grammar
specification and it generates an LALR(1) parser to recognize sentences in that grammar. yacc stands for
"yet another compiler compiler" and it is probably the most common of the LALR tools out there. Our
programming projects are configured to use the updated version bison, a close relative of the yak, but
all of the features we use are present in the original tool, so this handout serves as a brief overview of
both. Our course web page include a link to an online bison user’s manual for those who really want to
dig deep and learn everything there is to learn about parser generators.
Semantic Analysis:
Parsing only verifies that the program consists of tokens arranged in a syntactically valid combination.
For a program to be semantically valid, all variables, functions, classes, etc. must be properly defined,
expressions and variables must be used in ways that respect the type system, access control must be
respected, and so forth. Semantic analysis is the front end’s penultimate phase and the compiler’s last
chance to weed out incorrect programs.
In many languages, identifiers have to be declared before they’re used. As the compiler encounters a
new declaration, it records the type information assigned to that identifier.
Then, as it continues examining the rest of the program, it verifies that the type of an identifier is
respected in terms of the operations being performed.
For example, the type of the right side expression of an assignment statement should match the type of
the left side, and the left side needs to be a properly declared and assignable identifier.
The parameters of a function should match the arguments of a function call in both number and type.
The language may require that identifiers be unique, thereby forbidding two global declarations from
sharing the same name.
Arithmetic operands will need to be of numeric—perhaps even the exact same type (no automatic int-
to-double conversion, for instance). These are examples of the things checked in the semantic analysis
phase.
For example in Pascal, mod operator should be applied only on integer operands.
Verify that dereferencing is applied only on a pointer type. Indexing is applied only on array types.
In many languages, a programmer must first establish the name and type of any data
object (e.g., variable, function, type, etc). In addition, the programmer usually defines
the lifetime. A declaration is a statement in a program that communicates this
information to the compiler.
Base types int, float, double, char, bool, etc. These are the primitive
types provided directly by the underlying hardware. There may be
a facility for user-defined variants on the base types (such as C
enums).
Compound types arrays, pointers, records, structs, unions, classes, and so on.
These types are constructed as aggregations of the base types and
simple compound types
Type Checking
Type checking is the process of verifying that each operation executed in a program
respects the type system of the language.
2. identify the language constructs that have types associated with them
constants obviously, every constant has an associated type. A scanner tells
us these types as well as the associated lexeme.
variables all variables (global, local, and instance) must have a declared
type of one of the base types or the supported compound types.
Functions functions have a return type, and each parameter in the function
definition has a type, as does each argument in a function call.
Expressions an expression can be a constant, variable, function call, or some
operator (binary or unary) applied to expressions. Each of the
various expressions have a type based on the type of the constant,
variable, return type of the function, or type of operands.
The scanner stores the name for an identifier lexeme, which the parser records as an
attribute attached to the token. When reducing the Variable production, we have the
type associated with the Type symbol (passed up from the Type production) and the
name associated with the identifier symbol (passed from the scanner). We create a
new variable declaration, declaring that identifier to be of that type, which can be stored
in a symbol table for lookup later on.
Code Optimization:
Induction variable elimination can reduce the number of additions (or subtractions) in a loop, and
improve both run-time performance and code space.
Strength reduction is a compiler optimization where expensive operations are replaced with equivalent
but less expensive operations. The classic example of strength reduction converts "strong" multiplications
inside a loop into "weaker" additions
Dead code elimination is a compiler optimization that removes code that does not affect the program.
Removing such code has two benefits: it shrinks program size, an important consideration in some
contexts, and it lets the running program avoid executing irrelevant operations, which reduces its running
time. Dead code includes code that can never be executed (unreachable code), and code that only
affects dead variables, that is, variables that are irrelevant to the program.
loop interchange is the process of exchanging the order of two iteration variables
Code Generation:
code generation is the process by which a compiler's code generator converts some internal
representation of source code into a form (e.g., machine code) that can be readily executed by a
machine
The input to the code generator typically consists of a parse tree or an abstract syntax tree. The tree is
converted into a linear sequence of instructions, usually in an intermediate language such as three
address code.
In addition to the basic conversion from an intermediate representation into a linear sequence of machine
instructions, a typical code generator tries to optimize the generated code in some way. The generator
may try to use faster instructions, use fewer instructions, exploit available registers, and avoid redundant
computations.
Tasks which are typically part of a sophisticated compiler's "code generation" phase include: