You are on page 1of 120

Introduction

Unit-1
 A computer understands instructions in machine code, i.e. in
the form of 0s and 1s.

 It is a tedious task to write a computer program directly in


machine code.

 The programs are written mostly in high level languages


like Java, C++, Python etc. and are called source code.

 These source code cannot be executed directly by the


computer and must be converted into machine language to
be executed.
 Hence, a special translator system software is used to
translate the program written in high-level language into
machine code is called Language Processor and the
program after translated into machine code.
The language processors can be any
of the following three types:

 Compilers

 Assemblers

 Interpreter
Compiler
 The language processor that reads the complete source
program written in high level language as a whole in one go
and translates it into an equivalent program in machine
language is called as a Compiler.
 Example: C, C++, C#, Java In a compiler, the source code
is translated to object code successfully if it is free of errors.
The compiler specifies the errors at the end of compilation
with line numbers when there are any errors in the source
code.

 The errors must be removed before the compiler can


successfully recompile the source code again.>
Assembler 
 The Assembler is used to translate the program written in
Assembly language into machine code.

 The source program is a input of assembler that contains


assembly language instructions.

 The output generated by assembler is the object code or


machine code understandable by the computer.
Interpreter
 The translation of single statement of source program into
machine code is done by language processor and executes it
immediately before moving on to the next line is called an
interpreter.

  If there is an error in the statement, the interpreter terminates


its translating process at that statement and displays an error
message.

 The interpreter moves on to the next line for execution only


after removal of the error.
 An Interpreter directly executes instructions written in a
programming or scripting language without previously
converting them to an object code or machine code.

Example: Perl, Python and Matlab.


Difference between Compiler and Interpreter

Sr.no COMPILER INTERPRETER

A compiler is a program interpreter takes a source


which coverts the entire program and runs it line by
source code of a line, translating each line as
1
programming language it comes to it.
into executable machine
code for a CPU.
Compiler takes large Interpreter takes less amount
amount of time to analyze of time to analyze the source
2 the entire source code but code but the overall
the overall execution time execution time of the
of the program is program is slower.
comparatively faster.
Sr.No COMPILER INTERPRTER

Compiler generates Its Debugging is easier as it


the error message continues translating the
only after scanning program until the error is met
the whole program,
3 so debugging is
comparatively hard
as the error can be
present any where in
the program.
4 Generates No intermediate object code
intermediate object is generated.
code.
5 Examples: C, C++, Examples: Python, Perl
Java
 Linkers
 It collets code separately compiled or assembled in different
object files into a file that is directly executed

 Loaders
 Often a compiler, assembler or linker will produce code that
is not yet completely fixed and ready to execute but whose
principal memory reference are all made which can be
anywhere , such a code is said to be relocatable.

 A loader will resolve all relocatable address relative to a


given base or starting address
 Preprocessors

 A separate program that is called by compiler before actual


translation begins is called preprocessor .
 It can delete comments, include other files and perform
macro substitution
 Macro Processors

 It substitutes the actual parameters for formal parameters in


body of macro ; the transformed body then replaces the
macro use itself.

 A macro consists of a sequence of assembly language


statements and also other macro statements.

 A macroprepossser replaces a macro statement into a


sequence of assembly language statements and other macro
statements. They deal with macro definition and macro use.
 Text Editor
 A text editor is a program that allows editing source
program in the form of text

 Debugger
 It is a system that allows a programmer to look at programs
data while that program is running.

 It is usually called when program error such as overflow


occurs or when certain statements that are indicated in
source code are reached
Why we need compilers?
 With machine language, we must communicate directly with
a computer in terms of bits, registers and very primitive
machine operators.

 This language is a sequence of 0’s and 1’s for which


programming a complex algorithm is very tedious.

 Therefore High-Level languages were developed so that


programmer can express algorithms in more natural
notations.

 To translate this high level language into a machine language


, compiler is needed
Introduction to Language Processor

 A language processor is a software which bridges a


specification and execution gap.

 Language processing activities arise due to:


 The difference between software designer’s idea related
to behavior of software and the manner in which these
ideas are implemented.

20
Introduction to Language Processor
 The designer expresses the ideas in terms related to the
application domain.

 To implement these ideas in terms related to execution


domain.

 The difference between the two domain termed as semantic


gap.

Semantic Gap

Application Execution
Domain Domain 21
Introduction to Language Processor
 The semantic gap has many difficulties, some of the
important ones being
 large development time and efforts, and

 poor quality of software.

 These issues are tackled through the use of programming


language (PL).
 Software implementation using PL introduces a new domain
as PL domain.

22
Introduction to Language Processor
 Now the semantic gap is bridged by the software
engineering steps.

 The first step bridges gap between application domain and


PL domain known as specification gap.
 While the second step bridges the gap between PL domain
and execution domain as execution gap.

Specification Gap Execution Gap

Application PL Execution
Domain Domain Domain 23
Introduction to Language Processor
 The specification gap is bridge by the software development
team.

 While the execution gap is bridged by the designer of the


programming language processor. Like compiler or
interpreter.

 The language processor also provides a diagnostics


capability which detects and indicates errors in its input.
This helps in improving the quality of the software.

24
Introduction to Language Processor
• A range of LP is defined to meet practical requirements.

1. A language translator bridges an execution gap to the


machine language like assembler and compiler.
2. A detranslator bridges the same as the language translator,
but in the reverse manner.
3. A preprocessor is a language processor which bridges an
execution gap but not translator.
4. A language migrator bridges specification gap between two
PLs.

25
Data Structure for Language Processing

 Language Processor implementations are highly influenced


by the kind of storage structure used for program variables
and data.

 Managing Language requires allocating and managing the


memory used by the program in static or dynamic
environments.

 Implementation of suitable data structure becomes critical


in designing and executing the system programs.
Data Structure for Language Processing

 The Program behavior depends on the search and allocation


structures, the language support and their features such as
external reference, recursion etc.

 Vital operations: Search and Allocation


Data Structure for Language Processing
 Choice of Data Structure:
1. Nature of Data Structure
1. Linear
2. Non Linear

2. Purpose of Data Structure


1. Allocation
2. Search

3. Lifetime of Data Structure


Linear Data Structure

 Linear data structure involves linear data elements


arrangements
 Search efficiency is considered by requirement of contiguous
memory allocations for the elements
 Sometimes designer may compelled to overestimate the
memory requirements of the linear data structure in order to
ensure that later on memory requirements would not expand.
 This often leads to wastage of memory.
 Easy Access
 Implemented as Arrays or Linked Lists
Nonlinear Data Structure

 Nonlinear data structure are the alternatives


 Uses pointer implementations to access elements
 Need not occupy contiguous memory locations
 More flexible in allocation and availability of space
 Lower search efficiency
 Interleaved memory locations
Purpose of Data Structure

Search Data Structure:

 Provides efficient search


 Maintain attribute information
 Include table and sequential organization, binary search
organization, hash tables, linked lists, tree structure
Purpose of Data Structure

Allocation Data Structure :

 Include stack and heap organizations


 Not required the search operation and address of the
memory

 Language Processors uses both, search and allocation data


structure
Lifetime of Data Structure
 Whether it is used during language process or the target
program execution

 For ex., a data structure used during a language process


would have the scope till that process only.
Search Data Structure
 Also called as search structure

 Designed and used for its search efficiency during language


processing

 Set of entries, with each encompassing the information


concerning one entity

 Maintains attribute information about different entities


defined and used in the source program
Search Data Structure
 Entry will be done only once but can be searched many
times

 Primarily created to store table of information

 Mainly created and used during analysis phase of program

 Target program rarely use this data structure

 Used for Symbol Table implementation


Search Data Structure
 Characterized by ‘key’, special symbol field containing
name of entity to assist search operation
Features: Search Data Structure
 Entry: set of fields referred to as record
 Entry contains two parts:
 Value in fixed part determines the information to be stored
in the variable part

 Fixed (tag): Symbol, class type, length, dimension


information, number of parameters, their addresses, type
of returned value, length of returned value, statement
number
 Variable : name, class, statement number
Features: Search Data Structure
 Fixed length entry enable use of homogeneous linear data
structures
 eg. Arrays, they may contain records having identical
format
 Suffers from inefficient use of memory usage
 Variable length entry leads to compact organization with no
memory wastage
Features: Search Data Structure

Fixed length Variable length


Entry entry
More access Less access
efficiency efficiency
Less memory More memory
efficiency efficiency
Features: Search Data Structure

 Combining the functionality of two:


 Hybrid Entry Format:

Fixed Part Pointer

Length Variable Part


Symbol Table
 Kind of data structure, which is used in language
processing as both, search and allocation data
structures
Fixed (Tag ) Part Variant Part
Procedure Address of Parameter List,
Count of Parameters
Variable Type, Length, Dimension
Information
Label Statement Number
Function Length of returned value, type of
returned value, number of
parentheses and their addresses
Operations: Search Data Structure

 Insert
 Search
 Delete
Algorithm for Generic Search
1. Predict entry ‘e’ in the search data structure at which
symbol symb is stored

2. Let symbe be the symbol found at eth entry . Compare symbe


with symb.
a. If a match between the two is found, exit with success.
b. Else go to the next step.

3. Repeat steps 1 and 2 till all entries are evaluated or


concluded that the symbol does not exist in the search data
structure
Search organization

 Table Organization
 Sequential Search Organization
Search organization

 Table Organization

 Sequential Search Organization


 Occupied entries

 Free Entries

 Physical Deletion

 Logical Active/ Deleted Records


Search organization

 Binary Search Organization


 Search based on relational operators

 Entry number must not change after adding record

 Hash Table Organization


 Hash function is used for the mapping of a key value and

the slot where that value belongs to the hash table


 Hash function takes any key value from the collection

and computes on the integer value from it in the range of


slot names, between 0 and m-1.
Introduction To
Compiling
Compilers
 Complier is a program that reads a program
written in one language (the source language)
and translates it into an equivalent program in
another language (the target language)
Input

Source Target
Compiler
Program Program

Error messages Output


The Analysis-Synthesis Model of
Compilation
 There are two parts to compilation:
 The Analysis part breaks up the source program
into constituent pieces and creates a intermediate
representation of the source program

 Synthesis constructs the desired target program


from the intermediate representation.
Analysis determines the operations implied by the
source program which are recorded in a tree
structure.
Source Code:
position = initial + rate * 60 ;

Abstract-Syntax Tree: =

position +

initial *

rate 60
Front-end, Back-end division

Source IR Machine
code Front end Back end code

errors

 The front consist of those phases or parts of


phases that depend primarily on the source
language and largely independent of the
target machine.
 The backend includes those portions of the
complier that depend on the target and
generally those portions which do not depend
on the source language.
The Many Phases of a Compiler
Source Program

1 Lexical analyzer

2 Syntax Analyzer
Analyses

3 Semantic Analyzer

Intermediate
Symbol-table 4 Code Generator Error Handler
Manager

5 Code Optimizer

Syntheses
6 Code Generator

7 Peephole Optimization
1, 2, 3, 4, 5 : Front-End
53 6, 7 : Back-End Target Program
Lexical Analysis

 Given program is scanned from left to right


and grouped into tokens that are sequence
of character having a collective meaning.
Example
 position=initial + rate*60

 position id1
 = assignment operator
 initial id2
 + addition operator
 rate id3
 * multiplication operator
 60 literal 1
Syntax Analysis

 In this phase, it takes Input as the token


generated from lexical analysis and arranges
in a tree structure in a collective manner.
Example
 position=initial + rate*60

 assignment statement=> identifier=exp

 exp=>exp+exp | exp*exp | identifier | literal


assignment
statement

:=
identifier expression
+
Id1(position) expression expression
*
identifier expression expression
Id2(initial) identifier Literal

Id3(rate) 60

Nodes of tree are constructed using a grammar for the language


58
Semantic Analysis

 Here we calculate the information like data


types of identifier which then can be used for
code generation phase.
assignment
statement

:=
identifier expression
+
Id1(position) expression expression
*
identifier expression expression
Id2(initial) identifier Literal

Id3(rate) (Int to
float)
60
60
Intermediate phase

 Translates from abstract-syntax tree to


intermediate code

 In this phase we generate three address


code
Example:

temp1 = int tofloat(60)


temp2 = id3(rate) * temp1
temp3 = id2(initial) + temp2
Id1(position) = temp3
Code optimization

 To increase the speed


 To consume less memory location.

 Temp4= id3*60.0
 Id1= id2+temp4
Code Generation

 The final phase of the complier is the


generation of target code, consisting
normally of relocatable machine code or
assembly code.
Example:

Mov id3,R1
Mul #60.0, R1
Mov id2, R2
Add R2, R1
Mov R1, id1
The Symbol Table

 When identifiers are found, they will be


entered into a symbol table, which will hold
all relevant information about identifiers.
 This information will be used later by the
semantic analyzer and the code generator.
Error Detection And Reporting

■ Semantic Errors: Type mismatches between


operators and operands. Like, a return statements
with return type void.
■ Syntactic Errors: Misplaced semicolons, extra or
missing braces.
■ Lexical errors: Misspellings of keywords,
identifiers, and operators.
Major Data Structures in a complier

 Tokens
 The syntax tree
 The symbol table
 The literal table
 The intermediate table
 Temporary files
Bootstrapping

 The language in which the complier itself is


written is called the Host Language.

 For a complier to execute immediately this


host language must be same as the machine
language.

 This is how the first complier was designed.


Existing complier for Language B
Complier for Running
language A complier
written in for
Language B Language
A
 If the existing complier for language B runs
on a machine different from the target
machine, then the situation is bit complicated.

 Complication then produces a cross


complier.
T-Diagram Describing Complex
Situation
 A compiler written in language H that
translates language S into language T.

S T
H

 T-Diagram can be combined in two basic


ways.
The First T-diagram Combination

A B B C A C
H H H

 Two compilers run on the same machine H


 First from A to B
 Second from B to C
 Result from A to C on H
The Second T-diagram Combination

A B A B
H H K K
M

 Translate implementation language of a


compiler from H to K
 Use another compiler from H to K
The First Scenario
A H A H
B B H H
H

 Translate a compiler from A to H written in B


 Use an existing compiler for language B on
machine H
The Second Scenario
A H A H
B B K K
K

 Use an existing compiler for language B on


different machine K
 Result in a cross compiler
Process of Bootstrapping

 It is common to write a complier in the same


language that it is to compile
S T
S
 This problem is called circulatory problem
and to overcome this difficulty we use
bootstrapping
The First step in bootstrap
A H A H
A A H H
H

 “quick and dirty” compiler written in machine


language H
 Compiler written in its own language A
 Result in running but inefficient compiler
The Second step in bootstrap
A H A H
A A H H
H

 Compiler written in its own language A


 Complier from step 1
 Result in final version of the compiler
The step 1 in porting
A K A K
A A H H
H

 Compiler source code retargeted to K


 Original compiler from bootsrapping

 Result in Cross Compiler


The step 2 in porting
A K A K
A A K K
H

 Cross compiler
 Compiler source code retargeted to K
 Result in Retargeted Compiler

BACK
Compiler-Construction Tools
 Scanner generators
 Parser generators
 Syntax-directed translation engines
 Automatic code generators
 Data-flow engines
Analysis Tools

 Structure editors
 Pretty printers
 Static checkers
 Interpreters
Lexical Analysis

 Role of lexical analyzer


 Specification of tokens
 Recognition of tokens
 Lexical analyzer generator
 Design of lexical analyzer generator
 LEX Tool
The role of lexical analyzer

token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken

Symbol
table
Why to separate Lexical analysis and
parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Tokens, Patterns and Lexemes

 A token is a word found in the programming


language description
 There is a set of strings in the input for which
the same token is produced as output. This
set of strings is described by a rule called
Pattern associated with token
 A lexeme is a sequence of characters in the
source program that matches the pattern for
a token
Example

Token Informal description Sample lexemes

if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

E.G const pi=3.14


Substring pi is a lexeme for token identifier
Attributes for tokens
 Lexical analyzer collects information about tokens into their
associated attributes.

 E = M * C ** 2
 <id, pointer to symbol table entry for E>

 <assign-op>

 <id, pointer to symbol table entry for M>

 <mult-op>

 <id, pointer to symbol table entry for C>

 <exp-op>

 <number, integer value 2>


Lexical errors
 Few errors are not encountered at the lexical level alone
because a lexical analyzer has a very localized view of the
source program

 e,.g fi (a == f(x)) …

 However it may be able to recognize errors like:


 d = 2r

 Such errors are recognized when no pattern for tokens


matches a character sequence
Error recovery

 Panic mode: successive characters are ignored until we


reach to a well formed token
 Delete one character from the remaining input
 Insert a missing character into the remaining input
 Replace a character by another character
 Transpose two adjacent characters
Input buffering
 There are times when the lexical analyzer needs to look
ahead several characters beyond the lexeme for a pattern
before a match can be announced

 We use a buffer divided into two N character halves. N is


the number of characters on one disk block (1024 or 4096)

E = M * C * 2

Lexeme_beginning Forward pointer


ptr
Code for advance forward pointer

if (forward is at end of first half then begin)


{
reload second half;
forward = forward+1;
}
else if {forward is at end of second half then begin)
{
reload first half;
move forward to beginning of first half
end
}
else forward = forward+1;

Drawback
Here we check for the if condition many times and then only the forward pointer is
incremented by 1.
Sentinels

E=M eof * C * * 2 eof eof

It is a special character ie being introduced and it cannot


be a part of the source program.

Most commonly used sentinel is “eof”


Specification of Tokens
 alphabet (or character class): any finite set of
symbols

 String: over some alphabet in a finite


sequence of symbols drawn from that
alphabet

 Language: any set of strings over some fixed


alphabets
Regular Expressions

 It is a method of describing all the possible


tokens that can appear in the input string.

 *:zero or more instances


 +: one or more instances
 ?: zero or one instance
 Regular expression r over alphabet 
 Defines the language L(r) corresponding to r

 Regular Set: A language denoted by a regular expression

 Basic Symbols
 empty-string: 

 any symbol a in input symbol set 

 Basic Operators
 disjunction (OR, union): r | s

 concatenation (AND): r s (or simply rs)

 closure (repetition): r*

 identity (parenthesized): (r)


Regular definitions

We may wish to give names to regular


expressions and use these names in
subsequent expressions

d1 -> r1
d2 -> r2

dn -> rn
Recognition of Tokens
 One way to begin the design of any program is to describe
the behavior of the program by a flowchart.

 A specialized kind of flowchart for lexical analyzer is called


a Transition Diagram
 In the transition diagram, the boxes of the flowchart are
drawn as circles and called states.

 The states are connected with arrows called edges

 The label on the various edges leaving a state S indicates the


input character that can appear after that state S
 Transition diagram for relop
 Transition diagram for reserved words and
identifiers
Code for transition Diagram of an
Identifier
 State 0: C=getchar()
if letter(c) then goto state 1
else fail()

 State 1: C=getchar()
if letter(c) or Digit(c) then goto state 1
else if delimiter (c) goto state 2
else fail()
 In state 2 we return to the parser a pair
consisting the integer code for an identifier
denoted by ID and a value that is pointer to
the symbol table returned by INSTALL.

 State 2: retract ()
return( id, install)
E.g
Token Code Value
Begin 1
End 2
If 3
Then 4
Else 5
identifier 6 Pointer to symbol table
Constant 7 Pointer to symbol table
< 8 1
<= 8 2
= 8 3
<> 8 4
> 8 5
>= 8 6
Finite Automata

 A recognizer for a language is a program that


takes as input a string x and answers “yes” if
x is a sentence of the language and “no”
otherwise
NFA: Nondeterministic Finite
Automata
 It is defined as a five tuple
M=(Q,q0,F
 Q: A finite set of states
 : A finite set of input symbols
 : A transition function that maps (state, symbol)
pairs to sets of states
 q0: A state distinguished as start state
 F: A set of states distinguished as final states
Transition Diagram (NFA)
a
(a | b)*abb
start a b b
0 1 2 3

b
Transition Table
Start INPUT STMBOL
a B
0 {0,1} {0}
1 --- {2}
2 --- {3}
Deterministic Finite Automata

 A DFA is a special case of an NFA in which


 no state has an -transition
 for each state s and input symbol a, there is at
most one edge labeled a leaving s
 Q={qo,q1,q2,q3}
 {a,b,c}
 qo=q0
 F=q3
 transition Table
 a B c
qo q1 --- q2
q1 ----- q3 ---
q2 q3 ---- ---
q3 ---- --- ---
What is Lex?
 The main job of a lexical analyzer (scanner) is to break up
an input stream into more usable elements (tokens)

a = b + c * d;
ID ASSIGN ID PLUS ID MULT ID SEMI

 Lex is an utility to help you rapidly generate your scanners

112
Lex – Lexical Analyzer
 Lexical analyzers tokenize input streams

 Tokens are the terminals of a language


 English

 words, punctuation marks, …

 Programming language

 Identifiers, operators, keywords, …

 Regular expressions define terminals/tokens

113
An Overview of Lex

Lex source lex.yy.c


program
Lex

lex.yy.c C compiler a.out

input a.out tokens

114
Lex Source
 Lex source is separated into three sections by %
% delimiters
 The general format of Lex source is

{declarations}
%% (required)
{transition rules}
%% (optional)
{Auxiliary procedures}
 The absolute minimum Lex program is thus
%% PLLab, NTHU,Cs2403 Programming
Languages 115
 Translation rules are statements of the form

P1 {action 1}
P2 {action 2}
…….
…….
Pn {action n }

 Where Pi is a regular expression and action i is a program


fragment describing what action the lexical analyzer should
take when Pi matches a lexeme

116
Lex Source Program
 Lex source is a table of
 regular expressions and
 corresponding program fragments

digit [0-9]
letter [a-zA-Z]
%%
{letter}({letter}|{digit})* printf(“id: %s\n”, yytext);
\n printf(“new line\n”);
%%
main() {
yylex();
}

117
Lex Source to C Program

 The table is translated to a C program


(lex.yy.c) which

 reads an input stream


 partitioning the input into strings which match the
given expressions and
 copying it to an output stream if necessary

118
Lex v.s. Yacc
 Lex
 Lex generates C code for a lexical analyzer, or
scanner
 Lex uses patterns that match strings in the input and
converts the strings to tokens

 Yacc
 Yacc generates C code for syntax analyzer, or parser.
 Yacc uses grammar rules that allow it to analyze
tokens from Lex and create a syntax tree.

119
Implementation of lexical analyzer
 Lex can build from its input, a lexical analyzer that behaves
roughly like a finite automation.

 The idea is to construct a non-deterministic finite


automation N for each token pattern P, in the translation
rules and then link these NFA’s together with a new start
state and -transition.

 Next we convert this NFA to DFA using subset


construction.
 E.g consider the following LEX program

regular definitions
(none)
Translation rules
a {} /*actions are omitted here */
abb {}
a*b+ {}

You might also like