You are on page 1of 54

Language Processing System in Compiler Design

 Difficulty Level : Hard


 Last Updated : 22 Feb, 2021
Introduction :
The computer is an intelligent combination of software and hardware.
Hardware is simply a piece of mechanical equipment and its functions
are being compiled by the relevant software. The hardware considers
instructions as electronic charge, which is equivalent to the binary
language in software programming. The binary language has only 0s
and 1s. To enlighten, the hardware code has to be written in binary
format, which is just a series of 0s and 1s. Writing such code would be
an inconvenient and complicated task for computer programmers, so
we write programs in a high-level language, which is Convenient for
us to comprehend and memorize. These programs are then fed into a
series of devices and operating system (OS) components to obtain the
desired code that can be used by the machine. This is known as
a language processing system.

Language processing System


Components of Language processing system :
You have seen in the above diagram there are following components.
Let’s discuss it one by one.
 Preprocessor–
The preprocessor includes all header files and also evaluates
whether a macro(A macro is a piece of code that is given a name.
Whenever the name is used, it is replaced by the contents of the
macro by an interpreter or compiler. The purpose of macros is
either to automate the frequency used for sequences or to enable
more powerful abstraction) is included. It takes source code as
input and produces modified source code as output. The
preprocessor is also known as a macro evaluator, processing is
optional that is if any language that does not
support #include andmacrosprocessing is not required.
 
 Compiler–
The compiler takes the modified code as input and produces the
target code as output.

Input-Output

 Assembler–
The assembler takes the target code as input and produces real
locatable machine code as output.
 
 Linker–
A linker or link editor is a program that takes a collection of objects
(created by assemblers and compilers) and combines them into an
executable program.
 
 Loader–
The loader keeps the linked program in the main memory.
 
 Executable code– 
It is the low level and machine specific code and machine can
easily understand. Once the job of linker and loader is done then
object code finally converted it into the executable code. 
Differences between Linker/Loader :
The differences between linker and loader as follows.
Linker Loader

The linker is part of the library files. The loader is part of an operating system.

The linker performs the linking


operation. The loader loads the program for execution.

It also connects user-defined Loading a program involves reading the


functions to user-defined libraries. contents of an executable file in memory.

Functions of loader :
1. Allocation – 
It is used to allocate space for memory in an object program. A
translator cannot allocate space because there may be overlap or
large wastage of memory.
 
2. Linking – 
It combines two or more different object programs and resolves the
symbolic context between object decks. It also provides the
necessary information to allow reference between them. Linking is
of two types as follows.
Static Linking :
It copies all the library routines used in the program into an
executable image. This requires more disk space and memory.
Dynamic Linking :
It resolves undefined symbols while a program is running. This
means that executable code still has undefined symbols and a list of
objects or libraries that will provide definitions for the same.
 
3. Reallocation – 
This object modifies the program so that it can be loaded to an
address different from the originally specified location, and to
accommodate all addresses between dependent locations.
 
4. Loading – 
Physically, it keeps machine instructions and data in memory for
execution.
5. We basically have two phases of compilers, namely Analysis phase and Synthesis phase.
Analysis phase creates an intermediate representation from the given source code.
Synthesis phase creates an equivalent target program from the intermediate
representation.
6.

7. Symbol Table – It is a data structure being used and maintained by the compiler, consists
all the identifier’s name along with their types. It helps the compiler to function smoothly
by finding the identifiers quickly.
8. The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters are read from left
to right. It is then grouped into various tokens having a collective meaning.
2. Hierarchical Analysis-
In this analysis phase,based on a collective meaning, the tokens are categorized
hierarchically into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source program are
meaningful or not.

The compiler has two modules namely front end and back end. Front-end constitutes of the
Lexical analyzer, semantic analyzer, syntax analyzer and intermediate code generator. And the
rest are assembled to form the back end.

1. Lexical Analyzer –
It is also called scanner. It takes the output of preprocessor (which performs file
inclusion and macro expansion) as the input which is in pure high level language. It
reads the characters from source program and groups them into lexemes
(sequence of characters that “go together”). Each lexeme corresponds to a token.
Tokens are defined by regular expressions which are understood by the lexical
analyzer. It also removes lexical errors (for e.g., erroneous characters), comments
and white space.
2. Syntax Analyzer – It is sometimes called as parser. It constructs the parse tree.
It takes all the tokens one by one and uses Context Free Grammar to construct the
parse tree.

Why Grammar ?
The rules of programming can be entirely represented in some few productions.
Using these productions we can represent what the program actually is. The input
has to be checked whether it is in the desired format or not.
The parse tree is also called the derivation tree.Parse trees are generally constructed
to check for ambiguity in the given grammar. There are certain rules associated with
the derivation tree.

1.

o Any identifier is an expression


o Any number can be called an expression
o Performing any operations in the given expression will always result in
an expression. For example,the sum of two expressions is also an expression.
o The parse tree can be compressed to form a syntax tree

Syntax error can be detected at this level if the input is not in accordance with the
grammar.

2. Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not.
It furthermore produces a verified parse tree. It also does type checking, Label
checking and Flow control checking.
3. Intermediate Code Generator – It generates intermediate code, that is a form
which can be readily executed by machine We have many popular intermediate
codes. Example – Three address code etc. Intermediate code is converted to
machine language using the last two phases which are platform dependent.

Till intermediate code, it is same for every compiler out there, but after that, it
depends on the platform. To build a new compiler we don’t need to build it from
scratch. We can take the intermediate code from the already existing compiler and
build the last two parts.

4. Code Optimizer – It transforms the code so that it consumes fewer resources


and produces more speed. The meaning of the code being transformed is not
altered. Optimisation can be categorized into two types: machine dependent and
machine independent.
5. Target Code Generator – The main purpose of Target Code generator is to
write a code that the machine can understand and also register allocation,
instruction selection etc. The output is dependent on the type of assembler. This is
the final stage of compilation. The optimized code is converted into relocatable
machine code which then forms the input to the linker and loader.
All these six phases are associated with the symbol table manager and error handler as shown in
the above block diagram.

Nearly all compilers separate the task of analyzing syntax into two distinct

parts, lexical analysis and syntax analysis.

o Lexical analysis deals with small-scale language constructs

ƒ names and literals

o Syntax analysis deal with large-scale language constructs

ƒ expressions, statements, and program units

Reasons why lexical analysis is separated from syntax analysis:

• Simplicity

o Lexical analysis can be simplified because its techniques are less

complex than syntax analysis

o The syntax analyzer can be smaller and cleaner by removing the low-

level details of lexical analysis

• Efficiency – separate selective optimization

o Lexical analysis should be optimized because is requires a significant

portion of total compile time.

o The syntax analyzer should not be optimized.

• Portability

o The lexical analyzer is somewhat system dependent – input processing


o The syntax analyzer is not system dependent

Symbol Table in Compiler


 Difficulty Level : Easy
 Last Updated : 31 Jan, 2019
Prerequisite – Phases of a Compiler
Symbol Table is an important data structure created and maintained
by the compiler in order to keep track of semantics of variable i.e. it
stores information about scope and binding information about names,
information about instances of various entities such as variable and
function names, classes, objects, etc.
 It is built in lexical and syntax analysis phases.
 The information is collected by the analysis phases of compiler
and is used by synthesis phases of compiler to generate code.
 It is used by compiler to achieve compile time efficiency.
 It is used by various phases of compiler as follows :-
1. Lexical Analysis: Creates new table entries in the table,
example like entries about token.
2. Syntax Analysis: Adds information regarding attribute
type, scope, dimension, line of reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the table
to check for semantics i.e. to verify that expressions and
assignments are semantically correct(type checking) and update
it accordingly.
4. Intermediate Code generation: Refers symbol table for
knowing how much and what type of run-time is allocated and
table helps in adding temporary variable information.
5. Code Optimization: Uses information present in symbol
table for machine dependent optimization.
6. Target Code generation: Generates code by using address
information of identifier present in the table.
Symbol Table entries – Each entry in symbol table is associated with
attributes that support compiler in different phases.
Items stored in Symbol table:
 Variable names and constants
 Procedure and function names
 Literal constants and strings
 Compiler generated temporaries
 Labels in source languages
Information used by compiler from Symbol table:
 Data type and name
 Declaring procedures
 Offset in storage
 If structure or record then, pointer to structure table.
 For parameters, whether parameter passing by value or by
reference
 Number and type of arguments passed to function
 Base Address
Operations of Symbol table – The basic operations defined on a
symbol table include:
Implementation of Symbol table –
Following are commonly used data structure for implementing symbol
table :-
1. List –
 In this method, an array is used to store names and
associated information.
 A pointer “available” is maintained at end of all stored
records and new names are added in the order as they arrive
 To search for a name we start from beginning of list till
available pointer and if not found we get an error “use of
undeclared name”
 While inserting a new name we must ensure that it is not
already present otherwise error occurs i.e. “Multiple defined
name”
 Insertion is fast O(1), but lookup is slow for large tables –
O(n) on average
 Advantage is that it takes minimum amount of space.
2. Linked List –
 This implementation is using linked list. A link field is
added to each record.
 Searching of names is done in order pointed by link of link
field.
 A pointer “First” is maintained to point to first record of
symbol table.
 Insertion is fast O(1), but lookup is slow for large tables –
O(n) on average
3. Hash Table –
 In hashing scheme two tables are maintained – a hash table
and symbol table and is the most commonly used method to
implement symbol tables..
 A hash table is an array with index range: 0 to tablesize –
1.These entries are pointer pointing to names of symbol table.
 To search for a name we use hash function that will result
in any integer between 0 to tablesize – 1.
 Insertion and lookup can be made very fast – O(1).
 Advantage is quick search is possible and disadvantage is
that hashing is complicated to implement.
4. Binary Search Tree –
 Another approach to implement symbol table is to use
binary search tree i.e. we add two link fields i.e. left and right
child.
 All names are created as child of root node that always
follow the property of binary search tree.
 Insertion and lookup are O(log2 n) on average.
Input Buffering in Compiler Design
The lexical analyzer scans the input from left to right one character at a time. It uses
two pointers begin ptr(bp) and forward to keep track of the pointer of the input
scanned.
Initially both the pointers point to the first character of the input string as shown
below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space
is encountered, it indicates end of lexeme. In above example as soon as ptr (fp)
encounters a blank space the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer. there are two methods used in this context: One Buffer Scheme,
and Two Buffer Scheme. These are explained as following below.

1. One Buffer Scheme:


In this scheme, only one buffer is used to store the input string but the problem
with this scheme is that if lexeme is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be refilled, that makes overwriting the
first of lexeme.
2. Two Buffer Scheme:
To overcome the problem of one buffer scheme, in this method two buffers are
used to store the input string. the first buffer and second buffer are scanned
alternately. when end of current buffer is reached the other buffer is filled. the
only problem with this method is that if length of the lexeme is longer than length
of the buffer then scanning input cannot be scanned completely.

Initially both the bp and fp are pointing to the first character of first buffer. Then the
fp moves towards right in search of end of lexeme. as soon as blank character is
recognized, the string between bp and fp is identified as corresponding token. to
identify, the boundary of first buffer end of buffer character should be placed at the
end first buffer.

Similarly end of second buffer is also recognized by the end of buffer mark
present at the end of second buffer. when fp encounters first eof, then one can
recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer.
alternatively both the buffers can be filled up until end of the input program and
stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.
Tokens, patterns and lexemes

 The words generated by the linear analysis may be of different kinds:


o identifier,
o keyword (if, while, ...),
o punctuation character,
o multi-character operator (:=, ->, ...).
 Such a kind is called a TOKEN and an element of a kind is called a LEXEME.
 A word is recognized to be a lexeme for a certain token by PATTERN MATCHING.
For instance letter followed by letters and digits is a pattern that matches a
word like x or y with the token id (= identifier).

Token Lexeme Pattern

ID x y n0 letter followed by letters and digits

NUM -123 1.456e-5 any numeric constant

IF if if

LPAREN ( (

LITERAL ``Hello'' any string of characters (except ``) between `` and ``


Tokens, patterns and lexemes

Token:  Token is a sequence of characters that can be treated as a single logical entity. Typical tokens
are,                                                                                       

1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants

Pattern: A set of strings in the input for which the same token is produced as output. This set of strings
is described by a rule called a pattern associated with the token.

Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for
a token.

Example:          

                              Description of token 

Token lexeme pattern

const const   const  

if if if

relation <,<=,= ,< >,>=,>     < or <= or = or < > or >= or  letter

followed by  letters & digit   

i pi any numeric constant

nun 3.14    any character b/w “and “except"

literal "core" pattern

                                                                  

 A patter is a rule describing the set of lexemes that can represent a particular token in source program

Analysis part

• Analysis part breaks the source program into constituent pieces and imposes a
grammatical structure on them which further uses this structure to create an
intermediate representation of the source program.
• It is also termed as front end of compiler.
• Information about the source program is collected and stored in a data structure
called symbol table.

Synthesis part

• Synthesis part takes the intermediate representation as input and transforms it to the
target program.
• It is also termed as back end of compiler.

The design of compiler can be decomposed into several phases, each of which
converts one form of source program into another.
The different phases of compiler are as follows:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Code optimization
6. Code generation
All of the aforementioned phases involve the following tasks:
• Symbol table management.
• Error handling.
Lexical Analysis

• Lexical analysis is the first phase of compiler which is also termed as scanning.
• Source program is scanned to read the stream of characters and those characters
are grouped to form a sequence called lexemes which produces token as output.
• Token: Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers etc.
• Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,
• Pattern: Pattern describes the rule that the lexemes of a token takes. It is the
structure that must be matched by strings.
• Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: <token-name, attribute-value>
(eg.) c=a+b*5;
                                                 Lexemes and tokens

Lexemes Tokens
identifier
c
assignment symbol
=
identifier
a
+ (addition symbol)
+
identifier
b
* (multiplication symbol)
*
5 (number)
5

Hence, <id, 1><=>< id, 2>< +><id, 3 >< * >< 5>


Syntax Analysis

• Syntax analysis is the second phase of compiler which is also called as parsing.
• Parser converts the tokens produced by lexical analyzer into a tree like representation
called parse tree.
• A parse tree describes the syntactic structure of the input.

• Syntax tree is a compressed representation of the parse tree in which the operators
appear as interior nodes and the operands of the operator are the children of the node
for that operator.
Input: Tokens
Output: Syntax tree
Semantic Analysis

• Semantic analysis is the third phase of compiler.


• It checks for the semantic consistency.
• Type information is gathered and stored in symbol table or in syntax tree.
• Performs type checking.
Intermediate Code Generation

• Intermediate code generation produces intermediate representations for the source


program which are of the following forms:
     o Postfix notation
     o Three address code
     o Syntax tree
Most commonly used form is the three address code.
        t1 = inttofloat (5)
        t2 = id3* tl
        t3 = id2 + t2
        id1 = t3
Properties of intermediate code

• It should be easy to produce.


• It should be easy to translate into target program.
Code Optimization

• Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.
• It results in faster running machine code.
• It can be done by reducing the number of lines of code for a program.
• This phase reduces the redundant code and attempts to improve the intermediate
code so that faster-running machine code will result.
• During the code optimization, the result of the program is not affected.
• To improve the code generation, the optimization involves
       o Deduction and removal of dead code (unreachable code).
       o Calculation of constants in expressions and terms.
       o Collapsing of repeated expression into temporary string.
       o Loop unrolling.
       o Moving code outside the loop.
       o Removal of unwanted temporary variables.
                   t1 = id3* 5.0
                   id1 = id2 + t1
Code Generation

• Code generation is the final phase of a compiler.


• It gets input from code optimization phase and produces the target code or object
code as result.
• Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
• The code generation involves
     o Allocation of register.php and memory.
     o Generation of correct references.
     o Generation of correct data types.
     o Generation of missing code.
                LDF R2, id3
                MULF R2, # 5.0
                LDF R1, id2
                ADDF R1, R2
                STF id1, R1
Symbol Table Management

• Symbol table is used to store all the information about identifiers used in the program.
• It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
• It allows finding the record for each identifier quickly and to store or retrieve data from
that record.
• Whenever an identifier is detected in any of the phases, it is stored in the symbol
table.
Example
int a, b; float c; char z;

Symbol name Type Address

a Int 1000

b Int 1002

c Float 1004
z char 1008

 
 
 
Example
1 extern double test (double x);

2 double sample (int count) {

3  double sum= 0.0;

4  for (int i = 1; i &lt; = count; i++)

5  sum+= test((double) i);

6  return sum;

7  }

Symbol Type Scope


name

test function, extern


double

x double function
parameter

sample function, global


double

count int function


parameter

sum double block local

i int for-loop
statement
Error Handling

• Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
• In lexical analysis, errors occur in separation of tokens.
• In syntax analysis, errors occur during construction of syntax tree.
• In semantic analysis, errors may occur at the following cases:
(i) When the compiler detects constructs that have right syntactic structure but no
meaning
(ii) During type conversion.
• In code optimization, errors occur when the result is affected by the optimization. In
code generation, it shows error when code is missing etc.
Figure illustrates the translation of source code through each phase, considering the
statement
    c =a+ b * 5.
Error Encountered in Different Phases

Each phase can encounter errors. After detecting an error, a phase must some how
deal with the error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:
Lexical Errors

It includes incorrect or misspelled name of some identifier i.e., identifiers typed


incorrectly.
Syntactical Errors

It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled


by syntax analyzer (parser).
When an error is detected, it must be handled by parser to enable the parsing of the
rest of the input. In general, errors may be expected at various stages of compilation
but most of the errors are syntactic errors and hence the parser should be able to
detect and report those errors in the program.
The goals of error handler in parser are:
• Report the presence of errors clearly and accurately.
• Recover from each error quickly enough to detect subsequent errors.
• Add minimal overhead to the processing of correcting programs.
There are four common error-recovery strategies that can be implemented in the
parser to deal with errors in the code.
o Panic mode.
o Statement level.
o Error productions.
o Global correction.
Semantical Errors

These errors are a result of incompatible value assignment. The semantic errors that
the semantic analyzer is expected to recognize are:
• Type mismatch.
• Undeclared variable.
• Reserved identifier misuse.
• Multiple declaration of variable in a scope.
• Accessing an out of scope variable.
• Actual and formal parameter mismatch.
Logical errors

These errors occur due to not reachable code-infinite loop.


The phases of a compiler can be grouped as:
We’ll be covering the following topics in this tutorial:

 Front end
 Back end
 Front End
 Back End
 Passes
 Reducing the Number of Passes
Front end

Front end of a compiler consists of the phases


• Lexical analysis.
• Syntax analysis.
• Semantic analysis.
• Intermediate code generation.
Back end

Back end of a compiler contains


• Code optimization.
• Code generation.
Front End

• Front end comprises of phases which are dependent on the input (source language)
and independent on the target machine (target language).
• It includes lexical and syntactic analysis, symbol table management, semantic
analysis and the generation of intermediate code.
• Code optimization can also be done by the front end.
• It also includes error handling at the phases concerned.

           
Back End

• Back end comprises of those phases of the compiler that are dependent on the
target machine and independent on the source language.
• This includes code optimization, code generation.
• In addition to this, it also encompasses error handling and symbol table
management operations.

           
Passes

• The phases of compiler can be implemented in a single pass by marking the primary
actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the
operations in each and every phase are incorporated during the pass.
• (eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical
analysis may be translated directly into intermediate code.
Reducing the Number of Passes

• Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.
• When grouping phases into one pass, the entire program has to be kept
in memory to ensure proper information flow to each phase because one phase may
need information in a different order than the information produced in previous
phase.
The source program or target program differs from its internal representation. So,
the memory for internal form may be larger than that of input and output.

Compiler Construction tools – Compiler Design


By Dinesh Thakur

Some commonly used compiler-construction tools. include


1. Parser generators.
2. Scanner generators.
3. Syntax-directed translation engines.
4. Automatic code generators.
5. Data-flow analysis engines.
6. Compiler-construction toolkits.
Parser Generators

Input: Grammatical description of a programming language


Output: Syntax analyzers.
Parser generator takes the grammatical description of a programming language and
produces a syntax analyzer.
Scanner Generators

Input: Regular expression description of the tokens of a language


Output: Lexical analyzers.
Scanner generator generates lexical analyzers from a regular expression description
of the tokens of a language.
Syntax-directed Translation Engines

Input: Parse tree.
Output: Intermediate code.
Syntax-directed translation engines produce collections of routines that walk a parse
tree and generates intermediate code.
Automatic Code Generators

Input: Intermediate language.
Output: Machine language.
Code-generator takes a collection of rules that define the translation of each
operation of the intermediate language into the machine language for a target
machine.
Data-flow Analysis Engines

Data-flow analysis engine gathers the information, that is, the values transmitted
from one part of a program to each of the other parts. Data-flow analysis is a key
part of code optimization.
Compiler Construction Toolkits

The toolkits provide integrated set of routines for various phases of compiler.
Compiler construction toolkits provide an integrated set of routines for construction
of phases of compiler.

Lexical Analysis – Compiler Design


By Dinesh Thakur

Lexical analysis is the process of converting a sequence of characters from source


program into a sequence of tokens.
A program which performs lexical analysis is termed as a lexical analyzer (lexer),
tokenizer or scanner.
Lexical analysis consists of two stages of processing which are as follows:
• Scanning
• Tokenization
Token, Pattern and Lexeme

Token

Token is a valid sequence of characters which are given by lexeme. In a programming


language,
• keywords,
• constant,
• identifiers,
• numbers,
• operators and
• punctuations symbols
are possible tokens to be identified.
Pattern

Pattern describes a rule that must be matched by sequence of characters (lexemes)


to form a token. It can be defined by regular expressions or grammar rules.
Lexeme

Lexeme is a sequence of characters that matches the pattern for a token i.e.,
instance of a
token.
(eg.) c=a+b*5;
                                               Lexemes and tokens
 

Lexemes Tokens

c identifier

= assignment symbol

a identifier

+ + (addition symbol)

b identifier

* * (multiplication symbol)

5 5 (number)

 
 
 
 
 
 
 
 
 
 
 
 
 
The sequence of tokens produced by lexical analyzer helps the parser in analyzing
the syntax of programming languages.
Role of Lexical Analyzer

                        
Lexical analyzer performs the following tasks:
• Reads the source program, scans the input characters, group them into lexemes
and produce the token as output.
• Enters the identified token into the symbol table.
• Strips out white spaces and comments from source program.
• Correlates error messages with the source program i.e., displays error message
with its occurrence by specifying the line number.
• Expands the macros if it is found in the source program.
Tasks of lexical analyzer can be divided into two processes:
Scanning: Performs reading of input characters, removal of white spaces and
comments.
Lexical Analysis: Produce tokens as the output.
Need of Lexical Analyzer

Simplicity of design of compiler The removal of white spaces and comments enables


the syntax analyzer for efficient syntactic constructs.
Compiler efficiency is improved Specialized buffering techniques for reading
characters speed up the compiler process.
Compiler portability is enhanced
Issues in Lexical Analysis

Lexical analysis is the process of producing tokens from the source program. It has
the following issues:
• Lookahead
• Ambiguities
Lookahead

Lookahead is required to decide when one token will end and the next token will
begin. The simple example which has lookahead issues are i vs. if, = vs. ==.
Therefore a way to describe the lexemes of each token is required.
A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax
are similar.
Hence, the number of lookahead to be considered and a way to describe the lexemes
of each token is also needed.
Regular expressions are one of the most popular ways of representing tokens.
Ambiguities

The lexical analysis programs written with lex accept ambiguous specifications and
choose the longest match possible at each input point. Lex can handle ambiguous
specifications. When more than one expression can match the current input, lex
chooses as follows:
• The longest match is preferred.
• Among rules which matched the same number of characters, the rule given first is
preferred.
Lexical Errors

• A character sequence that cannot be scanned into any valid token is a lexical error.
• Lexical errors are uncommon, but they still must be handled by a scanner.
• Misspelling of identifiers, keyword, or operators are considered as lexical errors.
Usually, a lexical error is caused by the appearance of some illegal character, mostly
at the beginning of a token.
Error Recovery Schemes

• Panic mode recovery


• Local correction
   o Source text is changed around the error point in order to get a correct text.
   o Analyzer will be restarted with the resultant new text as input.
• Global correction
   o It is an enhanced panic mode recovery.
   o Preferred when local correction fails.
Panic mode recovery

In panic mode recovery, unmatched patterns are deleted from the remaining input,
until the lexical analyzer can find a well-formed token at the beginning of what input
is left.
(eg.) For instance the string fi is encountered for the first time in a C program in the
context:
fi (a== f(x))
A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an
undeclared function identifier.
Since f i is a valid lexeme for the token id, the lexical analyzer will return the
token id to the parser.
Local correction

Local correction performs deletion/insertion and/or replacement of any number of


symbols in the error detection point.
(eg.) In Pascal, c[i] ‘=’; the scanner deletes the first quote because it cannot legally
follow the closing bracket and the parser replaces the resulting’=’ by an assignment
statement.
Most of the errors are corrected by local correction.
(eg.) The effects of lexical error recovery might well create a later syntax error,
handled by the parser. Consider
· · · for $tnight · · ·
The $ terminates scanning of for. Since no valid token begins with $, it is deleted.
Then tnight is scanned as an identifier.
In effect it results,
· · · fortnight · · ·
Which will cause a syntax error? Such false errors are unavoidable, though a
syntactic error-repair may help.
Lexical error handling approaches

Lexical errors can be handled by the following actions:


• Deleting one character from the remaining input.
• Inserting a missing character into the remaining input.
• Replacing a character by another character.
• Transposing two adjacent characters.

Input Buffering – Compiler Design


By Dinesh Thakur

• To ensure that a right lexeme is found, one or more characters have to be looked
up beyond the next lexeme.
• Hence a two-buffer scheme is introduced to handle large lookaheads safely.
• Techniques for speeding up the process of lexical analyzer such as the use of
sentinels to mark the buffer end have been adopted.
There are three general approaches for the implementation of a lexical analyzer:
(i) By using a lexical-analyzer generator, such as lex compiler to produce the lexical
analyzer from a regular expression based specification. In this, the generator
provides routines for reading and buffering the input.
(ii) By writing the lexical analyzer in a conventional systems-programming language,
using I/O facilities of that language to read the input.
(iii) By writing the lexical analyzer in assembly language and explicitly managing the
reading of input.
We’ll be covering the following topics in this tutorial:

 Buffer Pairs
 Sentinels
Buffer Pairs

Because of large amount of time consumption in moving characters, specialized


buffering techniques have been developed to reduce the amount of overhead
required to process an input character.
Fig shows the buffer pairs which are used to hold the input data.

          
Scheme

• Consists of two buffers, each consists of N-character size which are reloaded
alternatively.
• N-Number of characters on one disk block, e.g., 4096.
• N characters are read from the input file to the buffer using one system read
command.
• eof is inserted at the end if the number of characters is less than N.
Pointers

Two pointers lexemeBegin and forward are maintained.


lexeme Begin points to the beginning of the current lexeme which is yet to be found.
forward scans ahead until a match for a pattern is found.
• Once a lexeme is found, lexemebegin is set to the character immediately after the
lexeme which is just found and forward is set to the character at its right end.
• Current lexeme is the set of characters between two pointers.
Disadvantages of this scheme

• This scheme works well most of the time, but the amount of lookahead is limited.
• This limited lookahead may make it impossible to recognize tokens in situations
where the distance that the forward pointer must travel is more than the length of
the buffer.
(eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;
• It cannot determine whether the DECLARE is a keyword or an array name until the
character that follows the right parenthesis.
Sentinels

• In the previous scheme, each time when the forward pointer is moved, a check is
done to ensure that one half of the buffer has not moved off. If it is done, then the
other half must be reloaded.
• Therefore the ends of the buffer halves require two tests for each advance of the
forward pointer.
Test 1: For end of buffer.
Test 2: To determine what character is read.
• The usage of sentinel reduces the two tests to one by extending each buffer half to
hold a sentinel character at the end.
• The sentinel is a special character that cannot be part of the source
program. (eof character is used as sentinel).
         

Advantages

• Most of the time, It performs only one test to see whether forward pointer points
to an eof.
• Only when it reaches the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eofs, the average number of
tests per input character is very close to 1.

Regular Expression – Compiler Design


By Dinesh Thakur

• Regular expressions are a notation to represent lexeme patterns for a token.


• They are used to represent the language for lexical analyzer.
• They assist in finding the type of token that accounts for a particular lexeme.
Strings and Languages

Alphabets are finite, non-empty set of input symbols.


               Σ = {0, 1} – binary alphabets
String represents the collection of alphabets.
               w = {0,1, 00, 01, 10, 11, 001, 010, … }
w indicates the set of possible strings for the given binary alphabet Σ
Language (L) is the collection of strings which are accepted by finite automata.
                L = {0n1 I n >= 0}
Length of string is defined as the number of input symbols in a given string. It is
found by || operator.
             Let ω = 0101
             | ω | =4
Empty string denotes zero occurrence of input symbol. It is represented by
Ɛ. Concatenation of two strings p and q is denoted by pq.
        Let       p = 010
        And      q = 001
                  pq = 010001
                  qp = 001010
                  i.e., pq ≠ qp
 
Empty string is identity under concatenation.
     Let x be a string.
                    Ex= XE= X
Prefix A prefix of any string s, is obtained by removing zero or more symbols from
the end of s.
          (eg.) s = balloon
Possible prefixes are: ball, balloon,
Suffix A suffix of any string s, is obtained by removing zero or more symbols from
the beginning of s.
          (eg.) s =balloon
Possible prefixes are: loon, balloon
Proper prefix: Proper prefix p of a strings, can be given by s ≠ p and p ≠ E
Proper suffix: Proper suffix x of a string s, can be given by s ≠ x and x ≠ E
Substring: Substring is part of a string obtained by removing any prefix and any suffix
from s.
Operations on Languages

Important operations on a language are:


• Union
• Concatenation and
• Closure
Union

Union of two languages Land M produces the set of strings which may be either in
language L or in language M or in both. It can be denoted as,
LUM = {p I p is in L or p is in M}
Concatenation

Concatenation of two languages L and M, produces a set of strings which are formed
by merging the strings in L with strings in M (strings in L must be followed by strings
in M). It can be represented as,
LUM= {pq | p is in L and q is in M}
Closure

Kleene closure (L*)


Kleene closure refers to zero or more occurrences of input symbols in a string, i.e., it
includes empty string Ɛ(set of strings with 0 or more occurrences of input symbols).

                                       
Positive closure (L +)
Positive closure indicates one or more occurrences of input symbols in a string, i.e.,
it excludes empty string Ɛ(set of strings with 1or more occurrences of input symbols).
                                       
L3– set of strings each with length 3.
(eg.) Let Σ = {a, b}
L* = {E, a, b, aa, ab, ba, bb, aab, aba, aaba, … }
L+ = {a, b, aa, ab, ba, bb, aab, aaba, }
L3 = {aaa, aba, abb, bba, bob, bbb, }
Precedence of operators

• Unary operator (*) is having highest precedence.


• Concatenation operator (-) is second highest and is left associative.
           letter_ (letter_ I digit )*
• Union operator ( I or U) has least precedence and is left associative.
Based on the precedence, the regular expression is transformed to finite automata
when implementing lexical analyzer.
Regular Expressions

Regular expressions are a combination of input symbols and language operators such
as union, concatenation and closure.
It can be used to describe the identifier for a language. The identifier is a collection
of letters, digits and underscore which must begin with a letter. Hence, the regular
expression for an identifier can be given by,
Letter_ (letter I digit)*
Note: Vertical bar ( I ) refers to ‘or’ (Union operator).
The following describes the language for given regular expression:
                                       Languages for regular expressions
                

S.No. Regular Language


expression
1
r L(r)
2
a L(a)
3
r|s L(r) | L(s)
4
rs L(r) L(s)
5
r* (L(r))*

 
Regular set Language defined by regular expression.
Two regular expressions are equivalent, if they represent the same regular set.
                                        (p I q) = (q | p)
 
                                    Algebraic laws of regular expressions
 

Law Description
r|s=s|r | is commutative
r | (s | t) = (r | s ) | t | is associative
r (st) = (rs)t Concatenation is
associative
r(s|t) = rs | rt; (s|t)r Concatenation is
= sr | tr distributive

Ɛr = rƐ = r Ɛ is identity for
concatenation

r* = (r | Ɛ)* Ɛ is guaranteed in closure

r** = r* * is idempotent

Regular Definition

Regular definition d gives aliases to regular expressions r and uses it for


convenience. Sequences of definitions are of the following form
di –> ri
d2–>r2
d3–> rs
dn–> rn
in which definitions di, d2, … , can be used in place of ri, r2 respectively.
letter –> A I B I · · · I Z I a I b I · · · I z I
digit –>0 |1 I 2 … I 9
id –> letter_ (letter I digit)*

Difference between Compiler and Interpreter


 Difficulty Level : Easy
 Last Updated : 25 Feb, 2021
1. Compiler: 
It is a translator which takes input i.e., High-Level Language, and
produces an output of low-level language i.e. machine or assembly
language. 
 A compiler is more intelligent than an assembler it checks all
kinds of limits, ranges, errors, etc.
 But its program run time is more and occupies a larger part of
memory. It has slow speed because a compiler goes through the
entire program and then translates the entire program into machine
codes.
 

Figure – Compiler-Process 
2. Interpreter: 
An interpreter is a program that translates a programming language
into a comprehensible language. –  
 It translates only one statement of the program at a time.
 Interpreters, more often than not are smaller than compilers. 
Figure – Interpreter-Process 
Let’s see the difference between Compiler and Interpreter: 
S.No
. Compiler Interpreter

Compiler scans the whole program in Translates program one statement at a


1. one go. time.

As it scans the code in one go, the


errors (if any) are shown at the end Considering it scans code one line at a
2. together. time, errors are shown line by line.

Due to interpreters being slow in


Main advantage of compilers is it’s executing the object code, it is
3. execution time. preferred less.

It does not convert source code into


It converts the source code into object code instead it scans it line by
4. object code. line

It does not require source code for It requires source code for later
5 later execution. execution.

Python, Ruby, Perl, SNOBOL, MATLAB,


Eg. C, C++, C# etc. etc.
Difference Compiler Interpreter
Types

Program o Write o Write a


ming a progra
Steps progra m in
m in source
source code.
code. o No
o Compi linking
le will of files
analyz happen
e your s, or
progra no
m machin
state e code
ments will
and genera
check te
their separa
correc tely.
tness. o The
If an source
error code
is progra
found mming
in a statem
progra ents
m, it are
throw execut
s an ed
error line-
messa by-line
ge. during

o If the their

progra executi

m on. If

contai an

ns no error is

error, found

then at any

the specific

compil statem

er will ent

conver interpr

t the eter, it

source stops

code further

progra executi

m into on

machi until

ne the

code. error

o The gets
remov
compil
ed.
er
links
all the
code
files
into a
single
runna
ble
progra
m,
which
is
known
as the
exe
file.

o Finally

, it
runs
the
progra
m and
gener
ates
output
.

Translat A compiler An
ion type translates interpreter
complete translates
high-level one
programmin statement of
g code into programming
machine code at a
code at time into
once. machine
code.

Advanta As the As the source


ge source code code is
is already interpreted
converted line-by-line,
into machine error
code, the detection and
code correction
execution become easy.
time
becomes
short.

Disadva If you want Interpreted


ntage to change programs can
your run on only
program for those
any reason, computers
either by which have
error or the same
logical
changes, interpreter.
you can do it
only by
going back
to your
source code.

Machine It stores the It never


code converted stores the
machine machine code
code from at all on the
your source disk.
code
program on
the disk.

Running A compiler An
time takes an interpreter
enormous takes less
time to time to
analyze analyze
source code. source code
However, as compared
overall to a compiler.
compiled However,
programmin overall
g code runs interpreted
faster as programming
compression code runs
to an slower as
interpreter. compression
to the
compiler.

Program The compiler The


generati generates an interpreter
on output of a doesn't
program (in generate a
the form of separate
an exe file) machine code
that can run as an output
separately program. So
from the it checks the
source code source code
program. every time
during the
execution.

Executio The process The process


n of program of program
execution execution is a
takes place part of
separately interpretation
from its steps, so it is
compilation done line-by-
process. line
Program simultaneousl
execution
only takes y.
place after
the complete
program is
compiled.

Memory A compiled An
require program is interpreted
ment generated program does
into an not generate
intermediate an
object code, intermediate
and it code. So
further there is no
required requirement
linking. So for extra
there is a memory.
requirement
for more
memory.

Best The In web


suited compiled environments
for program is , compiling
bounded to takes place
the specific relatively
target more time to
machine. It run even
requires the small code,
same which may
compiler on not run
the machine multiple
to execute; times. As
C and C++ load time is
are the most essential in
popular the web
programmin environment,
g language interpreters
based on the are better.
compilation JavaScript,
model. Python, Ruby
are based on
the
interpreter
model.

Error The compiler An


executio shows the interpreter
n complete reads the
errors and program line-
warning by-line; it
messages at shows the
program error if
compilation present at
time. So it is that specific
not possible line. You
to run the must have to
program correct the
without error first to
fixing interpret the
program next line of
errors. Doing the program.
debugging of Debugging is
the program comparativel
is y easy while
comparativel working with
y complex an
while Interpreter.
working with
a compiler.

Interpreter Compiler

Scans the entire


Translates program one program and translates
statement at a time. it as a whole into
machine code.

Interpreters usually Compilers usually take


take less amount of a large amount of time
time to analyze the to analyze the source
source code. However, code. However, the
the overall execution overall execution time is
time is comparatively comparatively faster
slower than compilers. than interpreters.

No Object Code is Generates Object Code


which further requires
generated, hence are
linking, hence requires
memory efficient.
more memory.

Programming
Programming
languages like
languages like C, C++,
JavaScript, Python,
Java use compilers.
Ruby use interpreters.

You might also like