You are on page 1of 21

The Compilation Process

The compilation process is a sequence of various phases. Each phase takes input from
its previous stage, has its own representation of source program, and feeds its output to
the next phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as
a stream of characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by
lexical analysis as input and generates a parse tree (or syntax tree). In this phase, token
arrangements are checked against the source code grammar, i.e. the parser checks if the
expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of
language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their
types and expressions; whether identifiers are declared before use or not etc. The
semantic analyzer produces an annotated syntax tree as an output.
Intermediate Code Generation
After semantic analysis the compiler generates an intermediate code of the source code
for the target machine. It represents a program for some abstract machine. It is in between
the high-level language and the machine language. This intermediate code should be
generated in such a way that it makes it easier to be translated into the target machine
code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be
assumed as something that removes unnecessary code lines, and arranges the
sequence of statements in order to speed up the program execution without wasting
resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate
code and maps it to the target machine language. The code generator translates the
intermediate code into a sequence of (generally) re-locatable machine code. Sequence
of instructions of machine code performs the task as the intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's
names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also
used for scope management.
https://www.tutorialspoint.com/compiler_design/compiler_design_phases_of_compiler.h
tm

Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes modified source code from
language preprocessors that are written in the form of sentences. The lexical analyzer
breaks these syntaxes into a series of tokens, by removing any whitespace or comments
in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer
works closely with the syntax analyzer. It reads character streams from the source code,
checks for legal tokens, and passes the data to the syntax analyzer when it demands.

Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are
some predefined rules for every lexeme to be identified as a valid token. These rules are
defined by grammar rules, by means of a pattern. A pattern explains what can be a token,
and these patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators
and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of
English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the
total number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14
and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero
length is known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*),
Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations can
be performed on them. Finite languages can be described by means of regular
expressions.

Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern
defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern
for finite strings of symbols. The grammar defined by regular expressions is known as
regular grammar. The language defined by regular grammar is known as regular
language.
Regular expression is an important notation for specifying patterns. Each pattern matches
a set of strings, so regular expressions serve as names for a set of strings. Programming
language tokens can be described by regular languages. The specification of regular
expressions is an example of a recursive definition. Regular languages are easy to
understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can
be used to manipulate regular expressions into equivalent forms.
Syntax Analysis
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall
learn the basic concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular
expressions and pattern rules. But a lexical analyzer cannot check the syntax of a given
sentence due to the limitations of the regular expressions. Regular expressions cannot
check balancing tokens, such as parenthesis. Therefore, this phase uses context-free
grammar (CFG), which is recognized by push-down automata.
CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

It implies that every Regular Grammar is also context-free, but there exists some
problems, which are beyond the scope of Regular Grammar. CFG is a helpful tool in
describing the syntax of programming languages.

Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce
terminologies used in parsing technology.
A context-free grammar has four components:
• A set of non-terminals (V). Non-terminals are syntactic variables that denote sets
of strings. The non-terminals define sets of strings that help define the language
generated by the grammar.
• A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols
from which strings are formed.
• A set of productions (P). The productions of a grammar specify the manner in
which the terminals and non-terminals can be combined to form strings. Each
production consists of a non-terminal called the left side of the production, an
arrow, and a sequence of tokens and/or on- terminals, called the right side of the
production.
• One of the non-terminals is designated as the start symbol (S); from where the
production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal
(initially the start symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of
Regular Expression. That is, L = { w | w = wR } is not a regular language. But it can be
described by means of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100,
1010101, 11111, etc.

Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production
rules to detect any errors in the code. The output of this phase is a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program.
Parsers use error recovering strategies, which we will learn later in this chapter.

Derivation
A derivation is basically a sequence of production rules, in order to get the input string.
During parsing, we take two decisions for some sentential form of input:

• Deciding the non-terminal which is to be replaced.


• Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two
options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-
most derivation. The sentential form derived by the left-most derivation is called the left-
sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as
right-most derivation. The sentential form derived from the right-most derivation is called
the right-sentential form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first.
The right-most derivation is:
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id

Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings
are derived from the start symbol. The start symbol of the derivation becomes the root of
the parse tree. Let us see this by an example from the last topic.
We take the left-most derivation of a + b * c
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:

E→E*E

Step 2:

E→E+E*E

Step 3:
E → id + E * E

Step 4:

E → id + id * E

Step 5:

E → id + id * id
In a parse tree:

• All leaf nodes are terminals.


• All interior nodes are non-terminals.
• In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is
traversed first, therefore the operator in that sub-tree gets precedence over the operator
which is in the parent nodes.

Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:

The language generated by an ambiguous grammar is said to be inherently ambiguous.


Ambiguity in grammar is not good for a compiler construction. No method can detect and
remove ambiguity automatically, but it can be removed by either re-writing the whole
grammar without ambiguity, or by setting and following associativity and precedence
constraints.
LISP

car, cdr, cons: Fundamental Functions

In Lisp, car, cdr, and cons are fundamental functions. The cons function is used to
construct lists, and the car and cdr functions are used to take them apart.
In the walk through of the copy-region-as-kill function, we will see cons as well as
two variants on cdr, namely, setcdr and nthcdr. (See section copy-region-as-kill.)

car & cdr: Functions for extracting part of a list.

cons: Constructing a list.

nthcdr: Calling cdr repeatedly.

setcar: Changing the first element of a list.

setcdr: Changing the rest of a list.

cons Exercise

The name of the cons function is not unreasonable: it is an abbreviation of the word
`construct'. The origins of the names for car and cdr, on the other hand, are esoteric:
car is an acronym from the phrase `Contents of the Address part of the Register'; and
cdr (pronounced `could-er') is an acronym from the phrase `Contents of the
Decrement part of the Register'. These phrases refer to specific pieces of hardware
on the very early computer on which the original Lisp was developed. Besides being
obsolete, the phrases have been completely irrelevant for more than 25 years to
anyone thinking about Lisp. Nonetheless, although a few brave scholars have begun
to use more reasonable names for these functions, the old terms are still in use. In
particular, since the terms are used in the Emacs Lisp source code, we will use them
in this introduction.
car and cdr
The car of a list is, quite simply, the first item in the list. Thus the car of the list (rose
violet daisy buttercup) is rose.

If you are reading this in Info in GNU Emacs, you can see this by evaluating the
following:
(car '(rose violet daisy buttercup))

After evaluating the expression, rose will appear in the echo area.
Clearly, a more reasonable name for the car function would be first and this is
often suggested.
car does not remove the first item from the list; it only reports what it is. After car
has been applied to a list, the list is still the same as it was. In the jargon, car is `non-
destructive'. This feature turns out to be important.
The cdr of a list is the rest of the list, that is, the cdr function returns the part of the
list that follows the first item. Thus, while the car of the list '(rose violet daisy
buttercup) is rose, the rest of the list, the value returned by cdr, is (violet daisy
buttercup).

You can see this by evaluating the following in the usual way:
(cdr '(rose violet daisy buttercup))

When you evaluate this, (violet daisy buttercup) will appear in the echo area.
Like car, cdr does not remove any elements form the list--it just returns a report of
what the second and subsequent elements are.
Incidentally, in the example, the list of flowers is quoted. If it were not, the Lisp
interpreter would try to evaluate the list by calling rose as a function. In this
example, we do not want to do that.
Clearly, a more reasonable name for cdr would be rest.
(There is a lesson here: when you name new functions, consider very carefully about
what you are doing, since you may be stuck with the names for far longer than you
expect. The reason this document perpetuates these names is that the Emacs Lisp
source code uses them, and if I did not use them, you would have a hard time
reading the code; but do please try to avoid using these terms yourself. The people
who come after you will be grateful to you.)
When car and cdr are applied to a list made up of symbols, such as the list (pine fir
oak maple), the element of the list returned by the function car is the symbol pine
without any parentheses around it. pine is the first element in the list. However, the
cdr of the list is a list itself, (fir oak maple), as you can see by evaluating the
following expressions in the usual way:
(car '(pine fir oak maple))

(cdr '(pine fir oak maple))

On the other hand, in a list of lists, the first element is itself a list. car returns this
first element as a list. For example, the following list contains three sub-lists, a list of
carnivores, a list of herbivores and a list of sea mammals:
(car '((lion tiger cheetah)

(gazelle antelope zebra)

(whale dolphin seal)))

In this case, the first element or car of the list is the list of carnivores, (lion tiger
cheetah), and the rest of the list is ((gazelle antelope zebra) (whale dolphin
seal)).

(cdr '((lion tiger cheetah)

(gazelle antelope zebra)

(whale dolphin seal)))

It is worth saying again that car and cdr are non-destructive--that is, they do not
modify or change lists to which they are applied. This is very important for how they
are used.
Also, in the first chapter, in the discussion about atoms, I said that in Lisp, "certain
kinds of atom, such as an array, can be separated into parts; but the mechanism for
doing this is different from the mechanism for splitting a list. As far as Lisp is
concerned, the atoms of a list are unsplittable." (See section Lisp Atoms.) The car
and cdr functions are used for splitting lists and are considered fundamental to Lisp.
Since they cannot split or gain access to the parts of an array, an array is considered
an atom. Conversely, the other fundamental function, cons, can put together or
construct a list, but not an array. (Arrays are handled by array-specific functions. See
section `Arrays' in The GNU Emacs Lisp Reference Manual.)
cons

The cons function constructs lists; it is the inverse of car and cdr. For example, cons
can be used to make a four element list from the three element list, (fir oak
maple):

(cons 'pine '(fir oak maple))

After evaluating this list, you will see


(pine fir oak maple)

appear in the echo area. cons puts a new element at the beginning of a list; it
attaches or pushes elements onto the list.
cons must have a list to attach to.(2) You cannot start from absolutely nothing. If you
are building a list, you need to provide at least an empty list at the beginning. Here is
a series of cons's that build up a list of flowers. If you are reading this in Info in GNU
Emacs, you can evaluate each of the expressions in the usual way; the value is
printed in this text after `=>', which you may read as `evaluates to'.
(cons 'buttercup ())

=> (buttercup)

(cons 'daisy '(buttercup))

=> (daisy buttercup)

(cons 'violet '(daisy buttercup))

=> (violet daisy buttercup)

(cons 'rose '(violet daisy buttercup))

=> (rose violet daisy buttercup)

In the first example, the empty list is shown as () and a list made up of buttercup
followed by the empty list is constructed. As you can see, the empty list is not shown
in the list that was constructed. All that you see is (buttercup). The empty list is not
counted as an element of a list because there is nothing in an empty list. Generally
speaking, an empty list is invisible.
The second example, (cons 'daisy '(buttercup)) constructs a new, two element list
by putting daisy in front of buttercup; and the third example constructs a three
element list by putting violet in front of daisy and buttercup.

length: How to find the length of a list.

Find the Length of a List: length

You can find out how many elements there are in a list by using the Lisp function
length, as in the following examples:

(length '(buttercup))

=> 1

(length '(daisy buttercup))

=> 2

(length (cons 'violet '(daisy buttercup)))

=> 3

In the third example, the cons function is used to construct a three element list which
is then passed to the length function as its argument.
We can also use length to count the number of elements in an empty list:
(length ())

=> 0

As you would expect, the number of elements in an empty list is zero.


An interesting experiment is to find out what happens if you try to find the length of
no list at all; that is, if you try to call length without giving it an argument, not even
an empty list:
(length )

What you see, if you evaluate this, is the error message


Wrong number of arguments: #<subr length>, 0
This means is that the function receives the wrong number of arguments, zero, when
it expects some other number of arguments. In this case, one argument is expected,
the argument being a list whose length the function is measuring. (Note that one list
is one argument, even if the list has many elements inside it.)

Scope and lifetime of variables in Java

Instance Variables
A variable which is declared inside a class and outside all the methods and blocks is an
instance variable. The general scope of an instance variable is throughout the class
except in static methods. The lifetime of an instance variable is until the object stays in
memory.

Class Variables
A variable which is declared inside a class, outside all the blocks and is marked static is
known as a class variable. The general scope of a class variable is throughout the class
and the lifetime of a class variable is until the end of the program or as long as the class
is loaded in memory.

Local Variables
All other variables which are not instance and class variables are treated as local
variables including the parameters in a method. Scope of a local variable is within the
block in which it is declared and the lifetime of a local variable is until the control leaves
the block in which it is declared.

In Java, static means that the method is associated with the class, not a specific instance
(object) of that class. This means that you can call a static method without creating an
object of the class. void means that the method has no return value. If the method
returned an int you would write int instead of void .

public means that the method is visible and can be called from other objects of other types.
Other alternatives are private, protected, package and package-private. See here for more details.
static means that the method is associated with the class, not a specific instance (object) of
that class. This means that you can call a static method without creating an object of the
class.
void means that the method has no return value. If the method returned an int you would
write int instead of void

https://stackoverflow.com/questions/2390063/what-does-public-static-void-mean-in-
java#:~:text=static%20means%20that%20the%20method,write%20int%20instead%20o
f%20void%20.

Controlling Access to Members of a Class


Access level modifiers determine whether other classes can use a particular field or invoke a particular method.
There are two levels of access control:

At the top level—public, or package-private (no explicit modifier).

At the member level—public, private, protected, or package-private (no explicit modifier).

A class may be declared with the modifier public, in which case that class is visible to all classes
everywhere. If a class has no modifier (the default, also known as package-private), it is visible only within its
own package (packages are named groups of related classes — you will learn about them in a later lesson.)

At the member level, you can also use the public modifier or no modifier (package-private) just as with top-
level classes, and with the same meaning. For members, there are two additional access modifiers: private
and protected. The private modifier specifies that the member can only be accessed in its own class.
The protected modifier specifies that the member can only be accessed within its own package (as with
package-private) and, in addition, by a subclass of its class in another package.

The following table shows the access to members permitted by each modifier.

Access Levels

Modifier Class Package Subclass World

public Y Y Y Y

protected Y Y Y N

no modifier Y Y N N

private Y N N N

The first data column indicates whether the class itself has access to the member defined by the access level.
As you can see, a class always has access to its own members. The second column indicates whether classes
in the same package as the class (regardless of their parentage) have access to the member. The third column
indicates whether subclasses of the class declared outside this package have access to the member. The
fourth column indicates whether all classes have access to the member.

Access levels affect you in two ways. First, when you use classes that come from another source, such as the
classes in the Java platform, access levels determine which members of those classes your own classes can
use. Second, when you write a class, you need to decide what access level every member variable and every
method in your class should have.

Let's look at a collection of classes and see how access levels affect visibility. The following figure shows the
four classes in this example and how they are related.

Classes and Packages of the Example Used to Illustrate Access Levels

The following table shows where the members of the Alpha class are visible for each of the access modifiers
that can be applied to them.

Visibility

Modifier Alpha Beta Alphasub Gamma

public Y Y Y Y

protected Y Y Y N

no modifier Y Y N N

private Y N N N

Tips on Choosing an Access Level:

If other programmers use your class, you want to ensure that errors from misuse cannot happen. Access levels
can help you do this.

Use the most restrictive access level that makes sense for a particular member. Use private unless you
have a good reason not to.

Avoid public fields except for constants. (Many of the examples in the tutorial use public fields. This may help
to illustrate some points concisely, but is not recommended for production code.) Public fields tend to link you
to a particular implementation and limit your flexibility in changing your code.

You might also like