Professional Documents
Culture Documents
Lexical Analyzer Synopsis Final PDF
Lexical Analyzer Synopsis Final PDF
Submitted To
CERTIFICATE
This is to certify that the work embodied in this synopsis entitled “Lexical
Analysis and Tokenization of Source Code Written in ‘C’ Language”
being submitted by “Jayati Naik” (Roll No.: 0111IT071045) , “Kuldeep
Kumar Mishra” (Roll No.: 0111IT071048) & “Meghendra Singh” (Roll
No.: 0111IT071053) for partial fulfillment of the requirement for the degree
of “Bachelor of Engineering in Information Technology” discipline to
“Rajiv Gandhi Praudyogiki Vishwavidyalaya, Bhopal(M.P.)” during the
academic year 2009-10 is a record of bona fide piece of work, carried out by
him under my supervision and guidance in the “Department of
Information Technology”, Technocrats Institute of Technology, Bhopal
(M.P.).
FORWARDED BY:
i
Technocrats Institute of Technology, Bhopal (M.P.)
Department Of Information Technology
DECLARATION
Jayati Naik
Date: Enrollment No.: 0111IT071045
Meghendra Singh
ii
Enrollment No.: 0111IT071053
Certificate ………………… i
Declaration ………………… ii
Abstract ………………… 1
REFERENCES ………………… 15
CONTENTS
iii
iv
ABSTRACT
The lexical analyzer is responsible for scanning the source input file and translating
lexemes (strings) into small objects that the compiler for a high level language can
easily process. These small values are often called “tokens”. The lexical analyzer is
also responsible for converting sequences of digits in to their numeric form as well as
processing other literal constants, for removing comments and white spaces from the
source file, and for taking care of many other mechanical details. Lexical analyzer
converts stream of input characters into a stream of tokens. For tokenizing into
identifiers and keywords we incorporate a symbol table which initially consists of
predefined keywords. The tokens are read from an input file. The output file will
consist of all the tokens present in our input file along with their respective token
values.
1
1.0 INTRODUCTION
2
code generators for different machines and promotes abstraction and portability from
specific machine times and languages. (I dare to say that the most famous example is
java’s byte-code and JVM). Semantic Analysis finds more meaningful errors such as
undeclared variables, type compatibility, and scope resolution.
Code Optimization makes the IR more efficient. Code optimization is usually
done in a sequence of steps. Some optimizations include code hosting, or moving
constant values to better places within the code, redundant code discovery, and
removal of useless code.
Code Generation is the final step in the compilation process. The input to the
Code Generator is the IR and the output is machine language code.
3
Tokens are frequently defined by regular expressions, which are understood by a
lexical analyzer generator such as lex. The lexical analyzer (either generated
automatically by a tool like lex, or hand-crafted) reads in a stream of characters,
identifies the lexemes in the stream, and categorizes them into tokens. This is called
"tokenizing." If the lexer finds an invalid token, it will report an error.
Following tokenizing is parsing. From there, the interpreted data may be loaded
into data structures for general use, interpretation, or compiling.
The first stage, the scanner, is usually based on a finite state machine. It has
encoded within it information on the possible sequences of characters that can be
contained within any of the tokens it handles (individual instances of these character
sequences are known as lexemes). For instance, an integer token may contain any
sequence of numerical digit characters. In many cases, the first non-white space
character can be used to deduce the kind of token that follows and subsequent input
characters are then processed one at a time until reaching a character that is not in the
set of characters acceptable for that token (this is known as the maximal munch rule,
or longest match rule). In some languages the lexeme creation rules are more
complicated and may involve backtracking over previously read characters.
Tokenization is the process of demarcating and possibly classifying sections of a
string of input characters. The resulting tokens are then passed on to some other form
of processing. The process can be considered a sub-task of parsing input.
Take, for example, the following string.
The quick brown fox jumps over the lazy dog
Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a
computer this is only a series of 43 characters.
A process of tokenization could be used to split the sentence into word tokens.
Although the following example is given as XML there are many ways to represent
tokenized input:
<sentence>
<word>The</word>
<word>quick</word>
<word>brown</word>
<word>fox</word>
4
<word>jumps</word>
<word>over</word>
<word>the</word>
<word>lazy</word>
<word>dog</word>
</sentence>
NAME "net_worth_future"
EQUALS
OPEN_PARENTHESIS
NAME "assets"
MINUS
NAME "liabilities"
CLOSE_PARENTHESIS
SEMICOLON
Though it is possible and sometimes necessary to write a lexer by hand, lexers are
often generated by automated tools. These tools generally accept regular expressions
that describe the tokens allowed in the input stream. Each regular expression is
associated with a production in the lexical grammar of the programming language that
evaluates the lexemes matching the regular expression. These tools may generate
source code that can be compiled and executed or construct a state table for a finite
state machine (which is plugged into template code for compilation and execution).
5
Regular expressions compactly represent patterns that the characters in lexemes
might follow. For example, for an English-based language, a NAME token might be
any English alphabetical character or an underscore, followed by any number of
instances of any ASCII alphanumeric character or an underscore. This could be
represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means "any
character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".
Regular expressions and the finite state machines they generate are not powerful
enough to handle recursive patterns, such as "n opening parentheses, followed by a
statement, followed by n closing parentheses." They are not capable of keeping count,
and verifying that n is the same on both sides — unless you have a finite set of
permissible values for n. It takes a full-fledged parser to recognize such patterns in
their full generality. A parser can push parentheses on a stack and then try to pop
them off and see if the stack is empty at the end.
The Lex programming tool and its compiler is designed to generate code for fast
lexical analysers based on a formal description of the lexical syntax. It is not generally
considered sufficient for applications with a complicated set of lexical rules and
severe performance requirements; for instance, the GNU Compiler Collection uses
hand-written lexers.
i. Characteristics
6
Like most imperative languages in the ALGOL tradition, C has facilities for
structured programming and allows lexical variable scope and recursion, while a static
type system prevents many unintended operations. In C, all executable code is
contained within functions. Function parameters are always passed by value. Pass-by-
reference is achieved in C by explicitly passing pointer values. Heterogeneous
aggregate data types (struct) allow related data elements to be combined and
manipulated as a unit. C program source text is free-format, using the semicolon as a
statement terminator (not a delimiter).
C also exhibits the following more specific characteristics:
• non-nest able function definitions
• variables may be hidden in nested blocks
• partially weak typing; for instance, characters can be used as integers
• low-level access to computer memory by converting machine addresses to
typed pointers
• function and data pointers supporting ad hoc run-time polymorphism
• array indexing as a secondary notion, defined in terms of pointer arithmetic
• a preprocessor for macro definition, source code file inclusion, and conditional
compilation
• complex functionality such as I/O, string manipulation, and mathematical
functions consistently delegated to library routines
• A relatively small set of reserved keywords (originally 32, now 37 in C99)
• A large number of compound operators, such as +=, ++
ii. Features
The relatively low-level nature of the language affords the programmer close
control over what the computer does, while allowing special tailoring and aggressive
optimization for a particular platform. This allows the code to run efficiently on very
limited hardware, such as embedded systems.
7
download and distribute while the Professional edition is a commercial product. The
professional edition is no longer available for purchase from Borland.
Turbo C++ 3.0 was released in 1991 (shipping on November 20), and came in
amidst expectations of the coming release of Turbo C++ for Microsoft Windows.
Initially released as an MS-DOS compiler, 3.0 supported C++ templates, Borland's
inline assembler, and generation of MS-DOS mode executables for both 8086 real-
mode & 286-protected (as well as the Intel 80186.) 3.0's implemented AT&T C++
2.1, the most recent at the time. The separate Turbo Assembler product was no longer
included, but the inline-assembler could stand in as a reduced functionality version.
Aim of the project is to develop a Lexical Analyzer that can generate tokens for the
further processing of compiler. The job of the lexical analyzer is to read the source
program one character at a time and produce as output a stream of tokens. The tokens
produced by the lexical analyzer serve as input to the next phase, the parser. Thus, the
lexical analyzer’s job is to translate the source program in to a form more conductive
to recognition by the parser.
The goal of this program is to create tokens from the given input stream.
8
SYMBOL
TABLE
9
[Fig. - 4: Second Level Data Flow Diagram for Lexical Analyzer]
10
Key: -
: START / END
: DECESSION
: PROCESS
: DISPLAY / OUTPUT
: MANUAL INPUT
11
[Fig.- 5: Flow Chart for Lexical Analyzer]
12
3.0 APPROACHED RESULT AND CONCLUSION
• Simple implementation.
13
4.0 APPLICATIONS AND FUTURE WORK
This lexical analyzer can be used as a stand alone string analysis tool, which can
analyze a given set of strings and check there lexical correctness. This can also be
used to analyze the string sequences delimited by white spaces in a C / C++ source
code (*.c / *.cpp) file and output all the results in a text file, if proper functionality of
file handling will be used in the source code of the lexical analyzer, this functionality
will not be a part of the present project but will be available in an upgraded version, if
time permits the development of it. Further more the applications of a lexical analyzer
include: -
1. Text Editing
2. Text Processing
3. Pattern Matching
4. File Searching
An enhanced version of this lexical analyzer can be incorporated with a Parser
having the functionality of syntax directed translation, to make a complete Compiler
in the future. The lexical assembly of the keywords and special characters can be
appropriately modified in the source code to create a new high level language like C+
+.
14
REFERENCES:
15