LEXICAL ANALYZER

Table of Contents

Abstract.........................................................................................1
Lexical analyzer.............................................................................................3

Introduction...................................................................................6
Phases of compiler.........................................................................................7 Interpreter ...............................................................................................12

Methods and materials....................................................................17
Passes of compiler…………………………………………………………………………………………… …………….17 Proposed methodology……………………………………………………………………………….. ………………19 Stages of lexical……………………………………………………………………………………………… ………….22 Lex………………………………………………………………………………… …………………………………….…24

Result…………………………………………………………………….26 Conclusion……………………………………………………………...27 Implications for future research……………………………………..28

References……………………………………………………………….29 Screen outputs………………………………………………………….73

ABSTRACT
Compilers are the software systems that translate a program into a form in which it can be executed by a computer. It is a program that can read a program in one language (source language) & translate it into an equivalent program in another language (target language).Compiler also report any errors in the source program that it detects during the translation process. Compilation process operates as a sequence of phases, each of which transforms one representation of the source program to another. Compiler is a computer program that translates source code from a high-level programming language to a lower level language (e.g. assembly language or machine code). A compiler perform following operations:There are basically seven phases of the compiler:


• • • •

Lexical analysis

Syntax analysis Semantic analysis Intermediate Representation Storage Allocation

• •

Code generation

Code optimization

Character Stream LEXICAL ANALYZER

Token Stream SYNTAX ANALYZER

Syntax Tree SEMANTIC ANALYZER

SYMBOL TABLE

Syntax Tree INTERMEDIATE CODE GENERATOR

Intermediate Representation MACHINE INDEPENDENT CODE OPTIMIZER

Intermediate Representation CODE GENERATOR

Target-Machine

Code

MACHINE DEPENDENT CODE OPTIMIZER

Target-Machine

Code

PHASES OF THE COMPILER

Lexical Analyzer
Our main focus is on the first phase of compiler i.e. Lexical Analyzer or Scanning. Lexical analysis or scanning is the process where the stream of characters making up the source program is read from left-to-right and grouped into tokens. Programs performing lexical analysis are called lexical analyzer or exers. Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (arithmetic, relational, logical), punctuation, and reserved words. The first task of the lexical analyzer is to parse the input character string into tokens. The second is to make the appropriate entries in the tables. A token is a substring of the input string that represents a basic element of the language. It may contain only simple characters and may not include another token. To the rest of the compiler , the token is the smallest unit of currency.Only lexical analysis and the output processor of the assembly phase concern themselves with such elements as characters. Uniform symbols are the terminals symbols for the syntax analysis. There are many ways of implementing this phase. One way is described below,fig depicts the results of the lexical phase for our example PL/I program. The input string is separated into tokens by break characters. Break charactersare denoted by the contents of a special field in the terminal table. Source characters are read in , checked for legality, and tested to see if they are break characters. Consecutives nonbreak characters are accumulated into tokens.Strings between break characters are tokens, as are nonblank break characters.Blanks may serve as break characters but are otherwise ignored. Lexical analysis recognizes three types of tokens:Terminal Symbol, Possible Identifiers, and Terminal Table.Once a match is found , the token is classified as a terminal symbol and lexical

analysis creates a uniform symbol of type ‘TRM’, and insert it in the uniform symbol table. If a token is not a terminal symbl, lexical analysis proceeds to classify it as a possible identifier or literal.Those tokens that satisfy the lexical rules for forming identifier are classified as “possible identifiers”. In PL/I these are strings that begin with an alphabetic character and contain up to 30 more alphametric characters or underscores. If a token does not fit into one of these categories, it is an errr and is flagged as such. After a token is classified as a “possible identifier”, the identifier\er tabe is examined. If this particular token is not in the table, a new entry is made. Since the only attribute that we know about an identifier is itsname , that is all that goes into the table. The remaining information is discovered and inserted by later phases. Regardless of whether or not an entry had to be created, lexical analysis creates a uniform symbol of type ‘IDN’ ad inserta it into the uniform symboltable. The readers should refer to fig 8.15 for example. Number,quoted characters strings and other self-defining data are classified as “literals”. After a token has been classified as such , the literal table is examined. If the litera is not yet there, a new entry is made. In contrast to the case of identifiers, lexical analyss can sdetermine the attributes and the internal representation of a literal by looking at the characters that represent it. Thus each new entry made in the literal table consist of the literal and all its attributes. Regardless of whether or not a new entry is made , a new uniform symbol of type ‘LIT’ is created and put into the unioform symbol table. The lexical analysis may make one complete pass over the sources code and produce the entire uniform symbol table. Another scheme is to have the lexical analyzer called only when the syntax phase needs the next token. The lexical analyzer might recognize particular instances of tokens such as: 3 or 255 for an integer constant token “SAHIL” or “DELHI” for a string constant token numTickets or queue for a variable token Such specific instances are called lexemes. A lexeme is the actual character sequence forming a token, the token is the general class that a lexeme belongs to. Some tokens have exactly one lexeme (e.g., the > character); for others, there are many lexemes (e.g. integer constants). For each lexeme,the lexical analyzer produces as output a token of the form

(token-name,attribute value) that it passes on to the subsequent phase,Syntax Analysis. Token name is an abstract symbol that is used during syntax analysis & attribute value points to an entry in the symbol table for this token. The Symbol Table is a data structure containing a record for each variable name,with fields for the attributes of the name. The lexical analyzer allows numbers,identifiers,and “white space”(blanks,tabs,and newlines) to appear within expressions. When the lexical analyzer discovers a lexeme constituting an identifier,it needs to enter that lexeme into the symbol table.It is hard for a lexical analyzer to tell, without the aid of other components,that there is a source-code error.Suppose such a situation arises where the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input.The simplest recovery is “PANIC MODE” recovery. We delete successive characters from the remaining input,until the lexical analyzer can find a well-formed token at the beginning of what input is left.

INTRODUCTION
What is a compiler? In order to reduce the complexity of designing and building computers , nearly all of these are made to execute relatively simple commands (but do so very quickly). A program for a computer must be built by combining these very simple commands into machine language. Since this is a tedious and errora program in what is called prone process most programming is,

instead, done using a high-level programming language. This language can be very different from the machine anguage that the computer can execute, so some means of bridging the gap is required. This is where the compiler comes in. A compiler translates (or compiles) a program written in a high-level programming language that is suitable for human programmers into the low-level machine language that is required by computers. During this process, the compiler will also attempt to spot and report obvious programmer mistakes. Using a high-level language for programming has a large impact on how fast programs can be developed. The main reasons for this are: • Compared to machine language, the notation used by programming languages is closer to the way humans think about problems. • The compiler can spot some obvious programming mistakes. • Programs written in a high-level language tend to be shorter than equivalent programs written in machine language. Another advantage of using a high-level level language is that the same programcan be compiled to many different machine languages and, hence, be brought to run on many different machines. On the other hand, programs that are written in a high-level language and automatically translated to machine language may run somewhat slower than programs that are hand-coded in machine

language. Hence, some time-critical programs are still written partly in machine language. A good compiler will,however, be able to get very close to the speed of hand-written machine code when translating well-structured programs.

The phases of a compiler Since writing a compiler is a nontrivial task, it is a good idea to structure the work. A typical way of doing this is to split the compilation into several phases with well-defined interfaces. Conceptually, these phases operate in sequence (though in practice, they are often interleaved), each phase (except the first) taking the output from the previous phase as its input. It is common to let each phase be handled by a separate module. Some of these modules are written by between several compilers. A common division into phases is described below. In some compilers, the ordering of phases may differ slightly, some phases may be combined or split into several phases or some extra phases may be inserted between those mentioned below. 1 .Lexical analysis: This is the initial part of reading and analysing the program text: The text is read and divided into tokens, each of which corresponds to a symbol in the programming language, e.g., a variable name, keyword or number. The word “lexical” in the traditional sense means “pertaining to words”. In terms of programming languages, words are objects like variable names, numbers,keywords etc. Such words are traditionally called tokens. A lexical analyser, or lexer for short, will as its input take a string of individual letters and divide this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called white-space), i.e., lay-out characters (spaces, newlines etc.) and comments. The main purpose of lexical analysis is to make life easier for the subsequent syntax analysis phase. In theory, the work that is done during lexical analysis can be made an integral part of hand, while others may be generated from specifications. Often, some of the modules can be shared

syntax analysis, and in simple systems this is indeed often done. However, there are reasons for keeping the phases separate:

• Efficiency: A lexer may do the simple parts of the work faster than the more general parser can. Furthermore, the size of a system that is split in two may be smaller than a combined system. This may seem paradoxical but, as we shall see, there is a non-linear factor involved which may make a separated system smaller than a combined system. • Modularity: The syntactical description of the language need not be cluttered with small lexical details such as white-space and comments. • Tradition: Languages are often designed with separate lexical and syntactical phases in mind, and the standard documents of such languages typically separate lexical and syntactical elements of the languages.It is usually not terribly difficult to write a lexer by hand: You first read past initial white-space, then you, in sequence, test to see if the next token is a keyword, a number, a variable or whatnot. However, this is not a very good way of handling the problem: You may read the same part of the input repeatedly while testing each possible token and in some cases it may not be clear where the next token ends. Furthermore, a handwritten lexer may be complex and difficult to maintain. Hence, lexers are normally constructed by lexer generators, which transform human-readable specifications of tokens and white-space into programs. We will see the same general strategy in the chapter about syntax analysis: Specifications in a welldefined human-readable notation are transformed into efficient programs. For lexical analysis, specifications are traditionally written using regular expressions: An algebraic notation for describing sets of strings. The generated lexers are in a class of extremely simple programs called finite automata. Regular expressions The set of all integer constants or the set of all variable names are sets of strings, where the individual letters are taken from a particular alphabet. Such a set of strings is called a language. For integers, the alphabet consists of the digits 0-9 and for variable names the alphabet contains both letters and digits (and perhaps a few other characters, such as underscore). Given an alphabet, we will describe sets of strings by regular expressions, an algebraic notation that is compact and easy for humans to use and understand. The idea is that regular expressions efficient

that describe simple sets of strings can be combined to form regular expressions that describe more complex sets of strings.

2.Syntax analysis: In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens (for example, words), to determine its grammatical structure with respect to a given (more or less) formal grammar. This phase takes the list of tokens produced by the lexical analysis and arranges these in a tree-structure (called the syntax tree) that reflects the structure of the program. This phase is often called parsing. This can be done by guessing derivations until the right one is found, but random guessing is hardly an effective method. Even so, some parsing techniques are based on “guessing” derivations. However, these make sure, by looking at the string, that they will always guess right. These are called predictive parsing methods. Predictive parsers always build the syntax tree from the root down to the leaves and are hence also called (deterministic) top-down parsers. Other parsers go the other way: They search for parts of the input string that matches right-hand sides of productions and rewrite these to the left-hand nonterminals, at the same time building pieces of the syntax tree. The syntax tree is eventually completed when the string has been rewritten (by inverse derivation) to the start symbol. Also here, we wish to make sure that we always pick the “right” rewrites, so we get deterministic parsing. Such methods are called bottomup parsing methods.

Types of parsers The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways: • Top-down parsing - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of

the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.LL parsers and recursive-descent parser are examples of top-down parsers which cannot accommodate left recursive productions. Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect leftrecursion and may require exponential time and space complexity while parsing ambiguous context-free grammars, more sophisticated algorithms for top-down parsing have been created by Frost, Hafiz, and Callaghan which accommodate ambiguity and left recursion in polynomial time and which generates polynomial-size representations of the potentiallyexponential number of parse trees. Their algorithm is able to produce both left-most and right-most derivations of an input with regard to a given CFG. • Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing. Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation (see context-free grammar). LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse)

3.Type checking : This phase analyses the syntax tree to determine if the program violates certain consistency requirements, e.g., if a variable is used but not declared or if it is used in a context that does not make sense given the type of the variable, such as trying to use a boolean value as a function pointer. 4.Intermediate code generation : The program is translated to a simple machine independent intermediate language. 5.Register allocation : The symbolic variable names used in the intermediate code are translated to numbers, each of which corresponds to a register in the target machine code.

6.Machine code generation : The intermediate language is translated to assembly language (a textual representation of machine code) for a specific machine architecture.

7.Assembly and linking : The assembly-language code is translated into binary representation and addresses of variables, functions, etc., are determined. The first three phases are collectively called the front-end of the compiler and the last three phases are collectively called the backend. The middle part of optimizations and transformations on the intermediate code. Each phase, through checking and transformation, establishes stronger invariants on the things it passes on to the next, so that writing each subsequent phase is easier than if these have to take all the preceding into account. For example, the type checker can assume absence of syntax errors and the code generation can assume absence of type errors.Assembly and linking are typically done by programs supplied by the machine or operating system vendor, and are hence not part of the compiler itself, so we will not further discuss these phases . Interpreters An interpreter is another way of implementing a programming language. Interpretation shares many aspects with compiling. Lexing, parsing and type checking are in an interpreter done just as in a compiler. But instead of generating code from the syntax tree, the syntax tree is processed directly to evaluate expressions and execute statements, and so on. An interpreter may need to process the same piece of the syntax tree (for example, the body of a loop) many times and, hence, interpretation is typically slower than executing a compiled program. But an interpreter is often simpler than writing a compiler and the interpreter is writing easier to move the compiler is in this context only the intermediate code generation, but this often includes various

to a different machine,so for applications where speed is not of essence, interpreters are often used.Compilation and interpretation may be combined to implement a programming language: The compiler may produce intermediate-level code which is then interpreted rather than compiled to machine code. In some systems, there may even be parts of a program that are compiled to machine code, some parts that are compiled to intermediate code, runtime while other parts may be kept as a syntax tree and which is interpreted at interpreted directly. Each

choice is a compromise between speed and space: Compiled code tends to be bigger than

intermediate code, which tend to be bigger

than syntax, but each step of translation improves

running speed. Using an interpreter is also useful during program development, where it is more important to be able to test a program modification quickly rather than run the program efficiently.

Operation of a typical Multi-language, Multi-target compiler

Why learn about compilers? Few people will ever be required to write a compiler for a general-purpose language like C, Pascal or SML. So why do most computer science institutions offer compiler courses and often make these mandatory? Some typical reasons are: a) It is considered a topic that you should know in order to be “well cultured” in computer science. b) A good craftsman should know his tools, and compilers are important tools for programmers and computer scientists. c) The techniques used for constructing a compiler are useful for other purposes as well. d) There is a good chance that a programmer or computer scientist will need to write a compiler or interpreter for a domain-specific language. The first of these reasons is somewhat dubious, though something can be said for “knowing your roots”, even in such a hastily changing field as computer science. Reason “b” is more convincing: Understanding how a compiler is built will allow programmers to get an intuition about what their high-level programs will look like when compiled and use this intuition to tune programs for better efficiency. Furthermore, the error reports that compilers provide are often easier to understand when one knows about and understands the different phases of compilation, such as knowing the difference between lexical errors, syntax errors, type errors and so on. The third reason is also quite valid. In particular, the techniques used for

reading (lexing and parsing) the text of a program and converting this into a form (abstract syntax) that is easily manipulated by a computer, can be used to read and manipulate any kind of structured text such as XML documents,address lists, etc.. Reason “d” is becoming more and more important as domain specific languages (DSLs) are gaining in popularity. A DSL is a (typically small) anguage designed for a narrow class of problems. Examples are data-base query languages, text-formatting languages, scene description languages for ray-tracers and languages for setting up economic simulations. The target language for a compiler for a DSL may be traditional machine code, but it can also be another

high-level language for which compilers already exist, a

sequence of control signals

for a machine, or formatted text and graphics in some printer-control language (e.g. PostScript). Even so, all DSL compilers will share similar front-ends for reading and analyzing the program text. Hence, the methods needed to make a compiler front-end are more widely applicable than the methods needed to make a compiler back-end, but the latter understanding how a program is executed on a machine. is more important for

PASSES OF COMPILER
Instead of viewing the compiler in terms of its seven logical phases, we could have looked at it in terms of the N physical passes that it must make over its data bases. Figure is an overview of a flowchart of a compiler, depicting its passes. Pass 1 correspond to the lexical analysis phase. It scans the source program and create the identifiers,literals,and uniform symbol tables. Pass 2 corresponds to the syntactic and interpretation phase. Pass 2 scans the uniform symbol table, produces the matrix, and places information about identifiers into the identifier table. Passes 1 and 2 could be combined into one by treating lexical analysis as an action routine that would parse the source program and transfer the tokens directly to the stack as they were needed. Pass 3 through N-3 correspond to the optimization phase. Each separate type of optimization may require several passes over the matrix. The optimization technique implemented for a particular compiler varies and may even be, as in PL/I, controlled by the programmer. Pass N-2 corresponds to the storage assignment phase. This is a pass over the identifier and literal tables rather than the program itself. Pass N-1 corresponds to the code generation phase. It scans the matrix and creates the first version of the object deck. Pass N corresponds to the assembly phase. It resolves symbolic address and creates the information for the loader .Notice that pass N corresponds roughly to pass 2 of the assembler. The addition of a macro facility or other advanced features to our compiler may require additional passes.

PROPOSED METHODOLOGY
The tasks of the lexical analysis phase involve manipulation of five data bases. PROPOSED METHODOLOGY:

A. SOURCE CODE: original form of program; appears to the compiler as a string of
characters.

B. UNIFORM SYMBOL TABLE: consist of a full or partial list of the tokens as they appears in
the program. Created by lexical analysis and used for syntax analysis and interpretation.

C. TERMINAL TABLE: a permanent table which lists all keywords and special symbols of the
language in symbolic form.

D. IDENTIFIER TABLE: contains all variables in the program and temporary storage and any
information needed to reference or allocate storage for them; created by lexical analysis, modified by interpretation and storage allocation, and referenced by code generation and assembly. The table may also contain information of all temporary locations that the compiler creates for use during execution of the source program ( e.g. Temporary matrix entries).

E. LITERAL TABLE: contain all constants in the program. F. REDUCTIONS: permanent table of decision rules in the form of patterns for matching with
the uniform symbol table to discover syntactic structure.

G. MATRIX: intermediate form of the program which is created by the action routines,
optimized, and then used for code generation.

H. CODE PRODUCTIONS: permanent table of definition, there is one entry definition code for
each possible matrix operator.

I. ASSEMBLY CODE: assembly language version of the program which is created by the
code generation phase and is input to the assembly phase.

J. RELOCATABLE OBJECT CODE: final output of the assembly base , ready to be use d as
input to loader.

LEXICAL ANALYZER
This is the initial part of reading and analysing the program text: The text is read and divided into tokens, each of which corresponds to a symbol in the programming language, e.g., a variable name, keyword or number. The word “lexical” in the traditional sense means “pertaining to words”. In terms of programming languages, words are objects like variable names, numbers,keywords etc. Such words are traditionally called tokens. A lexical analyser, or lexer for short, will as its input take a string of individual letters and divide this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called white-space), i.e., lay-out characters (spaces, newlines etc.) and comments.

A. SOURCE PROGRAM : original form of program, appears to th e compiler as a string of characters. B. UNIFORM SYMBOL TABLE : consist of a full or partial list of the tokens as they appears in the program. Created by lexical analysis and used for syntax analysis and interpretation. Table Index

Uniform symbol table entry C. TERMINAL TABLE : a permanent data base that has an entry for each terminal symbol (e.g. arithmetic operators, keywords, nonalphameric symbls). Each entry consists of the terminal symbol, an indication of its classification (operator, break character), and its precedence. Symbol Indicator Precedence

Terminal table entry D. IDENTIFIER TABLE : contains all variables in the program and temporary storage and any information needed to refrence or allocate storage for them; created by lexical analysis, modified by interpretation and storage allocation, and referenced by code generation and assembly. The table may also contain information of all temporary locations that the compiler creates for use during execution of the source program( eg. Temporary matrix entries). Name Data Attributes Address

Identifier table entry

E. LITERAL TABLE : contain all constants in the program. Literal Base Scale Precision Other information Address

Literal table entry

First the code fragment at the top of the figure is written, using the compiler specification’s structure and the language specification’s terminology. From method below it is apparent that the symbol table entries need type tests such as isNumeric.

Stages in Lexical Scanner The first stage, the scanner, is usually based on a finite state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule, or longest match rule). In some languages the lexeme creation rules are more complicated and may involve backtracking over previously read characters.

Tokenizer Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

For example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters. The quick brown fox jumps over the lazy dog A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to represent tokenized input:<sentence> <word>The</word> <word>quick</word> <word>brown</word> <word>fox</word> <word>jumps</word> <word>over</word> <word>the</word> <word>lazy</word> <word>dog</word> </sentence> A lexeme, however, is only a string of characters known to be of a certain kind (eg, a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.) For example, in the source code of a computer program the string net_worth_future = (assets - liabilities); might be converted (with whitespace suppressed) into the lexical token stream: NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS

NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a finite state machine (which is plugged into template code for compilation and execution). Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, a NAME token might be any English alphabetical character or an underscore, followed by any number of instances of any ASCII alphanumeric character or an underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_09]*. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9". Regular expressions and the finite state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are not capable of keeping count, and verifying that n is the same on both sides — unless you have a finite set of permissible values for n. It takes a full-fledged parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end. (see example in the SICP book) The Lex programming tool and its compiler is designed to generate code for fast lexical analysers based on a formal description of the lexical syntax. It is not generally considered sufficient for applications with a complicated set of lexical rules and severe performance requirements; for instance, the GNU Compiler Collection uses hand-written lexers.

Lex The first phase in a compiler reads the input source and converts strings in the source to tokens. Using regular expressions, we can specify patterns to lex so it can generate code that will allow it to scan and match strings in the input. Each pattern specified in the input to lex has an associated

action. Typically an action returns the matched string for subsequent use by the parser. Initially only the matched string is printed rather than returning a token value. The following represents a simple pattern, composed of a regular expression that scans for identifiers. Lex will read this pattern and produce C code for a lexical analyzer that scans for identifiers. Letter(letter |digits)* This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits. This example nicely illustrates operations allowed by zero or more letters or digits. A few operations are allowed in regular expressions for eg. • • • Repetitions, expressed by the “*” operator Altetnation, expressed by the “|” operator Concatenation

Any regular expressions may be expressed as a finite automaton (FSA). We can represent an FSA using states, and transitions between states. There is a one start state, and one start state, ans one or more final or accepting states.

In this figure state 0 is the start state and state 2 is accepting state. As the characters are read, we make transition from one state to another. When the first letter is read, we transition to state 1. We remain in state 1 as more letters or digits are read. When we read a character other than a letter or digit, we transition to state 2, the accepting state. Any FSA may be expressed as a computer program.

Lexical analyzer generators

• • • • • •

ANTLR - ANTLR generates predicated-LL(k) lexers Flex - Alternative variant of the classic 'lex' (C/C++). JFlex - a rewrite of JLex. JLex - A Lexical Analyzer Generator for Java.

Quex - (or 'Queχ') A Fast Universal Lexical Analyzer Generator for C++. Ragel - A state machine and lexical scanner generator with output support for C, C++, Objective-C, D, Java and Ruby source code.

RESULT
We compiled a program whose main objective was to perform the function of a lexical analyser. A lexical analyser, or lexer for short, which took input as a string of individual letters and divide this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called whitespace), i.e., lay-out characters (spaces, newlines etc.) and comments. The lexical analyser is mainly the interface between the source program and the compiler. The lexical analyser read the source program one character at a time, carving the source program into a sequence of atomic units called tokens. Each token represented a sequence of characters that can be treated as a single logical entity. Identifiers, keywords, constants, operators, and a punctuation symbols such as commas and parenthesis are typical tokens. The end result was that the expression that the user puts as a source program that is divided into tokens, which then identified them as a keyword, identifier, constant etc. Each pattern specified in the input to lexical has an associated action. Typically an action returns a token that represents the tokens matched string for subsequent use by the parser. Initially we will simply print the matched string rather than return a token value. For example :- if we enter a string in our program => Enter the string: for(x1=0; x1<=10; x1++); Analysis: for ( x1 = 0 ; x1 <= 10 ; x1 + + ) ; : Keyword : Special character : Identifier : Assignment operator : Constant : Special character : Identifier : Relational operator : Constant : Special character : Identifier : Operator : Operator : Special character : Special character

End of program

The result is that we divide the entered string into tokens (identifiers, operators,special character,keyword) . Now this all make easy for us to generate symbol table for further use.

CONCLUSION
The first quarter covers front-end issues such as lexical analysis, parsing, and scope-checking of identifiers.The second quarter covers static semantic checking (primarily type checking) and code generation.We have language. The key concepts in compiler-design are syntax-directed translation (from one language into another), which has a rule -based flavor as the compiler reads each fragment of the program, it recognizes its syntax (if-statement, addition expression, etc.) and performs a translation of the fragment according to the prescribed semantics of that syntactic category. developed a compiler which converts high level language to low level

Currently we have made a program of a lexical analyser, or lexer for short, will as its input take a string of individual letters and divide this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called white-space), i.e.,lay-out characters (spaces, newlines etc.) and comments This approach not only eases construction of the compiler, but also reinforces the syntax-directed concept itself. The main purpose of lexical analysis is to make life easier for the subsequent syntax analysis phase. In theory, the work that is done during lexical analysis can be made an integral part of syntax analysis, and in simple systems this is indeed often done. Moreover, compiler construction provides numerous opportunities for reinforcing engineering lessons such as the importance of conceptual integrity in design. The design approach helps in increasing implementation. Given reasonable abstractions that match the specification, a simple implementation that quickly implements an important subset of the functionality is possible. This goal, however, should prove almost impossible to meet, since the core of highly optimized assembly code is intimate knowledge of the target architecture. In order to evaluate the implementation at a more detailed level, the output snapshot should be of interest, where a more rigorous and in-depth analysis of the actual implementation is done, including instructions on compiling and running the current compiler.

IMPLICATIONS FOR FUTURE RESEARCH
Future of Compilers: Saving Millions of Developer Hours

Our project is about how to design compilers. If we examine the compilation process in detail, we see that it operates as a sequence of phases. But our study is concerned only about the first phase of the compiler, rest six phases are outstanding which could be studied further as future prospects and currently out of our scope. Compilers were considered almost impossible programs to write. The very first FORTAN compiler took about 18 man years to implement it. Today, however, compilers can be built with much less effort. In future much more better compilers can be easily be made that will reduce the burden on user as the development of software tools, systematic techniques are facilitating the implementation of the compilers.

REFERENCES

Books:
1. Compilers: Principles, Techniques, & Tools By Alfred V.Aho, Monika S.Lam , Ravi Sethi, & Jeffrey D.Ullman

2. Compiler Design In C By Holub Prentice-Hall

3. John J. Donovan : “SYSTEM PROGRAMMING”.

4. Wikipedia

5. Google search

Lexical Analyzer

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.