Professional Documents
Culture Documents
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 1 / 22
1 Lexical Analyzer
What is a Lexical Analyzer?
Why do we need a lexical analyzer?
Interaction of Lexical Analyzer and Parser
Tokens, Patterns, and Lexemes
Application of tokens
Lexical Errors
2 Specification of Tokens
Strings and Languages
Operations on Language
4 Conclusion
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 2 / 22
What is a Lexical Analyzer?
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 3 / 22
Why do we need a lexical analyzer?
Role of a Lexical Analyzer
As the first phase of a compiler, the main task of the lexical analyzer is to:
read the input characters of the source program
group them into lexemes
produce as output a sequence of tokens for each lexeme in the source
program
During and after the lexical analysis, :
It is common for the lexical analyzer to interact with the symbol table
If the lexical analyzer discovers a lexeme constituting an identifier, it
needs to enter that lexeme into the symbol table
In some cases, information regarding the kind of identifier may be
read from the symbol table by the lexical analyzer to assist it in
determining the proper token it must pass to the parser
The stream of tokens is sent to the parser for syntax analysis
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 4 / 22
Lexical Analyzers and Parsers
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 5 / 22
Lexical Analysis - Tokens, Patterns, and Lexemes
When discussing lexical analysis, we use three related but distinct terms:
Token
Pattern
Lexeme
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 6 / 22
Lexical Analysis - Token
Token
A token is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of lexical
unit, e.g., a particular keyword, or a sequence of input characters denoting
an identifier. The token names are the input symbols that the parser
processes. In what follows, we shall generally write the name of a token in
boldface. We will often refer to a token by its token name.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 7 / 22
Lexical Analysis - Pattern
Pattern
A pattern is a description of the form that the lexemes of a token may
take. In the case of a keyword as a token, the pattern is just the sequence
of characters that form the keyword. For identifiers and some other
tokens, the pattern is a more complex structure that is matched by many
strings.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 8 / 22
Lexical Analysis - Lexeme
Lexeme
A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identified by the lexical analyzer as an
instance of that token.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 9 / 22
Example of tokens
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 10 / 22
Application of tokens
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 11 / 22
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components,
that there is a source-code error.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 12 / 22
Specification of Tokens - Alphabet
Alphabet
An alphabet is any finite set of symbols. Typical examples of symbols are
letters, digits, and punctuation. The set {0, 1} is the binary alphabet. ASCII
is an important example of an alphabet; it is used in many software systems.
Unicode, which includes approximately 100,000 characters from alphabets
around the world, is another important example of an alphabet.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 13 / 22
Specification of Tokens - Strings
String
A string over an alphabet is a finite sequence of symbols drawn from that
alphabet. In language theory, the terms ”sentence” and ”word” are often
used as synonyms for ”string.” The length of a string s , usually written
|s|, is the number of occurrences of symbols in s. For example, banana is
a string of length six. The empty string, denoted , is the string of length
zero.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 14 / 22
Terms for Parts of Strings
The following string-related terms are commonly used:
A prefix of string s is any string obtained by removing zero or more
symbols from the end of s. For example, ban, banana, and are
prefixes of banana.
A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s. For example, nana, banana, and
are suffixes of banana.
A substring of s is obtained by deleting any prefix and any suffix from
s. For instance, banana, nan, and are substrings of banana.
The proper prefixes, suffixes, and substrings of a string s are those,
prefixes, suffixes, and substrings, respectively, of s that are not or
not equal to s itself.
A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s. For example, baan is a
subsequence of banana.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 15 / 22
Specification of Tokens - Language
Language
A language is any countable set of strings over some fixed alphabet. This
definition is very broad. Abstract languages like 0, the empty set, or {}, the
set containing only the empty string, are languages under this definition. So
too are the set of all syntactically well-formed C programs and the set of all
grammatically correct English sentences, although the latter two languages
are difficult to specify exactly. Note that the definition of ”language” does
not require that any meaning be ascribed to the strings in the language.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 16 / 22
Operations on Language
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 17 / 22
Regular Expressions (REGEX)
Regular expressions are used for describing all the languages that can be
built from these operators applied to the symbols of some alphabet.
The regular expressions are built recursively out of smaller regular expres-
sions, using the rules described below. Each regular expression r denotes a
language L(r ), which is also defined recursively from the languages denoted
by r ’s subexpressions.
PHere are the rules that define the regular expressions
over some alphabet and the languages that those expressions denote.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 18 / 22
Rules of Regular Expressions
Basic Rules
1 is a regular expression, and L () is {} , that is, the language
whose sole member is the empty string.
2 If a is a symbol in C, then a is a regular expression, and L(a)= {a},
that is, the language with one string, of length one, with a in its one
position. Note that by convention, we use italics for symbols, and
boldface for their corresponding regular expression.
Inductive Rules
S
1 (r )|(s) is a regular expression denoting the language L(r ) L(s).
2 (r )(s) is a regular expression denoting the language L(r )L(s) .
3 (r )∗ is a regular expression denoting (L(r ))∗ .
4 (r ) is a regular expression denoting L(r ). This last rule says that we
can add additional pairs of parentheses around expressions without
changing the language they denote.
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 19 / 22
Example of Regular Expressions
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 20 / 22
More examples of Regular Expressions
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 21 / 22
Thank you.
Any Questions?
Adnan Ferdous Ashrafi Lecture 2 - Introduction to Lexical Analyzer September 22, 2020 22 / 22