You are on page 1of 24
1.4 Lexical analysis Now, let us see “What is lexical analysis?” Definition: The process of reading the source program and converting it tokens is called lexical analysis. 1.4.1 The role of lexical analyzer Now, let us see “What is the role of lexical analyzer?” The lexic: phase of the compiler. The various tasks that are performed by the ‘+ Read a sequence of characters from the source program and produce tt * The tokens thus generated are sent to the parser for syntax analysis also called syntax analyzer. During this process, lexical analyzer interacts with symbol table to in: identifiers and constants. Sometimes, information of iden table to assist in determining the proper token to send to the parser. ‘The interaction between the lexical analyzer and the parser is pictorially repres shown below: 4.18 © Lexical Analyzer Token. ‘Semantic Source ‘Scanner (Lexical analyzer) Analysis. Program qaNexiToken() 4. The parser program calls the function getNextToken@) which is the function defined in lexical analyzer (See the calling sequence below) Parser Program Lu er program return token; ) The function getNextToken() of lexical analyzer returns the token back to parser for parsing. _ ts in amtared into the symbol table along with ¢ The function getNextToke parsing. 4 If the token obtained i various attribute values denoted by ID and a poi 4. The other actions that are per return token; a0) of lexical analyzer returns the token back to parser for ol table along with ier, it is entered into the symbs ff an integer code s an identifi and returns a token as a pair consisting o ‘inter to the symbol table for that identifier, formed by the parser are: Removes comments from the program. Remove white spaces such as blanks, tabs and newline characters from the are obtained. source program and then tokens Keep track of line numbers so as to associate Tine numbers with error messages TTany errors are encountered, the lexical analyzer displays appropriate ect inessages along with line numbers Preprocessing may be done during lexical analysis phase jg the start symbol grammar to geneare the following language: = {ww" where w € (a, b}*} s 26: Obtain a sample L s+ \asa\bs Janguage can be written as: tion: The ; sauton The ETE aa, bb, abba, baab,aaaa, bbbb,....} Observe that the given string is a palindrome of even length. This is achived by idleting the productions $—a|b. So, the final grammar is given by: Sse 7 oO ‘a S —» aSa|bSb ase. a abst Note: In the above grammar if the production Q F soe a b b isrplaced by a WCE 83.6" am wae te result . : m > ing grammar will generate the language m L= {wew" | w € {8b} " 1020 and 1U5U are ine aumpute venues wi 1.5 Input buffering )® oe ye Now, let us see “Why input buffering is required?” Input buffering is very eae the following reasons: ¢ Since lexical analyzer is the first phase of the compiler, it is the only phase of the compiler that reads the source program character-bi considerable time in eding the source program. Thus, the speed of lexical analysis is i coneorn-while Cesta ike Panes Ss speed of lexical analysis ¢ Lexical analyzers may have to look ne or more charac before we vehave the ne right lexemi ters beyond the next lexee For this reason, we use the concept of input buffering where a block of 1024 or 4096 or more characters are read in one menfbry read operation Se HORE REMOLY Tead operation and stored in the array to speed bien, The otectred. xf Cee eae a flocw of Cherro shi Cie oc de) & Compiler Design - 1.25 Now, let us see “What is input buffering?” Definition: The method of reading a block of characters (1K or 4K or more bytes) from the disk in ofe'read- operation and storing in memory (normally in the form of-an aay) for further” processing and faster accessing is called inpur buffering. The_memory (an array) where a block of characters read from the disk are stored is called buffer. Now, |<———— Buffer2, 4 The size of each buffer is N where N is usually the size of the disk block. If size o disk block is 4K, in one read operation 4096 characters can be read into the buf using one system command: rather than using one system call per character which consumes lot of time. 4 Imespective of number of characters stored in the buffer, la is eof. 4 Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a buffer means that the input is at an end. 4 The use of two pointers TexembeBeginning and input pointer and the methot Cl accessing lexeme remains same as in buffer pairs. * The algorithm consisting of lookahead code with sentinels is shown below: switch (*inputPointer++) { character of each buf: case eof: i: (inputPointer is at end of first buffer) reload the second buffer; inputPointer = beginning of second buffer; break; & Compiler Design - 1.27 if (inputPointer is at the end of second buffer) { reload the first buffer; inputPointer = beginning of first buffer; break; } /* eof within a buffer indicates the end of the input */ /* So, terminate lexical analysis */ break; /* Cases for other characters */ } ® Observe from the above algorithm that instead of having two tests as in buffer pair technique, there is only one test ie., testing the eof marker. 1.6 Specifications of tokens Problems Prove that b= fol" | zi} as eet seer L+for, oor, o00111, sooo «J yy) Le fot | mz, mej D> Amume L us vy on a method of > lab n bea Contradt on - 3 hut wort” \ A gpiek wexyz such tet J ye an DS Amume L us vugueer= method 4 > kt 7 bea Constant au oe Cyatradecton > Gplttk we Xy¥z such that ye ’ a. /ayl=” 3-daale KP, aya e he wed md, w=00!! syz00 7 a0, yO ZN @ pwyume Ke) TYE 7 cool ® b ae Fedo © inyt% dy not Tequtesr wsdl” nd, weoo!! syz00 | ~ a20, y=0 Zl" @ payee we ye = EET @L — Henu Lrqor® [ny ds not Tequlort _-_-_- Pove that L=fat|Puts a freee} fut db a Rigutan dangoage, ‘pl ds an gntegar Comtant Seek a abtng ‘ud {rom L Yuch — thouk, be daa, dag, aaaee- - Y dub 0-3. 22.0. xXye. el a Kel exyke « woe © : $ ye 2 contradict KeQ - ayX2 = 000 gh = In the above statement, the patterns, lexemes re shown below: and respective tokens ar ‘Symbolic names defined using #define keyword char p= CHAR, Z identifier str —> 41,1? —> LEFT_BRACKET lefi bracket | pore ; —> RIGHT_BRACKET right bracket Pane —> ASSIGN a operator = x -L > , LITERAL? Se ‘strit a jo s tring “hel me seMi_COLON symbol 3 SY Jexemes Patter Token =: Webb Tokens: Now, let us see what is a token?” 41 ig a pair consisting of token name and an 9) ion citrus value, sieally integer codes: represented using Sym jolic names written in INT. i ‘defined in the file token.h in OAT, SEMI_COLON te # Will not be present for keywords, . tribute values are optional and a press and symbols, The attribute values are present for all identifiers and constants, rr unique token name. For example, INT, FLOAT, CHAR, Definition: A tol he token names i operator + For every keyword ther ‘ar every symbol there is a unique token name. For example, SEMI_COLON, COLON, COMMA, LEFT_PARANTHESIS, RIGHT_PARANTHESIS, ASSIGN cl ————_ + Foran identifier sum, the token is where ID is the token name and 1 is the position of the identifier in the symbol table er Whenever there is a request from the parser, the lexical analyzer sends the token. So, tokens are output of the lexical analyzer and input to the syntax analyzer. The syntax analyzer uses these tokens to check whether the program is syntactically correct or not by deriving the tokens from the grammar. All the tokens are represented using symboli¢ constants defined using #define directive as shown belo /* TOKENS with corresponding integer codes for keywords **/ #define itdefine #define itdefine define #define AURWN 1.4.3.2 Lexeme Now, let us see “What is a lexeme?” Definition: A sequence of characters in the source program that matches the patterns such as identifiers, numbers, relational operators, arithmetic operators, symbols such as #, £1, G) and so on are called Jexemes. In other words, a lexeme is a string of patterns read from the source file that corresponds to a token. 1.4.3.3 Patterns Now, let us see “What is a pattern?” Definition: The description of a lexeme is called pattern. More formally, a pattern is described as Tule describing set of lexemes. The various patterns are shown below: Fearon: Tie pais fojoord ia aati of cheeaien aicice eee rae of a language. For example, int, if, else, while, do, switch etc are all reserve words. They are also called keywords ¢- Identifier: The pattern identifier is described a sequence of letters or underscores followed by any number of letters or digits or underscores. For example, sum, i, pos, first, rate_of_interest that represent variables in a program or that represent names of functions, structures etc. are all treated as identifiers. 1.22 B Lexical Analyzer ¢ Relational Operator: The pattern relational operators which is described 2s a symbols that reprovent various relational oparetors of a language. Fox exarapie: = ~ 1= represent pattems identifying the relational operate 4 Sembols: The pattern symbols is described as vet of symbols 9 },: and soon h 2s #, S$. 6) f Example 1.2: Identify lexemes and tokens in the following statement: printf(“Simple Interest = Jof\n’, si); soa Solution: The lexemes, patterns and tokens for the given printf statements are shown below: + prinif is a lexeme matching the pattern identifier and returning the token Mint ID i Ge is The token name and 1 is the position of identifier pringf in the sym table 4 The character ‘( is a lexeme matching the-patt i ken TET PARMA ig Pattern symbol and returning. the ‘0 # The sequence of characters “Simple Interest i = %fin” is hing pate suing and retuming the Token LITERAL, 7 WIGS Lene ee name and is the postion of literal in tie SHES sae + The character "is a Texeme matchin, ig the patt i coset Redan pattern symbol and returning the & Compiler Design - 1.23 ¢ stis a lexeme matching the pattem identifier and returning the token where ID ig the token name and 3 is the position oF identifier a in the symbol table ¢ The character °" is a lexeme matching the pattern symbol and returning the token SEMICOLON —— : 7; Obtain the grammar to generate the language b f mee L={0"}"2"{m2 1 andn20} Kh ' eS simple approach 4.21 tion: Given the language the productions can be gencrated as shown below: solu L={0"1"2"|mzLandn>0} 1s we 5 AO SAB ..., (1) pe wee rhe variable A should produce m number of O's f ¢ ’s with a minimum string 01 (Since m = 1). This j following production: wed by m number of is achieved using the whee ‘The variable A should produce m number of 0’s followed by m number of 1's with a minimum string 01 (Since m = 1). This is achieved using the following production: A—>01]0Al [Similar to example 20, page 4.17] =“ ¢ B should produce any number of 2's. Any number of 2's can be generated using the production: Boe|2B [Similar to example 1, page 4.7] So, final grammar to accept the given language is: SAB . A>01/0AI |ramart seen = {0"1"2"|m=>1andn>0} B—>|2B The following grammar also generates the same language. The reader is required to verify the answer. 4 eo Ark. S — A|S2 A-— 01/0A1 pse\2e-

You might also like