CS143 Summer 2007

Handout 6 June 28

Section 1 Handout

Regular Expressions and DFAs To understand the innards of lex/flex and use it effectively, you should be comfortable with regular expressions and finite automata. Here are some exercises to help you practice. 1. Note which strings are in the language denoted by each regular expression. (a) ((ab)|b)*c matches which of these strings? ababbc abab c babc aaabc (b) ab*c*(a|b)c matches which of these strings? acac acbbc abbcac abc acc (c) (a|b)a+(ba)* matches which of these strings? ba bba ababa aa baa 2. Write regular expressions for each description. The alphabet Σ is the binary digits {0, 1}. (a) All strings which end in 01 (b) All strings which contain exactly one 0 (c) All strings which contain an even number of 1s and no 0s (d) All strings which contain an even number of 1s and any number of 0s (e) All strings which contain the substring 01 (f) All strings which do not contain the substring 01 3. Describe the language denoted by the following regular expressions. The alphabet Σ is {x, y}. (a) x(x|y)*y (b) ((x|y)(x|y))+ (c) x*(yx+)*x* (d) (x|y)*((xx)|(yy))*y* One easy way to practice with regular expressions is to use the unix utility grep to search files for lines matching a given regular expression (see man page for usage). 4. The DFA below accepts which of these strings? xy xyxxy yyyx xyyxyxyxxy

1

‘ 5. Construct the following automata. (a) Construct a DFA for the language in problem 2a. (b) Construct a DFA for the language in problem 2f. Practice with (f)lex Remember that lex matches the longest token it can. If a token matches more than one pattern, the one listed first takes precedence. Also, assume that yytext is the global character buffer storing the matched token (as a null-terminated C string). 6. Given this lex specification: (a|b)+a ab*c+ (ab)+(a|c)* (aa)+ a?b*(ac)? [ \t\n] { { { { { { printf("1 %s\n", yytext); } printf("2 %s\n", yytext); } printf("3 %s\n", yytext); } printf("4 %s\n", yytext); } printf("5 %s\n", yytext); } /* discard whitespace */ }

Show the output printed from the scanner when reading this input: aaaa acabca bababbc 7. Suppose you already have a working scanner for Decaf or some similar language. Now, you want to add a simple pre-processor-like feature that allows large chunks of code to be switched on and off with #if ... #endif blocks. So, for instance, here: #if 0 ... ... #endif ... #if 1 ... ... #endif _ |region A _|

_ |region B _|

Everything in “region A” would be ignored completely, and everything in “region B” would be processed as if the enclosing #if / #endif pair weren’t there. Only 0 or 1 is allowed as the value in the #if directive. Show what would need to be added to the corresponding scanner specification file in order to implement this. 2

Answers 1. (a) ababbc abab c babc aaabc abc acc

(b) acac acbbc abbcac (c) ba bba ababa aa

baa

2. (Note: Many different regular expressions are possible.) (a) (0|1)*01 (any number of binary digits can precede the ending 01) (b) 1*01* (must be a zero somewhere, can be preceded or followed by 1s) (c) (11)* (ones must be added in pairs) (d) (0*10*10*)* (take answer to c and insert optional zeros in-between) (e) (0|1)*01(0|1)* (must be 01 somewhere, anything can come before or after) (f) 1*0* (0 can only be followed by another 0) 3. (a) string must start with x and end in y (b) string must be of even length ≥ 2 (c) every y is followed by at least one x (can’t contain substring yy and can’t end with y) (d) any string (i.e. this expression matches Σ∗ ) 4. xy xyxxy yyyx xyyxyxyxxy 5. Note there can be many equivalent automata. We tried to simplify to a fairly tidy version. (a)

(b)

6. Since lex will try to match the longest lexeme it can, even if it manages to match a pattern, it keeps pulling characters if it thinks a longer pattern might be matched. Only when it realizes that a longer lexeme can’t be matched will it give up and officially match a pattern.

3

If something doesn’t match any of the patterns, it matches the default pattern, where the default action is to just ECHO the lexeme to standard out. Output: 1 2 3 1 5 c aaaa ac abca baba bb

7. Because other lexical features that use start states may be nested inside a #if/#endif block, we need to use both exclusive and inclusive start states. We also need to use the state stack.
... %s INIF %x IGNORE %option stack /* Definition Section */ ... BEGINIF (#if" "(0|1)) ENDIF (#endif) %% /* Rules Section */ ... {BEGINIF} { if (yytext[4] == ’0’) yy_push_state(IGNORE); else yy_push_state(INIF); } <IGNORE>. { /* ignore anything until we hit endif */ } <IGNORE>{ENDIF} { yy_pop_state(); } <INIF>{ENDIF} { yy_pop_state(); }

4

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.