# CS143 Autumn 2007

Handout 08

Section: All Things Lexical

October 5, 2007

Problem 1: Finite Automata and Regular Grammars a.) Draw a DFA that accepts all strings in (a + b)* that do not contain bababb as a substring. b.) Present a context-free grammar that generates the language accepted by your DFA from part a. [Make sure you understand how to do this for any DFA, not just this one.] Problem 2: flex Basics Given the following flex rules:
\n ... [a-c]+ [A-C]+ [1-3]+ [acB13]+ [^a-c]+[^A-C]+[1-9] { { { { { { { printf("1"); printf("2"); printf("3"); printf("4"); printf("5"); printf("6"); printf("7"); } } } } } } }

What will be the output on the following input? Assume there is nothing to the right of the last visible character on each line except a newline character and that the following input is to be scanned all at once:
abcABC123 321CBAcba BaB1cAb AAAAacac1231

Problem 3: Manually Building a Parse Tree As a Client of flex Let’s deal with a subset of Scheme so small that we can build a lexer in half the time it takes to teach a section. There are no keywords whatsoever. Just lists and primitives. Scheme primitives include integer constants, string constants, and symbols. • Integer constants are decimal strings with as many leading zeroes as you want. There are no negative numbers in our subset. Legitimate integer constants include 15, 0117, 00000000040, and 0. String constants are double quote-delimited strings of zero of more characters. The characters themselves can be anything whatsoever, except for ". Schemers just need to do without. Symbols are text strings without double quotes. They always start with a letter, but can otherwise contain letters, numbers, and a few other characters: '_', '-', '>', '?', and '!'.

2 A list is an ordered sequence of elements, where each element can be a primitive or a sublist. The elements are separated by whitespace. List boundaries are marked by an open parenthesis at the front, and a close parenthesis at the tail. Here are some lists: • • •
(1 2 3 4 5) ("hi" there "hey” (there "ho" ((there)))) (())

Your job is to leverage off the abbreviated scheme-lexer.h/.l files and some C++ classes representing the various Scheme components (you’ll hear the buzzword S-Expression if you overhear Scheme programmers talking), and to figure out how to build a parse tree for an arbitrary Scheme S-Expression. This isn’t so much about flex as it is about ad hoc parsing. The challenge here is to wire up a tree representation of a Scheme expression. Ultimately you’ll have bison do the same thing for you, and after you do this you’ll appreciate the wiring capabilities of bison quite a bit. You’ll also be prompted to start thinking about C++ virtual inheritance a bit. Here’re the relevant parts of the flex files:
scheme-scanner.h typedef union { int intValue; char *textValue; } YYSTYPE; extern YYSTYPE yylval; typedef enum { T_IntegerConstant = 256, T_StringConstant, T_Symbol } TokenType; int yylex(); scheme-scanner.l %} Whitespace IntegerConstant StringConstant Symbol %% {Whitespace} {IntegerConstant} {StringConstant} {Symbol} ([ \t\n\r]+) ([0-9]+) (\"[^\"]*\") ([a-zA-Z][a-zA-Z0-9\-\>?!]+) { ; } { yylval.intValue = strtol(yytext, NULL, 10); return T_IntegerConstant; } { yylval.textValue = strdup(yytext); return T_StringConstant; } { yylval.textValue = strdup(yytext); return T_Symbol; }

3
[()] . %% { return yytext[0]; } { cerr << "Don't like this character: \'" << yytext[0] << "\'" << endl; }

Here’re the C++ class definitions that correspond to text, integers, and lists. Note that symbols and string constants can be handled by the same class.
class SExpression { public: virtual ~SExpression() {} protected: SExpression() {}; }; class Integer : public SExpression { public: Integer(int n) : n(n) {} virtual ~Integer() {} private: int n; }; class String : public SExpression { public: String(const char *text) : text(text) {} virtual ~String() {} private: string text; // std::string }; class SExpressionList : public SExpression { public: SExpressionList(); virtual ~SExpressionList(); virtual void append(SExpression *expr); private: vector<SExpression *> elements; };

Write a function called readList which repeatedly calls yylex to read in all of the tokens making up a Scheme list (possibly empty, possibly containing just primitives, or maybe containing sublists and subsublists). You can assume that stdin feeds in exactly one perfectly formatted Scheme list. Assume the following prototype (where lookahead refers to a master TokenType that’s be initialized to the '(' returned by the first call to yylex.

4

First, we’ll draw the data structure representation of an arbitrary Scheme list, just to illustrate what you’re working toward. You’ll then all work together to write the code. Solution 1: Finite Automata and Regular Grammars a.) Draw a DFA that accepts all strings in (a + b)* that do not contain bababb as a substring. a start A b b B a C b b a a D a E b a F b a, b Z

b.) Present a context-free grammar that generates the language accepted by your DFA from part a. [Make sure you understand how to do this for any DFA, not just this one.] The most obvious approach is to set each production to imitate some transition in the DFA. Here’s what I was thinking:
A " aA|bB|# B " aC|bB|# C " aA|bD|# D " aE|bB|# E " aA|bF|# F " aE|#

Since Z is a dead state, we don’t need to include any mention of it in the CFG ! (though it’s not a mistake to—just unnecessary.)

5 Solution 2: flex Basics Given the following flex rules:
\n ... [a-c]+ [A-C]+ [1-3]+ [acB13]+ [^a-c]+[^A-C]+[1-9] { { { { { { { printf("1"); printf("2"); printf("3"); printf("4"); printf("5"); printf("6"); printf("7"); } } } } } } }

What will be the output on the following input? Assume there is nothing to the right of the last visible character on each line except a newline character and that the following input is to be scanned all at once: 2 matches abc
abcABC123 321CBAcba BaB1cAb AAAAacac1231 2722164371 7 2 2 1 6 4 3 7 1 matches matches matches matches matches matches matches matches matches ABC123\n321 CBA cba \n BaB1c A b \nAAAAacac1231 \n (we didn’t require this one)

Solution 3: Manually Building a Parse Tree As a Client of flex