You are on page 1of 5

CS143 Handout 08

Autumn 2007 October 5, 2007


Section: All Things Lexical
Problem 1: Finite Automata and Regular Grammars
a.) Draw a DFA that accepts all strings in (a + b)* that do not contain bababb as a
substring.
b.) Present a context-free grammar that generates the language accepted by your DFA
from part a. [Make sure you understand how to do this for any DFA, not just this
one.]

Problem 2: flex Basics


Given the following flex rules:

\n { printf("1"); }
... { printf("2"); }
[a-c]+ { printf("3"); }
[A-C]+ { printf("4"); }
[1-3]+ { printf("5"); }
[acB13]+ { printf("6"); }
[^a-c]+[^A-C]+[1-9] { printf("7"); }

What will be the output on the following input? Assume there is nothing to the right of
the last visible character on each line except a newline character and that the following
input is to be scanned all at once:

abcABC123
321CBAcba
BaB1cAb
AAAAacac1231

Problem 3: Manually Building a Parse Tree As a Client of flex


Let’s deal with a subset of Scheme so small that we can build a lexer in half the time it
takes to teach a section. There are no keywords whatsoever. Just lists and primitives.
Scheme primitives include integer constants, string constants, and symbols.

• Integer constants are decimal strings with as many leading zeroes as you want.
There are no negative numbers in our subset. Legitimate integer constants
include 15, 0117, 00000000040, and 0.
• String constants are double quote-delimited strings of zero of more characters.
The characters themselves can be anything whatsoever, except for ". Schemers
just need to do without.
• Symbols are text strings without double quotes. They always start with a letter,
but can otherwise contain letters, numbers, and a few other characters: '_', '-',
'>', '?', and '!'.
2

A list is an ordered sequence of elements, where each element can be a primitive or a


sublist. The elements are separated by whitespace. List boundaries are marked by an
open parenthesis at the front, and a close parenthesis at the tail. Here are some lists:

• (1 2 3 4 5)
• ("hi" there "hey” (there "ho" ((there))))
• (())

Your job is to leverage off the abbreviated scheme-lexer.h/.l files and some C++ classes
representing the various Scheme components (you’ll hear the buzzword
S-Expression if you overhear Scheme programmers talking), and to figure out how to
build a parse tree for an arbitrary Scheme S-Expression. This isn’t so much about flex as
it is about ad hoc parsing. The challenge here is to wire up a tree representation of a
Scheme expression. Ultimately you’ll have bison do the same thing for you, and after
you do this you’ll appreciate the wiring capabilities of bison quite a bit. You’ll also be
prompted to start thinking about C++ virtual inheritance a bit.

Here’re the relevant parts of the flex files:


scheme-scanner.h

typedef union {
int intValue;
char *textValue;
} YYSTYPE;

extern YYSTYPE yylval;

typedef enum {
T_IntegerConstant = 256, T_StringConstant, T_Symbol
} TokenType;

int yylex();

scheme-scanner.l

%}

Whitespace ([ \t\n\r]+)
IntegerConstant ([0-9]+)
StringConstant (\"[^\"]*\")
Symbol ([a-zA-Z][a-zA-Z0-9\-\>?!]+)

%%
{Whitespace} { ; }
{IntegerConstant} { yylval.intValue = strtol(yytext, NULL, 10);
return T_IntegerConstant; }
{StringConstant} { yylval.textValue = strdup(yytext);
return T_StringConstant; }
{Symbol} { yylval.textValue = strdup(yytext);
return T_Symbol; }
3

[()] { return yytext[0]; }


. { cerr << "Don't like this character: \'"
<< yytext[0] << "\'" << endl; }
%%

Here’re the C++ class definitions that correspond to text, integers, and lists. Note that
symbols and string constants can be handled by the same class.

class SExpression {

public:
virtual ~SExpression() {}

protected:
SExpression() {};
};

class Integer : public SExpression {

public:
Integer(int n) : n(n) {}
virtual ~Integer() {}

private:
int n;
};

class String : public SExpression {

public:
String(const char *text) : text(text) {}
virtual ~String() {}

private:
string text; // std::string
};

class SExpressionList : public SExpression {

public:
SExpressionList();
virtual ~SExpressionList();
virtual void append(SExpression *expr);

private:
vector<SExpression *> elements;
};

Write a function called readList which repeatedly calls yylex to read in all of the tokens
making up a Scheme list (possibly empty, possibly containing just primitives, or maybe
containing sublists and subsublists). You can assume that stdin feeds in exactly one
perfectly formatted Scheme list. Assume the following prototype (where lookahead
refers to a master TokenType that’s be initialized to the '(' returned by the first call to
yylex.
4

SExpressionList *readList(TokenType& lookahead);

First, we’ll draw the data structure representation of an arbitrary Scheme list, just to
illustrate what you’re working toward. You’ll then all work together to write the code.

Solution 1: Finite Automata and Regular Grammars


a.) Draw a DFA that accepts all strings in (a + b)* that do not contain bababb as a
substring.

a b a, b

start b a b a b b
A B C D E F Z

b a
a

b.) Present a context-free grammar that generates the language accepted by your DFA
from part a. [Make sure you understand how to do this for any DFA, not just this
one.]

The most obvious approach is to set each production to imitate some transition in
the DFA. Here’s what I was thinking:

A " aA|bB|#
B " aC|bB|#
C " aA|bD|#
D " aE|bB|#
E " aA|bF|#
F " aE|#

Since Z is a dead state, we don’t need to include any mention of it in the CFG
(though it’s not a mistake to—just
! unnecessary.)
5

Solution 2: flex Basics


Given the following flex rules:

\n { printf("1"); }
... { printf("2"); }
[a-c]+ { printf("3"); }
[A-C]+ { printf("4"); }
[1-3]+ { printf("5"); }
[acB13]+ { printf("6"); }
[^a-c]+[^A-C]+[1-9] { printf("7"); }

What will be the output on the following input? Assume there is nothing to the right of
the last visible character on each line except a newline character and that the following
input is to be scanned all at once: 2 matches abc
7 matches ABC123\n321
abcABC123 2 matches CBA
321CBAcba 2 matches cba
BaB1cAb 1 matches \n
AAAAacac1231 6 matches BaB1c
4 matches A
2722164371 3 matches b
7 matches \nAAAAacac1231
1 matches \n (we didn’t require this one)

Solution 3: Manually Building a Parse Tree As a Client of flex


SExpressionList *readList(TokenType& lookahead)
{
assert(lookahead == '(');
lookahead = TokenType(yylex()); // consume '('
SExpressionList *exprList = new SExpressionList();
while (lookahead != ')') {
exprList->append(readNextExpression(lookahead));
}
lookahead = TokenType(yylex()); // consume ')'
return exprList;
}

SExpression *readNextExpression(TokenType& lookahead)


{
SExpression *primitive;
switch (lookahead) {
case '(': return readList(lookahead);
case ')': assert(false);
case T_IntegerConstant: primitive = new Integer(yylval.intValue);
break;
case T_Symbol:
case T_StringConstant: primitive = new String(yylval.textValue);
break;
}
lookahead = TokenType(yylex());
return primitive;
}

You might also like