COMP 3512 Assignment 2

When a program is compiled, typically the first step is to group the characters in the program text into tokens. This process is called scanning or lexing. In this assignment, we’ll write a simple scanner.


int main() { /* hello world program */ printf("hello, world\n"); return 0; }

We’ll use C as an example language. When the following program is scanned:

the scanner groups the characters into the following tokens: int main ( ) { printf ( "hello, world\n" ) ; return 0 ; } Note that comments are skipped. The scanner may also tag each token with information about its location (its line & character numbers) & its type. (The location information may be useful for diagnostic messages.) For example, int is a keyword & it starts at line 1, character 1, main is an identifier & it starts at line 1, character 5, etc.


Types of Tokens
1. string constants, e.g., "hello" 2. integer constants, e.g., 123 3. identifier, e.g., sum, n 4. keywords, e.g., int, if 5. operators, e.g., + 6. separators, e.g., [ ] ++ ;

For simplicity, we’ll only handle the following 6 types of tokens (with examples from C):

The language we are going to scan is C-like except that some things can be configured: • String constants: these are enclosed in double quotes just as in C 1

• Integer constants: they are the same as those in C • Identifiers: same as those in C — an identifier is a sequence of alphabets, digits & underscores & must not begin with a digit. Some examples are: n2, , is valid. An invalid identifier is 2upper • Keywords: these are identifiers reserved by the language. To make the scanner more flexible, instead of hard-coding the keywords, they are specified in a configuration file. (See §3.) • Separators: we’ll use the following as separators ( ) [ ] { } , ;

Note that this is a subset of those in C. • Operators: as the number of operators may be quite large, they are specified in the configuration file. (See §3.) • Comments: they are not tokens & are skipped. We allow the same 2 styles of comments as C++: – comments that start with /* (not within a string constant) & terminated by the next */ – comments that start with // (not within a string constant) & continue to the end of the line Note that for simplicity, we only deal with integer constants. We don’t handle floating-point constants at all.


Configuration File

The configuration file basically lists the keywords & the operators. Hence it has 2 kinds of sections – keyword sections & operator sections. Note that there can be multiple keyword & operator sections. • A keyword section lists keywords. It is started by the word KEYWORDS: All words that follow are regarded as keywords until the start of another section or until the end of file. • An operator section lists operators. It is started by the word OPERATORS: and, similar to a keyword section, lasts until the start of another section or until the end of file. The following is an example configuration file: KEYWORDS: int static const OPERATORS: + - * / % += -= *= /= %= ++ -KEYWORDS: if else while for Technically, everything can be on one line. The above example uses a more readable format. Note that there are restrictions on keywords & operators. Keywords must satisfy the requirements for identifiers (they are basically “reserved identifiers”), i.e., each is a sequence of alphabets, digits & underscores & must not begin with a digit. An invalid keyword should be rejected & a warning message indicating the invalid keyword should be printed (to standard error). An operator cannot contain whitespaces or characters that are separators or that can be used in an identifier. For example, a+, .=, +2 are not valid operators. As with most languages, control & other non-printable characters are not allowed anywhere in the program text & hence can’t be used in operators. 2


Additional Information
int main(){

Tokens are typically delimited by whitespaces, but that is not always the case. This is evident from the line

In the above, the 2 separators ( and ) are not surrounded by whitespaces. Consider the line: n+++m; Assuming that only + and ++ are valid operators, how should the line be tokenized? We have the following possibilities: n n n + + + m ; ++ + m ; + ++ m ;

Just as in C, our scanner is “greedy” — it will try to “consume” as many characters as possible & still come up with a valid token. This means that the scanner will come up with the second possibility. Similarly, n+++++m; will be tokenized as: n ++ ++ + m ;

Note that although it can be tokenized, it is not a valid C statement (for other reasons). As another example, +2 will be tokenized as: + 2 You’ll need to implement a Token class & a Scanner class that has a getToken method that returns the next token.


The Program

The program must be invoked with the name of a configuration file as a command-line argument. It reads the program text from standard input & outputs the tokens together with location & type information. For the hello world program in §1 & with the sample configuration file in §3, the output is: int KEYWORD (1,1) main IDENTIFIER ( SEPARATOR ) SEPARATOR { SEPARATOR printf IDENTIFIER ( SEPARATOR hello, world\n STRING ) SEPARATOR ; SEPARATOR return KEYWORD (3,3) 0 INT (3,10) ; SEPARATOR } SEPARATOR (1,5) (1,9) (1,10) (1,12) (2,3) (2,9) (2,11) (2,26) (2,27)

(3,11) (4,1)

Note that the string token shown above is different from the one in §1 — the double quotes around the string have been stripped. This is because the token is already tagged with the information that it is a string. Hence the double quotes are really not necessary.


The 2 numbers separated by a comma within brackets are the line number & character number. Both are counted from 1. The 3 parts of each line are separated by tabs. (In the above output, a tab has a width of 8 characters.) The above output doesn’t show an operator. For an operator, the word OPERATOR would be printed. Note that if an invalid character or token is encountered, the program should print an error message that includes the character or token & its location before exiting.


Additional Requirements

Do not use external variables. (This include global variables.) You’ll need to implement any class or function that you use that is not in the standard C/C++ library. We’ll be comparing the output of your program with expected output. Make sure your output adheres to the specification. Sample input, configuration & output files may be provided.


Submission & Grading

This assignment is due at noon, Wednesday, March 26, 2008. You’ll need to submit your assignment via subversion. Further information will be provided. If your program does not compile, you may receive zero for the assignment. Otherwise, the grade breakdown is approximately as follows: Code clarity Handling configuration file Tokens (excluding type & location) Type information Location information 10% 10% 40% 20% 20%


Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.