You are on page 1of 60

A Course Reader

Computer Science
Computer Science Department Van Lang University

Introduction to

VAN LANG UNIVERSITY 2002 (For in-class use only)


Handout #0.

Course Information

0.1 Objectives and Prerequisites ...................................................................... 1 0.2 Textbook and Handouts .............................................................................. 1 0.3 References.................................................................................................... 2 0.4 Course Syllabus ........................................................................................... 2 0.5 Glossary ....................................................................................................... 3

Handout #1.


1.1 Introduction to Computer Science.............................................................. 5 1.2 Data Models, Data Structures, and Algorithms ........................................ 6 1.3 C Essentials................................................................................................. 7 1.4 Basic Data Structures ................................................................................. 8 1.5 Glossary ....................................................................................................... 9

Handout #2.

Set Theory


2.1 The Set Data Model .................................................................................. 11 2.2 Set Algebra ................................................................................................ 12 2.3 Implementation of Sets ............................................................................ 13 2.4 Glossary ..................................................................................................... 15

Handout #3.

The Relational Data Model


3.1 Binary Relations ....................................................................................... 17 3.2 Relations .................................................................................................... 18 3.3 Relational Algebra .................................................................................... 19 3.4 Glossary ..................................................................................................... 21

Handout #4.

The List Data Model


4.1 Basic Concepts .......................................................................................... 23 4.2 Stacks......................................................................................................... 24 4.3 Queues........................................................................................................ 26 4.4 Glossary ..................................................................................................... 27

Handout #5.

The Tree Data Model


5.1 Basic Terminology..................................................................................... 29 5.2 Implementation of Trees .......................................................................... 30 5.3 Binary Trees and Binary Search Trees ................................................... 31 5.4 Glossary ..................................................................................................... 32

Handout #6.

The Graph Data Model


6.1 Basic Concepts .......................................................................................... 33 6.2 Implementation of Graphs ....................................................................... 34 6.3 Connected Components of an Undirected Graph.................................... 36 6.4 Glossary ..................................................................................................... 37

Handout #7.



7.1 Patterns and Pattern Matching ............................................................... 39 7.2 State Machines and Automata ................................................................. 40 7.3 Deterministic and Nondeterministic Automata ...................................... 41 7.4 Glossary ..................................................................................................... 43

Handout #8.

Regular Expressions


8.1 Introduction ............................................................................................... 45 8.2 Algebraic Laws for Regular Expressions................................................. 46 8.3 Glossary ..................................................................................................... 47

Handout #9.



9.1 Context-Free Grammars........................................................................... 49 9.2 Languages from Grammars ...................................................................... 52 9.3 Glossary ..................................................................................................... 52

Handout #10. Parsing


10.1 Parse Trees................................................................................................ 53 10.2 Constructing a Parse Tree........................................................................ 54 10.3 Glossary ..................................................................................................... 57


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

0.1 OBJECTIVES AND PREREQUISITES This course is intended to provide a broad introduction (in English) to the theoretical and mathematical foundations of computer science. It will cover the principal themes of Discrete Maths, Data Structures and Algorithms courses you might have studied in some previous semester. Our goal is not to re-teach those themes; rather we use them to make a good English environment in which you could study computer science in English as in your mother language. I know that many of you are not good in English, so for the first four weeks, I will speak both English and Vietnamese alternatively. After this time, we must work in English during the lectures. You can still make questions in Vietnamese after class, of course. A working knowledge of C is the only prerequisite for this course but it is not required. Do not be afraid of making questions in English (in class) or in Vietnamese (out of class). These are opportunities to get an English-working skill and a better knowledge in computer science. 0.2 TEXTBOOK AND HANDOUTS There is one required text for this course: Foundations of Computer Science (C Edition) by Alfred V. Aho and Jeffrey D. Ullman, W.H. Freeman, 1995. You may also need my translation of this text: C s cua Khoa hoc May tnh (Tap 1, 2, 3), for reference. From time to time, I will distribute handouts to get you through the text.


0.3 REFERENCES Here is a list of books you should read, of course, in English for this course and others in the term. 1. Aho, Hopcroft, and Ullman [1983]. Data Structures and Algorithms, AddisonWesley, Reading, Mass. 2. Cormen, Leiserson, and Rivest [1990]. Introduction to Algorithms, MIT Press. 3. Harbison, S. P. Steele G. L. [1995]. C: A Reference Mannual, Fourth Edition, Prentice-Hall, Englewood Cliffs, NJ. 4. Knuth, D. E. [1997, 1998]. The Art of Computer Programming, Volumes 1, 2, 3, Addison-Wesley, Reading, Mass. 5. Microsoft Studio 6.0 Books Online [1998]. C Reference. 6. Rosen K. H. [1999]. Discrete Mathematics and Its Applications, Fourth Edition, McGraw-Hill, Boston. 0.4 COURSE SYLLABUS 1. REVIEWS C Essentials: Data Objects & Statements Arrays & Linked Lists Reading: Sections 1.4 and 6.4. 2. SET THEORY The Set Data Model Set Algebra Reading: Sections 7.2 and 7.3. 3. THE RELATIONAL DATA MODEL Binary Relations Relations Relational Algebra Reading: Sections 7.7, 7.10, 8.2, and 8.7. 4. LISTS Basic Concepts Stacks Queues Reading: Sections 6.2, 6.6, and 6.8.


5. TREES Basic Terminology Implementation of Trees Binary Trees and Binary Search Trees Reading: Sections 5.2, 5.3, 5.6, and 5.7. 6. GRAPHS Basic Concepts Implementations of Graphs Connected Components of Undirected Graphs Reading: Sections 9.2, 9.3, and 9.4. 7. AUTOMATA Patterns and Pattern Matching Finite State Machines and Automata Deterministic and Nondeterministic Automata Reading: Sections 10.2 and 10.3. 8. REGULAR EXPRESSIONS Introduction Algebraic Laws for Regular Expressions Reading: Sections 10.5 and 10.7. 9. GRAMMARS Context-Free Grammars Languages from Grammars Reading: Sections 11.2 and 11.3. 10. PARSING Parse Trees Constructing a Parse Tree Reading: Section 11.4. 0.5 GLOSSARY Computer Science: Tin hoc, Khoa hoc May tnh. A field of knowledge studying all aspects of the design and use of computers. Information Technology: Cong nghe Thong tin. An applied science, the study of the


practical or industrial arts, in particular the merging of computing and high-speed communications links carrying data, sound, and video. Textbook: Sach giao khoa. A book giving instructions in the principles of a subject of study and used officially for a course. Reference: Tai lieu tham khao. A book or paper for more information. Handout. Dan e. A leaflet consisting of brief instructions. Applications, application program: Chng trnh ng dung, ng dung. System program: Chng trnh he thong. Relational Algebra: ai so quan he. Fundamental courses in computer science: Set Theory: Ly thuyet Tap hp. Discrete Mathematics: Toan ri rac. Probability and Statistics: Xac suat Thong ke. Graph Theory: Ly thuyet o th. Computer Architecture: Kien truc may tnh. The Programming Language C: Ngon ng lap trnh C. Object-Oriented Programming: Lap trnh hng oi tng. Data Structures and Algorithms: Cau truc d lieu va Thuat toan. Programming Languages: Ngon ng lap trnh. Automata and Formal Languages: Ngon ng hnh thc. Compilers: Trnh bien dch, Chng trnh dch. Operating Systems: He ieu hanh. Computer Networks: Mang May tnh. Distributed Systems: He phan tan (phan bo). Databases: C s d lieu. System Analysis and Design: Phan tch va Thiet ke he thong. Software Engineering: Ky nghe phan mem (Cong nghe phan mem). Artificial Intelligence: Tr tue nhan tao.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Introduction to Computer Science 2. Three Columns: Data Models, Data Structures, and Algorithms 3. C Essentials 4. Arrays and Linked Lists Reading: Sections 1.1, 1.3, 1.4, and 6.4 (textbook), C Reference (Microsoft Studio 6.0 Books Online) 1.1 INTRODUCTION TO COMPUTER SCIENCE A science of abstraction: creating the right model of a real world. Example: In Windows, we can use a small program called Paint for drawing simple pictures. Paint has tools such as Pencil, Brush, Airbrush, Eraser, etc. They are not real tools; rather they are the models of real tools. How to use a computer to solve a problem? 1. Choose the important features of the problem ==> analysis 2. Build a model to reflect those features ==> abstraction 3. Write a program to solve it ==> mechanization How to write a good computer program? 1. Design data structures 2. Design good algorithms 3. Choose an appropriate programming language to implement the program Example: You are hired to implement a computer system for a travel agency. Your system must determine the best route for a traveler to get from location A to location B ("best" means shortest distance traveled). How would you do it?


1. Specify the locations that the system can handle. 2. Create a database of distances between all the locations. Such a database must have a structure that allows for easy updates and quick searches. 3. Create algorithms that will operate on the database. 4. Create an algorithm that finds the shortest route between two locations. 5. Implement a program that takes as input location A and B and outputs the shortest route between the two. There are several examples of abstraction in this example: representing the locations and distances as names and numbers placing this data in some kind of abstract data structure in a computer creating abstract operations on this data structure such as "Find" and "Find Shortest Distance" 1.2 DATA MODELS, DATA STRUCTURES, AND ALGORITHMS Three basic concepts in CS: data models, data structures, and algorithms. 1. Data models: The picture of data in an abstract form. In the previous example, we could use a graph as a model to represent locations and distances: a node with a name for each location and a line with a number for a distance of two locations. A 15 (km) B

2. Data structures: Organizations in memory to hold data. For example, to represent locations in a computer memory, we need small regions to hold names and some other information about locations. The distance between two locations can be stored in a table structure with three fields: the first two for the names and the last for the distance. 3. Algorithms: A sequence of steps to solve a problem, often as a guide for a computer to do. Here is Euclides algorithm from Knuths book (TAOCP, Volume 1, pp. 2): Given two positive integers m and n, find their greatest common divisor, that is, the largest positive integer that evenly divides both m and n. E1. [Find remainder.]. Divide m by n and let r be the remainder. (We will have 0 r < n.) E2. [Is it zero?] If r = 0, the algorithm terminates; n is the answer. E3. [Reduce.] Set m n, n r, and go back to step E1.


0 68





1.3 C ESSENTIALS The two main parts of the programming language C: Data and Statements 1. Data. During the execution of a program, the data must be stored in a region of the memory that is known as a data object or data item. Data objects have the following attributes: Data types: A type is a set of values and a set of operations on those values. For examples, the values of an integer type consists of integers in some specified range, and the operations on those values consist of addition, substraction, inequality tests, and so forth. In C, there are two broad categories of types that form the type system: basic or built-in types (char, int, long, float, etc.), and user-defined types (pointers, arrays, structures, unions). Sizes: A number of bits used to represent the object. It depends on the type. For example, an object of type char is represented in 8 bits; of type int can be represented either 16 or 32 bits depending on the implementation. Values: An object could hold a value that is in the set of possible values called its domain. A value is either changeable or unchangeable. If the value is changeable, the object is a variable; otherwise, it is a constant. Addresses: In a byte-addressable computer, an object has its own address; that is the address of the first byte of bytes allocated for the object. Names: An object may have a name that can be used to refer to it. Some data without a name can be given in the source code. They are called literals. The above figure represents two data objects in memory. The numbers on the left are addresses. The object named i is of type int and holds the value of 25. It is


located at address 102. This address is the value of the object pt, an object of type "pointer to int". The address of the pointer pt is 68. 2. Statements. We use statements to direct a computer to do some task. A statement may indicate the flow of control or an operation to be performed in programs. In C, Flow of control statements: for, while, do-while (loops), if-else, switchcase (selections), and others (break, continue, return). Operators: assignments, additions, multiplications, ect. 1.4 BASIC DATA STRUCTURES In memory, there are only two physical data structures: arrays and linked lists. 1. Arrays: To hold a set of objects of the same type, we need a contiguous region large enough to accommodate those objects. From an intuitive point of view, it is best to think of an array as a sequence of boxes, one box for each data value in the array. Each value in an array is called an element. Each element of the array is identified with a numeric value called it index. Array indices begin at 0, so that the nth element of an array had an index of n1. Arrays have two properties which vary depending on the specific array: element type: also referred to as base type, this specifies the data type of the element size: the maximum number of elements the array can hold. The size must be determined at declaration time. In the figure, num is an array of type int with size 4. In general, the type of objects is any type in the type system of C.


num 5 8 3 10 0 1 2 3


2. Linked Lists. An array allocates memory for all its elements lumped together as one block of memory. In contrast, a linked list allocates space for each element separately when needed in its own block of memory called a "linked list element" or "node". The list gets its overall structure by using pointers to connect all its nodes together like the links in a chain. Each node contains two fields: a "data" field ( info) to store whatever element type the list holds for its client, and a "next" field which is a pointer used to link one node to the next node. This kind of linked list is called singly linked list.

Info1 Element 1

Info2 Element 2

Info3 Element 3

Info4 Element 4

Each node may also contain other fields, such as a "previous" field to hold a pointer to the previous node, forming a so-called doubly linked list. We only need to keep track of the first node of the list using a pointer called a header. Any other data structures are implemented by either arrays or linked lists. I will explain in class how operations on individual structures can be used when we reach to the topics. 1.5 GLOSSARY Abstraction: Tru tng hoa. The process of creating a description of a real-world problem by extracting important characteristics to be represented in a computer. to abstract: tru xuat. Problem: Bai toan, van e. A question proposed for solution or consideration. Solution: Li giai (nghiem), giai phap. An answer to a problem. to solve: giai (mot bai toan, van e). Model: Mo hnh. An abstract description used in capturing the important characteristics of a real-world problem to be represented and manipulated in a computer. to model: mo hnh hoa. Data Model: Mo hnh d lieu. A way of describing and representing data.



Implementation: S cai at, lap at; ban cai at. (1) The process of installing a computer system; (2) The process of building a sotfware product from its design; (3) a software running on a computer. to implement, to install: cai at, hien thc. Operation: Phep toan, thao tac; S hoat ong, ieu hanh. (1) An action specified by a single computer instruction or high-level statement (2) Any defined action. to operate: thao tac, hoat ong. operator: ngi ieu hanh, toan t. operand: toan hang. Execution: Thc thi, chay. The running of a program on a computer. to execute: to run. Data Type: Kieu d lieu. See the definition in text. built-in type: kieu cai san. user-defined type: kieu do ngi dung nh ngha, kieu t tao. Variable: Bien. See the definition in text. Constant: Hang. See the definition in text. RAM: Bo nh chnh. Random Access Memory, main memory. Pointer: Con tro. An data object (variable or constant) holding the address of another object. Statement: Cau lenh. An sentence to make a computer to perform an action, usually in a high-level language. Instruction: Ch th. An coded operation to be performed in a computer, usually in binary form (machine language) or mnemonic form (assembly). Index: Ban ch dan; Ch muc. (1) A structure to quickly search for a subject or name in a book or the like; (2) A number indicating the location of an element in an array. Array: Mang. See the definition in text. List: Danh sach. See the definition in text. Linked List: Danh sach lien ket. See the definition in text. Header: Con tro au. See the definition in text.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. The Set Data Model 2. Set Algebra 3. Implementation of Sets Reading: Sections 7.2, 7.3, 7.4, 7.7, and 7.10. 2.1 THE SET DATA MODEL The set is the most fundamental concept in mathematics, and this is true in computer science. The term "set" is not defined explicitly; at a basic level, we can think of a set as a collection of objects. Basic notation x S "element x is a member of set S" S = {x1 , x2 , . . . , xn} "elements x1 , x2 , . . . , xn are members of set S" each xi must be distinct sets contain unique objects in any order "the empty set" the set with no members Defining Sets S = {1, 4, 5, {3, 4}, 9} definition by enumeration T = { x | x S, x is odd } definition by abstraction. The latter means the set of elements x such that x is an odd element of S. "set former" notation (general form): We write { x | x X and P(x) } and read "the set of elements x in X such that x has property P."



Equality of Sets Two sets are equal if they have exactly the same members. This doesnt necessarily mean their representations must be identical. For example, the set {1, 2} is equal to the set {2, 1} because they both have exactly the elements 1 and 2. 2.2 SET ALGEBRA We introduce the three basic operations on sets. Union S T the set containing the elements that are in S or T (or both) Intersection S T the set containing the elements that are in both S and T Difference S T the set containing the elements that are in S but not in T A Venn Diagram illustrating these relationships:

Region 1

Region 2

Region 3

Region 4

See algebraic laws for those operations in the text (page 343). Subsets 1. S T means: S is a subset of T T is a superset of S S is contained in T T contains S 2. S T means: S is a proper subset of T T is a proper superset of S



S is properly contained in T T properly contains S and is true if S T and there is at least one element of T that is not also a member of S. Power Sets Let S be any set. The power set of S is the set of all subsets of S. We denote the power set of S by P(S) or 2S . Example: S = {a, b, c, d} P(S) = {, {a}, {b}, {c}, {d}, {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}, {a, b, c}, {a, b, d}, {a, c, d}, {b, c, d}, {a, b, c, d}} Theorem: If a set S has n members, then P(S) has 2n members. 2.3 IMPLEMENTATION OF SETS I will briefly present list implementation of sets. You could see some other implementations in Section 7.5. Note that the set is a general concept, thus we only show the basic set operations. When the set is a specific model, for example, dictionary or list, we can efficiently implement those operations. For simplicity, sets are implemented by a sorted linked list; each element is a cell.

M 2 4 9



The listing below is a code for the function setUnion(L,M). It makes use of an auxiliary function assemble(x,L,M) that produces a list whose header is x and whose tail is the union of lists L and M. I will explain it in class. typedef struct CELL *LIST; struct CELL { int element; LIST next; }; /* assemble produces a list whose head element is x and whose tail is the union of lists L and M */ LIST assemble(int x, LIST L, LIST M) { LIST first; first = (LIST) malloc(sizeof(struct CELL)); first->element = x; first->next = setUnion(L, M); return first; } /* setUnion returns a list that is the union of L and M */ LIST setUnion(LIST L, LIST M) { if (L == NULL && M == NULL) return NULL; else if (L == NULL) /* M cannot be NULL here */ return assemble(M->element, NULL, M->next); else if (M == NULL) /* L cannot be NULL here */ return assemble(L->element, L->next, NULL); /* if we reach here, neither L nor M can be NULL */ else if (L->element == M->element) return assemble(L->element, L->next, M->next); else if (L->element < M->element) return assemble(L->element, L->next, M); else /* here, M->element < L->element */ return assemble(M->element, L, M->next); }



2.4 GLOSSARY Concept: Khai niem, y niem. An original idea. Conceptual design: Thiet ke mc khai niem. Set: Tap hp (tap). See the definition in text. Collection: commonly used (informally) as a synonym of set. Set former: Lap t tap hp. An operator to form a set. Empty set: Tap trong, tap rong. Element, member: Phan t (cua tap hp). Subset: Tap con. Superset: Tap cha, tap bao ham. Term. Various meanings (1) Thuat ng, thuat t; (2) Hoc ky, dung chung cho ca quarter (khoang 10 en 15 tuan) va semester (khoang 20 en 25 tuan); (3) Hang thc. Notation: Ky phap. A system of signs or symbols to represent words, phrases, numbers, quantities, etc. as in mathematics, chemistry, musics, etc. Equality: Tnh bang nhau; ang thc. Set Algebra: ai so tap hp. Union: Phep hp; phan hp. See the definition in text. Intersection: Phep giao; phan giao. See the definition in text. Difference: Phep hieu; phan hieu. See the definition in text. Algebraic law: Luat ai so. A rule to tranform an algebraic expression into another equivalent expression. Power set: Tap luy tha. See the definition in text. Theorem: nh ly. Lemma: Bo e. Corollary: He qua. Proof: Chng minh. Function: Ham; Chc nang. (1) a subprogram to perform some task, usually as a synonym of procedure. (2) the natural, required or expected activity of a person or thing.



Code: Ban ma chng trnh. Any written program (in any programming languages, including machine languages). Listing: Chng trnh minh hoa. A code used in programming texts for illustration. Coding, Programming: Lap trnh, viet chng trnh, thao chng. NULL: Gia tr rong; con tro h ao. The value of a variable or pointer indicating that the variable or pointer currently has no values; especially for the pointer, it indicates that the pointer points to nowhere. In Pascal, this symbol is nil.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Binary Relations 2. Relations 3. Relational Algebra Reading: Sections 7.7, 7.10, 8.2, and 8.7. 3.1 BINARY RELATIONS Consider two sets of students: MS = {Tien, Hung, Khoa, Viet}, and FS = {Lan, Cuc, Hue}. Suppose that we have two pairs of lovers: Tien and Cuc, Viet and Lan. Others may love someone else not in the sets. We say that the binary relation love on the sets MS and FS has two tuples: (Tien, Cuc) and (Viet, Lan). The other tuples, say (Khoa, Hue) or (Hung, Lan), are not in love. We can now define binary relations formally. First, let A and B be two sets. The product or Cartesian product of A and B, denoted A B, is defined as the set of pairs in which the first component is chosen from A and the second component is chosen from B. That is, A B = { (a, b)

| a A and b B }

A binary relation R on A and B is a subset of the product A B. In the previous example, we have love = { (Tien, Cuc), (Viet, Lan) } and may write in a predicate form: love(Tien, Cuc) = true love(Viet, Lan) = true love(Hung, Hue) = false



The last must be false since it is not in the relation. We can use an infix notation for binary relations, so that a binary relation like < can be written between the componets of pairs in the relation. For example, we may write 1<2 instead of "(1,2)<" or "<(1,2)=true". Some special properties Binary relations may have some useful properties: transitivity, reflexivity, symmetry, and antisymmetry. Reflexivity A relation R on a set A is called reflexive if (a, a) R for every element a A. Symmetry A relation R on a set A is called symmetric if (b, a) R whenever (a, b) R for a, b A. Antisymmetry A relation R on a set A such that (a, b) R and (b, a) R only if a = b, for a, b A, is called antisymmetric. Transitivity A relation R on a set A is called transitive if whenever (a, b) R and (b, c) R, then (a, c) R, for a, b, c A. See more about binary relations in the textbook (partial orders, equivalence relations, equivalence classes, closures of relations). 3.2 RELATIONS In this section, we shall extend binary relations into relations of arbitrary arity. In the relational model, relations are viewed as tables; that is, we represent information in a table whose columns are given names, called attributes, and whose rows are called tuples and represent basic facts. Thus, a table has two aspects: 1. The set of column names, and 2. The rows containing the information. The set of column names (attributes) is called the scheme of the relation. For example, the table on the next page has the scheme (Course,StudentId,Grade). If we give it a name, say CSG, we may write CSG(Course,StudentId,Grade).



Course CS101 CS101 EE200 EE200 CS101 PH100

StudentId 12345 67890 12345 22222 33333 67890

Grade A B C B+ A C+

The table CSG with three attributes. A collection of relations (tables) is called a database. In an enterprise, databases could contain all of the vital information for its operations. Design of a dabase and applications that access data in the database is a big problem and is beyond the scope of this course. A set of schemes of various relations in a database is called the database scheme. Notice the difference between the scheme for the database, which tells us something about how information is organized in the database, and the set of tuples in each relation, which is the actual information stored in the database. 3.3 RELATIONAL ALGEBRA We introduce some basic operations on relations by examples. You can see more about those operations and others in many database textbooks. The Selection Operation If we want the tuples from the table CSG that have the Course component "CS101", we can perform a selection on this table and write:

where is the selection operator; Course="CS101" is a boolean expression that can consist of the logical operators such as AND, OR, and NOT. The result of this operation is as follows:
Course CS101 CS101 CS101 StudentId 12345 67890 33333 Grade A B A



The Projection Operation Another important operation is the projection. Suppose we want the identifiers of the students in the table CSG. That is, we must eliminate all the columns but StudentId. This can be done using the projection operator, represented by the symbol .

and we can get an one-column table:
StudentId 12345 67890 22222 33333

Note that this table has only four tuples, not six as in the original one. The reason is that relations are sets, and the 1-tuples "12345" and "67890" are duplicates, thus they are the same elements and need not to be represented twice. The Join Operation Unlike the previous operations that are unary ones, the join operation is a binary operation; that is, it has two operands. < S, is formed by taking each tuple r from R and The join of R and S, written R >

each tuple s from S and comparing them. If the component of r for Ai equals the component of s for Bj, then we form one tuple from r and s; otherwise, no tuple is created from the pairing of r and s. We form a tuple from r and s by taking the components of r and following them by all the components of s, but omitting the component for Bj, which is the same as the Ai component of r anyway. Suppose we have two tables CDH and CR as follows:
Course CS101 CS101 CS101 EE200 EE200 EE200 Day M W F Tu W Th Hour 9AM 9AM 9AM 10AM 1PM 10AM

The table CDH.



Course CS101 EE200 PH100

Room Turing Aud. 25 Ohm Hall Newton Lab.

The table CR. and we want to know at what times each room is occupied by some course taking a join on the Course components. The expression defining the resulting relation is: > < CR Course=Course CDH and the value of the relation produced by this expression is as shown in the following table:
Course CS101 CS101 CS101 EE200 EE200 EE200 Room Turing Aud. Turing Aud. Turing Aud. 25 Ohm Hall 25 Ohm Hall 25 Ohm Hall Day M W F Tu W Th Hour 9AM 9AM 9AM 10AM 1PM 10AM

We usually perform a kind of joins that equates the attributes with the same names. < . The join in the example is a Such a join is called a natural join and simply written > < CDH. natural join and can be rewritten CH > 3.4 GLOSSARY Relation: Quan he. See definition in text. Arity: Ngoi (cua quan he). Binary relation: Quan he hai ngoi. See definition in text. n-ary relation: Quan he n-ngoi. Tuple: Bo (d lieu). A sorted list of values, each corresponding to one component in the relation scheme. Also called a fact. Attribute: Thuoc tnh. A component of a relation that is given a name. Cartesian Product: Tch, tch Descartes. Also called cross product (tch cheo).



Predicate: V t. See Chapter 14 in the textbook for Predicate Logic. Infix notation: Ky phap trung v. An expression notation in which operators are betweet their operands. Related terms: prefix and postfix notation (tien v va hau v). Reflexivity: Tnh phan than. Symmetry: Tnh oi xng. Antisymmetry: Tnh phan xng. Transitivity: Tnh bac cau. Partial order: Th t bo phan. Equivalence relation: Quan he tng ng. Equivalence class: Lp tng ng. Closure: Bao ong. Scheme: Lc o. Database: C s d lieu. Relational algebra: ai so quan he. Selection: Phep chon. Projection: Phep chieu. Join: Phep noi. Natural Join: Noi t nhien.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Basic Concepts 2. Stacks 3. Queues Reading: Sections 6.2, 6.6, and 6.8. 4.1 BASIC CONCEPTS Computer programs usually operate on tables of information. In most cases these tables are not simply amorphous masses of numerical values; they involve important structural relationships between the data elements. In its simplest form, a table might be a linear list of elements, when its relevant structural properties might include the answers to such questions as: Which element is first in the list? Which is last? Which elements precede and follow a given one? How many elements are in the list? In this handout, we only consider linear lists. Conceptually, a linear list (or simply list) is a finite sequence of zero or more elements. We generally expect the elements of a list to be of some one type. For example, if the elements of a list are all of integers, we may call it a list of integers. Character strings are special lists. In C, a string is stored in an array ending with a null marker \0. All the library functions have a test for this marker. Linked Lists A list may be implemented as a linked list of elements, each is allocated a cell or node consisting of an area for data and a pointer to the next element in the list, as shown in the following figure. See more in Section 6.4 (textbook) and Handout #1.
Info1 Element 1 Info2 Element 2 Info3 Element 3 Info4 Element 4



4.2 STACKS A stack is a special kind of lists in which all operations are performed at one end of the list, which is called the top of the list. The term "LIFO (for last-in first-out) list" is a synonym for stack. Because of this restriction, stacks have two specialized operations: push and pop where push(x) puts the element x on the top of the stack and pop removes the topmost element from the stack. The figure illustrates how to use a stack with push and pop operations to compute the value of the postfix expression 3 4 + 2 . The details can be explained in class.

STACK 3 3, 4 7 7, 2 14


Initial 3 4 +

push 3 push 4 pop 4; pop 3 compute 7 = 3 + 4 push 7 push 2 pop 2; pop 7 compute 14 = 7 2 push 14

Stacks can be implemented by an array or linked list. We only discuss an arraybased implementation with elements to be of type int.
0 1 a0 a1 ... n 1 an1 ...




A smart way to implement a stack by an array is to create a structure consisting of: 1. An array to hold the elements, and 2. A variable top to keep track of the top of stack. In the listing, we represent the declaration for an array-based stack of integers and its operations. You should read the listing carefully. typedef struct { int A[MAX]; int top; } STACK; BOOLEAN isEmpty(STACK *pS) { return (pS->top < 0); } BOOLEAN isFull(STACK *pS) { return (pS->top >= MAX1); } BOOLEAN pop(STACK *pS, int *px) { if (isEmpty(pS)) return FALSE; else { (*px) = pS->A[(pS->top)]; return TRUE; } } BOOLEAN push(int x, STACK *pS) { if (isFull(pS)) return FALSE; else { pS->A[++(pS->top)] = x; return TRUE; } } A very important application of stacks is to implement function calls. I recommend you to read this in Section 6.7 (textbook).



4.3 QUEUES A queue is a restricted form of list in which elements are inserted at one end, the rear, and removed from the other end, the front. The term "FIFO (first-in first-out) list" is a synonym for queue. The intuitive idea behind a queue is a line at a cashiers window. People enter the line at the rear and receive service once they reach the front. Unlike a stack, there is fairness to a queue; people are served in the order in which they enter the line. Thus the person who has waited the longest is the one who is served next. Like a stack, there is two specialized operations on queues: enqueue and dequeue; enqueue(x) adds x to the rear of a queue, dequeue removes the element from the front of the queue. As with stacks, an array or linked list can be used to implement queues. For our purpose, we describe a linked list implementation with the following structure for cells. As usual, elements are of type int. typedef struct CELL *LIST; struct CELL { int element; LIST next; }; A queue itself is a structure with two pointers one to the front cell and another to the rear cell. typedef struct { LIST front, rear; } QUEUE; We also present the listing of operations of queues below. You should read it carefully to capture why those operations can work properly. BOOLEAN isEmpty(QUEUE *pQ) { return (pQ->front == NULL); } BOOLEAN isFull(QUEUE *pQ) { return FALSE; }



BOOLEAN dequeue(QUEUE *pQ, int *px) { if (isEmpty(pQ)) return FALSE; else { (*px) = pQ->front->element; pQ->front = pQ->front->next; return TRUE; } } BOOLEAN enqueue(int x, QUEUE *pQ) { if (isEmpty(pQ)) { pQ->front = (LIST) malloc(sizeof(struct CELL)); pQ->rear = pQ->front; } else { pQ->rear->next = (LIST) malloc(sizeof(struct CELL)); pQ->rear = pQ->rear->next; } pQ->rear->element = x; pQ->rear->next = NULL; return TRUE; } 4.4 GLOSSARY List: Danh sach. See the text. Linear list: Danh sach tuyen tnh. Linked List: Danh sach lien ket. A physical data structure used to implement highlevel data structures such as lists, trees, graphs, etc. The counterpart of linked lists is arrays. Relationship: Moi lien he, quan he. In computer terminology, relationship informally stands for relation with a little difference. Character string, string: Chuoi ky t, xau ky t. See the text. Stack: Chong xep, ngan xep. top: nh (chong xep).



push: ay (vao chong xep). pop: Nhat, lay (ra khoi chong xep). Postfix expression: Bieu thc hau v. An expression in which operators are between their operands. Declaration: Khai bao. A statement introducing a name or identifier into a program. In C, every identifier must be declared before use. Call: Li goi. A statement with the name of a function, and perhaps a list of actual parameters to tranfer control to that function. Also function call or procedure call. Queue: Hang i. See the text. front: au hang. rear: cuoi hang. enqueue: vao hang. dequeue: ri hang.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Basic Terminology 2. Implementation of Trees 3. Binary Trees and Binary Search Trees Reading: Sections 5.2, 5.3, 5.6, and 5.7. 5.1 BASIC TERMINOLOGY A list discussed in the last handout is a linear structure, whereas a tree is a non-linear structure representing hierachical relationships of information, such as that of directories and files stored in a computer. We can define formally a tree as a finite set of nodes and edges such that 1. There is one specially designated node called the root of the tree. The root is generally drawn at the top. 2. Every node c other than the root is connected by an edge to some one other node p called the parent of c. c is also called a child of p. We draw the parent of a node above that node. 3. A tree is connected in the sense that if we start at any node n other than the root, move to the parent of n, to the parent of the parent of n, and so on, we eventually reach the root of the tree. r n1 n4 n5 n2 n3



In the figure, r is the root and has three children: n1, n2, and n3. We can define important concepts from the figure. 1. The node n1 has two children, n4 and n5, but the nodes n2 and n3 both have no children. A node with no children is called a leaf; otherwise, they are interior. 2. n4 is a descendant of r and n1; conversely, r and n1 are ancestors of n4. 3. Nodes n1, n2, and n3 are siblings; so are n4 and n5. 4. The height of r is 2; this is also the height of the tree. The height of n1 is 1 and of n4 is 0. The depth or level of r is 0, of n1 is 1 and n4 is 2. 5.2 IMPLEMENTATION OF TREES Many data structures can be used to represent trees. Which one we should use depends on the particular operations we want to perform. In this very short handout, we use a common representation for a tree called leftmost-childright-sibling as suggested in the following figure. r n1 n4 n5 n2 n6 n3

In the sketch, the downward arrows are leftmost-child links; the sideway arrows are the right-sibling links. We can define a structure for a node as follows: typedef struct NODE *pNODE; struct NODE { int info; pNODE leftmostChild, rightSibling; }; In this representation, a node has the field info to hold any information associated with the node; the fields leftmostChild and rightSibling are pointers to the leftmost child and right sibling of the node in question, respectively. Thus a node with a NULL leftmost-child pointer is a leaf; a node with a NULL right-sibling pointer is a rightmost node. We can keep track of a tree using pointer header to the root. From this pointer, we can traverse the tree in several ways, but it is beyond the scope of this course.










5.3 BINARY TREES AND BINARY SEARCH TREES In a binary tree, a node can have at most two children, and rather than counting children from the left, we call them a left child and a right child. A similar data structure can be used for a binary tree. In this case, we also use two pointers, one to the left child and the other to the right child. Either or both pointers may be NULL. A structure for a node can be declared as follows: typedef struct NODE *TREE; struct NODE { int info; TREE leftChild, rightChild; }; Here we call the type "pointer to node" by the name TREE since the most common use for this type will be to represent trees and subtrees. We can interprete the leftChild and rightChild fields either as pointers to children or as the left and right subtrees themselves. The other issues for binary trees are the same as that for general trees. Binary Search Trees Binary search tree is a kind of binary trees that is useful for implementing a set of data elements in which we frequently perform a lookup for a specified element. The field used to lookup is called a search key or just key. In a binary search tree, the following property must hold at every node x: all nodes



in the left subtree of x have keys less than the key of x, and all nodes in the right subtree have keys greater than the key of x. This property is called the binary search tree property (BST property). Trees are very important in computer algorithms and are discussed in greater details in many textbooks. Our textbook is one of the most fundamentals. 5.4 GLOSSARY Tree: Cay. Binary tree: Cay nh phan. Binary search tree: Cay nh phan tm kiem, cay tm kiem nh phan. Subtree: Cay con. Node: Nut. Edge: Canh. Leaf: Nut la. Interior node: Nut trong, nut noi. Hierarchical relationship: Moi lien he phan cap. Parent: Cha, nut cha. Child, children: Con, nut con. Sibling: Nut anh em. Ancestor: To tien, nut to tien. Descendant: Hau due, nut hau due. Connected: lien thong. Leftmost, rightmost: Tan trai, tan phai. Topmost, bottommost: Tren cung, di cung. Lookup, Insert, Delete, Update: Tm kiem, Chen, Xoa, Cap nhat. Traverse: Duyet (cay).


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Basic Concepts 2. Implementation of Graphs 3. Connected Components of an Undirected Graph Reading: Sections 9.2, 9.3, and 9.4. 6.1 BASIC CONCEPTS The graph is a generalization of the tree that was studied in the previous week. Rather than parent-child relationships, an edge in a graph may represents any binary relationship between two objects, each represented by a node. Sometimes we need to indicate explicitly the direction of a relationship by using arrows rather than edges. In this case, the graph is directed and edges are called arcs. Formally, we can define a directed graph as a set N of nodes, and a set A of arcs representing a binary relation on N. Graphs can be drawn as suggested in the figure.

1 2 3 4 5 7 6



1. An arrow from node a to b is written (a, b) or a b. We call a the head of the arc and b the tail. We also say that a is a predecessor of b, and conversely, b is a successor of a. In the above figure, the arc 1 1 tells us that node 1 is both a predecessor and a successor of itself. The arc 1 1 is also called a loop. 2. A path in a directed graph is a list of nodes ( n1, n2, . . . , nk) such that there is an arc from each node to the next, that is, ni ni+1 for i = 1, 2, . . . , k 1. The length of the path is k 1, the number of arcs along the path. In the figure, there are two paths from node 1 to node 4, one is (1, 2, 3, 4) with length 3, the other is (1, 3, 4) with length 2. 3. A cycle in a directed graph is a path of length 1 or more that begins and ends at the same node. In the figure, the path (4, 5, 7, 4) is a cycle of length 3; the path (1, 1) is a cycle of length 1. If a graph has one or more cycle, we say the graph is cyclic; otherwise, it is acyclic. In an undirected graph, an edge between two node a and b is denoted by {a, b}. Those nodes are neighbors, not a predecessor and a successor. 6.2 IMPLEMENTATION OF GRAPHS There are two standard ways to represent a graph: adjacency lists and adjacency matrices. We shall consider these representation for directed graphs. Adjacency Lists For simplicity, let nodes be named by the integers 0, 1, . . . , MAX 1. We also use NODE as the type of nodes, but we may suppose that NODE is a synonym for int. A structure for a node can be defined as: typedef struct CELL *LIST; struct CELL { NODE nodeName; LIST next; }; The successors of a node form a linked list of cells. To hold the headers of linked lists, we create an array successors[MAX] of type LIST. LIST successors[MAX]; That is, the entry successors[u] contains a pointer to a linked list of all the successors of node u. The adjacency lists for the graph of the previous figure are suggested in the figure on the next page.



successors 0 1 2 3 4 5 6 7 Adjacency Matrices An adjacency matrix is a two-dimensional array arcs[MAX][MAX] of type BOOLEAN. If there is an arc u v, the value of the entry arcs[u][v] is TRUE; otherwise, it is FALSE. Note that we use BOOLEAN as a synonym for int. typedef int BOOLEAN; BOOLEAN arcs[MAX][MAX]; The adjacency matrix for our graph is shown below. We use 1 for TRUE and 0 for FALSE.
0 0 1 2 3 4 5 6 7 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 3 0 1 1 0 0 0 0 0 4 0 0 0 1 0 0 0 1 5 0 0 0 0 1 0 0 0 6 0 0 0 0 1 0 0 0 7 0 0 0 0 0 1 1 0

1 1 3 4 5 7 7 4

2 6 3

For an undirected graph, an edge can be viewed as an arc in both directions, and the graph can be represented as for directed graphs.



6.3 CONNECTED COMPONENTS OF AN UNDIRECTED GRAPH We can divide any undirected graph into one or more connected components. Each connected component is a set of nodes with paths from any member of the component to any other. Morover, the connected components are maximal, that is, for no node in the component there is a path to any node outside the component. If a graph consists of a single connected component, then we say that the graph is connected. By the definition, our graph is connected if we replace arcs by edges.

1 2 3 4 5 7 6

We now present a way to construct the connected components of a graph G. Just begin with a graph G0 consisting of the nodes of G with none of the edges. Then consider the edges of G, one at a time, to construct a sequence of graphs G0, G1, . . . , where Gi consists of the nodes of G and the first i edges of G. BASIS. G0 consists of only the nodes of G with none of the edges. Every node is in a component by itself. INDUCTION. Suppose we have the connected components for the graph Gi after considering the first i edges, and we now consider the (i + 1)st edge, {u, v}. 1. If nodes u and v are in the same component of Gi, then Gi+1 has the same set of connected components as Gi, because the new edge does not connect any nodes that were not already connected. 2. If nodes u and v are in different components, we merge the components containing u and v to get the connected components for Gi+1. The figure on the next page



v y

suggests why there is a path from any node x in the component of u, to any node y in the component of v. We follow the path in the first component from x to u, then the edge {u, v}, and finally the path from v to y that we know exists in the second component. When we have considered all edges in this manner, we have the connected components of the full graph. 6.4 GLOSSARY Graph: o th. Directed graph: o th co hng. Undirected graph: o th vo hng. Arc: Cung. Edge: Canh. Head: au. Tail: uoi. Predecessor: Tien nhiem. Successor: Ke v. Loop: Vong khuyen. Cycle: Chu trnh. Cyclic: co vong. Acyclic: khong vong. Path: ng i.



Neighbor: Lan can. Adjacency List: Danh sach ke. Adjacency Matrix: Ma tran ke. Connected Component: Thanh phan lien thong. Basis: C s, bc c s. Induction: Quy nap, bc quy nap.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Patterns and Pattern Matching 2. Finite State Machines and Automata 3. Deterministic and Nondeterministic Automata Reading: Sections 10.2 and 10.3. 7.1 PATTERNS AND PATTERN MATCHING A pattern is a set of objects with some recognizable property. One type of pattern is a set of character strings, such as the set of legal C identifiers, each of which is a string of letters, digits, and underscores, beginning with a letter or underscore. Given a pattern and an input, the process of determining if the input matches the pattern is called pattern matching, a problem also known as pattern recognition. In compiling, for example, one of the essential parts is to regconize construct patterns in programs before translating programs into a desired code. Lets see an illustration for the first phase of this process. Consider an if-statement in C, if (a==b) x = 1; A C compiler will read input characters from the left, one at a time, collect them into small groups of characters (lexemes or tokens) matching some lexical pattern. This phase is called lexical analysis. Our statement, for example, may be grouped into the following tokens, each has its own pattern: 1. The keyword if 2. The left parenthesis ( 3. The identifier a 4. The comparison operator ==



5. The identifier b 6. The right parenthesis ) 7. The identifier x 8. The assignment operator = 9. The integer 1 10. The statement-terminator ; White space characters (blanks, tabs, and newlines) would also be eliminated. 7.2 STATE MACHINES AND AUTOMATA Programs that search for patterns often have a special structure. We can identify certain positions in the code at which we know something particular about the programs progress toward its goal of finding an instance of a pattern. We call these positions states. The overall behavior of the program can be viewed as moving from state to state as it reads its input. To see the behavior of such a program, we can draw a graph with a node for each state, and an arc for each moving from state to state (called a transition). A graph for a program recognizing English words with five vowels in order is shown below: a

e a e

i i

o o

u u

There are two important states in this graph, one with an incoming arc labeled start (state 0), and the other with a double circle (state 5). The former, the start state, is the state in which we begin to recognize the pattern; the latter, the accepting state, is the state we reach after having found our pattern and "accept". There may be several accepting states but one start state. Such a graph is called a finite automaton or just automaton. We can design a pattern-matching program by first designing the automaton, then mechanically translating it into a program. I will give an example in the next section. Automata can be viewed as a state machine consisting of a finite control, an input tape, and a head to read a sequence of symbols written on the tape. At any time during its operation, the machine reads a symbol on the tape, changes its state, and moves the head one symbol to the right. A picture of automata is shown in the figure on the next page.



input tape

finite control

7.3 DETERMINISTIC AND NONDETERMINISTIC AUTOMATA The automaton discussed in the previous section has an important property. For any state s and any input character x, there is at most one transition out of state s whose label includes x. Such an automaton is said to be deterministic. It is straighforward to convert deterministic finite automata (DFA) into programs. We create a piece of code for each state. The code for state s examines its input and decides which of transitions out of s, if any, should be followed. If a transition from state s to state t is selected, then the code for state s must arrange for the code of state t to be executed next, perhaps by using a goto-statement. Suppose we have a DFA for a bounce filter.

1 0 start

0 1

You need not understand its meaning. Just observe that the DFA has the start state a and the two accepting states c and d, examines the input characters 1 and 0. From this DFA, we can mechanically produce a simple program under the guide mentioned. A resulting program is given on the next page.



void bounce() { char x; /* state a */ putchar(0); x = getchar(); if (x == 0) goto a; /* transition to state a */ if (x == 1) goto b; /* transition to state b */ goto finis; /* state b */ putchar(0); x = getchar(); if (x == 0) goto a; /* transition to state a */ if (x == 1) goto c; /* transition to state c */ goto finis; /* state 1 */ putchar(0); x = getchar(); if (x == 0) goto d; /* transition to state d */ if (x == 1) goto c; /* transition to state c */ goto finis; /* state d */ putchar(1); x = getchar(); if (x == 0) goto a; /* transition to state a */ if (x == 1) goto c; /* transition to state c */ goto finis;





finis: ; } Although it is easy to convert a DFA into a program, designing it is more difficult. In fact, there is a generalization of DFAs, which is conceptually more natural. This kind of automata, called nondeterministic finite automata (NFA for short), may have two or more transitions containing the same symbol out of one state. Note that a DFA is technically a NFA as well, one that happens not to have multiple transitions on one symbol.



NFAs are not directly implementable by programs, but they are useful conceptual tools for a number of applications. Moreover, by using the "subset construction", it is possible to convert any NFA to a DFA that accepts the same set of character strings but this topic is beyond our discussion. For an illustration, I only show a NFA in the following figure. start m a n

Note that we use the symbol to indicate any legal symbol. 7.4 GLOSSARY Pattern: Mau. See the definition in text. Pattern Matching: oi sanh mau, so mau. Recognition: Nhan dang. Identifier: nh danh. A name of an data object in a program. Character: Ky t. Any symbol that we may input from the keyboard, including letters, digits, special symbols such as +, ^, and some nonprintable symbols. Letter: Ch cai. Digit: Ky so, ch so. Underscore: Dau gach thap _. Input: Nguyen lieu, d lieu nhap. Output: Thanh pham, d lieu xuat. Code: Ma lenh, ma chng trnh. A full program or program segment in any form, such as a high-level language or machine language. Compilation: Qua trnh bien dch. Sometimes also translation. Compiler: Trnh bien dch. Interpreter: Trnh thong dch. Translator: Chng trnh dch (noi chung). Lexeme: T to. Token: The t.



Assignment operator: Toan t gan. Statement-terminator: Dau ket thuc cau lenh. Instance: The hien. Automaton, automata (pl.): Automat, Otomat. Deterministic finite automata: Automat hu han n nh (tat nh). Nondeterministic finite automata: Automat hu han a nh (khong n nh, khong tat nh). State: Trang thai. Transition: Chuyen v. Start state: Khi trang. Accepting state, final state: Trang thai kiem nhan, ket trang. Finite control: Bo ieu khien hu han. Input tape: Bang nguyen lieu. Head: au oc.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Introduction 2. Algebraic Laws for Regular Expressions Reading: Sections 10.5 and 10.7. 8.1 INTRODUCTION In the previous handout, we have studied a finite automaton which is, in a sense, a machine recognizing character strings with some patterns. In this handout, we meet regular expressions which is an algebraic way to describe a string pattern. Regular expressions are analogous to the algebra of arithmetic expressions with which we are all familiar, and the relational algebra that we met in Handout #3. Interestingly, the set of patterns that can be expressed in the regular-expression algebra is exactly the same set of patterns that can be described by automata. For example, the regular expression a | bc* can express the patterns described by the following automaton.





A set of strings denoted by a regular expression E is called a regular language and can be referred to as L(E). Given two regular languages L and M, we can construct new languages by applying the operators below one or more times to these languages: 1. The union of two languages L and M, denoted L M, is the set of strings that are in either L or M. For example, if L = {00, 01, 11} and M = {aa, ab}, the union L M = {00, 01, 11, aa, ab}. 2. The concatenation of languages L and M, denoted LM, is the set of strings that can be formed by taking any string in L and concatenating it with any string in M. For the above example, we have LM = {00aa, 01aa, 11aa, 00ab, 01ab, 11ab}. 3. The closure (star, or Kleene closure) of a language L is denoted L* and represents the set of strings that can be formed by taking any number of strings from L and concatenating all of them. Algebra of regular expressions, like all kinds of algebras, starts with some elementary expressions, usually constants and/or variables denoting languages. We then construct more expressions by applying the set of three operators (union, concatenation, star) to these elemetary expressions and to previously constructed expressions. In particular: 1. Empty string is denoted by the regular expression . 2. A symbol a is denoted by the regular expression a. 3. Suppose R and S are two regular expressions denoting the languages L(R) and L(S), respectively. a) R + S (or R and L(S).

| S) is a regular expression for the union of the languages L(R)

b) RS is a regular expression for the concatenation of the languages L(R) and L(S). c) R* is a regular expression for the Kleen closure of the language L(R). For a simple example, suppose letter is any character in the alphabet, digit is any decimal digits, and under is the symbol _. We can define a regular expression as a pattern of legal identifiers in C as follows: identifiers = (letter|under)(letter|digit|under)* As in arithmetic expressions, we can use parentheses to group regular expressions. 8.2 ALGEBRAIC LAWS FOR REGULAR EXPRESSIONS It is possible for two regular expressions to denote the same language, just as two arithmetic expressions can denote the same function of their operands. As an example, the arithmetic expressions x + y and y + x each denote the same function of x and y,



because addition is commutative. So are the regular expressions R + S and S + R. We now list some common algebraic laws for regular expressions with no proofs. More useful laws can be found in the textbook. 1. Commutativity of union. (R | S) (S | R) 3. Associativity of concatenation. ((RS)T) (R(ST)) 4. Left distributivity of concatenation over union. (R(S | T)) (RS | RT) 5. Right distributivity of concatenation over union. ((S | T)R) (SR | TR) 6. Idempotence of union. (R | R) R 7. RR* R*R 8.3 GLOSSARY Regular Expression: Bieu thc chnh quy. Arithmetic Expression: Bieu thc so hoc. Algebra: ai so. algebra of regular expressions: ai so bieu thc chnh quy. algebra of arithmetic expressions: ai so bieu thc so hoc. set algebra: ai so tap hp. relational algebra: ai so quan he. Algebraic law: Luat ai so (tnh chat ai so). Commutative law, commutativity: Luat giao hoan, tnh giao hoan. Associative law, associativity: Luat ket hp, tnh ket hp. Distributive law, distributivity: Luat phan phoi, tnh phan phoi. Idempotence: Tnh luy ang. Regular language, regular set: Ngon ng chnh quy, tap chnh quy. Union: Phep hp. Concatenation: Phep ghep noi. Star, Kleene closure: Phep toan sao, phep lay bao ong Kleene. 2. Associativity of union. ((R | S) | T) (R | (S | T))


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Context-Free Grammars 2. Languages from Grammars Reading: Sections 11.2 and 11.3. 9.1 CONTEXT-FREE GRAMMARS In the last two handouts, we met the two equivalent ways to decribe patterns. In this handout, we shall see another even more powerful way, called context-free grammars (or "grammars"), in the sense they can describe more languages than the two others. Suppose we want to define arithmetic expressions that involve 1. The four binary operators, +, , , and /, 2. Parentheses for grouping, and 3. Operands that are numbers. The usual definition is of the form: BASIS. A number is an expression. INDUCTION. If E is an expression, then each of the following is also an expression. 1. ( E ). That is, we may place parentheses around an expression to get a new expression. 2. E + E. That is, two expressions connected by a plus sign is an expression. 3. E E. This and the next two rules are analogous to (2) with the other operators. 4. E E. 5. E / E.



To be more succinct and concise, we can use a grammar to define our expressions: (1) (2) (3) (4) (5) (6) <Expression> <Expression> <Expression> <Expression> <Expression> <Expression> number (<Expression>) <Expression> + <Expression> <Expression> <Expression> <Expression> <Expression> <Expression> / <Expression>

The symbol <Expression> is called a syntactic category or a variable which stands for any arithmetic expression. The symbol means "can be composed of". For example, rule (2) states that an expression can be composed of a left parenthesis followed by any string that is an expression followed by a right parenthesis. There are three kinds of symbols that appear in grammars. 1. The first are "metasymbols," symbols that play special roles and do not stand for themselves. The only example we have seen so far is the symbol , which is used to seperate the syntactic category being defined from a way in which strings of that syntactic category may be composed. 2. The second kind of symbol is a syntactic category, which as we mentioned represents a set of strings being defined. 3. The third kind of symbol is called a terminal, which can be characters such as +, or (, or they can be any abstract symbol that is known or does not need to define in the grammar. The symbol number in our grammar is of this kind of symbol. A context-free grammar consists of one or more productions. Each line in our grammar is a production. In general, a production has three parts: 1. A head, which is the syntactic category on the left side of the arrow, 2. The metasymbol , and 3. A body, consisting of zero or more syntactic categories and/or terminals on the right side of the arrow. Our grammar for simple expressions has six productions numbered 1 to 6. We can augment the grammar for expressions by providing productions for number, a symbol has been viewed as a terminal, and productions for a new syntactic category <Digit>. Three more productions can be added to our working grammar. (7) (8) (9) <Digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <Number> <Digit> <Number> <Number> <Digit>



In fact, the production for <Digit> is composed of ten productions, each for one of ten decimal digits. <Digit> 0 <Digit> 1 ... <Digit> 9 A more complex grammar for expressions can be: (1) (2) (3) (4) (5) (6) (7) (8) (9) <Expression> <Expression> <Expression> <Expression> <Expression> <Expression> <Number> ( <Expression> ) <Expression> + <Expression> <Expression> <Expression> <Expression> * <Expression> <Expression> / <Expression>

<Digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <Number> <Digit> <Number> <Number> <Digit>

We can also describe the structure of control flow in language like C grammatically. For a simple example, it helps to imagine that there are abstract terminals condition and simpleStat. The former stands for a conditional expression. We could replace this terminal by a syntactic category, say <Condition>. The productions for <Condition> would resemble those of our expression grammar above, but with logical operators like &&, comparison operators like <, and the arithmetic operators. The terminal simpleStat stands for a statement that does not involve nested control structure, such as an assignment, function call, break, continue, return. Again, we could replace this terminal by a syntactic category and the productions to expand it. In the grammar for statements below, we use keywords like if, else, or while, punctuators like { or ;, as terminals. <Statement> <Statement> <Statement> <Statement> <Statement> while ( condition ) <Statement> if ( condition ) <Statement> if ( conditon ) <Statement> else <Statement> { <StatList> } simpleStat ;

<StatList> <StatList> <StatList> <Statement>



9.2 LANGUAGES FROM GRAMMARS A grammar is essentially an inductive definition involving sets of strings. Thus, from a grammar for a syntactic category, we can produce the set of strings that are of this syntactic category by walking around the grammar and applying the productions to get more and more strings. If a grammar consists of more than one syntactic category, by convention, the syntactic category that we want to get its strings is written first. In some compiler textbooks, this syntactic category is called the start symbol. For example, in the first our grammar, the start symbol is <Expression>; whereas in the second, the start symbol is <Statement>. 9.3 GLOSSARY Grammar: Van pham. Context-free grammar: Van pham phi ng canh. Syntax: Cu phap. Syntactic Category: Pham tru cu phap. Plus sign: Dau cong. Minus sign: Dau tr. Metasymbol: Meta ky hieu. Terminal: Ky hieu tan, tan. Nonterminal: Ky hieu cha tan, cha tan. Production: Luat sinh. Head: au (luat sinh). Body: Than (luat sinh). Decimal Digit: Ky so thap phan. Control Structure: Cau truc ieu khien. Start Symbol: Ky hieu khi au, khi t.


K5 & K6, Computer Science Department, Van Lang University Second semester -- Feb, 2002 Instructor: Tran c Quang

Major themes: 1. Parse Trees 2. Constructing a Parse Tree Reading: Section 11.4. 10.1 PARSE TREES As we have briefly discussed in the previous handout, we can discover that a string s belongs to the language L(<S>), for some syntactic category <S>, by the repeated application of productions: 1. Start with some strings derived from basis productions, those that have no syntactic category in the body. 2. Then "apply" productions to strings already for various syntactic categories. Each application involves substituting strings for occurrences of the various syntactic categories in the body of the production, and thereby constructing a string that belongs to the syntactic category of the head. 3. Eventually, construct the string s by applying a production with <S> at the head. It is often useful to draw the "proof" that s is in L(<S>) as a tree, which we call a parse tree. The nodes of a parse tree are labeled, either by terminals, by syntactic categories, or by the symbol . 1. The leaves are labeled only by terminal or , and 2. The interior nodes are labeled only by syntactic categories. 3. Every interior node v represents the application of a production. That is, there must be some production such that: a) The syntactic category labeling v is the head of the production, and b) The labels of the children of v, from the left, form the body of the production.



Here is the parse tree for the string 3*(2+14) using the grammar in the Handout #9, but we have abbreviated the syntactic categories <Expression>, <Number>, and <Digit> to <E>, <N>, and <D>, respectively. <E> <E > <N > <D > 3 * ( <E> <N > < D> 2 <N > < D> 1 The string 3*(2+14) is called the yield of the above parse tree. 10.2 CONSTRUCTING A PARSE TREE To see how a parse tree can be build, let us follow the construction of the parse tree shown in the figure. The grammar is reproduced for easy reference. (1) (2) (3) (4) (5) (6) (7) (8) (9) <E> <E> <E> <E> <E> <E> <N > ( <E > ) <E> + <E> <E> <E> <E> * <E> <E> / <E> <E> <E> + ) <E > <N > <D>

< D> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 < N > < D> <N> <N> <D>



1. First, construct a one-node tree for each terminal in the tree. 3 * ( 2 + 1 4 )

2. For the terminals 1, 2, 3, and 4, apply the productions (7) to get four two-node trees. <D > < D> < D> < D>

1 (a)

2 (b)

3 (c)

4 (d)

3. Apply the production (8), or <N> <D>, to the trees (a), (b), and (c) to obtain three following trees: <N> < D> 1 (a) <N > < D> 2 (b) <N > < D> 3 (c)

4. Now, apply the production (9), or <N> <N> <D>, to the trees (a) in step 3 and (d) in step 2 to get the tree for 14. <N > <N > < D> 1 < D>



5. Three parse trees below are constructed by the production (1), or <E> <N>. <E> <N > < D> 3 (a) <E> <N > < D> 2 (b) <E> <N > <N > < D> 1 (c) < D>

6. Next, use the production (3), or <E> <E> + <E>, for (b) and (c) in step 5, and + in step 1, to construct a new parse tree with the yield 2+14. <E> <E> <N> < D> 2 + <E> <N > <N > < D> 1 7. Applying the production (2), or <E> ( <E> ), to the resulting tree in step 6, we have the parse tree with the yield (2+14) as shown in the figure on the next page. 8. The overall parse tree, as in page 54, for the string 3*(2+14) is produced by applying the production (4), or <E> <E> * <E>, to the parse trees (a) in step 3, * in step 1, and the parse tree of step 7. < D>



<E> ( <E> <N> < D> 2 <E> + ) <E> <N > <N > < D> 1 10.3 GLOSSARY Parsing: Phan tch cu phap. Also syntax analysis. Parse tree: Cay phan tch cu phap. Syntax tree: Cay cu phap. A compacted parse tree; also expression tree or operator tree. Yield: Hoa li (cua cay phan tch cu phap). < D>