Datastructures For Language Processing

Datastructures for Language
Processing
 The tradeoff between memory requirements and

search efficiency of a datastructure, is a
fundamental principle of systems programming.
 A language processor makes a frequent use of
the search operation over its datastructure.

Classification of Data structures
 They can be classified according to the
following criteria:
a) Nature of the data structure – whether a
linear or nonlinear datastructure.
b) Purpose of the datastructure - Whether a
search datastructure or an allocation
datastructure.
c) Life time of a datastructure – whether used
during language processing or during target
program execution.
Linear and Non Linear
 Alinear datastructure consists of a linear
arrangement of elements in memory.
 It facilitates efficient search.
 It requires a continuous area of memory for its
elements.
 So it is forced to over estimate the memory
requirements and leads to memory wastage.
Non Linear data structures
 The elements of Non Linear Data structure are

accessed using pointers.
 Elements need not occupy contiguous areas of
memory.
 It leads to lower search efficiency.
E F
A C
H E
G F
H
B F
D
Linear Nonlinear
Search and Allocation datastructure
 Search datastructures are used during
language processing to maintain attribute
information concerning different entities in the
source program.
 The attribute entry for an entity is created only
once but may be searched for a large no of
times.So search efficiency is important.
 Allocation datastructures are characterized by
the fact that the address of the memory area
allocated to an entity is known to the users of
that entity.
 No search operations are conducted on
them.Speed of allocation or deallocation and
efficiency of memory utilization are the
important criteria.
Life time of datastructures.
 A language processor uses both search and

allocation datastructures during its operation.
 Search datastructures are used to constitute diff.
tables.
 Allocation datastructures are used to handle
programs with nested structures of some kind.
 Target program rarely uses search datastrctures,
but uses allocation datastructures.

Program Sample(input,output);
var
x,y:real;
i: integer;
Procedure calc(var a,b :real);
var
sum:=a+b;
---------
end calc;
 Symbol tables are created for both main
program and calc, say, Symtabsample and
Symtabcalc.
 They are constructed as search datastructures.
 Their attributes are bind during compilation and
searched many times.
var p: integer;
begin
new (p);
allocates memory and stores the address in tables.

 No search is involved.
Search data structures
 A search data structure(or search structure) is a
set of entries, each entry accommodating the
information concerning one entity.
 Each entry is assumed to contain a key field
which forms the basis for a search.
 The key field is the symbol field containing the
name of an entity.
Entry formats
 Each entry in a search data structure is a set of
fields.
 It consists of two parts: fixed part and varient
part. Each part consists of a set of fields.
 Fields of the fixed part exist in each entry of the
search structure.
 The value in a tag field of the fixed part
determines the information to be stored in the
varient part of the entry.
 For each value vi in the tag field, the varient part
of the entry consists of the set of fields SF
Entries in the symbol table of a compiler have the
following fields:
Fixed part: symbol and class, class is the tag
field.
Tag value variant part
 Variable type, length, dimension info
 Procedure name addr. of param list, no of
param
 Function name type of return value, len of
ret. value, addr of param
list, no of params
 Label statement number.
Fixed and variable length entries.
 An entry may be declared as a record or a
structure of the language .
 In Fixed length entry format, each record is
defined to consist of the following fields:
1) Fields in the fixed part of the entry.
2) Uvi SFvi, i.e. the set of fields in all varient parts
of the entry.
 All records in the fixed length entry have an
identical format.
 It enables efficient search procedure, but
 In the variable length entry format, a record
consists of the following fields:
1) Fields in the fixed part of the entry, including
the tag field.
2) { fj | fj SFvj if tag = vj }
 In this entry format no memory wastage occurs.
a) Fixed Entry format
1 2 3 4 5 6 7 8 9 10
1. symbol 6. parameter list address

2.class 7. no. of parameters
3. type 8. type of return value
4. dimension info 9. length of return value
5. length 10. statement number.
b) variable length entry for label
1 2 3
1. name 2. class 3. statement number

 When a variable length entry format is used, the
search method may require knowledge of the
length of the entry. So it consists of:
1) The length field
2) Fields in the fixed part of the entry, including
the tag field.
3) { fj | fj SFvj if tag = vj }
length entry
Hybrid entry format
 Hybrid entry format combines the access
efficiency of the fixed entry format with the
memory efficiency of the variable entry format.
 In this format, each entry is split into two halves,
the fixed part and the variable part.
 A pointer field is added to the fixed part. It
points to the variable part of the entry.
 The fixed and variable part are accommodated
in two different datastructures.
 The fixed parts of all entries are organized into
search(linear) datastructure, while variable part
is put into allocation data structure.
Hybrid entry format
Fixed part pointer length Variable part

Operations on search structures.
1. Operation add: Add the entry of a symbol
2.Operation search: Search and locate the entry
of a symbol.
3.Operation delete: Delete the entry of a symbol.
Entry is created once and searched many times.

Deletion is not common.
Algorithm: Generic search procedure
1.Make a prediction concerning the entry of the
search datastructure in which the symbol s may
be occupying. Let it be the entry e.
2.Let se be the symbol occupying in the eth entry.
Compare s and se. Exit with success if they match
3.Repeat steps 1 and 2 till it can be concluded
that the symbol does not exist in the search
datastructure.
 The nature of the prediction varies with the
organization of the search datastructure.
 Each comparison of step 2 is called a probe.
 Effeciency of a search procedure is determined
by the number of probes performed by the
search procedure.
 We use the following notations:
Ps : No of probs in a successful search.
Pu : No of probs in an unsuccessful search
Table organizations
 Table is a linear datastructure. The entries
occupy adjoining memory locations.
 If the location of an entry is given, it is
meaningful to talk of the next entry and the
previous entry.
 Tables using the fixed length entry organization
possess the property of positional determinancy.
 This property states that the address of an entry
in a table can be determined from its entry no.
For eg:
The address of the eth entry is
a+(e-1).l
Where a is the address of the first entry and
‘l’ is the length of the entry.
Sequential Search Organization
#1
#2
#3
#f
#n
n : Number of entries in the table

f : Number of occupied entries
Search for a symbol in Sequential
search
• Prediction is that symbol s occupies the next
entry of the table, where next=1 to start with.
• All active entries in the table have the same
probability of being accessed.
Ps = f/2 for successful search
Pu = f for an unsuccessful search
Add a Symbol
• Following an un successful search, a symbol
may be entered in the table using an add
operation.
• The symbol is added to the first free entry in the
table.
• The value of f is updated accordingly.
Delete a Symbol
• Deletion of an entry can be implemented in two
ways: Physical and Logical deletion.
• In physical deletion, an entry is deleted by
erasing or by overwriting.
• If dth entry is to be deleted, entry d+1 to f can
be shifted up by one entry each.
• This would require (f-d) shift operations.
Delete a Symbol
• An efficient alternative would be to move the fth
entry to dth position.
• This require only one shift operation.
• Physical deletion causes changes in entry
numbers of symbols, which interferes with the
representation of a symbol in IC.
Logical Deletion
• Logical Deletion of an entry is performed by
adding some information to the entry to indicate
its deletion.
• So introduce a new field to indicate whether an
entry is active or deleted as:
Active/ Deleted Symbol Other Info

Binary Search
1. Start = 1 ; end = f;
2. While start<=end
a) e= start+end
2 Exit with success if s= se
b) If s<se then end=e-1;
else start= e+1;
3. Exit with failure.
Binary search
For a table containing f entries,
Ps <= log 2 f and Pu = log 2 f
• Search performance is logarithmic in the size of
the table.
• It forbids both additions and deletions during

language processing.
• Binary search organization is suitable only for
a table containing a fixed set of symbols eg.
The table of keywords.
Hash table organization
• The search prediction depends on the value of
s , ie. e is a function of s.
• Three possibilities exist concerning the
predicted entry:
a) the entry may be occupied by s.
b) the entry may be occupied by some other
symbol
c) the entry may be empty.
Alg: Hash Table Management
1. e:= h(s)
2. Exit with success if s = se, and failure if e is
unoccupied.
3. Repeat steps 1 and 2 with different functions hl and
hll.
• The function h is called a hashing function
Notations
n : number of entries in the table.
f : Number of occupied entries in the table.
: Occupation density in the table. ie. f/n
k : number of distinct symbols in the source language.
kp : number of symbols used in some source program
Sp : set of symbols used in some source program
N : Address space of the table, ie. The space formed
by the entries 1…..n
Notations
K : Key space of the system, ie. The space
formed by enumerating all the symbols of the
source language. We will denote it as 1 …. k
Kp : Key space of the program. ie. 1… kp
Hashing
• A hashing function has the property that
1<=h(symb)<=n where symb is any valid
symbol of the source language.
• If k<=n, we can select a one – to – one function
as hashing function h. This will eliminate
collition.
• We refer to this organization as direct entry
organization.
Hashing functions
 While hashing, the representation of s is
treated as a binary number.
 The hashing function performs a numerical
transformation on this number to obtain e.
 Let the representation of s have b bits and the
computer uses m bit arithmetic.
 To apply numerical transformations we need m
bit representation of s. We call it as rs.
If b<=m, the representation of s can be
padded with zeros to obtain rs.
 If b>m, the representation of s is split into
pieces of m bits each and bitwise
exclusive OR operations are performed
on these pieces to obtain rs.This method
is called folding.
 The hashing function h is now applied to
rs.
Properties of hash function
1. The hashing function should not be sensitive to
the symbols in sp, that is it should perform
equally well for different source programs. Thus
the value of Ps should only depend on kp.
2. The hashing function h should execute
reasonably fast.
Popular classes of Hash functions
 Multiplication functions: These functions are
analogus to functions used in random number
generation. Eg: h(s)=(a X rs + b) mod 2m, where
a and b are constants and fixed point arithmetic
is used to compute h(s).Here the table size
should be powers of 2.
 Division functions: A typical division hashing
function is
 h(s)= (remainder of rs / n) + 1
where n is the size of the table. If n is prime, the
method is called prime division hashing.
Collition Handling Methods
Two approaches to collision handling are :
a) To accommodate a colliding entry
elsewhere in the hash table using a rehashing
technique.
b) to accommodate the colliding entry in a
separate table using an overflow chaining
techniques.
Rehashing
• It uses a sequence of hashing functions
h1,h2,…… to resolve collisions.
• Let a collision occur while probing the table
entry whose number is provided by hi(s). We
use hi+1(s) to obtain a new entry number.
• A popular technique called sequential
rehashing uses the recurrence relation
hi+1(s)=hi(s) mod n+1
• A drawback of rehashing technique is that a
colliding entry accommodated elsewhere in the
table may contribute to more collisions.
• This may lead to clustering of entries in the
table.
Clustering example
 Let h(a)=h(b)=h(c)=5 and h(d)=6. If the symbols are
entered in a table in the sequence a,b,c,d, they would
occupy the entries as shown below:
Symbol Entry number
------------------ ---------------------------------
a 5
b 6
c 7
d 8
 In the previous example, entries 5 to 8 form a
cluster of size 4.
 So now any symbol x such that h(x)=5 suffers
4 collisions due to the cluster.
 The average number of collitions for a colliding
entry = 1/2 X(average size of cluster +1)

Datastructures For Language Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datastructures For Language Processing

Uploaded by

Copyright:

Available Formats

Datastructures for Language

 The tradeoff between memory requirements and

the search operation over its datastructure.

 The elements of Non Linear Data structure are

 A language processor uses both search and

but uses allocation datastructures.

allocates memory and stores the address in tables.

1. symbol 6. parameter list address

b) variable length entry for label

1. name 2. class 3. statement number

Fixed part pointer length Variable part

Entry is created once and searched many times.

n : Number of entries in the table

Active/ Deleted Symbol Other Info

• It forbids both additions and deletions during

You might also like