You are on page 1of 92

Organizing Files for

Performance

1
Outline
• We will be looking at four different
issues:
– Data Compression: how to make files
smaller
– Reclaiming space in files that have
undergone deletions and updates
– Sorting Files in order to support
binary searching ==> Internal Sorting
– A better Sorting Method: KeySorting 2
Data Compression I:
An Overview
• Question: Why do we want to make files
smaller?
• Answer:
– To use less storage, i.e., saving costs
– To transmit these files faster, decreasing
access time or using the same access time,
but with a lower and cheaper bandwidth
– To process the file sequentially faster.

3
Data Compression II: Using a
Different Notation => Redundancy
Compression
• Deals mainly with pure binary encoding.
• E.g.: when referring to the state field, we used
2 ASCII bytes=16 bits. Was that really
necessary?
• Answer: If there are only 50 states, we could
encode them all with only 6 bits, thus saving 1
byte per state field.
• This means we are reducing no. of bits in
each field.
• Hence the name Redundancy reduction.

4
Data Compression II: Using a
Different Notation => Redundancy
Compression

• Disadvantages:
– Not Human-Readable
– Cost of Encoding/Decoding Time
– Increased Software Complexity (Encoding/
Decoding Module)
• Is this compression technique worth?
– Depends on applications.
• File is small
• File is accessed by different pieces of S/W
• If some S/W can’t deal with binary data.
– This method is bad.
5
Data Compression II: Suppressing
Repeating Sequences ==> Redundancy
Compression
• When the data is represented in a Sparse
array(Similar data repeated), we can use a type of
compression called: run-length encoding.
• Procedure:
– Read through the array in sequence except where
the same value occurs more than once in
succession.
– When the same value occurs more than once,
substitute the following 3 bytes in order:
• The special run-length code indicator
• The values that is repeated; and
• The number of times that the value is repeated.
6
Data Compression II: Suppressing
Repeating Sequences ==> Redundancy
Compression

E.g. Sequence of data is:


22 23 24 24 24 24 24 24 24 25 26 26 26
26 26 26 25 24
Compressed data is as follows:
22 23 ff 24 07 25 ff 26 06 25 24
•No guarantee that space will be saved!!!

7
Data Compression III:
Assigning Variable-Length
Code
Here assigning variable length code for
data occurring in encoding scheme:
e.g.:
•If alphabet e & t are occurring frequently
– then we can assign . => e & / => t
– For other alphabets we can assign 2 or 3
symbols for a alphabet like *& => a

8
Data Compression III:
Assigning Variable-Length
Code
• Principle: Assign short codes to the most
frequent occurring values and long ones to
the least frequent ones.
• The code-size cannot be fully optimized as
one wants codes to occur in succession,
without delimiters between them, and still be
recognized.
• This is the principle used in the Morse Code
• As well, it is used in Huffman Coding. ==>
Used for compression in Unix 9
Data Compression III:
Assigning Variable-Length
Code
• As well, it is used in Huffman Coding.
==> Used for compression

• If the string to be encoded is ‘abde’


– Then the code is: 1 010 0000 0001

10
Data Compression IV:
Irreversible Compression
Techniques
• Irreversible Compression is based on the
assumption that some information can be
sacrificed. [Irreversible compression is also called
Entropy Reduction].
• Example: Shrinking a raster image from 400-
by-400 pixels to 100-by-100 pixels. The new image
contains 1 pixel for every 16 pixels in the original
image.
• There is usually no way to determine what the
original pixels were from the one new pixel.
• In data files, irreversible compression is rarely 11used.
Reclaiming Space in Files
• What happens if a variable length is modified:
– New record can be smaller than the older
one
• In this case some space will not be used leads
to internal fragmentation.
– New record can be bigger than the older
one
• In this case we can use 2 methods:
– Extra data to be appended at the end with the help
of a pointer => processing record is slower
– Append the complete data to the end of the file => 12
Reclaiming Space in Files
• Discussion of file organization based
on:
– Record addition
– Record updating
– Record deletion
• During record addition the data is
appended at the end.
• The problems will rise in updating and
deletion. 13
Reclaiming Space in Files I:
Record Deletion and Storage
Compaction
• Recognizing Deleted Records
• Reusing the space from the record ==>
Storage Compaction.
• Storage Compaction:
– Makes file smaller
• By looking for places where there is no data
• Recovering this space.

14
Reclaiming Space in Files I:
Record Deletion and Storage
Compaction
Record deletion strategy:
– Some way to specify the deleted records.
– E.g. an * mark for deleted record.

15
Reclaiming Space in Files I:
Record Deletion and Storage
Compaction

• Storage Compaction:
– Should not be applied for each record
deletion => not effective
– After deleted records have accumulated for
some time, a special program is used to
reconstruct the file with all the deleted
approaches.
• Storage Compaction can be used with
both fixed- and variable-length records.
16
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically
• In some applications, it is necessary to
reclaim space immediately.
• To do so, we can:
– Mark deleted records in some special ways
– Find the space that deleted records once
occupied so that we can reuse that space when
we add records.
– Come up with a way to know immediately if there
are empty slots in the file and jump directly to
them.

17
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically

Solution:
•Use an avail linked list:
– Pointers to all the marked deleted records.
– Header record will point to the first record
in the file marked deleted.

•Relative Record Numbers (RRNs) play


the role of pointers 18
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically

• We can implement linked list maintaining w.r.t stack.


• Where insertion and deletion will happen in 1 end.
• As it fixed length records we can use RRN as a pointer.
• Before insertion of new deleted record 3 and after
insertion is as follows:

19
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically

• Linking and stacking deleted records:


– A way to know empty slots
– To jump directly to one of those slots
– Using stack both can be meet.
– E.g. how this is implemented is shown as
follow:

20
21
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically

• Implementing fixed length records w.r.t


RRN no.
– Write a simple function to return:
• RRN of an reusable record.
• RRN of the next record when there is no
reusable slots are available.

22
Reclaiming Space in Files III: Deleting
Variable-Length Records for Reclaiming
Space Dynamically

23
Reclaiming Space in Files III: Deleting
Variable-Length Records for
Reclaiming Space Dynamically

To reclaim space for variable length


record we need following:
– An Avail list
– An algorithm to add/delete records to avail
list
– An algorithm for finding and removing
records from avail list when we are ready
to use

24
Reclaiming Space in Files III: Deleting
Variable-Length Records for
Reclaiming Space Dynamically
An Avail List for variable length records:
– In variable length records we need to find the length of
variable records.
– For this part we can use length indicator for each record.
– Asterisk mark to show that record is deleted.

25
Reclaiming Space in Files III: Deleting
Variable-Length Records for Reclaiming
Space Dynamically

• Adding and deleting records:


– It not simple as in fixed length record.
– Because the size of the record is different.
– Hence we should place an extra condition:
• The record should be of right size.
• Right size means it should be big enough .
• If we maintain as a linked list structure => we
need to traverse the full list to check whether
new record can be added to the location which
will fit.
26
Reclaiming Space in Files III: Deleting
Variable-Length Records for Reclaiming
Space Dynamically

• Traverse the full list and check the


position where to add the record.
• Eg. To add new record of size 55 bytes.
• If no location is found then add at the
end.

27
Reclaiming Space in Files IV:
Storage Fragmentation
• Fixed length records: Waste Space within a
record is called internal Fragmentation.

• Variable-Length records do not suffer from


internal fragmentation.

28
Reclaiming Space in Files IV:
Storage Fragmentation
• In variable-Length records the only field
that is not a data is count field of 2
bytes.
• Where as in fixed length record is that it
wastes the memory in every record as
shown in the previous diagram.

29
Reclaiming Space in Files IV:
Storage Fragmentation
Problems with variable length record:

•If we need to add new record:

30
Reclaiming Space in Files IV:
Storage Fragmentation
• After addition it looks as follows:

• Extra space is left leads to internal


fragmentation.
• This can be overcome as follows:
31
Reclaiming Space in Files IV:
Storage Fragmentation
• Place the new record at the last bytes
so unused space can be used by new
records.

32
Reclaiming Space in Files IV:
Storage Fragmentation
• It is external fragmentation because
– It is not locked inside any record.
– It is on the avail list.
• However, external fragmentation is not avoided.
• 3 ways to deal with external fragmentation:
– Storage Compaction
– Coalescing the holes:
• If 2 record slots on the avail list are adjacent
• Combine those 2 slots to form larger slots
– Use a clever placement strategy

33
Reclaiming Space in Files V:
Placement Strategies I
• First Fit Strategy: accept the first
available record slot that can
accommodate the new record.
• Best Fit Strategy: choose the first
available smallest available record slot
that can accommodate the new record.
• Worst Fit Strategy: choose the largest
available record slot.
34
Reclaiming Space in Files V:
Placement Strategies II
• Some general remarks about placement
strategies:
– Placement strategies only apply to variable-
length records
– If space is lost due to internal fragmentation, the
choice is first fit and best fit. A worst fit strategy
truly makes internal fragmentation worse.
– If the space is lost due to external fragmentation,
one should give careful consideration to a worst-
fit strategy.
35
Finding Things Quickly I:
Overview I

• Key things to be noted while accessing:


– The cost of Seeking is very high.
–This cost has to be taken into
consideration:
• when determining a strategy for
searching a file for a particular piece of
information.
36
Finding Things Quickly I:
Overview I
• The same question also arises with
respect to sorting, which often is the
first step to searching efficiently.
• Rather than simply trying to sort and
search, we concentrate
– way that minimizes the number of seeks.

37
Finding things Quickly II:
Overview II
• So far, the only way we have to retrieve
or find records quickly is by using their
RRN (in case the record is of fixed-
length).
• Without a RRN or in the case of
variable-length records, the only way, so
far, to look for a record is by doing a
sequential search. This is a very
inefficient method.
• We are interested in more efficient ways 38
Finding things Quickly III:
Binary Search
• Let’s assume that the file is sorted and
that we are looking for record whose
key is Kelly in a file of 1000 fixed-length
records. 1: Johnson 2: Monroe

1 2 …. 500 750 1000

Next Comparison
39
Binary search algorithm

40
Class definition for Binary search
implementation

41
Finding things Quickly IV: Binary
Search versus Sequential Search
• In general Binary Search of a file with n
records takes at most [log2n]+1 comparisons.
• On average approximately [log2n]+1 /2
comparisons.
• Sequential search takes O(n) comparisons.
• For average case ½(n) comparisons

42
Finding things Quickly IV: Binary
Search versus Sequential Search
• Binary Search of a file with n records takes
O(log2n) comparisons.
• Sequential search takes O(n) comparisons.
• When sequential search is used, doubling the
number of records in the file doubles the
number of comparisons required for
sequential search.
• When binary search is used, doubling the
number of records in the file only adds one
more guess to our worst case.
• In order to use binary search, though, the file
first has to be sorted. This can be very
expensive. 43
Finding things Quickly V:
Sorting a Disk File in Memory
• If the entire content of a file can be held
in memory, then we can perform an
internal sort. Sorting in memory is very
efficient.
• However, if the file does not hold entirely
in memory, any sorting algorithm will
require a large number of seeks. Sorting
would, thus, be extremely slow.
Unfortunately, this is often the case, and44
Finding things Quickly VI: The
limitations of Binary Search and
Internal Sorting
• Prob1:Binary Search requires more than one or two
accesses.
– Accessing a record using the RRN can be done with a single
access ==> We would like to achieve RRN retrieval
performance while keeping the advantage of key
access(variable length record).
• Prob2:Keeping a file sorted is very expensive: in
addition to searching for the right location for the
insert, once this location is found, we have to shift
records to open up the space for insertion.
• Prob3: Internal Sorting only works on small files. ==>
hence for large files we develop Keysorting 45
Finding things Quickly VII:
Key Sorting (Tag Sort )

• Overview: when sorting a file in memory, the only


thing that really needs sorting are record keys.
• Instead of reading the entire record into the
memory:
– Read the keys
– Sort the keys
– Rearrange according to the new order.
Key sort mainly uses this method hence can be applied
for larger files.

46
Class for key sort
implementation

47
Key Sorting (Tag Sort ) :
Algorithm

48
Finding things Quickly VII:
Key Sorting (Tag Sort )
• Before sorting keys:

49
Finding things Quickly VII:
Key Sorting (Tag Sort )
• After sorting keys before the records
are sorted in secondary memory.

50
Finding things Quickly VII:
Key Sorting (Tag Sort )
• Keysort algorithms work like internal sort, but
with 2 important differences:
– Rather than read an entire record into a
memory array, we simply read each record
into a temporary buffer, extract the key and
then discard.
– If we want to write the records in sorted
order, we have to read them a second time.

51
Finding things Quickly VIII:
Limitation of the KeySort Method

• Writing the records in sorted order


requires as many random seeks as
there are records.
• Since writing is interspersed with
reading, writing also requires as many
seeks as there are records.
• Solution: Why bother to write the file of
records in key order: simply write back
the sorted index. 52
Solution to Key sort.
• Store the index file through which we
can access original file.

53
Finding things Quickly IX:
Pinned Records

• Indexes are also useful with regard to deleted


records.
• The avail list indicating the location of unused
records consists of pinned records in the sense
that these unused records cannot be moved
since moving them would create dangling
pointers.
• Pinned records make sorting very difficult. One
solution is to use an ordered index and not to
move the records. 54
Indexing

55
Overview
• An index is a table containing a list of keys associated with
a reference field pointing to the record where the
information referenced by the key can be found.
• An index lets you impose order on a file without rearranging
the file.
• A simple index is simply an array of (key, reference) pairs.
• You can have different indexes for the same data(Library :
Accessing a book can be by author, title or subject area )
multiple access paths.
• Indexing give us keyed access to variable-length record
files.
56
A Simple Index for Entry-
Sequenced Files I

• Suppose that you are looking at a collection


of musical recordings with the following
information about each of them:
– Identification Number
– Title
– Composer or Composers
– Artist or Artists
– Label (publisher)

57
A Simple Index for Entry-
Sequenced Files II
• We choose to organize the file as a series of variable-length
record with a size field preceding each record. The fields
within each record are also of variable-length but are
separated by delimiters.

58
A Simple Index for Entry-
Sequenced Files II

• We form a primary key by concatenating


the record company label code and the
record’s ID number. This should form a
unique identifier.
• In order to provide rapid keyed access,
we build a simple index with a key field
associated with a reference field which
provides the address of the first byte of
the corresponding data record. 59
A Simple Index for Entry-
Sequenced Files III
• The index may be sorted while the file does not
have to be. This means that the data file may be
entry sequenced: the record occur in the order
they are entered in the file.

60
A Simple Index for Entry-
Sequenced Files IV
A few comments about our Index Organization:
• The index is easier to use than the data file because
1) it uses fixed-length records
2) it is likely to be much smaller than the data file.
• By requiring fixed-length records in the index file, we
impose a limitation to the size of the primary key
field. This could cause problems.
• The index could carry more information than the key
and reference fields. (e.g., we could keep the length
of each data file record in the index as well).

61
Class to create Index

62
Function to retrieve a record in a
file through index

63
Basic Operations on an Indexed
Entry-Sequenced File
• Assumption: the index is small enough to be
held in memory. Later on, we will see what can
be done when this is not the case.
– Create the original empty index and data files
– Load the index into memory before using it.
– Rewrite the index file from memory after using
it.
– Add records to the data file and index.
– Delete records from the data file.
– Update records in the data file.
64
Creating the files
• Two files to be created:
– Data file (to hold records)
– Index file(to hold key and reference to
records)
• Both files to be created empty.
• Later it is updated as records are added.
• The above can be done by create()
function in BufferFile class.
65
Loading the index into
memory
• Operations w.r.t memory are:
– Loading (reading )
– Storing(Writing)
• As loading/storing a record is defined in
IOBuffer classes.
• We can use the same class to load
index file
• Consider an index file as a single object
– Hence to be loaded once. 66
Re-writing the index file from
• What happens if the index changed ?
memory
– Due to:
• re-writing does not take place when
contents of index file changed.
• re-writing takes place incompletely?
• Use a mechanism for indicating whether or
not the index is out of date.
• Have a procedure that reconstructs the index
from the data file in case it is out of date.

67
Record Addition
• When we add a record:
• Both the data file and the index should be updated.
• In the data file, the record can be added anywhere.
However, the byte-offset of the new record should be
saved.
• Since the index is sorted, the location of the new record
does matter: we have to shift all the records that belong
after the one we are inserting to open up space for the
new record. However, this operation is not too costly as
it is performed in memory (As index file is small).

68
Record Deletion
• Record deletion can be done using the
methods in previous chapter.
• In addition, however, the index record
corresponding to the data record being
deleted must also be deleted.
• Once again, since this deletion takes
place in memory, the record shifting is
not too costly.
69
Record Updating
• Record updating falls into two categories:
• The update changes the value of the key field.
• The update does not affect the key field.
• In the first case,
• Both the index and data file may need to be reordered.
• The update is easiest to deal with a concept
• A delete followed by an insert (but the user needs
not know about this).
• In the second case,
• the index does not need reordering, but the data file may.
• If the updated record is smaller than the original one:
• it can be re-written at the same location.
• If, however, it is larger:
• then a new spot has to be found for it.
• Again the delete/insert solution can be used. 70
Indexes that are too large to hold
in memory I
 Till now the discussion with index file
 By an assumption that file is small that can be
loaded in memory.
 What happens when index file is large.
 Index file must be maintained in secondary
memory.
 Problems when accessing the file in disk:
 Binary searching requires several seeks rather than
being performed at memory speed.
 Index rearrangement requires shifting or sorting
records on secondary storage ==> Extremely time
consuming.
 Solutions:
 Use a hashed organization
 Use a tree-structured index (e.g., a B-Tree)

71
Indexes that are too large to hold
in memory II
• Nonetheless, simple indexes should not be
completely discarded:
– They allow the use of a binary search in a
variable-length record file.
– If the index entries are significantly smaller
than the data file records, sorting and file
maintenance is faster.
– If there are pinned(deleted) records in the
data file, rearrangements of the keys are
possible without moving the data records.
– They can provide access by multiple keys.

72
Indexing to provide access by
multiple keys
• In the below table we have a primary key with a
reference field.
• Using the primary key we can access the record.

February 8 & 10 73
Indexing to provide access by
multiple keys
• Multiple keys:
– We can include another key which access the
same record.

74
Indexing to provide access by
multiple keys
So far, our index only allows key access. i.e., you can
retrieve record DG188807, but you cannot retrieve a
recording of Beethoven’s Symphony no. 9. ==> Not that
useful!
We need to use secondary key fields consisting of
album titles, composers, and artists.
Although it would be possible to relate a
secondary key => actual byte offset( this is usually not
done)
secondary key => primary key =>the actual byte offset.
75
Record Addition in multiple key
access settings
 When a secondary index is used, adding a record involves :
updating
 the data file
 the primary index
 the secondary index.
The secondary index update is similar to the primary index
update.
 Secondary keys are entered in canonical form (all capitals).
The upper- and lower- case form must be obtained from the
data file.
 As well, because of the length restriction on keys, secondary
keys may sometimes be truncated.
76
 The secondary index may contain duplicate(e.g. BEETHOVEN
Record Deletion in multiple key
access settings
 Removing a record from the data file means:
 removing its corresponding entry in the primary index
 removing all of the entries in the secondary indexes that refer
to this primary index entry.
 If we delete all entries in secondary index:
 Then secondary index to be sorted.
 The above can be done only if secondary key refers to
the actual offset of a record.
 If the offset of a record changes then
 Required both in primary and secondary index reference.

77
Record Deletion in multiple key
access settings
• However, it is also possible not to worry about the
secondary index (since, as we mentioned before,
secondary keys were made to point at primary ones).
==> savings associated with the lack of
rearrangement of the secondary index.
• Cost associated with not removal the secondary
index.

78
Record Updating in multiple key
access settings

Three possible situations:


Update changes the secondary key: may have
to rearrange secondary index.
Update changes the primary key: changes to
the primary index are required, but very few
are needed for the secondary index.
Update confined to other fields: no changes
necessary to primary nor secondary index.

79
Retrieval using combinations of
secondary keys

• Retrieval of data can be done by any of


the following ways:

• Along with above types retrieving can


happen:
– In Combinations of the above.
80
Retrieval using combinations of
secondary keys
• E.g.

• As a result of composer:

• After we can apply for title: SYMPHONY NO. 9

81
Retrieval using combinations of
secondary keys

• Now performing Boolean AND:

• Without the use of secondary indexes,


– This request requires a very expensive
sequential search through the entire file.
– Using secondary indexes, responding to
this query is simple and quick.
82
Improving the secondary index
structure I: The problem

• Secondary indexes lead to two difficulties:


– The index file has to be rearranged every
time a new record is added to the file.
– If there are duplicate secondary keys, the
secondary key field is repeated for each
entry ==> Space is wasted.

83
Improving the secondary index
structure II: Solution 1
• Solution 1: Change the secondary index structure so
it associates an array of reference with each
secondary key.

• If we need new record:


84
Improving the secondary index
structure II: Solution 1
 Then we need only a reference entry:

 Advantage: helps avoid the need to rearrange the


secondary index file too often.
 Disadvantages:
It may restrict the number of references that can be
associated with each secondary key.
It may cause internal fragmentation, i.e., waste of space.

85
Improving the secondary index
structure III: Solution 2
Method: Each secondary key points
To a different list of primary key references.
Each of these lists could grow to be as long as it
needs to be
No space would be lost to internal fragmentation.

86
Improving the secondary index
structure III: Solution 2

• If we need new record:

• Then the difference is:

87
Improving the secondary index
structure III: Solution 2
Advantages:
The secondary index file needs to be
rearranged only upon record addition.
The rearranging is faster.
It is not that costly to keep the secondary
index on disk.
The primary index never needs to be sorted.
Space from deleted primary index records
can easily be reused.
Disadvantage:
Locality (in the secondary index) has been
lost ==> More seeking may be necessary.
88
Selective Indexes
• Secondary indexes can be used to divide a
file into parts and provide a selective view.
 Example build a selective index that contains only the titles of
required recordings in the record collection
• Supposed we want to extract the recording released on particular
date  possible to form a selective index
 Selective index for “ recordings released prior to 1970”

89
Improving the secondary index
structure III: Solution 2
• If we have fixed length records with
RRN then the following can be done:

90
Binding I
 Question: At what point is the key bound to the physical address
of its associated record?
 Answer so far:
 The binding of our primary keys takes place at construction
time.
 The binding of our secondary keys takes place at the time
they are used.
 Advantage of construction time binding:
 Faster access
 Disadvantage of construction time binding:
 Reorganization of the data file must result in modifications to
all bound index files.
 Advantage of retrieval time binding:
 Safer

91
Binding II
• Tradeoff in binding decisions:
– Tight, construction time binding is preferable
when:
• The data file is static or nearly static, requiring
little or no adding, deleting or updating.
• Rapid performance during actual retrieval is a
high priority.
– Postponing binding as long as possible is
simpler and safer when the data file requires a lot
of adding, deleting and updating. 92

You might also like