Professional Documents
Culture Documents
Performance
1
Outline
• We will be looking at four different
issues:
– Data Compression: how to make files
smaller
– Reclaiming space in files that have
undergone deletions and updates
– Sorting Files in order to support
binary searching ==> Internal Sorting
– A better Sorting Method: KeySorting 2
Data Compression I:
An Overview
• Question: Why do we want to make files
smaller?
• Answer:
– To use less storage, i.e., saving costs
– To transmit these files faster, decreasing
access time or using the same access time,
but with a lower and cheaper bandwidth
– To process the file sequentially faster.
3
Data Compression II: Using a
Different Notation => Redundancy
Compression
• Deals mainly with pure binary encoding.
• E.g.: when referring to the state field, we used
2 ASCII bytes=16 bits. Was that really
necessary?
• Answer: If there are only 50 states, we could
encode them all with only 6 bits, thus saving 1
byte per state field.
• This means we are reducing no. of bits in
each field.
• Hence the name Redundancy reduction.
4
Data Compression II: Using a
Different Notation => Redundancy
Compression
• Disadvantages:
– Not Human-Readable
– Cost of Encoding/Decoding Time
– Increased Software Complexity (Encoding/
Decoding Module)
• Is this compression technique worth?
– Depends on applications.
• File is small
• File is accessed by different pieces of S/W
• If some S/W can’t deal with binary data.
– This method is bad.
5
Data Compression II: Suppressing
Repeating Sequences ==> Redundancy
Compression
• When the data is represented in a Sparse
array(Similar data repeated), we can use a type of
compression called: run-length encoding.
• Procedure:
– Read through the array in sequence except where
the same value occurs more than once in
succession.
– When the same value occurs more than once,
substitute the following 3 bytes in order:
• The special run-length code indicator
• The values that is repeated; and
• The number of times that the value is repeated.
6
Data Compression II: Suppressing
Repeating Sequences ==> Redundancy
Compression
7
Data Compression III:
Assigning Variable-Length
Code
Here assigning variable length code for
data occurring in encoding scheme:
e.g.:
•If alphabet e & t are occurring frequently
– then we can assign . => e & / => t
– For other alphabets we can assign 2 or 3
symbols for a alphabet like *& => a
8
Data Compression III:
Assigning Variable-Length
Code
• Principle: Assign short codes to the most
frequent occurring values and long ones to
the least frequent ones.
• The code-size cannot be fully optimized as
one wants codes to occur in succession,
without delimiters between them, and still be
recognized.
• This is the principle used in the Morse Code
• As well, it is used in Huffman Coding. ==>
Used for compression in Unix 9
Data Compression III:
Assigning Variable-Length
Code
• As well, it is used in Huffman Coding.
==> Used for compression
10
Data Compression IV:
Irreversible Compression
Techniques
• Irreversible Compression is based on the
assumption that some information can be
sacrificed. [Irreversible compression is also called
Entropy Reduction].
• Example: Shrinking a raster image from 400-
by-400 pixels to 100-by-100 pixels. The new image
contains 1 pixel for every 16 pixels in the original
image.
• There is usually no way to determine what the
original pixels were from the one new pixel.
• In data files, irreversible compression is rarely 11used.
Reclaiming Space in Files
• What happens if a variable length is modified:
– New record can be smaller than the older
one
• In this case some space will not be used leads
to internal fragmentation.
– New record can be bigger than the older
one
• In this case we can use 2 methods:
– Extra data to be appended at the end with the help
of a pointer => processing record is slower
– Append the complete data to the end of the file => 12
Reclaiming Space in Files
• Discussion of file organization based
on:
– Record addition
– Record updating
– Record deletion
• During record addition the data is
appended at the end.
• The problems will rise in updating and
deletion. 13
Reclaiming Space in Files I:
Record Deletion and Storage
Compaction
• Recognizing Deleted Records
• Reusing the space from the record ==>
Storage Compaction.
• Storage Compaction:
– Makes file smaller
• By looking for places where there is no data
• Recovering this space.
14
Reclaiming Space in Files I:
Record Deletion and Storage
Compaction
Record deletion strategy:
– Some way to specify the deleted records.
– E.g. an * mark for deleted record.
15
Reclaiming Space in Files I:
Record Deletion and Storage
Compaction
• Storage Compaction:
– Should not be applied for each record
deletion => not effective
– After deleted records have accumulated for
some time, a special program is used to
reconstruct the file with all the deleted
approaches.
• Storage Compaction can be used with
both fixed- and variable-length records.
16
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically
• In some applications, it is necessary to
reclaim space immediately.
• To do so, we can:
– Mark deleted records in some special ways
– Find the space that deleted records once
occupied so that we can reuse that space when
we add records.
– Come up with a way to know immediately if there
are empty slots in the file and jump directly to
them.
17
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically
Solution:
•Use an avail linked list:
– Pointers to all the marked deleted records.
– Header record will point to the first record
in the file marked deleted.
19
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically
20
21
Reclaiming Space in Files II: Deleting
Fixed-Length Records for Reclaiming
Space Dynamically
22
Reclaiming Space in Files III: Deleting
Variable-Length Records for Reclaiming
Space Dynamically
23
Reclaiming Space in Files III: Deleting
Variable-Length Records for
Reclaiming Space Dynamically
24
Reclaiming Space in Files III: Deleting
Variable-Length Records for
Reclaiming Space Dynamically
An Avail List for variable length records:
– In variable length records we need to find the length of
variable records.
– For this part we can use length indicator for each record.
– Asterisk mark to show that record is deleted.
25
Reclaiming Space in Files III: Deleting
Variable-Length Records for Reclaiming
Space Dynamically
27
Reclaiming Space in Files IV:
Storage Fragmentation
• Fixed length records: Waste Space within a
record is called internal Fragmentation.
28
Reclaiming Space in Files IV:
Storage Fragmentation
• In variable-Length records the only field
that is not a data is count field of 2
bytes.
• Where as in fixed length record is that it
wastes the memory in every record as
shown in the previous diagram.
29
Reclaiming Space in Files IV:
Storage Fragmentation
Problems with variable length record:
30
Reclaiming Space in Files IV:
Storage Fragmentation
• After addition it looks as follows:
32
Reclaiming Space in Files IV:
Storage Fragmentation
• It is external fragmentation because
– It is not locked inside any record.
– It is on the avail list.
• However, external fragmentation is not avoided.
• 3 ways to deal with external fragmentation:
– Storage Compaction
– Coalescing the holes:
• If 2 record slots on the avail list are adjacent
• Combine those 2 slots to form larger slots
– Use a clever placement strategy
33
Reclaiming Space in Files V:
Placement Strategies I
• First Fit Strategy: accept the first
available record slot that can
accommodate the new record.
• Best Fit Strategy: choose the first
available smallest available record slot
that can accommodate the new record.
• Worst Fit Strategy: choose the largest
available record slot.
34
Reclaiming Space in Files V:
Placement Strategies II
• Some general remarks about placement
strategies:
– Placement strategies only apply to variable-
length records
– If space is lost due to internal fragmentation, the
choice is first fit and best fit. A worst fit strategy
truly makes internal fragmentation worse.
– If the space is lost due to external fragmentation,
one should give careful consideration to a worst-
fit strategy.
35
Finding Things Quickly I:
Overview I
37
Finding things Quickly II:
Overview II
• So far, the only way we have to retrieve
or find records quickly is by using their
RRN (in case the record is of fixed-
length).
• Without a RRN or in the case of
variable-length records, the only way, so
far, to look for a record is by doing a
sequential search. This is a very
inefficient method.
• We are interested in more efficient ways 38
Finding things Quickly III:
Binary Search
• Let’s assume that the file is sorted and
that we are looking for record whose
key is Kelly in a file of 1000 fixed-length
records. 1: Johnson 2: Monroe
Next Comparison
39
Binary search algorithm
40
Class definition for Binary search
implementation
41
Finding things Quickly IV: Binary
Search versus Sequential Search
• In general Binary Search of a file with n
records takes at most [log2n]+1 comparisons.
• On average approximately [log2n]+1 /2
comparisons.
• Sequential search takes O(n) comparisons.
• For average case ½(n) comparisons
42
Finding things Quickly IV: Binary
Search versus Sequential Search
• Binary Search of a file with n records takes
O(log2n) comparisons.
• Sequential search takes O(n) comparisons.
• When sequential search is used, doubling the
number of records in the file doubles the
number of comparisons required for
sequential search.
• When binary search is used, doubling the
number of records in the file only adds one
more guess to our worst case.
• In order to use binary search, though, the file
first has to be sorted. This can be very
expensive. 43
Finding things Quickly V:
Sorting a Disk File in Memory
• If the entire content of a file can be held
in memory, then we can perform an
internal sort. Sorting in memory is very
efficient.
• However, if the file does not hold entirely
in memory, any sorting algorithm will
require a large number of seeks. Sorting
would, thus, be extremely slow.
Unfortunately, this is often the case, and44
Finding things Quickly VI: The
limitations of Binary Search and
Internal Sorting
• Prob1:Binary Search requires more than one or two
accesses.
– Accessing a record using the RRN can be done with a single
access ==> We would like to achieve RRN retrieval
performance while keeping the advantage of key
access(variable length record).
• Prob2:Keeping a file sorted is very expensive: in
addition to searching for the right location for the
insert, once this location is found, we have to shift
records to open up the space for insertion.
• Prob3: Internal Sorting only works on small files. ==>
hence for large files we develop Keysorting 45
Finding things Quickly VII:
Key Sorting (Tag Sort )
46
Class for key sort
implementation
47
Key Sorting (Tag Sort ) :
Algorithm
48
Finding things Quickly VII:
Key Sorting (Tag Sort )
• Before sorting keys:
49
Finding things Quickly VII:
Key Sorting (Tag Sort )
• After sorting keys before the records
are sorted in secondary memory.
50
Finding things Quickly VII:
Key Sorting (Tag Sort )
• Keysort algorithms work like internal sort, but
with 2 important differences:
– Rather than read an entire record into a
memory array, we simply read each record
into a temporary buffer, extract the key and
then discard.
– If we want to write the records in sorted
order, we have to read them a second time.
51
Finding things Quickly VIII:
Limitation of the KeySort Method
53
Finding things Quickly IX:
Pinned Records
55
Overview
• An index is a table containing a list of keys associated with
a reference field pointing to the record where the
information referenced by the key can be found.
• An index lets you impose order on a file without rearranging
the file.
• A simple index is simply an array of (key, reference) pairs.
• You can have different indexes for the same data(Library :
Accessing a book can be by author, title or subject area )
multiple access paths.
• Indexing give us keyed access to variable-length record
files.
56
A Simple Index for Entry-
Sequenced Files I
57
A Simple Index for Entry-
Sequenced Files II
• We choose to organize the file as a series of variable-length
record with a size field preceding each record. The fields
within each record are also of variable-length but are
separated by delimiters.
58
A Simple Index for Entry-
Sequenced Files II
60
A Simple Index for Entry-
Sequenced Files IV
A few comments about our Index Organization:
• The index is easier to use than the data file because
1) it uses fixed-length records
2) it is likely to be much smaller than the data file.
• By requiring fixed-length records in the index file, we
impose a limitation to the size of the primary key
field. This could cause problems.
• The index could carry more information than the key
and reference fields. (e.g., we could keep the length
of each data file record in the index as well).
61
Class to create Index
62
Function to retrieve a record in a
file through index
63
Basic Operations on an Indexed
Entry-Sequenced File
• Assumption: the index is small enough to be
held in memory. Later on, we will see what can
be done when this is not the case.
– Create the original empty index and data files
– Load the index into memory before using it.
– Rewrite the index file from memory after using
it.
– Add records to the data file and index.
– Delete records from the data file.
– Update records in the data file.
64
Creating the files
• Two files to be created:
– Data file (to hold records)
– Index file(to hold key and reference to
records)
• Both files to be created empty.
• Later it is updated as records are added.
• The above can be done by create()
function in BufferFile class.
65
Loading the index into
memory
• Operations w.r.t memory are:
– Loading (reading )
– Storing(Writing)
• As loading/storing a record is defined in
IOBuffer classes.
• We can use the same class to load
index file
• Consider an index file as a single object
– Hence to be loaded once. 66
Re-writing the index file from
• What happens if the index changed ?
memory
– Due to:
• re-writing does not take place when
contents of index file changed.
• re-writing takes place incompletely?
• Use a mechanism for indicating whether or
not the index is out of date.
• Have a procedure that reconstructs the index
from the data file in case it is out of date.
67
Record Addition
• When we add a record:
• Both the data file and the index should be updated.
• In the data file, the record can be added anywhere.
However, the byte-offset of the new record should be
saved.
• Since the index is sorted, the location of the new record
does matter: we have to shift all the records that belong
after the one we are inserting to open up space for the
new record. However, this operation is not too costly as
it is performed in memory (As index file is small).
68
Record Deletion
• Record deletion can be done using the
methods in previous chapter.
• In addition, however, the index record
corresponding to the data record being
deleted must also be deleted.
• Once again, since this deletion takes
place in memory, the record shifting is
not too costly.
69
Record Updating
• Record updating falls into two categories:
• The update changes the value of the key field.
• The update does not affect the key field.
• In the first case,
• Both the index and data file may need to be reordered.
• The update is easiest to deal with a concept
• A delete followed by an insert (but the user needs
not know about this).
• In the second case,
• the index does not need reordering, but the data file may.
• If the updated record is smaller than the original one:
• it can be re-written at the same location.
• If, however, it is larger:
• then a new spot has to be found for it.
• Again the delete/insert solution can be used. 70
Indexes that are too large to hold
in memory I
Till now the discussion with index file
By an assumption that file is small that can be
loaded in memory.
What happens when index file is large.
Index file must be maintained in secondary
memory.
Problems when accessing the file in disk:
Binary searching requires several seeks rather than
being performed at memory speed.
Index rearrangement requires shifting or sorting
records on secondary storage ==> Extremely time
consuming.
Solutions:
Use a hashed organization
Use a tree-structured index (e.g., a B-Tree)
71
Indexes that are too large to hold
in memory II
• Nonetheless, simple indexes should not be
completely discarded:
– They allow the use of a binary search in a
variable-length record file.
– If the index entries are significantly smaller
than the data file records, sorting and file
maintenance is faster.
– If there are pinned(deleted) records in the
data file, rearrangements of the keys are
possible without moving the data records.
– They can provide access by multiple keys.
72
Indexing to provide access by
multiple keys
• In the below table we have a primary key with a
reference field.
• Using the primary key we can access the record.
February 8 & 10 73
Indexing to provide access by
multiple keys
• Multiple keys:
– We can include another key which access the
same record.
74
Indexing to provide access by
multiple keys
So far, our index only allows key access. i.e., you can
retrieve record DG188807, but you cannot retrieve a
recording of Beethoven’s Symphony no. 9. ==> Not that
useful!
We need to use secondary key fields consisting of
album titles, composers, and artists.
Although it would be possible to relate a
secondary key => actual byte offset( this is usually not
done)
secondary key => primary key =>the actual byte offset.
75
Record Addition in multiple key
access settings
When a secondary index is used, adding a record involves :
updating
the data file
the primary index
the secondary index.
The secondary index update is similar to the primary index
update.
Secondary keys are entered in canonical form (all capitals).
The upper- and lower- case form must be obtained from the
data file.
As well, because of the length restriction on keys, secondary
keys may sometimes be truncated.
76
The secondary index may contain duplicate(e.g. BEETHOVEN
Record Deletion in multiple key
access settings
Removing a record from the data file means:
removing its corresponding entry in the primary index
removing all of the entries in the secondary indexes that refer
to this primary index entry.
If we delete all entries in secondary index:
Then secondary index to be sorted.
The above can be done only if secondary key refers to
the actual offset of a record.
If the offset of a record changes then
Required both in primary and secondary index reference.
77
Record Deletion in multiple key
access settings
• However, it is also possible not to worry about the
secondary index (since, as we mentioned before,
secondary keys were made to point at primary ones).
==> savings associated with the lack of
rearrangement of the secondary index.
• Cost associated with not removal the secondary
index.
78
Record Updating in multiple key
access settings
79
Retrieval using combinations of
secondary keys
• As a result of composer:
81
Retrieval using combinations of
secondary keys
83
Improving the secondary index
structure II: Solution 1
• Solution 1: Change the secondary index structure so
it associates an array of reference with each
secondary key.
85
Improving the secondary index
structure III: Solution 2
Method: Each secondary key points
To a different list of primary key references.
Each of these lists could grow to be as long as it
needs to be
No space would be lost to internal fragmentation.
86
Improving the secondary index
structure III: Solution 2
87
Improving the secondary index
structure III: Solution 2
Advantages:
The secondary index file needs to be
rearranged only upon record addition.
The rearranging is faster.
It is not that costly to keep the secondary
index on disk.
The primary index never needs to be sorted.
Space from deleted primary index records
can easily be reused.
Disadvantage:
Locality (in the secondary index) has been
lost ==> More seeking may be necessary.
88
Selective Indexes
• Secondary indexes can be used to divide a
file into parts and provide a selective view.
Example build a selective index that contains only the titles of
required recordings in the record collection
• Supposed we want to extract the recording released on particular
date possible to form a selective index
Selective index for “ recordings released prior to 1970”
89
Improving the secondary index
structure III: Solution 2
• If we have fixed length records with
RRN then the following can be done:
90
Binding I
Question: At what point is the key bound to the physical address
of its associated record?
Answer so far:
The binding of our primary keys takes place at construction
time.
The binding of our secondary keys takes place at the time
they are used.
Advantage of construction time binding:
Faster access
Disadvantage of construction time binding:
Reorganization of the data file must result in modifications to
all bound index files.
Advantage of retrieval time binding:
Safer
91
Binding II
• Tradeoff in binding decisions:
– Tight, construction time binding is preferable
when:
• The data file is static or nearly static, requiring
little or no adding, deleting or updating.
• Rapid performance during actual retrieval is a
high priority.
– Postponing binding as long as possible is
simpler and safer when the data file requires a lot
of adding, deleting and updating. 92