You are on page 1of 48

Chapter-3

Fundamental File Structure Concepts


Introduction to Record and Field Structure

 A record is a collection of fields.


 A field is used to store information about
some attribute.
 The question: when we write records, how do
we organize the fields in the records:
 so that the information can be recovered
 so that we save space
 so that we can process efficiently
 to maximize record structure flexibility
Field Delineation methods
 Fixed length fields
 Include length with field

 Separate fields with a delimiter

 Include keyword expression to identify each


field
Fixed length fields
 Easy to implement - use language record
structures (no parsing)
 Fields must be declared at maximum length
needed
10 10 15 15 2 9
last first address city state zip

“Yeakus Bill 123 Pine Utica OH43050 “


Include length with field
 Begin field with length indicator
 If maximum field length <256, a byte can be
used for length
last first address city state zip

Length bytes

Yeakus Bill 123 Pine

06 59 65 61 6B 75 73 04 42 69 6C 6C 08 31 32 33 20 50 69 6E 64
Separate fields with a delimiter
 Use a special character not used in data
 space, comma, tab
 Also special ASCII char’s: Field Separator (fs) 1C
 Here we use “|”
 Also need a end of record delimiter: “#”

“Yeakus|Bill|123 Pine|Utica|OH|43050#“
Include keyword expression
 Keywords label each fields
 A self-describing structure

 Allows LOTS of flexibility

 Uses lots of space

“LAST=Yeakus|FIRST=Bill|ADDRESS=123 Pine|
CITY=Utica|STATE=OH|ZIP=43050#“
Optional Fields
 Fixed length
 Leave blank
 Field length
 zero length field
 Delimiter
 Adjacent delimiters
 Keywords
 Just leave out
Reading a stream of fields
 Need to break record into fields
 Fixed length can simply be read into record
structure
 Others must be “parsed” with a parse
algorithm
Record Structures
 How do we organize records in a file?
 Records can be fixed length or variable
length
 Fixed length allows simple direct access lookup
 Fixed may waste space
 Variable - how do we find a records position?
Record Structures
 Fixed Length Records
 Fixed number of fields in records

 Variable length
 prefix each record with a length
 Use a second file to keep track of record start
positions
 Place delimiter between records
Fixed Length Records
 All records same length
 Record positions can be calculated for direct
access reads.
 Does not imply the that the sizes or number
of fields are fixed.
 Variable length records would lead to unused
space.
Fixed number of fields in records

 Field size could be fixed or variable


 Fixed
 results in fixed size records
 simply read directly into “struct”
 Variable sized fields
 delimited or field lengths
 Simply count fields while parsing
Variable length Records
 prefix each record with a length
 Use a second file to keep track of record start
positions
 Place delimiter between records
Prefix records with a length
 Allows true variable length records
 Form of prefix:
 Character number (fixed length)
 Binary number (write integer without conversion)
 Must consider Maximum length
 No direct access (great for sequencial
access)
Index of record start
addresses
 A second file is simply a list of offsets o
t
successive records
 Since the offsets are fixed length, this file
allows direct access, thereby allow direct
access to main file.
 Problem
 Maintaining file (adding and deleting records)
 Cost of index
Place delimiter between records
 Special character not used in record
 Allows efficient variable size

 No direct access

 Bible files - use ‘\n’ as delimiter


Binary data in files
 Binary reals and integers can be written, and
read, from a file:
 Need to know byte size of variables used.
 “tsize” function returns data size
Binary data in files
int rsize;
char rec_buf[MAX];
fstream mf;
mf.open(“myfile.bin”,ios::binary| ios::out);

strcpy (rec_buf,”this is a test record”);
rsize = strlen(rec_buf);
mf.write(&rsize,sizeof(int)); // write the size
mf. write(rec_buf,rsize); // write the record
mf.close();

mf.open(“myfile.bin”,ios::binary| ios::in);
mf. read(&rsize,sizeof(int)); // read the size
mf. read(rec_buf,rsize); // read the record
Viewing Binary file data
 Use the file dump utility (od - octal dump)
 od -xc <filename>
 x - hex output
 c - character output
 Useful for viewing what is actually in file
Using Classes to Manipulate
Buffer
 Three Classes
 delimited fields
 Length-based fields
 Fixed length fields
Record Access - Keys
 Attribute used to identify records
 Often used to find records

 Standard or canonical form


 rules which keys must conform to
 prevents missing record because key in different
form
 Example:
 all capitals
 Phone in form (nnn) nnn-nnnn
Record Access - Keys
 Keys can distinct - uniquely identify records
 Primary keys
 one-to-one relationship between key value and
possible entities represented
 SSN, Student ID
 Keys can identify a collection of records
 Secondary keys
 one-to-many relationship
 City, position, department
Record Access - Keys
 Primary key desired characteristics
 unique among collection of entities
 dataless - what if some entities have not value of
this type (e.g. SSN)
 unchanging
Record access
 Performance of access method
 how do we compare techniques?
 Must be careful what events we count.
 “big-oh” notation gives us a way to factor out all
but the most significant factors
Record Access - timing
 Sequential searching
 Consider file of 4000 records
 What if no blocking done, and one record per
block? (500 bytes records, 512 byte blocks)
 What if cluster size set to 8?
 always requires O(n), but search is faster by a
constant factor
Sequential searching
 Usually NOT the best method
 Sometimes it is best:
 Searching for some ASCII pattern (grep)
 Small files
 Files rarely searched
 Searching on secondary key, and a large
percentage of records match (say 25%)
Unix Tools for sequential file processing

 cat - display a file


 wc - count lines, words, and characters

 grep - find lines in file(s) which match regular


expression.
Direct Access
 Move “directly” to record without scanning preceding
data
 Different languages/OS’s support different models:
 Byte offset model
 Programmer must specify offset to record, and record size
to read.
 Supports variable size records, skip sequential processing

 Relative Record Number (RRN) model


 File has a fixed record size (declared at creation time)

 Records are specified by a record number

 File modeled as a collection of components

 Higher level of abstraction


Direct Access
 Different language support
 RRN support
 PL/I
 COBOL
 Pascal (files are modeled as a collection of
components (records)
 FORTRAN
 Byte offset
 C
Choosing Record Sizes for Direct Access

 Fixed Length Fields


 Very easy to parse records - just read into record
structure!
 Each field must be maximum length needed!
 Thus record must be as long all the maximum
fields

10 10 15 15 2 9
last first address city state zip

“Yeakus Bill 123 Pine Utica OH43050 “


Choosing Record Sizes for Direct Access

 Variable length fields


 Each field can be any length
 since some can be long, others short, overall
record size may be shorter.
 This gives more flexibility to fields length
 Records must be parsed, space wasted for
delimiter or length bytes.
Yeakus|Bill|123|Pine|Utica|OH43050
Snivenloppinsky|Helmut|12232 Galmentary Avenue|Spotsdale|NY|11232
Header Records
 The first record in a direct file may be used to
store special information
 Number of records used.
 Location of first record in key order sequence.
 Location of first empty record
 File record structure (meta-data)
 In languages with the RRN model Pascal,
variant record facility must be used
 In C++, the header record can be of different
size from the rest of the file records.
Header Records
 Consider a file of persons
 Header record contains 2 byte number of
record count.
 Header size is 32, record size is 67
class Person {
class head { public:
public: char LastName [11];
short rec_count; char FirstName [11];
char fill[30]; char Address [16];
}; char City [16];
char State [3];
char ZipCode [10];
}
Header Records
 Must be written when file created
 Must be rewritten when file changed

 Must be read when file is opened


IOS - I/O streams in C++
IOS - I/O streams in C++
 #include <iostream.h>
 As the iostream class hierarchy diagram
shows, ios is the base class for all the
input/output stream classes.
 You will not use ios directly, rather you
will be using many of the inherited
member functions and data members.
IOS - I/O streams in C++
 Data Members (static) — Public Members
 basefield
 Mask for obtaining the conversion base
flags (dec, oct, or hex).
 adjustfield
 Mask for obtaining the field padding
flags (left, right, or internal).
 floatfield
 Mask for obtaining the numeric format
(scientific or fixed).
IOS - I/O streams in C++
 Flag and Format Access Functions — Public
Members
 flags
 Sets or reads the stream’s format flags.
 setf
 Manipulates the stream’s format flags.
 unsetf
 Clears the stream’s format flags.
 fill
 Sets or reads the stream’s fill character.
 precision
 Sets or reads the stream’s floating-point format display
precision.
 width
 Sets or reads the stream’s output field width.
IOS - I/O streams in C++
 Status-Testing Functions — Public Members
 good
 Indicates good stream status.
 bad
 Indicates a serious I/O error.
 eof
 Indicates end of file.
 fail
 Indicates a serious I/O error or a possibly recoverable
I/O formatting error.
 rdstate
 Returns the stream’s error flags.
 clear
 Sets or clears the stream’s error flags.
IOS - I/O streams in C++
 ios Manipulators
 dec
 Causes the interpretation of subsequent fields in
decimal format (the default mode).
 hex
 Causes the interpretation of subsequent fields in
hexadecimal format.
 oct
 Causes the interpretation of subsequent fields in
octal format.
 binary
 Sets the stream’s mode to binary (stream must have
an associated filebuf buffer).
 text
 Sets the stream’s mode to text, the default mode
(stream must have an associated filebuf buffer).
IOS - I/O streams in C++
 Parameterized Manipulators
(#include <iomanip.h> required)
 setiosflags
 Sets the stream’s format flags.
 resetiosflags
 Resets the stream’s format flags.
 setfill
 Sets the stream’s fill character.
 setprecision
 Sets the stream’s floating-point display
precision.
 setw
 Sets the stream’s field width (for the next field
only).
File Access and Organization
 File Organization
 Variable Length Records
 Fixed Length Records
 Field Structures (size bytes, delimiters, fixed)
 File Access
 Sequential access
 Direct access
 Indexed access
File Access and Organization
 Interaction between organization and access
 Can the file be divided into fields?
 Is there a higher level of organization to the file
(meta data)?
 Do all records have to have the same number of
fields, bytes?
 How do we distinguish one record from the next?
 How do we recognize if a fixed length record
holds real data or not?
File Access and Organization
 There is a often a trade-off between space
and time
 Fixed length records - allow direct access, waste
space
 Variable require sequential search
 We also must consider the typical use of the
file - what are the desired access patterns
 Selection of a particular organization has
implications on the allowable types of access
Portability and Standardization
 Differences among Languages
 Fixed sized records versus byte addressable
access
 Differences among Machine Architectures
 Byte order of binary data
 May be high order or low order byte first
Byte order of binary data
 High order first: (Big Endian)
 A long int: say 45 is stored in memory.
 It is stored as: 00 00 00 2D
 Sun’s, Network protocols
 Low order first (Little Endian)
 A long int: say 45 is stored in memory.
 It is stored as: 2D 00 00 00
 PC’s, VAX’s
Byte order of binary data
 If binary data is written to a file, it is written in
the order stored in memory
 If the data is later read by a system with a
different ordering, the number will be
incorrect!
 For the sake of portability, files should be
written in an agreed upon format (probably
Big Endian)

You might also like