You are on page 1of 45

File Organization and

Processing

Lecture 1
Textbook and References
 Michael J. Folk, Bill Zoellick and Greg
Riccardi,"File Structures: an Object-
Oriented Approach with C++," Addison-
Wesley, 1998.
 Any reference covering the basic concepts
of C++.

09/11/21 2
RAM vs. Secondary Storage
• Slow • Fast
• Large • Small
• Cheap • Expensive
• Stable • Volatile

Secondary storage Data transfer


RAM
Ex: Hard disk, floppy,
tape, CD

• Data are stored here • Data are manipulated here

09/11/21 3
File Structures vs. Data Structures

Secondary storage RAM

File Structures Data Structures


• Involves representation • Involves representation
of data of data
• Involves operations for • Involves operations for
accessing data accessing data

09/11/21 4
Goal of File Structures
 To get the info from the disk with as few
disk accesses as possible.
 To group related info so that we can get
requested info with only one trip to the disk

09/11/21 5
Physical File vs. Logical File
 What is a physical file?
 It is a collection of bytes stored on a disk
 It has a physical name; e.g., A.txt
 It is used once when opening the file
 What is a logical file?
 It is a channel that connects the program to the
corresponding physical file
 It has a logical name, which is a variable inside the
program; e.g., infile
 It is used as many times as needed when opening,
reading, writing, appending, or closing the file

09/11/21 6
Physical File vs. Logical File (cont.)
 The operating system establishes the link between
the logical name chosen to physical file.
 The program deals with bytes in the physical files
through the logical file.
 From the program point of view, devices like
keyboard, screen, etc. are treated as files.
 There may be numerous number of physical files
but there is a limited number of logical files can be
opened at the same time.

09/11/21 7
Sample Program
A program that displays the contents of a file on the
screen:
 Open the file for input
 While there are characters to read:
 Read a character from the file
 Write the character to the screen
 Close the file

09/11/21 8
Sample Program in C++
#include<fstream.h>

void main(){

char ch;
fstream infile;

infile.open("A.txt", ios::in);
infile.unsetf(ios::skipws);
// set flag so it does not skip white space

infile >> ch;


while (!infile.fail()) {
cout << ch;
infile >> ch;
}

infile.close();
}

09/11/21 9
Opening Files
 Opening a file makes it ready for use. There are
two options:
 Open an existing file.
 Create a new file.
 When opening a file, we are positioned at the
beginning of the file.

09/11/21 10
Opening Files in C++
 “open” opens the file named by physical
file name and associates a logical name with
it.
 Example:
infile.open("A.txt", ios::in);
 The first argument is the physical file name.
 The second argument is the mode.

09/11/21 11
Opening Files in C++
 The mode:
ios::in open for input
ios::out open for output
ios::app seek the end of file before writing
ios::trunc create a new file
ios::nocreate fail if file does not exist
ios:noreplace create a new file but fail if it exists
ios::binary open in binary mode

09/11/21 12
Closing Files
 After closing, the logical name may be used again
with another physical file
 The bytes are not sent one by one to the physical
file. Instead, they are stored in a buffer and sent as
a block. When the file is closed, the leftover from
the buffer is flushed to the file.
 If the files is not closed in the program, the
operating system closes it at the end of the program
execution. However, if the program terminates
abnormally, data may be lost.

09/11/21 13
Closing Files
 In C++:
Infile.close();

Q1:Show how to change the permissions on a file


named myfile so the owner has read and write
permissions, group members have execute
permission, and others have no permission.
chmod 0610 myfile
Q2: What is this pmode=521 mean?

09/11/21 14
Reading Files in C++
 Example:
infile >> ch;
 Same action:
infile.read(&ch, 1);

09/11/21 16
Writing Files in C++
 Example:
outfile << ch;
 Same action:
outfile.read(&ch, 1);

09/11/21 18
Seeking in C++
 Object of class fstream has two file pointers:
 seekg = moves the get pointer
 seekp = moves the put pointer
 file.seekg(offset, origin)
 file.seekp(offset, origin)
origin File location
ios::beg File beginning
ios::cur Current file position
ios::end End-of-file

09/11/21 20
Seeking in C++: Examples
file.seekg(0, ios::beg);
// moves to the beginning of the file

file.seekg(0, ios::end);
// moves to the end of the file

file.seekg(-1, ios::end);
// moves back 1 byte from the end of
the // file

09/11/21 21
 Example:
The following moves both get and put pointers to a
byte 373:
file.seekg(373, ios::beg);
file.seekp(373, ios::beg);
 Example:
The following moves the put pointer at the end of
the file
file.seekp(0, ios::end);
 Example:
The following moves the get pointer at the
beginning of the file
file.seekg(0, ios::beg);

09/11/21 22
Example: C++ Implementation
#include<fstream.h>
void main(){
fstream myfile;

myfile.open("test.txt", ios::in|ios::out|
ios::trunc|ios::binary);

myfile << "Hello, world. Hello, again.";


myfile.seekp(12, ios::beg);
myfile << 'X' << 'X';
myfile.seekp(3, ios::cur);
myfile << 'Y';
myfile.seekp(-2, ios::end);
myfile << 'Z';

myfile.close();
}
Hello, worldXXHelYo, agaiZ.
09/11/21 23
File as a Stream of Bytes
File: A.txt
St# First Last
123456David McDonald
213456MichaelDouglas
312456George Bush
321456Paul Martin

 As indicated last lecture,


infile.open("A.txt", ios::in);
infile = fopen("A.txt", "r");
will point to the beginning of the file A.txt

09/11/21 24
Sample File as a Stream of Bytes
 As indicated last lecture,
infile >> ch;
fread(&ch, 1, 1, infile)
will read a character from the logical file infile (i.e., A.txt).
 The character read is the first character in the file; i.e., 1.
 File position is then incremented to point to the next character.
 The next
infile >> ch;
fread(&ch, 1, 1, infile)
will read 2.
 Here we looked at the file as a stream of subsequent bytes.

09/11/21 25
File as a Collection of Records
File: A.txt
St# First Last
123456David McDonald
213456MichaelDouglas
312456George Bush
321456Paul Martin Record

Field

09/11/21 26
File as a Collection of Records
 A record is a collection of fields.
 A field is smallest logically meaningful unit of
information in a file.
 A key is a subset of the fields in a record used to
identify the record.
 A key corresponds to a field or combination of
field that may be used in a search. In other words,
not every field is a key.
 A primary key is a key that is uniquely identifies
the record; e.g., the student number.

09/11/21 27
Field Structures
1. Fix the length of fields

123456123456712345678
123456David McDonald
213456MichaelDouglas
312456George Bush
321456Paul Martin

 In this example, field lengths are 6, 7 and 8.

09/11/21 28
Field Structures
2. Begin each field with a length indicator

0612345605David08McDonald
0621345607Michael07Douglas
0631245606George04Bush
0632145604Paul06Martin

09/11/21 29
Field Structures
3. Separate the fields with delimiters

123456|David|McDonald|
213456|Michael|Douglas|
312456|George|Bush|
321456|Paul|Martin|

09/11/21 30
Field Structures
4. Use a “keyword = value” expression to identify
fields

NUM=123456|First=David|Last=McDonald|
NUM=213456|First=Michael|Last=Douglas|
NUM=312456|First=George|Last=Bush|
NUM=321456|First=Paul|Last=Martin|
 Notice that “|” may be omitted.

09/11/21 31
Field Structures -- Comparison
Type Advantages Disadvantages
Fixed Easy to read/store Waste space with
padding
With Easy to jump to the Long fields (>255)
length end of the field require > 1 byte to
indicator store length
Delimited May waste less space Have to check
fields than the length-based every byte against
delimiter
Keyword Fields are self- Waste space with
describing keywords
09/11/21 32
Record Structures
1. Make records a predictable number of bytes
(fixed-length records)
1.1 Fixed length records combined with fixed length fields

123456David McDonald
213456MichaelDouglas
312456George Bush
321456Paul Martin

09/11/21 33
Record Structures
1. Make records a predictable number of bytes
(fixed-length records)
1.2 Fixed length records combined with length indicators

0612345605David08McDonald
0621345607Michael07Douglas
0631245606George04Bush
0632145604Paul06Martin
Space wasted
09/11/21 34
Record Structures
1. Make records a predictable number of bytes
(fixed-length records)
1.3 Fixed length records combined with field delimiters

123456|David|McDonald|
213456|Michael|Douglas|
312456|George|Bush|
321456|Paul|Martin|
Space wasted
09/11/21 35
Record Structures
1. Make records a predictable number of bytes (fixed-
length records)
1.4 Fixed length records combined with using keywords

NUM=123456|First=David|Last=McDonald|
NUM=213456|First=Michael|Last=Douglas|
NUM=312456|First=George|Last=Bush|
NUM=321456|First=Paul|Last=Martin|

09/11/21 36
Record Structures
2. Make records a predictable number of fields (of
variable-length)
Ex: The number of fields in our example is 3

0612345605David08McDonald0621345607Michael07.

123456|David|McDonald|213456|Michael|..

NUM=123456|First=David|Last=McDonald|NUM=213.

09/11/21 37
Record Structures
3. Begin each record with a length indicator
 Begin each record with a field containing an
integer that indicates the length of the record

22123456|David|McDonald|23213456|Michael|.

09/11/21 38
Record Structures
4. Use an index to keep track of addresses
Data file
123456|David|McDonald|213456|Michael|..

Index file 00 22 ..

 We use an index to keep a byte offset for each record in the


original file.
 It allows us to find the beginning of each record and
compute the length of the record

09/11/21 39
Record Structures -- Comparison

Type Advantages Disadvantages


Fixed-length Easy to jump Waste space
records to the i-th with padding
record
Variable-length Save space Cannot jump to
records where record the i-th record
sizes are unless through
diverse an index

09/11/21 40
Sequential Search and Direct Access
Sequential Search
Look at the records sequentially until matching is
found. The worst case scenario happens when
match found at the last record. If we have n
records, then search time is said to be O(n).
Direct Access
Being able to seek directly the beginning of the record.
In this case the search time is constant whatever the
number of records. Time is said to be O(1) for n
records.

09/11/21 41
Q1:
In a planning air travel application, information is
kept about each flight. This information consists
of:
 Flight Number
 Date Flight
 List of Cities
 Number of Seats
 Crew Names
Suggest a way in which the fields could be
organized?

09/11/21 42
Q2:
 The birth date field could be stored as eight bytes
MM/DD/YY or six bytes as MMDDYY or six
bytes as YYMMDD. What are some advantages
as disadvantages for each of these formats?
 (Hint: Consider how the field might be used. One
use is for displaying on reports, another may be
for selecting certain records for processing, i.e.
list all students born before January 1, 1990.)

09/11/21 43
Assume the above data is organized into a file with fixed length, positional
fields within fixed length records. Each record in the file has the 4
character student number, followed by six characters for birth date,
followed by twelve characters for last name, followed by eight characters

for first name.

09/11/21 44
1- What is the record size for this file?
record size = sum of field sizes = 4 + 6 + 12 + 8 = 30
2- Using the above data, what is the actual file size in bytes?
Since every record is the same size,
file size = record size * # of records
For this file 30 * 5 = 150 bytes
3- Assume that you want to randomly access the fourth record on
this file, in this example #3678. What byte offset would you
supply to the seek operation to position the file at that record?
byte offset = record size * (record # - 1) = 30 * (4-1) = 90
4- IF the year is changed from 2 byte to 4 byte, How will
the record and field layouts for the file change? How
would the total file size be affected?

09/11/21 45
Assume the original data is organized into a file with variable length,
positional fields within fixed length records. The fields are in the same
order as above, but terminated by a colon character (:). Every record is
28 bytes long.
1- What is the actual file size in bytes?
Since the records are fixed length,
file size = record size * # of records = 28 * 5 = 140 bytes
2- How much record fragmentation is there?
If "*" represents blank data at the end of the record, this file will look like;
2789::Ming:Chi:*************
6345:043077:Del Matteo:Mike:
1456:032876:Liu:Mary Ann:***
3678:032876:Ray:Bethany:****
7123::Clark-Tomson:Anne:****
The amount of fragmentation is
13 + 0 + 3 + 4 + 4 = 24 bytes or 17% of the file

09/11/21 46
Assume the above data is organized into a file with variable length, positional
fields within variable length records. The fields are in the same order as
above, separated by a colon character (:). The record is also variable
length, and is preceded by a byte count separated from the first data field
by a colon.
1- Using the original data described in the table, what is the actual file size in
bytes? The file will look like:
14:2789::Ming:Chi
27:6345:043077:Del Matteo:Mike
24:1456:032876:Liu:Mary Ann
23:3678:032876:Ray:Bethany
23:7123::Clark-Tomson:Anne
The total number of bytes is 17+30+27+26+26 = 126
2- How much record fragmentation is there? None
3- Assume that you want to be able to do random access on this data file by
using a seek to move the file position to the desired record. You will do
this by keeping a separate index file as described on page 55 of the text.
For this file the index would look like: 00 17 47 74 100

09/11/21 47
 Implementation: for the class person
 Write a new person variable-length

records in the file


 Reading the variable-length record
from the file.

09/11/21 48

You might also like