You are on page 1of 157

1

141404 OPERATING SYSTEMS 3 0 0 3


(Common to CSE & IT)

Aim: To learn the various aspects of operating systems such as process management,
memory management, and I/O management

UNIT I PROCESSES AND THREADS 9
Introduction to operating systems review of computer organization operating system
structures system calls system programs system structure virtual machines.
Processes: Process concept Process scheduling Operations on processes
Cooperating processes Interprocess communication Communication in client-server
systems. Case study: IPC in Linux. Threads: Multi-threading models Threading issues.
Case Study: Pthreads library

UNIT II PROCESS SCHEDULING AND SYNCHRONIZATION 10
CPU Scheduling: Scheduling criteria Scheduling algorithms Multiple-processor
scheduling Real time scheduling Algorithm Evaluation. Case study: Process
scheduling in Linux. Process Synchronization: The critical-section problem
Synchronization hardware Semaphores Classic problems of synchronization critical
regions Monitors. Deadlock: System model Deadlock characterization Methods for
handling deadlocks Deadlock prevention Deadlock avoidance Deadlock detection
Recovery from deadlock.

UNIT III STORAGE MANAGEMENT 9
Memory Management: Background Swapping Contiguous memory allocation
Paging Segmentation Segmentation with paging. Virtual Memory: Background
Demand paging Process creation Page replacement Allocation of frames
Thrashing. Case Study: Memory management in Linux

UNIT IV FILE SYSTEMS 9
File-System Interface: File concept Access methods Directory structure File-system
mounting Protection. File-System Implementation : Directory implementation
Allocation methods Free-space management efficiency and performance recovery
log-structured file systems. Case studies: File system in Linux file system in Windows
XP

UNIT V I/O SYSTEMS 8
I/O Systems I/O Hardware Application I/O interface kernel I/O subsystem streams
performance. Mass-Storage Structure: Disk scheduling Disk management Swap-
space management RAID disk attachment stable storage tertiary storage. Case
study: I/O in Linux
Total: 45

TEXT BOOK:
1. Silberschatz, Galvin, and Gagne, Operating System Concepts, Sixth Edition,
Wiley India Pvt Ltd, 2003.


2
REFERENCES:
1. Andrew S. Tanenbaum, Modern Operating Systems, Second Edition, Pearson
Education, 2004.
2. Gary Nutt, Operating Systems, Third Edition, Pearson Education, 2004.
3. Harvey M. Deital, Operating Systems, Third Edition, Pearson Education, 2004.












































3
UNIT I

PROCESSES AND THREADS

1. INTRODUCTION TO OPERATING SYSTEM

1.1 What is a Operating System?
A program that acts as an intermediary between a user of a computer and the
computer hardware
.
Operating system goals:
Execute user programs and make solving user problems easier.
Make the computer system convenient to use.
Use the computer hardware in an efficient manner.

1.2 Computer System Components
1. Hardware provides basic computing resources (CPU, memory, I/O devices).
2. Operating system controls and coordinates the use of the hardware among the
various application programs for the various users.
3. Applications programs define the ways in which the system resources are used
to solve the computing problems of the users (compilers, database systems, video games,
business programs).
4. Users (people, machines, other computers).

Abstract View of System Components


















1.3 Operating System Definitions
Resource allocator manages and allocates resources.
Control program controls the execution of user programs and operations of I/O
4
devices .
Kernel the one program running at all times (all else being application programs).

2. REVIEW OF COMPUTER ORGANISATION
2.1 Introduction
In the previous lesson we discussed about the evolution of computer. In this lesson we
will provide you with an overview of the basic design of a computer. You will know how
different parts of a computer are organised and how various operations are performed
between different parts to do a specific task. As you know from the previous lesson the
internal architecture of computer may differ from system to system, but the basic
organisation remains the same for all computer systems.
2.2 Objectives
At the end of the lesson you will be able to:
- understand basic organisation of computer system
- understand the meaning of Arithmetic Logical Unit, Control Unit and Central
Processing Unit
- differentiate between bit , byte and a word
- define computer memory
- differentiate between primary memory and secondary memory
- differentiate between primary storage and secondary storage units
- differentiate between input devices and output devices

2.3 Basic Computer Operations
A computer as shown in Fig. 2.1 performs basically five major operations or functions
irrespective of their size and make. These are 1) it accepts data or instructions by way of
input, 2) it stores data, 3) it can process data as required by the user, 4) it gives results in
the form of output, and 5) it controls all operations inside a computer. We discuss below
each of these operations.

1. Input: This is the process of entering data and programs in to the computer system.
You should know that computer is an electronic machine like any other machine which
takes as inputs raw data and performs some processing giving out processed data.
Therefore, the input unit takes data from us to the computer in an organized manner for
processing.

5


Fig. 2.1 Basic computer Operations


2. Storage: The process of saving data and instructions permanently is known as storage.
Data has to be fed into the system before the actual processing starts. It is because the
processing speed of Central Processing Unit (CPU) is so fast that the data has to be
provided to CPU with the same speed. Therefore the data is first stored in the storage unit
for faster access and processing. This storage unit or the primary storage of the computer
system is designed to do the above functionality. It provides space for storing data and
instructions.
The storage unit performs the following major functions:
- All data and instructions are stored here before and after processing.
- Intermediate results of processing are also stored here.
3. Processing: The task of performing operations like arithmetic and logical operations is
called processing. The Central Processing Unit (CPU) takes data and instructions from
the storage unit and makes all sorts of calculations based on the instructions given and the
type of data provided. It is then sent back to the storage unit.

4. Output: This is the process of producing results from the data for getting useful
information. Similarly the output produced by the computer after processing must also be
kept somewhere inside the computer before being given to you in human readable form.
Again the output is also stored inside the computer for further processing.

5. Control: The manner how instructions are executed and the above operations are
performed. Controlling of all operations like input, processing and output are performed
by control unit. It takes care of step by step processing of all operations in side the
computer.

2.4 Functional Units :In order to carry out the operations mentioned in the previous
section the computer allocates the task between its various functional units. The computer
system is divided into three separate units for its operation. They are 1) arithmetic logical
unit, 2) control unit, and 3) central processing unit.



6
2.4.1 Arithmetic Logical Unit (ALU)
After you enter data through the input device it is stored in the primary storage unit. The
actual processing of the data and instruction are performed by Arithmetic Logical Unit.
The major operations performed by the ALU are addition, subtraction, multiplication,
division, logic and comparison. Data is transferred to ALU from storage unit when
required. After processing the output is returned back to storage unit for further
processing or getting stored.

2.4.2 Control Unit (CU)
The next component of computer is the Control Unit, which acts like the supervisor
seeing that things are done in proper fashion. The control unit determines the sequence in
which computer programs and instructions are executed. Things like processing of
programs stored in the main memory, interpretation of the instructions and issuing of
signals for other units of the computer to execute them. It also acts as a switch board
operator when several users access the computer simultaneously. Thereby it coordinates
the activities of computers peripheral equipment as they perform the input and output.
Therefore it is the manager of all operations mentioned in the previous section.

2.4.3 Central Processing Unit (CPU)
The ALU and the CU of a computer system are jointly known as the central processing
unit. You may call CPU as the brain of any computer system. It is just like brain that
takes all major decisions, makes all sorts of calculations and directs different parts of the
computer functions by activating and controlling the operations.

3.OPERATING-SYSTEM STRUCTURES

3.1 Common System Components
Process Management
Main Memory Management
File Management
I/O System Management
Secondary Management
Networking
Protection System
Command-Interpreter System

3.1.1 Process Management
A process is a program in execution. A process needs certain resources, including
CPU time, memory, files, and I/O devices, to accomplish its task.
The operating system is responsible for the following activities in connection with
process management.
Process creation and deletion.
process suspension and resumption.
Provision of mechanisms for:
process synchronization
process communication

3.1.2 Main-Memory Management
Memory is a large array of words or bytes, each with its own address. It is a repository
of quickly accessible data shared by the CPU and I/O devices.
7
Main memory is a volatile storage device. It loses its contents in the case of system
failure.
The operating system is responsible for the following activities in connections with
memory management:
Keep track of which parts of memory are currently being used and by whom.
Decide which processes to load when memory space becomes available.
Allocate and deallocate memory space as needed.

3.1.3 File Management
A file is a collection of related information defined by its creator. Commonly, files
represent programs (both source and object forms) and data.
The operating system is responsible for the following activities in connections with file
management:
File creation and deletion.
Directory creation and deletion.
Support of primitives for manipulating files and directories.
Mapping files onto secondary storage.
File backup on stable (nonvolatile) storage media.

3.1.4 I/O System Management
The I/O system consists of:
A buffer-caching system
A general device-driver interface
Drivers for specific hardware devices

3.1.5 Secondary-Storage Management
Since main memory (primary storage) is volatile and too small to accommodate all
data and programs permanently, the computer system must provide secondary storage
to back up main memory.
Most modern computer systems use disks as the principle on-line storage medium, for
both programs and data.
The operating system is responsible for the following activities in connection with disk
management:
Free space management
Storage allocation
Disk scheduling

3.1.6 Networking (Distributed Systems)
A distributed system is a collection processors that do not share memory or a clock.
Each processor has its own local memory.
The processors in the system are connected through a communication network.
Communication takes place using a protocol.
A distributed system provides user access to various system resources.
Access to a shared resource allows:
Computation speed-up
Increased data availability
Enhanced reliability

3.1.7 Protection System
Protection refers to a mechanism for controlling access by programs, processes, or
8
users to both system and user resources.
The protection mechanism must:
distinguish between authorized and unauthorized usage.
specify the controls to be imposed.
provide a means of enforcement.

3.1.8 Command-Interpreter System
Many commands are given to the operating system by control statements which deal
with:
process creation and management
I/O handling
secondary-storage management
main-memory management
file-system access
protection
networking
The program that reads and interprets control statements is called variously:
command-line interpreter
shell (in UNIX)
Its function is to get and execute the next command statement.

4. SYSTEM CALLS

4.1 Introduction
System calls provide the interface between a running program and the operating
system.
Generally available as assembly-language instructions.
Languages defined to replace assembly language for systems programming allow
system calls to be made directly (e.g., C, C++)
Three general methods are used to pass parameters between a running program and the
operating system.
Pass parameters in registers.
Store the parameters in a table in memory, and the table address is passed as a
parameter in a register.
Push (store) the parameters onto the stack by the program, and pop off the stack by
operating system.

Passing of Parameters As A Table










9
4.2 Types of System Calls
Process control
File management
Device management
Information maintenance
Communications

MS-DOS Execution














MS-DOS Execution: At System Startup At Execution


UNIX Running Multiple Programs















Communication Models
Communication may take place using either message passing or shared memory.




10
















Message Passing Shared Memory


5. SYSTEM PROGRAMS

System programs provide a convenient environment for program development and
execution. The can be divided into:
File manipulation
Status information
File modification
Programming language support
Program loading and execution
Communications
Application programs
Most users view of the operation system is defined by system programs, not the actual
system calls.

6. SYSTEM STRUCTURE

Simple structure
This has generally resulted from large systems being built from small monitor
programs for single user systems. MS-DOS grew like topsy from 4000 assembler source
lines in 1981 to over 40,000 lines eventually. The lack of initial structure leads to a big
mess. There is some layering but the user was allowed to bypass it because the old 8086
had no hardware protection mechanism to enforce the layering. Tanenbaum (Modern OS)
identifies the layers as BIOS, kernel and shell. Windows 3 was effectively another layer
trying to hide.

UNIX: Simple multi-user structure
UNIX has also grown, but was written in C almost from the start and has always
supported multiple users and protected processes. The system call interface is well
defined and the kernel below that interface protected. So UNIX could be regarded as two
layers, above and below the system call interface. (SGG Fig. 2.11) The kernel is relatively
large and informally structured, with no universally defined internal interfaces. So the
11
device drivers can be regarded as a separate layer within the kernel and there is informal
separation between the file system and process management. Some reasons for the growth
of the kernel are provision of paged memory management, network communication
software, and the virtual file system (which actually runs as kernel code).

The user interface to shells, editors and libraries forms an unprotected layer above
the system call interface. This gives flexibility but potential risk to the naive user. In older
proprietary commercial operating systems much of this unprotected layer would be
protected and regarded as part of the operating system.
Note that in UNIX a system call switches context to kernel mode but logically the
same process is being executed. Also only one user originated process can be active (as
opposed to sleeping) in kernel mode in conventional UNIX implementations, alongside a
very few special kernel only processes such as the swapper.

Linux is a good modern example of conventional UNIX structure (originally for a
single processor but revised for multiprocessor support in version 2.2). Although the
kernel appears very large and monolithic, it comprises clearly identifiable modules.
Device drivers, the bulk of the source code, are independent loadable kernel modules with
well defined interfaces. In old Unix introducing a new device driver mean recompiling
and relinking the whole kernel but now a device driver is a loaded dynamically as
required. Networking (supporting sockets, TCP/IP, UDP, SLIP and PPP) is 24,000 lines.
The Virtual File System provides a well defined file store interface to the 13 different
possible underlying file systems, which are also each loadable kernel modules. Note that
such modules need kernel mode privilege so have to be trusted. Protection at a language
level is not an absolute guarantee that a rogue device driver will not bring down the
operating system.Incompatibilities arose when different versions of UNIX developed
variants of the systemcall interface. The POSIX (Portable Operating System Interface)
standards sponsored by the IEEE have brought standardisatio to this interface, and most
systems aim for POSIX compliance.This makes it much easier to switch both between
different variants of UNIX and to other systems offering POSIX compliant interfaces.
However the POSIX1003.1 definition is somewhat minimal and does not cover all the
more elaborate system calls of, say, System V. The POSIX standard also includes
provision of a conformant C compiler.

Layered or Onion Structure
The first serious attempts at structuring, now looking rather dated. Strict layering
hides inner layers, encapsulating them behind defined interfaces. (SGG Fig. 2.12)
Proposed by Dijkstra in 1968 in THE operating system. Note careful ordering of layers
(and possible need for compromises). Can be associated with hardware levels of
protection (MULTICS, ICL VME). ICL 2900/3900 has 16 access levels built into the
memory management system hardware.
These are used by the VME operating system software as follows:

9-15 User levels
8 System Control Language (shell)
7 Record Management
6 User code error management
5 Upper Director
4 Lower Director (Block data management, disc, tape, comms)
3 Director error management
12
2 Kernel (Process, virtual store, comms and processor management)
1 Kernel error management
0 Unused

Moving to a more privileged access level is a system call (Trap instruction), for which the
parameters are validated.Advantages of strict layering: Each layer coded and tested
independently depending only on lower layers. Forms basis for hardware supported layers
of protection and privilege (see later lectures).
Disadvantages: Cost of going through several layers. Difficulty of avoiding circular
dependencies between layers. Cost of getting information from another layer. Other
hardware manufacturers have not provided hardware support.

Virtual Machines
Each layer of software presents a virtual machine (VM) to outer layers. This VM
may hide the original machine or just present more facilities. The operating system may
provide multiple virtual machines, each separated and protected from the others. A
(heavyweight) user process may run in such a virtual machine. The memory management
system will provide a separate protected address space for each virtual machine. Thus
ICLs VME provides a VM with full protected address space for each user process. 386
and 486 PCs have a hardware mode in which the machine emulates multiple 8086 virtual
machines.The extreme is IBMs VM system running on 370 and subsequent mainframes.
In this the VM provided is a logically identical copy of the complete machine, including
peripheral devices. It is so complete you can run VM under VM (to test a new version).
Each CMS (say) user has a private VM running CMS. So in VM/CMS, VM is a multiuser
operating system with no user facilities, which makes one machine appear to be many
machines, and CMS a single user operating system for a 370 machine.

This extreme requires some clever tricks in privileged instruction and virtual peripheral
handling. System calls inside a virtual machine enter the VM system which changes the
virtual machine to virtual monitor mode and trampolines the call back. All code inside
the VM actually runs in physical user mode. Dont bother about the details, SGG gives
them if you are interested.

7. PROCESSES

7.1 PROCESS CONCEPT
7.1.1 Introduction
An operating system executes a variety of programs:
Batch system jobs
Time-shared systems user programs or tasks
Textbook uses the terms job and process almost interchangeably.
Process a program in execution; process execution must progress in sequential
fashion.
A process includes:
program counter
stack
data section

13

7.1.2 Process State













Diagram of Process State

As a process executes, it changes state
new: The process is being created.
running: Instructions are being executed.
waiting: The process is waiting for some event to occur.
ready: The process is waiting to be assigned to a process.
terminated: The process has finished execution.

7.1.3 Process Control Block (PCB)
Information associated with each process.
Process state
Program counter
CPU registers
CPU scheduling information
Memory-management information
Accounting information
I/O status information













Process Control Block (PCB)

14













CPU Switch From Process to Process

7.2 PROCESS SCHEDULING
7.2.1. Types Of Queue
Job queue set of all processes in the system.
Ready queue set of all processes residing in main memory, ready and waiting to
execute.
Device queues set of processes waiting for an I/O device.
Process migration between the various queues.

Ready Queue And Various I/O Device Queues
















Representation of Process Scheduling









15


7.2.2 Schedulers
Long-term scheduler (or job scheduler) selects which processes should be brought
into the ready queue.
Short-term scheduler (or CPU scheduler) selects which process should be
executed next and allocates CPU.
Addition of Medium Term Scheduling

















Short-term scheduler is invoked very frequently (milliseconds) (must be fast).
Long-term scheduler is invoked very infrequently (seconds, minutes) (may be
slow).
The long-term scheduler controls the degree of multiprogramming.
Processes can be described as either:
I/O-bound process spends more time doing I/O than computations, many short
CPU bursts.
CPU-bound process spends more time doing computations; few very long CPU
bursts.

7.2.3 Context Switch
When CPU switches to another process, the system must save the state of the old
process and load the saved state for the new process.
Context-switch time is overhead; the system does no useful work while switching.
Time dependent on hardware support.

7.3. OPERATIONS ON PROCESSES
7.3.1 Process Creation
Parent process create children processes, which, in turn create other processes, forming
a tree of processes.
Resource sharing
Parent and children share all resources.
Children share subset of parents resources.
Parent and child share no resources.
Execution
16
Parent and children execute concurrently.
Parent waits until children terminate.
Address space
Child duplicate of parent.
Child has a program loaded into it.
UNIX examples
fork system call creates new process
exec system call used after a fork to replace the process memory space with a new
program.

7.3.2 Process Termination
Process executes last statement and asks the operating system to decide it (exit).
Output data from child to parent (via wait).
Process resources are deallocated by operating system.
Parent may terminate execution of children processes (abort).
Child has exceeded allocated resources.
Task assigned to child is no longer required.
Parent is exiting.
Operating system does not allow child to continue if its parent terminates.
Cascading termination.

7.4 COOPERATING PROCESSES
7.4.1 Definition
Independent process cannot affect or be affected by the execution of another process.
Cooperating process can affect or be affected by the execution of another process
Advantages of process cooperation
Information sharing
Computation speed-up
Modularity
Convenience

7.4.2 Producer-Consumer Problem
Paradigm for cooperating processes, producer process produces information that is
consumed by a consumer process.
unbounded-buffer places no practical limit on the size of the buffer.
bounded-buffer assumes that there is a fixed buffer size.

Bounded-Buffer Shared-Memory Solution
Shared data
#define BUFFER_SIZE 10
Typedef struct {
. . .
} item;
item buffer[BUFFER_SIZE];
int in = 0;
17
int out = 0;
Solution is correct, but can only use BUFFER_SIZE-1 elements

Bounded-Buffer Producer Process

item nextProduced;

while (1) {
while (((in + 1) % BUFFER_SIZE) == out)
; /* do nothing */
buffer[in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
}
ounded-Buffer Consumer Process

item nextConsumed;

while (1) {
while (in == out)
; /* do nothing */
nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
}

8. INTERPROCESS COMMUNICATION (IPC)

8.1 Definition
Mechanism for processes to communicate and to synchronize their actions.

8.2 Message Passing System
Message system processes communicate with each other without resorting to shared
variables.
IPC facility provides two operations:
18
send(message) message size fixed or variable
receive(message)
If P and Q wish to communicate, they need to:
establish a communication link between them
exchange messages via send/receive
Implementation of communication link
physical (e.g., shared memory, hardware bus)
logical (e.g., logical properties)

8.3 Naming
8.3.1 Direct Communication
Processes must name each other explicitly:
send (P, message) send a message to process P
receive(Q, message) receive a message from process Q
Properties of communication link
Links are established automatically.
A link is associated with exactly one pair of communicating processes.
Between each pair there exists exactly one link.
The link may be unidirectional, but is usually bi-directional.

8.3.2 Indirect Communication
Messages are directed and received from mailboxes (also referred to as ports).
Each mailbox has a unique id.
Processes can communicate only if they share a mailbox.
Properties of communication link
Link established only if processes share a common mailbox
A link may be associated with many processes.
Each pair of processes may share several communication links.
Link may be unidirectional or bi-directional.
Operations
create a new mailbox
send and receive messages through mailbox
destroy a mailbox
Primitives are defined as:
send(A, message) send a message to mailbox A
receive(A, message) receive a message from mailbox A
Mailbox sharing
P
1
, P
2
, and P
3
share mailbox A.
P
1
, sends; P
2
and P
3
receive.
Who gets the message?

Solutions
Allow a link to be associated with at most two processes.
Allow only one process at a time to execute a receive operation.
Allow the system to select arbitrarily the receiver. Sender is notified who the
receiver was.
19

8.3.3 Synchronization
Message passing may be either blocking or non-blocking.
Blocking is considered synchronous
Non-blocking is considered asynchronous
send and receive primitives may be either blocking or non-blocking.

8.3.4 Buffering
Queue of messages attached to the link; implemented in one of three ways.
1. Zero capacity 0 messages
Sender must wait for receiver (rendezvous).
2. a .Bounded capacity finite length of n messages
Sender must wait if link full.
b. Unbounded capacity infinite length
Sender never waits.


9. CLIENT-SERVER COMMUNICATION

Sockets
Remote Procedure Calls
Remote Method Invocation (Java

Sockets

A socket is defined as an endpoint for communication.
Concatenation of IP address and port
The socket 161.25.19.8:1625 refers to port 1625 on host 161.25.19.8
Communication consists between a pair of sockets.

Socket Communication





20
Remote Procedure Calls

Remote procedure call (RPC) abstracts procedure calls between processes on
networked systems.
Stubs client-side proxy for the actual procedure on the server.
The client-side stub locates the server and marshalls the parameters.
The server-side stub receives this message, unpacks the marshalled parameters, and
peforms the procedure on the server.

Execution of RPC





Remote Method Invocation

Remote Method Invocation (RMI) is a Java mechanism similar to RPCs.RMI allows a
Java program on one machine to invoke a method on a remote object.
21
.



Marshalling Parameters



10. CASE STUDY: IPC IN LINUX.
The types of inter process communication are:
1. Signals - Sent by other processes or the kernel to a specific process to indicate
various conditions.
2. Pipes - Unnamed pipes set up by the shell normally with the "|" character to route
output from one program to the input of another.
3. FIFOS - Named pipes operating on the basis of first data in, first data out.
22
4. Message queues - Message queues are a mechanism set up to allow one or more
processes to write messages that can be read by one or more other processes.
5. Semaphores - Counters that are used to control access to shared resources. These
counters are used as a locking mechanism to prevent more than one process from
using the resource at a time.
6. Shared memory - The mapping of a memory area to be shared by multiple
processes.
Message queues, semaphores, and shared memory can be accessed by the processes if
they have access permission to the resource as set up by the object's creator. The process
must pass an identifier to the kernel to be able to get the access.


10.1 Signals
Signals are one of the oldest inter-process communication methods used by Unix
TM
systems. They are used to signal asynchronous events to one or more processes. A
signal could be generated by a keyboard interrupt or an error condition such as the
process attempting to access a non-existent location in its virtual memory. Signals are
also used by the shells to signal job control commands to their child processes.
There are a set of defined signals that the kernel can generate or that can be generated by
other processes in the system, provided that they have the correct privileges. You can list
a system's set of signals using the kill command (kill -l), on my Intel Linux box this
gives:

1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGIOT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 17) SIGCHLD
18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN
22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO
30) SIGPWR

The numbers are different for an Alpha AXP Linux box. Processes can choose to ignore
most of the signals that are generated, with two notable exceptions: neither the SIGSTOP
signal which causes a process to halt its execution nor the SIGKILL signal which causes a
process to exit can be ignored. Otherwise though, a process can choose just how it wants
to handle the various signals. Processes can block the signals and, if they do not block
them, they can either choose to handle them themselves or allow the kernel to handle
them. If the kernel handles the signals, it will do the default actions required for this
signal. For example, the default action when a process receives the SIGFPE (floating
point exception) signal is to core dump and then exit. Signals have no inherent relative
priorities. If two signals are generated for a process at the same time then they may be
presented to the process or handled in any order. Also there is no mechanism for handling
multiple signals of the same kind. There is no way that a process can tell if it received 1
or 42 SIGCONT signals.

Linux implements signals using information stored in the task_struct for the process. The
number of supported signals is limited to the word size of the processor. Processes with a
23
word size of 32 bits can have 32 signals whereas 64 bit processors like the Alpha
AXP may have up to 64 signals. The currently pending signals are kept in the signal field
with a mask of blocked signals held in blocked. With the exception of SIGSTOP and
SIGKILL, all signals can be blocked. If a blocked signal is generated, it remains pending
until it is unblocked. Linux also holds information about how each process handles every
possible signal and this is held in an array of sigaction data structures pointed at by the
task_struct for each process. Amongst other things it contains either the address of a
routine that will handle the signal or a flag which tells Linux that the process either
wishes to ignore this signal or let the kernel handle the signal for it. The process modifies
the default signal handling by making system calls and these calls alter the sigaction for
the appropriate signal as well as the blocked mask.

Not every process in the system can send signals to every other process, the kernel can
and super users can. Normal processes can only send signals to processes with the same
uid and gid or to processes in the same process group
1
. Signals are generated by setting
the appropriate bit in the task_struct's signal field. If the process has not blocked the
signal and is waiting but interruptible (in state Interruptible) then it is woken up by
changing its state to Running and making sure that it is in the run queue. That way the
scheduler will consider it a candidate for running when the system next schedules. If the
default handling is needed, then Linux can optimize the handling of the signal. For
example if the signal SIGWINCH (the X window changed focus) and the default handler
is being used then there is nothing to be done.

Signals are not presented to the process immediately they are generated., they must wait
until the process is running again. Every time a process exits from a system call its signal
and blocked fields are checked and, if there are any unblocked signals, they can now be
delivered. This might seem a very unreliable method but every process in the system is
making system calls, for example to write a character to the terminal, all of the time.
Processes can elect to wait for signals if they wish, they are suspended in state
Interruptible until a signal is presented. The Linux signal processing code looks at the
sigaction structure for each of the current unblocked signals.

If a signal's handler is set to the default action then the kernel will handle it. The
SIGSTOP signal's default handler will change the current process's state to Stopped and
then run the scheduler to select a new process to run. The default action for the SIGFPE
signal will core dump the process and then cause it to exit. Alternatively, the process may
have specfied its own signal handler. This is a routine which will be called whenever the
signal is generated and the sigaction structure holds the address of this routine. The kernel
must call the process's signal handling routine and how this happens is processor specific
but all CPUs must cope with the fact that the current process is running in kernel mode
and is just about to return to the process that called the kernel or system routine in user
mode. The problem is solved by manipulating the stack and registers of the process. The
process's program counter is set to the address of its signal handling routine and the
parameters to the routine are added to the call frame or passed in registers. When the
process resumes operation it appears as if the signal handling routine were called
normally.

Linux is POSIX compatible and so the process can specify which signals are blocked
when a particular signal handling routine is called. This means changing the blocked
mask during the call to the processes signal handler. The blocked mask must be returned
24
to its original value when the signal handling routine has finished. Therefore Linux adds a
call to a tidy up routine which will restore the original blocked mask onto the call stack of
the signalled process. Linux also optimizes the case where several signal handling
routines need to be called by stacking them so that each time one handling routine exits,
the next one is called until the tidy up routine is called.
10.2 Pipes
The common Linux shells all allow redirection. For example

$ ls | pr | lpr

pipes the output from the ls command listing the directory's files into the standard input of
the pr command which paginates them. Finally the standard output from the pr command
is piped into the standard input of the lpr command which prints the results on the default
printer. Pipes then are unidirectional byte streams which connect the standard output from
one process into the standard input of another process. Neither process is aware of this
redirection and behaves just as it would normally. It is the shell which sets up these
temporary pipes between the processes.

Pipes

In Linux, a pipe is implemented using two file data structures which both point at the
same temporary VFS inode which itself points at a physical page within memory. Figure
5.1 shows that each file data structure contains pointers to different file operation routine
vectors; one for writing to the pipe, the other for reading from the pipe.
This hides the underlying differences from the generic system calls which read and write
to ordinary files. As the writing process writes to the pipe, bytes are copied into the
shared data page and when the reading process reads from the pipe, bytes are copied from
the shared data page. Linux must synchronize access to the pipe. It must make sure that
the reader and the writer of the pipe are in step and to do this it uses locks, wait queues
and signals.

25
When the writer wants to write to the pipe it uses the standard write library functions.
These all pass file descriptors that are indices into the process's set of file data structures,
each one representing an open file or, as in this case, an open pipe. The Linux system call
uses the write routine pointed at by the file data structure describing this pipe. That write
routine uses information held in the VFS inode representing the pipe to manage the write
request.

If there is enough room to write all of the bytes into the pipe and, so long as the pipe is
not locked by its reader, Linux locks it for the writer and copies the bytes to be written
from the process's address space into the shared data page. If the pipe is locked by the
reader or if there is not enough room for the data then the current process is made to sleep
on the pipe inode's wait queue and the scheduler is called so that another process can run.
It is interruptible, so it can receive signals and it will be woken by the reader when there
is enough room for the write data or when the pipe is unlocked. When the data has been
written, the pipe's VFS inode is unlocked and any waiting readers sleeping on the inode's
wait queue will themselves be woken up.

Reading data from the pipe is a very similar process to writing to it. Processes are
allowed to do non-blocking reads (it depends on the mode in which they opened the file
or pipe) and, in this case, if there is no data to be read or if the pipe is locked, an error will
be returned. This means that the process can continue to run. The alternative is to wait on
the pipe inode's wait queue until the write process has finished. When both processes
have finished with the pipe, the pipe inode is discarded along with the shared data page.

Linux also supports named pipes, also known as FIFOs because pipes operate on a First
In, First Out principle. The first data written into the pipe is the first data read from the
pipe. Unlike pipes, FIFOs are not temporary objects, they are entities in the file system
and can be created using the mkfifo command. Processes are free to use a FIFO so long as
they have appropriate access rights to it. The way that FIFOs are opened is a little
different from pipes. A pipe (its two file data structures, its VFS inode and the shared data
page) is created in one go whereas a FIFO already exists and is opened and closed by its
users. Linux must handle readers opening the FIFO before writers open it as well as
readers reading before any writers have written to it. That aside, FIFOs are handled
almost exactly the same way as pipes and they use the same data structures and
operations.

Linux supports three types of interprocess communication mechanisms that first appeared
in Unix
T
M
System V (1983). These are message queues, semaphores and shared memory.
These System V IPC mechanisms all share common authentication methods. Processes
may access these resources only by passing a unique reference identifier to the kernel via
system calls. Access to these System V IPC objects is checked using access permissions,
much like accesses to files are checked. The access rights to the System V IPC object is
set by the creator of the object via system calls. The object's reference identifier is used
by each mechanism as an index into a table of resources. It is not a straight forward index
but requires some manipulation to generate the index.

All Linux data structures representing System V IPC objects in the system include an
ipc_perm structure which contains the owner and creator process's user and group
identifiers. The access mode for this object (owner, group and other) and the IPC object's
key. The key is used as a way of locating the System V IPC object's reference identifier.
26
Two sets of keys are supported: public and private. If the key is public then any process in
the system, subject to rights checking, can find the reference identifier for the System V
IPC object. System V IPC objects can never be referenced with a key, only by their
reference identifier.
.
System V IPC Message Queues

data structure contains an ipc_perm data structure and pointers to the messages entered
onto this queue. In addition, Linux keeps queue modification times such as the last time
that this queue was written to and so on. The msqid_ds also contains two wait queues; one
for the writers to the queue and one for the readers of the message queue.

In its simplest form a semaphore is a location in memory whose value can be tested and
set by more than one process. The test and set operation is, so far as each process is
concerned, uninterruptible or atomic; once started nothing can stop it. The result of the
test and set operation is the addition of the current value of the semaphore and the set
value, which can be positive or negative. Depending on the result of the test and set
operation one process may have to sleep until the semphore's value is changed by another
process. Semaphores can be used to implement critical regions, areas of critical code that
only one process at a time should be executing.

Say you had many cooperating processes reading records from and writing records to a
single data file. You would want that file access to be strictly coordinated. You could use
a semaphore with an initial value of 1 and, around the file operating code, put two
semaphore operations, the first to test and decrement the semaphore's value and the
second to test and increment it. The first process to access the file would try to decrement
the semaphore's value and it would succeed, the semaphore's value now being 0. This
process can now go ahead and use the data file but if another process wishing to use it
now tries to decrement the semaphore's value it would fail as the result would be -1. That
process will be suspended until the first process has finished with the data file. When the
first process has finished with the data file it will increment the semaphore's value,
27
making it 1 again. Now the waiting process can be woken and this time its attempt to
increment the semaphore will succeed.

System V IPC Semaphores
System V IPC semaphore objects each describe a semaphore array and Linux uses the
semid_ds data structure to represent this. All of the semid_ds data structures in the system
are pointed at by the semary, a vector of pointers. There are sem_nsems in each semaphore
array, each one described by a sem data structure pointed at by sem_base. All of the
processes that are allowed to manipulate the semaphore array of a System V IPC
semaphore object may make system calls that perform operations on them. The system
call can specify many operations and each operation is described by three inputs; the
semaphore index, the operation value and a set of flags. The semaphore index is an index
into the semaphore array and the operation value is a numerical value that will be added
to the current value of the semaphore. First Linux tests whether or not all of the
operations would succeed. An operation will succeed if the operation value added to the
semaphore's current value would be greater than zero or if both the operation value and
the semaphore's current value are zero. If any of the semaphore operations would fail
Linux may suspend the process but only if the operation flags have not requested that the
system call is non-blocking. If the process is to be suspended then Linux must save the
state of the semaphore operations to be performed and put the current process onto a wait
queue. It does this by building a sem_queue data structure on the stack and filling it out.
The new sem_queue data structure is put at the end of this semaphore object's wait queue
(using the sem_pending and sem_pending_last pointers). The current process is put on the wait
queue in the sem_queue data structure (sleeper) and the scheduler called to choose another
process to run.

If all of the semaphore operations would have succeeded and the current process does not
need to be suspended, Linux goes ahead and applies the operations to the appropriate
members of the semaphore array. Now Linux must check that any waiting, suspended,
processes may now apply their semaphore operations. It looks at each member of the
28
operations pending queue (sem_pending) in turn, testing to see if the semphore operations
will succeed this time. If they will then it removes the sem_queue data structure from the
operations pending list and applies the semaphore operations to the semaphore array. It
wakes up the sleeping process making it available to be restarted the next time the
scheduler runs. Linux keeps looking through the pending list from the start until there is a
pass where no semaphore operations can be applied and so no more processes can be
woken.

There is a problem with semaphores, deadlocks. These occur when one process has
altered the semaphores value as it enters a critical region but then fails to leave the critical
region because it crashed or was killed. Linux protects against this by maintaining lists of
adjustments to the semaphore arrays. The idea is that when these adjustments are applied,
the semaphores will be put back to the state that they were in before the a process's set of
semaphore operations were applied. These adjustments are kept in sem_undo data
structures queued both on the semid_ds data structure and on the task_struct data structure
for the processes using these semaphore arrays.

Each individual semaphore operation may request that an adjustment be maintained.
Linux will maintain at most one sem_undo data structure per process for each semaphore
array. If the requesting process does not have one, then one is created when it is needed.
The new sem_undo data structure is queued both onto this process's task_struct data structure
and onto the semaphore array's semid_ds data structure. As operations are applied to the
semphores in the semaphore array the negation of the operation value is added to this
semphore's entry in the adjustment array of this process's sem_undo data structure. So, if
the operation value is 2, then -2 is added to the adjustment entry for this semaphore.
When processes are deleted, as they exit Linux works through their set of sem_undo data
structures applying the adjustments to the semaphore arrays. If a semaphore set is deleted,
the sem_undo data structures are left queued on the process's task_struct but the semaphore
array identifier is made invalid. In this case the semaphore clean up code simply discards
the sem_undo data structure.
10.3 Shared Memory
Shared memory allows one or more processes to communicate via memory that appears
in all of their virtual address spaces. The pages of the virtual memory is referenced by
page table entries in each of the sharing processes' page tables. It does not have to be at
the same address in all of the processes' virtual memory. As with all System V IPC
objects, access to shared memory areas is controlled via keys and access rights checking.
Once the memory is being shared, there are no checks on how the processes are using it.
They must rely on other mechanisms, for example System V semaphores, to synchronize
access to the memory.
29

System V IPC Shared Memory
Each newly created shared memory area is represented by a shmid_ds data structure. These
are kept in the shm_segs vector. The shmid_ds data structure decribes how big the area of
shared memory is, how many processes are using it and information about how that
shared memory is mapped into their address spaces. It is the creator of the shared memory
that controls the access permissions to that memory and whether its key is public or
private. If it has enough access rights it may also lock the shared memory into physical
memory. Each process that wishes to share the memory must attach to that virtual
memory via a system call. This creates a new vm_area_struct data structure describing the
shared memory for this process. The process can choose where in its virtual address space
the shared memory goes or it can let Linux choose a free area large enough. The new
vm_area_struct structure is put into the list of vm_area_struct pointed at by the shmid_ds. The
vm_next_shared and vm_prev_shared pointers are used to link them together. The virtual
memory is not actually created during the attach; it happens when the first process
attempts to access it.

The first time that a process accesses one of the pages of the shared virtual memory, a
page fault will occur. When Linux fixes up that page fault it finds the vm_area_struct data
structure describing it. This contains pointers to handler routines for this type of shared
virtual memory. The shared memory page fault handling code looks in the list of page
table entries for this shmid_ds to see if one exists for this page of the shared virtual
memory. If it does not exist, it will allocate a physical page and create a page table entry
for it. As well as going into the current process's page tables, this entry is saved in the
shmid_ds. This means that when the next process that attempts to access this memory gets
a page fault, the shared memory fault handling code will use this newly created physical
page for that process too. So, the first process that accesses a page of the shared memory
causes it to be created and thereafter access by the other processes cause that page to be
added into their virtual address spaces.

When processes no longer wish to share the virtual memory, they detach from it. So long
as other processes are still using the memory the detach only affects the current process.
Its vm_area_struct is removed from the shmid_ds data structure and deallocated. The current
process's page tables are updated to invalidate the area of virtual memory that it used to
share. When the last process sharing the memory detaches from it, the pages of the shared
30
memory current in physical memory are freed, as is the shmid_ds data structure for this
shared memory.

11.THREADS

11.1Overview
11.1.1 Definition
A Thread, sometimes called a lightweight process (LWP) is a basic unit of CPU
utilization
It comprises a thread id, a program counter, a register set and a stack.
It shares with other threads belonging to the same process its code section, data
section and other operating system resources, such as open files and signals.

Single and Multithreaded Processes















11.1.2 Benefits
Responsiveness
Resource Sharing
Economy
Utilization of MP Architecture

11.1.3 User Threads
Thread management done by user-level threads library
Three primary thread libraries:
POSIX Pthreads
Win32 threads
Java threads

11.1.4 Kernel Threads
- Supported by the Kernel
- Examples
o Windows XP/2000
o Solaris
o Linux
o Tru64 UNIX
o Mac OS X
31


11.2 Multithreading Models
Many-to-One
One-to-One
Many-to-Many

11.2.1 Many-to-One
Many user-level threads mapped to single kernel thread
Examples:
Solaris Green Threads
GNU Portable Threads










Many-to-One Model


11.2.2 One-to-One
Each user-level thread maps to kernel thread
Examples
Windows NT/XP/2000
Linux
Solaris 9 and later








One-to-one Model

11.2.3 Many-to-Many Model
Allows many user level threads to be mapped to many kernel threads
Allows the operating system to create a sufficient number of kernel threads
Solaris prior to version 9
Windows NT/2000 with the ThreadFiber package


32









Many-to-Many Model


11.2.4 Two-level Model
Similar to M:M, except that it allows a user thread to be bound to kernel thread
Examples
IRIX
HP-UX
Tru64 UNIX
Solaris 8 and earlier

Two-level Model











11.3 Threading Issues
11.3.1 Thread Cancellation

Terminating a thread before it has finished
Two general approaches:
Asynchronous cancellation terminates the target thread immediately
Deferred cancellation allows the target thread to periodically check if it should be
cancelled

11.3.2 Signal Handling
Signals are used in UNIX systems to notify a process that a particular event has
occurred
A signal handler is used to process signals
Signal is generated by particular event
Signal is delivered to a process
Signal is handled
Options:
Deliver the signal to the thread to which the signal applies
33
Deliver the signal to every thread in the process
Deliver the signal to certain threads in the process
Assign a specific threa to receive all signals for the process

11.3.3 Thread Pools
Create a number of threads in a pool where they await work
Advantages:
Usually slightly faster to service a request with an existing thread than create a new
thread
Allows the number of threads in the application(s) to be bound to the size of the
pool

11.3.4 Thread Specific Data
Allows each thread to have its own copy of data
Useful when you do not have control over the thread creation process (i.e., when using
a thread pool)

11.3.5 Scheduler Activations
Both M:M and Two-level models require communication to maintain the appropriate
number of kernel threads allocated to the application
Scheduler activations provide upcalls - a communication mechanism from the kernel to
the thread library
This communication allows an application to maintain the correct number kernel
threads

12.CASE STUDY: PTHREADS LIBRARY
The POSIX thread libraries are a standards based thread API for C/C++. It allows one to
spawn a new concurrent process flow. It is most effective on multi-processor or multi-
core systems where the process flow can be scheduled to run on another processor thus
gaining speed through parallel or distributed processing. Threads require less overhead
than "forking" or spawning a new process because the system does not initialize a new
system virtual memory space and environment for the process. While most effective on a
multiprocessor system, gains are also found on uniprocessor systems which exploit
latency in I/O and other system functions which may halt process execution. (One thread
may execute while another is waiting for I/O or some other system latency.) Parallel
programming technologies such as MPI and PVM are used in a distributed computing
environment while threads are limited to a single computer system. All threads within a
process share the same address space. A thread is spawned by defining a function and its
arguments which will be processed in the thread. The purpose of using the POSIX thread
library in your software is to execute software faster.
Thread Basics:
- Thread operations include thread creation, termination, synchronization
(joins,blocking), scheduling, data management and process interaction.
- A thread does not maintain a list of created threads, nor does it know the thread
that created it.
- All threads within a process share the same address space.
- Threads in the same process share:
34
o Process instructions
o Most data
o open files (descriptors)
o signals and signal handlers
o current working directory
o User and group id
- Each thread has a unique:
o Thread ID
o set of registers, stack pointer
o stack for local variables, return addresses
o signal mask
o priority
o Return value: errno
- pthread functions return "0" if OK.

Thread Creation and Termination:
Example: pthread1.c
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void *print_message_function( void *ptr );

main()
{
pthread_t thread1, thread2;
char *message1 = "Thread 1";
char *message2 = "Thread 2";
int iret1, iret2;

/* Create independent threads each of which will execute function */

iret1 = pthread_create( &thread1, NULL, print_message_function, (void*) message1);
iret2 = pthread_create( &thread2, NULL, print_message_function, (void*) message2);

/* Wait till threads are complete before main continues. Unless we */
/* wait we run the risk of executing an exit which will terminate */
/* the process and all threads before the threads have completed. */

pthread_join( thread1, NULL);
pthread_join( thread2, NULL);

printf("Thread 1 returns: %d\n",iret1);
printf("Thread 2 returns: %d\n",iret2);
exit(0);
}

void *print_message_function( void *ptr )
{
char *message;
message = (char *) ptr;
printf("%s \n", message);
}

35
Thread Synchronization:
The threads library provides three synchronization mechanisms:
- mutexes - Mutual exclusion lock: Block access to variables by other threads. This
enforces exclusive access by a thread to a variable or set of variables.
- joins - Make a thread wait till others are complete (terminated).
- condition variables - data type pthread_cond_t
Thread Scheduling:
When this option is enabled, each thread may have its own scheduling properties.
Scheduling attributes may be specified:
- during thread creation
- by dynamically by changing the attributes of a thread already created
- by defining the effect of a mutex on the thread's scheduling when creating a mutex
- by dynamically changing the scheduling of a thread during synchronization
operations.
The threads library provides default values that are sufficient for most cases.

Thread Pitfalls:
- Race conditions: While the code may appear on the screen in the order you wish the
code to execute, threads are scheduled by the operating system and are executed at
random. It cannot be assumed that threads are executed in the order they are created.
They may also execute at different speeds. When threads are executing (racing to
complete) they may give unexpected results (race condition). Mutexes and joins must
be utilized to achieve a predictable execution order and outcome.
- Thread safe code: The threaded routines must call functions which are "thread safe".
This means that there are no static or global variables which other threads may clobber
or read assuming single threaded operation. If static or global variables are used then
mutexes must be applied or the functions must be re-written to avoid the use of these
variables. In C, local variables are dynamically allocated on the stack. Many non-
reentrant functions return a pointer to static data. This can be avoided by returning
dynamically allocated data or using caller-provided storage. An example of a non-
thread safe function is strtok which is also not re-entrant. The "thread safe" version is the
re-entrant version strtok_r.
- Mutex Deadlock: This condition occurs when a mutex is applied but then not
"unlocked". This causes program execution to halt indefinitely. It can also be caused by
poor application of mutexes or joins. Be careful when applying two or more mutexes to
a section of code. If the first pthread_mutex_lock is applied and the second
pthread_mutex_lock fails due to another thread applying a mutex, the first mutex may
eventually lock all other threads from accessing data including the thread which holds
the second mutex. The threads may wait indefinitely for the resource to become free
causing a deadlock. It is best to test and if failure occurs, free the resources and stall
before retrying.
*******************************
36
UNIT II

PROCESS SCHEDULING AND SYNCHRONIZATION

1.CPU SCHEDULING

1.1 BASIC CONCEPTS
1.1.1 CPU I/O burst cycle
Maximum CPU utilization obtained with multiprogramming
CPUI/O Burst Cycle Process execution consists of a cycle of CPU execution and
I/O wait.
CPU burst distribution

1.1.2 Scheduling Queues
Processes entering the system are put into a job queue.
Processes in memory waiting to be executed are kept in a list called the ready queue.
The ready queue is generally stored as a linked list.
Processes waiting for a particular device are placed in a I/O queue.

1.1.3 Schedulers
A process migrates between various scheduling queues throughout its lifetime.
The process of selecting processes from these queues is carried out by a scheduler.
Types of scheduler
o Long-term scheduler (job scheduler)
o Short-term scheduler (CPU scheduler)
1.1.3.1 The short-term scheduler:
selects from among the processes that are ready to execute and allocates the CPU to
one of them
must select a new process for the CPU frequently
must be very fast.
1.1.3.2 The long-term scheduler :
selects processes from a batch system and loads them into memory for execution
executes less frequently

1.1.4 Types of Scheduling
1.1.4.1 Scheduling Decision
Selects from among processes in memory that are ready to execute and allocates the
CPU to one of them. CPU scheduling decisions take place when a process:
(i) switches from running to waiting state
(ii) switches from running to ready state
(iii) switches from waiting to ready
(iv) terminates.
(v) Scheduling under (i) and (iv) is non-preemptive, otherwise the
scheduling scheme is preemptive.
37
1.1.4.2 Preemptive vs Non-Preemptive Scheduling
Scheduling is non-preemptive if once the CPU has been allocated to a process, the
process can keep the CPU until it releases it, either by terminating or switching to the
waiting state.
Scheduling is preemptive if the CPU can be taken away from a process during
execution.

1.1.5 Dispatcher
The module that gives control of the CPU to the process selected by the short-term
scheduler. This involves:
switching context
switching to user mode
jumping to the proper location in the user program to restart that program.

1.2 SCHEDULING CRITERIA
Different CPU scheduling algorithms have different properties. The criteria used for
comparing these algorithms include:
CPU Utilization: keep the CPU as busy as possible. Should range from 50% (lightly
loaded system) to 90% for a heavily used system.
Throughput : no. of processes that complete their execution per unit time. This may be
1 process/hour for long jobs; or 10 processes/second, for short transactions.
Turnaraound time: amount of time spent in executing a particular process. I.e. The
sum of the periods spent waiting to get into memory + waiting in the ready queue +
executing on the CPU + doing I/O.
Waiting time : amount of time a process spends waiting in the ready queue.
Response time : amount of time it takes from when a request was submitted until the
first response is produced.
Fairness: each process should have a fair share of the CPU.

1.3 SCHEDULING ALGORITHM
1.3.1 Optimization Criteria
Max CPU utilization
Max throughput
Min turnaround time
Min waiting time
Min response time

1.3.2 First-Come, First-Served (FCFS) Scheduling
Process Burst Time
P
1
24
P
2
3
P
3
3

Suppose that the processes arrive in the order: P
1
, P
2
, P
3
38
The Gantt Chart for the schedule is:









Waiting time for P
1
= 0; P
2
= 24; P
3
= 27
Average waiting time: (0 + 24 + 27)/3 = 17
Suppose that the processes arrive in the order
P
2
, P
3
, P
1
.
The Gantt chart for the schedule is:
Waiting time for P
1
= 6;

P
2
= 0
;
P
3
= 3
Average waiting time: (6 + 0 + 3)/3 = 3
Much better than previous case.
Convoy effect short process behind long process


1.3.3 Shortest-Job-First (SJF) Scheduling
Associate with each process the length of its next CPU burst. Use these lengths to
schedule the process with the shortest time.
Two schemes:
nonpreemptive once CPU given to the process it cannot be preempted until
completes its CPU burst.
preemptive if a new process arrives with CPU burst length less than remaining
time of current executing process, preempt. This scheme is know as the
Shortest-Remaining-Time-First (SRTF).
SJF is optimal gives minimum average waiting time for a given set of processes.
Example of Non-Preemptive SJF
Process Arrival Time Burst Time
P
1
0.0 7
P
2
2.0 4
P
3
4.0 1
P
4
5.0 4
SJF (non-preemptive)

Average waiting time = (0 + 6 + 3 + 7)/4 - 4

P
1

P
3

P
2

6 3 3
0
0
39

Example of Preemptive SJF
Process Arrival Time Burst Time
P
1
0.0 7
P
2
2.0 4
P
3
4.0 1
P
4
5.0 4
SJF (preemptive)

Average waiting time = (9 + 1 + 0 +2)/4 - 3



1.3.4 Priority Scheduling
A priority number (integer) is associated with each process
The CPU is allocated to the process with the highest priority (smallest integer highest
priority).
Preemptive
nonpreemptive
SJF is a priority scheduling where priority is the predicted next CPU burst time.
Problem Starvation low priority processes may never execute.
Solution Aging as time progresses increase the priority of the process.

1.3.5 Round Robin (RR)
Each process gets a small unit of CPU time (time quantum), usually 10-100
milliseconds. After this time has elapsed, the process is preempted and added to the
end of the ready queue.
If there are n processes in the ready queue and the time quantum is q, then each process
gets 1/n of the CPU time in chunks of at most q time units at once. No process waits
more than (n-1)q time units.
Performance
q large FIFO
q small q must be large with respect to context switch, otherwise overhead is too
high.
Example of RR with Time Quantum = 20
Process Burst Time
P
1
53
P
2
17
P
3
68
P
4
24

40

The Gantt chart is:







Typically, higher average turnaround than SJF, but better response.

1.3.6 Multilevel Queue
Ready queue is partitioned into separate queues:
foreground (interactive)
background (batch)
Each queue has its own scheduling algorithm,
foreground RR
background FCFS
Scheduling must be done between the queues.
Fixed priority scheduling; (i.e., serve all from foreground then from background).
Possibility of starvation.
Time slice each queue gets a certain amount of CPU time which it can schedule
amongst its processes; i.e., 80% to foreground in RR
20% to background in FCFS

Multilevel Queue Scheduling












1.3.7 Multilevel Feedback Queue
A process can move between the various queues; aging can be implemented this way.
Multilevel-feedback-queue scheduler defined by the following parameters:
number of queues
scheduling algorithms for each queue
method used to determine when to upgrade a process
method used to determine when to demote a process
method used to determine which queue a process will enter when that process needs
service
Example of Multilevel Feedback Queue
Three queues:
P
1
P
2
P
3
P
4
P
1
P
3
P
4
P
1
P
3
P
3
0 20 37 57 77 97 117 121 134 154 162
41
Q
0
time quantum 8 milliseconds
Q
1
time quantum 16 milliseconds
Q
2
FCFS
Scheduling
A new job enters queue Q
0
which is served FCFS. When it gains CPU, job receives
8 milliseconds. If it does not finish in 8 milliseconds, job is moved to queue Q
1
.
At Q
1
job is again served FCFS and receives 16 additional milliseconds. If it still
does not complete, it is preempted and moved to queue Q
2
.

Multilevel Feedback Queues











1.4 Multiple-Processor Scheduling
CPU scheduling more complex when multiple CPUs are available.
Homogeneous processors within a multiprocessor.
Load sharing
Asymmetric multiprocessing only one processor accesses the system data structures,
alleviating the need for data sharing.

1.5 Real-Time Scheduling
Hard real-time systems required to complete a critical task within a guaranteed
amount of time.
Soft real-time computing requires that critical processes receive priority over less
fortunate ones.

Dispatch Latency















42
1.6 ALGORITHM EVALUATION

How do we select a CPU scheduling algorithm for a particular system?
First, define the criteria (e.g. minimize average waiting time)

Deterministic modeling
Given a predetermined workload is known
Guestimate the performance of each algorithm (e.g. SJF, FCFS, RR)
Useful if behavior repeats or the same workload repeats

Queueing models
Deterministic model is not realistic

Simulation
Use random number generator to generate processes
Use traces

Implementation


Evaluation of CPU Schedulers by Simulation


2.CASE STUDY: PROCESS SCHEDULING IN LINUX.

2.1.Introduction to Linux process scheduling
Policy versus algorithm
Linux overall process scheduling objectives
Timesharing
Dynamic priority
Favor I/O-bound process
Linux scheduling algorithm
Dividing time into epochs
Remaining quantum as process priority
43
When scheduling occurs

2.2.Linux Process Scheduling Policy
First we examine Linux scheduling policy
A scheduling policy is the set of decisions you make
regarding scheduling priorities, goals, and objectives
A scheduling algorithm is the instructions or code that
implements a given scheduling policy
Linux has several, conflicting objectives
Fast process response time
Good throughput for background jobs
Avoidance of process starvation
Linux uses a timesharing technique
We know that this means that each process is assigned a small quantum or time slice
that it is allowed to execute
This relies on hardware timer interrupts and is completely transparent to the processes
Linux schedule process according to a priority ranking, this is a goodness ranking
Linux uses dynamic priorities, i.e., priorities are adjusted over time to eliminate
starvation
Processes that have not received the CPU for a long time get their priorities increased,
processes that have received the CPU often get their priorities decreased .We can classify
processes using two schemes
CPU-bound versus I/O-bound
We learned about this in previous lectures
Interactive versus batch versus real-time
We have talked about these concepts in previous lectures, so they should be relatively
self-explanatory
These classifications are somewhat independent, e.g., a batch process can be either I/O-
bound or CPU-bound
Linux recognizes real-time programs and assigns them high priority, but this is only soft
real-time, like streaming audio
Linux does not recognize batch or interactive processes, instead it implicitly favors I/O-
bound processes Linux uses process preemption, a process is preempted when
Its time quantum has expired
A new process enters TASK_RUNNING state and its priority is greater than the priority
of the currently running process
The preempted process is not suspended, it is still in the ready queue, it simply no
longer has the CPU
Consider a text editor and a compiler
Since the text editor is an interactive program, its dynamic priority is higher than the
compiler
The text editor will be block often since it is waiting for I/O
When the I/O interrupt receives a key-press for the editor, the editor is put on the ready
queue and the scheduler is called since the editors priority is higher than the compiler
The editor gets the input and quickly blocks for more I/O

2.3. Determining the length of the quantum

Should be neither too long or too short
44
If too short, the overhead caused by process switching becomes excessively high
If too long, processes no longer appear to be executing concurrently
For Linux, long quanta do not necessarily degrade response time for interactive
processes because their dynamic priority remains high, thus they get the CPU as soon as
they need it
For long quanta, responsiveness can degrade in instances where the scheduler does not
know if a process is interactive
or not, such as when a process is newly created
The for Linux is the longest possible quantum without affecting responsiveness; this
turns out to be about 20 clock ticks or 210 milliseconds
The Linux scheduling algorithm is not based on a continuous CPU time axis, instead it
divides the CPU time into epochs
An epoch is a division of time or a period of time
In a single epoch, every process has a specified time quantum that is computed at the
beginning of each epoch
This is the maximum CPU time that the process can use during the current epoch
A process only uses its quantum when it is executing on the CPU, when the process is
waiting for I/O its quantum is not used
As a result, a process can get the CPU many times in one epoch, until its quantum is
fully used
An epoch ends when all runnable processes have used all of their quantum
The a new epoch starts and all process get a new quantum

2.4 Linux Process Scheduling Algorithm
The Linux scheduling algorithm is not based on a continuous CPU time axis, instead it
divides the CPU time into epochs
An epoch is a division of time or a period of time
In a single epoch, every process has a specified time quantum that is computed at the
beginning of each epoch
This is the maximum CPU time that the process can use during the current epoch
A process only uses its quantum when it is executing on the CPU, when the process is
waiting for I/O its quantum is not used
As a result, a process can get the CPU many times in one epoch, until its quantum is
fully used
An epoch ends when all runnable processes have used all of their quantum
The a new epoch starts and all process get a new quantum
When does an epoch end?
Important!
An epoch ends when all processes in the ready queue have used their quantum
This does not include processes that are blocking on some wait queue, they will still
have quantum remaining
The end of an epoch is only concerned with processes on the ready queue 5 Calculating
process quanta for an epoch
Each process is initially assigned a base time quantum, as mentioned previously it is
about 20 clock ticks
If a process uses its entire quantum in the current epoch,then in the next epoch it will
get the base time quantum again
If a process does not use its entire quantum, then the unused quantum carries over into
the next epoch (the unused quantum is not directly used, but a bonus is calculated)
45
Why? Process that block often will not use their quantum; this is used to favor I/O-
bound processes because this value is used to calculate priority
When forking a new child process, the parent process remaining quantum divided in
half; half for the parent and half for the child
Selecting a process to run next
The scheduler considers the priority of each process
There are two kinds of priorities
Static priorities - these are assigned to real-time processes and range from 1 to 99; they
never change
Dynamic priorities - these apply to all other processes and it is the sum of the base time
quantum (also called the base priority) and the number of clock ticks left in the current
epoch
The static priority of real-time process is always higher than the dynamic priority of
conventional processes
Conventional processes will only execute when there are no real-time processes to
execute 6
Scheduling data in the process descriptor
The process descriptor (task_struct in Linux) holds essentially of the information for a
process, including scheduling information
Recall that Linux keeps a list of all process task_structs and a list of all ready process
task_structs
The next two slides describe the relevant scheduling fields in the process descriptor
Each process descriptor (task_struct) contains the following fields
need_resched - this flag is checked every time an interrupt handler completes to decide
if rescheduling is necessary
policy - the scheduling class for the process
For real-time processes this can have the value of
SCHED_FIFO - first-in, first-out with unlimited time quantum
SCHED_RR - round-robin with time quantum, fair CPU usage
For all other processes the value is
SCHED_OTHER
For processes that have yielded the CPU, the value is
SCHED_YIELD 7
Process descriptor fields (cont)
rt_priority - the static priority of a real-time process,not used for other processes
priority - the base time quantum (or base priority) of the process
counter - the number of CPU ticks left in its quantum for the current epoch
This field is updated every clock tick by update_process_times()
The priority and counter fields are used to for timesharing and dynamic priorities in
conventional processes, for only time-sharing in SCHED_RR real-time processes, and
are not used at all for SCHED_FIFO real-time processes
Process descriptor fields (cont)
Scheduling actually occurs in schedule()
Its objective is to find a process in the ready queue then assign the CPU to it
It is invoked in two ways
Direct invocation
Lazy invocation
Direct invocation of schedule()
Occurs when the current process is going to block because it needs to wait for a
necessary resource
46
The current process is taken off of the ready queue and is placed on the appropriate wait
queue; its state is changed to
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
Once the needed resource becomes available, the process is immediately woken up and
remove from the wait queue
Lazy invocation of schedule()


3. PROCESS SYNCHRONIZATION

3.1 Background
Concurrent access to shared data may result in data inconsistency.
Maintaining data consistency requires mechanisms to ensure the orderly execution of
cooperating processes.
Shared-memory solution to bounded-butter problem allows at most n 1 items in
buffer at the same time. A solution, where all N buffers are used is not simple.
Suppose that we modify the producer-consumer code by adding a variable counter,
initialized to 0 and incremented each time a new item is added to the buffer

Bounded-Buffer
Shared data

#define BUFFER_SIZE 10
typedef struct {
. . .
} item;
item buffer[BUFFER_SIZE];
int in = 0;
int out = 0;
int counter = 0;

Bounded-Buffer
Producer process
47

item nextProduced;

while (1) {
while (counter == BUFFER_SIZE)
; /* do nothing */
buffer[in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
counter++;
}


Bounded-Buffer
Consumer process

item nextConsumed;

while (1) {
while (counter == 0)
; /* do nothing */
nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
counter--;
}

count++ may be implemented in machine language as:

register1 = counter
register1 = register1 + 1
counter = register1nThe statement count may be implemented as:

register2 = counter
register2 = register2 1
counter = register2
48

Atomic operation means an operation that completes in its entirety without
interruption.

The statement count++ may be implemented in machine language as:

register1 = counter
register1 = register1 + 1
counter = register1

The statement count may be implemented as:

register2 = counter
register2 = register2 1
counter = register2
If both the producer and consumer attempt to update the buffer concurrently, the
assembly language statements may get interleaved.

Interleaving depends upon how the producer and consumer processes are scheduled.
Assume counter is initially 5. One interleaving of statements is:

producer: register1 = counter (register1 = 5)
producer: register1 = register1 + 1 (register1 = 6)
consumer: register2 = counter (register2 = 5)
consumer: register2 = register2 1 (register2 = 4)
producer: counter = register1 (counter = 6)
consumer: counter = register2 (counter = 4)

The value of count may be either 4 or 6, where the correct result should be 5.
Race condition: The situation where several processes access and manipulate shared
data concurrently. The final value of the shared data depends upon which process
finishes last.
To prevent race conditions, concurrent processes must be synchronized.

3.2 THE CRITICAL-SECTION PROBLEM

3.2.1 Introduction
n processes all competing to use some shared data
Each process has a code segment, called critical section, in which the shared data is
accessed.
Problem ensure that when one process is executing in its critical section, no other
process is allowed to execute in its critical section.


3.2.2 Solution to Critical-Section Problem

49
1. Mutual Exclusion. If process P
i
is executing in its critical section, then no other
processes can be executing in their critical sections.
2. Progress. If no process is executing in its critical section and there exist some
processes that wish to enter their critical section, then the selection of the processes that
will enter the critical section next cannot be postponed indefinitely.
3. Bounded Waiting. A bound must exist on the number of times that other
processes are allowed to enter their critical sections after a process has made a request to
enter its critical section and before that request is granted.
Assume that each process executes at a nonzero speed
No assumption concerning relative speed of the n processes.

3.2.3 Two-Process Solution
Only 2 processes, P
0
and P
1
General structure of process P
i
(other process P
j
)
do {
entry section
critical section
exit section
reminder section
} while (1);
Processes may share some common variables to synchronize their actions.

Algorithm 1
Shared variables:
int turn;
initially turn = 0
turn - i P
i
can enter its critical section
Process P
i

do {
while (turn != i) ;
critical section
turn = j;
reminder section
} while (1);
Satisfies mutual exclusion, but not progress

50
Algorithm 2
Shared variables
boolean flag[2];
initially flag [0] = flag [1] = false.
flag [i] = true P
i
ready to enter its critical section
Process P
i

do {
flag[i] := true;
while (flag[j]) ; critical section
flag [i] = false;
remainder section
} while (1);
Satisfies mutual exclusion, but not progress requirement.

Algorithm 3
Combined shared variables of algorithms 1 and 2.
Process P
i

do {
flag [i]:= true;
turn = j;
while (flag [j] and turn = j) ;
critical section
flag [i] = false;
remainder section
} while (1);
Meets all three requirements; solves the critical-section problem for two processes.

3.2.4 Bakery Algorithm or Multiple process Solution
Before entering its critical section, process receives a number. Holder of the smallest
number enters the critical section.
If processes P
i
and P
j
receive the same number, if i < j, then P
i
is served first; else P
j
is
served first.
The numbering scheme always generates numbers in increasing order of enumeration;
i.e., 1,2,3,3,3,3,4,5...
Bakery Algorithm
Notation < lexicographical order (ticket #, process id #)
(a,b) < c,d) if a < c or if a = c and b < d
max (a
0
,, a
n-1
) is a number, k, such that k > a
i
for i - 0,
, n 1
Shared data
51
boolean choosing[n];
int number[n];
Data structures are initialized to false and 0 respectively
Bakery Algorithm
do {
choosing[i] = true;
number[i] = max(number[0], number[1], , number [n 1])+1;
choosing[i] = false;
for (j = 0; j < n; j++) {
while (choosing[j]) ;
while ((number[j] != 0) && (number[j,j] < number[i,i])) ;
}
critical section
number[i] = 0;
remainder section
} while (1);

3.3 SYNCHRONIZATION HARDWARE
Test and modify the content of a word atomically
.
boolean TestAndSet(boolean &target) {
boolean rv = target;
tqrget = true;

return rv;
}
Mutual Exclusion with Test-and-Set
Shared data:
boolean lock = false;

Process P
i

52
do {
while (TestAndSet(lock)) ;
critical section
lock = false;
remainder section
}
Synchronization Hardware
Atomically swap two variables.

void Swap(boolean &a, boolean &b) {
boolean temp = a;
a = b;
b = temp;
}
Mutual Exclusion with Swap
Shared data (initialized to false):
boolean lock;
boolean waiting[n];

Process P
i

do {
key = true;
while (key == true)
Swap(lock,key);
critical section
lock = false;
remainder section
}

3.4 SEMAPHORES

3.4.1 Definition
Synchronization tool that does not require busy waiting.
53
Semaphore S integer variable
can only be accessed via two indivisible (atomic) operations
wait (S):
while Ss 0 do no-op;
S--;

signal (S):
S++;




3.4.2 Critical Section of n Processes
Shared data:
semaphore mutex; //initially mutex = 1

Process Pi:

do {
wait(mutex);
critical section
signal(mutex);
remainder section
} while (1);

3.4.3 Semaphore Implementation
Define a semaphore as a record
typedef struct {
int value;
struct process *L;
} semaphore;

Assume two simple operations:
block suspends the process that invokes it.
wakeup(P) resumes the execution of a blocked process P.

Semaphore operations now defined as
54
wait(S):
S.value--;
if (S.value < 0) {
add this process to S.L;
block;
}

signal(S):
S.value++;
if (S.value <= 0) {
remove a process P from S.L;
wakeup(P);
}
3.4.4 Semaphore as a General Synchronization Tool
Execute B in P
j
only after A executed in P
i

Use semaphore flag initialized to 0
Code:
P
i
P
j


A wait(flag)
signal(flag) B



3.4.5 Deadlock and Starvation
Deadlock two or more processes are waiting indefinitely for an event that can be
caused by only one of the waiting processes.
Let S and Q be two semaphores initialized to 1
P
0
P
1

wait(S); wait(Q);
wait(Q); wait(S);

signal(S); signal(Q);
signal(Q) signal(S);
Starvation indefinite blocking. A process may never be removed from the
55
semaphore queue in which it is suspended.

3.4.6 Two Types of Semaphores
Counting semaphore integer value can range over an unrestricted domain.
Binary semaphore integer value can range only between 0 and 1; can be simpler to
implement.
Can implement a counting semaphore S as a binary semaphore.

3.4.6.1 Implementing S as a Binary Semaphore
Data structures:
binary-semaphore S1, S2;
int C:
Initialization:
S1 = 1
S2 = 0
C = initial value of semaphore S
Implementing S
wait operation
wait(S1);
C--;
if (C < 0) {
signal(S1);
wait(S2);
}
signal(S1);

signal operation
wait(S1);
C ++;
if (C <= 0)
signal(S2);
else
signal(S1);




56
3.5 CLASSICAL PROBLEMS OF SYNCHRONIZATION
3.5.1 Classical Problems

Bounded-Buffer Problem
Readers and Writers Problem
Dining-Philosophers Problem3.5.2 Bounded-Buffer Problem

Shared data
semaphore full, empty, mutex;

Initially:

full = 0, empty = n, mutex = 1
Bounded-Buffer Problem Producer Process

do {

produce an item in nextp

wait(empty);
wait(mutex);

add nextp to buffer

signal(mutex);
signal(full);
} while (1);

Bounded-Buffer Problem Consumer Process

do {
wait(full)
wait(mutex);

remove an item from buffer to nextc

signal(mutex);
signal(empty);

consume the item in nextc

} while (1);

3.5.3 Readers-Writers Problem
Shared data

semaphore mutex, wrt;

Initially

57

mutex = 1, wrt = 1, readcount = 0


Readers-Writers Problem Writer Process
wait(wrt);

writing is performed

signal(wrt);
Readers-Writers Problem Reader Process

wait(mutex);
readcount++;
if (readcount == 1)
wait(rt);
signal(mutex);

reading is performed

wait(mutex);
readcount--;
if (readcount == 0)
signal(wrt);
signal(mutex):


3.5.4 Dining-Philosophers Problem












Shared data
semaphore chopstick[5];
Initially all values are 1
Dining-Philosophers Problem
Philosopher i:
do {
wait(chopstick[i])
wait(chopstick[(i+1) % 5])

eat
58

signal(chopstick[i]);
signal(chopstick[(i+1) % 5]);

think

} while (1);

3.6 CRITICAL REGIONS
High-level synchronization construct
A shared variable v of type T, is declared as:
v: shared T
Variable v accessed only inside statement
region v when B do S

where B is a boolean expression.

While statement S is being executed, no other process can access variable v.

Regions referring to the same shared variable exclude each other in time.

When a process tries to execute the region statement, the Boolean expression B is
evaluated. If B is true, statement S is executed. If it is false, the process is delayed until
B becomes true and no other process is in the region associated with v.
Example Bounded Buffer
Shared data:

struct buffer {
int pool[n];
int count, in, out;
}

Bounded Buffer Producer Process
Producer process inserts nextp into the shared buffer

region buffer when( count < n) {
pool[in] = nextp;
in:= (in+1) % n;
count++;
}

Bounded Buffer Consumer Process
Consumer process removes an item from the shared buffer and puts it in nextc

region buffer when (count > 0) { nextc =
pool[out];
out = (out+1) % n;
count--;
}
Implementation region x when B do S
59
Associate with the shared variable x, the following variables:
semaphore mutex, first-delay, second-delay;
int first-count, second-count;


Mutually exclusive access to the critical section is provided by mutex.

If a process cannot enter the critical section because the Boolean expression B is false, it
initially waits on the first-delay semaphore; moved to the second-delay semaphore before
it is allowed to reevaluate B.
Implementation
Keep track of the number of processes waiting on first-delay and second-delay, with first-
count and second-count respectively.

The algorithm assumes a FIFO ordering in the queuing of processes for a semaphore.

For an arbitrary queuing discipline, a more complicated implementation is required.

3.7 MONITORS
3.7.1 Introduction
High-level synchronization construct that allows the safe sharing of an abstract data
type among concurrent processes.


monitor monitor-name
{
shared variable declarations
procedure body P1 () {
. . .
}
procedure body P2 () {
. . .
}
procedure body Pn () {
. . .
}
{
initialization code
}
}
To allow a process to wait within the monitor, a condition variable must be declared, as
condition x, y;
Condition variable can only be used with the operations wait and signal.
The operation
x.wait();
means that the process invoking this operation is suspended until another process invokes
x.signal();
The x.signal operation resumes exactly one suspended process. If no process is
suspended, then the signal operation has no effect.

60















Schematic View of a Monitor
























Monitor With Condition Variables


3.7.2 Dining Philosophers Example
monitor dp
{
enum {thinking, hungry, eating} state[5];
condition self[5];
void pickup(int i) // following slides
void putdown(int i) // following slides
61
void test(int i) // following slides
void init() {
for (int i = 0; i < 5; i++)
state[i] = thinking;
}
}
void pickup(int i) {
state[i] = hungry;
test[i];
if (state[i] != eating)
self[i].wait();
}

void putdown(int i) {
state[i] = thinking;
// test left and right neighbors
test((i+4) % 5);
test((i+1) % 5);
}
void test(int i) {
if ( (state[(I + 4) % 5] != eating) &&
(state[i] == hungry) &&
(state[(i + 1) % 5] != eating)) {
state[i] = eating;
self[i].signal();
}
}

3.7.3 Monitor Implementation Using Semaphores
Variables
semaphore mutex; // (initially = 1)
semaphore next; // (initially = 0)
int next-count = 0;

Each external procedure F will be replaced by
wait(mutex);

body of F;

if (next-count > 0)
signal(next)
else
signal(mutex);

Mutual exclusion within a monitor is ensured.

For each condition variable x, we have:
semaphore x-sem; // (initially = 0)
int x-count = 0;

62
The operation x.wait can be implemented as:

x-count++;
if (next-count > 0)
signal(next);
else
signal(mutex);
wait(x-sem);
x-count--;

The operation x.signal can be implemented as:

if (x-count > 0) {
next-count++;
signal(x-sem);
wait(next);
next-count--;
}

Conditional-wait construct: x.wait(c);
c integer expression evaluated when the wait operation is executed.
value of c (a priority number) stored with the name of the process that is suspended.
when x.signal is executed, process with smallest associated priority number is
resumed next.
Check two conditions to establish correctness of system:
User processes must always make their calls on the monitor in a correct sequence.
Must ensure that an uncooperative process does not ignore the mutual-exclusion
gateway provided by the monitor, and try to access the shared resource directly,
without using the access protocols.

4. DEADLOCKS

4.1 The Deadlock Problem
A set of blocked processes each holding a resource and waiting to acquire a resource
held by another process in the set.
Example
System has 2 tape drives.
P
1
and P
2
each hold one tape drive and each needs another one.
Example
semaphores A and B, initialized to 1

P
0
P
1

wait (A); wait(B)
wait (B); wait(A)
63

Bridge Crossing Example
Traffic only in one direction.
Each section of a bridge can be viewed as a resource.
If a deadlock occurs, it can be resolved if one car backs up (preempt resources and
rollback).
Several cars may have to be backed up if a deadlock occurs.
Starvation is possible.

4.2 SYSTEM MODEL
Resource types R
1
, R
2
, . . ., R
m
CPU cycles, memory space, I/O devices
Each resource type R
i
has W
i
instances.
Each process utilizes a resource as follows:
request
use
release



4.3 DEADLOCK CHARACTERIZATION
4.3.1 Necessary Conditions
Mutual exclusion: only one process at a time can use a resource.
Hold and wait: a process holding at least one resource is waiting to acquire additional
resources held by other processes.
No preemption: a resource can be released only voluntarily by the process holding it,
after that process has completed its task.
Circular wait: there exists a set {P
0
, P
1
, , P
0
} of waiting processes such that P
0
is
waiting for a resource that is held by P
1
, P
1
is waiting for a resource that is held by
P
2
, , P
n1
is waiting for a resource that is held by
P
n
, and P
0
is waiting for a resource that is held by P
0
.

4.3.2 Resource-Allocation Graph
V is partitioned into two types:
P = {P
1
, P
2
, , P
n
}, the set consisting of all the processes in the system.

R = {R
1
, R
2
, , R
m
}, the set consisting of all resource types in the system.
request edge directed edge P
1
R
j

assignment edge directed edge R
j
P
i

Process





Resource Type with 4 instances
64



P
i
requests instance of R
j




P
i
is holding an instance of R
j





















Example of a Resource Allocation Graph















Resource Allocation Graph With A Deadlock

P
i

P
i

65














Resource Allocation Graph With A Cycle But No Deadlock




Basic Facts
If graph contains no cycles no deadlock.

If graph contains a cycle
if only one instance per resource type, then deadlock.
if several instances per resource type, possibility of deadlock.

4.4 METHODS FOR HANDLING DEADLOCKS
4.4.1 Dealing with deadlock
Ensure that the system will never enter a deadlock state.

Allow the system to enter a deadlock state and then recover.

Ignore the problem and pretend that deadlocks never occur in the system; used by most
operating systems, including UNIX.

4.4.2 Deadlock Prevention
Mutual Exclusion not required for sharable resources; must hold for nonsharable
resources.

Hold and Wait must guarantee that whenever a process requests a resource, it does
not hold any other resources.
Require process to request and be allocated all its resources before it begins
execution, or allow process to request resources only when the process has none.
Low resource utilization; starvation possible.
No Preemption
If a process that is holding some resources requests another resource that cannot be
immediately allocated to it, then all resources currently being held are released.
Preempted resources are added to the list of resources for which the process is
waiting.
Process will be restarted only when it can regain its old resources, as well as the
66
new ones that it is requesting.

Circular Wait impose a total ordering of all resource types, and require that each
process requests resources in an increasing order of enumeration.

4.4.3 Deadlock Avoidance
4.4.3.1 Algorithm
Simplest and most useful model requires that each process declare the maximum
number of resources of each type that it may need.

The deadlock-avoidance algorithm dynamically examines the resource-allocation state
to ensure that there can never be a circular-wait condition.

Resource-allocation state is defined by the number of available and allocated
resources, and the maximum demands of the processes.

4.4.3.2 Safe State
When a process requests an available resource, system must decide if immediate
allocation leaves the system in a safe state.

System is in safe state if there exists a safe sequence of all processes.

Sequence <P
1
, P
2
, , P
n
> is safe if for each P
i
, the resources that Pi can still request
can be satisfied by currently available resources + resources held by all the P
j
, with j<I.
If P
i
resource needs are not immediately available, then P
i
can wait until all P
j
have
finished.
When P
j
is finished, P
i
can obtain needed resources, execute, return allocated
resources, and terminate.
When P
i
terminates, P
i+1
can obtain its needed resources, and so on.

4.4.3.4 Basic Facts
If a system is in safe state no deadlocks.

If a system is in unsafe state possibility of deadlock.

Avoidance ensure that a system will never enter an unsafe state.
Safe, Unsafe , Deadlock State










67

4.4.3.5 Resource-Allocation Graph Algorithm
Claim edge P
i
R
j
indicated that process P
j
may request resource R
j
; represented by a
dashed line.

Claim edge converts to request edge when a process requests a resource.

When a resource is released by a process, assignment edge reconverts to a claim edge.

Resources must be claimed a priori in the system.
Resource-Allocation Graph For Deadlock Avoidance









Unsafe State In Resource-Allocation Graph




















4.4.3.6 Bankers Algorithm
Multiple instances.

Each process must a priori claim maximum use.

When a process requests a resource it may have to wait.

When a process gets all its resources it must return them in a finite amount of time.
68
Data Structures for the Bankers Algorithm
Available: Vector of length m. If available [j] = k, there are k instances of resource
type R
j
available.
Max: n x m matrix. If Max [i,j] = k, then process P
i
may request at most k instances of
resource type R
j
.
Allocation: n x m matrix. If Allocation[i,j] = k then P
i
is currently allocated k
instances of R
j.
Need: n x m matrix. If Need[i,j] = k, then P
i
may need k more instances of R
j
to
complete its task.

Need [i,j] = Max[i,j] Allocation [i,j].
Safety Algorithm
1. Let Work and Finish be vectors of length m and n, respectively. Initialize:
Work = Available
Finish [i] = false for i - 1,3, , n.
2. Find and i such that both:
(a) Finish [i] = false
(b) Need
i
s Work
If no such i exists, go to step 4.
3. Work = Work + Allocation
i

Finish[i] = true
go to step 2.
4. If Finish [i] == true for all i, then the system is in a safe state.
Resource-Request Algorithm for Process P
i

Request = request vector for process P
i
. If Request
i
[j] = k then process P
i
wants k
instances of resource type R
j.
1. If Request
i
s Need
i
go to step 2. Otherwise, raise error condition, since process
has exceeded its maximum claim.
2. If Request
i
s Available, go to step 3. Otherwise P
i
must wait, since resources are
not available.
3. Pretend to allocate requested resources to P
i
by modifying the state as follows:
Available = Available = Request
i
;
Allocation
i
= Allocation
i
+ Request
i
;
Need
i
= Need
i
Request
i;;
69
If safe the resources are allocated to P
i
.

If unsafe P
i
must wait, and the old resource-allocation state is restored

Example of Bankers Algorithm
5 processes P
0
through P
4
; 3 resource types A
(10 instances),
B (5instances, and C (7 instances).
Snapshot at time T
0
:
Allocation Max Available
A B C A B C A B C
P
0
0 1 0 7 5 3 3 3 2
P
1
2 0 0 3 2 2
P
2
3 0 2 9 0 2
P
3
2 1 1 2 2 2
P
4
0 0 2 4 3 3
The content of the matrix. Need is defined to be Max Allocation.
Need
A B C
P
0
7 4 3
P
1
1 2 2
P
2
6 0 0
P
3
0 1 1
P
4
4 3 1
The system is in a safe state since the sequence < P
1
, P
3
, P
4
, P
2
, P
0
> satisfies safety
criteria.

Check that Request s Available (that is, (1,0,2) s (3,3,2) true.
70
Allocation Need Available
A B C A B C A B C
P
0
0 1 0 7 4 3 2 3 0
P
1
3 0 2 0 2 0
P
2
3 0 1 6 0 0
P
3
2 1 1 0 1 1
P
4
0 0 2 4 3 1
Executing safety algorithm shows that sequence <P
1
, P
3
, P
4
, P
0
, P
2
> satisfies safety
requirement.
Can request for (3,3,0) by P
4
be granted?
Can request for (0,2,0) by P
0
be granted?


4.5 DEADLOCK DETECTION
4.5.1 Methods
Allow system to enter deadlock state
Detection algorithm
Recovery scheme



4.5.2 Single Instance of Each Resource Type
Maintain wait-for graph
Nodes are processes.
P
i
P
j
if P
i
is waiting for P
j
.

Periodically invoke an algorithm that searches for a cycle in the graph.

An algorithm to detect a cycle in a graph requires an order of n
2
operations, where n is
the number of vertices in the graph.
Resource-Allocation Graph and Wait-for Graph










71


4.5.3 Several Instances of a Resource Type
Available: A vector of length m indicates the number of available resources of each
type.

Allocation: An n x m matrix defines the number of resources of each type currently
allocated to each process.

Request: An n x m matrix indicates the current request of each process. If Request
[i
j
] = k, then process P
i
is requesting k more instances of resource type. R
j
.
Detection Algorithm
1. Let Work and Finish be vectors of length m and n, respectively Initialize:
(a) Work = Available
(b) For i = 1,2, , n, if Allocation
i
= 0, then
Finish[i] = false;otherwise, Finish[i] = true.
2. Find an index i such that both:
(a) Finish[i] == false
(b) Request
i
s Work

If no such i exists, go to step 4.
3. Work = Work + Allocation
i

Finish[i] = true
go to step 2.

4. If Finish[i] == false, for some i, 1 s i s n, then the system is in deadlock state.
Moreover, if Finish[i] == false, then P
i
is deadlocked.

Example of Detection Algorithm
Five processes P
0
through P
4
;

three resource types
A (7 instances), B (2 instances), and C (6 instances).
Snapshot at time T
0
:
72
Allocation Request Available
A B C A B C A B C
P
0
0 1 0 0 0 0 0 0 0
P
1
2 0 0 2 0 2
P
2
3 0 3 0 0 0
P
3
2 1 1 1 0 0
P
4
0 0 2 0 0 2
Sequence <P
0
, P
2
, P
3
, P
1
, P
4
> will result in Finish[i] = true for all i.

P
2
requests an additional instance of type C.
Request
A B C
P
0
0 0 0
P
1
2 0 1
P
2
0 0 1
P
3
1 0 0
P
4
0 0 2
State of system?
Can reclaim resources held by process P
0
, but insufficient resources to fulfill other
processes; requests.
Deadlock exists, consisting of processes P
1
,

P
2
, P
3
, and P
4
.

4.5.4 Detection-Algorithm Usage
When, and how often, to invoke depends on:
How often a deadlock is likely to occur?
How many processes will need to be rolled back?
one for each disjoint cycle

If detection algorithm is invoked arbitrarily, there may be many cycles in the resource
graph and so we would not be able to tell which of the many deadlocked processes
caused the deadlock.

4.6 RECOVERY FROM DEADLOCK
4.6.1 Process Termination
Abort all deadlocked processes.

Abort one process at a time until the deadlock cycle is eliminated.
73

In which order should we choose to abort?
Priority of the process.
How long process has computed, and how much longer to completion.
Resources the process has used.
Resources process needs to complete.
How many processes will need to be terminated.
Is process interactive or batch?


4.6.2 Recovery from Deadlock: Resource Preemption
Selecting a victim minimize cost.

Rollback return to some safe state, restart process for that state.

Starvation same process may always be picked as victim, include number of
rollback in cost factor.

4.7 COMBINED APPROACH TO DEADLOCK HANDLING
Combine the three basic approaches
prevention
avoidance
detection
allowing the use of the optimal approach for each of resources in the system.

Partition resources into hierarchically ordered classes.

Use most appropriate technique for handling deadlocks within each class.

*****************************************









74
UNIT - III

STORAGE MANAGEMENT

1.MEMORY MANAGEMENT

1.1 BACKGROUND
1.1.1 Introduction
Program must be brought into memory and placed within a process for it to be run.

Input queue collection of processes on the disk that are waiting to be brought into
memory to run the program.

User programs go through several steps before being run.

1.1.2 Binding of Instructions and Data to Memory
Compile time: If memory location known a priori, absolute code can be generated;
must recompile code if starting location changes.
Load time: Must generate relocatable code if memory location is not known at
compile time.
Execution time: Binding delayed until run time if the process can be moved during its
execution from one memory segment to another. Need hardware support for address
maps (e.g., base and limit registers).
Multistep Processing of a User Program


















1.1.3 Logical vs. Physical Address Space
The concept of a logical address space that is bound to a separate physical address
space is central to proper memory management.
Logical address generated by the CPU; also referred to as virtual address.
Physical address address seen by the memory unit.

Logical and physical addresses are the same in compile-time and load-time address-
binding schemes; logical (virtual) and physical addresses differ in execution-time
75
address-binding scheme.

1.1.4 Memory-Management Unit (MMU)
Hardware device that maps virtual to physical address.
In MMU scheme, the value in the relocation register is added to every address
generated by a user process at the time it is sent to memory.
The user program deals with logical addresses; it never sees the real physical
addresses.

Dynamic relocation using a relocation register













1.1.5 Dynamic Loading
Routine is not loaded until it is called
Better memory-space utilization; unused routine is never loaded.
Useful when large amounts of code are needed to handle infrequently occurring cases.
No special support from the operating system is required implemented through
program design.

1.1.6 Dynamic Linking
Linking postponed until execution time.
Small piece of code, stub, used to locate the appropriate memory-resident library
routine.
Stub replaces itself with the address of the routine, and executes the routine.
Operating system needed to check if routine is in processes memory address.
Dynamic linking is particularly useful for libraries.


1.2 SWAPPING
A process can be swapped temporarily out of memory to a backing store, and then
brought back into memory for continued execution.

Backing store fast disk large enough to accommodate copies of all memory images
for all users; must provide direct access to these memory images.

Roll out, roll in swapping variant used for priority-based scheduling algorithms;
lower-priority process is swapped out so higher-priority process can be loaded and
executed.

76
Major part of swap time is transfer time; total transfer time is directly proportional to
the amount of memory swapped.

Modified versions of swapping are found on many systems, i.e., UNIX, Linux, and
Windows.
Schematic View of Swapping
















1.3 CONTIGUOUS ALLOCATION
1.3.1 Introduction
Main memory usually into two partitions:
Resident operating system, usually held in low memory with interrupt vector.
User processes then held in high memory.

1.3.2 Memory Protection
Single-partition allocation
Relocation-register scheme used to protect user processes from each other, and
from changing operating-system code and data.
Relocation register contains value of smallest physical address; limit register
contains range of logical addresses each logical address must be less than the
limit register.

Hardware Support for Relocation and Limit Registers










Multiple-partition allocation
Hole block of available memory; holes of various size are scattered throughout
memory.
77
When a process arrives, it is allocated memory from a hole large enough to
accommodate it.
Operating system maintains information about:
a) allocated partitions b) free partitions (hole)

1.3.3 Dynamic Storage-Allocation Problem
First-fit: Allocate the first hole that is big enough.
Best-fit: Allocate the smallest hole that is big enough; must search entire list, unless
ordered by size. Produces the smallest leftover hole.
Worst-fit: Allocate the largest hole; must also search entire list. Produces the largest
leftover hole.
First-fit and best-fit better than worst-fit in terms of speed and storage utilization.

1.3.4 Fragmentation
External Fragmentation total memory space exists to satisfy a request, but it is not
contiguous.
Internal Fragmentation allocated memory may be slightly larger than requested
memory; this size difference is memory internal to a partition, but not being used.
Reduce external fragmentation by compaction
Shuffle memory contents to place all free memory together in one large block.
Compaction is possible only if relocation is dynamic, and is done at execution time.
I/O problem
Latch job in memory while it is involved in I/O.
Do I/O only into OS buffers.

1.4 PAGING
1.4.1 Basic method
Logical address space of a process can be noncontiguous; process is allocated physical
memory whenever the latter is available.
Divide physical memory into fixed-sized blocks called frames (size is power of 2,
between 512 bytes and 8192 bytes).
Divide logical memory into blocks of same size called pages.
Keep track of all free frames.
To run a program of size n pages, need to find n free frames and load program.
Set up a page table to translate logical to physical addresses.
Internal fragmentation.

1.4.1.1 Address Translation Scheme
Address generated by CPU is divided into:
Page number (p) used as an index into a page table which contains base address
of each page in physical memory.

Page offset (d) combined with base address to define the physical memory
address that is sent to the memory unit.






78
Address Translation Architecture















Paging Example












Paging Example


















79

1.4.1.2 Implementation of Page Table
Page table is kept in main memory.
Page-table base register (PTBR) points to the page table.
Page-table length register (PRLR) indicates size of the page table.
In this scheme every data/instruction access requires two memory accesses. One for
the page table and one for the data/instruction.
The two memory access problem can be solved by the use of a special fast-lookup
hardware cache called associative memory or translation look-aside buffers (TLBs)

1.4.1.3 Associative Memory
Associative memory parallel search
Address translation (A, A)
If A is in associative register, get frame # out.
Otherwise get frame # from page table in memory


1.4.2 Paging Hardware With TLB














1.4.2.1. Effective Access Time
Associative Lookup = c time unit
Assume memory cycle time is 1 microsecond
Hit ratio percentage of times that a page number is found in the associative registers;
ration related to number of associative registers.
Hit ratio = o
Effective Access Time (EAT)
EAT = (1 + c) o + (2 + c)(1 o)
= 2 + c o
1.4.3 Memory Protection
Memory protection implemented by associating protection bit with each frame.

Valid-invalid bit attached to each entry in the page table:
valid indicates that the associated page is in the process logical address space,
80
and is thus a legal page.
invalid indicates that the page is not in the process logical address space.

Valid (v) or Invalid (i) Bit In A Page Table












1.4.4 Page Table Structure
Hierarchical Paging
Hashed Page Tables
Inverted Page Tables

1.4.4.1 Hierarchical Page Tables
Break up the logical address space into multiple page tables.
A simple technique is a two-level page table.
Two-Level Paging Example
A logical address (on 32-bit machine with 4K page size) is divided into:
a page number consisting of 20 bits.
a page offset consisting of 12 bits.
Since the page table is paged, the page number is further divided into:
a 10-bit page number.
a 10-bit page offset.
Thus, a logical address is as follows:
p
i
is an index into the outer page table, and p
2
is the displacement within the page of
the outer page table.
Two-Level Page-Table Scheme













81

Address-Translation Scheme













Address-translation scheme for a two-level 32-bit paging architecture

1.4.4.2 Hashed Page Tables
Common in address spaces > 32 bits.
The virtual page number is hashed into a page table. This page table contains a chain
of elements hashing to the same location.
Virtual page numbers are compared in this chain searching for a match. If a match is
found, the corresponding physical frame is extracted.



Hashed Page Table















Inverted Page Table
One entry for each real page of memory.
Entry consists of the virtual address of the page stored in that real memory location,
with information about the process that owns that page.
Decreases memory needed to store each page table, but increases time needed to search
the table when a page reference occurs.
Use hash table to limit the search to one or at most a few page-table entries.
82

1.4.4.3 Inverted Page Table Architecture














1.4.5 Shared Pages
Shared code
One copy of read-only (reentrant) code shared among processes (i.e., text editors,
compilers, window systems).
Shared code must appear in same location in the logical address space of all
processes.

Private code and data
Each process keeps a separate copy of the code and data.
The pages for the private code and data can appear anywhere in the logical address
space.

Shared Pages Example





















83
1.5 SEGMENTATION
1.5.1 Basic method
Memory-management scheme that supports user view of memory.
A program is a collection of segments. A segment is a logical unit such as:
main program,
procedure,
function,
method,
object,
local variables, global variables,
common block,
stack,
symbol table, arrays
Users View of a Program














1.5.2 Segmentation Architecture
Logical address consists of a two tuple:
<segment-number, offset>,
Segment table maps two-dimensional physical addresses; each table entry has:
base contains the starting physical address where the segments reside in memory.
limit specifies the length of the segment.
Segment-table base register (STBR) points to the segment tables location in memory.
Segment-table length register (STLR) indicates number of segments used by a
program;
segment number s is legal if s < STLR.
Relocation.
dynamic
84
by segment table

Sharing.
shared segments
same segment number

Allocation.
first fit/best fit
external fragmentation
Protection. With each entry in segment table associate:
validation bit = 0 illegal segment
read/write/execute privileges
Protection bits associated with segments; code sharing occurs at segment level.
Since segments vary in length, memory allocation is a dynamic storage-allocation
problem.
A segmentation example is shown in the following diagram

Segmentation Hardware












Example of Segmentation















85


1.6 SEGMENTATION WITH PAGING
1.6.1 Multics
The MULTICS system solved problems of external fragmentation and lengthy search
times by paging the segments.
Solution differs from pure segmentation in that the segment-table entry contains not
the base address of the segment, but rather the base address of a page table for this
segment.


MULTICS Address Translation Scheme
















1.6.2 Intel 386
As shown in the following diagram, the Intel 386 uses segmentation with paging for
memory management with a two-level paging scheme.
Intel 30386 Address Translation


















86

2.VIRTUAL MEMORY

2.1.BACKGROUND
Virtual memory separation of user logical memory from physical memory.
Only part of the program needs to be in memory for execution.
Logical address space can therefore be much larger than physical address
space.
Allows address spaces to be shared by several processes.
Allows for more efficient process creation.

Virtual memory can be implemented via:
Demand paging
Demand segmentation
Virtual Memory That is Larger Than Physical Memory












2.2DEMAND PAGING
2.2.1 Basic concepts
Bring a page into memory only when it is needed.
Less I/O needed
Less memory needed
Faster response
More users

Page is needed reference to it
invalid reference abort
not-in-memory bring to memory

Transfer of a Paged Memory to Contiguous Disk Space










87
Valid-Invalid Bit
With each page table entry a validinvalid bit is associated
(1 in-memory, 0 not-in-memory)
Initially validinvalid but is set to 0 on all entries.
Example of a page table snapshot.

During address translation, if validinvalid bit in page table entry is 0 page fault.

Page Table When Some Pages Are Not in Main Memory











Page Fault
If there is ever a reference to a page, first reference will trap to
OS page fault
OS looks at another table to decide:
Invalid reference abort.
Just not in memory.
Get empty frame.
Swap page into frame.
Reset tables, validation bit = 1.
Restart instruction: Least Recently Used
block move
auto increment/decrement location
Steps in Handling a Page Fault













88



What happens if there is no free frame?
Page replacement find some page in memory, but not really in use, swap it out.
algorithm
performance want an algorithm which will result in minimum number of page
faults.

Same page may be brought into memory several times.

2.2.2 Performance of Demand Paging
Page Fault Rate 0 s p s 1.0
if p = 0 no page faults
if p = 1, every reference is a fault

Effective Access Time (EAT)
EAT = (1 p) x memory access
+ p (page fault overhead
+ [swap page out ]
+ swap page in
+ restart overhead)

2.2.3 Demand Paging Example
Memory access time = 1 microsecond

50% of the time the page that is being replaced has been modified and therefore needs
to be swapped out.

Swap Page Time = 10 msec = 10,000 msec
EAT = (1 p) x 1 + p (15000)
1 + 15000P (in msec)

2.3 PROCESS CREATION
Virtual memory allows other benefits during process creation:
- Copy-on-Write
- Memory-Mapped Files

2.3.1 Copy-on-Write
Copy-on-Write (COW) allows both parent and child processes to initially share the
same pages in memory.
If either process modifies a shared page, only then is the page copied.
COW allows more efficient process creation as only modified pages are copied.
89
Free pages are allocated from a pool of zeroed-out pages.

2.3.2 Memory-Mapped Files
Memory-mapped file I/O allows file I/O to be treated as routine memory access by
mapping a disk block to a page in memory.

A file is initially read using demand paging. A page-sized portion of the file is read
from the file system into a physical page. Subsequent reads/writes to/from the file are
treated as ordinary memory accesses.

Simplifies file access by treating file I/O through memory rather than read() write()
system calls.

Memory Mapped Files















2.4 PAGE REPLACEMENT
2.4.1 Need for Page replacement
Prevent over-allocation of memory by modifying page-fault service routine to include
page replacement.
Use modify (dirty) bit to reduce overhead of page transfers only modified pages are
written to disk.
Page replacement completes separation between logical memory and physical memory
large virtual memory can be provided on a smaller physical memory.
Need For Page Replacement










90

2.4.2 Basic Page Replacement
Find the location of the desired page on disk.
Find a free frame:
- If there is a free frame, use it.
- If there is no free frame, use a page replacement algorithm to select a victim
frame.
Read the desired page into the (newly) free frame. Update the page and frame tables.
Restart the process.


Page Replacement













2.4.3 Page Replacement Algorithms
Want lowest page-fault rate.
Evaluate algorithm by running it on a particular string of memory references (reference
string) and computing the number of page faults on that string.
In all our examples, the reference string is
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5.

2.4.3.1 First-In-First-Out (FIFO) Algorithm
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
3 frames (3 pages can be in memory at a time per process)









91
4 frames


FIFO Replacement Beladys Anomaly more frames less
page faults

FIFO Page Replacement











FIFO Illustrating Beladys Anamoly












2.4.3.2 Optimal Algorithm
Replace page that will not be used for longest period of time.
4 frames example
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5

92


How do you know this?
Used for measuring how well your algorithm performs.


Optimal Page Replacement











2.4.3.3 Least Recently Used (LRU) Algorithm
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5




Counter implementation
Every page entry has a counter; every time page is referenced through this entry,
copy the clock into the counter.
When a page needs to be changed, look at the counters to determine which are to
change.




93

LRU Page Replacement











Stack implementation keep a stack of page numbers in a double link form:
Page referenced:
move it to the top
requires 6 pointers to be changed
No search for replacement


Use Of A Stack to Record The Most Recent Page References














2.4.3.4 LRU Approximation Algorithms
2.4.3.4.1 Additional Reference bit algorithm
Reference bit
With each page associate a bit, initially = 0
When page is referenced bit set to 1.
Replace the one which is 0 (if one exists). We do not know the order, however.
Second chance
Need reference bit.
Clock replacement.
If page to be replaced (in clock order) has reference bit = 1. then:
set reference bit 0.
leave page in memory.
replace next page (in clock order), subject to same rules.
94



2.4.3.4.2 Second-Chance (clock) Page-Replacement Algorithm
















2.4.3.5 Counting Algorithms
Keep a counter of the number of references that have been made to each page.
LFU Algorithm: replaces page with smallest count.
MFU Algorithm: based on the argument that the page with the smallest count was
probably just brought in and has yet to be used.

2.5 ALLOCATION OF FRAMES
Each process needs minimum number of pages.
Example: IBM 370 6 pages to handle SS MOVE instruction:
instruction is 6 bytes, might span 2 pages.
2 pages to handle from.
2 pages to handle to.
Two major allocation schemes.
fixed allocation
priority allocation


2.5.1 Fixed Allocation
Equal allocation e.g., if 100 frames and 5 processes, give each 20 pages.
Proportional allocation Allocate according to the size of process.

2.5.2 Priority Allocation
Use a proportional allocation scheme using priorities rather than size.
If process P
i
generates a page fault,
select for replacement one of its frames.
select for replacement a frame from a process with lower priority number.

2.5.3 Global vs. Local Allocation
Global replacement process selects a replacement frame from the set of all frames;
one process can take a frame from another.
95
Local replacement each process selects from only its own set of allocated frames.

2.6 THRASHING
2.6.1 Causes of Thrashing
If a process does not have enough pages, the page-fault rate is very high. This leads
to:
low CPU utilization.
operating system thinks that it needs to increase the degree of multiprogramming.
another process added to the system.
Thrashing a process is busy swapping pages in and out.

Thrashing











Why does paging work?
Locality model
Process migrates from one locality to another.
Localities may overlap.
Why does thrashing occur?
E size of locality > total memory size

2.6.2 Working-Set Model
A working-set window a fixed number of page references
Example: 10,000 instruction
WSS
i
(working set of Process P
i
) =
total number of pages referenced in the most recent A (varies in time)
if A too small will not encompass entire locality.
if A too large will encompass several localities.
if A = will encompass entire program.
D = E WSS
i
total demand frames
if D > m Thrashing
Policy if D > m, then suspend one of the processes.


Working-set model





96




Keeping Track of the Working Set
Approximate with interval timer + a reference bit
Example: A = 10,000
Timer interrupts after every 5000 time units.
Keep in memory 2 bits for each page.
Whenever a timer interrupts copy and sets the values of all reference bits to 0.
If one of the bits in memory = 1 page in working set.
Why is this not completely accurate?
Improvement = 10 bits and interrupt every 1000 time units.

Page-Fault Frequency Scheme















Establish acceptable page-fault rate.
If actual rate too low, process loses frame.
If actual rate too high, process gains frame.


3.CASE STUDY:
3.1MEMORY MANAGEMENT IN LINUX
Rather than describing the theory of memory management in operating systems,
this section tries to pinpoint the main features of the Linux implementation. Although you
do not need to be a Linux virtual memory guru to implement mmap, a basic overview of
how things work is useful. What follows is a fairly lengthy description of the data
structures used by the kernel to manage memory. Once the necessary background has
been covered, we can get into working with these structures.
3.1.1. Address Types
Linux is, of course, a virtual memory system, meaning that the addresses seen by user
programs do not directly correspond to the physical addresses used by the hardware.
Virtual memory introduces a layer of indirection that allows a number of nice things.
With virtual memory, programs running on the system can allocate far more memory than
97
is physically available; indeed, even a single process can have a virtual address space
larger than the system's physical memory. Virtual memory also allows the program to
play a number of tricks with the process's address space, including mapping the program's
memory to device memory.
Thus far, we have talked about virtual and physical addresses, but a number of the details
have been glossed over. The Linux system deals with several types of addresses, each
with its own semantics. Unfortunately, the kernel code is not always very clear on exactly
which type of address is being used in each situation, so the programmer must be careful.
The following is a list of address types used in Linux. Figure 15-1 shows how these
address types relate to physical memory.
User virtual addresses
These are the regular addresses seen by user-space programs. User addresses are
either 32 or 64 bits in length, depending on the underlying hardware architecture,
and each process has its own virtual address space.
Physical addresses
The addresses used between the processor and the system's memory. Physical
addresses are 32- or 64-bit quantities; even 32-bit systems can use larger physical
addresses in some situations.
Bus addresses
The addresses used between peripheral buses and memory. Often, they are the
same as the physical addresses used by the processor, but that is not necessarily
the case. Some architectures can provide an I/O memory management unit
(IOMMU) that remaps addresses between a bus and main memory. An IOMMU
can make life easier in a number of ways (making a buffer scattered in memory
appear contiguous to the device, for example), but programming the IOMMU is
an extra step that must be performed when setting up DMA operations. Bus
addresses are highly architecture dependent, of course.
Kernel logical addresses
These make up the normal address space of the kernel. These addresses map some
portion (perhaps all) of main memory and are often treated as if they were
physical addresses. On most architectures, logical addresses and their associated
physical addresses differ only by a constant offset. Logical addresses use the
hardware's native pointer size and, therefore, may be unable to address all of
physical memory on heavily equipped 32-bit systems. Logical addresses are
usually stored in variables of type unsigned long or void *. Memory returned from
kmalloc has a kernel logical address.
Kernel virtual addresses
Kernel virtual addresses are similar to logical addresses in that they are a mapping
from a kernel-space address to a physical address. Kernel virtual addresses do not
necessarily have the linear, one-to-one mapping to physical addresses that
characterize the logical address space, however. All logical addresses are kernel
virtual addresses, but many kernel virtual addresses are not logical addresses. For
example, memory allocated by vmalloc has a virtual address (but no direct
physical mapping). The kmap function (described later in this chapter) also returns
virtual addresses. Virtual addresses are usually stored in pointer variables.


98
Figure 15-1. Address types used in Linux


If you have a logical address, the macro _ _pa( ) (defined in <asm/page.h>) returns its
associated physical address. Physical addresses can be mapped back to logical addresses
with _ _va( ), but only for low-memory pages.
Different kernel functions require different types of addresses. It would be nice if there
were different C types defined, so that the required address types were explicit, but we
have no such luck. In this chapter, we try to be clear on which types of addresses are used
where.

3.1.2. Physical Addresses and Pages
Physical memory is divided into discrete units called pages. Much of the system's internal
handling of memory is done on a per-page basis. Page size varies from one architecture to
the next, although most systems currently use 4096-byte pages. The constant PAGE_SIZE
(defined in <asm/page.h>) gives the page size on any given architecture.
If you look at a memory addressvirtual or physicalit is divisible into a page number
and an offset within the page. If 4096-byte pages are being used, for example, the 12
least-significant bits are the offset, and the remaining, higher bits indicate the page
number. If you discard the offset and shift the rest of an offset to the right, the result is
called a page frame number (PFN). Shifting bits to convert between page frame numbers
and addresses is a fairly common operation; the macro PAGE_SHIFT tells how many bits
must be shifted to make this conversion.

3.1.3. High and Low Memory
The difference between logical and kernel virtual addresses is highlighted on 32-bit
systems that are equipped with large amounts of memory. With 32 bits, it is possible to
address 4 GB of memory. Linux on 32-bit systems has, until recently, been limited to
substantially less memory than that, however, because of the way it sets up the virtual
address space.
The kernel (on the x86 architecture, in the default configuration) splits the 4-GB virtual
address space between user-space and the kernel; the same set of mappings is used in
99
both contexts. A typical split dedicates 3 GB to user space, and 1 GB for kernel space.
[1]

The kernel's code and data structures must fit into that space, but the biggest consumer of
kernel address space is virtual mappings for physical memory. The kernel cannot directly
manipulate memory that is not mapped into the kernel's address space. The kernel, in
other words, needs its own virtual address for any memory it must touch directly. Thus,
for many years, the maximum amount of physical memory that could be handled by the
kernel was the amount that could be mapped into the kernel's portion of the virtual
address space, minus the space needed for the kernel code itself. As a result, x86-based
Linux systems could work with a maximum of a little under 1 GB of physical memory.
[1]
Many non-x86 architectures are able to efficiently do without the
kernel/user-space split described here, so they can work with up to a 4-GB
kernel address space on 32-bit systems. The constraints described in this
section still apply to such systems when more than 4 GB of memory are
installed, however.
In response to commercial pressure to support more memory while not breaking 32-bit
application and the system's compatibility, the processor manufacturers have added
"address extension" features to their products. The result is that, in many cases, even 32-
bit processors can address more than 4 GB of physical memory. The limitation on how
much memory can be directly mapped with logical addresses remains, however. Only the
lowest portion of memory (up to 1 or 2 GB, depending on the hardware and the kernel
configuration) has logical addresses;
[2]
the rest (high memory) does not. Before accessing
a specific high-memory page, the kernel must set up an explicit virtual mapping to make
that page available in the kernel's address space. Thus, many kernel data structures must
be placed in low memory; high memory tends to be reserved for user-space process
pages.
[2]
The 2.6 kernel (with an added patch) can support a "4G/4G" mode on
x86 hardware, which enables larger kernel and user virtual address spaces
at a mild performance cost.
The term "high memory" can be confusing to some, especially since it has other meanings
in the PC world. So, to make things clear, we'll define the terms here:
Low memory
Memory for which logical addresses exist in kernel space. On almost every
system you will likely encounter, all memory is low memory.
High memory
Memory for which logical addresses do not exist, because it is beyond the address
range set aside for kernel virtual addresses.
On i386 systems, the boundary between low and high memory is usually set at just under
1 GB, although that boundary can be changed at kernel configuration time. This boundary
is not related in any way to the old 640 KB limit found on the original PC, and its
placement is not dictated by the hardware. It is, instead, a limit set by the kernel itself as it
splits the 32-bit address space between kernel and user space.
We will point out limitations on the use of high memory as we come to them in this
chapter.
100
3.1.4. The Memory Map and Struct Page
Historically, the kernel has used logical addresses to refer to pages of physical memory.
The addition of high-memory support, however, has exposed an obvious problem with
that approachlogical addresses are not available for high memory. Therefore, kernel
functions that deal with memory are increasingly using pointers to struct page (defined in
<linux/mm.h>) instead. This data structure is used to keep track of just about everything
the kernel needs to know about physical memory; there is one struct page for each physical
page on the system. Some of the fields of this structure include the following:
atomic_t count;
The number of references there are to this page. When the count drops to 0, the
page is returned to the free list.
void *virtual;
The kernel virtual address of the page, if it is mapped; NULL, otherwise. Low-
memory pages are always mapped; high-memory pages usually are not. This field
does not appear on all architectures; it generally is compiled only where the kernel
virtual address of a page cannot be easily calculated. If you want to look at this
field, the proper method is to use the page_address macro, described below.
unsigned long flags;
A set of bit flags describing the status of the page. These include PG_locked, which
indicates that the page has been locked in memory, and PG_reserved, which
prevents the memory management system from working with the page at all.
There is much more information within struct page, but it is part of the deeper black magic
of memory management and is not of concern to driver writers.
The kernel maintains one or more arrays of struct page entries that track all of the physical
memory on the system. On some systems, there is a single array called mem_map. On
some systems, however, the situation is more complicated. Nonuniform memory access
(NUMA) systems and those with widely discontiguous physical memory may have more
than one memory map array, so code that is meant to be portable should avoid direct
access to the array whenever possible. Fortunately, it is usually quite easy to just work
with struct page pointers without worrying about where they come from.
Some functions and macros are defined for translating between struct page pointers and
virtual addresses:
struct page *virt_to_page(void *kaddr);
This macro, defined in <asm/page.h>, takes a kernel logical address and returns its
associated struct page pointer. Since it requires a logical address, it does not work
with memory from vmalloc or high memory.
struct page *pfn_to_page(int pfn);
Returns the struct page pointer for the given page frame number. If necessary, it
checks a page frame number for validity with pfn_valid before passing it to
pfn_to_page.
void *page_address(struct page *page);
Returns the kernel virtual address of this page, if such an address exists. For high
memory, that address exists only if the page has been mapped. This function is
defined in <linux/mm.h>. In most situations, you want to use a version of kmap
rather than page_address.
#include <linux/highmem.h>
void *kmap(struct page *page);
void kunmap(struct page *page);
kmap returns a kernel virtual address for any page in the system. For low-memory
pages, it just returns the logical address of the page; for high-memory pages, kmap
101
creates a special mapping in a dedicated part of the kernel address space.
Mappings created with kmap should always be freed with kunmap; a limited
number of such mappings is available, so it is better not to hold on to them for too
long. kmap calls maintain a counter, so if two or more functions both call kmap on
the same page, the right thing happens. Note also that kmap can sleep if no
mappings are available.
#include <linux/highmem.h>
#include <asm/kmap_types.h>
void *kmap_atomic(struct page *page, enum km_type type);
void kunmap_atomic(void *addr, enum km_type type);
kmap_atomic is a high-performance form of kmap. Each architecture maintains a
small list of slots (dedicated page table entries) for atomic kmaps; a caller of
kmap_atomic must tell the system which of those slots to use in the type argument.
The only slots that make sense for drivers are KM_USER0 and KM_USER1 (for code
running directly from a call from user space), and KM_IRQ0 and KM_IRQ1 (for
interrupt handlers). Note that atomic kmaps must be handled atomically; your
code cannot sleep while holding one. Note also that nothing in the kernel keeps
two functions from trying to use the same slot and interfering with each other
(although there is a unique set of slots for each CPU). In practice, contention for
atomic kmap slots seems to not be a problem.
We see some uses of these functions when we get into the example code, later in this
chapter and in subsequent chapters.

3.1.5. Page Tables
On any modern system, the processor must have a mechanism for translating virtual
addresses into its corresponding physical addresses. This mechanism is called a page
table; it is essentially a multilevel tree-structured array containing virtual-to-physical
mappings and a few associated flags. The Linux kernel maintains a set of page tables
even on architectures that do not use such tables directly.
A number of operations commonly performed by device drivers can involve manipulating
page tables. Fortunately for the driver author, the 2.6 kernel has eliminated any need to
work with page tables directly. As a result, we do not describe them in any detail; curious
readers may want to have a look at Understanding The Linux Kernel by Daniel P. Bovet
and Marco Cesati (O'Reilly) for the full story.
3.1.6. Virtual Memory Areas
The virtual memory area (VMA) is the kernel data structure used to manage distinct
regions of a process's address space. A VMA represents a homogeneous region in the
virtual memory of a process: a contiguous range of virtual addresses that have the same
permission flags and are backed up by the same object (a file, say, or swap space). It
corresponds loosely to the concept of a "segment," although it is better described as "a
memory object with its own properties." The memory map of a process is made up of (at
least) the following areas:
- An area for the program's executable code (often called text)
- Multiple areas for data, including initialized data (that which has an explicitly
assigned value at the beginning of execution), uninitialized data (BSS),
[3]
and the
program stack
102
[3]
The name BSS is a historical relic from an old assembly operator
meaning "block started by symbol." The BSS segment of
executable files isn't stored on disk, and the kernel maps the zero
page to the BSS address range.
- One area for each active memory mapping
The memory areas of a process can be seen by looking in /proc/<pid/maps> (in which pid,
of course, is replaced by a process ID). /proc/self is a special case of /proc/pid, because it
always refers to the current process. As an example, here are a couple of memory maps
(to which we have added short comments in italics):
The fields in each line are:
start-end perm offset major:minor inode image

Each field in /proc/*/maps (except the image name) corresponds to a field in struct
vm_area_struct:
start
end
The beginning and ending virtual addresses for this memory area.
perm
A bit mask with the memory area's read, write, and execute permissions. This field
describes what the process is allowed to do with pages belonging to the area. The
last character in the field is either p for "private" or s for "shared."
offset
Where the memory area begins in the file that it is mapped to. An offset of 0
means that the beginning of the memory area corresponds to the beginning of the
file.
major
minor
The major and minor numbers of the device holding the file that has been mapped.
Confusingly, for device mappings, the major and minor numbers refer to the disk
partition holding the device special file that was opened by the user, and not the
device itself.
inode
The inode number of the mapped file.
image
The name of the file (usually an executable image) that has been mapped.
15.1.6.1 The vm_area_struct structure
When a user-space process calls mmap to map device memory into its address space, the
system responds by creating a new VMA to represent that mapping. A driver that
supports mmap (and, thus, that implements the mmap method) needs to help that process
by completing the initialization of that VMA. The driver writer should, therefore, have at
least a minimal understanding of VMAs in order to support mmap.
Let's look at the most important fields in struct vm_area_struct (defined in <linux/mm.h>).
These fields may be used by device drivers in their mmap implementation. Note that the
kernel maintains lists and trees of VMAs to optimize area lookup, and several fields of
vm_area_struct are used to maintain this organization. Therefore, VMAs can't be created at
will by a driver, or the structures break. The main fields of VMAs are as follows (note the
similarity between these fields and the /proc output we just saw):
103
unsigned long vm_start; unsigned long vm_end;
The virtual address range covered by this VMA. These fields are the first two
fields shown in /proc/*/maps.
struct file *vm_file;
A pointer to the struct file structure associated with this area (if any).
unsigned long vm_pgoff;
The offset of the area in the file, in pages. When a file or device is mapped, this is
the file position of the first page mapped in this area.
unsigned long vm_flags;
A set of flags describing this area. The flags of the most interest to device driver
writers are VM_IO and VM_RESERVED. VM_IO marks a VMA as being a memory-
mapped I/O region. Among other things, the VM_IO flag prevents the region from
being included in process core dumps. VM_RESERVED tells the memory
management system not to attempt to swap out this VMA; it should be set in most
device mappings.
struct vm_operations_struct *vm_ops;
A set of functions that the kernel may invoke to operate on this memory area. Its
presence indicates that the memory area is a kernel "object," like the struct file we
have been using throughout the book.
void *vm_private_data;
A field that may be used by the driver to store its own information.
Like struct vm_area_struct, the vm_operations_struct is defined in <linux/mm.h>; it includes the
operations listed below. These operations are the only ones needed to handle the process's
memory needs, and they are listed in the order they are declared. Later in this chapter,
some of these functions are implemented.
void (*open)(struct vm_area_struct *vma);
The open method is called by the kernel to allow the subsystem implementing the
VMA to initialize the area. This method is invoked any time a new reference to
the VMA is made (when a process forks, for example). The one exception
happens when the VMA is first created by mmap; in this case, the driver's mmap
method is called instead.
void (*close)(struct vm_area_struct *vma);
When an area is destroyed, the kernel calls its close operation. Note that there's no
usage count associated with VMAs; the area is opened and closed exactly once by
each process that uses it.
struct page *(*nopage)(struct vm_area_struct *vma, unsigned
long address, int
*type);
When a process tries to access a page that belongs to a valid VMA, but that is
currently not in memory, the nopage method is called (if it is defined) for the
related area. The method returns the struct page pointer for the physical page after,
perhaps, having read it in from secondary storage. If the nopage method isn't
defined for the area, an empty page is allocated by the kernel.
int (*populate)(struct vm_area_struct *vm, unsigned long
address, unsigned
long len, pgprot_t prot, unsigned long pgoff, int nonblock);
This method allows the kernel to "prefault" pages into memory before they are
accessed by user space. There is generally no need for drivers to implement the
populate method.
104

UNIT IV

FILE SYSTEMS

1. FILE-SYSTEM INTERFACE

1.1 FILE CONCEPT
1.1.1 Definition
Contiguous logical address space
Types:
Data
numeric
character
binary
Program

1.1.2 File Structure
None - sequence of words, bytes
Simple record structure
Lines
Fixed length
Variable length
Complex Structures
Formatted document
Relocatable load file
Can simulate last two with first method by inserting appropriate control characters.
Who decides:
Operating system
Program

1.1.3 File Attributes
Name only information kept in human-readable form.
Type needed for systems that support different types.
Location pointer to file location on device.
Size current file size.
Protection controls who can do reading, writing, executing.
Time, date, and user identification data for protection, security, and usage
monitoring.
Information about files are kept in the directory structure, which is maintained on
the disk.

1.1.4 File Operations
Create
Write
Read
Reposition within file file seek
105
Delete
Truncate
Open(F
i
) search the directory structure on disk for entry F
i
, and move the content
of entry to memory.
Close (F
i
) move the content of entry F
i
in memory to directory structure on disk.



1.1.5 File Types Name, Extension

















1.2 ACCESS METHODS

Sequential Access
read next
write next
reset
no read after last write
(rewrite)
Direct Access
106
read n
write n
position to n
read next
write next
rewrite n
n = relative block number
Sequential-access File






Simulation of Sequential Access on a Direct-access File










Example of Index and Relative Files














1.3 DIRECTORY STRUCTURE

1.3.1 Introduction
A collection of nodes containing information about all files.
107

A Typical File-system Organization









1.3.2 Information in a Device Directory
Name
Type
Address
Current length
Maximum length
Date last accessed (for archival)
Date last updated (for dump)
Owner ID (who pays)
Protection information (discuss later)

1.3.3 Operations Performed on Directory
Search for a file
Create a file
Delete a file
List a directory
Rename a file
Traverse the file system




1.3.4 Organize the Directory (Logically) to Obtain
Efficiency locating a file quickly.
Naming convenient to users.
Two users can have same name for different files.
The same file can have several different names.
Grouping logical grouping of files by properties, (e.g., all Java programs, all
games, ))

1.3.5 Single-Level Directory
A single directory for all users.







108


1.3.6 Two-Level Directory
Separate directory for each user.








1.3.7 Tree-Structured Directories










Efficient searching
Grouping Capability
Current directory (working directory)
cd /spell/mail/prog
type list
Absolute or relative path name
Creating a new file is done in current directory.
Delete a file
rm <file-name>
Creating a new subdirectory is done in current directory.
mkdir <dir-name>
Example: if in current directory /mail
mkdir count


1.3.8 Acyclic-Graph Directories
Have shared subdirectories and files.

Acyclic-Graph Directories




109







Two different names (aliasing)
If dict deletes list dangling pointer.
Solutions:
Backpointers, so we can delete all pointers.
Variable size records a problem.
Backpointers using a daisy chain organization.
Entry-hold-count solution.

1.3.9 General Graph Directory













How do we guarantee no cycles?
Allow only links to file not subdirectories.
Garbage collection.
Every time a new link is added use a cycle detection
algorithm to determine whether it is OK.

1.4 FILE SYSTEM MOUNTING

A file system must be mounted before it can be accessed.
A unmounted file system (I.e. Fig. 11-11(b)) is mounted at a mount point.




(a) Existing. (b) Unmounted Partition





110








Mount Point














1.5 FILE SHARING

1.5.1 Introduction
Sharing of files on multi-user systems is desirable.
Sharing may be done through a protection scheme.
On distributed systems, files may be shared across a network.
Network File System (NFS) is a common distributed file-sharing method.

1.5.2 File Sharing Multiple Users
User IDs identify users, allowing permissions and protections to be per-user
Group IDs allow users to be in groups, permitting group access rights

1.5.3 File Sharing Remote File Systems
Uses networking to allow file system access between systems
Manually via programs like FTP
Automatically, seamlessly using distributed file systems
Semi automatically via the world wide web
Client-server model allows clients to mount remote file systems from servers
Server can serve multiple clients
Client and user-on-client identification is insecure or complicated
NFS is standard UNIX client-server file sharing protocol


CIFS is standard Windows protocol
Standard operating system file calls are translated into remote calls
Distributed Information Systems (distributed naming services) such as LDAP, DNS,
NIS, Active Directory implement unified access to information needed for remote
111
computing

1.5.4 File Sharing Failure Modes
Remote file systems add new failure modes, due to network failure, server failure
Recovery from failure can involve state information about status of each remote
request
Stateless protocols such as NFS include all information in each request, allowing
easy recovery but less security

1.5.5 File Sharing Consistency Semantics
Consistency semantics specify how multiple users are to access a shared file
simultaneously
Similar to Ch 7 process synchronization algorithms
Tend to be less complex due to disk I/O and network latency (for remote file
systems
Andrew File System (AFS) implemented complex remote file sharing
semantics
Unix file system (UFS) implements:
Writes to an open file visible immediately to other users of the same open file
Sharing file pointer to allow multiple users to read and write concurrently
AFS has session semantics
Writes only visible to sessions starting after the file is closed

1.6 PROTECTION

1.6.1 What is Protection?
File owner/creator should be able to control:
what can be done
by whom

1.6.2 Types of Access
Types of access
Read
Write
Execute
Append
Delete
List

1.6.3 Access Lists and Groups
Mode of access: read, write, execute
Three classes of users
112
RWX
a) owner access 7 1 1 1
RWX
b) group access 6 1 1 0
RWX
c) public access 1 0 0 1
Ask manager to create a group (unique name), say G, and add some users to the
group.
For a particular file (say game) or subdirectory, define an appropriate access.



Attach a group to a file
chgrp G game






2. FILE SYSTEM IMPLEMENTATION

File System Structure
File System Implementation
Directory Implementation
Allocation Methods
Free-Space Management
Efficiency and Performance
Recovery
Log-Structured File Systems
NFS

2.1. FILE-SYSTEM STRUCTURE

File structure
Logical storage unit
Collection of related information
File system resides on secondary storage (disks).
File system organized into layers.
File control block storage structure consisting of information about a file.


owner group public
chmod 761 game
113
Layered File System



















A Typical File Control Block











In-Memory File System Structures
The following figure illustrates the necessary file system structures provided by the
operating systems.

Figure 12-3(a) refers to opening a file.

Figure 12-3(b) refers to reading a file.

In-Memory File System Structures

Virtual File Systems
Virtual File Systems (VFS) provide an object-oriented way of implementing file
systems.

VFS allows the same system call interface (the API) to be used for different types of
114
file systems.

The API is to the VFS interface, rather than any specific type of file system.

2.2.DIRECTORY IMPLEMENTATION

Linear list of file names with pointer to the data blocks.
simple to program
time-consuming to execute

Hash Table linear list with hash data structure.
decreases directory search time
collisions situations where two file names hash to the same location
fixed size

2.3.ALLOCATION METHODS

An allocation method refers to how disk blocks are allocated for files:

Contiguous allocation

Linked allocation

Indexed allocation



2.3.1.Contiguous Allocation

Each file occupies a set of contiguous blocks on the disk.

Simple only starting location (block #) and length (number of blocks) are required.

Random access.

Wasteful of space (dynamic storage-allocation problem).

Files cannot grow.

Contiguous Allocation of Disk Space
Extent-Based Systems

Many newer file systems (I.e. Veritas File System) use a modified contiguous
115
allocation scheme.

Extent-based file systems allocate disk blocks in extents.

An extent is a contiguous block of disks. Extents are allocated for file allocation. A
file consists of one or more extents.

2.3.2.Linked Allocation

Each file is a linked list of disk blocks: blocks may be scattered anywhere on the
disk.
Linked Allocation (Cont.)
Simple need only starting address
Free-space management system no waste of space
No random access
Mapping
Linked Allocation
File-Allocation Table

2.3.3.Indexed Allocation

Need index table
Random access
Dynamic access without external fragmentation, but have overhead of index block.
Mapping from logical to physical in a file of maximum size of 256K words and block
size of 512 words. We need only 1 block for index table.

Indexed Allocation Mapping
Mapping from logical to physical in a file of unbounded length (block size of 512
words).
Linked scheme Link blocks of index table (no limit on size).

Two-level index (maximum file size is 512
3
)





2.4.FREE-SPACE MANAGEMENT

Bit vector (n blocks)
Free-Space Management (Cont.)
Bit map requires extra space. Example:
116
block size = 2
12
bytes
disk size = 2
30
bytes (1 gigabyte)
n = 2
30
/2
12
= 2
18
bits (or 32K bytes)
Easy to get contiguous files
Linked list (free list)
Cannot get contiguous space easily
No waste of space
Grouping
Counting
Need to protect:
Pointer to free list
Bit map
Must be kept on disk
Copy in memory and disk may differ.
Cannot allow for block[i] to have a situation where bit[i] = 1 in memory and
bit[i] = 0 on disk.
Solution:
Set bit[i] = 1 in disk.
Allocate block[i]
Set bit[i] = 1 in memory


2.5.EFFICIENCY AND PERFORMANCE

Efficiency dependent on:
disk allocation and directory algorithms
types of data kept in files directory entry

Performance
disk cache separate section of main memory for frequently used blocks
free-behind and read-ahead techniques to optimize sequential access
improve PC performance by dedicating section of memory as virtual disk, or
RAM disk.
Various Disk-Caching Locations
Page Cache
A page cache caches pages rather than disk blocks using virtual memory techniques.

Memory-mapped I/O uses a page cache.

Routine I/O through the file system uses the buffer (disk) cache.

This leads to the following figure.


117




2.6.RECOVERY

Consistency checking compares data in directory structure with data blocks on
disk, and tries to fix inconsistencies.

Use system programs to back up data from disk to another storage device (floppy
disk, magnetic tape).

Recover lost file or disk by restoring data from backup.

2.7.LOG STRUCTURED FILE SYSTEMS

Log structured (or journaling) file systems record each update to the file system as a
transaction.

All transactions are written to a log. A transaction is considered committed once it is
written to the log. However, the file system may not yet be updated.

The transactions in the log are asynchronously written to the file system. When the
file system is modified, the transaction is removed from the log.

If the file system crashes, all remaining transactions in the log must still be
performed.


The Sun Network File System (NFS)

An implementation and a specification of a software system for accessing remote
files across LANs (or WANs).

The implementation is part of the Solaris and SunOS operating systems running on
Sun workstations using an unreliable datagram protocol (UDP/IP protocol and
Ethernet.

NFS (Cont.)

Interconnected workstations viewed as a set of independent machines with
independent file systems, which allows sharing among these file systems in a
transparent manner.
A remote directory is mounted over a local file system directory. The mounted
118
directory looks like an integral subtree of the local file system, replacing the
subtree descending from the local directory.
Specification of the remote directory for the mount operation is
nontransparent; the host name of the remote directory has to be provided.
Files in the remote directory can then be accessed in a transparent manner.
Subject to access-rights accreditation, potentially any file system (or directory
within a file system), can be mounted remotely on top of any local directory.







Three Independent File Systems
Mounting in NFS
NFS Mount Protocol
Establishes initial logical connection between server and client.
Mount operation includes name of remote directory to be mounted and name of
server machine storing it.
Mount request is mapped to corresponding RPC and forwarded to mount server
running on server machine.
Export list specifies local file systems that server exports for mounting, along
with names of machines that are permitted to mount them.
Following a mount request that conforms to its export list, the server returns a file
handlea key for further accesses.
File handle a file-system identifier, and an inode number to identify the mounted
directory within the exported file system.
The mount operation changes only the users view and does not affect the server
side.
NFS Protocol
Provides a set of remote procedure calls for remote file operations. The procedures
support the following operations:
searching for a file within a directory
reading a set of directory entries
manipulating links and directories
accessing file attributes
reading and writing files
NFS servers are stateless; each request has to provide a full set of arguments.
Modified data must be committed to the servers disk before results are returned to
the client (lose advantages of caching).
The NFS protocol does not provide concurrency-control mechanisms.
Three Major Layers of NFS Architecture
UNIX file-system interface (based on the open, read, write, and close calls, and file
descriptors).

Virtual File System (VFS) layer distinguishes local files from remote ones, and
local files are further distinguished according to their file-system types.
The VFS activates file-system-specific operations to handle local requests
according to their file-system types.
119
Calls the NFS protocol procedures for remote requests.

NFS service layer bottom layer of the architecture; implements the NFS protocol.
Schematic View of NFS Architecture
NFS Path-Name Translation
Performed by breaking the path into component names and performing a separate
NFS lookup call for every pair of component name and directory vnode.

To make lookup faster, a directory name lookup cache on the clients side holds the
vnodes for remote directory names.
NFS Remote Operations
Nearly one-to-one correspondence between regular UNIX system calls and the NFS
protocol RPCs (except opening and closing files).
NFS adheres to the remote-service paradigm, but employs buffering and caching
techniques for the sake of performance.


3. CASE STUDIES

3.1. THE LINUX SYSTEM
History
Design Principles
Kernel Modules
Process Management
Scheduling
Memory Management
File Systems
Input and Output
Interprocess Communication
Network Structure
Security
History
Linux is a modem, free operating system based on UNIX standards.
First developed as a small but self-contained kernel in 1991 by Linus Torvalds, with
the major design goal of UNIX compatibility.
Its history has been one of collaboration by many users from all around the world,
corresponding almost exclusively over the Internet.
It has been designed to run efficiently and reliably on common PC hardware, but
also runs on a variety of other platforms.
The core Linux operating system kernel is entirely original, but it can run much
existing free UNIX software, resulting in an entire UNIX-compatible operating
system free from proprietary code.
The Linux Kernel
Version 0.01 (May 1991) had no networking, ran only on 80386-compatible Intel
processors and on PC hardware, had extremely limited device-drive support, and
supported only the Minix file system.
Linux 1.0 (March 1994) included these new features:
Support for UNIXs standard TCP/IP networking protocols
BSD-compatible socket interface for networking programming
Device-driver support for running IP over an Ethernet
120
Enhanced file system
Support for a range of SCSI controllers for
high-performance disk access
Extra hardware support
Version 1.2 (March 1995) was the final PC-only Linux kernel.
Linux 2.0
Released in June 1996, 2.0 added two major new capabilities:
Support for multiple architectures, including a fully 64-bit native Alpha port.
Support for multiprocessor architectures
Other new features included:
Improved memory-management code
Improved TCP/IP performance
Support for internal kernel threads, for handling dependencies between
loadable modules, and for automatic loading of modules on demand.
Standardized configuration interface
Available for Motorola 68000-series processors, Sun Sparc systems, and for PC and
PowerMac systems.



The Linux System
Linux uses many tools developed as part of Berkeleys BSD operating system, MITs
X Window System, and the Free Software Foundation's GNU project.
The min system libraries were started by the GNU project, with improvements
provided by the Linux community.
Linux networking-administration tools were derived from 4.3BSD code; recent BSD
derivatives such as Free BSD have borrowed code from Linux in return.
The Linux system is maintained by a loose network of developers collaborating over
the Internet, with a small number of public ftp sites acting as de facto standard
repositories.
Linux Distributions
Standard, precompiled sets of packages, or distributions, include the basic Linux
system, system installation and management utilities, and ready-to-install packages
of common UNIX tools.
The first distributions managed these packages by simply providing a means of
unpacking all the files into the appropriate places; modern distributions include
advanced package management.
Early distributions included SLS and Slackware. Red Hat and Debian are popular
distributions from commercial and noncommercial sources, respectively.
The RPM Package file format permits compatibility among the various Linux
distributions.
Linux Licensing
The Linux kernel is distributed under the GNU General Public License (GPL), the
terms of which are set out by the Free Software Foundation.

Anyone using Linux, or creating their own derivative of Linux, may not make the
derived product proprietary; software released under the GPL may not be
redistributed as a binary-only product.
Design Principles
Linux is a multiuser, multitasking system with a full set of UNIX-compatible tools..
121
Its file system adheres to traditional UNIX semantics, and it fully implements the
standard UNIX networking model.
Main design goals are speed, efficiency, and standardization.
Linux is designed to be compliant with the relevant POSIX documents; at least two
Linux distributions have achieved official POSIX certification.
The Linux programming interface adheres to the SVR4 UNIX semantics, rather
than to BSD behavior.
Components of a Linux System
Components of a Linux System (Cont.)
Like most UNIX implementations, Linux is composed of three main bodies of code;
the most important distinction between the kernel and all other components.

The kernel is responsible for maintaining the important abstractions of the operating
system.
Kernel code executes in kernel mode with full access to all the physical
resources of the computer.
All kernel code and data structures are kept in the same single address space.
Components of a Linux System (Cont.)
The system libraries define a standard set of functions through which applications
interact with the kernel, and which implement much of the operating-system
functionality that does not need the full privileges of kernel code.

The system utilities perform individual specialized management tasks.

Kernel Modules
Sections of kernel code that can be compiled, loaded, and unloaded independent of
the rest of the kernel.
A kernel module may typically implement a device driver, a file system, or a
networking protocol.
The module interface allows third parties to write and distribute, on their own terms,
device drivers or file systems that could not be distributed under the GPL.
Kernel modules allow a Linux system to be set up with a standard, minimal kernel,
without any extra device drivers built in.
Three components to Linux module support:
module management
driver registration
conflict resolution
Module Management
Supports loading modules into memory and letting them talk to the rest of the kernel.
Module loading is split into two separate sections:
Managing sections of module code in kernel memory
Handling symbols that modules are allowed to reference
The module requestor manages loading requested, but currently unloaded, modules;
it also regularly queries the kernel to see whether a dynamically loaded module is
still in use, and will unload it when it is no longer actively needed.
Driver Registration
Allows modules to tell the rest of the kernel that a new driver has become available.
The kernel maintains dynamic tables of all known drivers, and provides a set of
routines to allow drivers to be added to or removed from these tables at any time.
Registration tables include the following items:
122
Device drivers
File systems
Network protocols
Binary format
Conflict Resolution
A mechanism that allows different device drivers to reserve hardware resources and
to protect those resources from accidental use by another driver

The conflict resolution module aims to:
Prevent modules from clashing over access to hardware resources
Prevent autoprobes from interfering with existing device drivers
Resolve conflicts with multiple drivers trying to access the same hardware
Process Management
UNIX process management separates the creation of processes and the running of a
new program into two distinct operations.
The fork system call creates a new process.
A new program is run after a call to execve.
Under UNIX, a process encompasses all the information that the operating system
must maintain t track the context of a single execution of a single program.
Under Linux, process properties fall into three groups: the processs identity,
environment, and context.
Process Identity
Process ID (PID). The unique identifier for the process; used to specify processes to
the operating system when an application makes a system call to signal, modify, or
wait for another process.

Credentials. Each process must have an associated user ID and one or more group
IDs that determine the processs rights to access system resources and files.
Personality. Not traditionally found on UNIX systems, but under Linux each
process has an associated personality identifier that can slightly modify the
semantics of certain system calls.
Used primarily by emulation libraries to request that system calls be compatible with
certain specific flavors of UNIX.

Process Environment
The processs environment is inherited from its parent, and is composed of two null-
terminated vectors:
The argument vector lists the command-line arguments used to invoke the
running program; conventionally starts with the name of the program itself
The environment vector is a list of NAME=VALUE pairs that associates
named environment variables with arbitrary textual values.
Passing environment variables among processes and inheriting variables by a
processs children are flexible means of passing information to components of the
user-mode system software.
The environment-variable mechanism provides a customization of the operating
system that can be set on a per-process basis, rather than being configured for the
system as a whole.
Process Context
The (constantly changing) state of a running program at any point in time.
123
The scheduling context is the most important part of the process context; it is the
information that the scheduler needs to suspend and restart the process.
The kernel maintains accounting information about the resources currently being
consumed by each process, and the total resources consumed by the process in its
lifetime so far.
The file table is an array of pointers to kernel file structures. When making file I/O
system calls, processes refer to files by their index into this table.
Process Context (Cont.)
Whereas the file table lists the existing open files, the
file-system context applies to requests to open new files. The current root and default
directories to be used for new file searches are stored here.
The signal-handler table defines the routine in the processs address space to be
called when specific signals arrive.
The virtual-memory context of a process describes the full contents of the its private
address space.
Processes and Threads
Linux uses the same internal representation for processes and threads; a thread is
simply a new process that happens to share the same address space as its parent.
A distinction is only made when a new thread is created by the clone system call.
fork creates a new process with its own entirely new process context
clone creates a new process with its own identity, but that is allowed to share
the data structures of its parent
Using clone gives an application fine-grained control over exactly what is shared
between two threads.
Scheduling
The job of allocating CPU time to different tasks within an operating system.

While scheduling is normally thought of as the running and interrupting of
processes, in Linux, scheduling also includes the running of the various kernel
tasks.

Running kernel tasks encompasses both tasks that are requested by a running
process and tasks that execute internally on behalf of a device driver.
Kernel Synchronization
A request for kernel-mode execution can occur in two ways:
A running program may request an operating system service, either explicitly
via a system call, or implicitly, for example, when a page fault occurs.
A device driver may deliver a hardware interrupt that causes the CPU to start
executing a kernel-defined handler for that interrupt.
Kernel synchronization requires a framework that will allow the kernels critical
sections to run without interruption by another critical section.
Kernel Synchronization (Cont.)
Linux uses two techniques to protect critical sections:
124
1. Normal kernel code is nonpreemptible
when a time interrupt is received while a process is
executing a kernel system service routine, the kernels
need_resched flag is set so that the scheduler will run
once the system call has completed and control is
about to be returned to user mode.
2. The second technique applies to critical sections that occur in an interrupt
service routines.
By using the processors interrupt control hardware to disable interrupts
during a critical section, the kernel guarantees that it can proceed without the risk
of concurrent access of shared data structures.

Kernel Synchronization (Cont.)
To avoid performance penalties, Linuxs kernel uses a synchronization architecture
that allows long critical sections to run without having interrupts disabled for the
critical sections entire duration.
Interrupt service routines are separated into a top half and a bottom half.
The top half is a normal interrupt service routine, and runs with recursive
interrupts disabled.
The bottom half is run, with all interrupts enabled, by a miniature scheduler
that ensures that bottom halves never interrupt themselves.
This architecture is completed by a mechanism for disabling selected bottom
halves while executing normal, foreground kernel code.
Interrupt Protection Levels
Each level may be interrupted by code running at a higher level, but will never be
interrupted by code running at the same or a lower level.
User processes can always be preempted by another process when a time-sharing
scheduling interrupt occurs.
Process Scheduling
Linux uses two process-scheduling algorithms:
A time-sharing algorithm for fair preemptive scheduling between multiple
processes
A real-time algorithm for tasks where absolute priorities are more important
than fairness
A processs scheduling class defines which algorithm to apply.
For time-sharing processes, Linux uses a prioritized, credit based algorithm.
The crediting rule factors in both the processs history and its priority.
This crediting system automatically prioritizes interactive or I/O-bound
processes.




Process Scheduling (Cont.)
Linux implements the FIFO and round-robin real-time scheduling classes; in both
cases, each process has a priority in addition to its scheduling class.
125
The scheduler runs the process with the highest priority; for equal-priority
processes, it runs the process waiting the longest
FIFO processes continue to run until they either exit or block
A round-robin process will be preempted after a while and moved to the end of
the scheduling queue, so that round-robing processes of equal priority
automatically time-share between themselves.
Symmetric Multiprocessing
Linux 2.0 was the first Linux kernel to support SMP hardware; separate processes
or threads can execute in parallel on separate processors.

To preserve the kernels nonpreemptible synchronization requirements, SMP
imposes the restriction, via a single kernel spinlock, that only one processor at a time
may execute kernel-mode code.
Memory Management
Linuxs physical memory-management system deals with allocating and freeing
pages, groups of pages, and small blocks of memory.

It has additional mechanisms for handling virtual memory, memory mapped into the
address space of running processes.
Splitting of Memory in a Buddy Heap
Managing Physical Memory
The page allocator allocates and frees all physical pages; it can allocate ranges of
physically-contiguous pages on request.
The allocator uses a buddy-heap algorithm to keep track of available physical
pages.
Each allocatable memory region is paired with an adjacent partner.
Whenever two allocated partner regions are both freed up they are combined
to form a larger region.
If a small memory request cannot be satisfied by allocating an existing small
free region, then a larger free region will be subdivided into two partners to
satisfy the request.
Memory allocations in the Linux kernel occur either statically (drivers reserve a
contiguous area of memory during system boot time) or dynamically (via the page
allocator).
Virtual Memory
The VM system maintains the address space visible to each process: It creates pages
of virtual memory on demand, and manages the loading of those pages from disk or
their swapping back out to disk as required.
The VM manager maintains two separate views of a processs address space:
A logical view describing instructions concerning the layout of the address
space.
The address space consists of a set of nonoverlapping regions, each
representing a continuous, page-aligned subset of the address space.
A physical view of each address space which is stored in the hardware page
tables for the process.
Virtual Memory (Cont.)
Virtual memory regions are characterized by:
The backing store, which describes from where the pages for a region come;
regions are usually backed by a file or by nothing (demand-zero memory)
The regions reaction to writes (page sharing or copy-on-write).
126



The kernel creates a new virtual address space
1. When a process runs a new program with the exec system call
2. Upon creation of a new process by the fork system call
The Linux kernel reserves a constant, architecture-dependent region of the virtual
address space of every process for its own internal use.

This kernel virtual-memory area contains two regions:
A static area that contains page table references to every available physical
page of memory in the system, so that there is a simple translation from
physical to virtual addresses when running kernel code.
The reminder of the reserved section is not reserved for any specific purpose;
its page-table entries can be modified to point to any other areas of memory.
Executing and Loading User Programs
Linux maintains a table of functions for loading programs; it gives each function
the opportunity to try loading the given file when an exec system call is made.
The registration of multiple loader routines allows Linux to support both the ELF
and a.out binary formats.
Initially, binary-file pages are mapped into virtual memory; only when a program
tries to access a given page will a page fault result in that page being loaded into
physical memory.
An ELF-format binary file consists of a header followed by several page-aligned
sections; the ELF loader works by reading the header and mapping the sections of
the file into separate regions of virtual memory.
Memory Layout for ELF Programs
Static and Dynamic Linking
A program whose necessary library functions are embedded directly in the
programs executable binary file is statically linked to its libraries.

The main disadvantage of static linkage is that every program generated must
contain copies of exactly the same common system library functions.

Dynamic linking is more efficient in terms of both physical memory and disk-space
usage because it loads the system libraries into memory only once.

FILE SYSTEMS
To the user, Linuxs file system appears as a hierarchical directory tree obeying
UNIX semantics.
Internally, the kernel hides implementation details and manages the multiple
different file systems via an abstraction layer, that is, the virtual file system (VFS).
The Linux VFS is designed around object-oriented principles and is composed of
two components:
A set of definitions that define what a file object is allowed to look like
The inode-object and the file-object structures represent individual files
the file system object represents an entire file system
A layer of software to manipulate those objects.
127
The Linux Ext2fs File System
Ext2fs uses a mechanism similar to that of BSD Fast File System (ffs) for locating
data blocks belonging to a specific file.
The main differences between ext2fs and ffs concern their disk allocation policies.
In ffs, the disk is allocated to files in blocks of 8Kb, with blocks being
subdivided into fragments of 1Kb to store small files or partially filled blocks
at the end of a file.
Ext2fs does not use fragments; it performs its allocations in smaller units. The
default block size on ext2fs is 1Kb, although 2Kb and 4Kb blocks are also
supported.
Ext2fs uses allocation policies designed to place logically adjacent blocks of a
file into physically adjacent blocks on disk, so that it can submit an I/O request
for several disk blocks as a single operation.
Ext2fs Block-Allocation Policies
The Linux Proc File System
The proc file system does not store data, rather, its contents are computed on
demand according to user file I/O requests.
proc must implement a directory structure, and the file contents within; it must then
define a unique and persistent inode number for each directory and files it contains.
It uses this inode number to identify just what operation is required when a
user tries to read from a particular file inode or perform a lookup in a
particular directory inode.
When data is read from one of these files, proc collects the appropriate
information, formats it into text form and places it into the requesting
processs read buffer.
Input and Output
The Linux device-oriented file system accesses disk storage through two caches:
Data is cached in the page cache, which is unified with the virtual memory
system
Metadata is cached in the buffer cache, a separate cache indexed by the
physical disk block.

Linux splits all devices into three classes:
block devices allow random access to completely independent, fixed size blocks
of data
character devices include most other devices; they dont need to support the
functionality of regular files.
network devices are interfaced via the kernels networking subsystem
Device-Driver Block Structure
Block Devices
Provide the main interface to all disk devices in a system.

The block buffer cache serves two main purposes:
it acts as a pool of buffers for active I/O
it serves as a cache for completed I/O

The request manager manages the reading and writing of buffer contents to and
from a block device driver.
Character Devices
A device driver which does not offer random access to fixed blocks of data.
128
A character device driver must register a set of functions which implement the
drivers various file I/O operations.
The kernel performs almost no preprocessing of a file read or write request to a
character device, but simply passes on the request to the device.
The main exception to this rule is the special subset of character device drivers
which implement terminal devices, for which the kernel maintains a standard
interface.
Interprocess Communication
Like UNIX, Linux informs processes that an event has occurred via signals.
There is a limited number of signals, and they cannot carry information: Only the
fact that a signal occurred is available to a process.
The Linux kernel does not use signals to communicate with processes with are
running in kernel mode, rather, communication within the kernel is accomplished
via scheduling states and wait.queue structures.
Passing Data Between Processes
The pipe mechanism allows a child process to inherit a communication channel to
its parent, data written to one end of the pipe can be read a the other.

Shared memory offers an extremely fast way of communicating; any data written by
one process to a shared memory region can be read immediately by any other
process that has mapped that region into its address space.

To obtain synchronization, however, shared memory must be used in conjunction
with another Interprocess-communication mechanism.
Shared Memory Object
The shared-memory object acts as a backing store for shared-memory regions in the
same way as a file can act as backing store for a memory-mapped memory region.

Shared-memory mappings direct page faults to map in pages from a persistent
shared-memory object.

Shared-memory objects remember their contents even if no processes are currently
mapping them into virtual memory.
Network Structure
Networking is a key area of functionality for Linux.
It supports the standard Internet protocols for UNIX to UNIX
communications.
It also implements protocols native to nonUNIX operating systems, in
particular, protocols used on PC networks, such as Appletalk and IPX.

Internally, networking in the Linux kernel is implemented by three layers of
software:
The socket interface
Protocol drivers
Network device drivers
Network Structure (Cont.)
The most important set of protocols in the Linux networking system is the internet
protocol suite.
It implements routing between different hosts anywhere on the network.
On top of the routing protocol are built the UDP, TCP and ICMP protocols.
129
Security
The pluggable authentication modules (PAM) system is available under Linux.
PAM is based on a shared library that can be used by any system component that
needs to authenticate users.
Access control under UNIX systems, including Linux, is performed through the use
of unique numeric identifiers (uid and gid).
Access control is performed by assigning objects a protections mask, which specifies
which access modesread, write, or executeare to be granted to processes with
owner, group, or world access.
Security
Linux augments the standard UNIX setuid mechanism in two ways:
It implements the POSIX specifications saved user-id mechanism, which
allows a process to repeatedly drop and reacquire its effective uid.
It has added a process characteristic that grants just a subset of the rights of
the effective uid.

Linux provides another mechanism that allows a client to selectively pass access to a
single file to some server process without granting it any other privileges.





3.2.WINDOWS 2000

History
Design Principles
System Components
Environmental Subsystems
File system
Networking
Programmer Interface

Windows 2000
32-bit preemptive multitasking operating system for Intel microprocessors.
Key goals for the system:
portability
security
POSIX compliance
multiprocessor support
extensibility
international support
compatibility with MS-DOS and MS-Windows applications.
Uses a micro-kernel architecture.
Available in four versions, Professional, Server, Advanced Server, National Server.
In 1996, more NT server licenses were sold than UNIX licenses
History
In 1988, Microsoft decided to develop a new technology (NT) portable operating
system that supported both the OS/2 and POSIX APIs.

130
Originally, NT was supposed to use the OS/2 API as its native environment but
during development NT was changed to use the Win32 API, reflecting the popularity
of Windows 3.0.
Design Principles
Extensibility layered architecture.
Executive, which runs in protected mode, provides the basic system services.
On top of the executive, several server subsystems operate in user mode.
Modular structure allows additional environmental subsystems to be added
without affecting the executive.
Portability 2000 can be moved from on hardware architecture to another with
relatively few changes.
Written in C and C++.
Processor-dependent code is isolated in a dynamic link library (DLL) called
the hardware abstraction layer (HAL).
Design Principles (Cont.)
Reliability 2000 uses hardware protection for virtual memory, and software
protection mechanisms for operating system resources.
Compatibility applications that follow the IEEE 1003.1 (POSIX) standard can be
complied to run on 2000 without changing the source code.
Performance 2000 subsystems can communicate with one another via high-
performance message passing.
Preemption of low priority threads enables the system to respond quickly to
external events.
Designed for symmetrical multiprocessing

2000 Architecture
Layered system of modules.

Protected mode HAL, kernel, executive.

User mode collection of subsystems
Environmental subsystems emulate different operating systems.
Protection subsystems provide security functions.
Depiction of 2000 Architecture
System Components Kernel
Foundation for the executive and the subsystems.
Never paged out of memory; execution is never preempted.
Four main responsibilities:
thread scheduling
interrupt and exception handling
low-level processor synchronization
recovery after a power failure
Kernel is object-oriented, uses two sets of objects.
dispatcher objects control dispatching and synchronization (events, mutants,
mutexes, semaphores, threads and timers).
control objects (asynchronous procedure calls, interrupts, power notify, power
status, process and profile objects.)
Kernel Process and Threads
The process has a virtual memory address space, information (such as a base
priority), and an affinity for one or more processors.
131
Threads are the unit of execution scheduled by the kernels dispatcher.
Each thread has its own state, including a priority, processor affinity, and
accounting information.
A thread can be one of six states: ready, standby, running, waiting, transition, and
terminated.
Kernel Scheduling
The dispatcher uses a 32-level priority scheme to determine the order of thread
execution. Priorities are divided into two classes..
The real-time class contains threads with priorities ranging from 16 to 31.
The variable class contains threads having priorities from 0 to 15.
Characteristics of 2000s priority strategy.
Trends to give very good response times to interactive threads that are using
the mouse and windows.
Enables I/O-bound threads to keep the I/O devices busy.
Complete-bound threads soak up the spare CPU cycles in the background.
Kernel Scheduling (Cont.)
Scheduling can occur when a thread enters the ready or wait state, when a thread
terminates, or when an application changes a threads priority or processor affinity.

Real-time threads are given preferential access to the CPU; but 2000 does not
guarantee that a real-time thread will start to execute within any particular time
limit. (This is known as soft realtime.)
Windows 2000 Interrupt Request Levels
Kernel Trap Handling
The kernel provides trap handling when exceptions and interrupts are generated by
hardware of software.
Exceptions that cannot be handled by the trap handler are handled by the kernel's
exception dispatcher.
The interrupt dispatcher in the kernel handles interrupts by calling either an
interrupt service routine (such as in a device driver) or an internal kernel routine.
The kernel uses spin locks that reside in global memory to achieve multiprocessor
mutual exclusion.
Executive Object Manager
2000 uses objects for all its services and entities; the object manger supervises the
use of all the objects.
Generates an object handle
Checks security.
Keeps track of which processes are using each object.
Objects are manipulated by a standard set of methods, namely create, open, close,
delete, query name, parse and security.
Executive Naming Objects
The 2000 executive allows any object to be given a name, which may be either
permanent or temporary.
Object names are structured like file path names in MS-DOS and UNIX.
2000 implements a symbolic link object, which is similar to symbolic links in UNIX
that allow multiple nicknames or aliases to refer to the same file.
A process gets an object handle by creating an object by opening an existing one, by
receiving a duplicated handle from another process, or by inheriting a handle from a
parent process.
Each object is protected by an access control list.
132
Executive Virtual Memory Manager
The design of the VM manager assumes that the underlying hardware supports
virtual to physical mapping a paging mechanism, transparent cache coherence on
multiprocessor systems, and virtual addressing aliasing.
The VM manager in 2000 uses a page-based management scheme with a page size
of 4 KB.
The 2000 VM manager uses a two step process to allocate memory.
The first step reserves a portion of the processs address space.
The second step commits the allocation by assigning space in the 2000 paging
file.
Virtual-Memory Layout
Virtual Memory Manager (Cont.)
The virtual address translation in 2000 uses several data structures.
Each process has a page directory that contains 1024 page directory entries of
size 4 bytes.
Each page directory entry points to a page table which contains 1024 page table
entries (PTEs) of size 4 bytes.
Each PTE points to a 4 KB page frame in physical memory.
A 10-bit integer can represent all the values form 0 to 1023, therefore, can select any
entry in the page directory, or in a page table.
This property is used when translating a virtual address pointer to a bye address in
physical memory.
A page can be in one of six states: valid, zeroed, free standby, modified and bad.
Virtual-to-Physical Address Translation
10 bits for page directory entry, 20 bits for page table entry, and 12 bits for byte
offset in page.
Page File Page-Table Entry
Executive Process Manager
Provides services for creating, deleting, and using threads and processes.

Issues such as parent/child relationships or process hierarchies are left to the
particular environmental subsystem that owns the process.
Executive Local Procedure Call Facility
The LPC passes requests and results between client and server processes within a
single machine.
In particular, it is used to request services from the various 2000 subsystems.
When a LPC channel is created, one of three types of message passing techniques
must be specified.
First type is suitable for small messages, up to 256 bytes; port's message queue
is used as intermediate storage, and the messages are copied from one process
to the other.
Second type avoids copying large messages by pointing to a shared memory
section object created for the channel.
Third method, called quick LPC was used by graphical display portions of the
Win32 subsystem.
Executive I/O Manager
The I/O manager is responsible for
file systems
cache management
device drivers
133
network drivers
Keeps track of which installable file systems are loaded, and manages buffers for I/O
requests.
Works with VM Manager to provide memory-mapped file I/O.
Controls the 2000 cache manager, which handles caching for the entire I/O system.
Supports both synchronous and asynchronous operations, provides time outs for
drivers, and has mechanisms for one driver to call another.
File I/O
Executive Security Reference Monitor
The object-oriented nature of 2000 enables the use of a uniform mechanism to
perform runtime access validation and audit checks for every entity in the system.

Whenever a process opens a handle to an object, the security reference monitor
checks the processs security token and the objects access control list to see whether
the process has the necessary rights.
Executive Plug-and-Play Manager
Plug-and-Play (PnP) manager is used to recognize and adapt to changes in the
hardware configuration.

When new devices are added (for example, PCI or USB), the PnP manager loads the
appropriate driver.

The manager also keeps track of the resources used by each device.
Environmental Subsystems
User-mode processes layered over the native 2000 executive services to enable 2000
to run programs developed for other operating system.

2000 uses the Win32 subsystem as the main operating environment; Win32 is used
to start all processes. It also provides all the keyboard, mouse and graphical display
capabilities.

MS-DOS environment is provided by a Win32 application called the virtual dos
machine (VDM), a user-mode process that is paged and dispatched like any other
2000 thread.
Environmental Subsystems (Cont.)
16-Bit Windows Environment:
Provided by a VDM that incorporates Windows on Windows.
Provides the Windows 3.1 kernel routines and sub routines for window
manager and GDI functions.
The POSIX subsystem is designed to run POSIX applications following the POSIX.1
standard which is based on the UNIX model.
Environmental Subsystems (Cont.)
OS/2 subsystems runs OS/2 applications.

Logon and Security Subsystems authenticates users logging to to Windows 2000
systems. Users are required to have account names and passwords.
134
- The authentication package authenticates users whenever they attempt to
access an object in the system. Windows 2000 uses Kerberos as the default
authentication package.

FILE SYSTEM
The fundamental structure of the 2000 file system (NTFS) is a volume.
Created by the 2000 disk administrator utility.
Based on a logical disk partition.
May occupy a portions of a disk, an entire disk, or span across several disks.
All metadata, such as information about the volume, is stored in a regular file.
NTFS uses clusters as the underlying unit of disk allocation.
A cluster is a number of disk sectors that is a power of two.
Because the cluster size is smaller than for the 16-bit FAT file system, the
amount of internal fragmentation is reduced.
File System Internal Layout
NTFS uses logical cluster numbers (LCNs) as disk addresses.
A file in NTFS is not a simple byte stream, as in MS-DOS or UNIX, rather, it is a
structured object consisting of attributes.
Every file in NTFS is described by one or more records in an array stored in a
special file called the Master File Table (MFT).
Each file on an NTFS volume has a unique ID called a file reference.
64-bit quantity that consists of a 48-bit file number and a 16-bit sequence
number.
Can be used to perform internal consistency checks.
The NTFS name space is organized by a hierarchy of directories; the index root
contains the top level of the B+ tree.
File System Recovery
All file system data structure updates are performed inside transactions that are
logged.
Before a data structure is altered, the transaction writes a log record that
contains redo and undo information.
After the data structure has been changed, a commit record is written to the
log to signify that the transaction succeeded.
After a crash, the file system data structures can be restored to a consistent
state by processing the log records.
File System Recovery (Cont.)
This scheme does not guarantee that all the user file data can be recovered after a
crash, just that the file system data structures (the metadata files) are undamaged
and reflect some consistent state prior to the crash.

The log is stored in the third metadata file at the beginning of the volume.

The logging functionality is provided by the 2000 log file service.


File System Security
Security of an NTFS volume is derived from the 2000 object model.

Each file object has a security descriptor attribute stored in this MFT record.
135

This attribute contains the access token of the owner of the file, and an access
control list that states the access privileges that are granted to each user that has
access to the file.
Volume Management and Fault Tolerance
FtDisk, the fault tolerant disk driver for 2000, provides several ways to combine
multiple SCSI disk drives into one logical volume.
Logically concatenate multiple disks to form a large logical volume, a volume set.
Interleave multiple physical partitions in round-robin fashion to form a stripe set
(also called RAID level 0, or disk striping).
Variation: stripe set with parity, or RAID level 5.
Disk mirroring, or RAID level 1, is a robust scheme that uses a mirror set two
equally sized partitions on tow disks with identical data contents.
To deal with disk sectors that go bad, FtDisk, uses a hardware technique called
sector sparing and NTFS uses a software technique called cluster remapping.
Volume Set On Two Drives
Stripe Set on Two Drives
Stripe Set With Parity on Three Drives
Mirror Set on Two Drives
File System Compression
To compress a file, NTFS divides the files data into compression units, which are
blocks of 16 contiguous clusters.

For sparse files, NTFS uses another technique to save space.
Clusters that contain all zeros are not actually allocated or stored on disk.
Instead, gaps are left in the sequence of virtual cluster numbers stored in the
MFT entry for the file.
When reading a file, if a gap in the virtual cluster numbers is found, NTFS just
zero-fills that portion of the callers buffer.
File System Reparse Points
A reparse point returns an error code when accessed. The reparse data tells the I/O
manager what to do next.

Reparse points can be used to provide the functionality of UNIX mounts

Reparse points can also be used to access files that have been moved to offline
storage.
Networking
2000 supports both peer-to-peer and client/server networking; it also has facilities
for network management.
To describe networking in 2000, we refer to two of the internal networking
interfaces:
NDIS (Network Device Interface Specification) Separates network adapters
from the transport protocols so that either can be changed without affecting
the other.
TDI (Transport Driver Interface) Enables any session layer component to
use any available transport mechanism.
2000 implements transport protocols as drivers that can be loaded and unloaded
from the system dynamically.

136


Networking Protocols
The server message block (SMB) protocol is used to send I/O requests over the
network. It has four message types:
- Session control
- File
- Printer
- Message
The network basic Input/Output system (NetBIOS) is a hardware abstraction
interface for networks. Used to:
Establish logical names on the network.
Establish logical connections of sessions between two logical names on the
network.
Support reliable data transfer for a session via NetBIOS requests or SMBs
Networking Protocols (Cont.)
NetBEUI (NetBIOS Extended User Interface): default protocol for Windows 95
peer networking and Windows for Workgroups; used when 2000 wants to share
resources with these networks.
2000 uses the TCP/IP Internet protocol to connect to a wide variety of operating
systems and hardware platforms.
PPTP (Point-to-Point Tunneling Protocol) is used to communicate between Remote
Access Server modules running on 2000 machines that are connected over the
Internet.
The 2000 NWLink protocol connects the NetBIOS to Novell NetWare networks.
Networking Protocols (Cont.)
The Data Link Control protocol (DLC) is used to access IBM mainframes and HP
printers that are directly connected to the network.
2000 systems can communicate with Macintosh computers via the Apple Talk
protocol if an 2000 Server on the network is running the Windows 2000 Services for
Macintosh package.
Networking Dist. Processing Mechanisms
2000 supports distributed applications via named NetBIOS,named pipes and
mailslots, Windows Sockets, Remote Procedure Calls (RPC), and Network Dynamic
Data Exchange (NetDDE).
NetBIOS applications can communicate over the network using NetBEUI, NWLink,
or TCP/IP.
Named pipes are connection-oriented messaging mechanism that are named via the
uniform naming convention (UNC).
Mailslots are a connectionless messaging mechanism that are used for broadcast
applications, such as for finding components on the network,
Winsock, the windows sockets API, is a session-layer interface that provides a
standardized interface to many transport protocols that may have different
addressing schemes.
Distributed Processing Mechanisms (Cont.)
The 2000 RPC mechanism follows the widely-used Distributed Computing
Environment standard for RPC messages, so programs written to use 2000 RPCs are
very portable.
RPC messages are sent using NetBIOS, or Winsock on TCP/IP networks, or
named pipes on LAN Manager networks.
137
2000 provides the Microsoft Interface Definition Language to describe the
remote procedure names, arguments, and results.
Networking Redirectors and Servers
In 2000, an application can use the 2000 I/O API to access files from a remote
computer as if they were local, provided that the remote computer is running an MS-
NET server.
A redirector is the client-side object that forwards I/O requests to remote files,
where they are satisfied by a server.
For performance and security, the redirectors and servers run in kernel mode.

Access to a Remote File
The application calls the I/O manager to request that a file be opened (we assume
that the file name is in the standard UNC format).
The I/O manager builds an I/O request packet.
The I/O manager recognizes that the access is for a remote file, and calls a driver
called a Multiple Universal Naming Convention Provider (MUP).
The MUP sends the I/O request packet asynchronously to all registered redirectors.
A redirector that can satisfy the request responds to the MUP.
To avoid asking all the redirectors the same question in the future, the MUP
uses a cache to remember with redirector can handle this file.
Access to a Remote File (Cont.)
The redirector sends the network request to the remote system.
The remote system network drivers receive the request and pass it to the server
driver.
The server driver hands the request to the proper local file system driver.
The proper device driver is called to access the data.
The results are returned to the server driver, which sends the data back to the
requesting redirector.
Networking Domains
NT uses the concept of a domain to manage global access rights within groups.
A domain is a group of machines running NT server that share a common security
policy and user database.
2000 provides three models of setting up trust relationships.
One way, A trusts B
Two way, transitive, A trusts B, B trusts C so A, B, C trust each other
Crosslink allows authentication to bypass hierarchy to cut down on
authentication traffic.
Name Resolution in TCP/IP Networks
On an IP network, name resolution is the process of converting a computer name to
an IP address.

e.g., www.bell-labs.com resolves to 135.104.1.14

2000 provides several methods of name resolution:
Windows Internet Name Service (WINS)
broadcast name resolution
domain name system (DNS)
a host file
an LMHOSTS file
138
Name Resolution (Cont.)
WINS consists of two or more WINS servers that maintain a dynamic database of
name to IP address bindings, and client software to query the servers.
WINS uses the Dynamic Host Configuration Protocol (DHCP), which automatically
updates address configurations in the WINS database, without user or administrator
intervention.
Programmer Interface Access to Kernel Obj.
A process gains access to a kernel object named XXX by calling the CreateXXX
function to open a handle to XXX; the handle is unique to that process.
A handle can be closed by calling the CloseHandle function; the system may delete
the object if the count of processes using the object drops to 0.
Given a handle to process and the handles value a second process can get a
handle to the same object, and thus share it.


Programmer Interface Process Management
Process is started via the CreateProcess routine which loads any dynamic link
libraries that are used by the process, and creates a primary thread.
Additional threads can be created by the CreateThread function.
Every dynamic link library or executable file that is loaded into the address space of
a process is identified by an instance handle.
Process Management (Cont.)
Scheduling in Win32 utilizes four priority classes:
- IDLE_PRIORITY_CLASS (priority level 4)
- NORMAL_PRIORITY_CLASS (level8 typical for most processes
- HIGH_PRIORITY_CLASS (level 13)
- REALTIME_PRIORITY_CLASS (level 24)
To provide performance levels needed for interactive programs, 2000 has a special
scheduling rule for processes in the NORMAL_PRIORITY_CLASS.
2000 distinguishes between the foreground process that is currently selected on
the screen, and the background processes that are not currently selected.
When a process moves into the foreground, 2000 increases the scheduling
quantum by some factor, typically 3.
Process Management (Cont.)
The kernel dynamically adjusts the priority of a thread depending on whether it is
I/O-bound or CPU-bound.
To synchronize the concurrent access to shared objects by threads, the kernel
provides synchronization objects, such as semaphores and mutexes.
In addition, threads can synchronize by using the WaitForSingleObject or
WaitForMultipleObjects functions.
Another method of synchronization in the Win32 API is the critical section.
Programmer Interface Interprocess Comm.
Win32 applications can have interprocess communication by sharing kernel objects.
An alternate means of interprocess communications is message passing, which is
particularly popular for Windows GUI applications.
One thread sends a message to another thread or to a window.
A thread can also send data with the message.
Every Win32 thread has its own input queue from which the thread receives
messages.
This is more reliable than the shared input queue of 16-bit windows, because with
139
separate queues, one stuck application cannot block input to the other applications.
Programmer Interface Memory Management
Virtual memory:
- VirtualAlloc reserves or commits virtual memory.
- VirtualFree decommits or releases the memory.
These functions enable the application to determine the virtual address at
which the memory is allocated.
An application can use memory by memory mapping a file into its address space.
Multistage process.
Two processes share memory by mapping the same file into their virtual
memory.




**********************************






















140
UNIT-V

I/O SYSTEMS

1. I/O SYSTEMS

The basic model of the UNIX I/O system is a sequence of bytes that can be
accessed either randomly or sequentially. There are no access methods and no
control blocks in a typical UNIX user process.Different programs expect various
levels of structure, but the kernel does not impose structure on I/O. For instance,
the convention for text files is lines of ASCII characters separated by a single
newline character (the ASCII line-feed character), but the kernel knows nothing
about this convention. For the purposes of most programs, the model is further
simplified to being a stream of data bytes, or an I/O stream. It is this single
common data form that makes the characteristic UNIX tool-based approach work
Kernighan & Pike, 1984. An I/O stream from one program can be fed as input to
almost any other program. (This kind of traditional UNIX I/O stream should not
be confused with the Eighth Edition stream I/O system or with the System V,
Release 3 STREAMS, both of which can be accessed as traditional I/O streams.)
Descriptors and I/O
UNIX processes use descriptors to reference I/O streams. Descriptors are small
unsigned integers obtained from the open and socket system calls.. A read or
write system call can be applied to a descriptor to transfer data. The close
system call can be used to deallocate any descriptor.Descriptors represent
underlying objects supported by the kernel, and are created by system calls
specific to the type of object. In 4.4BSD, three kinds of objects can be
represented by descriptors: files, pipes, and sockets.
- A file is a linear array of bytes with at least one name. A file exists until
all its names are deleted explicitly and no process holds a descriptor for
it. A process acquires a descriptor for a file by opening that file's name
with the open system call. I/O devices are accessed as files.
- A pipe is a linear array of bytes, as is a file, but it is used solely as an
I/O stream, and it is unidirectional. It also has no name, and thus cannot
be opened with open. Instead, it is created by the pipe system call,
which returns two descriptors, one of which accepts input that is sent to
the other descriptor reliably, without duplication, and in order. The
system also supports a named pipe or FIFO. A FIFO has properties
identical to a pipe, except that it appears in the filesystem; thus, it can
be opened using the open system call. Two processes that wish to
communicate each open the FIFO: One opens it for reading, the other
for writing.
- A socket is a transient object that is used for interprocess
communication; it exists only as long as some process holds a descriptor
referring to it. A socket is created by the socket system call, which
returns a descriptor for it. There are different kinds of sockets that
support various communication semantics, such as reliable delivery of
141
data, preservation of message ordering, and preservation of message
boundaries.
In systems before 4.2BSD, pipes were implemented using the filesystem; when
sockets were introduced in 4.2BSD, pipes were reimplemented as sockets.
The kernel keeps for each process a descriptor table, which is a table that the
kernel uses to translate the external representation of a descriptor into an
internal representation. (The descriptor is merely an index into this table.) The
descriptor table of a process is inherited from that process's parent, and thus
access to the objects to which the descriptors refer also is inherited. The main
ways that a process can obtain a descriptor are by opening or creation of an
object, and by inheritance from the parent process. In addition, socket IPC
allows passing of descriptors in messages between unrelated processes on the
same machine.
Every valid descriptor has an associated file offset in bytes from the beginning of the
object. Read and write operations start at this offset, which is updated after each data
transfer. For objects that permit random access, the file offset also may be set with the
lseek system call. Ordinary files permit random access, and some devices do, as well.
Pipes and sockets do not.
When a process terminates, the kernel reclaims all the descriptors that were in
use by that process. If the process was holding the final reference to an object,
the object's manager is notified so that it can do any necessary cleanup actions,
such as final deletion of a file or deallocation of a socket.

2. I/O Hardware

There are 3 basic hardware operations: read from a hardware module, write to a
module,and write to the output data stream. The read from a module also has a variations
that allows reading into a variable or directly into the output data stream. In addition,
there is a clear crate statement which performs the appropriate operation for that crate.
Output, either explicitly done with the output statement, or implicitly done by a hard-ware
read operation, is assumed to be in units of 4 byte integers. Each time a code section is
called, it produces a single bank of 4 byte integers. The bank header (including bank
length) is inserted automatically.
3. A KERNEL I/O SUBSYSTEM

142




3.1 KERNEL MODULES

3.1.1 Introduction
Sections of kernel code that can be compiled, loaded, and unloaded independent of
the rest of the kernel
A kernel module may typically implement a device driver, a file system, or a
networking protocol
The module interface allows third parties to write and distribute, on their own terms,
device drivers or file systems that could not be distributed under the GPL
Kernel modules allow a Linux system to be set up with a standard, minimal kernel,
without any extra device drivers built in
Three components to Linux module support:
module management
driver registration
conflict resolution

3.1.2 Module Management
Supports loading modules into memory and letting them talk to the rest of the kernel
Module loading is split into two separate sections:
Managing sections of module code in kernel memory
Handling symbols that modules are allowed to reference
The module requestor manages loading requested, but currently unloaded, modules;
it also regularly queries the kernel to see whether a dynamically loaded module is
still in use, and will unload it when it is no longer actively needed
143

3.1.3 Driver Registration
Allows modules to tell the rest of the kernel that a new driver has become available
The kernel maintains dynamic tables of all known drivers, and provides a set of
routines to allow drivers to be added to or removed from these tables at any time
Registration tables include the following items:
Device drivers
File systems
Network protocols
Binary format

3.1.4 Conflict Resolution
A mechanism that allows different device drivers to reserve hardware resources and
to protect those resources from accidental use by another driver
The conflict resolution module aims to:
Prevent modules from clashing over access to hardware resources
Prevent autoprobes from interfering with existing device drivers
Resolve conflicts with multiple drivers trying to access the same hardware






4. STREAMS
STREAM a full-duplex communication channel between a user-level process
and a device.

A STREAM consists of:
- STREAM head interfaces with the user process
- driver end interfaces with the device
- zero or more STREAM modules between them.

Each module contains a read queue and a write queue

Message passing is used to communicate between queues


4.1 The STREAMS Structure

144








5 PERFORMANCE

I/O a major factor in system performance:
Demands CPU to execute device driver, kernel I/O code
Context switches due to interrupts
Data copying
Network traffic especially stressful


5.1 Improving Performance
Reduce number of context switches
Reduce data copying
Reduce interrupts by using large transfers, smart controllers, polling
Use DMA
Balance CPU, memory, bus, and I/O performance for highest throughput

145


6. MASS-STORAGE SYSTEMS

6.1 Overview of Mass Storage Structure

Magnetic disks provide bulk of secondary storage of modern computers
Drives rotate at 60 to 200 times per second
Transfer rate is rate at which data flow between drive and computer
Positioning time (random-access time) is time to move disk arm to desired
cylinder (seek time) and time for desired sector to rotate under the disk head
(rotational latency)

Head crash results from disk head making contact with the disk surface
Thats bad
Disks can be removable
Drive attached to computer via I/O bus
Busses vary, including EIDE, ATA, SATA, USB, Fibre Channel, SCSI
Host controller in computer uses bus to talk to disk controller built into drive or
storage array

Moving-head Disk Machanism














Magnetic tape
Was early secondary-storage medium
Relatively permanent and holds large quantities of data
Access time slow
Random access ~1000 times slower than disk
Mainly used for backup, storage of infrequently-used data, transfer medium
between systems
Kept in spool and wound or rewound past read-write head
Once data under head, transfer rates comparable to disk
20-200GB typical storage
Common technologies are 4mm, 8mm, 19mm, LTO-2 and SDLT

6.2 DISK STRUCTURE

146
6.2.1 Introduction
Disk drives are addressed as large 1-dimensional arrays of logical blocks, where the
logical block is the smallest unit of transfer.
The 1-dimensional array of logical blocks is mapped into the sectors of the disk
sequentially.
Sector 0 is the first sector of the first track on the outermost cylinder.
Mapping proceeds in order through that track, then the rest of the tracks in
that cylinder, and then through the rest of the cylinders from outermost to
innermost.

6.2.2 Disk Attachment
Host-attached storage accessed through I/O ports talking to I/O busses
SCSI itself is a bus, up to 16 devices on one cable, SCSI initiator requests operation
and SCSI targets perform tasks
Each target can have up to 8 logical units (disks attached to device controller
FC is high-speed serial architecture
Can be switched fabric with 24-bit address space the basis of storage area
networks (SANs) in which many hosts attach to many storage units
Can be arbitrated loop (FC-AL) of 126 devices

6.2.3 Network-Attached Storage
Network-attached storage (NAS) is storage made available over a network rather
than over a local connection (such as a bus)
NFS and CIFS are common protocols
Implemented via remote procedure calls (RPCs) between host and storage
New iSCSI protocol uses IP network to carry the SCSI protocol












6.2.4 Storage Area Network
Common in large storage environments (and becoming more common)
Multiple hosts attached to multiple storage arrays flexible









147






6.3 DISK SCHEDULING

6.3.1 Introduction
The operating system is responsible for using hardware efficiently for the disk
drives, this means having a fast access time and disk bandwidth.
Access time has two major components
Seek time is the time for the disk are to move the heads to the cylinder
containing the desired sector.
Rotational latency is the additional time waiting for the disk to rotate the
desired sector to the disk head.
Minimize seek time
Seek time ~ seek distance
Disk bandwidth is the total number of bytes transferred, divided by the total time
between the first request for service and the completion of the last transfer.
Several algorithms exist to schedule the servicing of disk I/O requests.
We illustrate them with a request queue (0-199).

98, 183, 37, 122, 14, 124, 65, 67

Head pointer 53











6.3.2 FCFS
FCFS








148






Illustration shows total head movement of 640 cylinders.

6.3.3 SSTF
Selects the request with the minimum seek time from the current head position.
SSTF scheduling is a form of SJF scheduling; may cause starvation of some
requests.
Illustration shows total head movement of 236 cylinders.

SSTF















6.3.4 SCAN
SCAN
The disk arm starts at one end of the disk, and moves toward the other end, servicing
requests until it gets to the other end of the disk, where the head movement is
reversed and servicing continues.
Sometimes called the elevator algorithm.
Illustration shows total head movement of 208 cylinders.












149










6.3.5 C-SCAN
C-SCAN
Provides a more uniform wait time than SCAN.
The head moves from one end of the disk to the other. servicing requests as it
goes. When it reaches the other end, however, it immediately returns to the
beginning of the disk, without servicing any requests on the return trip.
Treats the cylinders as a circular list that wraps around from the last cylinder to
the first one.























6.3.6 C-LOOK

C-LOOK
Version of C-SCAN
Arm only goes as far as the last request in each direction, then reverses direction
immediately, without first going all the way to the end of the disk.



150
















6.3.7 Selecting a Disk-Scheduling Algorithm

SSTF is common and has a natural appeal
SCAN and C-SCAN perform better for systems that place a heavy load on the disk.
Performance depends on the number and types of requests.
Requests for disk service can be influenced by the file-allocation method.
The disk-scheduling algorithm should be written as a separate module of the
operating system, allowing it to be replaced with a different algorithm if necessary.
Either SSTF or LOOK is a reasonable choice for the default algorithm.

6.4. DISK MANAGEMENT

Low-level formatting, or physical formatting Dividing a disk into sectors that
the disk controller can read and write.
To use a disk to hold files, the operating system still needs to record its own data
structures on the disk.
Partition the disk into one or more groups of cylinders.
Logical formatting or making a file system.
Boot block initializes system.
The bootstrap is stored in ROM.
Bootstrap loader program.
Methods such as sector sparing used to handle bad blocks.




Booting from a Disk in Windows 2000







151








6.5. SWAP-SPACE MANAGEMENT

Swap-space Virtual memory uses disk space as an extension of main memory.
Swap-space can be carved out of the normal file system,or, more commonly, it can
be in a separate disk partition.
Swap-space management
4.3BSD allocates swap space when process starts; holds text segment (the
program) and data segment.
Kernel uses swap maps to track swap-space use.
Solaris 2 allocates swap space only when a page is forced out of physical
memory, not when the virtual memory page is first created.
Data Structures for Swapping on Linux Systems



















152
7. RAID
RAID is an acronym first defined by David A. Patterson, Garth A. Gibson, and Randy
Katz at the University of California, Berkeley in 1987 to describe a redundant array of
inexpensive disks,
[1]
a technology that allowed computer users to achieve high levels of
storage reliability from low-cost and less reliable PC-class disk-drive components, via the
technique of arranging the devices into arrays for redundancy.
Marketers representing industry RAID manufacturers later reinvented the term to describe
a redundant array of independent disks as a means of dissociating a "low cost"
expectation from RAID technology.
[2]

"RAID" is now used as an umbrella term for computer data storage schemes that can
divide and replicate data among multiple hard disk drives. The different
schemes/architectures are named by the word RAID followed by a number, as in RAID 0,
RAID 1, etc. RAID's various designs involve two key design goals: increase data
reliability and/or increase input/output performance. When multiple physical disks are set
up to use RAID technology, they are said to be in a RAID array. This array distributes
data across multiple disks, but the array is seen by the computer user and operating
system as one single disk. RAID can be set up to serve several different purposes.

Level Description
Minimum #
of disks
Space
Efficiency
Image
RAID 0
"Striped set without parity"
or "Striping". Provides
improved performance and
additional storage but no
redundancy or fault tolerance.
Because there is no
redundancy, this level is not
actually a Redundant Array
of Inexpensive Disks, i.e. not
true RAID. However, because
of the similarities to RAID
(especially the need for a
controller to distribute data
across multiple disks), simple
stripe sets are normally
referred to as RAID 0. Any
disk failure destroys the
array, which has greater
consequences with more
disks in the array (at a
minimum, catastrophic data
loss is twice as severe
2 n

153
compared to single drives
without RAID). A single disk
failure destroys the entire
array because when data is
written to a RAID 0 drive, the
data is broken into fragments.
The number of fragments is
dictated by the number of
disks in the array. The
fragments are written to their
respective disks
simultaneously on the same
sector. This allows smaller
sections of the entire chunk of
data to be read off the drive in
parallel, increasing
bandwidth. RAID 0 does not
implement error checking so
any error is unrecoverable.
More disks in the array means
higher bandwidth, but greater
risk of data loss.
RAID 1
'Mirrored set without
parity' or 'Mirroring'.
Provides fault tolerance from
disk errors and failure of all
but one of the drives.
Increased read performance
occurs when using a multi-
threaded operating system
that supports split seeks, as
well as a very small
performance reduction when
writing. Array continues to
operate so long as at least one
drive is functioning. Using
RAID 1 with a separate
controller for each disk is
sometimes called duplexing.
2
1 (size of
the
smallest
disk)

RAID 2
Hamming code parity.
Disks are synchronized and
striped in very small stripes,
often in single bytes/words.
Hamming codes error
correction is calculated across
corresponding bits on disks,
3
154
and is stored on multiple
parity disks.
RAID 3
Striped set with dedicated
parity or bit interleaved
parity or byte level parity.
This mechanism provides
fault tolerance similar to
RAID 5. However, because
the strip across the disks is a
lot smaller than a filesystem
block, reads and writes to the
array perform like a single
drive with a high linear write
performance. For this to work
properly, the drives must
have synchronised rotation. If
one drive fails, the
performance doesn't change.
3 n-1

RAID 4
Block level parity. Identical
to RAID 3, but does block-
level striping instead of byte-
level striping. In this setup,
files can be distributed
between multiple disks. Each
disk operates independently
which allows I/O requests to
be performed in parallel,
though data transfer speeds
can suffer due to the type of
parity. The error detection is
achieved through dedicated
parity and is stored in a
separate, single disk unit.
3 n-1

RAID 5
Striped set with distributed
parity or interleave parity.
Distributed parity requires all
drives but one to be present to
operate; drive failure requires
replacement, but the array is
not destroyed by a single
drive failure. Upon drive
failure, any subsequent reads
can be calculated from the
3 n-1

155
distributed parity such that
the drive failure is masked
from the end user. The array
will have data loss in the
event of a second drive
failure and is vulnerable until
the data that was on the failed
drive is rebuilt onto a
replacement drive. A single
drive failure in the set will
result in reduced performance
of the entire set until the
failed drive has been replaced
and rebuilt.
RAID 6
Striped set with dual
distributed parity. Provides
fault tolerance from two drive
failures; array continues to
operate with up to two failed
drives. This makes larger
RAID groups more practical,
especially for high
availability systems. This
becomes increasingly
important because large-
capacity drives lengthen the
time needed to recover from
the failure of a single drive.
Single parity RAID levels are
vulnerable to data loss until
the failed drive is rebuilt: the
larger the drive, the longer
the rebuild will take. Dual
parity gives time to rebuild
the array without the data
being at risk if a (single)
additional drive fails before
the rebuild is complete.
4 n-2



7.1.STABLE STORAGE
Stable storage is a classification of computer data storage technology that guarantees
atomicity for any given write operation and allows software to be written that is robust
against some hardware and power failures. To be considered atomic, upon reading back a
just written-to portion of the disk, the storage subsystem must return either the write data
156
or the data that was on that portion of the disk before the write operation. Most computer
disk drives are not considered stable storage because they do not guarantee atomic write:
an error could be returned upon subsequent read of the disk where it was just written to in
lieu of either the new or prior data.
Multiple techniques have been developed to achieve the atomic property from weakly-
atomic devices such as disks. Writing data to a disk in two places in a specific way is one
technique and can be done by application software. Most often though, stable storage
functionality is achieved by mirroring data on separate disks via RAID technology (level
1 or greater). The RAID controller implements the disk writing algorithms that enable
separate disks to act as stable storage. The RAID technique is robust against some single
disk failure in an array of disks whereas the software technique of writing to separate
areas of the same disk only protects against some kinds of internal disk media failures
such as bad sectors in single disk arrangements.
7.2.TERTIARY STORAGE
Tertiary storage or tertiary memory, provides a third level of storage. Typically it
involves a robotic mechanism which will mount (insert) and dismount removable mass
storage media into a storage device according to the system's demands; this data is often
copied to secondary storage before use. It is primarily used for archival of rarely accessed
information since it is much slower than secondary storage (e.g. 560 seconds vs. 1-10
milliseconds). This is primarily useful for extraordinarily large data stores, accessed
without human operators. Typical examples include tape libraries and optical jukeboxes.
When a computer needs to read information from the tertiary storage, it will first consult a
catalog database to determine which tape or disc contains the information. Next, the
computer will instruct a robotic arm to fetch the medium and place it in a drive. When the
computer has finished reading the information, the robotic arm will return the medium to
its place in the library.


8. CASE STUDY
8.1 IO in Linux
Input and Output
The Linux device-oriented file system accesses disk storage through two caches:
Data is cached in the page cache, which is unified with the virtual memory system
Metadata is cached in the buffer cache, a separate cache indexed by the physical
disk block.

Linux splits all devices into three classes:
block devices allow random access to completely independent, fixed size blocks of
data
character devices include most other devices; they dont need to support the
functionality of regular files.
network devices are interfaced via the kernels networking subsystem


157
Device-Driver Block Structure

Block Devices
Provide the main interface to all disk devices in a system.

The block buffer cache serves two main purposes:
it acts as a pool of buffers for active I/O
it serves as a cache for completed I/O

The request manager manages the reading and writing of buffer contents to and from a
block device driver.

Character Devices
A device driver which does not offer random access to fixed blocks of data.
A character device driver must register a set of functions which implement the drivers
various file I/O operations.
The kernel performs almost no preprocessing of a file read or write request to a
character device, but simply passes on the request to the device.
The main exception to this rule is the special subset of character device drivers which
implement terminal devices, for which the kernel maintains a standard interface.

familiar with the role of process schedulers, such as the new O(1) scheduler, many users are not so

***************************

You might also like