You are on page 1of 134

This is the collection of lecture slides* of the lecture Computer Architecture tought in Wintersemester 06/07 at University DuisburgEssen.

I slightly revised the surveys of the subjects and added slide numbers now.
Stefan Freinatis, March 2007

Computer Architecture

* Actually, this is the internet version of the lecture slides. With respect to the slides used in the lectures, animations are removed (errors hopefully as well) and additional text is added.

Slide 1

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 2

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Computer Architecture
Lecture Dr.-Ing. Stefan Freinatis
Fachgebiet Verteilte Systeme (Prof. Geisselhardt) Raum BB 1017

Times & Dates


1. 2. 3. 4. 5. 6. 7.

Computer Architecture

Exercises Dipl.-Math. Kerstin Luck


Fachgebiet Verteilte Systeme Raum BB 910
Slide 3 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

8. 9. 10. 11. 12. 13.

25.10.06 01.11.06 All Saints Day (public holiday in NRW, no lectures) 08.11.06 15.11.06 22.11.06 29.11.06 Lecture: 08:15 09:45 06.12.06 Exercise: 10:00 10:45 13.12.06 20.12.06 10.01.07 17.01.07 24.01.07 31.01.07 07.02.07
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 4

Resources
Homepage Verteilte Systeme

Computer Architecture

Topics
Introduction & History

http://www.fb9dv.uni-duisburg.de/vs/de/index.htm

1. Operating Systems (slide 34)


System layers, batching, multi-programming, time sharing

2. File Systems (slide 65)


Storage media, files & directories, disk scheduling

3. Process Management (slide 151)


Processes, threads, IPC, scheduling, deadlocks
Select English Lectures Winter semester 2006/2007 Computer Architecture

4. Memory Management (slide 351)


Memory, paging, segmentation, virtual memory, caches

Direct link to homepage of lecture Computer Architecture


http://www.fb9dv.uni-duisburg.de/vs/en/education/dv3/index2006.htm
Slide 5 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 6

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Literature
[HP03] J. Hennessy, D. Patterson: Computer Architecture A Quantitative Approach, 3rd ed., Elsevier Science, 2003, ISBN 1-55860-724-2. J. Hennessy, D. Patterson: Computer Architecture A Quantitative Approach, 4th ed., Elsevier Science, 2006, ISBN 0-12-370490-1 . A. Silberschatz: Applied Operating System concepts, 1st ed., John Wiley & Sons, 2000, ISBN 0-471-36508-4. A. Tanenbaum: Modern Operating Systems, 2nd ed., Prentice Hall, 2001, ISBN 0-13-092641-8.

Computer Architecture

[HP06]

Introduction

[Sil00]

[Ta01]

Slide 7

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 8

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Introduction
Computer Architecture is the conceptual design and fundamental operational structure of a computer system [Wikipedia]. Computer Architecture encompasses [HP03 p.9]: Instruction set architecture
stack or accumulator or general purpose register architecture

Computer Application Areas


Introduction

General Purpose desktops


balanced performance for range of tasks, graphics, video, audio

Scientific desktops and servers


high-performance floating point and graphics

Commercial servers
databases, transaction processing, highly reliable

Organization
memory system, bus structure, CPU design

Hardware
machine specifics, logic design, technology
Slide 9 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Embedded computing
low power, small size, safety critical
Slide 10 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer
Introduction

History
Introduction

A computer is a person or an apparatus that is capable of processing information by applying calculation rules.
Generalized technology independent definition.

~ 5000 bc

Basis of calculating is counting. 10 fingers decimal system Abacus (Suan Pan, Soroban)

~ 1000 bc

A computer is a machine for manipulating data according to a list of instructions known as program [Wikipedia] .
Chinese Suan Pan
Slide 11 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 12 Computer Architecture WS 06/07

Roman Abacus
Dr.-Ing. Stefan Freinatis

History
Introduction

History
Introduction

Book from 1958

300 bc 1000 ac

Roman numeral system addition system, no zero


Numeral Value

Finger technique (from Japanese book 1954)

M D C L X V I

1000 500 100 50 10 5 1

Value 19: XVIIII or XIX


Not suitable for performing multiplications.

See also: http://www.ee.ryerson.ca/~elf/abacus/leeabacus/


Slide 13 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 14 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

History
Introduction

History
Introduction

~ 500 ac

Hindu-Arabic Numeral System, place value system, introduction of 0


Indian (3rd century bc) Indian (8th century) West-Arabic (11th century) European (15th century) European (16th century) Today

1623

Wilhelm Schickard
Calculation machine

1641

Blaise Pascal
Adding machine

1679

G.W. Leibniz
Dyadic system (binary system)

Forms the basis point for the development of calculation on machines.


Slide 15 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

1808

J. M. Jaquard
Punch card controlled loom
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 16

History
Introduction

History
Introduction

1833

Charles Babbage
Difference Engine

1847

George Boole
Logic on mathematical statements

1890

H. Hollerith
Punch card based tabulating machine

Data memory, program memory Instruction based operation Conditional jumps I/O unit
Slide 17 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 18 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Digital data logging on punch cards. First electro mechanical data processing.

History
Introduction

History
Introduction

1936

Alan Turing
Philosophy of information, Turing machine Founder of Computer Science

Characteristics of the first 5 operative digital computers


Computer Zuse Z3 Nation Germany USA UK USA Shown working May 1941 Summer 1941 1943 1944 1944 ENIAC USA 1948 Digital Yes Yes Yes Yes Yes Yes Binary Yes Yes Yes No No No Electronic No Yes Yes No Yes Yes Programmable By punched film stock No Partially, by rewiring By punched paper tape Partially, by rewiring By function table ROM

1941

Konrad Zuse
First electro-mechanic computer Z3 Binary arithmetic, floating point

Atanasoff - Berry Computer Colossus Harvard Mark I

Z3 rebuild in 1961
Slide 19 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Information source: Wikipedia on Z3 or on ENIAC, English


Slide 20 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

History
Introduction

v. Neumann Model
A computer consists of 5 units
Introduction

1945

John v. Neumann
Concept of universal computer systems Founder of Computer Architecture
Input
data

Input Unit Output Unit Memory


Storage for program and data. Addressable storage locations. Read / Write. Communication with the environment

Memory Output
data

data

ALU

instructions

Control Unit
Interpretation of the program. Timing control of units.
control signals

control signals

Control

ALU (Arithmetic Logic Unit)


Performs calculations.
Slide 22 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

The von Neumann model of a universal computer (stored program computer)


Slide 21 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

v. Neumann Model
Introduction

Characteristics
von Neumann Model

Today:

Input unit and output unit are combined (not necessarily physically!) to form the Input/Output unit (short: I/O unit). The control unit and the ALU are combined to form the microprocessor.

Architecture is independent of problem to be processed


Universal stored program computer, not tailored to specific problem.

Random accessible memory locations


Selection of location by means of an address. All locations have same capacity.

Keyboard Monitor ...

Input / Output

Addresses

Both program and data reside in memory


The state of the machine (control unit) decides whether the content of a memory location is considered data or code.

Memory

Data Control

Computer is centrally controlled


The v. Neumann model (or architecture) basically still applies to the majority of modern computer systems.
WS 06/07 Dr.-Ing. Stefan Freinatis

CPU has the master role.

Microprocessor (CPU)
Microcomputer

Sequential processing
Execution of a program is done instruction by instruction.
Slide 24 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 23

Computer Architecture

v. Neumann Model
Steps in executing an instruction
1. Fetch instruction from memory and put it into instruction register (in CPU). 2. Evaluate instruction (decode instruction) 3. When needed for this particular instruction, address the data (the operands) in memory. 4. Fetch the data (usually into CPU internal registers). 5. Perform operation on the data (usually this is carried out by the ALU) and write back the results. 6. Adjust address counter to point to next instruction.
Slide 25 Computer Architecture WS 06/07

v. Neumann Bottleneck
Introduction

Memory accesses in executing C = A + B


address of instruction instruction

Introduction

The

Instruction phase

CPU side

Bus System
A address of B B address of C C

address of A

Memory side

Data phase

The

Dr.-Ing. Stefan Freinatis

A, B, C: data in memory address bus Slide data 26 bus

time
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

v. Neumann Bottleneck
Introduction

Computer Performance
Introduction

The data is processed faster by the CPU than it can be taken from or stored in memory.
The processor memory interface is crucial for the overall computation performance. Reduction of the bottleneck effect through introduction of a hierarchical memory organization. Register Cache Main memory
Slide 27 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Performance, the work done in a certain amount of time.


Performance is like Power.

P=

W t

Work can have the meaning of processing an instruction, carrying out a floating-point or an integer operation, processing a standardized program (benchmark)
Slide 28 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Performance
Introduction

Computer Performance
Introduction

Popular performance measures Clock rate [Hz]


The frequency at which the CPU is clocked.

Many performance measures are not very expressive ... as they do not
consider the number of instructions being carried out per cycle (parallel execution), cover the effective throughput between CPU and memory, distinguish between complex instruction set computer (CISC) and reduced instruction set computer (RISC).
Dr.-Ing. Stefan Freinatis Slide 30 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

MIPS
Million instructions per second

FLOPS
Floating point operations per second

Slide 29

Computer Architecture

WS 06/07

Computer Performance
Computer performance compared Many performance measures to a VAX-11/780 from 1978.
Introduction

Moores Law
Introduction

are not

very expressive ... as they do not


consider the number of instructions being carried out per cycle (parallel execution), cover the effective throughput between CPU and memory,
Figure from [HP06 p.3] distinguish between complex instruction set computer (CISC) and reduced instruction set computer (RISC).

Gordon Moore empirically observed in 1965 that


the number of transistors on a chip doubles approximately every 12 months.
Gordon E. Moore

In 1975 he revised his prediction to


the number of transistors on a chip doubling every two years.

Moores Law:

N t N 0 10 0.15 t
Computer Architecture WS 06/07

where t is in [years]

See also: www.thocp.net/biographies/papers/moores_law.htm


Slide 31 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 32 Dr.-Ing. Stefan Freinatis

Moores Law

Computer Architecture

Operating Systems
System layers (36) Early computer Systems (42) Batch systems (46) Multi-program systems (50) Time sharing systems (54) Modern systems (57)

Image source: Wikipedia on Moores Law, English

Slide 34

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Operating Systems
An operating system is a program that acts as an intermediary between a user of a computer and the computer hardware [Sil00 p.3]. Purpose: provision of an environment in which a user can execute programs. Objectives: to make the system convenient to use
Usability, extending the machine beyond low level hardware programming

Operating Systems

to use the hardware in an efficient manner


Resource management, manage the hardware allocation among different programs
Slide 35 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 36 Computer Architecture

Computer system layers


Figure from [Sil00 p.4]
WS 06/07 Dr.-Ing. Stefan Freinatis

System Layers
Operating systems

System Layers
Operating systems

1. 2.

Hardware provides basic computing resources.


CPU, Memory, I/O, and devices connected to I/O.

Operating system coordinates the use of the hardware among the various application programs for the various users. Applications programs the programs used to solve the computing problems of the users. User people, machines, or other computers using the computer system.
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 38 Computer Architecture

3. 4.

Computer system layers


Figure from lecture CA WS 05/06, original source unknown
WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 37

Operating Systems
Usability the operating system as an Extended Machine
The architecture of most computers at the machine language level is awkward to program, especially for I/O. The operating system shields the programmer from the hardware details, provides simple(r) interfaces, offers high level abstractions and, in this view, presents the user with the equivalent of an extended machine.
See also [Ta01 p.4]
Slide 39 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Operating Systems
Resource Management The operating system as a
Resource Manager Computer resources: processor(s), memory, timer, disks, network interfaces, printer, graphic card, ... The operating system keeps track of who is using which resource, grants or denies resource requests, accounts the usage of resources.
See also [Ta01 p.5]
Slide 40 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Resource Management
Operating systems

Early Computer Systems


First computer generation (1945 55)
Operating systems

Resource management may be divided into time management (e.g. CPU time, printer time), and space management (e.g. memory or disk space). Resource management incorporates

Vacuum tubes A single group of people did all the work


design, construction, programming, operating, maintenance

process management, memory management, file system management, device management.


Slide 41 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Programming in machine language


plugboard, no programming languages Before going in into these subjects, lets have a look at the computer development since 1945.

Users directly interact with computer system Programs directly interact with hardware No operating system
Slide 42 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Early Computer Systems


First computer generation (1945 55)
Operating systems

Early Computer Systems


First computer generation (1945 55)
Operating systems

IBM 407 Accounting Machine


Electro mechanical tabulator

Wiring panel (plugboard) IBM 402 plugboard


Source: Slide 44 http://www.columbia.edu/acis/history/plugboard.html Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Source: http://www.columbia.edu/acis/history/407.html
Slide 43 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Early Computer Systems


First computer generation (1945 55)
Operating systems

Batch Systems
Second computer generation (1955 65)
Operating systems

Vacuum tubes A single group of people did all the work


design, construction, programming, operating, maintenance

Transistors, Mainframe computers First high level programming languages


Fortran (Formula translation), Algol (Algorithmic language), Lisp (List Processing)

Programming in machine language


plugboard, no programming languages

No direct user interaction with computer


Everything went via the computer operators.

Users directly interact with computer system Programs directly interact with hardware No operating system
Slide 45 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Users submit job to operator


job = program + data + control information.

Operator batched jobs


Composition of jobs with similar needs
Slide 46 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Batch Systems
Second computer generation (1955 65)
Operating systems

Batch Systems
Second computer generation (1955 65)
Operating systems

Structure of a typical FMS (Fortran Monitor System) batch job

Batch job processing scence [Tanenbaum]

IBM

IBM

IBM

Figure from [Ta01 p.9]

Figure from [Ta01 p.8]

Slide 47

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 48

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Batch Systems
Second computer generation (1955 65)
Operating systems

Multiprogram Systems
Third computer generation (1965 80)
Operating systems

Resident monitor program in memory


Monitor program loading one job one after another (from tape).

Integrated Circuits Disks


Monitor program Direct access to several jobs on disk. Now the operating system can select jobs (job scheduling). Operating System job 1 job 2 job 3 job 4
Memory
WS 06/07 Dr.-Ing. Stefan Freinatis

Sequenced job input


Jobs from tape or from card reader. Monitor program cannot select jobs on its own.

Multiprogrammed Batch Systems


Several jobs in memory at the same time Operating system shares CPU time among the jobs (CPU scheduling). Better CPU utilization
Slide 50 Computer Architecture

One job in memory at a time CPU often idle


waiting for slow I/O devices
Memory

job

Slide 49

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multiprogram Systems
Operating systems

Multiprogram Systems
Operating systems

Assume program A being executed on a single-program computer. The program needs two I/O operations.
CPU usage over time

Total execution time on a single-program computer:

A1
I/O A

A2
I/O A

A3

Now assume program A and B being executed on a multi-program computer.


CPU usage over time

Assume program B being executed on the same computer at some other time. The program needs no I/O.
CPU usage over time

A1

B1
I/O A

A2

B2
I/O A

A3

B3

B1

B2

B3

Total execution time on a multi-program computer:

t
WS 06/07 Dr.-Ing. Stefan Freinatis Slide 52 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 51

Computer Architecture

Multiprogram Systems
Third computer generation (1965 80)
Operating systems

Time Sharing Systems


Third computer generation (1965 80)
Operating systems

Multiprogram computers were still batch systems Desire for quicker response time
It took hours/days until output ready. A single misplaced comma could cause a compilation to fail, and the programmer wasted half a day [Ta01 p.11].

Direct user interaction


Many users share a computer simultaneously. Terminals Host.

Multiple job execution with high frequent switching


Operating system must provide more sophisticated CPU scheduling.

Desire for interactivity


Users wanted to have the machine for themselves, working online.

Disk as backing store for memory


Virtual memory

Operating System
Swapping, address translation, protecting memory (memory management)

Requests paved the way for timesharing systems (still in third computer generation)
Slide 53 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Many jobs awaiting execution Disk as input / output storage

Need for the OS to manage user data (file system management)


Slide 54 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Time Sharing Systems


Assume program A and B as previously. Execution on a time sharing system:
CPU usage over time

Memory Layout
Operating System
program 1 program 2
CPU idle

Time sharing system

Program B has finished

Multi program system Operating System job 1 job 2 job 3

program 3

I/O A

I/O A

program 4 program 5 program 6 program n

Batch system Monitor program

Small time slices allow for interactivity (quasi parallel execution)

Time sharing is not necessarily faster. Compare to the multiprogramming example:


CPU usage over time

A1

B1
I/O

A2

B2
I/O

A3

B3

t
Memory
Slide 56

Working Memory

job job 4
Memory
Computer Architecture WS 06/07

Memory
Dr.-Ing. Stefan Freinatis

Slide 55

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Modern Systems
Fourth computer generation (1980 present)
Operating systems

Real Time Systems


Modern systems

Single-chip CPUs Personal Computers Real-Time Systems Multiprocessor Systems Distributed Systems Embedded Systems
Slide 57

CP/M MS-DOS, DR-DOS Windows 1.0 ... Windows 98 / ME Windows NT 4.0 ... 2003, XP XENIX, MINIX, Linux, FreeBSD

Rigid time requirements Hard Real Time


Industrial control & robotics Guaranteed response times Slimmed OS features (no virtual memory)
RT System

Soft Real Time


Multimedia, virtual reality

RT System

Less restrictive time requirements


WS 06/07 Dr.-Ing. Stefan Freinatis Slide 58 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Multiprocessor Systems
Modern systems

Distributed Systems
Modern systems

n processors in system (n > 1), tightly coupled


Resource sharing Symmetric Multiprocessing
Each CPU runs identical copy of OS All CPUs are peers (no master-slave)
CPU CPU User User User User Operating System

n computers/processors (n > 1), loosely coupled


Individual computers Autonomous operation Communication via network Network Operating System
File Sharing Message exchange

CPU

CPU

Asymmetric Multiprocessing
Each CPU is assigned specific task Task assignment by master CPU
Slide 59 Computer Architecture

User

User

User

User

Operating System

CPU

CPU

CPU

CPU

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 60

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Embedded Systems
Modern systems

Resource Management
File system management
Operating systems

Dedicated to specific tasks Encapsulated in host device


invisible, usually not repaired when defect

Creation and organization of a logical storage location where data (user data, system data, programs) can be persistently stored in terms of files. Assigning rights and managing accesses. Maintentance.

Process management
Creation of processes (programs in execution) and sharing the CPU among them. Control of execution time. Enabling communication between processes.

Small in size, low energy Sometimes safety-critical


automotive drive by wire, medical apparatus

Memory management
Assigning memory areas to processes. Organizing virtual memory.

Custom(ized) operating system


Little or no file I/O, sometimes multitasking, no fancy OSs.
Slide 61 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Device management.
Low level administrative work related to the specifics of the I/O devices. Translations, low level stream processing. Usually by device drivers.
Slide 62 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Operating Systems
An operating system in the wide sense is the software package for making a computer operable.
Image source: Wikipedia on kernel, English

Operating Systems
Operating system categories
Single User - Single Tasking Single User - Multi Tasking Multi User - Single Tasking Multi User - Multi Tasking
MS-DOS Windows, MacOS

The operating system in the narrow sense is the one program running all the time on the computer (the kernel). It consists of several tasks and is asked for services through system calls.
Slide 63 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

CP/M Unix, VMS

Slide 64

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Computer Architecture

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)

File System Management

Slide 65

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 66

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Storage Media
Figure from [Sil00 p.31]
primary storage

Storage Media
low

Storage hierarchy

Cost versus access time for DRAM and magnetic disks [HP06 p.359]

seconday storage

access time

Flash

high
Slide 67 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 68 Computer Architecture WS 06/07

1ms

10ms

Dr.-Ing. Stefan Freinatis

Storage Media
Requirements for secondary storage

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)
Slide 70 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Store large amount of data


Much more data than fits into (virtual) memory

Persistent store
The information must survive the termination of the process creating or using it.

Concurrent access to data


Multiple processes should be able to access the data simultaneously.

Storage of data on secondary storage media in terms of files.


Slide 69 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Magnetic Disks
Magnetic disk drive principle
Figure from [Sil00 p.29]

Magnetic Disks
Sector: Smallest addressable unit on magnetic disk.
Data size between 32 and 4096 bytes (standard 512 bytes).
512 bytes

A disk sector
Figure from [Ta01 p.315]

disk controller

Several sectors may be combined to form a logical block. The


Disk drive

composition is usually performed by a device driver. In this way the higher software layers only deal with abstract devices that all have the same block size, independent of the physical sector size.

host controller

Computer

Such a block is also termed cluster.


Slide 72 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 71

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Magnetic Disks
Formatted Disk Capacity
= bytes per sector
x

Magnetic Disks
number of tracks on a platter
x

sectors per track

cylinder

tracks per cylinder

(heads)

capacity of a track capacity of one platter side capacity of all platter sides = disk capacity

CHS = (7, 2, 9), sector size: 512 byte Capacity = 63 kB C = cylinder H = Heads = tracks per S = sectors per track
Slide 73 Computer Architecture WS 06/07

cylinder

Disk parameters for the original IBM PC floppy disk and a Western Digital WD 18300 hard disk [Ta01 p.301].
Slide 74 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Magnetic Disks
On older disks the number of sectors per track was the same for all cylinders.
The physics of the inner track sectors defined the maximum number of bytes per sector. From physics, the outer sectors could have stored more bytes than defined, as the areas are bigger.

Magnetic Disks
Modern disks are divided into zones with more sectors in the outer zones than in the inner zones (zone bit recording).

Waste of space / capacity


Physical disk geometry
Slide 75 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 76

Physical geometry (left) and corresponding virtual geometry example (right)


Figure from [Ta01 p.302], modified Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

This must be seen as two sectors

Magnetic Disks
Physical geometry: The true physical disk layout. With modern disks only the internal electronic knows about it. CHS (for old disks) or not published any more Virtual geometry: The published disk layout to the external world (device driver, operating system, user) CHS (e.g. WD 18300 example) LBA (logical block addressing)
Disk sectors are just numbered consecutively without regard of the physical geometry.

Magnetic Disks
Low level formatting: Creation of the physical geometry on the disk platters. Defect disk areas are masked out and are replaced by spare areas. Done by disk drive internal software. Partitioning: The disk is divided into independent partitions, each logically acting as a separate disk. Definition of a master boot record in first sector of the disk. Done by application program. High level formatting: A partition receives a boot block and an empty file system (free storage administration, root directory).
Done by application program or by operating system administration tool.

A disk is a random access storage device.


Slide 77 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 78

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Logical Disk Layout


Magnetic disks

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135)
Figure from [Ta01 p.400], modified

File system

Floppy Disks (145)


Slide 80 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 79

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Files
A file is a named collection of related information recorded on secondary storage. [Sil00 p.346] A file is a logical storage unit. It is an abstract data type. [Sil00 p345, 347] Files are an abstraction mechanism for storing information and retrieving it back later. [after Ta01 p.380]
Slide 81 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File Structure
Files

Logical file structure examples [Ta01 p.382]


Slide 82 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File Structure
Files

File Access
Sequential Access
Simple and most common. Based on the tape model of a file. Data is processed in order (byte after byte or record after record). Operations: read, write, rewind. Records need not to be of same length (e.g. text files with each line posing a record. Remember Pascal readln, writeln.
Files

a) Byte sequence
Unstructured. The OS does not know or care what is in the file. Meaning imposed by application program. Maximum flexibility. Approach used by Unix and Windows.

b) Sequence of records (fixed-length)


Each record has some internal structure. Background idea: read / write operations from secondary storage have record size.

c) Tree of records
Highly structured. Records may be of variable size. Access to a record through key (e.g. Pony). Lookup / read / write / append are performed by OS, not by application program. Approach used in large mainframe computers (commercial data processing systems).
Slide 83 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Record
Slide 84 Computer Architecture

Figure from [Sil00 p.355], modified


WS 06/07 Dr.-Ing. Stefan Freinatis

File Access
Direct Access
Files

File Access
Indexed Access
Index file holds keys. Keys point to records within relative file. Suited for tree structures.
Files

Bytes or fixed-length logical records. Records are numbered. Access can be in no particular order. Access by record number. Based on disk model of a file. Useful for immediate access to large data records (e.g. database). Operations: read, write, seek.
(file pointer)
1 2 3 4 5 6 7 8 9 10 11 12

Byte or record
Slide 85

seek
Computer Architecture

Figure from [Sil00 p.355], modified


WS 06/07 Dr.-Ing. Stefan Freinatis Slide 86

Example of index file and relative file, figure from [Sil00 p.358]
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File Names
Files

File Attributes
Additional information about a file.
Depends on operating system and file system what attributes there are. Files

Name assigned by creation process


andrew 2day urgent! fig_14

Assigned by the operating system. Stored in the file system

Case sensitivity
Andrew andrew ANDREW
Unix: case sensitive. MS-DOS: not sensitive.

Some possible file attributes Access rights Who can access the file and in what way? Date of file creation Whether the file content is text or is binary If set, file is a temporary file and is deleted on process exit Whether or not file name is displayed in listings Regular file or directory file or ...
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Two-part file names: basename.extension


readme.txt prog.c.Z lecture.doc
Extensions are often just conventions, not mandatory by the operating system (although convenient when the OS knows about them).
Slide 87 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Creation date text / binary flag Temp flag Hidden flag File type
Slide 88

File Types
Windows Files Unix

Directories
Block special files Character special files

Regular files

Directories

A directory is a named logical place to put files in.


Single-level directory
This is the directory entry for the file called records, pointing to the file content on the storage media.

Files for maintaining the logical structure of the file system

Text files (also termed ASCII files)


Contain bytes (words in Unicode) according to a standardized character set, such as EBCDIC, ASCII or Unicode. The content is directly printable (screen, printer). Data.

Binary files
Contents not intended to be printed (at least directly). Content has meaning only to those programs using the files. Program (binary executable) or data.
Slide 89 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Early operating systems (CP/M, MS-DOS 1.0) Still used in tiny embedded systems File names are unique
Slide 90 Computer Architecture WS 06/07

This is the file content of the file records.

Figure from [Sil00 p.360]


Dr.-Ing. Stefan Freinatis

Directories
Two-level directory
user1 user2 user3 user4

Directories
Multi-level directory
root directory

sub directories

Hierarchical structure (tree of depth 1) Absolute file names, relative file names, path names
/user1/test, /user3/test test, ../user4/data /user3
Figure from [Sil00 p.361], modified
Slide 91 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 92 Computer Architecture WS 06/07

Absolute file names are unique


Figure from [Sil00 p.363]
Dr.-Ing. Stefan Freinatis

level

Multi-Level Directory
Directories

Multi-Level Directory
Directories

Generalization of two-level directory Hierarchical structure of arbitrary depth


Tree structure, graph structure. Logical organization structure.

Acyclic graph directory structure


Additional directory entries (Links) Shared directories Shared files More than one absolute name for a file (or a directory) Dangling link problem
Shared directory Shared files
Figure from [Sil00 p.365]

One root directory


Arbitrary number of sub (sub sub ...) directories

Efficient file search


Tree / Graph traversing routines. Much faster than sequential search.

Logical grouping
System files, user files, shared files, ...

Most common structure


Slide 93 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 94

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multi-Level Directory
Directories

File System Management


Now turning from the users view to the implementors view. Users are concerned with how files are named, what operation are allowed and what the directories look like. Implementors are interested in

General graph directory structure


Allowing links to point to directories creates the possibility of cycles.

Avoiding cycles: Forbid any links to directories No more shared directories then Use cycle detection algorithm
Slide 95 Computer Architecture WS 06/07

how files and directories are stored on the disk, how the disk space is managed, and how to make everything work efficiently.
Figure from [Sil00 p.365]
Dr.-Ing. Stefan Freinatis Slide 96 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)
Slide 97 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File Implementation
The most important issue in implementing files is the way how the available disk space is allocated to a file.

Contiguous Allocation Linked Allocation


Chained Blocks Chained Pointers

Indexed Allocation
Slide 98 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Contiguous Allocation
File Implementation

Contiguous Allocation
File Implementation

Each file occupies a set of contiguous blocks on the disk. File defined by disk address (first block) and by length in block units.
Advantage

Simple implementation
For each file we just need to know its start block and its length

Fast access
Access in one continuous operation. Minimum head seeks. Disadvantage

Disk fragmentation
Problem of finding space for new file. The final file size must be known in advance!
Slide 99 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

(a) Contiguous allocation of disk space for 7 files (b) State of the disk after files D and E have been removed
Figure from [Ta01 p.401]
Slide 100 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Contiguous Allocation
External Fragmentation
File Implementation

Linked Allocation
File Implementation

Free disk space is broken into chunks (holes) which are spread all over the disk. New files are put into available holes, often not filling them up entirely and thus leaving smaller holes. A big problem arises when the largest available hole is too small for a new file.

Each file is a linked list of disk blocks. The blocks may be scattered anywhere on the disk. Each block has besides its data a pointer to the next block. The pointer is a number (a block number).

next

next

next

nil

Internal Fragmentation
A file usually does not fill up its last block entirely, so the remaining space in the block is left unused.
Slide 101 Computer Architecture WS 06/07

data
used

...

data

...

data

disk

Chained blocks
Dr.-Ing. Stefan Freinatis Slide 102 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Linked Allocation
File Implementation

Linked Allocation
Disadvantage
File Implementation

The file jeep starts with block 9. It consists of the blocks 9, 16, 1, 10, and 25 in this order.

Free space management


Somehow all the free blocks must be recorded in some free-block pool.

Higher access time


More seeks to access the whole file owing to block scattering.

Space reduction
Advantage Some bytes of each block are needed for the pointer.

Simple implementation
Only first block number needed.

Reliability
If a pointer is broken, the remainder of the file is inaccessible.

No external fragmentation
Figure from [Sil00 p.380]
Slide 103

Not efficient for random access


To get to block k we must walk along the chain.

Files consist of blocks scattered on the disk. No more useless blocks on disk.
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 104

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Linked Allocation
File Implementation

Linked Allocation
File Implementation
block block

In particular the last disadvantage of the chained blocks allocation method, the unsuitability for random accesses to files, lead to the chained pointers allocation method.
block

A table contains as many entries as there are disk blocks. The entries are numbered by block number. The block numbers of a file are linked in this table in chain manner (as with chained blocks). This table is called file allocation table (FAT).
Figure from [Sil00 p.382]
Slide 105 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Chained block allocation Chained pointer allocation (FAT) The FAT is stored on disk and is loaded into memory when the operating system starts up.
Slide 106 Computer Architecture

Figures from [Ta01 p.403,404], modified


WS 06/07 Dr.-Ing. Stefan Freinatis

Chained pointers
Advantage
Linked Allocation

Indexed Allocation
File Implementation

Simple implementation
One simple table for both file allocation and free-block pool.

Each file is assigned an index block. The index block is an array of block numbers, listing in order the blocks belonging to the file. To get to block k of a file, one reads the kth entry of the index block.
next next next next index block nil data data data data
disk

Whole block available for data


No more pointers taking away data space.

Suitable for random accesss


Although the principle of getting to block k did not change, the search (counting) is now done on the block numbers, not on the blocks themselves. Disadvantage

FAT takes up disk space and memory (when cached)


One table entry for each disk block. Table size proportional to disk size.

Higher access time (compared to contiguous allocation)


Still it needs many seeks to collect all the scattered blocks.
Slide 107 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 108

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Indexed Allocation
File Implementation

Indexed Allocation
Advantage
File Implementation

The file jeep is described by index block 19. The index block has 8 entries of which 5 are used.

Good for random access


Fast determination of block k of a file.

Lesser memory occupation


Only for those files currently in use (open files) the corresponding index blocks are loaded into in memory.

Lesser disk space occupation


Only as many index blocks needed as there are files in the file system. Disadvantage

Free block management


A separate free-block pool must be available. Index blocks are also called index nodes, short i-nodes or inodes.
Figure from [Sil00 p.383]
Slide 109 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Index block utilization


Unused entries in index block do waste space.
Slide 110 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Indexed Allocation
What if a file needs more blocks than entries available in an index block? Linked index blocks
The last entry in an index block points to another index block (chaining).
File Implementation

Indexed Allocation
File Implementation
data data data

Multilevel index blocks


An entry does not point to the data, but points to a first-level index block (single indirect block) which then points to the data. Optionally, additional level available through second-level and third-level index blocks.

Combined scheme
Most entries point to the data directly. The remaining entries point to first-level and second-level and third-level index blocks. Used by Unix.
Slide 111 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Combined scheme example (Unix V7)


from [Ta01 p.447], modified Note: The inodes are no disk blocks, but are records stored in disk blocks. The single / double / triple indirect blocks are disk blocks.
Slide 112 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

data

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)
Slide 113 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Directory Implementation
Before accessing a file, the file must first be opened by the operating system. For that, the OS uses the path name supplied by the user to locate the directory entry. A directory entry provides the name of the file, the information needed to find the blocks of the file, and information about the files attributes.
Slide 114 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Directory Entry
Attribute placement
Directory Implementation

Directory Entry
MS-DOS directory entry
Directory Implementation
Figure from [Ta01 p.440]

Directory entry size: 32 byte The attributes may be stored a) together with the file name in the directory entry (MS-DOS, VMS) b) or off the directory entry (Unix)
Figure from [Ta01 p.406]
Slide 115 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File attributes stored in entry. First block number points to first file block, respectively to the corresponding entry in the FAT (DOS uses chained pointers).
Slide 116 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Directory Entry
Unix directory entry (Unix V7)
Directory Implementation

Directory Entry
MS-DOS file attributes
Directory Implementation
Figure from [Ta01 p.440]

attributes

directory entry

ADVSHR

Entry size: 16 byte.


Modern Unix versions allow for longer file names.

A : Archive flag D: Directory flag V: Volume label flag


Dr.-Ing. Stefan Freinatis Slide 118

S : System file flag H: Hidden flag R: Read-only flag


Computer Architecture WS 06/07

of file creation

File attributes are stored in the inode. The rest of the inode points to the file blocks
Slide 117 Computer Architecture WS 06/07

Dr.-Ing. Stefan Freinatis

Directory Implementation
An MS-DOS directory (not the entry) itself is a file (a binary file) with the file type attribute set to directory. The disk blocks pointed to contain other directory entries (each again of 32 byte size) which either depict files or subsequent directories (sub directories). Upon installing an MS-DOS file system, there is automatically created a root directory. Similar applies to Unix. When the file type attribute is set to directory, the file blocks contain directory entries. Windows 2000 and descendants (NTFS) treat directories as entities different from files.
Slide 119 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Directory Implementation
MS-DOS directory
disk block

directory entry

directory entry directory entry directory entry directory entry directory entry

pointing to disk blocks containing directory entries pointing to disk blocks containing file data

...

Legend:

directory entry directory entry

= Directory attribute set = Directory attribute not set. Regular file.


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 120

File Lookup
Directory Implementation

File Lookup
Figure from [Ta01 p.447]

Directory Implementation

How to find a file name in a directory Linear Search


Each directory entry has to be compared against the search name (string compare). Slow for large directories.

Binary Search
Needs a sorted directory (by name). Entering and deleting files requires moving directory entries around in order to keep them sorted (Insertion Sort).

Hash Table
In addition to each file name, an hash value (a number) is created and stored. Search is then done on the hash value, not on the name.

B-tree
File names are nodes and leafs in a balanced tree. NTFS.

The steps in looking up the file /usr/ast/mbox in classical Unix


Slide 121 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 122 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)
Slide 123 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Free Block Management


To keep track of the blocks available for allocation (free blocks), the operating system must somehow maintain a free block pool. When a file is created, the pool is searched for free blocks. When a file is deleted, the freed blocks are added to the pool. File systems using a FAT do not need a separate free block pool. Free blocks are simply marked in the table by a 0.

Linked List
Free Block Pool Implementations: Free List

Bit Map
Slide 124 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Free Block Management


Linked List
The free blocks form a linked list where each block points to the next one (chained blocks).
Figure from [Sil00 p.388]

Free Block Management


Figure from [Ta01 p.413]

Free List
The free block numbers are listed in a table. The table is stored in disk blocks. The table blocks may be linked together.

Simple Implementation
Only first block number needed.

Quick Access
New blocks are prepended (LIFO principle)
17 18 0

Space
Each free block requires 4 byte in table

Disk I/O
Updating the pointers involves I/O.

Block Modification
Modified content hinders undelete of the block.
Slide 125 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Management
Adding and deleting block numbers needs time, in particular when a table block is almost full (additional disk I/O required).
Slide 126 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Free Block Management


Bit Map
To each existing block on disk a bit is assigned. When a block is free, the bit is set. When the block is occupied, the bit is reset (or vice versa). All bits form a bit map.

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135)
Figure from [Ta01 p.413]

Compact
Each block represented by a single bit. Fixed size.

Logical order
Neighboring bits represent neighboring blocks (logical order). Quite easy to find contiguous blocks, or blocks located close together.

Conversion block number bit position


From the block number the corresponding bit position must be calculated and vice versa.

Floppy Disks (145)


Slide 128 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 127

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

File System Layout

File System Layout


Layout of FAT file system
Information about the filesystem location is stored in the boot block.
A copy of the FAT for reliability reasons

FAT

FAT copy

Root dir

Files and directories

File system

Number of entries in root directory is limited, except for FAT-32 where it is a cluster chain.

Each Partition starts with a boot block (first block) which is followed by the file system. The boot block may be modified by the file system.
Slide 129 Computer Architecture

Microsoft FAT-32 specification at


http://www.microsoft.com/whdc/system/platform/firmware/fatgen.mspx
Slide 130 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Figure from [Ta01 p.400], modified


WS 06/07 Dr.-Ing. Stefan Freinatis

File System Layout


Possible file system layouts for a UNIX file system

File System Layout


Layout of NTFS file system
Information about the filesystem location is stored in the boot block.

Super block

Inodes

Root dir

Files and directories


Master File Table. Linear sequence of 1kB records. Each record describes one file or directory. MFT is a file, may be located anywhere on disk.

The inode for the root directory is located at a fixed place. Bit map free block management

MFT Inodes Root dir Files and directories

System files

File area

Super block

Free block pool

Files for storing metadata about the file system. Actually, the MFT itself is a system file.

Information about filesystem (block size, volume label, size of inode list, next free inode, next free block, ...)

More about NTFS: http://www.ntfs.com/ntfs_basics.htm


Slide 131 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 132 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)

Cylinder Skew
Disk Performance

Cylinder skew example


Assumption: Reading from inner tracks towards outer tracks. Here: skew = 3 sectors. After head has moved to next track, sector 0 arrives just in time. Reading can continue right away. Performance improvement when reading multiple tracks.
Physical disk geometry, figure from [Ta01 p.316]
Slide 134 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 133

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Disk Scheduling

Disk Performance

Disk Scheduling

Disk Performance

Modern disk drives are addressed as large one-dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer. The array of logical blocks is mapped into the sectors of the disk sequentially. Sector 0 is the first sector of the first track on the outermost cylinder. Mapping proceeds in order through that track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost. However, it is difficult to convert a logical block into CHS: The disk may have defective sectors which are replaced by spare sectors from elsewhere on the disk. Owing to zone bit recording the number of sectors per track is not the same for all cylinders. After [Sil00 p.436]
Slide 135 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Fast access desired (high disk bandwidth)


Disk bandwidth is the total number of bytes transferred, divided by the total time from the first request for service until completion of the last transfer.

Bandwidth depends on
Seek time, the time for the disk to move the heads to the
cylinder containing the desired sector.

Rotational latency, the additional time waiting for the disk to


rotate the desired sector to the disk head.

Seek time seek distance. Scheduling goal: minimizing seek time


Scheduling in earlier days done by OS, nowadays by either OS (then guessing the physical disk geometry) or by the integrated disk drive controller.
Slide 136 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Disk Scheduling
Scheduling Algorithms
Disk Performance

FCFS
Disk Scheduling

First-Come First-Served (FCFS) Shortest Seek Time First (STTF) SCAN


time

track

C-SCAN C-LOOK
For the following examples: Assumption that there are 200 tracks on a single sided disk. Read requests are queued in some queue. The queue is currently holding the requests for tracks 98, 183, 37, 122, 14, 124, 65 and 67. The head is at track 53.
Slide 137 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

The requests are serviced in the order of their entry (first entry is served first).
Figure from [Sil00 p.437]

Slide 138

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

SSTF
Disk Scheduling
track

SCAN
Disk Scheduling
track

The next request served is the one that is closest to current position (shortest seek time).
time time

Disk arm starts at one end of the disk and sweeps over to the other end, thereby servicing the requests. At the other end the head reverses direction and servicing continues on the return trip.
Figure from [Sil00 p.439]

Figure from [Sil00 p.438]

Slide 139

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 140

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

C-SCAN
Disk Scheduling
track

C-LOOK
Disk Scheduling
track

time

the disk without servicing on the return trip.


Figure from [Sil00 p.440] Figure from [Sil00 p.441]

Slide 141

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

time

Disk arm starts at one end of the disk and sweeps over to the other end, thereby servicing the requests. At the other end the head returns to the beginning of

Like SCAN or C-SCAN, but the head moves only as far as the final request in each direction.

Slide 142

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Disk Scheduling
Disk Performance

File System Management


Storage Media (67) Magnetic Disks (71) Files and Directories (81, 90) File Implementation (98) Directory Implementation (114) Free Block Management (124) File System Layout (129) Cylinder skew, disk scheduling (135) Floppy Disks (145)

SSTF is common and has a natural appeal SCAN and C-SCAN perform better for systems that place a heavy load on the disk. Performance depends on the number and types of requests. Requests for disk service are influenced by the file allocation method. Either SSTF or LOOK is a reasonable choice as default algorithm.

Slide 143

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 144

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Floppy Disks
Portable storage media 8 floppy in 1969 5.25 floppy in 1978 3.5 floppy in 1987
Figure from www.computermuseum.li

Floppy Disks
4 22 40 Seite 0 0 (Front) Page 3 21 Sektorerkennung BDOS: 2,0,6 42 Sector number BIOS: (CHS) BIOS: 0,2,6 (Seite, Spur, Sektor) BDOS: 42 23 41 5

39 42 24 6

Page 1(Rckseite) (Back)


Spurnummer Track index (0, 1, 2, ... ) 0

Seite 1

2 20 38

43 37 44 26 8

25

8 disk Capacity:
80K ... 1.2M

5.25 disk
360k ... 1.2M

3.5 disk
720K, 1.44 MB
BIOS 0,0,1

19

45 27 9

Floppy disks almost displaced by Flash Memory (e.g. USB Stick) now, except for the purpose of booting computers (bootable floppies).
Slide 145 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Two sided floppy disk

Beginn der Spuren Track start

Rotation direction
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Drehrichtung

Slide 146

Floppy Disks
4 22 40 Seite 0 0 (Front) Page 3 21 Sektorerkennung BDOS: 2,0,6 42 Sector number BIOS: (CHS) BIOS: 0,2,6 (Seite, Spur, Sektor) BDOS: 42 23 41 5

Floppy Disks
Sector Structure
sectors
Page 1(Rckseite) (Back)
42 24 6 Spurnummer Track index (0, 1, 2, ... ) 0 Seite 1

39

20 38 BIOS = 2 Basic Input Output System

Stored in (EP)ROM
43 7 Sector access through invoking a software-interrupt and addressing 37 19 a sector by means of CHS. 1 45 BDOS = Basic Disk Operating System 44 26 8 system. 25

Address field

Data field

Sync IAM index index index

track head

sector sector length CRC

DAM data bytes ECC


CRC: ECC: IAM: DAM:

128-1024

CRC/ InterRecord Gap

Originates BIOS 0,0,1

from CP/M 27 operating


9

Higher abstraction level than BIOS.

Sector access through invoking a software-interrupt and addressing a sector by means of a logical consecutive sector number (1, 2, ...). Beginn der Spuren Track start Drehrichtung Rotation direction
Slide 147 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 148 Computer Architecture

Cyclic Redundancy Check Error Checking/Correction Index Address Mark Data Address Mark

WS 06/07

Dr.-Ing. Stefan Freinatis

Floppy Disks
Starting sector numbers for system and data areas (FAT file system). All numbers are in decimal notation.

Floppy Disks
Track 0, Page 0
9 1 Dir. 8 (4) Dir. (3) FAT (1) 7 Dir. (2) FAT (2) Dir. (1) 6 FAT (4) 5 FAT (3) 4 15 3 16 Data Data (4) (4) Dir. (7) 2

Track 0, Page 1
18 10 Data Data (6) (6) Dir. (5)

Bootstrap Loader loader

Bootstrap 17

Disk 360 K 720 K 1.2 M 1.44 M

Boot sector FAT 1 1 1 1 1 2 2 2 2

FAT 2 4 5 9 11

Root dir 6 8 16 20

Data 13 15 30 34

Data (5) (5)


Dir. (6) 11

Data

Data Data (3) (3)


Data (2)

12

Data (2)
14

Data Data (1)


(1) 13

Dir. = allocated space for root directory


Slide 149 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 150

Track 0 of a 360 kB floppy disk


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Spur 0, Seite 1

Computer Architecture

Process Management
Processes (153)

Process Management

Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)

Slide 151

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 152

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Processes
A process is a set of identifiable, repeatable actions which are ordered in some way and contribute to the fulfillment of an objective.
(General definition)

Process Model
Several processes are working quasi-parallel. A process is a unit of work. Conceptually, each process has its own virtual CPU.
In reality, the real CPU switches back and forth from process to process. Processes make progress over time Processes

A process is a program in execution.


(Computer oriented definition)

Program: static, passive


A cooking recipe is a program.

Process: dynamic, active


Acting according to the recipe (cooking) is a process.
Slide 153 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Sequential view
Slide 154

Process model view


Figure from [Ta01 p.72] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Processes
A process may be described as either (more or less)

Address Space
Processes

a) CPU-bound
spends more time doing computations few very long CPU bursts.

A process is an executing program, and encompasses the current values of the program counter, of the registers, of the variables and of the stack. code section (text section or segment)
This is the actual program code (the machine instructions).
PC

b) I/O-bound
spends more time doing I/O than computations many short CPU bursts.

data section (data segment)


This segment contains global variables (global to the process, not global to the computer system).
CPU SP

CS

DS

stack section (stack segment)


Figure from [Ta01 p.134] Slide 155 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

The stack contains temporary data (local variables, return addresses, function parameters).
Slide 156 Computer Architecture WS 06/07

SS Memory
Dr.-Ing. Stefan Freinatis

Process States
Processes

Process States
New
The process is created. Resources are allocated. Processes

Ready
The process is ready to be (re)scheduled.

Running
The CPU is allocated to the process, that is, the program instructions are being executed.

Waiting
The process is waiting for some event to occur. Without this event the process cannot continue even if the CPU would be free.

Terminated
Note: Only in the running state the process needs CPU cycles, in all other states it is actually frozen (or nonexistent any more). Figure from [Sil00 p.89]
Slide 157 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Work is done. The process is taken off the system (off the queues) and its resources are freed.

Slide 158

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Processes
Events at which processes are created

Process Creation
Parent process creates a child process
which in turn may create other processes, forming a tree of processes. Processes

Operating System Start-Up


Most of the system processes are created here. A large portion of them are background processes (daemons).

Resource sharing
Parent and child share no resources.
Sys tem call s fo Un ix: rc Win fork reatin ga () dow chil s: C dp rea roc teP ess roc : ess ()
Dr.-Ing. Stefan Freinatis

Interactive User Request


A user requests an application to start.

Parent and child share all resources. Child shares subset of parents resources.

Batch job
Jobs that are scheduled to be carried out when the system has available the resources (e.g. calendar-driven events, low priority jobs)

Execution
Parent and child execute concurrently. Parent waits until child terminates.

Existing process gives birth


An existing process (e.g. a user application or a system process) creates a process to carry out some related (sub)tasks.

Address Space
Child is copy of parent Child has program loaded into it
Slide 160 Computer Architecture WS 06/07

Slide 159

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

fork() example
#include <stdio.h>
Process Creation system call that tells a process its pid (process identifier) which is a unique void main() process number within the system. { int result; printf(Parent, my pid is: %d\n", getpid()); result = fork(); from here on if (result == 0) { /* child only */ think parallel printf(Child, my pid is: %d\n", getpid()); ... } else { /* parent only */ printf(Parent, my pid is: %d\n", getpid()); ...

fork() example
Terminal output: Parent, my pid is: 189
Child, my pid is: 190 Parent, my pid is: 189
Process Creation order depends on whether parent or child is scheduled first after fork().

Before fork()

After fork()

Executed by child

PC

... fork() ...

PC

... fork() ...

PC

... fork() ...

} }
Slide 161

Executed by parent

pid = 189
WS 06/07 Dr.-Ing. Stefan Freinatis Slide 162

pid = 189
Computer Architecture WS 06/07

pid = 190
Dr.-Ing. Stefan Freinatis

Computer Architecture

Process Creation
Processes

Process Termination
Events at which processes are terminated
Processes

Process asks the OS to delete it


Work is done. Resources are deallocated (memory is freed, open files are closed, used I/O buffers are flushed).

Parent terminates child


A Unix process tree
Child may have exceeded allocated resources. Task assigned to child is no longer required. Parents parent is exiting.
Some OS do not allow a child to continue when its parent terminates. Cascading termination (a sub tree is deleted).
Figure from [Sil00 p.96] Slide 163 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 164 Computer Architecture WS 06/07

Dr.-Ing. Stefan Freinatis

Sys

tem

call

s fo rs Unix elf-te rmin : atio Win exit( no ) dow fa s: E pro xit ces s: Pro ces s()

Process Control Block


Processes

Process Control Block


Processes

Operating system maintains a process table Each entry represents a process Entry often termed PCB (process control block)
A PCB contains all information about a process that must be saved when the process is switched from running into waiting or ready, such that it can later be restarted as if it had never been stopped. Info regarding process management, regarding memory occupation and open files. PCB example
Slide 165 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Figure from [Sil00 p.89]

Typical fields of a PCB


Table from [Ta01 p.80] Slide 166 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Context Switch
Processes

Context Switch
Processes
context switch

The task of switching the CPU from one process to another is termed context switch (sometimes also process switch):

Saving the state of old process


Saving the current context of the process in its PCB.

context switch time

Loading the state of new process


Restoring the former context of the process from its PCB.

Context switching is pure administrative overhead. The duration of a switch lies in the range of 1 ... 1000 s. The switch time depends on the hardware. Processors with multiple sets of registers are faster in switching. Context switching poses a certain bottleneck, which is one reason for the introduction of threads.
Slide 167 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 168

context switch

context switch time

Figure from [Sil00 p.90] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Scheduling
On a uniprocessor system there is only one process running, all others have to wait until they are scheduled. They are waiting in some scheduling queue:
Processes

Scheduling
Processes

Ready queue

Job Queue
Holds the future processes of the system.

Tape

These queues are empty


Device queues
Ethernet
registers registers registers

Ready Queue (also called CPU queue)


Holds all processes that reside in memory and are ready to execute.

Device Queue (also called I/O queue)


Each device has a queue holding the processes waiting for I/O completion.

Disk

IPC Queue
Holds the processes that wait for some IPC (inter process communication) event to occur.

Terminal
registers

The ready queue and some device queues


Figure from [Sil00 p.92], modified

Slide 169

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 170

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Scheduling
From the job queue a new process is initially put into the ready queue. It waits until it is dispatched (selected for execution). Once the process is allocated the CPU, one of these events may occur.
Processes

Scheduling
Processes processes are ready process is running

Interrupt
The time slice may be expired or some higher priority process is ready. Hardware error signals (exceptions) also may cause a process to be interrupted.

Job queue

Ready queue

CPU
Interrupt I/O request events IPC request

processes are new

I/O request
The process requests I/O. The process is shifted to a device queue. After the I/O device has ready, the process is put into the ready queue to continue.

Device queue
processes are waiting

IPC request
The process wants to communicate with another process through some blocking IPC feature. Like I/O, but here the I/O-device is another process.
A note on the terminology: Strictly spoken, a process (in the sense of an active entity) only exists when it is allocated the CPU. In all other cases it is a dead body.
Slide 171 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

IPC queue

Queueing diagram of process scheduling


Slide 172 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Scheduling
The OS selects processes from queues and puts them into other queues. This selection task is done by schedulers.
Processes

Scheduling
Processes

Long-term scheduler

Short-term scheduler

Long-term Scheduler
Originates from batch systems. Selects jobs (programs) from the pool and loads them into memory. Invoked rather infrequently (seconds ... minutes). Can be slow. Has influence on the degree of multiprogramming (number of processes in memory). Some modern OS do not have a long-term scheduler any more.

Job queue

Ready queue

CPU

Short-term Scheduler
Selects one process from among the processes that are ready to execute, and allocates the CPU to it. Initiates the context switches. Invoked very frequently (in the range of milliseconds). Must be fast, that is, must not consume much CPU time compared to the processes.
Slide 173 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Schedulers and their queues


Slide 174 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Scheduling
Sometimes it may be advantageous to remove processes temporarily from memory in order to reduce the degree of multiprogramming. At some later time the process is reintroduced into memory and can be continued. This scheme is called swapping, performed by a medium-term scheduler.
Job queue
Processes

Process Concept
Program in execution
Several processes may be carried out in parallel. Processes

Resource grouping
Each process is related to a certain task and groups together the required resources (Address space, PCB).

Ready queue
swap out

CPU

Traditional multi-processing systems:

Each process is executed sequentially


No parallelism inside a process.

Medium-term scheduler

Blocked operations Blocked process


Any blocking operation (e.g. I/O, IPC) blocks the process. The process must wait until the operation finishes.

swap in

swap queue

In traditional systems each process has a single thread of control.


Dr.-Ing. Stefan Freinatis Slide 176 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 175

Computer Architecture

WS 06/07

Process Management
Processes (153) Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)

Threads
A thread is a piece of yarn, a screw spire, a line of thoughts. Here: a sequence of instructions
that may execute in parallel with others A thread is a line of execution within the scope of a process. A single threaded process has a single line of execution (sequential execution of program code), the process and the thread are the same. In particular, a thread is

a basic unit of CPU utilization.


Slide 177 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 178 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Threads

Threads
As an example, consider of a word processing application.

- Reading from keyboard - Formatting and displaying pages - Periodically saving to disk - ... and lots of other tasks
A single threaded process would quite quickly result in an unhappy user since (s)he always has to wait until the current operation is finished.

Multiple processes?
Three single threaded processes in parallel A process with three parallel threads.
Figure from [Ta01 p.82] Slide 179 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 180

Each process would have its own isolated address space.

Multiple threads!
The threads operate in the same address space and thus have access to the data.
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Threads
Three-threaded word processing application
formatting and displaying

Threads
Multiple executions in same environment
All threads have exactly the same address space (the process address space).

Each thread has own registers, stack and state

Reading keyboard

Saving to disk

Figure from [Ta01 p.86] Slide 181 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 182 Computer Architecture WS 06/07

Figure from [Sil00 p.116] Dr.-Ing. Stefan Freinatis

Threads

User Level Threads


Threads

Take place in user space


The operating system does not know about the applications internal multi-threading.

Can be used on OS not supporting threads


It only needs some thread library (like pthreads) linked to the application.

Each process has its own thread table


The table is maintained by the routines of the thread library.

Customized thread scheduling


Items shared by all threads in a process Items private to each thread
Table from [Ta01 p.83] Slide 183 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 184 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

The processes use their own thread scheduling algorithm. However, no timer controlled scheduling possible since there are no clock interrupts inside a process.

Blocking system calls do block the process


All threads are stopped because the process is temporarily removed from the CPU.

User Level Threads


Threads

Kernel Threads
Take place in kernel
The operating system manages the threads of each process Threads

Thread management is performed by the application.


Examples - POSIX Pthreads - Mach C-threads - Solaris threads

Available only on multi-threaded OSs


The operating system must support multi-threaded application programs.

No thread administration inside process


since this is done by the kernel. Thread creation and management however is generally somewhat slower than with user level threads [Sil00 p.118].

No customized scheduling
The user process cannot use its own customized scheduling algorithm.

No problem with blocking system calls


A blocking system call causes a thread to pause. The OS activates another thread, either from the same process or from another process.
Figure from [Ta01 p.91] Slide 185 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 186 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Kernel Threads
Threads

Multithreading Models
Many-to-One Model
Threads

Thread management is performed by the operating system.


Examples - Windows 95/98/NT/2000 - Solaris - Tru64 UNIX - BeOS - Linux

Many user level threads are mapped to a single kernel thread. Used on systems that do not support kernel threads.
Figure from [Ta01 p.91] Figure from [Sil00 p.118] Slide 188 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 187

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multithreading Models
One-to-One Model
Threads

Multithreading Models
Many-to-Many Model
Threads

Many user level threads are mapped to many kernel threads. Each user level thread is mapped to one kernel thread.
Figure from [Sil00 p.119] Slide 189 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 190 Computer Architecture WS 06/07 Figure from [Sil00 p.119] Dr.-Ing. Stefan Freinatis

Multithreading
Solaris 2 multi-threading example
Threads

Threads
Windows 2000: Implements one-to-one mapping
Each thread contains - a thread id - register set - separate user and kernel stacks - private data storage area

Linux:
One-to-one model (pthreads), many-to-many (NGPT) Thread creation is done through clone() system call.
clone() allows a child to share the address space of the parent. This system call is unique to Linux, source code not portable to other UNIX systems.
Figure from [Sil00 p.121] Slide 191 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 192 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Threads
Java: Provides support at language level.
Thread scheduling in JVM
class Worker extends Thread { public void run() { System.out.println("I am a worker thread"); } }
Example: Creation of a thread by inheriting from Thread class

Process Management
Processes (153) Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)

public class MainThread { public static void main(String args[]) { Worker worker1 = new Worker(); worker1.start(); System.out.println("I am the main thread"); } thread creation and automatic call of run() method }
Slide 193 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 194

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

IPC
Purpose of Inter Process Communication

Race Conditions
Print spooling example
IPC

Managing critical activities


Making sure that two (or more) processes do not get into each others' way when engaging critical activities.

Sequencing
Making sure that proper sequencing is assured in case of dependencies among processes.

Process Synchronization
Thread Synchronization

shared variables

Passing information
Processes are independent of each other and have private address spaces. How can a process pass information (or data) to another process?
Figure from [Ta01 p.101]

next empty slot

Data exchange
Slide 195 Computer Architecture

Less important for threads since they operate in the same environment
WS 06/07 Dr.-Ing. Stefan Freinatis

Situations, where two or more processes access some shared resource, and the final result depends on who runs precisely when, are called race conditions.
Slide 196 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Race Conditions
Processes A and B want to print a file Both have to enter the file name into a spooler directory out points to the next file to be printed. This variable is accessed only by the printer daemon. The daemon currently is busy with slot 4. in points to the next empty slot. Each process entering a file name in the empty slot must increment in.
Now consider this situation:

Race Conditions
IPC

Another example at machine instruction level Shared variable x (initially 0)

IPC

Process 1
R1 x R1 = R1+1 R1 x

Process 2

x=0 x=1

Process 1
R1 x

Process 2

x=0

Process A reads in (value = 7) into some local variable. Before it can continue, the CPU is switched over to B. Process B reads in (value = 7) and stores its value locally. Then the file name is entered into slot 7 and the local variable is incremented by 1. Finally the local variable is copied to in (value = 8). Process A is running again. According to the local variable, the file name is entered into slot 7 erasing the file name put by B. Finally in is incremented. User B is waiting in the printer room for years ...
Slide 197 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

R3 x R3 = R3+1 R1 = R1+1 R1 x R3 x

R3 x R3 = R3+1 R3 x

x=2

x=1

Scenario 1

Scenario 2

Slide 198

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Critical Regions
How to avoid race conditions? Find some way to prohibit more than one process from manipulating the shared data at the same time.
IPC

Critical Regions
IPC

Four conditions to provide correctly working mutual exclusion:

1. No two processes simultaneously in critical region


which would otherwise controvert the concept of mutuality.

Mutual exclusion
Part of the time a process is doing some internal computations and other things that do not lead to race conditions. Sometimes a process however needs to access shared resources or does other critical things that may lead to race conditions. These parts of a program are called critical regions (or critical sections).
critical region Process A

2. No assumptions about process speeds


No predictions on process timings or priorities. Must work with all processes.

3. No process outside its critical regions must block other processes, simply because there is no reason to hinder
others entering their critical region.

4. No process must wait forever to enter a critical region.


For reasons of fairness and to avoid deadlocks.

t
WS 06/07 Dr.-Ing. Stefan Freinatis Slide 200 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 199

Computer Architecture

Critical Regions
Mutual exclusion using critical regions
IPC

Mutual Exclusion
Proposals for achieving mutual exclusion
IPC

Disabling interrupts
The process disables all interrupts and thus cannot be taken away from the CPU.

Not appropriate. Unwise to give user process full control over computer.

Lock variables
A process reads a shared lock variable. If the lock it is not set, the process sets the variable (locking) and uses the resource.

In the period between evaluating and setting the variable the process may be interrupted. Same problem as with printer spooling example.
Figure from [Ta01 p.103] Slide 201 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 202 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Mutual Exclusion
Proposals for achieving mutual exclusion (continued)
IPC

Mutual Exclusion
Proposals for achieving mutual exclusion (continued)
IPC

Strict Alternation
The shared variable turn keeps track of whose turn it is. Both processes alternate in accessing their critical regions.

Strict Alternation (continued)


Busy waiting wastes CPU time. No good idea when one process is much slower than the other. Violation of condition 3.
busy waiting for turn = 0 Process 0 Process 1

while (1) { while (turn != 0); critical_region(); turn = 1; noncritical_region(); }


Process 0

while (1) { while (turn != 1); critical_region(); turn = 0; noncritical_region(); }


Process 1

t
turn = 0 turn = 1 turn = 0

Slide 203

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 204

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Mutual Exclusion
Proposals for achieving mutual exclusion (continued)
IPC

Mutual Exclusion
Peterson Algorithm (continued)
IPC

Peterson Algorithm
int turn; bool interested[2];
shared variables Two processes, number is either 0 or 1

Assume process 0 and 1 both simultaneously entering critical_region() other = 1 interested[0] = true turn = 0 other = 0 interested[1] = true turn = 1
Process 0 Process 1

void enter_region(int process) { int other = 1 process; interested[process] = TRUE; turn = process; while (turn == process && interested[other] == TRUE); } void leave_region(int process) { interested[process] = FALSE; }
Slide 205 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Both are manipulating turn at the same time. Whichever store is last is the one that counts. Assume process 1 was slightly later, thus turn = 1. while (turn == 0 && interested[1] == TRUE); while (turn == 1 && interested[0] == TRUE); Process 0 passes its while statement, whereas process 1 keeps busy waiting therein. Later, when process 0 calls leave_region(), process 1 is released from the loop.

Good working algorithm, but uses busy waiting


Computer Architecture WS 06/07

Slide 206

Dr.-Ing. Stefan Freinatis

Mutual Exclusion
Proposals for achieving mutual exclusion (continued)
IPC

Mutual Exclusion
Intermediate Summary
IPC Not recommended for multi-user systems. Problem remains the same. Violation of condition 3. Busy waiting. Busy waiting. Solves the problem through atomic operation. Should be used without busy waiting.

Test and Set Lock (TSL)


Atomic operation at machine level. Cannot be interrupted. TSL reads the content of the memory word lock into register R and then stores a nonzero value at the memory address lock. The memory bus is locked, no other process(or) can access lock.

Disabling Interrupts Lock Variables Strict Alternation Peterson Algorithm TSL instruction

enter_region:

TSL R, lock CMP R, #0 JNZ enter_region RET MOV lock, #0 RET

indivisible operation

CPU must support TSL Busy waiting

leave_region:

In essence, what the last three solutions do is this: A process checks whether the entry to its critical region is allowed. If it is not, the process just sits in a tight loop waiting until it is. Unexpected side effects, such as priority inversion problem.
Slide 208 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Pseudo assembler listing providing the functions enter_region() and leave_region().


Slide 207 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Priority Inversion Problem


IPC

Sleep and wake up


IPC

Consider a computer with two processes Process H with high priority Process L with low priority The scheduling rules are such that H is run whenever it is in ready state. At a certain moment, with L in its critical region, H becomes ready and is scheduled. H now begins busy waiting, but since L is never scheduled while H is running, L never has the chance to leave its critical region. H loops forever. This is sometimes referred to as the priority inversion problem. Solution: blocking a process instead of wasting CPU time.
Slide 209 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

sleep()
A system call that causes the caller to block, that is, the process voluntarily goes from the running state into the waiting state. The scheduler switches over to another process.

wakeup(process)
A system call that causes the process process to awake from its

sleep() and to continue execution. If the process process is not asleep at that moment, the wakeup signal is lost.
Note: these two calls are fictitious representatives of real system calls whose names and parameters depend on the particular operating system.
Slide 210 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Producer Consumer Problem


IPC

const int N = 100; int count = 0;

Producer Consumer Implementation Example


This implementation suffers from race conditions

Shared buffer with limited size


The buffer allows for a maximum of N entries (it is bounded). The problem is also known as bounded buffer problem.

Producer puts information into buffer


When the buffer is full, the producer must wait until at least one item has been consumed.

Consumer removes information from buffer


When the buffer is empty the consumer must wait until at least one new item has been entered.

void producer() { constantly producing while (TRUE) { int item = produce_item(); produce item if (count == N) sleep(); sleep when buffer is full insert_item(item); enter item to buffer count++; adjust item counter if (count == 1) wakeup(consumer); } when the buffer was empty beforehand } (and thus now has 1 item), wakeup any A
consumer(s) that may be waiting

shared buffer Producer Consumer

Slide 211

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

void consumer() { constantly consuming while(TRUE) { if (count == 0) sleep(); sleep when buffer is empty item = remove_item(); remove one item count--; adjust item counter if (count == N-1) wakeup(producer); consume_item(item); when the buffer was full beforehand (and thus now has N-1 items), wakeup } producer(s) that may be waiting. }

Producer Consumer Problem


A race condition may occur in this case:

Producer Consumer Problem


Mutual Exclusion

Mutual Exclusion

The buffer is empty and the consumer has just read count to see if it is 0. At that instant (see A in listing) the scheduler decides to switch over to the producer. The producer inserts an item in the buffer, increments count and notices that count is now 1. Reasoning that count was just 0 and thus the consumer must be sleeping, the producer calls wakeup() to wake the consumer up. However, the consumer was not yet asleep, it was just taken away the CPU shortly before it could enter sleep(). The wakeup signal is lost. When the consumer is rescheduled and resumes at A , it will go to sleep. Sooner or later the producer has filled up the buffer and goes asleep as well. Both processes will sleep forever.
Slide 213 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Reasons for race condition


The variable count is unconstrained
Any process has access any time.

Evaluating count and going asleep is a non-atomic operation


The prerequisite(s) that lead to sleep() may have changed when sleep() is reached.

Workaround:
Add a wakeup waiting bit
When the bit is set, sleep() will reset that bit and the process stays awake.

Each process must have a wakeup bit assigned


Although this is possible, the principal problem is not solved.

What is needed is something that does testing a variable and going to sleep dependent on that variable in a single non-interruptible manner.
Slide 214 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Semaphores
Mutual Exclusion

Semaphores
Up and down are system calls
Semaphores should be lock-protected
This is recommended at least in multi-processor systems to prevent another CPU from simultaneously accessing a semaphore. TSL instruction helps out. Mutual Exclusion in order to make sure that the operating system briefly disables all interrupts while carrying out the few machine instructions implementing up and down.

Introduced by Dijkstra (1965) Counting the number of wakeups


An integer variable counts the number of wakeups for future use.

Two operations: down and up


down is a generalization of sleep. up is a generalization of wakeup. Both
operations are carried out in a single, indivisible operation (usually in kernel). Once a semaphore operation is started, no other process can access the semaphore.

Producer Consumer problem using semaphores (next page)


Definition of variables:

down(int* sem) { if (*sem < 1) sleep(); *sem--; }


principle of down-operation
Slide 215

up(int* sem) { *sem++; if (*sem == 1) wakeup a process }


principle of up-operation
WS 06/07 Dr.-Ing. Stefan Freinatis

a semaphore is an integer counting empty slots counting full slots mutual exclusion on buffer access
Slide 216

const int N = 10; typedef int semaphore; semaphore empty = N; semaphore full = 0; semaphore mutex = 1;
WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Computer Architecture

Producer Consumer Implementation Example void producer() { This implementation does not suffer from race conditions while (TRUE) { int item = produce_item(); possibly sleep, decrement empty counter down(&empty); down(&mutex); possibly sleep, claim mutex (set it to 0) thereafter insert_item(item); up(&mutex); release mutex, wake up other process up(&full); increment full counter, possibly wake up other ... } }
void consumer() { while(TRUE) { down(&full); down(&mutex); item = remove_item(); up(&mutex); up(&empty); consume_item(item); } }

Semaphores
Assume N = 5. Initial condition: empty = 5, full = 0.
producer producer producer producer

Mutual Exclusion
producer pro...

empty full

sleep
5

4 1

3 2

2 3

1 4

0 5

Scenario: producer is working, no consumer present


Initial condition: empty = 0, full = 5.
consumer consumer consumer consumer consumer con...

possibly sleep, decrement full counter possibly sleep, claim mutex (set it to 0) thereafter release mutex, wake up other process increment empty counter, possibly wake up other ...

full empty

sleep
5

4 1

3 2

2 3

1 4

0 5

Scenario: consumer is working, no producer present


Slide 218 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Semaphores
Assume N = 5. Initial condition: empty = 1, full = 4.
producer pro... producer

Semaphores
Mutual Exclusion Assume N = 5. Initial condition: empty = 4, full = 1.
producer producer

Mutual Exclusion
producer

empty full

sleep
1

0 5

0 4

1 0 5 waking up producer
consumer consumer consumer

empty

t
full

5 0

4 1 0

4 1

3 2

waking up consumer
consumer con... consumer

full empty

5 0

4 1 0

4 1

3 2

full

sleep
1

t
empty

0 5

0 4

1 0 5

Scenario: Consumer waking up producer

Scenario: Producer waking up consumer

Slide 219

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 220

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Semaphores
Assume N = 5. Initial condition: empty = 3, full = 2.
down
producer

Mutex
Mutual Exclusion

Simplified semaphore
when counting is not needed.

Mutual Exclusion

empty full

23 1 21 up down
consumer

Two states
Locked or unlocked. Used for managing mutual exclusion (hence the name).

mutex_lock:
consumer

down 21 23 up

full empty

t
4 up

ok:

TSL R, mutex CMP R, #0 JZ ok CALL thread_yield JMP mutex_lock RET

get and set mutex was it unlocked? if yes: jump to ok if no: sleep try again acquiring mutex

If processes overlap, then temporary it may be that empty + full N


Note that consumer and producer may almost concurrently change the same semaphore legally.

mutex_unlock: MOV mutex, #0 RET


Pseudo assembler listing implementing mutex_lock() and mutex_unlock().

unlock mutex

Slide 221

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 222

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Monitors
Mutual Exclusion

Monitors
Mutual Exclusion

High level synchronization primitive


at programming language level. Direct support by some programming languages.

monitor example; integer i; condition c; procedure producer() ... ... end; procedure consumer() ... ... end; end monitor;

Variables not accessible from outside the monitors own methods (capsulation).

A collection of procedures, variables and data structures grouped together in a module


A monitor has multiple entry points Only one process can be in the monitor at a time Enforces mutual exclusion less chances for programming errors

Functions (methods) publicly accessible to all processes, however only one process at a time may call a monitor function.

Monitor implementation
Compiler handles implementation Library functions using semaphores
Slide 223 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

If the buffer is full, the producer must wait. If the buffer is empty the consumer must wait.

A monitor in Pidgin Pascal, from [Ta01 p.115]


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 224

Monitors
Mutual Exclusion

Monitors
Mutual Exclusion

How can a process wait inside a monitor?


Cannot put to sleep because no other process can enter the monitor meanwhile.

Use a condition variable!


A condition variable supports two operations.

wait(): suspend this process until it is signaled. The suspended process is not considered inside the monitor any more. Another process is allowed to enter the monitor. signal(): wake up one process waiting on the condition variable. No effect if nobody is waiting. The signaling process automatically leaves the monitor (Hoare monitor). Condition variables usable only inside a monitor.
Producer-Consumer problem with monitors, from [Ta01 p.117]
Slide 225 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 226 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Barriers
Group synchronization
Intended for groups of processes rather than for two processes. IPC

Barriers
Application example Process 1 working on these elements IPC Process 2 working on these elements Process 3 working on these elements

Processes wait at a barrier for the others


according to the all-or-none principle

After all have arrived, all can proceed


Process 0

... and so on for the remaining elements

Processes approaching barrier


Slide 227

Waiting for C to arrive


Computer Architecture WS 06/07

All processes continuing


Figure from [Ta01 p.124] Dr.-Ing. Stefan Freinatis

An array (e.g. an image) is updated frequently by some process 0 (producer). Many processes are working in parallel on certain array elements (consumers). All consumers must wait until the array has been updated and can then start working again on the updated input.
Slide 228 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

IPC
Intermediate Summary (II)

Messages
IPC

Semaphores
Counting variable, used in non-interruptible manner. Down may put the caller to sleep, up may wake up another process.

Kernel supported mechanism for data exchange


Eliminates the need for self-made (user-programmed) communication via shared resources such as shared files or shared memory.

Mutexes
Simplified semaphore with two states. Used for mutual exclusion.

Two basic operations: send(): send data


provided by the kernel (system calls)
Some data (a message)

receive(): receive data


System buffers

Monitors
High level construct for achieving mutual exclusion at programming language level.

Process 1
send

OS (kernel space)

Process 2
receive

Barriers
Used for synchronizing a group of processes. These mechanisms all serve for process synchronization. For data exchange among processes something else is needed: Messages.
Slide 229 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 230

Copy from user space to kernel space

Copy from kernel space to user space


WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Direct Communication
Messages

Indirect Communication
Messages

Both processes must exist


As the name direct implies, you cannot send a message to a future process.

Messages are send / received from mailboxes


The mailbox must exist, not necessarily the receiving process yet.

Processes must name each other explicitly


- send(P, message): send data to process P - receive(Q, message): receive data from process Q
Symmetry in addressing. Both processes need to know each other by some identifier. This is no problem if both were fork()ed off the same parent beforehand, but is a problem when they are strangers to each other.

- Each mailbox has a unique identifier - Processes communicate when they access the same mailbox

Primitives
- send(A, message): send message to mailbox A - receive(A, message): receive message from mailbox A

Communication link properties


- One process pair has exactly one link - The link may be unidirectional or bidirectional
Slide 231 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Communication link properties


- Link is established when processes share a mailbox - A link may be associated with many processes (broadcast) - Unidirectional or bidirectional communication
Slide 232 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Synchronous Communication
Messages

Asynchronous Communication
Messages

Also called blocking send / receive Sender waits for receiver to receive the data
The send() system call blocks until receiver has received the message. Process 1
send

Also called non-blocking send / receive Sender drops message and passes on
The send() system call returns to the caller when the kernel has the message. Process 1
send

OS (kernel space)

Process 2
receive

OS (kernel space)

Process 2
receive

Acknowledgement from receiver

A single buffer (for the pair) is sufficient

Multiple buffers (for each pair) needed

Receiver waits for sender to send data


The receive() system call blocks until a message is arriving.
Slide 233 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Receiver peeks for messages


The receive() system does not block, but rather returns some error code telling whether there is a message or not. Receiver must do polling to check for messages.
Slide 234 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Messages
IPC

UNIX IPC Mechanisms


IPC

Send by copy
The message is copied to kernel buffer at send time. At receive time the message is copied to the receiver. Copying takes time.

Pipes
Simple(st) communication link between two processes. Applies first-in first-out principle. Works like an invisible file, but is no file. Operations: read(), write().

Send by reference
A reference (a memory address or a handle) is copied to the receiver which uses the reference to access the data. The data usually resides in a kernel buffer (is copied there beforehand). Fast read access.

FIFOs
Also called named pipe. Works like a file. May exist in modern Unices just in the kernel (and not in the file system). There can be more than one writer or reader on a FIFO. Operations: open(), close(), read(), write().

Fixed sized messages


The kernel buffers are of fixed size as are the messages. Straightforward system level implementation. Big messages must be constructed from many small messages which makes user level programming somewhat more difficult.

Messages
Allow for message transfer. Messages can have types. A process may read all messages or only those of a particular type. Message communication works according to the first-in first-out principle. Operations: msgget(), msgsnd(), msgrcv(), msgctl().

Variable sized messages


Sender and receiver must communicate about the message size. Best use of kernel buffer space, however, buffers must not grow indefinitely.
Slide 235 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 236

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

UNIX IPC Mechanisms


Shared memory
IPC A selectable part of the address space of process P1 is mapped into the address space of another process P2 (or others). The processes have simultaneous access. Operations: shmget(), shmat(), shmdt(), smhctl().

Simple pipe example. Parent is writing, child is reading. const int FIXSIZE=80 void main() { int fd[2]; pipe(fd); int result = fork(); if (result == 0) { close(fd[1]); char buf[256]; read(fd[0], buf, FIXSIZE) exit(0); } close(fd[0]); // file descriptors for pipe // create pipe // duplicate process // start childs code // we do not need writing // a buffer // wait for message from parent // good bye // end child, start parent // we do not need reading

Semaphores
Creation and manipulation of sets of semaphores. Operations: semget(), semop(), semctl().

printf(This is the child, my pid is: %d\n", getpid());

printf(Child: received message was: %s\n", buf); For an introduction into the UNIX IPC mechanisms (with examples) see Stefan Freinatis: Interprozekommunikation unter Unix - eine Einfhrung, Technischer Bericht, Fachgebiet Datenverarbeitung, Universitt Duisburg, 1994. http://www.fb9dv.uni-duisburg.de/vs/members/fr/ipc.pdf }
Slide 237 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

printf(This is the parent, my pid is: %d\n", getpid()); write(fd[1], "Hallo!", FIXSIZE); // write message to child

Classical IPC Problems


The dining philosophers
An artificial synchronization problem posed and solved by Edsger Dijkstra 1965. IPC

Dining philosophers
Classical IPC problems

The life of these philosophers consists of alternate periods of eating and thinking. When a philosopher becomes hungry, she tries to acquire her left and right fork, one at a time, in either order. If successful in acquiring two forks, she eats for a while, then puts down the forks and continues to think.
Text from [Ta01 p.125]

Five philosophers sitting at a table


The problem can be generalized to more than five philosophers, of course.

Each either eats or thinks Five forks available Eating needs 2 forks
Slippery spaghetti, one needs two forks!

Can you write a program that makes the philosophers eating and thinking (thus creation of 5 threads or processes, one for each philosopher), allows maximum utilization (parallelism), that is, two philosophers may eat at a time (no simple solution with just one philosopher eating at a time), is not centrally controlled by somebody instructing the philosophers, and that never gets stuck?
Figure from [Ta01 p.125] WS 06/07 Dr.-Ing. Stefan Freinatis Slide 240 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Pick one fork at a time


Either first the right fork and then the left one, or vice versa.
Slide 239 Computer Architecture

Dining philosophers
Classical IPC problems const int N=5; void philosopher(int i) { while(TRUE){ think(); take_fork(i); take_fork((i+1)%N); eat(); put_fork(i); put_fork((i+1)%N); } } // put left fork // put right fork // take left fork // take right fork // N philosophers in parallel // for the whole life

Classical IPC Problems


The Readers and Writers Problem
An artificial shared database access problem by Courtois et. al, 1971 IPC

Database system
such as an airline reservation system.

Many competing processes wish to read and write


Many reading processes is not the problem, but if one process wants to write, no other process may have access not even readers.

How to program the readers and writers?

A nonsolution to the dining philosophers problem

Writer waits until all readers are gone


Not good. Usually there are always readers present. Indefinite wait.

If all philosophers take their left fork simultaneously, none will be able to take the right fork. All philosophers get stuck. Deadlock situation.
Slide 241 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Writer blocks new readers


A solution. Writer waits until old readers are gone and meanwhile blocks new readers

Slide 242

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Classical IPC Problems


The sleeping barber problem
An artificial queuing situation problem customer chairs IPC

Sleeping Barber
IPC

The barber shop has one barber, one barber chair, and n chairs for customers, if any, to sit on. If there are no customers present, the barber sits down in the barber chair and falls asleep. When a customer arrives, he has to wake up the sleeping barber. If additional customers arrive while the
barber sleeps when no customers are present

barber is cutting a customers hair, they either sit down (if there are empty chairs) or leave the shop (if all chairs are full).
Text from [Ta01 p.129]

How to program the barber and the customers without getting into race conditions?
const int CHAIRS=5; typdef int semaphore; semaphore customers = 0; semaphore barbers = 0; semaphore mutex = 1; int waiting = 0;
Slide 244

// number of chairs // number of customers waiting // number of barbers waiting // for mutual exclusion

Figure from [Ta01 p.130] Slide 243 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

void barber() { while(TRUE){ down(&customers); down(&mutex); waiting--; up(&barbers); up(&mutex); cut_hair(); } } A solution to the sleeping void customer() { down(&mutex); if (waiting < CHAIRS){ waiting++; up(&customers); up(&mutex); down(&barbers); get_haircut(); } else up(&mutex); }
Slide 245

// // // //

barber process for the whole life sleep if no customers acquire access to waiting

Scheduling
IPC

// one barber ready to cut // release waiting // cut hair (non critical)

Processes (153) Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)

barber problem [Ta01 p.131]


// // // // // // // // // customer process enter critical region when seats available one more waiting tell barber if first customer release waiting sleep if no barber available get serviced shop is full, leave

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 246

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Scheduling
Better CPU utilization through multiprogramming Scheduling: switching CPU among processes Productivity depends on CPU bursts

Short-Term Scheduler
Also called CPU scheduler. Selects one process from among the ready processes in memory and dispatches it. The dispatcher is a module that finally gives CPU control to the selected process (switching context, switching from kernel mode to user mode, loading the PC).
Short-term scheduler
Scheduling

Job queue

Ready queue

CPU

Figure from [Ta01 p.134] Slide 247 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 248 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Scheduling decisions
CPU scheduling decisions may take place when a process
Scheduling

Preemptive(ness)
Preemptiveness determines the way of multitasking.
Scheduling

1. 2. 3. 4.

switches from running to waiting, switches from running to ready, switches from waiting to ready, or terminates.
2. 4.

With non-preemptive scheduling (cooperative scheduling), a running process is taken away the CPU because the process became blocked, it completed, or it voluntarily gave up the CPU. With preemptive scheduling the operating system can additionally force a context switch at any time to satisfy the priority policies. This allows the system to more reliably guarantee each process a
Figure from [Sil00 p.89]

3.

1.

regular "slice" of operating time.


Slide 250 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 249

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Preemptive(ness)
Preemptive scheduling:
Scheduling

Scheduling Criteria
The scheduling policy depends on what criteria are emphasized [Sil00 p.140] Scheduling

Scheduler can interrupt Special timer hardware required


for the timer-controlled interrupts of the scheduler.

CPU Utilization
Keeping the CPU as busy as possible. The utilization usually ranges from 40% (light loaded system) to 90% (heavy loaded).

Synchronization of shared resources


An interrupted process may leave shared data inconsistent.

Throughput
The number of processes that are completed per time unit. For long processes the throughput rate may be one process per hour, for short ones it may be 10 per second.

Cooperative (non-preemptive) scheduling:

CPU occupation depends on process


in particular on the CPU burst distribution.

Turnaround time
The interval from the time of submission to the time of completion of a process. Includes the time to get into memory, times spent in the ready queue, execution time on CPU and I/O time.
[ With real-time scheduling this time-period is called reaction time ]
Slide 252 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Applicable on any hardware platform Lesser problems with shared resources


at least the elementary parts of shared data structures are not inconsistent
Slide 251 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Scheduling Criteria
Waiting time
Scheduling

Scheduling

The scheduling algorithm does not affect the time a process executes or spends doing I/O. It only affects the amount of time a process spends waiting in the ready queue. The waiting time is the sum of time spent waiting in the ready queue.

Response time
Irrespective of the turnaround time, some processes produce an output fairly early and continue computing new results while previous results are output to the user. The response time is the time from the submission of a request until the first response is produced.
[ Remark: In the exercises the response time is defined as the time from submission until the process starts (that is, until the first machine instruction is executing). ]

Different systems (batch systems, interactive computers, control systems) may put focus on different scheduling criteria. See next slide.
Slide 253 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 254

Criteria importance by system [Ta01 p.137]


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Optimization
Common criteria:
Scheduling

Static / Dynamic Scheduling


Scheduling

With static scheduling all decisions are made before the system starts running. This only works when there is perfect information available in advance about the work needed to be done and the deadlines that have to be met. Static scheduling - if applied - is used in real-time systems that operate in a deterministic environment. With dynamic scheduling all decisions are made at run time. Little needs to be known in advance. Dynamic scheduling is required when the number and type of requests is not known beforehand (non deterministic environment). Interactive computer systems like personal computers use dynamic scheduling. The scheduling algorithm is carried out as a
(hopefully short) system process in-between the other processes.

Maximize(average(CPU utilization)) Maximize(average(throughput)) Minimize(average(turnaround time)) Minimize(average(waiting time)) Minimize(average(response time))


Sometimes it is desirable to optimize the minimum or maximum values rather than the average. For example, to guarantee that all users receive a good service in terms of responsiveness, we may want to minimize the maximum response time. [Note: we do not delve into optimization any further].
Slide 255 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 256

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Scheduling Algorithms
Scheduling

First Come - First Served


The process that entered the ready queue first will be the first one scheduled. The ready queue is a FIFO queue. Cooperative scheduling (no preemption).
Process P1 P2 P3 Burst time 24 ms 3 ms 3 ms 0
Scheduling

First Come First Served Shortest Job First Priority Scheduling Round Robin Multilevel Queueing
These algorithms typically are dynamic scheduling algorithms.

Let the processes arrive in the order P1, P2, P3. The Gantt chart for the schedule is:

P1
24

P2
27

P3
ms

30

Waiting time for P1 = 0 ms, for P2 = 24 ms, for P3 = 27 ms. Average waiting time: (0 ms + 24 ms + 27 ms) / 3 = 17 ms.
Slide 258 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 257

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

First Come - First Served


Let the processes now arrive in the order P2, P3, P1. The Gantt chart for the schedule is:
Scheduling

Shortest Job First (SJF)


Associate with each process the length of its next CPU burst. Use these lengths to schedule the process with the shortest time. Two schemes:
Scheduling

P2
0 3

P3
6

P1
30

t [ms]

Non-preemptive SJF
Once the CPU is given to the process, it cannot be preempted until the CPU burst is completed.

Waiting time for P1 = 6 ms, for P2 = 0 ms, for P3 = 3 ms. Average waiting time: (6 ms + 0 ms + 3 ms) / 3 = 3 ms.

Preemptive SJF
When a new process arrives with a CPU burst length less than the remaining burst time of the current process, the CPU is given to the new process. This scheme is known as the Shortest Remaining Time First (SRTF)

Much better average waiting time than previous case. With FCFS the waiting time generally is not minimal. No preemption.
Slide 259 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

With respect to the waiting time, SJF is provably optimal. It gives the minimum average waiting time for a given set of processes. Processes with long bursts may suffer from starvation.
Slide 260 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Shortest Job First


Process P1 P2 P3 P4 Arrival time 0 ms 2 ms 4 ms 5 ms Burst time 7 ms 4 ms 1 ms 4 ms
Scheduling

Shortest Job First


Process P1 P2 P3 P4 Arrival time 0 ms 2 ms 4 ms 5 ms Burst time 7 ms 4 ms 1 ms 4 ms
Scheduling

For non-preemptive scheduling the Gantt chart is:

For preemptive scheduling (SRTF) the Gantt chart is:


P4 P1 t [ms]

P1

P3

P2

P4 t [ms] 0

P1

P2

P3

P2

12

16

11

16

Waiting time for P1 = 0 ms, for P2 = 6 ms, for P3 = 3 ms, for P4 = 7 ms. Average waiting time: (0 ms + 6 ms + 3 ms + 7 ms) / 4 = 4 ms.
Slide 261 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Waiting time for P1 = 9 ms, for P2 = 1 ms, for P3 = 0 ms, for P4 = 2 ms. Average waiting time: (9 ms + 1 ms + 0 ms + 2 ms) / 4 = 3 ms.
Slide 262 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Shortest Job First


Predicting the CPU burst time
Scheduling

Shortest Job First


Exponential average for = and 0 = 10
Scheduling

The next CPU burst is predicted as the exponential average of the measured lengths of previous bursts:

n + 1 = t n + (1 ) n
n + 1 = predicted length of next burst
tn = actual length of nth burst : 0 1
controls the relative contributions of the recent and the past history
Figure from [Sil00 p.144] Slide 263 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 264 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

i
0 1 2 3 4 5 6 7 8

Shortest Job First


Exponential average for = and 0 = 10
Scheduling

Priority Scheduling
Each process is assigned a priority. The process with highest priority is allocated the CPU. Two schemes:
Scheduling

1 = 6 + 10 = 8 2 = 4 + 8 = 6 3 = 6 + 6 = 6 4 = 4 + 6 = 5 5 = 13 + 5 = 9 6 = 13 + 9 = 11 7 = 13 + 11 = 12
Slide 265 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Non-preemptive

n + 1 = t n + (1 ) n

Preemptive
When a new process arrives with a priority higher than a running process, the CPU is given to the new process.

SJF scheduling is a special case of priority scheduling in which the priority is the inverse of the CPU burst length. Solution to starvation problem: The priority of a process increases as the waiting time increases (aging technique).
Slide 266 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Priority Scheduling
Assume low numbers representing high priorities
Process P1 P2 P3 P4 P5 Burst time 10 ms 1 ms 2 ms 1 ms 5 ms Priority 3 1 4 5 2
Scheduling

Priority Scheduling
Process Burst time Arrival time Priority P1 P2 P3 P4 10 ms 1 ms 2 ms 1 ms 5 ms 0 ms 2 ms 2 ms 6 ms 12 ms 3 1 4 5 2
Scheduling

All processes arrive at time 0. For non-preemptive scheduling the Gantt chart is:
P2 P5 P1

P5

Here: preemptive scheduling.

Timing diagram
Processes sorted by priority = running = ready

P2
0 1

P5
6

P1
16
Computer Architecture WS 06/07

P3
18

P4
19

t [ms]

P3 P4
0 5 10 15
Computer Architecture

20
WS 06/07

t [ms]
Dr.-Ing. Stefan Freinatis

Slide 267

Dr.-Ing. Stefan Freinatis

Slide 268

Round Robin
Each process gets a small unit of CPU time (time quantum), usually 10-100 milliseconds. After the quantum has elapsed, the process is preempted and added to the end of the ready queue.
Scheduling

Round Robin
Process P1 P2 P3 P4 Burst time 53 ms 17 ms 68 ms 24 ms
Scheduling

Burst quantum
When the current CPU burst is smaller than the time quantum, the process itself will release the CPU (changing state into waiting).

Suppose a time quantum of 20 ms. The Gantt chart for the schedule is:

Burst > quantum


The process is interrupted and another process is dispatched.

P1 0 20

P2 37

P3 57

P4 77

P1

P3 97 117

P4

P1

P3

P3

121 134 154 162

t [ms]

If the time quantum is very large compared to the processes burst times, the scheduling policy is the same as FCFS. If the time quantum is very small, the round robin policy turns into processor sharing (seems as if each process has its own processor).
Slide 269 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Waiting time for P1 = 0 + 57 + 24 = 81 ms, for P2 = 20 ms, for P3 = 37 + 40 + 17 = 94 ms, for P4 = 57 + 40 = 97 ms. Average waiting time: (81 + 20 + 94 + 97) / 4 = 73 ms.
Slide 270 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Round Robin
Round Robin typically has higher average turnarounds than SJF, but has better response.
Scheduling

Round Robin
Turnaround time depends on time quantum
Burst time Scheduling

Context switch and performance


The smaller the time quanta, the more the context switches do affect performance. Following is shown a process with a 10 ms burst, and time quanta of 12, 6 and 1 ms.

All processes arrive at same time. Ready queue order: P1, P2, P3, P4

Context switches cause overhead

Turnaround time as function of time quantum


Figure from [Sil00 p.149]

Figure from [Sil00 p.148] Slide 271 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 272

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Round Robin
Average turnaround time for time quantum = 1ms
P1 P2 P3 P4
0 5 10 15 20

Round Robin
Scheduling

Average turnaround time for time quantum = 2 ms


P1 P2 P3 P4
0 5 10 15 20

Scheduling

t [ms]

t [ms]

Turnaround (P1) = 15 ms Turnaround (P2) = 9 ms Turnaround (P3) = 3 ms Turnaround (P4) = 17 ms


Slide 273

Average turnaround: (15 + 9 + 3 + 17) ms = 11 ms 4

Turnaround (P1) = 14 ms Turnaround (P2) = 10 ms Turnaround (P3) = 5 ms Turnaround (P4) = 17 ms


Slide 274

Average turnaround: (14 + 10 + 5 + 17) ms = 11.5 ms 4

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Round Robin
Average turnaround time for time quantum = 6 ms
P1 P2 P3 P4
0 5 10 15 20

Multilevel Queue
Scheduling

The ready queue is partitioned into separate queues. Each queue has its own CPU scheduling algorithm. There is also scheduling between the queues (inter queue).

Scheduling

Side note: policy now is like FCFS

Interqueue scheduling Fixed priority Time slicing

t [ms]

Turnaround (P1) = 6 ms Turnaround (P2) = 9 ms Turnaround (P3) = 10 ms Turnaround (P4) = 17 ms


Slide 275

Average turnaround: (6 + 9 + 10 + 17) ms 4 = 10.5 ms


Figure from [Sil00 p.150] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 276 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Process Management
Processes (153) Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)

Real-Time Scheduling
Tdist Technical process r waiting (ready) context switch RT System execution
inclusive output

Scheduling

TRmax
t

d Tw Tcs
t

e s TR c

Realtime condition: TR TRmax


Slide 277 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 278

otherwise realtime-violation

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Real-Time Scheduling
Tdist
Scheduling

Real-Time Scheduling
The reaction time (also called response time) TR is the time interval between the request (the interrupt) and the end of the process: TR = Tw + TCS + e. This is the time interval the technical system has to wait until response is received. Starting from the request, the maximum response time TRmax defines the deadline d (a point in time) at which the real-time system must have responded. A hard real-time system must not violate the real-time conditions. Note: For all following considerations, the context switch time TCS is neglected, that is, we assume TCS = 0 s.
In accordance with D. Zbel, W. Albrecht: Echtzeitsysteme, page 24, ISBN 3-8266-0150-5

Scheduling

A technical process generates events (periodically or not). A real-time computing system is requested to respond to the events. The response must be delivered within the period TRmax. The technical system requests computation by raising an interrupt at time r at the real-time system. The time from the occurrence of the request (interrupt) until the context switch of the corresponding computer process is the waiting time Tw . Switching the context takes the time TCS . The point in time at which execution starts is the start time s. The execution time e is the netto CPU time needed for execution (even if the process is interrupted). The process finishes at completion time c.
Slide 279 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 280

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Real-Time Violation
Example RT.1 RT-Scheduling

Real-Time Violation
Case 1:
P1 low priority P2 high priority

Two technical processes TP1 and TP2 on some machine require response from a real-time system. The corresponding computer processes are P1 and P2. The technical processes generate events as follows:
a 0 a 0 Response must be given latest just before the next event (thus within Tdist)

Machine TP1 TP2


b 5

TRmax Process
4 ms 6 ms P1 P2
c

response time TR 1 ms 4 ms
d 10

Priority LOW HIGH

TRmax1 TRmax2

TP1 TP2

a 0 a 0

t [ms] t [ms]

TP1 TP2

b 5 b 5

c 10

t [ms] t [ms]

b 5 10

c 10

Real-time violation, response to TP1 is too late!

The execution time of P1 is 1ms, the execution time of P2 is 4 ms, and the scheduling algorithm is preemptive priority scheduling. The context switch time is considered negligible (0 s).
Slide 281 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

P1 P2
0
Slide 282

a a 5

b b 10
Computer Architecture

c c

t [ms]

WS 06/07

Dr.-Ing. Stefan Freinatis

Real-Time Violation
Case 2:
P1 high priority P2 low priority

Real-Time Scheduling
response time TR 1 ms 4 ms
d 10

Machine TP1 TP2


b 5

TRmax Process
4 ms 6 ms P1 P2
c

Priority HIGH LOW

Theorem

Scheduling

For a system with n processors (n 2) there is no optimal scheduling algorithm for a set of processes P1 ... Pm unless
all starting times s1, ... sm, all execution times e1, ... em, all completion times c1, ... cm

TP1 TP2

a 0 a 0

t [ms] t [ms]

b 5 10

No real-time violation. Fine!

are known (deterministic systems).


d

P1 P2
0
Slide 283

a a

b a 5 b

c b 10
Computer Architecture WS 06/07

Often, technical processes (or natural processes) are non-deterministic, at least to a part.
c

t [ms]

An algorithm is optimal when it finds an effective solution if such exists.


Slide 284 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Branch-and-Bound Scheduling
Find a schedule by searching all combinations of processes.
RT-Scheduling

Branch-and-Bound Scheduling
Search tree for the example
RT-Scheduling

Of each process (non-preemptive!) must be known in advance:


the request time (interrupt arrival time)
known in case of periodical technical processes

r
P1 P1, P2 P1, P3 P2, P1 P2 P2, P3 P3, P1 P3 P3, P2

the response time the deadline

TR

known from analysis or worst-case measurements

d
request time ri 0 ms 0 ms 0 ms
Computer Architecture

given by the technical system

Example:

Process P1 P2 P3

execution time e 20 ms 50 ms 30 ms
WS 06/07

deadline di 30 ms 90 ms 100 ms

P1, P2 , P3

P1, P3 , P2

P2, P1 , P3

P2, P3 , P1

P3, P1 , P2

P3, P2 , P1

For n processes: tree depth (number of levels) = n, number of combinations = n!


Slide 286 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 285

Dr.-Ing. Stefan Freinatis

Branch-and-Bound Scheduling
RT-Scheduling

Branch-and-Bound Scheduling
RT-Scheduling

Sequence P1, P2 , P3 P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110

Sequence P2, P1 , P3 P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110

t [ms] d1
Sequence P2, P3 , P1

t [ms] d1
Sequence P2, P3 , P1
Real-time violation

d2

d3
P3 P2 P1
0 10 20

d2

dd3

P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110

t [ms] d1
Real-time violation
Computer Architecture WS 06/07

t [ms]
30 40 50 60 70 80 90 100 110

d2

d3
Slide 288

d1

Real-time violation
Computer Architecture WS 06/07

d2

d3

Slide 287

Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Branch-and-Bound Scheduling
RT-Scheduling

Branch-and-Bound Scheduling
Search tree for the example
RT-Scheduling

Sequence P3, P1 , P2 P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110

Real-time violation

t [ms] d1
Sequence P3, P2 , P1
Real-time violation

P1 P1, P2 P1, P3 P2, P1

P2 P2, P3 P3, P1

P3 P3, P2

d2

d3

P3 P2 P1
0 10 20 30 40 50 60 70 80 90 100 110

P1, P2 , P3

P1, P3 , P2

P2, P1 , P3

P2, P3 , P1

P3, P1 , P2

P3, P2 , P1

t [ms] d1
Real-time violation
Computer Architecture WS 06/07

d2

d3
Slide 290

The only solution: P1 must be first, P2 must be second.


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 289

Dr.-Ing. Stefan Freinatis

Branch-and-Bound Scheduling
For small n one may directly investigate the n! combinations at the leafs. For bigger n it is recommended to start from the root and investigate all nodes (level by level). When a node violates the real-time condition the corresponding sub tree can be disregarded.
RT-Scheduling

Deadline Scheduling
RT-Scheduling

Priority Scheduling. The process with the closest deadline has highest priority. When processes have the same deadline, selection is done arbitrarily or according to FCFS.

Non-preemptive
The algorithm is carried out after a running process finishes. Intermediate requests are saved (interrupt flip-flops) meanwhile.

P1 P1, P2 P1, P3 P2, P1

P2 P2, P3 P3, P1

P3 P3, P2

Preemptive
The algorithm is carried out when a request arrives (interrupt routine) or after a process finishes.

P1, P2 , P3
Slide 291

P1, P3 , P2

P2, P3 , P1
Computer Architecture WS 06/07

P3, P2 , P1
Dr.-Ing. Stefan Freinatis

The deadline scheduling algorithm is also known as earliest deadline first (EDF). The algorithm is optimal for the one-processor case.
If there is a solution, it is found. If none is found then there is no solution.
Slide 292 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Deadline Scheduling
Example RT.2: Non-preemptive scheduling Process P1 P2 P3 P4 P4 P3 P2 P1
0 5 10 15 20

Deadline Scheduling
RT-Scheduling

Example RT.3: Preemptive scheduling Process P1 P2 P3 P4 P4 request time ri 0 ms 3 ms 6 ms 5 ms execution time e 2 ms 3 ms 3 ms 4 ms deadline di 4 ms 14 ms 12 ms 10 ms

RT-Scheduling
Remember, context switch time is neglected.

request time ri 0 ms 0 ms 0 ms 0 ms

execution time e 4 ms 1 ms 2 ms 5 ms

deadline di 5 ms 7 ms 7 ms 13 ms

Deadline is the same, choice is arbitrary. Could be sequence P3, P2 as well.

P3 P2 P1
0 5 10 15 20

t [ms]

t [ms]

d1 d2, d3
Slide 293

d4
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 294

d1

d4

d3

d2
WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Deadline Scheduling
Continuation of example RT.3 t = 0 ms: t = 2 ms: t = 3 ms: t = 5 ms: t = 6 ms: t = 9 ms: t = 12 ms: t = 13 ms:
RT-Scheduling Request for P1 arrives. Since there is no other process, P1 is scheduled. P1 finishes. Since there are no requests, the scheduler has nothing to do. Request for P2 arrives. Since there is no other process, P2 is scheduled. Request for P4 arrives. The deadline d4 is closer than the deadline of the running process P2. P4 has higher priority and is scheduled. Request for P3 arrives. Deadline d3 is more distant than any other, so nothing changes. P4 continues. P4 finishes. The closest deadline now is d3, so P3 is scheduled. P3 finishes. The closest deadline now is d2, so P2 is scheduled again. P2 finishes. There are no processes ready. Nothing to schedule.

Deadline Scheduling
RT-Scheduling

For multi-processor systems, the algorithm is not optimal.


Example RT.4: Three processes and two processors. Non-preemptive scheduling. Process P1 P2 P3
Processor 1 Processor 2
0 P3

request time ri 0 ms 0 ms 0 ms

execution time e 8 ms 5 ms 4 ms

deadline di 10 ms 9 ms 9 ms

P2

Real-time violation P1
P1

t [ms]
5
Computer Architecture WS 06/07

10
Dr.-Ing. Stefan Freinatis

Slide 295

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 296

Real-Time Scheduling
Scheduling

Real-Time Scheduling
Example RT.5 : Process P1 P2
t

When there are n processes that are


periodic, independent of each other, preemptable, and the response is to be delivered latest
Tdist TRmax

execution time e 15 ms 25 ms 15 ms

Scheduling

deadline di k 30 ms k 70 ms k 200 ms

P3

at the end of each period (that is TRmax = Tdist) then the processes can be scheduled on a single processor without real-time violation, if n

T
i =1

ei
dist i

15 25 15 + + = 0.5 + 0.36 + 0.075 = 0.935 1 30 70 200

The processes can be scheduled. Deadline scheduling would yield: P3 P2 P1


5 ms 15 ms 15 ms 0
Slide 297 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 298

ei 1 i =1 Tdist i

10 ms

10 ms

15 ms

Schedulability Test

50

100
Computer Architecture

150
WS 06/07

200

t [ms]

Dr.-Ing. Stefan Freinatis

Real-Time Scheduling
Continuation of example RT.5 t = 0 ms: t = 15 ms: t = 30 ms: t = 45 ms: t = 55 ms: t = 60 ms: t = 70 ms:
Scheduling Requests for P1, P2, P3 arrive. P1 has closest deadline and is scheduled. P1 finishes. The deadline of P2 is closer than the deadline of P3. P2 is scheduled. Request for P1 arrives. Reevaluation of the deadlines yields that P1 has highest priority. P1 is scheduled. P1 finishes. The deadline of P2 still is closer than the deadline of P3. P2 is scheduled. P2 finishes. The only waiting process is P3. P3 thus is scheduled. Request for P1 arrives. Reevaluation of the deadlines yields that P1 has highest priority. P1 is scheduled. Request for P2 arrives. Deadline of P1 is closest, P1 continues.

Real-Time Scheduling
Example RT.6: Process P1 P2 P3 execution time e 2 ms 3 ms 5 ms deadlines di k 4 ms k 14 ms k 12 ms
Scheduling

T
i =1

ei
dist i

2 3 5 + + = 0.5 + 0.215 + 0.42 = 1.135 4 14 12

...

This means an overutilization of the microprocessor. The processor would have to execute more than one process at a time (which is impossible). Therefore there is no schedule that would not violate the real-time condition sooner or later (on a single-processor system). The schedulability test failed.
Slide 300 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 299

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Laxity Scheduling
Priority Scheduling. The process with the least laxity has highest priority. For equal laxities the selection policy is arbitrary or FCFS.
The laxity is the period of time left in which a process can be started without violating its deadline. Latest when the laxity is 0 the process must be started, otherwise it will not finish in time. The execution time e of the process must be known, of course Laxity: lax = (d - now) e now is the point at time at which the laxity is
lax now e d
RT-Scheduling

Laxity Scheduling
RT-Scheduling

Deadline scheduling focuses on the deadline, but does not take into account the execution time e of a process. Laxity scheduling does, it sometimes finds a solution that deadline scheduling does not find.
Example RT.7: Three processes and two processors. Non-preemptive scheduling. Same as in example RT.4. Process P1
t

request time ri 0 ms 0 ms 0 ms

execution time e 8 ms 5 ms 4 ms

deadline di 10 ms 9 ms 9 ms

P2 P3

(re)calculated. Usually this is the point in time at which a new request arrives (preemptive scheduling) or at which a process finishes.
Slide 301 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 302

Processes now undergoing laxity scheduling (see next slide)

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Laxity Scheduling
Continuation of example RT.7 t = 0 ms:
RT-Scheduling Requests for P1, P2, P3 arrive. The laxities are: lax1 = 2 ms, lax2 = 4 ms,

Laxity Scheduling
Laxity scheduling, like deadline scheduling, is generally not optimal for multi-processors.
That is, it does not always find a solution. RT-Scheduling

lax3 = 5 ms. Least laxity is lax1, so P1 is scheduled on processor 1.


Processor 2 is not yet assigned, so P2 is chosen (lax2 < lax3).

t = 5 ms: t = 8 ms:

Example RT.8: Four processes and two processors. Non-preemptive scheduling. Process P1 P2 P3 request time ri 0 ms 0 ms 0 ms 0 ms execution time e 1 ms 5 ms 3 ms 5 ms deadline di 1 ms 6 ms 5 ms 8 ms

P2 finishes. The only process waiting is P3, so it is scheduled. P1 finishes. No new processes to schedule. Processor 1 Processor 2
0 P2 P1

P3

t [ms]
5 10

P4

No real-time violation as opposed to the deadline scheduling example RT.4


Slide 303 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 304 Computer Architecture

Continuation on next slide


WS 06/07 Dr.-Ing. Stefan Freinatis

Laxity Scheduling
Continuation of example RT.8 t = 0 ms:
RT-Scheduling Requests for P1, P2, P3, P4 arrive. The laxities are: lax1 = 0 ms, so P1 is scheduled on processor 1. Second least laxity is lax2, so P2 is chosen for processor 2.

Laxity Scheduling
Continuation of example RT.8 However, there exists a schedule that works well:
Processor 1 Processor 2
0 P1 P2

RT-Scheduling

lax2 = 1 ms, lax3 = 2 ms, lax4 = 3 ms. Least laxity is lax1,

Non-violating schedule
found through deadline scheduling P4

t = 1 ms: t = 4 ms:

P1 finishes. Least laxity is lax3 (now 1ms), so P3 is scheduled on processor 1. P3 finishes. Least laxity is lax4 (now -1 ms), so P4 is scheduled on processor 1 ... but it is already too late (negative laxity). Processor 1 Processor 2
0 P1 P3 P4

P3

t [ms]
5

d4

10

P2

Real-time violation P4 t [ms]


5

Scheduling non-preemptive processes in a multi-processor system is a complex problem.


This is even the case in a two-processor system when all request times ri are the same and all deadlines di are the same.
Slide 306 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

d4

10
Dr.-Ing. Stefan Freinatis

Slide 305

Computer Architecture

WS 06/07

Rate Monotonic Scheduling


Priority scheduling for periodical preemptive processes where the deadlines are equal to the periods. The process with highest frequency (repetition rate) has highest priority. Static scheduling.
Technical process 1 Tdist
t

Rate Monotonic Scheduling


A more thorough explanation from [Ta01 p.472]

The classic static real-time scheduling algorithm for preemptable, periodic processes is RMS (Rate Monotonic Scheduling). It can be used for processes that meet the following conditions:
Each periodic process must complete within its period.

Tdist
t

No process is dependent on any other process. Each process needs the same amount of CPU time on each burst. Any non periodic processes have no deadlines. Process preemption occurs instantaneously and with no overhead.

Technical process 2

RMS works by assigning each process a fixed priority equal to the frequency
Computer process P2 has higher priority than process P1 since its rate is higher.

of occurrence of its triggering event. For example, a process that must run every 30ms (= 33Hz) receives priority 33, a process that must run every 40ms (= 25 Hz) receives priority 25. The priorities are linear with the rate, this is why it is called rate monotonic.
Slide 308 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Although the algorithm is not optimal, it is often used in real-time applications because it is fast and simple (at run time!). Note, static scheduling!
Slide 307 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Rate Monotonic Scheduling


Example RT.9 Process A B C request time ri k 30 ms k 40 ms k 50 ms execution time e 10 ms 15 ms 5 ms deadline di (k+1) 30 ms (k+1) 40 ms (k+1) 50 ms

Rate Monotonic Scheduling


Continuation Example RT.9 The processes A, B, C scheduled with Rate Monotonic Scheduling (RMS), Deadline scheduling (EDF).

Three periodic processes [Ta01 p.471]

Figure from [Ta01 p.471] Slide 309 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 310 Computer Architecture WS 06/07

Figure from [Ta01 p.473] Dr.-Ing. Stefan Freinatis

Rate Monotonic Scheduling


Continuation Example RT.9

Rate Monotonic Scheduling


Example RT.10: Like RT.9 but process A now has 15ms execution time Process A B C request time ri k 30 ms k 40 ms k 50 ms execution time e 15 ms 15 ms 5 ms deadline di (k+1) 30 ms (k+1) 40 ms (k+1) 50 ms

Up to t = 90 the choices of EDF and RMS are the same. At t = 90 process A is requested again. The RMS scheduler votes for A (process A4 in the figure) since its priority is higher than the priority of B, thus B is interrupted. The deadline scheduler in contrast has a choice because the deadline of A is the same as the deadline of B (dA = dB = 120). In practice, preempting B has some nonzero cost associated, therefore it is better to let B continue.

The schedulability test yields that the processes are schedulable.

T
i =1

ei
dist i

15 15 5 + + = 0.5 + 0.375 + 0.1 = 0.975 30 40 50

See next example (Example RT.10) to dispel the idea that RMS and EDF would always give same results.
Slide 311 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 312

Nevertheless, RMS fails in this example while EDF does not.

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Rate Monotonic Scheduling


Continuation Example RT.10

Rate Monotonic Scheduling


Why did RMS fail? Using static priorities only works if the CPU utilization is not too high. It was proved* that RMS is guaranteed to work for any system of periodic processes if 1

T
i =1

ei

n (2 n 1).

dist i

For n = 2 processes, RMS will work for sure if the CPU utilization is below 0.828. For n = 3 processes, RMS will work for sure if the CPU utilization is below 0.780. For n processes, RMS will ... if the CPU utilization is below ln 2 (0.694). RMS leads to a real-time violation. Process C is missing its deadline dC = 50.
Figure from [Ta01 p.474] Slide 313 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

* C.L. Liu, James Layland: Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment, Journal of the ACM, 1973, http://citeseer.ist.psu.edu/liu73scheduling.html
Slide 314 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Rate Monotonic Scheduling


In example RT.9 the utilization was 0.808 (thus higher than 0.780), why did it work? We were just lucky. With different periods and execution times, a utilization of 0.808 might fail. In example RT.10 the utilization was so high that there was little hope RMS could work. In contrast to RMS, deadline scheduling always works for any schedulable set of processes (single-processor system). Deadline scheduling can achieve 100% CPU utilization. The price paid is a more complex algorithm [Ta01 p.475]. Because RMS is static all priorities are known at run time. Selecting the next process is a matter of just a few machine instructions.
Slide 315 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Real-Time Scheduling
Branch and Bound Description
Try all permutations of processes.

Deadline (EDF)
Earliest deadline has highest priority. Execution time is not taken into account.

Laxity
Least laxity has highest priority. Execution time is taken into account.

RMS
Highest repetition rate (frequency) has highest priority. Execution time is not taken into account.

Preferably static used in scheduling German Planen durch Name


Suchen

dynamic scheduling Planen nach Fristen

dynamic scheduling Planen nach Spielrumen

static scheduling Planen nach monotonen Raten

Overview real-time scheduling algorithms

Slide 316

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Computer Architecture
Processes (153) Threads (178) Interprocess Communication (IPC) (195) Scheduling (247) Real-Time Scheduling (278) Deadlocks (318)

Deadlocks
Consider two processes requiring exclusive access to some shared resources (e.g. file, tape-drive, printer, CD-Writer). { request(resource1); request(resource2); ... release(resource1); release(resource2); }
Process 1

{ request(resource2); request(resource1); ... release(resource2); release(resource1); }


Process 2

Fictitious system call for requesting exclusive access to a resource. When access cannot be granted, the call blocks until the resource is available.
Slide 317 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 318 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Deadlocks
{ request(resource1); request(resource2); ... release(resource1); release(resource2);
time

Deadlocks
{ request(resource1); request(resource2); ... release(resource1); release(resource2);
time

{ request(resource2);
blocked

} {

Process 1

request(resource2); request(resource1); ... release(resource2); release(resource1); }


Slide 319

When the two processes are executed sequentially (one after the other), no problem arises.

Process 1

request(resource1); ... release(resource2); release(resource1); }


Process 2

When process 1 has acquired the resources before process 2 starts trying the same, no problem arises. Process 2 just has to wait.

Process 2
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 320 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Deadlocks
{ request(resource1); request(resource2);
blocked

Deadlocks
{ request(resource2); request(resource1);
blocked

A set of processes is deadlocked when each process in the set is waiting for an event that only another process in the set can cause. Waiting for an event:
Waiting for the availability of a resource Waiting for some input Waiting for a message (IPC) or a signal or any other type of event that a process is waiting for in order to continue

time

Process 1

Process 2

Occasionally, when both processes are carried out in parallel as depicted above, both their attempts to acquire the missing resource will cause the processes to block. Since each process holds a resource that the other one needs, and since each process cannot release its resource, both processes do wait forever (deadlock).

Slide 321

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 322

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Deadlocks
Classical deadlock problem from the non-computer world

Resources
Anything a process / thread needs to continue Exclusive access
Only one process at a time can use the resource (e.g. printer or writing to a shared file). Deadlocks Examples: I/O-devices like printer, tape, CD-ROM, files, but also internal resources such as process table, thread table, file allocation table or semaphores / mutexes.

Yields to car at right

Yields to car at right

Non-exclusive access
More than one process can use the resource at the same time (e.g. reading from a shared file)
Yields to car at right

Yields to car at right

Every car is ought to give way to the car on the right. None will proceed.

Preemptable resources
The resource can (with some non-zero cost) be temporarily taken away from a process and given to another process (e.g. memory swapping).

Non-preemptable resources
The resource cannot be temporarily assigned to another process (e.g. printer, CD-Writer) without leading to garbage.
Slide 324 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Figure from lecture slides Computer Architecture WS 05/06 (Basermann / Jungmaier) Slide 323 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Deadlocks
The following four conditions must be present for a deadlock to occur.

Deadlock Modeling
Resource allocation graphs
Process Resource Deadlocks

Mutual Exclusion
Each resource is either currently assigned to exactly one process or is available.

Hold and Wait


A process currently holding a resource can request new resources.

Non-preemptable resources
Resources previously granted cannot be forcibly taken away from a process.

Circular Wait
There must be a circular chain of processes, each of which is waiting for a resource held by another process in the chain.

a) Holding a resource (Process A holds resource R) b) Requesting a resource (Process B requests resource S) c) Deadlock situation: Process D requests U which is held by process C. Process C requests T which is held by D. Figure from [Ta01 p.165]
Slide 326 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

If one of these conditions is absent, no deadlock is possible


Slide 325 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Deadlock Modeling
A B C

Deadlock Modeling
Deadlocks Deadlocks

Example of resource allocation not resulting in a deadlock

time

Figure from [Ta01 p.166]

time

Figure from [Ta01 p.166]

Resource allocation order leading to a deadlock


Slide 327 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 328

(o)

(p)

(q)

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Deadlocks
Strategies for dealing with deadlocks:

Deadlocks
Strategy 1 (Ignoring the problem)
Most operating systems, including UNIX and Windows, just ignore the problem on the assumption that most users would prefer an occasional deadlock to a rule restricting all users to one process, one open file, and one of everything. If deadlocks could be eliminated for free, there would not be much discussion. But the price is high. If deadlocks occur on the average once a year, but system crashes owing to hardware failures and software errors occur once a week, nobody would be willing to pay a large penalty in performance or convenience to eliminate deadlocks (After Ta01 p.167 ). For that, the deadlock problem often is disregarded.

1. Ignore the problem


Sounds silly, but in fact many operating systems do exactly this assuming that deadlocks occur rarely.

2. Detection & Recovery


The OS tries to detect deadlocks and then takes some recovery action.

3. Avoidance
Resources are granted in such a way that deadlocks cannot occur.

4. Prevention
Trying to break at least one of the four conditions such that no deadlock can happen.

Slide 329

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 330

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Deadlocks
Strategy 2 (Detection & Recovery)
The operating system tries to detect deadlocks and to recover.
Example DL.1 : Consider the following system state: Process A holds R and wants S Process B holds nothing and wants T Process C holds nothing and wants S Process D holds U and wants S and T Process E holds T and wants V Process F holds W and wants S Process G holds V and wants U. Is the system deadlocked, and if so, which processes are involved?
Slide 331 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Strategy 2
Continuation of example DL.1 (deadlock detection) Constructing the resource allocation graph (a):
Figure from [Ta01 p.169]

Deadlocks

deadlock

The extracted cycle (b) shows the processes and resources involved in a deadlock.
Slide 332 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Strategy 2
Deadlock detection with multiple instances of a resource type We have (respectively we define):
Deadlocks

Strategy 2
Deadlock detection with multiple instances of a resource type Definition of current allocation matrix and request matrix:
Deadlocks

n processes: P1 ... Pn m resource classes


Ei = the number of existing resource instances of resource class i, 1 i m. E is the existing resource vector, E = (E1 ... Em). A is the available resource vector. Each Ai in A gives the number of currently available resource instances. A = (A1 ... Am). Relation X Y is defined to be true if each Xi Yi.
Slide 333 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 334 Computer Architecture WS 06/07 Figure from [Ta01 p.171] Dr.-Ing. Stefan Freinatis

P1 P2

Strategy 2
Deadlock detection with multiple instances of a resource type Deadlock detection algorithm: 1. All processes are initially unmarked 2. Look for an unmarked process Pi for which row Ri A
Here the algorithm is looking for a process that can be run to completion (the resource demands of the process can be satisfied immediately). Deadlocks

Strategy 2
Example DL.2 (deadlock detection algorithm): Consider the following system state:
Deadlocks

Figure from [Ta01 p.173]

3. If such a Pi is found, add row Ci to A and mark Pi. Go to step 2.


After Pi is (or would have) finished, its resources are given back to the pool. The process is marked (in the sense of successful completion).

4. If no such process exists, terminate. All unmarked processes, if any, are deadlocked!
Slide 335 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Is there (or will there be) a deadlock in the system?


Slide 336 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Strategy 2
Continuation of example DL.2 (deadlock detection algorithm)
Deadlocks Checking P1: R1 is not A (CD-ROM is missing). P1 cannot run and is not marked. Checking P2: R2 is not A (Scanner is missing). P2 cannot run and is not marked. Checking P3: R3 is A, thus P3 can run and is marked. The resources are given back to the pool. A = (2 2 2 0). Checking P1: R1 still is A (CD-Rom still not available). Checking P2: R2 now is A, thus P2 can run and is marked. The resources are given back to the pool. A = (4 2 2 1). Checking P1: R1 now is A. P1 can run and is marked. The resources are given back to the pool. A = (4 2 3 1) = E. No more unmarked processes: termination.

Strategy 2
Example DL.3 (deadlock detection algorithm):
Deadlocks

Same as DL.2 but now C2 = (2 1 0 1) and thus A = (2 0 0 0).

Checking P1: R1 is not A (CD-ROM is missing). P1 cannot run and is not marked. Checking P2: R2 is not A (Scanner is missing). P2 cannot run and is not marked. Checking P3: R3 is not A (Plotter is missing). P3 cannot run and is not marked. All processes checked. Nothing will change: termination.

The entire system is deadlocked!

No deadlocks.
Slide 337 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 338 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Strategy 2
Detection & Recovery
Deadlocks

Deadlocks
Strategy 3 (Avoidance)
Do not allow system states that may result in a deadlock.

Resource Preemption
Forcibly taking away a resource from a process. May have ill side effects. Difficult or even impossible in many cases.

Process Rollback
A process periodically writes its complete state to file (checkpointing). In case of a deadlock, the process is rolled back to an earlier state in which it occupied lesser resources. Program(ming) overhead!

A state is said to be safe when it is not deadlocked and there exists some scheduling order in which every process can run to completion even if all of them request their maximum number of resources. An unsafe state may result in a deadlock, but does not have to. maximum number of resource instances needed (requests) number of resource instances currently held (allocation)

Killing Processes
Crudest but simplest method. One or more processes from the chain are terminated and must be started all over again at some later point in time. May also cause ill effects consider a process updating a data base twice instead of once.

Assume there is a total number of 10 instances available. Then the state is a safe state since there is a way to run all processes.
Slide 340 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 339

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Strategy 3
Deadlocks

Strategy 3
Deadlocks

(a)

(b)

(c)

(d)

(e)
Figure from [Ta01 p.177]

(a)

(b)

(c)

(d)
Figure from [Ta01 p.177]

a) starting situation as before (this is a safe state) a) starting situation (question: is this a safe state?). There are 3 resources left in the pool. b) B is granted 2 additional resources. c) B has finished. Now 5 resources are free. d) C is granted another 5 resources. e) C has finished. Now 7 resources are free. Process A can be run without problems. Thus (a) is a safe state.
Slide 341 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

b) A is granted one additional resource. c) B is granted the remaining 2 resources. d) B has finished. A and C cannot run because each of them needs 5 resources to complete. Deadlock. Any other sequence starting from (b) also ends up in a deadlock. Therefore state (b) is an unsafe state. The move from (a) to (b) was bringing the system from a safe state to an unsafe state.
Slide 342 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Strategy 3
Bankers Algorithm (Dijkstra 1965)
Deadlocks

Strategy 3
Continuation Bankers Algorithm
Deadlocks

Think of a small-town banker who deals with a group of customers to whom he has granted lines of credit. If granting a request leads to an unsafe state, the request is denied. If a request leads to a safe state, the request is granted. Knowing that not all customers need their credit line immediately, the banker has reserved 10 money units instead of 22 to service them. Initial state There are four customers (processes) demanding for a total of 22 money units (resources). The banker (operating system) has provided 10 money units in total.

The bankers algorithm considers each request as it occurs. A request is granted when the state remains safe, otherwise the request is postponed until later.

(a)

(b)

(c)

a) Initial state (safe) b) Safe state: Cs maximum request can be satisfied. When C has paid back the 4 money units, Bs request (or Ds) can be satisfied. ... c) Unsafe state: If any of the customers requests the maximum, the banker would be stuck (deadlock). Figure from [Ta01 p.178]
Slide 344 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 343

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Strategy 3
Bankers Algorithm for multiple resource instances
Deadlocks

Strategy 3
Bankers Algorithm for multiple resource instances
Deadlocks

1. Look for a row Ri whose unmet requirements are smaller than (or equal) to A. If no such row exists, the system will deadlock
Existing

sooner or later since no process can run to completion. 2. Assume the process of the row chosen requests its maximum resources (which is guaranteed to be possible) and finishes. Mark the process as terminated and add its resources to the pool A. 3. Repeat steps 1 and 2 until either all processes are marked (in which case the initial state was safe), or until a deadlock occurs (in which case the initial state was unsafe).

Available Possessed (allocated)

Current allocation matrix C

Request matrix R
Figure from [Ta01 p.179]

Slide 345

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 346

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Strategy 3
Bankers Algorithm for multiple resource instances
The pool is A = (1 0 2 0). Process D can be scheduled next because (0 0 1 0) < (1 0 2 0). When finished, the pool is A = (1 0 1 0) + (1 1 1 1) = (2 1 2 1) . Process A can be scheduled because (1 1 0 0) < ( 2 1 2 1). When finished, the pool is A = (1 0 2 1) + (4 1 1 1) = (5 1 3 2). Process B can be scheduled because (0 1 1 2) < (5 1 3 2). When finished, the pool is A = (5 0 2 0) + (0 2 1 2) = (5 2 3 2). Process C can be scheduled because (3 1 0 0) < (5 2 3 2). When finished, the pool is A = (2 1 3 2) + (4 2 1 0) = (6 3 4 2). Process E can be scheduled because (2 1 1 0) < (6 3 4 2). When finished, the pool is A = (4 2 3 2) + (2 1 1 0) = (6 3 4 2).
Slide 347 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Strategy 3
Deadlocks

Bankers Algorithm for multiple resource instances


No more processes. All processes have successfully completed.

Deadlocks

The state shown is a safe state since we have found at least one way to complete all processes. Other sequences are possible.

In practice the bankers algorithm is of minor use, because processes rarely know in advance the maximum number of resources needed, the number of processes is not constant over time as users log in and out (or other events require computational attention).
Slide 348 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Deadlocks
Strategy 4 (Deadlock Prevention)
Break (at least) one of the four conditions for a deadlock.

Strategy 4
Deadlocks

Attacking the no preemption condition


Forcibly removing a resource from a process is barely possible.

Avoiding mutual exclusion


Sometimes possible. Instead of using a printer exclusively, the processes write into a print spooler directory. This way several processes can use the printer at the same time. However, an internal system table (e.g. process table) cannot be spooled. Similar applies to a CD-Writer.

Breaking circular wait


Provide a global numbering of all resources (ranking). Resource requests must be made in ascending order. This way a resource allocation graph can have no cycles. In the figure, B cannot request the scanner even if it would be available.

Breaking the hold and wait


Processes request all their resources at once (either all or none). However, not all processes know their demand from the beginning. Moreover, the resources are not optimally used then (degradation in multi-programming). Variation: each time an additional resource is needed, the process releases all its resources first and then tries to acquire all of them at once. This way a process does not occupy resources while waiting for a new one.
Slide 349 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

1. 2. 3. 4. 5.

Imagesetter Scanner Plotter Tape drive CD-Rom drive

Scanner

Plotter

However, not all resources allow for a reasonable order. How to order table slots, disk spooling space, locked database records?

Slide 350

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Computer Architecture

Memory Management
Memory (353)

Memory Management

Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)

Slide 351

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 352

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Memory
Core Memory
Period: 1950 ... 1975 Non-volatile Matrix of magnetic cores Storing a bit by changing the magnetic polarity of a core Access time 3s ... 300ns Destructive read
After reading a core, the content is lost. A read cycle must be followed by a write cycle i.o. to restore.
Image source: http://www.psych.usyd.edu.au/pdp-11/core.html Slide 353 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Memory
Semiconductor Memory (1970 ...)
Dynamic memory (DRAM)
Storing a bit by charging a capacitor
(sometimes just the self-capacitance of a transistor) Memory Management

One transistor per bit


High density / capacity per area unit

Volatile Destructive read Self-discharging


Periodic refresh needed
Image source: http://www.research.ibm.com/journal/rd/391/adler.html Slide 354 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Memory
Semiconductor Memory (1970 ...)
Static memory (SRAM)
Storing a bit in a flip-flop
Setting / Resetting the flip-flop Memory Management

Memory Hierarchy
Memory Management

Program(mer)s want unlimited amounts of fast memory. Economical solution: Memory hierarchy.

6 transistors per bit


More chip area than with DRAM

Volatile Non-destructive read No self-discharge Fast!


Image source: Wikipedia on SRAM (English) Slide 355 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Memory hierarchy levels in typical desktop / server computers, figure from [HP06 p.288]
Slide 356 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Main Memory
Central to computer system Large array of words / bytes Many programs at a time
for multi-programming / tasking to be effective

Address Binding
Operating System
program 1 program 2 program 3 program 4 program 5 program 6 program n Memory Management

Program = binary executable file Code/data accessible via addresses


... i = i + 1; check(i); ...
Addresses in the source code are symbolic, here: i (a variable) and check (a function). The compiler typically binds the symbolic addresses to relocatable addresses, such as i is 14 bytes from the beginning of the module. The compiler may also

be instructed to produce absolute addresses (non-relocatable code).


Working Memory Memory layout of a time sharing system
Slide 357 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

The loader finally binds the relocatable addresses to absolute addresses, such as i is at 74014 when loading the code into memory.
Slide 358 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Address Binding Schemes


The binding of code and data to logical memory addresses can be done at three stages:
Memory Management

Logical / Physical Addresses


Memory Management

Logical Address
The address generated by the CPU, also termed virtual address. All logical addresses form the logical (virtual) address space.

Compile time (Program creation)


The resulting code is absolute code. All addresses are absolute. The program must be loaded exactly to a particular logical address in memory.

Physical Address
The address seen by the memory. All physical addresses form the physical address space. In compile-time and load-time address-binding schemes the logical and the physical addresses are the same. In execution-time address-binding the logical and physical addresses differ.
Slide 360 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Load time
The code must be relocatable, that is, all addresses are given as an offset from some starting address (relative addresses). The loader calculates and fills in the resulting absolute addresses at load time (before execution starts).

Execution time
The relocatable code is executed. Address translation from relative to absolute addresses takes place at execution time (for every single memory access). Special hardware needed (MMU).
Slide 359 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Memory Management Unit


Memory Management

Protection
Memory Management

Hardware device that maps logical addresses to physical addresses (MMU).

Protecting the kernel against user processes


No user process may read, modify or even destroy kernel data (or kernel code). Access to kernel data (system tables) only through system calls.

Protecting user processes from one another


No user process may read or modify other processes` data or code. Any data exchange between processes only via IPC.

MMU equipped with limit register Loaded with the highest allowed logical address
This is done by the dispatcher as part of the context switch.

Any address beyond the limit causes an error


A program (a process) deals with logical addresses, it never sees the real physical addresses.
Slide 361 Computer Architecture WS 06/07

Assumption: contiguous physical memory per process


Figure from [Sil00 p.258] Dr.-Ing. Stefan Freinatis Slide 362 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Protection
Memory Management

Memory Occupation
Obtaining better memory-space utilization
Memory Management Initially the entire program plus its data (variables) needed to be in memory

Dynamic Loading
Load what is needed when it is needed.

Overlays
Replace code by other code.

Dynamic Linking (Shared Libraries)


Use shared code rather than back-pack everything.

Limit register for protecting process spaces against each other


Figure from [Sil00 p.266] Slide 363 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Swapping
Temporarily kick out a process from memory.
Slide 364 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Dynamic Loading
Memory Occupation

Overlays
Memory Occupation

Routines are kept on disk


Main program is loaded into memory.

Existing code is replaced by new code


Similar to dynamic loading, but instead of adding new routines to the memory, existing code is replaced by the loaded code.

Routine loaded when needed


Upon each call it is checked whether the routine is in memory. If not, the routine is loaded into memory.

No special OS support required


Overlay technique implemented by the user.

Unused routines are never loaded


Although the total program size may be large, the portion that is actually executed can be much smaller.

Example: Consider a two-pass assembler Pass 1 Pass 2 Symbol table Common routines 70 kB 80 kB 20 kB 30 kB Loading everything at once would require 200 kB.

No special OS support required


Dynamic loading is implemented by the user. System libraries (and corresponding system calls) may help the programmer.

Pass 1 and pass 2 do not need to be in memory at the same time Overlay
Slide 366 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 365

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Overlays
Memory Occupation Pass 1, when finished, is overlayed by pass 2. An additional overlay driver is needed (10 kB), but the total memory requirement now is 140 kB instead of 200 kB.

Dynamic Linking
Different processes use same code
Memory Occupation This especially true for shared system libraries (e.g. reading from keyboard, graphical output on screen, networking, printing, disk access).

Single copy of shared code in memory


Rather than linking the libraries statically to each program (which increases the size of each binary executable), the libraries (or individual routines) are linked dynamically during execution time. Each library only resides once in physical memory.

Stub
is a piece of program code initially located at the library references in the program. When first called it loads the library (if not yet loaded) and replaces itself with the address of the library routine.

OS support required
Memory
Slide 367 Computer Architecture WS 06/07 Figure from [Sil00 p.262] Dr.-Ing. Stefan Freinatis

since a user process cannot look beyond its address space whether (and where) the library code may be located in physical memory (protection!).
Slide 368 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Swapping
Memory Occupation

Swapping
Memory Occupation

A process can be swapped temporarily out of memory to a

backing store, and then brought back into memory for continued execution.
Backing store: fast disk large enough to accommodate copies

of all memory images for all users; must provide direct access to these memory images.
Roll out, roll in swapping variant used for priority-based

scheduling algorithms; lower-priority process is swapped out so higher-priority process can be loaded and executed.
Major part of swap time is transfer time; total transfer time is

Figure from [Sil00 p.263]

directly proportional to the amount of memory swapped.


Slide 369 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Figure: Process P1 is swapped out, and process P2 is swapped in.


Slide 370 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Memory Allocation
Allocation of physical memory to a process
Memory Management

Contiguous Memory Allocation


The physical memory allocated to a process is contiguous (no holes).

Contiguous
The physical memory space is contiguous (linear) for each process.

Fixed-sized partitions
Memory is divided into fixed sized partitions. Originally used by IBM OS/360, no longer in use today.

Operating System
process 1

Fixed-sized partitions Variable sized partitions


Placement schemes: first fit, best fit, worst fit

Simple to implement Degree of multiprogramming is bound by the number of partitions Internal fragmentation
free partition

process 2 process 3

Non-Contiguous
The physical memory space per process is fragmented (has holes).

Paging Segmentation Combination of Paging and Segmentation


Slide 371 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

process 4

Slide 372

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Contiguous Memory Allocation


The physical memory allocated to a process is contiguous (no holes).

Compaction
Reducing external fragmentation (for variable-sized partitions)
Operating System
process 1 process 2 process 3 process 3

Variable-sized partitions
Partitions are of variable size.

Operating System
process 1 process 2 process 3

Operating System
process 1 process 2

OS must keep a free list


listing free memory (holes)

OS must provide placement scheme Degree of multiprogramming only limited by available memory No (or very little) internal fragmentation External fragmentation
The holes may be too small for a new process
Slide 373 Computer Architecture WS 06/07

process 4

Copy operation is expensive


process 4 process 4 free memory

Dr.-Ing. Stefan Freinatis

Slide 374

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Placement Schemes
Satisfying a request of size n from a list of free holes.
General to the following schemes: find a large enough hole, allocate the portion needed, and return the remainder (leftover hole) to the free list.

First Fit
Example: we need this amount of memory: Search starts at the bottom.
Operating System
process 1 process 2 process 3

Operating System
process 1 process 2 process 3 The first hole encountered is large enough.

First fit
Find the first hole that is large enough. Fastest method.

Best fit
Find the smallest hole that is large enough. The entire list must be searched (unless it is sorted by hole size). This strategy produces the smallest leftover hole.

Worst fit
Find the largest hole. Search entire list (unless sorted). This strategy produces the largest left-over hole, which may be more useful than the smallest leftover hole from the best-fit approach.
Search
Slide 375 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 376 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

process 4

process 4 leftover hole

Best Fit
Example: we need this amount of memory: Search starts at the bottom.
Operating System
process 1 process 2 process 3

Worst Fit
Example: we need this amount of memory: Search starts at the bottom.
Operating System
leftover hole process 1 process 2 process 3 We have to search all holes. The top hole fits best. This scheme creates the smallest leftover hole among the three schemes.
Search
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 378 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Operating System

Operating System
process 1 process 2 process 3 We have to search all holes. The bottom hole is found to be the largest. This scheme creates the largest leftover hole among the three schemes. leftover hole

process 1 process 2 process 3

process 4
Search
Slide 377

process 4

process 4

process 4

Memory Allocation
Allocation of physical memory to a process

Memory Management
Memory (353) Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)

Contiguous
The physical memory space is contiguous (linear) for each process.

Fixed-sized partitions Variable sized partitions


Placement schemes: first fit, best fit, worst fit

Non-Contiguous
The physical memory space of a process is fragmented (has holes).

Paging Segmentation Combination of Paging and Segmentation


Slide 379 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 380

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Paging
Physical address space of a process can

Address Translation
Paging

be non-contiguous
Physical memory divided into fixed-sized frames
Frame size is power of 2, between 512 bytes and 8192 bytes

Address generated by CPU is divided into:


Page number p used as in index into a page table which contains the base address f of the corresponding frame in physical memory. Page offset d the offset from the frame start, physical memory address = f + d.
page number logical address p
mn

Logical memory divided into pages


Prage size is identical to frame size.

OS keeps track of all free frames (free-frame list) Running a program of size n pages requires

finding n free frames


Page table translates logical to physical addresses. Internal fragmentation, no external fragmentation.
Slide 381 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 382

page offset d
n

Logical address is m bits wide. Page size = frame size = 2n.


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Paging
Physical address = f + d f = PageTable[p] p = m-n significant bits of logical address d = n least significant bits
low memory

Paging

high memory

Paging model: logical address space is contiguous, whereas the corresponding physical address space is not.

Figure from [Sil00 p.270] Slide 383 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 384 Computer Architecture WS 06/07

Figure from [Sil00 p.271] Dr.-Ing. Stefan Freinatis

Paging
What is the physical address of k?
n = 2 (page size is 4 byte) m = 4 (logical address space is 16 byte) k is located at logical address 10D
p d

Figure from [Sil00 p.272]

Free-Frame List
The OS must maintain a table of free frames (free-frame list)
Paging free-frame list 13 14 15 16 17 18
page 0 page 1 page 2 page 3

frame number
frame 0

frame 1 frame 2

free

free-frame list

frame 3

frame address
frame 4

14 13 18 20 15

15

13 14

page 1 page 0

frame number

15 16

10D = 1010 B

10 10
p = 2, d = 2. 0 1 2 3 20 24 4 8

frame 5 frame 6

f = PageTable[2] = 4

19 20

0 1 2 3

14 13 18 20

17 18 19 20
page 3 page 2

Physical address = f + d = 4 + 2 = 6
Slide 385 Computer Architecture

PageTable
WS 06/07

frame 7

page table of new process

new process
Slide 386 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Page-Table
Where to locate the page table?
Paging

Translation Look-Aside Buffer


Paging

Dedicated registers within CPU


Only suitable for small memory. Used e.g. in PDP-11 (8 page registers, each page 8 kB, 64 kB main memory total). Fast access (high speed registers).

A translation look-aside buffer (TLB) is a small fast-lookup associative memory.


key value

5 0 1 4 2 6 9 3

12 14 13 4 18 15 17 20
frame address or frame number

Table in main memory


A dedicated CPU register, the page-table base register (PTBR), points to the table in memory (the table currently in use). With each context switch the PTBR is reloaded (then pointing to another page table in memory). The actual size of the page table is given by a second register, the page table length register (PTLR).

page number

18

With the latter scheme we need two memory accesses, one for the page table, and one for accessing the memory location itself. Slowdown! Solution: Special hardware cache: translation look-aside buffer (TLB)

The associative registers contain page frame entries (key | value). When a page number is presented to the TLB, all keys are checked simultaneously. If the desired page number is not in the TLB, it must be fetched from memory.
Slide 388 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 387

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Translation Look-Aside Buffer


Paging

Memory Access Time


Paging

Assume: Memory access time = 100 ns. TLB access time = 20 ns When page number is in TLB (hit): total access time = 20 ns + 100 ns = 120 ns When page number is not in TLB (miss): total access time = 20 ns + 100 ns + 100 ns = 220 ns With 80% hit ratio: average access time = 0.8 120 ns + 0.2 220 ns = 140 ns With 98% hit ratio:
Paging hardware with TLB. Figure from [Sil00 p.276]
Slide 389 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

average access time = 0.98 120 ns + 0.02 220 ns = 122 ns


Slide 390 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Protection
With paging the processes memory spaces are automatically protected against each other since each process is assigned its own set of frames. If a page is tried to be accessed that is not in the page table (or is marked invalid -- see next slide), the process is trapped by the OS. 0 1 2 3

Figure from [Sil00 p.272]

Frame Attributes
Each frame may be characterized by additional bits in the page table.
Paging

Paging
frame 0

frame 1 frame 2

Valid / invalid
Whether the frame is currently allocated to the process

frame 3

Read-Only
Frame is read-only

frame address
frame 4

Execute-Only
Frame contains code

Valid physical addresses:


20 ... 23 24 ... 27 04 ... 07 08 ... 11
Slide 391 Computer Architecture

20 24 4 8

frame 5

Shared
frame 6

Frame is accessible to other processes as well.


Figure from [Sil00 p.277]

page table
WS 06/07

frame 7

Dr.-Ing. Stefan Freinatis

Slide 392

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Shared Pages
Implementation of shared memory through paging is rather easy.
Paging

Shared Pages
0 1
0 1 2 3

A shared page is a page whose frame is allocated to other processes as well. Many processes share a page in that each of the shared pages is mapped to the same frame in physical memory. Shared code must be non-self modifying code (reentrant code).
Figure on the next slide: Three processes are using an editor. The editor needs 3 pages for its code. Rather than loading the code three times into memory, the code is shared. It is loaded only once into memory, but is visible to each process as if it is their private code. The data (the text edited), of course, is private to each process. Each process thus has its own data frame.
Slide 393 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 394

2 3
Note: Free memory is shown in gray, occupied memory is in white.

Pages 0,1,2 of each process are mapped to physical frames 3,4,6.

0 1 2 3
0 1 2 3

0 1 2 3

0 1 2 3 Figure from [Sil00 p.283]

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Paging
Logical address space of modern CPUs: 232 ... 264 Assume: 32-bit CPU, frame size = 4K 232 / 212 = 220 page table entries (per process) Each entry size = 20 bit + 20 bit = 5 byte
20 bit for page number. 20 bit for frame number (less than requiring 32 bit for the frame address).

Two-Level Paging
Often, a process will not use all of its logical address space. Rather than allocating the page table contiguously in main memory (for the worst case), the page table is divided into small pieces and is paged itself.
Paging

outer page table

page table entry

page number frame number


20 20

inner page table

220 x

5 byte = 5 MB per page table!

output points to a frame containing page table entries (inner page table entries)

output points to final destination frame


WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 395

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 396

Computer Architecture

Two-Level Paging
Paging

Multi-Level Paging
Paging

page number logical address p1


10

page offset p2
10

Tree-Structure principle
Each outer page entry defines a root node of a tree.

d
12

Two / three / four level paging


SPARC (32 bit): three-level paging. Motorola 68030 (32 bit): four-level paging.

Numbers are for the 32-bit, 4 kB frame, example

max 210 entries each page of inner table has 210 entries final destination frame in memory

Better memory utilization


than using a contiguous (and possibly maximum-sized) page table.

Increase in access time


since we hop several times until final memory location is reached. Caching (TLB) however helps out a lot. Four-level paging with 98% hit rate: Effective access time = 0.98 120 ns + 0.02 520 ns = 128 ns

Figure from [Sil00 p.279]

Slide 397

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 398

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Computer Architecture
Memory (353) Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)

Segmentation
User views of logical memory: Linear array of bytes
Reflected by the Paging memory scheme

A collection of variable-sized entities


User thinks in terms of subroutines, stack, symbol table, main program which are somehow located somewhere in memory.

Segmentation supports this user view. The logical address space is a collection of segments.
WS 06/07 Dr.-Ing. Stefan Freinatis Slide 400 Computer Architecture WS 06/07

Figure from [Sil00 p.285] Dr.-Ing. Stefan Freinatis

Slide 399

Computer Architecture

Segmentation
1 1 2 3 2 4 3
User space Physical memory

Segmentation
Physical address space of a process can
4

be non-contiguous as with paging


Logical address consists of a tuple
<segment number, offset>

Segment table maps logical address onto physical

address

base: physical address of segment limit: length of segment

Segment table can hold additional segment attributes


Like with frame attributes (see paging).

Segmentation model: The user space (logical address space) consists of a collection of segments which are mapped through the segmentation architecture onto the physical memory.
Slide 401 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Shared Segments
Shared segments are mapped to the same segment in physical memory.

Slide 402

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Segmentation
s selects the entry from the table. Offset d is checked against the maximum size of the segment (limit). Final physical address = base + d.

Segmentation
Segments are variable-sized
Dynamic memory allocation required (first fit, best fit, worst fit).

External fragmentation
In the worst case the largest hole may not be large enough to fit in a new segment. Note that paging has no external fragmentation problem.

Each process has its own segment table


like with paging where each process has its own page table. The size of the segment table is determined by the number of segments, whereas the size of the page table depends on the total amount of memory occupied.

Segment table located in main memory


as is the page table with paging

Segment table base register (STBR)


points to current segment table in memory

Segment table length register (STLR)


Figure from [Sil00 p.286] Slide 403 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

indicates number of segments


Slide 404 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Segmentation
Example:

Segmentation
Example (continued):
Figure from [Sil00 p.287]

A program is being assembled. The compiler determines the sizes of the individual components (segments) as follows:
Segment main program symbol table function sqrt() subroutine stack
Slide 405

Size 400 byte 1000 byte 400 byte 1000 byte 1100 byte
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

The process is assigned 5 segments in memory as well as a segment table.


Slide 406 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Shared Segments
Segmentation

Paging versus Segmentation


With paging physical memory is divided into fixed-size frames. When

Process P1 and P2 share the editor code. Segment 0 of each process is mapped onto the same physical segment at address 43062.

memory space is needed, as many free frames are occupied as necessary. These frames can be located anywhere in memory, the user process always sees a logical contiguous address space.
With segmentation the memory is not systematically divided. When a

The data segments are private to each process, so segment 1 of each process is mapped to its own segment in physical memory.

program needs k segments (usually these have different sizes), the OS tries to place these segments in the available memory holes. The segments can be scattered around memory. The user process does not see a contiguous address space, but sees a collection of segments (of course each individual segment is contiguous as is each page or frame).
Figure from [Sil00 p.288]

Slide 407

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 408

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Paging versus Segmentation


13 14 15 16 17 18 19 20
seg3

Paging versus Segmentation


Each process is assigned its page table.

Paging Segmentation
Slide 410 Dr.-Ing. Stefan Freinatis

Page table size proportional to allocated memory Often large page tables and/or multi-level paging Internal fragmentation Free memory is quickly allocated to a process

unused memory
internal fragmentation

seg1

free memory
can be allocated

seg4

Motorola 68000 line is based on a flat address space

seg2

Each process is assigned a segment table Segment table size proportional to number of segments Usually small segment tables External fragmentation. Lengthy search times when allocating memory to a process. Intel 80X86 family is based on segmentation
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Paging is based on fixed-size units of memory (frames)


Slide 409

Segmentation is based on variable-size units of memory (segments)


WS 06/07

Computer Architecture

Memory Management
Memory (353) Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)

Paged Segments
Combining segmentation with paging yields paged segments
13
seg1

14 15

With segmentation, each segment is a contiguous space in physical memory.

seg4

16 17 18 19

seg2

With paged segments, each segment is sliced into pages. The pages can be scattered in memory.

seg3

20

segmentation
Slide 411 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 412

paged segments
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Paged Segments
Each segment has its own page table
frame numbers

Paged Segments
13 14 15 16 17 18 19 20

15
seg1

The MULTICS (predecessor of UNIX) operating system solved the problems of external fragmentation and lengthy search times by paging the segments. This solution differs from pure segmentation in that each segment table entry does not contain the base address of the segment, but rather contains the base address of a page table for this segment.
unused memory
internal fragmentation

16 17
page table

seg4

14
page table

seg2

13
page table

In contrast to pure paging where each process is assigned a page table, here each segment is assigned a page table. The processes still see just segments not knowing that the segments themselves are paged. With paged segments there is no more time spent on optimal segment placing, however, there is introduced some internal fragmentation.
Slide 414 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

seg3

18 20
page table

logical process space


Slide 413 Computer Architecture

physical memory
WS 06/07 Dr.-Ing. Stefan Freinatis

Paged Segments
Explanation of next slide (principle of paged segments)

Paged Segments

The logical address is a tuple <segment number s, offset d>. The segment number is added to the STBR (segment table base register) and by this points to a segment table entry. The segment table is located in main memory. From the entry the page table base is derived which points to the beginning of the corresponding page table in memory. The first part p of the offset d determines the entry in the page table. The output of the page table is the frame address f (or alternatively a frame number). Finally f + d is the physical memory address. Steps in resolving the final physical address: PageTable = SegmentTable[s].base; f = PageTable[p]; final address = f + d
Slide 415

d
logical address

Principle of paged segments

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 416

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Paged Segments
Combination of segmentation and paging
User view is segmentation, memory allocation scheme is paging

Computer Architecture
Memory (353) Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)

Used by modern processors / architectures

Example: Intel 80386


CPU has 6 segment registers
which act as a quick 6-entry segment table

Up to 16384 segments per process possible


in which case the segment table resides in main memory.

Maximum segment size is 4 GB


Within each segment we have a flat address scheme of 232 byte addresses

Page size is 4 kB
A two-level paging scheme is used
Slide 417 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 418 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Virtual Memory
What if the physical memory is smaller than required by a process?

Virtual Memory
Based on locality assumption
No process can access all its code and data at the same time, therefore the entire process space does not need to be in memory at all time instants. Require special precautions and extra work by the programmer.

Dynamic Loading Overlays

Only parts of the process space are in memory


The remaining ones are on disk and are loaded when demanded

It would be much easier if we would not have to worry about the memory size and could leave the problem of fitting a larger program into smaller memory to the operating system.

Logical address space can be much larger than physical address space
A program larger than physical memory can be executed More programs can (partially) reside in memory which increases the degree of multiprogramming!

Virtual Memory
Memory is abstracted into an extremely large uniform array of storage, apart from the amount of physical memory available.
Slide 419 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 420

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Virtual Memory
Virtual memory concept (one program)

Virtual Memory
Virtual memory concept (three programs)

OS
size

OS backing store
virtual memory concept

OS
(usually a disk)

size

B
free memory

C physical memory
WS 06/07 Dr.-Ing. Stefan Freinatis

program
Slide 421

physical memory
Computer Architecture

program A
Slide 422

program B

program C

physical memory
Computer Architecture WS 06/07

backing store

Dr.-Ing. Stefan Freinatis

Virtual Memory
Virtual memory can be implemented by means of

Virtual Memory

Demand Segmentation
Used in early Burroughs computer systems and in IBM OS/2. Complex segment-replacement algorithms.

Demand Paging
Commonly used today. Physical memory is divided into frames (paging principle). Demand paging applies to both paging systems and paged segment systems.

Figure next slide: Virtual memory usually is much larger than physical memory (e.g. modern 64-bit processors). The pages currently needed by a process are in memory, the other pages reside on disk. From the page table is known whether a page is in memory or on disk.

page table disk


Figure from [Sil00 p.299]

Virtual memory consists of more pages than there are frames in physical memory
Slide 423 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 424 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Demand Paging
Less I/O
than loading the entire program (at least for the moment)

Virtual Memory

Demand Paging

Virtual Memory

A page is brought from disk into memory when it is needed (when it is demanded by the process)

Q: How does the OS know that a page is demanded by a process? A: When the process tries to access a page that is not in memory!
A process does not know whether or not a page is in memory, only the OS knows.

Less memory needed


since a (hopefully) great part of the program remains on disk

Each page table entry has a validity bit (v)


If v = 1 page is in memory If v = 0 page is in on disk validity bit is also termed valid-invalid bit
During address translation, when the validity bit is found 0, the hardware causes a page fault trap to the operating system.

Faster response
The process can start earlier since loading is quicker

More processes in memory


The memory saved can be given to other processes

Loading a page on demand is done by the pager (a part of the operating system usually a daemon process).
Slide 425 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 426

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Page Fault
A page fault is the fact that a non-memory-resident page was tried to be accessed by some process. Steps in demand paging: 1. A reference to some page is made

Virtual Memory

Page Fault
into the free frame. 5. When disk read is complete, the internal tables are updated to reflect that the page now is in memory.

Virtual Memory

4. A disk operation is scheduled to read in the desired page

2. The page is not listed in the table (or is marked invalid) which causes a page fault trap (a hardware interrupt) to the operating system. 3. An internal table is checked (usually kept with the process control block) to determine whether the reference was a valid or an invalid memory access. If the reference was valid, a free frame is to be found.
Slide 427 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

6. The process is restarted at the instruction that caused the page fault trap. The process can now access the page.

Slide 428

These steps are symbolized in the next figure

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Virtual Memory

Page Fault

Virtual Memory

Page table indicating that pages 0, 2 and 5 are currently in memory, while pages 1, 3, 4, 6, 7 are not.
Figure from [Sil00 p.301] Slide 429 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Steps in handling a page fault


Slide 430 Computer Architecture WS 06/07

Figure from [Sil00 p.302] Dr.-Ing. Stefan Freinatis

Performance of Demand Paging


Page fault rate 0 p 1
Average probability that a memory reference will cause a page fault Virtual Memory

Performance of Demand Paging


Page fault time
Trap to the OS Context switch Check validity Find a free frame Schedule disk read
Virtual Memory The time from the failed memory reference until the machine instruction continues

if p = 0 no page faults at all if p = 1 every reference causes a page fault

Context switch to another process (optional) Place page in frame Adjust tables Context switch and restart process

Memory access time tma


Time to access physical memory (usually in the range of 10 ns ...150 ns)

Assuming a disk system with an average latency of 8 ms, average seek time of 15 ms and a transfer time of 1 ms (and neglecting that the disk queue may hold other processes waiting for disk I/O), and assuming the execution time of the page fault handling instructions to be 1 ms, the page fault time is 25 ms.
Slide 432 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Effective access time teff


Average effective memory access time. This time finally counts for system performance

teff = (1 p) tma + p page fault time


Slide 431 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Performance of Demand Paging


Effective access time
tma
Virtual Memory

Performance of Demand Paging


Some possibilities for lowering the page fault rate
Virtual Memory

teff = (1 p) 100 ns + p 25 ms = 100 ns + p 249999 ns 25 ms


When each memory reference causes a page fault (p = 1), the system is slowed down by a factor of 250000. When one out of 1000 references causes a page fault (p = 0.001), the system is slowed down by a factor of 250. For less than a 10 % degradation, the page fault rate p must be less than 0.000004 (1 page fault in 2.5 million references).
Slide 433 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Increase page size


With larger pages the likelihood of crossing page boundaries is lesser.

Use good page replacement scheme


Preferably one that minimizes page faults.

Assign sufficient frames


The system constantly monitors memory accesses, creates page-usage statistics and on-the-fly adjusts the number of allocated frames. Costly, but used in some systems (so-called working set model).

Enforce program locality


Programs can contribute to locality by minimizing cross-page accesses. This applies to the implemented algorithms as well as to the addressing modes of the individual machine instructions.

Slide 434

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Page Size
What should be the page (= frame) size?
Virtual Memory

Page Attributes
Large Pages
internal fragmentation smaller page tables faster disk I/O less page faults Next to the validity bit v, each page may in addition be equipped with the following attribute bits in the page table entry:
Virtual Memory

Small Pages
little internal fragmentation large page tables slower disk I/O more page faults

Reference bit r
Upon any reference to the page (read / write) the bit is set. Once the bit is set it remains set until cleared by the OS.

Modify bit m
Each time the page is modified (write access), the bit is set. The bit remains set until cleared by the OS. A page that is modified is also called dirty. The modify bit is also termed dirty bit. When the page is not modified it is clean.

Trend goes toward larger pages. Page faults are more costly today because the gap between CPU-speed and disk speed increased.

Intel 80386: 4 kB Intel Pentium II: 4 kB or 4 MB Sun UltraSparc: 8 kB, 64 kB, 512 kB, 4MB
Slide 435 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 436

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Finding Free Frames


Terminate another process

Virtual Memory

Page Replacement
Page replacement scheme:
If there is a free frame use it, otherwise use a page-replacement algorithm to select a victim frame. Save the victim page to disk and adjust the tables of the owner process. Read in the desired page and adjust the tables. Improvement Preferably use a victim page that is clean (not modified, m = 0). Clean pages do not need to be saved to disk.
Slide 438 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

What options does the OS have when needing free frames?

Virtual Memory

Not acceptable. The process may already have done some work (e.g. changed a data base) which may mistakenly be repeated when the process is started again.

Swap out a process


An option only in case of rare circumstances (e.g. thrashing).

Hold some frames in spare


Sooner or later the spare frames are used up. Memory utilization is lower since the spare memory is not used productively.

Two page transfers

Borrow frames
Yes! Take an allocated frame, use it, and give it (or another one) back to the owner later. Page Replacement
Slide 437 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Page Replacement
Virtual Memory
0 1 2 3

Page Replacement

Virtual Memory
Figure from [Sil00 p.310]

0 1 2 3

Figure from [Sil00 p.309] Slide 439

Need for page-replacement User process 1 wants to access module M (page 3). All memory however is occupied. Now a victim frame needs to be determined.
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Page-replacement The victim is saved to disk (1) and the page table is adjusted (2). The desired page is read in (3) and the table is adjusted again. In this figure the victim used to be a page from the same process (or same segment in case of paged segments).
Slide 440 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Page Replacement
Global Page Replacement
process can take a frame from another. Processes can affect each others page fault rate, though.
Virtual Memory

Page Replacement
Page replacement algorithms
Virtual Memory

The victim frame can be from the set of all frames, that is, one

First-in first-out (FIFO)


and its variations second-chance and clock.

Optimal page replacement (OPT) Least Recently Used (LRU) LRU Approximations
Desired: Lowest page-fault rate! Evaluation of the algorithms through applying them onto memory reference strings.
Slide 442 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Local Page Replacement


The victim frame may only be from the own set of frames, that is, the number of allocated frames per process does not change. No impact onto other processes.
The figure on the previous slide shows a local page replacement strategy.

Slide 441

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Memory Reference Strings


Assume the following address sequence:
(e.g. recorded by tracing the memory accesses of a process) Virtual Memory

Memory Reference Strings


In general, the more frames available the lesser is the expected number of page faults.

0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103 0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105 Assuming a page size of 100 bytes, the sequence can be reduced to

Page faults versus number of frames

1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1
This memory reference string lists the pages accessed over time (at the time steps at which page access changes).
If there is only 1 frame available, the sequence would cause 11 page faults. If there are 3 frames available, the sequence would cause 3 page faults.
Figure from [Sil00 p.312] Slide 443 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 444 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

FIFO Page Replacement


Principle: Replace the oldest page (old = swap-in time).
Virtual Memory

FIFO Page Replacement


Example VM.2
Memory reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 Number of frames: 3 1
1
Virtual Memory

Example VM.1
Memory reference string: 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 Number of frames: 3

2
1 2

3
1 2 3

4
4 2 3

1
4 1 3

2
4 1 2

5
5 1 2

1
5 1 2

2
5 1 2

3
5 3 2

4
5 3 4

5
5 3 4

0 1 2

frame contents over time

Figure from [Sil00 p.313]

Total: 15 page faults.


Slide 445 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 446

9 page faults

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

FIFO Page Replacement


Example VM.3
Number of frames: 4 1
1
Virtual Memory

FIFO Page Replacement


Virtual Memory

From the examples VM.2, VM.3 it can be noticed that the number of page faults for 4 frames is greater than for 3 frames. This unexpected result is known as Beladys Anomaly1: 1
1 2 3 4 1 2 3 4

Memory reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 (as in VM.2)

2
1 2

3
1 2 3

2
1 2 3 4

5
5 2 3 4

1
5 1 3 4

2
5 1 2 4

3
5 1 2 3

4
4 1 2 3

5
1 5 2 3
1

For some page-replacement algorithms the page-fault rate may increase as the number of allocated frames increases.

10 page faults
Slide 447 Computer Architecture

Although we have more frames available than previously, the page fault rate did not decrease!
WS 06/07 Dr.-Ing. Stefan Freinatis

Lazlo Belady, R. Nelson, G. Shedler: An anomaly in space-time characteristics of certain programs running in a paging machine, Communications of the ACM, Volume 12, Issue 6, June 1969, Pages: 349 - 353, ISSN:0001-0782, also available online as pdf from the ACM.

Slide 448

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Beladys Anomaly

Virtual Memory

Second-Chance Algorithm
This algorithm is a derivative of the FIFO algorithm. Start with the oldest page Inspect the page If r = 0: replace the page. Done. If r = 1: give the page a second chance by clearing r and moving the page to the top of the FIFO Proceed to next oldest page
Virtual Memory

Page faults versus number of frames for the string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5.

When a page is used often enough to keep the r bit set, it will never be replaced. Avoids the problem of throwing out a heavily used page (as may happen with strict FIFO). If all pages have r =1, the algorithm however is FIFO.
Figure from [Sil00 p.314] Slide 449 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 450 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Second-Chance Algorithm
Virtual Memory

Clock Algorithm
When the FIFO is arranged as a circular list the overhead is less.

Virtual Memory

Second chance constantly moves pages within the FIFO (overhead)!

Example: page A is the oldest in the FIFO (see a). With pure FIFO it would have been replaced. However, as r = 1 it is given a second chance and is moved to the top of the FIFO (see b). The algorithm continues with page B. FIFO

Initially the hand (a pointer) points to the oldest page.


r=1

The algorithm then applied is second chance.

Figure from [Ta01 p.218]

Figure from [Ta01 p.219]

Slide 451

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 452

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Optimal Page Replacement


Principle: Replace the page that will not be used for the longest time.

LRU Page Replacement


Example VM.5

Virtual Memory

Principle: Replace the page that has not been used for the longest time.

Example VM.4
Memory reference string: 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 Number of frames: 3

Memory reference string: 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 Number of frames: 3

frame contents over time

Figure from [Sil00 p.315]

frame contents over time


Figure from [Sil00 p.315]

Total: 9 page faults.


Slide 453 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 454 Computer Architecture WS 06/07

Dr.-Ing. Stefan Freinatis

LRU Page Replacement


Possible LRU implementations:
Virtual Memory

LRU Page Replacement


Example for the stack implementation principle
Virtual Memory

Counter Implementation
Every page table entry has a counter field. The system hardware must have a logical counter. With each page access the counter value is copied to the entry.

Update on each page access required Searching the table for finding the LRU page Account for clock overflow

Stack Implementation
Keep a stack containing all page numbers. Each time a page is referenced, its number is searched and moved to the top. The top holds the MRU pages, the bottom holds the LRU pages. bottom of stack

Update on page access required Searching the stack for the current page number
Slide 455 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 456 Computer Architecture

Figure from [Sil00 p.317] WS 06/07 Dr.-Ing. Stefan Freinatis

LRU Approximation
Not many systems provide sufficient hardware support for true LRU page replacement. Approximate LRU!
Virtual Memory

LRU Approximation
History field examples
Virtual Memory

00000000 11111111 01001000


history field

= Not used for the last 8 time periods = Used in each of the past 8 periods = Used in last period and in the fifth last period

Use reference bit


When looking for LRU page, take a page with r = 0 No ordering among the pages (only used and unused)

History Field
Each page table entry has a history field h (e.g. a byte) When page is accessed, set most significant bit (e.g. bit 7) Periodically (e.g. every 100 ms) shift right the bits When looking for LRU page, take page with smallest unsigned int(h) Better ordering among the pages (256 history values)
Slide 457 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

value (unsigned int)

0101 0111 0110 1011


Slide 458

5 7 6 11

This page will be chosen as victim

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Page Replacement
Exemplary page fault rates
Page Faults per 1000 References
40 35 30 25 20 15 10 5 0 6 8 10 Number of Frames Allocated 12 14 FIFO Clock LRU Opt

Page Replacement
Virtual Memory

Algorithms Summary

Virtual Memory

First-in first-out (FIFO)


Simplest algorithm, easy to implement, but has worst performance. The clock version is somewhat better as it does not replace busy pages.

Optimal page replacement (OPT)


Not of practical use as one must know future! Used for comparisons only. Lowest page fault rate of all algorithms.

Least Recently Used (LRU)


The best algorithm usable, but requires much execution time or highly sophisticated hardware.

Figure from lecture slides WS 05/06

LRU Approximations
Slightly worse than LRU, but faster. Applicable in practice.

Differences noticeable only for smaller number of frames


Slide 459 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 460

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Thrashing
Virtual Memory

Thrashing
Countermeasures
Virtual Memory

When the number of allocated frames falls below a certain number of pages actively used by a process, the process will cause page fault after page fault. This high paging activity is called thrashing.
Figure from [Sil00 p.326]

Switching to local page replacement


A thrashing process cannot steal frames from others. The page device queue (used by all) however is still full of requests lowering overall system performance.

Swap out
The thrashing process or some other process can be swapped out for a while. Choice depends on process priorities.

A too high degree of multiprogramming results in thrashing because each process does not have enough frames.

Assign sufficient frames


How many frames are sufficient?

Working-set model: All page references are monitored (online memory reference string creation). The pages recently accessed form the working-set. Its size is used as the number of sufficient frames.

Slide 461

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 462

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Working-Set
Virtual Memory

Program Locality

Virtual Memory

Demand paging is transparent to the user program. A program however can enforce locality (at least for data).
Assume a page size of 128 words and consider the following program which clears the elements of a 128 x 128 matrix.
row column

Figure from [Sil00 p.328]

The working-set model uses a parameter to define the working-set window. The set of pages in defines the working-set WS. The OS allocates to the process enough frames to maintain the size of the working-set. Keeping track of the working set requires the observation of memory accesses (constantly or in time intervals).

int A[][] = new int[128][128]; for (int j = 0; j < 128; j++) for (int i = 0; i < 128; i++) A[i][j] = 0;
Program A Program clearing the matrix elements column-wise.

Slide 463

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 464

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Program Locality
The array is stored in memory row major.

Virtual Memory

Program Locality
Thus, each row of the 128 x 128 matrix occupies one page. If the operating system allocates only one frame (for the data) to process A, the process will cause 128 x 128 = 16384 page faults!
i+3

Virtual Memory

In row major storage, a multidimensional array in linear memory is accessed such that rows are stored one after the other. It is the approach used by C, Java, and many other languages, with the notable exception of Fortran. For example, the matrix in C as

123 456

6
row 2

high

is defined

5 4 3

This is because the process clears one word in each page (word j), then the next word, ..., thus jumping from page to page in the inner loop. for (int j = 0; j < 128; j++) for (int i = 0; i < 128; i++)

i+2

i+1

row 1

int A[2][3]= { {1,2,3}, {4,5,6} };

2 1
word memory
low

and is stored in memory row-wise.


Slide 465 Computer Architecture WS 06/07

A[i][j] = 0;
Slide 466 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Program Locality
By changing the loop order, the process first finishes one page before going to the next. int A[][] = new int[128][128]; for (int i = 0; i < 128; i++) for (int j = 0; j < 128; j++) A[i][j] = 0;
Program B
i+3

Virtual Memory

Program Locality
Locality is also influenced by the addressing modes at machine instruction level.

Virtual Memory

Consider a three-address instruction, such as ADD A,B,C which performs C:=A+B. In the worst case the operands A, B, C are located in 3 different pages.

i+2

Another example is the PDP-11 instruction MOV @(R1)+,@(R2)+


j

i+1

Now, if the operating system allocates only one frame (for the data) to process B, the process will cause only 128 page faults!
Slide 467 Computer Architecture WS 06/07

which in the worst case straddles across 6 pages.

R1
PDP 11 addressing mode 3 for the source operand

Dr.-Ing. Stefan Freinatis

Slide 468

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Virtual Memory
Separation logical physical memory
The user / programmer can think of an extremely large virtual address space.

Computer Architecture
Memory (353) Paging (381) Segmentation (400) Paged Segments (412) Virtual Memory (419) Caches (471)

Pure Paging / Paged Segments


Virtual memory can be implemented upon both memory allocation schemes.

Execution of large programs


which do not fit into physical memory in their entirety.

Better multiprogramming
as there can be more programs in memory.

Not suitable for hard real-time systems!


Virtual memory is the antithesis of hard real-time computing. This is because the response times cannot be guaranteed owing to the fact that processes may influence each other (page device queue, thrashing, ...).

Slide 469

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 470

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Memory Hierarchy
The farther away from CPU, the larger and slower the memory. The hierarchy is the consequence of locality.
Caches

Locality Principle
Caches

Programs tend to reuse data and instructions. Rule of thumb:

[HP06 p.38]

A program spends 90% of its execution time in only 10% of the code.
Temporal locality: recently accessed items are likely to be accessed in near future. Spatial locality: items whose addresses are near one another tend to be referenced close together in time.
Slide 472 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Memory hierarchy levels in typical desktop / server computers, figure from [HP06 p.288]
Slide 471 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Locality Principle

Caches
Cache: a safe place for hiding or storing things
Websters Dictionary [HP06 p. C-1]

Example of a memory-access trace of a process

Here: Fast memory that stores copies of data from the most frequently used main memory locations. Used by the CPU to reduce the average time to access memory locations. Effect: instructions (in execution) can proceed quicker.
Instruction fetch is quicker Memory operands are accessed quicker
from the CPUs point of view

Result: faster program execution improved system performance


Figure from [Sil00 p.327] Slide 473 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 474 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Cached Memory Access


Steps in accessing memory (here: reading from memory), simplified.
Caches

Caches
To take advantage of spatial locality a cache contains blocks of data rather than individual bytes. A block is a contiguous line of processor words. It is also called a cache line. Common block sizes: 8 ... 128 bytes block transfer

CPU requests content from a memory location Cache is checked for this datum When present, deliver datum from cache When not, transfer datum from main memory to cache Then deliver from cache to CPU

Cache components
Data Area Tag Area word transfer
Slide 475 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 476 Computer Architecture WS 06/07

Attribute Area
Dr.-Ing. Stefan Freinatis

Data Area
Caches

Tag Area
Caches

All blocks in the cache make up the data area.


Block

The block addresses of the cached blocks make up the 1 tags of the cache lines. All tags form the tag area.
Block

N bytes per block

0 1 2 3 4

N byte per block

0 1 2 3 4

...

...

...
Data area

...

...
Tag area

...
Data area

...

B1

Cache capacity = B N bytes

1 The statement is slightly simplified. In real caches, often just a

fraction of the block address is used as tag.


Slide 477 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 478 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Attribute Area
Caches

Caches
Each cache line plus its tag plus its attributes forms a slot.
Block / Slot Block

The attribute area contains attribute bits for each cache line.

V D

Validity bit V
V = 1 data is valid V = 0 data is invalid

indicates whether the cache line holds valid data 0 N bytes per block
1 2 3

V D

N bytes per block

...
Attributes

4 Dirty bit D the cache line data is modified ...indicates whether...

Cache slot

...
Attributes

...
Tag area

...
Data area

...

with respect to main memory


Tag area

D = 1 data is modified Data area D = 0 data is not modified

B1

Slide 479

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 480

Computer Architecture

WS 06/07

...
B1
Dr.-Ing. Stefan Freinatis

...
B1
0 1 2 3 4

Caches
How to find a certain byte in the cache? Caches

Block Address
Memory can be considered as an array of blocks.
block address 0 1 2 3 0 4 8 12 16 20 24 28 32 36 4 bytes per block

Caches

The address generated by the CPU is divided into two fields. High order bits make up the block address Low order bits determine the offset within that block
m block address m-n offset n

Memory address (binary) 000000 000100 001000 001100 010000 010100 011000 011100 100000 100100

4 5 6 7 8 9

The block address should not be confused with the memory address at which the block starts. The block address is a block number.
block address = memory address DIV block size

Block address is compared against all tags simultaneously In case of a match (cache hit), the offset selects the byte
Remark: CPU address space = 2m, Cache line size (block size) = 2n
Slide 481 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Memory
memory address (decimal)
Slide 482

block address offset

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Caches
V D Tags Data

Hit Rate
Caches

Cache capacity is smaller than the capacity of main memory. Cache mechanism Consequently, not all memory locations can be mirrored in the cache. When a required datum is found in the cache, we have a cache hit, otherwise a cache miss.

...

...

...
The hit rate is the fraction of cache accesses that result in a hit.

Comparator

hit / miss

Data out
Hit rate =

number of hits number of memory accesses

block address

offset

CPU memory address


Slide 483 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

The miss rate is the fraction of cache accesses that result in a miss.
Slide 484 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Amdahls Law
The law is a general law, not restricted to caches or computers.

Caches

Amdahls Law
Example: 30% of the computations can be made twice as fast. P = 0.3, S = 2. Improvement I =

Caches

Used to find the maximum expected improvement to an overall system when a part of the system is improved.

1 0.3 (1 0.3) + 2

I=

1 (1 P) + P S
I=

1 = 1.177 0.7 + 0.15

Amdahls Law in the special case of parallelization

I: maximum expected improvement, I > 0 (usually I > 1) P: proportion of the system improved, 0 P 1 S: speedup of that proportion, S > 0, usually S > 1
Slide 485 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

1 (1 F ) F+ N

See lecture Advanced Computer Architecture

F: proportion of sequential calculations (no speedup possible), 0 F 1 N: grade of parallelism (e.g. N processors), N > 0
Slide 486 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Caches
CPU (registers) Memory space Access time 500 Byte 250 ps Cache (SRAM) 64 kB 1 ns Main memory (DRAM) 1 GB 100 ns I/O Devices (disks) 1 TB 10 ms

Read Access
Reading from memory (improvement)
Caches

CPU requests datum Search cache while fetching block from memory Cache hit: deliver datum, discard fetched block Cache miss: put block in cache and deliver datum
In case of a hit, the datum is available quickly. In case of a miss there is no benefit from the cache, but also no harm. Things are not that easy when writing into memory. Lets look at the cases of a write hit and a write miss.
Slide 488 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Example: Assume: Cache = 1 ns, main memory = 100 ns, 90% hit rate. What is the overall improvement?

P = 0.9, S =
I= 1

100 ns = 100 1 ns
0.9 100 = 9.175

(1 0.9) +
Slide 487

Memory accesses (as seen by the CPU) now are more than 9 times as fast than without a cache.

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Write Hit Policy


Write through
The datum is written to both the block in the cache and the block in memory.
CPU Cache

Caches

Write Miss Policy


Assume a write miss. What to do? Caches

Assume a write hit. How to keep cache and main memory consistent on write-accesses?

Memory

Write allocate
The block containing the referenced datum is transferred from main memory to the cache. Then one of the write hit policies is applied. Normally used with write back caches.

Write Buffer

Write back

Cache always clean (no dirty bit required) CPU write stall (problem reduced through write buffer) Main memory always has the most current copy (cache coherency in multi-processor systems)

No-write allocate
Write misses do not affect the cache. Instead the datum is modified only in main memory. Write hits however do affect the cache. Normally used with write through caches.

The datum is only written to the cache (dirty bit is set). The modified block is written to main memory once it is evicted from cache.
Write speed = cache speed Multiple writes to the same block still result in only one write to memory Less memory bandwidth needed
Slide 489 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 490

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Write Miss Policy


Assume an empty cache and the following sequence of memory operations.
WriteMem[100] WriteMem[100] ReadMem[200] WriteMem[200] WriteMem[100]

Caches

Caches

Cache

What are the number of hits and misses when using no-write allocate versus write allocate?
No-write allocate WriteMem[100] WriteMem[100] ReadMem[200] WriteMem[200] WriteMem[100]
Slide 491

Write allocate miss hit miss hit hit

Where exactly are the blocks placed in the cache? Cache Organization What if the cache if full? Replacement Strategies
...
Slide 492

miss miss miss hit miss


WS 06/07

Memory
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Dr.-Ing. Stefan Freinatis

Cache Organization
Where can a block be placed in the cache? Caches

Direct Mapped
Each memory block is mapped to exactly one slot in the cache (many-to-one mapping).
Memory
0 4 8 12 16 20 24 28

Caches

Direct Mapped
With this mapping scheme a memory block can be placed in only one particular slot. The slot number is calculated from
((memory address) DIV (blocksize)) MOD (slots in cache).

Cache

Slot

0 1 2 3

Fully Associative
The block can be placed in any slot.

Set Associative
The block can be placed in a restricted set of slots. A set is a group of slots. The block is first mapped onto the set and can then be placed anywhere within the set. The set number is calculated from
(memory address) DIV (block size) MOD (number of sets in cache).
Slide 493 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

...

...

Block size = 4 byte Cache capacity = 4 x 4 = 16 byte


If slot occupied (V = 1) evict cache line

memory address (decimal)


Slide 494 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Direct Mapped
Slot = ((memory address) DIV (blocksize)) MOD (slots in cache).
offset within slot = (memory address) MOD (blocksize).

Caches

Direct Mapped
Extracting slot number and offset directly from memory address
block address

Caches

Examples
In which slot goes the block located at address 12D? 12 DIV 4 = 3 3 MOD 4 = 3 (slot 3) In which slot goes the block located at address 20D? 20 DIV 4 = 5 5 MOD 4 = 1 (slot 1) Where goes the byte located at address 23D? 23 DIV 4 = 5 5 MOD 4 = 1 23 MOD 4 = 3
Slide 495

m tag bits slot offset n

MOD 4

m-n

The lower bits of the block address select the slot. The size of the slot field depends on the number of slots (size = ld(number of slots)).
2

Example

ld = logarithmus dualis (base 2)

Where goes the byte located at address 23D?

23D = 1 0 1 1 1B
The byte goes in cache line (slot) 1 at offset 3
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 496

slot offset

Slot 1, offset 3
WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Direct Mapped
31 30 29 28

Direct Mapped
Caches
7 6 5 4 3 2 1 0
Word Byte offset offset

Address (showing bit positions) . . . . . . . . 19 18 17 16 15 14 13 12 . . . .

Explanations for previous slide

Caches

Slot
Hit 16 12

Logical address space of CPU: 232 byte


Data

16 bits Valid Tag Data

128 bits

Number of cache slots: 64kB / 16 Byte = 4K = 4096 slots. Bit 0,1 determine the position of the selected byte in a word. However, as the CPU uses 4-byte words as smallest entity, the byte offset is not used. Bit 2,3 determine the position of the word within a cache line.

16k 4k entries

lines

16

32

32

32

32

Bits 4 to 15 (12 bits) determine the slot. 212 = 4K = number of slots.


64 kByte cache using four-word (16 Byte) blocks
MUX 32

Figure from lecture CA WS05/06

Bits 16 to 31 are compared against the tags to see whether or not the block is in the cache.
Slide 498 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 497

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Fully Associative
A memory block can go in any cache slot (many-to-many).
Caches
0 4 8 12 16 20 24 28 32 36 4 choices 0 1 2 3

Set Associative
Slot Set

Caches

A memory block goes into a set, and can be placed anywhere within the set (many-to some)
0 4 8 12 16 20 24 28 32 36 0 1 0 1 2-way set associative cache

0 1

Slot selection check all tags (preferably simultaneously) take a slot with V = 0 (a free slot) otherwise select a slot according to some replacement strategy
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slot selection Determine set from block address In this set, take a free slot ... ... or evict a slot according to some replacement strategy
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

memory address (decimal)

memory address (decimal)


Slide 500

Slide 499

Set Associative
Set = ((memory address) DIV (blocksize)) MOD (sets in cache). Example
In which set goes the block located at address 12D? 12 DIV 4 = 3 (block address) 3 MOD 2 = 1 (set 1)

Caches

Set Associative
N-way set associative cache
N = number of slots per set, not the number of sets N is a power of 2, common values are 2, 4, 8. Extremes N=1 There is only one slot per set, that is, each slot is a set. The set number (thus the slot) is drawn from the block address.
Caches

In which slot the block finally goes depends on occupation and replacement strategy

Similar to direct mapping, the low order bits of the block address determine the destination set.
block address

Direct Mapped
N=B There is only one set containing all slots (B = number of blocks in cache = number of slots).

m tag bits m-n


Slide 501 Computer Architecture WS 06/07

set

offset n
Dr.-Ing. Stefan Freinatis Slide 502

Fully Associative
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Set Associative
Caches

Set Associative
Opteron cache
Cache capacity: 64 kB in 64-byte blocks (1024 slots) Cache is two-way set associative: 512 sets 2 cache lines Hardware: Two arrays with each 512 caches lines, that is, each set has one cache line in array1 and one in array2.
Caches

AMD Opteron Cache


Two-way set associative

Physical address is 40 bits. Address is divided into 34 bit block address (subdivided into 25 tag bits and 9 index bits) and 6 bits byte offset ().
Figure: The index selects the set (29 = 512), see . The two tags of the set are compared against the tag bits. The valid bit must be set for a hit (). On a hit, the corresponding data is delivered using the winning input from a 2:1 multiplexer (). The data goes to Data in of the CPU. The victim buffer is needed when a cache line has to be written back to main memory (replacement).

Figure from [HP06 p. C-13] Slide 504 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Cache Organization
Where can block 12 go?
Cache

Caches

Cache Organization
Caches

For the previous figure, assume block 12 and block 20 being used very often. What is the problem?

Fully associative: No special problem. Both blocks can be stored in the cache at the same time. Direct mapped: Problem! Only one of them can be stored at the same time since both map to the same slot. 12 mod 8 = 20 mod 8 = 4 No special problem. Both blocks can be stored in the same set at the same time.

Block address

Memory

Set associative:

Figure from [HP06 p.C-7] Slide 505 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 506 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Cache Organization
Direct Mapped

Caches

Replacement Strategies
Strategies for selecting a slot to evict (when necessary)
Caches

Hard-Allocation (no choice) Simple & Inexpensive No replacement strategy required If a process uses 2 blocks mapping to the same slot, cache misses are high.

Random
Victim cache lines are selected randomly. Hardware pseudo-random number generator generates slot-numbers.

Fully Associative
Full choice Expensive searching (hardware) for free slot Replacement strategy required

Least-Recently Used (LRU)


Relies on the temporal locality principle. The least recently used block is hoped to have smallest likelihood of (re)usage. Expensive hardware.

Set Associative
Compromise between direct mapped and fully associative Some choice Replacement strategy required
Slide 507 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

First-in, First out (FIFO)


Approximation of LRU by selecting the oldest block (oldest = load time).

Slide 508

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Replacement Strategies
Caches Two-way Capacity 16 kB 64 kB 256 kB LRU 114.1 103.4 92.2 Rand 117.3 104.3 92.1 FIFO 115.5 103.9 92.5 LRU 111.7 102.4 92.1 Four-way Rand 115.1 102.3 92.1 FIFO 113.3 103.1 92.5 LRU 109.0 99.7 92.1 Eight-way Rand 111.8 100.5 92.1 FIFO 110.4 100.3 92.5

Miss Categories
Caches

Compulsory Misses
The very first access to a block cannot be in the cache, so the block must be loaded. Also called cold-start misses.

Capacity Misses
Owing to the limited capacity of the cache, capacity misses will occur in addition to compulsory misses.

Table: Data cache misses per 1000 instructions


Data collected for Alpha architecture, block size = 64 byte.

Data from [HP06 p.C-10]

Conflict Misses
In set associative or direct mapped caches too many blocks may map to the same set (or slot). Also called collision misses.

LRU is best for small caches little difference between all strategies for large caches
Slide 509 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Coherency Misses
are owing to cache flushes to keep multiple caches coherent in a multiprocessor. Not considered in this lecture (see lecture Advanced Computer Architecture).
Slide 510 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Cache Optimization
reducing miss rate

Caches

Block Size
Caches

Average memory access time = hit time + miss rate miss penalty

The data area gets larger cache lines (but less lines), the overall cache capacity remains the same.

reduced miss rate, taking advantage of spatial locality


more accesses will likely go to the same block

Larger block size larger cache capacity higher associativity


reducing miss penalty

increased miss penalty


More bytes have to be fetched from main memory

Multilevel caches read over write

increased conflict misses


cache has less slots (per set)

reducing hit time

increased capacity misses


only for small caches. In case of high locality (e.g. repeatedly access to only one byte in a block) the remaining bytes are unused and waste up cache capacity.

avoiding address translation

Common block sizes are 32 ... 128 bytes


Slide 511 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 512 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Block Size

Caches

Block Size
of overhead and then delivers 16 bytes every 2 clock cycles. Assume the hit time to be 1 clock cycle independent of block size. Which block size has the smallest average memory access time? Average memory access time = hit time + miss rate miss penalty 4K cache, 16 byte block:

Caches

For the previous figure, assume the memory system takes 80 clock cycles

Average memory access time = 1 + (8.57 % 82) = 8.027 clock cycles 4K cache, 32 byte block: Average memory access time = 1 + (7.24 % 84) = 7.082 clock cycles Miss rate versus block size [from HP06 p.C-26]
Slide 513 Computer Architecture WS 06/07

Cache capacity

... and so on for all cache sizes and block sizes.


Slide 514 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Dr.-Ing. Stefan Freinatis

Block Size
cache capacity
Block size Miss penalty

Caches

Cache Capacity
Caches

The cache is enlarged by adding more cache slots.

4K 8.027 7.082 7.160 8.469 11.651

16K 4.231 3.411 3.323 3.659 4.685

64K 2.673 2.134 1.933 1.979 2.288

256K 1.894 1.588 1.449 1.470 1.549

reduced miss rate


owing to less capacity misses

16 32 64 128 256

82 84 88 96 112

potentially increased hit time


owing to increased complexity

increased hardware & power consumption


Miss rates for block size 64 bytes

4K 7.00 %

16K 2.64 %
38 %

64K 1.06 %
40 %

256K 0.51 %
48 %
WS 06/07

Cache capacity Miss rate

Average memory access time (in clock cycles) versus block size for 4 different cache capacities green values = best (smallest access time) per column (thus per cache)
Slide 515 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 516

Computer Architecture

Dr.-Ing. Stefan Freinatis

Associativity
The higher the associativity the more slots per set.
Caches

Associativity
4 K cache
Degree Miss rate [%]

Caches

8 K cache
Degree Miss rate [%]

16 K cache
Degree Miss rate [%]

reduced miss rate


primarily owing to less conflict misses

1-way 2-way 4-way 8-way

9.8 7.6 7.1 7.1

1-way 2-way 4-way 8-way

6.8 4.9 4.4 4.4

1-way 2-way 4-way 8-way

4.9 4.1 4.1 4.1

increased hit time


time needed for finding a free slot in the set

Rules of Thumb

Eight-way set associative is almost as effective as fully associative. A direct mapped cache with capacity N has about the same miss rate as a two-way set associative cache of capacity N/2.
Common associativities are 1 (direct mapped), 2, 4, 8
Slide 517 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

64 K cache
Degree Miss rate [%]

128 K cache
Degree Miss rate [%]

512 K cache
Degree Miss rate [%]

1-way 2-way 4-way 8-way

3.7 3.1 3.0 2.9

1-way 2-way 4-way 8-way

2.1 1.9 1.9 1.9


WS 06/07

1-way 2-way 4-way 8-way

0.8 0.7 0.6 0.6

Data Slidefrom 518 [HP06 p.C-23]

Computer Architecture

Dr.-Ing. Stefan Freinatis

Associativity

Caches

Multi-Level Caches
Building a cache hierarchy.
Caches

First-Level Cache (L1)


small high speed cache usually located in the CPU

Second level cache (L2)


(hit time)

fast and bigger cache located close to CPU (chip set)

Third level cache (L3), optional


Separate memory chip between L2 and main memory

CPU

L1

L2

L3

Main Memory

Slide 519

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 520

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multi-Level Caches
Multi-level caches reduce the average miss penalty because on a miss the block can be fetched from the next higher level instead from main memory.
Distinction between local and global cache considerations: number of cache misses number of cache accesses
Caches

Multi-Level Caches
Example CH.1: Suppose that in 1000 memory references there are 40 misses in L1 and 20 misses in L2. What are the various miss rates? Local miss rateL1 = global miss rate L1 = Local miss rateL2 = Global miss rateL2 = 20 40 20 1000 = 50 % =2%
These 2 % go from L2 to main memory Caches

40 =4% 1000

These 4% go from L1 to L2

Local miss rate =

Local to a cache (e.g. L1, L2, ...)

Global miss rate =

number of cache misses number of memory references by CPU

CPU

L1

L2

Main Memory

Local miss rate L2 is large because L1 skims the cream of memory accesses.

local misses versus global references


Slide 521 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 522 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Multi-Level Caches
Caches

Multi-Level Caches
Caches

Average memory access time = hit timeL1 + miss rateL1 miss penaltyL1

Using the miss rates from example CH.1, and the following data hit time L1 = 1 clock cycle, hit timeL2 = 10 clock cycles, miss penaltyL2 = 200 clock cycles, the average memory access time is = 1 + 0.04 (10 + 0.05 200) = 5.5 clock cycles

CPU

L1

L2

Main Memory

miss penaltyL1 = hit timeL2 + local miss rateL2 miss penaltyL2 hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2) = hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2)

Slide 523

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Slide 524

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Read over Write


Caches

Read over Write


Caches

Assume a direct mapped write-through cache with 512 slots, and a four-word write buffer that is not checked on a read miss.
store word

Solutions to previous problem

Read misses wait until write buffer is empty,


and thereafter the required memory block is fetched into cache.

SW R3, 512(R0) LW R1, 1024(R0) LW R2, 512(R0)


load word

; mem[512]:= R3 ; R1:= mem[1024] ; R2:= mem[512]

(cache slot 0) (cache slot 0) (cache slot 0)

Check contents of write buffer,


if referenced data not in buffer let the read-access continue fetching the block into the cache. Write buffer is flushed later when memory system is available.

Read-after-write hazard: The data in R3 is placed in write buffer. causes a read miss. Cache line is discarded. again causes a read miss. If the write buffer has not completed writing R3 into memory, will read an incorrect value from mem[512]. Cache

Giving reads priority over writes

Also applicable to write-back caches. The dirty block is put into a write
CPU Memory

buffer that allows inspection in case of a read miss. Read misses check the buffer before directly going to memory.

Write Buffer Slide 525 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 526 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Address Translation
What addresses are cached, virtual or physical addresses?
Caches

Address Translation
Caches

Fully virtual cache: No address translation time on a hit Cache must have copies of protection information
Protection info must be fetched from page/segment tables.

A fully virtual cache uses logical addresses only, a fully physical cache uses physical addresses only.

virtual

CPU

address

virtual cache

virtual address

physical

Cache flush on processor switch


memory
Individual virtual addresses usually refer to different physical addresses.

translation

address

Shared Memory
Different virt. addresses refer to same phys. address. Copies of same data in cache.

segment tables / page tables / TLB

Fully physical cache:


physical address

virtual

physical

CPU

address

translation

address

physical cache

Very well on shared memory accesses


memory

Always address translation (time)


Hits are of no advantage regarding address translation
Slide 528 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Slide 527

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Address Translation
Solution: get the best from both virtual and physical caches Two issues in accessing a cache:
Caches

Address Translation
Virtually indexed, physically tagged cache
virtual address

Caches

CPU

Indexing the cache


that is, calculating the target set (or slot with direct mapping)

page number

page offset
word offset tags data data

Comparing tags
comparing the tag field with (parts of) the block address

Translation
(TLB, page table)

...

...

... Cache

The page offset (the part that is identical in both virtual and physical address space) is used to index the cache. In parallel, the virtual part of the address is translated into the physical address and used for tag comparison. Improved hit time. virtually indexed, physically tagged cache
Slide 529 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis Slide 530

physical address

frame address

offset
next memory level
WS 06/07 Dr.-Ing. Stefan Freinatis

next memory level


Computer Architecture

Cache Optimization
Technique Larger block size Larger cache capacity Higher associativity Multi-level cache Read over write Address translation + + +
Hit time Miss penalty Miss rate Complexity

Caches

Exam Computer Architecture


March 9th, 2007

Comment

09.03.2007
Date

ST 025/118
Location

8:30 hrs
Time

+ + +

0 1 1 2 1 1

Trivial Widely used for L2 Widely used


Costly hardware; harder if L1 block size L2 block size

Duisburg - Ruhrort !

Widely used Widely used


Data from [HP06 p.C-39]

+ = improves a factor, = hurts a factor, blank = no impact

Summary of basic cache optimizations


Slide 531 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

CA
Slide 532 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Slide 533

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis