You are on page 1of 86

Computer Architecture:

a qualitative overview of Hennessy


and Patterson

Philip Machanick

September 1998; corrected December 1998; March 2000; April 2001


contents

Chapter 1 Introduction ...................................................................1


1.1 Introduction ........................................................................... 1
1.2 Major Concepts ...................................................................... 2
1.2.1 Latency vs. Bandwidth .............................................................2
1.2.2 What Computer Architecture Is ...............................................3
1.2.3 The Quantitative Approach ......................................................4
1.2.4 How Performance is Measured ................................................4
1.3 Components of the Course .................................................... 6
1.4 The Prescribed Book ............................................................. 6
1.5 Structure of the Notes ............................................................ 6
1.6 Further Reading ..................................................................... 7

Chapter 2 Performance Measurement and Quantification ............9


2.1 Introduction ........................................................................... 9
2.2 Why Performance is Important .............................................. 9
2.3 Issues which Impact on Performance .................................. 10
2.4 Change over Time: Learning Curves and Paradigm Shifts . 12
2.4.1 Learning Curves ....................................................................12
2.4.2 Paradigm Shifts ......................................................................13
2.4.3 Relationship Between Learning Curves and Paradigm Shifts 14
2.4.3.1 Exponential ...............................................................15
2.4.3.2 Merced (EPIC, IA-64, Itanium) ................................16
2.4.3.3 Why Paradigm Shifts Fail .........................................17
2.5 Measuring and Reporting Performance ............................... 18
2.5.1 Important Principles ..............................................................20
2.6 Quantitative Principles of Design ........................................ 20
2.7 Examples: Memory Hierarchy and CPU Speed Trends ...... 21
2.8 Further Reading ................................................................... 22
2.9 Exercises .............................................................................. 23

iii
Chapter 3 Instruction Set Architecture and Implementation .......25
3.1 Introduction ......................................................................... 25
3.2 Instruction Set Principles: RISC vs. CISC .......................... 26
3.2.1 broad classification ................................................................28
3.3 Challenges for Pipeline Designers ....................................... 29
3.3.1 Causes of Pipeline Stalls .......................................................30
3.3.2 Non-Uniform Instructions ......................................................31
3.4 Techniques for Instruction-Level Parallelism ..................... 32
3.5 Limits on Instruction-Level Parallelism .............................. 35
3.6 Trends: Learning Curves and Paradigm Shifts .................... 35
3.7 Further Reading ................................................................... 37
3.8 Exercises .............................................................................. 37

Chapter 4 Memory-Hierarchy Design .........................................39


4.1 Introduction ......................................................................... 39
4.2 Hit and Miss ........................................................................ 40
4.3 Caches .................................................................................. 41
4.4 Main Memory ...................................................................... 44
4.5 Trends: Learning Curves and Paradigm Shifts .................... 45
4.6 Alternative Schemes ............................................................ 47
4.6.1 Introduction ...........................................................................47
4.6.2 Direct Rambus .......................................................................48
4.6.3 RAMpage ...............................................................................49
4.7 Further Reading ................................................................... 51
4.8 Exercises .............................................................................. 51

Chapter 5 Storage Systems and Networks ...................................53


5.1 Introduction ......................................................................... 53
5.2 The Internal Interconnect: Buses ......................................... 54
5.3 RAID ................................................................................... 54
5.4 I/O Performance Measures .................................................. 55
5.5 Operating System Issues for I/O and Networks .................. 55
5.6 A Simple Network ............................................................... 56
5.7 The Interconnect: Media and Switches ............................... 56
5.8 Wireless Networking and Practical Issues for Networks .... 57
5.9 Bandwidth vs. Latency ........................................................ 57
5.10 Trends: Learning Curves and Paradigm Shifts .................... 59

iv
5.11 Alternative Schemes ............................................................ 60
5.11.1Introduction ...........................................................................60
5.11.2Scalable Architecture for Video on Demand .........................61
5.11.3Disk Delay Lines for Scalable Transaction-Based Systems ..62
5.12 Further Reading ................................................................... 63
5.13 Exercises .............................................................................. 64

Chapter 6 Interconnects and Multiprocessor Systems .................67


6.1 Introduction ......................................................................... 68
6.2 Types of Multiprocessor ...................................................... 68
6.3 Workload Types .................................................................. 69
6.4 Shared-Memory Multiprocessors (Usually SMP) ............... 69
6.5 Distributed Shared-Memory (DSM) .................................... 71
6.6 Synchronization and Memory Consistency ......................... 72
6.7 Crosscutting Issues .............................................................. 73
6.8 Trends: Learning Curves and Paradigm Shifts .................... 74
6.9 Alternative Schemes ............................................................ 75
6.10 Further Reading ................................................................... 75
6.11 Exercises .............................................................................. 76

Chapter 7 References ...................................................................77

v
vi
Chapter 1 Introduction

Computer Architecture is a wide-ranging subject, so it is useful to find a focus to make it


interesting and to make sense of the detail.

The modern approach to computer architecture research is to quantify as much as pos-


sible, so much of the material covered is about measurement, including how to report
results and compare them. These notes however aim to provide a more qualitative overview
of the subject, to place everything in context. The quantitative aspect is however important
and is covered in the exercises.

1.1 Introduction

The prescribed book contains a lot of detail. The aim of these notes is to provide a specific
spin on the content as well as to provide a more abstract view of the subject, to help to make
it easier to understand the detail.

The main focus of this course is understanding the following two related issues:

• the conflict between achieving latency and bandwidth goals

• long-term problems caused by differences in learning curves

This introductory chapter explains these concepts, and provides a starting point for under-
standing how the prescribed book will be used to support the course.

First, the concepts are defined and explained. Next, this chapter goes on to break the
course down into components, and then the components are related to contents of the book.
Finally, the structure of the remainder of the notes is presented.

1
1.2 Major Concepts

This section looks at the main focus of the course, latency vs. bandwidth, and also intro-
duces some of the major concepts in the course: what architecture is, the qualitative princi-
ple of design, and how performance is estimated or measured.

1.2.1 Latency vs. Bandwidth

Loosely, latency is efficiency to the user, bandwidth is overall efficiency of the system.

More accurately, latency is defined as time to complete a specific operation. Bandwidth


(also called throughput) is the number of units of work that can be completed over a specific
time unit. These two measures are very different. Completing a specific piece of work
quickly to ensure minimum latency can be at the cost of overall efficiency, as measured by
bandwidth.

For example, a disk typically takes around 10ms to perform an access, most of which
is time to move the head to the right place (seek time), and to wait for rotation of the disk
(rotational delay). If a small amount of data is needed, the minimum latency is achieved if
only that piece of data is read off the disk. However, if many small pieces of data are needed
which happen to be close together on the disk, it is more efficient to fetch them all at once
than to do a separate disk transaction for each, since less seek and rotational delay time is
required in total. An individual transaction is slower (since the time to transfer a larger
amount of data is added onto it), but the overall effect is more efficient use of the available
bandwidth, or higher overall throughput.

Balancing latency and bandwidth requirements in general is a hard problem. Issues


which make it harder include:

• latency has to be designed in from the start since latency reduction is limited by the
worst bottleneck in the system

• bandwidth can in principle be added more easily (e.g. add more components working
in parallel; simple example: make a wider sewer pipe, or add more pipes)

• improving one is often at the expense of the other (as in the disk access example)

2
• different technologies have different learning curves (the rate at which improvements
are made) which tends to invalidate design compromises after several generations of
improvements

1.2.2 What Computer Architecture Is

Computer Architecture, broadly speaking, is about defining one or more layers of abstrac-
tion in defining a virtual machine.

At a coarse level, architecture is divided into software and hardware architecture. Soft-
ware architecture defines a number of layers of abstraction, including the operating system,
the application programming interface (API), user interface and possibly higher-level soft-
ware architectures, like object-oriented application frameworks, or component software
models. This course is concerned primarily with hardware architecture, though software is
occasionally mentioned.

Hardware architecture again can be divided into many layers, including:

• system architecture—interaction between major components, including disks, net-


works, memory, processor and interconnects between the components

• I/O architecture—how devices such as disks interact with the rest of the system (often
including software issues)

• instruction set architecture (ISA)—how a programmer (or more correctly in today’s


world, a compiler) sees the machine: what the instructions do, rather than how they are
implemented

• computer organization—how low-level components interact, particularly (but not


only) within the processor

• memory hierarchy—interaction between components and other parts of the system


(including software aspects, particularly relating to the operating system)

This course covers most of these areas, but the strongest focus is on the interaction between
the ISA, processor organization and the memory hierarchy. Note that many people really
mean “ISA” when they say “architecture”, so take care to be clear on what is being dis-
cussed.

3
1.2.3 The Quantitative Approach

Since the mid-1980s, computer designers have increasingly accepted the quantitative
approach to computer architecture, in which measurement is made to establish impacts of
design decisions on performance.

This approach was popularized largely as a result of the RISC movement.

RISC was essentially about designing an ISA which made it easier to achieve high per-
formance. To persuade computer companies to buy the RISC argument, researchers had to
produce convincing measurements of systems that had not yet been constructed to show
that their approach was in fact better. This is not to say that there was no attempt at measur-
ing performance prior to the RISC movement, but the need to sell a new, as yet untested,
idea helped to push the quantitative approach into the mainstream.

Up to that time, performance had often been measured on real systems using fairly arbi-
trary measures that did not allow meaningful performance comparisons across rival
designs.

1.2.4 How Performance is Measured

If performance of a real system is being measured, the most important thing to measure is
run time as seen by the user. Many other measures have been used, like MIPS (millions of
instructions per second, otherwise known as “meaningless indicator of performance of sys-
tems”).

Although the most important thing to the user is the program that slows down their
work, everyone wants standard performance measures. As a result, benchmarks—programs
whose run times are used to characterize performance of a system—of various forms are in
wide use. Many of these are not very useful, as they are too small to exercise major com-
ponents of the system (for example, they don’t use the disk, or they completely fit into the
fastest level of the memory hiearchy).

In the 1980s, the Standard Performance Evaluation Corporation (SPEC) was formed in
an attempt at providing a standard set of benchmarks, consisting of real programs covering
a range of different styles of computation, including text processing and numeric computa-
tion. At first, a single combined score was available, called the SPECmark, but in recent ver-

4
sions of the SPEC benchmarks, the floating point and integer scores have been separated,
to allow for the fact that some systems are much stronger in one of these areas than their
competitors. Even so, the SPEC benchmarks have had problems as indicators of perfor-
mance. Some vendors have gone to great lengths to develop specialized compilation tech-
niques that work on specific benchmarks but are not very general. As a result, SPEC
numbers now include a SPECbase number, which is the combined run times in which all
the SPEC programs were compiled using the same compiler options. Another problem with
SPEC is that it doesn’t scale up well with increases in speed. A faster CPU will not just be
used to solve the same problem faster, but also to solve bigger problems, but the SPEC
benchmarks used fixed-size datasets. As a result, the effect of bigger data is not captured.
A very fast processor with too little cache, for example, may score well on SPEC, but dis-
appoint when a real-world program with a large dataset is run on it.

These problems with SPEC are addressed by regular updates to the benchmarks. How-
ever, the problem then becomes one of having no basis for historical comparisons.

In the area of transaction processing, another set of benchmarks, called TPC (Transac-
tion Processing Council), tries to address the scaling problem. The TPC benchmarks mea-
sure performance in transactions per second, but to claim a given level of TPS, the size of
the data has to be scaled up. Section 5.4 revisits the idea of scalable benchmarks which
applies not only to I/O but to the more general case as well.

If a real system does not exist, there are several approaches that can be taken to estimate
performance:

• calculation—though it’s hard to take all factors into account, it’s possible to do simple
calculations to quantify the effects of a design change; such calculations at least help
to justify more detailed investigation

• partial simulations—a variety of techniques make it possible to measure the effects of


a design change, without having to simulate a complete system

• complete simulations—possibly even including operating system code; such simula-


tions are slow, but may be necessary to evaluate major design changes (e.g., adoption
of a radical new instruction set)

Since performance measurement is complex, we will return to this issue in more detail in
Chapter 2.

5
1.3 Components of the Course

The course is broken down into five components, and a practical workshop:

1. performance measurement and quantification, which expands on the description of the


previous section
2. instruction set architecture and implementation, including pipelining and instruction-
level parallelism
3. memory-hierarchy design
4. storage systems and networks
5. interconnects and multiprocessor systems, which ties together several earlier sections

The practical workshop will focus on memory hierarchy issues.

1.4 The Prescribed Book

The prescribed book for the course is

John L Hennessy and David A Patterson. Computer Archi-


tecture: A quantitative Approach (3rd edition), Morgan
Kauffman, San Francisco, to be published.
Content is generally derived fairly directly from the book, if topics are sometimes
grouped slightly differently.

It is assumed that you know something of computer organization from previous courses
(e.g. the basics of a simple pipeline, what registers are, what an assembly language program
is, etc.), so sections of the book relating to these topics are handled superficially.

Testing in this course is open book, so it is important to have a copy of the book (and
not an earlier edition, which are significantly different).

1.5 Structure of the Notes

The remainder of the notes contains one chapter for each of the major components of the
course—performance measurement and quantification, instruction set architecture and
implementation, memory-hierarchy design, storage systems and networks, and, finally,
interconnects and multiprocessor systems.

Each chapter ends with a Further Reading section, and each chapter after this one con-
tains exercises, some of which are pointers to exercises in the prescribed book.

6
1.6 Further Reading

For further information on standard benchmarks, the following web sites are useful:

• Transaction Processing Performance Council <http://www.tpc.org/>

• Standard Performance Evaluation Corporation <http://www.spec.org/>

To find out more about a popular architecture simulation tool set, see SimpleScalar at

• <http://www.cs.wisc.edu/~mscalar/simplescalar.html>

Please also review Chapters 1 and 2 of the prescribed book.

7
8
Chapter 2 Performance
Measurement and
Quantification

Performance measurement is a complex area, often reduced to bragging rights and mislead-
ing marketing.

It’s important to understand not only what to measure, but how to present and interpret
results. Otherwise, it’s easy to mislead or be misled.

2.1 Introduction

This chapter is largely based on Chapter 1 of the Hennessy and Patterson, “Fundamentals
of Computer Design”, but with some material drawn from elsewhere to add to the discus-
sion. The material presented here aims not only to illustrate how to measure and to report
results, but also how to evaluate the significance of results in terms of changes in technol-
ogy. For this reason, issues which effect technology change are also considered.

The remainder of the chapter is broken down as follows. First, why performance is
important is considered. Then, issues which impact on performance are examined, followed
by a discussion of changes over time. Then, the sections from the book on measuring and
reporting performance (1.5) and quantitative principles of design (1.6) are summarized.
Finally, memory hierarchy and changes in CPU design are combined to illustrate the issues
and problems in predicting future performance.

2.2 Why Performance is Important

I have often heard it said that “no one will ever need a PC faster than model X, but you
shouldn’t buy model Y, it’s obsolete”. A year later, model X has moved to the obsolete slot
in the same piece of sage advice.

More speed for the same money is always easy to sell. But do people really need it?

9
Some might say no, it’s just like buying a faster car and then keeping to the speed limit,
but there are legitimate reasons for wanting more speed:

• you can solve bigger problems in the same time

• things that used to hold up your work cease to be bottlenecks

• you can do things that were previously not possible

Many people argue that we are reaching limits on what a consumer PC really needs to do,
but such arguments are not new. As speeds improve, expectations grow: high-speed 3D
graphics for example, is becoming a commodity (even if the most exotic requirements are
still the province of expensive specialized equipment).
At the higher end, there is always demand for more performance, because, by defini-
tion, those wanting the best, fastest equipment are pushing the envelope, and would like
more performance to achieve their goals, and to do bigger, better things.

Always, though, the important thing to remember is that the issue of most importance
to the user is whatever slows down achieving the desired result. Never lose sight of this
when looking at performance measurement: any measures that do not ultimately tell you
whether the response time to the user is improved are at best indirect measures and at worst
useless.

2.3 Issues which Impact on Performance

Obviously, if every component of a computer can be equally sped up by a given factor, a


given task can run that many times faster. However, algorithm analysis tells us that unless
the program runs in linear time or better, we will not see an equivalent increase in the size
of problem that can be solved in the same amount of elapsed time. For example, if the algo-
rithm is Ο(n2), you need a computer four times as fast to solve a problem twice as big in the
same elapsed time.

Thus, as computers become faster, it becomes paradoxically more important than


before to find good algorithms, so the performance increase is not lost. Sadly, many soft-
ware designers do not realize this: as computers get faster, time wasted on inefficiency
grows; the modern software designer’s motto is waste transistors.

10
What are the factors that make it possible to improve the speed of a computer, aside
from improving the software?

Essentially, all components of the computer are susceptible to improvement, and if any
are neglected, they will eventually dominate performance (see Section 2.5).

Here are some areas of an overall computer system that can impact on performance, and
issues a designer has to consider for each:

• disk subsystem—access time (time to find the right part of the disk, often thought of as
latency, since this is the dominant factor in time for one disk transaction) and transfer
rate are both important issues; which of the two is more important depends on how the
disk is used but access time is the harder one to improve and is generally the reason
that disks are a potential performance bottleneck; there are a number of software
issues that need to be considered as well, such as the disk driver, the operating system
and efficiency of algorithms that make extensive use of a disk; a disk also relates to the
virtual memory system (see memory below)

• network—latency again is a problem and is composed of many components, much of


which is software, including assembling packets, queueing delays and other operating
system overhead, and how the specific network protocol controls access to the net-
work; bandwidth of a given network may be limited by capacity concerns (e.g., colli-
sions may reduce the total available network traffic), physical speed of the medium,
and speed of the computers, switches and other network components

• memory—designers have to explore trade-offs in cost and speed; the traditional main
memory technology, DRAM (dynamic random access memory) is becoming increas-
ingly slow compared with processors, so designers have to use a hierarchy of different-
speed memories: the faster, smaller ones (caches) nearer the CPU help to hide the
lower speed of slower, larger ones, with disk at the bottom of the hierarchy as a paging
device in a virtual memory system

• processor (or CPU: central processing unit)—designers here focus on two key issues:
improving clock speed and increasing the amount of instruction-level parallelism
(ILP): the amount of work done on each clock tick; this is not as easy as it sounds as
programming languages are inherently sequential, and other components, particularly
the memory system, interact with the CPU in ways that make it hard to scale up perfor-
mance

• power consumption—although not strictly a component but rather an overall design


consideration, power consumption is an important factor in speed-cost trade-offs;
some designers such as Compaq’s (formerly Digital’s) Alpha and Intel’s Pentium and
successor teams have gone for speed over low power consumption, with the result that
their designs add cost to an overall system and are not highly adaptable to mobile com-
puting, or embedded designs (e.g. a computer as part of a car, toy or appliance)

11
2.4 Change over Time: Learning Curves and Paradigm Shifts

What makes the issues a designer must deal with even more challenging is rapid advances
in knowledge and more challenging still the fact that different areas are advancing at differ-
ent rates. This section examines two styles of change in technology, learning curves (incre-
mental improvement) and paradigm shifts (changes to new models).

To put the two areas into context, the section concludes with a discussion of the way the
two approaches to achieving change interact.

2.4.1 Learning Curves

The general theory of learning curves goes back to the early days of the aircraft industry in
the 1930s. The idea is that a regular percentage improvement over time leads to exponential
advance.

The general form of the formula is

performance = PBt

where B is the rate of “learning” of improvements over a given time interval (usually a
year, but in any case the units in which elapsed time, t, is expressed), and P is a curve fit
parameter. For example, if B is 1.1, there is a 10% improvement per year.

Possibly the best-known learning curve in the computer industry is Moore’s Law,
named for one of the founders of Intel, Gordon Moore, who predicted the rate of increase
in the number of transistors on a chip. In practice, the combined effect of the increased
number of transistors (which makes it possible for more to be done at once—more ILP) and
smaller components (or reduced feature size—which makes it possible to run the chip
faster) in recent times has lead to a doubling of the speed of CPUs approximately every 18
months.

By contrast, in DRAM, the focus of designers is on improving density rather than on


speed. You can buy four times as much DRAM for the same money approximately every 3
years (though the curve isn’t smooth: there are short-term fluctuations for issues like the
launch of a new operating system that uses more memory). The basic cycle time of DRAM
on the other hand only improves at the rate of about 7% per year, leading to an increasing
speed gap between CPUs and main memory.

12
2.4.2 Paradigm Shifts

What happens when a learning curve hits a hard limit (like a transistor starts to become to
small in relation to the size of an electron to make sense physically)? Or what happens if
someone comes up with a fundamentally better new idea?

A learning curve is essentially an incremental development model. Although at a


detailed level, a lot of imagination may be applied to the problem of speed improvement,
reducing feature size, etc., the overall design framework remains the same.

A different development model is radical change: the theory or model being applied
changes completely. If the change is at a very deep level, it is called a paradigm shift. Essen-
tially, a paradigm shift resets the learning curve. It may start from a new base, and may have
a different rate of growth. Paradigm shifts are hard to sell for several reasons:

• they require a new mindset, and those schooled in the old way may often perversely
expend considerable energy persuading themselves that the idea is not new, or even if
it is, it can’t work

• they require a complete new set of work experience to be built up

• they may in extreme cases require that the entire infrastructure they plug into be
changed (for example, if the execution model of the CPU is radically different, it may
require new programming languages, compilers and operating systems, a new memory
hierarchy, etc.)

Even so, paradigm shifts do occur, if usually in less radical forms. For example, the RISC
movement was quite successful in reforming the previous trend towards increasing com-
plexity of instruction set design.
Here are some examples of paradigm shifts and their status:

• high-level language computing (HLL)—making the instruction set closer to high-level


languages, to make compiler writing easier and potential execution faster: died of
complexity; simpler instruction sets turned out to be easier to execute fast, and were
less tied to specifics of a given language

• microcode—the earliest computers had hard-wired logic but in the 1970s, microcode,
a very simple hardware-oriented machine language, became popular as a way of
implementing instruction sets; the idea essentially was killed when it no longer was
possible to make a ROM much faster than other kinds of memory and, as with HLL

13
instruction sets, simplicity won the day; today microcode is all but dead, though we
still refer to details of the architecture at a level below machine code as the microarchi-
tecture

• dataflow computing—the idea was that the order of computation should depend on
which data was ready, and not on the order instructions were written: mostly dead,
except some internal details of the Intel Pentium Pro and successors (the P6 family)
appear to be based on the dataflow model

• asynchronous execution—a typical CPU wastes a lot of time waiting for the slowest
part of the CPU (let alone other components of the system like DRAM), and the asyn-
chronous model aims to eliminate this wastage by having the CPU operate at the fast-
est possible speed, rather than working on fixed timing of a clock signal: the jury is
still out on this one, and research continues

2.4.3 Relationship Between Learning Curves and Paradigm Shifts

Why then if a paradigm shift can jump ahead of everything else, is most research incremen-
tal? Why does industry prefer small steps to a big breakthrough?

First, at a conceptual level, a paradigm shift, as noted before, can be a hard sell: it
requires a change in one’s internal mental model to understand it fully. Secondly, if it really
involves major change, it involves changing not only that particular technology but every-
thing else that goes to make the overall system (for example, the dataflow model, to work
properly, needed not only a different style of CPU but also a different style of memory).

At a practical level, a big problem, especially if B for the competing learning curve is
large, is that the paradigm shift is aiming at a moving target. The time and effort to get a
completely new approach not only accepted but into production has to be measured against
the gains the conventional approach will make. If there is any slip in the schedule, the par-
adigm shift may look unattractive, at least in its initial form. Perhaps, if its B is much larger
than that of the conventional model, it could catch up and still beat the conventional model,
but the opportunity to pick up all-important momentum and early design wins (selling the
technology to system designers) is lost.

Here are two examples, one that failed, and one that might—both relatively modest as
paradigm shifts go—followed by a wrap-up discussion of how learning curves often defeat
paradigm shifts.

14
2.4.3.1 Exponential

Exponential, to much fanfare, announced that they had a promising new approach to high-
speed processors. In 1995, they announced that they would soon produce a 500MHz ver-
sion of the PowerPC, which would make it one of the fastest clock speeds of any shipping
processor. Their big breakthrough was in a new strategy for combining Bipolar and CMOS
logic (two alternative strategies for creating components on a chip).

Historically, Bipolar has been used for high-speed circuits and CMOS where density or
power consumption is a concern. Traditional microprocessors use CMOS, though a few
core components may use Bipolar to increase speed. Exponential claimed they could do
Bipolar as the main technology, with CMOS in particular areas where less speed (or lower
power) was required, in a new form of BiCMOS—combined Bipolar and CMOS. This, they
claimed, allowed them to achieve much higher clock speeds without the power consump-
tion and cost penalty usually associated with Bipolar logic.

Exponential’s big problem was time-to-market. Their 500MHz goal sounded impres-
sive even 3 to 4 years later, but their processor was very simple in other areas, like a low
number of instructions per clock cycle (IPC). Moreover, their timing was defeated by
Moore’s Law. While their claims looked good in 1995, by mid-1997 when they were finally
ready to ship, the IBM-Motorola consortium almost had the PowerPC 750 ready. While the
PPC 750 started at much lower clock speeds (initial models were shipped with 233MHz and
266MHz), other details of the microarchitecture gave them unexpectedly good perfor-
mance. Not only that, the PPC 750 was a very low-power-consumption design, which gave
it strong cost advantages over Exponential, not to mention making it usable in both note-
books and desktop systems.

Exponential did deliver in the end, but slightly too late: Moore’s Law, the learning-
curve law, defeated their relatively modest paradigm shift. If they had been in volume pro-
duction 6 months earlier, it may have been different: they would then have been on a learn-
ing curve as well, and likely one with the same learning rate, B, as the rest of the industry.

15
2.4.3.2 Merced (EPIC, IA-64, Itanium)

Intel’s new architecture, a 64-bit processor running a new instruction set called IA-64, is
another attempt at breaking out of the mold. The first version of the IA-64 family, code-
named Merced (and now being marketed as Itanium), appears to be in trouble, since its
delivery schedule has slipped at least 5 years.

Intel claims that the new design addresses the problem of obtaining more instruction-
level parallelism (an issue we’ll cover in more detail later when we do pipelines in the next
chapter of these notes). The approaches they are using include:

• predication—an instruction can be flagged as only to be executed if a particular condi-


tion is true, reducing the need to have branch instructions around short pieces of code

• very long instruction word (VLIW)—several instructions are grouped together in one
big unit, to be executed at once: this is an approach which has been used before but
which Intel claims to have done better

• explicitly parallel instruction computing (EPIC)—extra information generated by the


compiler tells the processor which instructions may be executed in parallel

• very large number of registers—most current RISC designs have 32 general-purpose


(i.e., integer) registers and 32 floating-point registers, while Merced has many more

• speculative loads—a load instruction can be set up so that if it fails (e.g. because of a
page fault or invalid address), it is ignored, so a load can be set up long before it’s actu-
ally needed; if the data is actually needed, it will have been fetched long enough before
it’s needed that cache misses can be hidden

Intel (and HP who has also worked on the design and may have thought of most of the new
ideas) claims that Itanium will be able to execute many more instructions in parallel than
existing designs. However there are doubters. Most of the things introduced at best are sim-
plifications of things that can be done anyway. VLIW was not a big success before, and the
new model, while better in some ways, may not address enough of the problems (the big-
gest one: consistently finding enough work to parcel together in big multi-operation
instructions to be worth the effort). Predication, while a good idea, turns out essentially to
require pretty much the same hardware complexity as speculative execution (see Chapter 3,
where we look at complex pipelines). A higher number of registers is good, but existing
architectures get the same effect with register renaming which, while extra complexity,

16
means registers can be encoded using fewer bits in an instruction. Speculative loads are a
good idea, but it’s not clear how big a difference they will make. The value of the EPIC idea
depends on still-to-be-tested compilers, so it’s unclear how big a win it will be.
The deep underlying problem about all this is that there is little evidence to support the
view that there is a lot of untapped parallelism left to exploit at the local level (i.e. a few
instructions apart). A combination of existing methods with new compiler-based
approaches to identify parallelism further apart (e.g. in different procedures) might be more
profitable. Or it might be that most current programs are simply not written in a style that
exposes parallelism. Or maybe the problems we currently solve on computers have been
chosen for their essentially sequential nature because that’s what we think computers can
do best. A finally “or maybe”: maybe our existing programming languages aren’t really
designed to expose parallelism. In summary, there are a lot of “or maybes” in deciding
whether the IA-64 approach will be a win. The fact that Intel has shifted the shipping date
several times suggests that even if it is a good idea, they will have a problem with making
the great leap ahead of the competition that they’d hoped for.

Since this is a widely-reported example, it is followed up in more detail in Section 3.5.

2.4.3.3 Why Paradigm Shifts Fail

In general, why are paradigm shifts so hard? As suggested before, there are perception
problems, but it goes deeper than this.

Even if the paradigm shift increases the base from which the technology is judged,
increases the rate of improvement B faster than older technologies, or both, there is usually
an extra cost associated with getting a new technology working. Existing design, design
verification and manufacturing techniques may not work, resulting in delays in initial
launch. Other required new technologies may delay the launch too (or even fail, making the
promise of the new technology hollow). In the Merced case, for example, heavy reliance is
placed on new compiler techniques, which some doubt because the original VLIW move-
ment also ran into problems in producing good enough compilers. In the case of Exponen-
tial, although the basic design appeared to deliver, the rival PowerPC 750, although
ostensibly a simpler design, delivered better performance, because the details were fine-
tuned to the workloads typically run on a Power Macintosh system.

17
Another example of the required infrastructure defeating a cool new idea is Charles
Babbage’s 19th-century mechanical computers, which pushed the envelope of machine-
tool technology too hard. Working examples of his machines have been built in recent
times, showing that the designs were fundamentally sound.

The kinds of paradigm shifts described here are relatively shallow paradigm shifts:
changes at the level seen by the system designer in the Exponential case, by the compiler
writer as well in the Merced case. Deeper paradigm shifts of the kind that change the way
we work or use computers in some ways seem harder to justify, yet they also can be easier
to sell. All that’s required is that a large enough segment of the market adopt the idea to give
it critical mass. Winning on the price-performance treadmill against Moore’s Law is less
important in such cases, as a qualitative change may be the selling point, or an improvement
in usability. Yet even here, most people prefer the paradigm shift to be disguised as incre-
mental change (hence the greater success of Microsoft’s small step at a time approach to
converting to window-based programs, versus their major competitors’ all at once
approaches—even if today the standard by which Windows remains judged is how well it
has imitated its predecessors).

Another issue though which can save a paradigm shift is if different parts of a system
have very different learning curves. This is one of the reasons that hardwired logic made a
comeback in the 1980s. Prior to that time, it was possible to make a much faster (if small)
ROM than any practical RAM, which favoured microcode as an instruction set implemen-
tation strategy. However, memory technology changed, and faster RAMs made it hard to
achieve competitive performance with microcode. Whatever speed of microstore (as the
memory used for microcode was called) could be used, an equally fast if not faster memory
could be used for caches.

Mismatches in learning curves in different parts of a system therefore present potential


opportunities for paradigm shifts at the implementation level, particularly performance-ori-
ented changes in design models.

2.5 Measuring and Reporting Performance

Read and understand Section 1.5 of the prescribed book. This section contains a few high-
lights to guide you.

18
First, the most important measure to the user is elapsed time, or response time. Look at
the example given of output from a UNIX time command and note that CPU time is but one
component of the elapsed (“wall clock”) time, and can be a relatively small fraction at that.

The discussion on benchmarks is interesting: make sure you understand in particular


why real programs make the best benchmarks, and how the SPEC benchmarks are reported
on (including the use of geometric means).

Note the definitions of latency and throughput (also called bandwidth, particularly in
input-output, I/O). Note also the discussion on comparing times, and what it means to say
“X is n times the speed of Y”. Page ahead to other definitions of speed comparison, includ-
ing speedup, and what we mean when we say something is n% faster than something else.
Write definitions here (in all cases, for X and Y, run times are respectively tX and tY):

X is n times the speed of Y

speedup of X with respect to Y

X is n% faster than Y

Important: “how much faster” is a difference measure, whereas “speedup” is a multipli-


cative measure (or a ratio). Make sure that you have this distinction clear. To be sure there
is no confusion, I recommend sticking with exactly two measures:

• speedup—original time / improved time — a ratio

• percent faster—(original time – improved time) / (original time) × 100 — a difference


measure

A very common error is forgetting to subtract 100% from a difference measure. For exam-
ple, if one machine’s run time is half that of another, it is 50% faster—not 100% faster. If
it was 100% faster, it would take no time at all to run.

19
2.5.1 Important Principles

There are two important guiding principles in computer system design:

• make the common case fast—it’s hard to make everything fast (and expensive) so
when making design choices, favour the things that happen most often; not only is this
where the biggest gains are to be had, but the frequent cases are often simpler

• measure everything when calculating the effect of an improvement—note that the


speedup formula, also called Amdahl’s Law, requires that you take into account the
entire system when measuring the effect of speeding up part of it: work through the
given examples (see Exercise 2.3 for an example not involving computers of how mis-
leading calculating speedup without taking all factors into account can be)

These two principles illustrate how hard the system designer’s job can be. Focussing only
on the obvious bottlenecks and common operations can only take you so far before the less
common things start to dominate execution time. This was a lesson learnt the hard way in
the Illiac-IV supercomputer design at the University of Illinois, where I/O was neglected in
the quest for maximizing CPU speed, with the result that the system spent a large fraction
of its time waiting for peripherals instead of computing. The Illiac fiasco was one of
Amdahl’s inspirations in devising his famous law.

2.6 Quantitative Principles of Design

Some fundamentals out of the way, what are the issues that contribute to run time? Let’s
start with a simple model of CPU performance; more detail will follow in later chapters.

The CPU performance equation in its simplest form is given on p 38 of Hennessy and
Patterson:

timeCPU = CPU cycles for a program × clock cycle time or


CPU cycles for a program
timeCPU = --------------------------------------------------------------
clock rate
Note also how, for a given program, the CPU time can further be decomposed into clock
cycle time, number of instructions executed (instruction count: IC) and clock cycles per
instruction (CPI)—and what factors impact on these three measures.

The other forms of formula are given for cases where it’s useful to distinguish time used
by different types of instruction: a given program will use an instruction mix, frequencies

20
of execution of the available instruction. Note that the frequency can be measured in three
different ways:

• static frequency—the frequency of each instruction in a program, as measured from


the code without running it

• dynamic frequency—the fraction of all instructions executed that a given instruction


executes

• fraction of run-time—fraction of execution time accounted for by a given instruction,


which is essentially the dynamic frequency weighted for the relative time a given
instruction takes to execute

It is important to distinguish these three measures and avoid confusing them.


Work through the given examples, including the examples of instruction design alter-
natives. Also note the discussion on measuring CPU performance. Although the book refers
to CPI, most currently shipping processors can (at least at peak) complete more than one
instruction per clock cycle, which leads to a new measure, instructions per clock (IPC),
which is just CPI-1.

Section 1.7 of the book is a good instroduction of pipelining issues; if you want to be
prepared for coverage of these concepts for later in the course, work through this section
now.

2.7 Examples: Memory Hierarchy and CPU Speed Trends

To put everything in context, here is a discussion of memory hierarchy, and why memory
hierarchy is becoming an increasingly important issue, given the current speed (learning
curve) trends.

Generally, it is easier to make a small memory fast. More components can be used per
bit, switching speeds can be made faster, and transmission lines kept short. All of this adds
up to more cost and more power consumption as well. Fortunately a principle called locality
of reference allows a designer to build memory hierarchically, with a small, fast memory at
the top of the hierarchy designed to have the highest rate of usage, with larger, slower mem-
ories lower down designed to be used less often. Ideally, the effect should be as close as
possible to the cost of the cheapest memory, and the speed of the fastest. A cache today is

21
typically built in at least two levels (32Kbytes or more at the first level or L1; 256Kbytes or
more at the second level, L2).

A cache is a small, fast memory located close to the CPU (even in many cases, on the
same chip for L1). Caches are usually made with static RAM, which uses more components
per bit than DRAM, but is faster.

A cache is organized into blocks (often also called lines), and a block is the smallest unit
fetched into a cache. When a memory reference is found in the cache, it is called a hit, oth-
erwise a miss occurs. In the event of a miss, the pipeline may have to stall for a miss penalty,
an extra amount of time relative to a hit (note that some current designs have non-blocking
misses, i.e., the processor carries on executing other instructions as long as possible while
waiting for a miss but for now, assume a miss has a fixed penalty). When we do memory
hierarchy, we will do more detail of caches. For now it is sufficient to think through how to
apply a simple variation of Amdahl’s Law to caches, and to predict the impact of cache
misses on overall CPU performance.

To return to the theme of the course, let’s consider the impact of increasing the CPU
clock speed (equivalently, reducing the cycle time) while keeping memory speed constant.
Although this is not in practice what is happening (CPU speed doubles roughly every 18
months, while memory speed improves at 7% per year), the effect of the differences in
growth rates is that the cost in lost instructions for a DRAM reference doubles roughly
every 6 years. Exercise 2.4 requires that you explore the effect of doubling a miss penalty.
The kind of numbers you should see in the exercise should be an indication of why caches
are becoming increasingly important—as well as why you do not see a linear speed gain
with MHz, when the memory hierarchy is not also sped up.

2.8 Further Reading

For further information on standard benchmarks, the following web sites as noted in the
previous chapter are useful:

• Transaction Processing Council <http://www.tpc.org/>

• Standard Performance Evaluation Corporation <http://www.spec.org/>

22
Another good site to look for information is Microprocessor Report’s web site (Micropro-
cessor Report is a trade magazine, widely read by system designers and architecture
researchers; mostly considered unbiased but MPR appears to have rather oversold the IA-
64, considering its schedule slips):

• <http://www.chipanalyst.com/>

Dulong [1998] discusses some features of the Intel IA-64 architecture, and some perfor-
mance results for limited aspects of the architecture have been published [August et al.
1998]. VLIW is briefly discussed in Section 4.4 of the book, starting on page 278.
Some researchers [Wulf and McKee 1995, Johnson 1995] have claimed that we are
approaching a memory wall, in which all further CPU speed improvements will be limited
by DRAM speed. RAMpage is a new strategy for memory hierarchy, in which the lowest
level of cache becomes the main memory and DRAM becomes a paging device, to improve
effectiveness of the lowest-level SRAM [Machanick 1996, Machanick and Salverda 1998,
Machanick et al. 1998, RAMpage 1997].

Ted Lewis [1996a] discusses learning curves in more detail.

2.9 Exercises

2.1. In a car magazine, separate times are given for acceleration from a standing start to
various speeds, time to reach a quarter mile or kilometre from a standing start, braking
time from various speeds and overtaking acceleration (e.g. time for 80 to 100 km/h). Since
all these numbers are smaller for better performance, it might be tempting for someone
from a computer benchmarking background to add them all for a composite score.
a. Think of some reasons for doing this, and for not doing this.
b. What do these arguments tell you about the trend in computer benchmarks to come up
with a single composite number (MIPS, SPECmark, SPECint, SPECfp, Bytemark) as
opposed to reporting all the individual scores?

2.2. Explain why a geometric mean is used for combining SPEC scores. Give an example
where an arithmetic mean would give unexpected results.

2.3. Here is an example of Amdahl’s law in real life. The trip by plane from Johannesburg
to Cape Town takes 2 hours. British Airways, hoping to muscle in on the local market,
brings a Concorde in, and cuts the flying time to 1 hour. Calculate the speedup of the
overall journey given the following additional data, which remains unchanged across the
variation in flying time:
• time to get to airport: 30 min
• time to check in: 30 min

23
• time to board plane: 20 min
• time to taxi to takeoff: 10 min
• time to taxi after landing: 10 min
• time to leave plane: 30 min
• time to collect luggage: 20 min
• time from airport to final destination: 30 min
If a revolutionary new transport technology was invented which could pick you up from
your home, but where the actual travel time from Johannesburg to Cape Town was 4 hours,
what would the speedup be:
a. with respect to both the original travel time, and
b. the Concorde travel time?
c. Now restate both answers as a percentage faster.

2.4. Redo the following example with double the miss penalty (50 clock cycles):

Base CPI 2.0; only data accesses are loads or stores which are 40% of all instructions;
miss penalty is 25 clock cycles, miss rate is 2%.
• the CPU time relative to instruction count (IC) and clock cycle time is:
CPU cycles + memory stall cycles × clock cycle time
where memory stall cycles are
IC × memory references per instruction × miss rate × miss penalty =
IC × (1 + 0.4) × 0.02 × 25

• so total execution time is


(IC × 2.0 + IC + 0.7) × clock cycle time =
2.7 × IC × clock cycle time

a. How much slower is the machine now compared with before (choose an appropriate
difference or ratio measure), and
b. compared with the case where there are no misses (leave out the miss cycles from the
given calculation)?
c. How much must the miss rate be reduced to, to achieve the same performance as when
the miss penalty was 25 cycles?

24
Chapter 3 Instruction Set
Architecture and
Implementation

There’s much confusion about the difference between RISC and CISC architectures, and
whether there’s such a thing as a CISC architecture with RISC-like features.

It’s important to understand what these various concepts mean from the perspective of
the hardware designer, which provides greater clarity about the technical issues from the
perspective of the user.

3.1 Introduction

This chapter covers principles of instruction set architecture, i.e., design choices in creating
instruction sets and the architecture seen in programs (registers, memory addressing, etc.),
and covers implementation issues, especially as regards how the ISA relates to perfor-
mance. Issues covered include pipelining and instruction-level parallelism.

It is assumed here that you have a basic understanding of the major components of the
microarchitecture: registers, pipelines, fetch, decode and execution units. If you don’t,
please review this material; Section 1.7 of the book is a nice summary of the basics of pipe-
lining. The major focus here is on issues which impact on improving performance: those
factors which are influential in pushing the learning curve, inhibiting progress, and limiting
eventual speed improvements. Material for this chapter of the notes is drawn from Chapters
2, 3 and 4 of the book.

This chapter aims to lead you to understanding new development in the field, including
benefits and risks of potential new ideas. You should aim by the end of the chapter to be in
a position to evaluate designers’ claims for future-generation technologies critically, and to
be able to assess which claims are most likely to be delivered.

25
The remainder of the chapter is structured as follows. Section 3.2 covers instruction set
design principles, and clarifies the differences between RISC and CISC. Section 3.3 dis-
cusses issues arising out of the previous section as relate to pipeline design, with some for-
ward references to issues in Section 3.4, which covers techniques for instruction-level
parallelism. Section 3.5 discusses factors which limit instruction-level parallelism, with
discussion both of current designs and potential future designs. To tie everything together,
Section 3.6 relates the chapter to the focus of the course, longer-term trends.

The chapter concludes with the usual section on further reading.

3.2 Instruction Set Principles: RISC vs. CISC

This section summarizes Chapter 2 of the book. You should make sure you are familiar with
the content as this is essential background. However, the general trend towards RISC-style
designs reduces the need to be aware of a wide range of different architectural styles.

What is RISC? The term reduced instruction set computer originated from an influen-
tial paper in 1980 [Patterson and Ditzel 1980], which made a case for less complex instruc-
tion sets. However, the term has become so abused that some (including the authors of the
book) argue instead for the term “load-store instruction set”. The reason for this is that the
RISC movement was fundamentally about simpler instruction set architectures (ISA) to
make it easier to implement performance-oriented features like pipelines. Since the idea
originated, some have perverted it to imply that implementation features like pipelines are
“RISC-like”. By this definition, almost every processor designed with performance in mind
since the 1970s would be “RISC-like”. For example, some argue that the Pentium and its
successors are “RISC-like” because they make aggressive use of pipelines. Yet the com-
plexity of implementation of pipelines on these processors exactly illustrates the RISC
argument: if Intel had an instruction set designed for ease of implementing pipelines (which
would have been more “RISC-like”), they would have been able to achieve the performance
of the Pentium, Pentium Pro, Pentium II, etc. much more easily. The fact that these proces-
sors are implemented using techniques common to RISC processors doesn’t make them
“RISC-like”; in fact, the complexity of the implementation is exactly why they are not
“RISC-like”.

26
What then are features that the RISC movement has identified as critical to ease of
achieving performance goals? Here are some:

• load-store instruction set—memory is only accessed by load instructions (move data


from memory to a register) and store instructions (move data from a register to mem-
ory)

• minimal addressing modes—have as few as possible, ideally only one, way of con-
structing a memory address in a load or store instruction

• large set of registers—have a large number of registers (32 or more)

• general-purpose registers—don’t dedicate registers to a specific purpose (e.g. tie a


specific register or subset of registers to a specific instruction or instruction type)

• fixed instruction length—instructions are all the same length even if this sometimes
wastes a few bits

• similar instruction execution time—as far as possible, instructions take the same
amount of time in the execute stage of the pipeline

There are some variations between RISC processors, but all are pretty similar in these
respects. Others have gone further in simplicity. Most pre-RISC processors had condition
codes: special flags that could be set by many different instructions, and tested in condi-
tional branches. Some RISC designs instead store the result of a test instruction in a register.
This way, a boolean operation is no different from other arithmetic, and the order in which
instructions appear is less important. This brings us to design goals behind the list of fea-
tures of RISC machines:

• no special cases—if all instructions can be handled the same way as far as possible
except for things which have to be different (e.g. some reference memory, whereas
others use the floating-point unit, etc.), a pipeline’s logic can be kept simple and easy
to implement

• minimize constraints on order of execution—again an aid to the pipeline designer: if


ordering constraints are minimized, instructions can be reordered (either statically, by
the compiler, or dynamically, by the hardware) for most efficient processing by the
pipeline

• minimize interactions between different phases of instruction execution—again to aid


pipelining: for example, if an instruction can both address memory and do arithmetic,
some aspect of the ALU (arithmetic and logic unit) may be needed in the address com-
putation, making processing two successive instructions complicated)

27
• keep instruction timing as predictable as possible—complex addressing modes or
instructions that can have variable numbers of executions of iterations (e.g. string
operations) make it hard for the pipeline to predict in advance how long an instruction
will execute

With these issues in mind, let us know review common ISA features, contrasting the RISC
approach with the Intel IA-32 (since the new 64-bit Itanium or Merced design was
announced, Intel has differentiated the old 32-bit processor line as IA-32, as opposed to the
Merced IA-64) approach.

3.2.1 broad classification

Since the way registers are used is central to the organization of work in the CPU, the broad-
est classification of an ISA is often in terms of the way registers are used:

• accumulator—an accumulator is a register used both as a source and a destination; the


display of a pocket calculator is an example of this strategy. An accumulator architec-
ture can have low memory traffic, since the instructions can be compact (often the
accumulator is not specified as it is the only register for that instruction type), but a
larger fraction of the memory traffic is data moves to and from memory, since the
accumulator cannot store results non-destructively

• stack—not strictly a register machine at all (though some stack machines used a few
registers to represent the topmost elements of the stack for enhanced speed): a stack
machine purely does arithmetic using the topmost item or items on the stack, and data
is moved to or from ordinary memory by pop and push instructions, respectively; a
stack machine can have very compact instructions since operands are not specifically
specified, but can have a lot of memory traffic since the right things have to be on top
of the stack for each operation

• register-memory—instructions can combine a memory access and use of a register to


perform an operation

• general-purpose register (GPR)—registers can be used for any operation without arbi-
trary limit; in its purest form, a load-store architecture, in which data can only be
moved to or from memory in stores and loads (respectively), and no registers (except
possibly the stack pointer, used for the procedure-call stack and the program counter,
used to keep track of the next instruction to execute) are dedicated to a specific pur-
pose

Figure 2.2 on p 102 of the book contains some examples of some variations on these defi-
nitions, and Figure 2.4 on p 104 summarizes the advantages and disadvantages of three
variations.

28
A RISC architecture falls into the last category, the GPR machine, with a load-store
instruction set.

Which does the Intel IA-32 fall into?

Curiously enough, several. Although it looks a bit like a GPR instruction set in that
some registers can be used for most operations, some registers are dedicated to specific
operations, and make it look more like an accumulator machine. The floating point unit on
the other hand uses a stack architecture, and there are also some register-memory instruc-
tions. Since the IA-32 fails some of the other critical tests for a RISC architecture, such as
having multiple memory addressing modes and variable-length instructions, if is clearly not
a RISC design.

Why is it so complex? When its earliest version, the 8088/8086 came out, processor
design was not as well understood as today, and complex instruction sets were common.
Also, memory was expensive, and an instruction set that reduced memory requirements by
doing multiple things in one instruction and by making it possible to have shorter instruc-
tions for common operations was seen as desirable. In addition, the IA-32 evolved over
time, and earlier design decisions could not easily be undone, since older programs relied
on them.

The rest of Chapter 2 of the book covers addressing modes, type and size of operands,
encoding an instruction set, and compiler issues (all of which you should understand). The
example of the MIPS architecture is instructive because it is one of the simplest RISC archi-
tectures, and makes it relatively easy to see the principles without extraneous detail.

3.3 Challenges for Pipeline Designers

Given that RISC is meant to make life easier for the pipeline designer, what are the issues
that make life harder in designing a pipeline?

There are two major concerns:

• issues that stall the pipeline

• issues that make it complicated to treat instructions uniformly

29
Let’s take each of these in turn. If you have forgotten the basics of pipelining, please review
Sections 1.7 and 3.1 of the book. Note that Hennessy and Patterson tend to use a specific
five-stage pipeline in their books but other organizations are possible (e.g., some PowerPC
models use a 4-stage pipeline, and both the MIPS R4000 line and the latest designs from
AMD have a deeper pipeline).

3.3.1 Causes of Pipeline Stalls

In the ideal case, a pipeline (assuming a simple model for now where at most one instruc-
tion can execute, a scalar pipeline) that has n stages can execute one instruction per clock
cycle which (excluding overheads of pipelining) is 1/n the cycle time that a non-pipelined
machine can run. In other words, an ideal pipeline can give approximately n times speedup
over a non-pipelined implementation, all else being equal. In practice, there is some over-
head in the pipeline, and keeping the clock timing across multiple stages becomes a prob-
lem (clock skew across the pipeline results in timing problems between events at either end
of the pipeline). These issues limit the practical scalability of the pipeline (i.e., the maxi-
mum useful size of n) even in the ideal case where the pipeline can be kept busy constantly.

Unfortunately a number of problems can arise in attempting to keep the pipeline going
at full speed. The general description of the case where the pipeline doesn’t do something
useful on a cycle is a stall. A stall specifically refers to the case where the pipeline is idled:
though there are also cases in recent designs, like speculative execution, where the pipeline
appears to be busy but the work it does is discarded, not necessarily with any stalls involved,
and even in a simple pipeline, a piece of work may be started then abandoned, which results
in wasted cycles which are not called stalls. The best way of thinking of a stall is to think
of it as a bubble in the pipeline.

A stall is most commonly caused by a hazard, an event that prevents execution from
continuing unhindered. See Section 1.7 of the book for a description of the three categories
of hazards:

• structural—limits on hardware resources prevent all currently competing requests


from being granted

• data—a data dependency prevents completion of an instruction before another in the


pipeline completes

30
• control—a change in order of execution (anything that changes the PC)

Make sure you understand the modified speedup formula and examples on pp 53-57. Read
through and understand the material up to p 69, but note that modern designs are moving
increasingly to dynamic scheduling, i.e., the hardware can reorder instructions to reduce
hazards, as is covered in Section 3.4. Compiler-based approaches are limited in that it’s
hard to generalize them (i.e. to go beyond specific cases), and some cases are hard to detect
statically (e.g., they depend on whether a given branch is taken or not). Make sure you
understand the diagrams used to illustrate pipeline behaviour, and how forwarding works.
The description of the implementation of DLX is useful for later examples, but concen-
trate on the examples: read through and understand the rest of the chapter.

3.3.2 Non-Uniform Instructions

Variations in instructions—whether length of time an instruction takes to execute, how they


address memory or the length or format of the instruction—make life difficult for the pipe-
line designer. Let’s look at each of these issues in turn.

From the pipeline designer’s point of view, if each pipeline stage takes exactly the same
amount of time, this is the ideal case. Everything can flow smoothly through the pipeline.
As one instruction is fetched, the next can be fetched, and the same with decoding, execut-
ing, etc. Not only is the design easy, but time isn’t wasted. If one instruction takes four clock
cycles in its execute stage, others have to wait for it to complete (unless more sophisticated
techniques are used, as in Section 3.4). For example, some floating point instructions are
hard to execute in one clock cycle, and there’s little that can be done about this, except pos-
sibly converting each instruction to a sequence of simpler ones. However, other kinds of
instruction can be artificially made variable even within one instruction (for example, some
processors have instructions that can move multiple numbers of bytes, where the byte count
is also an operand). The pipeline designer in this case has to use complex logic to spot some
cases and vary the behaviour to suit them, possibly at the expense of increasing cycle time
(because of the extra logic that has to spot these special cases early enough to handle them).

Allowing many variations in memory addressing modes causes similar problems. The
pipeline designer has to find out early enough that an addressing mode that may cause the

31
pipeline to stall is involved, and stop other instructions. Again, the extra complexity may
make it hard to scale up cycle time.

Variations in instruction length or format are also a serious problem for the pipeline
designer, especially for superscalar architectures (again, see Section 3.4), where multiple
instructions must be loaded at once. Just finding instruction boundaries becomes a problem,
and inhibits the clean separation between pipeline stages for loading an instruction and
decoding it (the instruction fetch unit has to have some idea what kind of instruction was
just fetched to know if it has fetched the right number of bytes for the current instruction,
and to know if it has started fetching the next instruction at the right place in memory).

3.4 Techniques for Instruction-Level Parallelism

There are two major approaches for increasing ILP in a pipelined machine:

• deeper pipeline—for example, the MIPS R4000 has an 8-stage pipeline, and is called
superpipelined: unlike a standard pipeline, some conceptually atomic operations are
split into two stages, particularly memory references (including instruction fetches)

• superscalar execution—also called multiple issue—“issue” is the transition from


decode to execute in the pipeline

The two alternative strategies introduce different kinds of problem. A superpipelined archi-
tecture doesn’t overlap exactly similar phases of execution of different instructions, but a
long pipeline introduces more overhead (the overhead between stages) and makes it diffi-
cult to maintain a consistent clock across stages (clock skew occurs when timing across the
stages is significantly different as a result of propagation delays in the clock signal).
A multiple-issue architecture requires multiple functional units, the units that carry out
a specific aspect of the functionality of execution of an instruction. A functional unit may
be an integer unit or a floating-point unit (sometimes also a more specialized component).
Since integer and floating-point operations are in any case done by separate functional
units, it is a relatively cheap step to allow the two to be dispatched at once. Another common
split is to allow a floating point multiply and add to be executed simultaneously. This kind
of split, a functional ILP, is relatively cheap because it doesn’t require a duplicated func-
tional unit, though more bandwidth from the cache and more sophisticated internal CPU

32
buses are required. A more sophisticated form of superscalar architecture is one which
allows similar ILP: in this case, functional units have to be duplicated.

In any case, when more than one instruction can be issued, the opportunities for
resource conflicts and hence hazards increases. A superscalar architecture therefore pre-
sents more challenges for pipeline scheduling, choosing the best possible ordering for
instructions. Early superscalar architectures required that the compiler do the scheduling
(i.e., static scheduling). However, static scheduling has two big drawbacks: some decisions
require runtime knowledge (e.g. whether a branch is taken, or whether two different
memory-referencing instructions actually reference the same location), and the ideal sched-
ule for a given superscalar pipeline may not be ideal for a different implementation. For
example, a lot was said about optimizing code specifically for the Pentium when it first
appeared; any work put into this area was not particularly relevant for the Pentium Pro and
its successors.

Dynamic scheduling uses runtime information to determine the order of issue, accord-
ing to available resources and hazards. A dynamic scheduling unit is approximately of the
same complexity as a functional unit, which makes it possible to determine the trade-offs
in deciding when to go to dynamic scheduling. When an extra functional unit would
increase the peak parallelism above the level achievable with static scheduling, it becomes
worth sacrificing the silicon area that an additional functional unit would have needed.

As with many ideas in RISC microprocessors, dynamic scheduling owes much to work
done by Seymour Cray in the 1960s, on the Control Data 6600. His dynamic scheduling
unit was called a scoreboard. Make sure you understand the presentation of the scoreboard
in Section 4.2 of the book, as well as the alternative, Tomasulo’s algorithm. Interestingly,
some of the ideas have started to reappear as novel solutions in the microprocessor world,
including register renaming.

Also read understand 3.3 (branch prediction) and the 3.5 which deals with speculative
execution.

The Intel IA-64 (code-named Merced; first implementation Itanium; see 4.7) design is
meant to present an attempt at improved exploitation of ILP. Some of the techniques used
include:

33
• VLIW—several instructions are packaged together as one extra-long instruction (ini-
tially 3 instructions of 40 bits in a 128-bit bundle; the remaining 8 bits are used by the
compiler to identify parallelism)

• explicitly parallel instruction computing (EPIC)—instructions are tagged with infor-


mation as to which can be executed at once, both within a VLIW package and sur-
rounding instructions

• predicated instructions—if a predicate with which the instruction is tagged is false, the
instruction is treated as a NOP

• high number of registers—there are 64 predicate, 128 general-purpose (integer) and


128 floating-point registers, to reduce the need for renaming

• speculative loads—a load instruction can appear before it’s clear that it’s needed (e.g.
before a branch), followed later by an instruction that commits the load. If the load
would have resulted in a page fault or error, the interrupt is deferred until the commit
instruction, which may not in fact be executed (depending on the branch outcome):
this is a form of non-blocking prefetch

While the IA-64 contains some interesting ideas, it is not clear until it actually ships in
volume whether they will be a win. Each time the predicted shipping date slips, the proba-
bility of its being faster than competing designs is reduced (it became available in limited
quantities in 2000).
Some potential problems with the design include:

• VLIW—VLIW is not a new idea and most architecture researchers consider it to have
flopped; whether Intel can produce a compiler that can do what is implied by the EPIC
model is an open question

• IA-32 support—a design goal is to support the existing Intel architecture as well;
whether this can be done on the same chip without compromising performance
remains untested

• wins from predication—work published to date is not convincing. Given the short typ-
ical basic block in integer code, most predicated instructions will have to be executed
speculatively, which is not much different from a conventional branch-based imple-
mentation of conditional code

• value of increased register count—while eliminating the need for register renaming is
a win, increasing the bits needed from 5 (for 32 registers) to 7 (for 128 registers) is a
significant cost, especially as ALU operations use three registers (i.e., the bits to
encode registers is increased by 6 in an ALU instruction)

34
This brings us to the question of what limitations designers are in fact up against.

3.5 Limits on Instruction-Level Parallelism

Early studies on ILP showed that the typical available ILP is around 7 [Wall 1991], reflect-
ing the fact that integer programs typically have small basic blocks (a sequence of code with
only one entry and one exit point), often only about 5 instructions.

Floating-point programs typically have much larger basic blocks and can in principle
have higher ILP but, even so, there can be low limits unless extra work is done to extra par-
allelism.

One of the hardest problems is dealing with branches. Straight-line sequences of code
(as we will see in Chapter 4) can be fetched from the memory system quickly but any
change in ordering makes it hard for the memory system to keep up. The combination of
branch prediction and speculative execution helps to reduce the times the CPU stalls
(although it may sometimes waste instruction execution when it misspeculates—as is also
of course the case with IA-64’s predicated instructions). Gains from speculation and pred-
ication can occur in conditional code, and for speculation, also in loops. If branch predic-
tion is done right, the combined effect is that of unrolling a loop, an optimization a compiler
can also sometimes perform.

One of the issues Merced is aimed at addressing is increasing the window size—the
number of instructions examined as a unit—over which ILP is sought. The Intel view is that
a compiler can work with a larger window than the hardware can. Figure 3.45 in the book
illustrates the effect of window size. Notice how the first three benchmarks, integer pro-
grams, have relatively little ILP for any realistic window size. The floating-point programs,
while better, range from 16 to 49 for a window size of 512, which is probably unrealistically
large.

Make sure you understand the issues in Section 3.6 of the book.

3.6 Trends: Learning Curves and Paradigm Shifts

Some architecture researchers believe that Intel is on the right track in looking for compiler-
based approaches to increase parallelism, but also that Intel is on the wrong track in looking
relatively locally for parallelism (in a bundle of instructions or its neighbours).

35
For example, in a limit study (a study which drops constraints, without necessarily
being practical) carried out at the University of Michigan in 1998, it was found that many
data hazards are an artifact of the way the stack is used for stack frames in procedure calls.

After the return, the stack frame is popped, and the next call uses the same memory.
What appears to be a data dependency across the calls is in fact a false dependency: one
which could be removed by changing the way memory is used. In this study, a very much
higher degree of ILP was found than in previous studies. Although the study does not reveal
how best to implement this high degree of ILP, it does suggest that moving from a sequen-
tial procedural execution model to a multithreaded model may be a win. This reopens an
old debate in programming languages: at what level should parallelism be expressed? Here
are some examples:

• procedural—implement threads or another similar mechanism as a special case of a


procedure or function (e.g. Java)

• instruction-level—find ILP in conventional languages or better still in a language


designed to expose fine-grained parallelism, but leave it to the compiler rather than the
programmer to find the parallelism

• package—add parallel constructs as part of a library (typically the way threads etc. are
implemented in object-oriented languages like C++ or Smalltalk, if they are not a
native feature as in Java)

• statement-level—a variant on the ILP model, in which the programmer is more con-
scious of the potential for parallelism, but any statement (or expression in a functional
language) may potentially be executed in parallel

In practice, the biggest market is for code which is already compiled. Recompilation or
worse still, rewriting in a new language, puts the architect into paradigm-shift mode, which
remains a hard sell for now. If limits on ILP become a real problem, it’s possible that models
that require new languages may start building a following, but this is a prediction that has
been made often in the past and it didn’t happen (unless you call switching from FOR-
TRAN to C++ a paradigm shift, which it is at a certain level).
An intermediate model, a new architecture that requires a recompile, is easier to sell if
there is a transition model. The Alpha proved that it could be done: it came with a very effi-
cient translator of VAX code, which allowed VAX users to transition to the Alpha without

36
having to recompile (sometimes hard: the source could be lost, or could be in an obsolete
version of the language).

Rather than incorporate an IA-32 unit in the IA-64, it may be a better strategy for Intel
to write a software translator from IA-32 to IA-64. But the IA-64 project would be a lot
more convincing if something shipped. Lateness is never good for paradigm shifts, espe-
cially when others have a learning curve they are sticking to and are not late.

In this respect, Intel should try to learn from Exponential (see Subsection 2.4.3.1).

3.7 Further Reading

The first paper to use the RISC name appeared in an unrefereed journal [Patterson and
Ditzel 1980], indicating that influential ideas sometimes appear in relatively lowly places.
An early study of ILP [Wall 1991] put the limit relatively low, reflecting the small basic
blocks in integer programs. Look for more recent papers in Computer Architecture News
and the two big conferences, ISCA and ASPLOS, for work which raises the limits by relax-
ing assumptions or improved software or architectural features.

On the Intel IA-64 relatively little has been published but you can find some preliminary
results on aspects of the EPIC idea [August et al. 1998], and an overview of limited aspects
of the architecture [Dulong 1998].

3.8 Exercises

3.1. Redo the example of pp 260-261 of the book, but this time with
a. window size of the speculative design of 32
b. window size of 64 again, but with double the miss penalty
c. What do the results tell you about the importance of memory in scaling up the CPU?
d. What do the results tell you about the effect of window size?

3.2. Do the following questions from the book:


a. 1.19, 3.7, 3.8, 3.11, 3.17, 3.13.

3.3. Find at least one paper on superscalar design (particularly one featuring speculation
and branch prediction), and compare it against the reported results for the EPIC idea
[August et al. 1998]. Do the results as reported so far look promising compared with other
approaches?

37
38
Chapter 4 Memory-Hierarchy
Design

Although speeding up the CPU is the high-profile area of computer architecture (some even
think of the ISA as what is covered by the term “architecture”), memory is increasingly
becoming the area that limits speed improvement.

Consider for example the case where a processor running at 1GHz is capable of issuing
8 instructions per clock (IPC). That means one instruction for every 0.125ns. An SDRAM
main memory takes about 50ns to get started, then delivers subsequent references about
every 10ns. A rough ballpark figure for moving data into the cache is about 100ns (the exact
figure depends on the block size and any buffering and tag setup delays). If 1% of instruc-
tions result in a cache miss, and the pipeline otherwise never stalls, the processor actually
executes at the rate of 1.125ns per instruction—9 times slower than its theoretical peak rate.
Or to put it another way, instead of an IPC of 8, it has an CPI of 1.125.

4.1 Introduction

This chapter covers the major levels of memory hierarchy, with emphasis on problem areas
that have to be addressed to deal with the growing CPU-DRAM speed gap. The book takes
a more general view; please read Chapter 5 of the book to fill in gaps left in these notes.

This chapter starts with a general view of concepts involved in finding a given reference
in any hierarchy of the memory system: what happens when it’s there, and when it’s not.
Next, these principles are specialized to two areas of the hierarchy, caches and main mem-
ory. Issues in virtual addressing are dealt with in the same section as main memory,
although there are references back to the discussion on caches (as well as in the cache sec-
tion). The Trends section outlines the issues raised by the growing CPU-DRAM speed gap,
which leads to a section on alternative schemes: different ways of looking at the hierarchy,
as well as improvements at specific levels.

39
4.2 Hit and Miss

Any level in the memory system except the lowest level (furthest from the CPU: biggest,
slowest) has the property that a reference to that level which can either be a read or a write
may either hit or miss. A hit is when the location in memory which is referenced is present
at that level, a miss is when it is not present.

Memory systems are generally organized into blocks, in the sense that a unit bigger than
that addressed by the CPU (a byte or word) is moved between levels. A block is the mini-
mum unit that the memory hierarchy manages (in some cases, more than one block may be
grouped together). At different levels of the hierarchy, a block may have different names.
In a cache, a block is sometimes also called a line (the names are used interchangeably, but
watch out for slightly different definitions). In a virtual memory system, a block is called a
page (though some schemes have a different unit called a segment—often with variations
on what this term exactly means). There are also differences in terms for activities: in a vir-
tual memory (VM) situation, a miss is usually called a page fault. There are however some
common questions that can be asked about any level of the hierarchy (this list adds to the
book’s list on p 379):

1. block placement—where can a block be placed when moved up a level?

2. block identification—how is it decided whether there is a hit?

3. block replacement—which block should be replaced on a miss?

4. write strategy—how should consistency be maintained between levels when something


changes?

5. hardware-software trade-off—is the contents of the level (particularly replacement


strategy) managed in hardware or in software?

Here are some other common issues that make the concept of a memory hierarchy work.
The principle of locality makes it possible to arrive at the happy compromise of a large,
cheap memory with a small, fast memory closer to the CPU, with the overall effect of a cost
closer to the cheaper memory with speed closer to the expensive memory. Locality is
divided into spatial locality: something close to the current reference is likely to be needed
soon, and temporal locality: a given reference is likely to be repeated in the near future. A

40
combined effect of the two kinds of locality is the idea of a working set: a subset of the
entire address space that is sufficient to be able to hold in fast memory. The size of the work-
ing set depends on over how long a time window you measure. Where the speed gap
between the fast level and the next level down is small, the working set is measured over a
smaller time than if the gap is large. For example, in a first-level (L1) cache with a reason-
ably fast L2 cache below it, the working set could be measured over a few million instruc-
tions, whereas with DRAM, where the cost of a page fault runs to millions of instructions,
a working set may be measured over billions of instructions.
To conclude this section, here is a little more basic terminology. Since performance is
a factor of hits and misses between levels, designers are interested in the miss rate, which
is the fraction of references that miss at a given level. The miss penalty is the extra time a
miss takes relative to a hit. Read 5.1 in the book and make sure you understand everything.

4.3 Caches

Make sure you understand the four questions as presented in the book. Here are a few addi-
tional points to note.

Most caches until fairly recently were physically addressed, i.e., the virtual address
translation had to be completed before the tag lookup could be completed. Some caches did
use virtual addresses, so the physical address was only needed on a miss to DRAM. The
advantage of virtually addressing the cache is that the address translation is no longer on
the critical path for a hit, which cache designers aim to make as fast as possible, to justify
the expensive static RAM they use for the cache (and of course to make the CPU as fast as
possible). However, virtually addressing a cache causes a problem with aliases, different
virtual addresses that relate to the same physical page. One use of aliases is copy on write
pages, pages that are known to be initially the same, but which can be modified. A common
use of copy on write is initializing all memory to zeros. Only one page needs actually be
set to zeroes initially, and the copy on write mechanism is then set up so that as soon as
another page is modified, it too is set to all zeros, except the bytes that the write modifies.
The problem for a virtually addressed cache is that the alias has to be recognized in case
there is a write. The MIPS R4000 has an interesting compromise in which the cache is vir-
tually indexed but physically tagged. In this approach, the virtual address translation can be

41
done in parallel with the tag lookup. In general, virtual address translation, when it is cou-
pled with cache hits, has to be fast. The standard mechanism is a specialized cache of page
translations, called the translation lookaside buffer, or TLB (from the weird name, we can
conclude that IBM invented the concept)—of which more in the next section.

Read through Section 5.2 in the book, and make sure you understand the issues. In par-
ticular, work through the performance examples.

Also go on to 5.3, and understand the issues in choosing between the various strategies
for reducing misses. Make sure you understand the causes and differences between com-
pulsory (or cold-start) misses, capacity and conflict misses. Another issue not mentioned in
this section of the book is invalidations. If a cached block can be modified at another level
of the system (for example in a multiprocessor system, where another processor shares the
same address space, or if a peripheral can write directly to memory), the copy in a given
level of cache may no longer be correct. In this situation, the other system component that
modified the block should issue an invalidation signal, so the cache marks the block as no
longer valid. Invalidations arise out of the problem of maintaining cache coherency, a topic
which is briefly discussed in Section 5.9 of the book, and is discussed in more detail in Sec-
tion 8.3. Another issue not covered in this part of the book (but see p 698 for further justi-
fication of the idea) is maintaining inclusion: if a block is removed from a lower level, it
should also be removed from higher levels. Inclusion is not essential (some VM systems
don’t work this way), but ensuring that smaller levels are always a subset of larger levels
makes for significant simplifications.

Two key aspects of caches are associativity and block size (also called line size). A
cache is organized into blocks, which are the smallest unit which can be transferred
between the cache and the next level down. Deciding where to place a given block when it
is brought into the cache can be simple: a direct-mapped cache has only one location for a
given cache block. In this case, replacement policy is trivial: if the previous contents of that
block are valid, it must be replaced. However, a more associative cache, one in which a
given block can go in more than one place, has a lower miss rate, resulting from more
choice in which block to replace. An n-way associative cache has n different locations in
which a given block can be placed. While associativity reduces misses, it increases the com-

42
plexity of handling hits, as each potential location for a given block has to be checked in
parallel to see if the block is present. It is therefore harder to achieve fast hit times as asso-
ciativity increases (not impossible: the PowerPC 750 has an 8-way associative first-level
cache). Block size influences miss rate as well. As the size goes up, misses tend to reduce
because of spatial locality. However, as the size goes up, temporal locality misses also
increase because the amount displaced from the cache on a replacement goes up. Also, as
block size goes up, the cost of a miss increases, as more must be transferred on a miss.
There is therefore an optimal block size for a given cache size and organization (possibly
to some extent dependent on the workload, since the balance between temporal and spatial
locality may vary).

Note also the variations on write policy. A write through cache never has dirty blocks,
blocks which are inconsistent with the level below, because all writes result in the modifi-
cation being made through to the next level. This makes a write through cache easier to
implement, especially for multiprocessor systems. However, writing through increases traf-
fic to memory, so the trend in recent designs is to implement write back caches, which only
write dirty blocks back on a replacement (and hence need a dirty bit in the cache tags). Note
also variations on policy for allocating a block on a write miss.

Most modern systems have at least two levels of cache. In such cases, cache miss rates
can be expressed as a global miss rate (the fraction of all references), or a local miss rate
(the fraction of references that reach that level). The local miss rate for an L2 cache may be
quite high, since it only sees misses from L1, which will usually be a low fraction of all
memory traffic. It can also be useful (especially since most recent designs split the instruc-
tion and data caches in L1, the I- and D-caches), to give separate miss rates for data and
instructions. It is also sometimes useful to split the read and write miss rates. Read misses
may or may not include instructions, depending on the issue under investigation. Note that
L2 or lower caches are usually not split between data and instructions. The reason for the
split at L1 level is so that an instruction fetch and a load or store’s memory reference can
happen on the same clock. With superscalar designs, doubling the bandwidth available to
the caches by splitting between I and D caches becomes even more important at L1. Since
L2 sees relatively little traffic, there is less need to use ploys like split I- and D- caches.

43
Emphasizing the importance of caches, Section 5.4 of the book contains even more
detail, on reduction of miss penalty, and 5.5 reduction of hit time. Read through all this and
make sure you can do all the quantitative examples.

What is the answer to my question 5? Why? Can you answer all the other questions?
Make sure you understand the way cache tags are used, including the bits for the state of a
block, and to identify the address of the contents of a block.

4.4 Main Memory

Main memory has traditionally1 been made of dynamic RAM (DRAM), a kind of memory
that is slower than the SRAM used for caches. Unlike SRAM, it has a refresh cycle which
slows it down, and its access method is slower.

Read the description of memory technology, noting the way a typical DRAM cycle is
organized (row and column access). Understand the discussion of improvements to DRAM
organization—of which more in Section 4.5—including the calculation of miss penalty
from the cache, wider memory and interleaved memory. Be sure you can do the quantitative
examples.

Note that the problem with any scheme that widens memory (including interleaving and
multiple banks) is that it increases the cost of an upgrade. If the full width has to be
upgraded at once, more chips are required in the minimal upgrade. Multiple independent
banks are commonly used in traditional supercomputer designs, where cost is less impor-
tant than performance. Some have hundreds, even thousands, of banks. When we do Direct
Rambus in Subsection 4.6.2, you will see that, as with the RISC movement’s borrowing of
supercomputer CPU ideas, the mass market is attempting to find RAM strategies that mimic
supercomputer ideas in cheaper packaging.

Since the 1980s, there have been various attempts at improving DRAM, taking advan-
tage of a cache’s approach to spatial locality, relatively large blocks (which are in effect a
prefetch mechanism: bringing data or code into the cache before it’s specifically refer-
enced). Since the slow part of accessing DRAM is setting up the row access, it makes sense
to use as much of the row as possible. Fast page mode (FPM) allows the whole row to be

1. If since the 1970s can be called “tradition”.

44
streamed out. Extended Data Out (EDO) is a more recent refinement, by which the data
remains available when the next address is being set up. Synchronous DRAM (SDRAM)
followed EDO: the DRAM is clocked to the bus, and is designed to stream data at bus speed
after an initial setup time. All these variations do not improve the underlying cycle time, but
they make DRAM appear faster, particularly if more than one sequential access is required.
If the bus is 64 bits wide and a cache block is 32 bytes, 4 sequential accesses are required,
so making those after the initial one faster is a gain. Some caches have blocks as big as 128
bytes, in which case a streaming mode of the main memory is an even bigger win.

Again note the standard 4 questions, and my 5th question.

In this instance, the memory is managed using a virtual memory (VM) model. The dis-
cussion of pages versus segments is largely of historical interest as almost all currently
shipping systems use pages (including the Intel IA-32: the segmented mode is for back-
wards compatibility with the 286). Note though the idea of multiple page sizes, available in
the MIPS and Alpha ranges. The TLB section is more important. Make sure you understand
it. The TLB is a much-neglected component of the memory hierarchy when it comes to per-
formance tuning. A program with memory references scattered at random, while having
few cache misses because the references fit in the L2 cache, may still thrash the TLB (in
one case, I found that improving TLB behaviour reduced of run time by 25%).

Work through the Alpha example, and make sure you understand the general principles.

4.5 Trends: Learning Curves and Paradigm Shifts

One of the most important issues in memory hierarchy is the growing CPU-DRAM speed
gap. Figure 5.2 in the book illustrates the growing gap. Note that this is a log scale; on a
linear scale, the growth in the gap is even more dramatic. What’s missing is a curve for
static RAM, which is improving at about 40% per year, about the same rate as CPU clock
speed (CPU speed overall is increasing faster because of architectural improvements, like
more ILP).

Given this growing speed gap, caches are becoming increasingly important. As a result,
cache designers are being forced to be more sophisticated, and larger off-chip caches are
becoming common.

45
Given that it’s hard (and expensive) to make a large, fast cache, it seems likely that there
will be an increasing trend towards more levels of cache. L1 will keep up with the CPU, L2
will attempt to match the CPU cycle time (at worst, a factor of 2 to 4 times slower), and L3
can be several times slower, as long as it’s significantly faster than DRAM.

If DRAM isn’t keeping up on speed, what keeps it alive? The dominant trend in the
DRAM world is improving density, the number of bits you can buy for your money. Every
3 years, you can buy 4 times the DRAM for your money—a consistent long-term trend,
even if there are occasional short-term spikes in the graph (e.g. when a new RAM-hungry
OS is released, like the initial release of OS/2 or Windows 95). For this reason, DRAM
remains the memory of choice for expandability, and for bridging the gap between cache
and the next much slower level, disk.

In Section 4.6, some alternative RAM and memory hierarchy organizations are consid-
ered. Here, a few improvements on caches are considered.

The SRAM speed trend more or less tracks CPU clock speed (about 40% faster per
year), but what of architectural improvements, particularly increased ILP? How can a cache
keep up? The L2 (or even lower levels) in general can’t: the best that can be done is to keep
up with CPU clock speed. The fastest L2 caches in use today are clocked at the CPU clock
speed (for example, some PowerPC versions, particularly upgrade kits, have caches
clocked at CPU speed—if CPU speed is somewhat slower on the PPC range than its Intel
competitors), and L1 cache is generally clocked at the same speed as the CPU clock. Sev-
eral techniques can be used to handle ILP. Where contiguous instructions are fetched (the
common case, even if branches are frequent), a wider bus from the cache is a sufficient
mechanism. The fact that loads and stores do not occur on every instruction puts less pres-
sure on the data cache (recall that there are usually separate I and D caches at L1). However,
the fact that memory referencing instructions may at times refer to very different parts of
memory (e.g. when accessing simple variables or following pointers, as opposed to multi-
ple contiguous array references, or accessing contiguously-stored parts of an object) makes
a wider path from D-cache less of a win than is the case for the I-cache. For the I-cache,
there is also the problem of instruction fetches on branches. To handle both problems, there
is a variety of possible options:

46
• multiported caches—a multiported memory can handle more than one unrelated refer-
ence to different parts of the memory at once

• multibanked caches—like a main memory system with multiple banks, a multibanked


cache can support unrelated memory access to different banks, staggered to avoid
doing more than one transaction on the bus simultaneously

• multilateral caches—an idea under investigation at the university of Michigan: the


idea is to partition data into multiple caches, allowing the possibility that a cache hit
can happen simultaneously in more than one cache, without the complexities of multi-
porting or multiple banks

All of these approaches have different advantages and disadvantages. A multiported cache
is the best approach if the cost is justified, because it is the most general approach, but the
other approaches are cheaper. Multiple banks are only a win if unrelated references happen
to fall within different banks (the probability of this depends on the number of banks—
more is better—and the distribution of the references). A multilateral cache also relies on
being able to partition references, making reasonable predictions of which references are
likely to occur in instructions which are close together in the instruction stream.

4.6 Alternative Schemes

4.6.1 Introduction

There are many variations on DRAM being investigated. In general, the variations do not
address the underlying cycle time of DRAM, but attempt to improve the behaviour of
DRAM by exploiting its ability to stream data fast once a reference has been started. A key
idea derives from the fact that the traditional RAM access cycle starts with a RAS signal,
which makes a whole row of a 2-dimensional array of bits available. Once the row is avail-
able, it’s a relatively quick operation, as in traditional RAM access, to select a specific bit
(CAS). Meanwhile, the remainder of the row is wasted. The improved DRAM models
attempt to use the remainder of the row creatively (in effect caching it for future references).

This section only reviews the one most likely to become commonly accepted, Direct
Rambus, or Rambus-D. To show how else the problem of slow DRAM may be addressed,
an experimental strategy called RAMpage is also briefly reviewed.

47
4.6.2 Direct Rambus

Direct Rambus, or Rambus-D, is a development of the original Rambus idea.

Rambus, as originally specified, was a complete memory solution including the board-
level design of the bus, a chip set and a particular DRAM organization. The original idea
was relatively simple: a 1 byte-wide (8 bit) bus was clocked at relatively high speed, to
achieve the kind of bandwidth that a conventional memory system achieved. A typical
memory system today has a 64-bit bus, 8 times wider than the original Rambus design. The
rationale behind Rambus was that it was easier to scale up the speed of a narrow design even
if it had to be several times faster for the same effective bandwidth, and it allowed upgrades
in smaller increments. A byte-wide bus only needs byte-wide DRAM upgrades, whereas a
64-bit bus requires 8 byte-wide upgrades. The original Rambus design scored one signifi-
cant design win, the Nintendo 64 games machine, based on Silicon Graphics technology (a
cheapened version of the MIPS R4x00 processor is used). However, it was not adopted in
any general-purpose computer system.

Rambus, like other forms of DRAM, has the same underlying RAS-CAS cycle mecha-
nism, but aimed to stream data out rapidly once the RAS cycle was completed (in effect
treating the accessed row bits as a cache).

Rambus-D has a number of design enhancements to make it more attractive. It is 2 bytes


wide, doubling the available bandwidth for a given bus speed, and has a pipelined mode in
which multiple independent references can be overlapped.

Compared with a 100MHz 128-bit bus SDRAM memory, the timing is very similar. To
achieve the same bandwidth on a bus 8 times narrower though, instead of a 10ns cycle time,
the Rambus-D design of 1999 had a cycle time of 2.5ns, and could deliver data on both the
rising and falling clock edge, delivering data at a rate of once every 1.25ns (or 800MHz).
This gave Rambus-D a peak bandwidth of 1.5Gbyte/s, competitive with high-end DRAM
designs of its time. However, as with SDRAM, there was an initial (comparable) startup
delay, of 50ns. If this is compared against the time for data transfers of 1.25ns, this initial
delay is substantial. This is the problem the pipelined mode was meant to address: provided
sufficient independent references could be found, up to 95% of available bandwidth could

48
be utilized on units as small as 32 bytes. However, it has yet to be demonstrated that such
a memory reference pattern is a common case.

It’s possible to make a memory system with multiple Rambus channels. However, the
initial time to start an access is not improved by this kind of parallel design (any more than
is the case with a wide conventional bus). What is improved is the overall available band-
width, which may be a useful gain in some cases.

4.6.3 RAMpage

The RAMpage architecture is an attempt at exploiting the growing CPU-DRAM speed gap
to do more sophisticated management of memory. The idea is that the characteristics of
DRAM access are starting to look increasingly like the characteristics of a page fault, at
least on very early VM systems, when disks and drums were a lot faster in relation to the
rest of the system than they are today. For example, the first commercially available VM
machine, the British Ferranti Atlas, had a page fault penalty of (depending how it was mea-
sured) a few hundred to over 1000 instructions. By contrast, a modern system’s page fault
penalty is in the millions of instructions.

What are the potential gains from a more sophisticated strategy?

A hardware-managed cache presents the designer with a number of design issues which
require compromises:

• ease of achieving fast hits—less associativity makes it easier to make hits fast

• reducing misses—increasing associativity reduces misses by reducing conflict misses

• cost—less associativity is cheaper since the controller is less complex, particularly


where speed is an issue (more associativity means more logic to detect a hit, which
makes for more expense in achieving a given speed goal as well as requiring more sil-
icon to implement the cache)

Some recent designs have illustrated the difficulty in dealing with these conflicting goals.
The Intel Pentium II series packages the L2 cache and its tags and logic in one package with
the CPU. While this makes it possible to have a fast interconnect between the components
without relying on the board-level design, it limits the size of the L2 cache. The PowerPC
750 has its L2 tags and logic on the CPU chip, which again constrains the L2 design (in this

49
case, it is limited to 1Mbyte though unlike the Pentium II, the PPC 750 design did allow for
more than one level of off-chip cache).
Another issue of concern is finding a suitable block size for the L2 cache. The design
of DRAM favours large blocks to amortize the cost of setting up a reference (see Questions
4.3, 4.4 and 4.5). On the other hand, the block size to minimize misses is dependent on the
trade-off between spatial and temporal locality. A large block size favours spatial locality,
as it has a large prefetch effect, but can impact on temporal locality, as it displaces a larger
fraction of the cache when it is brought in (assuming it caused a replacement, and wasn’t
allocated a previously unused block). Increasing the size of a cache tends to favour larger
blocks, since the effect of replacing wanted blocks is reduced as the size goes up. See p 394
of the book for data on the effect of block size. Also, minimizing misses is not the sole con-
cern: a very large block size may reduced misses over a smaller size, but not sufficiently to
recoup the extra miss penalty incurred from moving a large block.

The premise of the Rampage model is that as cache sizes reach multiple Mbytes and
miss penalties to DRAM start to increase to hundreds or even over a1000 instructions, the
interface between SRAM and DRAM starts to look increasingly like a page fault. The Ram-
page model adjusts the status of the layers of the memory system so the lowest-level SRAM
becomes the main memory, and DRAM becomes a paging device. Since a large SRAM
main memory favours a large block (or now called page) size, the interface becomes very
similar to paging. The major differences over managing the SRAM as a cache are:

• easy hits—a hit can easily be made fast since it simply requires a physical address; the
TLB in the RAMpage model contains translations to SRAM main memory addresses
and not to DRAM addresses

• full associativity—a paged memory allows a virtual page to be stored anywhere in


physical memory, which is in effect fully associative. Unlike a cache, the associativity
does not cause complexity on hits

• slower misses—misses on the other hand are inherently slower, as they are handled in
software; the intent is that RAMpage should have fewer misses to compensate for this
extra penalty

• variable page size—the page size, unlike the block size with a conventional cache
design, can relatively easily be changed, since it is managed in software (some support
in the TLB is needed for variable page sizes, but this is a solved problem: several cur-

50
rent architectures including MIPS and Alpha already have this feature); this allows
potential to fine-tune the page size depending on the workload, or even a specific pro-
gram

• context switches or multithreading on misses—it becomes possible to do other work to


keep the CPU busy on a miss to DRAM, which is especially viable with large SRAM
page sizes; since something else is being done while waiting for a miss, using a large
page size to minimize misses works, whereas with a cache, the miss penalty reduces
the effectiveness of large blocks

The RAMpage model is an experimental design which has been simulated with useful
results, suggesting that with 1998-1999 parameters, it is competitive with a 2-way associa-
tive L2 cache of similar size, despite the higher miss (page fault) penalty resulting from
software management. Later work has shown that errors in the simulation favoured the con-
ventional hierarchy, and RAMpage, as simulated, is a significant win at today’s design
parameters.
The growing CPU-DRAM speed gaps allows scope for interesting new ideas like
RAMpage, and it is likely that results from a number of new projects addressing the same
issues will be published in the near future.

4.7 Further Reading

The impact of the TLB on performance is an area in which some work has been published,
but more could be done [Cheriton et al. 1993, Nagle et al. 1993]. Direct Rambus [Crisp
1997] is starting to move into the mainstream now that Intel has endorsed it, and is likely
to appear in mass-market designs. The RAMpage project has its own web site [RAMpage
1997], and several papers on the subject have been published [Machanick 1996, Machanick
and Salverda 1998a,b, Machanick et al. 1998, Machanick 2000]. There has been a fair
amount of work as well on improving L1 caches to keep up with processor design [Rivers
et al. 1997]. A very useful overview of VM issues has appeared recently [Jacob and Mudge
1998], showing that even though this is an old area, good work is still being done. Look for
further material in the usual journals and conferences.

4.8 Exercises

4.1. Add in to Figure 5.2 of the book a curve for SRAM speed improvement, at 40% per
year.
a. Which is growing faster: the CPU-DRAM speed gap, the CPU-SRAM speed gap, or

51
the SRAM-DRAM speed gap?
b. What does your previous answer say about where memory-hierarchy designers should
be focussing their efforts?

4.2. Caches are (mostly) managed in hardware, whereas the replacement and placement
policies of main memory are managed in software.
a. Explain the trade-offs involved.
b. What could change a designer’s strategy in either case?

4.3. Assume it takes 50ns for the first reference, and each subsequent reference for an
SDRAM takes 10ns (i.e., a 100MHz bus is used). Redo the calculation on p 430 of the
book (assume the cycles in the example are bus cycles) under these assumptions for the
following cases:
a. a 32-byte block on a 64-bit bus
b. a 128-byte block on a 64-bit bus
c. a 128-byte block on a 128-bit bus

4.4. Assume Rambus-D takes 50ns before any data is shipped, and 2 bytes are shipped
every 1.25ns after that. What fraction of peak bandwidth is actually used for each of the
following (assuming no pipelining, i.e., each reference is the only one active at any given
time):
a. a 32-byte block
b. a 64-byte block
c. a 128-byte block
d. a 1024-byte block
e. a 4Kbyte block

4.5. Redo Question 4.4, assuming now that you have a 4-channel Rambus-D, i.e., the
initial latency is unaffected, but you can transfer 8 bytes every 1.25ns. What does this tell
you about memory system design?

52
Chapter 5 Storage Systems
and Networks

If memory hierarchy presents a challenge in terms of keeping up with the CPU, I/O and
interconnects (internal and external) are even more of a challenge.

A memory system at least operates with cycle times in tens of nanoseconds; peripherals
typically operate in tens of milliseconds. As with DRAM where the initial startup time is a
major fraction of overall time and it’s hard to achieve the rated bandwidth with small trans-
fer units, I/O and interconnects have a hard time achieving their rated bandwidths with
small units, but the problem is even greater. Amdahl’s famous Law was derived because of
experience with the Illiac supercomputer project, where the I/O subsystem wasn’t fast
enough, resulting in poor overall performance. A balanced design requires that all aspects
be fast, as excessive latency at any level can’t be made up for at other levels.

5.1 Introduction

This chapter contains a brief overview of Chapters 6 and 7 of the book. Since Networks are
covered separately in a full course, Chapter 7 is only reviewed as it relates to general I/O
systems, and as background for multiprocessors (see Chapter 6 of these notes).

The major issues covered here are the interconnect within a single system, particularly
buses, I/O performance measures, RAID as it relates to the theme of the course, operating
system issues, a simple network, the interconnect including media, practical issues and a
summary of bandwidth and latency issues. By way of example, in the Trends section (5.10),
the Information Mass Transit idea is introduced as an example of balancing bandwidth and
latency.

For background, read through and understand Sections 6.1 and 6.2 of the book; we will
focus on the areas where performance issues are covered.

53
5.2 The Internal Interconnect: Buses

A bus in general is a set of wires used to carry signals and data; the common characteristic
of any bus is a shared medium. Other kinds of interconnect include point-to-point—each
node connects to every node it needs to communicate with directly—and switched, where
each node connects to another it needs to communicate with as needed. Most buses inside
a computer are parallel (multiple wires), but there are also common examples of serial
buses (e.g. ethernet). Like any shared medium, a bus can be a bottleneck if multiple devices
contend for it. Generally, off the processor, a system has at least two buses: one for I/O, and
one connecting the CPU and DRAM. The latter is often split further, with a separate bus for
the L2 cache.

Review the example in Figure 6.7 in the book. (A signal where we don’t care if it’s a 1
or 0, but want to record a transition, or a bus, where each line could have different value, is
represented as a pair of lines, crossing as a transition could occur.) make sure you under-
stand the design issues (pp 509-512).

Interfacing to the CPU is also an interesting issue: the general trend is towards off-load-
ing control of I/O from the CPU. Note that memory-mapped I/O (also referred to as DMA:
direct memory access) introduces some challenges for cache designers. Should the portion
of the address space which belongs to the device be cached? If so, how should the cache
ensure that the CPU will see a consistent version of the data, if the device writes to DRAM?
This is an example of the cache coherency problem, which is addressed in more detail in
Section 6.4 of these notes.

5.3 RAID

Make sure you understand the definitions of reliability and availability, and how RAID
(redundant array of inexpensive disks) addresses both.

Although reliability, availability and dependability are important topics, we will leave
them out in this course for lack of time, and only look at RAID as an example of bandwidth
versus latency trade-offs.

Of relevance to the focus of the course is the relationship between bandwidth and
latency in RAID. It is of importance to understand that putting multiple disks in parallel

54
doesn’t shorten the basic access time, so latency is not improved by RAID (in fact the trans-
action time could be increased by the complexity of the controller) except as noted before
(where performance measures are outlined in Section 5.4), faster throughput can improve
latency by reducing time spent in queues. However, if the device is not highly loaded, this
effect is not significant.

Note various mixes of styles of parallelism permitted by variations in RAID.

5.4 I/O Performance Measures

The section on I/O performance measures (6.5 of the book) is where we get to the heart of
the matter. Make sure you understand throughput and response time issues (or think of them
as bandwidth and latency). Note the knee in the curve in Figure 6.23 in the book, where
there is a sharp change in slope. This is a typical latency vs. bandwidth curve, and results
from loss of efficiency as requests increase. In a network like ethernet which has collisions,
the effect is caused by an increase in collisions. In some I/O situations like an operating
system sending requests to a disk, the losses occur from longer queueing delays (same
effect, but the delay occurs in a different place).

Note the discussion on scaling up bandwidth by adding more devices in parallel and
how it does not improve latency (except to the extent that queueing delays may be reduced).
You should know how to apply Little’s Law (queuing theory discussion) and understand the
derived equations. Work through the given examples.

Of particular interest by comparison with CPU performance measures is the fact that
common I/O performance are designed to scale up: the dataset or other measure of the prob-
lem size has to be scaled up as the rated performance is increased. Look for example at how
TPS benchmarks are scaled up in Figure 6.30 of the book. The idea of a self-scaling bench-
mark (pp 545-547) is also interesting. Note that this idea should not only apply to I/O. Make
sure you understand the issues and how details of the system configuration (caches, file
buffers—also called caches—etc.) can impact scalability of performance.

5.5 Operating System Issues for I/O and Networks

The remainder of the Chapter 6 of the book contains some interesting material; make sure
you understand the general issues. Work through the examples in Section 6.7, and under-

55
stand the issues raised in 6.8 and 6.9. There are also some operating system issues in the
networking chapter, in Section 7.10. Find these issues, and relate them to more general I/O
issues.

5.6 A Simple Network

Let’s move on now to the networking chapter, and review Section 7.2 quickly. Make sure
you understand the definitions and can work through the examples. Note particularly the
impact of overhead on the achievable bandwidth, and relate the delivered bandwidth in
Figure 7.7 to the data in Figure 7.8.

5.7 The Interconnect: Media and Switches

Section 7.3 of the book is a useful overview of physical media, though not very important
for the focus of the course. Of more interest is Section 7.4 which covers switched vs. shared
media.

Do not spend too much time with some of the more exotic topologies. For practical pur-
poses, it’s useful to understand the differences between ATM and ethernet, and the potential
for adding switches into ethernet. The more complex topologies were classically used to
make exotic supercomputers which have since died. However, they may still have uses in
specialized interconnects (e.g., a crossbar switch to implement a more scalable interconnect
than a bus in a medium-scale shared-memory multiprocessor, or to implement the internals
of an ATM switch).

Switches have two major purposes:

• eliminating collisions—a loaded network with a single medium using collision detec-
tion, as does ethernet, loses an increasing fraction of its available bandwidth to colli-
sions and retries as the workload scales up; switches by partitioning the medium and
queueing traffic prevent collisions

• partitioning the medium—if the traffic can be partitioned, the available bandwidth
becomes higher (depending on how partitionable the traffic is): if n different routes can
be simultaneously active, there is n times the bandwidth of a shared medium

Switches have two major drawbacks:

• they add cost—a single medium like coax cabling is cheap, whereas a switch, if it is
not to introduce significant additional latency, is generally expensive

56
• workloads don’t always partition—for example, if a computer lab boots off a single
server and most network traffic is between the server and lab machines, the only useful
way to partition the network is to put a much faster segment between the server and a
switch, which splits the lab

Make sure you understand the definitions and issues raised in “Shared versus Switched
Media” on pp 610-613. Be aware of the range of topologies available, but do not spend too
much time on “Switch Topology”. Note that crossbar switches are starting to appear in
commercial designs, as a multiprocessor interconnect (previously they were only used in
exotic supercomputers, but SGI now uses them in relatively low-end systems).
Note the discussion of routing and congestion control; these are more issues for a net-
works course. However, cut-through and wormhole routing (p 615) are used in some paral-
lel systems: you should have some idea of how they work: work through the CM-5 example
on p 616.

5.8 Wireless Networking and Practical Issues for Networks

Read through the section on wireless networking (7.5). It is interesting but beyond the
scope of this course. Also read the section on practical issues (7.6); there are some interest-
ing points but they do not relate closely to the focus of the course.

5.9 Bandwidth vs. Latency

It is useful at this stage to revisit all areas of I/O and networks where latency and bandwidth
interact.

Generally, bandwidth is easy to scale up (subject to issues like cost and physical limi-
tations, like space in a single box): you just add more components in parallel. Latency how-
ever is hard to fix after the event if your basic design is wrong.

For example, if you want a transaction-based system to support 10,000 transactions per
second with a maximum response time of 1s, and you design your database so it can’t be
partitioned, you run into a hard limit of the access time of a disk (typically 10ms; some are
faster but 7ms is about the fastest in common use at time of writing). The average time per
transaction to fit the required number into 1s is 0.1ms. Clearly, something different has to
be done to achieve the desired latency: a solution like RAID, for example, will not help.

57
This is one of the reasons that IBM was able to maintain a market for their mainframes
when microprocessors exceeded mainframes’ raw CPU speed in the early 1990s. While
IBM made very large losses, they were able to recover, because no one had a disk system
competitive with theirs. They lost the multiuser system and scientific computation markets,
but were able to hold onto large-scale transaction processing. Here’s some historical infor-
mation, summarized from the first edition of Hennessy and Patterson, to illustrate the prin-
ciples.

The development of a fast disk subsystem was not part of the original IBM 360 design;
this followed as a result of IBM’s early recognition that disk performance was vital for
large-scale commercial applications. The dominant philosophy was to choose latency over
throughput wherever a trade-off had to be made. The reason for this is the view (often found
to be valid in practice) that it’s much easier to improve throughput by adding hardware than
to reduce latency (see the discussion of disk arrays and the effect of adding a disk for an
example of adding throughput in this way). Put in another way: you can buy bandwidth, but
you have to design for latency.

The subsystem is divided into the following hierarchies:

• control—the hierarchy of controllers provides a range of alternative paths between


memory and I/O devices, and controls the timing of the transfer

• data—the hierarchy of connections is the path over which data flows between memory
and the I/O device

The hierarchy is designed to be highly scalable; each section of the hierarchy can contain
up to 64 drives, and the IBM 3090/600 CPU can have up to 6 such sections, for a total of
384 drives, or a total capacity of over 6 trillion bytes (terabyte: TB = 1024GB) using IBM
3390 disks. Compare this with a SCSI controller that can support up to 7 devices (with cur-
rent technology, maybe 63GB).
The channel is the best-known part of the hierarchy. It consists of 50 wires connecting
2 levels of the hierarchy. 18 are used for data (8 + 1 for parity in each direction), and the
rest are for control information. Current machines’ channels transfer at 4.5MB/s. (Contrast
this with the maximum rate of SCSI disks: about 4MB/s, though this is in only one direction
at a time). Each channel can connect to more than one disk, and there are redundant paths

58
to increase availability (if a channel goes down, performance drops, but the machine is still
usable).

Goals of I/O systems include supporting the following:

• low cost

• a variety of I/O devices

• a large number of I/O devices at a time

• low latency

High expandability and low latency are hard to achieve simultaneously; IBM achieve this
by use of hierarchical paths to connect many devices. This plus all the parallelism inherent
in the hierarchy allows simultaneous transfers, high bandwidth is supported. At the same
time, using many paths instead of large buffers to accommodate a high load minimizes
latency. Channels and rotational position sensing help to ensure low latency as well.
The key to good performance in the IBM system is low rotational positional misses and
low congestion on channel paths.

This architecture was very successful, judging both from the performance it delivered
and the duration of its dominance of the industry.

If we cross now to networks, ATM is having a hard time making headway against estab-
lished technologies, like ethernet, because the major gain is from switching, something that
can be added on top of a classically-shared-medium standard. ATM’s fixed-size packet is
meant to make routing easier, but the potential wins have to be measured against the fact
that ATM cells are a poor fit to existing designs, which all assume ethernet-style variable-
length packets. Possible a system designed around ATM from scratch would perform better,
but ATM starts from a poor position of having extra latency added up front: splitting larger
logical packets into ATM cells, and recombining them at the other end.

5.10 Trends: Learning Curves and Paradigm Shifts

The internet provides an interesting exercise in mismatches in learning curves.

Bandwidth is expanding at a rapid rate globally, at a rate of 1.78 times per year, a dou-
bling every 4 years. At the same time, no one is paying particular attention to latency. A fast

59
personal computer today can draw a moderately complex web page too fast to see the
redraw, but the end-to-end latency to the server may be in minutes.

As a result, quality of service is a major inhibiting factor to serious growth of internet


commerce. Using the internet to shop for air tickets, for example, is viable, given that the
latency is competing with a trip to a travel agent (or being put on hold because the agent is
busy). However, using the internet to deliver highly interactive services is problematic.

Another area which is problematic is video on demand (VoD). The general VoD prob-
lem is to deliver any of hundreds of movies with minimal latency (at least as good as a
VCR), with the same facilities as one would expect on a VCR: fast forward, rewind, pause,
switch to another movie. Currently shipping designs for VoD are very complex, and require
supercomputer-like sophistication to scale up. The key bottleneck in VoD designs unfortu-
nately is latency, rather than bandwidth. Although faster computers can reduce overheads,
the biggest latency bottlenecks are in the network and disk subsystems, the very areas
where learning curves are not driving significant improvements. Disks for example, have
barely improved in access time in 5 years, while network latency as noted before is not seri-
ously addressed in new designs.

The next section illustrates some alternative approaches that can be used to hide the
effects of high latency.

5.11 Alternative Schemes

5.11.1 Introduction

This section introduces a novel idea, Information Mass Transit (IMT), and applies it to two
examples: VoD and transaction-based systems. The basic idea is that in many high-volume
situations, it is relatively easy to not only provide the required bandwidth, but potentially
several times the required bandwidth, but latency goals are intractable, if each transaction
is handled separately. Handling each transaction separately is analogous to every commuter
driving their own car. While this seems to be the most efficient method in terms of reducing
delays waiting for fixed-schedule transport or fitting in with non-scheduled shared options
like lift clubs, the overall effect when the medium becomes saturated is slower than a mass
transit-based approach. To take the analogy further, the IMT idea proposes attempting to

60
exploit the relative ease of scaling up bandwidth to fake latency requirements, by grouping
requests together along with more data than is actually requested. An individual request has
to wait longer than the theoretical minimum latency of the architecture, but the aggregated
effect is shorter delays, i.e., lower latency.

To illustrate the principles, an approach to video on demand is presented, followed by


an alternative model for disks in a transaction-based system.

5.11.2 Scalable Architecture for Video on Demand

A proposed scalable architecture for video on demand (SAVoD) is based on the IMT prin-
ciple.

Conventional VoD requires that every request be handled separately, which makes it
hard to scale up. A full-scale VoD system has the complexity (and expense) of a high-end
supercomputer. Various compromises have been proposed, like pre-booking a movie to
reduce latency requirements, or clustering requests so that actions like fast forward and
rewind have a latency based on skipping forward or backward to another cluster of requests.

All of these approaches still require significant complexity in the server (and as the
number of users scales up, the network).

The SAVoD approach is to stream multiple copies of each available movie a fixed time
interval apart. For example, if the desired latency for VCR operations is at worst 1 minute,
and the typical movie is 120 minutes long, 120 copies of the movie are simultaneously
active, spaced 1 minute apart. VCR-like operations consist of skipping forward or back-
ward between copies of a stream, or to the start of a new stream. There are a few complica-
tions in implementation, but the essential idea is very simple. A server consists of little more
than an array of disks streaming into a network multiplexor, which has to time the alterna-
tive streams to their required separation, before encoding them on the network. The main
interconnect would be a high-capacity fibre-optic cable. At the user end, a local station
would connect multiple users to the high-speed interconnect, with a low-speed link to the
individual subscriber. This low-speed link would only have to carry the bandwidth required
for one channel, as well as spare capacity for control signals.

61
A SAVoD implementation with 1,000 120-minute movies at HDTV broadcast standard
and 1-minute separation would require a 2Tbit/s bandwidth, but smaller configurations with
fewer movies and lower-standard video could be implemented to get the standard started.
A particular win for the SAVoD approach is that the bandwidth required depends on the
video standard and number of movies, not the number of users. In fact as the number of
users scales up, an increasingly improved service can be offered, as the service is spread
over a larger client base.

Latency is not an issue at the server and network levels; if a higher-quality VCR service
is required, the current movie can be buffered in local disk either at the local access point,
or at the subscriber.

5.11.3 Disk Delay Lines for Scalable Transaction-Based Systems

Transaction-based systems are another example of an application area where scaling


latency up the traditional way is hard. Examples include airline reservation systems, large
web sites (which increasingly overlap other kinds of transaction-based systems) and bank-
ing systems.

Although there are many other factors impacting on latency, it’s easy to see that disk
performance is a big factor. If the basic access time is 10ms, the fastest transaction rate a
disk can handle with one access per transaction, for a worst-case response time of 1s is only
100 transactions per second (TPS). If disk time (probably more realistically) is taken as at
worst half the transaction time, only 50 TPS can be supported. This is why systems with a
very high TPS rating typically have 50-100 disks.

How can the IMT idea help here?

Let’s consider where a disk’s strength lies: streaming. Assume we have a high-end disk
capable of streaming at 40Mbyte/s and a relatively small database of 1Gbyte. Can we per-
suade such a device to allow us to reach 100,000 TPS, with 0.5s only for disk access?
Assume each transaction needs to access 128 bytes. Then at 40Mbyte/s, one transaction
takes 3µs. This looks promising. But if we simply sweep all the data out at 40Mbyte/s, the
worst-case time before any one part of the database is seen is 25.6s, about 50 times too long.
Simple solution: use 50 disks, synchronized to stream the database out at equally-spaced

62
time intervals. Now the worst-case delay is approximately 0.5s, where we want it to be. To
be able to handle 100,000 TPS, we need to be able to buffer requests for the worst-case sce-
nario where every request waits the maximum delay. Since the maximum delay is approx-
imately 0.5s, we need to buffer 50,000 requests.

Writing presents a few extra problems, but should not be too hard to add to the model.

What this idea suggests is that disk manufacturers should give up the futile attempt at
improving disk latency and concentrate instead on the relatively easy problem of improving
bandwidth, for example, by increasing the number of heads and the speed of the intercon-
nect to the drive.

The general idea of a storage medium which streams continuously is not new. Some of
the oldest computer systems used an ultrasonic mercury delay line, which was a tube filled
with mercury. An ultrasonic sound, encoding the data, was inserted in one end of the tube.
Since the speed of sound is relatively low by even the primitive standards if the computer
world in the 1950s, several hundred bits could be stored in a relatively small tube, utilizing
the delay between putting the data in one end and retrieving it as it came out the other end.
The idea proposed here is a disk delay line. The only real innovation relative to the original
idea is having multiple copies of the delay line in parallel to reduce latency (i.e., the delay
before the required data is seen).

The TPS in the calculation is not necessarily achievable, as the rest of the system would
have to keep up: a high-end TPS rating in 1998 is only in the order of a few thousand, on a
large-scale multiprocessor system with up to 100 disks. How much of the overall system is
simply dedicated to the problem of hiding the high latency of disks in the conventional
model is an important factor in assessing further whether the disk delay line model would
work.

5.12 Further Reading

The internet learning curve is described by Lewis [1996b]. More can be found on the Infor-
mation Mass-Transit ideas mentioned here on the Information Mass-Transit web pages
[IMT 1997], including papers which have been submitted for publication. Consult the usual
journals and conferences for issues like advances in RAID and network technology.

63
5.13 Exercises

5.1. Do questions 6.1, 6.2, 6.4, 6.5, 6.10, 7.1, 7.2, 7.3, 7.10.

5.2. Derive a general formula for the achievable TPS in the disk delay line approach,
taking into account the following (assume one access per transaction):
• size of database

• transfer rate of the disk

• number of disks in parallel

• size of individual transaction

• size of buffer

• maximum disk time allowed per transaction

5.3. Discuss the use of network switches in the network and server configuration we have
in our School (including equipment shared with other departments). What is the role of
switches and routers here, in terms of bandwidth issues?

5.4. Note the impact of overhead on the achievable bandwidth, and relate the delivered
bandwidth in Figure 7.7 to the data in Figure 7.8. Also consider the following figure
(derived from on in an earlier edition of the book), which compares ethernet and ATM
throughput (think about your answer to question 7.10):

80

70
155Mbit/s ATM
60

50

40

30

20
10Mbit/s ethernet
10

a. What does this tell you about network design?


b. If a new network design has a very high bandwidth but packets are complex to create,

64
what kind of performance would you expect with a typical NFS workload?
c. Given the data in the figure above, if you had the choice between deploying ATM and
partitioning traffic on a university network using switches in ethernet, which would
you chose? Why?

5.5. Another possible application of Information Mass Transit is a variation on multicast,


in which several alternative multicast streams are maintained simultaneously. If a receiver
drops a packet from one stream, they drop back to a later stream. Discuss this alternative
versus:
a. individual connections for every receiver
b. conventional multicast

65
66
Chapter 6 Interconnects and
Multiprocessor Systems

Scaling up a single-processor system is a challenge: does adding more processors make


things easier of harder?

Certainly, it has long been predicted that the CPU improvement learning curve will hit
a limit and that the only way forward will be multiprocessor systems. Each time, however,
before the predicted limit has been reached, a paradigm shift has saved the day (for exam-
ple, the shift to microprocessors in the 1980s). Still, multiprocessor systems fill an impor-
tant niche. For some applications which can be parallelized, they provide the means to
execute faster—before a new generation of uniprocessor arrives, which is just as fast. In
other cases, a multiprogramming job mix can be executed faster. The traditional multiuser
system—historically a mainframe or minicomputer—is a classic example of a multipro-
gramming requirement, but a modern window-based personal computer also has many pro-
cesses running at once, and even multithreaded applications, which can potentially benefit
from multiple processes.

One factor though is pushing multiprocessor systems into the mainstream: the fact that
Intel is running out of ideas to speed up the IA-32 microarchitecture. For this reason, mul-
tiprocessor systems are becoming increasingly common. On the other hand, slow progress
with IA-64 and competition from AMD is forcing Intel to squeeze more speed out of IA-
32—so the long-predicted death of the uniprocessor remains exaggerated.

Be that as it may, there is still growing interest in affordable multiprocessor systems.

The interconnect is an important part of scalability of multiprocessor systems. Scalabil-


ity has historically meant very big designs are supported. I argue that scalability should
include very small designs; a system that only makes economic sense as a very large-scale
design does not have a mass-market “tail” to drive issues like development of programming
tools.

67
6.1 Introduction

This chapter contains a brief overview of Chapters 8 of the book, with some backward ref-
erences to Chapter 7, since networks provide some background for multiprocessors. The
authors of the book clearly agree, as they have added a section (7.9) on clusters. The major
focus is on shared-memory multiprocessors, since this is the common model in relatively
wide use. A variant on shared memory, distributed shared memory (DSM) is also covered,
along with issues in implementation and efficiency.

The order of topics in the remainders of the chapter is as follows. Section 6.2 covers
classification of multiprocessor systems, while Section 6.3 classifies workloads.
Section 6.4 introduces the shared-memory model, with the DSM model in Section 6.5.
Memory consistency and synchronization, important issues in the efficiency of shared-
memory and DSM computers, are handled together in Section 6.6. In Section 6.7, issues
relating to a range of architectures including uniprocessor systems are covered. To conclude
the chapter, there are sections on trends (6.8) and alternative models (6.9).

6.2 Types of Multiprocessor

Read through Section 8.1 of the book. Note the arguments for considering multiprocessor
systems, and the taxonomy1 of parallel architectures, into SISD, SIMD, MISD and MIMD.

There is another argument for MIMD shared-memory machines not given with the tax-
onomy, but which appears later (p 679). Such a machine can be programmed in the style of
a distributed-memory machine by implementing message-passing in shared memory. This
means that whichever model of programming is more natural can be adopted on a shared-
memory machine. On a message-passing machine on the other hand, implementing shared
memory is complex and likely to result in poor performance. The DSM model covered later
is a hybrid, with operating system support for faking a logical (programmer-level) shared
memory on a distributed system.

Another common term which is introduced is symmetric multiprocessor (SMP). An


SMP is a multiprocessor system on which all processors are conceptually equal, as opposed

1. A taxonomy is a classification based on objective (as opposed to arbitrary) criteria.

68
to some models where specific functionality (e.g. the OS, or work distribution) may be han-
dled on specialized (or specific) processors. Most shared-memory systems are SMP.

Make sure you understand the differences between the models, the communication
strategies as relate to memory architecture, performance metrics, advantages of the various
communication models and the challenges imposed by Amdahl’s Law. In particular, in
terms of the focus of the course, think about the effect of buying a machine with a fixed bus
bandwidth and memory design, with the intention of buying faster processors over the next
3 to 5 years. Work through the example on pp 680-681, and on pp 681-683.

6.3 Workload Types

Section 8.2 of the book lists several parallel applications. Make sure you understand how
the different applications are characterized, and how they differ with respect to communi-
cation and computation behaviour.

Note also the multiprogramming workload, and how it differs. Note that a key differ-
ence is that the parallelism here is between separate programs, with only limited commu-
nication in the form of UNIX pipes. In this scenario, the operating system is likely to
generate most synchronization activity, as it has to maintain internal data structures like the
scheduler’s queues across processors.

6.4 Shared-Memory Multiprocessors (Usually SMP)

Understand the basic definition of a shared-memory system at the start of Section 8.3 of the
book. Note that data in a shared-memory system is usually maintained on the granularity
of cache blocks, and a block is typically tagged as being in one of the following states:

• shared—in more than one processor’s cache and can be read; as soon as it is written, it
must be invalidated in other caches

• modified—in only one cache; if another cache reads or writes it, it is written back; if
the other cache writes it, it becomes invalid in the original cache, otherwise shared

• exclusive—only in this cache: can become modified, shared or invalid depending on


whether its processor writes it, another reads it, or another modifies it (after a write
miss)

• invalid—the block contains no valid memory contents (either as a result of invalida-


tion, or because nothing has yet been placed in that block)

69
There are variations on this model, as we will see in Section 6.6. Also, the DSM model has
some variations on management of sharing.
The key issue to understand is how communication happens, and what some of the pit-
falls are. If one process of a multiprocessor application communicates with another, it
writes to a shared variable. There are several issues in ensuring such programs are correct,
involving synchronization, which we won’t go into in much detail (see Section 6.6 again).
Here, we will concern ourselves more with performance.

Here are a few issues before we get to performance.

Understand what cache coherence is, and the difference between private and shared
data. Go through the descriptions of approaches to coherence, including the performance
issues. Note the differences in issues raised for performance of multiprocessor applications
and multiprocessor workloads.

A few issues are important for achieving good performance in shared-memory multi-
processing applications:

• minimal sharing—although the shared-memory model makes it look cheap to share


data, it is not. Invalidations are a substantial performance bottleneck in shared-mem-
ory applications, so writing shared variables extensively is not scalable; for example, if
a shared counter logically has to be updated for every operation performed but is only
tested infrequently, keep a local count in each process, and update the global count as
infrequently as possible from the local count

• avoid false sharing—because memory consistency (or coherence) is managed on a


cache-block basis, false sharing can occur: read-only data coincidentally in the same
cache block as data which is modified can cause unnecessary invalidations and misses,
since accessing non-modified data is not distinguished from accessing modified data in
the same cache block; the solution is to waste memory by aligning data to cache block
boundaries and padding it if necessary so only truly shared data is in a given block

• minimal synchronization—related to minimal sharing in that synchronization is a form


of communication, but also, the less often a process has to wait for others, the less
likely you are to run into a load imbalance (where one or more processors has to wait
an extended time for others) as well

All these factors tend to combine to produce a less-than-linear speedup for most practical
applications. See for example Figure 8.24, in which illustrates how a program where data
is not optimally allocated to fit cache blocks without false sharing has increased coherence

70
misses as the block size increases. Note also how high a fraction of data traffic can be due
to coherence (i.e. communication).
However, on the positive side, a multiprocessor application can in some circumstances
achieve a better speedup than one might expect. If the dataset is much larger than the cache
of any available uniprocessor system, partitioning the workload across multiple processors
whose total cache is bigger than the dataset can give better than linear speedup—provided
communication is low.

6.5 Distributed Shared-Memory (DSM)

A DSM architecture attempts to get the best of both worlds. A distributed-memory archi-
tecture generally has a more scalable interconnect than a shared-memory architecture,
because the interconnect is often partitioned (see the arguments for switches in networks in
Section 5.7 of these notes). However, the basic interconnect speed is slower, as it is usually
a network like ethernet. The essential model looks like a network of uniprocessor (some-
times shared-memory multiprocessor) machines, in which a global shared memory is faked
by the operating system—often using the paging mechanism.

Since a page fault is a relatively expensive operation (network latency is typically of the
same order as disk latency, and, except for higher-end networks, transfer rate tends to be
slower: 100baseT for example can at peak achieve about 11 Mbyte/s), communication is an
even more difficult issue in DSM than ordinary shared memory. Typically, DSM models
attempt to work around the problem of issues like false sharing in the large transfer units
(page-sized, 4 Kbytes or more) by doing consistency on smaller units. Some caches also do
this (for example, a 128-byte block may have consistency maintained at the 32-byte level).

Read through the description of DSM, and directory-based protocols. Directory-based


protocols have also been used for large-scale shared-memory systems (snooping doesn’t
scale well). Note also the performance issues, and make sure you can do the examples. The
key issues though are handled in the next section, where memory consistency is handled in
more detail.

71
6.6 Synchronization and Memory Consistency

There is some link between synchronization and consistency. Often, synchronization is


needed to ensure communication is done correctly, which implies writes and hence that
memory consistency should be maintained. To address performance issues, it is useful to
consider the two issues together.

First, let’s consider issues in synchronization. The primitives depend on the program-
ming style of the parallel application. The simplest primitive is a lock, which is generally
needed even if higher-level constructs are built out of locks. A lock controls access to a crit-
ical region, a section of code where sequentialization is important to ensure that a write
operation is atomic (a condition where the result of a write depends on the order of process-
ing is called a race condition). Locks can be implemented in various ways. The simplest is
a spinlock: the code attempts to test a variable and set it in an atomic operation (it is usual
to have an instruction designed to do this atomically); if the test fails, the spinlock loops. A
spinlock is very expensive in memory operations, as every time it is set, it generates an
invalidation to every process that is trying to read it, followed by a flurry of attempts at writ-
ing it: the one process that wins the race to write it succeeds in the test-and-set operation,
while the others fail and loop again, and have to reload the value in their caches (at the same
time converting it from exclusive to shared in the other cache, usually also resulting in a
writeback to DRAM). Since it’s important for the cache coherence mechanism to handle
locks correctly, a natural next step is to incorporate the locking mechanism in the coherence
mechanism. While various approaches are covered, none go as far as some implemented by
researchers. Work through the various variations on locks, and make sure you can compute
how much bus traffic each generates. Also note the alternatives of a queueing lock, and
exponential backoff.

Two other primitives are semaphores and barriers.

A semaphore is in essence a counter, which is decremented when gaining access to a


resource, and incremented when giving it up. A semaphore can be used to control the
number of simultaneous users of the resource (e.g. by setting the initial value to some n >
1, n processes can access the resource before accesses are blocked). Semaphores usually
have a queue associated with them to keep sleeping process in order.

72
A barrier is used to stop 1 or more processes waiting on completion of an event by
another process. For example, in a time-stepped simulation, it is common to force all pro-
cesses to synchronize at the end of a timestep to communicate. A barrier in its simplest form
can be implemented using a semaphore or equivalently a lock and a counter, but does not
scale up well. Work through the rest of the section on synchronization, and make sure you
can do the examples.

Section 8.6 of the book goes on to memory consistency, and note how the issues interact
with synchronization. The most important principle exploited by designers of relaxed
models of consistency is that programmers don’t actually want race conditions, so any
writes to shared variables are likely to be protected by synchronization primitives. Under-
stand the general ideas, but spend more time on the performance issues than on specifics of
each model.

6.7 Crosscutting Issues

It is interesting to note that the increased level of ILP and the growing CPU-DRAM speed
gap is pushing more and more multiprocessor issues into the uniprocessor world.

Some of these ideas include nonblocking caches and latency hiding. Other issues
common to the two areas are inclusion and coherence, both useful in uniprocessor systems
to deal with non-CPU memory accesses (typically DMA).

A nonblocking cache allows the CPU to continue across a miss until the data or instruc-
tion reference is actually needed. In its original form, it was used with nonblocking prefetch
instructions, instructions that did a reference purely to force a cache miss (if needed) ahead
of time. In uniprocessor designs, the same idea has surfaced in the Intel IA-64 (though of
course the processor could be used in multiprocessor systems if it ships), but it is also
becoming increasingly common for ordinary misses to be nonblocking, especially in L1. If
the miss penalty to L2 is not too high, there is some chance that a pipeline stall can be
avoided by the time the processor is forced to execute the instruction that caused the miss
(or complete the fetch, if it was an I-miss). Misses to DRAM are becoming too slow to
make nonblocking L2 access much of a win.

The rest of the chapter contains interesting material but we will not cover it in detail.

73
6.8 Trends: Learning Curves and Paradigm Shifts

The biggest paradigm shift in recent years has been the recognition that the biggest market
is for relatively low-cost high-performance systems. The rate of advance of CPU speed
makes a highly upgradable system largely a waste, as the cost of allowing for many proces-
sors, a large amount of memory, etc., is high. If a mostly-empty box is bought, a year or two
later the expensive scalable infrastructure is too slow for a new CPU to make sense, as
opposed to replacing the whole thing by a lower-end new system that’s faster than the old
components.

Instead, the trend is towards smaller-scale parallel systems, with options of clustering
multiple small units to get the effect of a larger one. SGI has been effective at this strategy,
though they have lost ground in recent years through poor strategy in other areas. The tail-
ing off in microarchitecture improvement in the IA-32 line is creating new impetus in the
small-scale SMP market at the lower end, and it seems likely that the clustered hybrid
shared-memory-DSM model will gain ground as a result.

The clear loser in recent years has been a range of alternative models: MPP (massively
parallel processor) systems, distributed-memory systems, and SIMD systems. All of these
models have the problem that the focus of the designers was scaling up, not down. If a mass-
market version of a system cannot be made, it has a high risk of failure for the following
reasons:

• poor programming tools—there’s nothing like a mass-market application base to stim-


ulate development tool innovation (compare a Mac or PC-based environment with
most UNIX-based tools: the UNIX-based tools only really score in being on a more
robust platform, an advantage slowly eroding over time as mass-market operating sys-
tems discover the benefits of industrial-strength protection and multitasking)

• high marginal profit or loss—if a PC manufacturer loses a sale, it’s one sale in thou-
sands to millions; if a specialized supercomputer manufacturer loses a sale, it could be
a large fraction of their year’s revenues (Cray Research, when SGI bought the com-
pany, had annual sales of around $2-billion; at the typical cost of a supercomputer,
that’s about 200 machines per year)

• limited market—the more specialized designs not only were unsuited to wider markets
like multiprogramming workloads, but also to a wide range of parallel applications:
they were based on specialized programming models that did not suit all large-scale
computation applications

74
• high cost of scalability—if conventional multiprocessor systems suffer from the prob-
lem that large-scale configurations are expensive, this is even more true of specialized
designs, which often employed exotic interconnects; to make matters worse, some
designs (like the Hypercube) could only be upgraded in increasingly large steps, to
maintain their topology

6.9 Alternative Schemes

There are various alternatives implementations of barriers, which reduce the amount of
global communication. Another alternative is to have parts of the computation only do local
communication, which adds up to global communication over the totality of the parts. For
example, in a space-based simulation such as a wind tunnel simulation, local regions of
space could communicate with nearest neighbours by spinning on the nearest neighbour’s
copy of the global clock. When the neighbour’s clock matches the current unit of space’s
clock, it can go on to the next timestep.

While distributed shared memory has problems with communication speed, there is
increasing interest in hybrid schemes, which allow larger systems to built up out of clusters
of smaller multiprocessor systems. The basic building blocks are similar to a conventional
shared-memory system, with a fast interconnect which may nonetheless be slower than a
traditional bus. SGI, for example, builds such systems, based on the Stanford DASH
project. Look for more examples in the book.

6.10 Further Reading

There has been some work on program structuring approaches which improve cache behav-
iour of multiprocessor systems, and which attempt to avoid the need for communication
primitives like barriers which scale poorly [Cheriton et al. 1993]. The ParaDiGM architec-
ture [Cheriton et al. 1991] contains some interesting ideas about coherency-based locks as
well as the notion of scaling up shared memory with a hierarchy of buses and caches, while
the DASH project [Lenoski et al. 1990] was one of the earliest to introduce latency-hiding
strategies (an issue now with uniprocessor systems). Tree-based barriers attempt to distrib-
ute the synchronization overhead, so a barrier does not become a hot-spot for global con-
tention for locks [Mellor-Crumney and Scott 1991].

75
6.11 Exercises

6.1. You have two alternatives of similar price for buying a computer:
• a 4-processor system with 1 Gbyte of RAM and 20Gbytes of disk, but which cannot be
upgraded further

• a 2-processor system with 256 Kbytes of RAM and 10 Gbytes of disk, which can be
expanded to 20 processors, 100 Gbytes of disk and 16Gbytes of RAM

• adding one extra processor to the 2-processor system costs approximately the same as
the 4-processor system, if you figure in a trade-in of last year’s model

a. Discuss the economics of upgrading the 2-processor system over the next 3 years,
versus replacing the 4-processor system every year by a faster one.
b. Discuss the impact of CPU learning curves on the usefulness of the remainder of the
hardware on the 2-processor system, as new upgrades are bought in the future

6.2. Do questions 7.6, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.10, 8.11, 8.13, 8.14, 8.16, 8.17,
8.20.

76
References
[August et al. 1998] D August, D Connors, S Mahlke, J Sias, K Crozier, B Cheng, P Eaton, Q Olaniran and
W Hwu. Integrated Predication and Speculative Execution in the IMPACT EPIC Architecture, Proc.
ISCA ’25: 25th Int. Symp. on Computer Architecture, Barcelona, June-July 1998, pp 227-237.

[Cheriton et al. 1991] DR Cheriton, HA Goosen and PD Boyle. ParaDiGM: A Highly Scalable Shared-
Memory Architecture, Computer, vol. 24 no. 2 February 1991, pp 33–46.

[Cheriton et al. 1993] DR Cheriton, HA Goosen, H Holbrook and P Machanick. Restructuring a Parallel Sim-
ulation to Improve Cache Behavior in a Shared-Memory Multiprocessor: The Value of Distributed Syn-
chronization, Proc. 7th Workshop on Parallel and Distributed Simulation, San Diego, May 1993, pp 159–
162.

[Crisp 1997] R Crisp. Direct Rambus Technology: The New Main Memory Standard, IEEE Micro, vol. 17
no. 6, November/December 1997, pp 18-28.

[Dulong 1998] Carole Dulong. The IA-64 Architecture at Work, Computer vol. 31, no. 7, July 1998, pp 24-32.

[IMT 1997] Information Mass-Transit web site, continuously updated.


<http://www.cs.wits.ac.za/~philip/mass-transit/transit.html>.

[Jacob and Mudge 1998] B Jacob and T Mudge. Virtual Memory: Issues of Implementation, Computer, vol.
31 no. 6 June 1998, pp 33-43.

[Johnson 1995] EE Johnson. Graffiti on the Memory Wall, Computer Architecture News, vol. 23, no. 4, Sep-
tember 1995, pp 7-8.

[Lenoski et al. 1990] D Lenoski, J Laudon, K Gharachorloo, A Gupta and J Hennessy. The Directory-Based
Cache Coherence Protocol for the DASH Multiprocessor, Proc. 17th Int. Symp. on Computer Architec-
ture, Seattle, WA, May 1990, pp 148–159.

[Lewis 1996a] T Lewis. The Next 10,0002 Years: Part I, Computer vol. 29 no. 4 April 1996 pp 64-70.

[Lewis 1996b] T Lewis. The Next 10,0002 Years: Part II, Computer vol. 29 no. 5 May 1996 pp 78-86.

77
[Machanick 1996] P Machanick. The Case for SRAM Main Memory, Computer ArchitectureNews, vol. 24,
no. 5, December 1996, pp 23-30.
<ftp://ftp.cs.wits.ac.za//pub/users/philip/research/arch/SRAM-main.ps.gz>.

[Machanick 2000] Scalability of the RAMpage Memory Hierarchy, South African Computer Journal, no. 25
August 2000, pp 68-73 (longer version: Technical Report TR-Wits-CS-1999-3 May 1999).

[Machanick and Salverda 1998a] P Machanick and P Salverda. Preliminary Investigation of the RAMpage
Memory Hierarchy, South African Computer Journal, no. 21 August 1998, pp 16–25.
<http://www.cs.wits.ac.za/~philip/papers/rampage.html>.

[Machanick and Salverda 1998b] P Machanick and P Salverda. Implications of Emerging DRAM Technolo-
gies for the RAMpage Memory Hierarchy, Proc. SAICSIT ’98, Gordon’s Bay, November 1998, pp 27–40.
<http://www.cs.wits.ac.za/~philip/papers/rampage-sdram.html>

[Machanick et al.1998] P Machanick, P Salverda and L Pompe. Hardware-Software Trade-Offs in a Direct


Rambus Implementation of the RAMpage Memory Hierarchy, Proc. ASPLOS-VIII Eighth Int. Conf. on
Architectural Support for Programming Languages and Operating Systems, San Jose, October 1998, pp
105–114. <http://www.cs.wits.ac.za/~philip/papers/acm_published/rampage-cx-switch.html>.

[Mellor-Crumney and Scott 1991] JM Mellor-Crumney and ML Scott. Algorithms for Scalable Synchroniza-
tion on Shared-Memory Multiprocessors, ACM Trans. on Computer Systems, vol. 9 no. 1 February 1991,
pp 21–65.

[Nagle et al. 1993] D Nagle, R Uhlig, T Stanley, S Sechrest, T Mudge and R Brown. Design Trade-Offs for
Software Managed TLBs, Proc. Int. Symp. on Computer Architecture, May 1993, pp 27–38.

[Patterson and Ditzel 1980] DA Patterson and DR Ditzel. The case for the reduced instruction set computer,
Computer ArchitectureNews, vol. 8, no. 6, October 1980, pp 25-33.

[RAMpage 1997] RAMpage web site, continuously updated.


<http://www.cs.wits.ac.za/~philip/architecture/rampage.html>.

[Rivers et al. 1997] JA Rivers, GS Tyson, TM Austin and ES Davidson. On High-Bandwidth Data Cache
Design for Multi-Issue Processors. Proc. of 30th IEEE/ACM Int. Symp. on Microarchitecture, December
1997. <http://www.eecs.umich.edu/~jrivers/MICRO-30.ps.gz>.

[Wall 1991] David W. Wall. Limits of Instruction-Level Parallelism, Proc. 4th Int. Conf. on Architectural Sup-
port for Programming Languages and Operating Systems, April 1991, pp 176–188.

78
[Wulf and McKee 1995] WA Wulf and SA McKee. Hitting the Memory Wall: Implications of the Obvious,
Computer Architecture News, vol. 23 no. 1, March 1995, pp 20-24.

79
80

You might also like