You are on page 1of 369

1MCS 1

1MCS1 Computer System Design


Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

1MCS 1

Chapters selected from Computer Architecture from


Microcomputers to Supercomputers by Behrooz
Parhami

1
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Table of Contents

UNIT 1 ....................................................................................................................................................................................... 4

COMPUTER SYSTEMS TECHNOLOGY ........................................... 4


COMPUTER PERFORMANCE .............................................. 30

UNIT 2 ..................................................................................................................................................................................... 54

INSTRUCTIONS AND ADDRESSING .......................................... 54


PROCEDURE AND DATA ................................................ 78
ASSEMBLY LANGUAGE PROGRAMS ......................................... 99

UNIT 3 ..................................................................................................................................................................................... 11

NUMBER REPRESENTATION ............................................. 118


ADDERS AND SIMPLE ALUS ........................................... 138
MULTIPLIERS AND DEVISORS .......................................... 159

UNIT 4 ................................................................................................................................................................................... 182

INSTRUCTION EXECUTION STEPS ........................................ 182


CONTROL UNIT SYNTHESIS............................................ 198

UNIT 5 ................................................................................................................................................................................... 219

MAIN MEMORY CONCEPT ............................................... 219


CACHE MEMORY CONCEPT ............................................. 236
MASS MEMORY CONCEPT .............................................. 253
VIRTUAL MEMORY AND PAGING ......................................... 272

UNIT 6 ................................................................................................................................................................................... 292

INPUT/OUTPUT DEVICES .............................................. 292


INPUT/OUTPUT PROGRAMMING .......................................... 312
BUSES, LINKS AND INTERFACES ....................................... 331
CONTEXT SWITCHING AND INTERRUPTS ................................... 352

2
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

3
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Unit 1
COMPUTER SYSTEMS TECHNOLOGY
CHAPTER TOPICS

1. From components to applications


2. Computer systems and their parts
3. Generations of progress
4. Processor and memory technologies
5. Peripherals, I/O, and communications
6. Software and application systems

Computer architecture is driven by developments in computing technology and, in turn, motivates


and influences such developments. This chapter provides some background on past progress and
current trends in computer technology to the extent necessary to properly learn the topics of the rest
of the book. After observing the development of computer systems over time, the technology of
integrated digital circuits is briefly reviewed. Among other things, it shows how experimentally derived
Moore's law accurately predicts improvements in integrated circuit performance and density over the
past two decades and what this trend means for computer architecture. This presentation is followed
by a discussion of processor, memory, mass storage, input/output, and communication technology.
The chapter concludes with a review of software systems and applications.

1. From components to applications

Electronic, mechanical, and optical engineering artifacts are found in all modern computer
systems. This text is mainly interested in the electronic data manipulation and control aspects
of computer design. The mechanical (keyboard, disk, printer) and optical (display) parts are
taken for granted, not because they are less important than the electronic parts, but because
the focus as a fundamental part of computer architecture is on the latter. The chain linking
the capabilities of electronic components on one side of Figure 3.1 to the application domains
sought by end users on the other side involves many sub-divisions of computer engineering
and associated specialties. In this chain, the computer designer or architect occupies a central
position between hardware designers, who face notions at the logical and circuit level, and
software designers, who deal with system programming and applications. Of course, this
should not be surprising: if this had been a book about floral arrangements, the florist would
have been shown to be the center of the universe! As you move from right to left in Figure
3.1, you will find increasing levels of abstraction. The circuit designer has the lowest level
vision and addresses the physical phenomena that cause computer hardware to perform its
tasks. The logical designer is mainly confronted with models such as gates or flip-flops,
discussed in Chapters 1 and 2, and relies on design tools to accommodate any circuit
considerations that can be shown by imperfect abstraction. The computer architect requires
knowledge about vision at the logical level, although he deals primarily with higher-level
digital logic principles such as adders and log files. You should also be informed about conflicts
in the area of system design, the goal of which is to provide a layer of software that facilitates
the task of designing and developing applications. In other words, the system designer

4
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

approaches raw hardware with key software components that protect the user from the
details of hardware operation, file storage formats, protection mechanisms, communication
protocols, etc., offered rather than an easy-to-use interface with the machine. Finally, the
application designer, who has the highest level of vision, uses the facilities offered by lower-
level hardware and software to devise solutions to application problems that interest a
particular user or class of users.
Computer architecture, whose name attempts to reflect its similarity to building architecture
(Figure 3.2), has been aptly described as the interface between hardware and software.
Traditionally, the software side is associated with "computer science" and the hardware side
with "computer engineering". This dichotomy is somewhat misleading, as there are many
scientific and engineering considerations on both sides. Software development is also an
engineering activity; The process of designing a modern operating system or database
management system is only very different from designing an aircraft or suspension bridge. In
this context, much of the daily routines of hardware designers involve software and
programming (as in programmable logic devices, hardware description languages, and circuit
simulation tools). Meeting time, cost, and performance goals, which are required in both
software and hardware projects, are milestones of engineering activities, as is adherence to
the compatibility and interoperability standards of the resulting products.

5
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Computer architecture isn't just for machine designers and builders; Informed and effective users at
each level benefit from a strong understanding of the fundamental ideas and an understanding of the
most advanced concepts in this field. Certain key achievements, such as the fact that a 2x GHz
processor is not necessarily twice as fast as an x GHz model (see Chapter 4), require basic training in
computer architecture. In a way, using a computer is something similar to driving a car. You can do a
good job just by learning about the driver's interfaces in the car and the rules of the road. However,
to truly be a good driver, you must have some knowledge of how a car is driven and how the
parameters that affect performance, comfort, and safety are interrelated. Since both extremely large
and very small quantities (multi-gigabyte disks, sub-nanometer circuit elements, etc.) are described in
describing computer systems and their parts, Table 3.1 is included, which cites the prescriptions for
the metric system of units and clarifies the convention with respect to the presets used in this book
to describe memory capacity (kilobytes, gigabits, etc.). In this respect, some very large multiples and
extremely small fractions which have not been used in connection with computer technology were
included. As you read the following sections, at the pace of progress in this area, you will understand
why the latter may become relevant in the not-so-distant future.

TABLE 3.1 Symbols and presets for multiples and fractions of units.

Multiple Symbol Prefix Multiple Symbol* Prefix* Fraction Symbol Prefix


103 k Kilo 210 K or kb b-kilo 10-3 m mili
106 M Mega 220 Mb b-mega 10-6 µ or u micro
109 G Giga 230 Gb b-giga 10-9 n nano
1012 T Tera 240 Tb b-tera 10-12 p pico

7
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

1015 P Peta 250 Pb b-peta 10-15 f femto


1018 E Exa 260 Eb b-exa 10-18 a atto
1021 Y Yota 270 Yb b-yota 10-21 y yocto

* Note: The symbol K is usually used to mean 210 = 1 024. Because the same convention cannot be applied to other multiples
whose symbols are already uppercase letters, a subscript b is used to denote comparable powers of 2. When memory
capacity is specified, subscript b is always understood and can be removed; that is: 32 MB and 32 MbB represent the same
amount of memory. Power presets of 2 are read as binary-kilo, binary-mega, and so on.

2. Computer systems and their parts


Computers can be classified according to their size, power, price, applications, technology,
and other aspects of their implementation or use. When we talk about computers today, we
usually refer to a type of computer whose full name is "stored, electronic, general-purpose
digital program" (paragraph 3.3). Such computers used to be designed in versions optimized
for shredding numbers, necessary in numerically intense calculations and manipulation of
data, usually from business applications. The differences between these two categories of
machines have not disappeared in recent years. Analog, fixed-function, non-electronic, and
special-purpose computers are also widely used, but the focus of this book for the study of
computer architecture is on the most common type of computer, which corresponds to the
trajectory highlighted in Figure 3.3.

8
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

A useful categorization of computers is based on their computational power or price range: from small
embedded microcontrollers (basically, simple, low-cost microprocessors) at the base of scale, through
personal computers, workstations, and servers in the middle, to mainframes and supercomputers at
the top. The pyramid shape of the 3.4 figure is useful to clarify that the computers closest to the base
are more numerous and also represent a greater investment in their added cost. As computers have
achieved better performance/cost ratios, user demand for computing power has grown at an even
faster rate, so that supercomputers at the top tend to cost between ten and $30 million. Similarly, the
cost of a personal computer of modest power ranges from one thousand to three thousand dollars.
These observations, together with the highly competitive nature of the computer market, implying
that the cost of a computer system is a surprisingly accurate indicator of its capabilities and computing
power, at least more accurate than any other single numerical indicator. However, note that the costs
shown in Figure 3.4 are expressed in easy-to-remember round numbers. Embedded computers are
used in household appliances, automobiles, telephones, cameras, and entertainment systems, and in
many modern gadgets. The type of computer varies with the intended processing tasks. Home
appliances tend to contain very simple microcontrollers capable of tracking status information from
sensors, converting between analog and digital signals, measuring time intervals, and activating or
disabling actuators by postulating enabling and inhibition signals. Most current automotive
applications are similar, except that they require microcontrollers of more rugged types. More often,
however, computers installed in cars are expected to perform control functions rather than simple
ones (Fig. 3.5), given the trend towards greater use of information and articles in between.
Telephones, digital cameras, and audio/video entertainment systems require many signal processing
functions over multimedia data. For this reason, they tend to use a digital signal processor (DSP) chip

9
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

instead of a microcontroller. Personal computers are equally varied in their computational potentials
and usage pretensions. They are identified into two main categories: laptop and desktop. The first
ones are known as laptops or notebooks, within which there are subnotebooks and the pocket PCs
are smaller and more limited models. Tablet versions are also available, which are intended to replace
notebooks and pens. A desktop computer can have its CPU and peripherals on a unit that mirrors if
the monitor supports it or is located in a "tower" that stands (figure 3.6) that offers more space for
expansion and can also be hidden.

10
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Since flat displays are now quite accessible, desktop versions increasingly come with flat screens that
offer the benefits of clearer images, lower power consumption, and less heat dissipation (therefore,
lower air conditioning cost). A workstation is essentially a desktop computer with more memory,
higher I/O and communications capabilities, a larger deployment screen, and more advanced
application system and software. While a personal computer or workstation is usually associated with
a single user, servers and mainframes are workgroup, departmental, or enterprise computers. At the
low profile, a server is very similar to a workstation and is distinguished from it by more extensive
software/support, longer main memory, larger secondary storage, higher I/O capacities, faster
communication/capacity and greater Reliability. Their ability to work and reliability are particularly
important factors, given the severe and recurrent technical and financial consequences of unforeseen
falls. It is common to use multiple servers to meet capacity, bandwidth, or availability requirements.
At the end, this approach leads to the use of server farms. A high-profile server is, basically, a smaller
or less expensive version of a mainframe computer, in decades past it would have been called a
minicomputer. A central computer, with all its peripheral devices and support equipment, can fill a
room or a large area. Supercomputers represent the glamorous products of the computer industry.
They represent a small fraction of the total monetary quantity of goods shipped and even an even
smaller fraction of the number of computer installations. However, these machines and the
enormously challenge before computer problems that motivate their design and development always
take the headlines. Part of the fascination with supercomputers is the knowledge that many of the

11
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

advances introduced in such machines often find their way a few years later to workstations and
desktops; so, in a sense, they offer a window into the future. A supercomputer has been defined, half-
jokingly, as "any machine that's still on the drawing board" and "any machine that costs $30,000." The
computational power of the largest computers available has grown from millions of instructions or
floating dot per second (MIPS) operations in the 1970s, to GIPS/GFLOPS in the mid-1980s, and to
TIPS/TFLOPS in the late twentieth century. The machines now on the drawing board are directed to
the PIPS/PFLOPS. In addition to the numerically intense calculations that are always associated with
supercomputers, such machines are increasingly used for data storage and high-volume transaction
processing.

Regardless of its size, price range or application domain, a digital computer is made up of certain key
parts, which are shown in paragraph 3.7. This diagram is intended to reflect all the different visions
that have been advanced; from the tripartite vision (CPU, memory, I/O), through the four (processor,
memory, input, output) and five (data path, control, memory, input, output) parts that are seen in
most textbooks, to the six-part vision preferred by this text. which includes an explicit mention of the
linking component (link). Note that any of these versions, including the text, represent a simplified
view that aggregates functionality that can in fact be distributed through the system. Certainly, this is
true for control: memory often has a co such as the input/output subsystem, individual I/O devices,
network interface, etc. Similarly, processing capabilities exist everywhere: in the I/O subsystem, within
various devices (e.g. printer), at the network interface, etc. However, keep in mind that we don't learn
about airplanes by using a Concorde or a sophisticated jet military at first!

12
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

3. Generations of progress

Most computer technology chronologies distinguish a number of generations of computers,


and each begins with a breakthrough in component technology. Four generations are usually
specified that coincide with the use of vacuum tubes, transistors, small to medium scale
integrated circuits (SSI/MSI CI) and large-to-very large scale integration (LSI/VLSI). Add to the
latter the entire previous history of computing as generation 0 and the dramatic end-of-the-
end advances of the 1990s as generation 5 (Table 3.2). Note that the entries in Table 3.2 are
somewhat over simplified; the intention is to show a picture of advances and trends in broad
aspects, rather than a confusing table mentioning all technological advances and innovations.
The preoccupation with computing and mechanical gadgets to facilitate it began very early in
the history of civilization. Of course, the introduction of the abacus by the Chinese is well
known. The first sophisticated auxiliaries for computing seem to have been developed in
ancient Greece for astronomical calculations. For practical purposes, the roots of modern
digital computing can be traced back to the mechanical calculators designed and built by
seventeenth-century mathematicians1 in the 1820s, and the programmable analytical engine
more or less one gives each after, Charles Babbage established himself as the father of digital
computing, as the latter is known today. Although nineteenth-century technology did not
allow for a full implementation of Babbage's ideas, historians agree that he had operational
embodiments of many key concepts in computing, including programming, instruction sets,
operating codes, and program cycles. The emergence of reliable electromechanical relays in
the 1940s led to many of these ideas.

The ENIAC, built in 1945 under the supervision of John Mauchly and J. Pres per Eckert at the University
of Pennsylvania, is often cited as the first electronic computer, although many other concurrent or
earlier efforts are now known; for example, those of John Atanasoff at Iowa State University and
Konrad Zuse in Germany. The ENIAC weighed 30 tons, occupied almost 1 500 m2 of floor space, used
18 000 vacuum tubes and consumed 140 kW of electricity. It could do about five thousand sums per
second. The notion of calculating with a stored program was developed and refined in groups headed
by John von Neumann in the United States and Maurice Wilkes in England, leading to functional
machines in the late 1940s and commercial digital computers for scientific and business applications
in the early 1950s. These first-generation, digital, stored-program, general-purpose computers were
adapted for scientific or business applications, a distinction that has virtually disappeared. Examples
of first-generation machines include the UNIVAC 1100 series and IBM's 700 series.

13
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The emergence of the second generation is associated with the shift from vacuum tubes to much
smaller, cheaper and more reliable transistors. However, equally important, if not more so, are
developments in storage technology along with the introduction of high-level programming languages
and system software. NCR and RCA pioneered second-generation computing products, followed by
IBM's 7000 series of machines and, later, Digital Equipment Corporation's (DEC) PDP-1. These
computers began to look more like office machines than factory equipment. This fact, in addition to
the ease of use provided by more sophisticated software, led to the proliferation of scientific
computing and enterprise data processing applications. The ability to integrate many transistors and
other elements, hitherto built as discrete components, into a single circuit solved many problems that
were beginning to become quite serious as the complexity of computers grew to tens of thousands of
transistors and more. Integrated circuits not only brought the third generation of computers, but also
fueled a comprehensive microelectronic revolution that continues to shape the information-based
society. Perhaps the most successful and influential third-generation computing product, which also
helped bring attention to computer architecture as distinct from particular machines or
implementation technologies, is the IBM System 360 family of compatible machines. This series began
with relatively inexpensive low-profile machines for small businesses and expanded to very large
multi-billion dollar supercomputers that used the latest technological and algorithmic innovations to
achieve maximum performance, all based on the same global architecture and instruction set.
Another machine inflating in this generation, DEC's PDP-11, brought with it the era of affordable
minicomputers capable of operating in the corner of an office or laboratory, rather than requiring a
large air-conditioned computer room. As integrated circuits became larger and denser, it was finally
possible, in the early 1970s, to place a complete, yet very simple, processor on a single IC chip. This,

14
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

in addition to phenomenal increases in memory density that allowed the entire main memory to
reside on a handful of chips, led to the popularization of low-cost microcomputers. The Apple
Computer Corporation was the undisputed leader in this area, but it wasn't until IBM introduced its
open-architecture PC (meaning that components, peripherals, and software from different companies
could coexist on a single machine) that the PC revolution developed remarkably. To date, the term PC
is synonymous with IBM and IBM-compatible microcomputers. Larger computers also benefited from
advances in IC technology; With four generations of desktops offering the capabilities of
supercomputers from the earliest generations, new high-performance machines continue to push the
advancement of scientific computing and commercial Intel data applications to the future.

There is no general agreement on whether the transition to the fourth generation has already been
passed and, if so, when the transition to the fifth generation has begun. The 1990s witnessed dramatic
improvements not only in IoT technology but also in communications. If the advent of pocket PCs,
GFLOPS desktop computers, wireless Internet access, gigabit memory chips, and multi-gigabyte disks,
barely larger than a wristwatch, are inadequate to signal the dawn of a new era, it's hard to imagine
what it would be. The IC part of these advances is described as ultra-long, large-scale, or small-scale
integration. It is now possible to place an entire system on a single IC chip, leading to unprecedented
speed, compaction and electricity economy. Today, computers are primarily visualized as components
within other systems rather than as expensive systems in their own right. A brief look at the integrated
circuit manufacturing process (Figure 3.8) is useful to understand some of the current difficulties, as
well as the challenges to be overcome, for the continued growth in computer capabilities in future
generations. In essence, integrated circuits are printed on silicon dice using a complex chemical
process in many steps to deposition layers (insulation, conductor, etc.), remove unnecessary parts of

15
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

each layer and proceed to the next layer. Hundreds of dice can be formed from a single blank wafer
sliced from a glass bar. Since the bar and the deposition process (which involves tiny features in
extremely thin layers) are imperfect, some of the dice thus formed will not perform as expected and
must therefore be discarded. During the assembly process other imperfections arise, this leads to
more discarded parts. The ratio of the usable parts obtained at the end of this process, as well as the
number of dice with which it was started, is called production of the manufacturing process. Many
factors affect production. A key factor is the complexity of the die in terms of the area it occupies and
the intricacy of its design. Another is the density of defects on the wafer. Figure 3.9 shows that a
distribution of 11 small defects on the wafer surface can lead to 11 defective dice out of 120 (output
109/120 ≅ 91%), while the same defects would render 11 of 26 larger dice unusable (production 15/26
≅ 58%). This is the only contribution of wafer defects; The situation will be worsened by defects arising
from the more complex structure and interconnection patterns in the larger dice.

The only variables in the preceding equation are given area and production. Because production is a
decreasing function of the die area and defect density, cost per die is a super linear function of the die
area, which means doubling the die area to accommodate more functionality on a chip than doubling
the cost of the finished part. Specifically, note that experimental studies show that dice production is
Production given wafer production × [1 (defect density × given area)/a]
where wafer production represents those that are completely unusable and parameter a is estimated
to vary from 3 to 4 for modern CMOS processes.

16
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 3.1: Effect of die size on cost

Suppose the dice in Figure 3.9 measure 1 × 1 and 2 × 2 cm2 and ignore the defect pattern shown. If
a defect density of 0.8/cm2 is assumed, how much more expensive will the die of 2 × 2 be than the
die of 1 × 1?

Solution: Let w be the wafer production. From the die production formula, a production of 0.492w
and 0.113w is obtained for the dice of 1 × 1 and 2 × 2, respectively, if 3 is assumed. Putting these
values into the formula for die cost, it is found that die 2 × 2 costs (120/26) × (0.492/0.113) = 20.1
times the die of 1 × 1; The latter represents a factor of 120/26 = 4.62 higher cost attributable to the
smallest number of dice on a wafer and a factor of 0.492/0.113 = 4.35 due to production defect. With
a = 4, the ratio assumes the slightly larger value (120/26) × (0.482/0.095) 23.4.

4. Processor and memory technologies

As a result of advances in electronics, the processor and memory have improved dramatically.
The need for faster processors and larger memory fuels the phenomenal growth of the
semiconductor industry. A relentless growth in the number of devices that can be put on a
single chip. Part of this growth resulted from the ability to economically design and
manufacture larger dice; But, mainly, the increasing density of devices (per unit area of dice)
is the key factor. The exponential increase in the number of devices on a chip over the years
is called Moore's law, which predicts an annual increase of 60% (× 1.6 per year ≅ × 2 every 18
months ≅ × 10 every five years). This prediction has turned out to be so accurate that a long-
term plan, known as a semiconductor industry map, is based on it. For example, according to
this map it is known with certainty that by 2010 chips with billions of transistors will be
technically and economically feasible (memory chips of that size are already available). Today,
the best-known processors are Intel's Pentium family of chips and compatible products
offered by AMD and other manufacturers. The 32-bit Pentium processor has its roots in the
16-bit 8086 chip and its companion 8087 fl-dot coprocessor introduced by Intel in the late
1970s. As the need for 32-bit machines, which among other things can manipulate wider
memory addresses, became apparent, Intel introduced the80386 and 80486 processors
before moving to Pentium and its improved models that identify with the II, III, and IV (or 4)
models. Newer versions of these chips not only contain the floppy dot unit on the same chip,
they also have caches on the chip. Each model in this product sequence introduces
improvements and extensions to the previous model, but the architecture of the central set
of instructions is not altered. At the time of writing, the 64-bit Itanium architecture was
introduced to further increase the addressable space and computational power of the
Pentium series. Although Itanium is not a simple Pentium extension, it is designed to run
programs written for the most advanced of these systems. Power PC, made by IBM and
Motorola, which derives its fame from incorporation into Apple computers, is an example of
modern RISC processors (see Chapter 8). Other examples in this category include MIPS
products and the DEC/Compaq Alpha processor. As a result of denser and therefore faster
circuitry, along with architectural improvements, processor performance grew exponentially.
Moore's law, which predicts an improvement factor of 1.6 per year in component density, is
also applicable to the trend in processor performance measured in executed instructions per
second (IPS). Figure 3.10 shows this trend along with point data for some key processors (Intel
Pentium and its 80x86 predecessors, Motorola's 6800 series, and MIPS R10000). Moore's law,

17
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

when applied to memory chips, predicts capacity improvement by a factor of 4 every three
years (memory chip capacity is almost always an even power of 2; hence the appropriateness
of the formulation of the factor of 4). The 3.10 figure shows the trend of memory chip since
1980 and its projection during the following years. With the gigabit chip (128 MB), the memory
need of a typical PC fits into one or a handful of chips. The next short-term challenge is putting
processor and memory in the same device, as well as enabling a personal computer composed
of a single chip. The processor and memory chips must be connected to each other and to the
other parts of the machine shown in heading 3.7. Various packaging schemes are used for this
purpose, depending on the type of computer and cost/performance objectives. The most
common form of packaging is outlined in paragraph 3.11a. Memory chips are mounted on
small printed circuit boards (PCs) known as daughter cards. Then one or more of these, usually
with a single row of connection pins (single in-line memory modules), are mounted on the
motherboard that holds the processor, system bus, various interfaces, and a variety of
connectors. All circuitry for a small computer can fit into a single motherboard, while larger
machines may require many such boards mounted on a chassis or card cage and
interconnected by a base card. The single motherboard, or chassis holding multiple boards,
Chapter 3 / Computer Systems Technology. It is packed with peripheral devices, power
supplies, cooling fan and other required components in a box or cabinet, leaving enough
internal space for future expansion.

18
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In fact, all of the estimated 1018 or so transistors to be incorporated into integrated circuits to date
were built at "ground level," directly on the surface of silicon crystals. Just as one can increase the
population density in a city by using high-rise, multi-storey buildings, it is possible to fit more
transistors into a specific volume by using 3D packaging [Read 02]. Research is ongoing in 3D packaging
technology and products incorporating these technologies have begun to emerge. A promising
scheme under consideration (heading 3.11b) allows the direct linking of components through
connectors deposited on the outside of a 3D cube formed by stacking 2D integrated circuits, as well
as the removal of much of the volume and cost of current methods. Dramatic improvements in
processor performance and memory chip capacity are accompanied by equally significant cost
reductions. Both computing power and memory capacity per unit cost (MIPS/$ or MFLOPS/$, MB/$)
increased exposure over the past two decades and is expected to continue on this course for the
foreseeable future. Until almost a decade ago, it used to be jokingly asserted that there would soon
be zero or negative cost components. The authors recently learned that, at the turn of the twenty-
first century, computer hardware was moving away from anticipating profitability to software services
and products. They stopped the purpose of specifying the significance of these improvements in the
performance of the computers and the accompanying cost reductions, sometimes an interesting
comparison is used. It is stated that if the aviation industry had advanced at the same rate as the
computer industry, travel from the United States to Europe now would take a few seconds and cost
few cents. A similar analogy, applied to the auto industry, would lead to the expectation of being able
to buy a luxury car that traveled as fast as a jet and ran forever on a single tank of gas, for the same

19
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

price as a cup of coffee. Unfortunately, the reliability of application and system software has not
improved at the same rate, the latter leading to the counter-assertion that if the computer industry
had developed in the same way as the transport industry, the Windows operating system would fall
no more than once in a century!

5. Peripherals, I/O and communications


The state of computing would not have been possible only with the improvements in
processor performance and memory density discussed in section 3.4. The phenomenal
progress in input/output technologies, from printers and scanners to mass storage units and
communication interfaces, has been significant. Today's $100 hard drives, which easily fit into
the thinnest of notebook computers, can store as much data as they would fit in a room of a
house full of cabinets packed with dozens of very expensive drives from the 1970s.
Input/output (I/O) devices are discussed in more detail in Chapter 21. Here is a brief overview
of the types of I/O device and their capabilities with the goal of completing the broad
brushstroke picture of modern computer technology. Table 3.3 lists the main categories of I/O
devices. Note that punched card and paper tape readers, printing terminals, magnetic drums
and other devices not currently used are not mentioned. Input devices can be categorized into
many ways. Table 3.3 uses the input data type as the primary feature. Input data types include
symbols of a fine alpha, positional information, identity verification, sensory information,
audio signals, image and video. Within each class are given main examples and additional
examples, along with usual data rates and primary application domains. Input devices are
those that do not produce a lot of data and, therefore, do not need a large amount of
computing power to be served. For example, the peak input rate from a keyboard is limited
to how fast humans can type. Assuming 100 words (500 bytes) per minute results in a data
rate of about 67 b/s. A modern processor could manipulate input data from millions of
keyboards, should it be required. At the other extreme, high-quality video input may require
capturing millions of pixels per frame, each coded in 24 bits (eight bits per primary color), say,
at a rate of 100 frames per second. The latter means a data rate of billions of bits per second
and challenges the power of the fastest computers available today.

TABLE 3.3 Some two-way input, output and I/O devices.

20
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Ticket type Main examples Other examples Data rate (b/s) Main uses

Symbol Keyboard, keypad Note musical, 10 Ubiquitous


OCR
Position Mouse, touchpad Lever, wheel, 100 Ubiquitous
glove

Identity Barcode reader Badge, fingerprint 100 Sales, security

Sensory Touch, Essence, brain 100 Control, security


movement, light signal

Audio Microphone Telephone, radio, 1000 Ubiquitous


tape

Image Scanner, camera Graphic tablet 1000 Million Photography,


advertising
Video Video camera, VCR, TV cable 1000 Billions Entertainment
DVD

Output Type Main examples Other examples Data rate (b/s) Main uses

Symbol LCD Line LED, status light 10 Walking motor


Segments

Position Walking motor Robotic 100 Ubiquitous


movement
Warning Buzzer, bell, Flashing light A few Safeguarding,
Mermaid security
Sensory Text Braille Essence, brain 100 Personal
stimulus assistance

Audio Speaker, Speech 1000 Ubiquitous


audiotape synthesizer

Image Monitor, printer Plotter, 1000 Million Ubiquitous


microfilm
Video Monitor, TV Movie/Video 1000 Billions Entertainment
screen Recorder

21
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

I/O two-way Main examples Other examples Data rate (b/s) Main uses

Mass Storage Hard/Compact Floppy, tape, Million Ubiquitous


Drive archive

Red Modem, fax, LAN Cable, DSL, ATM 1 Billion Ubiquitous

Output devices are categorized in the same way in Table 3.3, but the category in the row "identity" is
replaced with "warning". Theoretically, triggering an alarm requires a single bit and represents the
slowest possible I/O rate. Again, at the high end, real-time video output can require a data rate of
billions of bits per second. A high-speed printer, printing dozens of color pages per minute, is little less
demanding. For both video and fixed images, the data rate can be reduced by image compression.
However, it affects only data transmission and buffering rates between the computer and the printer;
At some point, on the computer side or inside the print engine driver, the full data rate implied by the
number of pixels to be transferred to paper must be handled. Note that a single-megapixel color image
needs almost 3 MB of storage. Many modern computers have a dedicated video memory of this size
or larger. The latter allows a full-screen image to be stored and CPU data transfer to be limited to only
items that change from one image to the next. Most of the I/O provisions mentioned in Table 3.3 are
currently in common use; a few are considered to be "exotic" or experimental stage (the glove for 3D
input/location/movement, input or output essences, detection or generation of brain signals).
However, they are even expected to achieve major status in the future.

Certain two-way devices can be used for both input and output. The leadership among these is the
mass storage units and network interfaces. Magnetic hard and flexible discs, as well as optical discs
(CD-ROM, CD-RW, DVD), work on similar principles. These devices (fig 3.12) read or record data along
concentric tracks packed into the surface of a rotating disc platter. On hard drives, a read/crit
mechanism, which can be moved radially in the appropriate part of the track, passes under the head.
These simple principles have been used for decades. The marvelous feature of modern technology for
today's mass storage drives is their ability to correctly detect and manipulate bits of data packed so
tightly together that a single dust particle under the head can erase thousands of bits. Over the years,
the diameter of magnetic disk memories has shrunk more than tenfold, from tens of centimeters to a
few centimeters. Given this hundredfold shrinkage in the recording area, the increases in capacity
from mere megabytes to many gigabytes are even more noticeable. The speed improvements are less
impressive and were achieved through faster actuators to move the heads (which now need to travel
shorter distances) and faster rotation. Flexible and other removable discs are similar, except that, due
to lower accuracy (leading to lower recording density) and slower rotation, they are both slower and
smaller in capacity. Increasingly, inputs arrive via communication lines, rather than conventional input
devices, and outputs are written to data files accessed over a network. It is not uncommon for a printer
located next to a computer in an office to be connected to it not directly by a printer cable, but by
means of a local area network (LAN). Through a computer network, machines can communicate with
each other, as well as with a variety of peripherals; for example, file servers, appliances, control
devices and entertainment equipment. Figure 3.13 shows the two key characteristics of broadband
and latency for a variety of communication systems, from high-bandwidth buses that are less than 1
m long, to wide-area networks that span the globe. Computers and other devices communicate over
a network using network interface units. Special protocols are followed to ensure that the various
hardware devices can correctly and consistently understand the data being transmitted. Modems of
various types (telephone line, DSL, cable), network interface cards, switches and "routers" take part

22
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

in the transmission of data from a source to the desired destination by means of connections of many
different types (copper wires, optical fibers, wireless channels) and network gateways.

23
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6. Software and application systems

The instructions comprising computer hardware are coded into strings of 0 and 1 and are
therefore indistinguishable from the numbers on which they can operate. A set of such
instructions constitutes a machine language program specifying a step-by-step
computational process (Fig. 3.14, far right). The first digital computers were programmed in
machine language, a tedious process that was acceptable only because the programs of those
days were quite simple. Subsequent developments led to the invention of the assembly
language, which allows symbolic representation of programs in machine language, and high-
level procedural languages reminiscent of mathematical notation. These more abstract
representations, with translation software (assemblers and compilers) for automatic
conversion of programs to machine language, simplified program development and increased
programmer productivity. Much of the user-level computing is done through very high-level
notations, as it has a large amount of expressive power for specific domains of interest.

24
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Examples include word processing, image editing, drawing logic diagrams, and graphing.
These levels of abstraction in programming, and the process of going from each level to the
next lower level, are shown in paragraph 3.14.
As shown in paragraph 3.15, computer software can be divided into classes of application
software and system software. The first covers word processors, spreadsheet programs,
circuit simulators and many other programs that are designed to address specific tasks of
interest to users. The second is divided into programs that translate or interpret instructions
written in various notational systems (such as the MIPS assembly language or the C
programming language) and those that offer administrative, enabling or coordinating
functions for programs and system resources. There are functionalities that require the vast
majority of computer users and, therefore, are incorporated into the operating system.
Details of operating system functions can be found in many textbooks on the subject (see
references at the end of the chapter). Subsequent chapters of this book will briefly discuss
some of these topics: virtual memory and some security aspects in Chapter 20, file systems
and disk controllers in Chapter 19, I/O device drivers in Chapters 21 and 22, and certain
aspects of coordination in Chapters 23 and 24.

25
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

26
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

PROBLEMS
1. Definition of computer architecture.

The year is 2010 and you are asked to write a one-page article about the computer architecture for a
children's encyclopedia of science and technology. How would you describe computer architecture to
elementary school children? You are allowed to use a diagram, if necessary.

2. The importance of an architect.

one more word before each of the four ellipses of Fig. 3.2.

3. The Central Place of Computer Architecture.

In Figure 3.1, the computer designer, or architect, occupies a central position. This is not an action:
we like to see ourselves or our profession as fundamental. Testito is the map of the world as told by
the Americans (America in the center, divided by Europe, Africa and parts of Asia on the right, with
the rest of Asia plus Oceania on the left) and the Europeans (America on the left and all of Asia on
the right). I would have seen the equivalent of the 3.1 sign with the logical designer in the middle.
Describe what you would see in the two bubbles on either side. b) Repeat part (a) for a book about
electronic circuits

4. Complex systems made by humanity.

Computer systems are among the most complex systems made by mankind. (a) Name some systems
that you consider to be more complex than a modern digital computer. Elaborate under your criteria
of complexity. (b) Which of these systems, if any, has led to a separate field of study in science or
engineering?

5. Multiples of units.

If you assume that X is the symbol for an arbitrary multiple of power of 10 in Table 3.1, what is the
maximum relative error if one mistakenly uses Xb instead of X, or vice versa?

6. Embedded computers.

A frequently needed function in embedded control applications is to digital (A/D) conversion. Study
this problem and prepare a two-page report about your findings. The latter must include: (a) At least
one way to convert, including a hardware diagram. (b) Description of an application for which
conversion is required A/D. (c) A discussion of the accuracy of the conversion process and its
implications

7. Personal computers.

27
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Name all the reasons you think a laptop or notebook is much smaller than a desktop computer of
comparable compositional power.

8. Supercomputers.

Find as much information as you can about your most powerful computer you have information
about and write a two-page report about it. Start your study with [Top500] and include the
following. (a) Criteria leading to the supercomputer being classified as the most powerful. b)
Company or organization that built the machine and its motivation/consumer. c) Identity and
relative computational power of its nearest competitor.

9. Parts of a computer.

Name one or more human organs having functions similar to those associated with each part of Fig.
3.7.

10. History of digital computing.

Charles Babbage, who lived two centuries ago, is considered the "grandfather" of digital computing.
However, many believe that digital computing has much older origins. a) Study the article [deSo84]
and prepare a two-page essay on the topic. (b) Use a modern electronic calculator and a few of the
machines described in [deSo84], graphing trends in size, cost and speed of calculating machines over
the centuries.

11. The Future of Pocket/Desktop Calculators.

The abacus, still in use in certain remote areas, is anachronistic in much of the world. The slide rule
became obsolete in a much shorter time. What do you think today's pocket and desk calculators will
become? Do you think they will still be used in 2020? Discuss.

12. Babbage's planned machine.

Charles Babbage planned to build a machine that would multiply 50-digit numbers in less than a
minute. a) Compare the speed of Babbage's machine with that of a human calculator. Can you
compare the two in terms of reliability? (b) Repeat the comparisons in part (a) with a modern digital
computer.

13. Cost trends.

By plotting the cost of a computer per unit of computational power, you get a sharply declining
curve. Zero-cost computers are already here. Can you envision negative cost machines in the future?
Discussed.

28
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

14. Production variation with die size.

Paragraph 3.9 and Example 3.1 show the effect of increasing die size from 1 × 1 cm2to 2 × 2 cm2. (a)
Using the same assumptions as in Example 3.1, calculate the relative die production and cost for 3 ×
3 square dice. (b) Repeat part (a) for rectangular dice of 2 × 4.

15. Production effects on the cost of dice.

A wafer containing 100 copies of a complex processing die has a high production cost. The area
occupied by each processor is 2 cm2 and the default density is 2/cm2. What is the manufacturing
cost per given?

16. Number of dice on a wafer.

Consider a circular wafer of diameter d. The number of square dice on the wafer side is bounded by
πd /(4u2). 2 The real number will be smaller because there are incomplete dice on the edge. (a)
Argue that 2 2 πd /(4u )− πd/(1.414u) represents a fairly accurate estimate of the number of dice. (b)
Apply the formula in Part (a) to the wafers shown in Heading 3.9 to obtain an estimate of the
number of dice and determine the error in each case. The dice are 1 × 1, 2 × 2 and d 14. c) Suggest
and justify that a formula that would work for non-square dice or × v (e.g., 1 × 2 cm2).

17. Processor and memory technologies.

Find performance and capacity data on the latest processor chips and DRAM memory. Mark the
corresponding points in paragraph 3.10. How well do the extrapolated lines fit? Do these new points
indicate any lag in the rate of progress? Discuss.

18. Computer packaging.

It is planned to build a parallel computer of 4,096 nodes. These are organized as a 3D mesh of 16 ×
16 × 16, with each node connected to six neighbors (two along each dimension). Eight nodes can be
fitted into a custom VLSI chip, and 16 chips can be placed on a printed circuit board. a) Design a
partitioning scheme for the parallel computer that will minimize the number of off-chip, card, and
chassis links. (b) When considering the packaging scheme of Figure 3.11a and the partition
suggested in Part (a), would it be able to accommodate eight-bit-wide channels? (c) Does the 3D
packaging scheme of 3.11b offer any benefit to this design?

29
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

COMPUTER PERFORMANCE

CHAPTER TOPICS

1. Cost, Performance and Cost/Performance


2. Definition of Computer Performance
3. Performance Enhancement and Amdahl's law
4. Performance Measurement vs. Modeling
5. Reporting Computer Performance
6. The Quest for Higher Performance

Chapters 1 to 3 provide the background to study computer architecture. The last aspect to consider
before going into the details of the subject is the performance of the computers. There is a tendency
to equate "performance" with "speed", but the latter, at best, is a simplistic view. There are many
aspects of performance. Understanding all of these will help you get an idea of the various design
decisions found in the following chapters. For many years, performance has been the key driving force
behind advances in computer architecture. It's still very important, but as modern processors have
the performance to do without most "run the mile" applications, other parameters such as cost,
compaction and electrical economics are rapidly gaining in significance.

1. Cost, performance and cost/performance


A lot of work in computer architecture deals with methods to improve machine performance.
Examples of such methods can be found in the following chapters. In this sense, all design decisions
when building computers, from instruction set design to the use of implementation techniques such
as channeling, branch prediction, cache memory, and parallelism, if not primarily motivated by
performance improvement, are at least made for that purpose. Accordingly, it is important to have a
precise operational definition for the concept of performance, to know its relationships to other
aspects of computer quality and utility, and to learn how it can be quantified for comparison purposes
and marketing decisions.

A second key attribute of a computer system is its cost. In any given year, a computer can probably be
designed and built that is faster than the fastest of the current computers available on the market.
However, the cost could be so unattainable that this latter machine may never be built or
manufactured in very small quantities by agencies that are interested in advancing the state of the art
and don't mind spending an exorbitant amount to achieve that goal. Thus, the highest performing
machine that is technologically feasible may never materialize because it is cost-inefficient (it has an
unacceptable cost/performance ratio due to its high cost). It would be simplistic to match the cost of
a computer with its purchase price. Instead, you should try to evaluate its lifecycle cost, which includes
upgrades, maintenance, usage, and other recurring costs. Note that a computer that is bought for two
thousand dollars has different costs. It may have cost the manufacturer $1,500 (for hardware
components, software licenses, labor, shipping, advertising), and the remaining $500 covers sales and
profitability commissions. In this context, it could cost four thousand dollars during its lifetime once

30
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

added service, insurance, additional software, hardware upgrades, and so on. In order to appreciate
that computer performance is multifaceted and that any isolated indicator provides at most an
approximate picture, an analogy with passenger aircraft is used. In Table 4.1, six commercial aircraft
are characterized by their passenger capacity, cruising range, cruising speed and purchase price. Based
on the data in Table 4.1, which of those aircraft has the highest performance? It would be justified to
answer this question with another: performance from whose point of view?

A passenger interested in reducing their travel time can match performance with cruising speed. The
Concorde clearly wins in this regard (ignore the fact that that aircraft is no longer in service). Note that
because of the time it takes for the aircraft to prepare, take off, and land, the advantage of travel time
is less than the speed ratio. Now suppose that the city of destination no of a passenger is 8 750 km
away. Ignoring pre- and post-flight delays for simplicity's sake, it finds that the DC-8-50 would get
there in ten hours. The flight time of the Concorde would be only four hours, but some of its
advantages disappear when factoring the mandatory stop to refuel. For the same reason, the DC-8-50
is probably better than the faster Boeing 747 or 777 for flights whose distances exceed the latter's
range.

TABLE 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a
specific model/configuration of the aircraft or are averages of the quoted range of values.

Aircraft Passengers Range (km) Speed (km/h) Price*($M)

Airbus A310 250 8 300 895 120


Boeing 747 470 6 700 980 200
Boeing 767 250 12 300 885 120
Boeing 777 375 7 450 980 180
Concordant 130 6 400 2 200 350
DC-8-50 145 14 000 875 80

* Prices are derived by extrapolation and some divination. Passenger aircraft are often sold at large discounts to list prices.
Some models, such as the now-retired Concorde, are no longer produced or never sold on the open market.

And this was just the passenger's approach. An airline may be more interested in total performance,
which is defined as the product of passenger capacity and speed (these cost conflicts will be dealt with
shortly). If airfares were proportional to the distances flown, which is not the case in the real world,
the total return would represent the airline's ticketing revenue. The six aircraft have total yields of
0.224, 0.461, 0.221, 0.368, 0.286 and 0.127 million passenger-kilometers per hour, respectively, with
the Boeing 747 exhibiting the highest overall performance. Finally, performance from the Federal
Aviation Administration's point of view relates primarily to an aircraft's safety record.

Of course, performance is never isolated. Very few people felt that Concorde's travel time advantage
was worth much of its airfare. Similarly, very few airlines were willing to pay the higher purchase price

31
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

of Concorde. For this reason, combined performance and cost indicators are of interest. Suppose
performance is specified by a numerical indicator, so that larger values of this indicator are preferable
(aircraft speed is an example). Cost/performance, defined as the cost of a performance unit, or its
inverse, as well as the performance achieved per unit of cost, can be used to compare cost-benefit of
various systems. Thus, matching performance with total throughput and costs to purchase price, the
cost/performance merit ratio for the six aircraft in Table 4.1 are 536, 434, 543, 489, 1224 and 630,
respectively, with smaller values being better reputed.

Of course, the above comparison is rather simplistic; The cost of an aircraft for an aircraft involves not
only its purchase price but also its fuel economy, availability/price of parts, frequency/ease of
maintenance, safety-related costs (e.g., insurance), etc. For all practical purposes, the cost factor for
a passenger is simply airfare. Note that such composite measurement is even less accurate than
performance or cost only because it incorporates two factors that are difficult to quantify.

A final observation will then be made as to the analogy of aircraft. The greater performance of a
system is of interest only if one can truly benefit from it. For example, if you travel from Boston to
London and the Concorde only takes off from New York, then its faster speed may not have meant to
you in relation to this particular trip. In computational terms, this is akin to a new 64-bit architecture
that offers no benefit when the volume of its applications is designed for older 32-bit machines.
Similarly, if you want to get from Boston to New York, the flight time is such a small fraction of the
total time spent that it wouldn't make much difference whether your plane is a Boeing 777 or a DC-8-
50. For the computational analogue of this situation, consider that sometimes replacing a computer
processor with a faster one will not have a significant impact on performance because it is limited by
memory or I/O bandwidth.

32
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Graphing performance against cost (Figure 4.1) reveals three types of trends. Super linear growth of
return with cost indicates economy of scale: if you pay twice as much, you get more than twice the
return. This was the case in the early days of digital computers when supercomputers could perform
significantly more calculations per dollar spent than smaller machines. However, at present, a
sublinear trend is frequently observed. As more hardware is added to a system, some of the
theoretically possible performance is lost in management and coordination mechanisms needed to
run the most complex system. Similarly, an advanced processor that costs twice as much as an older
model may not offer twice the performance. With this type of trend, a point of reduced profitability
is soon reached beyond which the subsequent investment does not buy much in return. In this
context, linear improvement in performance with cost can be considered an ideal to strive for.

2. Definition of computer performance

As users, one expects a higher-performance computer to run application programs faster. In fact, the
execution time of programs, whether long processing calculations or simple commands to which the
computer must react immediately, is a universally accepted performance indicator. Because a longer
runtime means lower performance, you can write:

Performance = 1/ Execution Time

Thus, the fact that a computer runs a program in half the time it takes another machine means that it
has twice the performance. All other indicators represent approximations to the performance that are
used because one cannot measure or predict execution times for real programs. Possible reasons
include lack of knowledge about exactly which programs will run on the machine, the cost of
transferring programs to a new system for the sole purpose of benchmarking, and the need to value
a computer that has not yet been built or otherwise unavailable for experimentation.

As with the aircraft analogy in section 4.1, there are other views of performance. For example, a
computer center that sells machine time to a variety of users may consider total computational
performance, the total number of tasks performed per unit of time, as the most relevant, because this
directly affects the center's revenue. In fact, the execution of certain programs can be delayed in intent
if it leads to a better overall total performance. The strategy of a bus whose driver chooses the order
of descent of passengers to minimize the total service time is a good analogy for this situation. In this
vein, runtime and total performance are not completely independent in that improving one usually
(but not always) leads to an improvement in the other. For this reason, in the rest of the text, the
focus will be on the user's perception of performance, which is the inverse of the execution time.

In fact, a user may be concerned not with program execution time per se, but with response time or
total return time, which includes additional latency attributable to scheduling decisions, work
interruptions, I/O queue delays, etc. This is sometimes referred to as wall clock time, because it can
be measured by glance at the wall clock when starting and finishing a task. To illustrate the effects of
such highly variable and difficult-to-quantify factors, CPU runtime is sometimes used to define user-
perceived performance:

Performance = 1/CPU Execution Time

33
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

This concept makes evaluation much more manageable in cases involving analytical, rather than
experimental, evaluation methods (see section 4.4). Such insight does not lead to any inaccuracy for
computationally intensive tasks that do not involve much I/O. For such CPU-bounded tasks, processing
power represents the problem. Also, I/O bounded tasks will be poorly served if only CPU runtime is
taken into account (paragraph 4.2). For a smoothly balanced system, you wouldn't miss too much if
you consider only CPU runtime in the performance evaluation. Note that system balance is essential.
If you replace a machine's processor with a model that delivers twice the performance, it will not
double the overall performance of the system unless corresponding improvements are made to other
parts of the system (memory, system bus, I/O, etc.). Incidentally, it is mentioned that doubling the
performance of the processor does not mean replacing an x GHz processor with a 2x GHz one; Soon
you will see that clock frequency is just one of many factors that affect performance.

When one compares two M1 and M2 machines, the notion of relative performance comes into play.

(M1 yield)/(M2 performance)

= acceleration of M1 over M2

= (M2 execution time)/(M1 execution time)

34
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Note that performance and runtime vary in opposite directions. To improve the execution time, it
must be reduced; Improving performance means raising it. It tends to favor the use of "improve" over
other terms because it allows a common term to be applied to many different strategies that impact
performance indicators, no matter which indicator should go up or down for better performance.

The barely defined comparative performance measurement is a dimensionless ratio like 1.5 or 0.8.
Indicates that the M1 machine delivers x times the performance, or is x times faster than the M2
machine. When x > 1, the relative performance can be expressed in one of two equivalent ways:

M1 is x times faster than M2 (e.g., 1.5 times faster)

M1 is 100(x - 1) % faster than M2 (e.g., 50% faster).

Failing to differentiate these two presentation methods is a fairly common mistake. So, remember, a
machine that is 200% faster is not twice as fast but three times faster. More generally: y% faster means
1 and/100 times faster.

Each time a specific program is run, a number of machine instructions are executed. Often, this
number is different from run to run, but assume that you know the average number of instructions
that are executed during many runs of the program. Note that this number may have little relation to
the number of instructions in the program code. The last is the static instruction counter, while here
you are interested in the dynamic instruction counter, which is usually much larger than the static
counter due to repeated cycles and calls to certain procedures. The execution of each instruction takes
a certain number of clock cycles. Again, the number is different for several instructions and can in fact
depend not only on the instruction but also on the context (the instructions that are executed before
and after a specific instruction). Assume that you also have an average value for this parameter.
Finally, each clock cycle represents a fixed duration of time. For example, the cycle time of a 2 GHz
clock is 0.5 ns. The product of these three factors produces an estimate of the CPU runtime for the
program:

CPU execution time = instructions × (cycles per instruction) × (seconds per cycle)

= instructions × CPI/(clock rate)

where CPI means "cycles per instruction" and the clock rate, expressed in cycles per second, is the
inverse of "seconds per cycle".

The instruction counter of the three parameters, CPI and clock rate are not completely independent,
so improving one by a specific factor may not lead to an overall improvement in execution time by the
same factor.
The instruction account depends on the architecture of the instruction set (which instructions are available) and
how effectively the programmer or compiler uses them. The conflicts of the instruction set are discussed in Part
2 of the lib

CPI depends on the instruction set architecture and hardware organization. Most of the organizational conflicts
that directly influence CPIs are introduced in Part 4 of the book, although Part 3 covers concepts that are also
relevant.

The clock rate depends on the hardware organization and deployment technology. Some aspects of technology
affecting clock rate are covered in Chapters 1 to 3. Other conflicts will come in Parts 3 and 4.

35
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

To give just one example of these interdependencies, consider the effect on performance of raising
clock rates. If the clock rate of an Intel Pentium processor is improved by a factor of 2, performance
may improve but not necessarily by the same factor. As a result of different Pentium processor models
using substantially the same instruction set (without occasional extensions introducing a factor if pre-
existing programs continue to run), performance would double only if CPIs remain the same.
Unfortunately, however, raising clock rates is frequently accompanied by an increase in CPIs. The
reasons for the latter will be learned in part 4. Here it is only pointed out that a technique to
accommodate higher clock rates is to divide the process of executing instructions into a large number
of steps within a deeper channeling. The punishment in clock cycles for loss or cleaning of such
channeling, which would be necessary in cases of data and control dependencies or cache failures, is
proportional to its depth. Therefore, CPI is an increasing function of the depth of channeling.

The effect of clock rate on performance is even harder to judge when one compares machines with
different instruction set architectures. Paragraph 4.3 uses an analogy for this difficulty after passing.
Doubling the clock rate is the analogue of one person taking steps twice as fast as another. However,
if the person taking faster steps needs five times more of these to get from point A to point B, their
travel time will be 2.5 times longer. Based on this analogy, it's no surprise that the vendor of an x GHz
processor claims the performance advantage over another company's 2x GHz processor. Whether the
statement is true is another story; Examples of how performance claims can be misleading or totally
false will be seen later.

36
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

3. Performance Enhancement and Amdahl's law


Gene Amdahl, an architect of the first IBM computers who later founded the company that bears his
name, formulated his famous law (paragraph 4.4) to point out some limitations of parallel processing.
He claimed that the programs contained certain calculations that were inherently sequential and
therefore could not be accelerated by parallel processing. If f represents the fraction of a program's
run time due to such non-parallelizable calculations, even assuming that the rest of the program
enjoys the perfect acceleration of p when running on p processors, the overall acceleration would be:

[Amdahl acceleration formula]

Here, the number 1 in the numerator represents the original run time of the program and f +(1 − f)/p
constitutes the enhanced execution time of the program with p processors. The last is the sum of the
execution time for the non-parallelizable fraction f and the remaining fraction 1 - f, which now runs p
times faster. Note that the acceleration s cannot exceed p (linear acceleration achieved by f = 0) or 1/f
(maximum acceleration for p = ∞). Consequently, for f = 0.05, one can never expect to achieve
acceleration greater than 20, no matter how many processors are used.

37
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Despite its original formulation in terms of the acceleration that is possible with p processors,
Amdahl's law is much more general and can be applied to any situation that does not involve changes
in execution time for a fraction f of a program and improvement by a factor p (not necessarily an
integer) for the remaining part. This general interpretation suggests that if a part of a program is left
that represents a fraction f of its execution time unchanged, no amount of improvement for the
remaining fraction 1 - f will produce an acceleration greater than 1/f. For example, if the floating-point
arithmetic represents 1/3 of a program's execution time and only the floating-point unit is improved
(i.e., f = 2/3), the overall acceleration cannot exceed 1.5, no matter how much faster the floating-point
arithmetic becomes.

Example 1: Using Amdahl's Law in Design

A processor chip is used for applications where 30% of the runtime is spent on floating-point sum, 25%
on floating-point multiplication and 10% on floating-point division. For the new processor model, the
design team has stumbled upon three possible improvements, and each costs about the same in
design and manufacturing effort. Which of these improvements should be chosen?

a) Redesign the floating-point adder to make it twice as fast.

b) Redesign the floating-point multiplier to make it three times faster.

c) Redesign the floating-point divider to make it ten times faster.

Solution: Amdahl's law can be applied to all three options with f = 0.7, f = 0.75 and f = 0.9, respectively,
for the unmodified fraction in all three cases.

a) Acceleration for adder redesign = 1/[0.7 + 0.3/2] = 1.18

b) Acceleration for multiplier redesign = 1/[0.75 + 0.25/3] = 1.20

c) Acceleration for divider redesign = 1/[0.9 + 0.1/10] = 1.10

Therefore, redesigning the floating-point multiplier offers the greatest performance advantage,
although the difference with the floating-point adder redesign is not great. A lesson is learned from
this example: the significant acceleration of the divisor is not worth the effort because of the relative
rarity of the floating-point divisions. In fact, even if infinitely fast divisions could be made, the
acceleration achieved would still be only 1.11.

Example 2: Using Amdahl's Law in Management

Members of a university research group frequently go to the campus library to read or copy articles
published in technical journals. Each trip to the library takes 20 minutes. In order to reduce that time,
an administrator orders subscriptions for some journals that account for 90% of trips to the library.
For these journals, which are now kept in the group's private library, access time was reduced to two
minutes on average.

a) What is the average acceleration to access technical articles due to subscriptions?

b) If the group has 20 members and each makes an average of two trips per week to the campus
library, determine the annual expenditure that is financially justified for taking the subscriptions.
Assume 50 workweeks a year and an average cost of $25/h for a research time.

38
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Solution: The Amdahl law can be applied to this situation, where 10% of accesses remain unchanged
(f = 0.1) and the remaining 90% is accelerated by a factor of p = 20/2 = 10.

a) Acceleration in the time of access to articles = 1/[0.1 + 0.9/10] = 5.26

b) The time saved by subscriptions is 20 × 2 × 50 × 0.9(20 − 2) = 32 400 min = 540 h representing a cost
recovery of 540 × 25 = 13 500 dollars; This is the amount that can be financially justified by the cost of
subscriptions.
Note: This example is analogous to using a fast cache near the processor with the goal of faster access to the
most frequently used data. Details will be covered in Chapter 18

4. Performance Measurement vs. Modeling


The safest and most reliable method of evaluating performance is to run real programs of interest on
candidate machines and measure execution or CPU times. Figure 4.5 shows an example with three
different machines being evaluated in six programs. Based on the evaluation data shown, Machine 3
clearly comes out on top because it has the shortest runtime for all six programs. However, the result
is not as clear, as is the case between machines 1 and 2 of Figure 4.5. Machine 1 is faster than machine
2 for two of the programs and slower for the other four. If you had to choose between machines 1
and 2 (for example, because machine 3 is much more expensive or does not meet some other
important requirement), you could add weights to the programs and choose the machine so that the
weighted sum of execution times is smaller. The weight for a program could be the number of times
it runs per month (based on collected data or a prediction). If, for example, all six programs run the
same number of times, and therefore have equal weights, machine 2 would have a slight advantage
over machine 1. If, on the other hand, programs B or E make up the bulk of the workload, machine 1
will probably prevail. Section 4.5 shall develop methods for summarizing or reporting performance,
and associated pitfalls.

Experimentation with real programs on real machines is not always feasible or economically viable. If
you plan to buy hardware, the intended machines may not be within reach for intensive
experimentation. Similarly, the programs you will run may not be available or even known. You may
have an idea that you will run programs of a certain type (company payroll, linear equation solver,
graphic design, etc.); Some of the required programs will have to be designed in-house and others will
be outsourced. Remember to consider not only current but also future needs. In such cases, the
evaluation may be based on benchmarking programs or analytical modeling.

39
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Benchmarking
Benchmarks are real or synthetic programs that are selected or designed for comparative evaluation
of a machine's performance. A benchmark suite represents a collection of such programs that it is
intended to represent all kinds of applications and derail any attempt to design hardware that would
perform well in a more limited specific benchmark program (the latter is known as benchmark design).
Of course, benchmarking results are only relevant to the user if the programs in the suite remember
the programs the user will run. Benchmarks facilitate comparison across different platforms and
classes of computers. They also make it possible for computer vendors and independent companies
to evaluate many machines before they enter the market and publish benchmarking results for the
benefit of users. In this way, the user will not need to benchmark.

Benchmarks are primarily intended to be used when the hardware to be evaluated and the relevant
competitors needed to run the programs in the suite are already available. Most compilers have
optimization capabilities that can be turned on or off. Usually, to avoid tuning the compiler differently
for each program in the suite, the entire benchmark suite is required to run with a single set of
optimization flags. It is also possible to use a benchmark suite, especially one with shorter programs,
for the evaluation of machines or compilers not yet available. For example, the programs in the suite
can be compiled by hand and the results presented to a software simulator of the hardware to be
developed. In this context, instruction accounts can be extracted from hand-compiled code and used
to evaluate performance in the manner indicated under the analytical modeling that follows.

A popular benchmark suite for evaluating workstations and servers includes integer and dot programs
and is developed by the Standard Performance Evaluation Corporation (SPEC). Version 2000 of the
SPEC CPU benchmark suite, known as SPECint2000 for the collection of 12 integer programs and
SPECfp2000 for another set of 12 floating point programs, is characterized in Table 4.2. Instead of
providing absolute runtime data, it is common to report how much faster a program ran a machine

40
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

compared to some base machine; the larger this SPEC ratio, the higher the performance of the
machine. The ratios can then be plotted separately for SPEC int and SPEC fp to visualize the differences
between many machines, or for the same machine with different compilers or clock rates. Heading
4.6 shows an example of results for six different systems, where one of these may be the combination
of a processor with a specific clock rate and cache/memory size, particular C and Fortran compilers
with specific optimization setting, and known I/O capabilities.

TABLE 4.2 Summary of features of the SPEC CPU2000 benchmark suite.

Category Program Types Program Lines of Code


Examples

SPECint2000 Programs C (11) Data compression, 0.7k to 193k


C language
compiler
Program C++ (1) Computer display 34.2k
(ray tracing)

SPECfp2000 Programs C (4) 3D graphics, 1.2k to 81.8k


computational
chemistry
Fortran Shallow water 0.4k to 47.1k
Programs 77 (6) modeling,
multigrid solver
Programs Image processing, 2.4k to 59.8k
Fortran90 (4) fine element
method

Example 3: Performance benchmarks

You're an engineer at Outtell, a new company that aspires to compete with Intel through its new
processor technology that performs better than the latest Intel processor by a factor of 2.5 in floating-
point instructions. To achieve this level of weak point performance, the design team made some
negotiation that led to an average 20% increase in execution times for all other instructions. You are
in charge of choosing the benchmarks that would show the performance limit of Outtell.

a) What is the minimum fraction f required of time spent on floating-point instructions in a program
on the Intel processor to show acceleration of 2 or better for Outtell?

b) If on the Intel processor the execution time of a floating-point instruction is on average three times
as long as the other instructions, what does the fraction in its response to part a mean in terms of the
mixture of instructions in the benchmark?

(c) What type of benchmark would Intel choose to counter your company's claims?

41
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Solution: Use a generalized form of Amdahl's formula in which one fraction f is accelerated by one
specific factor and the rest is stopped by another factor.

a) Acceleration factor = 2.5, braking factor = 1.2 ⇒ 1/[1.2(1 − f ) + f/2.5] ≥ 2 ⇒ f ≥ 0.875

b) Let be the mixture of instructions x floating point and 1 – x other. Then, the total execution time is
proportional to 3 x + (1 – x) = 2x + 1. So the fraction of execution time due to floating point operations
is 3 x/(2x + 1). Requiring this fraction to be greater than or equal to 0.875 leads to x ≥ 70% (floating
point fraction in instruction mixture).

c) Intel would attempt to display a delay for Outtell: 1/[1.2(1 − f ) + f/2.5] < 1 ⇒ f < 0.125. In terms of
the instruction mixture, this means 3 x/(2 x + 1) < 0.125 or x < 4.5%.

Performance Estimation
It is possible to estimate machine performance without using direct observation of the behavior of
real programs or benchmarks run on real hardware. Methods range from simplistic estimates to the
use of highly detailed models that capture the effects of design features on hardware and software.
Performance models are of two types: analytical models and simulation models. The former use
mathematical formulations to relate performance to some key, observable, quantifiable parameters
of the system or application. The latter basically mimic the behavior of the system, often at a higher
level of abstraction to keep the size of the model and its processing time under observation. The
results obtained by any model are only as good as the reliability of the model to represent real-world
capabilities, limitations, and interactions. It is a mistake to think that a more detailed model
necessarily offers a more accurate performance estimate. In fact, the complexity of the model
sometimes obscures understanding and thus leads to an inability to see how the effect of imprecision
in estimating model parameters can affect final results.

The simplest performance estimation model is one that produces peak system performance, so named
because it represents the highest absolute performance level one can expect to extract from the
system. The peak performance of a computer is like the top speed of a car and can be as insignificant
as a quotient of merit for comparison purposes. Peak performance is often expressed in units of
instructions per second or IPS, and MIPS and GIPS are preferred to keep numbers small. The advantage
of peak performance is that it is easy to determine and report. For scientific and engineering
applications that primarily involve floating-point calculations, floating-point operations per second
(FLOPS) are used as a unit, again with megaflops (MFLOPS) and gigaflops (GFLOPS) as favorites. A
machine achieves its peak performance for an artificially constructed program that includes only
instructions of the fastest type. For example, in a machine that has instructions that take one and two
clock cycles for execution, peak performance is achieved if the program uses only one-cycle
instructions, perhaps with a few two-cycle instructions thrown around, if necessary, to form cycles
and other constructor programs.

A little more detailed, and also more realistic, is an analysis based on average CPI (CPI was defined in
section 4.2). Average CPIs can be calculated according to a mixture of instructions obtained from
experimental studies. Such studies may examine a large number of common programs to determine
the proportion of various classes of instructions, expressed as fractions totaling 1. For example, Table
4.3 provides mixtures of typical instructions. If instruction classes are chosen such that all instructions
in the same class have the same CPI, the average CPIs of the corresponding fractions can be calculated:

42
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Average CPI = Ʃ( All kinds of instructions) (Class I fraction) (CPI class × i)

Example 4: CPI and IPS calculations

Consider two different M1 and M2 hardware implementations of the same instruction set. There are
three classes of instructions in the instruction set: F, I and N. The clock rate of M1 is 600 MHz. The
clock cycle of M2 is 2 ns. The average CPI for the three instruction classes in M1 and M2 are as follows:

Classes CPI for M1 CPI for M2 Feedback

F 5.0 4.0 Floating-point


I 2.0 3.0 Integer arithmetic
N 2.4 2.0 Non-arithmetic

a) What are the peak yields of M1 and M2 in MIPS?

b) If 50% of all instructions executed in a given program are of class N and the rest are divided equally
between F and I, which machine is faster and by what factor?

43
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

c) M1 designers plan to redesign the machine for better performance. With the assumptions in part
(b), which of the following redesign options has the greatest performance impact and why?

1. Use a faster floating-point unit twice as fast (CPI class F = 2.5).

2. Add a second integer ALU to reduce the entire CPIs to 1.20.

3. Use faster logic that allows a clock rate of 750 MHz with the same CPIs.

d) The CPIs given include the effect of instruction cache failures at an average rate of 5%. Each of them
imposes a punishment of ten cycles (that is: adds 10 to the effective CPI of the instruction, which
causes the failure, or 0.5 cycles per instruction on average). A fourth redesign option is to use a larger
instruction cache that would reduce the failure rate from 5% to 3%. How does this compare to the
three options in part c?

(e) Characterize application programs that would run faster in M1 than in M2; That is: say as much as
you can about mixing instructions in such applications. Hint: Let x, y and be 1 - x - and the fraction of
instructions belonging to classes F, I and N, respectively.

Solution:

a) Peak MIPS for M1 = 600/2.0 = 300 (assume all class I)


Peak MIPS for M 2 = 500/2.0 = 250 (assume all class N)

b) Average CPI for M1 = 5.0/4 + 2.0/4 + 2.4/2 = 2.95 average CPI for M2 = 4.0/4 + 3.8/4 + 2.0/2 =
2.95 The average CPIs are equal, so M1 is 1.2 times faster than M2 (clock rate ratio).

c) 1. average CPI = 2.5/4 + 2.0/4 + 2.4/2 = 2.325; MIPS for option 1 = 600/2,325 = 258
2. average CPI = 5.0/4 + 1.2/4 + 2.4/2 = 2.75; MIPS for option 2 = 600/2.75 = 218
3. MIPS for option 3 = 750/2.95 = 254 ⇒ option 1 has the greatest impact.

d) With the largest cache, all CPIs are reduced by 0.2 due to the lower cache failure rate.
Average CPI = 4.8/4 + 1.8/4 + 2.2/2 = 2.75 ⇒ option or n 4 is comparable to optionor n 2.

e) Average CPI for M1 = 5.0x + 2.0y + 2.4(1 − x − y) = 2.6x − 0.4y + 2.4 average CPI for M2 = 4.0x +
3.8y + 2.0(1 − x − y) = 2x + 1.8y + 2 Conditions are sought under which 600/(2.6x − 0.4y + 2.4)
> 500/(2x + 1.8y + 2). Therefore, M1 performs better than M2 for x/y < 12.8. Roughly speaking,
M1 does better unless there is excessive floating-point arithmetic, so M1 is slightly slower, or
very little entire arithmetic, so M2 is slower (N-class instructions are immaterial because they
run at the same speed on both machines).

44
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 5: MIPS ratings can be misleading

5. Computer Performance Report


Even with the best method chosen for measuring or modeling performance, care must be
taken in interpreting and reporting results. This section reviews some of the difficulties to
separate performance data into a single numerical indicator.
Consider the runtimes for three programs A, B, and C on two different machines
X and Y, shown in Table 4.4. The data indicate that, for program A, machine X is ten times
faster than machine B, while for both B and C, the opposite is true. The first attempt to
summarize this performance data is to find the average of the three accelerations and assess
that machine Y is on average (0.1 + 10 + 10)/3 = 6.7 times faster than machine X. However,
the latter is incorrect. The last row of Table 4.4 shows the total execution times for the three
programs and an overall acceleration of 5.6, which would be the correct acceleration to report
whether these programs run the same number of times within the normal workload. If the
latter condition does not hold, then the overall acceleration must be calculated using
weighted sums, rather than simple sums. This example is similar to finding the average speed
of a car driving into a city at 100 km away at 100 km/h and on the return trip at 50 km/h; The
average speed is not (100 + 50)/2 = 75 km/h, but should be obtained from the fact that the
car travels 200 km in three hours.
Since SPEC benchmark execution times are normalized to a reference machine rather than
expressed in absolute terms, one should also seek to summarize performance data in this
case. For example, if in Table 4.4 X is taken as the reference machine, then the normalized

45
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

performance of Y is given by the acceleration values in the right column. It is known from the
preceding discussion that the average (arithmetic mean) of accelerations (normalized
execution times) is not the correct measure to use. A way to summarize normalized runtimes
is needed, which is consistent and its result does not depend on the chosen reference
machine, X would exhibit accelerations of 10, 0.1 and 0.1 for programs A, B and C, respectively.
The average of these values is 3.4; not only was the inverse of 6.7 (intended acceleration of Y
over X) not obtained, but the contradictory conclusion was reached that each machine is
faster than the other!

A solution to the above problem is to use the geometric mean instead of the arithmetic mean.
The geometric mean of n values is the nth root of its product. Applying this method to the
acceleration values in Table 4.4 yields the single relative performance indicator of (0.1 × 10 ×
10)1/3 = 2.15 for Y relative to X. Note that this global acceleration of Y over X is not called
because it is not; It is just an indicator that moves in the right direction in the sense that larger
values correspond to higher performance. Had Y been used as the reference machine, the
relative performance indicator for X would have been (10 × 0.1 × 0.1)1/3 = 0.46. Now this is
consistent, because 0.46 is the approximate inverse of 2.15. This consistency arises from the
fact that the ratio of geometric means is the same as the geometric mean of ratios.
Using the geometric mean solves the consistency problem but creates another problem:
derived numbers have no direct relation to execution times and can in fact be quite
misleading. Consider, for example, only programs A and B in Table 4.4. Based on these two
programs, X and Y machines have the same performance, since (0.1 × 10)1/2 = 1. Although this
is clearly not the case if programs A and B are run the same number of times in the workload.
The execution times on the two machines would be the same only if the fraction a of
executions corresponding to program A (therefore, 1 – a for program B) satisfies the following
equality:

a × 20 + (1 − a) × 1000 = a × 200 + (1 − a) × 100


This requires a = 5/6 and 1 – a = 1/6, implying that program A must run five times what
program B; This may or may not be the case in the workload.

46
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 6: Effect of Instruction Mixing on Performance

6. The Quest for Higher Performance

The state of computational power available at the turn of the twenty-first century can be
summarized as follows:
Gigaflops on desktop computers.
Teraflops in the supercomputer center.
Petaflops on the drawing board.
Given the exponential growth in computer performance, 10-15 years will see a shift from G,
T, and P to T, P, and E (Table 3.1). Over the years, achieving performance milestones, such as
teraflops and petaflops, has been one of the major driving forces in computer architecture
and technology.
Certain U.S. government agencies and other advanced users support research and
development projects on supercomputers to help solve larger, or hitherto very difficult,
problems within their domains of interest. Over time, new performance-enhancing methods
that were introduced in high-profile supercomputers find their way into smaller systems,
eventually showing up in personal computers. Thus, major computer companies are also
active in the design and construction of high-performance computer systems, even though
the market for such expensive high-profile machines is very limited (Figure 3.4). As seen in
Figure 4.7, the performance of supercomputers continues to grow exponentially. This trend
applies to both vector supercomputers and massively parallel processors (MPPs).
To ensure that progress in supercomputer performance is not slowed by the high
cost of research and development of such machines, the U.S. Department of Energy sponsors
the Accelerated Strategic Computing Initiative (ASCI) program, described as the Advanced
Simulation and Computing Initiative. which aims to develop supercomputers that set
performance limits: from 1 TFLOPS in 1997 to 100 TFLOPS in 2003 (figure 4.8). Even though
these numbers correspond to peak performance, there is hope that increases in peak

47
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

computing power will mean impressive advances in sustained performance for real
applications.

48
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In the future, the performance of both microprocessors and supercomputers is expected to


grow at the current rate. Thus, the interpolation of trends in Paragraphs 4.7 and 4.8 leads to
accurate predictions of the level of performance to be achieved in the coming decade. Beyond
that, however, the picture is much less clear. The problem is that certain fundamental physical
limits are being reached that can be difficult, or even impossible, to overcome. One concern
is that the reduction in the characteristic size of integrated circuits, which is a major
contributor to improvements in speed, is getting closer and closer to atomic dimensions.
Another conflict is that the speed of signal propagation at connectors between chip elements
is inherently limited; Today it is a fraction of the speed of light and can never exceed the last
(approximately 30 cm/ns). For example, if a memory chip is 3 cm from the processor chip, you
can never expect to send data from one to the other in a shorter time than 0.1 ns.
Consequently, more work is needed on architectural techniques that obviate the need for
frequent long-distance communication between circuit elements. Note that parallel
processing by itself does not solve the "speed of light" dilemma; Multiple processors still need
to communicate with each other.

PROBLEM

1. Amdahl's Law
A program runs a thousand seconds on a particular machine, with 75% of the time
spent performing multiplied/divide operations. We want to redesign the machine to
equip it with hardware multiply / divide faster.
a) How much faster would the multiplier/divisor become so that the program
runs three times faster?
b) What if you want the program to run four times faster?

2. Performance Benchmarks
A B1 benchmark suite consists of an equal proportion of Class X and Class Y instructions. The
M1 and M2 machines, with identical 1 GHz clocks, have equal performance of 500 MIPS in B1.
If half of the Class X instructions are replaced in B1 with Class Y instructions to derive another
benchmark suite B2, the running time of M1 becomes 70% that of M2. If we substitute half of
the class Y instructions with class X instructions to transform B1 into B3, M2 becomes 1.5 times
faster than M1.
a. What can you say about the average CPIs of M1 and M2 for the two classes of
instruction?
b. What is the IPS performance of M1 in benchmark suites B 2 and B3
c. What is the maximum possible acceleration of M1 over M2 and for what instruction
mixture is it achieved?
d. Repeat part (c) for the performance of M2 relative to M1.

3. Cycles per instruction


In the examples in this chapter, the CPIs provided and calculated are greater than or
equal to 1. Can CPI be less than 1? Explain

49
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Clock frequency and acceleration


A company sells two versions of its processors. The "pro" version runs at 1.5 times the
clock frequency but has an average CPI of 1.50 versus 1.25 for the "deluxe" version.
What acceleration, relative to the deluxe version, would you expect when running a
program in the pro version?

5. Performance comparison and acceleration


Consider two different M1 and M2 implementations of the same instruction set. M1
has a clock frequency of 500 MHz. The clock cycle of M2 is 1.5 ns. There are three
types of instructions with the following CPIs:

CLASS CPI FOR M1 CPI FOR M2


A 2 2
B 1 2
C 3 4

a) What are the peak yields of M1 and M2, expressed as MIPS?


b) If the number of instructions executed in a given program is divided equally
between the three classes, which machine is faster and in what factor?
c) M2 can be redesigned such that, with negligible cost increase, the CPIs for class B
instructions improve from 2 to 1 (the CPIs of classes A and C remain unchanged).
However, this change would increase the clock cycle from 1.5 ns to 2 ns. What should
be the minimum percentage of Class B instructions in a instruction mix for this
redesign to result in improved performance?
7. Amdahl's Law
A program spends 60% of its execution time performing floating-point arithmetic. Of the
floating-point operations in this program, 90% are executed in parallelizable cycles.
a) Find runtime improvement if floating-point hardware gets twice as fast.
b) Find runtime improvement if two processors are used to run the program's parallelizable
cycles twice as fast.
c) Find the improvement in execution time resulting from modifications in both a) and b).

8. Mixing instructions and performance


Redo example 4.6, but instead of considering only data compression and nuclear reactor
simulation, consider the four applications mentioned in Table 4.3.
a) Tabulate the results of the individual applications on each machine.
b) Find composite results for both whole applications (data compression and C language
compiler) and two floating points (nuclear reactor simulation and atomic motion modeling)
by averaging instruction mixtures.
(c) By using geometric mean and the results in part (b), quantify the overall performance
advantage of M2 over M1.

9. Performance gain with vector processing


A vector supercomputer has special instructions that perform arithmetic operations on
vectors. For example, vector multiplication on vectors A and B of length 64 is equivalent to 64
independent multiplications A[i] × B[i]. Suppose the machine has a CPI of 2 on all scalar
arithmetic instructions. Vector arithmetic over vectors of length m takes 8 + m cycles, where

50
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

8 is the star-up/wind-down header for channeling that allows an arithmetic operation to start
in each clock cycle; Thus, vector multiplication takes 72 clock cycles for vectors of length 64.
Consider a program with only arithmetic instructions (i.e., ignore everything else), with half of
these instructions involving scalars and half involving vector operands.
a) What acceleration is achieved if the average vector length is 16?
b) What is the equilibrium vector length for this machine (average vector length to result in
equal or higher performance due to vector processing)?
c) What is the average vector length required to achieve acceleration?

10. Amdahl's Law


A new version of an M machine, called Mfp++, executes all floating-point instructions four
times faster than M.
a) Graph that the acceleration achieved by the Mfp++ relative to M as a function of the fraction
x of time that M spends on floating-point arithmetic.
b) Find the acceleration of Mfp++ over M for each of the applications shown in Table 4.3, if
you assume that the average CPI for floating-point instructions in M is five times that of all
other instructions.

11. MIPS qualification


A machine your company uses has an average CPI of four for liquid point arithmetic
instructions and one for all other instructions. The applications you run spend half of their
time on liquid point arithmetic.
a) What is the mixture of instructions in your applications? That is, find the x-fraction of
executed instructions that perform floating-point arithmetic.
b) How much higher would the MIPS rating of the machine be if it used a compiler that
simulated floating-point arithmetic using sequences of whole instructions instead of using
floating-point instructions from the machine?

12. Performance comparison and acceleration


Consider two different implementations M1 (1 GHz) and M2 (1.5 GHz) of the same instruction
set. There are three types of instructions with the following CPIs:

Class CPI for M1 CPI for M2


A 2 2
B 1 2
C 3 5

a) What are the peak yields of M1 and M2 expressed as MIPS?


b) Show that if half of the instructions executed in a given program are of class A and the rest
are divided equally between classes B and C, then M2 is faster than M1.
(c) Show that the second assumption in part (b) is redundant. In other words, if half of the
instructions executed in a certain program are of class A, then M2 will always be faster than
M1, regardless of the distribution of the remaining instructions between classes B and C.

13. Supercomputer Performance Trends

51
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Data about today's most powerful supercomputers is published regularly [Top500]. Draw the
following scattering graphs based on the data for the top five supercomputers in each of the
last ten years and discuss the observed trends:
a) Performance vs. number of processors.
b) Performance vs. total amount of memory and memory per processor (use two types of
marker on the same dispersion graph)
c) Number of processors vs. year of introduction

14. Relative machine performance


When referring to Figure 4.5, it is noted that the relative performance of machines 1 and 2
fluctuates widely for the six programs shown. For example, machine 1 is twice as fast for
program B, while the opposite is true for program A. Speculate about what differences in the
two machines and the six programs may have contributed to the observed execution times.
So, compare machine 3 with each of the other two machines.

15. Amdahl's Law


You live in an apartment from which you have to drive seven minutes for your twice-weekly
shopping trip to a nearby supermarket, and a 20-minute drive to a department store where
you shop once every four weeks. You plan to move into a new apartment. Compare the
following possible locations with respect to the acceleration they offer you for your driving
time during shopping trips:
a) An apartment that is ten minutes away from the supermarket and department store.
b) An apartment that is five minutes away from the supermarket and 30 minutes from the
department store.

16. Amdahl's Law


Suppose that, based on trading accounts (there is no time spent on them), a numerical
application uses 20% floating-point operations and 80% integer/control (integer/control)
operations. The execution time of a floating-point operation is, on average, three times longer
than other operations. A redesign of the floating-point unit is considered to make it faster.
a) What acceleration factor for the floating-point unit would lead to 25% overall improvement
in speed?
(b) What is the maximum possible acceleration that can be achieved by modifying only the
floating-point unit?

17. Performance benchmark and MIPS


A particular benchmark is intended to run repeatedly, with computational throughput
specified as the number of times the benchmark can be run per second. Thus, a repeat rate
of 100/s represents a higher throughput than 80/s. The M1 and M2 machines exhibit a
performance of R1 and R2 repetitions per second. During these runs, M1 has a MIPS
performance that is rated P1. Would it be correct to conclude that the MIPS rating of M2 in
this benchmark is P1 × R2/R1? justify your answer completely.

18. Analogy with aircraft performance


In section 4.1 and Table 4.1, several passenger aircraft were compared with respect to their
performance and it was observed that performance is judged differently depending on the
evaluator's point of view. Information about U.S. military aircraft is available on the websites

52
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

of Periscope (periscopeone.com), NASA, and the American Institute of Aeronautics and


Astronautics.
a) Construct a table similar to 4.1, mention relevant parameters related to the performance
of US fighter jets (F-14, F-15/15E, F-16 and F/A-18) and discuss how these compare from
different points of view.
b) Repeat part a) for US army bombers, such as B-52-H, B-1B, B-2 and F-117.
c) Repeat part (a) for U.S. Army deterrent aerial vehicles such as Black Widow, Hunter,
Predator and Global Hawk.

19. Supercomputer trends


The Science and Technology Review edited a chronology of supercomputers at Lawrence
Livermore National Laboratory, from UNIVAC 1, in 1953, to ASCII White of ten Tera
Operations, in 2000 [Park02].
a) From this data, derive a growth rate figure for supercomputer performance over the last
five decades of the twentieth century and compare it against the microprocessor performance
growth in Figure 3.10.
(b) Is the growth rate in part (a) consistent with Figure 4.7? Discuss.
c) Try to get data about the main memory size on these computers. Derive growth rates for
main memory size and compare it to the performance growth rate in part a).
(d) Repeat part (c) for bulk memory.

20. MIPS qualification


A computer has two kinds of instructions. Class S instructions have a CPI of 1.5 and Class C
instructions have a CPI of 5.0. The clock rate is 2 GHz. Let x be the fraction of instructions in a
set of programs of interest belonging to class S.
a) Graph that the variation in MIPS qualification for the machine according to x varies from 0
to 1.
(b) Specify that the range of x-values for which an acceleration of 1.2 for class S instructions
leads to better performance than an acceleration of 2.5 for class C instructions.
c) What would be a fair average MIPS rating for this machine if nothing is known about x?
(That is, the value of x is evenly distributed over [0, 1] for different applications of potential
interest.)

53
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

UNIT 2
Instructions and Addressing
CHAPTER TOPICS

1. Abstract view of hardware


2. Instruction formats
3. Simple arithmetic and logical instructions
4. Load and store instructions
5. Jump and branch instructions
6. Addressing modes

This chapter begins the study of a simple instruction set that will help to understand the elements of
a modern instruction set architecture, the options involved in its elaboration and aspects of its
execution in hardware. The chosen instruction set is called MiniMIPS; it defines a hypothetical
machine that is very close to real MIPS processors, a set of processors offered by a company of the
same name. This instruction set is very well documented, has been used in many other textbooks, and
has a free simulator that can be downloaded for practice and programming exercises. By the end of
this chapter you will be able to compose sequences of instructions to perform non-trivial
computational tasks.

1. Abstract view of hardware


To drive a car, you should familiarize yourself with some key elements such as the accelerator and
brake pedals, the steering wheel, and some dashboard instruments. Collectively, these devices allow
you to monitor the car and its various parts, as well as observe the status of certain crucial subsystems.
The corresponding interface for computer hardware is its instruction set architecture. It is necessary
to learn this interface to be able to instruct the computer to perform computational tasks of interest.
In the case of cars, the user interface is so standardized that one can easily operate a car rented from
a brand that has never been driven before. The same cannot be said about digital computers, although
many common instruction set features have been developed over time. When you become familiar
with a machine's instruction set, you will learn others with little effort; The process is more like
improving vocabulary or studying a new dialect of Spanish than learning a whole new language.

The MiniMIPS instruction set (minimum MIPS) is quite similar to what one finds in many modern
processors. MiniMIPS is a load/store instruction set, which means that data elements must be copied
or loaded into registers before processing; The operation results also go to registers and must be
explicitly copied back to memory via separate Store operations. Therefore, to understand and be able
to use MiniMIPS, it is necessary to know about in-memory data storage schemes, functions of load
and storage instructions, types of operations allowed on data elements that are retained in registers,
and some other loose aspects that enable efficient programming.

To drive a car, you do not need to know where the engine is located or how it drives the wheels;
However, most instructors begin to teach driving by showing their students a diagram of a car with its
various parts. At this juncture, Figure 5.1 is presented in the same spirit, although computer hardware

54
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

will be examined in much more detail in later parts of the book. The assembly language programmer
(compiler) relates to:

Records.

Memory locations where data can be stored. Machine instructions that operate on and store data in
registers or memory.

Figure 5.1 shows the MiniMIPS memory unit, with up to 230 words (232 bytes), an integer execution
unit (EIU), a floating-point unit (FPU), and a bypass and memory unit (TMU). FPU and TMU are fully
displayed; The instructions given to these units will not be examined in this part of the book.
Instructions related to FPU will be discussed in Chapter 12, while part six will reveal some of the uses
of TMU. The EIU interprets and executes the basic MiniMIPS instructions covered in this part of the
book, and has 32 general-purpose registers, each 32 bits wide; therefore, you can retain the
contents of a memory location. The arithmetic/logical unit (ALU) executes the addition, subtraction,
and logical statements. A separate arithmetic unit is devoted to multiplication and division
instructions, the results of which are placed in two special registers, called "Hi" and "Lo", from where
they can be moved to general-purpose registers.

A view of MiniMIPS records is shown in paragraph 5.2. All registers except 0 ($0), which permanently
retains the constant 0, are general-purpose and can be used to store arbitrary data words.

55
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

However, to facilitate efficient program development and compilation, certain restrictions on the use
of registers are common. A useful analogy for these restrictions is the way a person may decide to put
change, keys, wallet, etc., in specific pockets to make it easier to remember where items can be found.
The restrictions used in this book are also shown in paragraph 5.2. Most of these conventions don't
make sense to you right now. Therefore, the examples will only use records $8 through $25 (with
symbolic names $s 0–$s 7 and $t 0–$t 9) until the use of conventions for the other records is discussed.

A 32-bit data item stored in a register or memory locality (with a divisible address by 4) is known as a
word. For the time being, assume that a word retains an instruction or a signed integer, although it
will later be seen that it can also retain an unsigned integer, floating-point number, or ASCII character
string. Because MiniMIPS words are stored in byte-addressable memory, a convention is required to
establish which end of the word appears in the first byte (the lowest memory address). Of the two
possible conventions in this regard, MiniMIPS use the big-endian scheme, where the most significant
endpoint appears first.

For certain values that do not need the entire range of a 32-bit word, eight-bit bytes can be used.
When a byte-sized data item is put into a record, it appears at the far right of the record (lowest byte).
A doubleword occupies two consecutive memory registers or locations. Again, the convention about
the order of the two words is big-endian. When a pair of records retain a doubleword, the smaller of

56
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

the two records always has an even index, which is used to refer to the doubleword locality (for
example, it says "the doubleword is in the register $16" to imply that its high end is at $16 and the low
end at $17). So only even-numbered records can retain doublewords. A final point before engaging in
the discussion of specific instructions for MiniMIPS is that, even though a programmer (compiler) does
not need to know more about hardware than what is covered in this part of the book, the
implementation details presented in the rest of the book are still quite important. Relative to the car
analogy, you can drive legally with virtually no knowledge of how a car is built or how its engine
operates. However, to get the most out of the machine, or to be a very safe driver, you need to know
a lot more.

2. Instruction formats
A typical MiniMIPS machine instruction is ‘add $t8, $s2, $s1’, which causes the contents of registers
$s 2 and $s 1 to be added together, and the result to be stored in register $t8. This may correspond to
the compiled form of the high-level language statement a = b + c and in turn represents a machine
word that uses 0 and 1 to code the operation and register specifications involved (paragraph 5.3). As
with high-level language programs, machine sequences or assembly language instructions are
executed top-down unless a different execution order is explicitly specified using the jump or branch
statements. Some of these instructions will be covered in section 5.5. For now, focus on the meaning
of simple instructions by themselves or on short sequences corresponding to compound operation.
For example, the following sequence of assembled instructions performs the calculation g = (b + c) –
(e + f) as a sequence of single-operation instructions, written one per line. The portion of each line
that follows the "#" symbol is a comment that is inserted to assist the reader of the sequence of
instructions and ignored during machine execution.

The preceding sequence of instructions calculates g by performing the operations a = b + c, d = e + f,


and g = a – d, with time values a and d retained in registers $t8 and $t9, respectively. When writing
programs in machine language, you must provide integers that specify memory addresses or numeric
constants. Such numbers are specified in decimal (base 10) or hexadecimal (base 16, or hexa for short)
format and are automatically converted to binary format for machine processing. Here are some
examples:

Decimal 25, 123456, –2873

Hexadecimal 0x59, 0x12b4c6, 0xffff0000

The hexadecimal numbers shall be explained in section 9.2. For now, just visualize them as a shortened
notation for bit patterns. The four-bit patterns from 0000 to 1111 correspond to the 16 hexadecimal
digits 0–9, a, b, c, d, e, and f. So the hexa number 0x59 represents byte 0101 1001 and 0xffff0000 is
short for the word 1111 1111 1111 1111 0000 0000 0000 0000 0000. The reason for using the prefix
"0x" is to distinguish these numbers both from decimal numbers (e.g. 59 or 059) and for variable
names (e.g. x59).

57
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

A machine instruction for an arithmetic/logical operation specifies an opcode, one or more source
operands, and usually a target operand. Opcode is a binary code (bit pattern) that defines an
operation. The operands of an arithmetic or logical instruction can come from a variety of sources.
The method used to specify where the operands are located and where the result should go is referred
to as the addressing mode (or scheme). For the time being, assume that all operands are in registers
$s0 – $s7 and $t0 – $t9, and other addressing modes will be discussed in section 5.6.

The three MiniMIPS instruction formats are shown in paragraph 5.4. The simplicity and uniformity of
instruction formats are common in modern RISC (reduced instruction-set computers) whose goal is to
execute the most commonly used operations as quickly as possible, perhaps at the expense of less
common ones. Chapter 8 will discuss other approaches to instruction set design, as well as the pros
and cons of the RISC approach. The opcode field, which is common to all three formats, is used to
distinguish instructions in all three classes. Of the 64 possible opcodes, the non-overlapping subsets
are assigned to the instruction classes in such a way that the hardware can easily recognize the class
of a particular instruction and, consequently, the appropriate interpretation for the remaining fields.

The register or type R instructions operate on the two records identified in the rs and rt fields and
store the result in the rd record. For such instructions, the function field (fn) serves as an extension of
the opcodes, to allow more operations to be defined and the offset quantity (sh) field to be used in
instructions specifying a constant offset quantity. Simple (logical) shift instructions are covered in
Chapter 6; Arithmetic shifts will be discussed in Chapter 10.

58
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The immediate (immediate) or type I instructions are actually of two different varieties. In immediate
instructions, the 16-bit operand field in bits 0-15 retains an integer that plays the same role as rt in R-
type instructions; In other words, the specified operation is performed on the contents of the RS
register and the immediate operand, the result is stored in the RT register. In the load, store, or branch
statements, the 16-bit field is interpreted as an offset, or relative address, which must be added to
the base value in the rs register (program counter) to obtain a memory address for reading or writing
(control transfer). For data accesses, offset is interpreted as the number of bytes forward (positive) or
backward (negative) relative to the base address. For the branch statement, the offset is in words,
since the instructions always occupy 32-bit full memory words.

Jump or J instructions cause unconditional control transfer to the instruction in the specified direction.
Because MiniMIPS addresses are 32 bits wide, while only 26 bits are available in the address field of a
J-type instruction, two conventions are used. First, the 16-bit field is supposed to carry a word address
as opposed to a byte address; Therefore, the hardware binds two 0s to the far right of the 26-bit
address field to derive a 28-bit word address. The four bits still lost are linked to the far left of the 28-
bit address in a way that will be described in section 5.5.

3. Simple Arithmetic and Logical Instructions


In section 5.2, the add and subtract instructions were introduced. These instructions work in registers
that contain full words (32 bits). For example:

Add $t0, $s0, $s1 # fixed $t 0 in ($s0) + ($s1)

Sub $t0, $s0, $s1 # fixed $t 0 in ($s0) - ($s1)

59
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The figure 5.5 outlines the machine representations of the preceding instructions. Logical instructions
operate on a pair of operands on a bit-by-bit basis. Logical instructions in MiniMIPS include the
following

and $t0, $s0, $s1 # fixed $t 0 in ($s 0) ∧ ($s 1)

or $t0, $s0, $s 1 # fixed $t 0 in ($s 0) ∨ ($s 1)

xor $t0, $s0, $s 1 # fixed $t 0 in ($s 0) ⊕ ($s 1)

nor $t0, $s0, $s1 # fixed $t 0 in (($s 0) ∨ ($s 1))'

The machine representations for these logical instructions are similar to those in 5.5, but the function
field retains 36 (100100) for and, 37 (100101) for or, 38 (100110) for xor, and 39 (100111) for nor.

Usually, an operand of an arithmetic or logical operation is constant. Although it is possible to place


this constant in one register and then perform the desired operation on two registers, it would be
more efficient to use instructions that directly specify the desired constant in a format I instruction.
Of course, the constant must be small enough to fit into the 16-bit field that is available for this
purpose. Therefore, for signed integer values, the valid range is -32,768 to 32,767, while for hexa
constants any four-digit number in the range [0x000, 0xffff] is acceptable.

addi $t0, $s0, 61 # fixed $t0 in ($s0) + 61 (decimal)

The machine representation for addi is shown in paragraph 5.6. There is no corresponding immediate
subtract statement, since its effect can be achieved by adding a negative value. Because MiniMIPS
have a 32-bit adder, the immediate 16-bit operand must be converted to an equivalent value of 32
bits before being applied as input to the adder. Later, in chapter 9, you will see that a signed number
that is represented in complement format to 2 must extend its sign if you want to result in the same
numeric value in a wider format. Sign extension simply means that the most significant bit of a 16-bit
signed value is repeated 16 times to fill the top half of the corresponding 32-bit version. Thus, a
positive number extends with several 0s while a negative number extends with several 1s.

Three of the newly introduced logic instructions also have immediate or format I versions:

andi $t0, $s0, 61 # fixed $t 0 in ($s 0) ∧ 61

ori $t0, $s0, 61 # fixed $t 0 in ($s 0) ∨61

xori $t0, $s0, 0x00ff # fixed $t 0 in ($s 0) ⊕ 0x00ff

The machine representations for these logical instructions are similar to those in 5.6, but the opcode
field retains 12 (001100) for andi, 13 (001101) for ori, and 14 (001110) for xori.

A key difference between the andi, ori and xori instructions and the addi instruction is that
the 16-bit operand of a logical instruction is extended 0 from the left to convert it into 32-bit format
for processing. In other words, the top half of the 32-bit version of the immediate operand for logical
instructions consists of 16 zeros.

60
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 1: Extracting fields from a word

A 32-bit word in $s0 retains one byte of data at bit positions 0-7 and a status flag at bit 10. Other bits
in the word have arbitrary (unpredictable) values. Unpack the information in this word, place the data
byte in the $t0 record, and the state flag in the $t1 record. In the end, register $t0 must retain an
integer in [0.255] corresponding to the byte of data, and register $t1 must retain a non-zero value if
the state flag is 1.

61
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Solution: The fields of a word can be extracted by operating AND the word with a predefined mask, a
word that has 1 in the bit positions of interest and 0 elsewhere. For example, when operating AND the
word with 0x000000ff (which in binary has 1 only in its eight least significant bit positions) has the
effect of extracting the byte from the extreme right. Any desired flag bit can be similarly extracted
through operating AND with a mask that has a single 1 in the desired position. Therefore, the following
two instructions accomplish what is required:

andi $t0, $s0, 0 00ff

andi $t1, $s0, 0 0040

Note that the non-zero value left in $t1 if the flag bit is 1 is equal to 210 = 1 024. Later it will be seen
that it is possible to use a shift instruction to leave 0 or 1 in $t1 according to the value of the flag bit

4. Load and store instructions


The load and store instructions transfer full words (32 bits) between memory and registers. Each of
these instructions specifies a register and a memory address. The record, which is the data destination
for loading and data source for storage, is specified in the rt field of a format I statement. The memory
address is specified in two parts that are added together to produce the address: the 16-bit signed
integer in the instruction is a constant offset that is added to the base value in the rs register. In
assembly language format, the source/destination record rt is specified first, followed by the constant
offset and the rs-base register in parentheses. By putting rs in parentheses, this is a reminder of A(i)
indexing notation in high-level languages. Here are two examples:

The machine instruction format for 1w and sw is shown in paragraph 5.7 along with the memory
addressing convention. Note that offset can be specified as an absolute integer or through a symbolic
name defined above. Chapter 7 will discuss the definition of symbolic names.

Section 6.4 covers load and store instructions that relate to data types other than words. Here only
another load statement is introduced that will allow the placement of an arbitrary constant in a
desired register. A small constant, which is representable in 16 bits or less, can be loaded into a register
by a single addi statement, the other operand of which is register $zero (which always retains 0). A
larger constant should be placed in a two-step register; The top 16 bits are loaded into the upper half
of the register via the LUI load upper immediate instruction and then the lower 16 bits are inserted
using an OR Immediate (ORI). This is possible because lui fills the lower half of the target register with
0, so that when these bits are operated OR with the immediate operand, it is copied to the lower half
of the register.

62
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The figure 5.8 shows the machine representation of the previous lui instruction and its effect on the
target register $s 0. Note that, although the lui and addi instruction pair can sometimes achieve the
same effects lui followed by ori, the latter may not be the case due to the sign extension before
adding the immediate 16-bit operand to the 32-bit register value.

63
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 2: Loading arbitrary bit patterns into registers

Consider the bit pattern shown in paragraph 5.6, corresponding to an addi statement. Show how this
particular bit pattern can be loaded into register $s0. And how can you put $s0 the bit pattern that
consists of all several 1s?

Solution: The upper half of the bit pattern in Figure 5.6 has the hexa representation 0x2110, while the
lower half is 0x003d. Therefore, the following two instructions accomplish what is required:

lui $s0, 0 2110 # puts upper half of pattern in $s0

ori $s0, 003d # puts lower half of pattern in $s0

These two instructions, with their immediate operands changed to 0xffff, could place the pattern of
all 1s in $s0. A simpler and faster way happens through the nor statement:

nor $s0, $zero, $zero # because (0˅0) ‘= 1

In representation of complement to 2, the pattern of bits all 1 corresponds to the integer -1.
Consequently, after learning, in Chapter 9, that MiniMIPS represent integers in complement to 2
format, alternative solutions will emerge for the latter part.

64
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5. Jump and branch instructions


The following two unconditional jump instructions are available in MiniMIPS:

The first instruction represents a simple jump, which causes the execution of the program to come
from the locality whose numerical or symbolic address is offered, instead of continuing with the next
instruction in sequence. Jump register, or jr, specifies a record that retains the white jump address.
This register is often $ra, and the jr $ra statement is used to return a procedure to the point from
which the procedure was called. The procedures and their use will be discussed in Chapter 6. Figure
5.9 shows the machine representations of the jump instructions of the MiniMIPS.

For the j statement, the 26-bit address field in the instruction is increased to 00 on the right and four
top-order bits of the program counter (PC) on the left to form a full 32-bit address. This is called
pseudo-direct addressing (see section 5.6). Since in practice, the PC is usually incremented very early
in the instruction execution cycle (before the decoding and execution of the instruction), it is assumed
that 4 bits of increased PC higher order are used to increase the 26-bit address field.

Conditional branch statements allow you to transfer control to a specific address when a condition of
interest is satisfied. In MiniMIPS there are three basic branch instructions based on comparison for
which the conditions constitute a negative record content and the equality or inequality of the
contents of two records. Instead of offering multiple instructions for other comparisons, MiniMIPS
offer a set-less than R-type comparison instruction that sets the contents of a specific target record to
1 if the "less than" relationship holds between the contents of two given registers, and is set to 0
mode. To compare with a constant, the immediate version of this statement, namely slti, is also
provided.

65
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 5.10 shows the machine representation of the three branch instructions and the two
comparison instructions barely discussed. For the three bltz, beq, and bne instructions, PC-relative
addressing is used whereby the 16-bit signed offset in the instruction is multiplied by 4 and the result
is added to the PC content to obtain a 32-bit branch destination address. If the specified tag is far from
being achieved by a 16-bit offset (very rare occurrence), the assembler replaces beq $s 0, $s 1, L1 with
a pair of MiniMIPS instructions:

66
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Here, the notation "L2:" defines the unspecific instruction that appears on that line, containing the
symbolic direction L2. Therefore, when writing branch statements, symbolic names can be used freely
without worrying if the corresponding address is reached from the current instruction using a 16-bit
offset.

Branch and jump statements can be used to form conditional and repetitive computational structures
comparable to if-then and while of high-level languages. The if-then statement

e.g., if (i == j) x = x + y

can be translated into the following MiniMIPS assembly language program fragment:

bne $s1 s2, endif # branch on I ≠ j


add $t1, $t1, $t2 # execute the “then” part
endif: ...

If the condition of the if statement were i < j, then the first instruction in the previous sequence would
be replaced by the following two statements (the rest would not change):

67
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

slt $t0, $s1, $s2 # set $t0 to 1 if I < j


beq $t0, $0, endif # branch if ($t0) = 0;
# i.e., i not< j or I ≥ j

Many other conditional forks can be synthesized in the same way.

Example 3: Compiling if-then-else statements

High-level languages contain an if-then-else construct that allows two calculations to be performed,
depending on whether a condition is satisfied. Display a sequence of MiniMIPS statements that
correspond to the following if-then-else statement:

if (i <= j) x = x + 1; z = 1; else y = y - 1; z = 2 * z;

Solution: This is very similar to the if-then statement, except that you need instructions corresponding
to the else part and a way to bypass the else part after the then part is executed.

slt $t0, $s2, $s1 # j < I? (Inverse condition)


bne $t0, $zero, else # if j < i go to else part
addi $t1, $t1, 1 # begin then part: x = x + 1
addi $t3, $zero, 1 #z=1
j endif # skip the else part
else: addi $t2, $t2, -1 # begin else part: y = y – 1
add $t3, $t3, $t3 #z=z+z
endif: ...
Note that each of the two instruction sequences corresponding to then and else in the conditional
statement can be arbitrarily long.

The simple loop while


while (A[i] == k) i = i + 1;
can be translated into the following MiniMIPS assembly language program fragment, assuming that
the index i, the start direction of array A, and the comparison constant k are in registers $s 1, $s 2, and
$s 3, respectively:

loop: add $t1, $s1, $s1 # calculate 2i in $t 1

add $t1, $t1, $t1 # calculate 4i on $t 1

add $t1, $t1, $s2 # sets the address of A[i] to $t 1

lw $t0, 0, ($t1) # loads the value of A[i] to $t 0

bne $t0, $s3, endwhl # leaves the cycle if A[i] ≠ k

68
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

addi $si, $si, 1 # i = i + 1

j loop # Stay in the cycle

endwhl: ...

Note that testing the while-loop condition requires five instructions: two to calculate offset 4i, one to
add offset to the base address in $s2, one to fetch the value of A[i] of memory, and one to verify
equality.

Example 4: Cycle with explicit array index and goto

In addition to while cycles, high-level language programs can contain cycles in which a goto (go to)
statement (conditional or unconditional) initiates the next iteration. Display a sequence of MiniMIPS
instructions that perform the function of the following cycle of this type:

loop: i = i + step;

sum = sum + A[i];

if (i ≠ n) goto loop;

Solution: This is quite similar to a while cycle and in fact could be written as one. Taken from the while-
loop example, the following sequence of instructions is written, if you assume that index i, the start
direction of array A, the comparison constant n, step, and sum are in registers $s1, $s2, $s3, $s4, and
$s5, respectively:

loop: add $s1, $s1, $s4 # i = i + step

add $t1, $s1, $s1 # calculate 2i in $t1

add $t1, $t1, $t1 # calculate 4i in $t1

add $t1, $t1, $s2 # sets the address of A[i] to $t1

lw $t0.0($t 1) # loads the value of A[i] to $t0

add $s5, $s5, $t0 # sum = sum + A[i]

bne $s1, $s3, loop # if (I ≠ n) goto loop

It is assumed that repeatedly adding the step increment to the variable index i will eventually make it
equal to n, otherwise the cycle will never end.

69
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6. Addressing modes
Addressing mode is the method that specifies the location of an operand within an instruction.
MiniMIPS use six addressing modes, shown in Figure 5.11 schematically and described as follows.

1. Implicit addressing: The operand comes from, or results from going to, a predefined place that is
not explicitly specified in the instruction. An example of this is found in the jal statement (to be
introduced in section 6.1), which always saves the address of the next instruction in the sequence in
register $ra.

2. Immediate addressing: The operand is provided in the instruction itself. Examples include the addi,
andi, oi, and xori statements, in which the second operand (or, really, its lower half) are supplied as
part of the instruction.

3. Addressing by register: The operand is taken from, or placed in, a specific register. R-type
instructions in MiniMIPS specify up to three registers as the locations of their operand(s) or result.
Registers are specified by their five-bit indexes.

4. Base addressing: The operand is in memory and its location is calculated by adding an offset (16-bit
signed integer) to the contents of a specific base register. This is the addressing mode of the lw and
sw instructions.

5. Addressing relative to the PC: Same as the base addressing, but the register is always the program
counter and the offset is added with two 0s on the far right (word addressing is always used). This is

70
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

the addressing mode used in branch statements and allows branching to other instructions within
±215 words of the current statement.

6. Pseudo-direct addressing: In direct addressing, the operand address is provided as part of the
instruction. For MiniMIPS the latter is impossible, since 32-bit instructions do not have enough space
to carry full 32-bit addresses. The j statement is close to direct addressing because it contains 26 bits
of the hop destination address that is filled with 00 on the far right and four bits of the program
counter on the left to form a full 32-bit address; hence its name "pseudo-direct".

MiniMIPS have an upload/store architecture: operands must be in registers before they can be
processed. The only MiniMIPS statements that refer to memory addresses are load, store, and
jump/branch. This scheme, in combination with the limited set of addressing modes, allows efficient
hardware to execute MiniMIPS instructions. There are other addressing modes. Some of these
alternate modes will be discussed in section 8.2. However, the addressing modes used in MiniMIPS
are more than adequate for convenience in programming and efficiency in the resulting programs.

In total, 20 MiniMIPS instructions were introduced in this chapter. Table 5.1 summarizes
these instructions for review and easy reference. This set of instructions is very suitable for composing
quite complex programs. In order for these programs to be modular and efficient, the additional
mechanisms discussed in Chapter 6 are needed. To make assembly programs complete, the assembly
directives that will be discussed in Chapter 7 are required.

TABLE 5.1 The 20 MiniMIPS instructions seen in Chapter 5.

Class Instruction Use Meaning op fn

Copy Load upper lui rt, imm rt ← (imm, 0x0000) 15


immediate

Add add rd, rs, rt rd ← (rs) + (rt) 0 32


Subtract sub rd, rs, rt rd ← (rs) − (rt) 0 34
Arithmetic Set less than slt rd, rs, rt rd ← if(rs) < (rt) then 1 else 0 0 42
Add addi rt, rs, rt ← (rs) + imm 8
immediate imm
Set less than slti rd, rs, rd ← if(rs) < imm then 1 else 10
immediate imm 0
Logic AND and rd, rs, rt rd ← (rs) ∧ (rt) 0 36
OR or rd, rs, rt rd ← (rs) ∨ (rt) 0 37
XOR xor rd, rs, rt rd ← (rs) ⊕ (rt) 0 38
NOR nor rd, rs, rt rd ← ((rs) ∨ (rt))’ 0 39
AND andi rt, rs, rt ← (rs) ∧ imm 12
immediate imm
OR immediate ori rt, rs, imm rt ← (rs) ∨ imm 13
XOR xori rt, rs, rt ← (rs) ⊕ imm 14
immediate imm

Memory Load word lw rt, imm(rs) rt ← mem[(rs) + imm] 35


access Store word sw rt, mem[(rs) + imm] ← (rt) 43
imm(rs)

71
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Jump jL goto L 2
Transfer of Jump register jr rs goto (rs) 0 8
Branch less bltz rs, L if (rs) < 0 then goto L 1
control
than 0
Branch equal rs, rt, L if (rs) = (rt) then goto L 4
beq
Branch not bne rs, rt, L if (rs) ≠ (rt) then goto L 5
equal

Example 5: How to find the maximum value in a list of numbers

A list of integers that is stored in memory begins at the address given in register $s1. The length of the
list is provided in the $s2 record. Type a MiniMIPS instruction sequence from Table 5.1 to find the
largest integer in the list and copy it to the $t0 record.

Solution: All items in list A are examined and $t0 is used to keep the largest integer identified so far
(initially, A[0]). Each step compares a new item in the list to the value at $t0 and updates the last one
if necessary.

lw $t0, 0($s 1) # initializes maximum at A[0]

addi $t1, $zero, 0 # initializes index i to 0

loop: add $t1, $t1, 1 # increments the i index by 1

beq $t1, $s2, done # if all items examined, exit

add $t2, $t1, $t1 # calculate 2i in $t2

add $t2, $t2, $t2 # calculate 4i in $t2

add $t2, $t2, $s1 # of the address of A[i] in $t2

lw $t3, 0($t2) # load value of A[i] in $t3

slt $t4, $t0, $t3 # max < A[i]?

beq $t4, $zero, loop # if not, repeat unchanged

addi $t0, $t3, 0 # if yes, A[i] is the new maximum

j loop # complete change; Now repeat

done: ... # continuation of the program

Note that the list was assumed not empty and contains at least item A[0].

72
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Problems
1) Instruction formats

In MIPS instruction formats (Figure 5.4), the opcode field can be reduced from six to five bits and the
function field from six to four bits. Given the number of MIPS instructions and different functions
required, these changes will not limit the design; That is, there will still be sufficient operation codes
and function codes. Therefore, the three bits gained can be used to extend the rs, rt, and rd fields
from five to six bits each.

(a) Name two positive effects of these changes. I justify that their answers.

(b) Name two negative effects of these changes. I justify that their answers.

2) Other logical instructions

For certain applications, some other logical bit operations may be useful. Demonstrate how the
following logical operations can be synthesized using the MIPS statements discussed in this chapter.
Try to use as few instructions as possible.

a) NOT

b) NAND

c) XNOR

d) Immediate NOR

3) Overflow in sum

The sum of two 32-bit integers may not be representable in 32 bits. In this case, an overflow is said to
have occurred. So far, there has been no discussion of how an overflow is detected or how to deal
with it. Type a sequence of MiniMIPS instructions that add two numbers stored in registers $s 1 and
$s 2, store the sum (modulo 232) in register $s 3, and set the register $t 0 to 1 if an overflow occurs
and to 0 otherwise. Tip: Overflow is possible only with operands of the same sign; For two non-
negative (negative) operands, if the sum obtained is less (greater) than another operand, overflow
occurred.

4) Multiplication by a small power of 2

Type a sequence of MiniMIPS statements (use only those in Table 5.1) to multiply the integer x stored
in register $s 0 by 2n, where n is a small non-negative integer stored in $s 1. The result should be
placed at $s 2. Tip: Use repeated duplication.

5) Compiling a switch/case statement

73
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

A switch/case statement allows multipath branching based on the value of an integer variable. In the
following example, the switch s variable can be assumed to be one of three values in [0, 2] and for
each case a different action is assumed.

switch (s) {

case 0: a = a + 1; break;

case 1: a = a - 1; break;

case 2: b = 2 * b; break;

Show how such a statement can be compiled into MiniMIPS assembly instructions.

6) Absolute value calculation

Type a MiniMIPS sequence of instructions to place in register $s 0 the absolute value of a parameter
that is stored in register $t 0.

7) Exchange without intermediate values

Write a MiniMIPS sequence of instructions to swap the contents of registers $s 0 and $s 1 without
disturbing the contents of any other register. Hint: (x ⊕ y) ⊕ y = x.

8) Set size statement

Suppose that in MiniMIPS all instructions have to be coded only with the use of the opcode and
function fields; That is, no other part of the instruction should carry information about the type of
instruction, even if it is not used for other purposes. Let nR, nI and nJ be the permissible number of
instructions of types R, I and J, respectively. Write an equation from which you can get the maximum
possible value of nR, nI, or nJ, when the other two accounts are provided.

9) Conditional fork

Modify the solution to example 5.3, so that the condition tested is:

a) i < j
b) i >= j
c) i + j <= 0
d) i + j > m + n

74
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

10) Conditional forking

Modify the solution to example 5.4 so that the condition being tested at the end of the cycle is:

a) i<n
b) i <= n
c) sum >= 0
d) A[i] == 0

11) Mysterious program fragment

The following program fragment calculates a result f (n) in register $s 1 when the non-negative integer
n is provided in register $s 2. Add appropriate comments to the instructions and characterize f(n).

add $t2, $s0, $s0

addi $t2, $t2, 1

addi $t0, $zero, 0

addi $t1, $zero, 1

loop: add $t0, $t0, $t1

beq $t1, $t2, done

addi $t1, $t1, 2

j loop

done: add $s1, $zero, $t0

12) Other conditional forks

Just by using the instructions in Table 5.1, show how an effect equivalent to the following
conditional branches can be obtained:

i. bgtz (fork in greater than 0)


ii. beqz (branch equal to 0)
iii. bnez (branch in not equal to 0)
iv. blez (bifurcation in less than or equal to 0)
v. bgez (branch in greater than or equal to 0)

13) Identification of extreme values

Modify the solution to example 5.5 so that it leads to the identification of:

a) The smallest integer in the list.

b) The list item with the largest absolute value.

75
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

(c) The list item whose least significant byte is the largest.

14) Sum of a set of values stored in records

Type the shortest possible sequence of MiniMIPS instructions from Table 5.1 to sum the contents of
the eight registers $s 0-$s 7, and store the result in $t 0. The original contents of the eight registers
should not be modified.

(a) Assume that records $t 0 through $t 9 can be freely used to retain temporal values.

(b) Suppose that no record other than $t 0 is modified in the process.

15) Modular reduction

The nonnegative integers x and y are stored in registers $s 0 and $s 1, respectively.

a) Type a sequence of MiniMIPS statements that calculate x mod y, and store the result in $t 0. Tip:
Use repeated subtractions.

b) Derive the number of instructions executed in the calculation of part (a) in the best and worst cases.

c) Does your answer to part a) lead to a reasonable result in the special case of y = 0? (d) Increase the
sequence of instructions in part (a) so that it also produces [x/y] in $t 1. (e) Derive the number of
instructions executed in the calculation of part (d) in the best and worst cases.

16) Cyclic buffer

A 64-input ring buffer is stored in memory locations 5000 through 5252. The storage localities are
called from L[0] to L[63], and L[0] is seen after L[63] in a circular shape. Register $s 0 retains the address
of the first entry B[0] of the buffer. Register $s 1 points to the entry just below the last one (i.e. where
the next element B[l should be placed], if the current buffer length is assumed to be l). Buffer filling is
indicated by a flag on $s 2 that retains 1 if l = 64 and 0 otherwise.

a) Type a MiniMIPS instruction sequence to copy the contents of $t 0 into the buffer if it is not full and
adjust the fill flag in case it fills up thereafter.

b) Type a MiniMIPS instruction sequence to copy the first buffer element to $t 1, provided the buffer
element is not empty, and set $s 0 accordingly.

c) Type a sequence of MiniMIPS instructions to find the buffer and see if it contains an element x,
whose value, provided it is in $t 2, set $t 3 by 1 if such an element is found and to 0 if it is not.

17) Optimization of instruction sequences

Consider the following sequence of three high-level language statements: X = Y + Z; Y = X + Z; W = X -


Y. Type an equivalent sequence of MiniMIPS instructions and assume that W must be formed in $t 0,
and that X, Y, and Z are in the memory locations whose addresses are provided in registers $s 0, $s 1,
and $s 2, respectively. Also, assume the following:

76
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

(a) The calculation must be performed exactly as specified and each statement is compiled
independently of the others.

b) Any sequence of instructions is acceptable as long as the final result is the same.

c) Same as part (b), but the compiler has access to temporary registers that are reserved for use.

18) Initializing a record

a) Identify all possible individual MiniMIPS instructions from Table 5.1 that can be used to initialize a
desired register to 0 (all 0-bit pattern).

b) Repeat part a) for the all 1 bit pattern.

c) Characterize the set of all bit patterns that can be placed in a desired register using a single
instruction from Table 5.1.

77
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

PROCEDURES AND DATA


CHAPTER TOPICS

1. Simple procedure calls


2. Using the stack for data storage
3. Parameters and results
4. Data types
5. Arrays and pointers
6. Additional instructions

This chapter continues the study of the MiniMIPS instruction set by examining the instructions
required for procedure calls and the mechanisms used to pass data between the caller and called
routines. In the process, you will learn other details of the instruction set architecture and become
familiar with some important ideas about data types, nested procedure calls, stack utility, access to
array elements, and pointer applications. By the end of the chapter, you will know enough instructions
for writing non-trivial and useful programs.

1. Simple procedure calls


A procedure represents a applet that, when called (initiated, invoked) performs a specific task,
perhaps leading to one or more results, based on the input parameters (arguments) provided to it and
returns to the calling point. In assembly language, a procedure is associated with a symbolic name
that denotes its start direction. The jal statement in MiniMIPS is used specifically for procedure calls:
it performs the control transfer (unconditional jump) to the start address of the procedure, but at the
same time saves the return address in the $ra register.

jal proc # jump to loc “proc” and link;


# “link” means “save the return
# address” (PC) + 4 in $ra ($31)

Note that while a jal statement is executed, the program counter retains its address; therefore, (PC) +
4 designates the address of the next instruction after jal. The machine instruction format for jal is
identical to the jump statement shown at the top of Figure 5.9, except that the opcode field contains
3. Using a procedure involves the following sequence of actions.

1. Put arguments in places known to the procedure (records $a0 - $a3).

2. Transfer control to the procedure and save the return direction (jal).

3. Acquire storage space, if required, for use of the procedure.

4. Perform the desired task.

5. Put the results in places known to the calling program (records $v0 - $v1). 6. Return control to the
call point (jr).

78
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The last step is performed using the jump register instruction (Figure 5.9):

jr rs # go to loc addressed by rs

Later we will see how the procedures can be accommodated with more than four words of arguments
and two words of results.

As the procedure is executed, it uses registers to retain operands and partial results. After returning
from a procedure, the calling program can reasonably expect to find its own operands and partial
results where they were before calling the procedure. If this requirement were to be enforced for all
records, each procedure would need to save the original contents of any records you use and then
restore it before termination. Figure 6.1 shows this salvage and restoration within the called
procedure, as well as the preparation to call the procedure and continue in the main program.

In order to avoid the use of the system associated with a large


number of save and restore log operations during procedure calls, the following convention is used. A
procedure can freely use the $v0 - $v1 and $t0 - $t9 records without having to save their original
contents; In other words, a program named should not expect any values placed in these 12 registers
to remain unchanged after calling the procedure. If the calling program has any value in these
registers, you must save them to other registers or to memory before calling a procedure. This division
of responsibility between calling and calling programs is very sensitive. It allows, for example, to call a
simple procedure without saving and restoring at all. As shown in Figure 5.2, the calling program
(caller) can expect the values in the following registers to be undisturbed when calling a procedure:
$a0 - $a3, $0 - $a7, $gp, $sp, $ft, $ra. A procedure (program called, callee) that modifies any of these
must restore them to their original values before finishing. To summarize, the registry save convention
in MiniMIPS is as follows:

Records saved by caller: $v0 - $v1, $t0 - $t9

Records saved by callee: $a 0-$a 3, $s 0-$s 7, $gp, $sp, $fp, $ra

Section 6.2 will discuss the uses of $gp, $sp and $fp records. In all cases, saving the contents of a record
is only required if that content will be used in the future by the caller/callee. For example, saving $ra
is only required if the callee will call another procedure that would then overwrite the current return
address with its own return address. In Figure 6.2, the abc procedure, in turn, calls the xyz procedure.
Note that before calling xyz, the procedure calling abc performs some preparatory actions, including
putting the arguments in the $a0 - $a3 registers and saving any of the $v0 - $v1 and $t0 - $t9 records
that contain useful data. After returning from xyz, the abc procedure can transfer any result in $v0 -
$v1 to other records. The latter is needed, for example, before another procedure is called, to avoid
overwriting the results of the previous procedure by the next procedure. Note that, with proper care,
registers $v0 and $v1 can also be used to pass parameters to a procedure, which consequently allows
up to six parameters to be passed to a procedure without using the stack.

The design of a procedure is illustrated by two simple examples. More interesting and realistic
procedures will be given later in this chapter.

Example 1: How to: Find the Absolute Value of an Integer

79
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Type a MiniMIPS procedure that accepts an integer parameter in register $a0 and returns its absolute
value in $vO

Solution: The absolute value of x is –x if x < 0 and x otherwise.

abs: sub $v0, $zero, $a0 # put -($a0) in $v0; in case ($a0) < 0

bltz $a0, done # if ($a0) < 0 then done


add $v0, $a0, $zero #else put ($a0) in $v0
done: jr $ra #return to calling program

In practice, such short procedures are rarely used due to the excessive use of machine they represent.
This example has between three and four machine usage instructions for three useful calculation
instructions. Additional instructions are jal, jr, an instruction to set the parameter to $a0 before the
call, and perhaps an instruction to move the returned result (it is said "maybe" because the result
could be used directly outside register $v0, if made before the next procedure call).

Example 2: How to: Find the largest of three integers

Type a MiniMIPS procedure that accepts three integer parameters in registers $a0, $a1, and $a2 and
returns the maximum of the three values at $v0.

Solution: Start by assuming that $a0 retains the larger value, compare this value against each of the
other two values, and replace it when a larger value is found.

max: add $v0, $a0, $zero # copy ($a0) in $v0; biggest so far

sub $t0, $a1, $v0 # calculate ($a1) – ($v0)

blitz $t0, okay # if ($a1) – ($v0) < 0 then it does not change

add $v0, $a1, $zero # otherwise $a1 is bigger so far

okay: sub $t0, $a2, $v0 # calculate ($a2) – ($v 0)

blitz $t0, okay # if ($a2) – ($v0) < 0 then it does not change

add $v0, $a2, $zero # otherwise $a2 is bigger than all

done: jr $ra # Return to the Caller Program

Also here are the comments at the end of the 6.1 example solution. See if you can bypass overhead
in this case.

80
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

2. Using the stack for data storage


The mechanisms and conventions discussed in section 6.1 are suitable for procedures that accept up
to four arguments, return up to two results, and use a dozen or so intermediate values in the course
of their calculations. Beyond these limits, or when the procedure must itself call another procedure
(so it needs the bran of some values), additional storage space is needed. A common mechanism for
saving things or making room for temporary data that requires a procedure is the use of a dynamic
data structure known as a stack.

Before discussing the stack and how it solves the data storage issue for procedures, look at the
conventions for using memory address space in MiniMIPS. Figure 6.3 shows a map of the MiniMIPS
memory and the use of the three pointing registers $gp, $sp and $fp. The second half of MiniMIPS
memory (starting with hexa address 0x80000000) is used for memory-mapped I/O and is therefore
not available for storing programs or data. Chapter 22 will discuss memory-mapped I/O (input/output
addressing). The first half of the latter, which extends from direction 0 to direction 0x7fffffff, is divided
into four segments as follows:

The first 1 M words (4 MB) are reserved for system use.

The next 63 M words (252 MB) retain the text of the program to be executed.

Starting at the hexa address 0x10000000, program data is stored.

81
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Starting in the hexa direction 0x7ffffffc, and growing backwards, is the stack.

Dynamic program and stack data can grow in size to the maximum available memory. If the global
register pointer ($gp) is set to retain the 0x10008000 address, then the first 216 bytes of program data
become easily accessible via base addressing of the form inm ($gp), where imm is a 16-bit signed
integer.

This leads to the use of the stack segment and its register of the $sp stack top pointer (or
stack pointer, for short). The stack is a dynamic data structure in which data can be placed and
retrieved in last-in, first-out order. It can be linked to the pile of trays in a coffee shop. As the trays are
cleaned and ready to use, they are placed high in the pile of trays; Similarly, customers take their trays
from the top of the pile. The last tray placed in the pile is the first one takes.

Data items are added to the stack when pushed into the stack and retrieved when removed. The push
and pop operations of the stack are illustrated in Figure 6.4 for a stack that has data elements b and a
as its two high elements. The stack pointer points to the high element of the stack that currently holds
b. The latter means that the instruction lw $t 0, 0($sp) causes the value b to be copied to $t 0 and sw
$t 1, 0($sp) causes b to be overwritten with the contents of $t 1. Therefore, a new c element currently
in register $t 4 can be pushed into the stack using the following two MiniMIPS statements:

push: addi $sp, $sp, -4


sw $t4, 0($sp)

Note that it is possible to reverse the order of the two instructions if the -4($sp) address is used instead
of 0($sp). To remove element b from the stack in Figure 6.4, an lw statement is used to copy b into a
desired register and the stack pointer is incremented by 4 to point to the next element in the stack:

pop: lw $t5,0($sp)
addi $sp, $sp, 4

Again you could adjust the stack pointer before copying its high element. Note that a pop operation
does not erase the old high element b from memory; B is still where it was before Operation POP and

82
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

would in fact be accessible via address -4($sp). However, the locality retaining b is no longer
considered part of the pile, whose high element is now a.

Even though technically a stack is a last-in, first-out data structure, in fact you can access any element
within the stack if you know its relative order from above. For example, the element can be accessed
just below the top of the stack (its second element) with the use of memory address 4($sp), and the
fifteenth element of the stack is at address 56($sp). Similarly, you do not need to remove the elements
one at a time. If the ten tall elements of a stack are no longer needed, they can be removed all at once
by increasing the stack pointer by 40.

3. Parameters and results


This section answers the following unresolved questions related to the procedures:

1. How can you pass more than four input parameters to a procedure or receive more than two results
from it?

2. Where does a procedure save its own parameters and intermediate results when it calls another
procedure (called nested)?

In both cases the battery is used. Before a procedure call, the calling program pushes the contents of
any records it needs to save to the top of the stack and follows this with any additional arguments for
the procedure. The procedure can also access these arguments in the stack. After the procedure ends,
the calling program expects to find the undisturbed stack pointer; Consequently, it allows you to
restore saved records to their original states and proceeds with your own calculations. Consequently,
a procedure that uses the stack by modifying the stack pointer must save the contents of the battery
pointer in principle and, ultimately, restore $sp to its original state. This is done by copying the stack
pointer into the $fp frame pointer register. However, before doing the latter, the old contents of the
frame pointer must be saved. Therefore, while a procedure is running, $fp must retain the contents
of $fp just before it is called; $fp and $sp "mark" together the area of the stack that is in use by the
current procedure (Figure 6.5).

In the example shown in Figure 6.5, the three parameters a, b, and c pass into the procedure by placing
them high in the stack just before calling the procedure. The procedure first pushes the contents of
$fp into the stack, copies the stack pointer into $fp, pushes the contents of the records that need to
be saved into the stack, uses the stack to retain those temporary local variables that cannot be
retained in the records, and so on. You can later call another procedure by placing arguments, such as
y and z, high in the stack. Each procedure in a nested sequence undisturbed the part of the stack that
belongs to the previous callers and uses the stack that begins with the empty slit above the input to
which $sp is pointed. Until the completion of a procedure, the process is reversed; Local variables are
removed from the stack, the log contents are restored, the frame pointer is copied to $sp, and finally
$fp restored to its original state.

83
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Throughout this process, a stable reference point $fp provided for addressing memory words in the
portion of the stack corresponding to the current procedure. It also offers a convenient way to return
$sp to its original value at the completion of the procedure. Words in the current frame have directions
-4($fp), -8($fp), -12($fp), etc. While the stack pointer changes in the course of the procedure
execution, the frame pointer retains a fixed direction along the length. Note that the use of $fp is
optional. A procedure that does not call another procedure itself can use the battery for any reason
without ever changing the battery pointer. It simply stores data in, and accesses, stack items beyond
the current stack height with the use of the memory addresses -4($sp), -8($sp), -12($sp), etc. In this
way, the stack pointer does not need to be saved because it is never modified by the procedure.

For example, the following procedure saves the contents of $fp, $ra, and $s 0 in the stack at the
beginning and restores them just before termination:

84
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

If it is known that this procedure neither calls another procedure nor needs the stack for any purpose
other than saving the contents of the $s 0 registry, the same end could be achieved without saving
$fp or $ra or even adjusting the stack pointer:

Proc: sw $s0, -4 ($sp) # save ($s 0) on pile height

lw $s0, -4 ($sp) # sets the stack high to $s0

jr $ra # return from the procedure

This substantially reduces the machine usage of the procedure call.

4. Parameters and results


In section 5.1, and Figure 5.2, reference was made to various data sizes (byte, word, doubleword) in
MiniMIPS. However, subsequently only instructions dealing with word-size operands were
considered. For example, 1w and sw transfer words between registers and memory localities, logical
instructions operate on 32-bit operands that are displayed as strings of bits, and arithmetic
instructions operate on signed integers with word size. Other commonly used data sizes are halfword
(16-bit) and quadword (128-bit). Neither type is recognized in MiniMIPS. Although the immediate
operands in MiniMIPS are halfwords, they are extended in sign (for the case of signed integers) or
zeroed extended (in the case of bit strings) to 32 bits before operating with them. The only exception
is the lui instruction, which loads an immediate 16-bit operand directly into the top half of a register.

While data size refers to the number of bits in a particular piece of data, the data type reflects the
meaning assigned to a data item. MiniMIPS have the following types of data, and each is provided
with all possible sizes that are available:

Note that there is no double-word integer or floating-point number with byte size: the former would
be quite feasible and is excluded to simplify the instruction set, while the latter is not feasible.
Numbers with floating-point will be discussed in Chapter 9; Until then nothing will be said about this

85
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

type of data. Thus, the rest of this section discusses the differences between signed and unsigned
integers and some notions related to bit strings.

In MiniMIPS, signed integers are represented in complement to 2 formats. Integers can assume values
in the range [−27, 27 − 1] = [−128, 127] in eight-bit format and [−231, 231 −1] = [−2 147 483 648, 2 147
483 647] in 32-bit format. In machine representation, the sign of a signed integer is evident from its
most significant bit: 0 for "+" and 1 for "-". This representation will be discussed in section 9.4.
Unsigned integers are just ordinary binary numbers with values in [0, 28 − 1] = [0.255] in eight-bit
format and [0, 232 − 1] = [0, 4,294,967,295] in 32-bit format. The rules of arithmetic with signed and
unsigned numbers are different; therefore, MiniMIPS provides a number of instructions for unsigned
arithmetic. The latter will be presented in section 6.6. Changing the data size is also done differently
for signed and unsigned numbers. Going from a narrower format to a wider format is done by sign
extension (repeating the sign bit in the additional positions) for signed numbers and by zero extension
for unsigned numbers. The following examples illustrate the difference:

Sometimes you want to process strings of bytes (for example, small integers or symbols of an
alphabet) as opposed to words. Load and store instructions for byte-sized data will be presented
shortly. From the above argument it follows that a signed byte and an eight-bit unsigned integer will
load differently in a 32-bit register. As a result, the load byte statement, like many other instructions
with similar properties, will have two versions, for signed and unsigned values.

An important use for byte-sized data elements is to represent the symbols of an alphabet consisting
of letters (uppercase and lowercase), digits, punctuation marks, and other necessary symbols. In hexa
notation, an eight-bit byte becomes a two-digit number. Table 6.1 shows how such two-digit hexa
numbers (eight-bit bytes) are assigned to represent the symbols of the American Standard Code for
Information Interchange (ASCII). In reality, Table 6.1 defines a seven-bit code and leaves the right half
of the code table unspecified. The unspecified half can be used for other symbols needed in various
applications. Even though "ASCII 8-bit/7-bit" is the correct usage, "8-bit/7-bit ASCII code" is sometimes
talked about (i.e., "code" is repeated for clarity).

When an alphabet with more than 256 symbols is involved, each symbol requires more than one byte
for representation. The next longest data size, a half-word, can be used to code an alphabet with up
to 216 = 65,536 symbols. This is more than adequate for almost all applications of interest. Unicode,
which represents an international standard, uses 16-bit half-words to represent each symbol.

86
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

TABLE 6.1 ASCII (American Standard Code for Information Interchange)1,2

1: The eight-bit ASCII code is formed as (column# row#)hex ; for example, "+" is (2b)hex

2: Columns 0–7 define the ASCII code 7 bits; For example, "+" is (010 1011)two.

3: The ISO code differs from ASCII only in the monetary symbol.

4: Columns 0–1 and 8–9 contain control characters, which are listed in alphabetical order.

ACK Acknowledgment BS Stepping back a space CAN Cancel

CR Carriage Return DCi Device Control DLE Data Link Escape

EM End of Media ENQ Query EOT End of trans.

ETB End trans block. ETX End of text FF Feeding Form

FS File Separator GS Group Separator HT Horizontal tab

LF Line feed NAK Rec. negative NUL Null

RS Register separator SI Shift in SO Shift out

SOH Start header SP Shape Power STX Space Text Start

SUB Substitute SYN Synchronous Free US Unit Separator

VT Vertical tab BEL Bell

87
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In MiniMIPS, the loadbyte, loady byte unsigned, and store byte statements allow you to transfer bytes
between memory and registers:

Figure 6.6 shows the machine instruction format for lb, lbu, and sbu statements. As always, the base
address can be specified as an absolute value or by a symbolic name.

This section concludes with a very important observation. A string of bits, stored in memory or in a
register, has no inherent significance. A 32-bit word can mean different things, depending on how it
is interpreted or processed. For example, paragraph 6.7 shows that the same word has three different
meanings, depending on how it is interpreted or used. The word represented by the hexa pattern
0x02114020 becomes the add statement if fetched from memory and executed. The same word, when
used as an integer operand in an arithmetic instruction, represents a positive integer. Finally, as a

88
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

string of bits, the word can represent a sequence of four ASCII symbols. None of these interpretations
is more natural, or more valid, than others. Other interpretations are also possible. For example, you
can view the bit string as a pair of Unicode symbols or the attendance record for 32 students in a
particular class, where 0 designates "absent" and 1 means "present".

5. Arrays and Pointers


In a wide variety of programming tasks, it becomes necessary to walk through an arrangement or list,
and examine each of its elements at once. For example, to determine the largest value in an integer
list, you must examine each item in the list. There are two basic ways to achieve the latter:

1. Index: Use a record that retains the index i and increment the register at each step to move item i
from the list to item i + 1.

2. Pointer: Use a record that points to (retains the address of) list item being examined and update it
at each step to point to the next item.

Any approach is valid, but the second is a bit more efficient for MiniMIPS, given its lack of an indexed
addressing mode that would allow the value to be used in the index register i in the address
calculation. To implement the first schema, the address must be calculated by many instructions that
essentially form 4i and then add the result to the register that retains the base address of the array or
list. In the second schema, a single instruction that adds 4 to the register pointer advances to the next
item in the array. Figure 6.8 graphically depicts the two methods.

However, dealing with pointers is conceptually more difficult and prone to error, while the use of array
indices appears more natural and therefore easier to understand. Thus, the use of pointers just to
improve program efficiency is usually not recommended. Fortunately, modern compilers are too
smart to replace such indexed code with the equivalent (more efficient) code used by pointers. Thus,
programmers can relax and use any way that is most natural for them.

The use of these methods is illustrated by two examples. Example 6.3 uses array indexes, while
example 6.4 takes advantage of pointers.

89
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 3: Maximum sum preference in an integer list

Consider a list of integers of length n. A length i prefix for the given list consists of the first integers i
in the list, where 0 ≤ i ≤ n. A maximum sum prep, as the name implies, represents a prefix for which
the sum of elements is the largest of all prefixes. For example, if the list is (2, -3, 2, 5, -4), its
maximum sum preference consists of the first four elements and the associated sum is 2 - 3 + 2 + 5 =
6; No other preference on the specific list has a larger sum. Type a MiniMIPS program to find the
length of the maximum sum preference and the sum of its elements for a given list.

Solution: The strategy to solve this problem is rather obvious. Start by initializing the max-sum prefix
to length 0 with a sum of 0. Then gradually increase the length of the prep, each time calculating the
new sum and comparing it with the maximum sum so far. When a larger sum is found, the length
and sum of the max-sum are updated. Write the program in the form of a procedure that accepts
the base address of array A in $a 0 and its length n in $a 1, return the length of the max-sum prefix in
$v 0 and the associated sum in $v 1.

# base A in $v 0, length n in $a 1

mspfx: addi $v0, $zero, 0 # initializes length in $v 0 to 0

addi $v0, $zero, 0 # initialize max sum in $v 1 to 0

addi $t0, $zero, 0 # initializes index i in $t 0 to 0

addi $t1, $zero, 0 # initializes sum that runs in $t 1 to 0

loop: add $t2, $t0, $t0 # fixed 2i in $t 2

add $t2, $t2, $t2 # fixed 4i in $t 2

addi $t3, $t2, $a0 # fixed 4i + a (address of A[i])in $t 3

lw $t4, 0, ($t3) # load A[i] of mem[($t 3)] in $t 4

add $t1, $t1, $t4 # sum A[i] sum that runs in $t 1

slt $t5, $v1, $t1 # fixed $t 5 in 1 if max sum < new sum

bne $t5, $zero, mdfy # if max sum is less, modify results

j test # Done?

mdfy: addi $v0, $t0, 1 # New max-sum prefix has length i+1

addi $v1, $t1, 0 # New max sum is the sum that runs

test: addi $t0, $t0, 1 # Index i advances

slt $t5, $t0, $al # Fixed $t 5 in 1 if i < n

bne $t5, $zero, loop # repeat if i < n

done: jr $ra # return length = ($v 0), max sum = ($v 1)

90
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Since the data in the prefix are changed only when a strictly larger sum is found, the procedure
identifies the shortest preference for the case that there are many prefixes with the same maximum
sum.

6. Additional instructions
Chapter 5 introduced 20 instructions for MiniMIPS (Table 5.1) and four additional instructions for
procedure call (jal) and byte-oriented data (lb, lbu, sb) so far in this chapter. As examples 6.3 and 6.4
indicate, useful, non-trivial programs can be written just by using these 24 instructions. However,
MiniMIPS have additional instructions that allow more complex calculations to be performed and
programs to be expressed more efficiently. This section presents some of these instructions, bringing
the total to 40 instructions. These complete the central part of the MiniMIPS instruction set
architecture. The instructions you deal with floating-point arithmetic and exceptions (that is, the
coprocessors shown in Figure 5.1) have yet to be covered. Floating-point instructions will be
introduced in Chapter 12, while instructions related to exception handling will be introduced as
needed, starting with section 14.6.

It begins with the presentation of many additional arithmetic/logical instructions. The multiply (mult)
statement represents an R-format instruction that places the double-word product of the contents of
two registers in the Hi (upper half) and Lo (lower half) registers. The div instruction calculates the
remainder and quotient of the contents of two source registers, and places them in the special
registers Hi and Lo, respectively.

mult $s0, $s1 # Set Hi, Lo in ($s0) × ($s1)

div $s0, $s1 # set Hi to ($s0) mod ($s1)

# and Lo in $s0/$s1

Figure 6.10 shows the machine representations of the mult and div instructions. To be able to use the
results of mult and div, two special instructions are provided, "mover de Hi" (mfhi) and "mover de Lo"
(mflo), to copy the contents of these two special registers into one of the general registers:

mfhi $t0 # fixed $t0 in (Hi)

mflo $t0 # fixed $t0 in (Lo)

Figure 6.11 shows the machine representations of the mfhi and mflo. Now that you've learned about
the MiniMIPS multiplication instruction, it may be attractive to use mult to calculate offset 4i, which
is needed when you want to access the i-th element of an array in memory. However, as will be
discussed in more detail in part three of this book, multiplication and division are more complex and
slower operations than addition and subtraction. Therefore, it is still advisable to calculate 4i by two
sums (i + i = 2i and 2i + 2i = 4i) instead of a single multiplication.

91
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Of course, an even more efficient way to calculate 4i is through a left shift of two bits of i, since i is an
unsigned integer (as is usually the case in array indexing). Section 10.5 will discuss the shift in signed
values. The MiniMIPS instruction that allows the contents of a register to be left shifted by a known
quantity is called a "logical left shift" (sll). Similarly, "logical right shift" (srl) runs the contents of a
record to the right. The need for "logical" qualification will become clear as arithmetic shifts are
discussed in section 10.5. For now, remember that logical shifts move the bits of a word left or right,
and fill empty positions with 0 and discard any bits moving from the other end of the word. The sll and
srl instructions deal with a constant shift amount that is given in the "sh" field of the instruction. For
a specific variable shift amount in a register, the relevant instructions are sllv and srlv.

A left shift of h bits multiplies the unsigned value x by 2h, provided x is too small that 2h x is still
representable in 32 bits. A right shift of h bits divides an unsigned value by 2h, and discards the

92
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

fractional part and preserves the integer quotient. Figure 6.12 shows the machine representation of
the preceding four shift instructions.

Finally, because unsigned integers are important in many calculations, some MiniMIPS instructions
have variations in which one or both operands are considered unsigned numbers. These instructions
have the same symbolic names as the corresponding signed instructions, but with "u" added at the
end.

The machine representations of these instructions are identical to those of the signed versions, except
that the value of the fn field is 1 more for R-type instructions and the value of the op field is 1 more
for addiu (paragraphs 5.5, 6.10 and 5.6).

Note that the immediate operand in addiu is actually a 16-bit signed value that is extended in sign to
32 bits before being added to the unsigned operand in a specified register. This is one of the
irregularities that must be remembered. Intuitively, one may think that both addiu operands are
unsigned values and this expectation is quite logical. However, as a consequence of the fact that in
MiniMIPS there is no "immediate unsigned subtraction" instruction and there are times when
something needs to be subtracted from an unsigned value, this seemingly illogical design choice is
made.

In Chapters 5 and 6, 37 MiniMIPS instructions were introduced. Table 6.2 summarizes these
instructions for review and ease of reference. In this context, you should fully understand Table 6.2,
except for the set overflow and non-overflow phrases in the "Meaning" column and the following
three instructions:

Arithmetic right shifts, and how they differ from logical right shifts, will be discussed in section 10.5.
The system call, including its use to perform input/output operations, will be discussed in section
7.6, and its details will be presented in Table 7.2.

TABLE 6.2 The 40 MiniMIPS instructions discussed in chapters 5-7.

93
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Class Instruction Use Meaning op fn

Move from Hi mfhi rd rd ← (Hi) 0 16


Move from Lo mflo rd rd ← (Lo) 0 18
Copy Load upper lui rt, imm rt ← (imm, 0x0000) 15
immediate

Add add rd, rs, rt rd ← (rs) + (rt); with overflow 0 32


Add unsigned addu rd, rs, rt rd ← (rt); no overflow 0 33
Subtract sub rd, rs, rt rd ← (rs) - (rt); with overflow 0 34
Subtract unsigned subu rd, rs, rt rd ← (rs) - (rt); no overflow 0 35
Set less than slt rd, rs, rt rd ← if (rs) < (rt) then 1 but 0 0 42
Multiply mult rs, rt Hi, Lo ← (rs) × (rt) 0 24
Arithmetic Multiply unsigned multu rs, rt Hi, Lo ← (rs) × (rt) 0 25
Divide div rs, rt Hi ← (rs) mod(rt); Lo ← (rs) ÷ (rt) 0 26
Divide unsigned divu rs, rt Hi ← (rs) mod(rt); Lo ← (rs) ÷ (rt) 0 27
Add immediate addi rt, rs, imm rt ← (rs) + imm; with overflow 8
Add immediate addiu rt, rs, imm rt ← (rs) + imm; no overflow 9
unsigned
Set less than slti rd, rs, imm rd ← If (rs) < imm then 1 but 0 10
immediate

Shift left logical sll rd, rt, sh rd ← (rt) Left shift by sh 0 0


Shift right logical srl rd, rt, sh rd ← (rt) Right shift by sh 0 2
Shift right sra rd, rt, sh As srl, but extended in sign 0 3
arithmetic
Shift left logical sllv rd, rt, rs rd ← (rt) left shift by (rs) 0 4
variable
Bleed Shift right logical srlv rd, rt, rs rd ← (rt) Right shift by (rs) 0 6
variable
Shift right arith srav rd, rt, rs As srlv, but extended in sign 0 7
variable

AND and rd, rs, rt rd ← (rs) ∧ (rt) 0 36


OR or rd, rs, rt rd ← (rs) ∨ (rt) 0 37
XOR xor rd, rs, rt rd ← (rs) ⊕ (rt) 0 38
Logic NOR nor rd, rs, rt rd ← ((rs) ∨ (rt))’ 0 39
AND immediate andi rt, rs, imm rt ← (rs) ∧ imm 12
OR immediate ori rt, rs, imm rt ← (rs) ∨ imm 13
XOR immediate xori rt, rs, imm rt ← (rs) ⊕ imm 14

Load word lw rt, imm(rs) rt ← mem[(rs) + imm] 35


Load byte lb rt, imm(rs) Load byte 0, extended in sign 32
Load byte unsigned lbu rt, imm(rs) Load byte 0, extended zero 36
Memory access Store word sw rt, imm(rs) mem[(rs) + imm] ← rt 43
Store byte sb rt, imm(rs) Stores byte 0 40

Jump JL go to L 2
Jump and link jal L go to L; $31 ← (PC) + 4 3
Jump register jr rs go to (rs) 0 8
Branch less than 0 bltz rs, L if (rs) < 0 then go to: L 1
Branch equal beq rs, rt, L if (rs) = (rt) then go to: L 4
Transfer of control Branch not equal bne rs, rt, L if (rs) ≠ (rt) then go to: L 5
System call Syscall See section 7.6 (Table 7.2) 0 12

94
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

*In this order of ideas, the arithmetic right shift instructions (sra.srav), which will be discussed in
section 10.5, and syscall are also included.

PROBLEMS
1) Nested procedure calls

Modify the procedure call diagram nested in Figure 6.2 to correspond to the following changes:

a) The xyz procedure calls the new def.

b) After calling xyz, the abc procedure calls another uvw procedure.

c) The abc procedure calls xyz twice different.

d) After calling abc, the main program calls xyz.

2) How to choose one of three integers

Modify the procedure in example 6.2 (finding the largest of three integers) so that:

a) Also return the index (0, 1, or 2) of the largest value in $v 1.

b) Find the smallest of the three values instead of the largest.

c) Find the intermediate of the three values instead of the largest.

3) Divisor by powers of 2

A binary integer that has consecutive h zeros at its extreme right is divisible by 2h. Type a MiniMIPS
procedure that accepts a single unsigned integer in register $a 0 and returns the power larger than 2
by which it is divisible (an integer into [0, 32]) in register $v 0.

4) Hamming distance between two words

The Hamming distance between two-bit strings of the same length represents the number of positions
at which the strings have different bit values. For example, the Hamming distance between 1110 and
0101 is 3. Type a procedure that accepts two words in registers $a 0 and $a 1 and return its Hamming
distances (an integer in [0, 32]) in register $v 0.

5) Meaning of a string of bits

a) Decode the instruction, positive integer and string of four ASCII characters in Figure 6.7.

(b) Repeat part (a) for the case when the instruction is andi $t 1, $s 2, 13108.

c) Repeat part (a) for the case when the integer is 825 240 373.

d) Repeat part (a) for the case when the string is '<US>'.

95
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6) Determining the length of a string

Type a procedure, howlong, that takes a pointer to a null-terminated ASCII string in register $a 0 and
returns the length of the string, excluding its null terminator, in register $v 0. For example, if the
argument for howlong points to the string 'not very long,' the value that returns is integer 13.

7) Unpacking a string of characters

Type a MiniMIPS (with comments) instruction sequence that would separate and place the four bytes
of a specific operand in string of characters in register $s 0 in registers $t 0 (least significant byte), $t
2, $t 2, and $t 3 (most significant byte). If necessary, you can use $t 4 and $t 4 as temporary.

8) Recursive procedure for calculating n!

a) The function f (n) = n! It can be defined recursively as f (n) = n × f (n − 1). Type a recursive procedure,
defined as a self-calling procedure, to accept n in $a 0 and return n! in $v 0. Suppose n sufficiently
small that the value of n! Fit into the result register.

b) Repeat part (a), but assume that the result is a double word and is returned in registers $v 0 and $v
1.

9) Prefix, postfix and subsequence sums

Modify the procedure in Example 6.3 (which finds the maximum sum prefix in a list of integers) so
that:

(a) Identify the longest preference with maximum sum.

(b) Identify the minimum amount.

c) Identify the maximum sum postfix (items at the end of the list)

(d) Identify the maximum subsequence sum. Hint: Any sum subsequence represents the difference
between two prefix sums.

10) Sorting of selection Modify the procedure in example 6.4 (sorting of selection) so that:

a) Order in descending order.

b) The keys that you order in ascending form are double words.

c) Sort the key list in ascending order and reorder a list of values associated with the keys so that, if a
key occupies position j when the key list is sorted, its associated value is also at position j in the
reordered list of values.

11) Calculating an XOR Checksum

96
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Type a procedure that accepts a byte string (base address at $a 0, length at $a 1) and return its XOR
checksum, defined as the unique OR of all its bytes, to $v 0.

12) Modification of the ISA MiniMIPS

Consider the following 32-bit instruction formats that are very similar to MiniMIPS. The six fields are
different in their widths and names, but have similar importance as the corresponding fields of the
MiniMIPS. Answer each of the following questions under two different assumptions:

1) the width of the register remains at 32 bits and 2) the registers are 64 bits wide. Assume that the
action field retains 00xx for R-format instructions, 1xxx for I format, and 01xx for J format.

(a) What is the maximum possible number of different instructions in each of the three classes R, I and
J?

(b) Can all the instructions shown in Table 6.2 be coded with this new format?

(c) Can you think of any new desirable instructions to add as a result of the changes?

13) Call for a procedure

Using the procedure in example 6.1, type a sequence of MiniMIPS statements to find the item in a list
of integers with the largest absolute value. Assume that the start address and the number of integers
in the list are provided in records $s 0 and $s 1, respectively.

14) Stack and frame pointers

Draw a schematic diagram of the stack (similar to Figure 6.5), corresponding to the call relationships
shown in Figure 6.2, for each of the following cases, showing the stack and frame pointers as arrows:

a) Within the main program, before calling the abc procedure

b) Within the abc procedure, before calling the xyz procedure

c) Within the xyz procedure

d) Within the abc procedure, after returning from procedure xyz

(e) Within the main program, after returning from the abc procedure

97
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

15) ASCII/integer conversion

a) A decimal number is stored in memory in the form of a null-terminated ASCII string. Type a MiniMIPS
procedure that accepts the start address of the ASCII string in the $a 0 register and returns the
equivalent integer value in register $v 0. Ignore the possibility of overflow and assume that the
number can begin with a digit or a sign (+ or -).

b) Write a MiniMIPS procedure to perform the reverse of the conversion of part a).

16) Emulation of shift instructions

Demonstrate how the effect of each of the following instructions can be achieved by using only
instructions seen before section 6.5

a) Logical left shift sll.

b) Logical left shift variable sllv.

c) Displacement right logical srl. Tip: Use left rotation (circular shift) by an appropriate amount
followed by zeros in some of the bits.

d) Variable logical right shift srlv.

17) Emulation of other instructions

a) Identify all instructions in Table 6.2, the effect of which can be achieved by no more than two other
instructions. Use at most one record for intermediate values (use $at if necessary).

b) Are there instructions in Table 6.2 whose effects cannot be achieved through other instructions, no
matter how many instructions or records are allowed to be used?

98
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

ASSEMBLY LANGUAGE PROGRAMS


CHAPTER TOPICS

1. Machine and assembler languages

2. Assembler directives

3. Pseudo-instructions

4. Macroinstructions

5. Bound and loaded

6. Run of assembly programs

To build and run efficient and useful assembly language programs, more than a knowledge of machine
instruction types and formats is needed. On the one hand, the assembler needs certain types of
information about the program and its data that are not evident from the instructions themselves.
Likewise, convenience indicates the use of certain pseudo-instructions and macros, which, although
they are not in one-to-one correspondence with the machine instructions, are easily converted to the
latter by the assembler. To round off the topics, this chapter presents brief descriptions of loading and
linking processes and program executions in assembly language.

1. Machine and assembler languages


Machine instructions are represented as binary bit strings. In the case of MiniMIPS, all instructions
have a uniform width of 32 bits and therefore fit into a single word. In Chapter 8 it will be seen that
certain machines use variable width instructions, which leads to a more efficient coding of the
intended actions. Whether fixed or variable width, binary machine code, or its equivalent hexa
representation, is inconvenient for human understanding. In fact, for this reason assembly languages
were developed that allow the use of symbolic names for instructions and their operands. In addition,
assemblers accept numbers in a variety of simple, natural representations and convert them to
required machine formats. Finally, assemblers allow the use of pseudo-instructions and macros that
serve the same purpose as abbreviations and acronyms in natural languages, that is, to allow more
compact representations that are easier to write, read, and understand.

Chapters 5 and 6 introduced parts of assembly language for the hypothetical MiniMIPS computer.
Sequences of instruction written in symbolic assembly language are translated into machine language
by a program called an assembler. Multiple program modules that are independently assembled are
linked together and combined with (pre-developed) library routines to form a complete executable
program that is then loaded into a computer's memory (Figure 7.1). Program modules that are
assembled separately and then linked together is beneficial because when one module is modified,
the other parts of the program do not have to be reassembled. Section 7.5 will discuss in more detail
the work of the linker and loader.

99
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Instead of being loaded into a computer's memory, the instructions of a machine language program
can be interpreted by a simulator: a program that examines each of the instructions and performs its
function, and that updates variables and data structures that correspond to registers and other
machine parts that retain information. Such simulators are used during the process of designing new
machines for writing and debugging programs before any functional hardware is available.

Most assembly language instructions are in one-to-one correspondence with machine instructions.
The process of translating such assembly language instruction into the corresponding machine
instruction involves assigning appropriate binary codes to symbolically specify operations, registers,
memory localities, immediate operands, etc. Some pseudo-instructions and all macro instructions
correspond to multiple machine instructions. An assembler reads a source file containing the assembly
language program and accompanying information (assembler directives or certain accounting details)
and in the process of producing the corresponding machine language program:

Preserves a table of symbols containing name-address matches. Constructs the text and data
segments of the program. Forms an object file that contains header, text, data, and relocation
information.

The assembly process is carried out in two passes. The main function of the first is the construction of
the symbol table. A symbol is a string of characters that are used as an instruction tag or as a variable
name. As the instructions are read, the assembler maintains an instruction locality counter that
determines the relative position of the instruction in the machine language program and therefore
the numerical equivalent of the instruction label, if any. This is done by assuming that the program
will be loaded at the beginning with address 0 in memory. Since the eventual locality of the program
is usually different, the assembler also compiles information about what needs to be done if the
program is loaded starting with another (arbitrary) direction. This relocation information is part of the
file produced by the assembler and will be used by the loader to modify the program according to its
eventual locality in memory.

100
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Consider, for example, assembly language instruction:

test: bne $v0, $s0, done

As the assembler reads this line, it detects the operation symbol "bne", two register symbols "$t 0"
and "$s 0", and the instruction tags "test" and "done". The appropriate code for "bne" is read from an
opcode table (which also contains information about the number and types of operands expected),
while the meanings of "$t 0" and "$s 0" are obtained from another reference table. The "test" symbol
is entered into the symbol table along with the current contents of the instruction's locality counter
as its value. The "done" symbol may or may not be in the symbol table, depending on whether the
fork is oriented backwards (towards a previous instruction) or forwards. If it is a backward fork, then
the numerical equivalent of "done" is already known and can be used instead. For the case of a
forward branch, an entry is created in the symbol table for "done" with its value blank; The value will
be filled in the second pass after all instruction tags have been assigned numeric equivalents.

This process of resolving forward references represents the main reason why a
second pass is used. It is also possible to perform the entire process in a single pass, but the latter
would require leaving adequate markers for unresolved references in advance, so that their values
can be filled at the end.

Figure 7.2 shows a short assembly language program, its assembled version, and the symbol table that
was created during the assembly process. The program is incomplete and meaningless, but it does
illustrate some of the concepts barely discussed.

101
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

2. Assembler Policies
Assembly directives provide the assembler with information about how to translate the program, but
in themselves do not lead to the generation of the corresponding machine instructions. For example,
assembler directives can specify the data template in the program's data segment, or define variables
with desired symbolic names and initial values. The assembler reads these directives and takes them
into account when processing the rest of the lines of the program.

In MiniMIPS convention, assembler directives begin with a period to distinguish them from
instructions. The following is a list of MiniMIPS assembler policies:

The "macro" and "end_macro" directives mark the beginning and end of the definition for a macro
instruction; They will be described in section 7.4 and appear here for listing purposes only. The "text"
and ".data" directives signal the beginning of a program's text and data segments. Therefore, "text"
indicates to the assembler that subsequent lines should be interpreted as instructions. Within the text
segment, instructions and pseudo-instructions (see section 7.3) can be used at will.

In the data segment, data values can be allocated memory space, given symbolic names, and
initialized to desired values. The assembler directive ".byte" is used to define one or more byte-sized
data items with the desired initial values and to assign a symbolic name to the first one. In the example
"tiny: .byte 156, 0x7a", a byte that retains 156 is defined followed by a second byte containing the
value hexa 0x7a (122 decimal), and the first is given the symbolic name "tiny". In the text of the
program, an address of the form tiny ($s 0) will refer to the first or second of these bytes if $s 0 retains
0 or 1, respectively. The ".word" and ".float" directives, as well as ".double", are similar to ".byte",
except that they define words (for example, unsigned or signed integers), short floating-point
numbers, and long floating-point numbers, respectively.

The ".align h" directive is used to force the address of the next data object to be a multiple
of 2h (aligned with a 2h byte boundary). For example, ".align 2" forces the next address to be a multiple
of 22 = 4; therefore, to align a data element with word size with the so-called word boundary. The
".space" directive is used to reserve a specified number of bytes in the data segment, usually to
provide storage space for arrays. For example, "array: .space 600" reserves 600 bytes of space, which
can then be used to store a byte string of length 600, a vector or word length of 150, or a vector of 75
elements of long floating-point values. The type of usage is not specified in the policy itself, but it will
be a function of which values are placed there and which instructions apply to them. Preceding this

102
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

directive with ".align 2" or ".align 3" allows you to use the array for words or double words,
respectively. Note that fixes are usually not initialized by directives in the data segment, but rather by
explicit store statements, as part of program execution.

ASCII character strings could be defined by using the ".byte" directive. However, in order to avoid
obtaining and mentioning the ASCII numeric code for each character and to improve readability, two
special directives are offered. The ".ascii" directive defines a sequence of bytes that maintain ASCII
codes for characters mentioned in quotation marks. For example, the ASCII codes for "a", "*" and "b"
can be placed in three consecutive bytes and the string is given the symbolic name "str1" through the
directive "str1: .ascii "a*b"". It is common practice to terminate an ASCII string with the special symbol
null (which has the ASCII code 0x00); In this way, the end of the symbolic chain is easily recognized.
Therefore, the ".asciiz" directive proves to have the same effect as ".ascii", except that it adds the null
symbol to the string and thus creates a string with an additional symbol at the end.

Finally, the ".global" directive defines one or more names as having significance in the sense
that they can be referenced from other files.

Example 7.1 shows the use of some of these directives in the context of a complete assembly language
program. All subsequent programming examples in this chapter will use assembler directives.

Example 7.1: Composition of Simple Assembly Directives

Write an assembler policy to accomplish each of the following goals:

a) Store the error message "Warning: The printer has no paper! In memory.

b) Set a constant called "size" with value 4.

c) Set an integer variable called "width" and initialize it to 4.

d) Establish a constant called "mill" with the value 1 000 000 (one million).

e) Reserve space for an integer vector "vect" of length 250.

Solution:

a) noppr: .asciiz "Warning: The printer has no paper!"

b) size: .byte 4 # small constant fits into one byte

c) width: .word 4 # Byte might be enough, but...

d) mill: .word 1000000 # Constant too large for byte

e) vect: .space 1000 # 250 words = 1000 bytes

For part a), a null-terminated ASCII string is specified because, with the null terminator, you do not
need to know the length of the string during use; If it is decided to change the chain, only this policy
will have to be modified. For part c), in the absence of information about the width range, it is assigned
a storage word.

103
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

3. Pseudo-instructions
Although any calculation or decision can be molded in terms of the simple MiniMIPS instructions
covered so far, it is sometimes simpler and more natural to use alternate formulations. Pseudo-
instructions allow calculations and decisions to be formulated in alternate forms not directly
supported by hardware. The MiniMIPS assembler is careful to translate these into basic hardware-
supported instructions. For example, MiniMIPS lacks a logical NOT statement that would cause all bits
in a word to reverse. Although the same effect can be achieved by

nor $s0, $s0, $zero # add-on ($s 0)

It would be much more natural, and more easily understandable, if it could be written:

not $s0 # add-on ($s 0)

Therefore, "not" is defined as a MiniMIPS pseudo-instruction; the latter is recognized as ordinary


instructions for the MiniMIPS assembler and replaced by the equivalent "nor" instruction before
conversion to machine language.

The barely defined pseudo-"not" instruction translates into a single MiniMIPS machine language
instruction. Some pseudo-instructions must be replaced with more than one statement and may
involve intermediate values that must be stored. For this reason, the $1 register is dedicated to the
use of the assembler and is given the symbolic name $at (temporary assembler, Figure 5.2). It is clear
that the assembler cannot assume the availability of other registers; Consequently, you must limit its
intermediate values to $at and, perhaps, the target record in the pseudo-instruction itself. An example
is the pseudo-statement "abs" that places the absolute value of the contents of a source record in a
target record:

abs $t0, $s0 # Place |($s 0)| $t 0

The MiniMIPS assembler can translate this pseudo-instruction into the following sequence of four
MiniMIPS statements:

Table 7.1 contains a complete list of pseudo-instructions for the MiniMIPS assembler. The pseudo-
instructions NOT arithmetic and logic are easy to understand. Rotate instructions are like shifts, except
that bits leaving one end of the word are reinserted at the other end. For example, the bit pattern
01000110, when left-rotated by two bits, becomes 00011001; That is, the two bits 01 run from the far
left occupy the two least significant positions in the result rotated to the left. The load immediate
pseudo-instruction allows any immediate value to be placed in a register, even if the value does not
fit 16 bits. The load address pseudo-instruction is quite useful for initializing pointers in registers.
Suppose you want to have a pointer to assist traffic through an arrangement. This pointer must be
initialized to point to the first element of the array. This cannot be done by regular MiniMIPS
instructions. The instruction lw $s 0, array would place the first element of the array, not its address,
in register $s 0, while the pseudo-instruction $s 0, array places the numerical equivalent of the "array"
tag in $s 0. The remaining load/store and branch-related pseudo-instructions (related to branching)
are also understandable.

104
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Because defining pseudo-instructions is a good way to practice assembly language programming, the
problems at the end of this chapter hypothesize other pseudo-instructions and require you to write
the corresponding MiniMIPS instructions to be produced by an assembler. Example 7.2 presents as a
model the equivalent machine languages of some of the pseudo-instructions in Table 7.1. The 7.3
example shows how assembler directives and pseudo-instructions are useful for creating complete
programs.

TABLE 7.1 Pseudo-instructions accepted by the MiniMIPS assembler.

Class Pseudo-instruction Use Meaning

Copy Move move regd, regs regd ← (regs)


Load address la regd, address Calculate address load, not content
Load immediate li regd, anyimm regd ← aarbitrary immediate value
Arithmetic Absolute value abs regd, regs regd ← |(regs)|
Negate neg regd, regs regd ← −(regs); with overflow
Multiply (into mul regd, reg1, reg2 regd ← (reg1) × (reg2); no overflow
register)
Divide (into div regd, reg1, reg2 regd ← (reg1) ÷ (reg2); with overflow
register)
Remainder rem regd, reg1, reg2 regd ← (reg1) mod(reg2)
Set greater than sgt regd, reg1, reg2 regd ← If (reg1) > (reg2) then 1 otherwise 0
Set less or equal sle regd, reg1, reg2 regd ← If (reg1) ≤ (reg2) then 1 otherwise 0
Set greater or equal sge regd, reg1, reg2 regd ← If (reg1) ≥ (reg2) then 1 otherwise 0

Bleed Rotate left rol regd, reg1, reg2 regd ← (reg1) rotated left by (reg2)
Rotate right ror regd, reg1, reg2 regd ← (reg1) rotated right by (reg2)

Logic NOT not reg reg ← (reg)’

Memory access Load doubleword ld regd, address Load regd and the following record
Store doubleword sd regd, address Stores regd and the following log

Transfer of control Branch less than blt reg1, reg2, L if (reg1) < (reg2) then go to L
Branch greater bgt reg1, reg2, L if (reg1) > (reg2) then go to L
than
Branch less or equal ble reg1, reg2, L if (reg1) ≤ (reg2) then go to L
Branch greater or bge reg1, reg2, L if (reg1) ≥ (reg2) then go to L
equal

Example 7.2: MiniMIPS pseudo-instructions

For each of the following pseudo-instructions defined in Table 7.1, enter the corresponding
instructions produced by the MiniMIPS assembler.

parta: neg $t0, $s0 # $t0 = - $s0; Set overflow

partb: rem $t0, $s0, $s1 # $t0 = ($s0) mod ($s1)

partc: li $t0, imm # $t0 = Arbitrary immediate value

partd: blt $s0, $1, label # If ($s 0) > ($s 1), go to label

105
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Solution:

parta: sub $t0, $zero, $s0 # - $s0 = 0 - $s0

partb: div $s0, $s1 # divider; the rest is in Hi

mfhi $t0 # copy (Hi) in $t0

partc: addi $t0, $zero, imm # If imm fits in 16-bit

partc: lui $t0, upperhalf # If imm requires 32-bit, the assembler

ori $t0, lowerhalf # must extract upper and lower halves

partd: slt $at, $s0, $s1 # ($s 0) < ($s 1)?

bne $at, $zero, $label # If ($s 0) < ($s 1) go to

Note that, for part (c), the equivalent mechanical language of the pseudo-instruction differs
depending on the size of the immediate operand provided.

Example 7.3: Forming Complete Assembly Language Programs

Add assembler directives and make other necessary modifications to the partial program in example
6.4 (selection classification using a maximum search procedure) to make it a complete program,
excluding data inputs and outputs.

Solution: The partial program in Example 6.4 assumes a list of input words, with pointers to their first
and last items available in records $a 0 and $a 1. The classified list will occupy the same memory
locations as the original list (in-place sorting).

.global main # Consider Main a global name

.data # Starts Program Data Segment

size: .word 0 # space for list size in words

list: .space 4000 # assumes that list has up to 1000 items

.text # starts program text segment

main: ... # Size and List entry

la $a0, list # Starts pointer to the first element

lw $a1, size # Set size to $a 1

addi $a1, $a1, -1 # offset in words to the last element

sll $a1, $a1, 2 # offset in bytes to last element

add $a1, $a0, $a1 # Starts pointer to Last Item

sort: ... # Rest of the program in Example 6.4

106
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

done: … # Output Classified List

Note that the complete program is named "main" and includes the "sort" program from Example
6.4, including its "max" procedure.

4. Macro-instructions
A macro statement (macro for short) is a mechanism for naming a frequently used sequence of
instructions to avoid specifying the sequence entirely each time. As an analogy, you can write or type
"IEC" in a draft document, instructing the typist or word processor software to replace all occurrences
of "IEC" in the final version with the full name "Department of Electrical and Computer Engineering".
Because the same sequence of instructions can involve different operand specifications (e.g.,
registers) each time they are used, a macro has a set of formal parameters that are substituted with
actual parameters during the assembly process. Macros are delimited by special assembler directives:

At this point two natural questions may arise:

1. How is a macro different from a pseudo-instruction?

2. How is a macro different from a procedure?

Pseudo-instructions are incorporated into the design of an assembler and are therefore fixed for the
user; In this context, macros are defined by the user. Additionally, a pseudo-instruction looks exactly
like an instruction, while a macro looks more like a procedure in a high-level language; for example, if
you don't know that "move" doesn't represent a MiniMIPS machine language instruction, you couldn't
tell by its appearance. As for differences with a procedure, at least two jump statements are used to
call and return from a procedure, while a macro is only an abbreviated notation for various assembly
language instructions; After the macro is replaced by its equivalent instructions in the program, no
traces of the macro remain.

Example 7.4 provides the definition and use of a complete macro. It is instructive to compare examples
7.4 and 6.2 (procedure for finding the largest of three integers), both to see the differences between
a macro and a procedure, and to note the effects of pseudo-instructions to make programs easier to
write and understand.

Example 7.4: Macro to find the largest of three values

Suppose you often need to determine the largest of three values in records to put the result in a fourth
record. Write an mx3r macro for this purpose, with the parameters being the result record and the
three registers operating.

Solution: The following macro definition uses only pseudo-instructions:

107
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

.macro mx3r (m, a1, a2, a3) # Macro name and argument(s)

move m, a1 # supposes (a1) is the largest;

bge m, a2, +4 # If (a2) is not larger, ignore it

move m, a2 # but fix m = (a2)

bge m, a3, +4 # If (a3) is not larger, ignore it

move m, a3 # but fix m = (a3)

.end_macro # macro terminator

If the macro is used as mx3r ($t 0, $s 0, $s 4, $s 3), the assembler replaces the arguments m, a1, a2,
and a3 in the macro text with $t 0, $s 0, $s 4, $s 3, respectively, resulting in the following seven
statements instead of the macro (note that the pseudo-statements in the macro definition were
replaced with regular statements):

addi $t0, $s0, $zero # assumes ($s 0) is larger; $t 0 = ($s 0)

slt $at, $t0, $s4 # ($t 0) < ($s 4)?

beq $at, $zero, +4 # if it is, ignore ($s 4)

addi $t0, $s4, $zero # but fixed $t 0 = ($s 4)

slt $at, $t0, $s3 # ($t 0) < ($s 3)?

beq $at, $zero, +4 # if it is, ignore ($s 3)

addi $t0, $s3, $zero # but fixed $t 0 = ($s 3)

Elsewhere in the program, the macro is used as mx3r ($t 5, $s 2, $v 0, $v 1), which leads to the same
seven instructions that are inserted into the program, but with different register specifications. This is
similar to the procedure call; Therefore, macro parameters are sometimes called "formal parameters,"
as is common for subroutines or procedures.

5. Bound and loaded


A program, whether in assembly language or high-level language, consists of multiple modules that
are often designed for different groups at different times. For example, a software company designing
a new application program may reuse many previously developed procedures as part of the new
program. Similarly, library routines (for example, those for calculations of common mathematical
functions or performing useful services such as input/output) are incorporated into programs by
referring them where they are needed. In order not to reassemble (or recompile, in the case of high-
level languages) all modules with each small modification in one of the components, a mechanism is
offered that allows the modules to be built, assembled and tested separately. Just before these
modules are placed in memory to form an executable program, the parts are linked and the references
between them are resolved, using a special linking program.

108
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Each of the modules to be linked will have information in a special header section about the size
of the text (instructions) and data segments. You will also have sections containing relocation
information and symbol table. These additional pieces of information will not be part of the executable
program but are used to enable the linker to perform its task correctly and efficiently. Among other
functions, the linker determines which memory locations each of the modules will occupy and will
adjust (relocate) all addresses within the program to correspond to the assigned locations of the
modules. The combined linked program will also have size and other relevant information to be used
by the loader, which is briefly described.

One of the most important functions of the linker is to ensure that the tags in all modules are
interpreted (resolved) appropriately. If, for example, a jal statement specifies a symbolic target that is
not defined in the module itself, the other modules to be linked are searched to determine if any of
them have an external symbol that fits the unresolved symbolic name. If none of the modules resolve
the undefined reference, then the system program libraries are queried to see if the programmer
intended to use a library routine. If any symbol remains unresolved at the end of this process, linking
fails and an error is reported to the user. The successful resolution of all these references by the linker
is followed by:

Determination of the location of text segments and data in memory.

Evaluation of all data addresses and instruction labels.

Formation of an executable program without unresolved references.

The output of the linker does not go to the main memory where it can be executed. Rather, it takes
the form of a file that is stored in secondary memory. The loader transfers this file into main memory
for execution. Since the same program can be placed in different memory areas during each run,
depending on where space is available, the loader is responsible for adjusting all directions in the
program to correspond to the actual in-memory locality of the program. The linker assumes that the
program will load starting with locality 0 in memory. If, instead, the starting locality is L, then all
absolute directions must run by L. This process, called relocation, involves adding a constant amount
to all absolute addresses within the instructions as they are loaded into memory. Also, the relative
directions do not change, so a fork to 25 locations ahead, or 12 locations back, does not change during
relocation.

Therefore, the shipper is responsible for the following:

Determination of the memory needs of the program from its header.

Copy text and data from the executable program file to memory.

Modify (run) addresses, where needed, during copying.

Place program parameters on the stack (as in a procedure call).

Initialize all machine records, including the stack pointer.

Jump to a boot routine that calls the main routine of the program.

Note that the program is treated as a procedure that is called from an operating system routine. Until
completion, the program returns control to the same routine, which then executes an exit system call.
The latter allows the passing of data and other parameters to the program through the same stack
mechanism used by procedure calls. Initializing machine registers means cleaning them to 0, except

109
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

for the stack pointer, which is fixed to point to the top of the stack after all program parameters were
pushed into it.

6. Run of assembly programs


In modern computational practice, assembly or machine language programs are generated by
compilers. Programmers rarely write directly in assembly language because this is a time-consuming
and error-prone process. However, in certain practical situations (for example, for a program module
that must be optimized for performance) it can be programmed in assembly language. Of course,
there is also the pedagogical reason: to learn enough about machine language and be able to
understand and appreciate conflicts in computer design, you must write at least some simple assembly
language programs. In this sense, the SPIM simulator is introduced that tests the programs in
MiniMIPS assembly language when running them for development, as well as to observe the results
obtained and track the execution in case of problems and thus facilitate debugging.

Readers who do not intend to design and run MiniMIPS assembly language
programs can ignore the rest of this section.

SPIM, which gets its name from inverting "MIPS", represents a simulator for the MIPS R2000/R3000
assembly language. The instruction set accepted by SPIM is larger than the MiniMIPS subset covered
in this book, but the latter does not cause problems running programs. This situation is compared to
that of someone with a limited English vocabulary who speaks with someone who is fluent in the
language; The latter will understand what is spoken, provided that the small subset of words is used
correctly. There are three versions of SPIM available for free download:

PCSpim for Windows machines


xspim for Macintosh OS X
spim for Unix or Linux systems

You can download SPIM by visiting http://www.cs.wisc.edu/~larus/spim.html and following the


instructions given there. Please read the copyright and terms of use notice at the bottom of the
website before you start using the program.
The following description is based on PCSpim, which was adapted by David A. Carley from
the other two versions developed by Dr. James R. Larus (formerly at the University of Wisconsin,
Madison, and now with Microsoft Research).
Figure 7.3 shows the PCSpim user interface. Next to the menu and toolbars at the top, and the status
bar at the bottom of the PCSpim window, there are four panes that display information about the
assembly program and its execution:
The log contents are displayed in the top pane. The text segment of the program is displayed in the
second panel, each line contains:
[Hexa Memory Address] Hexa Instruction Content Opcode and parameters

110
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The program's data segment is displayed in the third panel, and each line contains a hexa memory
address and the contents of four words (16 bytes) in hexa format. The messages produced by SPIM
are displayed in the fourth panel.

The simulator has certain default settings for its reliable features. These are well applied for the
purpose of the text, so it will not be discussed how they are modified. The support for the
modifications is downloaded together with the simulator.

To load and run a MiniMIPS assembly language program on PCSpim, you must first compose the
program into a text file. You can then open the file via the PCSpim file menu and this will cause the
text and program data segments to appear in the two panes in the middle of the PCSpim window. You
can then enter various simulator functions, such as Go to start running the program or Break to stop
it, using the PCSpim Simulator menu.

In part four of the book will be studied in detail input/output and interruptions. A brief overview of
these concepts is provided here to allow the introduction of PCSpim mechanisms for data input and
output and the significations of the "overflow" and "overflow less" designations for some instructions
in Tables 5.1 and 6.2. The instructions for data input and output will allow you to write complete
programs that include automatic acquisition of the necessary parameters (i.e. input from a keyboard)
and presentation of calculated results (i.e. output to a monitor screen).

Like many modern computers, MiniMIPS has memory-mapped I/O. That means that certain data
buffers and status registers within I/O devices, which are required to initiate and control the I/O
process, are displayed as memory locations. Therefore, if you know the specific memory locations

111
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

allocated to an input device such as the keyboard, you can question the status of the device and
transfer data from its data buffer to one of the general registers with the use of the load word
statement. However, this level of detail in I/O operations is only of interest in I/O routines known as
device handlers. Some aspects of such I/O operations will be discussed in Chapter 22. At the assembly
language programming level, I/O is often manipulated through system calls, meaning that the program
requests the operating system services to perform the desired input or output operation.

The default I/O mode of PCSpim is performed by a system call instruction with the symbolic opcode
syscall. The machine language instruction for syscall is of type R, with all its fields set to 0, except the
function field, which contains 12. A system call is characterized by an integer code in [1, 10] that is
placed in register $v 0 before the syscall statement. Table 7.2 contains the functions associated with
these ten system calls.

Input and output operations cause PCSpim to open a new window called Console. Any output from
the program appears in the console; All entries to the program must enter in the same way in this
window. The system call allocate memory (syscall with 9 in $v 0) results in a pointer to a block of
memory containing n additional bytes, where the desired number n is provided in register $a 0.The
exit from program system call (syscall with 10 in $v 0) causes program execution to terminate.

An arithmetic operation, such as sum, can produce a result that is too large to fit in the specified
destination register. This event is called overflow and causes an invalid sum to occur. Obviously, if the
calculation is continued with the latter, there will be a high probability of producing meaningless
results of the program. To avoid this, the MiniMIPS hardware recognizes that an overflow exists and
calls a special operating system routine known as an interrupt manipulator. This interruption of the
normal flow of the program and the consequent transfer of control to the operating system is referred
to as an interruption or exception. The interrupt manipulator can follow several courses of action,
depending on the nature or cause of the interruption. The latter will be discussed in Chapter 24. Until
then, it is enough to know that MiniMIPS assumes that the interrupt manipulator starts in the direction
0x80000080 in memory and will transfer control to you if an interrupt occurs. The transfer of control
through an unconditional balance (much like jal) and return address, or the value in the program
counter at the time of interruption, is automatically saved in a special register called exception
program counter or EPC in coprocessor 0 (Figure 5.1).

Some instructions are given the designation "no overflow" because they are usually used when
overflow is impossible or undesirable. For example, one of the main uses of arithmetic with unsigned
numbers is the calculation of direction. Adding a base address to an offset in bytes (four times the
offset in words) can produce an in-memory address corresponding to a new array element. Under
normal conditions, this calculated address will not exceed the limit of the computer's address space.
If you do, the hardware includes other mechanisms to detect an invalid address, and an instruction
execution unit overflow indication is not required.

From the approach of writing programs in assembly language, the possibility of overflow, and the
associated interruption, can be ignored, as long as care is taken that the results and intermediate
values are not too long in magnitude. This is easy to do for simple calculations and for the programs
discussed here.

112
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

TABLE 7.2 Input/output functions and syscall control in PCSpim.

PROBLEMS

7.1 Assembler Policies

Write assembler policies to accomplish each of the following objectives:

a) Setting four error messages, each of which is exactly 32 characters long, so that the program can
print the ith error message (0 ≤ i ≤ 3) based on a numerical index on a register. Tip: The index in the
register can be run left by five bits before being used as an offset.

b) Establishment of the integer constants least and most, with values 25 and 570, against which the
validity of the input data within the program can be verified.

(c) Set aside space for a string of characters of length not exceeding 256 symbols, including a null
terminator, if any.

(d) Set aside space for an entire array of 20 × 20 in order to store in memory in an order of greater
line.

e) Set aside space for an image with one million pixels, each is specified by an eight-bit color code and
another eight-bit brightness code.

7.2 Definition of pseudo-instructions

113
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Of the 20 pseudo-instructions in Table 7.1, two (not, abs) were discussed at the beginning of section
7.3 and four more (neg, rem, li, blt) were defined in example 7.2. Provide equivalent MiniMIPS
instructions or sequences of instructions for the remaining 14 pseudo-instructions; This constitutes
parts A-N of the problem, in order of appearance in Table 7.1.

7.3 Additional pseudo-instructions

The following are some additional pseudo-instructions one might define for MiniMIPS. In each case,
provide an equivalent MiniMIPS instruction or sequence of instructions with the desired effect. part,
mulacc is short for "multiply-accumulate".

parta: beqz reg, L # if (reg) = 0, go to L

partb: bnez reg, L # if (reg) ≠ 0, go to L

partc: bgtz reg, L # if (reg) > 0, go to L

partd: blez reg, L # if (reg) ≤ 0, go to L

parte: bgez reg, L # if (reg) ≥ 0, go to L

partf: double regd, regs # regd = 2 × (regs)

partg: triple regd, regs # regd = 3 × (regs)

parth: mulacc regd, reg1, reg2 # regd = (regd) + (reg1) × (reg2)

7.4 Complex pseudo-instructions

The pseudo-instructions for problems 7.2 and 7.3 recall regular MiniMIPS assembly instructions. This
problem proposes a number of pseudo-instructions that perform more complex functions that do not
remember ordinary instructions and might not be implemented as such within the format constraints
of MiniMIPS statements. The following pseudo-instructions perform the functions "memory word
increment", "fetch and sum" and "fetch and immediate sum". In each case, provide an equivalent
MiniMIPS sequence of instructions with the desired effect.

7.4 Complex pseudo-instructions

The pseudo-instructions for problems 7.2 and 7.3 recall regular MiniMIPS assembly instructions. This
problem proposes a number of pseudo-instructions that perform more complex functions that do not
remember ordinary instructions and might not be implemented as such within the format constraints
of MiniMIPS statements. The following pseudo-instructions perform the functions "memory word
increment", "fetch and sum" and "fetch and immediate sum". In each case, provide an equivalent
MiniMIPS sequence of instructions with the desired effect.

114
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

parta: incmem reg, imm # mem[(reg) + imm] = mem[(reg) + imm] + 1

partb: ftcha reg1, reg2, imm # mem[(reg1) + imm] = mem[(reg1) + imm] + (reg2)

partc: ftchai reg, imm1, imm2 # mem[(reg) + imm1] = mem[(reg) + imm1] + imm2

7.5 Entry and exit

a) Convert the selection classification program in example 7.3 into a program that receives the list of
integers to be classified as input and displays its output.

b) Run the program of part a) in the SPIM simulator and use an input list of at least 20 numbers.

7.6 New pseudo-instructions

Propose at least two useful pseudo-instructions for MiniMIPS that are not listed in Table 7.1. Why do
you think your pseudo-instructions are useful? How do you convert your pseudo-instructions to
equivalent MiniMIPS instructions or sequences of instructions?

7.7 How to find the nth Fibonacci number

Write a complete MiniMIPS program that accepts n as input and produces the nth Fibonacci number
as output. Fibonacci numbers are defined recursively using the formula Fn = F n+1 + F n+2, with F0 = 0
and F1 = 1.

a) Write your program with a recursive procedure.

b) Write your program without recursion.

c) Compare programs or parts a) and b) and discuss

7.8 Square root program

Write a complete MiniMIPS program that accepts an integer x as input and generates [x1/2] as output.
The program should print an appropriate message if the input is negative. Since the square root of a
31-bit binary number is no greater than 46,340, a binary search strategy in the interval [0, 65, 536]
can lead to the identification of the square root in 16 or fewer iterations. In binary search, the midpoint
(a + b)/2 of the search interval [a, b] is determined and the search is restricted to [a, (a + b)/2] or [(a +
b)/2, b], depending on the result of a comparison.

7.9 How to find max and min

Write a complete MiniMIPS program that accepts a sequence of integers in input, and after receipt of
each new input value, deploy the largest and smallest integers yet. An entry of 0 indicates the end of
the input values and is not an input value in itself. Note that you do not need to keep all integers in
memory.

115
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

7.10 Table of symbols

a) Display the symbol table developed by the MiniMIPS assembler for the program in Example 7.3.
Ignore the input and output parts.

b) Repeat part (a), this time assuming that adequate entry and exit operations have been included, in
accordance with problem 7.5.

7.11 Check Writing Program

A company's checks must be issued by writing the amount with words substituted for decimal digits
(for example, "five two seven dollars and five zero cents") to make counterfeiting more difficult. Write
a complete MiniMIPS program to read a non-negative integer, representing an amount in cents, as
input and producing the equivalent amount in words as output, as suggested by the $527.50 example
for integer input 52 750

7.12 Triangle formation

Write a complete MiniMIPS program to read three non-negative integers, presented in arbitrary order,
as inputs and determine if they can form the sides of:

a) A triangle

b) A right triangle

c) An acute triangle

d) An obtuse triangle

7.13 Integer sequences

Write a complete MiniMIPS program to read three non-negative integers, presented in arbitrary order,
as inputs and determine if they can form consecutive terms of:

a) A Fibonacci series (in which each term is the sum of the two preceding terms)

b) An arithmetic progression

c) A geometric progression

7.14 Factorization and primality test

Write a complete MiniMIPS program to accept a non-negative integer x in the range [0.1,000,000] as
input and display its factors, from smallest to largest. For example, the program output for x = 89
should be "89" (indicating that x is a prime number) and its output for x = 126 should be "2, 3, 3, 7".

116
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

7.15 Decimal to hexadecimal conversion

Write a complete MiniMIPS program to read an unsigned integer with up to six decimal digits from
the input and render it in hexadecimal form in the output. Note that as a result of the input integer
fitting into a single machine word, decimal to binary conversion happens as part of the input process.
All you need to do is derive the hexadecimal representation of the binary word.

7.16 Replacement encryption

A substitution cipher changes a plaintext message to a ciphertext by replacing each of the 26 letters
of the English alphabet with another letter according to a table of 26 entries. For simplicity, plaintext
or ciphertext is assumed to contain no space, numerals, punctuation marks, or special symbols. Type
a complete MiniMIPS program to read a string of 26 symbols that defines the substitution table (the
first symbol is the substitution of a, the second of b, etc.) and then repeatedly requests inputs in plain
text, this leads to the equivalent ciphertext as output. Each plain text entry ends with a period.
Entering any symbol other than a lowercase letter or period terminates program execution.

117
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

UNIT 3
NUMBER REPRESENTATION
CHAPTER TOPICS

1 Positional number systems

2 Digit Sets and Encodings

3 Number-Radix conversion

4 Signed integers

5 Fixed point numbers

6 Floating-point numbers

The representation of numbers is perhaps the most important issue in computational arithmetic. How
numbers are represented affects the compatibility between machines and their computational results
and influences the cost of implementation and latency of arithmetic circuits. This chapter will review
the most important methods for representing integers (signed magnitude and complement to 2) and
real numbers (standard floating-point ANSI/IEEE format). You also learn about other methods of
number representation, numeric base conversion, and binary encoding of arbitrary sets of digits.

1. Positional number systems


When thinking of numbers, it is usually natural numbers that come to mind first; Numbers that
sequence books or calendar pages, mark watch faces, flash on stadium scoreboards, and guide home
deliveries. The set {0, 1, 2, 3, ...} of natural numbers, called integers or unsigned integers, form the
basis of arithmetic. Four thousand years ago, the Babylonians knew about natural numbers and were
fluent in arithmetic. Since then, representations of natural numbers have advanced in parallel with
the evolution of language. Ancient civilizations used rods and beads to record inventories or accounts.
When the need for larger numbers arose, the idea of grouping rods or beads simplified counting and
comparisons. Over time, objects of different shapes or colors were used to denote such groups,
leading to more compact representations.

Numbers must be differentiated from their representations, sometimes called numerals. For example,
the number "27" can be represented in different ways by using several numerals or numbering
systems; These include:

However, the difference between numbers and numerals is not always made and "decimal numbers"
is often used instead of "decimal numerals" to refer to a representation in base 10.

118
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Bases other than 10 also appeared over time. The Babylonians used numbers in base 60, which made
it easy to deal with over time. Bases 12 (duodecimal) and 5 (quinaria) have also been used. The use of
numbers with base 2 (binary) became popular with the emergence of computers due to their use of
binary digits, or bits, which have only two possible values, 0 and 1; It is also compatible with electronic
signals. Base 3 (ternary) numbers were given serious consideration in the early stages of digital
computer development, but binary numbers eventually won. Numbers with base 8 (octal) and base
16 (hexadecimal) are used as shorthand notation for binary numbers. For example, a 24-bit binary
number can be represented as an eight-digit octal or a six-digit hexadecimal number by taking the bits
in groups of three and four, respectively.

In a general positional number system of radix r, with fixed word width k, a number x is represented
by a string of k digits xi, with 0 ≤ xi ≤ r – 1

For example, when using radix 2:

In a number system of base r and k digits, the natural numbers from 0 to rk - 1 can be represented.
Conversely, given a desired representation range [0, P], the required number k of digits in radix r is
obtained from the equation:

For example, the representation of the decimal number 3125 requires 12 bits in radix 2, six digits in
radix 5, and four digits in radix 8.

The finite nature of numerical encodings in computers has important implications. Figure 9.1 shows
four-bit binary encodings for numbers 0 through 15. This encoding cannot represent the numbers 16
and greater (or -1 and smaller). It is convenient to see the arithmetic operations of addition and
subtraction as the rotation of the sprocket in Figure 9.1 counterclockwise and inversely, respectively.
For example, if the sprocket is fixed so that the vertical arrow points to the number 3 and then turned
counterclockwise using four notches to add the number 4, the arrow points to the correct sum 7. If
the same procedure is tried with the numbers 9 and 12, the wheel will get tangled past 0 and the
arrow will point to 5, which is 16 units smaller than the correct sum 21. Therefore, the result obtained
constitutes the sum modulo 16 of 9 and 12 and an overflow occurs. Occasional overflows are
inevitable in any finite system of numerical representation.

A similar situation arises in subtraction 9 - 12, because the correct result, -3, is not representative in
the coding of Figure 9.1. The interpretation of cogwheel, in this case, is to set the wheel so that the
arrow points to 9 and then turn it clockwise by 12 notches, this puts the arrow at number 13. This
result, which is 16 units longer, again represents the difference in modulo 16 of 9 and 12. In this case,
it is inferred that an underflow has occurred; However, this term in computational arithmetic is
reserved for numbers too small in magnitude to be distinguishable from 0 (section 9.6). When the
rank of the number system exceeds its lower end, it still says that an overflow has occurred. Figure 9.2
shows the overflow regions of a finite numerical representation system.

119
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 9.1: Fixed-base positional numeral systems

For each of the following conventional x-base numerical representation systems, use the digital values
from 0 to r − 1, determine the largest max number that is representative with the indicated number k
of digits and the minimum number K of digits required to represent all natural numbers less than one
million.

a) r = 2, k = 16
b) r = 3, k = 10
c) r = 8, k = 6
d) r = 10, k = 8

Solution: The largest natural number to represent for the second part of the question is P = 999 999.

120
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

a) max = rk − 1 = 216 − 1 = 65,535; K = [ logr (P + 1) ] = [log2(106)] = 20 (219 < 106)


b) max = 310 − 1 = 59 048; K = [log3(106)] = 13 (312 < 106)
c) max = 86 − 1 = 262 143; K = [log8(106)] = 7 (86 < 106)
d) max = 108 − 1 = 99 999 999; K = [log10(106)] = 6

Example 9.2: Overflow in integer arithmetic

For each of the following conventional systems of numerical representation in base r, determine
whether evaluating the arithmetic expression shown leads to overflow. All operands are given in base
10.

a) r = 2, k = 16; 102 × 103


b) r = 3, k = 10; 15 000 + 20 000 + 25 000
c) r = 8, k = 6; 555 × 444
d) r = 10, k = 8; 317 − 316

Solution: It is not necessary to represent the operands involved in the specified number systems;
rather, the values found in the course of the expression evaluation are compared with the maximum
representative values derived in Example 9.1.

a) Result 105 is greater than max = 65 535, so overflow will occur.

b) Result 60 000 is greater than max = 59 048, so overflow will occur.

(c) Result 246 420 is no greater than max = 262 143, so there will be no overflow.

d) The final result 86 093 442 is not greater than max = 99 999 999. However, if the expression is
evaluated by first calculating 317 = 129 140 163 and then subtracting 316 from it, overflow will be found
and the calculation cannot be completed correctly. Rewriting the expression as 316 × (3 - 1) eliminates
the possibility of unnecessary overflow.

2. Sets of digits and encodings


The digits of a binary number can be represented directly as binary signals, which can then be
manipulated by logic circuits. Bases greater than 2 require sets of digits that have more than two
values. Such sets of digits must be encoded as binary strings to allow their storage and processing
within conventional two-valued logic circuits. A set of valued r-digits and base r requires at least b bits
for encoding, where

Decimal digits in [0, 9] therefore require encoding that is at least four bits wide. The binary encoded
decimal representation (BCD) is based on the four-bit binary representation of base 10 digits. Two
BCD digits can be packed into an eight-bit byte. Such packaged decimal representations are common
in some calculators whose circuits are designed for decimal arithmetic. Many computers provide
special instructions to facilitate decimal arithmetic on packaged BCD numbers. However, MiniMIPS
has no such instruction.

Of course, binary signals can be used to encode any finite set of symbols, not just digits. Such binary
encodings are commonly used for input/output and data transfers. The American Standard Code for

121
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Information Interchange (ASCII), shown in Table 6.1, is one such convention that represents uppercase
and lowercase letters, numerals, punctuation marks, and other symbols in a byte. For example, the
eight-bit ASCII codes for the ten decimal digits are of the form 0011xxxx, where the "xxxx" part is
identical for the BCD code discussed earlier. ASCII digits take twice as much space as BCD digits and
are not used in arithmetic units. Even less compact than ASCII is 16-bit Unicode, which can
accommodate symbols from many different languages.

Hexadecimal numbers, with r = 16, use digits in [0, 15], which are written as 0–9, "a"
for 10, "b" for 11, . . . and "f" for 15. It has already been seen that MiniMIPS assembly notation for
hexa numbers begins with "0x" followed by the number hexa. These types of numbers can be seen as
a shortened binary number notation, where each hexa digit represents a block of four bits.

The use of digital values from 0 to r - 1 in base r is only a convention. You could use more
than r digital values (for example, digital values -2 to 2 in base 4) or use r digital values that do not
begin with 0 (for example, the digital set {-1, 0, 1}in base 3). In the first instance, the resulting number
system has redundancy in the sense that some numbers will have multiple representations. The
following examples illustrate some of the possibilities.

Example 9.3: Symmetric ternary numbers

If you had ternary computers as opposed to binary (with flip-flops replaced by flip-flap-flops), radix 3
arithmetic would be in common use today. The conventional digital set in base 3 is {0, 1, 2}. However,
{-1, 0, 1} can also be used, which is an example of an unconventional digital set. Consider such a five-
digit symmetrical ternary numeral system.

a) What is the range of representative values? (That is, find the max- and max+ borders.)

b) Represent 3 and -74 as five-digit symmetrical ternary numbers.

c) Formulate an algorithm for adding symmetric ternary numbers.

d) Apply the addition algorithm in part c) to find the sum of 35 and -74.

Solution: The digit -1 will be denoted as -1, to avoid confusion with the subtraction symbol.

a) max+ = (1 1 1 1 1)three = 34 + 33 + 32 + 3 + 1 = 121; max− = − maxthree+ = −121

b) To express an integer as a symmetric ternary number, it must be decomposed into a sum of positive
and negative powers of 3. Therefore, we have: 35 = 27 + 9 − 1 = (0 1 1 0 −1)three and − 74 = −81 + 9 − 3
+ 1 = (−1 0 1 -1 1)three

c) The sum of two digits in {-1, 0, 1} varies from -2 to 2. Because 2 = 3 - 1 and -2 = -3 + 1, these values
can be rewritten as valid digits and transfer a carry (carry) of 1 or -1 (borrow, loan), worth 3 or -3 units,
respectively, to the next highest position. When the incoming ternary carry is included, the sum varies
from -3 to 3, which can still be manipulated in the same way.

d) 35 + (−74) = (0110 −1)three + (−101 −1 1 )three = ( 0 −1 −1 −1 0 )three = −39. In the addition process, positions
0 and 1 do not produce carrying, while positions 2 and 3 each produce a carry of 1.

122
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 9.4: Numbers with carry storage

Instead of the conventional set of digits {0, 1} for binary numbers, it can be used in redundant set of
digits {0, 1, 2}. The result is called a carry-over storage number. Consider a numerical system with five-
digit carry storage in which each digit is encoded in two bits, both bits being 1 for digit value 2, or bit
1 for digit value 1, and neither for 0. Ignore the global overflow and assume that all results are
representative in five digits.

a) What is the range of representative values? (That is, find the max- and max+ borders.)

b) Represent 22 and 45 as numbers with five-digit carry storage.

c) Show that a five-bit binary number can be added to a number with five-digit carry storage only by
using bit operations. That is, as a consequence of the fact that there is no carry propagation, it would
take the same time to do the sum even if the width of the operand was 64 instead of 5.

d) Demonstrate how using two carry-free addition steps of the derived type in part (c), two numbers
can be added with five-digit carry storage.

Solution:

a) max+ = (2 2 2 2 2)two = 2 × (24 + 23 + 22 + 2 + 1) = 62; max- = 0

b) Due to redundancy, multiple representations can exist. Only one representation is given for each
integer:

22 (0 2 1 1 0)two and 45 = (2 1 1 0 1)two

c) The latter is best represented graphically as in Figure 9.3a. In each position there are three bits: two
of the number with carry storage and one of the binary number. The sum of these bits is a number of
two of them consisting of a sum bit and a carry bit that are schematically connected to each other by
a line. Adding three bits x, y and z to get the sum s and carrying c is quite simple and leads to a simple
logic circuit (a binary complete adder): s = x ⊕ y ⊕ z and c = xy ∨ yz ∨ zx.

d) View the second number with carry storage as two binary numbers. Add one of these to the first
one with carry storage using the procedure in part c) to get a number with carry storage. Then add

123
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

the second to the number with barely obtained carry storage. Again, the latency of this two-step
process is independent of the width of the numbers involved. Figure 9.3b contains a graphical
representation of this process.

In addition to the use of unconventional and redundant sets of digits, other variations of positional
numerical representation have been proposed and occasional applications have been found. For
example, the basis r does not need to be positive, whole, or even real. Numbers with negative basis
(known as negabinaries for base 2), representations with irrational bases (such as √2), and number
systems with complex basis (for example, r = 2 j, where j = √−1) are all feasible [Parh00]. However, the
discussion of such number systems does not fall within the purview of this book.

3. Number-Radix conversion
Given a number x represented in radix r, its representation in radix R can be obtained by two forms. If
you want to perform arithmetic in the new radix R, you evaluate a polynomial in r whose coefficients
are the digits xi. The latter corresponds to the first equation of section 9.1 and can be performed more
efficiently with the use of Horner's rule, which involves alternating steps of multiplication by r and
addition:

This method is suitable for manual conversion of an arbitrary base r to base 10, given the relative ease
with which arithmetic can be performed in base 10.

For the purpose of performing the base conversion with the use of arithmetic in the old base x, the
number x is repeatedly divided by the new base R, and the trace of the rest is preserved at each step.
These remains correspond to the digits Xi in base R, starting from X0. For example, the decimal number
19 is converted to base 3 as follows:

19 divided by 3 produces 6 with remainder 1

6 divided by 3 produces 2 with remainder 0

2 divided by 3 produces 0 with remainder 2

Reading the calculated remains from top to bottom, you find 19 = (201)three. By using the same process,
19 can be converted to base 5 to get 19 = (34)five.

Example 9.5: Conversion from binary (or hexa) to decimal

Find the decimal equivalent of the binary number (1 0 1 1 0 1 0)two and highlight how hexadecimal
numbers can be converted to decimal.

Solution: Each 1 in the binary number corresponds to a power of 2 and the sum of these powers
produces the equivalent decimal value

124
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The same answer can be arrived at with the use of Horner's rule (read left to right and top to bottom;
for example, start as 0 × 2 + 1 → 1 × 2 + 0 → 2, etc.)

Hexadecimal to binary conversion is simple, because it involves replacing each hexa digit with its four-
bit binary equivalent; A hexa number of K digits becomes a binary 4K bit number. Therefore, it can be
converted from hexa to decimal in two steps: 1) hexa to binary and 2) binary to decimal.

Example 9.6: Conversion from decimal to binary (or hexa)

Find the eight-bit binary equivalent of (157)ten and highlight how the hexadecimal equivalent of a
decimal number is derived.

Solution: To convert (157)ten to binary, you need to divide by 2 repeatedly, and note the remainders.
Figure 9.4 shows a justification for this process.

157 divided by 2 produces 78 with remainder 1

78 divided by 2 produces 39 with remainder 0

39 divided by 2 produces 19 with remainder 1

19 divided by 2 produces 9 with remainder 1

9 divided by 2 produces 4 with remainder 1

4 divided by 2 produces 2 with remainder 0

2 divided by 2 produces 1 with remainder 0

1 divided by 2 produces 0 with remainder 1

By reading the remains from bottom to top, the desired binary equivalent (1 0 0 1 1 1 0 1)two is
obtained. Figure 9.4 shows why the steps of the base 2 conversion process work. When the binary
equivalent of a number is obtained, its hexa equivalent is formed by taking groups of four bits, starting
with the least significant bit, and rewriting them as hexa digits. For the example, the four-bit groups
are 1001 and 1101, which leads to the hexa equivalent (9D)hex.

125
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Signed integers
The set {..., −3, −2, −1, 0, 1, 2, 3, ...} of integers is also referred to as signed or directed numbers
(integers). The most direct representation of integers consists of attaching a sign bit to any desired
encoding of natural numbers, which leads to signed magnitude representation. The standard
convention is to use 0 for positive and 1 for negative and join the sign bit to the left end of the
magnitude. Here are some examples:

+27 in eight-bit signed magnitude binary code 0 0011011

-27 in eight-bit signed binary magnitude code 1 00110011

-27 in two-digit decimal code with digits BCD 1 0010 0111

Another option for encoding signed integers in the range [-N, P] is the signed representation. If the
fixed positive value N (the bias) is added to all numbers in the desired range, then unsigned integers
in the range [0, P + N] result. Any method to represent natural numbers in [0, P + N] can be used to
represent the original signed integers in [-N, P]. This type of biased representation has only limited
application in the coding of exponents into floating-point numbers (section 9.6).

In this context, the most common coding machine of signed integers constitutes the representation
in complement to 2. In the 2-bit complement format, a negative value -x, with x > 0, is encoded as the
unsigned number 2k - x. Figure 9.5 shows encodings of positive and negative integers in the four-bit 2-
bit complement format. Note that positive integers from 0 to 7 (or 2 k-1 - 1, in general) have the
standard binary encoding, while negative values from -1 to -8 (or -2 k-1) were transformed to unsigned
values by adding 16 (or 2k) to them. The range of representative values in complement format to 2
kbits is therefore [−2k−1, 2k−1 − 1]. Two important properties of numbers in complement to 2 should be
highlighted. First, the leftmost bit of the representation acts on the sign bit (0 for positive values, 1 for

126
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

negative ones). Second, the value represented by a particular bit pattern can be derived without
following different procedures for negative and positive numbers. Polynomial evaluation or Horner's
rule is simply used, as was done for unsigned integers, except that the sign bit has a negative
importance. Here are two examples:

(01011)2’s complement = (−0 × 24) + (1 × 23) + (0 × 22) + (1 × 21) + (1 × 20) = +11

(11011) 2’s complement = (−1 × 24) + (1 × 23) + (0 × 22) + (1 × 21) + (1 × 20) = -5

The reason for the popularity of representation in complement to 2 is understood from Figure 9.5.
While in representation of magnitude with sign the sum of numbers with equal and different signs
implies two different operations (addition vs subtraction), both are performed in the same way with
representation in complement to 2. In other words, the sum of -y, which, as always, implies
counterclockwise rotation by and notches, is achieved by counterclockwise rotation by means of 2 k -
and notches. Since 2k - y constitutes the representation in complement to 2 in k bits of -y, the common
rule for both negative and positive numbers is to rotate counterclockwise by an amount equal to the
representation of the number. Therefore, a binary adder can add numbers of any sign.

Example 9.7: Converting add-in to 2 to decimal

Find the decimal equivalent of the number in complement to 2 (1 0 1 1 0 1 0 1)2’s complement

Solution: Each 1 in the binary number corresponds to a power of 2 (with the sign bit considered
negative) and the sum of these powers produces the equivalent decimal value.

The same answer can be reached with the use of Horner's rule:

If the number in complement to 2 were given with its hexadecimal encoding, that is, as (B5)hex, its
decimal equivalent would be obtained by first converting to binary and then using the previous
procedure.

Note that since 2k − (2k − y) = y, changing the sign of a number in complement to 2 is done by
subtracting 2k in all cases, regardless of whether the number is negative or positive. In fact, 2 k - and
is calculated as (2k − 1 ) − y + 1. Since 2k − 1 has the binary representation all 1, subtract (2k − 1) − y is
simple and consists of inverting the bits of y. Thus, as a rule, the sign of a number in complement to 2
is changed by inverting all its bits and then adding 1 to it. The only exception is -2 k-1, whose negation
is not representable in k bits.

Example 9.8: Change of sign for a number in complement to 2

As a consequence of y = (10110101)2's complement, find the representation in complement to 2 of -y.

Solution: You need to reverse all the bits in the representation of and then add 1.

−y = (0 1 0 0 1 0 1 0) + 1 = (0 1 0 0 1 0 1 1)2’s complement

127
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Check the result when converting it to decimal:

The latter is in agreement with y = -75, obtained in example 9.7.

The above rule for sign change forms the basis of an adder/subtracter circuit in complement to 2
shown in Figure 9.6. Note that adding 1 to complete the sign change process is done by setting the
adder carry entry to 1 when the operation to be performed is subtraction; During summing, cin is set
to 0. The simplicity of the adder/subtracter circuit in Figure 9.6 is a further advantage of the
representation of the number in complement to 2

Other complement representation systems can also be devised, but none are in wide use. The choice
of any complementation constant M, which is as large as N + P + 1, allows to represent signed integers
in the range [-N, P], with the positive numbers in [0, +P] corresponding to the unsigned values in [ 0,
P ] and the negative numbers in [-N, -1] represented as unsigned values in [M - N, M - 1], that is, by
adding M to negative values. Sometimes M is recognized as alternate code for 0 (actually, -0). For
example, the system in complement to 1 of k bits is based on M = 2 k - 1 and includes numbers in the
symmetric range [−(2k−1 − 1), 2k−1 − 1], where 0 has two representations: the string all 0 and the string
all 1.

128
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5. Fixed point numbers


A fixed-point number consists of an integer or integral part and a fractional part, with the two parts
separated by a base point (decimal point in base 10, binary point in base 2, etc.). The position of the
base point is almost always implied and the point is not explicitly displayed. If a fixed-point number
has k integer digits and l fractional digits, its value is obtained from the formula:

In other words, the digits to the right of the basis point are assigned negative indices and their weights
are negative powers of the base. For example:

In a fixed-point number system in base r and (k + l) digits, with k integer digits, the numbers from 0 to
rk - r-l can be represented in increments of r-l. The step size or resolution r-l is often referred to as ulp,
or last position unit. For example, in a binary fixed-point numerical system of (2 + 3) bits, one has ulp
-2-3, and the values 0 = (00,000)two are representable to 22 - 2-3 = 3.875 = (11.11)two. For the same total
number k + l of digits in a fixed-point number system, the increasing k will lead to an elongated range
of numbers, while the increasing l leads to greater precision. Therefore, there is an exchange between
range and precision.

Fixed-point signed numbers are represented by the same methods discussed for signed integers:
signed magnitude, skewed formatting, and complement methods. In particular, for format in
complement to 2, a negative value -x is represented as the unsigned value 2k - x. Figure 9.7 shows
encodings of positive and negative integers in the complement format to 2 fixed-point (1 + 3) bits.
Note that positive values 0 to 7/8 (or 2k-1 - 2-l, in general) have the standard binary encoding, while
negative values -1/8 to -1 (or -2-l to -2k-1, in general) are transformed into unsigned values by adding 2
(or 2k, in general).

The two important properties of numbers in complement to 2, already mentioned in connection with
integers, are valid here as well; Namely, the far left bit of the number acts on the sign bit, and the
value represented by a particular bit pattern is derived by considering the sign bit with negative
weight. Here are two examples:

The process of changing the sign of a number is the same: inverting all bits and adding 1 to the least
significant position (i.e. adding ulp). For example, given (11.011)2’s complement = -0.625, its sign changes
by inverting all bits by 11.011 and adding ulp to get (00.101)2’s complement.

The conversion of fixed-point numbers from radix r to another radix R is done separately for the
integer and fractional parts. For the purpose of converting the fractional part, arithmetic can again be
used in the new radix R or in the old radix r, whichever is more convenient. With arithmetic of basis R,
a polynomial is evaluated in r-1 whose coefficient are the digits xi. The simplest way is to view the
fractional part as an integer of l digits, convert this integer to radix R, and divide the result by rl.

129
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

To perform the arithmetic base conversion on the old radix r, the fraction y is repeatedly multiplied
by the new radix R, noting and deleting the whole part at each step. These integer parts correspond
to the digits X-i in radix R, starting from X-1. For example, 0.175 is converted to radix 2 as follows:

0.175 multiplied by 2 yields 0.350 with integer part 0

0.350 multiplied by 2 yields 0.700 with integer part 0

0.700 multiplied by 2 yields 0.400 with integer part 1

0.400 multiplied by 2 yields 0.800 with integer part 0

0.800 multiplied by 2 yields 0.600 with integer part 1

0.600 multiplied by 2 yields 0.200 with integer part 1

0.200 multiplied by 2 yields 0.400 with integer part 0

0.400 multiplied by 2 yields 0.800 with integer part 0

When reading the entire parts recorded from top to bottom, you will find 0.175 ≅ (.00101100)two. This
equality is approximate because the result does not converge to 0. In general, a fraction in one radix
may not have an exact representation in another. In any case, the process is carried out until the
required number of digits is obtained in the new radix. It is also possible to continue the process after
the last digit of interest, so the result can be rounded.

Example 9.9: Converting binary from fixed point to decimal

Find the decimal equivalent of the unsigned binary number (1101.0101)two. What if the given number
was in complement to 2 instead of an unsigned format?

Solution: The integer represents 8 + 4 + 1 = 13. The fractional part, when viewed as a four-bit integer
(multiplied by 16), represents 4 + 1 = 5. Therefore, the fractional part is 5/16 = 0.3125. Consequently,
(1101.0101)two = (13.3125)ten. For the case of complement to 2, proceed as above: obtain the integer
part -8 + 4 + 1 = -3 and the fractional part 0.3125 and conclude that (1101.0101)2’s complement = (-
2.6875) ten. Thus, recognizing that the number is negative, changes its sign and then converts the
resulting number:

(1101.0101)2’s complement = −(0010.1011)two = −(2 + 11/16) = (−2.6875)ten

Example 9.10: Converting from decimal to fixed point to binary Find the binary fixed-point equivalents
of (4 + 4) bits of (3.72)ten and -(3.72)ten.

Solution: The binary equivalent in four bits of 3 is (0011)two.

0.72 multiplied by 2 yields 0.44 with integer part 1

0.44 multiplied by 2 yields 0.88 with integer part 0

0.88 multiplied by 2 yields 0.76 with integer part 1

130
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

0.76 multiplied by 2 yields 0.52 with integer part 1

0.52 multiplied by 2 yields 0.04 with integer part 1

Reading the entire parts recorded from top to bottom, one finds that (3.72)ten = (0011.1100)two. Note
that the fifth fractional bit was obtained to round the representation to the nearest (4 + 4) bits fixed
point. The same goal was achieved by noticing that the rest after the fourth step is 0.52, which is
greater than 0.5, and therefore rounding it off. By obtaining only four fractional bits without rounding,
the binary representation would have been (0011.1011)two, whose fractional part of 11/16 = 0.6875 is
not as good an approximation to 0.72 as 12/16 = 0.75. When using sign change for the result, you find
-(3.72)ten = (1100.0100)2’s complement.

6. Floating-point numbers
Integers in a prewritten range can be represented exactly for automatic processing, but most real
numbers must be approximated within the finite word width of the machine. Some real numbers are
represented as, or approximated by, fixed-point numbers of (k + l) bits, as seen in section 9.5. One
problem with fixed-point representation is that they are not very good at dealing with very large or
small numbers at the same time. Consider the two fixed-point numbers of (8 + 8) bits shown below:

x = (0000 0000 . 0000 1001)two small number

y = (1001 0000 . 0000 0000)two large number

The relative error of representation due to truncation or rounding of digits beyond the -8th position
is quite significant for x, but less severe for y. Also, neither y2 nor y/x is representative in this numerical
format.

Therefore, fixed-point rendering appears to be inadequate for applications that handle numerical
values over a wide range, from very small to extremely large, as it would require a very wide word to
accommodate both range and accuracy requirements. Floating-point numbers constitute the primary
mode of arithmetic in such situations. A floating-point value consists of a number with signed and
fixed point magnitude and an accompanying scale factor. After many years of experimentation with
different floating-point formats and having to cope with the resulting inconsistencies and
incompatibilities, computer science accepted the standard format proposed by the Institute of
Electrical and Electronics Engineers (IEEE) and adopted by similar national and international
organizations, including the American National Standards Institute (ANSI). Therefore, the discussion
of floating-point numbers and arithmetic is formulated only in terms of this format. Other formats will
differ in their parameters and rendering details, but the basic negotiations and algorithms will remain
the same.

A floating-point number in the ANSI/IEEE standard has three components: sign ±, exponent e and
meaning s, which represent the value ±2es. The exponent is an integer with a sign represented in a
biased format (a fixed bias is added to make it an unsigned number). The signifier constitutes a fixed-
point number in the range [1, 2]. As a consequence of the fact that the binary representation of the
meaning always begins with "1.", this fixed 1 is omitted (hidden) and only the fractional part of the
meaning is represented explicitly.

131
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Table 9.1 and Figure 9.8 show the details of the short (32-bit) and long (64-bit) floating-point ANSI/IEEE
formats. The short form factor has adequate range and accuracy for the most common applications
(magnitudes ranging from 1.2 × 10-38 to 3.4 × 1038). The long format is used for accurate calculations or
these include extreme variations in magnitude (from 2.2 × 10-308 to 1.8 × 10308). Of course, in these
formats, as already explained, zero does not have its own representation, because the meaning is
always different from zero. To remedy this problem, and be able to represent other special values, the
smallest and largest exponent codes (all 0 and all 1 in the skewed exponent field) are not used for
ordinary numbers. A word all 0 (0 in the sign, exponent and signifying fields) represents -0; similarly
+0 and ±∞ have special representations, as do any indeterminate value, known as "no number" (NaN).
Other details of this standard are not contemplated in this book; In particular, the denormalized
numbers (denormal, for short), which are included in Table 9.1, are not explained here.

When an arithmetic operation produces a result that is not representative in


the format that is used, the result must be rounded to some similar value. The ANSI/IEEE standard
prescribes four rounding options. The default rounding mode is rounding to the nearest pair: choose
the nearest representative value and, in the case of a tie, choose the value with its least significant bit
0. There are also three directed rounding modes: rounding to +∞ (choose the next highest value),
rounding to −∞ (choose the next lowest value), and rounding to 0 (choosing the nearest value that is
less than the nearest value in magnitude). With the option to round to the closest, the maximum
rounding error is 0.5 ulp, while with directed rounding schemes, the error can be up to 1 ulp. Section
12.1 discusses rounding floating-point values in more detail.

132
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

PROBLEMS
9.1 Fixed-base positional numerical systems

For each of the following conventional base number systems r, and using digit values from 0 to r − 1,
determine: max-, max+, the number b of bits needed to encode a digit, the K number of digits needed
to represent the equivalent of all 32-bit unsigned binary integers, and the rendering efficiency in the
latter case.

a) r = 2, k = 24
b) r = 5, k = 10
c) r = 10, k = 7
d) r = 12, k = 7
e) r = 16, k = 6

9.2 Fixed-base positional numerical systems

Test the statements in parts (a) to (c):

a) An unsigned binary integer is a power of 2 if the logical bit AND of x and x −1 is 0.

b) An unsigned integer in radix 3 is even if the sum of all its digits is even.

133
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

c) An unsigned binary integer (x k−1 xk−2 ··· x1x0)two is divisible by 3 iff Ʃ even i xi – Ʃ odd i is a multiple of
3.

d) Generalize the statements of parts b) and c) to obtain rules of divisibility of integers based on r by
r - 1 and r + 1.

9.3 Overflow in arithmetic operations

Argue whether overflow will occur when each of the arithmetic expressions is evaluated within the
framework of the five number systems defined in problem 9.1 (20 cases together). All operands are
given in base 10.

a) 3000 × 4000
b) 224 − 222
c) 9000 000 + 500 000 + 400 000
d) 12 + 22 + 32 +···+ 20002

9.4 Unconventional digit sets

Consider a symmetric fixed-point numeral system in radix 3 with k integer and l fractional digits, and
use the set of digits [-1, 1]. The entire version of this representation was discussed in Example 9.3.

a) Determine the range of numbers represented as a function of k and l.

b) What is the representational efficiency relative to the binary representation, as a consequence of


each digit in radix 3 needs a two-bit encoding?

c) Design a simple and fast hardware procedure to convert from the previous representation to a base
3 representation with the use of the redundant set of digits [0, 3].

d) What is the representational efficiency of redundant representation in Part (c)?

9.5 Numbers with carry storage

In Example 9.4:

a) Provide all possible alternate representations for part (b).

b) Identify numbers, if any, that have unique representations.

c) Which whole has as many different representations as possible?

9.6 Numbers with carry storage

Figure 9.3 can be interpreted as a way of compressing three binary numbers into two binary numbers
that have the same sum, and the same width, ignoring the possibility of overflow. Similarly, Figure
9.3b shows the compression of four binary numbers, first to three and then to two numbers. A

134
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

hardware circuit that converts three binary numbers to two with the same sum is called a carry-
storage adder.

a) Draw a diagram similar to Figure 9.3b that represents the compression of six binary numbers to two
that have the same sum and width.

b) Convert your solution to part a) to a hardware diagram, use blocks of three inputs and two outputs
to represent the adders with carry storage.

c) Show that the hardware diagram in part b) has a latency equal to that of three adders with carry
storage, otherwise present a different hardware implementation with such latency to compress six
binary numbers to two of these.

9.7 Negabinary numbers

Negabinary numbers use the set of digits {0, 1} in base r = -2. The value of a negabinary number
evaluates in the same way as a binary number, but terms containing odd powers of the base are
negative. Therefore, positive and negative numbers are represented without the need for a separate
sign bit or complementation scheme.

a) What is the range of representable values in nine- and ten-bit negabinary representations?

b) Given a negabinary number, how is its sign determined?

c) Design a procedure for converting a positive negabinary number to a conventional unsigned binary
number.

9.8 Compressed decimal numbers

One way to represent decimal numbers in memory is to pack two BCD digits into one byte. This
representation is impractical, as a byte that can encode 256 values is used to represent the pairs of
digits 00 through 99. One way to improve efficiency is to compress three BCD digits into ten bits.

a) Design appropriate coding for this compression. Tip: Let the three digits BCD x3 x2 x1 x0, and y3
y2 y1 y0 and z3 z2 z1 z0 be the three digits. Let W X2 X1 x0 Y2 Y1 and 0 Z2 Z1 z0 encoding ten bits.
In other words, the last least significant bits (LSBs) of the three digits are used and the
remaining nine bits (three of each digit) are encoded in seven bits. Let W = 0 encoding the
case x3 = y3 = z 3 = 0. In this case, the remaining digits are copied into the new representation.
Use X2X1 = 00, 01, 10 to encode the case where only one of the x3, y3, or z3 values is 1. Note
that when the most significant bit of a BCD digit is 1, the digit is fully specified by your LSB and
no other information is needed. Finally, use X 2 X1 = 11 for all other cases.

b) Design a circuit to convert three BCD digits into the ten-bit compressed representation.

c) Design a circuit to decompress the ten-bit code to recover the original three BCD digits.

d) Suggest similar encoding to compress two BCD digits into seven bits.

e) Design the required compression and decompression circuits for part (d) coding.

135
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

9.9 Numeric base conversion

Convert each of the following numbers from its indicated base to a representation in base 10.

a) Numbers in base 2: 1011, 1011, 0010, 1011 0010 1111 0001

b) Numbers in base 3: 1021, 1021 2210, 1021 2210 2100 1020

c) Numbers in base 8: 534, 534 607, 534 607 126 470

d) Numbers in base 12: 7a4, 7a4 539, 7a4 593 1b0

e) Numbers in base 16: 8e, 8e3a, 8e3a 51c0

9.10 Fixed point numbers

A fixed-point binary numeral system has 1 integer bit and 15 fractional bits.

a) What is the range of numbers represented, if it assumes an unsigned format?

b) What is the range of numbers represented, if it assumes format in complement to 2?

c) Represent decimal items 0.75 and 0.3 in the format of Part (a).

d) Represent the decimal fractions -0.75 and -0.3 in the format of part (b).

9.11 Numeric base conversion

Convert each of the following numbers from its indicated base to radix-2 and radix-16 representations.

a) Numbers in radix 3: 1021, 1021 2210, 1021 2210 2100 1020

b) Numbers in radix 5: 302, 302 423, 302 423 140

c) Numbers in radix 8: 354, 534 607, 534 607 126 470

d) Numbers in radix 10: 12, 5 655, 2 550 276, 76 545 336, 3 726 755

e) Numbers in radix 12: 9a5, b0a, ba95, a55a1, baabaa

9.12 Numeric base conversion

Convert each of the following numbers to a fixed point in your indicated base to a base 10
representation.

a) Numbers in base 2: 10.11, 1011.0010, 1011 0010.1111 001

b) Numbers in base 3: 10.21, 1021.2210, 1021 2210.2100 1020

c) Numbers in base 8: 53.4, 534.607, 534 607.126 470

d) Numbers in base 12: 7a.4, 7a4.539, 7a4 593.1b0

e) Numbers in base 16: 8.e, 8e.3a, 8e3a.51c0

136
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

9.13 Numeric radix conversion

Convert each of the following numbers to a fixed point in your indicated base to a representation in
base 2 and base 16.

a) Numbers in base 3: 10.21, 1021.2210, 1021 2210.2100 1020

b) Numbers in base 5: 30.2, 302.423, 302 423.140

c) Numbers in base 8: 53.4, 534.607, 534 607. 126 470

d) Numbers in base 10: 1.2, 56.55, 2 550.276, 76 545.336, 3 726.755

e) Numbers in base 12: 9a.5, b.0a, ba.95, a55a1, baa. baa

9.14 Two-part add-on format

a) Encode each of the following decimal numbers in 16-bit complement to 2 format with fractional
bits 0, 2, 4 or 8: 6.4, 33.675, 123.45.

b) Check the correctness of your conversions in part a) by applying Horner's formula to the resulting
numbers in complement to 2 to derive the decimal equivalents.

(c) Repeat part (a) for the following decimal places: -6.4, 33.675, -123.45

(d) Repeat part (b) for the results of part (c).

9.15 Floating-point numbers

Consider the entries provided by min and max in Table 9.1. Show how these values are derived and
explain why the max values in the two columns are specified as nearly equal to a power of 2.

9.16 Floating-point numbers

Display the representation of the following decimal numbers in ANSI/IEEE short and long floating-
point formats. When the number is not exactly representable, use the rule of rounding to the nearest
pair

a) 12.125

b) 555.5

c) 333.3

d) -6.25

e) -1024.0

f) -33.2

137
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

ADDERS and SIMPLE ALUS


CHAPTER TOPICS

1 Simple adders

2 Carry propagation networks

3 Counting and increment

4 Designing fast adders

5 Logical and shift operations

6 Multifunction ALU

The sum represents the most important arithmetic operation in digital computers. Even the simplest
embedded computers have an adder, while multipliers and splitters are found only in high-
performance microprocessors. This chapter begins by considering the design of single-bit adders (half-
adders and complete adders) and shows how such building blocks can be cascaded to build cascading
adders. Then proceed with the design of faster adders built by carry anticipation, the most widely used
haul prediction method. Other topics covered include counters, shift and logic operations, and
multifunction ALU.

1. Simple adders
In this chapter, only binary integer addition and subtraction are covered. Fixed-point numbers that
are in the same format can be added or subtracted as integers by ignoring the implicit base point. The
floating-point sum will be discussed in Chapter 12.

When two bits are added together, the sum is a value in the range [0, 2] that can be represented
by a sum bit and a carry bit (carry). The circuit that can calculate the addition and carry bits is called
the binary adder medium (HA), and its truth table and symbolic representation are shown in Figure
10.1. The carry output is the logical AND of the two inputs, while the sum output is the unique OR
(XOR) of the inputs. Adding a carry entry to an adder half yields a binary complete adder (FA), whose
truth table and schematic diagram are shown in Figure 10.2. Figure 10.3 shows many implementations
of a complete adder.

A complete adder, connected to a flip-flop to retain the carry bit from one cycle to the next, functions
as a serial bit adder. The inputs of a serial bit adder are provided in synchrony with a clock signal, one
bit of each operand per clock cycle, starting with the least significant bits (LSBs). Per clock cycle one
bit of the output is produced, and the carry of one cycle is retained and used as input in the next cycle.
Also, a cascade carry adder displays this sequential behavior in space, using a cascade of k full adders
to add two k-bit numbers (Figure 10.4).

138
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The cascade carry design in Figure 10.4 becomes a radix add r if each binary complete adder is replaced
by a complete adder of base r that accepts two digits in base r (each encoded in binary) and a carry
input signal (carry-in), which produces a sum digit in radix r and a carry output signal (carry-out). When
all intermediate carry signals cout are known, the bits/digits sum are easily calculated. For this reason,
adder design discussions focus on how all intermediate carries can be derived, depending on input
and cin operands. Since carry signals are always binary and their propagation can be done
independently of the basis r, as discussed in section 10.2, from this point onwards it will often not be
treated with bases other than 2.

Note that any binary adder design can be converted to an adder/subtracter in complement to 2
through the schematic shown in Figure 9.6. For this reason, subtraction will not be discussed as a
separate operation.

139
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

2. Carry propagation networks


When two numbers are added together, the carries are generated at certain digit positions. For the
decimal sum these are positions in which the sum of the digits operand is 10 or more. In binary sum,
carry generation requires that both bits operand are 1. The auxiliary binary signal gi is defined as 1 for
positions where a carry is generated and 0 for any other party. For binary sum, gi = xi yi; that is, gi is
the logical AND of the bits operands xi and yi. Similarly, there are digit positions in which an incoming
carry propagates. For decimal sum, these are the positions in which the sum of the digits operand is
equal to 9; An input carry makes 10 the sum of position, which leads to an output carry from that
position. Of course, if there is no carry of entry there will not be a carry exit for such positions. In

140
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

binary sum, carry propagation requires that one bit operand is 0 and the other is 1. The auxiliary binary
signal pi, derived as pi = xi ⊕ yi for binary sum, is defined as 1 if the position of the digit i propagates
an incoming carry.

Now, depending on the incoming carry c0 of an adder and the auxiliary signals gi and pi for all k-digit
positions, the intermediate carries ci and outgoing carry ck can be derived independently of the digital
input values. In a binary adder, the sum bit at position i is derived as

Figure 10.5 shows the general structure of a binary adder. Variations in the haulage network result in
many designs that differ in their implementation costs, operational speed, power consumption, and
so on.

The adder with cascade haulage of the figure 10.4 is very simple and easily expandable to any desired
width. However, it is rather slow because the carries can spread across the entire width of the adder.
This occurs, for example, when the two eight-bit numbers 10101011 and 01010101 are added
together. Each complete adder requires some time to generate its carry output based on its carry
input and the bits operating at that position. Cascading k of such units involves k times as much signal
delay in the worst case. This linear amount of time becomes unacceptable for wide words (i.e. 32 or
64 bits) or on high-performance computers, although it may be acceptable on an embedded system
that is dedicated to a single task and is not expected to be fast.

For the purpose of seeing an adder with cascading carry in the general framework of Figure 10.5,
note that the carry net of an adder with cascade carry is based on recurrence:

ci+1 = gi ∨ pici

141
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

This recurrence highlights that a carry will go to position i + 1 if it is generated at position i or if a carry
entering position i propagates through position i. This observation leads to Figure 10.6 as the haul net
of an adder with cascading haulage. The linear latency of an adder with cascading carry (2k gate levels
for the carry network, plus a few more to derive the auxiliary signals and produce the addition bits) is
evident from Figure 10.6.

Example 10.1: Variations in adder design

It is said that, at the i-digit position, a transfer occurs if a carry is generated or propagated. For binary
adders, the auxiliary transfer signal ti = gi ∨ pi can be derived by an OR gate, since gi ∨ pi = xi yi + (xi ⊕
yi) = xi ∨ yi. An OR gate is often faster than an XOR gate, so ti can be produced faster than pi.

a) Show that the carry recurrence ci+1 = gi ∨ pi ci remains valid if pi is replaced with t i.

b) How does the change in part (a) affect the design of a haulage network?

c) How else does changing part (a) affect the design of a binary adder?

Solution:

a) It is shown that the two expressions gi ∨ pi ci and gi ∨ ti c= are equivalent when converting one into
the other: gi ∨ pi ci = gi ∨ gi ci ∨ pi ci = gi ∨ (gi ∨ pi) ci = gi ∨ ti ci. Note that, in the first step of the
conversion, the inclusion of the additional term gi ci is justified by gi ∨ gi ci = gi (1 ∨ ci) = gi; in other
words, the term gi ci is redundant.

b) Because changing pi to ti does not affect the relationship between ci+1 and ci, nothing will change in
the carry network if it is provided with ti instead of pi. The small difference in speed between ti and pi
leads to faster production of carry signals.

c) It is necessary to include k OR gates of two additional inputs to produce the ti signals. The pi signals
are still required to produce the sum bits si = pi ⊕ ci once all the carries are known. However, the
adder will be faster overall, because the pi signals are derived with the operation of the carry network,
which is limited to take longer.

There are a few ways to speed up the spread of carrying, which leads to faster sum. One method,
which is conceptually quite simple, involves providing hop routes in a network with cascading hauling.
For example, a 32-bit carry network can be divided into eight four-bit sections, with a five-input AND

142
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

gate allowing incoming carries from position 4j to go directly to the end of the section if p4j = p4j+1 =
p4j+2 = p4j+3 = 1. A four-bit section of the carry network covering bit positions from 4 j to 4j+3 is outlined
in heading 10.7. Note that the latency of such a adder with four-bit hop paths is still linear in k,
although it is much lower than that of a simple adder with cascading carry. Faster propagation of
carries across hop trajectories may resemble drivers who reduce their travel times by using a nearby
motorway whenever the desired destination is very close (Figure 10.8).

Figure 10.7: Four-bit section of a network with cascading carry with hop paths.

Figure 10.8: Motorsport analogy for carry propagation in adders with jump trajectories. Taking the
highway allows the driver who wants to travel a long distance and avoid excessive delays at many
traffic lights.

Example 10.2: Carry equation for jump adders

It was seen that adders with cascade carry implement the recurrence of carry ci+1 = gi ∨ pi ci. What is
the corresponding equation for the cascading haul adder with four-bit hop paths, shown in Figure
10.7?

Solution: It is evident from Figure 10.7 that the carry equation remains the same for any position
whose index i is not a multiple of 4. The equation for the incoming carry at position i = 4j + 4 becomes
c4j+4 = g4j+3 ∨ p4j+3 c4j+3 ∨ p4j+3 p4j+2 p4j+1 p4j c4j. This equation shows that there are three (not mutually
exclusive) ways in which a carry can enter position 4j + 4: by generating at position 4j + 3, through the
propagation of a carry entering position 4j + 3, or that the carry in position 4j passes along the jump
trajectory.

143
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

3. Counting and increment


Before discussing some of the common ways in which the addition process is accelerated in modern
computers, consider a special case of addition, that of one of the two operands being a constant. If a
record is initialized to a value x and then repeatedly added a constant a, the sequence of values x, x +
a, x + 2a, x + 3a will be obtained, ...This process is known as counting by a. Figure 10.9 shows a
hardware implementation of this process using an adder whose bottom input is permanently
connected to the constant a. This counter can be updated in two different ways: increment causes the
counter to move to the next value in the previous sequence, and initialization causes an input data
value to be stored in the registry.

The special case of a = 1 corresponds to the standard counter above whose sequence is x, x + 1, x + 2,
x + 3,..., while a = -1 produces a downward counter, which proceeds in the order x, x − 1, x − 2, x − 3,...;

144
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

An up/down counter can count up or down, depending on the value of a direction control signal. If in
the process of counting down you go through 0, the counter is set to the negative value in complement
to 2 appropriate. The up and down counters can overflow when the account becomes too large or
small to render within the numerical representation format used. In the case of unsigned top counters,
the overflow is indicated by the carry output of the added claimant. In other cases, counter overflow
is detected in the same way as in adders (section 10.6).

Now attention will be focused on an upward counter with a = 1. In this case, instead of connecting the
constant 1 to the bottom input of the adder, as in Figure 10.9, one can set c in = 1 and use 0 as the
lower adder input. So, the adder becomes an Increaser whose design is simpler than an ordinary
adder. To see why, notice that adding y = 0 to x gives the generation signals gi = xi yi = 0 and the
propagation signals pi = xi ⊕ yi = xi. Therefore, when referring to the carry propagation network in
Figure 10.6, it is seen that all the OR gates, as well as the AND gate on the far right, can be eliminated,
leading to the simplified carry network in Figure 10.10. Later it will be seen that the methods used to
accelerate the carry propagation in adders can also be adopted to design faster increasers.

A machine's program counter is an example of an upward counter that is incremented to


point to the next instruction as the current instruction is executed. PC increment happens faster in the
instruction execution cycle, so there is plenty of time for completion, which means that a superfast
increaser may not be required. In the case of MiniMIPS, the PC increases not by 1 but by 4; However,
increment by any power of 2, such as 2K, is the same as ignoring the less significant h bits and adding
1 to the remaining part.

145
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Design of fast adders


It is possible to design a variety of fast adders that require logarithmic rather than linear time. In other
words, the delay of such fast adders grows as the logarithm of k. The best known and widely used of
such adders are carry advance adders, the design of which is discussed in this section.

The basic idea in advance carry addition is to form the required intermediate carries
directly from the gi, pi, and cin inputs into the carry network, rather than from the previous carries, as
is done in cascade carry adders. For example, the carry c3 of the adder in Figure 10.4, which was
previously expressed in terms of c2 with the use of carry recurrence

c3 = g2 ∨ p2c2
It can be derived directly from entries based on the logical expression:

This expression is obtained by developing the original recurrence; That is, by substituting C2 with its
equivalent expression in terms of C1 and then expressing C1 in terms of C0. In fact, you could write this
expression directly based on the following intuitive explanation. A carry to position 3 must have been
generated somewhere to the right of bit 3 and propagated from there to bit position 3. Each term on
the right side of the above equation covers one of four possibilities.

Theoretically, one can develop all carry equations and obtain each of the carries as a two-level AND-
OR expression. However, the fully developed expression would become too long for a wider adder
that requires that, say, c31 or c52 be derived. There are several carry networks with advance haulage
that systematize the preceding derivation for all intermediate hauls in parallel and make the
calculation efficient by sharing parts of the required circuits whenever possible. Several designs offer
negotiations on speed, cost, VLSI chip area, and power consumption. Information about designing
networks ahead of carry and other types of quick adders can be found in computational arithmetic
books [Parh00].

Here, only one example of a network with carry anticipation is presented. The building blocks of this
network consist of the carry operator, which combines the generate and propagate signals for two
adjacent blocks [i + 1, j] and [h, i] of digit positions in the respective signals for the wider combined
block [h, j]. In other words,

where ¢ designates the carry operator and [a, b] is set for (g[a, b], p[a, b]), which represents the pair of
signals generate and propagate for the block extending from the digit position a to the position of
digit b. Since the problem of determining all carries ci+1 is the same as calculating the signals generate
cumulative g[0, i], a network constructed of operator blocks ¢, such as the one shown in Figure 10.11,
can be used to derive all carries in parallel. If a cin signal is required for the adder, it can be
accommodated as the signal generates g-1 from an additional position on the right; In this case a carry
network of (k + 1) bits would be needed for a K bit adder.

146
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

To better understand the carry network in Figure 10.11 and be able to analyze its cost and latency in
general, look at its recursive structure [Bren82]. The top row of carry operators combine bit-level g
and p signals into g and p signals for two-bit blocks. The last signals correspond to the signals generate
and propagate for a sum in base 4, where each digit in base 4 consists of two bits in the original binary
number. The remaining rows of haulage operators, which basically consists of all other carries (the
numbered pairs) in the original sum in base 2. All that's left is for the bottom row of haul operators to
provide the odd-numbered hauls lost.

Figure 10.12 accentuates this recursive structure by showing how the eight-bit Brent-Kung network in
Figure 10.11 consists of a network of four inputs of the same type plus two rows of carry operators,
one row at the top and one at the bottom. This leads to a latency corresponding to almost 2log2k carry
operators (two rows, per log2k levels of recursion) and an approximate cost of 2k operator blocks
(almost k blocks in two rows at the beginning, then k/2 blocks, k/2 blocks, etc.). The exact values are
slightly lower: 2log2k - 2 levels for latency and 2k - log2k - 2 blocks for cost.

Instead of combining the auxiliary carry signals two at once, which leads to 2log2k levels, one
can make four-way combinations for faster operation. Within a group of four bits, encompassing the
positions from bit 0 to 3, the group of signals generating and propagating are derived as:

147
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

When the g and p signals are known for groups of 4 bits, the same process is repeated for the resulting
k/4 pairs of signals. This leads to the derivation of intermediate carries c4, c8, c12, c16, etc.; that is, one
in every four positions. The remaining problem is determining intermediate carries within groups of
four bits. This can be done in full anticipation. For example, the carries c1, c2 and c3 in the far-right
group are derived as:

The resulting circuit will be similar in structure to the Brent-Kung haulage network, except that the
upper half of rows will consist of production blocks of groups g and p and the lower half will be the
intermediate haulage production blocks barely defined. Figure 10.13 shows the designs of these two
types of blocks.

An important method for designing a quick adder, which often complements the carry anticipation
scheme, is carry selection. In the simplest application of the carry selection method, a k-bit adder is
constructed from a (k/2) bit adder in the lower half, two (k/2) bit adders in the upper half (forming
two versions of the top k/2 bits sum with ck/2 = 0 and ck/2 = 1), and a multiplexer to choose the correct
set of values once it is known. ck/2. A hybrid design, in which some of the carries (i.e. c8, c16 and c24 in
a 32-bit adder) are derived through blocks with carry anticipation and used to select one of two
versions of bit adders, these are produced for 8-bit blocks concurrently with the carry network
operation is very popular in modern arithmetic units. Figure 10.14 shows a portion of such a carry-
selected adder, which chooses the correct version of the sum for bit positions a to b before knowing
the intermediate carry ca.

148
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

149
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5. Logical and shift operations


Implementing logical operations within an ALU is very simple, since the ith bit of the two operands
combine to produce the ith bit of the result. Such bit operations can be implemented using a gate array,
as shown in Figure 1.5. For example, and, or, nor, xor, and other MiniMIPS logical statements are easily
implemented in this way. Controlling which of these operations is performed by the ALU is something
that will be discussed in section 10.6.

Shifting involves a rearrangement of bits within a word. For example, when a word is logically
run to the right by four bits, the bit i of the input word will constitute the i – 4 bit of the output word,
and the four most significant bits of the output word are filled with 0. Suppose you want to run a 32-
bit word to the right or left (right′left control signal) for a given quantity as a five-bit binary number (0
to 31 bits). Conceptually, the latter can be achieved with the use of a 64-to-1 multiplexer with 32-bit
inputs (Figure 10.15). However, practically the resulting circuit is too complex, particularly when other
types of shifts (arithmetic, cyclic) are also included. Practical shifters use a multi-level implementation.
After discussing the notion of arithmetic shifts, such an implementation will be described.

Shift to the left h bits an unsigned number x of k bits multiplies its value by 2h, provided that this 2hx
is representable in k bits. This happens because each 0 added to the right of a binary number doubles
its value (in the same way that decimal numbers are multiplied by 10 with each 0 that is added to its
extreme right). Surprisingly, this observation also applies to numbers in complement to 2 k bits. Since
the sign of a number must not change when multiplied by 2h, the number in complement to 2 must
have identical h + 1 bits on its left end if it is to be multiplied by 2h in this way. This will ensure that,
after h bits have been discarded as a result of the left shift, the bit value in the sign position will not
change.

The logical right shift affects numbers in complement to 2 and unsigned differently. An unsigned x
number is divided by 2h when running right by h bits. This is comparable to moving the decimal point

150
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

to the left one position to divide a decimal number by 10. However, this method of dividing by 2h does
not work for numbers in complement to two negatives; Such numbers become positive when 0 is
inserted from the left in the logical right shift course. The proper division of a number in complement
to 2 by 2h requires that the bits coming in from the left are the same as the sign bit (0 for positive
numbers and 1 for negative). This process is known as arithmetic right shift. Because dividing numbers
by powers of 2 is very useful, and very efficient when done through shifting, most computers have
provisions for two types of right shift: logical right shift, which sees the number as a string of bits
whose bits will be repositioned via shift, and arithmetic right shift. whose purpose is to divide the
numerical value of the operand by a power of 2.

MiniMIPS has two arithmetic shift instructions: shift right arithmetic and shift right arithmetic variable.
Those are defined as equal to logical right shifts, except that during the shift sign extension occurs, as
discussed above.

#Fixed $t 0 in ($s 1) run right by 2


#Fixed $t0 in ($s 1) straight by ($s0)
Figure 10.16 shows the machine representations of these two instructions.

He is now ready to discuss the design of a shifter for both logical and arithmetic shifts. First, consider
the case of single-bit shifts. We want to design a circuit that can perform 1-bit logical left shift, 1-bit
logical right shift, or 1-bit arithmetic right shift based on the control signals provided. It is desirable to
consider the non-shift case as a fourth possibility and to use the coding shown in Figure 10.17a to
distinguish the four cases. If the input operand is x[31, 0], the output must be x[31, 0] for non-offset,
x[30, 0] added with 0 for logical left shift, x[31, 1] preceded by 0 for logical right shift, and x[31, 1]
preceded by x[31] (i.e., a copy of the sign bit) for arithmetic right shift. Therefore, a four-input
multiplexer can be used to perform any of these shifts, as shown in Figure 10.17a.

151
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Multibit shifts can be performed in many stages using replicas of the circuit shown in Figure
10.17a for each stage. For example, suppose that logical and arithmetic shifts are to be performed
with shifted quantities between 0 and 7, provided as a three-bit binary number. The three stages
shown in heading 10.17b perform the desired shifts by first performing a four-bit shift, if required,
based on the most significant bit of the offset quantity. The latter converts the input x[31, 0] to y[31,0].
The intermediate value y is then subjected to two-bit shift if the average bit of the offset amount is 1,
which leads to the result z. Finally, z runs by 1 bit if the least significant bit of the shift amount is 1.
The multistage design in Figure 10.17b is called the barrel shifter. It is simple to add cyclic shifts, or
rotations, to the designs in Figure 10.17.

The logical and shift instructions have many applications. For example, they can be used to identify
and manipulate individual fields or bits within words. Suppose you want to isolate bits 10 through 15
in a 32-bit word. One way to do this is to operate the word with the word "mask"

0000 0000 0000 0000 1111 1100 0000 0000

It has 1 in the interest bit positions and 0 elsewhere, and then runs the result to the right (logically) by
ten bits to bring the interest bits to the right end of the word. At this point, the resulting word will
have a numerical value in the range [0, 63] depending on the contents of the original word at bit
positions 10 through 15.

As a second example, consider a 32-bit word as representing a 4 × 8-block of a black-and-white image,


with 1 representing a dark pixel and 0 representing a white pixel (Figure 10.18). Pixel values are
identified individually by alternating 1-bit left shifts and verifying the sign of the number. An initial test

152
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

identifies the first pixel (a negative number means 1). After a left shift, the second pixel is identified
by the sign test. The latter is continued until all pixel values are known.

6. Multifunction ALU
Now you can put together everything seen in this chapter and present the design of a multifunction
ALU that performs addition/subtraction, logic and shift operations. Only the arithmetic/logical
operations necessary to execute the instructions in Table 5.1 shall be considered; that is, addition,
subtraction, AND, OR, XOR, NOR. The overall structure of the ALU is shown in Figure 10.19. It consists
of three subunits for shifting, addition/subtraction, and logical operations. The output of one such
subunit, or the MSB (most significant bit) of the adder output (the sign bit), extended to 0 for a 32-bit
full word, can be chosen as the ALU output by postulating the "function class" control signals of the
four-input multiplexer. The remainder of this section describes each of the three subunits in the ALU.

First, the adder will be seen. This is the adder in complement to 2 described in Figure 9.6. What was
added here is a 32-input NOR circuit whose output is 1 if the adder output is 0 (detector all 0) and an
XOR gate that serves to detect overflow. Overflow in complement to 2 arithmetic only occurs when
the input operands are of the same sign and the adder output is of opposite sign. Therefore, by
denoting the signs of the inputs and outputs by x31, y31, and s31, the following expression for the
overflow signal is easily derived:

Ovfl = x31 y31 s’31 ∨ x’31 y’31 s31

153
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

An equivalent formulation, which is left as an exercise, is one in which an overflow occurs when the
next to the last carry of the adder (c31) is different from its haulage output (c32). This is how overflow
detection is implemented in Figure 10.19.

Then consider the barrel shifter again (Figure 10.17). Because MiniMIPS has two classes of shift
instruction (provided in the "sh" field of the instruction) or in a shift amount variable (in a designated
register), a multiplexer is used to direct the appropriate amount to the shifter. Note that, when the
amount of shift is given in a register, only the last 5 least significant bits of the quantity are relevant
(why?).

The logical unit performs one of four logical bit operations of AND, OR, XOR, NOR. So the logic unit
design consists of four gate arrays that calculate these logic functions and a four-input multiplexer
that allows you to choose the result set that advances to the output of the unit.

Finally, providing the MSB of the adder output as one of the inputs to the four-
input multiplexer requires some explanation. The sign bit is extended 0 to form a complete 32-bit
word whose value is 0 if the sign bit is 0 and 1 if the sign bit is 1. This allows you to use the ALU to
implement the MiniMIPS slt statement, which requires 1 or 0 to be stored in a register depending on
whether x < y or is not. This condition can be verified by calculating x – y and by viewing the sign bit
of the result: if the sign bit is 1, then x < y holds and 1 should be stored in the designated register;
otherwise, 0 must be stored. This is exactly what the ALU provides as its output when its "function
class" signals that control the output multiplexer are set to 01.

PROBLEMS
10.1 Design of half adders

With an AND gate and an XOR gate a half adder can be realized. Show that it can also be done with:

a) Four NAND gates with two inputs and one inverter (or without inverter if the cout signal is produced
in inverted form).

b) Three two-input NOR gates and two inverters (or without inverter if x and y inputs are available in
both true and supplemented forms).

10.2 Adders with cascade hauling

a) Assume that each FA block in Figure 10.4 is implemented as in Figure 10.3c and that the three-entry
gates are slower than the two-entry gates. Draw a line representing the critical path in Figure 10.4.

b) Now, suppose that the AFs are implemented as in Figure 10.3b and derive the critical path delay
from Figure 10.4 in terms of mux and inverter delays.

10.3 Unit decrement counter

Section 10.3 showed that inserting a value with an increment of 1 through the adder carry input
simplifies the design. Show that similar simplifications are possible for unit decrement (sum of -1).

154
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

10.4 Adders as versatile building blocks

A four-bit binary adder can be used to perform many logical functions in addition to its intended
addition function. For example, the adder that implements (x3 x2 x1 x0)two + (y3 y2 y1 y0)two + c0 = (c4 s3 s2
s1 s0)two can be used as a five-input XOR to perform the function a ⊕ b ⊕ c ⊕ d ⊕ e by making x0 =
a, y0 = b, c0 = c (the above leads to s0 = a ⊕ b ⊕ c), x1 = y1 = d (the latter leads to c2 = d), x2 = e, y2 =
s0 , and use s2 as the output. Demonstrate how a four-bit binary adder can be used as:

a) A five-input DNA circuit

b) A five-input OR circuit

c) A circuit to perform the logical function of four variables ab ∨ cd

d) Two independent single-bit complete adders, each with its own carry inputs and outputs.

e) A circuit that multiplies by 15 for a two-bit (u1u0)two unsigned binary number

f) A five-input "parallel counter" that produces the sum of three bits of five 1-bit numbers

10.5 Numbers in complement to two

Try the following for x and y numbers in complement to 2 kbits.

a) An arithmetic right shift of 1 bit of x always produces [x/2], regardless of the sign of x.

b) In the sum s = x + y, overflow occurs if ck−1 ≠ ck.

c) In an adder that calculates s = x + y, but does not produce a ck carry output signal, the latter can be
externally derived as ck = xk−1 yk−1 ∨ s'k−1 (xk−1 ∨ yk−1).

10.6 Brent-Kung haulage network

Draw a diagram similar to Figure 10.11 that corresponds to the carry network of a 16-digit adder. Tip:
Use the recursive construct shown in Figure 10.12.

10.7 Subtractor with Advanced Loan

Any carry network that produces the ci carries based on the gi and pi signals can be used, without
modification, as a loan propagation circuit to find the bi loans.

a) Define the signals γi generates loan and πi propagates lending for binary input operands.

b) Design a circuit to calculate the difference digit di from γi, πi and the incoming loan bi.

10.8 Carry Advance Enhancer

a) Section 10.3 noted that an incrementer, which calculates x + 1, is much simpler than an adder and
presented the design of an augmentor with cascading carry. Design an incrementer ahead of carry,
take advantage of any simplification due to an operand being 0.

155
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

b) Repeat part a) for a decrementor in advance of loan

10.9 Fixed priority arbitrator

A fixed priority arbitrator has k request input lines Rk-1, . . ., R1, R0, and k guaranteed exit lines Gi. In
each arbitration cycle, at most one of the guaranteed signals is 1 and it corresponds to the highest
priority request signal; that is: Gi = 1 if Ri = 1 and Rj = 0 for j < i.

a) Design a synchronous arbitrator with cascade hauling techniques. Tip: Consider c0 = 1 along with
carry propagation and annihilation rules; There is no carry generation.

b) Discuss designing a faster referee with the use of carry anticipation techniques. Submit a complete
fixed priority referee design for k = 64.

10.10 Anticipation with overlapping blocks

The equation [i + 1, j] ¢ [h, i] = [h, j], which was presented in section 10.4, symbolizes the function of
the carry operator that produces the signals g and p for the block [h, j] composed of two smaller blocks
[h, i] and [i + 1, j]. Show that the carry operator still produces the correct values g and p for a block [h,
j] if applied to two overlapping subblocks [h, i] and [i - a, j], where a ≥ 0.

10.11 Network Alternates with Carry Anticipation

A carry network for k = 2a can be defined recursively as follows. A 2-width haulage network is built
with a single carry operator. To build a 2h+1 gauge haulage network from two 2h gauge haulage
networks, the last output of the lower network (least significant part) is combined with each output
of the upper network using a carry operator. Therefore, according to two 2h width haulage networks,
a 2h+1 width haulage network can be synthesized with the use of 2h additional carry operator blocks.

a) Draw a 16-width haul net using the newly defined recursive construction.

b) Discuss any problems that may arise in implementing the design of part a).

10.12 Add with variable block haul jump

Consider a 16-bit carry network constructed using four cascading copies of the four-bit hop block in
Figure 10.7.

a) Demonstrate the critical path of the resulting haulage network and derive its latency in terms of
challenging two gates.

b) Show that using block widths 3, 5, 5, 3, instead of 4, 4, 4, 4, 4, 4, leads to a faster haul network.

c) Develop a general principle about variable block haul jump adders based on the results in part (b).

10.13 Wide adders constructed from straits

156
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Suppose you have a supply of eight-bit adders that also provide outputs for the g and p block signals.
Use a minimum number of additional components to construct the following types of adders for a
width of k = 24 bits. Compare the resulting 24-bit adders in terms of latency and cost, consider the
cost and latency of a gate of inputs as units each, and the cost and latency of an eight-bit adder such
as 50 and 10 units, respectively.

a) Adder with cascade carrying.

b) Add with carry hop with eight-bit blocks.

c) Adder with carry selection with eight-bit blocks. Tip: The select signal for the multiplexer in the top
eight bits can be derived from c8 and the two carry outputs in the middle eight bits, assuming c8 = 0
and c8 = 1.

10.14 Media Processing Adders

Many media signals that process applications are characterized by narrower operands (i.e., eight or
16 bits wide). In such applications, two or four data items can be packed into a single 32-bit word. It
would be more efficient if adders were designed to manipulate parallel operations on many such
narrow operands at once.

a) Suppose you want to design a 32-bit adder so that, when a special Half add signal is postulated, it
treats each of the two input operands as a pair of independent 16-bit unsigned integers and adds the
corresponding sub words of the two operands, this produces two unsigned sums of 16 bits. For each
of the two types of adders introduced in this chapter, show how the design can be modified to achieve
the goal. Ignore the overflow.

b) How can the adders in Part (a) be modified to accommodate four independent eight-bit sums, in
addition to the two 16-bit sums or one 32-bit sum already permitted? Tip: Consider using a quarter
add signal that is postulated when eight-bit independent sums are desired.

c) In some applications saturation sum is necessary, in which the output is fixed to the integer without
a larger sign when an overflow occurs. Discuss how the adder designs of parts a) and b) can be
modified to perform saturation sum whenever the Saturadd control signal is postulated.

10.15 Carry completion detection

With the use of two carry nets, one for carry propagation 1 and one for carry propagation 0, the
completion of the carry propagation process is detected in an adder with asynchronous cascade carry.
Not having to wait for the delay for worst-case carry propagation in each sum results in a simple
cascade carry adder, with average latency that is competitive with more complex carry advance
adders. In a position where both bits operands are 0, a carry 0 is "generated" and propagated in the
same way as a carry 1. The carry at position i is known when the carry signal 1 or 0 is postulated. The
carry propagation is complete when the carries are known in all positions. Design a carry completion
detection adder based on the previous discussion.

157
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

10.16 Sum of three operands

a) The calculation of direction on some machines may require the sum of three components: a base
value, an index value and an offset. If you assume 32-bit unsigned index and base values, and a
complement to two 16-bit offset values, design a fast address calculation circuit. Tip: See Figure 9.3a.

b) If the sum of the three address components cannot be represented as a 32-bit unsigned number,
an "invalid address" exception signal will be postulated. Increase the design of part (a) to produce the
required exception signal.

10.17 Unpacking bits of a number

a) Write a MiniMIPS procedure to take a 32-bit x-word as input and produce a 32-word array, starting
with a specific address in memory, whose elements are 0 or 1 according to the value of the
corresponding bit in x. The most significant bit of x must appear first.

(b) Modify the procedure in Part (a) so that the number of 1 in x also returns. This operation, called
population counting, is very useful and is sometimes provided as a machine instruction.

10.18 Rotations and shifts

a) Draw the counterpart of Figure 10.15 for right and left rotations, as opposed to logical shifts.

b) Repeat part (a) for arithmetic shifts.

10.19 ALU for MiniMIPS-64

Issue 8.11 defined a compatible 64-bit backward version of MiniMIPS. Discuss how the ALU layout in
Figure 10.19 should be modified for use on this new 64-bit machine.

10.20 Decimal adder

Consider designing a 15-digit decimal adder for unsigned numbers, where each decimal digit is
encoded in four bits (for a total width of 60 bits).

a) Design the circuits required for carry and propagate carry signals, if you assume BCD coding
for input operands
b) Complete the decimal adder design in part (a) to propose a carry anticipation circuit and
addition calculation circuit.

10.21 Multifunction ALU

For the multifunction ALU in Figure 10.19, specify the values of all control signals for each ALU type
instruction in Table 6.2. Present your answer in tabular form, use "x" for entries it doesn't matter.

158
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

MULTIPLIERS AND DIVISORS


CHAPTER TOPICS

1 Shift-addition multiplication

2 Hardware multipliers

3 Programmed multiplications

4 Shift-subtraction division

5 Hardware Dividers

6 Programmed division

Multiplication and division are widely used, even in applications not commonly associated with
numerical calculations. The main examples include data encryption for privacy/security, certain image
compression methods, and graphical representation. Hardware multipliers and splitters have become
virtual requirements for all but the most limited processors. Here we start with the binary shift-
addition multiplication algorithm, show how the algorithm is mapped to hardware, and how the
resulting unit speed is further improved. It also shows how the algorithm of shift-addition
multiplication can be programmed on a machine that has no multiplication instruction. The discussion
of division follows the same pattern: binary shift-subtraction algorithm, realization in hardware,
acceleration methods, and programmed splitting.

1. Shift-addition multiplication
The simplest machine multipliers are designed to follow a variant of the pen-and-paper multiplication
algorithm shown in Figure 11.1, where each row of points in the partial product bit matrix is all 0 (if
the corresponding yi = 0) or the same as x (if yi = 1). When manually multiplying k × k, all k partial
products are formed and the resulting k numbers are added together to obtain the product p.

159
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

For machine execution it is easier if a cumulative partial product is initialized to z(0) = 0, each row of
the bit array is added to that as the corresponding term is generated and the result of the sum is run
to the right by a bit to achieve proper alignment with the next term, as shown in Figure 11.1. In fact,
the latter is exactly how programmed multiplication is performed on a machine that does not have a
hardware multiplication unit.

The recurrence equation that describes the process is:

Since, when done, the right shifts will have caused the first partial product to be multiplied by 2-k, x is
pre-multiplied by 2k to displace the effect of these right shifts. This is not a real multiplication, but is
done to align x with the upper half of the cumulative partial product of 2k bits in the addition steps.
This shift-addition multiplication algorithm is performed directly on hardware, as will be seen in
section 11.2. The shift of the partial product does not need to be done in a separate step, but can be
incorporated into the connecting wires running from the adder outlet to the double-width register
that retains the accumulated partial product (Figure 11.5).

After k iterations, multiplication recurrence leads to:

Thus, if z(0) is initialized to 2ks (s filled with k zeros) instead of 0, the expression xy + s will be evaluated.
This multiplication-addition operation is very useful for many applications and is performed at no
additional cost compared to simple shift-addition multiplication.

The multiplication in base r is similar to the previous one, except that all occurrences of the number 2
in the equations are substituted with r.

Example 11.1: Binary and decimal multiplication

a) If you assume unsigned operands, perform binary multiplication 1010 × 0011, and display all the
steps of the algorithm and intermediate cumulative partial products.

b) Repeat part (a) for decimal multiplication 3528 × 4067.

Solution: In what follows, the term 2z(j+1) = z(j) + yj x2k is calculated with an additional digit on the far
left to avoid losing the carry-out output. This additional digit moves to the eight-digit cumulative
partial product once you run to the right.

a) See Figure 11.2a. The result is verified by observing that it represents 30 = 10 × 3.

b) See Figure 11.2b. The only difference with the binary multiplication of part a) is that the term partial
product yj x104 needs an additional digit. Therefore, if the dot notation diagram in Figure 11.1 were
to be drawn for this example, each row of the partial product bit array would contain five points.

160
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Numbers of signed magnitude can be multiplied with the use of the sum-shift algorithm already
discussed to multiply their magnitudes and separately derive the product sign as XOR from the two
incoming sign bits.

For entries in complement to 2, a simple modification of the shift-addition algorithm is required. First
notice that, if the multiplier y is positive, the shift-addition algorithm still works for the negative
multiplication x. The reason can be understood by examining Figure 11.2a, where 3x is calculated
when evaluating x + 2x. Now, if x is negative (in complement to 2 format) negative values x and 2x
would be added, which should lead to the correct result 3x. All that remains now is to show how a
multiplier should be manipulated and in complement to negative 2.

Consider the multiplication of x by y = (1011)2’s complement. Remember from the


discussion in section 9.4 that the sign bit can be seen as negatively weighted. Consequently, the
number y in complement to negative 2 is -8 + 2 + 1 = -5, while, as an unsigned number, it is 8 + 2 + 1 =
11. Just as you can multiply x by y = 11 by calculating 8x + 2x + x, you can multiply by y = -5 by calculating
-8x + 2x + x. It is seen that the only major difference between a multiplier and unsigned and one in
complement 2 is that, in the last step of multiplication, one must subtract, instead of adding, the
partial product yk-1x.

Example 11.2: Multiplication in complement to 2

a) If you assume operands in complement to 2, perform binary multiplication 1010 × 0011, and display
all the steps of the algorithm and the intermediate cumulative partial products.

b) Repeat Part (a) for multiplication in addition to 2 1010 × 1011.

Workaround: In what follows, all partial products are written with an additional bit on the left to
ensure that the sign information is preserved and manipulated correctly. In other words, the four-bit
multiplier x must be written as (11010)2’s complement or else its value will change. Also, the sign extension
(arithmetic shift) should be used when 2z(i) is run to the right to obtain z(i). (a) See Figure 11.3a. The

161
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

result is verified by observing that it represents 18 = (6) × 3. b) See Figure 11.3b. The result is verified
by noticing that it represents 30 = (6) × (5).

a) Multiplier and positive b) Multiplier and negative

Figure 11.3 Examples of step-by-step multiplication for numbers in four-digit complement to 2.

2. Hardware multipliers

The offset-sum algorithm in section 11.1 can be converted directly to the hardware multiplier shown
in Figure 11.4. There are k cycles for multiplication k × k. In the j-th cycle, x or 0 are added to the upper
half of the double-width partial product, depending on the j-th bit yj of the multiplier, and the new
partial product is run to the right before the start of the next cycle. The multiplexer in Figure 11.4
allows subtraction to be performed in the last cycle, as required by multiplication in complement to
2. In a multiplier used only with unsigned operands, the multiplexer can be replaced by an array of
AND gates that together multiply the bit yj by the multiplying x.

Instead of storing and in a separate record, as shown in Figure 11.4, you can be retained
in the lower half of the partial product record. This is justified because as the partial product expands
to occupy more bits from the lower half in the register, the bits of the multiplier are removed because
they are run to the right. The examples of binary multiplication in Figures 11.2a and 11.3 show how
the lower half of the partial product register is initially not used and then bits are run to it at the rate
of one per cycle.

Instead of treating the offset as a task to be performed after the adder output has been stored in the
partial product record, it can be incorporated into the aspect of how the data is stored in the record.
Remember from Figure 2.1 that to store a word in a register, the bits of the word must be supplied to

162
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

the individual flip-flops that comprise the register. It is easy to move the bits to the right one position
before supplying them to the register and valuing the load signal. Figure 11.5 shows how connections
are made to the input data lines to the register to achieve the right-shift load target.

Base 2 multiplication can be implemented very directly and involves little additional hardware
complexity in an ALU that already has an adder/subtraction. However, the k clock cycles required to
perform a multiplication k × k makes this operation much slower than the addition and logical/shift
operations, which take only a single clock cycle. One way to accelerate multiplication hardware is to
perform multiplication on base 4 or higher bases.

In base 4, each operating cycle takes over two bits of the multiplier and cuts the number of
cycles in half. When considering, for simplicity, unsigned multiplication the two bits of the multiplier
examined in a specific cycle are used to select one of four values to be added to the accumulated
partial product: 0, x, 2x (x run), 3x (precalculated by the sum 2x + x and stored in a register before the
iterations begin). In real implementations, 3x pre-calculation is avoided by special recoding methods,
such as Booth recalculation. In addition, the accumulated partial product is kept in the form of carry
storage, which makes the addition of the next partial product to it very fast (Figure 9.3a). In this way
only a regular sum is required in the last step, this leads to a significant acceleration. The details of the
implementation for high-base, recoded multipliers are not intended for study in this text, but are
found in computational arithmetic books [Parh00].

163
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Fast hardware multiplication units on high-performance processors are based on multiplier


tree designs. Instead of forming the partial products one at a time in base 2 or h at a time in base 2 h,
they can all be formed simultaneously, and the multiplication problem is reduced to a sum of n
operands, where n k in base 2, n k/2 in base 4, etc. For example, multiplication 32 × 32 becomes either
a problem of sum of 32 operands in base 2 or a problem of sum of 16 operands in base 4. In multiplier
trees, the n operands thus formed are added in two stages. In stage 1, a tree constructed with adders
with carry storage is used to reduce the n operands to two operands that have the same sum as the
original n numbers. An adder with carry storage (Figure 9.3a) reduces three values to two values, by a
reduction factor of 1.5, leading to a circuit of [log1.5(n/2)] levels to reduce n numbers to 2. The two
numbers thus derived are summed by a fast logarithmic time adder, which leads to global logarithmic
latency for the multiplier (Figure 11.6a).

The full tree multiplier in Figure 11.6a is complex and its speed may not be required for all applications.
In such cases, cheaper partial tree multipliers can be implemented. For example, if almost half of the
partial products are arranged by the tree part, then two passes through the tree can be used to form
the two numbers representing the desired product, and the results of the first pass are fed back to
the inputs and combined with the second set of partial products (Figure 11.6b). A partial tree
multiplier can be viewed as a (very) high base multiplier. For example, if 12 partial products are
combined in each pass, then a base multiplication of base 212 is effectively performed.

A multiplier array uses the same two-stage computation scheme of a multiplier tree,
with the difference that the adder tree with carry storage is unilateral (it has the maximum possible
depth of k for multiplication k × k) and the final adder is of the cascading carry type (very slow). Figure
11.7 shows a multiplier array 4 × 4, where cells HA and FA are half adders and complete adders defined
in Figures 10.1 and 10.2, respectively; MA cells are modified complete adders, one of whose inputs is
formed internally as the logical AND of xi and yj.

164
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

You probably wonder why such a kind of slow multiplier tree is of any interest. The answer is that the
multiplier arrays are quite suitable for VLSI realization, given their hugely regular design and efficient
wiring pattern. They can also be easily channeled with the insertion of latches between some of the
rows of cells; Therefore, it allows many multiplications to be performed in the same hardware
structure.

3. Programmed multiplication
MiniMIPS has two multiply statements that perform multiplications in complement to 2 and unsigned,
respectively. They are:

Such instructions leave the double-wide product in the special registers Hi and Lo, where Hi retains
the upper half and Lo the lower half of the 64-bit product. Why, like other MiniMIPS R-type
instructions, the product is not placed in a pair of registers in the log file, will become clear as
instruction execution and control sequencing are discussed in part four of this book. For now it is only
mentioned that the ratio is related to the much larger latency of multiplication relative to addition
and logic/shift operations. Since the product of two 32-bit numbers is always representable in 64 bits,
there can be no overflow in some kind of multiplication.

To allow access to Hi and Lo contents, the following two instructions are provided for copying their
contents to the desired logs in the general log file:

165
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

For machine representation of these four instructions, see section 6.6.

Note that if a 32-bit product is desired, it can be retrieved from the Lo registry. However, you should
examine the contents of Hi to be sure that the correct product is representable in 32 bits; that is, it
did not exceed the range of 32-bit numbers.

Example 11.3: Using multiplication in MiniMIPS programs

Show how to get the 32-bit product from 32-bit signed integers that are stored in registers $s3 and
$s7, and place the product in register $t3. If the product is not representative in 32-bit, control must
be transferred to movfl. Assume that records $t1 and $t2 are available for overridden results if
necessary.

Solution: If the signed integer product must fit into a 32-bit word, the special register Hi is expected
to retain 32 identical bits equal to the sign bit of Lo (if it were an unsigned multiplication, Hi should be
checked for 0). Therefore, if the value in Lo is positive, Hi must retain 0, and if the value in Lo is
negative, Hi must retain -1, which has the representation all 1.

mult $s3, $s7 # product formed in Hi, Lo

mfhi $t2 # copy top half of the product in $t2

mflo $t3 # copy lower half of the product in $t3

slt $t1, $t3, $zero # fixed (LSB of) $t1 to the sign of Lo

sll $t1, $t1, 31 # fixed sign bit of $t1 to the sign of Lo

sra $t1, $t1, 31 # fixed $t 1 to all 0/1 via arit shift

bne $t1, $t2, movfl # if (Hi) ≠ ($t 1), you have overflow

On machines that do not have a hardware-supported multiply instruction, multiplication can be


performed in software using the pay-shift algorithm in section 11.1. It is desirable to develop such a
program for MiniMIPS because it helps to better understand the algorithm. In what follows, unsigned
multiplication is considered and the reader is left to develop the signed version.

Example 11.4: Multiplication shift-sum of unsigned numbers

With the shift-addition algorithm, define a MiniMIPS procedure that multiplies two unsigned numbers
(passed to it in registers $a0 and $a1) and leaves the double-width product at $v0 (top) and $v1
(bottom).

Solution: The following procedure, called shamu for "shift-addition multiplication" uses both
instructions and pseudo-instructions. The records Hi and Lo, which retain the upper and lower halves
of the cumulative partial product z, are represented by $v0 and $v1, respectively. The multiplicand x
is at $a0 and the multiplier y is at $a1. The $t2 register is used as a counter that is initialized to 32 and
decrements by 1 at each iteration until it reaches 0. The j-th bit of y is isolated at $t1 by repeated right
shift and y looks for its LSB after each shift. The $t1 register is also used to isolate the LSB from Hi so
that it can be run on it during right shift. Finally, $t0 is used to calculate the carry output of the sum

166
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

to be run in Hi during right shifts. The use of logging in this example is shown in Figure 11.8 for easy
reference.

shamu: move $v0, $zero # initialize Hi to 0

move $vl, $zero # initialize Lo to 0

addi $t2, $zero, 32 # init repetition counter to 32

mloop: move $t0, $zero # set c-out to 0 in case of no add

move $t1, $a1 # copy ($a1) into $t1

srl $a1, 1 # halve the unsigned value in $a1

subu $t1, $t1, $a1 # subtract ($a1) from ($t1) twice to

subu $t1, $t1, $a1 # obtain LSB of ($a1), or y[j], in $t1

beqz $t1, noadd # no addition needed if y[j] = 0

addu $v0, $v0, $a0 # add x to upper part of z

sltu $t0, $v0, $a0 # form carry-out of addition in $t0

noadd: move $t1, $v0 # copy ($v0) into $t1

srl $v0, 1 # halve the unsigned value in $v0

subu $t1, $t1, $v0 # subtract ($v0) from ($t1) twice to

subu $t1, $t1, $v0 # obtain LSB of Hi in $t1

sll $t0, $t0, 31 # carry-out converted to 1 in MSB of $t0

167
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

addu $v0, $v0, $t0 # right-shifted $v0 corrected

srl $v1, 1 # halve the unsigned value in $v1

sll $t1, $t1, 31 # LSB of Hi converted to 1 in MSB of $t1

addu $v1, $v1, $t1 # right-shifted $v1 corrected

addi $t2, $t2, -1 # decrement repetition counter by 1

bne $t2, $zero, mloop # if counter > 0, repeat multiply loop

jr $ra # return to the calling program

Note that since the carry output of the sum is not recorded in MiniMIPS, a scheme was envisioned to
derive it into $t0 when noting that the carry output is 1 (an unsigned sum overflow) if the sum is less
than any operand.

When a multiplying x must be multiplied by a constant multiplier a, using the multiply statement, or
its software version of the 11.4 example, may not be the best choice. Multiplication by a power of 2
can be effected by shifting, which is faster and more effective than a full burn multiply instruction. In
this context, multiplication by other small constants can also be performed using the shift and add
statements. For example, to calculate 5x, 4x can be formed by a left-shift instruction and then add x
to the result. On many machines, a shift and an add statement take less time than a multiply
statement. As another example, 7x can be calculated as 8 x - x, although in this case there is a danger
of finding overflow when calculating 8x even though 7x itself is within the representation range of the
machine. Most modern compilers avoid returning multiply statements when a short sequence of
add/subtract/shift instructions can achieve the same goal.

4. Shift-subtraction division
Like multipliers, the simplest divider machines are designed to follow a variant of the pen-and-paper
algorithm shown in Figure 11.9. Each row of points in the bit matrix subtracted from Figure 11.9 is
either all 0 (if the corresponding yi = 0) or equal to x (if yi = 1). When a 2k/k division is performed
manually, the subtracted terms are formed one at a time by "guessing" the value of the next quotient
digit, subtracting the appropriate term (0 or a version with a suitable shift for x) from the partial
remainder, which is initialized to the value of the dividend z, and proceeding until all k bits of the
quotient y have been determined. At this point, the partial remainder becomes the final remainder s.

168
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

For hardware or software implementation, a recurrence equation is used that describes the above
process:

Since, by the time it is done, the left shifts will have caused the partial remainder to be multiplied by
2k , the true remainder is obtained by multiplying the final partial remainder by 2-k (running it to the
right by k bits). This is justified as follows:

As in the case of partial products for multiplication, the partial remainder shift does not need to be
performed in a separate step, but can be incorporated into the connecting wires running from the
adder output to the double-width register retained by the partial remainder (section 11.5).

The division in base r is similar to the binary division, except that all occurrences of the number 2 in
the equations are substituted for r.

Example 11.5: Integer and fractional division

a) If you assume unsigned operands, perform binary division 0111 0101/1010, and display all
algorithm steps and partial remainders.

b) Repeat part (a) for fractional decimal division .1435 1502/.4067.

Solution: In what follows, the term 2z(j) is formed with an additional digit on the far left to prevent the
extra digit created by duplicating from being lost.

(a) See Figure 11.10a. The results are verified by observing that they represent 117 = 10 × 11 + 7.

(b) See Figure 11.10b. The only differences from the binary division of part a) are that the dividend x
is not pre-multiplied by a power of the base and the subtracted terms yjx need an additional digit.
Therefore, if you wanted to draw the dot notation diagram in Figure 11.9 for this example, each row
of the subtracted bit array would contain five points.

169
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

a) Integer binary b) Fractional decimal

Figure 11.10 Step-by-step examples of division for unsigned binary integers and 8/4-digit decimal
fractional numbers.

Example 11.6: Division with operands of the same width

Often, z/x division is performed with operands of the same width instead of z being twice the width
of x. However, the algorithm is the same.

a) If you assume unsigned operands, perform binary division 1101/0101, show all algorithm steps and
partial remains.

b) Repeat part (a) for the fractional binary division .0101/.1101.

Solution:

a) See Figure 11.11a. Note that since z is four bits wide, the left shift can never produce a
nonzero bit at position 8; So, unlike the example in Figure 11.10, an extra bit is not needed
at position 8. The results are verified taking into account that they represent 13 = 5 × 2 + 3.

b) See Figure 11.11b. The only differences from the binary division of part a) are that the
dividend x is not pre-multiplied by a power of the base and that both the partial remnants
with shift and the terms y-j x subtracted need an additional bit to the left (position 0). The
results are verified by observing that they represent 5/16 = (13/16) × (6/16) + 2/256.

When comparing Figures 11.11b and 11.10b, it is noted that in Figure 11.11b the positions from -5 to
-8 are not used, except in the final remainder s. As a consequence of the fact that, in fractional division
the rest is usually not of interest, the positions to the right of the LSBs of the operands can be
completely ignored, both in manual calculation and in hardware implementation.

170
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

a) Integer binary b) Fractional binary

Figure 11.11 Step-by-step examples of division for unsigned binary integers and 4/4-digit fractions.

Since the quotient of dividing a 2k digit number by a number of k digits may not fit into k digits, division
can lead to overflow. Fortunately, there are simple tests for detecting overflow before division. In the
integer division, the latter will not occur if z < 2kx, as this will guarantee y < 2k; that is, x is required to
be strictly greater than the upper half of z. In fractional division, avoiding overflow requires z < x. So,
in any case, the top half of the double-wide z dividend must be less than the x-divisor.

When signed numbers are divided, the remainder s is defined to have the same sign as the dividend z
and a magnitude that is less than |x|. Consider the following examples of integer division with all
possible combinations of signs for z and x:

From the above examples it is seen that the magnitudes of the quotient and the remainder s are not
altered by the entry signs and that the signs y and s are easily derived from z and x. Therefore, one
way to do sign division is through an indirect algorithm that converts operands into unsigned values
and, in the end, explain the signs by adjusting the sign bits or via complementation. This is the method
of choice with the restorative division algorithm, discussed in this section. Direct division of signed
numbers is possible using the non-restorative division algorithm, the discussion of which is not

171
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

covered in this book. Interested readers can refer to any text about computational arithmetic
[Parh00].

5. Hardware Dividers
The shift-subtraction algorithm in section 11.4 can be converted to the hardware splitter shown in
Figure 11.12. There are k cycles for division 2k/k or k/k. In the j-th cycle, x is subtracted from the upper
half of the double-width partial remainder. The result, known as the test difference, is uploaded to
the partial remainder log only if it is positive. The test difference sign says whether 1 is the correct
choice for the next yk-j quotient digit or is too large (0 is the correct choice). Note that the test
difference is positive if the MSB of the previous partial remainder is 1, which means that the partial
remainder run is large enough to guarantee a positive difference, or otherwise if the adder's cout is 1.
This is the reason for providing these two bits to the quotient digit selector block.

The multiplexer in Figure 11.12 allows it to be performed or added or subtracted in each


cycle. The sum of x is never required for the restorative division algorithm presented in section 11.4.
The adjective "restorative" means that whenever the test difference becomes negative, it is taken as
an indicator that the next quotient digit is 0 and the test difference is not loaded into the partial rest
record, this causes the original value to remain intact at the end of the cycle. In the non-restorative
division, which is not covered in this book, the calculated difference, positive or negative, is stored as
the partial remainder, which is therefore not restored to its correct value when it becomes negative.
However, appropriate actions in subsequent steps will lead to the correct outcome. These corrective
actions require x to be added to, rather than subtracted from, partial remainder. The non-restorative
approach makes the control circuit simpler and has the collateral benefit of interpreting a possible
division with direct sign.

172
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Instead of inserting the bits of y into a separate register, as shown in Figure 11.12, they will be inserted
into the lower half of the partial remainder register. This happens because as the partial remainder is
run to the left, the bit positions at the far right of the register are released and can be used to store
the quotient bits as they develop. The examples of division in Figure 11.10 clearly show how the lower
half of the partial remainder register, which is fully occupied at the beginning, is released at the rate
of one bit/digit per cycle.

Instead of treating the offset as a task to be performed after the adder output is stored in the partial
remainder register, you can incorporate the offset into how the data is stored in the log. Remember
from Figure 2.1 that, to store a word in a register, the bits of the word must be supplied to the set of
flip-flops that comprise the register. It is easy to move the bits left through a position before supplying
them to the register and valuing the load signal. Figure 11.13 shows how connections are made to log
input data lines to achieve the goal of loading left run.

A comparison of Figures 11.4 and 11.12 reveals that the multipliers and divisors are quite similar and
can be implemented with shared hardware within an ALU that executes different operations based on
externally provided function code. In fact, because the square root is very similar to division, it is
common to implement a single unit that performs the operations of multiplication, division and square
root.

As in the case of multipliers, high-base divisors accelerate the splitting process by producing many bits
of the quotient in each cycle. For example, in base 4 each operation cycle generates two bits of the
quotient; Therefore, cut the number of cycles in half. However, there is a complication that makes the
process a little more difficult than multiplication in base 4: since digits in base 4 have four possible
values, you cannot use a selection scheme equal to base 2 (i.e. try 1 and choose 0 if it does not work)
for the selection of quotient digit in base 4. The details of how this problem is solved and how, for
partial products in multiplication, the partial remainder in division can be retained in the form of carry
storage, are not covered in this book.

While there is no counterpart to tree multipliers to perform division, divider arrangements exist and
are structurally similar to multiplier arrays. Figure 11.14 shows a divisor arrangement that divides an
eight-bit z-dividend by a four-bit x-divisor, producing a four-bit y-quotient and a four-bit s-remainder.

173
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The MS cells in Figure 11.14 are Modified complete subtractors. They receive a lending input bit from
the right and two bits operands from above, subtract the vertical operand from the diagonal operand
and produce a difference bit that goes down diagonally and a loan output bit that goes to the left.
There is a multiplexer at the diagonal output that allows you to pass the difference if yi, presented to
the cells as a horizontal control line coming from the left, is 1 and lets z pass if yi = 0. The latter serves
the function of using the test difference as the next partial remainder or restoring the partial
remainder to the previous value.

It is also possible to perform division with the use of a sequence of multiplications instead of additions.
Even if multiplications are slower and more complex than additions, an advantage can be gained over
additive schemes, since few multiplications are needed to perform the division. Thus, in such
convergence division algorithms, the complexity of iterations is negotiated for fewer iterations. In this
sense, in the convergence division the quotient y = z/x is derived by first calculating 1/x and then
multiplying it by z. To calculate 1/x, one starts with a rough approximation obtained on the basis of a
few bits of higher order of x, using a specially designed logic circuit or a small reference table. The
approximation u(0) will have a small relative error of, say, є, which means that u(0) = (1 + є)(1/x). Then
successively better approximations than 1/x are obtained with the use of an approximation u(i) in the
formula

Thus, each iteration to refine the value of 1/x involves two multiplications and one subtraction. If the
initial approximation is exact to, say, eight bits, the next one will be exact at 16 bits and the next
approximation at 32 bits. Therefore, between two and four iterations are sufficient in practice. Details
for implementing such convergence division schemes and analyzing the number of iterations needed
can be found in books on computational arithmetic [Parh00].

6. Programmed Division
MiniMIPS has two split-type instructions that perform division into complement to 2 and unsigned,
respectively. They are:

For machine representation of the preceding instructions, see section 6.6. These instructions leave
the quotient and the rest in the special registers Lo and Hi, respectively. Note that, unlike much of the
discussion in sections 11.4 and 11.5, the dividend for MiniMIPS divide instructions is single-width. For
this reason, the quotient, which is always smaller in magnitude than the dividend, is guaranteed to fit
into one word. As in the case of multiplication, when the results are obtained in Hi and Lo, they can
be copied to general registers, using the mfhi and mflo instructions, for further processing.

Why, like other MiniMIPS R-type instructions, the quotient and the rest are not placed in the overall
log file, it will become clear as instruction execution and control sequencing are discussed in part four
of the book. For now it is only mentioned that the ratio is related to the much larger latency of the
division, compared to the addition and logic/shift operations.

Example 11.7: Using Splitting in MiniMIPS Programs

174
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Show how to get the residue of z modulus x (z mod x), where z and x are 32-bit signed integers in
registers $s3 and $s7, respectively. The result must be placed in the $t2 register. Note that, for signed
numbers, the residue (residue) operation is different from the remainder in the division; While the
remainder sign is defined as equal to the z dividend sign, residuals by definition are always positive.
Assume that registration $t1 is available for overridden results if necessary.

Solution: The result of z mod x is a positive integer spos satisfying spos < |x| and z = x × y + spos for some
integer y. It is easy to see that spos is the same as the rest s of the z/x division when s ≥ 0 and can be
obtained as s + |x| when s < 0.

div $s3, $s7 # rest formed in Hi

mfhi $t2 # Rest copy in $t2

bgez $t2, done # Positive rest is the residue

move $t1, $s7 # copy x in $t 1; This is |x| if x ≥ 0

bgez $s7, noneg # |x| is at $t 1; No negation needed

sub $t1, $zero, $s7 # puts -x in $t 1; This is |x| if x < 0

noneg: add $t2, $t2, $t1 # Residue is calculated as S + |x|

done: ...

On machines that do not have a hardware-supported split instruction, splitting can be performed in
software using the shift-subtraction algorithm discussed in section 11.4. It is instructive to develop
such a program for MiniMIPS because it helps to better understand the algorithm. In what follows,
the unsigned division is considered and the reader is left to develop the signed version.

Example 11.8: Shift-subtraction division of unsigned numbers

Use the subtract algorithm to define a MiniMIPS procedure that performs z/x unsigned splitting, with
z and x passing to it in registers $a2 - $sa3 (upper and lower halves of double-width z-integer) and $a
0, respectively. Results should be returned to $v0 (rest) and $v1 (quotient).

Solution: The following procedure, called shsdi for "shift-subtraction division" uses both instructions
and pseudo-instructions. The records Hi and Lo, which retain the upper and lower halves of the partial
remainder z, are represented by $v 0 and $v 1, respectively. The x-divisor is at $a 0 and the quotient
y is formed at $a 1. The (k - j)-th bit of y is formed in $t 1 and then a left shift $a 1 is added, effectively
inserting the next quotient bit as the LSB of $a 1. The $t 1 register is also used to isolate the MSB from
Lo, so that it can be run towards Hi during left shift. Register $t 0 retains the MSB of Hi during left shift,
so it can be used to choose the next quotient digit. Register $t 2 is used as a counter that is initialized
to 32 and reduced by 1 at each iteration until it reaches 0. The use of logging in this example is shown
in paragraph 11.15 for easy reference.

175
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

shsdi: move $v0, $a2 # initializes Hi to ($a 2)

move $v1, $a3 # initializes Lo a ($a 3)

addi $t2, $zero, 32 # initializes repeat counter to 32

dloop: slt $t0, $v0, $zero # MSB copy of Hi in $t 0

sll $v0, $v0, 1 # left shift of part Hi of z

slt $t1, $v1, $zero # MSB copy of Lo in $t 1

or $v0, $v0, $t1 # moves Lo MSB into Hi LSB

sll $v1, $v1, 1 # left shift of part Lo de z

sge $t1, $v0, $a0 # quotient digit is 1 if (Hi) ≥ x,

or $t1, $t1, $t0 # or if Hi MSB was 1 before shift

sll $a1, $a1, 1 # running and to make room for new digit

or $a1, $a1, $t1 # copy y[k-j] to $a 1 LSB

beq $t1, $zero, nosub # If y[k-j] = 0, do not subtract

subu $v0, $v0, $a0 # subtract x divider from part Hi of z

nosub: addi $t2, $t2, -1 # Reduce repeat counter by 1

bne $t2, $zero, dloop # if counter > 0, repeat cycle division

move $v1, $a1 # copy the quotient and in $v 1

jr $ra # Back to Call Program

176
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Note that in the subtraction of x from the part Hi of z, the most significant bit of the partial remainder
that is in $t 3 is ignored. This does not create a problem because subtraction is performed only if the
partial remainder z (which includes its hidden MSB) is not less than the x-splitter.

The division between powers of 2 can be done by shifting, which tends to be a much faster
operation than division in any computer. Just as one can avoid using a multiply statement or software
routine for certain small constant multipliers (last paragraph of section 11.3), one can divide numbers
by small constant divisors with the use of a few shift/add/subtract statements. The theory for deriving
the required steps is little more complicated than that used for multiplication and is not discussed in
this book. Most modern compilers are smart enough to recognize opportunities to avoid using a slow
split-type instruction. Optimizations that prevent splitting operations are even more important than
optimizations that eliminate multiplications. The latter happens because in modern microprocessors
the division is between four and ten times slower than multiplication [Etie02].

PROBLEMS
1) Multiplication algorithm

Draw point diagrams similar to Figure 11.1 for the following variations:

(a) Unsigned decimal integer multiplication of 8 × 4 digits. Tip: The number of points will change.

(b) Unsigned binary fractional multiplication of 8 × 4 bits. Tip: You will change the alignment of the
dots.

(c) Unsigned decimal fractional multiplication of 8 × 4 digits.

2) Unsigned multiplication

Multiply the following four-bit unsigned binary numbers. Present your work in the format of Figure
11.2a.

a) x = 1001 and y = 0101


b) x = 1101 and y = 1011
c) x = .1001 and y = .0101

3) Unsigned multiplication

Multiply the following four-bit unsigned decimal numbers. Submit your work in the format of Figure
11.2b.

a) x = 8765 and y = 4321


b) x = 1234 and y = 5678
c) x = .8765 and y = .4321

177
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4) Multiplication in complement to two

Represent the following binary numbers with signed magnitude in five-bit complement to 2 format,
and then multiply them. Present your work in the format of Figure 11.3.

a) x = +.1001 and y = +.0101


b) x = +.1001 and y = −.0101
c) x = −.1001 and y = +.0101
d) x = −.1001 and y = −.0101

5) Multiplication algorithm

a) Reperform the multiplication steps in Figure 11.2 and verify that if the cumulative partial product is
initialized to 1011 instead of 0000, then a multiply-add operation will be performed.

b) Show that, regardless of the initial value of z (0), the multiplication-add result with k-digit operands
is always representable in 2k digits.

6) Multiplier array

a) In the multiplier array in Figure 11.7, label the input lines with bit values corresponding to the
multiplication example in Figure 11.2a. Then determine in the diagram all intermediate and output
signal values and verify that the correct product is obtained.

b) Repeat part (a) for x = 1001 and y = 0101.

c) Repeat part (a) for x = 1101 and y = 1011.

d) Will the multiplier produce the correct product for fractional inputs (e.g., x = 0.1001 and y = 0.0101)?

e) Show how the multiply-add operation can be performed on the multiplier array.

f) Check your method from part e) using the specific input values in problem 11.5a.

7) Multiplication with left shifts

The shift-addition multiplication algorithm presented in section 11.1 and exemplified in Figure 11.2
corresponds to processing rows of the partial products of the bit matrix (Figure 11.1) from top to
bottom. An alternating multiplication algorithm goes in the opposite direction (bottom up) and
requires left shift of the partial product before each addition step.

a) Formulate this new shift-addition multiplication algorithm in the form of a recurrence


equation.
b) Apply your algorithm to the examples in Figure 11.2 and verify that it produces the correct
answers.
c) Compare the new left-shift-based algorithm to the original right-shift algorithm, in terms of
hardware implementation.

178
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

8) Multiplication in MiniMIPS

Can you accomplish the task performed in example 11.3 without using another record than $t3 and
without changing the operand registers?

9) Programmed multiplication

Section 11.2 stated that records retaining the multiplier y and the lower half of the cumulative partial
product z can be combined in Figure 11.4. Is a similar combination of records $a 1 and $v 1 (Figure
11.8) beneficial in Example 11.4? Justify your answer by presenting an improved procedure or by
demonstrating that the combination complicates the multiplication process.

10) Multiplication by constants

Only by using shift and add/subtract statements, design efficient procedures for multiplying each of
the following constants. Assume 32-bit unsigned operands and make sure that the intermediate
results do not exceed the range of 32-bit signed numbers. It does not modify that operand registers
and use only another register for partial results.

a) 13
b) 43
c) 63
d) 135

11) Multiplication by constants


a) Intel's new 64-bit architecture (IA-64) has a special shladd (left shift and addition) instruction
that can run one operand by one to four bits to the left before aggregating it with another
operand. The latter allows multiplication by, say, 5, to be performed with a shladd instruction.
What other constant multiplications can be performed using a single shladd statement

b) Repeat part (a) for shlsub (left shift and subtraction), if you assume that the first operand is
with shift.

c) What is the smallest positive constant multiplication for which at least three of the
instructions defined in parts (a) and (b) are required?

12) Division algorithm

Draw point diagrams similar to that in Figure 11.9 for the following variations:

a) 8/4-digit unsigned decimal integer division. Tip: The number of points will change.

b) Unsigned binary fractional division 8/4 bits. Tip: This will change the alignment of points.

179
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

c) 8/4-digit unsigned decimal fractional division.

13) Unsigned division

Perform z/x split for the following unsigned binary dividend/divisor pairs; Get the quotient y and the
remainder s. Present your work in the format of Figures 11.11 and 11.10.

a) z = 0101 and x = 1001


b) z = .0101 and x = .1001
c) z = 1001 0100 and x = 1101
d) z = .1001 0100 and x = .1101

14) Unsigned division

Perform z/x division for the following unsigned dividend/decimal divisor pairs; Get the quotient y and
the remainder s. Present your work in the format of Figures 11.11 and 11.10.

a) z = 5678 and x = 0103


b) z = = .5678 and x = .0103
c) z = 1234 5678 and x = 4321
d) z = .1234 5678 and x = .4321

15) Signed division

Perform z/x split for the following signed binary dividend/divisor pairs; Get the quotient y and the
remainder s. Present your work in the format of Figures 11.11 and 11.10.

a) z = +0101 and x = −1001


b) z = −.0101 and x = −.1001
c) z = +1001 0100 and x = −1101
d) z = −.1001 0100 and x = +.1101

16) Divider arrangement

a) In the divisor arrangement of heading 11.14, label the input lines with the bit values corresponding
to the division example in Figure 11.10a. Then determine in the diagram all intermediate and output
signal values and verify that the correct quotient and remainder are obtained.

b) Show that the OR gates on the left edge of the divider array in Figure 11.14 can be replaced by MS
cells, which leads to a more uniform structure.

c) Rotate the divider array 90 degrees counterclockwise, note that the structure is similar to the
multiplier array, and suggest how the two circuits can be combined to obtain a circuit that multiplies
and divides according to the state of the Mul'Div signal.

180
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

17) Convergence division analysis

At the end of section 11.5 a convergence scheme for division was presented based on iteratively
refining an approximation to 1/x until a good approximation u ≅ 1/x and then calculating q = z/x by
multiplying on z × u . Show that the refining method of the approximation u using the recurrence
u(i+1) = u(i) × (2 − u(i) × x) has convergence quadratic in the sense that, if u(i) = (1 + є)(1/x), then u(i+1) = (1
− є2)(1/x). Then start with approximation 0.75 for the reciprocal of 1.5, derive successive
approximations based on the barely given recurrence (use decimal arithmetic and a calculator), and
verify that, in fact, the error is reduced quadratically at each step.

18) Scheduled division

a) Describe how the unsigned scheduled split procedure in example 11.8 can be used to perform a
split where both operands are 32 bits wide.

b) Would any simplification result in the procedure from the knowledge that it will always be used
with a single-width (32-bit) z-dividend?

c) Modify the procedure in Example 11.8 to perform signed splitting.

19) Scheduled division

Section 11.5 stated that records retaining the quotient y and the lower half of the partial remainder z
can be combined in Figure 11.12. Is a similar combination of records $a1 and $v1 beneficial in example
11.8? Justify your response by presenting an improved procedure or demonstrating that the
combination complicates the splitting process.

20) Division between 255

If z/255 = y, then y = 256y - z. This observation leads to a procedure to divide a number z by 255, with
the use of a subtraction, to obtain each byte of the quotient. Since 256y ends in eight 0, the least
significant byte of y is obtained with a byte width subtraction, for which the loan is saved. The lowest
byte of and thus obtained is the second lowest byte in 256y, this leads to the determination of the
second lowest byte of and by another subtraction, where the saved loan is used as a loan entry. This
process continues until all bytes of y have been found. Write a MiniMIPS procedure to implement this
division by 255 algorithm without any multiplication or division instruction.

181
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

UNIT 4
INSTRUCTION EXECUTION STEPS
CHAPTER TOPICS

1 A small set of instructions

2 The instruction execution unit

3 A single-cycle data path

4 Branching and jumping

5 Deriving the control signals

6 Performance of the single-cycle design

Simpler digital computers execute instructions one after another, following a flow of control from one
completed instruction to the next in the sequence, unless explicitly directed to alter this flow (by a
branch or jump instruction) or to terminate instruction execution. Thus, such a computer can be seen
as if it were in a cycle, where each iteration leads to the conclusion of an instruction. Part of the
instruction execution process, which begins with fetching the instruction from memory at the specific
address on the program counter (PC), is determining where the next instruction is located, as well as
updating the program counter accordingly. This chapter examines the steps of executing instructions
on a simple computer. Later it will be seen that, in order to achieve greater performance, this approach
must be substantially modified.

1. A small set of instructions


The CPU and control unit data path designs presented in Chapters 13 through 16 are based on the
MiniMIPS instruction set architecture introduced in Part Two. In order to make the design problem
manageable and the diagrams and tables less confusing, the hardware embodiments are based on a
22-instruction version of MiniMIPS, which is called "MicroMIPS". The MicroMIPS instruction set is
identical to the one shown in Table 5.1 at the end of Chapter 5, with the only difference being the
inclusion of the jal and syscall statements in Table 6.2. The last pair of instructions turns MicroMIPS
into a full-fledged computer that can run simple, yet useful, programs. To make this chapter self-
contained and also easy to reference, Table 13.1 presents the complete MicroMIPS instruction set.
The handling of the other MiniMIPS instructions, given in Table 6.2 at the end of Chapter 6, forms the
themes of many end-of-chapter problems.

Also, for reference, Figure 13.1 reproduces a more compact version of Figure 5.4, showing the R,
I, and J instruction formats and their various fields. Remember that in arithmetic and logical
statements with two log source operands, rd specifies the target record. For ALU instructions with an
immediate operand or for the load word statement, the rd field becomes part of the immediate 16-
bit operand. In this case, rt designates the destination record. Note that, according to MicroMIPS there
is no shift instruction, the sh field is not used.

182
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

TABLE 13.1 MicroMIPS instruction set. *

Class Instruction Use Meaning op fn


Copy Load upper immediate lui rt, imm rt ← (imm, 0x0000) 15

Add add rd, rs, rt rd ← (rs) + (rt); with overflow 0 32


Arithmetic Subtract sub rd, rs, rt rd ← (rs) - (rt); with overflow 0 34
Set less than slt rd, rs, rt rd ← if (rs) < (rt) then 1 but 0 0 42
Set less than immediate slti rd, rs, imm rd ← If (rs) < imm then 1 but 0 10
Add immediate addi rt, rs, imm rt ← (rs) + imm; with overflow 8

Arithmetic AND and rd, rs, rt rd ← (rs) ∧ (rt) 0 36


OR or rd, rs, rt rd ← (rs) ∨ (rt) 0 37
XOR xor rd, rs, rt rd ← (rs) ⊕ (rt) 0 38
Logic NOR nor rd, rs, rt rd ← ((rs) ∨ (rt))’ 0 39
AND immediate andi rt, rs, imm rt ← (rs) ∧ imm 12
OR immediate ori rt, rs, imm rt ← (rs) ∨ imm 13
XOR immediate xori rt, rs, imm rt ← (rs) ⊕ imm 14

Access Of Load word lw rt, imm(rs) rt ← mem[(rs) + imm] 35


Memory Store word sw rt, imm(rs) mem[(rs) + imm] ← rt 43

Memory Jump JL go to L 2
access Jump and link jal L go to L; $31 ← (PC) + 4 3
Transfer of Jump register jr rs go to (rs) 0 8
control Branch less than 0 bltz rs, L if (rs) < 0 then go to: L 1
Branch equal beq rs, rt, L if (rs) = (rt) then go to: L 4
Branch not equal bne rs, rt, L if (rs) ≠ (rt) then go to: L 5
System call syscall See section 7.6 (Table 7.2) 0 12

*Note: Except for the jal and syscall instructions in the background, which make MicroMIPS a
complete computer, this table is the same as table 5.1.

The instructions in Table 13.1 can be divided into five categories with respect to the steps that are
necessary for their execution:

Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor)

Six ALU instructions format I (lui, addi, slti, andi, ori, xori)

Two memory access instructions format I (lw, sw)

Three format I conditional branch instructions (bltz, beq, bne)

Four unconditional jump instructions (j, jr, jal, syscall)

183
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The seven R-formatted ALU statements have the following common execution sequence:

1. Read the contents of the rs and rt source records, and advance them as inputs to the ALU.

2. Tell the ALU what kind of operation to perform.

3. Write the output of the ALU to the destination registry rd.

Five of the six format I ALU instructions require steps similar to the three mentioned, except that the
contents of rs and the immediate value in the instruction are advanced as inputs to the ALU and the
result is stored in rt (rather than rd). The only exception is lui, which only requires the immediate
operand, but, even in this case, reading the contents of rs will not cause harm, as the ALU can ignore
it. The above discussion covers 13 of the 21 MicroMIPS instructions.

The execution sequence for the two memory access instructions in format I is as follows:

1. Read the content of rs.

2. Add the read number of rs with the immediate value in the instruction to form a memory address.

3. Read from or write to memory at the specified addresses.

4. In the case of lw, place the word read from the memory in rt.

Note that the first two steps of this sequence are identical to those of the I-format ALU instructions,
such as the last step of lw, which involves writing a value to rt (only the data read from memory is
written, instead of the result calculated by the ALU).

The final instruction set in Table 13.1 deals with transferring conditional or unconditional control to a
different instruction than the next in sequence. Remember that the branch to a destination address
is specified by an offset relative to the value of the incremented program counter, or (PC) + 4.
Therefore, if the branch is intended to conditionally jump to the next statement, the offset value +1
appears in the immediate field of the branch statement. The latter happens because offset is specified
in terms of words relative to the memory address (PC) + 4. On the other hand, for backward branching,
to the previous instruction, the offset value provided in the immediate field of the instruction will be
-2, resulting in the branch to the white direction (PC) + 4 − 2 × 4 = (PC) − 4.

184
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

For two of the three branch instructions (beq, bne), the contents of rs and rt are compared to
determine whether the branch condition is satisfied. In the event that the condition holds, the
immediate field is added to (PC) + 4 and the result is written back to PC; otherwise, (PC) + 4 is written
back to PC. The remaining branch statement, bltz, is similar, except that the branching decision is
based on the sign bit of the contents of rs rather than on the comparison of the contents of two
registers. For all four jump statements, PC is unconditionally modified to allow the next instruction to
be read from the jump target address. The destination address of the hop comes from the instruction
itself (j, jal), is read from the rs register (jr), or is a known constant associated with the location of an
operating system routine (syscall). Note that although syscall is indeed a jump statement, it has an R
format.

2. The instruction execution unit


MicroMIPS statements can be executed on a hardware unit such as the one whose structure and
components are shown in Figure 13.2. Starting at the far left, the contents of the program counter
(PC) are provided to the instruction cache and an instruction word is read from the specified location.
Chapter 18 will discuss cache memories. For now, view instruction and data caches as extremely fast
small SRAM memory units that can have a uniform rate of action with the speed of the other
components in Figure 13.2. As shown in Figure 2.10, an SRAM memory unit has data input ports and
addresses and a data output port. The data entry port of the instruction cache is not used within the
data path; therefore, it is not shown in Figure 13.2.

When an instruction is read from the instruction cache, its fields are separated and each is dispatched
to the appropriate place. For example, the op and fn fields go to the control unit, while rs, rt, and rd
are sent to the log file. The top entry of the ALU always comes from the rs register, while its bottom
entry can be the rt content or the immediate field of the instruction. For many instructions, the output
of the ALU is stored in a register; In such cases, the data cache is bypassed. For the case of the lw and

185
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

sw instructions, the data cache is accessed, and the rt content written to it is for sw and its output is
sent to the log file for lw.

The log file in Figure 13.2 is similar in design to Figure 2.9b, with h = 5 and k = 32. Figure 2.9a shows
the details of the log file deployment. In a clock cycle, the contents of any two (rs and rt) of the 32
registers can be read via the read ports, while at the same time a third register, not necessarily distinct
from rs or rt, is modified via the write port. The flip-flops that constitute the registers are flank-
activated, so reading from and writing to the same register in a single clock cycle is not a problem
(Figure 2.3). Remember that the $0 record represents a special record that always contains 0 and
cannot be modified.

The ALU used for the MicroMIPS implementation has the same design as shown in Figure 10.19, except
that the corridor and its associated logic, as well as the zero-sensing block (32-input NOR circuit) are
not required. The exit of the corridor that goes to the final mux in Figure 10.19 can be replaced by the
upper half of y (the lower ALU input), filled with 16 zeros on the right, to allow the implementation of
the lui (load upper immediate) instruction. Treating lui as an ALU statement leads to simplification in
the execution unit because a separate path is not needed to write the immediate value extended to
the right in the log file. Since the lui effect can be viewed as logical left shift of the immediate value at
16 bit positions, it is not inappropriate to use the ALU shift option for this instruction.

To conclude the preliminary discussion of the instruction execution unit, the data flow
associated with the branch and jump statements must be specified. For the beq and bne statements,
the contents of rs and rt are compared to determine whether the branch condition is satisfied. This
comparison is made within the Next addr box in Figure 13.2. In the case of bltz, the branch decision is
based on the sign bit of the contents of rs and not on the comparison of the contents of two registers.
Again, the latter is done inside the Next addr box, which is also responsible for choosing the
destination direction of the jump under the guidance of the control unit. Remember that the
destination address of the hop comes from the instruction itself (j, jal), and is read from the register
(jr) or is a known constant associated with the location of an operating system routine (syscall).

Note that some details are missing from Figure 13.2. For example, it does not show
how to notify the log file in which log, if any, will be written (rd or rt). These details will be provided in
section 13.3.

3. A single-cycle data path


Now begins the process of refining the abstract execution unit in Figure 13.2 until it becomes a
concrete logic circuit capable of executing the 22 MicroMIPS instructions. This section ignores the Next
addr and "Control" blocks in Figure 13.2 and focuses on the middle part composed of the program
counter, instruction cache, log file, ALU, and data cache. This part is known as a data path. The
following direction logic will be covered in section 13.4 and the control circuit in section 13.5

Figure 13.3 shows some details of the data path that is lost in the abstract version of
Figure 13.2. Everything that follows in this section conforms to the part of the data path in Figure 13.3,
whereby the instruction execution steps proceed from left to right. The four main blocks that appear
in this data path have already been described. The function of the three multiplexers used in the log
file input, in the lower input of the ALU, as well as in the ALU outputs and in the data cache will now
be explained. Understanding the importance of these multiplexers is a key to understanding in
addition to the data path in Figure 13.3 any data path in general.

186
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The multiplexer at the log file entry allows rt, rd, or $31 to be used as the index of the target log to
which a result will be written. A pair of RegDst logic signals, provided by the control unit, direct the
selection. RegDst is set to 00 to select rt, 01 for rd, and 10 for $31; This last option is required to run
JAL. Of course, not every statement writes a value to a record. Writing to a register requires the control
unit to postulate the RegWrite control signal; otherwise, regardless of the status of RegDst, nothing is
written to the log file. Note that the registers, rs and rt, are read for each instruction, even though
they may not be required in all cases. Therefore, no read control signal is displayed for the log file. In
this context, the instruction cache does not receive any control signal, because an instruction is read
from each cycle (that is, the clock signal serves as the read control for the instruction cache).

The multiplexer at the input below the ALU allows the control unit to choose the rt
content or the 32-bit extended version of the 16-bit immediate operand as the second ALU input (the
first input, or top input, always comes from rs). This is controlled by postulation or non-postulation of
the ALUSrc control signal. In the event that this signal is not postulated (it has a value of 0), the rt
content will be used as the input lower than the ALU; Otherwise, the immediate operand is used,
extended in sign to 32 bits. The sign extension of the immediate operand is done by the circular block
labeled "SE" in Figure 13.3. (See issue 13.17.)

Finally, the far-right multiplexer in Figure 13.3 allows the word provided by the data cache, the output
of the ALU, or the incremented PC value to be sent to the log file for writing (the latter option is needed
for jal). The choice is made by a pair of control signals, RegInSrc, which are set to 00 to choose the
data cache output, 01 for the ALU output, and 10 for the incremented PC value from the next-address
block.

The data path in Figure 13.3 is capable of executing one instruction per clock cycle; Hence the name
"Single Cycle Data Path". With each tick of the clock, a new address is loaded into the program counter,
causing a new instruction to appear in the instruction cache output after a short access delay. The
contents of the various fields in the instruction are sent to the relevant blocks, including the control
unit, which decides (based on the op and fn fields) which operation to perform for each block.

187
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

As the data from rs and rt, or rs and imm extended in sign, pass through the ALU, the operation
specified by the ALUFunc control signals is performed and the result appears in the ALU output. In
case the shift related control signals (Const'Var, the Shift function) in Figure 10.19 that are not needed
for MicroMIPS are ignored, the ALUFunc control signal beam contains five bits: one bit for add'Sub
control, two bits for controlling the logic unit (Logic function), and two bits for controlling the far-right
multiplexer in Figure 10.19 (Function class). Figure 13.3 shows the ALU output signal indicating
addition or subtraction overflow, although it is not used in this chapter.

For arithmetic and logical instructions, the ALU result must be stored in the target register and is
therefore brought forward to the log file using the feedback path near the bottom of Figure 13.3. For
memory access instructions, the ALU output represents a data address to write to the data cache
(postulated DataWrite) or read from it (postulated DataRead). In the latter case, the data cache
output, which appears after short latency, is sent using the lower feedback path to the log file for
writing. Finally, when the executed instruction is jal, the value of the incremented program counter,
(PC) + 4, is stored in the $31 register.

4. Branching and jumping


This section is dedicated to the design of the next-address block shown in the upper left of Figure 13.3.

The next address to load into the program counter is derived in one of five ways, depending on the
instruction to be executed and the contents of the registers on which the branch condition is based.
Because instructions are words stored in memory locations with addresses that are multiples of 4, the
two LSBs of the program counter are always set to 0. Therefore, in the discussion below, (PC)31:2 refers
to the top 30 bits of the program counter, which is the part of the PC that is modified by next-address
logic. With this convention, adding 4 to the contents of the program counter is done by calculating
(PC)31:2 + 1. In addition, the immediate value, which by definition must be multiplied by 4 before being
added to the incremented value of the program counter, is added to this top unchanged. The five
options for (PC)31:2 contents are:

188
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The first two options are combined with the use of an adder with its bottom input attached to (PC)31:2,
its upper input connected to imm (extended in sign to 30 bits) or 0, depending on whether or not a
branch will be performed, as well as its permanently postulated carry input signal (Figure 13.4).
Therefore, this address adder calculates (PC)31:2 + 1 or (PC)31:2 + 1 + imm, which are written to PC unless
a jump instruction is executed. Note that when the instruction is not a branch, the output of this adder
is (PC)31:2 + 1. Therefore, this output, with two 0s attached to its right, can be used as the incremented
value of the program counter to be stored at $31, as part of the execution of the jal instruction.

With reference in Figure 13.4, the branch condition checker checks the branch predicate, which is (rs)
= (rt), (rs) ≠ (rt), or (rs) < 0, and postulates its BrTrue output if one of these conditions is satisfied and
the corresponding branch statement is executed. The latest information is provided by the BrType
signal pair of the control unit. Finally, the pair of PCSrc signals, also provided by the control unit, direct
the multiplexer on the far left of Figure 13.4 to send one of its four inputs to be written to the top 30
bits of the PC. These signals are set to 00 most of the time (for all instructions other than the four
jumps); They are set at 01 for both J and Jal, 10 for JR and 11 for Syscall.

5. Bypass of control signals


Proper execution of MicroMIPS instructions requires the control circuit to assign appropriate values
to the control signals shown in Figure 13.3 and in the part in Figure 13.4. The value for each signal is a
function of the instruction to be executed; therefore, it is uniquely determined by its OP and FN fields.
Table 13.2 contains a list of all control signals and their definitions. In the far left column of Table 13.2,
the block in Figure 13.3 to which the signals relate is named. Column 2 lists the signals as shown in
Figure 13.3 and, if applicable, names the signal components. For example, ALUFunc has three
components (Figure 10.19): Add'Sub, LogicFn, and FnClass. The last two are two-bit signals, with their
indexed bits 1 (MSB) and 0 (LSB). The other columns in Table 13.2 specify the meaning associated with
each control or establishment signal value. For example, the next to the last line in the table assigns
different bit patterns or codes to the three different types of branch (beq, bne, bltz) and to all other
cases where no branch will occur.

189
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

TABLE 13.3 Control signal establishments for the single-cycle MicroMIPS instruction execution unit.*

*Note: Blank entries are of the "no matter" type.

Based on the definitions in Table 13.2 and the understanding of what needs to be done to execute
each instruction, Table 13.3 is constructed, in which the values of the 17 control signals for each of
the 22 MicroMIPS instructions are specified.

Table 13.2 defines each of the 17 control signals as a logical function of 12 input bits (op and fn). It is
easy to derive logical expressions for these signals in terms of the 12 input bits op5, op4..., op0, fn5,
fn4, ..., fn0. The drawback of such ad hoc approaches is that, if it is later decided to modify the
machine's instruction set or add new instructions to it, the entire design must be modified.

In this context, a two-step approach to the synthesis of control circuits is often


preferred. In the first, the instruction set is decoded and a different logical signal is postulated for each
instruction. Figure 13.5 shows a decoder for the MicroMIPS instruction set. The six-bit op field is
provided to a decoder from 6 to 64 that postulates one of its outputs depending on the op value. For
example, output 1 of the op decoder corresponds to the bltz statement; therefore, it is given the
symbolic name "bltzInst". The latter makes it easy to remember when each sign will be postulated.
Output 0 of the op decoder goes to a second decoder. Again, the outputs of this fn decoder are labeled
with the names that correspond to the instructions they represent.

190
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 13.5 Instruction decoder for MicroMIPS built with two decoders from 6 to 64.

Now, each of the 17 required control signals can be formed as the logical OR of a subset of the decoder
outputs in Figure 13.5. As a consequence of the fact that some of the signals in Table 13.3 have many
entries 1 in their truth tables, which in any case requires multilevel OR operation, it is convenient to
define many auxiliary signals that are then used in the formation of the main control signals. Define
three auxiliary signals as follows:

The derivation of the logical expressions for the remaining control signals of Figures 13.3 and 13.4 is
left as an exercise.

191
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6. Performance of the Single-cycle design


The single-cycle MicroMIPS implementation discussed in the previous sections can also be referred to
as a single-state or stateless implementation, in which the control circuit is merely combinational and
requires not remembering anything from one clock cycle to the next. All the information that is needed
for proper operation in the next clock cycle is carried in the program counter. Since a new instruction
is executed in each clock cycle, you have CPI = 1. All instructions share fetch and registry access steps.
Instruction decoding overlaps completely with access to the register, so it does not involve additional
latency.

The clock cycle is determined by the longest execution time for an instruction, which in turn depends
on the signal propagation latency through the data path in Figure 13.3. With the simple instruction
set, the lw statement, which requires an ALU operation, data cache access, and register write, is likely
to be the slowest of the instructions. No other instruction needs all three steps. Assume the following
worst-case latencies for blocks in the data path, the sum of which produces the execution latency for
the lw statement:

This corresponds to a clock speed of 125 MHz and performance of 125 MIPS. If the cycle time could
be made variable to adjust to the real time needed for each instruction, you would get a slightly better
average instruction execution time. Figure 13.6 shows the critical signal propagation paths for the
various instruction classes in MicroMIPS. Here it will be assumed that all jump statements represent
the fastest variety that does not need access to the register (j or syscall). This leads to a best-case
estimate for average instruction execution time, but the true average won't be much different,
according to the small percentage of jump instructions. The following typical instruction mix can derive
an average instruction execution time.

192
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 13.6 Deploying the MicroMIPS data path (by passing register write as a separate block) to
allow better visualization of critical path latencies for various instruction classes.

Thus, implementing a simple cycle with a fixed cycle time is slower than the ideal deployment with
variable cycle time that could achieve an average throughput of 157 MIPS. However, the latter is
impractical and is only discussed here to show that you cannot get a higher performance MicroMIPS
hardware implementation based on the simple cycle control philosophy. The disadvantages of single-
cycle design are discussed below.

As already noted, the clock cycle time of a machine with single-cycle control is determined by the most
complex or slowest instruction. Therefore, such an implementation would increase the delays of
simple operations to accommodate complex operations in a clock cycle. In the case of MicroMIPS, it
was necessary to set the clock period to 8 ns, which corresponds to the execution time of lw the
slowest MicroMIPS instruction. The latter caused the fastest instructions, which could theoretically be
executed in 3, 5, 6 or 7 ns, to also take 8 ns.

In fact, it is rather fortunate that, except for jump instructions that represent a tiny fraction of those
found, the execution times for MicroMIPS instructions do not deviate significantly from the average,
this makes simple cycle control with fixed cycle duration quite competitive with an ideal variable cycle
time implementation. If the instruction mix offered in this section is typical of applications to run on
the machine, more than one-third of single-cycle MicroMIPS instructions run close to their minimum
latencies when using a clock cycle time of 8 ns, while 80% suffer from no more than 33% braking as a
result of single-cycle implementation.

193
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

If MicroMIPS had included more complex instructions, such as multiplication, division, or floating-
point arithmetic, the disadvantage of single-cycle control would have been more pronounced. For
example, if the latency of a more complex instruction such as division were four times that of the sum,
the implementation of simple cycle control would imply that the same common addition operation
must be slowed down by a factor of at least 4, which affects performance. As is known, from the
discussion in section 8.5, this is in direct conflict with the multiplication and division instructions to a
unit other than the main ALU (Figure 5.1). The multiply/divide unit can perform these more complex
operations, while the ALU continues to read and execute simpler instructions. As long as adequate
time is allowed before copying the multiplication or division results of the Hi and Lo registers into the
general log file, no problem arises between these two independent arithmetic units.

The following analogy helps to follow the direction towards the implementation of
multicycle control in Chapter 14. Consider a dentist's schedule. Instead of ordering all appointments
on time, because some patients require a full hour of work, these can be assigned in standard 15-
minute increments. A patient arriving for a routine checkup could use only 15 minutes of the dentist's
time. In this sense, people who require more attention or complex procedures may receive multiple
time increases. This way, the dentist can see more patients and have less wasted time between
patients.

Problems
1) Data path details

a) Label each line in Figure 13.2 with the number of binary signals it represents.

b) Repeat part (a) for Figure 13.3.

2) Data path selection options

Let M1, M2 and M3, from left to right, be the three multiplexers in Figure 13.3.

a) M1 has three arrangements, while M2 has two; therefore, there are six combinations when the M1
and M2 arrays are taken together. For each of these combinations, indicate if it is ever used and, if so,
to execute which MicroMIPS instruction(s).

b) Repeat part (a) for M2 and M3 (six combinations).

c) Repeat part (a) for M1 and M3 (nine combinations).

3) Branch condition checker

Based on the description in section 13.4 and the signal encodings defined in Table 13.2, present a
complete logical design for the condition checker branch in Figure 13.4.

194
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4) Next address logic

Consider the following alternate design for the following MicroMIPS address logic. The array of 30
AND gates shown near the top in Figure 13.4 is removed and the immediate value, extended in sign
to 30 bits, is connected directly to the top input of the adder. A separate incrementer is introduced to
calculate (PC)31:2 + 1. The BrTrue signal is then used to control a multiplexer that allows the IncrPC
output to be taken from the newly introduced adder or augmenter. Compare this alternate design
with the original design in Figure 13.4 in terms of potential advantages and disadvantages.

5) MicroMIPS Instruction Format

Assume that you are provided with a blank whiteboard to redesign the instruction format for
MicroMIPS. The instruction set is defined in Table 13.1. The only requirement is that each instruction
contains the appropriate number of five-bit register specification fields and a 16-bit immediate/offset
field where needed. In particular, the width of the jta field may change. What is the minimum number
of bits required to encode all MicroMIPS instructions in a fixed-width format? Tip: Because there are
seven different statements that need two record fields and one immediate field, you can set a lower
dimension to the width.

6) Control signal values

a) On examination of Table 13.3, it is observed that for any pair of rows, there is at least one column
in which the control signal establishments are different (one is fixed to 0, the other to 1). Why doesn't
the latter come as a surprise?

b) Identify three pairs of rows, where each pair differs exactly by a control bit value. Explain how this
difference in control bit values causes different execution behaviors.

7) Derivation of control signals

The logical expressions for five of the 17 control signals cited in Tables 13.2 and 13.3 are provided at
the end of section 13.5. Provide logical expressions that define the other 12 control signals.

8) Single-cycle design performance

Discuss the effects of the following changes on performance results obtained in section 13.6:

a) By reducing the access time to the registry from 1 ns to 0.5 ns.

b) With the improvement of ALU latency from 2 ns to 1.5 ns.

c) With the use of caches with access time of 3 ns instead of 2 ns.

9) Exception handling

195
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

a) Assume that the ALU overflow will cause a special operating system routine to be invoked at the
ExceptHandler memory location. Discuss changes to the MicroMIPS instruction execution unit to
accommodate this change.

b) Discuss how a similar provision can be made for overflow in address calculation (note that addresses
are unsigned numbers).

10) Instruction decoding

Assume that the instruction decoder of the single-cycle MicroMIPS implementation was designed
directly from Table 13.3, rather than based on the two-stage process of full decoding followed by the
OR operation (Figure 13.5). Type the simplest possible logical expression for each of the 17 control
signals mentioned in Table 13.3 (parts a–q of the problem, in order from left to right). Blank and lost
entries in the table can be considered as "it doesn't matter".

11) Instruction decoding

a) Show how the operation decoder on the left side of Figure 13.5 can be replaced using only an
additional three-input AND gate with a much smaller 4 to 16 decoder. Suggestion: Subtract 32 from
35 and 43.

b) Show how the function decoder on the right side of Figure 13.5 can be replaced by a 4 through 16
decoders.

12) Control signals for shift instructions

Extend Table 13.3 with lines corresponding to the following new statements in Table 6.2 that can be
added to the single-cycle MicroMIPS deployment. Justify your answers.

a) Logical left shift (sll).


b) Logical right shift (srl).
c) Arithmetic right shift (ms).
d) Variable logical left shift (sllv).
e) Variable logical right shift (srlv).
f) Variable arithmetic right shift (srav).

13) Handling other instructions

Explain the necessary changes to the single-cycle data path and associated control signal
establishments to add the following new instructions from Table 6.2 to the MicroMIPS
implementation. Justify your answers.

a) Load byte (lb).


b) Load unsigned byte (lbu).
c) Arithmetic right shift (sra).

196
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

14) Handling multiplication/divide instructions

Suppose you want to augment MicroMIPS with a multiply/divide unit to allow you to execute the
multiplication and division instructions in to table 6.2 along with the associated mfhi and mflo
statements. In addition to Hi and Lo, the multiplier/divider unit has an Md register, which retains the
multiplicand/divisor during the execution of the instruction. The multiplier/dividend is stored, and the
quotient develops, in Lo. The unit multiply/divide must be provided with its operands and some
control bits indicating which operation is performed. The drive then operates independently from the
main data path for several clock cycles, eventually getting the required results. Don't worry about the
operation of the drive beyond the initial setup. Propose changes to the single-cycle execution unit to
accommodate those changes.

15) Adding other instructions

Other instructions can be added to the MicroMIPS instruction set. Consider the pseudo-instructions
in Table 7.1 and assume that you want to include them in MicroMIPS as regular instructions. In each
case, choose an appropriate coding for the instruction and specify all required modifications to the
single-cycle data path and associated control circuits. Ensure that the encodings chosen do not conflict
with the other MicroMIPS statements mentioned in Tables 6.2 and 12.1.

a) Mover (move).
b) Immediate charge (li).
c) Absolute value (abs).
d) Negate (neg).
e) Not (not).
f) Branch less than (blt).

16) URISC

Consider the URISC processor described in section 8.6 [Mava88]. How many clock cycles do you need
for URISC to execute an instruction, if you assume that memory can be accessed in one clock cycle?

17) ALU para MicroMIPS

The description of the MicroMIPS ALU at the end of section 13.2 omits one detail: that the logical unit
must experience the sign extension effect on an immediate operand.

a) Present the logic unit design within the ALU with this provision.

b) Speculate about possible reasons for this design choice (i.e. sign extension in all cases, followed by
possible undo).

197
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

CONTROL UNIT SYNTHESIS

CHAPTER TOPICS

1 A Multi-Cycle Implementation

2 Clock Cycle and Control Signals

3 The Control State Machine

4 Performance of the Multicycle Design

5 Microprogramming

6 Dealing with Exceptions

The memoryless control circuit for the single-cycle implementation of Chapter 13 forms all control
signals as functions of certain bits within the instruction. The above is good for a limited set of
instructions, most of which are executed in about the same time. When statements are more varied
in complexity, or when some resources must be used more than once during the same instruction, a
multi-cycle deployment is requested. The control circuit of such a multicycle implementation is a state
machine, with a number of states for normal execution and additional states for exception handling.
This chapter derives a multi-cycle control implementation for MicroMIPS and shows that the
execution of each instruction now becomes a "hardware program" (or microprogram) that, like a
normal program, has sequential execution, branching, and perhaps even cycles and procedure calls.

1. A Multi-Cycle Implementation
As you learned in Chapter 13, the execution of each MicroMIPS statement encompasses a set of
actions, such as memory access, log read, and ALU operation. According to the assumptions in section
13.5, each of these actions takes 1-2 ns to complete. The single-cycle operation requires that the sum
of these latencies be taken as the clock period. With multicycle design, a clock cycle performs a subset
of actions required by an instruction. Consequently, the clock cycle can be made much shorter, with
many cycles needed by executing a single instruction. The latter is analogous to a dental office that
allocates time to patients in multiples of 15 minutes, depending on the amount of work that is
anticipated. To allow multi-cycle operation, intermediate values for a cycle must be kept in records so
that they are available for examination in any subsequent clock cycle where needed.

A multi-cycle deployment can be used for reasons of greater speed or economy. Faster
operation results from taking a shorter clock period and using a variable number of clock cycles per
instruction; Therefore, each instruction as long as it needs for the various execution steps instead of
a certain time as the slowest instruction (Figure 14.1). The shorter implementation cost results from
being able to use some resources more than once in the course of executing the instruction; For
example, the same adder that is used to execute the ADD statement can be used to calculate the
branch target address or to increment the program counter.

198
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 14.2 shows an abstract view of a multicycle data path. Many features of this route are
remarkable. First, the two memory blocks (instruction cache and data cache) in Figure 13.2 have been
merged into a single cache block. When a word is read from the cache, it must be kept in a log for use
in subsequent cycles. The reason for having two registers (instruction and data) between the cache
and the log file is that, when the instruction is read, it must be maintained for all remaining cycles in
its execution in order to properly generate the control signals. Thus, a second register is required for
reading data associated with lw. Three other registers (x, y and z) also serve the purpose of retaining
information between cycles. Note that except for PC and the instruction register, all registers are
loaded in each clock cycle; hence the absence of explicit control for the loading of these records. As a
consequence of the fact that the contents of each of these registers are needed only in the
immediately following clock cycle, redundant loads on these registers do not harm.

199
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The data path in Figure 14.2 is capable of executing an instruction every 3-5 clock cycles; hence the
name "multi-cycle data path". The execution of all instructions begins in the same way in the first
cycle: the PC content is used to access the cache and the retrieved word is placed in the instruction
register. This is called an instruction fetch cycle. The second cycle is dedicated to decoding the
instruction and also to accessing the rs and rt registers. Note that not every instruction requires two
register operands and at this point it is not yet known which instruction has been read. However,
reading rs and rt does not harm, even if it is evident that the contents of one or both registers are not
needed. If the handheld instruction represents one of the four jump instructions (j, jr, jal, syscall), its
execution ends in the third cycle when you type the appropriate address on PC. For the case of a
branch instruction (beq, bne, bltz), the branch condition is verified and the appropriate value will be
written to PC in the third cycle. All other instructions proceed to the fourth cycle, where they are
completed. There is only one exception: lw requires a fifth cycle to write the data retrieved from the
cache to a log.

Figure 14.3 shows some of the data path details that are lost from the abstract version of Figure 14.2.
Multiplexers serve the same functions as those used in Figure 13.2. The three-entry multiplexer at the
log file input allows rt, rd, or $31 to be used as the index of the target log to which a result will be
written. The two-entry multiplexer at the bottom of the log file allows data read from the cache or
output from the ALU to be written to the selected log. This serves the same function as the three-
entry mux on the right edge of Figure 13.3. The reason for having one less entry here is that the
augmented PC is now formed by the ALU rather than by a separate unit. The two- and four-input
multiplexers near the right edge of Figure 14.3 correspond to the mux in the next-address block of
Figure 13.3 (Figure 13.4). The address of the next instruction to be written on PC can come from five
possible sources. The first two are the appropriately modified jta and SysCallAddr, corresponding to
the j and syscall statements. One of these two values is selected and sent to the top input of the four-
input "source PC" multiplexer. The other three inputs of this multiplexer are (rs) of the x register, the
ALU output in the preceding cycle of the z register, and the ALU output in the current cycle.

200
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Compared to Figure 13.3, a mux was added for the upper input to the ALU (mux x) and the ALU lower
input multiplexer (mux y) was expanded from two to four inputs. The latter happens because the ALU
must now also calculate (PC) + 4 and (PC) + 4 + imm. To calculate (PC) + 4, the mux x is provided with
the control signal 0 and the mux y with 00. This is done in the first cycle of execution instruction for
each instruction. Then, the increased PC value is added in a subsequent cycle at 4 × imm with the use
of control establishments 0 and 11 for mux x and y, respectively. Note that two versions of the
immediate extended value in sign can be used as the input lower than the ALU: the regular version,
which is needed for instructions such as addi, and the left shift version (multiplied by 4), which is
needed to deal with offset in branch instructions.

2. Clock Cycle and Control Signals


In multi-cycle control, the actions required to execute each instruction are divided, and a subset of
these actions are assigned to each clock cycle. The clock period should be chosen to balance the work
done in each cycle, with the goal of minimizing the amount of downtime; Excessive downtime leads
to loss of performance. As in section 13.5, the following latencies are assumed for the basic steps in
instruction execution:

Memory access (read or write) 2 ns

Log access (read or write) 1 ns

Operation ALU 2 ns

TABLE 14.1 Control signals for multi-cycle MicroMIPS implementation.

201
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Therefore, a clock cycle of 2 ns would perform each of the basic steps of an instruction in a clock cycle.
The latter leads to a clock frequency of 500 MHz. If the numbers do not include a margin of safety,
then a slightly longer clock period (i.e. 2.5 ns for a clock frequency of 400 MHz) might be needed to
accommodate the aggregate header of values stored in registers in each cycle. Here we proceed with
the assumption that a clock period of 2 ns is sufficient.

Table 14.1 contains a list and definitions of all the control signals shown in Figure 14.3, grouped by the
block in the diagram affected by the signal. The first three entries in Table 14.1 relate to the program
counter. PCSrc selects from four alternate values to load on the PC, and PCWrite specifies when to
modify the PC. The four sources for the new PC content are:

00 Direct address (PC31-28|jta|

00 or SysCallAddr), selected by JumpAddr

01 Contents of record x, which retains the read value of rs

10 Content of register z, which retains the ALU output of the preceding cycle

11 ALU output in the current cycle

Note that in Figure 14.3, the right extension of PC31-28|jta with 00 is shown as multiplication by 4.

Four binary signals are associated with the cache. The MemRead and MemWrite signals are self-
explanatory. The Inst'Data signal indicates whether the memory is accessed to read (fetch) an
instruction (address coming from the PC) or to read/write data (address calculated by the ALU in the
preceding cycle). The IRWrite signal, which is postulated in the instruction fetch cycle, indicates that
the memory output will be written to the instruction register. Note that the data log has no write
control signal. This happens because its write control input is tied to the clock, causing it to load the
memory output every cycle, even in the fetch cycle of the instruction. As noted earlier, redundant
loads do not cause problems, as long as the data loaded into this log is used in the next cycle before
being overwritten by something else.

The control signals affecting the log file are identical to those in the single-cycle design in Figure 14.3,
except that RegInSrc is bi rather than trivalued because the incremented PC content now emerges
from the ALU.

Finally, the control signals associated with the ALU specify the sources of its two operands and the
function to be performed. The high ALU operand can come from PC or the x register, under the control
of the ALUSrcX signal. The lower ALU operand has four possible sources, namely:

00 The constant 4 to increase the program counter

01 Content of the record and retention (rt), which was read in the previous cycle

10 The immediate field of the instruction, extended sign to 32 bits

11 The offset field of the instruction, extended in sign and shift to the left by two bits

202
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Again, attaching the extended offset in sign with 00 to the right is shown as multiplication by 4. The
signals comprising ALUFunc have the same meanings as those of the single-cycle design (Table 13.2).

Table 14.2 shows the control signal establishments in


each clock cycle during the execution of an instruction. The first two cycles are common to all
instructions. In cycle 1, the instruction is fetched from the cache and placed in the instruction register.
Because the ALU is free in this cycle, it is used to increase the CP. In this way, the incremented PC
value will also be available to add to the forked offset. In cycle 2, the rs and rt registers are read, and
their contents are written to the x and y registers, respectively. In addition, the instruction is decoded
and the branch destination address is formed in the z register. Although in most cases the instruction
will prove to be something other than branch, using the idle ALU cycle to pre-calculate the branch
direction does not harm. Note that here it is assumed that the 2 ns clock cycle provides sufficient time
for log reading and latency sum through the ALU.

When starting with cycle 3, the instruction is known by hand because decoding is completed in cycle
2. The control signal values in cycle 3 depend on the instruction category: type ALU, load/store, branch
and jump. The last two categories of instructions are completed in cycle 3, while others are moved to
cycle 4, where ALU instructions end and different actions are prescribed for load and store. Finally, lw
is the only instruction that cycle 5 needs for its conclusion. During this cycle, the contents of the data
log are written to the rt log.

203
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

3. The Control State Machine


The control unit must distinguish between the five cycles of the multicycle design and be able to
perform different operations depending on the instruction. Note that the setting of any control signal
is uniquely determined by knowing the type of instruction being executed and which of its cycles is in
progress. The state monitoring machine carries the required information from state to state, where
each state is associated with a particular set of values for the control signals.

Figure 14.4 shows the control states and state transitions. The status control machine is set to state 0
when program execution begins. It then moves from state to state until one instruction completes, at
which time it returns to state 0 to begin execution of another instruction. This cycle through the states
in Figure 14.4 continues until a syscall statement with the number 10 is executed in register $v 0 (see
Table 7.2). This instruction terminates the execution of the program.

204
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The state control sequences for various classes of MicroMIPS instruction are as follows:

Type ALU 0, 1, 7, 8

Upload word 0, 1, 2, 3, 4

Store word 0, 1, 2, 6

Jump/branch 0, 1, 5

In each state, except for states 5 and 7, control signal establishments are uniquely determined.
Information about the current control state and the instruction being executed is provided in the
decoders shown in Figure 14.5. Note that the instruction decoder portion of Figure 14.5 (composed
of op and fn decoders is identical to that in Figure 13.5), and the outputs of this decoder are used to
determine the control signals of states 5 and 7, as well as to control state transitions according to
Figure 14.4. What was added is a state decoder that postulates the ControlSti signal whenever the
status control machine is in state i.

The logical expressions for the control signals of the multicycle MicroMIPS implementation can be
easily derived based on Figures 14.4 and 14.5. Examples of control signals that are uniquely
determined by control status information include:

The establishments of ALUFunc signals depend not only on the control state, but also on the specific
instruction being executed. Define a pair of auxiliary control signals:

205
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Consequently, ALU control signals can be set as follows:

State control 5 is similar to state 7, in that the control signal establishments for it depend on the
instruction being executed and some other conditions. For example:

The control circuits thus derived would be slightly simpler if states 5 and 7 of the state control machine
were expanded into multiple states corresponding to identical or similar signal establishments. For
example, state 5 could be decomposed into states 5b (for branches) and 5j (for hops), or into multiple
states, one for each different instruction. However, this would complicate the control state machine
and associated state decoder, so it could be cost-inefficient.

4. Performance of the Multicycle Design


The multi-cycle MicroMIPS implementation discussed in the preceding sections may also be referred
to as a multistate implementation. The latter is in contrast to the implementation of simple or
memoryless state control in Chapter 13. Simple cycle control is memoryless in the sense that the
execution of each cycle begins again, and the signals and events during a cycle are based only on the
current instruction being executed; They are not affected by what transpires in previous cycles.

Section 13.6 showed that the single-cycle MicroMIPS has a CPI of 1, whereby a new
instruction is executed in each clock cycle. To evaluate the performance of the multicycle MicroMIPS,
the average CPI is calculated using the same instruction mix as used in section 13.6. The contribution
of each instruction class to the average CPI is obtained by multiplying its frequency by the number of
cycles required to execute instructions in that class:

Contribution to CPI

206
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Note that the number of cycles for each instruction class is determined by the steps required for its
execution (Table 14.2).

With the clock rate of 500 MHz, derived at the beginning of section 14.2, the CPI of 4.04 corresponds
to a performance of 500/4.04 ≅ 123.8 MIPS. This is the same as the 125 MIPS throughput of the single-
cycle implementation derived from section 13.6. Therefore, the two implementations of MicroMIPS
in Chapters 13 and 14 have comparable performance, in part because the instruction latencies are not
very different from each other; The slowest instruction has a latency that is 8/5 = 1.6 times that of the
fastest. If instruction latencies had been more varied, the multi-cycle design would have led to a
performance gain over the single-cycle implementation.

Example 14.1: Increased variability in instruction execution times

Consider a multi-cycle implementation of MicroMIPS++, a machine is similar to MicroMIPS, except


that its R-type instructions fall into three categories.

a) Type Ra instructions, which constitute half of all Type R instructions executed, take four cycles.

b) Type Rb instructions, which constitute a quarter of all R-type instructions executed, take six cycles.

c) Type Rc instructions, which constitute a quarter of all Type R instructions executed, take ten cycles.

With the instruction mix given at the beginning of section 14.4, and assuming that the slowest R-type
statements will take 16 ns to execute in a single-cycle deployment, the performance advantage of
multi-cycle deployment over single-cycle deployment derives.

Solution: The assumed worst-case execution time of 16 ns leads to a clock rate of 62.5 MHz and
throughput of 62.5 MIPS for the single-cycle design. For the multicycle design, the only change in the
average CPI calculation is that the contribution of R-type instructions to the average CPI increases
from 1.76 to 0.22 × 4 + 0.11 × 6 + 0.11 × 10 = 2.64. The latter raises the average CPI from 4.04 to 4.92.
Therefore, the performance of multicycle MicroMIPS++ becomes 500/4.92 ≅ 101.6 MIPS. The
performance improvement factor of multi-cycle design over single-cycle implementation is therefore
101.6/62.5 = 1.63. Note that including more complex R-type instructions that take six and ten cycles
to execute has a relatively small effect on multi-cycle design performance (from 123.8 MIPS to 101.6
MIPS, for a reduction of approximately 18%), while cutting the performance of the single-cycle design
in half.

5. Microprogramming
The condition control machine in Figure 14.4 looks like a program that has instructions (the states),
branching, and cycles. Such a program in hardware is called a microprogram and its basic steps,
microinstruction. Within each of these, different actions are prescribed, such as postulation of the
MemRead control signal or the establishment of ALUFunc to "+". Each such action represents a micro-
order. Instead of implementing the control state machine on custom hardware, microinstructions can
be stored in locations of a control ROM, and read (fetch), as well as execute a sequence of
microinstructions for each instruction in machine language. Thus, in the same way that a program or
procedure is broken down into machine instructions, a machine instruction is in turn broken down

207
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

into a sequence of microinstructions. Therefore, each microinstruction defines a step in the execution
of a machine language instruction.

ROM-based control implementation has many advantages. It makes hardware simpler, more regular
and less dependent on the details of the instruction set architecture, so that the same hardware can
be used for different purposes only by modifying the ROM contents. In addition, as hardware design
progresses from planning to implementation, errors and omissions can be corrected when changing
the microprogram, as opposed to costly redesign and remanufacturing of integrated circuits. Finally,
instruction sets can be tuned and new instructions added, by changing and expanding the
microprogram. A machine with this type of control is microprogrammed. Designing an appropriate
sequence of microinstructions to perform a particular instruction set architecture represents
microprogramming. If the microprogram can be easily modified, even by the user, the machine is
micro-programmable.

Are there drawbacks to microprogramming? In other words, does the flexibility offered by
implementing microprogrammed control come at a cost? The main drawback is less speed than is
achievable with the implementation of hardwired control. With microprogrammed implementation,
executing a MicroMIPS instruction requires 3-5 ROM accesses to fetch the microinstructions
corresponding to the states in Figure 14.4. After each microinstruction is read and placed in a
microinstruction register, sufficient time must be allocated for all signals to stabilize and actions (such
as reading or writing from memory) to happen. Therefore, all latencies associated with register read,
memory access, and ALU operation are still in effect; On top of these, some gate delays needed for
the generation of control signal are replaced by a ROM access delay that is much longer.

The design of a microprogrammed control unit begins the design of a suitable


microinstruction format. For MicroMIPS, you can use the 22-bit microinstruction format shown in
Figure 14.6. Except for the two bits on the far right called Sequence control, the bits in the
microinstruction correspond one by one with the control signals in the multi-cycle data path in Figure
14.3. Consequently, each microinstruction explicitly defines the setting for each of the control signals.

The two-bit control sequence field allows control of microinstruction sequencing


in the same way that PC control affects machine language instruction sequencing. Figure 14.7 shows
that there are four options for choosing the following microinstruction. Option 0 is to advance to the
next microinstruction in sequence by incrementing the microprogram counter. Options 1 and 2 allow
branching to occur depending on the opcode field in the machine instruction being executed. Option
3 is to go to microinstruction 0 corresponding to state 0 in Figure 14.4. The latter initiates the fetch
phase for the next machine instruction. Each of the two dispatch tables translates the opcode into a
microinstruction address. Transfer table 1 corresponds to the multi-way bifurcation from cycle 2 to
cycle 3 in Figure 14.4. Transfer table 2 implements the branch between cycles 3 and 4. Collectively,
the two transfer tables allow for two instruction-dependent multi-way forks in the course of
instruction execution; Therefore, you enable the sharing of microinstructions within classes of
instructions that follow more or less similar steps in their executions.

208
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

You can now proceed to write the microprogram for multi-cycle MicroMIPS implementation. In order
to make the microprogram more readable, the symbolic names shown in Table 14.3 are used to
designate combinations of bit values in the various fields of the microinstruction.

209
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

As an example, microinstruction

x111 0101 0000 000 0xx10 00

It is written in a much more readable symbolic form as:

PCnext, CacheFetch, PC + 4

Note that two of the fields (register control and sequence control), which have the default
establishments all 0, do not appear in this symbolic representation. These default establishments only
specify that no record write occurs and that the next microinstruction is executed next. In addition,
the establishments of the two fields ALU inputs and ALU function are combined into an expression
that identifies the inputs applied to the ALU and the operation it performs much more clearly.

Based on the notation in Table 14.3, Figure 14.8 shows the complete microprogram for
multicycle MicroMIPS implementation. Note that each line represents a microinstruction and that
microinstructions whose label ends in 1 (2) are arrived at from transfer table 1 (2) in Figure 14.7.

The microprogram in Figure 14.8, consisting of 37 microinstructions, fully defines the MicroMIPS
hardware operation for instruction execution. If the top microinstruction with the fetch tag is stored
in ROM address 0, then booting the machine with the μ PC cleaned to 0 will cause program execution
to start at the instruction specified on PC. Therefore, the boot part of the process for MicroMIPS is to
clear μ PC to 0 and set PC to the address of the system routine that initializes the machine.

210
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 14.8 Complete microprogram for MicroMIPS. The comments on the right show that each
microinstruction corresponds to a state or substate on the control state machine in Figure 14.4.

Figure 14.9 Alternating MicroMIPS microinstructions representing states 7 and 8 on the control state
machine in Figure 14.4.

Note that there is a large amount of repetition in the microprogram in Figure 14.8. The latter
corresponds to the substates of states 7 and 8 of the control machine of this in Figure 14.4. If the five-
bit function code, which is now part of the microinstruction, is provided directly to the ALU by a
separate decoder, only two microinstructions will be needed for each of states 7 and 8 of the state
control machine. The changes are shown in Figure 14.9, where it is suggested that the complete
microprogram now consists of 15 microinstructions, and each microinstruction is 17 bits wide (the
five-bit ALU function field is removed from the 22-bit microinstruction in Figure 14.6).

211
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Further reduction in the number of microinstructions is still possible (see the problems at the end of
this chapter for some examples).

It is also possible to further reduce the width of the microinstructions, by more efficiently coding the
control signal values. Note that the microinstruction format in Figure 14.6 retains one bit for each of
the 20 control signals in the data path in Figure 14.3, plus two bits to control microinstruction
sequencing. Such an approach leads to horizontal microinstructions. Referring to Table 14.3, it is noted
that the four-bit control cache field can retain only one of four possible bit patterns: the default bit
pattern 0000 and the three patterns mentioned in Table 14.3. These four possibilities can be efficiently
coded into two bits; therefore reduces the microinstruction width by two bits. All other fields, except
for sequence control, can be compacted in a similar way. Such compact encodings require the use of
decoders to derive the actual control signal values from their microinstructions-encoded forms.
Microinstructions in which signal value combinations are compactly coded are known as vertical
microinstructions. For the extreme case of vertical encoding, each microinstruction specifies a single
microoperation using a format that is very similar to that of a machine language instruction. In this
context, the designer has a number of options, from the merely horizontal to the extreme vertical
format. Microinstruction formats close to merely horizontal ones are faster (because they don't need
much decoding) and allow concurrent operations between data path components.

6. Exceptions
The control state diagram in Figure 14.4 and its associated wiring or microprogrammed control
implementation specifies the behavior of the MicroMIPS hardware in the normal course of instruction
execution. As long as nothing unusual occurs, the instructions are executed one after the other and
the intended effect of the program is observed in the updated register and in the memory contents.
However, things can and do go wrong inside the machine. The following are some of the issues, or
exceptions, that need to be dealt with:

• The ALU operation leads to overflow (an incorrect result is obtained).


• The opcode field retains a pattern that does not represent a legal operation.
• The cache error detection code checker estimates the input of an invalid word.
• The sensor or monitor flags a hazardous condition (e.g., overheating).

A usual way to deal with such exceptions is to force the immediate transfer of control to an operating
system routine known as an exception handler. This routine initiates repairing action to correct or
overcome the problem or, in the end, terminates the execution of the program to prevent further
data corruption or damage to system components.

In addition to exceptions, caused by events internal to the machine's CPU, the control unit must
struggle with interruptions, which are defined as external events that must be fixed soon. While
exceptions are often undesirable events, interruptions can be due to early completion of tasks by
input/output devices, notification of sensor data availability, message arrival on a network, and so on.
However, a processor's reaction to an outage is quite similar to its exception handling. Accordingly, in
what follows, the focus will be on the exceptions of overflow and illegal operation to demonstrate the
procedures and mechanisms required. Interruptions will be discussed more deeply in Chapter 24.
Figure 14.10 shows how overflow and illegal operation exceptions can be incorporated into the control
state machine for multi-cycle MicroMIPS deployment.

212
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 14.10 Exception states 9 and 10 are added to the control state machine.

In state 8 an arithmetic overflow is observed, followed by the ALU operation in state 7. Postulating
the overflow output signal from the ALU forces the state machine to special state 9. In this sense the
cause of the exception can be determined later by the exception handling routine, the Cause register
is set to 1 (a code for overflow), the current value of the program counter, minus 4 (to nullify its
advance to the next instruction), is saved in the register of the exception program counter (EPC), and
the control is transferred to the SysCallAddr address, The entry point of an operating system routine.
For the placement of the Cause and EPC registers on the machine hardware, refer to Figure 5.1.

In state 1, an illegal operation is detected in the opcode field, where the


instruction is decoded. A corresponding signal is postulated by the sensing circuit, which then forces
the control state machine into special state 10. Again, a code is saved for the exception cause, and the
address of the current instruction leading to it, before the control is transferred to the exception
handler.

The two examples discussed show the general procedure for dealing with an exception or interruption.
In the Cause registry a code is saved that shows the cause of the exception or interruption, for the
benefit of the operating system. The address of the current instruction that is executed is saved in the
EPC registry so that program execution can resume from that point once the problem is resolved. An
mfc0 statement (similar to mfc1 in Table 12.1) allows you to examine the contents of registers in
Coprocessor 0. Returning from the exception manipulation routine to the interrupted program is
similar to returning from a procedure, as discussed in section 6.1.

213
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

PROBLEMS
1) Multi-cycle data path details

a) Label each line in Figure 14.2 with the number of binary signals it represents.

b) Repeat part (a) for Figure 14.3.

2) Data path selection options

There are three pairs of multiplexers listed in Figure 14.3. In left-to-right order, the first pair feeds the
log file, the second supplies operands to the ALU, and the third, located near the right edge of the
diagram, selects the PC entry.

a) The multiplexers that feed the log file have three and two establishments, respectively; Therefore,
there are six combinations when establishments are taken together. For each of these combinations,
indicate if it is ever used and, if so, to execute which MicroMIPS instruction(s).

b) Repeat part a) for the pair of multiplexers provided by the ALU operands (eight combinations).

c) Repeat part a) for the pair of multiplexers that select the PC input (eight combinations).

3) Multi-cycle data path extension

Suggest some simple changes (the simpler the better) to the multicycle data path in Figure 14.3, so
that the following instructions can be included in the machine's instruction set:

a) Load byte (lb).


b) Load unsigned byte (lbu).
c) Store byte (sb).

4) Adding other instructions

If desired, certain instructions can be added to the MicroMIPS instruction set. Consider the following
pseudo-instructions in Table 7.1 and assume that you want to include them in MicroMIPS as regular
instructions. For each case, choose an appropriate coding for the instruction and specify all required
modifications to the multi-cycle data path and associated control circuits. Ensure that the encodings
chosen are not in conflict with other MicroMIPS instructions mentioned in Tables 6.2 and 12.1.

a) Mover (move).
b) Immediate charge (li).
c) Absolute value (abs).
d) Negate (neg).
e) No (not).
f) Branch less than (blt).

214
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5) Control State Machine

In the control state machine in Figure 14.4, state 5 does not fully specify the values to be used by the
control signals; Instead, assume the use of external circuitry to derive the control signal values in
accordance with the notes in the upper left corner of Figure 14.4. Divide state 5 into a minimum
number of substates, called 5a, 5b, etc., so that within each substate all control signal values are
determined (as is the case for all other states except states 7 and 8).

6) Condition Check Machine

In the condition control machine in Figure 14.4, states 7 and 8 do not fully specify the values to be
used by all control signals. The incomplete specifications, to be solved by external control circuits,
consist of the values assigned to the ALUSrcY and ALUFunc control signals in state 7 and RegDst in
state 8. Show that, if ALUFunc is ignored, it still needs to be determined externally, the ALUSrcY and
RegDst specifications can be created entirely with the use of two state pairs: 7a/8a and 7b/8b.

7) Instruction Decoding in MicroMIPS

The use of two decoders 6 to 64 in Figure 14.5 seems like a waste, as only a small subset of the 64
outputs in each decoder is useful. Demonstrate how each of the decoders 6 through 64 in Figure 14.5
can be replaced by a decoder 4 through 64 and a small number of additional logic circuits (the smaller
the better). Tip: In the op decoder in Figure 14.5, most of the first 16 outputs are useful, but only two
of the remaining 48 are used.

8) Control signals for multicycle MicroMIPS

The logical expressions for a number of the 20 control signals, shown in Figure 14.3, were derived in
section 14.3 based on the outputs of various decoder circuits. Derive logical expressions for all
remaining signals, with the goal of simplifying and sharing circuit components, as much as possible.

9) Multi-cycle control performance

In example 14.1, let fa, fb, and fc be the relative frequencies of the Ra, Rb, and Rc type instructions,
respectively, with h fa + fb + fc = 1.

a) Find a relationship between the three relative frequencies if the multicycle design will be 1.8 times
faster than the single-cycle design.

b) Calculate the actual frequencies of the result of part a) assuming that fb = fc.

c) What is the maximum acceleration factor of the multicycle design relative to the single-cycle design
over the possible values for the three relative frequencies?

10) Multi-cycle control performance

215
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

An instruction set is composed of h different classes of instructions, where the execution time of class
i instructions is 3 + i ns, 1 ≤ i ≤ h. The simple cycle control clearly implies a clock cycle of 3 + h ns.
Consider a multicycle control implementation with a clock cycle of q ns and assume that class i
instructions can then be executed in 3+i clock cycles; that is, ignore any headers associated with
multicycle control.

a) Derive the performance advantage (acceleration factor) of multicycle control relative to single-cycle
control, if you assume that the various instruction classes are used with the same frequency.

b) Show that the derived performance benefit in Part (a) represents an increasing function of h;
Therefore, prove that any acceleration factor is possible for a suitably large H.

c) Repeat part (a) for when the relative frequency of class i instructions is proportional to 1/i.

d) How does the performance benefit of part c) vary with h, and does the acceleration factor grow
indefinitely with increasing h, as was the case in part b)?

11) Multi-cycle control performance

Repeat issue 14.10, but this time assume a clock cycle of 2 ns for the multicycle design, which means
that class i instructions require (3 + i)/2 clock cycles.

12) Single vs. multicycle control

a) Discuss the conditions under which a multi-cycle control implementation would be inferior to the
single-cycle implementation.

b) How does variation in memory access latency affect the performance of the single-cycle versus
multi-cycle control implementation, assuming that the number of cycles and the actions performed
within each cycle do not change?

c) Discuss the implications of your responses to parts a) and b) in the specific case of MicroMIPS.

13) Microinstructions

a) Construct a table with six columns and 37 rows containing the binary field contents for the
microprogram shown in Figure 14.8.

b) Repeat part a), but assume that the microprogram in Figure 14.8 was modified according to Figure
14.9 and the associated discussion.

14) Microprogrammed control

a) How would the microprogram in Figure 14.8 change if the controller hardware in Figure 14.7
contained only one transfer table instead of two?

b) What changes are needed in Figure 14.7, if a third transfer table were included?

c) Argue that the change in part (b) offers no advantage over the microprogram in Figure 14.8.

216
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

d) Under what circumstances would the change in part (b) be disadvantageous?

15) Microprogramming

a) In connection with Figure 14.8, it was observed that a simple modification allows two substates for
each of states 7 and 8, namely 7regreg, 7regimm, 8regreg, 8regimm. Is it possible to delete these
substates together and have only one microinstruction for each of states 7 and 8?

b) What allows the merging of all substates of state 5, corresponding to the last five microinstructions
in the microprogram in Figure 14.8, into a single microinstruction?

16) Microinstruction formats

The discussion at the end of section 14.5 indicates that the width of a microinstruction is reduced if a
more compact encoding is used for valid combinations of signal values in each field.

a) If you assume that the same six fields in Figure 14.6 are retained in the microinstruction, and that
the most compact coding is used for each field, determine the width of the microinstruction.

b) Demonstrate that it is possible to further reduce the width of microinstruction by combining two
or more of its fields into a single field. Argue that the savings achieved in microinstruction width do
not justify the complexity of the added decoding and the associated delay.

17) Exception Handling

Until an exception is found, the current contents of the PC could be saved, instead of (PC) - 4, in the
exception program counter, which causes the operating system to assume the direction of the
offensive instruction. Would this modification simplify the multi-cycle MicroMIPS data path or its
associated control state machine? Fully justify your answer.

18) Exception Handling

Assume that instructions and data are cached in the multicycle MicroMIPS with the use of an error
detector code. Each time the cache is accessed, a special error detection circuit checks the validity of
the encoded word and postulates the CacheError signal if it is an invalid word code. Add exception
states to Figure 14.10 to deal with an error detected in:

a) Word of instruction.
b) Memory data word.

217
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

19) URISC

Consider the URISC processor described in section 8.6 [Mava88].

a) Design a wired control unit for URISC.

b) Describe a microprogrammed implementation of URISC and provide the complete microprogram


using appropriate notation.

218
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

UNIT 5
MAIN MEMORY CONCEPTS
CHAPTER TOPICS

1 Memory structure and SRAM

2 DRAM and Refresh Cycles

3 Hitting the memory wall

4 Channeled/Pipelined and Interpolated/Interleaved memories

5 Non-volatile memories

6 Need for a Memory Hierarchy

The main memory technology now in use is very different from the magnetic drums and central
memories of early digital computers. Today's semiconductor memories are at the same time faster,
denser and cheaper than their predecessors. This chapter is devoted to reviewing memory
organization, including SRAM and DRAM technologies typically used for fast caches and slower main
memory, respectively. It shows that memory has already become a very limiting factor in computer
performance and discusses how certain organizational techniques, such as interleaving and pipelining,
can mitigate some of the problems. It concludes with the justification of the need for a memory
hierarchy, in preparation for discussions about cache and virtual memories.

1. Memory structure and SRAM


Static Random Access Memory (SRAM) is a large array of storage cells that are accessed as registers.
An SRAM memory cell usually requires between four and six transistors per bit and retains stored data
as long as it is turned on. This situation is in contrast to dynamic random access memory (DRAM),
discussed in section 17.2, which uses only one transistor per bit and must be periodically regenerated
to avoid loss of stored data. Both SRAM and DRAM would lose data if disconnected from power; hence
its designation of volatile memory. As shown in Figure 2.10, a 2h × g SRAM chip has an h-bit input that
carries the address. This address is decoded and used to select one of 2h locations, indexed from 0 to
2h - 1, each of which is g bits wide. During a write operation, g bits of data are provided to the chip
when copied to the addressed location. For a read operation, g bits of data are read from the
addressed cell. The Chip select (CS) signal must be postulated for read and write operations, while
Write enable (WE) is postulated for write and Output enable for reading (Figure 17.1). An unselected
chip performs no operation at all, regardless of the values of its other input signals.

For ease of understanding, the storage cells in Figure 17.1 are sketched as flank-activated flip-flops D.
In practice, D latches are used because using flip-flops would lead to additional complexity in the cells
and therefore fewer cells on a chip. The use of latches does not cause major problems during reading
operations; However, for write operations, more stringent timing requirements must be met to ensure
that data is written properly to the desired locality and only to that location. Detailed discussion of
such timing considerations is not covered in this book [Wake01].

219
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

A synchronous SRAM (SSRAM) provides a cleaner interface to the designer while still using D latches
internally. Synchronous behavior is achieved by having internal records retain addresses, data, and
control inputs (and perhaps data output). Because input signals are provided in one clock cycle, while
memory is accessed in the next cycle, there is no danger of inputs to the memory array changing at
inopportune times.

If the size of a 2h × g SRAM chip is inadequate for storage needs, multiple chips can be used. k/g chips
are used in parallel to obtain k-bit data words. 2m/2h = 2m−h rows of chips are used to obtain a capacity
of 2m words.

Example 17.1: Multi-chip SRAM

Show how 128K × 8 SRAM chips can be used to build a 256K × 32 memory unit.

Solution: Because the desired word width is four times that of the SRAM chip, four chips must be used
in parallel. It takes two rows of chips to double the number of words from 128K to 256K. The required
structure is shown in Figure 17.2. The most significant bit of the 18-bit address is used to choose row
0 or row 1 of SRAM chips. All other chip control signals are externally linked to the memory unit's
common input control signals.

Frequently, the data input and output pins of an SRAM chip are shared, the non-shared input
and output pins are connected to the same bidirectional data bus. The latter makes sense,
because unlike a log file that is read and written during the same clock cycle, a memory unit
performs a read or write operation, but not both at the same time. In order to ensure that the
SRAM output data does not interfere with the input data provided during a write operation,
the control circuit shown at the top of Figure 17.1 should be slightly modified. This
modification, which involves disabling data output when Write enable is postulated, is shown
in Figure 17.3.

220
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Early memory chips were organized as 2h × 1 memories. The latter happens because the
capabilities of the chips were very limited and any increase in memory required many RAM
chips. Having one bit of every word of data on a chip offers the advantage of error isolation;
a failing chip will affect no more than one bit in each word; Therefore, it will allow codes to
detect and correct errors to effectively point out or mask the error. Nowadays it is not
uncommon for all memory to need an application to fit into one or a handful of chips.
Therefore, chips with byte-sized words have become very common. However, beyond eight-
bit data input and output, the pin limitation makes it unattractive to further widen memory
words. Note that, as shown in Figure 2.10, a random access memory array is constructed with
very wide words that are read in an internal buffer, where the portion corresponding to an
external memory word is selected for output. So, apart from the pin limitation, there is no real
impediment to having wider memory words.

221
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

2. DRAM and Refresh Cycles

It is impossible to build a bistable element with a transistor alone. To enable single transistor
memory cells, which lead to the highest possible storage density on a chip, and at very low cost per
bit, dynamic random access memory (DRAM) stores data as electrical charge in small capacitors,
accessed via a MOS pass-through transistor. Figure 17.4 shows such a cell in schematic form. When
the word line is postulated, the low (high) voltage on the bit line causes the capacitor to discharge
(charge) and thus stores 0 (1). To read a DRAM cell, first preload the half-voltage bit line. Then this
voltage is pulled slightly down or up until the postulation of the word line, depending on whether cell
0 or 1. This voltage change is detected by an amplifier, which retrieves a 0 or 1 accordingly. Since the
act of reading destroys the contents of the cell, such destructive reading of data must be followed
immediately with a write operation to restore the original values.

The charge leakage of the small capacitor shown in Figure 17.4 causes the data to be erased after a
fraction of a second. Consequently, DRAMs must be equipped with special circuits that periodically
regenerate memory contents. Regeneration happens for all memory rows by reading each row and
rewriting it to restore the load to the original value. Figure 17.5 shows how leakage leads to load decay
in a DRAM cell that stores 1, and how periodic regeneration restores charge just before the voltage
across the capacitor drops below a certain threshold. DRAMs are much cheaper, but also slower, than
SRAMs. In addition, some potential memory bandwidth in DRAM is lost in rewrite operations to
restore data destroyed during read and for regeneration cycles.

Example 17.2: Bandwidth loss for regeneration cycles

A 256 Mb DRAM chip is organized externally as a 32M × 8 memory and internally as a square array of
16K × 16K. Each row must be regenerated at least once every 50 ms in order to prevent data loss;
Regenerating a row takes 100 ns. What fraction of the total memory bandwidth is lost in regeneration
cycles?

Solution: Regenerating the 16K rows takes 16 × 1024 × 100 ns = 1.64 ms. Therefore, from each 50 ms
period, 1.64 ms are lost in regeneration cycles. The latter represents 1.64/50 = 3.3% of total
bandwidth.

Partly because of the large capacity of DRAM chips, which would require a large number of address
pins, and partly because, inside the chip, the process of reading any shape occurs in two steps (row

222
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

access, column selection), DRAM usually has half as many address pins as those dictated by its
speaking ability. For example, a 218 × 4 DRAM chip may have only nine address pins. To enter a word,
its row address is first provided within the memory array, along with a row address (RAS) sign
indicating the availability of the row address to the memory unit, which copies it to an internal register.
But then the column address is provided using the same address lines, concurrently with the signal
enable (strobe) column address strobe (CAS). Figure 17.6 shows a typical 24-pin DRAM packet with 11
address pins and four data pins.

The following describes the timing of events on a DRAM chip during the three regenerate, read, and
write operations. Regeneration can be performed by providing a row address to the chip. In this RAS-
only regeneration mode, the RAS rise flank causes the corresponding row to be read from an internal
row buffer, and the downstream flank causes the contents of the row to be rewritten. Operations are
asynchronous and no clock is involved. For this reason, the timing of the RAS signal is quite crucial. A
read cycle begins as a regeneration cycle, but the CAS signal is used to enable the chip's output bits,
which are selected from the contents of the row buffer using the column address provided
concurrently with CAS. A write cycle also begins as a regeneration cycle. However, before the RAS
upside signals a rewrite operation, the data in the row buffer is modified by postulating the Write
enable and CAS signals with the column address application. This causes the data in the appropriate
part of the row buffer to be modified before the entire buffer is rewritten.

Typical DRAMs have other modes of operation besides RAS-only regeneration, read,
and write. One of the most useful of such variations is known as page mode. In this row, when a row
is selected and read from the internal row buffer, subsequent access to words in the same row does
not require the full memory cycle. Each subsequent access to a word in the same row is achieved by
providing the corresponding column address, thus being significantly faster.

It is evident from what has been seen so far, that signal timing is quite crucial for DRAM
operation. This not only makes the design process difficult, it also leads to lower performance due to
the need to offer adequate safety margins. For this reason, synchronous DRAM (SDRAM) variations
have become very popular. As was the case for SSRAM, the internal operation of the SDRAM remains
asynchronous. However, the external interface is synchronous. Usually, the row direction is provided
in one clock cycle and the column direction in the next. Also in the first cycle a command word is
provided that specifies the operation to be performed. Actual access to memory contents takes
several clock cycles. However, memory is organized internally across multiple banks, so when access
to one bank is initiated, the timeout can be used to process more input commands that may involve

223
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

access to other banks. In this way, a much higher total performance can be supported. This is
particularly true because modern DRAMs are accessed in bursts to read instruction streams or
multiword blocks of cache. A burst length is often allowed as part of the input command to the DRAM
controller, which then causes the specified number of words, one per clock cycle, to be transferred
when the row is read into the row buffer.

Two improved implementations of SDRAM are currently widely used. The first is known as dual-data-
rate DRAM (DDR-DRAM), which doubles the transfer rate of DRAM by using both sides of the clock
signal to signal actions. Another, Rambus DRAM (RDRAM), combines dual-flank operation with a very
fast, but relatively narrow, channel between the DRAM controller and the memory array. The
narrowness of the channel is intended to facilitate the production of a high-quality, bias-free link that
can be led to very high clock rates. RDRAMs are expected to adhere to the Rambus channel
specifications for timing and pinout [Shri98], [Cupp01].

Figure 17.7 presents an interesting view of the progress of DRAM densities and chip capacities since
1980. It shows, for example, that a 512 MB main memory, which would have required hundreds of
DRAM chips in 1990, could be built with a handful of chips in 2000 and now requires only one DRAM
chip. Beyond this, the same memory unit can be integrated into a VLSI chip with a CPU and other
devices required to form a powerful single-chip computer.

3. Impact the memory wall


While both SRAM and DRAM densities and chip capacities have increased since the invention of
integrated circuits, the improvements in speed have not been as impressive. Thus, faster and faster
processors have led to a widening of the gap between CPU performance and memory bandwidth. In
other words, processor performance and density, as well as memory density, increase exponentially
as predicted by Moore's law. The latter is represented by the upper slope in the semilogarithmic graph
in Figure 17.8. However, memory performance has improved with a much more modest slope. This
CPU-memory performance gap is particularly problematic for DRAM. The speed of SRAMs has shown
greater improvement, not only for memory chips but also because caches, which are a primary
application of SRAMs, are now routinely integrated with the processor on the same chip, reducing
signal propagation times and other access headers involved when an off-chip drive must be accessed.

For these reasons, memory latency has become a serious problem in modern
computers. It is argued that, even now, greater improvements in processor speed and clock rate do
not mean noticeable gain in performance at the application level because the performance perceived
by the user is completely limited to memory access time. This condition is called "impacting the
memory wall," which points to the latter as a barrier to further progress [Wulf95]. For example, notice
that if the trends shown in Figure 17.8 continue, by 2010 each memory access will have an average
latency equivalent to hundreds of processor cycles. Therefore, even if only 1% of the instructions
require access to DRAM main memory, it is the DRAM latency, not the speed of the processor, that
indicates the execution time of the program.

224
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

For these reasons, memory latency has become a serious problem in modern computers. It is argued
that, even now, greater improvements in processor speed and clock rate do not mean noticeable gain
in performance at the application level because the performance perceived by the user is completely
limited to memory access time. This condition is called "impacting the memory wall," which points to
the latter as a barrier to further progress [Wulf95]. For example, notice that if the trends shown in
Figure 17.8 continue, by 2010 each memory access will have an average latency equivalent to
hundreds of processor cycles. Therefore, even if only 1% of the instructions require access to DRAM
main memory, it is the DRAM latency, not the speed of the processor, that indicates the execution
time of the program.

These arguments are very discouraging, but not all hope is lost. Even if a big discovery didn't solve the
performance gap problem, many of the existing and proposed methods could be used to move the
memory wall.

One way to bridge the gap between processor and memory speed is to use very wide words so that
each slow memory access recovers a large amount of data; A mux is then used to select the
appropriate word to overtake the processor (Figure 17.9). In this context, the remaining words that
are read (fetch) can be useful for subsequent instructions. Later you will see that the data is
transferred between main memory and cache in multiple words known as cache line. Therefore, such
wide-word charms are ideal for modern caching systems. Two other approaches, namely interleaved
and pipelined, have special significance and require more detailed descriptions; Accordingly, they are
discussed separately in Section 17.4.

Even if it is assumed that the growing performance gap cannot be overcome through the architectural
methods already mentioned, the only conclusion that can be drawn is that the performance of a single
processor will be limited by memory latency. In theory, the power of tens, perhaps even hundreds, of
processors can be applied to the solution of the problem. Although each of these processors may be
limited by the memory issue, collectively they can provide a level of performance that can be seen as
passing right through the memory wall. Parallel processing methods will be studied in part seven.

225
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Pipelined and Interleaved Memories


Simple memory units allow sequential access to their locations (one at a time). Although log files are
provided with multiple ports to allow for many simultaneous accesses, this approach to increasing
bandwidth would be very costly for memory. So how can you increase total memory performance so
that more data can be accessed per unit of time? The two main schemes for this purpose are
channeled and parallel data access.

Memory pipelining allows memory access latency to be dispersed over many pipeline stages;
therefore, it increases the total memory performance. Without pipelining on instruction and data
accesses on a processor's data path, performance will be limited by pipeline stages containing memory
references. Memory pipelining is possible because access to a memory bank consists of many events
or sequential steps: row direction decoding, row reading from the memory array, writing to a word or
forward it to output (based on column direction), and rewriting the modified row buffer to the
memory array, if necessary. By separating some of these steps and visualizing them as different stages
of a pipeline, a new memory access can be initiated when the previous one is moved to the second
pipeline stage. As always, pipelining can increase latency a bit, but it improves overall performance by
allowing multiple memory accesses to "fly" through the memory pipeline stages.

In addition to physical memory access, as already discussed, memory access involves other
support operations that form additional stages in the memory pipeline. These include possible address
changes before entering memory (cache or main) and tag comparison after entering a cache to ensure
that the data word read from the cache is the requested one (at this point it is not clear why the two
may be different). Chapter 18 will discuss the concepts related to cache access, while Chapter 20 will
discuss address change for virtual memory. In addition to physical memory access, as already
discussed, memory access involves other support operations that form additional stages in the
memory pipeline. These include possible address changes before entering memory (cache or main)
and tag comparison after entering a cache to ensure that the data word read from the cache is the
requested one (at this point it is not clear why the two may be different). Chapter 18 will discuss the
concepts related to cache access, while Chapter 20 will discuss address change for virtual memory.

226
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 17.10 shows a cache channeled based on the notions discussed above. There are four stages in
the pipeline: a change of direction stage, two stages for access to real memory, and a tag comparison
stage to ensure the validity of the data read (fetch). The change of direction stage is not always
necessary. For one thing, caches don't always require it; on the other hand, it can sometimes overlap
with access to physical memory (these points are discussed in Chapter 18). For the case of an
instruction cache, the tag comparison stage can overlap with instruction decoding and register access,
so that its latency becomes transparent, while for data cache, which is usually followed by register
rewriting, overlap is not possible. Therefore, if channeled memory were used in the data path in Figure
15.11, an additional pipeline stage would have to be included between data cache access and log
write.

Interpolated memory allows multiple accesses to occur in parallel, but must be at addresses located
in different memory banks. Figure 17.11 shows the block diagram for a four-way interpolated memory
unit. Each of the four memory banks retains addresses that have the same residue when divided by 4;
Therefore, bank 0 has addresses that are of the form 4j, bank 1 has addresses 4j + 1, and so on. As
each memory access request arrives, it is sent to the appropriate memory bank based on the two LSBs
at the address. A new request can be accepted on each clock cycle, even if a memory bank takes many
clock cycles to provide the requested data or type a word. Because all banks have n the same latency,
the requested data words emerge from the departure side on a first-come, first-served basis.

An interpolated memory unit can form the core of a channeled memory


implementation. Consider the four-way interpolated memory in Figure 17.11 as a four-stage
channeled memory unit. Figure 17.11 shows that, as long as accesses to the same memory bank are
separated by four clock cycles, the memory pipeline operates smoothly. However, within a processor's
data path, such access spacing cannot be guaranteed, and a special conflict detection mechanism is
needed to ensure correct operation through jams when necessary. It is very easy to implement such
a mechanism. An incoming access request is characterized by a two-bit memory bank number. This
bank number must be checked against bank numbers of up to three accesses that may still be in
progress (those that may have started in each of the preceding three clock cycles). If any of these three
comparisons lead to a match, the pipeline must be jammed to not allow new access. Otherwise, the
request is accepted and sent to the appropriate bank, while your bank number is run in a four-position
shift record to compare with incoming requests in the next three cycles.

The channeling scheme already discussed in this section can accommodate a limited
number of stages. Interpolation-based channeled memory can be used conceptually with an arbitrary
number of stages. However, except in applications with fairly regular memory access patterns, the
complexity of the control circuits and the jam penalty put a practical upper bound on the number of
stages. For this reason, large-scale interpolation is only used in vector supercomputers, which are
optimized to operate on long vectors whose in-memory template ensures that temporarily close
memory accesses are routed to different banks (Chapter 26).

5. Non-volatile memory
Both SRAM and DRAM require power to keep stored data intact. This type of volatile memory should
be supplemented with non-volatile or stable memory if data and programs are not lost when power
is interrupted. For most computers, this stable memory consists of two parts: relatively small read-
only memory (ROM) to retain crucial system programs needed to boot the machine, and a hard drive
that retains virtually indefinitely stored data without requiring power. The low cost and high storage

227
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

density of modern disk memory makes them ideal stable memory devices. Due to their large
capacities, disk memories play the dual roles of stable storage and mass storage (Chapter 19).

In small portable devices with limited space and battery capacity, the size and
power requirements of disk memories are a serious disadvantage, as are their moving mechanical
parts. For these reasons, such systems prefer to use non-volatile semiconductor memory devices. The
options available cover a wide range, from older, well-established read-only memory devices to the
latest read-write devices exemplified by flash memory.

Read-only memories are built in a variety of ways, but all of them share the property of having a
specific pattern of 0 and 1 wiring at the time of manufacture. For example, Figure 17.12 shows a four-
word segment of a ROM in which a normally high bit line assumes low voltage when a selected word
line is pulled down and a diode exists at the intersection of the bit line and the selected word line. The
diodes are placed in the bit cells that should store 1; then a diode corresponds to 0. If the diodes are
placed at all intersections and a mechanism is provided to selectively disconnect each unnecessary
diode, with the blowing of a fuse, a programmable ROM, or PROM, results. The programming of a
PROM is done by placing it in a special device (PROM programmer) and applying currents to burn
selected fuses.

A erasable PROM (EPROM) uses a transistor in each cell that acts as a programmable switch. The
contents of an EPROM can be erased (fixed to all 1) by exposing the device to ultraviolet light for a
few minutes. Since ROMs and PROMs are simpler and cheaper than EPROMs, they are used during
system development and debugging, and contents are placed in ROMs when data or programs are
finished. Deleting data in electrically erasable PROM (EEPROM) is both more convenient and selective.
Any given bit in the memory array can be erased by applying an appropriate voltage across the
corresponding transistor. There is often a limit (usually many hundreds) on how many times an
EEPROM cell can be deleted before it loses the ability to retain information. However, unlike DRAM,
this cargo is trapped so that it cannot escape; Therefore, it does not require regeneration or energy
to keep them in place for many years. Since erasure occurs in blocks and is relatively slow, flash

228
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

memory is not a substitute for SRAM or DRAM; rather, they are primarily used to store information
about the default configuration or establishment of digital systems that change rather infrequently
and must be preserved when the system has no power.

Other technologies for non-volatile Random Access Memories that look promising include
ferroelectric RAM, magneto-resistive RAM, and ovonic unifying memory [Gepp03].

6. Need for a memory hierarchy

To match processor speed, the memory that retains program instructions and data must be accessible
in 1 ns or less. Such memory can be built, but only in small sizes (for example, a log file). The volume
of the program and its data should be kept in slower and larger memory. The challenge is to design
the overall memory system so that it appears to have the fastest component speed and the cheapest
cost. While increasing the bandwidth of slow memory can help bridge the speed gap, this approach
requires increasingly sophisticated methods to hide memory latency and eventually drop when the
gap grows wide enough. Example 17.3 shows that increasing bandwidth can cause problems even in
the absence of limits on methods for hiding latency.

Example 17.3: Memory bandwidth to throughput ratio

Estimate the minimum main memory bandwidth required to sustain an instruction execution rate of
10 GIPS. Suppose there is no cache or other quick buffer for instructions or data. Would this bandwidth
be feasible with a memory latency of 100 ns and a bus frequency of 200 MHz?

229
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Solution: The assumption of no cache means that each executed instruction must be fetched from
main memory. Thus, the execution rate of 10 GIPS implies at least ten billion instruction access per
second. If it involves four-byte instructions, this implies a minimum memory bandwidth of 40 GB/s,
even if some instruction is not read (fecth) or stored at some time. Pulling 40 B/ns out of memory at
100 ns access latency would require 4,000 B to be read from each access. This access width, although
not completely feasible, is rather impractical. Note that the analysis is based on three very optimistic
assumptions: total lack of data references, ability to disperse memory accesses evenly over time, and
perfect latency hiding. Transferring 40 GB/s from memory to the processor over a 200 MHz bus would
require a bus width of 200 B. Again, the assumption that no bus cycle goes to waste is optimistic.

The fact that small memories can be built with access times of no more than a few nanoseconds leads
to the idea of buffer memory that retains the currently most useful program segments and data (those
that are frequently referenced or known to be needed very soon), much like registers retain some of
the currently active data elements for quick access. by the processor. This buffer or cache memory
may be too small to retain all the required programs and data, so objects must be carried there from
main memory. Such a cache does not completely bridge the speed gap between the processor and
main memory, so a second-level cache is often used. These two cache levels, along with the registers,
main memory, secondary memory (usually disk), and tertiary or archive memory, are shown in Figure
17.14.

The interacting memories of various capacities, speed, and costs shown in Figure 17.14 are said to
constitute a memory hierarchy within which the most useful data and program components are
somehow moved to higher levels, where they are more easily accessible. In practice, every modern
computer system has hierarchical memory, although not all levels shown in Figure 17.14 are present
in all cases.

The pyramidal shape in Figure 17.14 is intended to present the smallest size of memory units at the
top and the largest capacities at the bottom. Fast log files have capabilities that are measured in
hundreds or at most thousands of bytes. Currently, multi-terabyte tertiary memories are practical,
and the introduction of petabyte drives is under consideration. Of course, cost and speed vary in the

230
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

opposite direction, with smaller memories near the top being the fastest and most expensive.
Examining the access latencies given in Figure 17.4 shows that the latency ratios for successive levels
are quite small near the top (ten or less). Suddenly, the ratio grows by almost 105 between primary
and secondary memories, and it is also large, though not so bad, between secondary and tertiary
memories. This speed gap between semiconductor and magnetic-optical memory is a major
impediment to high performance in data-intensive applications.

The levels in the memory hierarchy in Figure 17.14 are numbered as follows: 0 for registers, 1 and 2
for cache (known as L1 and L2 cache), 3 for main memory, and so on. Sometimes registers are excluded
from the memory hierarchy, in part because the mechanisms for moving data in and out of registers
(explicit load/store instructions) are different from those used for data transfers between other levels.
Also, in some architectures, certain data items are never moved in registers and are processed directly
from the cache. However, because the focus of this book is on load/storage architectures in which
data must be loaded into registers before it can be manipulated, it is appropriate to include registers
as level 0 of the memory hierarchy. Note that the log file is used as a high-speed buffer for data only.
However, many modern processors use an instruction buffer that has the same role in relation to
instructions as the log file plays with the data.

The register level, or cache level 1, is sometimes considered the highest level of the memory hierarchy;
Although this nomenclature is consistent with the pyramid view shown in Figure 17.14, it can cause
some confusion when, for example, level 1 is displayed as greater than level 3. Therefore, this book
will avoid characterizing the levels of the memory hierarchy as "high" or "low", instead using the
names or numbers of the level. When relative positions must be specified, one will be characterized
as faster or closer to the processor (respectively, slower or farther away from the processor). For
example, when a required object is not at a specific level in the memory hierarchy, the next slowest
level is queried. This process of moving the focus to the next slower level continues until the object is
found. Therefore, a localized object is copied by successively faster levels on its journey to the
processor until it reaches level 0, where it is accessible to the processor.

This chapter dealt with the technology and basic organization of memory
units comprising the principal components and cache in Figure 17.14. Records and log files were
discussed in Chapter 2. What remains to be discussed is how information moves transparently
between cache and main memories (Chapter 18), the massive memory technology and data storage
schemes employed as secondary and tertiary components in the memory hierarchy (Chapter 19), and
the management of information transfers between bulk and main memories (Chapter 20).

PROBLEMS
1) SRAM memory organization

Two SRAM chips are provided, and each forms memory of w words and b bits wide, where b and w
are even. Show how to use a minimum amount of external logic to form a memory with the following
properties, or argue that it is impossible to do so.

a) w words that are 2b bits wide

b) 2w words that are b bits wide

c) w/2 words that are 4b bits wide

231
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

d) 4w words that are b/2 bis wide

e) Store duplicate copies of w words that are b bits wide and compare the two copies during each
access for error detection

2) SRAM memory organization

Consider 4Mb SRAM chips with three different internal organizations, offering one-, four- and eight-
bit data widths. How many of each type of chip would be needed to build a 16MB memory unit with
the following word widths and how should they be interconnected?

a) 8-bit Words
b) 16 bits Words
c) 32 bits Words

3) DRAM Memory Organization

A survey of desktops and laptops in one organization revealed that they have DRAM main memory
from 256 MB to 1 GB, in 128 MB increments. It has been said that these machines use DRAM chips
with capacities of 256 Mb or 1 Gb, and that both types are never mixed in the same machine. What
do these pieces of information reveal about the number of memory chips in PCs? In each case, specify
the possible data widths for the chips used, if the DRAM memory word width is 64 bits.

4) Trends in DRAM technology

Assume that the trends in Figure 17.7 continue for the foreseeable future. What would be the
expected range of values for the number of memory chips used and the overall memory capacity for:

a) Workstations in 2008?
b) Servers in 2015?
c) Supercomputers in 2020?
d) Big PCs in 2010?

5) Designing a Wide Word DRAM

An SDRAM unit provides 64-bit data words, takes four clock cycles for the first word and one cycle for
each subsequent word (up to seven more) within the same row of memory. How could you use this
unit as a component to build a wide-word SDRAM with a fixed access time, and what would be the
resulting memory latency for 256, 512, and 1,024-bit word widths?

6) Operation DRAM on track

In the description of DRAM it was noted that a DRAM chip needs half as many address pins as an SRAM
of the same size because the row and column addresses do not have to be provided simultaneously.
Why does the channeled operation in Figure 17.10 change this situation?

232
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

7) In-memory processor architectures

The argument was anticipated that, as a result of the bandwidth of external memory becoming a
serious problem, perhaps the higher internal bandwidth (due to reading an entire row of the memory
array at a time) can be exploited by implementing many simple processors on a DRAM memory chip.
Multiple processors can manipulate segments of the long word of internal memory that can be
accessed in each memory cycle. Briefly discuss the positive points, and disadvantages or
implementation issues, of this approach to breaking down the memory wall.

8) Memory channeled with interpolation

Provide the detailed design for the conflict and jam detection mechanism required to convert a four-
way interpolated memory unit into a four-stage channeled memory, as discussed in section 17.4.

9) Organization of data in interpolated memories

Assume that the elements of two arrays, A and B, of 1,000 × 1,000, are stored in the four-way
interpolated memory unit in Figure 17.11 in order from the largest row. Thus, the thousand elements
of a row of A or B are equally dispersed among the banks, while the elements of a column fall into the
same bank.

a) If you assume that compute time is negligible compared to memory access time, derive total
memory performance when elements A are added to corresponding elements of B in order from the
largest row.

b) Repeat part (a), but this time assume that the sums are made in order of greatest column.

c) Show that for some arrays m × m, calculations in parts a) and b) lead to comparable total memory
performances.

d) Repeat part (a), but this time assume that B is stored in order of largest column.

10) Access step in interpolated memories

A convenient way to analyze the effects of organizing data in interpolated memories on total memory
performance is through the notion of access passing. Consider an array m × m that is stored in an
interpolated memory of h ways in order of greatest row. When elements in a particular row of such
an array are accessed, the memory locations x, x + 1, x + 2, . . . are addressed; The access step is said
to be 1. The elements of a column fall on the localities y, y + m, and + 2m, . . ., for an access step of m.
Accessing the elements on the main diagonal leads to an access step of m-1, while the elements of
the anti-diagonal produce a step of m - 1.

a) Show that total memory performance is maximized as long as the access step s is relatively prime
to h.

b) One way to ensure maximum total memory performance for all steps is to choose h as the prime
number. Why isn't this particularly a good idea?

233
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

c) Instead of storing each row of the matrix that starts with its element 0, you can start storage in row
i with its i-th element, returning to the beginning of the row after the last element of the row appears.
Show that this type of skewed storage is useful for improving total memory performance in some
cases.

11) Read-only memory

One application of read-only memories is in function evaluation. Suppose you want to evaluate a
function f(x), where x represents a fraction of eight bits and the result will be obtained with 16 bits of
precision.

a) What size ROM is needed and how should it be organized?

b) Since memories are slower than logic circuits, is this method ever faster than the conventional
function evaluation used in ALU-type arithmetic circuits?

12) Choice of memory technology

a) The specifications of a digital camera indicate that its photo storage subsystem contains a SRAM
memory unit too large to hold a pair of photographs and a flash memory module that can store about
100 photographs. Discuss the reasons for the choices of technologies and capabilities for the two
memory components.

b) Repeat part (a) for an electronic organizer, storing several thousand names and associated contact
information, with the same combination of memories: a small SRAM unit and a very large flash
memory module.

13) Organizing flash memory

Flash memory is random access as far as reading is concerned. However, for writing an arbitrary data
element cannot be modified in place because simple word deletion is not possible. Study alternate
data organizations in flash-like modes that allow selective write operations.

14) Memory hierarchy features

Augment Figure 17.14 with a column that specifies the approximate rates of peak data at the various
levels of the hierarchy. For example, a 32-bit, three-port log file (two reads and one write per cycle)
with an approximate access time of 1 ns can support a data rate of 3 × 32 × 109 b/s = 12 GB/s ≅ 0.1
Tb/s. at 1 Tb/s, if the registers were twice as fast and twice as wide, and could be accessed through a
few more read/write ports.

15) Memory hierarchy features

Based on Figure 17.14, which level of hierarchical memory is likely to have the highest cost? Notice
that the total cost is asked, not the cost per byte.

234
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

16) Analogy for the hierarchical memory of a computer

Consider the hierarchical way a person deals with phone numbers. He has memorized some of the
important phone numbers. Numbers for other key contacts are kept in a pocket phone book or
electronic organizer. If you go further down the equivalent of Figure 17.14 in this case, you get to the
city's telephone directory and finally to the collection of directories from all over the country available
at the local library. Draw and properly label a pyramid diagram to represent this hierarchical system,
and mention the important characteristics of each level.

17) History of Core Memory Technology

Random access memories that are fast and large capacity are now taken for granted. Modern desktop
and laptop computers have RAM capabilities that surpass the secondary memories of the first more
powerful digital computers. These did not have random access memories, rather they used serial
memories in which data objects moved in a circular path; Access to a particular object required long
waits until the object appeared in the correct position for reading. Even when random access
capability appeared in magnetic core memories, the manufacturing process was complicated and
expensive, leading to the use of main memories that were too small by today's standards. Other
examples of now-abandoned memory technologies include wired roller memories, sonic delay line
memories, and magnetic bubble memories. Choose one of these, or some other pre-1970 memory
technology, and write a report describing the technology, associated memory organizations, and
range of practical applications in digital computers and elsewhere.

235
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

CACHE MEMORY ORGANIZATION

CHAPTER TOPICS

1 The Need for a cache

2 What makes a cache work?

3 Direct-Mapped Cache

4 Set-Associative Cache

5 Cache and Main Memory

6 Improving Cache Performance

Maurice Wilkes' quote shows that the idea of using fast memory (slave or cache) to bridge the speed
gap between a processor and slower, but larger, main memory has been discussed for some time.
Even though in the intervening years, memories have become faster, the processor-memory speed
gap has widened so much that the use of a cache is now almost mandatory. The cache-to-main
memory ratio is similar to that of a desk drawer to a filing cabinet: a more easily accessible place to
retain data of current interest for the duration of time when data is likely to be entered. This chapter
reviews strategies for moving data between main and cache memories, as well as learning ways to
quantify the resulting performance improvement.

1. The Need for a cache


Memory access latency is a major performance hurdle in modern computers. Improvements in
memory access time (currently thousands of nanoseconds for large off-chip memory) did not keep
pace with processor speed (< 1 ns to a few nanoseconds per operation). As a consequence of larger
memories tending to be slower, the problem is made worse by increasing memory sizes. In this
context, most processors use relatively small fast memory to maintain instructions and data so that
most time accesses to slow main memory are avoided. Data still needs to be moved from slower/larger
memory to smaller/faster memory and vice versa, but it can overlap with the calculation program. The
term "cache" (secure hiding place) is used for this small/fast memory because it is usually invisible to
the user, except through its effect on performance.

It is now very common to use a second-level cache to reduce access to main memory. Figure 18.1
shows the relationship between cache and main. The Level 2 cache can be added to a cached and
master system in two ways. The first option (Figure 18.1a) is to insert the Level 2 cache between the
Level 1 cache and main memory. The second option connects the Level 1 cache unit to both the Level
2 cache and main memory directly via two different memory buses (sometimes called rear and front
buses).

236
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Consider a single-tier cache. To enter a required data word, the cache is first queried. Finding this data
is called cache impact; Not finding them signifies a cache failure. An important parameter in evaluating
cache effectiveness is hit rate, which is defined as the fraction of data accesses that can be satisfied
from the cache as opposed to slower memory beyond it. For example, an impact rate of 95% means
that only one in 20 accesses, on average, will not find the required data. With an h impact rate, C fast
cache access cycle, and Cslow slower memory access cycle, the effective memory cycle time is

Cefective = hCfast + (1 - h) (Cslow + Cfast) = Cfast + (1 - h) Cslow

This equation is derived with the assumption that, when data is not in the cache, it must first be carried
to the cache (in the Cslow time) and then accessed from the cache (in the Cfast time). Simultaneously
advancing data from slow memory to both processor and cache reduces effective delay somewhat,
but the simple formula for Ceffective is suitable for these purposes, especially as a result of not taking
into account the header to determine if the required data is in the cache. It is seen that, when the
impact rate h is close to 1, an effective memory cycle close to Cfast is achieved. Therefore, the cache
offers the illusion that all memory space consists of fast memory.

In a typical microprocessor, cache access is part of the instruction execution cycle. As long as the
required data is in the cache, instruction execution continues at full speed. When a cache failure
occurs and slower memory must be accessed, instruction execution is interrupted. The cache failure
penalty is usually specified in terms of the number of clock cycles that will be wasted because the
processor has to get stuck until the data is available. On a microprocessor that executes, on average,
one instruction per clock cycle, when there is no cache failure, an eight-cycle cache failure penalty
means that these cycles will be added to the instruction execution time. If 5% of the instructions
encounter a cache failure, corresponding to a cache impact rate of 95%, and if an average of one
memory access per executed instruction is assumed, the effective CPI will be 1 + 0.05 × 8 = 1.4.

When you have a second-level or L2 cache, you can associate a local impact or failure
rate with it. For example, a local impact rate of 75% for the L2 cache means that 75% of the accesses
that referred to the L2 cache (due to an L1 cache failure) can be satisfied here; In this sense, 25% need
a reference to the main memory.

Sample 18.1: Performance of a Two-Level Cache System

A computer system has L1 and L2 caches. The local impact rates for L1 and L2 are 95 and 80%,
respectively. The penalties for failures are eight and 60 cycles, respectively. If you assume a CPI of 1.2
without some cache failure and average of 1.1 memory accesses per instruction, what will be the
effective CPI after factoring in cache failures? In the event that the two cache levels are taken as a
single cache, what will be their failure rate and failure penalty?

Solutions: You can use the formula Ceffective = Cfast + (1 - h1)[Cmiddle + (1 - h2)Cslow]. Since Cfast is already
included in the CPI of 1.2, the rest must be taken into account. The latter leads to an effective CPI of
1.2 + 1.1(1 - 0.95)[8 + (1 - 0.8)60] = 1.2 + 1.1 × 0.05 × 20 = 2.3. When the two caches are put together,
you have an impact rate of 99% (impact rate of 95% in level 1, plus 80% of the 5% failures of level 1
found in level 2) or a failure rate of 1%. The effective access time of this imaginary single-level cache
is 1 + 0.05 × 8 = 1.4 cycles and its failure penalty is 60 cycles.

237
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

A cache is characterized by several design parameters that influence its cost of implementation and
performance (impact rate). The following description refers to the assumption of a single cache level;
that is, there is no level 2 cache. The most important cache parameters are:

a) Cache size in bytes or words. A larger cache can retain more useful program data, but it is more
expensive and perhaps slower.

b) Block size or cache line width, which is defined as the unit of data transfer between the cache and
main memory. With a larger cache line, more data is brought into the cache with each failure. This can
improve the impact rate, but it also tends to tie parts of the cache with less useful data.

c) Placement policy. Determine where an incoming cache line can be stored. More flexible policies
mean higher hardware costs and may or may not have performance benefits as a result of its more
complex and time-consuming processes to host the required data in the cache.

d) Substitution policy. Determine which of the many cache blocks (to which a new cache line can be
mapped) should be overwritten. Typical policies include choosing a random block and the least used
block.

e) Recording Policy. Determine whether cache word updates are immediately ahead of main memory
(direct write policy) or modified cache blocks are copied to main memory in their entirety and when
they should be cached (copy-back policy).

These parameters are closely related; The fact that one is changed frequently means that the others
also need changes to ensure optimal memory performance. The impact of these parameters on
memory system performance will be discussed in section 18.6.

2. What Makes a Cache Work?


Caches are so successful in improving the performance of modern processors due to two locality
properties of memory access patterns in typical programs. The spatial locality of memory accesses
results from consecutive accesses that refer to nearby memory localities. For example, a nine-
instruction program cycle that runs a thousand times causes instruction accesses to be concentrated
in a nine-word region of the address space for a long period of time (Figure 18.2). Similarly, as a
program is assembled, the symbol table, which occupies a small place in the address space, is
frequently queried. Temporal locality indicates that when an instruction or data item is accessed,
future accesses to the same object tend to occur primarily in the near future. In other words, programs
tend to focus on one region of memory to get instructions or data and then move to the other regions
as the calculation phases are completed.

The two locality properties of memory access patterns make the most useful instructions and data
items at some specific point in the execution of a program (sometimes called the program's operating
game) reside in the cache. The latter leads to high cache impact rates, which are usually in the 90 to
98% range, or, equivalently, low failure rates, in the range of 2 to 10%.

238
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The following analogy is very useful (Figure 18.3). You work on the documents you place on your desk.
The desktop corresponds to a CPU register. Among the documents or files you don't need now, the
most useful, are those that may be in a desk drawer (analogy with cache) and the rest in filing cabinets
(main memory) or in a warehouse (secondary memory). Most of the time you find what you need in
your drawer. From time to time you have to fetch the documents or files you use less often; On rare
occasions you may have to go to the cellar to read (fetch) a document or file rarely consulted. With
the latencies given in Figure 18.3, if the drawer impact rate is 90%, the average document access time
is 5 + (1 - 0.90)30 = 8 s.

Example 18.2: Analogy between drawer and filing cabinet

If you assume an impact rate of h on the drawer, formulate the situation shown in Figure 18.2 in terms
of Amdahl's law and the acceleration resulting from the use of a drawer.

Solution: Without the drawer, a document is accessed at 30 s. Therefore, to read (fetch) a thousand
documents would take, say, 30 thousand s. The drawer makes a fraction h of cases happen six times
faster, and the access time for the remaining 1 - h remains unchanged. Therefore, the acceleration is
1/(1 - h + h/6) = 6/(6 - 5h). Improving drawer access time can increase the acceleration factor, but as
long as the failure rate remains at 1 - h, acceleration can never exceed 1/(1 - h). Since h = 0.9, for
example, the acceleration achieved is 4, with the upper limit being 10 for a very short drawer access
time. If the entire drawer could be placed on the desk, then an acceleration factor of 30/2 = 15 would
be achieved in 90% of cases. The overall acceleration would be 1/(0.1 + 0.9/15) = 6.25. Note, however,
that this is a hypothetical analysis; Stacking documents and files on your desktop is not a good way to
improve access speed!

It is a good idea to briefly review the types of cache failure and place them as mandatory, capacity,
and conflict failures.

Mandatory Failures: The first access to any cache line results in a failure. Some "mandatory" failures
can be avoided by predicting future access to objects and prefetching them in the cache. Thus, such
failures are actually mandatory only if a fetching policy is used on demand. These failures are
sometimes referred to as cold start failures.

Capacity failures: As a result of the fact that the size of the cache is finite, some lines of it have to be
discarded to make room for others; The latter leads to failures in the future, which would not have
been incurred with an infinitely large cache.

Conflict failures: Occasionally there is free space or space occupied by useless data, in the cache, but
the mapping scheme used to place elements in the cache forces to move useful data to place other
required data. This can lead to failures in the future. These are called collision failures.

Mandatory rulings are very easy to understand. If a program has access to three different cache lines,
it will consequently encounter three mandatory failures. To see the difference between capacity and
conflict failures, consider a two-line cache and A B C A C A B access patterns, where each letter
represents a line of data. First, A and B are loaded into the cache. Then C is charged (the third and last

239
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

mandatory failure), which must replace A or B. If the cache mapping is such that C replaces A, and vice
versa, then the next three failures are conflict failures. In this case there are six rulings in total (three
mandatory, three conflict). On the other hand, if C replaces B, there will be four failures in total (three
mandatory, one capacity). It could be argued that one of the three additional failures in the first
example should be seen as a capacity failure, as possibly a two-line cache cannot retain three different
lines of data through its second accesses. Consequently, don't take this categorization too seriously!

Example 18.3: Mandatory, capacity, and conflict failures

A program has access 10 times to each element of an array 1,000 × 1,000 times during its execution.
Only 1% of this array fits into the data cache at any specific time. A cache line retains four array
elements. How many mandatory data cache failures will the execution of this program cause? If the
106 elements of the matrix are accessed once in round i before starting the access round i + 1, how
many capacity failures will there be? Is it possible to have no conflict failures at all, no matter what
mapping scheme used?

Solution: Bringing all elements of the array to the cache requires the loading of 106/4 cache lines;
There will therefore be 250 000 mandatory judgments. Of course, it is possible to have a cache failure
until the first access to each of the 106 array elements. The latter would occur if, when a line is
invoked, only one of its four elements would be accessed before the line is replaced. However, not all
of these rulings are mandatory. As for capacity failures, each of the ten rounds of accesses generates
at least 250,000 failures. Consequently, there will be 250,000 × 9 = 2.25 × 106 capacity failures.
Although rather unlikely, it is possible for all accesses to each array element to occur while they reside
in cache after they are first invoked. In this case there is no conflict (or even capacity) failure.

Given a fixed-size cache, dictated by cost factors or availability of processor chip space, mandatory
and capacity failures are fairly fixed. On the other hand, conflict failures are influenced by the data
mapping scheme, which is under its control. The next two sections will discuss two popular mapping
schemes.

3. Direct-Mapped Cache

For simplicity, memory is supposed to be word-addressable. for byte-addressable memory of the type
used in MiniMIPS and MicroMIPS "word(s)" must be replaced by "byte(s)"

In the simplest mapping scheme, each line of main memory has a unique place in the cache where it
can reside. Suppose the cache contains 2L lines, each with 2W words. Then, the least significant W
bits in each direction specify the word index within a line. Taking the following L bits in the address as
the line number in the cache, will cause successive lines to be mapped to successive cache lines. All
words whose addresses have the same modulo 2L + W will map to the same cached word. In the example
in Figure 18.4, we have L = 3 and W = 2; So the five least significant bits of the address identify a cache
line and a word on that line (3 + 2 bits) from which the desired word should be read. Because many
main memory lines are mapped to the same cache line, it stores the label portion of the address to
indicate which of the many possible lines is actually present in the cache. Of course, a particular cache
line can contain useless data; That is indicated by resetting the "valid bit" associated with the cache

240
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

line. When a word is read from the cache, its tag is also fetched and compared to the label of the
desired word. If they match, and the valid bit of the cache line is set (this is equivalent to matching〈
1, Tag〉with〈x, y〉from valid bit/tag storage), a cache impact is indicated and the barely read word
is used. Otherwise, you have a cache failure and ignore the entered word. Note that this reading and
matching of tags is useful for a write operation. In the case of a writing operation, if the tags match,
the word is modified and written back as usual. In the case of misunderstandings, referred to as "write
failure", a new cache line containing the desired word must be cached (for read failures), the process
is carried out with the new line.

The process of deriving the cache address to read from the provided word address is known as address
translation. With direct mapping, this translation process is essentially trivial and consists of taking the
least significant L + W bits of the word address and using it as the cache address. As a result, in Figure
18.4 the five least significant bits of the word address are used as the cache address.

Conflict failures can be a problem for direct mapping caches. For example, if memory is accessed with
a step that is a multiple of 2L + W, each access leads to a cache failure. This would occur, for example, if
an m column matrix is stored in order of greatest row and m is a multiple of 2 L + W (32 in the example
in Figure 18.4). Since steps that are powers of 2 are very common, such an occurrence is not
uncommon.

Example 18.4: Direct mapping cache access

Assume that the memory is addressable per byte, that the memory addresses are 32 bits wide, that a
cache line contains 2W = 16 bytes, and that the cache size is 2L = 4,096 lines (64 KB). Display the
different parts of the address and identify which segment of the address is used to access the cache.

Solution: The byte offset on a line has log2 16 = 4 bits wide, and the cache line index has log2 4 096 =
12 bits wide. This leaves 32 - 12 - 4 = 16 bits for the label. Figure 18.5 shows the result.

241
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 18.5: 32-bit address components in an example of a direct mapping cache with per-byte
addressing.

The first quote at the beginning of this chapter is from a short paper describing for the first time a
direct mapping cache, although the term "cache" and actual implementations of the idea did not
appear until a few years ago. Direct mapping caches were common in early implementations; Most
modern caches use the associative set mapping scheme, described below.

4. Set-Associative Cache
An associative cache (sometimes referred to as fully associative) is one in which a cache line can be
placed in any cache locality; therefore, it eliminates conflict failures. Such a cache is very difficult to
implement because you need to compare thousands of tags against the desired tag for each access.
In practice, a compromise between associative cache and direct mapping works very well. It's called
associative mapping cache.

Associative set mapping represents the most commonly used mapping scheme in processor caches.
At the end of single-block sets, an associative set cache degenerates into a direct mapping cache.
Figure 18.6 shows a read operation of an associative assembly cache with a set size 2S = 2. The memory
address provided by the processor is composed of the tag and index parts. The latter, which itself
consists of a block address and word offset within the block, identifies a set of 2S cache words that
can potentially contain the required data, while the tag specifies one of many cache lines in the
address space that is mapped to the same set of 2S cache lines using the associative set placement
policy. For each memory access, the 2S candidate words are read, along with the tags associated with
their respective lines. Then the 2S tags are compared simultaneously with the desired tag, this leads
to two possible outcomes:

1. None of the stored tags match the desired tag: the data parts are ignored and a cache failure signal
is postulated to initiate a cache line transfer from main memory.

2. The i-th stored tag, which corresponds to the placement option i (0 ≤ i < 2S ), matches the desired
tag: the word read from the block corresponding to the i-th placement option is chosen as the data
output.

242
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

As in the direct mapping cache, each cache line has a valid bit indicating whether it retains valid data.
This is also read together with the tag and used in the comparison to ensure that only one match
occurs with the valid data tag. A rewrite cache line can also have a dirty bit that is fixed to 1 with each
write refresh to the line and used when replacing the line to decide whether a line to be overwritten
requires backing up to main memory. Due to the multiple placement options for each cache line,
conflict failures are less problematic here than in direct mapping caches.

One point that remains pending is the choice of one of the 2S cache lines in a set to replace with an
incoming line. In practice, random selection and selection based on which line was least used (LRU)
work well. The effects of the replacement policy on cache performance are discussed in section 18.6.

Example 18.5: Associative Set Cache Access

Assume that the memory is addressable per byte, that the memory addresses are 32 bits wide, that a
cache line contains 2W = 16 bytes, that the sets contain 2S = 2 lines, and that the cache size is 2L = 4,096
lines (64 KB). Display the various parts of the address and identify which segment of the address is
used to enter the cache.

Solution: The byte offset on a line has log2 16 = 4 bits wide and the index of the cached array has log2
(4,096/2) = 11 bits wide. This leaves 32 - 11 - 4 = 17 bits for the tag. Figure 18.7 shows the result.

Figure 18.7: 32-bit Address Components in a Two-Way Associative Set Cache Example

243
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 18.6: Cache Address Mapping

A 64 KB four-way associative set cache is byte-addressable and contains 32 B lines. The memory
addresses are 32 b wide.

a) What is the width of the tags in this cache?

b) Which main memory addresses are mapped to set number 5 in the cache?

Solution: The number of sets in the cache is 64 KB/(4 × 32 B) = 512.

a) A 32-bit address is divided into a five-bit byte offset, a nine-bit set index, and an 18-bit tag.

b) Addresses that have their nine-bit array indexes equal to five are mapped to set number five. These
addresses are in the form 214a + 25 × 5 + b, where a is a tag value (0 ≤ a ≤ 218 - 1) and b is a byte offset
(0 ≤ b ≤ 31). Thus, the addresses mapped to set number 5 include 160 to 191, 16,544 to 16,575, 32,928
to 32,959, and so on.

As is evident from Figure 18.6, two-way associative set caches are easily implemented. The LRU
substitution policy requires a single bit associated with each set that designates which of the two slits
or options were used less recently. Increasing the degree of associativity improves cache performance
by reducing conflict failures, but it also complicates the design and potentially lengthens the access
cycle. Thus, in practice, the degree of associativity almost exceeds 16 and is often maintained at four
or eight. More on this topic can be found in section 18.6.

5. Cache and Main Memory


The caches are organized in various ways to meet the memory performance requirements of specific
machines. Reference has already been made to single-level and multilevel caches (L1, L2, and perhaps
L3), and the latter is sketched in different ways (Figure 18.1). A unified cache is one that contains both
instructions and data, while a split cache consists of separate instruction and data caches. Since the
first machines designed at Harvard University had separate instruction and data memories, a machine
with split cache is said to follow the Harvard architecture. The unified caches represent "incarnations"
of von Neumann architecture. Each cache level can be unified or split, but the most common
combination in high-performance processors consists of a split L1 cache and a unified L2 cache.

Next, the relationship between a fast memory (cache) and a slow memory (main) is
explored. Equal topics apply to data transfer methods and requirements between L1 and L2, or L2 and
L3, if additional cache levels exist. Note that it is usually the case that the contents of one level within
the memory hierarchy constitute a subset of the contents of the next slower level. Therefore, the
search for data elements proceeds in the direction of slower memories, away from the processor,
because anything that is not available at a given level will not exist at one of the fastest levels. If a
cache level, say L1, is split, then each of both parts has a similar relationship to the next slower cache
level or to main memory, with the added complication that the two parties compete for the bandwidth
of the slower memory.

Since main memory is much slower than cache, many methods have been developed to make data
transfers between main and cache faster. Note that DRAMs have very high bandwidths internally

244
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

(Figure 18.8). However, this bandwidth is lost as a result of the I/O pin limitations on the memory chips
and, to a lesser extent, the relatively narrow buses that connect the main memory to the CPU cache.

Consider a 128 MB main memory and four chips built from the chips shown in Figure 18.8. To this
memory you can enter 32-bit words and transmit them on a bus of the same width to the cache.

Suppose the cache line is four words wide, which will be contained in the same row of all four chips.
Therefore, when the first word has been transferred, subsequent words will be read faster because
the latter will be done from the row buffer and not in the memory array. Because a cache failure
occurs when a particular word is entered, it is possible to design the data transfer mechanism so that
the main memory reads the requested word first, so that the cache supplies the word to the CPU while
the remaining three words are transferred. Optimizations like this are common in high-performance
designs.

Higher performance processors use larger main memories composed of many DRAM chips. In this
case, the chips offer access to many words (super words), making it feasible to use a single or a small
number of cycles on a wide bus to transfer an entire cache line to the CPU.

Writing modified cache entries to main memory presents a special design challenge in high-
performance systems. The write policy (see the end of section 18.1) is particularly problematic
because, unless write operations are infrequent, they slow down the cache while the main memory is
brought up to date. With rewrite or copy, the problem is less severe, as writing to main memory only
occurs when a modified cache line needs to be replaced. In the latter case, at least two main memory
accesses are needed: one to copy the old cache line and one to read the new line. A commonly used
method to avoid this performance penalty due to writing operations is to offer the cache write
buffered. The data to be written to the main memory is placed in one of these buffers and the cache
operation continues regardless of whether writing to main memory can occur immediately. As long as
there are no more cache failures, the main memory is not required to be upgraded immediately. Write
operations can be performed using bus idle cycles until all write buffers have been emptied. In the
event of another cache failure before all write buffers have been emptied, the buffers can be cleaned
before attempting to search for data in main memory or provide a special mechanism to search the
write buffers for the desired data, and only proceed to main memory if the required location is not in
one of the write buffers.

245
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6. Improving Cache Performance


This chapter learned that the cache bridges the speed gap between the CPU and main memory. Cache
impact rates in the 90-98% range, and even higher, are common, leading to high performance by
eliminating most main memory accesses. Because faster caches are very expensive, some computers
use two or three levels of cache, with the additional cache level being larger, slower, and cheaper, per
byte, than the first. An important design decision in introducing a new machine is the number,
capacities, and types of caches to use. For example, split caches produce higher performance by
allowing concurrency in data and instruction access. However, for a given total cache capacity, split
caches imply a smaller capacity for each drive, leading to a higher failure rate than a unified cache
when running a large program with little data need or, conversely, a very small program operating on
a very large data set.

A common way to evaluate the relative merits of various cache design alternatives is simulation using
publicly available datasets that characterize memory accesses in typical applications of interest. A
memory address trace contains information about the sequence and timing of memory addresses
generated in the course of running particular sets of programs. Providing a direction trace of interest
to a cache system simulator that is also endowed with design parameters for a particular cache
system, it produces a log of impacts and failures at the various cache levels, as well as a fairly accurate
indicator of performance. To get an idea of the interplay of the various parameters, the results of
some empirical studies based on the approach just discussed are presented.

Generally speaking, larger caches produce better performance. However, there is a


limit to the validity of this statement. Those are slower, so if larger capacity isn't needed, you can be
better off with a smaller cache. In addition, a cache that fits into the processor chip is faster due to
shorter wire and communication distances and the lack of need for off-chip signal transmission, which
is rather slow. The requirement that the cache fit on the same chip with the processor limits its size,
although this fact is less problematic with the growth in the number of transistors that can be placed
on an IC chip. It is not uncommon to have 90% or more of the transistors on a CPU chip allocated to
cache memories.

Apart from the cache size, usually given in bytes or words, other important cache parameters are:

1. Line width 2W

2. Set size (associativity) 2S

3. Line Replacement Policy

4. Writing Policy

Conflicts relevant to the choice of these parameters, and the associated negotiations, are highlighted
in the remainder of this section.

2W line width: Wider lines cause more data to be cached with each failure. If the transfer of a larger
block of data to the cache can be achieved at a higher rate, wider cache lines have the positive effect
of making words that are more likely to enter in the future (due to locality) easily accessible without
subsequent failures. Also, the processor may never access a significant portion of a very wide cache
line, leading to wasted cache space (which might otherwise be allocated to more useful objects) and

246
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

data transfer time. Because of these opposing effects, there is often an optimal cache line width that
leads to better performance.

Set size (associativity) 2S: The negotiation here is between the simplicity and speed of direct mapping
and the fewer conflict failures generated by associative set caches. The greater the associativity, the
smaller the effect of conflict failures. However, beyond four- or eight-way associativity, the
performance effect of greater associativity is negligible and easily nullified by the more complex and
slow addressing mechanisms. Figure 18.9 shows experimental results on the effects of associativity on
performance. Since higher levels of associativity require hardware headers that slow down the cache
and also use some chip area for the necessary comparison and selection mechanisms, often fewer
levels of associativity with larger capacity (enabled by the use of space freed up by the removal of
more complex control mechanisms for more cache inputs) offer better overall performance.

Line Replacement Policy: LRU (used less recently) or some similar approach is usually used to simplify
the process of keeping track of which line in each assembly should be replaced next. For two-way
associativity, LRU implementation is quite simple; With each set, a single state bit is maintained to
indicate which of the two lines in the set had the last access. With each access to a line, the hardware
automatically updates the LRU status bit of the set. As associativity increases, it becomes more difficult
to keep track of usage times. The problems at the end of the chapter provide some ideas for
implementing LRU with more associative caches. It is surprising that the random selection of the line
to be replaced works so well in practice.

Writing Policy: With the write scheme, the main memory is always consistent with the cache data so
that cache lines can be freely replaced without loss of information. A rewrite cache is often more
efficient because it minimizes the number of main memory accesses. With rewrite caches, a "dirty bit"
is associated with each cache line to indicate whether it has been modified since it was brought to
main memory. If a line is not dirty, replacing it would not create a problem. Otherwise, the cache line
must be copied to main memory before it can be replaced. As already mentioned, actual writing to
main memory doesn't have to happen right away; Instead, the lines to be rewritten are placed in the
write buffer and copied gradually as bus availability and memory bandwidth allow.

Note that the choice of cache parameters is dictated by the dominant pattern and timing of memory
accesses. Because these characteristics are significantly different for instruction streams and data
streams, the optimal parameter choices for instructional, data, and unified caches are often not the
same. Consequently, it is not uncommon to have varying degrees of associativity, or different
substitution policies, in multiple cache units on the same system.

PROBLEMS
1) Spatial and temporal locality

Describe a real application program whose data accesses show each of the following patterns, or argue
that the postulated combination is impossible.

a) Almost no spatial or temporal locality

b) Good spatial locality but virtually no temporal locality

c) Good temporary locality but very little or no spatial locality

247
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

d) Good spatial and temporal locations

2) Two-tier cache performance

A processor with two cache levels has a CPI of 1 when there is no cache failure level 1. At level 1, the
impact rate is 95% and a failure incurs a penalty of ten cycles. For the two-tier cache as a whole, the
impact rate is 98% (meaning that 2% of the time the main memory must be accessed) and the failure
penalty is 60 cycles.

a) What is the effective CPI after cache failures are factored?


b) If a single-tier cache was used instead of this two-tier cache, what impact rate and failure
penalty would it take to provide the same performance?

3) Two-tier cache performance

A computer system uses two levels of cache L1 and L2. At level L1 you enter a clock cycle and supply
the data in case of an L1 impact. For an L1 failure, which occurs during 3% of the time, L2 is queried.
An L2 impact incurs a penalty of ten clock cycles while an L2 failure incurs a penalty of 100 cycles.

a) If you assume a channeled implementation with a CPI of 1 when there is no cache failure (that is, if
data and control dependencies are ignored), calculate the effective CPI if the local failure rate of L2 is
25%.

b) If you were to model the two-tier cache system as a single cache, what failure rate and failure
penalty should be used?

c) Changing the mapping scheme from direct L2 to two-way associative set can improve your local
failure rate to 22% while increasing your impact penalty to 11 clock cycles, due to the more complex
access scheme. If you ignore cost issues, will switching be a good idea?

4) Cache impacts and failures

The following sequence of numbers represents the memory addresses in a 64-word main memory: 0,
1, 2, 3, 4, 15, 14, 13, 12, 11, 10, 9, 0, 1, 2, 3, 4, 56, 28, 32, 15, 14, 13, 12, 0, 1, 2, 3. Classify each of the
accesses as a cache impact, mandatory failure, of capacity or conflict, given the following cache
parameters. In each case provide the final cache contents.

a) Direct mapping, four-word lines, four-line capability.

b) Direct mapping, two-word lines, four-line capability.

c) Direct mapping, four-word lines, two-line capability.

d) Two-way associative set, two-word lines, four-set capability, LRU substitution

e) Two-way associative set, four-word lines, two-set capability, LRU substitution.

f) Associative set of four words, lines of two words, capacity of two sets, substitution LRU.

248
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5) Cache impacts and failures

A program has a cycle of nine instructions that is executed many times. Only the last instruction in the
cycle is a branch whose destination is the first instruction in the cycle at memory address 5678. The
first instruction in the cycle reads the contents of a different memory locality each time, starting with
memory address 8760 and going one at a time at each new iteration. Determine the cache impact
rate, as a whole and separately for instructions and data, according to the following cache parameters.

a) Unified direct mapping, four-word lines, four-line capability

b) Unified two-way associative set, two-word lines, four-set capability, LRU substitution

c) Direct mapping split caches, each has four-word lines and two-line capability

d) Split caches of two-way associative set, each has two-word lines, two-way capability, LRU
substitution

6) Cache layout

Characterize the direction tracing produced by running the cycle defined in problem 18.5, if it assumes
an infinite number of cycle iterations and cache sizes that are powers of 2. Find the smallest cache
possible (or the smallest total cache capacity for split caches) to achieve a failure rate no more than
5% with each of the following cache organization restrictions. You are free to choose any parameter
that is not explicitly specified.

a) Unified direct mapping cache.


b) Unified two-way associative set cache.
c) Direct mapping split caches.
d) Direct mapping split caches.

7) Cache layout

a) Consider the cache sketched in Figure 18.4. For a main memory capacity of 2x words, determine the
total number of bits in the cache array and derive the input of the bits with data denial as a percentage
header relative to the actual data bits.

b) Repeat part (a) for the cache shown in Figure 18.6.

8) Cache layout

A computer system has 4 GB of addressable main memory per byte and a unified cache of 256 KB with
32-byte blocks.

a) Draw a diagram showing each component of a main memory address (that is, how many bits for
tag, set index, and byte offset) for a four-way associative set cache.

b) Draw a diagram showing the tag comparison circuitry, cache failure signal generation, and data
output for the cache.

249
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

c) The performance of the four-way associative assembly cache architecture compute system is
unsatisfactory. Two redesign options involving almost the same additional designs and production
costs are considered. Option A is to increase the cache size to 512 KB. Option B represents increasing
the associativity of the cache from 256 KB to 16 ways. In your opinion, which option is most likely to
result in higher overall performance and why?

9) Cache address in byte-addressable memory

An address for a byte-addressable memory presented to the cache drive is divided as follows: 13-bit
tag, 14-bit line index, five-bit byte offset.

a) What is the cache size in bytes?

b) What is the cache mapping scheme?

c) For a given byte cached, how many different bytes in the 232 byte main memory can occupy it?

10) Relationships between cache parameters

For each of the caches partially specified in the following table, find values for lost parameters where
possible or explain why they cannot be deduced uniquely. Assume primary memory and word-
addressable cache.

11) Relationships between cache parameters

If you use a for address bits, cw for cached words, lw for words per line, sl for lines per set, cs for cached
sets, t for tag bits, i for set index bits, and or for offset bits, type equations that relate the parameters
mentioned in problem 18.10, So assigning values to a subset of parameters, yields unique values, or a
feasible set of values, for other parameters.

12) Cache Design Negotiations

Three design options are considered for a 16-word direct mapping cache. These options C1, C2, and
C4 correspond to using one-word, two-word, and four-word cache lines, respectively. The failure
penalties for options C1, C2 and C4 are six, seven and nine clock cycles, respectively. Provide relatively
short memory address traces that lead to each of the following results or show that the result is
impossible.

a) C2 has more faults than C1.

b) C4 has more faults than C2.

c) C1 has more faults than C4.

d) C2 has fewer failures than C1, but wastes more cycles on cache failures.

e) C4 has fewer failures than C2, but wastes more cycles on cache failures.

250
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

13) Cache-attentive programming

One problem with direct caches and even associative sets is thrashing, which is defined as excessive
conflict failures resulting from the nature of particular memory address patterns. Consider a direct
mapping cache containing two four-word lines used in the course of calculating the inner product of
two vectors of length 8 in a way that involves reading Ai and Bi, multiplying them, and adding the
product to a current total in a register.

a) Show that thrashing occurs when the start addresses of A and B are multiples of 4.

b) Suggest ways to avoid thrashing in this example.

c) For this particular example, would thrashing be possible if the cache were a two-way associative
set?

14) Cache behavior during array transposition

Matrix transposition is an important operation in many signal processing and scientific calculations.
Consider the transposition of matrix A from 4 × 4, which is stored in order of major line starting at
direction 0 and whose result is placed in B, stored in order of major line starting at direction 16. Two
nested cycles (indices i and j) are used, and Ai, j is copied into Bj, i within the inner cycle j. Assume
keynote addressable cache and master units per word. Note that each array element is accessed
exactly once. Draw two tables 4 × 4 representing the array elements and place "H" (for impact) or "M"
(for failure) in each entry for the following organizations in an eight-word cache.

a) Direct mapping, two-word lines.

b) Direct mapping, four-word lines.

c) Associative set of two ways, one-word lines, LRU substitution policy.

d) Two-way associative set, two-word lines, LRU substitution policy.

15) LRU Policy Implementation

One way to implement the LRU substitution policy for a four-way associative set cache is to store six
state bits for each set: the bit bi,j , 0 ≤ i < j ≤ 3, is fixed to 1 if to line i accessed more recently than line
j in the set. Provide implementation details for this schema, including the logic circuit that updates the
six state bits, and select the line to replace.

16) Cache to avoid repeated calculations

In some applications (notably multimedia) certain instructions are executed repeatedly with the same
arguments, this produces similar results. Most instructions, such as addition or multiplication, are
executed so fast that repeating them may be wiser than trying to detect their repetitions by using
calculated results. Other instructions, such as load or store, may produce different results even though
the memory and register addresses involved are the same. However, for the case of splitting, which
takes 20-40 clock cycles to run on many modern microprocessors, significant savings can be achieved
if the results of previously performed splits are stored in a buffer, which is queried in a clock cycle to

251
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

see if the necessary result is available there. If an impact occurs, the calculated quotient is sent to the
appropriate record. Only in the case of a failure, for example in 10-20% of cases, the operands are
sent to the divider unit for calculation. This method is called memoing and the buffer that contains the
arguments and results for recent division instructions such as division memo table (DMT).

a) Draw a block diagram for a 64-entry direct mapping DMT and show how it works; That is, show
what should be stored in it, how it is queried, and how the failure/impact signal is obtained and used.

b) Repeat part (a) for a four-way associative set DMT and highlight the advantages and disadvantages
of the modified scheme, if any.

c) If you assume a divided instruction of 22 cycles (one preparation/decoding cycle, DMT search cycle
and 20 execution cycles if DMT fails), express the average number of clock cycles with the split
operation and the division assist acceleration factor s as a function of the impact rate h on the DMT.
What DMT h impact rate is needed for a division acceleration of at least 3?

d) Draw and explain another hardware diagram for the memoing scheme that does not involve
penalizing a cycle in case of DMT failures (i.e. splitting takes 21 cycles in the worst case).

e) Repeat Part (c) for the implementation of Part (d). What fraction of the execution time should be
used in splitting, if an acceleration of division by an average factor of 3 produces an overall
acceleration of 5%?

17) Trace cache for instructions

A crawl cache preserves instructions that are not in order of direction but of dynamic execution. The
latter increases the instruction fetch bandwidth because a fetch may be executed in its entirety. Study
the design issues and benefits for a crawl cache and write a two-page report about it [Rote99].

252
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

MASS MEMORY CONCEPTS

CHAPTER TOPICS

1 Disk memory fundamentals

2 Organizing Data on Disk

3 Disk Performance

4 Disc Caching

5 Arrays and RAID

6 Other types of mass memory

The size of the main memories in a modern computer exceeds the imagination of early computer
programmers, although today it is estimated too small to contain all the data and programs of interest
to a typical user. In addition, the volatility of semiconductor memories, that is, the deletion of data in
the event of power loss, would require a form of backup storage, even if size were not an issue. Over
the years, disk memory has taken on this dual role of extended storage and backup. Improvements in
storage capacity and density, and reductions in the cost of disk memory, have been as phenomenal as
advances in integrated circuits. This chapter will study the organization and performance of disk
memory and see how disk caching and redundant multi-array orchestration leads to improvements in
speed and reliability. Some other mass storage technologies are also reviewed.

1. Disk Memory Fundamentals


Most of this chapter is devoted to a discussion of magnetic recording hard disk memories that will be
referred to as "disk memory" for short. Floppy disks, optical discs (CD-ROM, CD-RW, DVD-ROM, DVD-
RW), and other disc variants that have similar organizations and operating principles, but differ in their
recording and access technologies, application areas, and performance criteria. These variants are
discussed in section 19.6, along with some other mass storage options.

Modern hard drives are marvels of electrical and mechanical design. It is rather curious
that such discs are used at all. It seems that electronic memories should replace relatively slow disk
memory in the short term, as predicted more than once in past decades. However, the phenomenal
improvement in data recording density, the rapid decline in costs, and the very high reliability of
modern disk systems has led to their continued use in virtually all computers, starting with laptop and
desktop.

253
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 19.1 shows a typical disk memory configuration and the terminology associated with its design
and use. There are 1-12 platters mounted on a spindle that rotates rapidly from 3,600 to 10,000
revolutions or more per minute. Data is recorded on both surfaces of each platter along circular tracks.
Each track is divided into many sectors, with a sector representing the unit of data transfer to and
from disk. Recording density is a function of track density (tracks per centimeter or inch) and linear bit
density along the track (bits per centimeter or inch). In 2000 the area recording density of cheap
commercial discs ranged from 1-3 Gb/cm2. Early computer disks had diameters up to 50 cm, but
modern disks hardly leave the range of 1.0-3.5 inches (2.5-9 cm) in diameter.

The recording area on each surface does not extend throughout the center of the platter because very
short tracks near the middle cannot be used efficiently. Even so, indoor tracks are a factor of 2 or
shorter than outdoor tracks. Having the same number of sectors on all tracks would limit the track
(and disc capacity) so it is possible to record on short inner tracks. For this reason, modern discs put
more sectors on outdoor tracks. The bits recorded in each sector include a sector number at the
beginning, followed by a gap that allows the sector number to be processed and noticed by read/write
head logic, sector data, and error detection/correction information. There is also a gap between
adjacent sectors.

These gaps, sector number and error coding header, plus spare tracks, are frequently used to allow
the "repair" of bad tracks discovered in the course of disk memory operation, so that the formatted
capacity of a disk is less than its raw capacity based on data recording density.

An actuator can move the arms that hold the read/write heads, of which there are as many recording
surfaces, to align them with a desired cylinder, consisting of tracks with the same diameter on
different recording surfaces. Reading data very closely spaced on the disk requires the head to travel
very close to the surface of the disk (within a fraction of a micrometer). A thin air cushion prevents
heads from colliding with the surface. Note that even the smallest dust particle can collide on the

254
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

surface. Such head shocks damage the mechanical parts and destroy large amount of data on the disk.
To avoid these undesirable events, hard drives are sealed in airtight packaging.

The data access time in a desired sector on disk consists of three components:

1. Search time, or time to align the heads with the cylinder containing the track on which the sector
resides.

2. Rotational latency, or time for the disk to rotate until the start of sector data arrives under the
read/write head.

3. Data transfer time, which consists of the time for the sector to pass under the head that reads the
bits on the fly.

Table 19.1 contains data about the key characteristics of three modern disks with different physical
parameters and intended application domains. The capacity of a disk memory is related to any of these
parameters as follows:

Disk capacity = Surfaces × Tracks/surface × Sectors/track × bytes/sector

Often the number of recording surfaces is twice the number of platters. For this reason, the discs with
the highest capacity are of the multiplate variety. Both disk diameter and track density affect the
number of tracks per surface. The bytes per track, which is the product of the first two terms, are
proportional to the track length (and track diameter) and linear bit density. When all tracks do not
contain the same number of sectors, the average number of sectors per track is used in the disk
capacity equation.

Example 19.1: Disk memory parameters

Calculate the capacity of a two-platter disk drive with 18,000 cylinders, an average of 520 sectors per
track, and a sector size of 512 B.

Solution: With two platters, there are four recording surfaces. Therefore, the maximum gross disk
capacity is 10 B = 1.917 × 1010/230 4 × 18000 × 520 × 512 B = 1.917 × 10 GB = 17.85 GB. If 10% is
allowed for header or waste capacity for gaps, sector numbers, and coding for cyclic redundancy
verification (CRC), you get to a formatted capacity of almost 16 GB.

Based on the above observations, it follows that hard drives retain more data and are faster than
flexible ones because they have:

Larger diameters (although wide discs are no longer manufactured).

Faster turn.

Recording at higher densities (due to more precise control).

Recording at higher densities (due to more precise control).

Multiple dishes.

255
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

TABLE 19.1 Key attributes of three representative magnetic disks, from the highest capacity to the
smallest physical size (circa early 2003).

Optical discs use a laser beam and its reflection on a single recording surface, rather than
magnetization and magnetic field detection, to write and read data. More precise control of the light
mechanism allows for denser recording as well as greater capacity. Section 19.6 discusses this type of
discs in more detail.

2. Organizing data on disk


The bits of data in a track appear as small regions of the magnetic coating on the surface of the disk
that are magnetized in opposite directions to record 0 or 1 (Figure 19.2). As the areas associated with
different bits pass under the reader mechanism, the direction of magnetization is perceived and the
bit value is deduced. Recording occurs when the writing mechanism forces magnetization in a desired
direction by passing a current through a coil attached to the head. This scheme is called horizontal
recording. It is also possible to use vertical recording in which the direction of magnetization is
perpendicular to the recording surface. Vertical recording allows bits to be placed closer together, but
requires specially designed magnetic recording media along with more complex read and write
mechanisms, not commonly used.

The 0s and 1s that correspond to the magnetization addresses within small cells on the disk surface,
as shown in Figure 19.2, usually do not directly represent data bit values because special encoding
techniques are used to maximize storage density and make proper operation of read and write
mechanisms more likely. For example, instead of letting data bit values dictate the direction of
magnetization, you can magnetize based on whether there is a change in bit value from one cell to

256
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

the next. This type of non-zero-return (NRZ) encoding, which allows for the doubling of data recording
capacity relative to return-to-zero encoding (RTZ), represented the first widely used encoding
technique. The discussion of such codes does not belong to the scope of this book. The simplified view
shown in the string "0 1 0" in Figure 19.2, which represents the values of three consecutive data bits
in the disk track, is sufficient for the purposes of the text.

In addition to the complexities of the codes used for magnetic recording, each data sector is preceded
by a recorded sector number and followed by a cyclic redundancy check, which allows the detection
or correction of certain recording anomalies and read/write errors. This is essential because, at very
high recording densities, magnetization and error detection are inevitable. There are also gaps
between sectors and gaps within sectors to separate the various sector components so that processing
of one part is completed before the next arrives. Again, the simplified view that sectors follow each
other, without recording header or gap, is fit for purpose in the text.

The unit of data transfer to/from disk is a sector that usually contains 512 to a few thousand bytes.
The address of a sector on disk consists of three components:

Disk address = Cylinder #, Track #, Sector #

17–31 bits 10–16 bits, 1–5 bits, 6–10 bits

With a sector size of 512 = 29 B, the latter represents a disk capacity of:

Disk capacity = 213±3 × 23±2 × 28±2 × 29 = 233±7 B = 0.06 − 1024 GB

Of course, current disks do not allow these parameters to be simultaneously at their maximum values.
The number of cylinders is provided to the actuator mechanism to align the read/write heads with the
desired cylinder. The track number selects one of the read/write heads (or the associated recording
surface). Finally, the sector number is compared against the recorded sector indices as they pass under
the selected read/write head, and an identity indicates that the desired sector arrived under the head.
The head then reads the current industry data and its associated error detection/correction
information.

The sectors on a disk are independent, any collection of sectors can be used to contain multiple parts
of a file or other data structure. For example, Figure 19.2 shows a file composed of five sectors that
are scattered across different tracks. The addresses of these sectors will be put in an archive directory
so that any piece of the file is retrieved when needed; Alternatively, only the first sector of the file can
be put into the direct river, with a link provided within each sector to the next sector. Of course, since
pieces of a file may be accessed in close proximity to each other, it makes sense to try to allocate
sectors that would reduce the amount of head movements (search time) and rotational latency in the
course of file access.

Example 19.2: Making a backup copy of an entire disk

A disc with t = 100,000 tracks (the total number on all recording surfaces) rotates at 7,200 rpm, its
track-to-track search time is 1 ms. What is the minimum amount of time required to back up all the

257
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

contents of the disc? assume that the disk is full and the device backing up accepts data at the rate
provided by the disk

Solution: The disc gives 120 revolutions per second, or an evolution in 1/120 s. Ignore the rotational
latency of a boot time, which lasts on average 4.17 ms. To copy all the contents of the disc you need t
= 1 - 99,999 track-to-track searches and 100,000 full revolutions for data transfer. Therefore, the total
time is (t − 1)/103 + t/120 ≅ 933 s ≅ 15.5 min. This calculation assumes that the sectors can be read
in order of their physical appearance on the track. In other words, reading data from one sector is
immediately downloaded so that the next sector is read without delay. It will soon be seen that this
may not be the case.

Because the search time is much shorter when moving from one cylinder to an adjacent one (as
opposed to a distant one), the data elements that are usually accessed together will be placed in the
same or adjacent cylinders. However, data accessed consecutively should not be assigned successive
sectors on the same track. The reason is that some amount of processing is often required in one
sector before accessing the next. Therefore, placing the next piece of data in the immediate sector
leads to the possibility of losing the opportunity to read it and waiting for almost a complete rotation
of the disk. For this reason, consecutive logical sector numbers are assigned to physical sectors that
are separated by some intermediate sectors. In this way, placing the parts of a file in consecutive
logical sectors would not lead to the problem described. Additionally, logical sector numbers on
adjacent tracks can be skewed. Assume that the reading of the last logical sector on a track has just
been completed and you must now move to the first logical sector on the next track to get the next
piece in the file. It may take a couple of milliseconds for the head to complete the movement from
one track to the next, during which time the disk may have rotated at a fairly large angle. Figure 19.3
shows a possible numbering scheme for sectors on several adjacent runways based on the above
observations.

Fortunately, the user does not need to worry about the actual scheme of placing data on the disk
surface. The disk driver maintains the mapping information required to interpret a simple sequential
sector number, presented by the operating system, in a physical disk address that identifies the
surface, track, and sector. The controller then plans and performs all the activities required to collect
the sector data in a local buffer before transferring it to a designated area in the main memory.

258
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

3. Disk performance
Disk performance is related to access latency and data transfer rate. Access latency is the sum of
cylinder search time (or seek time) and rotational latency, the time needed for the sector of interest
to get under the read/write head. Therefore:

Disk access latency = Seek time + Rotational latency

The search time depends on how much the head has to travel from its current position to the
destination cylinder. Since this involves mechanical motion, consisting of an acceleration phase, a
uniform motion, and a deceleration or braking phase, the search time for moving through c cylinders
can be modeled as follows, where α, β, and γ are constant:

Search time = α + β(c − 1) + γ √ c – 1

The linear term β(c − 1), corresponding to the phase of uniform motion, is rather a recent addition to
the search time equation; Older disks don't have enough tracks or high enough acceleration to initiate
smooth motion.

Rotational latency is a function of where the desired sector is located on the track. In the best case,
the head is aligned with the track just when the desired sector arrives. In the worst case, the head has
just lost the sector and must wait for almost a complete revolution. Thus, on average, rotational
latency is equal to time for half a revolution:

Average rotational latency = (30/rpm) s = (30 000/rpm) ms

Consequently, for a rotational speed of 10,000 rpm, the average rotational latency is 3 ms and its
range is 0-6 ms.

The data transfer rate is related to the rotational speed of the disk and the linear bit density along the
track. For example, suppose a track contains almost 2 Mb of data and the disk rotates at 10,000 rpm.
So, every minute, 2 × 1010 bits pass under the head. Because bits are read quickly as they pass under
the head, this means an average data transfer rate of almost 333 Mb/s = 42 MB/s. The gap-induced
header, sector numbers, and CRC encodings make the peak transfer rate slightly higher than the
calculated average transfer rate.

Attempts to eliminate seek time, which according to Table 9.1, constitute a significant fraction of disk
access latency, have been unsuccessful. Head-by-track discs, which have a head permanently aligned
with each track on each recording surface (usually 1-2), were unsuccessful due to their complexity and
limited market, which made design and manufacturing costs unacceptably high. Because rotational
speed affects both rotational latency and data transfer rate, disk manufacturers are motivated to
increase rotation as much as possible. Beyond the latter, efforts to cut average rotational latency by
a factor of 2-4, by placing multiple sets of read/write heads around the periphery of the platters, were
ruined for the same reasons as for head-by-track disks.

Therefore, you must learn to live with search and rotational latencies and design around them. Paying
attention to how data is organized on disk, discussed at the end of section 19.2, reduces both search
time and rotational latency for commonly encountered access patterns. Queuing disk access requests
and carrying them out of order is another useful method. The latter method is applicable when
multiple independent programs share disk memory (multiprogramming) or when some applications
look for independent disk access requests that can be returned out of order (transaction processing).

259
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In Figure 19.4 the sectors are requested in the order A, B, C, D, E, F. If it happens that the head is close
to sector C, the sectors can be read in the order C, F (on the same track), D (on the neighboring track),
E, B, A, with the optimal order calculated by considering the head movement times and rotation
latencies based on the physical locations of the various sectors.

4. Disk Caching
The disk cache is a unit of Random Access Memory that is used to reduce the number of disk accesses
much like the CPU cache reduces accesses to slower main memory. For example, when a particular
disk sector is needed, multiple adjacent sectors, or perhaps an entire track, can be read into a disk
cache. If subsequent disk accesses happen to be in additional sectors that were read, they can be
satisfied from the much faster cache, obviating the need for many additional disk accesses. Prior to
the emergence of dedicated hardware caching in disk controllers, a form of software buffering
prevailed, where the operating system performed a similar caching or read-ahead function using a
portion of main memory that was set aside to contain disk sectors. The three disk memories
mentioned in Table 19.1 have dedicated buffer/caches ranging in size from 0.125 to 16 MB.

Because the disk is orders of magnitude slower than the main memory, the
operation of a disk cache can be controlled entirely by software. When the disk controller receives a
disk access request, the disk cache directory is queried to determine whether the requested sector is
in the disk cache. If it is, the sector is provided from the disk cache (reads) or modifies there (writes);
If not, regular disk access is initiated. Because the time to search the disk cache directory is much
slower than the average time for a disk access, the header is very small, while the payment in case of
finding the sector in the disk cache is very significant. Impact/failure rates for disk caches are hard to
find, as manufacturers use proprietary designs for their disk caches and are reluctant to share detailed
performance data. For initial design purposes, a reasonable estimate for the failure rate of a disk cache
is 0.1 [Crag96].

Note that when an entire track is read from a disk cache, rotational latency is almost completely
eliminated. The latter happens because, as soon as the head aligns with the desired track, the reading
can begin, no matter which sector arrives next under the head. The sectors accessed can be stored in

260
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

the random access cache at their corresponding positions within the space allocated to the track.
Similarly, when a track must be written back to disk to make room for a new track, the sectors can be
copied in the order in which the head finds them. Note that using a write policy for disk caches, which
is a natural choice to minimize the number of disk accesses, requires a form of backup power source
to ensure that updates are not lost in the event of power failure or other forms of outage.

Example 19.3: Disk Caching Performance Benefit

A 40 GB disk with 100,000 tracks contains 512 B sectors, rotates at 7,200 rpm, and has an average
seek time of 10 ms. If you assume that all tracks are driven to the disk cache, what is the performance
or acceleration benefit due to disk caching in an application whose runtime is completely dominated
by disk accesses? In other words, assume that processing time and other aspects of the application
are negligible relative to data access times from disk. In your calculations, use a disk cache failure rate
of 0.1.

Solution: An average track on this disc contains 40 × 230/(100,000 × 512) ≅ 839 sectors. Therefore, a
sector access requires an average time of 10 ms plus the time needed for 1/2 + 1/839 rotations. This
represents 18.35 ms. Reading an entire sector implies the average search time plus a full turnover, or
26.67 ms. Disk caching with a failure rate of 0.1 allows you to replace ten sector accesses of 18.35 ms
with a track access of 26.67 ms, this produces an acceleration of 10 × 18.35/26.67 ≅ 6.88.La actual
acceleration is a little lower, not only because the processing time and other aspects of the application
were ignored, Also because the disk access that is satisfied from the disk cache still needs to pass
through various brokers, such as channels, storage drivers, and device drivers, each brings some
latency.

In the previous discussion it was assumed that disk cache is incorporated into the disk controller. There
are several drawbacks to this approach:

1. The cache is linked to the CPU through intermediaries such as buses, I/O channel, storage driver,
and device driver. Thus, even in the case of a disk cache impact, some non-trivial latency is incurred.

2. On systems with many disks, there will be multiple caches, with some drives underutilized when
the associated disks are not very active, and others not having enough space to keep all their tracks
useful

One approach to solving these problems of remoteness and imbalance is to use a large cache that is
placed closer to the CPU and shares its space with data from many disks. For a specific total disk cache
capacity, the latter approach is more efficient. Such a shared cache can be included in addition to
those of disk controllers, leading to a multi-level caching structure.

261
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5. Arrays and RAID


Despite significant improvements in hard drive capacity and performance over the past few decades,
there are still applications that require more capacity or bandwidth than even the largest and fastest
disk drives available can provide. Making larger and faster disks, although technically possible, is not
economically viable because they are not needed in large quantities to generate economies of scale.
It would be preferable to find a solution based on the use of cheap arrays that are built in large
volumes and economical, to offer the personal computer market expanded capacity and improved
performance. This is known as the "array" approach.

Of course, it's easy to solve the capacity problem with the use of an array. But exceeding the
performance limit is not so simple and requires more effort. Moreover, applications that use large
amounts of data usually also require certainty in data availability, integrity, and security. Consider a
large bank or e-commerce site on the web that uses 500 disks to store its databases and other
information. Each of the 500 disks can fail once every 100,000 hours (11+ years). However, as a whole,
the 500-disk system has a failure rate that is nearly 500 times higher (one failure every eight days).
Therefore, it is imperative to incorporate redundancy fault tolerance into any array solution.

The term RAID, for redundant array of disks or standalone, was coined in the late
1980s to refer to a class of solutions to capacity and reliability problems for applications requiring
large data sets [Patt88]. Prior to the formulation of the RAID concept, it had been recognized that
arrays, with their many independent read/write mechanisms, could offer data bandwidth enhanced
by vector or parallel computers whose performance was limited by the rate at which data was fed to
their heavily channeled or multiple processing units. This use of arrays to take advantage of the
aggregate bandwidth of a large number of disks through which striped data passes is known as "RADI
level 0", although it was not included in the original taxonomy [Patt88] and has no redundancy to
justify the RAID designation.

RAID0 arrays can be designed to operate in synchronous or asynchronous mode. In a synchronous


array, the rotation of multiple disks and their head movements are synchronized so that all disks
access the same local logical sector on the same track, at the same time. Externally, such an array of
d drives appears as a single disk that has a bandwidth equal to d times that of a single disk. Designing
such arrays requires costly modifications to disk drives on the market and is only cost-effective for use
with high-performance supercomputers. Asynchronous arrays are accessed much like interpolated
memory banks. Since multiple disk accesses can be in process at the same time, effective bandwidth
increases. Some higher-level RAID systems can be configured to operate as RAID0 drives when the
user does not require the added reliability offered by redundancy.

The use of mirroring, the storage of a backup copy of the contents of one disk on another mirror disk,
allows tolerance of any individual disk failure with a very simple recovery procedure: detect the failure
of the disk by its inter-built error detection coding and switch it to the corresponding mirror disk while
replacing the damaged disk. RAID1 mirroring involves 100% redundancy, which represents a high price
to pay for single-disk fault tolerance when using tens or hundreds of disks. RAID2 represents an
attempt to reduce the degree of redundancy through error correction codes instead of duplication.
For example, a simple bit error correction Hamming code requires four verification bits for 11 bits of
data. Therefore, if the data is scratched across 11 disks, four disks will be sufficient to retain the
verification information (against 11 in the case of mirroring). With more data disks, the relative
redundancy of RAID2 is further reduced.

262
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

It is evident that even the redundancy of the Haming code is too much, and the same effect can be
achieved with just two additional disks: a parity disk that stores the XOR of the data bits of all other
disks and a spare disk that can take the place of a damaged disk when required (RAID3). This is possible
due to the inter-built error detection capability at the sector level based on the routinely included
cyclic redundancy check. If there are d data disks containing the bit values bi , 0 ≤ i ≤ d - 1, at a particular
position on the surface of the disk, the parity disk will retain the following bit value at the same
position:

Until the failure of the data disk j, its corresponding bit value bj can be reconstructed from the contents
of the remaining data disks and the parity disk, as follows:

Each of the RAID levels 1-3 (Figure 19.5) has anomalies that limit its usefulness or domain of
applicability.

1. In addition to 100% redundancy, which can't be a major conflict with cheap drives, Level 1 RAID
involves a performance penalty regardless of whether scratching is used. Each update operation on a
file must also update the backup. If the backup update is performed promptly, the larger of the two
disk access times will dictate the delay. If backup updates are queued and performed when
convenient, the probability of data loss due to disk failure will increase.

2. Level 2 RAID improves the level 1 redundancy factor by using scratching along with simple error
correction Hamming code. However, as a result of all disks being accessed to read or write a block of
data, severe performance penalties are incurred for reading or writing small amounts of data.

3. Level 3 RAID also has the drawback that all disks must participate in every read and write operation,
reducing the number of disk accesses that can be performed in unit time and damaging performance
for small reads and writes. With a special controller that calculates the parity required for write
operations and a non-volatile write buffer to retain blocks while parity is calculated, RAID3 can achieve
performance close to RAID0.

Because disk data is usually accessed on sector drives rather than bits or bytes, RAID Level 4 applies
parity or checksum to disk sectors. In this case, hatching does not affect small files that fit completely
into a sector. Read operations are performed from data disks without entering the parity disk. When
a disk fails, its data is reconstructed from the contents of other disks and the parity disk much like
RAID3, except that the XOR operation is performed in sectors rather than bits. A write operation
requires that the associated parity sector is also updated with each sector update and involves two
parity disk accesses. With capital letters to denote data sector and parity content, we have:

Pnew = Dnew ⊕ Dold ⊕ Pold

This equation shows an advantage of RAID4 over RAID3, but it also reveals one of its key drawbacks:
unless write operations are quite rare, parity disk accesses create a severe performance problem.

RAID Level 5 removes the problem by distributing parity sectors over all disks instead of placing them
on a dedicated disk. As seen in the example sketched in Figure 19.5, the parity data for the four sectors
of file 0 is stored on the far-right disk, while the parity sectors for files 1 and 2 are placed on different

263
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

disks. This scheme distributes parity accesses and resolves the performance issue that results from
parity updates. Read operations are independent as before. Therefore, multiple read and write
operations can be in progress in progress in RAID5 at the same time. For example, the sectors labeled
Data0 and Data1' can be updated at the same time, because the sectors and their associated parities
are on different disks.

Note that Level 3-4 RAIDs, and the most commonly used RAID5, offer adequate protection against
individual disk failures under the assumption that failure on a second disk does not occur while
recovery from a failure is in progress. To protect against dual failures, Level 6 RAID adds a second form
of error checking to the level 5 parity/checksum scheme, using the same distribution scheme, but with
the two check sectors placed on different disks. In this context, RAID6 is said to use "P+Q redundancy",
where P represents the parity scheme and Q the second scheme. This can be more complex because
it is rarely, if ever, invoked for data reconstruction. With RAID6 there is only a danger of data loss if a
third disk failure occurs before the first disk failure has been replaced.

RAID efficiency can be improved by combining the idea with that of a log-structured file system. In a
log-structured file, the data is not modified in place; Instead, a fresh copy of any modified record is
created in a different locality, and the file directory is updated appropriately to reflect the new locality.
Suppose a RAID controller has a write cache that is too large to retain the data cumulus of an entire
cylinder. Then, as a sector is modified, it is marked for deletion from its current location. The updated
version is not written back immediately, but is kept in the non-volatile cache until the cached cylinder
is full. At this point, the cylinder is written to whatever disk is available. Since the contents of the
cylinder become spare when erasing their sectors, a kind of automatic garbage collection is required
to read two or more such cylinders during times when the disks are not used and to compress them
into a new cylinder. An additional benefit of this approach is that cylinder data can be written in
compressed format, and decompression occurs during read, which involves copying the entire cylinder
to the disk cache.

Commercial RAID products are offered by all major mass storage vendors, including IBM, Compaq-HP,
Data General, EMC, and Storage Tek. Most RAID products offer some flexibility to select the RAID level
based on application needs.

6. Other types of Mass Memory

Although magnetic recording dominates the technology of choice for secondary memory in modern
computers, other options are available. An important feature of disk memory, which makes it a
necessary component even if the main memory is large enough for all programs and data, is its non-
volatility (see also discussion of non-volatile memory in section 17.5). Magnetic disks can retain data
almost indefinitely, while main memory loses its data when the computer shuts down or power is
interrupted for any other reason. Of course, secondary memory is not limited to online data storage,
so that its contents are continuously available to a computer; Some mass memory technologies are
used as backup storage or as a recording medium for distribution of software, documents, music,
pictures, movies, and other types of content.

Before proceeding with the review of optical discs and other secondary and tertiary memory
technologies, it should be noted that floppy disks and many other rotating magnetic memories
fundamentally use the same principles as hard disks for data recording and access. In other words,

264
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

they have tracks and sectors, actuators to perform radial motion of a head assembly to align with a
desired track, and read/write heads that magnetize small areas of the surface or sense the direction
of magnetization. What distinguishes 1.2 MB floppy disks (Figure 3.12b), or higher capacity (100/250
MB) zip disks from hard disks, is the lower recording density and less accurate read/write mechanisms
of the former, arising from the flexibility, and wider tolerances, in the recording media and the lack of
airtight packaging that keeps dust and other fine particles away from the recording surface.

Magnetic tapes record data along straight tracks, rather than the circular
tracks of a rotating disk memory. A key advantage of magnetic tapes is that they can be removed and
placed in permanent archives, although in the past, many types of tape devices were lost due to
obsolescence, resulting in many files not being available before recordings showed any deterioration.
Usually nine bits are written across tape (one byte of data, plus one parity bit; in this context, the
linear bit density along the tape is much lower than that of hard disks (for the same reasons cited for
floppy disks). As the tape winds from one reel and rolls up on the other, the read/write head (and
perhaps a separate eraser head) flies over the recorded data, using read and write techniques very
similar to those of disk memory. Magnetic tapes come in many different varieties: from small cassettes
or cartridges (Figure 3.12b) to large spools with expensive transport mechanism that can move the
tape at very high speed and also have near-instantaneous start and stop capabilities. The latest tape
drives use an empty buffer containing a loose segment of tape to prevent damage to tape material
and recorded data during high-speed mechanical movement or acceleration/deceleration.

As floppy disks and zip, along with magnetic tapes, gradually become less
important and replaced by other means such as CDs for writing, hard drives may become the only
magnetic recording memory widely used in computers. There is some speculation that even magnetic
hard disks will be displaced by optical and semiconductor memories. However, given past and ongoing
advances in disk memory technology, this will not happen anytime soon.

Optical discs (CD-ROM, DVD-ROM and their recordable and rewritable varieties) have become
increasingly popular since the 1990s. In some ways, these devices are similar to magnetic disk drives
in that they retain blocks of data on the surface of a round platter that must rotate to make data
available to an access mechanism. The compact disc (CD) format was first developed and quickly
became the preferred method for distributing music and software products that required hundreds
of megabytes of capacity. Subsequent advances in substrate material, semiconductor lasers, head
positioning methods, and optical components led to the creation of the higher-capacity digital
versatile disc (DVD) format, whose multi-gigabyte capacity is sufficient to store a movie. CD and DVD
data are protected through sophisticated error correction codes. Therefore, the annihilation of many
recorded bits (due to material defects or scratches) will not affect the proper use of the recovered
data.

Read-only optical discs (CDs, DVDs) are descendants of early optical storage systems that recorded
data by marking or punching holes on cards, paper tapes, and similar media. Optical discs record data
by creating micron-sized pits (depressions) on a thin layer of optical material (placed on a plastic
substrate) along a continuous spiral track that begins somewhere inside the disc and ends near the
outer edge; For those who are old enough to remember, the latter is the opposite of the spiral
direction that was used on vinyl phonograph records. Data was recorded by creating pits of various
sizes during the production of a CD or DVD. A laser light beam is focused on the surface of the platter
and differences in reflection of the unaffected surface and the pits are detected with the reading
function. Note that the above explanation is very simplified and does not explain the operation of the
laser diode, beam splitter, lenses, and detector in Figure 19.6 [Clem00].

265
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Writable CDs and DVDs can be write-only or rewritable. In single-write versions, pits are created by
exploding a special coating placed on a reflective surface (this is different from the protective coating
in Figure 19.6), melting the holes into a thin layer of special material with the use of a more powerful
writing laser, or by a variety of other methods. The rewritable versions rely on the reversible
modification of the optical or magnetic properties of the material by selective exposure to laser light
[Clem00].

Microelectromechanical (MEM) devices were developed to deliver massive memory on IC chips. One
technology, followed by IBM, operates like a phonograph and uses sharp spikes of thin silicon corbels
to inscribe data into, and read from, a polymer medium [Vett02]. The development of that technology
will bring closer to building true single-chip computing systems.

As already mentioned, online massive memory drives, such as hard drives, are directly accessible to
the computer, while offline storage media, such as CDs and tapes, must be placed or mounted on an
online device to make them accessible. The intermediate between the two are the massive near-line
memories that offer a large amount of storage without the need for human intervention in the
assembly of the medium. Such devices are often used as tertiary storage to retain rarely used
databases and other information that must nevertheless be available when the need arises. For
example, automated tape libraries offer many terabytes of storage capacity with less than a minute
of time. Usually, thousands of tape cartridges are stored in a large cabinet, with robotic arms able to
choose a cartridge, load it into a read/write assembly and then place it back into its original slit.

No matter what type of mass storage is used, the problem of digital


data permanence remains. Magnetic tapes last only a few decades. CDs are better, because their life
span is measured in centuries. Of course, a different challenge is stumbling upon devices that read
centuries-old CDs, given the rapid pace of technological progress. Programming languages and
database formats also face the problem of extinction, making it more difficult to preserve data
[Tris02].

266
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

PROBLEMS
1) Disk capacity maximization

Consider the following optimization problem in the design of a disk memory system. The diameter D
of the outer runway is a given constant. You should choose the diameter xD of the innermost track,
where x < 1. The negotiation is that the capacity of all tracks should be the same, so that the shorter
the inner track, the less data can be recorded on each track, but there will be more tracks. How can
you derive the optimal value from x, and which disk or data burning parameters influence this optimal
value? State all your assumptions clearly.

2) Retail Disk Drive Parameters

For each of the three disks mentioned in Table 19.1:

a) Derive the approximate average number of full-disk tracks that can be contained in your cache.
Ignore the header resulting from the storage of identification information, such as track or sector
numbers, in the buffer.

b) Calculate the time required to read a 10 MB file that is stored on adjacent tracks according to the
scheme in Figure 19.3, so that there is no waste of time due to rotational latency between reading
successive logical sectors of the file.

3) Trends in Disk Cost

Data prices for disk memory are available at www.cs.utexas.edu/users/dahlin/techTrends/data/ disk


prices.

a) Plot the price variations for personal computer disks with the time range 1980-2010 on the
horizontal axis and the price range $0 - $3,000 on the vertical axis. Plot a separate curve for each of
the capacities 10 MB, 20 MB, 80 MB, 210 MB, 420 MB, 1.05 GB, 2.1 GB, 4.2 GB, 9.1 GB, 18.2 GB, 36.4
GB, and 72.8 GB, where a curve extends over the time from when it was first commercially available
until the product was discontinued. Where multiple prices are cited for a given disk capacity, choose
the lowest.

b) Develop a scatter plot based on the data in part (a), where the horizontal axis again covers the 1980-
2010 time frame and the vertical axis shows the price per gigabyte of capacity, which ranges from $1
to $1 M on a logarithmic scale (each grid line is a factor ten times larger than the previous one).

(c) Discuss the general trends observed in parts (a) and (b) and attempt to extrapolate beyond the
available data in the future.

4) Average search distance

The average search distance is the average number of cylinders to cross to reach the next cylinder of
interest. Consider a c-cylinder disk, which has a maximum seek distance of c - 1.

267
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

a) If the head retracts to the cylinder home 0 (the outermost cylinder) after each access, then what is
the average search distance per access?

b) If you could choose a household cylinder other than cylinder 0, which one would you choose and
why?

c) Show that, if the head is allowed to stay in the last cylinder accessed, rather than retracting into a
household cylinder, then the average search distance with completely random accesses is
approximately c/3.

5) Disk search time formula

Derive a simplified form of the disk seek time formula (section 19.3) for which β = 0, and suppose that
the head accelerates at a constant rate for the first half of the search distance and then decelerates
to the same rate for the second half. Suggestion: At a constant acceleration a, the distance traveled is
related to the time t by d = 1/2 at2.

6) Disk search time

For each of the disk memories listed in Table 19.1, determine the coefficients of the search time
formula (section 19.3), if the average search time quoted is assumed to be for half the maximum
cylinder distance.

7) Disk Access Timeline

A disc with a single recording surface and 10,000 tracks has a controller that optimizes performance
through extraordinary processing of access requests. The access request queue currently retains the
following track numbers (on a first-come, first-served basis): 173, 2939, 1827, 3550, 1895, 3019, 2043,
3502, 259, 260. The head has just completed the reading of track 250 and is on its way to runway 285
for a read access. Determine the total travel distance of the head (on tracks) for each of the following
schedule alternatives.

a) First come, first served, FCFS.

b) Shortest search time first (SSTF = shortest seek time first).

c) Shortest search time among the three oldest requests.

d) Maintain scanning direction (if, say, the head moves inwards, the movement continues in the same
direction until the innermost requested track is accessed; then reverses the direction of the head
movement).

8) Disk Caching and Access Schedule

In many applications, tracks on a disc are not uniformly accessed; Instead, a small number of
frequently used tracks account for a large fraction of disk accesses. Suppose it is known that 50% of
all disk accesses are made to 5% of the tracks.

268
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

a) Can this information be used to make the disk caching scheme more effective?

b) Which of the disk schedule algorithms given in issue 19.7 will perhaps perform best in optimizing
disk accesses and why?

(c) Can you imagine a different schedule algorithm that performs better than part b)?

9) Disk performance parameters

A single-platter disc spins at 3,600 rpm and has 256 cylinders, each storing 256 MB. The search time
is given by the formula s + tc, where s = 0.5 ms is the start-up time, t = 0.05 ms is the travel time of the
head of a cylinder to an adjacent one and c is the distance of the cylinder; For example, if the head
must travel five cylinders inward, then it takes 0.75 ms to get there.

a) If gaps between cylinders and other headers are ignored, calculate the peak data transfer rate.

(b) If you assume an average cylinder distance of 85 from an access to a randomly chosen cylinder,
calculate the average access time for a sector of 4 KB.

c) Based on your answer to part b), how much faster on average is it to access a random sector on the
same track as another anywhere on the disk?

d) What is the minimum time needed to back up all the contents of the disk to tertiary storage?

10) Disk memory technology

An intermediate solution can be envisioned between disks with a single moving head per platter and
head-by-track discs, which has a dedicated head for each track to completely eliminate search time.
Imagine a disk that has eight heads mounted on an assembly. These move together and align with
different tracks so that the rate of data during read and write can be eight times that of a single-
headed disk.

a) What would be the best arrangement of the heads with respect to the tracks? In other words,
should heads read/write adjacent tracks or tracks that are separate?

b) List the performance benefits, if any, of your Part (a) disk design.

c) Speculate why such disks are not used in ordinary PCs.

11) RAID Systems

Assume that each of the disks listed in Table 19.1 will be used to build a Level 5 RAID system that fits
into a larger enclosure that has a volume of 1 m3. Nearly half of the cabinet space will occupy
ventilation gaps, fans, and power supply equipment. For each disk type:

a) Estimate the maximum total storage capacity that can be achieved.

b) Derive the maximum total read throughput of the RAID system; Set all your assumptions.

c) Repeat part b) for total write performance.

269
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

12) RAID Systems

Consider the RAID4 and RAID5 data organizations shown in Figure 19.5

a) If each disk access for data reading involves a single sector, which organization leads to higher
bandwidth for reading? Quantify the difference for random distribution of read addresses.

b) Repeat part (a) for write access.

c) Suppose one of the disks fails and its data must be rebuilt on the spare disk. Which organization
offers better performance for readings during the rebuilding process?

d) Repeat part (c) for write access.

13) Floppy disk performance and parameters

Consider a 3.5-inch floppy disk device that uses two-sided discs with 80 tracks per side. Each runway
contains nine sectors of 512 B. The disc spins at 360 rpm and the seek time constitutes s + tc, where s
= 200 ms represents the start/stop header and t = 10 ms means the travel time between track and
track.

a) Calculate the capacity of the floppy disk.

b) Determine the average access time to a random sector if the head origin position is on the
outermost track.

c) What is the data transfer rate when the sector has positioned itself?

14) Floppy disk parameters

Consider a floppy disk, note its capacity, and see if it's single-sided or double-sided. Then unmantle it
to measure the parameters (indoor radius, outside radius) of your recording area. So, if you assume
that the floppy has 80 tracks per side of equal capacity, determine its parameters, such as track
density, linear recording density, and average recording per area.

15) Magnetic tape parameters

Data is written to magnetic tape using contiguously recorded blocks with a linear recording density of
1,000 b/cm and a gap of 1.2 cm between blocks. Remember that a tape usually has eight data
recording tracks and one parity track.

a) What fraction of the total gross capacity of the tape retains useful data if each block contains a
record of 512 B?

b) Repeat part (a), but this time assume that four records are packed in a block of tape.

270
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

16) CD Parameters

A CD has a recording area with an outer (inner) radius of approximately 6 cm (2 cm). It retains 600 MB
of data in a spiral track with fixed bit density and track separation of 1.6 μm. Ignore the effects of
housekeeping and error check headers. Derive the linear bit density and write density per area for the
CD.

17) DVD Parameters

A DVD has a recording area with an approximate outside (indoor) radius of 6 cm (2 cm). It retains 3.6
GB of data in a spiral track with fixed bit density and track separation of 1.6 μm. Ignore the effects of
housekeeping and error check headers.

a) Derive the linear bit density and recording density per area for such DVD.

b) If you assume that there is no data compression, what size is the video that can be stored on this
DVD if each image is 500 × 800 pixels, each pixel requires 2 B of storage, and the image must be
regenerated 30 times per second?

c) Based on the result of part (b), what can you say about the average compression ratio for common
DVD movie formats?

d) How fast should the DVD spin if the required data rate for part c is to be achieved?

18) Tertiary memory style jukebox

Consider a jukebox that stores up to 1,024 optical cartridges and has a mechanism for choosing and
"touching" any cartridge. The selection and establishment time is 15 s. High-speed playback to read
data on the cartridge takes 120 s. The cartridge has a useful recording area of 100 cm2.

a) Derive the recording density per area required if this system must offer 1 TB of capacity.

b) What data transfer rate is implied by the information provided for the system in part (a)?

c) What is the effective average data rate when many cartridges are read in succession?

d) Suggest an application for this jukebox-style tertiary memory and discuss what other components
you would need to make it work for your suggested application.

e) Speculate about the possibility of a petabyte storage system with the same jukebox-style operation.

271
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

VIRTUAL MEMORY AND PAGING

CHAPTER TOPICS

1. Need for Virtual memory

2. Address translation in Virtual memory

3. Translation lookaside buffer

4. Page Placement and Replacement

5. Main and Mass memories

6. Improving Virtual memory performance

When main memory does not have the capacity to retain a very large program along with its
associated data, or many smaller programs that need to coexist on the machine, it is forced to break
objects into pieces and move them in and out of main memory according to their access patterns. In
virtual memory, the process of moving data in and out of main memory is automated, so the
programmer or compiler doesn't need to worry about it. What the programmer/compiler observes is
a large address space, even though, at some specific time, only a small part of this space is mapped to
main memory. This chapter reviews strategies for moving data between secondary and primary
memory, discusses the problem of virtual-to-real address translation and ways to speed it up, and
discusses virtual memory performance.

1. Need for Virtual memory


While caches are used to improve performance in a transparent way, virtual memory is used only for
convenience. In some cases, disabling virtual memory (if allowed) can lead to better performance.
Virtual memory gives the illusion of a main memory that is much larger than the available physical
memory. Programs that are written using a large virtual address space can run on systems with varying
amounts of real (physical) memory. For example, MiniMIPS uses 32-bit addresses, but the system
almost never has 4 GB of main memory. Even so, virtual memory management hardware gives the
illusion of a physical address space of 4 GB with, that is, a 64 MB drive of main memory.

In the absence of virtual memory, a programmer or system software must explicitly manage the
available memory space, and move, as required, the pieces of a large program and its dataset in and
out of main memory. This is a complicated and error-prone process, and it creates non-trivial storage
allocation issues because the program and data drives are not the same size. With virtual memory,
the programmer or compiler generates code with the intention that the main memory be as large as
the available address space suggests. Depending on the parts of the program or their reference data,
the virtual memory system joins these parts from the secondary memory (disk) to the smaller main
memory and, perhaps, leaves out other parts. Figure 20.1 shows a large program occupying many
tracks on a disk, with three pieces (designated with thick arcs on the surface of the disk) residing in
main memory. As was the case for processor caches, data transfer between primary and secondary
memory is fully automatic and transparent.

272
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In virtual memory, data transfer between disk and master memories is performed by fixed-size drives
known as pages. A page is usually in the range of 4-64 KB. On a disk with 512 B sectors, a page
corresponds to sectors 8-128. Virtual memory can and has been deployed using variable-sized data
units known as slices. However, as a consequence of the fact that the implementation of virtual
memory through paging is both simpler and more prevalent than segmentation, the discussion is
limited to the former.

Figure 20.2 presents the ideas of virtual memory and cache together, and sketches the data
movements between various levels of the memory hierarchy and the names associated with the data
transfer units. From Figure 20.2, the rationale behind the choice of the terms "line" and "page"
becomes apparent.

Virtual memory is a good idea virtually (no pun intended) for the same reasons that made caches
successful; that is, the spatial and temporal locality of the memory references (to this point we will
return in section 20.4). Due to the much larger units (pages) carried to main memory and the larger
size of main memory, the main memory impact rate (99.9% or more) is usually much better than the
cache impact rate (usually 90-98%). When a page is not in main memory, a page fault is said to have
occurred. The use of "page fault" is done for historical reasons; "Main memory failure" (analogous to
"cache failure") would have been more appropriate.

How does virtual memory relate to cache, which was discussed in Chapter 18? The cache was used to
make the main memory appear faster. Virtual memory is used to make the main memory appear
larger. Therefore, the hierarchical combination of primary and secondary caches, with automatic data
transfer between them, creates the illusion of very fast main memory (comparable in speed to SRAM)
whose cost per gigabyte is a small multiple of the corresponding cost for a disk.

273
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

2. Address translation in Virtual memory


Under virtual memory, a running program generates virtual addresses for memory accesses. These
virtual addresses must be translated into physical memory addresses before the instructions or data
are retrieved and sent to the processor. If the pages contain 2P bytes and the memory is addressable
per byte; consequently, a virtual address of V bits can be divided into a byte offset of P bits within the
page and a virtual page number of V - P bits. Because pages are brought to the main memory in their
entirety, the byte offset of P bits never changes. Therefore, it is sufficient to translate the virtual page
number to a physical page number corresponding to the locality of the page in main memory. Figure
20.3 outlines the virtual-to-physical address translation parameters.

Example 20.1: Virtual memory parameters

A virtual memory system with 32-bit virtual addresses uses 4 KB pages and 128 MB of addressable
main memory per byte. What are the direction parameters in Figure 20.3?

Solution: The physical memory addresses for a 128 MB drive are 27 bits wide (log2 128M = 27) and
the byte offset on page is 12 bits wide (log2 4K = 12). The latter leaves 32 - 12 = 20 bits for the virtual
page number and 27 - 12 = 15 bits for the physical page number. 128 MB main memory has space for
215 = 32K pages, while virtual address space contains 220 = 1M pages

274
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The mapping scheme used to place pages in main memory is completely associative, meaning that
each page can be mapped to any place in main memory. This eliminates conflict failures and minimizes
page faults, which are extremely costly in terms of time penalty (a few milliseconds can span millions
of clock cycles). However, there are mandatory failures and capacity failures, as discussed for caches
at the end of section 18.2.

Since the parallel comparison of the page tag with many thousands of tags is very complex, a two-
stage process is often used for memory access: a page table is queried to find if a required page is in
memory and, if so, where it is located; The actual memory access is then performed or, in the event
of a page failure, access from disk is initiated. Because disk accesses are very slow, the operating
system saves the entire context of the program and switches to a different task (context switching)
instead of forcing the processor to wait idle for millions of cycles until disk access is complete and the
required page is brought to main memory. However, in the event of cache failures, the processor jams
for several clock cycles until the data is brought to the cache.

Figure 20.4 shows a page table and how it is used in the address
translation process. While a program is running, the start address of its page table is stored in a special
page table record. The virtual page number is used as the index in the page table and the
corresponding entry is read. The memory address for the desired page table entry is obtained by
adding the virtual page number (offset) to the base address contained in the page table record. As
shown in Figure 20.4, each page table entry has a valid bit, many other flags, and a pointer to the
"whereabouts" of the page.

The page table entry thus obtained can be of one of two types. If the valid bit is fixed, the address
portion of the entry represents a physical page number that identifies the locality of the page in main
memory (page impact). If the valid bit is reset, the address part points to the virtual page locality in
disk memory (page failure or page fault). In addition to valid and invalid pages, which are distinguished
by the valid bit, a page within the virtual memory address space can be unallocated. The latter happens
because the virtual address space is usually large and most programs do not use all of their address
space. A unawarded page is completely empty of information; It doesn't even take up disk space.

Each page table entry represents a word (4 bytes), which is suitable


for specifying a physical page number or sector address on disk. Each time a page is copied to main
memory or sent to disk to make room for another page, its associated page table entry is updated to
reflect the new locality of the page.

In addition to the "valid bit", the page table entry can have a "dirty bit" (to indicate that the page was
modified since it was brought to main memory and must be inscribed back to disk if it needs to be
replaced) and a "usage bit", which is set to 1 whenever the page is accessed.

The usage bit finds application in the implementation of an approximate version of the Least Recently
Used (LRU) substitution policy (section 20.4).

275
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

From the previous discussion, it becomes apparent that the convenience of virtual memory comes
with a certain header, consisting of storage space for page tables and a page table access (address
translation) for each memory reference. Section 20.3 shows that, in most cases, it is possible to avoid
the additional memory reference; therefore, it makes the time header negligible. The storage header
for page tables can also be substantially reduced. Note that the page table can be large: in a virtual
address space of 232 B with 4 KB pages, there will be 220 pages and the same number of page table
entries. Instead of allocating 4 MB of memory to the page table, it can be organized as a variable-sized
structure that grows to its maximum size only if necessary. For example, instead of a direct-addressed
page table, where the input address is determined by adding the base address of the page table to
the virtual page number, you can use a noise table (hash) that needs far fewer entries for most
programs.

In addition to simplifying memory management (including allocation, binding, and loading), virtual
memory offers benefits that can make it useful to use, even when an expanded memory address space
is not required. Two of the most important additional benefits are sharing and memory protection
(Figure 20.5).

1. The sharing of a page by two or more processes is direct, it only requires that the entries in the
associated page tables have the same address of the page in main memory or on disk. The shared
page can, and often does, occupy different parts in the virtual address spaces of sharing processes.

2. Page table entry can be augmented with permission bits specifying what types of page access are
allowed and what other privileges are guaranteed. Because all memory accesses pass through the
page table, pages that are off-limits to a process automatically become inaccessible to it.

276
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Even if the virtual address space is usually much larger than the physical address space (main memory),
based on the benefits barely discussed, it makes sense to have a virtual address that is narrower than,
or the same width as, the physical address.

3. Translation lookaside buffer


Page table access for virtual-to-physical address translation is essentially double the main memory
access delay. The latter happens because reading or writing a memory word requires two accesses:
one to the page table and one to the word itself. In order to avoid this time penalty, which reduces
the efficiency of virtual memory, it takes advantage of the spatial and temporal locality properties of
programs. Since consecutive addresses generated by a program often reside on the same page, the
address translation process is likely to query the same entry, or a small number of entries, in the page
table during a specific period of time. Therefore, a cache-like structure, known as translation lookaside
buffer (TLB), can be used to keep track of the most recent address translations.

When a virtual address is translated to a physical address, the TLB is first queried.
As in the processor cache, a portion of the virtual page number specifies where to look in the TLB, and
the rest is compared against the tag of the stored input(s). If the tags match and the associated entry
is valid, then the physical page number is read and used. Otherwise, you have a TLB failure, which is
handled in a manner similar to a processor cache failure. In other words, the page table is queried, the
physical address is obtained, and the result of the translation is recorded in the TLB for future use,
possibly to replace an existing TLB entry.

Typically, a TLB has tens of thousands of entries, the smaller ones are fully associative, and the larger
ones have lower degrees of associativity. A TLB is a cache dedicated to page table entries. When you

277
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

want to replace a TLB entry, its flags must be copied into the page table. However, fortunately, this
does not imply much time penalty because TLB failures are very rare. Figure 20.6 shows TLB and its
use for address translation.

Example 20.2: Address translation over TLB

A particular address translation process converts a 32-bit virtual address to a physical address of equal
number of bits. Memory is byte addressable and uses 4 KB pages. A TLB of direct mapping and 16
inputs are used. Specify the components of the physical and virtual addresses and the width of the
various fields in the TLB.

Solution: The offset byte is taken 12 bits in both virtual and physical addresses. The remaining 20 bits
in the virtual (physical) address form the virtual (physical) page number. The 20-bit virtual page
number is broken down into a four-bit index to address the 16-entry TLB and a 16-bit tag; the latter is
compared against the tag that is read from the TLB to determine if there was an impact on the TLB. In
addition to the 16-bit tag, each TLB entry must contain a 20-bit physical page number, a valid bit, and
other flags that are copied from the corresponding page table entry.

The TLB discussion began with a virtual address that was presented to the main memory and showed
how the first memory access to query the page table can be avoided. It remains to be described how
TLB, or virtual-to-physical address translation, works in the presence of a processor cache. There are
two approaches to combined use of virtual memory and a processor cache.

In a virtual address cache, the cache is organized according to virtual addresses. When
a virtual address is presented to the cache handler, it is divided into several fields (offset,
line/assembly index, virtual tag) and the cache is accessed as usual. If the result is an impact, nothing
else is needed. For a cache failure, address translation and main memory access are required.

278
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Therefore, access time to TLB and other address translation headers is only incurred in the event of a
cache failure, which is a rare event.

In a physical address cache, TLB is accessed before the cache. The physical address that is obtained
from the TLB, or from the page table in the event of a TLB failure, is presented to the cache, leading
to a cache impact or failure. For a failure, the main memory is accessed with the physical address that
was obtained before cache access. Since TLB is just like a cache, this approach doubles the time to
enter data into the cache, even ignoring TLB failures (which are rare). However, TLB access can be
incorporated at a separate channeling stage, so that increasing latency results in little or no reduction
in total throughput (Figure 17.10). The overall yield is reduced only to the extent that deeper
channeling may require more frequent bubbles.

Because a virtual address cache eliminates the address translation header in most cases, you might
wonder why physical address caches are used. The answer is that even when virtual address caches
reduce the address translation header, they lead to complications elsewhere. In particular, when data
must be shared across multiple processes (including application and I/O processes), that data can have
different virtual addresses in the address spaces of many processes, leading to aliasing or the existence
of multiple (potentially inconsistent) copies of the same data in the cache.

You can use a hybrid address cache to gain the speed of a virtual address cache
while avoiding your aliasing issues. A commonly used hybrid approach is to make cached data
placement a function only of page offset bits that do not change during address translation. For
example, if the page size is 4 KB (12-bit offset), and the cache line size is 32 B (five-bit offset), there
will be 12 - 5 = 7 bits in the virtual address that are invariant during address translation and can be
used to determine the cache line or cache set in which the word you access resides.

The above method, combined with the use of physical tags in the cache leads to a
hybrid scheme, which can be referred to as virtually indexed, physically tagged. Until a virtual address
is presented, the processor cache and TLB are accessed in parallel. If the physical tag retrieved from
the processor cache matches the tag being read from the TLB, then the cache data can be used.
Otherwise, you have a cache failure. However, even in this case the result of the address translation
is already available, unless, of course, you have a TLB failure. Figure 20.7 graphically shows these
hybrid and pure addressing options.

Note that with processor cache, TLB, and virtual memory, a memory access can lead to failures of
three types: cache failure, TLB failure, and page failure. However, not all eight possible impact/failure
combinations in these three entities make sense.

279
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Page Placement and Replacement


Based on the discussion conducted so far, the designer of a virtual memory system must solve the
following key questions, which are fundamentally the same as those found for caches:

1. Fetch policy: when to bring a particular page to main memory.

2. Placement policy: where to place a recently brought page (and where to find it when needed).

3. Substitution policy: how to choose between placement options (which of the existing pages
currently occupies those locations to overwrite).

The most common fetch policy is on-demand pagination: bringing a page to the first access to it. An
alternative is pre-pagination, this means that pages can be brought in before they are needed by a
running program. The above can be beneficial, for example, if when accessing a particular page,
nearby pages are also brought on the same disk track, because doing the latter incurs very little time
penalty, at least as far as disk access is concerned. Since most modern disk systems employ special
caches to retain close sectors or even entire tracks until each disk access (section 19.4), no further
pre-paging needs to be considered.

The fully associative nature of the main memory page mapping scheme means that page placement
is not a problem (a page can be placed anywhere in main memory). It is this unrestricted placement
that necessitates the use of indirect access through a page table to locate a page, and the use of a TLV
to prevent this extra access in most cases. In contrast, in direct and associative set mappings used in
caches, the single slit, or a handful of possible localities, for a cache line allows it to be found quickly
using simple hardware aids. As a result, the virtual memory placement conflict will be completely
ignored. However, it is noted that in most advanced computing systems that have a multitude of local
(fast access) and remote (high latency) memory units available, placement does not become conflict.

280
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The remainder of this section is devoted to page replacement policies for choosing a page to overwrite
when no page frame is available in main memory to accommodate an incoming page. Because fetching
a page from disk takes millions of clock cycles, the operating system can allow the use of a
sophisticated page replacement policy that takes into account history and other factors. The running
time of even a fairly complex decision process is likely to be small compared to the many milliseconds
spent on a single disk access. The latter is in contrast to substitution policies in a processor cache or
TLB, which should be extremely simple.

Ideally, a page replacement policy minimizes the number of page failures during the course of running
a program. Such an ideal policy cannot be implemented because it would require perfect knowledge
about all future page accesses. Practical substitution policies are designed to approach an ideal policy
by using knowledge about a program's past behavior to predict its future course. The least recently
used algorithm (LRU) is one such policy. Implementing the LRU policy is difficult; It requires that each
page be time-stamped with each access and that the timestamp be compared to find the smallest
among them at the time of substitution. For this reason, a rough version of the LRU policy is often
implemented, which works very well in practice.

One way to implement LRU in an approximate way is described below. A usage bit is associated with
each page; This bit specifies whether that page was accessed in the recent past (since the last reset of
the usage bit by the operating system). The usage bit essentially divides pages into two groups: with
and without recent access. Only pages in the latter category are candidates for substitution. Many
methods can be used to select from candidates and to periodically reset usage bits. Figure 20.8 shows
one such scheme, sometimes referred to as clock policy, where the numbers 0 and 1 are usage bits
for the various pages, and the arrow is a pointer marking one of the pages. When a page must be
selected for substitution, the date moves clockwise, resets to 0 all the 1 it finds, and stops at the first
0, which is where the new page will be placed, with its usage bit set to 1. A page whose usage bits are
changed from 1 to 0 will be replaced when the arrow rotates completely, unless, of course, it is used
again before and its usage bit is set to 1. It is noted that pages with frequent accesses will never be
replaced under this policy.

Virtual memory works because it takes full advantage of both spatial and temporal locality of memory
accesses in typical programs. Space locality justifies bringing thousands of bytes into main memory
when a page fault is encountered during an attempt to access a single byte or word. Due to spatial
locality, much of the rest of the page may be used in the future, more than once, which amortizes the
latency of a single disk access during many memory references. The temporary locality makes the LRU

281
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

substitution policy successful to keep the most useful instructions and data (from the point of view of
future memory references) in main memory and overwrite the less useful parts. If an instruction or
data item is not used for some time, there are chances that the program will move to do other things
and you will not need such an object in the future. Of course, it is not difficult to write a program that
does not show spatial or temporal locality. Such a program will cause the virtual memory to have poor
performance. But most of the time he will do very well.

Example 20.3: LRU is not always the best policy

Consider the following program snippet that deals with a T-table with 17 rows and 1,024 columns,
calculates an average for each table column, and prints it to the screen (i is the row index and j is the
column index):

T[i][j] and temp are 32-bit floating-point values and memory is word-addressable. The temp temporary
variable is maintained in a processor register, so accessing temp does not involve a memory reference.
The main memory is paginated and retains 16 pages of 1,024 words in size. The page replacement
policy is "less recently used".

a) If you assume that T is stored in the virtual address space in larger row format, how many page
faults will be encountered and what is the main memory impact ratio?

b) What fraction of the page faults in part a) is mandatory? Capacity? Conflict?

c) Repeat part (a), but this time assume that T is stored in the largest column format.

d) What fraction of the page faults in part c) is mandatory? Capacity? Conflict?

Solution: Figure 20.9 provides a visualization of how the T-elements are divided by 17 pages for the
two row and major column storage cases.

a) After bringing row 0 to main memory, another 16 rows are accessed before row 0 is needed again.
The LRU policy forces every page to exit before its next use! Therefore, each access causes a page
failure. There is a total of 1,024 × 17 = 17,408 page faults and the impact ratio is 0.

b) Since there are only 17 pages, 17 or almost 0.1% of the rulings can be considered mandatory. There
are no conflict failures with a fully associative mapping scheme. The remaining 17,391, or 99.9% of
failures, are capacity.

c) Each page contains 1,024/17 ≅ 60 T columns. A page is never used again when its columns are
processed. Therefore, there are only 17 failures and the impact ratio is 99.9%.

d) All 17 judgements are binding; There are no failures of capacity or conflict.

282
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 20.9 Paging a table of 17 × 1 024 when stored in order of major row or major column.

A page may have been modified since it was moved to main memory. Therefore, to ensure that no
information is lost, pages must be copied back to disk before being overwritten with new incoming
pages. Because this rewrite consumes time, a "dirty bit" is associated with each page that is fixed,
whenever it is modified. If a page is not dirty, no rewriting is needed and time is saved. Note that, due
to the extreme penalty of disk access time, the use of through-through write scheme is not an option
here. However, with rewriting, the number of disk accesses is minimized and the header becomes
acceptable, as it was between cache and main memory. To avoid adding the write penalty and long
disk access time to a page failure, the operating system can pre-select pages to replace and start the
write process before a real need arises. Additionally, the operating system may enclose certain useful
or frequently necessary pages so that they are never replaced.

The following analogy may be helpful. You have limited-capacity booksellers for your technical books
and decide to buy or keep a new technical book only if you can discard an old book that is no longer
useful to you. Certain books are very useful in their work and mark them as non-replaceable. If you
discard books in advance, you'll likely have room available for new books when they arrive; Otherwise,
you may be forced to stack books on your desk until you find time to decide which books to discard.
Often, that time never comes, which makes the battery on your desk higher and higher, until it
collapses and hurts you with the slightest disturbance.

5. Main and Mass memories


Virtual memory not only makes it possible for a process to expand beyond the size of the main memory
available, it also allows multiple processes to share main memory with slightly simplified memory
management. When a page fault is found, a process is placed in a dormant state and the processor is
assigned to some other activity. In other words, a page fault is treated as an exception that triggers a
context switch. The latent process is later triggered when its required page is already in main memory.
In this way, control can be transferred continuously between many processes, each one remains
dormant if it encounters a page fault or other situation that requires waiting for the availability of a
resource. When many processes are available for execution, the processor can always find something
useful to do during the long waiting periods created by page faults. In this sense, putting more
processes in main memory means that everyone gets fewer pages, which leads to frequent page
failures and lower performance. Consequently, a balance must be struck between reducing processor
downtime by making processes more active and having enough pages of each process in main memory
to ensure a low page failure rate.

A useful notion in this aspect is that of the working set of a process. As the program runs, its memory
accesses tend to be concentrated in a few pages for some time. Then, some of these are abandoned

283
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

and the focus of interest shifts to others. The operating set of duration x for a process represents the
set of all pages referenced in the course of executing the last x instructions. The operating set changes
over time, so that it is denoted as W(t,x). The locality principle ensures that W(t,x) remains rather
small for large values of x. Figure 20.10 shows typical variations in the size of a program's operating
set over time, where narrow peaks correspond to shifts in program memory references from one
locality to another (during landslides, both the old locality and the new locality have been entered in
the recent past).

If the operating set of a process resides in main memory, then few page faults are found. At this point,
allocating more memory to the process may not have a significant effect on performance. If the dotted
line in Figure 20.10 represents the amount of main memory allocated to the program, then the
program will run without a page fault for long periods.

If you know exactly what the operating set of a process would be at any given time, you could preread
its members in main memory and prevent most page faults. Unfortunately, since the past behavior of
a program does not always accurately reflect what will happen in the future, this is impossible to do.
Thus, the notion of an operating set is often used as a conceptual tool in memory allocation. Keeping
an eye on page faults generated by a process gives a fairly accurate indication of whether the amount
of memory allocated to the process is sufficient to accommodate its operational set. Frequent page
faults indicate that the allocated memory is inadequate. If a process must be suspended to improve
memory performance, the fact that the largest page faults occur is a good sign. In this vein, if all active
processes produce a low level of page faults, then perhaps a new process can be triggered and brought
to main memory.

Typically, processes are created by placing their pages in secondary memory. When the execution of
a process is about to begin, it is assigned a certain number of main memory pages and gradually its
pages are carried (paging on demand). Some of these pages, which are expected to have frequent
accesses, can be closed in main memory. When initially mapped pages are used, many options are
available until the next page fault is found. With a local rollover policy, you choose to replace one of
the unclosed pages in the process. The latter means that the number of main memory pages allocated
to the process remains constant over time. A global replacement policy chooses the page to replace
regardless of which process it belongs to. This will cause more active processes, or those with larger
operational sets, to gradually occupy more pages at the expense of other processes; therefore, it leads
to a form of automatic allocation and balance of memory; As a process stops using certain pages, it
can give space to other processes that are more active, and then reclaim some space as it produces
more activity on its own.

284
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6. Improving Virtual memory performance


The implementation and performance of virtual memory are affected by some parameters whose
impact and interrelationship must be well understood not only by designers, but also by users. These
include:

Main memory size.

Page size.

Substitution policy.

Address translation scheme.

The first three parameters and their interrelationships are very similar to those of processor caches.
Therefore, Table 20.1 lists parameters related to hierarchical memory schemes (examples of which
cache and virtual memory) and their performance effects. You are already familiar with the
statements in Table 20.1 regarding caches. The same will be true for virtual memory below and further
conflicts related to implementing page tables for address translation will be discussed.

The performance impact of the main memory size is obvious. In the extreme, when the main memory
is too large to retain an entire program and its data, you only have mandatory failures. However, a
very large main memory is not only more expensive but also slower. So there is a hidden speed penalty
to consider, along with cost-effectiveness conflicts when choosing the size of a main memory.

The choice of a page size is a function of the typical behavior of the program, as well as the difference
in the latencies and data rates of the main and bulk memories. When program accesses have a large
amount of spatial locality, a larger page size tends to reduce the number of mandatory failures due to
the prefetching effect. However, the same larger pages will waste space and memory bandwidth when
there is little or no spatial locality. Thus, there is often an optimal page size that leads to the best
average performance for programs that usually "run" on a specific computer. The higher the ratio of
disk access time to main memory latency, the larger the optimal page size. Figure 20.11 shows how
disk seek time (which varies almost in proportion to total disk latency) and main memory access time
changed over the years. It is seen that the ratio of the two parameters remained constant. This fact,

285
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

along with faster caches that effectively hide some of the DRAM latency, explains why page sizes have
hovered around 4 KB.

It is interesting to note that, in the early 1980s, as the speed gap between disk memory and DRAM
seemed destined to grow (while their storage densities grew closer), predictions that disks would
gradually give way to semiconductor memories were commonplace. In part, as a result of this
challenge, disk manufacturers intensified their research and development efforts and succeeded in
actually narrowing the gap in the following decade (Figure 20.11).

The impact of substitution policy on performance has been extensively studied,


and many novel policies and refinements to older methods have offered an effort to reduce the
frequency of page failures. Figure 20.12 shows the experimental results on the approximate LRU and
LRU page fault rate, compared to the ideal algorithm and the first-in, first-out (FIFO) algorithm, this is
based on a very old study conducted by running a Fortran program with a fixed page size of 256 words
[Baer80]. However, other researchers have reported similar results in different contexts. The number
of pages allocated to a process is displayed next to the horizontal axis and can be viewed as
representing the size of the main memory in a single-process environment. Based on the trends shown
in Figure 20.12, it is concluded that the relative effect of the substitution policy on the page failure
rate is confined to the vicinity of a small constant factor (about 2). The difference is very significant in
absolute terms when a small set of pages is allocated to a process (or, equivalently, the main memory
is rather limited), and the difference becomes less pronounced when the memory is full.

Conceptually, address translation is implemented through a page table and its impact on performance
is smoothed with the use of a TLB. As noted earlier, page tables can be huge. For example, 32-bit
virtual addresses involve 1 M entry page tables when the page size is 4 KB. Of course, the page table
must reside in main memory if the required address translation must be performed quickly in the
event of a TLB failure. The latter is achieved without having to separate large areas of little-used main
memory for page tables. The trick is to let the pages in the page table be swept from memory like all
the other pages. The locality principle would ensure that the most useful "chunks" of the page table
would remain in main memory and could be accessed quickly.

286
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

An alternative to having a large monolithic page table is to use a two-tier structure. For example, for
20-bit virtual page numbers, the top ten bits could be used as an index in a page directory (level 1
page table) with 1,024 entries. Each entry in this page directory contains a pointer to a level 2 table of
1,024 entries. With 32-bit entries, each such table fits into a 4,096 B page. The page directory can be
closed in main memory and swept away level 2 tables as required. An advantage of this scheme is that
for typical programs that use a few contiguous pages at the beginning of their virtual address space
and one or more pages at the end to retain a stack, only a few level 2 tables will ever be needed,
making it possible for them to reside in main memory. A drawback of two-level page tables is the need
to have two table accesses per address translation. However, the use of a TLB mitigates this to a large
extent.

PROBLEMS
1) Analogies of pagination in everyday life

Describe what each of the following has in common with virtual memory and paging, elaborate on
fetch counterparts, placement and substitution policies, and address translation, where applicable.

a) Supermarket manager who assigns shelf space to products.

b) Films that are chosen for screening in cinemas around the world.

c) Cars parked on a street where there are parking meters.

2) Parts of a memory hierarchy

Consider the following diagram of a hierarchical memory system:

287
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Where ‘Ejec’ = Execution and

‘Disco’ = Disk

a) Assign an appropriate label to each unlabeled box and explain its function.

b) Name useful hardware technologies for each level.

c) Provide an approximate value for the latency ratios between successive levels.

d) Provide an approximate figure for the impact ratio at each level.

e) Mention typical placement and substitution schemes at each level.

f) Describe the address translation mechanism, if any, at each level.

3) Virtual memory concepts

Answer true or false, with full justification.

a) The numerical difference between a virtual address and the corresponding physical address is
always non-zero.

b) The size of a virtual address space can exceed the total disk capacity in bytes.

c) It is not possible to run a backward direction translation mechanism, and obtain a corresponding
virtual direction from a physical address.

d) Pages should be made as large as possible to minimize the size of page tables.

e) When many processes share a page, their corresponding page table entries must be identical.

f) An optimal replacement policy that always leads to as few possible page faults is not helpful.

4) Virtual and physical addresses

A byte-addressable virtual memory system with 4,096 B pages has, in main memory, frames of eight
virtual pages and four physical pages. Virtual pages 1, 3, 4, and 7 reside in page frames 3, 0, 2, and 1,
respectively.

a) Specify the virtual address ranges that would lead to page failures.

b) Find the physical addresses corresponding to 5,000 virtual addresses; 16 200; 17 891 and 32 679.

5) Data movement in memory hierarchy

a) Draw and label appropriately all components of a diagram similar to Figure 20.2 showing two levels
of caching.

288
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

b) Repeat part (a), but this time involve single-level caching with split caches.

c) Draw a diagram combining parts (a) and (b); that is, L1 split caches and L2 unified caches.

6) Address translation via TLB


a) Show how each of the translation options would work in Figure 20.7 with the TLB specified in
example 20.2.
b) Redo example 20.2 with a 64-input, two-way associative set TLB.
c) Repeat part (a) for the TLB specified in part (b).

7) Approximate LRU substitution algorithm

How would the program fragment in Example 20.3 behave under the approximate LRU substitution
algorithm? Clearly state all your assumptions.

8) MRU substitution algorithm

A friend suggests that replacing the most recently used page (MRU) might be a viable alternative. You
try to convince him that it's not a good idea, given the spatial and temporal locality properties of the
programs. He insists that MRU can perform LRU in some cases.

a) Does your friend's statement have any merit?

b) Is MRU easier to implement than LRU?

9) Approximate LRU substitution algorithm

Figure 20.8 exemplifies the application of "usage bits" in the implementation of an approximate
version of the LRU algorithm.

a) Under what conditions does this approximate version show exactly the same behavior as LRU?

b) A variation of this method also uses "dirty bits" (known as "modified bits") in substitution decision-
making. Describe how this more sophisticated form of the algorithm might work.

c) What are the benefits of using a two-bit "usage counter" instead of a single "usage bit" for this
algorithm?

10) Operational sets

a) If you ignore the print statement, convert the program fragment in Example 20.3 to a MiniMIPS
sequence of instructions.

b) Graph the variations of the operational set for the program in part a) based on the last 20, 40, 60, .
. . Instructions and assume that the table is stored in major row format.

c) Repeat part (b), but this time assume that the table is stored in major column format.

289
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

11) Page failure frequency

Consider running the following cycle, where arrays A, B, and C are stored on separate pages of 1,024
words:

Indicate the sequence of page references and discuss the frequency of page failures with various
amounts of main memory allocated to the program containing the loop. Clearly state all your
assumptions.

12) Page Replacement Policy

A process accesses its five pages in the following order: A B C D A B E A B C D E.

a) Determine which page accesses cause a page failure if there are three page frames in main memory
and the replacement policy is FIFO, LRU or approximate LRU (three cases).

b) Repeat part (a) for four-page frames.

13) Optimal Page Replacement Policy

Show that an optimal substitution policy is obtained by replacing the page whose next reference is the
furthest in the future [Matt70].

14) Average access time to main memory

In an on-demand paging system, the main memory cycle time is 150 ns and the average access time
to disk is 15 ms. Address translation requires memory access in the event of a TLB failure and
otherwise lost time. The TLB impact rate is 90%. Of TLB failures, 20% lead to a page fault (2% of total
accesses). Calculate an average access time for this virtual memory system. How is the calculated
average access time misleading?

15) Reduction of page faults through blocking

Example 20.3 showed that, depending on whether a table is stored in major row or major column
order, excessive page faults or too few page faults are generated. Consider the problem of matrix
multiplication with square matrices of 256 × 256. Assume a total main memory allocation of 64 pages,
each containing 1,024 words, to the two operand arrays and the result matrix.

a) If it assumes greater row storage of the three arrays involved, derive the number of page faults
when matrix multiplication is performed via the simple algorithm with three nested cycles by
calculating cij as the sum of the product terms aik × bkj over all values of k and the LRU substitution
algorithm is used.

290
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

b) Repeat part (a) with larger column storage of the three matrices.

c) Show that if arrays are displayed as 32-× 32 blocks and matrix multiplication is performed on arrays
resulting from blocks 8 × 8, the number of page failures is significantly reduced.

16) Two-level page tables

As mentioned at the end of section 20.6, two-tier page tables are used to reduce the memory
requirements of a large monolithic page table that usually has large blocks of unused entries.

a) Draw a diagram similar to Figure 20.4, but with a two-level page table.

b) Repeat part (a) for Figure 20.5. c). How is TLB organized when you have two-tier page tables?

d) Explain why it makes sense for second-level page tables to take up one page each.

e) Study Intel's Pentium two-level page table structure and prepare a two-page summary of your
address translation schema, including how you use the TLB.

17) Virtual memory performance

Given a sequence of numbers representing virtual page numbers in the order in which they are
accessed, you can evaluate the performance of a virtual memory system with known parameters by
estimating which accesses lead to page failures. Now consider reversing the given sequence so that
accesses to virtual pages occur in exactly the opposite order (i.e., CBBAACBA instead of ABCAABBC).
Can you deduce anything about the number of page faults encountered with this reverse order relative
to the original access sequence? Fully justify any assumptions you make and your conclusion.

291
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

UNIT 6
INPUT/OUTPUT DEVICES

CHAPTER TOPICS
1. Input/output Devices and Controllers
2. Keyboard and Mouse
3. Visual Display Units
4. Hard-Copy Input/output Devices
5. Other Input/output Devices
6. Networking of Input/output Devices

This chapter will review the structure and operating principles of some common input/output devices,
both devices that provide or present data and those that capture data for archiving or further
processing. In addition to the type of presentation or data recording, I/O devices can be categorized
by their data rates, from very slow data entry devices (keyboards) to high-bandwidth storage
subsystems. It will be seen that the required data rate can dictate how the CPU, memory, and I/O
devices interact. Increasingly, the input source, or output vessel, of a computer is another computer
that links to it over a network. For this reason, networking of I/O devices is also discussed in this
chapter.

1. Input/output Devices and Controllers

The processor and memory are much faster than most I/O devices. In general, the closer one is
attached to the CPU, the faster it is accessed. For example, if the main memory is 15 cm away from
the CPU (Figure 3.11a), only signal propagation takes 1 ns on each path, as electronic signals travel at
almost half the speed of light; To this round trip signal propagation delay of 2 ns must be added several
logical delays, buffer time, bus arbitration delay and memory access latency. In addition to the
additional delays due to increasing distance from the CPU, many I/O devices are also slower according
to their electromechanical nature.

Table 3.3 lists some I/O devices along with their approximate data rates and application domains. In
addition to the type of data input or output, which was used in Table 3.3, as the primary classification
criterion, I/O devices can be categorized based on other technology or application characteristics. The
medium used for data presentation (hard copy versus electronic or soft copy), measurement of human
involvement (manual versus automatic) and production quality (preliminary sketch versus publication
or photo quality) are some of the relevant characteristics.

Modern I/O devices are highly intelligent and often contain their own CPU (usually a low-profile
microcontroller or general-purpose processor), memory, and communication capabilities, which
increasingly include network connectivity. The interfaces with such devices, which can have many
megabytes of buffer memory, are different from the control of early I/O devices, which had to be fed
with information, or directed to send data, one byte at a time.

292
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Devices communicate with the CPU and memory using an I/O controller connected to a shared bus.
Simple low-profile systems can use the same I/O bus that is used to send data back and forth between
the CPU and memory (Figure 21.1). Because the common bus in Figure 21.1 contains some address
lines for many memory accesses, it is easy to let the same lines select devices for I/O operations by
allocating a portion of the memory address space to them. This type of memory-mapped I/O, and
other strategies for I/O control through programming, will be covered in Chapter 22. More elaborate
or high-performance systems may have a separate I/O bus to ensure that bandwidth is not removed
from CPU-memory transfers (Figure 21.2). When multiple buses exist in a system, they are interfaced
to each other through bus adapters. Chapter 23 will cover buses, relevant standards and interface
creation methods.

The I/O controllers in Figures 21.2 and 21.2 satisfy the following rules:

1. Isolate the CPU and memory from I/O device operation details and specific interface requirements.

2. Facilitate expansion in capabilities and innovation in I/O device technology without impacting CPU
design and operation.

3. Manage (potentially wide) speed incompatibilities between processor/memory on one side and I/O
devices on the other, through buffering.

4. Convert data formats or encodings and force data transfer protocols.

I/O controllers are computers in their own right. When activated, special programs run to guide them
through the required data transfer operations and associated protocols.

For this reason they introduce non-trivial I/O operations latencies that must be considered in
determining I/O throughput. Depending on the system and device types, I/O controller latencies of a
few milliseconds are not uncommon.

293
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

2. Keyboard and Mouse

A keyboard consists of an array of keys used to input information into the computer or other digital
device. Smaller keyboards, such as those found on phones, or in a separate area on many desktop
computers, are known as keypads. Almost all alphanumeric keyboards follow the QWERTY template,
a name derived from the labels of the first keys in the top row of letters on the standard typewriter
keyboard. Over the years, serious attempts have been made to introduce other key templates that
would make it easier to reach letters of the most frequently used alphabet, and increase the speed of
data entry. It is said that the QWERTY template was chosen to deliberately reduce typing speed with
the intention of avoiding jamming the mechanical hammers of the first typewriters. Unfortunately,
familiarity with the QWERTY template and the great experience in using it have prevented the
adoption of these more responsive templates. However, some progress has been made in the area of
unconventional key placements to give the user comfort during typing. Keyboards that follow such
designs are called ergonomic keyboards.

Figure 21.3 shows two specific designs for a key that is capable of closing an electrical circuit
by pressing the key cap attached to a plunger assembly or membrane mounted on shallow pistons.
There are many other designs in use. Membrane switches are inexpensive, commonly used in
embedded system applications, such as the control panel of a microwave oven. Mechanical switches,
such as the one shown in Figure 21.3a, are used on standard desktop keyboards in view of tactile
feedback and their stronger construction. Since contacts can become dirty over time, other types of
switches rely on the force induced magnetic rather than mechanically to close a pair of contacts. In
this way, the contacts can be placed in a sealed preserve to increase reliability. No matter what type
of contact is used, the circuit can be reopened and reclosed many times with each key press. This
effect, known as contact bounce, can be avoided by not using the key signal directly but letting it
activate a flip-flop, whose output becomes high with a key press and remains so regardless of the
length or severity of the contact bounce. It is also possible to deal with the effect of contact bounce
with properly designed software that handles keyboard data acquisition.

294
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Regardless of the physical template, the keys of a keyboard or numeric keypad with frequencies are
arranged in a 2D square or rectangular arrangement from a logical point of view. Each key is at the
intersection of a row and column circuit. Pressing a key causes electrical changes in the row and
column circuitry associated with the key, allowing identification of the key that was pressed (Figure
21.3c). An encoder then converts row and column identities into single-symbol code (usually ASCII),
or sequence of codes, to be transmitted to the computer. The physical keyboard template and key
ordering are not relevant to the detection and coding processes.

In order to obtain the four-bit output represented by the key pressed on the hexa numeric keypad in
Figure 21.3c, row signals are postulated in turn, perhaps with the use of a four-bit ring counter that
follows the counting sequence 0001, 0010, 0100, and 1000. As a signal is postulated in a particular
row, column signals are observed. The two-bit encoding of the row number joined to the two-bit
encoding of the column number provides the four-bit output. A 64-key keyboard can be logically
arranged in an 8 × 8 array (even though the physical arrangement of the keys may not be square). The
six-bit output code is then produced similar to the procedure for the hexa keyboard. If you want the
ASCII representation of the symbol, you can use a lookup table with 64 entries.

In addition to a keyboard, most desktop and laptop computers have pointing devices that will allow
the user to choose menu options, select text and graphic objects for editing, and perform a variety of
other operations. In fact, some devices without a keyboard rely exclusively on signaling for control
and data capture functions. The most commonly used pointing device is the mouse, named for its
shape in some early designs and the tail-shaped wire that connects it to the computer. Modern mice
come in various forms, many of them use wireless links with the computer.

295
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In a mechanical mouse, two counters are associated with the x and y rollers (Figure 21.4a). Lifting the
mouse sets the counters to zero. When the mouse drags, the counters increase or decrease according
to the direction of movement. The computer uses the counter values to determine the direction and
extent of motion to be applied to the cursor. Optical mice are more accurate and less prone to failure
due to the gathering of dust and yarns in their parts. Simpler optical mice detect lattice lines on a
special mouse mat (Figure 21.4b). The new, more advanced versions use small digital cameras that
allow them to detect motion on any surface.

Often, a touchpad replaces a mouse for laptop applications. The location of a finger touching the mat
is perceived (through row and column circuitry) and the movement of the cursor is determined by
how far, fast and in which direction the finger moved. A touchscreen is similar, but, instead of a
separate mat, the surface of the splash screen is used for aiming. Other pointing devices include the
trackball (moving sphere, which essentially represents an upside-down mechanical mouse whose
large ball is manipulated by hand) and the joystick (control stick) that finds applications in many
computer games as well as in industrial control scenarios. Some laptops use a small joystick embedded
in the keyboard to aim.

Today's keyboards and pointing devices are very smart and often have dedicated processors
(microcontrollers) inter-built.

3. Visual Display Units

Visual presentation of symbols and images is the primary output method for most computers. The
options available range from small monochrome screens (of the types used in cell phones and cheap
calculators) to large, high-resolution, rather expensive, presentation devices aimed at use by
professional graphic artists and advertisers. Until recently the cathode ray tube (CRT) was the primary
type of visual presentation device for desktop computers, flat panel displays were reserved for use in
laptops and other portable digital devices. It is predictable that most bulky, heavy and large CRT
electricity consumers will gradually be replaced by flat panel display units, whose costs are affordable
and whose small footprints and lower heat generation are important advantages for home and office.

296
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

CRT displays work by means of an electron beam that sweeps the surface of a sensitive screen, creating
a dark or light pixel as each point passes (Figure 21.5a). Early CRTs were monochromatic. In such CRTs,
the electron beam hits a phosphor layer on the back of the screen glass. This phosphor layer emits
light because it is hit by a stream of electrons traveling at very high speed, and the intensity of the
electron beam passing a specific point determines the level of brightness. To protect the phosphor
layer from the direct bombardment of electrons, behind it is placed a layer of aluminum; It acts as an
electrical contact. The color CRT technology went through several stages. Early attempts used a
monochrome CRT that was seen by rotating color filters while simultaneously displaying the pixels
associated with various colors. Other designs included replacing the mechanical filters with
electronically controlled ones and using multiple layers of phosphor, each emitting different colored
light.
Modern CRTs in common use today are based on the tricolor "shadow mask" method introduced by
RCA TVs in the 1950s. Design details vary, but the scheme used on Sony's Trinitron tubes is
representative. As shown in Figure 21.6, narrow strips of phosphors producing three different colors
of light are placed on the back of the glass face of the tube. The three colors are red, green and blue,
which gives the name "RGB" to the resulting scheme. Three separate electron beams scan the surface
of the screen, each arriving at a slightly different angle. A shadow mask, which represents a metal
plate with openings or holes, forces each beam to impact only phosphor strips of a particular color.
The separation of two consecutive strips of the same color, known as display pitch, dictates the
resolution of the resulting image.

297
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The three electron beams are controlled based on a representation of the image to be presented.
Every pixel is associated with four (simple 16-color images) to 32 bits of data in a frame buffer. If the
screen resolution is 1K × 1K pixels, and each pixel has 64K possible colors, then the frame buffer needs
2MB of space. This space can be provided in dedicated video memory or as part of the main memory
address space (shared memory). Often, dedicated video memories are dual-port to allow
simultaneous access via the CPU to modify the image or by the display controller to present it. Such
video memories are sometimes referred to as VRAM. Figure 21.5 shows the scanning motion of the
electron beam, as it covers the entire surface of the sensitive screen 30-75 times per second (known
as refresh rate) and a frame buffer that stores four bits of data per pixel. In practice, up to 32 bits of
data are needed per pixel:

32 bits, eight for each R, G, B, A ("true color").


16 bits, five for each R and B, six for G ("high color").
Eight bits or 256 different colors in VGA format.

In the above descriptions, "A" stands for "alpha" (a fourth component used to control color mixing)
and VGA represents an old graphic video array that is still supported for compatibility.

Example 21.1: Total Video Performance

Consider a visual presentation unit with a resolution of 1,024 × 768 pixels and a regeneration rate of
70 Hz. Calculate the total throughput required by video memory to support this deployment, whether
it assumes eight, 16, or 32 bits of data per pixel.

Solution: The number of pixels accessed per second is 1,024 × 768 × 70 ≅ 55M. The latter implies a
total throughput of 55 MB/s, 110 MB/s or 220 MB/s, depending on the pixel data width. With 55M
pixel per second access, around 18 ns are available for each pixel read, even if wasted time due to left
and right margins and beam return time are ignored. Consequently, if the desired data rates are to be
achieved with typical VRAM, multiple pixels must be read at each access.

Physically, CRT displays are bulky, require high power consumption, and generate a lot of heat. In
addition, the image presented by a CRT suffers from distortion as well as non-uniform focus and color.
For these reasons, liquid crystal flat panel (LCD) displays are popular, especially in light of their
continuous improvement in image quality and economic cost. Figure 21.7a shows a passive matrix
display. Each point of intersection between a row and a column represents a small optical blind that
is controlled by the difference between the row and column voltages. Since the row voltage is
provided to every cell in a row and the cells also receive cross-voltage from other rows, contrast and
resolution tend to be low. The active arrays are similar, except that a thin-film transistor is planted at
each intersection. A transistor that lights up allows the associated liquid crystal cells to be charged to
the voltages on the column lines. In this way, an image that is "written" row by row is preserved until
the next regeneration cycle.

298
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Other flat panel display technologies include those based on light-emitting diodes (LEDs) and the
plasma phenomenon. LED displays consist of arrays of LEDs of one or more colors, with each LED being
"addressed" through a row index and a column index. Electronic LEDs are suitable for presentations
that must be seen outdoors, requiring an enormous degree of brightness. A recent development in
LED displays is the rise of organic LEDs (OLEDs), which require less power than ordinary LEDs and offer
a variety of colors, including white. The operation of plasma screens is similar to that of a neon lamp.
They exploit the property of certain mixtures of gases that decompose into plasma when subjected to
a very strong electric field. Plasma conducts electricity and converts some of the electrical energy into
visible light. Currently, plasma screens are very expensive; therefore, they have limited applications.

Many computers also offer output compatible with the presentation format
on television, allowing any TV to act as a presentation unit. The same TV-compatible signal can be
provided to a video projector that produces a replica of the image on the computer's CRT or flat panel
display unit on a wall-sized display. These presentation technologies are useful for computer-based
presentations everywhere from small boardrooms or classrooms to large auditoriums.

4. Hard-Copy Input/output Devices


Despite numerous predictions that paperless offices, or perhaps a paperless society, would make hard
copy I/O devices obsolete, there is still a need for such equipment. Paper documents, far from falling
into disuse, have proliferated with the increasing use of computers in business and domestic
scenarios.
Hard copy input is accepted through scanners (digitizers). A digitizer works like a CRT display unit, but
in the opposite direction. While a CRT display converts electronic signals into color and brightness
levels with a scanning motion, a scanner perceives color and intensity and converts them into
electronic signals as it traverses an image line by line and point-to-point within each line (Figure 21.8).
This information is stored in memory in one of many standard formats. Scanners are characterized by
spatial resolution, expressed in number of pixels or dots per inch (e.g. 1,200 dpi) and color fidelity
(number of colors). Relative to the type of document input to the scanner, such devices are divided
into the sheet, flatbed, head and portable feeder types. Flatbed scanners have the advantage of not
causing damage to the original document due to folding and allow the digitization of a book and
newspaper pages. Head and laptop scanners can scan large documents without being bulky or
expensive. Cheap scanners are making fax machines obsolete.

299
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In the case of scanned text, the image can be converted into symbolic form via optical character
recognition (OCR) software and stored as a text file. This type of image-to-text conversion using
scanning and OCR is commonly used to put old books and other documents online. Handwriting
recognition has also become increasingly important as applications for personal digital assistants
(PDAs) and keyboardless laptops proliferate.

The development of modern printers is one of the success stories that inspire fear in computing. Early
computer printers worked similarly to older mechanical typewriters; They used character-forming
devices that they tapped onto a band of cloth or plastic using a hammer-like mechanism to print the
characters on paper one after the other (character printers). To increase the speed of such printers,
special mechanisms were developed that allowed printing of entire lines at once, leading to a wide
variety of line printers. Gradually, line printers evolved from noisy, bulky (refrigerator-sized) machines
to smaller, quieter units. There was soon an awareness that character formation by selecting a subset
of points in a 2D matrix of points (Figure 21.9) would lead to greater flexibility in arbitrary support
character sets as well as image formation. Therefore, dot matrix printers gradually replaced many
impact-type printing technologies.

Modern printers basically print a large dot matrix image of the page that is composed of text or graphic
files, using postscript or other intermediate printer file formats. Each point on the page image is
characterized by intensity and color. Relatively slow and inexpensive inkjet printers (Figure 21.10a)
print the dots one or a few at a time. The ink droplets are ejected from the print head by various
mechanisms such as heating (which leads to the expansion of an air bubble inside the head) or
generation of shock waves through a piezoelectric crystal transducer next to the tank. The largest and
fastest laser printers (Figure 21.10b) form a copy of the image to be printed in the form of electric
charge patterns on a rotating drum and print the entire page at high speed.

Like a scanner, a dot matrix printer is characterized by its spatial resolution, expressed in number of
pixels or dots per inch (for example, 1,200 dpi), and color fidelity (number of colors). Total print
throughput, another key performance characteristic for computer printers, ranges from a few to many
hundreds of pages per minute (ppm), sometimes varying even for the same printer depending on
resolution and color requirements. Many older printers required special paper covered with chemicals
or packaged in rolls to simplify the paper feeding mechanism. Modern printers use ordinary paper,
they are called plain paper printers.

300
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Black and white printers deposit ink droplets or melt toner particles on paper according to document
requirements. Gray levels are created by letting some of the white of the underlying paper be
displayed. Color printers work similarly to color CRTs in that they create multiple colors from three
colors. However, the three colors used in printers are cyan (blue-green), magenta and yellow (yellow),
which together form the CMY color scheme, is different from the RGB of CRTs. The reasons for the
difference have to do with the way the human eye perceives color. The CMY color scheme represents
the absence of RGB colors (cyan is the absence of red, etc.). In this way, cyan absorbs red, magenta
absorbs green and yellow absorbs blue. The RGB color scheme is additive, meaning that a desired
color is created by adding the appropriate amount of each of the three primary colors. The CMY
scheme is subtractive and forms a desired color by removing the appropriate components from white
light. Mixing these three colors in equal amounts would absorb all three primary colors and leave
black. However, the white thus produced is rather unsatisfactory, especially in view of the extreme
sensitivity of the human eye to any color run in black. For this reason, most color printers use the
CMYK scheme, where K stands for black.

Color printing is much more complicated than color screen. Reasons include the difficulty of
controlling exactly the size and alignment of the dots of the various colors and the possibility of colors
running if placed close together. Greater problems arise from reduced resolution when different

301
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

intensity levels (black and white grayscales) must be endured. For example, one way to create the
illusion of five levels of gray (0, 25, 50, 75, and 100%) is to divide the print area into blocks of 2 × 2
pixels, and place 0-4 black pixels in a block to create the five levels. However, the latter reduces the
resolution by a factor of 2 in each direction (a quadrupled overall).

Although modern printers can produce high-quality hard copy output, specialized hard copy
output devices are also available for particular requirements. Examples include plotters, which are
used to produce technical and architectural drawings, usually on very large papers, and photo printers,
which differ from regular printers only in the quality of their printing mechanisms and the types of
paper they accept.

The use of office machines that combine scanner/printer is becoming more common. Such machines
can offer fax transmission and photocopying capabilities with little additional hardware. Fax machines
scan documents into image files before transmission, so when scanning capability is present, the rest
is simple. In fact, documents are increasingly sent via scanning followed by transmission as an
attachment to an email, rather than fax machines. Copying is, in essence, digitization followed by
printing.

Example 21.2: Total data throughput of a digital copier

If you assume a resolution of 1,200 dpi, copy area of 8.5 inches × 11 inches, and 32-bit color in a digital
copier composed of a scanner followed by a laser printer, derive the total data throughput required
to support a total copy throughput of 20 ppm (pages per minute).

Solution: The data rate to print one-third of a page per second is 81/2 ×11 × 12002 × 4/3 ≅ 180 MB/s.
If you assume that data from the scanner is stored in memory and then retrieved for printing, the data
rate to be supported by memory is 360 MB/s. This is well within the capabilities of modern DRAMs.
However, it is possible to substantially reduce this rate by taking advantage of the white space of most
documents (data compression).

5. Other input/output devices


In addition to the secondary and tertiary memories that constitute the I/O devices commonly used for
stable data storage and archiving (Chapter 19), many other types of input and/or output units are
available. This section reviews the most important of these options to complete the study on I/O
devices.

A fixed digital or video camera captures images to enter a computer much like a scanner. The incoming
light coming from outside the camera is converted into pixels and stored in flash memory or some
other type of non-volatile memory unit. Fixed digital cameras are characterized by their resolution in
terms of the number of pixels in each captured image. For example, a one-megapixel camera can have
a resolution of 1,280 × 960. Cameras with resolutions of five or more megapixels are very expensive
and usually not needed in sophisticated applications, as a two-megapixel camera can deliver photo-
quality prints of 20 × 25 cm on inkjet printers. Digital cameras can use traditional optical zoom to zoom
in on objects; They may also have Digital Zoom that uses software algorithms to zoom in on a particular
part of the digital image, but in the process the image quality is reduced. Some fixed digital cameras
can take short films of rather low resolution. Digital camcorders are capable of capturing high-quality
videos, while webcams are used to capture moving images for the purpose of tracking, video
conferencing and similar applications where the quality and smoothness of image movement are not
crucial. Photographs and movies are stored on computers in a variety of standard formats, usually in

302
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

compressed form to reduce storage requirements. Common examples include JPEG (Joint
Photographic Experts Group = Joint Photo Expert Group) and GIF (Graphic Interchange Format =
Graphic Interchange Format), for images, and MPEG and Quick Time for movies.

Three-dimensional images can be captured for computer processing using volume


scanners. One method is to project a laser line onto the 3D object to digitize it and capture the image
of the line where the object intersects using high-resolution cameras. From captured images and
information about the position and orientation of the head that it scans, the surface coordinates of
the object are calculated.

Audio input is captured by microphones. Most desktop and laptop computers come with a
microphone, or a stereo pair, and contain a sound card that can capture sounds using a microphone
port. To store on a computer, sound is digitized by sampling the wave shape at regular intervals. Sound
fidelity is improved by increasing the sampling rate and using more bits to represent each sample (i.e.,
24 bits for professional results, instead of 16 bits). Because these provisions lead to increased storage
requirements, sound is usually compressed before storage. Examples of standard formats for
compressed audio include MP3 (short for MPEG-1, layer 3), Real Audio, Shockwave Audio, and the
Microsoft Windows WAV format. MP3 compresses audio files up to almost 1 MB per minute and leads
to CD-quality audio because in the course of compression it removes only sound components that
exceed human hearing range. In addition to those formats, MPEG and Quick Time movie formats can
be used to store sound by ignoring the video component.

Sensors are now often used to provide information about the environment and other conditions of
interest to computers. For example, there are sensor boards in a car and many thousands of those in
a modern industrial plant. Below is a partial list of commonly used sensors and their applications.

• Photocells are the simplest light sensors for controlling night lights, traffic lights, security systems,
cameras and toys. A photocell incorporates a variable resistance that changes with light (from many
thousands of ohms in the dark to almost one kilohm in bright light), making it easy to detect the
amount of light using an analog-to-digital converter.

• Temperature sensors are of two types. A contact detector measures its own temperature and
deduces an object with which it is in contact, if it assumes thermal equilibrium. The measurement is
based on the variation of material properties (such as electrical resistance) with temperature. Non-
contact detectors measure the infrared or optical radiation they receive from an object.

• Pressure detectors convert deformation or elongation into materials to changes in some measurable
electrical properties. For example, zigzag wire embedded in a plastic substrate in an extensimeter due
to deformation, with which its strength changes slightly. Microelectromechanical pressure detectors
offer greater sensitivity and accuracy. Pressure detectors are used in piping, engine control, aircraft
wings and so on.

Innovations in sensor design are of immense interest as a result of the growing demand in embedded
control, security systems and military applications. New technologies are followed include
microelectromechanical (MEM) sensors, particularly for pressure, and biosensors, which incorporate
biological components, either in the mechanisms used to detect or in the detected phenomenon.
MEM sensors offer greater sensitivity and accuracy, as well as small size. The advantages of biosensors

303
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

include low cost, higher sensitivity and energy economy. They are used in pollution control, bacterial
analysis, medical diagnosis and mining, among other applications.

Computer image output occurs when "rendering" various types of visual presentation devices (section
21.3) or printing/graphing on paper (section 21.4). Additionally, images can be transferred directly to
microfilm for archiving or projecting onto a screen during audio-visual presentations. The latter is
usually done by connecting a video projector to the screen output port of a desktop or laptop
computer or to a special port that offers output for TV format. More exotic graphic outputs include
stereographic images for virtual reality applications and holographic images.

Audio output is produced from audio files, and the sound card converts stored files
possibly compressed into electrical waves suitable for sending to one or more speakers. In addition to
the audio file formats already mentioned in connection with the audio input, sound output can be
produced from much more compact MIDI (Musical Instrument Digital Interface) files. MIDI files do not
contain recorded audio but instructions in order for the sound card to produce particular notes for
certain musical instruments. For this reason, MIDI files are usually 100 or more times smaller than
comparable MP3 files, but have much less fidelity. Speech synthesizers are also an area of great
interest, as they allow a more natural user interface in many contexts. Speech synthesis can be
performed by joining pre-recorded fragments or with algorithmic text-to-speech conversion.

It is now common for computers to be used for direct control of a variety of mechanical
devices. The latter is done by actuators that convert electrical signals to force and motion. Gradual
speed motors (also called stepper motors) are the most commonly used components. A gradual speed
motor is capable of small rotations when receiving one or more control pulses. Then a rotation can be
converted to linear motion, if desired, or used directly to move cameras, robotic arms, and many other
types of devices. Figure 21.11 shows the functions of a typical gradual speed motor. The rotor consists
of magnets 60° apart. The stator is divided into eight sections with 45° spacing. Assume that half of
the stator segments are magnetized as shown in Figure 21.11a). Now, if the remaining half of the
stator segments are magnetized as in Figure 21.11b) and the magnetization current is removed from
the first assembly, the rotor rotates 15° (the difference between the 60° spacing of the rotor and the
45° spacing of the stator).

304
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

There is some hope that, eventually, devices without mechanical parts will allow the construction of
smaller, lighter, stronger and low-power power generators. Muscle-like electroactive polymers, which
expand and contract in response to electrical stimulation, represent a useful technology currently
being pursued [BarC01].

6. Input/output device networks


Increasingly, input and output involve file transfers over a network. For example, sending a printed
document to a printer that has a large data buffer and a CPU is not much different from sending a file
to another computer. Network-based peripheral devices allow you to share unusual or rarely used
resources and also provide backup options in the event of failure or overload of a local resource (Figure
21.12). For example, in an office scenario, a workgroup may share an expensive color laser printer
with another printer located in a different department, designated as a backup in case of breakdown.
Networked I/O also offers flexibility in placing I/O devices according to dynamically changing needs
and allows for upgrade or replacement, while reducing cabling and connectivity issues. Moreover,
according to the universality of network communication standards, such as IP and Ethernet (section
23.1), networked I/O improves compatibility and facilitates interoperability.

Nowhere are the benefits of networked I/O more evident than in industrial process
control. In such systems, dashboards of hundreds or thousands of sensors, actuators, and controllers
must interact with a mainframe or distributed platform nodes. Using a network with a tree or cycle
topology, rather than point-to-point connections, leads to a significant reduction in cabling, with the
accompanying reduction in operation and maintenance costs. The same network can be used to
exchange data, diagnostics, configuration and calibration, saving costs. Such a reduced level of wiring
can be avoided if using wireless networking.

In the above context, input/output is extensively mixed with processing and control functions, and the
boundaries between the various functions become blurred. A sensor with an interbuilt processor and
network adapter, integrated on a single chip using MEM and other technologies, is displayed as an
input device or as a node in a distributed processing system (section 28.4). It is an input device in the
sense that its main function and ultimate goal is to provide information about its environment to a
control process or algorithm. It is a node in the sense that it is able to cooperate with other nodes, run
self-diagnostics, perform self-calibration or mutual calibration, and performs various functions related
to data collection traditionally relegated to a central processor. The latter includes tasks related to
reliable and accurate operation: filtering noise, retrying transmission, adapting to changes in the
environment, and so on.

305
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 21.13 shows the structure of a computer-based closed-loop control system with specialized
interfaces for handling sensors and actuators of various types. With network-enabling devices, most
of these disappear or are incorporated into the sensor/actuator subsystem. Everything goes through
the control computer via the network interface, which simplifies its work. The software designer for
such a computer-based control system may focus on the relevant algorithmic and performance
requirements, but not on the specific interface needs for each type of device.

306
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

PROBLEMS
1) Keyboards and their switches

a) The mechanical switches shown in Figure 21.3a appear to require a large amount of vertical motion
to close the electrical contact. If you examine the keyboard on a modern ultra-thin notebook
computer, you will notice that the keys move very little. Research the design of such small motion
switches.

b) Some keyboards are advertised as "spill-proof". What are the implications of this property on the
design of switches and other parts of a keyboard?

c) Name two other physical properties of a keyboard that you consider important to a user.

d) Mention two negative properties of a keyboard that would lead to user dissatisfaction or rejection.

2) Keyboard coding

Design an encoder for the numeric keypad in Figure 21.3c and assume that the output should be:

a) A four-bit hexa digit.

b) An eight-bit ASCII character representing the hexa digit

3) Keyboard coding

a) The keyboard coding discussion in section 21.2 tacitly assumed that at most one key is pressed at
any given time. Most keyboards support double key presses (for example, an ordinary key plus one of
the special shift, alt or control keys) to allow for a wider set of symbols. Speculate about how coding
is achieved in this case.

b) On Windows-based machines, at least triple key pressing (Alt + control + delete) is supported. Can
you think of a good reason to include this feature, which certainly complicates the coding process?

c) In the forecasts in parts (a) and (b), multiple key presses lead to the transmission of a symbol from
the keyboard to the computer. The opposite may also be true for some keyboards: multiple symbols
sent as a result of a single keystroke. Why is this feature useful? Describe one way to perform the
required encoder.

4) Pointing devices

Perform the following experiment on the 5 × 6 cm2 touchpad of a laptop computer. Move the cursor
to the upper-left corner of the 21 × 28 cm2 screen. Place a finger in the upper-left corner of the
touchpad and quickly drag it to the lower-right corner of the pad. The new cursor position is near the
lower-right corner of the screen. Now, move your finger in the opposite direction along the same
diagonal path, but more slowly. When your finger reaches the upper-left corner of the touchpad, the
cursor is near the center of the screen. What does this experiment tell you about how touchpad
position data is processed? Justify your conclusions and discuss their practical implications.

307
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5) Total video memory performance

In example 21.1 the data rate was derived with the assumption that all pixels in the frame buffer
should be updated in each frame.

a) Name three applications where only a very small fraction of pixels change from one frame to the
next.

b) Describe a way to send data to the frame buffer that allows selective updating of pixels or small
windows within the presentation area.

c) Discuss the header due to selective transmission of part b), both in terms of sending additional bits
(in addition to the actual pixel data) and additional processing time needed to extract the pixel data.

6) Scoreboard

Some monochrome scoreboard displays are built from 2D arrays of light bulbs that can be turned on
and off under programmed control.

(a) Describe how such a board of 100 × 250 can be controlled by 16 data signals from a microcontroller.

b) Discuss the benefits and harms of two ways of arranging the bulbs on the board: the j-th bulb of
row 2i + 1 aligned vertically with the j-th bulb of row 2i, against its alignment with the midpoint
between the j-th and the (j + 1)-th bulbs in row 2i. c) Is it feasible to build such a color board?

7) Adjustable resolution on monitors

The designs and methods described for CRT and flat panel monitors in section 21.3 suggest fixed pixel
spacings. However, on a typical personal computer, the user can set the resolution (number of pixels
on the screen) using a system utility. How are these variations actually implemented? In other words,
the shadow mask openings in Figure 21.6b or the row and column lines in Figure 21.7 do not change.
So, what changes when the resolution is varied?

8) Scanner

The specifications for a scanner indicate that it has a resolution of 600 dpi in the x direction and 1,200
dpi in the y direction.

a) Why do you think the resolutions are different in the two directions?

b) What kind of software processing would allow an image captured by this scanner to be printed on
a 1200 dpi printer?

c) Repeat part (b) for a 600 dpi printer.

9) Total data throughput in a copier

308
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

A typical office memo, the page of a book or other document being copied contains mostly white
space.

a) Discuss how this information can be used to reduce the data rate required for a copier, as derived
in example 21.1.

b) Can similar savings be achieved in data rates or storage space when a page (for example, a magazine
page with a black background and white font) does not include white space?

10) Printer Technologies

Based on your research on inkjet and laser printers, compare and contrast the two technologies
against the following attributes:

a) Output quality in terms of resolution and contrast.

b) Output quality in terms of durability (lack of erasure over time).

c) Latency and total print throughput.

d) Cost of ink or toner per page printed.

e) Total cost of ownership per page printed.

f) Ease of use, including physical dimensions, noise and heat generation.

11) Drum printers

A particular type of line printer used many years ago was the drum printer. A large metal drum rotated
at high speed along a paper path. The drum had as many strips as characters on a line; for example,
132. On each strip, the letters and symbols of interest appeared in a highlighted form. There were also
132 hammers lined up along the length of the drum. Each hammer was individually controlled and
struck on a tape just as the desired character for that position passed underneath.

a) Why did such a printer tend to muddy characters?

b) How does the speed of printing a drum printer relate to the latency of disk memory access? In
particular, compare with head-by-track and multi-head disks.

c) Analyze worst-case and average case latencies to print a line.

12) Detection and measurement via photocells

One application of photocells in factories is in the sorting of objects moving on a conveyor belt. If you
assume that objects do not stack and do not overlap in length, on the conveyor belt:

(a) Specify how a single photocell can be used to classify objects by length. Clearly state all your
assumptions.

b) Propose a scheme for classifying objects of fixed height by their heights.

c) Repeat part (b), but this time assume that the height may vary along the object.

309
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

d) Propose a scheme for classifying objects into cubes, spheres and pyramids.

13) Gradual speed motors

A gradual speed motor rotates 15° for each control pulse received. Specify the types of mechanism
you would need if you were to use this engine to control the lateral movement of the print head for a
600 dpi inkjet printer. Note that the required mechanisms must convert the rotational motion of the
motor to linear motion with finer grains.

14) Primitive input devices

Tilter switches and jumpers are primitive input mechanisms for devices that did not have a keyboard
or other input devices. A rocker switch can set an input bit to 0 or 1. A jumper does the same by
connecting the input to a constant voltage representing 0 or 1. They are typically used when few bits
of input data are needed and the input changes irregularly.

a) Identify two applications in which such input devices are used.

b) Compare rocker switches and jumpers for ease of use and flexibility.

c) Describe how eight rocker switches and one button can offer a way to input more than eight bits of
data.

15) Special input devices

Research the following topics regarding special input devices and write a two-page report about each.

a) Why the numbers printed on the bottom of most bank checks (including bank ID number
and account number) have unusual shapes.
b) How the Universal Product Code (UPC), found on most products, encodes data and how it
reads it using a scanner on the output counter.
c) How 80-column punch cards (known as Hollerith) encoded data and how their storage
density, in bits per cubic centimeter, compares to disks and other types of memory today.
d) How credit card readers work, found in many stores, ATMs and gas stations for the sale of
gasoline.
e) How a handwritten text, which is entered by means of a stylet, is perceived and accepted
on PDAs or tablet PCs.
f) How keyboards, mice, and wireless remote controls communicate with the computer (for
switching slides in a presentation).

16) Special output devices

Research the following topics regarding special output devices and write a two-page report about
each.

a) How Braille output occurs for blind users.

b) How small printers, coupled with calculators that print, differ from ordinary inkjet or laser printers.

310
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

c) How rear-projection presentation devices work.

d) How a line segment display works (e.g. the seven-segment LED display for numbers) and why they
are no longer widely used.

311
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Input/Output Programming
CHAPTER TOPICS

1) I/O performance and Benchmarks


2) Input/output Addressing
3) scheduled I/O: Polling
4) Demand-Based I/O: interruptions
5) I/O Data Transfer and DMA
6) Improving I/O performance

Input and output, like other activities within the computer, are controlled by the execution of certain
instructions within a program. While early computers had special I/O instruction, modern machines
treat I/O devices as occupying part of the memory address space (memory-mapped I/O). Therefore,
data entry is done by reading memory locations until their agreement that are assigned to specific
input devices. In this chapter, after showing I/O addressing details using examples, we review the
complementary probing and interrupt schemes to synchronize I/O with the rest of the actions in a
program. I/O performance and how it can be improved are also discussed.

1. I/O performance and Benchmarks


In addition to the obvious reasons for loading program text and data into memory and recording the
results of calculations in hard copy and other formats, input/output can be performed for a variety of
reasons. Examples include the following:

Data collection from sensor networks.

Performance of robotic arms or other assemblies in plants.

Querying or updating databases.

Backup of data sets or documents.

Creation of checkpoints to prevent restart in the event of a fall.

With the rapid improvement in CPU performance over the past two decades, which is projected to
continue at the same pace in the future, I/O performance took center stage. As with the "memory
wall" discussed in section 17.3, modern computers experience an "I/O wall" that simultaneously
reduces and completely nullifies the benefits of improved CPU performance.

Example 22.1: The Inlet/Outlet Wall

An industrial control application spent 90% of its time on CPU operations and 10% on I/O when it was
originally developed in the early 1980s. Since then, the CPU component of the system was replaced
with a newer model every five years, but the I/O components remained the same. If you assume that

312
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

CPU performance increased by a factor of 10 with each update, derive the fraction of time spent on
I/O over the life of the system.

Solution: This requires the application of Amdahl's law, with 90% of the task accelerated by 10, 100,
1,000 and 10,000 over 20 years. CPU upgrades successively reduced the original runtime from 1 to 0.1
+ 0.9/10 = 0.19, 0.109, 0.1009, and 0.10009, making the fraction of time spent on input/output
operations 100 × (0.1/0.19) = 52.6, 91.7, 99.1, and 99.9%, respectively. Note that the latest pair of CPU
upgrades doesn't accomplish much in terms of improved performance.

Unlike CPU performance, which can be modeled or estimated based on application and system
characteristics (e.g., instruction mix, cache failure rate, etc.), I/O performance is quite difficult to
predict. This is largely due to the many different elements involved in I/O operations: from the
operating system and device drives through the various buses and I/O controllers, to the I/O devices
themselves. The interactions of these elements and their contention in using shared resources such
as memory, add to the difficulty in modeling, even when each individual element (bus, memory, disk)
can be modeled accurately.

I/O performance is measured in a variety of ways, depending on application requirements. I/O access
latency is the time header for a single I/O operation, which is important for small I/O transactions of
the type found in e-banking or commerce. The I/O data rate, often expressed in megabytes per
second, is relevant to large data transfers usually of interest in supercomputer applications. The I/O
data transfer time is related to the block size to be transferred and the data rate. The response time
represents the sum of access latency and data transfer time; its inverse produces the number of I/O
operations per unit of time (total I/O yield). Response time and total I/O throughput are often traded
against each other. for example out-of-order processing of disk accesses makes latency highly variable,
and increases latency in many cases, but can improve overall throughput significantly.

In proportion to the various I/O characteristics in different application areas, there are three types of
I/O benchmark:

1. Supercomputer I/O benchmarks focus on reading large volumes of input data, writing many
snapshots to create checkpoints for very long calculations, and saving a relatively small set of output
results at the end. Here the key parameter is the total I/O data throughput, expressed in megabytes
per second. I/O latency is less important, as long as high total throughput can be maintained.

2. Transaction processing I/O benchmarks usually deal with a huge database, but each transaction is
quite small, involving some disk accesses (say, 2-10), with a few thousand instructions executed per
disk access. Accordingly, the important parameter in such benchmarks is the I/O rate, expressed in
number of disk accesses per second.

3. File system I/O benchmarks focus on file access (read and write operations), file creation, directory
management, indexing, and restructuring. Given the varied characteristics of files and file accesses in
different applications, such benchmarks are often specialized to a particular application domain (e.g.,
scientific calculations).

Each of these categories can be further refined. For example, transaction processing spans a spectrum
of application domains, from simple web interactions to complex requests involving a large amount

313
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

of processing. As with CPU benchmarks, discussed in section 4.4, a mix of features can be incorporated
into a synthetic or real I/O benchmark to assess a system for multidomain or general purpose use.

2. Input/output Addressing
In memory-mapped I/O, each input or output device has one or more hardware registers from/to
which it can read/write, as if they were memory locations. Therefore, no special I/O instructions are
required and instructions for I/O can be loaded and stored. Examples in MiniMIPS include keyboard
input and on-screen output.

As shown in Figure 22.1, the keyboard unit in MiniMIPS has a pair of memory locations associated with
a 32-bit control register (0xffff0000 address) and a data register (0xffff0004). The control register
retains several status bits that carry information about the device's status and modes of data
transmission. Two bits in the control register that are relevant to the discussion are the "device ready"
R flag and the IE flag for "enable interrupt" at bit positions 0 and 1, respectively (the use of R will be
discussed shortly; IE will be needed in Chapter 24). The data log may contain an ASCII symbol in its far-
right byte. When a key is pressed on the keyboard, the keyboard's logic determines which symbol will
be sent to the computer and places it in the lowest byte of the keyboard's data register, and postulates
the R bit to indicate that a symbol is ready for transmission. If the program loads the keyboard control
word from the memory locality 0xffff0000 into some register, it will learn, by examining bit 0, that the
keyboard has a symbol to transmit. In this sense, the program can load the contents of the 0xffff0004
locality into a register to determine which symbol is contained in the keyboard data record. Reading
the keyboard data log causes R to automatically depostulate. Note that R is a read-only bit in the sense
that if the processor writes something to the keyboard control register the state of R will not change.

Example 22.2: Entering data from keyboard

type a sequence of instructions in MiniMIPS assembly language to make the program wait until the
keyboard has a symbol to transmit and then read the symbol in register $v0.

314
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Solution: The program should continuously examine the keyboard control register to determine if R
was nominated. Finally, when the R bit is postulated, program inactivity ends (also known as busy
wait) and the symbol in the keyboard data register is loaded into $v0.

This type of input is suitable only if the computer expects some critical input and cannot do anything
useful in the absence of this input.

Similarly, the visual rendering unit (display) in MiniMIPS has a pair of memory locations associated
with its 32-bit control register (0xffff0008 locality) and data register (0xffff000c), as shown in Figure
22.1. The screen control register is similar to that of the keyboard, except that the R bit indicates that
the display unit is ready to accept a new symbol in its data register. When a symbol is copied to this
data record, R is automatically depostulated. If the program loads the word screen control from the
memory locality 0xffff0008 into some register, it will learn, by examining bit 0, that the display unit is
ready to accept a symbol. Consequently, the program can store the contents of a log in the 0xffff000c
locality, and sends a symbol to the display drive.

Example 22.3: Data output to display unit

write a sequence of instructions in MiniMIPS assembly language to make the program wait until the
display drive is ready so that it accepts a new symbol then type the register symbol $a 0 to the display
unit data log.

Solution: The program must continuously examine the control register of the display unit to determine
whether the R bit was postulated. Eventually, when the R bit is postulated, the symbol in register $a
0 is copied to the data log of the display unit.

This type of output is suitable only if it is affordable to have the CPU dedicated to transmitting data
to the display unit.

The hardware provided in each I/O device driver to enable this type of addressing is very simple. As
shown in Figure 22.2, the device driver connects to the memory bus. The memory address that is
observed on the bus is continuously compared against the address of the device stored in an internal
register. If this device address is in a read-only record, the device has a hard (non-modifiable) address;
otherwise, the device has a soft (modifiable) address. When an address match is detected, the
contents of the status or data record are placed on (read), or loaded from (writes), the bus data lines.

315
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Note that the notions discussed in this section are required only if it is necessary to write device drivers
for general-purpose computers or to perform I/O operations in an application-specific embedded
system environment. Most users are protected against such details because their I/O operations are
performed at a fairly high level through requests to the operating system. For example, in the SPIM
simulator for MiniMIPS, the user or compiler places appropriate code in register $v 0 (and one or more
of other parameters in certain registers that match) and then executes a syscall statement. For details,
see section 7.6, in particular Table 7.2.

3. Scheduled I/O: Polling


Examples 22.2 and 22.3 represent I/O instances polled: the processor initiates I/O by asking the device
if it is ready to send/receive data. With each interaction, a single unit of data is transferred. If the
device is not ready, the processor does not need to wait in a busy cycle; rather, you can perform other
tasks and check later with the device. No problem arises as long as the processor checks the I/O device
frequently enough to ensure that there is no data loss.

Example 22.4: Input via polling a keyboard

A keyboard should be interrogated at least ten times per second to ensure that no key was pressed
by the user and lost. Suppose that each such query and associated data transfer takes 800 clock cycles
on a processor clocked at 1 GHz. What fraction of the CPU time is spent polling the keyboard?

Solution: The fraction of CPU time spent polling the keyboard is obtained by dividing the number of
cycles required for ten interrogations by the total number of available cycles in one second:

(10 × 800)/109 ≅ 0.001%

Note that the polling rate of ten times per second, or 600 times per minute, was chosen to be greater
than even the speed of the fastest typist, so that there is no chance of losing a character that is
entered, and then overwritten, in the keyboard data buffer.

316
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

From example 22.4 you can see that the keyboard is a slow input device that the processor can
easily keep with it. After each keyboard poll, the processor has about 109 /10 − 800 ≅ 108 clock
cycles (0.1 s) to attend to other tasks before it has to poll the keyboard again. The following two
examples show that other I/O devices are more demanding in this regard.

Example 22.5: Entering via polling a floppy disk drive

A floppy disk drive sends or receives data four bytes at a time and has a data rate of 50 KB/s. To ensure
that old data is not overwritten with new data in the device buffer during an input operation, the
processor must sample the buffer at the rate of 50K/4 = 12.5K times per second. Assume that each
query and associated data transfer takes 800 clock cycles on a processor clocked at 1 GHz. What
fraction of the CPU time is spent polling the floppy disk drive?

Solution: The fraction of CPU time spent polling the floppy disk drive is obtained by dividing the
number of cycles required for 12.5K interrogations by the total number of available cycles in one
second:

(12.5K × 800)/109 = 1%

Note that the CPU-time header for polling the floppy disk drive is a thousand times that of the
keyboard poll in Example 22.4.

Example 22.6: Entering via polling a hard disk drive

A hard drive transfers data four bytes at a time and has a peak data rate of 3 MB/s. To ensure that old
data is not overwritten with new data in the device buffer during an input operation, the processor
must sample the buffer at the rate of 3M/4 = 750K times per second. Assume that each associated
interrogation and data transfer takes 800 clock cycles on a processor clocked at 1 GHz. What fraction
of the CPU time is spent polling the hard drive?

Solution: The fraction of the CPU time spent polling the hard drive is obtained by dividing the number
of cycles required for 750K interrogations by the total number of available cycles in one second:

(750K × 800)/109 = 60%

A single I/O device that takes up 60% CPU time is unacceptable.

Note that the hard disk in Example 22.6 keeps the CPU almost completely occupied: between two
consecutive interrogations, the CPU has only 109 /750,000 − 800 ≅ 533 clock cycles (0.5 μs) to handle
other tasks. The latter may not be enough to do much useful work. The floppy disk drive in Example
22.4 is intermediate between the very slow keyboard and the very fast hard drive. It leaves the CPU
with 109 /12,500 − 800 = 79,200 clock cycles (79 μs) of available time between consecutive question
marks.

317
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Demand-Based I/O: Interruptions


A lot of CPU time is wasted polling when the device is completely idle or not yet ready to send or
receive data. Even slow devices can cause unacceptable headend when there are many of these
devices that need to be polled continuously. Probing hundreds or thousands of sensors and actuators
in an industrial control scenario can be very wasteful. For example, a temperature detector does not
need to send data to the computer unless there is a change that exceeds a predefined threshold or a
certain critical temperature is reached. In interrupt-triggered I/O, the device initiates I/O by sending
an interrupt signal to the CPU. When it receives an interrupt signal, it transfers control to a special
interrupt routine that determines the cause of the interruption and initiates appropriate action.

Example 22.7: Interrupt-based input from a hard drive

Consider the same disk as in example 22.6 (transferring 4 B "chunks" of data to 3 MB/s when active)
and assume that the disk is active about 5% of the time. The interrupt header to the CPU and the
transfer performance is almost 1,200 clock cycles of a 1 GHz processor. What fraction of the CPU time
is spent servicing the hard disk drive?

Solution: From example 22.6 it is known that 750K interrupts will occur every second when the disk
drive is active, given the disk data rate and data transmission in "chunks" of 4 B. The fraction of CPU
time spent on hard disk interrupts is obtained by dividing the number of cycles required to serve the
expected number of interrupts per second by the total number of available cycles in one second:

0.05 × (750K × 1 200)/109 = 4.5%

In this context, even if the interrupt-enabled I/O header is longer when the disk is active, as the disk
is usually idle, the overall CPU time spent on I/O is much lower.

Interrupts are controlled by status bits on the device side (requester) and on the CPU (service provider)
side. The "enable interrupt" (IE) bit in the control registers in Figure 22.1 tells the device to send an
interrupt signal to the CPU when the "device ready" R bit is postulated. The CPU side has a
corresponding flag indicating whether interrupts from the keyboard, display unit or other devices will
be accepted. If interrupts are enabled, the interrupt signal is recognized and a protocol is entered that
leads to data being accepted from, or sent to, the requesting device. Interrupts that are not enabled
are said to be masked. Usually, interrupts are masked or disabled only for short periods when the CPU
must serve critical functions without interruption.

Until the detection of an interrupt signal, as long as the particular interrupt or interrupt class is not
masked, the CPU recognizes the interrupt (so that the device can depostulate its request signal) and
begins to execute an interrupt service routine. To serve an interrupt request, the following procedure
is performed:

1. Save the CPU state and ask for the interrupt service routine.

2. Disable all interrupts.

3. Save minimal information about the interruption in the stack.

318
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Enable interrupts (or at least the highest priority ones).

5. Identify the cause of the outage and address the underlying request.

6. Restore the state of the CPU to what existed before the last interrupt.

7. Return from the interrupt service routine.

The header of each interrupt request is larger than that of the probe because the steps required for
the last one correspond to a part of step 5; The other steps, or identifying the cause of the outage in
step 5, are not needed with probing. Note that interrupts are enabled in step 5 before manipulation
of the current interrupt is complete. This is because disabling interruptions for extended periods can
lead to data loss or inefficiencies in input/output. For example, a read/write head can pass the desired
sector while interrupts are disabled, forcing at least one additional revolution, with its accompanying
latency. The ability to handle nested interrupts is important when dealing with multiple high-speed
I/O devices or time-sensitive control applications. Interruptions will be discussed in more detail in
Chapter 24.

5. I/O Data Transfer and DMA


Note that if data were transferred between the main memory and the I/O device in larger "chunks,"
rather than the 1-4 byte drives assumed in examples 22.4 through 22.7, the interrupt header would
be reduced and less CPU time would be neglected. The last thing in this direction is to let the CPU
initiate an I/O operation (on its own initiative or upon receiving an interrupt) and let an intelligent I/O
controller copy the data from an input device to memory or from memory to an output device. This
approach is called DMA (direct memory access) input/output.

A DMA controller represents a simple processor that can acquire control of the CPU memory bus and
then act as the CPU would to control data transfers between I/O devices and memory. The CPU
interacts with the DMA controller largely as it does with an I/O device, except that only control
information is exchanged. Like the device driver in Figure 22.2, the DMA driver has a status register
that is used to communicate control information with the CPU. Three other registers replace the
device data record in Figure 22.2:

319
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

1. Source-address register, where the CPU places an address for the data source.

2. Record address-destination, which retains the destination for data transfer.

3. Length register, where the CPU places the number of data words to be transferred.

Figure 22.3 shows the relationship of the DMA driver to the CPU, memory, and device drivers,
assuming the use of a single system bus for memory and I/O access. In practice, most I/O devices
connect to separate buses that are linked to the memory bus through adapters or bridges (Figure
21.2). However, in the discussion of the WFD, we proceed with the assumption of a single system bus,
as in Figure 21.1.

Based on the information placed by the CPU in various registers of a DMA controller, the latter can
take control and perform the steps of a prewritten data transfer from an input device to memory or
from memory to an output device much like the CPU would. Note that the use of DMA supports both
memory-mapped and special I/O instructions. In fact, memory and I/O devices cannot tell whether
the CPU or DMA controller handles bus transactions, because in either case, the same bus protocol is
followed.

Normally the CPU is in charge of the memory bus and assumes that it is available for use at any time.
Therefore, before the DMA controller uses the bus, it postulates a bus request signal that informs the
CPU of its intent. The bus control circuits on the CPU observe this request and, when convenient for
the CPU, postulate a guaranteed bus signal. The DMA controller can now take over the bus and
perform the requested data transfer. When the data transfer is complete, the DMA controller
depostulates the bus request signal and the CPU, in turn, depostulates the guaranteed bus signal,
putting the bus back under its control (Figure 22.4a). Consequently, both bus request and guaranteed
bus signals remain postulated during the use of the bus by the DMA controller.

During a DMA transfer, the CPU can continue to execute instructions, while doing so does not involve
data transfer over the memory bus. This is very feasible with modern CPUs, given the high cache

320
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

impact rate (or combined L1 and L2 caches) usually built into the same chip as the CPU. If, due to cache
failures or some other event, the CPU needs to use the memory bus before the DMA transfer is
complete, it must wait until the DMA driver releases the bus (Figure 22.4a).

To prevent the CPU from being idle for a long time, which leads to significant
performance degradation, most DMA drivers are designed to be able to break a long I/O data transfer
into many shorter transfers involving few words. When the DMA controller acquires the bus, it
transfers a preset number of words before giving up the bus. After a short delay, the DMA controller
requests the bus again for the next stage of your data transfer. Meanwhile, the CPU regains control of
the bus and can use it to fetch words from memory that would allow its operation to continue. This
type of DMA data transfer block or cycle is shown in Figure 22.4b.

Example 22.8: DMA-based input from a hard disk drive

Consider the hard drive in examples 22.6 and 22.7 (with peak data rate of 3 MB/s for 5% of the time
when active). The disc has 512 B sectors and rotates at 3,600 rpm. Ignore gaps and other headers in
data storage on disk tracks. How does DMA transfer, in terms of time header for the 1 GHz CPU,
compare to the polling and interrupt-based I/O in examples 22.6 and 22.7? Suppose that 800 clock
cycles are needed to establish a DMA data transfer and 1,200 clock cycles to manipulate the interrupt
generated at the conclusion and that each data transfer involves a) a sector, or b) an entire track.

Solution: The disk track capacity can be obtained from the disk data rate and its rotation speed: (3
MB/s)/(60 revolutions/s) = 0.05 MB/revolution, or a track capacity of 50 KB = 100 sectors. The CPU
passes 800 + 1,200 = 2,000 clock cycles, or 2 μs, to set and process the termination of each data
transfer. Consider sector data transfers first. If the disk actively transfers data 5% of the time, it should
read 0.05 × 3 MB/s = 150 KB/s = 300 sectors/s. The time header for the CPU is therefore 2 μs × 300 =
0.6 ms (0.06%). The latter compares very favorably with the CPU-controlled polling headers (60%) and
word-for-word transfers (4.5%) derived in examples 22.6 and 22.7. If all tracks are transferred, and
disk activity remains at 5%, there should be a factor of 100 fewer data transfers, which leads to a
reduction by the same factor in CPU header. Note that the repeated bus control step between CPU
and DMA controller involves a certain header that was ignored in this example. The latter header is
most significant when each data transfer is performed in the cyclic manner shown in Figure 22.4b.
However, this header is not sufficient to seriously minimize the benefits of DMA-based I/O.

Note that it is possible to have more than one DMA controller sharing the memory bus with the CPU.
In such a case, as well as when there are multiple CPUs, a bus arbitration mechanism takes the place
of request guarantee signaling between a DMA controller and a single CPU. Bus arbitration will be
discussed in section 23.4.

Despite the clear performance advantage of DMA-based I/O, this method is not free of problems and
deceptions. For example, certain implementation problems arise when DMA interactions involve
locations in the memory address space, problems related to using physical or virtual addresses to
specify the source or destination of transfers, and the part of the memory hierarchy from (to) which
data is taken (copied). Here is a list of issues that need to be considered.

321
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

1. DMAs using physical addresses: Since consecutive virtual pages are not necessarily contiguous in
main memory, such a multipage transfer cannot be specified with a base address and length;
Therefore, longer data transfers must be broken in some transfers on the same page.

2. DMAs using virtual addresses: The DMA driver needs address translation.

3. DMA writing to, or reading from, main memory: The CPU can access cached old data, or the DMA
driver can get data that is out of date.

4. Cached DMA: I/O data can displace data from active CPUs, the latter leads to performance
degradation.

In any case, the cooperation of the operating system is needed to ensure that, once an address is
provided to the DMA driver, it remains valid through data transfer (i.e. pages do not move in main
memory).

It is prudent to reiterate that all of these I/O details are usually not visible to user programs because
they initiate I/O through system requests.

6. Improving I/O Performance


As with most books on computer architecture, most of this book is devoted to methods for designing
faster CPUs and bridging the CPU-memory speed gap. However, improving CPU and memory speed
without struggling with I/O processing issues would be of limited or futile value. It's not uncommon
for an application to spend more time on I/O than calculation. Even when I/O does not master the
runtime of a particular application, Amdahl's law reminds us that the improvement in expected
performance, while ignoring I/O, is rather limited. Therefore, it is important to recognize the problems
in and out and be attentive to the methods to deal with them. As discussed in section 22.1,
input/output performance is measured by several parameters that are not independent of each other
and can often be traded against each other. They are:

Access latency (time header of an I/O operation).

Data transfer rate (measured in megabytes per second).

Response time (time consumed from request to conclusion).

Total throughput (number of I/O operations per second).

Accordingly, this section will discuss some methods to improve each of these facets of I/O
performance. The methods cover a wide range of options, from application tuning programs for
reduced I/O activity, to providing architectural enhancements or hardware auxiliaries for low-header
I/O manipulation.

Access latency represents the total time header for an I/O operation. It includes obvious and well-
understood delays, such as disk seek time or rotational latency, as well as other headers that are less
obvious but can be very significant. Such headers accumulate. An example of how the entire header
from many layers of hardware and software could bring I/O to the front is in a remote login session in
which characters typed at a local site must be transmitted to a remote system. The number of events

322
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

and layers involved in transmitting a character or command from the local site to the remote computer
is stunning [Silb02]:

Sending side (local) Receiving side (remote)

Character writing Package Reception

Interrupt generation Network adapter

Context switch Interrupt generation

Interrupt manipulator Context switch

Keyboard Device Driver Device driver

Operating System Kernel Character extraction

Context switching Operating System Kernel

User Process Context switching

System prompt for output Network Daemon

Context switching Context switching

Operating System Kernel Operating System Kernel

Network Device Driver Context switching

Network Adapter Network Subdaemon

Interrupt generation

Context switching

Interrupt manipulator

Context switching

Conclusion of system request

On the receiving side is a system program called a network daemon, which manipulates the
identification and assignment of incoming network data, and another program, the subdaemon,
dedicated to manipulating network I/O for a specific login session. Now, if the receiver has to repeat
(echo) the character it sends to, the whole process is repeated in the reverse direction! Improvement
in each of the steps or agents just mentioned can lead to better I/O performance. For example,
hardware helpers for low-header context switching (discussed in section 24.5) can have a significant
effect, as "context switching" appears eight times in the list.

Even if the access latency itself cannot be decreased, an equivalent effect can be achieved by reducing
the number of I/O operations by increasing the size of each operation. Reading the same amount of
data through a small number of I/O requests has the effect of reducing effective access latency per
megabyte of data read.

323
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 22.9: Effective disk I/O bandwidth

Consider a hard drive with 512 B sectors, an average access latency of 10ms (including software
header), and a total peak throughput of 10 MB/s. Graph the variations in effective I/O bandwidth as
the data transfer unit (block) ranges in size from 1 sector (0.5 KB) to 1,024 sectors (500 KB). Ignore all
data recording headers and cross-sector gaps.

Solution: When a sector is transferred by disk access, the data transfer time of 0.5/10 = 0.05 (transfer
amount in kilobytes divided by transfer rate in kilobytes per millisecond) added to the access latency
of 10 ms produces a total I/O time of 10.05 ms. This corresponds to an effective bandwidth of
0.5/10.05 = 0.05 MB/s. Similar calculations for block sizes of 10, 20, 50, 100, 200, 300, 400, and 500
KB produce the trend shown in Figure 22.5. Note that effective bandwidth improves rather rapidly as
the block size increases and reaches half the theoretical peak value at a block size of 100 MB. From
then on, the improvement is not as significant, suggesting that it may not be worth increasing the
block size even more, given the possibility of carrying less useful data.

The I/O data transfer rate represents the amount of data that can be transferred between a device
and memory in unit of time. The peak transfer rates cited for various I/O devices are usually not
achieved in practice. The latter is due, in part, to the fact that devices can have variable rates; For
example, transfer rates from the innermost and outermost tracks of a disk may differ. However, the
actual transfer rate is a function not only of the device's ability to receive or send the data, but also of
the ability of a host of intermediate resources (controllers, buses, bridges or bus adapters, etc.) to
relieve the data. Note that even though a bus may be fast enough to manipulate the peak data rate
from a disk when disk-to-memory traffic is the only type present, sharing the bus for other ongoing
activities can reduce your ability to deal with disk I/O.

Response time is defined as the time elapsed from the issuance of an I/O request to its
conclusion. In the simplest case, when an I/O device is dedicated to a single application or user, the
response time constitutes the sum of I/O access latency and data transfer time (amount of data
transferred divided by transfer rate), in which case, the response time is not an independent measure
but is derivable from the two previous parameters. However, you can add a non-negligible queue
delay whose value depends on the current mix of pending I/O operations at various points in the I/O
transfer path. Queue delays are difficult to assess but form the focus of an area of study known as
queuing theory, the discussion of which is not a matter for this book. Suffice it to say that queues can
cause unexpected, and sometimes counterintuitive, I/O behavior. A good analogy is provided by the
queue of customers at a bank. If a queue of ten customers accumulates during the lunch hour, even if

324
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

the bank's total service yield is comparable to the rate of arrival of new customers, the long queue
can continue throughout the afternoon. In other words, even if total throughput is not an issue in this
case, the response time is very bad. This type of queuing effect can be experienced, for example, in
network communication during periods of peak traffic.

Another factor that has an adverse effect on I/O latency is the repeated copying of I/O data from one
memory area to another as a result of multiple layers of software involved in manipulating an I/O
request. For example, operating systems usually copy output data from user space into kernel space
before sending it to the device. Among other things, this approach allows the user to continue
modifying the data while the output operation is in progress. For example, this is what happens when
you issue a print command from a word processor and then continue to make further modifications
to the page to be printed, even before the print request is completed. Something similar happens in
the input address for other reasons, such as data integrity checking. On systems or applications where
I/O performance is hugely critical, some of this copying can be avoided. Only if the user wants to make
changes to the data before printing is complete, the operating system will make a copy to print.

Total I/O return, or the number of I/O operations performed


per unit of time, is of interest when each I/O operation is fairly simple, but many such operations must
be performed in quick succession. In most applications, latency is also a concern (for example, in a
network of bank ATMs linked to a central computing facility), leading to challenging scheduling
problems. This situation arises because high total throughput dictates the use of queues for optimal
use of resources, while low latency signals that each request is serviced as soon as possible.

Note that many of the problems and challenges described here arise from
desires to hide the complexities of I/O from the user and to make application and system programs
more independent of I/O device technology. These goals were achieved by introducing software layers
that protect users from hardware device features. Removing some of these layers of complexity is one
way to improve latency and total I/O throughput. This is done routinely, for example, on gaming
consoles, where the total high performance required for sophisticated graphics output needs direct
interaction between the graphics hardware and the display unit with minimal header. In fact, the
designers of such gaming consoles have been so successful in fine-tuning their graphics processing for
maximum performance, that the idea of using a large number of such consoles to build a
supercomputer is the goal of more than one research team.

To achieve the same low-header I/O for general-purpose systems, both the number of layers
involved in performing I/O and the latency at each layer must be reduced. The ultimate in capable
reduction is to provide users with direct control of I/O devices for certain critical I/O activities. To
reduce latency in each layer, both the specialization of routines can be sought to avoid operations that
are not applicable to a particular I/O device, and the reduction in the amount of data copying as much
as possible. At the hardware level, providing devices with more intelligent interfaces and the ability to
communicate over a switch-based network, as opposed to multiple bus layers and associated
adapters, is the driving force behind the new I/O Infiniband standard. As shown in Figure 22.6, this
scheme allows I/O devices to communicate with each other directly without placing a load on the CPU
or even its buses.

325
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

PROBLEMS
1) Input/output via waiting cycle

(a) Will the program fragment in Example 22.2 work correctly when beq is a delayed branch with a
delay slit? If not, modify the sequence of instructions to make it work. If it works like this, explain why.

(b) Repeat part (a) for example 22.3.

2) Command-line data input and echo

Convert the instruction sequences in Examples 22.2 and 22.3 to MiniMIPS procedures. Then type a
MiniMIPS instruction sequence that causes the characters entered on the keyboard to be saved in
memory and also presented on the screen, until the Enter key is pressed. At this point, the null symbol
is appended to the end of the stored string, and the string ending in null goes to a procedure for
appropriate analysis and action.

3) Replay on a keyboard

On most computers, if you press a key on the keyboard and hold it, the character will repeat.
Therefore, retaining the hyphen or underscore key will create a dotted line or a solid line, respectively.
The repetition rate is determined by a user-defined parameter. Discuss how this aspect of keyboard
data input can be handled in the context of example 22.2.

4) Device status records

326
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

In the device control (status) registers in Figure 22.1:

a) Suggest possible uses for at least two other bits in the keyboard control register (in addition to R
and IE).

b) Repeat part a) for the screen control record. (c) Repeat Part (a) for a third I/O device that you choose
and describe.

5) Logical addressing for I/O devices

In the I/O scheme shown in Figure 22.2:

a) What happens if two different input devices are assigned to the same device address?

b) Repeat part (a) for output devices.

c) Is it possible to connect two or more different keyboards to the bus? Can you think of a situation
where this ability might be useful?

d) Repeat part (c) for display units.

6) Polling a buffered keyboard

According to the assumption that two successive key presses on a keyboard can never be less than 0.1
s apart, a polling rate of 10/s, as in example 22.4, is adequate. Suppose that, in addition, it is known
that no more than ten key presses can occur in any interval of 2 s. By providing a buffer on the
keyboard, you can reduce the polling rate while still capturing all key presses.

a) Determine the proper size and organization of the keyboard buffer.

b) Propose a hardware realization of the keyboard buffer.

c) What is the proper polling rate with the buffer of part a)?

d) How can the CPU ensure that the polling rate in part c) is met?

e) Under what conditions would a larger buffer help reduce the polling rate even further?

7) Probe a hard drive

Consider polling a disk drive, with the disk and CPU characteristics given in Example 22.6.

a) Determine the minimum clock frequency for the CPU if you should maintain the hard drive.

b) Would providing a buffer for the disk drive, as was done for the 22.6 issue keyboard, allow the CPU
to poll less frequently? Justify your answer.

c) What changes to the assumptions would allow polling to be achieved with a 500 MHz CPU? d) Would
a 2 GHz CPU be capable of polling two hard drives? And four albums?

8) Disk I/O through interrupts

327
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Consistent with example 22.7, and the assumptions mentioned therein, a disk drive uses less than 5%
of CPU time if interrupts are used for output and input.

a) Explain why this result does not imply that a CPU can handle I/O to or from 20 disks.

b) Explain why even two hard drives may not be handled properly unless some special provisions are
implemented that you highlight.

9) DMA Data Transfers

Section 22.5 noted that DMA controllers with the ability to transfer data in shorter bursts, while
returning bus control to the CPU after each partial transfer, offers the benefit that the CPU never has
to wait long to use the bus.

a) Can you think of any disadvantages to this scheme?

b) What factors influence the choice of explosion length for partial transfer? What are the
negotiations?

c) Repeat part b) for the case when two DMA controllers join the same bus.

10) Effects of DMA on CPU performance

a) The operation of a DMA channel, and its effects on CPU performance, can be modeled as follows.
Time is measured in terms of bus cycles and all headers, including arbitrage delay, are ignored. When
the DMA channel does not use the bus during a certain bus cycle, it can become active with probability
0.1 in the subsequent bus cycle. Until activation, it obtains and uses the bus for 12 cycles before leaving
it. What is the long-term probability that the CPU can use the bus in some specific cycle? What is the
expected timeout for the CPU?

b) Repeat part (a), and this time assume that the WFD releases the bus for one cycle after every four
cycles of use; Therefore, the 12 cycles of use occur in three periods, separated by two free cycles.

11) Effective I/O bandwidth

This problem is a continuation of the 22.9 example.

a) Plot the variation in total yield versus block size (Figure 22.5), using logarithmic scale for block size
on the horizontal axis.

b) Repeat part (a), this time with logarithmic scales on both axes.

c) Do the graphs in parts a) and b) provide any additional information about the choice of block size?
Discuss.

12) Effective I/O bandwidth

Example 22.9 is a bit unrealistic in that it associates average access latency of 10ms with each disk
access. In reality, if reading larger blocks of data makes any sense, access to smaller blocks must

328
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

involve more locality. Also, a truly random access may have a higher latency than the average of 10
ms measured on all accesses. Discuss how these considerations can be explained in a more realistic
analysis. State all your assumptions clearly and quantify the effect of block size on total effective I/O
return, as in Figure 22.5.

13) I/O Distributed Infiniband

Prepare a five-page report on the distributed Infiniband I/O schema, with particular focus on the
structure of the host channel adapter and switching elements (shown in Figure 22.6). Pay special
attention to how the two elements differ in structure and function.

14) I/O Performance

A banking application represents an example of a transaction processing workload. Focus only on ATM
transactions made by bank customers and ignore all other transactions initiated by bank staff (e.g.,
check processing, account opening, miscellaneous updates). If you take a simplified view, each ATM
transaction involves three disk accesses (authentication, reading account data, and updating account).

a) For each of the disk memories mentioned in Table 19.1, estimate the maximum number of
transactions that can be supported per second (ignore the inconvenience of very small disks for this
type of application).

b) If 50% total spare throughput is allowed for future expansion, and 40% disk bandwidth waste due
to administrative header, how many of each disk drive considered in part a) would be needed to
support a total throughput of 200 transactions per second?

c) What are some of the reasons for a real ATM banking transaction that requires more than three
disk accesses?

d) How do additional accesses in Part (c) affect total transaction processing throughput?

15) I/O Performance

A computer uses a 1 GB/s I/O bus that can accommodate up to ten disk controllers, each capable of
connecting to ten disk drives. Assume that interpolated main memory is not a problem, in the sense
that it can supply or accept data at the peak rate of the I/O bus. Each disk controller adds 0.5 ms
latency to an I/O operation and has a maximum data rate of 200 MB/s. Consider the use of disk
memory cited in Table 19.1 and ignore the improbability that very small disks would be used for large
storage capacities. Calculate the maximum I/O rate you can achieve on this system with each disk
type. Suppose that each I/O operation involves a single randomly chosen sector. State all your
assumptions clearly.

16) Queuing theory

A disk controller that receives requests for disk access can be modeled by a queue that contains the
requests and a server that satisfies the requests in FIFO order. Each incoming request spends some
time in the queue and requires some service time for actual access. Assume that the disk can perform

329
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

an average of one access every 10 ms. In this order of ideas, it is assumed that the service time is
always Tservice = 10 ms. Let u disk utilization; For example, u = 0.5 if the disk performs 50 accesses per
second (50 × 10 ms = 500 ms). So, if it assumes random request arrivals with a Poisson distribution,
queue theory says that the time an access request spends waiting in the queue is Tqueue = 10u/(1 - u)
ms, where factor 10 is service time in milliseconds. For example, if u = 0.5, each request passes an
average of 10 ms in the queue and 10 ms for actual access, for a total average latency of 20 ms.

(a) What is disk utilization if, on average, 80 I/O requests are issued to disk per second?

b) Based on the response to part (a), calculate the average total latency for each disk access request.

c) Repeat part (a), but assume that the disk is replaced with a faster disk with average latency of 5 ms.

d) Repeat part (b) based on the response to part (c).

e) Calculate the acceleration of part d) over part b) (i.e. the impact on the performance of the fastest
disk).

17) Queuing theory

Little's law in queuing theory states that under equilibrium conditions the average number of tasks in
a system is the product of its arrival rate and average response time. For example, if customers of a
bank arrive, on average, one every two minutes (arrival rate of 0.5/min) and a customer spends an
average of ten minutes at the bank, then on average there will be 0.5 × 10 = five customers inside the
bank. Based on the explanation of problem 22.16, and if you assume a single employee in the bank, a
service time s for each customer implies Tqueue = su/(1 - u), where u, the employee's utilization, is 0.5s
(i.e., the product of the arrival rate and customer service time).

a) Based on this information, what is the service time for each client?

b) What is the expected queue length (number of customers waiting) at the bank?

c) Relate this example of the bank to the input/output of the disk.

d) Discuss the impact of having one employee work twice as fast as the original single employee versus
having two employees with the same work speed. Note that the exact calculation of the impacts of
the second option is complicated; In this way, present an intuitive argument for the relative merits of
each option.

e) What would be the I/O counterparty of the disk if you had more than one employee in the bank?

330
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

BUSES, LINKS and INTERFACES


CHAPTER TOPICS

1) Intra- and intersystem links


2) Buses and their presentation
3) Bus communication protocols
4) Bus Arbitration and Performance
5) Basics of Interface
6) Interfacing Standards

It is common to use shared links or buses to interconnect many subsystems. This not only leads to
fewer wires and pins, but improves cost effectiveness and reliability, also contributes to flexibility and
expandability. This chapter reviews such shared connections and two key issues in their design and
use: arbitration, to ensure that only one sender puts data on the bus at any given time; and
synchronization, to help verify that the receiver sees the same data values transmitted by the sender.
Creating interfaces from peripheral devices to computers involves complex issues of signal detection
and reconstruction. For this reason, many standard interfaces have been developed that fully define
signal parameters and exchange protocols. A peripheral device that produces or receives signals in
accordance with one of many standard bus/link protocols can be connected to a computer and used
with minimal effort.

1. Intra- and intersystem links


Intra-system connectivity is established primarily by thin metal layers deposited in, and separated by,
insulator layers on microchips or printed circuit boards. Such links are either point-to-point,
connecting one terminal of a particular subsystem to one terminal of another, or shared buses that
allow multiple units to communicate through it. Given the importance of buses in system
implementation and performance, they are discussed separately in sections 23.2 through 23.4.
Current electronic manufacturing processes allow multiple metal layers at the chip and board level,
allowing for fairly complex connectivity (Figure 23.1). The wires are "deposited" layer by layer, as well
as the rest of the processing steps. For example, to deposit two parallel wires in one layer, two pits
are engraved where they should appear (Figure 23.1a). After the surface is coated with a thin layer of
insulator, a layer of copper is deposited that fills the pits and covers the rest of the surface. Polishing
the surface removes excess copper, except in the pits where the wires were formed.

When components need to be removable and interchangeable, they are linked to the rest
of the system by sockets (pins) into which they can be connected or special cables with standard
connectors at either end. Some of the relevant standards in this regard are discussed in section 23.6.

Beyond the circuit board boundary, wires, cables, and optical fibers are used for communication,
depending on cost and bandwidth requirements. Figure 23.2 shows some of these options.
Intersystem connectivity takes a variety of forms, ranging from short wires linking nearby units to local
and wide area networks.

331
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The required interconnection structures are characterized by:

332
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Distance/amplitude: a few meters to thousands of kilometers

Data rate: a few kilobits per second to many gigabits per second

Topology: bus, star, ring, tree, etc.

Line sharing: point-to-point (dedicated) or multidrop (shared)

To get an idea about the characteristics of interconnections for different distances and data rates, the
remainder of this section (Figure 23.2) describes three examples of intersystem interconnectivity
standards. Table 23.1 contains a summary of the most important characteristics.

Since its inception in the early 1960s as an EIA (Electronics Industries Association) standard,
RS-232 has been a widely used scheme for connecting external equipment such as terminals to
computers. It is a serial method in which transmission occurs bit by bit over a single data line, which
is supported by some other lines for various control functions, including handshaking. There are
actually three wires to support the transmission of totally double serial data: a data wire for each
direction, plus a shared ground. The full RS-232 connector has 25 pins, but most PCs use a reduced
nine-pin version, shown in Figure 23.3. The four control signals (CTS, RTS, DTR, DSR) allow hand
shaking between transmitter and receiver of data (section 23.3). with RS-232, eight-bit or parity-
encoded flat symbols are transmitted, along with three additional bits such as the string
0dddddddp11, where the far left 0 is the boot bit, d represents data, p is the parity bit or an eighth
data bit, and the two numbers 2 on the right are the high bits. Transmitted symbols can be separated
by 1 bits of inactivity that follow the two high bits. As a consequence of its short range (tens of meters)
and relatively low data rate (0.3-19.6 Kb/s), RS-232 assumed the status of a legacy standard in recent
years, meaning that it is supported by compatibility issues with older devices and systems.

Ethernet is the most widely used LAN (local area network) standard. In its various forms,
Ethernet supports transmission rates of 10 Mb/s, 100 Mb/s and, recently, 1 Gb/s. The medium used
for transmission may be a pair of twisted wires, coaxial cable, or optical hardware (Figure 23.4).
Multiple devices or nodes can be connected to Ethernet that can support a transmission by a single
sender at any time with a latency of the order of 1 ms, with the use of packets or frames whose size
is hundreds of bytes. Each node connected to Ethernet has a transceiver (transmitter/receiver)
capable of sending messages and taking care of network traffic. When no node transmits over the
shared "ether", there is a fixed voltage of +0.7 V on the line. The latter is known as the carrier sense
signal, which indicates that the network is active and available for use.

333
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

To transmit over Ethernet, a node "listens" for a short period before booting. If two nodes start
transmitting at the same time, a collision results. When a collision is detected, the participating nodes
produce a "jam" signal and wait randomly for tens of milliseconds before trying again. This type of
distributed bus arbitration based on collision detection will be further discussed in section 23.4. A
node initiates its transmission by sending a string of 0 and 1 known as a preamble, which causes the
line voltage to fluctuate between +0.7 and -0.7 V. If no collision is detected towards the end of the
preamble, then the node sends the rest of the frame. A common means of spitting on Ethernet is the
RJ-45 connector with eight pins constituting four pairs of lines.

ATM (asynchronous transfer mode) is an interconnection technology used in the construction of long-
run networks of the type shown in Figure 23.2c. Made up of switches connected in an arbitrary
topology, an ATM network can transmit data at a rate of 155 Mb/s at 2.5 Gb/s. Transmission does not
occur in a continuous bit stream, but rather based on store-advance routing of 53 B packets (5 B
header followed by 48 B of data). The difference with packet routing on the Internet is that a specific
route, or virtual connection, is established, and all packets comprising a particular transmission are
advanced through that route. This allows you to negotiate certain QoS parameters (such as
guaranteed data rate and packet loss probability) when establishing the virtual connection. The
content of the 48 B of data in a packet was chosen as a compromise to balance the control header in
each packet against the latency requirements in voice communication, where each millisecond of
speech is usually represented in eight bytes, leading to intervals of 6 ms per ATM packet; Longer
intervals would have led to audible delays, as well as unpleasant pauses in the event of packet loss.
The development of ATM is managed by the ATM Forum [WWW], which among its members has many
telecommunications companies and several universities.

Intra- and intersystem linkages have assumed a central role in the design of digital systems. One
reason is that the rapidly increasing density and performance of electronic components has made
providing the required connectivity at reasonable speeds a very difficult problem. Signal propagation
through wires is one-third to one-half the speed of light. The latter means that electronic signals travel:

Through memory bus (10 cm) in 1 ns

Through a large computer (10 m) in 100 ns

334
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Through a university campus (1 km) in 10 μs

Through a city (100 km) in 1 ms

Through a country (2 000 km) in 20 ms

Across the globe (20 000 km) in 200 ms

Even the signal propagation delay on the chip has become non-negligible and must be taken into
account for the design of high-performance digital circuits.

2. Buses and their presentation


A bus represents a connector that links multiple data sources and sinks. The main advantage of bus-
based connectivity is that, theoretically, it allows any number of units to communicate with each
other, and each unit requires only one input port and one output port. This minimizes wire or wiring
and allows the flexibility to add new units with minimal change or disturbance. However, there is
practically an upper limit to the number of units that can share a single bus; This limit depends on the
technology and speed of the bus. Many low-profile computers benefit from the simplicity and
economy of a single shared bus that connects all system components (Figure 21.1). Higher
performance systems can include multiple buses or use point-to-point connectivity for speed and
bandwidth.

A bus usually consists of a beam of control lines, a set of direction lines, and several data lines, as
shown in Figure 23.5. Control lines carry many signals needed to control bus transfers, devices
connected to the bus, and associated handshaking protocols. Address lines carry information about
the source or destination of data (memory locality or I/O device). The most common address width in
current use is 32 bits, but narrower and wider addresses exist on many systems. Data lines carry data
values to be sent from one unit to another. Serial buses have a single line of data, while parallel buses
carry 1-16 bytes (8-128 bits) of data simultaneously. Address lines and data can be shared
(multiplexed), but the latter leads to lower performance.

335
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

A traditional computer can use about a dozen different types of buses. There are three main reasons
for this multiplicity:

1. Legacy buses are buses that existed in the past and must be supported for users to continue using
their old peripheral devices. Examples of legacy buses include PC Bus, ISA, RS-232, and the parallel
port.

2. Standard buses follow popular international standards and are required to connect the most current
peripheral devices to the computer. Examples of those, discussed further in section 23.6, include PCI,
SCSI, USB, and Ethernet.

3. Proprietary buses are custom designed to maximize the data transfer rate between critical system
components, and therefore the overall system performance. These buses are designed for specific
devices known at the time of design and are not expected to be compatible with other devices or
standards.

What follows reviews certain bus characteristics that are common among the three previous
categories. The most important bus specifications include:

Data rate: If the old slow buses are excluded, the data rate varies from many megabytes per second,
through tens or hundreds of megabytes per second (for example, on Ethernet) to many gibabytes per
second on fast memory buses that typically use 100-500 MHz clock with an 8 B wide data channel.

Maximum number of devices: More devices increase the bus load, as well as the arbitrage header,
slowing down the bus. This parameter varies from a few (for example, 7 for standard SCSI) to many
thousands (for example, Ethernet).

Connection method: This feature has to do with the types of connector and cable that can be used to
bind the bus, as well as facilitate the addition or removal of devices. For example, USB is hot-pluggable,
which means that devices can be added or removed even when the bus is in operation.

Strength and reliability: It is important to know how well a bus tolerates the malfunction of a device
that is connected to it, or various electrical irregularities such as short circuits, which can occur before
or during data transmission.

Electrical parameters: These include voltage levels, current ranges, overvoltage and short-circuit
protection, intermodulation, and power requirements.

Communication header: The error check and control bits are added to the transmitted data for
control, coordination, or enhanced reliability. This header tends to be slow for short fast buses and
very significant for long-run transmissions.

Buses can also be classified according to the control or arbitration scheme. In the simplest control
scheme, there is a single master bus that initiates all bus transfers (the master sends or receives data).
For example, the CPU can be the master that sends data to or receives data from any unit connected
to the bus. This is very simple to implement, but it has the double disadvantage of wasting CPU time
and using two bus transfers to send a word of data from disk to memory.

336
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Slightly more flexible is a scheme in which the master bus can delegate control of the bus to other
devices for a specified number of cycles or until the completion of the transfer is signaled. The DMA
schema in section 22.5 is a prime example of this method.

Buses with multiple masters may have an arbitrator who decides which teacher gets the bus use.
Alternatively, a distributed control scheme can be used whereby "collisions" are detected and dealt
with via relay at a later time (e.g. in Ethernet). The method used for bus arbitration has a profound
effect on performance. Arbitrage methods and bus performance are discussed in section 23.4.

Example 23.1: Taking into account the bus arbitration header

A 100 MHz bus is 8 B wide. Suppose 40% of the bus bandwidth is wasted due to arbitration and another
control header. What is the net bus bandwidth available to other I/O devices and the CPU after taking
into account the needs of a 1,024 × 768 pixel video display unit with 4 B/pixel and a 70 Hz regeneration
rate?

Solution: The data rate of the display drive is 1024 × 768 × 4 × 70 ≅ 220 MB/s. The total effective bus
bandwidth is 100 × 8 × 0.6 = 480 MB/s. The net bandwidth is therefore 480 - 220 = 260 MB/s, which
will serve the CPU and the remaining I/O devices. Note that it was assumed that memory can supply
data at the effective bus rate of 480 MB/s. This is not irrational; It represents 48 bytes per 100 ns.

3. Bus communication protocols


Buses are divided into two broad classes of synchronous and asynchronous. Within each class, buses
can be custom designed or based on the prevailing industry standard. On a synchronous bus, a clock
signal is part of the bus and events occur in specific clock cycles, according to an agreed schedule (bus
protocol). For example, a memory address can be placed on the bus in a cycle, where the required
memory responds with the data in the fifth clock cycle. Figure 23.6 contains a representation of the
preceding protocol. Synchronous buses are suitable for a small number of devices of equal or
comparable speed that communicate over short distances (otherwise, bus loading and clock bias
become problematic).

Asynchronous buses can accommodate devices of various speeds that communicate over longer
distances, as the fixed timing of the synchronous bus is replaced with a handshaking protocol. The
typical sequence of events on an asynchronous bus, for the case of data entry or memory access
reading (Figure 23.7) is described below. The unit requesting input postulates the Request signal, while
at the same time putting the address information on a direction bus. The recipient notes 1 on this
request, recognizes the request (postulates the Ack signal), and reads the bus address information. In
turn, the applicant in turn notes 2 of the recognition, depostulates the Request signal and removes
the information from the bus. Next, the recipient notes 3 the depostulation of the Request signal and
depostulates its recognition. When the requested data is available, 4 the Ready signal is postulated
and the data is placed on the bus. Now the applicant notes 5 the Ready signal, postulates the Ack
signal and reads the bus data. The last recognition g 6 leads to the emitter depostulating the Ready
signal and removing the data from the bus. Finally, the requestor, by noting 7 the depostulation of the
Ready signal, depostulates the Ack signal to terminate the bus transaction.

337
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

To interface asynchronous signals with synchronous (timed) circuits, it is necessary to use


synchronizers; These are circuits that force certain timing constraints related to signal transitions
necessary for the correct operation of synchronous circuits (Section 2.6). Synchronizers not only add
complexity to digital circuits, they also introduce delays that nullify some of the advantages of
asynchronous operation.

Within a computer system, the processor-memory bus is usually custom designed and
synchronous. The motherboard bus (backplane) can be standard or customized. Finally, the I/O bus is
usually standard and asynchronous. The PCI bus is reviewed as an example in the remainder of this
section. Section 23.6 describes some other commonly used standard interfaces.

The standard PCI (Peripheral Component Interconnect) bus, developed by Intel in the early 1990s, is
now managed by PCI-SIG [WWW], a special interest group with hundreds of member companies.
Although PCI was developed as an I/O bus, it shares some of the characteristics of a high-performance
bus, allowing it to be used as a focal point for all intersystem communications. For example, PCI
combines synchronous (timed) operation with the provision of timed cycle insertion by slower devices
that cannot compete with bus speed. The following description is based on the widely used original
standard, which has now been augmented in several ways, including PCI-X, mini-PCI, and direct-
pluggable PCI for improved performance, increased convenience, and extended application domains.
The description also reflects the minimum requirements in terms of data width (32 bits) and control
signals. The data width doubles and the number of control signals increases, in the optional 64-bit
version.

To reduce the number of pins, PCI has 32 AD lines that carry address or data. There are four low active
command lines, C/BE', which specify the function to be performed and also duplicate as signals enable
byte during actual data transmission. Typical commands include reading memory, writing memory,
reading I/O, writing I/O, and interrupt recognition. The enable byte (byte-enable) function indicates
which of the four bytes carries valid data. There are six basic interface control signals including:

338
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

FRAME' Remains active for the duration of a transaction, this delimits it

IRDY' Signals that the initiator is ready

TRDY' Signals that the target is ready

STOP' Requests that the master stops the current transaction

Additionally, three pins are dedicated to CLK (clock), RST' (reset) and PAR (parity), two bits are used
to report error and two bits are dedicated for arbitration (REQ' and GNT', applicable only to the master
unit).

Figure 23.8 shows a typical data transfer about the PCI. The bus transaction begins in clock cycle
1 when the master bus places an address and command (read I/O) on the bus; it also postulates
FRAME' to signal the beginning of the transaction. The current address transfer from the initiating unit
to the target drive occurs in clock cycle 2 when all bus-attached drives examine the address and
command lines to determine whether a transaction is directed to them; Consequently, the selected
unit prepares to participate in the remainder of the bus transaction. The next clock cycle 3 represents
a "return" cycle during which the master delegates the AD lines and places an "enable byte" command
on the C/BE' lines in preparation for data transmission in each successive cycle. Starting the clock cycle,
4 a 32-bit data word can be transmitted at every tick of the clock, with the actual transfer rate dictated
by the sender and receiver over the IRDY' and TRDY' lines. For example, Figure 23.8 shows two waiting
cycles inserted into the data stream: in cycle 5, the sender (destination) is not ready to transmit, while
in cycle 8, the standby cycle is inserted by the initiator because, for whatever reason, it is not ready to
receive.

339
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Example 23.2: Effective PCI bus bandwidth

If the bus transaction shown in Figure 23.7 is typical, what can you say about the effective bandwidth
of a PCI bus operating at 66 MHz?

Solution: When taking into account another return cycle (from the data to the address) at the end of
the transaction shown in Figure 23.7, ten clock cycles are taken for four data transfers (16 B). The
latter represents an effective data rate of 66 × 16/10 = 105.6 MB/s. The maximum transfer rate is
theoretically 66 × 4 = 264 MB/s. Therefore, the expected header due to handshaking and waiting cycles
is 60%.

4. Bus Arbitration and Performance


Any share requires a protocol that governs fair access to resources by entities that need to use them;
Buses are no exception. Deciding which device gets the use of a shared bus and when is known as bus
arbitration. One approach is to use a centralized arbitrator that receives bus request signals from a
number of devices and responds by postulating a guarantee signal to the individual device that can
use the bus. If the requesting devices are not synchronized with a common clock signal, inputs to the
referee must pass through synchronizers (Figure 23.9). The bus warranty may apply to a fixed number
of cycles (after which the bus is assumed to be free), or it may be for an indefinite period, in which
case the device must explicitly release the bus before another warranty can be issued.

For devices with fixed priority assignments, the bus referee simply represents a
priority encoder that postulates the assurance signal associated with the highest priority requestor if
the bus is available. With fixed priorities, the bus referee quickly guarantees the bus to a high-priority
requesting unit, but has the disadvantage of potentially "starving" the lower priority units when the
high-priority units use the bus intensively. In this context, a rotary priority scheme provides uniform
treatment to all devices. Some intermediate schemes can be glimpsed that allow higher priority for
some devices that require faster attention and avoid starvation of lower priority devices. Some such
alternatives are explored in the end-of-chapter problems.

A central referee can become very complex when a large number of units are involved; It also limits
expandability. A simple arbitration scheme that can be used alone or in combination with a priority-
based arbitrator (even if it is one that has few inputs and outputs) is daisy chaining. A daisy chain is an
ordered set of devices that share a single request and warranty signal. The request signals from all
devices in the chain are joined together and form a single request entry to a centralized arbitrator
(Figure 23.10). In this sense, the chain as a whole can be guaranteed permission to use the bus. The
warranty signal goes to the first device in the chain, which can use the bus or advance the warranty
signal to the next device in the chain. In this way, units located near the end of the chain are less likely
to obtain permission to use the bus; Consequently, they should not be fast or high-performance
devices.

340
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Distributed arbitrage schemes are more desirable because they avoid a centralized arbitrator that can
become a performance issue. Distributed arbitrage can be implemented through self-selection or
collision detection. In a self-selection scheme, each device requesting the bus places a code
representing its density on the bus and immediately examines it to imagine whether it has the highest
priority among all the units requesting the bus. Of course, the codes assigned to several units must be
such that the composition of all codes (logical OR) is adequate to say whether an applicant unit has
the highest priority. The latter tends to limit the number of units that can participate in arbitration.
For example, with an eight-bit bus, codes 00000001, 00000011, 00000111, ..., 11111111 can be
assigned to eight devices connected to an eight-bit bus. When many of these codes are operated OR
together, the code with the largest number of 1 result; Then the device that puts this particular code
on the bus selects itself to use the bus.

Collision detection relies on all units requesting to use the bus, but monitoring the bus to see if
multiple transmissions have mixed, or collided, with each other. If a collision occurs, emitters consider
their transmissions as failed and try later, waiting for a random length of time to minimize the chances
of further collisions. The intended receivers will also be able to detect a collision, which causes them
to ignore the confusing transmission. To ensure confusing transmission in the crash event, crashing
devices can "jam" the bus, as is done on Ethernet. The randomly chosen waiting period makes it very
unlikely that the same devices will collide many times, which wastes bus bandwidth due to repeated

341
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

collisions. As long as the bus is not loaded near saturation and the data is transmitted in large blocks,
the added latency due to collisions remains acceptable.

Bus throughput is usually measured in terms of the peak transmission rate over the bus. For a specific
bus width, and assuming that no bus cycle is lost to arbitrage and other headers, the bus data
transmission rate is proportional to its clock frequency. Therefore, you hear about 166 or 500 MHz
buses. Bus throughput (bandwidth) can be increased by using one or more of the following:

Higher clock frequency.

Wider data transfer path (for example, 64 bits of data instead of 32).

Address lines and separate (not multiplexed) data.

Block transfers for fewer arbitration headers.

Fine-grained bus control; For example, releasing the bus while waiting for something to happen (split
transaction protocol).

Faster bus arbitrage.

Note that a bus of, say, 1 GB/s can be implemented in many ways. Options range from a very fast
serial bus to a much slower bus with 256-bit data width.

Example 23.3: Bus Performance

A particular bus must support transactions that require anywhere from five to 55 ns for completion.
If you assume a bus data width of 8 B and uniform distribution of bus service times within the range
of 5-55 ns (which averages 30 ns), compare the following bus design options in relation to
performance.

a) Synchronous bus with a cycle time of 60 ns, including permission for arbitration header

b) Synchronous bus with cycle time of 15 ns, where cycles 1-4 are used for transfers, as required.
Assume that the conclusion signaling and arbitration headers use up to a bus cycle.

c) Asynchronous bus that requires a 25 ns header for arbitration and handshaking.

Solution: The following calculations show that the option in part (b) offers the best performance.

a) Transferring 8 B every 60 ns leads to the data transfer rate of 8/(60 × 10−9) B/s = 133.3 MB/s.

b) In the range of 5-55 ns, 20% of cases need one bus cycle, 30% need two bus cycles, 30% need three
bus cycles and the remaining 20% require four bus cycles. When the header of a cycle is factored, on
average 3.5 bus cycles are used for a transaction, this leads to the transfer rate of 8/(3.5 × 15 × 10 −9)
B/s = 152.4 MB/s.

c) In this case, the average transfer rate is 8 B every 30 + 25 = 55 ns, which leads to the data transfer
rate of 8/(55 × 10−9) B/s = 145.5 MB/s.

342
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Performance isn't the only design criterion for buses. The strength of a bus is also important for system
reliability and availability. The factors that influence strength are:

Parity or other encoding for error detection.

Ability to reject or isolate malfunctioning units.

Direct connection capability for continuous operation during unit (dis)connection.

For systems where reliability or availability is very important, multiple buses are often used to allow
continuous operation in the event of a bus failure.

5. Basics of Interface
Interfacing means connecting an external device to a computer. In addition to performing input and
output of more or less standard devices such as keyboards, visual display unit and printer, other
reasons for such a connection can be:

Detection and data collection (temperature, movement, robotic vision).

Computer control (robotics, speech synthesis lighting).

Addition to computer capabilities (storage, coprocessors, networking).

Creating interfaces can be as simple as plugging a compatible device into one of the sockets provided
on a standard bus and configuring the system software to allow the computer to communicate with
the device over that bus. Sometimes, though rather uncommon these days, creating interfaces may
involve doing something a little more elaborate to tailor a particular port of a computer to specific
features of a peripheral device.

Microcontrollers (processors that are specifically designed for control applications involving
everything from appliances to automobiles) come with special interfacing capabilities, such as analog-
to-digital (A/D) and digital-to-analog (D/A) converters, communication ports, and integrated timers.
Ordinary processors are usually interfaced by plugging them into their I/O buses and writing special

343
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

programs to handle the added device, often by adapting the interrupt manipulation routine. These
device-specific programs are known as device manipulators.

Consider, as an example, the detection and recording of wind direction by means of a weather vane
that provides a voltage in the range of 0-5 V, depending on the direction. The weather vane may have
an inter-constructed potentiometer with its sliding point moving with the wind direction (Figure
23.11). An output voltage close to zero can indicate wind to the north, with the voltage increasing
towards a quarter of its maximum for the East, half for the South and three-quarters for the West. A
microcontroller that has an inter-built analog-to-digital converter can accept the output voltage of the
weather vane on a specific pin of one of its ports. The rest is left to the software running on the
microcontroller. For example, the software can read the contents of an increment built-in timer every
millisecond, verify that the timer value indicates a predetermined period from the last reading of the
vane direction, test the corresponding pin if appropriate, and record the data in memory.

If you want to record the wind direction with MiniMIPS or another processor
that does not have an inter-built A/D conversion capability, it is necessary to acquire a suitable
component to perform the required analog-to-digital conversion with the desired precision, and store
the result in a register of, say, eight bits. Status records and converter data are assigned specific
addresses, and a decoder is designed to recognize these addresses. The A/D converter can then be
treated as an I/O device in the manner discussed in section 22.2.

6. Standards in the creation of interfaces


There are many interfacing standards that allow you to connect to computers, or to each other, to
devices of different speeds and capacities of different companies. Section 23.3 discussed the PCI bus,
and provided an example of a bus communication protocol. This section provides some other
commonly used interfaces. For comparison, Table 23.2 provides a summary of the characteristics of
three of these standards, along with those of the ICP.

344
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

SCSI (Small Computing System Interface; pronounced "skuzzy") is a standard bus whose primary intent
is high-speed I/O devices, such as magnetic and optical disk memory. It grew from the Shugart
Associates System Interface (SASI) in the mid-1980s. Over the years, several versions of SCSI were
introduced. The description below is based on SCSI-2, introduced as ANSI Standard X3.131 in 1994,
with some improvements and extensions present later. The latest SCSI-3 is not standard but rather a
family of standards. On most PCs, the IDE/ATA bus, which is discussed at the end of the section, is
used as an interface for magnetic and optical disc memories. The main reason is the low cost of the
latter, which exceeds the higher capacities and higher performance of SCSI.

Devices on the SCSI bus communicate with the computer through one of the drives connected to the
bus. This unit, known as a controller, has access to a primary bus of the computer, such as the PCI and,
from there, to memory. SCSI can work in synchronous and asynchronous modes, and the data rate is
much higher in synchronous operation. SCSI uses 50- or 68-pin connectors. Addresses and data are
multiplexed and bus control is performed by self-selection (section 23.4). As a result, the number of
devices pluggable to the SCSI bus is one less than its width. More information about this bus can be
found on the SCSI Trade Association, scsita.org website.

USB (Universal Serial Bus) is a standard serial bus intended for use with I/O devices with low to medium
bandwidth. USB connectors use small four-pin plugs attached to a four-wire cable (Figure 23.12).
These connectors carry both data signals, in half-duplex mode, and 5 V power.

Certain low-power peripherals can receive power through the connector, obviating
the need for a separate power source. One USB port can accommodate up to 127 devices through the
use of hubs or repeaters. Up to six tiers, or hub levels, are allowed, forming a daisy chain. The host
computer manages the bus using a complex protocol. When a device is connected on a USB bus (which
can be done even during operation), the host detects the presence of the device and its properties,
then proceeds to notify the operating system of the activation of the required device driver.
Disconnection of the device is detected and handled in the same way. This direct-connect capability,
and the ability to support many devices through a single port, form USB's main attractions. For
example, a computer with two USB ports can be linked to your keyboard and mouse via one port, and
to a variety of other devices through a hub connected to the second port (a five-port hub, for example,
can house a printer, joystick, camera, and scanner).

USB 2.0 has three data rates: 1.5, 12 and 480 Mb/s (0.2, 1.5 and 60 MB/s, respectively). The largest of
these is specific to "Hi-Speed USB", a relatively new standard. The other two rates are intended to
provide backward compatibility with previous versions of the standard, including USB 1.1, which

345
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

operated at 12 Mbps, and low-cost deployments for devices that don't need high data rates. Today,
virtually every desktop and laptop computer has one or more USB ports. In older products, USB 1.1
ports are observed; however, Hi-Speed USB quickly takes advantage. More information about USB can
be found in the literature [Ande97] or via the usb.org.

The IEEE 1394 serial I/O interface is based on Apple Computer's FireWire specification, introduced in
the mid-1980s. Sony uses the name i.Link for this interface. IEEE 1394 represents a general standard
I/O interface, although its relatively high cost compared to USB has limited its use to applications that
require higher bandwidth to deal with high-quality audio and video I/O. There are some similarities to
USB in connector shape and size, maximum cable length, and direct connection capability. A key
difference is the use of peer-to-peer bus arbitrage that obviates the need for a single master, but also
complicates each device interface. Supports synchronous and asynchronous transmissions. Any
individual device can request up to 65% of the available bus capacity, and up to 85% of the capacity
can be allocated to such devices. The remaining 15% capacity ensures that no device will be
completely disconnected from bus use.

Physically, the IEEE 1394 connector has six contacts for power plus two twisted pairs. The latter are
transposed at both ends of the cable to allow the use of the same connector at both ends (Figure
23.13). Each twisted pair is separately coated inside the cable, and the entire cable is also coated on
the outside. A smaller four-pin connector, with no power lines, has also been used in some products.
More information about IEEE 1394 can be found on the 1394 Trade Association (1394ta.org) website.

Other standards one encounters frequently are UART, AGP and IDE/ATA. In this context, these are
briefly described in the following paragraphs.

UART (Universal Asynchronous Receiver/Transmitter), a convenient serial interface standard, is


another name frequently heard in connection with interface creation. Virtually every computer has a
UART unit to manage its serial port(s). Characters are transmitted similarly as in the RS-232 standard,
as discussed in section 23.1. However, the bit rate, character width (5-8 bits), parity type (none, odd,
even) and stop sequence (1, 1.5 or 2 bits) are programmable and can be selected at the time of
construction. The transition 1 to 0 that starts the boot bit is detected and forms the reference point
for timing events. If the transmission rate is 19 200 b/s, then a new bit must be sampled every 1/19
200 s ≅ 52.08 μs. At a clock rate of many megahertz, the latter can be achieved quite accurately. For
example, with a 20 MHz clock, the input is sampled every 20 × 52.08 = 1 041 clock ticks, and the first
sample is taken after 520 ticks (half of the first bit time). Some inaccuracy in clock frequency or
sampling rate is tolerable as long as the cumulative effect for the time the stop bit appears is less than
half the bit time. For example, if a sample is taken every 1,024 tics instead of every 1,040 tics, the
cumulative effect after 9.5 bit times is 9.5 × 17 = 161. 5 tics = 0.155 bit time. This still leaves ample
room for variations in clock frequency without leading to sampling errors. Note that the boot bit has
a synchronization function in that it limits the accumulation of the aforementioned inaccuracies

346
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

beyond the time frame of a single character transmission. The original UART had a single-character
buffer that would fill up in about 1 ms at 9,600 b/s; so as long as the CPU respected an interrupt within
1ms, no problem would arise. At today's high transmission rates, a one-byte buffer is inadequate, so
UART is usually provided with a FIFO buffer of up to 128 B to accumulate data and interact with the
CPU through fewer interruptions.

AGP (accelerated graphics port) was developed by Intel to avoid the high data rate needed to render
graphics from PCI bus overhead. AGP is a dedicated port rather than a shared bus. As shown in Figure
21.2, it has a separate connection to the memory bus that provides you with fast memory access,
where certain large data sets associated with the graphic representation can be maintained. AGP is
usually assigned to a contiguous address block in the part of the address space that lies beyond the
available physical memory. A graphical address remapping (GART) table translates AGP virtual
addresses into physical memory addresses. This way, the graphics controller can use a portion of the
main memory as an extension of its dedicated video memory when needed. The alternative to using
larger video memory is both more expensive and wasted, because dedicated memory can go unused
in some applications.

AGP connectors have 124 or 132 pins, for single voltage (1.5 or 3.3 V) and universal varieties,
respectively. Peak bandwidth ranges from 0.25 to 1 GB/s. The high bandwidth is achieved, in part,
with the help of eight additional "side" address bits that allow AGP to follow new addresses and
requests while data is transferred on the 32-bit address/data mainlines. Of course, high AGP
bandwidth is useful only if the main system bus and memory can support slightly higher bandwidth,
to allow both AGP and CPU to access memory.

ATA, or AT attachment, is so named because it was developed as an adjunct to the first IBM personal
computer, PC-AT. ATA, and its descendants such as Fast ATA and Ultra ATA, are commonly used to
interface with PC storage drives. IDE (Integrated Device Electronics) is a registered trademark for
Western Digital's ATA products; therefore, it is virtually equivalent to ATA. The latter uses a 40-pin
connector, which is usually attached to a flexible 40-wire flat cable with two connectors. As already
mentioned, this rather limited interface has been surpassed in bandwidth and other capabilities by
other standards. However, ATA survives and is still in primary use due to its simplicity and low cost of
implementation.

Problems
1) RS-232 serial interface

Answer the following questions about the RS-232 serial interface based on research in books and on
the Internet.

a) How are the various control signals shown in Figure 23.3 used in the handshaking protocol?

b) How do I route or select a device in Figure 23.2a?

c) What types of input and output devices can this interface support?

2) Ethernet Collisions

347
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Consider the following highly simplified analysis of collisions in Ethernet. An Ethernet cable
interconnects n compressors. Time is divided into 1 ms slices. In an interval of 0.5 s, each of the n
computers attempts to transmit a message, and the timing of the transmission is randomly selected
from the 500 available slices of time. Regardless of whether the transmission is successful, the
transmitter gives up and does not make another transmission attempt in the current 0.5 s interval.

a) What is the probability of having one or more conflicts in an interval of 0.5 s with n = 60?

b) How does the outcome of part (a) relate to the "birthday problem" described in section 17.4, in
connection with interpolated memories?

c) On average, within each 0.5 s interval, what fraction of the party's attempted transmissions will be
successful?

d) Repeat part (a), this time assuming n = 150.

e) Repeat part (c) with the assumption of part (d).

3) Asynchronous transfer mode

Answer the following questions about ATM interconnect technology based on research in books and
on the Internet.

a) How is virtual connection established through the exchange of messages and acknowledgments?

b) What categories of quality of service are supported?

c) How many audio and video channels can be supported within the ATM data rates mentioned near
the end of section 23.1?

d) Under what conditions do ATM cells fall?

4) Bus protocols

On a bus, address lines and data can be separated or shared (multiplexed). Which of these two options
is likely to be assumed on the following buses and why?

a) The synchronous bus whose operation is shown in Figure 23.6

b) The asynchronous bus whose operation is shown in Figure 23.7

c) The bus in example 23.1

5) Synchronous buses

Consider the synchronous bus whose operation is shown in Figure 23.6. Assume that all bus
transactions are of the same type and ignore the arbitration header.

a) What fraction of the bus's peak bandwidth is actually used for data transfers?

b) How do I improve the available bandwidth using a split transaction protocol?

348
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6) Ethernet bridges

Large Ethernet networks are usually formed with smaller segments connected by bridges. A
transparent bridge (also known as a spanning tree bridge) can be used to expand a network without
modification to existing users' hardware or software. Study how these transparent bridges operate
and how they use a retrospective learning algorithm to build their routing tables, which are blank at
boot time. Describe your findings in a five-page report.

7) I/O write operation on PCI bus

a) Draw a diagram similar to Figure 23.8 to show an I/O write operation via PCI bus.

b) Explain the elements of your diagram similar to the I/O read operation at the end of section 23.3.

8) Fixed priority arbitrators

A fixed priority arbitrator can be designed much like a cascading carry adder, with "permission" signal
propagation from the highest priority to the lower priority side and stop at the first request, if any.

a) Submit the complete design of a cascade-type priority arbitrator with four request signals and four
guarantee signals.

b) Show how the referee can be made faster by overtaking techniques.

c) How can an eight-entry priority arbitrator be constructed with the use of two arbitrators of the type
defined in parts (a) or (b)?

9) Starvation-free priority schemes

One problem with a fixed-priority arbitrage scheme is that low-priority devices can never have a
chance to use the bus if higher priority units place strong demand on it. This condition is known as
starvation.

a) Design a rotating priority arbitrator with four request signals and four guarantee signals. The highest
priority device changes each time the bus is guaranteed to a device, so that each device has a chance
to become the highest priority device, ensuring no starvation.

b) The revolving priority scheme in Part (a) is not a genuine priority scheme in that all devices are
treated equally. An imp plea of priority of four inputs in which the devices of higher priority have
more possibility to use the bus, even if the devices of low priority do not have starvation. Tip: What if
each device is allowed at least one opportunity to use the bus after the next highest priority device
has guaranteed bus use x times, where x is an adjustable parameter of the design?

c) Suggest and implement a different arbitrator with the property suggested in part b), but now the
criterion for the warranty decision is the number of times a device's bus request is denied.

10) Distributed Bus Arbitration

349
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Consider the following distributed bus arbitration scheme. Each X device has a X_in input control signal
and a X_out output control signal and can postulate and observe the bus's busy signal. Control entry
to the device of the extreme left is always postulated, this gives it the highest priority. At the beginning
of each clock cycle, a device can depostulate its control output signal, which serves as a bus request
indicator. If X_in is postulated, device X must apply X_out if it does not intend to use the bus. When
signal propagation is complete, the only device with its postulated control input and its unpostulated
control output has priority to use the bus at the beginning of the next bus cycle. Discuss design
considerations for this distributed priority scheme, including the relationship between the clock cycle
and the signal propagation time in the device chain.

11) Distributed Bus Arbitration

The VAX series of computers used a distributed arbitrage bus and 16 Ri bus request lines, 0 ≤ i ≤ 15,
where the lowest index lines had higher priority. When device i wanted to use the bus, it made a
reservation for a future slit by postulating Ri during the current-time slit. At the end of the current
time slit, all requesting devices examined the request lines to determine if a higher priority device had
the bus requested; If not, then the device could use the bus during the requested slit.

a) Demonstrate how all 17 devices can share the bus even if there are only 16 lines of request. Tip:
The additional device would have the lowest priority.

b) Argue that, usually and paradoxically, the lowest priority device in part a) has the shortest average
wait time to use the bus. Therefore, in VAX, this lower priority spot was assigned to the CPU.

c) Discuss conditions under which the application of part b) would not hold.

12) Bus performance

a) Redo example 23.3, but this time assume that bus transactions take 5 ns (10% of cases), 25 ns (20%),
35 ns (25%), and 55 ns (45%). Provide intuitive justification for any differences between your results
and those obtained in Example 23.3.

b) Repeat part a) with the following transaction times and percentages: 5 ns (30%), 30 ns (50%), 55 ns
(20%).

13) Buses for detection and instrumentation

Buses used to communicate data between sensors, instruments, and a control computer are known
as fieldbuses. A commonly used fieldbus standard is the Foundation fieldbus. Answer the following
questions about the Foundation's fieldbus standard based on research in books and on the Internet.

a) What are the main design criteria that make a fieldbus different from other buses?

b) How are the devices connected to the bus and what is the arbitration method, if any?

c) What is the data width of the bus and how is the data about it transmitted?

d) What are the most important application domains for the bus?

350
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

14) A/D and D/A conversion

a) One way to convert a four-bit digital signal to an analogous voltage level in the range of, say, 0-3 V,
is to connect a network of ladder resistors to the four outputs of a register containing the digital
sample. Present a diagram of such a D/A converter and explain why it produces the desired output.

b) Conceptually, the simplest way to convert an analogous input voltage in the range of, say, 0-3 V, is
to use so-called flash circuits, which consist of resistors, voltage comparators and a priority encoder.
Study the design of A/D flash converters and prepare a two-page report on how they work.

c) When the input to the A/D converter of part b) is a rapidly varying voltage, sampling must also
capture the frequency of the signal in addition to its amplitude. Briefly discuss how proper frequency
sampling is determined. For this part, refer to a text or digital signal processing manual.

15) Interface Creation Basics

Suppose that the A/D converter used in the weathervane application in Figure 23.11 is rather
inaccurate but does not have systematic error or bias. In other words, the error is truly random and
can be in any direction. Underline a procedure that allows the microcontroller to compensate for the
conversion error to allow more accurate results for wind direction to be obtained.

16) Other interface creation standards

Many older, lesser-known or domain-specific interface creation standards, as well as more detailed
than the most widely used standards, were not discussed in section 23.6. Some of these are
mentioned below. Prepare a two-page report about the application domains and features of one of
the following standard interfaces.

a) ISA bus (industry standard architecture).

b) VL-local VESA bus (ISA extension).

c) Plug and play technology for PCI.

d) Parallel port.

e) PCMCIA for notebook computers.

17) Universal bay on notebook computers

Some notebook computers have a universal bay (sometimes called multibay) into which a variety of
I/O devices can be inserted, or even a second battery. Speculate about how the inserted device can
be connected to the computer (pins, connectors, etc.), how the computer can recognize what type of
device is present in the bay, and why such a device change does not affect I/O programming.

351
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

CONTEXT SWITCHING AND INTERRUPTS


CHAPTER TOPICS

1) System calls for I/O


2) Interruptions, Exceptions and Traps
3) Simple Interruptions Handling
4) Nested Interrupts
5) Types of Contextual Switching
6) Threads and Multithreads

Ordinary users and programmers usually don't need to worry about the details of the operation of I/O
devices and how to control them. These tasks are relegated to the operating system, which user
programs call whenever I/O is needed. In turn, the operating system activates device drivers that
generate asynchronous I/O processes that perform their tasks and report back to the CPU through
interrupts. When an interrupt is detected, the CPU can change the context, set aside the active
program, and run an interrupt care routine. This type of contextual switching is also feasible between
different user programs, or between threads of the same program, as a mechanism to avoid long waits
due to unavailability of data or system resources.

1. System calls for I/O


As mentioned in section 7.6, and again at the end of section 22.2, most users do not engage in detail
in I/O transfers between devices and memory, but instead use system requests to achieve I/O data
transfers. One reason for indirect I/O through the operating system is the need to protect data and
programs from accidental or deliberate damage. Another means that it is more convenient for the
user, as well as the system, to rely on operating system routines to provide a clear and uniform
interface to such devices in a form that is, as far as possible, device-independent. A system routine
initiates I/O, which is usually done via DMA, then hands over control to the CPU. Later, the latter will
receive an interruption that means the conclusion of I/O and will allow the I/O routine to perform its
cleaning stage.

For example, when reading data from a disk, the user must avoid accessing or modifying system data
or files that belong to another user. Additionally, a dozen or more types of errors may be encountered
during a read operation. Usually each of the errors has an associated bit flag in the device status
register. Examples include invalid cylinder number (track, sector), cheksum violation, abnormal timing,
and invalid memory address. These status bits must be checked at each I/O operation and appropriate
action initiated to deal with any anomalies. It makes perfect sense to free the user from such
complexities by providing I/O care routines, whose user interfaces focus on the required data transfer
and not on the mechanics of the transfer itself.

Device convenience and independence are achieved by providing multiple layers of control between
the user program and hardware I/O devices. Typical layers include:

Kernel OS → I/O Subsystem → Device Driver → Device → Device Driver

352
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

The device driver is a software link between the two hardware components on the right and the
operating system and its I/O subsystem on the left. As such, it is dependent on both OS and device. A
new I/O device that does not follow an already established hardware/software interface should be
introduced to the market with device drivers for several popular operating systems. The idea is to
capture, as much as possible, the characteristics of an I/O device in its associated driver, so that the
I/O subsystem needs to deal only with certain general categories, and each spans many I/O devices.

The operating system (OS) represents the system software that controls
everything on a computer. Its functions include resource management, task schedule, protection and
exception handling. The I/O subsystem within the operating system handles interactions with I/O
devices and manages actual data transfers. It frequently consists of a basic layer that is closely involved
with hardware features and other supporting modules that may not be required for all systems. The
basic I/O subsystem for the case of the Windows operating system is the BIOS (basic input/output
system) which, in addition to the basic I/O, is careful to self-load (booting) the system at the initial
power-up. BIOS routines are usually stored in ROM or flash memory to prevent losses in the power
outage event.

The aforementioned grouping of I/O devices into a small number of generic types is known as I/O
abstraction. The following review four of the most useful I/O abstractions commonly found in modern
operating systems:

I/O character flow.

I/O block.

Network sockets.

Clocks or timers.

The I/O character flow abstraction treats input or output as a stream of characters. A new input
character is added to the end of the input current, while a new output character is advanced to the
output device after the previous character. This type of behavior is captured by the system function
requests get(•) and put(•), where "get" gets an input character and "put" sends a character to the
output. With the use of these primitives, it is relatively simple to build I/O routines that enter or take
out strips of many characters. In fact, examining the first eight lines of Table 7.2 shows that character-
oriented I/O is the only type provided in MiniMIPS (and its simulator) at the assembly language level.
I/O character hardware is particularly suitable for devices such as keyboards, mice, modems, printers
and audio cards.

I/O block abstraction is suitable for disks and other block-oriented devices. In this sense, three basic
system requests can capture the essence of the I/O block: seek(•), read(•), and write(•). The first of
these, seek, is needed to specify which block should be transferred, while read and write perform the
transfer of actual I/O data to and from memory, respectively. It is possible to build a system of
memory-mapped files on top of such block-oriented I/O primitives. The idea is that the system prompt
for I/O returns the virtual address of the required data instead of initiating the data transfer.
Consequently, the data is automatically loaded into memory with the first access to it via the provided
virtual address. In this way, access to the file system becomes very efficient because its data transfers
are handled by the virtual memory system.

Network sockets are special I/O devices that support data communication over computer networks.
A system I/O request in this regard can create a socket, cause a local socket to connect to a remote
address (a socket created by another application), detect remote applications requesting to be

353
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

connected to a local socket, or send/receive packets. Servers, which usually support many sockets,
may additionally require installations for more efficient detection of which sockets have packets
waiting to be received and which have room to accept a new packet for transmission.

Clocks and timers, which are essential components in the implementation of


real-time control applications with microcontrollers, are also found in general-purpose processors in
view of their usefulness in determining the current time of day, measuring elapsed time and driving
events at preset times. A programmable interval timer is a hardware mechanism that can be preset
to a desired time frame and generate an interrupt when that amount of time has elapsed. The
operating system uses interval timers to schedule periodic tasks based on a schedule or assign a slice
of time to a task in a multitasking environment. It can also allow user processes to use this facility. In
the latter case, the user's request for interval timers beyond the number available in hardware can be
met by setting virtual timers. The header for maintaining these virtual timers is small compared to the
intervals measured in typical applications. For example, a time frame of 1 ms corresponds to 106 clock
ticks with a 1 GHz clock.

2. Interruptions, Exceptions and Traps


It has already been discussed (sections 22.4 and 22.5) how interrupts are used to free the CPU from
micromanagement of the I/O data transfer process. This and the next two sections will discuss how
traps are implemented and manipulated.

Computer interruptions closely resemble telephone and e-mail interruptions during a person's work
or study time (Figure 24.1). When a phone rings, you step aside what you're doing (bookmark a page
that reads in a book, hit the save button on the computer, press the mute button on the TV) and talk
to the caller. After hanging up you can write some notes about what needs to be done as a result of
the phone call and return to the main job. Studies show that it takes between one and two minutes
to "recover" from the interruption and continue working at the same rate before the phone rang.
Something similar applies when you have an email alert, although, for unknown reasons, they may
have to do with the relative difficulty of reading and writing versus speaking, the recovery time is little
longer in this case. Such interruptions can be avoided or completely disabled by disconnecting the
phone cord or checking email every two hours, rather than continuously. this situation is similar to
what the CPU does when it disables interrupts or restricts them only to high-priority interrupts.

354
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Interrupt is both the general term used for a CPU that diverts its attention from the current task being
executed to some unusual or unpredictable event that demands its involvement (for whatever reason)
and the specific type of interruption caused by input/output and other hardware units. In this context,
hardware interrupts are sometimes referred to. The CPU can deviate from the current task for reasons
in two other categories:
Exceptions, which are caused by illegal operations, such as splitting by 0, attempting to enter non-existent
memory locality, or executing privileged instructions in user mode. Exceptions are unpredictable in nature and
rather rare.

Traps, or software interruptions, which are deliberate requests to the operating system when particular services
are needed. Unlike hardware interrupts and exceptions, traps are pre-planned in a program design and are not
rare.

Much of our discussion of this chapter deals with manipulating hardware interrupt paths. However,
exceptions and traps are similarly manipulated in these required without a start-up side of program
execution and call to a special software routine to deal with the cause.

In the simplest scheme there is only one interrupt line request within the CPU. Multiple devices that
require CPU disruption share this line. Manipulating interrupts in this minimal configuration is
discussed in section 24.3. More generally, there are multiple lines of interrupt request from various
priorities. The following example shows why it can be useful to have multiple priority levels for
interrupts. Nested interruptions and conflicts in their handling are discussed in section 24.4.

Example 24.1: Interrupt priority levels on a car's computer

Suppose a microcontroller that has four interrupt priority levels is used in a car. The units that
interrupt the controller include an impact detection subsystem, which needs attention within 0.1 ms,
along with four other subsystems, namely:

Subsystem Outage rate Max attention time (ms)

Fuel/ignition 500/s 1

Engine temperature 1/s 100

Dashboard screen 800/s 0.2

Air conditioner 1/s 100

Discuss how priorities can be assigned to ensure 0.1ms response time for the impact detector and
handle each of the other critical units before the next outage occurs.

Solution: The impact detector must be given priority because its required response time is less than
the attention time for each of the other types of interruption; No other subsystem can share this high
level of priority. The air conditioner may be given lower priority because its function is not crucial for
safety. There are still two priority levels and three subsystems. As a result of the engine temperature
monitor having a maximum attention time of 100 ms, its priority is low, similar to that of display and
fuel/ignition subsystems, which require many hundreds of service periods per second. Finally, the

355
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

latter can share the same level of priority, since each can be attended after the other without
exceeding the time available before the next interruption. For example, if fuel/ignition subsystem
attention begins just before the screen interrupt, it will take 1.2 ms before both interrupt care routines
are completed, while 1/800 s = 1.25 ms is available before the next dash screen interrupt arrives. Note
that attention time for interruptions of the priority impact detection subsystem was ignored, because
they occur very rarely; When they do, the rest is unimportant.

3. Simple Interruptions Handling


This section assumes a single interrupt line going to, and an interrupt recognition signal produced by,
the CPU. Section 24.4 will discuss multiple interrupt lines with hierarchical priority levels and the
associated nesting of disruptive care activities.

The interrupt request signals from all participating units are connected to the IntReq line on the CPU
(Figure 24.2). There is a flip-flop interrupt mask that the CPU can set to disable interrupts. If IntReq is
postulated and interrupts are not masked, a flip-flop is set inside the CPU to register the presence of
an interrupt request and to:

Recognize the interruption by postulating the IntAck signal.

Notify the logic of the next CPU address that an interrupt is pending.

Set the interrupt mask so that no new interrupt is accepted.

356
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

After the devices notice the postulation of the IntAck signal, they depostulate their request signals,
which leads to the reset of the interrupt recognition flip-flop. This sequence of events is shown in
Figure 24.3. Meanwhile, after interrupt recognition and before the interrupt mask is set, an interrupt
alert signal (IntAlert) is produced that instructs the CPU to start executing an interrupt routine. This is
done by loading the start address of the interrupt handler on the PC and saving the previous contents
of the PC to be used as the return address when the interrupt was attended.

Since there is only one interrupt signal on the CPU, the first business order for the interrupt
handler is to determine the cause of the interrupt so that appropriate actions can be taken. One way
to do the latter is for the interrupt handler to poll all I/O devices to see which one requested attention.
The procedure for this is similar to the I/O polling discussed in section 22.3. Polling can occur, and care
can be provided, in the order of device priorities. An alternative to polling is to provide the interrupt
recognition signal only to the highest priority device and use a daisy chain to pass the signal to other
devices in descending order of priorities. This is similar to the overtaking of the bus guarantee signal

357
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

in the order in Figure 23.10. A device requesting attention can send its identity over the address or
data bus upon receiving the recognition signal from the CPU. Any low-priority device will not see the
interrupt recognition signal and will continue to postulate its interrupt request signal until it receives
an acknowledgement. Note that polling is more flexible than daisy chaining in that it allows priorities
to be easily changed. The disadvantage of polling is that it wastes a lot of time interrogating many I/O
devices whenever any of them generates an outage.

After the CPU learns the identity of the (highest priority) device that requested the interrupt, it jumps
to the appropriate segment of the interrupt care routine that handles the particular request type. This
process can be made more efficient by allowing the device to supply the start address for its desired
interrupt handler; In this way, the transition to such a manipulator can occur immediately. One benefit
of this method, known as vectorized interrupt, is that a device can provide different addresses
depending on its type of request; Therefore, avoid interrogations or additional levels of detours. In
either case, the interrupt care routine should save and, upon completion, restore any logs that you
need to use in the course of its execution. This is very similar to what happens in a procedure, except
that here all records have the same status in relation to saving and restoring (an ordinary procedure
is allowed to use some records without saving them; see the MiniMIPS record usage conventions in
Figure 5.2).

The above sorting can be extended to multiple interrupt levels with different priorities by using a
priority encoder (24.5 digit). Each interrupt request line on the left represents a set of devices with
identical priorities. If one or more interrupt requests are received, the IntReq signal is postulated to
notify the CPU. Additionally, the identity of the highest priority requested request line is provided to
the CPU. This way, either the CPU can uniquely identify the switch device (when a request line
represents a single device) or it will poll a subset of all devices. This is more efficient than having to
poll all devices with each interruption. If the CPU must enable or disable each interrupt type
separately, the scheme shown in Figure 24.5 must be modified to move the kill mask capability to the
input side of the priority encoder. Readers are left to provide details of the required modification.

The single-cycle MicroMIPS implementation considered so far


continuously tracks the interrupt condition and begins running the interrupt care routine in the clock
cycle following the IntAlert request. Adding interrupt capacity to the Chapter 14 multicycle MicroMIPS
deployment is straightforward and does not require looking beyond recognizing the proper control
state in which outage requests must be verified. The details are left as exercise.

However, adding interruptability to the tracked MicroMIPS designs of Chapters 15 and 16


requires more care. The conflicts involved are identical to those of the exceptions, as discussed in
section 16.6. For the case of a simple linear pipeline, similar to those in Figures 15.9 and 15.11, it can
be cleaned by allowing all instructions, currently in various stages of channeling, to run to their
conclusion; That may require q - 1 bubbles to be inserted into a Q-stage channel. For example, if one
of the pipeline instructions encounters a data cache failure, a long pipeline jam will occur while
accessing main memory. In the case of virtual memory, there is the possibility of a page fault with its
even higher latency, this cannot be admissible for interruptions that require a short response time. An
alternative is to override all partially completed instructions that do not yet affect data memory or
register contents, while taking care that, as a result, no inconsistencies are introduced. For example,
in the channeled data path in Figure 15.11, several statements affect the data cache and log contents
at different stages of the pipeline. If an instruction has already had an irreversible effect, that
instruction, and all instructions ahead of it in the pipeline, must run to completion as a result of
avoiding inconsistencies.

358
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

4. Nested interrupts
An obvious idea for handling multiple high-speed I/O devices or time-sensitive control applications is
to allow interrupt nesting so that a top-priority interrupt request can be attributed (preempt) to the
program handling a low-priority interrupt. The implementation of the latter would be summarized
when the high-priority interruption has been addressed. However, nested interrupts are more difficult
to organize than nested procedure requests; The latter is handled correctly with the help of a stack
for storing local variables and saving return addresses. Difficulties in dealing with nested interruptions
arise from their unpredictable timing. A programmer who writes a procedure that requests another
of these ensures that the second request is not made until after all previous steps have been
completed (Figure 6.2). Something similar does not hold for interrupts: a high-priority interrupt can
occur after a less important priority, perhaps after only a single instruction has been executed.
Therefore, it is important that subsequent interrupts are disabled until the interrupt handler has had
an opportunity to save enough information to allow the handling of the lowest priority interrupt to be
summarized after the high priority has been addressed.

The need for nested interrupts is evident from example 24.1, it relates to the guarantees of CPU
response time required for certain types of interrupts. If the four-level interrupt scheme shown in
Figure 24.5 were used for the control application in Example 24.1, an interrupt with any priority level
would have a response time of 100 ms, because the attention of an air conditioner interrupt could be
initiated when the highest priority interrupt request arrived. The latter is unacceptable. As another
example, a disk drive that reads at 3 MB/s and transfers data to the CPU in portions of 4 B (example
22.7) requires its interrupt request to be serviced within 0.75 μs. If there were some kind of interrupt
whose attention takes more than 0.75 μs, and if during the attention of a similar one interrupt a similar
one is not accepted, then giving the disk controller the highest priority would not solve the problem;
Some kind of attribution capability is required.

Now consider the steps the CPU takes to service an outage, as listed in section 22.4. If you assume an
unchanneled deployment, the following actions occur in route to, and during, the outage address.

1. Disable (mask) all interrupts.

2. Ask for routine interruption care; save the PC as in a procedural request.

3. Save CPU state.

4. Save minimal information about the interruption in the stack. 5. Enable interrupts (or at least top
priority ones).

6. Identify the cause of the outage and address the underlying request.

7. Restore the CPU state to what existed before the last interrupt.

8. Return from the routine of interrupted care.

Another interrupt may be accepted after the step indicated in step 5. Consequently, the faster the
first five steps are executed, the lower the worst-case latency before a top-priority outage is accepted.
For example, the highest priority interrupt should wait for these five steps to run in the worst-case
scenario. Any lower priority interruption will wait for the worst-case attention time of all possible top-
priority interruptions, including new ones that may occur in the course of care for any such
interruption.

359
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Figure 24.6 shows the relationship of an application program to two nested interrupts.

The first interruption (lowest priority) is detected after the conclusion of instruction a) and before
instruction b) starts its execution within the application program. At this point, the PC contains the
address of instruction b), saving the content that will allow the hardware to return to the point of
interruption and continue executing the application as if nothing had happened between instructions
a) and b). All interrupts are disabled (masked) immediately and control is automatically transferred to
an interrupt handler. After the interrupt handler has saved the CPU state and some minimal
information about the interrupt, enable all highest priority interrupts before proceeding to manipulate
the existing interrupt. If shortly after this point a second interruption (higher priority) is detected
between instructions c) and d), then the process is repeated and the attention of the lower priority
interruption is abandoned in favor of the newcomer.

5. Types of Contextual Switching


The transfer of control from a program running to an interrupt manipulator, or from the manipulator
for a low-priority interrupt to that of a higher-priority interrupt, is known as contextual switching. The
context of an execution process consists of the CPU state, which includes contents of all registers,
along with certain elements of memory management information that allow the process to access its
address space. Note that requesting a procedure is not considered a context switcher because the
procedure usually operates within the same context as the requesting program; For example, you can
access the global variables of the requesting program and pass the parameters to it through the log
file or stack. In addition to mandatory contextual switches due to interrupts, the operating system can
perform voluntary contextual switching because it relates to improving overall system performance.
This type of contextual switching is done within the framework of multiprogramming or multitasking,
those concepts are explained in this section. A related term, multithreaded, will be explained in section
24.6.

The execution of sequences of instructions within a program can be delayed for several reasons. These
include both short jams for events such as cache failures or TLB failures (chapters 18 and 20), as well

360
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

as much longer waits for page failures and I/O requests. So far, in this book we saw such bottlenecks
and waits as contributors to reduce performance. This is a correct approach to the execution of a
single sequential task. However, just as a person's time isn't really wasted if they flip through the
newspaper or read emails while hanging up the phone, computer or CPU resources are also not
wasted if another program or task is running while a task encounters a long delay.

The use of multiprogramming to overlap the waiting periods of a program with the
execution of other programs begins with the time sharing of computer systems in the early 1960s. In
a time-sharing system, multiple user programs reside in the computer's main memory, allowing the
CPU to switch between them relatively easily. To ensure fairness or bias, each program can be assigned
a fixed slice of time on the CPU, and the task is set aside in favor of another at the end of its allotted
time, even if it finds no waiting. The number of simultaneously active tasks in main memory represents
the degree of multiprogramming. It is beneficial to have a high degree of multiprogramming because
it makes it more likely that the CPU will find useful work even in the event that many tasks encounter
long waits. On the other hand, putting many tasks in main memory means that each gets a smaller
memory allocation, which is detrimental to performance. Multitasking is essentially the same idea as
multiprogramming, except that active tasks may belong to the same user or may even represent
different parts (sub-calculations) of a single program.

Figure 24.7 shows multitasking in humans and computers. As shown in Figure 24.7b), a CPU can
execute many tasks by running attention from one to another through contextual switching. The CPU
switches back and forth between different tasks at a fast pace that creates the illusion of parallel
processing, even if, at any given time, only one task is executed. To change context, you save the state
of the running process in a process control block and trigger a new process from its inception or from
a previously saved point. Saving and restoring dozens of records and other control information with
each contextual switch takes many microseconds, which is in some ways wasted time. However, the
hundreds or even thousands of clock cycles employed in contextual switching are still significantly less
than the many millions of clock cycles required to manipulate a page fault or I/O operation.

In the above context, a specific invocation of a program or task is often referred


to as a process. Note that two different processes can be different invocations of the same program

361
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

or task. In this sense, multiprocessing could be used as a broad term for multiprogramming or
multitasking. However, in the literature on computer architecture, multiprocessing has a technical
significance that relates to the concurrent execution of many processes, as opposed to the apparent
concurrency barely discussed.

Because the contextual switch header is not trivial, simple contextual switching is practical only for
using long waiting periods and cannot be used to recover shorter waits due to cache failures and the
like. In addition to the tangible header to save and restore process states, contextual switching
involves other header types that are much more difficult to quantify. For example, sharing instruction
and data caches across many processes can lead to conflicts and their companion cache failures for
each process. Similarly, when processes are not completely independent of each other,
synchronization (to ensure that data dependencies are met) wastes some time and other resources.
The following example captures tangible header effects in contextual switching.

Example 24.2: Nested interrupt header

Assume that whenever a higher priority interrupt assigns a low priority, a 20 μs header is incurred to
save and restore CPU state and other pertinent information. Is this header acceptable for the control
system application presented in Example 24.1?

Solution: If you ignore the priority given to the impact detector, because it does not cause an
interruption during normal system operation, there are three levels of interruption. Mentioned below,
with their repeat rates and associated service times: fuel/ignition (500/s, 1 ms) and dashboard display
(800/s, 0.2 ms) have the highest priority, engine temperature (1/s, 100 ms) is at medium level, and air
conditioner (1/s, 100 ms) has the lowest priority. Even assuming that the execution of interrupt
manipulators for engine temperature and air conditioning extends over a second due to repeated
attributions, each is attributed no more than 1 300 times, for a total header of 2 × 1 300 × 20 μs = 52
ms. If you consider this attribution header together with the total interruption attention time of 500
× 1 + 1 × 100 + 800 × 0.2 + 1 × 100 = 860 ms, it is observed that there will still be a margin of safety
equal to 1,000 - 860 - 52 = 88 ms in each second of real time. Thus, the contextual switching header is
acceptable in this application example.

Because of the importance of contextual switching in efficient interrupt handling and multitasking,
some architectures provide hardware resources to assist this purpose. For example, suppose a CPU
contains four identical log files each of which can be chosen as the active log file when setting a two-
bit control tag. As a result, contextual switching between up to four active processes could be
performed without saving or restoring any of the registry contents by adjusting the two-bit control
tag. This is a good example of low-header contextual switching. A less drastic alternative is to provide
special machine instructions that save or restore the entire log file in one operation. Such instructions
are likely to be faster than a sequence of instructions that deal with saving or restoring records one at
a time.

362
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

6. Threads and multithreads


As shown in Figure 24.8, threads are instruction streams (small segments of programs) within the
same task or program that can be executed concurrently with, and for most parts independently of,
other threads. As such, they provide an ideal source of independent instructions for filling the CPU
execution pipeline and making good use of multiple function units, such as those shown in Figure 16.8.
Threads are sometimes called lightweight processes because they share a large amount and therefore
switching between them does not involve as much header as conventional (or heavy) processes.

Architectures that can manipulate many threads are known as


superthreads, where "super" characterizes the width of the yarn much like super-pipelining is named
due to the depth of the pipeline. The term hyperthreading is also used for a more flexible form of
multithreading (simultaneous), but here the subtle differences of the two schemes will not be
discussed. Multithreading effectively divides the resources of a single processor into two or more
logical processors, where each logical processor executes instructions from a single thread. Figure 24.9
shows how instructions from three different threads can be mixed at various stages of the execution
pipeline on a dual-emission processor.

363
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

Note that even with multithreading, some pipeline bubbles may appear. However, the number of
bubbles is likely to be much smaller when a large number of threads are available for execution.
Building a program in the form of multithreads is akin to freeing the CPU largely from the burden of
discovering parallelism at the instruction level and allowing multiple instructions to be issued from
different parts of a program rather than relying on nearby instructions in the program's control flow.

PROBLEMS
1) Using interval timers

A computer has an interval timer that can be set to a desired length of time (i.e. 0.1 s), at the end of
which an interrupt is generated.

a) Describe how you would use such a timer to build an analog clock, with the use of a gradual speed
motor for output. The motor needs to control only the second hand of the clock, the minute hands
and hours move properly as a result.

b) What are the sources of imprecision in the time shown by the clock in part (a)?

2) Notions of interrupt handling

Consider the interruptions when eating, on the phone call, and in attending the e-mail alert shown in
Figure 24.1.

a) Write (in plain language) an interrupt manipulation routine for these events. Pay special attention
to the possibility of nesting interruptions. Set your assumptions.

b) Add two more levels of interruption to the two already present in the figure. Describe new outages
in detail and establish why they are of higher priority than existing ones.

c) Briefly discuss how these interrupts and their manipulations are different from those on the
computer.

3) Request for procedure against interruptions

Explain the differences between a procedure request and the transfer of control to an interrupt
handler. For example, why in the case of a procedure request is it appropriate to change the contents
of the PC (i.e. the normal conclusion of the instructions ahead of the procedure request in the pipeline
is allowed), while for an interruption the procedure request must be cleaned up?

4) Multiple interrupt request lines


a) Show how, in Figure 24.5, each type of interrupt can be masked separately.
b) Add interrupt recognition logic to Figure 24.5. Ensure that, when the highest priority interrupt
is recognized, the continued application of a lower priority interrupt request signal does not
lead to false recognition.

364
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

5) Multicycle MicroMIPS outages

The addition of interruptability to the single-cycle MicroMIPS implementation of Chapter 13 was


discussed in section 24.3. Discuss how the same capability can be added to the multicycle MicroMIPS
in Chapter 14. Tip: Think in terms of the control state machine in Figure 14.4.

6) Aninated interruptions

Three D1, D2, and D3 devices are connected to a bus and use interrupt-based I/O. Discuss each of the
following scenarios in two ways, one by assuming the use of a single interrupt request line and one by
assuming the use of two interrupt request lines with high and low priorities. In each case, specify when
and how interrupts are enabled or disabled.

a) Interruptions shall not be nested.

b) Interrupts for D1 and D2 are not nested relative to each other, but interrupts from D3 must be
accepted at any time.

c) Three nesting levels shall be allowed, with D1 at the low end and D3 at the high end of the priority
spectrum.

7) Multitasking concepts

Humans perform a lot of multitasking in daily life. An example is shown in Figure 24.7a.

a) What characteristics of tasks are required to make them sensitive to human multitasking?

b) What types of task are not suitable for multitasking by humans?

c) Relate your answers to parts a) and b) for multitasking on a computer.

d) Construct a realistic scenario in which a human performs five or more tasks concurrently.

e) Is there a limit to the maximum number of concurrent tasks that can be performed by a human?

f) Relate your response to parts d) and e) to multitasking on a computer.

8) Time slice granularity in multiprogramming

Consider the following simplifying assumptions in a multiprogramming system. Task execution times
are evenly distributed in the range [0, Tmax]. When a slice of time of duration τ is allocated to a task,
it is used fully and without waste (this implies that calculation and I/O are overlapping), except for the
last slice of time, when the task is completed; on average, half of this last slice of time is wasted. The
contextual switching time header is a constant c and there is no other header.

a) Determine the optimal value of τ to minimize waste, if you assume that Tmax is much larger than τ.

b) How does the answer to part a) change if the task execution times are evenly distributed in [Tmín,
Tmax]? Set your assumptions

365
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

9) Multiple levels of interrupt priority

Following the introduction of example 24.1, define a real-time system with multiple interrupt priority
levels, where response time requirements can be met without the need for nested interrupts. Then,
characterize such systems in general.

10) Hardware auxiliaries for contextual switching

a) Which of the "complex" instructions mentioned in Table 8.1 facilitates the implementation of
nested interrupts and why?

b) Consider an instruction set architecture of your choice and mention all your machine instructions
that have been provided to make contextual switching more efficient.

11) Contextual switch header

In example 24.2, what is the maximum attribution header that would be acceptable?

12) Polling frequency control

Example 22.4 stated that a keyboard needs to be interrogated at least ten times per second to ensure
that no user key pressure is lost. Based on what you learned in each chapter, how can you ensure that
polling is often done enough, but not too often to waste time?

13) Interrupt mechanisms on real processors

For a microprocessor of your choice, describe the interrupt handling mechanism and its impact on
instruction execution. Pay special attention to the following aspects of interruptions:

a) Identification and recognition.

b) Enabling and disabling.

c) Priority and nesting.

d) Special instructions.

e) Hardware versus software.

14) Interrupt priority levels

There are many similarities between handling multiple levels of interrupt priority and bus arbitrage,
as discussed in section 23.4. For each of the following items in bus interruptions or arbitrage, indicate
whether or not there is a counterparty in the other area. Justify your answer in each case.

a) Revolving priority.

b) Daisy chaining.

366
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

c) Distributed Arbitration.

d) Interruption masking.

e) Interruption nesting

15) Software outage

The MiniMIPS syscall statement (section 7.6) allows a running program to interrupt itself and pass
control to the operating system. The ARM microprocessor has a similar SWI (software interrupt)
instruction, where the attention requested is specified in the eight lower-order bits of the instruction
itself rather than through the contents of a register, as in MiniMIPS. Each of the services provided by
the operating system has an associated routine in ARM, and the start addresses of these routines are
stored in a table.

a) Describe the ARM interrupt mechanism and indicate how SWI fits into it.

b) Provide an overview of the ARM instruction set, with a focus on any unusual or unique features.

c) Identify the ARM instructions that the operating system can use to perform the task of transferring
control to the appropriate system routine.

16) Multi-level interrupt circuit

In this problem, consider designing a multilevel priority interrupt circuit external to the processor. The
latter has a single interrupt request and an interrupt recognition line. The circuit to be designed has
an internal three-bit register containing the current priority level: a number from 0 to 7. Eight lines of
interrupt request R0 - R7 enter the circuit, and R0 has the highest priority. When a priority request
token greater than the stored three-bit level is postulated, an interrupt request signal is sent to the
processor. When the interrupt is recognized by the processor, the interrupt level is immediately
updated and a corresponding recognition signal is sent (the recognition signal Ai corresponds to the
request signal Ri).

a) Present the design of the external priority interrupt circuit as described.

b) Explain how your circuit works, and pay special attention to how the three-bit level register is
updated to contain a smaller or larger value.

367
1MCS1 Computer System Design
Chapters selected from ‘Computer Architecture from Microcomputers to Supercomputers’ by Behrooz Parhami)

368

You might also like