Computer Architecture AllClasses-Outline-199-294

Computer Architecture Unit 1
Figure 9.2: Architecture of CRAY-1
Self Assessment Questions

3. ______________ is a modern shared-memory multiprocessor
version of the CDC Cyber 205 ________________ .
4. The memory-memory vector processors can prove to be much efficient
in case the vectors are sufficiently long. (True/False)
5. The scalar registers are linked to the functional units with the help of a
pair of ___________________ .
6. ___________ correspond to vector load or vector store.
7. Functional hazards are the conflicts in register accesses. (True/False)
Manipal University Jaipur B1648 Page No. 199

Activity 1:
Find out more about a recent vector thread processor which comes in two
parts: the control processor, known as Rocket, and the vector unit, known as
Hwacha.
9.4 Vector Length and Stride Issues

This section will discuss two issues that occur in real programs. First is the
case when the vector length in a program is not precisely 64.Second is the
way non-adjacent elements in vectors that reside in memory are dealt with.
First, let us study the issue of vector length.
9.4.1 Vector length
In our study till now, we have not stated anything about the real vector size.
We just supposed that the size of the vector register is similar to the size of
the vector we hold. But this may not turn out to be always true. Particularly, we
have two cases in our hands:
• One in which the vector size is less than the vector register size, and
• The second in which the vector size is larger than the vector register
size.
To be more concrete, we assume 64-element vector registers as offered by
the Cray systems. Let’s observe the easier of these two problems.
Handling smaller vectors: In case the vector size is less than 64, we have to
permit the system to be aware that it should not function on all the 64 elements
in the vector registers. This can be simply done by utilising the vector length
register. The Vector Length (VL) register carries the appropriate vector length.
The entire vector operations are conducted on the first VL elements (in other
words, elements in the series 0 to VL - 1). The following two instructions are
needed to load values into the VL register:
V L1 (VL = 1)
V L Ak (VL = Ak where k # 0)
For instance, in case the vector length is equivalent to 40, the code given
below can be utilised to include two vectors in registers V3 and V4:
A1 40 (A1 = 40)
V L A1 (VL = 40)
V 2 V3+FV4 (V2 = V3 + V4)

As we cannot write
V L 40,
We must utilise the two-instruction order for loading 40 into the VL register.
The last instruction indicates floating-point addition of vectors V3 and V4. As
the VL is 40, just the first 40 elements are included. Table 9.1 below depicts a
sample of Cray X-MP instructions.
Table 9.1: Sample Cray X-MP Instructions
Instruction Meaning Description
Vi V j +Vk Vi = Vj+Vk Integer add Add corresponding elements (in the range 0 to XT 1) from Vj
and Vk vectors and place the result in vector Vi
Vi S j+Vk
Vi = Sj+Vk Add the scalar Sj to each element (in the range 0 to
Integer add XT 1} of Vk vector and place the result m vector Vi
Vi Vj+FVk Vi = Vj+Vk Add corresponding element- (in the range 0 to VL 1) from Vj
Floating-point add and Vk vectors and place the floating-point result in vectorVi
Vi Sj+FVk Vi = Sj+Vk Add the scalar S j to each element (in the range fl io VL
Floating-point add 1) of Vk rector and place the floating-point result in vector
Vi
Vi rAO,Ak
Vi = M[A0)+Ak Vector Load into elements 0 to VL 1 of vector register Vi from memory
load with stride Ak starting at address AO and incrementing addresses by Ak
Vi rAO,l Vi = M[AO>+1
Load into elements 0 to VL 1 of vector register Vi from memory
Vector load with stride
starting at address AO and incrementing addresses by 1
1
rAO,Ak Vi
Vi = M[A0)+Ak Vector Store elements 0 to VL 1 of vector register Vi in memory
store with stride Ak starting at address AO and incrementing addresses by Ak
rAO,l Vi Vi = M[AO)+1 Store element. 0 to VL 1 of vector register Vi in memory starting
Vector store with stride at address AO and mcrementmE -addresses by 1
1
Vi V j &Vk Vi = Vj tVk Logical Perform bitwise-AND operation on corresponding elements (in

AND the range 0 to VL 1) from Vj and Vk vectors and place the result
in vectorVi
Vi S j &Vk Vi = Sj iVk Logical AND
Perform bitwise-AND operation on fl to VL 1 elements of Vk
and scalar and place the result in vector Vi
Vi V j aAk Right-shift 0 to XT 1 element: ofVj by Ak and place the result
Vi = Vj >aAk
Kight-cbifl by Ak m vector Vi
Vi V j <Ak Vi = Vj ^<Ak Left-shift Left-shift 0 to XT 1 elements of Vj by Ak and plaice the result
by Ak m vector Vi
Handling larger vectors: Smaller vector sizes can be handled by the VL

register, but this does not apply to vectors of larger sizes. For instance, we

possess 200-element vectors (i.e., N = 200), in which way the vector

instructions can be used to total two such vectors? The instance of larger
vectors is handled by a method called strip mining.
In strip mining, the vector is segregated into strips of 64 elements. In this way,
a single odd-size piece remains which may be less than 64 elements. The size
of such a piece is provided by N mod 64. Every strip is thereafter loaded into
a vector register. Later on the vector addition instruction is put into operation.
Now, the number of strips can be portrayed by (N /64) + 1. For this case, the
200 elements are segregated into four pieces:
• 64 elements are contained in three pieces.
• 8 elements are contained in one odd piece.
Thereafter a loop is utilised which iterates four times: VL is adjusted to 8 in one

of the iterations, and the rest of the three iterations will adjust the VL register
to 64.
9.4.2 Vector stride
We have to know the way in which elements are stored in memory in order to
understand vector stride. Let’s first observe vectors. Because vectors are one-
dimensional groups, saving a vector in memory is considerably easy: vector
elements are saved as sequential words in memory. In case, we wish to fetch
40 elements, 40 contiguous words from memory have to be read. Such
elements are said to contain a stride of 1, i.e., to connect with the subsequent
element, we must add 1 to the recent element. It’s necessary to observe that
the distance between consecutive elements is measured in number of
elements and not in bytes.
We will require non-unit vector strides for multidimensional ranges. In order to
find out the reason, we should concentrate on two-dimensional matrices. In
case we want to save a two-dimensional matrix in memory, we must linearise
it. We are able to work on this in one of two ways: column-major or row-major
sequence. Majority of the languages with the exception of FORTRAN, utilise
the row-major order. In such a way of sequencing, elements are saved in row
order: row 0, row 1, row 2, and so on. Elements are saved column by column:
column 0, column 1, and so on in the columnmajor order, which is utilised by
FORTRAN. For instance, consider the 4 x 4 matrix below:

Figure 9.3: Memory Layout of Vector A.

Such a matrix is saved in memory as depicted in figure 9.3. Presuming row-
major order for saving, we should search for a way to reach all elements of
column 0. It’s obvious that such elements are not saved alongside. We are
forced to reach 0, 4, 8, and 12 elements in the memory array.
Since successive elements are divided on the basis of 4 elements, it can be
said that the stride is 4. Vector machines provide load and store instructions
that make an allowance for the stride. It can be noted from Table 9.1 that Cray
X-MP machine assists both unit as well as non-unit stride access. For instance,
the instruction
Vi, A0, Ak
loads vector register Vi along with stride Ak. As unit stride is quite usual, a
particular instruction
Vi, A0,1
is given. Alike instructions exist for storing vectors in memory.
8. The instance of larger vectors is dealt with by a method called
9. Vector elements are saved in the form of ____________ in memory.

9.5 Compiler Effectiveness in Vector Processors

A program can be run in vector mode successfully with the help of two factors.
The program structure is the first factor. It should be able to judge whether the
loops comprise of true data dependences or can they be restructured in such
a way that they have no such dependences. This factor is affected by the
algorithms selected and, to some degree, by the manner in which they are
coded. The second factor is the ability of the compiler. Although no compiler
is able to vectorise a loop which does not contain parallelism among the loop
iterations, however there is huge variation in the capability of compilers to
decide if a loop can be vectorised.
The techniques utilised for vectorising programs are similar to revealing ILP;
here we just review how well such techniques perform. Let's look at the
vectorisation levels noted for the Perfect Club benchmarks, as a sign of the
vectorisation level which can be achieved in scientific programs. These
benchmarks are huge and actual scientific applications. Figure 9.4 depicts the
percentage of operations implemented in vector mode for two versions of the
code performing on the Cray Y-MP.
Operations executed in Operations executed In

Benchmark vector mode, compiler- vector mode, hand- Speedup from hand
name optimized optimized optimization
BDNA 96 |% 972% 1.52

MG3D 95.1% 94.5% 100
FLO52 HS.7% N/A
ARC3D 91.1% . 1.01
SPEC77 ■Xi V; 90.4% 1.07
MDG «7.7% 94.2% 1.49
IRIl) 6*).X% 73.7% 1.67
DYFESM 68.8% 65.6*4 N/A
ADM 42.9% 59.6% 3.60
OCEAN 42.8% 91.2% 3.92
1 RACK 14.4% 54.6% 2.52
SPICE 113% 79.9% 4.06
QCD 4.2% 75.1% 2.15
Figure 9.4: Level of Vectorisation among the Perfect Club Benchmarks
when executed on the Cray Y-MP

The first version is that acquired with simply compiler optimisation on the
original code, whereas the second version has been considerably hand-
optimised by a team of Cray Research programmers. The extensive variation
in compiler vectorisation level has been noted by various studies of the

functioning of applications on vector processors. The hand-optimised versions

normally depict important gains in level of vectorisation for codes which the
compiler was not able to vectorise properly by itself, as all codes at present
were above 50% vectorisation. Interestingly, the quicker code created by the
Cray programmers had lower vectorisation levels. The vectorisation level is
not enough by itself to decide performance.
Alternative vectorisation methods might implement lesser instructions, or
maintain more values in vector registers, or permit higher chaining and overlap
in the midst of vector operations, and thus enhance performance even in case
the vectorisation level stays the same or decreases.
For instance, BDNA has approximately the same vectorisation level in the two
versions, however the hand-optimised code is more than 50% faster. There is
also huge variation in the way various compilers perform in vectorising
programs. Summing up the state of vectorising compilers, look at the data in
figure 9.5, that depicts the degree of vectorisation for various processors,
which utilise a test suite containing 100 handwritten FORTRAN kernels.
Completely Partially
Processor Compiler vectorized vectorized Not vectorized
CDC CYBER 205 VAST-2 V2.21 62 5 33
Convex C-series FC5.0 69 5 26
Cray X-MP CFT77 V3.0 69 3 28
Cray X MP CFT V 1.15 50 1 49
Cray-2 CFT2 V3.1a 27 1 72
ETA-10 FTN 77 V 1.0 62 7 31
Hitachi S810/820 FORT77/H AP V2O-2B 67 4 29
IBM <090/VI- VS FOR I RAN V2.4 52 4 44
NEC SX/2 FORTRAN77 / SX V.040 66 5 29
Figure 9.5: Result of applying Vectorising Compilers to the 100 FORTRAN Test
Kernels
The kernels were planned to verify vectorisation ability and are able to be
vectorised by hand.
10. List two factors which enable a program to run successfully in vector
mode.
11. There does not exist any variation in the capability of compilers to decide
if a loop can be vectorised. (True/False)

Activity 2:
Visit your local computer vendor and get an expert opinion about vector
processors and their working.
9.6 Summary
There are several representative application areas where vector processing is
of the utmost importance. Depending upon the way the operands are fetched,
vector processors can be segregated into two groups.
• Operands are straight away streamed from the memory to the functional
units and outcomes are written back to memory at the time the vector
operation advances in this architecture.
• Operands are read into vector registers wherein they are fed to the
functional units and outcomes of operations are written to vector registers
in this architecture.
• Vector register architectures have several advantages over vector
memory-memory architectures.
• There are several major components of the vector unit of a registerregister
vector machine
• The various types of vector instructions for a register-register vector
processor are:
■ Vector-scalar Instructions
■ Vector-vector Instructions
■ Vector-memory Instructions
■ Gather and Scatter Instructions
■ Masking Instructions
■ Vector Reduction Instructions
• CRAY-1 is one of the oldest processors that implemented vector
processing.
Two issues that arise in real programs: (i) the vector length in a program is not
exactly 64. (ii) Non adjacent elements in vectors that reside in memory.
• The structure of the program & capability of the compiler are two factors
that affect the success with which a program can be run in vector mode.
9.7 Glossary
• ASC: Advanced Scientific Computer
• Data hazards: the conflicts in register accesses

• ETA-10: A later shared-memory multiprocessor version of the CDC Cyber

205.
• Functional hazards: the conflicts in functional units.
• Gather: an operation that fetches the non-zero elements of a sparse vector
from memory.
• Masking instructions: These instructions use a mask vector to expand or
compress a vector
• Scatter: It stores a vector in a sparse vector into memory.
• SECDED: single-error correction, double-error detection.
• Small scale integration: it can pack 10 to 20 transistors in a single chip.
• Strip mining: the vector is partitioned into strips of 64 elements.
• Vector reduction instructions: These instructions accept one or two
vectors as input and produce a scalar as output.
9.8 Terminal Questions

1. Explain the importance of Vector Processors.
2. What are the different types of Vector Processing?
3. How is vector register architecture more advantageous over memory-
memory vector architecture?
4. Write short notes on:
a) CDC Cyber 200 model 205 computer overview
b) CRAY-1
c) Vector Length
d) Vector Stride
5. List the various functional units of Vector Processor and explain each one
in brief.
6. Explain the various types of vector instructions in detail.
7. How effective is the compiler in vector processors?
9.9 Answers
1. Vector processors
2. Data parallelism
3. ETA-10
4. True
5. Crossbars
6. Vector-memory instructions
7. False
8. Strip mining
9. Sequential words
10. Structure of the program & capability of the compiler
11. False
Terminal Questions
1. There are various application areas of vector processors which are of
considerable importance. Refer Section 9.2.
2. Depending upon the way the operands are fetched, vector processors can
be segregated into two groups: Memory-memory vector architecture and
Vector-register architecture. Refer Section 9.3.
3. Due to the capability to overlap memory accesses as well as the probable
use of vector processors again, vector-register vector processors are
normally more efficient as compared to memory-memory vector
processors. Refer Section 9.3.
4. a. The CDC Cyber 205 is based on the concepts initiated for the CDC Star
100; the first commercial model was produced in 1981. Refer Section
9.4.
b. CRAY-1 is one of the oldest processors that implemented vector
processing. Refer Section 9.5.
c. The vector size may be less than the vector register size, and the
vector size may be larger than the vector register size. Refer Section
9.6.
d. As vectors are one-dimensional series, saving a vector in memory is
direct: vector elements are stored as sequential words in memory.
Refer Section 9.6.
5. The major components of the vector unit of a register-register vector
machine are Vector Registers, Vector Functional Units, Scalar Registers
etc. Refer Section 9.5.
6. The various types of vector instructions for a register-register vector
processor are: (Refer Section 9.5.) a. Vector-scalar Instructions
b. Vector-vector Instructions
c. Vector-memory Instructions
d. Gather and Scatter Instructions
e. Masking Instructions
f. Vector Reduction Instructions
7. Like an indication of vectorisation level which can be acquired in scientific

programs, we should observe the vectorisation levels noted for the Perfect
Club benchmarks. Refer Section 9.7.
References:
• Hwang, K. (1993). Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. (2010). Computer Organisation. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David (2011).
Computer Architecture: A Quantitative Approach, Morgan Kaufmann; 5th
edition.
• Sima, Dezso, Fountain, Terry J. &Kacsuk, Peter (1997). Advanced
computer architectures - a design space approach. Addison-Wesley-
Longman.
E-references:
• https://csel.cs.colorado.edu/~csci4576/VectorArch/VectorArch.html
• http://www.cs.clemson.edu/~mark/464/appG.pdf
• nasa_fig.gif
Unit 10 SIMD Architecture
Structure:
10.1 Introduction
Objectives
10.2 Parallel Processing: An Introduction
10.3 Classification of Parallel Processing
10.4 Fine-Grained SIMD Architecture
An example: the massively parallel processor Programming and
applications
10.5 Coarse-Grained SIMD Architecture
An example: the CMS
Programming and applications
10.6 Summary
10.7 Glossary
10.9 Answers
10.1 Introduction

In the previous unit, you studied about the use and effectiveness of vector
processors. Also, you studied the vector register architecture, vector length
and stride issues. We also learnt the concept of compiler effectiveness in
vector processors. In this unit, we will throw light on the data parallel
architecture, SIMD design space. We will also study the types of SIMD
architecture. The instruction execution in conventional computers is
sequential. So there is a time constraint involved, if one program is being
executed, the other task has to wait for the time till the first one is executed. In
parallel processing, the execution is in a parallel manner, that is at the same
time the program can be divided into segments, while one segment is being
processed, the other can be fetched from the memory and some segments
can be printed (provided they are already processed) all at the same time. The
purpose of parallel processing is to bring down the execution time, hence to
speed up the data processing.
Parallel processing can be established by dividing the data among different
units, each unit being processed simultaneously and the timing and
sequencing being governed by the control unit so as to get the fruitful result in
minimum amount of time.
Objectives:
After studying this unit, you should be able to:
• discuss the concept of data parallel architecture
• describe the SIMD design space
• identify the types of SIMD architecture
• recognise fine grained SIMD architecture
• explain coarse grained SIMD architecture
10.2 Parallel Processing: An Introduction

Parallel processing is basic part of our everyday life. The concept of parallel
processing is so natural in our life that we use it without even realising. When
we face some crisis, we take help from others and involve them to solve it
more easily. This cooperation of using two or more helpers to make easy the
solution of some problem may be termed parallel processing. The aim of
parallel processing is therefore to solve a particular problem more rapidly, or
to enable the solution of a particular problem that would otherwise be not
solvable by one person. The principles of parallel processing are, however, not
recent, as evidence suggests that the computational devices used over 2000
years ago also used this.
However, the early computer developers rapidly identified two obstacles

restricting the widespread acceptance of parallel machines: the complexity of
construction; and, the seemingly high programming effort required. As a result
of these early set-backs, the developmental thrust shifted to computers with a
single computing unit. Additionally, the availability of sequential machines
resulted in the development of algorithms and techniques optimised for these
particular architectures. The evolution of serial computers may be finally
reaching its peak due to the limitations imposed on the design by its physical
performance and natural hurdles. As consumers and end-users are
demanding better performance, computer designers have started considering
parallel approaches to get the better of these limitations.
All contemporary computer architectures include some degree of parallelism
in their designs. Better hardware designing and assembling together with
increasing understanding of how to deal with the difficulties in parallel
programming has confirmed that parallel processing is at the front line of
computer technology. Customarily, software has been developed to be
executed on one single computer machine having only one CPU (Central
Processing Unit) for serial computing. In this, a single problem is divided into
a smaller, more easy to handle instructions. Instructions are performed in
series one after another and only one instruction is executed at a given time
as shown in figure 10.1.
Figure 10.1: An Example of Serial Computing
In parallel computing, simultaneous use of multiple compute resources is

made to work out a computational problem. It may take the use of multiple
CPUs. A problem is broken into discrete parts that can be solved concurrently.
Each part is further broken down to a series of instructions and instructions
from each part execute simultaneously on different CPUs as shown in figure
10.2.

Thus, we can say that a computer system is said to be Parallel Processing

System or Parallel Computer if it provides facilities for simultaneous
processing of various set of data or simultaneous execution of multiple
instructions.
On a computer with more than one processor each of several processes can
be assigned to its own processor, to allow the processes to progress
simultaneously. If only one processor is available the effect of parallel
processing can be simulated by having the processor run each process in turn
for a short time.
Parallel processing in multiprocessor computer is said to be true parallel
processing and parallel processing in uni-processor computer is said to
simulated or virtual parallel processing. We can easily understand the
difference between true and virtual parallel processing by following figure 10.3.

Figure 10.3: (a) Serial Processing (b) True Parallel Processing with Multiple
Processors (c) Parallel Processing Simulated by Switching one Processor
among Three Processes
Figure 10.3 (a) represents the serial processing means next processing is
started when the previous process must be completed. In figure 10.3 (b) all
three process are running in one clock cycle of three processors. In figure 10.3
(c) all three process are also running in one clock cycle but each process are
getting only 1/3 of actual clock cycle on each clock cycle and the CPU is
switching from on process to other in its clock cycle. When one process is
running all other process must wait for their turn. So if we see in figure 10.3 (c)
then we will find that at one clock time only one process is running and other
are waiting. But in figure 10.3 (b) at one clock time all three process are
running. So in uni-processor system the parallel processing as shown in figure
10.3 (c) is called virtual parallel processing.


1. A problem is broken into a discrete series of _____________ .
2. ______________ provides facilities for simultaneous processing of
various set of data or simultaneous execution of multiple instructions.
3. Parallel processing in multiprocessor computer is said to be parallel
processing.
4. Parallel processing in uni-processor computer is said to parallel
processing.
10.3 Classification of Parallel Processing

The core element of parallel processing is CPUs. The essential computing
process is the execution of sequence of instruction on asset of data. The term
stream is used here to denote a sequence of items as executed by single
processor or multiprocessor. Based on a number of instruction and data,
streams can be processed simultaneously, Flynn classifies the computer
system into four categories. The matrix defines the 4 possible classifications
according to Flynn as given in figure 10.4.
SISD S1 MD
Single Instruction, Single Data Single Instruction, Multiple Data
M1 SD M1M D
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
Figure 10.4: Flynn’s Classification of Computer System
In this chapter, our main focus will be Single Instruction Multiple Data (SIMD).
Single Instruction Multiple Data (SIMD)
The term single instruction implies that all processing units execute the same
instruction at any given clock cycle. On the other hand, the term multiple data
implies that each and every processing unit could work on a different data
element. Generally, this type of machine has one instruction dispatcher, a very
big array of very small capacity instruction units and a network of very high
bandwidth. This type is suitable for specialised problems which are
characterised by a high regularity, for example, image processing. Figure 10.5
shows a case of SIMD processing.

Today, modern microprocessors can execute the same instruction on multiple

data. This is called Single Instruction Multiple Data (SIMD). SIMD instructions
handle floating-point real numbers and also provide important speedups in
algorithms. As the performing units for SIMD instructions typically belong to a
physical core, as many SIMD instructions can run in parallel as the available
physical cores. As mentioned, the utilisation of these vector-processing
capabilities in parallel could give significant speedups in certain specific
algorithms.
The adding up of SIMD instructions & hardware to a multi-core CPU is a bit
more extreme as compared to the addition of floating point ability. Since their
inception, a microprocessor is a SISD device. SIMD is also referred as vector
processing as its fundamental unit of organisation is the vector. This is shown
in figure 10.6:
Figure 10.6: Scalars and Vectors

A normal CPU operates on scalars, which is one at a time. A superscalar CPU

operates on multiple scalars at a given moment, but it executes a different

operation on each instruction. On the other hand, a vector processor lines up
an entire row of these same types of scalars and operates on them as a single
unit. Figure 10.7 shows the difference between SISD and SIMD.
Figure 10.7: SISD vs. SIMD

Modern, superscalar SISD machines exploit the property ‘instruction-level
parallelism’ of the instruction stream. This signifies that multiple instructions
can be executed at a single instance on the same identical data stream.
One property of the data stream called ‘data parallelism’ is exploited by a SIMD
machine. In this framework, you get data parallelism when you have a large
mass of uniform data that requires same instruction performed on it. Therefore,
a SIMD machine is totally a separate class of machine than the normal
microprocessor.
5. SIMD stands for _______________ .
6. Flynn classified computing architectures into SISD, MISD, SIMD and
7. SIMD is known as ________________ because its basic unit of

organisation is the vector.
8. Superscalar SISD machines use one property of the instruction stream
by the name of __________ .
Activity 1:
Explore the components of a parallel architecture that are used by an
organisation. Also, find out the type of memory used in that architecture.
10.4 Fine-Grained SIMD Architecture
The Steven Unger design scheme is the initial base for the Fine-grained SIMD
architectures. These are generally designed for low-level image processing

applications. The following are the features of fine-grained architecture:

• Complexity is minimal and the degree of autonomy is lowest feasible in
each Processing Element (PE).
• Economic constraints are applicable on the maximum number of PEs
provided.
• It is assumed by the programming model that there is equivalence between
the number of PEs and the number of data items, and hides any mismatch
as far as possible.
• The 4-connected nearest neighbour mesh is used as the basic
interconnection method.
• A simple extension of a sequential language with parallel-data additions is
the usual programming language
Even though, practically, this concept is not absolute in any systems, there are
certain systems that are close to this concept. They include CLIP4, the DAP,
the MPP (all first-generation systems), the CM1 and the MasPar1 amongst
later embodiments. There are other categories which are a bit deviated from
the classical model. They are explained as follows:
• Processing element complexity is increased, either so as to operate on
multi-bit numbers directly or by the addition of dedicated arithmetic units.
• Enhanced connectivity arrangements are superimposed over the standard
mesh. Such arrangements include hypercube and crossbar switches.
One of the most important architectural developments which have occurred in
this class of system over time is the incorporation of ever-increasing amounts
of local memory. This reflects the experience of all users that insufficient
memory can have a catastrophic effect on performance, outweighing, in the
worst cases, the advantages of a parallel configuration. Perhaps, the
Massively Parallel Processor (MPP) system has been the most modern design
which retained the simplicity of the fine-grained approach, and this is examined
in detail in the next section.
10.4.1 An example: The massively parallel processor
MPP is the acronym for Massively Parallel Processor. MPP shows the
principles of this group in the best possible way, though it is not the most recent
example of a fine-grained SIMD system. The overall system design is
illustrated in figure 10.8.

Figure 10.8: The MPP Systems
A square array was chosen in MPP to match the configuration of the

anticipated data sets on which the system was intended to work. The square
array is of 128 x 128 active processing elements. The MPP was constructed
for (and used by) NASA, with the obvious intention of processing mainly image
data. The size of the array was simply the biggest that could be achieved at
the time, given the constraints of then current technology and the intended
processor design. It resulted in a system constructed from 88 array cards, each
of which supported 24 processor chips (192 processors) together with their
associated local memory.
The array incorporates four additional columns of spare (inactive) processing
elements to provide some fault-tolerance. One of the major system design
considerations in highly parallel systems such as MPP is how to handle the
unavoidable device failures. The number of these is inevitably increased by
the use of custom integrated circuits, in which economic constraints lead to
poor characterisation. The problem is compounded in a data-parallel mesh-
connected array, since failure at some point of the array disrupts the very data
structures on which efficient computations are predicated. The MPP deals with
this problem by allowing a column of processors which contains one faulty
element to be switched out of use, while one of the four spare columns is
added to the edge of the array to maintain its size and shape. Naturally, if a

fault occurs during computation, the sequence of instructions following the last
dump to external memory must be repeated after replacement of the fault-
containing column.
The processing elements are linked by a 2-dimension near-neighbour mesh.
This resolution gives a number of important advantages over other likely
alternatives, such as trouble-less data structures maintenance in shifting,
engineering ease, high bandwidth, and a close conceptual match to the
formulation of many image processing calculations.
The principal disadvantage of this system is the sluggish transmission of data
between remote processors in array. However, this can be only seen if
comparatively minute amount of data is to be transmitted (rather than whole
images).
The option of 4 rather than 8-connectedness is perhaps surprising in view of
the minimal increase of complexity which latter involves, compared to a twofold
improvement in performance on some operations. There is one special
purpose staging memory meant for conversion of data format. All extremely
parallel computers have problems related with the data input & output, and in
those parallel computers which represent single-bit processors, the problems
are many and compounded. The problem is that external source data is usually
formatted as one individual string of integers. So, if such a data is utilised in a
two-dimensional array in any simple manner, considerable amount of time is
wasted before successful processing can start, basically because of the
unmatched format of the data.
The MPP included two solutions for this problem. The 1st was a distinct data
input/output register. The 2nd was the staging memory, which allowed
conversion between bit plane & integer string formats. Using jointly, these two
solutions allowed the processor array to function continuously, and so giving
out the maximum output.
10.4.2 Programming and applications
The MPP system was commissioned by NASA principally for the analysis of
Lands at images (satellite imagery of Earth) This meant that, initially, most
applications on the system were in the area of image processing, although the
machine eventually proved to be of wider applicability. At the same time, NASA
also utilised the MPP system for various other applications listed below. See
figure 10.9).

Figure 10.9: MPP integrated Circuits

Stereo image analysis: The stereo analysis algorithm on the MPP was
designed to work out elevations from artificial aperture images obtained at
different viewing angles during a Shuttle mission. By means of an appropriate
geometric model, elevations can be worked out from the differing locations of
corresponding pixels in a pair of images acquired at different incidence angles,
which form a pseudo stereo pair. The main difficulties observed in the
matching algorithm are:
• The brightness levels are different in corresponding areas of the two
images.
• Images have areas of low contrast and high noise.
• There are local distortions which differ from image to image.
We can overcome the first two difficulties by the use of normalised correlation

functions (a standard image processing technique) but the third arises due to
the different viewing angles. The MPP algorithm operates as follows:
• For each pixel in one of the images (the reference image) a local
neighbourhood area is defined. This is correlated with the similar area
surrounding each of the candidate match pixels in the second image.
• The measure applied is the normalised mean and variance cross
correlation function. The candidate yielding the highest correlation is
considered to be the best match, and the locations of the pixels in the two
images are compared to produce the disparity value at that point of the
reference image.
• The algorithm is iterative. It begins at low resolution, that is, with large
areas of correlation around each of a few pixels. When the first pass is
complete, the test image is geometrically warped according to the disparity
map.
• The process is then repeated with a higher resolution (usually reducing the
correlation area. and increasing the number of computed matches, by a
factor of two), a new disparity map is calculated and a new warping
applied, and so on.
• The procedure is continued either for a predetermined number of passes
or until some quality criterion is exceeded.
9. MPP is the acronym for ___________ .
10. All highly parallel computers have problems concerned with
10.5 Coarse-Grained SIMD Architecture
There are several technical difficulties that arise in fulfilling completely the fine-
grained SIMD ideal of one processor per data element. Thus, it is better to
begin with the coarse-grained approach and therefore, develop a more rational
architecture. Currently, a number of parallel computers manufacturers,
including nCUBE and Thinking Machines Inc., have adopted this outlook.
The manufacturers which are more familiar with the mainstream of computer
design than the application-specific architecture field often develop the
Coarse-grained data-parallel architectures. It is the result of MIMD
programmes that have helped discover the complexities of this approach and
seek to mitigate them. The consequences of these roots are systems which
can employ a number of different paradigms including MIMD. Multiple-SIMD
and what is often called single program multiple data (SPMD) in which each
processor executes its own program, but all the programs are the same, and
so remain in lock-step. Such systems are frequently used in this data-parallel
mode, and it is therefore reasonable to include them within the SIMD
paradigm. Naturally, when they are used in a different mode, their operation
has to be analysed in a different way. Coarse-grained SIMD systems of this
type embody the following concepts:
• Each PE is of high complexity, comparable to that of a typical
microprocessor.
• The PE is usually constructed from commercial devices rather than
incorporating a custom circuit.
• There is a (relatively) small number of PEs, on the order of a few
hundreds or thousands.
• Every PE is provided with ample local memory.
• The interconnection method is likely to be one of lower diameter and lower
bandwidth than the simple two-dimensional mesh. Networks such as the
tree, the crossbar switch and the hypercube can be utilised.
• Provision is often made for huge amounts of relatively high-speed, high-
bandwidth backup storage, often using an array of hard disks.
• The programming model assumes that some form of data mapping and
remapping will be necessary, whatever the application.
• The application field is likely to be high-speed scientific computing.
This type of systems have a number of advantages as compared to finegrained
SIMD, such as the capability to take maximum advantage from latest
processor technology, the aptitude to perform highly precise computations with
no performance penalty and the easier mapping to a selection of different data
types which the lesser number of processors and improved connectivity
permits.
The software required for such systems offers an advantage as well as a
disadvantage at the same time. The advantage lies in its closer similarity to
normal programming: the disadvantage lies in the less natural programming
for some applications. Coarse-grained systems also offer greater variety in
their designs, because each component of the design is less constrained than
in a tine-grained system. The example given below is, therefore, less
specifically representative of its class than was the MPP machine considered
earlier.

10.5.1 An example: the CM5

The Connection Machine family marketed by Thinking Machines Inc. has been
one of the most commercially successful examples of niche marketing in the
computing field in recent years (one other which springs to mind is the CRAY
family). Although the first of the family, CM1, was fine-grained in concept, the
latest, CM5, is definitely coarse-grained. The first factor in its design is the
processing element, illustrated in figure 10.10. The important components of
the design include:
• internal 64-bit organisation;
• a 40MHz SPARC microprocessor;
• separate data and control network interfaces;
• up to four floating-point vector processors with separate data paths to
memory;
• 32 Mbytes of memory.

Figure 10.10: CM5 Processing Element
Taken together, these components give a peak double-precision floatingpoint

rate of 128 Million Floating-Point Operations Per Second (MFLOPS) and a
memory bandwidth of 512 Mbyte/sec. Achieved performance rates for some
specific operations are given in Table 10.1.
Table 10.1: Performance of the CM5 node
Function Performance (MFLOPS)
Matrix multiply 64
Matrix-vector multiply 100
Unpack benchmark 50
8k-point FFT 90
Peak rate 128
There are three aspects to the system design which are of major importance.
The first is the data interconnection network, shown in figure 10.11, which is
designated by the designers a fat tree network. It is based upon the quadtree,
augmented to reduce the likelihood of blocking within the network.

Figure 10.11: CM5 Fat-Tree Connectivity Structure
Thus, at the lowest level, within what is designated an interior node of four
processors, there are at least two independent direct routes between any pair
of processors. Utilising the next level of tree, there are at least four partly
independent routes between a pair of processors. This increase in the number
of potential routes is maintained for increasing numbers of processors by
utilising higher levels of the tree structure.
Although this structure provides a potentially much higher band-width that the
ordinary quadtree, like any complex system, achieving the highest
performance depends critically on effective management of the resource. The
second component of system design which is of major importance is the
method of control of the processing elements. Since each of these
incorporates a complete microprocessor, the system can be used in fully
asynchronous MIMD mode. Similarly, if all processors execute the same
program, the system can operate in the SPMD mode. In fact, the designers
suggest that an intermediate method is possible, in which processing elements
act independently for part of the time, but are frequently resynchronised
globally. This technique corresponds to the implementation of SIMD with
algorithmic processor autonomy.
The final system design aspect of major importance is the data I/O method.
The design of the CM5 system seeks to overcome the problem of improving
(and therefore variable) disk access speeds by allowing any desired number
of system nodes to be allocated as disk nodes with attached backup storage.
This permits the amount and bandwidth of I/O arrangements to be tailored to
the specific system requirements. Overall, one of the main design aims, which

was pursued for the CM5 system was scalability. This not only means that the
number of nodes in the system can vary between (in the limits) one and 16384
processors, but that system parameters such as peak computing rate, memory
bandwidth and I/O bandwidth all automatically increase in the proportion of
processing elements. This is shown in the Table 10.2.
Table 10.2: CM5 System Parameters
Number of processor 32 1024 16384
Number or data paths 128 4096 65 536
Peak speed (MFLOPS) 4 128 2048
Memory (Gbyte) 1 32 512
Memory bandwidth (Gbyte/s) 16 512 8 192
1/0 bandwidth (Gbyte/s) 0.32 10 160
Synchronisation time (us) 1.5 3.0 4.0
10.5.2 Programming and applications

Purchasing a substantial CM5 machine involves a huge cost in terms of
investment. Thus, it is usually viewed as a multi-user system, in which
resources (that is, partial systems), once allocated, are fully protected and are
independent. This is ensured by a UNIX-based time-sharing operating system
and priority-based job queuing. Under this operating system, data- parallel
versions of C and Fortran are provided, together with a variety of packages for
data visualisation and scientific computation.
11. The two parallel computers manufacturers of coarse-grained architecture
are ____________________________ .
12. SPMD is the acronym for ___________________ .
13. ____________________ is the first of the CRAY family.
14. The latest system in the CRAY family is ________________ .
Activity 2
Visit an organisation and find out the difficulties that are faced by the
computer designers in implementing and operating the fine-grained and
coarse-grained SIMD architectures.
10.6 Summary
Let us recapitulate the important concepts discussed in this unit:

• Parallel processing can be established by dividing the data among different

units, each unit being processed simultaneously and the timing and
sequencing being governed by the control unit so as to get the fruitful result
in minimum amount of time.
• Parallel processing is an integral part of everyday life.
• The evolution of serial computers may be finally reaching its peak due to
the limitations imposed on the design by its physical implementation and
inherent bottlenecks.
• The core element of parallel processing is CPUs. The essential computing
process is the execution of sequence of instruction on asset of data.
• The term single instruction implies that all processing units execute the
same instruction at any given clock cycle. On the other hand, the term
multiple data implies that each processing unit can operate on a different
data element.
• SIMD instructions handle floating-point real numbers and also provide
important speedups in algorithms. A vector processor lines up a whole row
of the scalars, all of the same type, and operates on them as a unit.
• Modern, superscalar SISD machines exploit a property of the instruction
stream called instruction-level parallelism (ILP).
• The Steven Unger design scheme is the initial base for the Fine-grained
SIMD architectures. These are generally designed for low-level image
processing applications.
• MPP is the acronym for Massively Parallel Processor. MPP exemplifies the
ideology of this group in the best feasible way, though it is not the most
recent example of a fine-grained SIMD system.
• The array incorporates four additional columns of spare (inactive)
processing elements to provide some fault-tolerance. One of the major
system design considerations in highly parallel systems such as MPP is
how to handle the unavoidable device failures.
• There are several technical difficulties that arise in completely fulfilling the
fine-grained SIMD ideal of one processor per data element.
10.7 Glossary
• CM1: Connection Machine 1
• CM5: Connection Machine 5
• ILP: Instruction-level Parallelism
• MIMD: Multiple Instruction Multiple Data
• MISD: Multiple Instruction Single Data

• MPP: Massively Parallel Processor

• SIMD: Single Instruction Multiple Data
• SISD: Single Instruction Single Data
• SPMD: Single Program Multiple Data

1. What do you understand by Parallel Processing? Also, explain Serial
Processor and True Parallel Processor.
2. Explain the hardware Architecture of Parallel Processing.
3. Describe the Fine-Grained SIMD Architecture. Give a suitable example.
4. Illustrate and example the concept of Coarse-Grained SiIMD
Architecture.
5. Explain connection machine and describe the crayfFamily.
10.9 Answers
1. Instructions
2. Parallel Computer
3. True
4. Simulated Or Virtual
5. Single-Instruction, Multiple Data
6. SISD, SIMD, MISD, and MIMD
7. Vector Processing
8. Instruction-Level Parallelism
9. Massively Parallel Processor
10. Input and Output Of Data
11. Ncube and Thinking Machines Inc.
12. Single Program Multiple Data
13. Cm1
14. Cm5
Terminal Questions
1. Parallel Computing is the simultaneous use of multiple compute
resources to solve a computational problem. Refer Section 10.2.
2. The core element of Parallel Processing is Cpus. The essential
computing process is the execution of sequence of instruction on Asset of
Data. Refer Section 10.3.
3. The Steven Unger Design Scheme is the initial base for the fine-grained

SIMD architectures. These are generally designed for low-level image

processing applications. Refer Section 10.4.
4. There are several technical difficulties that arise in completely fulfilling the
fine-grained SIMD ideal of one processor per data element. Thus, it is
better to begin with the coarse-grained approach and therefore, develop a
more rational architecture. Refer Section 10.5.
5. The connection machine family marketed by thinking machines inc. Has
been one of the most commercially successful examples of niche
marketing in the computing field in recent years (one other which springs
to mind is the cray family). Refer section 10.5.
References
• Sima, Fountain, T. & Kacsuk, P. (1997) Advanced Computer Architectures:
A Design Space Approach.
• Hwang, K. (1993) Advanced Computer Architecture, Parallelism,
Scalablility, Programmability, Mgh.
• Flynn, M. J. (1995) Computer Architecture, Pipelined & Parallel Processor
Design - Narosa.
• Hayes, J. P. (1998) Computer Architecture & Organisation, 3rd Edition
Mcgraw Hill.
• Carter; N. P. (2002) Schaum’s Outline Of Computer Architecture; Mc.
Graw-Hill Professional
E-references:
• http://www.lc3help.com/
• http://www.scribd.com/
Unit 11 Vector Architecture and MIMD Architecture
Structure:
11.1 Introduction
Objectives
11.2 Vectorisation
11.3 Pipelining
11.4 MIMD Architectural Concepts
Multiprocessor
Multi-computer
11.5 Problems of Scalable Computers
11.6 Main Design Issues of Scalable MIMD Architecture
11.7 Summary
11.8 Glossary
11.10 Answers
11.11 Introduction
In the previous unit, you were introduced to data parallel architecture in which
you studied the SIMD part. You learned about SIMD architecture and its
various aspects like SIMD design space, fine-grained SMID architecture and
coarse gained SIMD architecture. In this unit we will progress a step further to
explain the MIMD architecture. Although we have covered vector architecture
in prior unit, we will throw some light on it as well so that the concept of MIMD
can be understood in a better way.
According to famous computer architect Jim Smith, the most efficient way to
execute a vectorisable application is a vector. Vector architectures are
responsible for collecting the group of data elements distributed in memory
and after that placing them in linear sequential register files. After placing,
operation starts on that data present in register files and the result is dispersed
again to the memory. On the other hand, MIMD architectures are of great
importance and may be used in numerous application areas such as CAD
(Computer Aided Design), CAM (Computer Aided Manufacturing), modelling,
simulation etc
In this unit, we are going to study different features of Vector architecture and
MIMD architecture such as pipelining, MIMD architectural concepts, problems
of scalable computers, Main design issues of scalable MIMD architecture.
Objectives:
• recall the concept of vector architecture
• discuss the concept of pipelining
• describe MIMD Architectural concepts
• differentiate between multiprocessor and multicomputer
• interpret the problems of scalable computers
• solve the problems of scalable computers
• recognise the main design issues of scalable MIMD architecture
11.2 Vectorisation
Vector machines are planned & designed to operate at the level of vectors.
Now, you will study the operation of vectors. Suppose there are 2 vectors, A
and B, both having 64 components. The components present in a vector are
the vector size. So our vector size is 64. Vector A and B are shown below:
JJ = />0, ................. j .
Now we want to add these 2 vectors and keep the result in another vector C.
It is shown in the equation below. The rule for adding the vector is to add the
corresponding components.
C — A I 13 — do I bo, di I bj .................................................... dn — 1 I bn— . ■

We can also perform the addition by utilizing loops as in high level languages.
This loop iterates n times; n is the size of vector. Now you will notice how this
code can be written in C-language:
For(i=0; i<n; i++),
C[i] = A[i] + B[i];
The above loop iterates n number of time and is known as for loop. This adding
of vectors is referred as scalar operation. Vector operations are defined by
vector processor instruction. Such as, only 1 vector instruction is required to
add A and B vectors. Basically, a vector instruction defines 4 fields, 3 are
registers and 1 is operation.
VOP Vd Vsl Vs2
which performs
Vd = Vs1 VOP Vs2
Here, VOP is the vector operation which is performed on registers Vs1 and
Vs2 and result is stored at Vd.
Architecture
As discussed in unit 9, both the scalar and vector unit is present in a vector
machine. The scalar unit has the same structural design as the conventional
processor and it works on scalars. Similarly vectors works on vector
operations. Advancements like moving from CISC to RISC designs, and
moving from the memory-memory architecture to the vector-register
architecture has been seen in vector architecture.
In the beginning, vector machines employed the memory-memory

architecture. In memory- memory architecture, input operands are received by
all vector operations from memory and the result is stored in memory. The 1 st
vector machine CDC Star 100 utilized this architecture.
The vector-register architecture is like RISC architecture and is utilized in
numerous vector machines such as Hitachi, NEC etc.
The vector register contains vectors on which vector operations are performed.
After the operations are performed then the result also gets stored in vector
register. Through RISC processor, particular load and store instruction moves
the vector among memory and register of vector.
Figure 11.1 shows the vector processor architecture which is grounded on
Cray 1 system. There are 5 major components in this architecture:
• vector register
• vector load/store unit
• vector functional units
• scalar unit
• main memory,

Figure 11.1: A Typical Vector Processor Architecture (Based on Cray 1)
Vector Registers: Vector registers carries the input and result vectors. 8
vector registers are present in Cray 1 and many other vector processors. Every
vector register contains 64 elements of 64 bits each. For example Fijitsu VP
200 processor permits the space of 8k elements present in vector register’s
programmable set whose range is 8 to 256. As 8 vector register carries 64
elements of 64 bits, but 256 register carry 32 elements.
Figure 11.1 contains 1 write port and 2 read port so that vector operations can
overlap on various vector registers. Scalar Registers: Vector operations get
the scalar inputs present in scalar registers. Such as a scalar register results
constant when elements are multiplied to matrix.
B=5*X+Y
In the above equation, 5 is a constant stored in scalar register and X and
vectors in 2 different vector register. Address calculation of vector load/store
unit is also done in this register. Vector Load/Store Unit: Data moves
Manipal University of Jaipur B1648 Page No.

234
among memory and vector registers in vector load/store unit. It is responsible

for overlapping read and write operation from memory and also mark the high
latency linked with main memory access.
1. Load Vector Operation: In this operation a vector moves from memory to
vector register
2. Store Vector Operation: This operation moves a vector to memory
Vector Functional Units: This unit have some vector functional units:
• integer operations
• floating point
• logical operations
As shown in figure 11.1, Cray 1 has six functional units The NEC SX/2 has
sixteen functional units: four shift units, four integers add/logical, four FP add
and four FP multiply/divide.
Memory: In this processor the memory unit is different from the memory unit
we use in normal processors. This unit permits the pipelined data transfer to
and from memory. Interleaved memory is utilized to support pipelined data
transfer from memory.
1. The first vector machine was ____________ .
2. _______ operations get the scalar inputs present in scalar registers.
11.3 Pipelining
We have discussed this concept in Unit 4 and 5, but we need to recap it in
order to get a better idea of the next sections.
What is Pipelining?
An implementation technique by which the execution of multiple instructions
can be overlapped is called pipelining. In other words, it is a method which
breaks down the sequential process into numerous sub-operations. Then
every sub-operation is concurrently executed in dedicated segments
separately. The main advantage of pipelining is that it increases the instruction
throughput, which is specified the count of instructions completed per unit
time. Thus, a program runs faster. In pipelining, several computations can run
in distinct segments simultaneously.
A register is connected with every segment in the pipeline to provide isolation
between each segment. Thus, each segment can operate on distinct data
simultaneously. Pipelining is also called virtual parallelism as it provides an

essence of parallelism only at the instruction level.
In pipelining, the CPU executes each instruction in a series of following small
common steps:
1. Instruction fetching
2. Instruction decoding
3. Operand address calculation and loading
4. Instruction execution
5. Storing the result of the execution
6. Write back
The CPU while executing a sequence of instructions can pipeline these
common steps. However, in a non-pipelined CPU, instructions are executed
in strict sequence following the steps mentioned above.
To understand pipelining, let us discuss how an instruction flows through the
data path in a five-segment pipeline. Consider a pipeline with five processing
units, where each unit is assumed to take 1 cycle to finish its execution as
described in the following steps:
a) Instruction fetch cycle: In the first step, the address of the instruction to
be fetched from memory into Instruction Register (IR) is stored in PC
register.
b) Instruction decode fetch cycle: The instruction thus fetched is decoded
and register is read into two temporary registers. Decoding and reading
of registers is done in parallel.
c) Effective address calculation cycle: In this cycle, the addresses of the
operands are being calculated and the effective addresses thus
calculated are placed into ALU output register.
d) Memory access completion cycle: In this cycle, the address of the
operand calculated during the prior cycle is used to access memory. In
case of load and store instructions, either data returns from memory and
is placed in the Load Memory Data (LMD) register or is written into
memory. In case of branch instruction, the PC is replaced with the branch
destination address in the ALU output register.
e) Instruction execution cycle: In the last cycle, the result is written into
the register file.
Pipelines are of two types - Linear and Non-linear. Linear pipelines perform
only one pre-defined fixed functions at specific times in a forward direction

from one stage to next stage. On the other hand, a dynamic pipeline which
allows feed forward and feedback connections in addition to the streamline
connections is called a non-linear pipeline.
An Instruction pipeline operates on a stream of instructions by overlapping and
decomposing the three phases of the instruction cycle. Super pipeline design
is an approach that makes use of more and more fine-grained pipeline stages
in order to have more instructions in the pipeline. As RISC instructions are
simpler than those used in CISC processors, they are more conducive to
pipelining.
3. ___________ specifies the count of instructions completed per unit
time.
4. Pipelining is also called _______________ as it provides an essence
of parallelism only at the instruction level.
5. Linear pipelines perform only one pre-defined fixed functions at specific
times in a forward direction. (True/False)
11.4 MIMD Architectural Concepts

Computers with multiple processors that are capable of executing vector
arithmetic operations using multiple instruction streams and multiple data
streams are called Multiple Instruction streams Multiple Data stream (MIMD)
computers. All multiprocessing computers are MIMD computers. The
framework of an MIMD computer is shown in figure 11.2.

Figure 11.2: The Framework of an MIMD Computer
MIMD is also known as multiple independent processors which operate as a

component of huge systems. For example parallel processors, multi-
processors and multi-computers. There are two forms of MIMD machines: •
multiprocessors (shared-memory machines)
• multi-computers (message-passing machines)
11.4.1 Multiprocessor
Multiprocessor are systems with multiple CPUs, which are capable of
independently executing different tasks in parallel. They have the following
main features:
• They have either shared common memory or unshared distributed
memories.
• They also share resources for example I/O devices, system utilities,
program libraries, and databases.
• They are operated on integrated operating system that gives interaction
among processors and their programs at the task, files, job levels and also
in data element level.
Types of multiprocessors
There are 3 types of multi-processors they are distributed in the way in which
shared memory is implemented. (See figure 11.3). They are:
• UMA (Uniform Memory Access),

• NUMA (Non Uniform Memory Access)

• COMA (Cache Only Memory Access)
Figure 11.3: Shared Memory Multiprocessors
Basically the memory is divided into several modules that is why large
multiprocessors into different categories. Let’s discuss them in detail.
UMA (Uniform Memory Access): In this category every processor and
memory module has similar access time. Hence each memory word can be
read as quickly as other memory word. If not then quick references are
slowed down to match the slow ones, so that programmers cannot find the
difference this is called uniformity here. Uniformity predicts the performance
which is a significant aspect for code writing. Figure 11.4 shows uniform
memory access from the CPU on the left.
Figure 11.4: Uniform and Non-Uniform Memory Access

Modern UMA machines are of small size and with single bus multiprocessors.
In the early design of scalable shared memory systems, large UMA machines
with a switching network and hundreds of processors were common.
Well-known examples of those multiprocessors are the NYU Ultra computer
as well as the Denelcor HEP. In their designs numerous features had been
introduced which act as an important achievement in today’s parallel
computers architecture. Nevertheless, early systems do not have local main
memory or cache memory, which has showed its importance for attaining high
performance in scalable shared memory systems. It is not appropriate for
building scalable parallel computers but it is very good for constructing small-
sized single bus multi-processor. Such as Encore Multimax of Encore
Computer Corporation introduced in late 80s and Silicon Graphics Computing
Systems introduced in late 90s.
NUMA (Non Uniform Memory Access): They are intended for avoiding the
memory access disadvantage of Uniform Memory Access machines. The
logically shared memory is spread between all the processing nodes of NUMA
machines, giving rise to distributed shared memory architectures.
Figure 11.4 shows non uniform memory between the left and right disks.
Although these parallel computers became highly scalable, yet they are
extremely sensible for data allocation in local memories. Accessing a remote
memory segment of a node is slower as compared to accessing a local
memory segment of a node. Multi-computers having distributed memory are
similar to the architecture of these machines. Major dissimilarity depends on
the organization of address space. In multiprocessors, a global address space
that is equally visible from each processor is applied.
In other words, all the memory locations can be accessed by CPU clearly. In
local memories of multi-computers, the address space is duplicated in the
processing elements. This dissimilarity in the memory’s address space is all
well showed in software level. NUMA machines programming depends on the
global address space (shared memory) principle while distributed memory
multi-computers programming depends on the message-passing paradigm.
COMA (Cache Only Memory Access): You can say that COMA machine act
as non-uniform but differently. It also avoids the effects of static memory
allocation of NUMA and Cache Coherent Non-Uniform Memory Architecture
(CC-NUMA) machines. This is done by doing two activities;
Manipal University of Jaipur B1648 Page No.

240
• including large caches as node memories

• excluding main memory blocks from the local memory of nodes
Cache memory is present in the above architectures. Main memory does not
exist nor in the form of NUMA and CC-NUMA distributed memory neither in
the form of UMA’s central shared memory,
Similarly, the requirement of carrying addresses explicitly is removed by virtual
memory. The COMA provides the allocation of static data which is driven by
demand to local memories. When the data is required, it is always attracted
towards the cache (local) memory with respect of cache coherence scheme.
In COMA machines, same cache coherence schemes can be executed as in
other shared memory systems. The dissimilarity is that these managing the
replacement. COMA machines are scalable parallel architectures. So
schemes supporting large-scale parallel systems can be only applied. For
example cache coherence protocols like hierarchical cache coherent schemes
directory schemes. Two representative COMA architectures are: KSR1
(Kendall Square Research high performance computer) and DDM (Data
Diffusion Machine).
11.4.2 Multi-computer
A multi-computer consists of numerous von Neumann computers that are
associated with interconnection network. For accessing the local memory and
sending/receiving messages on network, every computer on the network will
executes there programs. Typically, both memory and I/O is distributed among
the processors. So, each individual processor-memory- I/O module in a multi-
computer forms a node and is essentially a separate stand-alone autonomous
computer. Multi-computer is actually a group of MIMD computers with
physically distributed memory as shown in figure 11.5.

Figure11.5: The Framework of a Multi-computer
To support a larger number of processors, the memory here is spread among

the CPUs. This yields cost-effective higher bandwidth as most of the accesses
made by each processor are to its local memory.
Because remote memory which is also known as NORMA (No Remote
Memory Access), it cannot be directly accessed by multi-computers. There
are 2 types of multi- computers
• MPPs (Massively Parallel Processors), contains several processors that
are connected with high speed interconnection network. MPPs are very
costly super-computers. Such as Cray T3E and IBM SP/2
• Regular PCs or workstations, that are possibly rack mounted, as well as
connected by commercial off-the-shelf interconnection method.
Basically there is not much difference between the two categories but network
used in MPP is very expensive than the network used in regular PC’s or
workstation. These self assembled machines having several names for
example NOW (Network of Workstations) and COW (Cluster of Workstations).
6. All multiprocessing computers are _____________ computers.
7. In UMA machines, each memory word cannot be read as quickly as other
memory word. (True/ False).
8. NUMA stands for ____________________ .
Activity 1:
Prepare a collage depicting two columns, one for multiprocessor while the
other for multi-computer and paste diagrams, notes, pictures etc of the
various machines found under each heading.
11.5 Problems of Scalable Computers

There are two fundamental problems to be solved in any scalable computer
system:
1. Tolerate and hide latency of remote loads.
2. Tolerate and hide idling due to synchronisation among parallel processes.
Remote loads are unavoidable in scalable parallel systems which use some
form of distributed memory. Accessing a local memory usually requires only
one clock cycle while access to a remote memory cell can take two orders of
magnitude longer time.
If a processor issuing such a remote load operation should wait for the
completeness of the operation without doing any useful work, the remote load
would significantly slow down the computation. Since the rate of load
instructions is high in usual programs, the latency problem would eliminate all
the potential benefits of parallel activities. A typical case is shown if figure 11.6,
where P0 has to load two values A and B from two remote memory block M1
and Mn in order to evaluate the expression A+B.

Figure 11.6: The Remote Load Problem
The pointers to A and B are rA and rB stored in the local memory of P0. Access
of A and B are realised by the "rloadrA" and "rloadrB" instructions that should
travel through the interconnection network in order to fetch A and B.
The situation is even worse if the values of rA and rB are currently not available
in M1 and Mn. M1 and Mn will be generated by some other process which will
be executed later on. In this case where idling occurs due to synchronisation
among parallel processes, the original process on P0 should wait
unpredictable time resulting in unpredictable latency.
Solutions to the problems
In order to solve the above-mentioned problems several possible
hardware/software solutions were proposed and applied in various parallel
computers. They are as follows:
1. Application of cache memory
2. Pre-fetching

3. Introduction of threads and fast context switching mechanism among

threads
4. Using non-blocking writes
As a result of this study, various methods are being utilized to reduce or at
least hide the latency. We will now discuss these methods.
1. Application of cache memory: Data replication is the 1st latency
reduction or hiding method. If we keep several copies of data in different
locations then the access from those locations can be faster. This
replication method is also known as caching in which several copies of
data are kept nearby to location they are used and belongs.
Now one more strategy is to make peer copies of data equally not like the
asymmetric primary/secondary relationship used in caching. As several
copies of data are maintained in whatsoever form, now the major issues
are when, where and by whom data blocks are being placed. The solution
varies from active placement on demand through hardware to the
worldwide placement at load time in subsequent compiler directives.
2. Pre-fetching: It is the next method for reducing or hiding latency. In this
method the data is fetched before it is required, so that this method can
overlap the regular execution. Hence the data will be in the location, when
the item is required. Pre-fetching is under the program control or may be
automatic. Cache loads not only the word but the entire line because
following words can also be required soon.
This method can be controlled explicitly also. When compiler wants data
he can put explicit instruction for getting data. The instruction must be
written in advance so that data should be there on time. For using this
technique, compiler should have the entire knowledge of system and its
timings. He should also have control on the location where data is placed.
3. Multithreading: Multithreading is another method for reducing or hiding
latency. Multithreading is a process which is very common in modern
computers. Basically it is multi-programming in which several processes
can run at the same time. If we want the switching process to be speedy
then we have to provide each process a memory map and hardware
registers. Now if one process blocks while waiting for remote data, then
the switching can be done to the process that can be able to continue.
Basically the processor executes the 1st instruction form thread 1, 2nd
instruction from thread 2 and so on. This is the way through which
processor can be kept busy even in lengthy latencies for independent
threads. Actually for switching processes after each instruction to reduce
latencies, some systems automatically switch between the processes.
4. Non-blocking writes: Last method for reducing or hiding latency is non-
blocking wires. In this method the memory operation starts but the
program will continue executing. But normally when a STORE instruction
is carried out, at that time CPU waits till the instruction completes before
continuing.
Activity 2:
Visit a library and read books on computer architecture to find out more
ways of resolving the problems of scalable computers.

7. ________ are unavoidable in scalable parallel systems which use
some form of distributed memory.
8. Pre-fetching can never be controlled explicitly (True/ False).
11.6 Main Design Issues of Scalable MIMD Architecture

The main design issues in scalable parallel computers are as follows:
1. Processor design
2. Interconnection network design
3. Memory system design
4. I/O system design
Let’s discuss them in detail:
1. Processor design: The current generation of commodity processors
contains several built-in parallel architecture features like pipelining,
parallel instruction issue logic, etc. They also directly support the built of
small- and mid-size multiple processor systems by providing atomic
storage access, pre-fetching, cache coherency, message passing, etc.
However, they cannot tolerate remote memory load and idling due to
synchronisation which are the fundamental problems of scalable parallel
systems. To solve these problems a new approach is needed in processor
design. Multithreaded architectures offer a promising solution in the very
near future.
2. Interconnection network design: Interconnection network design was a

key problem in the data-parallel architectures since they aimed at
massively parallel systems, too.
In the current part those design issues will be reconsidered that are
relevant for the case when commodity microprocessors are to be applied
in the network. The central design issue in distributed memory multi-
computers is the selection of the interconnection network and the
hardware support of message passing through the network.
3. Memory system design: Memory design is the crucial topic in shared
memory multiprocessors. In these parallel systems the maintenance of a
logically shared memory plays a central role. Early multiprocessors
applied physically shared memory which becomes a bottleneck in scalable
parallel computers. Recent generation of multiprocessors employs a
distributed shared memory supported by distributed cache system. The
maintenance of cache coherency is a nontrivial problem which requires
careful hardware/software design.
4. I/O system design: In scalable parallel computers one of the main
problems is the handling of I/O devices in an efficient way. The problem
seems to be particularly serious when large data volumes should be
moved among I/O devices and remote processors. The main question is
how to avoid the disturbance of the work of internal computational
processors. The problem of I/O system design appears in every class of
MIMD systems.
9. _____________ was a key problem in the data-parallel architectures
since they aimed at massively parallel systems.
10. Early multiprocessors applied _____________ which becomes a
bottleneck in scalable parallel computers.
11.7 Summary
• A vector machine consists of a scalar unit and a vector unit. The scalar
unit works on scalars and has architecture similar to that in the traditional
processors. The vector unit is responsible for performing vector
operations.
• An implementation technique by which the execution of multiple
instructions can be overlapped is called pipelining. This pipeline technique
splits up the sequential process of an instruction cycle into sub-processes

that operates concurrently in separate segments.

• Computers with multiple processors that are capable of executing vector
arithmetic operations using multiple instruction streams and multiple data
streams are called Multiple Instruction streams Multiple Data stream
(MIMD) computers.
• MIMD is divided into two categories multiprocessors (shared-memory
machines) and multi-computers (message-passing machines).
• There are two fundamental problems to be solved in any scalable
computer system: 1. tolerate and hide latency of remote loads. 2. tolerate
and hide idling due to synchronisation among parallel processes.
• The main design issues in scalable parallel computers are :
1. Processor design
4. I/O system design
11.8 Glossary
• CC-NUMA machine: Cache-Coherent Non-Uniform Memory Access
machine.
• COMA machine: Cache Only Memory Access machine.
• COW: Cluster of Workstations.
• DDM: Data Diffusion Machine.
• LMD: Load Memory Data.
• MPPs: Massively Parallel Processors.
• Multi-computer: It contains numerous von Neumann computers that are
associated with interconnected network.
• Multiprocessor: Systems with multiple CPUs, which are capable of
independently executing different tasks in parallel.
• NORMA: NO Remote Memory Access.
• NOW: Network of Workstations.
• NUMA machine: Non Uniform Memory Access machine.
• Register: It is associated with each segment in the pipeline to provide
isolation between each segment.
• UMA machine: Uniform Memory Access machine.


1. Define vectorisation.
2. List the five components of vector-register machine architecture and write
a brief note on each.
3. What do you understand by pipelining? State steps in which instructions
are executed in pipelining.
4. Differentiate between multiprocessor and multi-computer.
A) UMA
B) NUMA
C) COMA
6. Describe the various problems of scalable computers. Explain the ways to
resolve them.
7. What are the main design issues of scalable MIMD architecture?
11.10 Answers
1. CDC Star 100
2. Vector
3. Instruction throughput
4. Virtual parallelism
5. True
6. MIMD
7. False
8. Non Uniform Memory Access
9. Remote loads
10. False
12. Physically shared memory
Terminal Questions
1. A vector may refer to a type of one dimensional array. Vectorisation is
collecting the group of data elements distributed in memory and after that
placing them in linear sequential register files. Refer Section 11.2.
2. The five components are vector registers, scalar registers, vector
functional units, vector load/store unit, and main memory. Refer Section
11.2.

3. An implementation technique by which the execution of multiple

instructions can be overlapped is called pipelining. In pipelining, the CPU
executes each instruction in a series of small common steps. Refer
Section 11.3.
4. Multiprocessor are systems with multiple CPUs, which are capable of
independently executing different tasks in parallel.Multi-computer contains
numerous von Neumann computers that are associated with
interconnected network. Refer Section 11.4for more details.
5. In this category every processor and memory module has similar access
time. Refer Section 11.4.1.
NUMA machines are intended for avoiding the memory access
disadvantage of Uniform Memory Access machines. Refer Section 11.4.1.
You can say that COMA machine act as non-uniform but differently. . It
also avoids the effects of static memory allocation of NUMA and Cache
Coherent Non-Uniform Memory Architecture (CC-NUMA) machines. Refer
Section 11.4.1.
6. There are two fundamental problems to be solved in any scalable computer
system:
1. Tolerate and hide latency of remote loads.
2. Tolerate and hide idling due to synchronisation among parallel
processes.
In order to solve the above-mentioned problems several possible
hardware/software solutions were proposed and applied in various parallel
computers. Refer Section 11.5.
7. The main design issues in scalable parallel computers are:
1. Processor design
4. I/O system design. Refer Section 11.6
References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill, 1993.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David, Computer
Architecture: A Quantitative Approach, Morgan Kaufmann; 5th edition,

2011.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter, Advanced computer
architectures - a design space approach. Addison-Wesley-Longman: I-
XXIII, 1-766.
E-references:
• http://www.cs.umd.edu/class/fall2001/cmsc411/projects/MIMD/mimd.
html.
• http://www.docstoc.com/docs/2685241/Computer-Architecture-
Introduction-to-MIMD-architectures.

Unit 12 Storage Systems

Structure:
12.1 Introduction
Objectives
12.2 Storage System
12.3 Types of Storage Devices
Magnetic storage
Optical storage
12.4 Connecting I/O devices to CPU/Memory
Input-output vs. memory bus
Isolated versus memory-mapped I/O
Example of I/O interface
12.5 Reliability, Availability and Dependability of Storage System
12.6 RAID
Mirroring (RAID 1)
Bit-Interleaved parity (RAID 3)
Block-Interleaved distributed parity (RAID 5)
12.7 I/O Performance Measures
12.8 Summary
12.9 Glossary
12.11 Answers
12.1 Introduction
In the previous unit, you studied the concept of vectorisation and pipelining.
Also, you studied the MIMD architectural concepts, problems of scalable
computers. We also learnt the main design issues of scalable MIMD
architecture.
A computer must have a system to get information from the outside world and
must be able to communicate results to the external world. It is required to
enter programs as well as data in computer memory in order to process them.
Also, it is required to record or display the results (for the user) received from
calculations. To use computer in an efficient manner it is necessary to prepare
numerous programs as well as data beforehand. Then these programs and
data are broadcasted into storage medium. Then, the information available in
disk is transmitted into the memory of a computer in a rapid manner.

Outcomes provided by programs are transmitted into a storage having high

speed. For example, they can be transmitted into disks. Later, they can be
transmitted into an output device in order to give output of outcomes.
In this unit, you will study storage systems and various topics such as different
types of storage systems, connecting I/O devices to CPU/Memory, availability,
dependability and reliability of the storage system. We will also explain the
concept of RAID and the I/O measures.
Objectives:
• define storage system
• describe various types of storage devices
• describe the process of connecting I/O device to CPU/memory
• discuss the reliability, availability and dependability of storage system
• explain the concept of RAID
• discuss on I/O performance measure.
12.2 Storage System

To illustrate computer system, a lot of distinction is made frequently among
computer organisation and computer architecture. By Computer architecture,
we mean those system attributes which are visible to a developer. By
Computer organisation, we mean operational units in addition to the
connection between them that recognise the specification of architecture.
Each time PC is shut down, the contents of the PC’s randomaccess memory
(RAM) are lost. That is because RAM is electronic and requires a constant
source of power to retain its contents. Likewise, whenever the program ends,
operating system discards the information which includes that program had
placed in RAM in order to make room for other programs. To keep information
from one computer session to another, one must store the information within
a file that ultimately stores on disk.
Nowadays, it is necessary to have storage systems for computing. Every
recognized platform of computing, which ranges from handy devices to huge
super computers, make use of storage systems. This is done to store data for
the short term or everlastingly.
Earlier punch card was used which stores a small amount of bytes of data.
Now storage systems comprise the capacity to store a large amount of bytes

in relatively a lesser amount of space and power utilisation.

We have given below some definitions of storage in reference to computers.
• It is defined as a device which has the capability to store data. Disk and
tape drives are considered as storage devices.
• Storage is considered as the place in computer where the data is placed
in a visual form in order to get accessed by a processor.
• Storage, which can also be called as computer data storage or memory,
refer to constituents of computer, recording media, and devices that
preserve digital data utilised for calculating for a period of time.
1. __________ signifies those systems attributes which are visible to a
developer.
2. ______________ signifies operational units in addition to the
connection between them that recognise the specification of
architecture.
3. RAM stands for _______________ .
12.3 Types of Storage Devices

Data is stored on devices or physical components. These devices or physical
components are known as storage media. Now let us define storage devices.
Storage devices are the components of hardware which are used to either
read or write to the storage media. The different types of storage devices
which are utilised to store data as well as information are:
• Magnetic Storage
• Optical Storage
Let’s discuss them in detail as below:
12.3.1 Magnetic storage
The most general and stable form of removable-storage technology is
magnetic storage. The magnetic storage device has a layer of coating of some
magnetic substance on a rigid or flexible surface. The drive is equipped with
a read-write head assembly that can convert the data and instructions
represented in the form of 0 and 1 into some form of magnetic signal. These
magnetic signals can then be stored on the medium. Storage devices such as
hard drives, diskette drives, and tape drive make use of the same kind of
medium. They make use of similar methods for either reading or writing data.

The exterior side of diskettes as well as magnetic tapes is layered with a

magnet based material like iron oxide. Here, polarization is used to store the
data. That is, every particle in magnetic material line up itself in a particular
direction. A magnet has one important advantage - Its state can be maintained
without providing electricity constantly.
The exterior side of disks are layered with numerous small particles of irons.
Thus you can store data on these types of disks. All particles work as a
magnet. Electromagnets are contained in the read/write heads of a disk drive.
When the head is passed over the disk, electromagnets produce magnetic
fields in iron. A chain of 1s and 0s is stored in read/write heads. This is done
by interchanging the path of current in the electromagnet.
Let’s discuss three types of magnetic storage viz., disks, hard drives and tape
drives as below:
Disks: Disks let the user store information from one computer session to the
next. Floppy disk was introduced by IBM. The first floppy disks used to be 8-
inch in diameter. As it got smaller and smaller gradually it started being called
diskette. Next smaller diskette was 5.25-inch in diameter.
Earlier, 3.5-inch diameter diskettes having 1.44 MB storage space were most
popular on microcomputers for storing data and programs. You can easily
calculate that as many as 400 pages of printed book can be stored on a single
floppy disk. Zip disks are similar in looks to floppy disks. They are slightly
bigger and thicker than floppy disks.
Hard drive: PC’s hard drive is a fast drive that is normally capable of storing
several hundred megabytes of data. To reduce chances of a disk-head crash
or disk damaging, never move PC while it is on. Each disk drive within PC has
a unique one-letter name.
The A: drive normally corresponds to floppy drive. Drive B: stands for second
floppy drive if there is any. Likewise, the C: drive is normally for hard disk. If
CD-ROM drive exists, the drive may be the D: OR E: drive, depending on the
system's configuration. When storing a file on disk, use the drive name to
select the drive onto which the user wants to record the file’s contents. Unlike
PC’s RAM that stores information electrically, disks record information
magnetically, much like recording a television program on a VHS tape or a
song on a cassette tape. Within a disk drive, there is a small device called a
read/write head that can record (magnetise) or read information on the disk’s
surface. Within the drive, a disk spins rapidly past the read/write head each as
shown in figure 12.1. A floppy disk, for example, may spin past the read/write
head at 300 revolutions per minute (RPMs), whereas a hard drive may spin at
3,600 to 10,000 RPMs.
Figure 12.1: Disks Head and Magnetic Surface
To better understand how the drive records information on disk, examine a

floppy disk. In the middle of the back of the floppy disk, a small metal spindle
opening can be seen. When someone inserts a floppy disk into a drive, the
drive uses this spindle opening to spin the disk.
The drive then opens the disk’s metal shutter, as shown in figure 12.2, to
access the disk’s surface. By gently sliding the shutter one can open to see
the disk media. Do not, however, touch the surface; doing so may damage the
disk and the information it contains.
Figure 12.2: Cross Section of Floppy Disk

Because disks magnetise information to their surface, the information does
not require constant electricity, as does RAM. However, keep the disks away
from devices such as your phone, television, or even static electricity that may
result in a magnetic flux that changes the information recorded in the disk.
Within PC, one will normally have at least two disk drives: a high- capacity fast

hard drive and a floppy-disk drive that lets insert and remove disks. Normally,
PC’s hard drive resides within PC’s system unit.
Tape drive: The function of tape drive is to read as well as write the data to
tape surface. An audio cassette also functions in the same way. The only
dissimilarity is that a tape drive burns digital data. Tape storage generally
stores data which is not needed frequently. The example of such data is
backup copies related to the hard disk. Tape drive is required to write data in
a serial manner. This is due to the reason that a tape appears as a long strip
which is made up of magnetic material. Direct access offered by media like
disks appears to be faster as compared to the process of tape drive which
writes data serially.
When it is required to access the particular data on a tape, then the drive starts
scanning through the entire data. That is, the data which is not required is also
going through scanning. Thus, this has an effect in slow access time. Access
time differs according to the speed with which the drive is accessing, position
available on the tape in addition to the length of tape.
12.3.2 Optical storage
A kind of optical storage which is most extensively used is called as the CD
(compact disk). Compact disk is utilised in CDR, CDRW, CD-ROM, DVD-
ROM, in addition to Photo CD systems. Nowadays, systems with DVD-ROM
drives are preferred, rather than standard CD-ROM units. The devices
included in optical storage are used to store the data over reflective surface.
Additionally, a ray of laser light is used to read the data. A thin ray of light is
directed and concentrated by means of lenses, mirrors, and prisms. All the
light having same wavelength helps in creating laser beam focus.
CD-ROM: It symbolises compact disk read only memory. To read data from
CD-ROM, a laser beam is directed on the surface of a spinning disk. The areas
that reflect back the light are read as 1s, and the ones that scatter the light
and do not reflect back are read as 0s. This is shown in figure 12.3.

Data on this device is stored in a long spiral starting from the disk edge. Also
it’s ending take place at the centre.
Figure 12.3: Working of Optical Storage
650 MB of data can be stored in a standard CD. Also, audio of almost 70

minutes can be stored in it.
DVD-ROM: It symbolises digital video (or versatile) disk read only memory. It
is defined as a medium having high-density which can store a complete movie
on a disk. High-storage capacity is achieved by storing the data on both sides
of the disk. The latest versions of the DVDs comprise of data tracks layers.
Firstly, a laser beam reads from 1st layer. After that, it moves to the 2nd layer
to read, and so on.
Photo CD, CD-R, CD-RW: Through CD-Recordable (CD-R), your individual
CD-ROM disks can be created. Any CDROM drive can read your CD-ROM
disks. You cannot change the information after it is written to CD.
By means of CD-RW (CD-Rewritable) drives, the data can be written as well
as overwritten to CDs by the user. Similar to a floppy disk, you can revise the
data by using a CD-RW. Photo CD is a well-liked form of CD-R. It is considered

as a standard which is formulated by Kodak and is used to store digitised

photographic images on CD.
4. Physical components on which data is stored are called ________ .
5. RPM is the acronym for _____________ .
6. To read data from _____________ , a laser beam is directed on the
surface of a spinning disk.
12.4 Connection of I/O Devices to CPU/Memory

It is essential for computer to have a system to get information from the outside
world and must be able to the communicate results to the external world. As
we know, it is required to enter programs as well as data into the memory of a
computer for processing. Also, it is required to record or display the results (for
the user) received from calculations. I/O interface offers a technique which is
used to transfer information among input-output devices and internal memory.
To interface computer peripherals with CPU, special communication links are
required. . You can consider peripheral as an external device which offers
input as well as output for computer. For instance, mouse, keyboard, printer,
etc. This is done through I/O bus which connects the peripheral devices to the
CPU.
In figure 12.4, we have shown the communication link among various
peripherals and processor.
Figure 12.4: Connection of I/O Bus to Input-Output Devices

The Input-Output bus consists of the following: • data lines • address lines •

control lines
A general-purpose computer makes use of printer, magnetic disk. In
computers, magnetic tape is utilised for backup storage. All peripheral devices
are connected with it by means of interface unit.
All interfaces decode the address as well as control obtained from I/O bus.
Every interface decodes them for peripheral. Also it offers signals for
peripheral controller. Data flow is synchronised and the transfer among
processor and peripheral is administered. Every peripheral comprises its
individual controller. A specific electromechanical device is operated by this
controller. For instance, paper movement, print timing in addition to the
printing characters selection are controlled by means of printer controller.
Perhaps, a controller is stored individually or is physically incorporated with
peripheral. Input-Output bus from processor is connected every peripheral
interface.
It is required for the processor to place the address of a device on address
lines in order to converse with a device. Every interface which is connected to
I/O bus includes address decoder. The function of address decoder is to
monitor address lines. The path among bus lines and device that are
controlled by interface gets activated, when the interface identifies its own
address. The interface disables those peripherals whose address is not
matching with the address in bus. Address lines contain the address. Also, a
function code is provided by processor in control lines.
An interface chosen replies to the function code. Then and continues to
implement it. You can consider function code as an Input-Output command.
Basically, the instruction which is carried out in interface and it is connected in
peripheral unit is known as a function code.
Interface may obtain different kinds of commands. The different kinds of
commands are:
• Control: We give a control command to activate the peripheral.
Particularly, a control command given relies on peripheral. Every
peripheral obtains its own differentiated series of commands, according to
its operation mode.
• Status: This command is used for testing different conditions of status
in peripheral as well as interface. For instance, before initiating a transfer,
computer may want to verify the peripheral’s status. When the transfer is

going on, some errors may take place. These errors are observed by
interface.
• Data output: In this command, the interface responds by transmitting
data. Data is transmitted from bus into any of its registers. As an example,
consider a tape unit. By means of a control command, the computer
begins the tape moving. Then the status of tape is monitored by processor.
This is done by using status command.
• Data input: By giving this command, interface obtains a data item from
peripheral. This data item is placed in buffer register of interface. The
availability of data is checked by the processor. This is done by using
status command. Then, data input command is issued. Here, the interface
puts the data on data lines. Also the data gets accepted by the processor.
12.4.1 Input-Output vs. memory bus
It is required for processor to converse with memory unit in order to converse
with I/O. Memory bus consists of the following:
• data
• address
• read/write control lines
Computer buses can communicate with I/O and memory by using the following
techniques:
• Make use of two different buses, the first bus for memory and the
second bus for I/O.
• Make use of a common bus for I/O as well as memory. However
different control lines should be there for each.
• Make use of a common bus for I/O as well as memory having common
control lines.
In case of first technique, the computer comprises the following:
• data
• address
• control buses, one bus for I/O and other for accessing memory
This is performed in computers having an individual IOP (input-output
Processor) (IOP) and CPU (Central Processing Unit). By means of a memory
bus, the memory converses with central processing unit as well as input-
output processing. IOP also converses with input as well as output devices.
This is done through an individual I/O bus having its individual data, address

in addition to control lines. IOP provides independent path for transferring

information among internal memory and external devices.
12.4.2 Isolated versus memory-mapped I/O
Information transfer between I/O or memory and CPU can be carried over one
common bus. Memory transfer and I/O transfer differs in that they use
separate lines for read and write operations. CPU task is to distinguish that r
the address on the address lines is for an interface register or for memory
word. It is done by enabling either the read lines or the write lines.
During the I/O transfer, control lines are enabled for I/O read and I/O writes
operations. During a memory transfer, the lines are enabled for the memory
read and memory write operations. By this configuration, I/O interface
addresses are isolated from the addresses assigned to memory. This
arrangement is known as isolated I/O method. It is used in the common bus
to assign addresses. In memory mapped I/O, all peripherals devices are
treated as memory locations.
12.4.3 Example of I/O interface
Figure 12.5 shows an example of an I/O interface. It has two data registers
called ports, a control register, a status register, bus buffers, and timing and
control circuits. The interface communicates with the CPU through the data
bus.

CS RSI RSO Register selected
0XX None: data bus in high-tmpedance
1 0 0 Port A register
1 0 1 Port ft register
1 1 0 Control register
1 1 1 Status register
Figure 12.5: Example of I/O Interface Unit
The chip select and register select inputs determine the address assigned to
the interface. The I/O read and writes are two control lines that specify an input
or output, respectively. The four registers: Port A Register, Port B register,
Control Register and Status register communicate directly with the I/O device
attached to the interface. The input-output data to and from the device can be
transferred into either port A or port B.
The interface may operate with an output device or with an input device, or
with a device that requires both input and output. If the interface is connected

to a printer, it will only output data, and if it services a character reader, it will
only input data. A magnetic disk unit is used to transfer data in both directions
but not at the same time, so the interface can use bidirectional lines. A
command is passed to the I/O device by sending a word to the appropriate
interface register.
In a system like this, the function code in the I/O bus is not needed because
control is sent to the control register, status information is received from the
status register, and data are transferred to and from ports A and B registers.
Thus the transfer of data, control, and status information is always via the
common data bus.
The distinction between data, control, or status information is determined from
the particular interface register with which the CPU communicates. The control
register gets control information from the CPU. By loading appropriate bits into
the control register, the interface and the I/O device attached to it can be
placed in a variety of operating modes. For example, port A may be defined
as an input port and port B as an output port, A magnetic tape unit may be
instructed to rewind the tape or to start the tape moving in the forward
direction. The bits in the status register are used for status conditions and for
recording errors that may occur during the data transfer. For example, a status
bit may indicate that port-A has received a new data item from the I/O device.
The interface registers uses bi-directional data bus to communicate with the
CPU. The address bus selects the interface unit through the chip select and
the two register select inputs. A circuit must be provided externally (usually, a
decoder) to detect the address assigned to the interface registers. This circuit
enables the chip select (CS) input to select the address bus. The two register
select-inputs RSl and RSO are usually connected to the two least significant
lines of the address bus. Out of those two inputs, select one of the four
registers in the interface as specified in the table accompanying the diagram.
The content of the selected register is transferred into the CPU via the data
bus when the I/O read signal is ended. The CPU transfers binary information
into the selected register via the data bus when the I/O write input is enabled.
7. ________ is used in computers for backup storage.
8. ________ from the processor is attached to all peripheral interfaces.
9. A ____________ is issued to test various status conditions in the

interface and the peripheral.
Activity 1:
Visit an IT organisation and observe the functioning of the I/O interface and
the data lines, control lines, and I/O bus architecture. Also, check whether
the I/O system used is isolated or memory-mapped.
12.5 Reliability, Availability and Dependability of Storage System

Response time and throughput are given considerable attention while
processor designing, although reliability is given more attention in storage
than processors. The terms reliability, availability and dependability are often
confused with each other. Here is a clearer distinction:
Reliability - Is anything broken?
Availability - Is the system still available to the users?
Dependability - Is it worth to trust the system?
Adding hardware can therefore improve availability (for example, Error
Correcting Code (ECC) on memory), but it cannot improve reliability (the
DRAM is still broken). Reliability can only be improved by bettering
environmental conditions, by building from more reliable components, or by
building with fewer components. Another term, data integrity, refers to
consistent reporting when information is lost because of failure; this is very
important in some applications.
Disk array is one innovation that improves both availability and performance
of storage systems. Since price per megabyte is independent of disk size,
potential throughput can be increased by having many disk drives and, hence,
many disk arms.
Simply spreading data over multiple disks, called striping, automatically forces
accesses to several disks. (Although arrays improve throughput, latency is not
necessarily improved.) The drawback to arrays is that with more devices,
reliability drops: N devices generally have 1/N the reliability of a single device.
So, if a single disk fails, the lost information can be reconstructed from
redundant information. The only danger is in having another disk failure
between the time a disk fails and the time it is replaced (termed mean time to
repair, or MTTR). Since the mean time to failure (MTTF) of disks is five or
more years, and the MTTR is measured in hours, redundancy can make the

availability of 100 disks much higher than that of a single disk. These systems
have become known by the acronym RAID, which stands for redundant array
of inexpensive disks. We will study this topic in the next section.
10. ____________ refers to consistent reporting when information is lost
because of failure.
11. ____________ is an innovation that improves both availability and
performance of storage systems
12.6 RAID
RAID is the acronym for ‘redundant array of inexpensive disks’. There are
several approaches to redundancy that have different overhead and
performance. The Patterson, Gibson, and Katz 1987 paper introduced the
term RAID. It used a numerical classification for these schemes that has
become popular; in fact, the non-redundant disk array is sometimes called
RAID 0.One disadvantage is discovering when the disk fails. Magnetic disks
help provide information about their correct operation. There is information
recorded in each sector which helps detect the errors in that sector.
Transferring of sectors will help the electronics attached to discover the failure
of disks or loss of information.
The levels of RAID are as follows:
12.6.1 Mirroring (RAID 1)
Mirroring or shadowing is the traditional solution to disk failure. It uses twice
as many disks. Data is simultaneously written on two disks, one non-
redundant and one redundant disk so that there are two copies of the data.
The system goes to the mirror disk in case one disk fails to get the required
information. This technique is the most expensive solution.
12.6.2 Bit-Interleaved parity (RAID 3)
Bit-Interleaved parity is an error detection technique where character bit
patterns are forced into parity so the total number of one (1) bit is always odd
or even. This is done by adding a “1” or “0” bit to each byte as the
character/byte is transmitted. At the other end of the transmission the parity is
checked for accuracy. BIP is also a method used at the physical layer (high
speed transmission of binary data) level to monitor errors.
The cost of higher availability can be reduced to 1/N, where N is the number
of disks in a group. In this case, we need only enough redundant information

required to restore the lost information, instead of having the complete original
copy. Reads or writes go to all disks in the group, with one extra disk to hold
the check information in case there is a failure.
RAID 3 is popular in applications with large data sets, for example multimedia
and several scientific codes. Parity is one such scheme. Parity is the example
of the redundant disk which is having the sum of all the data in the other disks.
When a disk fails, the data of the all the good disks is subtracted from the
parity disk. The remaining information is the missing information. Here, it is
assumed that failures are too rare that taking longer to recover from failure but
reducing redundant storage is a good trade-off. Mirroring effect can be
considered the special case of one data disk and one parity disk (N=1). Only
duplicating the data can accomplish parity, thus, mirrored disks have the
advantage of simplifying the calculations included in parity. The redundancy
of N = 1 has the highest overhead for increasing disk availability.
12.6.3 Block-interleaved distributed parity (RAID 5)
This level uses the same ratio of disks (data disks and check disks) as RAID
3, but data is accessed differently. In the prior organisation every access went
to all disks. Some applications would prefer to do smaller accesses, allowing
independent accesses to occur in parallel. That is the purpose of this next
RAID level. Since error-detection information in each sector is checked on
reads to see if data is correct, such “small reads” to each disk can occur
independently as long as the minimum access is one sector. Writes are
another matter. It would seem that each small write would demand that all
other disks be accessed to read the rest of the information needed to
recalculate the new parity. In our example, a “small write” would require
reading the other three data disks, adding the new information, and then
writing the new parity to the parity disk and the new data to the data disk.
The main thing to remember to reduce this overhead is that parity is simply a
sum of information. By watching which bits change when we write the new
information, we need only to change the corresponding bits on the parity disk.
We must read the old data, compare old data to the new data to see which
bits change, read the old parity, change the corresponding bits, and then write
the new data and new parity. Thus, the small write involves four disk accesses
for two disks instead of accessing all disks. This organisation is RAID 4. RAID

4 supports mixtures of large reads, large writes, small reads and small writes.
One drawback to the system is that the parity disk must be updated on every
write, so it is the bottleneck for sequential writes.
To fix the parity-write bottleneck, the parity information is spread throughout
all the disks so that there is no single bottleneck for writes. This distributed
parity organisation is RAID 5. Figure 12.6 shows how data is distributed in
RAID 4 and RAID 5.
Figure 12.6: Block-interleaved Parity (RAID 4) versus Distributed Block-

interleaved Parity (RAID 5)
As the organisation on the right shows, in RAID 5 the parity associated with
each row of data blocks is no longer restricted to a single disk. This
organisation allows for multiple writes to occur simultaneously as long as the
parity blocks are not located in the same disks. For example, a write to block
8 on the right must also access its parity block P2, thereby occupying the first
and third disks. A second write to block 5 on the right, implying an update to
its parity block P1, accesses the second and fourth disks and thus could occur
at the same time as the prior write to block 8. Thus, RAIDs are playing an
increasing role in storage systems.
12. RAID is the acronym for ____________ .
13. ____________ uses twice as many disks.
12.7 I/O Performance Measures

The two most common measures of I/O performance are diversity and
capacity.
• Diversity: Which I/O devices can connect to the computer system?
• Capacity: How many I/O devices can connect to a computer system?
Other traditional measures to performance are throughput (sometimes called
bandwidth) and response time (sometimes called latency). Figure 12.7 shows
the simple producer-server model. The producer creates tasks to be
performed and places them in a buffer; the server takes tasks from the first-
in-first-out buffer and performs them.
Figure 12.7: Producer-Server Model of Response Time and Through-put
Response time is defined as the time taken by a task since it is placed in the
buffer till it is completed by the server. Throughput, in simple words, is the
average number of tasks completed by the server over a period of time. To
reach the maximum level of throughput, the server should never be idle, and
the buffer should never be empty. Whereas, response time is the time spent
in the buffer and is minimised when the buffer is empty.
Improving performance does not always mean improvements in both
response time and throughput. Throughput is increased by adding more
servers as shown in figure 12.8, as it helps spread data across two disks
instead of one. This enables the tasks to be performed parallelly.

Unfortunately, this does not help response time, unless the workload is held
constant and the time in the buffers is reduced because of more resources.
Figure 12.8: Single-Producer Model Extended with another Server and Buffer
How does the architect balance these conflicting demands? If the computer
is interacting with human beings, figure 12.9 suggests an answer.
Workload
Conventional interactive workload
(1.0 sec. system response
time)
Conventional interactive workload

(0.3 sec. system response
time)
High-function graphics
workload (1.0 sec. system
response time)
High-function graphics
workload (0.3 sec. system
response time)
■ Entry time ■ System response time □ Think time

Figure 12.9: An Interactive Computer Divided into Entry Time, System

Response Time, and User Think Time
This figure presents the results of two studies of interactive environments: one
keyboard oriented and one graphical. An interaction, or transaction, with a
computer is divided into three parts:
1. Entry time - The time for the user to enter the command. The graphics
system in figure 12.9 took 0.25 seconds on average to enter a command
versus 4.0 seconds for the keyboard system.
2. System response time - The time between when the user enters the
command and the complete response is displayed.
3. Think time - The time from the reception of the response until the user
begins to enter the next command.
The sum of these three parts is called the transaction time. Several studies
report that user productivity is inversely proportional to transaction time;
transactions per hour are a measure of the work completed per hour by the
user.
The results in figure 12.9 show that reduction in response time actually
decreases transaction time by more than just the response time reduction:
Cutting system response time by 0.7 seconds saves 4.9 seconds (34%) from
the conventional transaction and 2.0 seconds (70%) from the graphics
transaction. This implausible result is explained by human nature: People need
less time to think when given a faster response.
14. _______ is also known as bandwidth.
15. ________________ is sometimes called latency.
Activity 2:
Visit an organisation. Find the level of reliability, availability and dependability
of the system used. Also, measure the I/O performance.
12.8 Summary
Let us recapitulate the important concepts discussed in this unit:
• A computer must have a system to get information from the outside world
and must be able to communicate results to the external world.
• Each time PC is shut down, the contents of the PC’s random-access

memory (RAM) are lost.

• There are two main categories of the storage devices: Magnetic Storage
and Optical Storage.
• Read and write data to the surface of a tape the same way as an
audiocassette - difference is that a computer tape drive writes digital data.
• The most widely used type of optical storage medium is the compact disk
(CD), which is used in CD-ROM, DVD-ROM, CDR, CDRW and Photo CD
systems.
• Response time and throughput are given considerable attention while
processor designing, although reliability is given more attention in storage
than processors.
• RAID is the acronym for redundant array of inexpensive disks. There are
performance.
• Response time is defined as the time taken by a task since it is placed in
the buffer till it is completed by the server. Throughput is the average
number of tasks completed by the server over a period of time.
12.9 Glossary
• Bus Interface: Communication link between the processor and several
peripherals.
• CD-R: Compact disk recordable
• CD-ROM: Compact disk read only memory
• CD-RW: Compact disk rewritable
• DVD-ROM: Digital video (or versatile) disk read only memory
• Input devices: Computer peripherals used to enter data into the computer.
• Input-Output Interface: This gives a method for transferring information
between internal memory and I/O devices.
• Input-Output Processor (IOP): An external processor that
communicates directly with all I/O devices and has direct memory access
capabilities.
• Output devices: Computer peripherals used do get output from the
computer.
• RAM: Random Access Memory
• RPM: Revolutions per Minute

1. What do you understand by system storage?

2. Explain briefly the various types of storage devices available.
3. Describe the communication link between the processor and several
peripherals.
4. What is the difference between isolated I/O and memory mapped I/O?
What are the advantages and disadvantages of each?
5. Give an example of I/O interface unit
6. Define raid. Also explain the levels of raid.
12.11 Answers
1. Computer architecture
2. Computer organisation
3. Random access memory
4. Storage media
5. Revolutions per minute
6. CD-ROM
7. Magnetic tape
8. I/O bus
9. Status command
10. Data integrity
11. Disk array
12. Redundant array of inexpensive disks
13. Mirroring or shadowing
14. Throughput
15. Response time
Terminal Questions
1. To keep information from one computer session to another, one must store
the information within a file that ultimately stores on disk. This is called
storage system. Refer Section 12.2.
2. There are two main categories of the storage devices: magnetic storage
and optical storage. Refer section 12.3.
3. Peripherals connected to a computer require special communication links
for interfacing them with the CPU. This is done through i/o bus which
connects the peripheral devices to the CPU. Refer section 12.4.
4. Memory transfer and I/O transfer differs in that they use separate read and

write lines. Refer section 12.4.2.

5. An example of an I/O interface has two data registers called ports, a control
register, a status register, bus buffers, and timing and control circuits.
Refer section 12.4.3.
6. Raid is the acronym for redundant array of inexpensive disks. There are
performance. Refer section 12.6.
References:
• Kai Hwang, Advanced Computer Architecture, Parallelism, Scalablility,
Programmability, Mgh.
• Micheal J. Flynm, Computer Architecture, Pipelined & Parallel Processor
Design, Narosa.
• J.P. Haycs: Computer Architecture & Organisation - Mgm
• Nicholas P. Carter, Schaum’s Outline Of Computer Architecture, Mc.
Graw-Hill Professional.
E-references:
• www.es.ele.tue.nl
• www.stanford.edu
• ece.eng.wayne.edu

Unit 13 Scalable, Multithreaded And

Data Flow Architecture
Structure:
13.1 Introduction
Objectives
13.2 Multithreading
What is a thread?
Need of multithreading
Benefits of multithreading system
13.3 Principles of Multithreading
13.4 Scalable and MultithreadedArchitecture
13.5 Computational Models
13.6 Von Neumann- based Multithreaded Architectures
Organisation and Operation of the Von Neumann architecture Key
features
13.7 Dataflow architecture
Dataflow programming
Dataflow graph
13.8 Hybrid Multithreaded Architecture
13.9 Summary
13.10 Glossary
13.12 Answers
13.1 Introduction
In the previous unit, you studied about storage systems. You covered various
aspects such as types of storage devices, connecting I/O devices to
CPU/memory, reliability, availability and dependability, RAID, I/O performance
measures. Multithreading is a type of multitasking. Prior to Win32, the only
type of multitasking that existed was the cooperative multitasking, which did
not have the concept of priorities. The multithreading system has a concept of
priorities and therefore, is also called background processing or pre-emptive
multitasking.
Dataflow architecture is in direct contrast to the traditional Von Neumann
architecture or control flow architecture. Although dataflow architecture has
not been used in any commercially successful computer hardware, it is very

relevant in many software architectures such as database engine designs and

parallel computing frameworks. A system, whose performance improves after
adding hardware, proportionally to the capacity added, is said to be a scalable
system.
In this unit, you will learn about multithreading, dataflow and scalable
architecture and their various aspects such as principles of multithreading,
scalable and multithreaded architecture, computational models, Von
Neumann - based multithreaded architectures, dataflow architecture and
Hybrid multithreaded architecture.
Objectives:
• define the term multithreading
• recognise the need and benefits of multithreading
• describe the principles of multithreading
• identify scalable and multithreaded architecture
• discuss about computational models
• describe von Neumann- based multithreaded architectures
• explain dataflow architecture
• create hybrid multithreaded architecture
13.2 Multithreading
Multithreading is the capability of a processor to do multiple things at one time.
The Windows operating system uses the API (Application Programming
Interface) calls to manipulate threads in multithreaded applications. Before
discussing the concepts of multithreading, let us first understand what a thread
is.
13.2.1 What is a thread?
Each process, which occurs when an application is run, consists of at least
one thread that contains code. All the code within a thread, when it is active,
is performed consecutively, one line after another. In a multithreading system,
many threads belonging to that particular process run concurrently. A thread
is viewed as an independent program counter within a process and the
location of the instruction that the thread is operating on is indicated by this. A
thread has the following features:
• A state of thread execution

• The saved thread context when not running.

• A stack tracing the execution path.
• Some space for local variables
Multithreading is supported by almost all modern operating systems such
Windows XP, Solaris, Linux, and OS/2 while the traditional operating systems
such as DOS and UNIX support the concept of single threading. A Java Virtual
Machine (JVM) is also an example of multithreading system.
13.2.2 Need of multithreading
Both Single and multithreaded process models as shown in figure 13.1 have
their own importance. But here we will discuss about the need of
multithreading.
Figure 13.1: Single Threaded and Multithreaded Process Models
Multithreading is needed to create an application that is able to perform more

than one task at once. For example, all GUI (Graphical User Interface)
programs can perform more than one task (such as editing a document as well
as printing it) at a time. A multithreading system can perform the following
tasks:
• Manage inputs from many windows and devices
• Distinguish tasks of varying priority.
• Allow the user interface to remain responsive all the time
• Allocate time to the background tasks

Although these tasks can be performed using more than one process, it is
generally more efficient to use a single multithreaded application because the
system can perform a context switch more quickly for threads than processes.
Moreover, all threads of a process share the same address space and
resources, such as files and pipes.
13.2.3 Benefits of multithreading system
A multithreading system provides the following benefits over a multiprocessing
system:
• Threads advance the communication between different execution traces
as the same user address space is shared.
• In an existing process, creating a new thread is much less timeconsuming
than creating a brand-new process.
• Termination of thread also takes less time.
• Also, control switching among two threads within a same process takes
less time than switching between two processes.
1. The multithreading system has a concept of priorities and therefore, is
also called __________ or __________ .
2. It takes much more time to create a new thread in an existing process
than to create a brand-new process. (True/False)
13.3 Principles of Multithreading

There are important parameters which characterise the performance of
multithreading. They include:
• The number of threads in each processor
• The number of remote reads in each thread
• Context switch mechanism
• Remote memory latency
• Remote memory servicing mechanism
• The number of instructions in athread
Let’s briefly explain the implications of these issues below.
• The number of active threads: This specifies the level of parallelism. The
level of parallelism can be categorised into computation parallelism and
communication parallelism. Computation parallelism is the 'conventional'
parallelism while communication parallelism is the means by which threads
can correspond with other threads existing in other processors.

• The numbers of remote reads in each thread verifies the number of

occurrences of thread switching and subsequently run length. There is a
thread switching, for every remote read. The number of switches is
proportional to the number of remote reads. Hence, it is desirable that the
remote reads be distributed evenly over the life of a thread. This allocation
results into thread run length. Thread run length is determined by the
number of uninterrupted instructions carried out between two consecutive
remote reads. The performance of multithreading is robustly affected by
this factor. In a small run length, it will be hard to bear the latency because
there are not adequate instructions to perform while the remote read is
excellent.
• Thread switch depicts how the control of a thread is transferred to another
thread. There are two types of context switches: explicit switching and
implicit switching. Implicit switching share registers from the multiple
threads while explicit switching does not. Implicit switching is literally
implicit that shows that there is fundamentally no evident switching from
the viewpoint of a register.
This method, thus, needs minute or no switching overhead. However, the
scheduling of registers and threads can be a difficult job. In the explicit
switching, threads do not share registers. A single thread uses all the
registers. Therefore, there is no issue as to how registers and threads are
scheduled.
• Communication latency is the aim of multithreading. The technology
used to build the machine and the interconnection network influencing its
variability. The network bandwidth is desired to be comparable with the
processor clock speed. Large disparity between the machine clock speed
and the network bandwidth can be problematic when using multithreading.
• The number of instructions in a thread is known as thread granularity.
Thread granularity can be classified into three categories: fine- grain,
medium-grain, and coarse-grain. Fine-grain threading usually is a thread
of a few to tens of instructions. It is basically for instructionlevel
multithreading. Medium-grain threading is considered as a looplevel or
function-level threading, and it consists of hundreds of instructions.
Coarse-grain threading is treated as a task-level
threading, where each thread consists of thousands of instructions.


3. The number of switches is proportional to the number of remote reads.
(True/ False)
4. typically refers to a thread of a few to tens of instructions.
Activity 1:
Visit a library and find out more details about the various models of
multithreading like Blocked model, Forking model, Process-pool model and
Multiplexing model.
13.4 Scalable and Multithreaded Architecture

Scalability is the skill to enhance the amount of processing that can be done
by adding up further resources to a system. It is different from performance as
it does not improve performance but rather sustains performance by providing
higher throughput. In other words, performance is the system response time
under a typical load whereas scalability is the ability of a system to increase
that load without degrading response time.
A computer architecture that is designed to execute more than one processor
is called a scalable architecture. Almost all business applications are scalable.
Scalability can be achieved in several ways, such as using more dominant
CPUs or adding extra CPUs.
There are two different modes of scaling: Scale up and scale out. Scaling up
is achieved by adding extra resources to a single machine to allow an
application to service more requests. The most common ways to do this are
by adding memory (RAM) or to use a faster CPU. Scaling out is achieved by
adding servers to a server group to make applications scale by scattering the
processing among multiple computers. An understanding of the bottlenecks
and the applications of each scaling method is required before a particular
method can be productively utilised.
Multithreading is the capability of processor to utilise multiple threads of
execution simultaneously in one application. In simple words, it allows a
program to do two things at once. When an application is run, each of the
processes contains at least one thread. However, many concurrent threads
may belong to one process in a multithreading system. For example, a Java
Virtual Machine (JVM) is a system of one process with multiple threads. Most

recent operating systems, such as Solaris, Linux, Windows 2000 and OS/2,
support multiple processes with multiple threads per process. However, the
traditional operating system, MS-DOS supports a single user process and a
single thread. Some traditional UNIX systems are multiprogramming systems
as they maintain multiple user processes but only one execution path is
allowed for each process.
5. Almost all business applications are _________________ .
6. When an application is run, each of the processes contains at least one
13.5 Computational Models

A mathematical model in computational science that requires widespread
computational resources to examine the performance of a complex system by
computer simulation is known as a computation model. This system is often a
compound non-linear system for which easy, instinctive logical solutions are
not readily presented. Instead of drawing out a mathematical analytical
solution to the problem, testing of the model is done by changing the factors
of the system in the computer, and examining the dissimilarities in the
conclusion of the tests conducted. These computational experiments help
derive/deduce the theories of operation of the model. Mathematical language
is used to describe a system. Not only the natural sciences and engineering
disciplines but also the social science physicists, engineers, computer
scientists, and economists use mathematical models most extensively. Thus,
the term ‘mathematical modelling’ is given to the process of developing a
mathematical model.
The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model. The concept of
computational model represents a higher level of abstraction than either the
computer architecture or the programming language. There are basically three
types of computational models as follows:
• Von Neumann model
• Dataflow model
• Hybrid multithreaded model
In the following sections we will discuss each one of these models in detail.


7. The combination of languages and the computer architecture in a
common foundation or paradigm is called ___________ .
8. Computational model uses mathematical language to describe a system
(True/ False)
13.6 Von Neumann-based Multithreaded Architectures

The foundation of computer architectures - how computers and computer
systems are structured, designed, and put into operation - certainly makes a
connection to the "Von Neumann architecture" as a basis for comparison. This
is because, virtually, every electronic computer built is always rooted in this
architecture.
The Central Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit), is the core of the Von Neumann computer
architecture. The CPU relates with a memory and an input/output (I/O)
subsystem and a stream of instructions are executed. As per this architecture
both data and instructions, are stored in the memory system in the same way.
Therefore, it defines the memory content completely in the way it is interpreted.
This is vital, suppose, for a program compiler that converts a user-
understandable programming language into the instruction stream
understandable by machine. The compiler gives ordinary data as the output.
However, the CPU then executes these data as instructions. It can carry out
various instructions for moving and altering data, and for deciding upon the
instructions to be executed next. The assortment of instructions is referred to
as the instruction set, and collectively with the resources required for their
implementation, it is called the Instruction Set Architecture (ISA).
The implementation of instruction is done by a cyclic clock signal. Even though
numerous sub-steps have to be carried out for the implementation of each
instruction, improved CPU implementation technologies are there that can go
beyond these steps such that, preferably, one instruction can be performed
each clock cycle.
Clock rates of today's processors are in the range of 2 to 3.4 GHz and they
allow up to 600 million basic operations (such as adding two numbers or
copying a data item to a storage location) to be performed each second. With
the recent advancements in technology, CPU speeds have increased rapidly.
Consequently, the factors such as the slower I/O operations and the memory

system limits the overall speed of a computer system since the speed of these
components have improved at a slower rate than CPU technology.
The average speed of memory systems can be improved by caches by
keeping the most commonly used data in a fast memory that is near to the
processor. One more factor obstructing CPU speed boosts is the naturally
sequential character of the Von Neumann instruction implementation. Now,
through parallel processing architectures, methods of executing various
instructions concurrently are being developed.
13.6.1 Organisation and operation of the Von Neumann architecture
As mentioned in section 13.6, the core of a computer system with Von
Neumann architecture is the CPU. This element obtains (i.e., reads)
instructions and data from the main memory and coordinates the entire
carrying out of every instruction. It is usually structured into two different
subunits: the Arithmetic and Logic Unit (ALU), and the control unit. Figure 13.2
shows the basic components of a Von Neumann model.
Figure 13.2: The Basic Components of a Computer with Von Neumann

Architecture
The ALU merges and converts data using arithmetic operations, such as
addition, subtraction, multiplication, and division, and logical operations, such
as bit-wise negation, AND, and OR.
The control unit interprets the instructions retrieved from the memory and
manages the operation of the whole system. It establishes the sequence in
which instructions are carried out and offers all of the electrical signals
essential to manage the operation of the ALU and the interfaces to the other

system components.
The memory is a set of storage cells, and each of this can be in one of two
different states. One state signifies a value of “0”, and the other state signifies
a value of “1”. By separating these two unlike logical states, each cell is
proficient of storing a distinct binary digit, or bit, of information. These bit
storage cells are analytically arranged into words, each of which is b bits wide.
Every word is allotted a unique address in the range [0, .................. , N - 1].
The CPU spots the word that it requires either to read or write by storing its
distinctive address in a special memory address register (MAR). A register
provisionally stores a value within the CPU. The memory acts in response to
a read request by interpreting the value stored at the desired address and
transferring it to the CPU via the CPU-memory data bus. The value is then for
the short term stored in the memory buffer register (MBR) (also sometimes
called the memory data register) before it is used by the control unit or ALU.
For a write operation, the CPU stores the value it desires to write into the MBR
and the corresponding address in the MAR. The value is then copied by the
memory from the MBR into the address pointed to by the MAR.
At last, the input/output (I/O) devices connect the computer system with the
exterior world. These devices let the programs and data to be entered into the
system and give a way for the system to manage an output device. Each I/O
port has a distinctive address to which the CPU can either read or write a
value. From the CPU's opinion, an I/O device is accessed similar to the way it
accesses memory. In fact, in a number of systems the hardware makes it
appear to the CPU that the I/O devices are memory locations. This
configuration, in which no difference between memory and I/O devices is seen
by the CPU, is referred to as memory-mapped I/O. In this case, no distinctive
I/O instructions are necessary.
13.6.2 Key features
In a basic organisation, processors having Von Neumann architecture are
differentiated from simple pre-programmed (or hardwired) controllers as they
posses several key features. First, the same main memory stores both
instructions and data. Consequently, instructions and data are not
distinguished. Also, different types of data, such as a floating-point value, or a
character code, an integer value, are all not distinguished. A particular bit
pattern’s explanation completely depends on how the CPU infers it. The same
data stored at a particular memory location can be inferred as an instruction

or data at different times. For example, when a compiler executes, it reads the
source code of a program written in a high-level language, such as FORTRAN
or COBOL, and is converted into a series of instructions that can be executed
by the CPU. The memory stores the output of the compiler like any other type
of data. On the other hand, the compiler output data can be implemented by
the CPU simply by interpreting them as instructions. Thus, the same values
accumulated in memory are considered as data by the compiler, but are then
taken as instructions executable by the CPU. Another outcome of this theory
is that every instruction must indicate how it deduces the data upon which it
functions. Therefore, suppose, Von Neumann architecture will have one set of
arithmetic instructions for functioning on integer values and another set for
functioning on floating-point values.
The second chief factor says that memory is retrieved by name (i.e., address)
irrelevant of the bit pattern stored at each address. Due to this feature, we can
interpret the values stored in memory as addresses or data or instructions.
Therefore, programs can alter addresses via the same set of instructions that
the CPU uses to alter data. This elasticity of how values in memory are read
permits very compound, vigorously changing patterns to be produced by the
CPU to access any range of data structure in spite of the kind of value being
read or written. Ultimately, an additional chief feature of the Von Neumann
scheme is that the sequence in which a program performs its instructions is
sequential, unless that order is openly altered. Program counter (PC), a
special register in the CPU, carries the address of the following instruction in
memory to be performed. After each instruction is carried out, the value in the
PC is increased to point to the following instruction in the series to be
implemented. This sequential implementation series can be transformed by
the program with the help of branch instructions that stores a fresh value into
the PC register.
On the other hand, special hardware can sense some outside event, such as
a suspension, and load a fresh value into the PC to cause the CPU to
commence executing a new series of instructions. Though this concept of
executing one action at a time really makes simpler the writing of programs
and the design and running of the CPU, it also limits the prospective
performance of this architecture.
9. The instruction set together with the resources needed for their execution

is called the _______________.

10. The memory is a collection of storage cells, each of which can be in one
of two different states (True/False).
Activity 2:
Surf the internet to find out details about architecture called Harvard
architecture and compares it with Von Neumann architecture.
13.7 Dataflow Architecture

In a traditional computer design, the processor executes instructions, which
are stored in memory in particular sequences. In each processor, the
instruction executions are in serial order and therefore are slow. There are four
possible ways of executing instructions:
1. Control-flow Method: In this mechanism, an instruction is executed when
the previous one in a defined sequence has been executed. This is the
traditional way.
2. Demand-driven Method: In this mechanism, an instruction is executed
when the results of the instruction are required by other instruction
3. Pattern-driven Method: In this mechanism, an instruction is executed
when particular data patterns appear.
4. Dataflow Method: In dataflow method, an instruction is executed when
the operands required become available
Dataflow architecture is a computer architecture that directly contrasts the
traditional control flow architecture (Von Neumann architecture). It does not
have a program counter and the execution of instructions is solely determined
based on the availability of input arguments to the instructions. The dataflow
architecture is very relevant in many software architectures today including
parallel computing frameworks. This architecture was proposed in the 1970s
and early 1980s by Jack Dennis of Massachusetts Institute of Technology
(MIT).
13.7.1 Dataflow programming
Software written using dataflow architecture consists of a collection of
independent components running in parallel that communicate via data
channels. In a dataflow model, a node is a computational component, and an
arrow is a buffered data channel. A control algorithm is divided into nodes first.
Each concurrently executing node is a self-contained software part with well-

defined functionality.
Data channels provide the sole mechanism by which nodes can interact and
communicate with each other by ensuring lower coupling and greater
reusability. Data channels can also be implemented transparently between
processors to carry messages between components that are physically
distributed. In the dataflow architecture, a control application is composed of
function bodies and data channels, and the connections between function
bodies and data channels are described in a dataflow graph. Consequently,
designing a control application mainly involves constructing such a dataflow
graph by selecting function bodies from the design library and connecting them
together.
Additional user-defined or application-specific function bodies are also easily
supported. A model of dataflow programming is shown in figure 13.3.
13.7.2 Dataflow graph

Data flow computational model uses directed graph to describe a computation.
This graph is called dataflow graph or data dependency graph. This graph
consists of nodes and edges (arcs). Nodes represent operations and edges
represent data paths. Dataflow is a distributed model of computation as there
is no single locus of control. Dataflow graph is asynchronous as execution of
a node starts when matching data is available at a node's input ports. In the
original dataflow models, data tokens are consumed when the node executes.

Some models were extended with "Sticky tokens", the tokens that stay much
like a constant input and match with tokens arriving on other inputs. Nodes
can have varying granularity, from instructions to functions. Once a node is
activated and the nodal operation is performed, this is called "Fired Results",
which are passed along the arc to waiting node. This process is repeated until
all of the nodes are fired and the final result is created. More than one node
can also be fired simultaneously. Arithmetic operators and conditional
operators act as nodes.

The Dynamic Critical Path: The dynamic critical path of dataflow graph is
simultaneously a function of program dependences, runtime execution path,
hardware resources and dynamic scheduling. All critical events must be last
arrival events. Such an event is the last one, which enables data to be latched.
Events correspond to signal transitions on the edges of the dataflow graphs.
Most often, the last-arrival event is the last input to reach an operation.
However, for moderate operations the last arrival event is the input that
enables the computation of the output. In lenient execution, all forward
branches are executed simultaneously.

In a typical execution, multiple critical events may correspond to the same

hardware structure. In strict execution, the multiplier is on the critical path while
in lenient execution, the multiplier is critical only when its result is used by latter
computations.
11. In a dataflow model, a ____________ is a computational component,
and an _____________ is a buffered data channel.
12. Once a node is activated and the nodal operation is performed, this is
called __________ .
13.8 Hybrid Multithreaded Architecture

The dataflow model and Von Neumann control-flow model are considered as
two edges of execution models which is the foundation for a variety of
architecture models. However, it has been argued that the two models are in
reality not orthogonal. Commencing with the functioning model of a pure data-
flow graph, one can without difficulty expand the model to support Von
Neumann style program implementation. A region in a dataflow graph can be
grouped collectively as a thread to be implemented in a sequence beneath its
own personal program counter control, while the commencement and
synchronisation of threads are data-driven.
It has been considered that there are equipments besides this range which
trade instruction scheduling ease for improved low level synchronisation and
that there exists some optimal point connecting the two limits i.e. a new hybrid
model which interchangeably joins features of both Von Neumann and Data-
flow, as well as depicts parallelism at a required stage. Such hybrid
multithreaded architecture models have been projected by a number of
examining groups with their birth in either inert dataflow or vibrant dataflow.
Now we will study the fundamentals of some research projects.
• McGill Dataflow Architecture: It is motivated by the static dataflow
model. The McGill Data flow Architecture Model has been projected
depending on the argument-fetching principle. The structural design heads
off from a straight execution of data flow graphs by having commands fetch
information from memory or registers rather than having instructions
deposit operands (tokens) in operand receivers of successor instructions.
An event (called a signal) will be posted to notify instructions depending
on the outcome of the instruction on its completion. This executes a

tailored model of dataflow calculation called dataflow signal graphs. The

structural design includes characteristics to support proficient loop
implementation through dataflow software pipelining, and the support of
threaded function activations.
• lannucci's Model: lannucci combined dataflow ideas, depending on his
study on the MIT Dynamic Tagged Token Dataflow Architecture (TTDA)
and the knowledge achieved, with sequential thread implementation. This
was done to characterise a hybrid computation model depicted in his Ph.D.
thesis. His ideas later took the form of multithreaded architecture project
at IBM Yorktown Research Centre. The structural design involves
characteristics such as a cache memory with synchronisation controls,
prioritised processor ready queues and features for well-organised
process migration to make possible load balancing.
• P-RISC: This hybrid model explores the possibility of constructing a

multithreaded architecture around an RISC processor. P-RISC model
divides the compound dataflow instructions into separate synchronisation,
arithmetic and fork/control instructions. This eliminates the requirement of
incidence bits on the token store (or frame memory) as proposed in the
Monsoon machine. P-RISC also allows the compiler to accumulate
instructions into longer threads, substituting a number of the dataflow
synchronisation with conventional program counter based
synchronisation.
• *T (Star T): The monsoon project at MIT was followed by the Star-T
project. It used extension of the off-the-shelf process or architecture to
define a multiprocessor architecture using to support fine-grain
communication and set up user micro-threads. The architecture is
projected to hold the latency-hiding feature of the Monsoon split-phase
global memory operations while being compatible with conventional
Massively Parallel Architecture (MPA's) based on Von Neumann model.
Inter-node traffic consists of a tag (called continuation, a pair comprising a
context and instruction pointer).
All inter-node communications are performed using the split-phase

transactions (request and response messages); processors never block
when issuing a remote request and the network interface for message

handling is well integrated into the processor pipeline. A separate co-

processor handles all the responses to remote requests.
13. _________ combined dataflow ideas with sequential thread
execution to define a hybrid computation model.
14. P-RISC explores the possibility of constructing a multithreaded
architecture around a CISC processor. (True/ False)
13.9 Summary
• Multithreading is needed to create an application that is able to perform
more than one task at once.
• There are various needs, benefits and principles of multithreading.
• Scalability is defined as the ability to increase the amount of processing
that can be done by adding more resources to a system.
• The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model.
• There are basically three types of computational models as follows: Von
Neumann model, Dataflow model and Hybrid multithreaded model.
• The heart of the Von Neumann computer architecture is the Central
Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit).
• Dataflow architecture does not have a program counter and the execution
of instructions is solely determined based on the availability of input
arguments to the instructions.
• A hybrid model synergistically combines features of both Von Neumann
and Data-flow, as well as exposes parallelism at a desired level.
13.10 Glossary
• Background processing: Another name for the multithreading system
which has a concept of priorities.
• Communication parallelism: refers to the way threads can
communicate with other threads residing in other processors.
• Computation parallelism: refers to the 'conventional' parallelism
• GUI: Graphical User Interface, these programs can perform more than one
task (such as editing a document as well as printing it) at a time.
• JVM: Java Virtual Machine, it is an example of multithreading system.

• Pre-emptive multitasking: Another name for the multithreading system

which has a concept of priorities.
• Thread granularity: refers to the number of instructions in a thread.
• Thread switch: refers to how the control of a thread is transferred to an
other thread.

1. Define multithreading. What is the need of multithreading? Enumerate its
benefits.
2. Briefly explain the principles of multithreading.
3. Discuss scalable and multithreaded architectures.
4. What is meant by Computational models?
a) Von Neumann- based multithreaded architectures
b) Dataflow architecture
c) Hybrid multithreaded architecture
13.12 Answers
1. Background processing, pre-emptive multitasking
2. False
3. True
4. Fine-grain threading
5. Scalable
6. Thread
7. Computational Model
8. True
9. Instruction set architecture (ISA)
10. True
11. Node, arrow
12. Fired Results
13. Iannucci
14. False
Terminal Questions
1. Multithreading is a type of multitasking. The multithreading system has a
concept of priorities and therefore, is also called background processing
or pre-emptive multitasking. Refer Section 13.2.

2. There are important parameters which characterise the performance of

multithreading. Refer Section 13.3.
3. Scalability is defined as the ability to increase the amount of processing
that can be done by adding more resources to a system. Refer Section
13.4.
4. The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model. Refer Section 13.5.
5. a) The heart of the Von Neumann computer architecture is the Central
Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit). Refer Section 13.6.
b) Dataflow architecture is a computer architecture that directly contrasts
the traditional control flow architecture (Von Neumann architecture).
Refer Section 13.7.
c) Hybrid model synergistically combines features of both Von Neumann
and Data-flow, as well as exposes parallelism at a desired level. Refer
Section 13.8.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. (2010) Computer Organisation. Technical
Publications.
• Hennessy, J. L., Patterson, D. A. & Goldberg D.(2011). Computer
Architecture: A Quantitative Approach, Morgan Kaufmann.
• Sima, Dezso, Fountain, T. J. & Kacsuk, P. (1997). Advanced computer
architectures - a design space approach. Addison-Wesley-Longman: I-
XXIII, 1-766.
E-references:
• http://users.ece.utexas.edu/~bevans/courses/ee382c/projects/spring02/
mishra-oney/LitSurveyReport.pdf
• http://www.google.co.in/search?hl=en&biw=1080&bih=619&q=von+neu
mann+architecture+pdf&revid=1463745648&sa=X&ei=mgCMT-
rmD4rQrQeuur2yCw&ved=0CBwQ1QIoADgK

Computer Architecture AllClasses-Outline-199-294

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Architecture AllClasses-Outline-199-294

Uploaded by

Copyright:

Available Formats

Computer Architecture Unit 1

Figure 9.2: Architecture of CRAY-1

Self Assessment Questions

Manipal University Jaipur B1648 Page No. 199

9.4 Vector Length and Stride Issues

Manipal University Jaipur B1648 Page No. 200

Vi V j &Vk Vi = Vj tVk Logical Perform bitwise-AND operation on corresponding elements (in

Handling larger vectors: Smaller vector sizes can be handled by the VL

Manipal University Jaipur B1648 Page No. 201

possess 200-element vectors (i.e., N = 200), in which way the vector

Thereafter a loop is utilised which iterates four times: VL is adjusted to 8 in one

Manipal University Jaipur B1648 Page No. 202

Figure 9.3: Memory Layout of Vector A.

9. Vector elements are saved in the form of ____________ in memory.

Manipal University Jaipur B1648 Page No. 203

9.5 Compiler Effectiveness in Vector Processors

Operations executed in Operations executed In

BDNA 96 |% 972% 1.52

when executed on the Cray Y-MP

Manipal University Jaipur B1648 Page No. 204

functioning of applications on vector processors. The hand-optimised versions

Manipal University Jaipur B1648 Page No. 205

Manipal University Jaipur B1648 Page No. 206

• ETA-10: A later shared-memory multiprocessor version of the CDC Cyber

9.8 Terminal Questions

Manipal University Jaipur B1648 Page No. 208

Manipal University Jaipur B1648 Page No. 209

10.2 Parallel Processing: An Introduction

However, the early computer developers rapidly identified two obstacles

Figure 10.1: An Example of Serial Computing

In parallel computing, simultaneous use of multiple compute resources is

Manipal University Jaipur B1648 Page No. 211

Manipal University Jaipur B1648 Page No. 212

Thus, we can say that a computer system is said to be Parallel Processing

Manipal University Jaipur B1648 Page No. 213

Manipal University Jaipur B1648 Page No. 214

Self Assessment Questions

10.3 Classification of Parallel Processing

Manipal University Jaipur B1648 Page No. 215

Today, modern microprocessors can execute the same instruction on multiple

Figure 10.6: Scalars and Vectors

Manipal University Jaipur B1648 Page No. 216

operates on multiple scalars at a given moment, but it executes a different

Figure 10.7: SISD vs. SIMD

7. SIMD is known as ________________ because its basic unit of

Manipal University Jaipur B1648 Page No. 217

applications. The following are the features of fine-grained architecture:

Manipal University Jaipur B1648 Page No. 218

Figure 10.8: The MPP Systems

A square array was chosen in MPP to match the configuration of the

Manipal University Jaipur B1648 Page No. 219

Manipal University Jaipur B1648 Page No. 220

Figure 10.9: MPP integrated Circuits

Manipal University Jaipur B1648 Page No. 221

Manipal University Jaipur B1648 Page No. 223

10.5.1 An example: the CM5

Manipal University Jaipur B1648 Page No. 224

Figure 10.10: CM5 Processing Element

Taken together, these components give a peak double-precision floatingpoint

Manipal University Jaipur B1648 Page No. 225

Figure 10.11: CM5 Fat-Tree Connectivity Structure

Manipal University Jaipur B1648 Page No. 226

10.5.2 Programming and applications

Manipal University Jaipur B1648 Page No. 227