Professional Documents
Culture Documents
Activity 1:
Find out more about a recent vector thread processor which comes in two
parts: the control processor, known as Rocket, and the vector unit, known as
Hwacha.
As we cannot write
V L 40,
We must utilise the two-instruction order for loading 40 into the VL register.
The last instruction indicates floating-point addition of vectors V3 and V4. As
the VL is 40, just the first 40 elements are included. Table 9.1 below depicts a
sample of Cray X-MP instructions.
Table 9.1: Sample Cray X-MP Instructions
Instruction Meaning Description
Vi V j +Vk Vi = Vj+Vk Integer add Add corresponding elements (in the range 0 to XT 1) from Vj
and Vk vectors and place the result in vector Vi
Vi S j+Vk
Vi = Sj+Vk Add the scalar Sj to each element (in the range 0 to
Integer add XT 1} of Vk vector and place the result m vector Vi
Vi Vj+FVk Vi = Vj+Vk Add corresponding element- (in the range 0 to VL 1) from Vj
Floating-point add and Vk vectors and place the floating-point result in vectorVi
Vi Sj+FVk Vi = Sj+Vk Add the scalar S j to each element (in the range fl io VL
Floating-point add 1) of Vk rector and place the floating-point result in vector
Vi
Vi rAO,Ak
Vi = M[A0)+Ak Vector Load into elements 0 to VL 1 of vector register Vi from memory
load with stride Ak starting at address AO and incrementing addresses by Ak
Vi rAO,l Vi = M[AO>+1
Load into elements 0 to VL 1 of vector register Vi from memory
Vector load with stride
starting at address AO and incrementing addresses by 1
1
rAO,Ak Vi
Vi = M[A0)+Ak Vector Store elements 0 to VL 1 of vector register Vi in memory
store with stride Ak starting at address AO and incrementing addresses by Ak
rAO,l Vi Vi = M[AO)+1 Store element. 0 to VL 1 of vector register Vi in memory starting
Vector store with stride at address AO and mcrementmE -addresses by 1
1
Completely Partially
Processor Compiler vectorized vectorized Not vectorized
CDC CYBER 205 VAST-2 V2.21 62 5 33
Convex C-series FC5.0 69 5 26
Cray X-MP CFT77 V3.0 69 3 28
Cray X MP CFT V 1.15 50 1 49
Cray-2 CFT2 V3.1a 27 1 72
ETA-10 FTN 77 V 1.0 62 7 31
Hitachi S810/820 FORT77/H AP V2O-2B 67 4 29
IBM <090/VI- VS FOR I RAN V2.4 52 4 44
NEC SX/2 FORTRAN77 / SX V.040 66 5 29
Figure 9.5: Result of applying Vectorising Compilers to the 100 FORTRAN Test
Kernels
The kernels were planned to verify vectorisation ability and are able to be
vectorised by hand.
Self Assessment Questions
10. List two factors which enable a program to run successfully in vector
mode.
11. There does not exist any variation in the capability of compilers to decide
if a loop can be vectorised. (True/False)
Activity 2:
Visit your local computer vendor and get an expert opinion about vector
processors and their working.
9.6 Summary
There are several representative application areas where vector processing is
of the utmost importance. Depending upon the way the operands are fetched,
vector processors can be segregated into two groups.
• Operands are straight away streamed from the memory to the functional
units and outcomes are written back to memory at the time the vector
operation advances in this architecture.
• Operands are read into vector registers wherein they are fed to the
functional units and outcomes of operations are written to vector registers
in this architecture.
• Vector register architectures have several advantages over vector
memory-memory architectures.
• There are several major components of the vector unit of a registerregister
vector machine
• The various types of vector instructions for a register-register vector
processor are:
■ Vector-scalar Instructions
■ Vector-vector Instructions
■ Vector-memory Instructions
■ Gather and Scatter Instructions
■ Masking Instructions
■ Vector Reduction Instructions
• CRAY-1 is one of the oldest processors that implemented vector
processing.
Two issues that arise in real programs: (i) the vector length in a program is not
exactly 64. (ii) Non adjacent elements in vectors that reside in memory.
• The structure of the program & capability of the compiler are two factors
that affect the success with which a program can be run in vector mode.
9.7 Glossary
• ASC: Advanced Scientific Computer
• Data hazards: the conflicts in register accesses
8. Strip mining
9. Sequential words
10. Structure of the program & capability of the compiler
11. False
Terminal Questions
1. There are various application areas of vector processors which are of
considerable importance. Refer Section 9.2.
2. Depending upon the way the operands are fetched, vector processors can
be segregated into two groups: Memory-memory vector architecture and
Vector-register architecture. Refer Section 9.3.
3. Due to the capability to overlap memory accesses as well as the probable
use of vector processors again, vector-register vector processors are
normally more efficient as compared to memory-memory vector
processors. Refer Section 9.3.
4. a. The CDC Cyber 205 is based on the concepts initiated for the CDC Star
100; the first commercial model was produced in 1981. Refer Section
9.4.
b. CRAY-1 is one of the oldest processors that implemented vector
processing. Refer Section 9.5.
c. The vector size may be less than the vector register size, and the
vector size may be larger than the vector register size. Refer Section
9.6.
d. As vectors are one-dimensional series, saving a vector in memory is
direct: vector elements are stored as sequential words in memory.
Refer Section 9.6.
5. The major components of the vector unit of a register-register vector
machine are Vector Registers, Vector Functional Units, Scalar Registers
etc. Refer Section 9.5.
6. The various types of vector instructions for a register-register vector
processor are: (Refer Section 9.5.) a. Vector-scalar Instructions
b. Vector-vector Instructions
c. Vector-memory Instructions
d. Gather and Scatter Instructions
e. Masking Instructions
f. Vector Reduction Instructions
7. Like an indication of vectorisation level which can be acquired in scientific
programs, we should observe the vectorisation levels noted for the Perfect
Club benchmarks. Refer Section 9.7.
References:
• Hwang, K. (1993). Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. (2010). Computer Organisation. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David (2011).
Computer Architecture: A Quantitative Approach, Morgan Kaufmann; 5th
edition.
• Sima, Dezso, Fountain, Terry J. &Kacsuk, Peter (1997). Advanced
computer architectures - a design space approach. Addison-Wesley-
Longman.
E-references:
• https://csel.cs.colorado.edu/~csci4576/VectorArch/VectorArch.html
• http://www.cs.clemson.edu/~mark/464/appG.pdf
• nasa_fig.gif
Unit 10 SIMD Architecture
Structure:
10.1 Introduction
Objectives
10.2 Parallel Processing: An Introduction
10.3 Classification of Parallel Processing
10.4 Fine-Grained SIMD Architecture
An example: the massively parallel processor Programming and
applications
10.5 Coarse-Grained SIMD Architecture
An example: the CMS
Programming and applications
10.6 Summary
10.7 Glossary
10.8 Terminal Questions
10.9 Answers
10.1 Introduction
In the previous unit, you studied about the use and effectiveness of vector
processors. Also, you studied the vector register architecture, vector length
and stride issues. We also learnt the concept of compiler effectiveness in
vector processors. In this unit, we will throw light on the data parallel
architecture, SIMD design space. We will also study the types of SIMD
architecture. The instruction execution in conventional computers is
sequential. So there is a time constraint involved, if one program is being
executed, the other task has to wait for the time till the first one is executed. In
parallel processing, the execution is in a parallel manner, that is at the same
time the program can be divided into segments, while one segment is being
processed, the other can be fetched from the memory and some segments
can be printed (provided they are already processed) all at the same time. The
purpose of parallel processing is to bring down the execution time, hence to
speed up the data processing.
Parallel processing can be established by dividing the data among different
units, each unit being processed simultaneously and the timing and
sequencing being governed by the control unit so as to get the fruitful result in
minimum amount of time.
Objectives:
After studying this unit, you should be able to:
• discuss the concept of data parallel architecture
• describe the SIMD design space
• identify the types of SIMD architecture
• recognise fine grained SIMD architecture
• explain coarse grained SIMD architecture
made to work out a computational problem. It may take the use of multiple
CPUs. A problem is broken into discrete parts that can be solved concurrently.
Each part is further broken down to a series of instructions and instructions
from each part execute simultaneously on different CPUs as shown in figure
10.2.
Figure 10.3: (a) Serial Processing (b) True Parallel Processing with Multiple
Processors (c) Parallel Processing Simulated by Switching one Processor
among Three Processes
Figure 10.3 (a) represents the serial processing means next processing is
started when the previous process must be completed. In figure 10.3 (b) all
three process are running in one clock cycle of three processors. In figure 10.3
(c) all three process are also running in one clock cycle but each process are
getting only 1/3 of actual clock cycle on each clock cycle and the CPU is
switching from on process to other in its clock cycle. When one process is
running all other process must wait for their turn. So if we see in figure 10.3 (c)
then we will find that at one clock time only one process is running and other
are waiting. But in figure 10.3 (b) at one clock time all three process are
running. So in uni-processor system the parallel processing as shown in figure
10.3 (c) is called virtual parallel processing.
SISD S1 MD
Single Instruction, Single Data Single Instruction, Multiple Data
M1 SD M1M D
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
Figure 10.4: Flynn’s Classification of Computer System
In this chapter, our main focus will be Single Instruction Multiple Data (SIMD).
Single Instruction Multiple Data (SIMD)
The term single instruction implies that all processing units execute the same
instruction at any given clock cycle. On the other hand, the term multiple data
implies that each and every processing unit could work on a different data
element. Generally, this type of machine has one instruction dispatcher, a very
big array of very small capacity instruction units and a network of very high
bandwidth. This type is suitable for specialised problems which are
characterised by a high regularity, for example, image processing. Figure 10.5
shows a case of SIMD processing.
Activity 1:
Explore the components of a parallel architecture that are used by an
organisation. Also, find out the type of memory used in that architecture.
10.4 Fine-Grained SIMD Architecture
The Steven Unger design scheme is the initial base for the Fine-grained SIMD
architectures. These are generally designed for low-level image processing
fault occurs during computation, the sequence of instructions following the last
dump to external memory must be repeated after replacement of the fault-
containing column.
The processing elements are linked by a 2-dimension near-neighbour mesh.
This resolution gives a number of important advantages over other likely
alternatives, such as trouble-less data structures maintenance in shifting,
engineering ease, high bandwidth, and a close conceptual match to the
formulation of many image processing calculations.
The principal disadvantage of this system is the sluggish transmission of data
between remote processors in array. However, this can be only seen if
comparatively minute amount of data is to be transmitted (rather than whole
images).
The option of 4 rather than 8-connectedness is perhaps surprising in view of
the minimal increase of complexity which latter involves, compared to a twofold
improvement in performance on some operations. There is one special
purpose staging memory meant for conversion of data format. All extremely
parallel computers have problems related with the data input & output, and in
those parallel computers which represent single-bit processors, the problems
are many and compounded. The problem is that external source data is usually
formatted as one individual string of integers. So, if such a data is utilised in a
two-dimensional array in any simple manner, considerable amount of time is
wasted before successful processing can start, basically because of the
unmatched format of the data.
The MPP included two solutions for this problem. The 1st was a distinct data
input/output register. The 2nd was the staging memory, which allowed
conversion between bit plane & integer string formats. Using jointly, these two
solutions allowed the processor array to function continuously, and so giving
out the maximum output.
10.4.2 Programming and applications
The MPP system was commissioned by NASA principally for the analysis of
Lands at images (satellite imagery of Earth) This meant that, initially, most
applications on the system were in the area of image processing, although the
machine eventually proved to be of wider applicability. At the same time, NASA
also utilised the MPP system for various other applications listed below. See
figure 10.9).
functions (a standard image processing technique) but the third arises due to
the different viewing angles. The MPP algorithm operates as follows:
• For each pixel in one of the images (the reference image) a local
neighbourhood area is defined. This is correlated with the similar area
surrounding each of the candidate match pixels in the second image.
• The measure applied is the normalised mean and variance cross
correlation function. The candidate yielding the highest correlation is
considered to be the best match, and the locations of the pixels in the two
images are compared to produce the disparity value at that point of the
reference image.
• The algorithm is iterative. It begins at low resolution, that is, with large
areas of correlation around each of a few pixels. When the first pass is
complete, the test image is geometrically warped according to the disparity
map.
• The process is then repeated with a higher resolution (usually reducing the
correlation area. and increasing the number of computed matches, by a
factor of two), a new disparity map is calculated and a new warping
applied, and so on.
• The procedure is continued either for a predetermined number of passes
or until some quality criterion is exceeded.
Self Assessment Questions
9. MPP is the acronym for ___________ .
10. All highly parallel computers have problems concerned with
10.5 Coarse-Grained SIMD Architecture
There are several technical difficulties that arise in fulfilling completely the fine-
grained SIMD ideal of one processor per data element. Thus, it is better to
begin with the coarse-grained approach and therefore, develop a more rational
architecture. Currently, a number of parallel computers manufacturers,
including nCUBE and Thinking Machines Inc., have adopted this outlook.
The manufacturers which are more familiar with the mainstream of computer
design than the application-specific architecture field often develop the
Coarse-grained data-parallel architectures. It is the result of MIMD
programmes that have helped discover the complexities of this approach and
seek to mitigate them. The consequences of these roots are systems which
can employ a number of different paradigms including MIMD. Multiple-SIMD
and what is often called single program multiple data (SPMD) in which each
Manipal University Jaipur B1648 Page No. 222
Computer Architecture Unit 1
processor executes its own program, but all the programs are the same, and
so remain in lock-step. Such systems are frequently used in this data-parallel
mode, and it is therefore reasonable to include them within the SIMD
paradigm. Naturally, when they are used in a different mode, their operation
has to be analysed in a different way. Coarse-grained SIMD systems of this
type embody the following concepts:
• Each PE is of high complexity, comparable to that of a typical
microprocessor.
• The PE is usually constructed from commercial devices rather than
incorporating a custom circuit.
• There is a (relatively) small number of PEs, on the order of a few
hundreds or thousands.
• Every PE is provided with ample local memory.
• The interconnection method is likely to be one of lower diameter and lower
bandwidth than the simple two-dimensional mesh. Networks such as the
tree, the crossbar switch and the hypercube can be utilised.
• Provision is often made for huge amounts of relatively high-speed, high-
bandwidth backup storage, often using an array of hard disks.
• The programming model assumes that some form of data mapping and
remapping will be necessary, whatever the application.
• The application field is likely to be high-speed scientific computing.
This type of systems have a number of advantages as compared to finegrained
SIMD, such as the capability to take maximum advantage from latest
processor technology, the aptitude to perform highly precise computations with
no performance penalty and the easier mapping to a selection of different data
types which the lesser number of processors and improved connectivity
permits.
The software required for such systems offers an advantage as well as a
disadvantage at the same time. The advantage lies in its closer similarity to
normal programming: the disadvantage lies in the less natural programming
for some applications. Coarse-grained systems also offer greater variety in
their designs, because each component of the design is less constrained than
in a tine-grained system. The example given below is, therefore, less
specifically representative of its class than was the MPP machine considered
earlier.
There are three aspects to the system design which are of major importance.
The first is the data interconnection network, shown in figure 10.11, which is
designated by the designers a fat tree network. It is based upon the quadtree,
augmented to reduce the likelihood of blocking within the network.
Thus, at the lowest level, within what is designated an interior node of four
processors, there are at least two independent direct routes between any pair
of processors. Utilising the next level of tree, there are at least four partly
independent routes between a pair of processors. This increase in the number
of potential routes is maintained for increasing numbers of processors by
utilising higher levels of the tree structure.
Although this structure provides a potentially much higher band-width that the
ordinary quadtree, like any complex system, achieving the highest
performance depends critically on effective management of the resource. The
second component of system design which is of major importance is the
method of control of the processing elements. Since each of these
incorporates a complete microprocessor, the system can be used in fully
asynchronous MIMD mode. Similarly, if all processors execute the same
program, the system can operate in the SPMD mode. In fact, the designers
suggest that an intermediate method is possible, in which processing elements
act independently for part of the time, but are frequently resynchronised
globally. This technique corresponds to the implementation of SIMD with
algorithmic processor autonomy.
The final system design aspect of major importance is the data I/O method.
The design of the CM5 system seeks to overcome the problem of improving
(and therefore variable) disk access speeds by allowing any desired number
of system nodes to be allocated as disk nodes with attached backup storage.
This permits the amount and bandwidth of I/O arrangements to be tailored to
the specific system requirements. Overall, one of the main design aims, which
was pursued for the CM5 system was scalability. This not only means that the
number of nodes in the system can vary between (in the limits) one and 16384
processors, but that system parameters such as peak computing rate, memory
bandwidth and I/O bandwidth all automatically increase in the proportion of
processing elements. This is shown in the Table 10.2.
Table 10.2: CM5 System Parameters
Number of processor 32 1024 16384
Number or data paths 128 4096 65 536
Peak speed (MFLOPS) 4 128 2048
Memory (Gbyte) 1 32 512
Memory bandwidth (Gbyte/s) 16 512 8 192
1/0 bandwidth (Gbyte/s) 0.32 10 160
Synchronisation time (us) 1.5 3.0 4.0
Activity 2
Visit an organisation and find out the difficulties that are faced by the
computer designers in implementing and operating the fine-grained and
coarse-grained SIMD architectures.
10.6 Summary
Let us recapitulate the important concepts discussed in this unit:
10.9 Answers
Self Assessment Questions
1. Instructions
2. Parallel Computer
3. True
4. Simulated Or Virtual
5. Single-Instruction, Multiple Data
6. SISD, SIMD, MISD, and MIMD
7. Vector Processing
8. Instruction-Level Parallelism
9. Massively Parallel Processor
10. Input and Output Of Data
11. Ncube and Thinking Machines Inc.
12. Single Program Multiple Data
13. Cm1
14. Cm5
Terminal Questions
1. Parallel Computing is the simultaneous use of multiple compute
resources to solve a computational problem. Refer Section 10.2.
2. The core element of Parallel Processing is Cpus. The essential
computing process is the execution of sequence of instruction on Asset of
Data. Refer Section 10.3.
3. The Steven Unger Design Scheme is the initial base for the fine-grained
11.8 Glossary
11.9 Terminal Questions
11.10 Answers
11.11 Introduction
In the previous unit, you were introduced to data parallel architecture in which
you studied the SIMD part. You learned about SIMD architecture and its
various aspects like SIMD design space, fine-grained SMID architecture and
coarse gained SIMD architecture. In this unit we will progress a step further to
explain the MIMD architecture. Although we have covered vector architecture
in prior unit, we will throw some light on it as well so that the concept of MIMD
can be understood in a better way.
According to famous computer architect Jim Smith, the most efficient way to
execute a vectorisable application is a vector. Vector architectures are
responsible for collecting the group of data elements distributed in memory
and after that placing them in linear sequential register files. After placing,
operation starts on that data present in register files and the result is dispersed
again to the memory. On the other hand, MIMD architectures are of great
importance and may be used in numerous application areas such as CAD
(Computer Aided Design), CAM (Computer Aided Manufacturing), modelling,
simulation etc
In this unit, we are going to study different features of Vector architecture and
MIMD architecture such as pipelining, MIMD architectural concepts, problems
of scalable computers, Main design issues of scalable MIMD architecture.
Objectives:
After studying this unit, you should be able to:
• recall the concept of vector architecture
• discuss the concept of pipelining
• describe MIMD Architectural concepts
• differentiate between multiprocessor and multicomputer
• interpret the problems of scalable computers
• solve the problems of scalable computers
• recognise the main design issues of scalable MIMD architecture
11.2 Vectorisation
Vector machines are planned & designed to operate at the level of vectors.
Manipal University Jaipur B1648 Page No. 231
Computer Architecture Unit 1
Now, you will study the operation of vectors. Suppose there are 2 vectors, A
and B, both having 64 components. The components present in a vector are
the vector size. So our vector size is 64. Vector A and B are shown below:
JJ = />0, ................. j .
Now we want to add these 2 vectors and keep the result in another vector C.
It is shown in the equation below. The rule for adding the vector is to add the
corresponding components.
which performs
Vd = Vs1 VOP Vs2
Here, VOP is the vector operation which is performed on registers Vs1 and
Vs2 and result is stored at Vd.
Architecture
As discussed in unit 9, both the scalar and vector unit is present in a vector
machine. The scalar unit has the same structural design as the conventional
processor and it works on scalars. Similarly vectors works on vector
operations. Advancements like moving from CISC to RISC designs, and
moving from the memory-memory architecture to the vector-register
architecture has been seen in vector architecture.
Manipal University Jaipur B1648 Page No. 232
Computer Architecture Unit 1
Vector Registers: Vector registers carries the input and result vectors. 8
vector registers are present in Cray 1 and many other vector processors. Every
vector register contains 64 elements of 64 bits each. For example Fijitsu VP
200 processor permits the space of 8k elements present in vector register’s
programmable set whose range is 8 to 256. As 8 vector register carries 64
elements of 64 bits, but 256 register carry 32 elements.
Figure 11.1 contains 1 write port and 2 read port so that vector operations can
overlap on various vector registers. Scalar Registers: Vector operations get
the scalar inputs present in scalar registers. Such as a scalar register results
constant when elements are multiplied to matrix.
B=5*X+Y
In the above equation, 5 is a constant stored in scalar register and X and
vectors in 2 different vector register. Address calculation of vector load/store
unit is also done in this register. Vector Load/Store Unit: Data moves
11.3 Pipelining
We have discussed this concept in Unit 4 and 5, but we need to recap it in
order to get a better idea of the next sections.
What is Pipelining?
An implementation technique by which the execution of multiple instructions
can be overlapped is called pipelining. In other words, it is a method which
breaks down the sequential process into numerous sub-operations. Then
every sub-operation is concurrently executed in dedicated segments
separately. The main advantage of pipelining is that it increases the instruction
throughput, which is specified the count of instructions completed per unit
time. Thus, a program runs faster. In pipelining, several computations can run
in distinct segments simultaneously.
A register is connected with every segment in the pipeline to provide isolation
between each segment. Thus, each segment can operate on distinct data
Manipal University Jaipur B1648 Page No. 235
Computer Architecture Unit 1
11.4.1 Multiprocessor
Multiprocessor are systems with multiple CPUs, which are capable of
independently executing different tasks in parallel. They have the following
main features:
• They have either shared common memory or unshared distributed
memories.
• They also share resources for example I/O devices, system utilities,
program libraries, and databases.
• They are operated on integrated operating system that gives interaction
among processors and their programs at the task, files, job levels and also
in data element level.
Types of multiprocessors
There are 3 types of multi-processors they are distributed in the way in which
shared memory is implemented. (See figure 11.3). They are:
• UMA (Uniform Memory Access),
Basically the memory is divided into several modules that is why large
multiprocessors into different categories. Let’s discuss them in detail.
UMA (Uniform Memory Access): In this category every processor and
memory module has similar access time. Hence each memory word can be
read as quickly as other memory word. If not then quick references are
slowed down to match the slow ones, so that programmers cannot find the
difference this is called uniformity here. Uniformity predicts the performance
which is a significant aspect for code writing. Figure 11.4 shows uniform
memory access from the CPU on the left.
Modern UMA machines are of small size and with single bus multiprocessors.
In the early design of scalable shared memory systems, large UMA machines
with a switching network and hundreds of processors were common.
Well-known examples of those multiprocessors are the NYU Ultra computer
as well as the Denelcor HEP. In their designs numerous features had been
introduced which act as an important achievement in today’s parallel
computers architecture. Nevertheless, early systems do not have local main
memory or cache memory, which has showed its importance for attaining high
performance in scalable shared memory systems. It is not appropriate for
building scalable parallel computers but it is very good for constructing small-
sized single bus multi-processor. Such as Encore Multimax of Encore
Computer Corporation introduced in late 80s and Silicon Graphics Computing
Systems introduced in late 90s.
NUMA (Non Uniform Memory Access): They are intended for avoiding the
memory access disadvantage of Uniform Memory Access machines. The
logically shared memory is spread between all the processing nodes of NUMA
machines, giving rise to distributed shared memory architectures.
Figure 11.4 shows non uniform memory between the left and right disks.
Although these parallel computers became highly scalable, yet they are
extremely sensible for data allocation in local memories. Accessing a remote
memory segment of a node is slower as compared to accessing a local
memory segment of a node. Multi-computers having distributed memory are
similar to the architecture of these machines. Major dissimilarity depends on
the organization of address space. In multiprocessors, a global address space
that is equally visible from each processor is applied.
In other words, all the memory locations can be accessed by CPU clearly. In
local memories of multi-computers, the address space is duplicated in the
processing elements. This dissimilarity in the memory’s address space is all
well showed in software level. NUMA machines programming depends on the
global address space (shared memory) principle while distributed memory
multi-computers programming depends on the message-passing paradigm.
COMA (Cache Only Memory Access): You can say that COMA machine act
as non-uniform but differently. It also avoids the effects of static memory
allocation of NUMA and Cache Coherent Non-Uniform Memory Architecture
(CC-NUMA) machines. This is done by doing two activities;
Activity 1:
Prepare a collage depicting two columns, one for multiprocessor while the
other for multi-computer and paste diagrams, notes, pictures etc of the
various machines found under each heading.
The pointers to A and B are rA and rB stored in the local memory of P0. Access
of A and B are realised by the "rloadrA" and "rloadrB" instructions that should
travel through the interconnection network in order to fetch A and B.
The situation is even worse if the values of rA and rB are currently not available
in M1 and Mn. M1 and Mn will be generated by some other process which will
be executed later on. In this case where idling occurs due to synchronisation
among parallel processes, the original process on P0 should wait
unpredictable time resulting in unpredictable latency.
Solutions to the problems
In order to solve the above-mentioned problems several possible
hardware/software solutions were proposed and applied in various parallel
computers. They are as follows:
1. Application of cache memory
2. Pre-fetching
instruction from thread 2 and so on. This is the way through which
processor can be kept busy even in lengthy latencies for independent
threads. Actually for switching processes after each instruction to reduce
latencies, some systems automatically switch between the processes.
4. Non-blocking writes: Last method for reducing or hiding latency is non-
blocking wires. In this method the memory operation starts but the
program will continue executing. But normally when a STORE instruction
is carried out, at that time CPU waits till the instruction completes before
continuing.
Activity 2:
Visit a library and read books on computer architecture to find out more
ways of resolving the problems of scalable computers.
11.8 Glossary
• CC-NUMA machine: Cache-Coherent Non-Uniform Memory Access
machine.
• COMA machine: Cache Only Memory Access machine.
• COW: Cluster of Workstations.
• DDM: Data Diffusion Machine.
• LMD: Load Memory Data.
• MPPs: Massively Parallel Processors.
• Multi-computer: It contains numerous von Neumann computers that are
associated with interconnected network.
• Multiprocessor: Systems with multiple CPUs, which are capable of
independently executing different tasks in parallel.
• NORMA: NO Remote Memory Access.
• NOW: Network of Workstations.
• NUMA machine: Non Uniform Memory Access machine.
• Register: It is associated with each segment in the pipeline to provide
isolation between each segment.
• UMA machine: Uniform Memory Access machine.
11.10 Answers
Self Assessment Questions
1. CDC Star 100
2. Vector
3. Instruction throughput
4. Virtual parallelism
5. True
6. MIMD
7. False
8. Non Uniform Memory Access
9. Remote loads
10. False
11. Interconnection network design
12. Physically shared memory
Terminal Questions
1. A vector may refer to a type of one dimensional array. Vectorisation is
collecting the group of data elements distributed in memory and after that
placing them in linear sequential register files. Refer Section 11.2.
2. The five components are vector registers, scalar registers, vector
functional units, vector load/store unit, and main memory. Refer Section
11.2.
References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill, 1993.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David, Computer
Architecture: A Quantitative Approach, Morgan Kaufmann; 5th edition,
2011.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter, Advanced computer
architectures - a design space approach. Addison-Wesley-Longman: I-
XXIII, 1-766.
E-references:
• http://www.cs.umd.edu/class/fall2001/cmsc411/projects/MIMD/mimd.
html.
• http://www.docstoc.com/docs/2685241/Computer-Architecture-
Introduction-to-MIMD-architectures.
12.1 Introduction
In the previous unit, you studied the concept of vectorisation and pipelining.
Also, you studied the MIMD architectural concepts, problems of scalable
computers. We also learnt the main design issues of scalable MIMD
architecture.
A computer must have a system to get information from the outside world and
must be able to communicate results to the external world. It is required to
enter programs as well as data in computer memory in order to process them.
Also, it is required to record or display the results (for the user) received from
calculations. To use computer in an efficient manner it is necessary to prepare
numerous programs as well as data beforehand. Then these programs and
data are broadcasted into storage medium. Then, the information available in
disk is transmitted into the memory of a computer in a rapid manner.
surface. Within the drive, a disk spins rapidly past the read/write head each as
shown in figure 12.1. A floppy disk, for example, may spin past the read/write
head at 300 revolutions per minute (RPMs), whereas a hard drive may spin at
3,600 to 10,000 RPMs.
hard drive and a floppy-disk drive that lets insert and remove disks. Normally,
PC’s hard drive resides within PC’s system unit.
Tape drive: The function of tape drive is to read as well as write the data to
tape surface. An audio cassette also functions in the same way. The only
dissimilarity is that a tape drive burns digital data. Tape storage generally
stores data which is not needed frequently. The example of such data is
backup copies related to the hard disk. Tape drive is required to write data in
a serial manner. This is due to the reason that a tape appears as a long strip
which is made up of magnetic material. Direct access offered by media like
disks appears to be faster as compared to the process of tape drive which
writes data serially.
When it is required to access the particular data on a tape, then the drive starts
scanning through the entire data. That is, the data which is not required is also
going through scanning. Thus, this has an effect in slow access time. Access
time differs according to the speed with which the drive is accessing, position
available on the tape in addition to the length of tape.
12.3.2 Optical storage
A kind of optical storage which is most extensively used is called as the CD
(compact disk). Compact disk is utilised in CDR, CDRW, CD-ROM, DVD-
ROM, in addition to Photo CD systems. Nowadays, systems with DVD-ROM
drives are preferred, rather than standard CD-ROM units. The devices
included in optical storage are used to store the data over reflective surface.
Additionally, a ray of laser light is used to read the data. A thin ray of light is
directed and concentrated by means of lenses, mirrors, and prisms. All the
light having same wavelength helps in creating laser beam focus.
CD-ROM: It symbolises compact disk read only memory. To read data from
CD-ROM, a laser beam is directed on the surface of a spinning disk. The areas
that reflect back the light are read as 1s, and the ones that scatter the light
and do not reflect back are read as 0s. This is shown in figure 12.3.
Data on this device is stored in a long spiral starting from the disk edge. Also
it’s ending take place at the centre.
control lines
A general-purpose computer makes use of printer, magnetic disk. In
computers, magnetic tape is utilised for backup storage. All peripheral devices
are connected with it by means of interface unit.
All interfaces decode the address as well as control obtained from I/O bus.
Every interface decodes them for peripheral. Also it offers signals for
peripheral controller. Data flow is synchronised and the transfer among
processor and peripheral is administered. Every peripheral comprises its
individual controller. A specific electromechanical device is operated by this
controller. For instance, paper movement, print timing in addition to the
printing characters selection are controlled by means of printer controller.
Perhaps, a controller is stored individually or is physically incorporated with
peripheral. Input-Output bus from processor is connected every peripheral
interface.
It is required for the processor to place the address of a device on address
lines in order to converse with a device. Every interface which is connected to
I/O bus includes address decoder. The function of address decoder is to
monitor address lines. The path among bus lines and device that are
controlled by interface gets activated, when the interface identifies its own
address. The interface disables those peripherals whose address is not
matching with the address in bus. Address lines contain the address. Also, a
function code is provided by processor in control lines.
An interface chosen replies to the function code. Then and continues to
implement it. You can consider function code as an Input-Output command.
Basically, the instruction which is carried out in interface and it is connected in
peripheral unit is known as a function code.
Interface may obtain different kinds of commands. The different kinds of
commands are:
• Control: We give a control command to activate the peripheral.
Particularly, a control command given relies on peripheral. Every
peripheral obtains its own differentiated series of commands, according to
its operation mode.
• Status: This command is used for testing different conditions of status
in peripheral as well as interface. For instance, before initiating a transfer,
computer may want to verify the peripheral’s status. When the transfer is
going on, some errors may take place. These errors are observed by
interface.
• Data output: In this command, the interface responds by transmitting
data. Data is transmitted from bus into any of its registers. As an example,
consider a tape unit. By means of a control command, the computer
begins the tape moving. Then the status of tape is monitored by processor.
This is done by using status command.
• Data input: By giving this command, interface obtains a data item from
peripheral. This data item is placed in buffer register of interface. The
availability of data is checked by the processor. This is done by using
status command. Then, data input command is issued. Here, the interface
puts the data on data lines. Also the data gets accepted by the processor.
12.4.1 Input-Output vs. memory bus
It is required for processor to converse with memory unit in order to converse
with I/O. Memory bus consists of the following:
• data
• address
• read/write control lines
Computer buses can communicate with I/O and memory by using the following
techniques:
• Make use of two different buses, the first bus for memory and the
second bus for I/O.
• Make use of a common bus for I/O as well as memory. However
different control lines should be there for each.
• Make use of a common bus for I/O as well as memory having common
control lines.
In case of first technique, the computer comprises the following:
• data
• address
• control buses, one bus for I/O and other for accessing memory
This is performed in computers having an individual IOP (input-output
Processor) (IOP) and CPU (Central Processing Unit). By means of a memory
bus, the memory converses with central processing unit as well as input-
output processing. IOP also converses with input as well as output devices.
This is done through an individual I/O bus having its individual data, address
1 0 0 Port A register
1 0 1 Port ft register
1 1 0 Control register
1 1 1 Status register
The chip select and register select inputs determine the address assigned to
the interface. The I/O read and writes are two control lines that specify an input
or output, respectively. The four registers: Port A Register, Port B register,
Control Register and Status register communicate directly with the I/O device
attached to the interface. The input-output data to and from the device can be
transferred into either port A or port B.
The interface may operate with an output device or with an input device, or
with a device that requires both input and output. If the interface is connected
to a printer, it will only output data, and if it services a character reader, it will
only input data. A magnetic disk unit is used to transfer data in both directions
but not at the same time, so the interface can use bidirectional lines. A
command is passed to the I/O device by sending a word to the appropriate
interface register.
In a system like this, the function code in the I/O bus is not needed because
control is sent to the control register, status information is received from the
status register, and data are transferred to and from ports A and B registers.
Thus the transfer of data, control, and status information is always via the
common data bus.
The distinction between data, control, or status information is determined from
the particular interface register with which the CPU communicates. The control
register gets control information from the CPU. By loading appropriate bits into
the control register, the interface and the I/O device attached to it can be
placed in a variety of operating modes. For example, port A may be defined
as an input port and port B as an output port, A magnetic tape unit may be
instructed to rewind the tape or to start the tape moving in the forward
direction. The bits in the status register are used for status conditions and for
recording errors that may occur during the data transfer. For example, a status
bit may indicate that port-A has received a new data item from the I/O device.
The interface registers uses bi-directional data bus to communicate with the
CPU. The address bus selects the interface unit through the chip select and
the two register select inputs. A circuit must be provided externally (usually, a
decoder) to detect the address assigned to the interface registers. This circuit
enables the chip select (CS) input to select the address bus. The two register
select-inputs RSl and RSO are usually connected to the two least significant
lines of the address bus. Out of those two inputs, select one of the four
registers in the interface as specified in the table accompanying the diagram.
The content of the selected register is transferred into the CPU via the data
bus when the I/O read signal is ended. The CPU transfers binary information
into the selected register via the data bus when the I/O write input is enabled.
Self Assessment Questions
7. ________ is used in computers for backup storage.
8. ________ from the processor is attached to all peripheral interfaces.
9. A ____________ is issued to test various status conditions in the
Activity 1:
Visit an IT organisation and observe the functioning of the I/O interface and
the data lines, control lines, and I/O bus architecture. Also, check whether
the I/O system used is isolated or memory-mapped.
availability of 100 disks much higher than that of a single disk. These systems
have become known by the acronym RAID, which stands for redundant array
of inexpensive disks. We will study this topic in the next section.
Self Assessment Questions
10. ____________ refers to consistent reporting when information is lost
because of failure.
11. ____________ is an innovation that improves both availability and
performance of storage systems
12.6 RAID
RAID is the acronym for ‘redundant array of inexpensive disks’. There are
several approaches to redundancy that have different overhead and
performance. The Patterson, Gibson, and Katz 1987 paper introduced the
term RAID. It used a numerical classification for these schemes that has
become popular; in fact, the non-redundant disk array is sometimes called
RAID 0.One disadvantage is discovering when the disk fails. Magnetic disks
help provide information about their correct operation. There is information
recorded in each sector which helps detect the errors in that sector.
Transferring of sectors will help the electronics attached to discover the failure
of disks or loss of information.
The levels of RAID are as follows:
12.6.1 Mirroring (RAID 1)
Mirroring or shadowing is the traditional solution to disk failure. It uses twice
as many disks. Data is simultaneously written on two disks, one non-
redundant and one redundant disk so that there are two copies of the data.
The system goes to the mirror disk in case one disk fails to get the required
information. This technique is the most expensive solution.
12.6.2 Bit-Interleaved parity (RAID 3)
Bit-Interleaved parity is an error detection technique where character bit
patterns are forced into parity so the total number of one (1) bit is always odd
or even. This is done by adding a “1” or “0” bit to each byte as the
character/byte is transmitted. At the other end of the transmission the parity is
checked for accuracy. BIP is also a method used at the physical layer (high
speed transmission of binary data) level to monitor errors.
The cost of higher availability can be reduced to 1/N, where N is the number
Manipal University Jaipur B1648 Page No. 266
Computer Architecture Unit 1
4 supports mixtures of large reads, large writes, small reads and small writes.
One drawback to the system is that the parity disk must be updated on every
write, so it is the bottleneck for sequential writes.
To fix the parity-write bottleneck, the parity information is spread throughout
all the disks so that there is no single bottleneck for writes. This distributed
parity organisation is RAID 5. Figure 12.6 shows how data is distributed in
RAID 4 and RAID 5.
The two most common measures of I/O performance are diversity and
capacity.
• Diversity: Which I/O devices can connect to the computer system?
• Capacity: How many I/O devices can connect to a computer system?
Other traditional measures to performance are throughput (sometimes called
bandwidth) and response time (sometimes called latency). Figure 12.7 shows
the simple producer-server model. The producer creates tasks to be
performed and places them in a buffer; the server takes tasks from the first-
in-first-out buffer and performs them.
Response time is defined as the time taken by a task since it is placed in the
buffer till it is completed by the server. Throughput, in simple words, is the
average number of tasks completed by the server over a period of time. To
reach the maximum level of throughput, the server should never be idle, and
the buffer should never be empty. Whereas, response time is the time spent
in the buffer and is minimised when the buffer is empty.
Improving performance does not always mean improvements in both
response time and throughput. Throughput is increased by adding more
servers as shown in figure 12.8, as it helps spread data across two disks
instead of one. This enables the tasks to be performed parallelly.
Unfortunately, this does not help response time, unless the workload is held
constant and the time in the buffers is reduced because of more resources.
Figure 12.8: Single-Producer Model Extended with another Server and Buffer
How does the architect balance these conflicting demands? If the computer
is interacting with human beings, figure 12.9 suggests an answer.
Workload
Conventional interactive workload
(1.0 sec. system response
time)
High-function graphics
workload (1.0 sec. system
response time)
High-function graphics
workload (0.3 sec. system
response time)
This figure presents the results of two studies of interactive environments: one
keyboard oriented and one graphical. An interaction, or transaction, with a
computer is divided into three parts:
1. Entry time - The time for the user to enter the command. The graphics
system in figure 12.9 took 0.25 seconds on average to enter a command
versus 4.0 seconds for the keyboard system.
2. System response time - The time between when the user enters the
command and the complete response is displayed.
3. Think time - The time from the reception of the response until the user
begins to enter the next command.
The sum of these three parts is called the transaction time. Several studies
report that user productivity is inversely proportional to transaction time;
transactions per hour are a measure of the work completed per hour by the
user.
The results in figure 12.9 show that reduction in response time actually
decreases transaction time by more than just the response time reduction:
Cutting system response time by 0.7 seconds saves 4.9 seconds (34%) from
the conventional transaction and 2.0 seconds (70%) from the graphics
transaction. This implausible result is explained by human nature: People need
less time to think when given a faster response.
Self Assessment Questions
14. _______ is also known as bandwidth.
15. ________________ is sometimes called latency.
Activity 2:
Visit an organisation. Find the level of reliability, availability and dependability
of the system used. Also, measure the I/O performance.
12.8 Summary
Let us recapitulate the important concepts discussed in this unit:
• A computer must have a system to get information from the outside world
and must be able to communicate results to the external world.
• Each time PC is shut down, the contents of the PC’s random-access
12.9 Glossary
• Bus Interface: Communication link between the processor and several
peripherals.
• CD-R: Compact disk recordable
• CD-ROM: Compact disk read only memory
• CD-RW: Compact disk rewritable
• DVD-ROM: Digital video (or versatile) disk read only memory
• Input devices: Computer peripherals used to enter data into the computer.
• Input-Output Interface: This gives a method for transferring information
between internal memory and I/O devices.
• Input-Output Processor (IOP): An external processor that
communicates directly with all I/O devices and has direct memory access
capabilities.
• Output devices: Computer peripherals used do get output from the
computer.
• RAM: Random Access Memory
• RPM: Revolutions per Minute
12.10 Terminal Questions
12.11 Answers
Self Assessment Questions
1. Computer architecture
2. Computer organisation
3. Random access memory
4. Storage media
5. Revolutions per minute
6. CD-ROM
7. Magnetic tape
8. I/O bus
9. Status command
10. Data integrity
11. Disk array
12. Redundant array of inexpensive disks
13. Mirroring or shadowing
14. Throughput
15. Response time
Terminal Questions
1. To keep information from one computer session to another, one must store
the information within a file that ultimately stores on disk. This is called
storage system. Refer Section 12.2.
2. There are two main categories of the storage devices: magnetic storage
and optical storage. Refer section 12.3.
3. Peripherals connected to a computer require special communication links
for interfacing them with the CPU. This is done through i/o bus which
connects the peripheral devices to the CPU. Refer section 12.4.
4. Memory transfer and I/O transfer differs in that they use separate read and
References:
• Kai Hwang, Advanced Computer Architecture, Parallelism, Scalablility,
Programmability, Mgh.
• Micheal J. Flynm, Computer Architecture, Pipelined & Parallel Processor
Design, Narosa.
• J.P. Haycs: Computer Architecture & Organisation - Mgm
• Nicholas P. Carter, Schaum’s Outline Of Computer Architecture, Mc.
Graw-Hill Professional.
E-references:
• www.es.ele.tue.nl
• www.stanford.edu
• ece.eng.wayne.edu
13.1 Introduction
In the previous unit, you studied about storage systems. You covered various
aspects such as types of storage devices, connecting I/O devices to
CPU/memory, reliability, availability and dependability, RAID, I/O performance
measures. Multithreading is a type of multitasking. Prior to Win32, the only
type of multitasking that existed was the cooperative multitasking, which did
not have the concept of priorities. The multithreading system has a concept of
priorities and therefore, is also called background processing or pre-emptive
multitasking.
Dataflow architecture is in direct contrast to the traditional Von Neumann
architecture or control flow architecture. Although dataflow architecture has
not been used in any commercially successful computer hardware, it is very
13.2 Multithreading
Multithreading is the capability of a processor to do multiple things at one time.
The Windows operating system uses the API (Application Programming
Interface) calls to manipulate threads in multithreaded applications. Before
discussing the concepts of multithreading, let us first understand what a thread
is.
13.2.1 What is a thread?
Each process, which occurs when an application is run, consists of at least
one thread that contains code. All the code within a thread, when it is active,
is performed consecutively, one line after another. In a multithreading system,
many threads belonging to that particular process run concurrently. A thread
is viewed as an independent program counter within a process and the
location of the instruction that the thread is operating on is indicated by this. A
thread has the following features:
• A state of thread execution
Although these tasks can be performed using more than one process, it is
generally more efficient to use a single multithreaded application because the
system can perform a context switch more quickly for threads than processes.
Moreover, all threads of a process share the same address space and
resources, such as files and pipes.
13.2.3 Benefits of multithreading system
A multithreading system provides the following benefits over a multiprocessing
system:
• Threads advance the communication between different execution traces
as the same user address space is shared.
• In an existing process, creating a new thread is much less timeconsuming
than creating a brand-new process.
• Termination of thread also takes less time.
• Also, control switching among two threads within a same process takes
less time than switching between two processes.
Self Assessment Questions
1. The multithreading system has a concept of priorities and therefore, is
also called __________ or __________ .
2. It takes much more time to create a new thread in an existing process
than to create a brand-new process. (True/False)
Activity 1:
Visit a library and find out more details about the various models of
multithreading like Blocked model, Forking model, Process-pool model and
Multiplexing model.
recent operating systems, such as Solaris, Linux, Windows 2000 and OS/2,
support multiple processes with multiple threads per process. However, the
traditional operating system, MS-DOS supports a single user process and a
single thread. Some traditional UNIX systems are multiprogramming systems
as they maintain multiple user processes but only one execution path is
allowed for each process.
Self Assessment Questions
5. Almost all business applications are _________________ .
6. When an application is run, each of the processes contains at least one
system limits the overall speed of a computer system since the speed of these
components have improved at a slower rate than CPU technology.
The average speed of memory systems can be improved by caches by
keeping the most commonly used data in a fast memory that is near to the
processor. One more factor obstructing CPU speed boosts is the naturally
sequential character of the Von Neumann instruction implementation. Now,
through parallel processing architectures, methods of executing various
instructions concurrently are being developed.
13.6.1 Organisation and operation of the Von Neumann architecture
As mentioned in section 13.6, the core of a computer system with Von
Neumann architecture is the CPU. This element obtains (i.e., reads)
instructions and data from the main memory and coordinates the entire
carrying out of every instruction. It is usually structured into two different
subunits: the Arithmetic and Logic Unit (ALU), and the control unit. Figure 13.2
shows the basic components of a Von Neumann model.
The ALU merges and converts data using arithmetic operations, such as
addition, subtraction, multiplication, and division, and logical operations, such
as bit-wise negation, AND, and OR.
The control unit interprets the instructions retrieved from the memory and
manages the operation of the whole system. It establishes the sequence in
which instructions are carried out and offers all of the electrical signals
essential to manage the operation of the ALU and the interfaces to the other
system components.
The memory is a set of storage cells, and each of this can be in one of two
different states. One state signifies a value of “0”, and the other state signifies
a value of “1”. By separating these two unlike logical states, each cell is
proficient of storing a distinct binary digit, or bit, of information. These bit
storage cells are analytically arranged into words, each of which is b bits wide.
Every word is allotted a unique address in the range [0, .................. , N - 1].
The CPU spots the word that it requires either to read or write by storing its
distinctive address in a special memory address register (MAR). A register
provisionally stores a value within the CPU. The memory acts in response to
a read request by interpreting the value stored at the desired address and
transferring it to the CPU via the CPU-memory data bus. The value is then for
the short term stored in the memory buffer register (MBR) (also sometimes
called the memory data register) before it is used by the control unit or ALU.
For a write operation, the CPU stores the value it desires to write into the MBR
and the corresponding address in the MAR. The value is then copied by the
memory from the MBR into the address pointed to by the MAR.
At last, the input/output (I/O) devices connect the computer system with the
exterior world. These devices let the programs and data to be entered into the
system and give a way for the system to manage an output device. Each I/O
port has a distinctive address to which the CPU can either read or write a
value. From the CPU's opinion, an I/O device is accessed similar to the way it
accesses memory. In fact, in a number of systems the hardware makes it
appear to the CPU that the I/O devices are memory locations. This
configuration, in which no difference between memory and I/O devices is seen
by the CPU, is referred to as memory-mapped I/O. In this case, no distinctive
I/O instructions are necessary.
13.6.2 Key features
In a basic organisation, processors having Von Neumann architecture are
differentiated from simple pre-programmed (or hardwired) controllers as they
posses several key features. First, the same main memory stores both
instructions and data. Consequently, instructions and data are not
distinguished. Also, different types of data, such as a floating-point value, or a
character code, an integer value, are all not distinguished. A particular bit
pattern’s explanation completely depends on how the CPU infers it. The same
data stored at a particular memory location can be inferred as an instruction
or data at different times. For example, when a compiler executes, it reads the
source code of a program written in a high-level language, such as FORTRAN
or COBOL, and is converted into a series of instructions that can be executed
by the CPU. The memory stores the output of the compiler like any other type
of data. On the other hand, the compiler output data can be implemented by
the CPU simply by interpreting them as instructions. Thus, the same values
accumulated in memory are considered as data by the compiler, but are then
taken as instructions executable by the CPU. Another outcome of this theory
is that every instruction must indicate how it deduces the data upon which it
functions. Therefore, suppose, Von Neumann architecture will have one set of
arithmetic instructions for functioning on integer values and another set for
functioning on floating-point values.
The second chief factor says that memory is retrieved by name (i.e., address)
irrelevant of the bit pattern stored at each address. Due to this feature, we can
interpret the values stored in memory as addresses or data or instructions.
Therefore, programs can alter addresses via the same set of instructions that
the CPU uses to alter data. This elasticity of how values in memory are read
permits very compound, vigorously changing patterns to be produced by the
CPU to access any range of data structure in spite of the kind of value being
read or written. Ultimately, an additional chief feature of the Von Neumann
scheme is that the sequence in which a program performs its instructions is
sequential, unless that order is openly altered. Program counter (PC), a
special register in the CPU, carries the address of the following instruction in
memory to be performed. After each instruction is carried out, the value in the
PC is increased to point to the following instruction in the series to be
implemented. This sequential implementation series can be transformed by
the program with the help of branch instructions that stores a fresh value into
the PC register.
On the other hand, special hardware can sense some outside event, such as
a suspension, and load a fresh value into the PC to cause the CPU to
commence executing a new series of instructions. Though this concept of
executing one action at a time really makes simpler the writing of programs
and the design and running of the CPU, it also limits the prospective
performance of this architecture.
Self Assessment Questions
9. The instruction set together with the resources needed for their execution
Activity 2:
Surf the internet to find out details about architecture called Harvard
architecture and compares it with Von Neumann architecture.
defined functionality.
Data channels provide the sole mechanism by which nodes can interact and
communicate with each other by ensuring lower coupling and greater
reusability. Data channels can also be implemented transparently between
processors to carry messages between components that are physically
distributed. In the dataflow architecture, a control application is composed of
function bodies and data channels, and the connections between function
bodies and data channels are described in a dataflow graph. Consequently,
designing a control application mainly involves constructing such a dataflow
graph by selecting function bodies from the design library and connecting them
together.
Additional user-defined or application-specific function bodies are also easily
supported. A model of dataflow programming is shown in figure 13.3.
Some models were extended with "Sticky tokens", the tokens that stay much
like a constant input and match with tokens arriving on other inputs. Nodes
can have varying granularity, from instructions to functions. Once a node is
activated and the nodal operation is performed, this is called "Fired Results",
which are passed along the arc to waiting node. This process is repeated until
all of the nodes are fired and the final result is created. More than one node
can also be fired simultaneously. Arithmetic operators and conditional
operators act as nodes.
The Dynamic Critical Path: The dynamic critical path of dataflow graph is
simultaneously a function of program dependences, runtime execution path,
hardware resources and dynamic scheduling. All critical events must be last
arrival events. Such an event is the last one, which enables data to be latched.
Events correspond to signal transitions on the edges of the dataflow graphs.
Most often, the last-arrival event is the last input to reach an operation.
However, for moderate operations the last arrival event is the input that
enables the computation of the output. In lenient execution, all forward
branches are executed simultaneously.
• *T (Star T): The monsoon project at MIT was followed by the Star-T
project. It used extension of the off-the-shelf process or architecture to
define a multiprocessor architecture using to support fine-grain
communication and set up user micro-threads. The architecture is
projected to hold the latency-hiding feature of the Monsoon split-phase
global memory operations while being compatible with conventional
Massively Parallel Architecture (MPA's) based on Von Neumann model.
Inter-node traffic consists of a tag (called continuation, a pair comprising a
context and instruction pointer).
13.9 Summary
• Multithreading is needed to create an application that is able to perform
more than one task at once.
• There are various needs, benefits and principles of multithreading.
• Scalability is defined as the ability to increase the amount of processing
that can be done by adding more resources to a system.
• The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model.
• There are basically three types of computational models as follows: Von
Neumann model, Dataflow model and Hybrid multithreaded model.
• The heart of the Von Neumann computer architecture is the Central
Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit).
• Dataflow architecture does not have a program counter and the execution
of instructions is solely determined based on the availability of input
arguments to the instructions.
• A hybrid model synergistically combines features of both Von Neumann
and Data-flow, as well as exposes parallelism at a desired level.
13.10 Glossary
• Background processing: Another name for the multithreading system
which has a concept of priorities.
• Communication parallelism: refers to the way threads can
communicate with other threads residing in other processors.
• Computation parallelism: refers to the 'conventional' parallelism
• GUI: Graphical User Interface, these programs can perform more than one
task (such as editing a document as well as printing it) at a time.
• JVM: Java Virtual Machine, it is an example of multithreading system.
13.12 Answers
Self Assessment Questions
1. Background processing, pre-emptive multitasking
2. False
3. True
4. Fine-grain threading
5. Scalable
6. Thread
7. Computational Model
8. True
9. Instruction set architecture (ISA)
10. True
11. Node, arrow
12. Fired Results
13. Iannucci
14. False
Terminal Questions
1. Multithreading is a type of multitasking. The multithreading system has a
concept of priorities and therefore, is also called background processing
or pre-emptive multitasking. Refer Section 13.2.