Chapter 1 (Parallel Computer Models)

Advanced Computer Architecture
Kai Hwang & Naresh Jotwani
Chapter One
Parallel Computer Models
Computer Generations
First Generation (1945-1954):
• Single central processing unit (CPU).
• Performed fixed-point arithmetic using program counter
• Used machine or assembly languages
• Subroutine linkage was not implemented
• Vacuum tubes and relay memories technology
• Representative system: IBM 701, ENIAC, Princeton IAS
Second Generation (1955-1964):
• Floating point arithmetic, multiplexed memory, index registers were introduced
• Subroutine libraries and compilers were implemented
• High level language (Fortan, Cobol) were established
• Registered transfer language was developed
• Representative system: IBM 7030, the Univac LARC, CDC 1604
Third Generation (1965-1974):
• Pipelining and cache memory were introduced
• Integrated circuit (IC) and Microprogrammed control for setting up the activities between CPU and
I/O for multiple users
• Time-sharing operating system was developed using virtual memory for maximum use of
resources
• Representative system: IBM 360-370 series, CDC 6600/7600, ASC, PDP-8 series
Computer Generations cont…
Forth Generation (1975-1990):

• Parallel computing was introduced for shared distributed memory
• Multiprocessing operating system , special language and compilers for parallelism purpose
• Software tools and environments were designed for parallel processing
• LSI/VLSI and semiconductor memory were created
• Representative system: IBM/3090 VF, VAX 9000, Cray X-MP, BBN TC 2000
Fifth Generation (1991-):
• Superscalar processors, cluster computers and MPP (Massively parallel processing) were
emphasized
• Depth in advanced VLSI and high density packaging and optical technologies have been
achieved.
• Heterogeneous processing is emerging to resolve large scale problems
• Representative system: Convex C3800, Cray Y-MP and C-90, Digital VAX 9000
Elements of Modern Computer
Computing problem
Algorithms and data structures
Hardware resources
Operating system
System software support
Compiler Support
Computing problem:
• Modern computer requires an integrated system: machine hardware, instructions set, system
software, application programs, user interface.
• Requires different computing resources based on the type of the problems e.g. for numerical
problems, the solution demands complex mathematical formulations, for alphanumerical
problems, it demands efficient transaction processing and large database management, for
artificial intelligence it demands logic interference and symbolic manipulations. Some
problem requires the combinations of all these processes.
Algorithms and data structures:
• To specify the computations involved in the solution, particular data structures and
algorithms are needed.
• In most cases, numerical algorithms are deterministic and symbolic processing may need
nondeterministic approaches.
Hardware resources:
• An operating system, application program, preprocessor, memory and other peripheral
devices form the hardware core of a computer system.
• This includes other special hardware interfaces integrated in the I/O devices like network
adaptors, modems, workstations, display terminals, printers, scanners etc.
Operating System:
• Manages the allocation and deallocation of the resources during the execution of any user
program.
• Application software and standard benchmark programs must be built for the performance
evaluation.
• Maps efficient compiler, processor scheduling, memory management and parallelism (at
both compile and run time).
System software support:
• Programs written in high-level languages must be translated into machine language with the
help of a good software support.
• Resource binding needs the use of the compiler, assembler, loader and OS kernel to
accomplish physical machine to program execution.
Compiler Support:
• There are three types of compiler support approaches: i) Preprocessor ii) Precompiler and iii)
Parallelizing compiler.
• Preprocessor uses sequential compiler and low level library to implement high level parallel
approach.
• Precompiler requires not full but rather some parallel program flows and limited
optimizations.
• Parallelizing compiler requires fully developed parallelizing/vectorising compiler that can
transform sequential codes into parallel constructs.
Flynn’s Classification of Computer Architectures(Michael Flynn, 1972):
SISD (Single Instruction stream over a Single Data stream):
Fig. 1.2(a) SISD uniprocessor architecture
SIMD (Single Instruction stream over Multiple Data streams):
Fig. 1.2(b) SIMD architecture (with distributed memory)

Legends: CU = Control Unit, PU = Processing Unit, MU = Memory Unit, IS = Instruction Stream, DS = Data Stream
PE = Processing Element, LM = Local Memory
MIMD (Multiple Instruction streams over Multiple Data streams):
Fig. 1.2(c) MIMD architecture (with shared memory)
MISD (Multiple Instructions stream and a Single Data stream):
Fig. 1.2(d) MISD architecture
Legends: CU = Control Unit, PU = Processing Unit, MU = Memory Unit, IS = Instruction Stream, DS = Data Stream
PE = Processing Element, LM = Local Memory
System Performance Attributes
Clock Time/Clock Cycle:
• Clock time denoted by 𝝉, is a constant that the digital computers have. Can vary depending
on the architecture used. (Usually measured in ns)
Clock Rate/Cycle Frequency:
• Inverse of clock time is clock rate, denoted by f. In short, (f = 1/𝝉). (Usually measured in hz)
Instruction Count (Ic):
• Instruction count is the size of a single program means the number of machine instructions to
be performed to execute a single program.
CPI/Cycles per Instruction:
• An instruction may consist of several micro operations (fetch, decode, operand fetch,
execute) that can be performed in one clock cycle. Then again, different machine instructions
may require different clock cycles.
• Number of clock cycles needed to execute a single instruction.
• If CPU clock cycles for a single program is C and Instruction count is Ic, then:
𝐶
𝐶𝑃𝐼 =
𝐼𝑐
Execution Time/CPU Time:
• If Ic is the total number of instructions, CPI is cycle per instruction and 𝝉 is clock cycle, the
total execution time, (T) for a program is:
𝐶𝑃𝐼
𝑇 = 𝐼𝑐 ∗ 𝐶𝑃𝐼 ∗ 𝝉 or 𝑇 = 𝐼𝑐 ∗
𝑓
• Carrying out an instruction requires some phases:
 Instruction fetch
 Decode
 Operands fetch
 Execution
 Storing results back to the memory
• Decode and execution phases are carried out in the CPU, thus the term processor cycle.
• Rest three phases require accessing memory, thus the term memory cycle.
• Usually the memory cycle is k times the processor cycle, where k is the ratio between
memory cycle and processor cycle. So the CPI can be written as
CPI = Processor cycle(p) + Memory cycle(m) * k
• Therefore, total execution time, T:
𝑇 = 𝐼𝑐 ∗ 𝑝 + 𝑚 ∗ 𝑘 ∗ 𝝉
Where, p = number of processor cycles needed for instruction decode and execution
m = number of memory references needed
k = ratio between p and m
MIPS rate (Million Instructions Per Seconds):
• Evaluated as:
𝐼𝑐
𝑀𝐼𝑃𝑆 𝑟𝑎𝑡𝑒 =
𝑇∗106
𝐼𝑐 𝑓 𝑓∗𝐼𝑐
Or, = = [Derivated from, Total execute time, 𝑇 = 𝐼𝑐 ∗ 𝐶𝑃𝐼 ∗ 𝝉]
𝑇∗106 𝐶𝑃𝐼∗106 𝐶∗106
 Math Problems
Multiprocessors and Multicomputers
Depending on the type of accessing memory, two categories are represented here.
Shared-Memory Multiprocessors
• Uniform Memory Access (UMA) model
• Non-uniform Memory Access (NUMA) model
• Cache Only Memory Architecture (COMA) moldel
Distributed-Memory Multicomputers
Shared-Memory Multiprocessors
• The Uniform Memory Access (UMA) model:
 The physical memory is uniformly shared by all the
processors.
 All processors have equal access time to all the memory
words, thus called uniform memory access.
 Processor can have their own private cache.
 All the tightly connected processors and the memory
are interconnected by a single bus, a crossbar switch or
a multistage network.
 When all the processors have equal access to all the
peripherals, it is called symmetric multiprocessor.
 When one or couple of processors have the access right,
It is called asymmetric multiprocessor.
 UMA model is advantageous for timesharing
Applications for multiple users and also useful to speed-up
large single program in a short time.
• The Nonuniform Memory Access (NUMA) model:
 In NUMA model, access time varies with the
location of the memory word.
 The shared memory is physically distributed to
all the processors, called local memories.
 Processors are divided into several clusters, the
clusters are themselves are UMA or NUMA
multiprocessors.
 All processors that are belonged to a
certain clusters have uniform access to the cluster
shared memory.
 All clusters have equal access to the global
shared memory and indeed the access time to
cluster memory is shorter than to the global
memory.
Legends: GSM = Global-Shared Memory, CSM = Cluster-Shared Memory

CIN = Cluster Interconnection Network, P = Processor
• The Cache-only Memory Access (COMA) model:
• This model is another special depiction of NUMA
model where the distributed main memories are
converted to caches.
• All the caches form the global memory space, thus
no memory hierarchy is needed.
• Distant cache access is served by the distributed
cache directories.
Legends: P = Processor, C = Cache, D = Directories

Distributed-Memory Multicomputers
• This model consists of multiple computers, also known as nodes
• These node are autonomous with a processor
and a local memory of their own.
• The message-passing network provides
point-to-point static connections among the
nodes.
• All the local memories are private and only
be accessible by the local processors. For this,
these models are called no-remote-memory-
access (NORMA).
• Internode communication is accomplished
by passing message through the static
connection network.
Legends: P = Processor, M = Memory

Multivector and SIMD Computers
• Vector Supercomputers
 Vector computer is an extension
of scalar computer. Much like an
optional feature as the figure
shown here.
 Through the host computer, data
and program are first loaded into
the main memory.
 The Scalar control unit decodes
all the instructions.
 if the decoded instruction is a
scalar operation or program, it will
be directly executed by the scalar
processor using the scalar functional
pipelines.
 If the decoded instruction is a
vector operation, it will be sent to
the vector control unit.
 The vector control unit then
supervises the flow of vector data between main memory and vector functional pipelines directly.
 Vector processor can have multiple vector functional pipelines.
• SIMD Supercomputer
 As mentioned in the slide no. 07, a proper and functional SIMD computer is presented here with 5-tuple:
M = (N, C, I, M, R)
 N is the number of Processing Elements(PE).
 C is the instructions set directly executed by the
Control Unit(CU), both scalar and program flow
control instruction.
 I is the set of instructions broadcast by the CU
to all the PE for parallel execution including all the
local operations executed by each active PE over
data within that PE.
 M is the masking scheme that separates the sets
into enabled and disabled PE subsets.
 R is the set of Data-routing functions that state
various patterns to be installed in the interconnection
Network for inner-PE connections.
Parallel Random Access Machines(PRAM)
• Unlike the conventional computers which are labeled as random access machines,
parallel random access machines (PRAM) have been developed for idealized
computers with zero synchronization or memory access overhead.
 A PRAM with n-processor has a global memory and
can be distributed among the processors centralized in
one place.
 These processors operate on a synchronized
read memory-compute-write memory cycle.
 Depending on how the concurrent operations are
handled, four options are possible:
 Exclusive Read (ER): Allows one processor to read from
any location in each cycle.
 Exclusive Write (EW): Allows one processor to write
into a memory location at a time.
 Concurrent Read (CR): Allows multiple processors to read the same information from the same
Memory location
 Concurrent Write (CW): Allows simultaneous writes to the same memory locations.
• PRAM Variants
Depending on how the memory reads and writes are handled, four variants have been settled:
 EREW-PRAM: Exclusive Read or Exclusive Write.
 CREW-PRAM: Concurrent Read or Exclusive Write.
 ERCW-PRAM: Exclusive Read or Concurrent Write.
 CRCW-PRAM: Concurrent Read or Concurrent Write.

Chapter 1 (Parallel Computer Models)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1 (Parallel Computer Models)

Uploaded by

Copyright:

Available Formats

Advanced Computer Architecture

Kai Hwang & Naresh Jotwani

Forth Generation (1975-1990):

Fig. 1.2(a) SISD uniprocessor architecture

SIMD (Single Instruction stream over Multiple Data streams):

Fig. 1.2(b) SIMD architecture (with distributed memory)

Fig. 1.2(c) MIMD architecture (with shared memory)

MISD (Multiple Instructions stream and a Single Data stream):

Fig. 1.2(d) MISD architecture

Legends: GSM = Global-Shared Memory, CSM = Cluster-Shared Memory

Legends: P = Processor, C = Cache, D = Directories

Legends: P = Processor, M = Memory

You might also like