Professional Documents
Culture Documents
MOD -1
Prerequisite
What is Computer Architecture?
Basic Architecture of a Computer?
What is Computer Organization?
Difference between CO And CA?
Computer Architecture
Is a functional description of requirements and design
3 Deals with high-level design issue. Deals with low-level design issue.
to maximize the performance and the programmability within the limits given
by technology and the cost at any instance of time.
Why Parallel Architecture?
Application Trends
Scientific and Engineering Computing
Commercial Computing
Five Generations of Electronic Computers
Computing problems
• Numerical computing - For numerical problems in science and technology, the
solutions demand complex mathematical formulations and intensive integer or
floating-point computations.
• Transaction processing - For alphanumerical problems in business and government,
the solutions demand accurate transactions, large database management, and
information retrieval operations.
• Logical reasoning -For artificial intelligence (AI) problems, the solutions demand logic
inferences and symbolic manipulations.
Algorithms and Data Structures
• Special algorithms and data structures are needed to specify the computations and
communications involved in computing problems.
• Most numerical algorithms are deterministic, using regularly structured data.
Hardware Resources
•Coordinated efforts by hardware resources, an operating system, and application
software.
•Hardware core of a computer system - Processors, memory, and peripheral devices
•Special hardware interfaces are often built into I/O devices, such as terminals,
workstations, optical page scanners, magnetic ink character recognizers, modems,
file servers, voice data entry, printers, and plotters. These peripherals are connected
to mainframe computers directly or through local or wide-area networks.
Operating System
• An effective operating system manages the allocation and deallocation of resources
during the execution of user programs.
• Beyond the OS, application software must be developed to benefit the users.
Mapping
• Mapping of algorithmic and data structures onto the machine architecture is a
bidirectional
process matching algorithmic structure with hardware architecture, and vice versa.
• Efficient mapping will benefit the programmer and produce better source codes.
• It includes processor scheduling, memory maps, interprocessor communications, etc
• System Software Support
• The source code written in a HLL must be first translated into object code by an
optimizing compiler.
• The compiler assigns variables to registers or to memory words and reserves
functional units for operators.
• An assembler is used to translate the compiled object code into machine code which
can be recognized by the machine hardware.
• A loader is used to initiate the program execution through the OS kernel.
Flynn's Classical Taxonomy
This taxonomy distinguishes multi-processor computer architectures
according two independent dimensions of Instruction stream and
Data stream.
An instruction stream is sequence of instructions executed by
machine.
Data stream is a sequence of data including input, partial or
temporary results used by instruction stream.
Each of these dimensions can have only one of two possible states:
Single or Multiple.
Flynn’s taxonomy
Classification based on memory arrangement
Classification based on communication
Classification based on the kind of parallelism
• Data-parallel
• Function-parallel
23
Flynn’s Taxonomy
– The most universally excepted method of classifying computer
systems
– Published in the Proceedings of the IEEE in 1966
Any computer can be placed in one of 4 broad categories:
» SISD: Single instruction stream, single data stream
» SIMD: Single instruction stream, multiple data streams
» MIMD: Multiple instruction streams, multiple data streams
» MISD: Multiple instruction streams, single data stream
24
Flynn’s Taxonomy….
• Two types of information flow into a processor:
- instructions and data
• Instruction stream is defined as the sequence of instructions
performed by the processing unit.
• Data stream is defined as the data traffic exchanged between the
memory and the processing unit.
• According to Flynn’s classification, either of the instruction or data
streams can be single or multiple.
25
SISD - Single Instruction stream, Single Data stream
Instructions
Processing Main memory
element (PE) (M)
Data
DS
IS
Control Unit PE Memory
26
IS
SIMD - Single Instruction stream, Multiple Data streams
Applications:
• Image processing
• Matrix
manipulations
• Sorting
• Eg vector
computers
2
7
A type of parallel computer
Single instruction: All processing units execute the same instruction issued by
the control unit at any given clock cycle. where there are multiple processor
executing instruction given by one control unit.
Multiple data: Each processing unit can operate on a different data element as
shown figure below the processor are connected to shared memory or
interconnection network providing multiple data to processing unit.
Single instruction is executed by different processing unit on different set of
data.
SIMD Architectures
Fine-grained
Image processing
application
Large number of PEs
Minimum complexity PEs
Programming language is a
simple extension of a
sequential language
Coarse-grained
Each PE is of higher
complexity and it is usually
built with commercial
devices
Each PE has local memory 29
MIMD - Multiple Instruction streams, Multiple Data streams
Applications:
• Parallel computers
• Shared Memory
3
0
Multiple Instruction: every processor may be executing a different
instruction stream.
Multiple Data: every processor may be working with a different data
stream as shown in figure multiple data stream is provided by
shared memory.
Can be categorized as loosely coupled or tightly coupled depending
on sharing of data and control.
Execution can be synchronous or asynchronous, deterministic or
nondeterministic.
There are different processor each processing different task.
MISD -Multiple Instruction streams, Single Data stream
Applications:
• Classification
• Robot vision
• Systolic array for pipelined
execution
of specific algorithms
3
2
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via
independent instruction streams as shown in figure a single data
stream is forwarded to different processing unit which are connected
to different control unit and execute instruction given to it by
control unit to which it is attached.
Thus in these computers same data flow through a
linear array of processors executing different
instruction streams as shown in figure .
This architecture is also known as systolic arrays for
pipelined execution of specific instructions.
Flynn’s taxonomy
Advantages of Flynn
» Universally accepted
» Compact Notation
» Easy to classify a system .
Disadvantages of Flynn
» Very coarse-grain differentiation among machine systems
» Comparison of different systems is limited
» Interconnections, I/O, memory not considered in the scheme
36
High Performance Computing Applications
PERFORMANCE FACTORS
Processor Cycle time (t in nanoseconds) - CPU is driven by a clock with
a constant cycle time (usually measured in nanoseconds, which controls the
rate of internal operations in the CPU.
Clock rate ( f = 1/t, f in megahertz) - inverse of the cycle time. A shorter
clock cycle time, or equivalently a larger number of cycles per second, implies
more operations can be performed per unit time.
Instruction count (Ic)- the number of machine instructions to be executed
by the program. Determines the size of the program. Different machine
instructions require different numbers of clock cycles to execute.
Average CPI (Cycles Per Instruction)- CPI is important to measure the
execution time of an instruction. Average CPI can be determined for a
particular processor if we know the frequency of occurrence of each
instruction type.
The term CPI is used with respect to a particular instruction set and a given
program mix.
PERFORMANCE FACTORS
CPU time (T = Ic x CPI x t) – CPU time required to execute a program
containing Ic instructions. Each instruction must be fetched from memory,
decoded, then operands fetched from memory, the instruction executed, and the
results stored.
p is the number of processor cycles needed for the instruction decode and
execution,
m is the number of memory references needed, k is the ratio between memory
cycle and processor cycle, Ic is the instruction count, and t is the processor cycle
System Attributes
The five performance factors (Ic, p, m, k, t) are influenced by four system
attributes
Performance Ic p m k t
Factors
System Attributes
Instruction set X X
architecture
Compiler technology X X X
CPU implementation & X X
control
Cache & memory X X
hierarchy
• The instruction set architecture affects program length (Ic) and processor cycles
(p)
• Compiler design affects the values of Ic, p & m.
• The CPU implementation & control determine the total processor time= p*t
• The memory technology & hierarchy design affect the memory access time= k*t
SYSTEM ATTRIBUTES
MIPS Rate - Let C be the total number of clock cycles needed to execute a
given program. Then Total CPU time can be estimated as T = C x t => C/f
Furthermore, CPI = C/Ic , T = Ic x CPI x t => Ic x CPI/f
The processor speed is measured in terms of million instructions per second
(MIPS). MIPS rate varies with respect to a number of factors, including the
clock rate, the instruction count (Ic), and the CPI of a given machine.
MIPS Rate = Ic / (T x 106 ) ) = f / (CPI x 106 ) = (f x Ic) / (C x 106 )
MIPS Rate directly proportional to clock rate and inversely proportional to CPI.
CPU time, T = Ic / (MIPS x 106 )
Throughput Rate
System throughput, Ws (in programs/second) - how many programs a
system can execute per unit time. It is measured across a large number of
programs over a long observation period.
CPU throughput, Wp (in programs/second) – in multi programmed system,
how many programs can be executed per unit time, based on MIPS rate and
average program length, Ic.
Wp = f / (Ic x CPI) = MIPS x (106 )/ Ic
٥
Example:
Now, when the task given in the previous example is executed on a FOUR-
processor system with shared memory. Due to the need for synchronization
among the FOUR program parts, 2000 extra instructions are added to each part.
– Calculate the average CPI?
– Determine the corresponding MIPS rate?
– Calculate the speedup factor of the FOUR-processor system?
– Calculate the efficiency of the FOUR-processor system?
– Show the interconnection network of this system?
Solution:
Average CPI = 2 cycles/instruction.
MIPS = (4 * 500MHz)/2 = 1000
Speedup = [T1/T4]
T1 = [ Ic/MIPS ] = 100000/250 =0.400 msec
T4 = [ Ic/MIPS ] = [100000+4*2000]/1000 =0.108 msec
Speedup = 0.4/0.108 = 3.703
Efficiency = Speedup / #Processors = 3.703/4 = 92.59%
• For CPU design:
Where;
CPIi: represents the average number of instructions per clock for instruction (i).
Ici: represents number of times instruction (i) is executed in a program.
Example
Suppose you have made the following measurements;
–Frequency of FP operations (other than FP SQR)= 25%
–Average CPI of FP operations = 4
–Average CPI of other operations =1.33
–Frequency of FPSQR= 2%
–CPI of FPSQR= 20
•Assume that TWO design alternatives are to decrease the CPI of FPSQR to 2, or to
decrease the average CPI of all FP operations to 2.5. Compare these two design
alternatives?
Amdahl’s Law
A program (or algorithm) which can be parallelized can be split up into
two parts:
A part which cannot be parallelized and
A part which can be parallelized
Eg:
Imagine a program that processes files from disk. A small part of that
program may scan the directory and create a list of files internally in
memory. After that, each file is passed to a separate thread for
processing. The part that scans the directory and creates the file list
cannot be parallelized, but processing the files can be done in parallel.
Total time taken to execute the program only serially is called T.
The time T includes the time of both the non-parallelizable and
parallelizable parts.
T = Total time of serial execution
B = Total time of non- parallelizable part
T - B = Total time of parallelizable part (when executed serially, not in
parallel)
First of all, a program can be broken up into a non-parallelizable part B, and
a parallelizable part 1-B, as illustrated by this diagram:
The line with the delimiters on at the top is the total execution time T(1).
Shared memory
Interconnection
I/O1 network
Interconnection
network
I/On
PE1 PEn
PE1 PEn M1 Mn
Processors P1 Pn
8
5
Shared-memory multiprocessors
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)
Cache-only Memory Architecture (COMA)
8
6
Uniform Memory Access (UMA) Model
Uniform Memory Access (UMA) Model
Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache
coherent means if one processor updates a location in shared
memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level.
The access time varies with the location of the memory word. Memory access
across link is slower
All local memories form a global address space accessible by all processors
8
9
Non-Uniform Memory Access (NUMA) Model
COMA - Cache-only Memory Architecture
COMA - Cache-only Memory Architecture
The COMA model is a special case of NUMA machine in which the distributed main
memories are converted to caches.
All caches form a global address space and there is no memory hierarchy at each
processor node.
Advantages:
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
• lack of scalability between memory and CPUs - due to geometrically increased
traffic
• Synchronization constructs that ensure "correct" access of global memory.
• increasingly difficult and expensive
Distributed memory multicomputers
Multiple computers- nodes, Message-passing network
Local memories are private with its own program and data and
are accessible only by local processors. So traditional
multicomputers have been called
NO-Remote-Memory-Access (NORMA) machines.
Changes it makes to its local memory have no effect on the
memory of other processors. So concept of cache coherency does
not apply .
Memory addresses in one processor do not map to another
processor, so there is no concept of global address space across
all processors.
No memory contention so that the number of processors is very
large.
The processors are connected by communication lines, and the
9
4 precise way in which the lines are connected is called the
Modern multicomputer- use hardware routers to pass
message.
Based on the interconnection , routers & channel used the
multi computers are divided into generations
1st generation : based on board technology using hypercube
architecture and software controlled message
switching.Eg:Caltech cosmic
2nd Generation: implemented with mesh connected
architecture, hardware message routing and software
environment for medium distributed – grained computing.
Eg: Intel Paragon