You are on page 1of 120

Introduction

MOD -1
Prerequisite
What is Computer Architecture?
Basic Architecture of a Computer?
What is Computer Organization?
Difference between CO And CA?
Computer Architecture
Is a functional description of requirements and design

implementation for the various parts of computer.


It deals with functional behaviour of computer system.

Architecture describes what the computer does.


Computer Organization:
Computer Organization is how operational attribute

are linked together and contribute to realise the


architectural specification.
Organization describes how it does it.
Sl.n COMPUTER ARCHITECTURE COMPUTER ORGANIZATION
o
1 Architecture describes what the Organization describes how it does it.
computer does.

2 Deals with functional behaviour of Deals with structural relationship.


computer system.

3 Deals with high-level design issue. Deals with low-level design issue.

4 Architecture indicates its hardware. Organization indicates its performance.

5 For designing a computer, its For designing a computer, organization


architecture is fixed first. is decided after its architecture.

6 Computer Architecture is also called Computer Organization is frequently


as instruction set architecture. called as micro architecture.
Parallel Architecture
 Parallel Computer Architecture is the method of organizing all the resources

to maximize the performance and the programmability within the limits given
by technology and the cost at any instance of time.
Why Parallel Architecture?

 Parallel computer architecture adds a new dimension in the

development of computer system by using more and more number of


processors.

Application Trends
 Scientific and Engineering Computing

 Commercial Computing
Five Generations of Electronic Computers
Computing problems
• Numerical computing - For numerical problems in science and technology, the
solutions demand complex mathematical formulations and intensive integer or
floating-point computations.
• Transaction processing - For alphanumerical problems in business and government,
the solutions demand accurate transactions, large database management, and
information retrieval operations.
• Logical reasoning -For artificial intelligence (AI) problems, the solutions demand logic
inferences and symbolic manipulations.
 Algorithms and Data Structures
• Special algorithms and data structures are needed to specify the computations and
communications involved in computing problems.
• Most numerical algorithms are deterministic, using regularly structured data.
 Hardware Resources
•Coordinated efforts by hardware resources, an operating system, and application
software.
•Hardware core of a computer system - Processors, memory, and peripheral devices
•Special hardware interfaces are often built into I/O devices, such as terminals,
workstations, optical page scanners, magnetic ink character recognizers, modems,
file servers, voice data entry, printers, and plotters. These peripherals are connected
to mainframe computers directly or through local or wide-area networks.
 Operating System
• An effective operating system manages the allocation and deallocation of resources
during the execution of user programs.
• Beyond the OS, application software must be developed to benefit the users.
 Mapping
• Mapping of algorithmic and data structures onto the machine architecture is a
bidirectional
process matching algorithmic structure with hardware architecture, and vice versa.
• Efficient mapping will benefit the programmer and produce better source codes.
• It includes processor scheduling, memory maps, interprocessor communications, etc
• System Software Support
• The source code written in a HLL must be first translated into object code by an
optimizing compiler.
• The compiler assigns variables to registers or to memory words and reserves
functional units for operators.
• An assembler is used to translate the compiled object code into machine code which
can be recognized by the machine hardware.
• A loader is used to initiate the program execution through the OS kernel.
Flynn's Classical Taxonomy
 This taxonomy distinguishes multi-processor computer architectures
according two independent dimensions of Instruction stream and
Data stream.
 An instruction stream is sequence of instructions executed by
machine.
 Data stream is a sequence of data including input, partial or
temporary results used by instruction stream.
 Each of these dimensions can have only one of two possible states:
Single or Multiple.
 Flynn’s taxonomy
Classification based on memory arrangement
Classification based on communication
Classification based on the kind of parallelism
• Data-parallel
• Function-parallel

23
Flynn’s Taxonomy
– The most universally excepted method of classifying computer
systems
– Published in the Proceedings of the IEEE in 1966
 Any computer can be placed in one of 4 broad categories:
» SISD: Single instruction stream, single data stream
» SIMD: Single instruction stream, multiple data streams
» MIMD: Multiple instruction streams, multiple data streams
» MISD: Multiple instruction streams, single data stream

24
Flynn’s Taxonomy….
• Two types of information flow into a processor:
- instructions and data
• Instruction stream is defined as the sequence of instructions
performed by the processing unit.
• Data stream is defined as the data traffic exchanged between the
memory and the processing unit.
• According to Flynn’s classification, either of the instruction or data
streams can be single or multiple.

25
SISD - Single Instruction stream, Single Data stream

Instructions
Processing Main memory
element (PE) (M)
Data

DS
IS
Control Unit PE Memory

26
IS
SIMD - Single Instruction stream, Multiple Data streams

Applications:
• Image processing
• Matrix
manipulations
• Sorting
• Eg vector
computers

2
7
 A type of parallel computer
 Single instruction: All processing units execute the same instruction issued by
the control unit at any given clock cycle. where there are multiple processor
executing instruction given by one control unit.
 Multiple data: Each processing unit can operate on a different data element as
shown figure below the processor are connected to shared memory or
interconnection network providing multiple data to processing unit.
 Single instruction is executed by different processing unit on different set of
data.
SIMD Architectures
Fine-grained
Image processing
application
Large number of PEs
Minimum complexity PEs
Programming language is a
simple extension of a
sequential language
Coarse-grained
Each PE is of higher
complexity and it is usually
built with commercial
devices
Each PE has local memory 29
MIMD - Multiple Instruction streams, Multiple Data streams
Applications:
• Parallel computers
• Shared Memory

3
0
 Multiple Instruction: every processor may be executing a different
instruction stream.
 Multiple Data: every processor may be working with a different data
stream as shown in figure multiple data stream is provided by
shared memory.
 Can be categorized as loosely coupled or tightly coupled depending
on sharing of data and control.
 Execution can be synchronous or asynchronous, deterministic or
nondeterministic.
 There are different processor each processing different task.
MISD -Multiple Instruction streams, Single Data stream
Applications:
• Classification
• Robot vision
• Systolic array for pipelined
execution
of specific algorithms

3
2
 A single data stream is fed into multiple processing units.
 Each processing unit operates on the data independently via
independent instruction streams as shown in figure a single data
stream is forwarded to different processing unit which are connected
to different control unit and execute instruction given to it by
control unit to which it is attached.
Thus in these computers same data flow through a
linear array of processors executing different
instruction streams as shown in figure .
This architecture is also known as systolic arrays for
pipelined execution of specific instructions.
Flynn’s taxonomy
Advantages of Flynn
» Universally accepted
» Compact Notation
» Easy to classify a system .

Disadvantages of Flynn
» Very coarse-grain differentiation among machine systems
» Comparison of different systems is limited
» Interconnections, I/O, memory not considered in the scheme

36
High Performance Computing Applications
PERFORMANCE FACTORS
 Processor Cycle time (t in nanoseconds) - CPU is driven by a clock with
a constant cycle time (usually measured in nanoseconds, which controls the
rate of internal operations in the CPU.
 Clock rate ( f = 1/t, f in megahertz) - inverse of the cycle time. A shorter
clock cycle time, or equivalently a larger number of cycles per second, implies
more operations can be performed per unit time.
 Instruction count (Ic)- the number of machine instructions to be executed
by the program. Determines the size of the program. Different machine
instructions require different numbers of clock cycles to execute.
 Average CPI (Cycles Per Instruction)- CPI is important to measure the
execution time of an instruction. Average CPI can be determined for a
particular processor if we know the frequency of occurrence of each
instruction type.
The term CPI is used with respect to a particular instruction set and a given
program mix.
PERFORMANCE FACTORS
CPU time (T = Ic x CPI x t) – CPU time required to execute a program
containing Ic instructions. Each instruction must be fetched from memory,
decoded, then operands fetched from memory, the instruction executed, and the
results stored.

 Memory Cycle time (k x t) -The time required to access memory, usually k


times the processor cycle time τ. The value of k depends on the memory
technology and the processor-memory interconnection scheme. The instruction
cycle may involve (k) memory references, (eg k=4; one-instruction fetching, two-
operand fetching, and one-storing result).
 CPI (p + m x k) - Processor cycles required for each instruction can be
attributed to cycles needed for instruction decode and execution (processor
cycles (p)) and cycles needed for memory references (memory cycles = m x k).
 Total CPU time –Effective CPU time needed to execute a program rewritten
as
T = Ic x (p + m x k) x t

p is the number of processor cycles needed for the instruction decode and
execution,
m is the number of memory references needed, k is the ratio between memory
cycle and processor cycle, Ic is the instruction count, and t is the processor cycle
System Attributes
The five performance factors (Ic, p, m, k, t) are influenced by four system
attributes
Performance Ic p m k t
Factors 
System Attributes
Instruction set X X
architecture
Compiler technology X X X
CPU implementation & X X
control
Cache & memory X X
hierarchy

• The instruction set architecture affects program length (Ic) and processor cycles
(p)
• Compiler design affects the values of Ic, p & m.
• The CPU implementation & control determine the total processor time= p*t
• The memory technology & hierarchy design affect the memory access time= k*t
SYSTEM ATTRIBUTES

 MIPS Rate - Let C be the total number of clock cycles needed to execute a
given program. Then Total CPU time can be estimated as T = C x t => C/f
Furthermore, CPI = C/Ic , T = Ic x CPI x t => Ic x CPI/f
The processor speed is measured in terms of million instructions per second
(MIPS). MIPS rate varies with respect to a number of factors, including the
clock rate, the instruction count (Ic), and the CPI of a given machine.
MIPS Rate = Ic / (T x 106 ) ) = f / (CPI x 106 ) = (f x Ic) / (C x 106 )
MIPS Rate directly proportional to clock rate and inversely proportional to CPI.
CPU time, T = Ic / (MIPS x 106 )
 Throughput Rate
System throughput, Ws (in programs/second) - how many programs a
system can execute per unit time. It is measured across a large number of
programs over a long observation period.
CPU throughput, Wp (in programs/second) – in multi programmed system,
how many programs can be executed per unit time, based on MIPS rate and
average program length, Ic.
Wp = f / (Ic x CPI) = MIPS x (106 )/ Ic

In a multiprogrammed system, (Ws < Wp), due to additional system


overheads caused by the I/O, compiler & OS when multiple programs are
interleaved for CPU execution by multiprogramming or time sharing
operation. If CPU kept busy in a perfect program-interleaving fashion, (Ws =
Wp).
 Floating Point Operations Per Second (FLOPS) –computation intensive
applications in science and engineering performance in flops.
megaflops (106 ) , gigaflops (109 ), teraflops (1012 ), petaflops (1015 ), etc
 Speed or Throughput (W/Tn) - the execution rate on an ‘n-processor’
system, measured in FLOPs/unit-time or instructions/unit-time.
 Speedup (Sn = T1/Tn) - how much faster in an actual machine, n
processors compared to 1 will perform the workload. The ratio T1/T∞ is
asymptotic speedup.
 Efficiency (En = Sn/n) - fraction of the theoretical maximum speedup
achieved by n processors. Efficiency is a measure of the fraction of time for
which a PE is usefully employed. In an ideal parallel system efficiency is
equal to one. In practice, efficiency is between zero and one.
 Degree of Parallelism (DOP) - for a given piece of the workload, the
number of processors that can be kept busy sharing that piece of
computation equally. Neglecting overhead, we assume that if k processors
work together on any workload, the workload gets done k times as fast as a
sequential execution.
Performance

 For some program running on machine A,


Performance of A, Perf(A) = 1 / ExecTime(A)
 “A is n times faster than B" iff

Perf(A) / Perf(B) = ExecTime(B) / ExecTime(A) = n

 “A is X% faster than B“ iff

Perf(A) / Perf(B) = ExecTime(B) / ExecTime(A) = 1 + X/100


CPU Performance Equation:

٥
Example:
Now, when the task given in the previous example is executed on a FOUR-
processor system with shared memory. Due to the need for synchronization
among the FOUR program parts, 2000 extra instructions are added to each part.
– Calculate the average CPI?
– Determine the corresponding MIPS rate?
– Calculate the speedup factor of the FOUR-processor system?
– Calculate the efficiency of the FOUR-processor system?
– Show the interconnection network of this system?
Solution:
Average CPI = 2 cycles/instruction.
MIPS = (4 * 500MHz)/2 = 1000
Speedup = [T1/T4]
T1 = [ Ic/MIPS ] = 100000/250 =0.400 msec
T4 = [ Ic/MIPS ] = [100000+4*2000]/1000 =0.108 msec
Speedup = 0.4/0.108 = 3.703
Efficiency = Speedup / #Processors = 3.703/4 = 92.59%
• For CPU design:

• The overall CPI is given by:

Where;
CPIi: represents the average number of instructions per clock for instruction (i).
Ici: represents number of times instruction (i) is executed in a program.
Example
Suppose you have made the following measurements;
–Frequency of FP operations (other than FP SQR)= 25%
–Average CPI of FP operations = 4
–Average CPI of other operations =1.33
–Frequency of FPSQR= 2%
–CPI of FPSQR= 20
•Assume that TWO design alternatives are to decrease the CPI of FPSQR to 2, or to
decrease the average CPI of all FP operations to 2.5. Compare these two design
alternatives?
Amdahl’s Law
A program (or algorithm) which can be parallelized can be split up into
two parts:
A part which cannot be parallelized and
 A part which can be parallelized
Eg:
Imagine a program that processes files from disk. A small part of that
program may scan the directory and create a list of files internally in
memory. After that, each file is passed to a separate thread for
processing. The part that scans the directory and creates the file list
cannot be parallelized, but processing the files can be done in parallel.
Total time taken to execute the program only serially is called T.
The time T includes the time of both the non-parallelizable and
parallelizable parts.
T = Total time of serial execution
B = Total time of non- parallelizable part
T - B = Total time of parallelizable part (when executed serially, not in
parallel)
First of all, a program can be broken up into a non-parallelizable part B, and
a parallelizable part 1-B, as illustrated by this diagram:
The line with the delimiters on at the top is the total execution time T(1).

Execution time with a


Execution time with a parallelization factor of 3:
parallelization factor of 2:
Example: 1
Suppose that a calculation has a 4% serial portion,
a) What is the limit of speedup on 16 processors?
b) What is the maximum speedup?
Ans:
a) Limit of speedup on 16 processors
=16/(1 + (16 – 1)*.04) = 10
b) The maximum speedup = 1/α
= 1/0.04 = 25
Example: 2
If 90% of a calculation can be parallelized, then what is the maximum
speed-up which can be achieved on 5 processors?
Ans: S(n) = n/(1 + (n – 1)* α) (α =1-0.9 = 0.1) (sequential
fraction)
= 5/(1 + (5 – 1)*.10) = 3.57
(the program can theoretically run 3.57 times faster on five processors
than on one)
Amdahl’s Law
The performance gain that can be obtained by improving
some portion of a computer can be calculated using Amdahl’s
law.
Amdahl’s law states that the performance improvement to be
gained from using some faster mode of execution is limited
by the fraction of the time the faster mode can be used.
Law defines the term ‘speedup’.
Amdahl’s Law
 The speedup or improvement gained by the enhanced
execution mode, that is, how much faster the task would run
if the enhanced mode were used for the entire program –
this value is the time of the original mode over the time of
the enhanced mode.
Example, if the enhanced mode takes 2 sec for a portion of
the program, while it is 5 sec in the original mode, the
speedup enhanced is 5/2.
 This is called Speedupenhanced which is always greater than
1.Corollary of Amdahl’s law
If an enhancement is only usable for a fraction of a task then we
can’t speed up the task by more than the reciprocal of 1 minus
that fraction.
Amdahl’s Law
The execution time using the original computer with the
enhanced mode will be the time spent using the unenhanced
portion of the computer plus the time spent using the
enhancement:
Amdahl’s Law
If three different enhancements use fractions of time (f1, f2
and f3 ) respectively, and have individual speedups as S1,
S2 and S3 respectively, the overall speedup is
Applications of Amdahl’s Law
Amdahl’s law gives us a quick way to find the speedup from
some enhancement.
Amdahl’s law is particularly useful for comparing the overall
system performance of two different systems
It can also be applied to compare two processor design
alternatives, based on enhancement on the same system
Improving the performance of the FP operations overall is slightly better because of the
higher frequency
Three enhancements with the following speedup are proposed
for a new architecture. S1=30, S2=20, S3 = 15. If enhancements
1and 2 are each usable for 25% of the time, what fraction of the
time must enhancement 3 be used to achieve an overall speed
up of 10?

10 = 1/ [1 – (0.25+0.25+ f3) + [(0.25/30)+ (0.25/20)+


(f3/10)]]
10 = 1/ [ 0.5 – f3) + [ (0.5+ 0.75 + 4 f3)/ 60 ] ]
10 = 60 / [ 30 – 60 f3 + 1.25 + 4 f3 ]
– 56 f3 = 6 – 31.25
f3 = –25.25/ –56 = 0.45 = 45%
Classification based on memory arrangement

Shared memory
Interconnection
I/O1 network
Interconnection
network
I/On
PE1 PEn

PE1 PEn M1 Mn

Processors P1 Pn

Shared memory - multiprocessors


Message passing -
multicomputers
8
3
Symmetric and Asymmetric Multiprocessors
 Symmetric:
- all processors have equal access to all peripheral devices.
- all processors are identical.
 Asymmetric:
- one processor (executive or master) executes the operating system and
handle I/O
- other processors (attached) may be of different types and may be
dedicated to special tasks.

8
5
Shared-memory multiprocessors
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)
Cache-only Memory Architecture (COMA)

Memory is common to all the processors.


Processors easily communicate by means of shared
variables.

8
6
Uniform Memory Access (UMA) Model
Uniform Memory Access (UMA) Model
Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache
coherent means if one processor updates a location in shared
memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level.

Tightly-coupled systems (high degree of resource sharing)


Suitable for general-purpose and time-sharing applications by
multiple users.
8
8
Non-Uniform Memory Access (NUMA) Model
 Often made by physically linking two or more SMPs

 Shared memory is distributed to local memories

 One SMP can directly access memory of another SMP

 Not all processors have equal access time to all memories

 The access time varies with the location of the memory word. Memory access
across link is slower
 All local memories form a global address space accessible by all processors

 If cache coherency is maintained, then may also be called CC-NUMA - Cache


Coherent NUMA

Access time: Cache, Local memory, Global Memory, Remote memory

8
9
Non-Uniform Memory Access (NUMA) Model
COMA - Cache-only Memory Architecture
COMA - Cache-only Memory Architecture
The COMA model is a special case of NUMA machine in which the distributed main
memories are converted to caches.
All caches form a global address space and there is no memory hierarchy at each
processor node.
Advantages:
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
• lack of scalability between memory and CPUs - due to geometrically increased
traffic
• Synchronization constructs that ensure "correct" access of global memory.
• increasingly difficult and expensive
Distributed memory multicomputers
 Multiple computers- nodes, Message-passing network
 Local memories are private with its own program and data and
are accessible only by local processors. So traditional
multicomputers have been called
NO-Remote-Memory-Access (NORMA) machines.
 Changes it makes to its local memory have no effect on the
memory of other processors. So concept of cache coherency does
not apply .
 Memory addresses in one processor do not map to another
processor, so there is no concept of global address space across
all processors.
 No memory contention so that the number of processors is very
large.
 The processors are connected by communication lines, and the
9
4 precise way in which the lines are connected is called the
 Modern multicomputer- use hardware routers to pass
message.
Based on the interconnection , routers & channel used the
multi computers are divided into generations
1st generation : based on board technology using hypercube
architecture and software controlled message
switching.Eg:Caltech cosmic
2nd Generation: implemented with mesh connected
architecture, hardware message routing and software
environment for medium distributed – grained computing.
Eg: Intel Paragon

3rd Generation : fine grained multicomputer like MIT J-


Machine.
• The network "fabric" used for data transfer varies widely,
though it can be as simple as Ethernet
Vector Processor
 A vector operand contains an ordered set of n elements, where n is called the length
of the vector. Each element in a vector is a scalar quantity, which may be a floating
point number, an integer, a logical value or a character.
 A vector processor consists of a scalar processor and a vector unit, which could be
thought of as an independent functional unit capable of efficient vector operations.
 Register-to-register architecture: Vector registers are used to hold the vector
operands, intermediate and final vector results . There are fixed numbers of vector
registers and functional pipelines in a vector processor
 Memory to memory architecture: uses a vector stream unit to replace the vector
registers. Vector operands and results are directly retrieved from memory in
superwords (512 bits).
 Vector hardware has the special ability to overlap or pipeline operand processing.
 Vector functional units pipelined, fully segmented -each stage of the pipeline
performs a step of the function on different operand(s) .Once pipeline is full, a new
result is produced each clock period (cp).
 Applications -
Long Range Weather forecasting, Petroleum explorations,
Medical diagnosis, Space flight simulations
SIMD has two basic architectural organizations
a. Array processor using random access memory
b. Associative processors using content addressable memory.
 An array processor is a synchronous array of parallel processors that coordinate
concurrent operations in lockstep through global clocks, central control units, or
vector unit controllers.
 These processors are composed of N identical processing elements (PES) under the
supervision of a one control unit (CU).
 This Control unit is a computer with high speed registers, local memory and
arithmetic logic unit.
 There are N data streams; one per processor, so different data be used in each
processor .
 Two categories of array processors depending how the memory units are organized:

a. Dedicated memory organization


b. Global memory organization
PRAM and VLSI Models
Use VLSI Chips to fabricate components like-
processor arrays, memory arrays & switching
networks.

A- chip area (measure of chip’s complexity)


T –Latency for given computation
s- problem size
The three dimensional solid represents the history of the
computation performed by chip.
PRAM and VLSI Models
Memory Bound on chip area
Computations with large data sets are memory bound
Limited by how dense the information can be placed on
the chip
Chip area bounds the amount of memory stored on
chip.
I/O Bound on Volume AT -The volume represents the amount
of information flowing through the chip.
Bisection communication Bound-
Bisection area represents maximum amount of
information between two halves of chip circuit in time period T
Architectural Development Tracks
Architecture of systems pursue development tracks.. These tracks are
illustrious by likeness in computational model & technological bases.
There are mainly 3 tracks:
1. Multiple -Processor tracks
multiple processor system can be shared memory or distributed
memory
(a) Shared Memory track: single address space
(b) Message Passing track:

2. Multivector and SIMD tracks


(a) Multivector track
(b) SIMD track

3. Multi threaded and Dataflow tracks


(a) Multi threaded track - execute multiple contexts at the same
time
(b) Data Flow Track
Thank you……….

You might also like