Professional Documents
Culture Documents
Neumann Architecture
Comprised of four main components:
Input unit: which accepts (or reads) the list of instructions to solve a problem
(a program) and data relevant to that problem.
Memory: in which the program, data and intermediate results are stored
Processing Element (PE) or CPU: which interprets and executes
instructions
Output: which displays or prints the results.
1.1
serial computing
2
Traditionally, software has been written for serial
computation:
To be run on a single computer having a single CPU
A problem is broken into a discrete series of instructions.
Instructions are executed one after another.
Only one instruction may execute at any moment in time.
1.2
parallel computing Problem
• Computational problem usually have
characteristics such as the ability to be:
– Broken apart into discrete pieces of work that
can be solved simultaneously.
– Execute multiple program instructions at any
moment in time.
– Solved in less time with multiple compute resources
than with a single compute resource.
1.3
parallel computing
4 In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem:
To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently
Each part is further broken down into a series of instructions
Instructions from each part execute simultaneously on different CPUs
1.4
Elements of Parallel Computer Arch
• Each of components of a Conventional architecture present
significant performance bottlenecks.
• It is important to understand each of these performance
bottlenecks
• Parallelism addresses each of these components in significant
ways.
• We start next with the processor level parallel architectures,
which are categorized as Implicit & Explicit.
1.5
Implicit Parallelism
Parallelism built into Processor chip and is Invisible
to programmer
Hardware exploits parallelism in the program
automatically.
Implicit Parallelism can be achieved by
1.6
Fine Grain Parallelism ILP
Pipelining
• The earliest use of parallelism in designing PEs
to enhance processing speed, was pipelining.
• Processors have long relied on pipelines for
improving execution rates.
• A Pipelining is a series of stages, where some
work is done at each stage in parallel.
• The stages are connected one to the next to
form a pipe - instructions enter at one end,
progress through the stages, and exit at the
other end. 1.7
PIPELINE CASE – LAUNDRY
• Assume that we have 4 lots of laundry that
need to washed, dried, and folded.
– 30 minutes to wash
– 40 minutes to dry
– 20 min. to fold.
– We have 1 washer, 1 dryer, and 1 folding station.
– What’s the most efficient way to get the 4 loads of
laundry done?
1.8
PIPELINE CASE – [NON PIPELINED LAUNDRY]
30 minutes to wash
40 minutes to dry
20 min. to fold.
1.9
PIPELINE CASE –
[PIPELINED LAUNDRY]
1.10
PIPELINE
• Instruction execution process lends itself
naturally to pipelining
• overlap the subtasks of instruction fetch,
decode and execute
• Instruction pipeline has six operations
– Fetch instruction (FI)
– Decode instruction (DI)
– Calculate operands (CO)
– Fetch operands (FO)
– Execute instructions (EI)
– Write result (WR)
1.11
INSTRUCTION PIPELINE
(CONT..)
Operation Briefly Explained
• Fetch instruction (FI) :The IF stage is responsible for obtaining the
requested instruction from memory. The instruction and the program
counter are stored in the register as temporary storage
• Decode instruction (DI): The DI stage is responsible for decoding
the instruction and sending out the various control lines to the other
parts of the processor.
• Calculate operands (CO) : The CO stage is where any calculations
are performed. The main component in this stage is the ALU. The ALU
is made up of arithmetic, logic and capabilities.
• Fetch operands (FO) & Execute instructions (EI) : The FO and EI
stages are responsible for storing and loading values to and from
memory. They are also responsible for input and output from the
processor respectively.
• Write result (WR): The WO stage is responsible for writing the result
of a calculation, memory access or input
• into the register file. 1.12
TIMING DIAGRAM FOR INSTRUCTION
PIPELINE OPERATION
The speed-up of a pipeline is eventually limited by the number of stages & time of
slowest stage.
For this reason, conventional processors tried very deep-pipeline (20 stage vs
normal 3-6 stages) 1.13
Pipeline Performance Bottlenecks
• Pipeline has following performance bottlenecks
Data Dependency: A data dependency occurs when an instruction depends
on the results of a previous instruction.
Branch Prediction: Branch instructions are those that tell the processor to
make a decision about what the next instruction to be executed should be based
on the results of another instruction. Branch instructions can be troublesome in a
pipeline if a branch is conditional on the results of an instruction which has not yet
finished its path through the pipeline.
Approx every 5-6th instruction is a conditional jump! This requires very
accurate branch prediction. Penalty of a prediction error grows with the
depth of pipeline, as more instructions will have to be flushed.
• Resource Constraint
• To execute multiple instructions
1.14
To enhance processor speed
• Pipelined Processors
• Superscalar Processor
• VLIW Processor
1.15
Superscalar Processor
• An obvious way to improve instruction execution rate
beyond this level is to use multiple pipelines.
• During each clock cycle, multiple instructions are
piped into the processor in parallel. These instructions
are executed on multiple functional units.
• Conventional microprocessors typically support four-way
superscalar execution.
• Need to Issue multiple independent instructions
simultaneously
• Problem here is of selecting or scheduling such
instructions for simultaneous issuing.
1.16
Superscalar Scheduler
• Superscalar scheduler is in-chip hardware that looks at number
of instructions in an instruction queue at runtime & selects
appropriate set of instructions to execute concurrently.
• Scheduling of instructions concurrently is constraint by a
number of factors:
– Resource Constraint Issues
– Data Dependency Issues
– Branch Prediction Issues
• Due to hardware cost, scheduler has limited complexity (eg
branch prediction algorithm) impacting its performance.
• Due to runtime operation, scheduler has limited time & scope
to extract parallelism impacting its performance.
1.17
Implicit Parallel Architectures:
ILP processors
• Pipelined Processor
• Superscalar Processor
• VLIW Processor
1.18
Very Long Instruction Word (VLIW)
Processors
• Hardware cost, complexity & time/ scope constraint
of runtime scheduling of the superscalar are the
major issues in superscalar design.
• To address these issues, VLIW processors rely on
compile time analysis to identify & bundle together
instructions that can be executed concurrently
• These instructions are packed & dispatched together
& thus the name very long instruction word
• Typical VLIW processors are limited to 4 to 8-way
parallelism. Variants of this concept are employed
in Intel IA64 processors & TI TMS320 C6XXX DSPs
1.19
Comparison: Superscalar vs
Very Long Instruction Word (VLIW)
• Superscalar implements Scheduler as in-chip Hardware,
while VLIW implements it in compiler software.
• Superscalar schedules concurrent instructions at runtime,
while VLIW does it at compile-time.
• Superscalar scheduler scope is limited to few instructions
from instruction-queue while VLIW scheduler has bigger
context (may be full program) to process.
• Due to more time & bigger context VLIW scheduler can use
more powerful algorithms (loop unrolling, branch prediction etc)
giving better results, which Superscalar can’t afford
• Compilers, however, do not have runtime information
(cache misses, branch variable state etc), so VLIW Scheduling
is inherently more conservative than Superscalar
1.20
Summary: Superscalar vs (VLIW)
Superscalar VLIW
• Implementation Hardware (in chip) software (compiler)
• time runtime compile-time
• Scope/ context few instructions full program
• Algorithms simple (time/context limit) complex & powerful
• Runtime info yes (less consevative) No (more conservative)
1.21
1.22