Input Unit: Memory: in Processing Element (PE) or CPU: Output

Structure of a single processor computer (Sequential): von
Neumann Architecture
Comprised of four main components:
Input unit: which accepts (or reads) the list of instructions to solve a problem
(a program) and data relevant to that problem.
Memory: in which the program, data and intermediate results are stored
Processing Element (PE) or CPU: which interprets and executes
instructions
Output: which displays or prints the results.
1.1
serial computing
2
Traditionally, software has been written for serial
computation:
To be run on a single computer having a single CPU
A problem is broken into a discrete series of instructions.
Instructions are executed one after another.
Only one instruction may execute at any moment in time.
1.2
parallel computing Problem
• Computational problem usually have
characteristics such as the ability to be:
– Broken apart into discrete pieces of work that
can be solved simultaneously.
– Execute multiple program instructions at any
moment in time.
– Solved in less time with multiple compute resources
than with a single compute resource.
1.3
parallel computing
4 In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem:
To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently
Each part is further broken down into a series of instructions
Instructions from each part execute simultaneously on different CPUs
1.4
Elements of Parallel Computer Arch
• Each of components of a Conventional architecture present
significant performance bottlenecks.
• It is important to understand each of these performance
bottlenecks
• Parallelism addresses each of these components in significant
ways.
• We start next with the processor level parallel architectures,
which are categorized as Implicit & Explicit.
1.5
Implicit Parallelism
Parallelism built into Processor chip and is Invisible
to programmer
Hardware exploits parallelism in the program
automatically.
Implicit Parallelism can be achieved by
- Fine Grain Parallelism (Inst Level: ILP)

- Automatic Parallelizing Scheduler
1.6
Fine Grain Parallelism ILP
Pipelining
• The earliest use of parallelism in designing PEs
to enhance processing speed, was pipelining.
• Processors have long relied on pipelines for
improving execution rates.
• A Pipelining is a series of stages, where some
work is done at each stage in parallel.
• The stages are connected one to the next to
form a pipe - instructions enter at one end,
progress through the stages, and exit at the
other end. 1.7
PIPELINE CASE – LAUNDRY
• Assume that we have 4 lots of laundry that
need to washed, dried, and folded.
– 30 minutes to wash
– 40 minutes to dry
– 20 min. to fold.
– We have 1 washer, 1 dryer, and 1 folding station.
– What’s the most efficient way to get the 4 loads of
laundry done?
1.8
PIPELINE CASE – [NON PIPELINED LAUNDRY]
30 minutes to wash
40 minutes to dry
20 min. to fold.
1.9
PIPELINE CASE –
[PIPELINED LAUNDRY]
1.10
PIPELINE
• Instruction execution process lends itself
naturally to pipelining
• overlap the subtasks of instruction fetch,
decode and execute
• Instruction pipeline has six operations
– Fetch instruction (FI)
– Decode instruction (DI)
– Calculate operands (CO)
– Fetch operands (FO)
– Execute instructions (EI)
– Write result (WR)
1.11
INSTRUCTION PIPELINE
(CONT..)
Operation Briefly Explained
• Fetch instruction (FI) :The IF stage is responsible for obtaining the
requested instruction from memory. The instruction and the program
counter are stored in the register as temporary storage
• Decode instruction (DI): The DI stage is responsible for decoding
the instruction and sending out the various control lines to the other
parts of the processor.
• Calculate operands (CO) : The CO stage is where any calculations
are performed. The main component in this stage is the ALU. The ALU
is made up of arithmetic, logic and capabilities.
• Fetch operands (FO) & Execute instructions (EI) : The FO and EI
stages are responsible for storing and loading values to and from
memory. They are also responsible for input and output from the
processor respectively.
• Write result (WR): The WO stage is responsible for writing the result
of a calculation, memory access or input
• into the register file. 1.12
TIMING DIAGRAM FOR INSTRUCTION
PIPELINE OPERATION
 The speed-up of a pipeline is eventually limited by the number of stages & time of
slowest stage.
 For this reason, conventional processors tried very deep-pipeline (20 stage vs
normal 3-6 stages) 1.13
Pipeline Performance Bottlenecks
• Pipeline has following performance bottlenecks
 Data Dependency: A data dependency occurs when an instruction depends
on the results of a previous instruction.
add $r3, $r2, $r1

add $r5, $r4, $r3
 Branch Prediction: Branch instructions are those that tell the processor to
make a decision about what the next instruction to be executed should be based
on the results of another instruction. Branch instructions can be troublesome in a
pipeline if a branch is conditional on the results of an instruction which has not yet
finished its path through the pipeline.
 Approx every 5-6th instruction is a conditional jump! This requires very
accurate branch prediction. Penalty of a prediction error grows with the
depth of pipeline, as more instructions will have to be flushed.
• Resource Constraint
• To execute multiple instructions
1.14
To enhance processor speed
Implicit Parallel Architectures:

ILP processors
• Pipelined Processors
• Superscalar Processor
• VLIW Processor
1.15
Superscalar Processor
• An obvious way to improve instruction execution rate
beyond this level is to use multiple pipelines.
• During each clock cycle, multiple instructions are
piped into the processor in parallel. These instructions
are executed on multiple functional units.
• Conventional microprocessors typically support four-way
superscalar execution.
• Need to Issue multiple independent instructions
simultaneously
• Problem here is of selecting or scheduling such
instructions for simultaneous issuing.
1.16
Superscalar Scheduler
• Superscalar scheduler is in-chip hardware that looks at number
of instructions in an instruction queue at runtime & selects
appropriate set of instructions to execute concurrently.
• Scheduling of instructions concurrently is constraint by a
number of factors:
– Resource Constraint Issues
– Data Dependency Issues
– Branch Prediction Issues
• Due to hardware cost, scheduler has limited complexity (eg
branch prediction algorithm) impacting its performance.
• Due to runtime operation, scheduler has limited time & scope
to extract parallelism impacting its performance.
1.17
Implicit Parallel Architectures:
ILP processors
• Pipelined Processor
• Superscalar Processor
• VLIW Processor
1.18
Very Long Instruction Word (VLIW)
Processors
• Hardware cost, complexity & time/ scope constraint
of runtime scheduling of the superscalar are the
major issues in superscalar design.
• To address these issues, VLIW processors rely on
compile time analysis to identify & bundle together
instructions that can be executed concurrently
• These instructions are packed & dispatched together
& thus the name very long instruction word
• Typical VLIW processors are limited to 4 to 8-way
parallelism. Variants of this concept are employed
in Intel IA64 processors & TI TMS320 C6XXX DSPs
1.19
Comparison: Superscalar vs
Very Long Instruction Word (VLIW)
• Superscalar implements Scheduler as in-chip Hardware,
while VLIW implements it in compiler software.
• Superscalar schedules concurrent instructions at runtime,
while VLIW does it at compile-time.
• Superscalar scheduler scope is limited to few instructions
from instruction-queue while VLIW scheduler has bigger
context (may be full program) to process.
• Due to more time & bigger context VLIW scheduler can use
more powerful algorithms (loop unrolling, branch prediction etc)
giving better results, which Superscalar can’t afford
• Compilers, however, do not have runtime information
(cache misses, branch variable state etc), so VLIW Scheduling
is inherently more conservative than Superscalar
1.20
Summary: Superscalar vs (VLIW)
Superscalar VLIW
• Implementation Hardware (in chip) software (compiler)
• time runtime compile-time
• Scope/ context few instructions full program
• Algorithms simple (time/context limit) complex & powerful
• Runtime info yes  (less consevative) No (more conservative)
1.21
1.22

Input Unit: Memory: in Processing Element (PE) or CPU: Output

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Input Unit: Memory: in Processing Element (PE) or CPU: Output

Uploaded by

Copyright:

Available Formats

Structure of a single processor computer (Sequential): von

- Fine Grain Parallelism (Inst Level: ILP)

add $r3, $r2, $r1

Implicit Parallel Architectures:

You might also like