Chapter 7

CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University
CpE 440
Computer Architecture
Dr. Haithem Al-Mefleh
Computer Engineering Department
Yarmouk University, Second 2020-2021
Multicores, Multiprocessors,
and Clusters
1
University
• Multiprocessor – a computer system with at least 2 processors

• 1 processor breaks  others continue
• Performance, reliability, availability
• Job level parallelism (or process-level parallelism)

• Different programs – different processors
• parallel processing program

• 1 program – different processors
• Cluster – a number of computers connected over a LAN and work

together as a one large multiprocessor
• Multicore microprocessor – a microprocessor that contains multiple

processors (cores) in a single integrated circuit.
• Parallel programming
• Execute efficiently in performance and power
2
University
Hardware & Software
Challenge – effective use of parallel hardware
parallel processing program or parallel software =

sequential or concurrent software running on parallel hardware
5
Difficulty of writing Parallel Processing

Programs
• Difficult to write software that uses multiple processors to complete 1
task faster
• The problem gets worse as number of processors increases
• You must get better performance + efficiency OR

just use sequential program on a uniprocessor as it is easier,….
superscalar/out-of-order/… without programmers’ involvement
3
University
• Parallel processing programs – much harder to write than sequential

programs ?!
• Communication overhead – Scheduling
• Load balancing; Divide work equally
• Time Synchronization
• Scheduling
Even small parts must be parallelized for a program to make

good use of cores
0.1%; 0.001
4
University
Getting good speedup on a multiprocessors while keeping the problem

size fixed is harder than it with increasing the problem size
Time for an addition = t 100 by 100 case

Time for an addition = t
Single processor: 10t+100t = 110t
Single processor: 10010t
10 processors: 10t + 100t/10 = 20t
10 processors: 1010t
Speedup = 5.5
Speedup = 9.9
(5.5/10)*100% = 55% of the potential speedup 99% of the potential speedup
100 processors: 10t + 100t/100 = 11t 100 processors: 110t

Speedup = 10 Speedup = 91
(10/100)*100% = 10% of the potential speedup 91% of the potential speedup
10
10
5
University
perfect load balance (previous example):

100 processors, each one 1% of load
Speedup 91
1 processor 2%x10000 = 200 additions 1 processor 5%x10000 = 500 additions
99 processors 9800 additions 99 processors 9500 additions
Time = max(200t, 9800t/99) + 10t = 210t Time = max(500t, 9500t/99) + 10t = 510t
Speedup = 48 Speedup = 20
11
11
Shared Memory Multiprocessors
12
12
6
University
How to simplify the task,… an option: SMP

• A single physical address space – all processors share
• Variables can be available any time to any Processor
• Can run independent jobs in virtual space
• Communicate using shared variables
13
13
2 styles of SMP,…
• UMA – Uniform Memory Access
• the same time to access main memory – any processor requests it
and any word is requested
• NUMA – Non-Uniform Memory Access

• depends on which processor requests which word
• programming challenges harder
• can scale to larger sizes
• can have lower latency to nearby memory
14
14
7
University
Synchronization
• Processors should coordinate when sharing data
• Lock is one mechanism – one processor access a shared data at a
time
15
15
• Step 1 – equal subsets
16
16
8
University
• Step 2 – Reduction; divide to conquer
17
17
Clusters and Other Message-

Passing Multiprocessors
18
18
9
University
• Each processor –
private physical space
• Communicate with
message passing
• ACK is possible
• Some applications run well

on both shared or private
spaces
19
19
• Disadvantages –
• cost of administration of n machines 
cost of administration of n independent
machines
• cost of administration shared M with n processors  cost of
administration of 1 machine
• processors – interconnected using I/O interconnect of each computer
• Overhead of M division – n machines  n Ms, n OSs
20
20
10
University
21
21
• There are 100 subsets  send one subset to each M

• Each computer find the sum of each subset
22
22
11
University
• Reduction – add partial sums
23
23
• Better availability
• Much easier to disconnect a machine, reinstall, replace, …
• Whole computers and independent scalable networks  easier to

expand without bringing down the application running on top of the
cluster
• Lower cost, high availability, improved power efficiency, and

rapid, incremental expandability 
• clusters attractive to service providers for the World Wide Web.
24
24
12
University
Hardware Multithreading
25
25
• multiple threads – share functional units of a single processor in an

overlapping way
• One thread is stalled  switch to another one quickly
• Keep a copy of the state of each thread
• Memory can be shared through VM mechanisms, which already
support multiprogramming
• 2 approaches
- Interleaving
- Individual threads
- start-up overhead
26
26
13
University
Simultaneous Multithreading - SMT

• a variation on hardware multithreading
• uses the resources of a multiple-issue, dynamically scheduled
processor
• to exploit thread-level parallelism at the same time it exploits
instruction-level parallelism.
• Multiple instructions from independent threads – can be issued

regardless of dependencies,….. Register Renaming + Dynamic
Scheduling
• Execute instructions from multiple threads, & leave Hardware to
associate instructions slots and renamed registers with their threads
27
27
Example
28
28
14
University
SISD, MIMD, SIMD, SPMD, and

Vector
A characterization of parallel hardware based on:
# of instruction streams, # of data streams
29
29
SISD – a uniprocessor
MIMD – a multiprocessor
• Different programs
• 1 program – conditional statements
• SPMD (Single Program Multiple Data)
30
30
15
University
SIMD – single instruction applied to many data

• Vector, array processors
• 1 add  send 64 data streams to 64 ALUs  64 sums in 1 cycle
• All units synchronized, and 1 PC
• Reduce cost of control unit over dozens of execution units
• Reduce size of program memory – 1 copy of code
• Best in array in for loops - identically structured data
• Data level parallelism
31
31
SIMD in x86: Multimedia Extensions

• MMX and SSE instructions
• Improve performance of multimedia programs
• Instructions allow the hardware to have many simultaneous ALUs, or
• split a wide ALU to many simultaneous ALUs
• a 64-bit ALU = two 32-bit ALUs = four 16-bit ALUs = eight 8-bit ALUs
• Stores/Loads – as wide as the widest ALU
• Width of operation and registers – in the opcode
32
32
16
University
Vector
• Pipelined ALU
• Get data into registers, operate on them sequentially, store result
back to M
• Vector registers
• Like an entire loop
• Hardware doesn’t have to check for data hazards in the same Vector
• Control hazards in loops are nonexistent
• Number of elements in a separate register
33
33
Introduction to Graphics
Processing Units (GPU)
34
34
17
University
• Many processors were connected to the graphics displays

• Increasing processing time for graphics
• Improve it
• Controllers – accelerate 2D and 3D graphics
• Rapid growing market game
• Graphic Processing Units
35
35
• A GPU supplements a CPU – no need for all tasks

• Can do some tasks, poorly or not,..
• Heterogenous Combination – CPU-GPU not all identical processors
• Programming interface – high-level application programming
interfaces (APIs) + high-level graphics shading
• OpenGL, DirectX
• VIDIA’s C for Graphics (Cg), Microsoft’s High Level Shader Language (HLSL)
• Drawing of vertices of 3D geometry primitives like lines and shading
or rendering pixel fragments
• Each vertex or pixel – independent drawing/rendering
• Threads
• data types are vertices, consisting of (x, y, z, w) coordinates, and
pixels, consisting of (red, green, blue, alpha) color components.
36
36
18
University
• Working set can be hundreds of megabytes; not the same temporal

locality as data does in mainstream applications
• a great deal of data-level parallelism
• Do not rely on multilevel caches to overcome M latency

• Rely on having enough threads
• Rely on extensive parallelism for high performance – many parallel

processors and concurrent threads
• Each GPU processor – Highly multi-threaded
37
37
• Main memory oriented toward BW not Latency
• Heterogenous/Identical
• SIMD instructions historically,

• Recently, focusing on scalar instructions – to improve
programmability and efficiency
• Was no support for double precision floating-point arithmetic – no

need in graphics applications
38
38
19
University
General Purpose GPUs or GPGPUs

• Use GPU for applications – performance
• C-inspired programming languages – write directly for the GPUs
• Brook – a streaming language for GPUs
• NVIDIA’s CUDA
• Write C programs to execute on GPUs – some restrictions
• Also for parallel programming
39
39
Introduction to Multiprocessor
Network Topologies
Multicore chips  networks on chips to connect cores
40
40
20
University
• Cost depends on
• Switches
• Links on a switch
• Width (bits #) per link
• Length of links
• Performance
• Throughput – max # of messages in a time
• Latency to send and receive messages
• Contention
• …
• Fault tolerance
• Power Efficiency
41
41
• Links – bidirectional,…
• Processor-Memory Node
• Bus
• Total BW = BW of the bus
= 2xBWlink
• the bisection bandwidth = BWlink
42
42
21
University
• Ring
• Total BW = PxBWlink
• the bisection bandwidth = 2xBWlink
43
43
• Fully Connected
• Each P – a bidirectional link to every other P
• Total BW = P × (P - 1)/2
• the bisection bandwidth is (P/2)2
44
44
22
University
45
45
Fallacies and Pitfalls
46
46
23
University
47
47
• Do not forget to try “Check Yourself” sections

• Answers given at end of chapter
48
48
24
University
Any questions/comments?
Thank you
49
49
25

Chapter 7

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 7

Uploaded by

Copyright:

Available Formats

CpE 440, Second 2020-2021, Yarmouk 2/26/2021

• Multiprocessor – a computer system with at least 2 processors

• Performance, reliability, availability

• Job level parallelism (or process-level parallelism)

• parallel processing program

• Cluster – a number of computers connected over a LAN and work

• Multicore microprocessor – a microprocessor that contains multiple

Hardware & Software

Challenge – effective use of parallel hardware

parallel processing program or parallel software =

Difficulty of writing Parallel Processing

• You must get better performance + efficiency OR

• Parallel processing programs – much harder to write than sequential

• Communication overhead – Scheduling

• Load balancing; Divide work equally

Even small parts must be parallelized for a program to make

Getting good speedup on a multiprocessors while keeping the problem

Time for an addition = t 100 by 100 case

100 processors: 10t + 100t/100 = 11t 100 processors: 110t

perfect load balance (previous example):

Shared Memory Multiprocessors

How to simplify the task,… an option: SMP

• NUMA – Non-Uniform Memory Access

• Step 1 – equal subsets

• Step 2 – Reduction; divide to conquer

Clusters and Other Message-

• Some applications run well

• processors – interconnected using I/O interconnect of each computer

• Overhead of M division – n machines  n Ms, n OSs

• There are 100 subsets  send one subset to each M

• Reduction – add partial sums

• Whole computers and independent scalable networks  easier to

• Lower cost, high availability, improved power efficiency, and

• multiple threads – share functional units of a single processor in an

Simultaneous Multithreading - SMT

• Multiple instructions from independent threads – can be issued

SISD, MIMD, SIMD, SPMD, and

SIMD – single instruction applied to many data

SIMD in x86: Multimedia Extensions

• Many processors were connected to the graphics displays

• A GPU supplements a CPU – no need for all tasks

• Working set can be hundreds of megabytes; not the same temporal

• Do not rely on multilevel caches to overcome M latency

• Rely on extensive parallelism for high performance – many parallel

• Main memory oriented toward BW not Latency

• SIMD instructions historically,

• Was no support for double precision floating-point arithmetic – no

General Purpose GPUs or GPGPUs

• Also for parallel programming

Fallacies and Pitfalls

• Do not forget to try “Check Yourself” sections

You might also like