You are on page 1of 25

CpE 440, Second 2020-2021, Yarmouk 2/26/2021

University

CpE 440
Computer Architecture
Dr. Haithem Al-Mefleh
Computer Engineering Department
Yarmouk University, Second 2020-2021

Multicores, Multiprocessors,
and Clusters

1
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Multiprocessor – a computer system with at least 2 processors


• 1 processor breaks  others continue

• Performance, reliability, availability

• Job level parallelism (or process-level parallelism)


• Different programs – different processors

• parallel processing program


• 1 program – different processors

• Cluster – a number of computers connected over a LAN and work


together as a one large multiprocessor

• Multicore microprocessor – a microprocessor that contains multiple


processors (cores) in a single integrated circuit.

• Parallel programming
• Execute efficiently in performance and power

2
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Hardware & Software

Challenge – effective use of parallel hardware

parallel processing program or parallel software =


sequential or concurrent software running on parallel hardware
5

Difficulty of writing Parallel Processing


Programs
• Difficult to write software that uses multiple processors to complete 1
task faster
• The problem gets worse as number of processors increases

• You must get better performance + efficiency OR


just use sequential program on a uniprocessor as it is easier,….
superscalar/out-of-order/… without programmers’ involvement

3
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Parallel processing programs – much harder to write than sequential


programs ?!

• Communication overhead – Scheduling

• Load balancing; Divide work equally

• Time Synchronization

• Scheduling

Even small parts must be parallelized for a program to make


good use of cores

0.1%; 0.001

4
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Getting good speedup on a multiprocessors while keeping the problem


size fixed is harder than it with increasing the problem size

Time for an addition = t 100 by 100 case


Time for an addition = t
Single processor: 10t+100t = 110t
Single processor: 10010t
10 processors: 10t + 100t/10 = 20t
10 processors: 1010t
Speedup = 5.5
Speedup = 9.9
(5.5/10)*100% = 55% of the potential speedup 99% of the potential speedup

100 processors: 10t + 100t/100 = 11t 100 processors: 110t


Speedup = 10 Speedup = 91
(10/100)*100% = 10% of the potential speedup 91% of the potential speedup

10

10

5
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

perfect load balance (previous example):


100 processors, each one 1% of load
Speedup 91
1 processor 2%x10000 = 200 additions 1 processor 5%x10000 = 500 additions
99 processors 9800 additions 99 processors 9500 additions

Time = max(200t, 9800t/99) + 10t = 210t Time = max(500t, 9500t/99) + 10t = 510t
Speedup = 48 Speedup = 20

11

11

Shared Memory Multiprocessors

12

12

6
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

How to simplify the task,… an option: SMP


• A single physical address space – all processors share
• Variables can be available any time to any Processor
• Can run independent jobs in virtual space
• Communicate using shared variables

13

13

2 styles of SMP,…
• UMA – Uniform Memory Access
• the same time to access main memory – any processor requests it
and any word is requested

• NUMA – Non-Uniform Memory Access


• depends on which processor requests which word
• programming challenges harder
• can scale to larger sizes
• can have lower latency to nearby memory

14

14

7
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Synchronization
• Processors should coordinate when sharing data
• Lock is one mechanism – one processor access a shared data at a
time

15

15

• Step 1 – equal subsets

16

16

8
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Step 2 – Reduction; divide to conquer

17

17

Clusters and Other Message-


Passing Multiprocessors

18

18

9
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Each processor –
private physical space

• Communicate with
message passing
• ACK is possible

• Some applications run well


on both shared or private
spaces

19

19

• Disadvantages –
• cost of administration of n machines 
cost of administration of n independent
machines
• cost of administration shared M with n processors  cost of
administration of 1 machine

• processors – interconnected using I/O interconnect of each computer

• Overhead of M division – n machines  n Ms, n OSs

20

20

10
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

21

21

• There are 100 subsets  send one subset to each M


• Each computer find the sum of each subset

22

22

11
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Reduction – add partial sums

23

23

• Better availability
• Much easier to disconnect a machine, reinstall, replace, …

• Whole computers and independent scalable networks  easier to


expand without bringing down the application running on top of the
cluster

• Lower cost, high availability, improved power efficiency, and


rapid, incremental expandability 
• clusters attractive to service providers for the World Wide Web.

24

24

12
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Hardware Multithreading

25

25

• multiple threads – share functional units of a single processor in an


overlapping way
• One thread is stalled  switch to another one quickly
• Keep a copy of the state of each thread
• Memory can be shared through VM mechanisms, which already
support multiprogramming

• 2 approaches
- Interleaving
- Individual threads
- start-up overhead

26

26

13
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Simultaneous Multithreading - SMT


• a variation on hardware multithreading
• uses the resources of a multiple-issue, dynamically scheduled
processor
• to exploit thread-level parallelism at the same time it exploits
instruction-level parallelism.

• Multiple instructions from independent threads – can be issued


regardless of dependencies,….. Register Renaming + Dynamic
Scheduling
• Execute instructions from multiple threads, & leave Hardware to
associate instructions slots and renamed registers with their threads
27

27

Example

28

28

14
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

SISD, MIMD, SIMD, SPMD, and


Vector
A characterization of parallel hardware based on:
# of instruction streams, # of data streams

29

29

SISD – a uniprocessor

MIMD – a multiprocessor
• Different programs
• 1 program – conditional statements
• SPMD (Single Program Multiple Data)

30

30

15
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

SIMD – single instruction applied to many data


• Vector, array processors
• 1 add  send 64 data streams to 64 ALUs  64 sums in 1 cycle
• All units synchronized, and 1 PC
• Reduce cost of control unit over dozens of execution units
• Reduce size of program memory – 1 copy of code
• Best in array in for loops - identically structured data
• Data level parallelism

31

31

SIMD in x86: Multimedia Extensions


• MMX and SSE instructions
• Improve performance of multimedia programs
• Instructions allow the hardware to have many simultaneous ALUs, or
• split a wide ALU to many simultaneous ALUs
• a 64-bit ALU = two 32-bit ALUs = four 16-bit ALUs = eight 8-bit ALUs
• Stores/Loads – as wide as the widest ALU
• Width of operation and registers – in the opcode

32

32

16
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Vector
• Pipelined ALU
• Get data into registers, operate on them sequentially, store result
back to M
• Vector registers
• Like an entire loop
• Hardware doesn’t have to check for data hazards in the same Vector
• Control hazards in loops are nonexistent
• Number of elements in a separate register

33

33

Introduction to Graphics
Processing Units (GPU)

34

34

17
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Many processors were connected to the graphics displays


• Increasing processing time for graphics
• Improve it
• Controllers – accelerate 2D and 3D graphics
• Rapid growing market game
• Graphic Processing Units

35

35

• A GPU supplements a CPU – no need for all tasks


• Can do some tasks, poorly or not,..
• Heterogenous Combination – CPU-GPU not all identical processors
• Programming interface – high-level application programming
interfaces (APIs) + high-level graphics shading
• OpenGL, DirectX
• VIDIA’s C for Graphics (Cg), Microsoft’s High Level Shader Language (HLSL)
• Drawing of vertices of 3D geometry primitives like lines and shading
or rendering pixel fragments
• Each vertex or pixel – independent drawing/rendering
• Threads
• data types are vertices, consisting of (x, y, z, w) coordinates, and
pixels, consisting of (red, green, blue, alpha) color components.
36

36

18
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Working set can be hundreds of megabytes; not the same temporal


locality as data does in mainstream applications
• a great deal of data-level parallelism

• Do not rely on multilevel caches to overcome M latency


• Rely on having enough threads

• Rely on extensive parallelism for high performance – many parallel


processors and concurrent threads
• Each GPU processor – Highly multi-threaded

37

37

• Main memory oriented toward BW not Latency

• Heterogenous/Identical

• SIMD instructions historically,


• Recently, focusing on scalar instructions – to improve
programmability and efficiency

• Was no support for double precision floating-point arithmetic – no


need in graphics applications

38

38

19
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

General Purpose GPUs or GPGPUs


• Use GPU for applications – performance
• C-inspired programming languages – write directly for the GPUs
• Brook – a streaming language for GPUs
• NVIDIA’s CUDA
• Write C programs to execute on GPUs – some restrictions

• Also for parallel programming

39

39

Introduction to Multiprocessor
Network Topologies
Multicore chips  networks on chips to connect cores

40

40

20
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Cost depends on
• Switches
• Links on a switch
• Width (bits #) per link
• Length of links

• Performance
• Throughput – max # of messages in a time
• Latency to send and receive messages
• Contention
• …
• Fault tolerance
• Power Efficiency
41

41

• Links – bidirectional,…
• Processor-Memory Node

• Bus
• Total BW = BW of the bus
= 2xBWlink
• the bisection bandwidth = BWlink

42

42

21
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Ring
• Total BW = PxBWlink
• the bisection bandwidth = 2xBWlink

43

43

• Fully Connected
• Each P – a bidirectional link to every other P
• Total BW = P × (P - 1)/2
• the bisection bandwidth is (P/2)2

44

44

22
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

45

45

Fallacies and Pitfalls

46

46

23
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

47

47

• Do not forget to try “Check Yourself” sections


• Answers given at end of chapter

48

48

24
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Any questions/comments?

Thank you
49

49

25

You might also like