Introduction To Multicore

Multicores, Multiprocessors, and Clusters
Introduction
Goal: connecting multiple computers to get higher performance
Multiprocessors Scalability, availability, power efficiency
Job-level (process-level) parallelism

High throughput for independent jobs
Parallel processing program

Single program run on multiple processors
Multicore microprocessors
Chips with multiple processors (cores)
Hardware and Software

Hardware
Serial: e.g., Pentium 4 Parallel: e.g., quad-core Xeon e5345
Software
Sequential: e.g., matrix multiplication Concurrent: e.g., operating system
Sequential/concurrent software can run on serial/parallel hardware

Challenge: making effective use of parallel hardware
Parallel Programming
Parallel software is the problem Need to get significant performance improvement
Otherwise, just use a faster uniprocessor, since its easier!
Difficulties
Partitioning Coordination Communications overhead
Amdahls Law
Sequential part can limit speedup Example: 100 processors, 90 speedup?
Speedup
Tnew = Tparallelizable/1001+ Tsequential

(1 Fparallelizable ) Fparallelizable /100
90
Solving: Fparallelizable = 0.999
Need sequential part to be 0.1% of original time
Scaling Example
Workload: sum of 10 scalars, and 10 10 matrix sum
Speed up from 10 to 100 processors
Single processor: Time = (10 + 100) tadd 10 processors

Time = 10 tadd + 100/10 tadd = 20 tadd Speedup = 110/20 = 5.5 (55% of potential)
100 processors
Time = 10 tadd + 100/100 tadd = 11 tadd Speedup = 110/11 = 10 (10% of potential)
Assumes load can be balanced across processors
Scaling Example (cont)

What if matrix size is 100 100? Single processor: Time = (10 + 10000) tadd 10 processors
Time = 10 tadd + 10000/10 tadd = 1010 tadd Speedup = 10010/1010 = 9.9 (99% of potential)
100 processors
Time = 10 tadd + 10000/100 tadd = 110 tadd Speedup = 10010/110 = 91 (91% of potential)
Assuming load balanced
Strong vs Weak Scaling

Strong scaling: problem size fixed
As in example
Weak scaling: problem size proportional to number of processors

10 processors, 10 10 matrix
Time = 20 tadd
100 processors, 32 32 matrix

Time = 10 tadd + 1000/100 tadd = 20 tadd
Constant performance in this example
Shared Memory
SMP: shared memory multiprocessor
Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time
UMA (uniform) vs. NUMA (nonuniform)
Example: Sum Reduction

Sum 100,000 numbers on 100 processor UMA
Each processor has ID: 0 Pn 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];
Now need to add these partial sums

Reduction: divide and conquer Half the processors add pairs, then quarter, Need to synchronize between reduction steps
Example: Sum Reduction
half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);
Message Passing
Each processor has private physical address space Hardware sends/receives messages between processors
Loosely Coupled Clusters

Network of independent computers
Each has private memory and OS Connected using I/O system
E.g., Ethernet/switch, Internet
Suitable for applications with independent tasks

Web servers, databases, simulations,
High availability, scalable, affordable Problems

Administration cost (prefer virtual machines) Low interconnect bandwidth
c.f. processor/memory bandwidth on an SMP
Sum Reduction (Again)

Sum 100,000 on 100 processors First distribute 100 numbers to each
The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];
Reduction
Half the processors send, other half receive and add The quarter send, quarter receive and add,
Sum Reduction (Again)

Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */
Send/receive also provide synchronization Assumes send/receive take similar time to addition
Grid Computing
Separate computers interconnected by long-haul networks
E.g., Internet connections Work units farmed out, results sent back
Can make use of idle time on PCs
Multithreading
Performing multiple threads of execution in parallel
Replicate registers, PC, etc. Fast switching between threads
In multiple-issue dynamically scheduled processor

Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming
Example: Intel Pentium-4 HT

Two threads: duplicated registers, shared function units and caches
Future of Multithreading
Will it survive? In what form? Power considerations simplified microarchitectures
Simpler forms of multithreading
Tolerating cache-miss latency Multiple simple cores might share resources more effectively
Instruction and Data Streams

An alternate classification
Data Streams Single Instruction Single Streams Multiple SISD: Intel Pentium 4 MISD: No examples today Multiple SIMD: SSE instructions of x86 MIMD: Intel Xeon e5345
SPMD: Single Program Multiple Data

A parallel program on a MIMD computer
SIMD
Operate elementwise on vectors of data
E.g., MMX and SSE instructions in x86
Multiple data elements in 128-bit wide registers
All processors execute the same instruction at the same time

Each with different data address, etc.
Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications
Vector Processors
Highly pipelined function units Stream data from/to vector registers to units
Data collected from memory into registers Results stored from registers to memory
Example: Vector extension

32 64-element registers (64-bit elements) Vector instructions
lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth
Example: (Y = a X + Y)
Conventional MIPS code l.d $f0,a($sp) addiu r4,$s0,#512 load loop: l.d $f2,0($s0) mul.d $f2,$f2,$f0 l.d $f4,0($s1) add.d $f4,$f4,$f2 s.d $f4,0($s1) addiu $s0,$s0,#8 addiu $s1,$s1,#8 subu $t0,r4,$s0 bne $t0,$zero,loop Vector MIPS code l.d $f0,a($sp) lv $v1,0($s0) mulvs.d $v2,$v1,$f0 lv $v3,0($s1) addv.d $v4,$v2,$v3 sv $v4,0($s1) ;load scalar a ;upper bound of what to ;load x(i) ;a x(i) ;load y(i) ;a x(i) + y(i) ;store into y(i) ;increment index to x ;increment index to y ;compute bound ;check if done ;load scalar a ;load vector x ;vector-scalar multiply ;load vector y ;add y to product ;store the result
Four Example Systems

2 quad-core Intel Xeon e5345 (Clovertown)
2 quad-core AMD Opteron X4 2356 (Barcelona)
Four Example Systems
2 oct-core Sun UltraSPARC T2 5140 (Niagara 2)
2 oct-core IBM Cell QS20
IBM Cell Broadband Engine

128 , 128-bit registers 128-bit vector/cycle
Abbreviations
PPE: PowerPC Engine SPE: Synergistic Processing Element MFC: Memory Flow Controller
32KB, I, D 512KB U
LS:
Local Store
SIMD: Single Instruction Multiple Data
300-600 cycles
CELL BE Programming Model
No direct access to DRAM from LS of SPE, Buffer size: 16KB
History of GPUs
Early video cards
Frame buffer memory with address generation for video output
3D graphics processing
Originally high-end computers (e.g., SGI) 3D graphics cards for PCs and game consoles
Graphics Processing Units

Processors oriented to 3D graphics tasks Vertex/pixel processing, shading, texture mapping, ray tracing
Graphics in the System
GPU Architectures
Processing is highly data-parallel
GPUs are highly multithreaded Use thread switching to hide memory latency
Less reliance on multi-level caches
Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs

Heterogeneous CPU/GPU systems CPU for sequential code, GPU for parallel code
Programming languages/APIs
DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute Unified Device Architecture (CUDA)
Classifying GPUs
Dont fit nicely into SIMD/MIMD model
Conditional execution in a thread allows an illusion of MIMD
But with performance degredation Need to write general purpose code with care
Static: Discovered at Compile Time Instruction-Level Parallelism Data-Level Parallelism VLIW SIMD or Vector Dynamic: Discovered at Runtime Superscalar Tesla Multiprocessor
Programming Model
CUDAs design goals
extend a standard sequential programming language, specifically C/C++,
focus on the important issues of parallelismhow to craft efficient parallel algorithmsrather than grappling with the mechanics of an unfamiliar and complicated language.
minimalist set of abstractions for expressing parallelism

highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores.
Tesla C870 GPU
1.5 GB per GPU
8 KB / multiprocessor
8 KB / multiprocessor
16 KB
(8192)
up to 768 threads (21 bytes of shared memory and 10 registers/thread)
CUDA Program Organization

Host program and GPU One or more parallel kernels
Kernel
scalar sequential program on a set of parallel threads.
Threads
The programmer organizes grid of thread blocks.
Threads of a Block
allowed to synchronize via barriers have access to a high-speed, per-block shared on-chip memory for interthread communication.
Threads from different blocks in the same grid

coordinate only via shared global memory space visible to all threads.
CUDA requires that thread blocks be independent, meaning:

kernel must execute correctly no matter the order in which blocks are run, provides scalability
Main consideration in decomposing workload into kernels

the need for global communication synchronization amongst threads
Tesla C870 Threads
Tesla C870 Memory Hierarchy
texture and constant caches: utilized by all the threads within the GPU
shared memory: individual and protected region available to threads of a multiprocessor
From Software Perspective

A kernel is a SPMD-style Programmer organizes threads into thread blocks; Kernel consists of a grid of one or more blocks A thread block is a group of concurrent threads that can cooperate amongst themselves through
barrier synchronization a per-block private shared memory space
Programmer specifies
number of blocks number of threads per block
Each thread block = virtual multiprocessor

each thread has a fixed register each block has a fixed allocation of per-block shared memory.
SIMT Architecture
Threads of a block grouped into warps
containing 32 threads each.
A warps threads
free to follow arbitrary and independent execution paths (DIVERGE) collectively execute only a single instruction at any particular instant.
Divergence and reconvergence

wait in turn while other threads of the warp execute the instructions managed in hardware.
Ideal match for typical data-parallel programs, SIMT execution of threads transparent to the programmer a scalar computational model.
Example
Which element am I processing?
B blocks of T threads
Data parallel vs. Task parallel kernels ! Expose more fine-grained parallelism
OpenMP vs CUDA
#pragma omp parallel for shared(A) private(i,j) for (i = 0; i < 32; i++){ for (j = 0; j < 32; j++) value=some_function(A[i][j]}
Efficiency Considerations
Avoid execution divergence
threads within a warp follow different execution paths. Divergence between warps is ok
Allow loading a block of data into SM

process it there, and then write the final result back out to external memory.
Coalesce memory accesses

Access executive words instead of gather-scatter
Create enough parallel work

5K to 10K threads
Potential Tricks
Large register file
primary scratch space for computation.
Small blocks of elements to each thread, rather than a single element to each thread The nonblocking nature of loads
software prefetching for hiding memory latency.
White Spaces After Digital TV

In telecommunications, white spaces refer to vacant frequency bands between licensed broadcast channels or services like wireless microphones. After the transition to digital TV in the U.S. in June 2009, the amount of white space exceeded the amount of occupied spectrum even in major cities. Utilization of white spaces for digital communications requires propagation loss models to detect occupied frequencies in near real-time for operation without causing harmful interference to a DTV signal, or other wireless systems operating on a previously vacant channel.
Challenge
Irregular Terrain Model (ITM), also known as the Longley-Rice model, is used to make predictions of radio field strength based on the elevation profile of terrains between the transmitter and the receiver.
Due to constant changes in terrain topography and variations in radio propagation, there is a pressing need for computational resources capable of running hundreds of thousands of transmission loss calculations per second.
ITM
Given the path length (d) for a radio transmitter T, a circle is drawn around T with radius d. Along the contour line, 64 hypothetical receivers (Ri) are placed with equal distance from each other. Vector lines from T to each Ri are further partitioned into 0.5 km sectors (Sj ). Atmospheric and geographic conditions along each sector form the profile of that terrain (used 256K profiles). For each profile, ITM involves independent computations based on atmospheric and geographic conditions followed by transmission loss calculations.
GPU Strategies for ITM

ITM requires 45 registers Each profile is 1KB ( radio frequency, path length, antenna heights, surface transfer impedance, plus 157 elevation points) reduces register count to 37
1.5 GB per GPU 8 KB / multiprocessor 8 KB / multiprocessor 16 KB
How many threads / MP?
128*16 threads
16*16 threads
192*16 threads
IBM CELL BE
Workload: 256k profiles. Strategies:

Message Queue (MQ), DMA and double buffering with various buffer sizes (DDB-n), SIMD with buffer size of 16KB (DDB-16+SIMD-)
FG: fine grained; CG: coarse grained.
Profile level SIMDization (CG) improves performance by 7.5x over MQ
Configurations
Comparison
Productivity from code development perspective Based on personal experience of a Ph.D. student
with C/C++ knowledge and the serial version of the ITM code in hand, without prior background on the Cell BE and GPU programming environments.
Data logged for the learning curve and design and debugging times individually.
Fermi: Next Generation GPU

32 cores, 1536 threads per core No complex mechanisms
speculation, out of order execution, superscalar, instruction level parallelism, branch predictors
Column access to memory

faster search, read
First GPU with error correcting code

registers, shared memory, caches, DRAM
Languages supported
C, C++, FORTRAN, Java, Matlab, and Python
IEEE 75408 Double-precision floating point

fused multiply-add operations,
Streaming and task switching (25 microseconds)

Launching multiple kernels (up to 16) simultaneously
visualization and computing GeForce GTX580 (512 streaming processors). It costs exactly $499 on newegg
Easy to Learn
Takes time to master
Concluding Remarks
Amdahls Law doesnt apply to parallel computers Goal: higher performance by using multiple processors Difficulties
Developing parallel software Devising appropriate architectures
Many reasons for optimism

Changing software and application environment Chip-level multiprocessors with lower latency, higher bandwidth interconnect
An ongoing challenge for computer architects!

Introduction To Multicore

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Multicore

Uploaded by

Copyright:

Available Formats

Multicores, Multiprocessors, and Clusters

Job-level (process-level) parallelism

Parallel processing program

Hardware and Software

Sequential/concurrent software can run on serial/parallel hardware

Tnew = Tparallelizable/1001+ Tsequential

Solving: Fparallelizable = 0.999

Need sequential part to be 0.1% of original time

Single processor: Time = (10 + 100) tadd 10 processors

Assumes load can be balanced across processors

Scaling Example (cont)

Assuming load balanced

Strong vs Weak Scaling

Weak scaling: problem size proportional to number of processors

100 processors, 32 32 matrix

Constant performance in this example

Example: Sum Reduction

Now need to add these partial sums

Example: Sum Reduction

Loosely Coupled Clusters

Suitable for applications with independent tasks

High availability, scalable, affordable Problems

Sum Reduction (Again)

Sum Reduction (Again)

Can make use of idle time on PCs

In multiple-issue dynamically scheduled processor

Example: Intel Pentium-4 HT

Instruction and Data Streams

SPMD: Single Program Multiple Data

All processors execute the same instruction at the same time

Example: Vector extension

Significantly reduces instruction-fetch bandwidth

Four Example Systems

2 quad-core AMD Opteron X4 2356 (Barcelona)

Four Example Systems

2 oct-core Sun UltraSPARC T2 5140 (Niagara 2)

2 oct-core IBM Cell QS20

IBM Cell Broadband Engine

SIMD: Single Instruction Multiple Data

CELL BE Programming Model

No direct access to DRAM from LS of SPE, Buffer size: 16KB

Graphics Processing Units

Graphics in the System

Graphics memory is wide and high-bandwidth

Trend toward general purpose GPUs

minimalist set of abstractions for expressing parallelism

Tesla C870 GPU

1.5 GB per GPU

up to 768 threads (21 bytes of shared memory and 10 registers/thread)

CUDA Program Organization

Threads from different blocks in the same grid

CUDA requires that thread blocks be independent, meaning:

Main consideration in decomposing workload into kernels

Tesla C870 Threads

Tesla C870 Memory Hierarchy

shared memory: individual and protected region available to threads of a multiprocessor

From Software Perspective

Each thread block = virtual multiprocessor

Divergence and reconvergence

Allow loading a block of data into SM

Coalesce memory accesses

Create enough parallel work