You are on page 1of 24

On-Chip Optical Communication for Multicore Processors

Jason Miller
Carbon Research Group
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB

“Moore’s Gap”
Performance (GOPS) 1000 100 10 1 0.1 0.01 1992 1998 2002
n ra T s
Tiled Multicore Multicore
SMT, FGMT, CGMT

rs o The st i

GOPS Gap

OOO Superscalar Pipelining

 Diminishing returns from single CPU mechanisms (pipelining, caching, etc.)  Wire delays  Power envelopes

2006

2010

time
2

Multicore Scaling Trends
Today
A few large cores on each chip Diminishing returns prevent cores from getting more complex Only option for future scaling is to add more cores Still some shared global structures: bus, L2 caches
p c
BUS

Tomorrow
100’s to 1000’s of simpler cores [S. Borkar, Intel, 2007] Simple cores are more power and area efficient Global structures do not scale; all resources must be distributed
m p
switch

p c

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

L2 Cache

m p
switch

m p
switch

m p
switch

m p
switch

3

The Future of Multicore
Number of cores doubles every 18 months

Parallelism replaces clock frequency scaling and core complexity Resulting Challenges…
Scalability Programming Power

MIT RAW

Sun Ultrasparc T2

IBM XCell 8i

Tilera TILE64

4

Multicore Challenges
 Scalability
 How do we turn additional cores into additional performance?
 Must accelerate single apps, not just run more apps in parallel  Efficient core-to-core communication is crucial

 Architectures that grow easily with each new technology generation

 Programming
 Traditional parallel programming techniques are hard  Parallel machines were rare and used only by rocket scientists  Multicores are ubiquitous and must be programmable by anyone

 Power
 Already a first-order design constraint  More cores and more communication  more power  Previous tricks (e.g. lower Vdd) are running out of steam
5

Multicore Communication Today
Bus-based Interconnect
p c
BUS

p c

Single shared resource Uniform communication cost Communication through memory Doesn’t scale to many cores due to contention and long wires Scalable up to about 8 cores
6

L2 Cache

DRAM

Multicore Communication Tomorrow
Point-to-Point Mesh Network
m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

Examples: MIT Raw, Tilera TILEPro64, Intel Terascale Prototype Neighboring tiles are connected Distributed communication resources Non-uniform costs: Latency depends on distance Encourages direct communication More energy efficient than bus Scalable to hundreds of cores
7

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

DRAM

DRAM

DRAM

DRAM

Multicore Programming Trends
Meshes and small cores solve the physical scaling challenge, but programming remains a barrier Parallelizing applications to thousands of cores is hard  Task and data partitioning  Communication becomes critical as latencies increase  Increasing contention for distant communication  Degraded performance, higher energy  Inefficient broadcast-style communication  Major source of contention  Expensive to distribute signal electrically

8

Multicore Programming Trends
For high performance, communication and locality must be managed  Tasks and data must be both partitioned and placed
 Analyze communication patterns to minimize latencies  Place data near the code that needs it most  Place certain code near critical resources (e.g. DRAM, I/O)

 Dynamic, unpredictable communication is impossible to optimize  Orchestrating communication and locality increases programming difficulty exponentially
9

Improving Programmability
Observations:
 A cheap broadcast communication mechanism can make programming easier
 Enables convenient programming models (e.g., shared memory)  Reduces the need to carefully manage locality

 On-chip optical components enable cheap, energy-efficient broadcast
10

ATAC Architecture
Electrical Mesh Interconnect
m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

m p
switch

Optical Broadcast WDM Interconnect 11

Optical Broadcast Network
 Waveguide passes through every core  Multiple wavelengths (WDM) eliminates contention  Signal reaches all cores in <2ns  Same signal can be received by all cores
optical waveguide
12

Optical Broadcast Network
 Electronicphotonic integration using standard CMOS process  Cores communicate via optical WDM broadcast and select network  Each core sends on its own dedicated wavelength using modulators  Cores can receive from some set 13 of

N cores

Optical bit transmission
 Each core sends data using a different wavelength  no contention  Data is sent once, any or all cores can receive it  efficient broadcast
multi-wavelength source waveguide modulator data waveguide modulator driver filter photodetector flip-flop transimpedance amplifier

flip-flop

sending core

receiving core

14

Core-to-core communication
 32-bit data words transmitted across several parallel waveguides  Each core contains receive filters and a FIFO buffer for every sender  Data is buffered at receiver until needed by the processing core  Receiver can screen data by sender (i.e. wavelength) or message type

32 FIFO 32 Processor Core sending core A

32 FIFO 32 Processor Core sending core B

FIFO

Processor Core receiving core 15

FIFO

FIFO

FIFO

ATAC Bandwidth
64 cores, 32 lines, 1 Gb/s  Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s  Receive-Weighted BW: 2 Tb/s * 63 receivers = 126 Tb/s
 Good metric for broadcast networks – reflects WDM

ATAC allows better utilization of computational resources because less time is spent performing communication

16

System Capabilities and Performance
Baseline: Raw Multicore Chip
 Leading-edge tiled multicore

ATAC Multicore Chip
 Future optical interconnect multicore

64-core system (65nm process)
 Peak performance: 64 GOPS  Chip power: 24 W  Theoretical power eff.: 2.7 GOPS/W  Effective performance: 7.3 GOPS  Effective power eff: 0.3 GOPS/W  Total system power: 150 W

64-core system (65nm process)
 Peak performance: 64 GOPS  Chip power: 25.5 W  Theoretical power eff.: 2.5 GOPS/W  Effective performance: 38.0 GOPS  Effective power eff.: 1.5 GOPS/W  Total system power: 153 W

Optical communications require a small amount of additional system power but allow for much better utilization of computational resources.

17

Programming ATAC
 Cores can directly communicate with any other core in one hop (<2ns)  Broadcasts require just one send  No complicated routing on network required  Cheap broadcast enables frequent global communications  Broadcast-based cache update/remote store protocol
 All “subscribers” are notified when a writing core issues a store (“publish”)

 Uniform communication latency simplifies scheduling
18

Communication-centric Computing
 ATAC reduces off-chip memory calls, and hence energy and latency  View of extended global memory can be enabled cheaply with on-chip distributed cache memory and ATAC network
Operation memory
500pJ 500pJ 500pJ 500pJ 3pJ

Energy Latency 3 cycles 1 cycle 1 cycle 250 cycles

Network transfer ALU add operation

3pJ 2pJ

p c
BUS

p c

3pJ 3pJ 3pJ

32KB cache 50pJ read Off-chip memory read 500pJ

L2 Cache

Bus-Based Multicore

ATAC

19

Summary
ATAC uses optical networks to enable multicore programming and performance scaling
ATAC encourages communication-centric architecture, which helps multicore performance and power scalability ATAC simplifies programming with a contention-free all-toall broadcast network ATAC is enabled by recent advances in CMOS integration of optical components

20

Backup Slides

What Does the Future Look Like?
Corollary of Moore’s law: Number of cores will double every 18 months
‘02 Research Industry 16 4 ‘05 64 16 ‘08 256 64 ‘11 1024 256 ‘14 4096 1024

1K cores by 2014! Are we ready?
22

(Cores minimally big enough to run a self respecting O

Scaling to 1000 Cores
memory

BNet

Proc
ONet ENet
memory 64 Optically-Connected Clusters Electrical Networks Connect 16 Cores to Optical Hub NET HUB

$ Dir $

 Purely optical design scales to about 64 cores  After that, clusters of cores share optical hubs
 ENet and BNet move data to/from optical hub  Dedicated, special-purpose electrical networks
23

ATAC is an Efficient Network
• Modulators are Primary Source of Power Consumption
– Receive Power: Require only ~2 fJ/bit even with -5dB link loss – Modulator Power: • Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver)

• Example: 64-Core Communication
(i.e. N = 64 cores = 64 λ s; for 32 bit word: 2048 drops/core and 32 adds/core)

– Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 µ W – Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 µ W – Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit

• Comparison: Electrical Broadcast Across 64 Cores
– Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power) (Assumes 150fJ/mm/bit, 1-mm spaced tiles)

24