You are on page 1of 19

www.studymafia.

org

Seminar
On
Smart Memories

Submitted To: Submitted By:


www.studymafia.org www.studymafia.org
CONTENT
 Packet Processing Workload Challenges
 Solution –Smart Memory
 Introduction to Smart Memory
 Smart Memory Architecture
 Packet Processing Bottlenecks
 Smart Memory
 Advantages
 Reference
OVERVIEW

1. High Performance Packet Processing


Challenges

2. Solution –Smart Memory

3. Smart Memory Architecture


PACKET PROCESSING WORKLOAD CHALLENGES

• Sequential memory references


 For lookups (L2, L3, L4, and L7)
 Finite automata traversal
• Read-modify-write Tons of memory reference sand minimal compute

 Statistics, counters, token-bucket, mutex, etc


• Pointer and link - list management
 Buffer management, packet queues, etc.
• Traditional implementations use
Memory Memory Memory Memory
 Commodity memory to store data
 NPs and ASICs to process data in memory

Performance Barriers:
1. Memory and chip I/O bandwidth P P P P P P
2. Memory latency P P P P P P
3. Lock for atomic access
ILLUSTRATION OF PERFORMANCE BARRIER I
Memory Memory Memory Memory
0 1 1
0 2 1 3 1
0 P1
4 5 6
Interconnection network P2 1
0 7 P3
P P P P P P 8
9 P4
P5
P P P P P P IP lookup tree

Requires several transactions between memory and processors

Requires several transactions between memory and processors

Need more More latency In


Low IPC
processors inter connect
ILLUSTRATION OF PERFORMANCE BARRIER II
Memory Memory Memory • Lookups are read-only so
relatively easy
• Link-list, counters, policers, etc
Interconnection network
are read-modify-write
P P P P P P • Requires per memory address
P P P P P P lock in multi-core systems
Enqueue Lock free-list Counters Lock counter
Dequeue Get free node Read counter Locks often kept in memory
Unlock free-list Write counter
Requires another transaction
Lock list tail Unlock counter

Read list tail


Adds significant latency
Link free node

Update list tail


Single queue or single counter
Unlock list tail operations are extremely slow
SOLUTION –SMART MEMORY

 Attach simple compute with data

 Attach lock with data

 Enable local memory communication


INTRODUCTION TO SMART MEMORY
Memory Memory Memory

• What is the real problem?


Interconnection network  Compute occurs far away from data

P P P P P P  Lock acquire/release occurs far from data


P P P P P P
Fortunately, compute for packet
processing jobs are very modest!
Memory Memory Memory Memory

• Solution: Make memory smarter by:


Compute Compute Compute Compute

Enabling local communication

Interconnection network
Managing lock close to data
P P P P P P
P P P P P P
Keeping compute close to data
INTRODUCTION TO SMART MEMORY
Memory Memory Memory

• What is the real problem?


Interconnection network  Compute occurs far away from data

P P P P P P  Lock acquire/release occurs far from data


P P P P P P
Fortunately, compute for packet
processing jobs are very modest!
Memory Memory Memory Memory

Compute Compute Compute Compute


Smart Memory Advantages
(Get more off fewer transactions!)
1. Lower I/O bandwidth
2. Lower processing latency
Interconnection network 3. Higher IPC
4. Significantly higher single
P P P P P P counter/queue performance

P P P P P P
SMART MEMORY ARCHITECTURE
 Hybrid memory –eDRAM + DDR3-DRAM

 Serial chip I/O


SMART MEMORY CAPACITY AND
BANDWIDTH @100G

40
Memory bandwidth (Billion accesses / packet)

DP ng)
(St

20
I
ri

DPI
(re gex)
10

A
5 (algo CL
FBI ritha
(algori m)
2.5
th a m)
Queuing/
Scheduling
Layer 2
fwding

1.2
5 Statistics
/Counter
.62
Basic
Laye2
Packet
Buffer
.31
Vide
Buffer
.15
2 4 8 16 32 64 128 256 512+
Memory Capacity (MB)
SMART MEMORY CAPACITY AND
BANDWIDTH @100G

40
Memory bandwidth (Billion accesses / packet)

DP ng)
(St

20
I
ri

DPI Smart Memory uses


(re gex)
10
intelligent algorithms to
split the data-structures
64 banks eDRAM A
5 (algo CL
FBI ritha
(algori m)
2.5
th a m)
Queuing/
Scheduling
Layer 2
fwding

1.2
5 Statistics
/Counter
.62
Basic
Laye2
Packet
Buffer
.31
8 Channels of DDR3-RAM Vide
Buffer
.15
2 4 8 16 32 64 128 256 512+
Memory Capacity (MB)
SMART MEMORY HIGH LEVEL ARCHITECTURE
Global interconnect: DDR3
Packet processor complex provides fair communication between processors DRAM
andsmart memory

P P P P DRAM
SMEngine

P P P P
eDRAM eDRAM eDRAM eDRAM

P P P P SM engine SM engine SM engine SM engine

eDRAM eDRAM eDRAM eDRAM

P P P P SM engine SM engine SM engine SM engine

eDRAM eDRAM eDRAM eDRAM

SM engine SM engine SM engine SM engine

eDRAM eDRAM eDRAM eDRAM

SM engine SM engine SM engine SM engine


Local interconnect:
provides local communication
between smart memory blocks
Smart Memory complex
SMART MEMORY HIGH LEVEL ARCHITECTURE
DDR3
Packet processor complex DRAM

P P P P Result
DRAM Read
SMEngine
Computation occurs closeto
memory reducing latency
P P P P
Requires fewer memory
eDRAM eDRAM eDRAM eDRAM
transactions Read
P P P P SM engine SM engine SM engine SM engine

eDRAM eDRAM eDRAM eDRAM

P P P P SM engine SM engine SM engine SM engine

Split tables into eDRAM eDRAM eDRAM eDRAM


eDRAM and DRAM SM engine SM engine SM engine SM engine

eDRAM eDRAM eDRAM eDRAM

SM engine SM engine SM engine SM engine

Smart Memory complex


I/O TECHNOLOGY CHOICE IN SMART MEMORY

 Smart Memory reduces the chip I/O


bandwidth significantly
 How to further optimize it?

 Bandwidth, latency and I/O bandwidth gap is growing


 On-chip bandwidth is much higher than memory I/O

Smart Memory use serial I/O

-4X throughput than RLDRAM and QDR


-3X fewer pins than DDR3 and DDR4
Based on MoSys data -2.5X reduces I/O power
HIGH SPEED LINE CARD WITH SMART MEMORY
540+ W Power 212 – W
Traditional Line Card 472+ cm^2 Area 148 cm^2
5600+ $ Cost 2520 $ Line Card with SM
DDR3 memory 2-3
10+DIMM,900+Pins times
CIF D
R
R
D
Y C S S S S D D
R R R R D
H R R RR TCM TCM 3 3
A A AR R R R A A
P D D DD
M M D D D D M M
D D DD
3 3 33 MD D D D
3 3 3 3
SM SM
NP NP TM
Y
H
C
P I U

F NP P
C

To Switch FabricDDR3
R R R R
R R RR
D D DD NP NP D
D
D
D
D
D
D
D TM
Y D D DD 3 3 3 3
H 3 3 33
P S
R
S
R
S
R
S
R
SM SM
TCM TCM
A A A A
M M M M

Y RDD 3
H
A M
U

C
P

P RDD 3

YHP YHP YHP YHP


Cantrol Plane
Memory
CONCLUDING REMARKS

• Packet Processing Bottlenecks


 Data away from compute
 I/O and memory bandwidth

• Smart Memory
 Keep compute close to data
 Keep locking close to data
 Provide inter-memory connect

• Advantages
 Reduced chip I/O bandwidth
 High performance and low latency
 Feature rich, flexible and programmable
 Lower cost
 One chip for several functions
REFERENCE

 www.google.com
 www.wikipedia.com
 www.studymafia.org
 www.projectsreports.org
Thank You
ALL

You might also like