You are on page 1of 12

11/5/2011

Embedded Memory

It is an integrated on-chip memory that supports the logic core to accomplish intended functions. High-performance embedded memory is a key component in VLSI because of its high-speed and wide bus-width capability, which eliminates external inter-chip communication. The increasing performance gap between microprocessors and DRAMs has widened, chip designers placing greater emphasis on the development of embedded memory devices.

MPU Versus DRAM Performances

Advantages of using embedded memories: Reduced number of chips, reduced pin count, multi-port memories, less board space requirements, faster response with memory embedded on-chip dedicated architecture, memory capacity specific for an application, reduced power consumption, greater cost effectiveness at the system level. Disadvantages of embedded memories: Generally larger in size more complex to design and manufacture processing becomes more complex designer integrates different types of memory on the same chip.

Enabling SoC:
In the modern SoC era, memory becomes an important and essential IP requirement for SoC design. High quality embedded non-volatile memory (eFuse, OTP, MTP, EEPROM and eFlash) can be used for trimming, redundancy, data encryption, ID, coding and programming.
TIs 1-V DSP for Wireless Communications

Embedded Memory Design


On-chip memory integration: Why to bother studying memories?
Virtually any CMOS chip manufactured nowadays includes some form of memory Memories do often consume a big part of the chip's total area and power To improve the chip's area and energy efficiency, new on-chip memory implementations must be found

11/5/2011

Major options for on-chip memory implementations 1. DRAM macrocells 2. SRAM macrocells' 3. Flip-op or latch arrays Limitations of conventional on-chip memory implementations :
Dynamic random access memory (DRAM) Conventional embedded DRAM uses 1T-1C storage cells Special technology options are required to build high-density stacked or trench capacitors Small storage cell size Not compatible with standard logic CMOS technologies

Static random access memory (SRAM)


Conventional SRAM uses 6T storage cells Compatible with standard logic CMOS technologies Large storage cell size

New approaches to area- and energy-effcient on-chip memory design


Flip-Flop and latch arrays:
Flip-flop and latch arrays bring advantages over macrocells for small storage capacities Excessively large for storage capacities bigger than some kbit

Multilevel DRAM
Principle: To increase the per-area storage density of conventional
1T-1C DRAM, more than one bit can be stored in a single memory cell Digital-to-analog and analog-to-digital converters are used for the write and read operations, respectively The more (fractional) bits are to be stored per cell, the shorter the data retention time

Challenges: Invent circuit and system level techniques to cope with the degradation of the data retention time due to multilevel operation

Gain cells
Principle: Amplify stored charge to have higher sensing current Gain-cell-based DRAMs can be smaller than SRAM and at the same time they can be logic compatible Various gain-cell-based DRAMs have been proposed to date, with different inherent gain cells Challenges: Inhibit leakage mechanisms and boost data retention time at advanced technology nodes (65nm, 45nm, etc.)

Fault-tolerant memories: Principle: Allow a well-controlled amount of bit errors in the memory if the memory is to be used in errorcorrecting systems, thereby saving area and power in the memory Applications: Turbo decoder, low-density parity check (LDPC) decoder, and many more Challenges: Express the bit-error probability in terms of statistics on circuit parameters such as Vth, W, and L for different PVT corners.

11/5/2011

Scratchpad Memories
Designers of embedded systems strive to improve performance and reduce energy consumption. Low energy systems extend battery life and reduce cooling costs and decrease weight. Shutting down parts of the processor, voltage scaling, specialized instructions, feature size reduction and additional cache levels are some of the techniques used to reduce energy in embedded systems. To reduce power in embedded systems , profile knowledge applied to a system with scratchpad memory (SPM) instead of instruction cache is found to be useful.

A SPM, made of SRAM cells is a memory array consisting of only decoding and column circuitry logic. The typical instruction memory hierarchy consists of a main memory and a SPM. The content of the SPM is updated through the bus connecting the DRAM to the SPM.

SPM Configuration

In the SPM configuration, it is possible for the CPU to access the DRAM directly, unlike in the cache architecture where the CPU has access to DRAM only through the cache. In case of SPM, the compiled code is initially loaded into the DRAM repeatedly utilized is encountered the segment is copied to SPM and the program is executed from the SPM. Portions which are rarely used are executed from the DRAM, reducing copying costs.

Cache Configuration

In SPM configuration , it is possible for the CPU to access the DRAM directly, unlike in cache Architecture, where the CPU has access to DRAM only through cache.

Cache Based Embedded SOC


All programs exhibit the property of locality of reference and the cache memories exploit this property of the programs to give improved performance. In cache based architectures, data is placed in an ochip RAM and copied at run-time to cache by a hardware cache controller. The mapping of data from off-chip RAM to L1-cache is dictated by the cache associativity scheme and can create potential side effects like thrashing. Direct mapped caches incur much more off-chip memory traffic[, which, when not handled properly, can lead to very high power consumption and lower performance.

Coherent Caches

11/5/2011

Bus-based Shared Memory Organization

Organization
Bus is usually simple physical connection (wires) Bus bandwidth limits no. of CPUs Could be multiple memory elements For now, assume that each CPU has only a single level of cache

Problem of Memory Coherence


Assume just single level caches and main memory Processor writes to location in its cache Other caches may hold shared copies - these will be out of date Updating main memory alone is not enough

Bus Snooping
Scheme

Snooping Protocols
Write Invalidate CPU wanting to write to an address, grabs a bus cycle and sends a write invalidate message All snooping caches invalidate their copy of appropriate cache line CPU writes to its cached copy (assume for now that it also writes through to memory) Any shared read in other CPUs will now miss in cache and re-fetch new data. Write Update CPU wanting to write grabs bus cycle and broadcasts new data as it updates its own copy All snooping caches update their copy

where every CPU knows who has a copy of its cached data is far too complex.

So each CPU (cache system) snoops (i.e. watches continually) for write activity concerned with data addresses which it has cached.
This

assumes a bus structure which is global, i.e all communication can be seen by all. More scalable solution: directory based coherence schemes

11/5/2011

Note

that in both schemes, problem of simultaneous writes is taken care of by bus arbitration - only one CPU can use the bus at any one time.

Update or Invalidate?
Due to both spatial and temporal locality, previous cases occur often. Bus bandwidth is a precious commodity in shared memory multi-processors Experience has shown that invalidate protocols use significantly less bandwidth. Will consider implementation details only of invalidate.

Update or Invalidate?
Update looks the simplest, most obvious and fastest, but:Invalidate scheme is usually implemented with write-back caches and in that case: Multiple writes to same word (no intervening read) need only one invalidate message but would require an update for each Writes to same block in (usual) multi-word cache block require only one invalidate but would require multiple updates.

Implementation Issues
In both schemes, knowing if a cached value is not shared (copy in another cache) can avoid sending any messages. Invalidate description assumed that a cache value update was written through to memory. If we used a copy back scheme other processors could refetch old value on a cache miss. We need a protocol to handle all this.
Basic Protocol

Snooping Cache Variations


Berkeley Protocol
Owned Exclusive Owned Shared Shared Invalid

Illinois Protocol
Private Dirty Private Clean Shared Invalid

MESI Protocol
Modfied (private,!=Memory) eXclusive (private,=Memory) Shared (shared,=Memory) Invalid

Exclusive Shared Invalid

If read sourced from memory, then Private Clean if read sourced from other cache, then Shared Can write in cache if held private clean or dirty

MESI
Assumptions:
Data ownership model. Only one cache can have dirty data Writes are typically not broadcasted (i.e. data is not transmitted). Cause data invalidation in other caches

Bus activity
Master: Intend to cache (inquiry cycle) Slave: I have a copy (Hit signal) I have a modified copy (HitM signal)

11/5/2011

MESI Protocol (1)


A practical multiprocessor invalidate protocol which attempts to minimize bus usage. Allows usage of a write back scheme - i.e. main memory not updated until dirty cache line is displaced Extension of usual cache tags, i.e. invalid tag and dirty tag in normal write back cache.

MESI Protocol (2)


Any cache line can be in one of 4 states (2 bits) Modified - cache line has been modified, is different from main memory - is the only cached copy. (multiprocessor dirty) Exclusive - cache line is the same as main memory and is the only cached copy Shared - Same as main memory but copies may exist in other caches. Invalid - Line data is not valid (as in simple cache)

MESI Protocol (3)


Cache line changes state as a function of memory access events. Event may be either Due to local processor activity (i.e. cache access) Due to bus activity - as a result of snooping Cache line has its own state affected only if address matches

MESI Protocol (4)


Operation can be described informally by looking at action in local processor Read Hit Read Miss Write Hit Write Miss More formally by state transition diagram

MESI Local Read Hit


Line must be in one of MES This must be correct local value (if M it must have been modified locally) Simply return value No state change No other copy in caches Processor makes bus request to memory Value read to local cache, marked E One cache has E copy Processor makes bus request to memory Snooping cache puts copy value on the bus Memory access is abandoned Local processor caches value Both lines set to S

Several caches have S copy Processor makes bus request to memory One cache puts copy value on the bus (arbitrated) Memory access is abandoned Local processor caches value Local copy set to S Other copies remain S One cache has M copy Processor makes bus request to memory Snooping cache puts copy value on the bus Memory access is abandoned Local processor caches value Local copy tagged S Source (M) value copied back to memory Source value M -> S

11/5/2011

MESI Local Write Hit


Line must be one of MES M line is exclusive and already dirty Update local cache value no state change E Update local cache value State E -> M S Processor broadcasts an invalidate on bus Snooping processors with S copy change S->I Local cache value is updated Local state change S->M

Detailed action depends on copies in other processors No other copies Value read from memory to local cache (?) Value updated Local copy state set to M Other copies, either one in state E or more in state S Value read from memory to local cache - bus transaction marked RWITM (Read With Intent To Modify) Snooping processors see this and set their copy state to I Local copy updated & state set to M

Another copy in state M Processor issues bus transaction marked RWITM Snooping processor sees this Blocks RWITM request Takes control of bus Writes back its copy to memory Sets its copy state to I Another copy in state M (continued) Original local processor re-issues RWITM request Is now simple no-copy case Value read from memory to local cache Local copy value updated Local copy state set to M

MESI locally initiated accesses

MESI remotely initiated accesses MESI notes


There are minor variations (particularly to do with write miss) Normal write back when cache line is evicted is done if line state is M Multi-level caches If caches are inclusive, only the lowest level cache needs to snoop on the bus

11/5/2011

Cache Line State This cache line is valid The memory copy is

M Modified Yes ..out of date

E Exclusive Yes valid No

S Shared Yes valid Maybe

I Invalid No -Maybe

L1/L2 organization Flow of SNOOP cycles

Copies exist in No caches of other processors? A write tot this line Does not go to bus

Does not go to bus

Goes to bus and updates cache

goes directly to bus

L1/L2 communication

Directory Schemes
Snoopy schemes do not scale because they rely on broadcast Directory-based schemes allow scaling. avoid broadcasts by keeping track of all PEs caching a memory block, and then using point-to-point messages to maintain coherence. they allow the flexibility to use any scalable point-to-point network.

Basic Scheme (Censier & Feautrier)

11/5/2011

Larger MPs
Separate Memory per Processor Local or Remote access via memory controller Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache Which caches have a copies of block, dirty vs. clean, ... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is (memory size) vs. (cache size) Prevent directory as bottleneck: distribute directory entries with memory, each keeping track of which Procs have copies of their blocks

Distributed Directory MPs

Interconnection Network

Directory Protocol
Similar to Snoopy Protocol: Three states Shared: 1 processors have data, memory up-to-date Un-cached (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory out-ofdate In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Keep it simple(r): Writes to non-exclusive data => write miss Processor blocks until access completes Assume messages received and acted upon in order sent

Directory Protocol
No bus and dont want to broadcast: interconnect no longer single arbitration point all messages have explicit responses Terms: Local node is the node where a request originates Home node is the node where the memory location of an address resides Remote node is the node that has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address

Directory Protocol Messages


Message type Source Destination Read miss Local cache Home directory Processor P reads data at address A; send data and make P a read sharer Write miss Local cache Home directory Processor P writes data at address A; send data and make P the exclusive owner Invalidate Home directory Remote caches Invalidate a shared copy at address A. Fetch Home directory Remote cache Fetch the block at address A and send it to its home directory Fetch/Invalidate Home directory Remote cache Fetch the block at address A and send it to its home directory; invalidate the block in the cache Data value reply Home directory Local cache Return a data value from the home memory Data write-back Remote cache Home directory Write-back a data value for address A Msg P, A P, A

State Transition Diagram for an Individual Cache Block in a Directory Based System
States identical to snoopy case; transactions very similar. Transitions caused by read misses, write misses, invalidates, data fetch req. Generates read miss & write miss msg to home directory. Write misses that were broadcast on the bus => explicit invalidate & data fetch requests.

A A A

Data A, Data

11/5/2011

State Transition Diagram for the Directory


Same states & structure as the transition diagram for an individual cache 2 actions: update of directory state & send msgs to satisfy req. Tracks all copies of memory block. Also indicate an action that updates the sharing set, Sharers, as opposed to sending a message.

Example Directory Protocol


Message sent to directory causes two actions: Update the directory More messages to satisfy request Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: Read miss: requesting processor sent data from memory & requestor made only sharing node; state of block made Shared. Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. Block is Shared => the memory value is up-to-date: Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: Read miss: owner processor sent data fetch message, which causes state of block in owners cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now un-cached, and the Sharer set is empty. Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

Example

Miss Rates for Snooping Protocol


4th C: Conflict, Capacity, Compulsory and Coherency Misses More processors: increase coherency misses while decreasing capacity misses since more cache memory (for fixed problem size) Cache behavior of Five Parallel Programs: FFT Fast Fourier Transform: Matrix transposition + computation LU factorization of dense 2D matrix (linear algebra) Barnes-Hut n-body algorithm solving galaxy evolution probem Ocean simluates influence of eddy & boundary currents on large-scale flow in ocean: dynamic arrays per grid

Miss Rates for Snooping Protocol

Cache size is 64KB, 2-way set associative, with 32B blocks. Misses in these applications are generated by accesses to data that is potentially shared. Except for Ocean, data is heavily shared; in Ocean only the boundaries of the subgrids are shared, though the entire grid is treated as a shared data object. Since the boundaries change as we increase the processor count (for a fixed size problem), different amounts of the grid become shared. The anamolous increase in miss rate for Ocean in moving from 1 to 2 processors arises because of conflict misses in accessing the subgrids.

10

11/5/2011

% Misses Caused by Coherency Traffic vs. # of Processors


% cache misses caused by coherency transactions typically rises when a fixed size problem is run on more processors. The absolute number of coherency misses is increasing in all these benchmarks, including Ocean. In Ocean, however, it is difficult to separate out these misses from others, since the amount of sharing of the grid varies with processor count. Invalidations increases significantly; In FFT, the miss rate arising from coherency misses increases from nothing to almost 7%.

Miss Rates as Increase Cache Size/Processor

Miss rate drops

as the cache size is increased, unless the miss rate is dominated by coherency misses. The block size is 32B & the cache is 2-way set-associative. The processor count is fixed at 16 processors.

Miss Rate vs. Block Size


Since cache block hold multiple words, may get coherency traffic for unrelated variables in same block False sharing arises from the use of an invalidation based coherency algorithm. It occurs when a block is invalidated (and a subsequent reference causes a miss) because some word in the block, other than the one being read, is written into.

% Misses Caused by Coherency Traffic vs. Block Size


FFT communicates data in large blocks & communication adapts to the block size (it is a parameter to the code); makes effective use of large blocks. Ocean competing effects that favor different block size Accesses to the boundary of each sub-grid, in one direction the accesses match the array layout, taking advantage of large blocks, while in the other dimension, they do not match. These two effects largely cancel each other out leading to an overall decrease in the coherency misses as well as the capacity misses.

Bus Traffic as Increase Block Size


Bus traffic climbs steadily as the

Miss Rates for Directory


Cache size is 128 KB, 2-way set associative, with 64B blocks (cover longer latency) Ocean: only the boundaries of the sub-grids are shared. Since the boundaries change as we increase the processor count (for a fixed size problem), different amounts of the grid become shared. The increase in miss rate for Ocean in moving from 32 to 64 processors arises because of conflict misses in accessing small sub-grids & for coherency misses for 64 processors.

block size is increased. The factor of 3 increase in traffic for Ocean is the best argument against larger block sizes. Remember that our protocol treats ownership misses the same as other misses, slightly increasing the penalty for large cache blocks: in both Ocean and FFT this effect accounts for less than 10% of the traffic.

11

11/5/2011

Miss Rates as Increase Cache Size/Processor for Directory


Miss rate drops as the cache size is increased, unless the miss rate is dominated by coherency misses. The block size is 64B and the cache is 2-way set associative. The processor count is fixed at 16 processors.

Block Size for Directory


Assumes 128 KB cache & 64 processors Large cache size to combat higher memory latencies than snoop caches.

Implementing a Directory
We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of buffers in network Optimizations: read miss or write miss in Exclusive: send data directly to requestor from owner vs. 1st to memory and then from memory to requestor.

12

You might also like