Professional Documents
Culture Documents
Embedded Memory: MPU Versus DRAM Performances
Embedded Memory: MPU Versus DRAM Performances
Embedded Memory
It is an integrated on-chip memory that supports the logic core to accomplish intended functions. High-performance embedded memory is a key component in VLSI because of its high-speed and wide bus-width capability, which eliminates external inter-chip communication. The increasing performance gap between microprocessors and DRAMs has widened, chip designers placing greater emphasis on the development of embedded memory devices.
Advantages of using embedded memories: Reduced number of chips, reduced pin count, multi-port memories, less board space requirements, faster response with memory embedded on-chip dedicated architecture, memory capacity specific for an application, reduced power consumption, greater cost effectiveness at the system level. Disadvantages of embedded memories: Generally larger in size more complex to design and manufacture processing becomes more complex designer integrates different types of memory on the same chip.
Enabling SoC:
In the modern SoC era, memory becomes an important and essential IP requirement for SoC design. High quality embedded non-volatile memory (eFuse, OTP, MTP, EEPROM and eFlash) can be used for trimming, redundancy, data encryption, ID, coding and programming.
TIs 1-V DSP for Wireless Communications
11/5/2011
Major options for on-chip memory implementations 1. DRAM macrocells 2. SRAM macrocells' 3. Flip-op or latch arrays Limitations of conventional on-chip memory implementations :
Dynamic random access memory (DRAM) Conventional embedded DRAM uses 1T-1C storage cells Special technology options are required to build high-density stacked or trench capacitors Small storage cell size Not compatible with standard logic CMOS technologies
Multilevel DRAM
Principle: To increase the per-area storage density of conventional
1T-1C DRAM, more than one bit can be stored in a single memory cell Digital-to-analog and analog-to-digital converters are used for the write and read operations, respectively The more (fractional) bits are to be stored per cell, the shorter the data retention time
Challenges: Invent circuit and system level techniques to cope with the degradation of the data retention time due to multilevel operation
Gain cells
Principle: Amplify stored charge to have higher sensing current Gain-cell-based DRAMs can be smaller than SRAM and at the same time they can be logic compatible Various gain-cell-based DRAMs have been proposed to date, with different inherent gain cells Challenges: Inhibit leakage mechanisms and boost data retention time at advanced technology nodes (65nm, 45nm, etc.)
Fault-tolerant memories: Principle: Allow a well-controlled amount of bit errors in the memory if the memory is to be used in errorcorrecting systems, thereby saving area and power in the memory Applications: Turbo decoder, low-density parity check (LDPC) decoder, and many more Challenges: Express the bit-error probability in terms of statistics on circuit parameters such as Vth, W, and L for different PVT corners.
11/5/2011
Scratchpad Memories
Designers of embedded systems strive to improve performance and reduce energy consumption. Low energy systems extend battery life and reduce cooling costs and decrease weight. Shutting down parts of the processor, voltage scaling, specialized instructions, feature size reduction and additional cache levels are some of the techniques used to reduce energy in embedded systems. To reduce power in embedded systems , profile knowledge applied to a system with scratchpad memory (SPM) instead of instruction cache is found to be useful.
A SPM, made of SRAM cells is a memory array consisting of only decoding and column circuitry logic. The typical instruction memory hierarchy consists of a main memory and a SPM. The content of the SPM is updated through the bus connecting the DRAM to the SPM.
SPM Configuration
In the SPM configuration, it is possible for the CPU to access the DRAM directly, unlike in the cache architecture where the CPU has access to DRAM only through the cache. In case of SPM, the compiled code is initially loaded into the DRAM repeatedly utilized is encountered the segment is copied to SPM and the program is executed from the SPM. Portions which are rarely used are executed from the DRAM, reducing copying costs.
Cache Configuration
In SPM configuration , it is possible for the CPU to access the DRAM directly, unlike in cache Architecture, where the CPU has access to DRAM only through cache.
Coherent Caches
11/5/2011
Organization
Bus is usually simple physical connection (wires) Bus bandwidth limits no. of CPUs Could be multiple memory elements For now, assume that each CPU has only a single level of cache
Bus Snooping
Scheme
Snooping Protocols
Write Invalidate CPU wanting to write to an address, grabs a bus cycle and sends a write invalidate message All snooping caches invalidate their copy of appropriate cache line CPU writes to its cached copy (assume for now that it also writes through to memory) Any shared read in other CPUs will now miss in cache and re-fetch new data. Write Update CPU wanting to write grabs bus cycle and broadcasts new data as it updates its own copy All snooping caches update their copy
where every CPU knows who has a copy of its cached data is far too complex.
So each CPU (cache system) snoops (i.e. watches continually) for write activity concerned with data addresses which it has cached.
This
assumes a bus structure which is global, i.e all communication can be seen by all. More scalable solution: directory based coherence schemes
11/5/2011
Note
that in both schemes, problem of simultaneous writes is taken care of by bus arbitration - only one CPU can use the bus at any one time.
Update or Invalidate?
Due to both spatial and temporal locality, previous cases occur often. Bus bandwidth is a precious commodity in shared memory multi-processors Experience has shown that invalidate protocols use significantly less bandwidth. Will consider implementation details only of invalidate.
Update or Invalidate?
Update looks the simplest, most obvious and fastest, but:Invalidate scheme is usually implemented with write-back caches and in that case: Multiple writes to same word (no intervening read) need only one invalidate message but would require an update for each Writes to same block in (usual) multi-word cache block require only one invalidate but would require multiple updates.
Implementation Issues
In both schemes, knowing if a cached value is not shared (copy in another cache) can avoid sending any messages. Invalidate description assumed that a cache value update was written through to memory. If we used a copy back scheme other processors could refetch old value on a cache miss. We need a protocol to handle all this.
Basic Protocol
Illinois Protocol
Private Dirty Private Clean Shared Invalid
MESI Protocol
Modfied (private,!=Memory) eXclusive (private,=Memory) Shared (shared,=Memory) Invalid
If read sourced from memory, then Private Clean if read sourced from other cache, then Shared Can write in cache if held private clean or dirty
MESI
Assumptions:
Data ownership model. Only one cache can have dirty data Writes are typically not broadcasted (i.e. data is not transmitted). Cause data invalidation in other caches
Bus activity
Master: Intend to cache (inquiry cycle) Slave: I have a copy (Hit signal) I have a modified copy (HitM signal)
11/5/2011
Several caches have S copy Processor makes bus request to memory One cache puts copy value on the bus (arbitrated) Memory access is abandoned Local processor caches value Local copy set to S Other copies remain S One cache has M copy Processor makes bus request to memory Snooping cache puts copy value on the bus Memory access is abandoned Local processor caches value Local copy tagged S Source (M) value copied back to memory Source value M -> S
11/5/2011
Detailed action depends on copies in other processors No other copies Value read from memory to local cache (?) Value updated Local copy state set to M Other copies, either one in state E or more in state S Value read from memory to local cache - bus transaction marked RWITM (Read With Intent To Modify) Snooping processors see this and set their copy state to I Local copy updated & state set to M
Another copy in state M Processor issues bus transaction marked RWITM Snooping processor sees this Blocks RWITM request Takes control of bus Writes back its copy to memory Sets its copy state to I Another copy in state M (continued) Original local processor re-issues RWITM request Is now simple no-copy case Value read from memory to local cache Local copy value updated Local copy state set to M
11/5/2011
Cache Line State This cache line is valid The memory copy is
I Invalid No -Maybe
Copies exist in No caches of other processors? A write tot this line Does not go to bus
L1/L2 communication
Directory Schemes
Snoopy schemes do not scale because they rely on broadcast Directory-based schemes allow scaling. avoid broadcasts by keeping track of all PEs caching a memory block, and then using point-to-point messages to maintain coherence. they allow the flexibility to use any scalable point-to-point network.
11/5/2011
Larger MPs
Separate Memory per Processor Local or Remote access via memory controller Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache Which caches have a copies of block, dirty vs. clean, ... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is (memory size) vs. (cache size) Prevent directory as bottleneck: distribute directory entries with memory, each keeping track of which Procs have copies of their blocks
Interconnection Network
Directory Protocol
Similar to Snoopy Protocol: Three states Shared: 1 processors have data, memory up-to-date Un-cached (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory out-ofdate In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Keep it simple(r): Writes to non-exclusive data => write miss Processor blocks until access completes Assume messages received and acted upon in order sent
Directory Protocol
No bus and dont want to broadcast: interconnect no longer single arbitration point all messages have explicit responses Terms: Local node is the node where a request originates Home node is the node where the memory location of an address resides Remote node is the node that has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address
State Transition Diagram for an Individual Cache Block in a Directory Based System
States identical to snoopy case; transactions very similar. Transitions caused by read misses, write misses, invalidates, data fetch req. Generates read miss & write miss msg to home directory. Write misses that were broadcast on the bus => explicit invalidate & data fetch requests.
A A A
Data A, Data
11/5/2011
Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: Read miss: owner processor sent data fetch message, which causes state of block in owners cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now un-cached, and the Sharer set is empty. Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.
Example
Cache size is 64KB, 2-way set associative, with 32B blocks. Misses in these applications are generated by accesses to data that is potentially shared. Except for Ocean, data is heavily shared; in Ocean only the boundaries of the subgrids are shared, though the entire grid is treated as a shared data object. Since the boundaries change as we increase the processor count (for a fixed size problem), different amounts of the grid become shared. The anamolous increase in miss rate for Ocean in moving from 1 to 2 processors arises because of conflict misses in accessing the subgrids.
10
11/5/2011
as the cache size is increased, unless the miss rate is dominated by coherency misses. The block size is 32B & the cache is 2-way set-associative. The processor count is fixed at 16 processors.
block size is increased. The factor of 3 increase in traffic for Ocean is the best argument against larger block sizes. Remember that our protocol treats ownership misses the same as other misses, slightly increasing the penalty for large cache blocks: in both Ocean and FFT this effect accounts for less than 10% of the traffic.
11
11/5/2011
Implementing a Directory
We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of buffers in network Optimizations: read miss or write miss in Exclusive: send data directly to requestor from owner vs. 1st to memory and then from memory to requestor.
12