You are on page 1of 42

HIGH PERFORMANCE AND LOW POWER HYBRID CACHE ARCHITECTURES FOR CMPs

Seminar Guide : Sunith C K

Presented by, ANOOP THOMAS REG NO:98911037 VLSI & EMBEDDED SYSTEMS

MTech VLSI & ES

Introduction
A multi-core processor is a single component with two or more independent processors (cores). Chip Multiprocessors or CMP:- When the cores are integrated onto a single integrated circuit die.

MTech VLSI & ES

Need For Cache Memory


Cache memories minimizes the performance gap between high-speed processors and slow off-chip memory. Cache subsystems, particularly on-chip, with multiple layers of large caches is common in CMPs.

MTech VLSI & ES

The processor-memory performance gap. [web]

MTech VLSI & ES

Current Schemes
Performance can be improved through Non-uniform Cache Architecture (NUCA) Large cache is divided into multiple banks with different access latencies determined by their physical locations to the source of the request. Static NUCA:- A cache line is statistically mapped into banks, with low order bits of the index determining the bank. Dynamic NUCA:- Any given line can be mapped to several banks based on the mapping policy.

MTech VLSI & ES

NUCA fails in case of large cache


It only utilizes varied access latency of cache banks, due to their physical location, to improve performance. Cache banks are of same size, process and circuit technology. Overall cache size available is fixed for same memory technology.

MTech VLSI & ES

Comparison of different memory technologies.[4] >No single memory technology by itself is efficient.

MTech VLSI & ES

Memory Hierarchy generally used. [web]

MTech VLSI & ES

Hybrid Cache Memory Architectures


Cache designed using differing memory technologies performs better than single technology. Hybrid Cache Architectures (HCA) allows levels in cache hierarchy to be constructed from different memory technologies.

MTech VLSI & ES

Inter cache HCA (LHCA): The levels in a cache hierarchy can be made of disparate memory technologies. Region Base Intra Cache HCA (RHCA): A single level of cache can be partitioned into multiple regions, each of different memory technology. STT-RAM with SRAM can be used to form a hybrid cache architecture for chip multiprocessors with low power consumption and high performance.

MTech VLSI & ES

10

Overview of LHCA and RHCA [4]

MTech VLSI & ES

11

STT RAM based HCA


STTRAM is non-volatile. Read speed is comparable to that of SRAM (As per the design). Higher density than SRAM. Disadvantage is that write latency is long and high dynamic power consumption.

MTech VLSI & ES

12

Background of STT-RAM STT-

A conceptual view of MTJ. [4]

Information carrier inside MRAM is Magnetic Tunnel Junction (MTJ).


MTech VLSI & ES 13

An illustration of an MRAM cell. [4]

MTech VLSI & ES

14

MTJ is the storage element. NMOS is used for access controller. They are connected in series. Write Operation
>Positive voltage difference between bit line and source line for writing 0 >Negative voltage difference established for writing 1

Read operation
NMOS is enabled and (Vbl-Vsl) voltage is applied between BL and SL, is usually negative and small.
MTech VLSI & ES 15

STT RAM alone is not used as cache memory


Large number of writes on the last level cache (LLC) occurs for most of the CMP applications. Due to long write latency and very high dynamic power consumption, using STT RAM is not advised.

MTech VLSI & ES

16

Hybrid Cache Architecture using STTSTT-RAM


STT-RAM and SRAM can be together used to form an HCA. STT-RAM has low leakage power and high density. With smart cache management policies, low power consumption and high performance can be obtained simultaneously.
MTech VLSI & ES 17

Basic Architecture
Each Core is configured with private L1 instruction and data cache. Is the LLC consisting of multiple cache banks connected through interconnection network.

MTech VLSI & ES

18

Each bank is either an STT-RAM or an SRAM bank. SRAM banks are shared by all cores STT-RAM banks are logically divided into groups, privately allocated to cores. Shared SRAM banks are organized into DNUCA.

MTech VLSI & ES

19

Hybrid Cache Architecture. [2]

MTech VLSI & ES

20

24 STT-RAM groups are logically divided into 8 groups. Each of these groups consists of 3 STT-RAM banks. Each core is privately allocated one logical STT-RAM group and most of the cores requests will be served by this group. SRAM is included to make write operations more efficient. SRAM banks are shared by all on-chip cores.

MTech VLSI & ES

21

Cache bank structure of hybrid cache. [2]

Each hybrid LLC is implemented with 4 subbanks. Each STT-RAM sub-bank is configured with a sub-bank write buffer to speed up long latency write operations.
MTech VLSI & ES 22

Micro Architectural Mechanism


Private STT-RAM was used to reduce the high power-consuming remote block accesses. For the core running memory-intensive workloads, private STT-RAM group may not accommodate the large working set.

MTech VLSI & ES

23

Neighborhood Group Caching


Neighboring cores share their private STTRAM groups with each other based on the HCA. Eg: Core 1 can share its STT-RAM banks with its one-hop neighbor core 0, 2 and 5. Neighborhood sharing can obtain more balanced capacity and access latency between private and shared schemes.

MTech VLSI & ES

24

NGC is scalable for future CMPs by carefully defining the neighborhood. The energy aware read and write policies helps HCA to optimize the power consumption without sacrificing performance. Flow graph for the whole micro architectural mechanism is shown in the next slide.

MTech VLSI & ES

25

Flow graph of proposed micro-architecture mechanisms. [2]

MTech VLSI & ES

26

EnergyEnergy-Aware Write
Write miss occurs then target block will be loaded from low level memory and put into SRAM bank. Write hits to SRAM is directly served by corresponding SRAM bank. Write hits to STT-RAM banks are served by Block Swapping Mechanism.
MTech VLSI & ES 27

EnergyEnergy-Aware Read
Read miss occurs then the target block is fetched from low level memory and put into local STT-RAM group. Read hits on STT-RAM is served directly by local group or from the neighboring groups. Read hits on SRAM bank, Active Block Migration is used to serve the request.
MTech VLSI & ES 28

Block Swapping
Cache lines with intensive write operations are migrated from STT-RAM to SRAM. Migration causes an original line in SRAM to be replaced. If replaced SRAM line is valid, two lines in SRAM and STT-RAM are swapped. Future accesses to this line will hit STT-RAM which reduces long latency accesses to low level memory. Invalid line is directly written back to memory.

MTech VLSI & ES

29

State transitions of block swapping [2]

Swapping is activated upon a block in STT-RAM accessed by two consecutive writes or accumulatively accessed by three writes. Each cache line extended with 2 bit swapping counter and 1 bit cross access counter to control data swapping between STT-RAM and SRAM.
MTech VLSI & ES 30

Once a block is loaded into STT-RAM both counters will be set to zero. Block swap occurs when cross access counter is 0 and swapping counter is 10 or when cross access counter is 1 and swapping counter is 11. When a read occurs when swapping counter is 01, cross access counter will be set to 1 to indicate that this block is accessed by read and write operations.

MTech VLSI & ES

31

Active Block Migration


Upon read hits on SRAM migration of cache line from SRAM to STT-RAM occurs. Blocks in SRAM are divided into two types >Blocks fetched from low level memory. >Blocks swapped from STT-RAM banks. Cross access counter is used to differentiate these block. >low level memory is set to 0 >swapped from STT-RAM is set to 1

MTech VLSI & ES

32

State transitions of Active Block Migration [4]

ACTIVE MIGRATION: Block fetched from low level memory will be migrated into STT-RAM when a read request hits on this block. LAZY ACTIVE MIGRATION: The swapped blocks from STTRAM are migrated back into STT-RAM when it is accumulatively read by twice more than the write on this block.
MTech VLSI & ES 33

Results and Analysis

Main simulation parameters considered. [4]

MTech VLSI & ES

34

POWER ANALYSIS The main power component in STT-RAM is dynamic power and the leakage power of peripheral circuits. Using STT-RAM as well as the low-power block swapping mechanism, the proposed scheme consumes less power than conventional SNUCA and DNUCA.

MTech VLSI & ES

35

Power Comparison normalized by SNUCA. [2]

MTech VLSI & ES

36

PERFORMANCE ANALYSIS The performance of the hybrid scheme is better than conventional SNUCA and DNUCA. Block replication causes large numbers of lowlatency local hits in private STT-RAM groups and hence IPC is improved . Due the large density of STT-RAM and the capacity efficiency of NGC scheme, the hybrid scheme reduces massive long latency on-chip remote accesses and off-chip accesses during the execution.

MTech VLSI & ES

37

Average IPC comparison normalized by SNUCA. [2]

MTech VLSI & ES

38

Conclusion
HCA greatly reduces the power and increases the performance when compared to using the conventional SRAM on-chip cache technology. By the combination of various memory technologies it is possible to construct a cache system with better performance. With the help of proposed micro-architectural mechanisms, the hybrid scheme is adaptive to variations of workloads.

MTech VLSI & ES

39

References
[1] Fran Vahid, Tony D. Givargis Embedded System Design: A Unified Hardware/Software Introduction [2] Jianhua Li, Xue, C.J, Yinlong Xu, STT-RAM based energy-efficiency hybrid cache for CMPs In VLSI and System-on-Chip, 2011 IEEE/IFIP 19th International Conference, pages 31-36, 2011 [3] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, K. Yamane, H. Yamada, M. Shoji, H. Hachino, C. Fukumoto, H. Nagao, and H. Kano, A novel nonvolatile memory with spin torque transfer magnetization switching: spin-ram. In IEEE Electronic Device Conference, pages 459462, 2005. [4] Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ramakrishnan Rajamony, Yuan Xie, Hybrid Cache Architecture with disparate memory technologies. Paper published in- SCA 2009 Proceedings of the 36th annual international symposium on Computer architecture Online Available: isca09.cs.columbia.edu/pres/04.pdf

MTech VLSI & ES

40

Refereces cntd. ntd.


[5] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, A novel architecture of the 3d stacked mram l2 cache for cmps. In The 16th IEEE International Symposium on HighPerformance Computer Architecture, pages 239249, 2009. [6] Video lecture on Digital Computer Organization - Lec-18 Cache Memory Architecture Online Available:http://nptel.iitm.ac.in/video.php?subjectId=117105078

MTech VLSI & ES

41

THANK YOU

MTech VLSI & ES

42