Professional Documents
Culture Documents
Abstract— Due to the evolution of imagining and dealing with Computing systems acknowledge executing tasks in a true
issues and demands that have developed with intelligent systems, multitasking fashion. Such systems share the reconfigurable
higher performance has become a crucial need for daily growth device and processing unit as computing resources which
of application size and complexities. Currently, most high- supervises to highly dynamic allocation situations. [7] and [8]
performance computing systems are multi-core, and many kinds propose heuristic approaches concentrating on developing a
of research have been done toward minimizing the execution time powerful nexus linking partitioning, scheduling and placement
efficiently. Cache performance impacts on program execution have and considerable improvement in overall execution time
especially on modern intelligent processing systems, therefore of the tasks have been achieved. High-performance systems
cache replacement policies in set-associative caches have been
even at the best configuration endure overheads and penalties
investigated in great depth. We propose an approach to consider
information derived from coherence state of the cache block in a
which deteriorates overall performance and execution time
time-interleaved manner, and our simulations on high- hence an approach toward managing and mitigating them is
performance applications of PARSEC2.1 and SPLASH-2 highly required [9]. Cache performance plays an essential role
benchmark suits show that our approach has 10% lower miss not only in the field of processing systems science which is
rate and up to 10% more instructions per cycle than LRU, MRU used for a wide range of computationally intensive tasks in
and Random replacement policy without further power various domains but also in the reduction of delay of memory
consumption. accesses by giving a momentary fast-access storage unit for
the data that is being reached by the CPU. On a cache miss,
Keywords— data- intensive application, cache replacement data is fetched from the memory and placed in its analogous
policy, coherence state, memory management. locale in the cache. A prolific cache replacement policy can
naturally reduce the cache miss rate and thus decreasing the
I. INTRODUCTION contingency of a penalty of hundreds of cycles expenditure
due to memory accesses. The goal is to depreciate the number
Intelligent Systems made revolutionary scientific of requests to the slower memory and hence succeeding in the
discoveries, fed game-changing innovations, and improved reduction of memory access latency. A cache hierarchy is used
quality of life for billions of people around the globe. Through in various applications such as reconfigurable systems, data
the progressions of needs and demands, High-Performance processing units, computing servers, databases and CPUs to
Computing (HPC) has become not only the foundation for mention but a few.
scientific, industrial, and societal advancements but also a
vital service [1]. As technologies like the Internet of Things The current computing systems are parallel in nature, and
(IoT) [2], artificial intelligence (AI) [3], and 3-D imaging [4], they comprise several processors or processing units that share
traffic management [5] evolve, the size and amount of data a single memory address space. Thus, smart designs are
that systems have to work with are growing exponentially. For involved with intelligent algorithms aiming much better
many purposes, the ability to process this huge amount of data execution time, performance and power consumption [10-11]
by an intelligent high performance real-time state is crucial For intelligent performance and power efficiency plans, these
and one of the best basic block for these systems is multi-core multi-core systems include caches, raising a new concern:
processors. ensuring cache coherence which traverses to a new category of
conflicts, coherence misses, due to a block having been
Toward the high-performance computers, a range of evicted as a conclusion of a coherence action, moreover this
parallel structure has been investigated during past years. In eviction may intrude with the local way-replacement policy.
[6] a general structure had been proposed for the massive
parallel processing systems. The proposed structure is Cache misses can be broadly categorised in 3 classes [12]:
expandable in a vertical and horizontal method and
• Compulsory miss: when the block has to be read from
incorporates many of the previous computer designs. The
main memory whatever the cache design is, as known as cold
queuing theory and Jackson queuing network are applied for
start misses or first references misses.
assembling an analytical model for the proposed structure.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 20:01:21 UTC from IEEE Xplore. Restrictions apply.
2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
• Capacity miss: when it is impossible to hold all the The contributions are the following:
concurrently used data in the cache because of the cache
reduced size compared to the working set size. • Elements of our replacement strategy based on the
current coherence state of cache blocks,
• Conflict miss: when multiple addresses map to the same
set and evict used blocks, due to the fact that even though all • Analysis of our proposal when combined with prominent
cache blocks may not be occupied, the mapping function replacement policies (LRU, MRU and Random).
cannot target an empty block in order to map a memory • We present probative indication that reinforces such an
address to a cache set. approach should be thought in the next span of architectures
Trade-offs can be accomplished by adjusting the cache that figure on cache coherence.
parameters. Increasing cache capacity, for instance, can reduce The remainder of the paper is prepared as follows. Section
conflict and capacity misses, while increasing the associativity II presents related work. Our proposal contemplates cache
decreases the conflict misses, by increasing block size can coherence state in way-replacement is disclosed in Section III.
reduce compulsory misses due to spatial locality and increase Section IV presents simulation outcomes achieved on a state-
conflict misses (for a fixed-size cache). of-the-art simulator. Ultimately, Section V critiques the
Recent trends in modern processor design have made outcomes and concludes the paper.
adequate management of cache resources more indispensable
than ever before. As the core count and memory demands of II. RELATED WORKS
applications rise, discerning cache architectures and In multi-core systems, accesses from multiple cores
management policies have become vital to meet the challenges compete for shared cache capacity. Inadequate management of
of area and power limitation. Researchers have recently the shared cache can deteriorate system performance and end
investigated cache designs that offer distinct advantages and in an unfair allocation of resources because one perverse
disadvantages. application can deprive the performance of all other
Set-associative caches have been introduced for half a applications sharing the cache. For example, a workload with
century ago [13] then way-replacement policies, i.e. choosing streaming memory accesses can evict useful data belonging to
which way should be evicted, have been studied. The most other recency-friendly applications. Shared caches in multi-
earliest studies conducted in [14] and [15] which published core processors introduce serious difficulties in providing
surveys and analysis of replacement algorithms. Since then, guarantees on the real-time properties of embedded software
new replacement policies have been devised and many due to the interaction and the resulting contention in the
optimizations proposed to fit different hardware related shared caches. Hence recency-based and frequency-based
constraints, such as area, power and number of ways. policies are not proficient about multicore processors [19].
When a conflict occurs in a direct-mapped cache, as each Recency and frequency-based policies have been proposed
address maps to a unique block and set, the only way to deal a long time ago, new way-replacement policies are still under
with a full cache set is to replace the block in the set with the investigation. Observing the cache access patterns appeared in
new data. In case of an N-way associative or fully associative three types: when the working set of application is diminutive
cache, a block must be chosen for eviction when such a enough to meet in cache namely recency-friendly accesses, by
conflict occurs and the set is full. The temporal locality contrast, when the working set of application surpasses the
principle expressed earlier suggests than the most appropriate size of cache termed trashing accesses and finally when a
block to evict is the Least Recently Used (LRU) one, as it is sequence of streaming accesses nevermore revolve called
the least likely to be used again in the near future. scans[20]. Due to designing trash and scan resistant policy, an
adaptive policy that identifies the application behavior has
This replacement scheme can be implemented by adding a been developed [21]. A new cache replacement policy for
use bit next to the validity bit and indicates if the block in the shared LLC called Application-aware Cache Replacement
corresponding way was the most recently accessed. When one (ACR) was proposed which prevents victimizing low-access
of the ways is used, this bit is updated, and the way having this rate application by a high-access rate application. It
bit high is the one replaced when needed. For a two-way dynamically preserves track of maximum life-time of cache
cache, the least recently used block is actually replaced. In lines in shared LLC for each concurrent application and heals
caches with a higher degree of associativity, a random block in efficient utilization of the cache space.
having the used bit low is replaced. Such a policy is called Another cache replacement policy that respects application
pseudo-LRU and is often good enough in practice. behaviour was proposed in [22]. Since the state-of-the-art
In this paper, our aim is not to introduce a new way- replacement policies like Static Re-reference Interval
replacement algorithm, but to experiment if any information Prediction (SRRIP) and Application Aware Behaviour Re-
relative to cache coherence, can be used to enhance the reference Interval Prediction (ABRIP) [23] evict a cache block
performance of existing way-replacement policies. To that based on their re-usability in the near future. SRRIP makes the
purpose, we propose an approach to take the coherence state replacement decisions per block basis whereas ABRIP also
into account, and we appraise our proposal on high- considers the cache behaviour of an application to minimize
performance parallel applications from the SPLASH-2 [16] conflicting data demands. Hence, ABRIP outperforms SRRIP
and PARSEC2.1 benchmark [17] using the Sniper simulator for workload mixes where one application is cache friendly,
[18]. and the other one is streaming. However, ABRIP does not
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 20:01:21 UTC from IEEE Xplore. Restrictions apply.
2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
perform well when the workload mix is Cache Friendly. Their
policy effectively utilizes the shared LLC and outperforms
both SRRIP and ABRIP with performance gains of up to
10.12% and 9.36% respectively.
We encountered only one recent competent study which
addresses replacement policy considering use of the coherence
state of a cache block on a multicore system regarding today’s
standards. A cache replacement policy that takes coherence
state into account is [24]. They proposed an algorithm which
decides on coherence state and sharing of cache blocks to
choose the best amongst the k for eviction in k-way set-
associative caches. They consider two main approaches to
prioritize the eviction of the block contained in one way over
another. The first one consists of counting how many sharers a
single cache block has and take this information into account Fig. 1. Overview of the proposed coherence state-based eviction strategy
when an eviction is needed. In the second one, the decision of
evicting or not the cache block depends on its coherence state. The replacement algorithm is performed after the cache set
The main idea is to let the way-replacement algorithm select selection, accordingly to the set-associative mapping strategy.
the candidate for eviction, and then check if that candidate A candidate way is chosen, and if the block is valid, its
should or not be evicted. In essence, by authors’ claim, they coherence state is compared to the MRU-State, updated in the
implemented a second chance approach. This research takes time aperture. The replacement algorithm favours blocks that
more investigations including miss rate, power consumption are not in the same state than the current MRU-State for
and Instruction per Cycles (IPC) evaluation in comparison eviction in a time-interleaved window of number of cache
with more well-known replacement policies using ‘Modified’, access. If they are equal, the way selection is performed
‘Shared’, ‘Invalid’ (MSI), cache coherence protocol. another time. This is for considering not any cache access in
our strategy since adjacent instructions in time or order are
We design and implement a time-interleaved coherence more in common in respect of their need for data to be fetched
state aware approach for High-Performance Computing, from cache hierarchy. The time aperture is designed to reset
Graphics, Animations and Data Mining Application domains, the MRU-State and obtains the advantage of temporal locality
that differs from the related work. Considering the largest even more.
realizable inputs, the proposed method should be trash
resistance and gains more information about past states of
blocks. This requires supplementary bits per each set IV. SIMULATIONS AND RESULT EVALUATION
representing the history of blocks’ behaviour. Finally, we explain the details of the simulation
environment, then results that we obtained by evaluating our
III. WAY-REPLACEMENT POLICY PROPOSAL proposal using simulation is exhibited in this section.
We recognise the main approach to prioritize the eviction A. Simulation environment and benchmarks
of the block which is basically the decision of evicting or not
the cache block depends on its coherence state in time- To conduct our experiments, we used Sniper [18], a
interleaved windows. multicore simulator. Sniper has integrated to its core the
McPAT framework [22], which we handled to measure power
For the purpose of this work, we use the ‘Modified’, consumption. We adjusted Sniper to set up our test
‘Exclusive’, ‘Shared’, ‘Invalid’ (MESI), cache coherence architectures and way-replacement algorithms. Regarding
protocol. Cache coherence is one of the main challenges to McPAT, it uses the design parameters from Sniper, and
tackle while designing a shared-memory multicore system. simulations output to assess the results. Hence, we did not
Incoherence may happen when multiple associate cores in a modify the power model in the framework. A 32-core
system are working on the same blocks of data without any processor based on a Gainestown variant of x86 Nehalem
coordination. Coordination must be conducted by the micro-architecture at 2.66 GHz was set. Some specific
coherence protocol: a set of finite states machines, maintaining characteristics from this micro-architecture cannot be
the caches and memory and keeping the coherence invariants reproduced in Sniper though (e.g., Sniper only simulates an
true. inclusive cache model). Furthermore, data prefetching is
Our proposal is based on Coherence State Eviction in a disabled in our simulations. The memory hierarchy has three
windowed time aperture in order to decide evicting or not a levels. L1 has 32kB for data and 32kB for instructions, each
cache block based on its current coherence state. Fig. 1 shows one 4-way associative, private, with 4 cycles data access time,
an overview of the strategy. Each cache block has its own using LRU replacement policy. L2 has 2MB, 8-way
coherence state, which is updated when there is a coherence associative, private cache, with an access time of 8 cycles. We
transition in time-interleaved aperture. Each cache set knows restricted the policy on last level cache during our
the state of most recently used block was in the aperture of experiments. The last level is L3, with 8MB size, 16-way,
recent time window. Supplementary bits are required per set, shared by 8, with 35 cycles of access time. The prefetching is
updated when a cache access occurs on each set. disabled in this experiment.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 20:01:21 UTC from IEEE Xplore. Restrictions apply.
2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
TABLE I. WORKLOADS DOMAIN AND INPUT SIZE
Benchmark
Suite Application Domain Program Problem Size
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 20:01:21 UTC from IEEE Xplore. Restrictions apply.
2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
due to the facility of visual comparison, the actual values of
cholesky and raytrace in Fig. 5 are a quarter of depicted
values. Furthermore, the actual values for freqmine in Fig. 6
are 10 times the figured values, due to ease of visual
comparison.
V. CONCLUSIONS
In this paper, we conducted a way-replacement in time-
interleaved approach for the eviction of blocks in cache
memories on a multicore system. We chose to evaluate if the
coherence state conversance in a time aperture of a cache
block considering access patterns can be used to improve the
overall performance and energy consumption. To perform this
evaluation, we considered the cache block state in a time-
interleaved aperture, deciding to evict or not the block. Using
Fig. 6. Energy consumption for PARSEC2.1 applications
the Sniper simulator and applications from PARSEC2.1 and
SPLASH-2 benchmark, our results confirmed that our
proposal improves the cache miss rate and IPC without further
power consumption on high performance computing, graphics,
computer vision, animation and data mining applications. We
conclude that time-interleaved coherence state information of
a cache block can be employed to enhance the performance of
cache way-replacement in high-performance computing
multicore systems.
References
[1] Schifano, S.F., et.al., “High Performance and Distributed Computing,”
Toward an Open Resource Using Services: Cloud Computing for
Environmental Data, pp.155-162, 2020 .
[2] Tamilselvan, K. and Thangaraj, P.,. Pods–A novel intelligent energy
efficient and dynamic frequency scalings for multi-core embedded
architectures in an IoT environment. Microprocessors and
Fig. 4. Instruction per Cycle (IPC) Microsystems, 72, p.102907. 2020
[3] Han, J., Choi, M. and Kwon, Y.,. 40‐TFLOPS artificial intelligence
processor with function‐safe programmable many‐cores for ISO26262
ASIL‐D. ETRI Journal, 42(4), pp.468-479. 2020.
[4] Aali S.N., et.al. "Divisible Load Scheduling of Image Processing
Applications on the Heterogeneous Star Network Using a new Genetic
Algorithm," 26th Euromicro International Conference on Parallel,
Distributed, and Network-Based Processing, PDP2018, pp. 77-84, 2018.
[5] A.H. Jafari, et.al., "A Reinforcement Routing Algorithm with Access
Selection in the Multi–Hop Multi–Interface Networks," Journal of
Electrical Engineering, Vol. 66, No. 2 pp. 70-78, 2015.
[6] Shahhoseini, H., etal. Shared memory multistage clustering structure, an
efficient structure for massively parallel processing systems. in
Proceedings Fourth International Conference/Exhibition on High
Performance Computing in the Asia-Pacific Region. 2000. IEEE.
[7] Bassiri, M.M. and H.S. Shahhoseini. A new approach in on-line task
scheduling for reconfigurable computing systems. in ASAP 2010-21st
IEEE International Conference on Application-specific Systems,
Fig. 5. Energy consumption for SPLASH-2 applications Architectures and Processors. 2010. IEEE.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 20:01:21 UTC from IEEE Xplore. Restrictions apply.
2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
[8] S.M. Mohtavipour, et.al. "A link-elimination partitioning approach for
application graph mapping in reconfigurable computing systems." The
Journal of Supercomputing 76, No. 1, pp. 726-754, 2020.
[9] Bassiri, M.M. and H.S. Shahhoseini. Mitigating reconfiguration
overhead in on-line task scheduling for reconfigurable computing
systems. in 2010 2nd International Conference on Computer
Engineering and Technology. 2010. IEEE.
[10] Naderi H, et.al, "Evaluation MCDM Multi-disjoint Paths Selection
Algorithms Using Fuzzy-Copeland Ranking Method," International
Journal of Communication Networks and Information Security, Vol.5,
No.1, pp. 59-67, 2013.
[11] Rad, H.J., et al. A new adaptive power optimization scheme for target
tracking wireless sensor networks. in 2009 IEEE Symposium on
Industrial Electronics & Applications. 2009. IEEE.
[12] Hennessy, J.L. and D.A. Patterson, Computer architecture: a quantitative
approach. 2011: Elsevier.
[13] Conti, C.J., D.H. Gibson, and S.H. Pitkowsky, Structural aspects of the
system/360 model 85, i: General organization. IBM Systems Journal,
1968. 7(1): pp. 2-14.
[14] Smith, A.J., A comparative study of set associative memory mapping
algorithms and their use for cache and main memory. IEEE Transactions
on Software Engineering, 1978(2): p. 121-130.
[15] Rao, G.S., Performance analysis of cache memories. Journal of the
ACM (JACM), 1978. 25(3): p. 378-395.
[16] Woo, S.C., et al., The SPLASH-2 programs: Characterization and
methodological considerations. ACM SIGARCH computer architecture
news, 1995. 23(2): pp. 24-36.
[17] Bienia, C., et al., The PARSEC benchmark suite: characterization and
architectural implications, in Proceedings of the 17th international
conference on Parallel architectures and compilation techniques. 2008,
Association for Computing Machinery: Toronto, Ontario, Canada. pp.
72–81.
[18] Carlson, T.E., et al., An evaluation of high-level mechanistic core
models. ACM Transactions on Architecture and Code Optimization
(TACO), 2014. 11(3): pp. 1-25.
[19] A. Jain and C. Lin, “Cache replacement policies,” Synth.
Lect. Comput. Archit., vol. 14, no. 1, pp. 1–87, 2019, doi:
10.2200/s00922ed1v01y201905cac047.
[20] P. J. Denning, “Thrashing: Its Causes and Prevention,” in Proceedings
of the December 9-11, 1968, Fall Joint Computer Conference, Part I,
1968, pp. 915–922, doi: 10.1145/1476589.1476705.
[21] Warrier, T.S., B. Anupama, and M. Mutyam. An Application-Aware
Cache Replacement Policy for Last-Level Caches. in Architecture of
Computing Systems – ARCS 2013. 2013. Berlin, Heidelberg: Springer
Berlin Heidelberg.
[22] Beckmann, N. and D. Sanchez. Maximizing cache performance under
uncertainty. in 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA). 2017. IEEE.
[23] Jaleel, A., et al., High performance cache replacement using re-reference
interval prediction (RRIP). ACM SIGARCH Computer Architecture
News, 2010. 38(3): pp. 60-71.
[24] Souza, M., H.C. Freitas, and F. Pétrot. Coherence State Awareness in
Way-Replacement Algorithms for Multicore Processors. in Anais
Principais do XX Simpósio em Sistemas Computacionais de Alto
Desempenho. 2019. SBC.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 20:01:21 UTC from IEEE Xplore. Restrictions apply.