High Performance Application Oriented Memory Management On Multicore Systems

2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
High Performance Application-oriented Memory

Management on Multicore Systems
Ahmad Sadigh Baroughi Madjid Naderi
2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS) | 978-1-7281-8629-0/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICSPIS51611.2020.9349590
Department of Electrical Engineering Department of Electrical Engineering

Iran University of Science and Technology Iran University of Science and Technology
Tehran, Iran Tehran, Iran
sadighbaroughi_a@elec.iust.ac.ir m_naderi@iust.ac.ir
Abstract— Due to the evolution of imagining and dealing with Computing systems acknowledge executing tasks in a true
issues and demands that have developed with intelligent systems, multitasking fashion. Such systems share the reconfigurable
higher performance has become a crucial need for daily growth device and processing unit as computing resources which
of application size and complexities. Currently, most high- supervises to highly dynamic allocation situations. [7] and [8]
performance computing systems are multi-core, and many kinds propose heuristic approaches concentrating on developing a
of research have been done toward minimizing the execution time powerful nexus linking partitioning, scheduling and placement
efficiently. Cache performance impacts on program execution have and considerable improvement in overall execution time
especially on modern intelligent processing systems, therefore of the tasks have been achieved. High-performance systems
cache replacement policies in set-associative caches have been
even at the best configuration endure overheads and penalties
investigated in great depth. We propose an approach to consider
information derived from coherence state of the cache block in a
which deteriorates overall performance and execution time
time-interleaved manner, and our simulations on high- hence an approach toward managing and mitigating them is
performance applications of PARSEC2.1 and SPLASH-2 highly required [9]. Cache performance plays an essential role
benchmark suits show that our approach has 10% lower miss not only in the field of processing systems science which is
rate and up to 10% more instructions per cycle than LRU, MRU used for a wide range of computationally intensive tasks in
and Random replacement policy without further power various domains but also in the reduction of delay of memory
consumption. accesses by giving a momentary fast-access storage unit for
the data that is being reached by the CPU. On a cache miss,
Keywords— data- intensive application, cache replacement data is fetched from the memory and placed in its analogous
policy, coherence state, memory management. locale in the cache. A prolific cache replacement policy can
naturally reduce the cache miss rate and thus decreasing the
I. INTRODUCTION contingency of a penalty of hundreds of cycles expenditure
due to memory accesses. The goal is to depreciate the number
Intelligent Systems made revolutionary scientific of requests to the slower memory and hence succeeding in the
discoveries, fed game-changing innovations, and improved reduction of memory access latency. A cache hierarchy is used
quality of life for billions of people around the globe. Through in various applications such as reconfigurable systems, data
the progressions of needs and demands, High-Performance processing units, computing servers, databases and CPUs to
Computing (HPC) has become not only the foundation for mention but a few.
scientific, industrial, and societal advancements but also a
vital service [1]. As technologies like the Internet of Things The current computing systems are parallel in nature, and
(IoT) [2], artificial intelligence (AI) [3], and 3-D imaging [4], they comprise several processors or processing units that share
traffic management [5] evolve, the size and amount of data a single memory address space. Thus, smart designs are
that systems have to work with are growing exponentially. For involved with intelligent algorithms aiming much better
many purposes, the ability to process this huge amount of data execution time, performance and power consumption [10-11]
by an intelligent high performance real-time state is crucial For intelligent performance and power efficiency plans, these
and one of the best basic block for these systems is multi-core multi-core systems include caches, raising a new concern:
processors. ensuring cache coherence which traverses to a new category of
conflicts, coherence misses, due to a block having been
Toward the high-performance computers, a range of evicted as a conclusion of a coherence action, moreover this
parallel structure has been investigated during past years. In eviction may intrude with the local way-replacement policy.
[6] a general structure had been proposed for the massive
parallel processing systems. The proposed structure is Cache misses can be broadly categorised in 3 classes [12]:
expandable in a vertical and horizontal method and
• Compulsory miss: when the block has to be read from
incorporates many of the previous computer designs. The
main memory whatever the cache design is, as known as cold
queuing theory and Jackson queuing network are applied for
start misses or first references misses.
assembling an analytical model for the proposed structure.
978-1-7281-8629-0/20/$31.00 ©2020 IEEE
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 20:01:21 UTC from IEEE Xplore. Restrictions apply.
• Capacity miss: when it is impossible to hold all the The contributions are the following:
concurrently used data in the cache because of the cache
reduced size compared to the working set size. • Elements of our replacement strategy based on the
current coherence state of cache blocks,
• Conflict miss: when multiple addresses map to the same
set and evict used blocks, due to the fact that even though all • Analysis of our proposal when combined with prominent
cache blocks may not be occupied, the mapping function replacement policies (LRU, MRU and Random).
cannot target an empty block in order to map a memory • We present probative indication that reinforces such an
address to a cache set. approach should be thought in the next span of architectures
Trade-offs can be accomplished by adjusting the cache that figure on cache coherence.
parameters. Increasing cache capacity, for instance, can reduce The remainder of the paper is prepared as follows. Section
conflict and capacity misses, while increasing the associativity II presents related work. Our proposal contemplates cache
decreases the conflict misses, by increasing block size can coherence state in way-replacement is disclosed in Section III.
reduce compulsory misses due to spatial locality and increase Section IV presents simulation outcomes achieved on a state-
conflict misses (for a fixed-size cache). of-the-art simulator. Ultimately, Section V critiques the
Recent trends in modern processor design have made outcomes and concludes the paper.
adequate management of cache resources more indispensable
than ever before. As the core count and memory demands of II. RELATED WORKS
applications rise, discerning cache architectures and In multi-core systems, accesses from multiple cores
management policies have become vital to meet the challenges compete for shared cache capacity. Inadequate management of
of area and power limitation. Researchers have recently the shared cache can deteriorate system performance and end
investigated cache designs that offer distinct advantages and in an unfair allocation of resources because one perverse
disadvantages. application can deprive the performance of all other
Set-associative caches have been introduced for half a applications sharing the cache. For example, a workload with
century ago [13] then way-replacement policies, i.e. choosing streaming memory accesses can evict useful data belonging to
which way should be evicted, have been studied. The most other recency-friendly applications. Shared caches in multi-
earliest studies conducted in [14] and [15] which published core processors introduce serious difficulties in providing
surveys and analysis of replacement algorithms. Since then, guarantees on the real-time properties of embedded software
new replacement policies have been devised and many due to the interaction and the resulting contention in the
optimizations proposed to fit different hardware related shared caches. Hence recency-based and frequency-based
constraints, such as area, power and number of ways. policies are not proficient about multicore processors [19].
When a conflict occurs in a direct-mapped cache, as each Recency and frequency-based policies have been proposed
address maps to a unique block and set, the only way to deal a long time ago, new way-replacement policies are still under
with a full cache set is to replace the block in the set with the investigation. Observing the cache access patterns appeared in
new data. In case of an N-way associative or fully associative three types: when the working set of application is diminutive
cache, a block must be chosen for eviction when such a enough to meet in cache namely recency-friendly accesses, by
conflict occurs and the set is full. The temporal locality contrast, when the working set of application surpasses the
principle expressed earlier suggests than the most appropriate size of cache termed trashing accesses and finally when a
block to evict is the Least Recently Used (LRU) one, as it is sequence of streaming accesses nevermore revolve called
the least likely to be used again in the near future. scans[20]. Due to designing trash and scan resistant policy, an
adaptive policy that identifies the application behavior has
This replacement scheme can be implemented by adding a been developed [21]. A new cache replacement policy for
use bit next to the validity bit and indicates if the block in the shared LLC called Application-aware Cache Replacement
corresponding way was the most recently accessed. When one (ACR) was proposed which prevents victimizing low-access
of the ways is used, this bit is updated, and the way having this rate application by a high-access rate application. It
bit high is the one replaced when needed. For a two-way dynamically preserves track of maximum life-time of cache
cache, the least recently used block is actually replaced. In lines in shared LLC for each concurrent application and heals
caches with a higher degree of associativity, a random block in efficient utilization of the cache space.
having the used bit low is replaced. Such a policy is called Another cache replacement policy that respects application
pseudo-LRU and is often good enough in practice. behaviour was proposed in [22]. Since the state-of-the-art
In this paper, our aim is not to introduce a new way- replacement policies like Static Re-reference Interval
replacement algorithm, but to experiment if any information Prediction (SRRIP) and Application Aware Behaviour Re-
relative to cache coherence, can be used to enhance the reference Interval Prediction (ABRIP) [23] evict a cache block
performance of existing way-replacement policies. To that based on their re-usability in the near future. SRRIP makes the
purpose, we propose an approach to take the coherence state replacement decisions per block basis whereas ABRIP also
into account, and we appraise our proposal on high- considers the cache behaviour of an application to minimize
performance parallel applications from the SPLASH-2 [16] conflicting data demands. Hence, ABRIP outperforms SRRIP
and PARSEC2.1 benchmark [17] using the Sniper simulator for workload mixes where one application is cache friendly,
[18]. and the other one is streaming. However, ABRIP does not
perform well when the workload mix is Cache Friendly. Their
policy effectively utilizes the shared LLC and outperforms
both SRRIP and ABRIP with performance gains of up to
10.12% and 9.36% respectively.
We encountered only one recent competent study which
addresses replacement policy considering use of the coherence
state of a cache block on a multicore system regarding today’s
standards. A cache replacement policy that takes coherence
state into account is [24]. They proposed an algorithm which
decides on coherence state and sharing of cache blocks to
choose the best amongst the k for eviction in k-way set-
associative caches. They consider two main approaches to
prioritize the eviction of the block contained in one way over
another. The first one consists of counting how many sharers a
single cache block has and take this information into account Fig. 1. Overview of the proposed coherence state-based eviction strategy
when an eviction is needed. In the second one, the decision of
evicting or not the cache block depends on its coherence state. The replacement algorithm is performed after the cache set
The main idea is to let the way-replacement algorithm select selection, accordingly to the set-associative mapping strategy.
the candidate for eviction, and then check if that candidate A candidate way is chosen, and if the block is valid, its
should or not be evicted. In essence, by authors’ claim, they coherence state is compared to the MRU-State, updated in the
implemented a second chance approach. This research takes time aperture. The replacement algorithm favours blocks that
more investigations including miss rate, power consumption are not in the same state than the current MRU-State for
and Instruction per Cycles (IPC) evaluation in comparison eviction in a time-interleaved window of number of cache
with more well-known replacement policies using ‘Modified’, access. If they are equal, the way selection is performed
‘Shared’, ‘Invalid’ (MSI), cache coherence protocol. another time. This is for considering not any cache access in
our strategy since adjacent instructions in time or order are
We design and implement a time-interleaved coherence more in common in respect of their need for data to be fetched
state aware approach for High-Performance Computing, from cache hierarchy. The time aperture is designed to reset
Graphics, Animations and Data Mining Application domains, the MRU-State and obtains the advantage of temporal locality
that differs from the related work. Considering the largest even more.
realizable inputs, the proposed method should be trash
resistance and gains more information about past states of
blocks. This requires supplementary bits per each set IV. SIMULATIONS AND RESULT EVALUATION
representing the history of blocks’ behaviour. Finally, we explain the details of the simulation
environment, then results that we obtained by evaluating our
III. WAY-REPLACEMENT POLICY PROPOSAL proposal using simulation is exhibited in this section.
We recognise the main approach to prioritize the eviction A. Simulation environment and benchmarks
of the block which is basically the decision of evicting or not
the cache block depends on its coherence state in time- To conduct our experiments, we used Sniper [18], a
interleaved windows. multicore simulator. Sniper has integrated to its core the
McPAT framework [22], which we handled to measure power
For the purpose of this work, we use the ‘Modified’, consumption. We adjusted Sniper to set up our test
‘Exclusive’, ‘Shared’, ‘Invalid’ (MESI), cache coherence architectures and way-replacement algorithms. Regarding
protocol. Cache coherence is one of the main challenges to McPAT, it uses the design parameters from Sniper, and
tackle while designing a shared-memory multicore system. simulations output to assess the results. Hence, we did not
Incoherence may happen when multiple associate cores in a modify the power model in the framework. A 32-core
system are working on the same blocks of data without any processor based on a Gainestown variant of x86 Nehalem
coordination. Coordination must be conducted by the micro-architecture at 2.66 GHz was set. Some specific
coherence protocol: a set of finite states machines, maintaining characteristics from this micro-architecture cannot be
the caches and memory and keeping the coherence invariants reproduced in Sniper though (e.g., Sniper only simulates an
true. inclusive cache model). Furthermore, data prefetching is
Our proposal is based on Coherence State Eviction in a disabled in our simulations. The memory hierarchy has three
windowed time aperture in order to decide evicting or not a levels. L1 has 32kB for data and 32kB for instructions, each
cache block based on its current coherence state. Fig. 1 shows one 4-way associative, private, with 4 cycles data access time,
an overview of the strategy. Each cache block has its own using LRU replacement policy. L2 has 2MB, 8-way
coherence state, which is updated when there is a coherence associative, private cache, with an access time of 8 cycles. We
transition in time-interleaved aperture. Each cache set knows restricted the policy on last level cache during our
the state of most recently used block was in the aperture of experiments. The last level is L3, with 8MB size, 16-way,
recent time window. Supplementary bits are required per set, shared by 8, with 35 cycles of access time. The prefetching is
updated when a cache access occurs on each set. disabled in this experiment.
TABLE I. WORKLOADS DOMAIN AND INPUT SIZE
Benchmark
Suite Application Domain Program Problem Size
barnes 65,536 particles

High-Performance
cholesky tk29.O
Computing
SPLASH-2 ocean 1026×1026 grid
radiosity Large Room
Graphics
raytrace Car
4 frames, 4,000
Computer Vision bodytrack
particles
1 frame, 372,126
facesim
tetrahedra
Animation
PARSEC2.1 5 frames, 300,000
fluidanimate
particles
freqmine 990,000 transactions
Data Mining 16,384 points per Fig. 2. LLC cache miss rate
streamcluster
block, 1 block
We used some of High-Performance Computing, Graphics,

Animation and Data mining domain of parallel workloads
from both the PARSEC2.1 and SPLASH-2 benchmark
detailed in Table 1 showing problem size for each selected
program which we desired as input size for each related
workload at the possible largest program size. Those
workloads were run using all 32 cores. In order that our
approach is not an exclusive cache replacement policy, we
have to apply our proposal on the basis of a well-known cache
replacement policy. Since the number of simulations would be
mathematical combinatorics of available combinations which
will be plenty, we decided to efficiently minimize the
representation of simulations thus we choose the most infirm
policy per each application for each metrics and evaluation as Fig. 3. MPKI
the baseline cache replacement policy for that application. In
other words, if replacement policy i shows the most infirm comparison concerning the other programs, the values for
performance on application k due to metric j, the baseline ocean program have been divided by 10 individually.
cache replacement policy for our proposal on application k in
respect to the metric j is the i cache replacement policy, ergo 2) Instruction per Cycle evaluation
our goal is to evaluate our proposal based on the worst-case We evaluated the IPC in order to understand the
then attain the best result toward all other policies. We chose performance of our approach. Fig. 4 depicts the evaluated
to evaluate three metrics to understand the behaviour of our values of IPC for each application. The proposed strategy
cache replacement proposal: (1) L3 cache miss rate, (2) advances IPC for all applications. The improvement rate with
Instructions per cycle (IPC) and (3) Power consumption. respect to the best IPC is 9% to 10% during all SPLASH-2
programs and toward PARSEC programs we achieved the
1) L3 miss cache rate highest rate for facesim 20.8% and the lowest for freqmine
We evaluate the miss rate of L3 cache which replacement 0.8% while improvement rate for the other PARSEC programs
policies as our proposal have been implemented on. Results is 6.6% to 7.8%.
have been evaluated by calculating the mean value of all
cores’ L3 miss rate while 4 replacement policy including 3) Energy and Power consumption
LRU, MRU, Random and the proposed method has been The energy consumptions have been evoked from McPAT
chosen individually. Fig. 2 demonstrates that the miss rate per inside Sniper and illustrated in detail measured in Joules and
aforementioned benchmark programs exhibiting the best power consumption in Watts due to the diversity of execution
reduction concerning the best performance of other policies time for each program. Fig. 5 and Fig. 6 presents the energy
has been achieved by 11.6% for cholesky and 10.9% for consumptions with details of core and cache shares for tested
bodytrack which are SPLASH-2 and PARSEC programs, applications of SPLASH-2 and PARSEC benchmarks,
respectively. It should be stated that the proposed strategy respectively. Each bar in groups shows the value for 4
never exacerbates the miss rate and exhibits performance as replacement policies, which are from left to right, LRU, MRU,
LRU in the most unfavourable case. We also evaluated the Random and proposed policy. These figures describe that the
MPKI which indicates how many misses per 1000 instructions obtained improvements in L3 cache miss rate and IPC do not
has occurred. Fig. 3 depicts MPKI has been never increased incite more energy consumption. Due to the better IPC for the
obtaining best reduction by 5% for ocean and 7.5% for proposed strategy, total power consumption measured in
facesim. This should be regarded that the exact values of Watts is also improved, illustrated in Fig. 7 represents no
MPKI for ocean are 10 times higher and for ease of visual— additional power has been employed. It should be noted that,
due to the facility of visual comparison, the actual values of
cholesky and raytrace in Fig. 5 are a quarter of depicted
values. Furthermore, the actual values for freqmine in Fig. 6
are 10 times the figured values, due to ease of visual
comparison.
V. CONCLUSIONS
In this paper, we conducted a way-replacement in time-
interleaved approach for the eviction of blocks in cache
memories on a multicore system. We chose to evaluate if the
coherence state conversance in a time aperture of a cache
block considering access patterns can be used to improve the
overall performance and energy consumption. To perform this
evaluation, we considered the cache block state in a time-
interleaved aperture, deciding to evict or not the block. Using
Fig. 6. Energy consumption for PARSEC2.1 applications
the Sniper simulator and applications from PARSEC2.1 and
SPLASH-2 benchmark, our results confirmed that our
proposal improves the cache miss rate and IPC without further
power consumption on high performance computing, graphics,
computer vision, animation and data mining applications. We
conclude that time-interleaved coherence state information of
a cache block can be employed to enhance the performance of
cache way-replacement in high-performance computing
multicore systems.
Fig. 7. Total power consumption
References
[1] Schifano, S.F., et.al., “High Performance and Distributed Computing,”
Toward an Open Resource Using Services: Cloud Computing for
Environmental Data, pp.155-162, 2020 .
[2] Tamilselvan, K. and Thangaraj, P.,. Pods–A novel intelligent energy
efficient and dynamic frequency scalings for multi-core embedded
architectures in an IoT environment. Microprocessors and
Fig. 4. Instruction per Cycle (IPC) Microsystems, 72, p.102907. 2020
[3] Han, J., Choi, M. and Kwon, Y.,. 40‐TFLOPS artificial intelligence
processor with function‐safe programmable many‐cores for ISO26262
ASIL‐D. ETRI Journal, 42(4), pp.468-479. 2020.
[4] Aali S.N., et.al. "Divisible Load Scheduling of Image Processing
Applications on the Heterogeneous Star Network Using a new Genetic
Algorithm," 26th Euromicro International Conference on Parallel,
Distributed, and Network-Based Processing, PDP2018, pp. 77-84, 2018.
[5] A.H. Jafari, et.al., "A Reinforcement Routing Algorithm with Access
Selection in the Multi–Hop Multi–Interface Networks," Journal of
Electrical Engineering, Vol. 66, No. 2 pp. 70-78, 2015.
[6] Shahhoseini, H., etal. Shared memory multistage clustering structure, an
efficient structure for massively parallel processing systems. in
Proceedings Fourth International Conference/Exhibition on High
Performance Computing in the Asia-Pacific Region. 2000. IEEE.
[7] Bassiri, M.M. and H.S. Shahhoseini. A new approach in on-line task
scheduling for reconfigurable computing systems. in ASAP 2010-21st
IEEE International Conference on Application-specific Systems,
Fig. 5. Energy consumption for SPLASH-2 applications Architectures and Processors. 2010. IEEE.
[8] S.M. Mohtavipour, et.al. "A link-elimination partitioning approach for
application graph mapping in reconfigurable computing systems." The
Journal of Supercomputing 76, No. 1, pp. 726-754, 2020.
[9] Bassiri, M.M. and H.S. Shahhoseini. Mitigating reconfiguration
overhead in on-line task scheduling for reconfigurable computing
systems. in 2010 2nd International Conference on Computer
Engineering and Technology. 2010. IEEE.
[10] Naderi H, et.al, "Evaluation MCDM Multi-disjoint Paths Selection
Algorithms Using Fuzzy-Copeland Ranking Method," International
Journal of Communication Networks and Information Security, Vol.5,
No.1, pp. 59-67, 2013.
[11] Rad, H.J., et al. A new adaptive power optimization scheme for target
tracking wireless sensor networks. in 2009 IEEE Symposium on
Industrial Electronics & Applications. 2009. IEEE.
[12] Hennessy, J.L. and D.A. Patterson, Computer architecture: a quantitative
approach. 2011: Elsevier.
[13] Conti, C.J., D.H. Gibson, and S.H. Pitkowsky, Structural aspects of the
system/360 model 85, i: General organization. IBM Systems Journal,
1968. 7(1): pp. 2-14.
[14] Smith, A.J., A comparative study of set associative memory mapping
algorithms and their use for cache and main memory. IEEE Transactions
on Software Engineering, 1978(2): p. 121-130.
[15] Rao, G.S., Performance analysis of cache memories. Journal of the
ACM (JACM), 1978. 25(3): p. 378-395.
[16] Woo, S.C., et al., The SPLASH-2 programs: Characterization and
methodological considerations. ACM SIGARCH computer architecture
news, 1995. 23(2): pp. 24-36.
[17] Bienia, C., et al., The PARSEC benchmark suite: characterization and
architectural implications, in Proceedings of the 17th international
conference on Parallel architectures and compilation techniques. 2008,
Association for Computing Machinery: Toronto, Ontario, Canada. pp.
72–81.
[18] Carlson, T.E., et al., An evaluation of high-level mechanistic core
models. ACM Transactions on Architecture and Code Optimization
(TACO), 2014. 11(3): pp. 1-25.
[19] A. Jain and C. Lin, “Cache replacement policies,” Synth.
Lect. Comput. Archit., vol. 14, no. 1, pp. 1–87, 2019, doi:
10.2200/s00922ed1v01y201905cac047.
[20] P. J. Denning, “Thrashing: Its Causes and Prevention,” in Proceedings
of the December 9-11, 1968, Fall Joint Computer Conference, Part I,
1968, pp. 915–922, doi: 10.1145/1476589.1476705.
[21] Warrier, T.S., B. Anupama, and M. Mutyam. An Application-Aware
Cache Replacement Policy for Last-Level Caches. in Architecture of
Computing Systems – ARCS 2013. 2013. Berlin, Heidelberg: Springer
Berlin Heidelberg.
[22] Beckmann, N. and D. Sanchez. Maximizing cache performance under
uncertainty. in 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA). 2017. IEEE.
[23] Jaleel, A., et al., High performance cache replacement using re-reference
interval prediction (RRIP). ACM SIGARCH Computer Architecture
News, 2010. 38(3): pp. 60-71.
[24] Souza, M., H.C. Freitas, and F. Pétrot. Coherence State Awareness in
Way-Replacement Algorithms for Multicore Processors. in Anais
Principais do XX Simpósio em Sistemas Computacionais de Alto
Desempenho. 2019. SBC.

High Performance Application Oriented Memory Management On Multicore Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High Performance Application Oriented Memory Management On Multicore Systems

Uploaded by

Copyright:

Available Formats

2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)

High Performance Application-oriented Memory

Department of Electrical Engineering Department of Electrical Engineering

978-1-7281-8629-0/20/$31.00 ©2020 IEEE

barnes 65,536 particles

We used some of High-Performance Computing, Graphics,

Fig. 7. Total power consumption

You might also like