Professional Documents
Culture Documents
Ninth Ninth
IEEE International Conference on Dependable, Autonomic and Secure Computing
Abstract—The data coherence in the cache systems of CMPs Since the reliability of CMPs is an important issue, a
with thousands of processors are to be more accurate and number of works address this from different perspectives [8],
reliable. This work proposes an effective solution to address [9], [10], [11]. The schemes ensuring coherency in CMPs
this issue through introduction of highly efficient test logic with
the cache controller. It is based on the modular structure of with thousands of cores, through frequent communication
Cellular Automata (CA) and a special class of CA referred to along the global wires, are reported in [3], [4], [5], [6],
as the SACA (single length single cycle attractor CA) has been [7]. This communication among the L1 caches seriously
introduced to identify the inconsistencies in cache line states of affects the system performance as well as the energy usage.
the processors’ private caches. The hardware implementation However, the power consumption has emerged as the first
of the proposed test logic can ensure quick verification of
cache inconsistencies in CMPs. The proposed design eliminates order design metric for future CMPs [1][2].
the requirement of huge storage as well as the complex data The above scenario motivates us to develop a scheme to
structures commonly used to verify the data coherency in a determine the accuracy in data consistency of the CMPs
multiprocessor system. cache system [12]. The design should function at speed with
Keywords-Cache coherence, Chip Multi-Processor, Fault de- the system as well as is to be energy efficient. In this work,
tection, Coherence controller; we propose the design of an efficient test logic entrusted
with the verification of data inconsistencies in the CMPs
I. I NTRODUCTION private caches. The solution is based on the theory of a
The Chip Multi-Processors (CMPs) with thousands of special class of cellular automata (CA) [13] referred to as
on-chip cores are more susceptible to faults due to the the single length single cycle attractor CA (SACA) [14],
effects of technology scaling [1] as well as noncompliance [15]. At each stage of cache access, the proposed test logic
of the schemes that are targeted for small systems. The checks the state of a cache line at all the processors’ private
technology scaling adds new form of defects at the deep caches. For any inconsistency recorded, due to defect in the
submicron level that pose serious threats in CMPs while cache system, the SACA of the fault detection unit points
ensuring data consistency in processors’ private caches. to an attractor indicating inconsistency in the cache system.
Therefore, the current need is to develop schemes for high- The hardware implementation of the proposed test logic
speed verification of inconsistencies in cache data without a can quickly determine the denial of cache coherence in
commitment of major cost involvement. CMPs at an instant of time. This SACA based uncon-
In CMPs, the L1 cache is the private cache of a processor ventional scheme demands minimum wire communication
core. L2 cache is shared among the cores and is kept as well as interconnect access that effectively can enable
coherent with all the L1s. The cache coherence controller the reduction in power dissipation. Further, the modular
(CC), responsible for ensuring consistency in shared data, structure of cellular automata (CA) [13][14] makes the
is one of the most important hardware component [2]. An solution suitable for a system with billions of cores -that
insignificant defect in CC of the chip multiprocessors can is, highly scalable.
lead to a major data inconsistency in CMPs L1 caches. The relevant part of CA preliminaries and a brief on cache
If the CC wrongly computes a cache line state as ‘shared’ coherence are introduced in the next sections.
(S) instead of ‘modified’ (M), it denies the issuance of
invalidation message. This can cause a serious damage to II. CA P RELIMINARIES
the system performance as well as reliability of the system. A Cellular Automaton ( ) consists of a number of cells
On the other hand, setting of an ‘M’ state instead of ‘S’ organized in the form of lattice. It evolves in discrete space
results in unnecessary message delivery, in effect, a huge and time, and can be viewed as an autonomous finite state
power loss. Therefore, maintaining coherency of shared data machine ( ). Each cell stores a discrete variable at time
in CMPs is of utmost necessity for ensuring the correctness that refers to the present state (PS) of the cell. The next
of computation as well as the power efficiency of a system. state (NS) of the cell at is affected by its state and the
IN
Cell 1 Cell i−1 Cell i Cell i+1 Cell n
(FF) (FF) (FF) (FF) (FF)
" ( * +
, , / , , , 0 , , ,
- - - - - - - -
2 "
2 "
fi
2 "
f1 fn
2 "
2 "
3 4 5 7 9 ; < = ?
A C
10
where a cell is having two states - 0 or 1 and the next A C
0
state of cell is 5 D F
4
=
Q
8
K N K K
K
D I J
G L M G D G D G D
J P P I J
6 11
where , and are the present states of the left G
K
D
N
G
K
D G
K
1
9
5 D F
and
R
is the next state function. The states of the cells L
K
A C
12 14 2 5 3 7 15
at is the present state of the .
Q @
D T M G D G D G D
A C
3 Z G D I J
[ \ ] [ ^ _ ^
]
_ ^ _ \ [ ^
]
_ ^ _ ^ _ \ g [ ^
g
_
h ]
^
g
_ ^
g
_
i ]
Figure 2. A 4-cell reversible CA
` a a b c a a b a d c a e e e a a a c c
pressed in the form of a truth table (Table I). The decimal 192 and 240.
equivalent of the 8 outputs is called Rule . In a 2-state k
K
rules 15, 14, 192, 207, and 240 are illustrated in Table I. The attractor. The attractors of single length cycle, that is, 7 7 }
D F D F D F
M 5 Z 5 M 5
G
K
q
G
K
s
N
G
K
G
K
J
sK N
q
G
K
J t J t
I J
K N
respectively. The following terminologies are relevant for coherence protocol connects all the L1 caches through a
t
the current work. shared bus (Figure 4) and each L1 cache miss generates the
Definition 1. The set R = of u k k
V
k
K
k
X x
P P W W W P P W W W P
13 1
considered as the Min Term of a 3-variable G
K
D
N
G
K
D G
K
D
15
switching function. Each column of the first row of Table I
J P P I J
7
is referred to as the Rule Min Term (RMT). The column 011
of Table I is the 3 RMT. The next states corresponding { |
to this RMT are 1 for Rule 15, 14 and 207, and 0 for Rule
329
330
Processor (summary of states shown in Figure 5) of the cache system.
P1 P2 Pn Cores All these are coherent states -that is, when the system is
in such states, the proposed cache inconsistency detection
Private L1 logic should respond as CH (coherent). The event shown in
C1 C2 Cn Caches column 2 causes transition of a cache line’s (say, B) states
at different Cs (private caches of different processors) from
a current state to the desired next state (column 3). During
this transition a faulty system may record incorrect states of
CA based B at different processors’ caches (noted in column 4). For
Shared CC Test Unit
L2 Cache the current design, we assume the faulty recording is due to
communication failure or a design defect in the CC logic.
A faulty recording may not always lead to incoherent
Figure 4. CMPs with CC and test unit state. The effect of fault results in either CH or ICH
(incoherent) as noted in the last column of Table II. The
Invalid Pjs refers to other processors
MSI (snoopy) entry ‘All Cs I[S]’, in the table, represents the cache line B
I at all the caches is in Invalid [Shared] state.
The consideration of the columns 1, 4 and 5 of Table II
Processor Pi write miss: Pi read miss: indicates that the fault detection unit should respond as CH
Pj
ack ss:
ite
m
iss
Pj
Pi write−back Shared
Modified Pi read hit
M
S
(read−only) Case 4: One cache M and all others I.
On the other hand, for the following incoherent states
Pi read/write hit Pi write hit: Case 5: One cache as M, at least one S and others I
signal to Pjs for invalidation Case 6: Two caches as M and others I
it should respond ICH.
‘B’ is the cache line (block) read miss follows the write−back
In the proposed design, we introduce a CA-based model
Figure 5. State transition diagram of 3-state MSI protocol to realize the test logic (unit) so that it can correctly respond
either CH or ICH following the above six cases in a cache
system. For a chip multiprocessor with caches (C1, C2, ?
R .
(local verification and global verification) technique for CC
330
331
Table II
S TATE T RANSITIONS
Current Cache States Event Desired Next States Faulty Next States Effect of Fault
(1) (2) (3) (4) (5)
All Cs I Pi writes Ci M and all others I Ci M and all others I Coherent state (CH)
All Cs S Pi writes Ci M and all others I Ci M and all others I Coherent state (CH)
Ci M and others S & I Incoherent state (ICH)
Cs are I & S Pi writes Ci M and all others I Ci M and others S & I Incoherent state (ICH)
Cj M and all others I Pi reads Ci & Cj S others I Cj M Ci S and others I Incoherent state (ICH)
Pi writes Ci M and all others I Ci & Cj M and others I Incoherent state (ICH)
its state in the next time step. For example, the RMT xdx 3
13 12 11
(x=0/1, d=0/1) of a rule is considered to find the next
6
state of cell when the current states of its left neighbor
4 1
d and x respectively. To get a single length cycle attractor, 8
9 2
the RMT xdx of is to be d(0/1). It implies that the state
0
10 5
Figure 6. State transition diagram of " # % ' " # % ' " # % ' " # % )
9
3 12 1
single length cycle attractor(s) if at least one of the s
0 15 6 8 11
Based on Property1, the 256 rules are classified [15] Figure 7. State transition diagram of "
*
+ ' "
*
+ ' "
*
+ ' "
*
+ )
designed only
Observation 1. Most of the rules of group 6 form single with the rule 34/48 can form SACA.
Observation 4. The uniform
those - 2, 16, 32, 42, 56, 98, 112, 162, and 176 form SACA.
Observation 2. A number of rules from group 4 form both It is observed that to form an SACA, the CA rules should
the single & multi-length cycles and single & multi-graphs. follow Property 1. However, a rule (e.g. rule 204) that
The state transition diagram of a 4-bit is noted in
that form only single length cycle uniform are 0, 10, The rule 48 of group 2 denies Property 1 for 6 RMTs.
15, 20, 24, 30, 36, 40, 46, 66, 80, 85, 90, 96, 106, 120, 130, On the other hand, rule 192 of group 6 denies Property 1
331
332
Table III Structure of Rule 192 Structure of Rule 207
CA RULES FOR UNIFORM SACA
RMT 111 110 101 100 011 010 001 000 RMT 111 110 101 100 011 010 001 000
0 1
arbitrary length. The SACA, synthesized following Property R = 240, R = 15 and R = 14.
2, is employed to develop the proposed scheme for detecting Theorem 1: The null boundary n-cell CA, configured with
any violation of cache coherence. rule 15 and 240 in any sequence, forms SACA. The depth
of the CA is n.
B. SACA rule selection for R , R & R
seed (initial seed), for Case 1-6. The LSB ‘0’ signifies Case
5 & 6 are also hybrid. Let us assume these form attractors 1, 2, 3 & 6 and ‘1’ is for Case 4 & 5. However, Cond
&
therefore, can be such that set of Case 1, 2 & 3 and Case 4 attractor must be in the
=
,
The CA for Case 6 is resulted from the uniform CA for Stage 1: (a) To differentiate Case 6 and Case 1, 2 & 3 at
Case 1 through hybridization of R . Now, to ensure
As the rule 192 & 207 follow Property 3, the attractor of
by a cell rule R , it can generate new attractors only if the hybrid CA (Case 6) is different from the attractor of (n+2)-
set of RMTs of R , for which Property 1 is denied, is not a cell uniform CA with rule 192 (Figure 8). The LSB of the
subset of the set of RMTs of R , for which also the Property attractors for Case 1, 2 & 3 is 0 (CH) but it is 1 for Case 6
(ICH) (here the terminal cells are set to 192).
1 is denied.
For example, the RMTs of rule 192 (R ) for which (b) To separate out Case 4 and Case 5 at the completion
Property 1 is denied are 2, 3 and the similar set for rule
332
333
Stage 0 Rm=240 Ri=15
D D FF D FF D D
Rs=14 FF
Cell i−1 Cell i Cell i+1
at
or
Q
tra
Q Q Q Q
ct
0 1
ct
tra
or
Si−1 Si
at Si+1
Stage 1 Rm=207 Ri=192 Rm=192 Ri=192
S’i−1
Rs=192 Rs=207
at
at
or
or
tra
Si−1. Si
tra
S’i−1 + Si
ct
ct
0 1
ct
tra
0
tra
ct
or
S’i−1.Si + S’i−1.Si+1
or
at
at
0
CH ICH 1
M
2
3
d a
Figure 10. CA based two-stage verification of data inconsistencies 4
5
t
6
a
7 Output
8 U
R = 192, R = 192 and R = 207. 9
l i
10
Here, the presence of R in the (n+2)-cell (terminal cells
n
11
e
are with rule 192) hybrid CA generates attractor with LSB 12
s
‘1’ (Figure 8(a)) that corresponds to Case 5 (ICH). Case 4 13 X
14
corresponds to the uniform CA with rule 192 and its attractor 15
1−bit
1−bit select
is ‘0’. Attractor Stage
sta t e at Ci
The two stage process is described in Figure 10. The
Pi1 Pi0
hardware realization is reported in the next subsection.
States of cache line B at different caches (2n−bit)
C. Realization of the test unit
Figure 11. Hardware realization of test unit
The states M, I and S of a cache line are represented as
the 00, 10 and 11 respectively (01 is don’t care state). At
Stage 0 (encoded as ‘0’) the cache line B’s state (at Ci) ‘00’ we choose all the six cases (Case 1- Case 6 of section IV).
implies the rule 240 (R ) is to be set, it is 15 (R ) for state
11) is 192 (Figure 10). In Figure 11, this sets the select lines This work proposes an efficient solution for detecting
of the MUX as 0011. It implies the output of the MUX is faults in the logic entrusted with cache coherence. The
S .S -that is, the rule 192. The part of the next state logic,
333
334
Table IV
E XPERIMENTAL RESULTS
[2] Hui Wang, Sandeep Baldawa, Rama Sangireddy. Dynamic [9] Rui Gong, Kui Dai, Zhiying Wang Transient Fault Recovery
Error Detection for Dependable Cache Coherency in Multicore on Chip Multiprocessor based on Dual Core Redundancy and
Architecture. VLSI Design conference, January 2008. Context Saving. IEEE Int Conference for Young Computer
Scientists, 2008.
[3] Liqun Cheng, NAveen Muralimanohar, Karthik Ramani, Ra-
jeev Balasubrsmonian, Jihn B. Carter. Interconnect-Aware [10] Ransford Hyman, Koustav Bhattacharya, Nagarajan Ran-
Coherence Protocols for Chip Multiprocessors. The 33rd IEEE ganathan, Redundancy Mining for Soft Error Detection in
International Symposium on Computer Architecture (ISCA’06), Multicore Processors. IEEE Trans on Computers, VOL. 60,
2006. NO. 8, August 2011.
[4] Akira Yamawaki, Masahiko Iwane. Coherence Maintenances [11] Pramod Subramanyan, Virendra Singh, Kewal K. Saluja, Erik
to realize an efficient parallel processing for a Cache Memory Larsson Energy-Efficient Fault Tolerance in Chip Multipro-
with Synchronization on a ChipMultiprocessor. ISPAN, 2005. cessors Using Critical Value Forwarding. 201O IEEE/IIFIP
International Conference on Dependable Systems & Networks
[5] Jichuan Chang, Gurindar S. Sohi Cooperative Caching for (DSN).
Chip Multiprocessors. ISCA, 2006.
[12] J. L. Hennesy and D. A. Patterson. Computer Architecture: A
Quantitative Approach, 3rd Edition. Morgan Kaufmann, 2003.
[6] Rana Ejaz Ahmed Energy-Aware Cache Coherence Protocol
for Chip-Multiprocessors. IEEE CCECE/CCGEI, Ottawa, May [13] S. Wolfram. Cellular Automata and Complexity — Collected
2006. Papers. Addison Wesley, 1994.
[7] Alberto Ros, Manuel E. Acacio, Jose M. Garca A Direct Co- [14] P Pal Chaudhuri, D Roy Chowdhury, S Nandi, and S Chat-
herence Protocol for Many-Core Chip Multiprocessors. IEEE terjee. Additive Cellular Automata – Theory and Applications,
Trans on Parallel and Distributed Ststems, VOL. 21, NO. 12, volume 1. IEEE Computer Society Press, California, USA,
December 2010. ISBN 0-8186-7717-1, 1997.
[8] Linzhi Ning, Wenbin Yao, Jun Ni, Nianmin Yao Fault-Tolerant [15] Sukanta Das, Nazma N Naskar, Sukanya Mukherjee, Mamata
CMP Architecture Based on SMT Technology. IEEE In- Dalui and Biplab K Sikdar. Characterization of CA Rules For
ternational Multisymposium on Computer and Computational SACA Targeting Detection of Faulty Nodes In WSN. ACRI-
Sciences, 2007. 2010.
334
335