You are on page 1of 158

Error Control for Network-on-Chip Links

Bo Fu l Paul Ampadu

Error Control for


Network-on-Chip Links
Bo Fu Paul Ampadu
Marvell Semiconductor, Inc. Department of Electrical
5488 Marvell Lane and Computer Engineering
Santa Clara, CA 95054, USA University of Rochester
bofu.ur@gmail.com Rochester, NY 14627, USA
ampadu@ece.rochester.edu

ISBN 978-1-4419-9312-0 e-ISBN 978-1-4419-9313-7


DOI 10.1007/978-1-4419-9313-7
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2011936003

# Springer Science+Business Media, LLC 2012


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they
are not identified as such, is not to be taken as an expression of opinion as to whether or not they are
subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer ScienceþBusiness Media (www.springer.com)


To Luzviminda, Luzann, Majelia
and Paul Jr for Love, Patience,
Courage and Dedication
Preface

Traditional bus-based infrastructures can no longer handle the intensive


communication among various modules, as hundreds and even thousands of intel-
lectual property (IP) cores are integrated on a single chip. Network-on-Chip (NoC)
is emerging as an efficient solution to solve the aggravating scalability and conten-
tion issues of on-chip communication. With technology scaled into the nanometer
regime, the physical links in NoCs are facing important design challenges of delay,
power and reliability. The purpose of this book is to present current solutions
addressing reliability issues in on-chip communications.
Reliability is an important issue in NoC design. For example, errors in the header
of a packet may lead to loss of packet. The reliability issue of on-chip communica-
tion can be addressed at different NoC layers, such as physical layer, data link layer
and network layer. This book focuses on techniques applied to the data link layer.
Error control coding is a common technique used in the data link layer to provide
reliable on-chip communication. With the shrinking link feature size, on-chip
interconnects are becoming susceptible to multiple random and burst errors, requir-
ing more powerful error control codes (ECCs) than those previously used. At the
same time, the energy consumption of on-chip interconnect is becoming an increas-
ingly large portion of on-chip power dissipation, motivating the need for more
energy efficient communication solutions. In this book, we present energy-efficient
error control approaches for on-chip interconnects. We introduce a method of
combining extended Hamming product codes with type-II hybrid automatic repeat
request (HARQ). This method provides a strong error correction capability against
multiple random and burst errors; while keeping the hardware overhead reasonable.
The combination of extended Hamming product codes with type-II HARQ has been
shown to meet the same reliability requirements as previous solutions while using a
lower link swing voltage to reduce energy consumption. The extended Hamming
product codes can also be integrated into a configurable error control scheme by
combining it with a traditional Hamming code. The different coding strengths
provided by this realization can achieve better energy performance in the presence
of varying noise conditions.

vii
viii Preface

Capacitive crosstalk coupling greatly increases with increased interconnect aspect


ratio with each scaled technology node. Capacitive crosstalk coupling can cause delay
uncertainty, which greatly decreases the system performance resulting in timing
errors. ECCs have been successfully applied to improve the reliability of on-chip
interconnect by correcting logic errors. Unfortunately, conventional ECCs are not as
efficient in addressing delay uncertainty caused by capacitive crosstalk coupling.
In this book, we also present methods that simultaneously address logic errors and
crosstalk-induced delay uncertainty. We introduce a method of combining ECCs with
conventional skewed transitions. Here, the inherent skew resulting from the ECC
parity generation is exploited to ensure that no two adjacent wires switch in opposite
directions simultaneously, thereby reducing worst-case on-chip capacitive coupling.
This method can reduce the overhead of conventional skewed transitions by hiding the
delay insertion overhead in parity calculations. Compared with other solutions that
simultaneously handle logic errors and delay uncertainty, this method requires fewer
wires, resulting in smaller link area and energy consumption.
This book is based on the first author’s Ph.D. dissertation completed at the
University of Rochester. The research work was supported in part by the U.S.
National Science Foundation under grant NSF-ECCS-0733450.
The authors would like to thank friends and colleagues Prof. Eby Friedman, Prof.
Chen Ding, and Prof. Thomas Tucker for their invaluable suggestions during the
writing of the dissertation that led to this version of the book. Dr. Bo Fu expresses
gratitude to his exceptional colleagues, Dr. David Wolpert (now at IBM) and
Dr. Qiaoyan Yu (now at UNH), for their productive, supportive and enjoyable
collaborations during his studies in the Embedded Integrated System-on-Chip
(EdISon) research group at the University of Rochester. Dr. Fu also thanks friends,
Lin Zhang, Chao Yu, Qiang Sun, Xin Li, Gaojie Lu, Xiaohua Zhang and Fan Yang
for their help and friendship. His deepest gratitude and immense appreciation goes
to his parents and his wife for their constant encouragement and unwavering support.
Many thanks also to graduate student Meilin Zhang for his assistance in formatting
the book and to Charles B. Glaser from Springer for his support and assistance
throughout.

Santa Clara, CA, USA Bo Fu


Rochester, NY, USA Paul Ampadu
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Impact of Scaling on Interconnect Parameters . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Reliability Issues for On-Chip Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Types of Errors and Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Types of Errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Solutions to Improve the Reliability
of On-Chip Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Wire Sizing and Spacing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Shielding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Repeater Insertion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Crosstalk Avoidance Codes (CACs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Skewed Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Error Control Coding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.1 Automatic Repeat Request (ARQ). . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Forward-Error Correction (FEC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.3 Hybrid ARQ (HARQ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Spare Wires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Networks-on-Chip (NoC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Bus Based On-Chip Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 NoC Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Routing and Switching Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Router Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Reliability in NoC Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

ix
x Contents

4 Error Control Coding for On-Chip Interconnects. . . . . . . . . . . . . . . . . . . . . 49


4.1 Error Control Coding Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.1 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.2 Linear Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.3 Systematic Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.4 Hamming Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.5 Code Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Error Control Codes for On-Chip Interconnect . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Single Parity Check (SPC) Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Duplicate-Add-Parity (DAP) Code . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.4 Hsiao Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.5 SEC Codes with Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.6 Cyclic Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.7 Bose-Chaudhuri-Hocquenghem (BCH) Codes . . . . . . . . . . . . . . 66
4.2.8 Reed-Solomon (RS) Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.9 Hamming Product Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Energy Efficient Error Control Implementation . . . . . . . . . . . . . . . . . . . . . . 79
5.1 Error Control Coding with Low Link
Swing Voltage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Error Control Coding with Dynamic Voltage
Swing Scaling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Product Codes with Type-II ARQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Extended Hamming Product Codes
with Type-II HARQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Configurable Error Control System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 Configurable Encoder Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.3 Configurable Decoder Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 Combining Error Control Codes with Crosstalk Reduction . . . . . . . . . 117
6.1 Duplicate-Add-Parity (DAP) Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Boundary Shift Code (BSC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Crosstalk Avoidance and Multiple Error Correction
Code (CAMEC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Unified Coding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.1 Forbidden Overlap Condition (FOC) Codes. . . . . . . . . . . . . . . . 124
6.4.2 Forbidden Transition Condition (FTC) Codes. . . . . . . . . . . . . . 125
Contents xi

6.4.3 Forbidden Pattern Condition (FPC) Codes . . . . . . . . . . . . . . . . . 126


6.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.5 Error Control Codes with Skewed Transitions . . . . . . . . . . . . . . . . . . . . . 130
6.5.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5.2 Data Mapping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Chapter 1
Introduction

On-chip interconnects play an important role for the performance of current VLSI
system. As technology scales into nanoscale regime, interconnect is facing several
design challenges in terms of delay, power and reliability [1–3].

1.1 Impact of Scaling on Interconnect Parameters

An interconnect can be characterized by three electrical properties – resistance R,


capacitance C and inductance L. The resistance R is calculated using (1.1),

r  Lint
R¼ (1.1)
Tint  Wint

where r is the resistivity of metal. Lint, Tint, and Wint are the interconnect length,
thickness, and width, respectively, as shown in Fig. 1.1 H is the distance between
interconnect and ground plane.
From (1.1), the resistance R of a wire increases with the reduced value of Tint and
Wint. As technology scales, the resistivity r of a metal interconnect can increase [4].
This phenomenon is caused by carrier collisions when the thickness of a wire is
approaching the mean free path of electrons. Also, the increased clock frequency
aggravates the skin effects [5], in which the current starts to flow through the skin of
the wire. Skin effects reduce the effective cross-area that carries the current through
a wire further increasing the wire resistance. Figure 1.2 shows that the resistance
greatly increases with technology scaling. A large value of wire’s resistance greatly
increases the interconnect delay; also causes a large signal attenuation.
The interconnect capacitance C of a wire consist of parallel plate capacitance
Cg1, fringing capacitance Cg2 and sidewall capacitance CC, as shown in Fig. 1.3.
The parallel plate capacitance Cg1 refers to the capacitance between metal wire and
substrate or ground, which is proportion to (Wint·Lint)/H. The sidewall capacitance

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, 1


DOI 10.1007/978-1-4419-9313-7_1, # Springer Science+Business Media, LLC 2012
2 1 Introduction

Fig. 1.1 Dimensional parameters of single interconnect

Fig. 1.2 Resistance value with technology scaling [6]

Fig. 1.3 Parallel plate, fringing and coupling interconnect capacitances

CC refers to the coupling capacitance between two adjacent wires on the same metal
layer, which is proportion to (Tint·Lint)/Sint.
As technology scales, the interconnect thickness Tint decreases at a slower rate
than the interconnect width Wint and spacing Sint [7]. Thus, the interconnect aspect
ratio, defined as the ratio of Tint to Wint, increases with each technology node.
The increased interconnect aspect ratio has caused an increase in capacitive coupling
effects, which can greatly affect the reliability of on-chip interconnects.
The inductance L of an interconnect is caused by the current loop formed by the
signal wire and its return. As multi-GHz clock frequencies are widely applied in
current VLSI system, the inductive effects become significant, especially for long
1.1 Impact of Scaling on Interconnect Parameters 3

Fig. 1.4 Interconnect delay as technology scaling [11]

and wide global wires [8, 9]. Inductance effects increase the design complexity, as it
is difficult to accurately extract inductance. Further, the inductance effects can exist
over a long distance, which exacerbates the crosstalk coupling [10].
The technology scaling has a significant impact on the performances of on-chip
interconnects [11, 12]. The rise in wire resistance increases the RC time constant
resulting in a delay increase for a fixed length of inter connect. Moreover, the length of
global interconnect grows as the chip size increases to integrate more components. By
considering the chip scaling factor, the delay of global interconnects increases by
S2Sc2, where S is the technology scaling factor and Sc is chip size scaling factor [7].
Figure 1.4 shows the scaling trend of the gate delay, local wire delay, and global wire
delay with or without repeater insertion. The global interconnect has a considerable
increase in delay compared to logic gates and becomes the performance bottleneck
with technology scaling.
The power consumption is another critical factor faced with on-chip inter-
connects in nanoscale system [13, 14]. A great portion of the total chip power
is consumed by on-chip interconnects because a large interconnect capacitance is
charged and discharged every time a transition occurring. The use of large-sized
repeaters to reduce the delay of the global interconnect further aggravates the power
consumption of on-chip interconnects. Figure 1.5 shows that the dynamic power
breakdown of UltraSPARC T3 SoC processor with 16 SPARC cores using 40 nm
technology [15]. The power consumption of interconnect is the same as the power
consumption of logic gates. Figure 1.6 shows the total interconnect length integrated
into a chip with technology scaling. Because the wire capacitance is linearly related
to the wire’s length, the increased interconnect capacitance will greatly increases the
power consumed by on-chip interconnects.
4 1 Introduction

Fig. 1.5 UltraSPARC T3 and its dynamic power breakdown [15]

Fig. 1.6 Total interconnect length on a chip as technology scales [11]

1.2 Reliability Issues for On-Chip Interconnect

Interconnect reliability issues are caused by manufacturing defects [16, 17] or a variety
of noise sources, such as external radiation [18, 19], crosstalk coupling [8, 9, 20–22],
supply voltage fluctuations [23, 24], process variations [25–31], temperature variations
[4, 32], electromagnetic interference (EMI) [33] and combinations of these sources.
1.2 Reliability Issues for On-Chip Interconnect 5

Fig. 1.7 manufacturing


defects in interconnects

Imperfect manufacturing process can cause on-chip interconnects. Figure 1.7


shows metal sliver and crack in on-chip interconnects caused by manufacturing
defects. Metal sliver is a small piece of extra metal left between two metal wires
during manufacturing process. When the metal temperature increases, the metal
sliver will expand and touch both of these two wires. A short connection can be
caused by metal sliver. Crack is another common manufacturing defect caused
by material stresses. Crack can cause an open connection. The rise of the occur-
rence probability of manufacturing defects in nanoscale technology results in a
decrease in the manufacturing yield of large-area chips. Techniques to improve the
yield must be considered.
Noise sources affect the reliability of on-chip interconnect in two ways – signal
integrity [2] and delay uncertainty [34]. Noise sources reduce the signal integrity
of on-chip interconnect by inducing voltage glitches. If a voltage glitch is greater
than the tolerable noise margin of the circuit and has a sufficient duration, it
can cause logic errors. Delay uncertainty refers to an unknown fluctuation in
the timing of a signal transition. Delay uncertainty decreases the system operating
frequency because a large design margin is required to guarantee correct operation.
In nanoscale technology, crosstalk-induced delay uncertainty can be a critical
bottleneck for the operation of high speed synchronous systems.
Crosstalk coupling is one of the most important factors affecting reliability
of on-chip interconnects. Crosstalk coupling is caused by the mutual capacitance
or mutual inductance between wires [10]. As technology scales, the increased
interconnect aspect ratio has caused an increase in capacitive coupling effects.
Inductive coupling occurs when signal switching causes a change in magnetic
field. The gigahertz clock frequencies in nanoscale technology result in a non-
negligible inductive effect in on-chip interconnects [8, 9]. Unlike capacitive
coupling, inductive coupling can be a long range phenomenon and is more
important in the presence of wide busses [10].
Crosstalk coupling effects can induce significant voltage glitches on a victim
line. Figure 1.8 shows noise waveforms resulting from capacitive and inductive
coupling between two fully coupled lines [9]. The peak noise voltage can exceed
20% of the supply voltage, potentially inducing logic errors. Crosstalk-induced
delay uncertainty is mainly caused by the dependence of coupling capacitance on
signal switching patterns. Depending on the switching behavior of a wire and its
6 1 Introduction

Fig. 1.8 Noise waveforms of crosstalk coupling between two coupled lines [9]

Fig. 1.9 Soft errors caused by particle strikes

neighbors, the effective capacitance Ceff of a wire can change from Cg to Cg + 4Cc
[35] (where Cg ¼ Cg1 + Cg2). The best case Ceff exists when all three adjacent wires
switch in the same direction. The worst case Ceff occurs when there is a transition
010!101 (101!010) on three adjacent wires. The dependence of Ceff on signal
switching patterns can result in up to 50% delay change in on-chip interconnect [36].
Figure 1.9 shows an example of soft error caused by particle strikes. As technology
scales, integrated circuits become more vulnerable to soft errors caused by particle
strikes [18], such as alpha particles and neutron. As the node capacitance decreases in
1.2 Reliability Issues for On-Chip Interconnect 7

Fig. 1.10 A single event transient (SET) jumping from one interconnect to another in the presence
of crosstalk [37]

nanoscale technology, a smaller injection of charge can induce errors (the amount of
charge required to induce in error is referring to as the critical charge Qcrit). This is
exacerbated by the scaling of supply voltage, which decreases circuit noise margins
and makes them more susceptible to particle strikes. Higher clock frequency and deep
pipelined design also increase the probability that any faults resulting from a strike
will be latched by flip-flops, creating errors. When crosstalk coupling is considered,
particle strikes become even more problematic [37, 38]. In the presence of a transient
error caused by alpha particles or neutron strikes, crosstalk coupling may propagate
this error to other parts of the circuit by inducing large voltage glitches on neighbor-
ing wires, as shown in Fig. 1.10. The coupling effect increases the probability that
multiple adjacent errors (also referred to as a burst error) are caused by a single
particle strike.
The probability of errors caused by electromigration also increases in nanoscale
technology [3, 39, 40]. Electromigration is the alteration of an atomic structure
caused by electromagnetic forces, i.e., the dense flow of electrons in interconnects.
Over time, the atomic displacement can result in opens or shorts. With the aggressive
scaling of interconnect dimensions, the current density within these interconnects
significantly increases. This rise in current density, combined with the use of low-k
dielectrics, (which have lower thermal conductivities [4]), results in a significant
increase in the metal temperature. The large rise in metal temperature exacerbates
electromigration, degrading system lifetime [4].
The impact of process variations on interconnect is also expected to increase as
technology scales [25–27]. Process variation in interconnects are caused by the
imperfect processes of photolithography, planarization and metal etching. Variations
in the geometric parameters of interconnects lead to a change in interconnect resis-
tance and capacitance, which causes a variation in the delay of on-chip interconnects.
The variations in interconnect delay may lead to timing closure problems. The
variations in interconnect dimensions also increase the probability of opens caused
by electromigration in narrower sections of a wire. Moreover, device parameter
variations introduce delay variations in the drivers and repeaters, which can also
result in link delay errors.
8 1 Introduction

Supply voltage fluctuations are another factor affecting signal integrity of on-chip
interconnects [23, 24, 41]. Supply voltage fluctuations can affect the driver and
repeaters performance and noise margins, increasing the susceptibility to both logic
and timing errors. There are two components of power supply noise – low
frequency and high frequency. The low frequency component is known as IR
drop, which is the reduction in voltage caused by passing current through a resistive
line. The high frequency component is known as L@i=@t noise, which is caused by
the inductive properties of currents flowing through the chip power grid. A sudden
current demand caused by simultaneous switching of a large number of logic gates
results in a large L@i=@t noise.
On-chip inter connect reliability issues can also be caused by temperature vari-
ations [32]. Interconnect resistance is linearly dependent on temperature. On-chip
temperature variations result in different wire resistances, which cause delay uncer-
tainty. It has been reported that thermal gradients can be as large as 50 C across high
performance microprocessor substrate [32]. Thus, it is very important to take into
account the impact of temperature variations on interconnect performance.
All of these factors decrease the reliability of on-chip interconnects in
nanoscale technology. Design techniques, which can improve the reliability of
on-chip interconnects, should be considered.

1.3 Types of Errors and Error Models

1.3.1 Types of Errors

Depending on their duration, errors can be divided into three classes – transient,
permanent and intermittent. Transient errors, which are also called soft errors, are
short-term malfunctions temporarily induced by external radiation or electrical
noises rather than manufacturing defects [3, 42, 43].
Transient errors can be caused by neutron or alpha particle strikes. As the node
capacitance decreases with technology scaling, a smaller injection of charge can
induce errors. It has been shown that transient errors caused by particle strike
increase by two orders of magnitude from a 180 nm technology to a 45 nm
technology [44]. Also, the probability of multiple errors caused by a single particle
strike increases in nanoscale technology [45].
Crosstalk coupling is another factor that causes transient errors. In nanoscale
technology, capacitive coupling effects increase with rising of interconnect aspect
ratios. A high clock frequency results in a non-negligible inductive coupling.
Instead of inducing single errors, crosstalk coupling can cause spatial burst errors,
which occur in multiple adjacent wires. Crosstalk coupling can also propagate
transient errors caused by particle strikes from a victim wire to its neighbors,
further increasing the probability of multiple errors in on-chip interconnects.
1.3 Types of Errors and Error Models 9

Transient errors can also be caused by other noise sources such as process
variations, supply voltage fluctuation, electromagnetic interference (EMI), and
electrostatic discharge [42]. As technology scales, impacts of these noise sources
are expected to increase because of smaller feature sizes, lower supply voltage and
higher clock frequency.
Permanent errors are irreversible malfunctions caused by physical changes; once
permanent errors occur, they will not disappear. Permanent errors are usually a
result of manufacturing defects, which can be detected during manufacture testing.
However, permanent errors can also occur at run-time (e.g., caused by electro-
migration or aging) [3]. An efficient approach to fix permanent errors is to use spare
wires [17, 46].
Intermittent errors are long-duration errors (but not permanent [3]) occurring
in the same position. Intermittent errors are usually activated by voltage or
environmental (e.g., temperature) changes or specific input patterns. Intermittent
errors can lead to the occurrence of permanent errors. For example, electro migration
usually causes timing errors resulting from increased resistance, before it finally
breaks down the link and creates an open.
The occurrence of permanent and intermittent errors decreases the efficiency of
error control schemes. For example, error detection and retransmission (EDR) can
be used to address transient errors; but it can be ineffective against intermittent
errors (a system may stall while sending many retransmissions of a single piece of
data) and EDR fails to work in the presence of permanent errors. Intermittent
and permanent errors also reduce the capability of error control codes to tolerate
transient errors and require more powerful error control codes, which lead to large
power and area overheads.

1.3.2 Error Models

Modeling error rates of on-chip interconnects can be difficult, because it requires


the knowledge of various noise sources and their dependence upon the supply
voltage. In [47, 48], a simplified model is applied by assuming that all the noise
effects on a wire can be modeled as a normal distribution noise VN with standard
deviation sN. The probability of an error occurring in this model (shown by the
shaded area in Fig. 1.11) is the sum of two components – the probability of noise
causing a logic low to exceed the gate switching threshold voltage (Vdd/2), and
the probability of noise causing a logic high to fall below the gate switching
threshold voltage. The probability of a single wire being erroneous e during a
transition can be expressed by a Gaussian pulse function [47],
  Z 1
Vswing 1
¼ V pffiffiffiffiffiffi ey =2 dy
2
e¼Q (1.2)
2sN swing
2s
2p
N
10 1 Introduction

Fig. 1.11 Error probability of independent error model

Fig. 1.12 multiple adjacent errors caused by a noise source

where Vswing is the link swing voltage and sN is the standard deviation of the noise
voltage. In this model, the error probability in each wire is assumed to be
independent.
As technology scales, the probability that a single noise source causes errors in
multiple neighboring wires increases [49–54]. As shown in Fig. 1.10, a single particle
strike can cause multiple errors because of crosstalk coupling effects [37, 38]. Thus, a
more realistic error model should include spatial burst errors, where multiple adjacent
wires are erroneous. Equation 1.2 can be extended to include burst errors. Instead of
only affecting one wire, the noise source is modeled to affect its neighbors.
The effect on neighboring wires can be described by a coupling probability Pn
[54], as shown in Fig. 1.12. The higher Pn, the more likely the noise source causes
errors in multiple neighboring wires. The probability of the noise source causing an
error in ln can be expressed as (1.3) below,

Pðb ¼ 1Þ ¼ ð1  Pn Þ2  e (1.3)

where e is the probability of a single wire being erroneous if no coupling


effects are considered.
1.3 Types of Errors and Error Models 11

Fig. 1.13 Residual flit error rate of Hamming code for different error models, the coupling
probability Pn ¼ 102 in dependent error model

The probability of two- and three-wire errors caused by the same noise source
P(b ¼ 2) and P(b ¼ 3) can be expressed as (1.4) and (1.5), respectively,

Pðb ¼ 2Þ ¼ 2  Pn ð1  Pn Þ  e (1.4)

Pðb ¼ 3Þ ¼ Pn 2  e (1.5)

The probability of the same noise source at ln also affecting ln + 2 and ln2 is
usually much smaller than Pn. So we ignore the probability of a noise source
causing burst errors of four bits or more P(b  4 | ln) ¼ 0.
Equation 1.2 above can be considered a specific case of the extended model
when Pn is 0. The value of Pn depends on coupling effects and the amplitude of the
noise voltage. For simplicity, we use different Pn values to describe the coupling
effects in the following analysis and simulation. Figure 1.13 shows the residual flit
error of Hamming codes using independent and dependent error models. In the
dependent error model, coupling probability Pn is 102. The results show that the
residual flit error rate using the dependent error model increases greatly compared
to using the independent error model.
A more complex error model is proposed in [49]. In this error model,
effects of a single noise source are described by a normalized matrix P with the
following format (1.6),
12 1 Introduction

2 3
pð1; 1Þ ... pð1; tmax Þ
6 7
6 pð2; 1Þ . . . pð2; tmax Þ 7
6 7
P¼6 . .. .. 7 (1.6)
6 .. . . 7
4 5
pðwmax ; 1Þ . . . pðwmax ; tmax Þ

The element p(o, t) in the matrix P represents the probability of a single noise
source affecting o wires for t cycles. omax and tmax are the maximum number of
wires and cycles affected by this noise source. Compared to previous models, the
error model in [49] can be used to express the probability of multiple-wire and
multiple-cycle errors.

1.4 Book Overview

As technology scales into nanoscale regime, it is impossible to guarantee the perfect


hardware design. Moreover, if the requirement of 100% correctness in hardware
can be relaxed, the cost of manufacturing, verification, and testing will be signifi-
cantly reduced. Many approaches have been proposed to address the reliability
problem of on-chip communications. This book mainly focuses on the use of error
control codes (ECCs) to improve on-chip interconnect reliability.
In Chap. 2, we examine various techniques used to improve the reliability of
on-chip interconnects. These techniques can be separated into noise reduction
techniques and error control methods. Noise reduction techniques can reduce the
noise effects and lower the probability of error occurring, such as a wider metal
wire and a larger interconnect spacing, shielding, skewed transition, repeater
insertion, and crosstalk avoidance codes (CACs). Error control methods are used
to detect or correct errors after error occurs. The use of spatial redundancy, temporal
redundancy and information redundancy are the common techniques exploited in
error control methods.
An important application of error control coding for on-chip interconnects is to
improve the communication reliability in network-on-chip (NoC) architecture.
As technology scales, billions of transistors can are integrated into a single chip,
and traditional bus-based infrastructures are no longer sufficient to handle intensive
on-chip communication. NoC is emerging as an efficient solution to solve the
aggravating scalability and contention issues of on-chip communication. In Chap. 3,
we introduce different architectures and design components of a NoC. The techniques
used to improve the communication reliability in NoC are also discussed in this
chapter.
Error control codes (ECCs) have been widely applied in conventional
communication systems. As area and energy costs of ECCS are relatively small
in nanoscale technology, ECCs become a promising solution to address the
reliability issue in on-chip interconnects. Simple ECCs such as single parity
References 13

check (SPC) codes, Hamming codes, and duplicate-add-parity (DAP) codes are
widely used in previous work. As the probability of multiple errors increases in
nanoscale technology, more complex error control codes, such as Bose-
Chaudhuri-Hocquenghem (BCH) codes, Reed-Solomon (RS) codes and product
codes are applied to improve the reliability of on-chip interconnects. In Chap. 4,
we will discuss these ECCs and their hardware implementation.
On-chip interconnects have tight speed, area, and energy constraints. Thus, the
implementation of error control codes for on-chip interconnects needs to balance
reliability and performance. In Chap. 5, we introduce various design techniques to
tradeoff the reliability and energy consumption of on-chip interconnects. These
techniques include the implementation of low link swing voltage and dynamic voltage
scaling with error control codes, the combination of Hamming product codes with
type-II hybrid ARQ, and the configurable error control codes implementation.
Conventional error control codes, such as Hamming and BCH codes, have been
successfully applied to improve the reliability of on-chip interconnect by correcting
logic errors. Unfortunately, these codes are inefficient to address crosstalk-induced
delay uncertainty. As the effects of coupling capacitance increase with technology
scaling, the delay uncertainty caused by capacitive coupling greatly reduces the
system performance because a large additional design margin is required. In Chap.
6, we will discuss the solutions, which can efficiently address both logic errors and
capacitive crosstalk induced delay uncertainty simultaneously.

References

1. Davis AJ et al (2001) Interconnect limits on gigascale integration (GSI) in the 21st Century.
Proc IEEE 89:305–324
2. Caignet F, Bendhia DS, Sicard E (2001) The challenge of signal integrity in deep-
submicrometer CMOS technology. Proc IEEE 89:556–573
3. Constantinescu C (2003) Trends and challenges in VLSI circuit reliability. IEEE Micro
23:14–19
4. Im S, Srivastava N, Banerjee K, Goodson EK (2005) Scaling analysis of multilevel intercon-
nect temperatures for high performance ICs. IEEE Trans Electron Devices 52:2710–2719
5. Kleveland B, Qi X, Madden L et al (2002) High-frequency characterization of on-chip
digital interconnects. IEEE J Solid-State Circuits 37:716–725
6. Ho R, Mai WK, Horowitz AM (2001) The future of wires. Proc IEEE 89:490–504
7. Bakoglu BH, Meindl DJ (1985) Optimal interconnect circuits for VLSI. IEEE Trans Electron
Devices 32:903–909
8. Ismail IY, Friedman GE, Neves LJ (1999) Figures of merit to characterize the importance of
on-chip inductance. IEEE Trans Very Large Scale Integr (VLSI) Syst 7:442–449
9. Agarwal K, Sylvester D, Blaauw D (2006) Modeling and analysis of crosstalk noise in coupled
RLC interconnects. IEEE Trans Comput Aided Des Integr Circuits Syst 25:892–901
10. Ismail IY (2002) On-chip inductance cons and pros. IEEE Trans Very Large Scale Integr
(VLSI) Syst 10:685–694
11. International Technology Roadmap for Semiconductors (2005) http://public.itrs.net
12. Horowitz M, Dally B (2004) How scaling will change processor architecture. In: Proceedings
of the international solid state circuits conference (ISSCC), pp 132–133
14 1 Introduction

13. Magen N, Kolodny A, Weiser U, Shamir N (2004) Interconnect-power dissipation in a


microprocessor. In: Proceedings of the international workshop on system-level interconnect
prediction (SLIP), pp 7–13
14. Soteriou V, Peh SL (2004) Design-space exploration of power-aware on/off interconnection
networks. In: Proceedings of the International conference on computer design (ICCD),
pp 510–517
15. Shin LJ et al (2011) A 40 nm 16-core 128-thread SPARC SoC processor. IEEE J Solid-State
Circuits 46:131–144
16. Zorian Y, Gizopoulos D, Vandenberg C, Magarshack P (2004) Guest editors’ introduction:
design for yield and reliability. IEEE Des Test Comput 21:177–182
17. Grecu C, Ivanov A, Saleh R, Pande PP (2006) NoC interconnect yield improvement using
crosspoint redundancy. In: Proceedings of the IEEE international symposium on defect and
fault tolerance in VLSI system (DFT), pp 457–465
18. Karnick T, Hazucha P, Patel J (2004) Characterization of soft errors caused by single event
upsets in CMOS processes. IEEE Trans Depend Secure Comput 1:128–143
19. Munteanu D, Autran LJ (2008) Modeling and simulation of single-event effects in digital
devices and ICs. IEEE Trans Nucl Sci 55:1854–1878
20. Tang TK, Friedman GE (2000) Delay and noise estimation of CMOS logic gates driving
coupled resistive-capacitive interconnections. Integr VLSI J 29:131–165
21. Vittal A, Chen HL, Marek MS et al (1999) Crosstalk in VLSI interconnections. IEEE Trans
Comput Aided Des Integr Circuits Syst 18:1817–1824
22. Sylvester D, Hu C (2001) Analytical modeling and characterization of deep submicron
interconnect. Proc IEEE 89:634–664
23. Larsson P (1999) Power supply noise in future IC’s: a crystal ball reading. In: Proceedings of
the IEEE custom integrated circuits conference, pp 467–474
24. Mezhiba VA, Friedman GE (2004) Scaling trends of on-chip power distribution noise. IEEE
Trans Very Large Scale Integr (VLSI) Syst 12:386–394
25. Scheffer L (2006) An overview of on-chip interconnect variation. In: Proceedings of the 2006
international workshop on system-level interconnect prediction, pp 27–28
26. Lin Z et al (1998) Circuit sensitivity to interconnect variations. IEEE Trans Semiconductor
Manuf 11:557–568
27. Lopez G et al (2007) The impact of size effects and copper interconnect process variations on
the maximum critical path delay of single and multi-core microprocessors. In: Proceedings of
the international interconnect technology conference, pp 40–42
28. Demircan E (2006) Effects of interconnect process variations on signal integrity.
In: Proceedings of the IEEE international SOC conference, pp 281–284
29. Mehrotra V, Nassif S, Boning D, Chung J (1998) Modeling the effects of manufacturing
variation on high-speed microprocessor interconnect performance. In: Proceedings of the
IEEE electron devices meetings (IEDM), pp 767–770
30. Mehrotra V, Sam LS, Boning D et al (2000) A methodology for modeling the effects of
systematic within-die interconnect and device variation on circuit performance.
In: Proceedings of the ACM/IEEE design automation conference (DAC), pp 172–175
31. Qi X, Lo S, Luo Y et al (2005) Simulation and analysis of inductive impact on VLSI
interconnects in the presence of process variations. In: IEEE custom integrated circuit
conference, pp 309–312
32. Ajami HA, Banerjee K, Pedram M (2005) Modeling and analysis of nonuniform substrate
temperature effects on global ULSI interconnects. IEEE Trans Comput Aided Des Integr
Circuits Syst 24:849–861
33. Khazaka R, Nakhla M (1998) Analysis of high-speed interconnects in the presence of
electromagnetic interference. IEEE Trans Microw Theory Tech 46:940–947
34. Nassif S (2000) Delay variability: sources, impacts and trends. In: Proceedings of the IEEE
international solid-state circuits conference digest of technical papers, pp 7–9
References 15

35. Sotiriadis P (2002) Interconnect modeling and optimization in deep submicron technologies.
Dissertation, Massachusetts Institute of Technology
36. Tamhankar R, Murali S, Stergiou S et al (2007) Timing-error-tolerant network-on-chip design
methodology. IEEE Trans Comput Aided Des Integr Circuits Syst 26:1297–1310
37. Balasubramanian A, Sternberg LA, Bhuva LB, Massengill WL (2006) Crosstalk effects
caused by single event hits in deep sub-micron CMOS technologies. IEEE Trans Nucl Sci
53:3306–3311
38. Balasubramanian A et al (2008) Measurement and analysis of interconnect crosstalk due to
single events in a 90 nm CMOS technology. IEEE Trans Nucl Sci 55:2079–2084
39. Srinivasan J, Adve V S, Bose P, Rivers AJ (2004) The case for lifetime reliability-
aware microprocessors. In: Proceedings of the 31st international symposium on computer
architecture (ISCA), pp 276–287
40. Xuan X, Singh A, Chatterjee A (2003) Reliability evaluation for integrated circuit with
defective interconnect under electromigration. In: Proceedings of the international symposium
on quality electronic design, pp 29–34
41. Heydari P, Pedram M (2003) Ground bounce in digital VLSI circuits. IEEE Trans Very Large
Scale Integr (VLSI) Syst 11:180–193
42. Zhao C, Bai X, Dey S (2007) Evaluating transient error effects in digital nanometer circuits.
IEEE Trans Reliab 56:381–391
43. Maheshwari A, Burleson W, Tessier R (2004) Trading off transient fault tolerance and power
consumption in deep submicron (DSM) VLSI circuits. IEEE Trans Very Large Scale Integr
(VLSI) Syst 12:299–311
44. Heidel FD et al (2008) Alpha-particle-induced upsets in advanced CMOS circuits and
technology. IBM J Res Dev 52:225–232
45. Tipton DA et al (2006) Multiple-bit upset in 130 nm CMOS technology. IEEE Trans Nucl Sci
53:3259–3264
46. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for
addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr
(VLSI) Syst 18:527–540
47. Hegde R, Shanbhag RN (2000) Toward achieving energy-efficiency in presence of deep
submicron noise. IEEE Trans Very Large Scale Integr (VLSI) Syst 8:379–391
48. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication
links: the energy-reliability tradeoff. IEEE Trans Comput Aided Des Integr Circuits Syst
24:818–831
49. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-to-
switch buses in a network-on-chip. In: Proceedings of the international conference on
hardware/software codesign and system synthesis (CODES-ISSS), pp 188–193
50. De Micheli G, Benini L (2006) Networks on chips: technology and tools. Elsevier, Amsterdam
51. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault
tolerant NoC. VLSI Des. Article ID 94676:13
52. Fu B, Ampadu P (2008) A multi-wire error correction scheme for reliable and energy efficient
SoC links using Hamming product codes. In: Proceedings of the IEEE international SoC
conference (SoCC), pp 59–62
53. Fu B, Ampadu P (2008) An energy-efficient multi-wire error control scheme for reliable on-
chip interconnects using Hamming product codes. VLSI Des Article ID: 109490, 1–14,
doi:101155/2008/109490
54. Fu B, Ampadu P (2009) On hamming product codes with type-II hybrid ARQ for on-chip
interconnects. IEEE Trans Circuits Syst I Reg Papers 56:2042–2054
Chapter 2
Solutions to Improve the Reliability
of On-Chip Interconnects

Various noise reduction and error control techniques have been applied to improve
the reliability of on-chip interconnects. Noise reduction techniques include increas-
ing wire width and spacing [1, 2], shielding [3–7], repeater insertion [8–13],
crosstalk avoidance codes [14–18], skewed transition [19–25] and decoupling
capacitors [26–28]. Error control techniques improve the reliability of on-chip
interconnect by correcting errors using retransmission, error control codes and
spare wires. The use of these techniques relaxes the reliability requirements of
circuit components reducing the cost of manufacturing, verification, and testing.
In this chapter, we will review both noise reduction and error control techniques and
their pros and cons.

2.1 Wire Sizing and Spacing

Increasing interconnect width has different effects on capacitive coupling and


inductive coupling. When the interconnect width is increased, inter-wire capacitive
coupling effects decreases because a wider wire has a larger ground capacitance.
Inductive coupling increases with the interconnect width. Thus, the total impact of
crosstalk coupling is only weakly dependent on interconnect width when both
capacitive and inductive effects are considered [1].
Capacitive coupling is linearly related to the spacing between two interconnect
lines. Increasing the interconnect spacing can effectively reduce the capacitive
coupling. Inductive coupling is logarithmically related to the spacing between
two wires [1, 2]. As the spacing between two interconnect wires increases, the
inductive coupling decreases at a much slow rate than that of capacitive coupling.
Above a certain threshold value of the spacing between two interconnect lines,
inductive coupling will dominate the total coupling effects, as shown in Fig. 2.1.
The dominancy of inductive coupling reduces the efficiency of using wider spacing
to reduce crosstalk coupling. The drawback of using wire sizing and spacing is the
increase of the link routing area.

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, 17


DOI 10.1007/978-1-4419-9313-7_2, # Springer Science+Business Media, LLC 2012
18 2 Solutions to Improve the Reliability of On-Chip Interconnects

Fig. 2.1 Effects of wire spacing on the capacitive and inductive coupling noises [1]

2.2 Shielding

Shielding is the most common design technique to prevent crosstalk coupling.


There are two kinds of shielding methods – passive shielding [3, 4] and act-
ive shielding [6, 7]. In passive shielding, the shield wires, which are statically
connected to power or ground, are placed on either side of the signal wire. The
effects of capacitive coupling is reduced by isolating the signal wire from its
neighboring signal wires. Passive shielding also reduces inductive coupling by
providing a closer return path for the operating currents.
In [6, 7], an active shielding approach is proposed that connects the shield wires
to the signal wire, as shown in Fig. 2.2. In the active shielding method, the shield
wires have the same switching behavior as the signal wire. Active shielding can
achieve a larger delay reduction than passive shielding by taking advantage of the
Miller effect (The Miller effect states that when two parallel wires switch in the
same direction, the effective coupling capacitance is zero, while when they switch
in opposite directions, the effective coupling capacitance is doubled). Active
shielding reduces the link power consumed by coupling capacitance; however,
the self-switching power consumption is increased in active shielding because of
the additional switching of the shield wires.
Both active and passive shielding requires additional wires, greatly increasing
link routing area. Instead of adding the shield wire for each signal line, shield wires
can be inserted between every two to four signal wires to reduce the area cost while
giving up some of the coupling improvement [3].
2.3 Repeater Insertion 19

Fig. 2.2 An example of active shielding

2.3 Repeater Insertion

In repeater insertion, a long interconnect line is separated into several segments,


each driven by an inverting or non-inverting buffer. Repeater insertion has been
successfully used to reduce the global interconnect delay. Without repeater inser-
tion, the delay of global interconnect increases quadratically with the interconnect
length. By properly sizing and placing repeaters, the global interconnect delay
is reduced to a linear dependence on length. Repeater insertion can also be used
to reduce the capacitive coupling noise between two adjacent interconnect
lines. The coupling capacitance between two neighboring wires is proportional
to the interconnect length. By inserting repeaters, a long interconnect wire is divided
into several small pieces. The coupling capacitance of each segment is smaller
than that of the overall link without repeater insertion resulting in a reduction in
coupling noise.
In traditional repeater insertion, each segment has the same length and each repeater
has the same size, as shown in Fig. 2.3a. Traditional repeater insertion cannot
effectively handle delay uncertainty caused by capacitive coupling between adjacent
interconnect lines. In order to reduce the delay uncertainty caused by capacitive
coupling, several new repeater insertion methods have been proposed [10–13].
In [10], a staggered repeater insertion scheme is presented to reduce the
capacitive coupling effect by shifting the inverters locations on adjacent lines, as
shown in Fig. 2.3b. In the staggered repeater method, the worst case delay is
reduced because the transition with the worst case capacitive coupling is limited
to only half of each segment. For example, the transition 010!101 with the worst
case capacitive coupling in the first half of each segment becomes the transition
000!111 with the best case capacitive coupling in the second half of each segment.
The performance of staggered repeater is sensitive to the repeater insertion position.
Thus, the selection of the repeater position of staggered repeater insertion is more
complex than that of traditional repeater insertion. An optimum position for
staggered repeater insertion is presented in [11].
A hybrid polarity repeater insertion method is presented in [12], shown in
Fig. 2.3c. In this method, inverting repeaters (single inverter) and non-inverting
(two inverters) repeaters are alternately used at the midpoint of the bus. Similar to
20 2 Solutions to Improve the Reliability of On-Chip Interconnects

Fig. 2.3 Repeater insertion: (a) Traditional repeater insertion, (b) staggered repeater insertion,
(c) hybrid polarity repeater insertion, (d) Alternate repeater insertion

the staggered repeater method, a worst case delay transition in the first half of a line
becomes a best case delay transition in the second half. Thus the worst case delay is
reduced by averaging the coupling effects during the transition across the whole bus
2.4 Crosstalk Avoidance Codes (CACs) 21

line. Compared to staggered repeater method, the hybrid polarity repeater method
does not need a shift in repeater positions and the transition patterns are inverted only
once at the middle point of the whole interconnect length.
Instead of only using non-inverting repeaters at the midpoint of the bus line, an
alternate repeater insertion method [13] is proposed by using inverting and non-
inverting repeaters alternately along the bus line, as shown in Fig. 2.3d. In alternate
repeater insertion, the placement of the non-inverting repeaters is shifted for two
adjacent interconnect lines. Alternate repeater insertion is suitable for a shared bus
line with multiple drivers and receivers. As long as the driver and receiver are
separated by more than one segment, the worst case delay caused by crosstalk
coupling can be reduced.

2.4 Crosstalk Avoidance Codes (CACs)

The delay of a wire l in a k-bit bus can be modeled as [29],

8
<t0 ðð1 þ lÞD1  lD1 D2 Þ; l¼1
2
>
tk ðdt ; dtþ1 Þ ¼ t0 ðð1 þ lÞDl  lDl ðDl1 þ Dlþ1 ÞÞ; 1<l<l :
2
(2.1)
>
:
t0 ðð1 þ lÞD2k  lDk Dk1 Þ; l¼k

where t0 ¼ Rt Cgt is the wire delay, when there is no crosstalk. Rt and Cgt represent
the total resistance and ground capacitance of a wire respectively; l is the ratio of
total coupling capacitance Cct to total ground capacitance Cgt, Dl ¼ dl t+1dl t is the
difference in value of wire l at time t + 1 and t (if there is a 0 ! 1 transition on wire l,
Dl ¼ 1, if the transition is 1 ! 0, Dl ¼ 1). From (2.1), the delay of a wire is
dependent on the switching behaviors of its neighboring wires. Table 2.1 shows the
middle wire delay of three adjacent wires with different switching patterns. The delay
is normalized to the delay t0. The delay of the middle wire can be separated into six

Table 2.1 Normalized link delay for different switching patterns


 tþ1 tþ1 tþ1 
dl1 ; dl ; dlþ1
000 001 010 011 100 101 110 111
 
t
dl1 ; dlt ; dlþ1
t 000 0 0 1+2l 1+l 0 0 1+l 1
001 0 0 1+3l 1+2l 0 0 1+2l 1+l
010 1+2l 1+3l 0 0 1+3l 1+4l 0 0
011 1+l 1+2l 0 0 1+2l 1+3l 0 0
100 0 0 1+3l 1+2l 0 0 1+2l 1+l
101 0 0 1+4l 1+3l 0 0 1+3l 1+2l
110 1+l 1+2l 0 0 1+2l 1+3l 0 0
111 1 1+l 0 0 1+l 1+2l 0 0
22 2 Solutions to Improve the Reliability of On-Chip Interconnects

classes (0, 1, 1 + l, 1 + 2 l, 1 + 3 l, 1 + 4 l) [29, 30]. The sixth class has the worst case
link delay which is caused by a transition 010!101 (or 101!010) on three adjacent
wires.
From Table 2.1, if the transition patterns with high delay classes can be elimi-
nated from the data transmission, the crosstalk coupling can be reduced. Crosstalk
avoidance codes (CACs) use a coding approach to achieve this coupling reduction
by mapping the input data into CAC codewords, in which some specific switching
patterns are avoided. A number of CACs have been proposed in previous work.
There are two types of CACs [31] – memory-based and memory-less. The memory-
based CACs generate a codeword using not only the input data also the previous
transmitted codeword. The memory-less CACs generate a codeword only based on
the input data.
CACs are technology independent. Compared to the shielding method, CACs
have additional codec cost but require a smaller routing area overhead. CACs cannot
be used to correct logic errors. To achieve reliable on-chip communication, CACs
are usually incorporate with error control coding (ECC). More detail about combin-
ing ECC with CACs will be discussed in Chap. 6.

2.5 Skewed Transitions

Another approach to reduce the delay uncertainty caused by crosstalk coupling


is skewed transitions. In skewed transitions, a relative delay DT is introduced to
avoid simultaneous opposite switching between adjacent bus wires. Skewed
transitions can reduce the crosstalk-induced worst case delay without increasing
the routing area.
The relative delay DT between adjacent bus lines can be generated statically or
dynamically. In the static approach, the relative delay always exists between adjacent
bus lines regardless of the switching patterns. In [19, 20], DT is generated by inserting
delay elements (e.g., inverter chain) at the beginning of alternate bus lines, as shown
in Fig. 2.4. Figure 2.5 shows the relation between the relative delay DT and the total
link delay Td. In this case, Td can increase if DT is too large. A careful selection of DT
is needed to achieve a reduction of the overall delay Td. In [19], the transitions
between adjacent bus lines can also be separated using different clocks. In this
method, the signals need to be aligned at the end of the interconnect bus. Instead of
inserting the delay element at the beginning of the alternate bus lines, repeaters and
bus driver transistors with low threshold voltage can be applied to generate the
skewed transition by speeding up the transition on the alternate bus lines [22, 23],
as shown in Fig. 2.6.
In the dynamic approach, the relative delay is induced only when adjacent bus
lines are switching oppositely. In [24], a transition detection circuit is used to detect
0 ! 1 transitions on a bus line. If a 0 ! 1 transition is detected, this transition will
Fig. 2.4 An example of skewed transition by inserting delay elements at the beginning of alternate
bus lines

Fig. 2.5 Timing relation in traditional skewed transitions

Fig. 2.6 Skewed transitions using threshold voltage adjustment


24 2 Solutions to Improve the Reliability of On-Chip Interconnects

Fig. 2.7 Skewed repeater bus

be delayed. In this method, if the transitions on two adjacent bus lines have the same
direction, they will be both delayed or not delayed. No skew exits between these
two wires. If two adjacent wires switch in opposite direction, the 0 ! 1 transition is
delayed while the 1 ! 0 transition is completed immediately. Thus, a relative delay
DT exists between these two bus lines. The opposite switching transitions between
two adjacent bus lines can also be separated by using skewed repeaters [25], as
shown in Fig. 2.7. Using skewed repeaters, two neighboring inverters along the
bus line are skewed in opposite directions. The opposite transitions between two
adjacent lines no longer switch simultaneously, because one transition travels
faster than the other.

2.6 Error Control Coding Schemes

Reliability of on-chip interconnects can be improved by introducing error control


coding schemes [32–48]. On-chip interconnects typically use one of three schemes
for error recovery – automatic repeat request (ARQ), forward error correction
(FEC), and hybrid ARQ (HARQ). In this section, we discuss these three types of
error control schemes and their advantages and disadvantages (Fig. 2.8).

2.6.1 Automatic Repeat Request (ARQ)

The basic concept of ARQ is to request a retransmission if errors are detected [49].
In ARQ, the input data is encoded using an error detection code. The encoded data
is transmitted through the link. In the receiver, the encoded data is decoded to detect
errors. When errors are detected in the received data, the receiver sends back a
negative acknowledge (NACK) signal to request a retransmission.
2.6 Error Control Coding Schemes 25

Fig. 2.8 Three types of ARQ scheme (a) Stop-and-wait (b) Go-back-N (c) Selective-repeat

There are three types of ARQ [49] – stop-and-wait, go-back-N, and selective-
repeat. In stop-and-wait ARQ, the transmitter sends data and waits until the ACK/
NACK signal is received. Stop-and-wait ARQ has the benefit of simplicity; how-
ever its throughput is low and unsuitable for high speed on-chip communication.
To improve the system throughput, go-back-N and selective-repeat ARQ sche-
mes are used. In go-back-N ARQ, the transmitter continuously sends data before an
NACK is received. The transmitter resends the data acknowledged by the NACK
signal and also the succeeding (N1) data transmitted during the round-trip delay.
In the receiver, if errors are detected for the received data, a retransmission is
26 2 Solutions to Improve the Reliability of On-Chip Interconnects

requested. The receiver discards all the incoming data until the retransmitted data is
received. The average number of transmissions needed to successfully send a data in
go-back-N ARQ can be represented by (2.2) below [49],

NGBN ¼ 1  ð1  Pd Þ þ ðN þ 1Þ  ð1  Pd Þ  Pd þ ð2N þ 1Þ  ð1  Pd Þ  Pd 2
þ    þ ðlN þ 1Þ  ð1  Pd Þ  Pd l þ   
NPd
¼1þ (2.2)
ð1  Pd Þ

where Pd is the probability that errors can be detected in the received data. N is the
round trip delay in clock cycles. NGBN depends on both the channel error rate and
the round-trip delay N. A transmitter buffer with length N is needed in go-back-N
ARQ to store the transmitted data until the ACK signal is received. Go-back-N
trades off the throughput improvement with a moderate implementation cost, and is
suitable for on-chip interconnects.
Instead of resending N data, selective-repeat ARQ only requests the erroneous
data to be retransmitted. Consider an ideal selective-repeat ARQ system, in which
the receiver has an infinite buffer to store the error free data. The average number of
transmissions needed to successfully send a data in selective-repeat ARQ can be
represented by (2.3) below [49],

NSR ¼ 1  ð1  Pd Þ þ 2  ð1  Pd Þ  Pd þ 3  ð1  Pd Þ  Pd 2
þ    þ l  ð1  Pd Þ  Pd l1 þ   
1
¼ (2.3)
1  Pd

NSR does not depend on the roundtrip delay, which makes it suitable for long-
distance applications such as satellite communication. In selective-repeat ARQ,
a receiver buffer is used to save all the incoming data. Also a complex mechanism is
needed to reorder the received data.
In ARQ schemes, the error detection codes are easy to construct at a minor energy
cost. When error patterns occur that cannot be detected (known as decoder failure),
errors are introduced into the system. The drawback of ARQ is the retransmission
latency. In persistent noise environments, a large number of retransmissions are
required making it unsuitable for high performance applications.

2.6.2 Forward-Error Correction (FEC)

In FEC schemes, errors are corrected without any retransmission. Compared to


ARQ schemes, more complex error control codes are required in FEC schemes
increasing the encoder and decoder overhead. However, FEC schemes allow for a
much simpler communication protocol with a fixed throughput.
2.6 Error Control Coding Schemes 27

The error control codes used for FEC can be divided in two classes [50] – (1)
block codes, in which the data is encoded or decoded block by block and each
data block is processed independently, and (2) convolutional codes, in which the
encoding process involves the current input data as well as previous input data.
In on-chip communication, the data is usually transmitted in parallel. To apply
convolutional codes, either an encoder/decoder is needed for each wire, or the data
is encoded serially before it is transmitted in parallel across the link. These two
approaches lead to a large area or latency overhead and are not suitable for on-chip
interconnects.
In this book, we focus on block codes, especially linear block code, in which the
sum of two codewords in a given code is also a codeword. The implementation of FEC
in on-chip communication often has strict performance and cost requirements. Simple
codes, such as single-error-correcting (SEC) codes, single-error-correcting and dou-
ble-error-detecting (SEC-DEC) codes are widely used in previous work [34, 51].
An extended discussion of the various types of error control codes used for on-chip
interconnects are discussed in Chap. 3.

2.6.3 Hybrid ARQ (HARQ)

HARQ schemes combine the advantages of FEC and ARQ together to increase
performance [49, 52]. In HARQ schemes, the receiver corrects errors within the
code’s error correction capability and requests retransmission when the errors are
detectable but not correctable. The use of FEC in HARQ reduces the frequency of
retransmission by correcting the error patterns which occur most frequently.
Unlike error control with FEC alone, a retransmission is requested if there is an
error pattern that is detected but cannot be corrected. Unlike error control with ARQ
alone, the FEC in HARQ can correct persistent noise errors. As a result, a proper
combination of FEC and ARQ can provide higher reliability than error control with
FEC alone and higher throughput than error control with ARQ alone.
The simplest way to implement HARQ is to use an error control code with
the capability of detecting and correcting errors simultaneously. When errors are
detected in the received codeword, the receiver first tries to correct these errors. If
the number of errors is within the error correcting capability of the code, the errors
will be corrected. If an uncorrectable error pattern is detected, the receiver discards
the received word and requests a retransmission. The retransmission is the same
codeword. This process continues until the codeword is successfully decoded. This
type of HARQ is referred as type-I HARQ [49]. The error control code used in type-
I HARQ is able to correct and detect errors simultaneously. This requires more
parity check bits than a code used in an ARQ scheme purely for error detection.
Thus, the transmission overhead of type-I HARQ is larger than an error control
scheme using only ARQ. In type-I HARQ, the same amount of parity check bits are
transmitted each time. This limits its performance when noise conditions vary with
different environmental factors.
28 2 Solutions to Improve the Reliability of On-Chip Interconnects

Type-II HARQ was proposed to improve the performance by sending the parity
check bits incrementally [49, 53]. In type-II HARQ, the codeword in the first
transmission is comprised of input data and a few parity bits for error correction.
When the receiver detects uncorrectable errors in the received data, it saves
the erroneous data in a buffer and requests a retransmission at the same time.
The retransmission is not the original data but a block of additional parity check
bits which is formed based on the input data. When this block of additional parity
check bits is received, it is used to correct the errors in the data stored in the receiver
buffer. Each retransmission in type-II HARQ is coded differently. Compared to
type-I HARQ, parity check bits in type-II HARQ are transmitted incrementally.
Because the minimal required redundancy is transmitted each time, type-II HARQ
achieves a better throughput when the noise condition varies with the environment.

2.7 Spare Wires

Most permanent errors can be detected during manufacture testing; however some
of them can occur during run time (e.g., electromigration or aging). Permanent
errors can greatly reduce the correction capabilities of the commonly used error
control codes, making them unsuitable to handle permanent errors. An efficient
approach to solve this problem is to replace permanently erroneous wires with spare
wires.
In [54], the use of spare wires to improve on-chip interconnect manufacturing
yield is analyzed and a configurable scheme using crossbar switches to remap
erroneous wires is proposed. In [55], an in-line test (ILT) method combined with
a configurable remapping of erroneous links using spare wires is proposed to detect
and correct permanent errors. In the ILT method, each pair of adjacent wires in the
link is periodically tested for opens and shorts. A reconfiguration unit is used to
remap data from the pair of adjacent wires under test to a set of available spare
wires. The ILT method allows the link to be tested for permanent errors without
interrupting data transmission. Once the permanent errors are detected, the same
reconfiguration unit is used to bypass erroneous wires using spare wires. In order to
reduce the complexity of the reconfigurable units, the remapping process ripples
through the bus instead of remapping erroneous wires directly to the spare wires.

References

1. Agarwal K, Sylvester D, Blaauw D (2006) Modeling and analysis of crosstalk noise in coupled
RLC interconnects. IEEE Trans Computer-Aided Des Integr Circuits Syst 5:892–901
2. Massoud Y (2002) Managing on-chip inductive effects. IEEE Trans Very Large Scale Integr
(VLSI) Syst 6:789–798
References 29

3. Huang X, Cao Y, Sylvester D et al (2000) RLC signal integrity analysis of high-speed global
interconnects. In: Proceedings of IEEE international electron devices meeting (IEDM),
pp 731–743
4. Zhang J, Friedman GE (2004) Effect of shield insertion on reducing crosstalk noise between
coupled interconnects. In: Proceedings of IEEE international symposium on circuits and
system (ISCAS), pp 529–532
5. Lepak MK, Xu M, Chen J, He L (2004) Simultaneous shielding insertion and net ordering for
capacitive and inductive coupling minimization. ACM Trans Des Autom Electron Syst 3:290–309
6. Kaul H, Sylvester D, Blauw D (2002) Active shields: a new approach to shielding global wires.
In: Proceedings of IEEE/ACM great lakes symposium on VLSI (GLSVLSI), pp 112–117
7. Kaul H, Sylvester D, Blaauw D (2004) Performance optimization of critical nets through
active shielding. IEEE Trans Circuits Syst I Reg Papers 12:2417–2435
8. Adler V, Friedman GE (2000) Uniform repeater insertion in RC trees. IEEE Trans Circuits
Syst I Fund Theor Appl 10:1515–1523
9. Alpert JC, Devgan A, Quay TS (1999) Buffer insertion for noise and delay optimization. IEEE
Trans Computer-Aided Des Integr Circuits Syst 11:1633–1645
10. Kahng B A, Muddu S, Sarto E, Sharma R (1998) Interconnect tuning strategies for high-
performance ICs. In: Proceedings of the design, automation and test in Europe (DATE),
pp 471–478
11. Ghoneima M, Ismail Y (2005) Optimum positioning of interleaved repeaters in bidirectional
buses. IEEE Trans Comput-Aided Des Integr Circuits Syst 3:461–469
12. Akl JC, Bayoumi AM (2008) Reducing interconnect delay uncertainty via hybrid polarity
repeater insertion. IEEE Trans Very Large Scale Integr (VLSI) Syst 9:1230–1239
13. Kaul H, Seo J S, Anders M, Sylvester D, Krishnamurthy R (2008) A robust alternate repeater
technique for high performance busses in the multi-core era. In: Proceedings of the IEEE
international symposium on circuits and systems (ISCAS), pp 372–375
14. Sridhara RS, Ahmed A, Shanbhag RN (2004) Area and energy-efficient crosstalk avoidance
codes for on-chip busses. In: Proceedings of international conference on computer design
(ICCD), pp 12–17
15. Duan C, Tirumala A, Khatri PS (2001) Analysis and avoidance of crosstalk in on-chip buses.
In: Proceedings of hot interconnects, pp 133–138
16. Victor B, Keutzer K (2001) Bus encoding to prevent crosstalk delay. In: Proceedings of
IEEE/ACM international conference on computer-aided design (ICCAD), pp 57–63
17. Patel NK, Markov L (2004) Error-correction and crosstalk avoidance in DSM busses.
IEEE Trans Very Large Scale Integr (VLSI) Syst 10:1076–1080
18. Sridhara RS, Shanbhag RN (2007) Coding for reliable on-chip buses: a class of fundamental
bounds and practical codes. IEEE Trans Computer-Aided Des Integr Circuits Syst 5:977–982
19. Hirose K, Yassura H (2000) A bus delay reduction technique considering crosstalk.
In: Proceedings of the design, automation and test in Europe (DATE), pp 441–445
20. Nose K, Sakurai T (2001) Two schemes to reduce interconnect delay in bi-directional and uni-
directional buses. In: Proceedings of VLSI symposium, pp 193–194
21. Ghoneima M, Ismail Y (2004) Effect of relative delay on the dissipated energy in coupled
interconnects. In: Proceedings of IEEE international symposium on circuits and systems
(ISCAS), pp 525–528
22. Kim WK, Jung OS, Kim T (2003) Coupling delay optimization by temporal decorrelation
using dual threshold voltage technology. IEEE Trans Very Large Scale Integr (VLSI) Syst
5:879–887
23. Ghoneima M, Ismail IY, Khellah MM, Tschanz WJ, De V (2006) Reducing the effective
coupling capacitance in buses using threshold voltage adjustment techniques. IEEE Trans
Circuits Syst I Fund Theor Appl 9:1928–1933
24. Nieuwland KA, Katoch A, Meijer M (2004) Reducing cross-talk induced power consumption
and delay. In: Proceedings of international workshop on power and timing modeling
optimization and simulation (PATMOS), pp 179–188
30 2 Solutions to Improve the Reliability of On-Chip Interconnects

25. Ghoneima M et al (2006) Skewed repeater bus: a low power scheme for on-chip bus.
IEEE Trans Circuits Syst I Fund Theor Appl 7:1904–19106
26. Zhao S, Roy K, Koh KC (2002) Decoupling capacitance allocation and its application to
power supply noise aware floorplanning. IEEE Trans Computer-Aided Des Integr Circuits
Syst 1:8–92
27. Su H, Sapatnekar SS, Nassif RS (2003) Optimal decoupling capacitor sizing and placement
for standard cell layout designs. IEEE Trans Computer-Aided Des Integr Circuits Syst
4:428–436
28. Popovich M, Sotman M, Kolodny A, Friedman GE (2008) Effective radii of on-chip
decoupling capacitors. IEEE Trans Very Large Scale Integr (VLSI) Syst 7:894–907
29. Sotiriadis P, Chandrakasan A (2000) Reducing bus delay in sub-micron technology using
coding. In: Proceedings of the IEEE Asia and South Pacific design automation conference
(ASPDAC), pp 109–114
30. Li L et al. (2004) A crosstalk aware interconnect with variable cycle transmission. In: Pro-
ceedings of the design, automation and test in Europe (DATE), pp 102–107
31. Duan C, Cordero Calle HV, Khatri PS (2009) Efficient on-chip crosstalk avoidance DODEC
design. IEEE Trans Very Large Scale Integr (VLSI) Syst 4:551–560
32. Li L, Vijaykrishnan N, Kandemir M, Irwin JM (2003) Adaptive error protection for energy
efficiency. In: Proceedings of IEEE/ACM international conference on computer-aided design
(ICCAD), pp 2–7
33. Bertozzi D, Benini L (2004) Xpipes: a network-on-chip architecture for gigascale systems-on-
chip. IEEE Circuits Syst Mag 4:18–31
34. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework.
IEEE Trans Very Large Scale Integr (VLSI) Syst 6:655–667
35. Murali S, Theocharides T, Vijaykrishnan N, Irwin JM, Benini L, De Micheli G (2005)
Analysis of error recovery schemes for networks-on-chips. IEEE Des Test Comput 5:434–442
36. Rossi D, Nieuwland KA, Katoch A, Metra C (2005) Exploiting ECC redundancy to minimize
crosstalk impact. IEEE Des Test Comput 1:59–70
37. Worm F, Ienne P, Thiran P, Micheli DG (2005) A robust self-calibrating transmission scheme
for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 1:126–139
38. Komatsu S, Fujita M (2005) Low power and fault tolerant encoding methods for on-chip data
transfer in practical applications. IEICE Trans Fund E88-A(12):3282–3289
39. Pande PP, et al. (2006) Design of low power & reliable networks on chip through joint
crosstalk avoidance and forward error correction coding. In: Proceedings of IEEE international
symposium on defect and fault tolerance in VLSI systems (DFT), pp 466–476
40. Ejlali A, Al-Hashimi MB, Rosinger P, Miremadi GS (2007) Joint consideration of fault-
tolerance, energy-efficiency and performance in on-chip networks. In: Proceedings of the
design, automation and test in Europe (DATE), pp 1–6
41. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal
integrity. In: Proceedings of international on line testing symposium (IOLTS), pp 43–48
42. Yu Q, Ampadu P (2008) Adaptive error control for NoC switch-to-switch links in a variable
noise environment. In: Proceedings of IEEE international symposium on defect and fault
tolerance in VLSI system (DFT), pp 352–360
43. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on
chip through joint crosstalk avoidance and multiple error correction coding. J Electron Test
Theor Appl (JETTA), Special Issue on Defect and Fault Tolerance, 67–81
44. Lehtonen T, Liljeberg P, Plosila J (2007) Analysis of forward error correction methods for
nanoscale networks-on-chip. In: Proceedings of 2nd international conference on nano-networks
(Nano-Net), pp 1–5
45. Rossi D, Nieuwland KA, Dijk SVE, Kleihorst PR, Metra C (2008) Power consumption of fault
tolerant busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 5:542–553
46. Yu Q, Ampadu P (2009) Adaptive error control for nanometer scale NoC links. IET Comput
Digit Tech 6:643–659, Special issue on advances in nanoelectronics circuits and systems
References 31

47. Fu B, Ampadu P (2010) Error control combining Hamming and product codes for energy
efficient nanoscale on-chip interconnects. IET Comput Digit Tech 3:251–261
48. Fu B, Ampadu P (2009) A dual-mode hybrid ARQ scheme for energy efficient on-chip
interconnects. In: Springer lecture notes of the institute for computer sciences, social-infor-
matics and telecommunications engineering – 3 rd international ICST conference NanoNet
2008, revised selected papers, pp 74–79
49. Lin S et al (1984) Automatic-repeat-request error-control schemes. IEEE Commun Mag
12:5–17
50. Lin S, Costello JD (2004) Error control coding, 2nd edn. Prentice Hall, Upper Saddle River
51. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication
links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Des Integr Circuits Syst
6:818–831
52. Benice JR, Frey HA (1964) An analysis of retransmission systems. IEEE Trans Commun
Technol 4:135–145
53. Metzner JJ (1979) Improvements in block-retransmission schemes. IEEE Trans Commun
2:524–532
54. Grecu C, Ivanov A, Saleh R, Pande PP (2006) NoC interconnect yield improvement using
crosspoint redundancy. In: Proceedings of IEEE international symposium on defect and fault
tolerance in VLSI systems (DFT), pp 457–465
55. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for
addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integer
(VLSI) Syst 4:527–540
Chapter 3
Networks-on-Chip (NoC)

The move to many-core system is expected to become the dominant trend in


the near future. With technology scaling into nanoscale regime, hundreds and
even thousands of intellectual property (IP) cores can be integrated into a single
chip. How to provide efficient and reliable communication between these IP cores
becomes a bit problem. The conventional bus-based infrastructures are no longer
sufficient to handle intensive on-chip communication. Network-on-chip (NoC)
is emerging as an efficient solution to solve the aggravating scalability and
bandwidth issues of on-chip communication by replacing traditional bus structures
with a packet-switched network. This chapter is developed to introduce the com-
mon NoC architectures and the reliability issues facing in NoC design.

3.1 Bus Based On-Chip Communication

Bus-based infrastructure is the most frequently used traditional on-chip communi-


cation architecture. Figure 3.1 shows the architecture of IBM Cell processor [1], in
which bus-based on-chip communication architecture is used to provide data
communication between eight special-purpose processing units (SPU) and a single
general-purpose 64-bit power processor. In bus-based communication architecture,
all IPs are connected to the same transmission medium, bus. The IPs connected to a
bus can be separated to masters, which can initiate read or write data transfers, and
slaves, which response the requests from other master IPs. If multiple masters want
to access the shared-bus at the same time, a bus arbiter is used to determine which
master has the right to access. The advantage of bus architecture is simple and low
area cost. The bandwidth of bus architecture is low because only one master can
access the shared-bus at any time. There are many bus-based SoC interconnect
specifications. Three of them are widely used in industrial – ARM Microcontroller
Bus Architecture (AMBA) versions 2.0 [2] and 3.0 [3], IBM CoreConnect [4] and
OpenCors Wishbone [5].

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, 33


DOI 10.1007/978-1-4419-9313-7_3, # Springer Science+Business Media, LLC 2012
34 3 Networks-on-Chip (NoC)

Fig. 3.1 IBM cell processor architecture [1]

The bus architecture is facing several design challenges as technology scales into
nanoscale regime. The capacitive load of a shared-bus greatly increases as more IPs
are connected to the same bus architecture. Also, the bus length need increase as the
number of integrated IPs increases. These two factors increase the propagation delay
of on-chip bus and limit the number of IPs, which can be connected to the shared bus.
Hierarchical bus architecture is proposed to solve this problem by splitting a long
bus into several segments. Figure 3.2 shows a hierarchical bus architecture using
AMBA protocol. In hierarchical bus architecture, bus bridges are used to connect
different bus segments. As hundreds of IP are integrated into a single chip, the
hierarchical bus architecture becomes complex and still faces the intrinsic band-
width limitation caused by multiple IP cores sharing the same transmission medium.
The conventional bus architecture becomes a bottleneck in through and scalability.
More efficient on-chip communication architecture needs to be developed to meet
high throughput requirement of large-scale SoC designs.

3.2 NoC Design

NoC architectures are developed to address the complex on-chip communication of


large-scale SoCs. In this section, we will introduce the common components, and
different topologies of NoC systems. The switching techniques and router design
used in NoC systems are also discussed.
3.2 NoC Design 35

Fig. 3.2 A hierarchical bus architecture using AMBA protocol [6]

Fig. 3.3 A NoC based 48-core IA-32 processor from Intel [7]

3.2.1 NoC Architectures

In NoC system, the traditional shared-bus structure is replaced with a packet-


switched communication network. Figure 3.3 shows the architecture of a NoC
based processor with 48 IA-32 cores from Intel [7, 8]. A NoC system is typically
consisted of IP cores, network interfaces (NI), routers and global links. IP cores
are connected to on-chip networks via NI. The communication between different IP
cores take place in the form of packets. The function of NI is to packetize and unpack
the data. In NI, the input data injected by IP core is separated into small packets and
extra information used to identify and track these packets are added to each packet.
36 3 Networks-on-Chip (NoC)

NI is also used to establish the connection between the source IP and destination IP.
The router is used to route the packet to the correct destination according to a
specified routing algorithm. Router plays an important role in the performance of
a NoC architecture. Routers are connected using high-performance global links,
which include data and control wires for communication. For the processor shown in
Fig. 3.3, each tile, which consists of a dual IA-32 core, is connected by a 2D-mesh
on-chip network. A mesh interface unit (MIU) in each tile is used to packetize/de-
packetize data into/from the mesh network.
NoC systems provide better performance and scalability. The use of multiple
concurrent connections results in much higher bandwidth in NoC systems than that
in conventional bus structures. The decoupling of computation IPs from communi-
cation network reduces the system design complexity by allowing each part to be
optimized separately. Component reuse is easy in NoC systems because of the
standardized network interface. NoC are also very scalable; by adding new routers,
more resources are connected to the on-chip network.
NoC can be organized as different topologies, which define how the routers and
computation IP cores are connected [9]. Figure 3.4 shows the common NoC
topologies including fat tree, butterfly fat tree (BFT), mesh, octagon, torus, and
folded torus. Among these topologies, mesh topology has achieved more consider-
ation because of its regular architecture and simplicity. There are a wide variety of
NoC prototypes that have been developed recently. Most of these systems are
developed using a mesh topology such as Nostrum [10], Tile64 [11], TRIPS [12],
Teraflops [13], and SCC [14].
In mesh topology, except the routers on the network boundaries, each router has
five active ports: one is connected to local IP core; while other four ports are
connected to the neighboring routers. In mesh topology, the number of IP cores is
equal to the number of routers. The disadvantage of mesh topology is that a large-
scale mesh-based NoC has long communication latency between the IP cores on
two opposite edges of the network. Torus topology is proposed to improve the
communication latency by directly connecting the routers on the edges to the
routers on the opposite edges of the network, as shown in Fig. 3.4e. In the torus
topology, each router has the same structure with five active ports. The long wrap-
around interconnects in torus topology can still be the bottleneck compared to other
pieces of the network. This issue can be avoided by using folded torus, as shown in
Fig. 3.4f.
NoC protocols are typically organized in layers [15, 16]. Similar to the Open
System Layer (OSI) model used in internet, NoC implementations can be separated
into seven layers – physical layer, data link layer, network layer, transport layer,
session layer, presentation layer and application layer, as shown in Fig. 3.5.
Among these layers, the physical layer is used to handle the physical implemen-
tation of transmission medium (i.e. wires). Different signaling techniques, such
as differential signaling, low swing voltage signaling, pulse-based signaling and
current mode signaling, are applied in this layer. The data link layer is used to
provide reliable data transmission even the physical links unreliable. Error control
coding is a common technique applied in this layer. The network layer is used to
3.2 NoC Design 37

Fig. 3.4 Common NoC topologies

establish the data transmission path according to different switching and routing
algorithms. Congestion control is also employed in network layer to balance the
traffic load over the network. In transport layer, the data is segmented into packets
at the source and unpack at the destination. The transport layer needs to guarantee
these packets transmitted and received in order. The selection of the packet size is
an important problem in transport layer. The session layer, presentation layer and
application layer are used to provide an abstraction of the hardware implementation
38 3 Networks-on-Chip (NoC)

Fig. 3.5 Layered structure in


NoC design

to the system software and users. For example, the session layer is used to
synchronize the data communication of multi-core system running parallel
programs. The layered structure simplifies NoC design by hiding the implementa-
tion details of low layers from high layers.

3.2.2 Routing and Switching Techniques

Routing and switching are applied in the network layer of a NoC


architecture. Routing determines the data transfer path, which will be used to
correctly move the data packets through the network to its destination. Switching
decides when and how the data packets are actually transferred through the routers.
Routing algorithms can be classified into static routing [17–19] and dynamic
routing [19–21]. The static routing, also known as deterministic routing, always
provides fixed paths between a particular source and destination pair to transfer
data. In static routing algorithms, the routing paths are predetermined. The routing
scheme does not use the information, such as current network traffic load and link
status, to make the routing decisions. XY routing is one common used static routing
algorithm. In XY routing algorithm, the data is always first routed in the horizontal
direction (X direction), until it reaches the network node, which has the same X
coordinate of the destination node. Then, the data is routed in the vertical direction
(Y direction), until it reaches the destination node. The advantage of static routing is
easy to implement with small area cost. If there is only a single routing path used,
3.2 NoC Design 39

Fig. 3.6 An example of deadlock in NoC routing

static routing can guarantee in-order data packet delivery. In-order data packet
delivery simplifies the NI design in the destination node.
The dynamic routing is also known as adaptive routing. In dynamic routing, the
routing decision is made based on the current network conditions, such as traffic
load and error rate. The routing paths between a particular source and destination
pair can change over time as the network conditions change. In comparison to the
static routing, the dynamic routing is more efficient to avoid the network congestion
and link errors by continuously monitoring the network conditions. The flexibility
of dynamic routing results in additional hardware costs, such as monitoring circuits
and more complex routing control circuits. A large amount of dynamic routing
algorithms have been proposed recently, such as minimal adaptive [20], fully
adaptive [20], congestion look-ahead [22], odd–even [18], slack time aware [23],
west first, north-last and negative-first.
The purpose of routing algorithms is to ensure that all the data packets will
correctly reach its destination no matter which routing algorithm is selected. To
achieve this goal, the routing algorithms need guarantee no livelock and deadlock in
the routing path. Livelock is a condition where the data packet is moved
around between routers over and over again and never reaches its destination.
There exist circles in the routing path when livelock happens. Livelock can
be avoided by monitoring the distance of the data packet form the destination on
each router. Only the routing paths, which can reduce this distance, are selected.
This approach ensures that the packet will reach its destination after a finite number
of steps.
Deadlock is a condition, where one or more packets in the network cannot move
staring from a time t. These packets will be blocked for an infinite time regardless
of the routing algorithm selected. Deadlock happens when a circular dependence
between different packets exists. Figure 3.6 shows an example of deadlock.
40 3 Networks-on-Chip (NoC)

In this example, each packet has taken its own buffer resource, but at the same tries
to request the buffer resource held by other packets. For example, the packet 1 is
transmitted from router A to B. At the same time, packet 1 tries to request the
resources held by packed 2. Packet 2 is transmitted from router B to C and tries to
request resources held by packet 3. The similar scenario happens to packet 3, and
packet 4. A dependency loop is formed in this example. Deadlock can be avoided by
adding restrictions on the routing algorithm (e.g. forbidding certain turns in the
routing algorithm).
There are two major switching techniques in NoCs, namely circuit switching [24]
and packet switching [21, 25]. In circuit switching, a physical path, composed of a
series of links and routers, is reserved from source to destination before the data
transmission starts. The physical path is held for the entire duration of the
data transmission. The hardware resource reserved for the physical path is released
when the last data is transmitted. The advantage of circuit switching is that the whole
link bandwidth is available once the physical path has been established. However, the
circuit switching has a large setup latency before the data can be transmitted. Also,
other routing paths cannot use these link and router resources during the process of
data transmission, which wastes the valuable hardware resource. Virtual circuit
switching technique can be applied to improve the hardware efficiency. In virtual
circuit switching, several virtual links (also known as virtual channel) share on a
single physical link. Extra buffer resource or time division multiplexing technique is
needed to support virtual circuit switching.
In packet switching, data is divided into fixed-length blocks called packets. Instead
of establishing a physical path before data transmission, the source sends the data at
any time when it is available. There are three types of packet switching techniques –
store-and-forward, virtual cut-through and wormhole. Store-and-forward and virtual
cut-through require the receiving switch to be able to store the whole packet before the
data starts to transmit. Therefore these approaches require a large buffer size, which
increase the silicon area and the power consumption.
In wormhole switching, packets are divided into flow control units (flits). The
first flit of the packet is called as the header, which carries the routing information.
The header flit is used to reserve the resource. The following flits in the packet
simply follow this path in a pipelined fashion. For example, if a header flit of
a packet passes through a link, no other packet can use this link until the tail flit
of this packet passes through the link. In wormhole switching, if a flit from a given
packet is blocked in a buffer, it will decrease channel utilization. In order to achieve
a high throughput, a set of virtual channels can be used. If a flit belonging to a
particular packet is blocked in one of the virtual channels, then flits of alternate
packets can use the other virtual channel buffers. The header flit can also be used to
reserve a dedicated flit buffer in a router with multiple virtual channels (VCs).
In wormhole switching, only several flits need be stored at every switching element.
As a result, the buffer space requirement in the switches can be small compared to
that required for other packet switching schemes.
3.2 NoC Design 41

Fig. 3.7 Block-level representation of a five-port router

3.2.3 Router Design

The router design plays an important role to decide the performance of NoC
systems. In order to achieve an efficient router design, the router type, size of the
FIFO buffer, and switching technique must be carefully selected. The structure of
routers in a NoC depends on the network topology. The size of the FIFO inside the
router is very important because it directly impacts the router delay, packet loss, and
power consumption. The router design also depends on the routing scheme adopted.
If deterministic routing schemes are adopted, the router can be designed to be fast
and compact. If dynamic routing schemes are applied, the circuit to define and
control the routing path will introduce an extra area overhead. Figure 3.7 shows a
block level architecture of a five-port router. It mainly consists of input/output FIFO
buffers, a crossbar connecting input ports to output ports, and a routing control
unit [26]. Input ports from each side feed data into a corresponding buffer. Each
output port selects data from the processor core and three input buffers based on the
routing control unit.
In order to reduce communication latency in NoCs, virtual channels are widely
applied in router design [27, 28]. By introducing virtual channels in the input and
output ports, the channel utility is greatly increased. In a router with virtual channel,
if a flit belonging to a packet is using one of the virtual channels, then flits of other
packets can use the other virtual channel buffers and, ultimately, the physical
channel. The architecture of a router having virtual channels is shown in Fig. 3.8.
42 3 Networks-on-Chip (NoC)

Fig. 3.8 Architecture of a router having virtual channels [28]

The router with virtual channels is usually comprised of a number of different


components, such as a header decoder, virtual channels, crossbar switch, and
a routing unit. The header decoder receives the packets and determines their
destination addresses and keeps the packets until it routes them. Multiplexer and
de-multiplexer are used to manage virtual channel operations. Virtual channel
arbiter selects the appropriate virtual channel. Crossbar switch connects each input
channel to each unoccupied output channel. The routing computation unit
implements the routing algorithm and controls the crossbar switch. The virtual
channels increase the communication throughput, but it also increases the complex-
ity of the router design.

3.3 Reliability in NoC Links

Reliability is an important issue in NoC design. For example, errors in the header of a
packet may lead to deadlock and block communication of an NoC. Error control
techniques can be applied at different NoC layers [29]. In the physical layer, spare
wires [30] can be used to bypass defective wires, which can be detected during
manufacturing tests or self-test at start-up (or at run-time). Different wiring and
signaling techniques (e.g., shielding and differential signaling) can also be applied
at the physical layer to improve reliability of data transmission. The data link layer is
used to provide reliable data transmission across an unreliable physical link. Error
control coding is the most popular fault tolerant approach at the data link layer [29,
31, 32]. Adaptive routing algorithms [33] can improve NoC reliability at the network
3.3 Reliability in NoC Links 43

Fig. 3.9 Implementation of error control coding in data link layer of a NoC

layer by rerouting data around defective links. Flooding algorithms [34] are another
fault tolerant technique used at the network layer, which involve redundant transmis-
sion of data over multiple paths to improve the reliability of network transmissions.
Error control coding can also be used at network layer by performing an end-to-end
error protection [32], in which the data is encoded at the source and decoded at the
destination, with no encoding/decoding of the data at intermediate hops.
In this book, we mainly focus on applying error control coding at the data link
layer. In the data link layer, encoder and decoder circuits are integrated into
the NoC routers, shown in Fig. 3.9. The error control schemes are incorporated
into the output and input link controllers, respectively. The output link controller
includes the encoder, and the input link controller includes the corresponding
decoder. The incoming flit is encoded in the encoder and transmitted through the
link. The decoder in the receiver side detects or corrects errors. Registers are
inserted between encoder and link, and also between link and decoder to allow
pipelined operation. If ARQ or HARQ schemes are applied, retransmission is
invoked using the ACK/NACK signal.
A large amount of research has been done to apply error control coding in data
link layer. In [32], the impact of various error control schemes on the energy
efficiency, error protection capability, and performance is investigated in an NoC
design environment. The error control schemes implemented in this work can be
applied to the network or data link layers. Two different error detection and
retransmission schemes are used. In the end-to-end error detection scheme,
a retransmission occurs between the sender and receiver (which may require
multiple switch-to-switch hops). In the switch-to-switch error detection scheme, a
retransmission occurs between adjacent switches. This work also implements type-I
44 3 Networks-on-Chip (NoC)

HARQ scheme by correcting any single error on a flit and requesting a retransmis-
sion when the error can be detected but not corrected. Single parity check
codes, cyclic redundancy check (CRC) codes and Hamming codes are examined.
This work provides a large amount of useful information to select an appropriate
error control scheme for a given application.
In [35], a methodology using error control coding to trade off energy and
reliability of on-chip interconnects is proposed. In their method, error control
codes are used to meet predefined reliability requirements. Energy consumption
of interconnects is reduced by lowering the link swing voltage. This paper shows
that sufficient signal integrity of on-chip interconnects can still be achieved for low
voltage swing, once an appropriate error control scheme is applied to the on-chip
interconnects. Hamming codes and duplicate-add-parity (DAP) codes are used to
correct single errors. This work also provides a frame work to combine error control
codes with crosstalk avoidance codes, which will be discussed in detail in Chap. 6.
In [36], the author compares the energy behavior of different error detection and
correction schemes to find the most efficient error control technique from an
energy viewpoint. FEC and ARQ are considered in this work. The comparison is
performed in a realistic SoC system and includes Hamming codes, extended
Hamming codes, parity check codes, and CRC codes.
The impact of applying Hamming codes on bus power consumption in nanoscale
technology is analyzed in [37, 38]. Bus wires parameters (coupling capacitances,
drivers, repeaters and receivers) and encoding and decoding circuits are considered
in this work. The analysis shows that all different equivalent Hamming codes (with
the same redundancy) have identical energy consumption. This work also shows
that DAP codes with a proper bus layout can consume less power than Hamming
code with the same bus footprint for a small size bus.
A dynamic voltage swing approach with error detection has been proposed in [39].
In this work, the link swing voltage is dynamically scaled down to reduce the link
energy consumption. The signal integrity is ensured by using error detection codes
(CRC codes) and retransmission. The operational link swing voltage is determined by
sampling the retransmission rate. This research has not considered FEC or HARQ.
In [40], the author analyzes the impact of ARQ, FEC and HARQ on the trade-off
between performance, reliability and energy consumption in on-chip networks, for
various voltage swings, noise powers, wire lengths (wire capacitances) and timing
constraints. Unlike previous work, this work considers the performance and reli-
ability jointly and uses a new metric performability. The author concludes that
HARQ consumes less energy than ARQ and FEC for a given performability
constraint, except for when short wires are used.
In order to achieve energy efficiency, configurable error control schemes have
been proposed that dynamically provide appropriate error control based on noise
conditions or system requirements. In [41], an adaptive error detection method is
proposed. In this work, parity check codes, Hamming codes and extended Ham-
ming codes are dynamically selected for error detection codes based on the number
of errors found over a fixed time window. In [42], the author describes a
configurable error control scheme, which can be configured into three different
References 45

operating modes – correction mode, detection mode and mixed mode depending on
the particular application. This configurable error control allows the system to meet
different Quality of Service (QoS) levels. Hamming codes, extended Hamming
codes (Hsiao codes), and two-bit symbol error correcting codes [43] are considered.
In [44, 45], an adaptive error control scheme using configurable single-error-
correcting (SEC) codes and interleaving is applied to address multiple adjacent
errors. In [46], different error control schemes are dynamically selected according
to the fault type. Hamming codes with interleaving combined with retransmission
are used to handle transient errors; and split retransmission and spare wires are used
to address intermittent and permanent errors.
The research work described above has focused on tolerating a very limited number
of simultaneous errors, commonly one or two. In nanoscale technology, the number of
simultaneous and burst-type errors is likely to increase. Little research has been
concentrated on addressing multiple errors, especially for a combination of multiple
random errors and burst errors. In [44, 47], interleaving was used to improve error
resilience against burst errors. In this method, a wide bus is split into smaller groups
and each group is encoded using a SEC or single-error-correcting and double-error-
detecting (SEC-DED) code. The outputs of these SEC or SEC-DED encoders are
further interleaved to reduce the probability of multiple errors occurring within the
same group. In [42], symbol correcting codes are applied to correct two-bit burst error
in a symbol. These two approaches only focus on burst errors; for a combination of
random errors and burst errors, they lose their effectiveness. In [48, 49], a multiple-
error correction code is constructed by combining Hamming codes with DAP codes. In
this method, the outputs of a Hamming encoder are duplicated and an extra parity bit
generated from the original Hamming code is added. This method can correct up to
three-bit errors but requires a large number of additional interconnect wires (e.g., 143
for a 64-bit input message). The large number of wires greatly increases the link energy
consumption and link area overhead. In [50], Bose-Chaudhuri-Hocquenghem (BCH)
codes and Reed-Solomon (RS) codes are applied to network-on-chip (NoC) links. BCH
and RS codes can correct multiple errors, but the field operations greatly increase the
codec area, power consumption and decoding time.

References

1. Gschwind M, Hofstee H, Flachs B et al (2006) Synergistic processing in Cell’s multicore


architecture. IEEE Micro 26:10–24
2. ARM AMBA specification and multilayer AHB specification (rev2.0). http://www.arm.com
3. ARM AMBA 3.0 AXI specification. http://www.arm.com/armtech/AXI
4. IBM CoreConnect specification. http://www.ibm.com/chips/techlib/techlib.nsf/product families/
CoreConnect_Bus_Architecture
5. Wishbone specification. http://www.opencores.org/wishbone
6. Pasricha S, Dutt N, Ben-Romdhane M (2004) Fast exploration of bus-based on-chip
communication architectures. In: International conference on hardware/software codesign
and system synthesis (CODES_ISSS), pp 242–247
7. Mattson GT et al. (2010) The 48-core SCC processor: the programmer’s view. In: Proceedings
of ACM/IEEE conference on supercomputing (SC), pp 1–11
46 3 Networks-on-Chip (NoC)

8. Howard J et al (2011) A 48-core IA-32 processor in 45 nm CMOS using on-die message-passing


and DVFS for performance and power scaling. IEEE J Solid-State Circuits 46:173–183
9. Pande PP, Grecu C, Ivanov A, Saleh R (2005) Performance evaluation and design trade-offs
for network-on-chip interconnect architectures. IEEE Trans Comput 54:1025–1040
10. Millberg M, Nilsson E, Thid R, Jantsch A (2004) Guaranteed bandwidth using looped containers
in temporally disjoint networks within the nostrum network on chip. In: Proceedings of the
design, automation and test in Europe conference and exhibition (DATE), pp 890–895
11. Wentzlaff D et al (2007) On-chip interconnection architecture of the tile processor.
IEEE Micro 27:15–31
12. Gratz P, Kim C, Sankaralingam K, Hanson H et al (2007) On-chip interconnection networks of
the TRIPS chip. IEEE Micro 27:41–50
13. Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S (2007) A 5-GHz mesh interconnects for a
teraflops processor. IEEE Micro 27:51–61
14. Ilitzky AD, Hoffman DJ, Chun A, Esparza PB (2007) Architecture of the scalable
communications core’s network on chip. IEEE Micro 27:62–74
15. Sgroi M et al. (2001) Addressing the system-on-a-chip interconnect woes through communica-
tion-based design. In: Proceedings of 38th Design Automation Conference (DAC), pp 667–672
16. Bolotin E, Cidon I, Ginosar R, Kolodny A (2004) Cost considerations in network on chip.
Integr VLSI J 38:19–42
17. Dehyadgari M, Nickray M, Afzali-kusha A, Navabi Z (2005) Evaluation of pseudo adaptive
XY routing using an object oriented model for NOC. In: The 17th international conference on
microelectronics
18. Bobda C, Ahmadinia A, Majer M et al (2005) DyNoC: a dynamic infrastructure for
communication in dynamically reconfigurable devices. In: International conference on field
programmable logic and applications, pp 153–158
19. Kariniemi H, Nurmi J (2004) Arbitration and routing schemes for on-chip packet networks.
In: Interconnect-centric design for advanced SoC and NoC, pp 253–282
20. Dally JW, Towles B (2004) Principles and practices of interconnection networks. Morgan
Kauffman, San Francisco
21. Andriahantenaina A, Charlery H, Greiner A et al. (2003) SPIN: a scalable, packet switched,
on-chip micro–network. In: Design automation and test in Europe conference and exhibition
(DATE), pp 70–73
22. Kim J, Park D, Theocharides T et al. (2005) A low latency router supporting adaptivity for on-
chip interconnects. In: Proceedings of 42nd design automation conference (DAC), pp 59–564.
23. Andreasson D, Kumar S (2005) Slack-time aware routing in NoC systems. In: IEEE interna-
tional symposium on circuits and systems, pp 2353–2356
24. Wolkotte PT, Smit G, Rauwerda KG, Smit TL (2005) An energy-efficient reconfigurable
circuit switched network-on-chip. In: Proceedings of the 19th IEEE international parallel and
distributed processing symposium (IPDPS)
25. Kumar S, Jantsch A, Soininen PJ et al. (2002) A network on chip architecture and design
methodology. In: Proceedings of IEEE computer society annual symposium on VLSI,
pp 105–112
26. Wentzla D, Griffin P, Hoffmann H et al (2007) On-chip interconnection architecture of the tile
processor. IEEE Micro 27:15–31
27. Balfour J, Dally WJ (2006) Design tradeoffs for tiled CMP on-chip networks. In: 20th annual
international conference on supercomputing, pp 187–198
28. Nicopoulos CA, Dongkook P, Jongman K et al (2006) ViChar: a dynamic virtual channel
regulator for network-on-chip routers. In: 39th Annual IEEE/ACM international symposium
on microarchitecture (MICRO), pp 333–346
29. Bjerregaard T, Mahadevan S (2006) A survey of research and practices of network-on-chip.
Acm Comput Surv 38:1–51
References 47

30. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for
addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr
(VLSI) Syst 18:527–540
31. Bertozzi D, Benini L (2004) Xpipes: a network-on-chip architecture for gigascale systems-on-
chip. IEEE Circuits Syst Mag 4:18–31
32. Murali S, Theocharides T, Vijaykrishnan N, Irwin JM, Benini L, De Micheli G (2005)
Analysis of error recovery schemes for networks-on-chips. IEEE Des Test Comput
22:434–442
33. Pirretti M, Link MG, Brooks RR et al (2004) Fault tolerant algorithms for network-on-chip
interconnect. In: IEEE computer society annual symposium on VLSI, pp 46–51
34. Dumitras T, Kerner S, Marculescu R (2003) Towards on-chip fault tolerant communication.
In: Proceedings of ACM/IEEE design automation conference (DAC), pp 225–232
35. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework.
IEEE Trans Very Large Scale Integr (VLSI) Syst 13:655–667
36. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication
links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Design Integr Circuits
Syst 24:818–831
37. Rossi D, Nieuwland KA, Katoch A, Metra C (2005) Exploiting ECC redundancy to minimize
crosstalk impact. IEEE Des Test Comput 22:59–70
38. Rossi D, Nieuwland KA, Dijk SVE, Kleihorst PR, Metra C (2008) Power consumption of fault
tolerant busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 16:542–553
39. Worm F, Ienne P, Thiran P, Micheli DG (2005) A robust self-calibrating transmission scheme
for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 13:126–139
40. Ejlali A, Al-Hashimi MB, Rosinger P, Miremadi GS, Benini L (2010) Performability/energy
tradeoff in error-control schemes for on-chip networks. IEEE Trans Very Large Scale Integr
(VLSI) Syst 18:1–14
41. Li L, Vijaykrishnan N, Kandemir M, Irwin JM (2003) Adaptive error protection for energy
efficiency. In: Proceedings of IEEE/ACM international conference on computer-aided design
(ICCAD), pp 2–7
42. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal
integrity. In: Proceedings of international on line testing symposium (IOLTS), pp 43–48
43. Fujiwara E (2006) Code design for dependable systems: theory and practical applications.
Wiley, Hoboken
44. Yu Q, Ampadu P (2008) Adaptive error control for NoC switch-to-switch links in a variable
noise environment. In: Proceedings of IEEE international symposium on defect and fault
tolerance in VLSI system (DFT), pp 352–360
45. Yu Q, Ampadu P (2009) Adaptive error control for nanometer scale NoC links. IET Comput
Digit Tech 3:643–659 (Special issue on advances in nanoelectronics circuits and systems)
46. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault
tolerant NoC, VLSI Design. Article ID 94676:13
47. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-to-
switch buses in a network-on-chip. In: Proceedings of international conference hardware/
software codesign and systems synthesis (CODES-ISSS), pp 188–193
48. Gangly A, Pande PP, Belter B, Grecu C (2008) Design of low power & reliable networks on
chip through joint crosstalk avoidance and multiple error correction coding. J Electron
Tes: Theory Apple (JETTA), 67–81 (Special issue on defect and fault tolerance)
49. Gangly A, Pande PP, Belter B (2009) Crosstalk-aware channel coding schemes for energy
efficient and reliable NOC interconnects. IEEE Trans VLSI Syst 17:1626–1639
50. Lehtonen T, Liljeberg P, Plosila J (2007) Analysis of forward error correction methods for
nanoscale networks-on-chip. In: Proceedings of 2nd international conference on nano-networks
(Nano-Net), pp 1–5
Chapter 4
Error Control Coding for On-Chip
Interconnects

Error control codes (ECCs) have been widely applied in communication systems [1].
In ECCs, parity check bits are calculated based on the input data. The input data and
parity check bits are transmitted across a noisy channel. In the receiver, an ECC
decoder is used to detect or correct the errors induced during the transmission.
A powerful ECC usually requires more redundant bits and more complex encoding
and decoding processes, which increases the codec overhead. To meet the tight
speed, area, and energy constraints imposed by on-chip interconnect links, ECCs
used for on-chip interconnects need to balance reliability and performance. In
this chapter, we will first introduce the basic concepts of error control coding.
Then, the error control codes used for on-chip interconnect and their hardware
implementations are discussed.

4.1 Error Control Coding Basics

4.1.1 Field

Error control coding is based on arithmetic operations in fields. A field F is a non-


empty set of elements with the definition of two operators, called as addition ‘+’ and
multiplication ‘*’ respectively [1]. A field F must meet the following requirements:
1. Closure: 8 a, b ∈ F

c¼aþb
d¼ab (4.1)

where c, d ∈ F.

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, 49


DOI 10.1007/978-1-4419-9313-7_4, # Springer Science+Business Media, LLC 2012
50 4 Error Control Coding for On-Chip Interconnects

2. Associative: 8 a, b, c ∈ F

a þ (b þ c) ¼ (a þ b) þ c
a  (b  c) ¼ (a  b)  c (4.2)

3. Identity: There exist an additive identity element ‘0’ and a multiplicative identity
‘1’ that satisfy

0þa¼aþ0¼a
a1¼1a¼a (4.3)

where a ∈ F.
4. Inverse: For a ∈ F, there exist elements b and c ∈ F

aþb¼0
ac¼1 (4.4)

where b ¼ (a) is called as the additive inverse. c ¼ a1 (a 6¼ 0) is called as the


multiplicative inverse.
5. Commutative: 8a, b ∈ F

aþb¼bþa
ab¼ba (4.5)

6. Distributive: 8 a, b, c ∈ F

(a þ b)  c ¼ a  c þ b  c (4.6)

If there are a finite number of elements in a field F, F is said to be a finite field.


Finite field is also known as Galois field. GF(q) represents a Galois field with q
elements. If q is equal to prime number or its powers (q ¼ pm and p is a prime),
it has shown that the set of integers {0, 1, 2, . . ., q  1} together with modulo
q addition and multiplication forms a Galois field GF(q). The simplest Galois
field is GF(2), in which p is equal to 2. There are two elements {0,1} in GF(2).
The modulo-2 addition can be realized as XOR operation and the modulo-2 multi-
plication can be realized as AND operation.
A polynomial p(x) of degree m over GF(2) can be represented as follows,

pðxÞ ¼ p0 þ p1 x þ p2 x2 þ ::: þ pm xm (4.7)

where the coefficients pi belong to GF(2) ¼ {0,1}. Polynomials over GF(2) can be
added, subtracted, multiplied and divided. A polynomial p(x) with degree m over
GF(2) is irreducible if p(x) is not divisible by any polynomial over GF(2) with
4.1 Error Control Coding Basics 51

degree less than m. An irreducible polynomial p(x) of degree m over GF(2) can be
used to generate the extension field GF(2m), in which the field elements are
comprised of 2m polynomials of degree less than m over GF(2).
An irreducible polynomial p(x) with degree m is a primitive polynomial if
p(x) divides xn + 1, where the smallest value of n is 2m  1. If a is a root of an
irreducible and primitive polynomial p(x), which is used to generate the extension
field GF(2m), all elements in GF(2m) can be represented as {0, 1, a, a2, . . ., an1},
(n ¼ 2m  1). In this case, a is called a primitive element and an ¼ 1. Any element
a in GF(2m) can be represented uniquely a linear combination of these m linearly
independent elements {1, a, a2, . . ., am1} over GF(2), such as

a ¼ a0 þ a1 a þ ::: þ am1 am1 ai 2 GFð2Þ (4.8)

GF(2) is used to construct binary block codes, in which the encoding and decoding
process are based on direct bit operation. GF(2m) is used to construct nonbinary
codes, in which more complex field operation is involved.

4.1.2 Linear Block Codes

For a (n, k) block code over GF(q), the encoding process involves in mapping qk
message words, which is composed of k-symbol, into qk codewords, which is
composed of n-symbol. When the value of k and n is small, the mapping process
is simple. For example, a table can be used to list the mapping relationship between
message words and codewords. The encoding process becomes too complex
when the value of k and n is large. Linear block codes are used to simplify the
encoding process by requiring the linearity of the codewords. In linear block codes,
qk n-symbol codewords form a vector subspace [1]. The sum of any of two
codewords is also a codeword. Almost all useful block codes are linear block codes.
Let C denote a (n, k) linear block code over GF(q), there exist k linearly
independent codewords {g0, g1 , . . . , gk1}, such that any codeword c ∈ C can
be represented as a linear combination of these codewords.

c ¼ m0 g0 þ m1 g1 þ ::: þ mk1 gk1 (4.9)

where mi ∈ GF(q) and {g0, g1 , . . . , gk1} comprise a vector subspace and is called
as a basis for the codeword.
The basis {g0, g1 , . . ., gk1} can be arranged as the rows of a k  n matrix G,
2 3 2 3
g0 g0;0 g0;1  g0;n
6 g1 7 6 g1;0 g1;1  g1;n 7
6 7 6 7
Gkn ¼ 6 .. 7¼6 .. .. .. 7 (4.10)
4 . 5 4 . .  . 5
gk1 gk1;0 gk1;1  gk1;n
52 4 Error Control Coding for On-Chip Interconnects

Let k-symbol message m be

m ¼ ½ m0 m1  mk1  (4.11)

The encoding process of a linear block (n, k) code can be represented in matrix
form by,

c ¼ m  Gkn (4.12)

Because any codeword in C can be generated by multiplying the k-symbol message


by the matrix Gkn . Gkn is named as generator matrix of the code.
For a (n, k) linear block code C, there must be a (n, n  k) dual code C⊥ of C.
The dual code C⊥ can be constructed by a (n  k)  n matrix H with the following
format,
2 3 2 3
h0 h0;0 h0;1  h0;n
6 h1 7 6 h1;0 h1;1  h1;n 7
6 7 6 7
HðnkÞn ¼ 6 .. 7¼6 .. .. .. 7 (4.13)
4 . 5 4 . .  . 5
hnk1 hnk1;0 hnk1;1    hnk1;n

The matrix H is known as the parity check matrix of the code C. The inner product
between any row vector gi of generator matrix G and any row vector hj of parity
check matrix H is zero. This relationship can be expressed in matrix format as,

Gkn  HðnkÞn
T
¼0 (4.14)

where HT is the transpose of the matrix H.


Since any codeword c can be constructed from generator matrix G, the following
relationship exists for the codeword c and the parity check matrix H,

c  HðnkÞn
T
¼ m  Gkn  HðnkÞn
T
¼0 (4.15)

4.1.3 Systematic Codes

A (n, k) linear block code is called as a systematic code, if its codeword can be
separated into message bits and redundancy bits [1]. Figure 4.1 shows an example
of a (7, 4) systematic linear block code over GF(2). In this example, the message
bits are placed at the beginning of the codeword and are followed by redundancy
bits. The position of message bits and redundancy bits can be exchanged by placing
the redundancy bits before the message bits.
4.1 Error Control Coding Basics 53

Fig. 4.1 An example of a


(7, 4) systematic linear block
code

The generator matrix Gkn of a systematic code is of the following form,


 
Gkn ¼ Ik jPkðnkÞ (4.16)

where Ik is the k-dimensional identity matrix and Pk(nk) is called as the parity
matrix. The parity check matrix H(nk)n of a systematic linear block code can be
expressed as,
h i
HðnkÞn ¼ PTkðnkÞ jIðnkÞ (4.17)

where HT is the transpose of parity matrix and I(nk) is the (nk) dimensional
identity matrix. The generator matrix and parity check matrix of the (7, 4) system-
atic code shown in Fig. 4.3 is expressed as,
2 3
1 0 0 0 1 1 0
60 1 0 0 0 1 17
G47 ¼ ½I4 jP43  ¼ 6
40
7 (4.18)
0 1 0 1 1 15
0 0 0 1 1 0 1
2 3
  1 0 1 1 1 0 0
H37 ¼ PT43 jI3 ¼ 41 1 1 0 0 1 05 (4.19)
0 1 1 1 0 0 1
54 4 Error Control Coding for On-Chip Interconnects

Fig. 4.2 Hamming sphere


and minimum Hamming
distance

4.1.4 Hamming Distance

The Hamming distance between two vectors vi and vj over GF(q)n is defined as
the number of symbols that are different between these two vectors [1]. It can
expressed as,
 
dH ðvi ; vj Þ ¼ # k j vi;k 6¼ vj;k ; 0  k<n (4.20)

where vi,k and vj,k ∈ GF(q). #{A} is the number of the elements in a set A. If these
vectors belongs to GF(2)n, the Hamming distance can be calculated as,

X
n1
dH ðvi ; vj Þ ¼ vi;k  vj;k (4.21)
k¼0

where  is XOR operation.


Let C be a (n, k) linear block code over GF(q) and ci ∈ C. All the vectors, which
have a Hamming distance less than or equal to l from the codeword ci, comprise of a
sphere of radius l around the codeword ci. This sphere is named as Hamming sphere
and is described as,
  
Sl ðci Þ ¼ vj  dH ðci ; vj Þ  l (4.22)

If the Hamming sphere with radius l around each codeword of code C does not
overlap to each other, any received codeword with erroneous symbols less than
and equal to l can be corrected by code C. Figure 4.2 demonstrates the relationship
4.1 Error Control Coding Basics 55

between the error correction capability and the Hamming sphere. A linear
block codeword ci ∈ C is transmitted. Due to the errors caused by the noise
during transmission, the codeword is received as ri. If the number of errors is less
than or equal to l, the received codeword is located in the Hamming sphere of
codeword ci. By selecting ci as the transmitted codeword, the received codeword ri
is properly decoded.
From Fig. 4.2, we can see that the radius of nonoverlapping Hamming sphere
characterizes the error correction capability of a linear block code [1]. The error
correcting capability, t, of a linear block code C is defined as the maximum radius of
Hamming spheres St around all the codewords in C, such that there is no
overlapping of the Hamming sphere for any different codewords ci and cj ∈ C.
It can be expressed as,
 
t ¼ max l j Sl ðci Þ \ Sl ðcj Þ ¼ ; ci 6¼ cj (4.23)
ci ;cj 2 C

When the Hamming spheres around each codeword have the same radius, the
error correction capability t is determined by the minimum distance between
two codewords ci and cj in the code C. The minimum Hamming distance of a
code C is defined as,
 
dmin ¼ min dH ðci ; cj Þ; ci 6¼ cj (4.24)
ci ;cj 2 C

The relationship between the error correction capability t and the minimum
Hamming distance dmin can be expressed as,
 
ðdmin  1Þ
t¼ (4.25)
2

where b xc is the floor function and represents the largest integer less or equal to x.
For a linear block code, the minimum distance is equal to the minimum codeword
weight, which is defined as the number of nonzero symbols in a codeword.
The minimum Hamming distance is also used to characterize the error detection
capability of a linear block code. A linear block code with dmin can guarantee to
detect any errors less than or equal to dmin1.
A linear block codes C is usually described by three parameters: the codeword
length n, the message length k, and the minimum Hamming distance dmin.
The code with a large value of dmin is more powerful because of the increased
code’s error correction and detection capability. However, powerful codes require
a large number of redundancy bits increasing the encoding and decoding process.
It is important to balance the error correction capability and the complexity of
a code.
56 4 Error Control Coding for On-Chip Interconnects

4.1.5 Code Modification

Code modification is usually used to construct new codes from a linear block code
C[1]. In this section, we will discuss these code construction methods and the
relationship between the modified codes and the original code.
Let C be a (n, k) linear block code, a shortened code is constructed by deleting l
message symbols from the codeword. A shortened code is represented as (nl,
kl). Suppose that code C is a systematic code, the generator matrix of the
shortened code (nl, kl) can be achieved from the original generator matrix
Gkn by deleting s columns of the identity matrix Ik and l rows, which have nonzero
value in the deleted s columns. The following example shows the generator matrix
of a shortened (6, 3) block code, which is modified from the (7, 4) linear block code
in Sect. 4.2.3.
2 3
1 0 0 0 1 1
G36 ¼ 40 1 0 1 1 15 (4.26)
0 0 1 1 0 1

In this example, we assume that the first column in the original generator matrix is
removed. Then the first row of the original generator matrix is also deleted, because
the first row has a nonzero value at its first position. The minimum Hamming
distance dmin of the shortened code is greater or equal to the original code.
A (n, k) linear block code can be extended by adding l additional parity check
symbols. The extended code is represented as (n+l, k). The parity check matrix of
the extended code can be constructed by adding l rows and columns to the original
parity check matrix. The minimum Hamming distance of the extended code is
larger than the original code. The most common approach to extend a code is to add
an additional parity check symbol, which is calculated based on all the inputs.
The following example shows the parity check matrix an extended (8, 4) code,
which is modified from the (7, 4) linear block code in Sect. 4.1.3.
2 3
1 0 1 1 1 0 0 0
61 1 1 0 0 1 0 07
H48 ¼6
40
7 (4.27)
1 1 1 0 0 1 05
1 1 1 1 1 1 1 1

A linear block code can also be punctured and lengthened. In a punctured code, l
parity check symbols are removed to construct a new (nl, k) code. For a punctured
code, the code rate increases because less parity check symbols in the codeword.
A linear block code can be lengthened by adding l message symbols. The generator
matrix of a lengthened code is constructed by adding l columns and l rows to the
generator matrix of the original code. The minimum Hamming distance of the
punctured code and the lengthened code is less than or equal to the minimum
Hamming distance of the original code.
4.2 Error Control Codes for On-Chip Interconnect 57

4.2 Error Control Codes for On-Chip Interconnect

On-chip interconnects have tight speed, area, and energy constraints. Thus, error
control codes used for on-chip interconnects need to balance reliability and perfor-
mance. In early research work, simple ECCs, such as single parity check (SPC)
codes, Hamming codes, and duplicate-add-parity (DAP) codes are widely used to
detect or correct single errors. As the probability of multiple errors increases in
nanoscale technology, more complex error control codes, such as Bose-Chaudhuri-
Hocquenghem (BCH) codes, Reed-Solomon (RS) codes and product codes are
applied to improve the reliability of on-chip interconnects.

4.2.1 Single Parity Check (SPC) Codes

The single parity check (SPC) code is one of the simplest codes. In SPC codes, an
additional parity bit is added to a k-bit data block such that the resulting (k + 1)-bit
codeword has an even number (for even parity) or an odd number (for odd parity) of
1 s. SPC codes have a minimum Hamming distance dmin ¼ 2 and can only be used
for error detection. SPC codes can detect all odd numbers of errors in a codeword.
The hardware circuit used to generate the parity check bit is composed of
a number of exclusive OR (XOR) gates as shown in Fig. 4.3. In the SPC decoder,
another parity generation circuit, identical to that employed in the encoder, is
employed to recalculate the parity check bit based on the received data.
The recalculated parity check bit is compared to the received parity check bit.

Fig. 4.3 An example of single parity check (SPC) codes


58 4 Error Control Coding for On-Chip Interconnects

Fig. 4.4 Encoding and decoding process of DAP codes

If the recalculated parity check bit is different from the received parity check bit,
errors are detected. The bit comparison can be implemented using an XOR gate as
shown in Fig. 4.3.

4.2.2 Duplicate-Add-Parity (DAP) Code

In duplicate-add-parity (DAP) codes [2, 3], a k-bit input is duplicated and an extra
parity check bit, calculated from original data, is added. For k-bit input data, the
codeword width of DAP codes is 2k + 1. DAP codes have a minimum Hamming
distance dmin ¼ 3, because any two distinct codewords differ in at least three bit
positions. It can correct single errors.
The encoding and decoding process of DAP codes are shown in Fig. 4.4.
XOR gates are used to calculate the parity bit based on input data. The input data,
its duplicated copy and the parity bit are comprised of the DAP codeword. In a DAP
decoder, the recalculated parity check bit is compared to the transmitted parity check
bit. If they have the same value, the original data is selected as the decoder output.
If the recalculated parity check bit is different from the transmitted parity check bit,
the duplicated copy of the original data is selected as the decoder outputs. In the DAP
code implementation, each duplicated data bit is placed adjacent to each original
data bit. Thus, DAP codes can reduce the impact of crosstalk coupling. Crosstalk
reduction using DAP codes will be discussed in Chap. 6.
4.2 Error Control Codes for On-Chip Interconnect 59

Fig. 4.5 An example of


Hamming H(7,4) encoder

4.2.3 Hamming Codes

Hamming codes are a type of linear block codes with minimum Hamming distance
dmin ¼ 3. Hamming codes can be used to either correct single errors or detect double
errors. For a positive integer r
3, there exists a (n, k) Hamming code with the
following parameters:

n ¼ 2r  1
k ¼ 2r  1  r (4.28)

An n-bit Hamming codeword c1n can be generated by multiplying the k-bit


input data m1k by a generator matrix Gkn, that is

c1n ¼ m1k  Gkn (4.29)

An example of a generator matrix for the (7,4) Hamming code H(7,4) with r ¼ 3
is shown below.
2 3
1 0 0 0 1 1 0
60 1 0 0 1 1 17
G47 6
¼ ½I44 jP43  ¼ 4 7 (4.30)
0 0 1 0 1 0 15
0 0 0 1 0 1 1

The multiplication of a generator matrix Gkn by the input data m1k can be
implemented using XOR trees. Figure 4.5 shows an example of the encoding circuit
of a Hamming H(7,4) code. The XOR tree depth determines the worst case delay of
the Hamming encoder. The largest XOR tree depth is the case where a column in
Gkn is all-ones.
60 4 Error Control Coding for On-Chip Interconnects

The Hamming codes can be decoded using a syndrome decoding method.


In syndrome decoding method, the syndrome s1r is calculated by multiplying the
received codeword r1n with the transpose of parity check matrix H(nk)n,

s1r ¼ r1n  HðnkÞ


T
n

¼ ðc1n þ e1n Þ  HðnkÞn


T
¼ e1n  HðnkÞn
T
(4.31)

where e1n is an error vector representing any errors introduced during the
transmission. The columns in the parity check matrix H(nk)n consist of all the
nonzero r-bit vectors such that no two columns have the same value. The multipli-
cation of any valid codeword c1n with the transpose of H(nk)n is equal to zero.
Thus, if there is no error in the received codeword, the syndrome s1r is zero. If the
received codeword has a single errors, the syndrome is nonzero and equal to one
column of the parity check matrix H(nk)n. The nonzero syndrome vector can be
used to determine the corresponding error vector e1n by looking through a
predefined table. The error correction is performed by adding error vector e1n
back to the received codeword r1n. The following is an example using syndrome
decoder. Assume that a (7, 4) Hamming code word c1 ¼ [0 0 1 1 1 1 0] is generated
using the generator matrix G in (5,3). The fourth bit has error during transmission.
The received codeword becomes r1 ¼ [0 0 1 0 1 1 0]. The syndrome is s ¼ [0 1 1]
by using the following parity check matrix.
2 3
1 1 1 0 1 0 0
H37 ¼ 4 1 1 0 1 0 1 0 5 (4.32)
0 1 1 1 0 0 1

The syndrome is equal to the fourth column of parity check matrix. Thus the
error vector is e ¼ [0 0 0 1 0 0 0].
Figure 4.6 shows an example of the Hamming(7,4) decoding circuits. The
syndrome is calculated from the received Hamming codeword. The syndrome
calculation circuit can be implemented as XOR trees. The calculated syndrome is
used to decide the error vector through the syndrome decoder circuit. The syndrome
decoder circuit can be realized using AND trees. The syndrome and its inverse are
the inputs of the AND trees. For the binary code, the adding of error vector to the
received codeword is just XOR operations.
A Hamming code can be extended by adding one overall parity check bit.
An extended (n + 1, k) Hamming code meets the following requirements,

n ¼ 2r 1  1
k ¼ 2r 1  r (4.33)

Extended Hamming codes have a minimum Hamming distance dmin ¼ 4 and


belong to single-error-correcting and double-error-detecting (SEC-DED) codes,
which can correct single errors and detect double errors at the same time.
4.2 Error Control Codes for On-Chip Interconnect 61

Fig. 4.6 An example of Hamming H(7,4) decoder

Figure 4.7 shows an implementation example of the extended Hamming EH(8,4)


decoder. One of the syndrome bits is an even parity of the entire codeword. If this
bit is a zero and other syndrome bits are non-zero, this implies that there were
two errors – the zero even parity check bit indicates that there are zero (or an even
number of) errors, while the other non-zero syndrome bits indicate that there is at
least one error. Since the extended Hamming code can only guarantee detection of
up to two errors, we assume that this syndrome pattern represents double errors. The
implementation of double error detection is shown in the bottom of Fig. 4.7.
A Hamming code or an extended Hamming code can be shortened by eliminating
a certain number of information bits (e.g., s). A shortened (ns, ks) Hamming code
or extended Hamming code have the same number of redundant bits r as the
original codes. Table 4.1 shows the codeword length of normal and shortened
Hamming codes and extended Hamming codes with different input data bits. The
shortened code makes the code more flexible, able to support any input data width.

4.2.4 Hsiao Codes

Hsiao codes are a special case of extended Hamming code with SEC-DEC capability.
In Hsiao codes, the parity check matrix H(nk)n satisfies the following four
constraints – (a) Every column is different. (b) No all zero column exists. (c) There
are an odd number of 1’s in each column. (d) Each row in parity check matrix
contains the same number of 1’s.
62 4 Error Control Coding for On-Chip Interconnects

Fig. 4.7 An implementation example of the extended Hamming EH(8,4) decoder

Table 4.1 Codeword length for Hamming codes and extended Hamming codes for different input
data bits
Shortened/standard Hamming Shortened/standard Ex-Hamming
Data bits k code (n, k) code (n + 1, k)
4 (7,4) (8,4)
8 (12,8) (13,8)
16 (21,16) (22,16)
32 (38,32) (39, 32)
64 (71,64) (72,64)
128 (136,128) (137,128)
256 (265,256) (266,256)
512 (522,512) (523,512)

The following is an example of parity check matrix of a (8, 4) Hsiao codes,


2 3
1 1 1 0 1 0 0 0
61 1 0 1 0 1 0 07
H48 ¼6
41
7 (4.34)
0 1 1 0 0 1 05
0 1 1 1 0 0 0 1
4.2 Error Control Codes for On-Chip Interconnect 63

Fig. 4.8 An example


of correcting spatial burst
error using multiple SEC
codes with interleaving

This parity check matrix has an odd number of 1’s in each column and the number
of 1’s in each row is equal. The double error is detected, when the syndrome is non-
zero and the number of 1’s in syndrome is not odd. The hardware requirement in the
encoder and decoder of Hsiao codes is less than that of extended Hamming codes
shown in Sect. 4.2.3, because the number of 1’s in parity check matrix of Hsiao codes
is less than an extended Hamming code. Further, the same number of 1’s in each row
of the parity check matrix reduces the calculation delay of the parity check bits.

4.2.5 SEC Codes with Interleaving

Interleaving is an efficient approach to achieve protection against spatial burst


errors. Figure 4.8 shows the principle using multiple SEC codes with inter leaving
to correct spatial burst errors [4, 5]. In this method, the input data is separated into
smaller groups. Each group is encoded separately using simple linear block
codes (e.g., SEC codes). The outputs of these small groups are interleaved.
The interleaved data is transmitted to the receiver. In the receiver, the interleaved
data is first deinterleavered. The deinterleaving process distributes spatial burst
errors into different groups. For the example in Fig. 4.8 of a two-bit burst error with
two groups, we see that each group only contains a single error after deinterleaving,
which can then be corrected using SEC decoders.
A lot of interleaving algorithms have been proposed for communication system.
The simplest interleaving algorithm is row-column interleaving. Figure 4.9
illustrates the relation between inputs and outputs of a row-column interleaver.
Assume the input data is separated to m groups. After encoding, each group has
n-bit outputs. The row-column interleaving algorithm involves a process to assign
the wire in the bus in such a way that the first m wires carry the m first bits from
m blocks, followed by the m second bits and so on. The interleaving distance, which
is the distance between two wires that belong to the same group, can be used to
64 4 Error Control Coding for On-Chip Interconnects

Fig. 4.9 Input and output relation of row-column interleaver

measure the spatial burst error correction capability. A larger interleaving distance
will allow correction of larger burst error. For on-chip interconnects, the
interleaving is usually implemented as local hardwire connection, adding some
design complexity with negligible power and delay overhead.

4.2.6 Cyclic Codes

A linear block code C is a cyclic code if a cyclic shift of any codeword in C is still
a codeword in C [1]. In cyclic codes, the input data m ¼ (m0, m1, . . ., mk) and
codeword c ¼ (c0, c1, . . ., cn) are usually represented as a polynomial with the
following format,

mðxÞ ¼ m0 þ m1 x þ m2 x2 þ ::: þ mk xk (4.35)

cðxÞ ¼ c0 þ c1 x þ c2 x2 þ ::: þ cn xn (4.36)

A (n, k) cyclic code can be constructed by multiplying an input polynomial with


generator polynomial g(x),

cðxÞ ¼ mðxÞ  gðxÞ (4.37)


4.2 Error Control Codes for On-Chip Interconnect 65

Fig. 4.10 Encoding circuit for an (n, k) cyclic code

where g(x) ¼ 1 + g1·x + g2·x2 + . . . + gr1·xr1 + xr is a factor of xn  1 and has


degree of r ¼ n  k. For example, a cyclic (7, 3) code can be constructed with generator
polynomial g(x) ¼ 1 + x2 + x3 + x4, which is a factor of x7  1 ((x7  1) ¼ (1 + x)
(1 + x + x3)(1 + x2 + x3)).
A systematic cyclic code can be constructed by appending the parity check bits
to the right-shifted input data,
 
cðxÞ ¼ mðxÞ  xnk þ RgðxÞ mðxÞ  xnk (4.38)

The parity check bits are the remainder of the division m(x) · xnk by g(x).
The encoding process of cyclic codes can be realized serially by using a simple
linear feedback shift register (LFSR). Figure 4.10 shows the encoding circuit for a
systematic cyclic code. The input data m(x) is shifted into the encoding circuit one
bit at a time from the right end with switches a and b at position 1. After all the input
data enter the circuit, the switches a and b are set to position 2 and the parity check
bits in b0 to br1 are serially shifted out.
The use of an LFSR circuit requires little hardware but introduces a large latency
when a large amount of data is processed. Cyclic codes can also be encoded by
multiplying input data with a generator matrix. For instance, an (n, k) cyclic code
with generator polynomial g(x) ¼ g0 + g1·x + g2·x2 + . . . + gr1·x r1 + gr·xr, the
generator matrix can be constructed using the following method:
2 3
g0 g1 g2 : : : gr 0 0 0 : : 0
60 g0 g1 g2 : : : gr 0 0 : : 07
6 7
60 0 g0 g1 g2 : : : gr 0 : : 07
6 7
Gkn ¼6
6 : : 7
7 (4.39)
6 : : 7
6 7
4 : : 5
0 0 0 0 g0 g1 g2 : : : : : gr
66 4 Error Control Coding for On-Chip Interconnects

Table 4.2 BCH codes t¼1 t¼2 t¼3


generated by primitive
elements for m  7 with m ¼ 3 (7,4)
different error correction m ¼ 4 (15,11) (15,7) (15,5)
capabilities m ¼ 5 (31, 26) (31, 21) (31, 16)
m ¼ 6 (63, 57) (63, 51) (63, 45)
m ¼ 7 (127, 120) (127, 113) (127, 106)

A shortened (nl, kl) cyclic code can be constructed from an (n, k) cyclic code
by eliminating the l rightmost bits in the codewords. The shortened cyclic codes are
generally not cyclic. A class of shortened cyclic codes, which are usually generated
by either a primitive polynomial p(x) or a generator polynomial g(x) ¼ (x + 1)p(x),
are also known as cyclic redundancy check (CRC) codes. CRC codes are effective
at detecting burst errors. CRC codes can be implemented using an LFSR with
the same circuit as the original cyclic codes. CRC codes can also be implemented
in a parallel approach to improve throughput [6, 7]. For on-chip interconnects,
parallel implementation is preferred. A few CRC codes have become international
standards (e.g., CRC-5 with generator polynomial g(x) ¼ 1 + x2 + x4 + x5 is used
for the International Telecommunication Union (ITU) standard).

4.2.7 Bose-Chaudhuri-Hocquenghem (BCH) Codes

BCH codes are an important class of linear block codes for multiple error correction [1].
For a positive integer m
3, a t-error-correcting BCH (n, k) code can be constructed
over Galois fields GF(2m) with the following requirement for the codeword width n
and data width k,

n ¼ 2m  1 (4.40)
k
n  mt (4.41)
where GF(2 ) is the extension field constructed from GF(2) with elements {0, 1, a,
m

a2 ,. . ., a2 2 }; a is a primitive element in GF(2m). Table 4.2 shows the examples of


m

BCH codes generated by primitive elements for m  7 with different error


correction capabilities.
The encoding process of BCH codes can be realized with a similar approach to
cyclic codes. The generator polynomial g(x) of a t-error-correcting BCH code is
defined as the least common multiple (LCM) of F1, F3, . . ., F2t1,

gðxÞ ¼ LCMfF1 ; F3 ; :::; F2t1 g (4.42)

where Fj is the minimal polynomial of aj (0 < j < 2 t). For a t-error-correcting


BCH code, the generator polynomial g(x) has a, a2, a3, . . ., a2t as its roots.
4.2 Error Control Codes for On-Chip Interconnect 67

Fig. 4.11 Block diagram of BCH decoder

For an example, the generator polynomial of the 3-error-correcting BCH (15, 5)


code is obtained by multiplying the following minimal polynomials,

F1 ðxÞ ¼ ðx þ aÞðx þ a2 Þðx þ a4 Þðx þ a8 Þ ¼ 1 þ x þ x4 (4.43)

F3 ðxÞ ¼ ðx þ a3 Þðx þ a6 Þðx þ a12 Þðx þ a9 Þ ¼ 1 þ x þ x2 þ x3 þ x4 (4.44)

F5 ðxÞ ¼ ðx þ a5 Þðx þ a10 Þ ¼ 1 þ x þ x2 (4.45)

Thus the generator polynomial g(x) is given by,

gðxÞ ¼ F1 ðxÞF3 ðxÞF5 ðxÞ ¼ 1 þ x þ x2 þ x4 þ x5 þ x8 þ x10 (4.46)

The decoding process of BCH codes is more complicated than the encoding
process. Usually, the decoding process of BCH codes can be separated into four
steps – calculating the syndromes, calculating the error location polynomial, finding
the error locations and flipping the errors, as shown in Fig. 4.11.
Let

cðxÞ ¼ c0 þ c1 x þ c2 x2 þ ::: þ cn1 xn1


rðxÞ ¼ r0 þ r1 x þ r2 x2 þ ::: þ rn1 xn1
eðxÞ ¼ e0 þ e1 x þ e2 x2 þ ::: þ en1 xn1 (4.47)

be the transmitted polynomial, the received polynomial and the error polynomial
respectively. So that

rðxÞ ¼ cðxÞ þ eðxÞ (4.48)

For a t-error-correcting BCH code, 2t syndromes Sj (1  j  2t) can be


calculated as,
X
n1
Sj ¼ ri aij (4.49)
i¼0
68 4 Error Control Coding for On-Chip Interconnects

Fig. 4.12 An example of calculating the syndrome S3 of BCH code with m ¼ 4

Equation 4.49 can be written as,



Sj ¼ ::: ðr n1  a j þ r n2 Þ  a j þ r n3  a j þ    a j þ r 0 (4.50)

The calculation of syndrome Sj requires (n1) multiplications by the constant


value aj and (n1) additions. Because of ri ∈ GF(2), S2j is equal to Sj2.
Figure 4.12 shows an example of a circuit calculating S3 for m ¼ 4 with
p(x) ¼ x4 + x + 1. The register si (0  i  3) is initialized to zero. Then the
received bits ri (0  i  14) and the register s0-s3 are shift every clock cycle.
After 15 clock cycle, the S3 is obtained in the s0-s3 register.
Syndromes Sj can also be obtained as the remainder in the division of the
received polynomial r(x) by the minimal polynomial fj(x). That is,

rðxÞ ¼ aj  fj ðxÞ þ bj ðxÞ (4.51)

Thus,

Sj ¼ bj ða j Þ (4.52)

The minimal polynomials for a, a2, a4, . . . are the same and so the same register
architecture can be used to calculate the syndromes S1, S2, S4, . . . . This can be also
used for S3, S6, . . ., and so on.
Figure 4.13 shows an example of the circuit to S3 for m ¼ 4. The minimal poly-
nomial of a3 is f3(x) ¼ 1 + x + x2 + x3 + x4. Let b(x) ¼ b0 + b1x + b2x2+ b3x3 be
the remainder on dividing r(x) by f3(x). Then

S3 ¼ bða3 Þ
¼ b0 þ b1 a3 þ b2 a6 þ b3 a9
¼ b0 þ b3 a þ b2 a2 þ ðb1 þ b2 þ b3 Þa3 (4.53)

In Fig. 4.13, the received vector r(x) is first divided by f3(x) to generate b(x) and
then S3 is calculated according (4.53). The result is obtained from the register b0  b3
after 15 clock cycles.
4.2 Error Control Codes for On-Chip Interconnect 69

Fig. 4.13 Another method to calculate the syndrome S3 of BCH code with m ¼ 4

The syndrome can also be calculated by multiplying received codeword with the
transpose of the parity check matrix H2tn .

s12t ¼ v1n  H2tn


T
(4.54)

where v ¼ (v0, v1, . . ., vn-1) is the received codeword. H2tn


T
can be described as the
following,
2 3
1 a a2 a3  an1
61 ða2 Þ ða2 Þ
2
ða2 Þ
3
 ða2 Þ
n1
7
6 7
6 ða3 Þ 3 2
ða Þ 3 3
ða Þ  3 n1
ða Þ 7
H2tn ¼ 61 7 (4.55)
6. .. 7
4 .. . 5
2 3 n1
1 ða2t Þ ða2t Þ ða2t Þ  ða2t Þ

The second stage of the BCH decoding process is finding the coefficients of the
errorlocation polynomial s(x) ¼ s0 + s1x + . . . + stxt using the syndromes Sj (1 
j  2t). The relationship between the syndromes and these coefficients sj is given by,

X
t
Stþij sj ¼ 0 ði ¼ 1; :::; tÞ (4.56)
j¼0

The roots of s(x) give the error positions. The coefficients of s(x) can be
calculated by methods [1] such as the Peterson-Gorenstein-Zieler algorithm,
Euclid’s algorithm, and Berlekamp–Massey (BM) algorithm. In this section, we
mainly introduce BMA algorithm because it is the most efficient method in
practice.
In the BMA, the error location polynomial s(x) is found by t1 recursive
iterations. During each iteration r, the degree of s(x) is usually incremented by
one. The discrepancy dr in each iteration is defined as,
70 4 Error Control Coding for On-Chip Interconnects

X
t
dr ¼ S2rjþ1  sj (4.57)
j¼0

If the iteration number r is greater or equal the number of errors ta that have
actually occurred, the discrepancy dr in (4.57) is equal to zero. If r < ta, the
discrepancy dr calculated in (4.57) is usually non zero. Then, the degree and
coefficients of s(x) is modified based on the dr value. The purpose of the BMA is
to compute the shortest degree s(x) meeting the requirement of (4.56).The BMA
with inversion is given below.
Initials values:
(
1 if S1 ¼ 0
dp ¼
S1 if S1 6¼ 0
sð0Þ ðxÞ ¼ 1 þ S1  x
( 3
x if S1 ¼ 0
ð1Þ
b ðxÞ ¼
x2 if S1 6¼ 0
(
0 if S1 ¼ 0
l1 ¼
1 if S1 6¼ 0
r ¼ 1: (4.58)

The error location polynomial s(x) is then calculated using the following set of
equations:

X
t
ðrÞ
dr ¼ sj  S2rjþ1
i¼0
8
< sðr1Þ ðxÞ if dr ¼ 0
sðrÞ ðxÞ ¼
: sðr1Þ ðxÞ  d1  d  bðrÞ ðxÞ if d 6¼ 0
p r r
8
< x  b ðxÞ
2 ðrÞ
if dr ¼ 0 or r < lr
bðrþ1Þ ðxÞ ¼
: x2  sðr1Þ ðxÞ if dr 6¼ 0 and r
lr
(
lr if dr ¼ 0 or r < lr
lrþ1 ¼
2  r  lr þ 1 if dr 6¼ 0 and r
lr
(
dp if dr ¼ 0 or r < lr
dp ¼
dr if dr 6¼ 0 and r
lr
r ¼ r þ 1: (4.59)
4.2 Error Control Codes for On-Chip Interconnect 71

Fig. 4.14 Berlekamp Massey Algorithm with inversion

These calculations are carried out for r ¼ 1, . . ., t1. Figure 4.14 shows a circuit
implementation of the BMA. The error location polynomial s(x) is obtained in the
s registers after t-1 iterations.
In some applications it may be useful to implement the BMA without inver-
sion [8]. For inversionless BMA the initial conditions can be the same as that for the
BMA with inversion given in (4.58). The error location polynomial is then calcu-
lated using following equations:

X
t
ðrÞ
dr ¼ sj  S2rjþ1
i¼0
8
< dp  sðr1Þ ðxÞ if dr ¼ 0
sðrÞ ðxÞ ¼
: d  sðr1Þ ðxÞ  d  bðrÞ ðxÞ if d 6¼ 0
p r r
8
< x2  bðrÞ ðxÞ if dr ¼ 0 or r < lr
bðrþ1Þ ðxÞ ¼
: x2  sðr1Þ ðxÞ if dr 6¼ 0 and r
lr
(
lr if dr ¼ 0 or r < lr
lrþ1 ¼
2  r  lr þ 1 if dr 6¼ 0 and r
lr
(
dp if dr ¼ 0 or r < lr
dp ¼
dr if dr 6¼ 0 and r
lr
r ¼ r þ 1: (4.60)
72 4 Error Control Coding for On-Chip Interconnects

Fig. 4.15 The implementation of Chien’s search algorithm

Inversionless BMA is more complicated and requires a greater number of


multiplications than the BMA with inversion. On the other hand, BMA with
inversion takes more clock cycles to complete the same calculation. Therefore the
inversionless algorithm can be implemented to meet high performance requirement.
For SEC and DEC BCH codes the coefficients of s(x) can be obtained directly
without using the BMA. This is because for SEC BCH codes

sðxÞ ¼ 1 þ S1 x (4.61)

and for DEC BCH codes


sðxÞ ¼ 1 þ s1 ðxÞ þ s2 x2 ¼ 1 þ S1 x þ S1 2 þ S3  S1 1 x2 (4.62)

The calculation of s(x) directly from the syndromes can be extended to triple-
error-correcting (TEC) BCH codes. However, this method quickly becomes too
complex to implement in hardware, as the error correction capability increases.
The third step in decoding BCH codes is to find the erroneous bit locations.
These values are the reciprocals of the roots of s(x) and can be found simply by
substituting 1, a, a2, . . ., an-1 into s(x). A method of finding the error location has
been presented by Chien. In the Chien search algorithm, the sum

s0 þ s1 aj þ s2 a2j þ ::: þ st atj ðj ¼ 0; 1; :::; k  1Þ (4.63)

is evaluated every clock. If the sum equals zero for clock cycle j, the received
bit rnj1 is erroneous.
Figure 4.15 shows a hardware implementation of the Chien search algorithm.
The registers c0, c1, . . ., ct are initialized by the
Pcoefficients of the error location
t
polynomial s0, s1, . . ., st. Then the sum i¼0 c i is calculated and if this
value equals zero, the error has been detected. At the same time, each value Pt in
the ci register is multiplied by ai (using a constant multiplier). The sum i¼0 ci
is calculated again on the next clock cycle. The above operations are carried out for
every transmitted message bit.
4.2 Error Control Codes for On-Chip Interconnect 73

The last step of decoding BCH codes is to correct erroneous bits, once the error
position is detected by Chien search algorithm. The error correction can be
implemented XOR gates.
A t-error-correcting (n, k, t) BCH code can be shortened by eliminating a certain
number of information bits to construct a shortened t-error-correcting BCH code
with the same redundant bits.

4.2.8 Reed-Solomon (RS) Codes

RS codes are a subclass of nonbinary BCH codes [1, 9] that are good at correcting
multiple symbol errors. For a RS code with symbols from GF(q), the codeword
width n in symbols and the number of symbols in the input data k are defined by the
following parameters,

n¼q1 (4.64)

k ¼ n  2t (4.65)

There are n-k parity symbols and the code can correct t symbol errors. q is
generally set to 2m. The code symbols are elements from the GF(2m).
Let a be a primitive element in GF(q), the generator polynomial g(x) of t-symbol-
correcting RS codes with a, a2, a3, . . ., a2t as its roots can be expressed by,

gðxÞ ¼ ðx  aÞðx  a2 Þ    ðx  a2t Þ


¼ g0 þ g1 x þ g2 x2 þ    þ g2t1 x2t1 þ x2t (4.66)

where gi (0  i <2 t) is symbol from GF(q). The encoding of RS codes can be


realized using an LFSR. The parity check symbols of RS codes are the remainder of
the division of right-shifted input polynomial m(x)·x2t by generator polynomial g(x).
The decoding process of RS codes is similar to that of BCH codes. After the
syndrome calculation, the Berlekamp–Massey algorithm can be used to calculate
the coefficients of the error locator polynomial s(x) and the error magnitude
polynomial O(x). The Chien search algorithm can be used to identify the error
positions and the Forney algorithm can be used to calculate the error values.
The error correction is done by adding the error values to the received codeword.

4.2.9 Hamming Product Codes

Product codes were first presented in 1954 [10]. The concept of product codes is
very simple. Long and powerful block codes can be constructed by serially
concatenating two or more simple component codes [1, 11, 12].
74 4 Error Control Coding for On-Chip Interconnects

Fig. 4.16 Encoding process of product codes

Figure 4.16 shows the construction process of two dimensional product codes.
Assume that two component codes C1(n1, k1, d1) and C2(n2, k2, d2) are used, where
n1, k1 and d1 are codeword width, input data width, and minimum Hamming
distance for the code C1, respectively; n2, k2 and d2 are codeword width, input
data width, and minimum Hamming distance for the code C2, respectively.
The product code Cp(n1  n2, k1  k2, d1  d2) is constructed from C1 and C2 as
follows:
1. Arrange input data in a matrix of k2 rows and k1 columns.
2. Encode the k2 rows using component code C1. The result will be an array
of k2 rows and n1 columns.
3. Encode the n1 columns using component code C2.
Product codes have a larger Hamming distance compared to that of the
component codes. If the component codes C1 and C2 have minimum Hamming
distance d1 and d2 respectively, then the minimum Hamming distance of the
product code Cp is the product d1  d2, which greatly increases the error correction
capability. Product codes can be constructed by a serial concatenation of simple
component codes and a row-column block interleaver, in which the input sequence is
written into the matrix row-wise and read out column-wise. Product codes can
efficiently correct both random and burst errors. For example, if the received product
codeword has errors located in a number of rows not exceeding ðd2  1Þ=2 and
no errors in other rows, all the errors can be corrected during column decoding.
The simplest two-dimensional product codes are single-parity check (SPC)
product codes [1]. SPC product codes only guarantee correction of one error.
The product codes, whose component codes are Hamming or extended Hamming
product codes, are known as Hamming product codes.
4.2 Error Control Codes for On-Chip Interconnect 75

Fig. 4.17 An example of row and column status vectors after first and second decoding stages [13]

The Hamming product codes can be decoded using two-step row-column (or
column-row) decoding algorithm [1]. Unfortunately, this decoding method fails
to correct certain error patterns (e.g. rectangular four-bit errors). A three-stage
pipelined Hamming product code decoding method is proposed in [13]. Compared
to the two-step row-column decoding method, the three-stage pipelined decoding
method uses a row status vector and a column status vector to record the
behaviors of the row and column decoders. Instead of passing only the coded
data between row and column decoder, these row and column status vectors are
passed between stages to help make decoding decisions [13].
The simplified row and column status vector implementation can be described
as follows:
The ith (1  i  n2) position in the row status vector is set to “1” when there are
detectable errors (regardless of whether the errors can be corrected or not) in the ith
row; otherwise that position is set to “0”.
For the column status vectors, there are two separate conditions that can cause
the jth (1  j  n1) position in column status vector to be set to “1” (a) when an
error is detectable but not correctable, or (b) when an error is correctable, but the
row where the error occurs has a status value “0”. Otherwise, that position is set to
“0”. Figure 4.17 shows an example of the row and column status vectors after the
first and second stage decoding process. Extended Hamming codes are used as row
and column component codes.
76 4 Error Control Coding for On-Chip Interconnects

Fig. 4.18 Block diagram of proposed three-stage pipelined decoding algorithm [13]

Figure 4.18 describes the three-stage pipelined Hamming product code decoding
process. After initializing all status vectors to zeros, the steps are described as
follows:
Step 1: Row decoding of the received encoded matrix. If the errors in a row are
correctable, the error bit indicated by the syndrome is flipped. The row status
vector is set to “0” if the syndrome is zero and “1” if the syndrome is nonzero.
Step 2: Column decoding of the updated matrix. The error correction process is
similar to Step 1. The column status vector is calculated using both the column
error vector and the row status vector from Step 1.
Step 3: Row decoding the matrix after changes from Step 2. The syndrome for each
row is recalculated. If any remaining errors in each row are correctable, the row
syndrome will be used to do the correction. If the errors in each row are still
detectable but uncorrectable, the column status vector from Step 2 is used to
indicate which columns need to be corrected.
To implement the three-stage decoding algorithm, a modification of conven-
tional extended Hamming decoder is needed. The modified extended Hamming
decoder needs to generate row/column status value that will later be used to
improve the overall error correction capability of the decoding process. The gener-
ation of the row status value in the first row decoding process is simple. A row status
is set to “1” if the syndrome value of this row is nonzero. This can be implemented
using OR gates with all syndrome bits as inputs. Figure 4.19 shows a block diagram
of the modified extended Hamming decoder used in the column decoding process.
A column status is set to “1” if the output of double error detection is “1” or if there is
at least one bit position in which the error vector value is 1 and the row status is “0”.
Figure 4.20 shows a block diagram of the modified extended Hamming decoder
used in the second row decoding process. Unlike the normal extended Hamming
decoder, the error correction is decided by the error vector and also the value of the
column status vector. A bit in the codeword is considered erroneous if the error
vector is “1” in that position or the output of double error detection is “1” and the
column status vector in that position is also “1”.
4.2 Error Control Codes for On-Chip Interconnect 77

Fig. 4.19 Implementation of the column decoder [13]

Fig. 4.20 Implementation of the row decoder [13]


78 4 Error Control Coding for On-Chip Interconnects

Compared to a conventional two-step row-column decoding method, the


decoding method in [13] achieves a better error correction capability. For example,
in the Hamming product code Cp(8  8, 4  4), the proposed decoding method
can correct 100% of error patterns consisting of five errors or less.

References

1. Lin S, Costello DJ (2004) Error control coding, 2nd edn. Prentice Hall, Englewood Cliffs
2. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework.
IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667
3. Rossi D, Metra C, Nieuwland KA, Atul K (2005) Exploiting ECC redundancy to minimize
crosstalk impact. IEEE Des Test Comput 22:59–70
4. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-to-
switch buses in a network-on-chip. In: Proceedings international conference hardware/soft-
ware codesign and system synthesis (CODES-ISSS), pp 188–193
5. Yu Q, Ampadu P (2008) Adaptive error control for NoC switch-to-switch links in a variable
noise environment. In: Proceedings IEEE international symposium on defect and fault
tolerance in VLSI system (DFT), pp 352–360
6. Pei BT, Zukowski C (1992) High speed parallel CRC circuits in VLSI. IEEE Trans Commun
40:653–657
7. Shieh DM, Sheu HM, Chen HC, Lo FH (2001) A systematic approach for parallel CRC
computations. J Inf Sci Eng 17:445–461
8. Burton OH (1971) Inversionless decoding of binary BCH codes. IEEE Trans Inf Theory
17:464–466
9. Reed SI, Solomon G (1960) Polynomial codes over certain finite fields. J Soc Ind Appl Math
8:300–304
10. Elias P (1954) Error-free coding. IEEE Trans Inf Theory 4:29–37
11. Fujiwara E (2006) Code design for dependable systems: theory and practical applications.
Wiley Interscience, Hoboken
12. Pyndiah R (1998) Near-optimum decoding of product codes: block turbo codes. IEEE Trans
Commun 46:1003–1010
13. Fu B, Ampadu P (2009) On hamming product codes with type-II hybrid ARQ for on-chip
interconnects. IEEE Trans Circuits Syst I, Reg Papers 9:2042–2054
Chapter 5
Energy Efficient Error Control
Implementation

Error control is applied to improve the reliability of on-chip communication.


However, on-chip interconnects are still facing the challenge of the increased energy
consumption. It is important to consider energy efficiency in error control realization.
In this chapter, we will introduce design techniques, which can efficiently balance the
energy efficiency and reliability of on-chip interconnects.

5.1 Error Control Coding with Low Link Swing Voltage System

A large portion of the total chip power can be consumed by global links [1, 2]. In
[3–6], error control schemes are combined with low link swing voltage system to
trade off reliability and energy of on-chip interconnects. Figure 5.1 shows a
possible implementation of error control codes with a low link swing voltage
system. Level shifting circuits are needed to switch between different voltages
at the link and receiver. Figure 5.2 shows a simple implementation of these
level shifting circuits [7]. Triple modular redundancy is implemented to protect
the ACK/NACK and control signals against errors.
In the method of combining error control coding with low link swing voltage
system, the energy consumption is reduced because the error control codes allow
the system to meet the same communication reliability using a lower link swing
voltage compared to uncoded system. For a given reliability requirement Preq, the
uncoded system and the system with error control coding meet the required
reliability at the raw bit error probability eunc and eecc. From Chap. 1, the raw bit
error probability eunc and eecc are a function of link swing voltage Vswing and the

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, 79


DOI 10.1007/978-1-4419-9313-7_5, # Springer Science+Business Media, LLC 2012
80 5 Energy Efficient Error Control Implementation

Fig. 5.1 Integration of error control codes with low link swing voltage system

Fig. 5.2 Hardware implementation of level shifter circuits

standard noise voltage deviation sN. If the standard noise voltage deviation sN is the
same, the link swing voltages of coded and uncoded system have the following
relation [3],
1
Vunc swing  Q ðeecc Þ
Vecc swing ¼ (5.1)
Q1 ðeunc Þ

where Q1 ðeÞ is the inverse Q function.


Figure 5.3 compares the minimum link swing voltage of different coding
schemes to that of uncoded link. In error detection (ED) scheme, Hamming codes
are used to detect errors. Figure 5.3 shows that error control coding greatly reduces
the link swing voltage for the same reliability requirements. The unencoded link has
to use the highest link swing voltage.
Figure 5.4 shows the energy consumption of different coding schemes with
different link lengths. The energy consumption includes the codec energy and link
energy. The link is modeled for a 0.18 um technology. The same reliability is required
for all the schemes. Figure 5.4a shows that the use of error control scheme can
efficiently reduce the energy consumption for long links. In this case, the link energy
dominates the total energy consumption. The reduction of link swing voltage by using
5.2 Error Control Coding with Dynamic Voltage Swing Scaling System 81

Fig. 5.3 Minimum link swing voltage needed by each coding scheme to meet a predefined
communication reliability requirement [4]

error control coding can result in a total energy reduction. For short links, the
codec energy overhead is comparable to the link energy and reduces the benefits of
using error control coding. In Fig. 5.4b, we can see that the error control scheme still
archives energy benefits for high reliability requirements. As the technology scales, the
codec energy consumption becomes smaller compared to the link energy consumption.
The combination of error control coding with low link swing voltage system will be a
promising approach to balance the reliability and energy consumption.

5.2 Error Control Coding with Dynamic Voltage


Swing Scaling System

In [8], a self-calibrating transmission scheme for on-chip interconnect is proposed.


In this method, the dynamic voltage swing scaling (DVSS) technique is combined
with error control codes to trade off energy, throughput and reliability of on-chip
interconnects. The principle of this method is to run the system with a more
aggressive voltage scaling scheme, which takes advantage of the use of error
control codes.
Figure 5.5 shows a possible architecture for the self-calibrating transmission
system. In this architecture, the data transmission is pipelined into three stages –
encoder, synchronization and decoder. The input data is first encoded using error
control codes and then transmitted together with parity check bits. An operating
point controller can dynamically adjust the data transmission frequency and the
82 5 Energy Efficient Error Control Implementation

Fig. 5.4 Energy consumption of different coding schemes (a) Link lengths are several centimeters
(b) link lengths are several millimeters [4]
5.2 Error Control Coding with Dynamic Voltage Swing Scaling System 83

Fig. 5.5 The self-calibrating transmission system (a) The concept (b) A possible implementation [8]

link swing voltage according to traffic information and error detection rate. Error
control codes are used to detect on-chip communication errors. If errors are
detected, a go-back-N ARQ scheme is applied to ensure the data transmission
correctly. The error control decoder also sends the error information to the
operating point controller, which use this feedback information to select proper
link swing voltage. The self-calibrating transmission scheme reduces the energy
consumption by providing the minimum link swing voltage, at which the on-chip
communication can achieve the targeted bandwidth for a given data transmission
error rate.
The self-calibrating transmission scheme uses error detection and ARQ to
guarantee reliable communication. It is important to select a proper error coding
method, which can efficiently detect both logic and timing errors. Figure 5.6 shows
the error detection method used. The error detection method is separated into two
parts. CRC-8 code with generator polynomial x8 + x2 + x + 1 is used to detect
logic errors. CRC codes alone are not efficient to detect timing errors. For example,
if the clock cycle time is much less than the data transition time on the link, the data
84 5 Energy Efficient Error Control Implementation

Fig. 5.6 Error detection method used in self-calibrating transmission system [8]

latched in two successive clock cycles may have the same value. These data are
both valid codewords. CRC code cannot recognize that the second copy is uncor-
rected. An additional bit, which changes alternatively between 0 and 1 every clock
cycle, is used to detect this case. This additional bit ensures that the encoded data at
any two successive clock cycles are always different. The parity bits of CRC-8 code
are transmitted with the data. This additional bit is not transmitted. It is generated
independently at the transmitter and receiver sides, as shown in Fig. 5.6.
Figure 5.7 shows the control algorithm used to select the best frequency and link
swing voltage point in self-calibrating transmission scheme. The controller
performs three tasks independently. First, the controller needs to find the lowest
swing voltage for a given frequency. The controller performs this by monitoring the
error rate of data transmission. For a given frequency, a swing voltage is selected
and applied to the system. If the system can operate correctly for a period of time
(e.g. 500 or 1,000 clock cycles), then the controller attempts to reduce the swing
voltage. If the error rate at the lower swing voltage is larger than a threshold value,
the system will return to more conservative mode by increasing the swing voltage.
The controller continues to do this until the lowest swing voltage, which can satisfy
both the frequency and reliability requirements, is found. The same process will be
performed for each possible frequency, and the controller records these best voltage
and frequency operating points. The second task of the controller is to choose a
proper frequency based on the delay constraint and buffer fill level. Once the proper
frequency is decided, the last task of the controller is to select the lowest swing
voltage according to the frequency. The control algorithm in Fig. 5.7 can minimize
the energy consumption; while meet the performance and reliability requirements
at the same time.
Figure 5.8 illustrates how the self-calibrating transmission scheme performs for
a realistic time-varying MPEG based workload. In this example, the self-calibrating
scheme can transmit the data with a frequency range from 50 MHz to 1 GHz by
5.2 Error Control Coding with Dynamic Voltage Swing Scaling System 85

Fig. 5.7 Control algorithm used to select the best frequency and link swing voltage point in self-
calibrating transmission scheme [8]

adjusting the link swing voltage from 0.6 to 1.2 V. By choosing the proper
frequency based on the workload, the self-calibrating system sends each MPEG
frame just below the delay constraint, the all dotted line shown in the bottom figure.
The classic system, in which no adaptive design is applied, need to operate at a
higher frequency to meet the safety margin requirements for the worst-case work-
load condition. The lower data transmission frequency of self-calibrating scheme
results in a lower link swing voltage, which can greatly reduce the link energy
consumption.
Figure 5.9 shows the energy consumption of each component of the self-calibrating
transmission system except the voltage converter with a 90 nm CMOS technology.
The link length is assumed as 1 cm. The results show that the energy overhead of the
operating point controller and the synchronizing registers account for 13% of total
energy consumption. The link energy consumption is the largest portion of the total
86 5 Energy Efficient Error Control Implementation

Fig. 5.8 An example to transmit a realistic time-varying MPEG based workload [8]

Fig. 5.9 Energy consumption of each component of the self-calibrating transmission system [8]

system energy consumption. By reducing the link energy consumption, the self-
calibrating transmission system can achieve 42% energy reduction compared to
classic system, when total 400 MPEG frames are transmitted and each frame consists
of several kilobytes data.
5.3 Product Codes with Type-II ARQ 87

5.3 Product Codes with Type-II ARQ

5.3.1 The Principle

In Chap. 4, we have introduced product codes, which can provide a good error
correction capability of both random and burst errors. However, the direct use of
product codes requires a large number of redundancy bits resulting in low code rates.
The code rates of the direct use of product codes Cp ¼ C1  C2 can be described by,

K k2  k1
Rdirect ¼ ¼ (5.2)
N n 2  n1

where k1 and k2 are the number of column and row in the input data, respectively. n1
and n2 are the row and column codeword length, respectively. For 64-bit input data,
Table 5.1 summarizes Rdirect value for different component codes and k2 values.
The value show that the large number of redundancy bits in the product codes
results in low code rates, increasing link energy consumption.
In order to improve the code rate and improve energy efficiency, a method
combining product codes with type-II HARQ is proposed in [9, 10]. In type-II
HARQ [11], redundancy bits are incrementally transmitted if they are requested. In
the method of combining product codes with type-II HARQ, the original input data
is encoded and transmitted with its row parity check bits first. If the errors in the
receiver are detectable but not correctable, a transmission of the column parity
check bits is requested. The effective code rates Reffective of this method can be
described by,

K K
Reffective ¼  (5.3)
k 2  n1 þ Pd uc  ½ðn 2  k2 Þ  n 1  k 2  n1

where Pd_uc is the probability of detectable but uncorrectable error in the first
transmission. The error probability is usually on the order of 109–1012 errors/
bit [12]; thus, the second term in the denominator is negligible in most cases.
The code rates of a direct product code implementation and the product code with
type-II HARQ method are compared in Fig. 5.10 for different input data widths K and
different numbers of data rows k2. The results show that the combination of product
code with type-II HARQ can greatly improve the effective code rate by sending the
column parity check bits and checks-on-checks only when necessary.
Table 5.1 Rdirect for different component codes and k2 values at K ¼ 64 bits
K ¼ 64 bits
Component
codes Parity check Hamming Extended Hamming
k2 2 4 8 2 4 8 2 4 8
N 99 85 81 190 147 144 234 176 169
Rdirect 0.65 0.75 0.80 0.34 0.44 0.45 0.27 0.36 0.38
88 5 Energy Efficient Error Control Implementation

Fig. 5.10 Comparison of code rate Reffective and Rdirect for different input data width K and the
number of rows k2 in the message of product codes

Fig. 5.11 Encoding process of combining product codes with type-II HARQ

Figure 5.11 shows the encoding process of the method combining product codes
with type-II HARQ. K-bit input data is first separated into k2 rows with length of
k1-bit each row. Multiple row encoders are used to minimize the encoding latency.
Each row is encoded using a component code C1(n1, k1). All k2  n1 outputs of the
5.3 Product Codes with Type-II ARQ 89

Fig. 5.12 An example of


combining four extended
Hamming EH(22,16)
encoders with row-column
interleaver

row encoders are fed into a row-column block interleaver. The mapping relation of
the (nr  nc)-bit row-column interleaver can be described by,
 
iinput
ioutput ¼ nr  modðiinput ; nc Þ þ (5.4)
nc

where 0  iinput, ioutput  (nr  nc)1 and nr and nc are the number of rows and
columns of the row-column interleaver, respectively.
Figure 5.12 demonstrates an example of the row-column interleaver mapping
relation for 64-bit input data. The 64-bit input data is separated into four identical
rows. Each row is encoded with an extended Hamming EH(22, 16) code and the
outputs of these encoders are interleaved.
The interleaved row encoder outputs are both transmitted to the receiver and fed
into n1 column encoders. The column parity check bits of the n1 column encoders are
saved into a buffer. The total (n2  k2)  n1 additional parity check bits are kept in
the buffer until an acknowledgement/negative acknowledgement (ACK/NACK)
signal indicating the status of the previously transmitted data is received. If a
NACK signal is received, the stored parity check bits are sent to the receiver. The
flow chart of the encoding process is shown in Fig. 5.13.
90 5 Energy Efficient Error Control Implementation

Fig. 5.13 Flow chart of the encoding process [9]

Figure 5.14 shows the decoding process of the combination of product codes with
type-II HARQ. The received data is deinterleaved and then decoded using row
decoders. If the number of errors is within the error correction capability of the row
decoder, the errors are corrected. If the errors are detectable but not correctable, the
row decoded data and row parity check bits are saved into a buffer and the receiver
instructs the transmitter to send the column parity check bits and checks-on-checks
which are formed based on the original data. When the additional parity check bits are
received, they are used with the row decoded data and row parity check bits, which
have been stored in the decoder buffer, to complete the product code decoding
process. The flow chart of the proposed decoding process is shown in Fig. 5.15.
The transmission and retransmission process is shown in Fig. 5.16. In order to
simplify the hardware implementation, only one retransmission is allowed. The
buffer depth in the transmitter and receiver is determined by the round trip delay of
the transmission. The reliability of the proposed method is dependent on two pieces –
the error detection capability in the first transmission and the error correction
capability when the full product code decoding process is applied. The reliability in
terms of residual flit error rate can be estimated by,

Presidual ¼ Pud þ Pðedecoding ; edetect Þ (5.5)

where Pud is the undetectable error probability in the first transmission and
P(edecoding, edetect) is the error probability after the full product code decoding
process is performed.
5.3 Product Codes with Type-II ARQ 91

Fig. 5.14 Decoding process of combining product codes with type-II HARQ

Fig. 5.15 Flow chart of proposed decoding process [9]


92 5 Energy Efficient Error Control Implementation

Fig. 5.16 Transmission and retransmission procedure

5.3.2 Extended Hamming Product Codes with Type-II HARQ

To balance complexity and error correction capability, an error control method


combining extended Hamming product codes with type-II HARQ is introduced
by [10]. The encoding process of the combination of extended Hamming product
codes with type-II HARQ is simple. K-bit input data is arranged into a matrix with
k2 rows and k1 columns. Each row is encoded using an extended Hamming code EH
(n1, k1) and each column is encoded using an extended Hamming code EH(n2, k2).
The K-bit input data and row parity check bits are transmitted and the column parity
check bits are saved into an encoder buffer. The saved column parity check bits will
be transmitted once a NACK is received.
In the decoding process of extended Hamming product codes with type-II
HARQ, the received data is first decoded row by row using multiple extended
Hamming decoders. Extended Hamming codes can correct single errors and detect
double errors in each row. If all errors are correctable (no more than one error in
each row), the receiver indicates a successful transmission by sending back an ACK
signal to transmitter. If the receiver detects two errors in any row, it saves the row
decoded data and row parity check bits in the decoding buffer and requests a
transmission of column parity check bits and checks-on-checks by sending back a
NACK signal. When the extra parity check bits are received, they are used with the
saved data and row parity check bits to complete the column decoding process and
the second row decoding process in the three-stage pipelined decoding method
introduced in Chap. 4. Figure 5.17 shows the decoding process, when the three-
stage decoding method is used. The first row decoding is always performed. The
column decoding in the second stage and the row decoding in the third stage are
performed only when a retransmission of column parity checks bits and checks-on-
checks are requested.
5.3 Product Codes with Type-II ARQ 93

Fig. 5.17 Implementation of three-stage pipelined decoding algorithm in the case of combining
extended Hamming product codes with type-II HARQ

Fig. 5.18 An example of decoding process for extended Hamming codes with type-II HARQ

Figure 5.18 shows an example of the decoding process when extended Hamming
product codes with type-II HARQ are applied. A rectangular four-error pattern
occurs in the transmission of the original data and row parity check bits. The
extended Hamming decoder detects these errors during the first row decoding
process and a transmission of column parity check bits and checks-on-checks is
requested.
94 5 Energy Efficient Error Control Implementation

Fig. 5.19 Number of


redundancy bits for extended
Hamming code with different
input data widths

A single error occurs during retransmission of column parity check bits and
checks-on-checks. It can be directly corrected, before these extra parity check bits
are combined with the saved data and row parity check bits to complete the three-
stage pipelined decoding process. In Step 2, because double errors are detectable
but uncorrectable, no correction is performed and “1”s are recorded in the
corresponding column states. In the second row decoding process (Step 3), the
extended Hamming decoder still detects two errors in a row, so the column status
vector is used to indicate which positions need to be flipped.
In the combination of extended Hamming product codes with type-II HARQ, a
K-bit input data is arranged as a matrix of k2 rows with length k1 ¼ dK=k2 e:
Each row or column is encoded using an extended Hamming code EH(n1,k1) or
EH(n2,k2). The required interconnect width WL1 in the first transmission (original
data and row parity check bits) can be described,

WL1 ¼ K þ k2  reh ðdK=k2 eÞ (5.6)

where reh ðdK=k2 eÞ is the number of parity check bits added by the extended
Hamming code for the dK=k2 e-bit row input. The relationship between the number
of redundancy bits and the number of input data bits is shown in Fig. 5.19.
If a NACK is received, a retransmission is requested. The retransmission
includes parity check bits for n1 ¼ ðdK=k2 e þ reh ðdK=k2 eÞÞcolumns and requires
interconnect width WL2,

WL2 ¼ ðdK=k2 e þ reh ðdK=k2 eÞÞ  reh ðk2 Þ (5.7)

where reh ðk2 Þ is the number of parity check bits added by the extended Hamming
code for the k2-bit column input.
5.3 Product Codes with Type-II ARQ 95

Table 5.2 Required number of wires in the link for different input data widths K and row numbers
k2 [10]
k2 ¼ 1 k2 ¼ 2 k2 ¼ 3 k2 ¼ 4 k2 ¼ 5 k2 ¼ 6 k2 ¼ 7 k2 ¼ 8 k2 ¼ 9
K ¼ 32 WL1 39 44 47 52 57 62 67 64 68
WL2 156 88 64 52 60 55 50 40 40
WL 156 88 64 52 60 62 67 64 68
K ¼ 64 WL1 72 78 82 88 94 94 99 104 109
WL2 288 156 112 88 95 80 75 65 65
WL 288 156 112 88 95 94 99 104 109
K ¼ 128 WL1 137 144 149 156 158 164 170 176 182
WL2 548 288 200 156 160 140 125 110 105
WL 548 288 200 156 160 164 170 176 182

The required link width WL to successfully complete two transmissions is the


maximum value of WL1 and WL2.

WL ¼ MaxðWL1 ; WL2 Þ (5.8)

From (5.6)–(5.8), the required link width is a function of the row number k2 for a
given input message width K. Table 5.2 shows the number of wires in the link for
different input data widths K and the number of rows in the message k2. It can be
seen that WL changes greatly for different k2 values. Because on-chip interconnects
can consume a large proportion of the total energy in nanoscale technology,
Reducing the number of wires in the link can improve both energy efficiency and
wire area footprint. To achieve the minimum number of wires, WL1 and WL2 should
be balanced. The minimum number of wires WL is achieved when the difference
between WL1 and WL2 has the smallest value. For the link widths examined in
Table 5.2, a minimum value is achieved when k2 is equal to four. In this case
(k2 ¼ 4), WL1 and WL2 have the same values and an extended Hamming code EH
(8,4) is used for column encoding.

5.3.3 Performance Evaluation

The combination of extended Hamming codes with type-II HARQ is compared to


different coding solutions – FEC schemes using Hamming code and three-bit
error correction BCH code, ARQ, and HARQ using extended Hamming code.
Standard CRC-5 with generator polynomial x5 + x2 + 1 is used for ARQ scheme.
The input data width is 64-bit. The number of wires used to transmit encoded
information in these error control schemes is 88, 71, 85, 69 and 72, respectively.
In order to improve the throughput, go-back-N retransmission approach [11] is
applied to the ARQ and HARQ schemes. For implementation simplicity, only one
96 5 Energy Efficient Error Control Implementation

Fig. 5.20 Residual flit error


rate for different error control
schemes as a function of noise
voltage deviation at
(a) Pn ¼ 102 and
(b) Pn ¼ 1 [10]

retransmission is allowed in the combination of extended Hamming product codes


with type-II HARQ. Thus, the ACK/NACK signal not only depends on the double
error detection from each row decoder, but also depends on whether the input data
is the first transmitted or retransmitted flit. When the input data is the
retransmitted flit, an ACK signal (the value is 0) is always sent back to the
transmitter.

5.3.3.1 Reliability

The residual flit error rate Presidual is used to measure the system reliability.
Figure 5.20 shows the residual flit error rate of different error control schemes as
a function of noise voltage deviation. Dependent error model introduced in chap. 1
5.3 Product Codes with Type-II ARQ 97

is used in the simulation with two coupling probability values, Pn ¼ 102 and
Pn ¼ 1. A link swing voltage of 1.0 V is used. The simulation results show that
Hamming product codes with type-II HARQ achieves a significant reduction in
residual flit error rate when multiple random and burst errors are considered. ARQ
CRC-5 has a good burst error detection capability but it is inefficient to detect
multiple random errors. HARQ EH(72,64) scheme can correct single errors and
detect double errors but as the burst error probability increases, the performance of
decreases. Compared to the BCH(85,64) code, extended Hamming product codes
with type-II HARQ can effectively correct multiple random and burst errors, while
BCH code is only good at correcting multiple random errors. The combination of
extended Hamming product codes with type-II HARQ can correct at least two
permanent errors, while ARQ CRC-5 will not work in this persistent noise
environment.

5.3.3.2 Throughput

Another main concern in on-chip communication is the throughput. In the


simulations, go-back-N retransmission policy was applied to improve the through-
put of ARQ and HARQ schemes. The average number of transmissions needed to
successfully send a flit in go-back-N ARQ is represented in Chap. 2.
When go-back-N retransmission policy is applied to HARQ scheme, the average
number of transmissions needed to successfully send a flit can be described using
(5.9) by modifying (2.2),

NHARQ ¼ 1  ðPne þ Pud þ Pd c Þ þ ðN þ 1Þ  ðPne þ Pud þ Pd c Þ  Pd uc

þ ð2N þ 1Þ  ðPne þ Pud þ Pd c Þ  P2d uc þ   


þ ðlN þ 1Þ  ðPne þ Pud þ Pd c Þ  Pld uc þ 
NPd uc
¼1þ (5.9)
ð1  Pd uc Þ

where Pne is the probability of no errors. Pud is the undetectable error probability.
Pd_c is the probability of correctable error. Pd_uc is the probability that the errors are
detectable but uncorrectable. Pne + Pud + Pd_c +Pd_uc is equal to 1. Because Pd_uc
is smaller than Pd, the average number of transmissions needed to successfully send
a flit in HARQ is less than that in ARQ. In the method combining product codes
with type-II HARQ, the retransmission time was limited to one. The average
number of transmissions needed to successfully send a flit in the proposed method
can be described by,

Nprop ¼ 1  ðPne þ Pud þ Pd c Þ þ 2  Pd uc (5.10)

The throughput of the different error control schemes is compared in Fig. 5.21.
The throughput was normalized to the throughput in the case of no errors occurring.
98 5 Energy Efficient Error Control Implementation

Fig. 5.21 Throughput of


different error control
schemes with varying
standard noise voltage
deviation

The throughput comparison does not include the H(71,64) and BCH(85,64)
schemes, because no retransmission is needed in these schemes and their
normalized throughput are always equal to 1. As shown in Fig. 5.21, ARQ,
HARQ and proposed method achieve nearly the same throughput at low noise
environments (small sN). As sN increases, The ARQ scheme achieves the lowest
throughput, because retransmission is the only way for it to correct errors. The
overhead for retransmission increases as noise environments become worse. Com-
pared to HARQ H(72,64) scheme, the combination of extended Hamming codes
with type-II HARQ scheme achieves better throughput because more errors can be
corrected during the first transmission. The combination of extended Hamming
product codes with type-II HARQ can achieve 45% and 10% improvement in the
throughput under high noise conditions compared to ARQ and HARQ scheme,
respectfully.

5.3.3.3 Energy Consumption

The average energy per flit is used as the metric to measure energy consumption.
The average energy consumption includes the encoder energy Ee1, the link energy
El1, and the decoder energy Ed1 in the first transmission, as well as the encoder
energy Ee2, the link energy El2, and the decoder energy Ed2 in the retransmission,
where Pd_uc is the probability that the errors are detectable but uncorrectable. The link
energy using low swing voltage can be estimated as, where CL is the interconnect
capacitance. WL is the number of wires in the link, which depends on the error control
scheme. In the combination of extended Hamming product codes with type-II
HARQ, WL is greatly affected by the selection of k2. Sf is the wire switching
probability. VDD is the supply voltage. The link swing voltage Vswing is decided by
5.3 Product Codes with Type-II ARQ 99

Fig. 5.22 Required link


swing voltages of different
error control schemes
for given reliability
requirements [10]

the reliability requirement. Elevel is the energy consumption of the level translation
circuit when low swing voltage is applied.

Eavg ¼ ðEe1 þ El1 þ Ed1 Þ þ Pd uc ðEe2 þ El2 þ Ed2 Þ (5.11)

El  Sf  WL  CL  VDD  Vswing þ Elevel (5.12)

Figure 5.22 compares the link swing voltage of different error control schemes
for the same residual flit error rate requirement. The sN is assumed to be 0.1 V. The
coupling probability Pn is 101. The results show that the more effective the error
correction capability of an error control scheme, the lower the swing voltage needed
for the interconnect links. To achieve the same residual flit error rate, the combination
of extended Hamming product codes with type-II HARQ achieve the smallest link
swing voltage. The link swing voltage of the combination of extended Hamming
codes with type-II HARQ is about 60% and 80% compared to that of the H(71, 64)
and ARQ CRC-5, respectively. The lower link swing voltage allows this method to
consume less link energy.
Figure 5.23 compares the link energy consumption of different error control
schemes given the same residual flit error rate requirement. In the simulation, the
requirement of residual flit error rate is assumed to be Presidual  1020 [13]. The
simulation was performed for a noise environment of sN ¼ 0.07 V. Different
technology nodes are considered in the simulation using Predictive Technology
Model (PTM) CMOS 65 and 45 nm technology [14]. The effect of different link
lengths on energy consumption is also evaluated. In NoCs, the link length is the
distance between two routers, which is decided by the dimension of the tile block. In
mesh or torus topologies, the links between two routers are generally a few
millimeters long wires [15–17]. In the experiments, link lengths from 1 to 3 mm,
100 5 Energy Efficient Error Control Implementation

Fig. 5.23 Link energy


consumption of different
error control schemes for
different link lengths
(a) 45 nm technology
(b) 65 nm technology [10]

are examined. The link energy is measured in Cadence Spectre. The input data is
generated using an H.264 video encoder with the average switching factor about 0.5.
Figure 5.23 shows that the combination of extended Hamming product codes
with type II HARQ has the smallest link energy of the compared schemes, because
the lowest link swing voltage counterbalances the large number of wires in the link.
As link length increases, Hamming product codes with type II HARQ can benefit
more from the lowest link swing voltage. The link energy consumption of Ham-
ming product codes with type-II HARQ is about 80% and 35% compared to the link
energy consumption of ARQ CRC-5 and H(71,64), respectively.
5.3 Product Codes with Type-II ARQ 101

Fig. 5.24 Energy


comparison of different error
control schemes at residual
flit error rate 1020 (a) Link
length 1 mm (b) Link length
3 mm [10]

Figure 5.24 compares the average energy consumption per flit for different error
control schemes at the same reliability requirement (1020) [13]. The average energy
includes encoder, decoder and link energy consumption. Two noise voltage
deviations, sN ¼0.07 V and sN ¼0.1 V, are considered. Raw bit error probability
e is about 1012 and 106 for these two cases. The results show that ARQ
CRC-5 achieves the least average energy consumption at low noise environment
(sN ¼0.07 V) for small link lengths, because of the smaller codec energy and link
energy consumption. As the noise voltage deviation increases, however, higher
link swing voltages are needed to achieve the same reliability. In high noise
conditions, the average energy consumption of ARQ CRC-5 increases more
than the average energy consumption of the combination of extended Hamming
product codes with type-II HARQ, because ARQ CRC-5 has larger link energy
consumption. The combination of extended Hamming product codes yields the
102 5 Energy Efficient Error Control Implementation

Fig. 5.25 Delay of 3 mm link for different link swing voltages [10]

least average energy consumption at the higher noise environment (sN ¼0.1 V).
The BCH(85,64) scheme has the larger average energy consumption for small
link lengths because it has the largest codec energy consumption. Hamming
product codes with type-II HARQ achieve the least energy consumption at large
link lengths or high noise environments. When the link length is 3 mm, the energy
consumption of the this approach is about 15% and 50% less than that of ARQ
CRC-5 and H(71,64), respectively, in high noise environment. In addition to the
energy consumption improvement compared to ARQ in high noise environments,
the combination of extended Hamming product codes with type-II HARQ can
correct at least two permanent errors, while ARQ will not work in a persistent
noise environment. Thus, the approach combining forward error correction with
limited retransmission can achieve a better performance for balanced energy,
performance, and error resilience.

5.3.3.4 Delay and Area

More powerful error control schemes enable reduced link swing voltages, which
results in significant energy reduction. However, the link delay increase as the link
swing voltage decreases. Figure 5.25 evaluates the effect of reduced swing voltage
on link delay (including the delay of level translation circuit). The simulation
results show that the link delay at Vswing ¼ 0.75 V increases about 30% compared
to that at Vswing ¼1 V. In order to address the increased delay using lower link swing
voltage, the link can be pipelined [18], if a higher frequency is required. Figure 5.25
also shows the delay overhead of level translation circuit (the shadowed part on the
5.3 Product Codes with Type-II ARQ 103

Table 5.3 Delay and area of different coding schemes using 65 nm technology [10]
Encoder Decoder Area
Error control scheme delay (ns) delay (ns) (mm2)
Hamming (71,64) 0.40 0.62 2,080.8
ARQ (CRC-5) 0.37 0.41 3,605.6
HARQ 0.42 0.64 4,283.7
(Extended Hamming (72,64))
BCH(85,64) 0.42 0.72 77,353.2
Extended Hamming product 0.41 0.59 9,792.5
codes with type-II HARQ

top) for different link swing voltages. The simulation results show that the delay
overhead of level translation circuit increases as the swing voltage decreases.
At Vswing ¼ 0.75 V, the delay overhead of the level translation circuit is about
13% compared to the total link delay.
Table 5.3 compares the codec delay and area of different error control schemes
using TSMC 65 nm libraries. The Hamming encoder is implemented as a simple
XOR tree. Instead of using linear feedback shift registers to generate check bits for
CRC codes, a parallel implementation method [19] is employed to reduce the large
latency of CRC codes at a minor cost of complexity. The decoder delay, typically
much larger than encoder delay, is reported in Table 5.3. As expected, the decoder
delay of ARQ CRC-5 is the smallest, compared to other error control schemes,
because only syndrome is calculated and no error correction is needed in this
scheme. The BCH(85,64) scheme has the largest delay because of complexity of
arithmetic field operations. The delay of extended Hamming product codes with
type-II HARQ is slightly smaller than that of H(71,64) and HARQ EH(72,64)
schemes, because the component codes used in constructing product codes have a
smaller input data width.
Table 5.3 also shows the area of different error control schemes. In go-back-N
retransmission policy, N flits will be retransmitted if a NACK signal is received.
Thus, a transmitter buffer is needed to store these N flits in ARQ and HARQ
schemes. The number N is dependent on the round trip transmission delay. In the
simulation, N is equal to four. In the Hamming product codes with type-II HARQ
scheme, the part of the decoder buffer which stores the original message can
be shared with the routing buffer used for routing and flow control purposes in the
router. This greatly reduces the buffer size required. The results show that FEC
scheme using H(71,64) has the least area, because the encoder and syndrome
calculation circuits are implemented as simple XOR trees; also no buffers are
needed in this scheme. The area of extended Hamming product codes with type-II
HARQ increases about two times compared to that of HARQ scheme, because of
overhead associated with the three-stage decoding method. BCH(85,64) has the
largest area because of the large complexity of the decoding process. In nanoscale
technologies, the link energy is likely to largely exceed the codec energy.
Thus, the proposed method is more promising to achieve the energy benefits as
technology scales.
104 5 Energy Efficient Error Control Implementation

5.4 Configurable Error Control System

Each error control scheme has different area, power, throughput, and error correc-
tion capability trade-offs. Configurable error control schemes achieve energy effi-
ciency by dynamically providing appropriate error control based on noise
conditions or system requirements. In this section, we will introduce a method by
combining product codes with conventional Hamming codes to generate different
error correction capabilities in varied noise environments [20].

5.4.1 Principle

Hamming codes have been widely applied to on-chip interconnects because of their
low codec overhead. As noise environments worsen, Hamming codes are inefficient to
maintain system reliability because of their low error correction capability. The core
idea of combining product codes with conventional Hamming codes is to construct a
system with adjustable code strength, which can be dynamically selected according to
noise environments or reliability requirements. In this method, the error control
scheme works in two operating modes: mode-(a) directly uses Hamming codes in
low noise environments; mode-(b) uses Hamming product codes for high noise
environments. This configurable error control scheme can improve energy efficiency
for a specified reliability requirement or varying noise environments by switching
between two operating modes. Directly using Hamming codes in operating mode-(a)
has smaller codec energy. Also, fewer links lead to smaller link energy consumption.
In operating mode-(b), using product codes can provide higher reliability.
Figure 5.26a shows the concept of the configurable encoder design. In low noise
environments, encoder1 is configured as a Hamming encoder, which uses the whole
message as the input. The encoded message is sent to the receiver through an
interleaver. The interleaver is implemented as hardwire direct connection with
negligible overhead. In high noise environments, encoder1 is configured as a
Hamming product code component encoder (row encoder). The component
encoder consists of multiple Hamming encoders, each using a part of the message
as its input. The interleaved outputs of configurable encoder1 (original message and
row parity check bits) are sent to the receiver and simultaneously fed to component
encoder2(column encoder). The outputs of component encoder2 are saved into a
buffer and transmitted when required.
Figure 5.26b shows the concept of the configurable decoder design. Decoder1 can
be configured as a Hamming decoder using the whole codeword as the input or a
Hamming product code component decoder (row decoder). When the configurable
error control scheme is in operating mode-(b), the outputs of configurable decoder1
are sent to component decoder2, which is realized using an iterative decoding
algorithm. The configurable control signal can be generated by a link quality monitor
[21] or system software [22]. The link quality monitor is realized by counting the
5.4 Configurable Error Control System 105

Fig. 5.26 Concept of configurable error control using Hamming product codes (a) encoder (b)
decoder [20]

detected errors (the syndrome value of decoder1 is non-zero). The number of detected
errors is compared to a preset threshold value to decide the switching between
different operating modes. If the number of detected error is greater than the preset
value, operating mode-(b) is selected. When the configuration is performed by the
system software, an interface control register is needed. The system software can
select the operation modes by setting the control register based on the application
requirement (e.g., if the correctness of operation is the main concern, mode-(b)
is used).

5.4.2 Configurable Encoder Design

Figure 5.27 shows the implementation of the configurable encoder design. In


operating mode-(a), the input message is directly encoded by a Hamming code.
In operating mode-(b), the K -bit input message is arranged into a 4  (K/4) matrix
106 5 Energy Efficient Error Control Implementation

Fig. 5.27 Implementation of configurable encoder [20]

to construct the product code and each row is encoded with an extended Hamming
code with a (K/4)-bit input. To reduce the encoder area overhead, the configurable
encoder1 is implemented using a hardware sharing method. In this method, the
Hamming encoder with a K-bit input is realized by combining the outputs of
four Hamming encoders, each of them with a (K/4)-bit input. The following
example demonstrates the hardware sharing method. Consider a 16-bit input mes-
sage, which is separated into four rows. Each row is encoded using an extended
Hamming code EH(8,4) with the generator matrix in (5.13). The parity check bits of
each group can be combined to generate parity check bits of an extended Hamming
code H(21,16) with the generator matrix in (5.15), where P16x5 is parity matrix. The
hardware implementation of the EH(8,4) and H(22,16) encoder is shown in Fig.
5.28. By using the hardware sharing method, the configurable encoder1 is
implemented in two stages – parity calculation and merge circuits, as shown in
Fig. 5.28. The parity calculation outputs can be directly used as the parity check bits
of four extended Hamming encoders with input width (K/4)-bit or merged together
to generate the parity check bits of a Hamming encoder with input width K bits.
2 3
1 0 0 0 1 1 0 1
60 1 0 0 1 1 1 07
G1 ¼ ½I44 jP44  ¼ 6
40
7 (5.13)
0 1 0 1 0 1 15
0 0 0 1 0 1 1 1
5.4 Configurable Error Control System 107

Fig. 5.28 Hardware sharing between four extended Hamming EH(8,4) encoders and one Ham-
ming H(22,16) encoder [20]

2 3
1 1 1 0  
61 1 0 17 M
PT44 ¼6
40
7¼ (5.14)
1 1 15 1 0 1 1
1 0 1 1

G2 ¼ ½I1616 jP165  (5.15)


2 3
61 1 1 0 1 1 1 0 1 1 1 0 1 1 1 07
6 7
6 7
61 1 0 1 1 1 0 1 1 1 0 1 1 1 0 17
6 7
PT165 ¼6
60 1 1 1 0 1 1 1 0 1 1 1 0 1 1 17 7
6 7
60 0 0 0 1 1 1 1 0 0 0 0 1 1 1 17
6 7
40 0 0 0 0 0 0 0 1 1 1 1 1 1 1 15
|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
row a row b row c row d

2 3
M M M M
¼40 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 5 (5.16)
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
108 5 Energy Efficient Error Control Implementation

Fig. 5.29 Implementation of configurable decoder [20]

5.4.3 Configurable Decoder Design

Figure 5.29 shows an implementation of the configurable decoder design. Decoder1


can be configured as a single Hamming decoder, which uses the whole codeword as
the input, or a component decoder (row decoder) in Hamming product codes.
The component decoder consists of four extended Hamming decoders and each
decoder uses a part of the codeword as input. The realization of configurable
decoder1 is divided into three steps, as shown in Fig. 5.29. The hardware sharing
method used in transmitter design is implemented for the parity calculation circuits.
The syndrome calculation circuit is an XOR operation of the parity calculation
outputs and the parity check bits in the codeword. Syndrome calculation1 generates
the syndrome vector of operating mode-(a) and syndrome calculation2 generates the
syndrome vector of operating mode-(b). The syndrome vectors are fed into a syn-
drome decoder and error correction circuit. The syndrome decoder is implemented as
an AND tree, whose inputs are the syndrome value or its inverse. The error correction
is XOR operation.
To save the hardware resource, a hardware sharing method is introduced to
realize the syndrome decoder and error correction circuits in [20]. Figure 5.30
shows an example of the hardware sharing method. The 16-bit message is encoded
by a Hamming code H(21,16) (operating mode-(a)) or four extended Hamming
code EH(8,4) (operating mode-(b)). Four-bit syndrome vectors are used for each
EH(8,4) code and a five-bit vector is used for the H(21,16) code. By properly
selecting the syndrome value and its inverse from the different operating modes,
5.4 Configurable Error Control System 109

Fig. 5.30 Configurable syndrome decoder and error correction circuits design [20]

the syndrome decoder circuits and error correction circuit can be shared. “1” is
assigned as the extra syndrome bit for each EH(8,4) code. The outputs of
configurable decoder1 are saved into a receiver buffer. In operating mode-(a),
only the decoded message is saved. In operating mode-(b), the decoded message
and row parity check bits are both saved. The saved message and row parity check
bits are used to perform an iterative decoding procedure when the column parity
check bits and checks-on-checks are transmitted.

5.4.4 Performance Evaluation

The performance of the configurable error coding scheme by combining product


codes with conventional Hamming codes is evaluated in terms of codec delay, area,
reliability, and energy consumption. The input data width K is assumed as 64 bits.
The Hamming code H(71,64) is used in operating mode-(a). In operating mode-(b),
the 64-bit input message is arranged into a 416 matrix. Each row is encoded using
an extended Hamming code EH(22,16) and each column is encoded using an
extended Hamming code EH(8,4). The total number of wires in the link of the
proposed method is 88. In operating mode-(a), only 71 wires are used and the
remaining wires are connected to ground. The configurable error control scheme is
developed and verified in Verilog HDL. The encoder and decoder are synthesized
using TSMC 45 nm technology. The delay, area and power of the encoder and
110 5 Energy Efficient Error Control Implementation

Table 5.4 Numbers of wires in the link and codec delay and area of different
error coding schemes [20]
The number of
wires in the link Decoder Codec
Error control scheme (active/total) delay (ns) area (mm2)
Hamming (71,64) 71/71 0.53 1,550
BCH(85,64) 85/85 0.63 53,547
RS(85,65) 85/85 0.68 37,482
Product code 88/88 0.50 7,671
Configurable error (a) 71/88 0.58 8,906
control method (b) 88/88

decoder are reported using Synopsis Design Compiler at 1 GHz clock frequency.
The link power is measured in Cadence Spectre using a 45 nm global link intercon-
nect model [14]. Simulation results are compared to directly using Hamming code
H(71,64), a three-bit error correction BCH(85,64), and a Reed-Solomon RS(85,65)
code. Zero padding is applied to meet the length requirement of RS code. The
number of wires in the link for different coding schemes is shown in Table 5.3.

5.4.4.1 Codec Delay and Area

Table 5.4 compares the synthesized codec delay of the configurable error scheme to
the directly using H(71,64), BCH(85,64), and RS(85,65). The decoder delay, typi-
cally much larger than encoder delay, is reported here. The three-stage pipelined
decoding process is implemented in operating mode-(b) to decode product codes.
The decoding process for operating mode-(a), described in Fig. 5.29 is implemented
within one clock cycle. Compared to directly using H(71,64) code, the decoder delay
of the configurable coding method increases about 10% because of the overhead of
the extra MUX for mode switching. The BCH(85,64) and RS(85,65) decoder are
implemented in a 7-stage pipelined architecture. In order to improve the throughput
of BCH(85,64) and RS(85,65) codes, the parallel method in [23] is used. Compared
to the RS(85,65), the configurable coding method achieves a 15% delay reduction.
Table 5.4 also shows the synthesized codec area for different error control
schemes. The codec area includes the encoder and decoder area. The encoder
area includes the retransmission buffer. The decoder buffer storing the original
message in the receiver is not included, because this buffer can be shared with the
routing buffer in the router. The area of the error counter and comparison circuits in
the configurable control logic is also included in the proposed method. The results
show that multiple error correction codes have much larger area than that of simple
Hamming codes. The area overhead of the product codes is mainly because of the
retransmission buffer and the pipelined decoder architecture. Compared to BCH
(85,64) and RS(85,65), the product code has a smaller area, because each compo-
nent code is still a simple extended Hamming code. BCH(85,64) has the largest area
due to the complexity of the field operation and the decoding process. By using the
5.4 Configurable Error Control System 111

Fig. 5.31 Residual flit


error rate of different error
control schemes as a
function of noise voltage
deviation (a) Pn ¼ 102
(b) Pn ¼ 1 [20]

proposed hardware sharing method, the area overhead of the configuration circuit is
relatively small compared to Hamming product code itself.

5.4.4.2 Reliability

Figure 5.31 shows the residual flit error rate of different error control schemes as a
function of noise voltage deviation at Pn ¼ 102 and Pn ¼ 1. A supply voltage of
1 V is assumed. The simulation results show that the H(71,64) used in operating
mode-(a) has the worst residual flit error rate, because Hamming codes can only
correct one error at a time and simultaneous errors greater than one will lead to
uncorrected errors. Compared to the BCH(85,64) code, the product code used in
operating mode-(b) achieves a better residual flit error rate, because the product
code can effectively correct multiple random and burst errors, while the BCH code
112 5 Energy Efficient Error Control Implementation

Fig. 5.32 Energy


comparison of the
configurable error control
at different operating modes
(a) Link length 1 mm (b) Link
length 3 mm [20]

is only good at correcting multiple random errors. As Pn value increases, the


residual flit error rate of the H(71, 64) code and BCH(85, 64) code decreases
because of the higher burst error probability at larger Pn. Compared to RS(85,65),
the Hamming product code used in operating mode-(b) has a better error correction
capability, because RS(85,65) can only correct multiple errors within two symbols.
In NoC links, burst errors caused by noise and crosstalk can begin at any bit position
of the links. More powerful RS code can be constructed but with a larger delay and
area overhead.

5.4.4.3 Power and Energy Consumption

Figure 5.32 shows the energy consumption of the configurable error control method
at two operating modes. The energy includes encoder, decoder and link energy
consumption. The results show that the operating mode-(a) consumes less codec
and link energy compared to operating mode-(b), if both of the operating modes
meet the reliability requirement. This is because gating techniques is applied and
fewer link wires in operating mode-(a). The results also show that the link energy
5.4 Configurable Error Control System 113

Fig. 5.33 Example of mode


switching for a given
reliability requirement [20]

dominates the total energy consumption, as the link length increases. For the 3 mm
link length, operating mode-(b) consumes about 28% more energy than mode-(a).
The energy consumption of the configurable error control method combining
product codes with Hamming codes is also compared to the energy consumption of
directly using H(71,64) code, BCH(85,64) and RS(85,64) code. First, the compari-
son is performed under a fixed residual flit error rate requirement of 1010, shown in
Fig. 5.33. Two noise environments are considered. For the favorable environment
(sN ¼ 0.06), the proposed method operates in mode-(a). In the noisy environment
(sN ¼ 0.11), the proposed method switches to operation mode-(b). As the noise
environment worsens, the direct implementation of the H(71,64) code requires a
higher link swing voltage to meet the reliability requirement, while the proposed
method can switch to more reliable operating mode-(b). In the noisy environment
(sN ¼0.11), the conventional Hamming implementation requires a 39% increase in
the link swing voltage compared to the configurable method to achieve the required
residual flit error rate. The increased link swing voltage greatly increases the link
energy of the conventional Hamming implementation.
Figure 5.34 shows energy consumption of the four error control scheme for link
lengths of 1 and 3 mm. The results show that the configurable method combining
product codes with Hamming code consumes the least energy in the high noise
environments by switching to operating mode-(b). The BCH(85,64) code consumes
the largest energy for a link length 1 mm because its codec energy is larger than the
other error control schemes. As the link length increases, the direct implementation of
H(71,64) code consumes the largest energy in the high noise environment because
of the increased link swing voltage. For a 3 mm link in the noisy environment
(sN ¼ 0.11), the configurable method achieves 30% and 25% improvement in energy
consumption compared to the direct implementation of H(71,64) code and the BCH
(85,64) code. In the more favorable condition (sN ¼ 0.06), direct implementation of
114 5 Energy Efficient Error Control Implementation

Fig. 5.34 Energy comparison for different noise environments and link lengths [20]

the H(71,64) code consumes the least energy of the compared schemes. By switching
to operating mode-(a) in low noise environments, the proposed method consumes 10%
more energy than the H(71,64) code because of the configurable system overhead.
Compared to BCH(85,64), mode-(a) of the configurable error control method achieves
a 40% improvement in energy consumption.

5.5 Summary

In this chapter, we have discussed the techniques to achieve reliable and energy
efficient on-chip communications. In order to reduce the link energy, error control
codes can be combined with low link swing voltage system. In this method, the link
energy consumption is reduced because the error control codes allow the system to
run at a lower link swing voltage compared to uncoded system. The link energy
reduction can benefit the total energy consumption.
Instead of using the worst-case design method with an additional safety margins, a
self-calibrating transmission scheme is applied to improve the energy efficiency of
on-chip interconnects. In this method, error control codes are used to detect or correct
errors. The error detection rate provided by an ECC decoder is used to control the
voltage scaling. The self-calibrating transmission achieves the low energy consump-
tion benefits by running the system with a more aggressive voltage scaling scheme.
The direct use of product codes requires a large number of wires, increasing the
link energy consumption. The combination of product codes with type-II HARQ
scheme can efficiently solve this problem by transmitting the column parity check
bits only when they are requested. As an example of this method, the combination of
extended Hamming product codes with type-II HARQ achieves a significant reduc-
tion in residual flit error rate when multiple random and burst errors are considered.
For a given residual flit error rate requirement, the combination of extended Ham-
ming codes with type-II HARQ can operate at much lower swing voltages than other
References 115

methods – about 60% and 80% of the supply voltages required for the H(71,64) and
ARQ CRC-5 schemes, respectively. The lower link swing voltage makes the com-
bination of extended Hamming product codes with type-II HARQ more energy
efficient compared to other error control schemes. As technology scales, the link
energy is likely to further exceed the codec energy. Thus, the combination of product
codes with type-II HARQ is a more promising approach to provide energy efficient
and reliable communication for future system designs.
A configurable error control scheme, combining extended Hamming product
codes with traditional Hamming codes is also presented in this chapter. By using
Hamming codes in low noise environments and extended Hamming product codes
in high noise environments, this configurable coding method improves the energy
efficiency for varied noise environments compared to a fixed error control
approach. In order to reduce the configurable system overhead, a hardware sharing
method is applied to optimize the parity check calculation circuit, syndrome
decoder, and error correction circuits. For a given system reliability requirement,
this configurable error control scheme can achieve a 25% energy reduction com-
pared to a multi-error correcting BCH code in a noisy environment. Compared
to conventional Hamming codes, this configurable error control scheme uses a
lower swing voltage for the same reliability in noisy environments, resulting in
a 30% energy reduction. In a low noise environment, this configurable error
control method can achieve a 40% reduction in energy consumption compared
to a BCH code, and has a 10% energy overhead penalty compared to directly using
Hamming codes.

References

1. Magen N, Kolodny A, Weiser U, Shamir N (2004) Interconnect-power dissipation in a


microprocessor. In: Proceedings international workshop on system-level interconnect predic-
tion (SLIP), pp 7–13
2. Soteriou V, Peh S L (2004) Design-space exploration of power-aware on/off interconnection
networks. In: Proceedings international conference on computer design (ICCD), pp.510–517
3. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework.
IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667
4. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication
links: the energy-reliability tradeoff. IEEE Trans Comput-Aided Des Integr Circuits Syst
6:818–831
5. Murali S, Theocharides T, Vijaykrishnan N, Irwin JM, Benini L, De Micheli G (2005)
Analysis of error recovery schemes for networks-on-chips. IEEE Des Test Comput 5:434–442
6. Ejlali A, Al-Hashimi MB, Rosinger P, Miremadi GS, Benini L (2010) Performability/energy
tradeoff in error-control schemes for on-chip networks. IEEE Trans Very Large Scale Integr
(VLSI) Syst 18:1–14
7. Zhang H, Varghese G, Rabaey MJ (2000) Low-swing on-chip signaling techniques: effectiveness
and robustness. IEEE Trans Very Large Scale Integr (VLSI) Syst 3:264–272
8. Worm F, Ienne P, Thiran P, Micheli DG (2005) A robust self-calibrating transmission scheme
for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 1:126–139
116 5 Energy Efficient Error Control Implementation

9. Fu B, Ampadu P (2008) An energy-efficient multi-wire error control scheme for reliable on-
chip interconnects using Hamming product codes. VLSI Des 2008:1–14. doi:101155/2008/
109490
10. Fu B, Ampadu P (2009) On hamming product codes with type-II hybrid ARQ for on-chip
interconnects. IEEE Trans Circuits Syst I, Reg Papers 9:2042–2054
11. Lin S, Costello D, Miller M (1984) Automatic-repeat-request error-control schemes. IEEE
Commun Mag 12:5–17
12. Srinivasan RG (1996) Modeling the cosmic-ray-induced soft-error rate in integrated circuits:
an overview. IBM J Res Dev 1:77–89
13. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework.
IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667
14. Arizona State University. Predictive technology model [Online] http://ptm.asu.edu/
15. Pande PP, Grecu C, Ivanov A, Saleh R (2005) Performance evaluation and design trade-offs
for network-on-chip interconnect architectures. IEEE Trans Comput 54:1025–1040
16. Kim J S, Taylor M B, Miller J, Wentzlaff D (2003) Energy characterization of a tiled
architecture processor with on-chip networks. In: Proceedings international symposium on
low power electronics and design (ISLPED), pp 424–427
17. Vangal S et al (2008) An 80-tile sub-100-W teraFLOPS processor in 65-nm CMOS. IEEE
J Solid-State Circuits 43:29–41
18. Scheffer L (2002) Methodologies and tools for pipelined on-chip interconnect. In: Proceedings
international conference on computer design (ICCD), pp 152–157
19. Pei BT, Zukowski C (1992) High speed parallel CRC circuits in VLSI. IEEE Trans Commun
40:653–657
20. Fu B, Ampadu P (2010) Error control combining Hamming and product codes for energy
efficient nanoscale on-chip interconnects. IET Comput Digit Tech 4:251–261
21. Li L, Vijaykrishnan N, Kandemir M, Irwin J M (2003) Adaptive error protection for energy
efficiency. In: Proceedings IEEE/ACM international conference on computer-aided design
(ICCAD), pp 2–7
22. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal
integrity. In: Proceedings international on line testing symposium (IOLTS), pp 43–48
23. Sun F, Devarajan S, Rose K, Zhang T (2007) Design of on-chip error correction systems for
multilevel NOR and NAND flash memories. IET Circuits Devices Syst 1:241–249
Chapter 6
Combining Error Control Codes
with Crosstalk Reduction

Conventional error control codes (ECCs) has been successfully applied to improve
the reliability of on-chip interconnect by correcting logic errors. Unfortunately, ECCs
is inefficient to address crosstalk-induced delay uncertainty, which greatly decreases
the system performance even causing timing errors. Crosstalk-induced delay uncer-
tainty results from the dependence of coupling capacitance and inductance on
different wire switching patterns. In this chapter, we mainly focus on the delay
uncertainty caused by the capacitive crosstalk coupling. The capacitive crosstalk
induced delay uncertainty can be alleviated by techniques such as shielding, routing,
wire sizing and spacing, crosstalk avoidance codes (CACs), skewed transitions, and
staggered repeater. Typically, these methods do not address logic errors. In this
chapter, we will discuss the solutions, which efficiently address both logic errors
and capacitive crosstalk induced delay uncertainty simultaneously.

6.1 Duplicate-Add-Parity (DAP) Codes

The encoding and decoding process of the DAP code is introduced in Chap. 4. By
duplicating the input data and adding an extra parity check bit, DAP codes have a
minimum Hamming distance three and can correct single errors.
DAP codes can also reduce capacitive coupling [1, 2]. Let di (i < k) be the k-bit
original data, d’i (i < k) be the k-bit duplicated data, and p0 be the parity check bit. In
DAP code implementation, the bus wire used to transmit d’i is always placed
adjacent to the bus wire used to transmit di. Because the values of di and d’i are
always the same, the coupling capacitance between these two adjacent wires does
not need to be charged. Thus, any bus wire transmitting DAP codewords only needs
to charge the coupling capacitance of one side. Moreover, an intelligent spacing
method can be used to further optimize DAP code [2]. In intelligent spacing method,
the spacing between two wires carrying the identical data can be smaller than the

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, 117


DOI 10.1007/978-1-4419-9313-7_6, # Springer Science+Business Media, LLC 2012
118 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.1 A bus layout of a DAP code with intelligent spacing

Fig. 6.2 Worst-case wire capacitances of a (9, 4) DAP code and a (7, 4) Hamming code [2]

spacing between two wires carrying different data. Figure 6.1 shows a bus layout of a
DAP code with intelligent spacing. A grounded shielding wire is assumed to place
around the boundary of the bus wire.
From Fig. 6.1, a wire (except d0 and p0) in DPA codeword has an effective
coupling capacitance 2Cc(SDR), where Cc(SDR) is the physical coupling capacitance
between two wires with a spacing SDR. If the boundary condition is considered,
the worst case coupling capacitance in DAP code happens at the wire p0. When the
wire d’k1 and p0 switch oppositely, the worst case coupling capacitance of wire p0
is 3Cc(SDR). Figure 6.2 shows the worst-case wire capacitances of a (9, 4) DAP code
and a (7, 4) Hamming code. The delay factor is defined as the ratio of the worst-case
wire capacitance in each coding scheme to the ground capacitance of a wire with
the minimum spacing. The same wire routing area is used for both coding schemes.
Figure 6.2 shows that (9, 4) DAP code have a smaller effective coupling capaci-
tance compared to (7, 4) Hamming code for most cases.
The bus layout of DAP code can further be optimized by increasing the spacing
between the wire d’k1 and p0 to reduce the coupling capacitance of p0. Also a
modified DAP code (MDR) is proposed in [3]. In MDR code, the parity bit p0 is also
duplicated. The worst case coupling capacitance in MDR codes is reduced to
2Cc(SDR).
6.2 Boundary Shift Code (BSC) 119

6.2 Boundary Shift Code (BSC)

BSC [4] is one type of codes, in which no any adjacent bits simultaneously switch in
opposite direction (i.e., no 01 ! 10 or 10 ! 01 transition at two adjacent bit
positions). The encoding process of BSC codes is similar to DAP codes. The input
data is first duplicated and an extra parity check bit is added. In order to avoid
adjacent bits switching in opposite directions, the encoded data in odd cycles are
right shifted before they are transmitted (the number of cycles starts from 0). In
BSC codes, the parity check bit can be the rightmost or the leftmost bit of the
codeword. An example of BCS codes is shown in Fig. 6.3.
The decoding process of BSC codes is similar to that of DAP codes. The
received codeword is first shifted back when the clock cycle is odd. A parity
check bit is recalculated using one copy of the input data. The recalculated parity
bit is compared to the transmitted parity bit. If the recalculated parity bit is equal to
the transmitted parity bit, the data copy used to recalculate the parity bit is selected
as the decoder output. If the recalculated parity bit is different from the transmitted
parity bit, another copy of input data is selected as the decoder outputs. Figure 6.4
shows the decoder of a BSC (9, 4) code.

Fig. 6.3 An example of BSC code

Fig. 6.4 Decoder design of a BSC (9,4) code


120 6 Combining Error Control Codes with Crosstalk Reduction

BSC codes have the minimum Hamming distance of three, which can be used to
correct single errors. Because no any adjacent bits in BSC codeword simulta-
neously switch in opposite direction, the worst case coupling capacitance of a bus
wire used to transmit BSC codeword is equal to 2Cc, which is smaller than the worst
case coupling capacitance value of a standard bus wire.

6.3 Crosstalk Avoidance and Multiple Error


Correction Code (CAMEC)

The DAP code concept can be extended to construct CAMEC codes. Figure 6.5
shows an example of CAMEC code proposed in [5]. In this example, the input data
is first encoded using Hamming codes. The outputs of the Hamming encoder are
duplicated and an overall parity check bit, calculated from the output of the
Hamming encoder, is added to the whole codeword. For k-bit input information,
if a (n, k) Hamming code is used. The codeword width of CAMEC code is 2n + 1.
The minimum Hamming distance of CAMEC contracted as Fig. 6.5 is seven. This
code can guarantee to correct up to three errors.
The decoding process of the CAMEC code is more complex than that of DAP
codes. Figure 6.6 shows a crosstalk avoiding double error correction (CADEC)
decoding algorithm proposed in [5]. In a CADEC decoder, the parity check bits
pa and pb are first recalculated from the original Hamming codeword and its copy,
respectively. Then, these recalculated parity check bits pa and pb are compared.
If they are the same, the duplicated Hamming codeword is sent to syndrome detection.

Fig. 6.5 An example of CAMEC codes


6.3 Crosstalk Avoidance and Multiple Error Correction Code (CAMEC) 121

Fig. 6.6 Implementation of CADEC decoder

If no error is detected during syndrome detection, the duplicated Hamming codeword


will be used as inputs of a conventional Hamming decoding process; otherwise, the
original Hamming codeword will be sent to the conventional Hamming decoder. If pa
and pb are different, pb is compared to the received parity check bit p0. If pb is equal to
p0, the duplicated Hamming codeword is used for further decoding. If pb is not equal to
p0, the original Hamming codeword is used to complete the conventional Hamming
decoding process.
The CADEC decoding algorithm can only guarantee to correct double errors.
An updated joint crosstalk avoidance and triple error correction (JTEC) decoding
algorithm is proposed in [6]. The JTEC code can guarantee to correct three errors. In
JTEC codes, the Hamming code along with the overall parity bit comprise of an
extended Hamming code, which can correct single error and detect double error at the
same time. Figure 6.7 shows the flowchart of JTEC decoding algorithm. In a JTEC
decoder, the syndrome SA and SB are first calculated from the extended Hamming
copy (Hamming codeword and parity bit) and the Hamming copy, respectively. If
syndrome SA is zero, it means no error exists in extended Hamming copy. Thus, it
will be used as decoder output. If SA is not zero, there can be one, two or three errors
in the extended Hamming copy. If SA indicates that two errors exist in the extended
Hamming copy, the Hamming copy will have single errors. The Hamming copy will
be decoded and selected as decoder output. If SA indicates that one or three errors
exist in extended Hamming copy, the syndrome of Hamming copy SB will be used to
make decision. If SB is zero, all three errors are in extended Hamming copy. The
Hamming copy is selected as decoder output. If SB is not zero, it means that only
122 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.7 Flowchart of JTEC decoding algorithm

single errors exist in extended Hamming copy, which can be corrected and selected as
decoder output. In order to reduce the calculation delay of the parity bit p0, the
extended Hamming code can be replaced by Hsiao SEC–DED code.
The parity bit p0 in Fig. 6.5 can be duplicated to construct a code with the
minimum Hamming distance eight. The new code is proposed in [6] and named as
joint crosstalk avoidance and triple error correction and simultaneous quadruple
error detection code (JTEC-SQED) code.
The worst case coupling capacitance of a bus wire transmitting CAMEC code-
word is 2Cc. Table 6.1 compares the wire delay in 64-node NoC architectures when
different error control coding schemes are applied to the links between router and
router. Three different NoC architectures, mesh, folded torus, and butterfly fat tree
(BFT), are considered. For 64 nodes, BFT-based architecture will have three levels of
routers. BFTa is the wire delay between the routers in level 2 and the routers in level
3. BFTb is the wire delay between the routers in level 1 and the routers in level 2.
Table 6.1 shows that the conventional Hamming codes do not have crosstalk avoid-
ance characteristics and have the largest wire delay.
6.4 Unified Coding Framework 123

Table 6.1 The comparison of wire delay in 64-node NoC architecture [6]
Coding scheme NoC architecture Length (mm) Delay (ps)
Hamming code Mesh 2.86 243
Folded torus 5.72 612
BFTa 10 1,620
BFTb 5 495
DAP/CADEC/JTEC/JTEC-SQED Mesh 2.86 184
Folded torus 5.72 375
BFTa 10 900
BFTb 5 315

Table 6.2 The comparison of codec delay and area [6]


Coding scheme Encoder delay (ps) Decoder delay (ps) Area (2-input NAND gates)
Hamming Code 410 520 447
DAP 290 475 396
CADEC 525 545 1,145
JTEC 190 440 1,495
JTEC-SQEC 190 450 1,675

Table 6.2 compares the codec delay and area of Hamming code, DAP code and
different CAMEC codes. The delay and area values are reported using synthesis
results with a 90-nm technology. The input data width is 32 bit. CADEC code has the
largest codec delay. By using Hsiao code, JTEC and JTEC-SQEC has the smaller
codec delay. Compared to convention Hamming code and DAP code, the error
correction capability improvement of CAMEC codes comes from a larger codec area.

6.4 Unified Coding Framework

A unified coding framework by combining error control coding with crosstalk


avoidance codes is proposed in [1, 7]. In this method, the input data is first encoded
using nonlinear crosstalk avoidance codes (CACs). The outputs of CACs are
encoded using an error control code. The parity bits generated by the error control
codes are protected against crosstalk coupling using techniques such as shielding
and duplication. The encoding process of combining ECC with CACs is shown in
Fig. 6.8.
There are three common used CACs – forbidden overlap condition (FOC)
codes [8], forbidden transition condition (FTC) codes [9], and forbidden pattern
condition (FPC) codes [10]. Each of these CACs has different crosstalk reduction
capabilities. The encoding process of each CAC is described as follows.
124 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.8 A unified coding framework combining error control with CACs

6.4.1 Forbidden Overlap Condition (FOC) Codes

In Table 2.1, the worst case link delay (1 + 4l)t0 occurs when there is a 010 ! 101
(or 101 ! 010) transition on three adjacent wires. This worst-case link delay can be
avoided by prohibiting these switching patterns. The CACs satisfying the above
requirement are named forbidden overlap condition (FOC) codes [8]. Because no
010 ! 101 (or 101 ! 010) transition exists in two continuous FOC codewords, the
worst case delay of FOC codes is reduced from (1 + 4l)t0 to (1 + 3l)t0.
Table 6.3 shows the truth table of a FOC(5, 4) code. The encoding process can be
expressed by,

c0 ¼ d1 þ d2 d3
c1 ¼ d2 d3
c 2 ¼ d0
c 3 ¼ d2 d3
c 4 ¼ d1 d2 þ d3 (6.1)

where di (i ¼ 0 to 3) is the input data bit and ci (i ¼ 0 to 4) is the FOC(5, 4)


codeword bit. The complexity of the FOC code increases significantly with the
increased input data width. It is impractical to encode a wide bus using a single FOC
code. A solution to address this issue is to separate a wide bus into small groups and
encode each group using a FOC code with a small input width. For example, 32-bit
input data can be separated into eight groups, each group encoded using a FOC(5, 4)
code. In this hierarchical encoding method, two groups of FOC(5, 4) code can
be placed next to each other without violating the requirement of FOC codes. For
32-bit input data, the total encoded output is 40-bit.
Half-shielding, in which a shield wire is inserted between every two signal wires,
can be regard as the simplest approach satisfying the forbidden overlap condition.
For a wide bus, half-shielding has a large area overhead compared to encoding the
whole using multiple FOC(5,4) codes.
6.4 Unified Coding Framework 125

Table 6.3 (5, 4) forbidden Data bits Codeword bits


overlap condition (FOC)
d3 d2 d1 d0 c4 c3 c2 c1 c0
codes
0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 1 0 0
0 0 1 0 0 0 0 0 1
0 0 1 1 0 0 1 0 1
0 1 0 0 0 0 0 1 1
0 1 0 1 0 0 1 1 1
0 1 1 0 1 0 0 1 1
0 1 1 1 1 0 1 1 1
1 0 0 0 1 0 0 0 0
1 0 0 1 1 0 1 0 0
1 0 1 0 1 0 0 0 1
1 0 1 1 1 0 1 0 1
1 1 0 0 1 1 0 0 0
1 1 0 1 1 1 1 0 0
1 1 1 0 1 1 0 0 1
1 1 1 1 1 1 1 0 1

Table 6.4 (4, 3) forbidden Data bits Codeword bits


transition condition (FTC)
d2 d1 d0 c3 c2 c1 c0
codes
0 0 0 0 0 0 0
0 0 1 0 1 0 0
0 1 0 0 0 0 1
0 1 1 0 1 0 1
1 0 0 0 1 1 1
1 0 1 1 1 0 0
1 1 0 1 1 0 1
1 1 1 1 1 1 1

6.4.2 Forbidden Transition Condition (FTC) Codes

In FTC codes [9], any transition involving adjacent wires switching in opposite
directions is prohibited (i.e., 01 ! 10 or 10 ! 01 transition); thus, the worst case
link delay of FTC codes is reduced from (1 + 4l)t0 to (1 + 2l)t0. Inserting a
shielding wire between each signal line is the simplest approach satisfying the
forbidden transition condition. Table 6.4 shows the truth table of a FTC (4, 3) code.
The encoding process can be expressed by,

c0 ¼ d1 þ d2 d0
c1 ¼ d0 d1 d2 þ d0 d1 d2
c 2 ¼ d0 þ d2
c 3 ¼ d0 d2 þ d 1 d2 (6.2)
126 6 Combining Error Control Codes with Crosstalk Reduction

Table 6.5 (5, 4) forbidden Data bits Codeword bits


pattern condition (FPC) codes
d3 d2 d1 d0 c4 c3 c2 c1 c0
0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 1
0 0 1 0 0 0 1 1 0
0 0 1 1 0 0 0 1 1
0 1 0 0 0 1 1 0 0
0 1 0 1 0 0 1 1 1
0 1 1 0 0 1 1 1 0
0 1 1 1 0 1 1 1 1
1 0 0 0 1 0 0 0 0
1 0 0 1 1 0 0 0 1
1 0 1 0 1 1 0 0 0
1 0 1 1 1 0 0 1 1
1 1 0 0 1 1 1 0 0
1 1 0 1 1 1 0 0 1
1 1 1 0 1 1 1 1 0
1 1 1 1 1 1 1 1 1

A similar hierarchical encoding method can be applied to encode a wide bus using
FTC codes. For example, 32-bit input data can be separated into 11 groups and each
group is encoded using a FTC (4, 3) code. Because two groups of FTC (4, 3) code
cannot be placed next to each other without violating the FTC requirement at the
boundary, shielding wires are needed between each adjacent group to ensure that
transitions on boundary wires do not switch in opposite directions. Using this
hierarchical method, a FTC (53, 32) can be constructed for 32-bit input data.

6.4.3 Forbidden Pattern Condition (FPC) Codes

In FPC codes [10], the coupling effects are reduced by prohibiting 010 and 101 bit
patterns for each codeword. The worst case link delay of FPC codes is reduced from
(1 + 4l)t0 to (1 + 2l)t0.
Table 6.5 shows the truth tale of a FPC(5,4) code. The encoding process is
expressed by,

c 0 ¼ d0
c1 ¼ d0 d1 þ d1 d2 þ d1 d3 þ d0 d2 d3
c2 ¼ d2 d3 þ d1 d2 þ d0 d2 þ d0 d1 d3
c3 ¼ d2 d3 þ d0 d2 þ d1 d2 þ d0 d1 d3
c 4 ¼ d3 (6.3)
6.4 Unified Coding Framework 127

Fig. 6.9 Hierarchical


encoding using two FPC(5,4)
codes

Table 6.6 The comparison of coupling factor, minimum Hamming distance and codeword
width [7]
Number Component
Coding Maximum Minimum of wires for Parity
Scheme coupling distance 32-bit bus CAC ECC protection
FOC+HC 3 3 49 FOC(5,4) Hamming Half-shielding
FTC+HC 2(FT) 3 65 FTC(4,3) Hamming Shielding
DAP 2(FP) 3 65 Duplication Parity –
OLC+HC 1 3 106 OLC(8,4) Hamming Duplication
+shielding
DSAP 1 3 97 Duplication Parity Shielding
+shielding
BSC 2 3 65 Duplication Parity –

In order to ensure that no bit patterns 101 and 010 occur at the boundaries of two
groups of FPC(5, 4) code during hierarchical encoding process, shielding wires can
be inserted between two adjacent groups. Figure 6.9 shows another solution to solve
the boundary problem. In Fig. 6.9, the most significant input bit of a FPC(5,4) encoder
is fed into the least significant input bit of the next adjacent FPC(5,4) encoder. This
method is more efficient than simply placing shielding wires between two adjacent
groups, resulting in fewer redundancy wires. Using this method, a FPC(52,32) code
can be constructed for 32-bit input data.

6.4.4 Performance Evaluation

Table 6.6 lists the coupling factor, the minimum Hamming distance and codeword
width of different coding schemes, FOC + HC, FTC + HC, and OLC + HC, which
are constructed using the uniform coding method. In these codes, a Hamming code
is combined with FOC(5, 4), FTC(4, 3), and OLC(8, 4) based CACs, respectively.
OLC represents one lambda code. The simplest OLC can be constructed by
duplicating the input data bits and inserting shield wires between adjacent pairs
128 6 Combining Error Control Codes with Crosstalk Reduction

Table 6.7 An OLC(8, 4) Data bits Codeword bits


code
d3 d2 d1 d0 c7 c6 c5 c4 c3 c2 c1 c0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 1 1 1
0 0 1 1 0 0 0 1 1 1 0 0
0 1 0 0 0 0 0 1 1 1 1 1
0 1 0 1 0 1 1 1 0 0 0 0
0 1 1 0 0 1 1 1 0 0 0 1
0 1 1 1 0 1 1 1 1 1 0 0
1 0 0 0 0 1 1 1 1 1 1 1
1 0 0 1 1 1 0 0 0 0 0 0
1 0 1 0 1 1 0 0 0 0 0 1
1 0 1 1 1 1 0 0 0 1 1 1
1 1 0 0 1 1 1 1 0 0 0 0
1 1 0 1 1 1 1 1 0 0 0 1
1 1 1 0 1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 1 1 1 1 1

of duplicated bits. OLC(8, 4) is proposed by [11], shown in Table 6.7. To encode a


wide bus using OLC(8, 4), the wide bus is first separated into several 4-bit groups.
Each group is encoded separately using an OLC(8,4) code. Between each group, the
boundary bit is duplicated and a shield wire is inserted. Table also includes DAP
code, BSC, and duplicate shield add parity (DSAP) codes, which can also correct
single errors but are constructed using alternative approach. In DSAP code, shield
wires are inserted between adjacent pairs of duplicated bits and between the
duplicated bits and parity bit.
Figure 6.10 compares codec overhead of different coding schemes for a 32-bit bus
[7]. The codec area, delay and energy are normalized to Hamming (38, 32) code.
Figure 6.10 shows that OLC + HC code has the largest codec overhead. The DAP
and DSAP codes have the smallest codec overhead.
Figure 6.11 compares the speedup of different coding schemes over the uncoded
bus as a function of the ratio of the coupling capacitance Cc to the ground Cg
capacitance, l. The bus length L is 10 mm. The speedup of code1 over code2 is
defined in [1] as,

Tc2 þ Tb2
speedup ¼ (6.4)
Tc1 þ Tb1

where Tci is the codec delay of code i including encoder and decoder delays and Tbi
is the bus delay with code i. Hamming code has a speedup of less than one because
of the same link delay and an extra codec delay. The joint coding schemes, which
can simultaneously correct single error and reduce crosstalk coupling, achieve the
6.4 Unified Coding Framework 129

Fig. 6.10 Codec area, delay and energy comparison for different coding schemes

Fig. 6.11 Speedup comparison of different coding schemes [7]


130 6 Combining Error Control Codes with Crosstalk Reduction

speedup over uncoded bus. DAP and DSAP codes achieve speedups of 1.44 and
2.14, respectively at L ¼ 10 mm and l ¼ 2.8, because these two codes can reduce
the worst-case capacitance coupling to two and one, respectively, and also have a
relative small codec delay overhead. Speedup increases with the increased value of
bus length and l. Therefore, as the technology scaling leads to reduced codec delay,
longer bus length L, and larger l, the joint coding scheme will achieve a larger
speedup.

6.5 Error Control Codes with Skewed Transitions

Skewed transition method [12, 13] is used to reduce crosstalk coupling by delaying
adjacent transitions with some finite time DT. In this section, we will introduce
another method combining error control coding with skewed transitions to simulta-
neously address error correction and capacitance coupling induced delay uncertainty.

6.5.1 The Principle

In skewed transitions, the simultaneous opposite switching on neighboring bus lines


are avoided by the induced relative delay DT. The worst-case effective capacitance
Ceff of skewed transition (Ceff of a middle wire when a 010 ! 101 or 101 ! 010
transition occurs on three adjacent wires) can be expressed by (6.5) below [12],

jVN ðDTÞ  VN ð0Þj


Ceff ¼ Cgt þ ð4  2 ÞCct
VDD
¼ Cgt þ ð4  2vðDTÞÞCct (6.5)

where Cgt is the total capacitance between the wire and ground; Cct is the total
coupling capacitance between any two adjacent wires; VN(DT) and VN(0) are the
voltages of neighboring wires at time DT and 0, respectively; v(DT) is the ratio of
the neighboring wire’s voltage difference at time DT and 0 to VDD (0  v(DT)  1).
When DT ¼0, v is 0. As DT increases, v approaches 1.
In skewed transition methods, delay elements are inserted at the beginning of
alternate bus lines to generate the relative delay DT, as shown in Fig. 6.12a. For a
bus line with k  1 repeaters, the worst-case link delay Td in skewed transitions can
be described by (6.6) below [12],

Rt
Td ¼ ð0:7Rr þ 0:4 ÞðCgt þ 4Cct Þ þ 0:7ðkRr þ Rt ÞCr
k
Rt
þ DT  ð1:4Rr þ 0:8 ÞCct vðDTÞ (6.6)
k
6.5 Error Control Codes with Skewed Transitions 131

Fig. 6.12 Conventional skewed transitions (a) Skewed transition by inserting delay elements
(b) The relation between the worst case link delay and skewed delay DT

where Rr and Cr are the on-resistance and output capacitance of the repeater. Rt is
the total resistance of the wire. The first two terms in (6.6) are the worst-case delay
of the standard bus. From (6.6), the delay reduction achieved by the skewed
transition method depends on the difference between the last two terms. Thus, a
large DT increases the overall link delay Td, as shown in Fig. 6.12b.
In [14], a method combining ECCs and skewed transitions is proposed to
improve the reliability of on-chip interconnects. In this method, ECCs is used to
correct logic errors while skewed transitions are applied to reduce capacitive
crosstalk induced delay uncertainties. By hiding the delay insertion overhead of
the skewed transition method, this method achieves a larger reduction in the worst
case link delay compared to conventional skewed transition method.
Figure 6.13 show the method combining ECCs with skewed transitions. In an
error control encoder, the parity bits are generated from the original input data after
132 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.13 Block diagram of proposed method exploiting parity computation latency to reduce
crosstalk coupling [14]

a finite delay. Instead of sending the input data and parity bits to the link at the
same time, partial input data can be sent before the parity bits are available. Two
clocks (CLK1 and CLK2, with CLK1 arriving ahead of CLK2) are used alternately
to offset the transitions in each pair of adjacent interconnect lines. The input data
and parity bits are mapped to registers triggered by these two clocks, as shown in
Fig. 6.13.
Figure 6.14 illustrate the transmission procedure of the method combining ECCs
with skewed transition. Assume that the clock cycle of CLK1 and CLK2 is Tcycle and
k-bit input data are available at the rising edge of CLK1. The calculation of r-bit
parity data is completed after a delay of Dparity. The wires l(i) (1  i  k + r) in the
link with odd index i are triggered by CLK1 and l(i) with even index i are triggered
by CLK2. In the proposed method, input data can be sent at the next rising edge of
CLK1 or at the rising edge of CLK2, which arrives after a delay DT1, as shown in
Fig. 6.14. Because the data bits are available before the parity bits are calculated,
thus the data can be sent earlier than the parity bits without affecting the overall
system performance. Parity-check bits are calculated using the input data after the
delay Dparity; thus, they can only be transmitted at the next rising edge of CLK1. The
relationship between Tcycle and the timing offsets DT1 and DT2 is described in
Fig. 6.14 and should meet the following constraint,

Tcycle ¼ DT1 þ DT2  Dparity (6.7)

For implementation simplicity, CLK1 and CLK2 can be the rising and falling
edge of the same clock.
6.5 Error Control Codes with Skewed Transitions 133

Fig. 6.14 Transmission


procedure of the method
combining error control
coding with crosstalk
reduction [14]

Fig. 6.15 Mapping


algorithm for a systematic
(n, k) linear block code

6.5.2 Data Mapping Algorithm

In the method combining ECCs with skewed transitions, the input data can be
transmitted either using CLK1 or CLK2; while the parity check bits can only be
transmitted using CLK1. In a systematic (n, k) linear block code, the codeword c can
be calculated by,

c ¼ m  ½Ik jPkðnkÞ  (6.8)

where m is the k-bit input data. Ik is identity matrix and Pk(n-k) is parity matrix. The
mapping of the n-bit codeword c to the proper wire position in the link can be realized
by the algorithm in Fig. 6.15. c(i) is the ith bit in c and l(i) is the ith wire in the link.
134 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.16 An example of the mapping algorithm when a Hamming H(12,8) code is used to correct
logic errors

Figure 6.16 shows an example of the mapping algorithm applied to a Hamming H


(12,8) code. The parity calculation unit is implemented as XOR trees. The link is
driven by two alternating clocks CLK1 and CLK2, ensuring no two adjacent wires use
the same clock. As shown in this example, there are four parity check bits, which are
assigned to CLK1, separated by first three data bits. The rest of the data bits are
assigned to the remaining wires.
The more complex case is considered when multiple SEC codes or SEC-DED
codes are interleaved to correct spatial burst errors. In this case, a K-bit input data
is separated into several smaller groups, with each group encoded separately using a
SEC or SEC-DED code. The outputs of these small groups are interleaved before
they are transmitted through the link. The burst error correction capability of this
method depends on the interleaving distance, which is defined as the distance
between two wires belonging to the same group. In this case, the mapping algorithm
should meet two conditions – (1) parity check bits can only be transmitted using
CLK1, (2) the mapping algorithm should maintain the same interleaving distance.
Assume that K-bit input data are separated into g groups and each group Gi (1 
j  g) is encoded using a SEC (n, k) code with codeword cj. To maintain the
interleaving distance, the mapping must cycle through each group in sequence (e.g.,
G1 ! G2 ! G3 ! G1 etc.). Mapping data and parity bits between CLK1 and CLK2
is straight forward when the number of groups is odd, shown by the algorithm in
Fig. 6.17 – in alternating rounds, each group will be mapped to both CLK1 and
6.5 Error Control Codes with Skewed Transitions 135

Fig. 6.17 Proposed mapping


algorithm for multiple SEC/
SEC-DED codes with
interleaving when the number
of groups is odd

CLK2. For example, if we have three groups, the mapping would begin as follows:
G1 ! CLK1, G2 ! CLK2, G3 ! CLK1. The next ‘loop’ through the groups would
then be G1 ! CLK2, G2 ! CLK1, G3 ! CLK2; thus, each group will have access to
both CLK1 and CLK2, and can route their data and parity bits appropriately.
The mapping of an even number of groups is slightly more complex. If the same
method of looping through an even number of groups is used, each group would
only be mapped to one clock. For example, with four groups, G1 and G3 will always
be mapped to CLK1 and G2 and G4 will always be mapped to CLK2. One solution to
address this issue is to insert an extra wire in the link, shown as l(13) in Fig. 6.18 (an
example with four (7,4) Hamming encoded groups). This extra wire allows us to
switch G1 and G3 to CLK2 and G2 and G4 to CLK1, ensuring that each group will
have access to both CLK1 and CLK2. The resulting mapping algorithm with even
number of groups is shown in Fig. 6.19.
136 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.18 An example of applying the proposed mapping algorithm to four (7, 4) Hamming
encoded groups

6.5.3 Performance Evaluation

Unlike conventional skewed transition methods, the combination of error control


codes with skewed transitions hides the overhead induced by delay elements in
the ECC encoding stage. The worst-case link delay Td_comb of this method can be
analyzed as,

Rt
Td comb ¼ ð0:7Rr þ 0:4 ÞðCgt þ 4Cct Þ þ 0:7ðkRr þ Rt ÞCr
k
Rt
 ð1:4Rr þ 0:8 ÞCct vðDTcomb Þ (6.9)
k

where DTcomb is equal to the minimum value between DT1 and DT2. Moreover, the
last term in (6.9) is greater than that in (6.6) because DTcomb can be much larger than
DTconv. Thus, the combination of ECCs with skewed transitions can achieve an
extra delay reduction compared to a conventional skewed transition approach.
Figure 6.20 compares the worst-case link delay of the combination of ECCs with
skewed transitions with the conventional skewed transition method. A Hamming
H(71,64) code is used to correct single logic errors. The Hamming encoder is
6.5 Error Control Codes with Skewed Transitions 137

Fig. 6.19 Proposed mapping


algorithm for multiple SEC/
SEC-DED codes with
interleaving when the number
of groups is even [14]

implemented as XOR trees. The depth of the XOR trees determines the worst-case
delay of the Hamming encoder. The H(71,64) encoder is synthesized using a TSMC
65 nm technology with the worst-case delay Dparity ¼ 400 ps. DTcomb ¼ 200 ps is
equal to half of Dparity. A 65 nm link model [15] with lengths from 1 mm to 5 mm is
used in the simulations. The link delay is normalized to the delay of a standard bus
with minimum link width and spacing. Figure 6.20 show that the combination of
ECCs with skewed transitions reduces the worst-case link delay by up to 46%.
Compared to conventional skewed transition method, this method reduces worst-
case link delay by 25% for a 5 mm link.
The performance of the combination of ECC with skewed transitions is com-
pared to DAP codes, BSC codes, and the combination of ECC with CAC codes,
such as FOC codes, FTC codes, and FPC codes. The specific crosstalk reduction
techniques used for data and parity bits for each scheme are shown in Table 6.8. The
codecs are synthesized using a TSMC 65 nm technology. Codec power, delay and
area are reported using Synopsys Design Compiler. The system frequency is 1 GHz.
A global 65 nm link model [15] is used. The link power is measured using Cadence
Spectre using random input data with switching activity factor 0.5. The signal slew
rate is equal to 2.5 the output slew rate of an FO4 inverter [15]. The input data
138 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.20 Comparison of


worst-case link delay for the
combination of ECCs with
skewed transitions and
conventional skewed
transitions [14]

Table 6.8 Specific crosstalk reduction techniques used for data and parity bits for each
scheme [14]
Data crosstalk Parity crosstalk Number
reduction reduction of wires
Hamming(71,64) + skewed transitions Skewed transitions Skewed transitions 71
DAP(129,64) Duplication Duplication 129
BSC(129,64) Duplication Duplication 129
Hamming(71,64) + CACs FOC Half-shielding 91
FTC Shielding 120
FPC Shielding 119

Table 6.9 Codec delay and area comparison with previous solutions simultaneously addressing
logic errors and crosstalk-induced delay [14]
Encoder Decoder Codec area
delay (ns) delay (ns) (mm2)
Hamming(71, 64) + skewed transitions 0.40 0.62 2,468.2
DAP(129,64) 0.43 0.52 2,338.4
BSC(129,64) 0.46 0.78 2,638.1
Hamming(71,64) + FOC 0.52 0.70 3,048.5
FTC 0.55 0.76 4,342.3
FPC 0.57 0.79 4,485.9

width is 64 bits. Registers are inserted between encoder and link, and between link
and decoder to allow pipelined operation.
Table 6.9 shows the codec delay for each scheme. As can be seen, the decoder
delay is typically larger than the encoder delay. DAP(129,64) has the smallest
decoder delay, because the decoding process is simpler than that of Hamming
codes. The combination of error control coding with CACs has larger codec
delay, because of the extra delay introduced by the CACs. The combination of
6.5 Error Control Codes with Skewed Transitions 139

Fig. 6.21 Link area


comparison for different
schemes [14]

ECC with skewed transitions achieves a 21% reduction in decoder delay compared
to combining Hamming codes with FPC codes.
Table 6.9 also compares the codec area of each scheme. Codec area includes the
area of encoder, decoder and pipelined registers. The results show that the combi-
nation of error control coding with CACs has a large codec area overhead compared
to other schemes, because of the encoding and decoding circuits of CACs. The
codec area of combining Hamming code with skewed transitions is close to the
codec area of DAP codes and is 45% less than the codec area of combining
Hamming codes with FPC.
Figure 6.21 compare the link area of different schemes. The area is normalized
to an uncoded bus with minimum link width and wire spacing. The results show that
DAP(129,64) has the largest link area, because of the large number of wires
required. The combination of ECC with skewed transition requires the fewest
number of wires resulting in the smallest link area of the compared schemes. It
achieves about 45% reduction in link area compared to DAP codes for 64-bit data.
Residual flit error rate is used to measure the reliability. PResidual of Hamming
codes can be estimated by (6.10) below [1],

PHamming ðeÞ ¼ C2kþr  e2 (6.10)

where k is input data width and r is the number of parity bits in the Hamming
codeword.
Residual flit error rate of DAP and BSC codes can be estimated by (6.11) below [1],

3kðk þ 1Þ 2
PDAP ðeÞ ¼ e (6.11)
2

The residual flit error probability of each scheme is estimated by replacing k ¼ 64


and the corresponding r value in (6.10) and (6.11). For k ¼ 64, the combination of
H(71,64) with skewed transition achieves 1.5X and 2.5X improvement in residual
word error probability compared to combining H(71,64) with FOC and DAP
(129,64), respectively.
140 6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.22 Delay uncertainty


comparison for different
schemes handling both error
correction and crosstalk
reduction [14]

Fig. 6.23 Link energy


consumption for different
schemes that simultaneously
address error correction with
crosstalk reduction [14]

Delay uncertainty is used to measure the effects of crosstalk coupling on the link
delay. The delay uncertainty is defined as the ratio of the delay variation to the
worst-case delay [16],

tprop ðmaxÞ  tprop ðminÞ


U¼ (6.12)
tprop ðmaxÞ

where tprop(max) is the worst-case link propagation delay. tprop(min) is the mini-
mum link propagation delay. Figure 6.22 shows the delay uncertainty of each
scheme. The link length is varied from 1 to 3 mm. Each scheme examined can
greatly reduce the delay uncertainty compared to an uncoded link. The delay
uncertainty of the combination of ECC with skewed transitions is up to 49% less
than that of uncoded links. Compared to the combination of ECC with FPC, it can
achieve up to 24% delay uncertainty reduction. Figure 6.23 compares the link energy
6.6 Summary 141

Fig. 6.24 Total energy versus link length for different schemes simultaneously addressing error
correction with crosstalk reduction [14]

of each method. The comparison is performed for the same reliability requirement
(Preq < 1020 with 3sN noise voltage equal to 20% of Vdd). The results show that the
combination of ECC with skewed transitions has the least link energy consumption of
the compared schemes because of the fewer number of required wires.
Figure 6.23 compares the total energy consumption Etotal of each method at link
lengths of 1 and 3 mm. Etota includes encoder, link, and decoder energy. The results
show that combining H(71,64) with FPC consumes more total energy than other
schemes, because of the larger codec and link energy consumption. The combina-
tion of H(71,64) with skewed transitions achieves the least total energy consump-
tion because of the relatively small codec overhead and the least required number of
wires. Compared to combining H(71,64) with FPC, it can achieve 32% improve-
ment in energy consumption at link length 3 mm.
Figure 6.24 compares the total energy consumption Etotal of each method at link
lengths of 1 and 3 mm. Etota includes encoder, link, and decoder energy. The results
show that combining H(71,64) with FPC consumes more total energy than other
schemes, because of the larger codec and link energy consumption. The combina-
tion of H (71, 64) with skewed transitions achieves the least total energy consump-
tion because of the relatively small codec overhead and the least required number of
wires. Compared to combining H(71,64) with FPC, it can achieve 32% improve-
ment in energy consumption at link length 3 mm.

6.6 Summary

In this chapter, we have examined different techniques, which can efficiently


address both logic errors and capacitive crosstalk induced delay uncertainty simul-
taneously. By duplicating the input data and adding an extra parity check bit, DAP
142 6 Combining Error Control Codes with Crosstalk Reduction

codes have a minimum Hamming distance three and can correct single errors.
In DAP codes, the bus wire used to transmit the duplicated bit is placed adjacent
to the bus wire used to transmit the original bit. Thus, DAP codes can reduce the
effect of the capacitive coupling. An intelligent spacing method can be used to
further optimize DAP code. In intelligent spacing method, the spacing between
two wires carrying the identical data can be smaller than the spacing between two
wires carrying different data.
BSC code can correct single errors and reduce capacitive coupling effects
simultaneously. In BSC code, no any adjacent bits simultaneously switch in oppo-
site direction. The worst case coupling capacitance of a bus wire used to transmit
BSC codeword is equal to 2Cc, which is smaller than the worst case coupling
capacitance value of a standard bus wire.
CAMEC codes are constructed based the similar idea of DPA codes. An
example of CAMEC codes constructed using Hamming code is presented in this
chapter. In this example, the input data is first encoded using a Hamming code. The
outputs of the Hamming encoder are duplicated. An overall parity check bit
calculated from the output of the Hamming encoder is added to the whole
codeword. The decoding process of CAMEC codes is more complex than that of
conventional Hamming codes. Two difference decoding algorithms are discussed.
CAMEC codes can be used to correct multiple errors and reduce the capacitive
crosstalk induced delay uncertainty at the same time.
A general coding framework by combining ECCs with CACs is used to address
both logic errors and capacitive crosstalk induced delay uncertainty. In this joint
coding method, the input data is first encoded using nonlinear CACs. The outputs of
nonlinear CACs are encoded using ECCs. The parity bits generated by the ECC
encoder are protected against crosstalk coupling using techniques such as shielding
and duplication. Three common used CACs – FOC codes, FTC codes, and FPC codes
are discussed in this chapter. The combination of ECCs with CACs can achieve the
speedup over conventional Hamming codes, which can only correct logic errors.
Another method of combining ECCs with skewed transitions is also discussed in
this chapter. In this method, the inherent skew resulting from the ECC parity
generation is exploited to ensure that no two adjacent wires switch in opposite
directions simultaneously, thereby reducing worst-case on-chip capacitive coupling.
Instead of waiting for the parity computation to send the original input data and parity
bits to the link at the same time, the original input data is sent before the parity bits are
available. A mapping algorithm is needed to properly map data and parity check bits
to link driver registers, which are triggered by alternating clock phases. Compared to
a conventional skewed transition approach, the combination of ECCs with skewed
transitions hides the delay element insertion overhead in the parity calculation
latency; thus, a large skewed delay is allowed in this method without affecting the
overall link delay. The larger delay offset in this method further reduces the effects of
capacitive coupling. Compared to other solutions that simultaneously handle logic
errors and delay uncertainty, the combination of ECCs with skewed transitions
requires fewer wires, resulting in smaller link area and energy consumption.
References 143

References

1. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework.


IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667
2. Rossi D, Metra C, Nieuwland KA, Atul K (2005) Exploiting ECC redundancy to minimize
crosstalk impact. IEEE Des Test Comput 22:59–70
3. Rossi D, Metra C, Nieuwland KA, Atul K (2005) New ECC for crosstalk impact minimization.
IEEE Des Test Comput 22:340–348
4. Patel KN, Markov IL (2004) Error-correction and crosstalk avoidance in DSM busses. IEEE
Trans Very Large Scale Integr (VLSI) Syst 12:1076–1080
5. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on
chip through joint crosstalk avoidance and multiple error correction coding. J Electron Testing
Theory Appl (JETTA), 67–81, Special Issue on Defect and Fault Tolerance
6. Ganguly A, Pande PP, Belzer B (2009) Crosstalk-aware channel coding schemes for energy
efficient and reliable NOC interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst
17:1626–1639
7. Sridhara S, Shanbhag RN (2007) Coding for reliable on-chip buses: a class of fundamental
bounds and practical codes. IEEE Trans Comput-Aided Des Integr Circuits Syst 5:977–982
8. Sridhara S, Ahmed A, Shanbhag R N (2004) Area and energy-efficient crosstalk avoidance
codes for on-chip busses. In: Proceedings International Conference on Computer Design
(ICCD), pp 12–17
9. Duan C, Tirumala A, and Khatri S P (2001) Analysis and avoidance of crosstalk in on-chip
buses. In: Proceedings of the international conference on hot interconnects, pp. 133–138
10. Victor B, Keutzer K (2001) Bus encoding to prevent crosstalk delay. In: Proceedings IEEE/
ACM international conference on computer-aided design (ICCAD), pp 57–63
11. Sridhara R S, Shanbhag R N (2005) Coding for reliable on-chip buses: Fundamental limits and
practical codes. In: Proceedings VLSI design, pp 417–422
12. Hirose K, Yassura H (2000) A bus delay reduction technique considering crosstalk. In: Proceed-
ings of design, automation and test in Europe (DATE), pp 441–445
13. Nose K, Sakurai T (2001) Two schemes to reduce interconnect delay in bi-directional and
unidirectional buses. In: Proceedings of VLSI symposium, pp 193–194
14. Fu B, Ampadu P (2010) Exploiting parity computation latency for on-chip crosstalk reduction.
IEEE Trans Circuits Syst II: Express Briefs 57:399–403
15. Arizona State University Predictive Technology Model [Online] Available: http://ptm.asu.edu/
16. Akl CJ, Bayoumi MA (2008) Reducing interconnect delay uncertainty via hybrid polarity
repeater insertion. IEEE Trans Very Large Scale Integr (VLSI) Syst 16:1230–1239
List of Symbols

DT Delay inserted in the skewed transition method


Фj Minimal polynomial of aj
O(x) Error magnitude polynomial
a Primitive element in a Galois field
e Probability of a single wire being erroneous
l Ratio of Cct to Cgt
s(x) Error locator polynomial
sN Standard deviation of noise voltage
t0 Wire delay when no crosstalk is considered
ACK/NACK Acknowledge/negative acknowledge
ARQ Automatic repeat request
BCH Bose-Chaudhuri-Hocquenghem
CAC Crosstalk avoidance codes
Cc Coupling capacitance between two adjacent wires
Cct Total coupling capacitance
Ceff Effective capacitance
Cg Ground capacitance
Cg1 Parallel plate capacitance between the parallel surface of a
wire and the substrate or ground
Cg2 Fringing capacitance between the sides of a wire and the
substrate or ground
Cgt Total ground capacitance of a wire
CRC Cyclic redundancy check
DAP Duplicate-add-parity
ECC Error control coding
EMI Electromagnetic interference

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, 145


DOI 10.1007/978-1-4419-9313-7, # Springer Science+Business Media, LLC 2012
146 List of Symbols

FEC Forward error correction


Gkn Generator matrix for an (n, k) linear block code
HARQ Hybrid automatic repeat request
HðnkÞn Parity check matrix for an (n, k) linear block code
Ik k-dimensional identity matrix
LCM Least common multiple
P(b), b¼1,2,3,. . . Error probability of b-bit burst error
Pd Probability that errors can be detected
Pd_ c Probability of correctable error
Pd_uc Probability of detectable but uncorrectable error
PkðnkÞ Parity matrix for an (n, k) linear block code
Pn Probability of single error source causing errors in neighboring
wires
Pne Probability of no error
Presidual Residual flit error rate
Pud Probability of undetectable errors in the first transmission
Qcrit Amount of charge required to induce an error
R Code rate
RS Reed-Solomon
Reffective Effective code rate
Rt Total resistance of a wire
SEC Single-error-correcting
SEC-DEC Single-error-correcting and double-error-detecting
SPC Single parity check
Sint Spacing between two wires
Tint Interconnect thickness
Type-I HARQ Type of HARQ in which the same information is retransmitted
Type-II HARQ Type of HARQ in which redundant bits are transmitted
incrementally
VN Normal distribution noise
Vdd Supply voltage
Vswing Link swing voltage
WL Number of wires in the link
Wint Interconnect width
c(x) Polynomial representation of codeword
c1n n-bit codeword
dmin Minimum Hamming distance
e1n n-bit error vector
g(x) Generator polynomial
List of Symbols 147

m(x) Polynomial representation of input message


m1k k-bit input data
reh ðkÞ Number of parity check bits added by the extended Hamming
code for k-bit input data
s1r r-bit syndrome vector
v1n Received n-bit codeword
Index

A Crosstalk-induced delay uncertainty, 5, 13,


Active shielding, 18, 19 22, 117, 138
Alternate repeater insertion, 20, 21 Cyclic code, 64–66
Automatic repeat request (ARQ), 13, 24–26, Cyclic redundancy check (CRC) code, 44, 66,
43, 44, 83, 87–103, 115 83, 84, 103

B D
Basis, 51 Deadlock, 39, 40, 42
Berlekamp–Massey (BM) algorithm, 69–73 Delay uncertainty, 5, 8, 13, 19, 22, 117,
Bose-Chaudhuri-Hocquenghem (BCH) code, 130, 140–142
13, 45, 57, 66–73, 95, 97, 98, 102, Duplicate-add-parity (DAP) code, 13, 44,
103, 110–115 57–59, 117–118
Boundary shift code (BSC), 119–120, 127, Dynamic routing, 38, 39, 41
128, 137–139, 142 Dynamic voltage swing scaling (DVSS),
Bus-based infrastructure, 12, 33 81–86

C E
Chien search algorithm, 72, 73 Effective code rate, 87, 88
Circuit switching, 40 Electromigration, 7, 9, 28
Coding scheme speedup, 128–130, 142 Energy efficiency, 43, 44, 79, 87, 95, 104,
Configurable error control system, 104–114 114, 115
CRC code. See Cyclic redundancy check Error control, 9, 12, 13, 17, 22, 24–28, 36,
(CRC) code 42–45, 49–115, 117–142
Crosstalk avoidance and multiple error Error control codes (ECCs), 12, 49, 57, 117,
correction (CAMEC) code, 120–123, 131–133, 136–138, 142
142 Error correction capability, 27, 55, 64, 74, 76,
Crosstalk avoidance codes (CACs), 12, 21–22, 78, 87, 90, 92, 99, 104, 112, 123, 134
117, 123 Error model, 8–12, 96
Crosstalk avoiding double error correction Extended codes, 56
(CADEC) code, 120, 121, 123 Extended Hamming code, 45
Crosstalk coupling, 3–8, 10, 17, 18, 21, Extended Hamming product codes with
22, 58, 117, 123, 128, 130, 132, type-II HARQ, 92–99, 101–103,
140, 142 114, 115

149
150 Index

F J
Field, 45, 49–51, 66, 103, 110 Joint coding scheme, 128, 130
Forbidden overlap condition (FOC) code, Joint crosstalk avoidance and triple error
123–125, 127, 137–139, 142 correction and simultaneous quadruple
Forbidden pattern condition (FPC) code, error detection code (JTEC-SQED)
123, 126, 127, 137–142 code, 122, 123
Forbidden transition condition (FTC) code, Joint crosstalk avoidance and triple error
123, 125–127, 137, 138, 142 correction (JTEC) code, 121–123
Forward error correction (FEC),
24, 26–27
Fringing capacitance Cg2, 1 L
Level shifter circuit, 80
Linear block codes, 51, 55, 59, 63, 66
G Livelock, 39
Galois field (GF), 50–52, 54, 66, 68, 73 Low link swing voltage, 13, 79–81, 114
Generator matrix, 52, 53, 56, 59, 60,
65, 106
Global wire delay, 3 M
Go-back-N, 25, 26, 83, 95, 97, 103 Mesh, 36, 44, 99, 122, 123
Metal sliver and crack, 5
Minimum Hamming distance, 54–60, 74, 117,
H 120, 122, 127, 142
Hamming code, 11, 13, 44, 45, 57, Minimum link swing voltage, 80, 81, 83
59–63, 75, 80, 92–95, 98, 99, Modified DAP code (MDR), 118
104–106, 108–111, 113–115, Multiple adjacent errors, 7, 10, 45
118, 120–123, 127, 128, 138,
139, 142
Hamming distance, 54–60, 74, 117, 120, N
122, 127, 142 Network interface (NI), 35, 36, 39
Hamming product code, 13, 73–78, 92–105, Network-on-chip (NoC), 12, 33–45, 112,
108, 111, 112, 114, 115 122, 123
Hamming sphere, 54, 55 layered structure, 38
Hardware sharing, 106–108, 111, 115 topology, 36, 37
Hierarchical bus architecture, 34, 35 Noise reduction, 12, 17
Hsiao code, 45, 61–63, 123
Hybrid ARQ (HARQ), 24, 27–28
type-I, 27, 28, 43–44 O
type-II, 28, 87, 88, 90–103, 114, 115 One lambda code (OLC), 127
Hybrid polarity repeater insertion,
19, 20
P
Packet switching, 40
I Parallel plate capacitance Cg1, 1
Inductance effects, 3 Parity check matrix, 52, 53, 56, 60–63, 69
Intelligent spacing, 117, 118, 142 Passive shielding, 18
Interconnect Permanent error, 9, 28, 97, 102
aspect ratio, 2, 5, 8 Pipelined product decoding algorithm, 76, 93
resistance, 7, 8 Polynomial p(x), 50, 51, 66
Interleaving, 45, 63–64, 134, 135, 137 primitive, 51, 66
Intermittent error, 9 Process variation, 4, 7, 9
Index 151

R Staggered repeater insertion, 19, 20


Reed-Solomon (RS) code, 13, 45, 57, 73, 110, Static routing, 38, 39
112, 113 Stop-and-wait, 25
Residual flit error rate, 11, 90, 96, 97, 99, 101, Supply voltage fluctuation, 4, 8, 9
111–114, 139 Syndrome decoder, 60, 108, 109, 115
Router, 34–36, 38–43, 99, 103, 110, 122 Systematic codes, 52–54

S T
Selective-repeat, 25, 26 Temperature variation, 4, 8
Self-calibrating transmission, 81, Torus, 36, 99, 122, 123
83–86, 114 Transient error, 7–9, 45
Shortened codes, 81, 83–86, 114
Sidewall capacitance CC, 1–2
Signal integrity, 5, 8, 44 V
Single parity check (SPC) code, 57–58 Varying noise condition, 104
Skewed repeater, 24 Virtual channel, 40–42
Skewed transition, 12, 17, 22–24, 117,
130–142
Soft error, 6, 8 W
Spare wire, 9, 17, 28, 42, 45 Wire sizing and spacing, 17–18, 117
Spatial burst error, 8, 10, 63, 64, 134 Wormhole switching, 40

You might also like