You are on page 1of 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO.

6, JUNE 2009

479

A High-Performance and Energy-Efficient TCAM


Design for IP-Address Lookup
Yen-Jen Chang, Member, IEEE

AbstractIn this brief, we propose a two-level dont-care


gating (DCG) scheme that aims to reduce the ternary contentaddressable memory (TCAM) power dissipated in the searchline switching activity. By exploiting the vertically continuous
dont-care feature, the two-level DCG scheme can largely reduce
the average search-line power consumption during a switch pattern. In addition, we also use the search enable technique to eliminate the unnecessary search-line switching activity in the quiet
pattern. By reducing both the search-line switching activity and
average switching power, the proposed design can minimize the
TCAM search-line power consumption. For a 128 32 TCAM,
the best configuration we examined shows that when the first-level
and second-level gating granularities are 16 and 8, respectively,
with a 9% search performance improvement, the two-level DCG
scheme can achieve 70% search-line energy reduction.
Index TermsForwarding table, low-power design, router,
search line (SL), ternary content addressable memory (TCAM).

I. I NTRODUCTION

ECAUSE ternary content-addressable memory (TCAM)


has an additional dont-care (or X) state to perform
the wild match, it is widely used in the forwarding table of
the network router. However, the power consumption of TCAM
is usually considerable due to the parallel comparison feature,
in which a large amount of transistors and wires are active on
each lookup. As revealed in [1], there are three major power
consumers in TCAM, including the clock and control, match
lines (MLs), and search lines (SLs). The power dissipated by
the former two components has effectively been reduced [2],
[3], but the SLs still contribute 54%82% to the total power
consumption [1]. The SL power consumption can be reduced by
using the segmented [1] or hierarchical SL scheme [4][6] and
minimizing the switching activity [7], [8]. However, they suffer
from either performance penalty or complex control circuitry.
In contrast, this brief presents a low-power TCAM design that
consists of the two-level dont-care gating (DCG) scheme and
the search enable (SE) technique. Without performance penalty
and complex control circuitry, our design can largely reduce the
TCAM power dissipated in the SL switching activity.
The most pronounced features of the proposed low-power
TCAM design are summarized as follows: 1) The SE technique
can completely eliminate the unnecessary SL switching activity
in the quiet pattern, but it will result in a performance penalty.
Manuscript received November 4, 2008; revised January 23, 2009. Current
version published June 17, 2009. This paper was recommended by Associate
Editor T. Zhang.
The author is with the Department of Computer Science and Engineering,
National Chung Hsing University, Taichung 402, Taiwan (e-mail: ychang@cs.
nchu.edu.tw).
Digital Object Identifier 10.1109/TCSII.2009.2020935

2) Based on the vertically continuous X feature, the two-level


DCG scheme uses the additional gating nodes to conditionally
prevent the search data from being broadcast over the entire SL.
By decreasing the SL effective capacitance, the two-level DCG
scheme can largely reduce the average power dissipated in the
SL switches. 3) Instead of the simple transmission gates (TGs),
the two-level DCG scheme uses the inverter chain circuitry
to accelerate the transmission of search data. Because such
speedup will compensate for the performance loss incurred by
the SE technique, our design can achieve comparable or even
better performance than the conventional TCAM design.
The proposed TCAM design with a size of 128 32 was
implemented with the TSMC 0.18-m technology, and all
the results were measured from the HSPICE simulation. By
examining all possible configurations, the results show that
our design can achieve the best energy efficiency when the
first-level (L1) and second-level (L2) gating granularities are
16 and 8, respectively. Compared to the conventional NORtype TCAM, our design not only reduces the average SL
energy consumption by 63%70% but also improves the search
performance by 9%.
The rest of this brief is organized as follows. Section II
reviews the conventional NOR-type TCAM design and the
previous work on TCAM power reduction. Section III describes
the circuitry developed for the two-level DCG scheme in detail.
In addition to the discussions on the importance issues, the
comparison between our design and the related work is also
provided. The measurement results and analysis are given in
Sections IV, and V offers some brief conclusions.
II. TCAM
As shown in Fig. 1, a typical TCAM cell consists of three
major components: 1) an eight-transistor XOR-type contentaddressable memory (CAM) cell that not only stores the prefix
data but also compares the prefix data with the search data;
2) a six-transistor static random access memory (SRAM) cell
that stores the mask bit to indicate whether this TCAM cell is
X or not; and 3) a control logic that can conditionally pull
down the ML. As shown in Fig. 1, this can simply be implemented with two n-type metaloxidesemiconductor transistors
placed in serial, which are controlled by the mask bit and the
XOR result of the CAM cell, respectively.
A. Search Operation
The search operation is initiated by discharging all SLs and
then precharging all MLs to VDD . In Fig. 1, the discharge of
both S and S is to ensure that no short-circuit path exists during

1549-7747/$25.00 2009 IEEE

480

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009

Fig. 1. Typical TCAM cell and the corresponding state table.

the ML precharge. After precharging the ML, the search data


and its complement are then applied to S and S to perform the
search operation. As shown in Fig. 1, M = 1 means that this
TCAM cell is in the X state, where it is always a match
regardless of the comparison result of the CAM cell. Such
comparison is referred to as a wild match. In contrast, if the
mask bit is 0, this TCAM cell is in either the 0 or 1 state. In
this case, if the search data are equal to the stored data, it is also
a match that we refer to as normal match to distinguish it from
the wild match. In Fig. 1, the ML is discharged to 0 only in case
of mismatch, in which M = 0 and the XOR result of the CAM
cell is 1. Note that the search operations account for almost
all the TCAM operations, and the SLs are highly capacitive.
Thus, the TCAM power dissipated in the high-frequency SL
switching is substantial [1].
B. Related Work
Including our previous work [9][11], a large number of
designs have been proposed to reduce TCAM power consumption, particularly ML power consumption [2], [3]. Because this
brief aims at reducing the TCAM SL power, we only focus on
the work related to the SL power reduction. In [4], Pagiamtzis
and Sheikholeslami introduced a hierarchical SL design in
which the SLs are broken into the global SLs (GSLs) and local
SLs (LSLs). Instead of directly connecting to the CAM cells,
the GSLs with low-swing signals are fed into the LSLs, and
then, the local receivers would conditionally amplify the signals
to drive a subset of CAM cells. Because only a few LSLs
with reduced capacitance would be activated, the CAM power
dissipated in the SLs can effectively be reduced.
Based on [4], an alternative dont-care-based hierarchical
(DCBH) SL design [6] was proposed. The DCBH scheme uses
the X bit stored in the bottom word of each block to control
whether the search data on GSL should be broadcast to the
corresponding LSL. The GSLs are active every cycle, but the
LSLs are active only when the bottom word is not X.
The only difference between [4] and [6] is the LSL control
strategy. In [1], the segmented SL design was introduced,
which uses the segmentation cell (SC) to segment the SL. The
SC consists of a dummy cell and a path-control switch. In
particular, the dummy cell is an extra SRAM cell to probe the
continuous X information. When the TCAM cell above the
SC stores the X bit, the SC dummy cell will keep 0 to turn

Fig. 2. (a) Search data example of five consecutive 0s. Case A/B is the TCAM
design without/with SE scheme. (b) Traditional SE scheme.

off the path-control switch to block the search data from being
further broadcast. Because the effective capacitance of SL is
decreased, the SL power consumption can be reduced.
III. T WO -L EVEL DCG S CHEME
A. SE Technique
If we only consider two consecutive search data on a single
bit, there are four search patterns, i.e., 0 0, 1 1, 0 1,
and 1 0. Due to no having data transition, 0 0 and 1 1
patterns are classified as quiet patterns. In contrast, 0 1 and
1 0 are classified as switch patterns. In the conventional
TCAM design, because the ML has to be charged to high during
the precharge phase, both S and S must be discharged to 0 to
avoid a possible short circuit. However, such discharge will
increase the unnecessary SL switching activity. For example,
Fig. 2(a) shows a search data pattern of five consecutive 0s.
In case A, i.e., the conventional TCAM design, the gray block
contains the values of S and S during the precharge phase.
Clearly, the number of energy-consuming transitions (N01 )
on SL is four, but they are all unnecessary switching activities.
As illustrated in Fig. 2(b), a straightforward solution to this
weakness is the introduction of an additional transistor, i.e., N3,
that is used to disconnect the pull-down path during the ML
precharge and then enable the search operation. It is referred to
as the SE technique, whose effect can be observed in case B
shown in Fig. 2(a), where the unnecessary SL switches are all
eliminated. Compared to the traditional TCAM without SE,
whose pull-down path is only N1 and N2, the use of SE will
result in an increase of one transistor in the length of the pulldown path. Based on our simulation, the optimal N3 width
is three times the N1 (or N2) width, where the performance
penalty is about 4.2%. Fortunately, this performance penalty
can be compensated by the two-level DCG scheme proposed
in the following section.
B. L1 DCG
Fig. 3 shows the general configuration of a TCAM array
with N prefixes. The function of the routing table lookup is to
find the longest one in all the prefixes that match the incoming

CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP

481

Fig. 5. L2 DCG example, where the granularity (GL2 ) is 4, and GNL2 is the
L2 gating node.

Fig. 3.

General configuration of a prefix array.

effect in the case of Mi = 0. Note that the search data are


opposite after the gating node.
The granularity of L1 DCG (GL1 ) is defined as the number
of cells between two L1 gating nodes. It is critical to both
search performance and power saving. If GL1 is too large, the
probability that the first cell of segment X is low, such that
the amount of effective capacitance reduction is insignificant.
In contrast, a small GL1 can largely reduce the effective capacitance but suffers from the serious propagation delay. Therefore,
an adequate GL1 that benefits both power saving and search
performance is very critical to our design. The detailed analysis
will be provided in the experimental results.
C. L2 DCG

Fig. 4.

L1 gating node (GNL1 ) implementation.

packets destination address. This is commonly known as the


longest prefix match (LPM) problem [12]. To simply resolve
the LPM problem, the stored prefixes are sorted by decreasing
length. Clearly, if the ith cell is X, then all the cells below
the ith cell must be X. It is referred to as the vertically
continuous dont-care feature. For example, consider the gray
column shown in Fig. 3. Because the sixth cell is X, the
sixthN th cells are all X. This scenario implies that the
comparisons between the search data and the prefix data stored
in the sixthN th cells are redundant. If we can gate the search
data at the sixth cell, only the capacitances before the sixth
cell are needed to be charged/discharged. Such capacitance that
needs charge in reality is referred to as effective capacitance. In
this example, due to the decrease of effective capacitance, the
SL power consumption can be reduced.
To gate the search data from being broadcast over the entire
SL, our design inserts the gating nodes to break the entire SL
into several segments. As shown in Fig. 4, the L1 gating node
(GNL1 ) is implemented as an inverter that is controlled by
the corresponding mask bit. For example, a GNL1 is located
in the ith cell. When the ith cell is X, Mi = 1 will cut off
both the power and ground sources to disable the inverter from
transmitting the search data. Otherwise, the inverter will take

The L1 DCG is beneficial to reduce the SL power only when


the first TCAM cell is X in a segment. If the first cell is
not X, this segment would be driven to perform the search
operation even though the remainder cells are all X. Because
decreasing GL1 will worsen the search performance, to improve
the power efficiency of the L1 DCG, we propose an L2 gating
scheme to further exploit the vertically continuous X property
within an L1 segment.
Fig. 5 shows an L2 gating example, in which the L2 granularity (GL2 ) is 4. For example, if GL1 and GL2 are 16 and 4,
respectively, then each L1 segment is divided into four L2
segments each containing four TCAM cells. Similar to GNL1 ,
the function of the L2 gating node (GNL2 ) is to either gate or
transmit the search data from the L1 segment, which depends
on whether the first cell is X or not. This requirement can
easily be realized by a TG, but it has a vital defect when the
input search data is gated, in which the Q and Q nodes are
floating, such that the N2 transistor (shown in Fig. 1) is likely
to be turned on to cause an unexpected pull-down path between
the ML and the ground, i.e., false mismatch.
Instead of the TG, GNL2 is implemented as in Fig. 5, which
is controlled by the mask value of the first cell (M0 ) in the
corresponding L2 segment: 1) If the first cell is not X, then
M0 = 0 will turn on the P1 transistor to activate GNL2 as an
inverter. Assume that the search data from the L1 segment are S.

482

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009

GNL2 will drive the Q node to be S to perform the search


operation. 2) In the other case, where the first cell is X,
M0 = 1 will cut off the VDD path and turn on the N2 transistor
to force the Q node to be 0 regardless of the input search data.
Thus, all side effects incurred by the floating Q and Q nodes
can completely be eliminated.
D. Compared to Related Work
Similar to the related work [1], [6], our design also exploits
the vertically continuous X feature to reduce the TCAM SL
power, but the strategy and implementation are totally different.
The major overheads in [6] come from the low-swing receivers
that are responsible for translating a low-swing GSL signal to
a full-swing LSL signal. In addition to the complex circuitry,
the overheads also include the leakage power penalty and the
increased propagation delay. Because these overheads increase
with the number of low-swing receivers, the large block size is
preferable, such that the effect of power saving is reduced. In
contrast, the gating nodes used in the two-level DCG are very
simple, so the above drawbacks are absent in our design.
Compared to the segmented SL design [1], instead of using
an extra SRAM cell, our design uses the stored X bit to
directly control the broadcast of search data. Note that the pathcontrol switch used in SC [1] is implemented with the TG that
has no power source. Thus, the propagation delay of search
data is increased. In contrast, because the gating node used
in the two-level DCG is implemented with the controllable
inverter that has a power source, our design does not incur
any performance loss. On the contrary, these gating nodes
will realize an inverter chain that can facilitate the search data
transmission, i.e., the search performance can be improved.
IV. E XPERIMENTAL R ESULTS
In this brief, we use TSMC 0.18-m technology with 1.8-V
supply voltage to implement two IPv4 routing tables. Both of
them are with a size of 128 32, i.e., 128 entries by 32 bits.
One is implemented with the proposed SE and two-level DCG
schemes, and the other is a conventional NOR-type TCAM used
for comparison.
A. Search Performance
The metric used to evaluate the TCAM search performance
is the match delay (MD). Because our design involves the
modifications of both the SL architecture and the pull-down
path of ML, the MD is defined as the elapsed time from the
search data applied to the SL to the ML discharged to 0 in case
of a mismatch. Throughout this brief, the MD of our design is
always the worst-case delay measured from the mismatch of a
TCAM word located in the last L1 segment.
Fig. 6 shows the MD for both the conventional and our
TCAM designs, where XY means that the L1 and L2 granularities are X and Y , respectively. In our simulation, GL1 is
constrained within the range from 8 to 32, and a total of nine
reasonable two-level DCG configurations are evaluated. They
are 8-2, 8-4, 16-2, 16-4, 16-8, 32-2, 32-4, 32-8, and 32-16. Due

Fig. 6. Match delay for both the conventional and two-level DCG TCAM
designs, where XY means that GL1 = X and GL2 = Y .

Fig. 7. Column power dissipated in the switch pattern for both the conventional and two-level DCG TCAM designs.

to having no segmentation, the MD of the conventional TCAM


(MDConv ) is fixed at 1.608 ns.
In Fig. 6, there are two cases in which MDDCG is always
worse than MDConv . The first case is that GL1 is less than or
equal to 8; the second case is that GL2 is less than or equal
to 2. In particular, MDDCG is even shorter than MDConv in
the cases of 16-4, 16-8, 32-4, 32-8, and 32-16. This is because
our design decouples most TCAM cells from the SL, such that
the SL capacitance can largely be reduced. According to the
RC delay model, the SL propagation delay will decrease with
the SL capacitance. In addition, the inverter chain introduced
by the L1 DCG scheme can also reduce the SL propagation
delay. Therefore, the performance gains from the two-level
DCG scheme will cover the performance penalty incurred by
the SE technique. Based on this result, only five configurations,
i.e., 16-4, 16-8, 32-4, 32-8, and 32-16, are evaluated in the
following discussion of energy reduction.
B. Average SL Energy Reduction
Because the power consumption of a quiet pattern is hardly
noticeable in our design, Fig. 7 only shows the column power
dissipated in a switch pattern, in which the continuous X
number is varied from 0 to 128, e.g., X = 32 means that the
continuous X number is 32. For a clear demonstration, only
the 16-8, 32-4, and 32-8 configurations are displayed. Because
the two-level DCG breaks the entire SL into several segments,

CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP

483

TABLE I
COLUMN POWER CONSUMPTION (IN WATTS) FOR BOTH THE
CONVENTIONAL AND TWO-LEVEL DCG TCAM DESIGNS.
CONFIGURATION 16-8 IS THE BEST TO REDUCE
THE S WITCH P OWER C ONSUMPTION

TABLE II
COLUMN ENERGY CONSUMPTION FOR BOTH THE CONVENTIONAL
AND T WO -L EVEL DCG TCAM D ESIGNS

its column power shows a step wave. The power data are
summarized in Table I, where the worst and best values are for
the X = 0 and X = 128 cases. In particular, the average value
is obtained by averaging the results of all 129 cases, i.e., X =
0 128, if every case has the same occurrence probability.
In Table I, the key observations are as follows: 1) There are
two features about the conventional TCAM. First, due to the
need for presetting both S and S to 0, the power consumption
of the quiet pattern is almost equal to that of the switch pattern.
Second, its column power is independent of the continuous X
number. As shown in Table I, they are always 3.753 E-05 and
3.776 E-05 W for quiet and switch patterns, respectively. 2) Due
to having no SL switch, in the quiet pattern, our design almost
consumes no power compared to the conventional TCAM, and
the difference between three cases is hardly noticeable. 3) In
the worst case of the switch pattern, because no X cell can
facilitate our design to reduce the SL power, the additional
L1 and L2 gating nodes will result in absolute power penalty.
Consequently, for all configurations, the worst switch power
must be larger than that of the conventional TCAM. 4) Clearly,
for the switch pattern the best configuration is 16-8, in which
our design incurs the least power penalty, i.e., 13%, in the
worst case, while achieving the largest power reduction, i.e.,
35%, in the average case.
For a fair comparison, the evaluation metric is the energy,
which is the product of the MD and the search power. Thus,
the column energy consumption is summarized in Table II, in
which only the results of the average case are presented. In
Table II, the best configuration is still 16-8, which can achieve
70% average SL energy reduction compared to the conventional
NOR-type TCAM design.

V. C ONCLUSION

C. Area Overhead and Leakage Power Penalty


Our design can effectively reduce the SL energy consumption, but the additional SE control and L1 and L2 gating nodes
will increase the transistor count. Because both the area and the
leakage power consumption are proportional to the transistor
count, our design will result in a 5.8% area overhead and 1.6%
leakage power penalty. In addition, the two-level DCG design
also increases the mask node capacitance of the first cell of
L1 and L2 segments, and the read/write performance loss is
about 0.2%.

In this brief, we have proposed a low-power TCAM design


that not only reduces the SL power consumption but also
improves the search performance. We have first used the SE
technique to eliminate all unnecessary SL switches in the quiet
pattern and then used the proposed two-level DCG scheme to
reduce the SL switch power in the switch pattern. By using
the vertically continuous dont care feature, our design can
achieve at least 63% reduction in average SL energy consumption with a 5.8% area overhead.

R EFERENCES
[1] J. S. Wang, C. C. Wang, and C. Yeh, TCAM for IP-address lookup
using tree-style AND-type match lines and segmented search lines, in
Int. Solid-State Circuits Conf., 2006, pp. 577586.
[2] N. Mohan and M. Sachdev, Low-capacitance and charge-shared matchlines for low-energy high-performance TCAMs, IEEE J. Solid-State
Circuits, vol. 42, no. 9, pp. 20542060, Sep. 2007.
[3] N. Mohan, W. Fung, D. Wright, and M. Sachdev, Match line sense
amplifiers with positive feedback for low-power content addressable
memories, in IEEE Custom Integr. Circuits Conf., 2006, pp. 297300.
[4] K. Pagiamtzis and A. Sheikholeslami, A low power content-addressable
memory (CAM) using pipelined hierarchical search scheme, IEEE J.
Solid-State Circuits, vol. 39, no. 9, pp. 15121519, Sep. 2004.
[5] H. Noda, K. Inoue, M. Kuroiwa, F. Igaue, K. Yamamoto, H. J. Mattausch,
T. Koide, A. Amo, A. Hachisuka, S. Soeda, I. Hayashi, F. Morishita,
K. Dosaka, K. Arimoto, K. Fujishima, K. Anami, and T. Yoshihara, A
cost-efficient high-performance dynamic TCAM with pipelined hierarchical searching and shift redundancy architecture, IEEE J. Solid-State
Circuits, vol. 40, no. 1, pp. 245253, Jan. 2005.
[6] P. T. Huang, S. W. Chang, W. Y. Liu, and W. Hwang, A 256x128 energyefficient TCAM with novel low power schemes, in Proc. Int. Symp. VLSIDAT, 2007, pp. 14.
[7] I. Arsovski, T. Chandler, and A. Sheikholeslami, A ternary content addressable memory (TCAM) based on 4T static storage and including a
current-race sensing scheme, IEEE J. Solid-State Circuits, vol. 38, no. 1,
pp. 155158, Jan. 2003.
[8] I. Arsovski and R. Nadkarni, Low-noise embedded CAM with reduced
slow-rate match-llines and asynchronous search-lines, in IEEE Custom
Integr. Circuits Conf., 2005, pp. 447450.
[9] Y.-J. Chang, Two-layer hierarchical matching method for energyefficient CAM design, Electron. Lett., vol. 43, no. 2, pp. 8082,
Jan. 2007.
[10] Y.-J. Chang, Y.-H. Liao, and S.-J. Ruan, Improve CAM power efficiency
using decoupled match line scheme, in IEEE/ACM DATE, Apr. 1620,
2007, pp. 16.
[11] Y.-J. Chang and Y.-H. Liao, Hybrid-type CAM design for both power
and performance efficiency, IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 16, no. 8, pp. 965974, Aug. 2008.
[12] D. Shah and P. Gupta, Fast updating algorithms for TCAMs, IEEE
Micro, vol. 21, no. 1, pp. 3647, Jan./Feb. 2001.

You might also like