You are on page 1of 9

1736

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 8, AUGUST 2005

A Low-Power CAM Using Pulsed NANDNOR Match-Line and Charge-Recycling Search-Line Driver
Byung-Do Yang and Lee-Sup Kim
AbstractThis paper proposes a low-power CAM using pulsed
NAND-NOR match-line and charge-recycling search-line. The pulsed NANDNOR match-line not only signicantly reduces the match-line power by activating only a few match-lines by using NAND cells for several bits but also achieves high speed by using NOR cells

for most bits. The charge-recycling search-line driver reduces the search-line power by recycling the charge of search-lines without precharging. The CAM chip with 128 32 bit is fabricated in a 0.25- m CMOS process with 2.5 V. It dissipates 17.2 fJ/bit/search. It consumes 31% power of the dynamic NOR-type CAM. Index TermsCAM, charge recycling, low power, match-line (ML), search-line (SL).

I. INTRODUCTION

ONTENT-ADDRESSABLE memory (CAM) provides a fast data search function. It compares a search data with all stored data in parallel and then returns the address at which the matching data is found. CAM is used in a wide range of applications such as lookup tables, databases, associative computing, and data compression. However, it consumes considerably large power due to its fully parallel comparison. In the search operation, a large amount of circuitry is active and consumes power. Moreover, CAM consumes more power as its size increases because its power consumption is proportional to its memory size. Fig. 1 shows the simplied CAM architecture consisting of an array of memory cells, a search word register, a word match circuit, and an address encoder. Each row of the array stores a word and a match-line (ML). A search word is supplied on search-lines (SLs). The CAM compares the search data with all memory cells and identies a matching word. As a result of this parallel comparison, the voltages of MLs are changed. The major portion of CAM power is consumed in the highly capacitive MLs and SLs which are charged and discharged in every cycle [1][4]. To reduce the power consumption of CAMs, several techniques were proposed. The NAND-type CAM consumes the least power but it is the slowest because only a high-precharged ML is discharged through many transistors in series [1], [3]. In contrast, the NOR-type CAM is the fastest but it dissipates the largest

Fig. 1. Simplied CAM architecture.

Manuscript received December 12, 2004; revised March 25, 2005. This work was supported by KOSEF through the MICROS at KAIST, Republic of Korea. The authors are with the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Korea (e-mail: bdyang@mvlsi.kaist.ac.kr; lskim@ee.kaist.ac.kr). Digital Object Identier 10.1109/JSSC.2005.852028

power because all high-precharged MLs except one are discharged through many transistors in parallel [2], [3]. To achieve both low power and high speed, several techniques have been developed based on the NOR-type CAM. Miyatakes technique limits the swing voltage of MLs to reduce the power consumption of MLs [4]. Lins technique reduces the number of the activated MLs by using the precomputation-based CAM [5]. Arsovskis technique reduces the static power consumption of MLs by allocating less power to the mismatched MLs [6]. It also reduces the SL power by minimizing the switching activity of SLs. Chois technique signicantly saves the ML power by using the hierarchical search composed of a small NOR-type main-bank for coarse search and several large NAND-type subbanks for ne search [7]. However, this cannot use fully the memory cells because the data stored in the same sub-bank must have the same bits in the main-bank. Without the loss of the memory utilization, the proposed pulsed NANDNOR CAM (PNN-CAM) reduces power consumed in both MLs and SLs by using pulsed NANDNOR match-line (PNN-ML) and charge-recycling search-line driver (CRSLD), respectively. The PNN-ML not only signicantly reduces the ML power by activating only a few MLs by using

0018-9200/$20.00 2005 IEEE

YANG AND KIM: LOW-POWER CAM USING PULSED NANDNOR MATCH-LINE AND CHARGE-RECYCLING SEARCH-LINE DRIVER

1737

Fig. 2. Pulsed NANDNOR ML architecture.

Fig. 3.

(a) NAND-ML. (b) NOR-ML.

the NAND cells for several bits but also achieves high speed by using the NOR cells for most bits. The CRSLD reduces the SL power by recycling the charge of SLs without the SL precharge. The organization of this paper is as follows. In Section II, we propose the PNN-CAM using the PNN-ML and the CRSLD. In Section III, we present performance comparisons and show test results of the fabricated chip. This paper ends with the conclusion in Section IV. II. ARCHITECTURE A. Pulsed NAND-NOR Match-Line Scheme Fig. 2 shows the pulsed NANDNOR match-line (PNN-ML) architecture. The CAM has PNN-MLs and a replica PNN-ML. Each PNN-ML consists of NAND cells and - NOR cells. The PNN-ML utilizes the advantages of both NAND- and NOR-type

Fig. 4.

Activations in NAND-MLs and NOR-MLs.

CAMs. The NAND-ML in the NAND type CAM consumes the least power but it is the slowest. The NOR-ML in the NOR type CAM is the fastest but it dissipates the largest power. NAND cells. The avFig. 3(a) shows the NAND-ML with erage capacitance of the NAND-ML is where the matching probability of each bit is and is drain capacitance of transistors [3]. Its swing voltage is due to the voltage drop of series NMOS transistors whose gate . The effective capacitance of NAND-ML is voltage is where the degradation ratio is .

1738

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 8, AUGUST 2005

Fig. 5.

Power consumption in mismatched MLs (a) without the replica PNN-ML and (b) with the replica PNN-ML.

The capacitance of ML-precharge transistors is where is gate capacitance of transistors. Therefore, the NAND-ML consumes power . Although the NAND-ML consumes the least power, it is slow because the high-precharged ML is discharged through transistors in series. Fig. 3(b) shows the NOR-ML with NOR cells [2]. During the precharge time, all SLs are discharged to ground. Each cell turns on an NMOS transistor connected to the ML. Four drain capacitances of transistors per bit are connected to the ML. The . The capacitance of capacitance of the NOR-ML is . Therefore, the NOR-ML conML-precharge transistors is . sumes power Although the NOR-ML is the fastest, it consumes a large amount of power because all high-precharged MLs except one are discharged through some of transistors in parallel. The PNN-ML utilizes the advantages of both NAND-MLs are used and NOR-MLs. In the PNN-ML, only several bits are used for NOR cells. for NAND cells and most bits Fig. 4 shows the activations in NAND-MLs and NOR-MLs. All NAND-MLs are activated but they consume a very small amount of power. The NAND cells reduce the number of activated MLs. When all NAND cells are matched, the NAND-ML activates its ML of NOR cells. If the matching probability of each bit is , the matching probability of NAND cells is . The where number of activated MLs is reduced from to is the number of MLs in the CAM. When and , only two MLs are activated on the average. The PNN-ML saves the power by reducing the number of activated MLs with NOR cells. At most, one of the activated MLs is matched. The matched ML consumes the dynamic power to charge the ML to . Most activated MLs are not matched. The mismatched MLs consume the static power during the time when their NAND cells are matched, as shown in Fig. 5(a). To minimize the static power, the PNN-ML uses the pulsed match-line enable (MLE) signal generated by a replica PNN-ML, as shown in Fig. 5(b) [8]. The replica PNN-ML consists of series NMOS transistors in the parallel NMOS transistors replica NAND cells and in the replica NOR cells, as shown in Fig. 2. When the NAND cell is matched, the gate voltage of series NMOS transistors is . To make the same delay as the ML, is supplied to the gates of NMOS transistors in the replica NAND cells. This voltage is generated by an NMOS transistor, an in-

Fig. 6. Simulated waveforms of the PNN-ML.

verter, and the SL charge-recycling (SLCR) signal. When the NOR cell is matched, it disconnects the ML from ground. Both and ground are supplied to the gates of NMOS transistors in the replica NOR cells to disconnect the replica ML (RML) from ground, as shown in Fig. 2. Fig. 6 shows the simulated waveforms of the PNN-ML. At or ground. The match-line rst, the SLs are driven to precharge (MLPC) signal is 1 and the MLE is 0. All MLs are discharged to ground by the MLPC. All NAND-MLs are by the MLE. Then, the MLPC becomes precharged to 0 and the MLE becomes 1. The replica NAND-ML is discharged to ground and it supplies current to the RML with a ML-charging PMOS transistor. At the same time, a few matched NAND-MLs are discharged and they supply currents , the MLE to their MLs. When the RML is higher than returns to 0. In practice, the RML rises to by adding the delay element for the reliable operation in MLs. All NAND-MLs are precharged and all ML-charging PMOS transistors are turned off. After that, the RML and matched ML to by the PMOS transistor are charged from keepers in the output drivers. The RML and matched ML consume the dynamic power. All mismatched MLs consume the static power only during the short pulse when the MLE is 1.

YANG AND KIM: LOW-POWER CAM USING PULSED NANDNOR MATCH-LINE AND CHARGE-RECYCLING SEARCH-LINE DRIVER

1739

TABLE I EFFECTIVE CAPACITANCE COMPARISONS OF MLs

Fig. 7.

Architectures of the PNN-ML when (a)

m = 32 and (b) m = 144.


fective capacitance per ML is When , and , it is only result, the PNN-ML consumes power . . As a

The PNN-ML signicantly saves the ML power by activating a few MLs and by reducing the static power in the mismatched MLs with the pulse operation. Each PNN-ML consumes power in -bit NAND-ML, -bit ML, and three ML-precharge transistors. The capacitances of NAND-ML and ML-precharge transistors are and , respectively. The capacitances of both RML and . The power consumption matched ML are of the mismatched ML is smaller than that of the RML because the mismatched ML consumes the static power during the time instead of . To when the RML is charged to simplify the power consumption of the mismatched MLs, we approximate the effective capacitance of the mismatched ML as that of the RML. The total number of the RML and all acand the capacitance of the MLs is tivated MLs is . Therefore, the total capacitance of the RML . The efand all activated MLs are

. Table I tabulates the effective capacitances of MLs. In summary, the PNN-ML reduces the number of activated by using NAND cells, and then it saves the MLs from to static power consumed in the activated MLs by using the pulse operation of the replica PNN-ML. The PNN-ML consumes power in all NAND-MLs, a few MLs, and three ML-precharge transistors per ML. The power of PNN-ML is a little larger than that of the NAND-ML. The delay of PNN-ML is the summation -bit NOR-ML. The of delays of -bit NAND-ML and PNN-ML is a little slower than the NOR-ML due to the delay of the small -bit NAND-ML. Although the NOR-ML is much faster than the NAND-ML, its increases, delay increases proportional to . Therefore, as

1740

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 8, AUGUST 2005

Fig. 8.

Power and delay comparisons of MLs when (a)

n = 128 and m = 32 and (b) n = 512 and m = 144.

Fig. 9. CAM architecture.

the PNN-ML becomes slow. To improve the speed, it utilizes the hierarchical ML in [7]. Fig. 7 shows the architectures of the , the PNN-ML is relatively fast. HowPNN-ML. When ever, when , it becomes slow. To improve the speed, the PNN-ML is hierarchically divided into four sub-PNN-MLs bits. The sub-PNN-ML consists of NAND cells and with NOR cells. The search result is hierarchically generated by four sub-PNN-MLs with three AND gates. Its delay is -bit sub-PNN-ML, two equal to the summation of delays of AND gates, and the hierarchical ML wires. Although the hierarchical PNN-ML consumes more power due to four NAND-MLs, 12 ML-precharge transistors, three AND gates, and the hierarchical ML wires, it is much faster. Fig. 8 shows the power and delay comparisons of MLs. All simulations in this paper are performed in a 0.25- m CMOS

V. When and , process with it consumes 20% more power but it is three times faster than the NAND-ML. Also, it is 19% slower but it consumes 89% less and , it power than the NOR-ML. When consumes 76% more power but it is 23 times faster than the NAND-ML. Also, it is 28% faster and it consumes 90% less power than the NOR-ML. This saves the power of MLs by reducing the number of activated MLs and by using the pulse op, the PNN-ML with the hierarchical ML eration. When is faster than the NOR-ML. Fig. 9 shows the CAM architecture. During search operation, the I/O circuits catch and send the search data to SLs. The SLs are connected to all memory cells. The CAM compares the search data in SLs to the stored data. 128 32 bit memory is divided into four sub-blocks with 32 32-bit memory in order

YANG AND KIM: LOW-POWER CAM USING PULSED NANDNOR MATCH-LINE AND CHARGE-RECYCLING SEARCH-LINE DRIVER

1741

Fig. 10.

CRSLD.

to reduce the delay of the SL. When a ML is matched, the address of the matched ML is encoded in the ROM encoder. The encoded search address is send to the output pad of the chip by the I/O circuits. B. Charge-Recycling Search-Line Driver In the PNN-ML, the SLs are not precharged because search data change while all MLs are discharged to ground. The SLs consume power only when the search data change. In general, the transition probability of SL is smaller than one half. The nonprecharged SLs consume less power than half of that of the precharged SLs [6]. The CRSLD further saves the SL power by recycling the charge of SLs. Fig. 10 shows the CRSLD. Initially, the SLCR is 0. The transistor P1 turns on. The CR is 0. The transmission gates T1 and T2 turn off. Two tri-state drivers D1 and D2 drive the SL pairs. The latch holds the data of SLs. The SLCR becomes 1. The transistor N1 turns on. When the search data change from 0 to 1, the transistors N2 and N3 turn on. When the search data change from 1 to 0, the transistors N4 and N5 turn on. Therefore, the CR becomes 1. T1 turns on and two . T2 turns on SLs share their charges. The SLs become and the latch updates its data. If the search data do not change, the CR remains at 0. The SLCR returns to 0 and the CR becomes 0. T1 and T2 turn off. D1 and D2 drive the SL pairs to or ground. from Table II shows the power comparison of SLs. The precharged where is the SL driver consumes power capacitance of SL, because the SLs are precharged in every clock cycles. The nonprecharged SL driver consumes power where is the transition probability of SL, because it drives the SLs only when search data changes [6]. The nonprecharged SL driver consumes less power than a half of power of the precharged SL driver. Fig. 11 shows waveforms of the CRSLD. When search data change, the CRSLD recycles the charge in SLs and then drives

TABLE II POWER COMPARISONS OF SLs

the SLs by the SLCR signal during search cycles. Therefore, the CRSLD theoretically consumes power which is half of power of the nonprecharged SLs. The CRSLD theoretically consumes only one-fourth and one-half power of , the precharged SLs and the nonprecharged SLs when respectively. Fig. 12 shows the power comparisons of SLs. A single SL is attached to cells where is the number of MLs. When and , the CRSLD saves 60% and 21% power compared to the precharged SL (PC-SL) and the nonprecharged SL (NPC-SL), respectively. Although the SLs of the CRSLD consume only one-half and one-fourth power of the precharged SL and the nonprecharged SL, the power savings of the CRSLD and are reduced due to the control overhead. When , the CRSLD saves 84% and 21% power compared to the precharged SL and the nonprecharged SL. As decreases, the CRSLD saves more power. Fig. 13 shows the power comparisons of SLs. When and , the CRSLD consumes less power than the nonprecharged SL. As increases, the power of the CRSLD is close to one-fourth and one-half of the precharged SL and the nonprecharged SL, because the power of SLs is proportional to but the power of drivers is independent of . When , the power of the CRSLD becomes much smaller. When , the CRSLD consumes less power than the nonprecharged SL. As increases, the power of the CRSLD is close to onetenth and one-half of the precharged SL and the nonprecharged

1742

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 8, AUGUST 2005

Fig. 11.

Waveforms of the CRSLD.

Fig. 12. Power comparisons of SLs when n = 128 and (a) = 0:5 and (b) = 0:2.

Fig. 13.

Power comparisons of SLs according to n when (a) = 0:5 and (b) = 0:2.

SL. As power.

increases and

decreases, the CRSLD saves more

III. PERFORMANCE COMPARISON AND TEST RESULTS A. Performance Comparisons Fig. 14 shows the energy and delay comparisons of various CAMs. For a fair comparison, the proposed PNN-CAM, the dynamic NAND-type CAM using the NAND-MLs (NAND-CAM) [3], the dynamic NOR-type CAM using the NOR-MLs (NOR-CAM) [2], the current saving CAM (CS-CAM) [6], and the Hybrid-type CAM (Hybrid-CAM) [7] are simulated in a 0.25- m V. CMOS process with The NAND-CAM consumes the least ML power but it is the slowest. The NOR-CAM is the fastest but it consumes the

and largest ML power because all MLs precharged to then discharged to ground. With the speed of the NOR-CAM, the CS-CAM reduces the ML power by supplying large current to the matched ML and small current to the mismatched MLs. Also, the SLs are not precharged. The Hybrid-CAM further reduces the ML and SL power by using the hierarchical search composed of a small NOR-type main-bank and several large NAND-type sub-banks. The result of the main-bank activates only a sub-bank. The main-bank is fast and consumes a little power because it is a small NOR-CAM. The selected sub-bank consumes a small amount of power because it uses the NAND-MLs. To improve the speed of the NAND-MLs, we apply the hierarchical ML structures and it inserts many of the ML-repeaters into the NAND-MLs. The Hybrid-CAM consumes the least power in both MLs and SLs among the previous CAMs.

YANG AND KIM: LOW-POWER CAM USING PULSED NANDNOR MATCH-LINE AND CHARGE-RECYCLING SEARCH-LINE DRIVER

1743

Fig. 15.

Chip microphotograph of the PNN-CAM.

Fig. 14. Energy/bit/search and delay comparisons when CAM sizes are (a) 128 32 bit and (b) 512 144 bit.

Also, it achieves high speed. However, this has a drawback. It cannot use fully the memory cells because the data stored in the same sub-bank must have the same bits in the main-bank. Fig. 14(a) shows the energy and delay comparisons of CAMs with 128 32 bit. The PNN-CAM with 128 32 bits consumes 17.2-fJ/bit/search with 3.8-ns search time. It consumes the least power. Without the loss of the memory utilization, the PNN-CAM reduces the power consumed in both MLs and SLs by using the PNN-ML and CRSLD, respectively. The power saving of the PNN-CAM is only 4% compared to the Hybrid-CAM. However, the PNN-CAM is faster than the Hybrid-CAM, because the PNN-CAM is based on the NOR-ML but the Hybrid-CAM is based on the NAND-ML. The PNN-CAM saves 69% and 50% power of the NOR-CAM and CS-CAM. It is the fastest CAM except the NOR-CAM. It is 5% and 21% faster than the CS-CAM and the Hybrid-CAM, respectively. Fig. 14(b) shows the energy and delay comparisons of CAMs 512 144 bit. The PNN-CAM with 512 144 bit consumes 9.0 fJ/bit/search with 6.1 ns search time. It consumes the least power and it is the fastest. In the large CAM, the energy of the control unit and encoder is negligible compared to that of MLs and SLs. The Hybrid-CAM is faster than the NOR-CAM and CS-CAM because it achieves high speed in a long ML with the hierarchical ML structure. The PNN-CAM further reduces the delay of ML by using both the NOR-ML and the hierarchical ML structure.

Fig. 16.

Measured waveforms of the PNN-CAM at 200 MHz.

Therefore, the PNN-CAM is not only the fastest but also it consumes the least power. B. Test Results The test chip is fabricated in a 0.25- m CMOS process. Fig. 15 shows the chip microphotograph. The features of the chip are tabulated in Table III. The power is measured V. The chip core dissipates at 200 MHz with 17.2-fJ/bit/search. The chip core area is 0.32 mm . Fig. 16 shows the measured waveforms at 200 MHz. During search cycles, if input data is matched, the corresponding address is generated. In the measured waveforms, then input data is 1, the data is matched and the precharged address is discharge to ground during a half clock cycles. Its maximum operating V. frequency is 260 MHz at

1744

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 8, AUGUST 2005

TABLE III FEATURES OF THE PNN-CAM CHIP

IV. CONCLUSION The PNN-CAM is proposed to achieve low power and high speed. The PNN-CAM reduces the ML power by using the PNN-ML. The PNN-ML not only signicantly reduces the ML power by activating only a few MLs by using the NAND cells for several bits but also achieves high speed by using the NOR cells for most bits. To reduce the delay of long MLs, the hierarchical ML is utilized. The PNN-CAM reduces the SL power by using the CRSLD. The CRSLD reduces the SL power by recycling the charge of SLs without the SL precharge. The small PNN-CAM with 128 32 bit consumes only 31% power with 19% speed degradation compared to the dynamic NOR-type CAM. The large PNN-CAM with 512 144 bit consumes only 21% power with 39% speed improvement. The PNN-CAM chip with 128 32 bit is fabricated in a 0.25- m V. The chip core dissipates CMOS process with 17.2-fJ/bit/search. Its area is 0.32 mm . Its maximum operating frequency is 260 MHz. REFERENCES
[1] F. Shafai et al., Fully parallel 30-MHz 2.5-Mb CAM, IEEE J. SolidState Circuits, vol. 33, no. 11, pp. 16901998, Nov. 1998. [2] P. Lin et al., A 1-V 128-kb four-set-associative CMOS cache memory using wordline-oriented tag compare (WLOTC) structure with contentaddressable memory (CAM) 10-transistor tag cell, IEEE J. Solid-State Circuits, vol. 36, no. 4, pp. 666676, Apr. 2001. [3] Y. L. Hsiao et al., Power modeling and low-power design of contentaddressable memories, in Proc. IEEE Int. Symp. Circuits and Systems, vol. 4, 2001, pp. 926929.

[4] H. Miyatake et al., A design for high-speed low-power CMOS fully parallel content-addressable memory macros, IEEE J. Solid-State Circuits, vol. 36, no. 6, pp. 956968, Jun. 2001. [5] C.-S. Lin et al., A low power precomputation-based fully parallel content-addressable memory, IEEE J. Solid-State Circuits, vol. 38, no. 4, pp. 654662, Apr. 2003. [6] I. Arsovski et al., A mismatch-dependent power allocation technique for match-line sensing in content-addressable memories, IEEE J. SolidState Circuits, vol. 38, no. 11, pp. 19581966, Nov. 2003. [7] S. Choi et al., A 0.7 fJ/bit/search, 2.2 ns search time hybrid type TCAM architecture, in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2004, pp. 498499. [8] I. Arsovski et al., A ternary content-addressable memory (TCAM) based on 4T static storage and including a current-race sensing scheme, IEEE J. Solid-State Circuits, vol. 38, no. 1, pp. 155158, Jan. 2003. [9] K. Pagiamtzis et al., A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme, IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 15121519, Sep. 2004.

Byung-Do Yang received the B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1999, 2001, and 2005, respectively. He joined the Memory Division, Samsung Electronics, Kyungki-Do, Korea, in 2005, where he has been engaged in the design of DRAM. His research interests include low-power DRAM circuits.

Lee-Sup Kim received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea, in 1982 and the M.S. and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, in 1986 and 1990, respectively. He was a Postdoctoral Fellow with the Toshiba Corporation, Kawasaki, Japan, during 19901993, where he was involved in the design of the high-performance DSP and single-chip MPEG2 decoder. Since March 1993, he has been with the Korea Advanced Institute of Science and Technology, Daejeon, Korea. In November 2002, he became a full Professor. His research interests are multimedia VLSI design, hardware implementation of signal processing algorithms, and low-power IC design.

You might also like