Professional Documents
Culture Documents
II. ARCHITECTURE
The CPU uses a modified Harvard architecture as shown
in Fig. 2. In order to maximize throughput and energy ef-
ficiency, the architecture relies heavily on both pipelining
and parallelism [4]. For example, the CPU employs a six-
stage instruction pipeline divided as follows: 1) program
prefetch, 2) program fetch, 3) instruction decode, 4) operand
address generation, 5) operand read, and 6) execute/write.
Three separate data buses and one program bus coupled with
two data address generators and one program address generator
facilitate a high degree of parallelism. For instance, two reads
and one write operation can be performed in a single cycle.
This exploitation of concurrent processing techniques results
in a highly efficient instruction set, which in turn increases the
energy efficiency of the processor [5]. Fig. 4. State transition diagram and corresponding DSP code for a Viterbi
butterfly operation. The DADST and DSADT are the double 16-b add/subtract
The CPU contains a 40-b arithmetic logic unit (ALU), instructions. CMPS is the compare/select/store instruction. k in the comment
which can selectively feed one of two 40-b accumulators. section means two operations are done in parallel in the same clock cycle.
The accumulators are divided into three parts: a 16-b low-
order word, a 16-b high-order word, and eight guard bits.
By setting a status register bit, the ALU can function as calculation, as well as finite impulse response (FIR) and
a single unit or as two 16-b ALU’s operating in parallel. least mean square (LMS) filtering operations.
This dual 16-b mode is especially useful for the Viterbi In addition to these built-in filtering instructions, the DSP
add/compare/select operation, as described in detail later in also contains a compare, select, and store unit (CSSU) which
this section. Furthermore, one of the ALU inputs can be accelerates the Viterbi computation required by many commu-
taken from a 40-b barrel shifter, allowing the processor to nication algorithms [6]. The hardware components that support
perform numerical scaling, bit extraction, extended arithmetic, the Viterbi operator are shown in Fig. 3. This diagram is
and overflow prevention. Together with the exponent detector, best understood by examining the code for a Viterbi butterfly
the shifter enables single-cycle normalization of values in an operation, which is given in Fig. 4 along with the correspond-
accumulator. ing state transition diagram. In this example, two old states
The CPU also features a multiply-accumulate block capable transition to two new states. For each new state, the objective
of a 17 17-b two’s-complement multiplication and a 40-b is to find which of two possible routes results in the maximum
addition in a single instruction cycle. The multiplier and ALU path metric. By configuring the ALU in dual 16-b mode
can operate in parallel, allowing simultaneous execution of using the C16 bit in Fig. 3, the four add/subtracts required
multiply-accumulate (MAC) and arithmetic operations. to calculate the candidate path metrics can be performed in
The DSP has a rich instruction set including single- only two cycles. The results of the add/subtract operations are
instruction repeat and block repeat operations, block memory stored in the two 40-b accumulators, A and B, and are fed to
move instructions, instructions with two- or three-operand the CSSU. In a single cycle, the CSSU is able to compare the
reads, conditional store instructions, and arithmetic instruc- two 16-b words stored in one of the two accumulators, as well
tions with parallel load and store. To improve performance as select and store the largest of these two words using the
and power on typical DSP algorithms, several dedicated result bus EB. In the same cycle, the decision is recorded in
instructions are available including euclidean distance the test/control (TC) flag bit of the status register and in the
1768 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997
Fig. 8. Histogram of typical phase jitter for digital PLL. inherent speed advantage over the array multiplier, but also be-
cause of its lower dynamic power consumption [10]. The speed
advantage of the Wallace tree comes from the reduction in the
height of the carry-save adder tree—from nine to five stages
in our case. The improvement in dynamic power consumption
comes from a large reduction in spurious switching events
at the internal nodes of the multiplier, resulting from balanced
delay paths in the Wallace tree. The architecture of the Wallace
multiplier used on this chip consists of a 3 : 2 counter stage,
(a)
followed by two consecutive 4 : 2 counter stages, as shown
in Fig. 10. The 4 : 2 counter is constructed by cascading two
carry-save adders. A 23% power reduction and 25% speed
improvement are achieved by the Wallace tree architecture
over the array architecture using the same carry-save adder.
The carry-save adder makes extensive use of transmission gate
logic to minimize transistor count and power consumption
[11]. The transistor sizes were carefully tuned so that the
variations in delay for different input signal combinations
are minimized. The tight output delay distribution helps to
(b) minimize the overall dynamic power consumption in the
Fig. 9. Comparison of (a) conventional CMOS latch to (b) lower-power latch multiplier [10]. SPICE simulations at 1 V reveal a worst-case
which reduces clock loading. delay of under 5 ns.
the conventional version [Fig. 9(a)]. When the data is high, D. Memory Subsystem
it is driven differentially into the cross-coupled inverter to
The on-chip memory subsystem includes a 6 K 16 b
ensure good performance even at low supply voltages. A 15%
SRAM and 48 K 16 b ROM divided into separate program
reduction in power was obtained, while achieving a 15% speed
and data spaces.
improvement and a 50% reduction in area. The large area
1) Random-Access Memory: The memory subsystem al-
reduction is due to the use of primarily NMOS devices, which
lows two data operand reads or a long (32-b) read from any
reduces isolation spacing requirements.
one of the three 2 K 16 b on-chip SRAM blocks in a
single machine cycle. If the program bus is used for coefficient
C. Multiply-Accumulate Unit access, three operands can be read in a single cycle. A divided
The multiply-accumulate (MAC) unit on this chip is com- word line architecture is used in the SRAM to reduce power
posed of a 17 17-bit Wallace tree multiplier coupled with [12]. Each 2 K 16 b SRAM block is divided into two banks
a 40-b carry-lookahead adder. This MAC unit is capable of as shown in Fig. 11. The banks in turn are subdivided into
performing a nonpipelined multiply-accumulate operation in four 256 16 b arrays, each with its own local word line
a single clock cycle. The multiplier can operate on signed, driver. Consequently, each memory access only activates 1/8
unsigned, and signed/unsigned 16-b operands. Both integer of the entire array, thereby reducing the power consumption.
and fractional modes, as well as two’s-complement rounding The sense amplifier employs a latch-type design to reduce the
and overflow detection, are supported in this MAC unit. The static power dissipation and to ensure proper operation even
Wallace tree architecture was chosen not only because of its at very low supply voltages.
LEE et al.: 1-V PROGRAMMABLE DSP FOR WIRELESS COMMUNICATIONS 1771
TABLE I
KEY PROCESS TECHNOLOGY PARAMETERS
Technology 0.35 m twin-well, dual-Vt ; CMOS with triple-metal
Gate Length 0.25 m
Gate Oxide 5 nm
High Vt 0.4 V
Low Vt 0.2 V
IV. PROCESS
The chip was fabricated using a 0.35- m, dual- twin-
well CMOS process with three layers of metal. The minimum
length of patterned gate poly is 0.25 m. The gate oxide
thickness is 5 nm. To fabricate both low and high transistors
on the same chip with minimal process overhead, a surface
counter-doping implantation was performed on the channels
of the low transistors in addition to the regular channel
implant [13]. The nominal threshold voltages (defined as the
gate voltage when the drain current reaches a current level
Fig. 12. Architecture of 8 K 2 16 b building block used to construct data of 0.1 A where and are the drawn width
and length of the transistor) are 0.4 and 0.2 V, respectively.
and program ROM’s.
The threshold voltages quoted in an earlier publication [1] are
High transistors are used in the six-transistor memory different from the values used here. The discrepancy stems
cell to keep the standby current to a minimum, while low from a different definition of threshold voltage when
transistors are used in the sense amplifier and all peripheral cir- A adopted in the earlier publication. The
cuitry to allow for high-speed operation at 1 V and below. The leakage current of the high and low devices are below
dependence of the access time as a function of the threshold 1 nA/ m and 1 A/ m, respectively. The drive current of the
voltage will be described in greater detail in Section IV. low (LVT) transistors is typically twice that of the high
2) Read-Only Memory: There are two large ROM’s in the (HVT) devices. Key process characteristics are summarized in
DSP: a 32 K 16 b program store ROM and a 16 K Table I.
16 b coefficient data ROM. Both ROM’s use the same To gain a better insight into the impact of threshold voltages
basic architecture and cell design. They are constructed using on the performance of this DSP, SPICE simulation results of
multiple occurrences of a basic building block—an 8 K one of the critical paths, namely the on-chip SRAM access
16 b ROM, illustrated in Fig. 12. time, are presented in Table II. SPICE simulations reveal that
These are diffusion-programmed ROM’s incorporating a the speed of the SRAM is not significantly degraded by using
number of design features to reduce power dissipation. They threshold voltages as high as 0.4 V in the memory cell. The
utilize selective precharge of bit lines, for essentially zero speed of the SRAM is strongly dependent, however, on the
standby power dissipation and low operating power consump- threshold voltage used in the peripheral circuitry, which is
tion. A 16 : 1 column mux ratio minimizes bit line capacitance therefore kept low. For a low and high of 0.2 and 0.4 V,
and, therefore, access time and power consumption. The respectively, SPICE simulations indicate an access time of
diffusion-programming feature results in a 40% memory array about 6 ns.
area savings with respect to via- or contact-programmed cores
of the same capacity, further lowering bit line and word line V. RESULTS
capacitance and, thus, power. The ROM’s feature dynamic The chip measures 5.65 5.50 mm and contains 1.6
page selection, which allows either 16-b or 32-b access, to million transistors. A photomicrograph of the die is shown
meet system architecture requirements. The cell bit line contact in Fig. 13. The DSP was functional on first silicon and was
1772 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997
TABLE II
DEPENDENCE OF SRAM ACCESS TIME ON THRESHOLD
VOLTAGE AT 1 V POWER SUPPLY AND NOMINAL CONDITIONS
Vt of LVT (V) Vt of HVT (V) Tacc (ns) Relative Tacc
0.2 0.40 6.16 100%
0.2 0.35 5.87 95%
0.2 0.30 5.62 91%
0.05 0.40 5.00 81%
0.05 0.35 4.71 76%
0.05 0.30 4.46 72%
Fig. 16. Power consumption as a function of frequency for the 0.35-m LVT
+ HVT chip and the HVT-only chip. Two curves are shown for the HVT-only
Fig. 17. Energy-delay product as a function of supply voltage.
chip: one is for a 1.0-V supply and another for a 1.3-V supply. The curve for
the LVT + HVT chip is obtained with a 1.0-V supply. TABLE IV
CHIP STATISTICS AND PERFORMANCE DATA
TABLE III Chip size 2
5.65 5.50 mm2 =31.1 mm2
APPROXIMATE POWER BREAKDOWN OF DSP Transistor count 1.6 million
Component Approximate Power Fraction of Total Power Package 128-pin TQFP
Consumption (mW) Speed at room temperature 63 MHz at 1 V, 100 MHz at 1.35 V
Clock 4.4 26% Power 17 mW at 1 V
Datapath 6.5 38% Power–Performance Metric 0.21 mW/MHz at 1 V
Control 3.1 18%
Memory 3.1 18%
Total 17 100% global clocks. A flexible PLL-based clock generator/multiplier
gives the user fine-grain control over the clock frequency as
The power-delay product has been used for many years as well. To facilitate efficient low-voltage operation, the tradi-
a figure of merit in evaluating digital circuits. Recently, it has tional analog PLL has been replaced with an almost entirely
been proposed that the energy-delay product (or power-delay- digital implementation.
delay product) is a better metric to analyze the tradeoff be- The design makes use of a 0.35- m dual- technology
tween energy efficiency and performance [14]. Fig. 17 shows with 0.25- m gate lengths to enable good performance at 1 V.
the energy-delay product of the HVT LVT and HVT-only At this voltage, the chip dissipates 17 mW and operates at
chips as a function of supply voltage. The LVT HVT chip 63 MHz with a power-performance metric of 0.21 mW/MHz.
achieves a minimum of 4.2 nJ-ns at 1.0 V. Thus, 1.0 V is Moreover, the device is functional down to 0.6 V and reaches
indeed the optimal operating voltage for this chip. The HVT- 100 MHz at 1.35 V. These results are indicative of the energy
only chip achieves a minimum of 5.7 nJ-ns at 1.20 V. The efficiency that can be achieved using low-voltage digital
shift of the minimum to a higher voltage is due to the higher design techniques. As cellular phone manufacturers continue
threshold voltage in the HVT-only chip. For comparison, the to drive down the system supply voltage, the trend toward
0.6- m DSP achieves its minimum at 1.8 V with a minimum migrating more functionality to the digital domain is likely
value almost an order of magnitude higher than that of the to continue. This will fuel the demand for high-performance,
1 V chip. low-power DSP’s. The results presented here demonstrate
Some of the more important chip statistics presented in this that high performance and low power consumption can be
section are summarized in Table IV. achieved simultaneously through innovative technology and
digital design techniques.
VI. CONCLUSIONS
This paper has described a low-power, fixed-point pro- ACKNOWLEDGMENT
grammable DSP optimized for wireless communication ap- The authors would like to thank D. Nowlen for mask
plications. From the outset, the DSP was architected for low generation; E. Born, G. Stacey, and B. Garcia for assistance
power consumption, employing a bus and memory structure in testing; B. Strong and B. Fleck for SRAM design; A.
that facilitate a high degree of parallelism and a specialized Amerasekera and C. Duvvury for advice on ESD design; and
instruction set that enables efficient execution of common B. Hewes, A. Shah, P. Yang, S. Shichijo, and G. Frantz for
communication algorithms such as the Viterbi butterfly. Fur- their encouragement and support.
thermore, the DSP includes a rich set of peripherals and a
large amount of on-chip memory (54 K 16 b) that help to REFERENCES
minimize chip count and off-chip I/O, thus saving power at
the system level. [1] W. Lee et al., “A 1 V DSP for wireless communications,” in ISSCC
Dig. Tech. Papers, Feb. 1997, pp. 92–93.
The chip utilizes a hybrid power management strategy with [2] M. Izumikawa et al., “A 0.9 V 100 MHz 4 mW mm2 16 b DSP Core,”
automatic local clock gating and user-controlled gating of in ISSCC Dig. Tech. Papers, Feb. 1995, pp. 84–85.
1774 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997
[3] S. Mutoh et al., “A 1 V multi-threshold voltage CMOS DSP with an Brock Barton received the B.S and B.A. degrees
efficient power management technique for mobile phone application,” in physics and mathematics, respectively, from the
in ISSCC Dig. Tech. Papers, Feb. 1996, pp. 168–169. University of Texas, Austin, in 1966. He received
[4] Texas Instruments Inc., TMS320C54x User’s Guide, 1995. the Ph.D. degree in solid-state physics from the
[5] A. Chandrakasan, S. Sheng, and R. Brodersen, “Low-power CMOS Massachusetts Institute of Technology, Cambridge,
digital design,” IEEE J. Solid-State Circuits, vol. 27, pp. 473–483, Apr. in 1971.
1992. He is currently a Texas Instruments Fellow, in
[6] E. Lee and D. Messerschmitt, Digital Communication. Norwell, MA: the DSPS R&D Center of TI’s Corporate R&D
Kluwer, 1988. organization, Dallas. He joined TI in 1972 and has
[7] I. Young, J. Greason, and K. Wong, “A PLL clock generator with 5–100 worked in TI’s Equipment Group, Central Research
MHz of lock range for microprocessors,” IEEE J. Solid-State Circuits, Laboratories, Semiconductor Group, and Semicon-
vol. 27, pp. 1599–1606, Nov. 1992. ductor R&D, a part of TI’s Corporate R&D organization. He was responsible
[8] E. Vittoz, “Low power design: Ways to approach the limits,” in ISSCC for the design of TI’s first standard cell library. He has worked on a
Dig. Tech. Papers, Feb. 1994, pp. 14–18. number of different programs, including CCD imager and memory design and
[9] R. Best, Phase-Locked Loops: Design, Simulation, and Applications. development, CMOS calculator chip product engineering, and MOS memory
New York: McGraw-Hill, 1996. and microprocessor process development, prior to helping create TI’s ASIC
[10] T. Sakuta, W. Lee, and P. Balsara, “Delay balanced multipliers for low
technology. More recently, he was responsible for fuzzy logic IC development
power/low voltage DSP core,” in Symp. Low Power Electronics Dig.
and 1V DSP chip design in the DSPS R&D Center. He is the author or
Tech. Papers, Oct. 1995, pp. 36–37.
co-author of more than 25 technical papers and publications and holds five
[11] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A
patents.
Systems Perspective. Reading, MA: Addison-Wesley, 1985.
[12] M. Yoshimoto et al., “A divided word-line structure in the static RAM
and its application to 64 K full CMOS RAM,” IEEE J. Solid-State
Circuits, vol. 18, pp. 479–485, Oct. 1983.
[13] M. Nandakumar, A. Chatterjee, G. Stacey, and I.-C. Chen, “A 0.25 Shigeshi Abiko received the B.S. degree in
m CMOS technology for 1 V low power applications—Device design computer science from the Scientific University
and power/performance considerations,” in Symp. VLSI Technology Dig. of Tokyo, Japan, in 1982.
Tech. Papers, June 1996, pp. 68–69. He joined the Tokyo Design Center, Texas
[14] R. Gonzalez and M. Horowitz, “Energy dissipation in general purpose Instruments Japan Ltd. in 1982 and then worked on
microprocessors,” IEEE J. Solid-State Circuits, vol. 31, pp. 1277–1284, the development of programmable DSP devices. He
Sept. 1996. is currently a Senior Member of the Technical Staff.
Kenichi Tashiro was born in Osaka prefecture, Emmanuel Ego received the engineer degree
Japan, on January 30, 1965. He received the B.E. in microelectronics from Institut Superieur
degree in electronics from the Osaka Institute of d’Electronique du Nord, Lille, France, in 1992.
Technology, Osaka, Japan in 1987. He joined Texas Instruments France, Villeneuve
He joined Texas Instruments Japan Ltd., Tokyo, Loubet, in December 1994. He has been designing
in 1987. Since then, he has been engaged in the PLL’s for DSP’s circuits since then. He is now
development of DSP devices. involved in design of digital blocks to be integrated
in next generation of low-power DSP cores.
Hiep Tran, photograph and biography not available at the time of publication.
Frederic Boutaud (M’88) received the diplome Albert Shih received the B.S. degree in electrical
d’ingenieur in electricity and electronics from Ecole engineering from National Taiwan University, Tai-
Centrale de Lyon, France, in July 1978. wan, in 1987 and the M.S. and Ph.D. degrees in
After two years working on computers for air- electrical engineering from the University of Texas
craft automatic pilot, he joined Texas Instruments, at Austin in 1991 and 1994, respectively.
Villeneuve Loubet, France, in 1980, where he be- He spent the summer of 1991 at Texas Instru-
came a Senior Member of Technical Staff and ments, working on the modeling of crosstalk in
a Design Manager. He has been working mainly multilevel interconnect VLSI systems. In 1994, he
on digital signal processor architecture and design, joined the Submicron Logic Process Development
digital filters for delta–sigma converters, and low group of Texas Instruments, Dallas, where he was
power design for telecommunication applications. involved in the device design and yield enhancement
He authored and co-authored several papers and holds more than 12 patents of logic products. Since 1995, he has worked on low-power CMOS device
in the area of digital design, delta–sigma converters, video processor and DSP and process integration and the prototyping of the first fully functional 1-V
architecture. In 1996, he moved to Boston, MA, and joined the Communication DSP chip presented in this paper. Currently, he is engaged in the design of
Division of Analog Devices where he is currently a Design Manager. high performance dynamic random access memory circuits.
1776 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997
Mahalingam Nandakumar was born in Madras, India, on December 10, Ih-Chin Chen (S’85–M’86–SM’96) received the
1964. He received the B. Tech degree in electrical engineering from the B.S. degree in electrical engineering from National
College of Engineering, Trivandrum, India, in 1987 and the M. Tech degree Taiwan University in 1980 and the M.S. and Ph.D.
in electrical engineering with specialization in integrated circuits and systems degrees in electrical engineering and computer sci-
from the Indian Institute of Technology, Madras, India. He received the Ph.D. ences from the University of California, Berkeley,
degree in electrical engineering from North Carolina State University, Raleigh, in 1984 and 1986, respectively.
in 1993. From 1980 to 1982, he served as a Teaching Offi-
In his doctoral thesis work, he experimentally demonstrated a new MOS cer in the Electronic Training Center of the Chinese
gated power semiconductor device called the base resistance controlled Army Missile Command. From 1983 to 1986, he
thyristor (BRT) for the first time. Since 1993 he has been employed at the was a Postgraduate Researcher at the University
Semiconductor Process and Design Center of Texas Instruments, Dallas, where of California, Berkeley. He joined Semiconductor
he has been responsible for developing a 0.25/0.18-m CMOS technology for Process and Design Center, Texas Instruments, Dallas, in 1987. From 1987
low power applications. His research interests include development of novel to 1991, he had worked on transistor and isolation design, storage dielectrics
semiconductor device structures and submicrometer CMOS device design. development, and cell leakage reduction for 4, 16, and 64 Mb DRAM. Since
1991, he has been the Manager of Advanced Transistor/Isolation branch
focusing on high-performance CMOS for advanced microprocessors, 1 V
CMOS for low power applications, transistor and isolation design for 256 Mb
DRAM, shallow trench isolation, and sub-0.1 mm device research. He was
Robert H. Eklund (M’86) received the B.S. degree elected as a TI Fellow in 1996. He holds several patents, and has authored
from Davidson College, Davidson, NC, in 1974 and or co-authored 88 technical papers.
the Ph.D. degree in chemistry from the University Dr. Chen has served on several international conferences, including IEDM
of Texas at Austin in 1979. (1993–1995 as a Program Committee Member and 1996 as the Subcommittee
After graduation, he joined Texas Instruments, Chairman of the CMOS Devices and Reliability Subcommittee), Device
Dallas, as a Member of the Technical Staff. Since Technology Conference of SPIE Microelectronic Manufacturing Symposium
joining Texas Instruments, he has worked in the (1995–1996, as the Conference Chairman), IEEE Symposium on Low Power
area of materials characterization and on process Electronics (1996–1997, as a Program Committee Member), and the Interna-
development and integration for various programs at tional Symposium on VLSI Technology, Systems, and Applications (1997, as
TI including Schottky diode processes for the VH- a Program Committee Member).
SIC STL program, submicrometer bipolar process
development, BiCMOS on SIMOX process development, and submicrometer
CMOS and BiCMOS process development. Currently, he is a TI Fellow and
Manager of the Logic Process Technology branch responsible for the CMOS
and BiCMOS process integration. He has authored and coauthored numerous
papers and patents issued in the field of microelectronics.