You are on page 1of 11

1766 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO.

11, NOVEMBER 1997

A 1-V Programmable DSP for


Wireless Communications
Wai Lee, Member, IEEE, Paul E. Landman, Member, IEEE, Brock Barton, Shigeshi Abiko,
Hiroshi Takahashi, Associate Member, IEEE, Hiroyuki Mizuno, Shigetoshi Muramatsu, Kenichi Tashiro,
Masahiro Fusumada, Luat Pham, Frederic Boutaud, Member, IEEE, Emmanuel Ego,
Girolamo Gallo, Hiep Tran, Carl Lemonds, Albert Shih, Mahalingam Nandakumar,
Robert H. Eklund, Member, IEEE, and Ih-Chin Chen, Senior Member, IEEE

Abstract— In an effort to extend battery life, the manufac-


turers of portable consumer electronics are continually driving
down the supply voltages of their systems. For example, next-
generation cellular phones are expected to utilize a 1-V power
supply for their digital components. To address this market, an
energy-efficient, programmable digital signal processing (DSP)
chip that operates from a 1-V supply has been designed, fab-
ricated, and tested. The DSP features an instruction set and
micro-architecture that are specifically targeted at wireless com-
munication applications and that have been carefully optimized
to minimize power consumption without sacrificing performance.
The design utilizes a 0.35-m dual-Vt technology with 0.25-m
minimum gate lengths that enables good performance at 1 V.
Specifically, the chip dissipates 17 mW at 1 V, achieving 63-MHz
operation with a power-performance metric of 0.21 mW/MHz. Fig. 1. Use of DSP in a cellular phone application.
Index Terms—CMOS digital integrated circuits, computer ar-
chitecture, flip-flops, integrated circuit design, low-power circuits, demodulation, and equalization. These functions require a DSP
low-power CMOS, phase-locked loops, semiconductor memories, that offers a high computational throughput. However, this
Viterbi decoding, wireless communications.
requirement is often at odds with the goal of minimizing power
consumption. This paper describes a fixed-point programmable
DSP designed for wireless communication applications that
I. INTRODUCTION addresses these dual constraints of lower power and higher
throughput [1]. This is achieved by operating at low voltage
T HE portable consumer electronics market is currently
undergoing a period of rapid growth. Wireless communi-
cation devices, such as cellular phones, represent an important
(1 V) while using a dual- process to maintain high perfor-
mance. The use of multiple threshold voltages to maintain
segment of this market. One of the keys to success in this a balance between speed and standby current is not new.
highly competitive arena is to keep products lightweight and Several processors have been designed which incorporate both
small. This places tight constraints on the size of the batteries high and low threshold voltage transistors on a single chip
that can be used and makes it extremely important to employ [2], [3]. Relative to the DSP presented here, however, these
energy-efficient circuits in the design of these products. As earlier low-power DSP’s have either incorporated much less
shown in Fig. 1, wireless systems rely heavily on embedded functionality ( 70 K transistors) [2] or have had higher power
digital signal processors (DSP’s) to perform critical commu- per unit throughput, in the range of 2 mW/MHz [3]. The chip
nication functions such as speech coding, channel coding, described in this paper has a power-performance metric of
0.21 mW/MHz and is a full-feature, 1.6-million transistor DSP
Manuscript received April 1997; revised June 3, 1997. with a large amount of on-chip memory and a rich set of
W. Lee, P. E. Landman, B. Barton, H. Tran, C. Lemonds, A. Shih, M. peripherals for interfacing to external devices.
Nandakumar, R. H. Eklund, and I.-C. Chen are with Corporate R&D, Texas The design of this DSP is described in several sections.
Instruments Incorporated, Dallas, TX 75243 USA.
S. Abiko, H. Takahashi, H. Mizuno, S. Muramatsu, K. Tashiro, and M. First, Section II describes the architecture. Next, Section III
Fusumada are with Texas Instruments Japan Ltd., Tokyo, 108 Japan. presents the implementation of the design, describing the
L. Pham is with Texas Instruments Incorporated, Stafford, TX 77477 USA. power management strategy and detailing the design of key
F. Boutaud was with Texas Instruments France, 06271 Villeneuve Loubet
Cedex, France. He is now with Analog Devices, Wilmington, MA 01887 circuits such as the clock generation and distribution network,
USA. the multiply-accumulate unit, and the memory subsystem.
E. Ego is with Texas Instruments France, 06271 Villeneuve Loubet Cedex, The multithreshold process that enables low voltage operation
France.
G. Gallo is with Texas Instruments Italia, 67051 Avezzano, Italy. is then described in Section IV. Finally, Sections V and VI
Publisher Item Identifier S 0018-9200(97)07855-4. present measured results and concluding remarks.
0018–9200/97$10.00  1997 IEEE
LEE et al.: 1-V PROGRAMMABLE DSP FOR WIRELESS COMMUNICATIONS 1767

Fig. 3. Architectural components supporting compare, select, and store op-


eration used in Viterbi algorithm.
Fig. 2. High-level block diagram of DSP architecture.

II. ARCHITECTURE
The CPU uses a modified Harvard architecture as shown
in Fig. 2. In order to maximize throughput and energy ef-
ficiency, the architecture relies heavily on both pipelining
and parallelism [4]. For example, the CPU employs a six-
stage instruction pipeline divided as follows: 1) program
prefetch, 2) program fetch, 3) instruction decode, 4) operand
address generation, 5) operand read, and 6) execute/write.
Three separate data buses and one program bus coupled with
two data address generators and one program address generator
facilitate a high degree of parallelism. For instance, two reads
and one write operation can be performed in a single cycle.
This exploitation of concurrent processing techniques results
in a highly efficient instruction set, which in turn increases the
energy efficiency of the processor [5]. Fig. 4. State transition diagram and corresponding DSP code for a Viterbi
butterfly operation. The DADST and DSADT are the double 16-b add/subtract
The CPU contains a 40-b arithmetic logic unit (ALU), instructions. CMPS is the compare/select/store instruction. k in the comment
which can selectively feed one of two 40-b accumulators. section means two operations are done in parallel in the same clock cycle.
The accumulators are divided into three parts: a 16-b low-
order word, a 16-b high-order word, and eight guard bits.
By setting a status register bit, the ALU can function as calculation, as well as finite impulse response (FIR) and
a single unit or as two 16-b ALU’s operating in parallel. least mean square (LMS) filtering operations.
This dual 16-b mode is especially useful for the Viterbi In addition to these built-in filtering instructions, the DSP
add/compare/select operation, as described in detail later in also contains a compare, select, and store unit (CSSU) which
this section. Furthermore, one of the ALU inputs can be accelerates the Viterbi computation required by many commu-
taken from a 40-b barrel shifter, allowing the processor to nication algorithms [6]. The hardware components that support
perform numerical scaling, bit extraction, extended arithmetic, the Viterbi operator are shown in Fig. 3. This diagram is
and overflow prevention. Together with the exponent detector, best understood by examining the code for a Viterbi butterfly
the shifter enables single-cycle normalization of values in an operation, which is given in Fig. 4 along with the correspond-
accumulator. ing state transition diagram. In this example, two old states
The CPU also features a multiply-accumulate block capable transition to two new states. For each new state, the objective
of a 17 17-b two’s-complement multiplication and a 40-b is to find which of two possible routes results in the maximum
addition in a single instruction cycle. The multiplier and ALU path metric. By configuring the ALU in dual 16-b mode
can operate in parallel, allowing simultaneous execution of using the C16 bit in Fig. 3, the four add/subtracts required
multiply-accumulate (MAC) and arithmetic operations. to calculate the candidate path metrics can be performed in
The DSP has a rich instruction set including single- only two cycles. The results of the add/subtract operations are
instruction repeat and block repeat operations, block memory stored in the two 40-b accumulators, A and B, and are fed to
move instructions, instructions with two- or three-operand the CSSU. In a single cycle, the CSSU is able to compare the
reads, conditional store instructions, and arithmetic instruc- two 16-b words stored in one of the two accumulators, as well
tions with parallel load and store. To improve performance as select and store the largest of these two words using the
and power on typical DSP algorithms, several dedicated result bus EB. In the same cycle, the decision is recorded in
instructions are available including euclidean distance the test/control (TC) flag bit of the status register and in the
1768 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997

Fig. 6. High-level architecture of digital PLL.

ready signal. This signal is generated automatically by local


control logic using information from the instruction decoder.
The gated slave logic registers when data has been clocked
into the master latch and outputs an enable signal that gates the
Fig. 5. Example illustrating the use of gated slave logic to reduce unneces-
sary switching activity. slave clock. Thus, functional blocks are only activated when
they have valid data to process. Global clock gating is also
available and is controlled by the user through three power-
16-b transition shift register (TRN). The information contained down instructions: IDLE1, IDLE2, and IDLE3. The IDLE1
in TRN can later be used by a back-tracking routine to find instruction shuts down the CPU but the peripheral devices
the optimal path through the trellis. Therefore, it takes only and the system clock remain active; the IDLE2 instruction
two cycles to complete the two compare, select, and store shuts down the CPU and all the on-chip peripherals, though
operations. To summarize, the dual ALU mode and the CSSU the PLL remains active; and the IDLE3 instruction shuts down
allow the Viterbi butterfly to be completed in four cycles as the entire processor. In IDLE3 mode, the only source of power
opposed to the 20 required by a previous architecture. This consumption is the standby power of the entire chip which can
represents a substantial savings of both time and energy. be substantially lower than the power during active operation.
In addition to the CPU, the DSP also contains a substantial However, the latency for the CPU to return to active operation
amount of on-chip memory including a 6 K 16 b SRAM from the IDLE3 mode is also the largest since the on-chip PLL
and 48 K 16 b ROM divided into separate program and data has to acquire lock to the external clock source again, which
spaces. This large amount of on-chip memory is intended to typically takes about 50 s. In summary, the chip utilizes a
minimize off-chip memory accesses and, consequently, mini- hybrid power management strategy that is partially automated
mize the overall power consumption of the system. Several and partially under user control.
on-chip peripherals are available as well, including a 16-b
timer, a synchronous full-duplex serial port, a buffered full-
B. Clock Generation and Distribution
duplex serial port, and an 8-b parallel host interface. This
set of peripherals was designed to facilitate communication The DSP uses an on-chip PLL for clock generation and
between the DSP and the other components in the system. For multiplication. Traditionally, the PLL’s used for this purpose
example, in a typical digital cellular phone, one serial port have been largely analog [7]. There are several reasons,
can communicate with the RF modem; the other serial port however, to consider using a more digital solution. Digital
can interface with the audio subsystem; and the parallel host logic, for example, exhibits much better noise immunity and
interface can communicate with the microcontroller. The DSP is less sensitive to process variations than analog circuitry.
also contains emulation and test circuitry which implements Furthermore, unlike digital logic, the energy efficiency of
the JTAG (IEEE 1149.1) standard and allows access to all analog circuitry is actually worsened by aggressive voltage
on-chip resources. scaling [8]. Moreover, the reduction in available headroom
makes analog design much more difficult at low voltage levels.
III. CIRCUIT IMPLEMENTATION All of these factors led us to opt for a more digital approach.
1) PLL Architecture: The basic architecture of the PLL is
In this section, we provide more details of the DSP design, shown in Fig. 6. The reference input can be driven by either
beginning with the power management strategy in Section III- an external clock source or a crystal resonator. In addition, it
A. This leads into a description of the clock generation and can be used directly or can be divided down in frequency by
distribution network in Section III-B. Finally, Sections III-C a factor of two or four. The output clock can be taken either
and III-D cover the MAC unit and the memory subsystem, from the PLL or from one of these divided down versions of
respectively. the reference.
The core loop of the PLL is structurally similar to that of
A. Power Management Strategy an analog PLL, with the exception that a digital loop filter
During active operation, power savings are realized by replaces the traditional analog filter, and a digitally controlled
extensive use of gated clocks (see Fig. 5). In particular, latches oscillator (DCO) replaces the voltage controlled oscillator
are only clocked when useful data is available at their inputs. (VCO). The programmable frequency divider in the feedback
This is achieved by locally gating the master clock with a data path enables the PLL to generate an output frequency that is a
LEE et al.: 1-V PROGRAMMABLE DSP FOR WIRELESS COMMUNICATIONS 1769

factor of 1 to 15 higher than the reference. The combination of


feedforward and feedback frequency dividers allows the PLL
to generate 15 different integer-frequency multiples (1, ,
15), eight different half-frequency multiples (0.5, 1.5, , 7.5)
and eight different quarter-frequency multiples (0.25, 0.75, ,
3.75).
The operation of the PLL is as follows. The phase detector
compares the position of the reference clock edge to that of
the divided-down DCO edge. Depending on whether the DCO
edge lags or leads the reference, the phase detector outputs a
signal telling the DCO to increase or decrease its frequency.
If the two edges are aligned within the tolerance of the phase Fig. 7. Schematic of digitally controlled oscillator (DCO) with
switched-capacitor frequency control.
detector, then no correction pulses are generated.
Next, the loop filter examines the phase detector output and
determines how to alter the DCO frequency. Since the loop is roughly . A capacitor is turned on by pulling the
filter is a synchronous digital circuit, a local clock signal is correspond digital control bit, , to . This grounds the
required. Therefore, the first stage of the circuit is a clock source/drain and creates an inversion layer in the NMOS
generation unit that creates a loop filter clock pulse whenever channel contributing to the capacitive load on node .
it sees a corresponding pulse on the phase detector output. The capacitor is turned off by pulling to ground, placing the
This clock drives the main loop filter logic, which realizes a transistor in the depletion region, and significantly reducing its
first-order low-pass filter designed to guarantee loop stability. gate capacitance to where . Therefore, the
The output of the loop filter is a 6-b digital control word load capacitance on node is roughly given by
that specifies at what frequency the DCO should operate. The
final stage of the loop filter is a synchronizer. The purpose of
this circuitry is to prevent the DCO frequency from changing
midcycle, which could give rise to glitches in the output clock. where is the value of the
The synchronized output of the loop filter then passes to the digital control word, is the node’s parasitic capacitance,
DCO. The DCO produces a square wave whose period is lin- and
early related to the DCO control word. This signal is then fed This causes the delay of inverter U2 to be linearly related
back to the phase detector through a programmable frequency to the digital control word. The step size between adjacent
divider. The frequency locking action of the feedback loop oscillation periods is controlled by sizing and U2, and the
guarantees that resulting in a DCO frequency range is controlled by , U2, and the number of bits in the
equal to times the reference. control word. Schmitt trigger U3 is used to sense when node
2) Digitally Controlled Oscillator: Since the DCO period has reached its high or low thresholds at which point a sharp
is controlled by a digital input, the PLL cannot generate a edge is generated. The signal then travels through a fixed delay
continuous range of frequencies, but rather produces a finite stage (which contributes to the offset of the frequency range)
number of discrete frequencies. Because the quantization of and is fed back to NAND U1, completing the loop. Finally,
the DCO period sets some fundamental limits on the PLL’s reset devices U4, U5, M , and M are included, to ensure that
achievable jitter, it is desirable for the DCO to have a small Node is at the rails at the start of each charging cycle.
step size. Traditional DCO’s generate their outputs by dividing The result is a DCO capable of excellent frequency res-
down a fixed, high-frequency clock source [9]. This type of olution whose period is given by .
implementation makes it extremely difficult to achieve a small Measurements from silicon indicate a typical RMS phase jitter
step size since it is limited by how fast the clock source and of about 45.5 ps as shown in Fig. 8.
the associated divider can operate. For example, to achieve a 3) Clock Distribution Network: The clock distribution net-
200-ps step between periods would require a 5-GHz oscillator work is carefully tuned to minimize power dissipation. To
and counter! The DCO used in this design overcomes this reduce the significant short-circuit currents typical of large
difficulty by synthesizing a signal with the desired frequency clock drivers, a hierarchical clock buffering scheme is used.
rather than deriving it from a high frequency source. The Local clock buffers are placed as close to the clock signal
schematic for the DCO is shown in Fig. 7. The circuit uses destination points as possible to minimize skew. After the
a binarily weighted switched capacitor array to control the required rise and fall times of the clock signal at the inputs
frequency of oscillation. More specifically, inverter U2 drives to scannable register latches (SRL’s) are specified, the size
a variable load, the size of which is controlled by turning of the clock drivers and interconnect widths are tuned to
on or off the capacitors labeled through where satisfy the propagation delay requirements with minimal power
is the number of bits in the loop filter output (six in our dissipation.
design). The capacitors are realized as NMOS transistors with Power is further reduced by careful design of the SRL’s
their source and drain tied together. The gate areas of the to minimize clock loading as shown in Fig. 9. In the new
transistors are binarily weighted such that if the on-capacitance design [Fig. 9(b)] the clock signals drive two NMOS devices
of the smallest transistor is , the capacitance of transistor instead of feeding two NMOS and two PMOS devices as in
1770 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997

Fig. 10. Block diagram of the MAC unit.

Fig. 8. Histogram of typical phase jitter for digital PLL. inherent speed advantage over the array multiplier, but also be-
cause of its lower dynamic power consumption [10]. The speed
advantage of the Wallace tree comes from the reduction in the
height of the carry-save adder tree—from nine to five stages
in our case. The improvement in dynamic power consumption
comes from a large reduction in spurious switching events
at the internal nodes of the multiplier, resulting from balanced
delay paths in the Wallace tree. The architecture of the Wallace
multiplier used on this chip consists of a 3 : 2 counter stage,
(a)
followed by two consecutive 4 : 2 counter stages, as shown
in Fig. 10. The 4 : 2 counter is constructed by cascading two
carry-save adders. A 23% power reduction and 25% speed
improvement are achieved by the Wallace tree architecture
over the array architecture using the same carry-save adder.
The carry-save adder makes extensive use of transmission gate
logic to minimize transistor count and power consumption
[11]. The transistor sizes were carefully tuned so that the
variations in delay for different input signal combinations
are minimized. The tight output delay distribution helps to
(b) minimize the overall dynamic power consumption in the
Fig. 9. Comparison of (a) conventional CMOS latch to (b) lower-power latch multiplier [10]. SPICE simulations at 1 V reveal a worst-case
which reduces clock loading. delay of under 5 ns.

the conventional version [Fig. 9(a)]. When the data is high, D. Memory Subsystem
it is driven differentially into the cross-coupled inverter to
The on-chip memory subsystem includes a 6 K 16 b
ensure good performance even at low supply voltages. A 15%
SRAM and 48 K 16 b ROM divided into separate program
reduction in power was obtained, while achieving a 15% speed
and data spaces.
improvement and a 50% reduction in area. The large area
1) Random-Access Memory: The memory subsystem al-
reduction is due to the use of primarily NMOS devices, which
lows two data operand reads or a long (32-b) read from any
reduces isolation spacing requirements.
one of the three 2 K 16 b on-chip SRAM blocks in a
single machine cycle. If the program bus is used for coefficient
C. Multiply-Accumulate Unit access, three operands can be read in a single cycle. A divided
The multiply-accumulate (MAC) unit on this chip is com- word line architecture is used in the SRAM to reduce power
posed of a 17 17-bit Wallace tree multiplier coupled with [12]. Each 2 K 16 b SRAM block is divided into two banks
a 40-b carry-lookahead adder. This MAC unit is capable of as shown in Fig. 11. The banks in turn are subdivided into
performing a nonpipelined multiply-accumulate operation in four 256 16 b arrays, each with its own local word line
a single clock cycle. The multiplier can operate on signed, driver. Consequently, each memory access only activates 1/8
unsigned, and signed/unsigned 16-b operands. Both integer of the entire array, thereby reducing the power consumption.
and fractional modes, as well as two’s-complement rounding The sense amplifier employs a latch-type design to reduce the
and overflow detection, are supported in this MAC unit. The static power dissipation and to ensure proper operation even
Wallace tree architecture was chosen not only because of its at very low supply voltages.
LEE et al.: 1-V PROGRAMMABLE DSP FOR WIRELESS COMMUNICATIONS 1771

TABLE I
KEY PROCESS TECHNOLOGY PARAMETERS
Technology 0.35 m twin-well, dual-Vt ; CMOS with triple-metal
Gate Length 0.25 m
Gate Oxide 5 nm
High Vt 0.4 V
Low Vt 0.2 V

is shared by two adjacent cells, to optimize both bit line


precharge/discharge time and power consumption.
To improve ROM performance at low voltage, additional
steps were taken. For example, all peripheral circuitry employs
low devices; however, as in the SRAM, high devices
are used in the array to minimize standby current. In addition,
Fig. 11. On-chip SRAM architecture illustrating divided word line strategy horizontal metal-2 straps were added every 32 cells along the
for power reduction.
word lines. Vertical metal-1 ground lines were connected to
each other through metal-2 jumper lines to provide a more
solid ground to the array. And finally, the row deselection
circuitry was optimized to reduce word line recovery time,
permitting a shorter cycle time.

IV. PROCESS
The chip was fabricated using a 0.35- m, dual- twin-
well CMOS process with three layers of metal. The minimum
length of patterned gate poly is 0.25 m. The gate oxide
thickness is 5 nm. To fabricate both low and high transistors
on the same chip with minimal process overhead, a surface
counter-doping implantation was performed on the channels
of the low transistors in addition to the regular channel
implant [13]. The nominal threshold voltages (defined as the
gate voltage when the drain current reaches a current level
Fig. 12. Architecture of 8 K 2 16 b building block used to construct data of 0.1 A where and are the drawn width
and length of the transistor) are 0.4 and 0.2 V, respectively.
and program ROM’s.
The threshold voltages quoted in an earlier publication [1] are
High transistors are used in the six-transistor memory different from the values used here. The discrepancy stems
cell to keep the standby current to a minimum, while low from a different definition of threshold voltage when
transistors are used in the sense amplifier and all peripheral cir- A adopted in the earlier publication. The
cuitry to allow for high-speed operation at 1 V and below. The leakage current of the high and low devices are below
dependence of the access time as a function of the threshold 1 nA/ m and 1 A/ m, respectively. The drive current of the
voltage will be described in greater detail in Section IV. low (LVT) transistors is typically twice that of the high
2) Read-Only Memory: There are two large ROM’s in the (HVT) devices. Key process characteristics are summarized in
DSP: a 32 K 16 b program store ROM and a 16 K Table I.
16 b coefficient data ROM. Both ROM’s use the same To gain a better insight into the impact of threshold voltages
basic architecture and cell design. They are constructed using on the performance of this DSP, SPICE simulation results of
multiple occurrences of a basic building block—an 8 K one of the critical paths, namely the on-chip SRAM access
16 b ROM, illustrated in Fig. 12. time, are presented in Table II. SPICE simulations reveal that
These are diffusion-programmed ROM’s incorporating a the speed of the SRAM is not significantly degraded by using
number of design features to reduce power dissipation. They threshold voltages as high as 0.4 V in the memory cell. The
utilize selective precharge of bit lines, for essentially zero speed of the SRAM is strongly dependent, however, on the
standby power dissipation and low operating power consump- threshold voltage used in the peripheral circuitry, which is
tion. A 16 : 1 column mux ratio minimizes bit line capacitance therefore kept low. For a low and high of 0.2 and 0.4 V,
and, therefore, access time and power consumption. The respectively, SPICE simulations indicate an access time of
diffusion-programming feature results in a 40% memory array about 6 ns.
area savings with respect to via- or contact-programmed cores
of the same capacity, further lowering bit line and word line V. RESULTS
capacitance and, thus, power. The ROM’s feature dynamic The chip measures 5.65 5.50 mm and contains 1.6
page selection, which allows either 16-b or 32-b access, to million transistors. A photomicrograph of the die is shown
meet system architecture requirements. The cell bit line contact in Fig. 13. The DSP was functional on first silicon and was
1772 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997

TABLE II
DEPENDENCE OF SRAM ACCESS TIME ON THRESHOLD
VOLTAGE AT 1 V POWER SUPPLY AND NOMINAL CONDITIONS
Vt of LVT (V) Vt of HVT (V) Tacc (ns) Relative Tacc
0.2 0.40 6.16 100%
0.2 0.35 5.87 95%
0.2 0.30 5.62 91%
0.05 0.40 5.00 81%
0.05 0.35 4.71 76%
0.05 0.30 4.46 72%

Fig. 14. Maximum operating frequency as a function of supply voltage.

Fig. 15. Power consumption as a function of frequency for a fixed supply


voltage.
Fig. 13. Die photo of 1-V DSP implemented in 0.35-m, dual-Vt CMOS
technology.
60 MHz is improved by 15 times, rather than 19 times, because
packaged in a 128-pin TQFP. Since the power consumption of the 4-mW standby power, which produces a dc offset in the
of the DSP depends on the instructions being executed, it is power versus frequency curve (see the LVT HVT curve in
necessary to specify the benchmark used during testing. The Fig. 16). The relatively large standby power can be attributed
results presented here were obtained while repeatedly execut- to the extensive use of low threshold voltage transistors.
ing a 16-tap FIR filtering application. This algorithm heavily One way to reduce the standby power would be to avoid the
exercises the MAC unit, resulting in a fair—if not slightly pes- use of LVT transistors. In fact, a second chip fabricated with all
simistic—measurement of power consumption. Furthermore, HVT devices in the same 0.35- m technology had a measured
the program was configured to execute from on-chip memory, standby power of less than 50 W at 1 V. Unfortunately, this
ensuring that the critical (SRAM access) path was exercised. HVT version of the chip can only reach 35 MHz at 1.0 V.
The performance of the DSP as a function of supply voltage In order to attain the 63 MHz achieved by the LVT HVT
is shown in Fig. 14. In order to highlight the improvement chip, the HVT chip must be operated at 1.3 V. At 1.3 V and
made by this chip, curves are given for both the 1-V (0.35- m) 63 MHz, the HVT-only chip consumes 22.6 mW, which is
part described in this paper and for an earlier chip having the 32% more power than the LVT HVT chip operating at 1.0 V.
same architecture and circuit implementation but fabricated Fig. 16 compares the power-performance metric of the HVT-
using a 0.6- m, 3.3-V process with a single standard threshold only chip operated at 1.0 and 1.3 V, respectively, with that
voltage. The 0.35- m chip is functional down to 0.6 V, of the LVT HVT chip operated at 1.0 V. The figure shows
reaching 63 MHz at 1 V and 100 MHz at 1.35 V. This amounts that going from 1.0 to 1.3 V degrades the power-performance
to roughly a 10 performance improvement relative to the metric from 0.21 to 0.36 mW/MHz.
0.6- m part for the same 1-V supply. An even better solution is to continue to use both low and
Fig. 15 gives the power versus operating frequency curves high devices, but to also add HVT switches (controlled
for the two devices. To a fairly high degree of accuracy, power by the power-down modes) in series with the supply. The
is a linear function of frequency. The slope of this curve is sizing of these devices would be dictated by the peak current
a power-performance metric for active operation in units of requirements of the various modules in the chip. With properly
mW/MHz. This metric accounts only for dynamic power. The sized switches, the DSP at 1 V should approach the 63 MHz
0.35- m chip achieves 0.21 mW/MHz at 1.0 V, which is a performance of the current LVT HVT design, but with a
19 times improvement over the 4.0 mW/MHz of the 0.6- m standby power closer to the 50 W of the HVT-only design.
chip at 3.3 V. At 60 MHz, this chip consumes only 17 mW, Table III shows the approximate distribution of the power
which is 15 times less power than the 3.3 V chip. The power at consumption at 1.0 V and 63 MHz for the LVT HVT chip.
LEE et al.: 1-V PROGRAMMABLE DSP FOR WIRELESS COMMUNICATIONS 1773

Fig. 16. Power consumption as a function of frequency for the 0.35-m LVT
+ HVT chip and the HVT-only chip. Two curves are shown for the HVT-only
Fig. 17. Energy-delay product as a function of supply voltage.
chip: one is for a 1.0-V supply and another for a 1.3-V supply. The curve for
the LVT + HVT chip is obtained with a 1.0-V supply. TABLE IV
CHIP STATISTICS AND PERFORMANCE DATA
TABLE III Chip size 2
5.65 5.50 mm2 =31.1 mm2
APPROXIMATE POWER BREAKDOWN OF DSP Transistor count 1.6 million
Component Approximate Power Fraction of Total Power Package 128-pin TQFP
Consumption (mW) Speed at room temperature 63 MHz at 1 V, 100 MHz at 1.35 V
Clock 4.4 26% Power 17 mW at 1 V
Datapath 6.5 38% Power–Performance Metric 0.21 mW/MHz at 1 V
Control 3.1 18%
Memory 3.1 18%
Total 17 100% global clocks. A flexible PLL-based clock generator/multiplier
gives the user fine-grain control over the clock frequency as
The power-delay product has been used for many years as well. To facilitate efficient low-voltage operation, the tradi-
a figure of merit in evaluating digital circuits. Recently, it has tional analog PLL has been replaced with an almost entirely
been proposed that the energy-delay product (or power-delay- digital implementation.
delay product) is a better metric to analyze the tradeoff be- The design makes use of a 0.35- m dual- technology
tween energy efficiency and performance [14]. Fig. 17 shows with 0.25- m gate lengths to enable good performance at 1 V.
the energy-delay product of the HVT LVT and HVT-only At this voltage, the chip dissipates 17 mW and operates at
chips as a function of supply voltage. The LVT HVT chip 63 MHz with a power-performance metric of 0.21 mW/MHz.
achieves a minimum of 4.2 nJ-ns at 1.0 V. Thus, 1.0 V is Moreover, the device is functional down to 0.6 V and reaches
indeed the optimal operating voltage for this chip. The HVT- 100 MHz at 1.35 V. These results are indicative of the energy
only chip achieves a minimum of 5.7 nJ-ns at 1.20 V. The efficiency that can be achieved using low-voltage digital
shift of the minimum to a higher voltage is due to the higher design techniques. As cellular phone manufacturers continue
threshold voltage in the HVT-only chip. For comparison, the to drive down the system supply voltage, the trend toward
0.6- m DSP achieves its minimum at 1.8 V with a minimum migrating more functionality to the digital domain is likely
value almost an order of magnitude higher than that of the to continue. This will fuel the demand for high-performance,
1 V chip. low-power DSP’s. The results presented here demonstrate
Some of the more important chip statistics presented in this that high performance and low power consumption can be
section are summarized in Table IV. achieved simultaneously through innovative technology and
digital design techniques.
VI. CONCLUSIONS
This paper has described a low-power, fixed-point pro- ACKNOWLEDGMENT
grammable DSP optimized for wireless communication ap- The authors would like to thank D. Nowlen for mask
plications. From the outset, the DSP was architected for low generation; E. Born, G. Stacey, and B. Garcia for assistance
power consumption, employing a bus and memory structure in testing; B. Strong and B. Fleck for SRAM design; A.
that facilitate a high degree of parallelism and a specialized Amerasekera and C. Duvvury for advice on ESD design; and
instruction set that enables efficient execution of common B. Hewes, A. Shah, P. Yang, S. Shichijo, and G. Frantz for
communication algorithms such as the Viterbi butterfly. Fur- their encouragement and support.
thermore, the DSP includes a rich set of peripherals and a
large amount of on-chip memory (54 K 16 b) that help to REFERENCES
minimize chip count and off-chip I/O, thus saving power at
the system level. [1] W. Lee et al., “A 1 V DSP for wireless communications,” in ISSCC
Dig. Tech. Papers, Feb. 1997, pp. 92–93.
The chip utilizes a hybrid power management strategy with [2] M. Izumikawa et al., “A 0.9 V 100 MHz 4 mW mm2 16 b DSP Core,”
automatic local clock gating and user-controlled gating of in ISSCC Dig. Tech. Papers, Feb. 1995, pp. 84–85.
1774 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997

[3] S. Mutoh et al., “A 1 V multi-threshold voltage CMOS DSP with an Brock Barton received the B.S and B.A. degrees
efficient power management technique for mobile phone application,” in physics and mathematics, respectively, from the
in ISSCC Dig. Tech. Papers, Feb. 1996, pp. 168–169. University of Texas, Austin, in 1966. He received
[4] Texas Instruments Inc., TMS320C54x User’s Guide, 1995. the Ph.D. degree in solid-state physics from the
[5] A. Chandrakasan, S. Sheng, and R. Brodersen, “Low-power CMOS Massachusetts Institute of Technology, Cambridge,
digital design,” IEEE J. Solid-State Circuits, vol. 27, pp. 473–483, Apr. in 1971.
1992. He is currently a Texas Instruments Fellow, in
[6] E. Lee and D. Messerschmitt, Digital Communication. Norwell, MA: the DSPS R&D Center of TI’s Corporate R&D
Kluwer, 1988. organization, Dallas. He joined TI in 1972 and has
[7] I. Young, J. Greason, and K. Wong, “A PLL clock generator with 5–100 worked in TI’s Equipment Group, Central Research
MHz of lock range for microprocessors,” IEEE J. Solid-State Circuits, Laboratories, Semiconductor Group, and Semicon-
vol. 27, pp. 1599–1606, Nov. 1992. ductor R&D, a part of TI’s Corporate R&D organization. He was responsible
[8] E. Vittoz, “Low power design: Ways to approach the limits,” in ISSCC for the design of TI’s first standard cell library. He has worked on a
Dig. Tech. Papers, Feb. 1994, pp. 14–18. number of different programs, including CCD imager and memory design and
[9] R. Best, Phase-Locked Loops: Design, Simulation, and Applications. development, CMOS calculator chip product engineering, and MOS memory
New York: McGraw-Hill, 1996. and microprocessor process development, prior to helping create TI’s ASIC
[10] T. Sakuta, W. Lee, and P. Balsara, “Delay balanced multipliers for low
technology. More recently, he was responsible for fuzzy logic IC development
power/low voltage DSP core,” in Symp. Low Power Electronics Dig.
and 1V DSP chip design in the DSPS R&D Center. He is the author or
Tech. Papers, Oct. 1995, pp. 36–37.
co-author of more than 25 technical papers and publications and holds five
[11] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A
patents.
Systems Perspective. Reading, MA: Addison-Wesley, 1985.
[12] M. Yoshimoto et al., “A divided word-line structure in the static RAM
and its application to 64 K full CMOS RAM,” IEEE J. Solid-State
Circuits, vol. 18, pp. 479–485, Oct. 1983.
[13] M. Nandakumar, A. Chatterjee, G. Stacey, and I.-C. Chen, “A 0.25 Shigeshi Abiko received the B.S. degree in
m CMOS technology for 1 V low power applications—Device design computer science from the Scientific University
and power/performance considerations,” in Symp. VLSI Technology Dig. of Tokyo, Japan, in 1982.
Tech. Papers, June 1996, pp. 68–69. He joined the Tokyo Design Center, Texas
[14] R. Gonzalez and M. Horowitz, “Energy dissipation in general purpose Instruments Japan Ltd. in 1982 and then worked on
microprocessors,” IEEE J. Solid-State Circuits, vol. 31, pp. 1277–1284, the development of programmable DSP devices. He
Sept. 1996. is currently a Senior Member of the Technical Staff.

Wai Lee (S’80–M’89) received the B.S., M.S.,


and Ph.D. degrees in electrical engineering and Hiroshi Takahashi (A’87) was born in Saitama pre-
computer science from the Massachusetts Institute fecture, Japan, on November 11, 1959. He graduated
of Technology, Cambridge, in 1983, 1986, and from Shohoku College, (Sony gakeun) Kanagawa,
1988, respectively. in 1980. His major was electronics.
From 1988 to 1993, he was a Research Staff In 1980, he joined Dai-Nippon Printing Co. In
Member at IBM T. J. Watson Research Center in 1984, he joined Texas Instruments (TI) Japan Ltd.
Yorktown Heights, NY, where he was first involved He has been engaged in DSP design since 1985 and
with the device design for 0.5-m Si and SiGe developed several DSP devices including EPROM
bipolar transistors. He later joined the VLSI design versions at TI France and at TI locations at Houston,
department and led a circuit design team working Dallas, and Lubbock, TX. He is currently a Senior
on a high-performance CMOS RISC microprocessor. In 1993, he joined Member of the Technical Staff at TI.
the low power design branch of the Integrated System Laboratory of Texas
Instruments, Dallas, TX, as a Member of Technical Staff. In 1997, he became
the manager of the DSP Circuits branch in the DSP R&D Center. His
present interests include low-power and high-performance DSP design for Hiroyuki Mizuno was born in Aichi prefecture,
communication applications. He has authored or co-authored more than 30 Japan, on February 23, 1969. He received the B.S.
technical papers. and M.S. degrees in engineering from Tokyo Uni-
versity, Japan, in 1991 and 1994, respectively.
In April 1994, he joined Texas Instruments Japan
Ltd., Tokyo. Since then he has been engaged in DSP
development work.

Paul E. Landman (S’92–M’95) received the B.S.,


M.S., and Ph.D. degrees in electrical engineering
and computer science from the University of Cali-
fornia, Berkeley, in 1989, 1991, and 1994, respec-
tively. His research focused on low-power digital
design techniques and tools with an emphasis on Shigetoshi Muramatsu was born in Nagano pre-
DSP applications. fecture, Japan. He received the B.E. degree in
After completing his dissertation, he joined the applied physics from the University of Nagoya,
low-power design branch of the DSP R&D Center of Aichi, Japan, in 1987.
Texas Instruments, Dallas. His initial research there He joined Texas Instruments Japan Ltd., Tokyo,
focused on low-power algorithms, architectures, and in 1987. Since then, he has been engaged in the
circuits for portable video applications. Most recently, he participated in development of several DSP devices.
the development of a low-power programmable DSP targeted at wireless
communication applications.
Dr. Landman is a National Science Foundation Fellowship recipient and is
a member of Tau Beta Pi and Eta Kappa Nu.
LEE et al.: 1-V PROGRAMMABLE DSP FOR WIRELESS COMMUNICATIONS 1775

Kenichi Tashiro was born in Osaka prefecture, Emmanuel Ego received the engineer degree
Japan, on January 30, 1965. He received the B.E. in microelectronics from Institut Superieur
degree in electronics from the Osaka Institute of d’Electronique du Nord, Lille, France, in 1992.
Technology, Osaka, Japan in 1987. He joined Texas Instruments France, Villeneuve
He joined Texas Instruments Japan Ltd., Tokyo, Loubet, in December 1994. He has been designing
in 1987. Since then, he has been engaged in the PLL’s for DSP’s circuits since then. He is now
development of DSP devices. involved in design of digital blocks to be integrated
in next generation of low-power DSP cores.

Girolamo Gallo received the master’s degree in


physics magna cum laude from Naples University,
Italy, in 1989.
He joined TI Italy, Avezzano, as an IC Design
Masahiro Fusumada received the B.E. degree in Engineer in 1989. In 1989–1990 he participated in
applied physics from the University of Tokai, Japan, the 16-Mb DRAM design in Houston, TX. Back in
in 1987. Italy, he was involved in the design of application
He joined Texas Instruments Japan Ltd., Tokyo, specific nonvolatile memories. In 1992–1994 he was
in 1987. Since then, he has been engaged in the Project Manager for the Hand Printed Character
development of several DSP devices, including the Recognition (HPCR) program, aimed at devising
design of embedded SRAM’s. and implementing new handwriting recognition al-
gorithms to be ported to pen-computing systems. He is currently a Member of
the TI Semiconductor Group Technical Staff and the Team Leader for ROM
and Flash EEPROM memories development at the TI European MOS Memory
Design Center in Avezzano, Italy. He holds nine patents and has published or
presented several papers in international technical journals and conferences.

Hiep Tran, photograph and biography not available at the time of publication.

Luat Pham received the B.S. degree in electrical


engineering from the Oregon State University, Cor-
vallis, in 1980.
He join Texas Instruments, Stafford, TX, in 1980. Carl Lemonds received the B.S. and M.S. degrees
From 1980 to 1989 he worked as a VLSI De- in electrical engineering from the University of
sign Engineer developing mega-modules and cell Missouri at Columbia.
libraries. From 1989 to 1996 he worked as VLSI During graduate school he joined Texas Instru-
Design Engineer developing submicrometer embed- ments in Lubbock, TX, working in the consumer
ded SRAM and ROM silicon compilers in support products division. Later, he joined Texas Instru-
of application-specific product designs. In 1996, ments in Dallas working as a Design Engineer on
he joined high-performance DSP program as VLSI catalog CMOS products. In 1984 he joined the
Design Engineer developing floating-point datapath circuits. He holds one Defense Systems group where he worked on various
patent with three others pending. His interests are in the areas of design for designs using CMOS gate arrays and standard cells.
reuse, high-performance datapath, and memory compiler design. He transferred to Corporate Research as a Member
of Technical Staff in 1989 and worked on TI’s first BiCMOS gate array. Much
of his recent work has been in the area of low power multipliers. Currently, he
is researching high-performance circuit techniques for multipliers and other
CMOS datapath circuits for DSP core applications.

Frederic Boutaud (M’88) received the diplome Albert Shih received the B.S. degree in electrical
d’ingenieur in electricity and electronics from Ecole engineering from National Taiwan University, Tai-
Centrale de Lyon, France, in July 1978. wan, in 1987 and the M.S. and Ph.D. degrees in
After two years working on computers for air- electrical engineering from the University of Texas
craft automatic pilot, he joined Texas Instruments, at Austin in 1991 and 1994, respectively.
Villeneuve Loubet, France, in 1980, where he be- He spent the summer of 1991 at Texas Instru-
came a Senior Member of Technical Staff and ments, working on the modeling of crosstalk in
a Design Manager. He has been working mainly multilevel interconnect VLSI systems. In 1994, he
on digital signal processor architecture and design, joined the Submicron Logic Process Development
digital filters for delta–sigma converters, and low group of Texas Instruments, Dallas, where he was
power design for telecommunication applications. involved in the device design and yield enhancement
He authored and co-authored several papers and holds more than 12 patents of logic products. Since 1995, he has worked on low-power CMOS device
in the area of digital design, delta–sigma converters, video processor and DSP and process integration and the prototyping of the first fully functional 1-V
architecture. In 1996, he moved to Boston, MA, and joined the Communication DSP chip presented in this paper. Currently, he is engaged in the design of
Division of Analog Devices where he is currently a Design Manager. high performance dynamic random access memory circuits.
1776 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997

Mahalingam Nandakumar was born in Madras, India, on December 10, Ih-Chin Chen (S’85–M’86–SM’96) received the
1964. He received the B. Tech degree in electrical engineering from the B.S. degree in electrical engineering from National
College of Engineering, Trivandrum, India, in 1987 and the M. Tech degree Taiwan University in 1980 and the M.S. and Ph.D.
in electrical engineering with specialization in integrated circuits and systems degrees in electrical engineering and computer sci-
from the Indian Institute of Technology, Madras, India. He received the Ph.D. ences from the University of California, Berkeley,
degree in electrical engineering from North Carolina State University, Raleigh, in 1984 and 1986, respectively.
in 1993. From 1980 to 1982, he served as a Teaching Offi-
In his doctoral thesis work, he experimentally demonstrated a new MOS cer in the Electronic Training Center of the Chinese
gated power semiconductor device called the base resistance controlled Army Missile Command. From 1983 to 1986, he
thyristor (BRT) for the first time. Since 1993 he has been employed at the was a Postgraduate Researcher at the University
Semiconductor Process and Design Center of Texas Instruments, Dallas, where of California, Berkeley. He joined Semiconductor
he has been responsible for developing a 0.25/0.18-m CMOS technology for Process and Design Center, Texas Instruments, Dallas, in 1987. From 1987
low power applications. His research interests include development of novel to 1991, he had worked on transistor and isolation design, storage dielectrics
semiconductor device structures and submicrometer CMOS device design. development, and cell leakage reduction for 4, 16, and 64 Mb DRAM. Since
1991, he has been the Manager of Advanced Transistor/Isolation branch
focusing on high-performance CMOS for advanced microprocessors, 1 V
CMOS for low power applications, transistor and isolation design for 256 Mb
DRAM, shallow trench isolation, and sub-0.1 mm device research. He was
Robert H. Eklund (M’86) received the B.S. degree elected as a TI Fellow in 1996. He holds several patents, and has authored
from Davidson College, Davidson, NC, in 1974 and or co-authored 88 technical papers.
the Ph.D. degree in chemistry from the University Dr. Chen has served on several international conferences, including IEDM
of Texas at Austin in 1979. (1993–1995 as a Program Committee Member and 1996 as the Subcommittee
After graduation, he joined Texas Instruments, Chairman of the CMOS Devices and Reliability Subcommittee), Device
Dallas, as a Member of the Technical Staff. Since Technology Conference of SPIE Microelectronic Manufacturing Symposium
joining Texas Instruments, he has worked in the (1995–1996, as the Conference Chairman), IEEE Symposium on Low Power
area of materials characterization and on process Electronics (1996–1997, as a Program Committee Member), and the Interna-
development and integration for various programs at tional Symposium on VLSI Technology, Systems, and Applications (1997, as
TI including Schottky diode processes for the VH- a Program Committee Member).
SIC STL program, submicrometer bipolar process
development, BiCMOS on SIMOX process development, and submicrometer
CMOS and BiCMOS process development. Currently, he is a TI Fellow and
Manager of the Logic Process Technology branch responsible for the CMOS
and BiCMOS process integration. He has authored and coauthored numerous
papers and patents issued in the field of microelectronics.

You might also like