You are on page 1of 18

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO.

1, JANUARY 2006 179

Overview of the Architecture, Circuit Design, and


Physical Implementation of a First-Generation
Cell Processor
Dac C. Pham, Tony Aipperspach, David Boerstler, Mark Bolliger, Rajat Chaudhry, Dennis Cox,
Paul Harvey, Paul M. Harvey, Member, IEEE, H. Peter Hofstee, Member, IEEE, Charles Johns, Jim Kahle,
Atsushi Kameyama, Member, IEEE, John Keaty, Member, IEEE, Yoshio Masubuchi, Member, IEEE,
Mydung Pham, Jürgen Pille, Stephen Posluszny, Mack Riley, Daniel L. Stasiak, Masakazu Suzuoki,
Osamu Takahashi, Member, IEEE, James Warnock, Member, IEEE, Stephen Weitzel, Member, IEEE, Dieter Wendel,
and Kazuaki Yazawa, Member, IEEE

Abstract—This paper reviews the design challenges that current concurrent activities. Cell supports multiple operating sys-
and future processors must face, with stringent power limits, tems including Linux, and is designed for flexibility with a
high-frequency targets, and the continuing system integration wide variety of application domains. Other attributes include
trends. This paper then describes the architecture, circuit design,
and physical implementation of a first-generation Cell processor hardware content protection and extensive single-precision
and the design techniques used to overcome the above challenges. floating-point capability. By extending the Power Architecture
A Cell processor consists of a 64-bit Power Architecture processor with synergistic processor elements (SPE) having coherent
coupled with multiple synergistic processors, a flexible IO inter- direct memory access (DMA) to system storage and with
face, and a memory interface controller that supports multiple multi-operating system resource management, Cell supports
operating systems including Linux. This multi-core SoC, imple-
mented in 90-nm SOI technology, achieved a high clock rate by concurrent media-centric and conventional computing. With a
maximizing custom circuit design while maintaining reasonable dual-threaded power processor element (PPE) and eight SPEs
complexity through design modularity and reuse. this implementation is capable of ten simultaneous threads and
Index Terms—Bus interface controller (BIC), cell processor, over 128 outstanding memory requests.
element interconnect bus (EIB), flexible IO, hardware content
protection, local store, media-centric computing, memory inter-
face controller (MIC), modularity, multi-core, natural human II. ARCHITECTURE OVERVIEW
interaction, Power Architecture, power processor element (PPE),
real-time system, simultaneous multi-threading, SoC, synergistic The first-generation Cell processor consists of eight SPEs
processor, virtualization, 90-nm SOI. [2], each with its own local store (LS) [3], the PPE and its
L2 cache, a high bandwidth internal element interconnect bus
(EIB), a memory interface controller (MIC), a bus interface con-
I. INTRODUCTION troller (BIC) providing two configurable noncoherent IO inter-

T HE architectural vision of “bringing supercomputer


power to everyday life” is the driving force behind the
Cell processor design, setting a new performance standard
faces, a pervasive unit that supports extensive test, monitoring,
and debug functions, a power management unit (PMU), and a
thermal management unit (TMU). The high-level chip diagram
by exploiting parallelism while achieving high frequency [1]. is shown in Fig. 1, along with a die photograph in Fig. 2. Each
Cell is designed for natural human interactions: photorealistic, of these major components is described in turn in the following
predictable real-time response, and virtualized resource for discussion.
The SPEs share system memory with the PPE via coherent
Manuscript received May 15, 2005; revised August 15, 2005. translated and protected DMA, but store data and instructions
D. C. Pham, D. Boerstler, M. Bolliger, R. Chaudhry, P. Harvey, P. M. Harvey, in a private real address space supported by a 256 K local
H. P. Hofstee, C. Johns, J. Kahle, J. Keaty, M. Pham, J. Pille, S. Posluszny, store, with one array dedicated to each SPE. The SPEs provide
M. Riley, D. L. Stasiak, O. Takahashi, and S. Weitzel are with the
IBM Systems and Technology Group, Austin, TX 78717 USA (e-mail: much of the computational performance of the Cell processor.
dacpham@us.ibm.com). Each of the eight processors contains a 128-bit wide dual-issue
T. Aipperspach and D. Cox are with the IBM Systems and Technology Group, SIMD dataflow that is fully pipelined for all operations except
Rochester, MN 55901 USA.
A. Kameyama and Y. Masubuchi are with Toshiba America Electronic Com- double-precision floating-point. Operands are provided by
ponents, Austin, TX 78717 USA. a unified 128-entry by 128-bit-wide register file. The single
M. Pham is with the IBM Systems and Technology Group, 11501 Burnet ported LS supports full-bandwidth concurrent read and write
Road, Austin, TX 78758 USA.
M. Suzuoki and K. Yazawa are with Sony Computer Entertainment, Tokyo, DMA to the EIB, as well as 16-byte SPE loads and stores
Japan. and instruction (pre)fetch. The SPEs access main storage by
J. Warnock is with the IBM Systems and Technology Group, Yorktown issuing DMA commands with an effective address (EA) to the
Heights, NY 10598 USA.
D. Wendel is with IBM Entwicklung GmbH, Boeblingen 71032, Germany. associated memory flow control (MFC). The MFC applies the
Digital Object Identifier 10.1109/JSSC.2005.859896 standard Power Architecture address translation to the EA and
0018-9200/$20.00 © 2006 IEEE
180 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fig. 1. Processor high-level diagram.

Fig. 2. Processor die photograph.

then asynchronously transfers data between the local storage implementation, built from the ground up with an extended
and main storage. This allows overlapping communication pipeline to achieve a low FO4 cycle time to match the SPE. The
and computation, facilitating real-time operation. SPE access core is an enhanced in-order design with a moderate pipeline
to shared memory via DMA, a large register file, and stan- length to provide state-of-the-art performance capability. The
dard in-order execution semantics provide a general-purpose PPE has been extended with resource management tables for
streaming programming environment. Each SPE can be dynam- the cache and translation tables to support real-time operations.
ically configured to operate in a mode in which its resources can Through memory mapped IO (MMIO) control registers, the
only be accessed by validated programs. The SPE is organized PPE can also initiate DMA requests on behalf of an SPE
in a modular fashion, so that a single integrated SPE physical and can support communication with SPE mailboxes. It also
entity is placed 8 times on the chip. implements the Power Architecture hypervisor extension to
The PPE is a 64-bit processor compliant to the family of allow multiple concurrent operating systems to be run on it at
Power Architecture processors [4], [5]. It is implemented as a the same time through thread management support.
dual-threaded core with the integer, floating point, VMX, and The EIB is central to the Cell processor. This coherent bus
MMU units of the Power Architecture. The processor contains transfers up to 96 bytes per processor cycle. The bus is orga-
32-kB instruction and data caches, a 512-kB L2 cache, and nized as four 16-byte-wide (half-rate) rings that each support
on-chip bus interface logic. The PPE core is a completely new up to three simultaneous data transfers. A separate address and
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 181

The pervasive unit contains all of the global logic needed for
basic functional operation of the chip, lab debug, and manufac-
turing test. For each of these areas, described in detail below,
debug mechanisms have been incorporated into the design to as-
sist with bring-up and characterization of the logic and circuits.
The pervasive unit provides several features needed to sup-
port basic functional operation of the chip. The unit contains a
Fig. 3. Element interconnect bus (EIB). serial peripheral interface (SPI) that is used by an external con-
troller to communicate with the chip during normal operation,
and to receive status information from the pervasive unit. The
command network manages bus requests and the coherence pro- SPI interface is also used by the external controller to train the
tocol. Each of the 12 on-ramps to the bus and 12 off-ramps are high-speed IO. In addition, the pervasive unit provides the con-
16 bytes wide and operate at half the base clock rate. Fig. 2 trol for the clock generation and distribution. This logic enables
shows the location of the EIB relative to the other units within
the correct phase-locked loop (PLL) functions on the chip, de-
the processor die. Fig. 3 shows the physical interleaving of the
termines which clocks are active on each of the different clock
bus, designed to minimize noise coupling issues.
grids, enabling/disabling asynchronous clocks as required, and
The MIC supports two Rambus XDR memory banks. Each
maintains the phase relationships between full-speed and half-
bank is 36 bits wide with an independent control bus. The two
speed clocks across the chip. Power-on reset (POR) is another
banks are interleaved in the processor’s address space, but for
base function that is provided here. The POR engine for the Cell
increased system flexibility, the MIC can be configured to sup-
processor is a 32-instruction state machine that systematically
port just a single bank of XDR memory. For added reliability, the
initializes all the units of the processor. The POR engine has
MIC supports ECC and a periodic ECC scrub of memory. The
four hard-coded initialization sequences that can be selected by
XDR interface operates asynchronously to the processor and
configuration pins on the chip. The POR engine also has a debug
IO interfaces; hence, the MIC contains speed-matching SRAM
mode that allows its instructions to be single-stepped, skipped,
buffers and logic, and requires two clock domains. One clock
or performed out of order.
domain operates at half the global processor clock rate, while
For lab debug and bring-up, the pervasive unit contains chip
the other domain operates at half the rate of the XDR inter-
level error checking and reporting mechanisms. Global fault iso-
face. When the processor frequency is lower than that of the
XDR, the MIC is configured to pace memory requests in order lation registers are provided to allow the operating system to
to avoid overrunning the SRAM buffers. The asynchronous in- quickly determine which unit generated an error condition. To
terface not only provides for greater flexibility, but the separa- assist with performance analysis, a performance monitor (PFM)
tion of clocks also allows the processor to be stopped and its is provided, which consists of a centralized unit connected to all
system state scanned out without affecting the training of the functional units on the chip via a the trace/debug bus. To assist
transceiver. This training is required as a result of the high speed with debug, an on-board trace logic analyzer (TLA) captures
of the XDR interface, with the training sequence performed by stores internal signals while the chip is running at full speed.
either the PPE or an SPE. The TLA is programmable and allows complex trace/capture
The BIC provides two flexible interfaces with varying pro- sequences to be created. An on-chip control processor, with
tocols and bandwidth capabilities to address differing system an IEEE 1149.1 interface, is also available to help with debug.
requirements. The interfaces can be configured as either two IO The pervasive logic interfaces with this processor to control and
interfaces (IOIF 0/1) or as an IO and a coherent SMP interface monitor the chip internal logic.
(IOIF and BIF). With seven transmit and five receive bytes of For manufacturing test, the pervasive unit supports 11 dif-
Rambus RRAC IOs, this design provides substantial bandwidth ferent test modes, which include debug mechanisms to assist
support. For increased flexibility, the RRAC transmitters and re- with test pattern development and bring-up. Array built-in self-
ceivers operate asynchronously to the processors and memory, test (ABIST) is one such mode controlled by the pervasive unit.
and the available bandwidth is configurable between the two in- The ABIST engines are programmable to provide debug ca-
terfaces. In addition, the RRACs for one of the interfaces can be pability and the ability to add customized test patterns during
run at half rate. The design thus supports multiple configurations manufacturing test. All ABIST engines are operated in parallel
without the need to redesign or repackage the chip. The BIC to reduce test time. Logic BIST (LBIST) is also provided, by
provides an asynchronous interface between the EIB and the means of a centralized controller housed in the pervasive unit
two IO interfaces, and hence contains speed-matching SRAM and 15 LBIST satellites elsewhere in the design. Both ABIST
buffers and logic with three clock domains. The processor side and LBIST are capable of scanning and capturing data at full
operates at half the global rate, while the IO side operates at one functional speed to improve AC test speed, and all internal scan
third the rate of the RRACs, with a small distribution network chains are timed at the full functional speed to support this re-
operating at half the rate of the RRACs. For BIC and RRAC cal- quirement. Finally, the pervasive unit provides the logic used
ibration, required to support the high speed of the transmitters for programming and testing electronic fuses (eFuses), which
and receivers, an elastic buffer is used to eliminate the skew be- are used for array repair and chip customization during manu-
tween the bytes comprising the interface. facturing test.
182 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fig. 4. Digital thermal sensor schematic.

Power dissipation and thermal design are two key is- with an increase in temperature. By comparing Vref_diode to
sues which were considered early in the development of the Vref, the temperature can be determined to be below or above
first-generation Cell processor. For power reduction, a power the temperature associated with the reference voltage. The DTS
management unit (PMU) provides a mechanism to allow soft- provides many reference voltages and an analog multiplexer for
ware controls to reduce the chip power when the full processing the selection of which Vref to compare. The TMU constantly
capabilities are not needed. The PMU allows the operating changes the selection to find the reference voltages where the
system to throttle, pause, or stop single or multiple units, or comparison (Tx Detect) changes. Once determined, the tem-
the entire chip, in order to manage chip power. For thermal perature is within the range associated with reference voltages.
monitoring and control, the processor employs two types of The voltage difference between each reference corresponds to
thermal sensors and a hardware controlled thermal manage- a 2 C temperature difference. While the relative difference
ment unit (TMU). A linear diode connected to two module is constant, the absolute temperature varies with process. To
pins allows an external device to monitor the temperature of eliminate the process variance, each DTS is calibrated during
the processor. This sensor is located in a relatively constant manufacture of the chip.
temperature location, giving a reading of the global temperature
of the processor. This sensor is designed to be used for control- III. CMOS TECHNOLOGY SOI AND PACKAGING
ling external cooling mechanisms. Ten digital thermal sensors TECHNOLOGY OVERVIEW
(DTS) are distributed on the chip to monitor temperatures in The Cell processor was designed in a 90-nm PD-SOI
critical local regions. One DTS is located in each element and process. The choice of the technology characteristics was
one is adjacent to the linear sensor. The TMU continuously crucial in being able to meet the design goals of the chip. Previ-
monitors each digital thermal sensor and can be programmed ously, designers of high-speed systems were mainly concerned
to dynamically control the temperature of each processing with performance and area, but current and future designs must
element and interrupt the PPE when a temperature specified for meet the triple constraints of power, performance, and area. In
each sensor is observed. The temperature of each element is addition to system-level controls such as clock gating, and cir-
controlled by adjust the instruction issue rate (i.e. throttling) a cuit level control such as low-power latch designs, technology
processing element based on the temperature of the associated choices of device thresholds, voltages, oxide thicknesses, and
DTS. Software controls the TMU by setting four temperature channel lengths are keys to controlling the power. Table I shows
values and the amount of throttling for each sensor in the TMU. the technology parameters.
In increasing temperature, the first temperature specifies when The 90-nm device targets included a 10.5-A gate oxide thick-
the throttling of an element stops, the second when throttling ness, 46-nm channel length, and a nominal voltage of 1 V. This
starts, the third when the element is completely stopped, and device had the requisite performance to achieve the frequency
the fourth when the chips clocks are shut down. The first two goal at the 11FO4 design point. The optimum point for the
temperature levels provide some level of hysteresis to avoid device definition was a balance between the various character-
frequent transitions between throttling and normal operation. istics. A thinner gate oxide would result in potentially higher
The third level stops the processing element when throttling performance but with significantly higher gate tunneling cur-
alone is not enough to control the temperature. As long as the rent and reliability concerns especially at the slower end of the
temperature is maintained below the third level, the processing process window. A shorter channel length or lower threshold
element is guaranteed to achieve a certain performance level to voltage would result in potentially improved performance at the
meet the real-time needs of an application. When the temper- expense of significantly increased leakage current especially at
ature exceeds the safe operating range, specified by the fourth the faster end of the process window. A higher voltage would
level, the TMU automatically shuts down the chip clocks to result in more performance but higher AC and DC power across
avoid permanent damage to the chip. the process window. A detailed study of these gradients was
The DTS employed in the first-generation Cell processor is a conducted to determine the optimum design point. Since not all
diode-based design with a programmable temperature set point. devices are used in circuits with critical delay paths, an addi-
Fig. 4 shows a schematic of the DTS. The diode reference tional higher threshold voltage (HVT) was used to help reduce
voltage (Vref_diode) linearly decreases with increasing tem- the power of the chip. The HVT device was used throughout the
perature while the reference voltage (Vref) linearly increases design in logic blocks, array cells, and analog designs.
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 183

TABLE I
TECHNOLOGY PARAMETERS

Analog designs required additional devices with significantly


longer channel lengths and a thicker gate oxide. The thicker
oxide device relied on a 22.5-A gate oxide, and channel lengths
up to 20 times the minimum value were used. In addition, in
PD-SOI designs, body-contacted devices are used for better
control and tracking of threshold voltages. Chip decoupling
capacitors had the added option of using an intermediate oxide
thickness of 15 A.
The metal stack used in wiring and powering the chip was
another major design consideration. It was important to match
the performance of the device with a low- high-performance
metal approach. The optimization of the wiring scheme was
Fig. 5. Cross section of Cell processor package.
carried out in a fashion similar to that of the device definition,
subject to the triple constraints described above. Too few levels
of wiring would have increased chip area, and too many levels IO partitions on the chip are routed in the bottom build-up
would have added cost without reducing the chip size. An eight layers from interior BGA locations in the same region of the
layer of metal stack was used with five layers of 1 wiring, one package. Other sides of the chip and package are principally
layer of 2 wiring, and two layers of 6 wiring, in addition to dedicated to core power distribution as well as to PLL, test,
a local interconnect. and various other miscellaneous signals. The Cell processor is
The electronic package design was driven by the high off-chip designed for power-limited applications, and employs gating
signal bandwidth and low-noise power delivery requirements algorithms and other methods to intelligently reduce overall
of the Cell processor. A systematic and comprehensive study power usage. While these techniques contribute to the overall
of various packaging options resulted in an optimized flip-chip power reduction, they can also spawn unique noise events
plastic ball grid array (FC PBGA) package comprised of a thin which must be taken into account in package design and anal-
core, multi-layer build-up substrate with decoupling capacitors ysis by ensuring a robust mid-frequency AC noise response.
mounted on the opposite side of the package substrate directly The package substrate facilitates fine-pitch mechanical drilling
underneath the chip, and a thick lid in thermal contact with the which minimizes the effective inductance loop between the
backside of the die via a thermal interface material, as illustrated chip and the module decoupling capacitors and enhances the
in Fig. 5. mid-frequency response to on-chip current fluctuations. Con-
Careful planning and coordinated co-design of the chip, tinual advances in fine-pitch substrate drilling have enabled
package, and prospective system afforded optimal performance high-performance laminates at reasonable cost.
at minimum cost. In this design, high-speed differential nets Statistically derived design rules which account for all man-
are routed off two sides of the chip into corresponding sides of ufacturing tolerances were used to route all high-speed differ-
the package. The high-speed nets are routed in the top build-up ential nets. Thermomechanical reliability tradeoffs were also
layers and then transitioned to the BGA balls through optimized considered in the optimization of the via structures used in the
signal via structures. Isolated analog supplies that service the package. Signal and power distribution integrity in the package
184 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fig. 6. PLL block diagram.


Fig. 7. Phase-frequency detector.

was verified by quasi-static 2-D and 3-D and full-wave 3-D elec-
tromagnetic extraction and simulation and validated by passive
characterization of specifically designed test vehicles. Final ver-
ification was done by passive and active characterization of the
initial prototype hardware.

IV. CIRCUIT DESIGN OVERVIEW


A. PLL
Fig. 6 shows a block diagram of the PLL [6] containing a
differential receiver, a phase-frequency detector (PFD), charge-
pump (CP), current reference (IREF), filter capacitor/low-pass
filter (LPF), voltage-controlled oscillator (VCO), and frequency
dividers in the forward (VCODIV) and feedback (FBDIV) paths Fig. 8. Interleaved VCO with five delay elements.
for frequency-scaling capabilities. The PLL power is supplied
by a separate analog supply (AVDD), while the logic circuits To reduce the large transient turn-on currents that can occur
and PLL-to-logic level shifters are powered by the regular chip in the CP, a conventional CP circuit topology has complemen-
supply (VDD). The typical supply voltages for both AVDD and tary current paths added to it which prevents internal switching
VDD range from 0.8 to 1.2 V. nodes from drifting toward the supply rails during the high-
Self-resetting CMOS PFD architectures have previously impedance OFF state [6]. Matched replica devices are also added
been described [7] which effectively eliminate any dead zone to the CP and are connected to opposite polarities of the control
using a combination of simple static and dynamic circuits. signals from the PFD to counteract the error introduced by para-
This PFD design used similar concepts but with enhanced sitic gate-source and gate-drain capacitances during switching.
resetting techniques for extending the performance to higher Direct feed-forwarding of the PFD signals into the VCO was
used to avoid the problems normally associated with integrated
frequencies. Fig. 7 shows the PFD topology consisting of
PLL resistors, such as high capacitance and lack of adjustability.
symmetric paths containing enabling latches, dynamic triggers,
The VCO [8] consists of a ring of five delay elements with in-
and pulse stretchers for the reference clock (REF) and feedback
terleaved control/feed-forward connections as shown in Fig. 8,
clock (FB) paths, and a common reset generator path. When
where each delay element consists of a main ring inverter, a
the REF signal arrives before FB, the dynamic trigger for REF
control section, and a feed-forward section (Fig. 9). The VCO
responds to the rising edge of REF, causing Q to rise and uses mostly two levels of device stacking, contributing to large
issuing a fast localized disable of the trigger. The UP output signal swings, high operating frequency, and low-voltage oper-
is then asserted after a delay created by the two inverters. ation. Simulated VCO gain for the interleaved VCO design is
A subsequent rising transition of FB similarly causes the FB 9.4 GHz/V.
trigger to respond and issue its own local disable. After delay Body-contacted devices were used in areas of the PLL where
of the three inverters, the UP output is stretched by an amount the history effect [9] caused by SOI may be a concern, although
. The reset generator signal R1 now forces a reset to this usually resulted in a larger area and increased parasitic.
both enabling latches in preparation for the reset of the triggers. Body contacts were also used in the analog circuits to improve
After some delay, the reset generator signal R pre-charges threshold voltage ( ) matching, and to reduce current variation
both dynamic triggers, which in turn force R and R high, with drain-source voltage .
completing the reset process. Finally, when REF and FB falling Fig. 10 shows the lock range as a function of AVDD for
transitions occur, the triggers are enabled in preparation for the the PLL, where and represent the maximum and
next cycle. minimum lock frequencies, respectively. The measured
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 185

Fig. 9. VCO delay element.

Fig. 11. PLL jitter histogram at 16 GHz cycle-to-cycle jitter = 12.7 ps


peak-to-peak, 1.57 ps rms.

The clock distribution is an extension of a previous Power


clock distribution design [11], enhanced for significantly higher
performance over the predecessor designs, and for lower power
dissipation at a given clock frequency. All clock grids were con-
structed on the lowest impedance final two layers of metal, and
supported by a matrix of over 850 individually tuned buffers.
This enabled control of the clock signal arrival times and tran-
sition rates, especially for the main clock grid, which covers
most of the area of the chip and encompasses regions of widely
varying clock load densities. Collectively, the three clock grids
Fig. 10. PLL lock range versus analog supply voltage Avdd. required over 1100 clock buffers and 19.4 m of metal. Higher
clock speeds were attained by reducing wire lengths between the
grid and grid buffer, as well as by reducing distances between
grid buffers.
increases roughly linearly with frequency from 6.6 GHz at
0.804 V to 16.4 GHz at 1.337 V and maintains an average One of the clock distribution design goals was to signifi-
ratio of 2.6 over this range. Above 1.3 V, the slope cantly reduce power dissipation over previous designs [10],
of decreases significantly, and a maximum lock frequency [11]. Early studies showed that much of the clock distribution
of 17.5 GHz was measured at 1.819 V for this part. 10.0 GHz power dissipation was determined by three main categories of
operation was achieved at a supply voltage of less than 0.960 V. capacitances—clock load, lower level twig and mesh wires,
The maximum PLL lock frequency measured was 21.424 GHz. and grid clock buffers (see Fig. 12). Clock load, consisting of
Cycle-to-cycle (C-C) jitter measurements were taken using a mostly macro gate capacitance and minor local clock wires,
proprietary time-interval analysis tool. Fig. 11 shows the C-C was monitored for excessive capacitance. Clock twig wires,
jitter histogram taken at 1.2 V for the PLL output divided by 32 connecting clock loads to the clock grid, had widths tuned
with jitter of 12.7 ps peak-to-peak, 1.57 ps rms. The minimum within reserved physical regions, thereby decreasing both their
C-C PLL jitter measured was 9.48 ps peak-to-peak, 1.09 ps rms. area and perimeter wire capacitance components. For the chip,
wire width tuning decreased clock twig wire capacitance by as
much as 42% without sacrificing local clock skew. Less clock
B. Clock Distribution
load, and smaller clock twig wire capacitance, and lower level
The chip contains three distinct clock distribution systems, grid wire capacitance allowed reduction in grid buffer drive
each sourced by an independent PLL, to support processor, bus strengths. Further reduction in grid buffer drive strength was
interface, and memory interface requirements. A main high- achieved through matching one of seven buffer drive strengths
frequency clock grid [10] covers over 85% of the chip, deliv- to the local grid and clock load capacitances. Together, these
ering the clock signal to processors and miscellaneous circuits. techniques lowered clock distribution power dissipation by
Second and third clock grids, each operating at fractions of the more than 20% over previous designs [10].
main clock signal frequency and covering areas of 8 mm and High-frequency clock signal distribution optimization and
15 mm , respectively, are interleaved with the main clock grid verification required wire simulation models that included
structure, creating multiple clock frequency islands within the frequency sensitive inductance and resistance elements as well
chip. In these interleaved regions, logic circuits are permitted to as extracted capacitances. A rapid simulation engine completed
operate from either the main clock frequency or the auxiliary the chip grid analysis within four hours. Fig. 13 shows the
clock frequency, or from both. simulated chip clock skew to be less than 12 ps.
186 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fig. 12. Total effective clock distribution capacitance composition.

The key clock components of a local clock splitter are shown


in Fig. 14, where the base block provides a foundation for all
locally generated clocks on the chip. To minimize the hold time
of the local clock control signals, including test hold, functional
activate and force scan, the base block behaves functionally as
if it included a latch. Starting from the cycle boundary (falling
global clock), if the “gate” signal is initially high, the “clk” node
will not be pulled high, and the clock signal will not propagate.
The delayed clock path into the OAI gate will hold “gate” high
after the initial sampling period, thereby allowing the various
control inputs to change without affecting the local clock. On
the other hand, if “gate” is initially low, “clk” is pulled high,
propagating a clock edge, and also causing “fb” to pull low.
Once “fb” is low, “gate” can then be reset back high again by the
Fig. 13. Main clock distribution normalized arrival time.
clock delay path, without any impact on the local clock signal.
The base block accepts a local clock activate signal, which
can be overridden by the scan control signal for scan testing, and
C. Local Clock Buffers, Latches, and Flip-Flops also a global priority input “testhold_b” for controlling the clock
In any microprocessor, as the frequency is increased by during test operations. Since there is only a small window near
decreasing the depth of logic in each pipeline stage, the number the cycle boundary where the control inputs are sampled, all
and density of latches and flip-flops will increase correspond- input setup and hold times are specified against only the falling
ingly. Therefore, these clocked elements will account for a global clock edge.
larger and larger fraction of the overall chip power as the cycle For a typical master–slave flip-flop (Fig. 15), clocks are de-
time is pushed downwards. Also, as logic depth is reduced, the rived off the base blocks as shown in Fig. 14. Overall clock
fraction of the overall cycle time devoted to the latch or flip-flop latency and absolute clock uncertainty are minimized by this
will increase, given a fixed overhead for these elements. In scheme since there are only three gate delays between the global
light of the conflicting requirements driven by these trends clock input and the data launch clock (lclk). Also, the common
(lower power, but smaller latency), it was essential to develop a point for both launch clock and capture clock is at the output
strategy aimed at minimizing the power consumption of these of the base block, minimizing the relative uncertainty between
clocked elements for the design as a whole, while still allowing launch and capture clocks. As sketched out above, when clocks
special high-speed constructs for speed-critical paths in order are in the gated state, launch clock (lclk) is held inactive, and the
to meet the aggressive cycle time targets. A rich set of latches capture clock (dclk) is held high. The system state is therefore
and flip-flops were developed to span this range of application stored in the slave latch.
requirements. A general scan design (GSD) approach was For timing critical paths, a high-performance latch (HPL) was
chosen, with a single global clock and local clock splitters at designed which combined a wide multiplexer (up to 10-way),
the register level. Fine grained clock gating was practiced to relying on a dynamic NOR gate, with a set-reset latch (Fig. 16)
minimize AC power. The most effective clock gating strategy [12], [13]. This circuit accepts static inputs and provides static
was to keep the data in the slave latch, gating the slave clock outputs, but leverages the performance advantage of an em-
off, with master clock fixed high. bedded dynamic NOR. The dynamic NOR starts evaluating with
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 187

Fig. 16. High-performance latch and clock details.


Fig. 14. Basic local clock components.

Fig. 17. Clocking for pulsed-mode operation of the flip-flop.


Fig. 15. Standard master–slave flip-flop.
The 32-kB L1 macro was organized as a two-way associative
memory, with interleaving of the two ways to improve parity
rising lclk, sampling the data inputs at the start of the cycle. check performance. As shown in Fig. 18, the overall latency
After a fixed amount of delay (giving the dynamic NOR enough is three cycles, with the first cycle being occupied by the sum
time to discharge), all sel_b inputs are forced high, after which address decoder (SAD). The array access is performed in the
time d_b and sel inputs may change without affecting the value second cycle, and finally parity checking, data formatting, and
stored in the set-reset latch. For scan testing, one multiplexer way select in the third cycle concludes the L1 access. The macro
input (and one select line) was devoted to scan data. The scan is organized internally with 512 word-lines, each accessing 32
select signal is given priority over the functional selects. bytes of 9-bit data for each way. Writing occurs one way at a
For power reduction, the standard flip-flop (Fig. 15) can be time and can be 64 bytes (two consecutive word-lines) or 32
run in pulsed mode, with a clock configuration shown in Fig. 17. bytes with byte write capability. Up to 16 bytes of data can be
In this case the slave clock is pulsed in normal operation, and read, with parity checking on each byte. In formatted mode,
master clock is held high. There is also a “chicken switch” which eight consecutive bytes are provided, starting with any of the
allows running in normal master–slave clocked mode if race 32 bytes in the word. The SRAM core is built up from local
problems are seen in the hardware. A nonscannable pulsed latch bit-lines with 16 cells per bit-line and a dot of 8 onto a global
(consisting of only the slave latch portion of the flip-flop in bit-line in a ripple domino read approach.
Fig. 15) was also supported, minimizing area, power, and la- The L2 Cache was constructed as an eight-way associative,
tency in situations where a longer hold time could be tolerated. 512-kB cache, composed of four individual 1024 8 140
macros. Shown in Fig. 18, the L2 cache resides in a timing
D. SRAM Arrays domain which is clocked at half the core rate. To conserve
The two different processing elements, as described in the power, the way select is decoded prior to the array access,
chip overview, drive different memory requirements. Starting activating only 1/8 of the macro. Single-word 140-bit writes or
with the PPE, the cache system includes separated L1 caches double-word 280-bit reads are supported. The array is pipelined
for instruction and data, and a 512-kB unified L2. such that a write takes two cycles to complete. For reads, it
188 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fig. 18. PPE L1/L2 complex.

takes three cycles for the first 140-bit word to be read, with output at the end of the first cycle. In the second cycle, the
the second word in the fourth cycle. Thus, write/reads can be pre-decoded indices are distributed to the four copies of the
interleaved with data continuously being fetched. Bit-line re- 64-kB mem64k memory block. The decoding of the address is
dundancy is implemented using a bit-shifting approach. There completed within mem64k in the third cycle, which ends with
is one repair action possible for each sub-array leading to a the word-line selection latched at the word-line (WL) driver.
total of 16 repairs per macro. The sub-array access (fourth cycle), starts with WL activation,
The SPE main memory is the LS. The LS unit in the SPE and a write operation completes during this cycle. For a read,
is a local memory comprised of several macros performing the sense amplifier (SA) senses the bit-line differential signal
load/stores, transactions for DMA, and instruction fetches and holds the value until it is captured by the read latch (RL)
into an instruction line buffer (ILB). Because the LS occupies at the beginning of the fifth cycle. The remainder of the fifth
one-third of the SPE floorplan, area, power, and yield are as cycle transfers the data read from the arrays either to the read
important as performance. buffers or to a four-ported latch (rdb41). The fetched data are
As shown in Figs. 19 and 20, the LS consists of a sum forwarded to the execution units in the sixth cycle. Macro
addressed decoder (memdec), four 64-kB memory arrays placement, number of inversions, pin locations, wire length bal-
(mem64k), write accumulation buffers (wacc and wtb), and ance, bus interleaving, hostile/quiet neighbor wire assignment,
read accumulation buffers (rdb1, rdb2, and rdb41), distributed choice of metal layer, metal width and spacing were engineered
throughout the SPE. It takes four cycles to complete a write to achieve the six-cycle 11FO4 path.
and six cycles to perform a read. The numbered latch points Two basic read schemes were used for the arrays on this chip.
in the pipeline diagram also appear in the physical image to The base domino read implementation used for both the L1 and
indicate the location of the latches. A memory access starts L2 caches is shown in Fig. 21. The local bit-line has 16 cells
by setting address operand registers in the memdec macro. connected. The global bit-line can have 4, 8, or 16 local evalua-
The memdec effectively adds the operands while pre-decoding tion circuits connected. The continuous bit-line has 64, 128, or
bit groups of the sum, with the results latched at the macro 256 cells connected, respectively. The local/global bit-lines are
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 189

Fig. 19. SPE local store.

Fig. 20. Die photo of the PPE caches and SPE local store.

used to read the cell, the continuous bit-line and write data_b are The sense amplifier used in the LS design is also shown.
used to write. The local evaluation shown is a minimum device The sense amplifier is a traditional gate only input design with
solution to reduce the local overhead. The local/global bit-lines added cross coupled PFETs to enhance its signal sensing ability,
are used to read the cell in dynamic fashion when the cell is while retaining the ability to pre-charge at a different time than
activated. The continuous bit-line and write data_b are used to the storage cells. One sense amplifier can service four sets of
write the cell. The continuous bit-line provides compliment data 64-cell/bit-line groups.
directly to the cell while data_b provides true data to the local Special attention was directed to cell stability. mis-
eval to drive onto the local bit-lines to the cell. When the cell is matches, due to random dopant fluctuations, increase as
activated the true and compliment data update the cell contents. transistors get smaller. In addition factors like ACLV, corner
The local evaluation shown is a minimum device solution to re- rounding, printability issues, poly roughness and SOI history
duce the local overhead. effects add variability to the devices, representing a tremen-
190 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fig. 21. SRAM arrays read schemes.

dous challenge for the cell design in 90-nm technology and


beyond. Cell stability depends on the ratio of the pull-down
device to the access devices and on the switching point of the
pull-up/pull-down device pair. During the read operation the
down level at the internal cell node (voltage divider between
pull-down and access device) must be lower than the switch
point of the cell’s inverter to allow a stable read operation
(Fig. 22). The most important parameter for writability of the
cell is the strength ratio of the access device to the pull-up
device. A dynamic statistical analysis (DSA) approach is used
to determine the probability of a cell fail due to stability or
writability.
Fig. 22. Bit cell stability.

V. PHYSICAL IMPLEMENTATION OVERVIEW


The wires were arranged in groups of four, interleaved with
A. Chip Integration GND and VDD shields twisted at the center to reduce coupling
Fig. 2 shows the die photograph with approximately 241 M noise on the two unshielded wires. To ensure signal integrity,
transistors from 17 major physical partitions and 8912 discrete many other nets were custom-tailored, and over 50% of the
chip-level floorplan floorplanned blocks (e.g., PLL, IO, thermal 57 K chip-level nets were engineered with 32 K repeaters.
sensors, engineered bus transport). The chip total across all The Cell SoC uses 3349 C4s in a 90-column by 61-row varied
levels of hierarchy included 177 K floor-planned blocks, 580 K pitch matrix with five regions of different row–column pitches
repeaters and 1.55 M unit, partition, and global signal nets. At attached to the low-cost organic package described earlier.
the center of the chip is the EIB composed of four 128-bit data This structure supports 20 separate power domains on-chip,
rings plus a 64-bit tag operated at half the processor clock rate. many of which overlap physically on the die. The entire chip
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 191

Fig. 23. Hierarchical noise analysis.

hierarchy, from processor elements to power and clock distri- C. Power Analysis
butions, global routing, and the chip assembly techniques, are
engineered to support a modular design in a building-block-like To calculate the total power, input switching factors (SF) and
construction. clock activity (CA) for each macro are monitored in an RTL-
level simulation of a given workload. For every macro instance,
B. Noise Analysis the input SF is calculated by observing the percent of inputs that
Given the aggressive frequency targets and the circuit design have changed state from the previous cycle. The CA is measured
techniques described in the previous section, it was apparent that by observing the number of local clock buffers that are turned
a detailed chip noise analysis strategy would be required to en- on in a given cycle. The ability to have a cycle-by-cycle power
sure stable operation of the chip. With the chip device counts de- for each macro for a realistic workload made this methodology
scribed above, it is impractical to analyze the whole chip using very useful for package analysis. Since power is dependent on
a flat transistor-level simulator. Thus, in keeping with the hier- both SF and CA, these factors need to be calculated for every
archical/modular design philosophy, a hierarchical noise anal- cycle of the simulation and cannot be simply averaged over the
ysis approach was applied, which relied on a macro-level static entire simulation. In a given cycle the total power is calculated
noise analysis [14], followed by a unit/chip-level noise analysis, as
as shown in Fig. 23.
Macro-level static noise analysis was carried out on all de- 1
signs, including static/dynamic circuits, arrays, and synthesized 2
random logic macros (RLMs), creating a noise abstract at the
same time. The analysis was performed by a transistor-level where is the amount of global net capacitance switched
simulator, running on a net list with parasitic extracted from in the cycle.
the layout. This noise tool carried out a static noise analysis on To estimate the power consumption of the first-generation
each channel connected component, looking for noise failures Cell processor and for refining and verifying the power manage-
throughout the design. A macro-level noise abstract was gener- ment logic of the chip, each core or functional unit on the chip
ated during the analysis, which included input noise tolerance was required to run at least three different types of workloads:
level, input capacitance, and output resistance. idle, typical, and high power. The idle workload was expected to
Unit/chip-level noise analysis was then performed using the
be the lowest power state, since it shuts off as many local clock
macro noise abstracts and information from timing analysis.
buffers as possible. The high power test case was used as a stim-
Macros were represented by their abstracts, which specified the
input capacitance and output resistance. Extracted parasitic pa- ulus for power grid integrity analysis, thermal analysis, and as
rameters are obtained from the global layout. Unit or/chip-level an input for the IR drop analysis.
noise was then analyzed with an equivalent circuit composed of Fig. 24 shows the power analysis flow. Fig. 25 shows the
resistors, capacitors, and voltage sources which were driven ac- macro and net powers of the typical power workload, and also
cording to the timing information. A noise failure is reported if the power traces of the three workloads. Fig. 26 shows the
the noise level on a net exceeds the receiver pin’s input toler- number of local clock buffers active for the three different
ance level, which was determined during the macro-level noise workloads. The runtime for 4000 clock cycles on a core with
analysis. 20.9 M transistors was approximately 30 minutes.
192 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fig. 24. Power analysis flow.

Fig. 26. Percent of local clock buffers active for idle, typical, and high-power
workload.

create local heat spikes on top of the die global temperature


distribution.

(1)

where is substrate area, is the source area, is the ef-


Fig. 25. (a) Macro and net power for a typical work load. (b) Total power. fective heat transfer coefficient at the opposite side of silicon
substrate, is the substrate thickness, and is the thermal con-
ductivity of silicon.
Then, peak temperature excesses can be estimated by analytic
D. Thermal Analysis modeling techniques based on classical steady-state thermal dif-
fusion theory [15], as well as on transient analysis [16]. These
Due to local heating caused by independent operation of
models approximate a rectangular source, with the general for-
individual processing units, sophisticated local thermal sensing
mula described as
strategies and thermal control mechanisms were required to
allow aggressively low-cost and quiet thermal solutions. This 1 1
SoC presented new challenges in chip thermal design. Under (2)
high heat flux conditions, the silicon substrate behaves like a
relatively low thermal conductance body. Given the definition where is the energy generation rate, is the radius, is the
of (effective Biot number) as defined by (1) it can be time, is the height of the cylinder, is the thermal diffusion
determined under which conditions small hot spots would of silicon, and is the temperature.
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 193

ACKNOWLEDGMENT
The authors gratefully acknowledge the collaborations
and contributions from the entire Sony-Toshiba-IBM team
who worked tirelessly side-by-side on the design of this pro-
cessor. They also gratefully acknowledge the executive and
the management teams of the three companies who provided
management oversight and created the right business conditions
for this project. Special thanks go to M. Beattie, E. Behnen,
C. Carter, S. Dhong, T. Himeno, B. Krauter, S. Lee, J. Petrovick,
G. Plumb, P. Restle, D. Shippy, J. Silberman, K. Miki, J. Qi,
E. Hailu, M. Yoshida, and D. Widiger for their contributions to
this project.

Fig. 27. Die thermal map. REFERENCES


[1] D. Pham et al., “The design and implementation of a first-generation Cell
processor,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2005, pp. 184–185.
[2] B. Flachs et al., “The microarchitecture of the streaming processor for
a Cell processor,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2005, pp.
134–135.
[3] T. Asano et al., “A 4.8 GHz fully pipelined embedded SRAM in the
streaming processor of a Cell processor,” in IEEE ISSCC Dig. Tech. Pa-
pers, Feb. 2005, pp. 486–487.
[4] C. J. Anderson et al., “Physical design of a fourth-generation Power
GHz microprocessor,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2001, pp.
232–233.
[5] N. Rohrer et al., “PowerPC 970 in 130 nm and 90 nm technologies,” in
IEEE ISSCC Dig. Tech. Papers, Feb. 2004, pp. 68–69.
[6] D. Boerstler, K. Miki, E. Hailu, H. Kihara, E. Lukes, J. Peter, S. Pet-
tengill, J. Qi, J. Strom, and M. Yoshida, “A 10+ GHz low jitter wide
band PLL in 90 nm PD SOI CMOS technology,” in Symp. VLSI Circuits
Dig. Tech. Papers, Jun. 2004, pp. 228–231.
[7] D. W. Boerstler and K. A. Jenkins, “A phase-locked loop clock generator
Fig. 28. First pass hardware results in the laboratory. for a 1 GHz microprocessor,” in Symp. VLSI Circuits Dig. Tech. Papers,
Jun. 1998, pp. 212–213.
[8] D. W. Boerstler, “Interleaved VCO with balanced feedforward,” U.S.
A 100- m range length scale was found to be a high enough Patent 6,744,326, Jun. 1, 2004.
resolution for the hot spot analysis. Thus, it was not necessary [9] S. K. H. Fung, N. Zamdmer, P. J. Oldiges, J. Sleight, A. Mocuta, M.
to consider any quantum effects [17] in this analysis. The hot Sherony, S. H. Lo, R. Joshi, C. T. Chuang, I. Yang, S. Crowder, T. C.
spot information was generated by the power maps constructed Chen, F. Assaderaghi, and G. Shahidi, “Controlling floating-body effects
for 0.13 m and 0.10 m SOI CMOS,” in IEDM Tech. Dig., Dec. 2000,
from by the power analysis based on the various workloads de- pp. 231–234.
scribed earlier. In addition, the external thermal spreading in the [10] P. J. Restle et al., “A clock distribution method for microprocessors,”
lateral direction, through the package, affects the die tempera- IEEE J. Solid-State Circuits, vol. 36, no. 5, pp. 792–799, May 2001.
ture global profile, causing a steeper slope from the center out [11] P. J. Restle et al., “The clock distribution of the Power4 microprocessor,”
in IEEE ISSCC Dig. Tech. Papers, vol. 45, 2002, pp. 144–145.
toward the edges/corners. [12] L. Sigal, J. D. Warnock, B. W. Curran, Y. H. Chan, P. J. Camporese, M.
Because of the high degree of complexity, extensive numer- D. Mayo, W. V. Huott, D. R. Knebel, C. T. Chuang, J. P. Eckhardt, and
ical thermal analysis was required. This analysis was carried P. T. Wu, “Circuit design techniques for the high-performance CMOS
IBM S/390 parallel enterprise server G4 microprocessor,” IBM J. Res.
out early in the design cycle to ensure that the maximum junc- Dev., vol. 41, pp. 489–503, 1997.
tion temperature, as well as the mean die temperature, including [13] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta,
transient behavior, would end up within design specifications. R. Heald, and G. Yee, “A new family of semidynamic and dynamic flip-
These data were analyzed to improve the design and floorplan flops with embedded logic for high-performance processors,” IEEE J.
Solid-State Circuits, vol. 34, no. 5, pp. 712–716, May 1999.
of the chip, and also to provide feedback for improved thermal [14] K. L. Shepard, V. Narayannan, and R. Rose, “Harmony: static noise
sensor design and placement, following the thermal map shown analysis of deep submicron digital integrated circuits,” IEEE Trans.
in Fig. 27. Comput.-Aided Des. Integr. Circuits Syst., vol. 18, no. 8, pp. 1132–1150,
Aug. 1999.
[15] S. Lee, S. S. Van Au, and K. P. Moran, “Construction/spreading resis-
VI. CONCLUSION tance model for electronics packaging,” in Proc. ASME/JSME Thermal
Engineering Conf., vol. 4, 1995, pp. 199–206.
Special circuit techniques, rules for modularity and reuse, [16] K. Yazawa and M. Ishizuka, “Thermal modeling with transfer function
customized clocking structures, and unique power and thermal for the transient chip-on-substrate problem,” Thermal Sci. Eng., Heat
management concepts were applied to optimize the design of Transfer Soc. Jpn., vol. 13, no. 1, pp. 37–40, 2005.
[17] K. E. Goodson, Y. S. Ju, and M. Asheghi, “Thermal phenomenon in
the Cell processor. Correct operation has been observed in the
semiconductor devices and interconnects,” in Microscale Energy Trans-
laboratory on first pass silicon at frequencies well over 4 GHz, port, C. L. Tien, A. Majumdar, and F. M. Gerner, Eds. New York:
as shown in Fig. 28. Taylor & Francis, 1998, ch. 7, pp. 229–294.
194 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Dac C. Pham received the B.S.E.E. degree from Rajat Chaudry received the B.S. and M.S. degrees
University of Rhode Island, Kingston, in 1986, in electrical engineering from the University of Texas
the M.S.E.E. degree from Rensselaer Polytechnic at Austin in 1996 and 1998, respectively.
Institute, Troy, NY, in 1987, and the Ph.D. degree He currently works for the Tools and Method-
in materials science from the University of Vermont, ology group at the STI Design Center in Austin,
Burlington. TX. He works on power estimation and power grid
He joined IBM in 1987, where he worked on verification tools.
IBM’s 386 SX, 386SLP, and Blue Lightning family
of processors that were used in IBM PS/2 Systems.
In 1992, he joined the Apple-IBM-Motorola Som-
erset Design Center, where he worked on IBM’s
PowerPC 603 and follow-on processors which lead to PowerPC G3. He also
worked on IBM’s PowerPC 602 and a Graphics Co-Processor that were used in
Gaming Systems, Consumer Electronics, and PDA devices. In 1997, he joined
the IBM Giga-Processor team where he worked on IBM’s Power 4 processor
which was used in IBM servers and Apple’s G5. Between 1999 and 2003, he
was with Intel’s Texas Design Center and managed the Chip Implementation
team of a next-generation Pentium processor. He re-joined IBM in 2003, where
he worked on the high-speed, multi-core, first-generation Cell processor. Dennis Cox received the B.S. degree in electrical
Dr. Pham is currently the Sony-Toshiba-IBM Design Center Chief Engineer engineering from the University of Wisconsin at
and Global Convergence Functional Manager. Madison in 1970 and the M.S. degree in electrical
engineering from Syracuse University, Syracuse,
NY, in 1975.
He is a Distinguished Engineer in Engineering and
Tony Aipperspach received the B.S. degree in elec- Technology Services (E&TS) and a member of the
trical engineering from Montana State University, IBM Academy of Technology working in the area
Bozeman, in 1979. of technology delivery and circuit design. He joined
He joined IBM’s Rochester Development Labora- IBM, Kingston, NY, in 1970 as a Circuit Design
tory, Rochester, MN, in July 1979. He was involved Engineer where he designed SRAMs, dynamic logic
in the design of an on-chip DRAM in 3 m NMOS. processors, and PLA. He transferred to Circuit Technology in Rochester, MN,
From 1981 to 1993, he was involved with various in 1976 and worked in the development of ASIC design systems. Subsequently,
ASIC library designs, as well as integrated SRAM as a part of Server Group, he worked on a number of different Power PC
design. From 1994 to 1996, he was involved in microprocessors. With the advent of E&TS, he has been working with a
BICMOS circuit design and a variety of memory number of companies in different design and consulting contracts as well as
configurations including SRAM, CAM, and register leading various IBM Academy studies.
arrays. From 1996 to 2000, he was the team lead for the array design group and
supported all memory elements of the Star series of processors for the AS400.
Between 2000 and 2005, he was Array Lead for various processor designs
in support of the gaming industry, including the Cell processor. His current
position is Array and Circuit Lead/Consultant on various projects.

Paul Harvey received the B.S.E.E. degree from


David Boerstler received the B.S. degree in elec- Texas A&M University, College Station, in 1981.
trical engineering from the University of Cincinnati, He is currently the Sony-Toshiba-IBM Design
Cincinnati, OH, in 1978, the M.S. degree in electrical Center Timing Technical Lead. He joined Motorola’s
engineering from Syracuse University, Syracuse, MC68000 microprocessor design team in 1981 as a
NY, in 1981, and the M.S. degree in computer logic designer in the floating point group. He joined
engineering from Syracuse University in 1985. IBM in 1986 and micro-coded the transcendental
Since joining IBM Corporation in 1978, he has functions for the FPA floating point accelerator card.
worked in various research and product development In 1987, he worked on the Power1, Power2, and
organizations with an emphasis on high-frequency Power3 microprocessor designs as the FPU and FXU
and mixed-signal circuit design. In the last ten years Technical Team Lead, and on Power4 and Power5 as
at IBM, he has been involved primarily with the the Timing Technical Lead. In 2002, he joined the Cell processor design team
design of multi-GHz PLLs, clock distribution, and serial IO in advanced bulk in his current role.
and SOI CMOS technologies. He is currently an IBM Senior Technical Staff
Member in the IBM Systems and Technology Group in Austin, TX, and leads
the analog and mixed-signal design team for the STI Design Center.

Mark Bolliger received the B.S. degree in computer


science from the University of Utah, Salt Lake City, Paul M. Harvey (M’88) received the B.S. degree in
in 1988 and the M.S. degree in computer science from chemical engineering from Brigham Young Univer-
Rensselaer Polytechnic Institute, Troy, NY, in 1994. sity, Provo, UT, and the M.S.E.E. degree from Syra-
He joined IBM in 1988 where he worked on IBM’s cuse University, Syracuse, NY.
386SX, 386SLP and Blue Lightning processor He is a Senior Technical Staff Member at IBM,
family. In 1992, he joined the Apple-IBM-Motorola Austin, TX, and is the Technical Team Leader of
Somerset Design Center, where he worked on the the STI Packaging Design Team, responsible for
IBM’s PowerPC processor family. From 1996 to the package design and analysis for the Cell mi-
2001, he was Director of Information Systems for croprocessor and derivative products. He has more
a group of privately held companies. He re-joined than 20 years experience in all aspects of electronic
IBM in 2001, where he worked on Apple’s G5 processor. In 2002, he joined packaging design, analysis and characterization. He
the Sony/Toshiba/IBM Design Center where he worked on the first-generation has six issued patents with nine pending and has authored over 40 publications
Cell processor in Chip Integration on the Global Convergence Team. on various aspects of electronic packaging.
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 195

H. Peter Hofstee (M’96) received the “Doctorandus” John Keaty (M’86) received the B.S. Degree in
degree in theoretical physics from Groningen Univer- mathematics from the State University of New
sity, The Netherlands, in 1989, and the Ph.D. degree York, Plattsburgh, in 1974, and the M.A. degree
in computer science from the California Institute of in mathematics in 1977 and the M.S. degree in
Technology (Caltech), Pasadena, in 1995. computer science, both from the University of
He is a Senior Technical Staff Member in the Sony- Wisconsin-Madison, in 1980.
Toshiba-IBM (STI) Design Center, IBM, Austin, TX. He is a Senior Technical Staff Member and Global
He is the chief scientist for Cell and the chief architect Integration Leader in the STI Design Center, Austin,
of the Cell Synergistic Processor Element. He joined TX. He joined the IBM Microelectronics Division in
the Caltech faculty in 1995 and 1996 to teach com- Burlington, VT, in 1981 to work on automated di-
puter science and VLSI. In 1996, he joined the IBM agnostic systems for semiconductor logic products.
Austin Research Laboratory where he helped to create the first GHz CMOS Since then, he has worked in ASIC product development on several CMOS
processor. Between 1997 and 2000, he worked on a number of other high-fre- logic families, and also on industry standard microprocessor development in the
quency server processor designs. In 2000, he helped create the concept for Cell IBM Microelectronics Division before transferring to the IBM Server Group in
and became one of the founding members of the STI Design Center in 2001. His Austin, TX, in 1996. He was responsible for timing of the Power4 processor in
current interest focuses on application of the Cell processor beyond the gaming the R/S 6000 Regatta system. He is currently responsible for the global integra-
space and on future Cell processor designs. He is a member of the ACM and a tion and physical design of the Cell processor family.
member of the IBM Academy of Technology.

Charles Johns received the B.S. degree in electrical engineering from the Uni-
versity of Texas at Austin in 1984. Yoshio Masubuchi (M’89) received B.S. and M.S.
He is a Senior Technical Staff Member in the Sony-Toshiba-IBM Design in Electronics Engineering from University of Tokyo
Center, Austin, TX. After joining IBM Austin in 1984, he worked on various in 1981 and 1983 respectively.
disk, memory, voice communication, and graphics adapters for the IBM Per- He is an Assistant to the General Manager, Broad-
sonal Computer. From 1988 until he transferred to the STI project in 2000, he band System LSI Development Center, Toshiba
was part of the Graphics Organization and was responsible for the architecture Corporation Semiconductor Company, Austin, TX.
and development of entry and midrange 3-D graphics adapters and rater engines. Since joining Toshiba in 1983, he has been engaged
He is now responsible for the Cell Broadband Engine Architecture (BPA) and in research and development of microprocessors
participates in the development of the first-generation Cell processor. and computer systems. From 1988 to 1990, he was
a visiting Fellow at the University of California,
Berkeley. He served as Toshiba Project Leader for
the Cell project at the STI Design Center, Austin, TX, from 2002 to 2005. He
is a member of the ACM, the Information Processing Society of Japan, and the
Jim Kahle received the B.S. degree from Rice Uni- Institute of Electronics, Information and Communication Engineers.
versity, Houston, TX, in 1983.
He is an IBM Fellow with over 20 years of
research and development experience in chip design.
He is a renowned expert in the microprocessor
industry for delivering break through designs. His
current role is Chief Architect and Director of
Technology for the Austin’s Design Center for the
Cell Technology which is a partnership between Mydung Pham received the B.S. degree in electrical
IBM, Sony, and Toshiba. He has been working for and computer engineering from the University of Ver-
IBM since the early 1980s on RISC-based micropro- mont, Burlington, in 1992.
cessors. His work started in Physical Design tools and now has concentrated Since joining Apple-IBM-Motorola’s Somerset
on RISC Architecture. He was a key designer for the RIOS I processor that Design Center in 1992, she has worked on various
launched IBM into the RS/6000 line of workstations and servers. He was also microprocessor designs of the PowerPC 603 and
one of the founding members of the Somerset Design center were he was the follow-on processors which lead to PowerPC G3.
project manager for the PowerPC 603 and follow-on processors which lead to In 2000, she worked on the PowerPC G5 as Chip
PowerPC G3. These chips and cores have been used in many applications from Timing Lead. In 2001, she joined the Cell processor
embedded cores to Apple systems and Nintendo’s new Game Cube. He was key design team as Memory Management Technical
to the definition of the Power Architecture and of the superscalar techniques Lead. She is currently a Manager of the circuit
used in IBM. He has been on the forefront of Dual Core designs, Asymmetric design, timing, and physical integration team for the PowerPC Processor
Multiprocessors, Superscalar design, and SMT micro-architectures. He was the Element (PPE).
Chief Architect for the Power4 core and continually assists in Power processor
roadmap planning.

Atsushi Kameyama (M’91) received the B.S., M.S., Jürgen Pille received the M.S. degree in microelec-
and Ph.D. degrees in physical electronics from the tronics from the University of Hanover, Germany, in
Tokyo Institute of Technology, Tokyo, Japan, in 1990.
1982, 1984, and 1999, respectively. He joined IBM at the IBM Development Labora-
In 1984, he joined Toshiba VLSI Research Center, tory, Boeblingen, Germany, in 1990 to work on Ad-
where he was engaged in the circuit design and de- vanced VLSI Designs. He was involved in several
velopment of GaAs digital ICs. From 1992 to 1994, areas of the S/390 and PowerPC microprocessor de-
he was a visiting scholar at Stanford University, Stan- velopment including assignments at the IBM Devel-
ford, CA, doing research in the field of pnp-type Al- opment Laboratories, Poughkeepsie, NY, and Austin,
GaAs/GaAs HBT. Since 1994, he has been involved TX. He has worked in several areas including circuit
in the development of GaAs MMIC and low-power design, array design, and custom logic design. As a
circuit design using SOI. He is currently a Design Engineering Manager with Senior Technical Staff Member, he is currently responsible for array designs in
Toshiba America Electronic Components, Inc., Austin, TX the STI Design Center, Austin, TX.
196 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Stephen Posluszny received the B.S. and M.S. de- James Warnock (M’89) received the Ph.D. degree
grees in computer engineering from Rensselaer Poly- in physics from the Massachusetts Institute of Tech-
technic Institute, Troy, NY, in 1982. nology, Cambridge, in 1985.
He joined IBM in 1982, working in Fishkill, NY, Since then, he has been at IBM, Yorktown Heights,
with IBM’s corporate design automation organiza- NY, working on advanced bipolar and CMOS silicon
tion in the area of silicon compaction. In 1988, he technologies, and more recently in the area of circuit
moved to Austin, TX, to work on logic synthesis design for high-performance digital microprocessors.
development for the RS/6000 line of processors. This included work on the S/390 G4 processor and
Between 1995 and 2000, he worked in IBM’s also on the Power4 chip, where he was the circuit de-
Austin Research Laboratory on the first 1-Gigahertz sign team leader. Currently, he is the circuit lead for
Processor. In 2001, he joined the Sony-Toshiba-IBM the Entertainment and Embedded Processor Devel-
Design Center as IBM’s Senior Technical Staff Member and Tools and Method- opment Organization in IBM’s Systems and Technology group. He is involved
ology Technical Lead. He has contributed to the Center’s high-frequency design with work on the Cell processor and also with several other microprocessor de-
methodology definition and leads the tools team supporting the design effort. velopment programs. He has experience in process technology and device de-
sign as well as in SOI digital circuit design, and has authored or co-authored
over 160 conference or journal papers.

Mack Riley received the B.S.E.E. degree from the


Tuskegee Institute, Tuskegee, AL, in 1979, and the
M.S.E.E. degree in electrical engineering from Stan-
ford University, Stanford, CA, in 1981. Stephen Weitzel (M’75) received the the B.S.
He joined IBM in 1981 at the Austin, TX, site. Ini- in electrical engineering from Pennsylvania State
tially, he worked on design and field support for the University, University Park.
IBM 5520 Administrative system and several 55xx He joined IBM as a Test Equipment Engineer in
peripheral products. He was a member of the original 1974 and held various positions in test engineering
design teams for the RT PC and the RS6000 system and circuit design in the East Fishkill development
design teams. In 1989, he joined the Graphics Devel- center, Noyce design center, and Somerset design
opment team for the RS/6000 systems. In this role, center. He is currently a Senior Technical Staff
he was manager of entry and mid-range 2-D and 3-D graphics adapters and 3-D Member working on high-frequency clock distribu-
graphics controller design. In 2001, he joined the STI project as Manager and tions for IBM in the STI Design Center, Austin, TX.
Technical Lead for the Pervasive and Design for Test teams. In this role, he is He has 11 patents and has co-authored seven papers.
responsible for all the chip level control, performance monitors, debug, RAS,
test logic functions and delivery of manufacturing test patterns.

Dieter Wendel received the B.S. degree in electrical


Daniel L. Stasiak received the B.S.E.E. degree from University of Missouri- engineering from the University of Wuerzburg, Ger-
Rolla in 1987, and the M.S.E.E. degree from University of Minnesota, Min- many, in 1981.
neapolis, in 1991. He is an IBM Distinguished Engineer. He joined
Since joining IBM in 1987, he has worked on various microprocessor design IBM in 1981 to work on the very large scale integra-
and development: AS400 processor design, IBM’s PowerPC, and most recently, tion fellowship team in Boeblingen, Germany. After
a first-generation Cell processor design. He is currently the Sony-Toshiba-IBM a two-year assignment from 1984 to 1986 at the IBM
Design Center Power Technical Lead. Research Laboratory in Yorktown Heights, NY, he
joined the S/390 microprocessor development organ-
ization at the IBM Boeblingen Laboratory to work in
several areas including test, array design, and custom
Masakazu Suzuoki, photograph and biography not available at the time of logic design. He joined the STI Design Center in Austin, TX, when it was
publication. founded in 2001 as circuit lead. His current interests focus on concepts in high-
frequency circuit design and the exploitation of new technologies.

Osamu Takahashi (M’97) received the B.S. degree


in engineering physics and the M.S. degree in elec-
trical engineering from the University of California, Kazuaki Yazawa (M’99) received the B.S. degree
Berkeley, in 1993 and 1995, respectively, and the in mechanical engineering from Chiba University,
Ph.D. degree in computer and mathematical sciences Japan, in 1980, and is currently pursuing the Ph.D.
from Tohoku University, Sendai, Japan, in 2001. degree in mechanical engineering from Toyama
He joined IBM Austin Research Laboratory, Prefectural University, Japan.
Austin, TX, in 1995 as a member of the high-per- He has 25 years of mechanical and thermal engi-
formance VLSI design research team. He has been neering experience for various consumer products
involved in the development of the first 1-GHz of Sony Corporation after he joined the company.
processor as well as the development of fully syn- He has published four journal papers and 11 con-
chronous and pipelined 1-GHz clock 1-MB embedded DRAM macro. He is ference papers, and holds 34 patents, some pending.
an IBM Senior Technical Staff Member and the Manager of the circuit design, He is currently serving as Thermal Architect for
timing, and physical integration team for the Synergistic Processor Element Sony Computer Entertainment Inc., Tokyo, Japan, as well as Distinguished
(SPE) developed by Sony, Toshiba, and IBM. Engineer for the Micro Systems Network Company of Sony Corporation.

You might also like