Professional Documents
Culture Documents
Overview of The Architecture, Circuit Design, and Physical Implementation of A First-Generation Cell Processor
Overview of The Architecture, Circuit Design, and Physical Implementation of A First-Generation Cell Processor
Abstract—This paper reviews the design challenges that current concurrent activities. Cell supports multiple operating sys-
and future processors must face, with stringent power limits, tems including Linux, and is designed for flexibility with a
high-frequency targets, and the continuing system integration wide variety of application domains. Other attributes include
trends. This paper then describes the architecture, circuit design,
and physical implementation of a first-generation Cell processor hardware content protection and extensive single-precision
and the design techniques used to overcome the above challenges. floating-point capability. By extending the Power Architecture
A Cell processor consists of a 64-bit Power Architecture processor with synergistic processor elements (SPE) having coherent
coupled with multiple synergistic processors, a flexible IO inter- direct memory access (DMA) to system storage and with
face, and a memory interface controller that supports multiple multi-operating system resource management, Cell supports
operating systems including Linux. This multi-core SoC, imple-
mented in 90-nm SOI technology, achieved a high clock rate by concurrent media-centric and conventional computing. With a
maximizing custom circuit design while maintaining reasonable dual-threaded power processor element (PPE) and eight SPEs
complexity through design modularity and reuse. this implementation is capable of ten simultaneous threads and
Index Terms—Bus interface controller (BIC), cell processor, over 128 outstanding memory requests.
element interconnect bus (EIB), flexible IO, hardware content
protection, local store, media-centric computing, memory inter-
face controller (MIC), modularity, multi-core, natural human II. ARCHITECTURE OVERVIEW
interaction, Power Architecture, power processor element (PPE),
real-time system, simultaneous multi-threading, SoC, synergistic The first-generation Cell processor consists of eight SPEs
processor, virtualization, 90-nm SOI. [2], each with its own local store (LS) [3], the PPE and its
L2 cache, a high bandwidth internal element interconnect bus
(EIB), a memory interface controller (MIC), a bus interface con-
I. INTRODUCTION troller (BIC) providing two configurable noncoherent IO inter-
then asynchronously transfers data between the local storage implementation, built from the ground up with an extended
and main storage. This allows overlapping communication pipeline to achieve a low FO4 cycle time to match the SPE. The
and computation, facilitating real-time operation. SPE access core is an enhanced in-order design with a moderate pipeline
to shared memory via DMA, a large register file, and stan- length to provide state-of-the-art performance capability. The
dard in-order execution semantics provide a general-purpose PPE has been extended with resource management tables for
streaming programming environment. Each SPE can be dynam- the cache and translation tables to support real-time operations.
ically configured to operate in a mode in which its resources can Through memory mapped IO (MMIO) control registers, the
only be accessed by validated programs. The SPE is organized PPE can also initiate DMA requests on behalf of an SPE
in a modular fashion, so that a single integrated SPE physical and can support communication with SPE mailboxes. It also
entity is placed 8 times on the chip. implements the Power Architecture hypervisor extension to
The PPE is a 64-bit processor compliant to the family of allow multiple concurrent operating systems to be run on it at
Power Architecture processors [4], [5]. It is implemented as a the same time through thread management support.
dual-threaded core with the integer, floating point, VMX, and The EIB is central to the Cell processor. This coherent bus
MMU units of the Power Architecture. The processor contains transfers up to 96 bytes per processor cycle. The bus is orga-
32-kB instruction and data caches, a 512-kB L2 cache, and nized as four 16-byte-wide (half-rate) rings that each support
on-chip bus interface logic. The PPE core is a completely new up to three simultaneous data transfers. A separate address and
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 181
The pervasive unit contains all of the global logic needed for
basic functional operation of the chip, lab debug, and manufac-
turing test. For each of these areas, described in detail below,
debug mechanisms have been incorporated into the design to as-
sist with bring-up and characterization of the logic and circuits.
The pervasive unit provides several features needed to sup-
port basic functional operation of the chip. The unit contains a
Fig. 3. Element interconnect bus (EIB). serial peripheral interface (SPI) that is used by an external con-
troller to communicate with the chip during normal operation,
and to receive status information from the pervasive unit. The
command network manages bus requests and the coherence pro- SPI interface is also used by the external controller to train the
tocol. Each of the 12 on-ramps to the bus and 12 off-ramps are high-speed IO. In addition, the pervasive unit provides the con-
16 bytes wide and operate at half the base clock rate. Fig. 2 trol for the clock generation and distribution. This logic enables
shows the location of the EIB relative to the other units within
the correct phase-locked loop (PLL) functions on the chip, de-
the processor die. Fig. 3 shows the physical interleaving of the
termines which clocks are active on each of the different clock
bus, designed to minimize noise coupling issues.
grids, enabling/disabling asynchronous clocks as required, and
The MIC supports two Rambus XDR memory banks. Each
maintains the phase relationships between full-speed and half-
bank is 36 bits wide with an independent control bus. The two
speed clocks across the chip. Power-on reset (POR) is another
banks are interleaved in the processor’s address space, but for
base function that is provided here. The POR engine for the Cell
increased system flexibility, the MIC can be configured to sup-
processor is a 32-instruction state machine that systematically
port just a single bank of XDR memory. For added reliability, the
initializes all the units of the processor. The POR engine has
MIC supports ECC and a periodic ECC scrub of memory. The
four hard-coded initialization sequences that can be selected by
XDR interface operates asynchronously to the processor and
configuration pins on the chip. The POR engine also has a debug
IO interfaces; hence, the MIC contains speed-matching SRAM
mode that allows its instructions to be single-stepped, skipped,
buffers and logic, and requires two clock domains. One clock
or performed out of order.
domain operates at half the global processor clock rate, while
For lab debug and bring-up, the pervasive unit contains chip
the other domain operates at half the rate of the XDR inter-
level error checking and reporting mechanisms. Global fault iso-
face. When the processor frequency is lower than that of the
XDR, the MIC is configured to pace memory requests in order lation registers are provided to allow the operating system to
to avoid overrunning the SRAM buffers. The asynchronous in- quickly determine which unit generated an error condition. To
terface not only provides for greater flexibility, but the separa- assist with performance analysis, a performance monitor (PFM)
tion of clocks also allows the processor to be stopped and its is provided, which consists of a centralized unit connected to all
system state scanned out without affecting the training of the functional units on the chip via a the trace/debug bus. To assist
transceiver. This training is required as a result of the high speed with debug, an on-board trace logic analyzer (TLA) captures
of the XDR interface, with the training sequence performed by stores internal signals while the chip is running at full speed.
either the PPE or an SPE. The TLA is programmable and allows complex trace/capture
The BIC provides two flexible interfaces with varying pro- sequences to be created. An on-chip control processor, with
tocols and bandwidth capabilities to address differing system an IEEE 1149.1 interface, is also available to help with debug.
requirements. The interfaces can be configured as either two IO The pervasive logic interfaces with this processor to control and
interfaces (IOIF 0/1) or as an IO and a coherent SMP interface monitor the chip internal logic.
(IOIF and BIF). With seven transmit and five receive bytes of For manufacturing test, the pervasive unit supports 11 dif-
Rambus RRAC IOs, this design provides substantial bandwidth ferent test modes, which include debug mechanisms to assist
support. For increased flexibility, the RRAC transmitters and re- with test pattern development and bring-up. Array built-in self-
ceivers operate asynchronously to the processors and memory, test (ABIST) is one such mode controlled by the pervasive unit.
and the available bandwidth is configurable between the two in- The ABIST engines are programmable to provide debug ca-
terfaces. In addition, the RRACs for one of the interfaces can be pability and the ability to add customized test patterns during
run at half rate. The design thus supports multiple configurations manufacturing test. All ABIST engines are operated in parallel
without the need to redesign or repackage the chip. The BIC to reduce test time. Logic BIST (LBIST) is also provided, by
provides an asynchronous interface between the EIB and the means of a centralized controller housed in the pervasive unit
two IO interfaces, and hence contains speed-matching SRAM and 15 LBIST satellites elsewhere in the design. Both ABIST
buffers and logic with three clock domains. The processor side and LBIST are capable of scanning and capturing data at full
operates at half the global rate, while the IO side operates at one functional speed to improve AC test speed, and all internal scan
third the rate of the RRACs, with a small distribution network chains are timed at the full functional speed to support this re-
operating at half the rate of the RRACs. For BIC and RRAC cal- quirement. Finally, the pervasive unit provides the logic used
ibration, required to support the high speed of the transmitters for programming and testing electronic fuses (eFuses), which
and receivers, an elastic buffer is used to eliminate the skew be- are used for array repair and chip customization during manu-
tween the bytes comprising the interface. facturing test.
182 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006
Power dissipation and thermal design are two key is- with an increase in temperature. By comparing Vref_diode to
sues which were considered early in the development of the Vref, the temperature can be determined to be below or above
first-generation Cell processor. For power reduction, a power the temperature associated with the reference voltage. The DTS
management unit (PMU) provides a mechanism to allow soft- provides many reference voltages and an analog multiplexer for
ware controls to reduce the chip power when the full processing the selection of which Vref to compare. The TMU constantly
capabilities are not needed. The PMU allows the operating changes the selection to find the reference voltages where the
system to throttle, pause, or stop single or multiple units, or comparison (Tx Detect) changes. Once determined, the tem-
the entire chip, in order to manage chip power. For thermal perature is within the range associated with reference voltages.
monitoring and control, the processor employs two types of The voltage difference between each reference corresponds to
thermal sensors and a hardware controlled thermal manage- a 2 C temperature difference. While the relative difference
ment unit (TMU). A linear diode connected to two module is constant, the absolute temperature varies with process. To
pins allows an external device to monitor the temperature of eliminate the process variance, each DTS is calibrated during
the processor. This sensor is located in a relatively constant manufacture of the chip.
temperature location, giving a reading of the global temperature
of the processor. This sensor is designed to be used for control- III. CMOS TECHNOLOGY SOI AND PACKAGING
ling external cooling mechanisms. Ten digital thermal sensors TECHNOLOGY OVERVIEW
(DTS) are distributed on the chip to monitor temperatures in The Cell processor was designed in a 90-nm PD-SOI
critical local regions. One DTS is located in each element and process. The choice of the technology characteristics was
one is adjacent to the linear sensor. The TMU continuously crucial in being able to meet the design goals of the chip. Previ-
monitors each digital thermal sensor and can be programmed ously, designers of high-speed systems were mainly concerned
to dynamically control the temperature of each processing with performance and area, but current and future designs must
element and interrupt the PPE when a temperature specified for meet the triple constraints of power, performance, and area. In
each sensor is observed. The temperature of each element is addition to system-level controls such as clock gating, and cir-
controlled by adjust the instruction issue rate (i.e. throttling) a cuit level control such as low-power latch designs, technology
processing element based on the temperature of the associated choices of device thresholds, voltages, oxide thicknesses, and
DTS. Software controls the TMU by setting four temperature channel lengths are keys to controlling the power. Table I shows
values and the amount of throttling for each sensor in the TMU. the technology parameters.
In increasing temperature, the first temperature specifies when The 90-nm device targets included a 10.5-A gate oxide thick-
the throttling of an element stops, the second when throttling ness, 46-nm channel length, and a nominal voltage of 1 V. This
starts, the third when the element is completely stopped, and device had the requisite performance to achieve the frequency
the fourth when the chips clocks are shut down. The first two goal at the 11FO4 design point. The optimum point for the
temperature levels provide some level of hysteresis to avoid device definition was a balance between the various character-
frequent transitions between throttling and normal operation. istics. A thinner gate oxide would result in potentially higher
The third level stops the processing element when throttling performance but with significantly higher gate tunneling cur-
alone is not enough to control the temperature. As long as the rent and reliability concerns especially at the slower end of the
temperature is maintained below the third level, the processing process window. A shorter channel length or lower threshold
element is guaranteed to achieve a certain performance level to voltage would result in potentially improved performance at the
meet the real-time needs of an application. When the temper- expense of significantly increased leakage current especially at
ature exceeds the safe operating range, specified by the fourth the faster end of the process window. A higher voltage would
level, the TMU automatically shuts down the chip clocks to result in more performance but higher AC and DC power across
avoid permanent damage to the chip. the process window. A detailed study of these gradients was
The DTS employed in the first-generation Cell processor is a conducted to determine the optimum design point. Since not all
diode-based design with a programmable temperature set point. devices are used in circuits with critical delay paths, an addi-
Fig. 4 shows a schematic of the DTS. The diode reference tional higher threshold voltage (HVT) was used to help reduce
voltage (Vref_diode) linearly decreases with increasing tem- the power of the chip. The HVT device was used throughout the
perature while the reference voltage (Vref) linearly increases design in logic blocks, array cells, and analog designs.
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 183
TABLE I
TECHNOLOGY PARAMETERS
was verified by quasi-static 2-D and 3-D and full-wave 3-D elec-
tromagnetic extraction and simulation and validated by passive
characterization of specifically designed test vehicles. Final ver-
ification was done by passive and active characterization of the
initial prototype hardware.
takes three cycles for the first 140-bit word to be read, with output at the end of the first cycle. In the second cycle, the
the second word in the fourth cycle. Thus, write/reads can be pre-decoded indices are distributed to the four copies of the
interleaved with data continuously being fetched. Bit-line re- 64-kB mem64k memory block. The decoding of the address is
dundancy is implemented using a bit-shifting approach. There completed within mem64k in the third cycle, which ends with
is one repair action possible for each sub-array leading to a the word-line selection latched at the word-line (WL) driver.
total of 16 repairs per macro. The sub-array access (fourth cycle), starts with WL activation,
The SPE main memory is the LS. The LS unit in the SPE and a write operation completes during this cycle. For a read,
is a local memory comprised of several macros performing the sense amplifier (SA) senses the bit-line differential signal
load/stores, transactions for DMA, and instruction fetches and holds the value until it is captured by the read latch (RL)
into an instruction line buffer (ILB). Because the LS occupies at the beginning of the fifth cycle. The remainder of the fifth
one-third of the SPE floorplan, area, power, and yield are as cycle transfers the data read from the arrays either to the read
important as performance. buffers or to a four-ported latch (rdb41). The fetched data are
As shown in Figs. 19 and 20, the LS consists of a sum forwarded to the execution units in the sixth cycle. Macro
addressed decoder (memdec), four 64-kB memory arrays placement, number of inversions, pin locations, wire length bal-
(mem64k), write accumulation buffers (wacc and wtb), and ance, bus interleaving, hostile/quiet neighbor wire assignment,
read accumulation buffers (rdb1, rdb2, and rdb41), distributed choice of metal layer, metal width and spacing were engineered
throughout the SPE. It takes four cycles to complete a write to achieve the six-cycle 11FO4 path.
and six cycles to perform a read. The numbered latch points Two basic read schemes were used for the arrays on this chip.
in the pipeline diagram also appear in the physical image to The base domino read implementation used for both the L1 and
indicate the location of the latches. A memory access starts L2 caches is shown in Fig. 21. The local bit-line has 16 cells
by setting address operand registers in the memdec macro. connected. The global bit-line can have 4, 8, or 16 local evalua-
The memdec effectively adds the operands while pre-decoding tion circuits connected. The continuous bit-line has 64, 128, or
bit groups of the sum, with the results latched at the macro 256 cells connected, respectively. The local/global bit-lines are
PHAM et al.: ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION Cell PROCESSOR 189
Fig. 20. Die photo of the PPE caches and SPE local store.
used to read the cell, the continuous bit-line and write data_b are The sense amplifier used in the LS design is also shown.
used to write. The local evaluation shown is a minimum device The sense amplifier is a traditional gate only input design with
solution to reduce the local overhead. The local/global bit-lines added cross coupled PFETs to enhance its signal sensing ability,
are used to read the cell in dynamic fashion when the cell is while retaining the ability to pre-charge at a different time than
activated. The continuous bit-line and write data_b are used to the storage cells. One sense amplifier can service four sets of
write the cell. The continuous bit-line provides compliment data 64-cell/bit-line groups.
directly to the cell while data_b provides true data to the local Special attention was directed to cell stability. mis-
eval to drive onto the local bit-lines to the cell. When the cell is matches, due to random dopant fluctuations, increase as
activated the true and compliment data update the cell contents. transistors get smaller. In addition factors like ACLV, corner
The local evaluation shown is a minimum device solution to re- rounding, printability issues, poly roughness and SOI history
duce the local overhead. effects add variability to the devices, representing a tremen-
190 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006
hierarchy, from processor elements to power and clock distri- C. Power Analysis
butions, global routing, and the chip assembly techniques, are
engineered to support a modular design in a building-block-like To calculate the total power, input switching factors (SF) and
construction. clock activity (CA) for each macro are monitored in an RTL-
level simulation of a given workload. For every macro instance,
B. Noise Analysis the input SF is calculated by observing the percent of inputs that
Given the aggressive frequency targets and the circuit design have changed state from the previous cycle. The CA is measured
techniques described in the previous section, it was apparent that by observing the number of local clock buffers that are turned
a detailed chip noise analysis strategy would be required to en- on in a given cycle. The ability to have a cycle-by-cycle power
sure stable operation of the chip. With the chip device counts de- for each macro for a realistic workload made this methodology
scribed above, it is impractical to analyze the whole chip using very useful for package analysis. Since power is dependent on
a flat transistor-level simulator. Thus, in keeping with the hier- both SF and CA, these factors need to be calculated for every
archical/modular design philosophy, a hierarchical noise anal- cycle of the simulation and cannot be simply averaged over the
ysis approach was applied, which relied on a macro-level static entire simulation. In a given cycle the total power is calculated
noise analysis [14], followed by a unit/chip-level noise analysis, as
as shown in Fig. 23.
Macro-level static noise analysis was carried out on all de- 1
signs, including static/dynamic circuits, arrays, and synthesized 2
random logic macros (RLMs), creating a noise abstract at the
same time. The analysis was performed by a transistor-level where is the amount of global net capacitance switched
simulator, running on a net list with parasitic extracted from in the cycle.
the layout. This noise tool carried out a static noise analysis on To estimate the power consumption of the first-generation
each channel connected component, looking for noise failures Cell processor and for refining and verifying the power manage-
throughout the design. A macro-level noise abstract was gener- ment logic of the chip, each core or functional unit on the chip
ated during the analysis, which included input noise tolerance was required to run at least three different types of workloads:
level, input capacitance, and output resistance. idle, typical, and high power. The idle workload was expected to
Unit/chip-level noise analysis was then performed using the
be the lowest power state, since it shuts off as many local clock
macro noise abstracts and information from timing analysis.
buffers as possible. The high power test case was used as a stim-
Macros were represented by their abstracts, which specified the
input capacitance and output resistance. Extracted parasitic pa- ulus for power grid integrity analysis, thermal analysis, and as
rameters are obtained from the global layout. Unit or/chip-level an input for the IR drop analysis.
noise was then analyzed with an equivalent circuit composed of Fig. 24 shows the power analysis flow. Fig. 25 shows the
resistors, capacitors, and voltage sources which were driven ac- macro and net powers of the typical power workload, and also
cording to the timing information. A noise failure is reported if the power traces of the three workloads. Fig. 26 shows the
the noise level on a net exceeds the receiver pin’s input toler- number of local clock buffers active for the three different
ance level, which was determined during the macro-level noise workloads. The runtime for 4000 clock cycles on a core with
analysis. 20.9 M transistors was approximately 30 minutes.
192 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006
Fig. 26. Percent of local clock buffers active for idle, typical, and high-power
workload.
(1)
ACKNOWLEDGMENT
The authors gratefully acknowledge the collaborations
and contributions from the entire Sony-Toshiba-IBM team
who worked tirelessly side-by-side on the design of this pro-
cessor. They also gratefully acknowledge the executive and
the management teams of the three companies who provided
management oversight and created the right business conditions
for this project. Special thanks go to M. Beattie, E. Behnen,
C. Carter, S. Dhong, T. Himeno, B. Krauter, S. Lee, J. Petrovick,
G. Plumb, P. Restle, D. Shippy, J. Silberman, K. Miki, J. Qi,
E. Hailu, M. Yoshida, and D. Widiger for their contributions to
this project.
Dac C. Pham received the B.S.E.E. degree from Rajat Chaudry received the B.S. and M.S. degrees
University of Rhode Island, Kingston, in 1986, in electrical engineering from the University of Texas
the M.S.E.E. degree from Rensselaer Polytechnic at Austin in 1996 and 1998, respectively.
Institute, Troy, NY, in 1987, and the Ph.D. degree He currently works for the Tools and Method-
in materials science from the University of Vermont, ology group at the STI Design Center in Austin,
Burlington. TX. He works on power estimation and power grid
He joined IBM in 1987, where he worked on verification tools.
IBM’s 386 SX, 386SLP, and Blue Lightning family
of processors that were used in IBM PS/2 Systems.
In 1992, he joined the Apple-IBM-Motorola Som-
erset Design Center, where he worked on IBM’s
PowerPC 603 and follow-on processors which lead to PowerPC G3. He also
worked on IBM’s PowerPC 602 and a Graphics Co-Processor that were used in
Gaming Systems, Consumer Electronics, and PDA devices. In 1997, he joined
the IBM Giga-Processor team where he worked on IBM’s Power 4 processor
which was used in IBM servers and Apple’s G5. Between 1999 and 2003, he
was with Intel’s Texas Design Center and managed the Chip Implementation
team of a next-generation Pentium processor. He re-joined IBM in 2003, where
he worked on the high-speed, multi-core, first-generation Cell processor. Dennis Cox received the B.S. degree in electrical
Dr. Pham is currently the Sony-Toshiba-IBM Design Center Chief Engineer engineering from the University of Wisconsin at
and Global Convergence Functional Manager. Madison in 1970 and the M.S. degree in electrical
engineering from Syracuse University, Syracuse,
NY, in 1975.
He is a Distinguished Engineer in Engineering and
Tony Aipperspach received the B.S. degree in elec- Technology Services (E&TS) and a member of the
trical engineering from Montana State University, IBM Academy of Technology working in the area
Bozeman, in 1979. of technology delivery and circuit design. He joined
He joined IBM’s Rochester Development Labora- IBM, Kingston, NY, in 1970 as a Circuit Design
tory, Rochester, MN, in July 1979. He was involved Engineer where he designed SRAMs, dynamic logic
in the design of an on-chip DRAM in 3 m NMOS. processors, and PLA. He transferred to Circuit Technology in Rochester, MN,
From 1981 to 1993, he was involved with various in 1976 and worked in the development of ASIC design systems. Subsequently,
ASIC library designs, as well as integrated SRAM as a part of Server Group, he worked on a number of different Power PC
design. From 1994 to 1996, he was involved in microprocessors. With the advent of E&TS, he has been working with a
BICMOS circuit design and a variety of memory number of companies in different design and consulting contracts as well as
configurations including SRAM, CAM, and register leading various IBM Academy studies.
arrays. From 1996 to 2000, he was the team lead for the array design group and
supported all memory elements of the Star series of processors for the AS400.
Between 2000 and 2005, he was Array Lead for various processor designs
in support of the gaming industry, including the Cell processor. His current
position is Array and Circuit Lead/Consultant on various projects.
H. Peter Hofstee (M’96) received the “Doctorandus” John Keaty (M’86) received the B.S. Degree in
degree in theoretical physics from Groningen Univer- mathematics from the State University of New
sity, The Netherlands, in 1989, and the Ph.D. degree York, Plattsburgh, in 1974, and the M.A. degree
in computer science from the California Institute of in mathematics in 1977 and the M.S. degree in
Technology (Caltech), Pasadena, in 1995. computer science, both from the University of
He is a Senior Technical Staff Member in the Sony- Wisconsin-Madison, in 1980.
Toshiba-IBM (STI) Design Center, IBM, Austin, TX. He is a Senior Technical Staff Member and Global
He is the chief scientist for Cell and the chief architect Integration Leader in the STI Design Center, Austin,
of the Cell Synergistic Processor Element. He joined TX. He joined the IBM Microelectronics Division in
the Caltech faculty in 1995 and 1996 to teach com- Burlington, VT, in 1981 to work on automated di-
puter science and VLSI. In 1996, he joined the IBM agnostic systems for semiconductor logic products.
Austin Research Laboratory where he helped to create the first GHz CMOS Since then, he has worked in ASIC product development on several CMOS
processor. Between 1997 and 2000, he worked on a number of other high-fre- logic families, and also on industry standard microprocessor development in the
quency server processor designs. In 2000, he helped create the concept for Cell IBM Microelectronics Division before transferring to the IBM Server Group in
and became one of the founding members of the STI Design Center in 2001. His Austin, TX, in 1996. He was responsible for timing of the Power4 processor in
current interest focuses on application of the Cell processor beyond the gaming the R/S 6000 Regatta system. He is currently responsible for the global integra-
space and on future Cell processor designs. He is a member of the ACM and a tion and physical design of the Cell processor family.
member of the IBM Academy of Technology.
Charles Johns received the B.S. degree in electrical engineering from the Uni-
versity of Texas at Austin in 1984. Yoshio Masubuchi (M’89) received B.S. and M.S.
He is a Senior Technical Staff Member in the Sony-Toshiba-IBM Design in Electronics Engineering from University of Tokyo
Center, Austin, TX. After joining IBM Austin in 1984, he worked on various in 1981 and 1983 respectively.
disk, memory, voice communication, and graphics adapters for the IBM Per- He is an Assistant to the General Manager, Broad-
sonal Computer. From 1988 until he transferred to the STI project in 2000, he band System LSI Development Center, Toshiba
was part of the Graphics Organization and was responsible for the architecture Corporation Semiconductor Company, Austin, TX.
and development of entry and midrange 3-D graphics adapters and rater engines. Since joining Toshiba in 1983, he has been engaged
He is now responsible for the Cell Broadband Engine Architecture (BPA) and in research and development of microprocessors
participates in the development of the first-generation Cell processor. and computer systems. From 1988 to 1990, he was
a visiting Fellow at the University of California,
Berkeley. He served as Toshiba Project Leader for
the Cell project at the STI Design Center, Austin, TX, from 2002 to 2005. He
is a member of the ACM, the Information Processing Society of Japan, and the
Jim Kahle received the B.S. degree from Rice Uni- Institute of Electronics, Information and Communication Engineers.
versity, Houston, TX, in 1983.
He is an IBM Fellow with over 20 years of
research and development experience in chip design.
He is a renowned expert in the microprocessor
industry for delivering break through designs. His
current role is Chief Architect and Director of
Technology for the Austin’s Design Center for the
Cell Technology which is a partnership between Mydung Pham received the B.S. degree in electrical
IBM, Sony, and Toshiba. He has been working for and computer engineering from the University of Ver-
IBM since the early 1980s on RISC-based micropro- mont, Burlington, in 1992.
cessors. His work started in Physical Design tools and now has concentrated Since joining Apple-IBM-Motorola’s Somerset
on RISC Architecture. He was a key designer for the RIOS I processor that Design Center in 1992, she has worked on various
launched IBM into the RS/6000 line of workstations and servers. He was also microprocessor designs of the PowerPC 603 and
one of the founding members of the Somerset Design center were he was the follow-on processors which lead to PowerPC G3.
project manager for the PowerPC 603 and follow-on processors which lead to In 2000, she worked on the PowerPC G5 as Chip
PowerPC G3. These chips and cores have been used in many applications from Timing Lead. In 2001, she joined the Cell processor
embedded cores to Apple systems and Nintendo’s new Game Cube. He was key design team as Memory Management Technical
to the definition of the Power Architecture and of the superscalar techniques Lead. She is currently a Manager of the circuit
used in IBM. He has been on the forefront of Dual Core designs, Asymmetric design, timing, and physical integration team for the PowerPC Processor
Multiprocessors, Superscalar design, and SMT micro-architectures. He was the Element (PPE).
Chief Architect for the Power4 core and continually assists in Power processor
roadmap planning.
Atsushi Kameyama (M’91) received the B.S., M.S., Jürgen Pille received the M.S. degree in microelec-
and Ph.D. degrees in physical electronics from the tronics from the University of Hanover, Germany, in
Tokyo Institute of Technology, Tokyo, Japan, in 1990.
1982, 1984, and 1999, respectively. He joined IBM at the IBM Development Labora-
In 1984, he joined Toshiba VLSI Research Center, tory, Boeblingen, Germany, in 1990 to work on Ad-
where he was engaged in the circuit design and de- vanced VLSI Designs. He was involved in several
velopment of GaAs digital ICs. From 1992 to 1994, areas of the S/390 and PowerPC microprocessor de-
he was a visiting scholar at Stanford University, Stan- velopment including assignments at the IBM Devel-
ford, CA, doing research in the field of pnp-type Al- opment Laboratories, Poughkeepsie, NY, and Austin,
GaAs/GaAs HBT. Since 1994, he has been involved TX. He has worked in several areas including circuit
in the development of GaAs MMIC and low-power design, array design, and custom logic design. As a
circuit design using SOI. He is currently a Design Engineering Manager with Senior Technical Staff Member, he is currently responsible for array designs in
Toshiba America Electronic Components, Inc., Austin, TX the STI Design Center, Austin, TX.
196 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006
Stephen Posluszny received the B.S. and M.S. de- James Warnock (M’89) received the Ph.D. degree
grees in computer engineering from Rensselaer Poly- in physics from the Massachusetts Institute of Tech-
technic Institute, Troy, NY, in 1982. nology, Cambridge, in 1985.
He joined IBM in 1982, working in Fishkill, NY, Since then, he has been at IBM, Yorktown Heights,
with IBM’s corporate design automation organiza- NY, working on advanced bipolar and CMOS silicon
tion in the area of silicon compaction. In 1988, he technologies, and more recently in the area of circuit
moved to Austin, TX, to work on logic synthesis design for high-performance digital microprocessors.
development for the RS/6000 line of processors. This included work on the S/390 G4 processor and
Between 1995 and 2000, he worked in IBM’s also on the Power4 chip, where he was the circuit de-
Austin Research Laboratory on the first 1-Gigahertz sign team leader. Currently, he is the circuit lead for
Processor. In 2001, he joined the Sony-Toshiba-IBM the Entertainment and Embedded Processor Devel-
Design Center as IBM’s Senior Technical Staff Member and Tools and Method- opment Organization in IBM’s Systems and Technology group. He is involved
ology Technical Lead. He has contributed to the Center’s high-frequency design with work on the Cell processor and also with several other microprocessor de-
methodology definition and leads the tools team supporting the design effort. velopment programs. He has experience in process technology and device de-
sign as well as in SOI digital circuit design, and has authored or co-authored
over 160 conference or journal papers.