You are on page 1of 11

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO.

1, JANUARY 2011

173

A 48-Core IA-32 Processor in 45 nm CMOS


Using On-Die Message-Passing and DVFS
for Performance and Power Scaling
Jason Howard, Saurabh Dighe, Sriram R. Vangal, Gregory Ruhl, Member, IEEE, Nitin Borkar, Shailendra Jain,
Vasantha Erraguntla, Michael Konow, Michael Riepen, Matthias Gries, Guido Droege, Tor Lund-Larsen,
Sebastian Steibl, Shekhar Borkar, Vivek K. De, and Rob Van Der Wijngaart

AbstractThis paper describes a multi-core processor that


integrates 48 cores, 4 DDR3 memory channels, and a voltage
regulator controller in a 6 4 2D-mesh network-on-chip architecture. Located at each mesh node is a five-port virtual
cut-through packet-switched router shared between two IA-32
cores. Core-to-core communication uses message passing while
exploiting 384 KB of on-die shared memory. Fine grain power
management takes advantage of 8 voltage and 28 frequency islands
to allow independent DVFS of cores and mesh. At the nominal
1.1 V supply, the cores operate at 1 GHz while the 2D-mesh operates at 2 GHz. As performance and voltage scales, the processor
dissipates between 25 W and 125 W. The 567 mm2 processor is
implemented in 45 nm Hi-K CMOS and has 1.3 billion transistors.
Index Terms2D-routing, CMOS digital integrated circuits,
DDR3 controllers, dynamic voltage frequency scaling (DVFS),
IA-32, message passing, network-on-chip (NoC).

I. INTRODUCTION

FUNDAMENTAL shift in microprocessor design from


frequency scaling to increased core counts has facilitated
the emergence of many-core architectures. Recent many-core
designs have proven to optimize performance while achieving
higher energy efficiency [1]. However, the complexity of
maintaining coherency across traditional memory hierarchies
in many-cores designs is causing a dilemma. Simply stated,
the computational value gained through additional cores will
at some point be exceeded by the protocol overhead needed
to maintain cache coherency among the cores. Architectural
techniques can be used to delay this crossover point for only
so long. Alternatively, another approach is to altogether eliminate cache coherency and rely on software to maintain data
consistency between cores. Many-core architectures also face
steep design challenges with respect to power consumption.
The seemingly endless compaction and density increase of

Manuscript received April 15, 2010; revised July 16, 2010; accepted August
30, 2010. Date of publication November 09, 2010; date of current version December 27, 2010. This paper was approved by Guest Editor Tanay Karnik.
J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Borkar, and
V. K. De are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail:
jason.m.howard@intel.com).
S. Jain and V. Erraguntla are with Intel Labs, Bangalore, India.
M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, and S. Steibl
are with Intel Labs, Braunschweig, Germany.
R. Van Der Wijngaart is with Intel Labs, Santa Clara, CA.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSSC.2010.2079450

transistors, as stated by Moores Law [2], has both positively


impacted the growth in core counts while negatively impacting
thermal gradients via the exponential surge in power density. In
an effort to mitigate these effects, many-core architectures will
be required to employ a variety of power saving techniques.
The prototype processor (Fig. 1) described in this paper is
an evolutionary approach toward many-core Network-on-Chip
(NoC) architectures that remove dependence on hardware maintained cache coherency while remaining in a constrained power
budget. The 48 cores communicate over an on-die network utilizing a message passing architecture that allows data sharing
with software maintained memory consistency. The processor
also uses voltage and frequency islands to combine the advantages of Dynamic Voltage and Frequency Scaling (DVFS) for
improving energy efficiency.
The remainder of the paper is organized as follows. Section II
gives a more in-depth architectural description of the 48-core
processor and describes key building blocks. The section also
highlights enhancements made to the IA-32 core and describes
accompanying non-core logic. Router architectural details and
packet formats are also described, followed by an explanation
of the DDR3 memory controller. Section III presents a novel
message passing based software protocol used to maintain
data consistency in shared memory. Section IV describes the
DVFS power reduction techniques. Details of an on-die voltage
regulator controller are also discussed. Experimental results,
chip measurement, and programming methodologies are given
in Section V.
II. TOP LEVEL ARCHITECTURE
The processor is implemented in 45 nm high-K metal-gate
mm and contains 1.3 bilCMOS [3] with a total die area of
lion transistors. The architecture integrates 48 Pentium class
IA-32 cores [4] using a tiled design methodology; the 24 tiles
are arrayed in a 6 4 grid with 2 cores per tile. High speed,
low latency routers are also embedded within each tile to provide a 2D-mesh interconnect network with sufficient bandwidth,
an essential ingredient in complex, many-core NoCs. 4 DDR3
memory channels reside on the periphery of the 2D-mesh network to provide up to 64 GB of system memory. Additionally,
an 8-byte bidirectional high-speed I/O interface is used for all
off-die communication.
Included within a tile are two 256 KB unified L2 caches, one
for each core, and supporting network interface (NI) logic re-

0018-9200/$26.00 2010 IEEE

174

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 1. Block diagram and tile architecture.

Fig. 2. Full-chip and tile micrograph and characteristics.

quired for core-to-router communication. Each tiles NI logic


also features a Message Passing Buffer (MPB), or 16 KB of
on-die shared memory. The MPB is used to increase performance of a message passing programming model whereby cores
communicate through local shared memory.
Total die power is kept at a minimum by dynamically scaling
both voltage and performance. Fine grained voltage change
commands are transmitted over the on-die network to a Voltage
Regulator Controller (VRC). The VRC interfaces with two
on package voltage regulators. Each voltage regulator has a
4-rail output to supply 8 on-die voltage islands. Further power
savings are achieved through active frequency scaling at a tile
granularity. Frequency change commands are issued to a tiles
un-core logic, whereby frequency adjustments are processed.
Tile performance is scalable from 300 MHz at 700 mV to 1.3
GHz at 1.3 V. The on-chip network scales from 60 MHz at 550
mV to 2.6 GHz at 1.3 V. The design target for nominal usage is
1 GHz for tiles and 2 GHz for the 2-D network, when supplied
by 1.1 V. Full-Chip and tile micrograph and characteristics are
shown in Fig. 2.

A. Core Architecture
The core is an enhanced version of the second generation Pentium processor [4]. L1 instruction and data caches have been
upsized to 16 KB, over the previous 8 KB design, and support 4-way set associativity and both write-through and writeback modes for increased performance. Additionally, data cache
lines have been modified with a new status bit used to mark
the content of the cache line as Message Passing Memory Type
(MPMT). The MPMT is introduced to differentiate between
normal memory data and message passing data. The cache lines
MPMT bit is determined by page table information found in the
cores TLB and must be setup properly by the operating system.
The Pentium instruction set architecture has been extended to
include a new instruction, INVDMB, used to support software
managed coherency. When executed, an INVDMB instruction
invalidates all MPMT cache lines in a single clock cycle. Subsequently, reads or writes to the MPMT cache lines are guaranteed
to miss and data will be fetched or written. The instruction exposes the programmer to direct control of cache management
while in a message passing environment.

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

175

Fig. 3. Router architecture and latency.

The addressable space of the core has been extended from


32-bits to 36-bits to support 64 GB system memory. This is
accomplished using a 256 entry Look-up-table or LUT extension. To ensure proper physical routing of system addresses, the
LUTs also provide addresses destination and routing information. A bypass status bit in an LUT entry allows for direct access to local tile MPB. To provide applications and designers
the most flexibility, the LUT can be reconfigured dynamically.
B. L2 Cache and Controller
Each cores L1 data and instruction caches are reinforced by
a unified 256 KB 4-way write-back L2 cache. The L2 cache uses
a 32 byte line size to match the line sizes internal to the cores
L1 caches. Salient features of the L2 cache include: a 10 cycle
hit latency, in-line double-error-detection and single-error-correction for improved performance, several programmable sleep
modes for power reduction and a programmable a time-out and
retry mechanism for increased system reliability. Evicted cache
lines are determined through a strict least-recently-used (LRU)
algorithm. After every reset deassertion in the L2 cache, 4000
cycles are needed to initialization the state array and LRU bits.
During that time the grants to the core will be deasserted and no
requests will be accepted.
Several architectural attributes for the core were considered
during the design of the L2 cache and associated cache controller. Simplifications to the controllers pipeline were made
due to the core limitation of only 1 outstanding read/write request at a given time. Additionally, inclusion is not maintained
between the cores L1 caches and the cores L2 cache. This
eliminates the necessity for snoop or inquire cycles between L1
and L2 caches and allows data to be evicted from the L2 cache
without an inquire cycle to L1 cache. Finally, the there is no allocate-on-write capability in L1 caches of the core. Thus, a L1

cache write miss and L2 cache write hit does not write the cache
line back into the L1 cache.
High post-silicon visibility into the L2 cache was achieved
through a comprehensive scan approach. All L2 cache lines
were made scan addressable, including both tag and data arrays
and LRU status bits. A full self test feature was also included
that allowed the L2 cache controller to write either random or
programmed data to all cache lines, followed by a read comparison of the results.
C. Router Architecture
The 5-port router [5] uses two 144 bit uni-directional links
to connect with 4 neighboring routers and one local port while
creating the 2-D mesh on-die network. As an alternative to
wormhole routing used in earlier work [1], virtual cut-through
switching is used for reduced mesh latency over the previous
work. The router has 4 pipe stages (Fig. 3) and an operational
frequency of 2 GHz at a 1.1 V. The first stage includes link
traversal for incoming packet traversal and input buffer write.
The switch arbitration is done in the second stage and third &
fourth stages are the VC allocation and switch traversal stages
respectively. Two message classes (MCs) and eight virtual
channels (VCs) ensure deadlock free routing and maximize
bandwidth utilization. Two VCs are reserved: VC6 for request
MCs and VC7 for response MCs.
Dimension-ordered XY routing eliminates network deadlock
and route pre-computation in the previous hop allows fast output
port identification on packet arrival. Input port and output port
arbitrations are done concurrently using a centralized conflictfree wrapped wave-front arbiter [6] formed using a 5 5 array
of asymmetric cells (Fig. 4). A cell with a row (column) token
that is unable to use the token passes the token to the right
(down), wrapping around at the end of the array. These tokens

176

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 4. Wrapped wave-front arbiter.

propagate in a wave-front from the priority diagonal group. If a


cell with a request receives both a row and a column token, it
grants the request and stops the propagation of tokens.
Crossbar switch allocation is done in a single clock cycle, on
a packet granularity. No-load router latency is 4 clock cycles,
including link traversal. Individual links offer 64 GB/s of interconnect bandwidth, enabling the total network to support 256
GB/s of bisection bandwidth.
D. Communication and Router Packet Format
A packet is the granularity at which all 2-D mesh agents communicate with each other. However, a packet maybe subdivided
into a collection of one or multiple FLITs or Flow control
units. The header (Fig. 5) is the first FLIT of any packet and
contains routing related fields and flow control commands. Protocol layer packets are divided into request and response packets
through the message class field. When required, data payload
FLITs will follow the header FLIT. Important fields within the
header FLIT include:
Routeidentifies the tile/router a packet will traverse to
DestIDidentifies the final agent within a tile a packet is
addressed to
SourceIDidentifies the mesh agent within each node/tile
a packet is from
Commandidentifies the type of MC (request or response)
TransactionIDa unique identifier assigned at packet generation time
Packetization of core requests and responses into FLITs is
handled by the associated un-core logic. It is incumbent on the
user to correctly packetize all FLITs generated off die.
E. DRR3 Memory Controller
Memory transactions are serviced by four DDR3 [7] integrated memory controllers (IMC) positioned at the periphery

of the 2D-mesh network. The controllers feature support for


DDR3800, 1066 and 1333 speed grades and reach 75%
bandwidth efficiency with rank and bank interleaving applying
closed-page mode. By supporting dual rank and two DIMMs
per channel, a system memory of 64 GB is realized using 8 GB
DIMMs. An overview of the IMC is shown in Fig. 6.
All memory access requests enter the IMC through the Mesh
Interface Unit (MIU). The MIU reassembles memory transactions from the 2-D mesh packet protocol and passes the transaction to the Access Controller (ACC) block or the controller state
machine. The Analog Front End (AFE) circuits provide the actual I/O buffers and fine-grain DDR3-compliant compensation
and training control. The IMCs AFE is a derivative of productized IP [8].
The ACC block is responsible for buffering and issuing up
to eight data transfers in-order while interleaving control sequences such as refresh and ZQ calibration commands. Control
sequence interleaving results in a 5X increase the achievable
bandwidth since activate and precharge delays can be hidden
behind data transfers on the DDR3 bus. The ACC also applies
closed-page mode by precharging activated memory pages as
soon as one burst access is finished using auto-precharge. A
complete feature list of the IMC is shown in Table I.
III. MESSAGE PASSING
Shared memory coherency is maintained through software in
an effort to eliminate the communication and hardware overhead required for a memory coherent 2D-mesh. Inspired by software coherency models such as SHMEM [9], MPI and openMP
[10], the message passing protocol is based on one-sided put
and get primitives that efficiently moves data between the L1
cache of one core to the L1 cache of another [11]. As described
earlier, the new Message Passing Memory Type (MPMT) is introduced in conjunction with the new instruction, INVDMB,

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

177

Fig. 5. Request (a) and Response (b) Header FLITs.

Fig. 6. Integrated memory controller block diagram.

as an architectural enhancement to optimize data sharing using


these software procedures. The MPMT retains all the performance benefits of a conventional cache line, but distinguishes itself by addressing non-coherent shared memory. The strict message passing protocol proceeds as follows (Fig. 7(a)):
A core initiates a message write by first invalidating all message passing data cache lines. Next, when the core attempts to
write data from a private address to a message passing address a
write miss occurs and the data is written to memory. Similarly,
when a core reads a message, it begins by invalidating all message passing data cache lines. The read attempt will cause a read
miss, and the data from memory will be fetched.

The 16 KB MPB, found in each tile and used as on-die shared


memory, further optimizes the design by decreasing the latency
of shared memory accesses. Messages that are smaller than 16
KB see a 15x latency improvement when passed through the
MPB, rather than sent through main memory (Fig. 7(b)). However, messages larger than 16 KB lose this performance edge
since the MPB is completely filled and the remaining portion
must be sent to main memory.
IV. DVFS
To maximize power savings, the processor is implemented
using 8 Voltage Islands (VIs) and 28 Frequency Islands (FIs)

178

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

TABLE I
INTEGRATED MEMORY CONTROLLER FEATURES

Fig. 7. Message passing protocol (a) and message passing versus DDR3-800 (b).

(Fig. 8). Software based power management protocols take


advantage of the voltage/frequency islands through Dynamic
Voltage and Frequency Scaling (DVFS). Two voltage islands
supply the 2-D mesh and die periphery, with the remaining 6
voltage islands being divided among the core area. The on-die
VRC interfaces with two on-package voltage regulators [12] to
scale the voltages of the 2-D mesh and core area dynamically
from 0V-1.3 V in 6.25 mV steps. Since the VRC acts as any
other 2-D mesh agent, the VRC is addressable by all cores.
Upon reception of a new voltage command change, the VRC
and on-package voltage regulators respond in under a millisecond. VIs for idle cores can be set to 0.7 V, a safe voltage
for state retention, or completely collapsed to 0 V, if retention
is unnecessary. Voltage level isolation and translation circuitry
allow VIs with active cores to continue execution with no
impact from collapsed VIs and provide a clean interface across
voltage domains.

The processors 28 FIs are divided as such: one FI for


each tile (24 total), one FI for the entire 2-D mesh, and the
remaining three FIs for the system interface, VRC, and memory
controllers, respectively. Similar to the VIs, all core area FIs
are dynamically adjustable to an integer division (up to 16)
of the globally distributed clock. However, unlike voltage
changes, the response time of frequency changes are significantly fasteraround 20 ns when a 1 GHz clock is being
used. Thus, frequency changes are much more common than
voltage changes in power optimized software. Deterministic
first-in-first-out (FIFO) based clock crossing units (CCFs) are
used for synchronization across clocking domains [5]. Frequency aware read pointers ensure that the same FIFO location
is not read from and written to simultaneously. Embedded level
shifters (Fig. 9) within the clock crossing unit handle voltage
translation.

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

179

Fig. 8. Voltage/frequency islands, clock crossing FIFOs and clock gating.

Fig. 9. Voltage level translation circuit.

V. EXPERIMENTAL RESULTS & PROGRAMMING


The board and packaged processor used for evaluation and
testing is shown in Fig. 10. The die is packaged in a 14-layer
(5-4-5) 1567 pin LGA package with 970 signal pins, most of
which are allocated to the 4 DDR3 channels. A standard Xeon
server socket is used to house the package on the board. A standard PC running customized software is able to interface with
the processor and populate the DDR3 memory with a bootable
OS for every core. After reset de-assertion, all 48-cores boot
independent OS threads. Silicon has been validated to be fully
functional.
The Lower-Upper Symmetric Gauss-Seidel solver (LU) and
Block tri-diagonal solver (BT) benchmarks from the NAS parallel benchmarks [13] were successfully ported to the processor
architecture with minimal effort. LU employs a symmetric

successive over-relaxation scheme to solve regular, sparse,


block 5 5 lower and upper triangular systems of equations.
LU uses a pencil decomposition to assign a column block of a
3-dimensional discretization grid to each core. A 2 dimensional
pipeline algorithm is used to propagate a wavefront communication pattern across the cores. BT solves multiple independent
systems of block tri-diagonal equations with 5 5 blocks. BT
decomposes the problem into larger numbers of blocks that
are distributed about the cores with a cyclic distribution. The
communication patterns are regular and emphasize nearest
neighbor communication as the algorithm sweeps successively
over each plane of blocks. It is important to note that these two
benchmarks have distinctly different communication patterns.
Results for runs on the processor with cores running at 533
MHz and the mesh at 1 GHz for the benchmarks running on a
102 102 102 discretization grid are shown in Fig. 11. As
expected, the speedup is effectively linear across the range of
problems studied.
Measured maximum frequency for both the core and router as
a function of voltage is shown in Fig. 12. Silicon measurements
were taken while maintaining a constant case temperature of
50 using a 400 W external chiller. As voltage for the router
is scaled from 550 mV to 1.34 V, a resulting Fmax increase is
observed from 60 MHz to 2.6 GHz. Likewise, as voltage for
the IA-core is scaled from 730 mV to 1.32 V, Fmax increases
from 300 MHz to 1.3 GHz. The offset between the two profiles
is explained by the difference in design target points; the core
was designed for 1 GHz operation at 1.1 V while the router was
designed for 2 GHz operation at the same voltage.
This chart shown in Fig. 13 illustrates the increase in total
power as supply voltage is scaled. During data collection, the

180

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 10. Package and test board.

Fig. 13. Measured chip power versus supply.

Fig. 11. NAS parallel benchmark results with increasing core count.

at 1.14 V and consuming 125 W of power. Fig. 14 presents the


measured power breakdown at full power and low power operation. When the processor is dissipating 125 W, 69% of this
power, or 88 W, is attributed to the cores. When both voltage
and frequency are reduced and the processor dissipates 25 W,
only 21% of this power, or 5.1 W, is due to the cores. At this
low power operation we see the memory controllers becoming
the major power consumer, largely because the analog I/O voltages cannot be scaled due to the DDR3 spec.
VI. CONCLUSION

Fig. 12. Maximum frequency (Fmax) versus supply.

processors operating frequency was always set to the Fmax


of the current supply voltage being measuring. Thus, we see
a cubic trend of power increase since power is proportional to
frequency times voltage squared. At 700 mV the processor consumes 25 W, while at 1.28 V it consumes 201 W. During nominal operation we see a 1 GHz core and a 2 GHz router operating

In this paper, we have presented a 48 IA-32 core processor in a


45 nm Hi-K CMOS process that utilizes a 2D-mesh network and
4 DDR3 channels. The processor uses a new message passing
protocol and 384 KB of on-die shared memory for increased performance of core to core communication. It employs dynamic
voltage and frequency scaling with 8 voltage islands and 28 frequency islands for power management. Silicon operates over a
wide voltage and frequency range, 0.7 V and 125 MHz up to 1.3
V and 1.3 GHz. Measured results show a power consumption of
125 W at 50 when operating under typical conditions, 1.14
V and 1 GHz. With active DVFS measured power is reduced by
80% to 25 W at 50 . These results demonstrate the feasibility
of many-core architectures and high-performance, energy-efficient computing in the near future.

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

181

Fig. 14. Measured full power and low power breakdowns.

ACKNOWLEDGMENT
The authors thank Yatin Hoskote, D. Finan, D. Jenkins,
H. Wilson, G. Schrom, F. Paillet, T. Jacob, S. Yada, S. Marella,
P. Salihundam, J. Lindemann, T. Apel, K. Henriss, T. Mattson,
J. Rattner, J. Schutz, M. Haycock, G. Taylor, and J. Held for
their leadership, encouragement, and support, and the entire
mask design team for chip layout.

Jason Howard is a senior technical research lead for


the Advanced Microprocessor Research team within
Intel Labs, Hillsboro, Oregon. During his time with
Intel Labs, Howard has worked on projects ranging
from high performance low power digital building
blocks to the 80-Tile TeraFLOPs NoC Processor. His
research interests include alternative microprocessor
architectures, energy efficient design techniques,
variation aware and tolerant circuitry, and exascale
computing.
Jason Howard received the B.S. degree and M.S
degree in electrical engineering Brigham Young University, Provo, UT, in 1998
and 2000 respectively. He joined Intel Corporation in 2000. He has authored
and co-authored several papers and has several patents issued and pending.

REFERENCES
[1] S. Vangal et al., An 80-Tile 1.28TFLOPS network-on-Chip in 65 nm
CMOS, ISSCC Dig. Tech. Papers, pp. 9899, Feb. 2007.
[2] G. Moore, Cramming more components onto integrated circuits,
Electronics, vol. 38, no. 8, Apr. 1965.
[3] K. Mistry et al., A 45 nm logic technology with high -k
gate transistors, strained silicon, 9 Cu interconnect layers, 193 nm dry
patterning, and 100% Pb-free packaging, IEDM Dig. Tech. Papers,
Dec. 2007.
[4] J. Schutz, A 3.3 V 0.6 m BiCMOS superscalar microprocessor,
ISSCC Dig. Tech. Papers, pp. 202203, Feb. 1994.
[5] P. Salihundam et al., A 2 Tb/s 6 4 mesh network with DVFS and
2.3 Tb/s/W router in 45 nm CMOS, in Symp. VLSI Circuits Dig. Tech.
Papers, Jun. 2010.
[6] Y. Tamir and H.-C. Chi, Symmetric crossbar arbiters for VLSI communication switches, IEEE Trans. Parallel Distrib Syst., vol. 4, no. 1,
pp. 1327, Jan. 1993.
[7] JEDEC, Solid State Technology Association: DDR3 SDRAM Specification, Apr. 2008, JESD79-3B.
[8] R. Kumar and G. Hinton, A family of 45 nm IA processors, ISSCC
Dig. Tech. Papers, pp. 5859, Feb. 2009.
[9] SHMEM Technical Note for C, Cray Research, Inc., 1994, SG-2516
2.3..
[10] L. Smith and M. Bull, Development of hybrid mode MPI/OpenMP
applications, Scientific Programming, vol. 9, no. 23, pp. 8398, 2001.
[11] T. Mattson et al., The intel 48-core single-chip cloud computer (SCC)
processor: Programmers view, in Int. Conf. High Performance Computing, 2010.
[12] G. Schrom, F. Faillet, and J. Hahn, A 60 MHz 50 W fine-grain package
integrated VR powering a CPU from 3.3 V, in Applied Power Electronics Conf., 2010.
[13] D. H. Bailey et al., The NAS parallel benchmarks, Int. J. Supercomputer Applications, vol. 5, no. 3, pp. 6373, 1991.

+ Metal

Saurabh Dighe received his MS degree in Computer Engineering from the University of Minnesota,
Minneapolis in 2003. He was with Intel Corporation, Santa Clara, working on front end logic and
validation methodologies for the Itanium processor
and the Core processor design team Currently
he is a member of the Advanced Microprocessor
Research team at Intel Labs, Oregon, involved in the
definition, implementation and validation of future
Tera-scale computing technologies like the Intel
Teraflops processor and 48-Core IA-32 Message
Passing Processor. His research interests are in the area of energy efficient
computing and low power high performance circuits.

Sriram R. Vangal (S90M98) received the B.S.


degree from Bangalore University, India, and the
M.S. degree from University of Nebraska, Lincoln,
USA, and the Ph.D. degree from Linkping University, Sweden, all in Electrical Engineering. He
is currently a Principal Research Scientist with Advanced Microprocessor Research, Intel Labs. Sriram
was the technical lead for the advanced prototype
team that designed the industrys first single-chip
80-core, sub-100 W Polaris TeraFLOPS processor
(2006) and co-led the development of the 48-core
Rock-Creek prototype (2009). His research interests are in the area of
low-power high-performance circuits, power-aware computing and NoC
architectures. He has published 20 journal and conference papers and has 16
issued patents with 8 pending.

182

Gregory Ruhl (M07) received the B.S. degree in


computer engineering and the M.S. degree in electrical and computer engineering from the Georgia
Institute of Technology, Atlanta, in 1998 and 1999,
respectively.
He joined Intel Corporation, Hillsboro, OR, in
1999 as a part of the Rotation Engineering Program
where he worked on the PCI-X I/O switch, Gigabit
Ethernet validation, and clocking circuit and test
automation research projects. After completing the
REP program, he joined Intels Circuits Research
Lab where he worked on design, research and validation on a variety of topics
ranging from SRAMs and signaling to terascale computing. In 2009, Greg
became a part of the Microprocessor Research Lab within Intel Labs where he
has since been designing and working on tera- and exa-scale research silicon
and near threshold voltage computing projects.

Nitin Borkar received the M.Sc. degree in physics


from the University of Bombay, India, in 1982, and
the M.S.E.E. degree from Louisiana State University
in 1985.
He joined Intel Corporation in Portland, OR, in
1986. He worked on the design of the i960 family
of embedded microcontrollers. In 1990, he joined the
i486DX2 microprocessor design team and led the design and the performance verification program. After
successful completion of the i486DX2 development,
he worked on high-speed signaling technology for the
Teraflop machine. He now leads the prototype design team in the Microprocessor & Programming Research Laboratory, developing novel technologies in
the high-performance low power circuit areas and applying those towards future
computing and systems research.

Shailendra Jain received the B.Tech. degree in


electronics engineering from Devi Ahilya Vishwavidyalaya, Indore India in 1999, and M.Tech.
degree in Microelectronics and VLSI Design from
IIT, Madras, India in 2001. With Intel since 2004,
he is currently a technical research lead at the
Bangalore Design Lab of Intel Labs, Bangalore,
India. His research interests includes near-threshold
voltage range digital circuits design, energy-efficient
design techniques for TeraFLOPs NoC processors
and floating-point arithmetic units, and many core
advance rapid prototyping. He has co-authored ten papers in these areas.

Vasantha Erraguntla received her B.E. in Electrical


Engineering from Osmania University, India and
an M.S. in Computer Engineering from University of Louisiana. She joined Intel in 1991 to be
a part of the Teraflop machine design team and
worked on its high-speed router technology. Since
June 2004, Vasantha has been heading Intel Labs
Bangalore Design Lab to facilitate the worlds first
programmable Terascale processor and the 48-iA
core Single-Chip Cloud Computer. Vasantha has
co-authored over 13 IEEE journal and conference
papers and holds 3 patents and 2 pending. She is also a member of IEEE.
She served on the organizing committee of the 2008 and 2009 International
Symposium on Low Power Electronics and Design (ISLPED) and on the Technical Program Committee of ISLPED 2007, Asia Solid State Circuits Conference (A-SSCC) in 2008 and 2009. She is also a Technical Program Committee
member for energy-efficient digital design for ISSCC 2010 and ISSCC 2011.
She is also serving on the Organizing Committee for the VLSI Design Conference 2011.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Michael Konow manages an engineering team


working on research and prototyping of future
processor architectures within Intel Labs, Braunschweig, Germany. During his time with Intel Labs,
Michael was leading the development of the 1st
in-socket FPGA prototype of a x86 processor and
has worked on several FPGA and silicon prototyping
projects. His research interests include future microprocessor architectures and FPGA prototyping
technologies.
Michael received his diploma degree in Electrical
Engineering from University of Braunschweig in 1996. Since then he has
worked on the development of integrated circuits for various companies and a
wide range of applications. He joined Intel in 2000 and Intel Labs in 2005.

Michael Riepen is a senior technical research


engineer in the Advanced Processor Prototyping
Team within Intel Labs, Braunschweig, Germany.
During his time in Intel Labs, Michael worked on
FPGA prototyping of processor architectures as well
as on efficient many-core pre-silicon validation environments. His research interests include exascale
computing, Many-core programmability as well as
efficient validation methodologies.
Michael Riepen received the master of computer
science from University of applied sciences Wedel,
Germany, in 1999. He joined Intel Corporation in 2000. He has authored and
co-authored several papers and has several patents issued and pending.

Matthias Gries joined Intel Labs at Braunschweig,


Germany, in 2007, where he is working on architectures and design methods for memory subsystems.
Before, he spent three years at Infineon Technologies
in Munich, Germany, refining micro-architectures
for network applications at the Corporate Research
and Communication Solutions departments. He
was a post-doctoral researcher at the University of
California, Berkeley, in the Computer-Aided Design
group, implementing design methods for application-specific programmable processors from 2002
to 04. He received the Doctor of Technical Sciences degree from the Swiss
Federal Institute of Technology (ETH) Zurich in 2001 and the Dipl.-Ing. degree
in electrical engineering from the Technical University Hamburg-Harburg,
Germany, in 1996.
His interests include architectures, methods and tools for developing x86 platforms, resource management and MP-SoCs.

Guido Droege received his Diploma in electrical engineering from Technical University Braunschweig,
Germany in 1992 and the Ph.D. degree in 1997. His
academic work focused on analog circuit design automation. After graduation, Droege worked an ASIC
company, Sican GmbH, and later Infineon Technologies. He designed RF circuits for telecommunication
and worked on MEMS technology for automotive.
In 2001 he joined Intel Corporation where he started
with high-speed interface designs for optical communication. As part of Intel Labs he was responsible for
the analog frontend of several Silicon prototype designs. Currently, he works in
the area of high-bandwidth memory research.

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

Tor Lund-Larsen is the Engineering Manager


for the Advanced Memory Research team within
Intel Labs Germany in Braunschweig, Germany.
During his time with Intel Labs, Tor has worked on
projects ranging from FPGA based proto-types for
multi-radio and adaptive clocking to analog clocking
concepts and memory controllers. His research interests include multi-level memory, Computation-in
Memory for many-core architecture, resiliency and
high bandwidth memory.
Tor Lund-Larsen received MBA and MSE degrees
from the Program of Engineering and Manufacturing Management program at
the University of Washington, Seattle in 1996. He joined Intel Corporation in
1997.

Sebastian Steibl is the Director of Intel Labs Braunschweig in Germany and leads a team of researchers
and engineers in developing technologies ranging
from the next generation Intel CPU architectures,
high bandwidth memory and memory architectures
to emulation and FPGA many-core prototyping
methodology.His research interests include on-die
message passing and embedded many-core microprocessor architectures.
Sebastian Steibl has a Degree in Electrical Engineering from Technical University of Braunschweig
and holds three patents.

Shekhar Borkar is an Intel Fellow, an IEEE Fellow


and director of Exascale Technologies in Intel Labs.
He received B.S. and M.S. degrees in physics from
the University of Bombay, India, in 1979, and
the M.S. degree in electrical engineering from the
University of Notre Dame, Notre Dame, IN, in 1981
and joined Intel Corporation where he has worked
on the design of the 8051 family of microcontrollers,
iWarp multi-computer, and high-speed signaling
technology for Intel supercomputers. Shekhars
research interests are low power, high performance
digital circuits, and high speed signaling. He has published over 100 articles
and holds 60 patents.

183

Vivek K. De is an Intel Fellow and director of Circuit Technology Research in Intel Labs. In his current
role, De provides strategic direction for future circuit
technologies and is responsible for aligning Intels
circuit research with technology scaling challenges.
De received his bachelors degree in electrical engineering from the Indian Institute of Technology in
Madras, India in 1985 and his masters degree in electrical engineering from Duke University in 1986. He
received a Ph.D. in electrical engineering from Rensselaer Polytechnic Institute in 1992. De has published
more than 185 technical papers and holds 169 patents with 33 patents pending.

Rob Van Der Wijngaart is a senior software


engineer in the Developer Products Division of
Intels Software and Services Group. At Intel he has
worked on parallel programming projects ranging
from benchmark development, programming model
design and implementation, algorithmic research,
and fine-grain power management. He developed the
first application to break the TeraFLOP-on-a-chip
barrier for the 80-Tile TeraFLOPs NoC Processor.
Van der Wijngaart received an M.S, degree in
Applied Mathematics from Delft University of
Technology, Netherlands, and a Ph.D. degree in Mechanical Engineering from
Stanford University, CA, in 1982 and 1989, respectively. He joined Intel in
2005. Before that he worked at NASA Ames Research Center for 12 years,
focusing on high performance computing. He was one of the implementers of
the NAS Parallel Benchmarks.

You might also like