Professional Documents
Culture Documents
1, JANUARY 2011
173
I. INTRODUCTION
Manuscript received April 15, 2010; revised July 16, 2010; accepted August
30, 2010. Date of publication November 09, 2010; date of current version December 27, 2010. This paper was approved by Guest Editor Tanay Karnik.
J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Borkar, and
V. K. De are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail:
jason.m.howard@intel.com).
S. Jain and V. Erraguntla are with Intel Labs, Bangalore, India.
M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, and S. Steibl
are with Intel Labs, Braunschweig, Germany.
R. Van Der Wijngaart is with Intel Labs, Santa Clara, CA.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSSC.2010.2079450
174
A. Core Architecture
The core is an enhanced version of the second generation Pentium processor [4]. L1 instruction and data caches have been
upsized to 16 KB, over the previous 8 KB design, and support 4-way set associativity and both write-through and writeback modes for increased performance. Additionally, data cache
lines have been modified with a new status bit used to mark
the content of the cache line as Message Passing Memory Type
(MPMT). The MPMT is introduced to differentiate between
normal memory data and message passing data. The cache lines
MPMT bit is determined by page table information found in the
cores TLB and must be setup properly by the operating system.
The Pentium instruction set architecture has been extended to
include a new instruction, INVDMB, used to support software
managed coherency. When executed, an INVDMB instruction
invalidates all MPMT cache lines in a single clock cycle. Subsequently, reads or writes to the MPMT cache lines are guaranteed
to miss and data will be fetched or written. The instruction exposes the programmer to direct control of cache management
while in a message passing environment.
HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING
175
cache write miss and L2 cache write hit does not write the cache
line back into the L1 cache.
High post-silicon visibility into the L2 cache was achieved
through a comprehensive scan approach. All L2 cache lines
were made scan addressable, including both tag and data arrays
and LRU status bits. A full self test feature was also included
that allowed the L2 cache controller to write either random or
programmed data to all cache lines, followed by a read comparison of the results.
C. Router Architecture
The 5-port router [5] uses two 144 bit uni-directional links
to connect with 4 neighboring routers and one local port while
creating the 2-D mesh on-die network. As an alternative to
wormhole routing used in earlier work [1], virtual cut-through
switching is used for reduced mesh latency over the previous
work. The router has 4 pipe stages (Fig. 3) and an operational
frequency of 2 GHz at a 1.1 V. The first stage includes link
traversal for incoming packet traversal and input buffer write.
The switch arbitration is done in the second stage and third &
fourth stages are the VC allocation and switch traversal stages
respectively. Two message classes (MCs) and eight virtual
channels (VCs) ensure deadlock free routing and maximize
bandwidth utilization. Two VCs are reserved: VC6 for request
MCs and VC7 for response MCs.
Dimension-ordered XY routing eliminates network deadlock
and route pre-computation in the previous hop allows fast output
port identification on packet arrival. Input port and output port
arbitrations are done concurrently using a centralized conflictfree wrapped wave-front arbiter [6] formed using a 5 5 array
of asymmetric cells (Fig. 4). A cell with a row (column) token
that is unable to use the token passes the token to the right
(down), wrapping around at the end of the array. These tokens
176
HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING
177
178
TABLE I
INTEGRATED MEMORY CONTROLLER FEATURES
Fig. 7. Message passing protocol (a) and message passing versus DDR3-800 (b).
HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING
179
180
Fig. 11. NAS parallel benchmark results with increasing core count.
HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING
181
ACKNOWLEDGMENT
The authors thank Yatin Hoskote, D. Finan, D. Jenkins,
H. Wilson, G. Schrom, F. Paillet, T. Jacob, S. Yada, S. Marella,
P. Salihundam, J. Lindemann, T. Apel, K. Henriss, T. Mattson,
J. Rattner, J. Schutz, M. Haycock, G. Taylor, and J. Held for
their leadership, encouragement, and support, and the entire
mask design team for chip layout.
REFERENCES
[1] S. Vangal et al., An 80-Tile 1.28TFLOPS network-on-Chip in 65 nm
CMOS, ISSCC Dig. Tech. Papers, pp. 9899, Feb. 2007.
[2] G. Moore, Cramming more components onto integrated circuits,
Electronics, vol. 38, no. 8, Apr. 1965.
[3] K. Mistry et al., A 45 nm logic technology with high -k
gate transistors, strained silicon, 9 Cu interconnect layers, 193 nm dry
patterning, and 100% Pb-free packaging, IEDM Dig. Tech. Papers,
Dec. 2007.
[4] J. Schutz, A 3.3 V 0.6 m BiCMOS superscalar microprocessor,
ISSCC Dig. Tech. Papers, pp. 202203, Feb. 1994.
[5] P. Salihundam et al., A 2 Tb/s 6 4 mesh network with DVFS and
2.3 Tb/s/W router in 45 nm CMOS, in Symp. VLSI Circuits Dig. Tech.
Papers, Jun. 2010.
[6] Y. Tamir and H.-C. Chi, Symmetric crossbar arbiters for VLSI communication switches, IEEE Trans. Parallel Distrib Syst., vol. 4, no. 1,
pp. 1327, Jan. 1993.
[7] JEDEC, Solid State Technology Association: DDR3 SDRAM Specification, Apr. 2008, JESD79-3B.
[8] R. Kumar and G. Hinton, A family of 45 nm IA processors, ISSCC
Dig. Tech. Papers, pp. 5859, Feb. 2009.
[9] SHMEM Technical Note for C, Cray Research, Inc., 1994, SG-2516
2.3..
[10] L. Smith and M. Bull, Development of hybrid mode MPI/OpenMP
applications, Scientific Programming, vol. 9, no. 23, pp. 8398, 2001.
[11] T. Mattson et al., The intel 48-core single-chip cloud computer (SCC)
processor: Programmers view, in Int. Conf. High Performance Computing, 2010.
[12] G. Schrom, F. Faillet, and J. Hahn, A 60 MHz 50 W fine-grain package
integrated VR powering a CPU from 3.3 V, in Applied Power Electronics Conf., 2010.
[13] D. H. Bailey et al., The NAS parallel benchmarks, Int. J. Supercomputer Applications, vol. 5, no. 3, pp. 6373, 1991.
+ Metal
Saurabh Dighe received his MS degree in Computer Engineering from the University of Minnesota,
Minneapolis in 2003. He was with Intel Corporation, Santa Clara, working on front end logic and
validation methodologies for the Itanium processor
and the Core processor design team Currently
he is a member of the Advanced Microprocessor
Research team at Intel Labs, Oregon, involved in the
definition, implementation and validation of future
Tera-scale computing technologies like the Intel
Teraflops processor and 48-Core IA-32 Message
Passing Processor. His research interests are in the area of energy efficient
computing and low power high performance circuits.
182
Guido Droege received his Diploma in electrical engineering from Technical University Braunschweig,
Germany in 1992 and the Ph.D. degree in 1997. His
academic work focused on analog circuit design automation. After graduation, Droege worked an ASIC
company, Sican GmbH, and later Infineon Technologies. He designed RF circuits for telecommunication
and worked on MEMS technology for automotive.
In 2001 he joined Intel Corporation where he started
with high-speed interface designs for optical communication. As part of Intel Labs he was responsible for
the analog frontend of several Silicon prototype designs. Currently, he works in
the area of high-bandwidth memory research.
HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING
Sebastian Steibl is the Director of Intel Labs Braunschweig in Germany and leads a team of researchers
and engineers in developing technologies ranging
from the next generation Intel CPU architectures,
high bandwidth memory and memory architectures
to emulation and FPGA many-core prototyping
methodology.His research interests include on-die
message passing and embedded many-core microprocessor architectures.
Sebastian Steibl has a Degree in Electrical Engineering from Technical University of Braunschweig
and holds three patents.
183
Vivek K. De is an Intel Fellow and director of Circuit Technology Research in Intel Labs. In his current
role, De provides strategic direction for future circuit
technologies and is responsible for aligning Intels
circuit research with technology scaling challenges.
De received his bachelors degree in electrical engineering from the Indian Institute of Technology in
Madras, India in 1985 and his masters degree in electrical engineering from Duke University in 1986. He
received a Ph.D. in electrical engineering from Rensselaer Polytechnic Institute in 1992. De has published
more than 185 technical papers and holds 169 patents with 33 patents pending.