You are on page 1of 13

MICROPROCESSOR

REPORT www.MPRonline.com

T H E I N S I D E R ’ S G U I D E T O M I C R O P R O C E S S O R H A R D WA R E

SPEC CPU2006 BENCHMARK SUITE


The Newest Iteration of a Widely Used Measure of Processor Performance
By Har lan McGhan {10/10/06-02}

On August 24, 2006, the nonprofit Standard Performance Evaluation Corporation (SPEC)
announced the release of its new SPEC CPU2006 benchmark suite. This is the fifth iteration
in a line of SPEC CPU benchmark suites that began in 1989 and continued with new versions

in 1992 (CPU92), 1995 (CPU95), and 2000 (CPU2000). The Nature of the SPEC CPU Benchmark Suite
roughly five-year lifetime of a SPEC CPU benchmark suite now SPEC was an industrywide response to an industrywide
seems to be established policy, which is to say that we can expect problem that reached a crisis point during the “processor
CPU2006 to be the standard from now through at least 2011. wars” of the late 1980s: namely, the lack of any generally
Benchmarks tend to arouse strong passions among acceptable means of comparing processor performance
those with a professional (or monetary) interest in com- across widely different instruction-set architectures (ISAs—
puter performance. Many veterans of the performance wars see sidebar, “The Problem With MIPS”). To solve this prob-
believe that benchmarks tend to conceal and distort far lem, the leading companies of the day (including DEC, HP,
more than they ever reveal or reflect. (See MPR 6/26/00-01, IBM, Intel, MIPS Technologies, Motorola, SGI, Sun) banded
“Benchmarks Are Bunk.”) Fortunately, it is not necessary to together to create the SPEC consortium. SPEC’s goal was to
first agree on the merits of benchmarking before agreeing develop a robust cross-platform suite of representative test
that the SPEC CPU benchmark suite is important. programs, to create methods for converting this test suite
SPEC CPU is by far the most widely used measure of into meaningful scores, and to establish fair rules for run-
processor performance in history. The SPEC CPU bench- ning benchmark programs and reporting benchmark scores.
mark programs themselves are among the most extensively The focus of SPEC CPU is on “component,” specifically
studied programs in the computer industry, the primary processor (CPU), performance rather than “whole system”
target of many processor architects, compiler writers, com- performance. (The capital C prefixed to the names of SPEC
petitive analysts, and academic researchers—to name only CPU tests, e.g., Cint2006 and Cfp2006, stands for compo-
some vitally interested parties. A new iteration of the SPEC nent.) One charge sometimes lodged against SPEC CPU
CPU benchmark suite, therefore, carries weight. It is sure to scores, therefore, can be readily laid to rest: namely, the objec-
affect the entire industry—for good or ill—in many ways. tion that SPEC CPU scores measure only some aspects of a
Given the length of time that has elapsed since its predeces- system’s performance. (And—for anyone who has ever been
sor suite was released (see MPR 4/17/00-02, “SPEC stuck on the wrong side of a slow disk drive or a slow network
CPU2000 Released”) and the significant changes that have connection—by no means the most important aspect.) This
occurred in the processor landscape since 2000, this latest objection is absolutely true and also completely irrelevant.
version of the SPEC CPU benchmark suite merits careful SPEC CPU is intended to measure only one aspect of com-
consideration. (See sidebar on “Price and Availability of puter performance: specifically, basic processor “horsepower.”
SPEC CPU2006” for product details.) Admittedly, many other aspects of a system also contribute to

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


2 SPEC CPU2006 Benchmark Suite

the overall performance that system delivers. Indeed, over must be written in a “neutral” way (favoring no special fea-
the years, SPEC has formed a variety of committees to tures of any particular ISA or microarchitecture), using a
develop generally acceptable benchmarks for other impor- “standard” programming language (i.e., one for which good
tant aspects of computer performance, including graphics, compilers readily exist on all platforms). To date, this means
as well as for various specialized types of “whole” systems, that SPEC programs have been limited to just C, C++, and
including Java client servers, mail servers, web servers, and (for floating-point-intensive programs) Fortran (both For-
network file servers. tran77 and Fortran90 have been used). Second, to test all and
None of these other tests, however, tells us anything only the relevant parts of the memory hierarchy, the bench-
(much) about processor performance, just as SPEC CPU mark program must compile to a working set too large to fit
tells us nothing (much) about graphics or disk or network into any available (on- or off-chip) cache but small enough to
I/O performance. If processor performance is your interest, fit into a typical main memory.
you need a spotlight to pick it out of the general landscape.
The specific intention of SPEC CPU is to provide that spot- Why a New SPEC CPU Benchmark Suite?
light (leaving all else in shadow). The fact that the typical size of both cache and main mem-
From this perspective, it is regrettable that SPEC CPU ory for processors continually increases is one important
is not even more specific than it is or can be. To support reason that the SPEC CPU benchmark suite must be revised
SPEC CPU’s cross-platform requirement, programs in the periodically. For example, when SPEC CPU2000 was intro-
SPEC benchmark suite have to be supplied in source form. duced, the size of a typical on-chip cache was 256KB (or
Consequently, the minimum component testable by these less), the largest off-chip cache was 8MB, and a reasonable
programs on any given machine includes the compiler that minimum size for main memory was 256MB. In 2006, typi-
converts these programs into (more or less) efficient cal on-chip caches range up to 4MB, the largest off-chip
machine code. Also, since few realistic programs fit entirely cache is 36MB, and a reasonable minimum main memory
into a processor’s on-chip cache memory (which also varies size is 1GB. Programs that once fairly tested the largest mem-
greatly in size and configuration from processor to proces- ory hierarchies of 2000 are now too small to realistically
sor), the SPEC committee broadened its definition of the exercise normal memory hierarchies—hence, the debut of
basic processor component to include both off-chip cache SPEC CPU 2006, replacing SPEC CPU 2000.
and main memory (but to exclude disk drives, network stor- Just as the size of typical memories increases over time,
age, and all other I/O operations). so does the power of typical processors. In 2000, the fastest
This notion of the “processor” component (encompass- processor was Intel’s just-introduced Pentium 4 at 1.5GHz.
ing everything from the compiler to main memory) sets the In 2006, Intel’s latest Pentium 4 Extreme Edition operates at
limits for a suitable SPEC CPU benchmark program. First, it 3.73GHz (with the company’s even more powerful next-
generation Core 2 Extreme processor just coming
into use). SPEC CPU2000 programs that once took
800 hundreds of seconds to execute on a 1.5GHz Pen-
700 41.67MHz POWER 4164 tium 4 now execute in a few tens of seconds on
64K/8K I/D Caches these newer machines. With such short run times,
SPEC Performance Ratio

600 64MB Memory


AIX 3.1.5 Operating System results are increasingly perturbed by minor fluctu-
500 ations in the overall system state or measurement
circumstances. In short, many older programs sim-
400
ply don’t run long enough to draw accurate con-
300 clusions from them. Correcting this problem
requires not only bigger programs but more com-
200
plex programs, as well as programs with much
100 larger working sets that take far longer to complete.
0 Improvements in compiler technology also
gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv can obsolete a particular benchmark. The first and
Benchmark still one of the most famous examples of this type
of problem occurred with the matrix300 compo-
Figure 1. 1991 SPEC CPU89 results for the IBM PowerStation 550 (based on the nent program in the original 1989 benchmark
41.67MHz POWER 4164 processor) with and without enhancing compilers. suite. In 1990, it made little difference whether a
Although use of a geometrical mean average, used for computing overall SPECmarks,
smooths out disturbances caused by atypical results on individual component pro- SPECmark was computed with or without this
grams, even a geometric mean couldn’t average away the effect of the extreme component program. A year later, it made all the
increase in matrix300 performance caused by newfound ways to optimized this com- difference in the world. For example, in 1990, the
ponent benchmark. The SPEC committee dropped matrix300 from the 1992 test
suite. The nasa7 component, however, survived until the 1995 test suite (despite the 30MHz IBM POWER 3064 achieved a SPECmark
ability of optimizing compilers to significantly enhance its performance, too). of 34.7; recomputing this value omitting

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


SPEC CPU2006 Benchmark Suite 3

matrix300 resulted in a slightly higher overall score of 35.7. scrutiny of every component program in each successive iter-
Just a year later, in 1991, the 41.67MHz POWER 4164 ation of this benchmark suite. Both hardware designers and
achieved a SPECmark of 72.2; recomputing this value omit- compiler writers measure their efforts to improve successive
ting matrix300 resulted in a dramatically lower score of 55.9. generations of their respective products by the degree to
So how much did the performance of IBM’s POWER which they raise SPEC benchmark scores. Occasionally, as
processor line improve in 1991 from 1990? The matrix300 with matrix300, they are inspired to find a new way of attack-
program looms large in the answer. Counting it, the improve- ing a formerly intractable performance issue. While the long-
ment was over 108%. Excluding it, the improvement was less term effect of this unrelenting attack on knotty performance
than 57%. The explanation of this phenomenon is that, when issues is generally positive, constantly adding new and refined
introduced in 1989, matrix300 was (by intention) a cache software and hardware technologies to the overall perform-
“thrasher” program, requiring constant trips to main mem- ance arsenal, the short-term effect on the SPEC benchmark
ory. Hence, it ran poorly, meaning that omitting this compo- results themselves can be both disruptive and misleading—
nent test was likely to improve a processor’s overall SPECmark until the suite is revised to eliminate the source of the distur-
score. In 1991, compiler writers (or, more precisely, compiler bance. (For another example of benchmark obsolescence due
preprocessor writers) discovered how to “bust” this program, to compiler optimization, see the discussion of the 023.eqntott
devising a scheme for organizing its data references so that the component in the SPEC CPU92 suite at www.spec.org/
program ran almost entirely in cache. The result was like osg/news/articles/news9512/eqntott.html.)
throwing a switch. Scores on this one component program Still another important reason to periodically refresh
suddenly rocketed through the roof, taking the machine’s the SPEC CPU benchmark suite is the fact that typical
entire SPECmark score up with them. For an illustration of workloads shift over time, with new types of computations
the dramatic difference caused by introducing this new opti- coming to the foreground. The SPEC CPU2000 benchmark
mization, see Figure 1. suite put more emphasis on 3D modeling, image recogni-
Although an extreme example of the effect of compiler tion, and data compression than did the SPEC CPU95
(or precompiler) optimization, matrix300 is by no means an benchmark suite. The new SPEC CPU2006 suite adds video
isolated example. An unintended consequence of the rise of compression and speech recognition to the mix and greatly
the SPEC CPU benchmark suite to prominence, as the central expands the emphasis placed on molecular biology among
focus of the processor performance wars, has been intense the scientific (floating-point-intensive) tasks. See Table 1

CPU2000 CPU2006
Benchmark Description Integer Lng RT Integer Lng RT
GNU C compiler 176.gcc C 1,100 403.gcc C 8,050
Manipulates strings & prime numbers in Perl language 253.perlbmk C 1,800 400.perlbench C 9,766
Minimum cost network flow solver (combinatorial optimization) 181.mcf C 1,800 429.mcf C 9,120
Data compression utility 256.bzip2 C 1,500 401.bzip2 C 9,644
Data compression utility 164.gzip C 1,400
Video compression & decompression 464.h264ref C 22,235
Artificial intelligence, plays game of Chess 186.crafty C 1,000 458.sjeng C 12,141
Artificial intelligence, plays game of Go 445.gobmk C 10,489
Artificial intelligence used in games for finding 2D paths across terrains 473.astar C++ 7,017
Natural language processing 197.parser C 1,800
XML processing 483.xalancbmk C++ 6,869
FPGA circuit placement and routing 175.vpr C 1,400
EDA place and route simulator 300.twolf C 3,000
Search gene sequence 456.hmmer C 9,333
Ray tracing 252.eon C++ 1,300
Computational group theory 254.gap C 1,100
Database program 255.vortex C 1,900
Library for simulating a quantum computer 462.libquantum C 20,704
Discrete event simulation 471.omnetpp C++ 6,270
hours 5.3 19,100 hours 36.6 131,638

Table 1. Changes in SPEC CPU benchmark integer programs between CPU2000 and CPU2006. The reference time (RT) for each component is meas-
ured on a reference machine. (A similar, but not identical, near-300MHz UltraSPARC II generation processor is used for both suites.) Although four com-
ponents have been retained from the 2000 suite, the run time for all programs has dramatically increased in CPU 2006. Data compression has been
deemphasized, while video compression and decompression have been added to the test suite. More emphasis is placed on artificial intelligence (includ-
ing a greatly improved chess-playing program). Natural language processing has been eliminated in favor of extended markup language (XML) pro-
cessing. The former emphasis on electronic design-automation programs is gone, as are former programs for ray tracing, computational group theory,
and database operations. In their place is a new emphasis on gene sequencing, quantum computing, and discrete event simulation. More emphasis is
also placed on C++ programming.

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


4 SPEC CPU2006 Benchmark Suite

and Table 2 for a detailed comparison of the component This particular objection, of course, can be met by sub-
programs in the 2000 and 2006 SPEC CPU integer and setting the benchmark suite, looking just at those programs
floating-point suites, respectively. that actually do reflect the types of applications run at a given
site. Thus, a site doing fluid dynamics or quantum chemistry
Is SPEC CPU2006 Well Balanced? might look primarily at the specific results reported on
The most usual objection raised to any new version of the SPEC benchmark programs in those particular areas, giving weight
CPU benchmark suite is that some key types of programs are to other components only to the degree that they mirror its
over-represented, while others are under-represented, with cor- own computational needs.
responding skews in the numbers reported. Although the num- It is more difficult to meet the objection that some areas
ber of programs included in the SPEC CPU benchmark suite of interest are entirely ignored by the programs included in the
has gradually crept up over the years (from 10 in the original suite. For example, critics complained the CPU2000 suite
CPU89 suite to 18 in CPU95 to 26 in CPU2000 to 29 in the new dropped JPEG compression (discrete-cosine transformations),
CPU2006 suite), there is no way to ever incorporate enough eliminated single-precision floating-point operations, ignored
programs to entirely eliminate this objection. Certainly, the spe- advanced compression algorithms (e.g., fractals and wavelets),
cific mix of programs in CPU2006 (or any past or probable and omitted speech-recognition technologies. Furthermore,
future version of the suite) is unlikely to represent the mix of they argued against the general orientation of the SPEC suite,
programs run at any actual computing site. For example, there leaning as it does toward limited-interest, heavy-duty scientific
are very few Go-playing, gene-sequencing, weather-modeling, applications (run, say, at a site like Brookhaven National Labs
astrophysical quantum chemistry sites. for investigations into quark coloring [chromodynamics] or

CPU2000 CPU2006
Benchmark Description Floating Pnt Lng RTime Floating Pnt Lng RTime
Weather prediction, shallow water model 171.swim F77 3,100
Velocity & distribution of pollutants based on temperature, wind 301.apsi F77 2,600
Weather modeling (30km area over 2 days) 481.wrf C/F 11,215
Physics, particle accelerator model 200.sixtrack F77 1,100
Parabolic/elliptic partial differential equations 173.applu F77 2,100
Multi-grid solver in 3D potential field 172.mgrid F77 1,800
General relativity, solves Einstein evolution equations 436.cactusADM C/F 11,927
Computational electromagnetics (solves Maxwell equations in 3D) 459.GemsFDTD F 10,583
Quantum chromodynamics 168.wupwise F77 1,600
Quantum chromodynamics, gauge field generation with dynamical quarks 433.milc C 9,180
Fluid dynamics, analysis of oscillatory instability 178.galgel F90 2,900
Fluid dynamics, computes 3D transonic transient laminar viscous flow 410.bwaves F 13,592
Computational fluid dynamics for simulation of astrophysical phenomena 434.zeusmp F 9,096
Fluid dynamics, large eddy simulations with linear-eddy model in 3D 437.leslie3d F 9,358
Fluid dynamics, simulates incompressible fluids in 3D 470.lbm C 13,718
Molecular dynamics (simulations based on newtonian equations of motion) 435.gromacs C/F 7,132
Biomolecular dynamics, simulates large system with 92,224 atoms 444.namd C++ 8,018
Computational chemistry 188.ammp C 2,200
Quantum chemistry package (object-oriented design) 465.tonto F 9,822
Quantum chemistry, wide range of self-consistent field calculations 416.gamess F 19,575
Computer vision, face recognition 187.facerec F90 1,900
Speech recognition system 482.sphinx3 C 19,528
3D graphics library 177.mesa C 1,400
Neural network simulation (adaptive resonance theory) 179.art C 2,600
Earthquake modeling (finite element simulation) 183.equake C 1,300
Crash modeling (finite element simulation) 191.fma3d F90 2,100
Number theory (testing for primes) 189.lucas F90 2,000
Structural mechanics (finite elements for linear & nonlinear 3D structures) 454.calculix C/F 8,250
Finite element analysis (program library) 447.dealII C++ 11,486
Linear programming optimization (railroad planning, airlift models) 450.soplex C++ 8,338
Image ray tracing (400x400 anti-aliased image with abstract objects) 453.povray C++ 5,346
hours 8.0 28,700 hours 52 186,164
hours 13.3 47,810 hours 88.3 317,802

Table 2. Changes in SPEC CPU benchmark floating-point programs between CPU2000 and CPU2006. Again, note the dramatic sevenfold increase in run
time of the suite on the (very similar) reference machine. CPU2006 marks the first-ever total revision of a SPEC benchmark suite from one generation to
the next. Although meteorology, physics, fluid dynamics, and chemistry have all been retained as areas of interest, CPU2006 provides new, larger, and
(often) more programs in each field, as well as new programs in the area of molecular dynamics. Face recognition has been eliminated in favor of speech
recognition. Miscellaneous graphics, modeling, and number theory programs are replaced by engineering, route planning, and ray tracing programs. The
variety of languages represented has also been substantially increased, including four programs written in a mixture of C and Fortran.

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


SPEC CPU2006 Benchmark Suite 5

big bang theory) and away from far more widely relevant level. In truth, rather than too few programs, the several SPEC
PC-style applications (involving complex floating-point CPU suites have each tended to contain too many programs:
calculations, advanced game playing, positional 3D sound, that is, they invariably comprehend redundant programs
and so on). that add little or nothing to the mixture of operations repre-
To an extent, CPU2006 repairs some of these omissions sented and whose inclusion or exclusion makes little or no
in CPU2000. The SPEC committee included both video- difference to the overall score achieved.
compression/decompression and speech-recognition pro- To the extent that the greatest error of a SPEC CPU
grams, throwing in XML (extended markup language) process- benchmark suite is plain redundancy, it is at least a relatively
ing for good measure. It also put considerably more emphasis harmless flaw. To be sure, running redundant programs adds
on artificial intelligence for game playing. But at the same time, time and cost to the SPEC CPU measurements. However, this
it entirely eliminated EDA (electronic design automation) pro- is a relatively small price to pay for completeness of coverage.
grams. The emphasis placed on advanced physics and other Far better to err, as the SPEC committee has consistently
heavy-duty scientific applications is now even greater than it done, on the side of over-representing typical workloads than
was in CPU2000, with eleven programs where before there were risk under-representing the range of tasks computers are
only six. Moreover, as measured by the reference machine, called upon to perform.
these eleven CPU2006 programs (in physics, fluid dynamics,
molecular dynamics, and quantum chemistry) represent more The SPEC CPU Measures
than 65% of the floating-point suite’s total run time. In Although, for all the reasons enumerated above, the basic
CPU2000, the corresponding six programs took only 40% of thing SPEC CPU must periodically upgrade is the bench-
the floating-point suite’s total run time. mark suite itself, both the associated measures and their run
and reporting rules have evolved over time in an effort to bet-
Two Defenses of SPEC CPU2006 Balance ter capture CPU performance. In 1989, SPEC introduced an
There are two possible defenses to the objection that CPU2006 initial suite of ten programs, including six heavy in floating-
is an unbalanced mix of programs (an objection, to repeat, that point operations. The result of running the ten programs in
is doubtlessly unavoidable with any relatively small group of this benchmark suite on a processor was captured in a single
programs). The first defense is to argue simply that any “SPECmark” number of merit.
machine powerful enough to handle the needs of the National SPECmarks were relative measures of performance based
Laboratory at Brookhaven (Sandia, Oak Ridge, Los Alamos, on a reference machine. For the CPU89 benchmark suite, the
Lawrence Livermore, etc.) is probably powerful enough to SPEC committee selected the VAX 11/780 as the reference
handle just about any other conceivable type of need. machine. (The VAX 11/780 was the traditional “1mips” refer-
The second defense involves shifting the focus away ence machine used to compute “Dhrystone mips”—see side-
from the application level to a more basic level of computer bar on “The Problem With mips.”) By definition, the number
operation. From the standpoint of basic machine code, not of seconds a VAX 11/780 took to execute each of the ten pro-
that many different types of tasks are performed by comput- grams in the CPU89 suite set the base reference score for each
ers. All programs (regardless of application type) consist of a program. The number of seconds any other machine took to
mixture of loads and stores, interspersed with sequences of execute the same program was then divided into its VAX refer-
calculations interrupted by branches. At bottom, the calcula- ence time. This generated a ratio, indicating how much faster
tions themselves reduce to combinations of a relatively few (or slower) than the reference machine the machine being
fundamental operations: addition, subtraction, compares, tested was at executing the given task. The individual ratios for
logical AND, OR, NOT, and so forth—the exact number and all ten scores were then averaged to create an overall ratio,
nature of the available operators depending on a particular reported as the SPECmark for the tested machine.
machine’s specific ISA. How well any program performs, For example, in 1989, a 33MHz Motorola 68030 system
then, depends on the number and nature of the loads, stores, achieved a measured SPECmark of 4.0, meaning it was
and branches it involves, as well as the specific combinations exactly four times faster than the reference VAX 11/780
of basic arithmetical and logical operations that it performs machine (on average, across the ten programs in the SPEC
(for example, the extent to which these various activities con- CPU89 benchmark suite). The same year, a MIPS R3000 at
ceal or reveal exploitable instruction-level parallelism). 25MHz measured at 17.6 SPECmark. Since both the MIPS
Over the years, numerous studies have been conducted R3000 and the Motorola 68030 were compared with the same
of the various SPEC CPU benchmark suites at this more common standard, it also was possible to relate these two very
basic computer science level. All these studies reach a similar different types of processors to each other, with the solid con-
conclusion: namely, that the same overall results can be clusion that (at least, when executing the set of ten programs
obtained by using just a subset of the programs actually in the SPEC CPU benchmark suite) the R3000 was 17.6/4.0 =
included in the suite. In other words, when analyzed at a fun- 4.4 times faster than the 68030.
damental level, programs in the suite tend to be less dissim- In short, for the first time, the SPECmark provided a
ilar from each other than may first appear at the application level playing field specifically designed to allow meaningful,

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


6 SPEC CPU2006 Benchmark Suite

direct comparisons of performance across processor architec- into a single number of merit was provided. Instead, the
tures. The benchmark suite itself encompassed a reasonable notion of an overall SPECmark score was simply dropped.
sample of different, challenging computational tasks. The The 1992 SPEC committee also totally revamped the
same public rules governed running these benchmark pro- means of calculating SPECthruput, renaming the reformed
grams and reporting their results. Equally important, it had measure SPECrate. Since there were now two separate and
been agreed to by all the major players involved in shaping distinct speed measures, one for integer programs and one
processor performance. Consequently, the SPECmark for floating-point programs, two correspondingly separate
quickly established itself as the single number of merit to use and distinct integer and floating-point SPECrates were
when comparing processors (replacing all earlier mips-based defined: Cint_rate92 and Cfp_rate92.
performance metrics). The 1995 revision of the SPEC CPU benchmark suite
In 1990, the SPEC committee introduced a second brought one final refinement in SPEC score reporting. To
type of performance measurement for multiprocessor sys- counteract the effect of increasingly elaborate (and highly
tems, originally termed SPECthruput. Unlike traditional specific) optimizations being invoked during compilation of
speed metrics (including SPECmark), SPECthruput is a each benchmark program, the 1995 SPEC committee defined
measure of throughput performance. This difference is fun- a new required “base” metric for each of the four SPEC CPU
damental. Speed metrics measure the time it takes to com- measures, making a total of eight SPEC CPU95 metrics in all.
plete a given unit of work—with shorter times awarded Although SPEC reporting rules always required disclosure of
higher marks. (Alternatively, speed can be defined as a all the options set to compile each of the benchmark pro-
measure of the time it takes to cover a given distance—in a grams, the new base metrics imposed strict limits on the
SPEC benchmark program, specifically, the time it takes to number of compiler flags that could be used and required
get from first to last instruction.) By contrast, a throughput that every program in the given suite be compiled with the
measure computes scores based on the total amount of same (restricted) set of flags, invoked in the same order.
work done in a given unit of time (with scores scaling in The idea behind the new restricted base metrics was to
direct proportion to the amount of work done, rather than represent the sort of performance ordinary users might
inversely with the amount of time needed to do the work). achieve on a tested processor—users who simply wanted
For example, a sports car carrying two passengers their programs to compile correctly with a minimum of fuss
might cover a mile in 18 seconds, achieving a speed mark of and were willing to accept whatever performance fell out of
200 miles per hour and a throughput mark of 400 passenger the compiler, processor, and its associated memory hierar-
miles per hour. A bus carrying 60 passengers might cover the chy in consequence. The old unrestricted metrics would
same mile in 60 seconds, achieving a speed mark of 60 miles continue to represent the sort of performance power users
per hour and a throughput mark of 3,600 passenger miles might achieve on a tested processor—users who were will-
per hour. The point is, a sports car and a bus are fundamen- ing to experiment extensively with different combinations
tally different types of vehicles and need to be judged by of the available compiler optimizations, in an effort to
fundamentally different types of metrics. (A third type of wring the last bit of performance out of a program com-
metric, only now coming into focus for processors—SPEC piled to run on their target processor.
expects to introduce its first metric in this area early in Of the two types of measures, SPEC requires vendors to
2007—is efficiency. In the present example, efficiency might report only base scores. Reports of peak optimized scores (in
be measured by something like the amount of fuel burned addition) are optional. Although the restrictions placed on
per passenger mile.) base scores are not foolproof (they can be gotten around, for
Although the intention behind both SPECmark and example, by building advanced optimizations into the default
SPECthruput was good, and the basic idea behind each settings for a compiler), they are clearly an improvement over
sound, both measures were reformed and renamed in the having just peak scores reported. Partly because peak scores
second, 1992, iteration of the SPEC CPU benchmark suite. tend to be less consistent, and partly because they tend to be
The single mix of integer and floating-point programs in less representative of “real world” performance, many organ-
the original benchmark suite was replaced with two separate izations (including Microprocessor Report) list only base
suites of programs, one composed of six general integer scores when reporting SPEC CPU numbers. The item that
programs (with little or no floating-point content) and the follows illustrates the sorts of differences found in the num-
other composed of 14 scientific and technical programs ber and complexity of the flags used for base and peak com-
(heavy in floating-point content). piler optimization, and Table 3 summarizes the changes in
Corresponding to this bifurcation in the structure of the SPEC benchmark metrics over the years.
the SPEC benchmark suite, the single SPECmark measure of
merit was replaced by corresponding separate SPEC Cint92 IBM Corporation report on IBM System p5 595
and Cfp92 measures of merit. Although Cint92 and Cfp92 (2,300MHz, 1 CPU), August 2006
were both calculated in exactly the same way that SPECmarks Base Fortran compiler flags (all Fortran programs):
were computed, no method of combining these two scores -O5 -lhmu -blpdata –lmass

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


SPEC CPU2006 Benchmark Suite 7

Peak Fortran compiler flags for 178.galgel: Although the SPEC measures underwent dramatic
-qpdf1/pdf2 changes in both 1992 and 1995, the basic arithmetic used to
-O5 -qfdpr -qalign=struct=natural -lhmu -blp- compute processor speed from the run times of the individual
data -lmass -qessl -lessl programs in a benchmark suite has not changed since the
fdpr -q -O3 original 1989 SPECmark measure itself. Converting run times
for a set of test programs into a test score poses two problems.
Overall, peak optimization improved the floating- The first problem is that, although the SPEC bench-
point scores reported on this machine by just over mark programs are carefully chosen to represent different but
8%, but the effect of peak optimization on individ- similarly challenging types of tasks, they are nonetheless not
ual component scores varied all the way from a all of the same size and complexity, i.e., some of the test pro-
small (less than 1%) negative impact up to an grams are significantly larger than others, and some have sig-
improvement of nearly 32% (for 178.galgel). nificantly longer run times than others. To eliminate this
inequality—giving each test program equal weight (regard-
Basic SPEC CPU Speed Arithmetic less of its relative size or complexity)—the run time of each
SPEC CPU component benchmark programs are never arti- individual test program is divided into the run time of the
ficial or made-up programs, designed to simulate or other- same program on the reference machine, creating a relative
wise represent “typical” workloads. Instead, they are selected performance ratio. Performance ratios are all the same “size,”
from actual applications used by real people to do meaning- whatever the size and run times of the individual programs
ful work. Applications are carefully selected to represent dif- used to determine them.
ferent, but (in their own way) equally challenging types of The second problem is to find a fair way to average the
tasks. It is not necessary that applications be “popular” or resulting set of ratios (once every program in the suite has
widely used. For the same reason, no attempt is made to been run and divided into its corresponding reference
weight applications by frequency of use. The goal of SPEC value). With a small set of numbers, obviously mode and
CPU is to test processors against demanding workloads median values are of no significance, while an arithmetical
rather than against “typical” workloads. (Assuming what mean value is liable to be unduly influenced by a single high
seems doubtful, that it is possible to determine a “typical” or low result. For example, for a suite of ten programs, imag-
workload for a “typical” user.) ine a test machine twice as fast as the reference machine on

Year Iteration Suites Languages Measures Reference Machine


1989 SPEC CPU Vax 11/780
10 SPEC programs
RT: 18.66 hours C(4) & Fortran(5) & SPECmark 5 MHz
(scores not rounded) C/Fortran(1) SPECthruput 8K cache off-chip
memory N/A
1992 SPEC CPU92 6 CINT92 programs
RT: 6.21 hours C(6) SPECint92
14 CFP92 programs SPECfp92 same as SPEC89
RT: 41.27 hours C(2) & Fortran(12) SPECint_rate92
SPECfp_rate92
(scores rounded to 10s place)
SPECint95
1995 SPEC CPU95
SPECint_base95
8 CINT95 programs SPECfp95 SPARCstation 10/40
RT: 5.25 hours C(8) SPECfp_base95 40 MHz SuperSPARC I
10 CFP95 programs SPECint_rate95 20K/16K I/D L1 on-chip
RT: 11.00 hours Fortran(10) SPECint_rate_base95 no L2 cache
(scores rounded to 100s place) SPECfp_rate95 128MB memory
SPECfp_rate_base95
2000 SPEC CPU2000 12 CINT2000 programs Ultra 5 model 10
C(11) & C++(1)
RT: 5.31 hours 300 MHz UltraSPARC IIi
same set of 8 measures
14 CFP2000 programs 16K/16K I/D L1 on-chip
C(4) & Fortran77(6) & defined for SPEC CPU95
RT: 7.97 hours 2MB L2 cache off-chip
(scores rounded to 100s place) Fortran90(4) 256MB memory
2006 SPEC CPU2006 12 CINT2006 programs C(9) & C++(3) Ultra Enterprise 2
RT: 36.57 hours 296 MHz UltraSPARC II
same set of 8 measures
17 CFP2006 programs C(3) & C++(4) & 16K/16K I/D L1 on-chip
defined for SPEC CPU95
RT: 51.71 hours Fortran(6) & 2MB L2 cache off-chip
(scores not rounded) C/Fortran(4) 1GB memory

Table 3. Changes in SPEC CPU benchmark suites over the years. The benchmark committee stabilized on the current set of eight CPU measures in
1995. The 2000 and 2006 iterations of the benchmark focus on upgrades to the two suites of test programs and refinements of the run and report-
ing rules. The 1992 and 1995 changes in the basic reporting measures themselves were applied retroactively to the former test suites. Thus, for
example, one occasionally sees references to a SPECint89 or a SPECint_base92 score, although no such measures existed in 1989 or 1992.

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


8 SPEC CPU2006 Benchmark Suite

every component test except one, where it is 100 times as fast. common standard. As with the SPEC CPU speed metrics,
The arithmetical mean for this set of numbers is 118/10 = normalization of the throughput measures is accomplished
11.8—a value far more misleading than revealing, since the by relating each individual program in the test suite to the
test machine is not close to being 12 times faster than the ref- reference machine. Specifically, an additional normalizing
erence machine on any of the test programs. term (called a “reference factor” by SPEC) is inserted into the
To solve this problem, the SPEC committee adopted the above equation, equal to the execution time (on the reference
geometric mean as the fairest single-value summary of a machine) of the given benchmark program divided by the
small set of numbers, smoothing out the influence of one or execution time (on the reference machine) of the longest-
two “extreme” elements. The geometric mean of a set of N running program in the same benchmark suite. In essence,
numbers is computed by multiplying the numbers together the reference factor gives the longest-running program in the
and then taking the Nth root of the product. In the present suite a weight of 1.0, and every other program in the suite a
example, the geometric mean of ten individual SPECratios is fractional weight, in direct proportion to its relative run time
the tenth root of 51,200 = 2.96—a number far more repre- (as determined by the reference machine).
sentative of the machine’s overall performance than its arith- To see how this works, let’s return to the above example,
metical mean value of 11.8. where we had two very different throughput numbers based
on the very different running times of the two programs used
Basic SPEC CPU Throughput Arithmetic to measure throughput performance. Using the large 3,000-
SPECrate is SPEC’s “homogeneous capacity method,” mean- second program, the eight-way machine being tested had a
ing it measures a system’s ability to simultaneously execute throughput of 9.6 copies an hour. Using the small 1,000-
separate iterations of the same task rather than of different second program, its throughput was 28.8 copies an hour. Now
tasks. The tasks used for computing SPECrates are the same suppose these two large and small programs execute on the
SPEC benchmark suites used to compute SPEC speed meas- reference machine in 9,000 and 3,000 seconds, respectively,
ures. SPECrates are measured by simultaneously starting whereas the longest-running program on the reference
multiple instances of each program in one of the test suites machine takes 10,000 seconds. The reference factor for the
on a multiway system and measuring the time needed for all first equation, then, is 9,000/10,000, and the reference factor
the copies launched to finish executing. Typically, a separate for the second equation is 3,000/10,000. Inserting these two
copy of the program is started on each separate execution terms into the above two throughput calculations then yields:
engine in a system, although there is no SPEC rule requiring 8 x (9,000/10,000) x (3,600/3,000) = 8.64
that an eight-way system (for example) start eight instances 8 x (3,000/10,000) x (3,600/1,000) = 8.64
of a particular benchmark program. If a vendor thinks a bet- In other words, the reference factor successfully elimi-
ter overall result can be achieved by starting instead just seven nates differences in throughput calculations based on differ-
instances of a program (reserving one engine to handle sys- ences in the size and complexity of the program used to
tem overhead), it is free to use an eight-way machine to run measure throughput. Of course, in practice, throughput
seven-way (or any other-way) SPECrate tests. results obtained by running different programs are unlikely
Given the number of tasks launched simultaneously, to be quite as neat as in this particular example. Although the
and the time needed to finish all of them, computing the reference term is an effective way to level out major inequali-
number of copies of the given task that could run to comple- ties between the different programs in a SPEC benchmark
tion on the tested machine in a fixed unit of time, say, one suite, smaller idiosyncratic differences in the way different
hour, is straightforward. For example, if an eight-way programs run on both the test and reference machines are
machine ran eight copies of a given program in 3,000 sec- likely to remain. In consequence, depending on the specific
onds, it could run 8 x (3,600/3,000) = 9.6 copies of the pro- program used to benchmark throughput, somewhat different
gram in an hour. results are likely. Here, just as with the speed metrics, once a
But the number of copies of a program that will run to throughput number is determined for each individual pro-
completion in an hour is a function of the length and com- gram in a benchmark suite, an overall throughput number
plexity of the given program—which can vary greatly. For for the suite is generated by taking the geometric mean of the
example, if eight copies of a second program complete on the individual throughput results.
same machine in 1,000 seconds, it could run 8 x (3,600/1,000) The overall SPECrate reported for a multiway system,
= 28.8 copies of that second program in an hour—not then, is the geometric mean of the normalized SPECrates of
because the machine suddenly has three times the compute the component programs in the benchmark suite. This repre-
capacity when executing the second program, but because sents the number of copies of an “average” workload the tested
executing a copy of the second program involves only one- machine can execute in a given unit of time. When SPECrate
third of the work needed to execute the first program. was introduced in 1992, the unit of time used was a week.
To determine the compute capacity of a given machine SPEC CPU95 dropped the throughput period to a day, and
by running a variety of different programs on it, then, SPEC CPU2000 further shortened it to an hour. The new
requires some way to normalize all the different tasks to a CPU2006 takes an unobvious next step and simply eliminates

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


SPEC CPU2006 Benchmark Suite 9

any reference to a fixed unit of time in its SPECrate calcula- The key to comparing machines across different bench-
tion. Specifically, the formula used to compute SPECrates in mark eras is to find a linking machine, i.e., a machine with
the 2006 suite is no longer: scores reported on both the older and newer measures. In a
N x RF x U/T table like Table 4, adjacent machines are always compared on
Where N is the number of copies of the benchmark the same scale (never across different benchmark suites).
program run, RF is the normalizing reference factor for that Linking machines are special only in the fact that they are
benchmark program, U is a specified fixed unit of time (in compared with their predecessor on the older scale and with
seconds), and T is the time (in seconds) the tested machine their successor on the newer scale. As long as each suite fairly
takes to finish all N simultaneously launched copies. Instead, compares the machines of its own era (including the dual-
the revised 2006 formula is simply: membership linking machine), the mathematical law of tran-
N x (RF/T) sitivity allows fair comparisons across successive suites.
Applying this revised formula to the two examples For example, if
above yields: 1. the POWER3-II at 375MHz is fairly regarded as 18.9%
8 x [(9,000/10,000)/3,000] = 0.0024 faster than the PowerPC RS64-III at 450MHz (based on
8 x [(3,000/10,000)/1,000] = 0.0024 their respective SPEC CPU95 benchmark scores of 22.6
In effect, dropping any specific reference to U from vs. 19.0), and
the calculation changes the SPECrate fixed unit of time to 2. the POWER3-II at 450MHz is fairly regarded as 15.3%
the unit used to measure times—in other words, from one faster than the POWER3-II at 375MHz (based on their
hour (3,600 seconds) to one second. (Multiplying 0.0024 x respective SPEC CPU2000 benchmark scores of 286
3,600 = 8.64.) vs. 248). Then it follows (by the law of transitivity) that
3. the POWER3-II at 450MHz (in 2000) is fairly regarded
SPEC CPU As a Window Into the Past as 37.2% faster than the PowerPC RS64-III at 450MHz
SPEC occasionally has been criticized for periodically revising (in 1999)—even though those two machines were never
the programs in its CPU benchmark suite (and changing the scored on a common SPEC benchmark suite.
associated reference machine), on the
grounds that it makes comparisons
from one benchmark-suite era to the
5MHz 5MHz 40MHz 300MHz
next one difficult or impossible. In
VAX VAX SS I US IIi EQV
reality, these changes are not a serious Processor Year MHz INT89 INT92 INT95 INT00 Gain INT00
handicap—as long as one assumes that POWER1 3064 1990 30 24.0 N/A 9.76
each separate suite fairly compares the POWER1 4164 1991 41.67 33.8 38.9% 13.6
systems of its own era. Although SPEC POWER1 6264 1992 62.50 61.3 59.2 50.0% 20.3
POWER2 1993 71.50 125.9 112.7% 43.2
always carefully stresses that no for-
POWER2 1994 71.50 125.9 0.0% 43.2
mula exists for converting measure- PowerPC 604 1995 133 142.2 4.45 12.9% 48.8
ments made using one benchmark PowerPC 604e 1996 200 7.18 61.3% 78.8
suite into measurements for an earlier PowerPC 604e 1997 360 12.1 68.5% 133
or later suite, no such formula is PowerPC 604e 1998 375 14.5 19.8% 159
PowerPC RS64- 1999 450 19.0 31.0% 208
needed to conduct longitudinal stud-
POWER3-II 2000 375 22.6 248 18.9% 248
ies of changes in performance over the POWER3-II 2000 450 286 15.3% 286
years (spanning different SPEC CPU POWER4 2001 1,300 790 176.2% 790
benchmark-suite eras). All that’s nec- POWER4+ 2002 1,450 909 15.1% 909
essary is to find “linking systems” that POWER4+ 2003 1,700 1,077 18.5% 1,077
POWER5 2004 1,900 1,398 29.8% 1,398
connect each scale of reference with
POWER5+ 2005 1,900 1,470 5.2% 1,470
the following scale. POWER5+ 2006 2,300 1,820 23.8% 1,820
IBM’s POWER processor line
furnishes an excellent example for this
sort of study, because its lifespan Table 4. Integer performance of IBM’s POWER/PowerPC processor line from introduction to pres-
ent. As a reminder, the reference machine “scale” for each suite is shown at the top of its column.
(1990–present) coincides almost The POWER3-II at 375MHz is shown in italics because it is just a linking machine and not properly
exactly with the lifespan of the SPEC part of this sequence of best integer performers on a year-by-year basis. It would have been con-
benchmark suite itself (1989–present). venient if IBM had reported a SPEC CPU95 score for 2000’s fastest POWER3-II at 450MHz—directly
Table 4 shows the best SPEC CPU inte- showing its 37% performance advantage over the 450MHz PowerPC RS64-III at 450MHz—but it
ger score achieved by a POWER (or did not. Fortunately, it is not necessary that companies report the most convenient scores, as long as
they report scores for some linking machine, bridging the gap from numbers reported in a former
PowerPC) processor over the 17-year suite to numbers reported in a subsequent suite. (Since SPEC requires that dual scores be reported
history of this architecture, on a year- for a three-month period when transitioning to a new suite, the existence of some crossover machine
by-year basis. or other for every active architecture is virtually guaranteed at each transition point.)

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


10 SPEC CPU2006 Benchmark Suite

The principle of transitivity also allows machines from was one of great ferment among high-end microprocessor
different SPEC CPU benchmark eras all to be rescaled to any architectures. Intel’s new i860 line threatened the domi-
arbitrary common measure, just as long as the new scale pre- nance of that company’s just-introduced i486 processor
serves the same fair difference between each adjacent pair of family. Motorola’s 68000 line remained an established force,
machines reported on the given SPEC CPU benchmark suite. while the new Motorola 88000 family attempted to super-
Since the base for SPEC CPU2000 was set at 100 rather than sede it. The SPARC, MIPS, and PA-RISC processor architec-
1.00, it makes a convenient general reference scale for all tures contended for RISC preeminence. The DEC VAX,
machines over time (avoiding any need for small fractional Intergraph/Fairchild Clipper, and HP/Apollo Prism were
values all the way down to machines with just 1% the com- still present. The IBM POWER, the DEC Alpha, and the
puting power of a 300MHz UltraSPARC IIi). HP/Intel Itanium had yet to appear.
Note that the integer performance of the POWER/ Today, most of those alternatives are gone. The SPEC
PowerPC line has improved (at least slightly) every year, with the CPU2000 suite serviced just seven architectures: the various
singular exception of 1994, when no advance was posted over x86 lines of Intel and AMD (Pentium, Xeon, Core, Athlon,
the huge (nearly 113%) gain recorded by the new POWER2 Opteron) plus Itanium and the five major high-end RISC
design in 1993. Much can be learned about processor perform- architectures: PA-RISC, MIPS, SPARC, POWER, Alpha. Of
ance from studying a historical table like this one—which, in that diminished pool, at least three architectures (PA-RISC,
turn, is possible only because of the SPEC CPU benchmark suite. MIPS, Alpha) seem certain never to be tested on SPEC
In fact, a table like the one shown in this section, aug- CPU2006. In short, SPEC CPU2006 appears to be a vehicle
mented by similar tables for floating-point performance with a relatively limited practical scope: specifically, com-
and comparable tables for other architectures, is an invalu- paring the various implementations of x64 processors from
able performance tool (albeit, besides POWER, only SPARC Intel and AMD with each other and with alternative SPARC,
and Pentium span the entire period represented by the POWER/PowerPC, and Itanium processors.
SPEC benchmark suite). These sorts of tables provide nearly Even this relatively limited scope appears threatened
endless opportunities for analysis and discussion, backed up with further contraction. As Table 5 shows, exactly one inte-
by hard, measured numbers rather than mere opinion and ger result has been posted (to date) during 2006 for Itanium—
guesswork. Certainly, one of the most significant contribu- despite Intel’s July 18 release of its once much-ballyhooed
tions of the SPEC benchmark suite—empowered by its abil- 1.7-billion-transistor Dual-Core Itanium 2 Processor 9000
ity to periodically refresh itself to stay relevant to the latest series (code-name “Montecito”). Neither the most interest-
generation of processors—is this sort of historical window ing new SPARC processor in years, the UltraSPARC T1
into the changing nature of processor performance itself. (code-name “Niagara”), nor IBM’s recent Cell processor has
bothered to post a single SPEC CPU number.
How Relevant Is SPEC CPU to the Future? To be sure, SPEC CPU has never been a benchmark for
In the 17 years since SPEC CPU89 first appeared, many changes all purposes or all architectures. Embedded processors have
have occurred in the computing industry. The year 1989 their own distinct set of benchmarks, developed by the
Embedded Microprocessor Benchmark
Consortium (EEMBC; for recent coverage,
2000 2001 2002 2003 2004 2005 2006 Total see MPR 2/22/05-01, “EEMBC Expands
Xeon 0 9 16 29 45 118 130 347
Pentium 36 29 62 31 41 37 35 271
Benchmarks,” and MPR 7/17/06-02,
Opteron 17 14 68 74 173 “EEMBC Energizes Benchmarking”). No
SPARC 6 43 19 19 7 8 8 110 IBM mainframe or mainframe clone has
POWER 16 13 18 21 10 10 21 109 ever been subjected to SPEC CPU testing,
Itanium 4 7 15 21 10 1 58 nor has any Cray or other supercomputer.
Athlon 2 14 11 6 4 2 6 45
Alpha 15 8 10 3 4 40 And even though the candidate micro-
Core 23 23 processor architectures for SPEC CPU test-
PA-RISC 5 8 4 17 ing are dwindling down to a select few, the
R1X00 (MIPS) 5 4 2 11 number of vendors anxious to test their sys-
Total Cint2000 reports 85 132 149 141 146 253 298 1,204
tems on SPEC CPU measures seems, if any-
Table 5. Breakdown of SPEC CPUint2000 results by processor type. Although 2002 was the last
thing, to be on the increase (see Table 6).
year any results were reported for either PA-RISC or MIPS processors, and 2004 marked the last
report for an Alpha processor, architectural dropout has not cast a damper on results reporting. A Brave New World of
To the contrary, reporting has picked up sharply in the past two years as a consequence of the Processor Performance
ongoing performance duel between Xeon-based and Opteron-based systems. New Core-based A more serious difficulty than the dwindling
systems may fuel a further increase in reporting during 2007. The sharp drop-off in reported
results for SPARC-based systems, starting in 2004, reflects the fact that Sun has declined to post
number of high-end architectures targeted
any Cint2000 results for its throughput-oriented UltraSPARC IV generation, leaving Fujitsu to by the SPEC CPU benchmark suite is the
carry the burden of single-thread performance with its alternative SPARC64 V line. changing nature of high-end processors

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


SPEC CPU2006 Benchmark Suite 11

themselves. The introduction of both multicore chip designs


(starting with IBM’s POWER4, introduced in 2001) and mul-
Price & Availability
tithreaded core designs (starting with Intel’s “hyperthreaded”
Pentium 4 Xeon, introduced in 2002) has changed the very SPEC CPU2006 is available for immediate delivery (on
notion of processor performance. DVD) from SPEC. Since SPEC is a nonprofit corporation,
With a widespread shift to dual-core designs already in the price is modest, intended only to defray associated
full swing, and still higher core counts promised for following costs. For new customers, the charge is $800; for cus-
generations, traditional single-core, single-thread processors tomers upgrading from SPEC CPU2000, the cost is $400;
are rapidly becoming extinct. In a multicore, multithread and for qualified educational institutions, the cost is just
milieu, every processor is (or soon will be) a multiprocessor $200. Requirements for running the benchmarks on a test
system on a chip. For these designs, performance evaluation system include a Windows, Unix, or Linux operating sys-
must inevitable shift away from the traditional speed metrics tem plus a minimum of 1GB of memory, at least 8GB of
in the SPEC CPU benchmark suite toward throughput meas- unused disk space, and a set of C, C++, and Fortran com-
ures. Indeed, some of the newest designs, like Sun’s Ultra- pilers (plus, of course, a DVD drive to read the SPEC disks).
SPARC T1 (8 integer cores x 4 threads/core) and IBM’s Cell See www.spec.org/ for additional details about the
broadband engine (1 dual-thread PowerPC core + 8 vector SPEC benchmark suites.
synergistic processor elements or SPEs) already stress the
throughput half of the speed/throughput dichotomy much
more heavily than any dual-core and/or dual-thread design. SPEC measures, it must be that SPEC measures do not fairly or
Regrettably, the SPECrate measure, although useful, is usefully reflect the performance these sorts of radical multi-
both incomplete and inadequate as a sole measure of through- thread designs provide.
put performance. There are several issues here. Vendors that
want to post a throughput score but not a speed score For 2006, Just a New CPU Suite Is Not Enough
sometimes hesitate to use SPECrate, because of its link to the The continued widespread use of the SPEC CPU bench-
SPEC speed metrics (fearing that competitors may attempt mark suite to exhibit and compare processor performance is
to reverse engineer a speed number out of a posted a testament to the admirable job done by the SPEC CPU
SPECrate number). Set-
ting this issue aside,
SPECrate measures only 4Q99 2000 2001 2002 2003 2004 2005 1-3Q06 Total
“homogeneous” work- 1 Acer Incorporated 0 0 0 0 0 0 25 4 29
loads, but many multi- 2 Advanced Micro Devices 0 2 14 11 19 9 17 10 82
3 Alpha Processor, Inc. 0 2 0 0 0 0 0 0 2
threaded workloads are 4 ASUS Computer International 0 0 0 0 0 0 0 6 6
heterogeneous and so 5 Banha for Electronic Industeries 0 0 0 0 0 0 0 1 1
remain unrepresented by 6 Bull 0 0 0 0 3 1 0 10 14
any existing SPEC bench- 7 CAIRA 0 0 0 0 0 0 0 2 2
8 Compaq Computer Corporation 6 7 9 1 0 0 0 0 23
mark number. Further- 9 Dell 2 11 19 39 28 22 21 38 180
more, SPECrate was ini- 10 Fujitsu* 0 1 28 12 10 7 30 31 119
tially conceived and still 11 gcocco software, inc 0 0 1 0 0 0 0 0 1
13 Hewlett-Packard Company 0 5 11 24 22 26 46 45 179
exists primarily as a metric
12 Hitachi 0 0 0 0 0 0 3 0 3
for comparing multi- 14 IBM Corporation 1 15 13 18 23 14 68 84 236
processor systems, rather 16 Intel Corporation 1 21 18 33 17 22 3 22 137
than for comparing indi- 15 ION Computer Systems 0 0 0 0 1 10 7 15 33
17 PathScale, Inc. 0 0 0 0 0 1 0 0 1
vidual processors at the 18 RLX 0 0 0 2 0 0 0 0 2
level of multiprocessor sys- 19 SGI 2 3 4 2 4 7 1 0 23
tems on a chip. 20 Sun Microsystems 1 5 13 7 14 7 9 11 67
It is significant that 21 Supermicro 0 0 0 0 0 20 23 19 62
22 Verlag Heinz Heise 0 0 2 0 0 0 0 0 2
neither of the two most Total Cint2000 reports 13 72 132 149 141 146 253 298 1,204
interesting recent processor Total companies reporting 6 10 11 10 10 12 12 14
designs—Sun’s Ultra-
SPARC T1 and IBM’s Cell Table 6. Breakdown of companies reporting SPEC CPUint2000 results. The increase in Xeon and Opteron report-
processor—are represented ing during 2005–2006 (shown in Table 5) has been largely driven by Acer, Bull, Dell, Fujitsu, HP, IBM, Sun, and
anywhere in the current set Supermicro. (All Sun Cint2000 results from 2004 on are for Opteron systems.) The 14 different companies report-
ing results in 2006 (to date) is also an all-time high. In addition to the 22 companies listed above, six more com-
of posted SPEC CPU panies have reported results in one or more of the other three SPEC CPU results categories (for floating-point or
results. Since neither Sun rate numbers): HCL Infosystems, Itautec, NEC, RackSaver, Unisys, and Wipro. *Includes results from Fujitsu Lim-
nor IBM hesitates to use ited (41), Fujitsu Siemens Computers (77), and Fujitsu Egypt (1).

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


12 SPEC CPU2006 Benchmark Suite

committee over the years. The efforts of this committee to periodically refresh and renew the SPEC CPU benchmark
The Problem With “mips”

SPEC was born out of a controversy over processor per- The biggest difficulty with both peak and sustainable
formance that raged during the mid-to-late 1980s, at the mips ratings, however, is that comparisons are properly lim-
time of the “RISC revolution” in processor design. Before ited to machines that share a common ISA. That is to say,
the introduction of the SPEC benchmark suite, processor it makes sense for a computer architect to use mips when
performance was measured in “mips,” an acronym for mil- thinking about the difference in performance between two
lions of instructions executed by a processor per second. generations of machines based on the same ISA. But where
(Note that mips is not a plural form; 1mips is the proper ISAs differ, the same mips number will mean two different
performance designation for a machine able to execute one things. Particularly when ISA differences are as profound as
million instructions per second.) There is nothing intrinsi- RISC vs. CISC, mips comparisons are likely to be very mis-
cally wrong with measuring performance by the number of leading. CISC proponents rightly complained that their
instructions executed in a unit of time. To the contrary, “complex” instructions tended to do more work than the
computer architects typically spend their time thinking of “reduced” instructions RISC machines executed, There-
ways to get a next-generation machine to execute more fore, CISC mips were bigger than RISC mips, e.g., a
instructions per second than the current generation of “3mips” CISC machine might actually outperform a
machine. Thus, the waggish translation of mips as “mean- “5mips” RISC machine (i.e., do more work in three million
ingless indicator of processor speed”—popular during the instructions than a given RISC machine would accomplish
height of the RISC performance wars—was unduly harsh. by executing five million instructions).
The problem with mips as a measure of processor per- These sorts of intractable difficulties with peak and
formance is not that it’s meaningless, but rather that its sustainable mips led to “Dhrystone mips” ratings (also
meaning is not fixed. There are peak mips, which is the known as “VAX mips”). Dhrystone mips were not mips at
maximum number of instructions a particular machine could all, except by courtesy. This value was computed by count-
possibly execute in a second. This can be computed by sim- ing the number of times a small artificial integer-code
ply multiplying the number of instructions a machine can benchmark program called “Dhrystones” (a pun on a well-
issue in a clock cycle by its clock speed in megahertz. Thus, known floating-point benchmark known as “Whetstones”)
the Intel i860, one of the early superscalar designs, was pro- iterated in a second and dividing that number by 1,760—
moted as offering up to “150mips” of performance, based supposedly, the number of times this program iterated in a
on its (intended) clock speed of 50MHz and its ability to second on a VAX 11/780 (a 1978 DEC system widely pro-
issue a maximum of three instructions in a single clock cycle. moted as a “1mips” machine, despite the fact that it exe-
A peak mips rating, however, gives a very unrealistic cuted only about 500,000 instructions a second).
appraisal of a machine’s actual performance. A better Dhrystone mips enjoyed several advantages. It was
measure is “sustainable mips,” which is computed by mul- both clear and measurable (in contrast to sustainable mips),
tiplying a processor’s clock speed by the average number of bore some relationship to actual performance (in contrast
instructions it issues in a clock cycle during the course of to peak mips), and could be used to compare machines
some reasonable workload. Thus, the initial SPARC-based with very different ISAs (in contrast to either peak or sus-
systems from Sun were advertised as “10mips” machines. tained mips). It suffered from two disadvantages: its only
Since the processor ran at 16.67MHz, clearly somebody at connection to mips was an exaggerated DEC marketing
Sun estimated a typical issue rate of about six instructions claim and the awkward truth that the artificial benchmark
out of every 10 clock cycles, or 60% of its peak issue rate itself was not representative of any realistic application pro-
of one instruction per clock. gram (see MPR 12/16/02-01, “Dhrystones Are All Wet”).
Peak vs. sustainable mips ratings had opposite virtues Still, the notion of Dhrystone mips was clearly a step
and vices. Peak mips were clear, definite values that every- in the right direction. The basic idea itself—compare the
one could agree on, but they were very unrealistic in terms performance of a benchmarked machine against the
of a machine’s achieved (or even achievable) performance. known performance of a standard reference machine on a
Average mips might more closely reflect the machine’s standardized test program—was exactly right. The problem
achieved performance, but there was nothing clear or def- lay in using Dhrystone as the standardized test program. It
inite about them. Not only did they vary quite dramatically was simply too small and too prone to compiler optimiza-
by workload, but the manner of computing an average tion to be a reliable measure. The solution adopted by SPEC
number of instructions per clock was often only vaguely was to replace this one small, artificial program with a
specified, if it was specified at all. robust suite of much larger programs, drawn from real life

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT


SPEC CPU2006 Benchmark Suite 13

P R O B L E M W I T H “ M I P S ” (C o n t i n u e d ) metric will never cease to be relevant and interesting. Use of


the current “auto-parallelization” option, when compiling
and carefully chosen to represent a range of important and running the individual programs in the two speed
and typical computational tasks. As a side benefit, benchmark suites, allows the multiple cores provided by
because machine speed was directly measured by the today’s processors to be applied to the task of speeding up a
elapsed time needed to complete specific programs single thread of execution. (Although it may be desirable to
rather than by counting the number of instructions exe- revisit the current single-thread measures—with and with-
cuted per second, the notion of “mips” could be out auto-parallelization—looking for potential weaknesses
dropped altogether in the various SPEC measures. and shortcomings in the light of new, massively threaded
chips like the UltraSPARC T1 and the IBM Cell.)
Current measures of the throughput performance a
single processor chip can deliver are less satisfactory, how-
suite have kept it current with, and relevant to, successive ever. CPU2006 SPECrate numbers provide only limited illu-
generations of processors. However, if SPEC CPU2006 is to mination here, in part because this measure was never
remain the computer industry’s premier benchmarking tool designed to address performance at the component proces-
for analyzing and comparing processor performance over sor level (but only at the multiprocessor system level).
the next five years, more than just a revision of the pro- Another vital issue for the processors of 2006 and
grams in the CPU2000 benchmark suite may be needed. For beyond is the efficiency with which a processor chip deliv-
the first time in over a decade, it may be necessary for the ers performance, both from the standpoint of its speed rat-
SPEC committee to revisit the basic scales created for meas- ing and in terms of its throughput rating. Is there, say, an
uring processor performance. optimal speed and/or throughput performance-per-watt
In particular, it may be time to broaden the scope of figure for a processor, and, if so, how can a customer know
the measures comprehended in SPEC CPU2006 beyond the when this ideal balance point is reached?
current three dichotomies of integer vs. floating-point, The SPEC committee is already hard at work develop-
speed vs. rate, and base vs. peak performance. Without ing ways to measure a processor’s efficiency (due to be
question, the focus of SPEC CPU2006 should remain on released sometime in early 2007). This is a positive sign that
component processor performance, but 2007 should see an SPEC intends to continue changing with the times in the
expansion in the focus of this component to encompass all future, as it has in the past. It is to be hoped that the SPEC
the relevant performance issues raised by current and future committee will not rest after adding new efficiency meas-
radically multithreaded processor chips. ures. The processors available for testing by the CPU2006
New measures should be in addition to the current, benchmark suite will be more different from the processors
older measures. Certainly, Cint2006 and Cfp2006 will con- tested on CPU2000 than CPU2000 processors were from
tinue to adequately address the sort of performance the processors tested by the original CPU89 suite. It is time
achieved by a single core running a single thread. As IBM’s to revisit the fundamental notions of measuring processor
next-generation POWER6 design indicates, this basic speed performance.

To subscribe to Microprocessor Report, phone 480.483.4441 or visit www.MPRonline.com

© I N - S TAT OCTOBER 10, 2006 MICROPROCESSOR REPORT