You are on page 1of 12

Janet Wilson

Computer
Although the problems are manifold, chip architec:ts from Sun, Cyrix,
Motorola, Mips, Intel, and Digital see challenges r,ather than walls.
his year is near an inflection point in the
three-year cycle for designing new micro-
processors. For this reason, several compa-
nies will produce or introduce new projects in
What are the major roadblocks chip architects
will have to overcome in the short term (five
years)? Which (if any) are fundamental problems
of physics?
What will bethe foremost obstacle to continued T 1998:
Digitals 21264 and IBMs Power 3-64-bit proces-
sors-are projected to reach volume production.
Suns UltraSparc 111will sample.
Intel plans to debut its latest Pentium 11, code-
named Deschutes.
Sun plans to release details about UltraJ ava, its
newest high-performance J avdmedid3D processor.
Cyrixs first foray into the high-performance CPU
core arena, Cayenne, will debut.
AMD expects to release a 3D-enhanced version
of the K6, and Silicon Graphics plans to announce
the R12000, its next Mips processor.
Toward the end of 1998, we can also expect more
news about Merced, Intels next-generation, 64-bit
processor. (For a related discussion, see Intro-
duction to Predicated Execution, pp. 49-50).
Microprocessor design is often spotlighted for tech-
nological innovation, but trends here also have tremen-
dous economic ramifications. Worldwide sales for
microprocessors reached $23.6 billion in 1997, accord-
ing to the World Semiconductor Trade Statistics orga-
nization ( WSTS Semiconductor Forecast, Semi-
conductor Industry Assoc., San Jose, Calif., 1997). This
group, which represents 70 semiconductor companies,
also reports that microprocessors outsold DRAMS for
the first time in 1997, recording a 27.6 percent increase
over 1997 sales. The WSTS estimates the 1998 market
for microprocessors will hit $28.4 billion.
VIRTUAL ROUNDTABLE
As part of this outlook issue, Computer invited six
computer architects to participate in a virtual round-
table. Each participant responded to the following list
of questions posed by the Computer staff:
microprocessor performance improvements in
the next fiveyears?
Are increasing costs associated with validation
and testing a looming bottleneck? If so, what
ways do you see around this problem?
Will slow bus speeds prove a significant problem
to increased chip speed? Why or why not?
Will slow memory access bea major problem?
What types of applications will drive micro-
processor design in the next five years? Many
point to multimedia-are there others?
What are the trends surrounding microprocessor
design itself: team sizes, schedules, targets, tools?
Are these trends acceptable, or do any also con-
stitute a threat to the business?
Will standards evolve to support modular sys-
tems on a chip? Do you see the major companies
working rogether?
What functionality may migrate to software?
ith so much at stake in this competitive field,
we feazed that participants would find it diffi-
cult, if not impossible, to share their insights.
As one participant so eloquently put it, I an acad-
emic at heart and like nothing better than all-out dis-
cussions of interesting problems. On the other hand,
I paid with checks that have a companys name at the
bottom, and I must zealously guard their interests.
Despite this, these six architects shared several
insights of interest to those of us not intimately con-
nected with processor design. Wethank them for their
candor and for giving generously of their time. *%
Janet Wilson is an associate editor for Computer. Con-
tact her at j.wilson@computer.org.
J anuary 1998 0018-9162/98/$10.00 0 1998 IEEE
Increasing Work,
ushing the C
Marc Tremblay
Sun Microsystems
hen designers create a new generation
of processors, improving performance
is often the key goal. There are three
main factors that affect performance;
they are:
how fast you can crank up the clock,
* how much work you can do per cycle, and
* how many instructions you need to perform a
task.
Designers optimize these factors through microarchi-
tecture techniques, compiler optimizations, and
instruction set architecture innovations.
People in industry talk about these three factors,
but I havent seen too much, even from academia, that
really improves how much work a processor does per
cycle and also pushes the clock rate. Most ideas usu-
ally improve one factor to the detriment of the other.
This can be fine, because if you improve one factor by
50 percent and decrease the other by 10 percent then,
by the multiplication of factors, youre still better off.
Today, most design houses are merely extending
what already exists, designing microprocessors capa-
ble of issuing more instructions per cycle, for example.
New machines are capable of more out-of-order exe-
cution, can access memory faster, or perform two
memory operations in parallel. Yet this evolution may
present problems because it conflicts with the goal of
keeping the cycle time short. Another challenge for
architects is to develop strategies that not only improve
performance but facilitate the physical design.
Tailoring the processor. One way to branch out of
this simple evolution is to design processors that are
much more tailored to what users want to run. Now
companies are designing microprocessors that end up
being used in servers, powerful desktops, and even
(sometimes) network computers. Architects are trying
to design a processor that will run say, huge database
applications or huge EDA applications, like
CAD/CAM software. Then the same processor must
also run a word processor, visualize 3D models, play
a video clip, and so on. I foresee the day that design-
ers will partition a microprocessor family into, say,
client chips and server chips.
Client chips would focus on user interaction-thats
what 99 percent of users care about. By focusing on
user inTeractions-multimedia, voice recognition,
speech processing, video, audio, aiid especially 3D
graphics-we can optimize the chip, because all these
applications have very predictable data accesses. For
example, playing video on a client is easy; its just
streaming data coming in and being decompressed
and then displayed on the screen. In this example, the
processor doesnt need to be close to memory, so
theres no need to integrate memory with the proces-
sor on a single chip. That strategy may apply to other
applications for which its hard to predict what data
the chip needs to access from memory. But for multi-
media, its fairly well-understood what needs to be
brought into the microprocessor, and this data can be
accessed speculatively and ahead of time. So the indus-
try will specialize microprocessors not only by user,
but by application categories.
Design drivers. Besides multimedia, other important
applications for the next five years are personal pro-
ductivity applications (like word processors, spread-
sheets, and presentation software), 3D browsing, and
shared whiteboards. There may also bemore industrial
applications, in which users visualize and order parts.
Microprocessors for these applications are com-
pletely different from those that sit on a server and run
SAP, Baan, and Oracle software. So it would bemuch
more economical if we could partition a microproces-
so1family to run a specific set of applications. Wecould
then make trade-offs that deliver better performance
and permit a chip to excel at its particular task.
leveraging design teams. One trend is that very high-
end processors keep getting more complicated. This
means that team sizehas remained large. One way com-
panies will contlnue to work around this is by overlap-
ping work between processor generations. Although
there are two independent design teams for say, an
UltraSparc I11and UltraSparc I V, theres a lot of tech-
nology and design that can beleveraged from one to the
other, even though theyre completely different archi-
tectures. For example, the people designing the on-chip
memory-the instruction and the data caches-can
leverage a lot of their design. Actually, what we use are
Computer
small SWAT teams, which focus on parts of the design
that can behighly leveraged across chips.
Another thing that could beleveraged, if managed
properly, is a global verification methodology. We
could also leverage some CAD tools from one gener-
ation to the next. This can help keep teams smaller so
that designing two processors in a overlapping fash-
ion does not require twice the sizeof the original team.
More parallelism. To take microprocessors to the
next level, we need to look at parallelism at higher than
the instruction level. Weneed to run more than one
execution thread in parallel. A word processor, for
instance, reformats pages sequentially. The program
starts with paragraph one, then goes to two, three, and
so on. Wecould rewrite code to reformat all the para-
graphs in parallel so that one part of the machine could
do the first 10 paragraphs in parallel, another part, the
next 10, and eventually you could have several such
threads format a document in parallel. Theres a lot of
inherent parallelism in most applications; its just that
we havent approached it like that.
Focusing on this higher level parallelism would
allow smaller computational units to work in paral-
lel on these threads, communicating only when nec-
essary. This would lead to a much more modular
approach to processor design.
Part of the problem with doing this, though, is that
weve been tied to the same ISA for 20 years: Binaries
created in the late 1970s for the x86 still need to run
unmodified on x86-compatible processors. But if we
had a layer, basically a virtual machine, we could avoid
these problems. This problem i s the trigger for run-
time ISAs (as described by J osh Fisher in Walk-Time
Techniques: Catalyst for Architectural Change, Com-
puter, Sept. 1997, pp. 40-42), which are, of course,
provided by J ava. In J ava, we no longer have to run
existing binaries or deal with legacy ISAs, so now we
can tailor the ISA to what users intend to run. 9
Marc Tremblay is a distinguished engineer involved in
the architecture of high-performance processors at Sun
Microsystems. Prior to his current work on the archi-
tecture for Ultrajava and piCOJava, he was coarchitect
for the UltraSparc I and 11 processors. Tremblay
received an MS and a PhD in computer science fyom
UCLA and a BS in physics engineering from Lava1
University, Canada. He is a member of the IEEE Com-
puter Society.
Reining in
Complexity
Greg Grohoski
Cyrix
ver the last few years weve seen a trend
toward increasing complexity, in an attempt
to execute more instructions per cycle. The
lpendulum will swing the other direction
now: Designers have gotten fed up with the
amount of complexity thats in microprocessors. In
the final analysis, complexity often fails to yield the
expected IPC and winds up costing a lot in die size,
clock cycle, and schedule. As a result, designers will
simplify designs and take a hard look at why were
including certain architectural features. Why, for
instance, is there a high degree of out-of-order exe-
cution? How much can be done by the compiler with
ISA extensions? Can we use various prescheduling
techniques? Can we get the performance through an
efficient pipeline with a higher clock rate?
In reducing complexity, RISC chips or other proces-
sors with new instruction sets have an advantage
because theyre able to retarget the compiler. Many
applications for such systems are also developed by
end users rather than purchased frolmthird-party ven-
dors. End users are usually willing to recompile to get
a reasonable performance increase--typically at least
20 percent. The x86, however, is in a whole different
arena; weve got basically no flexibiility whatsoever to
recompile. So its a much bigger challenge to get more
and more performance. Historically, thats led to
immense complexity.
Memory access is also a major problem. Most of
the high-frequency chips coming to market today
spend a large fraction of time waiting for data from
the off-chip cache or from memory. In the future, we
can mitigate some of these intraprocessor communi-
cation problems through careful integration and sys-
tem design.
J anuary 1998
My personal-favorite solution is putting the
CPU on a DRAM. Although we may run a
high-speed bus between the memory and CPU
on todays split-chip designs, we cant funda-
mentally get the latency and bandwidth possi-
ble if the two were on the same chip.
CPU/memory integration will make sense foi a
wide variety of systems. Lets say you have a
system that can get by with 16 to 32 Mbytes of
memory. On one 256-Mbit memory chip, using
techniques like memory compression, you can
afford to devote some reasonable percentage of
that die to a simple CPU. I think this type of design
can be far more effective in terms of dollars per MIPS
than todays split-chip designs.
The next step will be getting data from the network
or from a disk. Interconnects like Universal Serial Bus
and Firewire will help, but theyre still too slow to
keep up with future microprocessor speeds. Well need
other strategies to attack those problems.
Design drivers. Processors will continue to change
because of the applications that drive the design. I see
large data-mining applications driving microproces-
sor design. Visual computing is also on the horizon;
many users would like outstanding 3D graphics and
multimedia. Clearly wevealready seen this with wide-
spread adoption of MMX or MMX-like technology,
not only in Intel-compatible CPUs but also in RISC
designs. At the low end, cost is a driving factor. When
we can really sell powerful CPUs with memory and
graphics for, say, $20 or $30 apiece, vendors can
design PCs that sell for a few hundred dollars. This
results in a different class of PCs, for which engineers
have to figure out optimal cost-performance trade-
offs. At the high end, where die size is less of an issue,
we tend to throw the kitchen sink at a problem-
whats another couple hundred thousand transistors?
Its a lot more challenging to get world-class perfor-
mance when your die-size target is 80 to 100 mm2.
So theres no question that were tailoring products
to a particular market. At Cyrix, one idea were pur-
suing is a CPU core that works like a large ASIC block.
System developers can instantiate various devices on
a chip with this CPU core. For a group of companies
to cooperate in system on chip, though, will take a
product that has a potential to sell in quantities of 10
to 50 million units. When that opportunity arises, bar-
riers to industry cooperation will fall. Of course the
first company that attacks that problem successfully is
likely to drive the standards and make the money.
Yet Cyrixs interest in cores doesnt mean were giv-
ing up on high-performance CPUs. What were after
now is to deliver a compact, efficient core useful in a
wide variety of applications.
Tool problems. Although I dont want to offend any-
one, the state of industry VLSI design tools is disap-
pointing. Most EDA vendors focus on the ASIC mar-
ket-and rightly so, because they sell a lot more seats
that way. But I dont know any microprocessor
designer whos happy with their set of tools. Schematic
entry, for instance, is painful. In industry, cycle simu-
lation is still in its infancy. Better tools would enable
better schedules.
This gets back to managing the design. Wehave to
come up with designs that dont require 250 engineers
and billions of cycles for verification. Wereally have
to cut back and simplify to deliver competitive per-
formance without all the complexity and risk thats
inherent in unbounded designs.
Schedules and team sizes. Cyrix prides itself on
accomplishing a lot with very few people. Wehave
been pretty successful maintaining two to three year
schedules with small, focused teams. Theres this desire
(I call it a fantasy) that we can design a microproces-
sor faster than before. But in my experience, new
designs always take around two and a half to three
years. It takes time to staff a new team, to get them
working with each other, and to adapt a new set of
tools that are always deficient in one way or another.
Yet the real problem is that the useful lifetime of micro-
processors is measured in maybe a year and a half, so
youve got to work on several different designs at once.
This puts tremendous pressure on the engineers.
To get to design teams of less than a few hundred
engineers, we really have to cut back and simplify the
cores. When 200 people are working on a design, you
spent a lot of time coordinating. There are also very
few people who understand how the whole design
works. So when you find a bug, a half dozen people
have to sit in a room to devise a fix and also ensure
theyve covered all the corner cases. With a simple
design, its much easier for one or maybe two engi-
neers to work out a solution and prototype it quickly.
Fundamentally, we have to create less complexity
so we spend less time debugging the complexity. Then
we can begin the transistor-level design sooner. Since
we wont. spend so much time developing the behav-
ioral model, well beable to spend more time with the
transistor-level design. This would allow us to at least
maintain these three-year schedules and get to high
clock frequencies without having 200 people on a
design team. +: *
Greg Grohoski is a prolect manager at Cyrix and is
currently workzng on thelalapen0 core. He prevzously
worked at IBM Research as the lead architect for the
Power 1 (IBMS first superscalar RISC processor) and
also worked on the mzcroarchatecture for Power 2.
Grohoski received an MS zn electrical engzneerzng
from the Universzty of Illinois at Urbana-Champaign.
Computer
A Way of Life
Brad Burgess
Motorola
ith respect to roadblocks, many chal-
lenges lie ahead. There are several prob-
lems related to memory, the foremost of
which is latency. Memory is a problem
because the faster the processor runs,
the more difficult it is to hide latency to memory.
Caches are fairly effective at addressing memory
latency, depending on the application, though I do
expect software writers and compilers to become bet-
ter at tuning applications for cacheability.
In terms of processor design, one of the biggest
headaches were seeing is wiring delay. At increasingly
higher clock frequencies with sub-quarter-micron
designs, metal is becoming a significant portion of the
delay time in circuits. In the past, designers could focus
primarily on gate delays and lay transistors down in a
fairly straightforward fashion. In the future, the archi-
tecture, floor planning, and circuit design will bemuch
more tightly interwoven.
How do we sustain continued performance improve-
ments? As a hardware guy I hate to say it, but we need
to improve the software. For large gains, we must h d
better ways of expressing and exposing parallelism to
the processor. In recent years, weve seen aggressive
superscalar and very long instruction word compilers,
parallelizing Fortran compilers, and languages support-
ing explicit multithreading. Those are the beginnings of
what programmers and compilers will need to do to
expose more parallelism and improve performance.
With respect to roadblocks and physics, there are
physical limits to how far one can shrink a CMOS tran-
sistor. In a Hot Chips presentation, for example, J ames
Meindl of Gcorgia Tech pointed out that if we keep
shrinking transistors as in the past, we will run into the
limits of physics within the next 25 years. How do you
shrink a transistor smaller than a single atom? You
dont; you find other ways to improve performance.
What applications will drive microprocessor designs?
There are different drivers for different markets.
Unfortunately, many people tend to think of a proces-
sor only in terms of their desktop PC; I view processors
much more broadly. Although multimedia will likely
push desktop PCs, there are other applications such as
transaction processing, networking, signal processing,
and real-time control that will significantly push high-
end designs. Often each application space brings with
. .
it a unique set of requirements. A good general-purpose
processor may offer high SPECmarks but it doesnt nec-
essarily make a good transaction processor or real-time
control processor. In addition, there are many market
segments that require performance/power/cost trade-
offs. Our PowerPC 750 processol; for example, has very
high-end performance at significantly lower power dis-
sipation than the competition. This. gives us a signifi-
cant advantage in the laptop PC and certain embedded
systems market segments.
Will standards evolve to supporc. modular systems
on a chip? Again, it depends on the market. In the
embedded space weve been doing modular systems-
on-a-chip for several years. A number of customers
have designed modules based on our internal bus stan-
dards. Whether or not we line up with other vendors
is a business decision; the technology is there.
What about the size of design teams? There are sev-
eral philosophies about how big design teams should
be. I believe that smaller teams have an advantage in
that communication is much more open, design details
can beworked out, and trade-offs can beinade quickly.
At the same time, you must have enough people to get
the job accomplished. In large teams, you often have
to build a bureaucracy just to manage communication
and decision-making. The bureaucracy not only limits
communication, but the decision-rnaking is often too
far removed from the problem. At Motorola, weve
avoided the 300 to 400 person teams.
What about better tools? In our work, we can always
ask for better tools, and we push fixward the state of
the art. Our philosophy is to look at: whats on the mar-
ket and use the best tools we can. At Somerset, we use
a lot of our own tools because they are better than any-
thing we can huy. Wealso use vendor tools, for exam-
ple, layout and schematic capture tools. If you can find
a comparable vendor tool, then by all means, use it.
Support and service from outside keeps you from weigh-
ing down your own tools organizalion. * : *
Brad Burgess is chief architect for he PuwerPC
750/740 microprocessovs (G3) and next-generation
PowerPC microprocessor (G4) projects at Motorola
and I BMS Somerset Microprocessor Design Center in
Austin, Texas. He holds a BSEE and an MSEE from
Texas A&M University.
J anuary 1998
Challenges,
adblocks
Earl Killian
Mips Technologies
n the short term (the next five years) there are no
major roadblocks. In microprocessor design,
were always looking out about five years, and
so we understand what the solutions are in that
time frame. J ust beyond that time frame several
issues arise. Many have suggested, and continue to
suggest, major roadblocks for electronics technology.
Yet technologists have successfully gone around those
roadblocks in the past, and there is no reason to think
this wont continue for the next decade or two.
Naysayers do serve a useful role: They point out the
places where innovation is required! In that context,
there are some more detailed issues that the industry
will need to address:
0 Optical lithography will need to bereplaced with
deep-UV, X-ray, or electron beam techniques to
permit feature sizes below 0.15 to 0.12 micron.
This will require major investments by semicon-
ductor and stepper companies, which will result
in tremendous benefits to companies that pick
the correct technology. This will benefit fabless
companies-as opposed to those with fabs-
because they will have access to whichever fabs
best make the transition.
0 Below 0.1 micron, quantum effects begin to play
a larger role in the operation of transistors. There
are both potential pitfalls and opportunities in
this area. Wemay also still have a few genera-
tions to go before this becomes a problem-IBM
has built 0.08 micron devices in its labs that oper-
ate successfully.
0 Interconnections on chip will be of increasing
concern, but not in terms of local interconnects;
global interconnect will require increasing care.
Copper metalization is worth a 20- to 30-percent
reduction in wire delay, which is helpful, but it is
not a fundamental change (it only delays the
problems by a generation or so). A trend I see is
. that chip designers will need to begin thinking
about a chip more like a PC board. Youll have
components on this large expanse of silicon, and
you have to think of some as far away and oth-
ers as much more local.
0 Power consumption will bean increasing prob-
lem. Power has increased from generation to gen-
eration because transistor count and frequency
scaling have exceeded the rate of voltage scaling.
In addition, it is uncertain whether voltage scal-
ing can continue to very low voltages.
Obstacles and driving applications. There are no
major obstacles that we can see in the next five years.
Beyond that it gets a little murkier, but then it has
always been a little murky that far our.
The widening gap between processors and mem-
ory is an issue, but has been increasing for decades.
Were fairly adept now at dealing with it. In general,
we look for ways to convert bandwidth into latency.
Silicon Graphics approach to microprocessor
design is to focus on particular market segments, and
not to try to design an all things to all people sort
of processor. Even in that context, our target appli-
cations are fairly broad, but are essentially all big
data problems-manipulating and organizing large
sets of data. Webelieve the action in microprocessor
design will increasingly be how well you handle big
data and how well you do multiprocessing. Thats
where we are focusing.
Video and audio are currently minor challenges for
microprocessors and will become less so over time
because there is a limit to what the human visual and
audio systems can perceive. Thus, these areas are inher-
ently self-limiting and wont belong-term issues. They
illustrate a general phenomenon. Microprocessors
begin by broaching a performance level that makes
some fixed application barely possible, and designers
struggle with that application for a while. A genera-
tion or two later, microprocessors can handle that
application easily, and microprocessor designers have
moved on to the next challenging application area.
Applications that can scale indefinitely are more
interesting; 3D graphics may be in this category, at
least for a while. The issue here is not the graphics
itself, but the fact that 3D performance is growing at
a faster rate than general-purpose processor perfor-
mance. This will ultimately lead to a situation in
which a single processor will be unable to feed the
graphics pipelines that we are able to build. The
Computer
advantage here will go to companies that can apply
multiprocessing to feeding 3D graphics hardware. For
a similar reason, attempts to put part of the 3D graph-
ics pipeline into the processor are misguided.
Image, handwriting, and speech recognition will be
other major challenges. It is arguable whether these
are fixed hurdles, like video and audio processing, or
whether they can scale indefinitely.
Other areas that show the ability to challenge
processor designers for some time are network and
disk technology. Both had been increasing at fairly
slow rates, but have recently picked up their pace.
Applications that involve massive data storage and
communication will continue to tax processors for the
foreseeable future. Increasingly, this involves data min-
ing and information retrieval. Wevegot almost too
much information, even today, so the question
becomes how well can machines organize it for us?
Can machines help turn information into knowledge?
To help with the performance challenges, we should
not move processor functionality back into the com-
piler. Rather, the compiler could better help with the
real solution to the big-data challenge: multiprocessing.
Weshould expect significant improvements in the area
of automatic parallelization, for example.
Slow buses. One thing that is not a performance bot-
tleneck for SGI are bus speeds. For years, slow bus
speeds have been a concern of low-end chip vendors;
theyve responded to the challenges by standing still
(PCs are stuck at 60 to 66 MHz). On the other hand,
SGI is way out in front of this one; weve been ship-
ping products with 400-MHz chip-to-chip communi-
cation by not using buses. In their place, we generally
use point-to-point communication between chips, usu-
ally unidirectional. Wecan actually communicate at
400 MHz over fivemeters of cable, and can do signif-
icantly better on a PC board over shorter distances.
Testing, validation, and design. Validation and test-
ing is a big area with many subcomponents. There are
challenges in design validation, but no looming bot-
tleneck. Chip debug in the lab is an increasing concern,
because the inability to probe makes debugging diffi-
cult. The answer here is more design validation with
increasing emphasis on circuit validation via tools to
raise it to the level of logic validation,. For example,
SGI has been writing its own tools for circuit valida-
tion to fill a gap left by the CAD industry.
Team sizes and budgets are slowly increasing over
time. It is roughly linear, but certainly not exponen-
tial. It would bedisastrous if it were tied to transistor
count (that is, exponential), but fortunately were able
design things once and replicate to prevent that.
Systems on a chip. Well see modular systems on a
chip, but probably not as the result of standards, and
not from major companies working together, which
they almost never do.
You will see more of the system moving onto the
chip at SGI. One way we can increase the value of
Mips microprocessors in the future iisto take SGIs
developed systems technology, improve it, and inte-
grate it into our processors. * : *
Earl Killian is director of architecture, for Mips Tech-
nologies. Previously, he cofounded Quantum Effect
Design and participated in the devebpment of their
R4600, R4650, and R4700 RISC pi)ocessors. Prior
to that he was involved an the Mips R3000 and the
first 64-bit microprocessor, the R4000. He has a BSEE
from MI X
Maintaining
a leading Position
Robert Colwell
Intel Corp.
ssuming that nothing basic in manufactur-
ing or physics breaks in the next couple of
years, there is no reason the historical trend
in microprocessor performance will slow.
A For the short term, industry has modeled
and understood the physics that will affect processor
design and manufacture. Changes to accommodate
J anuary 1998
those effects are in the production planning
phase.
One force that could, however, interrupt the
microprocessor performance trend is if con-
sumers begin to prefer cheaper computers
instead of faster ones. We can make them
cheaper or faster, but its extremely difficult to
do both at once.
Increasing costs associated with validation
and testing are a problem, but not the real
threat. The real threat i s shipping 30 million
CPUs only to discover theyre imperfect in
some way that causes a recall. Such an expense
could easily bankrupt a company. The conver-
gence of three factors may make it more likely
for a problem to rise to recall severity:
The number of transistors per CPU is increasing
rapidly per Moores law.
Designs are becoming more complicated to drive
microarchitecture performance higher.
The technological sophistication of consumers
purchasing desktop systems is decreasing.
There is every reason to suspect that this threat will
grow, potentially to levels that require changes in the
way companies conduct business.
Thus, chip manufacturers are pumping ever higher
levels of validation into design efforts. More than 20
percent of the Pentium Pro design effort was associated
with validation. As design teams grow 30 percent or so
per major product generanon, validation teams are now
larger than the entlre design team of only six years ago.
Weare now much more rigorous at tracking both
pre- and post-silicon sightings-anomalies in
expected results that a human discovers. And although
our validation tools have also improved enormously,
we will still rely on humans writing test code in assem-
bly language.
Over the last few years, random-instruction testing
has also become much more useful and prevalent.
Though consuming time and requiring huge amounts
of code, RIT sequences are notorious for testing cor-
ner cases of a machines architecture that no human
would ever think of. Both RIT and assembly language
test code are therefore the current mainstays of a chip
development effort.
Formal methods are now useful in some cases. For
example, the Pentium recall was due to a design flaw
in the chips floating-point divider, and formal meth-
ods are now mature enough that they can identify such
errors. However, as currently practiced and envi-
sioned, these techniques have serious practical limita-
tions. For instance, they can be directed only to
specific, well-defined, and contained areas of the
design because they blow up quickly if we incor-
Computer
porate too much complexity. (A program that blows
up begins to consume too much memory and time and
is not likely to return useful results.) Also, like other
validation methods, they are incapable of proving that
a design is correct, a fact commonly forgotten.
Working around slow buses. Future CPUs will run
in the gigahertz frequencies, and front-side bus elec-
tronics will not support anything close. Platform solu-
tions will arise, such as bringing the L2 cache onto
the CPU die and integrating the memory controller
onto the CPU. In these approaches, memory traffic
does not traverse a slow bus.
Memory. Memory will not keep pace with CPU per-
formance in terms of latency; it is slow and, relative
to CPU speed, becoming slower. Industry will con-
tinue to find ways to ameliorate (but not solve) this
problem, including larger caches, more elaborate
cache hierarchies, prefetch hints, compiler tricks,
streaming buffers, intelligent memory controllers, and
bank interleaving. The impending introduction of
Rambus DRAMS into volume PCs will also help.
Communications focus. Within the next five years,
computers will become primarily communication
devices, as opposed to performing number-crunching
tasks. We hadnt noticed this emphasis until very
recently because machines were not fast enough to
execute the required workloads at human real-time
speeds, nor was the I/O available to feed those
machines, In playing games, for instance, most peo-
ple (with Garry Kasparov a possible exception) report
that a human is by far a more interesting and chal-
lenging opponent than a machine. Likewise, in vir-
tual reality worlds, new explorers commonly wander
about the space until a seemingly intelligent agent
(such as a fellow human player) appears. Their atten-
tion transfers to the new agent as if by magic. I think
this same human instinct is why people are using com-
puters more as communication enhancement devices.
Beyond these multimedia-related communication
functions, users will require better dependability and
security. Antilock brakes that mostly worked or
hardly ever crashed wouldnt be acceptable, but
that describes general-purpose computing today. This
situation arises because we do not design hardware in
conjunction with software, application developers
dont design software with the OS, and companies
place less emphasis on the overall hardware-software
system reliability than in getting to market quickly.
This does not indict the industry-these are trends
that the world has rewarded. Yet ultimately the lack
of system dependability could well become industrys
concern because it will become societys burden.
Computer viruses are a plague, and system design has
thus far downplayed hacker attacks, rather than
guarded against them. There is a place for more secure
computers and better encryption and decryption.
I
Trends i n the design process. More transistors and
higher performance beget larger design teams.
Schedules are not getting any longer-in fact theyre
constantly shorter. This means we do much more
work presilicon, which in some cases is quicker over-
all but less efficient in terms of designer time than
performing the same task post-silicon.
Complexity is high and increasing because the
quest for higher performance leads in that direction.
More complexity in less time produces a much higher
risk of functional bugs. The need for more thorough
validation in those shortened pre- and post-silicon
time periods increases dramatically.
The software base for Intel architecture processors
is mushrooming, and every new CPU must be com-
patible with the entire software base. The more soft-
ware a new design must be compatible with, the
harder the job of ensuring that it is.
No one of these issues seems to be a fundamental
threat to the business, but, taken together, they defi-
nitely are.
System on a chip. It would not besurprising to see
CPUs with integrated versions of emerging standards
such as Intels Accelerated Graphics Port (AGP) on-
die. For companies trying to break into the business
quickly, buying on-die units (phase-locked loops, float-
ing-point units, or multiport register files) may make
Managing
-
Speed
Paul 1. Rubinfeld
Digital Semiconductor
ive issues will beimportant to microproces-
sor design over the next five years; I list them
in no particular order. One is power; as
processors become faster, managing the power
dissipation becomes a significant issue. More
transistors switching in parallel on one die dissipate
more heat, which presents a whole bunch of electrical
and thermal-management issues.
Second, at the same time the chips are running faster,
theyre becoming more complex. So what I would
broadly call on-chip signal mtegrity-maintaining the
integrity of the signal as it moves from one end of the
chip to the other-becomes more difficult. Some archi-
tects call this the interconnect nightmare. You essen-
tially have lots of tightly packed wires that communicate
sense. For Intel, with our own fabs and process tech-
nology, a huge premium on tight integration and small
die sizes, and the expectation of very large unit volume
shipments, it makes less sense to contract out parts of
the design.
Software functionality. In cost-sensitive market seg-
ments, as much functionality as is possible will shift
to software because that saves money. Sound-card
functions, modems, network controllers, and 3D
operations may appear in software very soon.
However, the trick is not to emulate a function in soft-
ware, but to emulate several functions simultaneously,
especially considering that many are real-time. A
machine cannot fall behind the combined workload
of all the real-time applications taken together.
Todays most popular operating systems cannot sup-
port such functions in software, so I expect an indus-
try learning experience to accompany this shift. 0
Robert Colwell is an Intel Fellow acrd director of 32-
bit microprocessor development Lrt Intel. He also
worked on the Multiflow Trace computer. Colwell has
a BS in electrical engineering from the University of
Pittsburgh, and an MS and a PhD in electrical engi-
neering from Carnegie Mellon University, He is a
member of the I EEE Computer Society.
parasitically to each other-a difficult problem that cre-
ates all types of circuit issues.
A third design issue is the idea of soft-error upsets:
Cosmic rays or gamma radiation causes storage nodes
to lose charge, which causes failures. More advanced
work says this will imply future designs that are fault tol-
erant. Weknow how to provide fault tolerance for mem-
ory with error-correcting code, but the way circuits are
shrinking, any storage node could become susceptible to
radiation. So beyond just memory ;structures, we may
have to deal with fault tolerance thmughout the chip.
Recent research concentrated on soft-error rates due
to alpha particles. Yet it now appears that gamma radi-
ation is an order of magnitude more significant. In the
J anuary 1998
future, as geometries get even smaller, as stored
charge gets smallel; designers may have to assume
that soft errors happen fairly frequently. Well
have to deal with that at the logic level.
A fourth issue of concern is on-chip process
variation. With smaller geometries and larger
chips, were seeing process variation on a single
die, which adds to signal skews between one
part of the chip and another. This adds to the
uncertainty, which basically limits the ability to
design high-speed circuits. So thats yet another
set of analyses to do, another set of effects to
design for.
I
The fifth issue concerns the human aspects of these
projects-managing large teams isnt easy. Clearly,
large engineering teams are not unusual outside chip
design, but inside its somewhat new.
l i mi t s of parallelism. Perhaps one obstacle I see is that
there may be fundamental limits to parallelism. How
much parallelismcan we extract from software? In part,
the way software is written today and the language
structures it uses are factors that limit the amount of
extractable parallelism. Of course, it takes a while for
software to migrate to new programming languages
and models; software isnt that soft anymore. But were
already working on diminishing returns-doing twice
as much development work to extract less and less ben-
efit. Parallelism is a rock were squeezing awfully hard.
Continuous profi l i ng. Another trend well see for
putting performance improvements in software is this
idea of continuous application profiling. Continuously
executing an application and profiling the execution,
gives us information to use in post-optimizations to
the order of the code. Thats a technique Digitals using
in several of our tools and it pays big dividends.
Val i dati on, buses, and memory. As for the valida-
tion problem, we seem to have that under control.
Wewere able to functionally verify the Alpha 21264
efficiently, booting all the operating systems and run-
ning the application software on the first pass.
Bus speeds wont necessarily be slow; they will
probably scale. Bandwidth will also improve dramat-
ically for this reason as well because area-array flip-
chip packaging will let us implement much wider
buses. There are also tricks-some have been used,
some havent-to hide latency. Cache hierarchy and
cache efficiency improvements will also mitigate the
relative effect of slow buses. In the future, multi-
megahertz buses will become common.
A similar set of arguments could bemade for mem-
ory. For the first time in many years, new technolo-
gies-Rambus, SyncLink, and DDR RAM-are
improving memory speeds. This, coupled with
improvements in cache hierarchy and latency hiding,
will mitigate the effects of the differences in speed
between the internal processor and external memory.
Computer
Design drivers. At least three applications will prob-
ably drive the computer industry. The first is a generic
category I call modeling reality. Virtual reality is an
example, as are games. The computational powei
required by these new games is quite remarkable; they
and the whole entertainment field will drive a lot of
computational needs.
Similarly, this whole idea of natural interfaces with
computers-speech recognition, and so on-can con-
sume lots of computational power and make com-
puters more usable.
Wealso have to acknowledge that data sets and
databases are growing large. Theres just far more
data, and we want to manipulate it. So transaction
processing will drive some development.
Design process issues. When it comes to schedules,
I see two types of products-new designs and lever-
aged designs. Imnot sure theres been a significant
improvement in the design time around major new
redesigns of a machine; more of the products seemto
be leveraged designs, such as those that move to a new
process generation. Good engineering says you ought
to leverage your design for everything you can, but the
full-blown, soup-to-nuts CPU still takes time to design.
Imnot sure weve improved in the last five years.
One potential threat to the business is important to
mention: Theres a shortage of qualified design engi-
neers in the custom design space. Several companies
appear to be fighting tooth and nail for those quali-
fied designers. Few colleges really teach people how
to design hzgh-speed microprocessors. Squeezing the
last 50 percent out of the silicon requires circuit exper-
tise and attention to physical details. You dont find
too many people who can deal with on-chip process
variation and on-chip signal integrity. There are also
reliability issues-how do we design fast circuits that
are immune to various failure mechanisms?
Another problem is that most microprocessors com-
panies really have to design their own analysis tools
and essentially their own CAD packages to accommo-
date leading-edge problems. Maybe thats a good indi-
cator of whether youre really designing a high-speed
microprocessor: If you use standard tools, youre not,
Its important to realize that none of these are fun-
damental limits to processor performance. As we
move from generation to generation, the problems lust
get harder, and that makes our business fun. * : *
Paul I. Rubinfeld zs a senior engzneerzng manager for
mzcroprocessor development at Dzgztal Semzconduc-
tor, where he manages development of the Alpha 21 164
microprocessors as well as chzpset development for the
entire Alpha famzly. He has worked on the VAX and
PDP-11 CPU development prolects. Rubznfeld recezved
a BS and an MS in electrical engzneerzng from Carnegze
Mellon Unzverszty,
Introduction t o
Predicated Execution
Wen-mei Hwu
University of Illinois, Urbana-Champaign
The story of Merced, Intels first processor based
on its next-generation 64-bit architecture, will con-
tinue to unfold in 1998, Intel expects this product of
its collaboration with Hewlett-Packard t o reach vol-
ume production in 1999. To date, however, the t wo
companies have released few details about Intel
Architecture 64 (IA-64). One significant change they
did admit to at the October 1997 Microprocessor
Forum was the switch t o full predicated execution, a
technique that no other commercial general-purpose
processor employs.
Computer wanted to give its readers advance nottce
vf thas promising technique. We invited Wen-mei
Hwu, a prominent researcher in this area, to explain
predication, a topic you inay be hearing more about
in 1998.--Janet Wilson
redicated execution is a mechanism that supports
the conditional execution of individual opera-
tions. Compared to a conventional instruction
set, an operation in a predicated-execution architec-
ture has an additional input operand-a predicate-
that can assume a value of true or false. During
runtime, a predicated-execution processor fetches
operations regardless of their prcdicatc valuc. The
processor executes operations with true predicates
normally; it nullifies operations with false predicates
and prevents themfrom modifying the processor state.
Using predication inherently changes the represen-
tation of a programs control flow. A conventional
instruction set requires all control flow to be explicitly
represented in the form of branches, the only mecha-
nism available to conditionally execute operations.
An instruction set with predicated execution, how-
cvcr, can support conditional execution via either con-
ventional branches or predicated operations.
INSTRUCTION SET ENHANCEMENTS
instruction set should provide
To effectively support predicated execution, an
the means to specify a predicate operand to indi-
a predicate register file,
a nullification mechanism, and
a set of predicate-defining operations.
vidual operations,
Adding a predicate operand Io conventional
instruction formats requires extra bits in the instruc-
tion encoding. Instruction sets that must stay within
a 32-bit encoding are typically limited to adding pred-
icate operands to only operations that take fewer than
32 bits to encode, such as MOV operations. This is
referred to as partial predication. Adding a predicate
operand to an operation also increases the number of
operand values the processor must read and write
from architectural registers during; runtime. Using
integer registers to hold predicates would require more
registers as well as more ports to the register file. An
efficient solution is to specify a separate predicate reg-
ister file. A study by Scott Mahlkle and colleagues
showed that partial and full pred.ication can both
result in significant performance impovement to con-
ventional instruction sets?
Special predicate-definmg operations are used to com-
pute predicates. Such operations replace branch opera-
tions as the compiler generates code. Each branch thus
replaced results in two predicates, one for operations
along the branchs taken path and another for those on
the fall-through path. Thus, each predicate-defining
operation should simultaneously define two predicates.
Figure 1 (on the next page) illustratcs this. In traditional
code with branches, shown in Figure l a, two branch
operations jointly choose one of three possible execution
paths. Figure 1 b shows the predicates assigned to each
branch. In Figure IC, the code for predicated execution
uses two predicate-defining operations. The first oper-
ation defines two predicates according to a set of U-type
rules: Predicate p, is true if the original branch condition
is false; predicate pz assumes the complementary value
of pl. In the second predicate-defining operation, the
value of p1 indicates whether the pro,gram should reach
the corresponding branch in the conventional code. If
not, then neither of the two possible paths (controlled
by p3 and p4) should beactivated. Thus, the second pred-
icate instruction may define both p3 and p4 as false val-
ues. Figure I d summarizes the U-type rules for
determining such destination predicates (p,) of a pred-
icate-defining operation.
These are simple examples; Hewlett-Packards
PlayDoh architecture specification discusses predicate-
defining operations based on more advanced rules.3
Advanced rules allow the compiler to generate pred-
icated code for more sophisticated program control
structures. They also allow the compiler to perform
implicit control flow optimizations on predicated
code.
COMPILER SUPPORT
Predication is most commonly utilized in a compiler
by employing if-conversion. This technique converts
conditional branches in an acyclic region of the control
flow into predicate-defining operations. With if-
J anuary 1 I998
Figure 1. Example of
branching code: (a) C
source code, (b)
assembly code con-
trol flow diagram, (cl
predicated code, and
(d) U-type rules for
predicate-de fining
operations.
conversion, a straight-line sequence of predicated code
can replace complex nets of branching code. As Figure
1 shows, the execution of the code after if-conversion
does not involve any branch. This means the compiler
can eliminate problematic branches and avoid their
associated overhead. It also facilitates increased
instruction-level parallelism by allowing the compiler
to overlap the execution of separate control flow
paths. This allows the processor to simultaneously
execute multiple paths from a single thread of con-
. -
trol. An important benefit of predication not illus-
trated in Figure 1 is that it allows overlapping and
independent control flow constructs without expand-
ing the code.
Providing compiler support for predicated execu-
tion is challenging. Current optimizing compilers rely
on control flow representation as the foundation of
analysis and optimization. Because predicated code
changes the control flow representation, effectively
handling it requires an extensive modification of the
compiler infrastructure, particularly in the areas of
classical and ILP optimizations, code scheduling, and
register allocation. An effective compiler must balance
the control flow and the use of predi ~ati on.~ If
resources become oversubscribed or dependence
heights (the lengths of the chains of dependent oper-
execution can degrade performance.
Predicated execution started as a software approach
to avoiding conditional branches in early super-
computers. Vector architectures such as the Cray 1
and array-processing architectures such as Illiac IV
adopted predication in the forin of mask registers to
allow effective vectorization of loops with conditional
branches. During the era of mini-supercomputers, the
Cydrome Cydra 5 became the first machine to sup-
port generalized predication. Parallel to the Cydra 5,
the Multiflow Trace machine adopted partial predi-
cation by introducing a single instruction with a pred-
ations) become unbalanced among patha, predicated
icate input, a select instruction. Contemporary proces-
sors, such as the DEC Alpha and the Sparc V9, have
adopted the partial-predication approach so they can
maintain a 32-bit instruction encoding.
n the future, integrating control and data specula-
tion with predicated execution will enable advanced
compiler techniques to increase the performance of
future processor^.^With the adoption of advanced full
predication support in IA-64 and perhaps many other
architectures, predicated execution may become one
of the most significant advances in the history of com-
puter architecture and compiler design. *3
. . . , . . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . . . . . . . . . . . , . . . . . . . . , . . .
Ref erences
1. B.R. Rau et al., The Cydra5 Departmental Supercom-
puter, Computer, Jan. 1989, pp. 12-35.
2. S.A. Mahlke et al., A Comparison of Full and Partial
Predicated Execution Support for ILP Processors, Proc.
Intl Symp. Computer Architecture, ACM Press, New
York, 1995, pp. 138-149.
3. V. Kathail, M.S. Schlansker, and B.R. Rau, HPL Pluy-
Doh Architecture Specification: Version 1 .O, Tech.
Report HPL-93-80, Hewlett-Packard Labs, Palo Alto,
Calif., 1994.
4. D.I. August, W.W. Hwu, and S.A. Mahlke, A Frame-
work for Balancing Control Flow and Predication,
Proc. Micro-30, IEEE CSPress, Los Alamitos, Calif.,
5. W.W. Hwu et al., Compiler Technology for Future
Microprocessors, Proc. I EEE, IEEE Press, New York,
1995.
1997, pp. 92-103.
Wen-mei Hwu is a professor in the Department of
Electrical and Computer Engineering at the University
of Illinois, Urbana-Champaign. Contact hi m at
hwu@crhc.uiuc.edu.
Computer