You are on page 1of 10

.....................................................................................................................................................................................................................................................

RELIABILITY: FALLACY OR REALITY?


.....................................................................................................................................................................................................................................................
AS CHIP ARCHITECTS AND MANUFACTURERS PLUMB EVER-SMALLER PROCESS

TECHNOLOGIES, NEW SPECIES OF FAULTS ARE COMPROMISING DEVICE RELIABILITY.

FOLLOWING AN INTRODUCTION BY ANTONIO GONZÁLEZ, SCOTT MAHLKE AND SHUBU

MUKHERJEE DEBATE WHETHER RELIABILITY IS A LEGITIMATE CONCERN FOR THE


Antonio González MICROARCHITECT. TOPICS INCLUDE THE COSTS OF ADDING RELIABILITY VERSUS THOSE OF
Intel IGNORING IT, HOW TO MEASURE IT, TECHNIQUES FOR IMPROVING IT, AND WHETHER

CONSUMERS REALLY WANT IT.


Scott Mahlke
University of Michigan Moderator’s introduction: Antonio González future devices will be more vulnerable to
...... Technology projections suggest particle strikes. Another source of faults is
Shubu Mukherjee that Moore’s law will continue to be effective
for at least the next 10 years. Basically, as
wear-out effects, such as electromigration.
Finally, faults are also caused by variations.
Intel Figure 1 shows, in each new generation
devices will continue to get smaller, become
When variations are high, we might want to
target designs to the common case rather
faster, and consume less energy. However, than the worst case. When the worst case
Resit Sendag the new technology also brings along some
new cotravelers. Among them are variations,
occurs, the system would need to take some
corrective action, to continue operating
University of Rhode which manifest in multiple ways. First, there
are variations caused by the characteristics of
correctly.
Basically, the topic of the panel is these
Island the materials and the way chips are manu-
factured; these are called process variations.
faults. We typically classify faults into three
main categories:
There are multiple types of process varia-
Derek Chiou tions: spatial and temporal, within die and N Transient faults appear for a very short
between dies. Random dopant fluctuations period of time and then disappear by
University of Texas at are one type of process variation. Second, themselves. Particle strikes are the
there are voltage variations, such as voltage most common type of transient fault.
Austin droops. Third, there are variations caused by N Intermittent faults appear and disap-
temperature. Temperature affects many key pear by themselves, but the duration
parameters, such as delay and energy con- can be very long—that is, undeter-
Joshua J. Yi sumption. Finally, there are variations due to mined. Voltage droops are an example
inputs. A given functional unit behaves of this type of fault.
Freescale Semiconductor differently—in terms of delay, energy, and N Permanent faults remain in the system
other parameters—depending on the data until a corrective action is taken.
input sets. Electromigration is an example of this
Faults are another group of cotravelers type of fault.
that accompany new technology. Faults
have multiple potential sources. One of Some of these faults have a changing
these is radiation particles; it is expected that probability of occurrence over the lifetime
.......................................................................

36 Published by the IEEE Computer Society 0272-1732/07/$20.00 G 2007 IEEE


Figure 1. Technology scaling trends: more, faster, less-energy transistors.

of a device. Failure rates during device and detect faults? Or, do we need mech-
lifetimes exhibit a bathtub curve behavior. anisms to detect and correct these faults?
At the beginning, during the period called For instance, some parts that fail (or are
infant mortality (1 to 20 weeks), the likely to fail) might be transparently
probability of a fault occurring is relatively corrected, just as when you go to a mechanic
high. Then, during the normal lifetime, the to replace weak or failing parts of your car.
probability is far lower. Finally, during the What will be the cost of such solutions?
wear-out period, the probability starts to Will all users be willing to pay for the cost
increase again. of reliability, or would only certain classes
Many questions remain open in reliabil- of users (for example, users of large servers)
ity research. First, what will be the be willing to pay? Does reliability depend
magnitude of these faults, and what oppor- on the application? Are certain applications
tunities will arise from exploiting these more sensitive? Is reliability going to be
variations? What will the impact of particle a mainstream architecture consideration, or
strikes be in the future? What is the degree is it going to be limited to a niche of systems
of wear-out in the typical lifetime of and applications? All of these remain to be
a processor? Will reliability be critical to answered.
keep high yields? How much does the Until recently, reliability has been ad-
processor contribute to the total faults in dressed in the manufacturing process
a system? Is it really an important part of (through burn-in and testing) and through
the problem, or can architects just ignore it circuit techniques, whereas microarchitec-
and take care of the other parts? ture techniques have focused primarily on
Is the microarchitecture the right level at mission-critical systems. However, over the
which to address these issues? Are system-, past 5 to 10 years, reliability has moved
software-, or circuit-level solutions prefera- more into the mainstream of computer
ble? Which types of solution are the most architecture research. On the one hand,
adequate, feasible, and cost-effective? transient and permanent faults due to
Is ignoring reliability an option? Do we CMOS scaling are a looming problem that
need schemes and methods just to anticipate must be solved. In a recent keynote address,
........................................................................

NOVEMBER–DECEMBER 2007 37
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE

Shekhar Borkar summed up the emerging Reason 1: It’s the software, stupid!
design space as follows: ‘‘Future designs will The unreliability of software has long
consist of 100 billion transistors, 20 billion dominated that of hardware. Figure 2
of which are unusable due to manufacturing shows the failure rates for several system
defects; 10 billion will fail over time due to components. The failures per billion hours
wear-out, and regular intermittent errors of operation (also called failures in time, or
will be observed.’’1 This vision clearly FITs) in Microsoft Windows are an order of
suggests that fault tolerance must become magnitude higher than corresponding va-
a first-class design feature. lues for all hardware components. Bill Gates
On the other hand, some people in the has stated that the average Windows
computer architecture community believe machine will fail, on average, twice a month.
that reliability will provide little added value In fact, when operating systems start off,
for most of the computer systems that will they fail very frequently. Mature operating
be sold in the near future. They claim that systems can have a mean time to failure
researchers have artificially enhanced the (MTTF) measured in months, whereas
magnitude of the problems to increase the a newer operating system might crash every
perceived value of their work. In their few days.
opinion, unreliable operation has been This is not intended as a bash of
accepted by consumers as commonplace, Microsoft or other software companies.
and significant increases in hardware failure Software is inherently much more complex
rates will have little effect on the end-user than hardware, and software verification is
experience. From this perspective, reliability an open research question. The bottom line
is simply a tax that the doomsayers want to is that to improve the reliability of current
systems for the user, the focus should be on
levy on your computer system.
the software, not the hardware.
The goal of this panel is to debate the
relevance of reliability research for computer
architects. Highlights of the discussion
Reason 2: Electronics have become disposable
One of the big issues that researchers are
follow the panelists’ position statements.
examining today is transistor wear-out and
how to build wear-out-tolerant computer
The need for reliability is a fallacy: systems. But, the majority of consumers
Scott Mahlke care little about the reliable operation of
I believe there is a need for highly reliable
electronic devices, and their concerns are
microprocessors in mission-critical systems, decreasing as these devices become more
such as airplanes and the Space Shuttle. In disposable. In 2006, the average lifetime of
these systems, the cost of the computer a business cell phone was nine months. The
systems is not a dominant factor; the more average lifetimes of a desktop and a laptop
important reality is that people’s lives are at computer were about two years and one
stake. However, for the mainstream com- year, respectively. Most electronic devices
puter systems used in consumer and are replaced before wear-out defects can
business electronic devices, the need for manifest. Therefore, building devices whose
reliability is a fallacy. Starting with the most hardware functions flawlessly for 20 years is
obvious and working toward the least simply unnecessary. Furthermore, from the
obvious, the following are the top five economic perspective, reliability can be
reasons why computer architecture research quite expensive in terms of chip area, power
in reliability is a fallacy. consumption, and design complexity. Thus,
it’s often not worth the cost.
.....................................................................................................................................................................
About this article Reason 3: A transient fault is about as likely as
Derek Chiou, Resit Sendag, and Josh Yi conceived and organized the 2007 CARD my winning the lottery
Workshop and transcribed, organized, and edited this article based on the panel discussion. Data from the IBM z900 server system
Video and audio of this and the other panels can be found at http://www.ele.uri.edu/CARD/. shows that three transient faults occurred in
198 million hours of operation, or about
.......................................................................

38 IEEE MICRO
Figure 2. Failures in billions of hours of operation.2–5

one fault every 2.7 million days. One of my laptop. Imagine a streaming video coming
favorite comments about this was that it was through; no one would care if occasionally
much more likely that somebody would a pixel on a frame had the wrong color.
walk by and kick the power cord than it was How often do you lose a call on your cell
that an actual transient fault would occur. phone? How often is a word garbled and
Let’s compare the rate of transient faults you have to ask the person on the other end
to some other things that we don’t think to repeat something? The majority of
about in our everyday lives. My chance of consumers today either do not notice or
winning the lottery is about equal: 1 in readily accept imperfect operation of elec-
3 million. The chance of getting struck by tronic devices. There is also a lot of
lightning in a thunderstorm is about twice redundancy in applications such as stream-
that of a transient fault: 1 in 1.4 million. ing video, so maybe the answer is that
How about getting murdered? The chance software, rather than hardware, should be
is about 1 in 10,000 in the United States. made more resilient.
The chance of being involved in a fatal car
crash is about 1 in 6,000, and the chance of Reason 5: This problem is better solved closer to
a plane crashing is 1 in 10 million. The the circuit level
point is that we don’t constantly worry Even if you accept reliability as a problem
about these things happening to us, so do for microprocessor designers, one important
we really need to be concerned about question is at what level of design should
a transient fault happening? The chances the problem be solved—architectural or
are so unlikely that the best thing to do circuit? An example of the circuit-level
about them may be to ignore them. approach is Razor, which introduced a mon-
itoring latch.6 Essentially, Razor is an in situ
Reason 4: Does anyone care? sensor that measures the latency of a partic-
In many situations, 100 percent reliable ular circuit. It enables timing speculation
operation of hardware is not important or and the detection of transient pulses caused
worth the extra cost. For instance, no one by energetic particle strikes. The point is
can tell the difference if a few pixels are that architects may not need to worry about
incorrect in a picture displayed on their solving these problems. Rather, techniques
........................................................................

NOVEMBER–DECEMBER 2007 39
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE

like Razor can handle reliability problems of dollars are involved. User-visible errors,
beneath the surface, where the area and even when few surface, have an enormous
power overhead is lower. Another impor- impact on the industry. They increase cost
tant factor is that many designs can benefit because companies start getting product
from circuit-level techniques, so point returns. Companies can face product re-
solutions need not be constructed. Finally, turns even for soft errors, because users
in situ solutions naturally handle process sometimes demand replacement of parts. In
variation. addition, there is the issue of loss of data or
availability.
The need for reliability is a reality:
Shubu Mukherjee The designer’s awakening
Captain Jean-Luc Picard of the starship The designer’s awakening is an experi-
USS Enterprise once said that there are three ence similar to going through the four stages
versions of the truth: your truth, his truth, of grief. First, you have the shock: ‘‘Soft
and the truth. The God-given truth is that errors (SERs) are the crabgrass in the lawn
circuits are becoming more unreliable due of computer design.’’ This is followed by
to increasing soft errors, cell instability, denial: ‘‘We will do the SER work two
time-dependent device degradation, and months before tape out.’’ Then comes
extreme device variations. We have to figure anger: ‘‘Our reliability target is too ambi-
out how to deal with them. tious.’’ Finally, there is acceptance: ‘‘You
can deny physics only so long.’’ These are
The user’s truth all real comments by my colleagues. The
Users care deeply about the reliability of truth is, designers have accepted silicon
their systems. If they get hundreds of errors reliability as a challenge they will have to
a day, they will be unhappy. But, if the deal with.
number of hardware errors is in the noise
relative to other errors, hardware errors may The designer’s challenge
not matter as much to them. That is exactly The industry is addressing the reliability
the direction in which the entire industry is problem with the help of research commu-
moving. The goal of hardware vendors is to nity. We need solutions at every level.
maintain a low enough hardware error rate Protection comes at many levels: at the
that those errors continue to be obfuscated process level through improved process
by software crashes and bugs. However, technology; at the materials level through
there have always been point risks that make shielding for alpha particles; at the circuit
certain individual corruption or crashes level through radiation-hardened cells; at
critical—for example, a Windows 98 crash the architecture level through error-correct-
during a Bill Gates demo—even if such ing code (ECC), parity, hardened gates, and
errors occur rarely. redundant execution; and at the software
level through higher-level detection and
The IT manager or vendor’s truth recovery. Companies are doing a lot. They
The truth, however, is very different from are constantly making trade-offs between
the perspective of an IT manager who has to the cost of protection (in terms of perfor-
deal with thousands of users. The greater mance and die size) and chip reliability,
the number of user complaints per day, the without sacrificing the minimum perfor-
greater is her company’s total cost of mance, reliability, and power thresholds
ownership for those machines. It is like that they must achieve.
the classic light bulb phenomenon: The
more you have, the sooner at least one of Industry needs universities’ help with research
them will fail. And that’s what we see in Industry needs help from academia, but
many houses. In a house with 48 light academia has some misconceptions about
bulbs, each with 4 years MTTF, we replace reliability research. One of the misconcep-
a light bulb every month. These failures tions concerns mean time between failures
negatively impact business, because billions (MTBF), which is only a rough estimate of
.......................................................................

40 IEEE MICRO
an individual part’s life. Using MTBF to to measure and show them what they are
predict the time to failure (TTF) of a single paying for. It turns out that every company
part is fundamentally flawed, because has some applications that they need to run
MTBF does not apply to a specific part. with much more reliability than others. E-
Thus, we cannot start optimizing lifetime mail, surprisingly, is one of them. Financial
reliability on the basis of MTBF. applications are another. So, yes, they are
Another common misconception is the willing to pay if we show them what they
notion that a system hang doesn’t cause data are paying for.
corruption. However, if you cannot prove
otherwise, you should assume it does cause González: Following up on that, do we
data corruption because your data might have any quantification of reliability in
already have been written to the disk before terms of area or any other metric? How
the system hangs. Finally, one other mis- much is already in the chip today to
conception is that adding protection with- guarantee certain levels of reliability?
out correction reduces the overall error rate, Mukherjee: The amount of area that we are
but in reality it does not. putting in for error correction logic is going
Many questions remain unanswered in up exponentially in order to keep a constant
different areas of silicon reliability, and MTTF. And it will continue to grow.
industry needs help from the universities.
How do we predict and/or measure error Mahlke: Yes, but most of that error
rate from radiation, wear-out, and variabil- protection logic is in the memory, right?
ity? How do we detect soft errors, wear-out, There is not much in the actual processor.
and variability on individual parts? Many
traditional solutions exist, but how do we Mukherjee: Not necessarily. I cannot
make them cheaper? publicly reveal the details. Mean time to
failure—what does it tell us?
Cost of reliability—Are users willing Audience member: Why do we make
to pay? products with MTTF of seven years, when
Mahlke: The big question is how much are
most of the users are going to throw them
the people willing to pay? This is very
away in one year? It’s just a matter of doing
market dependent. For example, a credit
the mathematics. It all depends on the
card company trying to compute bills
distribution of the failures versus time. You
would be willing to pay a fair amount of
can have 90 percent of your population
money. But what about the average laptop
failing in one year, and still have an MTTF
user, how much extra are they willing to
of seven years with the right distribution.
pay? My theory is that end users are not
So, just because my MTTF is seven years
willing to pay. Either they are used to errors,
doesn’t mean that unacceptably large frac-
they accept them, or they don’t care about
tions of the people are not going to see
infrequent errors.
failures in one year.
González: I want to add that cost could also
be reflected in some performance penalty. Mahlke: I think you are right. Just because
Reliability can be sometimes provided at the the MTTF is seven years, it doesn’t mean
expense of some decrease in performance— that all fail at seven years. Many of them
for example, lower frequency—or some will fail before that. But, I think if you look
increase in power due to the extra hardware. at it, people are keeping these things
11 months and then throwing them away.
Mukherjee: If you look back, people pay If you look at the data for how many
for ECC. We do. We pay for parity, we pay phones actually fail after 11 months, I
for RAID systems (redundant arrays of believe that it is a very small number, even
independent disks), we pay a lot for fault- when there are hundreds of millions of
tolerant file servers from EMC. So, people phones sold each year. And, if we just
pay. The main thing that we need to do is replace those phones, and each phone is
........................................................................

NOVEMBER–DECEMBER 2007 41
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE

$100 or something, the cost is relatively fault detection raises your overall error rate.
small. Silent data corruption falls into two classi-
The counter-argument is this: Let’s say I fications: the one that you care about, and
am going to add $10 worth of electronics the one that you don’t care about. When
for reliability to each phone. Does the you put fault detection to prevent silent
average phone customer want to pay that data corruption, you end up flagging both
money to get that little bit of extra types of these errors, and the customer
reliability? I think there is a big distinction annoyance factor goes up. The bottom line
between servers that are doing important is that detection alone is not enough. You
computation and disposable electronic de- have to go for full-blown correction.
vices that people use. Maybe we need two
different reliability strategies for these two González: So, there are errors that matter
domains. Because, for disposable electron- and errors that don’t—but are we not using
ics, where we try to reduce the cost, it may the same kind of systems for both? For
be too much overhead if we blindly instance, you may be checking your bank
incorporate reliability mechanisms such as account with the same computer that you’re
parity bits or dual modular redundancy using to run your media entertainment.
(DMR) into the hardware. Mahlke: If you are accessing something and
Mukherjee: We don’t specify lifetime you knew there was an error in something
reliability based on the mean, but rather small, maybe your address book, you can
on a very high percentile of the chips download a copy of it, right? But, if there
surviving whatever number of years we was something larger than that, and maybe
internally think the chips should survive. you didn’t have a copy of the data—or
So, it is not based on the mean. maybe, as Antonio [González] said, you
were doing something critical like trans-
Audience member: I would like to add one ferring money from your bank account—
sentence to that. I have seen graphs from then I kind of agree with Shubu [Mukher-
Intel that are publicly available. They show jee] that detection alone may not be good
that this number of years for this 99.99+ enough. If you want to go down this
percent of chips is going down. The reason reliability path, you may need to detect
is that it is harder to give the same and correct, because detection just throws
guarantee. many red flags and you start worrying about
what you lost versus fixing it behind the
Mukherjee: Good observation. scenes. If a fault actually leads to a system
hanging or crashing, then you will know
Error detection and correction about it. This may reduce the number of
Audience member: One of the things I find things we need to worry about to the things
disturbing is silent failure. If the hardware that lead to silent data corruption. Because,
failure is a silent failure, I don’t know if my if that is a relatively small number, and I can
data is corrupted, I have no indication. So, figure out the other ones when they occur,
in these systems, if I have error detection, maybe I don’t need to worry about the
then I can track the data being corrupted small subset of faults.
and follow on. But, if we are having bit-
flipping hardware, I don’t have detection Mukherjee: A system hang is not necessar-
and neither have I correction. Detection is ily a detected error; it can cause silent data
not correction, I agree, but detection is one corruptions.
of the reasons we tolerate failures in
software and supplement systems. Audience member: In Scott’s [Mahlke]
presentation, he used the z series from
Mukherjee: That’s a very good point. I was IBM as an example, but these are systems
in the same camp for a while, but after that are all about the reliability, and they are
interacting with some of the customers, I enormously internally redundant and fault
am beginning to think otherwise, because tolerant. So, when you talked about the low
.......................................................................

42 IEEE MICRO
error rate, that is the low error rate after all programmable systems doing both critical
the expense devoted to reliability and and noncritical tasks, the trade-off becomes
adding everything for reliability. It would a little bit harder, or a little bit foggy, with
be very interesting to understand the non-z these multipurpose systems.
series experience that people have.
González: Do you have a good example of
Mahlke: The errors that I mentioned did a potential area where you believe that
not cause corruption, but were actually current approaches—for example, ECC—
parity errors that were caught and corrected won’t be enough? Do you have a good
in their system. These weren’t the errors example to motivate what can be done at
that got through all the armor that they the microarchitecture level, Shubu?
have put up.
Mukherjee: If you look at logic gates today,
Mukherjee: I have an example of that, from their contribution to error rate is hidden in
a recently published paper from Los Alamos the noise of all the factors that cause the
National Lab.7 There is a system of 2048 errors, such as soft errors and cell instability.
HP AlphaServer ES45s, where the MTTF is But, if you look five to 10 years ahead—
quite small. It is proven that cosmic-ray- once timing problems start to show up,
induced neutrons are the primary cause of maybe due to variations or wear-outs—logic
the BTAG (board-level cache tag) parity is going to become a problem. So, in that
errors that are causing the machines to fail. case, classical ECC is not going to buy you
anything. We can start looking at residue-
Will classical solutions be enough? checking or parity-prediction circuits to
Audience member: I think one thing that detect logic errors.
both of you agreed on is that reliability has It also comes down to this fundamental
a cost, either in area or in performance. You point that you have a full stack, starting
may not see the errors, because the system is
from the software all the way down to the
over-dimensioned. Perhaps, we are not
process. As you go up from one layer to
doing our jobs as microarchitects to actually
another, the definition of errors makes it
look at the trade-offs between performance
clearer whether it is an error or not. That’s
and errors that we are able to tolerate, or to
what you need to track and what you need
look at dimensioning the system to tolerate
to expose. That’s what Joel Emer and I have
more errors, and perhaps to make the
worked on for a long time. We are
system cheaper using cheaper materials or
convinced that if you look at different levels
architectures. Perhaps the tragedy here is
of a system, the resilience is very different.
that a lot of these trade-offs at this point are
In some cases, if you hit the bit, you
done at the semiconductor level rather than
immediately see the error, but sometimes
at microarchitecture level. But, in the future
you don’t see it at all. There is a wide degree
it may be done at the microarchitecture
level, where you can compute the trade-offs of variability.
between performance versus reliability ver-
sus cost. Measuring reliability
Audience member: Can you give us an idea
Mahlke: One of the problems is that as you of how much reliability we gain by what
go towards more multipurpose systems, you are putting in the processor, compared
these systems tend to do both critical and to when we have nothing?
noncritical things. You might have different
trade-offs for reliability versus performance Mukherjee: I have some data on a cancelled
for different tasks. For instance, for a video processor project. I was the lead architect for
encoder, the system requires maximum reliability of that processor. Our data
performance and lower reliability. There- showed that if you didn’t have any pro-
fore, as we go towards less-programmable tection, that chip would be failing in
systems, the trade-off is more obvious. On months due to all kinds of reliability
the other hand, as we go towards more- issues.
........................................................................

NOVEMBER–DECEMBER 2007 43
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE

Audience member: How much can we put Mahlke: I guess you are saying that you are
in the commodity processors that customers putting in too much, since the software
are willing to pay extra pennies for? errors are two orders of magnitude greater
than the hardware errors.
Mukherjee: That goes to the fundamental
problem of how to let customers know what Mukherjee: Microsoft has actually shown
they are getting for the extra price. The that Windows causes very few of the
problem is that if you look at soft errors, we problems. It is the device drivers that cause
cannot tell them what extra benefit they are many of the problems. Memory is also a big
going to get. We can measure the perfor- problem, since more than 90 percent of
mance by clock time, but we don’t have memories out there don’t have any fault
a good measurement of a system’s reliability. detection or error correction in them. Stratus
is a company that actually builds fault-
González: Any idea on how we can measure tolerant systems using Windows boxes
reliability? running on Pentium 3s. How did they do
Mukherjee: We fundamentally need that? They tested all the device drivers. And
a mechanism to measure these things. For they don’t let anyone install any device driver
hard errors, the problem may be tractable. arbitrarily on those systems. So, they have
For soft errors—induced by radiation—this a highly reliable, fault-tolerant Windows
is still a hard problem. For gradual errors, system running on Pentium 3s. Believe it
such as wear-out, we still don’t know how or not, that exists. So, blaming Windows is
to measure the reliability of an individual not the right way. Microsoft has done
part. So, the answer is that, in many cases, a phenomenal job showing that it is not
we don’t know how to measure reliability. Windows itself that causes most of the
reliability problems in today’s computers.MICRO
Mahlke: There may be a different angle of
looking at how reliability can be measured. Acknowledgments
Instead of thinking of reliability as a tax that All views expressed by Antonio González
you have to pay, and trying to justify this in this article are his alone, and all views
tax, maybe the right way to go about this is expressed by Shubu Mukherjee are his
thinking of what else we get in addition to alone. Neither author represents the posi-
reliability. tion of Intel Corporation in any shape or
Let’s take Razor as an example. Razor can form. Although Mahlke argues the fallacy
identify events like transient faults, and it viewpoint in this article, his research group
also allows you to drive voltage down and actively works in the areas of designing
essentially operate at the lowest voltage reliable and adaptive computer systems.
margin possible. It identifies when the
................................................................................................
voltage goes too low and self-corrects the
circuit. If we talk about adaptive systems References
1. S. Borkar, ‘‘Microarchitecture and Design
and how to make systems more adaptable
Challenges for Gigascale Integration,’’ key-
for reliability or power consumption, then it
note address, 37th Ann. IEEE/ACM Int’l
may be about justifying the cost of some
Symp. Microarchitecture, 2004.
feature that the user really wants, and
2. National Software Testing Labs, http://
reliability just kind of happens magically
www.nstl.com.
behind your back.
3. R. Mariani, G. Boschi, and A. Ricca, ‘‘A

How much extra hardware is needed System-Level Approach for Memory Ro-

for reliability? bustness,’’ Proc. Int’l Conf. Memory Tech-


nology and Design (ICMTD 05), 2005;
Audience member: How much can Intel
http://www.icmtd.com/proceedings.htm.
afford to put in the chip for reliability?
4. J. Srinivasan et al., ‘‘Lifetime Reliability: To-
Mukherjee: We will put as much as we ward an Architectural Solution,’’ IEEE Micro,
need to hide under the software errors. vol. 25, no. 3, May-June 2005, pp. 70-80.
.......................................................................

44 IEEE MICRO
5. Center for Advanced Life Cycle Engineer- Creating Custom Processors research group.
ing, Univ. of Maryland; http://www.calce. His research interests include application-
umd.edu. specific processors, high-level synthesis,
6. D. Ernst et al., ‘‘Razor: A Low-Power compiler optimization, and computer ar-
Pipeline Based on Circuit-Level Timing chitecture. Mahlke has a PhD from the
Speculation,’’ Proc. 36th Ann. Int’l Symp. University of Illinois, Urbana-Champaign.
Microarchitecture (MICRO 03), IEEE CS He is a member of the IEEE and ACM.
Press, 2003, pp. 7-18.
7. S.E. Michalak et al., ‘‘Predicting the Num- Shubu Mukherjee is a principal engineer
ber of Fatal Soft Errors in Los Alamos and director of SPEARS (Simulation and
National Laboratory’s ASC Q Supercomput- Pathfinding of Efficient and Reliable Sys-
er,’’ IEEE Trans. Device and Materials tems) at Intel. The SPEARS Group spear-
Reliability, vol. 5, no. 3, Sept. 2005, pp. 329- heads architectural innovation in the de-
335. livery of microprocessors and chipsets by
building and supporting simulation and
Antonio González is the founding director analytical models of performance, power,
of the Intel-UPC Barcelona Research Cen- and reliability. Mukherjee has a PhD com-
ter and a professor of computer architecture puter science from the University of
at Universitat Politècnica de Catalunya. His Wisconsin–Madison.
research focuses on computer architecture,
with particular emphasis on processor The biographies of Resit Sendag, Derek
microarchitecture and code generation tech- Chiou, and Joshua J. Yi appear on p. 24.
niques. González is an associate editor of
IEEE Transactions on Computers, IEEE Direct questions and comments about
Transactions on Parallel and Distributed this article to Antonio González, Intel and
Systems, ACM Transactions on Architecture UPC, c/Jordi Girona 29, Edifici Nexus II,
and Code Optimization, and Journal of 3a. planta, 08034 Barcelona, Spain; antonio.
Embedded Computing. gonzalez@intel.com.

Scott Mahlke is an associate professor in For more information on this or any


the Electrical Engineering and Computer other computing topic, please visit our
Science Department at the University of Digital Library at http://computer.org/
Michigan, where he directs the Compilers csdl.

........................................................................

NOVEMBER–DECEMBER 2007 45

You might also like