Failures in Hardware

FAILURES IN HARDWARE
All electronic systems carry the possibility of failure. An embedded system has intrinsic
intelligence that facilitates the possibility of predicting failure and mitigating its effects. This
two-part series reviews the options for self-testing that are open to the embedded software
developer, along with testing algorithms for memory and some ideas for self-monitoring
software in multi-tasking and multi-CPU systems. In this section, we will look at self-testing
approaches to guard against hardware failure. Embedded software can be incredibly complex, so
the possibilities for something going wrong are extensive. Add to this the complexity and
potential unreliability of hardware, and system failure seems almost inevitable. And yet, most
systems are amazingly reliable, functioning faultlessly for months at a time. This is no accident.
The reliability comes about through careful design in the first place and a tacit acceptance of the
possibility of failure in the second.
Writing robust software that is less likely to exhibit failure requires three issues to be
considered:
 How to reduce the likelihood of failure

 How to handle impending failure (and maybe prevent it)
 How to recover from a failure condition
Four main points of failure

The first task is to identify the possible points of failure. There are broadly four
possibilities where failure may occur:
 The CPU (microprocessor or microcontroller) itself

 Circuitry around the CPU or one or more peripheral devices
 Memory
 Software
Each of these may be addressed separately.
CPU failure
The failure of just the CPU (or part of it) in an embedded system is quite rare. In the
event of such a failure occurring, it is unlikely that any instructions could be executed,
so self-testing code is irrelevant. Fortunately, this kind of failure is most likely to
happen on power up, when the dead system is likely to be noticed by a user.
As multicore systems are becoming more common, the possibilities for CPUs to
maintain a health check on one another arise. A simple handshake communication
during initialization would suffice to verify that power-up failure has not occurred.
Unfortunately, if all the cores are on the same chip, there is less likelihood of one of
them failing in isolation.
Peripheral failure
An embedded system may have any amount of other electronics around the CPU and
all of that can potentially fail. If it dies totally, the peripheral will most likely
“disappear” – it no longer responds to its address and accessing it results in a trap.
Suitable trap code is a good precaution.
Other possible self-testing is totally device dependent. A communications device, for

example, may have a loopback mode, which facilitates some rudimentary testing. A
display can be loaded with something distinctive so that an operator might observe
any visible failures.
Memory failure
Given the enormous amount of memory in modern systems and the tiny geometry of
the chip technology, it is surprising that memory failure is not a very frequent
occurrence. There are two broad types of possible fault: transient and hard.
Transient faults occur from time to time and are virtually impossible to prevent.
They are caused by stray radiation – typically cosmic rays – that randomly flip a
single bit of memory. Heavy shielding might reduce their possibility, but this is not
practical for many types of device. There is no reliable way to detect a transient fault
itself. It is most likely to become manifest as a software malfunction, because some
code or data was corrupted.
Hard faults are permanent malfunctions and show up in three forms:
1. Memory not responding to being addressed at all.

2. One or more bits are stuck on 0 or 1.
3. There is cross talk; addressing one bit has an effect on one or more others.
(1) results in a trap in the same way as the aforementioned peripheral failure. Ensuring
that a suitable trap handler is implemented addresses this issue.
The other forms of hard memory failure, (2) and (3), may be detected using self-test
code. This testing may be run at start-up and can also be executed as a background
task.
Start-up memory testing
There are a couple of reasons why testing memory on start-up makes a lot of sense.
As with most electronic devices, the time when memory is most likely to fail is on
power-up. So, testing before use is logical. It is also possible to do more
comprehensive testing on memory that does not yet contain meaningful data.
A common power-up memory test is called “moving ones”. This tests for cross-talk –
i.e. whether setting or clearing one bit affects any others. Here is the logic of a moving
ones test:
set every bit of memory to 0
for each bit of memory{
verify that all bits are 0
set the bit under test to 1
verify that it is 1
verify all other bits are 0

set the bit under test to 0}
The same idea may be applied to implement a moving zeros test. Ideally, both tests
should be used in succession. Coding these tests needs care. The process should not,
itself, use any memory – code should be executed from flash and all working data
must be stored in CPU registers.
With increasingly large amounts of memory, the time taken to perform these tests
escalates exponentially and could result in an unacceptable delay in the start-up time
for a device. Knowledge of memory architecture can enable optimization. For
example, cross talk is more likely within a given memory array. So, if there are
multiple arrays, the test can be performed individually on each one. Afterwards, a
quick check can be performed to verify that there is no cross-talk between arrays,
thus:
fill all memory with 0s
for each memory array{
fill array with 1s
verify that other arrays still contain just 0s
fill array with 0s}
This can then be repeated with all of memory starting full of ones.
Background memory testing
Once “real” code is running, comprehensive memory testing is no longer possible.

However, testing of individual bytes/words of memory is possible, so long as tiny
interruptions in software execution can be tolerated. Most embedded systems have
some idle time or run a background task, when there is no real work to be done. This
may be an opportunity to run a memory test.
A simple approach is to write, read and verify a series of bit patterns: all ones, all
zeros and alternate one/zero patterns. Here is the logic:
for each byte of memory{ turn off interrupts
save memory byte contents

for values 0x00, 0xff, 0xaa, 0x55 {
write value to byte under test
verify value of byte }
restore byte data
turn on interrupts
Implementing this code requires a little care, as an optimizing compiler is likely to

conclude that some or all of the memory accesses are redundant and optimize them
away. The compiler is optimistic about memory integrity.

Failures in Hardware

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Failures in Hardware

Uploaded by

Copyright:

Available Formats

FAILURES IN HARDWARE

 How to reduce the likelihood of failure

Four main points of failure

 The CPU (microprocessor or microcontroller) itself

Each of these may be addressed separately.

Other possible self-testing is totally device dependent. A communications device, for

Hard faults are permanent malfunctions and show up in three forms:

1. Memory not responding to being addressed at all.

set every bit of memory to 0

for each bit of memory{

verify that all bits are 0

set the bit under test to 1

verify all other bits are 0

fill all memory with 0s

for each memory array{

fill array with 1s

verify that other arrays still contain just 0s

fill array with 0s}

Once “real” code is running, comprehensive memory testing is no longer possible.

for each byte of memory{ turn off interrupts

save memory byte contents

write value to byte under test

verify value of byte }

restore byte data

Implementing this code requires a little care, as an optimizing compiler is likely to

You might also like