Professional Documents
Culture Documents
All electronic systems carry the possibility of failure. An embedded system has intrinsic
intelligence that facilitates the possibility of predicting failure and mitigating its effects. This
two-part series reviews the options for self-testing that are open to the embedded software
developer, along with testing algorithms for memory and some ideas for self-monitoring
software in multi-tasking and multi-CPU systems. In this section, we will look at self-testing
approaches to guard against hardware failure. Embedded software can be incredibly complex, so
the possibilities for something going wrong are extensive. Add to this the complexity and
potential unreliability of hardware, and system failure seems almost inevitable. And yet, most
systems are amazingly reliable, functioning faultlessly for months at a time. This is no accident.
The reliability comes about through careful design in the first place and a tacit acceptance of the
possibility of failure in the second.
Writing robust software that is less likely to exhibit failure requires three issues to be
considered:
CPU failure
The failure of just the CPU (or part of it) in an embedded system is quite rare. In the
event of such a failure occurring, it is unlikely that any instructions could be executed,
so self-testing code is irrelevant. Fortunately, this kind of failure is most likely to
happen on power up, when the dead system is likely to be noticed by a user.
As multicore systems are becoming more common, the possibilities for CPUs to
maintain a health check on one another arise. A simple handshake communication
during initialization would suffice to verify that power-up failure has not occurred.
Unfortunately, if all the cores are on the same chip, there is less likelihood of one of
them failing in isolation.
Peripheral failure
An embedded system may have any amount of other electronics around the CPU and
all of that can potentially fail. If it dies totally, the peripheral will most likely
“disappear” – it no longer responds to its address and accessing it results in a trap.
Suitable trap code is a good precaution.
Memory failure
Given the enormous amount of memory in modern systems and the tiny geometry of
the chip technology, it is surprising that memory failure is not a very frequent
occurrence. There are two broad types of possible fault: transient and hard.
Transient faults occur from time to time and are virtually impossible to prevent.
They are caused by stray radiation – typically cosmic rays – that randomly flip a
single bit of memory. Heavy shielding might reduce their possibility, but this is not
practical for many types of device. There is no reliable way to detect a transient fault
itself. It is most likely to become manifest as a software malfunction, because some
code or data was corrupted.
(1) results in a trap in the same way as the aforementioned peripheral failure. Ensuring
that a suitable trap handler is implemented addresses this issue.
The other forms of hard memory failure, (2) and (3), may be detected using self-test
code. This testing may be run at start-up and can also be executed as a background
task.
Start-up memory testing
There are a couple of reasons why testing memory on start-up makes a lot of sense.
As with most electronic devices, the time when memory is most likely to fail is on
power-up. So, testing before use is logical. It is also possible to do more
comprehensive testing on memory that does not yet contain meaningful data.
A common power-up memory test is called “moving ones”. This tests for cross-talk –
i.e. whether setting or clearing one bit affects any others. Here is the logic of a moving
ones test:
verify that it is 1
The same idea may be applied to implement a moving zeros test. Ideally, both tests
should be used in succession. Coding these tests needs care. The process should not,
itself, use any memory – code should be executed from flash and all working data
must be stored in CPU registers.
With increasingly large amounts of memory, the time taken to perform these tests
escalates exponentially and could result in an unacceptable delay in the start-up time
for a device. Knowledge of memory architecture can enable optimization. For
example, cross talk is more likely within a given memory array. So, if there are
multiple arrays, the test can be performed individually on each one. Afterwards, a
quick check can be performed to verify that there is no cross-talk between arrays,
thus:
This can then be repeated with all of memory starting full of ones.
Background memory testing
turn on interrupts