You are on page 1of 15

When Correct Is Not Enough:

Formal Verification of
Fault-Tolerant Hardware
Copyright OneSpin Solutions © 2020

Abstract
Once upon a time, hardware functional verification was all about
ensuring that a circuit would perform its specified functions under
all legal input stimuli. Today, though, gaining confidence that a
hardware design is correct is often not enough. Several
industries, including automotive, medical, and aerospace, rely on
safety-critical hardware to keep people safe. Other systems, for
example, in hard-to-reach equipment, data centers, and storage
applications, may also require fault-tolerant electronics.
Fault-tolerant hardware development is no longer a niche and
presents new challenges. Engineers face the daunting task of
having to examine countless faulty variants of their designs in
order to integrate and verify multiple safety mechanisms within
complex systems-on-chip (SoCs).
This white paper examines key goals and challenges in fault-
tolerant hardware verification and presents formal solutions that
ensure predictable hardware behavior under all relevant
operating conditions and fault scenarios, while saving in
engineering and computational resources.

Contact · info@onespin.com · www.onespin.com


USA: +1 408 734 1900 · Europe: +49 89 99013-0 · Japan: +81 45 285 1573
© Copyright 2020 OneSpin. All rights reserved. OneSpin is a registered trademark of OneSpin Solutions GmbH. OneSpin
Solutions, OneSpin 360, the 360 product names, the OneSpin logo are trademarks of OneSpin Solutions GmbH.
All other trademarks are the property of their respective owners.
2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

TABLE OF CONTENTS
Introduction ..............................................................................................................................3
Safety Mechanisms .................................................................................................................4
Formal Verification of Safety Mechanisms ..............................................................................5
ECC-Protected FIFO ...............................................................................................................7
Duplicated Configuration Registers .......................................................................................10
Lockstep Comparator and Recovery Module ........................................................................11
Summary ...............................................................................................................................14
Author ....................................................................................................................................14
References and Further Reading ..........................................................................................15

© 2020 OneSpin Solutions. | Page 2

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

INTRODUCTION
In the past fifteen years, advanced verification flows based on constrained-random simulation,
emulation, and formal tools have enabled engineers to tape-out complex systems-on-chip
(SoCs) successfully. However, with ever more applications posing security and safety
requirements on hardware—in addition to the established ones on performance, area, and
power consumption—new challenges have emerged that require innovative and powerful
verification solutions.
Faults caused by wear-out, interference from environmental radiation, power glitches, and
other physical effects can appear at any time during hardware operation. Increasing design
complexity and shrinking transistor geometries and power budgets are the main risk drivers
for the occurrence of random hardware failures. Fault-tolerant hardware with integrated safety
mechanisms can counterbalance this risk, and is used not only in safety-critical products
governed by functional safety standards, such as ISO 26262 for automotive or DO-254 for
avionics, but also for a range of other applications, including storage and data centers, that
require a controlled and predictable response to failures.
Safety mechanisms prevent or control the occurrence of dangerous failures, but they also
make hardware development harder and more expensive. For functional verification
specifically, engineers need to consider that:

• The proportion of faults detected by the safety mechanisms could be insufficient to


achieve the required safety integrity level (SIL).
• Bugs in the safety mechanisms could lead to undetected faults or even to interference
with the normal hardware functions, potentially causing dangerous failures, rather than
preventing them.

To demonstrate the achievement of the required level of fault tolerance, a quantitative analysis
of the effectiveness of the safety mechanisms is needed. OneSpin’s portfolio of safety-critical
solutions includes formal apps that address this task, namely, the Fault Propagation App
(FPA™) and the Fault Detection App (FDA™). These apps complement fault simulation to
speed up the verification and analysis of diagnostic coverage estimates, while helping to
achieve the best possible results. For more information, please refer to the OneSpin website
(www.onespin.com) and to the white paper, “Using Formal to Verify Safety-Critical Hardware
for ISO 26262”.

To avoid bugs in the specification or implementation of safety mechanisms, a rigorous and


efficient functional verification process is essential. This is the focus of this paper. It is also
worth noting that functional verification and diagnostic coverage analysis are closely related
activities, but it is common for companies to have separate teams addressing the two tasks.
In fact, diagnostic coverage estimates might have to be verified on a gate-level model of the
chip, while functional verification is often carried out at module level and mostly on RTL
models.

The next section outlines a number of widely used safety mechanisms.

© 2020 OneSpin Solutions. | Page 3

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

SAFETY MECHANISMS
Modern SoCs may include a variety of software and hardware safety mechanisms, depending
on their target application and associated safety integrity level. Some well-known mechanisms
are shown below in Figure 1.

Figure 1 – Examples of safety mechanisms

• Software Self-Test. This technique consists of software routines that perform simple
operations and compare their results against expected values. Routines running
periodically during system operation must be small and fast, otherwise they could
reduce the availability of the system.
• Dual-Core Lockstep. CPU cores are duplicated, often with the exclusion of large
memories to save area and reduce cost, with one core being used as master, and the
other as checker. The two cores receive the same input stimuli and their outputs are
continuously compared. An alarm is raised if the outputs of the two cores differ.
• Hardware Replication. Besides processor cores, other critical modules (e.g., arbiters
and DMAs) can be replicated and their outputs continuously compared to detect faults.
• Watchdog. During normal operation, the system periodically resets the watchdog
timer. However, due to hardware faults or programming errors, the timer could elapse
and generate a timeout signal that triggers appropriate actions.
• Built-In-Self-Test (BIST). A typical application is memory BIST modules that write and
read memories with predefined patterns to detect faults, and potentially use spare logic
to repair them. BIST logic can be used at power-up, for example, whenever a car is
turned on, or even during system operation. In this case, though, tests might have to
be split into small chunks and wrapped around save/restore functions to limit their
impact on system performances and avoid memory corruption.

© 2020 OneSpin Solutions. | Page 4

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

• Error Detecting and Correcting Codes (EDC/ECC). Address and data information
traveling on buses or saved into memories is encoded (and later decoded) using
specialized algorithms. Extra logic and storage is the price to pay to gain confidence
that information has not been corrupted by hardware faults. SRAM memories are often
protected with logic providing correction of single bit errors, and detection of double bit
errors (SECDED). Occasionally, double error correction and triple error detection
(DECTED) strategies are also used.
• Triple Modular Redundancy (TMR). Critical logic or configuration registers are
triplicated and majority-voting logic added to decide which of the three copies presents
the correct output.
• Safe FSM. If a glitch or other issue causes the FSM to make an illegal state transition,
this is detected and the FSM transitions to a predefined recovery state.

Besides providing the required level of protection against random hardware faults, safety
mechanisms must also be quick in detecting and reporting faults. The fault tolerant time
interval (FTTI) is defined as the time a fault can be present in a system before a hazardous
event could occur. The time that a system needs to detect and react to a fault must be shorter
than its FTTI.

Finally, it is worth noting that the logic implementing hardware safety mechanisms can itself
be affected by faults. In some cases, despite compromising functionality, these faults could
remain undetected and be latent, as safety functions might be inactive when no faults are
present in the logic implementing the normal hardware functions. Latent faults are a threat to
safety and must be taken into account when designing and implementing safety mechanisms.

The next section uses three examples to provide more information on the formal verification
of important safety mechanisms such as lockstep, ECC and hardware duplication.

FORMAL VERIFICATION OF SAFETY MECHANISMS


Assertions are widely used in RTL development to capture functional expectations, from low-
level intent to high-level requirements. Nowadays, engineers are well aware that formal
verification of assertions is a powerful, efficient technique to uncover corner case bugs that
could be missed even by sophisticated simulation and emulation environments. Nonetheless,
in some projects, formal is still considered little more than a nice-to-have element of the flow.

© 2020 OneSpin Solutions. | Page 5

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

Fault-tolerant and safety-critical hardware, however, present unique challenges that position
formal as a clear must-have:

• Fault Injection. Engineers must set up a consistent and reusable flow capable of
automated and controlled fault injection. The injection of fault scenarios is necessary
to activate safety functions that would otherwise be dormant. Formal tools provide a
flexible, powerful platform for controlled fault injection.
• Exhaustive Fault Verification. Engineers must analyze and verify the effects of
different fault types occurring on huge numbers of bit-level locations (wires, registers,
memories) and at different times during system operation. It is not feasible to foresee
all the corner case functional scenarios to which a design could be driven by the
sudden appearance of faults. Exhaustive verification of all relevant fault scenarios is
therefore necessary to achieve the high level of confidence required by critical
applications. Only formal tools have the potential to deliver exhaustive analysis.
• Rigorous Requirements Analysis. Engineers must identify errors and gaps in the
requirements specification and verification plan of safety functions. Unspecified or
unverified functions are not acceptable in safety-critical systems, even when these
functions are not used in the target application. Formal verification, paired with an
appropriate assertion development methodology, can provide a rigorous, automated
solution to this task.

The OneSpin Fault Injection App (FIA) provides an intuitive and


reusable formal environment to define and inject fault scenarios
and associate them to assertions. A fault scenario can be seen as
a set of faulty variants of the design under test (DUT) and is defined
by three elements (see Fig. 2): the set of bit-level locations that are
to be considered for fault injection; the timing of fault injection; and
the type of fault that is injected. In this context, the original design
corresponds to the particular fault scenario of no faults being
present.
Figure 2 – Fault scenario
Additionally, FIA has debug features that clearly identify the specific
fault scenario being debugged and the differences between the original and the faulty design
behavior. Moreover, under-the-hood proof strategies mitigate complexity issues that can arise
from the huge number of fault scenarios that need to be examined.

OneSpin’s GapFreeVerification™ methodology, supported by the OneSpin 360 DV-Certify™


product, on the other hand, provides a systematic process to capture requirements in end-to-
end Operational SVA and automatically identify gaps and errors in specifications and
verification plans.

The following examples demonstrate the benefits of both FIA and GapFreeVerification.

© 2020 OneSpin Solutions. | Page 6

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

ECC-PROTECTED FIFO
The FIFO shown in Fig.3 is implemented using a small memory and logic to keep track of the
read and write pointers. An ECC safety mechanism using a hamming encoder and decoder
protects the content of the FIFO against single event upset (SEU) faults caused by radiation
flipping the state of memory bits. The mechanism implements a SECDED algorithm. The
encoder generates extra bits that are stored in the memory along their associated data entry.
When the FIFO is popped, the decoder checks the data and ECC bits. If it detects a double
error, the alarm output is pulsed. In case of a single error, the corrected output is pulsed and
the corrected data is presented at the output.

Figure 3 – ECC FIFO

The verification of the normal FIFO behavior, carried out with no injected faults, shall also
prove that the alarm and corrected outputs are inactive. The verification of the safety
requirements, on the other hand, needs fault injection. Ideally, to make the verification process
robust and efficient, engineers should use a unified formal environment enabling them to
achieve exhaustive verification and functional sign-off for the whole FIFO module. FIA builds
fault injection handling capabilities on top of OneSpin’s advanced formal verification platform,
enabling engineers to use the same environment for the verification of both normal and safety
functions, with unified formal coverage metrics and verification plan (see Fig. 4).

Considering the requirement of detecting double bit errors, the first step for its verification is
to define the fault locations. Although the obvious choice could seem to be the memory, there
are other factors to consider. Double bit faults injected in an invalid FIFO slot would have no
effect. Moreover, to determine if the alarm output shall be raised after a pop operation, the
environment should keep track of the corrupted FIFO slots. To simplify the verification, it might
be beneficial to inject faults on the encoded read data signal. This signal goes out of the
memory read port to the decoder, which is responsible for extracting the original data.
Assuming that the FIFO works correctly without faults (this will have to be verified), this is a
valid choice, as the decoder has no knowledge of which FIFO slot the data is coming from.

© 2020 OneSpin Solutions. | Page 7

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

The fault scenario and associated assertion below are sufficient to capture one of the ECC
FIFO safety requirements:

Assertion: double_fault_alarm:

assert property (

!empty && rd_en |=> alarm);

Fault Location: any two bits of


rd_data_enc[WIDTH:0]

Fault Timing: inject always

Fault Type: bit-flip.


Figure 4 – Formal signoff flow with FIA

The fault scenario can be easily expressed using two FIA TCL commands as follow:

inject_faults –signals {rd_data_enc}

inject_faults –map { {double_fault_alarm flip_2} }

It is worth noting that, for 64-bit data, which requires 8 ECC bits for SECDED, there are 2^64
different data values to examine, and, considering that faults can also affect the ECC bits, for
each data value there are 2556 double-bit fault combinations (see Table I). While exhaustive
verification of all these combinations is by far not feasible with simulation, formal can achieve
it in a few seconds.

© 2020 OneSpin Solutions. | Page 8

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

Table I. Complexity and proof runtime for three FIFO configurations with increasing input data width

FIFO with Hamming encoder/decoder for SECDED protection

Number of Double bit flip Proof runtime Equivalent


Bit-width of Required ECC
different data fault scenarios (double_fault_a simulation
wr_data redundant bits
values per data value larm) cycles

4 4 16 28 0.25 sec 448

32 7 2^32 741 6.5 sec > 3 trillions

> 47 x 10^6
64 8 2^64 2556 57 sec
trillions

For many designs, simulation-based verification relies on constrained-random stimuli and


functional coverage to ensure, among other things, that foreseen corner cases have been
exercised. However, examining the RTL code for the 64-bit decoder (see Fig. 5), one can
imagine how easy it would be to make a mistake in one bit of the numerous case conditions
or xor operands. Trying to identify specific corner cases for data and fault scenario
combinations would be an effort-intensive and error-prone task. For this design, exhaustive
verification with formal technology and an environment that is independent from the actual
encoding and decoding algorithms is therefore not a nice-to-have, but rather a basic
verification aim.

Figure 5 – 64-bit decoder RTL

© 2020 OneSpin Solutions. | Page 9

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

DUPLICATED CONFIGURATION REGISTERS


In a complex design, hundreds of multibit registers for memory configuration, clock control,
security, etc., may require protection against a variety of faults, including stuck-at, SEU, and
bridging. Register duplication is a basic protection method. The outputs of the two registers
are continuously compared, and an alarm is raised if they differ. Multiple alarm signals may
be hierarchically combined before being routed to a safety control unit, as depicted in Figure
6, below.

Verification has to ensure that any single fault


occurring at any time on each individual
register bit will raise the combined alarm.
Moreover, one might also have to prove that
the alarm is raised within a maximum number
of cycles, regardless of the state of all other
registers or the rest of the design. It is not
possible to cover all faults and state
combinations with simulation-based testing. On
the other hand, assertions and formal fault
Figure 6 – Register duplication injection can deliver exhaustive verification
within minutes of runtime and minimal engineering effort. With OneSpin’s FIA, it easy to define
fault scenarios with any timing and force the corruption of at least a single register bit for at
least one cycle. A simple assertion can then capture the expectation that the alarm is raised
within a range of cycles.

Assertion: fault_raises_alarm: assert property (

|fault_en |-> eventually [1:2] alarm);

Fault Location: at least one bit of any protected register

Fault Timing: inject anytime

Fault Type: bit-flip.

The fault scenario can be easily expressed using two FIA TCL commands as follow:

inject_faults \

–signals { reg1 reg1d reg2 reg2d reg3 reg3d } \

–enable { fault_en }

inject_faults –map { {fault_raises_alarm flip_any} }

© 2020 OneSpin Solutions. | Page 10

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

Fault enable is a verification signal that controls the fault injection timing. It is left free to enable
any faults at any time. Formal technology exhaustively examines all possible fault enable
scenarios, uncovering any subtle bug. Moreover, the entire flow can be automated using the
tool’s TCL interface and leveraging machine-readable specifications of the registers.

It is also worth noting that debugging failing assertions on a design that is purposely faulty
may get quite confusing. Dedicated debug features are necessary to help distinguish the
effects of RTL or assertions bugs from those of injected faults. OneSpin’s debug waveform
viewer allows for a clear view of the fault injection timing and of the original (design without
faults) and faulty values of the corrupted signals (see Fig. 7).

Figure 7 – Debug – alarm not raised on time

LOCKSTEP COMPARATOR AND RECOVERY MODULE


Safety-critical automotive SoCs often feature dual-core lockstep architectures. This is a
powerful technique to prevent uncontrolled failures, but it is expensive in terms of area and
power consumption.

The two cores operate on the same input stimuli, though there might be a delay of a few clock
cycles to achieve temporal diversity and reduce the risk of faults causing the same effects on
both cores. A comparator realigns the outputs of the two cores and raises an alarm should
they be different.

© 2020 OneSpin Solutions. | Page 11

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

The comparator logic can be exhaustively verified using formal verification. If the module is
treated in isolation, fault injection is not required. If the comparator is verified as part of the
SoC, FIA can be applied to inject faults. At SoC level, one could use formal connectivity
checking to verify that the appropriate outputs of the master and checker core, along with
reset, clock, and other system-level control signals, are correctly wired to the comparator
inputs, while the comparator alarm output is correctly wired to the safety control unit and fault
logging registers.

Figure 8 – Dual-core lockstep comparator and recovery

One of the shortcomings of this mechanism is that, in case of an alarm, it is not possible to
know if one of the cores could still operate correctly. A significantly more complex module that
can log error information, test the two cores, and multiplex their outputs, can improve system
availability (see Fig. 8). On an alarm, the recovery module can select and run appropriate
tests. If the results identify a core as free from faults, the recovery module can select its outputs
in the multiplexer and allow the system to continue operation, at least to complete the
execution of an ongoing critical program.

The verification of the control-dominated recovery mechanism is a critical task. All possible
sequences of recovery actions must be fully specified and verified. Operational SVA
developed according to the GapFreeVerification methodology is ideal to capture the
operations of the recovery module in a concise, elegant way. Operational SVA leverages a
library of standard SystemVerilog properties and sequences called TiDAL that, among other
things, enable the use of high-level timing conditions. In the example below, one can see the
general structure of an Operational SVA, with clear separation of causes (light blue boxes)
referring to inputs and current values of architectural states, and effects (light green boxes)
referring to outputs and next values architectural states.

© 2020 OneSpin Solutions. | Page 12

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

Operational SVA: exec_test_1: assert property (

t_start ##0 rec_state == idle and

t_start ##0 alarm_detected and

t_start ##0 start_test_1

implies

during(nxt(t_start,1), t_end_test, <test_proc_on>) and

during(nxt(t_start,1), t_end_test, <test_res_off>) and

during(nxt(t_end_test,1), t_res, <test_proc_off>) and

during(nxt(t_end_test,1), t_res, <test_res_on>) and

t_res ##0 rec_state == idle);

Time point:

sequence t_res; await(t_end_test, res_req && res_ack);

endsequence

High-level time points enable capturing of requirements in a similar way to the timing diagrams
often shown in specification documents, and including in the case of overlapping, pipelined
cause-and-effect sequences. The time point in the example above marks the completion of
the result interface handshake happening after test completion.

Another crucial advantage of using Operation SVA is that OneSpin’s DV-Certify performs an
automated design-independent formal analysis of the set of assertions and spots gaps and
errors in the specifications or the verification plan. For more information on OneSpin
Operational SVA, the GapFreeVerification methodology and DV-Certify please refer to the
following two white papers: “Achieving 100% Functional Coverage by Operational Assertion-
based Verification” and “Capturing Timing Diagrams in Operational SVA”.

© 2020 OneSpin Solutions. | Page 13

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

SUMMARY
Fault-tolerant hardware features a variety of safety mechanisms to prevent or control failures
during operation. Its development requires a particularly rigorous functional verification flow
and the examination of a vast amount of faulty design variants. Moreover, it must often satisfy
tight budget and time-to-market constraints that are becoming similar to those of consumer
applications.

The exhaustive analysis performed by formal verification tools, when augmented with
appropriate proof strategies, user interfaces, and methodology, is a powerful way to address
these challenges and achieve rigorous functional sign-off for a range of hardware safety
mechanisms.

In this paper, we have taken a closer look at three widely used mechanisms, namely: ECC
codes for memory protection, lockstep comparators and recovery modules, and duplication of
critical control registers. With practical examples, we have shown how fault injection and
formal ABV can deliver exhaustive verification of all relevant fault scenarios.

OneSpin FIA provides an automated and project-independent solution to intuitively define and
manage fault injection scenarios, bringing the formal verification of normal and safety functions
under a unified common environment. Through FIA, engineers can immediately leverage
OneSpin’s other products and apps, including 360 DV-Verify™ and Quantify™, to develop
fault-aware assertions, track overall progress with intuitive coverage metrics, and achieve
functional sign-off. Furthermore, OneSpin’s Operational SVA, coupled with the
GapFreeVerification methodology and 360 DV-Certify, provides a particularly rigorous and
systematic verification solution that can automatically detect gaps and errors in requirements
specifications and verification plans.

OneSpin’s portfolio of verification solution includes apps that automate recurring problems,
like SoC connectivity checking, registers checking, design inspection, and more. Moreover,
OneSpin’s Quantify metric-driven verification solution enables engineers to follow a systematic
and predictable path to functional sign-off of RTL designs. To learn more about OneSpin and
its products, please visit www.onespin.com. If you wish to have hands-on guidance on how
to integrate formal verification into your fault-tolerant hardware development flow, contact
support@onespin.com. OneSpin’s expert team will be delighted to help.

AUTHOR
Sergio Marchese is the Technical Marketing Manager at OneSpin Solutions. He started his
career at Infineon Technologies, applying coverage-driven constrained-random simulation
and formal methods to verify the TriCore CPU, an architecture widely used in today's
automotive SoCs.

© 2020 OneSpin Solutions. | Page 14

2020-05 V. 14
When Correct Is Not Enough: Formal Verification of Fault-Tolerant Hardware

Over the past 16 years, he has worked on solutions in many domains, including
communications, consumer, industrial and aerospace, in an effort to leverage the most
advanced formal tools and methodologies to implement rigorous and efficient hardware
development flows.

Marchese has also built and managed state-of-the-art teams, successfully signing off complex
hardware designs solely using formal verification.

REFERENCES AND FURTHER READING


1. ISO 26262 Standard - Road Vehicles – Functional Safety - Parts 1-10. 15 Nov. 2011.

2. Sergio Marchese. Using Formal to Verify Safety Critical Hardware for ISO 26262, Apr
2017, OneSpin White Paper.

3. M. Siegel and S. Beyer. Achieving 100% Functional Coverage by Operational


Assertion-based Verification, OneSpin White Paper.

4. Klaus Winkelmann. Capturing Timing Diagrams in Operational SVA, OneSpin White


Paper.

5. Haissam Ziade, Rafic Ayoubi, and Raoul Velazco. A Survey on Fault Injection
Techniques. The International Arab Journal of Information Technology, Vol. 1, No. 2,
July 2004.

6. T. Blackmore, S. Marchese, F. Bruno. Formal Verification of a Key Block of the


TriCore2 Microprocessor. Euro DesignCon, 2004.

7. T. Blackmore, F. Bruno, J. Bormann, S. Beyer, A. Maggiore, M.Siegel, S. Skalberg.


Complete Formal Verification of TriCore2 and Other Processors. DVCon, 2007.

8. H. Busch. Quantification of Formal Properties for Productive Automotive


Microcontrollers. DVCon USA, 2013.

9. H. Busch. An automated Formal Verification Flow for Safety Registers. DVCon Europe,
2015.

10. Traskov et al. Redundant Two-Processor Controller and Control Method. U.S. Patent
US 8,959,392 B2, issued February 17, 2015.

© 2020 OneSpin Solutions. | Page 15

2020-05 V. 14

You might also like