You are on page 1of 29

HARDWARE &

RELIABILITY
A quick glimpse
Reliability of electronic hardware
• (Traditionally) Assured through
• Design
• (Hopefully good) model of the product
• Production
• Carried out according to experience based standards

• experience from "yesterday's techniques“?

• Reliability testing is often not carried out, and if it is,


reliability is verified once the design is complete.
• faster implementation of new material and packaging means
changes methodology?
Failure rate (recap)
• Component vs system
• Any component fails, whole system fails?
• Exponential Failure Law
• Failure rate assumed constant
• Measured in FITS (failures per 109 hours)
• Good approximation if past infant mortality period
Combining hardware modules
• Series

• All components need to work for system to work


• Rsys=RARBRC
• Paralel
• One component needs to work
• Rsys=RARBRC+(1-RA)RBRC+(1-RB)RARC+(1-RC)RARB+
+(1-RA)(1-RB)RC+(1-RA)(1-RC)RB+(1-RB)(1-RC)RA=
=1-(1-RA)(1-RB)(1-RC)
• Mix
• Can tolerate one component B failing
• Rsys=RA(1-(1RB)2)RC
Embedded reliability
• reliability of an embedded application
• designed to run continuously for years without errors, often in hostile
environments that no desktop PC could endure
• sophisticated testing processes and performance evaluation techniques which
provide a company the capability to figure how their design would hold up over
the years, without having to wait that long

• HALT- Highly Accelerated Life Testing


• time compression testing protocol that utilizes a step stress approach in
subjecting products to varied thermal and vibration stresses
• used to uncover design limitations and weaknesses in the preliminary design phase,
helping designers of embedded systems to find and fix errors before they occur in the
field
• boards are subjected to stress conditions using repeatable testing techniques
• thermal step stress
• rapid thermal transitions
• vibration step stress
• combined temperature and vibration environments
4 TS-7250 RevA units submitted to the
HALT process
Placement of measurement equipment:
thermocouples, a spectrum analyzer, accelerometers, a data acquisition
interface and a signal conditioner
Results during
the combined
temperature and
vibration cycle
the testing unit was
exposed to 10 ½ rapid
temperature cycles
starting at -50°C to
+110°C and ending at -
85°C and +120°C. The
vibration level is set to 5
Grms for the first
temperature cycle and
then increased in 5 Grms
increments before each
additional cycle through
the cycle 5. For cycles 6
through 11, vibration
levels started at 25 Grms
and ended at 65 Grms.
The dwell time at each
temperature extreme was
10 minutes and the
thermal transition rate was
set to the chamber
maximum (> 60°C/min.
empty table)
Failure analysis
• Microscopes • Device modification
• Optical microscope • Focused ion beam etching (FIB)
• Liquid crystal • Surface analysis
• Scanning acoustic microscope (SAM) • Dye penetrant inspection
• Scanning acoustic tomography (SCAT) • Other Surface analysis tools
• Atomic force microscope (AFM) • Laser signal injection microscopy (LSIM)
• Stereomicroscope • Photo carrier stimulation
• Static
• Photoemission electron microscope (PEM) • Optical beam induced current (OBIC)
• Light-induced voltage alteration (LIVA)
• X-ray microscope • Dynamic

• Infra-red microscope Laser-assisted device alteration (LADA)

• Thermal laser stimulation (TLS)


• Scanning SQUID microscope • Static
• Sample preparation •

Optical-beam-induced resistance change (OBIRCH)
Thermally induced voltage alteration (TIVA)
• External induced voltage alteration (XIVA)
• Jet-etcher • Seebeck effect imaging (SEI)
• Dynamic
• Plasma etcher • Soft defect localization (SDL)

• Back side thinning tools • Semiconductor probing


• Mechanical back-side thinning • Mechanical probe station
• Laser chemical back-side etching
• Electron beam prober
• Scanning electron microscopy
• Laser voltage prober
• Scanning electron microscope (SEM) • Time-resolved photon emission prober (TRPE)
• Electron beam induced current (EBIC) in SEM
• Charge-induced voltage alteration (CIVA) in SEM • Software-based fault location techniques
• Voltage contrast in SEM • CAD Navigation
• Electron backscatter diffraction (EBSD) in SEM
• Automatic test pattern generation (ATPG)
• Energy-dispersive X-ray spectroscopy (EDS) in SEM
• Spectroscopic analysis
• Transmission electron microscope (TEM)
• Transmission line pulse spectroscopy (TLPS)
• Computer-controlled scanning electron microscope
• Auger electron spectroscopy
(CCSEM)
• Deep-level transient spectroscopy (DLTS)
Problems
Fault Tolerance
• Ability of system to continue error-free operation in
presence of unexpected fault

• Fault Tolerance requires some form of redundancy


• Time Redundancy
• Hardware Redundancy
• Information Redundancy
Redundancy comparison
Time Hardware Information
How Perform same operation Replicate hardware and Encode outputs with
does twice compare outputs error detecting or
it See if get same result both from two or more correcting code
work times Code selected to minimize
modules
If not, then fault occurred redundancy for
class of faults
+  Little to no hardware  Little or no  Less hardware to
overhead performance impact generate redundant
 Can detect temporary  Detects both information than
faults permanent and replicating module
temporary faults
-  Impacts system or  Area and power for  Added complexity in
circuit performance redundant hardware design
 Cannot detect
permanent faults
Information redundancy
Error detecting codes
Rollback
Checkpoint
Coding information - multiplication
Coding information - division
LFSR
• With characteristic polynomial equal to g(x)
• Append n-k 0’s to end of message
• Ex:
• m(x)=x2+x+1 and g(x)=x3+x+1
Check result
• Shift codeword into LFSR
• with same characteristic polynomial as used to generate it
• If final state of LFSR non-zero, then error
Selecting Generator Polynomial
• If first and last bit of polynomial are 1
• Will detect burst errors of length n-k or less
• If generator polynomial is multiple of (x+1)
• Will detect any odd number of errors
• If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and
n < 2n-k-1
• Will detect single, double, triple, and odd errors
Memory ECC Architecture
Hamming Code fro ECC RAM
Hardware redundancy
gate-level
module-level
chip-level
board-level
Hardware redundancy forms
Static / Passive Dynamic / Active Hybrid
• Masks faults rather • Detects faults and • Combines active and
than detects them reconfigures to spare passive approaches
hardware
• Provides
uninterrupted • Involves Masks faults like static
operation • Detecting fault Detects and reconfigures
• Important for real-time • Locating faulty like dynamic
systems hardware unit
• No time to • Reconfiguring
reconfigure or system to use
retry operation spare fault-free
• Simple self-contained hardware unit
• No need to
update or rollback
system state
Simple examples – static & dynamic
• R_SS=R+(1-R)*R=2R-R^2
• R_TMR=R*R*R+(1-R)*R*R+R*(1-R)*R +R*R*(1-
R)=3*R^2-2*R^3
Static redundancy – interwoven logic
• Replace each gate with 4 gates using inconnection pattern
that automatically corrects errors
• Traditionally not as attractive as TMR
• Requires lots of area overhead
• Renewed interest by researchers investigating emerging
nanoelectronic technologies
Hybrid redundancy - Self-Purging
Redundancy
• Uses threshold voter instead of majority voter
• Threshold voter outputs 1 if number of input that are 1 greater than
threshold
• Otherwise outputs 0
• Requires hot spares
Time redundancy
Repeated execution
Multi-threaded redundant execution
Multiple sampling of outputs
Multiple Sampling of Outputs
• Done at circuit-level
• Sample once at end of normal clock cycle
• Same again after delay
• Two samples compared to detect mismatch indicates error
• Detect fault whose duration is less than delay
• Performance overhead depends on size of delay relative
to normal clock period
Solutions
Summary