You are on page 1of 5

Watchdog Timer for Robust Embedded Systems

Pratyush Gehlot is an embedded engineer, working as firmware developer at


General Industrial Controls Pvt Ltd, Pune
In a complex embedded system, a small bug may crash the whole system, or worse, put it into a dangerous
operating mode. Bugs are not the only problem. A perfectly-designed-and-tested device on which a perfect
code executes can still fail. A watchdog timer (WDT) is a safety mechanism that brings the system back to
life when it crashes. For this reason, it must be well-designed and implemented for robust embedded system
development.
A WDT is a hardware that contains a timing device and clock source. A timing device is a free-running timer,
which is set to a certain value that gets decremented continuously. When the value reaches zero, a short
pulse is generated by WDT circuitry that resets and restarts the system.

Fig. 1: External watchdog timer


Fig. 2: Internal watchdog timer
It is the applications responsibility to reload WDT value each time before it reaches zero, else WDT circuitry
will reset the system. Once reloaded, it will again start decrementing. In short, WDT constantly watches the
execution of the code and resets the system if software is hung or no longer executing the correct sequence
of the code. Reloading of WDT value by the software is called kicking the watchdog.
Watchdog based design considerations
1. The clock source for WDT must be separate, which means that it should not share the system clock. If the
crystal stops under normal operation, say, in sleep mode, the watchdog will not work.
2. Once WDT initialisation is complete and WDT starts, the software should not be able to disable the
watchdog or modify its control registers to stop a buggy code from accidentally disabling it. Some processors
do have this locking feature.
3. After the watchdog resets, the system must come back to a known state under any condition.
4. The watchdog reset sequence must ensure that all connected peripherals are also brought back to a known
state.
Types of watchdog timers
WDTs can be divided into two general categories: external WDT and internal WDT. Most microcontrollers have
an internal WDT. Various chip vendors also provide external WDT chips.
An external WDT has a physical reset pin for the processor. An I/O pin of the processor is used to kick the
watchdog.
Non-watchdog based design problems
In 1994, a deep-space probe, the Clementine, was launched to make observations of the Moon and a large
asteroid, 1620 Geo graphos. After months of operation, a software exception caused a control thruster to fire

for 11 minutes, which depleted most of the remaining fuel and caused the probe to rotate at 80rpm. Control
was eventually regained, but it was too late to successfully complete the mission.
There can always be a bug present in the embedded system design, even if the code is designed very
carefully. If we test our device in a heavy-electrical, noisy environment, a high-voltage spike may corrupt the
program counter or stack pointer. Cosmic rays are also evil for the digital system and can alter the
processors register bits.
Software can cause the system to hang indefinitely, in case of an infinite loop, buffer overflow or deadlocks.
In a small embedded device, it is easy to find the exact root cause of the bug, but not so in a complex
embedded system. However, by using a watchdog, we can ensure that the system will not hang indefinitely.
Hence, the system software in any situation should not hang infinitely. A general solution, in case it does
hang, is to reset the system, and this is where watchdogs in embedded systems come in handy.
Watchdog timer based system design
The software needs to kick the watchdog constantly. In some implementations, a sequence of bytes is needed
to be written in the watchdog register to kick the watchdog. This reduces the chance of an errant code that
might accidentally kick the watchdog.
After WDT overflows, it will assert the processor reset line. Some processors and controllers can generate an
interrupt before resetting the device, which is like an early warning for an upcoming watchdog reset. We can
save useful information like status register in a non-volatile memory by reading this information after
recovery. From reset logs, we can debug the root cause of the reset.
A watchdog can also be used to wake up the device from sleep or idle mode. In sleep mode, watchdog
timeout will not reset the system, but just cause it to wake up.
Simply enabling WDT and kicking it regularly is not enough to ensure system reliability. To get optimum
benefit, implementation of the watchdog is a must for robust design.
Watchdog time-out period
For selecting watchdog time-out period, we must have a proper understanding of the software loop latency.
An unusual number of interrupts may happen during a single scanning of a loop, and the extra time spent in
the interrupt service routine (ISR) will increase the main loop latency. A software delay routine will also
increase loop latency. The design with delays in various places in the code has control of the watchdog, which
can prove to be problematic.
For some time, critical application and recovery time from the watchdog reset is very important. In such a
system, time-out period needs to be very precise. After watchdog reset, the system must boot-up as fast as
possible. For example, in case of a pacemaker machine, the system must boot-up almost within a heartbeat.
The initialisation after a watchdog reset should be much shorter than power-on initialisation.
Very short time-out periods may lead to the system resetting unnecessarily. If the
system is not time-critical, it is better to choose time-out in seconds.
Implementation of watchdog timer for single-thread software design
The traditional approach for a single-thread design is to kick WDT at the end of the
main loop.
In a single-thread design, we can use state-machine-like architecture as shown in the
code snippet below. Increment the state variable value at three different sections of
the code, which will definitely iterate once in a one-loop scan. At the end of the main
loop, check the state value; if it is three, it means that the code execution is done in
proper sequence. Then, kick the watchdog and clear the state flag. If the state value
is not three, it means there is some fault in the execution of the code. In this case,
Fig. 3: Traditional
watchdog kicking
inside the main loop

do not kick the watchdog, else the system will reset after watchdog time-out.
----------------CODE-----------------main ()
{
for( ; ; )
{
if(State == 0) State = 0x01;
...
...
if(State == 1)State = 0x02;
...
...
if(State == 2)State = 0x03;
...
...
If (State == 0x03)
{
Kick the watchdog
State = 0;
}
}
}
----------------CODE-----------------On some microcontrollers, the built-in watchdog has a maximum time-out of the order of a few hundred
milliseconds. But, if the main loop scan time is higher than the maximum allowed watchdog time-out, we
need to multiply that in the software.
For example, main loop latency of 500ms and maximum allowed watchdog time-out period of 100ms (which
means that the watchdog must kick before 100ms) is not possible from the main loop. In this case, we can
configure the processors internal timer to 50ms free-running and define flag state at the end of the main
loop set and state it as Alive.
----------------CODE-----------------main ( )
{
for ( ; ; )
{
...
...
State = ALIVE;
}
}
----------------CODE-----------------In every 50ms ISR increment count, check state flag. Only kick the watchdog if state is not Unknown. When
the count reaches above ten (500ms time is elapsed), ISR again and check state flag. If state is Alive, it
means that the program is running correctly. Otherwise, set state as Unknown. This represents that there is
some problem in the execution of the code and so ISR will not kick the watchdog anymore and the system
will restart after watchdog time-out of 100ms.
----------------CODE-----------------ISR() //50ms free running
{
Count++;
If(Count > 10) //10x50ms
{
Count = 0;

If(State == ALIVE)
{
State = RESET;
}
else
{
State = UNKNOWN;
}

If (State != UNKNOWN)
{
Kick the watchdog
}
}
----------------CODE-----------------Never kick the watchdog in an ISR unconditionally or devote an RTOS task to this activity, because, if the
main code crashes, interrupts (and even the scheduler), it may continue to run so the watchdog never timesout. However, this approach is not recommended as we have no idea if the code is working, or not, except
the timer ISR.
Implementation of watchdog timer for RTOS based application
In a multitasking environment, there are a couple of independent loops running in parallel, known as tasks.
The scheduler schedules each task based on priority. To validate that each task is running properly, each task
must contribute in the decision of kicking the watchdog.
To implement the watchdog mechanism in an RTOS environment, we can design a separate task that will
monitor the status of all running taskswe can call this the watchdog task. Only this task gets the privilege
of kicking the watchdog.

Fig. 4: Watchdog design for RTOS, approach 1

Fig. 5: Watchdog design for RTOS, approach 2


Let us take an approach in which there is a status byte and each bit of this byte is associated with a task. For
example, our system has three tasks running and each task will set corresponding bits in the status flag at
the end of its body.
When the watchdog task wakes up, it will check whether all three bits are set (which means whether all tasks
are running properly). It will kick the watchdog and clear the status flag. In this case, the priority of the
watchdog task must be lower than other system tasks. Once the watchdog timer execution is completed, it
goes in sleep mode for less than the watchdog time-out period.
The approach for the watchdog design for an RTOS (Fig. 4) will work well if all tasks are executed once in less
time than the watchdog reset period, including watchdog task. But if any of the tasks go in sleep mode for a
couple of seconds, or have to wait for an event, the above approach will not work in this design.
We can implement it in a better way by using the message queue, where each task blocks at the message
queue. The watchdog task will post messages to all tasks and go in sleep mode for a specified time interval
(less than the watchdog time-out period).
After the arrival of the message in the message queue, the task will wake up one by one based on priority.
Each task reads the message and if the task has been woken up by the watchdog task, it will set the
corresponding bit in the status flag.
When the watchdog task wakes up, check the status flag. If it has all corresponding bits set, kick the
watchdog and clear the status flag. In this approach, the watchdog task must have higher priority than all
other system tasks.
The selection of priority of the watchdog task is very important as it depends on the design architecture of
the system.