You are on page 1of 116

Embedded Software Gems

Ahmed Tolba
xgameprogrammer@hotmail.com
Vienna, Austria.

1. Introduction to Data Structures and algorithms in embedded software


2. Introduction to Design patterns and Software Design in Embedded Software
3. Introduction to SIMD in ARM Cortex Architecture
4. Math Concepts for Embedded Software Engineers

This is a tutorial handbook; it means you need to do a lot of researching about the
topics that I have presented. I just gave you some information and application that
you might need to do a lot of work on it and googling such information. This is not
a textbook where it explains every detail about the subject. I have grouped many
articles from several books; they are mentioned in the reference section. I tried my
best to group every piece of information that I wanted to learn about.
DI Ahmed Tolba
Introduction to Software Architecture in
Embedded Software
Architectural Coupling
Coupling refers to how closely related different modules or classes are to each other
and the degree to which they are interdependent. The degree to which the
architecture is coupled determines how well a developer can achieve their
architectural goals. For example, if I want to develop a portable architecture, I need
to ensure that my architecture has low coupling.
There are several different types and causes for coupling to occur in a software
system. First, common coupling occurs when multiple modules have access to the
same global variable(s). In this instance, code can’t be easily ported to another
system without bringing the global variables along. In addition, the global variables
become dangerous because any module can access them in the system. Easy access
encourages “quick and dirty” access to the variables from other modules, which then
increases the coupling even further. The modules have a dependency on those
globally shared variables.
Another type of coupling that often occurs is content coupling. Content coupling
is when one module accesses another module’s functions and APIs. While at first
this seems reasonable because data might be encapsulated, developers have to be
careful how many function calls the module depends on. It’s possible to create not
just tightly coupled dependencies but also circular dependencies that can turn the
software architecture into a big ball of mud.
Coupling is most easily seen when you try to port a feature from one code base to
another. I think we’ve all gone through the process of grabbing a module, dropping
it into our new code base, compiling it, and then discovering a ton of compilation
errors. Upon closer examination, there is a module dependency that was overlooked.
So, we grab that dependency, put it in the code, and recompile. More compilation
errors! Adding the new module quadrupled the number of errors! It made things
worse, not better. Weeks later, we finally decided it’s faster to just start from scratch.
Software architects must carefully manage their coupling to ensure they can
successfully meet their architecture goals. Highly coupled code is always a
nightmare to maintain and scale. I would not want to attempt to port highly coupled
code, either. Porting tightly coupled code is time-consuming, stressful, and not fun!

Architectural Cohesion
The coupling is only the first part of the story. Low module coupling doesn’t
guarantee that the architecture will exhibit good characteristics and meet our goals.
Architects ultimately want to have low coupling and high cohesion. Cohesion
refers to the degree to which the module or class elements belong together.
In a microcontroller environment, a low cohesion example would be lumping
every microcontroller peripheral function into a single module or class. The
module would be significant and unwieldy. Instead, a base class could be created
that defines the common interface for interacting with peripherals. Each peripheral
could then inherit from that interface and implement the peripheral-specific
functionality. The result is a highly cohesive architecture, low coupling, and other
desirable characteristics like reusable, portable, scalable, and so forth. Cohesion is
really all about putting “things” together that belong together. Code that is highly
cohesive is easy to follow because everything needed is in one place. Developers
do not have to search and hunt through the code base to find related code. For
example, I often see developers using an RTOS (real-time operating system)
spread their task creation code throughout the application. The result is low
cohesion. Instead, I pull all my task creation code into a single module so that I
only have one place to go. The task creation code is highly cohesive, and it’s easily
ported and configured as well.
Now that we have a fundamental understanding of the characteristics we are
interested in as embedded architects and developers, let’s examine architectural
design patterns that are common in the industry.
The Unstructured Monolithic Architecture
The great sage Wikipedia describes a monolithic application as “a single-tiered
software application in which the user interface and data access code are combined
into a single program from a single platform. An unstructured monolithic
architecture might look something like.
The architecture is tightly coupled which makes it extremely difficult to reuse,
port, and maintain. Individual modules or classes might have high cohesion, but
the coupling is out of control

An unstructured monolith was one of the most common architectures in embedded


systems.
Layered Monolithic Architectures
I would like to argue that layered monolithic architectures are the most common
architecture used in embedded applications today. The layered architecture allows
the architect to separate the application into various independent layers and only
interact through a well-defined abstraction layer. You may have seen layered
monolithic architectures
A layered monolithic application attempts to improve the high coupling of an
unstructured monolithic architecture by breaking the application into independent
layers. Each layer should only be allowed to communicate with the layer directly
above or below it through an abstraction layer. The layers help to break the
coupling, which allows layers to be swapped in and out as needed.
Using different microcontrollers as an example is easier to do, The driver layer
were placed behind a standard hardware abstraction layer HAL that made the
application dependent on the HAL not the underlying hardware.
Leveraging APIs and abstraction layers is one of the most significant benefits of
layered monolithic architecture. It breaks the coupling between layers, allows
easier portability and reuse, and can improve maintainability
As layers are added to an application, the performance can take a hit since the
application can no longer just access hardware directly. Furthermore, clock cycles
need to be consumed to circumnavigate the layered architecture. In addition to the
performance hit, it also requires more time up front to design the architecture
properly. Finally, the code bases do tend to be a little bit bigger. Despite some
disadvantages, modern microcontrollers have more than enough processing power
in most instances to overcome them.
An example modern layered architecture diagram can be seen in

First, note how we can exchange the driver layer for working with nearly any
hardware by using a hardware abstraction layer. For example, a common HAL
today can be found in Arm’s CMSIS.
The HAL again decouples the hardware drivers from the above code, breaking the
dependencies.
Next, notice how we don’t even allow the application code to depend on an RTOS
or OS. Instead, we use an operating system abstraction layer (OSAL). If the team
needs to change RTOSes, which does happen, they can just integrate the new
RTOS without having to change a bunch of application code. I’ve encountered
many teams that directly make calls to their RTOS APIs, only later to decide they
need to change RTOSes. An example of OSAL can be found in CMSIS-RTOS2.
Next, the board support package exists outside the driver layer! At first, this may
seem counterintuitive. Shouldn’t hardware like sensors, displays, and so forth be in
the driver layer? I view the hardware and driver layer as dedicated to only the
microcontroller. Any sensors and so on that are connected to the microcontroller
should be communicated through the HAL. For example, a sensor might be on the
I2C bus. The sensor would depend on the I2C HAL, not the low-level hardware.
The abstraction dependency makes it easier for the BSP to be ported to other
applications.
Finally, we can see that even the middleware should be wrapped in an abstraction
layer. If someone is using a TLS library or an SD card library, you don’t want your
application to be dependent on these. Again, I look at this as a way to make code
more portable, but it also isolates the application so that it can be simulated and
tested off target.

Event-Driven Architectures
Event-driven architectures make a lot of sense for real-time embedded applications
and applications concerned with energy consumption. In an event-driven
architecture, the system is generally in an idle state or low-power state unless an
event triggers an action to be performed. For example, a widget may be in a low-
power idle state until a button is clicked. Clicking the button triggers an event that
sends a message to a message processor, which then wakes up the system

Event-driven architectures typically utilize interrupts to respond to the event


immediately. However, processing the event is usually offloaded to a central
message processor or a task that handles the event. Therefore, event-driven
architectures often use message queues, semaphores, and event flags to signal that
an event has occurred in the system.
The event-driven architecture has many benefits. First, it is relatively scalable. For
example, if the widget in the figure needed to filter sensor data when a new sample
became available, the architecture could be modified to something like the next
figure . New events can easily be added to the software by adding an event handler,
an event message, and the function that handles the event.

Another benefit to the event-driven architecture is that software modules


generally have high cohesion. Each event can be separated and focused on just a
single purpose.
One last benefit to consider is that the architecture has low coupling. Each event
minimizes dependencies. The event occurs and needs access only to the message
queue that is input to the central message processor. Message queue access can be
passed in during initialization, decoupling the module from the messaging system.
Even the message processor can have low coupling. The message processor needs
access to the list of messages it accepts and either the function to execute or a new
message to send. Depending on the implementation, the coupling can seem higher.
However, the message processor can take during its initialization configuration
tables to minimize the coupling.
The disadvantage to using an event-driven architecture with a central message
processor is that there is additional overhead and complexity whenever anything
needs to be done. For example, instead of a button press just waking the system up,
it needs to send a message that then needs to be processed that triggers the event.
The result is extra latency, a larger code base, and complexity. However, a trade-
off is made to create a more scalable and reusable architecture.
If performance is of concern, architects can use an event-driven architecture that
doesn’t use or limits the use of the central message processor. For example, the
button press could directly wake the system, while other events like a completed
sensor sample could be routed through the message processor. Such an
architectural example can be seen in the following Figure

It’s not uncommon for an architectural solution to be tiered. A tiered architecture


provides multiple solutions depending on the real-time performance constraints
and requirements placed on the system. A tiered architecture helps to balance the
need for performance and architectural elegance with reuse and scalability. The
tiered architecture also helps to provide several different solutions for different
problem domains that exist within a single application. Unfortunately, I often see
teams get stuck trying to give a single elegant solution, only to talk themselves in
circles. Sometimes, the simplest solution is to tier the architecture into multiple
solutions and use the solution that fits the problem for that event.

Principles of RTOS Software Design

Embedded software has steadily become more and more complex. As businesses
focus on joining the IoT, the need for an operating system to manage low-level
hardware, memory, and time has steadily increased. Embedded systems implement
a real-time operating system in approximately 65% of systems. The remaining
systems are simple enough for bare-metal scheduling techniques to achieve the
systems requirements.
Real-time systems require the correctness of the computations, the
computation’s logical correctness, and timely responses.There are many
scheduling algorithms that developers can use to get real-time responses, such as
1. Run to completion schedulers
2. Round-robin schedulers
3. Time slicing
4. Priority-based scheduling

RTOSes are much more compact than general-purpose operating systems like
Android or Windows, which can require gigabytes of storage space to hold the
operating system. A good RTOS typically requires a few kilobytes of storage
space, depending on the specific application needs. (Many RTOSes are
configurable, and the exact settings determine how large the build gets.)
An RTOS provides developers with several key capabilities that can be time-
consuming and costly to develop and test from scratch. For example, an RTOS will
provide
A multithreading environment
At least one scheduling algorithm
Mutexes, semaphores, queues, and event flags
Middleware components (generally optional)

While an RTOS can provide developers with a great starting point and several
tools to jump-start development, designing an RTOS-based application can be
challenging the first few times they use an RTOS. There are common questions
that developers encounter, such as
How do I figure out how many tasks to have in my application?
How much should a single task do?
Can I have too many tasks?
How do I set my task priorities?
Tasks, Threads, and Processes
An RTOS application is typically broken up into tasks, threads, and processes.
These are the primary building blocks available to developers; therefore, we must
understand their differences.
A task has several definitions that are worth discussing. First, a task is a
concurrent and independent program that competes for execution time on a
CPU.This definition tells us that tasks are isolated applications without interactions
with other tasks in the system but may compete with them for CPU time. They also
need to appear like they are the only program running on the processor. This
definition is helpful, but it doesn’t represent what a task is on an embedded system.
The second definition, I think, is a bit more accurate. A task is a semi-
independent portion of the application that carries out a specific duty. This
definition of a task fits well. From it, we can gather that there are several
characteristics we can expect from a task:
It is a separate “program.”
It may interact with other tasks (programs) running on the system.
It has a dedicated function or purpose.

This definition fits well with what we expect a task to be in a microcontroller-


based embedded system. Surveying several different RTOSes available in the wild,
you’ll find that there are several that provide task APIs, such as FreeRTOS and uC
OS II/III.
On the other hand, a thread is a semi-independent program segment that
executes within a process. From it, we can gather that there are several
characteristics we can expect from a thread:
First, it is a separate “program.”
It may interact with other tasks (programs) running on the system.
It has a dedicated function or purpose.

For most developers working with an RTOS, a thread and a task are synonyms!
Surveying several different RTOSes available in the wild, you’ll find that there are
several that provide thread APIs, such as Azure RTOS, Keil RTX, and Zephyr.
These operating systems provide similar capabilities that compete with RTOSes
that use task terminology.
A process is a collection of tasks or threads and associated memory that runs in
an independent memory location. A process will often leverage a memory
protection unit (MPU) to collect the various elements part of the process. These
elements can consist of
Flash memory locations that contain executable instructions or data
RAM locations that include executable instructions or data
Peripheral memory locations
Shared RAM, where data is stored for interprocess communication

A process groups resources in a system that work together to achieve the


application’s goal. Processes have the added benefits of improving application
robustness and security because it limits what each process can access. A typical
multithreaded application has all the application tasks, input/output, interrupts, and
other RTOS objects in a single address space.

Task Decomposition Techniques


The question I’m asked the most by developers attending my real-time operating
systems courses is, “How do I break my application up into tasks?”. At first glance,
one might think breaking up an application into semi-independent programs would
be straightforward. The problem is that there are nearly infinite ways that a
program can be broken up, but not all of them will be efficient or result in good
software architecture. We will talk about two primary task decomposition
techniques: feature-based and the outside-in approach.
Feature-Based Decomposition
Feature-based decomposition is the process of breaking an application into tasks
based on the application features. A feature is a unique property of the system or
application.For example, the display or the touch screen would be a feature of an
IoT thermostat. However, these could very quickly be their own task within the
application.
Decomposing an application based on features is a straightforward process. A
developer can start by simply listing the features in their application. For an IoT
thermostat, I might make a list something like the following:

Display
Touch screen
LED backlight
Cloud connectivity
Temperature measurement
Humidity measurement
HVAC controller
Most teams will create a list of features the system must support when they
develop their stakeholder diagrams and identify the system requirements. This
effort can also be used to determine the tasks that make up the application
software.
Feature-based task decomposition can be very useful, but sometimes it can
result in an overly complex system. For example, if we create tasks based on all the
system features, it would not be uncommon to identify upward of a hundred tasks
in the system quickly! This isn’t necessarily wrong, but it could result in an overly
complex system with more memory and RAM than required.
When using the feature-based approach, it’s critical that developers also go
through an optimization phase to see where identified tasks can be combined based
on common functionality. For example, tasks may be specified for measuring
temperature, pressure, humidity, etc. However, having a task for each individual
task will overcomplicate the design. Instead, these measurements could all be
combined into a sensor task.
An example is feature-based task decomposition for an IoT thermostat that shows
all the tasks in the software

Using features is not the only way to decompose tasks. One of my favorite
methods to use is the outside-in approach
Example:

Step #1 – Identify the Major Components

Humidity/temperature sensor
Gesture sensor
Touch screen
Analog sensors
Connectivity devices (Wi-Fi/Bluetooth)
LCD/display
Fan/motor control
Backlight
Etc.

Step #2 – Draw a High-Level Block Diagram


Step #3 – Label the Inputs

Step #4 – Label the Outputs


Step #5 – Identify First-Tier Tasks

Step #6 – Determine Concurrencies, Dependencies, and Data Flow

Step #7 – Identify Second-Tier Tasks


Setting Task Priorities
Designers often set task priorities based on experience and intuition. There are
several problems with using experience and intuition to set task priorities. First, if
you don’t have experience, you’ll have no clue how to set the preferences! Next,
even if you have experience, it doesn’t mean you have the experience for the
designed application. Finally, if you rely on experience and intuition, the chances
are high that the system will not be optimized or may not work at all! The
implementers may need to constantly fiddle and play with the priorities to get the
software stable.
The trick to setting task priorities relies not on experience or intuition but on
engineering principles! When setting task priorities, designers want to examine
task scheduling algorithms that dictate how task priorities should be set.
There are typically three different algorithms that designers can use to set task
priorities:

Shortest job first (SJF)


Shortest response time (SRT)
Periodic execution time (RMS)

Explaining those algorithms is beyond the scope of the tutorials, you can refer to
any RTOS books out there. We recommend Operating Systems concepts book.
Math Concepts for Embedded Software
Engineers.
Look Up Tables:
We have been taught in the schools the use of Math to solve real world problems
especially in mechanical concepts like acceleration, velocity, etc. There are many
embedded projects that use heavily the trigonometry functions like sin(x), cos(x).
Sample projects like those that uses GPS navigation which tries to calculate the
distance between points on a sphere(earth), it needs to compute a lot of trig
functions in real time! These functions are terribly slow while executing on a
microcontroller that runs on 8MHz.
These functions are usually computed using the Taylor series, which approximates
most of complex function calculation like n! and sin (x), etc. using Taylor Series,
sin(x) can be computed using the following series
sin(x) := x - x^3/3! + x^5/5! - x^7/7! +
As you see you need to calculate of terms like factorials, power, which is a disaster
if you want to compute and calculate them on a microcontroller.
One solution that was used in the era of 80’s when computer game programmers
used to work on 8 bit computers like Atari is to use a Lookup table, if you really
must to work with trig functions, then Look-up tables are to rescue.
Look-up tables are precomputed values of some computation that you know you’ll
perform during run-time. You simply compute all possible values at startup and then
run the embedded software. For example, say you
needed the sine and cosine of the angles from 0-359 degrees. Computing them using
sin() and cos() would kill you if you used the floating-point processor, but with a
look-up table your code will be able to compute sin() or cos() in a few cycles because
it’s just a look-up. Here’s an example:

// storage for look up tables


float SIN_LOOK[360];
float COS_LOOK[360];
// create look-up table
for (int angle=0; angle < 360; angle++)
{
// convert angle to radians since math library uses
// rads instead of degrees
// remember there are 2*pi rads in 360 degrees
float rad_angle = angle * (3.14159/180);
// fill in next entries in look-up tables
SIN_LOOK[angle] = sin(rad_angle);
COS_LOOK[angle] = cos(rad_angle);
} // end for angle

As an example of using the look-up table


for (int ang = 0; ang<360; ang++)
{
// compute the next point on circle
x_pos = 10*COS_LOOK[ang];
y_pos = 10*SIN_LOOK[ang];
// Do something with x_pos, y_pos
}
You actually first compute the precomputed values at initialization of the system,
then in the main loop whenever you calculate sin or cos, you look up to their
values in the look up table.
Yes, it takes a space overhead, but speed at run time is also important especially in
low end microcontrollers like AVR.
Identifying Fast and Slow Operations
Optimizing your system to do its mathematical operations quickly requires you to
understand
a bit more about your compiler and processor. Once you understand which
operations occur quickly (and which ones take up one line of code but compile to
use
two libraries and an absurd amount of processing), you'll have the basis to optimize
your system.
So, addition and subtraction are fast. Shifting bits is fast. Division is very slow.
Anything with floating point is dead slow.
What about multiplication? On a DSP, it is fast: multiply and add together form a
single instruction (MAC for multiply-accumulate). On a non-DSP (e.g. an ARM
or your PC), multiplication is between addition and division, closer to addition.
Fixed Point Math
Some microcontrollers don’t support Floating point operations, as they don’t have
an FPU. Most operations on these microcontrollers are simulated in software. You
can look at your map files of your specific microcontroller to see how much code
space that a floating point operation between the sum of two floating points took
how much a big a space at the end. In addition to code space increase, floating
point operation are slow, floating point operations are expensive in embedded
software. Because they are emulated in Software which takes a lot of CPU Cycles
to emulate the mathematical operations of floating points. The idea to the
following explanation is to fake Floating point numbers with Fixed Point Math.
The point is to fake floating point with integers. Implementation-wise, there are
two main ways to do fixed point math: CPU or dedicated logic.
For the CPU approach, it can be a full-blown DSP (Digital Signal Processor), or it
can be an application processor running at a high clock rate or with built-in support
for SIMD instructions. Dedicated logic can be implemented through a FPGA, or it
can be a hardware accelerator that becomes part of an SoC (System on Chip). Each
approach has its pros and cons. And because there are two approaches for doing
fixed point math, the code snippets presented in this chapter will also be in one of
these two forms: C/C++ or Verilog/System Verilog.
In case of AVR's and PIC's compiler knows that there is no fpu available, so it will
translate every single operation to a bunch of commands that CPU supports. It will
have to normalize both operands to a common exponent, then perform operation on
mantissa like on integral numbers, then adjust exponent. This is quite a lot of
operations so emulated floating point is slow. And, beside that, if you optimize for
size, every floating point operation may become a function call.

And on ARM arch things may be quite weird. There are ARM's with FPU and
without. And you may want to have universal application which will run on both.
In such case there is a tricky (and slow) scheme. Application uses FPU commands.
If your CPU does not have FPU, then such command will trigger an interrupt and
in it OS will emulate the instruction, clear error bit and return control to an
application. But that scheme occurred to be very slow an is not commonly use

Floating-point is just scientific notation in base-2. Both the mantissa and exponent
are integers, and softfloat libraries will break up floating-part operations into
operations that affect the mantissa and exponent, which can use the CPU integer
support.

For example, (x 2n) * (y 2m) = x * y 2n+m.

Fixed-Point vs. Floating-Point Digital Signal Processing | Education | Analog


Devices

Remember at school we can have a number like 10.7. but can we represent it as an
integer? We can’t we don’t have a decimal place in integer values, but we can
scale it up by 10 and the result would be 107. Which is an integer. You scale
numbers by some factor and that make sure to take this scale into consideration
when doing mathematics.

We will consider using 32bit integers for our floating point representation. There
are many formats of Fixed point math that depends on the size of the integer that
has been used. We will show an example of a 16.16 fixed point math.

You put the whole part in the upper 16 bits and the decimal part in the lower 16
bits. Hence, you’re scaling all numbers by 2^16, or 65,536. Moreover, to extract
the integer portion of a fixed-point number, you shift and mask the upper 16 bits,
and to get to the decimal portion, you shift and mask the lower 16 bits. Here’s
some working types for fixed-point math:

#define FP_SHIFT 16 // shifts to produce a fixed-point number


#define FP_SCALE 65536 // scaling factor
typedef int FIXPOINT;
Here’s a macro that converts an integer to fixed-point:
#define INT_TO_FIXP(n) (FIXPOINT((n << FP_SHIFT)))
For example:
FIXPOINT speed = INT_TO_FIXP(100);
And here’s a macro to convert floating-point numbers to fixed-point:
#define FLOAT_TO_FIXP(n) (FIXPOINT((float)n * FP_SCALE))
For example:
FIXPOINT speed = FLOAT_TO_FIXP(100.5);
Extracting a fixed-point number is simple too. Here’s a macro to get the integral
portion
in the upper 16 bits:
#define FIXP_INT_PART(n) (n >> 16)
And to get the decimal portion in the lower 16 bits, you simply need to mask the
integral
part:
#define FIXP_DEC_PART(n) (n & 0x0000ffff)

Addition and Subtraction


Addition and subtraction of fixed-point numbers is trivial. You can use the
standard +
and – operators:
FIXPOINT f1 = FLOAT_TO_FIX(10.5),
f2 = FLOAT_TO_FIX(-2.6),
f3 = 0; // zero is 0 no matter what baby
// to add them
f3 = f1 + f2;
// to subtract them
f3 = f1 – f2
Multiplication and Division
Multiplication and division are a little more complex than addition and subtraction.
The problem is that the fixed-point numbers are scaled; when you multiply them,
you not only multiply the fixed-point numbers but also the scaling factors. Take a
look:
f1 = n1 * scale
f2 = n2 * scale
f3 = f1 * f2 = (n1 * scale) * (n2 * scale) = n1*n2*scale^2
See the extra factor of scale? To remedy this, you need to divide or shift out the
one
factor of scale^2. Hence, here’s how to multiply two fixed-point numbers:
#define FP_MUL (f1,f2) ((f2*f1)>>FP_SHIFT)
f3 = ((f1 * f2) >> FP_SHIFT);

Division of fixed-point numbers has the same scaling problem as multiplication,


but
in the opposite sense. Take a look at this math:
f1 = n1 * scale
f2 = n2 * scale
Given this, then
f3 = f1/f2 = (n1*scale) / (n2*scale) = n1/n2 // no scale!
Note that you’ve lost the scale factor and thus turned the quotient into a non-
fixedpoint
number. This is useful in some cases, but to maintain the fixed-point property,
you must prescale the numerator like this:
f3 = (f1 << FP_SHIFT) / f2;
The problem with both multiplication and division is overflow and underflow. In
the case of multiplication, the result might be 64-bit in the worst case. Similarly, in
the case of division, the upper 16 bits of the numerator are always lost, leaving
only the decimal portion. The solution?
Use a 24.8-bit format or use full 64-bit math.
This will allow multiplication and division to work better because you won’t lose
everything all the time, but your accuracy will fall apart.
FIRMWARE CONCEPTS IN C
Stack painting
An effective way to measure the amount of stack space needed consists of filling
the
estimated stack space with a well-known pattern. This mechanism, informally
referred to as
stack painting, reveals the maximum expansion of the execution stack at any time.
By
running the software with a painted stack, it is in fact possible to measure the
amount of
stack used by looking for the last recognizable pattern, and assuming that the stack
pointer
has moved during the execution at most until that point.
We can perform stack painting manually in the reset handler, during memory
initialization.
To do so, we need to assign an area to paint. In this case it would be the last 8 KB
of
memory up until _end_stack. Once again, while manipulating the stack in the
reset_handler function, local variables should not be used. The handler
function will store the value of the current stack pointer into the global variable sp:
static unsigned int sp;

Within the handler, the following section can be added before invoking main:

The first assembly instruction is used to store the current value of the stack pointer
to the
variable sp, to ensure that the painting stops after painting the area, but only up
until the last unused address in the stack:
The current stack usage can be checked periodically at runtime, for instance in the
main loop, to detect the area painted with the recognizable pattern. The areas that
are still painted have never been used by the execution stack so far, and indicate
the amount of stack still available.
This mechanism may be used to verify the amount of stack space required by the
application to run comfortably. According to the design, this information can be
used later on to set a safe lower limit on the segment that can be used for the stack.

Explicit Type of Bit- Width


One of the main functions of the firmware is to configure hardware, which usually
involves register/memory access. Most register and memory buses will have a bit-
width that is the multiple of a byte (8-bit). Since each CPU has its own native word
length (bit-width) for bus and register file, the size of integer-type varies from one
compiler to another. It is always good practice for a firmware project to typedef its
own primitive types, with explicit bit-width marking. Those “typedef” should be
put in a common head file and included by the whole project to avoid any
ambiguity on bit-width. 8-bit, 16-bit, and 32-bit unsigned integer types are defined
for the KEIL C51 compiler and the GNU C 32-bit compiler. (Notice that GNU C
can support 64-bit operations on 32-bit hardware through a software library. So 64-
bit unsigned integer can be defined as “long long” type in GNU C.) All the rest of
the project should only use these U8/U16/U32/U64 types to define and declare
variables for the sake of readability and portability. DSP programmers might also
want to define the signed integer type (S8/S16/S32) for their own platforms if
necessary.
#ifndef COMMON_TYPE_H
#define COMMON_TYPE_H
#include "debug.h"
#if defined (KEIL_C51)
typedef unsigned long int U32;
typedef unsigned short U16;
typedef unsigned char U8;
#elif defined (GNU_C_32BIT)
typedef long long U64;
typedef unsigned int U32;
typedef unsigned short U16;
typedef unsigned char U8;
C_ASSERT(sizeof(U64) == 8);
#else
...
#end if
C_ASSERT(sizeof(U32) == 4);
C_ASSERT(sizeof(U16) == 2);
C_ASSERT(sizeof(U8) == 1);
# endif

Data Alignment
Most computer systems have some alignment requirement on the starting memory
address of a variable. The memory address of a C variable often must be aligned,
as listed in Table . The smallest unit exchanged between the processor and the
memory is a byte (8 bits), and thus the memory address is always in terms of bytes.
A variable is n-byte aligned in memory if its starting memory address is some
multiple of n. Typically, n is a power of 2, such as 2 (halfword aligned), 4 (word
aligned), and 8 (double word aligned). Suppose a 32-bit variable is word aligned. If
the address of the next available byte in memory is 0x8001, the variable is then
stored in a continuous span of 4 bytes from 0x8004 to 0x8007. The compiler or the
program inserts three bytes at memory addresses 0x8001, 0x8002, and 0x8003.
These three bytes are called padding bytes.
Enforcing data alignment is to improve the memory performance. A memory
system consists of multiple storage units, and the processor typically distributes
data among these units in a round-robin fashion. Because the number of pins
available on a processor is limited, these memory units typically share some pins in
the memory address bus. To allow these memory units to transfer data
concurrently, the target data stored in all memory units needs to share a portion of
their memory addresses. The data alignment ensures that all data of a variable
stored in different memory units meet this requirement.
When the processor reads a properly aligned variable, only one access is required
to transfer the data out of these memory units. Otherwise, two separate memory
accesses might be necessary, slowing down the processor performance.

The data memory is organized into four banks, and these banks can feed the 32-bit
data bus. Four bytes in the same row of all banks can be loaded into the processor
concurrently. In this example, data 0x78563412 is not aligned with word
boundaries, and the processor takes two memory accesses to load the data
0x78563412 to a register. However, it takes only one memory access to load data
0x44332211.
Align the Data Structure
As mentioned in previous section, each CPU has its own word length (bit-width)
and it varies from one platform to another. Due to the broad scope of embedded
CPUs, the word length can be as small as 8-bit (such as Intel 8051), or as big as
32-bit (like the popular ARM processor). For some high-end products, they could
even afford 64-bit CPUs. And most systems would also allow memory access with
a width less than the native word length (For example, most 32-bit CPUs can R/W
memory in a single byte as well, with degraded efficiency.) Consequently, it gives
rise to the alignment issue for all data structures. From the performance standpoint,
memory accesses that are aligned to native word length are always preferred (i.e.,
the beginning address is at the boundary of native word length, and the R/W length
is a multiple of native word length.) Thus most compilers will do some
optimization by inserting space padding into the data structure when they see the
chance of misalignment, as demonstrated

typedef struct {
U8 a;

U32 b;

U16 c;

} STRUCT_NOT_PACKED;
sizeof(STRUCT_NOT_PACKED) == 12
For a 32-bit target CPU, the GNU C compiler will insert three bytes of padding for
field a and two bytes padding for field c in the struct defined previously so that
struct member b will be aligned to 32-bit boundary, and the total size of struct
STRUCT_NOT_PACKED becomes 12. Under GNC C, such alignment padding
can be disabled by using the packed attribute , as illustrated . After applying the
packed attribute, the size of the data structure is reduced to 7. However, struct
member b is now stored across 32-bit boundary and carries performance penalties
if accessed individually.
#define PACKED __attribute__((packed))
typedef struct {
U8 a;
U32 b;
U16 c;
} PACKED STRUCT_PACKED;
GNU C Compiler, 32 bit target CPU:
sizeof(STRUCT_PACKED) == 7

Manual Padding for Alignment


typedef struct {
U8 a;
U8 padding;
U16 c;
U32 b;
} STRUCT_MANUAL_PADDING;
GNU C Compiler, 32 bit target CPU:
sizeof(STRUCT_MANUAL_PADDING) == 8
A data structure defined in C language aggregates multiple basic variables into a
single complex entity. By default, compilers ensure that all variables in a structure
are aligned to their required memory boundaries. In a structure array, compilers
also ensure that all variables in this array meet their alignment requirements.
Therefore, compilers may place padding bytes between structure variables.
C language also supports packed structures in which variables are not aligned.
Therefore, compilers do not add any padding bytes into a data structure. Packed
structures are often used in communication protocols (such as USB) to save
transmission time.
The _unpack modifier is often used to map a structure to a special data area
in memory, such as a USB communication package received in a memory buffer.
The Cortex-M processors do support unaligned memory accesses. However,
unaligned accesses are still slower than aligned memory accesses, and thus it is
recommended to avoid using unaligned accesses.

Let me give one example where packing is needed:

Consider a microcontroller interfaced with an EEPROM where some structure is


being stored. Imagine a function writing to the EEPROM would look as below:

Write_EEPROM(EEPROM address, Ram address, Byte count);


Now if packing is not done, the extra padded bytes would occupy space in the
EEPROM, which is of no use.
Type-Qualifier “volatile ”

As we all know, RAM access is much slower than register access. If you know
beforehand that a certain variable will be used frequently, you would like it to be
stored in register to speed up subsequent accesses.
One way to achieve such acceleration in C is to specify the storage class of that
variable as register instead of auto , so that compiler will store the variable in a
register if there is one. However, such manual optimization has its limitations and
is usually unnecessary in practice. Instead, you could let compiler figure out the
best register-allocation scheme by turning on the optimization switch, and most
modern C compilers do a good job in this regard.
Let’s do some experiments to get an idea on how compilers will optimize the
memory access.
The results shown here are produced with a 32-bit Cygwin on a 64-bit x86
machine running Windows 10.
The object file format is pei-i386 . I chose 32-bit Cygwin since most embedded
processors are 32 bits or fewer. If you run it with 64-bit Cygwin, the object file
format will be pei-x86-64 , and the final Assembly code will use 64-bit
instructions. But the same idea of optimization applies.
Check Please after doing optimization flags in your compiler.
Replacing memory access with register access could have harmful side effects,
which in turn lays the ground for inconsistency between the debug build and the
release build.
Access Peripheral Registers

In hardware, peripheral registers are usually mapped into a memory space. The
following mixed use of const and volatile is recommend when these register are
being operated on:
• Use volatile unsigned int * const to specify register address: In practice,
unsigned int can also be replaced by U32 or U16 . Here’s an example of such a
practice, where control register is written, followed by waiting on the busy flag.
#define BUSY_FLAG (1 << 0)
...
volatile U32* const REG_CONTROL = (U32*) 0xABCD0000;
volatile U32* const REG_STATUS = (U32*) 0xABCD0004;
(*REG_CONTROL) = ... ; // correct statement
REG_CONTROL = ... ; // will produce compile error
while((*REG_STATUS) & BUSY_FLAG); // wait on flag
...
• Use const volatile type * to explore a read-only data buffer: If the peripheral
exposes a buffer of read-only data to the CPU, a pointer of const volatile type can
be used to explore the data, with an additional sanity check from the compiler to
prevent inadvertent writes on the buffer.
const volatile U32 *p;
U32 data1, data2, data3;
volatile U32* const DATA_BUFFER = (U32*) 0xABCD0008;
p = DATA_BUFFER;
data1 = *p++;
data2 = *p++;
data3 = *p++;
(*p) = ...; // will produce compile error
• Use volatile type * to read/write the data buffer: If the peripheral exposes a
bidirectional
buffer to the CPU, a pointer of volatile type * can be used to explore
the data.
#define DATA_BUFFER 0xABCD0008
volatile U32 *p;
U32 data;

Atomic Operation and Critical Section


As stressed in previous sections, it is quite possible for data to be modified
externally, and that is the main reason why the volatile keyword was introduced in
C. In addition to the recommended practice, firmware engineers should also
recognize the atomic operation. For example, if p is a pointer to 32-bit volatile data
(memory-mapped to peripheral data buffer), the statement (*p) = ...; will generate
bus-write transactions. In a 32-bit system, such a statement will be compiled into
one instruction, which corresponds to one bus-write transaction, so the statement
by itself is an atomic operation. However, in a 16-bit system, an atomic bus-write
operation can only take 16-bit data, and such a statement will be chopped into two
consecutive writes on the bus. However, if an interrupt is also enabled when these
transactions are being carried out, chances are that an ISR (Interrupt Service
Routine) could cut in between those two atomic bus writes. And even worse, the
ISR may operate on the same peripheral data buffer as well, which generates a
corner case that has never been mapped out in the first place. In general, if it is
possible that more than one agent will operate on the same object concurrently, and
if such an operation is not atomic by itself, some protection measures have to be
taken to avoid conflicts. And the piece of code that operates on the object is often
called critical section . There are many ways to protect critical sections, such as
disabling-interrupt (for single processor), spin-lock, semaphore.
Object-Oriented Programming in C

Developers should consider developing their drivers and their application code in
an object-oriented manner. The C programming language is not an object-oriented
programming language. C is a procedural programming language where the
primary focus is to specify a series of well-structured steps and procedures within
its programming context to produce a program.7 An object-oriented programming
language, on the other hand, is a programming language that focuses on the
definition of and operations that are performed on data.
There are several characteristics that set an object-oriented programming language
apart from a procedural language. These include:
• Abstraction
• Encapsulation
• Objects
• Classes
• Inheritance
• Polymorphism
Despite C not being object-oriented, developers can still implement some concepts
in their application that will dramatically improve their software. While there are
ways to create classes, inheritance, and polymorphism in C, if these features are
required, developers would be better off just using C++. Applications can benefit
greatly from using abstractions and encapsulation.

Abstractions and Abstract Data Types (ADTs)


An abstraction hides the underlying implementation details while making the
functionality available to developers. For example, a well-implemented GPIO
driver will provide an interface that tells a developer what can be done with the
driver, but the developer doesn’t need to know any details about how the driver is
implemented or even on what hardware it runs. Abstractions hide the details from
developers, creating a black box that simplifies what they need to know to use the
software.
Abstractions don’t only apply to component interfaces. Abstractions can just as
easily be applied to data types. Abstract data types (often written as ADT for short)
are data types whose implementation details are hidden from the view of the user
for a data structure. There are several different methods that can be used to create
an ADT in C. One method that is straightforward can be done in five easy steps.
Let’s look at how we can create an ADT for managing a memory stack.
First, a developer defines the abstract data type. The ADT in C is usually defined
as a pointer to a structure. The ADT is declared within a header file without any
underlying details, leaving it up to the implementer to fully declare the ADT in the
source module. An example of an ADT would be a StackPtr_t, NodePtr_t, or
QueuePtr_t, to name a few.

Encapsulation and Data Hiding

Encapsulation and data hiding are an important concept that embedded-software


developers should follow. Encapsulation is the idea that related data, functions, and
operations should all be wrapped together into a single unit. For example, all the
general-purpose input and output operations would be wrapped together in a single
GPIO module. Any operations and data that involve the GPIO would be put into
that module. The idea can go even further by considering data hiding. Data hiding
is where developers hide the data and the implementation from the module user.
It’s not important that the caller understand the implementation, only how to use
the interface and what its inputs and outputs are.

OOP IN C EXAMPLE

#ifndef Sensor_H
#define Sensor_H
/*## class Sensor */
typedef struct Sensor Sensor;
struct Sensor {
int filterFrequency;
int updateFrequency;
int value;
};
int Sensor_getFilterFrequency(const Sensor* const me);
void Sensor_setFilterFrequency(Sensor* const me, int p_filterFrequency);
int Sensor_getUpdateFrequency(const Sensor* const me);
void Sensor_setUpdateFrequency(Sensor* const me, int p_updateFrequency);
int Sensor_getValue(const Sensor* const me);
Sensor * Sensor_Create(void);
void Sensor_Destroy(Sensor* const me);
#include "Sensor.h"
void Sensor_Init(Sensor* const me) {
}
void Sensor_Cleanup(Sensor* const me) {
}
int Sensor_getFilterFrequency(const Sensor* const me) {
return me->filterFrequency;
}
void Sensor_setFilterFrequency(Sensor* const me, int p_filterFrequency) {
me->filterFrequency = p_filterFrequency;
}
int Sensor_getUpdateFrequency(const Sensor* const me) {
return me->updateFrequency;
}
void Sensor_setUpdateFrequency(Sensor* const me, int p_updateFrequency) {
me->updateFrequency = p_updateFrequency;
}
int Sensor_getValue(const Sensor* const me) {
return me->value;
}
Sensor * Sensor_Create(void) {
Sensor* me = (Sensor *) malloc(sizeof(Sensor));
if(me!=NULL)
{
Sensor_Init(me);
}
return me;
}
void Sensor_Destroy(Sensor* const me) {
if(me!=NULL)
{
Sensor_Cleanup(me);
}
free(me);
}
Polymorphism and Virtual Functions
Polymorphism is a valuable feature of object-oriented languages. It allows for the
same function name to represent one function in one context and another function
in a different context. In practice, this means that when either the static or dynamic
context of an element changes, the appropriate operation can be called.
One approach using switch case, another approach using function pointers. Please
refer to a book on Object Oriented Programming in C

int acquireValue(Sensor *me) {


int *r, *w; /* read and write addresses */
int j;
switch(me->whatKindOfInterface) {
case MEMORYMAPPED:
w = (int*)WRITEADDR; /* address to write to sensor */
*w = WRITEMASK; /* sensor command to force a read */
for (j=0;j<100;j++) { /* wait loop */ };
r = (int *)READADDR; /* address of returned value */
me->value = *r;
break;
case PORTMAPPED:
me->value = inp(SENSORPORT);
/* inp() is a compiler-specific port function */
break;
}; /* end switch */
return me->value;
};

Inheritance
….
Extend, Vehicle, Father, and Child Parent Relationship.

#ifndef QUEUE_H_
#define QUEUE_H_
#define QUEUE_SIZE 10
/* class Queue */
typedef struct Queue Queue;
struct Queue {
int buffer[QUEUE_SIZE]; /* where the data things are */
int head;
int size;
int tail;
int (*isFull)(Queue* const me);
int (*isEmpty)(Queue* const me);
int (*getSize)(Queue* const me);
void (*insert)(Queue* const me, int k);
int (*remove)(Queue* const me);
};
/* Constructors and destructors:*/
void Queue_Init(Queue* const me,int (*isFullfunction)(Queue* const me),
int (*isEmptyfunction)(Queue* const me),
int (*getSizefunction)(Queue* const me),
void (*insertfunction)(Queue* const me, int k),
int (*removefunction)(Queue* const me) );
void Queue_Cleanup(Queue* const me);
/* Operations */
int Queue_isFull(Queue* const me);
int Queue_isEmpty(Queue* const me);
int Queue_getSize(Queue* const me);
void Queue_insert(Queue* const me, int k);
int Queue_remove(Queue* const me);
Queue * Queue_Create(void);
void Queue_Destroy(Queue* const me);
#endif /*QUEUE_H_*/
#include <stdio.h>
#include <stdlib.h>
#include "queue.h"
void Queue_Init(Queue* const me,int (*isFullfunction)(Queue* const me),
int (*isEmptyfunction)(Queue* const me),
int (*getSizefunction)(Queue* const me),
void (*insertfunction)(Queue* const me, int k),
int (*removefunction)(Queue* const me) ){
/* initialize attributes */
me->head = 0;
me->size = 0;
me->tail = 0;
/* initialize member function pointers */
me->isFull = isFullfunction;
me->isEmpty = isEmptyfunction;
me->getSize = getSizefunction;
me->insert = insertfunction;
me->remove = removefunction;
}
/* operation Cleanup() */
void Queue_Cleanup(Queue* const me) {
}
/* operation isFull() */
int Queue_isFull(Queue* const me){
return (me->head+1) % QUEUE_SIZE == me->tail;
}
/* operation isEmpty() */
int Queue_isEmpty(Queue* const me){
return (me->head == me->tail);
}
/* operation getSize() */
int Queue_getSize(Queue* const me) {
return me->size;
}
/* operation insert(int) */
void Queue_insert(Queue* const me, int k) {
if (!me->isFull(me)) {
me->buffer[me->head] = k;
me->head = (me->head+1) % QUEUE_SIZE;
++me->size;
}
}
/* operation remove */
int Queue_remove(Queue* const me) {
int value = -9999; /* sentinel value */
if (!me->isEmpty(me)) {
value = me->buffer[me->tail];
me->tail = (me->tail+1) % QUEUE_SIZE;
me->size;
}
return value;
}
Queue * Queue_Create(void) {
Queue* me = (Queue *) malloc(sizeof(Queue));
if(me!=NULL)
{
Queue_Init(me, Queue_isFull, Queue_isEmpty, Queue_getSize,
Queue_insert, Queue_remove);
}
return me;
}
void Queue_Destroy(Queue* const me) {
if(me!=NULL)
{
Queue_Cleanup(me);
}
free(me);
}

#ifndef CACHEDQUEUE_H_
#define CACHEDQUEUE_H_
#include "queue.h"
typedef struct CachedQueue CachedQueue;
struct CachedQueue {
Queue* queue; /* base class */
/* new attributes */
char filename[80];
int numberElementsOnDisk;
/* aggregation in subclass */
Queue* outputQueue;
/* inherited virtual functions */
int (*isFull)(CachedQueue* const me);
int (*isEmpty)(CachedQueue* const me);
int (*getSize)(CachedQueue* const me);
void (*insert)(CachedQueue* const me, int k);
int (*remove)(CachedQueue* const me);
/* new virtual functions */
void (*flush)(CachedQueue* const me);
void (*load)(CachedQueue* const me);
};

void CachedQueue_Init(CachedQueue* const me, char* fName,


int (*isFullfunction)(CachedQueue* const me),
int (*isEmptyfunction)(CachedQueue* const me),
int (*getSizefunction)(CachedQueue* const me),
void (*insertfunction)(CachedQueue* const me, int k),
int (*removefunction)(CachedQueue* const me),
void (*flushfunction)(CachedQueue* const me),
void (*loadfunction)(CachedQueue* const me)) {
/* initialize base class */
me->queue = Queue_Create(); /* queue member must use its original functions */
/* initialize subclass attributes */
me->numberElementsOnDisk = 0;
strcpy(me->filename, fName);
me->outputQueue = Queue_Create();
/* initialize subclass virtual operations ptrs */
me->isFull = isFullfunction;
me->isEmpty = isEmptyfunction;
me->getSize = getSizefunction;
me->insert = insertfunction;
me->remove = removefunction;
me->flush = flushfunction;
me->load = loadfunction;
}

Design Patterns in Embedded Software:


Design patterns aren’t magic, and they aren’t all that difficult. Applying design
patterns is what good designers (including architects) do everyday anyway _ even
if they don’t Recognize that is what they are doing. Good designers examine their
new design problems and try to reason about what they’ve done or seen done in the
past that solved similar problems. That is nothing more or less than apply design
patterns, even though it is implicit rather than explicit. What a design-pattern-
centric design approach does is formalize this a bit to simplify both the capture of
good design solutions and their application to specific design contexts.
A design pattern is a “generalized solution to a commonly occurring problem”. If a
design solution addresses a problem very specific to a particular system, there is no
value in abstracting it into a reusable design pattern. Similarly, a design pattern
must abstract away the specifics of a particular system so that it may be easily
applied to other systems operating in other contexts.
The Hardware Proxy
pattern is for the abstraction of hardware for the purpose of encapsulating details
that are likely to change from the usage of the information provided to or by the
hardware. The Hardware Adapter Pattern extends the Hardware Proxy Pattern to
provide the ability to support different hardware interfaces. The Mediator Pattern
supports coordination of multiple hardware devices to achieve a system level
behavior. The Observer Pattern is a way of distributing sensed data to the software
elements that need it. The Debouncing and Interrupt Patterns are simple reusable
approaches to interface with hardware devices. The Timer Interrupt Pattern extends
the Interrupt timer to provide accurate timing for embedded
Systems

SuperLoop Design Pattern

Smaller embedded systems are typically designed as a “superloop” that runs on a


bare-metal CPU, without any underlying operating system. This is also the most
basic structure that all embedded programmers learn in the beginning of their
careers. For example, here you can see a superloop adapted from the basic Arduino
Blink Tutorial. The code is structured as an endless “while (1)” loop, which turns
an LED on, waits for 1000 ms, turns the LED off, and waits for another 1000ms.
All this results in blinking the LED. The main characteristics of this approach is
that the code often waits in-line for various conditions, for example a time delay.
“In-line” means that the code won't proceed until the specified condition is met.
Programming that way is called sequential programming. The main problem with
this sequential approach is that while waiting for one kind of event, the
“superloop” is unresponsive to any other events, so it is difficult to add new events
to the loop. Of course, the loop can be modified to wait for ever shorter periods of
time to check for various conditions more often. But adding new events to the loop
becomes increasingly difficult and often causes an upheaval to the whole structure
and timing of the entire loop

An obvious solution to the unresponsiveness of a single superloop is to allow


multiple superloops to run on the same CPU. Multiple superloops can wait for
multiple events in parallel. And this is exactly what a Real-Time Operating System
(RTOS) allows you to do. Through the process of scheduling and switching the
CPU, which is called multitasking or multithreading, an RTOS allows you to run
multiple superloops on the same CPU. The main job of the RTOS is to create an
illusion that each superloop, called now a thread, has the entire CPU all to itself.
For example, here you have two threads: one for blinking an LED and another for
sounding an alarm when a button is pressed. As you can see, the code for the Blink
thread is really identical to the Blink superloop, so it is also sequential and
structured as an endless while(1) loop. The only difference now is that instead of
the polling delay() function, you use RTOS_delay(), which is very different
internally, but from the programming point of view it performs exactly the same
function

How does the RTOS achieve multitasking? Well, each thread in an RTOS has a
dedicated private context in RAM, consisting of a private stack area and a thread-
control-block (TCB). The context for every thread must be that big, because in a
sequential code like that, the context must remember the whole nested function call
tree and the exact place in the code, that is, the program counter. For example, in
the Blink thread, the contexts of the two calls to RTOS_delay(), will have identical
call stack, but will differ in the values of the program counter (PC). Every time a
thread makes a blocking call, such as RTOS_delay() the RTOS saves CPU
registers on that thread's stack and updates it's TCB. The RTOS then finds the next
thread to run in the process called scheduling. Finally, the RTOS restores the CPU
registers from that next thread's stack. At this point the next thread resumes the
execution and becomes the current thread. The whole context-switch process is
typically coded in CPU-specific assembly language, and takes a few microseconds
to complete.
Compared to a “superloop”, an RTOS kernel brings a number of very important
benefits: 1. It provides a “divide-and-conquer” strategy, because it allows you to
partition your application into multiple threads. → Each one of these threads is
much easier to develop and maintain than one “kitchen sink” superloop 2. Threads
that wait for events are efficiently blocked and don't consume CPU cycles. This is
in contrast to wasteful polling loops often used in the superloop. 3. Certain
schedulers, most notably preemptive, priority-based schedulers, can execute your
applications such that the timing of high-priority threads can be insensitive to
changes in low-priority threads (if the threads don't share resources). This is
because under these conditions, high-priority threads can always preempt lower-
priority threads. This enables you to apply formal timing analysis methods, such as
Rate Monotonic Analysis (RMA), which can guarantee that (under certain
conditions) all your higher-priority threads will meet their deadlines
EVENT BASED SYSTEM
Remember Windows Event Loop?
Taken from internet ( JAVA Script Event LOOP)

Windows Messaging in Games  from Tricks of the Windows Game


Programming Gurus Andre’ LaMothe <3
We are going to implement a system which helps us to handle quickly things
which need to be fast, and postponing things which can wait. For this we are going
to implement an Event module. I’m going to describe first the interface and the
high-level concept, and then go into the implementation details later.
The Event module proposed here follows the idea that the interrupt service routine
only sets an event flag. That flag is processed asynchronously by the event handler
loop. That way the main loop or event handler does the heavy work, while the
interrupt service routine only sets a notification of the event. This approach is not
limited only to interrupts, it can be used for polled keys or other cases. It is
possible that a single event can cause multiple actions, or that an event can cause
the creation of additional events.
That way a sequence of events and actions can be created, or actions and events
can be nested.
Software Design using Component based
Architecture
The HAL itself, in theory, is fully reusable and does not need rewriting when you
port to new hardware. In practice, the HAL typically needs tweaking to
accommodate some idiosyncrasy of the new platform, but as a stable API it
remains unchanged from the perspective of the higher levels of software. What this
means from a software reuse perspective is that, when you port the code to new
hardware, all access into and out of the hardware-sensitive code remains
unchanged. In theory, you can swap in any hardware layer and it still works.

Managing Peripheral Data


One of the primary design philosophies is that data dictates design. When
designing embedded software, we must follow the data. In many cases, the data
starts in the outside world, interacts with the system through a peripheral device,
and is then served up to the application. A major design decision that all
architects encounter is “How to get the data from the peripheral to the
application?”. As it turns out, there are several different design mechanisms that
we can use, such as

Polling
Interrupts
Direct memory access (DMA)
Each of these design mechanisms, in turn, has several design patterns that can
be used to ensure data loss is not encountered. Let’s explore these mechanisms
now.

The following Figure demonstrates polling using a sequence diagram. We can


see that the data becomes available but sits in the peripheral until the application
gets around to requesting the data. There are several advantages and disadvantages
to using polling in your design.

The advantage of polling is that it is simple! There is no need to set up the


interrupt controller or interrupt handlers for the peripheral. Typically, a single
application thread with very well-known timing is used to check a status bit within
the peripheral to see if data is available or if the peripheral needs to be managed.

Peripheral Polling
The most straightforward design mechanism to collect data from a peripheral is to
simply have the application poll the peripheral periodically to see if any data is
available to manage and process.
Unfortunately, there are more disadvantages to using polling than there are
advantages. First, polling tends to waste processing cycles. Developers must
allocate processor time to go out and check on the peripheral whether there is data
there or not. In a resource-constrained or low-power system, these cycles can add
up significantly. Second, there can be a lot of jitter and latency in processing the
peripheral depending on how the developers implement their code.
For example, if developers decide that they are going to create a while loop that
just sits and waits for data, they can get the data very consistently with low latency
and jitter, but it comes at the cost of a lot of wasted CPU cycles. Waiting in this
manner is polling using blocking. On the other hand, developers can instead not
block and use a nonblocking method where another code is executed, but if the
data arrives while another code is being executed, then there can be a delay in the
application getting to the data adding latency. Furthermore, if the data comes in at
nonperiodic rates, it’s possible for the latency to vary, causing jitter in the
processing time. The jitter may or may not affect other parts of the embedded
system or cause system instability depending on the application.
Despite the disadvantages of polling, sometimes polling is just the best solution. If
a system doesn’t have much going on and it doesn’t make sense to add the
complexity of interrupts, then why add them? Debugging a system that uses
interrupts is often much more complicated. If polling fits, then it might be the right
solution; however, if the system needs to minimize response times and latency, or
must wake up from low-power states, then interrupts might be a better solution.

Peripheral Interrupts
Interrupts are a fantastic tool available to designers and developers to overcome
many disadvantages polling presents. Interrupts do precisely what their name
implies; they interrupt the normal flow of the application to allow an interrupt
handler to run code to handle an event that has occurred in the system. For
example, an interrupt might fire for a peripheral when data is available, has been
received, or even transmitted.
The following Figure shows an example sequence diagram for what we can expect
from an interrupt design.
The advantages of using an interrupt are severalfold. First, there is no need to
waste CPU cycles checking to see if data is ready. Instead, an interrupt fires when
there is data available. Next, the latency to get the data is deterministic. It takes the
same number of clock cycles when the interrupt fires to enter and return from the
interrupt service routine (ISR). The latency for a lower priority interrupt can vary
though if a higher priority interrupt is running or interrupts it during execution.
Finally, jitter is minimized and only occurs if multiple interrupts are firing
simultaneously. In this case, the interrupt with the highest priority gets executed
first. The jitter can potentially become worse as well if the interrupt fires when an
instruction is executing that can’t be interrupted.
Despite interrupts solving many problems associated with polling, there are still
some disadvantages to using interrupts. First, interrupts can be complicated to set
up. While interrupt usage increases complexity, the benefits usually overrule this
disadvantage. Next, designers must be careful not to use interrupts that fire too
frequently. For example, trying to use interrupts to debounce a switch can cause an
interrupt to fire very frequently, potentially starving the main application and
breaking its real-time performance. Finally, when interrupts are used to receive
data, developers must carefully manage what they do in the ISR. Every clock cycle
spent in the ISR is a clock cycle away from the application. As a result, developers
often need to use the ISR to handle the immediate action required and then offload
processing and non-urgent activities to the application, causing software design
complexity to increase.

Interrupt Design Patterns


When an interrupt is used in a design, there is a chance that the work performed on
the data will take too long to run in the ISR. When we design an ISR, we want the
interrupt to
Run as quickly as possible (to minimize the interruption)
Avoid memory allocation operations like declaring nonstatic variables,
manipulating the stack, or using dynamic memory
Minimize function calls to avoid clock cycle overhead and issues with nonreentrant
functions or functions that may block.

There’s also a good chance that the data just received needs to be combined
with past or future data to be useful. We can’t do all those operations in a timely
manner within an interrupt. We are much better served by saving the data and
notifying the application that data is ready to be processed. When this happens, we
need to reach for design patterns that allow us to get the data quickly, store it, and
get back to the main application as soon as possible.
Designers can leverage several such patterns used on bare-metal and RTOS-
based systems. A few of the most exciting patterns include
Linear data store
Ping-pong buffers
Circular buffers
Circular buffer with semaphores
Circular buffer with event flags
Message queues

Linear Data Store Design Pattern


A linear data store is a shared memory location that an interrupt service routine can
directly access, typically to write new data to memory. The application code,
usually the data reader, can also directly access this memory, as shown in Figure

Now, if you’ve been designing and developing embedded software for any
period, you’ll realize that linear data stores can be dangerous! Linear data stores
are where we often encounter race conditions because access to the data store
needs to be carefully managed so that the ISR and application aren’t trying to read
and write from the data store simultaneously. In addition, the variables used to
share the data stored between the application and the ISR also need to be declared
volatile to prevent the compiler from optimizing out important instructions caused
by the interruptible nature of the operations.
Data stores often require the designer to build mutual exclusion into the data store.
Mutual exclusion is needed because data stores have a critical section where if the
application is partway through reading the data when an interrupt fires and changes
it, the application can end up with corrupt data. We don’t care how the developers
implement the mutex at the design level, but we need to make them aware that the
mutex exists. I often do this by putting a circular symbol on the data store
containing either an “M” for a mutex or a key symbol, as shown in the following
Figure. Unfortunately, at this time, there are no standards that support official
nomenclature for representing a mutex

Ping-Pong Buffer Design Pattern


Ping-pong buffers, also sometimes referred to as double buffers, offer another
design solution meant to help alleviate some of the race condition problems
encountered with a data store. Instead of having a single data store, we have two
identical data stores, as shown in Figure

Now, at first, having two data stores might seem like an opportunity just to double
the trouble, but it’s a potential race condition saver. A ping-pong buffer is so
named because the data buffers are used back and forth in a ping-pong-like
manner. For example, at the start of an application, both buffers are marked as
write only – the ISR stores data in the first data store when data comes in. When
the ISR is done and ready for the application code to read, it marks that data store
as ready to read. While the application reads that data, the ISR stores it in the
second data store if additional data comes in. The process then repeats.
Circular Buffer Design Pattern
One of the simplest and most used patterns to get and use data from an interrupt is
to leverage a circular buffer. A circular buffer is a data structure that uses a single,
fixed-size buffer as if it were connected end to end. Circular buffers are often
represented as a ring, as shown in the following Figure. Microcontroller memory is
not circular but linear. When we build a circular buffer in code, we specify the start
and stop addresses, and once the stop address is reached, we loop back to the
starting address.

The idea with the circular buffer is that the real-time data we receive in the
interrupt can be removed from the peripheral and stored in a circular buffer. As a
result, the interrupt can run as fast as possible while allowing the application code
to process the circular buffer at its discretion. Using a circular buffer helps ensure
that data is not lost, the interrupt is fast, and we still process the data reasonably.
The most straightforward design pattern for a circular buffer can be seen in Figure
5-7. In this pattern, we are simply showing how data moves from the peripheral to
the application. The data starts in the peripheral, is handled by the ISR, and is
stored in a circular buffer. The application can come and retrieve data from the
circular buffer when it wants to. Of course, the circular buffer needs to be sized
appropriately, so the buffer does not overflow.

In the following tutorials, I have also mentioned a design of Circular buffer.


Circular Buffer with Notification Design Pattern
The circular buffer design pattern is great, but there is one problem with it that we
haven’t discussed; the application needs to poll the buffer to see if there is new
data available. While this is not a world-ending catastrophe, it would be nice to
have the application notified that the data buffer should be checked. Two methods
can signal the application: a semaphore and an event flag.
A semaphore is a synchronization primitive that is included in real-time operating
systems. Semaphores can be used to signal tasks about events that have occurred in
the application. We can leverage this tool in our design pattern, as shown in Figure.
The goal is to have the ISR respond to the peripheral as soon as possible, so there
is no data loss. The ISR then saves the data to the circular buffer. At this point, the
application
doesn’t know that there is data to be processed in the circular buffer without
polling it. The ISR then signals the application by giving a semaphore before
completing execution. When the application code runs, the task that manages the
circular buffer can be unblocked by receiving the semaphore. The task then
processes the data stored in the circular buffer.

Using a semaphore is not the only method to signal the application that data is
ready to be processed. Another approach is to replace the semaphore with an event
flag. An event flag is an individual bit that is usually part of an event flag group
that signals when an event has occurred. Using an event flag is more efficient in
most real-time operating systems than using a semaphore. For example, a designer
can have 32 event flags in a single RAM location on an Arm Cortex-M processor.
In contrast, just a single semaphore with its semaphore control block can easily be
a few hundred bytes.
Semaphores are often overused in RTOS applications because developers jump
straight into the coding and often don’t take a high-level view of the application.
The result is a bunch of semaphores scattered throughout the system. I’ve also
found that developers are less comfortable with event flags because they aren’t
covered or discussed as often in classes or engineering literature
An example design pattern for using event flags and interrupts can be seen in
Figure. We can represent an event flag version of a circular buffer with a
notification design pattern. As you can see, the pattern itself does not change, just
the tool we use to implement it. The implementation here results in fewer clock
cycles being used and less RAM.

FUNDAMENTALS OF DESIGNING DEVICE DRIVERS:

Struct Overlays

In embedded systems featuring memory-mapped I/O devices, it is sometimes


useful to overlay a C struct onto each peripheral's control and status registers.
Benefits of struct overlays are that you can read and write through a pointer to the
struct, the register is described nicely by the struct, code can be kept clean, and the
compiler does the address construction at compile time.

The following example code shows a struct overlay for a timer peripheral. If a
peripheral's registers do not align correctly, reserved members can be included in
the struct. Thus, in the following example, an extra field that you'll never refer to is
included at offset 4 so that the control field lies properly at offset 6.
typedef struct
{
uint16_t count; /* Offset 0 */
uint16_t maxCount; /* Offset 2 */
uint16_t _reserved1; /* Offset 4 */
uint16_t control; /* Offset 6 */
} volatile timer_t;

timer_t *pTimer = (timer_t *)(0xABCD0123);

It is very important to be careful when creating a struct overlay to ensure that the
sizes and addresses of the underlying peripheral's registers map correctly.

The bitwise operators shown earlier to test, set, clear, and toggle bits can also be
used with a struct overlay. The following code shows how to access the timer
peripheral's registers using the struct overlay. Here's the code for testing bits:

if (pTimer->control & 0x08)


{
/* Do something here... */
}

Here's the code for setting bits:

pTimer->control |= 0x10;

Here's the code for clearing bits:

pTimer->control &= ~(0x04);

And here's the code for toggling bits:

pTimer->control ^= 0x80;

Design by Contract
Design-by-contract is a methodology developer can use to specify pre-conditions,
post-conditions, side effects, and invariants that are associated with the interface.
Every component then has a contract that must be adhered to in order for the
component to integrate into the application successfully.

As developers, we must examine a component’s inputs, outputs, and the work (the
side effects) that will be performed. The pre-conditions describe what conditions
must already exist within the system prior to performing an operation with the
component.
For example, a GPIO pin state cannot be toggled unless it first has the GPIO clock
enabled. Enabling the clock would be a pre-condition or a pre-requisite for the
GPIO component. Failing to meet this condition would result in nothing happening
when a call to perform a GPIO operation occurs.
A side effect is basically just that something in the system changes. Maybe a
memory region is written or read, an i/o state is altered, or data is simply returned.
Something useful happens by interacting with the component’s interface. The
resulting side effect then produces post-conditions that a developer can
expect. The system state has changed into a desired state.
Finally, the outputs for the component are extracted. Perhaps the interface returns
a success or a failure flag—maybe even an error code. Something is returned to let
the caller know that everything proceeded as expected and the resulting side effect
should now be observable.

Assertions
The best definition for an assertion that I have come across is
“An assertion is a Boolean expression at a specific point in a program that will
be true unless there is a bug in the program.”
There are three essential points that we need to note about the preceding definition,
which include an assertion evaluates an expression as either true or false.
The assertion is an assumption of the state of the system at a specific point in the
code. The assertion is validating a system assumption that, if not true, reveals a
bug in the code. (It is not an error handler!)
As you can see from the definition, assertions are particularly useful for verifying
the contract for an interface, function, or module.
Each programming language that supports assertions does so in a slightly different
manner. For example, in C/C++, assertions are implemented as a macro named
assert that can be found in assert.h. There is quite a bit to know about assertions,
which we will cover in this chapter, but before we move on to everything we can
do with assertions, let’s first look at how they can be used in the context of Design-
by-Contract. First, take a moment to review the contract that we specified in the
documentation in the following Listing

The Doxygen documentation for a function can specify the conditions in the
Design-by-Contract for that interface call
/*************************************************************
* Function : Dio_Init()
*//**
* \b Description:
*
* This function is used to initialize the Dio based on the
* configuration table defined in dio_cfg module.
* PRE-CONDITION: Configuration table needs to populated
* (sizeof > 0) <br>
* PRE-CONDITION: NUMBER_OF_CHANNELS_PER_PORT > 0 <br>
* PRE-CONDITION: NUMBER_OF_PORTS > 0 <br>
* PRE-CONDITION: The MCU clocks must be configured and
* enabled.
*
* POST-CONDITION: The DIO peripheral is setup with the
* configuration settings.
*
* @param[in] Config is a pointer to the configuration
* table that contains the initialization for the peripheral.
*
* @return void
*
* \b Example:
* @code
* const DioConfig_t *DioConfig = Dio_ConfigGet();
* Dio_Init(DioConfig);
* @endcode
*
* @see Dio_Init
* @see Dio_ChannelRead
* @see Dio_ChannelWrite
* @see Dio_ChannelToggle
* @see Dio_ChannelModeSet
* @see Dio_ChannelDirectionSet
* @see Dio_RegisterWrite
* @see Dio_RegisterRead
* @see Dio_CallbackRegister
*************************************************************/
Documentation can be an excellent way to create a contract between the interface
and the developer. However, it suffers from one critical defect; the contract cannot
be verified by executable code. As a result, if a developer doesn’t bother to read
the documentation or pay close attention to it, they can violate the contract, thus
injecting a bug into their source code that they may or may not be aware of. At
some point, the bug will rear its ugly head, and the developer will likely need to
spend countless hours hunting down their mistake. Various programming
languages deal with Design-by-Contract concepts differently, but for embedded
developers working in C/C++, we can take advantage of a built-in language feature
known as assertions. Assertions can be used within the Dio_Init function to verify
that the contract specified in the documentation is met. For example, if you were to
write the function stub for Dio_Init and include the assertions, it would look
something like the following Listing. As you can see, for each precondition, you
have an assertion. We could also use assertions to perform the checks on the
postcondition and invariants. This is a bit stickier for the digital I/O example
because we may have dozens of I/O states that we want to verify, including pin
multiplexing. I will leave this up to you to consider and work through some
example code on your own.
Listing An example contract definition using assertions in C
void Dio_Init(DioConfig_t const * const Config)
{
assert(sizeof(Config) > 0);
assert(CHANNELS_PER_PORT > 0);
assert(NUMBER_OF_PORTS > 0);
assert(Mcu_ClockState(GPIO) == true);
/* TODO: Define the implementation */
}

Defining Assertions
The use of assertions goes well beyond creating an interface contract. Assertions
are interesting because they are pieces of code developers add to their applications
to verify assumptions and detect bugs directly. When an assertion is found to be
true, there is no bug, and the code continues to execute normally. However, if the
assertion is found to be false, the assertion calls to a function that handles the failed
assertion.
Each compiler and toolchain tend to implement the assert macro slightly
differently. However, the result is the same. The ANSI C standard dictates how
assertions must behave, so the differences are probably not of interest if you use an
ANSI C–compliant compiler. I still find it interesting to look and see how each
vendor implements it, though. For example, The next Listing shows how the
STM32CubeIDE toolchain defines the assert macro. Also the next Listing
demonstrates how Microchip implements it in MPLAB X. Again, the result is the
same, but how the developer maps what the assert macro does if the result is false
will be different.

The assert macro is defined in the STM32CubeIDE assert.h header


/* required by ANSI standard */
#ifdef NDEBUG
# define assert(__e) ((void)0)
#else
# define assert(__e) ((__e) ? (void)0 : __assert_func \
(__FILE__, __LINE__, __ASSERT_FUNC, #__e))
Looking at the definitions, you’ll notice a few things. First, the definition for
macros changes depending on whether NDEBUG is defined. The controversial
idea here is that a developer can use assertions during development with NDEBUG
not defined. However, for production, they can define NDEBUG, changing the
definition of assert to nothing, which removes it from the resultant binary image.
This idea is controversial because you should ship what you test! So if you test
your code with assertions enabled, you should ship with them enabled! Feel free to
ship with them disabled if you test with them disabled.

The assert macro is defined in the MPLAB X assert.h header


#ifdef NDEBUG
#define assert(x) (void)0
#else
#define assert(x) ((void)((x) || (__assert_fail(#x, __FILE__,\
__LINE__, __func__),0)))
#endif

Next, when we are using the full definition for assert, you’ll notice that there is an
assertion failed function that is called. Again, the developer defines the exact
function but is most likely different between compilers. For STM32CubeIDE, the
function is __assert_func, while for MPLAB X the function is __assert_fail.
Several key results occur when an assertion fails, which include
1. Collecting the filename and line number where the assertion failed
2. Printing out a notification to the developer that the assertion failed and where it
occurred
3. Stopping program execution so that the developer can take a closer look

The assertion tells the developer where the bug was detected and halts the program
so that they can review the call path and memory states and determine what exactly
went wrong in the program execution. This is far better than waiting for the bug to
rear its head in how the system behaves, which may occur instantly or take
considerable time.
When and Where to Use Assertions
The assert macro is an excellent tool to catch bugs as they occur. That also
preserves the call stack and the system in the state it was in when the assertion
failed. This helps us to pinpoint the problem much faster, but where does it make
sense to use assertions? Let’s look at a couple proper and improper uses for
assertions.
First, it’s important to note that we can use assertions anywhere in our program
where we want to test an assumption, meaning assertions could be found just about
anywhere within our program. Second, I’ve found that one of the best uses for
assertions is verifying function preconditions and postconditions, as discussed
earlier in the chapter. Still, they can also be “sprinkled” throughout the function
code.
As a developer writes their drivers and application code, in every function, they
analyze what conditions must occur before this function is executed for it to run
correctly. They then develop a series of assertions to test those preconditions. The
preconditions become part of the documentation and form a contract with the caller
on what must be met for everything to go smoothly.
A great example of this is a function that changes the state of a system variable in
the application. For example, an embedded system may be broken up into several
different system states that are defined using an enum. These states would then be
passed into a function like System_StateSet to change the system's operational
state, as shown in the following Listing

A set function for changing a private variable in a state machine


void System_StateSet(SystemState_t const State)
{
SystemState = State;
}
What happens if only five system states exist, but an application developer passes
in a state greater than the maximum state? In this case, the system state is set to
some nonexistent state which will cause an unknown behavior in the system.
Instead, the software should be written such that a contract exists between the
application code and the System_StateSet function that the state variable passed
into the function will be less than the maximum state. If anything else is passed in,
that is a defect, and the developer should be notified. We can do this using assert,
and the updated code can be seen in the following Listing
set function for changing a private variable in a state machine with an assertion to
check the preconditions
void System_StateSet(SystemState_t const State)
{
assert(State < SYSTEM_STATE_MAX);
SystemState = State;
}
Now you might say that the assertion could be removed and a simple if statement
used to check the parameter. This would be acceptable, but that is error handling!
In this case, we are constraining the conditions under which the function executes,
which means we don’t need error handling but a way to detect an improper
condition (bug) in the code. This brings us to an important point; assertions
should NOT be used for error handling! For example, a developer should NOT
use an assertion to check that a file exists on their file system, as shown in the
following Listing.
In this case, creating an error handler to create a default version of UserData.cfg
makes a lot more sense than signaling the developer that there is a bug in the
software. There is not a bug in the software but a file that is missing from the file
system; this is a runtime error.

An INCORRECT use of assertions is to use them for error handling


FileReader = fopen("UserData.cfg", 'r');
assert(FileReader != NULL);

Does Assert Make a Difference?


There is a famous paper that was written by Microsoft several years ago that
examined how assertion density affected code quality. They found that if
developers achieved an assertion density of 2–3%, the quality of the code was
much higher, which meant that it had fewer bugs in the code. Now the paper does
mention that the assertion density must be a true density which means developers
aren’t just adding assertions to reach 2–3%. Instead, they are adding assertions
where it makes sense and where they are genuinely needed.
Setting Up and Using Assertions
Developers can follow a basic process to set up assertions in their code base. These
steps include
1. Include <assert.h> in your module.
2. Define a test assertion using assert(false);.
3. Compile your code (this helps the IDE bring in function references).
4. Right-click your assert and select “goto definition.”
5. Examine the assertion implementation.
6. Implement the assert_failed function.
7. Compile and verify the test assertion.
I recommend following this process early in the development cycle. Usually, after
I create my project, I go through the process before writing even a single line of
code. If you set up assertions early, they’ll be available to you in the code base,
which hopefully means you’ll use them and catch your bugs faster.
The preceding process has some pretty obvious steps. For example, steps 1–4 don’t
require further discussion. We examined how assertions are implemented in two
toolchains earlier in the chapter. Let’s change things up and look at steps 5–7 using
Keil MDK.

Examining the Assert Macro Definition


After walking through steps 1–4, a developer would need to examine how
assertions are implemented. When working with Keil MDK, the definition for
assert located in assert.h is slightly different from the definitions we’ve already
explored. The code in the following Figure shows the definitions for assert, starting
with the NDEBUG block. There are several important things we should notice.
First, we can control whether the assert macro is replaced with nothing, basically
compiled out of the code base, or we can define a version that will call a function if
the assertion fails. If we want to disable our assertions, we must create the symbol
NDEBUG. This is usually done through the compiler settings.
Next, we also must make sure that we define
__DO_NOT_LINK_PROMISE_WITH_ASSERT. Again, this is usually done
within the compiler settings symbol table. Finally, at this point, we can then come
to the definition that will be used for our assertions:
define assert(e) (e ? (void)0 : __CLIBNS __aeabi_assert("e", \
__FILE__, __LINE__))
As you can see, there is a bit more to defining assertions in Keil MDK compared to
(GNU Compiler Collection) GCC-based tools like STM32CubeIDE and MPLAB
X. Notice that the functions that are called if the assert fails are similar but again
named differently. Therefore, to define your assertion function, you must start by
reviewing what our toolchain expects. This is important because we will have to
define our assert_failed function, and we need to know what to call it so that it is
properly linked to the project.

Implementing assert_failed
Once we have found how the assertion is implemented, we need to create the
definition for the function. assert.h makes the declaration, but nothing useful will
come of it without defining what that function does. There are four things that we
need to do, which include
Copy the declaration and paste the declaration into a source module7
Turn the new declaration into a function definition
Output something so that the developer knows the assertion failed
Stop the program from executing

For a developer using Keil MDK, their assertion failed function would look
something like the code in Listing
void __aeabi_assert(const char *expr, const char *file,
int line)
{
Uart_printf(UART1, "Assert failed in %s at line %d\n",
file, line);
}
we copied and pasted the declaration from assert.h and turned it into a function
definition. (Usually, you can right-click the function call in the macro, which will
take you to the official declaration. You can just copy and paste this instead of
stumbling to define it yourself.) The function, when executed, will print out a
message through one of the microcontrollers' UARTs to notify the developer that
an assertion failed. A typical printout message is to notify the developer whose file
the assertion failed in and the line number. This tells the developer exactly where
the problem is.
This brings us to an interesting point. You can create very complex-looking
assertions that test for multiple conditions within a single assert, but then you’ll
have to do a lot more work to determine what went wrong. I prefer to keep my
assertions simple, checking a single condition within each assertion. There are
quite a few advantages to this, such as
First, it’s easier to figure out which condition failed.
Second, the assertions become clear, concise documentation for how the function
should behave.
Third, maintaining the code is more manageable.

Finally, once we have notified the developer that something has gone wrong, we
want to stop program execution similarly. There are several ways to do this. First,
and perhaps the method I see the most, is just to use an empty while (true) or for(;;)
statement. At this point, the system “stops” executing any new code and just sits in
a loop. This is okay to do, but from an IDE perspective, it doesn’t show the
developer that something went wrong. If my debugger can handle flash
breakpoints, I prefer to place a breakpoint in this function, or I’ll use the assembly
instruction __BKPT to halt the processor. At that point, the IDE will stop and
highlight the line of code. Using __BKPT can be seen in the following Listing
void __aeabi_assert(const char *expr, const char *file,
int line)
{
Uart_printf(UART1, "Assert failed in %s at line %d\n",
file, line);
__asm("BKPT");
}

Verifying Assertions Work


Once the assertion is implemented, I will always go through and test it to ensure it
is working the way I expect it to. The best way to do this is to simply create an
assertion that will fail somewhere in the application. A great example is to place
the following assertion somewhere early in your code after initialization:
assert(false);
This assertion will never be true, and once the code is compiled and executed,
we might see serial output from our application that looks something like this:
“Assertion failed in Task_100ms.c at line 17”
As you can see, when we encounter the assertion, our new assert failed function
will tell us that an assertion failed. In this case, it was in the file Task_100ms.c at
line number 17. You can’t get more specific about where a defect is hiding in your
code than that!

Three Instances Where Assertions Are Dangerous


Assertions, if used properly, have been proven to improve code quality. Still,
despite code quality improvements, developers need to recognize that there are
instances where using assertions is either ineffective or could cause serious
problems. Therefore, before using assertions, we must understand the limits of
assertions. There are three specific instances where using an assertion could cause
problems and potentially be dangerous.
Instance #1 – Initialization
The first instance where assertions may not behave or perform as expected is
during the system initialization. When the microcontroller powers up, it reads the
reset vector and then jumps to the address stored there. That address usually points
to vendor-supplied code that brings up the clocks and performs the C copy down to
initialize the runtime environment for the application. At the same time, this seems
like a perfect place to have assertions; trying to sprinkle assertions throughout this
code is asking for trouble for several reasons.
First, the oscillator is starting up, so the peripherals are not properly clocked if
an assertion were to fire. Second, at this point in the application, printf will most
likely not have been mapped, and whatever resource it would be mapped to would
not have been initialized. Stopping the processor or trying to print something could
result in an exception that would prevent the processor from starting and result in
more issues and debugging than we would want to spend.
For these reasons, it's best to keep the assertion to areas of the code after the
low-level initializations and under the complete control of the developer. I’ve
found that if you are using a library or autogenerated code, it’s best to leave these
as is and not force assertion checks into them. Instead, in your application code,
add assertions to ensure that everything is as expected after they have “done their
thing.”

Instance #2 – Microcontroller Drivers


Drivers are a handy place to use assertions, but, again, we need to be careful which
drivers we use assertions in. Consider the case where we get through the start-up
code. One of the first things many developers do is initialize the GPIO pins. I often
use a configuration table that is passed into the Gpio_Init function. If I have
assertions checking this structure and something is not right, I’m going to fire an
assertion, but the result of that assertion will go nowhere! Not only are the GPIO
pins not initialized, but printf and the associated output have not yet been mapped!
At this point, I’ll get an assertion that fails silently or, worse, some other code in
the assertion that fails that then has me barking up the wrong tree.
A silent assertion otherwise is not necessarily a bad thing. We still get the line
number of the failed assertion and information about what the cause could be
stored in memory. The issue is that we don’t realize that the assertion has fired and
just look confounded at our system for a while, trying to understand why it isn’t
running. As we discussed earlier, we could set an automatic breakpoint or use
assembly instructions to make it obvious to us. One of my favorite tricks is to
dedicate an LED as an assert LED that latches when an assertion fires. We will talk
more about it shortly

Instance #3 – Real-Time Hardware Components


The first two instances we have looked at are truthfully the minor cases where we
could get ourselves into trouble. The final instance is where there is a potential for
horrible things to happen. For example, consider an embedded system that has a
motor that is being driven by the software. That motor might be attached to a series
of gears lifting something heavy, driving a propulsion mechanism, or doing other
practical work. If an operator was running that system and the motor suddenly
stopped being driven, a large payload could suddenly give way! That could
potentially damage the gearing, or the payload could come down on the user!8
Another example could be if an electronic controller were driving a turbine or
rocket engine. Suppose the engine was executing successfully, and the system
gained altitude or was on its way to orbit, and suddenly an assertion was hit. In that
case, we suddenly have an uncontrolled flying brick! Obviously, we can’t allow
these types of situations to occur.
For systems that could be in the middle of real-time or production operations, our
assertions need to have more finesse than simply stopping our code. The assertions
need to be able to log whatever data they can and then notify the application code
that something has gone wrong. The application can then decide if the system
should be shut down safely or terminated or if the application should just proceed
(maybe even attempt recovery).

As discussed earlier, you want to be able to detect a bug the moment that it
occurs, but you also don’t want to break the system or put it into an unsafe state.
Real-time assertions are essentially standard assertions, except that they don’t use a
“BKPT” instruction or infinite loop to stop program execution. Instead, the
assertion needs to
There are many ways developers can do this, but let’s look at four tips that
should aid you when getting started with real-time assertions.

Real-Time Assertion Tip #1 – Use a Visual Aid


The first and most straightforward technique developers can use to be notified of
an assertion without just halting the CPU is to signal with a visual aid. In most
circumstances, this will be with an LED, but it is possible to use LED bars and
other visual indicators. The idea here is that we want to print out the message, but
we also want to get the developers' attention. Therefore, we can modify the assert
failed function like the following Listing , where we remove the BKPT instruction
and instead place a call to change the state of an LED.
void __aeabi_assert(const char expr, const char *file,
int line)
{
Uart_printf(UART1, "Assertion failed in %s at line %d\n",
file, line);
// Turn on the LED and signal an assertion.
LED_StateWrite(LED_Assert, ON);
}
The latched LED shows that the assertion has occurred, and the developer can then
check the terminal for the details. While this can be useful, it isn’t going to store
more than the file and line, which makes its use just a notch up from continuously
watching the terminal. Unfortunately, as implemented, we also don’t get any
background information about the system's state. You only know which assertion
failed, but if you crafted the assertion expression correctly, this should provide
nearly enough information to at least start the bug investigation.

Real-Time Assertion Tip #2 – Create an Assertion Log


Assertions usually halt the CPU, but if we have a motor spinning or some other
reason we don’t want to stop executing the code, we can redirect the terminal
information to a log. Of course, logs can be used in the development, but they are
also one of the best ways to record assertion information from a production system
without a terminal. If you look at the DevAlert product from Percepio, they tie
assertions into their recording system. They then can push the information to the
cloud to notify developers about issues they are having in the field.9
There are several different locations to which we could redirect the assertion log.
The first location that we should consider, no matter what, is to log the assertion
data to RAM. I will often use a circular buffer that can hold a string of a maximum
length and then write the information to the buffer. I’ll often include the file, line
number, and any additional information that I think could be important. I
sometimes modify the assertion failed function to take a custom string that can
provide more information about the conditions immediately before the assertion
was fired. Finally, we might log the data doing something like the following
Listing
void __aeabi_assert(const char expr, const char *file,
int line)
{
#if ASSERT_UART == TRUE
Uart_printf(UART1, "Assertion failed in %s at line %d\n",
file, line);
#elif
Log_Append(ASSERT, Assert_String, file, line);
#endif
// Turn on the LED and signal an assertion.
LED_StateWrite(LED_Assert, ON);
}

You’ll notice in the Listing that I’ve also started adding conditional compilation
statements to define different ways the assertion function can behave. For example,
if ASSERT_UART is true, we just print the standard assertion text to the UART.
Otherwise, we call Log_Append, which will store additional information and log
the details in another manner. Once the log information is stored in RAM, there
would be some task in the main application that would periodically store the RAM
log to nonvolatile memory such as flash, an SD card, or other media.

Real-Time Assertion Tip #3 – Notify the Application


Assertions are not designed to be fault handlers; a real-time assertion can’t just
stop the embedded system. Of course, we may still want to stop the system, but to
do so, we need to let the main application know that a defect has been detected and
that we need to move into a safe mode as soon as possible (at least during
development). We can create a signaling mechanism between the assertion library
and the main application.
For example, we may decide that there are three different types of assertions in
our real-time system:
1) Critical assertions that require the system to move into a safe state
immediately.
2) Moderate assertions don’t require the system to stop, but the developer is
immediately notified of an issue so that the developer can decide how to
proceed.
3) Minor assertions don’t require the system to be stopped and may not even
need to notify the application. However, these assertions would be logged
We might add these assertion severity levels into an enum that we can use in our
code. For example,
typedef enum
{
ASSERT_SEVERITY_MINOR,
ASSERT_SEVERITY_MODERATE,
ASSERT_SEVERITY_CRITICAL,
ASSERT_SEVERITY_MAX_COUNT
}AssertSeverity_t
Our assert failed function would then require further modifications that would
allow it to notify the main application. We might even need to move away from the
C standard library assert functions. For example, we might decide that the
developer should specify if the assertion failed and what severity level the
assertion is. We may want to redefine our assertion failed function to look
something like the following Listing
void assert_failed(const char expr, const char *file, int
line, AssertSeverity_t Severity)
{
#if ASSERT_UART == TRUE
Uart_printf(UART1, "Assertion failed in %s at line %d\n",
file, line);
#elif
Log_Append(ASSERT, Assert_String, file, line);
#endif
// Turn on the LED and signal an assertion.
LED_StateWrite(LED_Assert, ON);
App_Notify(Severity)
}
In the modified assertion function, we allow an assertion severity to be passed to
the function and then have a custom function named App_Notify that passes that
severity to a function that will behave how the application needs it to based on the
severity level. After all, each application may have its requirements for handling
these things. So, for example, App_Notify can decide if the assertion is just logged
or if some custom handler is executed to put the system into a safe state

Real-Time Assertion Tip #4 – Conditionally Configure Assertions


Assertions are a simple mechanism for detecting defects, but developers can create
as sophisticated a mechanism as they decide will fit their application. If you plan to
use assertions in both development and production, it can be helpful to create a
series of conditions that determine how your assertions will function. For example,
you might create conditions that allow the output to be mapped to
1. UART
2. Debug Console
3. Serial Wire Debug
4. A log file
There can also be conditions that would disable assertion functionality altogether
or the information that would gather and be provided to the log. There may be
different capabilities that a developer wants during development vs. what they
would like in production. What’s important is to think through what you would
like to get from your assertion capabilities and design the most straightforward
functionality you need. The more complex you make it, the greater the chances that
something will go wrong.
REVIEW: VOLATILE, STATIC, and EXTERN KEYWORDS PLEASE :

Memory-Mapping Methodologies in Device Drivers

The simplest techniques tend to not be reusable or portable, while the more
complex techniques are. There are several memory-mapping techniques that are
commonly used in driver design.
These methods include the following:
• Direct memory mapping
• Using pointers
• Using structures
• Using pointer arrays
Let’s examine the different methods that can be used to map a driver to memory.
Mapping Memory Directly
Once a developer has thought through the different driver models that can be used
to control the microcontroller peripherals, it is time to start writing code. There are
multiple techniques that a developer could use to map their driver into the
peripherals’ memory space, such as directly writing registers or using pointers,
structures, or pointer arrays.
The simplest technique to use—and the least reusable—is to write directly to a
peripheral’s register. For example, let’s say that a developer wants to configure
GPIO Port C. In order to set up and read the port, a developer can examine the
register definition file, find the correct identifier, and then write code similar to that
seen in Figure

PORTC_SET_PIN_2();
Writing code in this manner is very manual and labor intensive. The code is written
for a single and very specific setup. The code can be ported, but there are
opportunities for the wrong values to be written, which can lead to a bug and then
time spent debugging. Very simple applications that won’t be reused often use this
direct register write method for setting up and controlling peripherals. Directly
writing to registers in this manner is also fast and efficient, and it doesn’t require a
lot of flash space.
While directly writing to registers can be useful, the technique is often employed
for software that will not be reused or that is written on a very resource-constrained
embedded system, such as a simple 8-bit microcontroller. A technique that is
commonly used when reuse is necessary is to use pointers to map into memory. An
example declaration to map into the GPIO Port C register—let’s say it’s the data
register—can be seen in Figure

A PROBLEM?!
In order to resolve this issue, developers need to use the volatile keyword.
Volatile essentially tells the compiler that the data being read can change out of
sequence at any time without any code changing the value. There are three places
that volatile is typically used:
• Variables that are being mapped to hardware registers
• Data being shared between interrupt service routines and application
code
• Data being shared between multiple threads
Volatile basically tells the compiler to not optimize out the read but instead make
sure that the data stored in the memory location is read every time the variable is
encountered.

With the volatile keyword in the correct place, we now know the compiler won’t
optimize out reading the variable. However, there still is a problem with the
declaration the way it has been written. Take a moment to examine the code shown
in Figure

It is perfectly legal to increment our pointer Gpio_PortC. After incrementing the


pointer, we could be pointed at Port D, a different register in Port C, or even an SPI
or IIC peripheral. Once a pointer is mapped into memory, a developer should not
be allowed to increment, decrement, or modify the location for the pointer. This is
extremely dangerous! So instead, in our declaration, we should declare our pointer
to be constant as shown in Figure

Mapping Memory with Structures


The next technique, and probably the most common technique provided by
microcontroller vendors, is to use structures to map into memory. Structures
provide developers with a way to create data members that directly map to a
memory location.
The C standard guarantees that if I create data members in a structure, they will
appear in the same order without padding. The result is the ability to create
structure pointers that directly map into a peripheral’s memory space, as shown in
Figure
Remember Structure Overlays?
Using structures to map memory can be efficient and provide developers with a
way to start creating reusable mapped drivers. Using standards such as ARM®
Cortex® Software Interface Standard (CMSIS) can provide a common and
reusable method for accessing peripheral registers that improves portability.
Unfortunately, as of this writing, many vendors will still use their own naming
conventions, which still requires a fair amount of work to adapt to different
microcontrollers.
https://github.com/ARM-
software/NXP_LPC/blob/master/LPC1700/CMSIS/Driver/GPIO_LPC17xx.c
Introduction to SIMD in ARM Cortex
Architecture
Let’s take an example of AVR microcontroller. That microcontroller process data
in single instruction on a single data. Remember Intel SSE or MMX of AMD that
supports single
xxxx xxxx
xxxx xxxx
xxxx xxxx

for (int I =0; I < width; i++)


for (int j = 0; j < height; j++)
{
Bitmap3[i][j] = Bitmap_1[i][j]+bitmap_2[i][j]
}
instruction and work on multiple data. Those architectures based CPUs are very
effective in Multimedia applications such as Image processing where you work on
multiple pixels at the same time or Audio processing where you work on multiple
audio samples using single instruction. Most processors that supports SIMD
instruction set, they work on 32bits or 64bits data words, hence processing
capabilities are wasted when working on single datum. Using SIMD we work on
multiple data on a parallel behavior which gains a lot high performance.
Modern ARM architecture microcontrollers supports SIMD and an extension to
SIMD called NEON. Covering that is beyond the scope of the book. We will cover
the basic SIMD on a microcontroller that is ARM Cortex M4 based architecture
which easy to grasp the fundamentals while discussing it.
Suppose you have a 2D Bitmap image stored as a 2D Array of pixels and each
pixel consumes 32bits with format A8R8G8B8.
Let’s try an example of SIMD, we would like to subtract a value from each pixel in
that framebuffer(image). This is a very basic example of an operation.
Executing the following code consumes 300ms

As a debugging techniques I’m toggling a pin between the processing loop to


measure the elapsed time that this loop consumed. As you see that’s a very basic
operation and it took 300ms. Because you are trying to execute the loop 57600
times plus many operations that is executed inside the loop. If you disassembly that
code, you will find comparing the index i and fetching data from the array,
subtracting 100, comparison, assigning back the value temp which all these
operations consume a lot of cycles. Please take a look at disassembly version of
that piece of code.
Looking at the SIMD instruction set manual of ARM Cortex M4, there is an
instruction called

USUB8 from the manual it enabled us to subtract 4 bytes of val1 from val2. That
means we will work on 4 bytes in a row, which is multiple data, and that will
enhance the performance a lot as we will see.
The example is shown in the following figure:
The value 0x64 is 100 in decimal. As you notice we are processing 4 bytes each
time the loop is executed, hence the division by 4. Recall the image representation
as A8R8G8B8, which each channel consists of one byte. So each pixel is subtract
from the value 100 using that instruction.
The performance is shown in the following figure

As you see the width of the pulse is 3.3ms which is a huge gain performance! As a
conclusion you can consult the ARM SIMD manual of cortex M4, and you can
also read more about the NEON which is an extension of the ARM Cortex SIMD.
There are many times that you want to work on multiple data to gain performance
as we showed in that chapter.
VARIOUS TOPICS IN EMBEDDED
SOFTWARE:
Endianness
Endianness is the attribute of a system that indicates whether integers are
represented from left to right or right to left.

Endianness comes in two varieties: big and little. A big-endian representation has a
multibyte integer written with its most significant byte on the left; a number
represented thus is easily read by English-speaking humans. A little-endian
representation, on the other hand, places the most significant byte on the right. Of
course, computer architectures don't have an intrinsic "left" or "right." These
human terms are borrowed from our written forms of communication. The
following definitions are more precise:

Big-endian

Means that the most significant byte of any multibyte data field is stored at
the lowest memory address, which is also the address of the larger field

Little-endian

Means that the least significant byte of any multibyte data field is stored at the
lowest memory address, which is also the address of the larger field

It matters only when two computers are trying to communicate. Every processor
and every communication protocol must choose one type of endianness or the
other. Thus, two processors with different endianness will conflict if they
communicate through a memory device. Similarly, a little-endian processor trying
to communicate over a big-endian network will need to do software-byte
reordering.
A C programming example

#include <stdint.h>
#include <stdio.h>

int main (void)


{
uint32_t word = 0x0A0B0C0D; // An unsigned 32-bit integer.
char *pointer = (char *) &word; // A pointer to the first octet of the word.

for (int i = 0; i < 4; i++)


{
printf("%02x ", (unsigned int) pointer[i]);
}
puts("");
}

Output from various endiannesses:

 Modern little-endian: 0d 0c 0b 0a
 Modern big-endian: 0a 0b 0c 0d
If you develop software for a microcontroller , it is likely that you will measure
some physical quantity (e.g., acceleration or temperature) and sample it at regular
time intervals with an analog-to-digital converter to obtain digital data. Now,
regardless of what you are measuring, you will need to take into account the
endianness of the AD-converter.
The Web is full of data sheets for many AD-converters. I picked the following one
(almost at random):
www.analog.com/media/en/technical-documentation/data-sheets/AD7981.pdf
It is an industrial converter, designed to operate at high temperatures, that can
perform 600 kSPS (i.e.,six hundred thousand samples per second) of an input
voltage between 0V and 5.1V. But those details are irrelevant for the purposes of
this book. What matters is that it converts the input voltage to a 16-bit number

What is important to note is that the microprocessor/microcontroller receives the


16 bits that represent the sampled analog value one bit at a time. The 16 bits fit into
the two bytes of an unsigned short integer. But, and this is the crucial question, in
which order ? That is, with which endianness? Data sheets are never easy bedside
reading, and if you look at the 25-page-long data sheet of AD7981 you will find
the information you need on page 17, where it says: “When CNV goes low, the
MSB is output onto SDO. The remaining data bits are then clocked by subsequent
SCK falling edges.” It means that, as you get the MSB first, the 16-bit number is
transferred in big-endian format, the opposite of how numbers are represented in
our computers! In practical terms, you need to swap the two bytes of each value
you receive from the AD-converter. Within each byte the bit order is fine, with the
byte’s MSB “on the left” and the LSB “on the right.” You only need to change the
byte order.
MEMORY TESTING STRATEGIES:

Data bus test

We need to confirm that any value placed on the data bus by the processor is
correctly received by the memory device at the other end. The most obvious way to
test that is to write all possible data values and verify that the memory device
stores each one successfully. However, that is not the most efficient test available.
A faster method is to test the bus one bit at a time. The data bus passes the test if
each data bit can be set to 0 and 1, independently of the other data bits.

A good way to test each bit independently is to perform the so-called walking 1's
test. The following Table shows the data patterns used in an 8-bit version of this
test. The name walking 1's comes from the fact that a single data bit is set to 1 and
"walked" through the entire data word. The number of data values to test is the
same as the width of the data bus. This reduces the number of test patterns from 2n
to n, where n is the width of the data bus.

Table. Consecutive data values for an 8-bit walking 1's test


00000001

00000010

00000100

00001000

00010000

00100000

01000000

10000000
Because we are testing only the data bus at this point, all of the data values can be
written to the same address. Any address within the memory device will do.
However, if the data bus splits as it makes its way to more than one memory chip,
you will need to perform the data bus test at multiple addresses one within each
chip.To perform the walking 1's test, simply write the first data value in the table,
verify it by reading it back, write the second value, verify, and so on. When you
reach the end of the table, the test is complete. This time, it is okay to do the read
immediately after the corresponding write because we are not yet looking for
missing chips. In fact, this test may provide meaningful results even if the memory
chips are not installed!

int memtestDataBus(datum *pAddress, datum **ppFailAddr)


{
datum pattern;

*ppFailAddr = NULL;

/* Perform a walking 1's test at the given address. */


for (pattern = 1; pattern != 0; pattern <<= 1)
{
/* Write the test pattern. */
*pAddress = pattern;

/* Read it back (immediately is okay for this test). */


if (*pAddress != pattern)
{
*ppFailAddr = pAddress;
return 0;
}
}

return 1;

Validating Memory Contents


How can we tell whether the data or program stored in a nonvolatile memory
device is still valid? One of the easiest ways is to compute a checksum of the data
when it is known to be valid prior to programming the ROM, for example. Then,
each time you want to confirm the validity of the data, you need only recalculate
the checksum and compare the result to the previously computed value. If the two
checksums match, the data is assumed to be valid. By carefully selecting the
checksum algorithm, we can increase the probability that specific types of errors
will be detected, while keeping the size of the checksum, and the time required to
check it, down to a reasonable size.
A cyclic redundancy check (CRC) is a specific checksum algorithm that is
designed to detect the most common data errors. The theory behind the CRC is
quite mathematical and beyond the scope of this book. However, cyclic
redundancy codes are frequently useful in embedded applications that require the
storage or transmission of large blocks of data. What follows is a brief explanation
of the CRC technique and some source code that shows how it can be implemented
in C. Thankfully, you don't need to understand why CRCs detect data errors or
even how they are implemented to take advantage of their ability to detect errors.

Introduction to Data structure and


Algorithms
These are very important concepts that everyone must know about Organizing data
in structures There are many different ways to organize your data, that’s why data
structures are very important concept It’s not only limited to embedded software,
but any software engineering concept, needs a data structure to handle its data. We
are not going to talk about the implementation of these data structure in details,
because there are many books about that subject out there, instead we will show
practical examples using these data structures that are important in embedded
software life cycle.
A linear data structure is one in which the data elements are stored in a linear, or
sequential, order; that is, data is stored in consecutive memory locations. A linear
data structure can be represented in two ways; either it is represented by a linear
relationship between various elements utilizing consecutive memory locations as in
the case of arrays, or it may be represented by a linear relationship between the
elements utilizing links from one element to another as in the case of linked lists.
Examples of linear data structures include arrays, linked lists, stacks, queues, and
so on.
A non-linear data structure is one in which the data is not stored in any sequential
order or consecutive memory locations. The data elements in this structure are
represented by a hierarchical order. Examples of non-linear data structure include
graphs, trees, and so forth.
A static data structure is a collection of data in memory which is fixed in size and
cannot be changed during runtime. The memory size must be known in advance as
the memory cannot be reallocated later in a program. One example is an array.
A dynamic data structure is a collection of data in which memory can be
reallocated during execution of a program. The programmer can add or remove
elements according to his/her need. Examples include linked lists, graphs, trees,
and so on.

Advantages of using arrays


1.Elements are stored in adjacent memory locations; hence, searching is very fast,
as any element can be easily accessed.
2.Arrays do not support dynamic memory allocation, so all the memory
management is done by the compiler
Limitations of using arrays
1.Insertion and deletion of elements in arrays is complicated and very time-
consuming, as it requires the shifting of elements.
2.Arrays are static; hence, the size must be known in advance.
3.Elements in the array are stored in consecutive memory locations which may or
may not be available.

A Queue is a linear collection of data elements in which the element inserted first
will be the element that is taken out first; that is, a queue is a FIFO (First In First
Out) data structure. A queue is a popular linear data structure in which the first
element is inserted from one end called the REAR end (also called the tail end),
and the deletion of the element takes place from the other end called the FRONT
end (also called the head).
Practical Application:
For a simple illustration of a queue, there is a line of people standing at the bus
stop and waiting for the bus. Therefore, the first person standing in the line will get
into the bus first
Buffering data like keypad, serial port, video buffering like YouTube

A Stack is a linear collection of data elements in which insertion and deletion take
place only at the top of the stack. A stack is a Last In First Out (LIFO) data
structure, because the last element pushed onto the stack will be the first element to
be deleted from the stack. Three operations can be performed on the stack, which
includes PUSH, POP, and PEEP operations. The PUSH operation inputs an
element into the top of the stack, while the POP operation removes an element
from the stack. The PEEP operation returns the value of the topmost element in the
stack without deleting it from the stack. Every stack has a variable TOP which is
associated with it. The TOP pointer stores the address of the topmost element in
the stack. The TOP is the position from where insertion and deletion take place.
Practical Application:
A real-life example of a stack is if there is a pile of plates arranged on a table. A
person will pick up the first plate from the top of the stack.

Linked List
The major drawback of the array is that the size or the number of elements must be
known in advance. Thus, this drawback gave rise to the new concept of a linked
list. A Linked list is a linear collection of data elements. These data elements are
called nodes, which point to the next node using pointers. A linked list is a
sequence of nodes in which each node contains one or more than one data field and
a pointer which points to the next node. Also, linked lists are dynamic; that is,
memory is allocated as and when required.

In the previous figure we have made a linked list in which each node is divided
into two slots:
1.The first slot contains the information/data.
2.The second slot contains the address of the next node.
Practical Application:
A simple real-life example is a train; here each coach is connected to its previous
and next coach (except the first and last coach).
Snake game, consider using a linked list for connecting each node
We have already learned that an array is a collection of data elements stored in
contiguous memory locations. Also, we studied that arrays were static in nature;
that is, the size of the array must be specified when declaring an array, which limits
the number of elements to be stored in the array. For example, if we have an array
declared as int array[15], then the array can contain a maximum of 15 elements and
not more than that. This method of allocating memory is good when the exact
number of elements is known, but if we are not sure of the number of elements
then there will be a problem, as in data structures our aim is to make programs
efficient by consuming less memory space along with minimal time. To overcome
this problem, we will use linked lists.
A linked list is a linear collection of data elements. These data elements are called
nodes, and they point to the next node by means of pointers. A linked list is a data
structure which can be used to implement other data structures such as stacks,
queues, trees, and so on. A linked list is a sequence of nodes in which each node
contains one or more than one data field and a pointer which points to the next
node. Also, linked lists are dynamic in nature; that is, memory is allocated as and
when required. There is no need to know the exact size or exact number of
elements as in the case of arrays. The following is an example of a simple linked
list which contains five nodes:

In the previous figure, we have made a linked list in which each node is divided
into two parts:
1.The first part contains the information/data.
2.The second part contains the address of the next node.
The last node will not have any next node connected to it, so it will store a special
value called NULL. Usually NULL is defined by -1. Therefore, the NULL pointer
represents the end of the linked list. Also, there is another special pointer START
that stores the address of the first node of the linked list. Therefore, the START
pointer represents the beginning of the linked list. If START = NULL then it
means that the linked list is empty. A linked list, since each node points to another
node which is of the same type, is known as a self-referential data type or a self-
referential structure

Advantages of linked lists


1.Linked lists are dynamic data structures; that is, they can grow or shrink during
the execution of the program.
2.Linked lists have efficient memory utilization. Memory is allocated whenever it
is required, and it is de-allocated whenever it is no longer needed.
3.Insertion and deletion are easier and efficient.
4.Many complex applications can be easily carried out with linked lists.
Disadvantages of linked lists
1.They consume more space because every node requires an additional pointer to
store the address of the next node.
2.Searching a particular element in the list is difficult and time consuming
Linked List Applications – A simple memory manager

In C and C++, it can be very convenient to allocate and de-allocate blocks of


memory as and when needed. This is certainly standard practice in both languages
and almost unavoidable in C++. However, the handling of such dynamic memory
can be problematic and inefficient. For desktop applications, where memory is
freely available, these difficulties can be ignored. For embedded - generally real
time - applications, ignoring the issues is not an option.

Dynamic memory allocation tends to be nondeterministic; the time taken to


allocate memory may not be predictable and the memory pool may become
fragmented, resulting in unexpected allocation failures. In this session the
problems will be outlined in detail and an approach to deterministic dynamic
memory allocation detailed.

C/C++ Memory Spaces

It may be useful to think in terms of data memory in C and C++ as being divided
into three separate spaces:

Static memory. This is where variables, which are defined outside of functions,
are located. The keyword static does not generally affect where such variables are
located; it specifies their scope to be local to the current module. Variables that are
defined inside of a function, which are explicitly declared static, are also stored in
static memory. Commonly, static memory is located at the beginning of the RAM
area. The actual allocation of addresses to variables is performed by the embedded
software development toolkit: a collaboration between the compiler and the linker.
Normally, program sections are used to control placement, but more advanced
techniques, like Fine Grain Allocation, give more control. Commonly, all the
remaining memory, which is not used for static storage, is used to constitute the
dynamic storage area, which accommodates the other two memory spaces.
Automatic variables. Variables defined inside a function, which are not declared
static, are automatic. There is a keyword to explicitly declare such a variable – auto
– but it is almost never used. Automatic variables (and function parameters) are
usually stored on the stack. The stack is normally located using the linker. The end
of the dynamic storage area is typically used for the stack. Compiler optimizations
may result in variables being stored in registers for part or all of their lifetimes; this
may also be suggested by using the keyword register.

The heap. The remainder of the dynamic storage area is commonly allocated to the
heap, from which application programs may dynamically allocate memory, as
required.

Dynamic Memory in C

In C, dynamic memory is allocated from the heap using some standard library
functions. The two key dynamic memory functions are malloc() and free().
The malloc() function takes a single parameter, which is the size of the requested
memory area in bytes. It returns a pointer to the allocated memory. If the allocation
fails, it returns NULL. The prototype for the standard library function is like this:

void *malloc(size_t size);

The free() function takes the pointer returned by malloc() and de-allocates the
memory. No indication of success or failure is returned. The function prototype is
like this:

void free(void *pointer);

To illustrate the use of these functions, here is some code to statically define an
array and set the fourth element’s value:

int my_array[10];
my_array[3] = 99;

The following code does the same job using dynamic memory allocation:

int *pointer;
pointer = malloc(10 * sizeof(int));
*(pointer+3) = 99;

The pointer de-referencing syntax is hard to read, so normal array referencing


syntax may be used, as [ and ] are just operators:

pointer[3] = 99;

When the array is no longer needed, the memory may be de-allocated thus:

free(pointer);
pointer = NULL;

Assigning NULL to the pointer is not compulsory, but is good practice, as it will
cause an error to be generated if the pointer is erroneous utilized after the memory
has been de-allocated.

The amount of heap space actually allocated by malloc() is normally one word
larger than that requested. The additional word is used to hold the size of the
allocation and is for later use by free(). This “size word” precedes the data area to
which malloc() returns a pointer.

There are two other variants of the malloc() function: calloc() and realloc().

The calloc() function does basically the same job as malloc(), except that it takes
two parameters – the number of array elements and the size of each element –
instead of a single parameter (which is the product of these two values). The
allocated memory is also initialized to zeros. Here is the prototype:

void *calloc(size_t nelements, size_t elementSize);

The realloc() function resizes a memory allocation previously made by malloc(). It


takes as parameters a pointer to the memory area and the new size that is required.
If the size is reduced, data may be lost. If the size is increased and the function is
unable to extend the existing allocation, it will automatically allocate a new
memory area and copy data across. In any case, it returns a pointer to the allocated
memory. Here is the prototype:

void *realloc(void *pointer, size_t size);

Dynamic Memory in C++

Management of dynamic memory in C++ is quite similar to C in most respects.


Although the library functions are likely to be available, C++ has two additional
operators – new and delete – which enable code to be written more clearly,
succinctly and flexibly, with less likelihood of errors. The new operator can be
used in three ways:

p_var = new typename;


p_var = new type(initializer);
p_array = new type [size];

In the first two cases, space for a single object is allocated; the second one includes
initialization. The third case is the mechanism for allocating space for an array of
objects.

The delete operator can be invoked in two ways:


delete p_var;
delete[] p_array;

The first is for a single object; the second deallocates the space used by an array. It
is very important to use the correct de-allocator in each case.

There is no operator that provides the functionality of the C realloc() function.

Here is the code to dynamically allocate an array and initialize the fourth element:

int* pointer;
pointer = new int[10];
pointer[3] = 99;

Using the array access notation is natural. De-allocation is performed thus:

delete[] pointer;
pointer = NULL;

Again, assigning NULL to the pointer after deallocation is just good programming
practice. Another option for managing dynamic memory in C++ is the use the
Standard Template Library. This may be inadvisable for real time embedded
systems.

Issues and Problems

As a general rule, dynamic behavior is troublesome in real time embedded


systems. The two key areas of concern are determination of the action to be taken
on resource exhaustion and nondeterministic execution performance.

There are a number of problems with dynamic memory allocation in a real time
system. The standard library functions (malloc() and free()) are not normally
reentrant, which would be problematic in a multithreaded application. If the source
code is available, this should be straightforward to rectify by locking resources
using RTOS facilities (like a semaphore). A more intractable problem is associated
with the performance of malloc(). Its behavior is unpredictable, as the time it takes
to allocate memory is extremely variable. Such nondeterministic behavior is
intolerable in real time systems.
Without great care, it is easy to introduce memory leaks into application code
implemented using malloc() and free(). This is caused by memory being allocated
and never being deallocated. Such errors tend to cause a gradual performance
degradation and eventual failure. This type of bug can be very hard to locate.

Memory allocation failure is a concern. Unlike a desktop application, most


embedded systems do not have the opportunity to pop up a dialog and discuss
options with the user. Often, resetting is the only option, which is unattractive. If
allocation failures are encountered during testing, care must be taken with
diagnosing their cause. It may be that there is simply insufficient memory available
– this suggests various courses of action. However, it may be that there is sufficient
memory, but not available in one contiguous chunk that can satisfy the allocation
request. This situation is called memory fragmentation.

Memory Fragmentation

The best way to understand memory fragmentation is to look at an example. For


this example, it is assumed hat there is a 10K heap. First, an area of 3K is
requested, thus:

#define K (1024)
char *p1;
p1 = malloc(3*K);

Then, a further 4K is requested:

p2 = malloc(4*K);

3K of memory is now free.

Some time later, the first memory allocation, pointed to by p1, is de-allocated:

free(p1);

This leaves 6K of memory free in two 3K chunks. A further request for a 4K


allocation is issued:

p1 = malloc(4*K);
This results in a failure – NULL is returned into p1 – because, even though 6K of
memory is available, there is not a 4K contiguous block available. This is memory
fragmentation.

It would seem that an obvious solution would be to de-fragment the memory,


merging the two 3K blocks to make a single one of 6K. However, this is not
possible because it would entail moving the 4K block to which p2 points. Moving
it would change its address, so any code that has taken a copy of the pointer would
then be broken. In other languages (such as Visual Basic, Java and C#), there are
defragmentation (or “garbage collection”) facilities. This is only possible because
these languages do not support direct pointers, so moving the data has no adverse
effect upon application code. This defragmentation may occur when a memory
allocation fails or there may be a periodic garbage collection process that is run. In
either case, this would severely compromise real time performance and
determinism.

Memory with an RTOS

A real time operating system may provide a service which is effectively a reentrant
form of malloc(). However, it is unlikely that this facility would be deterministic.

Memory management facilities that are compatible with real time requirements –
i.e. they are deterministic – are usually provided. This is most commonly a scheme
which allocates blocks – or “partitions” – of memory under the control of the OS.

Block/partition Memory Allocation

Typically, block memory allocation is performed using a “partition pool”, which is


defined statically or dynamically and configured to contain a specified number of
blocks of a specified fixed size. For Nucleus OS, the API call to define a partition
pool has the following prototype:

STATUS
NU_Create_Partition_Pool (NU_PAR TITION_POOL *pool, CHAR *name,
VOID *start_address, UNSIGNED pool_size, UNSIGNED partition_size,
OPTION suspend_type);

This is most clearly understood by means of an example:


status = NU_Create_Partition_Pool(&MyPoo l, "any name", (VOID *) 0xB000,
2000, 40, NU_FIFO);

This creates a partition pool with the descriptor MyPool, containing 2000 bytes of
memory, filled with partitions of size 40 bytes (i.e. there are 50 partitions). The
pool is located at address 0xB000. The pool is configured such that, if a task
attempts to allocate a block, when there are none available, and it requests to be
suspended on the allocation API call, suspended tasks will be woken up in a first-
in, first-out order. The other option would have been task priority order.

Another API call is available to request allocation of a partition. Here is an


example using Nucleus OS:

status = NU_Allocate_Partition(&MyPool, &ptr, NU_SUSPEND);

This requests the allocation of a partition from MyPool. When successful, a pointer
to the allocated block is returned in ptr. If no memory is available, the task is
suspended, because NU_SUSPEND was specified; other options, which may have
been selected, would have been to suspend with a timeout or to simply return with
an error.

When the partition is no longer required, it may be de-allocated thus:

status = NU_Deallocate_Partition(ptr);

If a task of higher priority was suspended pending availability of a partition, it


would now be run. There is no possibility for fragmentation, as only fixed size
blocks are available. The only failure mode is true resource exhaustion, which may
be controlled and contained using task suspend, as shown.

Additional API calls are available which can provide the application code with
information about the status of the partition pool – for example, how many free
partitions are currently available. Care is required in allocating and de-allocating
partitions, as the possibility for the introduction of memory leaks remains.

Memory Leak Detection

The potential for programmer error resulting in a memory leak when using
partition pools is recognized by vendors of real time operating systems. Typically,
a profiler tool is available which assists with the location and rectification of such
bugs.

Real Time Memory Solutions

Having identified a number of problems with dynamic memory behavior in real


time systems, some possible solutions and better approaches can be proposed.

Dynamic Memory

It is possible to use partition memory allocation to implement malloc() in a robust


and deterministic fashion. The idea is to define a series of partition pools with
block sizes in a geometric progression; e.g. 32, 64, 128, 256 bytes. A malloc()
function may be written to deterministically select the correct pool to provide
enough space for a given allocation request. This approach takes advantage of the
deterministic behavior of the partition allocation API call, the robust error handling
(e.g. task suspend) and the immunity from fragmentation offered by block
memory.

A Simple Memory Manager using Linked List:

In general, the heap manager allows the program to allocate a variable block size,
but in this section we will develop a simplified heap manager handles just fixed
size blocks. In this example, the block size is specified by SIZE. The initialization
will create a linked list of all the free blocks. A list is a collection of dissimilar
objects, typically implemented in C with struct. In this case, the list is an array
where the first element is a pointer, and the remaining elements are the memory to
be allocated. A linked list is a collection of lists that are connected together with
pointers, as shown in Figure
FreePt points to a linear linked list of free blocks. Initially these free blocks are
contiguous and in order, but as the manager is used the positions and order of the
free blocks can vary. It will be the pointers that will thread the free blocks together.

Conclusions

C and C++ use memory in various ways, both static and dynamic. Dynamic
memory includes stack and heap.

Dynamic behavior in embedded real time systems is generally a source of concern,


as it tends to be non-deterministic and failure is hard to contain.

Using the facilities provided by most real time operating systems, a dynamic
memory facility may be implemented which is deterministic, immune from
fragmentation and with good error handling.

Queue Application-The need of buffering


• for (int x = 0; x < 50; x++) {
• while (peripheral_is_busy());
• peripheral_send_byte(data[x]);
• }

This loop is simple. For each of the 50 bytes, it first waits until the peripheral isn’t
busy, then tells the peripheral to send it. You can imagine what implementations of
the peripheral_is_busy() and peripheral_send_byte() functions might look like.

While you’re transmitting these 50 bytes, the rest of your program can’t run
because you’re busy in this loop making sure all of the bytes are sent correctly.
What a waste, especially if the data transmission rate is much slower than your
microcontroller! (Typically, that will be the case.) There are so many more
important tasks your microcontroller could be doing in the meantime than sitting in
a loop waiting for a slow transmission to complete. The solution is to buffer the
data and allow it to be sent in the background while the rest of your program does
other things.
How to buffer the data

So how do you buffer the data? You create a buffer that will store data waiting to
be transmitted. If the peripheral is busy, rather than waiting around for it to finish,
you put your data into the buffer. When the peripheral finishes transmitting a byte,
it fires an interrupt. Your interrupt handler takes the next byte from the buffer and
sends it to the peripheral, then immediately returns back to your program. Your
program can then continue to do other things while the peripheral is transmitting.
You will periodically be interrupted to send another byte, but it will be a very short
period of time — all the interrupt handler has to do is grab the next byte waiting to
be transmitted and tell the peripheral to send it off. Then your program can get
back to doing more important stuff.

This is called interrupt-driven I/O, and it’s awesome. The original code I showed
above is called polled I/O.

Implementing Ring Buffer

A really easy to way to implement a queue is by creating a ring buffer, also called
a circular buffer or a circular queue. It’s a regular old array, but when you reach the
end of the array, you wrap back around to the beginning. You keep two indexes:
head and tail. The head is updated when an item is inserted into the queue, and it is
the index of the next free location in the ring buffer. The tail is updated when an
item is removed from the queue, and it is the index of the next item available for
reading from the buffer. When the head and tail are the same, the buffer is empty.
As you add things to the buffer, the head index increases. If the head wraps all the
way back around to the point where it’s right behind the tail, the buffer is
considered full and there is no room to add any more items until something is
removed. As items are removed, the tail index increases until it reaches the head
and it’s empty again. The head and tail endlessly follow this circular pattern–the
tail is always trying to catch up with the head–and it will catch up, unless you’re
constantly transmitting new data so quickly that the tail is always busy chasing the
head.

Anyway, we’ve determined that you need three things:

• An array
• A head
• A tail

These will all be accessed by both the main loop and the interrupt handler, so they
should all be declared as volatile. Also, updates to the head and updates to the
tail each need to be an atomic operation, so they should be the native size of your
architecture. For example, if you’re on an 8-bit processor like an AVR, it should be
a uint8_t (which also means the maximum possible size of the queue is 256 items).
On a 16-bit processor it can be a uint16_t, and so on. Let’s assume we’re on an 8-
bit processor here, so ring_pos_t in the code below is defined to be a uint8_t.

#define RING_SIZE 64

typedef uint8_t ring_pos_t;

volatile ring_pos_t ring_head;

volatile ring_pos_t ring_tail;

volatile char ring_data[RING_SIZE];

One final thing before I give you code: it’s a really good idea to use a power of two
for your ring size (16, 32, 64, 128, etc.). The reason for this is because the
wrapping operation (where index 63 wraps back around to index 0, for example) is
much quicker if it’s a power of two. I’ll explain why. Normally a programmer
would use the modulo (%) operator to do the wrapping. For example:

ring_tail = (ring_tail + 1) % 64;

If your tail began at 60 and you repeated this line above multiple times, the tail
would do the following:

61 -> 62 -> 63 -> 0 -> 1 -> …

That works perfectly, but the problem with this approach is that modulo is pretty
slow because it’s a divide operation. Division is a pretty slow operation on
computers. It turns out when you have a power of two, you can do the equivalent
of a modulo by doing a bitwise AND, which is a much quicker operation. It works
because if you take a power of two and subtract one, you get a number which can
be represented in binary as a string of all 1 bits. In the case of a queue of size 64,
bitwise ANDing the head or tail with 63 will always keep the index between 0 and
63.

So you can do the wrap-around like so:

ring_tail = (ring_tail + 1) & 63;

Producer-Consumer using a FIFO Queue

A FIFO is also used when you ask the computer to print a file. Rather than waiting
for the actual printing to occur character by character, the print command will put
the data in a FIFO. Whenever the printeris free, it will get data from the FIFO. The
advantage of the FIFO is it allows you to continue to use your computer while the
printing occurs in the background. To implement this magic of background
printing we will need interrupts.

The classic producer/consumer problem has two threads. One thread produces data
and the other consumes data. For an input device, the background thread is the
producer because it generates new data, and the foreground thread is the consumer
because it uses the data up. For an output device, the data flows in the other
direction so the producer/consumer roles are reversed. It is appropriate to pass data
from the producer thread to the consumer thread using a FIFO queue
A graphics display uses two buffers called a front buffer and a back buffer. The
graphics hardware uses the front buffer to create the visual image on the display,
i.e., the front buffer contains the data that you see. The software uses the back
buffer to create a new image, i.e., the back buffer contains the data that you see
next. When the new image is ready, and the time is right, the two buffers are
switched (the front becomes the back and the back becomes the front.) In this way,
the user never sees a partially drawn image.

‫و هللا المستعان‬
DI Ahmed TOLBA

References:
1. Tricks of the windows game programming gurus
2. Design Patterns Gang of Four
3. Operating System Concepts
4. Embedded Software Design
5. The C Programming Language
And all the references here

You might also like