You are on page 1of 3

12 FEBRUARY 2008 | LI NUX FOR YOU | www.openITis.

com
C M Y K
Insight
hen can we write highly
efcient code? It is when we
understand how the underlying
machine works and make best
use of that knowledge. One
well-known way to write highly efcient code
is to write code in assembly. There are many
disadvantages with this; for example, we
cannot port the programs easily, it is difcult
to maintain the code, etc. So, we need to look
W
for alternatives to write efcient code.
We can write in low-level programming
languages like C to get write code, whose
efciency is often comparable to the
equivalent code written in assembly. For this
reason, C is often referred as a high-level
assembler. In this article, well look at various
programming constructs from the perspective
of efciency. Well consider general machine
architecture for illustration; and for specic
How many programmers can actually write assembly programs? With the rising popularity
of high-level languages (like Java, VB.Net, etc), there is rarely any need for programmers
to learn assembly or low-level programming. However, there are domains where writing
effcient code is very important, for example, game programming and scientifc computing.
Insight
Understand the Machine
to Write Efficient Code
www.openITis.com | LI NUX FOR YOU | FEBRUARY 2008
C M Y K
13
Insight
examples, x86 architecture will be
used. A word of caution before we
proceed: the techniques and issues
covered are not for general-purpose
programming.
Basic types
The machine, in general,
understands only three types of
values: address, integer and oating-
point values. For representation
and manipulation, here is the
correspondence between the types
that the machine can understand
and what C supports. Addresses
correspond to the pointer construct;
integersboth signed and
unsignedcorrespond to short, int,
long, long long, char (yes, a char is
represented as an int internally!) etc;
oating-point types correspond to
oat, double, long double, etc.
The most efcient data-type
that a processor handles is a word,
which corresponds to int type
in C. For oating-point types, all
computation is typically done in
a larger oating-point type. For
example, in x86 machines, all
oating-point computation is done
in extended precision, which is 80
bits in size and usually corresponds
to long double in C; if oating point
expressions are used in the code,
they are internally converted to
extended precision by the processor
and the results are converted back
to oat (which occupies 32 bits).
The processor does oating-point
computations in a separate co-
processor unit.
Address computation (such
as array index access) is done
using pointers in C. They directly
correspond to memory access
operations in the underlying machine
(such as index addressing, in this
case).
The int type is the most efcient
for computations. Unsigned types
and operations on that are as
efcient as signed types. Floating
point types and operations are slow
compared to integral types. For
example, in an imaginary processor,
if integer division takes four cycles,
oating point division operation
might take 50 to 100 cycles. Memory
access operations are very slowif
the desired memory location is not in
cache, then it might take hundreds of
cycles to fetch the data from the main
memory.
Operators
C supports a rich set of operators.
There are few operators that are
directly supported by the processor
and there are a few that are simulated
in the software.
For integral types, bit-
manipulation operators are faster
compared to other operators like
arithmetic, logical or relational
operators. One of the ways to write
efcient code is to write code using
bitwise operations instead of other
slower operations. Here is a well-
known example: Using << is more
operator efcient than dividing an
integer value by 2. Well look at a
different example for illustration
here.
A typical code segment for
toggling a characters case is to use
relational operators, as in:
// precondition: the char ch provided
is in range [a-z] or [A-Z]
char toggle_ascii_char_case(char ch) {
if( (ch >= a) && (ch <= z)
) // lower case
ch = ch - 0x20;
else if( (ch >= A) && (ch <=
Z) ) // upper case
ch = ch + 0x20;
return ch;
}
The code works on the following
assumption: the given char ch is
within the range [a-z] or [A-Z]. If
the char is [A-Z], it returns the
corresponding char in [a-z] and vice
versa, that is, it toggles the case of
the character. The value 0x20 is
added or subtracted based on the fact
that the alphabetic characters are
separated by the hex value 0x20 in
the ASCII table.
But this function is slow. If
this is a library function used for
toggling the case of characters
in a string, then it is not a good
implementation.
The code can be improved
as follows: since the comparison
operators are not required, the
function precondition says that
ch passed to the function is in the
given range [a-z] or [A-Z]. Based
on C tradition, we need not check
to ensure that the given char is
in fact in this range. Also, since
we are performing either the - or
+ arithmetic operation, we can
replace it with bit-wise operations
for toggling the bit using the ex-
or operator. With this the code
becomes efcient and simple:
// precondition: the char ch provided
is in range [a-z] or [A-Z]
char toggle_ascii_char_case(char ch)
{
return (ch ^= 0x20);
}
This example is just for
illustration purposes. Using bit-wise
operators obscures the code, but it
usually signicantly improves the
efciency of the code.
Control flow
C has various conditional and
looping constructs. A C compiler
transforms such code constructs
to branching (also known as
jump) instructions. So, goto is the
most straightforward construct
for programming. It is possible to
take any C program and create an
equivalent program by removing all
conditions and loops with just goto
statements. Though it is ultimately
branching instructions, there are
subtle differences in constructs
when it comes to efciency.
Which one is more efcient
nested if conditions or switch
statements? In general, a switch
is more efcient than nested if
statements. If both are implemented
using branching, why is switch more
efcient than nested if conditions?
Recall that, in a switch statement,
all cases are constants. So, a
14 FEBRUARY 2008 | LI NUX FOR YOU | www.openITis.com
C M Y K
Insight
compiler can transform a switch
statement to a range or look-up
table, which is more efcient than a
long list of jumps.
Note that executing jump
instructions is not costly, but
unpredictable jumps can result
in considerably slower execution.
For example, frequent jumps can
result in the ushing of the pipeline.
Similarly, processors typically look
ahead in the instruction stream
and pre-fetch necessary memory
accesses and put it in cache. So,
unpredictable jumps can result in
memory faults, which will result in
wasting hundreds of cycles since
the processor has to wait for the
memory value to be available for it
to continue execution. In general,
a program with less number of
branches is faster than those that
have a large number of branches.
Memory access
As said earlier, memory access
is a costly operation. Let us take
a specic example to illustrate
this. Typically, it takes the same
time to access global data or local
(stack allocated) data. However,
it is preferable to use local data
instead of global data because of the
well-known principle of locality.
If a memory location is accessed,
the processor doesnt fetch value
in just that memory location; it
fetches many values adjacent to that
memory location since the program
is likely to access variables that are
located near that memory location.
Fetching a block of data and putting
it in cache is not time consuming,
but if there is a memory fault, it can
lead to the processor waiting for
hundreds of cycles for that memory
access to happen. If there are large
numbers of global variables and their
accesses are spread throughout
the program, then the program
execution becomes considerably
slower.
Let us consider another (well-
known) example to illustrate this
important principle of locality. Of
the following two loops, which one is
more efcient?
for(i = 0; i< 50; i++)
for(j = 0; j< 50; j++)
for(k = 0; k< 50; k++)
printf(%d ,
a[k][j][i]);
for(i = 0; i< 50; i++)
for(j = 0; j< 50; j++)
for(k = 0; k< 50; k++)
printf(%d ,
a[i][j][k]);
C has arrays implemented in
row-major order, that is, the same
way it is organised in the hardware.
The second loop is more efcient
because it accesses memory
locations sequentially, in row-major
order. The processor will fetch the
memory blocks into cache, and
since the memory access is also
sequential, this is efcient. However,
in the rst loop, the memory access
is not sequential and hence there
might be a lot of memory faults and
hence it will be considerably slower
than the second loop.
From these examples we learn
that it is important to keep in mind
that memory faults are costly and
we need to minimise such memory
faults to write efcient code.
Compound types
C supports compound types like
structs, unions and bit-elds that
are implemented in terms of other
primitive or compound types. The
hardware does not understand any
compound types, and all processing
is done on primitive types only.
There are many aspectssuch
as padding and alignmentin
using compound types that can
affect the performance. Here, well
look at an example of bit-elds to
understand how it is supported by
the hardware.
In C, we can manipulate and
access bits using bit-elds. It is a
syntax error to attempt taking the
address of a bit-eld member. Why?
The granularity of addressing in
modern computers is in bytes and
it is not possible to take address
of specic bits in a byte. Though it
is space-efcient to use bit-elds,
it is not time-efcient since it is
not possible to access individual
bits; so the compiler emits code to
access words and then does bit-
manipulation to access individual
bit-eld member values.
The following bit-eld struct is to
represent time in a day in HH:MM:
SS format. For the hour, the range is
0-23 and for minutes and seconds,
the range is 0-59. We can use the
following struct:
struct time {
unsigned int hour : 5;
unsigned int minute: 6;
unsigned int second: 6;
} tm1, tm2;

To access tm.minute, the
compiler has to generate code to
access the word (4 bytes) and do bit-
manipulation and access only the 5
th

to 10
th
bit in that word, which is slow.
So, avoid using bit-elds extensively
if performance is important for your
software. A better option in this case
is to use a struct (without bit-elds)
with a byte each for the hour, minute
and second, respectively.
In this article, we explored some
fundamental issues to understand
how various programming constructs
can affect the efciency of programs.
There are many other issues, such
as unaligned memory access, cost
of I/O operations, etc, that are not
covered in this article. This article is
just a starting point to understand
such problems. If you are interested,
you can read books on assembly
language, computer architecture and
compiler optimisation to get a better
understanding of the issues related to
writing efcient programs.
By: S G Ganesh is a research
engineer in Siemens (Corporate
Technology), Bangalore. His latest
book is, 60 Tips on Object Oriented
Programming, published by Tata
McGraw-Hill in December 2007. You
can reach him at sgganesh@gmail.com

You might also like