You are on page 1of 7




Joel Falcou and 1 Jocelyn Sérot
LASMEA, UMR 6602 CNRS/UBP, Clermont-Ferrand, France

ABSTRACT ation and mapping the operations of the algorithm onto the
Microprocessors with SIMD enhanced instruction sets have been SIMD instruction set. For this reason, this approach can-
proposed as a solution for delivering higher hardware utilization not be reasonably applied at a large scale and is reserved
for the most demanding media-processing applications. But dir- to simple algorithms and applications.
ect exploitation of these SIMD extensions places a significant Dedicated compilers – such as the Intel C compiler for the
burden on the programmer since it generally involves a com- Pentium SSE – also rely on loop vectorization and are able
plex, low-level and awkward programming style. In this paper, to automatically extract parallelism from programs writ-
we propose a high-level C++ class library dedicated to the ef- ten with a conventional sequential language. Although
ficient exploitation of these SIMD extensions. We show that performance may become comparable with the ones ob-
the weaknesses traditionally associated with library-based ap- tained with hand-crafted code for some simple, highly reg-
proaches can be avoided by resorting to sophisticated template- ular program patterns, these compilers have difficulties in
based metaprogramming techniques at the implementation level. generating efficient code for more complex applications,
Benchmarks for a prototype implementation of the library, tar- involving less regular data access and processing patterns.
geting the PowerPC G4 processor and its Altivec extension are Moreover, these compilers are generally devoted to one
presented. They show that performances close to the one ob- specific processor and are not easily retargeted.
tained with hand-crafted code can be reached from a formulation An intermediate solution for taking advantage of SIMD
offering a significant increase in expressivity. extensions is to rely on a class library providing a high-
level view of parallel vectors and a related set of opera-
1. INTRODUCTION tions, and encapsulating all the low-level details related
to actual implementation and manipulation by the SIMD
Recently, SIMD enhanced instructions have been proposed processor extension. With this which approach, the pro-
as a solution for delivering higher microprocessor hard- grammer is responsible for identifying parallelism in pro-
ware utilization. SIMD (Single Instruction, Multiple Data) grams but not for the mapping of this parallelism onto the
extensions started appearing in 1994 in HP’s MAX2 and actual SIMD mode of operation. For example, he will
Sun’s VIS extensions and can now be found in most of mi- write code in the style of of Fig.1.
croprocessors, including Intel’s Pentiums (MMX/SSE/SSE2)
Vector<float> a(4000),b(4000),c(4000);
and Motorola/IBM’s PowerPCs (Altivec). They have been Vector<float> r(4000);
proved particularly useful for accelerating applications based r = a/b+c;
upon data-intensive, regular computations, such as signal
or image processing.
However, to benefit from this acceleration, algorithms have Figure 1: Vector processing with a high-level API
to be recreated so that operations can be applied in paral-
lel to groups of values (vectors) and programs have to be It is well known, however, that the code generated by such
rewritten to make use of the specific SIMD instruction set. “naive” class libraries is often very inefficient, due to the
This parallelization process can be carried out by the pro- unwanted copies caused by temporaries. This has led in
grammer or by a dedicated compiler. turn to the development of Active Libraries [1], which
Explicit parallelization (by the programmer) generally of- both provide domain-specific abstractions and dedicated
fers the best performances. But it is a difficult and error- code optimization mechanisms. This paper describes how
prone task since it involves identifying SIMD parallel- this approach can be applied to the specific problem of
ism within loops, rewriting these loops so that they op- generating efficient SIMD code from a high-level C++
erate on vectors, explicitly handling vector registers alloc- Application Programming Interface (API).
It is organized as follows. Sect. 2 explains why generating float Vector::operator[](int index)
efficient code for vector expressions is not trivial and in- { return data_[index]; }
troduces the concept of template-based meta-programming.
Sect. 3 explains how this can used to generate optimized • an Xpr class, for representing (encoding) vector ex-
SIMD code. Sect. 4 rapidly presents the API of the lib- pressions :
rary we built upon these principles. Sect. 5 gives simple
template<class LHS,class OP,class RHS>
benchmark results while Sect. 6 describes the implement-
class Xpr {};
ation of a complete image processing application. Sect. 7
is a brief survey of related work and Sect. 8 concludes.
Consider for example, the statement r=a/b+c given in
the code sample above. Its right-hand side expression
2. TEMPLATE BASED META-PROGRAMMING (a/b+c, where a, b and c have type Vector) will be
encoded with the following C++ type:
The problem of generating efficient code for vector ex-
pressions can be understood, in essence, by starting with Xpr<Xpr<Vector,div,Vector>,add,Vector>
the code sample given in Fig. 1. Ideally, this code should
be inlined as: This type will be automatically built from the expression
syntax a/b+c using overloaded versions of the / and *
for ( i=0; i<4000; i++ )
r[i] = a[i] / b[i] + c[i];
In practice, due to the way overloaded operators are handled template<class T> Xpr<T,div,Vector>
in C++, it is developed as: operator/(T, Vector)
{ return Xpr<T,div,Vector>(); }
Vector<float> _t1(4000), _t2(4000); template<class T> Xpr<T,add,Vector>
for (int i=0; i < 4000; ++i) operator+(T, Vector)
_t1[i] = a[i] / b[i]; { return Xpr<T,add,Vector>(); }
for (i=0; i < 4000; ++i)
_t2[i] = _t1[i] + c[i]; The “evaluation” (at compile time) of the encoded vector
for (i=0; i < 4000; ++i) expression is carried out by an overloaded version of the
r[i] = _t2[i]; assignment operator (=):
The allocation of temporary vectors and the redundant
template<class T> Vector&
loops result in poor efficiency. For more complex ex-
Vector::operator=( const T& xpr ) {
pressions, the performance penalty can easily reach one for(int i=0;i<size;i++)
order of magnitude. In this case, it is clear that express- data_[i] = xpr[i];
iveness is obtained at a prohibitive cost. This problem can return *this;
be overcome by using an advanced C++ technique known }
as expression templates [2, 3]. The basic idea of expres-
sion templates is to create parse trees of vector expressions For this, the Xpr class provides an operator[] method,
at compile time and to use these parse trees to generate so that each element of the result vector (data_[i]) gets
customized loops. This is actually done in two steps. In the value xpr[i]:
the first step, the parse trees – represented as C++ types –
are constructed using recursive template instantiations and template<class LEFT,class OP,class RIGHT>
float X<LEFT,OP,RIGHT>::operator[](int i)
overloaded versions of the vector operators (/, +, etc.). In
{ return OP::eval(left_[i],right_[i]); }
the second step, these parse trees are “evaluated” using
overloaded versions of the assignment (=) and indexing where left_ and _right are the members storing the
([]) operators to compute the left-hand side vector in a left and right sub-expressions of an Xpr object and eval
single pass, with no temporary. Both steps rely on the the static method of the C++ functor associated with the
definition of two classes: vector operation OP. Such a functor will be defined for
• a Vector class, for representing application-level each possible operation. For example, the div and add
vectors; this class has a member data_, where the functors associated with the / and + vector operators are
array elements are stored and an operator [] for defined as:
accessing these elements1 :
1 For simplicity, the description given here only refers to vectors of class div {
float values. In practice, a Vector class is defined for every single static float eval(float x,float y)
type supported by the SIMD extension (8-bit, 16-bit and 32-bit integers, { return x/y; }
32-bits floats for the Altivec, for example). };
class add { vec_st(xpr.load(i),0,data);
static float eval(float x,float y) return *this;
{ return x+y; } }
class add { eval(vec_t x,vec_t y)
Using the mechanism described above, a standard C++ { return vec_add(x,y); } };
compiler can reduce the statement r=a/b+c to the fol- class mul { eval(vec_t x,vec_t y)
lowing “optimal” code: ‘ { return vec_mul(x,y); } };
for ( i=0; i<4000; ++i)
r[i] = a[i]/b[i]+c[i]; With this approach, the compilation of the previous ex-
ample (r=a/b+c) now gives a code equivalent to :
EFFICIENT SIMD CODE for(int i=0;i<size/4;i++)
The template-based meta-programming technique described vec_st(vec_div(vec_ld(a,16*i),
in the previous section can readily be adapted to support vec_add(vec_ld(b,16*i),
the generation of efficient code for processors with SIMD vec_ld(c,16*i))
extensions. Suppose that : ,0,data);
• vec t is the type of SIMD registers, as viewed by
the compiler2 , This code is very similar to a hand-crafted one. It is op-
timal in the sense that it contains the minimum of vec-
• vec ld(addr) is the operation for loading a SIMD tor load/store and operations : three vector load opera-
register with a vector of data located at address addr tions, one vector addition, one vector multiplication and
in memory, one vector store.
• vec st(r, addr) is the operation for storing a
SIMD register in memory at address addr, 4. THE EVE LIBRARY
• vec add(r1,r2), vec mul(r1,r2), etc. are As a proof of concept, we have implemented a high-level
the SIMD operations for performing addition, mul- C++ vector library for exploiting the Altivec extension of
tiplication, etc. between SIMD registers. the PowerPC processors. This library, called EVE (for Ex-
pressive Velocity Engine) can be used to accelerate pro-
Then, this adaptation only requires
grams manipulating 1D and 2D arrays and provides spe-
• the addition of a method for loading a part of a vec- cific operations dedicated to signal and image processing.
tor into a SIMD register using the vec ld instruc- A brief survey of the Altivec is given in Sect. 4.1. The
tion, library itself is sketched in Sect. 4.2.

• a modification of the assignment operator, which

4.1. Altivec
now uses the vec st instruction to store the res-
ult, AltiVec [4] is an extension designed to enhance PowerPC3
processor performance. Its architecture is based on a SIMD
• a modification of the add (resp. mul, etc.) func- processing unit integrated with the PowerPC architecture.
tors, which now use the available vec add (resp. It introduces a new set of 128 bit wide registers distinct
vec mul, etc.) instructions. from the existing general purpose or floating-point registers.
vec_t Array::load(int index) These registers are accessible through 160 new “vector”
{ return vec_ld(data_,index*ELT_SIZE); } instructions that can be freely mixed with other instruc-
tions (there are no restriction on how vector instructions
template<class T> can be intermixed with branch, integer or floating-point
Array& Array::operator=(T xpr) { instructions with no context switching nor overhead for
for(int i=0;i<size/4;i++) doing so). Altivec handles data as 128 bit vectors that can
2 We will make the assumption here that the all SIMD operations are contain sixteen 8 bit integers, eight 16 bit integers, four
accessible through a C API (as for the Altivec extension of PowerPC 32 bit integers or four 32 bit floating points values. For
processors). However, the technique described here remains applicable, example, any vector operation performed on a vector
though less readable, when these operations must be inserted as inline
assembly codes. 3 PPC 74xx (G4) and PPC 970 (G5).
char is in fact performed on sixteen char simultaneously 5. BENCHMARKS
and is theoretically running sixteen times faster as the scalar
equivalent operation. AltiVec vector functions cover a Two kinds of performance tests have been performed: ba-
large spectrum, extending from simple arithmetic func- sic tests, involving only one vector operation and more
tions (additions, subtractions) to boolean evaluation or lookup complex tests, in which several vector operations are com-
table solving. Altivec is natively programmed by means of posed into more complex expressions. All tests involved
a C API [5]. vectors of different types (8 bit integers, 16 bit integers, 32
bit integers and 32 bit floats) but of the same total length
(16 Kbytes) in order to reduce the impact of cache effects
4.2. EVE API on the observed performance4 . They have been conducted
on a 1.2GHz PowerPC G4 with gcc 3.3.15 . A selec-
EVE basically provides two classes, array and matrix, tion of performance results is given in Table 1. For each
for dealing with 1D and 2D arrays respectively. Its API test, four numbers are given: the maximum theoretical
can be roughly divided in four families: speedup6 (TM), the measured speedup for a hand-coded
version of the test using the ALtivec native C API (NC),
1. Arithmetic and boolean operators. These are the the measured speedup with a “naive” vector library (NV)
direct vector extension of their C++ counterparts. – which does not use the expression template mechanism
For example: described in Sect. 2, and the measured speedup with the
EVE library.
It can be observed that, for most of the tests, the spee-
array<char> a(64),b(64),c(64),d(64);
dup obtained with EVE is close to the one obtained with
d = (a+b)/c;
a hand-coded version of the algorithm using the native C
2. Boolean predicates. These functions can be used Tests 1-2 correspond to basic operations, which are mapped
to manipulate boolean vectors and use them as se- directly to a single AltiVec instruction. In this case, the
lection masks. For example: measured speedup is very close to the theoretical max-
array<char> a(64),b(64),c(64); Tests 3-6 correspond to more complex operations, involving
c = where(a < b, a, b); several AltiVec instructions7 . Here again, the observed
// c[i] = a[i]<b[i] ? a[i] : b[i] speed-ups are close to the ones obtained with hand-crafted
code. By contrast, the performance of the “naive” class
3. Mathematical and STL functions. These func- library are here disappointing. This clearly demonstrates
tions work like their STL or math.h counterparts. the effectiveness of the metaprogramming-based optimiz-
The only difference is that they take an array (or ation.
matrix) as a whole argument instead of a couple of
iterators. Apart from this difference, EVE functions 6. REALISTIC CASE STUDY
and operators are very similar to their STL coun-
terparts. This allows algorithms developed with the In this section we apply EVE to the parallelization of a
STL to be ported (and accelerated) with a minimum
complete image processing (IP) application. This applic-
effort on a PowerPC platform with EVE. Example:
ation, a classic one in the field, aims at detecting point of
interest (POIs) using the Harris filter [6]. It is often used in
array<float> a(64),b(64); applications performing 3D video reconstruction, where it
b = tan(a); provides a set of points for the 3D matching algorithms.
float r = inner_product(a, b); Its high demand for computing power – especially when
// r = a[0]*b[0]+...+a[63]*b[63]
it must be performed in real-time on digital video stream
– together with the relatively regular nature of the data
4. Signal and image processing functions. These func- 4 I.e. the vector size (in elements) was 16K for 8 bit integers, 8K for
tions allow the direct expression (without explicit
16 bit integers and 4K for 32 bits integers or floats.
decomposition into sums and products) of 1D and 5 With: -faltivec -ftemplate-deph-128 -O3
2D FIR filters. For example: 6 This depends on the type of the vector elements : 16 for 8 bit in-

tegers, 8 for 16 bit integers and 4 for 32 bit integers and floats.
7 For test 5, despite the fact that the operands are vectors of 8 bit in-
array<float> a(64),b(64); tegers, the computations are actually carried out on vectors of 16 bit in-
filter<3,horizontal> gaussian = 1,2,1; tegers, in order to keep a reasonable precision. The theoretical maximum
res = gaussian(image); speedup is therefore 8 instead of 16.
Table 1: Selected performance results
Test Vector type TM NC NV EVE
1. v3=v1+v2 8 bit integer 16 15.7 8.0 15.4
2. v2=tan(v1) 32 bit float 4 3.6 2.0 3.5
3. v3=inner_prod(v1,v2) 8 bit integer 8 7.8 4.5 7.2
4. 3x1 FIR 8 bit integer 8 7.9 0.1 7.8
5. 3x1 FIR 32 bit float 4 3.7 0.1 3.7
6. v5=sqrt(tan(v1+v2)/cos(v3*v4)) 32 bit float 4 3.9 0.04 3.9

access patterns makes it a good candidate for SIMD par- 6.2. Implementation
In this implementation, only the filtering and the pixel de-
tection are vectorized. Sorting an array cannot be easily
With this experience, our goals were twofold : first to
vectorized with the AltiVec instruction set. It’s not worth
show that the presence of a few IP-specific operators can
it anyway, since the time spent in the final sorting and se-
greatly ease the writing of these applications; second to
lection process only accounts for a small fraction (around
demonstrate that this increase in expressivity is not ob-
3%) of the total execution time of the algorithm. The code
tained at the expense of efficiency, even when many vector
for computing M coefficients and H values is shown in
computations are composed within the same application.
figure 2. It can be split into three sections :

6.1. Algorithm 1. a declarative section where the needed matrix

and filter objects are instantiated. matrix ob-
Starting from an input image I(x, y), horizontal and ver- jects are declared as float containers to prevent
tical gaussian filters are applied to remove noise. Then for overflow when filtering is applied on the input im-
each pixel (x, y), the following matrix is computed : age and to speed up final computation by removing
! the need for type casting.
∂I 2 ∂I ∂I
( ∂x ) ∂x ∂y
M (x, y) = ∂I ∂I ∂I 2 2. a filtering section where the coefficients of the M
∂x ∂y ( ∂y )
matrix are computed. We use EVE filter objects, in-
∂I ∂I stantiated for gaussian and gradient filters. Filter
Where ( ∂x ) and ( ∂y ) are respectively the horizontal and supports an overloaded * operator that is semantic-
vertical gradient of I(x, y). M (x, y) is filtered again with ally used as the composition operator between fil-
a gaussian filter and the following quantity is computed : ters.

3. a computing section where the final value of H(x, y)

H(x, y) = Det(M ) − k.trace(M )2 , k ∈ [0.04; 0.06] is computed using the overloaded versions of arith-
metics operators.
H is viewed as a measure of pixel interest. Local maxima
of H are then searched in 3x3 windows and the nth first
maxima are finally selected. Figure 6.1 shows the result
6.3. Performance
of the detection algorithm on a video frame picturing an
outdoor scene. The performance of this detector implementation have been
compared to those of the same algorithm written in C, both
using 320*240 pixels video sequence as input. The tests
were run on a 1.2GHz Power PC G4 and compiled with
gcc 3.3.1. As the two steps of the algorithm (filtering
and detection) use two different parts of the EVE API, we
give the execution times for each step along with the total
execution time.
The performance of the first step are very satisfactory (82%
of the theorical maximum speed-up). The observed speed-
up observed for the second step is more disappointing.
// Declarations veloped that extends C semantics and type system and can
#define W 320 target several family of microprocessors. Started in 1998,
#define H 240 the project seems to be in dormant state.
matrix<short> I(W,H),a(W,H),b(W,H); Another example of the compiler-based approach is given
matrix<short> c(W,H),t1(W,H),t2(W,H); by Kyo et al. in [8]. They describe a compiler for a
matrix<float> h(W,H); parallel C dialect (1 DC, One Dimensional C) producing
float k = 0.05f; SIMD code for Pentium processors and aimed at the suc-
cinct description of parallel image processing algorithms.
filter<3,horizontal> smooth_x = 1,2,1; Benchmarks results show that speed-ups in the range of 2
filter<3,horizontal> grad_x = 1,0,1; to 7 (compared with code generated with a conventional C
filter<3,vertical> smooth_y = 1,2,1; compiler) can be obtained for low-level image processing
filter<3,vertical> grad_y = -1,0,1; tasks. But the parallelization techniques described in the
work – which are derived from the one used for program-
ming linear processor arrays – seems to be only applicable
// Evaluation of coefficient of M to simple image filtering algorithms based upon sweep-
t1 = grad_x(I); ing a horizontal pixel-updating line row-wise across the
t2 = grad_y(I); image, which restricts its applicability. Moreover, and
a = (smooth_x*smooth_y)(t1*t1); this can be viewed as a limitation of compiler-based ap-
b = (smooth_x*smooth_y)(t2*t2); proaches, retargeting another processor may be difficult,
c = (smooth_x*smooth_y)(t1*t2); since it requires a good understanding of the compiler in-
ternal representations.
// Evaluation of H(x,y)
The VAST code optimizer [9] has a specific back-end for
h = (a*b-c*c)-k*(a+b)*(a+b);
generating Altivec/PowerPC code. This compiler offers
Figure 2: The Harris detector, coded with EVE automatic vectorization and parallelization from conven-
tional C source code, automatically replacing loops with
Step Execution Time Speed Up inline vector extensions. The speedups obtained with VAST
Filtering 3.03ms 3.32 are claimed to be closed to those obtained with hand
Evaluation 1.08ms 1.18 -vectorized code. VAST is a commercial product.
Total Time 4.11ms 2.75 There have been numerous attempts to provide a library-
based approach to the exploitation of SIMD features in
micro-processors. Apple VECLIB [10], which provides a
set of Altivec-optimized functions for signal processing,
is an example. But most of these attempts suffer from
Reasons for this have been identified : this is due to mul- the weaknesses described in Sect. 2; namely, they can-
tiple redundant vector load and store operations8 . We are not handle complex vector expressions and produce ineffi-
currently working in order to fix this. The global perform- cient code when multiple vector operations are involved in
ance of the algorithm, however remain satisfactory. the same algorithm. MacSTL [11] is the only work we are
aware of that aims at eliminating these weaknesses while
7. RELATED WORK keeping the expressivity and portability of a library-based
approach. MacSTL is actually very similar to EVE in goals
Projects aiming at simplifying the exploitation of the SIMD and design principles. This C++ class library provides
extensions of modern micro-processors can be divided into a fast valarray class optimized for Altivec and relies
two broad categories : compiler-based approaches and on template-based metaprogramming techniques for code
library-based approaches. optimization. The only difference is that it only provides
The SWAR (SIMD Within A register, [7]) project is an STL-compliant functions and operators (it can actually
example of the first approach. Its goal is to propose a be viewed as a specific implementation of the STL for
versatile data parallel C language making full SIMD-style G4/G5 computers) whereas EVE offers additional domain-
programming models effective for commodity micropro- specific functions for signal and image processing.
cessors. An experimental compiler (SCC) has been de-
8 For the evaluation of H, the current code generator actually issues
eight vector load instructions (three for a, three for b and two for c).
These redundant loads have a very negative effect on Altivec perform-
ance, for which the best strategy is to load all at once, then compute and We have shown how a classical technique – template-based
finally write all. metaprogramming – can be applied to the design and im-
plementation of an efficient high-level vector manipula- implementation of image filters on media extended
tion library aimed at scientific computing on PowerPC micro-processors,” in Proceedings of Acivs 2003
platforms. This library offers a significant improvement (Advanced Concepts for Intelligent Vision Systems),
in terms of expressivity over the native C API traditionally Ghent, Belgium, Sept. 1998.
used for taking advantage of the SIMD capabilities of this
processor. It allows developers to obtain significant spee- [9] “Altivec,” altivec.html.
dups without having to deal with low level implement- [10] Apple, “The Apple VecLib Framework,”
ation details. A prototype version of the library can be
downloaded from the following URL: http://lasmea.univ- Personnel/Joel.Falcou/eng/eve [11] G. Low, “The MacSTL Library,”
We are currently working on improving the performance
obtained with this prototype. This involves, for instance,
globally minimizing the number of vector load and store
operations, using more judiciously Altivec-specific cache
manipulation instructions or taking advantage of fused op-
erations (e.g. multiply/add). Finally, it can be noted that,
although the current version of EVE has been designed
for PowerPC processors with Altivec, it could easily be
retargeted to Pentium 4 processors with MMX/SSE2 be-
cause the code generator itself (using the expression tem-
plate mechanism) can be made largely independent of the
SIMD instruction set.


[1] T. Veldhuizen and D. Gannon, “Active libraries: Re-

thinking the roles of compilers and libraries,” in Pro-
ceedings of the SIAM Workshop on Object Oriented
Methods for Inter-operable Scientific and Engineer-
ing Computing (OO’98), Philadelphia, PA, USA,
1998, SIAM.

[2] T. Veldhuizen, “Expression templates,” C++ Re-

port, vol. 7, no. 5, pp. 26–31, June 1995.

[3] T. Veldhuizen, “Using C++ template metapro-

grams,” C++ Report, vol. 7, no. 4, pp. 36–38, 40,
42–43, May 1995.

[4] I. Ollman, “Altivec velocity engine tutorial,”

[5] Apple, “Altivec instructions ref-

erences, tutorials and presentation,”

[6] C. Harris and M. Stephens, “A combined corner and

edge detector,” in Proc 4th Alvey Vision Conf, Aug.
1988, pp. 189–192, Manchester.

[7] Purdue University, “The SWAR Home Page,” swar.

[8] S. Kyo, S. Okasaki, and I. Kuroda, “An exten-

ded C language and a SIMD compiler for efficient