MAPPING VISION ALGORITHMS ON SIMD ARCHITECTURE SMART CAMERAS

Chen Wu, Hamid Aghajan
Wireless Sensor Networks Lab
Stanford Univ., Stanford CA 94305
Richard Kleihorst
NXP Research and Philips Research
Eindhoven, the Netherlands
ABSTRACT
SIMD (Single-Instruction Multiple-Data) processors have
demonstrated high performance for vector-based image process-
ing, thereby facilitating real-time vision applications. How-
ever, to fully exploit the advantages of the SIMD architecture,
implementation of a given vision algorithm needs to undergo
a mapping from a general purpose CPU programming style
to a pixel parallel style. This paper describes how part of a
given gesture analysis algorithm is mapped on a smart cam-
era with the SIMD processor to achieve real-time operation.
The pixel parallel nature of the SIMD processing restricts
diversified treatment of pixels. Therefore, in this paper we
show how to modify the algorithm and discuss improvements
in the architecture in order to achieve the intended function-
ality. Mapping of background removal, segmentation, and la-
beling functions is described. We also discuss robustness is-
sues since the mapping to smart cameras aims for practical,
real-time applications.
1. INTRODUCTION
Through offering access to multiple sources of visual data,
distribued vision networks facilitate interpretation of events
in smart environments. In a human-centric application in-
volving interpretation of postures and gestures of the user,
such as in assisted living [1] or multimedia and gaming appli-
cations, having access to interpretations of gesture elements
obtained from visual data over time enables higher-level rea-
soning modules to deduct the user’s actions, context, and be-
havior models, and decide upon suitable actions or responses
to the situation. Recently, methods have been developed for
gesture analysis from multiple collaborating cameras [2].
Under the assumption of networked cameras, tradeoffs
between local processing of images and the costs associated
with communicating data have motivated the development of
advanced processors to handle the early vision processing op-
eration efficiently at the camera node [3, 4]. Through local
processing, functions such as event detection, filtering, at-
tribute extraction, and object-based operations can be tasked
for achieving real-time operation, requiring the network nodes
to only exchange processed information instead of raw im-
ages. This allows the network to operate efficiently in the
communication domain, where only the event descriptions are
forwarded to a host systemfor taking appropriate decision [5].
Advanced video processing applications on smart cam-
eras such as object-based video coding and scene monitoring
require extensive computational power to deal with the in-
creasing complexity of the algorithms and the high pixel rates.
The processing involves real-time robust detection of objects
and their attributes in uncontrolled lighting conditions. Mod-
ern algorithms address this by performing multiple passes over
the same frame at different scales and rotations using large
sets of filters, e.g. Haar filters [6] for faces, and blob filters for
objects [7]. A useful attribute of video processing algorithms
is the large extent of data-level parallelism which computer
architectures can exploit to arrive at the required performance
level. One class of architectures that fits this scenario well is
the single-instruction multiple-data (SIMD) processing par-
adigm [8]. In most multimedia extensions (e.g., MMX and
AltiVec) of general purpose processors and the latest media-
oriented architectures [9, 10], short to medium-length SIMD
processing engines are used. At the extreme end, one finds
massively parallel SIMD (MP-SIMD) machines which try to
exploit as much data-level parallelism as possible, thereby
maximizing the effective performance level [11, 12, 13, 14].
In our smart cameras application in human gesture analy-
sis, we have exploited data-level parallelism in using a power-
efficient, fully-programmable IC called Xetal, which provides
a peak performance of 50 GOPS when operating at 80 MHz
and 1.8 V, while dissipating less than 500 mW. The IC has in-
ternally 32 VGA linememories of working data with a band-
width access rate of over 1 Tbit/S.
Mapping of a given vision algorithm after its development
in the computer-based processing environment to an embed-
ded processor with parallel processing architecture often re-
quires the algorithm to be revised or even redesigned to both
balance the hardware structure benefits and meet the algo-
rithm implementation requirements. In this paper we describe
how part of a vision algorithm for gesture analysis is mapped
onto the smart camera based on the Xetal chip. The Xetal chip
is programmed using an extended C (XTC) language, which
will take advantage of the SIMD processor. It introduces a
vector data type to represent the 320-element line memories.
In order to accommodate these changes, the mapping calls for
certain revisions in the algorithm which will be discussed in
1-4244-1354-0/07/$25.00 c 2007 IEEE
27
the paper. The architecture of the smart camera and the SIMD
processor is described in section 2. This is followed by an in-
troduction to the vision algorithm for human gesture analysis
in section 2.2. Section 4 discusses the process of mapping
the vision algorithm to the SIMD processor and examines the
necessary modifications in the algorithm to be programmed
on a parallel processor. Section 5 includes a description of
the demonstration setup, and some concluding remarks and
ideas for further investigation are presented in section 6.
2. CAMERA HARDWARE PLATFORM
Real-time video processing on (low-cost and low-power) pro-
grammable platforms is now becoming possible thanks to ad-
vances in integration techniques [3, 11, 15, 16]. It is impor-
tant that these platforms are programmable since new vision
methods and applications emerge frequently. The two types
of programmable processors that we propose to be included
in smart camera architectures are the SIMD (Single Instruc-
tion Multiple Data) massively parallel processor, and (one or
more) general purpose DSPs [17, 8].
The algorithms in the application areas of smart cameras
can be grouped into 3 levels: low-level, intermediate-level and
high-level tasks. Figure 1 and Figure 2 show the task classifi-
cation and the corresponding data entities respectively.
Fig. 1. Algorithm classification with respect to the type of
operations
The low- or early- image processing level is associated
with typical kernel operations like convolutions and
data-dependent operations using a limited neighbourhood of
the current pixels. In this part, often a classification or the
initial steps towards pixel classification are performed. Be-
cause every pixel could be classified in the end as “interest-
ing”, the algorithms per pixel are essentially the same. So, if
Fig. 2. Data entities with processing characteristics and pos-
sible ways to increase performance by exploiting parallelism
more performance is needed in this level of image process-
ing, with up to a billion pixels per second, it is very fruitful to
use this inherent data parallelism by operating on more pix-
els per clock cycle. The processors exploiting this have an
SIMD architecture, where the same instruction is issued on
all data items in parallel [8, 18]. From a power consump-
tion point of view, SIMD processors prove to be economical
[19]. The parallel architecture reduces the number of mem-
ory accesses, clock speed, and instruction decoding, thereby
enabling higher arithmetic performance at lower power con-
sumption [3, 11].
In the high- and intermediate-level part of image process-
ing, decisions are made and forwarded to the user. General
purpose processors are ideal for these tasks because they of-
fer the flexibility to implement complex software tasks and
are often capable of running an operating system and doing
networking applications.
With earlier stated considerations in mind, the camera con-
sists of basically four components, one or two VGA color im-
age sensors, an SIMD processor for low-level image process-
ing, a general purpose processor for intermediate and high-
level processing and control and a communication module.
Both processors are coupled using a dual port RAM that en-
ables themto work in a shared workspace on their own process-
ing pace (see Figure 3).
2.1. Top-Level Architecture
Figure 4 shows the top-level architecture of Xetal, which is
based on a massively-parallel SIMD processing paradigm. A
linear processing array (LPA) consisting of 320 processing
elements (PEs) and an on-chip memory of 64 lines of 320
28
Fig. 3. Complete architecture of the wireless camera showing
all processing and hardware blocks
positions handle the compute-intensive data processing. The
data input processor (DIP) and data output processor (DOP)
provide interfaces to 3 video channels, each with 8-bit reso-
lution. A global control processor (GCP) manages the oper-
ation of the IC. The chip is programmable in a sub-set of C
extended with a vector data type.
V
i
d
e
o
O
u
t
V
i
d
e
o

I
n
I
2
C
G
P
O
G
P
I
G
l
o
b
a
l

C
o
n
t
r
o
l
P
r
o
c
e
s
s
o
r
P
r
o
g
r
a
m

(
1
6
k

x

5
6
b
)
D
a
t
a

(
2
k

x

1
6
b
)
L
i
n
e
a
r

P
r
o
c
e
s
s
o
r

A
r
r
a
y
(
3
2
0

P
E
s
)
S
e
q
u
e
n
t
i
a
l

I
/
O

M
e
m
o
r
y
(
2

l
i
n
e
s

x

3
2
0

p
i
x
e
l
s
)
D
I
P
D
O
P
F
r
a
m
e

M
e
m
o
r
y
(
2
0
4
8

l
i
n
e
s

x

3
2
0

p
i
x
e
l
s
)
IMEM (240 kb)
OMEM,LUT (240 kb)
Fig. 4. Xetal top-level architecture
The adopted MP-SIMDprocessing paradigmprovides high
computational efficiency because of the good match with the
parallelism inherently present in the algorithms and the data.
Most image and video processing algorithms consist of many
small kernels (such as convolution and edge detection) that
operate on all data elements alike. This makes MP-SIMD
processing a natural choice with computational efficiency orig-
inating from the reduced overhead of control operations. The
cost of instruction and address decoding is amortized over the
many processing elements.
The achievable computational power of an MP-SIMDproces-
sor is determined by two primary design parameters: the op-
erating frequency and the number of PEs (the degree of par-
allelism). The choice of the two parameters affects the over-
all chip physical design, the computational efficiency and the
software framework. We have kept the number of PEs 320
(the same as [11]) since it is an integral divisor of image lines
for most standard video formats, e.g., CIF (320x240), VGA
(640x480) and HDTV (1280x720). Even though Xetal oper-
ates at relatively low frequency, the effective 10-bit multiply-
accumulate (MAC) performance is still high, thanks to the
large number of PEs and the wide datapath.
2.2. Programming Xetal
The Xetal chip is programmed using an extended C (XTC)
language, a sample code is shown in Table 1. One of the major
extensions is the introduction of a vector data type (vint) to
represent the 320-element wide memory in the LPA. The re-
maining data are represented in a 10-bit integer or fixed point.
In the example code, the indices [-1,1] are used to access left
and right neighbors, respectively. The optional index [0] rep-
resents the data directly connected to each PE.
Another specific programming feature is the distinction
with respect to conditional statements. To perform a global
conditional, one needs to use the while or if-then-else con-
structs which lead to checking the wired-or of all the 320 LPA
flags by the global controller. On the other hand, a local con-
ditional that needs to be executed in each PE based on its own
flag requires the ternary construct y = (x > a) ? p : q
which translates into a conditional pass (PASSC) assembly
instruction.
3. ALGORITHM
In our experiment there are three cameras with a person in the
common FOVs. The goal is to reconstruct a 3D model of the
person. The model consists of the following components: 1.
Geometric configuration: body part lengths, angles. 2. Color
or texture of body parts. 3. Motion of body parts.
Opportunistic data fusion is applied aiming for intelligent
and efficient vision interpretations in the camera network. One
underlying constraint of the network is the relatively lowband-
width. Therefore for efficient collaboration between cameras,
we expect concise descriptions instead of raw image data as
outputs fromlocal processing in a single camera. This process
inevitably removes certain details in images of a single cam-
29
Background
subtraction
Rough
segmentation
EM: refine
color models
Watershed
segmentation
Color segmentation and ellipse fitting in local processing
Ellipse fitting
Generate test
configurations
Score test
configurations
Update each test
configuration using
PSO
Combine 3 views to get 3D skeleton geometric configuration
Update 3D model
(color/texture,
motion)
3D human body model
Maintain current
model
Previous color
distribution
Previous geometric
configuration and motion
Local
processing
from other
cameras
Check
stop criteria
N
Y
Fig. 5. Algorithm flowchart for 3D human skeleton model reconstruction.
Table 1. XTC Example
------------------------------------------------
#include <stdio.xtc>
// variables mapped to GCP data memory
int rows; int MaxRows = 480; float C = 0.333;
// variables mapped to LPA frame memory
vint r, g, b, ny, y, py, nyf, yf, pyf, t1, t2;
// main program
loop {
vsync(); // frame sync
rows = 0;
while(rows < MaxRows)
{
hsync(); // line sync
get_RGB(r, g, b); // get from input
rgb2y(r, g, b, ny); // compute intensity
// 2D 3x3 box filer
t1 = C
*
ny + C
*
y + C
*
py;
py = y;
y = ny;
nyf = C
*
t1[-1] + C
*
t1[0] + C
*
t1[1];
// horizontal Prewitt edge detection
t1 = (nyf[-1] - pyf[-1])+
(nyf[ 0] - pyf[ 0])+
(nyf[ 1] - pyf[ 1]);
t1 = (t1 > TH) ? 255 : 0;
pyf = yf;
yf = nyf;
send_line(t1,t1,t1); //send to output
rows++;
}
}
-----------------------------------------------
era, which requires the camera to have some ”intelligence” on
its observations (smart cameras) , i.e., some knowledge of the
subject. This derives one of the motivations for opportunistic
data fusion between cameras (space, time and complementary
features, see [2]), which compensates for partial observations
in individual cameras. So the output from opportunistic data
fusion (a model of the subject) is fed to local processing. On
the other hand, outputs of local processing in single cameras
enable opportunistic data fusion by contributing local descrip-
tions from multiple views. It is the interactive loop that brings
in the potential for achieving both efficient and adequate vi-
sion analysis in the camera network. Therefore, the algorithm
is composed of two main parts, the local processing part in
a single camera and the collaboration part between cameras.
The flowchart is shown in Fig. 5.
Not only as an output for gesture interpretations, the 3D
human model also acts as an enabler for camera collabora-
tion. It takes up two roles. First, the 3D human body model
provides a unified interface between the vision network and
application-level reasoning modules for a variety of gesture
interpretations. Second, as a representation of up-to-date in-
formation from both current and historical observations of
all cameras, it creates a feedback path from spatiotemporal
and feature fusion operations to low-level vision processing in
each camera. Instead of being a passive decision output, the
3D model implicitly enables more interactions between the
three dimensions by being actively involved in vision analy-
sis. For example, although predefined appearance attributes
are generally not reliable, adaptively learned appearance at-
tributes can be used to identify the person or body parts.
In Fig. 5, local processing in single cameras includes seg-
mentation and ellipse fitting for a concise parametrization of
segments. We assume the 3D model is initialized with a dis-
tinct color distribution for the subject. For each camera, the
color distribution is first refined using the EM algorithm and
then used for segmentation. Undetermined pixels from EM
are assigned labels through watershed segmentation. Exam-
ples for segmentation and ellipse fitting are shown in Fig. 6.
For spatial collaboration, ellipses fromall cameras are merged
to find the geometric configuration of the 3D skeleton model.
That is, if the optimal 3D skeleton model is projected onto im-
30
age planes of the cameras, the projections maximally match
ellipses from all the cameras. Candidate configurations are
evolved using PSO. Examples of fitted skeleton are shown in
Fig. 7. Details of the algorithm are presented in [2].

Fig. 6. Experiment results for local processing in single cam-
eras. (a) original images; (b) segments; (c) fitted ellipses.
Fig. 7. Experiment results for 3D skeleton reconstruction.
Original images from 3 cameras’ views and the skeletons are
shown.
4. MAPPING FROM THE ALGORITHM TO THE
ARCHITECTURE
The goal is to map the gesture estimation algorithmonto Wica.
The whole systemconsists of four Wicas, one of which is con-
nected to a PC to display reconstructed skeletons. Local im-
age processing is running on IC3Dof three Wicas, with output
being parameters of segments. These parameters are transmit-
ted from the three Wicas to the fourth one through ZigBee. A
“server” program on the fourth Wica collects segmentation
parameters from the others and reconstruct the skeleton. Fi-
nally the skeleton is given to the PC for demonstration.
The mapping is not a direct conversion from a PC-based
program to embedded programming. Instead, for good per-
formance the mapping is constrained by requirements imposed
by both sides. First, the SIMD architecture of IC3D largely
specifies instructions that can be implemented for local im-
age processing. On one hand, SIMD is very fast for vector-
based instructions and especially fast for image filtering. On
the other hand, it makes it very difficult for single pixel op-
erations. So efforts need to be spent on how to design the
algorithm to best utilize potentials provided SIMD. Second,
resource limitation on a single Wica (e.g. memory) has much
impact on the mapping as well. Third, for real-time commu-
nication between Wicas, outputs of a single one should only
occupy the allocated bandwidth. Finally, the high frame rate
allows for some of the local processing to be simplified. For
example, iterative processes can be carried out on consecutive
frames with one iteration on each frame.
Algorithm mapping in a single camera is explained in the
following sections.
4.1. Background Subtraction
The current assumption is a static background, and Y in YUV
is used for background subtraction. However, this is a much
too constrained experiment environment. In practical situa-
tions, there are both slow and sudden changes in the back-
ground. Slow changes include lighting variations. Sudden
changes include people passing by, movements of background
objects. One aspect of our current work is to develop an al-
gorithm for background adaptations, and apply it to the sys-
tem. We aim to use depth information provided by multi-view
cameras to remove foreground outliers.
4.2. K-means Clustering on IC3D for Segmentation
Due to memory limitation, no image pixels are stored on the
node and we only have access to the current row. Therefore
it is impossible to do many iterations for K-means on a single
image. If we assume the frame rate is high enough such that
subsequent frames don’t differ much, iterations can be done
once for each frame. This assumption is valid for our experi-
ment since the frame rate is 30fps, high enough compared to
the object’s motion. Segmentation is based on color. Suppose
there are two dominant colors. The adopted K-means to look
for color kernels is as follows.
1. At initialization, two color kernels are given as (u1, v1),
(u2, v2).
2. Every pixel in the foreground is assigned to either class
1 or class 2 based on their distance to kernels.
3. Update kernels (u1, v1), (u2, v2). First, it is very im-
portant to update (u1, v1), (u2, v2) only using “reli-
able pixels”. So pixels close to the kernels are used,
while boundary ones are discarded in updating kernels
31
1
2
Scanning from bottom-right
1
Scanning from top-left
(a) (b)
Fig. 8. Effects of different scanning directions on the labeling
of a segment.
although they are classified and displayed.This proce-
dure proves to be important since we don’t have a good-
ness measure of a pixel belonging to a certain class.
However, this simple method works well since as it
goes through frame by frame the kernels quickly con-
verge. Therefore for every frame the processing is sim-
ple and fast. Second, Since precision in (u1, v1), (u2, v2)
is low (integers, not floating point), for each iteration
we only increase or decrease components in (u1, v1),
(u2, v2) by 1 based on a cheap counting of how many
pixels are greater or lower than the kernels in u, v val-
ues.
4.3. Labeling for Segments
Since now we don’t want to save flowing lines, relabeling due
to newly discovered connectivity is impossible. Therefore,
for every pixel p in the new line, we examine neighboring
labeling of the previous line. If p is adjacent to two labels (e.g.
1, 2), it will pick up the smaller one (e.g. 1). If p doesn’t have
adjacent predecessor, a new label is declared. The resulting
labeling for one segment can be like in Fig. 8.
Note that over-segmentation in Fig. 8(a) may potentially
help. For ellipse fitting anyways we want to split the shape
in Fig. 8(a). And the scanning sequence of the hardware
may automatically achieve that. Another very interesting ob-
servation is, by changing scanning directions (bottom-top or
top-bottom), we may discover different dent directions in the
shape (compare Fig. 8(a) and (b)).
5. DEMONSTRATION
The algorithmis running on the camera at a speed of 30 frames
per second. The image sizes are VGA (640 by 480) pixels.
From the sensor, 8-bit YUV data in 4:2:2 format is provided.
The sensor also takes care of demosaicing, exposure time con-
trol, and white balancing. Although in the final application
data will be shared among cameras, in the current set-up an
LCD screen is connected to the camera to see the intermediate
video results. The display is also of VGA format, but accepts
RGB data. The Xetal IC in this camera runs at 80MHz and
the 320 processing elements do single clock cycle arithmetic
at 10-bit word size. Background subtraction is performed us-
ing the external RAM as the data pool. We communicate the
number of objects found, their positions and statistics to the
8051 host processor.
The programas described in earlier sections has been mapped
to the camera in its programming language ”XTC”. The pro-
gram source size is 24KBytes and contains almost 800 lines
of code. The line count is low compared to similar algorithms
on other processors because each instruction in this program
invokes 320 parallel instructions. After compilation the total
program size is 1953 24-bit instructions.
During processing the processor load is in between 39
and 6.9-GOPS depending on the image activity. The inter-
nal bandwidth per second averages 62-Gbit/s. The power
consumption of the Xetal processor is between 250 and 50-
mWatt, depending on the image activity again. The power
consumption of the remainder of the camera system (exclud-
ing the LCD display) is around 100mWatts in this application,
which makes the continuous power consumption to a level be-
tween 350 and 150mWatts for this application.
During the mapping we only started to reach the limits of
the system near the end. Then, the limited size of internal
coefficient memories started to limit the number of regions
that we can track per frame. This could be solved to some
extend by optimizing the code and using other memories for
storage. Also, we have taken these issues into account during
the design of a successor IC, where significantly more line
memories and more coefficient storage will be available.
Figure 9 shows the setup of the system, a toy duck with
three major object parts, yellow, orange and black was used.
This can be seen in the background of the image, in the fore-
ground the camera mote can be seen. Also some images were
taken of the system in action at different moments in time.
Figure 10 shows an intermediate result of the k-means sup-
ported segmentation. Each color indicates a certain cluster of
pixels. Finally, Figure 11 shows the result of tracking, where
only three objects remain in the pictures, indicated by dif-
ferent values of green related to the label value. Small blue
markers in the image indicate the tracked centers of the parts
of the toy duck. The results are completely stable to move-
ments of the duck, but dependent on lighting changes and
shadows in the scene because of the simple background model
that we used.
6. DISCUSSION
The demand for a real time implementation requires the bal-
ance between algorithm complexity and accuracy. One way is
to reduce complexity for each single frame while improving
accuracy by accumulated computation through a number of
sequential frames. Given a high frame rate, this accumulation
can be fast in time.
Generally for a vision application a critical question is
what kind of features to use to solve the problem. Either
32
Fig. 9. Picture of our setup.
Fig. 10. Snapshot of the LCD screen showing the segmenta-
tion results.
Fig. 11. Snapshot of the LCD screen with label tracking.
the chosen feature is invariant, or we are able to predict its
change under different situations. In our experiment now we
mainly rely on color to differentiate the foreground from the
background as well as to analyze different parts of the ob-
ject. This works for short periods of time during which light-
ing conditions remain mostly the same. However, colors are
highly sensitive to a number of factors. When lighting is too
strong or too weak, illumination will be very high or low and
chrominance becomes hard to discern. White balancing of
the cameras often cause sudden changes for all colors in the
image, which may well result in losing track of objects in the
scene.
In an unconstrained environment, there are additional chal-
lenges to operate a robust and real-time system. One is the
need to handle different situations, such as variations in the
appearance of objects of interest or the background from the
desired or assumed settings. The other challenge is the abil-
ity to recover from failures. Issues related to timely detection
of a failure and quick re-initialization of the parameters after
a failure are worthy of further examination. The mapping of
an algorithm on the smart camera in the current investigation
showed clearly to us the necessity of a robust algorithm to
cope with even partly constrained environments. This urges
us to re-address for instance the background subtraction ap-
proach to allow practical deployments in which the assump-
tion of a stationary background may not always be valid.
The performance level and processing bandwidth required
for even an application like the one under study here showed
us the need for high-performance hardware in order to run
the operation in real-time. We were approaching the limits
of the system, so additional hardware capabilities need to be
provisioned for progressing towards applications with more
complex processing needs.
7. REFERENCES
[1] H. Aghajan, J. Augusto, C. Wu, P. McCullagh, and
J. Walkden, “Distributed vision-based accident manage-
ment for assisted living,” in ICOST 2007, Nara, Japan.
[2] Chen Wu and Hamid Aghajan, “Model-based human
posture estimation for gesture analysis in an opportunis-
tic fusion smart camera network,” in AVSS 2007 ( to
appear ), London, UK.
[3] A.A. Abbo and R.P. Kleihorst, “A programmable smart-
camera architecture,” in ACIVS2002, Gent, Belgium,
Sept 2002.
[4] Richard Kleihorst, Ben Schueler, Anteneh Abbo, and
Vishal Choudhary, “Design challenges for power con-
sumption in mobile smart cameras,” in COGIS2006,
Paris, France, Mar. 2006.
[5] Richard Kleihorst, Ben Schueler, and Alexander
Danilin, “Architecture and applications of wireless
smart cameras (networks),” in Proceedings ICASSP07,
2007.
[6] M. Jones and P. Viola, “Fast multi-view face detection,”
in Proceedings IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2003.
[7] D.G. Lowe, “Distinctive image features from scale-
invariant keypoints,” International Journal of Computer
Vision, vol. 60, no. 2, 2004.
33
[8] P. Jonker, “Why linear arrays are better image proces-
sors,” in Proc. 12th IAPR Conf. on Pattern Recognition,
Jerusalem, Israel, 1994, pp. 334–338.
[9] B. Khailany et al., “Imagine: Media processing with
streams,” IEEE MICRO, pp. 35 – 46, Apr 2001.
[10] “Cell:,” http://researchweb.watson.ibm.com.
[11] R.P. Kleihorst, A.A. Abbo, A. van der Avoird,
M.J.R. Op de Beeck, L. Sevat, P. Wielage, R. van
Veen, and H. van Herten, “Xetal: A low-power high-
performance smart camera processor,” in ISCAS 2001,
Sydney, Australia, may 2001.
[12] S. Kyo, T. Koga, S. Okazaki, and I. Kuroda, “An in-
tegrated memory array processor for embedded image
recognition systems,” in Proc. of ISCA, June 2005, pp.
134 – 145.
[13] M. Nakajima et al., “A 40GOPS 250mw massively par-
allel processor based on matrix architecture,” in ISSCC
Dig. of Tech. Papers 2006, Feb 2006, pp. 410 – 411.
[14] A. Abbo, R. Kleihorst, V. Choudhary, L. Sevat,
P. Wielage, S. Mouy, and M. Heijligers, “Xetal-II:
A 107GOPS, 600mw massively-parallel processor for
video scene analysis,” in ISSCC2007 Digest of techni-
cal papers, San Fransisco, Ca, USA, 2007.
[15] J.C. Gealow and C.G. Sodini, “A pixel-parallel image
processor using logic pitch-matched to dynamic mem-
ory,” IEEE Journal of Solid-State Circuits, vol. 34, June
1999.
[16] H. Yamashila and C. Sodini, “A 128 × 128 CMOS im-
ager with 4 × 128 bit-serial column-parallel PE array,”
in ISSCC2001 Digest of technical papers, 2001.
[17] P.P. Jonker, Morphological Image Processing: Archi-
tecture and VLSI design, Kluwer, 1992.
[18] D. W. Hammerstrom and D. P. Lulich, “Image process-
ing using one-dimensional processor arrays,” IEEE Pro-
ceedings, vol. 84, no. 7, pp. 1005–1018, jul 1996.
[19] R. Kleihorst et al., “An SIMD smart camera architec-
ture for real-time face recognition,” in Abstracts of the
SAFE & ProRISC/IEEE Workshops on Semiconductors,
Circuits and Systems and Signal Processing, Veldhoven,
The Netherlands, Nov 26–27, 2003.
34

Data entities with processing characteristics and possible ways to increase performance by exploiting parallelism Fig. and instruction decoding. which is based on a massively-parallel SIMD processing paradigm. 2. From a power consumption point of view. one or two VGA color image sensors. Algorithm classification with respect to the type of operations The low. thereby enabling higher arithmetic performance at lower power consumption [3. This is followed by an introduction to the vision algorithm for human gesture analysis in section 2. decisions are made and forwarded to the user. Figure 1 and Figure 2 show the task classification and the corresponding data entities respectively. Top-Level Architecture Figure 4 shows the top-level architecture of Xetal. 2. CAMERA HARDWARE PLATFORM Real-time video processing on (low-cost and low-power) programmable platforms is now becoming possible thanks to advances in integration techniques [3. In this part. The parallel architecture reduces the number of memory accesses.and intermediate-level part of image processing. 11. Section 5 includes a description of the demonstration setup.1. Because every pixel could be classified in the end as “interesting”. SIMD processors prove to be economical [19].the paper. it is very fruitful to use this inherent data parallelism by operating on more pixels per clock cycle. clock speed. It is important that these platforms are programmable since new vision methods and applications emerge frequently. The algorithms in the application areas of smart cameras can be grouped into 3 levels: low-level. 16]. if more performance is needed in this level of image processing. In the high. The two types of programmable processors that we propose to be included in smart camera architectures are the SIMD (Single Instruction Multiple Data) massively parallel processor. an SIMD processor for low-level image processing. A linear processing array (LPA) consisting of 320 processing elements (PEs) and an on-chip memory of 64 lines of 320 28 . So.or early. 2. The processors exploiting this have an SIMD architecture. Both processors are coupled using a dual port RAM that enables them to work in a shared workspace on their own processing pace (see Figure 3). the camera consists of basically four components. 1. Fig. Section 4 discusses the process of mapping the vision algorithm to the SIMD processor and examines the necessary modifications in the algorithm to be programmed on a parallel processor. General purpose processors are ideal for these tasks because they offer the flexibility to implement complex software tasks and are often capable of running an operating system and doing networking applications. the algorithms per pixel are essentially the same.image processing level is associated with typical kernel operations like convolutions and data-dependent operations using a limited neighbourhood of the current pixels.2. a general purpose processor for intermediate and highlevel processing and control and a communication module. 8]. 15. With earlier stated considerations in mind. 11]. with up to a billion pixels per second. The architecture of the smart camera and the SIMD processor is described in section 2. often a classification or the initial steps towards pixel classification are performed. and some concluding remarks and ideas for further investigation are presented in section 6. intermediate-level and high-level tasks. 18]. where the same instruction is issued on all data items in parallel [8. and (one or more) general purpose DSPs [17.

The chip is programmable in a sub-set of C extended with a vector data type. Data (2k x 16b) Global Control Processor Program (16k x 56b) GPI GPO I2C Fig. The model consists of the following components: 1. To perform a global conditional. 4. 3.LUT (240 kb) (2 lines x 320 pixels) Frame Memory DOP Video Out 29 . respectively. Programming Xetal The Xetal chip is programmed using an extended C (XTC) language. A global control processor (GCP) manages the operation of the IC.. 3. 3. Therefore for efficient collaboration between cameras. The optional index [0] represents the data directly connected to each PE. Color or texture of body parts. each with 8-bit resolution. a sample code is shown in Table 1. a local conditional that needs to be executed in each PE based on its own flag requires the ternary construct y = (x > a) ? p : q which translates into a conditional pass (P ASSC) assembly instruction. the indices [-1.2. The choice of the two parameters affects the overall chip physical design. Xetal top-level architecture Linear Processor Array (320 PEs) Sequential I/O Memory (2048 lines x 320 pixels) OMEM. Motion of body parts. Complete architecture of the wireless camera showing all processing and hardware blocks Most image and video processing algorithms consist of many small kernels (such as convolution and edge detection) that operate on all data elements alike. CIF (320x240). Opportunistic data fusion is applied aiming for intelligent and efficient vision interpretations in the camera network. This process inevitably removes certain details in images of a single cam- positions handle the compute-intensive data processing. Video In DIP IMEM (240 kb) The adopted MP-SIMD processing paradigm provides high computational efficiency because of the good match with the parallelism inherently present in the algorithms and the data. Geometric configuration: body part lengths. One of the major extensions is the introduction of a vector data type (vint) to represent the 320-element wide memory in the LPA. ALGORITHM In our experiment there are three cameras with a person in the common FOVs. angles.1] are used to access left and right neighbors. The cost of instruction and address decoding is amortized over the many processing elements. thanks to the large number of PEs and the wide datapath. Even though Xetal operates at relatively low frequency. we expect concise descriptions instead of raw image data as outputs from local processing in a single camera. 2. This makes MP-SIMD processing a natural choice with computational efficiency originating from the reduced overhead of control operations. In the example code.Fig.g. One underlying constraint of the network is the relatively low bandwidth. We have kept the number of PEs 320 (the same as [11]) since it is an integral divisor of image lines for most standard video formats. VGA (640x480) and HDTV (1280x720). Another specific programming feature is the distinction with respect to conditional statements. one needs to use the while or if-then-else constructs which lead to checking the wired-or of all the 320 LPA flags by the global controller. the effective 10-bit multiplyaccumulate (MAC) performance is still high. e. The goal is to reconstruct a 3D model of the person. The data input processor (DIP) and data output processor (DOP) provide interfaces to 3 video channels. The achievable computational power of an MP-SIMD processor is determined by two primary design parameters: the operating frequency and the number of PEs (the degree of parallelism). The remaining data are represented in a 10-bit integer or fixed point. On the other hand. the computational efficiency and the software framework. 2.

b. local processing in single cameras includes segmentation and ellipse fitting for a concise parametrization of segments. t2. Second. ny. the 3D human body model provides a unified interface between the vision network and application-level reasoning modules for a variety of gesture interpretations. if the optimal 3D skeleton model is projected onto im- Table 1. //send to output rows++.333. it creates a feedback path from spatiotemporal and feature fusion operations to low-level vision processing in each camera. // main program loop { vsync().. We assume the 3D model is initialized with a distinct color distribution for the subject.t1. Not only as an output for gesture interpretations. 6. // horizontal Prewitt edge detection t1 = (nyf[-1] . g. py.pyf[-1])+ (nyf[ 0] . Undetermined pixels from EM are assigned labels through watershed segmentation. t1. That is. 5.pyf[ 1]). int MaxRows = 480. although predefined appearance attributes are generally not reliable. pyf. // frame sync // line sync // get from input // compute intensity // 2D 3x3 box filer t1 = C*ny + C*y + C*py. nyf. In Fig. py = y. Examples for segmentation and ellipse fitting are shown in Fig.pyf[ 0])+ (nyf[ 1] . y = ny. yf = nyf. the local processing part in a single camera and the collaboration part between cameras.xtc> // variables mapped to GCP data memory int rows. t1 = (t1 > TH) ? 255 : 0. b). For each camera. rgb2y(r. which compensates for partial observations in individual cameras. Algorithm flowchart for 3D human skeleton model reconstruction. while(rows < MaxRows) { hsync(). i. XTC Example -----------------------------------------------#include <stdio. outputs of local processing in single cameras enable opportunistic data fusion by contributing local descriptions from multiple views. 5. y.t1). pyf = yf. send_line(t1.e. time and complementary features. The flowchart is shown in Fig. So the output from opportunistic data fusion (a model of the subject) is fed to local processing. the 3D human model also acts as an enabler for camera collaboration." !$ * " !" # !$ & ' ) . the color distribution is first refined using the EM algorithm and then used for segmentation. data fusion between cameras (space. Instead of being a passive decision output. see [2]). ellipses from all cameras are merged to find the geometric configuration of the 3D skeleton model. nyf = C*t1[-1] + C*t1[0] + C*t1[1]. rows = 0. g. b. which requires the camera to have some ”intelligence” on its observations (smart cameras) . the algorithm is composed of two main parts. On the other hand. yf. the 3D model implicitly enables more interactions between the three dimensions by being actively involved in vision analysis. For spatial collaboration. ( !$ + % Fig. // variables mapped to LPA frame memory vint r. ny). } } ----------------------------------------------- era. float C = 0. 5. g. This derives one of the motivations for opportunistic 30 . adaptively learned appearance attributes can be used to identify the person or body parts. as a representation of up-to-date information from both current and historical observations of all cameras. First. It is the interactive loop that brings in the potential for achieving both efficient and adequate vision analysis in the camera network. Therefore. get_RGB(r. For example. It takes up two roles. some knowledge of the subject.

the SIMD architecture of IC3D largely specifies instructions that can be implemented for local image processing. On one hand. resource limitation on a single Wica (e. Slow changes include lighting variations. it makes it very difficult for single pixel operations. Experiment results for local processing in single cameras. and Y in YUV is used for background subtraction. v1). The mapping is not a direct conversion from a PC-based program to embedded programming. A “server” program on the fourth Wica collects segmentation parameters from the others and reconstruct the skeleton. (u2. for good performance the mapping is constrained by requirements imposed by both sides. 1. (c) fitted ellipses. while boundary ones are discarded in updating kernels Fig. movements of background objects. Instead. iterative processes can be carried out on consecutive frames with one iteration on each frame. 2. 3. We aim to use depth information provided by multi-view cameras to remove foreground outliers. Details of the algorithm are presented in [2]. Local image processing is running on IC3D of three Wicas. So pixels close to the kernels are used. MAPPING FROM THE ALGORITHM TO THE ARCHITECTURE The goal is to map the gesture estimation algorithm onto Wica. the high frame rate allows for some of the local processing to be simplified. 4. One aspect of our current work is to develop an algorithm for background adaptations. one of which is connected to a PC to display reconstructed skeletons. 7. At initialization. If we assume the frame rate is high enough such that subsequent frames don’t differ much. On the other hand.1. (a) original images. v2). Suppose there are two dominant colors. there are both slow and sudden changes in the background.g. The adopted K-means to look for color kernels is as follows. 31 . (u2. This assumption is valid for our experiment since the frame rate is 30fps. with output being parameters of segments. 4. For example. K-means Clustering on IC3D for Segmentation Due to memory limitation. outputs of a single one should only occupy the allocated bandwidth. Experiment results for 3D skeleton reconstruction. In practical situations. Segmentation is based on color. high enough compared to the object’s motion. for real-time communication between Wicas. this is a much too constrained experiment environment. However. memory) has much impact on the mapping as well. 7. Candidate configurations are evolved using PSO. SIMD is very fast for vectorbased instructions and especially fast for image filtering. (b) segments. v2) only using “reliable pixels”. First. it is very important to update (u1. iterations can be done once for each frame.2. These parameters are transmitted from the three Wicas to the fourth one through ZigBee. The whole system consists of four Wicas. Finally. Original images from 3 cameras’ views and the skeletons are shown. (u2. two color kernels are given as (u1. v2). Every pixel in the foreground is assigned to either class 1 or class 2 based on their distance to kernels. First. Update kernels (u1. 4. Sudden changes include people passing by. Therefore it is impossible to do many iterations for K-means on a single image. v1). Second. and apply it to the system. Fig. 6. the projections maximally match ellipses from all the cameras. no image pixels are stored on the node and we only have access to the current row. Examples of fitted skeleton are shown in Fig. v1). Third. Algorithm mapping in a single camera is explained in the following sections. So efforts need to be spent on how to design the algorithm to best utilize potentials provided SIMD. Background Subtraction The current assumption is a static background. Finally the skeleton is given to the PC for demonstration.age planes of the cameras.

indicated by difmay automatically achieve that. Either 32 . From the sensor. Also some images were adjacent predecessor. Another very interesting obferent values of green related to the label value. Figure 10 shows an intermediate result of the k-means supNote that over-segmentation in Fig. Effects of different scanning directions on the labeling invokes 320 parallel instructions. Therefore. Labeling for Segments the design of a successor IC. Second. this accumulation can be fast in time. After compilation the total of a segment. a new label is declared. This could be solved to some extend by optimizing the code and using other memories for storage. Therefore for every frame the processing is siming the LCD display) is around 100mWatts in this application. the limited size of internal pixels are greater or lower than the kernels in u. The sensor also takes care of demosaicing. v2) by 1 based on a cheap counting of how many the system near the end. Each color indicates a certain cluster of help. but dependent on lighting changes and shadows in the scene because of the simple background model that we used. During processing the processor load is in between 39 although they are classified and displayed. orange and black was used. that we can track per frame. Figure 11 shows the result of tracking. The Xetal IC in this camera runs at 80MHz and the 320 processing elements do single clock cycle arithmetic 6. yellow. ple and fast. We communicate the 1 1 number of objects found. Figure 9 shows the setup of the system. If p is adjacent to two labels (e. consumption of the Xetal processor is between 250 and 50However.This proceand 6. we examine neighboring three major object parts. in the fore1. The proScanning from bottom-right (a) gram source size is 24KBytes and contains almost 800 lines (b) of code. it will pick up the smaller one (e. we only increase or decrease components in (u1. but accepts RGB data.at 10-bit word size. labeling of the previous line.g. (u2. The line count is low compared to similar algorithms on other processors because each instruction in this program Fig. 8(a). The interdure proves to be important since we don’t have a goodnal bandwidth per second averages 62-Gbit/s. where significantly more line Since now we don’t want to save flowing lines. One way is to reduce complexity for each single frame while improving accuracy by accumulated computation through a number of sequential frames.9-GOPS depending on the image activity. labeling for one segment can be like in Fig. The power ness measure of a pixel belonging to a certain class. Small blue servation is. 1). During the mapping we only started to reach the limits of (u2. 8. The image sizes are VGA (640 by 480) pixels. Generally for a vision application a critical question is what kind of features to use to solve the problem. DISCUSSION The demand for a real time implementation requires the balance between algorithm complexity and accuracy. program size is 1953 24-bit instructions. 8(a) may potentially ported segmentation. we have taken these issues into account during 4. not floating point). this simple method works well since as it mWatt. by changing scanning directions (bottom-top or markers in the image indicate the tracked centers of the parts top-bottom). ments of the duck. 8. 5. Background subtraction is performed using the external RAM as the data pool. Although in the final application data will be shared among cameras. 2). exposure time control. their positions and statistics to the 2 8051 host processor. If p doesn’t have ground the camera mote can be seen. and white balancing. Given a high frame rate. to newly discovered connectivity is impossible. v valcoefficient memories started to limit the number of regions ues.g. The resulting taken of the system in action at different moments in time. where in Fig. The display is also of VGA format. And the scanning sequence of the hardware only three objects remain in the pictures. DEMONSTRATION Scanning from top-left The algorithm is running on the camera at a speed of 30 frames per second. For ellipse fitting anyways we want to split the shape pixels. we may discover different dent directions in the of the toy duck. 8(a) and (b)).3. Finally. depending on the image activity again. Also. v1). v1). 8-bit YUV data in 4:2:2 format is provided. Since precision in (u1. This can be seen in the background of the image. for each iteration tween 350 and 150mWatts for this application. Then. relabeling due memories and more coefficient storage will be available. v2) which makes the continuous power consumption to a level beis low (integers. The power goes through frame by frame the kernels quickly conconsumption of the remainder of the camera system (excludverge. The results are completely stable to moveshape (compare Fig. in the current set-up an LCD screen is connected to the camera to see the intermediate video results. The program as described in earlier sections has been mapped to the camera in its programming language ”XTC”. a toy duck with for every pixel p in the new line.

“Distributed vision-based accident management for assisted living. Snapshot of the LCD screen with label tracking. When lighting is too strong or too weak. 2006. Paris. Aghajan.” in ACIVS2002. Mar. [7] D. P.P. such as variations in the appearance of objects of interest or the background from the desired or assumed settings. Wu. Viola. [5] Richard Kleihorst. 2. Japan. 2007. colors are highly sensitive to a number of factors. Anteneh Abbo. [1] H.” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Fig. Belgium. London. image. or we are able to predict its change under different situations. “A programmable smartcamera architecture.” in Proceedings ICASSP07. the chosen feature is invariant. 9. Abbo and R. “Fast multi-view face detection. France. This urges us to re-address for instance the background subtraction approach to allow practical deployments in which the assumption of a stationary background may not always be valid. Ben Schueler. [2] Chen Wu and Hamid Aghajan. In an unconstrained environment. Issues related to timely detection of a failure and quick re-initialization of the parameters after a failure are worthy of further examination.” in ICOST 2007. One is the need to handle different situations. Walkden. “Design challenges for power consumption in mobile smart cameras. so additional hardware capabilities need to be provisioned for progressing towards applications with more complex processing needs. Kleihorst. Ben Schueler. White balancing of the cameras often cause sudden changes for all colors in the 33 .G. 11. illumination will be very high or low and chrominance becomes hard to discern. there are additional challenges to operate a robust and real-time system.” in COGIS2006. and Vishal Choudhary. The other challenge is the ability to recover from failures. C.” in AVSS 2007 ( to appear ). “Model-based human posture estimation for gesture analysis in an opportunistic fusion smart camera network.A. Gent. McCullagh. and Alexander Danilin. “Architecture and applications of wireless smart cameras (networks). no. “Distinctive image features from scaleinvariant keypoints.Fig. vol. Augusto. 7. REFERENCES Fig. Picture of our setup. Nara. 60. The mapping of an algorithm on the smart camera in the current investigation showed clearly to us the necessity of a robust algorithm to cope with even partly constrained environments. which may well result in losing track of objects in the scene. Lowe. 10. J. We were approaching the limits of the system. [4] Richard Kleihorst. UK. and J.” International Journal of Computer Vision. Snapshot of the LCD screen showing the segmentation results. In our experiment now we mainly rely on color to differentiate the foreground from the background as well as to analyze different parts of the object. Jones and P. Sept 2002. 2004. 2003. [6] M. However. [3] A. This works for short periods of time during which lighting conditions remain mostly the same. The performance level and processing bandwidth required for even an application like the one under study here showed us the need for high-performance hardware in order to run the operation in real-time.

34 .” in ISSCC2001 Digest of technical papers. 134 – 145. 334–338. and H. Hammerstrom and D. L. pp. Gealow and C. San Fransisco. 34. W. Mouy. Choudhary. Yamashila and C. 84. P. 7. jul 1996. 600mw massively-parallel processor for video scene analysis. Circuits and Systems and Signal Processing. S. Wielage. [10] “Cell:. [14] A. Kyo. pp. pp. Jonker. vol. “A pixel-parallel image processor using logic pitch-matched to dynamic memory. “A 40GOPS 250mw massively parallel processor based on matrix architecture. USA. [11] R. [12] S. 2001.P.watson.G. R. [16] H. 410 – 411. P.. “A 128 × 128 CMOS imager with 4 × 128 bit-serial column-parallel PE array..” in ISCAS 2001. Sydney. Morphological Image Processing: Architecture and VLSI design. may 2001.J. Ca.” in Proc..” IEEE MICRO. 12th IAPR Conf. Israel. Khailany et al. 2007. S. [17] P. “Why linear arrays are better image processors. Kuroda. Koga. The Netherlands.ibm. R.P. van der Avoird. of Tech. “An SIMD smart camera architecture for real-time face recognition. “Image processing using one-dimensional processor arrays. Apr 2001.” in ISSCC Dig. T. Veldhoven. Sodini. A. Nakajima et al. vol. on Pattern Recognition. P.” in ISSCC2007 Digest of technical papers. “An integrated memory array processor for embedded image recognition systems. 1005–1018. van Herten. Papers 2006. Kleihorst et al.” in Abstracts of the SAFE & ProRISC/IEEE Workshops on Semiconductors. Feb 2006. Sevat. Okazaki. van Veen.A. Kleihorst. Australia. 35 – 46. Kluwer. no. “Xetal: A low-power highperformance smart camera processor. [13] M. of ISCA. Sevat. and M. M. Wielage. Lulich. Abbo. [18] D. [19] R.” IEEE Journal of Solid-State Circuits. pp.com. Heijligers. Nov 26–27. June 1999. [15] J. 1994.” in Proc.C. 2003.[8] P. Op de Beeck. Kleihorst. V. 1992.R. A. Sodini. “Imagine: Media processing with streams.” http://researchweb. and I. pp. Jonker. L. Abbo. “Xetal-II: A 107GOPS. June 2005. [9] B. Jerusalem.” IEEE Proceedings.

Sign up to vote on this title
UsefulNot useful