You are on page 1of 9

VIGILANTE: A machine and algorithms for neural rich pixel image

recognition at video-rate and faster


Steve Suddarth, Ballistic Missile Defense Organization, Washington, DC
Anil Thakoor, Curtis Padgett, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA

Introduction
In spite of recent gains, image recognition still suffers from disappointing recognition performance and low
processing speed. Generally, system designers are forced onto the horns of a dilemma between algorithms that are
optimized for limited computational architectures and algorithms that might perform better, but which cannot be
effectively tested given what computer resources are readily available today. Many ATR designers would prefer to
develop recognition algorithms based upon various eye-brain theories, but these present special hardware
challenges, perhaps requiring a new architecture. The VIGILANTE neural processor was designed to provide an
experimental processor for such ATR research at video rates over a wide variety of algorithms. Its architecture is
based upon a rich-pixel processing paradigm.

Processing with rich pixels


One way of viewing image recognition processes is to break it down into four stages analogous to biological vision:
collection of imagery, generation of synthetic imagery, image fusion, and semantic interpretation. Figure 1 shows
the authors' (highly simplified) model of the eye-brain system.
Although biological vision does not have such cleanly
defined boundaries as do artificial systems, it could be
argued that the brain uses synthetic imagery to analyze
scenes by comparing corresponding pixels among the
various imagery types. This is essentially a rich pixel
concept, where the brain becomes a data fusion machine
at a pixel level before analyzing the scene in a semantic
way. Enriching pixels could be seen as a way of
improving evidence used in properly classifying each
pixel of the image. This paper will examine approaches
for improving our rich pixel processing ability.
Figure 1: Notional eye-brain process flow
Artificial systems likewise enrich pixels. Examples include hyperspectral/multispectral imaging, motion imaging,
edge detection, texture analysis and spatial filtering.

Why thinking of ATR as a rich pixel problem is a good idea


Rich pixels are a useful organizing principle for mating algorithms to hardware, as this paper will later show.
Essentially, rich pixel processing consists of:
1.

Sensing raw imagery

2.

Generating synthetic imagery

3.

Fusing the rich pixels from steps 1 and 2 above

4.

Interpreting the resultant fused images

The importance of making this distinction is that the structure of these four tasks helps simplify hardware. Many
image processing systems suffer from trying to perform steps 2-4 on the same processing architecture, even though
the processes themselves have very different structures. To speed up recognition, designers often resort to specialpurpose circuits that implement a particular algorithm quickly, but which lack generality for other types of
problems. VIGILANTEs philosophy, however, is to map the above functions to a relatively small set of special-

purpose hardware that, when properly configured, can implement a wide variety of algorithms. For example, the
regularity of synthetic image generation tasks (spatial filtering, motion, correspondence) justifies special-purpose
hardware. Pixel-level fusion, although less structured can be performed on regular parallel architectures such as
SIMD arrays. Semantic analysis involves a wide variety of algorithmic approaches. However, it seldom presents a
significant computational bottleneck compared to the other functions. This task can be handled with generalpurpose hardware. This hardware mapping concept is illustrated in Figure 2.
Examples

Type of Process
C

Convolve ap , Kn , 1 , 1
k

k
C

Normalize C , 0 , 255
k

ap

K
0

K
1

N C

K
2

K
3

K
4

K
5

K
6

N4

Convolution

Gray-scale morphology

rotation/scale invariant
patterns

Some special purpose


circuits

N2 - point ops

T ( P , 180)

50

summing

thresholding

masking

Various massively parallel


architectures

Other

100

h
i

Optimal Hardware

100

200
i

300

Connected component

Local histograms

semantic interpretation

Massively parallel
architectures (in some
cases) or serial processors
based upon the algorithm

Figure 2: Three broad classes of ATR primitives

The ANTE Processor


The processor portion of VIGILANTE, referred to as the Artificial Neural Three-D Experiment (ANTE), has an
architecture optimized for the processing shown Figure 3, bringing the following features to bear in an inexpensive
laboratory setting:

High-speed, large format convolutions (for N4 operations) operating at about 2 x 1012 OPS

A SIMD point operations processor (for N2 operations) operating at about 10 nominal GigaOPS

A high-bandwidth link between the convolution processor and the point operations processor

General purpose computer with adequate (PCI) bussing to remainder of system for program support

ANTE is capable of performing sophisticated feature-based, context-sensitive image recognition at video frame
rates. ANTE takes advantage of a general ATR process flow which is depicted in Figure 5:
1.

The 3DANN-M network produces 64 simultaneous convolutions with 64x64 masks. This corresponds to the
N4 operations shown above.

2.

The 64 analog values generated by 3DANN-M are converted to 8-bit digital values and passed along to the
associated feedback memory and Point Operation Processor (POP).

3.

POP takes the output from the 3DANN-M and performs those target recognition/tracking functions that can be
performed at a pixel level. This corresponds to the N 2 operations above.

4.

Command and control of VIGILANTE operations (e.g., detection/classification/tracking mode command,


loading of templates, point operation functions, data recording, etc.) are done though the P6 motherboard
(shown as the processor/memory block in Figure 3). This corresponds to the other processes discussed
above.
Use Word 6.0c or later to

Figure 3: The VIGILANTE


processing architecture that
orchestrates the data flow from
sensor through neural processor
also serves as the basis for
developing methodologies for
ATR applications.

view Macintosh picture.

Use Word 6.0c or later to

vi ew Macintosh picture.

Figure 4: The 3DANN-M has 64


layers of a 64x64 synapse arrays
based on an 8-bit MDAC. It uses a
special-purpose image-write device
called CLIC that is bump-bonded to
the synapse array stack. 3DANN-M
is a 10-gm, 3-cm 3 package, with
power consumption of 4.5 W. CLIC
rasters a 64x64 window of a larger
image from the frame buffer and is
synchronized with 3DANN-Ms
250-ns inner-product operations.

Convolution is performed via Custom ICs and Stacks


The heart of the ANTE processor is the 3DANN-M neural sugarcube chip stack (Figure 4) which performs the
high speed convolutions. 1,2,3 3DANN uses a neural circuit design based on Multiplying Digital-to-Analog
Converter (MDAC) technology; each circuit has 8-bit resolution digital weight storage and an analog multiplier
with a voltage-input/current-output configuration. 3DANN-M is shown in Figure 3. In 3DANN-M, 64 complete
neural inner products, each with a 4096 (i.e., 64x64) input array can be accomplished in 250 nanoseconds (i.e.,
1012 multiply and add operations in 1 second). The 3DANN-M circuitry has a low power consumption of
approximately 4.5 watts.

Fat Pipe Communications is Provided by Custom Circuitry


Although current computer busses (such as PCI) can easily handle video rate data (about 2Mbytes/sec), the output
of the 3DANN-M cube is 64 images. Therefore, data between the convolution circuit (3DANN-M) and POP is the

equivalent of as many as 64 video streams, or about 128Mbytes/sec. To allow high-speed transfer of output, data is
staged into a memory and then sent over 16 custom I/O circuits.

Point Operations Processing


Currently, the feedback memory and POP are implemented in VIGILANTE with a commercial productfour of
Adaptive Solutions CNAPS array processor boards (each board containing 128 SIMD processors and 32
megabytes of memory)providing flexibility in programming different point processing operations. 4 In later stages
of the project, a custom VLSI implementation of POP may be designed and fabricated. POP takes the output from
the 3DANN-M and performs the desired target recognition/tracking functions.

Recognition with Rich Pixels (some examples)


Recognition from convolution alone is not satisfactory
Rich pixel processing is shown in the following example of recognizing facial features in test images based upon
sub-images from a reference image. The goal is shown in Figure 5.
Key:
- Eyes (left, right)
- Ears (left, right)
- Nose
- Mouth
- Chin
Reference Image

Sample Test Image


Figure 5: Facial feature recognition task

A simple way to perform this task is to take convolutions between the base image and feature detectors (kernels)
such as the ones shown in Figure 6. The next step is to take develop a zero-mean version of each kernel, i.e.
normalizing the kernel such that

ni, j

0 . Thus, the convolution operations becomes analogous to

correlation. Where the kernel is a good match in the tested image, the output (convolved) image is bright, thus one
might be tempted to use thresholding of these convolution outputs is a simple way of detecting features.
Unfortunately, the performance of this approach for images other than the reference image is generally
disappointing. Figure 6 shows how the reference image produces several false alarms after the threshold was set
sufficiently low to detect all features. The same figure also shows how performance is far worse using the test
image. In fact, the simple method of convolution followed by thresholding generally fails whenever the system
looks for features in a new image.
mouth

r. eye

l. eye

nose

r. ear

l. ear

chin

Key:
- Eyes (left, right)
- Ears (left, right)
- Nose
- Mouth
- Chin
Reference Image

Test Image

Figure 6: Convolution filters K 0 through K6 and features detected based upon those filters

A Simple Rich Pixel Algorithm using Linear Methods


Recognition can be enhanced by using convolution as a first step in recognizing feature candidates that can be
pared down to a final set of features. A simple linear algorithm is shown here to illustrate how even complex
functions such as elastic graph matching can be performed using simple
(and parallelizable) point processes.

Figure 7: General facial geometry


used by algorithm
In the faces example, a large number of false alarms can be eliminated by considering that faces generally have the
geometry shown in Figure 7. Thus, the algorithm follows up convolution with a comparison between all
convolution results in order to determine whether a face can be plausibly interpreted within the scene. The strategy
in this phase is to try to locate the centroid of a face through collective activation based upon the output of the
feature detector convolutions.
The algorithm works by shifting blurred copies of the convolution outputs such that they align in the center of the
face. From there, a face can be detected from the various spatial spectra in the imagery. Once detected, a face
centroid is then shifted back out to the original feature locations and used as a mask on the original convolution
outputs. This approach can be performed entirely with point processes. The process is shown below in Figure 8.
Step

mouth

r. eye

l. eye

nose

r. ear

l. ear

chin

1. Blur

Bn Cn * k disk

2. Shift

S ni , j B ni I

, j Jn

3a, 3b, 3c.


Sum & clean up

M i , j tanh 1 [a ( Ti , j t )]

E ni , j Z ni I

Z M * k gaussian

4. shift centroid
out for masking

E
0

E
1

E
2

E
3

E
4

E
5

E
6

H
0

H
1

H
2

H
3

H
4

H
5

H
6

, j Jn

5. mask

H i , j Ci , j E i , j

6. Threshold

Fi , j ( E i , j t )

Figure 8: A simple rich pixel algorithm for matching features to faces


Feature
detection from
convolution
alone
Feature
detection for
convolution plus
geometry

Figure 9: Comparison of convolution alone to convolution plus facial geometry


Note: Steps 1-3a could be performed with a single, large convolution template. The steps were kept distinct for
clarity. Real-time implementations of this algorithm on the ANTE hardware will use nonlinear approaches to
perform pixel fusion. In that case, the steps may not be combined.

A General Detector, Classifier, Orienter


To efficiently recognize objects of arbitrary size and orientation, a hierarchical neural network approach based on
eigenvectors is employed (see Figure 10). 3DANN-M is used to convolve with 64 eigenvector templates
representing the principle axes of a collection of multidimensional data points (i.e., object images of various
configurations). 5,6,7 Since each data point (image) is a 4096-element vector, finding a set of 4096 orthonormal
eigenvectors is possible (64 of which can reside on 3DANN-M). Selecting the most significant 64 eigenvectors
constructed from principle component analysis of target imagery reduces the dimensionality of image sets, yet still
retains much of the information relevant for classification.

Use Word 6.0c or later to

view Macintosh picture.

The most problematic aspect of this technique is that


unless some restrictions are placed on variations in
the target imagery, the most significant components
become so general as to be unsuitable for fine
distinctions such as object orientation or identity
(e.g., missile type). Our strategy is to parameterize
(e.g., lighting, pose, class, identity, and scale) and
partition the object space in a hierarchical fashion.
To classify each partition, a neural network (or other
classifier) is trained on data imagery drawn from the
set of variables that define the partition and projected
onto eigenvectors suitable for that particular
distribution of data.

Figure 10: General target recognition is achieved using


eigenvector projections in conjunction with a neural
network classifier trained on selected data sets.
Information about the object (its class, identity, or pose) is processed in a coarse-to-fine manner. For instance, after
detecting an object in a frame, a rough estimate of image pose/scale is made, a result that can then be used to limit
the variation that needs to be considered during object classification (i.e., plane, helicopter, and missile). Results
using the technique described here have achieved nearly 97% detection rates, 94% classification rates for
determining the angle of the principle dimension of an object with respect to the image (30), and object
classification rates approaching 95%. See Figure 7 for results on object/nonobject image classification rates
achieved with a helicopter/missile/plane data set.
Use Word 6.0c or la ter to

view Maci nto sh pi cture.

Figure 11: High


detection/classification
rates are achieved on
selected data sets that
include all possible
orientations and scales of
targets.

Conclusions
In this paper we have discussed the architecture of the ANTE processor and how it applies to a processing
paradigm of rich-pixel image recognition, where synthetic images are generated and then fused. This
architecture provides for the possibility of end-to-end sophisticated image recognition at video frame rates. It also
allows for the mixing of evidence among spatial, motion and spectral features, all done in parallel at a pixel level.
The ANTE hardware should be functional by the end of summer 1997, and the results from running algorithms
similar to those shown should be available then. Future directions for research include extensive mapping of
algorithms to this architecture an the architectures expansion to include more highly-integrated and streamlined
circuitry, particularly in the areas of internal communications and the Point Operations Processor (POP).

References

J. Carson, On focal plane array feature extraction using a 3-D artificial neural network (3DANN), Proc. SPIE, vol.
1541, Part I: pp. 141-144, Part II: pp. 227-231, 1991.
2

T. Duong, S. Kemeny, T. Daud, A. Thakoor, C. Saunders, and J. Carson, Analog 3-D neuroprocessor for fast frame
focal plane image processing, SIMULATION, vol. 65, no. 1, pp. 11-24, 1995.
3

T. Duong, T. Thomas, T. Daud, A. Thakoor, and B. Lee, 64x64 Analog input array for 3-dimensional neural network
processor, Proceedings of the 3rd International Conference on Neural Networks and Their Applications, Marseilles,
France, 1997.
4

D. Hammerstrom, E. Means, M. Griffin, G. Tahara, K. Knopp, R. Pinkham, and B. Riley, An 11 million transistor
digital neural network execution engine, Proceedings of the IEEE International Solid-State Circuits Conference, pp.
180-181, 1991.
5

M. Turk and A. Pentland, Eigenfaces for recognition, J. of Cognitive Neuroscience, vol. 3, pp. 71-86, 1991.

C. Padgett, G. Cottrell, and R. Adolphs, Categorical perception in facial emotion classification, Proceedings of the
18th Annual Conference of the Cognitive Science Society, Hilldale, pp. 201-207, 1996.
7

C. Padgett, M. Zhu, and S. Suddarth, Detection and object identification using VIGILANTE processing system, Proc.
SPIE, vol. 3077, 1997.

You might also like