You are on page 1of 6

Preattentive grouping and Attentive selection for early visual computation

Winfried A. Fellenz and Georg Hartmann E-Mail: (getfell)(hartmann)@get.uni-paderborn.de Universit tGH Paderborn, FB Elektrotechnik a Pohlweg 4749, 33098 Paderborn, Germany

Abstract
The segmentation of objects in a real world scene is a prerequisite for any higher level recognition or interpretation process. Biological visual systems exploit efcient mechanisms for object extraction which seem to be mostly data driven. We propose a network for perceptual grouping inspired from neurophysiological and psychophysical ndings, incorporating a phase diffusion process which labels the whole image into its constituent objects and the background, followed by a selective attention stage which sequentially extracts objects in the scene. The image is processed by four successive stages, copying the design of visual cortical mechanisms. Direction specic edge responses are used as starting points for a competitive and cooperative phase process. The resulting Phase image is processed by an attention mechanism, extracting homogeneous regions using both spatial and phase information, followed by the generation of a saccadic signal.

if the rst three terms of that functional are kept:

E (u; K ) =

Z
K

(jru(x; y)j2 + (u ? g)2)dxdy + L(K )

1. Introduction
Intermediate vision is concerned with the extraction of objects and their attributes in a scene and serves as a suitable representation, linking low-level processes like feature extraction to higher level processes like object recognition and scene interpretation. Apart from shape, motion, and depth analysis the segmentation of the scene into its basic regions is the most common intermediate level computation. As object boundaries normally coincide with intensity discontinuities in the image, many segmentation methods rely on a low level edge detection stage and group the extracted edges into region boundaries using edge following and linking methods. These edge based methods are opposed by region growing schemes which merge initially small regions with similar intensity values into larger regions by comparing pixel properties in a local neighbourhood or by using a global statistic. It was shown [21] that most of the segmentation methods can be unied using a common variational formulation which is equivalent to the Mumford-Shah energy 1

(1) The functional denes the segmentation problem as a joint smoothing and edge detection problem whereby the rst term imposes the smoothness of the image outside the edges, the second term guarantees that the piecewise smooth image u x; y indeed approximates the image g x; y , and the third term forces the discontinuity set K to have minimal length L. The nal edge Set K marks the boundaries in the image which separate regions with uniform properties. A similar formulation was used in [2, 17, 27] for surface reconstruction where a exible template is tted to the sparse and noisy data, allowing discontinuities in the data to be labeled explicitly. This labeling is introduced by a line process [11] coupled to the regularising approximation process, or a weak continuity constraint allowing occasional cracks in the interpolation process charged in the corresponding energy function by a penalty term. However, a perceptual satisfying segmentation of an intensity image should also result in regions corresponding to the observed perceptual groups and objects, using the discontinuities to indicate region boundaries. Therefore the interpolation term can be neglected and the segmentation can be decoupled from the intensity image to allow the emergent forming of regions in the phase domain, corresponding to perceptual groups in the image domain. Next, a scheme will be proposed which transforms the preattentive grouping and segmentation process into phase space, thereby decoupling the resulting phase image from the intensity image. The responses of an initial edge detection stage are rectied into ON and OFF channels, producing direction selective responses at each position of the intensity image. Simple local constraints between the resulting edge maps are used to relax the associated phase labels into homogeneous regions and phase discontinuities which correspond to zero crossings of the smoothed second intensity derivative, followed by the attentive extraction of the regions.

( )

( )

y x

+ + + + + +

X2

Figure 1. a) Open disk scene; b) Summed ON and OFF


channel responses; c) Phase image after 100 iterations.

+ + + + + +

+ + + + + + + + + + + +

2. The Edge Detection Stage


Since the early days in Computer Vision many operators for edge detection have been proposed ranging from gradient and template matching operators to parametric edge models. Because most of these operators are designed to detect special kinds of edges, it soon became obvious that a general purpose edge detector would be a compromise between certain performance criteria. In [4] a linear operator was derived for the detection of step edges, which minimises the joint criteria of good localisation and reliability. Being an odd-symmetric operator it suffers from false localisation of line and roof edges. The detection of zero crossings (ZC) in the second derivative of the intensity function using an even symmetric gradient lter [19] faces similar problems at line edges. A nonlinear combination using the summed squares of both even and odd symmetric lters has proven to be a good detector of edges composed of steps, peaks, and roofs [22, 24]. Using both the local energy and phase, it is possible to reconstruct the generating edge showing its applicability for image coding. However, to localise the edge exactly a search for the maximum response is necessary, contrary to the detection of zero crossings which are by denition dimensionless. We will show that the presented relaxation phase labeling (RPL) procedure is able to detect ZCs in phase space, hereby sharpening the edge response, at the same time performing the perceptual grouping of arranged dots into contours and closed objects. In the presented system we use six pairs of oriented Gabor lters [6] being in quadrature phase to extract the local energy followed by a differentiation step using odd symmetric Gabor lters to rectify the oriented responses: Figure 2. a) Hierarchical extraction of direction specic
contour lines; b) Hyper-columnar structure with parametric phase for the relaxation phase labeling process.

are the initial data for the relaxation labeling process exposed in the next section.

3. The Relaxation Phase Labeling Process


Twenty years ago a mechanism for scene labeling was proposed [26], which reduces the ambiguity among objects in a scene in terms of an iterated relaxation procedure, performed in parallel on the data array. Since then numerous approaches to parallel relaxation operations have been described. We have adopted the general strategy of relaxing labels corresponding to observed properties in the scene, using parametric phase labels, which group into coherent objects in phase space, giving the relaxation procedure a new degree of freedom to accomplish a consistent labeling. As revealed by the Gestalt school in the rst half of the century, visual perception is governed by certain simple rules which group parts into wholes employing laws like grouping by proximity, similarity, closure, symmetry and good continuation [18]. Although these principles are easy to investigate in psychophysical experiments, their underlying neuronal computations are mainly unknown. It has been speculated, that synchronisations of visual cortical neurons, revealed by recent electrophysiological studies [8, 12], may serve as the carrier for the observed perceptual grouping phenomenom. The differences in oscillator phase between spatially neighbouring spiking cells could be used in principle to label different objects in the scene for their intrinsical segmentation. The proposed grouping criteria of spatial contiguity and coherence of particular feature domains indeed show similarities to the proposed Gestalt-laws. However, the law of good continuation, which plays a central role in many edge grouping and linking schemes in computer vision, is able to override both proximity and similarity. This pronounces the role of oriented edges both in the implementation of perceptual grouping and synchronisation mechanisms. The emergent forming of a perceptual group, including both edge and region based information is depicted in gure 1: the dots form2

2 2 q+ (x; y) = cos( x)exp ? x + y 2 2 q (x; y) = sin( x)exp ? x + y

(2) (3)

stants and specify the envelope of the oriented Gaussian, sets the appropriate frequency of the modulating sinusoidal, and is a normalisation factor. Figure 2 shows the edge detection stage and the model hyper-columns which

Where q+ (x; y) represents the even symmetric function and q?(x; y) its odd symmetric Hilbert-transform. The con-

Intensity f(x) 2

Phase

a)

d)
2

K(x)

Figure 4. a) Scene with edge and line dened objects; b)


b)
0 x 0 x

Phase image after 28 iterations; c) Phase gradient of b.

e)
ON OFF 0 x 0 x 2

K (x)

_ i;j = !i;j +
Zi;j (n) = g f( ) = vm;n
i;j !i;j Ei;j (m) Zi;j (n) vm;n hk;l

c)

f)

Figure 3. Scheme for relaxation and diffusion of phase labels. The intensity distribution (a) is ltered to extract intensity gradients (b) corresponding to perceived edges in the image. The smoothed derivative of the edge map is rectied into ON and OFF channels (c), allowing simple compatibility constraints between channels to modify an initially uniform phase map (d); (e) intermediate and (f) nal phase distribution of the phase image evolving in parallel over time.

= i+k;j+l ? ij ; ? <= = exp(? (jjm ? njj)2) ?

= ( + cos( ))

vm;n Ei;j (m)Zi;j (n) (4) m;n2M X hk;l Ei+k;j +l (n)f ( ) k;l2N <
(5)

ing an incomplete circle are grouped into a synchronised round disk with a discontinuity at the upper right indicating the missing dot in phase space. In gure 4 the results of the proposed segmentation scheme for a scene with three simple objects is shown. Although the objects are dened by different boundary types ranging from intensity discontinuities over lines to dots, the phase gradient shows a common interpretation of all contour types. In gure 3 the general idea of the proposed phase relaxation and diffusion mechanism is depicted. The principal processing is as follows [10]: we dened smoothly varying constraints on the interaction strength between all direction selective responses of the second preprocessing stage. These constraints support orientation continuity by positive interactions between similar directions, and decouple both sides of the contour by negative interactions between opposite directions. The spreading of labels into regions is introduced by synchronising phase oscillators at the contours with oscillators in the interior of objects. This lling in is similar to brightness diffusion [5, 23] allowing the separation of gure and ground [16], but instead uses the coherency of cyclic phases to label the whole scene. The proposed labeling process can be formulated in terms of minimising an explicit functional depending on the basic compatibility relations, using results developed in [14]. The phases ij of each hypercolumnar vector at position i; j are updated according to a GauSeidel procedure, using a sigmoid nonlinearity g for summing up the individual activations, and a shifted cosines for calculating the contributions of neighbouring elements depending on their phase difference:

g(x) f (x)

; ; ; ; m; n 2 M

Notation Phase at position (i,j) Random variable Activity in m-th feature map Contribution of n-th feature map Compatibility constraints Connectivity matrix Sigmoid nonlinearity (tanh x ) Periodic function of phase difference Phase difference Constants Set of discrete directions

()

( )

()

The compatibility function vm;n , depicted in Fig. 5a) is modelled as a shifted Gaussian. A sparse horizontal connectivity scheme hk;l was chosen to improve the synchronisation behaviour. In Figure 5b) the qualitative convergence properties of the system are depicted, showing average phase change and normalised average energy over iteration steps. The periodic function f x can be set to sin x to resemble the Kuramoto oscillator, we instead used formulation 5 to speed up convergence. The zero mean random variable !i;j introduces noise into the decision process, thereby resolving ambiguous situations, and forcing the process to move from the initial equilibrium state with all phases being equal, to a global solution in phase space. As can be seen from the process equation 4, the change in phase at each location is governed by a correlated activity in at least one feature map at neighbouring positions. To allow the spreading of phase labels into regions formed by the oriented contours is added to an additional feaa uniform activity Ei;j m ture map m , to resemble spontaneous neuronal activity. Figure 9b)-d) shows the extracted direction selective edges

()

()

+1

( +1)

3 Y x 10 28.00

Average Phase Change / Compatibility Energy

26.00

24.00

22.00

Spatial Map, engage Attention Posterior Parietal Cortex

20.00

18.00

Target selection Superior Colliculus

16.00

Spatial Modulation IOR, FEF

14.00

12.00

Attention Engagement Pulvinar, Thalamus Object Recognition, IT


0.00

10.00

50

100

150

200 Steps

Figure 5. a) Competitive/cooperative interaction constraints between direction selective responses; b) Qualitative convergence behaviour of relaxation process, continuous: average phase change - dashed: average energy.

Feature Maps, V1 - V5 Preattentive Segmentation Synchronization

of the test image Paolina, using only odd-symmetric Gabor lters to half-wave rectify the oriented responses into ON and OFF channels. The result of the constraint satisfaction relaxation procedure is shown in 9e), from which the phase gradient 9f) has been computed. To compare the performance of the segmentation, the binarised gradient of the phase image and the edges detected by a Canny edge detector are shown. It can be evaluated, that the contours of the binarised phase gradient in Figure 9g) resemble the Canny edges, although no postprocessing like edge linking and maximum detection was necessary. In gure 10 the same maps are shown for a boat image.

Image Plane Retina

Figure 6. Sketch of the maps involved in the process of


segmenting and extracting objects from a scene

4. Selective Attention
Two types of theories have been suggested to explain how attention is allocated to perform visual tasks. According to region based theories, an attentional spotlight is directed to spatial positions in the visual eld having circular shape with varying diameter. Object based theories, on the other hand, propose that attention is directed to perceptual groups and not just locations. However, the main advantage of an attentional mechanism is the information reduction capability of spatially selecting salient portions of the visual eld, and the possible simplication of the binding problem by linking together the output of cells coding different features of the attended object. Recent research reveals evidence for object-based theories of attention [29], with objects acting as wholes in a slow, competitive process working in parallel across the visual eld [7], although spatial selection and top-down control are part of the attentional system. Figure 6 shows a simplied sketch of the brain maps involved in the segmentation of objects from a complex scene by applying a cortical grouping mechanism and an attentional focus to the early representation of the scene. Both processes are part of early vision mechanisms [15], which operate bottom-up, whereby the attentive control serves the coupling of data driven and cognitive processing streams both possessing cyclic and feedback loops. The visual in4

formation of an image is decomposed into sets of features of multiple feature maps (V1-V5) which interact by excitatory and inhibitory connections between locations (horizontal) and features (vertical). The pre-attentively grouped visual information is further processed by an attention mechanism (pulvinar) which chooses the most salient perceptual group and selectively enhances the responsiveness of neurons to this location at the expense of information from other groups or locations. The target selection map (SC) precomputes the expected saccade in a retinotopic coordinate frame, which is transformed into a spatial attentional map in viewer centred (environmental) coordinates (PP). The spatial modulation map (FEF) integrates information about attentionally relevant locations from PP with recently visited locations (IOR) and cognitive information like expected locations and overall scanning behaviour (compare with [30]).

5. The object based attention process


The phase image of the preattentive stage was used for the sequential extraction of objects by a selective attention mechanism [9]. This stage of processing applies an object-based attention lter to the presegmented early visual information by selectively enhancing and inhibiting regions corresponding to preattentively synchronised perceptual groups in the earlier visual maps. The attentional lter is computed by a global winner-take-all (WTA) mechanism in a separate attentional map integrating the information from all feature and scale specic earlier visual maps and the temporal decaying memory map (IOR) represent-

ing recently attended objects. The dynamics of the system has been adapted from the shunting feedback network proposed by S. Grossberg [13], and has been rewritten for discrete simulation on a computer:

Cij = e? inh (B ? Cij )Aij ? (Cij ? D)I ? Cij + extij


ij

Aij = (2v + 1)(2w + 1) hkl (Ci+k;j +l )2 k=?v l=?w

w v X X

(6)

(Cij )2 I = mn i j where Cij corresponds to the map element at position (i; j ), I equals the squared sum over all activations, and Aij corresponds to the normalised result from convoluting C 2 with kernel h at Cij . extij denotes the excitatory and inhij the inhibitoryinput for IOR. B and D are arbitrarily chosen constants for bounding the activation of Cij between D and B. For reasons of simplicity we have chosen D = 0 and B = 1. = 10 =10 = 01 = 10 = 30 = 01

m n XX

Figure 7. Sequence of attentional foci (white) using both


edge enrgy and phase, overlayed on the phase image of 8a).

In the presented simulations, the constants have been set to :; :; :; :; : , and : . Critical for the overall performance of the network is the size and form of the convolution kernel h, for which we have chosen a Gaussian with diameter ve, and the parameter which inuences the size of the variable attentional spotlight. In the presented simulations the excitatory input consists of two arrays for the phase and activity at each spatial location. In the last processing stage the selected visual information from the feature maps is integrated in a target selection map (SC) which executes a saccade by applying a nonlinear model of local lateral interactions for saccade averaging [28], based on ensemble coding and linear vector addition of movement contributions [20]. In Figure 7 the sequence of attentional foci computed from an objects image, overlayed on its phase image are shown. Figure 8 shows phase and activity maps of the excitatory input and the sequence of inhibitory maps to prevent the system to visit recently attended locations. As can be evaluated, the selected regions are a compromise between spatial and phasic coherence, allowing perceptual groups and objects to be extracted from the input.

Figure 8. a) Phase image of objects scene; b) Summed


activity of edge maps; c) Sequence of inhibitory memory.

6. Conclusion
A four stage processing model for object segmentation and selection has been proposed which combines neurophysiological and psychological data to account for its biological plausibility. We have described a relaxation phase labeling procedure for the preattentive grouping and perceptual segmentation of objects in phase space and an attention mechanism which sequentially extracts perceptual groups in a cluttered scene consistent with an object based theory of 5

visual attention. The original contribution of the presented biological framework for perceptual segmentation and selection of objects in a real world scene is the transformation of the grouping process into phase space, using a simple relaxation labeling procedure. By introducing directional responses and local constraints thereupon, serving the grouping of similar directions and the decoupling of both sides of a contour line, the proposed mechanism is able to detect zero-crossings in phase space without an explicit and biological implausible search. The gradient in phase space is sharpened compared to the edge response or the intensity discontinuity, and the whole scene is labelled into objects and background. Furthermore, the relaxation phase labeling (RPL) process is able to extract the most salient contour lines of perceptual groups in phase space, suppressing false responses generated from the preprocessing stage. Therefore the RPL-process can be used to link edges into object boundaries by closing small gaps in the contour lines of the intensity image, or the groupingof perceptual primitives like dots, points or dashes into perceptual wholes using grouping principles originally proposed by Gestalt-Psychology. For a more complete segmentation scheme involving both different spatial frequencies and multiple feature domains, the system could be expanded by a scale space approach [3, 23] and the integration of parallel texture-, motion-, and colour specic processing channels [25, 1]. An extension on the feature level will be the integration of distinctive maps for two dimensional features like direction of motion, texture, curvature, endstoppings and junctions.

Figure 9. a) Paolina image (Size 200x200); b) Summed


responses of six ON channels; c) Summed responses of six OFF channels; d) Phase image after 21 iteration steps; e) Binarised phase gradient of d; f) Canny detector with = 1, and threshold (0.3,0.9).

Figure 10. a) Boat scene (Size 200x200); b) Summed responses of six ON channels; c) Summed responses of six OFF channels; d) Phase image after 51 iteration steps; e)-f) same as in Fig. 9.

References
[1] J. Aloimonos and D. Shulman. Integration of Visual Modules: An Extension to the Marr Paradigm. Academic Press, 1989. [2] A. Blake and A. Zisserman. Invariant surface reconstruction using weak continuity constraints. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, pages 6267. IEEE, 1986. [3] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE Trans. on Communications, 31(4):532540, 1983. [4] J. F. Canny. A computational approachto edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(6):679698, 1986. [5] M. A: Cohen and S. Grossberg. Neural dynamics of brightness perception: Features, boundaries, diffusion, and resonance. Perception and Psychophysics, 36:428456, 1984. [6] J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. J. Opt. Soc. Am. A, 2:1160 1168, July 1985.

[7] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18:193222, 1995. [8] R. Eckhorn, R. Bauer, W. Jordan, M. Brosch, M. Kruse, W. Munk, and H. J. Reitboeck. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60:121130, 1988. [9] W. A. Fellenz. A sequential model for attentive object selection. In Proc. 39th IWK, Sept. 27-30, vol. II, pages 109116, TU Ilmenau, 1994. [10] W. A. Fellenz and G. Hartmann. Image segmentation by phase label diffusion. In Proc. of the Int. Conference on Articial Neural Networks, ICANN-95, Paris, vol. II, pages 309314, 1995. [11] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721741, 1984. [12] C. M. Gray, P. Konig, A. K. Engel, and W. Singer. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338:334336, 1989. [13] S. Grossberg. Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1:1761, 1988. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 5:267287, 1983. [15] B. Julesz. Foundations of Cyclopean Perception. University of Chicago Press, 1971. [16] P. K. Kienker, G. E. Sejnowski, T. J. Hinton, and L. E. Schumacher. Separating gure from ground with a parallel network. Perception, 15:197216, 1986. [17] C. Koch, J. Marroquin, and A. Yuille. Analog neuronal networks in early vision. Proceedings of the National Academy of Science, 83:42634267, 1986. [18] K. Koffka. Principles of Gestalt Psychology. Harcourt, Brace & World, New York, 1935. [19] D. Marr and E. Hildreth. Theory of edge detection. Proceedings of the Royal Society of London B, 207:187216, 1980. [20] James T. McIlwain. Distributed spatial coding in the superior colliculus: A review. Visual Neuroscience, 6:313, 1991. [21] J.-M. Morel and S. Solimini. Variational Methods in Image Segmentation. Birkh user, Boston, 1995. a [22] M. C. Morrone and D. C. Burr. Feature detection in human vision: a phase-dependent energy model. Proceedings of the Royal Society of London, B 235:221245, 1988. [23] P. Perona and J. Malik. Detecting and localizing edges composed of steps, peaks and roofs. In Proc. of the 3rd Int. Conf. on Computer Vision, pages 5257. IEEE Comp. Soc., Osaka, 1990. [24] P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7):629639, 1990. [25] T. Poggio, E. B. Gamble, and J. J. Little. Parallel integration of visual modules. Science, 242:436242, 1988. [26] A. Rosenfeld, R. A. Hummel, and S. W. Zucker. Scene labeling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics, 6:420433, 1976. [27] D. Terzopoulos. Regularization of inverse visual problems involving discontinuities. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(4):413424, 1986. [28] A. J. Van Opstal and J. A. M. Van Ginsbergen. A nonlinear model for collicular spatial interactions underlying the metrical properties of electrically elicited saccades. Biol. Cybern., 60:171183, 1989. [29] S. Yantis. Multielement visual tracking: Attention and perceptual organization. Cognitive Psychology, 24:295340, 1992. [30] A. L. Yarbus. Eye movements and vision. Plenum, New York, 1967.