You are on page 1of 6


Andrea Teschioni, Franco Oberti and Carlo Regazzoni Department of Biophysical and Electronic Engineering University of Genoa (Italy) Via allOpera Pia11a I-16145 Genova (Italy) e-mail:

Monitoring systems of outdoor environments for surveillance applications need real time solutions for complex computer vision problems. However, advanced visual surveillance systems not only need to detect and track moving objects but also to interpret their pattern of behaviour. This work aims at presenting a method based on the use of neural networks for classification of moving tracked objects. Application scenario consists of an entrance access of a touristic village crossed by vehicles and the main aim of the method is to recognise possible presence of pedestrians in the zone.

messages and suggestions) and intelligent logging capabilities (which allows variables of interest to be automatically extracted, coded and memorized, by saving recording resources). In this way, possible human failures are expected to be overcome and better surveillance performances are obtained. In this work, a neural networks based approach for image understanding using a Multilayer Perceptron [1] is presented dealing with a particular surveillance application in the transport field. In this field, some surveillance systems are used to detect dangerous situations, to get statistical knowledge for traffic activities (maintenance schedule, traffic flow plans and simulator, etc.) and to provide users with information about accidents or traffic jams so that they may travel safe and comfortable. In particular, the aim of the proposed work is to develop an algorithm for the discrimination, within the wide class of moving objects present in a road traffic scene, of pedestrians and vehicles in order to provide some possible alarm situations to a remote operator who has in charge to monitor the examined scene. The work is organized in such a way: section II provides a little overview of the faced problem, section III presents in general the examined proposed system, section IV examines in detail the neural based approach while section V presents the obtained results in moving objects classification.

Visual surveillance primarily involves the human interpretation of image sequences through direct monitor inspection. Advanced visual surveillance of complex environments goes further and automates the detection of predefined alarm situations in a given context. The role of automatic computation in such systems is to support the human operator to perform different tasks, such as observing and detecting, interpreting and understanding, logging, giving alarms, and diagnosing false alarms. In many surveillance systems, most of activities of human operators staying in a control room are devoted to observe images provided by a set of cameras on several monitors. The number of monitors is usually smaller than the number of video sensors and automatic processing multisensorial information can be used, for example, to select the subset of cameras whose subset must be displayed. The restriction to a temporary selection of scenes from different cameras is devoted to help the human operator to concentrate his decision capabilities on possible danger situations, by means of significant focus-of-attention messages. Other tasks such as information logging from image sequences should require a continuous analysis of visual information. In this case the most practical solution was to perform off-line analysis of recorded sequences. This solution has two major disadvantages: it requires selection and storage of large amount of data; it charges human operator of an annoying, and repetitive task.

The system that we are considering must be applied to the surveillance of dangerous situations in the entrance access of a touristic village: in particular such system must be able to detect, as main functionality, the presence of people walking in zones of the road normally reserved for vehicles and to provide an alarm to a remote operator whenever such an event appears. As further functionality, the system must be also able, within the class of detected people, to discriminate between the municipality personnel village, who is dressed in a coloured particular way, and civilian people, who must be counted for statistical purposes, and to provide a different kind of alarm depending on the class of detected people.

Modern automatic visual surveillance systems provide the operator with both attention focusing information (which allows important events to be signalled by means of userfriendly

Figure 3 shows the general architecture of the proposed surveillance system. The following assumptions are made: (a) stationary and precalibrated camera, (b) ground-plane hypothesis, (c) known set of object and behaviour models. The system is composed by 5 modules: image acquisition (IA), background updating (BU), mobile object detection (MOD), object tracking (OT), object recognition (OR) and dynamic scene interpretation (DSI).

Figure 1: Typical image of the considered entrance access Fig 1 represents a typical scene to be monitored. The way followed for implementing such solution is based on the use of a cascade of two Multilayer Perceptron Neural Networks used a classifiers, each one devoted to a particular task. First network will be devoted to the classification of each moving object in one of three different class, i.e. people, vehicles or other object. The second network, which takes as input the output of the first one, will be oriented to discriminate, among people, between uniformed personnel and civilian people, in order to determine the counting increment to be computed. Fig. 2 allows to better understand the particular solution we have designed for such problem. Figure 3: General System Architecture
C o l o r c a m er a O b j ec t s d e t e c t i o n a n d t r a c k i n g s y st e m
Lis t of d etec ted m ov ing areas (i.e. blobs )

3.1 Image acquisition and background updating

A color surveillance camera mounted on a pole and having a wide-angle lens objective to capture the activity over a wide area scene acquires visual images representing the input of the system. A pin-hole camera model has been selected. A background updating procedure is used to adapt the background image BCK(x,y) to the significant changes in the scene (e.g., illumination, new static objects, etc.).

F i r s t n e ur a l ne t w or k
C las sif ica tion v eh icle s-ped es trian s

3.2 Object detection

S e c o n d n e ur a l ne t w or k

C las sif ica tion u niform ed pers onnel-c ivilia n peop le

Figure 2: Proposed hierarchical approach to moving objects classification task We have chosen to use the same kind of neural network at the two classification levels by obviously differencing the features provided to each network according the task that has to be faced.

A change detection (CD) procedure based on a simple difference method identifies mobile objects in the scene by separating them from the static background. Let B(x,y) be the output of the CD algorithm. B(x,y) is a binary image where pixels representing mobile objects are set to 1 and background pixels are set to 0. The B(x,y) image normally contains some noisy isolated points or small spurious blobs generated during the acquisition process. A morphological erosion operator is applied to eliminate these undesired effects. Let Bi be the binary blob representing the i-th detected object in the scene.

p jk

p -1

3.3 Object tracking

The position and the dimensions of the minimum rectangles (MBR) bounding the detected blobs on the image plane are considered as target features and matched between two successive frames. In particular, the displacement (dx,dy) of the MBR centroid and the variations (dh,dl) in the MBR size are computed. After that, an extended Kalman filter (EKF) estimates the depth Zb of each object's center of gravity in a 3D general reference system (GRS), together with the width W and the length L of the object itself. A ground plane hypothesis is applied to perform 2D-into-3D transformations from the image plane into the GRS. However, the tracking task is not the focus of the paper and it will be not further examined in the following.

This structure has been chosen because it allows supervised learning with each input vector having a corresponding known target output. The difference between the networks actual output and the target is computed in order to determine the error. The weights in the network are then updated, according to the backpropagation algorithm, to minimize the maximum modulus of the error. Learning is deemed to be finished when the error at the output has reached a visible minimum when plotted against training time (number of presentations of the training set), that is the weights converge to a particular solution for the training set. For the choice of the numbers of layers and units in the hidden layers, a specific rule does not exist: it must be performed on the basis of the acquired experience. In our specific case, the inputs of the network will correspond to a particular set of features computed on each detected blob and significant for the particular task, while outputs will correspond to the clusters to which the blobs must be classified, that are, in the specific case, three for the first level (people, vehicle and other) and two for the second level (uniformed personnel and civilian people). The particular configuration of perceptron that we have chosen consists of 20 neurons in the hidden layer (see fig. 4)

3.4 Dynamic object recognition and behaviour understanding

The overall purpose of a visual surveillance system is to provide an accurate description of a dynamic scene. To do this, an effective interpretation of dynamic behavioral of 3D moving objects is required. A set of object features, extracted from the input images, is used to match effectively with projected geometric features on object models. In particular, regular moment invariant features are considered to characterize the object shape. Each detected blob on the binary image B(x,y) represents the silhouette of a mobile object as it appears in a perspective view from an arbitrary viewpoint in the 3D scene and the 3D object is constrained to move on the ground plane. Since the viewpoint is arbitrary, the position, the size, the orientation of the 2D blob can vary from image to image. A large set of possible perspective shape of multiple 3Dobject models, e.g., cars, lorries, buses, motorcycles, pedestrians, etc. have been considered. In the next Sections, the proposed solution for the particular kind of classifier that has been chosen and for the features that will be used as input to the classifier itself will be presented, followed by some experimental results we have obtained in the classification task.

outputs 1 2 3

hidde n layers

2 inputs

Figure 4: Neural network for classification (n=20) The training set has been selected as composed by a large number of patterns representative of the classes while tests have been performed by using different inizializations for the network. The parameter used in the inizialization influences the weight updating during the training; they are briefly explained in the following. The parameters used for inizializing the neural network are the following: learning rate: the weights of the network are updated by means of the following relation: w p +1 (s) = w p (s) - h


4.1 The choice of the classifier
The choice of the best classifier for the considered application should be constrained by the trade off among the training time, the required memory, the computational complexity and the classification time, other than the probability of success. For the considered application, the classification should be performed as fast as possible, in order to guarantee the real time behaviour of the system; for this reason we have considered a three-layer Multilayer Perceptron with backpropagation learning rule, in fact this classifier needs a long off-line training step, but it is able to process data in a quick way [1].


(s) jk
where :

+a w

(s) - w



Ep is the global error performed by the network at step p;


) ]

F = m

- m


p jk

(s) represents the weight value between units j and k,

3,0 1,2

m )( m

+ )[( m 3,0 1,2

2, 1

+ m1,2

3( m 2,
1 2

+ m 0,3


layer s and training step p; h represents the learning

+ (3m

2, 1

- m0,3 )( m

+ m0, )[3( 3 m

+ m1, ) - ( 2, + m0,3 ) ] 2 1 m

F = (m
6 2,0

- m )(m
0, 2 3, 0

+m ) (m
2 1,2 2, 1

+m ) +4 +
2 0,3


[( m



)(m2,1 m



+ m + m0, ) 2 ] 3

a represents the momentum. It is possible to notice that the learning rate plays an important role for the algorithm convergence, as shown in the previous equation; in fact it represents the weight updating step: the smaller the learning rate, the slower and usually more precise the training process; nevertheless, using a too smaller learning rate introduces a risk, because it may be possible that the algorithm converges in a local minimum.

F7 = 3(m (m

2, 1

+ m0, )(m 3, - m )[(m 3, + m1, ) 2 ] 1, 3 2 2 0 0 [3(m - m1,2 )(m 2, + m0, ) 3 1 [3(m

3, 0

2, 1

3, 0

+ m1, ) 2 2 (m

2, 1

+ m0, ) 2 ]

where m

p, q

v p ,q (m 0,0 )

(i.e., b = + 1

p+q 2

) represents the

normalized central moment

The momentum parameter has been introduced in order to ease the algorithm convergence, while the Wight decay influences the speed at which the weights not influent for the training are set to zero. During the training process, the weight inizialization represents a critic question because a bad weight inizialization set could make the training process too slow or generating a large error. For this reason, it is better to consider different trials corresponding to different weight initial values; at the end, only the initial set that provides the best results is considered. At the beginning of the training process, weights assume random values and the training stops according to the following criterion:
1 np if

v p ,q =

( x 0 (x,y) Bi

) (yy

) I ( x, y)

( p, q = 0,1,2,...)

computed on the area Bi. A particular comment has to be reserved to the choice of the measure I(x,y) referred to each pixel of the image and which could be a sort of index of luminosity associated with each pixel itself, having considered the luminosity of a pixel as discriminant criterion for distinction among vehicles and people (e.g. humans have a different reflectivity coefficient with respect to vehicles). In previous works [3,4], which were limited to grey-level image processing, it was straightforward to use the grey-level coefficient of each pixel as reference luminosity value but in our case it is not possible to do that, because of the vectorial nature of the luminosity values in color images.

y ( p) t ( p) threshold nc * np c p =1 c =1


As scalar luminosity index has then been selected, for the first network, the Y coefficient of the color YUV space [5], which well represents a luminosity index in color images. Different is the case of the second network: in order to recognise uniformed personnel among the wider and more general class of people an a-priori knowledge about the particular uniform color (e.g. in most of the cases orange) has been taken into account and this fact has lead us to consider as scalar luminosity index the H (hue) value of the HLS (Hue-Luminance-Saturation) space [5]. The normalized central moments used in our case are then equal to:
v = y) , p q ( x - x ) ( y - y ) Y ( x,
0 0 p q

where: nc = number of clusters, np = number of patterns, yc ( p) = desired output for pattern p, tc ( p) = output for pattern p obtained with current weight values. At this time the last step to be faced is the selection of the features set used as input to the neural classifier: they will be presented in the following section.

4.2 The set of features to be used

In order to recognise each observed blob, the Multilayer Perceptron has been learned with Hu moments, which are invariant to rotation, translation, scale changes [2]. Let f ,.., f , be these invariant moments. 7 1

( x , y )B i

( p, q = 0,1,2,...)

for the first network and to: v p ,q =


( x , y )Bi

( x - x 0 ) ( y - y 0 H ( x, y) )

( p, q = 0,1,2,...)

F 1 = m 2,0 + m 0,


for the second


F 2 = m - m0, 2 ) 2, + ( 0


The pattern for the i-th detected object will be composed as follow: p(x,y)=[f ,f ,f ,f ,f ,f ,f ], where the functions, 1 2 3 4 5 6 7

F 3 = ( m 3,0 - 3 m1, 2 ) + (3m 2,1 - m 0,3 F 4 = (m3,0 + m ) 2 (m 2,1 + 1,2 + m0,3


f1,.., 7 , are computed on the blob i . A set of feature vectors f B extracted from several models representing different objects taken from different viewpoints are used as patterns of the training procedure. After the object has been recognized, a new Multilayer Perceptron trained with different dynamic features of the same

object taken from different consecutive frames, has been employed to generate systems alarms. Each pattern is composed by information about the recognized object class, and the estimated object speed and position on each sequence frame. Each reference model is characterized by a set of parameters that are specific for the behaviour class to which the object belongs. The main advantage carried on by such a feature set selection is the independence of such kind of set from the 3D knowledge, so it is possible to use them without necessity of having a different training depending, for example, on a particular zoom degree of the acquiring camera.

A surveillance system for detecting dangerous situations on a road entrance access has been presented. The system is based on the use of a Multilayer Perceptron NN to perform both object classification and scene understanding. Average correct classification rate between people and vehicles is equal to 90%, while correct recognition percentage for uniformed personnel within the more general class of pedestrians is equal to 85%.

The present work has been partially supported by European Commission under the ESPRIT contract no. 28494 AVS-RIO (Advanced Video Surveillance cable television based Remote Video surveillance system for protected sites monitoring).

Experimental results have been carried on by using a set of different sequences representing a pilot entrance access, in presence of different illumination and traffic conditions, in order to have a significant validation of the classification approach we have proposed. Let us remind that the surveillance system must be able to detect moving objects, localize and recognise them and interpret their behaviour in order to prevent possible dangerous situations, e.g. one or more pedestrians moving in an area completely devoted to the vehicle traffic. Images have been selected from a database in which sequences acquired with different pan, tilt ant zoom of the same color video camera and, in this paper, particular results about the recognition task will be presented. The performances provided by the examined surveillance system in terms of capabilities of object classification and discrimination have been measured through the percentage of correct object recognition. For example, if within a certain sequence N areas relevant to pedestrians present in the scene have been detected, then the percentage of correct object recognition will be computed as: PR Perc = 100 N where PR indicates the number of the times in which the person have been correctly detected. In the following table the average values of the correct recognition percentage has been presented for each sequence, both for the percentage of objects discrimination and for the pedestrians discrimination between civilian people and municipality personnel: PERCENTAGE OF OBJECTS RECOGNITION 94% 98% 89% 92% 87% 94% 88% 92% PERCENTAGE OF CIVILIAN DISCRIMINATION 84% 86% 88% 81% 83% 84% 87% 85%

[1] [2] [3] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, UK, 1996. M.K. Hu, Visual pattern recognition by moment invariant, IEEE Trans. on Information Theory, Vol. 8, 1962, pp. 179-187. G.L. Foresti, A neural tree based image understanding system for advanced visual surveillance, Advanced Video Based Surveillance Systems, Kluwer Academic Publishers, 1998, pp. 117-129. J.E. Hollis, D.J. Brown, I.C. Luckraft and C.R. Gent, Feature vectors for road vehicle scene classification, Neural Networks, Vol. 9, No 2, 1996, pp.337-344. B. Furht, S. W. Smoliar, H. Zhang, Video and Image Processing in Multimedia Systems, Kluwer Academic Publishers, 1995.

[4] [5]