4, DECEMBER 2002


Automatic Detection and Indexing of Video-Event Shots for Surveillance Applications
Gian Luca Foresti, Senior Member, IEEE, Lucio Marcenaro, and Carlo S. Regazzoni, Senior Member, IEEE

Abstract—Increased communication capabilities and automatic scene understanding allow human operators to simultaneously monitor multiple environments. Due to the amount of data to be processed in new surveillance systems, the human operator must be helped by automatic processing tools in the work of inspecting video sequences. In this paper, a novel approach allowing layered content-based retrieval of video-event shots referring to potentially interesting situations is presented. Interpretation of events is used for defining new video-event shot detection and indexing criteria. Interesting events refer to potentially dangerous situations: abandoned objects and predefined human events are considered in this paper. Video-event shot detection and indexing capabilities are used for online and offline content-based retrieval of scenes to be detected. Index Terms—Information retrieval, object detection, surveillance, video signal processing.

I. INTRODUCTION N THE LAST decade, automatic surveillance systems have been developed in order to improve the efficiency of the prevention of dangerous situations [1], [2]. User requirements indicate that a surveillance system should be able to alert the attention of a human operator when a dangerous situation occurs in the guarded environment and to help him in retrieving the video sequence part related to the event that represents the reason why he has been alerted (called hereafter video-event shot). The development of content-based retrieval techniques has suggested the idea of integrating video-based surveillance techniques and video-event shot detection and indexing algorithms, allowing one to automate the task of recovering causes of alarms as well [3], [4]. In the literature, many approaches to video shot detection and content-based retrieval are presented which use low-level feature based approaches [5]–[17]. Such techniques usually detect video shots on the basis of color or texture features that are compared for consecutive frames in order to detect scene changes. The retrieval is performed on the basis of feature vectors that are extracted from the data (i.e., the key frame of the sequences) and then used as index.
Manuscript received April 1, 2001; revised March 7, 2002. The associate editor coordinating the review of this paper and approving it for publication was Dr. H.-J. Zhang. G. L. Foresti is with the Department of Mathematics and Computer Science (DIMI), University of Udine, 33100 Udine, Italy (e-mail: L. Marcenaro and C. S. Regazzoni are with the Department of Biophysical and Electronic Engineering (DIBE), University of Genoa, 16145 Genoa, Italy (e-mail:; Digital Object Identifier 10.1109/TMM.2002.802024


For the application considered in this work, low-level (i.e., mean gray level, shape of the histogram, etc.) feature-based approaches are not sufficient to detect the changes of interest. Often, the fraction of the image that changes during a videoevent shot is too small to create detectable changes in the lowlevel features. Moreover, color and intensity of the newly introduced objects do not necessarily change significantly histogram-derived features. Layered methods [18]–[20] are more suitable for the integration of video surveillance systems and content based retrieval techniques. In the proposed system, human operator is alerted whenever an alarm is detected. The layer is represented by the foreground image and related descriptive metadata associated with a dangerous object or by people generating the event of interest and the retrieval is performed on the basis of descriptive features associated with the layer. Irani and Anandan [21] present an example of such a layered event representation. Their method transforms video data from a sequential frame-based representation into a single, common scene-based representation to which each frame (layer) can be directly related. This method is based on the mosaic representation of the scene, obtained by composing the information from different views of the scene into a single image. Each frame can be mapped to the mosaic representation by using a three-dimensional (3-D) scene structure. In the application described here, however, only one view of the scene is available. Therefore, instead of the mosaic representation, we use as a reference image a background image representing the guarded environment with no extraneous objects or persons. Therefore here the concept of layer is referred to a subimage containing the object which is superimposed to the background image: in this sense, the layer is similar to a MPEG4 object [32]. In [21], the sequence is segmented in video-clips, the beginning and the end of which are determined by drastic changes in the scene, expressed in terms of color and gray-level distribution. Sequences acquired by the proposed surveillance system usually contain clips, too, associated with changes represented by the appearing or disappearing of objects or persons; these kinds of changes usually do not vary the color distribution of the image but they introduce a relevant variation in the semantic content of the sequence. The detection and indexing of the video-event shot is then performed at the layer level of objects. As the object layer can be further classified depending on the type of video-event with which it is associated, a specific feature vector can be used for detection and indexing that characterizes the layer representing the alarm (abandoned objects or people with suspect behavior).

1520-9210/02$17.00 © 2002 IEEE



A first attempt of joining surveillance systems and content based retrieval techniques is presented in [4]. In [4], the application was considered of detecting both the presence of abandoned objects and the video-event shot containing the person who left it. The results were promising, but the system detected video-event shots consisting of a fixed number of frames for each situation in which an abandoned object was found without considering if the alarm-generating object was present in all the frames of the video-event shot. Moreover, in the approach there described, limitations were present related to the analysis of human behavior. Substantial improvement to the system presented in [4] is described in this work, in which the task of tracking moving objects has been introduced as a parallel process with respect to abandoned object detection in order to determine more precisely the video-event shots in which dangerous objects are present. Moreover, this new module allows also the system to perform multifunctional object retrieval, i.e., to be able to retrieve events of different types. In particular, people in the scene assuming a predefined kind of behavior can be detected. The proposed system reaches this goal by integrating two modules: abandoned object detection, that has some improvements with respect to the one described in [4] and a new moving object detection and tracking module, that is more deeply described here. Recently, researchers have investigated techniques for describing and recognizing human activities. The main objective of this type of research is to analyze the human behavior from an image sequence in order to automatically detect interesting events, e.g., suspicious behaviors. Hogg [22] first addressed the problem of detecting a walking person in image sequences. His approach integrates image processing and geometric reasoning techniques and it has been recently extended to take into account the nonrigidity of human shape in motion [23]. Image sequences of human behavior have been studied to detect and identify activities that exhibit regular and cyclic characteristics (e.g., walking) [24]–[27]. Galton [28] generates complex descriptions of human actions on a set of generic basic spatio-temporal proposition to distinguish different human behaviors. Brand et al. [29] propose to use probabilistic models, i.e., hidden Markov models, to capture the uncertainty of mobile object properties, while Chleq and Thonnat [30] propose a generic framework for the real-time interpretation of real world scene with humans. As an example of behavior detection, the problem of intruder detection near high security zones has been explored by Rosin and Ellis [31]. In their work, simple processing of sequences of images was used to extract regions of interest (blobs) and the motion of these blobs was analyzed using a frame-based system. The system is able to differentiate between flocks of birds, small animals and other organizms in poor quality images. Recently, Wren et al. [33] propose a real-time system, called Pfinder, for tracking human body and interpreting their behavior in a controlled indoor environment. The system uses a multi-class statistical model of color and shape to obtain a two-dimensional (2-D) representation of head and hands in a wide range of viewing conditions. Pfinder has been successfully used in a wide range of applications including wireless interfaces, video databases, and low-bandwidth

coding. Haritaoglu et al. [34] propose a real-time visual-based surveillance system, called , for detecting and tracking multiple people and monitoring their activities in an outdoor employs a combination of shape analysis environment. and tracking to locate people and their parts (head, hands, feet, torso) and to create models of people’s appearance so that they can be tracked through interactions such as occlusions. It can recognize events between people and objects, such as depositing an object, exchanging bags, or removing an object. McKenna et al. [35] present a system able to track multiple people in relatively unconstrained environments. Tracking is performed at three levels of abstraction: regions, people, and groups. A novel, adaptive background subtraction method that combines color and gradient information is used to cope with shadows and unreliable color cues. People are tracked through mutual occlusions as they form groups and separate from one another. The system is tested using both indoor and outdoor sequences. An interesting survey on the visual analysis of human motion can be found in [41]. Most of the above mentioned studies assume a complete or, at least sufficient isolation of objects or human bodies; for this reason, they are not suitable for complex surveillance applications, for which the proposed assumption of no occlusions is not realistic. The proposed system is robust to temporary or partial occlusions between moving objects, thanks to the long memory algorithm that is able to correctly re-assign objects identities after occlusion and improves the performances for the detection of abandoned objects; moreover, the proposed system is able to detect, index and retrieve interesting video-event shots of human activities. This paper is organized as follows. Section II describes the general architecture of the proposed system, composed by several subsystems dedicated to particular tasks. The subsystems for detecting abandoned objects, for tracking moving objects and recognizing anomalous human behaviors are also briefly explained. The method used for detecting, indexing and retrieving interesting video-event shots is described in Section III. In Section IV, results are shown in terms of success rate in detecting abandoned objects, in classifying behaviors, in detecting interesting video-event shot boundaries and in layer retrieval. Finally, conclusions are reported in Section V.

II. GENERAL SYSTEM ARCHITECTURE The proposed system aims at performing three tasks: the online detection of abandoned objects, the online classification of events including humans as actors and the automatic-detection and the indexing of video-event shots showing the cause of alarms (Fig. 1). For what concerns abandoned objects, it is assumed that the cause of an alarm consists in the person who left the object in the scene. While in [4] the number of frames of the retrieved video-event shot was fixed, a new step is here introduced in order to estimate the number of frames containing the person that was carrying the abandoned object in the scene. The link between an abandoned object and the person who left it is here performed with the help of a further subsystem whose aim is the detection and tracking of objects moving in the scene.



Fig. 1. General architecture of the proposed system.

Data about detected and tracked objects are also used by the subsystem aiming at detecting human events. As regards to the classification of human events, the cause of an alarm is contained in the video-event shot in which persons are found by the system to act in a way close to a description of the event of interest. Therefore a video-event shot can be defined as a temporal subpart of the considered sequence associated with a set of information that can be structured in layers. The layers correspond to — the entire video sequence containing the person associated with the event i.e., the base layer; — the sequence of the part of the video containing only the person that caused the event i.e., the video-object layer; — the metadata describing the video object layer i.e., the metadata object layer. In our system there exists a separate video-object layer for each active process. An object tracking layer (OTL) is associated to the tracking module and different higher level video-object layers are associated with different processes of the system. In particular, there exists an abandoned object layer (AOL) in correspondence of the abandoned object detection module and a human event layer (HEL) connected to the human event detection module. In this way, the scene that contains the event of interest (abandoned object, anomalous human events, etc.), can be fully described by using extracted information structured in the above described different layers. In this section, we briefly present the modules for detecting abandoned objects, for tracking moving objects and for recognizing anomalous human events. The module for video-event detection, indexing and retrieval, which represent the main novelty of the paper, will be described in detail in Section III A. Detection of Abandoned Objects The architecture of the subsystem for detecting abandoned objects is shown in Fig. 2. This system is based on a long-term change detection algorithm [4], [42] that is able to extract pixels

Fig. 2. Architecture of the subsystem for abandoned objects detection.

in the image related to static changes in the scene by differencing the current acquired image with respect to a reference scene kept updated through a background updating algorithm. Considered technique is able to absorb slow environmental changes in the scene (i.e., moving cast shadows), while leaving untouched the background regions corresponding to detected objects (static or moving) at least until a static object is classified as an abandoned object and its identity is passed to the abandoned objects detection module. It must be noticed that the method is robust to occlusions, occurring for few frames and hiding the abandoned objects from the sensor point of view. Clearly, when image complexity increases (in terms of moving objects present in a scene) the probability of missed detection gets higher. Pixels detected as potential abandoned objects, are filtered and compacted in regions of interest (named blobs) by means of morphological operations [4], [38]. Blobs are then classified by a neural network in order to avoid false alarms when a static change is detected which is not due to a lost object (e.g., persons remaining relatively still for some time, like a seated person). Shape features are used to discriminate among different causes of permanent change [4]. A multilayer perceptron (MLP) has been selected as neural classifier because of the time constraints on the classification process. The architecture of the MLP has been obtained by trials [4]: a three-layer network with one hidden layer composed by 20 hidden neurons has reached the best results. The classification step considers four possible causes of permanent changes; these are — abandoned objects; — persons—a person seated on a chair (e.g., waiting for a train) generates a permanent change in a scene similar to the change generated by an abandoned object;



Fig. 3. Example of event detected by the system for detecting abandoned objects: in the last frame, the area is shown containing the video-object layer related to the detected abandoned object.

lighting effects—waiting rooms have at least one door and often also windows. The opening of the door or windows, and persons passing near the windows can generate some local changes in the lighting of guarded environments that may persist in time. In this case, not performing the classification step would generate a false alarm; structural changes—permanent changes in the scene may happen when the positions of the objects in the scene (doors, chairs, etc.) vary.

By means of the above reported operations, the system is able to detect isolated events as shown in the temporal diagrams of detected at generic instant is Fig. 3. Each event labeled by a feature vector containing the features used for the object classification [4]. B. Tracking Module The main novelty in the indexing and retrieval step with respect to [4] is the capability of indexing also the video object layer and not only the base sequence layer. This is achieved thanks to the more efficient integration with the tracking algorithm. Such integration allows also one to adaptively estimate the temporal boundaries of a video-event shot in a more precise way, with respect to the fixed shot extension introduced in [4]. The tracking subsystem shown in Fig. 4 aims at detecting and tracking static and moving objects that are present in the scene. Objects are detected in the scene by performing a difference between the currently acquired image and a continuously updated reference frame. A background updating algorithm [45] updates the reference frame with two different speed for image areas corresponding to detected objects (no update) and background zones (fast update for dealing with illumination variations in the scene). As soon as a still object is recognized as abandoned, the corresponding image portion is marked as background and updated: in this way, the abandoned object is absorbed in the reference image.

Fig. 4. Architecture of the system for tracking objects and temporal graph describing the OTL.

This system is based on a long-memory matching algorithm [42], which uses information from blobs extracted by change detection and focus of attention methods [39]. The long-memory matching algorithm is able to track objects related to changes in the scene with respect to a reference situation. The term “longmemory” is referred to the fact that the algorithm is able to recover blob identity after merging of more objects into aggregated blobs and successive splitting (Fig. 5). Identity estimation is performed by means of a matching measure between the histogram of involved objects based on the Bhattacharyya coefficient [49]. Features used for correspondence matching are colorbased features (mean and variance of the color of regions composing the blob) [40], geometric features (area and perimeter of the blob) and 3-D position of the blob [16].



Fig. 5. (a) Objects detected before the merge event; (b) the “merge” phase; (c) after the split, the identity of the objects are preserved; (d) associated temporal graph.

Fig. 6. Examples of different kinds of nodes labeled by the tracking module.

By means of the previously reported operations, the system is able to detect events continuously in time as shown in the temporal diagrams of Fig. 5 in parallel with detection of aban, corresponding to a tracked doned objects. Each event blob detected at generic instant , is labeled by a feature vector containing the features used for the object tracking (i.e., metadata). Moreover, the events are linked in time and represented by means of a temporal graph in which each node represents a tracked object and the features of the related blob. Four types of nodes are considered (Fig. 6): —this label is given to objects detected — new nodes by the system as new object in the scene; —this label is given to objects — overlapped nodes that are not new in the scene; usually, the minimum rectangles bounding these blobs are partially overlapped in two consecutive frames; —this label is assigned when a node — split nodes is split in two ore more nodes; this may occur, for example, when two persons walking together start to move in different directions; —this label is assigned when two ore — merge nodes more objects merge into one. The characteristics of the temporal graph (link between nodes and label of the node) are the basis for the video-event shot detection and layered indexing, as shown in Section III. C. Human Action Recognition The human action recognition (HAR) subsystem takes as input features extracted from each image of a temporal sequence and performs real-time scene interpretation [41]. The interpretation process is performed with respect to a finite library of scenario models representing both normal and unusual events produced by human actions. A scenario model is defined as a set of event models and/or temporal constraints [42]. The aim of the HAR subsystem is to automatically detect patterns of human actions that correspond to predefined model events, e.g., vandalism actions, entrance in prohibited areas, etc., in order to generate alarms or require further data acquisition.

In order to detect patterns of human actions generated by an interesting event in the scene, it is necessary to define the bounds of normal actions, which are obviously different from application to application. It is therefore necessary to calibrate the surveillance system to monitor a particular scene and to learn the characteristics of normal and unusual human actions. For example, people remaining stationary, or almost stationary, for a long period in front of a bank entrance may be considered as suspicious events. People moving on the area of a railway level-crossing or on a metroline station with an abnormal trajectory (e.g., crossing the tracks) can be considered as unusual human actions. The HAR subsystem (Fig. 7) is composed by two main modules, the human recognition and the event detector. It uses information from blobs extracted by the detection and tracking subsystems over a long image sequence. In particular, it analyzes, in a hierarchical way, the temporal graph which contains the events corresponding to tracked blobs and the related features. The minimum rectangle bounding the blob is divided into four parts, , called quartils [44], according to the . The four distances, position of its center of mass , between and the center of mass of the four quartils are computed. These distances give a measure of the blob shape. In order to increase the robustness of such a represen, formed by the distance vectation, the angles , with the horizontal axis of the 2-D reference tors, , system centered in have been considered. The pattern related to the detected blob , is composed by the following eight values: (1) A neural tree (NT) [43], is applied to recognize blobs which represent humans moving in the scene. In order to eliminate the dependency of the obtained patterns on scale variations and on rotation around the axis, a large set of 2-D object shapes, obtained by observing 3-D object models from different viewpoints, have been taken into account [43]. In particular, for each object model, eight different representative views (taken at camera viewpoints separated by 45 ) have been considered to learn the neural tree. Different 2-D human shapes have been used to learn the neural tree. These blobs have been obtained both from 3-D models and from real scenes. The proposed approach has been focalized on scenes containing a single moving human or multiple human whose motion produces a limitated number of occlusions over the whole sequence. An interesting work on tracking groups of people can be found in [35]. The output of the NT is represented by two classes: the class human and the class nonhuman. At the end of the classification phase, the obtained classification is added to the corresponding node of the temporal graph. The event detector module considers a set of consecutive graph nodes, i.e., a brief history of the object motion, 1) to give a more robust classification of the detected object and 2) to individuate interesting events. In particular, a winner-takes-all scheme is applied to find the best object classification on the considered frames as the majority of the classification results



Fig. 7.

General architecture of the HAR module.

over a temporal window. If the detected object has been classified as a human, information about its location and about its trajectory on the ground plane is used to identify interesting events. Let be the human speed and let be the human direction at the -th frame, measured with respect to the center of mass of the blob, i.e., (2a) (2b) be the feature vector, associated with the recognized Let human blob at the time instant . It contains the positions of the (in the coordinates of the image plane) center of mass of the blob, its speed and its direction , measured at each of the last frames, i.e.,


Changes in human speed and direction can be used as easily models of interesting events. Moreover, available a-priori knowledge about the environment (presence of static objects, forbidden areas, etc.) can be related to the localization of human and used to define specific models of interesting events, e.g., entrance in prohibited zones, anomalous trajectories, etc. To this aim, a set of scenario models, comprehensive of both normal and unusual events, is created. An event is characterized by a set of human features observed over a sequence of frames, . For example, let us consider the scenario of a metro station where the event “an human overpasses the yellow line which limits the track area” is considered as dangerous event. This event can be represented by a given set , obtained by varying the human poof feature vectors sition, speed and trajectory. The whole set of scenario models is stored into an Event database, and used to learn a binary neural tree. Each node of the NT is characterized by input, i.e., the com, and output, i.e., normal ponents of the feature vector and unusual human actions.



Note that scene complexity can affect feature vectors in several ways. First, the number of frames in which a pedestrian appears as isolated decreases when the number of moving objects in the scene increases. Second, when the pedestrian reappears as isolated after merging with other objects, the probability of correct recognition decreases with increasing complexity as well. Moreover, as trajectory during merging is interpolated between positions first and after merging, complex in the scene may cause trajectory features to deteriorate. However, in many medium complexity scenes the proposed approach works well. For either more complex scenes or more accurate trajectory estimation some complex very recent multi-object trackers as the one proposed in PETS [46], [47] could be used. III. VIDEO-EVENT SHOT DETECTION AND INDEXING The proposed method for video-event shot detection and indexing is obtained by integrating at the metadata level the three subsystems mentioned in the previous sections, as well as by addressing video-object layers represented by blobs associated with data. Three different kind of video object layers are considered: — a base layer corresponding to the entire video sequence; — a video object layer corresponding to each detected object in the scene; this layer can be divided into three sublayers as described in the previous sections (i.e., OTL, AOL, and HEL); — a metadata object layer associated with features extracted from detected objects in the scene. A. Abandoned Object Detection We first analyze the procedure for the boundary detection of the video-event shot related to abandoned objects, that is here based on the integration between the output of the subsystems for detecting abandoned objects and for detecting and tracking moving objects. These two subsystems provide one with two (abandoned obtypologies of metadata: static events jects) and blobs associated with the abandoned object layer de(foretected at generic instant and dynamic events ground objects) and related blobs in the object tracking layer, that are integrated in order to perform content based video-event shot detection and indexing. More precisely, in order to detect video-event shot temporal boundaries and indexes, the system performs the following steps. — abandoned object event detection and video-event shot ; final frame identification and — computation of video-event shot initial frame layered indexing of the detected video-event shot; 1) Video-Event Shot Final Frame Identification: The first operation is based on blob label matching and it aims at detecting a correspondence between an object detected by the abandoned objects detection module and a static object detected in the object tracking layer. The graph analysis starts by locating the blob corresponding in the tracking graph. As a latency of to the event seconds occurs between abandoned object appearance and its


(b) Fig. 8. Example of abandoned object detection: (a) temporal graph associated with OTL and AOL; (b) video-object layer associated with the detected event: OTL and AOL data are superimposed on the base layer.

detection, a back search is necessary in the tracking graph in order to detect the corresponding blob and the frame that cor. The search is driven by the responds to the abandonment time instant of the detected event: a blob is searched backward in the tracking graph approximately for a number of frames equal where is the latency of the abandoned obto is the frame rate (fps) of the ject detection subsystem and corresponds to tracking module. It can be supposed that a split between unattended object and the abandoning person seconds before the alarm generaapproximately happened tion: the frame corresponding to this split is the last frame of the video-event shot (in Fig. 8, the frame ). In order to be sure not to miss the abandoning event, each blob detected in where the a spatial window centered in the position abandoned object has been detected are checked in the object tracking graph. be the intensity of the generic color component Let of the pixel at the position of the image lattice. The of a generic blob normalized histogram corresponding to the event is a function defined as the belonging to the blob number of pixels of color , normalized with respect to the number of pixels (area) of the blob itself. For computational reasons, the histogram of a blob has been derived by considering only 32 intensity levels for each color component. This simplification does not affect the performances of the blob matching operation if the number of considered colors is not too limited [40]. corresponds to the event within The event the considered spatio-temporal search window in the object tracking layer, containing the histogram more similar to the



A segmentation driven validation step is performed in order to find the abandoned object inside each video-event object in . To this end the normalized histogram of the interesting object is searched inside the detected video-object by using the above described similarity measure based on color. In particular, the following steps are performed: — — —
Fig. 9. Example of video-objects in the (a) object tracking layer and (b) abandoned object layer.

segmentation of the foreground moving blob related to the dangerous object; matching between the abandoned object histogram and the histogram of the regions in the segmented blob; selection of the region that best fits the abandoned object histogram.

Fig. 10. Example of backwarsearch in the temporal graph associated to the OTL in order to find video-event shot boundaries.

one of the detected abandoned object in the AOL. The distance between two histograms is the mean square error between them:

A video object can be segmented in different regions [38] and one can suppose that, if the considered blob contains the abandoned object, one of the regions resulting from the segmentation step corresponds to the dangerous object. The matching operation, based on the normalized histogram as in the video-event shot boundary detection, allows one to isolate the interesting object from the other regions. The probabilistic approach for selecting regions is made possible by the definition of the normalized color histogram introduced in the event matching. Each value assumed by the histogram function ranges from 0 to 1 and it can be seen as the probability that a pixel associated with a given color, belongs to the object represented by the histogram: if this probability is higher than a threshold, then the pixel is , a pixel considered. More formally, by considering a region with intensity is extracted if: (5) where indicates the threshold, and represents the color of the pixel. The pixels extracted with the thresholding operation are then compacted by means of the focus of attention procedure used during the blob detection task and, finally, the interesting object is retrieved by considering the minimum distance between its normalized histogram and the normalized histogram of the remaining regions.

(4) of the video-event In this way, the final time instant shot can be retrieved. A representation of the event matching is shown in Figs. 8 Fig. 9 both for what concerns object tracking and abandoned object layers. 2) Video-event shot initial frame identification and layered indexing: In order to estimate the instant when first the abandoned object is present, a search is performed in the temporal graph extracted by the object tracker. By using blob identity information, a foreground object in the OTL can be associated through a backwith a considered abandoned object event ward inspection of the OTL graph: for simplicity the backward search is performed in the OTL among blobs associated with is detected until the node connected with the one where a single-connected path (i.e., a set of nodes of successive time in the OTL graph) can be identified. For instants linked to is met the path example, when a merge event previous to is interrupted (Fig. 10). In this way, the number of frames of the retrieved video-event shot is given by the maximum length of the single-connected path backward searched in the tracking is also produced, graph. A video-event sequence corresponding to a specific video-object layer in the OTL, as well as to the metadata associated with.

Human Event Sequence Detection: A similar procedure is used to detect boundaries of the video-event shots in which people assume suspicious behaviors in the guarded environment by relating the events in the HEL with those in the OTL. The system performs a data integration between the subsystem for classifying human behavior and the subsystem for tracking moving objects: a dangerous behavior is an event linked to the correspondent dynamic event by means of the blob label matching operation according to the same strategy for linking an abandoned object to the temporal graph. In this case, the last frame of the video shot is the in which the suspicious one corresponding to the instant behavior is detected (i.e., no latency is assured); the first frame is determined by means of graph analysis, by searching the first frame in which the blob appeared in the scene, as performed for video-event shots related to abandoned objects. Single-connection is used also in this case to fix the path.



Fig. 11. Real image sequence containing a group of persons inside a metro station. (a) Temporal graph associated with the tracker; (b) image sequence with the video-object layers; (c) the OTL and (d) the HEL.

Fig. 12. (a) Temporal graph. Image sequence containing three people moving in (b) an outdoor parking area, (c) associated OTL, and (d) HEL.

Also for human event related events, indexing is introduced by using the normalized histogram of the person causing the alarm. Layer retrieval is made as in the case of abandoned objects, with the difference of considering all the entire extracted blobs, instead of regions composing the blob, in the matching operations: the retrieved person corresponds to the blob presenting the minimum distance histogram with respect to the blobs of the image. Figs. 11 and 12 show respectively human event shots indexing processes for two different kind of human events, i.e., yellow line crossings and suspicious trajectory detection. IV. EXPERIMENTAL RESULTS The performances of the proposed system are presented in order to evaluate the robustness in both event detection and layer retrieval tasks. This section is organized as follows: first, detection results are presented by means of values that indicate the success rate in classifying abandoned objects and human actions; then, video-event shot detection and retrieval results are presented. A. Abandoned Object Detection The system for detecting abandoned objects has been tested in three different environments: an “artificial” environment (the laboratory) and the waiting room in the railway stations of Genova-Rivarolo and Genova-Borzoli, Italy. In the tests, each environment corresponds to an acquired video-sequence. The sequences were acquired at a frame rate of 1 frame/s and the size of each frame is 256 256 pixels. For each test, we

considered about 2800–3000 events equally distributed on the four classes (abandoned objects, persons, lighting effects and structural changes) considered by the system (about 700–750 abandoned objects, 700–750 humans, etc). Performances of the system are measured in terms of proband probability of misdetection ability of false alarm for classifying detected objects. A false alarm is presented whenever a change not related to an abandoned object is classified as an abandoned object. A misdetection happens whenever an abandoned object is classified as it was not. On the basis of these definitions, the performance rates of the system in the different environments were as follows: % and %; — laboratory: % and — Genova-Rivarolo Railway station: %; % and — Genova-Borzoli Railway station: % It is possible to notice that the performances are satisfactory also in the real case. B. Human Action Recognition A real metro station and an outdoor parking areas have been chosen to test the performances of the HAR system. Fig. 11 shows an image sequence containing a group of people inside a metroline station. The considered video-sequence, which was acquired in a Belgium underground [39], is characterized by the following parameters: frame rate of 1 frame/s, image size of 256 256 pixels, and sequence length of 500 frames. Two events have been considered as unusual human actions: (a) one



Fig. 13. Blob images extracted form sequence acquired in a metro station with strong shadows (gray level pixel indicate shadow points). TABLE I OBJECT CLASSIFICATION (P ONE OR MORE ISOLATED PERSONS OR GROUP OF PERSONS) AND DANGEROUS PERSON OR GROUP OF PERSONS BEYOND THE EVENT DETECTION ( YELLOW LINE) IN THE IMAGE SEQUENCE IN Fig. 14

YL =

GP =


Fig. 14. (a) Map of the parking area with some examples of unusual human actions considered for learning the NT and (b) estimated trajectories of the detected humans in the sequence of Fig. 12.

or more persons overpass the yellow line and (b) one or more persons cross the tracks. If the scene is illuminated with artificial lights, each object projects long shadows on the ground floor. Fig. 13 shows a sequence acquired in a metro station and the extracted blobs. The presence of shadows (symmetric with respect to the intersection of the blob with the ground plane) modifies the blob’s shape and increases the complexity of the classification process. A shadow detection procedure [48] has been applied to eliminate the shadows from the detected blobs. In particular, the gray pixels in Fig. 16 represent detected shadows. The neural tree used for pedestrian recognition is composed by 74 internal perceptron nodes and 54 leaf nodes (distributed on levels), while the neural tree applied for suspicious event detection recognition is composed by 36 internal perceptron nodes levels). Table I shows and 21 leaf nodes (distributed on the object classification obtained for each frame of the sequence, seand the event classification obtained over a set of quence frames. It worth noting that the HAR system is able to correctly classify the persons present in the scene as one or more

isolated persons (in all frames) and/or a group of persons (in frames from 1 to 9). Moreover, the HAR system recognizes correctly the presence in the input sequence of the unusual events. Fig. 12 shows an image sequence containing a group of people moving in an outdoor parking area. This video-sequence, which was acquired in a parking area of the University of Genoa, is characterized by the following parameters: frame rate of 1 frame/sec, image size of 256 256 pixels, and seframes. The scenarios containing persons quence length of moving around different cars, with an irregular trajectory have been considered as unusual human actions. Fig. 14(a) shows some examples of unusual trajectories, drawn by hand by a human operator and used for the training of the neural tree; Fig. 17(b) shows trajectories computed by the tracking subsystem for sequence shown in Fig. 12. The neural tree applied for unusual human action recognition is composed by 112 internal perceptron nodes and 68 leaf nodes levels). The HAR system is able to de(distributed on tect, track and correctly classify the three persons present in the scene, and to recognize correctly the presence in the input sequence of two unusual events. C. Video-Event Shot Detection and Layer Retrieval The video-event shot detection performances depend on performances of the subsystems for detecting abandoned objects, for detecting and tracking moving objects, and for classifying human behavior. In this section, we describe the performances of video-event shot detection, by assuming that an alarm really correspond to a dangerous situation (abandoned object or unusual behavior). In Fig. 15(a), an example of an alarm situation due to the presence of an abandoned object is shown. The related videoevent shot was automatically detected by the proposed system.



Fig. 15.

Example of alarms and related video-event shot detected in the case of (a) abandoned object and (b) predefined human event.

In Fig. 15(b), one can see the alarm generated by the suspicious behavior of persons who passed the yellow line, and the related video-event shot automatically detected by the proposed system. The performances of the video-event shot detector are measured by counting how many times the system is able to store the whole sequence containing a particular dangerous situation. Errors concern the precise detection of the boundaries of the video-event shot. In detecting the beginning of video-event shots, possible errors are due to temporary occlusions of the object or person that caused the alarm. The system for tracking moving objects [39] has a mechanism to recover the object if it disappears for few consecutive frames, so that the video-event shot detection procedure is more robust to partial occlusion of the object (layer) representing the video content to be extracted. The detection of the end of video-event shots may be critical only for what concerns an abandoned object because a split in the temporal graph may occur with a little delay with respect to the instant in which the object is left, as can be seen in Fig. 8. Nevertheless, this error is not dramatic for the operator that must reconstruct an alarming situation. It is interesting to measure how the performances of the system vary with the complexity of the scene, where the complexity is identified with the number of persons moving in the guarded environment. In fact, the probability of temporary occlusion of objects increases when the presence of moving persons is higher in the scene. For laboratory tests, we define three levels of complexity: — low complexity: two persons, at maximum, in the scene, corresponding to a maximum density of 0.12 person/m ; medium complexity: four persons, at maximum, in the scene, corresponding to a maximum density of 0.24 person/m ; high complexity: more than four persons (i.e., more that 0.24 person/m ).

Fig. 16. Performances obtained by the proposed method for video-event shot boundary-detection versus the complexity of the guarded environment.

Fig. 17. Performances obtained by the proposed retrieval procedure.

These definitions, introduced during experiments in laboratory, have been considered also in real transport environments. Numerical results are shown in Fig. 16 that shows the mean success probability in detecting video-event shot boundaries considering different levels of scene complexity. It is possible to notice that, although the performances decay with the

complexity increasing, good results are obtained also with a medium level of complexity of the scene. Moreover, performances are measured for the layer retrieval step. In this case, after having detected the interesting video-event shot, the system retrieves the layer (dangerous object and person who left it or person with suspicious behavior) that caused the alarm. Evaluations are performed by considering probabilities of success, false alarm and misdetection probabilities in retrieving layers. Two different kinds of success are considered. The system reaches the full success when it is able to retrieve the searched object for at least 80% of its area in the image plane. When the system retrieves only from 20% to 80% pixels of the searched object, then a partial success has been reached. A misdetection occurs when the system does not retrieve the object or it retrieves less than 20% of the pixels



of the object. Finally, the system provides the operator with a false alarm when regions different from the searched object are retrieved. Numerical results are shown in Fig. 17, in which it is possible to notice that the success rate is quite high (71%), and the probability of full success (65%) is considerably higher than the number of partial success (6%). This consideration proves the efficiency of the proposed retrieval procedure. V. CONCLUSION In this paper, a novel approach for video shot detection and layered indexing has been presented with applications to video-based surveillance systems. The considered video-based surveillance system aims at supporting a human operator in guarding indoor environments, such as waiting rooms of railway stations or metro stations, by providing him with an alarm signal whenever a dangerous situation is detected. By means of the method introduced in this paper, the human operator has also the possibility of retrieving the video-event shot in which the causes of alarms (person leaving abandoned objects or people with a suspicious behavior) are shown. The system considered in this paper presents a success event detection rate of about 97%. Within this set of events correctly detected, the proposed method for semantic video shot boundaries detection works correctly with a rate of 95%, 75%, and 33% in case of low, medium and high complexity of the scene, respectively. The proposed procedure for content-based retrieval extracts the correct layers in 71% of cases in the performed experiments. From the numerical results, it is concluded that the proposed system provides the human operators with a powerful instrument for reconstructing dangerous situations that can be happen in transport environments. ACKNOWLEDGMENT The authors thank the anonymous reviewers for their useful suggestion and F. Oberti for valuable discussion and technical support. REFERENCES
[1] C. S. Regazzoni, G. Vernazza, and G. Fabri, Advanced Video-Based Surveillance Systems. Dordrecht, The Netherlands: Kluwer, 1998. [2] G. L. Foresti, P. Mahonen, and C. S. Regazzoni, Multimedia Video-Based Surveillance Systems: Requirements, Issues and Solutions. Dordrecht, The Netherlands: Kluwer, 2000. [3] E. Stringa and C. S. Regazzoni, “Content-based retrieval and real time detection from video sequences acquired by surveillance systems,” Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 138–142, Oct. 4–7, 1998. [4] , “Real-time video-shot detection for scene surveillance applications,” IEEE Trans. Image Processing, vol. 9, pp. 69–79, Jan. 2000. [5] T. Rachidi, S. A. Maelainin, and A. Bensaid, “Scene change detection using adaptive skip factors,” Proc. IEEE Int. Conf. Image Processing (ICIP97), Oct. 26–29, 1997. [6] H. H. Yu and W. Wolf, “A hierarchical multiresolution video shot transition detection scheme,” Comput. Vis. Image Understand., vol. 75, no. 1–2, pp. 196–213, 1999. [7] N. V. Patel and I. K. Sethi, “Video shot detection and characterization for video databases,” Pattern Recognit., vol. 30, no. 4, pp. 583–592, 1997. [8] A. D. Doulamis, N. D. Doulamis, and S. D. Kollias, “A fuzzy video content representation for video summarization and content-based retrieval,” Signal Process., vol. 80, no. 6, pp. 1049–1067, 2000. [9] H. J. Zhang, J. H. Wu, D. Zhong, and S. W. Smoliar, “An integrated system for content-based video retrieval and browsing,” Pattern Recognit., vol. 30, no. 4, pp. 643–658, 1997.

[10] J. D. Courtney, “Automatic video indexing via object motion analysis,” Pattern Recognit., vol. 30, no. 4, pp. 607–625, 1997. [11] M. S. Drew, J. Wei, and Z. N. Li, “Illumination-invariant image retrieval and video segmentation,” Pattern Recognit., vol. 32, no. 8, pp. 1369–1388, 1999. [12] S. Dagtas, W. Al-Khatib, A. Ghafoor, and R. L. Kashyap, “Models for motion-based video indexing and retrieval,” IEEE Trans. Image Processing, vol. 9, pp. 88–101, Jan. 2000. [13] V. Roth, “Content-based retrieval from digital video,” Image Vis. Comput., vol. 17, no. 7, pp. 531–540, 1999. [14] W. G. Chen, G. B. Giannakis, and N. Nandhakumar, “A harmonic retrieval framework for discontinuous motion estimation,” IEEE Trans. Image Processing, vol. 7, pp. 1242–1257, Sept. 1998. [15] T. Rachidi, S. A. Maelainin, and A. Bensaid, “Scene change detection using adaptive skip factors,” Proc. IEEE Int. Conf. Image Processing (ICIP97), Oct. 26–29, 1997. [16] B. Furht, S. W. Smoliar, and H. Zhang, Video and Image Processing in Multimedia Systems. Dordrecht, The Netherlands: Kluwer, 1995. [17] H. Hampapur, R. Jain, and T. E. Weymouth, “Indexing in video databases,” in Proc. SPIE, Storage and Retrieval for Image and Video Databases II, vol. 2420, Feb. 1995, pp. 292–306. [18] J. Y. A. Wang and E. H. Adelson, “Layered representation for image sequence coding,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 221–224, Apr. 1993. , “Layered representation for motion analysis,” Proc. IEEE Conf. [19] Computer Vision and Pattern Recognition, pp. 361–366, June 1993. [20] H. S. Sawhney and S. Ayer, “Compact representations of videos through dominant and multiple motion estimation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 18, pp. 814–830, Aug. 1996. [21] M. Irani and P. Anandan, “Video indexing based on mosaic representations,” Proc. IEEE, vol. 86, pp. 905–921, May 1998. [22] D. C. Hogg, “Model-based vision: A program to see a walking person,” Image Vis. Comput., vol. 1, no. 1, pp. 5–20, 1983. [23] A. Baunberg and D. Hogg, “Learning flexible models from image sequences,” in Proc. 3rd Eur. Conf. Computer Vision, vol. I, 1994, pp. 299–308. [24] R. Polana and R. Nelson, “Detecting activities,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2–7, June 1993. [25] K. Rohr, “Incremental recognition of pedestrian from image sequences,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 8–13, June 1993. [26] A. J. Lipton, H. Fujiyoshi, and C. Thorpe, “Moving target classification and tracking from real-time video,” Proc. 4th IEEE Workshop of Applications of Computer Vision, pp. 8–14, Oct. 1998. [27] H. Fujiyoshi and A. J. Lipton, “Real-time human motion analysis by image skeletonization,” Proc. 4th IEEE Workshop of Applications of Computer Vision, pp. 15–21, Oct. 1998. [28] A. Galton, “Toward an integrated logic of space, time and motion,” presented at the Int. Joint Conf. Artificial Intelligence (IJCAI), Chambery, France, Aug. 1993. [29] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models for compelx action recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1997. [30] N. Chleq and M. Thonnat, “Real-time image sequence interpretation for video-surveillance applications,” Proc. IEEE Int. Conf. Image Processing (ICIP96), vol. III, pp. 801–804, Sept. 1996. [31] P. L. Rosin and T. J. Ellis, “Frame-based system for image interpretation,” Image Vis. Comput., vol. 9, pp. 353–361, 1991. [32] R. Koenen, “Profiles and levels in MPEG-4: Approach and overview,” Signal Process.: Image Commun., no. 4–5, pp. 463–478, Jan. 2000. [33] C. Wren, A. Azerbayejani, T. Darrel, and A. Pentalnd, “Pfinder: Real-time tracking of the human body,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 780–785, July 1997. : Real-time surveil[34] I. Haritaoglu, D. Harwood, and L. S. Davis, “ lance of people and their activities,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 809–830, Aug. 2000. [35] S. J. McKenna, S. Jabri, Z. Duric, H. Wechsler, and A. Rosenfeld, “Tracking groups of people,” Comput. Vis. Image Understand., vol. 80, pp. 42–56, 2000. [36] D. Gavrila, “The visual analysis of human movement: A survey,” Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, 1999. [37] A. Tesei, A. Teschioni, C. S. Regazzoni, and G. Vernazza, “Long-memory matching of interacting complex objects from real image sequences,” in Time-Varying Image Processing and Moving Object Recognition, V. Cappellini, Ed. Amsterdam, The Netherlands: Elsevier, 1997, pp. 283–288.




[38] A. Chanda, “Application of binary mathematical morphology to separate overlapped objects,” Pattern Recognit. Lett., vol. 13, no. 9, pp. 639–645, Sept. 1992. [39] M. Bogaert, N. Chleq, P. Cornez, C. S. Regazzoni, A. Teschioni, and M. Thonnat, “The PASSWORDS project,” Proc. IEEE Int. Conf. Image Processing, pp. 675–678, Sept. 1996. [40] M. J. Swain and D. H. Ballard, “Color indexing,” Int. J. Comput. Vis., vol. 7, no. 1, pp. 11–32, 1991. [41] G. L. Foresti and F. Roli, “Real-time recognition of suspicious events for advanced visual-based surveillance,” in Multimedia Video-Based Surveillance Systems: From User Requirements to Research Solutions, G. L. Foresti, C. S. Regazzoni, and P. Mahonen, Eds. Dordrecht, The Netherlands: Kluwer, 2000, pp. 84–93. [42] F. Bremond and M. Thonnat, “Tracking multiple nonrigid objects in video sequences,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 585–591, 1998. [43] G. L. Foresti, “Outdoor scene classification by a neural tree based approach,” Pattern Analysis and Applications, vol. 2, pp. 129–142, 1999. [44] E. James, L. Hollins, D. J. Brown, I. C. Luckraft, and C. R. Gent, “Feature vectors for road vehicle scene classification,” Neural Networks, vol. 9, no. 2, pp. 337–344, 1996. [45] L. Marcenaro, F. Oberti, and C. S. Regazzoni, “Multiple objects colorbased tracking using multiple cameras in complex time-varying outdoor scenes,” Proceedings of the 2nd IEEE Int. Workshop on PETS, Dec. 9, 2001. [46] F. Oberti, “Reconfigurable, Distributed and Intelligent Video Surveillance Networks,” Ph.D. thesis, Univ. Genoa, Italy, Feb. 2002. [47] J. H. Piater and J. L. Crowley, “Multi-modal tracking targets using Gaussian approximations,” Proc. 2nd IEEE Int. Workshop on PETS, Dec. 9, 2001. [48] G. L. Foresti, “Object detection and tracking in time-varying and badly illuminated outdoor environments,” Opt. Eng., vol. 37, no. 9, pp. 2550–2564, 1998. [49] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” IEEE Trans. Commun. Tech., vol. COM-15, pp. 52–60, 1967.

Lucio Marcenaro was born in Genoa, Italy, in 1974. He received the Laurea degree in electronic engineering with telecommunication and telematic specialization from the University of Genoa in 1999, with a thesis about flexible models for human motion to analyze images from multiple video surveillance cameras. He is currently pursuing the Ph.D. degree in electronic engineering and computer science at the University of Genoa, where he is a member of the Signal Processing & Telecommunications Group. His main research interests include image and sequence processing for video-surveillance systems and statistical pattern recognition.

Gian Luca Foresti (S’93–M’95–SM’01) was born in Savona, Italy, in 1965. He received the Laurea degree (cum laude) in electronic engineering and the Ph.D. degree in computer science from University of Genoa, Italy, in 1990 and 1994, respectively. Since 1998, he is Professor of computer science, Department of Mathematics and Computer Science (DIMI), University of Udine, Italy, and Director of the Artificial Vision and Real-Time System (AVIRES) Lab. His main interests involve active vision, image processing, multisensor data fusion, and artificial neural networks. Techniques proposed found applications in the following fields: automatic video-based systems for surveillance and monitoring of outdoor environments, vision systems for autonomous vehicle driving and/or road traffic control, 3-D scene interpretation, human behavior understanding. He is author or coauthor of more than 100 papers published in international journals and refereed international conferences. He has contributed to seven books in his area of interest and is coauthor of the book Multimedia Video-based Surveillance Systems (Dordrecht, The Netherlands: Kluwer, 2000). He has served as a reviewer for several international journals, and for the European Union in different research programs (MAST III, Long Term Research, Brite-CRAFT). He has been responsible for DIMI for several European and National research projects in the field of computer vision and image processing. Prof. Foresti was general co-chair, chairman, and member of technical committees at several conferences. He has been organizer of several special sessions on video-based surveillance systems and data fusion at international conferences. He has been Guest Editor of a Special Issue on “Video Communications, Processing and Understanding for Third Generation Surveillance Systems” of the PROCEEDINGS OF THE IEEE. In February 2000, he has been appointed as Italian member of the Information Systems Technology (IST) panel of the NATO-RTO. He is a member of IAPR.

Carlo S. Regazzoni (S’90–M’92–SM’00) was born in Savona, Italy, in 1963. He received the Laurea degree in electronic engineering and the Ph.D. degree in telecommunications and signal processing from the University of Genoa, in 1987 and 1992, respectively. Since 1998, he is Professor of telecommunication systems in the Engineering Faculty of the University of Genoa, and, also since 1998, is responsible for the Signal Processing and Telecommunications (SP&T) Research Group, Department of Biophysical and Electronic Engineering (DIBE), University of Genoa, which he joined in 1987. His main current research interests are multimedia and nonlinear signal and video processing, signal processing for telecommunications, multimedia broadband wireless and wired telecommunications systems. He is involved in research on multimedia surveillance systems since 1988. He co-editor of the books Advanced Video-based Surveillance Systems (Dordrecht, The Netherlands: Kluwer, 1999) and Multimedia Video-Based Surveillance Systems (Kluwer, 2000). He is author or coauthor of 43 papers in international scientific journals and of more than 130 papers presented at refereed international conferences. Dr. Regazzoni has been co-organizer and chairman of the first two International Workshops on Advanced Video Based Surveillance, held in Genoa, Italy, 1998 and Kingston, U.K., 2001. He has also organized several Special Sessions in the same field at International Conferences (Image Analysis and Processing, Venice 1999 (ICIAP99), European Signal Processing Conf. (Eusipco2000), Tampere Finland, 2000). He is a consultant of the EU Commission for the definition of the 6th research framework program in the ambient intelligence domain. He has been responsible for several EU research and development projects dealing with video surveillance methodologies and applications in the transport field (ESPRIT Dimus, Athena, Passwords, AVS-PV, AVS-RIO); he has been also responsible of several research contracts with italian industries; he served as a referee for international journals, and as reviewer for EU in different research programs. He is a member of AEI and IAPR.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times