You are on page 1of 52




A desirable video object extraction scheme for content based applications should meet
the following criteria:
Segmented object should conform to human perception i.e., semantically meaningful
objects should be segmented.
Segmentation algorithm should be efficient and achieve fast speed.
Initialization should be simple and easy for users to operate (human intervention
should be minimized). One feasible solution that satisfies these criteria is edge change
In Video Object (VO) segmentation methods, which are using mathematical
morphology and perspective motion model, objects of interest should be initially outlined by
human observer. From the manually specified object boundary, the correct object boundary is
calculated using a morphological segmentation tool. The obtained VOP is then automatically
tracked and updated in successive frames. It has difficulty in dealing with a large non rigid
object movement and in the presence of occlusion, especially in the VOP tracking schemes.
The algorithm based on edge change detection allows automatic detection of the new
appearance of a VO. The edge change detection for inter-frame difference is another stream
of popular schemes because it is straightforward to implement and enables automatic
detection of new appearance. This ability enables to develop a fully automated object-based
system, such as an object-based video surveillance system.
It is found that the algorithms based on inter frame change detection render automatic
detection of objects and allow larger non rigid motion compared to mathematical morphology
and perspective motion model methods. The drawbacks are small false regions detected by
decision error due to noise. Thus, small whole removal using morphological operation and
removal of false parts like uncovered background by motion information are usually
Another drawback in edge change detection is that object boundaries are irregular in
some critical image areas, which must be smoothened and adapted by spatial edge
information. Since spatial edge information is useful for generating VOP with accurate
boundaries, a simple binary edge difference scheme may be assumed to be a good solution. In

order to overcome boundary inaccuracy multiple features, multiple frames and spatialtemporal entropy methods are used. In addition, it gives robustness to noise and occluding
The first stage is applied to the first two frames of a video shot to discover moving
objects while the second stage is applied to the rest of the frames to extract the detected
objects through the video shot. The algorithm is applied to first two frames of the image
sequence to detect the moving objects in the video sequence. First two frames of video
sequence are taken and motion vectors are computed using Adaptive Rood Pattern Search
(ARPS) algorithm. Simultaneously, components of optical flow are computed for each block
in the image. By using the motion vectors, motion compensated frame is generated.
Initial segmentation is performed on the first frame of traffic sequence. Applying
watershed transformation directly on the gradient of image results in over segmentation. To
avoid over segmentation morphological gradient is computed on the frame and then
watershed transformation is performed. After watershed transformation, some regions may
need to be merged because of possible over-segmentation.
Canny binary edge image is used to localize an object in subsequent frames of video
sequences and to detect the true weak edges. Intensity edge pixels are used as feature points
due to the key role that edges play in the human visual process and the fact that edges are
little affected by variation of luminance. Object models evolve from one frame to the next,
capturing the changes in the shape of objects as they move. The algorithm naturally
establishes the temporal correspondence of objects throughout the video sequence, and the
output of the algorithm is a sequence of binary models representing the motion and shape
changes of the objects.
Object model is obtained by subtracting background edge from edge image and
eliminating unlinked pixels. After a binary model for the object of interest has been derived
the motion vectors generated from ARPs algorithm are used to match the subsequent frames
in the sequence. Matching is performed on edge images because it is computationally
efficient and fairly insensitive to changes in illumination. The degree of change in the shape
of an object from one frame to the next is determined based on simplified Hausdorff distance
where simplified Hausdorff distance is defined as combination of distance transformation and
correlation. Distance Transform the image and then threshold it by different amounts to form
different dilated image sets. To search for the object in the image, it is required to obtain the

amount by which the image is dilated such that maximum points in the object model are
matched to image set.
In this automatic VO segmentation algorithm edge change detection starts with edge
detection which is the first and most important stage of human visual process. Edge
information plays a key role in extracting the physical change of the corresponding surface in
a real scene, exploiting simple difference of edges for extracting shape information of moving
objects in video sequence suffers from great deal of noise even in stationary background.
This is due to the fact that the random noise created in one frame is different from the one
created in the successive frame, and thus results in slight changes of the edge locations in the
successive frames. Thus difference edge of frames suppresses the noise in luminance
difference by means of canny edge detector.
Motion estimation is based on temporal changes in image intensities. The underlying
supposition behind motion estimation is that the patterns corresponding to objects and
background in a frame of video sequence move within the frame to form corresponding
objects on the subsequent frame. Motion estimation is accomplished using ARPs algorithm.
The ARPS algorithm makes use of the fact that the general motion in a frame is usually
coherent, i.e. if the macro blocks around the current macro block move in a particular
direction then there is a high probability that the current macro block will also have a similar
motion vector. This algorithm uses the motion vector of the macro block to its immediate left
to predict its own motion vector.
The ARPS algorithm tries to achieve the Peak Signal to Noise Ratio (PSNR) similar
to that of Exhaustive Search (ES) algorithm. ES algorithm, also known as Full Search, is the
most computationally expensive block matching algorithm. This algorithm calculates the cost
function at each possible location in the search window. As a result it finds the best possible
match and gives the highest PSNR among any block matching algorithm. The obvious
disadvantage of exhaustive search is it takes more computations to estimate motion vectors.
ARPS tries to achieve the same PSNR as that of ES with less number of computations.



Existing Methodology involves a saliency-based video object extraction (VOE)

framework. The proposed framework aims to automatically extract foreground objects of
interest without any user interaction or the use of any training data. To separate foreground
and background regions within and across video frames, the proposed method utilizes visual
and motion saliency information extracted from the input video. A conditional random field
is applied to effectively combine the saliency induced features, which allows us to deal with
unknown pose and scale variations of the foreground object (and its articulated parts). Based
on the ability to preserve both spatial continuity and temporal consistency in the proposed
VOE framework, experiments on a variety of videos verify that our method is able to produce
quantitatively and qualitatively satisfactory VOE results.

2.1.1. Introduction
A human can easily determine the subject of interest in a video, even though that
subject is presented in an unknown or cluttered background or even has never been seen
before. With the complex cognitive capabilities exhibited by human brains, this process can
be interpreted as simultaneous extraction of both foreground and background information
from a video.
Many researchers have been working toward closing the gap between human and
computer vision. However, without any prior knowledge on the subject of interest or training
data, it is still very challenging for computer vision algorithms to automatically extract the
foreground object of interest in a video. As a result, if one needs to design an algorithm to
automatically extract the foreground objects from a video, several tasks need to be addressed.
1) Unknown object category and unknown number of the object instances in a video.
2) Complex or unexpected motion of foreground objects due to articulated parts or arbitrary
3) Ambiguous appearance between foreground and background regions due to similar color,
low contrast, insufficient lighting, etc. conditions.

In practice, it is infeasible to manipulate all possible foreground object or background

models beforehand. However, if one can extract representative information from either
foreground or background (or both) regions from a video, the extracted information can be
utilized to distinguish between foreground and background regions, and thus the task of
foreground object extraction can be addressed. As discussed later in Section II, most of the
prior works either consider a fixed background or assume that the background exhibits
dominant motion across video frames. These assumptions might not be practical for real
world applications, since they cannot generalize well to videos captured by freely moving
cameras with arbitrary movements.
In this system, we propose a robust video object extraction (VOE) framework, which
utilizes both visual and motion saliency information across video frames. The observed
saliency information allows us to infer several visual and motion cues for learning foreground
and background models, and a conditional random field (CRF) is applied to automatically
determines the label (foreground or background) of each pixel based on the observed models.
With the ability to preserve both spatial and temporal consistency, our VOE
framework exhibits promising results on a variety of videos, and produces quantitatively and
qualitatively satisfactory performance. While we focus on VOE problems for single concept
videos (i.e., videos which have only one object category of interest presented), our proposed
method is able to deal with multiple object instances (of the same type) with pose, scale, etc.
Methods to solve the occlusion problem in multiple interacting objects tracking have
been previously presented. Shiloh, Chang and Dockstader 1 The work presented in this paper
was sponsored by the Foundation of National Laboratory of Pattern Recognition and National
Natural Science Foundation of China (60172037) overcame occlusion in multiple object
tracking by used fusing multiple camera inputs. Cucchiara proposed probabilistic masks and
appearance models to cope with frequent shape changes and large occlusions. Eng developed
a Bayesian segmentation approach that fused a region-based background subtraction and a
human shape model for people tracking under occlusion. Wu proposed a dynamic Bayesian
network which accommodates an extra hidden process for partial occlusion handling. Andrew
used appearance models to track occluded objects. Siebel proposed a tracking system with
three co-operating parts: an active shape tracker, a region tracker and a head detector. The
region tracker exploits the other two modules to solve occlusions. Hieu proposed a template

matching algorithm and update the template using appearance features smoothed by kalman
filter. Tao presented a dynamic background layer model and model each moving object as a
foreground layer, together with the foreground ordering, the complete information necessary
for reliably tracking objects through occlusion is included. Alper tracked the complete object
and evolving the contour from frame to frame by minimizing some energy functions.
An obvious step towards video segmentation in Efficient Hierarchical Graph-Based
Video Segmentation Matthias is to apply image segmentation techniques to video frames
without considering temporal coherence. These methods are inherently scalable and may
generate segmentation results in real time. However, lack of temporal information from
neighboring frames may cause jitter across frames. Freedman and Kisilev applied a samplingbased fast mean shift approach to a cluster of 10 frames as a larger set of image features to
generate smoother results without taking into account temporal information.
Spatio-temporal video segmentation techniques can be distinguished by whether the
information from future frames is used in addition to past frames. Causal methods apply
Kalman filtering to aggregate data over time, which only consider past data. Paris et al.
derived the equivalent tool of mean-shift image segmentation for video streams based on the
ubiquitous use of the Gaussian kernel. They achieved real-time performance without
considering future frames in the video.
Another class of spatio-temporal techniques takes advantage of both past and future
data in a video. They treat the video as a 3D space-time volume, and typically use a variant of
the mean shift algorithm for segmentation. Dementhon applied mean shift on a 3D lattice and
used a hierarchical strategy to cluster the space-time video stack for computational efficiency.
Wang et al. used anisotropic kernel mean shift segmentation for video tooning. Wang and
Adelson used motion heuristics to iteratively segment video frames into motion consistent
layers. Tracking-based video segmentation methods generally define segments at frame-level
and use motion, color and spatial cues to force temporal coherence.
Following the same line of work, Brendel and Todorovic used contour cues to allow
splitting and merging of segments to boost the tracking performance. Interactive object
segmentation has recently shown significant progress. These systems produce high quality
segmentations driven by user input. We exhibit a similar interactive framework driven by our
segmentation. Our video segmentation method builds on Felzenszwalb and Huttenlochers
graph-based image segmentation technique. Their algorithm is efficient, being nearly linear

in the number of edges in the graph, which makes it suitable for extension to spatio-temporal
segmentation. We extend the technique to video making use of both past and future frames,
and improve the performance and efficiency using a hierarchical framework.
2.1.2. Methodology
In this paper, we aim at automatically extracting foreground objects in videos which
are captured by freely moving cameras. Instead of assuming that the background motion is
dominant and different from that of the foreground as did, we relax this assumption and allow
foreground objects to be presented in freely moving scenes.
We advance both visual and motion saliency information across video frames, and a
CRF model is utilized for integrating the associated features for VOE (i.e., visual saliency,
shape, foreground/background color models, and spatial/temporal energy terms). From our
quantitative and qualitative experiments, we verify that our VOE performance exhibits spatial
consistency and temporal continuity, and our method is shown to outperform state-of the-art
unsupervised VOE approaches. It is worth noting that, our proposed VOE framework is an
unsupervised approach, which does not require the prior knowledge (i.e., training data) of the
object of interest nor the user interaction for any annotation.
Input video

Visual saliency

Motion saliency using

optical flow

Color and shape



Detected object

Fig. 2.1. Block Diagram of Existing System


Most existing unsupervised VOE approaches assume the foreground objects as

outliers in terms of the observed motion information, so that the induced appearance, color,
etc. features are utilized for distinguishing between foreground and background regions.
However, these methods cannot generalize well to videos captured by freely moving cameras
as discussed earlier. In this work, we propose a saliency-based VOE framework which learns
saliency information in both spatial (visual) and temporal (motion) domains.
By advancing conditional random fields (CRF), the integration of the resulting
features can automatically identify the foreground object without the need to treat either
foreground or background as outliers.

Fig. 2.2. Overview of Existing VOE Framework

In general, one can address VOE problems using supervised or unsupervised
approaches. Supervised methods require prior knowledge on the subject of interest and need
to collect training data beforehand for designing the associated VOE algorithms. For
example, Wu and Nevatia and Lin and Davis both decomposed an object shape model in a
hierarchical way to train object part detectors, and these detectors are used to describe all
possible configurations of the object of interest (e.g. pedestrians).
Another type of supervised methods requires user interaction for annotating candidate
foreground regions. For example, image segmentation algorithms proposed in focused on an
interactive scheme and required users to manually provide the ground truth label information.
For videos captured by a monocular camera, methods applied a conditional random field
(CRF) maximizing a joint probability of color, motion, etc. models to predict the label of


each image pixel. Although the color features can be automatically determined from the input
video, these methods still need the user to train object detectors for extracting shape or
motion features.
Recently, researchers proposed to use some preliminary strokes to manually select the
foreground and background regions, and they utilized such information to train local
classifiers to detect the foreground objects. While these works produce promising results, it
might not be practical for users to manually annotate a large amount of video data.
2.1.3. Visual Saliency
The salience (also called saliency) of an item be it an object, a person, a pixel, etc.
is the state or quality by which it stands out relative to its neighbors. Saliency detection is
considered to be a key attention mechanism that facilitates learning and survival by enabling
organisms to focus their limited perceptual and cognitive resources on the most pertinent
subset of the available sensory data.
Saliency typically arises from contrasts between items and their neighborhood, such
as a red dot surrounded by white dots, a flickering message indicator of an answering
machine, or a loud noise in an otherwise quiet environment. Saliency detection is often
studied in the context of the visual system, but similar mechanisms operate in other sensory
systems. What is salient can be influenced by training: for example, for human subjects
particular letters can become salient by training.
When attention deployment is driven by salient stimuli, it is considered to be bottomup, memory-free, and reactive. Attention can also be guided by top-down, memorydependent, or anticipatory mechanisms, such as when looking ahead of moving objects or
sideways before crossing streets. Humans and other animals have difficulty paying attention
to more than one item simultaneously, so they are faced with the challenge of continuously
integrating and prioritizing different bottom-up and top-down influences.
In the domain of psychology, efforts have been made in modeling the mechanism of
human attention, including the learning of prioritizing the different bottom-up and top-down


In the domain of computer vision, efforts have been made in modeling the mechanism
of human attention, especially the bottom-up attention mechanism. Such a process is also
called visual saliency detection.
Generally speaking, there are two kinds of models to mimic the bottom-up saliency
mechanism. One way is based on the spatial contrast analysis. For example, in a centersurround mechanism is used to define saliency across scales, which is inspired by the putative
neural mechanism. The other way is based on the frequency domain analysis. While they
used the amplitude spectrum to assign saliency to rarely occurring magnitudes, Guo et al. use
the phase spectrum instead. introduced a system that uses both the amplitude and the phase
A key limitation in many such approaches is their computational complexity which
produces less than real-time performance, even on modern computer hardware. Some recent
work attempts to overcome these issues but at the expense of saliency detection quality under
some conditions.
Our attention is attracted to visually salient stimuli. It is important for complex
biological systems to rapidly detect potential prey, predators, or mates in a cluttered visual
world. However, simultaneously identifying any and all interesting targets in one's visual
field has prohibitive computational complexity making it a daunting task even for the most
sophisticated biological brains let alone for any existing computer.

Fig. 2. Example of visual saliency calculations. (a) Original video frame. (b) and (C)
are Visual saliency of (a)
One solution, adopted by primates and many other animals, is to restrict complex
object recognition process to a small area or a few objects at any one time. The many objects
or areas in the visual scene can then be processed one after the other. This serialization of


visual scene analysis is operationalized through mechanisms of visual attention: A common

(although somewhat inaccurate) metaphor for attention is that of a virtual spotlight, shifting
to and highlighting different sub-regions of the visual world, so that one region at a time can
be subjected to more detailed visual analysis
Visual attention may be a solution to the inability to fully process all locations in
parallel. However, this solution produces a problem. If you are only going to process one
region or object at a time, how do you select that target of attention? Visual salience helps
your brain achieve reasonably efficient selection. Early stages of visual processing give rise
to a distinct subjective perceptual quality which makes some stimuli stand out from among
other items or locations. Our brain has evolved to rapidly compute salience in an automatic
manner and in real-time over the entire visual field. Visual attention is then attracted towards
salient visual locations.
The core of visual salience is a bottom-up, stimulus-driven signal that announces this
location is sufficiently different from its surroundings to be worthy of your attention.
This bottom-up deployment of attention towards salient locations can be strongly modulated
or even sometimes overridden by top-down, user-driven factors. Thus, a lone red object in a
green field will be salient and will attract attention in a bottom-up manner (see illustration
below). In addition, if you are looking through a childs toy bin for a red plastic dragon,
amidst plastic objects of many vivid colors, no one color may be especially salient until your
top-down desire to find the red object renders all red objects, whether dragons or not, more
Visual salience is sometimes carelessly described as a physical property of a visual
stimulus. It is important to remember that salience is the consequence of an interaction of a
stimulus with other stimuli, as well as with a visual system (biological or artificial). As a
straight-forward example, consider that a color-blind person will have a dramatically
different experience of visual salience than a person with normal color vision, even when
both look at exactly the same physical scene (see, e.g., the first example image below).
As a more controversial example, it may be that expertise changes the salience of
some stimuli for some observers. Nevertheless, because visual salience arises from fairly
low-level and stereotypical computations in the early stages of visual processing the factors
contributing to salience are generally quite comparable from one observer to the next, leading
to similar experiences across a range of observers and of behavioral conditions.


2.1.4. Motion Saliency

Motion saliency detection has an important impact on further video processing tasks,
such as video segmentation, object recognition and adaptive compression. Different to image
saliency, in videos, moving regions (objects) catch human beings attention much easier than
static ones. Based on this observation, we propose a novel method of motion saliency
detection, which makes use of the low-rank and sparse decomposition on video slices along
X-T and Y-T planes to achieve the goal, i.e. separating foreground moving objects from
backgrounds. To detect motion saliency in videos, however, most of the techniques for
images (mentioned above) are not available.
Since, different with image saliency detection, moving regions (objects) alternatively
catch more human beings attention than static ones, even though which have large contrast
to their neighbours in static images. That is to say, the focal point changes from the regions
with large contrast to their neighbours for images to those with motion discrimination for
videos. Therefore, the contrast based methods are hardly applied to videos directly.
An exception exists, which extends the spectrum residual analysis in images to
videos. Actually, the goal of moving object separation from background is the same as that of
motion saliency detection. A few solutions for separating foreground moving objects from
backgrounds have been proposed, such as Gaussian Mixture Model
To detect each moving part and its corresponding pixels, we perform dense opticalflow forward and backward propagation at each frame of a video. A moving pixel qt at frame
t is determined by


denotes the pixel pair detected by forward or backward optical flow propagation.

Determination of Shape Cues:

Although motion saliency allows us to capture motion salient regions within and
across video frames, those regions might only correspond to moving parts of the foreground
object within some time interval.

Determination of Color Cues:

Besides the motion-induced shape information, we also extract both foreground and
back- ground color information for improved VOE performance. According to the
observation and the assumption that each moving part of the foreground object forms a


complete sampling of itself, we cannot construct foreground or background color models

simply based on visual or motion saliency detection results at each individual frame

Fig. 2.4. Motion Saliency Calculated for Fig 2.3. (a) Calculation of the Optical Flow.
(b) Motion Saliency Derived from (a)


Utilizing an undirected graph, conditional random field (CRF) is a powerful technique

to estimate the structural information (e.g. class label) of a set of variables with the associated
observations. For video foreground object segmentation, CRF has been applied to predict the
label of each observed pixel in an image I.
It is a class of statistical modeling method often applied in pattern recognition and






for structured




ordinary classifier predicts a label for a single sample without regard to "neighboring"
samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural
language processing predicts sequences of labels for sequences of input samples.
CRFs are a type of discriminative undirected probabilistic graphical model. It is used
to encode known relationships between observations and construct consistent interpretations.
It is often used for labeling or parsing of sequential data, such as natural language text
or biological sequences and in computer vision. Specifically, CRFs find applications
in shallow parsing, named entity recognition and gene finding, among other tasks, being an
alternative to the related hidden Markov models. In computer vision, CRFs are often used for
object recognition and image segmentation.
Conditional random fields (CRFs) are a probabilistic framework for labeling and
segmenting structured data, such as sequences, trees and lattices. The underlying idea is that
of defining a conditional probability distribution over label sequences given a particular
observation sequence, rather than a joint distribution over both label and observation


The primary advantage of CRFs over hidden Markov models is their conditional
nature, resulting in the relaxation of the independence assumptions required by HMMs in
order to ensure tractable inference.
Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum
entropy Markov models (MEMMs) and other conditional Markov models based on directed
graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world
tasks in many fields, including bioinformatics, computational linguistics and speech
Imagine you have a sequence of snapshots from a day in Justin Biebers life, and you
want to label each image with the activity it represents (eating, sleeping, driving, etc.). How
can you do this?
One way is to ignore the sequential nature of the snapshots, and build aperimage classifier. For example, given a months worth of labeled snapshots, you might learn
that dark images taken at 6am tend to be about sleeping, images with lots of bright colors
tend to be about dancing, and images of cars are about driving, and so on.
By ignoring this sequential aspect, however, you lose a lot of information. For
example, what happens if you see a close-up picture of a mouth is it about singing or
eating? If you know that the previous image is a picture of Justin Bieber eating or cooking,
then its more likely this picture is about eating; if, however, the previous image contains
Justin Bieber singing or dancing, then this one probably shows him singing as well.
Thus, to increase the accuracy of our labeler, we should incorporate the labels of
nearby photos, and this is precisely what a conditional random field does.
2.1.6. Disadvantages

Not possible to work with videos with moving background.

Shape and colour cues extract foreground objects with missing parts.



-Ken Fukuchi, Kouji Miyazato, Akisato Kimura, Shigeru Takagi and Junji Yamato
This paper proposes a new method for achieving precise video segmentation without
any supervision or interaction. The main contributions of this report include
1) the introduction of fully automatic segmentation based on the maximum a
posteriori (MAP) estimation of the Markov random field (MRF) with graph cuts and saliency
driven priors.
2) the updating of priors and feature likelihoods by integrating the previous
segmentation results and the currently estimated saliency-based visual attention.
Methods used

Markov random field (MRF) model

Here each hidden state corresponds to the label of a position representing an object

or background, and an observation is a frame of the input frame. The density calculated in
the previous step can be utilized for estimating the priors of objects/backgrounds and the
feature likelihoods of the MRF. When calculating priors and likelihoods, the regions
extracted from the previous frames are also available.

Image segmentation
Consider a set of random variables A = {Ax}xI defined on a set I of co- ordinates.

Each random variable Ax takes a value ax from the set L = {0,1}, which corresponds to a
background (0) and an object (1), respectively.
A MAP-based MRF estimation can be formulated as an energy minimization
problem where the energy corresponding to the configuration a is the negative log likelihood
of the joint posterior density of the MRF, E(a|D)=logp(A = a|D), where D represents the
input image.

Proposed a new method for achieving precise video segmentation without any
supervised interactions.


The main contributions included

1) the introduction of MAP-based frame-wise segmentation with graph cuts and
saliency-driven priors.
2) the technique for updating priors and likelihoods with a Kalman Filter.

Segmented regions were randomly switched


-Radhakrishna Achanta and Sabine Susstrunk

Detection of visually salient image regions is useful for applications like object
segmentation, adaptive compression, and object recognition. Recently, full-resolution salient
maps that retain well defined boundaries have attracted attention. In these maps, boundaries
are preserved by retaining substantially more frequency content from the original image than
older techniques. However, if the salient regions comprise more than half the pixels of the
image, or if the background is complex, the background gets highlighted instead of the salient
This paper,

introduce a method for salient region detection that retains the

advantages of such saliency maps while overcoming their shortcomings. Our method exploits
features of color and luminance, is simple to implement and is computationally efficient.

Methods used

Saliency computation methods:

Saliency has been referred to as visual attention, unpredictability, rarity, or surprise.

Saliency estimation methods can broadly be classied as biologically based, purely

computational, or those that combine the two ideas. In general, most methods employ a lowlevel approach of determining contrast of image regions relative to their surroundings using
one or more features of intensity, color, and orientation.

Graph Based Segmentation

Graph cuts based methods are popular for image segmentation applications. Boykov

and Jolly perform interactive segmentation using graph cuts. They require a user to provide
scribble based input to indicate foreground and background regions.


A graph cuts based algorithm then segments foreground from background. We use a
similar approach, however, instead of the user indicating the background and foreground
pixels using scribbles, we use the saliency map to assign these pixels automatically.

This method improves upon six existing state-of-the-art algorithms in precision and
recall with respect to a ground truth database.

The saliency maps generated by this method suffer from low resolution


-Kuo-Chin Lien and Yu-Chiang Frank Wang
Proposes a motion-driven video object extraction (VOE) method, which is able to
model and segment foreground objects in single-concept videos, i.e. videos which have only
one object category of interest but may have multiple object instances with pose, scale, etc.
variations. Given such a video, we construct a compact shape model induced by motion cues,
and extract the foreground and background color in- formation accordingly.
It integrates these feature models into a unified framework via a conditional random
field (CRF), and this CRF can be applied to video object segmentation and further video
editing and retrieval applications. One of the advantages of this method is that it does not
require the prior knowledge of the object of interest, and thus no training data or
predetermined object detectors are needed; this makes this approach robust and practical to
real-world problems. Very attractive empirical results on a variety of videos with highly
articulated objects support the feasibility of our proposed method.
Methods used

Object Modeling And Extraction

First extract the motion cues from the moving object across video frames, and

combine the motion- induced shape, foreground and background color models into a CRF.
Without prior knowledge of the object of interest, this CRF model is designed to address
VOE problems in an unsupervised setting.

Conditional random field:

By utilizing an undirected graph, CRF is a powerful technique to estimate the

structural information (e.g. class label) of a set of variables. For object segmentation, CRF is


used to predict the label of each observed pixel in an image I. As shown in Figure 1, pixel i is
associated with observation zi, while the hidden node Fi indicates its corresponding label (i.e.
foreground or background).
In this CRF framework, the label Fi is calculated by the observation zi, while the
spatial coherence between this output and neighboring observations zj and labels Fj are
simultaneously taken into consideration.

Extraction of motion cues

Aim was to extract different feature information from these moving parts for the later

CRF construction. To detect the moving parts and their corresponding pixels, we perform
dense optical flow forward and backward propagation at every frame.

Proposed a method which utilizes multiple motion-induced features such as shape and
fore- ground/background color models to extract foreground objects in single-concept videos.
We advanced a unified CRF framework to integrate the above feature models.

Some of the motion cues might be negligible due to low contrast


-P. Harding N. M. Robertson
This technique is validated using comprehensive human eye-tracking experiments.
This algorithm is known as Visual Interest (VI) since the resultant segmentation reveals
image regions that are visually salient during the performance of multiple observer search
tasks. It demonstrate that it works on generic, eye-level photographs and is not dependent on
heuristic tuning.
The descriptor- matching property of the SURF feature points can be exploited via
object recognition to modulate the context of the attention probability map for a given object
search task, refining the salient area and fully validate the Visual Interest algorithm through
applying it to salient compression using a pre-blur of non salient regions prior to JPEG and
conducting comprehensive observer performance tests.


Methods used

Feature points extraction

Computer vision feature points (basically, local features at which the signal changes

three-dimensionally in space and scale) have many attractive properties, such as robust
invariant descriptor matching over scale, rotation and affine offset that could be useful in
combination with their use as a primitive saliency detector.
Feature matching has been used in estimating inter-frame homography mismatching
as an estimate of temporal saliency in video, but not as a measure of spatial saliency. Harding
and Robertson compare the co-occurrence of a set of computer vision feature points with
predictive maps of visual saliency .
Saliency has been referred to as visual attention, unpredictability, rarity, or surprise.
Saliency estimation methods can broadly be classified as biologically based, purely
computational, or those that combine the two ideas. In general, most methods employ a lowlevel approach of determining contrast of image regions relative to their surroundings using
one or more features of intensity, color, and orientation.
Presented an image segmentation algorithm to segment image areas which are
visually salient to observers performing multiple tasks. In contrast to bottom-up saliency
alone, and combined with specific task search models our technique finds image areas
relevant to the performance of multiple objective tasks and without the need for prior
The general mode acts on eye-level imagery with parameters chosen from careful
experimentation and requires no machine learning stage. The technique is built upon feature
points and the descriptors of these feature points can be compared to database representations
of stored objects to narrow the focus of the attention prediction map for object class search,
all in one algorithmic iteration.
The application to compression is just one possible use only for a segmentation
algorithm based on visually salient information.



- Yun Zhai ,Mubarak Shah
A hierarchical spatial attention representation is established to reveal the interesting
points in images as well as the interesting regions. Finally, a dynamic fusion technique is
applied to combine both the temporal and spatial saliency maps, where temporal attention is
dominant over the spatial model when large motion contrast exists, and vice versa. The
proposed spatiotemporal attention framework has been extensively applied on several video
sequences, and attended regions are detected to highlight interesting objects and motions
present in the sequences with very high user satisfaction rate.

Methods used:

Video attention detection

Propose a bottom-up approach for modeling the spatiotemporal attention in video

sequences. The proposed technique is able to detect the attended regions as well as attended
actions in video sequences. Different from the previous methods, most of which are based on
the dense optical flow fields, our proposed temporal attention model utilizes the interest point
correspondences and the geometric transformations between images.
In this model, feature points are firstly detected in consecutive video images, and
correspondences are established between the interest-points using the Scale Invariant Feature
Transformation (SIFT ).

Spatiotemporal saliency map

A linear time algorithm is developed to compute pixel-level saliency maps. In this

algorithm, color statistics of the images are used to reveal the color contrast information in
the scene. Given the pixel-level saliency map, attended points are detected by finding the
pixels with the local maxima saliency values. The region-level attention is constructed based
upon the attended points. Given an attended point, a unit region is created with its center to
be the point.
This region is then iteratively expanded by computing the expansion potentials on the
sides of the region. Rectangular attended regions are finally achieved. The temporal and
spatial attention models are finally combined in a dynamic fashion. Higher weights are
assigned to the temporal model if large motion contrast is present in the sequence.


Presented a spatiotemporal attention detection framework for detecting both attention
regions and interesting actions in video sequences. The saliency maps are computed
separately for the temporal and spatial information of the videos.

Fail to highlight the entire salient region or highlight smaller salient regions better
than larger ones.


-T. Bouwmans, F. El Baf, B. Vachon
Mixture of Gaussians is a widely used approach for background modeling to detect
moving objects from static cameras. Numerous improvements of the original method
developed by Stauffer and Grimson have been proposed over the recent years and the purpose
of this paper is to provide a survey and an original classification of these improvements. It
also discusses relevant issues to reduce the computation time. Firstly, the original MOG are
reminded and discussed following the challenges met in video sequences.
This survey categorize the different improvements found in the literature and
classified them in term of strategies used to improve the original MOG and we have
discussed them in term of the critical situations they claim to handle. After analyzing the
strategies and identifying their limitations, we conclude with several promising directions for
future research.

Methods used

Background Modeling
In the context of a traffic surveillance system, Friedman and Russell proposed to

model each background pixel using a mixture of three Gaussians corresponding to road,
vehicle and shadows. This model is initialized using an EM algorithm. Then, the Gaussians
are manually labeled in a heuristic manner as follows: the darkest component is labeled as
shadow; in the remaining two components, the one with the largest variance is labeled as
vehicle and the other one as road.


This remains fixed for all the process giving lack of adaptation to changes over time.
First, each pixel is characterized by its intensity in the RGB color space. Then, the probability
of observing the current pixel value is considered given by the following formula in the
multidimensional case:

Foreground Detection
In this case, the ordering and labeling phases are conserved and only the matching test

is changed to be more exact statistically. Indeed, Stauffer and Grimson checked every new
pixel against the K existing distribution using the Equation to classify it in background pixel
or foreground one.
Allows the reader to survey the strategies and it can effectively guide him to select the
best improvement for his specific application.

Leading to misdetection of foreground objects and background .


-Matthias Grundmann1,Vivek Kwatra, Mei Han,Irfan Essa1
This hierarchical approach generates high quality segmentations, which are
temporally coherent with stable region boundaries, and allows subsequent applications to
choose from varying levels of granularity. It further improve segmentation quality by using
dense optical flow to guide temporal connections in the initial graph and also propose two
novel approaches to improve the scalability of this technique:
(a) a parallel out of core algorithm that can process volumes much larger than an incore algorithm.
(b) a clip-based processing algorithm that divides the video into overlapping clips in
time, and segments them successively while enforcing consistency.


Methods used

Graph-based Algorithm
Specifically, for image segmentation, a graph is defined with the pixels as nodes,

connected by edges based on 8- neighborhood. Edge weights are derived from the per-pixel
normalized color difference.

Hierarchical Spatio-Temporal Segmentation:

The above algorithm can be extended to video by constructing a graph over

the spatio-temporal video volume with edges based on a 26 neighborhood in 3D space-time.

Following this, the same segmentation algorithm can be applied to obtain volumetric regions.
This simple approach generally leads to somewhat underwhelming results due to the
following drawbacks.

Parallel Out-of-Core Segmentation:

It consists of a multi-grid-inspired out of core algorithm that operates on a

subset of the video volume. Performing multiple passes over windows of increasing size, it
still generates a segmentation identical to the in memory algorithm. Besides segmenting large
videos, this algorithm takes advantage of modern multi-core processors, and segments several
parts of the same video in parallel.
This method apply the segmentation algorithm to a wide range of videos, from classic
examples to long dynamic movie shots, studying the contribution of each part of this
However, as it defines a graph over the entire video volume, there is a restriction on
the size of the video that it can process, especially for the pixel-level over-segmentation stage



-Yuri Boykov Olga Veksler Ramin Zabih

This paper address the problem of minimizing a large class of energy functions that
occur in early vision. The major restriction is that the energy function's smoothness term must
only involve pairs of pixels. It propose two algorithms that use graph cuts to compute a local
minimum even when very large moves are allowed.


The first move we consider is an -- swap: for a pair of labels ,, this move
exchanges the labels between an arbitrary set of pixels labeled and another arbitrary set
labeled .
First algorithm generates a labeling such that there is no swap move that decreases the
energy. The second move we consider is an _-expansion: for a label _, this move assigns an
arbitrary set of pixels the label _. Our second algorithm, which requires the smoothness term
to be a metric, generates a labeling such that there is no expansion move that decreases the
energy. Moreover, this solution is within a known factor of the global minimum.
Methods used

Energy minimization via graph cuts:

The most important property of these methods is that they produce a local minimum

even when large moves are allowed. In this section, we discuss the moves we allow, which
are best described in terms of partitions. We sketch the algorithms and list their basic
properties. We then formally introduce the notion of a graph cut, which is the basis for our

Graph cuts
The minimum cut problem is to find the cut with smallest cost. There are numerous

algorithms for this problem with low-order polynomial complexity; in practice these methods
run in near-linear time.

This method can produce only low energy
-Xue Bai, Guillermo Sapiro
The proposed technique is based on the optimal, linear time, computation of weighted
geodesic distances to user-provided scribbles, from which the whole data is automatically
segmented. The weights are based on spatial and/or temporal gradients, considering the
statistics of the pixels scribbled by the user, without explicit optical flow or any advanced and
often computationally expensive feature detectors. These could be naturally added to the
proposed framework as well if desired, in the form of weights in the geodesic distances.


Methods used


Algorithm starts from two types of user-provided scribbles, F for foreground and B for
background, roughly placed across the main regions of interest. Now the problem is how to
learn from them and propagate this prior information/labeling to the entire image, exploiting
both the marked pixel statistics and their positions.

Feature Distribution Estimation

The role of the scribbles is twofold. The scribbles indicate spatial constraints, and also

collect labeled information from F/B regions. We use discriminative features to learn from
the samples on the scribbles (pixels marked by the user via the scribbles), and to classify all
the remaining pixels in the image.

Geodesic Distance:
We use the geodesic distance from these user-provided scribbles to classify the pixels

x in the image (outside of the scribbles), labeling them F or B. The geodesic distance d(x) is
simply the smallest integral of a weight function over all possible paths from the scribbles to
x (in contrast with the average distance as used in random walks or diffusion/Laplace based

Presented a geodesics-based algorithm for (interactive) natural image, 3D, and video
segmentation and matting.We introduced the framework for still images and extended it to
video segmentation and matting, as well as to 3D medical data.
Although the proposed framework is general, we mainly exploited weights in the
geodesic computation that depend on the pixel value distributions. Algorithm does not work
when these distributions significantly overlap.



-Jeroen van Baar, Paul Beardsley, Marc Pollefeys, Markus Gross
This paper proposes an interactive method for the segmentation of objects in video. It
aims to exploit multiple modalities to reduce the dependency on color discrimination alone.
Given an initial segmentation for the first and last frame of a video sequence, This method
aims to propagate the segmentation to the intermediate frames of the sequence. Video frames
are first segmented into super pixels.
The segmentation propagation is then regarded as a super pixels labeling problem.
The problem is formulated as an energy minimization problem which can be solved
efficiently. Higher-order energy terms are included to represent temporal constraints. Our
proposed method is interactive, to ensure correct propagation and relabel incorrectly labeled
super pixels.
Methods used

Segmentation Propagation as Energy Minimization

We formulate the problem of propagating known segmentations and In as an energy

minimization: E = _(xi) + _(xi; xj) + _(x)

Here _(x) represents a unary term, _(xi; xj) represents a binary term between neighboring
super pixels xi and xj , and finally _(x) represents a so-called higher-order clique term . Each
super pixel is assigned a label, with the set of labels L defined by the different segments.

Interactive Segmentation Correction

Super pixels may have an incorrect label after propagation. It is therefore necessary

for the user to correct these incorrect segmentation labels. Rather than requiring to re-label
individual pixels, in our case the interactive correction step can be more easily performed on
the super pixels directly.

Exploiting Multiple Modalities:

We exploit the thermal signal in the super pixel segmentation and in the matching of

super pixels. This is especially helpful for scenes with human actors, since the thermal signal
helps to separate the actors from their background, and could also help to separate actors
from each other.


described an interactive video segmentation approach based on the propagation of
known segmentations for the first and last frame, to the intermediate frames of a video
sequence. The straightforward matching of super pixels across the video sequence could
easily handle moving cameras and non-rigidly moving objects.
The method is not much efficient when the objects moving very fast


Technique Used

Saliency based video segmentation

with graph cuts and sequentially
updated priors

Markov random field

(MRF) model

Segmented regions
were randomly

Saliency computation

The saliency maps

generated by this
method suffer from
low resolution

Object Modeling And


Some of the motion

cues might be
negligible due to low

Visual Saliency From Image

Features With Application to

Feature points

Compression is just
one possible use for
algorithm based on
visually salient

Visual Attention Detection In

Video Sequences Using
Spatiotemporal Cues

saliency map

Fail to highlight the

entire salient region

Background modeling

Leading to
misdetection of
foreground objects
and background

Saliency detection using maximum

symmetric surround

Automatic object extraction in

single concept Videos

Background modeling using

mixture of Gaussians for
foreground detection



Efficient hierarchical graph-based

video segmentation

Hierarchical SpatioTemporal

there is a restriction
on the size of the

Fast approximate energy

minimization via graph cuts

Energy minimization
via graph cuts

produces only low


Feature Distribution

does not work when

distributions are

A framework for fast interactive
image and video segmentation and


Interactive video segmentation

supported by multiple modalities
with an application to depth maps


not much efficient

when the objects
moving very fast




The project proposes efficient motion detection and people counting based on
background subtraction using dynamic threshold. Here three different methods are used
effectively for object detection and compare these performance based on accurate detection.
Here the techniques frame differences, dynamic threshold based detection and mixture of
Gaussian model will be used. After the object foreground detection, the parameters like
speed, velocity and angle of motion will be determined.
For this, most of previous methods depend on the assumption that the background is
static over short time periods. However, structured motion patterns of the background which
are distinctive from variations due to noise, are hardly tolerated in this assumption and thus
still lead to high-level false positive rates when using previous models. In dynamic threshold
based object detection, morphological process and filtering also used effectively for
unwanted pixel removal from the background.
Along with this dynamic threshold, we introduce a background subtraction algorithm
for temporally dynamic texture scenes using a mixture of Gaussian, which has an ability of
greatly attenuating color variations generated by background motions while still highlighting
moving objects. Finally the proposed method will be proved that effective for background
subtraction in dynamic texture scenes compared to several competitive methods and
parameter of moving object will be evaluated for all methods.
The Background Subtraction for accurate moving object detection from dynamic
scene using dynamic threshold detection and mixture of Gaussian model and determination of
object parameters


Frame Subtraction


Connected Component

Fig. 3.1. Block Diagram of Proposed System



Frame Separation

Frame Subtraction

Dynamic Threshold Approach

Morphological Filtering

Object Detection

Parameter analysis

3.4.1. Frame Separation

A method to shorten the time required for the frame separation in the time-divisionmultiplexed delta modulation system is described. Employing successive "1"s as the sync
pattern, the system detects the sync channel out of the delta modulated information pulses
which occur at the rate of one half in average.
Using a memory device with the capacity of one frame, the detection is performed by
taking successive frame correlation of each channel in parallel. For example, in the system
with 20 channels, the frame separation is established in approximately six frame periods.
The system includes such facilities as to stabilize the frame separation against the
error of the sync pattern and against the occurrence of the information pattern similar to the
sync pattern.
3.4.2. Frame Subtraction
In digital photography, dark-frame subtraction is a way to minimize image noise for
pictures taken with long exposure times. It takes advantage of the fact that a component of
image noise, known as fixed-pattern noise, is the same from shot to shot: noise from the
sensor, dead or hot pixels. It works by taking a picture with the shutter closed.
A dark frame is an image captured with the sensor in the dark, essentially just an
image of noise in an image sensor. A dark frame, or an average of several dark frames, can
then be subtracted from subsequent images to correct for fixed-pattern noise such as that
caused by dark. Dark-frame subtraction has been done for some time in scientific imaging;
many newer consumer digital cameras offer it as an option, or may do it automatically for
exposures beyond a certain time.


Visible fixed-pattern noise is often caused by hot pixels pixel sensors with higher
than normal dark current. On long exposure, they can appear as bright pixels. Sensors on the
CCD that always appear as brighter pixels are called stuck pixels while sensors that only
brighten up after long exposure are called hot pixels.
The dark-frame-subtraction technique is also used in digital photogrammetry, to
improve the contrast of satellite and air photograms, and is considered part of "best practice.
Each CCD has its own dark signal signature that is present on every acquired image
(bias, dark frame, light frame). By cooling the CCD to a very low temperature (down to -30
C), the dark signal is attenuated but not completely cancelled. Image processing is then
required to remove dark signature from each raw image. The method usually used consists of
acquiring dark frames that are subsequently subtracted from each image. This permits
keeping only relevant information related to the observed target.
During an acquisition session, an ideal dark frame is obtained by acquiring and
averaging several dark frames. This ideal dark frame has a reduced noise signature and is, in
theory, acquired with the same conditions (temperature and moisture) as the observation
images. This averaged dark frame will be subtracted from every raw image.
This approach for reducing dark signal produces very nice astronomical images. It is
suitable for many applications which do not require very accurate photometry results.
However, the technique is not necessarily sufficiently accurate for detecting and analyzing
very low signal-to-noise ratio objects.
3.4.3. Dynamic Threshold Approach
Fixed decision boundaries (or fixed threshold) classification approaches are
successfully applied to segment human skin. These fixed thresholds mostly failed in two
situations as they only search for a certain skin color range:
1) any non-skin object may be classified as skin if non-skin objects color values
belong to fixed threshold range.
2) Any true skin may be mistakenly classified as non-skin if that skin color values do
not belong to fixed threshold range. Instead of predefined fixed thresholds, novel online
learned dynamic thresholds are used to overcome the above drawbacks. The experimental
results show that our method is robust in overcoming these drawbacks.


3.4.4. Morphological Filtering

Given a sampled2 binary image signal f[x] with values 1 for the image object and 0
for the background, typical image transformations involving a moving window set
W = {y1, y2, ..., yn} of n sample indexes would be
b (f)[x] = b(f[x y1], ..., f[x yn])


Where b(v1, ..., vn) is a Boolean function of n variables. The mapping f _ b(f) is
called a Boolean filter. By varying the Boolean function b, a large variety of Boolean filters
can be obtained. For example, choosing a Boolean AND for b would shrink the input image
object, whereas a Boolean OR would expand it.
Numerous other Boolean filters are possible, since there are 22n possible Boolean
functions of n variables. The main applications of such Boolean image operations have been
in biomedical image processing, character recognition, object detection, and general 2D
shape analysis.
Among the important concepts offered by mathematical morphology was to use sets
to represent binary images and set operations to represent binary image transformations.
Specifically, given a binary image, let the object be represented by the set X and its
background by the set complement Xc. The Boolean OR transformation of X by a (window)
set B is equivalent to the Minkowski set addition (+), also called dilation, of X by B:

X (+) B {x + y:x (+) X, y (+)}


Extending morphological operators from binary to gray level images can be done by
using set representations of signals and transforming these input sets via morphological set
operations. Thus, consider an image signal f(x) defined on the continuous or discrete plane ID
= R2 or Z2 and assuming values in R=RU {,}. Thresholding f at all amplitude levels v
produces an ensemble of binary images represented by the threshold sets

v(f) {x

ID : f(x) v} , < v < +


Lattice Opening Filters

The three types of nonlinear filters defined below have proven to be very useful for

image enhancement. If a 2D image f contains 1D object, e.g. lines, and B is a 2D disk-like


structuring element, then the simple opening or closing of f by B will eliminate these 1D
objects. Another problem arises when f contains large-scale objects with sharp corners that
need to be preserved; in such cases opening or closing f by a disk B will round these corners.
These two problems could be avoided in some cases if we replace the conventional opening
with a radial opening.

3.4.5. Object Detection

A substantial amount of research has been done in developing techniques for locating
Objects of interest automatically in digitized pictures drawing the boundaries around
objects are essential for pattern recognition, object tracking, image enhancement, data
reduction, and various other applications. This constitutes a good survey of research and
applications in image processing and picture analysis. Most researchers of picture analysts
have assumed that:
(1) The image of an object is more or less uniform or smooth in its local properties
(that is, illumination, color, and local texture are smoothly changing inside the image of an
(2) There is detectable discontinuity in local properties between images of two
different objects. This will adopt these two assumptions in this paper and assume no textural
The work on automatic location of objects in digitized images has split into two
approaches: edge detection and edge following versus region growing. Edge detection applies
local independent operators over the picture to detect edges and then uses algorithms to trace
the boundaries by following the local edge detected A recent survey of literature in this area.
The region growing approach uses various clustering algorithms to grow regions of
almost uniform local properties in the image for typical applications. More detected
references will be given later.
In this method the two approaches are combined to complement each other; the result
is a more powerful mechanism to segment pictures into objects. This will develop a new edge
detector and combined it with new region growing techniques to locate objects; in so doing
we resolved the confusion in regular edge following that the results where more than one
isolated object on a uniform background is in the scene.


3.4.6. Parameters Analysis

Generally this paper deals with many of the parameters that can be dealt here.





Correlation and Convolution are basic operations that we will perform to extract
information from images. They are in some sense the simplest operations that we can perform
on an image, but they are extremely useful. Moreover, because they are simple, they can be
analyzed and understood very well, and they are also easy to implement and can be computed
very efficiently.
Our main goal is to understand exactly what correlation and convolution do, and why
they are useful. We will also touch on some of their interesting theoretical properties; though
developing a full understanding of them would take more time than we have.
We estimate the sensitivity of the image processing operators with respect to
parameter changes by performing a sensitivity analysis. This is not new to the image
processing community but often left out for performance reasons.
Image quality assessment is an important but difficult issue in image processing
applications such as compression coding and digital watermarking. For a long time, mean
square error (MSE) and peak signal-to-noise ratio (PSNR) are widely used to measure the
degree of image distortion because they can represent the overall gray-value error contained
in the entire image, and are mathematically tractable as well. In many applications, it is
usually straightforward to design systems that minimize MSE or PSNR.

MSE works satisfactorily when the distortion is mainly caused by contamination of

additive noise. However the problem inherent in MSE and PSNR is that they do not take into
account the viewing conditions and visual sensitivity with respect to image contents. With
MSE or PSNR, only gray-value differences between corresponding pixels of the original and
the distorted version are considered. Pixels are treated as being independent of their


neighbors. Moreover, all pixels in an image are assumed to be equally important. This, of
course, is far from being true. As a matter of fact, pixels at different positions in an image can
have very different effects on the human visual system (HVS).

3.4.7. Advantages

Less sensitive to noise

Automatic background updating model

Accuracy is more

3.4.8. Applications

Video surveillance

Object detection

People counting






The computer used for the development of the project is





256 DDR

Hard Disc

80 GB

Floppy Disc

1.44 MB

CD drive



Samsung 17


108 keys Samsung


Logitech scroll mouse

Zip Drive

250 MB


HP DeskJet


The project was developed using the following software.

Operating System

Windows XP





The software used in this project is MATLAB- 7.10 or above


Introduction To Matlab
MATLAB is a product of The Math Works; Inc. The name MATLAB stands for

MATRIX LABORATORY.It are a high-performance language for technical computing.

It integrates computation, visualization, and programming in an easy-to-use
environment where problems and solutions are expressed in familiar mathematical notation.
Typical uses includes
1. Math and computation
2. Algorithm development
3. Data acquisition
4. Modeling, simulation, and prototyping
5. Data analysis, exploration, and visualization
6. Scientific and engineering graphics
7. Application development, including graphical user interface building

MATLAB is an interactive system whose basic data element is an array that does not
require dimensioning. This allows you to solve many technical computing problems,
especially those with matrix and vector formulations, in a fraction of the time it would take to
write a program in a scalar non interactive language such as C or FORTRAN.

The name MATLAB stands for matrix laboratory. MATLAB was originally written to
provide easy access to matrix software developed by the LINPACK and EISPACK projects.
Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the
state of the art in software for matrix computation.

MATLAB has evolved over a period of years with input from many users. In
university environments, it is the standard instructional tool for introductory and advanced
courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice
for high-productivity research, development, and analysis.


MATLAB features a family of add-on application-specific solutions called toolboxes.

Very important to most users of MATLAB, toolboxes allow you to learn and apply
specialized technology. Toolboxes are comprehensive collections of MATLAB functions (Mfiles) that extend the MATLAB environment to solve particular classes of problems. Areas in
which toolboxes are available include signal processing, control systems, neural networks,
fuzzy logic, wavelets, simulation, and many others.
The MATLAB system consists of five main parts

Desktop tools and development environment

This is the set of tools and facilities that help you use MATLAB functions and files.

Many of these tools are graphical user interfaces. It includes the MATLAB desktop and
Command Window, a command history, an editor and debugger, a code analyzer and other
reports, and browsers for viewing help, the workspace, files, and the search path.

MATLAB windows
1. Command window: This is the main window. It is characterized by Matlab

command prompt>>. All commands, including those for running user-written programs, are
typed in this window at the Matlab prompt.
2. Graphics Window: The output of all graphics commands typed are flushed into the
graphics orfigure window.
3. Edit Window: This is a place to create, write, edit and save programs in files called
Mfiles. Any text editor can be used to carry out these tasks.

The MATLAB mathematical function library

This is a vast collection of computational algorithms ranging from elementary

functions, like sum, sine, cosine, and complex arithmetic, to more sophisticated functions like
matrix inverse, matrix eigen values, Bessel functions, and fast Fourier transforms.


The MATLAB language

This is a high-level matrix/array language with control flow statements, functions,

data structures, input/output, and object-oriented programming features. It allows both

programming in the small to rapidly create quick and dirty throw-away programs, and
programming in the large to create large and complex application programs.

Graphics handler
MATLAB has extensive facilities for displaying vectors and matrices as graphs, as

well as annotating and printing these graphs. It includes high-level functions for twodimensional and three-dimensional data visualization, image processing, animation, and
presentation graphics. It also includes low-level functions that allow you to fully customize
the appearance of graphics as well as to build complete graphical user interfaces on your
MATLAB applications.
4.3.3. The MATLAB application program interface
This is a library that allows you to write C and Fortran programs that interact with
MATLAB. It includes facilities for calling routines from MATLAB (dynamic linking),
calling MATLAB as a computational engine, and for reading and writing MAT-files
4.3.4. MATLAB documentation
MATLAB provides extensive documentation, in both printed and online format, to
help you learn about and use all of its features. If you are a new user, start with this Getting
Started book. It covers all the primary MATLAB features at a high level, including many
examples. The MATLAB online help provides task-oriented and reference information about

Introduction to M-function programming

One of the most powerful features of MATLAB is the capability it provides users to

program their own new functions. MATLAB function programming is flexible and
particularly easy to learn.



M-files in MATLAB can be scripts that simply execute a series of MATLAB

statements, or they can be functions that can accept arguments and can produce one or more
outputs. M FILE functions extend the capabilities of both MATLAB and the Image
Processing Toolbox to address specific, user-defined applications. M-files are created using a
text editor and are stored with a name of the form filename.m, such as average.m and filter.m.

The components of a function M-file are,

The function definition line

The H1 line

Help text

The function body




MATLAB operators are grouped into three main categories:

Arithmetic operators that perform numeric computations

Relational operators that compare operands quantitatively

Logical operators that perform the functions AND, OR, and NOT


Image processing tool box

Image Processing Toolbox provides a comprehensive set of reference-standard

algorithms and graphical tools for image processing, analysis, visualization, and algorithm
development. You can perform image enhancement, image deblurring, feature detection,
noise reduction, image segmentation, geometric transformations, and image registration.
Many toolbox functions are multithreaded to take advantage of multicore and multiprocessor
Image Processing Toolbox supports a diverse set of image types, including high
dynamic range, gigapixel resolution, embedded ICC profile, tomography Spatial image
transformations, morphological operations, neighborhood and block operations, Linear
filtering and filter design, Transforms Image analysis and enhancement, Image registration,
De blurring Region of interest operations


You can extend the capabilities of Image Processing Toolbox by writing your own Mfiles, or by using the toolbox in combination with other toolboxes, such as Signal Processing
Toolbox and Wavelet Toolbox.
Graphical tools let you explore an image, examine a region of pixels, adjust the
contrast, create contours or histograms, and manipulate regions of interest (ROIs). With
toolbox algorithms you can restore degraded images, detect and measure features, analyze
shapes and textures, and adjust color balance

Interfacing with other languages

Libraries written in Java, ActiveX or .NET can be directly called from MATLAB and

many MATLAB libraries (for example XML or SQL support) are implemented as wrappers
around Java or ActiveX libraries. Calling MATLAB from Java is more complicated, but can
be done with MATLAB extension, which is sold separately by Math Works, or using an
undocumented mechanism called JMI (Java-to-Matlab Interface), which should not be
confused with the unrelated Java Metadata Interface that is also called JMI.
As alternatives to the MuPAD based Symbolic Math Toolbox available from Math
Works, MATLAB can be connected to Maple or Mathematical.
MATLAB has a direct node with mode FRONTIER, a multidisciplinary and multiobjective optimization and design environment, written to allow coupling to almost any
computer aided engineering (CAE) tool. Once obtained a certain result using Matlab, data
can be transferred and stored in a mode FRONTIER


System design is the process of planning a new system to complement or altogether replace
the old system. The purpose of the design phase is the first step in moving from the problem
domain to the solution domain. The design of the system is the critical aspect that affects the
quality of the software. System design is also called top-level design. The design phase
translates the logical aspects of the system into physical aspects of the system.




Visual Saliency




The Background Subtraction for accurate moving object detection from dynamic
scene using dynamic threshold detection and mixture of Gaussian model and determination of
object parameters with the help of morphological filtering will be done in next phase.




1. Saliency-based video segmentation with graph cuts and sequentially updated priors,
K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and J. Yamato,
in Proc. IEEE Int. Conf. Multimedia Expo, Jun.Jul. 2009, pp. 638641.

2. Saliency detection using maximum symmetric surround,

R. Achanta and S. Ssstrunk,
in Proc. IEEE Int. Conf. Image Process., Sep. 2010,pp. 26532656.

3. Automatic object extraction in single concept videos,

K.-C. Lien and Y.-C. F. Wang,
in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2011,pp. 16.

4. Visual saliency from image features with application to compression,

P. Harding and N. M. Robertson,
Cognit. Comput., vol. 5, no. 1, pp. 7698, 2012.

5. Visual attention detection in video sequencesusing spatiotemporal cues,

Y. Zhai and M. Shah,
in Proc. ACM Int. Conf. Multimedia, 2006, pp. 815824.

6. Background modeling using mixture of Gaussians for foreground detectionA

T. Bouwmans, F. E. Baf, and B. Vachon,
Recent Patents Comput. Sci., vol. 3, no. 3, pp. 219237, 2008.

7. Efcient hierarchical graph based video segmentation,

M. Grundmann, V. Kwatra, M. Han, and I. Essa,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 21412148.

8. Fast approximate energyminimization via graph cuts,

Y. Y. Boykov, O. Veksler, and R. Zabih,


IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 12221239, Nov. 2001.

9. A geodesic framework for fast interactive image and video segmentation and matting,
X. Bai and G. Sapiro,
in Proc. IEEE Int. Conf. Comput.Vis., Oct. 2007, pp. 18.

10. Key-segments for video object segmentation,

Y. J. Lee, J. Kim, and K. Grauman,
in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011,pp. 19952002.