You are on page 1of 6

Real-Time Object Detection and Tracking

Hammad Naeem∗ , Jawad Ahmad∗ and Muhammad Tayyab∗

∗ HITEC University Taxila Cantt, Pakistan.

Abstract—Real-time object detection and tracking is a vast, of images over frames. Tracking using mean shift kernel is
vibrant yet inconclusive and complex area of computer vision. also introduced. This method performs well when there is
Due to its increased utilization in surveillance, tracking system occlusion, which can be solved using templates [2]. Camshift
used in security and many others applications have propelled
researchers to continuously devise more efficient and competitive (Continously Adaptive Meanshift) can track a single object
algorithms . However, problems emerge in implementing object fast and robust using color features, but it is ineffective for
detection and tracking in real-time; such as tracking under occlusion.
dynamic environment, expensive computation to fit the real-time There is also research on appearance-based object detection
performance, or multi-camera multi-objects tracking make [3]. It uses whole 2-D images to perform tracking for navi-
this task strenuously difficult. Though, many methods and
techniques have been developed, but in this literature review gation in faster time. However, this kind of approach requires
we have discussed some famous and basic methods of object several templates and does not work when the target object,
detection and tracking. In the end we have also given their color or perspective view is changed. The main problem
general applications and results. in object detection and tracking are the temporal variation
of objects due to perspective, occlusion, interaction between
Keywords: real-time, object detection, tracking, surveillance
objects and appearance or disappearance of objects. That cause
the appearance of a target tends to change during a long
tracking. The background in a long image sequence is also
Object tracking, using video sensing technique, is one of dynamic even if it is taken by a stationary camera [4].
the major areas of research due to its increased commercial Detection and tracking of multiple objects at the same time
applications such as surveillance systems, Mobile Robots, is an important issue for real-time performance [5]. The
Medical therapy, security systems and driver assistance sys- comprehensive search in multiple tracking is computational
tems [1]. Object tracking, by definition, is to track an object expensive and incapable of being real-time system. Another
(or multiple objects) over a sequence of images. Tracking is issue is when using the moving camera, instead of using
usually performed on higher-level applications that require the camera with a fixed location, which need the analysis of
location and object in every frame. camera platform coordinate system.
The most popular application in this area is vision-based In this review, five different real-time object detection and
surveillance, to help understand the movement patterns of tracking techniques are analyzed in terms of accuracy, com-
people with suspicious actions. Traffic scene analysis is also a putational time and memory consumption and the best among
well-known application, to get the tracking information for these is proposed for real-time implementation in mobile
keeping the vehicles in lane and preventing the accidents. robots. The techniques are: (1) Object tracking by image dif-
Thus, object detection and tracking under dynamic conditions ferencing [1, 2]. (2) Object tracking by using non-parametric
is still a challenge for real-time performance which requires local transformations [4]. (3) Object tracking by using Mor-
the computational complexity to be minimum. phological based object detection [6]. (4) Object tracking by
Various methods for object detection have been proposed; Kanade Lucas technique [3] and (5) Meanshift algorithm. In
such as feature-based, template-based object detection and the end, based on simulation results, we propose the Kanade-
background subtraction [1]. But selection of the best technique Lucas algorithm to be the most feasible video sensing algo-
for a specific application is relative and dependent upon the rithm for real-time implementation in mobile robots because
hardware resources and scope of the application. Feature-based it achieves the best results of computational time, accuracy
detection searches for corresponding features in successive and memory consumption. The development of technology in
frames, including Harris corner, edges, SIFT, contours or hardware also affects the real-time performance of object de-
colour pixels. tection and tracking. In real-world object tracking system, the
Background subtraction is a popular method which uses static system has to be robust to tackle the changing environments
background and calculating the difference between the hy- with real-time constraint and the limited processing resources
potheses background and the current image. This approach and memory. For example, Ptrack is used as a solution for
is fast and good for fixed background but it cannot deal low-cost, fast, marker-based camera tracking.
with the dynamic environment, with different illumination Consequently, handling complex tracking using only-software
and motions of small objects [2]. The goal of tracking is to solution is not flexible since it is limited of the processing
establish a correspondence between the detected target objects capability under real-time constraints [7]. Real-time applica-

978-1-4799-3043-2/13/$31.00 2013 IEEE 148

tion requires that the tracking system designed must be fast
enough, power efficient, with managed memory to meet the
hard real-time constraints. Thus, the organization of this paper
is as follows. Section I presents a detailed study of each
algorithm with the steps involved for implementation. Section
II gives a brief comparison of each algorithm and proposed the
suitable algorithm for real-time object detection and tracking.
Finally, in section III we concluded by summarizing the Fig. 1: a) Input image with objects. b) Background model
important aspects of studied algorithm and provided the future
1) A.J. Lipton, H. Fujiyoshi and R.S. Patil discussed an al-
gorithm based on frame differencing [8]. The algorithm
has no complexities and is fairly accurate for objects
that do not change size or color etc. However the error
between two frames grows exponentially throughout the Fig. 2: Detected Objects.
video sequences.
2) Stein, Rosenberg and Werman [4] proposed the use
of non-parametric local transforms for object tracking. 1) Frame Difference
Ramin Zabih and John Woodfill used this idea and devel- 2) Background Subtraction
oped one such transform known as Census Transform. This technique is very simple to implement and the results pro-
Their approach works well even in the environment duced by it, using both of the above methods, are satisfactory
which is corrupted by noise and suffers with illumination for the purpose of motion detection but they have some serious
changes. However it is computationally expensive. restrictions. Some of the drawbacks which can be observed in
3) Owensa, Hunterb and Eric [9] proposed an algorithm this algorithm are:
that tracks moving objects based on morphological char- 1) Both absolute differencing methods are based on dif-
acteristics. This algorithm provides the solution of object ferences in grey level pixel intensities which involve a
merging while tracking multiple objects however the lot of computations and it takes a lot of instruction sets
object recognition through morphological methods is a which are stored in the hardware memory which makes
bit complex process and is repeated continuously [9]. algorithm time consuming.
4) Carlo Tomasi, Takeo Kanade [10] proposed one simple 2) Not feasible for DSP implementation because of large
object tracking method which minimizes the sum of instruction set.
squared intensities between two consecutive frames.
This method is computationally fast and robust in nature B. Census Transform Method
and is recommended for the real time object tracking. A better video sensing technique for object tracking,
that uses non-parametric local transforms as the basis for
A. Absolute Differences correlation, was presented by Zabih and Woodfill [12].
Jaewon Shin [11] explained the absolute difference method Non-parametric local transforms mainly depend upon the
which works on the principle of taking two image frames and local order of intensity values surrounding some central pixel
finding the absolute difference between the grey level inten- instead of actual intensity values of the pixels [4]. Two main
sities of the adjacent pixels [11] [5], as shown in the Figure steps towards object tracking using these transforms are:
1, 2. Absolute differences can be mathematically represented
using the following equation. • Apply local transform to consecutive frames.
• Compute correspondence of similar pixels between the
two frames using correlation.
D(t) = |I(ti ) − I(tj )| (1)
In this method first consecutive difference images are formed
Where I(ti ) is the image I at time i , I(tj ) is the image I at by differencing consecutive frames of a video [13]. Then for
time j and D(t) is the absolute difference for that time. In an each pixel in these difference images, the pixels in its local
ideal case when there is no motion then surrounding, of some size, are replaced either by bit 1 if they
are greater than this pixel or by 0 if otherwise and in this way
I(ti ) = I(tj ) (2)
a bit-string, called signature vector, is generated in for each
and D(t)=0. pixel [12], [13]. This is shown in Figure 3. Then separate lists
There are two methods which can be used for motion detection for each image are generated which contain signature vectors
using absolute differences. They are as follows: for all pixels with the corresponding pixel positions [12], [13]

Fig. 5: (a) Object delineated by rectangle calculated from
binary silhouette (b). (c) Segmented object
Fig. 3: Census Method: Signature Vector Generation

Fig. 6: Objects are successfully tracked per frames

taking into consideration their anatomy [14]. The technique

involves three basic processes: 1). background estimation 2).
frame differencing 3). object registration [14]. The two consec-
Fig. 4: Census Method Steps utive frames are subtracted from each other and binary thresh-
old is applied. This yields the moving objects between frame
and estimated background. These objects are then registered
as shown in the Figure 4 . based on their morphological characteristics i.e. their width,
This algorithm is immune to noise and illumination changes area, height and histogram [14]. Same process is repeated
because the value of any pixel depends upon the values of for the next coming frames i.e. background estimation, frame
the surrounding pixels and change in one pixel does not affect differencing and object registration [14]. The newly registered
the output much. Furthermore, it consumes lesser time and objects are then matched to the previously registered objects
is faster and more accurate than the previous algorithm but using cost function [14] and optical flow is calculated [14].
still, it is not suitable for hardware implementation in mobile This is shown in the Figure 5, 6.
robots. For hardware implementation, these parameters need The computational efficiency of this method appears to be
much improvement than what this algorithm can achieve. much better than the other methods but it depends largely
on the number of contours or number of moving objects
C. Feature Based Methods in the video sequence. This greatly affects its computational
Feature based algorithm involves extracting regions of efficiency as it normally generates multiple contours and its
interest, clustering them into higher level features and number of operations increases exponentially with the number
matching them in consecutive frames. A combined feature of contours making it computationally expensive. But this
tracking approach is introduced using background subtraction technique is definitely smarter than Census transform and
and feature grouping algorithm. The tracking process includes Absolute difference techniques. Its major advantage is that it
three steps; corner feature tracking, cluster tracking and is capable of tracking multiple objects in the video sequence
object level tracking. and its effective handling of object merging problem while
tracking multiple objects. The drawbacks are its complexity
1) Corner feature tracking [2]: The corner features are and continuous repetition resulting in computational overhead
only detected in the foreground region which is estimated by
the background subtraction algorithm. The detected corners
are tracked by applying cross correlation template matching D. Kanade-Lucas Technique
on a small image patch. Then, a variation of an Expectation-
Maximization (EM) algorithm is applied to group these Object tracking by using Kanade Lucas technique can be
corners into small clusters. Finally, the clusters are grouped achieved in following steps:
into objects. This method can well track multiple objects of First it takes two frames from a video consecutively [3]. In
various size and provide a set of robust trajectories. case there is any motion in the incoming video, the second
frame can be considered as the previous frame with the small
2) Morphology Based Object Tracking: This technique is shift. and the frame eror in a window size of (2wx +1) x (2wy
relatively smart and more complex. It tracks the object by + 1) is calculated as:

+wx uy +wy
ε(d) = I(x, y, t)−J(x+dx, y +dy, t) (3)
x=ux −wx y=uy −wy

which can be simplified into

+wx uy +wy
X Ix ∗ Iy
G= (4)
Ix ∗ Iy Iy2
x=ux −wx y=uy −wy

+wx uy +wy   (a) Frame 1: Target Initialization(b) Frame2:Target Mode Finding
X δI ∗ Ix
b= (5)
δI ∗ Iy
x=ux −wx y=uy −wy

Where Iy and Ix are the image gradients in y and x direc-

tions, respectively. After acquiring consecutive frames, their
derivatives in y and x directions are calculated to evaluate G
[3]. Then first image is searched for some trackable features
using a detection algorithm. After selecting features which are
to be tracked, the matrix b is found for these frames and then
the Kanade-Lucas model, given below, is applied to compute
the optical flow for these features.
This process is repeated in iterations to minimize the error and
to get the optimum value of V. The value of V, when optimized, (c) Frame 2:Mean Shift Iteration(d) Frame2:Mean Shift Itertion2
provides the difference in the previous and new positions of
Fig. 7: Meanshift Steps: Tracking Results and Correspond
the selected features approximately. So new feature positions
can be found by adding V to the previous position.
The Kanade Lucas algorithm apparently solves most of the
problems which were observed in the previously studied
algorithms. It is much more accurate and has a better-error
detection rate with much lesser computational time. However, ˆ k (x)
h2 ∇f
it works on the constraint that the optical flow between two Mh,G (x) = (7)
2/C fˆG (x)
image frames is constant in a local neighborhood around the
central pixel under consideration at any given time [3]. Then the window is move iteratively in the direction of
the local gradient with the step of mean shift vector, until
E. Mean-Shift Method convergence. In object tracking, mean shift is used to find
To find the mode of the probability distributions of samples the target candidate that is the most similar to a given target
Mean-Shift method is used. Mean-shift algorithm is non- model. The object is represented using colour histogram. The
parametric probability density estimation method for finding similarity of the object models is measured by Bhattacharyya
climbing density gradients. This virtue avoids choosing a coefficient. Then the mode of the similarity function is found
model and estimating its distribution parameters [5]. using mean shift algorithm. The tracking process includes
In mean shift algorithm, the density gradient estimation is several steps: 1) initialize the location of the target in the
given by the following formula. current frame, compute the distribution models and evaluate
the Bhattacharyya coefficient - similarity function. 2) based on
the mean shift vector, derive the new location of the target. 3)
2Ck,d X x − xi 2 move the location until reaching the maxima of the similarity
∇fh,K (x) = (xi − x)g (|| || ) (6)
nhd+2 i=1 h function. The procedure of tracking is shown in Figure 7.
The mean shift algorithm is improved in different ways.
where, g is the kernel function, which is a weighting function, First, it is developed to Camshift algorithm, which employs
used in density estimation, describing the relative likelihood an adaptive window size in order to deal with dynamically
for this random variable to occur at a given point in the changing colour distributions. The algorithm starts with an
observation space. The first term is proportional to the density initial window size. For each frame, mean shift is applied
estimation at x, while the second term is the mean shift vector, with several iterations. The position and size of the window
obtained by subtracting the weighted mean and x. Therefore, are calculated for the next frame. Then this process is repeated
it can be calculated as the following formula. until convergence.

Another improvement based on traditional mean shift algo-
rithm is to use more features instead of only colour feature.
One method is to combine SIFT features and mean shift track-
ing algorithm. The target is represented by a joint histogram of
two most descriptive features, selected out of 7 colour features
and shape texture feature. The division histogram is used in
to calculate shift vector in order to find the current position
of the target. The target model is updated using the initial (a) Frame 310 (b) Frame 312
target model. The weight of the initial model in the update is
Fig. 8: Lucas Kanade Implementation
determined by the similarity between the initial model and the
current target. The updated model is calculated as shown in
formula. TABLE I: Timing Analysis of Different Methods
ALGORITHM Arithmetic Time taken Program Exe-
and Logic by Algorithm cution time
i operations
Hm = (1 − sic )Hi + sic Hci (8)
Absolute Differ- 4230100 16 13
Where Hi is the initial model; Hci the combined histogram of encing
the current target appearance and the previous target model;
Census 2416000 5. 4 6
and sic the similarity between the initial model and current Transform
target appearance. This method can well moving targets in a
surveillance video. But it fails in case of a very similar object Morphological 352210 14.2 15
nearby or long duration of heavy occlusions.
Kanade Lucas 500825 0.486 2
A. Application
There are wide range of application of real-time object which has the magnitude/length and direction.
detection and tracking. For example, for detecting and tracking Moving camera requires the spherical camera platform coordi-
cars that passes on the highway. The tracked cars then will be nate. Hence, as the camera pans move to the left in frame 310
counted by the image processing system in order to provide to frame 312, the feature points on background are tracked
traffic information. Another important and popular application moving to the right diretion. Objects are robustly detected
is for security surveillance. A suspicious movement by a and tracked in real-time on a moving camera platform, by
person can be detected and tracked using the real-time surveil- employing the KLT algorithm at bottom level and SIR particle
lance system. There are rapid development from the industries filter at top level.
in this field to give better performance. Object tracking also
the main issue in mobile robot with vision based. A robot can B. Discussion
track objects and uses the features information to build the Obviously it is difficult to analyze the strengths and weak-
map for localization. nesses of each algorithm on one particular DSP processor
Under more complex environment, the combination from tradi- because each research group use dedicated hardware design
tional and modern methods is applied, such as the background according to application task. Thankfully, Shah [6] recently
subtraction combined with KLT/meanshift. An interesting ap- analyzed different video tracking algorithms on DSP fixedp
plication comes from the paper [1], where the system can point BF537 and from the findings of Shah we summarize the
detect and track objects for outdoor night surveillance. Since results as shown in Table I. Image size used in the experiments
using only the traidtional method cannot achieve the desired is 320 × 240 pixels per frame.
result, they used two stages of algorithm. In the first stage, From the table I, it seems that morphological based tracking
objects are detected using contrast changes measured by taking is computationally more efficient than Kanade Lucas, but the
the inter-frame. In the second stage, motion prediction and problem with morphological based tracking is that it generates
spatial nearest neighbor data association are used to track more than one contour even for single moving object, and its
objects and give feedback to the first stage. This application operations increase exponentially with increase in number of
can be implemented in real-time with more than 25 FPS at contours. Thus it is clear that computationally Kanade Lucas
320 × 240 resolution with a 1.8G CPU. is the most efficient algorithm [3].
1) Lukas-Kanade Method: In the following application[5], The table II also discuss some of the advantages and the
KLT is used as the bottom-level approach to do the real drawbacks from each methods.
time object tracking using multi cameras. The object detection
targets are described by contour labels in different color. IV. CONCLUSION AND FUTURE WORKS
Features points (see Figure 8 , blue dots) are detected by KLT In this literature review, we have analyzed five methods for
algorithm. Optical flow is calculated for every feature point, real time object detection and tracking, in terms of accuracy,

computational time and memory consumption. The success of TABLE II: Result Comparison of Different Methods
a video tracking algorithm depends upon the quality of its Technique Strength Weaknesses
response to a high frame- rate input video. The higher the Absolute Dif- 1. Easy to implement 1. Computationally ex-
ferences Tech- 2. Allows continuous pensive. 2. Low accu-
frame-rate is, more difficult it becomes for an algorithm to nique [2][5] tracking despite cessa- racy 3. slow
produce accurate results. Based on the simulation results, we tions in motion videos.
recommend the ’Kanade-Lucas’ algorithm for the hardware FEATURE 1. Can track multiple 1. Object registration
technique objects. 2. Handles the is complex and
(real-time) implementation in mobile robots. The ’Kanade- [1][6] problem of merging computationally slow
Lucas’ algorithm is the fastest, consumes the least memory 2. In case of multiple
and the most accurate with no implementation complexities. objects per frame,
algorithm becomes
It gives high accuracy in the scenarios where the distortion more complex.
is very high. This algorithm is one of the standards in Census 1. Immune to noise and 1. A bit more accurate
motion detection. Because of its iterative nature, it provides Transform Illumination changes. and faster but compu-
technique[13] tationally too expensive.
a good support for the video sequences having high frame 2. Large Memory con-
rates. However, Kanade Lucas algorithm puts an additional sumption.
constraint on the image intensities to be constant between the Lucas 1. Efficient algorithm 1. Takes large mem-
consecutive frames. Kanade with high object detec- ory if window size in-
Transform tion rate. 2. Computa- creases. 2. Algorithm
It is almost impossible to evaluate the fundamental nature of Technique [3] tion time is very less as works on the constraint
each algorithm as some algorithms produce good results in compared to other algo- that brightness will stay
one set of conditions but may not perform well in the other. rithms. 3. Fair response same during the image
against noise and distur- tracking.
For example, an algorithm that tracks object by identifying bances in real scenes. 4.
a specific feature may work very efficiently in one scenario Instruction set is simple.
but in case of video containing different objects with similar Mean Shift 1. Efficient approach to 1. Ineffective when there
tracking objects whose are occlusion problems.
features, it may fail completely. appearance is defined by 2.It can only track single
Therefore, some of the papers use combination of methods to color. moving object.
achieve an appropriate results according to their application.
For example in [5], they applied the bottom-top-level from
the combination of particle filter and KLT tracking for doing [6] L. Braun, D. G
the multiple-object tracking with moving camera. Also, in ”ohringer, T. Perschke, V. Schatz, M. H
”ubner, and J. Becker, “Adaptive real-time image processing exploiting
[2], they implemented Camshift with background modelling two dimensional reconfigurable architecture,” Journal of Real-Time
to perform moving object tracking in color real-time video. Image Processing, vol. 4, no. 2, pp. 109–125, 2009.
Similarly a particular algorithm may work efficiently to track [7] Y. Meng, “Agent-based reconfigurable architecture for real-time object
tracking,” Journal of Real-Time Image Processing, vol. 4, no. 4, pp.
rigid bodies, but will fail to track non rigid bodies like human. 339–351, 2009.
The recommended algorithm works very well in the present [8] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target classification
scenario i.e. for mobile robots. However, there are some and tracking from real-time video,” in Applications of Computer Vision,
1998. WACV’98. Proceedings., Fourth IEEE Workshop on. IEEE, 1998,
limitations which need to be improved in future. They are pp. 8–14.
as under: [9] J. Owens, A. Hunter, E. Fletcher et al., “A fast model-free morphology-
based object tracking algorithm,” 2002.
1) Camera interfacing on the mobile robots is a big prob- [10] C. Tomasi and T. Kanade, “Shape and motion from image streams under
lem. It needs to be improved for better results. orthography: a factorization method,” International Journal of Computer
2) The other possible improvement is to design a position Vision, vol. 9, no. 2, pp. 137–154, 1992.
[11] J. Shin, “Initialization of visual object tracker using frame absolute
control system for the moving camera so that this difference,” 2000.
algorithm may track motion in a video captured by a [12] R. Zabih and J. Woodfill, “Non-parametric local transforms for com-
moving camera. puting visual correspondence,” in Computer VisionECCV’94. Springer,
1994, pp. 151–158.
R EFERENCES [13] S. Shah, T. Khattak, M. Farooq, Y. Khawaja, A. Bais, A. Anees, and
M. Khan, “Real Time Object Tracking in a Video Sequence Using a
[1] K. Huang, L. Wang, T. Tan, and S. Maybank, “A real-time object Fixed Point DSP,” Advances in Visual Computing, pp. 879–888.
detecting and tracking system for outdoor night surveillance,” Pattern [14] Y. Yao, C. Chen, A. Koschan, and M. Abidi, “Adaptive online cam-
Recognition, vol. 41, no. 1, pp. 432–444, 2008. era coordination for multi-camera multi-target surveillance,” Computer
[2] J. Li, F. Li, and M. Zhang, “A Real-time Detecting and Tracking Method Vision and Image Understanding, 2010.
for Moving Objects Based on Color Video,” in 2009 Sixth International
Conference on Computer Graphics, Imaging and Visualization. IEEE,
2009, pp. 317–322.
[3] F. Labrosse, “Short and long-range visual navigation using warped
panoramic images,” Robotics and Autonomous Systems, vol. 55, no. 9,
pp. 675–684, 2007.
[4] W. Junqiu and Y. Yagi, “Integrating color and shapetexture features for
adaptive real-time object tracking,” IEEE Trans on Image Processing,
vol. 17, no. 2, pp. 235–240, 2008.
[5] Q. Wang and Z. Gao, “Study on a Real-Time Image Object Tracking
System,” in Computer Science and Computational Technology, 2008.
ISCSCT’08. International Symposium on, vol. 2, 2008.