You are on page 1of 20

Pedestrian Tracking Independent Study Spring 2009

Isaac Case

May 13, 2009


One ability of the human visual system is the ability to detect and follow motion.

This is a critical skill in understanding the world around us as it helps us identify

objects of importance and helps us identify change in our environment. This ability is

highly desired in computer vision systems. Tracking motion can be used to monitor

activity in an area. For this work, I have applied this idea of object tracking in an

attempt to track pedestrian motion in video.



The goal of this work was to track people as they walk through some frames of video. It was intended to only track the motion of pedestrians. All other motion should be ignored. Partial and full occlusion should be compensated for. It should not matter how many people are in the video, or how large a person is, in comparison to the size of the video frame. Some of these goals were met, however, others were not able to be implemented due to the time constraint of this research, i.e. only ten weeks.




2.1 Overview

In order to accomplish this goal, it was necessary to break this down into smaller pieces. The current implementation is based on the idea of background subtraction. Given a background, we are able to determine the difference from the current frame and the background to find the foreground. The foreground should be all of the objects that are worth investigation. With a given foreground, the next step would be to segment it into regions of interest referred to as blobs. Then, with a set of blobs, track the motion from frame to frame, matching these blobs across a series of frames.

2.2 Implementation

The implementation of this system followed the same process as described in the overview. Each part is a separate algorithm, however all work together to produce the results given. It may be possible to replace a subsection without any effect on any of the other subsections and in that way are somewhat independent and could be researched independently.

2.2.1 Background Detection

The first operation was to determine the background for every frame. A very simple approach was tested at first. That implementation was as simple as just taking the average value of a series of frames as the background as in:

B =



F n


where B is the found background, k is the number of frames to be analyzed and F n is

the n th frame.

This assumes that the background fills the frame a majority of the time,


the background is fairly static, and the value of k must be large enough force any non-

background object to become insignificant. Given those conditions, the results can be fairly

good, however, given any deviation from those conditions and the results worsen quickly as

can be seen in figure 1

and the results worsen quickly as can be seen in figure 1 (a) (b) Figure 1:


the results worsen quickly as can be seen in figure 1 (a) (b) Figure 1: A


Figure 1: A Comparison of Averaged backgrounds where (a) meets the criteria for averaging and (b) does not. (b) is averaged over too few frames

One major issue with this system of background detections is that it does not allow

the background to change over time. If an object was moving, but then stops moving, it

should become part of the background instead of forever staying as a foreground object. One

example of this is a car that enters a scene and then parks. The car when moving may be of

interest and should be considered part of the foreground, but when the car parks and stops, it

eventually should be ignored again as it is no longer a moving object. An improvement over

the previous method was also implemented for this research. It is based on the same idea

of averaging multiple frames of video, but instead of averaging every pixel for every frame,

it only includes those pixels for which there is little motion over some range of frames. It

follows a similar function, however has some important differences.

B i,j = f i,j

where f i,j is the set of values for the pixel at position i, j in the

|f| series of frames evaluated for which there is little change compared to the frame


n frames previous and n frames in the future, and |f | is the number of elements in the set f .

This is done a whole frame at a time and can be seen in figure 2. The difference between frame n and frame n k, referred to as ∆ nk,n , is calculated as well as the difference between frame n and frame n + k, again referred to as ∆ n,n+k . The section that changes in frame n is then the boolean AND of the two differences or ∆ nk,n n,n+k . This then allows us to claim that the background for that frame is ¬(∆ nk,n n,n+k ).


(∆ n − k , n ∧ ∆ n , n + k ). T=-10 T=0


n − k , n ∧ ∆ n , n + k ). T=-10 T=0 T=10


− k , n ∧ ∆ n , n + k ). T=-10 T=0 T=10 ∆

∆ between T=-10 & T= 0

, n + k ). T=-10 T=0 T=10 ∆ between T=-10 & T= 0 ∆ between

∆ between T=0 & T= 10

∆ between T=-10 & T= 0 ∆ between T=0 & T= 10 Detected difference for T=0

Detected difference for T=0

0 ∆ between T=0 & T= 10 Detected difference for T=0 Detected background for T=0 Figure

Detected background for T=0

10 Detected difference for T=0 Detected background for T=0 Figure 2: The detection of background for

Figure 2: The detection of background for this one frame

The actual method for calculating the difference between frames in this portion of the project was based on the HSV color space. The difference was calculated in the following



H = cos 1 {cos H i cos H j + sin H i sin H j } S = |S i S j | V = |V i V j |

The difference for the whole region is then defined as a combination of these three chan- nels. According to Shan et al [3] it is possible to use this color space to help remove shadows from our difference. For the HSV color model it is done in this manner:

I = ∆V (∆H S)

where ∆I is the entire image difference for all three channels.

2.2.2 Image Registration

This change in algorithm helps with some of the problems presented with just averaging entire frames over some region, but problems still remain. The main problem that still exists is the issue with camera motion. The previous method works well if the camera is stationary, but fails if the camera is moving at all. If we did not compensate for camera motion, then there would be problems with noise being added to our estimated background image due to motion that is not object motion, but camera motion. There are many ways to compensate for camera motion, but the method I implemented requires a distinguishable amount of background to work. The general idea is to register the frames from two different times one with another. If we can realign these frames, then we can still determine the background pixel values for those pixels that are visible for a given time period. We can also ensure that the frame that we will be evaluating against this background in the future is registered with the found background accurately. The


registration I implemented is based on the idea that for a given image, we can detect feature points, specifically we use harris points[1], and with those feature points we can find a correlation between the feature points of one image and another, see figure 3.

the feature points of one image and another, see figure 3. Figure 3: Feature point matching

Figure 3: Feature point matching from one frame to another

There are two different algorithms that were attempted for feature point correlation. One method can sometimes be faster, while the other is slower, but sometimes more accurate. The faster algorithm is uses an approximation of the slower algorithm which is why they produce similar results. The first algorithm tried takes all of the feature points, p 0 , from an image frame F 0 , and computes the euclidean distance, as separate x & y components, from those points to all the feature points, p 1 from frame F 1 . Along with the euclidean distance, the absolute value of the difference in intensity is also calculated for each feature point pair where difference in intensity refers to the intensity of the pixel value of the source image, not the intensity of the harris point itself. All the pairs which have a difference in intensity greater than some threshold are ignored. For all the other pairs a histogram is computed for spacial difference for both the x and y components. Given the histogram for each component, select the bin with the most points. Then given the highest individual components, see if that correlates to a high number of point pairs that match both the selected x and y distances. If the number of pairs is greater than a preselected threshold, then those pairs of points are returned as the x and y shift for the given frames. If it is not,


then we go though different combinations of x values and y values in order of occurrence to

try and find some pair for which there is a match in both directions. In many cases the first

values are selected and the function is quite fast. If it is difficult to find a set of matching

pairs, it may take somewhat longer to find the match. This happens in some frames where

there is either not much background, i.e. a blank white wall, or a background with a lot of

random noise which the harris corner detector detects as corners which are less useful.

An alternative approach is similar, but requires on average more time to compute the

feature point matching. The goal is the same for this algorithm, find the pairs of points

for which there is a high amount of correlation and use those points to realign the image.

Instead of looking at the distances between feature points and finding a correlation among

them, we look at the image itself. For each feature point p in p 0 , create a window w, from

n from frame F 1 . Then, with this

window, perform a normalized cross correlation between window w and the windows w 1

n . For all of these cross correlations, pick the one with the highest value, w i , and as long as the

frame F 0 . Also, for all points p in p 1 , create windows w 1 w


highest values is above some threshold, assign that as the match from p 0 , so p 0 correlates to

p i .

Whichever method is used, as a result we have a list of points that match from one frame

to another frame. Using these points we can create a linear transform to shift the image.

For all the points given, we calculate the distance in both the x and y direction. Then, given

those values, ∆x&∆y, pick the mode out of each set and use that as the translation values.

Since the translation of the image may likely result in the images not overlapping, pad the

image with empty data in all edges where data may be lacking due to registration. For

background detection that also means that if we are constraining ourselves to only the size

of the original image, we are very likely to lose information. In order to get around this, if

we know that there is camera motion, we assume that the background is likely to move and

pad our found background on each side to compensate. Then we can have some information,


and tracked background even if the camera moves back and forth. An example of found background for a given frame from a video where the camera was moving can be seen in figure 4. The black border surrounding the frame is the padding provided for camera motion. If it were not for the registration from frame to frame the image would appear much blurrier and it would be more difficult to use as a representation of the background.

Found Background

use as a representation of the background. Found Background Original Image Figure 4: Comparison of found

Original Image

of the background. Found Background Original Image Figure 4: Comparison of found background image to actual

Figure 4: Comparison of found background image to actual image frame

2.2.3 Foreground Selection

Once a background is found for a frame, the next step is to segment the foreground from the background. For some implementations it is acceptable to just perform a difference between the found background and the foreground. This process has some drawbacks, mainly that of ignoring normal variance or noise in the video. For this implementation I would keep track of approximately twenty frames of found background information. With these twenty frames it was possible to derive the mean and standard deviation of all pixels in a range of frames, see figure 5. Specifically the mean for a given pixel is only the average of that


specific pixel location for which background was claimed to be found from our background detection algorithm. Non background content is ignored in this assessment of mean. The standard deviation is also calculated from this data set and ignores the approximated fore- ground content. Then, given this information it was possible to determine what content in the current frame was out of range, beyond one standard deviation, from the mean found background. Content out of range is considered foreground, see figure 6. This is different from the foreground/background selection process for background detection in that it the subtle changes that can occur over time, and also includes other noise sources such as cam- era noise, or video compression noise, and its variance over time. For multi channel images, such as RGB, the difference from the mean is calculated separately on all three channels. Therefore F G r = |r r¯| > µ r , etc. For the foreground mask of an entire image frame, it is just the boolean AND of all three channels, so F G f = FG r FG g FG b . Much of this work was inspired by an equation given by Shan et al [3].

Per Pixel Mean
Per Pixel Mean


T=-10 T=-5 T=0 T=+5 T=+10 Deviation Standard Pixel Per
Deviation Standard Pixel Per

Figure 5: Computation on a range of frames for background statistics

The foreground sections should contain the areas where objects are moving. These are the objects that we are interested in. Connected components are segmented out into different blobs. Each blob contains information about itself including the pixel values contained in a bounding box, the mask associated with the content that was determined to be foreground,


Current Frame
Current Frame
Per Pixel Mean
Per Pixel Mean
Standard Deviation Pixel Per
Deviation Pixel Per
Current Frame Per Pixel Mean Standard Deviation Pixel Per Current Foreground Frame Figure 6: Foreground Calculation
Current Foreground Frame
Current Foreground Frame

Figure 6: Foreground Calculation from background statistics

the relative location within the frame for which the blob was found, and the blob mask’s centroid. This information helps us determine how a blob moves from frame to frame and allows us to track the blobs motion through the video sequence.

2.2.4 Blob Tracking

For video frames, any motion is an optical illusion created from multiple static images for which there are small changes frame to frame and if they are presented to a person in quick succession it creates the illusion that some of the image has moved, or the contents of the image have moved. Unfortunately this motion is not encoded in the video itself, but rather each individual static frame is all we have to work with in trying to detect motion from the video. Given the previously explained parts to our system, the actual motion tracking occurs as a function of finding associations between blobs over time. Much of this work was inspired by the work shown in Masoud and Papanikolopoulos’s “A novel method for tracking and counting pedestrians in real-time using a single camera”[2], however not all of their techniques were implemented. Just as smooth motion is the illusion created from slight variations frame to frame, each blob should change only slightly frame to frame if it is to give the illusion of smooth motion. The means to our goal of tracking motion becomes a function of tracking blobs from any given previous frame to the current frame. This mapping from T=0 to T=1 of blobs can


decompose the path traveled by any of these given blobs over time. This is the perceived motion. In previous implementation of this algorithm, one major failure was observed. That was an assumption that any given blob detected in a video frame could only be identified with a singular moving object. For a video sequence with only one moving object, or with objects that never cross paths this never becomes a problem. However, most real world situations do not follow this strict rule. For situations where individual objects cross paths, or form groups and then diverge, it must be the case that a singular blob can represent multiple separate and distinct objects. For this reason, it was added in this revision the ability for a found blob in frame to be matched to multiple different previously found objects. This way different objects need not be lost as different moving objects overlap. The basis for the tracking of blobs from frame to frame is a “best match” search of all the blobs from the previous frame to the blobs of the current frame. This matching process is based on a cost function which is composed of euclidean distance, difference in the size of the bounding box, and the difference between the color histograms of the current frame blobs to the previous frame blobs.

diff = w 0 dE + w 1 dA + w 2 dH

The Euclidean distance of two blobs is defined as the distance between the centroid of a given blob in the current frame to the predicted centroid of a blob from the previous frame. The predicted centroid is calculated based on the current predicted speed of the blob and the previous frame’s known centroid. The difference in size of the bounding box is simply the absolute value of the difference between the are of the previous bounding box and the blob to be evaluated’s bounding box. The color histogram difference is calculated as the sum of the percentage of non matching color values, those values for which there is no equivalent in the comparison, for the combinations of {red, green}, {red, blue}, and {blue, green} assuming


that the image is defined in RGB space. The percentage is calculated relative to the size

of the smaller of the two blobs. Also, only the RGB values for the region of the blob, not

the whole bounding box, are compared. Other pixel values are considered background and

ignored in this histogram comparison. The color histogram value will have a value of 0 for

a perfect match and a maximum value of 3 for a perfect mismatch. This color histogram

matching is similar to work done by Swain and Ballard[4] except that for their research

they were matching an entire image to an image database where this is only matching small

sections of an image to other sections for which shape and size are not constant.

In order to reduce mismatches only blobs for which there is some overlap between the

mask of the predicted previous frame blob and the mask of the current frame blob are

considered. This is determined by first performing a quick comparison of bounding box

overlap. This is a quick method of ignoring most of the possible mismatches. If the bounding

boxes overlap, then given the bounding boxes and the mask of the blobs we perform a boolean

AND operation on the two masks projected in their proper coordinates. After the boolean

AND a sum is performed on the resultant matrix. If the sum of that matrix is non zero, then

there is some overlap. If not, then there is no overlap and this blob is ignored for matching.

Reducing the blobs looked at reduces the complexity of finding a matching blob for tracking

smooth motion frame to frame. This assumes that objects are not moving in jumps larger

than their own size each frame.

Given this cost function, all blobs of the previous frame and compared to the blobs of the

current frame. If there is a 1:1 matching between these blobs, the current frame blobs inherit

all of the properties of the previous frame blobs other than the updated mask, updated image

segment, and updated coordinates. The predicted speed of this blob is calculated based on

the previous speed of the previous frame blob. Currently the function used is:

X = T



X + S








= T Y + S




where S X is the speed in the X direction at time T and ∆ T


X is the change is the X direction

between time T and T 1. The same calculations are performed for speed in the Y direction

and are kept separate as a vector denoting the approximate speed of the blob. This is not the

best predictor for speed as this only works for things moving in a linear manor. Any other

type of constantly changing function would not be well predicted by this function, however,

it is very simple to implement and works fairly well for simple pedestrian movement.

If there is a many to one mapping of previous frame blobs to a current frame blob, then

we take this into consideration and create a new blob, which is just a place holder, who has

a new characteristic called “sub blobs”. These sub blobs are the independent components

that are believed to make up the combined blob. This relation is held for one frame, then

when it comes time to compare this combined blob to the blobs of the new current frame,

it is decomposed into its core parts, and the “sub blobs” are compared to the current frame

blobs. In this way, individual components retain their individuality even if it appears that

they have been combined into a larger object. If they were to split at some time in the future,

they would retain all previously known knowledge about themselves, i.e. color, mask, speed,

direction, etc.

For all blobs which have been matched either in a 1:1 or a many to one mapping for

multiple frames we recognize this as a blob worth tracking and track it. Tracking for simple

1:1 matched blobs is performed by marking the previously known centroids for the object

and drawing the bounding box around the current found location of the blob for this frame.

An example of this type of output can be seen in figure 7.

For blobs which have been matched in a many to one situation we try and preserve the

independence of the sub blobs. In order to do this we do not draw a bounding box around

the merged blob, but instead draw projected centroids for where we believe the sub blobs


Figure 7: Tracking the motion of found blobs for a series of frames are and

Figure 7: Tracking the motion of found blobs for a series of frames

are and a bounding box around where we believe it to be at this point in time. Current frame place prediction is based on the last known speed vector of the blob and is calculated as the multiple of the number of frames since this object was an individual times the known speed. Since this prediction is based on the last known speed of the sub blob it could be very incorrect. For this reason the projection is clipped at the boundaries of the larger merged blob. If it is projected that the centroid of a smaller blob is outside the range of a merged blob, then the smaller sub blob’s centroid is clipped at the boundary of the bounding box of the larger merged blob. This prevents the predicted placement of centroids and bounding boxes from being too far away from the actual motion of the merger of the sub blobs. Since we maintain the existence of sub blobs this helps with the problem of what happens in the situation where two people cross paths. In the previous paradigm one would be lost or ignored until they separated again. In the current implementation we acknowledge the merger and attempt to do the best we can with what we have. For some scenarios the results


are much improve over the past results, see figure 8

Blocks Crossing Old Method

the past results, see figure 8 Blocks Crossing Old Method People Crossing Old Method Blocks Crossing

People Crossing Old Method

8 Blocks Crossing Old Method People Crossing Old Method Blocks Crossing New Method People Crossing New

Blocks Crossing New Method

Method People Crossing Old Method Blocks Crossing New Method People Crossing New Method Figure 8: A

People Crossing New Method

Method Blocks Crossing New Method People Crossing New Method Figure 8: A Comparison of maintaing independence

Figure 8: A Comparison of maintaing independence of sub blobs in a merge situation to not acknowledging merges

So we don’t have trails forever for objects that have disappeared after each frame we

prune the list of blobs that we are interested in. If a blob is not matched frame to frame, it

is marked as having missed frames. If the number of missed frames exceeds some tolerance

value, then that blob is pruned from the list of blobs that we are interested in. It will not be

drawn anymore and it will no longer be compared to any new blobs or matched again. This

ensures that we recognize the fact that a blob has left the scene and is no longer of interest

for tracking.




Compared with previous results, this new methodology improves in many areas over past implementations. It addresses the issues involved with partial occlusion and crossed paths. It attempts to address changes in background over time and ignoring some of the normal variance included in the video. Although not all situations are fully addressed, this imple- mentation is much improved over previous results. and in certain conditions produces fairly good results. Specifically with the video file entitled “people01 08 05.avi” as presented on the Com- puter Vision website results were much improved over previous results. Portions where people cross paths is much improved in that each individuals path is preserved and is fairly accurate given the limitations. Also, the camera noise that was picked up by previous algorithms is fairly successfully ignored along with changes in lighting from the sun moving behind clouds and back out again. Results for videos with not much background information, i.e. a blank white wall, had poor results compared to other video files. Tracking the camera motion appeared not possible with currently implemented algorithms. Also the motion of a person walking toward the camera is not fully comprehended. At the points where the person is distinguishable from the background only parts of the body are recognized as objects in motion. This causes failures in object matching as it is recognizing the arms or legs as moving objects instead of the whole person. Then when tracking the person, instead of tracking the whole, even when the whole person is detected as a foreground object, the implemented system assumes that it is just the merger of the appendages instead of recognizing the person as a whole.




Unfortunately given the situation posed to the class as our working situation, this method has many drawbacks. One of the most significant failures seen in attempting to track students entering a classroom had to do with the white wall background. Given the significant lack of detail in the background of this situation, it was very difficult to determine the difference between camera motion and object motion. One possible explanation for this is that the human visual system is comparing the scene as a whole and is comparing what is perceived with what makes the most ‘sense’ given the situation. If in a series of frames the movement appears to be a person moving from right to left without any movement of their appendages, it becomes likely that it is not the person that is moving, but it is the observer that is moving. Then with that hypothesis it is possible to observer everything that is not a human in the scene and see if that theory still makes sense. Unfortunately my implementation is not advanced enough to handle that type of semantic gap between what is perceived, and what is possible for all of the different possible objects in the given scene. One other advantage the human visual system has over the processing I was able to do is its ability to segment out objects from a static scene. For my analysis, I had to wait for an object to move before I was able to identify it as an object worth noting. Our visual system, including our brain for processing, can segment out all possible objects from a scene before any motion happens, and therefore can pay more attention to regions that are probable for motion and somewhat safely ignore other regions. This would include paying more attention to the people in the video and less attention to the blank white wall. In combination with this ability, the our visual system also can bridge the gap between an object and the parts that compose an object. If a person is moving their arms, we know that it is not going to move as a whole object until the legs are also moving. This connection between arm motion, leg motion, etc and the motion of the object/person as a whole is lost in the simple


processing that I have done.

5 Future Work

There is much work left to do in this area. One main portion that was not completed during this course was the application of a pedestrian detector. Once foreground content has been detected it is just assumed to be of importance. If this were to be an actual pedestrian tracking algorithm, one more step would need to be implemented. That would be a step where once foreground pieces are detected there must be a process whereby each foreground piece is evaluated as to whether or not it is a pedestrian or not. In this was we can eliminate much of the background noise, but also eliminate non pedestrian movement. There are many was that this could be accomplished, but unfortunately none were attempted for this scene. Other possible refinements include a better history of the objects in motion. Currently the only historical data kept is the previous locations of the center of the mask. Current speed of the object is somewhat based on previous data, but historical speed is not kept for reference or statistical query. This could more accurately project an objects movement while occluded. Also with more historical data, it might be possible to implement more complex predictions of the objects placement. Currently we have looked at linear prediction, but there are many other ways to accomplish this. Historical references to the objects appearance could also be very useful in detecting changes over time. In the current implementation we only have the appearance of the object the last frame it was tracked. This is a poor model for matching and provides little data for comparison. It also does not ensure that there is a match between scenes where change is possible. Currently the probabilistic method for obtaining the foreground data from a frame is based on the mean and standard deviation in RGB space. It would be useful to do a comparison of different spaces to see if any produce better results. One prime example


worth testing is using the HSV color space for this detection. Also, other color spaces should

be tested in the background detection section. The use of HSV in the current implementation

is based on some initial test between rgb, HSV and Ycbcr. Although HSV was chosen based

on some initial results, it may be possible to use a different space and find a better metric

for non changing content, within some range.

6 Time Tracking

This section is provided as a departmental requirement for the independent study. The

original estimate of work to be performed for this independent study was a little optimistic.

Unfortunately due to the nature of the work, progress was fairly slow. One of the main

reason for this was the time it takes to test new ideas. The majority of the work done for

this was done in Matlab, which although it is very nice to work with, and easy to implement

fairly complex algorithms, is very slow compared to languages like C++ or Java. Because of

this, often it would take over 20 minutes just to test one changed parameter to an algorithm.

Most of the time spent was not in developing new ideas, but rather in implementing some

ideas, validating that they are sound, testing them on some sample data, retuning parameters

and finally testing on real world data. For this reason, the time to implement may of the

algorithms was much larger than originally expected.

Looking at the individual portions of this project, the breakdown would match the fol-


Background Detection

2 weeks

Image Registration

3 weeks

Foreground Selection

2 weeks

Blob Tracking

3 weeks

Although it was never a complete split between the four sections, it is an approximate


estimate in tallying the total time given. If hours are requested, then an approximation of number of hours worked per week on this project would be at least 10 hours per week. Often times more in the beginning as my algorithms were not very well optimized so it took longer to get any results.


[1] Chris Harris and Mike Stephens. A combined corner and edge detector. 4th Alvey Vision Conference, pages 189–192, 1988.

[2] O. Masoud and N.P. Papanikolopoulos. A novel method for tracking and counting pedes- trians in real-time using a single camera. Vehicular Technology, IEEE Transactions on, 50(5):1267–1278, Sep 2001.

[3] Yong Shan, Fan Yang, and Runsheng Wang. Color space selection for moving shadow elimination. pages 496–501, Aug. 2007.

[4] M. Swain and D. Ballard. Color indexing. International Journal of Computer Vision, 7 (1):11–32, 1991.