Finding Mean Traffic Speed

in Low Frame-rate Video

Jared Friedman
January 14, 2005
Final Project Report
Computer Science 283


In this paper, I present a novel approach to estimate mean traffic speed using low
frame-rate video taken from an uncalibrated camera. This approach takes advantage of a
known relationship between traffic speed and traffic density to make tracking of
individual vehicles unnecessary. The algorithm has been developed especially for
nighttime conditions, though extensions to daytime images seem quite possible. It has
been tested on several image sequences and shown to produce results consistent with
human estimations from those sequences.
Computer vision techniques have been applied to images of traffic scenes for a
variety of purposes [1]. One of the more popular of these purposes is to attempt to extract
a measure of the level of congestion of the road in the scene. Properly disseminated, this
information can be used by drivers to plan routes that avoid traffic and by first responders
to identify accidents. The measurement typically used for congestion is the mean speed
of traffic, as this is the measurement a rational traveler should care about. In this paper, I
present a new algorithm to estimate mean traffic speed using video images at low frame
speed. The work is motivated by the presence of Trafficland, a new company that offers
video streams from over 400 traffic cameras in the Washington D.C. area through free
internet access (at Previous work on finding traffic speed has
worked by finding it directly, essentially by tracking vehicles for a known time over a
known distance and calculating an average ratio. However, due to bandwidth limitations,
Trafficland’s cameras give video at less than one frame per second, with unreliable and
difficult to determine time intervals between frames, making tracking extremely difficult
if at all possible. Some published algorithms [2],[3] instead place two virtual lines or
“tripwires” on the road at a known separation and measure the time interval between cars
crossing the first and crossing the second, in the natural computer vision analogue of the
physical loop detectors on roads. While this may seem to be a different technique from
tracking, it shares many similarities, including the assumption that cars will not travel
much from one frame to the next, and in practice it requires an even higher frame speed
than tracking.
Nevertheless, clearly humans are able to judge traffic levels from Trafficland’s
video, and they would still be able to do so even if shown only every two or three frames,
making tracking literally impossible. I assert that the way we make this judgment is by
determining how closely spaced the cars are and using the intuitive fact that closely
spaced cars tend to travel more slowly than tightly packed ones. More precisely, we use
the inverse correlation between mean traffic speed and traffic density, which is defined as
vehicles per lane-mile [4]. The approach of this paper is to take advantage of this
relationship and compute density directly, which is easier to compute at low frame
speeds, and convert this into a speed using the known relationship.
To compute density in a particular region of interest, we must know the number of
lanes of the region, the length of the region, and the number of cars in the region in each

frame. Some traffic vision systems have had to accommodate cameras that could be
rotated and zoomed by traffic operators with joysticks (e.g., [1], [5]), and thus have had to
build in some automatic calibration capability to their programs. However, Trafficland’s
cameras appear to be stationary, and we make the assumption that the number of lanes
and the length of the region need only be determined once, and take advantage of human
input in this one-time low-cost setup procedure. Many published and commercial
systems [6] [7] also require some initial human setup: for one, if there are multiple roads
or two directions of a single road in the picture (as is usually the case), the software
cannot possibly know which road is the intended one without some human input.
Specifically, the initial calibration setup simply requires a human to draw a rectangle (in
world coordinates) around an area of interest and then to trace out the lanes in the region.
Using some simple geometric constraints and assuming a typical lane width, the length of
the region can then be calculated.
The calculation of the number of vehicles in the region proved to be more
challenging than expected. This partially because surprisingly little previous work has
been done on the problem. Several algorithms have developed excellent tracking of cars
in daytime conditions at high frame speeds, which implies that they are able to recognize
vehicles to some extent. However, in tracking vehicles directly, it is not necessary to
segment cars properly, but only to identify blobs that correspond to multiple vehicles or
parts of vehicles, since in a traffic stream all the vehicles, their parts and their shadows
tend to move at about the same velocity. The algorithm reported in [1] does require
correct segmentation of vehicles, because it must estimate their size correctly, but it
requires correct segmentation of only a few vehicles at a time, and thus it simply throws
away any blobs that do not correspond to a tightly defined vehicle profile. Accurate
counting of vehicles in daytime conditions requires a more sophisticated approach to deal
with occlusion, shadows, and vehicles of widely varying appearance. In this preliminary
report, I chose to focus on nighttime conditions only and to leave daytime conditions for
future work. Nighttime conditions are easier because at night, it is usually possible to
simply count the number of headlights appearing in the region, and headlights are much
more visible and less vulnerable to occlusion, shadow, and varying appearance than cars.
Nighttime conditions are in any case a more suitable potential use of the algorithm
advanced in this paper, since tracking-based systems ordinarily find daytime conditions
much easier than nighttime conditions, giving a the density approach a particular
advantage in these conditions.
This paper first reviews the key assumptions of the algorithm and discusses their
validity and which ones could be relaxed in further work. I then discuss in detail the
workings of the algorithm and follow by some considerations of its computational
efficiency. I conclude with empirical results validating the accuracy of the algorithm.

Underlying Assumptions
The following gives a list of the key assumptions used to simplify the problem
and some discussion of their validity.

1) Images are taken at night. Headlights are the brightest objects in the region of interest.
2) Traffic is moving generally towards the camera, but not (almost) directly into it. The
second requirement exists because when traffic is going almost directly towards the
camera, the glare from the headlights creates bloom, lens flares, and severe distortion.
The first requirement exists because if the traffic is moving away from the camera, the
headlights will not be directly visible. In my opinion, the second of these twin
requirements is much more reasonable than the first. Many of Trafficland’s nighttime
images are so severely distorted by the lens flares that it is nearly impossible even for a
human to determine the amount of traffic, and working with these images would be a real
challenge. However, tracking cars going away from the camera will obviously be
necessary for a fielded system, and future systems could use either the rear vehicle lights
or the fairly bright reflected glare from the headlights to accomplish this.
3) Vehicles are confined to the road plane, and there exists a region of interest with
straight edges. Also, the number of lanes in the region of interest is constant. These
requirements are necessary for the calculation of the geometry of the situation.
4) The width of each lane in the picture is approximately 11.5 feet. This assumption is
the distance required to determine the scale of the image and thus the length of the region
of interest. The validity of the assumption is taken from [8] which states that virtually all
American highways have lane widths between 10 and 12.5 feet at all times, with lane
widths close to 12 feet being the most common. Other systems have used a variety of
means to attempt to produce a scale measurement, mostly by placing physical marks on
the road [9], [10], although [1] did so by assuming a known distribution of vehicle
lengths. However, in this situation it was impossible to have an operator placing marks
on or near the road. Estimating by mean vehicle length requires an algorithm that can
accurately determine vehicle length of all vehicles, including trucks, which is difficult to
construct and must be run over a considerable period to get an accurate mean value.
Furthermore, it is not at all clear that this mean vehicle length is more constant from road
to road than the mean lane width. [11] reports evidence that the mean vehicle length
changed considerably depending on the time of day, the highway, and the lane observed,
primarily due to the considerable variation in the presence of large trucks, leading to large
errors in systems that assumed a constant vehicle length. Using the lane width as a
calibration tool appears to be a novel suggestion, and it seems a sensible choice for a
variety of situations, not limited to low-frame rate video. It is perhaps worth noting that
if the lane width did not hold to the 10-12.5 foot range, then the validity of assumption
(5) would be in question anyway, as this would affect the density-speed relationship.
5) Traffic speed and density have a known, and constant relationship, specifically Edie’s
model as given in [4]. This assumption is admittedly somewhat controversial. While the
inverse correlation between speed and density is obvious, the exact relationship has been
a topic of considerable debate. For decades, it was believed to be a linear relationship on
the basis of a single study using seven data points all collected from a single highway [4].
Further study by Greenberg found that a logarithmic relationship was the best fit, as in
Fig. 1. However, despite the seemingly excellent fit, a number of caveats can be raised

with Greenberg’s methods, and several later studies found that Greenberg’s relationship
was only a mediocre fit to their data. The modern favored choice for the relationship is
Edie’s hypothesis, which is a piecewise function shown in Fig. 2. The piecewise
relationship has not only be confirmed by a rigorous study into the matter [12], it also fits
well with theoretical models of traffic flow, which invariably divide the problem into at
least two subcases corresponding to free-flow and congested-flow, if not more. Despite
all the debate as to the precise nature of the relationship, the actual difference for this
application between the initial Greenshields model and the most recent Edie’s model is
only at most 10-20%. However, the data for these studies appears to have been collected
only during the daytime and only in normal weather conditions. How nighttime
conditions and adverse weather might affect the relationship is quite unknown.

Fig. 1. Greenberg’s speed-density hypothesis, plotted with his data.

Fig. 2. Eddie’s speed-density hypothesis, plotted with his data.


Finally, it is important to note that there is another relationship between traffic
variables that could be useful in further study. The relationship is between traffic speed
and traffic volume, which is defined as cars per lane-hour. I chose not to use this
relationship in my algorithm because the relationship between volume and speed is
considerably less well-established than the one between speed and density. However,
calculating volume does not require finding the length of the region of interest, and thus it
is immune from that source of error. If the expected level of error from the estimation of
the length of the region were greater than the expected error from the estimation of the
speed-volume relationship, then it would be a reasonable choice to abandon density and
instead calculate volume. Determining whether this is in fact the case unfortunately goes
beyond the scope of this paper, but if it were, then the change would be quite easy to
make. The part of the algorithm that calculates the length of the region would simply be
dropped, and instead the number of cars counted by the second part of the algorithm,
divided by the product of the time period of operation and the number of lanes, could be
plugged into the function in [4] that computes an expected relationship between volume
and speed.
Algorithm Operation
This section details the workings of the algorithm. The first part explains the counting of
cars, and the second part explains the camera calibration and determination of the region
I. Counting Cars
The algorithm operates on a sequence of nighttime images, sampled at virtually
any frame rate. The images in the dataset are originally in color, but they are converted to
greyscale for analysis. For nighttime images, there is normally very little useful
information in the color channels, and what information there might have been is
obscured by the terrible color distortion in Trafficland’s cameras. The car counting part
of the algorithm assumes that an operator has drawn a rectangular region of interest and
that we are counting cars only in that region.
The car-counting algorithm operates essentially by counting headlights. Unlike
other papers [13], I do not assume a particular shape for headlights, nor do I require that
each vehicle have exactly two nearly identical headlights. While those assumptions are
often valid, occlusion, reflections on the road and on the vehicles, and varying headlight
configurations complicate the picture. Instead, I merely assume that each vehicle has one
or more brightly colored dots on or right next to it. The algorithm finds those dots and
attempts to determine which dots belong to which vehicles.


Fig. 3. A typical unprocessed Trafficland image
The first step of the algorithm is to crop the image to the smallest size that
contains the region of interest, and then to set all the pixels outside the region of interest
to zero intensity. The image is then converted to greyscale. A typical image at this stage
is shown in Fig. 3. The image is then top-hat filtered. Top-hat filtering is a technique
used to smooth out uneven dark backgrounds. Top-hat filtering is defined by subtracting
the result of performing a morphological closing on the input image from the input image
itself. This has the effect of reducing background noise by eroding it away, and thus
producing a more even, clean background. Results of top-hat filtering are shown in Fig.
4. Top-hat filtering requires a choice of a structuring element for the morphological
closure. The choice of a disk shape was easy – this is standard. The choice of the size of
the disk was more difficult and also somewhat arbitrary. Examination of the scale of a
few Trafficland cameras showed that a disk radius of 10 pixels gave good results. Some
further experimentation showed that the performance of the algorithm was highly nonsensitive to changes in this size.


Fig. 4. Image after top-hat filtering.
The next step is to choose a threshold and convert the greyscale image into black
and white. Choosing the threshold is obviously the difficult part. If there are headlights
in the image, then Otsu’s method, which chooses a threshold to minimize the intra-class
variance, works very well. This is essentially because in this case, the image histogram
will be strongly bimodal between headlight and not-headlight, and this method will easily
find the dividing point. However, in an image that has no headlights, Otsu’s method will
return terrible results, as it will cause a segmentation of the road itself based on random
noise on the road, but in this case we want it to segment to all black pixels. One way to
solve this problem is simply to set a fixed parameter that represents a minimum
reasonable headlight intensity, and to take the threshold to be the maximum of this
intensity and the threshold computed by Otsu’s method. In practice, this method will
return good results with almost all images, as the difference between the road
background, generally 0 to .3 in intensity, and the headlight intensity, usually .9 to 1, is so
extreme that any parameter value choice of .4 to .7 will correctly separate them, and the
choice of the parameter within this region will have little effect on the quality of the
In the expectation this method might not return optimal results for all images, I
pursued a method that would learn the threshold from previous images. The algorithm
begins using the fixed parameter method with a low fixed parameter value. A record is
kept of all the thresholds determined by Otsu’s method during the past 100 images –
including those determinations when the value was not used as the actual threshold
because it was too low. To this vector, Otsu’s method is itself applied again, to separate
the vector into values that were found during no vehicle presence and ones that were
found during an actual vehicle presence. If there was at least one image with headlights
and one image without headlights, this method will return good results. Once again, the
problem can occur with trying to separate a non bimodal distribution. I make the
assumption that in at least one of these images, a car was present. To test to see whether
all the images have headlights in them, I do a 2-sample t-test, comparing the values below
the computed threshold to the values above the computed threshold. If the difference is

significant, then there are likely two different distributions in the data – one with
headlights, and one without headlights, and I set the minimum parameter to the threshold
computed on the vector of 100 previous thresholds. If the difference is not significant,
then probably all of the images had cars in them, and I keep the old threshold. This
assumes that the values of the grey threshold computed when there are cars in the picture
and when there are not cars in the picture both have normal distributions, and I have
found this to be a fairly accurate assumption, as confirmed both visually and by the
Lilliefors normality test. Fig. 5 shows a histogram of the thresholds determined in a 200
image sequence with a highly bimodal distribution due to the presence or absence of
headlights. I found this method to accurately determine when there were actually cars
present in the picture, and to choose a grey threshold accordingly that reflected the
lighting conditions of the image – e.g., images with brighter backgrounds had a higher
threshold. However, due to the non-sensitivity of the rest of the algorithm to the value of
this parameter, this method failed to return significantly better or even significantly
different results on the actual dataset from the simple-minded hard-coded parameter

Fig. 5. Histogram of 100 threshold values determined by Otsu’s method.
Having converted the image into black and white, the next step is to identify cars
from the white blobs that may correspond to headlights or reflections. A typical image at
this stage is shown in Fig. 6. As discussed before, other work [13] has attempted to use
prior knowledge about headlight shape to accomplish this segmentation. Unfortunately,
they give virtually no details about their algorithm, so it was not possible to reproduce
their results. After a considerable amount of experimentation with template-based
matching, the technique used in [13], I decided that these assumptions about headlight
shape were not valid enough in general to be useful, and instead I use a simpler method

that relies only on the assumption that all the headlights and headlight reflections on a car
will be close. First, the binary image is dilated, which tends to connect the unconnected
blobs belonging to a single car. Unfortunately, this simple technique runs the risk of
joining blobs of adjacent cars incorrectly, leading to undercounting the actual number of
vehicles. To mitigate this problem, I use the fact that the user has drawn segments
corresponding to the lanes in the region in the initial setup and I separate blobs along
those lane boundaries. Assuming that all cars are entirely in a lane, this essentially solves
the problem of connection across lane boundaries, and leaves only the potential problem
of connecting two cars within a lane. But since the headlights of cars in a single lane are
separated by dark car bodies, this is rarely a problem. Simply counting the blobs found at
this stage gives a reasonable result, but I do one further step of noise-reduction that
improves performance further. Since all headlights should by now have been joined
together in blobs of considerable size, I eliminate all blobs below a certain threshold size,
since these usually correspond to small reflections on or around vehicles that have already
been counted. The size chosen is a volume of 15 pixels, which is an extremely
conservative estimate for the size of a headlight, especially after dilation. Rather than an
assumption about the size of the headlights in the images, it is best considered as a
constraint on the choice of the region of interest by the traffic operator, requiring the
region to be placed close enough to the camera so that dilated headlights have a volume
of more than 15 pixels. Indeed, if this is not the case, the rest of the algorithm is unlikely
to perform well anyway, as the resolution will be very poor.

Fig. 6. Black and white segmentation.
Dilating the image in the above step requires a choice of a structuring element,
and this choice is best determined in a principled manner, as the performance of the
algorithm is considerably affected by it. In images with small headlights, like the ones
shown in the above figures, dilation is not necessary, though a small amount rarely hurts.
In images with a closer view of the traffic, dilation becomes essential. The shape of the
structuring element is simply a disk, as is standard. The idea behind my method of
choosing the radius of the disk is that the disk needs to be large enough to join headlight


pairs but should not be much larger, else it will run the risk of joining together different
cars. The algorithm for calculating this size depends on some assumptions about
headlight size taken from [14] and also informally observed in Trafficland’s images. The
key assumption specifically is that the average distance between the headlights is
approximately proportional to the typical headlight size, as recorded by the image. One
time the assumption is clearly false is when the cars are traveling almost directly towards
the camera, as then the distortion will make the headlights seem much larger. Another
time the assumption does not work well is when the traffic is traveling almost directly
across the camera’s field of view; however, in this case, some dilation is still useful in
connecting cars with the glare reflection, which will be particularly prominent. In
between these two extremes, however, the constancy of this ratio is good enough that
setting the size of the structuring element to be a constant times the estimated headlight
size returns excellent results. The correct ratio is difficult to determine exactly, but it is
close to one; the value I use is 1.3. I measure the size of the headlights by finding the
median area of all the blobs found in the first 100 frames, and calculating the
corresponding radius of a circle of this area.
II. Determining the Region Length
The algorithm I use to determine the region length is taken from [15], adapted to
the information available for the scene. First, the traffic operator gives the initial set-up
information pictured in Fig. 7. This includes a region of interest, whose projection onto
the road plane must be rectangular in the world coordinates and a trace of all the lanes in
this region of interest. For good performance, the region of interest must be a straight
section of road, it should begin as close to the camera as possible, and it should not
extend so far that the resolution at the end of the region is too poor (see above for a more
precise definition).

Fig. 7. What the traffic operator draws.


The estimate of the region length begins with computing a camera calibration
from the data that the operator has given. The camera calibration can then be combined
with some simple geometry to estimate the region length. The camera calibration
technique described in [15] is easier for the traffic operator than the one described in [7],
which requires that a grid evenly spaced along the road axis be determined by the
operator, which in practice is a difficult judgment for a human to make. I do not repeat
the detailed derivation of the algorithm in [15] here, but I give an overview of its
operation as applied to this situation. Most of what follows is taken from this paper; for
brevity, I omit the exact citations.
Camera calibration involves finding a camera’s intrinsic and external parameters.
First consider the intrinsic parameters. Recall that they can be represented by the matrix

α u − α u cot θ u 0 

v0 
sin θ

1 
To simplify the calibration process, I make several assumptions about the internal
parameters which are approximately true for most cameras and common in computer
vision applications. First I assume that the axes are in fact perpendicular, so that, θ= 0. I
also assume that the horizontal and vertical focal lengths are equal, and that u0 and v0, the
coordinates of the camera center are actually at the image center. This reduces what was
previously a five parameter problem to a one parameter problem, leaving αu as the only
To calculate the external parameters, we first calculate a vanishing point. The
vanishing point that can be calculated most accurately is the one in the road direction.
We could use only the region of interest boundaries to calculate this point, but we will get
better results if we also use the tracings of the lanes the user has made. Specifically, we
wish to find the point whose sum of squared distances to all the lines is a minimum. This
least squares estimate can be easily determined by solving a system of linear equations of
the form Mx = b. If there are n lanes on the road, then M and b will each have (n+2)
rows. Let Li be a unit vector in homogeneous coordinates representing the direction of
the ith line (out of the n+2), and let Pi be a point on that line (in homogeneous
coordinates), then we can define the ith row of M and b as follows:
M i = [ − Li 2 − Li1 ]

bi = ( Li × pi ) [ 0 0 1]


Then the vanishing point x is simply the pseudo-inverse of M times b. The vanishing
direction can be computed as A-1x, where A is the camera intrinsic parameters matrix
(which is not yet entirely known).
We can describe the world coordinates in terms of three axes: Gx, which is
perpendicular to the vanishing direction, Gy, which is parallel to the vanishing direction,


and Gz, which completes the coordinate system. Let v denote the normalized vanishing
direction. Let ϕbe the roll angle about the vanishing direction, and define β= 1/(1+vy).
Then we can determine the three axes in terms of these variables, only two of which are
unknown. [15] provides the expressions for Gx and Gz (the expression for Gy is trivial);
unfortunately, the expression for Gz contains several apparently typographical errors. The
correct expressions are:
 (1 − β v x 2 ) cos ϕ + β v x v z sin ϕ 

Gx = 
v z sin ϕ − v x cos ϕ

 − (1 − β v z 2 ) sin ϕ − βv x v z cos ϕ 

(1 − β v x 2 ) sin ϕ + βv x v z cos ϕ 

Gz = 
− v x sin ϕ − v z cos ϕ

(1 − β v 2 ) cos ϕ − βv v sin ϕ 
x z



If these axes were known (currently they are written in terms of two unknowns), we
would then be able to use the axes, our internal parameter matrix, and some geometry to
calculate distances on the road plane. Specifically, this can be done in the following
manner. Given an image point x, compute its projection p = A-1x. This is of course a
vector in the direction of the ray that goes from the camera center to the point x. But the
intersection of this ray and the road plane is
Gz 2

pˆ ⋅ G z
Given two such projections P1 and P2, the distance between them is simply ||P1 – P2||, up
to some unknown but constant scale factor.

Of course, all of this assumes that we have values for the two unknowns αu and ϕ. The key
insight of [15] is that with a knowledge of the ratios of lengths in the picture, we can use
a non-linear optimization process to solve for those two unknowns. For the Trafficland
situation, say again that we have n lanes. Then we know of n+2 segments in the direction
of the road that must be of the same length in the world. Also, all the 2n+2 segments
perpendicularly connecting the lanes at the beginning and end of the region of interest
must be of the same length. Let us denote the n+2 road-parallel segments as q0, …, qn+1 and
the 2n+2 perpendicular segments as s0, …, s2n+1. The residual I compute is a modification
of the one in [15] and is defined by:

 2 n+1 s

r = ∑  0 − 1 + ∑  0 − 1
1  qi
1  si

n +1



A non-linear optimization process can then be used to solve for the αu and ϕthat minimize
r. [15] recommends the Levenberg-Marquadt method, but I use a subspace trust region
method based on the interior-reflective Newton method, as some informal
experimentation showed that this algorithm was much less likely to converge to incorrect
local minima when given an initial value distant from the correct one. Finally, the scale
factor can be easily determined by assuming the lane width as stated above and dividing
the actual lane width by the average of the computed ones.
Some Considerations of Computational Efficiency
The algorithm essentially has two parts, the initial setup and calibration, and the
counting of vehicles in actual operation. Obviously, the efficiency requirements of the
two are quite different. The part of the algorithm that operates in real-time must be
highly efficient; however, the initial setup, which only happens once, is not nearly so
Virtually the entirety of the computational time for the initial setup is consumed
by the nonlinear optimization process, which must compute a relatively computationally
intensive function many times to find a good minimum. The function that it computes is
constant time with respect to the size of the image, but it contains several matrix
inversions and a good deal of matrix multiplication and arithmetic operations. Typical
running time for the nonlinear optimization process to complete is about ten seconds on a
Pentium 4. Considering that it will take the operator significantly longer to draw the lines
on the image that are used in the calibration process, this seems within acceptable limits.
The part of the algorithm which must operate in real time is the car counting.
Empirically, this algorithm is highly efficient, requiring only about .2 seconds per frame,
whereas the frames occur at less than one frame per second. This implies, assuming no
frames are dropped, and indeed it would be possible to drop frames without affecting
performance significantly, that each computer could process the video feeds from 5
cameras simultaneously, which is superior to most published algorithms, which are
usually able to handle only one camera [1]. Part of the reason for the high efficiency
comes from the small size of the images that are effectively being worked with. The
original traffic image is 320x240. But most of this is background, and the region of
interest size is typically on the order of 100x100. Profiling the execution of the algorithm
using the excellent profiler tool in MATLAB showed that the algorithm spent 64.4% of
its time directly computing morphological operations of some kind. When the time to
check arguments, resize matrices and execute other miscellaneous utility functions
connected with the morphological operations is taken into account, the actual percentage
of the time spent doing morphological operations is between 80% and 90%. Virtually all
of the rest of the CPU time is spent resizing the image and converting it to greyscale.
Computing the threshold using Otsu’s method takes only 2.0% time. The top hat
transformation at the beginning is particularly computationally intensive (55%), as for a
circular structural element, the computational complexity of the morphological closing is
proportional to the area of the circle, which is rather large. This may indicate that in
situations where computational efficiency is at a premium, a smaller structural element or
one of an easier shape should be substituted, though this was not investigated.

The memory requirements of the algorithm are very modest. Since each image is
processed individually, only the data to process that particular image must be stored. In
my implementation, that is approximately 4 times the memory requirement of the image
cropped down to the region of interest, because temporarily we must store the original
image, the top-hat enhanced image, the segmented image, and the dilation of the
segmented image. The other major memory requirement comes from storing the sizes of
the headlights of the past 100 frames. In a fielded system, this would really not need to
be calculated every frame; instead it could just be re-calculated every few thousand
frames. However, during its calculation it takes a matrix of about 1000 elements to store
it, assuming about ten blobs per image. The other variables are independent of the size of
the images and very small.
Empirical Results
The gold standard in the empirical validation of computer vision speed detection
algorithms is simultaneous inductance loop data. Inductance loops are wires run under
the highway and connected to electrical monitoring equipment in such a way that a clear
and easily measurable electrical impulse occurs when an axle rolls over the wire. By
building two inductance loops close together at a known distance, the speed of traffic on
the highway can be measured very accurately. If the images being analyzed by a
computer vision algorithm have simultaneous loop detector data, the algorithm’s results
can be compared with the known good speeds from the inductance loops and the
algorithm validated with high accuracy.
Unfortunately, no such simultaneous data is publicly available. Without it, a total
verification of the algorithm’s accuracy is impossible, but significant confirmation is still
possible. Recall that the accuracy of the entire algorithm rests on the accuracy of three
components: the counting of the cars, the determination of the size of the region, and the
relationship between speed and density; if all three of these are correct, then the speed
estimates produced must also be correct. The third of these is impossible for me to test,
but it has been verified by numerous studies into the matter, and so it is reasonable to
assume its accuracy. The second of these is also nearly impossible for me to test
accurately. However, I can say informally that the algorithm returns results within some
reasonable bound –there is at least no egregious error in implementation. More
importantly, this algorithm has been used before in [15], and they provide considerable
empirical validation of the approach using data obtained from physically measuring the
road. Thus, the only part of the algorithm whose accuracy is in serious question is the car
counting, and this is easily checked by counting the cars by hand and comparing that
actual result to the estimated result found by the algorithm. In brief, such a comparison
shows that the algorithm has excellent accuracy.
However, it is not sufficient validation to test the algorithm on a single camera in
that manner, and furthermore it is not really sufficient validation of the robustness of the
algorithm to test the algorithm on a camera whose images have been used to develop the
algorithm. To provide a valid test of robustness, I developed the algorithm while working
with the images of only one camera. Once the algorithm was performing well, I froze the

code and then tested it on image sequences from several new cameras. However, I did
not choose the new cameras randomly – rather I chose only cameras that met the fairly
restrictive criteria outline in the Assumptions section. Unfortunately, only a small
percentage of Trafficland’s cameras actually meet those criteria; however I believe that
my algorithm will work more or less equally well on all that do. Most of the cameras are
disqualified either because the traffic is going in the wrong direction or because of some
form of severe distortion from headlight or streetlight glare.
The key results of the study are shown in Table 1. They consist of twenty-image
sequences from four cameras, and they compare the hand-counted results with the
automatically determined results. Of the four cameras, one was the base camera the
algorithm was developed on, and three were the new cameras in the test set. The results
show that the algorithm estimates are essentially nonbiased and quite accurate over a
fairly large range of traffic densities.
Camera 1
Camera 1
Camera 2
Camera 2
Camera 3
Camera 3
Table 1. Empirical Testing Results.



% Error

1. Dailey, D.J., Cathey, F.W., Pumrin, S., An algorithm to estimate mean traffic speed
using uncalibrated cameras, IEEE Trans. Intelligent Transportation Systems(1), No. 2,
June 2000, pp. 98-107.
2. S. Takaba, M. Sakauchi, T. Kaneko, B. Won-Hwang, and T. Sekine,
Measurement of traffic flow using real time processing of moving pictures,
in Proc. 32nd IEEE Vehicular Technology Conf., San Diego, CA, May 23–26, 1982, pp.
3. N. Hashimoto, Y. Kumagai, K. Sakai, K. Sugimoto, Y. Ito, K. Sawai,
and K. Nishiyama, Development of an image-processing traffic flow measuring system,
Sumitomo Electric Tech. Rev., no. 25, pp. 133–137.
4. Traffic Flow Theory, edited by N.H. Gartner, C.J. Messer, and A.K. Rathi.
Washington, D.C.: US Federal Highway Administration. Chap. 2, Traffic Stream
Characteristics, by Hall, F.


5. José Melo, Andrew Naftel, Alexandre Bernardino, José Santos-Victor: Viewpoint
Independent Detection of Vehicle Trajectories and Lane Geometry from Uncalibrated
Traffic Surveillance Cameras. International Conference of Image Analysis and
Recognition (ICIAR 2004). Porto, Outubro 2004.: 454-462.
6. Peek Traffic VideoTrak Detection System. Described in
7. Worrall, A. D., Sullivan, G. D. and Baker, K. D. A simple, intuitive camera calibration
tool for natural images, Proc. 5th British Machine Vision Conference, 13-16 September,
University of York, York, 1994, pp 781-790.
8. A policy on geometric design of highways and streets (AASHTO Green Book)
American Association of State and Highway Transportation Officials . Jan. 2001 pp.
9. K.W. Dickinson and R. C.Waterfall, “Video image processing for monitoring road
traffic,” in Proc. IEE Int. Conf. Road Traffic Data Collection, Dec. 5–7, 1984, pp. 105–
10. R. Ashworth, D. G. Darkin, K.W. Dickinson, M. G. Hartley, C. L.Wan, and R. C.
Waterfall, “Applications of video image processing for traffic control systems,” in Proc.
2nd Int. Conf. Road Traffic Control, London, U.K., Apr. 14–18, 1985, pp. 119–122.
11. Bickel, P., Chen, C., Kwonx, J., Rice, J., van Zwety, E., Varaiyaz P. Measuring
traffic. (Preprint) June 2004,
12. Drake, J.S., J.L. Schofer, and A.D. May. 1967. A statistical analysis of speed density
hypotheses. Highway Research Record 154, Highway Research Board, NRC,
Washington, D.C.: 53-87.
13. Cucchiara, R., Piccardi, M., Vehicle detection under day and night illumination. in
Proc. of IIA’99 - Third Int. ICSC Symp. on Intelligent Industrial Automation., Special
Session on Vision Based Intelligent Systems for Surveillance and Traffic Control, 1999,
pp. 789-794.
14. Zwahlen, H.T., and Schnell, T., Driver-headlamp dimensions, driver characteristics,
and vehicle and environmental factors in retroreflective target visibility calculations”,
Transportation Research Record 1692, National Academy of Sciences, Washington, DC.,
15. Masoud, O., Papanikolopoulos, N.P., Kwon, E., The use of computer vision in
monitoring weaving sections, IEEE Trans. Intelligent Transportation Systems, (2), No. 1,
March 2001, pp. 18-25.