You are on page 1of 17

Finding Mean Traffic Speed in Low Frame-rate Video

Jared Friedman

January 14, 2005 Final Project Report Computer Science 283

Abstract

In this paper, I present a novel approach to estimate mean traffic speed using low frame-rate video taken from an uncalibrated camera. This approach takes advantage of a known relationship between traffic speed and traffic density to make tracking of individual vehicles unnecessary. The algorithm has been developed especially for nighttime conditions, though extensions to daytime images seem quite possible. It has been tested on several image sequences and shown to produce results consistent with human estimations from those sequences.

Introduction

Computer vision techniques have been applied to images of traffic scenes for a variety of purposes [1]. One of the more popular of these purposes is to attempt to extract a measure of the level of congestion of the road in the scene. Properly disseminated, this information can be used by drivers to plan routes that avoid traffic and by first responders to identify accidents. The measurement typically used for congestion is the mean speed of traffic, as this is the measurement a rational traveler should care about. In this paper, I present a new algorithm to estimate mean traffic speed using video images at low frame speed. The work is motivated by the presence of Trafficland, a new company that offers video streams from over 400 traffic cameras in the Washington D.C. area through free internet access (at www.trafficland.com). Previous work on finding traffic speed has worked by finding it directly, essentially by tracking vehicles for a known time over a known distance and calculating an average ratio. However, due to bandwidth limitations, Trafficland’s cameras give video at less than one frame per second, with unreliable and difficult to determine time intervals between frames, making tracking extremely difficult if at all possible. Some published algorithms [2],[3] instead place two virtual lines or “tripwires” on the road at a known separation and measure the time interval between cars crossing the first and crossing the second, in the natural computer vision analogue of the physical loop detectors on roads. While this may seem to be a different technique from tracking, it shares many similarities, including the assumption that cars will not travel much from one frame to the next, and in practice it requires an even higher frame speed than tracking.

Nevertheless, clearly humans are able to judge traffic levels from Trafficland’s video, and they would still be able to do so even if shown only every two or three frames, making tracking literally impossible. I assert that the way we make this judgment is by determining how closely spaced the cars are and using the intuitive fact that closely spaced cars tend to travel more slowly than tightly packed ones. More precisely, we use the inverse correlation between mean traffic speed and traffic density, which is defined as vehicles per lane-mile [4]. The approach of this paper is to take advantage of this relationship and compute density directly, which is easier to compute at low frame speeds, and convert this into a speed using the known relationship.

To compute density in a particular region of interest, we must know the number of lanes of the region, the length of the region, and the number of cars in the region in each

frame. Some traffic vision systems have had to accommodate cameras that could be rotated and zoomed by traffic operators with joysticks (e.g., [1], [5]), and thus have had to build in some automatic calibration capability to their programs. However, Trafficland’s cameras appear to be stationary, and we make the assumption that the number of lanes and the length of the region need only be determined once, and take advantage of human input in this one-time low-cost setup procedure. Many published and commercial systems [6] [7] also require some initial human setup: for one, if there are multiple roads or two directions of a single road in the picture (as is usually the case), the software cannot possibly know which road is the intended one without some human input. Specifically, the initial calibration setup simply requires a human to draw a rectangle (in world coordinates) around an area of interest and then to trace out the lanes in the region. Using some simple geometric constraints and assuming a typical lane width, the length of the region can then be calculated.

The calculation of the number of vehicles in the region proved to be more challenging than expected. This partially because surprisingly little previous work has been done on the problem. Several algorithms have developed excellent tracking of cars in daytime conditions at high frame speeds, which implies that they are able to recognize vehicles to some extent. However, in tracking vehicles directly, it is not necessary to

segment cars properly, but only to identify blobs that correspond to multiple vehicles or parts of vehicles, since in a traffic stream all the vehicles, their parts and their shadows tend to move at about the same velocity. The algorithm reported in [1] does require correct segmentation of vehicles, because it must estimate their size correctly, but it requires correct segmentation of only a few vehicles at a time, and thus it simply throws

away any blobs that do not correspond to a tightly defined vehicle profile.

counting of vehicles in daytime conditions requires a more sophisticated approach to deal

with occlusion, shadows, and vehicles of widely varying appearance. In this preliminary report, I chose to focus on nighttime conditions only and to leave daytime conditions for future work. Nighttime conditions are easier because at night, it is usually possible to simply count the number of headlights appearing in the region, and headlights are much more visible and less vulnerable to occlusion, shadow, and varying appearance than cars. Nighttime conditions are in any case a more suitable potential use of the algorithm advanced in this paper, since tracking-based systems ordinarily find daytime conditions much easier than nighttime conditions, giving a the density approach a particular advantage in these conditions.

Accurate

This paper first reviews the key assumptions of the algorithm and discusses their validity and which ones could be relaxed in further work. I then discuss in detail the workings of the algorithm and follow by some considerations of its computational efficiency. I conclude with empirical results validating the accuracy of the algorithm.

Underlying Assumptions

The following gives a list of the key assumptions used to simplify the problem and some discussion of their validity.

1) Images are taken at night. Headlights are the brightest objects in the region of interest.

2) Traffic is moving generally towards the camera, but not (almost) directly into it. The second requirement exists because when traffic is going almost directly towards the camera, the glare from the headlights creates bloom, lens flares, and severe distortion. The first requirement exists because if the traffic is moving away from the camera, the headlights will not be directly visible. In my opinion, the second of these twin requirements is much more reasonable than the first. Many of Trafficland’s nighttime images are so severely distorted by the lens flares that it is nearly impossible even for a human to determine the amount of traffic, and working with these images would be a real challenge. However, tracking cars going away from the camera will obviously be necessary for a fielded system, and future systems could use either the rear vehicle lights or the fairly bright reflected glare from the headlights to accomplish this.

3) Vehicles are confined to the road plane, and there exists a region of interest with straight edges. Also, the number of lanes in the region of interest is constant. These requirements are necessary for the calculation of the geometry of the situation.

4) The width of each lane in the picture is approximately 11.5 feet. This assumption is the distance required to determine the scale of the image and thus the length of the region of interest. The validity of the assumption is taken from [8] which states that virtually all American highways have lane widths between 10 and 12.5 feet at all times, with lane widths close to 12 feet being the most common. Other systems have used a variety of means to attempt to produce a scale measurement, mostly by placing physical marks on the road [9], [10], although [1] did so by assuming a known distribution of vehicle lengths. However, in this situation it was impossible to have an operator placing marks on or near the road. Estimating by mean vehicle length requires an algorithm that can accurately determine vehicle length of all vehicles, including trucks, which is difficult to construct and must be run over a considerable period to get an accurate mean value. Furthermore, it is not at all clear that this mean vehicle length is more constant from road to road than the mean lane width. [11] reports evidence that the mean vehicle length changed considerably depending on the time of day, the highway, and the lane observed, primarily due to the considerable variation in the presence of large trucks, leading to large errors in systems that assumed a constant vehicle length. Using the lane width as a calibration tool appears to be a novel suggestion, and it seems a sensible choice for a variety of situations, not limited to low-frame rate video. It is perhaps worth noting that if the lane width did not hold to the 10-12.5 foot range, then the validity of assumption (5) would be in question anyway, as this would affect the density-speed relationship.

5) Traffic speed and density have a known, and constant relationship, specifically Edie’s model as given in [4]. This assumption is admittedly somewhat controversial. While the inverse correlation between speed and density is obvious, the exact relationship has been a topic of considerable debate. For decades, it was believed to be a linear relationship on the basis of a single study using seven data points all collected from a single highway [4]. Further study by Greenberg found that a logarithmic relationship was the best fit, as in Fig. 1. However, despite the seemingly excellent fit, a number of caveats can be raised

with Greenberg’s methods, and several later studies found that Greenberg’s relationship was only a mediocre fit to their data. The modern favored choice for the relationship is Edie’s hypothesis, which is a piecewise function shown in Fig. 2. The piecewise relationship has not only be confirmed by a rigorous study into the matter [12], it also fits well with theoretical models of traffic flow, which invariably divide the problem into at least two subcases corresponding to free-flow and congested-flow, if not more. Despite all the debate as to the precise nature of the relationship, the actual difference for this application between the initial Greenshields model and the most recent Edie’s model is only at most 10-20%. However, the data for these studies appears to have been collected only during the daytime and only in normal weather conditions. How nighttime conditions and adverse weather might affect the relationship is quite unknown.

weather might affect the relationship is quite unknown. Fig. 1. Greenberg’s speed-density hypothesis, plotted with

Fig. 1. Greenberg’s speed-density hypothesis, plotted with his data.

speed-density hypothesis, plotted with his data. Fig. 2. Eddie’s speed-density hypothesis, plotted with his

Fig. 2. Eddie’s speed-density hypothesis, plotted with his data.

Finally, it is important to note that there is another relationship between traffic variables that could be useful in further study. The relationship is between traffic speed and traffic volume, which is defined as cars per lane-hour. I chose not to use this relationship in my algorithm because the relationship between volume and speed is considerably less well-established than the one between speed and density. However, calculating volume does not require finding the length of the region of interest, and thus it is immune from that source of error. If the expected level of error from the estimation of the length of the region were greater than the expected error from the estimation of the speed-volume relationship, then it would be a reasonable choice to abandon density and instead calculate volume. Determining whether this is in fact the case unfortunately goes beyond the scope of this paper, but if it were, then the change would be quite easy to make. The part of the algorithm that calculates the length of the region would simply be dropped, and instead the number of cars counted by the second part of the algorithm, divided by the product of the time period of operation and the number of lanes, could be plugged into the function in [4] that computes an expected relationship between volume and speed.

Algorithm Operation

This section details the workings of the algorithm. The first part explains the counting of cars, and the second part explains the camera calibration and determination of the region length.

I. Counting Cars

The algorithm operates on a sequence of nighttime images, sampled at virtually any frame rate. The images in the dataset are originally in color, but they are converted to greyscale for analysis. For nighttime images, there is normally very little useful information in the color channels, and what information there might have been is obscured by the terrible color distortion in Trafficland’s cameras. The car counting part of the algorithm assumes that an operator has drawn a rectangular region of interest and that we are counting cars only in that region.

The car-counting algorithm operates essentially by counting headlights. Unlike other papers [13], I do not assume a particular shape for headlights, nor do I require that each vehicle have exactly two nearly identical headlights. While those assumptions are often valid, occlusion, reflections on the road and on the vehicles, and varying headlight configurations complicate the picture. Instead, I merely assume that each vehicle has one or more brightly colored dots on or right next to it. The algorithm finds those dots and attempts to determine which dots belong to which vehicles.

Fig. 3. A typical unprocessed Trafficland image The first step of the algorithm is to

Fig. 3. A typical unprocessed Trafficland image

The first step of the algorithm is to crop the image to the smallest size that

contains the region of interest, and then to set all the pixels outside the region of interest to zero intensity. The image is then converted to greyscale. A typical image at this stage is shown in Fig. 3. The image is then top-hat filtered. Top-hat filtering is a technique used to smooth out uneven dark backgrounds. Top-hat filtering is defined by subtracting the result of performing a morphological closing on the input image from the input image itself. This has the effect of reducing background noise by eroding it away, and thus producing a more even, clean background. Results of top-hat filtering are shown in Fig.

4. Top-hat filtering requires a choice of a structuring element for the morphological

closure. The choice of a disk shape was easy – this is standard. The choice of the size of

the disk was more difficult and also somewhat arbitrary. Examination of the scale of a few Trafficland cameras showed that a disk radius of 10 pixels gave good results. Some further experimentation showed that the performance of the algorithm was highly non- sensitive to changes in this size.

Fig. 4. Image after top-hat filtering. The next step is to choose a threshold and

Fig. 4. Image after top-hat filtering.

The next step is to choose a threshold and convert the greyscale image into black and white. Choosing the threshold is obviously the difficult part. If there are headlights in the image, then Otsu’s method, which chooses a threshold to minimize the intra-class variance, works very well. This is essentially because in this case, the image histogram will be strongly bimodal between headlight and not-headlight, and this method will easily find the dividing point. However, in an image that has no headlights, Otsu’s method will return terrible results, as it will cause a segmentation of the road itself based on random noise on the road, but in this case we want it to segment to all black pixels. One way to solve this problem is simply to set a fixed parameter that represents a minimum reasonable headlight intensity, and to take the threshold to be the maximum of this intensity and the threshold computed by Otsu’s method. In practice, this method will return good results with almost all images, as the difference between the road background, generally 0 to .3 in intensity, and the headlight intensity, usually .9 to 1, is so extreme that any parameter value choice of .4 to .7 will correctly separate them, and the choice of the parameter within this region will have little effect on the quality of the segmentation.

In the expectation this method might not return optimal results for all images, I pursued a method that would learn the threshold from previous images. The algorithm begins using the fixed parameter method with a low fixed parameter value. A record is kept of all the thresholds determined by Otsu’s method during the past 100 images – including those determinations when the value was not used as the actual threshold because it was too low. To this vector, Otsu’s method is itself applied again, to separate the vector into values that were found during no vehicle presence and ones that were found during an actual vehicle presence. If there was at least one image with headlights and one image without headlights, this method will return good results. Once again, the problem can occur with trying to separate a non bimodal distribution. I make the assumption that in at least one of these images, a car was present. To test to see whether all the images have headlights in them, I do a 2-sample t-test, comparing the values below the computed threshold to the values above the computed threshold. If the difference is

significant, then there are likely two different distributions in the data – one with headlights, and one without headlights, and I set the minimum parameter to the threshold computed on the vector of 100 previous thresholds. If the difference is not significant, then probably all of the images had cars in them, and I keep the old threshold. This assumes that the values of the grey threshold computed when there are cars in the picture and when there are not cars in the picture both have normal distributions, and I have found this to be a fairly accurate assumption, as confirmed both visually and by the Lilliefors normality test. Fig. 5 shows a histogram of the thresholds determined in a 200 image sequence with a highly bimodal distribution due to the presence or absence of headlights. I found this method to accurately determine when there were actually cars present in the picture, and to choose a grey threshold accordingly that reflected the lighting conditions of the image – e.g., images with brighter backgrounds had a higher threshold. However, due to the non-sensitivity of the rest of the algorithm to the value of this parameter, this method failed to return significantly better or even significantly different results on the actual dataset from the simple-minded hard-coded parameter method.

dataset from the simple-minded hard-coded parameter method. Fig. 5. Histogram of 100 threshold values determined by

Fig. 5. Histogram of 100 threshold values determined by Otsu’s method.

Having converted the image into black and white, the next step is to identify cars from the white blobs that may correspond to headlights or reflections. A typical image at this stage is shown in Fig. 6. As discussed before, other work [13] has attempted to use

prior knowledge about headlight shape to accomplish this segmentation.

they give virtually no details about their algorithm, so it was not possible to reproduce their results. After a considerable amount of experimentation with template-based matching, the technique used in [13], I decided that these assumptions about headlight

shape were not valid enough in general to be useful, and instead I use a simpler method

Unfortunately,

that relies only on the assumption that all the headlights and headlight reflections on a car will be close. First, the binary image is dilated, which tends to connect the unconnected blobs belonging to a single car. Unfortunately, this simple technique runs the risk of joining blobs of adjacent cars incorrectly, leading to undercounting the actual number of vehicles. To mitigate this problem, I use the fact that the user has drawn segments corresponding to the lanes in the region in the initial setup and I separate blobs along those lane boundaries. Assuming that all cars are entirely in a lane, this essentially solves the problem of connection across lane boundaries, and leaves only the potential problem of connecting two cars within a lane. But since the headlights of cars in a single lane are separated by dark car bodies, this is rarely a problem. Simply counting the blobs found at this stage gives a reasonable result, but I do one further step of noise-reduction that improves performance further. Since all headlights should by now have been joined together in blobs of considerable size, I eliminate all blobs below a certain threshold size, since these usually correspond to small reflections on or around vehicles that have already been counted. The size chosen is a volume of 15 pixels, which is an extremely conservative estimate for the size of a headlight, especially after dilation. Rather than an assumption about the size of the headlights in the images, it is best considered as a constraint on the choice of the region of interest by the traffic operator, requiring the region to be placed close enough to the camera so that dilated headlights have a volume of more than 15 pixels. Indeed, if this is not the case, the rest of the algorithm is unlikely to perform well anyway, as the resolution will be very poor.

to perform well anyway, as the resolution will be very poor. Fig. 6. Black and white

Fig. 6. Black and white segmentation.

Dilating the image in the above step requires a choice of a structuring element, and this choice is best determined in a principled manner, as the performance of the algorithm is considerably affected by it. In images with small headlights, like the ones shown in the above figures, dilation is not necessary, though a small amount rarely hurts. In images with a closer view of the traffic, dilation becomes essential. The shape of the structuring element is simply a disk, as is standard. The idea behind my method of choosing the radius of the disk is that the disk needs to be large enough to join headlight

pairs but should not be much larger, else it will run the risk of joining together different cars. The algorithm for calculating this size depends on some assumptions about headlight size taken from [14] and also informally observed in Trafficland’s images. The key assumption specifically is that the average distance between the headlights is approximately proportional to the typical headlight size, as recorded by the image. One time the assumption is clearly false is when the cars are traveling almost directly towards the camera, as then the distortion will make the headlights seem much larger. Another time the assumption does not work well is when the traffic is traveling almost directly across the camera’s field of view; however, in this case, some dilation is still useful in connecting cars with the glare reflection, which will be particularly prominent. In between these two extremes, however, the constancy of this ratio is good enough that setting the size of the structuring element to be a constant times the estimated headlight size returns excellent results. The correct ratio is difficult to determine exactly, but it is close to one; the value I use is 1.3. I measure the size of the headlights by finding the median area of all the blobs found in the first 100 frames, and calculating the corresponding radius of a circle of this area.

II. Determining the Region Length

The algorithm I use to determine the region length is taken from [15], adapted to the information available for the scene. First, the traffic operator gives the initial set-up information pictured in Fig. 7. This includes a region of interest, whose projection onto the road plane must be rectangular in the world coordinates and a trace of all the lanes in this region of interest. For good performance, the region of interest must be a straight section of road, it should begin as close to the camera as possible, and it should not extend so far that the resolution at the end of the region is too poor (see above for a more precise definition).

the end of the region is too poor (see above for a more precise definition). Fig.

Fig. 7. What the traffic operator draws.

The estimate of the region length begins with computing a camera calibration from the data that the operator has given. The camera calibration can then be combined with some simple geometry to estimate the region length. The camera calibration technique described in [15] is easier for the traffic operator than the one described in [7], which requires that a grid evenly spaced along the road axis be determined by the operator, which in practice is a difficult judgment for a human to make. I do not repeat the detailed derivation of the algorithm in [15] here, but I give an overview of its operation as applied to this situation. Most of what follows is taken from this paper; for brevity, I omit the exact citations.

Camera calibration involves finding a camera’s intrinsic and external parameters. First consider the intrinsic parameters. Recall that they can be represented by the matrix

cot

u

v

sin

0

u 0

v 0

1

u 0 0
u
0
0

To simplify the calibration process, I make several assumptions about the internal parameters which are approximately true for most cameras and common in computer vision applications. First I assume that the axes are in fact perpendicular, so that, = 0. I also assume that the horizontal and vertical focal lengths are equal, and that u 0 and v 0 , the coordinates of the camera center are actually at the image center. This reduces what was previously a five parameter problem to a one parameter problem, leaving u as the only unknown.

To calculate the external parameters, we first calculate a vanishing point. The vanishing point that can be calculated most accurately is the one in the road direction. We could use only the region of interest boundaries to calculate this point, but we will get better results if we also use the tracings of the lanes the user has made. Specifically, we wish to find the point whose sum of squared distances to all the lines is a minimum. This least squares estimate can be easily determined by solving a system of linear equations of the form Mx = b. If there are n lanes on the road, then M and b will each have (n+2) rows. Let L i be a unit vector in homogeneous coordinates representing the direction of the i th line (out of the n+2), and let P i be a point on that line (in homogeneous coordinates), then we can define the i th row of M and b as follows:

M

i

=

[

L

i2

L

i1

]

b

i

=

(

L

i

p

i

)

T

[

0

0

1

]

T

Then the vanishing point x is simply the pseudo-inverse of M times b. The vanishing direction can be computed as A -1 x, where A is the camera intrinsic parameters matrix (which is not yet entirely known).

We can describe the world coordinates in terms of three axes: G x , which is perpendicular to the vanishing direction, G y , which is parallel to the vanishing direction,

and G z , which completes the coordinate system. Let v denote the normalized vanishing direction. Let be the roll angle about the vanishing direction, and define = 1/(1+v y ). Then we can determine the three axes in terms of these variables, only two of which are unknown. [15] provides the expressions for G x and G z (the expression for G y is trivial); unfortunately, the expression for G z contains several apparently typographical errors. The correct expressions are:

2 (1 v ) cos + x G = v sin v x z x
2
(1
v
) cos
+
x
G
=
v
sin
v
x
z
x
2
(1
v
) sin
z
2
(1
v
) sin
+
x
G
=
v
sin
v
z
x
z
2
(1
v
) cos
z

T

v v sin x z cos
v v
sin
x
z
cos
 

v

x

v

z

cos

v

x

v

z

cos

 

cos

 

v

x

v

z

sin

T

If these axes were known (currently they are written in terms of two unknowns), we would then be able to use the axes, our internal parameter matrix, and some geometry to calculate distances on the road plane. Specifically, this can be done in the following manner. Given an image point x, compute its projection p = A -1 x. This is of course a vector in the direction of the ray that goes from the camera center to the point x. But the intersection of this ray and the road plane is

G z 2

ˆ G

P =

p ˆ

p Given two such projections P 1 and P 2 , the distance between them is simply ||P 1 – P 2 ||, up to some unknown but constant scale factor.

z

Of course, all of this assumes that we have values for the two unknowns u and . The key insight of [15] is that with a knowledge of the ratios of lengths in the picture, we can use a non-linear optimization process to solve for those two unknowns. For the Trafficland situation, say again that we have n lanes. Then we know of n+2 segments in the direction of the road that must be of the same length in the world. Also, all the 2n+2 segments perpendicularly connecting the lanes at the beginning and end of the region of interest must be of the same length. Let us denote the n+2 road-parallel segments as q 0, , q n+1 and the 2n+2 perpendicular segments as s 0 , …, s 2n+1. The residual I compute is a modification of the one in [15] and is defined by:

r =

n +

1

1

q

q

0

i

1

2

+

2

n

1

+

1

s

s

0

i

1

2

A non-linear optimization process can then be used to solve for the u and that minimize

r. [15] recommends the Levenberg-Marquadt method, but I use a subspace trust region

method based on the interior-reflective Newton method, as some informal experimentation showed that this algorithm was much less likely to converge to incorrect local minima when given an initial value distant from the correct one. Finally, the scale factor can be easily determined by assuming the lane width as stated above and dividing the actual lane width by the average of the computed ones.

Some Considerations of Computational Efficiency

The algorithm essentially has two parts, the initial setup and calibration, and the counting of vehicles in actual operation. Obviously, the efficiency requirements of the two are quite different. The part of the algorithm that operates in real-time must be highly efficient; however, the initial setup, which only happens once, is not nearly so constrained.

Virtually the entirety of the computational time for the initial setup is consumed by the nonlinear optimization process, which must compute a relatively computationally intensive function many times to find a good minimum. The function that it computes is constant time with respect to the size of the image, but it contains several matrix inversions and a good deal of matrix multiplication and arithmetic operations. Typical running time for the nonlinear optimization process to complete is about ten seconds on a Pentium 4. Considering that it will take the operator significantly longer to draw the lines on the image that are used in the calibration process, this seems within acceptable limits.

The part of the algorithm which must operate in real time is the car counting. Empirically, this algorithm is highly efficient, requiring only about .2 seconds per frame, whereas the frames occur at less than one frame per second. This implies, assuming no frames are dropped, and indeed it would be possible to drop frames without affecting performance significantly, that each computer could process the video feeds from 5 cameras simultaneously, which is superior to most published algorithms, which are usually able to handle only one camera [1]. Part of the reason for the high efficiency comes from the small size of the images that are effectively being worked with. The original traffic image is 320x240. But most of this is background, and the region of

interest size is typically on the order of 100x100. Profiling the execution of the algorithm using the excellent profiler tool in MATLAB showed that the algorithm spent 64.4% of

its time directly computing morphological operations of some kind.

check arguments, resize matrices and execute other miscellaneous utility functions connected with the morphological operations is taken into account, the actual percentage of the time spent doing morphological operations is between 80% and 90%. Virtually all of the rest of the CPU time is spent resizing the image and converting it to greyscale. Computing the threshold using Otsu’s method takes only 2.0% time. The top hat transformation at the beginning is particularly computationally intensive (55%), as for a circular structural element, the computational complexity of the morphological closing is proportional to the area of the circle, which is rather large. This may indicate that in situations where computational efficiency is at a premium, a smaller structural element or one of an easier shape should be substituted, though this was not investigated.

When the time to

The memory requirements of the algorithm are very modest. Since each image is processed individually, only the data to process that particular image must be stored. In my implementation, that is approximately 4 times the memory requirement of the image cropped down to the region of interest, because temporarily we must store the original image, the top-hat enhanced image, the segmented image, and the dilation of the segmented image. The other major memory requirement comes from storing the sizes of the headlights of the past 100 frames. In a fielded system, this would really not need to be calculated every frame; instead it could just be re-calculated every few thousand frames. However, during its calculation it takes a matrix of about 1000 elements to store it, assuming about ten blobs per image. The other variables are independent of the size of the images and very small.

Empirical Results

The gold standard in the empirical validation of computer vision speed detection algorithms is simultaneous inductance loop data. Inductance loops are wires run under the highway and connected to electrical monitoring equipment in such a way that a clear and easily measurable electrical impulse occurs when an axle rolls over the wire. By building two inductance loops close together at a known distance, the speed of traffic on the highway can be measured very accurately. If the images being analyzed by a computer vision algorithm have simultaneous loop detector data, the algorithm’s results can be compared with the known good speeds from the inductance loops and the algorithm validated with high accuracy.

Unfortunately, no such simultaneous data is publicly available. Without it, a total verification of the algorithm’s accuracy is impossible, but significant confirmation is still possible. Recall that the accuracy of the entire algorithm rests on the accuracy of three components: the counting of the cars, the determination of the size of the region, and the relationship between speed and density; if all three of these are correct, then the speed estimates produced must also be correct. The third of these is impossible for me to test, but it has been verified by numerous studies into the matter, and so it is reasonable to assume its accuracy. The second of these is also nearly impossible for me to test accurately. However, I can say informally that the algorithm returns results within some reasonable bound –there is at least no egregious error in implementation. More importantly, this algorithm has been used before in [15], and they provide considerable empirical validation of the approach using data obtained from physically measuring the road. Thus, the only part of the algorithm whose accuracy is in serious question is the car counting, and this is easily checked by counting the cars by hand and comparing that actual result to the estimated result found by the algorithm. In brief, such a comparison shows that the algorithm has excellent accuracy.

However, it is not sufficient validation to test the algorithm on a single camera in that manner, and furthermore it is not really sufficient validation of the robustness of the algorithm to test the algorithm on a camera whose images have been used to develop the algorithm. To provide a valid test of robustness, I developed the algorithm while working with the images of only one camera. Once the algorithm was performing well, I froze the

code and then tested it on image sequences from several new cameras. However, I did not choose the new cameras randomly – rather I chose only cameras that met the fairly restrictive criteria outline in the Assumptions section. Unfortunately, only a small percentage of Trafficland’s cameras actually meet those criteria; however I believe that my algorithm will work more or less equally well on all that do. Most of the cameras are disqualified either because the traffic is going in the wrong direction or because of some form of severe distortion from headlight or streetlight glare.

The key results of the study are shown in Table 1. They consist of twenty-image sequences from four cameras, and they compare the hand-counted results with the automatically determined results. Of the four cameras, one was the base camera the algorithm was developed on, and three were the new cameras in the test set. The results show that the algorithm estimates are essentially nonbiased and quite accurate over a fairly large range of traffic densities.

Camera

Type

Mean

S.D.

% Error

Base

Manual

4.25

1.1

1.2

Base

Automatic

4.30

1.7

 

Camera 1

Manual

0.40

0.5

12.5

Camera 1

Automatic

0.45

0.6

 

Camera 2

Manual

7.30

1.5

2.1

Camera 2

Automatic

7.15

2.1

 

Camera 3

Manual

4.5

1.5

4.4

Camera 3

Automatic

4.7

1.8

 

Table 1. Empirical Testing Results.

References

1. Dailey, D.J., Cathey, F.W., Pumrin, S., An algorithm to estimate mean traffic speed

using uncalibrated cameras, IEEE Trans. Intelligent Transportation Systems(1), No. 2,

June 2000, pp. 98-107.

2. S. Takaba, M. Sakauchi, T. Kaneko, B. Won-Hwang, and T. Sekine,

Measurement of traffic flow using real time processing of moving pictures, in Proc. 32nd IEEE Vehicular Technology Conf., San Diego, CA, May 23–26, 1982, pp.

488–494.

3. N. Hashimoto, Y. Kumagai, K. Sakai, K. Sugimoto, Y. Ito, K. Sawai,

and K. Nishiyama, Development of an image-processing traffic flow measuring system, Sumitomo Electric Tech. Rev., no. 25, pp. 133–137.

4. Traffic Flow Theory, edited by N.H. Gartner, C.J. Messer, and A.K. Rathi.

Washington, D.C.: US Federal Highway Administration. Chap. 2, Traffic Stream Characteristics, by Hall, F.

5.

José Melo, Andrew Naftel, Alexandre Bernardino, José Santos-Victor: Viewpoint

Independent Detection of Vehicle Trajectories and Lane Geometry from Uncalibrated Traffic Surveillance Cameras. International Conference of Image Analysis and Recognition (ICIAR 2004). Porto, Outubro 2004.: 454-462.

6. Peek Traffic VideoTrak Detection System. Described in http://www.peek-

traffic.com/File.asp?FileID=ss96-081-1VideoTrak.

7. Worrall, A. D., Sullivan, G. D. and Baker, K. D. A simple, intuitive camera calibration

tool for natural images, Proc. 5th British Machine Vision Conference, 13-16 September,

University of York, York, 1994, pp 781-790.

8. A policy on geometric design of highways and streets (AASHTO Green Book)

American Association of State and Highway Transportation Officials . Jan. 2001 pp.

315-316.

9. K.W. Dickinson and R. C.Waterfall, “Video image processing for monitoring road

traffic,” in Proc. IEE Int. Conf. Road Traffic Data Collection, Dec. 5–7, 1984, pp. 105–

109.

10.

R. Ashworth, D. G. Darkin, K.W. Dickinson, M. G. Hartley, C. L.Wan, and R. C.

Waterfall, “Applications of video image processing for traffic control systems,” in Proc. 2nd Int. Conf. Road Traffic Control, London, U.K., Apr. 14–18, 1985, pp. 119–122.

11. Bickel, P., Chen, C., Kwonx, J., Rice, J., van Zwety, E., Varaiyaz P. Measuring

traffic. (Preprint) June 2004, http://www.stat.berkeley.edu/users/rice/664.pdf

12. Drake, J.S., J.L. Schofer, and A.D. May. 1967. A statistical analysis of speed density

hypotheses. Highway Research Record 154, Highway Research Board, NRC,

Washington, D.C.: 53-87.

13. Cucchiara, R., Piccardi, M., Vehicle detection under day and night illumination. in

Proc. of IIA’99 - Third Int. ICSC Symp. on Intelligent Industrial Automation., Special Session on Vision Based Intelligent Systems for Surveillance and Traffic Control, 1999, pp. 789-794.

14. Zwahlen, H.T., and Schnell, T., Driver-headlamp dimensions, driver characteristics,

and vehicle and environmental factors in retroreflective target visibility calculations”, Transportation Research Record 1692, National Academy of Sciences, Washington, DC.,

1999.

15. Masoud, O., Papanikolopoulos, N.P., Kwon, E., The use of computer vision in

monitoring weaving sections, IEEE Trans. Intelligent Transportation Systems, (2), No. 1,

March 2001, pp. 18-25.