You are on page 1of 8

Pedestrian Detection: A Benchmark

Piotr Doll´ r1 a Christian Wojek2 Bernt Schiele2
2

Pietro Perona1

1 Dept. of Electrical Engineering California Institute of Technology {pdollar,perona}@caltech.edu

Dept. of Computer Science TU Darmstadt {wojek,schiele}@cs.tu-darmstadt.de

Abstract
Pedestrian detection is a key problem in computer vision, with several applications including robotics, surveillance and automotive safety. Much of the progress of the past few years has been driven by the availability of challenging public datasets. To continue the rapid rate of innovation, we introduce the Caltech Pedestrian Dataset, which is two orders of magnitude larger than existing datasets. The dataset contains richly annotated video, recorded from a moving vehicle, with challenging images of low resolution and frequently occluded people. We propose improved evaluation metrics, demonstrating that commonly used perwindow measures are flawed and can fail to predict performance on full images. We also benchmark several promising detection systems, providing an overview of state-of-theart performance and a direct, unbiased comparison of existing methods. Finally, by analyzing common failure cases, we help identify future research directions for the field.

Figure 1. Example images (cropped) and annotations. The solid green
boxes denote the full pedestrian extent while the dashed yellow boxes denote the visible regions. The Caltech Pedestrian Database, collected from a vehicle driving through regular traffic in an urban environment, consists of 350,000 labeled pedestrian bounding boxes in 250,000 frames.

1. Introduction
Detecting people in images is a problem with a long history [37, 13, 35, 27, 16, 41, 23, 5]; in the past two years there has been a surge of interest in pedestrian detection [6, 9, 11, 18, 20, 21, 25, 30, 32, 33, 36, 38, 42]. Accurate pedestrian detection would have immediate and far reaching impact to applications such as surveillance, robotics, assistive technology for the visually impaired, content based indexing (e.g. Flickr, Google, movies), advanced human machine interfaces and automotive safety, among others. Automotive applications [12, 14, 34] are particularly compelling as they have the potential to save numerous lives [39]. Publicly available benchmarks, the most popular of which is the INRIA dataset [5], have contributed to spurring interest and progress in this area of machine vision. However, as algorithm performance improves, more challenging datasets are necessary to continue the rapid pace of progress and to inspire novel ideas. Existing pedestrian datasets often contain a limited range of scale, occlusion and pose variation, and are fairly small, making it difficult to assess real 1

world performance (see Sec. 2.4). As we will demonstrate, the established methodology of evaluating pedestrian detectors, which uses per-window measures of performance, is flawed and can fail to predict actual per-image performance. Our contribution is fourfold. (1) We introduce the Caltech Pedestrian Dataset1 , which is two orders of magnitude larger than any existing dataset. The pedestrians vary widely in appearance, pose and scale; furthermore, occlusion information is annotated (see Fig. 1). These statistics are more representative of real world applications and allow for in depth analysis of existing algorithms. (2) We propose improved performance metrics. (3) We benchmark seven algorithms [40, 5, 7, 30, 11, 42, 21], either obtained directly from the original authors or reimplemented in-house. (4) We highlight situations of practical interest under which existing methods fail and identify future research directions. We introduce the Caltech Pedestrian Dataset and describe its statistics in Sec. 2. In Sec. 3, we discuss the pitfalls of per-window metrics and describe our evaluation methodology, based on the PASCAL criteria [28]. In Sec. 4 we report a detailed performance evaluation for seven promising methods for pedestrian detection. We summarize our findings and discuss open problems in Sec. 5.
1 www.vision.caltech.edu/Image

Datasets/CaltechPedestrians/

Detection in the medium scale is essential for automotive Figure 2. This division into three scales is motivated by the distribution of sizes in the dataset. duration labeling effort ∼1000K ∼250K ∼132K ∼350K ∼126K ∼2300 ∼5s ∼400h stays relatively constant while the visible BB may change rapidly. To make such a large scale labeling effort feasible we created a user-friendly labeling tool.2. ave ped. below 30 pixels. For every frame in which a given pedestrian is visible. and that the cutoffs for the near/far scales correspond to about ±1 standard deviation from the mean height (in log space).2. Hollywood. while 30% have two or more. in addition a second BB was used to delineate the visible region. 2. Santa Monica. Scale: We group pedestrians by their image size (height in pixels) into three scales: near (80 or more pixels). The Berkeley Segmentation Dataset [22]. Large groups of pedestrians for which it would have been tedious or impossible to label individuals were delineated using a single BB and labeled as ‘People’ (∼300). This will serve as a foundation for establishing the requirements for a real world system. 2. The video was stabilized to remove effects of the vehicle pitching. The driver was independent from the authors of this study and had instructions to drive normally through areas where pedestrians were frequently present. About 50% of the frames have no pedestrians. the Barron et al. 5(a). Much in the same way. In addition. Figure 4. Individual pedestrians were labeled ‘Person’ (∼1900 instances). Cutoffs for the near/far scales are marked. in the PASCAL labeling scheme [28] only the visible BB is labeled and occluded pedestrians are marked as ‘truncated’. One expects to see the number of pedestrians decrease with the square of their height. # bounding boxes # occluded BB # unique peds.000 frames (in 137 approximately minute long segments) for a total of 350. Note that ∼68% of the pedestrians lie in the medium scale. Images and Ground Truth We collected approximately 10 hours of 30Hz video (∼106 frames) taken from a vehicle driving through regular traffic in an urban environment (camera setup shown in Fig. In Fig. and. the overall image quality is lower than that of still images of comparable resolution. To perform the stabilization. [3] and Middlebury [2] optical flow datasets. not unexpectedly. Figure 3. our goal in introducing the Caltech Pedestrian Dataset is to provide a better benchmark and to help identify conditions under which current detection methods fail and thus focus research effort on these difficult cases. For occluded pedestrians this involves estimating the location of hidden parts. The decrease at the other end.000 BBs in our database using logarithmic sized bins. The CCD video resolution is 640 × 480. occlusion and location. Screenshot of the video labeler. 4. labelers drew a tight bounding box (BB) that indicated the full extent of the entire pedestrian. Below. 3. the estimated full BB . Each sequence of BBs belonging to a given object was assigned one of three labels.1. For comparison. Camera setup. primarily to simplify annotation. human performance and automotive system requirements. medium (between 30-80 pixels) and far (30 pixels or less). total frames labeled frames frames w peds. The most salient aspect of the labeler an interactive procedure where the user labels only a sparse set of frames and the system automatically labels intermediate frames by interpolation. briefly described in Fig. Database Summary. i. we histogram the heights of the 350. Pasadena. is due to annotators having difficulty identifying small pedestrians reliably. Example images with overlaid annotations are shown in Fig.e. During an occlusion event. Dataset Statistics A summary of the database is given in Fig. Pedestrians are visible for 5s on average. we implemented a differential camera tracker based on the system described in [45]. the label ‘Person?’ was assigned when clear identification of a pedestrian was ambiguous or easily mistaken (∼110). There are minor variations in the camera position due to repeated mountings of the camera. we give detailed analysis of the distribution of pedestrian scale. and Little Tokyo. The video was captured in the greater Los Angeles metropolitan area from neighborhoods chosen for their relatively high concentration of pedestrians: LAX. 2). proportionally with their image area. We annotated 250.000 labeled bounding boxes and 2300 unique pedestrians. It is designed so that users can efficiently navigate and annotate the video with a minimum amount of labor. 1. the Middlebury Stereo Dataset [31] and the Caltech 101 object categorization dataset [10] all improved performance evaluation and helped drive innovation in their respective fields. Dataset Challenging datasets are catalysts for progress in computer vision.

Position: Viewpoint and ground plane geometry (Fig.05 0.6 0. (b) Distance vs height at 55 km/h.7 0.g.1 0. occluded pedestrians are annotated with two BBs that denote the visible and full pedestrian extent. The interesting cases occur in between.5s away.6 0. We define the near scale to include pedestrians 80 pixels or taller.5s 0. (c) Occ. tend to be concentrated in this same region. Occlusion: Little previous work has been done to quantify detection performance in the presence of occlusion (using real data). and focal length fixed at 7. many objects.25 20 person’s height (pixels) 120 100 80 60 40 20 0 0. Thus detecting near scale pedestrians may leave insufficient time to alert the driver. Thus incorporating this constraint would considerably speed up detection but it would only moderately improve performance. an 80 pixel person is just 1.2 30 prob prob 0. Thus there is an important mismatch in current research efforts and the requirements of real world systems. The distribution has three distinct peaks: pedestrians that are never occluded (29%). 4). while 0% is used to indicate that a BB could not rep- resent the extent of the visible region (e.4 0. occur in the same band as the true positives. 6(b). Occlusion statistics.06 0. For each occluded pedestrian. In the collected data..e. We chose a camera setup that mirrors expected automotive applications: vertical field of view of 27◦ . As can be seen pedestrians are typically located in a narrow band running horizontally across the center of the image (y-coordinate varies somewhat with distance/height). including both true and false positives. In Fig. However. As described. Aggregating. Assuming 1.16 0.160 0.2 0. in Fig.14 partial 9% heavy 38% full 42% 10 0.05 far 15% 30 medium 68% 80 near 16% 0.2 90 0. Figure 7.18 invalid 10% 0. We compute the expected center position (over the 350.4 0. we display a heat map that indicates which regions of a pedestrian were most likely to be occluded (obtained by averaging the occlusion masks). occluded in some frames (53%) and occluded in all frames (19%). not just pedestrians. Finally. In Fig.03 0. We shall use the near/medium/far distinction throughout this work. and the far scale as 30 pixels or less. Moreover. .000 BBs) and plot the resulting heat map.3 never 29% sometimes 53% always 19% 0.8 0. 7(b) we show a heat map obtained by using BBs generated by the HOG [5] pedestrian detector with a low threshold. human performance is quite good in the near and medium scales but degrades noticeably at the far scale.e.3 0.04 0. (b) Distribution of HOG detections. most current research targets the near scale and performance is poor even in the medium scale (see Sec.5 0. in Fig.06 0. we obtain the histogram in Fig. As described. i. 6(c). position.g.8 1 fraction of time occluded fraction occluded (a) Occlusion frequency. the medium scale as 30-80 pixels. Expected center location of pedestrian BBs for (a) ground truth and (b) HOG detections. meaning pedestrian location is even more concentrated than immediately apparent.07 0. e. and for automotive applications detection must also occur at this scale. resolution of 640 × 480. human performance is excellent at this scale. Over 80% occlusion typically indicates full occlusion. the typical occluder is lower to the ground. we plot frequency of occlusion. 4).5mm.01 0 medium 0. applications. i. 6(a). while far scale pedestrians are less relevant.12 0. mask.08 0. log-normalized. Note that over 70% of pedestrians are occluded in at least one frame. This bias contradicts the common assumption that probability of occlusion is uniform. With the vehicle traveling at an urban speed of 55 km/h (∼15 m/s). 7(a).6 0. Note that the same constraints are not valid when photographing a scene from arbitrary viewpoints. (a) Height distribution. Most pedestrians are observed at the medium scale.04 40 50 60 70 80 0. most current algorithms are designed for the near scale and perform poorly even at the medium scale (see Sec.02 0. The heat maps are log-normalized. 2) constrain pedestrians to appear only in certain regions of the image. which we further subdivide into partial occlusion (1-35% area occluded) and heavy occlusion (35-80% occluded).15 prob 1. Figure 5. There is a strong bias for the lower portion of the pedestrian to be occluded and for the top portion to be visible. 16 32 64 128 256 height (pixels) distance from camera (m) Figure 6.1 100 10 20 30 40 50 140 near 0.2 0. for each pedestrian we measure the fraction of frames in which the pedestrian was occluded. we can compute the fraction of occluded area as one minus the fraction of visible area over total area (calculated from the visible and full BBs). we can obtain an estimate of the distance to a pedestrian of observed pixel height h: d ≈ 1800/h m. while a 30 pixel person is 4s away (see 5(b)).4 0.02 0 0 4s far 0 0. About half of the detections. see text for details. 50 100 150 200 250 300 350 400 450 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 100 200 300 400 500 600 (a) Distribution of ped. 0 20 40 60 80 100 120 (b) Occlusion amount.8m tall pedestrians. due to a diagonal occluder).1 0.8 1 0 0.9 0. However. in the INRIA dataset. most pedestrians are observed at the medium scale and for safety systems detection must also occur in this scale.

then merge results on the validation sets and report performance on the entire training set (sessions 0-5).3. Here we focus on evaluating existing. while surveillance videos have restricted backgrounds. no selec. Comparison to Existing Datasets Existing datasets may be grouped into two types: (1) ‘person’ datasets containing people in unconstrained pose in a wide range of domains and (2) ‘pedestrian’ datasets containing upright people (standing or walking). typically viewed from more restricted viewpoints but often containing motion information and more complete labeling. Our data was captured over 11 sessions.). median . Instead we ask authors to submit final. The final columns summarize additional properties of each dataset. it ensures that all algorithms are evaluated in precisely the same manner. In this work we focus primarily on pedestrian detection. temporal corr. as photographs must be manually chosen to contain only upright people and negative images are chosen according to arbitrary criteria. images # pos. median and 90th percentile pixel heights of the pedestrians. as such. Table 1 provides a detailed overview of existing pedestrian datasets. this is the accepted methodology for the INRIA dataset [5]. The DC data. showing the range of scales found in each dataset. unlike all previous pedestrian datasets. Our aim is to help prevent overfitting and to extend the dataset’s lifespan. The first six columns indicate the amount of training/testing data in each dataset. Authors are encouraged to re-train their systems on our larger training set. such as a robot or vehicle [8]. Note that the proposed dataset is two orders of magnitude larger than existing datasets. mostly unoccluded pedestrians. We divide the data roughly in half. our dataset was not built to demonstrate the effectiveness of a particular method.g. Datasets gathered from photographs suffer from selection bias. cropped pedestrians. we urge authors to adhere to one of three training/testing scenarios described below. However. also captured in an urban setting. bias video seqs. contains only very small. labels                      We split the database into training/testing data and specify our evaluation methodology. shifts. This will allow different research groups to compare their methods directly. The columns are: number of unique pedestrian BBs (not counting reflections. pre-trained pedestrian detectors. as opposed to the per-image measures frequently used in object detection [28]. which is more relevant to certain applications including surveillance. and images taken from a mobile recording setup. • Scenario-C: Train using sessions 0-5. with ‘k’ used to denote thousands (1k=103 ). The other most relevant datasets are the DaimlerChrysler (DC) [24] and ETH [8] datasets. occlusions are not annotated and each frame is labeled independently.6k 10k† – 36 12k – 1804 50 155k 56k 65k 27 128 98 90 108 83 218 279 36 90 48 Table 1. setting aside 6 sessions for training (0-5) and 5 for testing (sessions 6-10). Evaluation Methodology The established methodology for evaluating pedestrian detectors is flawed. In addition. images # pos. In real applications. occ. two orders of magnitude more than any other dataset. surveillance video [26]. MIT[27] USC-A[43] USC-B[43] USC-C[44] CVC[14] TUD-det[1] INRIA[5] DC[24] ETH[8] Caltech 924 – – – 1000 400 1208 2. images 10% quantile Height 90% quantile 128 133 126 145 164 278 456 36 189 97 Properties color images per-image ev. test on sessions 6-10. The dataset includes color video sequences and contains pedestrians with a large range of scales and more pose variability than typical pedestrian datasets. images # pedestrians Testing # neg.2. Scenario-B allows authors to compare to other groups prior to having us evaluate using the full test set under Scenario-C.4k 2388 192k – – – – 6175† – 1218 15k† – 61k – – – – – 400 614 – 499 67k – – – 128 313 – 205 70 271 – 54 63 232 – 100 74 – – – 46 311 – 250 133 566 453 288 139 1. challenging test bed. It includes O(105 ) pedestrian BBs labeled in O(105 ) frames. 3. The next three columns give the 10th percentile. captured using a pair of cameras attached to a stroller. and thus provides for an impartial. Datasets collected with a mobile recording setup largely eliminate selection bias. robotics and automotive safety. test on sessions 610. The results reported here use this setup as it allows for a broad survey of existing methods without any retraining. The most widely used ‘person’ datasets include subsets of the MIT LabelMe data [29] and the PASCAL VOC datasets [28]. etc. For detailed statistics about the amount of training/testing data see bottom row of Table 1. Finally. trained classifiers which we shall proceed to evaluate. We are not releasing the test data (sessions 6-10) at this time. The INRIA dataset [5] has helped drive recent advances in pedestrian detection and remains the most widely used. has reasonable scale variation and a significant amount of labeled data. and number of images containing at least one pedestrian. Comparison of pedestrian datasets. as far as we know. We specify three training/testing scenarios: • Scenario-A: Train on any external data. however. In each phase use 5 sessions for training and the 6th for testing.4. this is the first dataset with temporal correspondence between BBs and detailed occlusion labels. it is biased toward large. • Scenario-B: Perform 6-fold cross validation using sessions 0-5. Pedestrians can be labeled in photographs [5]. Most authors compare per-window performance. 2. each filmed in one of 5 city neighborhoods. e. We conclude by summarizing the most novel and important aspects of the Caltech Pedestrian Dataset. The ETH data. Furthermore. Training and Testing Data # pedestrians Training # neg. number of images containing no pedestrians (a † indicates cropped negative BBs only).

This is preferred to precision recall curves for certain tasks. especially in crowded scenes [32].2 VJ HOG FtrMine Shapelet HikSvm 10 −5 0. as typically there is an upper limit on the acceptable false positives perimage rate independent of pedestrian density. each having the same height as the ‘People’ BB.6 0. Of course. Evaluation is performed on the final generated list of detected BBs.26) VJ−OpenCv (0. Detectors may require different sampling schemes [36].50) MultiFtr (0.0. Detections in these regions do not affect performance.g.42) HOG (0. Figure 8. Together. there can be complex interactions between sampling density and NMS. matches are not considered mistakes either. corrected versions. 4). Matching proceeds as before.5 area(BBdt ∪ BBgt ) (1) 0. Methods usually detect pedestrians at some minimum size. E. each ‘People’ BB is broken down into multiple overlapping BBig . denoted using BBig . We conclude by listing additional details. A detected BB (BBdt ) and a ground truth BB (BBgt ) form a potential match if their areas overlap sufficiently. a per-window detector is densely scanned across an image and nearby detections merged. in the latter case detections in the ignore regions would count as false positives.3 VJ (0. The typical assumption is that better per-window scores will lead to better performance on entire images. e..9 0.7 0.63) Shapelet−orig (0. particularly those that are somewhat invariant to changes in position and scale.8 0.g. We computed the per-image results using the 288 full images (each containing at least one pedestrian) and the methodology described in Sec. Each BBdt and BBgt may be matched at most once.19) HikSvm (0.21) FtrMine (0. not all detection systems are based on sliding windows [19.85) 10 −2 0.g. Note that setting a BB to ignore is not the same as removing that BB from the ground truth.ca/˜mori/research/papers/ sabzmeydani_shapelet_cvpr07. when available. and per-window evaluation of such systems is impossible. Unmatched BBdt count as false positives and unmatched BBgt as false negatives. 3.5 1 0. but in practice the effect should be negligible. Instead.02 0.05 0.html. except BBdt matched to BBig do not count as true positives. thus avoiding NMS or other post processing. E. Some detectors output BBs with padding around the pedestrian (e. containing a ‘Person?’ (difficult or ambiguous cases). Dalal & Triggs suggest evaluating a detector by classifying cropped windows centered on pedestrians against windows sampled at a fixed density from images without pedestrians.g.1 miss rate 0. To compare methods we plot miss rate against false positives per-image by varying the threshold on detection confidence. For both algorithms we evaluate updated.cs. Four types of ground truth are always set to ignore: any BBgt under 20 pixels high or near image borders (to avoid boundary effects). Ground truth BBs selected to be ignored. In addition. see the advisory posted at Mori’s website: www. The system should perform multiscale detection and any necessary NMS or other post processing. we set all BBs that contain occluded pedestrians to ignore. The threshold of 50% is arbitrary but reasonable. and unmatched BBig do not count as a false negatives (matches to BBgt are therefore preferred). but the INRIA dataset also contains full images with the same pedestrians but within the original context. we employ the PASCAL measure. In rare cases this assignment may be suboptimal.31) Shapelet (0.01 −6 10 0. nor does it take into account the effect of NMS (which can reduce false positives at varying rates for different methods). 2 Both groups have acknowledged this. or containing ‘People’. HOG outputs 128 × 64 BBs around 96 pixel tall people). such BBs are cropped appropriately. e. (b) INRIA per-image results. To evaluate performance on different subsets of the ground truth.1. 17]. we introduce the notion of ignore regions. 21] 2 . to evaluate performance on unoccluded pedestrians. We observed this in two of the algorithms evaluated [30.1. however. need not be matched.1 −4 −3 −2 10 10 10 False Positive Per−Window (FPPW) 10 −1 10 −1 10 0 10 1 10 2 false positives per image (a) INRIA per-window results. automotive applications. The per-window results. Results on the INRIA datasets (each algorithm is described in more detail in Sec. Specifically. to coax smaller de- . however.4 0.2 Miss Rate 0. Perhaps the biggest pitfall of the per-window scheme pertains to use of cropped positives and uncropped negatives for training and testing: classifiers may exploit window boundary effects as discriminative features leading to good per-window performance but poor per-image performance. in practice per-window performance can fail to predict per-image performance (see Fig. these factors make evaluating a classifier independently of the overall detection scheme difficult.5 3.15) LatSvm (0.sfu. using non maximal suppression (NMS). which states that the area of overlap a0 must exceed 50%: ao = area(BBdt ∩ BBgt ) > 0. Typically results are reported on cropped positives. furthermore. are reproduced from the original publications (the VJ curve is extracted from [5]).g. 8). Per-image evaluation We perform single frame evaluation using a modified version of the scheme laid out in the PASCAL object detection challenges [28]. A detection system needs to take in an image and return a BB and a score or confidence for each detection. We resolve any assignment ambiguity by matching detections with highest confidence first. There may be a number of explanations. Note the reordering of the classification performance between the per-window and per-image results. Per-window evaluation does not measure errors caused by detections at incorrect scales or positions or arising from false detections on body parts..

9  6.1.3). 8(b)). are shown in Fig. However. near pedestrians. 9(a).model height (in pixels) original implementation trained on INRIA data per-image evaluation publication scale stride low-level features sec / frame classifier VJ[40] Haar HOG[5] HOG FtrMine[7] gen. the full BB is always used for matching. 3). atypical aspect ratios). are shown in Fig. see Fig. 2. 3. model height used for training and the scale stride used for testing are also listed.3 140 96 96 96 96 96 80 96 1. Results on unoccluded. HOG performs best on near. The tested implementations of Shapelet and HikSvm have been corrected so they no longer overfit to boundary effects (see Sec. All approaches use sliding windows and NMS (all except LatSvm use kernel density estimation for NMS. However. All use variants of HOG or Haar features and are trained with variations of boosting or SVMs. plotting miss rate versus false positives per-image (FPPI) in log-log scale (lower curves indicate better performance). Finally.20 ’04 ’05 ’07 ’07 ’08 ’08 ’08 Figure 10.3 45 60. Summary: HOG. Aspect ratio: The mean aspect ratio of BBs in the proposed dataset is about 0. MultiFtr outperforms the remaining methods. which contains over 68% of the annotated pedestrians. the only exceptions were the VJ and Shapelet detectors which were reimplemented in [42] (these outperform the OpenCV VJ code and the original Shapelet code.05 1. At the medium scale. We obtained the detectors directly from their authors. tections we upscale the input images. we were unable to run HikSvm on upscaled images. Evaluation Results To measure performance we evaluated seven promising pedestrian detectors (Table 2). unoccluded pedestrians (typical errors are shown in Fig. MultiFtr achieves the best relative performance but absolute performance is quite poor with 72% miss rate at 1 FPPI. LatSvm and especially HOG perform best. with a miss rate of over 80% at 1 FPPI. For heavy occlusion the situation becomes worse.05 1. purely gradient based detectors such as HikSvm. even for partially occluded pedestrians. Scale: As discussed in Sec. LatSvm was trained on the PASCAL dataset. split according to aspect ratio. Haar Shapelet[30] gradients MultiFtr[42] HOG+Haar LatSvm[11] HOG HikSvm[21] HOG-like AdaBoost linear SVM AdaBoost AdaBoost AdaBoost latent SVM HIK SVM            7. We focus on evaluating existing. and as these comprise the bulk of the dataset MultiFtr achieves a slightly higher overall performance. To understand where the methods fail we examine performance on varying subsets of the data. degrades least.4 pedestrians per image). with a miss rate under 40% at 1 FPPI. This adversely affects HikSvm’s overall performance as small pedestrians are not detected. 2. MultiFtr and FtrMine tend to outperform the other methods surveyed. not the visible BB. At the far scale performance is rather abysmal.2. we classify pedestrians as unoccluded.20 1. with HOG as a close second. from a maximum recall of about 56% at 1 FPPI on typical aspect ratios. Occlusion: The impact of occlusion on detecting pedestrians with a minimum height of 50 pixels is shown in Fig.05 1.1. on unoccluded pedestrians only. occlusion. VJ and Shapelet perform poorly. 9(b) and 9(c). Results for each scale. the others on the INRIA pedestrian dataset.05 1.2. to about 40% recall achieved by both HOG and MultiFtr. However. achieved by HOG. Table 2. For ground truth. 9(d)-9(f). Performance should improve somewhat upon retraining. LatSvm likely suffers from being trained on the Pascal dataset. 9(g)-9(i). Note that LatSvm. the rest of the algorithms were originally evaluated using per-window measures. Due to time and memory constraints. pre-trained pedestrian detectors (Scenario-A described in Sec. Overall: We begin by plotting results on the entire dataset in Fig. with maximum recall dropping to 7% at 1 FPPI. For unoccluded . 4. HOG performs similarly at this scale. Only LatSvm and MultiFtr reported results using per-image measures. none of the algorithms is able to achieve more than 8% recall at 1 FPPI. while results for HikSvm are artificially depressed since small people are not detected. leading to a maximum recall of slightly under 30% at 1 FPPI achieved by MultiFtr. respectively. 50 pixel or taller pedestrians.43 and has a standard deviation of 0. Runtime per 640 × 480 image. Results for HikSvm at the medium/far scales are not shown (see Table 2 for details).1  18. medium (30-80 pixels) and far (30 pixels or less) scales. which is part-based. the impact is not as severe as for occlusion and scale. 10). partially occluded (1-35% area occluded) and heavily occluded (35-80% occluded). 2. Overview of tested algorithms. MultiFtr ties or outperforms HOG on more difficult cases (smaller scales. absolute performance in these cases is still poor.05 1. all reported results are computed using every 30th frame in the test data. Atypical aspect ratios (outside of one standard deviation) frequently correspond to variations in viewpoint and pose. as proposed in [4]). Performance drops significantly even under partial occlusion. Performs clearly degrades for atypical aspect ratios. we group pedestrians according to their pixel height into the near (80 or more pixels). We use the evaluation methodology outlined in Sec.0 13. absolute performance is quite poor. Selected HOG false negatives (left) and high confidence false positives (right) for near scale unoccluded pedestrians. We use the miss rate at 1 FPPI as a common reference point to compare results (note that on average there are 1. As discussed in Sec.

70) MultiFtr (0.3 10 −1 10 0 10 1 10 2 0. scale (computed for unoccluded pedestrians).97) HOG (0. HOG achieves 60% recall at 1 FPPI on the proposed dataset (see Fig.71) LatSvm (0.52) LatSvm (0.4 0. only 30% of pedestrians remain unoccluded for the entire time they are present.(a) Overall 1 1 0. furthermore.6 miss rate VJ (0.4 0. (a) Overall performance on the entire dataset.5 VJ (0.92) LatSvm (0.94) FtrMine (0. we were unable to run HikSvm on upscaled images.82) LatSvm (0.5 VJ (0.4 0. miss rate at 1 FPPI for each algorithm is shown in plot legends.77) 10 −2 0.r.74) FtrMine (0.93) Shapelet (0. Yet.63) HikSvm (0.5 0.8 0. This is worse but comparable to the 80% recall at 1 FPPI on the INRIA data on which HOG was trained.51) HikSvm (0.9 0.82) HOG (0.73) FtrMine (0.6 0.9 0.r. in automotive tasks it is crucial to identify pedestrians early to give ample warning to the driver.4 0.8 1 0.8 0.6 0.4 0.90) HOG (0. Under these conditions performance is reasonable but still below levels necessary for real world applications.79) HikSvm (0.55) LatSvm (0.97) 10 −2 0.3 10 −1 10 0 10 1 10 2 false positives per image false positives per image false positives per image (g) No occlusion 1 0.4 0.85) HOG (0.3 10 −1 10 0 10 1 10 2 0.84) FtrMine (0.39) FtrMine (0.8 0.7 (b) Typical aspect ratios 1 0. Under more realistic and challenging conditions.7 (c) Atypical aspect ratios overall & aspect ratio 0.54) LatSvm (0.55) Shapelet (0.62) LatSvm (0.50) Shapelet (0.9 0.63) Shapelet (0.91) HOG (0. especially when properly benchmarked (using per-image metrics).5 VJ (0.8 0. 9(d)).7 (h) Partial occlusion 1 0.4 0. 5.5 VJ (0.95) 10 −2 0.40) 10 −2 0.7 (f) Far scale scale miss rate miss rate 0.3 10 −1 10 0 10 1 10 2 0.96) HOG (0.6 0. this adversely affects HikSvm’s performance on many of the plots shown.93) HikSvm (0.80) MultiFtr (0. aspect ratio (computed for unoccluded pedestrians 50 pixels or taller).7 miss rate miss rate 0.8 0.4 0.9 0.9 0.00) 10 −2 0. For unoccluded pedestrians over 80 pixels high.84) 10 −2 0.92) Shapelet (0. Two under explored cases stand out as being particularly frequent and relevant in the data gathered: pedestrians at lower resolution and under partial occlusion.9 0.8 0. Fur- . these are precisely the tasks for which current methods fail. our analysis shows that HOG remains competitive. Discussion and Future Work We introduced the large.86) MultiFtr (0. Although recent literature would suggest otherwise.00) 10 −2 0.97) HikSvm (1.6 0.76) Shapelet (0.94) MultiFtr (0.5 VJ (0.9 0. Occlusion is likewise common.6 miss rate VJ (0.6 0.t.92) HikSvm (0.72) LatSvm (0.74) HikSvm (0.t.89) MultiFtr (0.75) Shapelet (0.98) HOG (0. Note that pedestrians in the medium/far scales represent more than 80% of the data.94) LatSvm (0. Lower curves indicate better performance.86) HOG (0.99) HOG (0. (b-c) Performance w.57) Shapelet (0.8 0.9 0.3 10 −1 10 0 10 1 10 2 false positives per image false positives per image false positives per image Figure 9.86) Shapelet (0.44) FtrMine (0.47) FtrMine (0.3 10 −1 10 0 10 1 10 2 0. as our analysis has shown.5 0.5 0.7 1 0.3 10 −1 10 0 10 1 10 2 0.3 10 −1 10 0 10 1 10 2 0.3 10 −1 10 0 10 1 10 2 false positives per image false positives per image false positives per image (d) Near scale 1 0.78) 10 −2 0.97) MultiFtr (0.6 0.93) FtrMine (0.82) 10 −2 0.5 VJ (0. Miss rates versus false positive per-image curves shown for various subsets of the data. richly annotated Caltech Pedestrian Dataset for training and evaluating pedestrian detectors and benchmarked a number of promising methods.99) MultiFtr (0.4 0.66) HikSvm (0.8 0.91) MultiFtr (0.7 (e) Medium scale 1 0. Due to time and memory constraints.6 miss rate VJ (0.88) HikSvm (1.61) FtrMine (0.9 0. (g-i): Performance under varying levels of occlusion (computed for pedestrians 50 pixels or taller).81) MultiFtr (0. performance degrades rapidly.7 miss rate miss rate 0. (d-f): Performance w.7 (i) Heavy occlusion occlusion 0.

Seemann. V. 2008. Gavrila. Tsukiyama and Y. Leibe. partially occluded humans in a single image by bayesian combination of edgelet part detectors.ther research addressing detection at smaller scales and of partially occluded pedestrians is crucial. [45] H. performance eval. 2004. Scharstein. Sabzmeydani and G. J. both spatial [9. Thonnat. Roth. [23] K. Configuration estimates improve pedestrian finding. 2007. P. volume 2. Pedestrian detection and o tracking using a mixture of view-based shape-texture models. Beyond sliding windows: Object localization by efficient subwindow search. [39] United Nations Economic Commission for Europe. [25] S. Gool. Porikli. C. Hayun. [11] P. 2006. via classification on riemannian manifolds. and W. 2008. Schindler. Robust object detection with interleaved categorization and segm. Valentin. and G. T. In Towards Category-Level Object Rec. 2001. In ICCV. PAMI. 2000. Gavrila and V. L´ pez. and L. [43] B. M. [3] J. [29] B. [24] S. A performance evaluation of single and multi-feature people detection. Ponsa. Z. and L. and B. Fritz. Baker. Ess. and T. Fergus. [30] P. In ICCV. K. 2008. 2006. for video surveillance systems. Hoiem. In CVPR. and L. Institut National Polytechnique de Grenoble. volume 20. B. Classification using intersection kernel SVMs is efficient. 2004. 2007. PR. Perona. Yang. V. Nghiem. Multiple a component learning for object detection. pages 41–59. Barron. Mori. A database and eval. A mobile vision system for robust multi-person tracking. 28(4):594–611. [34] A. Forsyth. In CVPR. 47:7–42. Learning to detect natural image boundaries using local brightness. D. 2007. Pollefeys. annotation tool and evaluation code are available on the project website. M. [42] C. Dynamic 3D scene analysis from a moving vehicle. Adaptive image o o sampling and windows classification for on-board pedestrian detection. LabelMe: A database and web-based tool for image annotation. In ECCV. PAMI. and D. Maji. The first of these is context. 2007. In ECCV. Histogram of oriented gradient for human detection. D. pages 1863–1868. G. Gavrila. 2006. deformable part model. Leonardis. [37] T. Finding People in Images and Videos. Forsyth. B. Viola. and A. In ECCV. C. A discriminatively trained. A. X. Performance of optical flow techniques. K. Lampert. Perona. Detection of the movements of persons from a sparse sequence of tv images. B. M. References [1] M. Szeliski. In CVPR. S. [22] D. pages 259–289. and P. . One-shot learning of object categories. Switzerland. 2002. M. [13] D. In ICCV. In IEEE Transactions on Intelligent Transportation Systems. [38] O. and M. Ess. Munder and D. S. Russell. Bremond. 18(3-4):207–213. In CVPR. M. IJCV. [20] Z. [8] A. Viola and M. In DAGM. Beauchemin. F. Babenko. Berg. PhD thesis. J. Jones. Acknowledgments: We would like to thank Eugene Bart. [5] N. 57(2):137–154. The database. In CVPR. pages 2137 – 2144. Tran and D. [41] P. 2005. Shashua. IJCV. PAMI. Schiele. Pedestrian det. pages 29–48. In CVPR. Towards robust pedestrian detection in crowded image sequences. and texture cues. Lewis. [36] D. [21] S. Leibe. T. Ramanan. 2007. [9] A. Forsyth. 2008. IJCV. Human detection based on a prob. Welch. Nevatia. [28] J. Gool. C. Towards detection of human motion. 38(1):15–33. A. and A. S. Munder. [32] E. Triggs. and V. In CVPR. [10] L. Schiele. Davis. 26(5):530–549. Perona. People-tracking-by-detection and people-detection-by-tracking. Discriminative part-based approaches [6. and P. Belongie. B. ETISEO. 2008. Hebert. Shirai. [7] P. Schmid. 77(13):157–173. This work was partially supported by the Office of Naval Research grant N00014-06-1-0734 and a gift from an automobile manufacturer who wishes to remain anonymous. Tuzel. An experimental study on pedestrian classification. M. M. Gavrila and S. Schmid. In CVPR. A. although those may be ill-suited for low resolution pedestrians. Dalal and B. Williams. [2] S. 2008. IJCV. 15] and temporal [4]. Berg. and T. Malik. J. Jones. 2005. Robust real-time object detection. 2007.. IJCV. Schiele. Multi-cue pedestrian detection and tracking from a moving vehicle. Munder. [26] A. 2008. Szeliski. Efros. and B. M. S. Sappa. Detection of multiple. [16] S. Doll´ r. Freeman. [12] D. Doll´ r. Scharstein and R. Zisserman. Ponce. and Z. A. [33] V. Snow. and D. Martin. Depth and appearance for mobile scene analysis. 11] may also provide more robustness to occlusion. Fei-Fei. In CVPR. Ilie. S. H. Of primary importance is to repeat the evaluation of each algorithm after re-training on our dataset (Scenario-C). 2006. H. Mikolajczyk. and P. Pedestrian detection for driving assistance systems: single-frame classification and system level performance. A. [6] P. color. N. Russell. [44] B. Leibe. IJCV. Lin and L. In Inter. McAllester. Andriluka. 2008. Poggio. M. Detecting pedestrians using patterns of motion and appearance. 2006. Zisserman. A. Real-time object detection for “smart” vehicles. C. 2008. Leibe. In NIPS. Meer. and D. on Computer Vision Systems. In AVSS. Everingham. In CVPR. Belongie. and J. A number of cues should help improve performance at low resolutions and under occlusion. Dataset issues in object rec. K. Cornelis. In CVPR. [15] D. 2004. Lazebnik. Feng. 2003. Papageorgiou and T. Y. L. and S. Wojek and B. Blaschko. Roth. 30(10):1713–1727. and D. Song. Hebert. Differential camera tracking through linearizing the local appearance manifold. Statistics of road traffic accidents in Europe and North America. A. 2008. A. M. 1999. [27] C. methodology for optical flow. [19] B. and Irina Bart for her many long hours labeling small pedestrians. 2008. 2000. S. Wu and R. C. [4] N. Zhang. Wu and R. Ger´ nimo. 2007. 2008. 1985. 2005. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. [40] P. Schn¨ rr. multiscale. Burkitt. Sharma and J. D. pages 45–68. A. Tu. Murphy. In Intelligent Vehicles Symposium. In ICCV. A pose-invariant descriptor for human detection and segmentation. Fowlkes. Tu. Davis. Cornelis. Tao. Probabilistic methods for finding people. S. We are also interested in evaluating detectors that utilize features computed over 2-4 frames [41. In CVPR. D. Ioffe and D. Putting objects in perspective. [35] Y. van Gool. Torralba. Malik. A trainable system for object detection. Frahm. J. Gdalyahu. 1994. Detecting pedestrians by learning shapelet features. C. M. Cluster boosted tree classifier for multi-view. [17] C. Torralba. and J. Conf. Nevatia. [18] B. IJCV. P. 2007. Marszałek. In CVPR. [14] D. Felzenszwalb. 2007. Dalal. [31] D. 2007. We are planning to extend our benchmark to explore two more issues. M. In CVPR. Springer. A. and R. and B. 2004. Hofmann. 12(1):43–77. Integrating appearance and motion cues for simultaneous detection and segmentation of ped. multi-pose object detection. 4] and also algorithms that integrate information over longer time scales. In ICCV. PAMI. Fleet. F. 2005. Feature mining for image a classification. 2008. 2007. R. In ICCV. and A. Ryan Gomes and Mohamed Aly for valuable help and feedback. Schiele. M. assembly of robust part det. Philomin. T. IJCV. Black.