13 views

Uploaded by José Alejandro Mesejo Chiong

save

- COMPARITIVE STUDY OF DATA MINING TECHNIQUES FOR INTRUSION DETECTION SYSTEM
- jaqm_vol1_issue2
- Image Segmentation by Histogram Thresholding
- Fake News Detection
- IJAIEM-2014-06-12-30
- JITIM - Data Mining the Harness Track and Predicting Outcomes
- ArensPaper
- Data Mining using R For Criminal Detection
- MIT401
- IEEE PROJECT TOPICS &ABSTRACTS
- Lecture 9
- Face Recognition by Generalized Two-dimensional FLD Method and Multi-class SVM
- FGKA a Fast Genetic K-means Clustering Algorithm
- Gatos 2017
- IJAIEM-2013-04-30-114
- N7DM04
- pxc387141
- Mohammad Abolhasani
- 3. Comp Sci - Ijcse -An Optimized Approach to Analyze Stock Market
- A Comparative Study of Data Clustering
- Text .ppt
- Report Data Mining
- 19. Regional Spatially Adaptive Total Variation
- Analyzing Electricity Consumption Behavior Of Households Using Big Data Analytics
- Introduction to Clustering Large and High-Dimensional Data (2006)
- Paper 41
- Dwm Course
- PSO Definition
- tr311.pdf
- K-means-EM-Cobweb-WEKA.pdf
- BAB askep.docx
- (Sales)-Employment Application Form- Ms Shangrila (Private) Limited- April 2017
- MAR.docx
- Instr. Confid. Para El Repres. de Piero
- Natividad María
- Plan de Sistemas de Informacion
- Procese verbale receptie lucrari.doc
- PERATURAN pedoman monev mutu.doc
- I. GÂNDIREA CRITICA.pdf
- power.xlsx
- 4 Test de Preferencia
- Cantar na escola
- Estrategias de Ensenanza-Aprendizaje
- Apuntes_Modos_de_adquirir_el_dominio_U_Cursos (2).pdf
- busco dj para eventos
- ASBID KB DEPO PROGESTIN+konsep
- CHILE HABANERO.pdf
- Rpp Gizi Smk Nusantara
- Madilim ang paligid.docx
- rmk msdm sap 5.doc
- Reinach BL. Classica. sjmw.ch_ März Entradawettbewerb. Op. 42. eintritt frei!
- Senaria2 Gender
- 1_Julio_Mendez_Los_superconductores.pdf
- 4.2.2.1. Sk Penyampaian Informasi Kepada Masyarakat, Lp, Ls Don't want to upload? Get unlimited downloads as a member
- Preparación Del Personal
- sampul askep pjk
- Scribd
- Panama Canal
- who am i this time
- ssangyong
- Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
- Continuous-State Graphical Models for Object Localization, Pose Estimation and Tracking
- KokkinosEvangelopoulosMaragos_curve_evolution_modulation_features_ICIP04.pdf
- Articulated Object Tracking and Pose Estimation - A Literature Survey

You are on page 1of 8

**Yi Yang Deva Ramanan Dept. of Computer Science, University of California, Irvine
**

{yyang8,dramanan}@ics.uci.edu

Abstract

We describe a method for human pose estimation in static images based on a novel representation of part models. Notably, we do not use articulated limb parts, but rather capture orientation with a mixture of templates for each part. We describe a general, ﬂexible mixture model 287 between for capturing contextual co-occurrence relations for recognition resentingshape parts, augmenting standard spring models that encode spatial relations. We show that such relations can capture nocylinder tions of local rigidity. When co-occurrence and spatial relations are tree-structured, our model can be efﬁciently optimized with dynamic programming. We present experimental results on standard benchmarks for pose estimation that indicate our approach is the state-of-the-art system for pose estimation, outperforming past work by 50% while being bird d biped orders of magnitude faster.

geometric constraints on pairs of parts, often visualized as springs. When parts are parameterized by pixel location and orientation, the resulting structure can model articulation. This has been the dominant approach to human pose estimation. In contrast, traditional models for object recognition use parts parameterized solely by location, which simpliﬁes both inference and learning. Such models have been shown to be very successful for object recognition [2, 9]. In this work, we introduce a novel uniﬁed representation for both models that produces state-of-the-art results for human pose estimation. Representations for articulated pose: Full-body pose estimation is difﬁcult because of the many degrees of freedoms to be estimated. Moreover, limbs vary greatly in appearance due to changes in clothing and body shape, as well as changes in viewpoint manifested in in-plane orientations and foreshortening. These difﬁculties complicate inference since one must typically search images with a large number of rotated and foreshortened templates. We address these problems by introducing a novel but simple representation for modeling a family of afﬁnely-warped templates: a mixture of non-oriented pictorial structures (Fig.1). We empirically demonstrate that such approximations can outperform explicitly articulated parts because mixture models can capture orientation-speciﬁc statistics of background features (Fig.2). Representations for objects: Current object recognition systems are built on relatively simple structures encoding mixtures of star models deﬁned over tens of parts [9], or implicitly-deﬁned shape models built on hundreds of parts [19, 2]. In order to model the varied appearance of objects (due to deformation, viewpoint,etc.), we argue that one will need vocabularies of hundreds or thousands of parts, where only a subset are instanced at a time. We augment classic spring models with co-occurrence constraints that favor particular combinations of parts. Such constraints can capture notions of local rigidity – for example, two parts on the same limb should be constrained to have the same orientation state (Fig.1). We show that one can embed such constraints in a tree relational graph that preserves tractability. An open challenge is that of learning such complex repre1385

1. Introduction

We examine the task of human pose estimation in static images. A working technology would immediately impact many key vision tasks such as image understanding and activity recognition. An inﬂuential approach is the pictorial structure framework [7, 12] which decomposes the aphuman ostrich pearance of objects into local part templates, together with

K

Figure 1: On the left, we show the classic articulated limb model of Marr and Nishihara [20]. In the middle, we show dove ape different orientation and foreshortening states of a limb, each of which is evaluated separately in classic articulated body models. On the right, we approximate these transformations with a mixture of non-oriented pictorial structures, in this case tuned to represent near-vertical and nearhorizontal limbs.

rocess of relating new shape descriptions to known shapes is to eliable information about the shape, it must be conservative. an organization (or indexing) of stored shape descriptions

15 60 0. An alternate family of techniques has explored the tradeoff between generative and discriminative models trained explicitly for pose estimation.g.tj limb. when instanced as a tree. pi ) is a feature vector (e. We do so with a novel but simple representation that is orders of magnitude faster than previous approaches. and gradient descriptors [1.05 horizontal diagonal vertical 150 180 0 210 240 270 300 330 Figure 2: We plot the average HOG feature as a polar histogram over 18 gradient orientation channels as computed from the entire PASCAL 2010 dataset [6]. We use two standard benchmark datasets [23. Our model. 10]. . t) = S (t) i∈V + ti .. 32]. and much stronger horizontal gradients as compared to diagonal gradients. . We write G = (V.90 120 0. pi = (x. . sentations from data. dating back to classic model-based approaches of O’Rourke and Badler [22]. Model Let us write I for an image. we ﬁrst deﬁne a compatibility function for part types that factors into a sum of local and pairwise scores: S (t) = i∈ V i bt i + i bij t . pi ∈ {1. then bij would favor consistent orientation assignments. 3. p.. as warned by [28. 10]. 29]. Articulated limb models obtained by rotating a single template cannot exploit such orientation-speciﬁc cues. Co-occurrence model: To score of a conﬁguration of parts. . more susceptible to overﬁtting to statistics of a particular dataset.g. . In terms of object detection. Loopy models require approximate inference strategies such as importance sampling [7. we deﬁne the lack of subscript to indicate a set spanned by that subscript (e. assuming that such techniques will be needed to initialize videobased articulated trackers. Notably. For example. such approaches maybe 1386 i The parameter bt i favors particular type assignments for ti . pi ) + where φ(I. our mixture models are tuned to detect parts at particular orientations. E ) for a K -node relational graph whose edges specify which pairs of parts are constrained to have consistent relations. K }. Related Work Pose estimation has typically been addressed in the video domain. Approaches include conditional random ﬁelds [24] and margin-based or boosted detectors [27. foreground/background color models [23.1. 18]. t = {t1 . Hogg [13]. One area of research is the encoding of spatial structure. 29].. Our model requires roughly 1 second to process a typical benchmark image. We write ψ (pi − T pj ) = dx dx2 dy dy 2 . images contain more horizontal gradients than vertical gradients. 1. Probabilistic formulations are common. or iterative approximations [33]. advocated by [17]. but types may span semantic classes (an open versus closed hand). if part types correspond to orientations and part i and j are on the same rigid ti . we conclude that supervision is a key ingredient for learning structured relational models. allowing for the possibility of real-time performance with further speedups (such as cascaded or parallelized implementations). We can now write the full score associated with a conﬁguration of part types and positions: S (I. but are plagued by the well-known phenomena of double-counting.3. and so can exploit such statistics. Past work has explored the use of superpixels [21]. We outperform all published past work on both datasets. A ﬁnal crucial issue is that of feature descriptors. 16. . . we argue that it is easier to ﬁnd diagonal limbs (as opposed to horizontal ones) because one is less likely to be confused by diagonal background clutter. . . tK }). For notational convenience. . L} and ti ∈ {1.1 30 0. 30]. y ) for the pixel location of part i and ti for the mixture component of part i. In practical terms. We show that our model generalizes such representations in Sec. Rohr [25]. Tree models are efﬁcient and allow for efﬁcient inference [7]. while the pairwise parameter bij favors particular co-occurrences of part types. Recent work has suggested that branch and bound algorithms with tree-based lower bounds can globally solve such problems [31. We call ti the “type” of part i. As in [2]. However. 26. a vertical versus horizontally oriented hand). .g. where dx = xi − xj and dy = yi − yj . .tj part i. Another approach to tackling the double-counting phenomena is the use of stronger pose priors. HOG descriptor [3]) extracted from pixel location pi in image I . loopy belief propagation [28]. Our motivating example of types include orientations of a part (e. our work is most similar to pictorial structure models that reason about mixtures of parts [5. 9].tj wij ij ∈E (2) · ψ (pi − pj ) ti wi · φ(I. We demonstrate results on the difﬁcult task of pose estimation. 7. 15]. reducing error by up to 50%. can be written as a recursive grammar of parts [8]. Recent work has examined the problem for static images. We write i ∈ {1. this relative location is deﬁned with respect to . contours [27.tj (1) ij ∈E 2. This means that gradient statistics are not orientation invariant. T }. On the other hand. the relative location of part i with respect to j . We see that on average.

at all pixel locations pi and for all possible types ti . p1 ) represents the best scoring conﬁguration for each root position and type. where each model is a star-based pictorial structure. while a downwardoriented hand should lie below the elbow. T is small (≤ 6 in our experiments) and the distance transform is quite efﬁcient. tn } and negative examples .tj pi · ψ (pi − pj ) (7) t . Models (4) and (5) maintain only T springs per part. More interesting cases are below. p. tuned for type ti . message passing reduces to O(LT 2 ) per part. Since one has to perform T 2 distance transforms. Once messages are passed to the root part (i = 1). t) from (2) over p and t. Given labeled positive examples {In . pi ). which are encoded by wij a tree.the pixel grid and not the orientation of part i (as in classic articulated pictorial structures [7]). For example. Each spring is tailored for a particular pair of types (ti . j its parent elbow part. so the computation time is dominated by computing the local scores of each type-speciﬁc appearti ance models wi · φ(I. this can be done with a type model. E ) is 1387 5. and assume part types capture orientation. This can achieved by restricting the co-occurrence model to allow for only globally-consistent types: i bij t . 4. score1 (c1 .tj ti = wij (5) The above simpliﬁcation states that the relative location of part with respect to its parent is dependant on part-type. let i be a hand part. but is considerably more ﬂexible (as we show in our experiments). at location pi .tj = wij (3) Mixtures of deformable parts: [9] deﬁne a mixture of models. in which case our model reduces to a standard pictorial structure. This can be obtained as a special case of our model by using a single spring for all types of a particular pair of parts: i wij i max score(ti . The above relational model states that a sideways-oriented hand should tend to lie next to the elbow. Computation: The computationally taxing portion of dynamic programming is (7). Since this score is linear. Let kids(i) be the set of children of part i in G. Consider a face model with eye and mouth parts. reducing message passing to O(LT ). Appearance model: The ﬁrst sum in (2) is an appearance model that computes the local score of placing a temti for part i. we explore a simpliﬁed version of (2) with a reduced set of springs: i wij t . plate wi Deformation model: The second term can be interpreted as a “switching” spring model that controls the relative placement of part i and j by switching between a collection of springs. one can backtrack to ﬁnd the location and type of each part in each maximal conﬁguration. When the relational graph G = (V. (7) computes for every location and possible type of part j . and compute a max over L × T possible child locations and types. but not parent-type. It is worthwhile to note that our articulated model is no more computationally complex than the deformable mixtures of parts in [9]. When ψ (pi − pj ) is a quadratic function (as is the case for us). (6) computes the local score of part i. the inner maximization in (7) can be efﬁciently computed for each combination of ti and tj in O(L) with a max-convolution or distance transform [7]. Semantic part models: [5] argue that part appearances should capture semantic classes and not visual classes. making the computation O(L2 T 2 ) for each part. We compute the message part i passes to its parent j by the following: i i scorei (ti .tj = 0 if ti = tj −∞ otherwise (4) Articulation: In our experiments. but open eyes may tend to co-occur with smiling mouths. this can be done efﬁciently with dynamic programming. In practice. pn . pj ) = ti .tj . Inference Inference corresponds to maximizing S (x. so message passing reduces to O(L). The spatial relationship between the two does not likely depend on their type. Special cases We now describe various special cases of our model which have appeared in the literature. One obvious case is T = 1. By keeping track of the argmax indices. it can be efﬁciently computed for all positions pi by optimized convolution routines. and is parameterized by its rest locati . Special cases: Model (3) maintains only a single spring per part. regardless of the orientation of the upper arm. tj ).tj + max bij t i 3. pi ) + wij t . pi ) (6) mi (tj . One can use these root scores to generate multiple detections in image I by thresholding them and applying non-maximum suppression (NMS). by collecting messages from the children of i. the best scoring location and type of its child part i.1. One has to loop over L × T possible parent locations and types. pi ) + k∈kids(i) mk (ti . pi ) = bt i + wti · φ(I. Learning We assume a supervised learning paradigm. tion and rigidity. One may want to model different types of eyes (open and closed) and mouths (smiling and frowning).

This means we can use relative location as a supervisory cue to help derive type labels that capture orientation.. which measures the percentage of correctly localized body parts. Because we wish to model articulation. and so can be written as S (I. Both datasets include a standard train/test split. We now describe a procedure for generating type labels for our articulated model (5). and a upper-body skeleton for the Buffy set. and hence.1). and instead generate negative constraints from positive examples with mis-estimated labels z . While this translates directly to a pose estimation task. 10. We would learn a model of the form: 1 β·β+C ξn (8) arg min w.{In }. since the space of z is (LT )K . we use the negative training images from the INRIAPerson database [3] as our negative training set. should score less than -1.t. We use K-means with K = T . Experimental Results Datasets: We evaluate results using the Image Parse dataset [23] and the Buffy dataset [10]. To train our models. only a small minority of the constraints will be active on typical problems (e. These images tend to be outdoor scenes that do not contain people. We treat each possible placement of the root on a negative image as a unique negative example xn .1. Notably. we can assume that part types should correspond to different relative locations of a part with respect to its parent in E . z ) ≤ −1 + ξn The above constraint states that positive examples should score better than 1 (the margin). Buffy is also distributed with a set of validated detection windows returned by an upper-body person detector run on the testset. This form of learning problem is known as a structural SVM. tn ) and note that the scoring function (2) is linear in model parameters β = (w. Optimization: The above optimization is a quadratic program (QP) with an exponential number of constraints. Partial supervision: Because part type is derived heuristically above. Each cluster corresponds to a collection of part instances with consistent relative locations. the number of positive examples varies from 200-1000 and the number of negative images is roughly 1000. sidewaysoriented hands occur next to elbows.3. . which we will describe in an upcoming tech report. We found good results by implementing our own dual coordinate-descent solver. We show example results in Fig. zn ) ≥ 1 − ξn β · Φ(In .ξi ≥0 2 n s. The Parse set contains 305 pose-annotated images of highly-articulated fullbody images of human poses. but we found that type labels tend not to change over iterations.g. we can also present PCP results on the full Buffy testset. This latent SVM problem can be solved by coordinate descent [9] or the CCP algorithm [34]. For each part i. 6. while negative examples. we cluster its relative position over the training set {pn i : ∀n} to obtain T clusters. We performed some initial experiments with latent updating of part types using the coordinate descent framework of [9]. b). while downwardfacing hands occur below elbows. We use the derived type labels to construct a fully supervised dataset. Since our model also serves as a person detector. ∀n ∈ pos ∀n ∈ neg. We deﬁne the type labels for parts tn i based on cluster membership. meaning we have millions of negative constraints. one could treat tn i as a latent variable that is also optimized during learning. as do we. for all conﬁgurations of part positions and types. We ﬁnd the above to work well for both pose estimation and person detection. Learning in practice Most human pose datasets include images with labeled joint positions [23. z ). To deﬁne a fully labeled dataset of part locations and types. 2]. Problem size: On our training datasets. Deriving part type from position: Assume that our nth training image In has labeled joint positions pn . Detection vs pose estimation: Traditional structured prediction tasks do not require an explicit negative training set. our above formulation also includes a “detection” component: it trains a model that scores highly on ground-truth poses. Let pn i be the relative position of part i with respect to its parent in image In . we consider models with hundreds of thousands of parameters. ∀z β · Φ(In . we will deﬁne a structured prediction objective function similar to those proposed in [9. The objective function penalizes violations of these constraints using slack variables ξn . z ) = β · Φ(I. consistent orientations by our arguments above. We deﬁne parts to be located at joints.3. the support vectors). To do so. let us write zn = (pn . but generates low scores on images without people. The Buffy dataset contains 748 pose-annotated video frames over 5 episodes of a TV show. For example. we group parts into orientations based on their relative location with respect to their parents (as described in Sec 5. We show clustering results in Fig. making them solvable in practice. but not part type labels t. and there exists many well-tuned solvers such as the cutting plane solver of SVMStruct [11] and the stochastic gradient descent solver in [9]. so these provide part position labels p. 16]. and a standardized evaluation protocol based on the probability of a correct pose (PCP). Models: We deﬁne a full-body skeleton for the Parse set. We found that a careful optimized solver was necessary to manage learning at this scale. Furthermore. Fortunately. from which we learn 1388 5. This corresponds to training a model that tends to score a ground-truth pose highly and alternate poses poorly. We ﬁrst manually deﬁne the edge structure E by connecting joint positions based on average proximity. We leave such partiallysupervised learning as interesting future work. Most previous work report results on this set.

4. For reference.) and a 27 part model where midway points between limbs are added (midupper arm. [9] use a star-structured model of HOG templates trained with weakly-supervised data. . Our representation provides a general framework for modeling co-occurrence relations between mixtures of parts as well as classic spatial relations between the location of parts. We believe our high performance is to due to the fact that our models leverage orientation-speciﬁc statistics (Fig. Bourdev and J. we also trained a star model. Overall. tuned by cross validation. Notably. Parts[9] 85. We show that such relations capture notions of local rigidity. hand. Roth. Histograms of oriented gradients for human detection.) to increase coverage. Though we visualize 4 trees. Detection accuracy: We use our model as an upper body detector on the Buffy dataset in Table 1. Andriluka. To visualize the model. Buffy: We give quantitative results for PCP in Table 3. and because parts and relations are simultaneously learned in a discriminative framework. Our algorithm is several 1389 Upper body detection on Buffy Rigid HOG[10] Mixtures of Def. Notably. ONR-MURI Grant N0001410-1-0933. For both cases. but saw inferior performance compared to the tree models shown in Fig. when learned with supervision. orders of magnitude faster than the next-best approaches of [26. 2009. respectively. but ﬂexible extension of tree-based models of part mixtures.1 93.8 Us 99. We show the full-body model learned on the Parse dataset in Fig. etc. and B. Recall that each part type has its own appearance template and spring encoding its relative location with respect to its parent. 27]. In contrast. [3] N. These results indicate the potential of our representation and supervised learning framework for general object detection.6 Table 1: Our model clearly outperforms past approaches for upper body detection. When evaluated on the entire testset. when evaluated on a subset of standardized windows or the entire testset. [2] L. These are the results presented below. We are applying this approach to the task of general object detection. orientation-invariant part detectors) due to the computational burden of inference. presumably because additional orientations are being captured. Triggs. volume 1. We reduce error by 25%. We saw a slight improvement using a variable number of mixtures (5 or 6) per part. We experiment with a 14 part model deﬁned at 14 joint positions (shoulder. etc.3. our approach reduces error by 54%.7. and show example images in Fig. can yield improved results for detection. We refer the reader to the captions for a detailed analysis. Dalal and B. In International Conference on Computer Vision (ICCV). we emphasize that there exists an exponential number of trees that our model can generate by composing different part types together. mid-lower arm. When part mixture models correspond to part orientations. Conclusion: We have described a simple. The dataset include two alternate detectors based on a rigid HOG template and a mixtures-of-star models [9] which perform at 85% and 94%.6% of the people in the testset. pages I: 886–893. Parse: We give quantitative results for PCP in Table 2. We correctly detect 99.2). our representation can model articulation with greater speed and accuracy than classic approaches. This is because we expect part types to correspond to orientation because of the supervised labeling shown in Fig. articulated models are often learned in stages (using pre-trained. References [1] M. all previous work uses articulated parts. S. Malik. and support from Google and Intel. In CVPR. and placing it at its maximum-scoring position. elbow. Pictorial structures revisited: People detection and articulated pose estimation. but our method outperforms all previously published results by a signiﬁcant margin. we show 4 trees generated by selecting one of the four types of each part. 2005. Notably. Acknowledgements: Funding for this research was provided by NSF Grant 0954083. page 4. Performance vs number of types per part 74 72 70 68 66 64 62 60 27−part model 14−part model 1 2 3 4 5 6 Figure 4: We show the effect of model structure on pose estimation by evaluating PCP performance on the Parse dataset. all previous approaches use articulated parts. but have already demonstrated impressive results for the challenging task of human pose estimation. increasing the number of mixture components improves performance. The latter is widely regarded as a state-of-the-art system for object recognition.5. We set all parts to be 5 × 5 HOG cells in size. We refer the reader to the captions for a detailed analysis. Our results suggest more complex object structure. likely due to the fact that more orientations can be modeled.4. but we outperform all past approaches. Performance increases with denser coverage and an increased number of part types. 2009. Schiele. Poselets: Body part detectors trained using 3d human pose annotations. increasing the number of parts (by instancing parts at limb midpoints in addition to joints) improves performance. CVPR.6. In Proc. Structure: We consider the effect of varying T (the number of mixtures or types) and K (number of parts) on the accuracy of pose estimation on the Parse dataset in Fig. and show example images in Fig.ﬂexible mixtures of parts.

Ferrari. Epshtein and S. BMVC. University of Chicago. In CVPR. Joachims. Because hands articulate to a large degree. 88(2):303–338. Technical report. NY. Combining discriminative appearance and segmentation cues for articulated human pose estimation. 2010. and D. F. Johnson and M. Huttenlocher. Figure 5: A visualization of our model for T = 4. [11] T. Felzenszwalb and D. Felzenszwalb and D. USA. Williams. Semantic hierarchies for recognizing objects and parts. Ramanan. L. Image and Vision computing. Winn. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation. 2010. Pictorial structures for object recognition. Everingham. In Proc. We show the local templates above. BMVC. J. heads tend to be upright. Hogg. 2005. Ullman. 1390 . [7] P. and the tree structure below. 61(1):55–79. Torr. In Proc. Progressive search space reduction for human pose estimation. [5] B. and A. pages 552– 559. A. In CVPR. volume 2. obtained composing different part types together. page 6. Fischler and R. Everingham. IJCV. and P. Felzenszwalb. [14] S. and so is efﬁcient to search over (1). McAllester. C. McAllester. Though we visualize 4 trees. Model-based vision: a program to see a walking person. Eichner and V. Elschlager. For example. 5555. [13] D. Girshick. The score associated with each combination decomposes into a tree. 2007. pages 405–412. These clusters are used to generate mixture labels for parts during training. pages 304– 311. and so the associated mixture models focus on upright orientations. R. B. Zisserman. [4] M. D. Efﬁcient discriminative learning of parts-based models. Better appearance models for pictorial structures. 2010. Van Gool. IEEE Transactions on. IEEE PAMI. Zisserman. June 2008. 2010. Marin-Jimenez. Citeseer. placing parts at their best-scoring location relative to their parent. In CVPR.Neck wrt Head 6 4 2 0 −2 −4 −6 −5 0 x 5 −10 5 0 −5 y Left knee wrt hip 10 10 5 0 −5 −10 −10 −5 0 x 5 10 Left foot wrt knee 5 Left elbow wrt shoulder 5 Left hand wrt elbow 0 y y y 0 y −5 0 x 5 −5 −5 −10 −5 0 x 5 10 −5 0 x 5 Figure 3: We take a “data-driven” approach to orientation-modeling by clustering the relative locations of parts with respect to their parents. Ferrari. In Proceedings of the 25th international conference on Machine learning. M. ACM New York. there exists an exponential number of realizable combinations. and A. The representation and matching of pictorial structures. Object detection with discriminatively trained part based models. Training structural SVMs when exact inference is intractable. [8] P. [9] P. 2010. 2009. 1(1):5–20. [6] M. Kumar. Finley and T. IJCV. [12] M. Zisserman. pages 1–8. [15] S. Computers. 2008. [10] V. mixture models for the hand are spread apart to capture a larger variety of relative orientations. Johnson and M. 100(1):67–92. Object detection grammars. 1983. 1973. [16] M. The PASCAL visual object classes (VOC) challenge. In ICCV Workshops. IEEE. 99(1). Everingham. trained on the Parse dataset.

Sapp. reason globally: Occlusion-sensitive articulated pose estimation. 2007.3 63. 2008.0 29.9 81. pages 17–32.3 34.1 72. A. Multiple tree models for occlusion and spatial constraints in human pose estimation. R.4 70. Bottom-up recognition and parsing of the human body.6 55. and B. CVGIP-Image Understanding. 2007.1 65. [32] D.1 99.6 76. Singh.6 91. 2006.6 EFZ [4] 98. Lee and I. Forsyth.9 73. volume 2.2 73. Toshev. Lan and D. pages 422–429. volume 1. 2010.4 97. 27] take 5 minutes. Huang. 2005.9 68. Ramanan. Recovering human body conﬁgurations: Combining segmentation and recognition. arms Total Torso Head U.8 Our Model 100 99. Sapp. pages 710–724. and C.0 85.7 95. Taskar. Leonardis.9 75. Because our model also serves as a very accurate detector (Table 1). ECCV.0 17. The concave-convex procedure. Shi. 2010. Beyond trees: Common-factor models for 2d human pose recovery.9 48. [26] B. In CVPR. ECCV 2010.2 81. 1994. 1980. Badler.1 STT [27] 100 96. In CVPR. Black. Ramanan and C. 15(4):915–936. and B.9 85.2 95. Proposal maps driven mcmc for estimating human body pose in static images. IEEE. pages 227–240.2 85. [27] B.2 SJT [26] 100 100 91. Our total performance of 74. Schiele. [33] Y. We do this for published results on the right table. Yuille and A. 2010. [22] J.5 63. Biological Sciences. We compare our results to all published work on this set on the left. arms Total TF [32] 62. [23] D. In CVPR. Representation and recognition of the spatial organization of three-dimensional shapes. Mori. Jordan. Sminchisescu. ECCV. The distributed evaluation protocol also allows one to compute performance on the full test videos by multiplying PCP values with the overall detection rate.9 95. Efros. 1978. [17] X. Model-based image analysis of human motion using constraint propagation.9 50.6 31.2 71. Nishihara. Neural Computation.6 Table 3: The Buffy testset is distributed with a subset of windows detected by a rigid HOG upper-body detector.8 76. Training deformable models for localization. Measure locally. 2003. volume 1. NIPS. Leibe. using the standard criteria of PCP.7 97. [18] M. IEEE. pages 2041–2048. Combined object categorization and segmentation with an implicit shape model.6 Head 37. Cohen.5 35.6 96. Huttenlocher. Malik. 2006. IEEE.8 80.5 50.7 83.9 82.5 75. [34] A. pages 406–420. 2(6):522–536.2 Image Parse Testset Upper legs Lower legs Upper arms 31. [28] L.5 79. [20] D. Wang and G.3 53. [30] P. Our total pipeline requires 1 second to process an image.4 65.9 Table 2: We compare our model to all previous published results on the Parse dataset.1 81. pages 470–477. We outperform or (nearly) tie all previous results on a per-part basis.6 98.8 59.3 41. Sclaroff.2 56. and J. 2010. C. pages 81–88.0 73.0 ARS [1] 90. 59(1):94–115. 200(1140):269–294. As pointed out by [32].0 83.2 46.1 84. [19] B.2 55. Citeseer. we obtain signiﬁcantly better results than past work when evaluated on the full testset.Method R [23] ARS [1] JEa [15] SNH [29] JEb [14] Our Model Torso 52.1 81.1 93.7 85. Rohr.5 64.5 77.9 89. Subset of Buffy Testset Full Buffy Testset Buffy Torso Head U.4 64. [29] V. X.4 77. Efﬁcient inference with multiple heterogenous part detectors for human pose estimation. 2004. arms L. In ECCV04 workshop on statistical learning in computer vision. Ren. A. In ECCV. Rangarajan.5 54.3 67. Adaptive pose priors for pictorial structures. Fast globally optimal 2D human detection with loopy graph models. Proceedings of the Royal Society of London. and B. We also outperform all previous results on a per-part basis. Marr and H.1 47.6 72. In CVPR. Cascaded Models for Articulated Pose Estimation. pages 206–213.0 Lower arms 13.7 39.2 74. In CVPR. 2010. In CVPR. volume 2. Learning to parse images of articulated bodies.4 60. [31] T. Improved Human Parsing with a Full Relational Model. In CVPR.2%. PAMI. Nevatia.2 55.6 70. A. 1391 . Series B.1 85.3 Total 27.5 87.1 77. IEEE.9 53. Towards model-based recognition of human movements in image sequences. Taskar. Srinivasan and J.9 66.1 68.6 68. We obtain the best overall PCP while being orders of magnitude faster than the next-best approaches.5 85. In CVPR. O’Rourke and N. arms L. [24] D.1 62. [25] K. Tian and S. Sigal and M. [21] G. this subset contains little pose variation because it is biased to be responses of a rigid template. while [26. 19:1129. Tran and D.9% compares favorably to the best previous result of 66. Mori.1 53. 2004. IEEE. 2004.6 61.

We show 27 part bounding boxes reported by our algorithm for each image. suffers from doublecounting phenomena common to tree models (both the left and right legs ﬁre on the same image region). head torso l. We show 17 part bounding boxes. reported by our algorithm. From top to bottom. Figure 7: Results on the Buffy dataset. we ﬁnd our model is not ﬂexible enough to model horizontal people. Examining failure cases from left to right.leg r. while the right column show failure cases.arm r. we see that our model still has difﬁcultly with raised arms and is confused by vertical limb-like clutter in the background. while the bottom row shows failure cases.leg l. is confused by overlapping people. The top 3 rows show successful examples. 1392 . The left 4 columns show successful examples. and is confused when objects partially occlude people. corresponding to upper body parts.arm Figure 6: Results on the Parse dataset.

- COMPARITIVE STUDY OF DATA MINING TECHNIQUES FOR INTRUSION DETECTION SYSTEMUploaded byAnonymous vQrJlEN
- jaqm_vol1_issue2Uploaded byiconpharm
- Image Segmentation by Histogram ThresholdingUploaded byNanda Intansarii
- Fake News DetectionUploaded byHassan Sultan
- IJAIEM-2014-06-12-30Uploaded byAnonymous vQrJlEN
- JITIM - Data Mining the Harness Track and Predicting OutcomesUploaded byJimmyDuncan
- ArensPaperUploaded byVivek Kumar Sinha
- Data Mining using R For Criminal DetectionUploaded byIJIRST
- MIT401Uploaded byMrinal Kalita
- IEEE PROJECT TOPICS &ABSTRACTSUploaded byaswin
- Lecture 9Uploaded byjeromeku
- Face Recognition by Generalized Two-dimensional FLD Method and Multi-class SVMUploaded byPhungTue
- FGKA a Fast Genetic K-means Clustering AlgorithmUploaded byrealjj0110
- Gatos 2017Uploaded byGilberto Junior
- IJAIEM-2013-04-30-114Uploaded byAnonymous vQrJlEN
- N7DM04Uploaded byAppkee Solutions
- pxc387141Uploaded bydkishore
- Mohammad AbolhasaniUploaded byAnonymous s57NXL2W
- 3. Comp Sci - Ijcse -An Optimized Approach to Analyze Stock MarketUploaded byiaset123
- A Comparative Study of Data ClusteringUploaded byKosa
- Text .pptUploaded byKumar Satyam
- Report Data MiningUploaded bySamoha Samoha
- 19. Regional Spatially Adaptive Total VariationUploaded byjagadeesh jagade
- Analyzing Electricity Consumption Behavior Of Households Using Big Data AnalyticsUploaded byAnonymous vQrJlEN
- Introduction to Clustering Large and High-Dimensional Data (2006)Uploaded byttestt123
- Paper 41Uploaded byIntegrated Intelligent Research
- Dwm CourseUploaded bynaresh.kosuri6966
- PSO DefinitionUploaded byAtifMinhas
- tr311.pdfUploaded byDiana Arun
- K-means-EM-Cobweb-WEKA.pdfUploaded byGabriel Gregorius Rusfian

- Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input MechanismsUploaded byJosé Alejandro Mesejo Chiong
- Continuous-State Graphical Models for Object Localization, Pose Estimation and TrackingUploaded byJosé Alejandro Mesejo Chiong
- KokkinosEvangelopoulosMaragos_curve_evolution_modulation_features_ICIP04.pdfUploaded byJosé Alejandro Mesejo Chiong
- Articulated Object Tracking and Pose Estimation - A Literature SurveyUploaded byJosé Alejandro Mesejo Chiong