Learning Hierarchical Poselets for Human ParsingYang Wang, Duan Tran, Zicheng Liao, David ForsythPrevious part-based models for human parsing use rigid parts (head, torso, half limbs) as their basic part representations. In order to train a detector for a particular part, one would collect a set of positive patches (Fig. 1) and random negative patches, then train a classifer. Fig 2 shows an visualization of the trained classifers (using simple binary edge features) for each part in [4]. Intuitively, each part detector tries to look for rectangular shapes with parallel lines in the image.
Although the part visualization in Fig 2 is intutive, we believe part detectors learned in this way are not effective. The reason is that parallel lines appear almost everywhere in an image (e.g. building, window, etc). The part detectors trained to find the parts shown in Fig 1 will likely to give a lot false positives. The main idea of this work is to use a broader definition of "parts" that go beyond the traditional rigid parts. In addition to rigid parts, we also define parts covering large pieces of human bodies. As shown in Fig 3, large body parts can have very distinctive visual patterns that usually do not appear in background, which makes it easier to build detectors for those parts.
In particular, we define 20 parts covering various pieces of human bodies and organize them in a hierarchical model (Fig. 4). We use a structured learning approach to learn the parameters of the model. The main properties of our approach can be highlighted as: Discriminative "parts": our work is based on a new concept of "parts" which goes beyond the traditional rigid parts. Here we consider parts covering a wide range of portions of human bodies. Those parts have better discriminative powers than rigid part detectors. Coarse-to-find granularity: Different parts in our model capture features at various levels of details. Conceptually, this provides an efficient coarse-to-find search strategy. Structured hierarchical model: Information of various parts is aggregated via a structured model, which is different from the simple Hough voting scheme in [5]. Datasets UIUC people dataset Sport image dataset Please cite [1] if you use the sport image dataset. And cite [2] if you use the UIUC people dataset. See the README file in each dataset. References [1] Learning Hierarchical Poselets for Human Parsing. Y. Wang, D. Tran and Z. Liao, in CVPR 2011 pdf [2] Improved Human Parsing with a Full Relational Model. D. Tran and D. Forsyth, in ECCV 2010 [3] Recovering human body configuration: combining segmentation and recognition. G. Mori, X. Ren, A. Efros and J. Malik, in CVPR 2004 [4] Learning to parse images of articulated bodies. D. Ramanan, in NIPS 2006 [5] Detecting people using mutually consistent poselet activations. L. Bourdev, S. Maji, T. Brox and J. Malik, in ECCV 2010 Acknowledgement This work was supported in part by NSF under IIS-0803603 and IIS-1029035, and by ONR under N00014-01-1-0890 and N00014-10-1-0934 as part of the MURI program, and by an NSERC postdoctoral fellowship. |