Learning Hierarchical Poselets for Human Parsing

Yang Wang, Duan Tran, Zicheng Liao, David Forsyth


We consider the problem of human parsing with part-based models. Most previous work in part-based models only considers rigid parts (e.g. torso, head, half limbs) guided by human anatomy. We argue that this representation of parts is not necessarily appropriate for human parsing. In this paper, we introduce hierarchical poselets -- a new representation for human parsing. Hierarchical poselets can be rigid parts, but they can also be parts that cover large portions of human bodies (e.g. torso + left arm). In the extreme case, they can be the whole body. We develop a structured model to organize poselest in a hierarchical way (see the figure above) and learn the model parameters in a max-margin framework. We demonstrate the superior performance of our proposed approach on two datasets with aggressive pose variations.

Previous part-based models for human parsing use rigid parts (head, torso, half limbs) as their basic part representations. In order to train a detector for a particular part, one would collect a set of positive patches (Fig. 1) and random negative patches, then train a classifer. Fig 2 shows an visualization of the trained classifers (using simple binary edge features) for each part in [4]. Intuitively, each part detector tries to look for rectangular shapes with parallel lines in the image.
Fig 1. Examples of positive patches (from [3]) Fig 2. Visualization of learned part models (from [4])

Although the part visualization in Fig 2 is intutive, we believe part detectors learned in this way are not effective. The reason is that parallel lines appear almost everywhere in an image (e.g. building, window, etc). The part detectors trained to find the parts shown in Fig 1 will likely to give a lot false positives.

The main idea of this work is to use a broader definition of "parts" that go beyond the traditional rigid parts. In addition to rigid parts, we also define parts covering large pieces of human bodies. As shown in Fig 3, large body parts can have very distinctive visual patterns that usually do not appear in background, which makes it easier to build detectors for those parts.
Fig 3. Examples of large parts (legs) that have distinctive visual patterns which tend not to appear in the background Fig 4. Illustration of the hierarchical pose representation. Each vertex corresponds to a part

In particular, we define 20 parts covering various pieces of human bodies and organize them in a hierarchical model (Fig. 4). We use a structured learning approach to learn the parameters of the model.

The main properties of our approach can be highlighted as:
Discriminative "parts": our work is based on a new concept of "parts" which goes beyond the traditional rigid parts. Here we consider parts covering a wide range of portions of human bodies. Those parts have better discriminative powers than rigid part detectors.
Coarse-to-find granularity: Different parts in our model capture features at various levels of details. Conceptually, this provides an efficient coarse-to-find search strategy.
Structured hierarchical model: Information of various parts is aggregated via a structured model, which is different from the simple Hough voting scheme in [5].

Datasets
UIUC people dataset
Sport image dataset

Please cite [1] if you use the sport image dataset. And cite [2] if you use the UIUC people dataset. See the README file in each dataset.
References
[1] Learning Hierarchical Poselets for Human Parsing. Y. Wang, D. Tran and Z. Liao, in CVPR 2011 pdf
[2] Improved Human Parsing with a Full Relational Model. D. Tran and D. Forsyth, in ECCV 2010
[3] Recovering human body configuration: combining segmentation and recognition. G. Mori, X. Ren, A. Efros and J. Malik, in CVPR 2004
[4] Learning to parse images of articulated bodies. D. Ramanan, in NIPS 2006
[5] Detecting people using mutually consistent poselet activations. L. Bourdev, S. Maji, T. Brox and J. Malik, in ECCV 2010
Acknowledgement
This work was supported in part by NSF under IIS-0803603 and IIS-1029035, and by ONR under N00014-01-1-0890 and N00014-10-1-0934 as part of the MURI program, and by an NSERC postdoctoral fellowship.