|
Introduction
This paper addresses the problem of recovering 3D human pose from
a single monocular image, using a discriminative
bag-of-words approach. In previous work, the visual words are
learned by unsupervised clustering algorithms. They capture the
most common patterns and are good features for coarse-grain
recognition tasks like object classification. But for those tasks
which deal with subtle differences such as pose estimation, such
representation may lack the needed discriminative power. In this
paper, we propose to jointly learn the visual words and the pose
regressors in a supervised manner. More specifically, we learn an
individual distance metric for each visual word to optimize the
pose estimation performance. The learned metrics rescale the
visual words to suppress unimportant dimensions such as those
corresponding to background. Another contribution is that we
design an Appearance and Position Context (APC) local
descriptor that achieves both selectivity and invariance while
requiring no background subtraction. We test our approach on both
a quasi-synthetic dataset and a real dataset (HumanEva) to verify
its effectiveness. Our approach also achieves fast computational
speed thanks to the integral histograms used in APC descriptor
extraction and fast inference of pose regressors.
[TOP]
|