| Action Detection in Complex Scenes with Spatial and Temporal Ambiguities
|
This paper studied the problem of human action detection
in complex scenes.
Unlike conventional action recognition in well-controlled environments,
action detection in complex scenes suffers from cluttered
backgrounds, heavy crowds, occluded bodies, and spatialtemporal
boundary ambiguities caused by imperfect human
detection and tracking.
To overcome such spatial-temporal ambiguities, we introduce
a framework of multiinstance learning, where
the candidate regions of an action are treated as a
bag of instances, and a novel multiple-instance learning
framework, named SMILE-SVM (Simulated annealingMultiple
Instance LEarning Support Vector Machines), is developed to detect human action
based on imprecise action locations. Our approach works well on
CMU action database and a real world problem of detecting
whether the customers in a shopping mall show an intention
to purchase the merchandise on shelf (even if they didn’t
buy it eventually).
Yuxiao Hu,
Liangliang Cao,
Fengjun Lv,
Shuicheng Yan,
Yihong Gong,
and
Thomas S. Huang
Action Detection in Complex Scenes with Spatial and Temporal Ambiguities
IEEE Proc. Int'l Conf. Computer Vision (ICCV), 2009 [
pdf]
[
bib][slides]
We develop a multiple instance learning approach to handle the spatial and temporal ambiguities.
Although
we do not know exactly where and when the target action
happens, we may estimate a ”bag” covering more than one
potential region and time slice. A bag can be positive (target
action happen somewhere in the bag) or negative (absolutely
no interesting action happens). There must be at
least one positive instance in one positive bag, while all instances
in one negative bag are non-action instances. This
multi-instance method provides a way to not only recognize
the action of interest, but also locate the exact position and
time period of the action.
To avoid the local minimum trap caused by the unbalanced
data during the iteration of MIL, simulated annealing
(SA) is introduced to search for the global optimum in the
learning process. We called the proposed algorithm as Simulated
annealing Multiple Instance Learning(SMILE).
|
SMILE-SVM algorithm
|
Example of multiple instances
|
| Experimental Results on Ke Yan's CMU dataset |
| Experimental Results on retail store dataset |
Overview
The dataset
- Real video from a retail store in Tokyo
- 1 hour long, 20 minutes for training
- ~150 positive bags, ~50 for training
- ~75k positive instances, ~25k for training
- ~382 negative bags randomly selected from non- action tracking trajectories
- ~113k negative instances, ~34k for training
|
|
|
Example of detection:
Note: The dashed box depicts an event bag, in which the positive instances are bounded by red rectangles.
The distribution of multiple instance learning scores:
The work of Y. Hu was done in part during hist internship with NEC
Laboratories America, Inc.
Cao and Huang are partially supported by U.S.
Government VACE Program. Yan is partially supported by
NRF/IDM grant NRF2008IDM-IDM004-029.
- E. Aarts. Simulated Annealing: Theory and Applications. Springer, 1987.
- S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for
multiple-instance learning. NIPS, 2002.
- J. Davis and A. Bobick. The representation and recognition of human movement
using temporal templates. CVPR, 1997.
- Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos.
ICCV, 2007.