Action Detection in Complex Scenes with Spatial and Temporal Ambiguities


This paper studied the problem of human action detection in complex scenes. Unlike conventional action recognition in well-controlled environments, action detection in complex scenes suffers from cluttered backgrounds, heavy crowds, occluded bodies, and spatialtemporal boundary ambiguities caused by imperfect human detection and tracking. To overcome such spatial-temporal ambiguities, we introduce a framework of multiinstance learning, where the candidate regions of an action are treated as a bag of instances, and a novel multiple-instance learning framework, named SMILE-SVM (Simulated annealingMultiple Instance LEarning Support Vector Machines), is developed to detect human action based on imprecise action locations. Our approach works well on CMU action database and a real world problem of detecting whether the customers in a shopping mall show an intention to purchase the merchandise on shelf (even if they didn’t buy it eventually).

Yuxiao Hu, Liangliang Cao, Fengjun Lv, Shuicheng Yan, Yihong Gong, and Thomas S. Huang
Action Detection in Complex Scenes with Spatial and Temporal Ambiguities
IEEE Proc. Int'l Conf. Computer Vision (ICCV), 2009 [pdf] [bib] [slides]


This work aims to detect actions of interests in complex scenes e.g., with cluttered backgrounds or partially occluded crowds. The difficulty lies in several aspects:
  • It is very difficult to locate human body precisely. When trying to crop an object from a complex scene, we often have to endure substantial misalignment or occasional drifting if no human interaction is involved.
  • Some real world actions, such as picking up objects, taking a photo, and pushing an elevator button, will happen in a non-repetitive way and the duration is short. For these actions, it is not easy to decide the start or end point of these actions of interest since the human motion is continuous and the speed vary greatly even within the same action category.
These difficulties are due to the spatial and temporal ambiguities which bring serious difficulty into the action detection task.

We develop a multiple instance learning approach to handle the spatial and temporal ambiguities. Although we do not know exactly where and when the target action happens, we may estimate a ”bag” covering more than one potential region and time slice. A bag can be positive (target action happen somewhere in the bag) or negative (absolutely no interesting action happens). There must be at least one positive instance in one positive bag, while all instances in one negative bag are non-action instances. This multi-instance method provides a way to not only recognize the action of interest, but also locate the exact position and time period of the action. To avoid the local minimum trap caused by the unbalanced data during the iteration of MIL, simulated annealing (SA) is introduced to search for the global optimum in the learning process. We called the proposed algorithm as Simulated annealing Multiple Instance Learning(SMILE).
SMILE-SVM algorithm
Example of multiple instances

Experimental Results on Ke Yan's CMU dataset

Experimental Results on retail store dataset
The dataset
  • Real video from a retail store in Tokyo
  • 1 hour long, 20 minutes for training
  • ~150 positive bags, ~50 for training
  • ~75k positive instances, ~25k for training
  • ~382 negative bags randomly selected from non- action tracking trajectories
  • ~113k negative instances, ~34k for training
Example of detection:
Note: The dashed box depicts an event bag, in which the positive instances are bounded by red rectangles.
The distribution of multiple instance learning scores:

The work of Y. Hu was done in part during hist internship with NEC Laboratories America, Inc. Cao and Huang are partially supported by U.S. Government VACE Program. Yan is partially supported by NRF/IDM grant NRF2008IDM-IDM004-029.

  1. E. Aarts. Simulated Annealing: Theory and Applications. Springer, 1987.
  2. S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. NIPS, 2002.
  3. J. Davis and A. Bobick. The representation and recognition of human movement using temporal templates. CVPR, 1997.
  4. Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. ICCV, 2007.