Hierarchical Space-time Model Enabling Efficient Search for Human Actions

Draft in PDF.


Introduction

We propose a five-layer hierarchical space-time model (HSTM) for representing and searching human actions in videos. From a feature point of view, both invariance and selectivity are desirable characteristics, which seem to contradict each other. To make these characteristics coexist, we introduce a coarse-to-fine search and verification scheme based on the HSTM model for action searching. Because going through layers of the hierarchy corresponds to progressively turning the knob between invariance and selectivity, this strategy enables the search for human actions ranging from rapid movements of sports to subtle motions of facial expressions. The introduction of the Histogram of Gabor Orientations (HIGO) feature makes searching for actions go smoothly across the hierarchical layers of the HSTM model. The matching efficiency is enhanced by applying integral histograms to compute the features in the top two layers. The HSTM model was tested on three selected challenging video sequences and on the KTH human action database and achieved improvement over other state-of-the-art algorithms. These results validate that the HSTM model is both selective and robust for searching human actions.

 

Framework

Our model consists of five layers. Following the naming conventions of [11] and [28], we refer to the top four layers as S1, C1, S2, and C2 as shown in Fig. 2. Note, however, that these layers are defined in 3D space-time. The lowest layer is the original video. The S1 responses are obtained by convolving the original video with a bank of 3D Gabor filters. Pooling over limited ranges in the S1 layer through a max-operation results in the C1 layer. We adopt the histogram of Gabor orientations (HIGO) features for the S2 layer and 3D Gabor coefficient histograms for the C2 layer. Using 3D integral histograms significantly reduces the computational cost for computing the features in the S2 and C2 layers. Our S2 layer is built from histograms of Gabor orientations, which makes it unlike simple or complex cells in the brain. Nevertheless, we decided to retain the conventional name.

 

overview2.png

 

Coarse-to-fine search scheme

The searching procedure starts from the highest layer, where the query video is correlated against the reference video at all locations in (x, y, t) space. The candidate locations are passed to the lower layers for further verification. Verification at lower layers allows for the discrimination of subtle actions, since the features become more selective.

 

Experimental Results

Query tennis strokes. See Video for better illustration.

 

 

Query ballet turn. See Video for better illustration.

 

Query person smiling. See Video for better illustration.


Query on the KTH Database

Precision and recall on the KTH database for both leave-onesequence-out (LOSO) and leave-one-person-out (LOPO) setups. (a) Precision; (b) Recall.

 

Action Recognition on the KTH Database

Kim et al. [18] report a result slightly higher than ours under experimental Setup 2. But they manually align the actions in space-time, avoiding the

tough preprocessing step of action recognition. In our work, the alignment is done full automatically.

 

Run time of the system