Hierarchical
Space-time Model Enabling Efficient Search for Human Actions
|
Introduction We propose a five-layer hierarchical space-time
model (HSTM) for representing and searching human actions in videos. From a
feature point of view, both invariance and selectivity are desirable
characteristics, which seem to contradict each other. To make these characteristics
coexist, we introduce a coarse-to-fine search and verification scheme based
on the HSTM model for action searching. Because going through layers of the
hierarchy corresponds to progressively turning the knob between invariance
and selectivity, this strategy enables the search for human actions ranging
from rapid movements of sports to subtle motions of facial expressions. The
introduction of the Histogram of Gabor Orientations (HIGO) feature makes
searching for actions go smoothly across the hierarchical layers of the HSTM
model. The matching efficiency is enhanced by applying integral histograms to
compute the features in the top two layers. The HSTM model was tested on
three selected challenging video sequences and on the KTH human action
database and achieved improvement over other state-of-the-art algorithms.
These results validate that the HSTM model is both selective and robust for
searching human actions. Framework Our model consists of five layers. Following the
naming conventions of [11] and [28], we refer to the top four layers as S1,
C1, S2, and C2 as shown in Fig. 2. Note, however, that these layers are
defined in 3D space-time. The lowest layer is the original video. The S1
responses are obtained by convolving the original video with a bank of 3D
Gabor filters. Pooling over limited ranges in the S1 layer through a
max-operation results in the C1 layer. We adopt the histogram of Gabor
orientations (HIGO) features for the S2 layer and 3D Gabor coefficient
histograms for the C2 layer. Using 3D integral histograms significantly
reduces the computational cost for computing the features in the S2 and C2
layers. Our S2 layer is built from histograms of Gabor orientations, which
makes it unlike simple or complex cells in the brain. Nevertheless, we
decided to retain the conventional name.
Coarse-to-fine
search scheme The searching procedure starts from the highest
layer, where the query video is correlated against the reference video at all
locations in (x, y, t) space. The candidate locations are passed to the lower
layers for further verification. Verification at lower layers allows for the
discrimination of subtle actions, since the features become more selective. Experimental
Results Query
tennis strokes. See Video for
better illustration.
Query
ballet turn. See Video for better illustration. |

Query person
smiling. See Video for better illustration.

Query on the KTH Database
Precision
and recall on the KTH database for both leave-onesequence-out (LOSO) and
leave-one-person-out (LOPO) setups. (a) Precision; (b) Recall.

Action Recognition
on the KTH Database
Kim et al. [18] report a result slightly higher than
ours under experimental Setup 2. But they manually align the actions in space-time,
avoiding the
tough preprocessing step of action recognition. In our work,
the alignment is done full automatically.

Run time of the system
