Shih-Fu Chang
Columbia University
A Position Statement for Panel 2: Video representation, coding, indexing
The 1998 International Workshop on Very Low Bitrate Video Coding
People are used to the structured representation of languages, both at the
syntactic and the semantic levels. Are there good analogies for visual
and audio content? MPEG-4 includes object-based representation of
audio-visual data, on which flexible manipulation and interaction can be
applied. In an evolving stage, MPEG-7 may use a similar hierarchical
framework to index the visual content at multiple levels, including story,
scene, shot, object, and feature. However, are these analogies anywhere
closer to how people describe and interpret visual content? Can new
automatic or semi-automatic tools be facilitated by this new type of
visual representation? For example, will new visual representations
(beyond pixel-based) provide new opportunities for bridging the
automatically extractable features to high-level semantics? We will
explore positive answers to above questions by presenting an innovative
bi-directional interactive environment for humans and machines to jointly
define semantic concepts based on audio-visual cues.