Shih-Fu Chang
Columbia University

A Position Statement for Panel 2: Video representation, coding, indexing
The 1998 International Workshop on Very Low Bitrate Video Coding

People are used to the structured representation of languages, both at the syntactic and the semantic levels. Are there good analogies for visual and audio content? MPEG-4 includes object-based representation of audio-visual data, on which flexible manipulation and interaction can be applied. In an evolving stage, MPEG-7 may use a similar hierarchical framework to index the visual content at multiple levels, including story, scene, shot, object, and feature. However, are these analogies anywhere closer to how people describe and interpret visual content? Can new automatic or semi-automatic tools be facilitated by this new type of visual representation? For example, will new visual representations (beyond pixel-based) provide new opportunities for bridging the automatically extractable features to high-level semantics? We will explore positive answers to above questions by presenting an innovative bi-directional interactive environment for humans and machines to jointly define semantic concepts based on audio-visual cues.