HongJiang Zhang
HP Labs
A Position Statement for Panel 5: MPEG-7 issues
The 1998 International Workshop on Very Low Bitrate Video Coding

In contrast to previous MPEG standards, MPEG-7, formally defined as "multimedia content description interface," addresses the need of multimedia content access, although MPEG-7 does not aim at standardizing the search engines or feature extraction engines. To establish a useful and widely acceptable standard for multimedia content description, it is necessary to understand what are need to describe the content of any given multimedia data, and what type of applications it will support. In this statement, we try to present a list of issues related to supporting content-based video browsing.

Considering the vast amount of video data, to develop means for quick relevance assessment of video documents is critical. However, when we look at managing of video data, the conventional tools for indexing and retrieval are very preliminary, although more and more computers are equipped with video capturing, compression and playing functions. For example, selection of a video clip in current video information systems or World Wide Web sites of video collections rarely involves anything better than key words search or category browsing; and any browsing of the video itself is limited to the lowest level of VCR-like control and display window. The fundamental need to change such situation is similar to that of image databases: video data should be structured and indexed. Content-based image retrieval technologies can be extended to video retrieval. However, such an extension is not straightforward.

Considering a video clip a sequence of image frames, indexing each of them as a still image not only will introduce extremely high redundancy, but also will be impossible given the number of frames in a video of even just one-minute. Furthermore, such a scheme will destroy or miss the story structure that make video distinct from still images. The fact is that video is a structured media in which actions and events in a time and space comprise stories or convey particular visual information. That is, a video program should be viewed more like a document than just non-structured sequence of frames. Thus, the indexing of video should also be analog to text document indexing, where a structure analysis is performed to decompose a document into paragraphs, sentences and words, before index being built. In other words, we need to identify structures of video, and decompose video into basic components, then build indexes based on the structures information, in addition to individual image frames.

Moreover, content-based browsing is another significant issue for quick relevance assessment of video source material, considering the large amount of data of video. By browsing, we mean an informal but quick access of content which may lack any specific goal or focus. How can we spend only few minutes to view an hour of video and still have a fairly correct perception of what its contents is like? Or how can we map an entire segment to some small number of representative images? Browsing is a means that may be more suitable to address those needs. The task of browsing is ac-tually also very intimately related to and needed for retrieval of video. Unlike in image retrieval where retrieval results can be easily presented as thumb nails for examining, viewing retrieval results of video require more sophisticated browsing tools, given the temporal nature and the vast amount of data of video. Furthermore, browsing of retrieval results is the best way to provide feedback for a given query, thus, serves as an aid to formulating queries, making it easier for the user to "just ask around" in the process of figuring out the most appropriate query to pose. To achieve content-based browsing, we need to have a representation to present information landscape or structure of video in a more abstracted or summarized manner. Therefore, following issues need to be addressed by MPEG-7 if it will be designed to support content-based video browsing:

  1. How many levels are needed to represent the temporal structure of a video program, so that it will support both detail and overview, or different level of granularity in video browsing?
  2. What kind of structure representation schemes should be supported so that random and non-linear browsing is supported?
  3. What kind of content features the standard should specify so that it will support similarity-based grouping as needed in video browsing?
  4. Although MPEG-7 will not standardize the process to extraction structural and semantic content, it is still a valid concern that how and at which stage the structural and semantic content descriptors should be extracted? This will effect the acceptance of MPEG-7, since if there is no effective way to extract the content descriptors, the usefulness of MPEG-7 will be seriously limited.
  5. Can we define the content description scheme so that it will not only support search and browsing as separate applications, but also be able to bridge between the two applications?