On Progress in Segmentation of Video
 
Atul Puri
AT&T Labs - Research
 
A Position Statement for Panel 1: Image/Video feature extraction and segmentation
The 1998 International Workshop on Very Low Bitrate Video Coding
 

1. Overview and Status

Segmentation can be defined as the operation of partitioning a scene into regions extracted according to a given criterion. Effective segmentation of objects in static or dynamic scenes (images or video), is a topic that has thus demanded much attention and despite progress, it still poses a significant challenge. Typical schemes for image segmentation have included extraction of features such as edges and curves and integration of these features into continuous shapes which are spatial coherent. Typical schemes for video segmentation, in addition to those used for image segmentation, have included temporal change detection due to motion of individual objects in a temporally coherent manner, as well as combination of the two. In the past, there have been many motivations for work on image and video segmentation such as, scene analysis, pattern matching, character recognition, industrial vision systems, target recognition, biomedical etc. The work has generally been very application specific and has included scenes captured with different types of sensors and noise level. Not surprisingly, the results have also been application specific, for instance, some of the techniques work well for detection of enemy vehicles in high resolution satellite imagery while others work well for machine parts detection and identification and well yet others work well in traffic sensing and survelliance.
More recently, an evolving breed of multimedia applications requiring advanced functionalities have provided a new focus for work on segmentation. The main requirement of these applications is access to individual audio-visual objects in the scene, and, the advanced functionalities which need to be supported are the capability of the capability to move these objects freely and rearrrange the scene, the capability to add, drop or modify objects in a coded scene without re-encoding, the capability to improve the spatial or temporal quality of objects, the capability to combine natural coded objects with synthetic objects (defined by model parameters) etc. MPEG-4 is an object based multimedia standard (in progress) designed to address such needs. MPEG-4 video standardizes the syntax and semantics of video bitstream and specifies the decoding process, it does not mandate any specific pre-processing or details of encoding. MPEG-4 encoding assumes availability of segmented video objects (VO's) in the form of a sequence of snapshots in time of these objects referred to as video object planes (VOP's). For the purpose of information, the current specification of MPEG-4 includes discussion on segmentation of objects based on work of Multifunctional ad hoc group of MPEG-4 video. Section 2 provides a brief overview of this work.
 

2. The MPEG-4 Approach

Figure 1 shows the framework for segmentation of video being examined in MPEG-4; it consists of up to three major steps. In the first step, global motion compensation and scene cut detection is applied as a preprocessing step to compensate for overall camera movement. In the second step, either just temporal segmentation or both temporal and spatial segmentation are performed. The third step is only necessary when both temporal and spatial segmentation are performed in the second step, and simply consists of merging of results of the second step. Figure 1 The MPEG-4 framework for video segmentation

3. Possibilities and Limitations

While, we the humans are able to identify meaningful semantic objects in scenes with relative ease because of our experience, vision and recognition processes which employ features, shape, color and movement analysis, the same task for an automated algorithm is rather difficult and requires significant processing. Machine vision systems have been successful only by solving a subset of the bigger problem and by incorporating learning systems. In a general sense, robust automatic segmentation of video is an area in which significant advances are needed and even the state-of-the-art is far from being satisfactory. The next best approach may be to devise semi-automatic segmentation algorithms that work well with minimal human intervention. Perhaps, segmentation at scene change can be specified manually, followed by tracking of movement of the objects. A mechanism to determine when the objects become untrackable and manually re-specfying the segmentation map for that frame could be used followed by motion tracking. In specialized cases, such as when video scenes consist of video objects on the background with a chosen chromakey color (with some noise), automatic real-time segmentation should be possible. These applications include advanced videoconferencing with background insertion, TV weather forecasting, live video games etc.