Joern Ostermann
AT&T Labs-Research
A Position Statement for Panel 3: 3D Modeling and VR
The 1998 International Workshop on Very Low Bitrate Video Coding

Semantic Coding Meets Animation

Research currently addresses the problem of audiovisual representation of video and associated audio from 2 different angles.

One group works on a layered coding approach with a video coder that switches a source model according to the level of scene understanding resulting from image analysis and scene knowledge. The source model can be as simple as just modeling statistical dependencies between neighboring pels of an image (intra mode in a video coder), as common as modeling motion of square blocks (block based motion compensation in H.261, MPEG-4, ...), or as complicated as knowing about scene contents and adapting a 3D source model to each person in a scene transmitting only semantic parameters like 'turn head' or 'smile'. Using this semantic coding approach should result in a very high coding efficiency since the degrees of freedom to describe the video signal is drastically reduced. Currently, the main challenge for the semantic coders is that they depend heavily on a reliable real-time image analysis limiting the usefulness .

A second group is concerned with animation that creates 3D models with behavior and animates them using proprietary or MPEG-4 animation parameter streams. Usually, these streams carry semantic information like 'turn head' or 'smile'. In a coding environment, the coder first transmits the 3D model to the decoder. The decoder animates this model using the received animation parameter stream. In applications requiring a dialog system with an avatar or a virtual company representative, the visual animation can be augmented by audio generated with a text-to-speech (TTS) synthesizer. In this case, audio-visual animation can easily be created at bit rates below 300 bit/s - not considering the initial transmission of the model itself.

Semantic coders and animation transmit semantic parameters - just the method for creating the parameter sets differ. However, in order to generate realistic animations of humans within a computer-driven dialog system, understanding of human motion and behavior becomes more and more important. Therefore, video analysis is gaining importance also in the computer animation field.

With the progress of image analysis, semantic coders using 3D models for humans will become as feasible as animation is today. At the same time progress in speech recognition and speech synthesis will allow coding speech in an abstract form and decode the speech using speech synthesizers. Combining these technologies with computer animation will enable new multimedia application where you or your avatar talks to my avatar or me and neither of us will know what is real and what is computed.