Behzad Shahraray
AT&T Labs - Research
A Position Statement for Panel 4: Applications
The 1998 International Workshop on Very Low Bitrate Video Coding
As we approach the 21st century, we are witnessing a communications revolution. This revolution is the result of major advancements in networking, computing, information processing, and communications protocols. High performance communications networks now connect many remote locations of the world. High-speed modems, cable modems, digital subscriber lines, and optical fiber connections provide access to these broadband networks from homes and businesses. The Internet and the user interfaces provided for the World Wide Web (WWW) have changed the face of communications. The information contained on the large number of interconnected computers have created a huge, and continuously growing, distributed digital library.

The sheer size and rapid growth of these information repositories has created a pressing need for intelligent and content-based information retrieval techniques. Today, users of the WWW rely heavily on the available search engines to find their way to the information of interest on the Web. The Web sites providing the search capabilities are among the most frequently visited sites. Although the result of the search can be of multimedia nature, most of the search mechanisms used today are based only on the textual information, in explicit or implicit form, that accompany multimedia data. Intelligent techniques for the content-based retrieval of non-textual information are needed. These techniques rely on machine understanding of multimedia information, such as speech, audio, video, and text, to organize and index the information.

The ability of machines to automatically understand the visual contents of still images can result in the generation of high-level linguistic, as well as non-linguistic descriptors. Image sequences contained in video programs contain dynamic information that can generate an even richer set of descriptors. The extraction of such content descriptors would enable the creation of powerful image and video search engines. In practice, these high-level descriptors have proven difficult to extract with the current state of the art in image understanding. Nevertheless, many working systems exist that serve as proof to the effectiveness of even partial and domain specific content descriptors for selective retrieval of pictorial information.

While pictorial content are a major source of information in a video program, much valuable information is also carried in the other media components, such as text (overlaid on the images, or included as closed captions), audio, and speech, that accompany the pictorial component. A combined and cooperative analysis of these components would be far more effective in the characterization of the video program. The discussion presented here considers a more general definition of video content analysis and indexing that involves the analysis of all the media components contained in a video program. In other words, the discussion considers multimedia indexing. Many Internet and telecommunications applications can benefit from content-based processing of multimedia information. In the remainder of this paper, we attempt to point out several existing and potential applications of content-based video and multimedia processing.

The availability of large multimedia libraries that can be efficiently searched has a very strong impact on education. Students and educators can take advantage of such capabilities to expand their access to educational material. The significance of this has been acknowledged in the telecommunications act of 1996. It has special provisions for providing Internet access to schools and public libraries. This holds the promise of turning small libraries that contain a small number of books and multimedia sources, to ones with immediate access to every book, audio program, video program, and other multimedia educational material. It gives students access to large informational resources without even leaving their classes.

Another application area is the automated conversion (re-purposing) of already produced material to enable alternative means of presentation. Media organizations and television broadcasting companies have shown considerable interest in presenting their information through the WWW. This allows the users to retrieve only the information that they are interested in, whenever they wish. Analysis of the video content through image and video understanding, speech transcription, and linguistic processing can serve to create alternative presentations of the information suitable for the Web. Large archives of information created in this way can be easily searched to retrieve current or historic information. The Web presentations can be augmented with related and supplementary information, thereby be a richer source of information than the video programs from which they are generated. A survey conducted by the Pew Research Center for the People and the Press indicates that the number of Americans who obtain their news on the Internet is growing at an astonishing rate. The survey also indicates that users tend to search the Internet for relevant and supplementary information about events that they have been made aware of through mass media. Intelligent content analysis engines are an essential part of this process. When sufficient bandwidth is available, television programs can be delivered with the same high quality as the original productions. The Real-time Transport Protocol (RTP) and the specific payload types defined by the Internet Engineering Task Force (IETF) have already made the delivery of high-quality MPEG-2 encoded video over IP networks. In the short term, this will only be feasible over private local IP networks. In the long term, however, this will enable the creation of searchable and browsable TV.

An information delivery system has to be able to adapt to the constraints imposed by the information appliance and the available bandwidth. When the delivery of information involves transmitting visual information contained in a video program, such an adaptation involves spatial and temporal scalability. Video content analysis, not only provides useful indices into the video program, but has also proven useful as an effective data reduction method. A content-based sampling of the video frames, based on changes in the visual information, results in compact representation that supports information access on information appliances such as mobile communications devices.

Intelligent indexing of multimedia presentations is another area where content-based analysis can play a major role. The existing video compression and transmission standards have made it possible for presentations to be transmitted to remote sites. These presentations can be stored for on-demand replay. Different media components of the presentation can be processed to characterize and index it. Such a processing could include the analysis of the motion and gestures of the speaker, slide transition detection, extraction of textual information by performing OCR on the slides, speech recognition, speaker identification and discrimination, and audio event detection. The information extracted by this processing generates very powerful indexing capabilities that would enable selective and content-based retrieval of different segments of a presentation. An archive of presentations may be searched to find information about a topic of interest.

Multimedia collaborative systems can also benefit from effective multimedia understanding and indexing techniques. Communication networks give people the ability to work together despite geographic distance. The multimedia collaborative sessions involve real-time exchange of visual, textual, and auditory information. The information that is retained is often limited to the end result of the collaboration and does not include the steps that were taken, or discussions that took place. Archiving systems can be set up for storing all the information together with relevant synchronization information. Content-based analysis and indexing of these archives based on multiple information streams enables the retrieval of segments of the collaborative process. Such a process gives the users of the system the ability to not only have access to the end result, but also to the process that led to those results. When the communication links used for the collaborative session are established by a conferencing bridge, the information that is readily available at the bridge can be utilized in the indexing process, thereby reducing the processing that is required to identify the source of each stream.

Content analysis is also an effective tool for monitoring and surveillance applications. Reliable detection of certain events is instrumental in the automation of monitoring systems. At the lowest level, simple motion detection can be an effective way of detecting activities in an area. Higher level processing of the speed and patterns of motion can help detect unwanted activities in the presence of normal ones.

In conclusion, we emphasize significance of automated content analysis in dealing with the explosion of information that has been brought about by the Internet. The realization of this fact has resulted in major activities, in several fronts, to provide solutions to this problem. The current activities of MPEG-7 are one example of such efforts aimed at creating a multimedia content description interface. The target of this effort is to standardize the description of different multimedia content. The definition of a set of content descriptors that meet the needs of a wide range of applications is a challenge not to be taken lightly. The major challenge, however, is in the creation of media analysis techniques that would enable the extraction of such features. More attention needs to be paid to the computational aspects of these algorithms since high computational requirements tend to limit the applicability of robust algorithms. The difficulties in the machine understanding of multimedia content leads to inaccuracies in the search results and creates a need for effective information browsing capabilities. Given these inadequacies, multi-modal user interfaces need to be created that combine the speed and capabilities of machines with the strong perceptual abilities of humans in a closed loop to locate relevant information.