Spoken Language Processing and Multimodal
Communication : A View from Europe.
Joseph J. Mariani
Human-Machine Communication if a very active research field and is one of the main areas in which computer manufacturers, telecommunications companies and the general public electronic equipment industry invest a large amount of R&D efforts. Human-Machine Communication includes perception, production and cognition aspects. Several means can be used to communicate, such as spoken or written language, image, vision or gesture. Those topics represent a large, long-term, interdisciplinary research area.
Spoken language communication with machines is a research topic per se, and includes speech recognition and synthesis as well as speaker or language recognition. Large efforts have been devoted in that field, and big progress can be reported in the recent past, based on the use of stochastic modeling approaches, requiring large quantities of training data.
However, it appears that communication between humans usually involves different modalities, even for a single media, such as audio processing and face/lip "reading" in spoken language communication. When speech is used together with gesture and vision, it may result in a more robust, more natural and more efficient communication. Bringing this multimodal communication ability in the field of human-machine communication is now a big challenge, and raises difficult problems, such as the integration over time of speech and gesture stimuli, or the use of a common visual reference shared in a human-machine dialog. It also appears that similar methodologies can be used for addressing the different communication modalities
Another interesting aspect of the works in this area is the possibility to transfer the information from one media to another, such as vision to speech, or speech to gesture in order to help the handicapped people to better communicate with the machine. Finally, it may be thought that, in the long term, the study of multimodal training is necessary in order to develop a single communication mode system.
In order to investigate research in those areas, know-how in various related areas such as Natural Language Processing, Speech Processing, Computer Vision, Computer Graphics, Gestural interaction, Human Factors, Cognitive Psychology and Sociology of innovation has been gathered at Limsi. The integration of modalities has been studied with different approaches. Multimodal communication has been applied for providing information to customers (« Multimodal-Multimedia Automated Service Kiosk » (MASK) project), to graphic design interface (LimsiDraw and Mix3D projects) and to car navigation. Transmodal communication has been studied for developing aids to the blinds (Meditor), and to the deaf (French Sign Language recognition).
Those research areas are similarly very active at the European level. Several Spoken Language Processing projects have been conducted in the framework of major programs of the European Union, such as Esprit, Esprit « Long Term Research » or Telematics « Language Research and Engineering ». They address both basic research and technology development from the very beginning in 1984. New programs have been started recently, which focus more on the applications of such technologies and on the response to the needs of society, such as the « Language Engineering » program, the Info2000 program for the publishing industry, the « Multilingual Information Society » (MLIS) program, for providing support allowing for multilingual communication all over Europe, TIDE for the aids to the handicapped. Also, the multimedia and multimodal aspects are specifically addressed in subprograms such as Intelligent Information Interfaces, Multimedia Systems or Educational Multimedia.
1.0 Human Communication: Perception, Production and Cognition
Human-Machine Communication (HMC) is a very active research field, and is one of the main areas in which computer manufacturer, telecommunication companies or the general electric equipment industry invest a large amount of R&D efforts. Communication includes the perception or production of a message or of an action, as an explicit or implicit cognitive process. This communication establishes a link between the human being and his environment which is made up partly by other human beings. Several communication modes coexist. For perception, the "5 senses": hearing, vision, touch, taste and odor, with reading as a specific visual operation, and speech perception as a specific hearing operation related to spoken language sounds. For production, it also includes sound (speech, or general sound production), vision (generation of drawings, of graphics or, more typically, of written messages). Those means are specifically involved in the communication between human beings. Other actions can be produced, such as grasping, throwing, holding..., which generalize communication to the whole physical world. Cognition is the central entity, which should be able to understand, or to generate a message or an action, from a knowledge source. This activity relies on conscious processes (that allow, for example, to conduct a reasoning), or unconscious ones. It takes into account the task to be fulfilled and the goal which is aimed at. It plans the scheduling of actions in order to reach that goal, and takes decisions. A specific aspect of HMC is the teleogical component (the fact that we prepare well in advance the actual generation of a linguistic event, of a sound or of a move).
2.0 Human-Machine Communication
The idea is thus to offer this ability, which is typical to human beings, to machines in order to allow for a dialog with the machine which then becomes an interface between the human being and the physical world with whom he communicates. This world can be reduced to objects, but it can also be constituted by other human beings. In the first case, which is much simpler, and for the three communication links of this human-machine-world triplet, both perception and production aspects appear. The communication between the human and the real world can involve a direct communication, or may entirely be conducted through the machine. The communication between the machine and the real world corresponds to Robotics, and includes effectors (robot, machine tool...), and sensors (normal or infra-red camera, sonar sensor, laser telemeter..., recognition of sounds (engine noises for example), or of odors (exhaust gas)...).
In the domain of HMC, the computer already has artificial perception abilities: speech, character, graphics, gesture or movement recognition. This recognition function can be accompanied by the recognition of the identity of the person through the same modes. Recognition and understanding are closely related in the framework of a dynamic process, as the understanding of the beginning of a message will interfere with the recognition of the rest of the message. The abilities of those communication modes are still limited, and imply the need of strong constraints on the use of the systems. Gesture or movement recognition is made through the use of a special equipment, such as the VPL DataGlove, or even DataSuit, or the Cyberglove, which includes position sensors. Other sensors allow for recognizing the direction of viewing (through an oculometer, or through a camera). Reciprocally, the computer can produce messages. The most trivial is of course the display on a screen of a pre-determined textual or graphical (including icons) message. We could add Concept-to-Text generation, summary generation, speech synthesis, static or animated image synthesis. They can be produced in stereovision, or as a complete environment in which the user is immersed ("virtual" reality), or be superimposed on the real environment ("augmented" reality), thus requiring to wear a special equipment. The provided information is multimedia, including text, real or synthetic images and sound. It is also possible, in the gestural communication mode, to produce a kinesthetic feedback, allowing for the generation of simulated solid objects.
Finally, the machine must also have cognitive abilities. It must have a model of the user, of the world on which he acts, of the relationship between those two elements, but also of the task that has to be carried out, and of the structures of the dialog. It must be able to conduct a reasoning, to plan a linguistic or non-linguistic act in order to reach a target, to achieve problem solving and aid to the decision, to merge information coming from various sensors, to learn new knowledge or new structures, etc. Multimodal communication raises the problem of co-reference (when the user designates an object, or a spot, on the computer display and pronounces a sentence relative to an action on that object ("Put that there"). Communication is a global operation, and the meaning comes from the simultaneous co-occurrence of various stimuli at a given time, in a given situation. It is also necessary in the design of a HMC system to adjust the transmission of information through the various modalities in order to optimize the global communication. The machine transmits to the human a real representation of the world, a modified reality or a fictitious world. If the world also includes humans, the model gets more complex. The machine then interacts with several participants, who share a common reference. This reference can be a fiction generated by the machine.
3.0 A Long-Term Interdisciplinary Research
This research area is large. Initially, laboratories used to work on the different communication modes independently. Now, interdisciplinary research projects or laboratories address several modes in parallel (11, 23, 29). It appears that it is important to understand the human functions, in order to get inspiration when designing an automatic system, but, moreover, in order to model in the machine the user with whom it has to communicate. Not only modeling those functions, but also modeling the world in which they occur. This gives an idea of the size of the effort that has to be achieved, and it extends Human-Machine Communication to various research domains such as room acoustics, physics or optics. It is important to link the study of both the perception and the production modes, with the machine playing the role of an information emitter or receiver, at various degrees going from a simple signal coding/decoding process to a complete understanding/generating process, for voice as well as for writing or visual information. Artificial systems can also extend human capabilities: speaking with different timbres, or in several languages for example.
4.0 Progress in Spoken Language Processing
Large progresses have been made in the recent years in the field of Spoken Language Processing, and especially Speech Recognition (25). First operational systems were able, in the early 80's, to recognize a small vocabulary (40 to 50 words), pronounced in isolation for a single speaker. Those systems were based on pattern matching using dynamic programming techniques. Several improvements were made possible among each of the 3 axes (from isolated to continuous speech, from speaker-dependent to speaker-independent and on the increase in the size of the vocabulary), with a similar approach. But the use of statistical models, such as Hidden Markov Models (HMM) allowed for a large improvement on the 3 axes simultaneously, while also increasing the robustness of systems. It allowed to include in the same model various pronunciations by the same speaker or pronunciations by various speakers. It also allowed to use phone models instead of word models, and to reconstruct the word, including the various pronunciations reflecting the regional variants, or foreign accents, for the same language, from the phone models. Using context-dependent phones, or even word-dependent phones, helped taking into account the coarticulation effect, while solving the problem of a priori segmentation. A similar statistical approach was also used for language modeling, using the counts of the succession of 2 or 3 words in large quantities of texts corresponding to the application task (such as large quantities of correspondence for mail dictation). New techniques allowed for better recognition in noisy conditions, or when using different microphones and for processing spontaneous speech, including hesitations, stuttering etc. Those progresses result now in the availability of systems which can be used on specific tasks. But the development of those systems for each application is a very large effort, as it is needed to constitute (record and transcribe) a large database reflecting as closely as possible the future use of the system in operational conditions. The problem of dialog modeling is also still a very difficult area. The design of task-independent systems is therefore presently the challenge. Also, as a general problem, the actual use of prosody in spoken language understanding is an open issue.
One could also report progress in other fields of spoken language processing. In text-to-speech synthesis, the quality was generally improved by using concatenative approaches based on real speech segments. Intelligibility and naturalness were improved, even if the prosody still needs much improvement. This corpus-based approach allow for voice conversion from one speaker to another, by using techniques developed for speech recognition, and for relatively easy adaptation to a new language. The techniques developed for speech recognition were also successfully used for speaker recognition, language identification or even topic spotting. They were also imported in the field of Natural Language Processing, allowing for progress in the specific field of written language processing, as well as in the contribution of this field of research to speech understanding and generation.
5.0 Linking Language and Image, and Reality
With the coming of "intelligent" images, the relationship between Language and Image is getting closer (11). It justifies advanced human-machine communication modes, because it requires such modes. In an "intelligent" synthetic image (which implies the modeling of the real world, with its physical characteristics), a sentence such as "Throw the ball on the table" will induce a complex scenario where the ball will rebound on the table, then fall on the ground, which would be difficult to describe to the machine with usual low-level computer languages or interfaces. Visual communication is directly involved in human to machine communication (for recognizing the user, or the expressions on his face, for example), but also indirectly in the building of a visual reference that will be shared by the human and the machine, allowing for a common understanding of the messages that they exchange (for example, in the understanding of the sentence "Take the knife which is on the small marble table." Instead of considering the user on one side, and the machine on the other side, the user himself can become an element of the simulated world, acting and moving in this world, and getting reactions from it.
6.0 Common Methodologies
One can find several similarities in the researches concerning those different communication modes. In Speech and Vision Processing, similar methods are used for signal processing, coding and pattern recognition. In Spoken and Written Language Processing, part of the morphological, lexical, grammatical and semantical information will be common, together with similar approaches in the understanding, learning and generating processes. The same pattern recognition techniques can also be used for Speech and Gesture recognition. Human-Machine communication is central in the debate between Knowledge-Based methods and Self-Organizing ones. The first approach implies that the experts formalize the knowledge, and it may result in the necessity of modeling the law of physics or mechanics, with the corresponding equations. The second approach is based on automatic training, through statistical or neuromimetic techniques, applied to very large quantities of data. It has been applied with similar algorithms, to various domains of HMC, such as speech recognition or synthesis, character or object visual recognition, or to the syntactic parsing of text data. The complementarity of those two approaches has still to be determined. If it may appear that the idea of introducing by hand all the knowledge and all the strategies which are necessary for the various HMC modes is unrealistic, the self-organizing approach raises the problem of defining the training processes, and of building sufficiently large multimodal databases (one should compare the few hours of voice recordings that can be presently used by existing systems to the few years of multimodal perception and production that are necessary in the acquisition of the language by human beings). Apart from the theoretical issues, the availability of computer facilities having enough power and memory is crucial. This critical threshold has been attained very recently for speech processing, and should be reached in the near future in the case of computer vision.
7.0 The Need for Large Corpora in Order to Design and Assess Systems
A system which has been tested in ideal laboratory conditions, will show much worse performances when placed in the real context of the application, if it was not designed in order to be robust enough. The problem of the assessment of the HMC systems in order to evaluate their quality and their adequacy with the application which is aimed at, is, in itself, a research domain. This evaluation implies the design of large enough data bases, so that the models will include the various phenomena that they must represent, and so that the results are statistically valid. In the domain of speech communication, such databases have been built in order to assess the recognition systems on specific tasks. Similar actions also started in the case of Natural Language Processing, and of Image Processing, which requires quantities of information which are even larger. This approach has been used extensively in the case of the design of Voice Dictation systems. In this case, the language model can be built relatively easily from the huge amount of text data which are available in various domains (newspapers, legal texts, medical reports and so on). The acoustic model can also be built from sentences, extracted from this text data, which are read aloud. It is somehow more difficult to apply this approach to spontaneous speech, that will be found in actual human-human dialogs, as there is no available corpora of transcribed spontaneous speech comparable to what can be obtained for text dictation data, in order to build the language model. However, using an already existing speech recognition system developed for voice dictation, in order to recognize spontaneous speech and build the corresponding language model could be considered, as soon as the recognition rate will be good enough to reduce the number of errors to a threshold acceptable for hand-made correction, and to detect the possibility of such errors (by using measures of confidence). Reaching this threshold will probably allow for a big progress in speech recognition systems abilities, in the framework of a bootstrapping process, as it would allow to use speech data of unlimited size, such as those continuously provided by radio or TV broadcast. This approach could be extended to multimodal communication. It will then raise many research problems related to the evaluation of multimodal communication systems, and to the corresponding design, acquisition and labeling of multimodal databases. It is also necessary to define a metric in order to measure the quality of the systems. Finally, the ergonomic study of the system aims at providing the user with an efficient and comfortable communication. It appears that the design of the systems should aim at copying the reality as much as possible, in order to place the human in an universe that he knows well and which looks natural to him.
8.0 Multimodal Spoken Language Communication
Humans use multimodal communication when they speak to each other, except in the case of pathology or of telephone communication. Both the movements of the face and lips, but also the expression and posture will be involved in the spoken language communication process. This fact, together with interesting phenomena, such as the "Mac Gurk" effect and the availability of more advanced image processing technologies, induced the study of bimodal (acoustical and optical) spoken language communication. Studies in speech intelligibility also showed that having both a visual and an audio information improves the information communication, especially when the message is complex or when the communication takes place in a noisy environment. This lead to studies in bimodal speech synthesis and recognition.
In the field of speech synthesis, models of speaking faces were designed and used in speech dialog systems (5, 7, 18). The face and lips movement were synthesized by studying those movements in human speech production, through image analysis. It resulted in text-to-talking heads synthesis systems. The effect of using the visual information in speech communication was studied, in various ways (using the image of the lips only, or the bottom of the face or the entire face) was studied and showed that the intelligibility was improved for the human "listener", especially in a noisy environment. In the same way, the use of the visual face information, and especially the lips, in speech recognition was studied, and results showed that using both information gives better recognition performances than using only the audio or visual information, especially in a noisy environment (14, 24, 31, 33). The use of lips visual information was also interestingly used for speaker recognition (19).
This visual information on the human image has been used as part of the spoken language communication process. However, other types of visual information related to the human user can be considered by the machine. The fact that the user is in the room, or is seated in front of the computer display, or the gaze can be used in the communication process (waiting for the presence of the human in the room to synthesize a message, or choosing between a graphic or spoken mode for delivering information, if the user is in front of the computer or somewhere else in the room, adjusting the synthesis volume depending on how far he is from the loudspeaker, adapting a microphone array on the basis of the position of the user in the room (31) or checking what the user is looking at on the screen in order to deliver an information relative to this area). Face recognition may also help in order to synthesize a message addressed to a man or a woman, or specifically aimed at that user. Even the mood of the user could be evaluated from his expression and considered in the way the machine will communicate with him (15, 16). Reciprocally, the expressions can vary in the synthesized talking head in order to improve the communication efficiency and comfort. It has been shown for example that eyelids blinking is important for a better presence and communication, or having a simple groan from the talking head while the machine computes the answer to a question in a human-machine dialog (7).
9.0 Multimodal Communication
Communication can also use different, both verbal and non-verbal, media. A. Waibel proposes a multimodal (speech, gesture and handwriting) interface for an appointment scheduling task (31). Those different modes can be used to enter a command, or to provide a missing information or to solve an ambiguity, following a request from the system. Berkley and Flanagan (6) designed the AT&T HuMaNet system for multipoint conferencing over the public telephone network. The system features hands-free sound pick up through microphone arrays, voice control of call set up, data access and display through speech recognition, speech synthesis, speaker verification for privileged data, still image and stereo image coding. It has been extended to also include tactile interaction, gesturing and handwriting inputs and face recognition (12). In Japan, ATR has a similar advanced teleconferencing program, including 3D object modeling, face modeling, voice command and gestural communication. Rickheit & Wachsmuth (26) describe a "situated artificial communicators" project using speech input by microphone, and vision input (for gestural instructions) by camera to command a one-arm robot constructor. At Apple Computer (27), A. James & E. Ohmaye have designed the puppeteer tool aiming at helping designers to build simulations of interaction for people. Puppeteer supports user input in various forms (icons selection, typing or through speech recognition), and combines animation and speech synthesis to produce talking heads. At IRST, Stringa & cols. (28) have designed, within the MAIA project, a multimodal interface (speech recognition and synthesis, and vision) to communicate with a "concierge" of the institute, which answers questions on the institute and its researchers, and with a mobile robot, which has the task of delivering books or accompanying visitors. In the closely related domain of multimedia information processing, very interesting results have been obtained in the Informedia project at CMU on the automatic indexing of TV broadcast data (News), and multimedia information query by voice. The system uses continuous speech recognition to transcribe the talks. It segments the video information in sequences, and uses Natural Language Processing techniques to automatically index those sequences from the result of the textual transcriptions. Although the speech recognition is far from being perfect (about 50% recognition rate), it seems to be good enough for allowing the user to get a sufficient amount of multimedia information from his queries (32). Similar projects started in Europe, and better recognition performances, more complex image processing (including character recognition) and multilingual information processing are aimed at for the future.
10.0 Experience in Spoken Dialog Systems Design at Limsi
Several multimodal human-machine communication systems have been developed at LIMSI, as a continuation of our work in the design of vocal dialog systems. This issue was first studied in the design of a vocal pilot-plane dialog system sponsored by the French DoD. A cooperative effort with the MIT Spoken Language Systems Group aimed at developing a French version for the MIT Air Travel Information Service (ATIS), called L'ATIS (8). "Wizard of Oz" experiments were conducted very early, and the linguistic analysis of the resulting corpus in a train timetable inquiry system simulation was conducted (20). The Standia study of automated telematic (voice+text) switchboard has been conducted as a joint effort between the "Speech Communication" group and the "Language & Cognition" group. The design of a voice dialog system has been explored within the framework of a multimedia air-controller training application, in the Parole project (21). The goal was to replace the humans who presently play the role of the pilots in those training systems by a spoken dialog module. Speech understanding uses speech recognition in conjunction with a representation of the semantic and pragmatic knowledge related to the task, both static (structure of the plane call-signs, dictionary, confusion matrix between words...) and dynamic (dialog history, air traffic context...). The dialog manager determines the meaning of a sentence by merging those two kinds of information: the acoustic evidence from the speech recognition, and the knowledge information from the task model. It then generates a command that modifies the context of the task (and the radar image), and a vocal message to the air-controller student, using multivoice speech synthesis. The whole system is bilingual (French and English), and recognizes the language which is used. It is able to generate pilot's initiatives, and several dialogs can be held in parallel.
11.0 The Multimodal-Multimedia Automated Service Kiosk (MASK) Project
In the ESPRIT "Multimodal-Multimedia Automated Service Kiosk" (MASK), speech recognition and synthesis are used in parallel with other input (touch screen) and output (graphics) means (13). The application is to provide railway travel information to the railway customers, including the possibility to make reservations. The complete system uses a speaker-independent, continuous speech recognition system, with a vocabulary of about 1500 words. The signal acquisition is achieved by using 3 microphones and the system has to work in the very noisy environment of a railway station. The acoustic model and the language model have been built by using a prototype, and by recording speakers having to fulfill a set of scenarios. The language model is based on trigrams, trained on the transcription of about 15K utterances. The semantic analysis is conducted by using a case grammar similar to the one developed for the Parole project. A dialog manager is used, to determine a final complete semantic scheme which accesses a DBMS containing the railway travel information, and to generate a response including both graphical and vocal information. The vocal information is provided through a "concatenated" speech approach, in order to obtain the best possible quality. The system is presently mostly monomodal. In the near future, it should allow the users to either use speech or tactile input, or to use both together. However, first Wizard of Oz studies seem to show that subjects tend to use one mode or the other, but not both at the same time, at least within a single utterance (while they may switch to another mode during the dialog, if one mode appears to be unsatisfactory).
12.0 The LIMSIDraw, Tycoon and Meditor Multimodal Communication Systems
A first truly multimodal communication system has been designed using vocal and gestural input, and visual output communication. The task is the drawing of colored geometric objects on a computer display (30). The systems uses a Datavox continuous speech recognition system, a high definition touch screen, and a mouse. Each of those communication modes can be used in isolation, or in combination with the others in order to give a command to the system. Each input device has its own language model, and is connected to an interpreter, which translates the low-level input information (x,y coordinates, recognized words...) into a higher level information, accompanied by timing information, which is inserted in a waiting list, as part of a command. Each interpreter uses the information provided by a User Model, a Dialog Model and a Model of the Universe corresponding to the application. The Dialog Manager analyses the content of the waiting list and launches an execution command towards the output devices when it has filled the arguments of the specific command which was identified, taking into account the application model. In order to assign the proper value to the argument, the dialog manager uses two types of constraints: type compatibility (the value should correspond to the nature of the argument) and time compatibility (the reference to the argument and the corresponding value should be produced in a sufficiently close time interval). The manager has two working modes: a designation type mode, without feed-back to the output, and a small movement type mode, where the user can follow on the screen his input until a stop condition occurs (such as handing-up from the screen). The multimodal grammar of the user interface is described by an Augmented Transition Network structure (3). The experiments which were conducted with the system clearly show that gestural interaction is better for transmitting analog information, while speech is better for transmitting discrete information.
Several developments followed this first attempt. A tool for the specification of multimodal interface has been designed. It considers multimodality types, based on the number of devices per statement (one versus several), the statement production (sequential versus parallel), and the device use (exclusive versus simultaneous). Another approach for modalities integration was experimented as an alternative to Augmented Transition Networks. The TYCOON system (22) is based on a neuromimetic approach, called Guided Propagation Networks, which uses the detection of temporal coincidences between events of different kinds (in this case multimodal events). It features a command language which allows the user to combine speech, keyboard and mouse interactions. A general modality server has been designed in a Unix environment, which has the role of time-stamping events detected from those different modalities. A multimodal recognition score is computed, based on the speech recognition score, the correspondence between expected events and detected events, and a linear temporal coincidence function. It has been applied to extend the capabilities of a GUI used for editing Conceptual Graphs, and in the design of an interface (speech + mouse) with a map, in an itinerary description task.
Based on the previous LimsiDraw system, an application of multimodal communication for the design of a text editor for the blind has been achieved (4). The system uses a regular keyboard, a braille keyboard and a speech recognition system as inputs, a text-to-speech and a sound synthesis systems and a braille display as outputs. The system allows for the following functions : read a text with embedded character attributes, such as style, color, fonts; select, copy move or delete parts of a text; modify text or attributes; search strings of text with specific attributes; read parts of text using speech synthesis; insert, consult and modify additional information on words (annotations). Most of the input operations involve tactile (through braille keyboard) accompanied by speech communication. Output information integrates tactile (braille) and spoken modalities. A first stage evaluation was conducted, addressing three kinds of exercises (coloring words having a given grammatical category, getting definition of some words by speech synthesis, text editing using cut, copy and paste commands). Those experiments showed that the multimodal communication appeared as very natural, and easy to learn. Future work will address the major problem of how to adapt Graphical User Interface (GUI) for blind users.
13.0 Transmodal Communication and the Tactile Communication Mode as an Alternative to Speech
Another interesting aspect of the works in this area is the possibility to transfer the information from one modality to another, such as vision to speech, or speech to gesture in order to help the handicapped people to better communicate with the machine. Activities may be reported in the field of generation of a text from a sequence of video frames. Initial systems would generate a text corresponding to a complete image sequence, while current systems, such as VITRA are able of producing text incrementally about real-world traffic scenes or short video clips of soccer matches (35). Reciprocally, other systems would generate images from a text, such as the works related to the production of cartoons directly from a scenario. Of great interest is also the bimodal (text+graphics) automatic generation from concepts. It has been applied in the automatic design of directions-for-use booklets. It includes both a graphical and a textual part, and automatically decides on which is the information that should be provided graphically, what should be the information provided by text, and how they should relate to each other (1, 34). Very interesting results have also been obtained by transcribing a visual scene to a tactile information, as it seems that blind people would rapidly learn to recognize such transcribed scenes (17).
The gestural communication mode may be used as an alternative to speech, especially when speech is used for communicating with other humans, so that communicating also by speech with the machine would bring confusion. In this framework, we conducted experiments on the possibility of using free-hand gestural communication for managing a slide presentation during a talk (2). The HMC is thus monomodal (gestural) although the complete communication is multimodal (visual, gestural and vocal), and multimedia. The gestural communication was made through a VPL DataGlove. The user could use 16 gestural commands (using the arm and the hand) to navigate in the hypertext and conduct his presentation, such as going to the next, or to the previous, slide, to the next, or previous chapter, or highlighting a zone of the slide. Although, it appeared that the system was usable after some training, some user errors were difficult to handle : those caused by the "immersion syndrome" (the fact that the user will also use gestures that are not commands to the system, but which accompany naturally his verbal presentation), or errors caused by hesitations when issuing a command, due to the stress. It appeared that, while it was easy to correct a clearly misrecognized gesture, it was more difficult to correct the result of an insertion error, as it is difficult to figure out what was the gesture which caused that error and to remember what is the counter gesture.
Another use of gestural communication was for sign language communication (ARGO system). In this framework, a VPL DataGlove was used in order to communicate with the machine. The "speaker" produces full "sentences", which include a sequence of gestures, each of them corresponding to a concept, according to the French Sign Language (LSF). The HMM software developed at Limsi for continuous speech recognition was used for recognizing the sequence of concepts. The meaning of the sentence is transcribed on a 3D model of the scene, as it highly depends on the spatial layout (place of the "speaker" in relation to the places where the "actors" he is speaking of are standing) (10).
14.0 The MIX3D and Sammovar Projects
A more ambitious project is now being conducted, including computer vision and 3D modeling, natural language and knowledge representation, speech and gestural communication. The aim of the project is to design a system able to analyze a real static 3D scene by stereovision, and to model the corresponding objects (object models will benefit from the real world input and will in turn improve the scene analysis system). The user will have the possibility to designate, to move and to change the shape of the reconstructed objects, using voice and gestures. This project will address the difficult problem of model training in the framework of multimodal information (how non-verbal (visual, gestural) information can be used in order to build a language model, how linguistic information can help in order to build models of objects, and how to train a multimodal multimedia model).
A first step has been achieved in this project with the design of a multimodal X Window kernel for CAD applications (9). The interface is made up of a keyboard, a mouse and a speech recognition system. The output comprises a high quality graphic display and a speech synthesis system. The designer uses both the mouse and the voice input to design the 3D objects. The user inputs graphic information with the tactile device, while attaching or modifying information by voice on the corresponding figures. If necessary, the system provides an information through speech synthesis, informing the user that the action has been properly completed. The multimodal interface allows the user to concentrate his attention on the drawing and object design, as he doesn't have to use the same tactile device to click on a menu and as he also doesn't have to read a written information somewhere else on the display. The interaction is still not synergetic, but sequential. The same interface has also been applied to the design of 2D objects, including hand drawing, followed by the naming of a geometrical type for the previously drawn object. The final shape is generated as a results of the integration of those two kinds of information.
15.0 Spoken Language Processing and Multimodal Communication in the European Union Programs
The European Union launched several Framework Programs in the R&D area, which lasted 4 to 5 years : FP1 (1984-1987), FP2 (1987-1991), FP3 (1990-1994), FP4 (1994-1998), and it is now preparing the next program, FP5 (1998-2002). The activities in spoken language processing and more generally in Multimodal communication, took essentially place in 2 programs : ESPRIT, now called the IT (Information Technology) program, and TELEMATICS, with a specific action line on Language Engineering. But those topics may also be present in other programs of the commission.
The ESPRIT / IT program was managed by DGXIII-A, and is now managed by DGIII in Brussels. The pilot phase of this program started in 1983. The following programs took place until now, with some overlap between the different phases : ESPRIT I (1984-1988), ESPRIT II (1987-1992), ESPRIT III (1990-1994) and ESPRIT IV (1994-1998). The underlying policy of the projects is cooperation, not competition, within projects. There is not much cooperation between projects and all projects are with a limited time duration.
We may identify 31 Spoken Language projects in various areas, from 1983 on, in different topics related to speech processing, and each topic contains several projects :
Apart from those Spoken Language Processing projects, there were also several Natural Language Processing projects, in different areas :
Other projects are more remotely linked to spoken or written language processing : HERODE, SOMIW, PODA, HUFIT, ACORD...
On Fall 1993, the ESPRIT management asked for a study of the impact of ESPRIT Speech projects and produced a report.
The results of this study stated that a 126 MEcu effort has been devoted to speech (total budget) within Esprit. This corresponds roughly to a total of 1,000 Man-Years, over 10 years (1983-1993). It is estimated that this represents about 12% of the total European activity on speech. This means that Esprit funded 6% of total activity, which is considered to be a small share of the total effort.
22 projects have been conducted and 9 of those projects produced demonstrators. In 1993, 13 industrial companies reported intention to put products on the market, with an extra effort of 18 MEcu. 4 were already on the market in 1993 and reported a 2 MEcu income on that year, and all 13 estimated to reach a 100 MEcu income by 1996, while 2 SMEs were to make 90% of that income. This represents a return of investment of 1.3% by 1993 and 72% by 1996, which is considered to be low.
The reasons for this low performance was analyzed as due to the fact that no exploitation plans were mentioned in the projects from the beginning, no market investigation was conducted, and the attitude of large industrial groups was more as "technology watch." Since then, large groups quitted the speech scene or still stayed as "technology watch." Those groups are ready to buy elsewhere (illustrating the « not-invented-here syndrome »). The Small and Medium Enterprises (SMEs) are more active, for « staying alive » reasons. The projects were too much technology-pull, not enough market-push, while speech was still the "cherry on the cake" for many customers.
In the present IT program (1994-1998), and although there is a specific « Language Engineering » program in the Telematics action, it was said that « Speech and Natural Language Processing » was also most welcomed (especially the technology development within Long Term Research) but there is no specific area for it in the program. Human-Machine Communication activities, including verbal and non-verbal communication, may be found in Domain 1 (Software Technologies) / Area 4 (Human-Centered Interfaces), with applications to manufacturing, command & control, training, transport, entertainment, home and electronic business systems. It is contained in 3 action lines : User-Centered development, Usability and User interface technologies (Virtual Reality, multimodality, NL & speech interfaces). The goal is to study the application-user interface interaction. The same topic may also be found in Domain 3 (Multimedia systems), within Area 1 (Multimedia Technology) and Area 2 (Multimedia objects trading and Intellectual Property Rights).
Domain 4 (Long Term Research) is structured in 3 different areas : Area 1 : Openness to ideas, Area 2 : Reactiveness to Industrial Needs, and Area 3 : Proactiveness. One of the 2 action lines (the other one being « Advanced Research Initiative in Electronics ») within Area 3 is « The Intelligent Information Interfaces » (I3), which addresses the concepts of a broad population, interacting with information in a Human-centered system. The project should address new interfaces and new paradigms. This resulted in the start-up of the I3Net network, made up of a small set of founding members, who also gathers representatives from each projects retained by the European Commission in this area. The action is structured into 2 schemata : The Connected Community (mostly dealing with Augmented Reality), and The Inhabited Information Spaces (addressing the Large scale information systems with broad citizen participation).
In Domain 5, one spoken language project is developed within the Open Microprocessor Systems Initiative (OMI) : IVORY, with applications in the domain of games. In Domain 6 (High Performance Computing & Networking (HPCN)), it is present in Area 4 : Networked Multi-site applications.
The Telematics program is managed by DGXIII in Luxembourg. It included 12 Sectors, with a 900 MEcu budget : Information Engineering , Telematics for Libraries, Education and Training, Transport, Urban and rural areas... Within Telematics, the "Linguistic Research Engineering" (LRE) lasted from 1991 to 1994, with a budget of 25 MEcu coming from the EU. It was followed by a Multilingual Action Plan (MLAP) (1993-1994), with a 8 MEcu budget from the EU. Finally the Language Engineering program (LE) (1994-1998) is now on-going. The total budget spent in this program is presently 80 MEcu, 50 MEcu coming from the EU. Those programs are the follow-up of the Eurotra Machine Translation program, and they now include Spoken language processing. The program is Application and User oriented, and the idea when it was launched in 1994 was to possibly make good applications with still imperfect technologies.
In Linguistic Research and Engineering (LRE) and in MLAP, 8 projects were related to speech on a total of 56 projects. It included projects on several R&D areas : Assessment (SQALE (Multilingual Speech Recognizer Quality Evaluation)) ; Spoken Language Resources (EUROCOCOSDA (Interface to Cocosda / Speech resources), RELATOR (Repository of Linguistic Resources), ONOMASTICA (Multilingual pronunciations of proper names), SPEECHDAT (Speech DBs for Telephone applications & Basic Research)) ; Language Acquisition (ILAM (Aid to Language Acquisition)) ; Railway Inquiry Systems (MAIS (Multilingual Automatic Inquiry System), RAILTEL (Spoken dialog for train time-table inquiries)).
In the on-going Language Engineering (LE) program, 8 projects are on speech on a total of 38 projects. Those projects may also be gathered in several R&D areas. Spoken dialog (ACCESS (Automated Call Center Through Speech Understanding Systems), REWARD (REal World Applications of Robust Dialogue), SPEEDATA (Speech recognition for data-entry applications), ARISE (Telephone-based railway inquiries )); Car navigation (VODIS : (Advanced Speech technologies for Voice Operated Driver Information Systems)); Speaker recognition (CAVE (Caller Verification In Banking and Telecommunications)); Spoken Language resources (SPEECHDAT-2 (Speech Databases for Creation of Voice Driven Teleservices)) and Language training (SPEAK (Language Training and authoring keys)). The 4th Call for Proposal was issued on December 1996 with a deadline in April 1997 and a budget of 21 MEcu. The present feeling is that there is a need now to invest again on Language Engineering Technologies development, in parallel with applications.
It also appears that there should be a permanent infrastructure. Such an infrastructure may be brought off in different areas : Coordination of research (ELSNET : European Language and Speech Network), founded in 1991), Standards (EAGLES : Expert Advisory Group on Language Engineering Standards) and Language Resources (ELRA : European Language Resources Association).
Other Programs also address Human-Machine Communication and Human Language Technology.
The MLIS (Multilingual Information Society) program, started for a duration of 3 years (1997-1999), with a budget of 15 MEcu. The main objective is to bring a technological support to Multilingualism in Europe. It is organized in 3 domains :
A Call for Proposal was issued in December 1996 on « Translation and language use in business environment ».
The INFO2000 program is scheduled from 1996 to 1999. The goal is the development of the Multimedia content industry and the use of Multimedia content. It contains 2 actions :
We should also mention DRIVE (on Education technologies and applications), AIM (on Medical applications) and TIDE (Aids to the handicapped) within the Telematics program.
Some spoken language processing projects also appear in the RACE / ACTS (Advanced Communication Technologies and Services) program, under different headings :
Apart from those technically oriented programs, other programs addressed Socio - Economic issues. The Forecasting and Assessment in the field of Science and Technology (FAST ) program lasted from 1978 to 1987, with a follow up in the MONITOR/FAST program (1989-1993). It included an Anthropocentric Production Systems (APS) part. The program was stopped, apparently due to the pressure coming from technologists, and to the difficulty to forward the recommendations to the designers.
More recently, the Targeted Socio-Economic Research (TSER) action was launched. It includes 3 sub-parts :
Transversal programs may also bring a support to the specific programs. Several actions may be found in HCM / TMR (Human Capital and Mobility / Training and Mobility of Researchers). The ERASMUS / SOCRATES program supports an academic training network in "Phonetics and Speech Communication." This program also allows cooperation actions with the USA. The PECO, INCO-COPERNICUS and INTAS programs allow for the cooperation with Central and Eastern European countries and FSU. The INCO-DC program allows for the cooperation with Mediterranean countries (Maghreb...) and others (India...)
A new transversal Thematic Call Workprogramme aims at promoting pluridisciplinary actions, which are trans-domain and trans-program (IT, ACTS, Telematics...). It contains 4 different topics : IT for Mobility, Electronic Commerce, Information Access & Interfaces, and Learning & Training in Industry.
Another transversal action is the Educational Multimedia Joint Call. It gathers expertise from different participants which may already be in different programs, and may get support from specific programs (Telematics, Information Technologies (IT) or Targeted Socio-Economic Research (TSER)), but also from Education (Socrates), Training (Leonardo da Vinci), or Trans-European Networks (TEN-Telecom) programs.
Finally, it should be stressed that those topics are included in one of the 3 priorities of the fifth Framework Program FP5 (1998 - 2002), now under discussion :
Research and development in Human-Machine Communication is a very large domain, with many different application areas. It is important to ensure a good link between the studies of the various parts which are necessary for building a complete system and ensure its adequacy with a societal or economic need, and to devote a large effort in the long term for the development of the various components themselves as well as for the integration of those various components at a deep level. The European Commission and the European laboratories developed a large effort in this area and prepare an even larger effort for the near future.
More information on the EU programs may be found at the EC servers : All programs : CORDIS Server : http ://www.cordis.lu/
Telematics : I*M Server : http ://www.echo.lu/
(1) Andre, E. and Rist, T, (1995), "Research in Multimedia Systems at DFKI", in Integration of Natural Language and Vision Processing, Vol. II : Intelligent Multimedia, P. Mc Kevitt Eds, (Kluwer Academics Publishers).
(2) Baudel, T. and Beaudoin-Lafon, M., (1993), "Charade: Remote Control of Objects using Free-Hand Gestures," Communications of the ACM, 36 (7).
(3) Bellik, Y. and Teil, D., "A Multimodal Dialogue Controller for Multimodal User Interface Management System Application: A Multimodal Window Manager," Interchi'93, Amsterdam, April 24-29, 1993.
(4) Bellik, Y. and Burger, D., "The Potential of Multimodal Interfaces for the Blind: An Exploratory Study," Proc. Resna'95, Vancouver, Canada, June 1995.
(5) Benoit, C., Massaro, D.W., and Cohen, M.M., (1996), "Multimodality : Facial Movement and Speech Synthesis," in Survey of the State of the Art in Human Language Technology, E. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue eds, (Cambridge University Press).
(6) Berkley, D.A. and Flanagan, J., "HuMaNet : An Experimental Human/Machine Communication Network Based on ISDN," AT&T Technical Journal, 69:87-98.
(7) Beskow, J., "Talking heads: Communication, Articulation and Animation", Proc. of Fonetik'96, Nasslingen, May 1996.
(8) Bonneau-Maynard, H., Gauvain, J.L., Goodline, D., Lamel, L.F., Polifroni, J., and Seneff, S., (1993), "A French Version of the MIT-ATIS System: Portability Issue," Proc. of Eurospeech'93, pp 2059-2062.
(9) Bourdot, P., Krus, M., and Gherbi, R., "Cooperation Between a Model of Reactive 3D Objects and A Multimodal X Window Kernel for CAD Applications," in Cooperative Multimodal Communication, H. Bunt, R.J. Beun eds, (Addison-Wesley).
(10) Braffort, A., "A Gesture Recognition Architecture for Sign Language," ACM Assets'96, Vancouver, April 1996.
(11) M. Denis and M. Carfantan, eds, "Images et Langages: Multimodalites et Modelisation Cognitive," Proceedings Colloque CNRS Images et Langages, Paris, April 1-2, 1993.
(12) Flanagan, J.L., (1996), "Overview on Multimodality," in Survey of the State of the Art in Human Language Technology, E. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue eds, (Cambridge University Press).
(13) Gauvain, J.L., Gangolf, J.J. and Lamel, L., "Speech recognition for an Information Kiosk," ICSLP'96, Philadelphia, 3-6 October, 1996.
(14) Goldshen, A.J., (1996), "Multimodality : Facial Movement and Speech recognition," in Survey of the State of the Art in Human Language Technology, E. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue eds, (Cambridge University Press).
(15) Hayamizu, S., Hasegawa, O., Itou, K., Tanaka, K., Nakazawa, M., Endo, T., Togawa, F., Sakamoto, K. and Yamamoto K., "RWC Multimodal database for interactions by integration of spoken language and visual information," ICSLP'96, Philadelphia, October, 3-6, 1996.
(16) Iwano, Y., Kageyama, S., Morikawa, E., Nakazato, S. and Shirai, K., "Analysis of head movements and its role in Spoken Dialog," ICSLP'96, Philadelphia, October, 3-6, 1996.
(17) Kaczmarek, K. and Bach-y-Rita, P., "Tactile Displays," in Advance Interface Design and Virtual Environments, W. Barfield and T.F. III eds, (Oxford University Press), in press.
(18) Le Goff, B. and Benoit, C., "A Text-to-audiovisual speech Synthesizer for French," ICSLP'96, Philadelphia, October, 3-6, 1996.
(19) Luettin, J., Thacker, N.A. and Beet, S.W., "Speaker Identification by lipreading," ICSLP'96, Philadelphia, October, 3-6, 1996.
(20) Luzzati, D., (1987), "ALORS: A Skimming Parser for Spontaneous Speech Processing," Computer Speech and Language, Vol. 2.
(21) Marque, F., Bennacef, S.K., Neel, F. and Trinh, S., "PAROLE: A Vocal Dialogue System for Air Traffic Control Training" ESCA/NATO ETRW, Applications of Speech Technology, Lautrach, September,16-17.
(22) Martin, J.C., Veldman, R. and Beroule, D., "Developing Multimodal Interfaces : A Theoretical Framework and Guided-Propagation Networks," in Cooperative Multimodal Communication, H. Bunt, R.J. Beun eds, (Addison-Wesley).
(23) Maybury, M.T., (1995), "Research in Multimedia and Multimodal parsing and generation," in Integration of Natural Language and Vision Processing, Vol. II: Intelligent Multimedia, P. Mc Kevitt eds, (Kluwer Academics Publishers).
(24) Petajan, E., Bischoff, B., Bodoff, D. and Brooke, N.M., "An Improved Automatic Lipreading System to Enhance Speech Recognition," CHI'88, pp. 19-25.
(25) Price, P., (1996), "A Decade of Speech Recognition; The Past as Springboard to the Future," Proceedings ARPA 1996 Speech Recognition Workshop, (Morgan Kaufmann publishers).
(26) Rickheit, G., Wachsmuth, I., (1996), "Situated Artificial Communicators," in Integration of Natural Language and Vision Processing, Vol. IV: Recent Advances, P. Mc Kevitt ed., (Kluwer Academics Publishers).
(27) Spohrer, J.S., (1995), "Apple Computer's Authoring Tools & Titles R&D Program," in Integration of Natural Language and Vision Processing, Vol. II: Intelligent Multimedia, P. Mc Kevitt Ed., (Kluwer Academics Publishers).
(28) Stock, O., (1995), "A Third Modality of Natural Language?," in Integration of Natural Language and Vision Processing, Vol. II: Intelligent Multimedia, P. Mc Kevitt Ed., (Kluwer Academics Publishers).
(29) M. Taylor, F. Neel, and D. Bouwhuis, eds, (1989), The structure of multimodal dialogue, (Elsevier Science Publishers).
(30) Teil, D. and Bellik, Y., "Multimodal Interaction Interface Using Voice and Gesture," in The Structure of Multimodal Dialogue II, M. Taylor, F. Neel and D. Bouwhuis eds, Proceedings, The Structure of Multimodal Dialogue Worskhop, Maratea, September, 1991.
(31) Vo, M.T., Houghton, R., Yang, J., Bub, U., Meier, U., Waibel, A. and Duchnowski, P., "Multimodal Learning Interfaces," in Proceedings ARPA 1995, Spoken Language Systems Technology Workshop, Austin, January 22-25, 1995, (Morgan Kaufmann publishers).
(32) Waclar, H., Kanade, T., Smith, M. and Stevens, S., (1996), "Intelligent Access to Digital Video: The Informedia Project," IEEE Computer, 29 (5).
(33) Waibel, A., Vo, M.T., Duchnowski, P. and Manke, S., (1996), "Multimodal Interfaces," in Integration of Natural Language and Vision Processing, Vol. IV: Recent Advances, P. Mc Kevitt Ed., (Kluwer Academics Publishers).
(34) Wahlster, W., "Multimodal Presentation Systems: Planning Coordinated Text, Graphics and Animation," in Proceedings Colloque CNRS Images et Langages, M. Denis and M. Carfantan eds., Paris, April 1-2, 1993.
(35) Wahlster, W., (1996), "Multimodality: Text and Images," in Survey of the State of the Art in Human Language Technology, E. Cole, J. Mariani, H. Uszkoreit, A. Zaenen and V. Zue eds., (Cambridge University Press).