Abstract: Panel on Performance Measurement of Human-Centered Systems
Panelists: Donna Harman, Sharon Laskowski, David Pallett, Jean Scholtz, NIST
Catherine Plaisant, University of Maryland
B.H. Juang, Bell Labs
The NIST Perspective
The National Institute of Standards and Technology is committed to providing researchers with measurement instruments and methodologies. We believe evaluation is critical to determine research directions necessary for progress. At NIST we have been looking at human-centered systems and we believe that a new focus on metrics and measurements is needed. It seems only logical that systems designed to be human-centered need human-centered evaluations. The goal of this panel is to explain our vision for a new set of measurements and methodologies and to enlist your help in making this vision a reality.
The panelists we have assembled today represent both NIST researchers and researchers from the human-centered systems community. We are here because we feel that this community shares our concerns. The report that was issued after the February, 1997 Human-Centered Systems workshop contained numerous recommendations to the National Science Foundation. There is a great deal of overlap with these recommendations and our independently developed vision. One recommendation calls for the establishment of a Human-Centered Systems collaboratory to provide, among other items, a digital library of case studies and data and a set of demonstrations that illustrate HCS principles in context. Further recommendations are for the establishment of HCS testbeds, HCS competitions modeled after TREC, and a critical research initiative on metrics and evaluation of HCS. The vision we have addresses these recommendations as well as others.
Our objectives are:
The vision is twofold: coordination of agency programs through a common corpus that can be used by all researchers of component technologies. We want to add an additional focus of user interaction with the component technologies. To provide a better understanding of the integrated system, including user interactions, a demonstration capability would be provided.
We are interested in evaluations of core technologies and applications that use the data from these core technologies. NIST has a long-standing record in evaluating core technologies such as speech recognition and text retrieval. However there is expanding interest from the research community in complex tasks, such as access to multimedia information, production of summaries or translations from retrieved information, and automatic creation of structured information, where the structures are knowledge-base frames, object-oriented databases, etc. To further research in these areas, an infrastructure of test corpora and evaluation metrics will be needed. Of particular need are ways of measuring the interaction between the components of these complex systems, including how performance of the various core technologies affects the performance on the final task output.
User interaction occurs within the core technologies and applications also. This interaction with any of these applications can improve the end result and therefore, must be measured and combined with the technological performance metric. By providing the appropriate user interactions in a system, the overall result can far exceed the actual technical metric. The reverse is also true. If the user interactions are inappropriate, the end result can be perceived of as worse than the performance metric. The interaction component has thus far been missing from consideration in most of the core technologies.
In order to achieve this vision, we must put a plan into action. We need to work with researchers and agencies to produce the common tasks that would be appropriate. A common corpus that will support these tasks needs to be agreed upon and developed - Methodologies and metrics for evaluating interaction need to be developed as well. We do not yet know what these metrics or methodologies will be.
Finally, a demonstration system that shows the integration of the various component technologies and possible user interactions with these integrated technologies will be available. While the showcase system would NOT be used for evaluation, it would serve to gain understanding and consensus about the integration of the technologies and user interactions. In the early stages, missing component technologies will be replaced by sample processed data. The demonstration will be scenario-based, and therefore will show the integrated system in several contexts.
Our vision builds on current expertise and on current evaluation programs, but will be expanded to include evaluations of interactions, both between component technologies and between users and those technologies. The use of a common corpus and common set of tasks will serve as a unifying framework for this evaluation, and NIST can serve as a central resource. The common corpus, the demonstration system and the evaluation methodologies for the technologies and user interactions will be developed and maintained at NIST. Other agencies and researchers will use these resources as appropriate.
Catherine Plaisant: Experiences with user interface evaluation
A large portion of the formal user interface evaluation work uses controlled experiments to compare interface components and measure performance data as well as subjective satisfaction. Those studies are conducted using ad-hoc samples of data and tasks, often painfully developed by the testers themselves. Such studies are very useful to highlight areas of strengths and weaknesses of the interfaces but the results of even similar studies are difficult if not impossible to combine. Our experience also suggests that providing real data to designers is more likely to lead to a successful design rapidly that the slow process of design, test with made-up data, test with real data, redesign and retest.
B.H. ("Fred") Juang: Building an evaluation methodology
We need common evaluation that is built on a sound methodology. In order to do that, however, we need to know:
Many of these things of course will come in due course. The first thing is to understand how to construct a measurement of quality of interaction between human and machine.