Project > Proposal

Introduction: The testbed for this project involves science education, using the Lego Mindstorms construction materials with children, oversampled for females and underserved populations.
The main focus of this project will be on developing Module 5, Learning and Inferring Action Decisions, which includes the dialogue controller, and on examining the effects of the resulting proactive computing on human reactions and behavior. However, this requires the presence of the other components of the system, as well, in order to have a full system. To accomplish this, we will take the state detectors developed prior to and early in the project and integrate them into a full, though initially very limited, system. Only by having a simple but complete proactive computer system can we begin to deal with the technical problems of integration and the psychological issues involved in working with such a system. As more sophisticated sensors, state detectors and computer action synthesis modules are developed, they will be integrated into the system, and the action decision and dialogue modules will be improved.

Testbed and participants: The testbed we have chosen is that of Lego-Logo/Mindstorms (LL/M). Lego-Logo is a learning environment that enables children to build working machines out of specialized Lego equipment (including lights, sensors, and motors) and control them using computers. Mindstorms extends the Lego-Logo environment with a GUI and advanced robotics equipment that allows students to build and program sophisticated designs [185,186,187,188,189,190]. It includes an on-board computer processor into which programs can be downloaded from a PC via IR transmission. The processor can control multiple motors and has connections for multiple sensors, including light, touch and rotation sensors (others are being developed by the company). These can be built right into a construction, positioned as needed. There are several reasons for choosing this testbed.
· Computer coaching: There are many situations in education, industry and military in which it would be desirable to have a computer that can 'coach' people in tasks involving real-world materials, whether for training, equipment maintenance or construction. This is a high-priority application area. The LL/M environment provides a prototype environment for these types of applications.
· Hands-on science education: Encouraging children in their learning of science is a national priority, especially for females and underrepresented groups, who traditionally have had less interest and success.
· Rich, yet constrained, environment: Construction elements and structures and the programming are well defined, yet the combinatorial possibilities are virtually endless. Tasks can range from simple and very structured to highly complex and creative, and can be produced for an enormous range of skill levels.
· Emotionally rich: It is an environment in which some children become energetically engaged and emotionally expressive while other children are timid and hesitant in initiating task activity.
The project will focus on using proactive computing to achieve two ends. First, helping children to begin work in the LL/M environment; particularly, enticing children to participate who are timid, fearful or generally disinterested in science or construction activities, with the goal of helping them gain a vision of the benefits of these activities, together with a basic skill level, that will sustain a continued interest. Second, encouraging creativity and perseverance in children who already feel comfortable with the LL/M environment, developing a willingness to explore further possibilities and to accept greater challenges. Thus, we will seek ways for a proactive computer to recognize and reduce fear and frustration, as well as to gently guide skill learning and issue challenges for independent initiative. These are general goals of excellent hands-on science initiatives.
Children of age 10 to 12, as well as some college undergraduates, will be the subjects, with an oversampling of females and of children from underrepresented groups. The Don Moyers Boys' and Girls' Club is just 5 blocks from Beckman Institute, providing access to many minority group children. Denisha Tate, the director of research for the Club, has agreed to a cooperation between the Club and our project. In addition, one of the graduate research assistants who will work on the project, Dean Grosshandler, conducts after-school children's classes using LL/M materials. His present and past students constitute a set of experienced users.
The physical arrangement will consist of a child sitting at a table with a computer monitor about 70 cm in from the edge of the table. Immediately before her will be a tray of LL/M parts needed for the current task. Instructions and messages can be presented visually on the monitor or auditorially. Video cameras will be directed toward the child, including an active camera focused on the face to record facial expressions, and a wide-view camera giving information about hand gestures and other body movements. Additional cameras will be directed down from above at the LL/M workspace, giving information about the current state of the task. A throat microphone will record the child's vocalizations. An eyetracker and head-tracking apparatus will record eye and head movements. A clock system will be used to provide a synchronizing signal on the various data sources so that they can be coordinated in later analyses, including examination of multiple videotapes in synch with audio and eye movement information. This environment will be developed in Beckman Institute's Integration Support Lab where the integration of diverse equipment is occurring.
In initial observational studies, a tutor/experimenter will sit next to the child, with video camera and microphone placement that allows a video and audio record of their interactions as well as of the child's behavior (body movement, gestures, etc.).

Developing a corpus for identifying user states and computer actions, and for computer learning: Video and audio recordings will be made of 40 children of different skill levels, as they carry out LL/M construction and programming tasks. Instructions, in the form of CAD-type displays and/or printed instructions, will be presented on the computer screen. Tasks will be selected that are appropriate for the skill levels of the different children. These will fall into three classes: following a CAD diagram to build an object (guided construction task), building an object to meet some specified goal (goal-specified construction task), or building an object of their own design (free construction task). In the first two cases, tasks will be chosen that exemplify a principle of science or engineering or of programming, such as levers for increasing lift power, or a washing machine with internal parts that turn only when the door is closed, using touch or light sensing. For 15 of the children, an experienced tutor (Grosshandler) will be sitting by their side, giving suggestions and encouragement as appropriate. A complete video and audio record, including eye movement recording, will be made.
A detailed analysis of the recordings of a few children will be conducted by expert pschologists and educators to identify emotional, motivational and cognitive states and behavioral indicators of those states. In addition, types of tutor actions will be identified, as well as key task events (completion of substructures and structures, errors and their correction, etc.). Once the basic categories have been established from the analysis of these records, undergraduate students will be trained to conduct these analyses. The remainder of the videotaped records will then be analyzed to identify and label periods in which mental and task states can be identified, and tutor actions. In addition, experienced tutors will view the tapes and indicate other potentially useful tutor actions that could have been taken at various points in time. The resulting labeled corpus, called the base corpus, is crucial for the rest of the project. It will be used to train computer models for state detection and for inferring appropriate computer actions. The corpus will also be made publicly available for use by other investigators. We suspect that it will be widely used in related projects.

Recognition and tracking of human states: Prior research by team members has led to significant progress in tracking faces, facial expressions and body movement (including hands gestures), recognizing faces, analyzing the speech signal for affective indicators, and using eye movement indicators to detect confusion. Research on human motion tracking and understanding will continue in this project, particularly using the base corpus.
Facial expression recognition and head tracking: This research involves reliable tracking of the subject's face, extracting the features used by the classifier. We have developed a reliable non-rigid facial motion-tracking algorithm [191]. The outputs of this are the Action Units representing shape and motion of facial features including lips, eyebrows and cheeks. Using these features, a classifier has been constructed to classify facial expressions. Chen [41] has shown that the classification can be done with good accuracy with a small number of major emotions even on a single frame; we will extend this to classification using temporal cues. This can be achieved by a number of methods, such as Hidden Markov Models and time warping using Dynamic Programming algorithms. These methods also provide a record of the subject's head movements.
Body tracking research will focus on both tracking the person as a blob and on tracking of body parts such as arms and hands. We have already developed algorithms for head/body/limb tracking using range data [192] and hand/face tracking using adaptive color segmentation [193]. In our proposed research, we shall take an integrated approach where the human upper body (head, torso, arms/hands) is modeled as an articulated object where each part is represented by a deformable object [194]. Based on data taken by multiple video cameras, the parameters of the model (joint angles, surface shape parameters) will be estimated as functions of time. Deformable surfaces are needed in order to recognize shoulder movement, shrugging, twisting, etc. The challenge lies in finding suitable deformable models for the upper human body, and algorithms for estimating the model parameters. The algorithms will rely heavily on statistical learning techniques including switching state-space models, dynamic Bayesian networks, and self-organizing maps.

Voice Analysis: For the purposes of this project we will need a bi-directional speech communication between the subject and the system. This will require both speech recognition and text-to-speech synthesis. Neither of these is a solved problem but the state-of-the-art is sufficient for the construction of a useful experimental platform. We intend to use the IBM "Via-Voice" system for speech recognition and the Bellcore "Orator" system for synthesis. The technology for both of these commercial products is well known. For the speech recognition part, the method of choice is hidden Markov modeling [195]. For synthesis, it is concatenative LPC [196]. Analysis of prosodic features is also well known and will follow the autocorrelation technique [197]. Higher level linguistic analysis will follow the method described in Levinson and Shipley [198].
We will also analyze the speech signal in order to detect mental states. Most of the information about mental state is reflected in the prosodic features of speech, namely pitch, energy and duration. Physical correlates of these features can be extracted from the signal quite reliably without the need to actually determine what words were spoken. Pitch is estimated from extraction of the fundamental frequency, energy from the signal envelope and duration from syllabic rate as determined by counts of peaks in the energy contour. Since speech recognition is required for other aspects of this project, we can use the syntactic and semantic analysis to provide additional information about mental state, using specific words and phrase structures are indicators of mental state.

Eyetracking: Initial research will be directed at identifying eye movement indicators of various emotional and cognitive states. Episodes will be identified in the base corpus in which a child is confused, when she is progressing without difficulty, when she is daydreaming, when she is planning her next steps, or is in other identifiable states. Analyses will seek to identify eye movement patterns that are typical of these different states.
In addition, children will be asked to perform different cognitive tasks in the LL/M environment (read instructions carefully, scan textual instructions quickly, scan a Lego CAD diagram, examine a Lego CAD diagram carefully, identify the next component needed from a Lego CAD diagram, search for a particular building component, check a computer program for consistency, determine which command to include next in a program, etc.). Again, analyses will be aimed at identifying properties of the eye movement record that tend to accompany each type of cognitive activity. Discrimination methods involving patterns of eye movements are exemplified in Althoff [81] and Yang& McConkie [82].
Based on these findings, eye movement pattern detectors will be developed and integrated into the proactive computer system.

Fusion: In the above subsections, research in visual face and body tracking, voice analysis, and eye movement are discussed separately. A major aim of our proposed research as a whole is to combine cues from these different modalities so that the emotional and cognitive state of a person could be assessed more accurately than is possible when only cues from a single modality are available [199]. We propose to use a probabilistic framework. In particular, various architectures of Dynamic Bayesian networks [200] will be explored. Obviously, this research will depend critically on the basic corpus where user appearance is related to user state.

Tracking task state: Computer vision techniques will be used to monitor the state of a user's assembly task. While easily obtained measures such as the height of a construction may provide sufficient feedback to the proactive system for certain tasks, our longer term objective for this subsystem is to track the Lego parts and to continually determine the configuration of a partially assembled project. One can view the process as reverse engineering, taking image data as input and producing Computer Aided Design (CAD) models. Since there is a modest number of different Lego parts, each of which is monochromatic and has a well-defined geometry, a combination of well-understood color image segmentation and model-based vision techniques can be used to recognize and track isolated parts from monocular, binocular or trinocular image data [12,86,87,96,88,89,90].
Our approach will exploit complete geometric models of each part annotated with appearance-based information, particularly color. An overhead commercial trinocular vision system (Triclops) will provide a depth map of the entire workspace and be used to locate isolated parts and assemblies; however, there is insufficient resolution to completely determine an assembly's structure. To provide greater visual coverage in the presence of occlusion, multiple color video cameras will observe the scene, and we will develop algorithms for estimating and tracking the assembly's structure from one or more video streams.
There are numerous challenges and opportunities in this context because assemblies are composed of numerous parts, parts with the same color may mate flushly obscuring the interpart boundaries, and parts may be partially or wholly occluded. The importance and challenge of exploiting part decompositions for object recognition was one the main conclusions of the "1995 NSF/ARPA Workshop on 3D Object Representations in Computer Vision," yet very little research or progress has ensued, in part due to successes of appearance-based methods [201]. Unfortunately, appearance-based methods are most effective when there is limited parametric variability in viewpoint, articulation, shape deformation or lighting, and here we are directly confronting structural variability.
The construction process may involve the addition or deletion of a single part or the mating or separation of sub-assemblies. By tracking the state of assemblies and observing which new part is grasped from a part pallet, kinematic constraints restrict the possible location of the new part with respect to the current assembly. These kinematic constraints might simply be a finite set of locations (two basic blocks can only be attached together in a small number of ways) or they may involve revolute or prismatic joints with continuous degrees of freedom. In general, the number of configurations is exponential in the number of parts, but since assemblies are constructed incrementally, we will not have to consider the entire state space. Nonetheless when the vision system cannot unambiguously determine the result of an operation, a hypothesis space must be maintained, and it too may grow exponentially. As further parts are added or as an assembly is rotated to provide a different view, we believe that this hypothesis space can be effectively pruned. Depending upon the task, (e.g. building a specified structure or freeform building to meet a specified design objective), we will develop mechanisms to compare an observed assembly to the specified goal. This will require establishing meaningful metrics between a measured state and a goal state, or between the set of hypothesized states and the goal.

Identifying candidate computer actions: The base corpus video/audio recordings in which a tutor was present will be analyzed to identify the tutor's actions and the conditions under which they occur, in attempting to assist the child. We anticipate that these actions (mostly verbal, but also with gestures) will fall into six categories: social (not directly related to the task, such as greetings, jokes, and various side comments), control (messages intended to set limits on a child's behavior and to encourage her to abide by them; attention-attracting actions), affective (expressing interest, giving assurance or encouragement, praising), procedural (instructions for specific actions to take and how to carry them out), exemplar (presenting an example of some construction, usually a substructure, as a way of helping the child to see how to progress; presenting examples of objects that might be constructed; presenting examples of computer command sequences to produce different action patterns for computerized objects), or question (posing a question to help a child recollect past knowledge, or to help direct her thoughts).
Tutor actions in the base corpus will be labeled according to their type. These data will provide a basis for computer learning algorithms to identify appropriate conditions for different action types. In addition, a collection of broadly-applicable actions (comments, displays, sounds) of each type will be compiled that can be used by a proactive computer in its communications. Frames for narrowly-applicable actions will be identified, together with their conditions, and these will be incorporated into the dialogue controller.

Develop models to guide computer action: We will investigate and study the relative merits of two conceptually different approaches to developing a system to guide computer actions. One is a direct classification method that attempts to decide on the best action to use without an intermediate step of density estimation and the other develops a probability distribution over predetermined states and uses it to select a course of action. Beyond that, the two approaches differ in the training policy (type of feedback given to the learning program) as well as in the amount of manual intervention required when building the system.
The first approach we will study is a direct approach that attempts to learn a mapping directly from sensors to actions, using the data from the base corpus described above. For this approach we will develop a model for supervised learning of action strategies in dynamic stochastic domains, and learn strategies represented by (generalized) rule-based systems. In this model the learning program will be given access to traces of good behavior, namely, of a human expert observing a student, and will attempt to learn a strategy for behaving successfully in similar situations. This general direction has roots in works on learning to reason [202,203] and, more specifically, in works on learning to take actions [204,205]. Technically, this framework is based on the PAC model of learning from examples [206] but is applied here to problems in which a program acts and needs to achieve goals. The formalization studied considers stochastic partially observable worlds as in reinforcement learning [207] where the state is described using relational information. This general framework has been studied theoretically and has been quite successful experimentally in several domains see Khardon [208].
The main challenges in applying a framework of this sort in our domain are that it is significantly more complex, both in terms of input dimensionality and expected functional complexity of the actions. We are planning to address these in two ways that involve representational and computational issues. First, the basic action strategies will be presented as rules of the form C --> A, were A is one of several actions, and C is a condition (potentially an existentially quantified expression) that is expressed as a simple function of some sensor measurements along with state variables. Our representation for C would be that of a generalized rule [209] that can be learned more efficiently from examples, even in the presence of incomplete information [203,210, 211] and of very large input dimensionality that is composed mostly of irrelevant variables, as is the case when interacting with real world sensory data. A secondary advantage of this representation is that it can be manually initialized and/or augmented by experts to facilitate building a quick prototype. The second way we will address the challenge is by studying more complex action strategies that are composed hierarchically. The basic action strategies described above will be used as subroutines in a hierarchically composed strategy. These intermediate representations might include the generation (as part of the learning process) of support predicates and the identification of internal states. As before, it would be possible for a domain expert to study the action strategies learned by the system, name internal states or add states and rules.
Unlike the learning centered approach that directly learns a mapping from sensory input to actions (and, potentially, represents intermediate states while doing that) the second approach assumes in its input a more abstract representation of state information, such as that represented in Figure 1 as component 3, and attempts to learn a joint probability distribution over the space of internal states (of the task and user) and actions. This probability distribution will then be used to infer the most likely action given an observation. In this formalization we separate the stage of recognizing task and user states from that of action selection and assume that they were done earlier.
The focus of this approach will be on basic methods for constructing a situation model that integrates the diverse pieces of information we anticipate at the state level. Such information is quite dynamic and can be fraught with uncertainty and incompleteness. The key technical challenge in this direction lies in the coherent and efficient extension of Bayesian networks to accommodate such diverse types of information, with most of the work falling within the realm of probability theory, knowledge representation and reasoning and, less than in the previous approach, learning theory.
Bayesian networks are among the most successful approaches for managing uncertain information and have been used in a variety of applications, including diagnosis, planning, and course-of-action evaluation. The success of this formalism stems mostly from its roots in probability theory, which equips it with a well-accepted semantics, and from the associated computational machinery which allows for practical implementations in certain applications. The application we have at hand, however, places technical demands that are outside the realm of standard Bayesian networks. For example, a Bayesian network requires a complete probabilistic model of a given situation, while the information pertaining to the state of our student, observed and interpreted by a battery of recognition programs, will be often incomplete.
This limitation has been long observed, with little progress achieved so far on dealing with it in a principled matter. Although one can define the notion of an incomplete Bayesian network, in terms of a set of probabilistic models, a major unresolved difficulty remains of reasoning efficiently with this class of networks. Intrinsic computational difficulties in inference with Bayesian networks [212] which can often be addressed using stochastic methods [213,214] seem to be too severe in these cases. We plan to investigate here a new and promising direction that is based on new results on compiling Bayesian networks into parameterized arithmetic expressions [215]. Moreover, the situations we anticipate require the ability to adapt the probabilistic model to user input, sensory input or other forms of feedback. While adaptation is known to be computationally hard when done directly with the Bayesian network representation, results in learning theory indicate that it may be easier to adapt the probabilistic representation in its arithmetic expression form. This is another direction we intend to investigate.
Another major limitation of Bayesian networks is their static nature as they do not include a standard representation of temporal information. Representing time in dynamic Bayesian networks without paying a hefty computational price remains to be an open challenge, which we plan to address. Our approach will be based on recent results obtained for networks with repetitive structures, a class that includes temporal networks as a special case.
In summary, the focus of this component of our research program will be on extending current models of learning and Bayesian networks in several key dimensions: The former will focus on learning action strategies in dynamic stochastic domains and the generation of intermediate state information to allow for complex action strategies to be composed. The latter would allow for the use of incomplete/missing information, the adaptation of the representation given additional information and for an efficient and canonical integration of different kinds of information into Bayesian networks.

Intelligent dialog system and system integration: The dialog manager is the interface between the subject and the system. It has two main functions. First, it interprets the spoken input from the subject. Second, in conjunction with the task-learning and user state modules, it gives appropriate spoken information to the subject by providing encouragement, evaluating her activities or responding to her need for help. Whatever part of that action is in the form of a verbal response will be controlled by the dialog manager. In particular, the dialogue system must be proactive in initiating conversation under different state conditions, rather than simply responding to the user's verbalizations.
The dialog manger has three parts, an interpreter, a memory and a response generator. The interpreter is comprised of a syntactic and semantic analyzer. It takes the structure of the subject's input and, based on the contents of the task model stored in memory, generates a response and updates the task model. The memory contains, in addition to the dynamic task model and user state information, all the factual and procedural knowledge necessary to understand the subject's input and to take responsive or proactive actions. The response generator uses a formal grammar to compose appropriate output sentences. These sentences must then be marked prosodically so that the correct intonation can be applied to convey the affective aspects of the response. This information is then given to the text-to-speech synthesizer so that a spoken response can be made to the subject. The technical details of such a system are given in Levinson and Shipley [198]. These techniques will be modified for the Lego-Logo task.
A special technique that we have found effective in the past for the integration of such a complex system is to build the basic framework of the entire system and add specific capabilities one at a time. This methodology should be used at a very fine-grained level so that the ability of the system to respond to very specific sentential forms will be added one at a time. A goal of our development will be to accomplish the purposes of the system with a minimum of speech understanding requirements.

Affective synthesized speech and face displays: Some communication between the system and the subject will be by means of an animated face and a corresponding voice with emotional content. Both the articulation as manifest in the face and the prosodic features of the voice must reflect the desired emotions.
For face animation, we choose the 3D-model rather than the appearance-based approach, because for our application complete realism is not needed (or even desired) and because 3D models are more flexible. We have already developed a generic 3D face/head model (which we call the iFace) which can be fitted to any particular person's face based on range or multiple 2D image data. The iFace will be driven by text and a facial expression script (provided by the "computer action decision" module). We have developed a preliminary mapping from phonemes and facial expressions (smile, frown, etc.) to facial movements. This mapping needs to be improved. Of particular interest is the co-articulation of speech-related lip movement and expression-related facial movement. To do this research, we are currently using Microsoft's TTS (Text-To-Speech) software. In the near future, we plan to get an open version of either Lucent Technologies' or Motorola's TTS, which will provide much more flexibility.
Initially, the voice will be provided by a text-to-speech synthesizer in which the prosodic features of an utterance may be altered by the insertion of special symbols in the text. Affective speech is thus generated by applying a set of hand-crafted rules to the text of the desired utterance so that the appropriate symbols are placed in the text prior to synthesis. Later, we will build our own synthesizer so that we can get full control of the internal parameters. Combined with our word modeling and intelligent agent research, we want to apply semantic information when the speech is synthesized so prosodic parameters can be selected more sensibly. Our preliminary results show that pitch contours play import roles in expressing emotions and they are dependent on the content that is being synthesized.

Evaluation: There are three levels of evaluation planned: 1) does the system produce acceptable interaction dialogues with the students, 2) what effects do various interactions have on the students, and 3) does proactive computing help the students achieve desired learning, motivational, and affective goals? The first level will be the focus of evaluation during the early stages of the grant, and the second and third levels will become the focus during later stages when the system is consistent in taking reasonably appropriate actions.
At the initial stages of the project, the gaps between the system's performance and reasonable tutor performance will likely be large enough that little formal evaluation will be necessary. However, as these gaps are narrowed, more sensitive formative evaluation will be needed. To accomplish this, sessions of children working with the computer will be videotaped and segmented. Segments will then be presented to other children and to teachers for their judgments of the appropriateness of the computer's actions, and for suggestions for improvement. In addition, interactional analyses [166] will be carried out to examine subtleties in the types of actions being taken by the computer under different user and task state conditions.
Once the system is capable of a reasonable degree of interaction, we can examine the effects that the various computer actions have on the students. This can be accomplished either through analysis of videotapes (including going through a tape with the subject herself to probe her reactions) or through using 'think-aloud' methods in which children verbalize their reactions as they are engaged in the task.
As the system matures we will be able to evaluate the extent to which different aspects of proactive interaction facilitates the achievement of learning, motivational, and affective goals. Since the goals are varied, this evaluation must be multidimensional. It is expected to include the length of time children remain at the computer, the proportion of this time that the child is actually engaged in the task, the frequencies of negative and positive reactions, whether children are more successful in accomplishing tasks with proactive assistance and encouragement, and whether principles learned in building one object transfer to building another. The proactive computing system will be constructed in such a way that parts of its interaction capability can be disabled, thus allowing investigations of the effects of the presence or absence of these capabilities.

Summary: An interdisciplinary team will attempt to construct and evaluate the effects of a proactive computer system that responds, not just to user requests, but to sensed mental and task states, as well. This is an example of an attempt to create a true "human centered" computer system by providing the computer with the means of acquiring a great deal of information about its user and basing its actions on that information in addition to direct requests. This is a high risk project: no one has previously attempted to make this much real-time user information available to the computer, and developing the ability to act based on this information presents serious challenges. At the same time, it is a major attempt at defining and exploring characteristics of a new human-computer interface paradigm. This type of interface, should it be refined and available generally, would dramatically change people's relationships with their computers. The computer itself would carry much of the burden of the dialogue, thereby greatly easing requirements on the human for the benefits of successful interaction to occur. We believe that, used wisely, this approach can facilitate new users' entry into computer use, and increase the value of computer assistance to experienced users. It is much more than just adjusting to users' expressed preferences; it is a process of getting to know the user and becoming a close and helpful companion.

 

Copyright 2001 Beckman Institute,
University of Illinois at Urbana-Champaign