Project > State of the Art

The proposed project builds on past and current research from a number of different areas that can only be briefly summarized here.

Back
Recognition and tracking of human states:
Using sensors, especially visual, to track people and understand their behaviour, has become an active research field (particularly in the computer vision community). This challenging task can be made easier (relatively) by using wearable computers [27]. However, in our proposed research, in order not to disturb the student's learning style, we concentrate on remote, non-contact sensors; except for the fact that initially a wearable eye-tracker will be used.
Face tracking and analysis: Emotion recognition using facial expression emerged following the work done by Paul Ekman [28,29,30]. This work showed that facial expressions of six basic emotions are universally recognized and a mapping of facial muscle movement (described as activation of Action Units) to these emotions was constructed and named Facial Action Coding System (FACS). Although FACS was designed for emotion recognition by humans, it has been the basis for automatic emotion recognition based on facial expressions. The differences among algorithms are in the feature extraction methods, type of classifiers and whether the recognition is done from still images or video sequences. Essa and Pentland [31,32] used optical flow to extract the features from video sequences and a distance-based classifier. Otsuka et al. [33], Yacoob et al. [34], Rosenblum et al. [35] and Mase [36] also used optical flow results as features, but the classifiers used were Hidden Markov Model based classification, ruled base classification, Neural Network classifier, and template based K-Nearest Neighbor classifiers. Black et al. [37] used local parametric models as a means of tracking and recognizing facial expressions using a rule-based classifier. Edwards et al. [38] and Hong et al. [39] used template based classifiers with different feature extraction methods for emotion recognition from still images. Recognition rates for still image methods are generally lower than for sequence based methods. Recent work has used both vision and audio information for emotion recognition. Chen et al. [40,41] used a parametric 3-D model to extract the Action Units from video, and several audio features and a network based classifier. De Silva et al. [42] shows that human judgments of emotions improves when using both modalities. A recent paper Donato et al. [43] compared various techniques for the automatic recognition of facial AVs.
Gesture and body position tracking: The research in human tracking can be divided into two broad categories, one that is concerned with localization and the other that focuses on the tracking of individual body parts like hand/fingers/head etc. The time difference image along with color information is the most effective way to do localization. The tracking of arms and fingers is achieved by making use of articulatory constraints and solving the inverse kinematic problem. Existing research focuses on both powerful learning techniques like SLDS [44], HMM [45] and complex vision algorithms [46,47,48,49]. Researchers have successfully tracked humans using either geometrical or appearance models in constrained environments [50,51,52,53]. The work on low level tracking has mainly focused on tracking the head pose [54,55,56] and articulated motion of fingers and the human arm [57,58,59,60] with application to American sign language and gesture recognition. For accurate tracking in 3D and handling of occlusion, range data and stereo cameras have been used [61,62,63,64]. A recent trend has been to combine learning techniques with the vision algorithms, exploiting the power of both methods together to handle complex environments.
Voice analysis: Picard [65] indicates that emotion recognition is an important step towards natural human computer interaction. Dellaert et al. [66] classified emotions by using a technique called majority voting of subspace specialists on features extracted from a smoothing spline approximation of the pitch contour. Cowie et al. [67] found that variations in the so-called augmented prosodic domain are emotive. Li et al. [68] used short time features as well as long-term features and achieved 62% accuracy on six emotions. Nakatsu et al. [69] applied their emotion recognition algorithm to a computer agent that plays a character role in an interactive movie system. Naktsu et al. [69] built a system that can realize emotion recognition and emotion synthesis by relating the physical features of emotional speech to the emotional content through linear statistical methods. Chen [41] achieved high recognition accuracy by using both audio and video information. Current emotion recognition systems are not robust enough in that training data is sparse and usually recorded on the same session. Though video information can improve the recognition results, real time tracking is still a difficult task.
Eye movement recording: There is a long history of using eye movement recording to study cognitive processes during reading [70,71], picture viewing [72], and other tasks. Eye behavior is useful in studying issues regarding both low-level perceptuo-oculomotor processes [73] and higher-level syntactic, semantic and knowledge-based processes [74]. Unless people specifically suppress normal eye movements, their eyes are directed toward the part of the stimulus field to which they are giving attention [75] and remain at that location for a period of time that is correlated with the time required for processing the information [76]. People adopt unique patterns of eye movements (oculomotor strategies) in different tasks [77,78] and tend to show regular sequences that have been referred to as 'scan paths' [79,80]. Thus, various aspects of eye behavior reflect and indicate a number of different types of mental activity. Given this fact, eye movement recording has recently been used to indicate what aspect of the stimulus an individual is attending to at the moment (current work by A. Kramer), and whether that person is acquainted with a particular person or object [81], is encountering processing difficulty during reading [82], or is examining the stimulus carefully or superficially. Applications to human-computer interaction require the development of methods for classifying data from individual subjects on individual episodes; this work is only beginning.

Back
Fusing multi-modal input for state detection:
The automatic detection/recognition of human emotional and cognitive state is an extremely challenging problem and is largely unexplored. As mentioned earlier, preliminary work has been done in combining visual facial expression and voice analysis in recognizing human emotion. However, human emotional and cognitive states are manifested not only in facial expressions and tone of voice but also very much in head, arm/hands, and body gestures, and eye movement. The chance of successful automatic human state recognition will be greatly enhanced, if we take advantage of all these cues. Little research has been done along this direction. The most relevant work is the pioneering research of Cassell [83] and Quek [84].

Back
Task Tracking:
Video analysis of complex structures: We intend to use computer vision to monitor a subject's progress in a task, specifically to determine the locations of individual Lego parts and the configuration of assemblies. There has been tremendous progress in 3-D model-based computer vision, which uses CAD-like geometric object models, for the recognition of polyhedral [12,85,86,87,88,89,90,91]
and curved objects [11,12,15,16,17,92,93,94] as well as for tracking the 3-D position and orientation of objects over time [90,95,96,97]. Alternatively, appearance-based methods, which learn a representation from a collection of training images, have recently been shown to be particularly effective for objects with complex shape and reflectance [98,99,100,101]; however, there must be a limited amount of pose, illumination or shape variation, and so these techniques are inappropriate for assemblies. Additionally, there has been some work in recognizing and understanding assemblies, particularly within teach-by-showing methods for robot programming [102,103,104], though these methods cannot yet be considered mature technologies.

Back
Guiding Computer Actions:
Computer and human learning: The learning field is so vast that we do not attempt to present a literature review. Several members of our team are experts in computer learning theory and its applications. In addition to theoreticians (Levinson, Roth) the team includes experts in the key domains of learning for natural language interactions (Levinson, Roth), machine vision (Huang, Kriegman), pattern analysis (Huang), science learning (Brown) and knowledge acquisition from text and pictures (McConkie).

Back
Intelligent Dialogue Systems:
Dialogue systems: In recent years, significant improvements have been achieved in commercial speech recognition systems. It is now practical to build dialogue systems based on these recognition engines. Some experimental spoken dialogue systems in various domains have been built [105,106,107,108,109]. These systems can interact with the user based on limited contexts. Given the vast variety of human speech, the current research trend is to stochastically model the dialog process. This can be reflected in the recent special issue on language modeling and dialogue systems of IEEE Transactions on Speech and Audio Processing. Levin et al. [110] proposed to model the dialogue as a Markov decision process. Riccardi et al. [111] claimed that the problem of dialog design can be formalized as an optimization problem with an objective function. Siu et al. [112] applied the variable n-gram design algorithm to conversational speech. Van Noord et al. [113] also gave a robust grammar analysis algorithm for processing spoken input. Interest in automated learning of dialogue schemes is growing recently [110,114]. Current research efforts have also been directed toward improving dialogue management through various fine-tuning techniques such as mixed-initiative [115,116,117,118,119,120].

Back
Creating Affective Displays:
Expressive voice: To achieve friendly man-machine communication, computers need affective characteristics [65]. Generating emotional speech is one way to accomplish this. One method is to use a commercial synthesizer and adjust the speech features related to emotion. Researchers at University of Dundee, UK, [121,122,123] and at MIT [124] have achieve quite good results using this approach with DECtalk, in terms of emotion identifiability. Many other researchers also tried to modify acoustic features such as pitch and intensity using non-commercial synthesizers [125,126,127,128,129]. Approaches taken by some Japanese researchers are interesting in that their method can be used in both synthesis and recognition [130,131].
Facial expressions: Producing facial expressions is another way to achieve affective displays. Early research on face expression synthesis used linear interpolation between pre-digitized sculptures of a face with various expressions [132,133]. The parametric approach [134] produced face animation by controlling a set of parameters. Based on linguistics and psychological studies, Pelachaud et al. [135] used a set of rules to link the intonation, emotion and facial animation. Terzopoulous et al. [136] used deformable contour models to track the non-rigid motions of facial features and estimated the muscle contractions that are utilized for face animation. Essa et al. [137] used optical flow measurements to extract facial action parameters from images for face animation. Hong et al. [138] combined the eigen mouth shapes and motion curves, which are obtained by analyzing video sequences, for synthesizing an expressive talking mouth. Performance-based approaches [139,140] directly animated a face model using the tracking results of the points on a live actor's face without analysis. Brand [141] trained a HMM to map audio features to expressive face animation. Appearance-based methods are often very effective, [142,143,144].

Back
Testbed: Science Education:
Hands-on science education: Recent science and mathematics education policy documents summarize the broad consensus among science educators that students need to be engaged in thoughtful consideration of hands-on interactions [145,146,147]. Such thoughtful hands-on manipulations help students construct understandings (vs. just absorbing often meaningless information) from interactions with phenomena, symbolic representations, and other people [148,149,150,151,152]. This holds particularly true for traditionally underrepresented populations in science such as females and minorities who often have less out-of-class hands-on experiences [153,154,155,156]. One important class of hands-on activities are interactions with "construction kits" [157] such as Lego-Logo, which can help students meaningfully interact with phenomena and ideas of science, mathematics, and technology [158,159,160,161,162].

Corpus Development:
Video analyses of interactions have as their goal the identification of regularities or stabilities in interactions and how these stabilities influence and are influenced by the evolution of the interaction. These analyses can focus on observable aspects of interactions [163,164,165,166,167], they can assist in the development of grounded models of unobservable processes such as thinking [169,170,171,172,173], and in some cases both observable and unobservable processes can be the focus [174]. Identification and articulation of these regularities can then assist in developing instructional interactions that take these previously often unseen stabilities into account [175,176,177,178,179,180,181,182,183,184].

 

Copyright 2001 Beckman Institute, University of Illinois at Urbana-Champaign