|
Project > State of the Art
The proposed project builds on past and current research from a number
of different areas that can only be briefly summarized here.
Back
Recognition and tracking of human states:
Using sensors, especially visual, to track people and understand their
behaviour, has become an active research field (particularly in the computer
vision community). This challenging task can be made easier (relatively)
by using wearable computers [27]. However, in our proposed research, in
order not to disturb the student's learning style, we concentrate on remote,
non-contact sensors; except for the fact that initially a wearable eye-tracker
will be used.
Face tracking and analysis: Emotion recognition using facial expression
emerged following the work done by Paul Ekman [28,29,30]. This work showed
that facial expressions of six basic emotions are universally recognized
and a mapping of facial muscle movement (described as activation of Action
Units) to these emotions was constructed and named Facial Action Coding
System (FACS). Although FACS was designed for emotion recognition by humans,
it has been the basis for automatic emotion recognition based on facial
expressions. The differences among algorithms are in the feature extraction
methods, type of classifiers and whether the recognition is done from
still images or video sequences. Essa and Pentland [31,32] used optical
flow to extract the features from video sequences and a distance-based
classifier. Otsuka et al. [33], Yacoob et al. [34], Rosenblum et al. [35]
and Mase [36] also used optical flow results as features, but the classifiers
used were Hidden Markov Model based classification, ruled base classification,
Neural Network classifier, and template based K-Nearest Neighbor classifiers.
Black et al. [37] used local parametric models as a means of tracking
and recognizing facial expressions using a rule-based classifier. Edwards
et al. [38] and Hong et al. [39] used template based classifiers with
different feature extraction methods for emotion recognition from still
images. Recognition rates for still image methods are generally lower
than for sequence based methods. Recent work has used both vision and
audio information for emotion recognition. Chen et al. [40,41] used a
parametric 3-D model to extract the Action Units from video, and several
audio features and a network based classifier. De Silva et al. [42] shows
that human judgments of emotions improves when using both modalities.
A recent paper Donato et al. [43] compared various techniques for the
automatic recognition of facial AVs.
Gesture and body position tracking: The research in human tracking can
be divided into two broad categories, one that is concerned with localization
and the other that focuses on the tracking of individual body parts like
hand/fingers/head etc. The time difference image along with color information
is the most effective way to do localization. The tracking of arms and
fingers is achieved by making use of articulatory constraints and solving
the inverse kinematic problem. Existing research focuses on both powerful
learning techniques like SLDS [44], HMM [45] and complex vision algorithms
[46,47,48,49]. Researchers have successfully tracked humans using either
geometrical or appearance models in constrained environments [50,51,52,53].
The work on low level tracking has mainly focused on tracking the head
pose [54,55,56] and articulated motion of fingers and the human arm [57,58,59,60]
with application to American sign language and gesture recognition. For
accurate tracking in 3D and handling of occlusion, range data and stereo
cameras have been used [61,62,63,64]. A recent trend has been to combine
learning techniques with the vision algorithms, exploiting the power of
both methods together to handle complex environments.
Voice analysis: Picard [65] indicates that emotion recognition
is an important step towards natural human computer interaction. Dellaert
et al. [66] classified emotions by using a technique called majority voting
of subspace specialists on features extracted from a smoothing spline
approximation of the pitch contour. Cowie et al. [67] found that variations
in the so-called augmented prosodic domain are emotive. Li et al. [68]
used short time features as well as long-term features and achieved 62%
accuracy on six emotions. Nakatsu et al. [69] applied their emotion recognition
algorithm to a computer agent that plays a character role in an interactive
movie system. Naktsu et al. [69] built a system that can realize emotion
recognition and emotion synthesis by relating the physical features of
emotional speech to the emotional content through linear statistical methods.
Chen [41] achieved high recognition accuracy by using both audio and video
information. Current emotion recognition systems are not robust enough
in that training data is sparse and usually recorded on the same session.
Though video information can improve the recognition results, real time
tracking is still a difficult task.
Eye movement recording: There is a long history of using eye movement
recording to study cognitive processes during reading [70,71], picture
viewing [72], and other tasks. Eye behavior is useful in studying issues
regarding both low-level perceptuo-oculomotor processes [73] and higher-level
syntactic, semantic and knowledge-based processes [74]. Unless people
specifically suppress normal eye movements, their eyes are directed toward
the part of the stimulus field to which they are giving attention [75]
and remain at that location for a period of time that is correlated with
the time required for processing the information [76]. People adopt unique
patterns of eye movements (oculomotor strategies) in different tasks [77,78]
and tend to show regular sequences that have been referred to as 'scan
paths' [79,80]. Thus, various aspects of eye behavior reflect and indicate
a number of different types of mental activity. Given this fact, eye movement
recording has recently been used to indicate what aspect of the stimulus
an individual is attending to at the moment (current work by A. Kramer),
and whether that person is acquainted with a particular person or object
[81], is encountering processing difficulty during reading [82], or is
examining the stimulus carefully or superficially. Applications to human-computer
interaction require the development of methods for classifying data from
individual subjects on individual episodes; this work is only beginning.
Back
Fusing multi-modal input for state detection:
The automatic detection/recognition of human emotional and cognitive state
is an extremely challenging problem and is largely unexplored. As mentioned
earlier, preliminary work has been done in combining visual facial expression
and voice analysis in recognizing human emotion. However, human emotional
and cognitive states are manifested not only in facial expressions and
tone of voice but also very much in head, arm/hands, and body gestures,
and eye movement. The chance of successful automatic human state recognition
will be greatly enhanced, if we take advantage of all these cues. Little
research has been done along this direction. The most relevant work is
the pioneering research of Cassell [83] and Quek [84].
Back
Task Tracking:
Video analysis of complex structures: We intend to use computer vision
to monitor a subject's progress in a task, specifically to determine the
locations of individual Lego parts and the configuration of assemblies.
There has been tremendous progress in 3-D model-based computer vision,
which uses CAD-like geometric object models, for the recognition of polyhedral
[12,85,86,87,88,89,90,91]
and curved objects [11,12,15,16,17,92,93,94] as well as for tracking the
3-D position and orientation of objects over time [90,95,96,97]. Alternatively,
appearance-based methods, which learn a representation from a collection
of training images, have recently been shown to be particularly effective
for objects with complex shape and reflectance [98,99,100,101]; however,
there must be a limited amount of pose, illumination or shape variation,
and so these techniques are inappropriate for assemblies. Additionally,
there has been some work in recognizing and understanding assemblies,
particularly within teach-by-showing methods for robot programming [102,103,104],
though these methods cannot yet be considered mature technologies.
Back
Guiding Computer Actions:
Computer and human learning: The learning field is so vast that we do
not attempt to present a literature review. Several members of our team
are experts in computer learning theory and its applications. In addition
to theoreticians (Levinson, Roth) the team includes experts in the key
domains of learning for natural language interactions (Levinson, Roth),
machine vision (Huang, Kriegman), pattern analysis (Huang), science learning
(Brown) and knowledge acquisition from text and pictures (McConkie).
Back
Intelligent Dialogue Systems:
Dialogue systems: In recent years, significant improvements have been
achieved in commercial speech recognition systems. It is now practical
to build dialogue systems based on these recognition engines. Some experimental
spoken dialogue systems in various domains have been built [105,106,107,108,109].
These systems can interact with the user based on limited contexts. Given
the vast variety of human speech, the current research trend is to stochastically
model the dialog process. This can be reflected in the recent special
issue on language modeling and dialogue systems of IEEE Transactions on
Speech and Audio Processing. Levin et al. [110] proposed to model the
dialogue as a Markov decision process. Riccardi et al. [111] claimed that
the problem of dialog design can be formalized as an optimization problem
with an objective function. Siu et al. [112] applied the variable n-gram
design algorithm to conversational speech. Van Noord et al. [113] also
gave a robust grammar analysis algorithm for processing spoken input.
Interest in automated learning of dialogue schemes is growing recently
[110,114]. Current research efforts have also been directed toward improving
dialogue management through various fine-tuning techniques such as mixed-initiative
[115,116,117,118,119,120].
Back
Creating Affective Displays:
Expressive voice: To achieve friendly man-machine communication, computers
need affective characteristics [65]. Generating emotional speech is one
way to accomplish this. One method is to use a commercial synthesizer
and adjust the speech features related to emotion. Researchers at University
of Dundee, UK, [121,122,123] and at MIT [124] have achieve quite good
results using this approach with DECtalk, in terms of emotion identifiability.
Many other researchers also tried to modify acoustic features such as
pitch and intensity using non-commercial synthesizers [125,126,127,128,129].
Approaches taken by some Japanese researchers are interesting in that
their method can be used in both synthesis and recognition [130,131].
Facial expressions: Producing facial expressions is another way to achieve
affective displays. Early research on face expression synthesis used linear
interpolation between pre-digitized sculptures of a face with various
expressions [132,133]. The parametric approach [134] produced face animation
by controlling a set of parameters. Based on linguistics and psychological
studies, Pelachaud et al. [135] used a set of rules to link the intonation,
emotion and facial animation. Terzopoulous et al. [136] used deformable
contour models to track the non-rigid motions of facial features and estimated
the muscle contractions that are utilized for face animation. Essa et
al. [137] used optical flow measurements to extract facial action parameters
from images for face animation. Hong et al. [138] combined the eigen mouth
shapes and motion curves, which are obtained by analyzing video sequences,
for synthesizing an expressive talking mouth. Performance-based approaches
[139,140] directly animated a face model using the tracking results of
the points on a live actor's face without analysis. Brand [141] trained
a HMM to map audio features to expressive face animation. Appearance-based
methods are often very effective, [142,143,144].
Back
Testbed: Science Education:
Hands-on science education: Recent science and mathematics education policy
documents summarize the broad consensus among science educators that students
need to be engaged in thoughtful consideration of hands-on interactions
[145,146,147]. Such thoughtful hands-on manipulations help students construct
understandings (vs. just absorbing often meaningless information) from
interactions with phenomena, symbolic representations, and other people
[148,149,150,151,152]. This holds particularly true for traditionally
underrepresented populations in science such as females and minorities
who often have less out-of-class hands-on experiences [153,154,155,156].
One important class of hands-on activities are interactions with "construction
kits" [157] such as Lego-Logo, which can help students meaningfully
interact with phenomena and ideas of science, mathematics, and technology
[158,159,160,161,162].
Corpus Development:
Video analyses of interactions have as their goal the identification of
regularities or stabilities in interactions and how these stabilities
influence and are influenced by the evolution of the interaction. These
analyses can focus on observable aspects of interactions [163,164,165,166,167],
they can assist in the development of grounded models of unobservable
processes such as thinking [169,170,171,172,173], and in some cases both
observable and unobservable processes can be the focus [174]. Identification
and articulation of these regularities can then assist in developing instructional
interactions that take these previously often unseen stabilities into
account [175,176,177,178,179,180,181,182,183,184].
|