UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Decision theoretic learning of human facial displays and gestures Hoey, Jesse 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-901963.pdf [ 29.29MB ]
Metadata
JSON: 831-1.0051546.json
JSON-LD: 831-1.0051546-ld.json
RDF/XML (Pretty): 831-1.0051546-rdf.xml
RDF/JSON: 831-1.0051546-rdf.json
Turtle: 831-1.0051546-turtle.txt
N-Triples: 831-1.0051546-rdf-ntriples.txt
Original Record: 831-1.0051546-source.json
Full Text
831-1.0051546-fulltext.txt
Citation
831-1.0051546.ris

Full Text

Decision Theoretic Learning of Human Facial Displays and Gestures by Jesse Hoey B.Sc. Physics, McGill University, 1992 M.Sc. Physics, University of British Columbia, 1995 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF Doctor of Philosophy in T H E FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard The University of British Columbia December 2003 © Jesse Hoey, 2003 Abstract We present a vision-based, adaptive, decision-theoretic model of human facial dis-plays and gestures in interaction. Changes in the human face occur due to many factors, including communication, emotion, speech, and physiology. Most systems for facial expression analysis attempt to recognize one or more of these factors, re-sulting in a machine whose inputs are video sequences or static images, and whose outputs are, for example, basic emotion categories. Our approach is fundamentally different. We make no prior commitment to some particular recognition task. In-stead, we consider that the meaning of a facial display for an observer is contained in its relationship to actions and outcomes. Agents must distinguish facial displays according to their affordances, or how they help an agent to maximize utility. To this end, our system learns relationships between the movements of a person's face, the context in which they are acting, and a utility function. The model is a par-tially observable Markov decision process, or P O M D P . The video observations are integrated into the P O M D P using a dynamic Bayesian network, which creates spa-tial and temoral abstractions amenable to decision making at the high level. The parameters of the model are learned from training data using an a-posteriori con-strained optimization technique based on the expectation-maximization algorithm. The training does not require labeled data, since we do not train classifiers for indi-vidual facial actions, and then integrate them into the model. Rather, the learning process discovers clusters of facial motions and their relationship to the context au-tomatically. As such, it can be applied to any situation in which non-verbal gestures are purposefully used in a task. We present an experimental paradigm in which we record two humans playing a collaborative game, or a single human playing against an automated agent, and learn the human behaviors. We use the resulting model to predict human actions. We show results on three simple games. i i Contents Abstract 1 1 Contents iii List of Tables vii List of Figures viii Acknowledgements xi 1 Introduction 1 1.1 Non-Verbal Displays 3 1.2 Model Design 5 1.3 Model Structure 8 1.4 Using the Model 12 1.5 Experiments and Results 13 1.6 Contributions 15 1.7 Thesis Outline 16 2 The Study of Faces and Gestures 18 2.1 Psychology of Facial Expression 21 2.2 Describing Human Motion with Computer Vision 24 2.2.1 Representing the Human Body in Video 26 2.2.2 Classification 29 2.2.3 Unsupervised Classification 32 2.3 Purposeful modeling of human action 33 3 Modeling Facial Displays 36 3.1 Zernike Polynomial Basis Functions 38 3.2 Modeling Facial Dynamics 41 3.2.1 Optical Flow 42 iii 3.2.2 Projections of Optical flow fields 44 3.2.3 Probabilistic Projections 45 3.2.4 Feature weighting 52 3.2.5 Experiments with Probabilistic Projections 53 3.3 Modeling Facial Configuration 56 3.4 Lighting Changes 59 3.5 Temporal Modeling 61 3.6 Temporal abstraction 63 3.6.1 Context Dependent Mixtures of Hidden Markov Models . . . 66 3.6.2 Markov chains of Mixtures of Hidden Markov Models . . . . 68 3.6.3 Context Dependent 4MGs 69 3.6.4 Temporal Segmentation 70 3.7 Tracking 70 3.7.1 Tracker: procedural description 71 3.7.2 Tracker: Bayesian model 74 4 L e a r n i n g M o d e l s of Fac ia l Disp lays 80 4.1 Expectation Maximization 81 4.2 Clustering Individual Flow Fields 83 4.3 Clustering Individual Poses 86 4.4 Learning Temporal Models 87 4.5 Mixtures of Hidden Markov Models (3MGs) 88 4.6 Context Dependent Mixtures of Hidden Markov Models (C3MGs) . 90 4.7 Context Dependent Markov Chains of Mixtures of Hidden Markov Models (C4MGs) 90 4.8 Parameter list 91 4.9 Initialization 92 4.10 Software Implementation Details . 94 4.10.1 Numerical Scaling 96 5 Fac ia l Disp lays i n Games 98 5.1 Decision-Theoretic Foundations 100 5.1.1 Fully Observable M D P s 101 5.1.2 Factored Representations and Structured Representations . . 103 5.1.3 Partially Observable Markov Decision Processes 103 5.1.4 Learning P O M D P s 105 5.2 P O M D P s For Non-Verbal Displays in Games 106 5.3 Value-Directed Structure Learning 108 5.4 Experimental Design 110 iv 5.5 Imitation Game H I 5.6 Hand Gestures for Robot Control 113 5.7 Card Matching Game . . 114 5.7.1 Bidding Round 118 5.7.2 Displaying Round 122 5.7.3 Combining M D P s 123 6 Experiments 125 6.1 Imitation Game 126 6.1.1 Clustering Flow Fields 126 6.1.2 Clustering Images 133 6.1.3 Clustering Display Sequences 136 6.1.4 Inferring Agent Actions 146 6.2 Robot Control Gestures 151 6.3 Card Matching Game 155 6.3.1 Learning a P O M D P 159 6.3.2 Structure of Learned Model 161 6.3.3 Policy of Action 163 6.3.4 Using the Policy 165 6.3.5 Symmetry Considerations 166 6.3.6 Output Models 168 7 Conclusions 181 7.1 Contributions 183 7.2 Future Work 184 7.2.1 Modeling human non-verbal behaviors in context 185 7.2.2 Optimal or approximate high-dimensional P O M D P solutions 187 7.2.3 Unification of decision theory and computer vision for embod-ied human-interactive agents 187 Bibliography 189 Appendix A Proof of Equation (4.3) 207 Appendix B Tracking updates 208 Appendix C Update Equations for Mixture Model 210 A p p e n d i x D E s t i m a t i n g C 4 M G M o d e l Parameters 216 D . l Estimating State 224 D.2 C 4 M G Paramter Updates 225 vi List of Tables 4.1 Parameter list 92 5.1 Variables and action for Bob in the card matching game 118 5.2 Conditional probability distributions to be learned 121 5.3 Actions during displaying round 123 6.1 Probability distribution P ( ^ 6 : a | ^ a ) learned for subject A 136 6.2 Confusion matrices and success rates 146 6.3 Confusion matrices and success rates (best two) 150 6.4 Confusion matrices and success rates (supervised) 150 6.5 Log for the card matching games 160 6.6 Log for card matching training game 165 6.7 Log for card matching test game 166 vii List of Figures 1.1 Interaction between two humans as a P O M D P 6 1.2 Schematic overview of model for describing video sequences 9 2.1 Major research areas for modeling of facial expressions and gestures. 19 2.2 Computer vision approach to recognising human motion 25 3.1 Example flows generated from ZPs 46 3.2 Reconstructing flow fields from Zernike projections 47 3.3 Bayesian network for the mixture of Gaussians over optical flow fields. 48 3.4 Example of probabilistic projection advantage 50 3.5 Synthetic images and flows 54 3.6 Yosemite sequence results 55 3.7 Example images generated from ZPs 57 3.8 Reconstructing images from Zernike projections 58 3.9 Bayesian network for the mixture of Gaussians over images 59 3.10 Erroneous flow caused by large lighting change 60 3.11 Erroneous flow caused by small lighting change 60 3.12 D B N for modeling of pose and dynamics 62 3.13 Mixture of coupled hidden Markov models 65 3.14 Simplified mixture of H M M s 66 3.15 Context dependent mixture of H M M s 67 3.16 Markov chain of mixtures of H M M S (4MG) 68 3.17 Context dependent 4 M G 69 3.18 Procedural description of the tracking process 72 3.19 Tracker failure and recovery example (1) 75 3.20 Tracker failure and recovery example (2) 76 3.21 D B N for tracking and recognition 77 4.1 Bayesian network for the mixture of Gaussians over optical flow fields. 84 4.2 Bayesian network for the mixture of Gaussians over images 86 viii 4.3 D B N for modeling of pose and dynamics 88 4.4 Mixture of coupled hidden Markov models 89 4.5 Context dependent 4 M G 91 5.1 Markov decision process as a Bayesian network 102 5.2 Partially observable Markov decision process as a Bayesian network . 104 5.3 Graphical model of two agents in a game 107 5.4 Refined graphical model of two agents in a game 108 5.5 Procedure for value-directed structure and parameter learning for P O M D P s 109 5.6 (a) neutral face (Aa = ai...ai) Faces which subjects were told to imitatel l2 5.7 Time slice of graphical model of simple imitation game 113 5.8 P O M D P for robot control gestures 114 5.9 Game interfaces for two players during a typical round 116 5.10 Game interfaces for two players during a typical round (2) 117 5.11 Two time slices of Bob's graphical model for card matching game. The variables and actions are described in Table 5.1 120 5.12 Two time slices of combined P O M D P for the card matching game. . 124 6.1 Feature weights learned for facial expression data 127 6.2 Clustering result for facial expressions along two most relevant features. 129 6.3 Eyebrow raising classified as states 2 and 4 130 6.4 Smiling event classified as state 3 131 6.5 Eyebrow lowering event classified as state 6 132 6.6 Smile returning to neutral classified as state 5 132 6.7 Feature weights learned for facial expression data 133 6.8 Clustering result for facial expressions 134 6.9 Smiling pose classified as state 6 135 6.10 Return from smiling pose classified as state 4 135 6.11 Feature weights T% for model 1 dynamics and configurations chains. . 137 6.12 Dynamics chain model 1 output states 138 6.13 Configuration chain model 1 output distributions 138 6.14 Trajectory of dynamics feature vector for sequence 13 139 6.15 Trajectory of configuration feature vector for sequence 13 140 6.16 Example of smiling motion 141 6.17 Example of smiling motion 142 6.18 Feature weights model 2 143 6.19 Dynamics chain model 2 output states 143 6.20 Configuration chain model 2 output distributions ' 144 IX 6.21 Trajectory of dynamics feature vector for sequence 11 145 6.22 Trajectory of configuration feature vector for sequence 11 145 6.23 Example of eyebrows raising motion . . 147 6.24 Example of eyebrows raising motion 148 6.25 Example of eyebrows raising motion 149 6.26 Six-state value function and policy 152 6.27 Four-state merged value function and policy 152 6.28 4-state gesture model's explanation of a stop gesture 153 6.29 4-state gesture model's explanation of a stop gesture (cont) 154 6.30 4-state gesture model's explanation of a forwards gesture 156 6.31 4-state gesture model's explanation of a forwards gesture (cont) . . . 157 6.32 Portion of value function for card matching game 159 6.33 Portion of policy function for card matching game 161 6.34 Conditional probability distribution over Ann's action 162 6.35 Conditional probability distribution over Ann's display 162 6.36 Policy of action for card matching game 163 6.37 Conditional probability distribution over Ann's display 164 6.38 Symmetrised conditional probability distribution over Ann's action . 167 6.39 Symmetrised conditional probability distribution over Ann's dislplay 167 6.40 Symmetrised policy of action for card matching game 168 6.41 Feature weights model 2 169 6.42 Dynamics chain model 2 output states 170 6.43 Configuration chain model 2 output distributions 171 6.44 Trajectory of most likely dynamics feature vector for sequence 6 . . 171 6.45 Trajectory of most likely configuration feature vector for sequence 6 172 6.46 C 4 M G explanation of nodding motion 173 6.47 C 4 M G explanation of nodding motion (cont) 174 6.48 Feature weights model 3 175 6.49 Dynamics chain model 3 output states 175 6.50 Configuration chain model 3 output distributions 176 6.51 Trajectory of most likely dynamics feature vector for sequence 5 . . 177 6.52 Trajectory of most likely configuration feature vector for sequence 5 178 6.53 C 4 M G explanation of shaking 179 6.54 C 4 M G explanation of shaking (cont) 180 C. l Bayesian network for the mixture of Gaussians over optical flow fields. 210 D. l D B N for modeling of pose and dynamics 216 D.2 Context dependent 4 M G 225 x Acknowledgements Elite Member Catherine Dat Catherine Hoey John Hoey Dylan Hoey Board Member J im Little David Lowe David Poole Cristina Conati Associate Board Member Sid Fels Michiel van de Panne Stan Sclaroff Preferred Member Don Murray Pantelis Elinas Nicole Arksey Pascal Poupart Giuseppe Carenini Robert St-Aubin Andrea Bunt Kenji Okumak Peter Carbonetto Kasia Muldner Jochen Lang Star Contributor Matthew Wilkins Bruce Dow Luc Dierckx Dave Brent Mike Sanderson Sean Godel Chris Majewski Diamond Contributor Honorary Board Member Craig Boutilier Alan Hu Ron Rensink Enrique Sucar-Sucar Nando de Freitas Jack Snoeyink Financial Donor Valerie McRae Joyce Poon Giuliana Villegas Sunnie Khuman Monica Glaboff N S E R C U B C IRIS/Precarn Nissan J E S S E H O E Y The University of British Columbia December 2003 xi Chapter 1 Introduction This thesis describes a model of human gestures and facial expressions that unifies computer vision, uncertain reasoning, and decision theory. The motivation is that, since humans use non-verbal signals in communication, computational agents that interact with humans will need capabilities for learning, recognising and using the extensive panoply of human non-verbal communication skills. The verb use is im-portant: non-verbal signals are useful. For example, they may predict the future actions of the signaler, or they may request actions from the perceiver. The per-ceiver of a non-verbal signal must not only recognise the signal, but must understand what it is useful for. The signal's usefulness will be defined by its relationship to the joint space of both signaler and receiver, their joint actions, their possible futures together, and their individual ways of assigning value to these futures. The model we present in this thesis aims to unify the computer vision aspects of automati-cally perceiving human non-verbal behaviors with the decision-theoretic aspects of putting those perceptions to use. The ability to explicitly reason about uncertainty plays an important role in this unification. If an agent is to take decisions based upon noisy visual data, then the agent must explicitly model its own uncertainty about its perceptions. Bayesian networks are the ideal tool for modeling uncertainty in video measurements and in decision theoretic models, providing a solid theoret-ical basis for these models to co-exist. Al l the models presented in this thesis are Bayesian networks (with utility). We claim that it is important not to separate computer vision from decision theory when modeling human behavior. It is not sufficient to build computer vision sensors that deliver information to decision-theoretic reasoning modules, which then decide upon actions. Recognition and action are too tightly coupled for this kind of one-way communication. Decision making is difficult, and is best dealt with at a high level of abstraction. Humans seem capable of acting in the world based upon abstract representations of their sensory inputs. If these abstract representations are 1 sufficient for decision making, then there cannot be more useful information in the stimuli than there is in the abstractions. Therefore, an efficient perceptual system (which is presumably what humans have evolved to have) will be able to adapt its low-level representation system to only pick up those parts of the signal that are sufficient for building the high-level abstractions. It is important for low-level vision components to not only send, but also receive information from high-level decision-making components. We believe that a unified model of the two aspects is the most effective method for approaching this problem. This thesis presents a vision-based, adaptive, Bayesian model of human fa-cial displays and gestures. The model is, in fact, a partially observable Markov decision process, or P O M D P , with observations over the space of video sequences of human facial displays and gestures. The P O M D P model integrates the recognition of the non-verbal signals with their interpretation and use in a utility-maximization framework. To ease the burden on decision-making, the model builds temporal and spatial abstractions of input video data. The model can be acquired from data, and can be used for decision making based, in part, on the non-verbal behavior of a human through observation. However, optimal decision making in the face of large output spaces is still an open problem, which this thesis does not attempt to solve. Instead, we apply some simple approximate solution techniques to compute policies of action based on non-verbal displays. These approximate policies work fairly well in the examples we present, but would be insufficient in more complex situations. Finding better approximate solutions is part of our current research. Our work is distinguished from other work on recognising human non-verbal behavior primarily because it does not require labeling of training data for particular facial displays or gestures. We do not train classifiers for different behaviors and then base decisions upon the classifier outputs. Instead, the model training process discovers categories of behaviors in the data. This discovery is based, in part, upon what non-verbal displays are important for making value-directed decisions. The advantage of this approach is threefold. First, we do not need expert knowledge about which displays are important. Second, since the system learns categories of motions, it will adapt to novel displays without modification, and without a-priori specification of said displays. Third, resources can be focused on tasks which will be useful for the agent. It is wasteful to train complex classifiers for the recognition of fine motion if only simple displays are being used in the agent's context. In constrast, psychologists have advocated the use of coding systems for the description of facial action. In particular, the Facial Action Coding System (FACS) [EF78] has become the standard for psychological inquiries into facial ex-pression. Computer vision researchers have also adopted it as the standard to strive 2 towards [EP97, T K C 0 1 , BLB+03]. However, although the the recognition of facial action units may give the ability to discriminate between very subtle differences in facial motion, or "microactions", it requires extensive training and domain specific knowledge. For example, Bartlett et ah write that it would require approximately 250 minutes, or nearly half a million hand-coded frames of video to train their sys-tem for most facial action units [BLB+03], while their current data set contains only 17 minutes of coded video frames. Notwithstanding this hurdle, which will surely be cleared eventually, these systems are still not able to accurately distinguish action units from each other in spontaneous video. The current state of the art is only to distinguish brow lowering from random facial motion 75% of the time [BLB+03] with less than 20 training examples. Their system needs over 50 labeled training examples to achieve recognition rates over 90%. As Pentland has pointed out, the importance of such a fine level of representation is not clear for computer vision systems that intend to take actions based on observations of the human, and the action unit analysis may only be useful for the behavioral sciences [PenOO]. Finally, the same type of analysis is not applicable to gestures or body motion, as these do not have well defined standards of form [McN92]: it will be difficult to come up with a small set of "basic" action units which span the space of all possible gestures. The next section gives a basic introduction to the human facial displays from a psychology perspective. Psychology provides guidelines which we use to design a model of facial displays and gestures for decision making in Section 1.2. Section 1.3 then gives an overview of our model. Section 1.4 describes how the model could be used to for decision making in an autonomous agent. Section 1.5 describes the experiments we have performed. Contributions of the thesis are given in Section 1.6. Finally, Section 1.7 gives an outline of the main body of the thesis. 1.1 Non-Verbal Displays Communication plays a critical role in life. Be it a conversation, a business trans-action, the writing of a letter, or a choice of dress, the diversity of the human act of communication is clearly evolutionarily dominant, allowing for collusion and col-laboration [Fri94]. Although the auditory-speech language channel appears to carry the most weight, communication occurs in multiple channels, adding the generation and visual interpretation of gesture and facial expressions to speech and hearing. Gestures and other actions complement speech, adding information and increasing effect. Aristotle, in Rhetoric, points out ( [Ari], Book II, Chapter 8): . . . those who heighten the effect of their words with suitable gestures, tones, dress, and dramatic action generally, are especially successful . . . 3 Falling under the umbrella of non-verbal communication, or gesture, the human face, hands and body all perform communicative acts, which have been ex-tensively practiced [dJ32], and studied [Dar72, ER97, Fri94, McN92]. The creation and interpretation of signs has clear evolutionary advantages [Fri94], and these abil-ities are deeply rooted in the history of the human species, far before the evolution of spoken language. The perception and use of signs, and in particular facial ex-pressions, develops in human children at a very young age. Three month old babies can generate and recognize smiles [RFD97a]. The antecedence of facial expression learning to spoken language acquisition in individual humans seems to mirror the evolutionary development of non-verbal to verbal signals in the human species. The advantages of non-verbal communication are rapidity and ease of in-terpretation. Displays of aggression in males, for example, do not require higher intellectual processing, and can be accomplished more effectively with simple ges-tures or sounds. A more complex example is the initiation of conversation between two humans, a joint action which requires a great deal of synchrony. In fact, timing in conversations in general is critical, yet seems effortless for adult humans. At-tempting to negotiate such synchrony using spoken utterances would be costly and slow. Instead, the face, gaze and body posture are used as rapid and effective timing signals. For example, the raising of hands by a listener is a signal that the listener wants to say something, and so is asking the speaker to give her the floor [CasOO]. A speaker who lowers their eyes during a pause in a conversation may be indicating that they still wish to hold the floor, and need only a moment to figure out what to say next [Cho91]. Another well known example is the back-channel display of acknowledgment, which is used by a listener to assure the speaker that their message to date has been properly received, without causing an interruption. The multiple channels used by humans in their communication also has evo-lutionary advantages, since they allow for meaning to be more richly conveyed, opening the door to more complex messages. Hand gestures in humans, for ex-ample, may complement speech in a global way, adding an overall context to an utterance which could not be simultaneously and synchronously achieved through language alone [McN92]. Hand gestures may also be used to abstract concepts, such as imperatives. For example, the spoken instruction to "put it down on the table beside the record player" can be summarized as "put it there", if accompanied by a pointing gesture to the table beside the record player. The increased efficiency this allows is typically cited as one of the primary evolutionary advantages of multi-modal communication. Many animals use multiple modalities for communication. A defensive feline does not simply hiss, but raises her hair and tail, and arches her back. In fact, even spiders use vibration, vision and chemical senses for multimodal 4 communication [UR02]. There has been a growing body of work in the past decade on the com-municative function of the face [Fri94, RFD97b] and of the hands [McN92], This psychological research has drawn three major conclusions. First, non-verbal ges-tures are often purposeful communicative signals [Fri94]. Second, the purpose is not defined by the gesture alone, but is dependent on both the gesture and the context in which the gesture was emitted [RFD97b, Cho91]. Third, the signals are not universal, but vary between individuals in their physical appearance, their contextual relationships, and their purpose [RFD97b]. We believe that these three considerations should be used as critical constraints in the design of computational communicative agents able to learn, recognise, and use human facial signals. 1.2 Model Design Psychologists have found that gestures and facial displays are context dependent, are often used for a purpose, and are individually variable. Therefore, the design of a communicative agent that uses non-verbal displays must include the following constraints: 1. Context dependence implies that the agent must model the relationships be-tween the displays and the context in which they are shown. 2. If the agent is to act rationally, then it must be able to compute the utility of taking actions in situations involving purposeful non-verbal displays. It must understand the relationships between the displays, the context, and its own utility function. 3. The signals are individual and context dependent, and so the agent needs to adapt to new interactants and new situations, by learning new relationships between non-verbal signals and other context. These constraints can be integrated in a decision-theoretic, vision-based model of human non-verbal signals. The model can be used to predict human behavior, or to choose actions which maximize expected utility. We present this model using the example interaction shown in Figure 1.1. Charlie (on the left) is trying to decide upon an action to take. He is considering stealing a cake from the cook's tray, but is watching the cook's behavior. He is trying to optimise over the outcomes that will occur as a result of his actions. However, he is not in complete control over the outcomes, as they also depend on the cook's actions (on the left). Nevertheless, he can observe the non-verbal actions of the cook, and use these to 5 Figure 1.1: Interaction between two humans as a partially observable Markov deci-sion process ( P O M D P ) . G predict what the cook's actions will be, and what resulting outcomes are. In general, the outcomes are dependent on both person's actions, and other contextual factors, such as the plate of cakes on the counter 1 . The dependencies shown in Figure 1.1 are those of a partially observable Markov decision process, or P O M D P [KLC98]. A P O M D P describes the effects of an agent's actions upon its environment, the utility of states in the environment, and the relationship between observations, the actions and the states. The Markovian assumption is that the agent's history is contained in its current state. A P O M D P model allows an agent to predict the long term effects of its actions upon his environment, and to choose valuable actions based on these predictions. The P O M D P framework addresses the design constraints above in the following ways. 1. The observations of the P O M D P are the displays of a human partner in a video stream. The state of the P O M D P is the context in which the displays are emitted, and thus, the P O M D P models the relationship between displays and context. As psychologists tell us, the displays are only meaningful when analysed in context. Their meaning, then, is this relationship between displays and context modeled by the P O M D P . 2. A utility function is part of a P O M D P , and allows the agent to make decisions leading to valuable states based, in part, upon the "meaning" of displays as contained in the model. 3. P O M D P s can be learned from data, allowing for adaptivity. Our work is novel in this regard in that it does not use any prior knowledge about the kinds of facial displays that are expected, but rather builds classes of facial displays from unlabeled data. The models we learn discover the ways in which the learned classes of non-verbal signals are related to the expected utility of the agent's actions. This type of learned model opens the door to the ability to only model those aspects of the signal which are important to the agent. It is not necessary to learn models of all possible facial displays, but only those the recognition of which is useful for achieving value. Rational agents must rapidly make value-directed decisions about actions based on sensory information, and so must have some system for assigning value to states of the world. Decision making is expensive, however, and raw video inputs are far too information intensive for any hope of tractability. The agent must con-struct abstract representations of the world over which a resource-bounded decision making process can occur. In a P O M D P , these abstractions are the non-observable 1To see what the outcome was, visit www. cs .ubc . ca/~jhoey/dtf aces. 7 states which describe facial displays at a high level. For example, one such state may correspond with the wink of an eye, whereas another may correspond to a smile. These abstract, high-level states condition the raw video signals using a multi-level dynamic Bayesian network which incorporates both spatial and temporal abstrac-tion. This model is summarized in the next section. There will be little discussion of speech recognition and natural language understanding in this thesis. Our work focuses solely on recognizing and using non-verbal communicative acts. However, it is well known that gesture and facial expression are intimately tied to speech [McN92, C B C + 0 0 ] , and one might object to our omission. Nevertheless, research has shown that gestures take place globally and synthetically, as opposed to language, which is linear-segmented (can be broken down in to individual units) [McN92]. Further, semantic gestures do not combine to form larger gestures, but remain one to a clause. Finally, gestures and facial expressions vary from person to person [McN92, Cho91]. These findings give us good reason to keep speech and gesture recognition at a distance, unifying them at some higher level of temporal abstraction. That is, it does not make sense to try to correlate a particular phoneme with one 30ms snapshot taken during a gesture. We must instead seek to correlate some description of the entire gesture with the recognized words or concepts conveyed by the language. Our Bayesian model shows how we can classify sequences of motions in video, and how to learn correlations between those classifications and modeled data from other modalities. Wi th the important assumption of independence at the lowest levels, we can easily incorporate modeled data from any other sensor, so long as it can be described by a set of discrete states. 1.3 Model Structure This thesis presents a unified Bayesian model of these vision-based aspects of non-verbal display recognition and understanding. The model is a multi-level dynamic Bayesian network, the parameters of which can be learned from unlabeled training data. We consider that the context and the raw video signal are observable, but the model is "free" to come up with an internal intermediate representation to map between fast, continuous input signals to slowly changing, discrete context signals. The parameters are learned using an a-posteriori optimization technique based on the expectation-maximization (EM) algorithm [DLR77]. The model also includes utility, and can be used to compute optimal or approximate policies of action. In turn, these computations can guide the classification of the signals towards those which are useful for achieving value. The model performs both spatial and temporal abstraction in order to alleviate the burden on the decision process. 8 Figure 1.2 shows an schematic overview of the model. We will describe it here using facial displays as an example, but it can be applied equally well to gestures. At the top we see the same model shown in Figure 1.1, with outcomes depending on Figure 1.2: Schematic overview of P O M D P and its output function for creating spa-tially and temporally abstract descriptions of video sequences. Actions and context are shown at the top (the utility function is not shown explicitly). The observa-tion (in this case facial expression) is a video sequence shown at the bottom. The intermediate levels are for spatially and temporally abstracting the video. actions, context, and high-level descriptors of non-verbal displays. At the bottom, we see the input video sequence. The goal is to learn a mapping between the video sequence and the abstract, high-level descriptor of the display. This means that we want to compute the likelihood of observing an input video sequence, given this high-level descriptor: P(video\display). We consider that analyzing human facial displays from video streams involves modeling both the current configuration or pose of the face and the way the face is currently moving, or the dynamics of the face. Dynamics and configuration comple-9 ment one another, and are akin to momentum and position in classical dynamics. They can be both useful in describing what a human face does. The instantaneous pose of the face is given by the image region, I, containing the face. The region is given by an independent tracking process. The instantaneous dynamics of the face is given by the temporal derivative over the region of interest, dl, between succes-sive images. Thus, video = {I, dl} = {I\,..., Ix, dl\,..., dlx-i) and we seek to compute P({I, dl}\display). We use bold face symbols to denote sets of variables. We factor the likelihood computation of P({I, dl}\display) into three terms: where each Z is a continuous feature vector representing both the instantaneous dynamics and pose of the face, but in a much lower dimensional space than the input videos and derivatives. The continuous vectors, Z, are then discretized through P (Z |Y) to a set of multinomial variables, Y , which are high-level descriptors of instantaneous pose and motion over the course of the sequence. Finally, these high-level, discrete variables are temporally abstracted to the display descriptor through the mapping P(Y\display). The primary contribution of this thesis is a method for analytically computing the likelihood in Equation 1.1 given a parametrization of the three distributions, and for learning the parameters of these distributions from unlabeled training data. We give a brief description of each of the three terms in Equation 1.1. The details can be found in Chapter 3. P(I,dI|Z) The images and temporal derivatives are considered independent given the continuous feature vectors at each time step and independent of one an-other given their continuous descriptors. Thus, The likelihood functions, P(It\Z_,t) and P(dI\Z_jit)i a r e computed using a projection to a set of a-priori basis functions. Using pre-determined basis functions defers any commitment to particular types of motion to higher lev-els of processing, without affecting computational efficiency. We use the com-plete and orthogonal basis of Zernike polynomials [vZ34], which have useful properties for modeling flow fields [HLOO], and images [TC88]. The images are projected directly to the Zernike basis functions. Projections of temporal derivatives are slightly more complex, since they do not characterise motion directly. Instead, the projection is done through the intermediary of the (un-known) optical flow field, which is probabilistically related to the derivative. Z , Y (1.1) P(I,dI|Z) = Y[P(It\ZIit)P(dI\Z_,t). 10 One of the main contributions of this thesis is a analytical method for evalu-ating the probability distribution over this projection. This is important, as the flow fields include noise which is data dependent: it varies spatially across the image according to the local structure. The method we have presented de-scribes how to account for this noise when computing the likelihoods of spatial derivatives given high-level classes [HL03]. Since we do not know which of these basis functions will be useful for any par-ticular task, we use a feature weighting technique that works by placing priors on the cluster means, such that basis functions which discriminate between classes are weighted more highly. P (Z |Y) Again, we assume the continuous features are independent given their dis-crete counterparts, such that P (Z |Y) = *[[p(ZItt\Wt)P(Zd_tt\Xt), t where Yt = {Xt, Wt}. The likelihoods of continuous features given discrete descriptors are modeled with mixtures of Gaussian distributions, such that P(ZItt\Wt) =Af(ZItt)fJ.wAw) and P(ZdIit\Xt) = Af(Z_tt; L I X , Ax)where p,w and Kw {l^x and A x ) are the mean and covariance, respectively, of the Gaus-sian distribution for a particular value of W (X). States of X and W corre-spond to classes of instantaneous motion and pose, respectively. For example, during a smiling display, classes of X may correspond to smile expansion or smile relaxation motions, while classes of W may correspond to smiling or neutral poses. P(Y\display) This likelihood involves temporal abstraction. Recall that Y is a se-quence of discrete motion and pose descriptors, which independently charac-terize the instantaneous dynamics and configuration in the input video. When used to describe facial motion, they may represent a smiling facial pose, or motions observed during contraction of the eyebrows. On the other hand, display is a high-level display descriptor. For example one display may de-scribe sequences in which a person's face expands into a smile. Another may describe the contraction of the face into a frown, and the subsequent relaxation back to a neutral pose. The mapping from sequences of instantaneous motions and poses to high-level descriptors of facial display sequences is achieved with mixture of coupled hidden Markov models ( C H M M ) . A C H M M is a dynamic Bayesian network which couples two Markovian processes over the high level descriptors of dynamics and configuration, X and W. The C H M M models 11 the temporal progression of both motion and pose descriptors, as well as the interaction between them. The similarity between the two output distribution models over dynamics and configuration makes this model elegant and simple to describe, yet powerful in its ability to model facial displays. Each display state has a separate C H M M , combined in a mixture model. The distribution over high-level displays can be computed using belief propagation [Pea88] in each C H M M . Chapter 3 describes how to compute the likelihood in Equation 1.1. Chap-ter 4 shows how the parameters of the model, including the means and covariances of the Gaussian output distributions, the feature weights, and the parameters of the mixture of C H M M s , can learned using a Bayesian a-posteriori optimization based on the expectation-maximization algorithm [DLR77]. The use of priors allows learning of the model even in the face of little training data, or online in a reinforcement learning situation. 1.4 Using the Model It would be foolhardy to attempt to learn models of full-blown human interactions directly. Much work in understanding the learning process must be done before we can hope to reach this lofty goal. Instead, we develop an experimental paradigm which allows us to focus only on a subset of displays in a restricted context. We record human subjects playing cooperative, interactive games which require the use of non-verbal communication. The games are designed such that the communication is critical for winning in the game. Cooperative games are used for two reasons. First, we believe that most simple human interaction is cooperative. This is what makes a species survive! Second, cooperative games are simpler to model because we can assume that each player is working towards the same goal. Thus, a player can do sufficiently well without explicitly modeling his partner's utility function. The video data taken during play of a particular game is used to learn a decision-theoretic model that can be used to compute strategies of action for a ra-tional agent. There are many ways of approaching the task of finding optimal action plans for multi-agent games. A game-theoretic solution concept involves examining the models and utility functions of all players, and attempting to understand the behavior of all the players in the game simultaneously, assuming that they are all rational agents. However, these types of solutions are used to understand the way in which humans will interact together. We would like our methods to be useful for an artificial agent playing a game with a human. In this case, it is not possible, in gen-eral, to know the utility functions of each player in the game, and we must instead 12 adopt the decision-analytic approach to games [Mye91]. This approach considers the game from the perspective of individual agents. Each agent's optimal strategy is to maximize its expected payoff with respect to its beliefs about the strategies of all the other agents. There are problems with this approach. The possible strategies of all other agents will inevitably include their assessments of the possible strategies of all other agents as well. Therefore, to figure out how to behave rationally, an agent would have to know how another rational agent would expect him to behave, which is the problem he started out with. This infinite regress makes it difficult to justify decision-analytic solution concepts from a game theoretic standpoint. However, a decision-analytic agent can attempt to explicitly model this in-finitely nested system of beliefs, as described in [Gmy02]. The idea is that the agent can perform sufficiently well in a game by "cutting off!' the infinite nesting at some point. For example, agent A can explicitly model the beliefs of agent B , but not the beliefs that agent B has about agent A's beliefs. This type of analysis may be particularly relevant in cooperative games, in which agents may assume that all collaborators share roughly the same utility function. The games we consider in this thesis are simple enough that these types of considerations do not play a major role. Further work into the use of decision-analytic models for multi-agent systems is warranted for more complex situations. 1.5 Experiments and Results We present experimental results on data sets taken during play of three games: the imitation game, a robot control "game", and the card matching game. The first (imitation game) only involves facial displays as actions, and so does not have a reward function. This game is used to explore the representational power of our computer vision modeling techniques. The second (robot control) involves a single human performing gestures for robot control. The robot control experiments are intended as a simple demonstration of our system working with something other than facial expression, and of our value-directed structure learning techniques. The experimental setup for the robot control experiments is very simple, and we do not claim a gesture recognition system that would deal with orientation, time scaling, or robot movement and viewpoint. The third (card matching game) involves two humans playing a collaborative game. The facial displays are fairly simple, but the decision theory problem is much more complex than the other two games. This data is used to demonstrate how a policy can be computed based, in part, on non-verbal displays. During play of the imitation game, a human tries to mimic the facial displays 13 of an animated cartoon face. These displays involve some complex facial motions, and we use this game primarily to demonstrate the computer vision modeling tech-niques. We perform three analyses on this data in order to gain understanding of the different components involved in the model. The first analysis looks at the spatial abstraction of instantaneous motion, and assumes temporal independence: each flow field is independent of all others. This allows understanding of the motion representation we use, and shows how clusters of similar instantaneous motions can be discovered in a data set. The second analysis does the same as the first, but for the facial configuration. The third analysis includes the temporal abstraction of both instantaneous motion and configuration, and looks at how similar sequences of motions (temporally extended facial displays) can be clustered together. We present these three separate treatments because each model performs a different function, and the first two are an integral part of the third. The simple mixture of Gaussians model assumes temporal independence, and learns models which spatially abstract each frame of the video data. The coupled hidden Markov models use these spatial abstraction models as their lowest level, and perform temporal abstraction, learning the temporal progression of the data. We show results from a leave-one-out cross-validation experiment for each of three subjects playing the imitation game, where we attempt to infer the cartoon display that produced the observed facial motion on unseen data. These results are compared to those obtained if the training data is labeled, and we find our clustering method performs well. Robot control using gestures is the second "game". A n operator signals navigation commmands to a robot using hand gestures forwards, stop, go left and go right, and rewards the robot for performing the correct action after each gesture. The robot learns the gesture categories, their number, and their relationships to its actions and utility functions. It can then use the learned model to take actions after subsequent gestural commands. We perform a leave-one-out cross-validation experiment and measure how much reward the robot would gain by taking these actions on unseen data. We find error rates of only 2% over 48 test sequences. The card matching game is played by two humans through a computer in-terface. The players can see, but not hear, one another. They are allowed to communicate through this visual link, but no restrictions are placed on the type of communication. Each player has a hand of three cards, but can only see their own cards, not their partner's. Each player gets to play a single card, and they both win if the suits of their played cards match. The idea is that the players must somehow agree, using only visual communication, about which card to play. This general form of the card matching game has a similar flavor to the Krauss-Glucksberg referential test used to study the development of language [KG77]. In a Krauss-Glucksberg 14 test, the two subjects can hear, but not see, one another, and must choose match-ing "nonsense" symbols from identical sets. To pass the test, they must develop a common referential language for the symbols, which often involves one subject imi-tating the symbols of the other. The general card matching game is similar because there are no pre-defined standard gestures for playing cards, and the players have to develop a communication system for referring to card suits. We present results on a simplified version of the card matching game, in which one player has the ability to make a "bid" of a card suit. The other player must then simply agree or disagree with the bid. This game simplifies the analysis because it allows the players to use a pre-defined set of gestures: head nods and head shakes. The data is used to train a decision-theoretic model, which must learn the relationship between what a player is doing, the state of the game, what the future actions will be, and how the outcome will benefit the players. The model must also learn what a player expects his partner to display, given the state of the game. The model is used to compute a policy of action using an approximation technique, and is demonstrated on a test data set. 1.6 Contributions The major contributions of this thesis are as follows. They are ranked in order of importance, and references to our published work on these subjects are shown. • [HoeOl, Hoe02, HL04, Hoe04] A novel and unified model of human non-verbal behavior in video streams, integrated with decision making over high level context states and actions. A Bayesian a-posteriori learning procedure for adapting the parameters of the model to training data. This model aims at unifying computer vision with decision theory. The goal of computer vision systems has traditionally been exactly the task that the computer vision is performing: the goal of a face recognition system is to recognise faces, while the goal of a pedestrian tracker is to track pedestrians. However, the field is ready to move on to domains in which the goal is something larger than just the computer vision task. For example, facial expression recognition can be used by an embodied conversational agent, and pedestrian trackers can be used by smart buildings. Integrating computer vision with actions and goals is an important step, not just for the larger systems that use computer vision, but also for the computer vision task itself. In fact, the task of the computer vision system is defined by the goals it is being used to reach. It is important, therefore, not to consider the computer vision task alone, but rather in the context of whatever it is being used for. 15 Decision theory is the standard method used by rational agents to choose actions, and most planning tasks can be formulated as decision theoretic mod-els. This thesis shows how computer vision modeling of human action can be unified with decision theoretic planning in a single, Bayesian model. • [HL03] A novel probabilistic method for spatially abstracting optical flow fields to a set of basis functions. The usual method for constructing simple descrip-tions of classes of optical flow fields is to first compute the flow field from the spatio-temporal image derivatives, then to project it to some low dimensional subspace, and finally to model the distribution over projections as arising from a number of discrete categories. However, each stage of this process discards information about the noise inherent in the estimation. We have presented a Bayesian approach that directly models the distribution over spatio-temporal image derivatives given the motion classes. The method includes the appli-cation of a Bayesian feature weighting technique, obviating the need for prior selection of basis functions. • [HLOO] Use of Zernike polynomials as descriptors of optical flow. The Zernike polynomial basis functions have been used for analysis of shapes [TC88, BSA91, HSJ95, KKOO]. Their advantages are efficiency of computation, and accuracy of reconstruction. Our work is the first to apply the Zernike polynomial basis for representing optical flow [HLOO]. 1.7 Thesis Outline The thesis is structured as follows. Chapter 2 discusses some of the previous work on the psychology of facial movement, on the modeling of facial motion and hu-man body motion in general using computer vision, and on the decision-theoretic modeling of human action. Chapter 3 presents a detailed description of a general class of Bayesian hierarchical models of human facial displays which conform to the assumptions made above. This chapter also describes the method we use for tracking a face in a video sequence. Chapter 4 show how the parameters of the mod-els can be learned from data using the expectation-maximization algorithm. Some implementation details are also given in this chapter. Chapter 5 then describes how the learned models can be used to predict human actions in repeated Bayesian games with non-verbal communication, and how to choose optimal actions for a computational agent in an interaction with a human. Chapter 6 demonstrates the model learning techniques described in Chapter 4, and shows results on three sets of data. Our conclusions and directions for future work are described In Chapter 7. 16 There are four appendices which follow the bibliography. They contain most of the mathematical derivations which are not critical an understanding of the work, but would be necessary for an implementation. The first is a small proof of an equation used in the derivation of the optimization algorithm we use for learning the model. Appendix B shows a derivation used in the tracking equations. Appendix C derives the learning equations for the simple time-independent mixture model, including feature weighting. Appendix D derives the learning equations for the temporal sequence models. 1 7 Chapter 2 The Study of Faces and Gestures This chapter examines some of the prior work on human non-verbal behaviors from the perspective of psychologists and of computer scientists. The literature on this subject is vast, and arises in at least four major research areas. Indeed, much of the work on facial expression and gesture analysis is inherently inter-disciplinary, bridg-ing cognitive science, computer vision, artificial intelligence, and human-computer interaction (HCl) [LSOO]. In this chapter, we will first give a broad summary of each of these disciplines, and the relationships between them. The summary will be annotated with some key references to the major works in each field and in each cross-disciplinary endeavour. These will be books, survey articles, or conference proceedings where possible, to give the reader an idea of the broadest place to look for further information. We will then proceed to give a more in depth review of the work in each area. In particular, we will focus on the work in psychology, and the work in computer vision. This will be followed by a review of some of the work bridging these topics. We will show how our work falls into this last category, combining ideas from each of the major research areas. Figure 2.1 is a schematic of the four major research areas which are impor-tant to this thesis. The cognitive science and psychology literature is an important factor, as it studies human behavior, and can give indications about what to ex-pect. Psychologists have been studying facial expression from Darwin's 1872 The expression of the emotions in man and animals [Dar72] and Ekman's 1972 Emo-tion in the human face [EFE72], to the more recent expositions in McNeill 's 1992 Hand and Mind: What Gestures Reveal about Thought [McN92], Fridlund's 1994 Evolution of Facial Expression [Fri94], and Russell's 1997 The Psychology of Facial Expression [RFD97a]. Section 2.1 discusses this literature in more detail. 18 HCI [CHI03.UM03] Cognitive Science [Dar72,EFE72] [McN92,Fri94,RFD97aJ Q [CSPCOO.PenOOJJ> Computer Vision LFG02,FL03,WHT03] Decision/Game Theory [vNM53,Mye91,Put94] Figure 2.1: Four major research areas (rectangles) and key references for decision theoretic modeling of facial expressions and gestures. Links show that research areas draw from one another. Ellipses show inter-disciplinary research areas and key references. H C I and computer vision both draw heavily from cognitive science, so the links between these fields are emphasised. Finally, the central star shape denotes this thesis topic, and its links to the four major research areas. Computer vision scientists have been describing human motion for many years. The bi-yearly Proceedings of the International Conference on Automatic Face and Gesture Recognition publishes the frontier of research in the field [FG02]. A recent survey on facial expression analysis is Fasel and Luettin's 2003 Automatic Facial Expression Analysis: A Survey [FL03], while a more general review of hu-man motion analysis is Wang et aUs 2003 Recent Developments in Human Motion Analysis [WHT03]. Much of the computer vision literature on the analysis of faces and gestures draws from cognitive science, but primarily from a small number of works relating to emotion in the human face [EFE72], and from McNeill 's book on gestures [McN92]. Section 2.2 delves further into the vast computer vision literature. Decision theory and game theory play an important role in the design of ratio-nal agents. Von Neumann and Morgenstern laid the groundwork for this field in their 1953 Theory of Games and Economic Behavior [vNM53]. A more recent book on the subject is Myerson's 1991 Game Theory: Analysis of Conflict [Mye91]. Markov decision processes (MDPs) are discussed at length in Puterman's 1994 Markov De-cision Processes: Discrete Stochastic Dynamic Programming [Put94]. Partially ob-servable Markov decision processes (POMDPs) are discussed by Kaelbling, Littman and Cassandra [KLC98]. We will leave discussion of more recent work with decision 19 theoretic models to the discussion on M D P s and P O M D P s in Chapter 5. Human Computer interaction, or H C l , considers methods for getting hu-mans and computers to interact more easily. Two major yearly conferences cover much of the most recent research in this field: the CHI (computer-human inter-action) conference covers many areas from privacy to video-conferencing [CHI03], while the U M (user modeling) conference deals more specifically with adaptive user models [BCdR03]. Some of the most interesting recent work takes place in the intersections be-tween the four major fields we have just described. Some of these cross-disciplinary areas are also shown in Figure 2.1. Lisetti and Schiano's 2000 Automatic facial ex-pression interpretation: Where human-computer interaction, artificial intelligence and cognitive science interact [LS00], gives an informative overview of research combining H C l , computer vision and cognitive science. Cassell's 2000 Embodied Conversational Agents [CSPC00] gathers a number of papers from researchers in-terested in using computer vision in HCI-related tasks. Pentland's 2000 Looking at People [PenOO] gives another overview from a computer vision perspective. A new workshop has been initiated to discuss research in using computer vision for H C l : the IEEE Workshop on Computer Vision and Pattern Recognition for Human Computer Interaction [CVP03]. The yearly International Conference on Multimodal Interfaces (ICMI) also presents much of the work in this field [ICM03]. Efforts have recently been made at integrating decision theory with H C L Murray and VanLehn's 2000 paper on DT Tutor shows how decision theoretic models can be used to select optimal tutor actions in an intelligent tutoring environment. Paek and Horvitz describe how decision theory can be used to model conversation in their 2000 Conversation as Action under Uncertainty [PH00]. Gmytrasiewicz and Lisetti have recently described emotional dynamics using decision theoretic models in their 2000 Using decision theory to formalize emotions for multi-agent system applications [GL00]. However, none of these efforts at bringing decision theory to H C l uses computer vision. Active vision is an exception. Darrell and Pentland used decision theoretic models (POMDPs) to control the foveation of an active camera in their 1996 Active Gesture Recognition using Partially Observable Markov Decision Processes [DP96]. The work we present in this thesis finds common ground with all four of the major fields we have been discussing. It is motivated by recent work in cognitive science, and uses decision theoretic models with computer vision observations. We learn these models from data taken during interactions between humans and com-puters. They explicitly integrate computer vision observations with actions, and are designed for human computer interaction. 20 2.1 Psychology of Facial Expression Changes in the human face arise from many sources. While it is commonly held that the face is primarily used for the expression of emotion, there are other, perhaps more significant reasons behind facial displays 1 . This is particularly important from a computer vision perspective, for we only have to notice the prevalence of speech-related lip movement during interaction to realise that the modeling of facial motion will involve much more than just emotion recognition. We first give an overview of the research into the expression of emotion in the human face, followed by some of the more recent explorations into other motivations behind facial displays. Facial expressions and gestures in both humans and animals have been ex-tensively studied over the course of history. Charles Darwin's The expression of the emotions in man and animals is perhaps the oldest writing that is still influential today [Dar72]. In it, he attempts to link the emotions with the human face in a two-factor model. He posited that facial expressions were either involuntary expressions of emotion or "willful and employed to social ends" [Fri94]. Darwin's two-factor model was more recently taken up in the facial expression program [ER97, Iza97]. The facial expression program assumes that there are somewhere between five and nine basic emotions, and that these emotions are innate, universal and discrete. Other emotions are mixtures, or blends of the basic emotions. Furthermore, each emotion has an involuntary characteristic facial expression associated with it. Ex-pression of emotion in the face can, however, be voluntarily masked, replaced or imitated, but the " true emotion" always leaks out and can be detected in the face. The facial expression program led to the development of the Facial Action Coding System, or FACS [EF78]. FACS describes the pose of the human face as a combination of basic facial deformations, called action units, or AUs. Each action unit is essentially binary, is numbered and is objectively described. For example, A U 1 corresponds to a raised inner brow, while AU10 is the upper lip raised, and AU46 is a wink. A trained human FACS coder can observe a still image of a face, and determine which action units are "on", and which are not. Ekman claims that combining AUs allows a human to express emotion. For example, the expression of happiness corresponds to a mixture of A U 1 + 2 + 10 + 25 + 26. The FACS coding system has been used extensively in research on the motion of the face [GLW01, Rus97, DBH+99]. Ekman's facial expression program used an experimental method that con-sisted of showing still photographs of people's faces who were in the midst of dis-playing what he claimed were prototypical expressions of emotions. The subjects •"^ Hence our use the terminology display rather than expression to avoid restricting our-selves to emotional expression in the human face. 21 were then asked to choose one of the five emotions (fear, happiness, surprise, anger or disgust) as the one most associated with the photographs. Stripped of context as they were, the photographs were very often classified as the emotion they were supposedly displaying [EFE72]. However, in a different experiment, Carroll and Russell [CR96] showed the same photographs to a different set of subjects. Before showing the photographs, they gave the subjects a story to read and told them that the face in the photograph was the face of the person in the story. The subjects were then asked to describe the emotion the person in the photograph was feeling. The results showed that the interpretation of meaning in the face was heavily biased by the context. Subjects ascribed "anger" to "fear" photographs, "hope" to "surprise" photographs. If there is an emotional content to the meaning of facial displays, it does not appear to be categorical and universal, but rather dimensional and contex-tual. That is, emotions do not occur in discrete categories, such as "happiness" or "sadness", but vary continuously along dimensions such as arousal (high-low) and valence (positive-negative). Further, emotions are not independent entities that are culturally and individually invariant, but are dependent upon the context in which they occur. Russell and Fernandez-Dols [RFD97b] and Fridlund [Fri94] both point out that Darwin may have been misinterpreted, and that there are other, more prag-matic, ways of understanding facial expressions. They suggest the behavioral ecology view of human facial expression, and argue that the human face is used primarily as a communication tool, which complements and associates with speech and language. For example, Chovil [Cho91], measured people's facial motion during conversations and noticed that displays rarely are expressions of emotion, but rather constitute syntactic and semantic signals. She classified facial displays in terms of their infor-mative function, rather than their physical description. She found that movements in speaker's faces occurred primarily as syntax markers, as illustrations of words, and as additional information. Listeners, on the other hand, used their faces to non-invasively comment on the speaker's dialogue. There is much evidence to support the hypothesis that faces, as gestures, are only interpretable within the context in which they are displayed, and are individu-ally unique [Fri97, Cho91, McN92]. Context is defined as all circumstances relevant to the display, and may include concurrent or proximate speech and gestures of ob-server and performer, as well as other environmental factors. For example, eyebrows are sometimes raised during conversation when the speaker is thinking or remember-ing. However, eyebrows are also raised in back-channel displays of acknowledgment, or when an individual is taking turn in a conversation [CasOO]. In any case, the observed facial display carries little meaning by itself, while the combination of the 22 current context and the observed facial display is meaningful. Russell provides the following description of an experiment conducted in Russian director Lev Kuleshov's workshop just after the 1917 revolution (From [Rus97], p295): ... Kuleshov created three silent film strips each ending with the same footage of a deliberately deadpan face of the actor Ivan Mozhukhin. In one strip, Mozhukhin's face was preceded by a bowl of hot soup, in the second by a dead woman lying in a coffin, and in the third by a young girl playing with a teddy bear. The result was an illusion: Audiences saw emotions expressed in Mozhukhin's expressionless face. [One student] recalled, "The public raved about the acting of the artist. They pointed out the heavy pensiveness of his mood over the forgotten soup, were touched and moved by the deep sorrow with which he looked on the dead woman, and admired the light, happy smile with which he surveyed the girl at play. But we knew that in all three cases the face was exactly the same." This quote exposes the idea that the interpretation of expressions in faces cannot be decoupled from the context in which they occur. Non-verbal behavior, including facial displays and gestures, are important aspects of human communication. Agents can use non-verbal signals to facilitate or manage face-to-face interactions. The use of modalities other than speech is an important advantage, which allows for more efficient communication. Consider that agents in a conversation are working together on a. joint project [Cla96]. That is, they are trying to achieve some common goal which holds value for all involved agents. The agents are working on this goal by communicating propositional content with speech. Using non-verbal displays, however, they can simultaneously issue performa-tive signals, which indicate the reasons for the communication [Aus62, PPOO]. Raised eyebrows and rising intonation of speech are typical performatives accompanying a question, and indicate to the listener that they are expected to give an answer. In this case, the content of the question is the proposition, while the raising eyebrows and rising intonation of speech are indicating that the speaker is communicating this statement because he wants the listener to answer a question. Performative acts, therefore, are an efficient method for each agent to inform their partner of which actions they are expected to take. That is, the speaker's performative efficiently biases the listener's choice of response action [PPOO]. For example, the speaker who raised his eyebrows at the end of a question was telling the listener to answer the question. Suppose the listener, however, was about to make a statement (utter a proposition). She now has two choices. She can either ignore the speaker's performative, and go ahead with her statement, or she can take 23 the speaker's suggestion and answer the question. In general, she may be able to choose some intermediate actions which attempt to optimise all agent's preferences. The performative is necessary for cooperation in this simple example. Without it, the listener would only have the option of going ahead with her question (she does not know the speaker's utterance required an answer), leading to possible missed goals for the speaker (he does not get his answer). In general, the performative only biases the listener's actions towards those that are of maximal value the speaker. To cooperate, the listener must then choose an action based upon both his values and those of the speaker, which are indicated by her performative. At this stage, other factors come into play, such as the power structure between speaker and listener, which may influence how the two values are combined. This psychology literature lead us to the conclusion that, in order to interact naturally with humans, computational agents will need to recognize (and generate) situated and purposeful non-verbal signals, which vary widely across individuals and contexts [RN96, CasOO]. That is, they will need to understand the relationship between the observed displays, the context of the interaction, the expected utility (including goals of all involved agents) in order to use the information to make cooperative action choices. 2.2 Describing Human Motion with Computer Vision Eadweard Muybridge was perhaps the first to analyse human and animal motion from video sequences [Muy87]. Commissioned by Leland Stanford, he attempted to find out what animals and humans actually were doing during typical movements, and his findings shed light on the kinesiology of human and animal motion. For example, his careful analysis of videos of galloping horses demonstrated that the commonly held belief about symmetrical foreleg motion was incorrect, and that galloping was actually a complex, asymmetrical pattern of leg movements. More recently, computer vision researchers have been trying to automatically analyse human motion. Research has focussed on three major categories: facial ex-pression [MP91, TP91, BY97, EP97, OPB97, LTC97, CET98, DBH+99, TKC01], gesture [SP95, DEP96, WBC97, CT98, VM99, WB99, WPG01], and whole body motion [NA94, BH96, JBY96, Bre97, LB98, LF98, RB99, Bra99a, WCPOO, KMOO, GJH01, BD01, OHG02]. There is also a large body of work in recognising lip and facial motions connected to speech [LT97, RE01, VBPB03] , and on generation of facial motion from speech using models learned from data [BCS97, Bra99b, DH01]. This work has a wide range of potential applications, from visual surveillance to advanced user interfaces. Survey articles have recently been published on both hu-24 man motion analysis [Gav99, WHT03], gesture recognition [PSH97], and on facial expression analysis [PROO, FL03]. Lisetti [LSOO] also gives an informative mul-tidisciplinary review on facial expression analysis. Our research is distinct from most of this work in that we do not attempt to learn pre-defined categories of motions, instead building a system to automatically discover the important pat-terns in the data. Research which is similar in spirit to this has been carried out by [DEP96, Bra97, CP99, WCPOO, WPG01]. video / images categories Figure 2.2: General computer vision approach to recognition of human motion maps video inputs to categorical output. It typically involves three stages: acquisition of regions of video, extraction of features, and classification. Analysis of human motion involves three major stages, as shown in Fig-ure 2.2. First, the acquisition or segmentation of the part(s) of video to be analysed. For example, detecting and tracking of faces, hands, or entire human bodies is the usual first step towards any facial expression, gesture or human motion recogni-tion system. We cover some of the literature on tracking in Section 3.7. Second, the extraction of features from these acquired parts. This is usually done by first estimating some quantity of interest at the pixel level, and then spatially abstract-ing this to a low dimensional feature vector. Optical flow [MP91, BY97, EP97, CT98, L K C L 9 8 , DBH+99], color blobs [SP95, Bre97], deformable models [LTC97], motion energy [BD01], and filtered images [BLB + 03] , are the more well used pixel-level features. Spatial abstraction is achieved, for example, by using the pixel-level features to modulate a 3D model [EP97], or by projecting them to a low dimen-sional subspace [TP91, BY97]. The third stage is to classify the extracted features into a discrete set of non-verbal behaviors, such as facial expressions or gestural primitives. Since the extracted features are extended in time, this usually involves some kind of temporal modeling. Spatio-temporal templates [EP97], Dynamic time warping [DEP96], and hidden Markov models [SP95] are all popular approaches. While much research in the analysis of human motion separates the tasks 25 of representing the motion and deformation of the body from that of analysing its temporal trajectory, as we have just described, some of the early work considers the spatio-temporal space directly. For example, Niyogi and Adelson [NA94] analysed walking figures by searching the X Y T (spatio-temporal) volume of a video segment for repetitive patterns of motion. The work described in this thesis uses an optical flow and skin-color based tracker to locate and track a human face, described in Section 3.7. Our features are projections of optical flow fields and raw intensity values to a pre-defined set of orthogonal Zernike polynomials, described in Chapter 3. Our method for tem-poral abstraction uses a multi-level dynamic Bayesian network, also described in Chapter 3. Finally, our model explicitly includes actions and utility functions at the highest level (Chapter 5), an addition that is made by none of the above-cited works. The remainder of this section describes in more detail the previous work the second and third stages of analysis: the representation and classification of human motion. Tracking is not a major focus of this thesis, and we leave review of relevant work to Section 3.7. Most of the methods we review will be strictly supervised, in that models are trained for one of several pre-defined categories of human behaviors, and are tested for recognition of the same behaviors. Research on the automatic discovery of human motion patterns from video is more closely related to the approach presented in this thesis, and is reviewed in Section 2.2.3. 2.2.1 Representing the Human Body in Video One of the most well-used approaches for analysis of human body motion is find-ing the principal subspace of variation [Koh89]. The basic idea is that although the dimensionality of image space is extremely large (number of dimensions = number of pixels), there is some (typically) low dimensional sub-space in which a particular type of motion can be accurately represented. Finding this subspace, called the principal eigenspace, is achieved using a singular-value decomposition, also known as principal components analysis (PCA) . This method has been ap-plied to describing facial structure for face recognition (eigenfaces) [TP91], mo-tion in the face [LKCL98, FBYJOO], and spatio-temporal variation in body mo-tion [BH96, LF98] 2 . One of the difficulties with standard linear P C A is that it can only deal with one mode of variation (e.g. identity or expression), but faces vary in many ways at once (e.g. lighting, pose, expression, identity). View-based eigenspaces approach this problem by computing a different sub-space for each 2Eigenspaces have been used for many other purposes as well, e.g. modeling pose of static objects [MN95] 26 view [PMS94]. Another approach is an extension to the standard P C A approach that decomposes along multiple modes at once. This has been called N-mode S V D , or TensorFaces when applied to faces [VT02, WA03]. The modes of variation found using P C A are generative, in that they describe the overall variation across all classes, and so can be used for accurate reconstruction of images in a dataset. Dis-criminative methods, on the other hand, find modes of variation that discriminate between classes. The Fisher linear discriminant (FLD), for example, find modes that that describe variation which maximizes the ratio of betweeen class scatter to within class scatter. While FLDs are not useful for describing all images in a dataset, they are very good for distinguishing classes within that dataset [BHK97]. Principal component analysis and Fisher linear discriminants both operate using the second order statistics (variance) of the data. However, much relevant in-formation in a facial expression analysis task may be contained in the higher order relationships between pixels. Independent component analysis (ICA) is a generaliza-tion of P C A which capitalises on these higher order statistics for representing images. I C A has an information theoretic and a biological motivation [BS95]. While P C A finds the modes which maximize the variance, I C A finds the modes that maximize the entropy. Bartlett has used I C A for facial expression analysis [BarOl]. Another popular type of motion and image representation is to project op-tical flow fields or images to a pre-defined set of basis functions. Translational or affme models are typical for optical flow fields. Mase and Pentland [MP91] used optical flow to model muscle motion over coarse areas of the face. Yacoob and Davis [YD94], and Black and Yacoob [BY97] use a low-order parameterisation of flow fields to describe local areas of deformation in the human face during expres-sions. Their work is extended in [JBY96] to include models of leg motion. Zernike moments, Hu moments, Legendre moments, and rotational moments have also been studied [TC88, BSA91] for representing images and optical flow fields. Moments of images or optical flow fields are also extensively used. In particular, central moments were used by Little and Boyd [LB98] for representing motion of the human body for gait recognition. The angular radial transform (ART) basis is a type of rotational moment used in the M P E G - 7 standard for video encoding [BobOl]. Another pop-ular type of basis set are the Gabor filters, which are attractive due to their close relationship with the receptive fields of simple cells in the visual cortex. Bartlett et al. used Gabor-filtered images to train support vector machines for the recognition of action units [BLB+03]. They used hand-labeled feature points on human faces to warp the faces to a canonical view in a pre-processing stage. Gabor wavelets have also been used for face recognition [LBA99]. Template-based approaches store a number of exemplars of face or body 27 poses, and use some correlation metric to match unlabeled data to the stored ex-emplars. Bobick and Davis used temporal templates called motion history images (MHIs), which encode temporal information in a single image, with more recently occurring events more heavily weighted [BD01]. These MHIs are then spatially ab-stracted using the Hu moments, and compared to stored templates. The method is tested on aerobics movement sequences, and was used in an interactive play-space for children called the KidsRoom. Efros et al. use optical flow templates to characterise motions of soccer players, tennis strokes and ballet steps [ABMM03]. There are a number of model-based approaches in the literature, which at-tempt to fit facial or body motion and structure to a model of the face or body. Three dimensional models in particular are useful for synthesising computer graph-ics characters. Terzopoulus and Waters [TW93] used three dimensional wire-frame model of the face driven by tracked facial features. However, their system required facial markers. A similar, but non-invasive, method was used by Essa and Pent-land in an approach that uses optical flow to estimate muscle motion [EP97]. They first register the 3D wire-frame mesh to the face by locating the positions of eyes, nose and lips. Optical flow is then measured between registered images, and used in conjunction with the 3D model to estimate the forces underlying the facial mo-tion. The 3D model is also used to constrain the flow field, which can then be used to construct spatio-temporal templates for recognition. Three dimensional mod-els promise to avoid the problems with view-based approaches have with different views of the face or body. Bartlett et al. use a 3D face model to which the image is registered [BLB+03]. They currently do the registration manually, although they plan to eventually do so automatically. Models are also extensively used for other body parts. For example, the Vogler and Metaxas fit 3D models of arms to motion capture data [VM98]. Inferring stick figures from motion-capture data is a standard procedure for recovering physically-based models of human motion for graphics ap-plications. However, 3D modeling usually comes at the cost of heavy computational requirements. Feature points are used for representing motion of faces, hands and bodies. Perhaps the simplest way of doing this is using motion capture systems, or colored markers, which unfortunately are invasive and expensive. Other methods usually involve some kind of model-based approach, in which feature points or edges are located and tracked according to their fit to a model. For example, Lanitis, Taylor and Cootes use flexible shape models which track and adjust to a set of facial feature points [LTC97]. Similar models are used to represent human body silhouettes by Galata et al. [GJH99]. Feature points and edges are also used by Tian et al. [TKC01] for recognising facial expressions from static images. 28 A simple and robust method for representing the positions of hands and face regions is the use of image intensity or color "blobs". A blob a statistical model of the distribution of image intensity, color, or optical flow over an oriented region of space. Blobs are estimated from training image, can be simply tracked, and give a coarse estimate of the position and orientation of hands and face. Blobs have been used for analysis of gestures [SP95, JP98, CT98], body motion [Bre97, Bra97, WADP97], and faces [OPB97]. Our work uses Zernike basis functions [vZ34] for holistic representation of the face and facial motion. The Zernike polynomial basis provides a rich and data independent description of optical flow fields and grayscale images which can be seen as an extension of the affine basis. These characteristics make it a desirable approach for unsupervised classification of patterns of motion in the face. When applied to optical flow, the Zernike basis can be seen as an extension of the stan-dard afflne basis [HLOO]. In fact, Black and Yacoob [BY97] have used the first three orders of the Zernike basis for local representations of motion in the face. The Zernike representation differs from approaches such as Eigen-analysis [TP91], or facial action unit recognition [TKC01] in that it makes no commitment to a par-ticular type of motion, leading to a transportable classification system (e.g., usable for gesture clustering). The Zernike basis has also been used extensively for shape descriptions [Tea80, TC88]. Teh k, Chin show that Zernike and pseudo-Zernike mo-ments perform best in terms of noise insensitivity and image reconstruction [TC88]. Zernike moments have also been proposed as shape descriptors for the M P E G - 7 video coding standard [KKOO], although angular-radial transformations (ARTs) are currently used [BobOl]. A R T bases, however, are not orthogonal, making recon-struction (and visualisation) not possible. Zernike polynomials have been used in the vision community for recognising hand poses [HSJ95], handwriting and silhou-ettes [BSA91], shape-based image retrieval [ZL02], and optical flow fields [HLOO]. 2.2.2 Classification Once a region of a set of images containing some human body part(s) has been simply represented, the final task is to classify these representations. The usual method here is to gather and manually label a set of training examples of each motion that needs to be recognised. These training examples are then used to train a classifier, which is tested on unseen data. Features extracted from a single image can be used directly for classification of the facial pose. For example, the model fits used by Tian et al. are used as inputs to a neural network for the classification of FACS units, which are defined for static images [TKC01]. Other methods represent the spatio-temporal volume 29 using a set of features, explicitly taking the temporal nature of the signal into account [NA94, BD01]. Cutler and Davis analyse periodic motion using Fourier analysis applied to image differences [CDOO]. However, one of the most popular approaches is to model the temporal progression of features extracted from images or flow fields using some kind of dynamical model. Simple temporal models of facial expressions were used by Yacoob and Davis and by Black and Yacoob [YD94, BY97]. They characterised the progression of an expression in three phases (beginning, peak and ending), and used a rule-book to classify emotions. Cutler and Turk also use a rule-based temporal classification scheme with features based on the Fourier analysis of blob motions [CT98]. Hidden Markov models (HMMs) are perhaps the best known dynamical model, as their use in speech recognition has been extensively advocated and practiced [RH93]. A hidden Markov model is a generative model, in which the observations (feature vectors) arise from one of a discrete set of states. The expected distribution of feature vectors for each such discrete state is characterised with some kind of mixture model, such as a mixture of Gaussians (for continuous features), or a mixture of multinomials (for discrete features). The discrete states are then assumed to be temporally connected in a Markovian chain. H M M s have been recently been applied to many problems in computer vi-sion. Schlenzig, Hunter and Jain used them to classify hand gestures [SHJ94], while Starner and Pentland used them to recognise American Sign Language (ASL) ges-tures in a view-based method using color blobs [SP95]. Vogler and Metaxas also recognised A S L signs with H M M s , but used three-dimensional models [VM98]. Their data, however, was captured using magnetic sensors, not computer vision. Mori-moto, Yacoob and Davis recognised head gestures using H M M s in a view-based approach [MYD96]. Lien et al. and Bartlett et al. use H M M s to distinguish FACS facial action units [LKCL98, B L B + 0 3 ] . While Lien et al. use principal compo-nents analysis and a vector quantizer to represent the images, Bartlett et al. use Gabor filters and support vector machines. Our work used H M M s to recognize facial expressions represented holistically over the face region using projections to Zernike polynomial basis functions [HLOO]. Although not in a temporally dynamic setting, H M M s have recently been successfully applied for face recognition, where the Markovian chain is defined over the spatial extent of the face [NI00]. Hidden Markov models are usually only first order, so that the observation at time t is independent of the history given the observation at time t — 1. However, H M M s with more "memory" represent higher-order temporal dependencies, and can be useful for more complex action recognition. Vogler and Metaxas use second-order H M M s to recognise A S L gestures [VM98], while Galata, Johnson and Hogg use models in which the "memory" length is also learned from the data [GJH99]. 30 Hidden Markov models use a Markov chain over a discrete set of states. A closely relative of the H M M uses continuous state, a model usually referred to as a linear dynamical system (LDS) [Min99]. State estimation in LDSs (forward propagation) is better known as a Kalman filter [Kal60], which has been extensively used in computer vision. H M M s are really only the beginning of the story on statistical temporal models, however. They are, in fact, a special case of the more general dynamic Bayesian networks (DBNs), which are Bayesian networks in which a discrete time index is explicitly represented. Inference and learning in DBNs is simply an applica-tion of network propagation in Bayesian networks [Pea88]. A comprehensive review of DBNs is given by Murphy's thesis [Mur02]. DBNs usually make a Markovian assumption, but explicitly represent conditional independencies in the variables, allowing for more efficient and accurate inference and learning. A simple D B N extension of H M M s is the coupled hidden Markov model, in-troduced by Brand, Oliver and Pentland [BOP97] for recognition of simultaneous hu-man actions. C H M M s have two Markovian chains, each modeling a different stream of data. The two chains are coupled to allow modeling of the inter-dependence between the two data streams. They used these models for analysing patterns of pedestrian motions. The coupling between the chains models the interactions be-tween the people. Vogler and Metaxas [VM99] use a similar model for American Sign Language (ASL) recognition, in which the two Markovian chains model the two arms. As pointed out earlier, they do not use computer vision data, but in-stead capture motion information from wearable magnetic sensors. They show that removing the coupling between the chains leads to increases in efficiency without significantly affecting performance. Some of the most interesting DBNs model the data at multiple temporal scales. This has been an important issue in speech recognition, where phoneme-level models are combined at a higher level into word or sentence models [RH93]. However, as noted in [BF95], speech recognition systems do not learn high-level semantics (at the word or sentence level), relying heavily on prior knowledge. Com-puter vision, on the other hand, has made significant progress in learning hierarchical temporal models. A general formulation of hierarchical hidden Markov models is given by Fine, Singer and Tishby [FST98]. Murphy and Paskin have shown how the parameters of this model can be learned efficiently [MP01]. We review some of the work on hierarchical models in computer vision here. Bregler [Bre97] used a multi-level system which analysed the motion of color blobs using over short time scales using linear dynamical systems (LDSs), and then modeled the LDSs as arising as outputs from a hidden Markov model. The idea is 31 that long, complex motions can be represented as concatenations of shorter, more simple motions. For example, the motion of the foot during running is bi-phasic (one phase while its on the ground, one while its in the air), and the running motion alternates between the two phases. A similar hybrid dynamical model model has more recently been proposed by Pavlovic, Frey and Huang [PFH99], who also advocate the use of variational methods for estimating parameters. Bobick and Ivanov [BI98] used a multi-level model that described primitive events using a set of hidden Markov models, and then used a stochastic context-free • grammar (SCFG) to provide high-level semantic help. A S C F G enables the modeling of more complex semantic patterns of behavior than a hidden Markov model. They use the S C F G and H M M s to model positions of head and hands acquired using a stereo vision system during the performance of simple gestures, and of musical conducting. Cohen et al. [CSG+03] have recently proposed a multilevel H M M to recognize facial expressions. Their main contribution is a method for automatically segment-ing a video stream into facial expression events, assuming a neutral state between each expression. Oliver et. al [OHG02] learned a multi-layer model of office activity to choose actions for a computational agent. The model uses multimodal inputs, making only very slight use of computer vision. Our work uses a similar model, called the abstract hidden Markov model, to describe facial expressions [HoeOl]. 2.2.3 Unsupervised Classification Most of the methods we have been describing use extensive training processes with labeled data. That is, their goal is to classify gestures, facial expressions, or body motions into a pre-defined set of categories. Training data is acquired for each of the categories, and is used to train models independently. The likelihood of test data given the learned models is then used for classification of unseen examples. The most significant drawback to such methods is that training data must be manually labeled in a time consuming process. This also implies that adaptation to novel types of motions requires new training data to be acquired and labeled. The alternative is to develop systems that can discover categories of motions in training data. These latter systems are more unsupervised than the former, and have been much less extensively researched. We give an overview here. Clustering sequences of data using mixtures of hidden Markov models was proposed by Smyth [Smy97], and has been further investigated by L i and Biswas [LB99] A mixture of hidden Markov models is a mixture model in which the mixing com-ponents are hidden Markov models. Such a model can be trained on unlabeled data, and finds clusters of entire sequences of data: it discovers which sequences 32 are similar to each other. This is in contrast to the supervised H M M classification experiments we described in the last section, in which an expert human explicitly labeled this similarity by grouping training data into pre-defined categories. The mixture of hidden Markov models, and other closely related models have been used in computer vision for unsupervised clustering of data. Clarkson et al. [CP99] examine understanding purposeful motion is discussed in the context of wearable computing devices. The idea is to have a camera which is attached to the human subject, and which performs unsupervised, hierarchical modeling of the situations in which the subject finds itself. Their results are promising, and show that unsupervised learning can provide fruitful classes of video data. Alon et. al. [ASKP03] have also recently examined such models for action recognition. Darrell, Essa and Pentland [DEP96] examine the same kind of models, but use dynamic time warping (DTW) instead of hidden Markov models as mixture com-ponents. They also use these models for generation of facial expressions. They do not explicitly model actions and utilities, however. 2.3 Purposeful modeling of human action The previous section examined automatic recognition of human actions. However, most of this work takes the slant that the recognition itself is the goal. Clearly, it is what to do with the recognised states which is of most interest. Here, we discuss work on integrating the recognition of human action with decision making. Most of this work lies between computer vision and human-computer interaction (HCI), in which computer vision systems provide information about the state or actions of a human user of a computer application. This information is used by the application to tailor its interface. Currently, most HCI systems only make use of human interface actions, such as mouse or keyboard actions, some have begun to integrate visual and auditory information. Clearly, if we want to leave keyboard and mouse behind and progress to more natural interfaces, these multi-modal cues are of utmost importance. Perhaps the simplest application of computer vision for user interfaces is in using head gestures for controlling mouse actions, as described by Kjeldsen [KjeOl]. Leibe et al. describe a system for interaction with the "perceptive workbench", a virtual reality device in which a user can manipulate objects on a virtual table using their hands [LSR+00]. The device locates and tracks the user with computer vision methods. Pentland reviews a number of systems which make use of facial or gestural inputs [PenOO]. Cassell has stressed the importance of recognising and generating both verbal and non-verbal signals for embodied conversational agents (ECAs) [CSPCOO]. ECAs 33 are built upon a conversational architecture: they are designed based upon the psy-chology of human conversational behaviors. This involves the high-level modeling of communicative and domain goals, the understanding and synthesis of propositional and interactional information, and low-level issues related to recognition of speech and gesture, and to timing. Cassell's group has built an E C A named Rea who in-teracts with a human as a real-estate agent (hence the name). Rea's visual inputs are the positions and orientations of head and hands from color "blob" trackers. Jebara and Pentland [JP98] presented action-reaction learning, in which a dynamic model was learned from observing video of two persons interacting. The model was then used in a reactive way to simulate interactions for a single user. The features are color blobs of head and hands, and the joint likelihood of the each person's features is accomplished using a variant of the expectation-maximization algorithm. Our work bears a close resemblance to action-reaction learning, but gen-eralizes it by adding high-level context states, actions and utilities. Action-reaction learning is designed strictly for imitation-type tasks, while our model is applica-ble to interactions in more general contexts, in which plans need to be developed autonomously. A number of robotic systems have made use of computer vision for interac-tions with humans. Fong et al. survey some of the most recent work in building socially interactive robots [FND03]. Breazeal's robot Kismet interacts with peo-ple on an emotional level, using stereo vision to detect presence, proximity and color [BS99]. Thrun's group have been developing robotic guides. The latest em-bodiment, Pearl, is a robotic helper for the elderly that uses stereo vision [MPR+02]. Jose is a robotic waiter that uses stereo vision for navigation and person detection to accomplish its task of delivering food at a social gathering [EHL+02]. H O M E R , Josee's successor, is a messenger robot that interacts with people [EHL03]. The model we focus on in this thesis is the partially observable Markov decision process, or P O M D P . P O M D P s were applied to the problem of active gesture recognition in [DP96], in which the goal is to model unobservable, non-foveated regions. P O M D P s have also been applied to the dialogue management problem [PHOO, RPTOO, ZCMG01] for human-computer and human-robot interac-tion. This work, as Cassell's work on E C A s , models some of the basic mechanics un-derlying dialogue, such as turn taking, channel control, and signal detection. These agents typically use very few (or none at all) manually specified facial expressions or gestures. Decision theoretic models have yet to make solid contact with computer vision. The reason is probably because of the lack of optimal or efficient solution al-gorithms for continuous inputs. However, many approximation techniques have now 34 been studied [HauOO], and we believe they are appropriate for use in modern com-puter vision systems. The advantage of using a decision theoretic model is its ability to make provably rational choices in situations involving uncertainty. This thesis presents a method which promises to begin filling this important gap in the inter-disciplinary work bridging computer vision, decision theory and human-computer interaction. Its main strength, and what distinguishes it from other research in this area, is that it presents the method as a unified whole, using the paradigm of P O M D P s to combine computer vision with actions for user interfaces. 35 Chapter 3 M o d e l i n g F a c i a l D i s p l a y s This chapter describes the modeling of human facial motion from video. We are interested in computational models which can recognize and adapt to facial displays and the relationships between the displays and other states of context. These context states, however, exist in a temporally and spatially abstract space as compared to the raw video signal. Our models must therefore perform spatio-temporal abstraction in order to find appropriate correlations between facial displays and context. We first discuss the spatial abstraction methods, which summarize video frames and spatio-temporal derviatives of video frames with projections to a set of basis functions. We then describe how the resulting spatial summarizations are temporally abstracted using dynamic Bayesian networks. We consider that the instantaneous state of a human's face during a facial display consists of a pose, or configuration, and an instantaneous motion, or dynam-ics. Dynamics and configuration complement one another, and are akin to velocity and position in classical dynamics. They can be both useful in describing the way a human face moves. In particular, modeling the configuration of the face disam-biguates temporal sequences of dynamics, while the dynamics of the face can predict future configurations. Configuration and dynamics are both useful for tracking. The pose data is simply an image containing the face, while the dynam-ics data are the spatio-temporal derivatives over the same region from one video frame to the next. Thus, the measurements we start from contain simultaneous de-scriptions of the instantaneous configuration and dynamics of the face. The spatio-temporal derivatives induce a dense optical flow field, by assuming that the image intensity structure is locally constant across short periods of time. The optical flow field is a projection of the 3D scene velocity to the image plane, and gives the motion in the image at each pixel. The same method is used for spatial abstraction of both the configuration and dynamics of the face. Image regions and optical flow fields are each projected to a set of basis functions, yielding finite dimensional feature vectors. 36 The distributions of each of these feature vectors (for configuration and dyanmics) are modeled by a mixture model. The states of the mixture model correspond to classes of instantaneous configuration and dynamics of the face in the training data. For example, the configuration classes may correspond to characteristic facial poses, such as the apex of a smile, the apex of an eyebrow raise, or a neutral expression. The dynamics classes are motion classes, and may correspond to, for example, mo-tion during expansion of the face to a smile, or during contraction of the eyebrows from raised to neutral. The resulting configuration and dynamics mixtures are then each tempo-rally connected in a Markov chain, and the two chains are coupled. This coupled Markov model describes a temporal sequence of facial movement with two inter-acting distributions over instantaneous pose and dynamics states. Keeping the two chains separated, but coupled, allows additional independencies to be leveraged in the temporal chains. We temporally abstract finite length sequences in this coupled chain by modeling it as the output distribution of another mixture model. This high-level mixture model, then, represents entire sequences of video with a distri-bution over a finite set of facial display classes. For example, one such class may be a smiling facial display. The corresponding coupled Markov chain would represent sequences of instantaneous facial poses and flows during a typical smiling display. Typically, there will be a small number of classes (< 10), but the model does not preclude more. Throughout most of this chapter, we will assume that an image region has been selected in each video frame. When modeling facial expression, the human face must be located throughout the video sequence. Specifying these regions involves some method for tracking the region of interest. We discuss the tracking method we have used at the end of this chapter. This chapter will be structured as follows. The next section describes the Zernike polynomial basis set, which is used for spatial abstraction of images and flow fields. Section 3.2 describes spatial abstraction of the dynamics of the face, including computation of optical flow, projections to basis functions, and modeling of the resulting feature vector using normal distributions. Section 3.3 then describes spatial abstraction of the facial pose, which is very similar to that for the dynamics, but is slightly simpler because it does not involve the computation of optical flow. Section 3.4 briefly discusses some issues related to changes of lighting. These are mainly encountered in more unconstrained environments than we use in our exper-iments, but are important considerations for future work. Section 3.5 discusses the temporal modeling of sequences of facial pose and dynamics using coupled Markov models, and Section 3.6 describes the temporal abstraction of sequences using mix-37 ture models. This section also discusses how to integrate context measurements. We show how the context states can be integrated into the model as conditioning or being conditioned by the mixture components. Section 3.6.4 touches on the issue of temporal segmentation. Finally, Section 3.7 discusses the tracking method we use to locate the face in the video stream, as well as considerations about how to fully integrate the tracking process into the Bayesian model we use for facial motion representation. 3.1 Zernike Polynomial Basis Functions Spatial abstraction of flow fields and images involves finding appropriate subspaces in which the motions and poses we are trying to categorize are sufficiently well separated. A standard approach to this problem is to compute a data-dependent subspace using principal components analysis, or P C A . This approach has been used for representing general 3-D objects [MN95], facial pose [TP91], and facial motion [FBYJOO]. However, there are two problems with this method. First, the dimensionality of the subspace must be manually specified. While principal compo-nents analysis finds a set of basis functions that span the entire observation space, it does not find which correspond to a useful subspace. Typically, the subspace is chosen as the one that accounts for the most variability in the data. However, other choices may be more conducive to the clustering we wish to perform. The second problem with PCA-based approaches is that a separate basis set must be computed and stored for every type of motion we wish to recognize. We believe that a data independent subspace surmounts the two aforemen-tioned problems. We choose a complete and orthogonal set of basis functions a-priori, and use them for all our modeling methods. The advantage of data indepen-dence is that the basis can equally well be used for representing any motion. For example, while we learn classes of face motions in this paper, our system could be easily applied to gestures or gaits, without re-computation of a set of basis func-tions. Leaving any commitment to particular motions to higher level processing is an advantage in many cases. The usual objection to this type of modeling is that we do not know which set of basis functions are best for a particular modeling task. However, our method includes a feature weighting technique which learns the subset of basis functions which are most useful for the classification task. Thus, as long as we use a complete set of basis functions, we should be able to learn the subset which is best, while only storing the feature weights for each type of motion we wish to recognize, not complete basis sets. We believe that this is an improvement over PCA-based approaches. Further comparison of the P C A approach with the Zernike 38 basis functions can be found in [HLOO]. Data independent basis sets have been used for shape representation in com-puter vision for many years. In particular, analysis of images using moments has been well studied, including geometric moments, rotational moments, Legendre mo-ments, Zernike and pseudo-Zernike moments [Tea80, TC88]. Teh & Chin show that Zernike and pseudo-Zernike moments perform best in terms of noise insensitivity and image reconstruction [TC88]. Zernike moments have also been proposed as shape descriptors for the M P E G - 7 video coding standard [KKOO], although angular-radial transformations (ARTs) are currently used [BobOl]. A R T bases, however, are not orthogonal, making reconstruction (and visualisation) not possible. In this work, we use the basis of Zernike polynomials as our a-priori basis set [vZ34]. The Zernike polynomial basis provides a rich and data independent description of optical flow fields and grayscale images which can be seen as an extension of the affine ba-sis. Zernike polynomials have been used in the vision community for recognising hand poses [HSJ95], handwriting and silhouettes [BSA91], shape-based image re-trieval [ZL02], and optical flow fields [HLOO]. Zernike polynomials are an orthogonal set of complex polynomials defined on the unit disk [PR89]. The lowest two orders of Zernike polynomials correspond to the standard affine basis. Higher orders represent higher spatial frequencies. The basis is orthogonal over the unit disk, such that each order can be used as an independent characterization of a 2D function, and each such function has a unique decomposition in the basis. Zernike polynomials are expressed in polar coordinates as a radial function, i?™(p), modulated by a complex exponential in the angle, (f>, as follows: U™(p,<f>) = K(p)eimt> (3.1) with radial function,/^ 1 (p), given by (n-|m|)/2 , R™(p) = V (-!)'("-Q! n-2l for n and m integers with n > \m\ > 0 and n — m even. The first few radial basis functions are therefore: R°0 = 1 R\ = p R°2 = P 2 R\ = 2p2 - 1 R\ = 3p 3 - 2p R33 = p3 The indices, n and rn, are indicators of the spatial frequency of the Zernike basis function. The larger the value of n, the higher the spatial frequency in the radial direction. Similarly, the larger the value of m, the higher the spatial frequency in the angular direction. For each n, polynomials are defined for a selection of values 39 for m in the range {0, n}. Thus, each combination of radial and angular spatial frequencies are available. The Zernike polynomials are orthogonal on the unit disk and obey the fol-lowing orthogonality relation: / / U™{p,<l>)U%\pA)pd<l>dp=T-—5nn,5rnml, (3.2) Jo Jo (n + -U where 5nni = 1 if n = n', and 0 otherwise. The orthogonality of the basis allows the decomposition of an arbitrary func-tion on the unit disk, f(p,(fi), in terms of a unique combination of odd and even Zernike polynomials. That is, [PR89] M N ^ ) ^ E E [Kcos(m<f>) + B™sin(mct>)} R™(p), (3.3) m=0 n = m which can be used to approximate a sufficiently smooth function f(p,cf>) to any degree of accuracy by making N and M large enough. Using the orthogonality relation (Equation 3.2), the coefficients A7^ and can be obtained as A% = em(n + l) ^ f1 T f(P,miP)^{r^pd4>d^ (3.4) 7r Jo Jo sin{m<p) where 1 i / m = 0 2 otherwise Zernike polynomials are defined on a disk, and so an elliptical area must be identified which will be projected onto the Zernike basis. A n elliptical region allows for a more flexible image region than a simple disk, and corresponds roughly to the shape of the human face. Although we only consider ellipses with axes aligned with the image axes, it would be possible to allow for rotations of the ellipse in future work. Our tracking method for updating this region in a video stream is described in Section 3.7. Once a scale and centroid have been identified for each flow image, a 2D function (either a flow field or an image region) f(x, y) is projected onto the Zernike basis using the discrete equivalent of Equation 3.4: 40 where <f> = arctan (y'/x1), p = ^/x'2 + y'2 < 1, x' = (x - xc)/rx, y' = (y - yc)/ry, and {xc,yc} and {rx,ry} are the centroid and scales of the region of interest. We can write the projection equation (3.3) for a 2D function f(x,y) as a matrix equation if we write / as a TV x 1 column vector by reading pixels from / row-wise from top to bottom, where N is the number of pixels in the function: f = Pz (3.6) The Zernike basis functions are the columns of the N x Nz matrix P (also arranged row-wise), and the projection coefficients are in the Nz x 1 column vector z, where Nz is the total number of Zernike polynomial basis functions. The columns of P (and the rows of z) correspond to the Zernike polynomials in order of increasing n and m, alternating bewteen A™ and B™, such that columns 0,1,2, 3 . . . are Zernike polynomials A%, A\, B\, A% .... 3.2 Modeling Facial Dynamics We wish to classify the dynamics of the human face into a discrete set of classes, starting from image derivatives, V / . However, the image derivatives by themselves are not sufficient to describe image motion, because there is a many-to-one corre-spondence between derivatives and motion. We can constrain the derivatives using the hypothesis that the intensity structure of the scene is locally stable across short time intervals [HS81b]. This allows us to estimate the way things are moving, or the optical flow, in the image plane between frames, which is what we want to classify. Optical flow is the projection of the 3D velocity of objects in the scene relative to the camera onto the image plane. Classification of optical flow fields directly is difficult due to the high dimen-sionality of the signal. Instead, we classify optical flow in a subspace of flow fields defined by their projections to the Zernike polyomial basis. Thus, we will be find-ing clusters of subspace optical flow fields by classifying Zernike basis projections. However, simply computing optical flow and subsequently projecting to the Zernike basis does not make use of the uncertainties inherent in the estimation process. We describe an efficient Bayesian solution for directly computing probability dis-tributions over basis coefficients from image gradients using brightness constancy. We show how the method's incorporation of the flow uncertainties improves the estimation of the parametrized flow by discounting image regions with less certain flow vectors. Our mixture model includes weights on the basis coefficients, which describe the effectiveness of each feature at achieving good clusters. Our probabilis-tic projections ensure that the uncertainties inherent in the calculation of optical 41 flow are propagated to the cluster membership variables, leading to a more robust clustering. This section will first describe the extraction of optical flow from video streams. Section 3.2.2 then describes our method for projecting optical flow fields to the Zernike basis, including a method for weighting or selecting features of the pro-jection basis. Some simple experiments are also described, which show the usefulness of the probabilistic projection method. 3.2.1 Optical Flow As will be shown in Section 3.6, our estimation of optical flow is tightly integrated with the projection to a basis set, and the subsequent classification. Here we de-scribe the underlying mechanics of optical flow estimation. There are many ways to estimate optical flow, usually based on the assumption that local image regions maintain their intensity structure over small periods of time [HS81b]. If some small part of the world appears at time t at some position in the image, then we assume it will appear the same at t + 1, albeit in a different position in the image. The difference in position is the (true) optical flow, v. The motion in the image is denoted by a vector optical flow field, v(x,y), which describes the direction and magnitude of the motion of the image pixel at (x,y). That is, if some image region, at (x, y), undergoes an optical flow of v(x, y), the same region can be found after the flow at (x + vx(x,y),y + vy(x,y)), where vx,vy are the components of v in the x and y directions, respectively. The assumption (3.7) is known as brightness constancy, and can be used to derive optical flow estimation techniques based on phase, on correlation, or on gradients [BB95]. Gradient based techniques use a Taylor expansion of Equation 3.7, and drop terms above first order [LK81]. Writing the image brightness function as / , such that the spatial and temporal derivatives of the image are ( / s = {fx, fy}) and (/ T ) , respectively, the brightness constancy assumption is that 1 The equation says that the change of brightness at a pixel over a time interval is ex-actly the brightness difference between the current pixel and a pixel separated from 1These variables are fields over all N pixels in the image: fT is a N x 1 column matrix, fs = [fxfy] is a TV x 27V matrix (where fx is a N x N matrix with the horizontal spatial derivative f x along the diagonal, and similarly for fy) and v = [vxvy]' is a 2N x 1 matrix with the components of horizontal and vertical flow. I(x, y, t) « I(x + vxSt, y + vy5t, t + 5t) (3.7) fr + fsV = 0 (3.8) 42 the current pixel by v, assuming a planar function separating the two pixels. Equa-tion 3.8 actually only constrains the flow in the direction of the spatial derivative (normal to the spatial image orientation),- since if the flow is normal to the spatial derivative, f3v = 0, and thus v is unconstrained. Typically this is dealt with by intro-ducing constraints on the smoothness of the optical flow field [HS81b, BA96, SAH91]. Since we are projecting to a set of basis functions, we are implicitly constraining the flow to be globally smooth, and so do not need any further regularization. We also know that the brightness constancy assumption is often violated. Occluding edges and reflections are some obvious examples, which can be dealt with using robust methods, for example [BA96]. However, the primary sources of violations of the assumption are failures of the planarity assumption, camera noise, and errors in the derivative calculations. Simoncelli [SAH91] has described how to account for such violations in a Bayesian framework, in which the variabil-ity due to each error source is explicitly included in the model. We describe the noise as arising from two independent zero-mean Gaussian noise source, n\ and ri2, which account for failures of the planarity assumption, and errors in the temporal derivative measurements, respectively. The brightness constancy assumption thus becomes [SAH91] fr + fs-iv- m) = n2, n i ~ M(0, A;)- (3.9) We can assume that the errors in the spatial derivatives are minimal compared to those in the temporal derivatives, since the temporal sampling is much coarser than the spatial sampling. Equation 3.9 thus describes the conditional probability P(ft\vJs): P ( V / » cx M(fT; -fsv, fMi + A 2 ) , (3.10) where A i = OIIN, A2 = O~IIN (IN is N x N identity). The important thing to notice about this distribution is the dependence of the variance on the spatial derivative, fs. The magnitude of the spatial derivative, \\fs\\2, is the image contrast, which plays an important role in determining the distribution of flow fields [SAH91]. Optical flow is difficult to estimate (and so has high variance) in regions of low contrast. Using Bayes' rule, we can compute the probability distribution over the flow fields as W . , / . ) = 3 * ^ j p > (3.11) where P(v) is a prior distribution over flow vectors. This distribution is useful if one is interested in the optical flow field itself, in which case a zero-mean prior P(v) ~ J\f(v; 0, A p ) is used in Equation 3.11. Since the likelihood and prior are both 43 Gaussian, the posterior is also Gaussian, with mean and covariance [SAH91] Pv = -KfsU'a^fs + A s ) " 1 / * (3-12) A , = [fsif's^ifs + A 2 ) - V i + A ; 1 ] " 1 (3.13) If this solution is computed at each pixel independently, then the mean, pv, will be approximately the component of the optical flow normal to the local spatial gradient. Estimates of the true optical flow field can be obtained by including a regularizer to enforce smoothness [HS81b]. A simple regularization technique is to simply average over small regions in the image, such that the mean and variance are Hv = - A v ^2 wifsi{f'sAfsi + A 2 ) 1fti i A , = E fsiU'sMsi + My'f'si + A ; 1 - l l - l where the sums are over small image regions around each pixel, fSi and fu are the spatial and temporal derivatives at pixel i, Wi is a weighting function such that points further away are weighting less. We refer to this as the Simoncelli method in the following, and is what we will use to compute optical flow fields where needed. However, we are concerned in this work with interpreting the optical flow field as a distribution over a small, temporally and spatially abstract set of discrete states. As we will see in the next section, this will mean replacing the prior over flow fields with a conditional distribution over flow fields given the high-level discrete states. We further parameterize this distribution by projecting to our basis set. 3.2.2 Projections of Optical flow fields The lowest orders of the Zernike basis, when used to describe optical flow fields, cor-responds to the affine basis, which is capable of representing simple planar motions such as translations, rotations, and expansions. The next order polynomials corre-spond to extensions of the affine basis, roughly the planar projections of yaw, pitch and roll. In particular, Black & Yacoob found these next orders to be particularly useful for modeling motion around the mouth region [BY97]. Equation 3.5 applied to the horizontal flow field, vx(x, y), gives the projection coefficients, UA™ and " S " : : ^ ^ E E « . ( . . » w < ^ » (3.14) x y A similar set of equations is obtained for the vertical flow estimates, VA™ and VB™. Figure 3.1 shows some example flow fields reconstructed from different orders (values 44 of n and m) of Zernike polynomials. Higher orders of Zernike polynomials result in flow fields with higher spatial frequencies, representing more complex motions. The flows can be reconstructed from the coefficients using Equation 3.3, as shown for an example flow field in Figure 3.2. As we reconstruct with more coefficients, we are including higher spatial frequencies, leading to a more accurate reconstruction of the original. As in Equation 3.6, we can represent the optical flow field projections as v = Mz, where r n r P 0 0 P v = M = z = (3.15) The columns of P are the basis vectors and zx, zy are the Zernike coefficients for horizontal and vertical flow, respectively, {"A™, "5™} and {VA™, VB™}. In practice, M will be some subset of the Zernike basis vectors, the remaining variance in the flow fields being attributed to zero-mean Gaussian noise. Thus, we write v = Mz + np, where np oc A/"(0, A p ) , and so P(v\z) =Af(v;Mz,Ap) (3.16) The noise, np, is a combination of three noise sources: the reconstruction error (en-ergy in the higher order moments not in M), the geometric error (due to discretiza-tion of a circular region), and the numerical error (from discrete integration) [LP98]. The choice of a subset of basis elements to use will depend on what the projections are being used for. We discuss a consistent method for making this choice in Sec-tion 3.2.4. 3.2.3 Probabilistic Projections We wish to use these flow field projections for classification tasks, in which some flow field, v, is classified as originating from one of a set of causes, X. In the remainder of this section, we show how to perform this classification using the maximization of a probability function over X. We only discuss classifying individual flow fields in this Section, but the analysis is also used in classifying sequences of flow fields, as will be shown in Section 3.6. Given a set of images, I\... I^t, we wish to assign each of the Nt-i flow fields one of Nx cluster labels X\... X^x, such that the optical flows with the same label are as similar as possible to each other, but as dissimilar as possible from the flow fields with any other label. For example, if describing motion over the human face, states of X may correspond to instantaneous motion fields during smiling (mouth expansion), frowning (contraction between the eyes), or talking (lip 45 M8 0.5, = 0.5 . \ \ t itt \ \ \ t i ti/. A W W W I I I 1 / / / V / V / / I I \ \ \ • / I I I H ' / / /. W W \ V UB\ = -1.0, VA\ = 1.0 0.5, VA\ = -0.5 UA\ = 0.5,u73| = -1.0, VA\ = -0.5 1.0, VB\ = -0.5 Figure 3.1: Example flows generated from ZPs corresponding to the indicated sub-sets of the feature dimensions. 46 original flow field j j ^ ^ ^ ^ J -project lo Zernike bus is zernike projection reconstruct Hows from Z reconstructed first 3 ZPs first 10 ZPs first 128 ZPs flow fields (affine) Figure 3.2: Reconstructing flow fields from Zernike projections. Shown are three reconstructions, using increasing number of coefficients from left to right: 3 ZPs or affine (n = m = 1), 10 ZPs (n = m = 3) and 128 ZPs (n = m = 14). 47 motion). The determination of Nx is discussed in Section 5.3. Figure 3.3 shows our model represented as a Bayesian network. We consider the measurements to be the image spatial ( / s = {fx-fy}) and temporal (ft) derivatives, calculated using a centered difference method. We can express classification of an image motion as the maximization of the probability distribution over the classes, X, given the spatial and temporal derivatives, p(x\vf, e) oc P(Vf\x, e)p(x\e), (3.17) where 0 are the parameters of the model, and V / = {fx, fy,ft}- Since we wish to Figure 3.3: Bayesian network for the mixture of Gaussians over optical flow fields with feature weighting. Shaded nodes are observed or fixed (known), while un-shaded nodes are unknown random variables. Boxes are fixed hyper-parameters. The dashed line delineates the priors for feature weighting. X € 1... Nx are discrete motion classes, Z is the Zernike feature vector (projection of optical flow field), V is the optical flow field, fs are the spatial derivatives, and ft is the temporal derivative. Hz, Az are the parameters of the mixture of Gaussians over the Z vector space, and T are the feature weights. @x a r e the class probability parameter (a multinomial), and ax is the parameter of the (conjugate) Dirichlet prior over @x- M* a n d A* are the mean and variance over all classes (of all the data). classify optical flow fields, we expand the probability distribution over the classes, X, as p ( x | v / , © ) = / p(vf\v,@)p(v\x,e)p(x\e) J V 48 where we have assumed the image derivatives to be independent of the high level motion class given the optical flow. There are three terms in the integration. The prior over classes, P(X\@), is part of our model, parametrized with a multinomial 0 X i j = P(X = i). The dis-tribution over spatio-temporal derivatives conditioned on the flow, P(S7f\v,Q), is estimated in a gradient-based formulation using the brightness constancy assump-tion, and is given by Equation 3.10. We do not represent P(v\X) directly in our model, but instead we parametrise this distribution using a probabilistic projection of v to the basis of Zernike poly-nomials. As shown in Equation 3.16, this projection can be written as a distribu-tion over v, given the projection coefficients, z, P(v\z) oc N(v;Mz,AP). We then parametrise the distribution over z given X with a normal P(z\X) = M(z; /J,Z,X, AZTX). We are assuming that the flow fields will be normally distributed in the space of the basis function projections. The Gaussian noise can be attributed to performance differences in the flow fields (on the part of the human being observed). We can now write down the likelihood of the image derivatives given the high-level motion class as P(fT\X,fs,Q) = [j^(fT;-fsV,A)Af(v;Mz,Ap)Af(z;fiZ!:c,Az,x) (3.18) J VTZ where A = / A / i + A 2 . A more naive approach avoids the integrations over z by first computing the mean optical flow field, fj,v, using the Simoncelli method, projecting this field to the Zernike basis, and taking the resulting feature vector, z, as the input data to a classification scheme using P(z\x) [HLOO]. That is, the naive approach considers P(X\Vf) oc P(M'fj,v\X,Q)P(X\Q). However, this approach ignores the variance information in the flow calculation, leading to less accurate results. For example, Figure 3.4 shows two frames from a video sequence of a person's face. There is significant motion upwards near and above the eyebrows and downwards along the sides of the jaw. The mean flow field, LIV, calculated using the method of Simon-celli [SAH91], is shown in Figure 3.4, for the image region of the subject's face as indicated. The certainty of the flow vectors (the trace of the inverse flow variances), is also shown in Figure 3.4 (brighter means the flow estimates are more certain). Large variance flows are prevalent in regions with little contrast since we are using a gradient based optical flow calculation. The jaw, forehead and the background wall are examples. A projection of these flow fields to a low dimensional basis will suffer because of these regions, unless the flow variances are taken into account. The bottom row in Figure 3.4 shows reconstructions of the flow fields from projections to the Zernike basis. On the left, we see a simple projection (dot product), while 49 naive probabilistic Figure 3.4: Top: two sequential frames from a video sequence. Middle: flow field computing using Simoncelli method (zero-mean prior) and certainty in the flow (trace of inverse variance of the flow). Bottom: naive and probabilistic reconstruc-tions from low dimensional basis on the right is the projection which takes the variances on the flow into account. Improvements can be seen in the low contrast area just above the subject's left eyebrow. The Simoncelli method flow field finds nearly horizontal flow vectors in this area, which bias the naive projections (incorrectly) towards the horizontal. The probabilistic projections discount this area, and the reconstructed flow field is less biased by these errors. We compare our probabilistic projection approach with the naive approach further in Section 3.2.5. In the remainder of this section, we show how the integrations in Equa-tion (3.18) can be performed analytically, leading to an efficient method for calcu-lating P(X\Vf), taking all variance information in the flow fields into account. We then show how to implement weights on the dimensions of the projections, and how our methods can be implemented in a multi-scale approach. Chapter 4 shows how the distribution P ( X | V / ) is used to learn the parameters of the model using the expectation-maximization algorithm. Note that the model does not take violations of the brightness constancy 50 assumption, such as occlusions, reflections, or transparent motions, into account. It is possible to extend the brightness constancy assumption to account for overall brightness changes [NY93]. We discuss this further in Section 3.4. Computing the likelihood Since all terms in Equation 3.18 are Gaussian distributions, we can perform the integrations over v and z analytically by successively completing the squares in v and z to obtain P(fr\Xfs) = / ^ ' ^ e l C A U V x A z . x - M U A ^ . s - e ) ( 3 _ 1 9 ) where K = (f'sA^fs + Ap1)-1 Kx = (Ki + M'iAp + if'sA-'fs)-1)-1]^)-1 Vz,x = Kx(K,lvz,x - M'^Ayjtv) (3.20) e = fTA-lfT + w'Aww w = f'sA-lfT If we normalize this distribution over X, we can remove all terms which are inde-pendent of X, and obtain The mean, fiz,xi and covariance, A 2 t X , are the parameters of the distribution of basis vector coefficients, z: P{z\XVf) = 2 n - ^ \ A Z ! X \ - h - ^ z - ^ y K 7 ^ z - ^ \ The expected value of the Zernike vector for a particular state of X = Xi, zx,i, is therefore zXti = J zP(z\xiVf) = fiZii (3.22) and the expected value of Z given the entire model is z = JzP(z\Vf) = £ / zP(z\xi,Vf)P(xi\Vf) = Y,»z,iP(xi\Vf). (3.23) 51 The distribution over X in this expression is computed as where P ( V / | X J ) is given by Equation (3.19), and P ( V / ) normalizes the distribution. The expected flow field, v, for a given state, vx, and for the whole model, v, can be computed as vx = Mp,ZiX v = Mz (3.24) Multi-scale implementation The brightness constancy assumption fails if the velocity v is large enough to produce aliasing. Therefore, a multi-scale pyramid decomposition of the opti-cal flow field must be used. This results in distribution over the flow vectors, P{v\Vf) ~ jV(v;pv,Av), where A„ = (Z^" 1 /*) - 1 and pv = -KvfsA-lfT [SAH91]. Using these coarse-to-fine estimates, Equations 3.20 become Kx = ( A " i + M'(A„ + A ^ ^ M ) " 1 Pz,x = A 2 i X ( A ^ z , x + M ' ( A p + Ay)'1^) (3-25) The mean of this distribution, £LZJX, is a weighted combination of the mean Zernike projection from the data (M'fj,v), and the model mean, /j,ZtX. The weighting of the data involves the variance A„, which, if we take A i to be diagonal with entry A i , and the scalar variance A 2 = A 2 , is A — / s / s V ~ Uill/,ll 2 + A 2 . In regions of low contrast (cr i | | / s | | 2 < a2), the data weighting increases with the contrast (A„ decreases as | | / s | | 2 increases). In regions of high contrast, the A i | | / S | | 2 term in the denominator normalizes the numerator, f'sfs, and so the data weighting stays constant. This seems intuitively reasonable, since we would not expect the data to continue being more heavily weighted once the contrast (or signal-to-noise ratio) has increased above the noise level. 3.2.4 Feature weighting In general, we will not know which basis coefficients are the most useful for our classification task: which basis vectors should be included in M, and which should be left out (as part of np). Further, selecting a relevant subset of the basis vectors for 52 clustering can lead to significant computational savings. We use the feature weight-ing techniques of [CdFGT03], which characterize the relevance of basis vectors by examining how the cluster means, fJ-z,x, are distributed along each basis dimension, k = 1.. . Nz. Relevant dimensions will have well separated means (large inter-class distance along that dimension), while irrelevant dimensions will have means which are all similar to the mean of the data, /J,* . Note that the feature weights do not play a role in the recognition task, only in the learning of the model, as we will describe in Chapter 4. The feature weights change the learning process, however, in such a way that the model will be less sensitive to dimensions with lower feature weights. To implement these notions, we place a conjugate normal prior on the cluster means, LIZ>X ~ M(LI*,T), where T is diagonal with elements r ^ . . . r ^ , and T% is the feature weight for dimension k. The prior biases the model means to be close to the data mean along dimensions with small feature weights (small variance of the means), but allows them to be far from the data mean along dimensions with large feature weights (large variance of the means). Thus, r 2 will be large if k is a dimension relevant to the clustering task, while T% —> 0 if the dimension is irrelevant. Feature selection occurs if we allow T% = 0 for some k. Conjugate priors are placed on the feature weights, T | , and on the model covariances, A 2 ) X . Each feature weight is univariate, and so an inverse gamma distribution is the prior on each r | : P ( T 2 | a , & ) a ( T 2 ) - a - V b ^ . (3.26) This prior allows some control over the magnitude of the learned feature weights, Tj2. The model covariances are multivariate, for which the conjugate prior is an inverse-Wishart prior: P ( A Z i Z | a , A * ) c < | A a i a : r ( a + J V * + 1 ) / 2 e - 2 f r ( Q A * A ^ ) , ' (3.27) where A* is the covariance of all the data, and a is a parameter which dictates the expected size of the clusters (the intra-class distance). This prior stabilizes the cluster learning. 3.2.5 Experiments with Probabilistic Projections We performed two experiments to examine the advantages of using the probabilis-tic projection described by Equation (3.25) over the naive projection. In the first, flow fields were reconstructed from 500 twenty-dimensional Zernike vectors, with all coefficients randomly generated in the interval [—1,1]. The resulting flows (< 5 pix-els/frame) were used to warp a synthetic 120 x 120 image using linear interpolation. 53 The original image, shown in Figure 3.5(a), is a sine grating with added Gaussian noise a = 5 greyscale values. A n additional amount of Gaussian noise ( a = 5 again) was added after the warp. Figure 3.5 (b) and (c) show an example flow field and the corresponding warped image, respectively. Optical flow was projected to the Zernike basis using both naive and probabilistic methods. The probabilistic projection used a single x state with a zero mean prior, / U 2 ) X = 0, and a diagonal covariance \ Z I X = CJZ<XI, with a Z i X = 0.01. The coefficients were compared with the ground truth. The mean Euclidean distances were 1.08 ± 0.32 for the probabilistic projection, and 1.55 ± 0.37 for the naive projection. The mean difference (naive-probabilistic) was 0.48 ±0 .13 , showing that the probabilistic projection outperforms the naive method, as expected. Our second experiment used the synthetic Yosemite flow-through sequence, constructed from an aerial image and a depth map [BFB94]. Ground truth over the ground region is used to evaluate the performance of the projection methods. There are fourteen 316x252 frames in the sequence, and the flow fields range up to 4 pixels/frame. The ground sections of two frames are shown in the top row of Figure 3.6, while the middle row shows the ground truth and the estimate of the flow field using the method of [SAH91]. We pre-smoothed each image using separable Gaussian filters [ a = 1.0), and used a 3-level Gaussian pyramid. The noise parameters were set to a \ = 0.08, 02 = 1.0, a p = 10.0, CJQ = 0.5 and a d = 0.1. The angular error 2 on the flow estimate is 8.4 ± 12.0. The first 10 Zernike coefficients were estimated for all frames using both naive and probabilistic projections, and were used to reconstruct flow fields over an ellipsoidal region covering the rigid portion of the scene using Equation 3.3. We used a single zero-mean \xz = 0.0 model with diagonal covariance A 2 = 0.001. The bottom row in Figure 3.6 shows the reconstructed flow fields for the two projections. The average angular errors over all frames for the reconstructions were 6.27 ± 6.36 for the naive and 5.66 ± 6.22 2The angular error, E, between ground truth vc and estimate ve is E = arccos(z;c • ve), (a) original d (b) flow field ( Figure 3.5: Synthetic images and flows. c) warped where v = y/u2+v2 + l 54 (frame 7) (frame 8) naive probabilistic Figure 3.6: Top: two frames from the Yosemite sequence. Middle: Ground truth flow and estimate using [SAH91]. Bottom: reconstructed flow fields. 55 for the probabilistic projections. Again, we see the advantage of the probabilistic projection. 3.3 Modeling Facial Configuration We use the Zernike polynomial basis to describe image regions as facial poses. We use a small subset (only the first 32 basis polynomials), which gives a very coarse approximation to the brightness structure over the face. However, the modeling of configuration is only secondary to the dynamics modeling we have just described, and is used to disambiguate zero-motion flow fields from each other, as we discussed in the beginning of this chapter. Our focus has been primarily on the dynamics modeling. It may be desirable in the future to build more precise models of con-figuration. The temporal modeling techniques we describe in Sections 3.5 and 3.6 would still be applicable, however, and we proceed with this simplified model. Equation 3.5 is used directly to compute the coefficients, A™ and B™, of the projection of an image region, H, to the Zernike basis. H is the model estimate of the image over the projection region. Equation 3.6 allows us to represent H as H = Pz, where the columns of P are the Nz basis vectors and z are the Zernike coefficients, A™ and B™. As with the flow fields, we can write H = Pz + nq, where nq oc A/"(0, Aq), and so P(H\z) — f\f(H; Pz, Aq). As we will show, the same feature weighting techiques apply equally well here for selection of basis vectors. Figure 3.7 shows some examples of Zernike polynomials of different orders (values of n and m) as grayscale images. Higher orders of Zernike polynomials represent more complex brightness patterns. Images can be reconstructed from the coefficients using Equation 3.3, as shown for an example image in Figure 3.8. As we reconstruct with more coefficients, we are including higher spatial frequencies, leading to a more accurate reconstruction of the original. In the following, we only discuss classifying individual images, but the anal-ysis will be combined with that for the flow field projections in Sections 3.5 and 3.6. The classification of image configurations is to assign each of a set of images, 7 i . . . 7/vt, to one of Nw cluster labels W\... W^w, such that the images with the same label are as similar as possible to each other, but as dissimilar as possible from the images with any other label. For example, if describing the human face, states of W may correspond to instantaneous poses during smiling (mouth expanded), frowning (contraction between the eyes), or talking (lips in a particular configura-tion). We use the same model described in the last section, as shown in Figure 3.9, which is the same as Figure 3.3, except the labels have changed. The measurements 56 A2 4 5 Figure 3.7: Example images generated from individual ZPs as shown. are now the images, I. The subspace projections over image regions are labeled H. We can express classification of an image as the maximization of the probability distribution over the classes, W, given the image: P(W\I, 0) oc P(I\W, e )P (W | 0 ) , (3.28) where 0 are the parameters of the model. Since we plan to classify image projec-tions, we expand this probability distribution as p(w\i,e)= [ p(i\h,e)P(h\z,e)P(z\we)p(w\e) Jh,z There are four terms in the integration. The prior over classes, P ( W | 0 ) , is part of our model, parametrized with a multinomial 0^,^ = P{W = i). The distri-bution over z given W is parametetrised with a normal P(z|VF) — - V ( z ; A Z ] U ) ) . The distribution over the image regions given the Zernike projection is normal, P(h\zQ) = Af(h; z, Aq), as previously described. The distribution over images given the subspace image region, h, P(I\h, 0) , can be approximated using a normal dis-tribution at each pixel, P(I\h,Q) ~ Ar(7; h, A / J . The integrations over h and z can be performed since all the distributions inside the integral are Gaussian. The results are similar to that for the dynamics (Equation 3.21): P(W\I, 0) oc e^ A* 1^-^'Z,WKX,HZ,W (329) 57 original image project to Zernike basis T zernike projection t- l reconstruct image from Z reconstructed images first 3 ZPs (affine) first 32 ZPs first 64 ZPs Figure 3.8: Reconstructing images from Zernike projections. Shown are three re-constructions, using increasing number of coefficients from left to right: 3 ZPs or affine (n = m = 1), 32 ZPs and 64 ZPs. 58 Figure 3.9: Bayesian network for the mixture of Gaussians over images with feature weighting. Shaded nodes are observed or fixed (known), while unshaded nodes are unknown random variables. Boxes are fixed hyper-parameters. The dashed line delineates the priors for feature weighting. W El... Nw are discrete image classes, Z is the Zernike feature vector (projection of image), H is the projected image region, and I is the full image. /J,Z,AZ are the parameters of the mixture of Gaussians over the Z vector space, and T are the feature weights. Qw are the class probability parameter (a multinomial), and aw is the parameter of the (conjugate) Dirichlet prior over @w where Mo = A " 1 [ A ^ 1 + A " 1 ] _ 1 A^I' + A ; > 2 l U J (3.30) A 0 = (M'k-LM - M ' A " 1 [A^ 1 + A " 1 ] - 1 A-'M - A ^ ) " 1 (3.31) In this case, however, there are no data-dependent variances as in the flow field case. Therefore, the naive projection is a good approximation to the full integration, and we write P(W\IQ) = P{zw\WQ)P(W\@), where z w = M'l is the projection of the image region to the basis set. 3.4 Lighting Changes The modeling of pose and dyanmics is sensitive to lighting changes which are due to factors other than motion in the face. Shadows passing over and reflections onto 59 the face are both examples of environmental lighting changes that will seriously affect the models we have described. Such violations are common in videos taken outdoors or with a moving camera. The data we analyse, however, is taken in a laboratory setting, with constant diffuse light, and we do not encounter these problems. Nevertheless, we give a brief discussion of the problem here, as it would have to be dealt with for a system operating outdoors, or on a mobile platform. The dynamics process is also affected by external lighting changes. As de-scribed in Section 3.2.1, optical flow is estimated using brightness constancy, which assumes that any brightness changes in the image are due only to motion in the scene (which is what we are trying to estimate). Brightness changes that occur due to external causes violate the brightness constancy assumption, and result in erro-neous flow estimates. A n example is shown in Figure 3.10, in which a face (which is stationary) undergoes a lighting change due to a new light source. The flow esti-mation algorithm assumes the changes in brightness are due to motion of the face, and estimates flow accordingly. frame 2515 2519 erroneous flow Figure 3.10: Example of erroneous flow generated by a large lighting change. The subject's head does not move, but the flow estimates are large due to the lighting cast on the subject's face by a passing object. 60 Large brightness differences can be easily detected since they result in large violations of brightness constancy. Ixu + Iyv + It = -E& » 0. Thus, we can threshold the value of Ef, and reject the corresponding (erroneous) flow fields. Unfortunately, even small brightness changes can cause erroneous flow measurements, as shown in Figure 3.11, in which a small lighting change on the subject's face causes a violation of brightness constancy of the same magnitude as that encountered in normal motion estimation, and so escapes the thresholding operation. It is possible to extend the brightness constancy assumption to account for overall brightness changes [NY93]. The effects on the configuration modeling are clear: the images with lighting effects will project to different regions of the Zernike vector space. This is a well studied (and difficult) problem in computer vision. One solution is to have different models for each possible lighting condition, or to estimate the lighting condition based on the image. However, this involves adding much complexity to the method. Another solution is to use spatial filters to remove the effects of lighting. For exam-ple, working in hue-saturation space only (removing luminosity) has been known to remove some of the problems. 3.5 Temporal Modeling We have seen in the last two sections how to compute spatially abstract repre-sentations of images and their derviatives. These models are applicable only for individual time frames, however, and we would like to describe what happens over longer time periods. To this end, we combine the two spatial representations to-gether in a single temporal model, as shown in Figure 3.12. This dynamic Bayesian network is a flavor of hidden Markov model, known as a coupled hidden Markov model, or C H M M [BOP97]. The idea is to model the temporal dependencies of the dynamics and configuration, as well as the interaction between the two, at a high level of abstraction. The coupled hidden Markov model represents the tem-poral evolution of the dynamics and configuration using two Markovian processes over X and W, which we call the dynamics process and the configuration process respectively. The transition probabilities in these two processes are parametrized by 0X = P(Xt\Xt-iWt), 9w = P(Wt\Wt-iXt-i), while the intial state probabilities are parametrized with n x — P(Xo) and irw = P(WQ). Note that the coupling in this model is reduced from the usual full coupling in a C H M M [BOP97]: we have introduced the additional independency that Xt is independent of Wt-i given Wt and Xt-\. This is because we do not expect the dynamics to depend on the previous and current configurations given the previous dynamics. We do, however, expect that our beliefs about the dynamics will be affected by the current configuration (the 61 dynamics configuration Figure 3.12: Two time slices of a dynamic Bayesian network (DBN) for simultane-ous modeling of pose and dynamics. This model combines the mixture models in Figure 3.3 and 3.9 by coupling the high-level mixture variable, X and W in a cou-pled temporal Markov chain. We have left out explicit representation of the model parameters LIZ,AZ,T, and other low-level noise terms, but included the transition parameters, Qx and Qw and their priors, ax and aw, respectively. pose gives us information about what changes may be expected) and the previous dynamics (smoothness of the dynamics process). Computing the likelihood of a sequence of T images and derivatives, I, Vf given the parameters of the coupled hidden Markov model, 0 , is accomplished by us-ing an extension of the usual "forward" computation from H M M estimation [Rab89]. We can derive a recursive formula for this computation, in which we write O = IVf, 62 and an entire sequence of data as O = {0} = {I\... IT, V / I . . . V/x}, p(o|e) = ^ p ( x T i i w r j { o } | e ) = Y/p(iTvfT\xT,iwT,j {0} e)P(xT,lwTJ {o} |9) ij 1>T-1 l . T - l = Y,p( IT\WT,ie)P(vfT\xTje)^p(xT,iwTjXT-likWT-u {o} \e) l . T - l where and ij kl J2P(lT\WT,ie)P(VfT\XTjG) y2QxijkQwjkiP(XT-i,kWT-i,i {0} 6) ij kl 1,T-1 (3.32) Iwjkl = P(Wt = j\Wt-x = k,Xt^ = I) eXijk = p(xt = i|xt_i = jwt = k) Given that we have a coupled hidden Markov model such as we have just described, we may be interested in estimating the temporal state evolution of the dynamics and configuration processes for some set of observations. That is, we want to find the single best state sequences, X \ . . . XT, and W\... WT given observations, O i . . . O r , which can be formulated as maximizing P ( { X , W } | { O } 0 ) . The Viterbi 1,T 1,T algorithm performs this maximization. It needs to be modified slightly to accomo-date the two temporal chains. Appendix D . l derives the Viterbi algorithm for this model. 3.6 Temporal abstraction We have just described the modeling of sequences of input data using the hidden state of two Markovian processes. A single such model is only useful for classifying data on a frame by frame basis, so, except for the tracking and the addition of temporal dependence in the dynamics and configuration processes, we are no further ahead than at the end of Section 3.3. However, recall that we wish to build temporal abstractions into our model, and infer high-level states which can be used in a decision-making agent. The state of the dynamics and configuration processes at any time, t, will be descriptions of instantaneous motion and pose. Over some longer period of time, however, the sequence of states will be a description of the 63 longer term trajectories of motion and pose. For example, suppose two states of the dynamics process, Xi,Xj, correspond to instantaneous flow fields observed when the subject raises or lowers their eyebrows, respectively. If the subject "flashes" their eyebrows (raises, then quickly lowers them), the sequence of states in the dynamics process over the course of the display may start in Xi at the beginning of the expansion phase, change to Xj at the apex of the display, and remain there until the end of the retraction phase. However, the subject may also hold their eyebrows raised for a period of time, during which the dynamics process will be in a state X^ corresponding to no motion. The sequence of states will then start in Xi, change to Xf~ at the apex of the display, and change to Xj at the beginning of the retraction phase of the display. If it is important to distinguish these two displays, then we must hypothesise a higher level, discrete, variable, D, which conditions the lower level variables, as shown in Figure 3.13. This figure shows a mixture of C H M M s like the one in Figure 3.12. For clarity, some of the variables have been grouped together: the joint space of the mixture variables is denoted Y = {X, W}, the joint space of the Zernike projections is Z = {ZX;ZW}, the joint space of the image and flow projections is G = {H,V}, and the joint space of observations is denoted O = {I, V / } . The high-level motion state at some time r , DT, explains the sequence of observations, 0, from t = gT .. .gT+i — 1. The indicator variables, gT, are indices to frame times which demarcate the beginning of each sequence explained by a DT. In general, the sequence of data from gT to gT+i — 1 will be part of a longer sequence. The question of how to find the temporal segments, g, is discussed in Section 3.6.4. The parameters of this model are the prior over high-level states, Qui = P{Di), and a set of parameters as in the coupled hidden Markov model ( C H M M ) for each state, D. For example, the transition probability over the dynamics process, X, will now be conditioned on D: QXijki = P(XgT+tti\XgT+t_ujCgT+ttkDTii). This model is a mixture of hidden Markov models [Smy97]. The estimation of the likelihood of a sequence of data, O, given this model can be computed as by summing over D as follows k The likelihood of the data for a particular high-level state, Dn, P(0\Dn), is com-puted using the same equations as for the C H M M model (Equation 3.32), with all (3.33) 64 {X,W} { Z x , Z J { V , H } G {I,f.) ( o Figure 3.13: Mixture of coupled hidden Markov models as a dynamic Bayesian network. The mixture variable, D, conditions a C H M M like the one shown in Figure 3.12, where we have grouped sets of variables together: Y = {X,W},Z = {Zx, ZW},G = {H, V}, and O = {/, V / } . The multinomial mixture parameter is 0 D , and OLD is the parameter of its Dirichlet prior. 65 Figure 3.14: Simplified mixture of hidden Markov models as a Bayesian network. This is the same model as shown in Figure 3.13, except we have left out explicit representation of the hidden variables, Y, Z and G. parameters conditioned on Dn: P(0\Dn@) = Y l P ^ W T ' i D n @ ^ V ^ X T ^ D n G " > ij eXijknQwjklnPiXT-^kWT-^ {0} | D n 0 ) i ,r-i (3.34) We may also want to compute the most likely high-level motion sequence class, D*, given a set of observations, O, which means solving D* = a r g m a x P ( D i | 0 ) = a r g m a x P ( 0 | A ) © o » i i where P(0\Di) is computed using Equation 3.34, and 0£>j is the model parameter. The model in Figure 3.13 is a mixture of coupled hidden Markov models, which we will refer to as a 3MG in the following (there are three 'M's in M C H M M ) . We can simplify notation by representing the network in Figure 3.13 as by the one in Figure 3.14. In this figure, O r = 0 9 t . . . 0 S t + 1 _ I , and we have left out explicit representation of the hidden states, Y, Z, G. This representation makes clear the fact that we are again dealing with a mixture model, as we were in Section 3.2.2, and will be useful in the following section. 3.6.1 Context Dependent Mixtures of Hidden Markov Models Recall that we are interested in modeling the dependence of facial displays on con-text. We assume that context is observable, and comes in two flavors: context which E 66 Figure 3.15: Context Dependent Mixture of hidden Markov models as a dynamic Bayesian network, including context variables, C and A , and the additional param-eter, 6 A = P(A\D). affects the facial displays, and context which is affected by the facial displays. In the simple games we consider in Chapter 5, the state of the game affects what kinds of facial displays the players use, and the actions of the players are affected by the displays of their partner. Thus, the context can either condition (in which case we label it C) or be conditioned by (in which case we label it A), the high-level mo-tion state, D. This gives us the model in Figure 3.15. We refer to this model as a C 3 M G (a Context dependent 3MG). Since the context is observable, the likelihood of interest is now the likelihood of all the observations, p{oCiAj\e) = ^2p(DkoCiAj\e) k = Y/P(Aj\Dk)P(0\Dk)P(Dk\Ci)P(Ci) k = Y,QAik^CkiP{0\Dk)P{d) (3.35) fe where we have introduced two new parameters, @Ajk = P(Aj\Dk) and Qcki = P(Dk\Ci), and the likelihood of the data, P(0\Dk), is computed using Equa-tion 3.34. Again, we want to compute the most likely high-level motion sequence class, D*, given a set of observations, O, Ck,Ai, which means solving D* = a r g m a x P ( A | 0 , C j , Ak) = argmaxP(0|A)@Afci@Dij 67 Figure 3.16: Markov chain of mixtures of hidden Markov models (4MG) as a dy-namic Bayesian network. where P(0\Di) is computed using Equation 3.34, and Qoij^Aki are the model parameters. 3.6.2 Markov chains of Mixtures of Hidden Markov Models The highest level motion descriptors, D, in a 3 M G may also not be temporally independent. If we assume that their dependence is also Markovian, then we have the dynamic Bayesian network shown in Figure 3.16. This is a Markov chain of mixtures of hidden coupled hidden Markov models, which we refer to as a 4 M G (4 'M's in M M C H M M ) . Such a model is also known as an embedded hidden Markov model [NI00], or a hierarchical mixture of Markov chains [HoeOl]. The likelihood for this model can be computed in the same way as for a normal hidden Markov model (HMM), using the likelihood function given by Equation 3.34: This is a recursive formula which can efficiently computed using the usual forwards algorithm from H M M estimation [Rab89]. To compute the most likely sequence of D \ . . . DT given a set of sequences of observations O i . . . O x , we use the standard Viterbi algorithm [Rab89], using P(0\D) as given by Equation 3.34 as the likelihood function. p(o|e) = £p(z? T l f c{o}|e) 1,T (3.36) 68 Figure 3.17: Context dependent Markov chain of mixtures of hidden Markov models (C4MG) as a dynamic Bayesian network, including context variables, C and A. 3.6.3 Context Dependent 4MGs Adding context variables to the model shown in the last section gives us the one shown in Figure 3.17. This model is a C 4 M G (context dependent 4 M G ) . This model is known as an input-output hidden Markov model, or I O H M M [BF96]. In this case, the context variable, C, is the input observations, and both the image data, O, and the conditioned context, A, are output observations. Again, the likelihood is computed using a recursive formula which combines Equation 3.36 and 3.35. The likelihood is the same calculation as for an I O H M M , except that the output observation distribution, P(A,0\D), is factored into two terms, P(A\D)P(0\D). To compute the most likely sequence of D \ . . . DT given a set of sequences of observations O i . . . O r , C\... CT, A\... AT, we again use the Viterbi algorithm, except that now the likelihood function is a product of two terms, P(0\D) and P(A\D), and the transition function over D is selected based on the value of C for each time frame. P ( O , C , A | 0 ) = Yp(DT,k{ocA}\e) k i 69 3.6.4 Temporal Segmentation The models we have just presented all describe a finite length sequence of data, denoted by the indicator variable, gT. In a complete Bayesian model, these indi-cators will be part of the model, and must be integrated out, as in a hierarchical hidden Markov model [FST98]. Although linear time algorithms exist for such mod-els [MP01], we claim that the segmentation can be achieved by looking at the context variables, C and A. We have assumed these to be fully observable, and to be hap-pening at event times which span multiple frame times in general. Thus, the frame time of occurence of these context states can be used directly as the indicators, g. For example, in the simple card game we consider in Chapter 5, players take turns playing cards or making bids. These plays and bids are clear temporal markers in the games, and we do not expect facial displays to span the plays. We expect the players to communicate and play sequentially, not simultaneously. Previous authors have approached this temporal segmentation problem by manually fixing it [OHG02], exhaustive search for the most likely time scale [WCPOO, WBC97], or by searching for discontinuities in the temporal trajectories [WPG01, RAOO]. 3.7 Tracking The previous sections all assumed that some region in the image had been selected for projection. While in some cases we may be interested in the camera's field of view (the entire image), in most we will be interested in the motion of some figure upon some background, for example. Whatever aspect of this figure we are interested in describing the motion of (for example, the human face) must be located in each frame that we wish to process. In a video stream, however, images occur at regular 33 millisecond intervals, and we can use temporal continuity to make this location process much simpler. One of the primary requirements of our face tracker is that the tracked region be consistently registered to the face: the facial features must show up in exactly the same positions relative to the tracked region. Furthermore, the tracker must be robust to failures and so must be able to re-initialize automatically. For example, if we envision an online user interface, then the tracker must be ready to lock onto people when they appear, and re-initialize after they turn their heads, or bend down to pick up a dropped pencil. However, since our tests are performed with humans using a computer, their faces will tend to be fairly stationary, and will be facing the camera (mounted above the screen). Therefore, we can focus more resources on the registration than on the localization. While many head and body trackers have been implemented, most are de-70 signed to robustly maintain a rough lock on the tracked region, but maintain track-ing through occlusions, full 360° rotations, and distractions (other faces present). For example, the Birchfield tracker uses color histograms and intensity gradients to fit an ellipse to the head region [Bir98]. Jepson et al. describe robust, adaptive, appearance models which are learned on-line using the E M algorithm for tracking of natural objects (including faces) [JFEM01]. Again, these models are tailored for robustness in the face of occlusions and variations in 3D pose. There are many other examples of tracking algorithms in the litterature. Our method uses an optical flow tracker with corrections from an exemplar database and skin color detection. It is similar to the method used by Kruger [KZ02]. Exemplars give the ability to accurately register the face. Skin color gives the ability to re-initialize after failure. It may also be desireable to also track facial features, such as pupils or eyes, for additional registration. Section 3.7.1 gives a procedural description of the optical flow tracker with corrections from exemplar images to avoid long term drift, and briefly discusses some experiences we have had with this tracker for very long video sequences. Sec-tion 3.7.2 then describes how the tracker is actually a special case of a more general class of dynamic Bayesian models. This interpretation would allow us to integrate tracking with modeling of facial displays, as described in previous sections, in a con-sistent framework. Further investigations into this Bayesian tracking model would be necessary in the future. 3.7.1 Tracker: procedural description The tracking problem is to update a region as described by centroid and scale parameters at — {xc, yc, rx,ry} from one frame, t, to the next, t + 1. The centroid is {xCi Vc} (pixels) while the sides of the image region are {rx, ry} (pixels). We assume that there is only one region to be tracked i i i all frames. A schematic of our tracking method is shown in Figure 3.18. Since optical flow is an estimate of the change between two frames, we get an initial tracker update from the first and second order coefficients of the projected flow estimates xc = xc + UAQ y'c = 2/c + VAQ r'x = rx + UA\ r'y = ry + vBl (3.38) where A , B are components of zt (Equation 3.23). However, updates using only the flow are prone to significant drift over any sequence longer than roughly 600 frames (20 seconds). We correct for this drift at each frame using color template image matching. We maintain a set of templates (also called exemplars), which are 71 Figure 3.18: Procedural description of the tracking process. Estimates of optical flow are used to update the track, with corrections to prevent long term drift by matching raw image regions to an exemplar database. Skin segmentation is used to re-initialize after tracker failure. example images of the region we are tracking in a variety of poses under different lighting conditions. These exemplars should span the space of poses and lighting conditions we expect to see in the subject's face, and restrict the tracker to operate only for a particular subject. Although it may be possible to automatically select the exemplars [KZ02], we select them by hand in the following way. The tracker is run with human supervision until it fails, at which point an exemplar is selected which allows the tracker to recover from failure. This process is continued until the tracking is stable, and then the sequence is re-tracked from the start without supervision. In a typical laboratory setting, only about 20-30 exemplars are needed. In more complex environments, this tracker would probably not suffice, and would need further work. The exemplars are all scaled to be the same size: Ne x Ne. The scaling can be done using the Mitchell algorithm [Sch92], or using graphics rendering hardware if available. We search over small local corrections to the scale prediction, a't — {x'c,y'c,r'x,r'y}, and compare the image in each corrected region to each exemplar by first scaling the image region to the fixed exemplar size (Ne x Ne) and then computing the distance using the sum of squares pixelwise metric. The most likely scale correction (which could be nothing) is then computed based on these distances. Severe tracker failure can still occur for a variety of reasons, including head rotations, changes in lighting, and occlusions by the subject's hands, such as when scratching the face. Tracker failures also occur when the subject's face leaves the 72 camera's field of view. For example, they may bend down to pick up a dropped item. Therefore, a robust tracker needs some way of initializing a track without prior knowledge of the previous location of the region of interest. Tracker failure is signaled if either of two error measures crosses a threshold: first, the degree to which the brightness constancy assumption is violated in the flow estimate, and second, the exemplar match scores. The thresholds are assigned manually after observing the tracker behavior. We accomplish tracker initialization using skin color segmentation and the exemplar database. We transform the R G B color images to H S V (hue-saturation-value) space, and segment the image using simple thresholding in hue and saturation. Median filtering removes noisy estimates, and we find the connected components in the resulting binary image. The bounding box, 60, of the largest skin component is our initial estimate of scale. However, such an image region will not necessarily be the same as any exemplar in the database, since the exemplars are manually selected regardless of skin color segmentation. Thus, for each exemplar, say hk, we maintain a skin scale mapping, M ^ , which maps the bounding box, bk, of h^s largest connected skin component, to the actual bounding box that was selected for the exemplar manually, 6 :^ M%(bk) = b^. The inverses of the skin scale mappings are Mf, 1 (tf) = b^. For a given exemplar, k, our estimate of the scale in the image, CKQ, given the bounding box bo is then a§ = Mf,{bo). The most likely initial scale is then computed as before, but using these skin-based predictions instead of the usual flow-based predictions. The skin-scale mapping enables the initialization of the tracker to be fairly independent of the performance of the skin segmentation procedure. For example, the skin segmentation shown in Figure 3.18 only classifies about half the subject's face as skin pixels. In this particular example, it is because the H S V thresholds had to be tightened in order to deal with a nearly skin-colored portion of the background. However, as long as the mapping between the skin segmentation bounding box and some exemplar is consistent, the correct scale can still be correctly estimated. Figures 3.19 and 3.20 show some representative results of the tracker's per-formance on data of a human's face while she is playing a simple card game through a video link with another player (see Chapter 5). The figure shows part of the in-terface to the tracker software, with the image on the left with the current location of the track. On the right, the interface shows the most recently computed skin segmentation and connected components results (from the last failure, since the skin estimate is only recomputed when the tracker fails). Components are shown in different colors or shades of grey. Along the bottom of each image are shown a selection of the exemplars, ranked by their match scores from left to right: the best 73 match is on the far left. Only the 10 best exemplars are shown out of a total of 26 in this example. Note that the frame rate in this example is only 12 frames per sec-ond, making the tracking more difficult than some of our other experiments, which were run at 30 frames per second. The tracker is operating normally in Figure 3.19, with updates usually based only on optical flow estimates, and occasionally on skin color. Figure 3.20 shows the tracker failing due to the subject throwing her head back while laughing. This pose, in which the face is barely visible, is not contained in the exemplar database, and so the tracker locks onto another part of the image. The tracker recovers after the subject's face returns to a normal, frontal pose. This sequence is nearly 10,000 frames long, or about 14 minutes at 12 fps. The tracker successfully recovered from failure in all but two instances, where manual interven-tion was required due to the tracker locking onto an image region that was not the face, but had close enough template matches to pass the failure detector. Typical tracker speeds are over 10 frames per second for an tracked region of about 100 x 80 pixels. 3.7.2 Tracker: Bayesian model The last section described an optical flow based tracker, with corrections from tem-plate database matches. However, this approach separates the tracking problem from the recognition problem, although the two are tightly coupled. This section gives a rough overview of how tracking and recognition can be combined in a single Bayesian model. The method described in this section has not been implemented, and is described for future work only. Nevertheless, we show how our tracker can be derived as a special case of the more general version presented here. The model implements essentially the same tracker as we described in the last section, except that the exemplars are replaced with the configuration modeling described in Sec-tion 3.3. Thus, optical flow provides predictions about the tracked region, and the configuration model provides corrections to the track. Addition of the track causes efficiency problems, however, as it must be integrated out during the computation of any quantities of interest. We discuss some approximation methods for surmounting these difficulties. Figure 3.21 shows the same model as in Figure 3.12, but with the facial region, a, explicitly represented. The scale, a forms another Markovian chain, called the transformation process, which conditions both measurements, V / and I. In our previous derivations, we have always assumed that the scale is fixed, such that, P(Vf\Xa) = P ( V / Q | X ) and P(I\Wa) = P(Ia\W), where V / a and Ia are the regions of V / and I delineated by a. In the full Bayesian model shown in Figure 3.21, this is no longer the case, and the scales must be explicitly included, and integrated 74 8456 8457 * 1 8469 8472 8479 8481 8 JMH M m l UP* Site* 8482 8483 Figure 3.19: Tracker successful operation. Top left of each frame is the image and the tracked region. Top right is the skin color segmentation connected components. Different shades of grey denote different components. Exemplars are shown along the bottom, ranked by increasing match scores from left to right (best exemplar on the left). The frame rate in this sequence is 12 fps. The tracker is operating normally at this point. Re-initialization occurs at frames 8469, 8482 and 8483 due to large motions of the head. The track holds successfully, as exemplars in the database are found that closely match the current pose. However, the tracker is about to fail at frame 8484, shown in Figure 3.20. 75 8484 8485 mmm -1 V 4 4 • • m 8487 8493 8509 8515 Figure 3.20: Continued from Figure 3.19, the tracker begins to fail at frame 8484 due to a large motion upwards of the face. By the following frame (8485) it has completely failed, and is tracking the subject's neck, because no appropriate exem-plar was found in the database of this face pose. However, after the subject's face returns to a frontal pose at frame 8499, the tracker only needs 10 more frames to lock back onto the face and find matching exemplars. 76 Figure 3.21: Dynamic Bayesian network for full recognition and tracking model. out. Given a scale, a, the likelihood functions are computed as they were shown in Sections 3.2 and 3.3. The Zernike projections of the optical flow field, Zx, are inputs to the tran-formation process: they condition the scale. Recall that the flow field projections, Zx, are computed from the spatio-temporal image derivatives between the current and previous frames, It and h-i- These projections give us evidence about where the tracked region is at time t, given where it was at time t — 1: at = at-i + zt. The flow-based updates are not exact, however, and we model deviations from them with additive, zero-mean Gaussian noise, at = at-\ + zt-\ + n a > where na ~ A/"(0, A a ) . Therefore, we can write P{at\at-izt-i) = N{ctt\oit-i + zt, A a ) (3.39) This noise will cause significant drift in the track, as noted in the last section. The tracking problem is to estimate the expected scale, at, at time t given all previous observations, {O}, and the current image, It: l a t = E a * P ( a ' K ° } > J ' ) ( 3 - 4 ° ) 77 Using Bayes' rule we can write a recursive update for the distribution over a t : P(at\It{0}) (3.41) l . t - i Y P(atWtXt^\It{0}) — ' 11— 1 WtXt-i y P(It\atWt)P(atWtXt^\{0}) WtXt-i 1't yP(h\atWt) y C K . V / t - i ) £ Ow y exP(Xt-2Wt-iat- i |Jt- i{0» (3.42) where £(a t, V/ t -i) = f PK|a i_ 12X i t_i)P(V/ t_i|a t_izX ; t_i)P(2 ; C i t_ 1 |X t_ 1) Zx,t-\ We show in Appendix B that this integration over zx>t-i can be performed analyti-cally, giving ; £ K , V / t _ i ) =M(at;fia,Aa)e\ (3.43) where Ma = at-i + Az,x A a = A a + A z , x (3.44) 7 OC A a A a V a ~ / M _ 1 / r + w ' A " ^ - A 4 , x A ~ x / U 2 , X (3.45) As we expect, the mean scale is given by the updates based on the mean Zernike basis projection as given by Equation 3.20, and the covariance on the scale is the sum of the noises from the estimation of z, and from the transformation process. When the tracker re-initializes after failure, there is no prior evidence in the transformation process, but there is new evidence given by the skin segmentation algorithm, as described in the last section. This is in the form of a bounding box, 60, which is related to the scale for each exemplar through the skin-scale mapping, Ms. In this case, the expected scale is OiQ = yyaoP{Io\cto,hk)P(bo\hk,a0)P(hk\ci)P(ci) Although it is possible to incorporate the skin segmentation evidence at every frame (not just the initial one), the exemplar matches are far more accurate, rendering the skin evidence only useful at the initial frame. Due to the complex dependence of the image and image derivative regions on the scale, the double sums over scales, at,at-i, in Equation (3.42) cannot be 78 performed analytically, and will make the updates very inefficient. Therefore, we need some way to approximate these updates. Note that any terms in the sums with at far (in terms of A A ) from the p,a will be small, and can be neglected without significantly affecting tracking accuracy. Therefore, we can approximate the scale updates using a particle filter, and we can expect the number of particles necessary for accurate tracking will be small. The particle filter implements Equation (3.42) as follows, starting from an initial set of particles and their weights at time t — 1, {aj_i ,aj_i} iel...Np: 1. re-sample the particles according to their weights, aj_ x 2. project each particle to time t according to the deterministic dynamics a\ = o4_! + 5(aj_!) 3. add noise A A by moving each particle by an amount randomly sampled from A/"(a;0, AQ,) 4. compute new weights, a\ using Equation (3.43). The complexity of this procedure is multi-linear in the size of the image region being tracked and on the number of particles necessary for maintaining and accurate track. Typically, only a small number of particles are needed for accurate tracking over long periods. The tracker we described in Section 3.7.1 uses only a single particle, but makes it active [Jen02]. To make a particle active, we simply add a step in which the particle locally performs gradient descent in the likelihood space. 79 Chapter 4 Learning Models of Facial Displays The preceding chapter described a method for representing flow fields and poses over a discrete set of normal distributions. It showed how to combine the modeling of facial dynamics and configurations in a temporal model, and how to build temporal abstractions of the sequences, leading to descriptions of entire sequences of motions and poses over extended periods of time. It also demonstrated high level observable context can be included. In this chapter, we show how to learn the parameters of these models from data. Our models fall into the general class of Bayesian mixture models with hid-den state. We make some noisy observations of the world, which we hypothesise arise from some number of discrete causes. We assume that the causes generate the observations through some process, and that this process is corrupted by noise. We also assume that the observations are generated in a multi-stage process, during which temporally and spatially more complex states are generated. The last chapter described how to compute the likelihood of a set of observations given such a model. This likelihood can be used, along with Bayes' rule, to estimate the distribution over the high-level discrete causes given the observations. The current chapter addresses the more difficult problem of estimating the most likely model which could have given rise to our entire set of observations. This problem can be formulated as a con-strained optimization of the likelihood of the observations given the model, over the constrained model parameters. We use the well-known expectation-maximization, or E M , algorithm to find a locally optimal solution [DLR77]. If we can find a good initialization to the model, then the E M algorithm will optimize this model locally. Section 4.1 is a general presentation of the E M algorithm, including a simple proof of convergence. Section 4.2 then shows a particular application of the E M algo-80 rithm to the simple mixture model over flow fields discussed in Section 3.2. Only the major results are presented, the details are shown in Appendix C. Section 4.4 then discusses how to learn the parameters of, including dynamics, configuration and transformation chains, as described in Section 3.7. The learning equations are very similar as those for a hidden Markov model[Rab89], but include propagation of information forwards and backwards through all three Markovian chains, and use the likelihood functions for flow fields and configurations as described in Chapter 3. Appendix D derives the equations in full. Sections 4.4, 4.5, and 4.6 describe learning the parameters of the more com-plex models which were presented in Section 3.5. These last two models will be the primary models for analysing the experimental data in Chapter 6. Section 4.9 discusses initialization techniques for the E M algorithm applied to our models. The chapter concludes with a discussion of the implementation of the learning algorithms in software. 4.1 Expectation Maximization The expectation-maximization (EM) algorithm is a statistical technique for esti-mating the maximum likelihood values of model parameters in cases where there is missing or incomplete data. It addresses the situation where we have observed some set of data, y, but not the values of some other, unobserved data, X. The random variable X is related to Y through some model or function parametrized by 6. We additionally may have some constraints on the values of 6. Thus, we wish to find the maximum a-posteriori (MAP) values of the parameters, 6*. That is, we wish to solve 6* = argmaxP(y|0)P(0) = argmax V P(y, X\0)P(0), 0 9 ^ X subject to the constraints on 6. Since P(y, X |0) is typically a product over all observations in y, the expression above is usually hard to evaluate since it involves a large sum of a large product. The E M algorithm simplifies this problem by allowing us to work with the logarithm of the product, leading to a double sum which can be manipulated to yield tractable solutions. The E M algorithm operates by first computing the expected values of the hidden variables, given the observations, the current guess at the model parameters, and the model constraints. This essentially fills in the hidden data, and is the expectation step of E M . The maximization step of E M then follows by updating the model parameters from the (now complete) data by counting over expectations. This description is slightly simplistic, because the filling in of the data is done probabilistically, so each data has a certain probability 81 of being generated by each hidden state. The update is not simple counting, but rather a constrained optimization of the expected complete data likelihood over the model parameters. To derive the E M algorithm, we use the fact that Y J X P (X|y, 9') = 1, so that P(y\9) = P( y|0)£x-P(x|y>0') = n p ( y | 0 ) p ( x | y , e , ) _ T T /P (y ,X | f l ) W vP(X|y,0) x P(X|y,0') x Taking the logarithm of both sides, we obtain: log £ P(yX|0) = E P(X|ye') log P(y, X|0) - £ p ( X | y , 0') log P(X|y, 0). (4.2) x If we define L(9) = i o g E p ( y X | 0 ) x Q{e\e') = E p ( x l y 0 > g p ( y , x | 0 ) x so that = £p(x|y*') [log P(y,*\9)] H(9\0') = E ^ X l y . O l o g ^ X l y . f l ) x = Pp(x | y ^)[logP(X|y,^)] L(9) = Q(9\9') - H(9\9') Now, suppose that we arbitrarily choose some value for 9', and then find the values of 9 that maximize Q(9\9') (using whatever method we want). We can show that the log likelihood of the observations with this new set of parameters will always be greater than or equal to the log likelihood of the observations with the old set of parameters, 9': if 6" = argmaxg Q(9\9'), then L(0") > L{9'). To see why, notice that by definition of our maximization over 9, Q{9"\9') > Q{9'\9'). Further, an application of Jensen's inequality gives (see proof in Appendix A) H{9"\9') < H(9'\9'). (4.3) 82 Therefore, L{e") - L(6') = Q(e"\e') - Q{e'\e') + H(e'\e') - H(e"\e') > o. If, instead of maximizing Q(9\0') (which is the maximum likelihood, or M L esti-mates), we maximize Q(0|0') + logP(0), the same conclusion holds, but we obtain the M A P estimates. Given a value for the model parameters, 0', the following procedure is guaranteed to find a new value of 0, 8", with higher log posterior, L{e") + log Pie"y Procedure EM_iterate(0', y) 1. compute P(X|y,0') 2. find e" = argmaxfl Q(0|0') + logP(0). return 0 The expectation maximization algorithm can now be defined as Procedure EM(0,y) do 1. set 0' <- 0 2. set 0 <- EM_iterate(0', y) while L(8) - Lie') + log ^ > e return 0 Since the likelihood is guaranteed to increase at each step, the EM procedure is guar-anteed to converge to the set of parameters, 0, at which L(0) has a local maximium (ignoring the possibility of saddle points). The E M algorithm works well because Q(0\0') and P(X|y,0') are easy to maximize over, whereas L(0) is hard. Notice that we could replace the function P(X|y, e') in the above derivation with any proper probability distribution over X . In fact, this function need only be chosen so that it increases the log likelihood at each iteration of the E M algorithm. The resulting algorithm is a generalized E M (GEM) algorithm, which can lead to incremental versions of the algorithm [NH93]. However, the function that gives the optimal increase is the one we have selected above. We do not discuss G E M algorithms further here, although our results could be easily generalized to allow for incremental variants. The following sections derive implementations of the E M alogrithm for specific models. 4.2 Clustering Individual Flow Fields The model presented in Section 3.2 is a mixture of Gaussians, with output distri-butions over the space of Zernike polynomial projections of optical flow fields. The 83 Figure 4.1: Bayesian network for the mixture of Gaussians over optical flow fields with feature weighting. Repeat of Figure 3.3. Bayesian network, originally shown in Figure 3.3, is repeated in Figure 4.1. To learn the parameters of this model using the expectation-maximization (EM) algorithm, we optimize Q(6\9') over the model parameters, 9, subject to constraints on 9. In this case, the observations, y are the spatio-temporal image derivatives, Vf. Since we are attempting to learn the distribution over the Zernike vector space, it will be convenient to explicitly include the integration over this space in the Q function. Since the observations in this model are independent, the conditional expectation is written P(X|vr,0') = I I / P(Xkzk\Vfk,9>) (4.4) fc=l Jzk where Nt is the number of data points, which is computed using Equations (3.17) and (3.19). The complete data posterior is Nt p(xvf|0)= I P(Vfxz |e) = np (v / f c | z f c e )P(z f c | x f c e)p (x i f c | e ) J z , , Thus, we are trying to find 0* fc=l 0* = arg max 0 y I P(XZ|Vf,0')k)gP(VfXZ|0) + logP(0) -y- J *Zi (4.5) 84 over the model parameters, ©, where 9' are the current estimates. The complete set of parameters is 9 = {pz, Az, @x, Tfc, A i , A 2 , A p , a, a, 6}. While {yuz, A z , QX, T^} are learned from data, { A i , A 2 , A p , a, a, 6}, are fixed. Recall, the bold face variables indicate sets of variables, X = {Xi, ...X^t}, where Nt is the number of flow fields, and similarly for Z and Vf. The number of dynamics and configuration states, Nx, must still be chosen. A method for doing so auto-matically is discussed in Section 5.3. The E M algorithm alternates between " E " and " M " steps until the increase in the log-posterior becomes smaller than some convergence threshold. The expectation, or " E " , step of the E M algorithm is the calculation of the posterior according to Equation (4.4), using the current model parameters, 0 ' . The " M " step is then to solve Equation (4.5) for the most likely model parameters, ©*. Analytical expressions for these updates can be found by taking derivatives, setting to zero and solving subject to the parameter constraints. The update equations differ from those for a standard mixture of Gaussians with feature weighting [CdFGT03] because of the integrations over Z. To derive the E M update equations, we only perform the integrations at the end. The update equa-tions for each parameter are derived by setting the derivative of the argument in Equation (4.5) with respect to that parameter to zero, and solving for the parame-ter. The derivations are given in Appendix C, here we present the update results for the mean, /j,Zti, the covariance, A 2 i j , the feature weights, T, and the class probability, b a + Nx/2 + 1 4 2o + 7YX + 2 K i ^ ^ k ^ + T - 1 ^ (4.6) 1 N x i=l = i,i + a + N, + l ( 4 8 ) where = -P(-Xfc,i|V/fc), .^^  = Z~2k=i^,i, and Nz is the number of features. Thus, the most likely mean for each state x is the weighted sum of the most likely values of z as given by Equation (3.25). Dimensions of the means, \xz^ with small feature weights, r%, will be biased toward the data mean, p*, in that dimension. This is reasonable, because such dimensions are not relevant for clustering, and so should be the same for any cluster, X. The updates to the feature weights (Equation 4.7) show that those dimensions, k, with fiz^tk very different from the data mean, /j,*k, across all 85 states, will receive large values of r | , while those with fiz,i,k ~ Mfc w m receive small values of r | . Intuitively, the dimensions along which the data is well separated (large inter-class distance) will be weighted more. The covariance is updated based on a combination of the most likely covariance, A 2 i j and the covariance of the most likely mean, p,Zj, about the model mean [iz^, for each data point, k = 1.. . Nt, weighted by the probability of state i given the data (Equation 4.8). The prior covariance is represented with aA*, which stabilizes the updates, avoiding matrix singularity problems in the inverses in Equation (3.21). The updates to the distribution over X are simply the summed weights for each datum given each state of X. 4.3 Clustering Individual Poses Figure 4.2: Bayesian network for the mixture of Gaussians over images with feature weighting. Repeat of Figure 3.9. The mixture model over facial poses described in Section 3.3 is very similar to the mixture of flow fields. The Bayesian network, originally shown in Figure 3.9, is repeated in Figure 4.2. The difference is that it is no longer necessary to integrate over the Zernike feature vector. In this case, the update equations are the same as in [CdFGT03]. We are trying to find the parameters 0* that maximize 0* = arg max E P ( W l 1 ' l o S P ( I W | 0 ) + log P ( 0 ) w (4.10) 86 Note that we are re-using many of the symbols here as in the last section. This should not cause confusion, as we will make the difference clear when necessary. The update equations are as follows: Mz b Nt M Nu + A , i = 0iu,i a + Nw/2 + 1 2a + Nw + 2 ^ i=i J2k=il(zw - Vz,i)(zw ~ Vz,i)']€k,i + « A £. t i + a + Nz + 1 where £fc,i = f (Wjfc l^/fe), cj.,i = 2~2k=i €k,u a n d is the number of states of W. (4-11) (4.12) (4.13) (4.14) 4.4 Learning Temporal Models As we described in Section 3.5, the addition of temporal continuity to the basic Gaussian mixture models leads to two temporal Markov chains: the dynamics and configuration processes. The Bayesian network, originally shown in Figure 3.12, is repeated in Figure 4.3. The optimization we are trying to perform is to find 0 * 0 * = arg max e P(Y\0, 9') log P(0, Y | 0 ) + log P ( 0 ) (4.15) . Y Recall that Y = {X, W} is the joint space of the mixture variables and O = {I, V/} is the joint space of observations. The expectation step of the E M algorithm is to evaluate P (Y |O , 0 ' ) , which is slightly more complex now, as Y and O are sets of variables spanning some time interval. Thus, estimating this quantity involves propagating information forwards and backwards through both temporal chains. Appendix D derives the E M algorithm for this model. The resulting update equa-tions are similar to those for a simple hidden Markov model [Rab89], but involve forwards and backwards variables over both dynamics and configuration chains. The updates for the means and covariances of the output distributions in the dynamics and configuration chains are the same as those derived in the last two sections. The only difference is in the computation of £k,u which in this case is conditioned on the data for the entire sequence, Vf or I, not only that for the current frame, k. That is, for the dynamics chain, we have = P ( X M | V f 0 ' ) . 87 dynamics X configuration *(Wt_ Figure 4.3: Two time slices of a dynamic Bayesian network (DBN) for simultaneous modeling of pose and dynamics. Repeat of Figure 3.12. while for the configuration chain, we have &,< = P(wkti\ie'). The forward-backward algorithm derived in Appendix D is used to compute this quantity, as well as the sufficient statistics for the transition parameters. 4.5 Mixtures of Hidden Markov Models (3MGs) Learning equations for a mixture of hidden Markov models (a 3MG), as shown in Figure 4.4 (a repeat of Figure 3.13), are simple to construct once we have the update equations for the hidden Markov model components in the mixture. We are trying to find 9* 0* = arg max G P(Y, D\Q, 9') log P ( 0 , Y , D\&) + log P(O) Y.D However, since P ( O , Y , L > | 0 ) = P ( 0 , Y\DO)P(D\G), this is 0* = arg max y P(Y\D, 0,9')P(D\0, d') log P ( 0 , Y|£>©)+ Y,D + P(D\Q, 9') log P(D\B) + log P ( 0 ) (4.16) 88 Figure 4.4: Mixture of coupled hidden Markov models as a dynamic Bayesian net-work. Repeat of Figure 3.13. The second term in this expression only varies in the optimization over the mixture weight parameter, ©£>, while the first does not vary over ©£>. Thus, when optimizing this expression with respect to ©£>, only the second and last terms are involved. On the other hand, when optimizing with respect to all other parameters in the model, only the first and last terms are involved. Therefore, the update to the mixture weight parameter is similar to those for any mixture model (e.g. Equation 4.9): @D,i = where in this case, £kj = -P(Afc,i|Ofc)> and Ns is the number of training sequences. The updates to the component hidden Markov models are then weighted by the posterior probability over the mixture weights, P(Dk\0), which is P(Dk\0) = W W ™ ZDk P(o\Dk)eDtk where QD are the mixture weight parameters, and P(0\Dk) is computed from Equa-tion 3.34. 89 4.6 Context Dependent Mixtures of Hidden Markov Models (C3MGs) The addition of the observable context changes the learning slightly. We are trying to find 6* 0* = arg max e P{YD\OCA6') log P(OYDCA\@) + log P ( 0 ) .YD However, since P(OYDCA\e) = P ( O Y A D | C 0 ) P ( C | 0 ) , we again obtain two terms 0* = arg max e J2P(YD\OACe')logP{OYAD\Cey YD + y P(D\OCA9') log P ( C | 0 ) + log P ( 0 ) D The first term in Equation 4.17 is the term that is maximized for a mixture of hidden Markov models, conditioned on the (observed) variable C. The addition of the observable A only adds an extra multiplicative factor into the equations. Maximizing the second term in Equation 4.17 updates the parameterized probability distribution ©Dij = P(D = i\C = j) by counting the expected number of times the cluster variable D — i when the context variable C = j, Npij. That is, Q _ Ep(D\OCA)Npij D*J 2~2iEp(D\OCA)NDij where Ep^D\ocA)^Dij = 2~1tP{DT = i\OCA)5(CT = j). Thus, the update equa-tions for the probability of the weight parameter given the context state, @D,ij = P(D = i\C = j) becomes aD,ij + Y,k\Ck=j €k,i where f = P(D \C AO ) = P(.Ok\DkjP(A\Dk,i)P(DkACk) " Y£iPmDks)P(A\Dkti)PP{DktJ\Ck) 4.7 Context Dependent Markov Chains of Mixtures of Hidden Markov Models (C4MGs) As pointed out in Section 3.6.3, the model shown in Figure 4.5 (Repeat of Fig-ure 3.17) can be seen as an input-output hidden Markov model [BF96]. We are 90 Figure 4.5: Context dependent Markov chain of mixtures of hidden Markov models (C4MG) as a dynamic Bayesian network, including context variables, C and A. Repeat of Figure 3.17. trying to maximize PWO, C , A, 6') log P ( D , O, C , A|6) + log P(9) D where O are the sequences of images and derivatives, while C and A are the se-quences of context variables (input and output, respectively). The update equations for the parameters are nearly the same as for a normal hidden Markov model, ex-cept for the factorization of the outputs into two sets, O and A , and the additional conditioning variable, C . While the additional outputs, A , cause no special prob-lems, the conditioning variable, C , requires that we compute updates for parameters separately for each value of C, and use only those data which co-occurred with that value of C in the updates. The derivation is given in Appendix D. 4.8 Parameter list Table 4.1 shows all the parameters in the model. The top section shows parameters that have fixed values. These numbers are set manually based on prior expectations of their values. The bottom section shows the parameters that are learned using the E M algorithm, the initialization heuristics, or the value-directed structure learning methods of Section 5.3. ©* = arg max 91 Parameter Used for Value or Method optical flow 0.08,1.0 Nz projection error number of features 0.01 problem dependent fixed a, a, b feature weights (X chain) A^+2,1,0.01 fixed a, a, b feature weights (W chain) iV a+2,l,0.01 Q-x i OLw transition dirichlet priors ( C H M M ) problem dependent OLD transition dirichlet prior (D) problem dependent OLA action expectation dirichlet prior problem dependent pz,xt A.z^xt T~x Mix . of Gaussians (X chain) E M Pz,Wl AZ^W, Tyj Mix. of Gaussians (W chain) E M <D Transition functions ( C H M M ) E M fl QD Transition (D chain) Action Expectation E M E M N N number of states in X and W chains initialization ND number of states in D chain Sec. 5.3 Table 4.1: List of parameters in the model. (TOP) fixed parameters ( B O T T O M ) learned parameters 4.9 Initialization The E M algorithm performs hill climbing on the likelihood surface, converging from its starting point to a local maximum. It is therefore dependent on the initial choice of the parameters. In models with many parameters as we have described, the likelihood surface can contain many local maxima. It is therefore critical to achieve good initialization, introducing as much prior knowledge about the domain before attempting a full maximization of the posterior probability of the model given the data. Initialization of the simple mixture model (Section 4.2) with Nx classes from a set of single (independent) spatio-temporal derivative fields, Vf, proceeds as follows: 1. Compute the most likely values of Z for each frame using a single zero-mean model with diagonal covariance with constant variance az = 0.001. 2. Select a subset of the data points where \Z\ > 5, where S is some threshold on the magnitude of the Zernike vector. Since much of our data contains very little motion, (so Z ~ 0), this ensures that the clustering is not biased towards clustering on the noise in the frames where little is happening. 3. Select 7YX such data points are selected randomly as the initial seeds for Afx-92 means (K-means with K = Nx) clustering with a Euclidean distance measure between values of Z. 4. Fi t Gaussian distributions to the classes resulting from the A^-means cluster-ing. 5. Initialize all the feature weights r | to 1. 6. Initialize the state assignment probability Qx evenly. The results for the simple clustering experiments were relatively insensitive to the initialization. The mixture model over the configurations (Section 4.3) is initialized in a similar way, using the projections of image regions to the Zernike basis. Initialization of the context dependent mixture of coupled hidden Markov models (Section 4.6) with Nd classes from a set of Nt sequences of spatio-temporal derivative fields, Vf, and images, I, proceeds as follows: 1. Perform the initialization of a mixture of Gaussians for the dynamics states, X, as described above, from the spatio-temporal derivative fields for each frame, V / , considering all frames to be independent. Choose larger than the maximum number of classes of dynamics states we expect (see Section 5.3). Maximization of the posterior ensures that no singularity problems will be encountered. 2. Perform a similar initialization for the configuration states, W, Choose the number of W classes, A ^ , larger than the maximum number of classes of configuration states we expect (see Section 5.3). 3. Classify all the data (including the data that did not pass the thresholding test, above) using the two mixture models. This gives two sets of state membership indices, w and x. 4. Find the set of X states visited by each sequence. 5. Find the largest Nd sets of sequences whose sets of visited X states match exactly, irrespectively of the number of X states visited by the sequences. These A ^ clusters group together the sequences which match closely at the Gaussian output level in the dynamics process. The sequences which are not in a group are only similar to a small number of other sequences, and are neglected for the moment. This is a bootstrapping process whereby only the "best" sequences are used to initialize the model, ensuring that the results are not "washed out" by noisy data. 93 6. Find the set of X states visited by all the sequences in each cluster, i. This gives the number of X states for the dynamics chain of model i, Nx. Do the same for the W states, giving Nw. 7. Initialize a coupled hidden Markov model for each cluster, i, by assigning the output distributions (including feature weights, if applicable) to be those in the simple mixture models which are used by the sequences in the cluster. Initialize the transition and initial state probabilities randomly. 8. Train each coupled hidden Markov model, keeping the output distributions fixed. 9. Initialize the mixture probabilities for the mixture of coupled H M M s evenly for each context state. Smyth [Smy97] suggests a different initialization method for mixtures of hid-den Markov models, which fits a simple H M M to each individual sequence, evaluates the log-likelihood of each sequence given every simple H M M , and then clusters the sequences into K groups using the log-likelihood distance matrix. Simple H M M s are then fit to each of these K clusters, and the results are used to initialize the final mixture model. The conditional probabilities of cluster membership, D, given the context variables, C, are initialized by counting the number of observed states C in each cluster. We have experimented with this method, using agglomerative clustering with complete linkage (furthest neighbors merging). However, the result-ing models tend to be significantly "washed out", and do not find clusters which are well matched with the context states. The reason is that the individual H M M s are not sufficiently well supported by the data, and tend to learn models heavily biased by the prior distributions. These biases are then propagated to the final model. We have chosen the method described above to take advantage of the good initializations that can be performed at the lowest level (the Gaussian and multi-nomial output distributions). This level is very important since it involves many parameters and performs the spatial abstraction step which is crucial to the efficient description of our data. 4.10 Software Implementation Details The models we have been discussing in this chapter and in the last are implemented in software written in C++. The structure of the models lends itself easily to an elegant class structure, with a small number of base classes providing support for derived classes for each of the models we have described. We use a data structure 94 to hold our observations which fits in with the model class structure, allowing for a simple and intuitive programming interface to the model learning algorithms. A probability distribution, or PDist, is the fundamental base class. It is pure virtual, allowing all other classes to include generic PDists without specific class assignment until runtime. The normal distributions which characterize continuous outputs are Gaussians, derived from PDist. The probabilistic normal projections are encapsulated in a class, ZernGauss, derived from Gaussian. The multinomial distribution is encapsulated in a class, Multinomial, derived from PDist. The simple mixture model with Nx states, a MG, also derives from PDist, and contains an array of Nx output PDists, and an Nx x 1 array for the mixture weight parameter. The context-dependent mixture model, or CMG, derives from MG (and so is also a PDist), and adds the additional parameters necessary for describing the P(D\C). The hidden Markov model with Nx states, a HMM, also derives from PDist, contains an array of Nx output PDists, a Nx x Nx matrix for the transition parameter, and a Nx x 1 array for the initial state parameter. The coupled hidden Markov model, or XCHMM, derives from the HMM class (and so is also a PDist), adds the C transition probability, and another set of output distributions, also PDists. The mixture model, or MG class, can be used at run time to implement a mixture of any PDists. For example, a mixture of Gaussians assigns the MG object's Nx output PDists to be Gaussian objects. A mixture of HMMs assigns the MG object's Nx output PDists to be HMM objects. A Markov chain of mixtures of Markov chains, or 4MG, is a HMM with output PDists as MG objects, each of which has output PDists as HMM objects. Thus, we see the class structure leads to an efficient way of implementing hierarchical models of any depth. In this paper, we only use models with two layers. The hierarchical structure of the models requires a hierarchical data struc-ture, which is encapsulated in the PData class. A PData object can contain raw floating-point data, or an array of PData objects, allowing for a hierarchical struc-ture. The base PDist class defines the three major functions necessary for learning the model parameters. In the following, let X refer to the model states, and O refer to the observations. 1. pdf Computes the posterior distribution over the model states, X, given an observation (a PData object), O. It first computes the likelihood for each state X, by calling pdf on each of its children PDists. It then computes the posterior by multiplying the likelihood by the model parameters. For the mixture model, MG, this is a simple multiplication by the weights, P(X). For a hidden Markov model, HMM, this involves the forwards and backwards passes. 95 2. addSuffStats Called after a call to pdf , with a single argument giving the weight assigned to the previous piece of data by a higher level (or 1.0 if at the highest level). This function updates the sufficient statistics for the PDist by adding in the data passed to pdf weighted by the posterior distribution computed in pdf and by the passed in higher level weights. 3. update Updates the model parameters based upon the accumulated sufficient statistics, and resets those statistics. Thus, the E M algorithm for any PDist given any set of data in a PData object can be implemented as a member function trainModel as follows function trainModel(PData **0) while (Iconverged) for each data in O pdf(data) addSufficientStats(l.O) end update() end 4.10.1 Numerical Scaling The computation of the data likelihood, described in Chapter 3, involves taking the exponential of what can be a very large quantity. This causes numerical problems in an implementation. However, the exponentials are only used to compute the the sufficient statistic, which is always normalized over higher level states. This normalization allows us to get around the numerical problems with a simple scaling trick, described here for the case of a simple mixture of Gaussians, but which can be applied to any of the distributions learned. We are trying to compute the sufficient statistic for the simple mixture of Gaussians: where /j is the normalizing factor in the Gaussian, (2 /pz ) _ d / 2 |A |~ 1 / ' 2 , is — (XJ — fii)'A~1(xj — in), pi is the model state assignment parameter, A ^ is the number of data, and A ^ is the number of states (models). Problems occur in this calculation if the factors in the exponential, aji, are too large (greater than about 103), since the exponential cannot be computed. However, since we are normalizing by a sum over states, we can scale each factor by subtracting some large number which is 9 6 independent of i. The largest such factor is rrij = maxj dji. Thus, we can multiply both top and bottom by e" -i, giving Nd f.eaji m j p . Now, all terms in the exponentials are guaranteed to be less than 0, and one is always exactly 1. When computing the log-likelihood, we must also be careful, since the same exponentials appear Nd Nx j=i i=i In this case, we multiply the exponential by 1 = e Nd Nd = E Nx m i e m i ) giving 5 Vi i=i log 2/« e < l i ("m 5 'R + m J a = l Note that this scaling trick also works well when the factors really large and negative numbers, for all values of i. Such cases cause problems when computing the log likelihood, since we will be trying to take the logarithm of 0. The scaling technique ensures that the overall smallness of the aji factors is taken out of the sum over states, since we want to take the logarithm of it anyways. 97 Chapter 5 Facial Displays in Games The previous chapter described methods for learning statistical models of temporal patterns of motion in the human face. We have demonstrated how such models can be learned from observations, and how the learned models establish classes of facial motions at different levels of temporal abstraction. We have also shown how to incorporate high-level observations of context into the models. Now we may ask: what can such models be used for? One answer is action. As we have seen in Chapter 2, humans use facial displays during interactions as relevant signals which may affect actions of other agents. As such, they can be considered as simply other actions in an agent's repertoire. Suppose that some rational agent can describe the probabilistic relationships between classes of facial motions and the states of the world, his actions and his value system. Since facial signals are relevant, this agent must use such relationships in selecting actions which maximize its expected utility [vNM53]. In this chapter, we present the concepts of expected utility maximization using the framework of Markov decision processes, or M D P s [Put94]. M D P s are a simple, yet general, model of the interactions between an agent's actions, the state of the world, and the agent's value system. Since we are modeling observations of the state of the world using a temporally and spatially abstract set of (unobservable) states, we will, in fact, be using a partially observable Markov decision process, or P O M D P [KLC98]. Further, since we are modeling the interactions between two agents, we are using multi-agent P O M D P s [Gmy02]. Decision making requires an agent to consider the long-term effects of his actions upon his world model, a process which is typically much more difficult than learning models from input data. It is necessary for an agent to build temporally and spatially abstract descriptions of the world over which decision making can be tractably attempted. These descriptions are the agent's internal state. This chapter will show that by equating actions and internal states with context, the 98 models of spatial and temporal abstraction presented in Chapter 4 can be used to learn relationships between facial motion and actions, utilities and states. The learned dependencies are at a level of abstraction at which decision making can be realistically attempted, yielding policies of action based, in part, upon observed facial motions. Attempting to model unconstrained human communication would be fool-hardy, however, given the enormously complex social, emotional, and psychological context in which any communication takes place. Instead, in order to study the relationships between action recognition and action, we have developed an experi-mental paradigm in which we artificially constrain the structure of an interaction between two humans. We impose constraints as rules in a computer game the hu-mans play. We then observe the humans playing the game, and learn models of the relationships between their facial motions and the states and actions in the game. The models are restricted cases of the general model, where the restrictions are the constraints imposed by the game. Subsequent analysis of the learned models reveals how the humans were using their faces for achieving value in the game. We can also extract policies of action from the models. The constraints imposed by the game restrict the human's degrees of freedom in the interaction, and so focus attention in the model on the aspects we are interested in. Further, the constraints simplify the structure of the problems such that solutions become more feasible. The chapter is structured as follows. Section 5.1 begins with the concepts of a multi-agent interaction from a game-theoretic perspective, and argues that P O M D P s are a useful model of such situations. This is followed by an overview of the decision-theoretic foundations, including fully and partially observable Markov deci-sion processes, solution techniques, and how such models can be learned from data. We discuss optimal solution methods for P O M D P s with discrete output spaces, and approximate methods for P O M D P s with continuous output spaces. Section 5.2 presents a general model of communication with non-verbal signals in a dual-agent situation as a partially observable Markov decision process. Section 5.4 discusses the experimental paradigm we use to evaluate this model. The last three sections describe particular applications of our experimental method to three games. Section 5.5 presents a very simple imitation game in which a human plays with an agent. The imitation game requires the players to use complex facial dis-plays, and we use this game primarily to demonstrate the computer vision modeling techniques discussed in Chapter 3 and 4. We show the learned models, example sequences, and their classifications given the learned models. There are no real decisions to be made in this game, but we present classification results in a leave-one-out cross-validation experiment. 99 Section 5.6 describes a game in which a human controls a robot using ges-tures. This game is a simple decision problem, but involves a complex vision task of recognising hand gestures. It is used primarily to demonstrate that our models are not restricted to facial displays alone. Section 5.7 shows a more complex, card matching game, in which two human players face off. The second game, in contrast to the first two, only calls for very simple facial displays, but has a more complex, high-level structure, involving nearly 10 5 states, six actions, and temporal look ahead. This model is used to demonstrate the learning of the decision process, and the subsequent generation of policies of action. 5.1 Decision-Theoretic Foundations In situations involving other intelligent agents, a decision-making agent who wishes to interpret other agent's actions must be prepared to reason using game theory. Since he is operating in a multi-agent environment, he must think about the fact that the other agents are decision-makers as well, who are possibly considering strategies of action based upon their beliefs about his possible strategies of action. Of course, knowing this, he must account for the other agent's knowing that he knows about their strategic plans based on his beliefs, etc. This infinite regress is the subject of game theory, usually approached by seeking equilibria in the strategy choices of all the agents simultaneously. At such equilibria, no single rational agent would change strategies if he believed no other agent would either. The strength of the concept of equilibria in games lies in its ability to avoid explicit reasoning about infinitely nested beliefs. However, game-theoretic solution concepts such as equilibria are not sufficient in a general conversational setting. The reason is that the payoffs to (value functions of) other agents may not be known with certainty, and so it may be difficult for an agent to figure out what other agents are planning to do. Further, there may be multiple equilibria in such settings (non-uniqueness), and agents may not act according to their equilibrium strategies (incompleteness) [Gmy02]. To surmount these problems, we consider that each agent will choose strategies of action based upon his beliefs about other agent's internal states, value functions, and strategy choices. This approach is called the decision-analytic approach to games [Mye91]. It avoids the difficulties of uniqueness and incompleteness of the equilibrium approach at the cost of explicitly representing the infinitely nested belief systems. That is, suppose that some agent, a, is trying to come up with a strategy in an interaction with agent b. Taking the decision-analytic approach, agent a will try to explicitly 100 assess agent 6's strategy choices. To do so, he may put himself in agent b's shoes, and will realise that agent b is trying to assess agent a's strategy choices. Agent a will then be forced to explicitly assess what agent 6's beliefs are about agent a's beliefs, etc, in an infinite regress. Nevertheless, .we assume that this regress can be terminated at some (not too deeply nested) point, and the approximate solution will be sufficient for our purposes. Note that, in the games we will describe in the latter part of this chapter, these concepts do not play an important role, but will need to be addressed further for more complex interactions. The decision-analytic approach to multi-agent games can be formalised as a partially observable Markov decision process, or P O M D P [Gmy02]. A P O M D P is a probabilistic temporal model of an agent interacting with the environment [KLC98], A P O M D P is similar to a hidden Markov model in that it describes observations as arising from hidden states, which are linked through a Markovian chain. However, the P O M D P adds actions and rewards, allowing for decision-theoretic planning. In a multi-agent setting, the states of the P O M D P include complete P O M D P models of other agents. That is, each decision-analytic agent will explicitly represent the P O M D P models for each other agent. These types of models have been called interactive POMDPs or I-POMDPs [Gmy02]. Since we wish to focus on the use of non-verbal displays in games, we will use a restricted form of I -POMDPs, in which much of the state space is assumed to be observable. However, our models should integrate seamlessly with I -POMDPs once the assumptions are relaxed. 5.1.1 Fully Observable MDPs Markov decision processes (MDPs) have become the semantic model of choice for decision-theoretic planning (DTP) in the A l planning community. Fully-observable M D P s [Bel57, Put94] model the domain of interest with a finite set of (fully observ-able) states S. Actions of an agent, drawn from a finite set A, induce stochastic state transitions, with P(St\At, St-i) denoting the probability with which state'St € S is reached when action At € A is executed at state St-i 6 5 . A real-valued reward function R, associates with each state, s, and action, a, its immediate utility R(s, a). Figure 5.1 shows the Bayesian network representation of an M D P . A stationary policy ir : S —> A describes a particular course of action to be adopted by an agent, with n(s) denoting the action to be taken in state s. If we assume that the agent acts indefinitely (an infinite horizon). The stationarity of a policy means that it will not change over the course of time. We compare different policies by adopting an expected total discounted reward as our optimality criterion wherein future rewards are discounted at a rate 0 < (3 < 1, and the value of a policy is given by the expected total discounted reward accrued. The expected value V^-(s) 101 Figure 5.1: Two time slices of a Markov decision process M D P as a Bayesian net-work. The full network involves infinitely many time slices. Actions, A, induce transitions between states, S. A reward function, R, gives the utility of taking an action in a state. of a policy ir at a given state s satisfies [Put94]: Vn(s) = R(s, 7r(s)) + B E P r ( s t | 7 r ( s t _ i ) , st-i) • Vn(t) (5.1) A policy 7r is optimal if > V^* for all s G S and policies TT' . The optimal value function V* is the value of any optimal policy. Value iteration [Bel57] is a simple iterative approximation algorithm for con-structing optimal policies. It proceeds by constructing a series of n-stage-to-go value functions Vn. Setting V° = R, we define V (s) = max J R(s, a) + B Pr(s, a, t) • Vn(t) I (5.2) a e A I tes J The sequence of value functions Vn produced by value iteration converges linearly to the optimal value function V*. For some finite n, the actions that maximize Equation 5.2 form an optimal policy, and Vn approximates its value. A commonly used stopping criterion specifies termination of the iteration procedure when \\Vn+1-Vn\\ < C ^ ~ ^ (5.3) 2p (where = max{|x| : x € X} denotes the supremum norm). This ensures that the resulting value function Vn+1 is within | of the optimal function V* at any state, and that the resulting policy is e-optimal [Put94]. We wil l deal primarily with finite-horizon problems, in which the decision maker knows about some time, T, in the future at which his decision-making will stop. For these cases, there is no need for discounting, so B = 1 and value iteration is only iterated up to the horizon: VT(s) = max i R(s, a) + ^ Pr(s, a, t) • VT~l{t) \ (5.4) a e A I tes J where, again, V° = R. 102 5.1.2 Factored Representations and Structured Representations The classical technique for representing M D P s is a tabular, or flat format, in which the transition function is written as a table giving the probability of going from any state to any other under all the actions. Although the fiat representation is often ef-fective for small problems, typical AI planning problems fall prey to Bellman's curse of dimensionality: the size of the state space grows exponentially with the number of domain features. The flat representation, which requires explicit enumeration of the state space, is typically infeasible. Further, it is usually difficult to write down, and hard to interpret. A more interpretable representation is one in which the state space is implic-itly represented as the cross product of a. set of multinomial, discrete variables. The transition function is then written as a set of conditional probability distributions (CPTs). Each C P T is typically an easily interpretable function of a small set of variables. Factored M D P s also allow the conditional independencies in the Markov network to be exploited by M D P solvers. In particular, C P T s can be represented using decision trees [DB97] or networks [HSAHB99], allowing similar states to be abstracted. These types of solvers usually gain much in efficiency, and have been successfully applied to large problems [Plu03]. We will use our SPUDD solver, which was the first M D P solver to use algebraic decision diagrams for representation and computation [HSAHB99]. 5.1.3 Partially Observable Markov Decision Processes In many cases of interest, the state space, S, is not fully observable, and can only be inferred from observations, resulting in a partially observable Markov decision process, or P O M D P . Figure 5.2 shows a P O M D P as a Bayesian network. As in the fully observable case, the environment is modeled using a finite set of states, <S, and a real valued reward function R(s) maps states to values. The agent has available to it a finite set of actions, A, which induce state transitions according to the probability function P(St\At, St-i). In the partially observable case, however, the state is not observable directly: it is some hidden underlying cause for the observations the agent makes, Ot, which are generated by states according to the probability function P(Ot\St, At). Since the state cannot be directly observed, it must be inferred from the observations, resulting in a belief state, b(s), such that b(st) is the probability that the system is in state st at time t. Updating the belief state in a dynamic Markovian model with observable inputs (the actions) was the subject of Section 3.6.3 and 4.7. 103 t-1 t Figure 5.2: Two time slices of a partially observable Markov decision process (POMDP) as a Bayesian network. Shaded nodes are directly observable, while unshaded nodes are unobservable. Actions, A , induce transitions between states, S, which are not observable directly, but only through their effects on a set of observations, O. The reward function, R, gives the utility of being in states S. It involves only computing the forwards pass of the standard forwards-backwards algorithm. Once the agent has the belief state, however, she is faced with a much more difficult problem: how to choose an optimal action based upon it. The difficulty arises because the belief state lives in a continuous (bounded) space, and so the effects of each action must be evaluated against an infinite number of possible im-mediate futures that it may lead to. This is in contrast with the fully observable case, where the immediate (one-step look-ahead) future is finitely enumerable. The standard way to approach this problem is to recast it as a fully observable M D P with a continuous, |<S|-dimensional state space consisting of all possible belief states. In this case, the value of a belief state, b(st), can be recursively computed using Equa-tion 5.2, where we must now integrate over all possible observations and next belief states which, since R(b(st-i)) = 2~2St-i bist-i)R{st-i) and St,St-l St,St-l = P(o\b(st-i),a) 104 IS V(b(st-i)) = max a e A £ &(st-i)fl(st_i) + P f P(<#(st-i),a)V(6(st)) st-i J o t (5.5) Notice that we can write V(b(s)) = 2~2sesb(s)V(s)- Since each V(s) is some real-valued number, V(b(s)) is a piece-wise linear function of b(s). If the observations, o, are drawn from a finite set, then it can be shown that the dynamic programming up-date in Equation 5.5 preserves the piecewise linearity of the value function [KLM96]. This property means that it is possible to solve for the optimal policy, although such a calculation may be difficult due to the explosion of the number linear pieces that make up the value function. It is often the case, however, that many of these lin-ear pieces are dominated everywhere by some others, and can be removed from the representation of the value function, leading to significant efficiency improvements. Incremental pruning [CLZ97] is an algorithm for doing this. It is one of the most efficient algorithms, because it interleaves pruning into the dynamic programming step in such a way that the number of linear pieces is kept to a minimum. Factored state versions of the incremental pruning algorithm have also been developed [HFOO], which make use of the structure inherent in the factored representation to increase the efficiency of the algorithm further. If the observation space is continuous, as in our case, the problem becomes much more difficult. In fact, there are no known algorithms for computing optimal policies for such problems. The intractability of these problems arises because each dynamic programming step involves an integration over the entire observation space, destroying the piecewise-linearity of the value function. The simplest possible ap-proximation technique is to simply consider the P O M D P as a fully observable M D P , and to use the value of the state, S, as the most likely value over the belief state, S — arg max,, b(s). This is called the MDP approximation. A slightly better approx-imation augments this most likely state with the entropy of the belief state, suitably discretized [RPTOO]. For example, the position of a robot could be described as at x with low certainty, or the display of a human face could be described as a smile with high certainty. More complex approximate algorithms using Monte-Carlo sampling methods have also recently been studied [ThrOO]. However, the complexity of these approximations grows very rapidly with the dimensionality of the input space, which is extremely large in our case. 5.1.4 Learning P O M D P s There are two elements to be learned from data in P O M D P s . First, the parameters of the underlying Bayesian network: the transition function and the observation 105 function. Second, the reward function, which maps states and actions to values for the agent. Reinforcement learning deals with learning of both these quantities [KLM96]. Usually, this involves trading off between exploration of the world in order to learn a model, and exploitation of the learned model in order to achieve optimal rewards. There are many ways to do this. The most popular are model-free approaches, which directly learn the value function without learning models of state transitions and observations. Since we are dealing with very high dimensional observation spaces, these approaches are impractical. Model-based approaches, on the other hand, learn the transition and reward functions from data, and then generate a policy of action from these, as described in the last two sections. We take a very simple model-based approach, and acquire training data and reinforcement signals during a training phase, during which the agent acts randomly. The models are then learned offline, and the results are used in a testing phase in which the agent acts according to the generated policy. This approach is known as certainty equivalence [KLM96]. The learning procedures we have described for P O M D P s in Chapter 4 can be applied for learning the parameters of the P O M D P s , while the reward function can be learned by simple counting. Although there are significant limitations to certainty equivalence, we are primarily interested in demon-strating how our computer vision techniques for temporal and spatial abstraction can be integrated with the generation of policy. Taking this system one step further to a fully on-line version would involve implementing a more complex learning algo-rithm, interleaving exploration and exploitation. One such model-based algorithm is prioritized sweeping [KLM96]. 5.2 POMDPs For Non-Verbal Displays in Games As described at the beginning of this chapter, agents in an interaction can be mod-eled as playing a multi-agent game, and this can be described as a P O M D P using the decision-analytic approach. In this section, we restrict our attention to two-agent interactions, and label the two agents a and b. We will be looking at agent a's P O M D P , which maintains a dynamically evolving model of the world that we will describe at time i as a state, 5", in some discrete state space Sa. The state space does not have to be fully observable, but can be inferred from some observations, 0s. Each agent has some way of inferring distributions over unobservable states. In general, <Sa includes agent a's model of the other agent in his environment, agent b. We are interested primarily in the non-verbal displays of agent 6, so we will ex-plicitly factor agent a's state space into agent a's description of agent b's non-verbal 106 Figure 5.3: Time slice of the graphical model of interaction between two agents in a game. If the nodes are clustered according to the dashed lines, this model is a standard partially observable Markov decision process (POMDP) as in Figure 5.2 action, Ab'a and the remainder of the state space, denoted again as Sa, as shown in Figure 5.3. Again, the actions Ab:a do not have to be fully observable, but may need to be inferred from observations, O. As usual, agent a can perform some action, Aa, which comes from some set of actions Aa. Finally, agent a maintains a reward function, Ra, which is a mapping from Sf xA?x Abta -» Ka. Note that we have assumed that the agent has some way of separating the observations in O from those in 0s. This assumption requires that an agent is able to distinguish those observations that are related to the non-verbal actions, Ab:a, from those related to other states, Sa. If Ab:a are facial displays, for example, this could be accomplished through a visual face tracker that explicitly separates visual observations of the face, the hands, the context, and other observations. However, in order to focus our attention on the non-verbal displays, we further assume that only these displays are unobservable directly. The states of context and of other agent actions, Sa, are considered fully observable, as shown in Figure 5.4. Other than the utility nodes, this model has the same structure as the one discussed in Section 3.6.3. Nodes in Figure 3.17 have been relabeled in Figure 5.4 as D —> Ab:a, A -> and C —> { A ^ S ^ } . Some further differences in the temporal structure can be included or deleted without loss of generality. The observations, O, are entire sequences of video frames and spatio-temporal derivatives. We assume that the temporal segmentation is provided directly by the onset times of Sa and Aa. That is, we assume that observable states and actions are delimiters for tem-107 kp4 Figure 5.4: Time slice of the graphical model of interaction between two agents a game, in which the state, Sa, is fully observable. poral events, and that the non-verbal displays will occur between these times. The parameters of such a model can be learned from data, as described in Section 4.7. The utility function can be learned using reinforcement learning, as discussed in Section 5.1.4. A n approximate policy can then be derived using the techniques in Section 5.1. 5.3 Value-Directed Structure Learning The value function, V(s), gives the expected value for the decision maker in each state. However, there may be parts of the state space which are indistinguishable (or nearly so) with respect to certain characteristics, such as value or optimal action choice. These indistinguishable states can be grouped or merged together to form an aggregate or abstract state. The set of abstract states partitions the state space according to some characteristic. States of the original M D P which are part of the same abstract state are not distinguishable insofar as decisions go. Eliminating the distinctions between them by merging states can lead to efficiency gains without compromising decision quality. A n agent needs only distinguish those states which are useful to it for achieving value. In fact, such state aggregation is a form of structure learning based upon the utility of states. This value-directed structure learning is in contrast to more data dependent structure learning, in which the structure is determined solely based upon the statistical distribution of the data, and the complexity of the model. For 108 example, many structure learning algorithms use some simplicity prior (such as the minimum description length [Ris78, Bra99a, WPG01]), and find a trade-off between the model's precision and complexity. Others use an incremental generate-and-test mechanism in which states are. added or deleted, and the model is tested to see if any advantage is gained [S094]. We now discuss a particular technique for value-directed state aggregation applied to learning the number of facial displays or gestures that need to be distin-guished in our learned P O M D P . As we have mentioned, the state space is represented in a factored P O M D P as a product over a set of variables. In our model, the values of one of these variables, Ab:a, are the (unlabeled) gestures or facial displays. This variable splits the value function into Na pieces, Vi, one for each value, i, of the variable Ab'a. Each such Vi gives the values of being in any state in which AB:a = i. A similar split occurs for the policy, yielding sub-policies, 7T j , giving the actions to take for each Ab:a = i. The Vt can be compared by computing the difference between them, dij — \\Vi — Vj\\, where | | X | | = max{x : x € X} is the supremum norm. Two sub-policies, ni and TTJ are considered equivalent if the optimal actions agree for every state: if ni A TTJ. These comparisons are used in the following algorithm for learning the number of display states, Na. The algorithm starts by assigning Na to be as large as the training data will support, and prunes redundant states. repeat 1 . learn the POMDP model 2 . compute Vi and 7Tj V i us ing value i t e r a t i o n 3. compute a\j = \\Vi — Vj\\ V ^ j 4 . i f 3(i,j)(TTi ATTj) 5. {i,j} = argmin { f c / }(4jV{fc,Z} | nk A TT/) 6 . merge s ta tes i and j 7. Na*-Na-1 end u n t i l Na stops changing Figure 5.5: Procedure for value-directed structure and parameter learning for P O M D P s There are many potential ways to merge states at step 6, but we simply delete one of the the redundant states. Note that the algorithm could also start with Na = 2 and add states until redundancies appear, but we have not experimented with this version. Added states could be added initialized randomly, or as the most populated current state with added noise. 109 Learning the P O M D P parameters (step 1 in Figure 5.5) involves iterations of expectation-maximization, as implemented by the forwards-backwards algorithm (message passing in the Bayesian network). The complexity of this procedure is 0(Nd(Nx + N f y T ) , where N x is the number of states in the dynamics process, N w is the number of states in the configuration process, A ^ is the number of high-level facial display states and T is the length of the entire sequence of data. The complexity of value iteration (step 2 in Figure 5.5) has a complexity of 0(NgNaH), where A^ is the number of states in the P O M D P , N a is the number of actions, and H is the horizon. The remainder of the algorithm is 0(Ng) (steps 4-7 in Figure 5.5). Therefore, the complete learning procedure has a worst-case complexity of 0{Nl(N% + Nl)T + NdN^NaH). In typical problems, N d , N x and N w are all quite small numbers, while N s is very large (exponential in the number of variables in the P O M D P ) . Therefore, even using simple P O M D P solution approximations, the complexity will generally be dominated by the second term, 0(NdNgNaH). Attempting to compute optimal P O M D P solutions would increase this complexity. 5.4 Experimental Design To investigate the relationships between facial displays and other, conditioning fac-tors, we adopt an experimental paradigm in which we observe humans playing com-puter games. The games embody constraints to be placed on the P O M D P s , so we can focus the learning on specific areas of interest. The following describes the method. 1. Design game and encode it as a P O M D P . In the general case, the encoding of the P O M D P would be learned directly from data. However, we are constrain-ing many of the variables in the game in order to focus on facial display models, and we so characterize the context in terms of a set of discrete, high-level vari-ables. There is no single way to encode a particular domain as a P O M D P . Part of the process involves testing the encoding by manually assigning conditional probability distributions based on prior knowledge of the domain, and then computing a policy to ensure that it correctly predicts actions. This also gives the encoder some idea about how the problem is structured as an M D P , to allow for easier interpretation of results when the model is learned from data. 2. Gather training & test data sets. This involves a human playing the game either against another human or against a computer agent. In the latter case, the agent selects actions randomly in both training and test data sets. 110 3. Apply the procedure in Figure 5.5 to learn the parameters and structure of the P O M D P model from the training data. This also computes an approximate policy of action. 4. Use the model to predict actions in the test data set. In the case of two humans playing against one another, the predictions are over the (observable) actions of one of the players, and can be compared to the actual actions taken in the test data for a performance measure. In the case of a human playing against an agent, the predictions are action selections from the policy, but since the agent was playing randomly, these cannot be compared to the agent's actions. Instead, we use the actions as if they were selected in the real game, and collect rewards based upon them. The total collected rewards are a performance indicator. The following three sections describe this procedure applied to three simple games. The first (imitation game) only involves facial displays as actions, and so does not have a reward function. This game is used to explore the representational power of our computer vision modeling techniques. The second (robot control) involves a single human performing gestures for robot control. The robot control experiments are intended as a simple demonstration of our system working with something other than facial expression. The experimental setup for the robot control experiments is very simple, and we do not claim that it is a gesture recognition system that would deal with orientation, time scaling, or robot movement and viewpoint. This second game also demonstrates our value-directed structure learning techniques. The third (card matching game) involves two humans playing a collaborative game. The facial displays are fairly simple, but the decision theory problem is much more complex than the other two games. This data is used to demonstrate how a policy can be computed based, in part, on non-verbal displays. 5.5 Imitation Game The imitation game is a single player game in which the player watches a computer animated face on a screen, and is told to imitate the actions of the face. While many face generation systems use complex 3D graphics, this face is a simple car-toon. This allows for fast rendering, and avoids problems with the "uncanny valley", where human-like interfaces look disturbing if they are just less than perfectly re-alistic [Mor82]. Further, research has shown that humans can interact with simple generated faces as a real human face [RN96]. The animated displays start from a neutral face, as shown in Figure 5.6(a), then warp to one of the 4 poses shown in 111 Figures 5.6. The pose is held for roughly a second, and the face then warps back to the neutral pose where it remains for an additional second. Although these displays may be reminiscent of so-called prototypical facial expressions, the displays they elicited were clearly not expressions of emotion, but only reactions to context. That they may have been interpreted as emotional displays by the subjects is not relevant here. I I (a) a\ ai a^ 04 Figure 5.6: (a) neutral face (Aa = a\...ai) Faces which subjects were told to imitate In this game, the actions of the human player are the imitative displays, and the state of the game is the animated display. The actions of the human are not observable, but must be inferred from a video stream. The additional independencies in this game result in a simplification of the model in Figure 5.3, as shown in Figure 5.7. The agent's actions, Aa, are choices of cartoon facial displays, a\ ... 04. The observations of the human's actions, O, are sequences of video images and the spatio-temporal derivatives between subsequent video frames, as described in Chapter 3. The human player's actions, as modeled by the agent, are described by a discrete A r Q-valued variable, Ab:a 6 { a i . . . a^a}, where the agent has control over Na. Although it is possible to include a utility function in this game, it would imply associating some reward with particular facial displays (e.g., smiles). The agent could then use the reward function to figure out which of its cartoon displays is most likely to elicit the rewarded displays. However, this game was designed to test the vision sequence representation model, and so the facial displays are not tied to any task, and a utility function is not very useful. Instead, we leave out the utility function, and compute a measure of the ability of the model to represent the different imitation displays. We therefore hide the cartoon display labels, Aa, in the test data set, and compute the distribution over this variable: P ( A a | 0 ) oc P(0\Aa)P{Aa) = yP(0\Afa)P{Ab:a\Aa) i The maximum of this distribution tells us which cartoon display was most likely to have produced the observed human display. Agreement with the actual cartoon displays gives us an indication of how well the model can represent the display imitations. 112 o I v y Figure 5.7: Time slice of graphical model of simple imitation game. 5.6 Hand Gestures for Robot Control This "game" involves a human operator issuing navigation commands to a robot using hand gestures. The robot has four possible navigation actions, go left, go right, stop and go forwards, and the human operator uses four distinct hand gestures corresponding to each command. The robot, however, must learn the mapping from hand gestures to its actions. It learns this mapping during a training phase during which it acts randomly in response to the operator's hand gestures. Rewards of one unit are explicitly assigned by the operator if the robot performs the correct actions. The robot gets no reward if it performs the wrong action. Again, each interaction in this game is temporally independent. The P O M D P for each time slice is shown in Figure 5.8 from the point of view of the robot. The robot's action is denoted Aa, while the operator's action (to reward the robot or not) is explicitly represented in the model as Ab, and is fully observable. Thus, the reward function is a one-to-one mapping from this action. The robot's observations of the operator's hand gesture, O, are conditioned on the high-level interpretation of the gesture, Ab:a, which is a discrete-valued variable with 7Ya values. A n optimal policy of action in this model needs only be computed over a horizon of one time step (since the actions are temporally independent). The policy, 7r, specifies an action for each possible recognized gesture, Ab:a, such that Aa — Tr(Ab'a = i) is the action which will most likely result in the operator rewarding the robot if Ab:a = i is observed. For example, if the command is a stop gesture, then the robot's stop action will most likely result in a reward. Notice that the M D P approximation can easily lead to sub-optimal action choices. Suppose that the dis-113 Figure 5.8: Two time slices of a P O M D P for robot control gestures. Aa is the robot's action (one of go left, go right, stop, forwards), Ab is the operator's action (reward or punish the robot), R is the reward function (a one-to-one mapping from Ah), Ab'a is the robot's interpretation of the control command (gesture), and O is the video sequence observation of the gesture. tribution over gesture interpretations, P(Ab:a\0), has a roughly equal value for two values of Ab'a, so the robot is uncertain about which gesture was actually performed. The M D P approximation will simply choose the most likely one, possibly leading to an error. A n optimal solution to the P O M D P would include this uncertainty, and might specify some other action (such as ask for clarification) in the case of such uncertainty. This type of analysis is important for dialogue modeling [PHOO]. 5.7 Card Matching Game Two teams of two players per team play the card matching game. At the start of a round, each player is dealt three cards: a heart, a diamond and a club. Each player can only see his own set of cards. The values of the cards (ranging from ace to ten), and their placement on the table, are randomly distributed. Each player's cards are dealt from a different deck, which is re-shuffled after every round. The players all play a single card simultaneously, and if the suits of the cards played by the two members of a team match, then that team reserves the sum of the values on the two cards, otherwise, that team reserves zero. The team with the highest reserve wins their reserve, while the other team wins nothing (and loses their reserve). On alternate rounds, a player has an opportunity to send a confidential bid to his 114 partner, indicating a card suit. The bids are non-binding and do not directly affect the payoffs in the game. During the other rounds (displaying rounds), the player can only see her partner's bids, and then play one of her cards. There is no time limit for playing a card, but the decision to play a card is final once made. Finally, each player can see (but not hear) his teammate through a real-time video link. This game constrains the players to use their gestures by not having an audio link, so the players cannot speak to one another. Players cannot see members of the opposing team, nor can they see the confidential bids of the opposing team. One of the two teams is simply implemented in software, and does not ac-tually have a video link or a bidding process. Their presence is only to give the human players incentive to cooperate, and to try to choose the highest matching suit. We can, therefore, assume that the human players we will be modeling have positive value assigned to choosing matching pairs of cards with the highest possible value. This will allow us to disregard the opposing team in the analysis, and use a simplified reward structure. A picture of the game interface during a typical interaction is shown in Figures 5.9 and 5.10. We will refer to the two players by names in what follows: player 1 is "Ann", while player 2 is "Bob". On the left, Ann's interface shows her cards face up below the 'card table', Bob's face, Bob's cards face down below his face, the opposing team's cards on the left and right of the card table, and both teams current winnings along the bottom. On the right, Bob's interface shows the same thing, but he sees Ann's face and sees only his own cards face up. Figure 5.9 shows stage 1, in which Ann (left) has made a bid of her highest card (a heart) to Bob (right). In stage 2, both players have committed their hearts, and the round terminates with the team winning 12 points. The interfaces at the beginning of the next round are shown in Figure 5.10 in which the players are happy to have won, can see their winnings added to their total, and it is now Bob's turn to bid. We now describe the encoding of the card matching game as a P O M D P . The only part of the state which is non-observable are the facial displays of the players. In more complex games, this may not be true, and we would have to model other parts of the state space as non-observable. Our modeling techniques do not preclude this. However, solutions to the P O M D P s will become increasingly difficult as the non-observable part of the state space grows. For example, in the card matching game, Bob could maintain beliefs about what the values of Ann's cards were. These beliefs could be useful in taking an optimal decision. If he notices that Ann vehemently shook her head when he bid clubs, sort of grimaced when he bid hearts, and sort of nodded when he bid diamonds, he might assess a distribution over her cards which would help him bid the optimal choice for both. Our assumption, however, seems 115 Ann's view Bob's view stage 1 Ann's view Bob's view stage 2 Figure 5.9: Game interfaces for two players during a typical round. See text for description. 116 Ann's view Bob's view stage 3 Figure 5.10: The beginning of the round subsequent to the one shown in Figure 5.9. 117 variable values description BVv {vl, v2, v3} value of Bob's heart card B$v {vl,v2,v3} value of Bob's diamond card B*v {vl,v2, v3} value of Bob's club card Acv {null, vl, v2, v3} value of card Ann committed Bcv {null, vl, v2, v3} value of card Bob committed match {null,V,0,X} match of played cards is false (null) or one of the suits Aact {null, cmtty, cmi<C>, cmtJI»} Ann's action is nothing (null), or commit (cmt) a suit. bid {rmM, <?,<>,*} current bid suit Acorn {d1...dNd} state Ann's facial display Bact {null, bidV, bid{), bidJIt Bob's action is nothing (null), cmtfZ), cmt<), cmtX} or bid (bid) or commit (cmt) a suit. Table 5.1: Variables and action for Bob in the card matching game P O M D P during a round when player 2 has the bid. to be well supported in our experimental data, where the players often commit to bids on the first step of a round, without engaging in complex deliberations about the precise value of all the cards. For a given player, each round of the game is either a bidding round (when she makes the bid), or not (when she sees her partner's bid). To simplify the analysis, we will consider each of these types of rounds separately. We will show how the two can be combined into a single P O M D P in Section 5.7.3. 5.7.1 Bidding Round There are nine variables which describe the state of the game when a player has the bid. While these may not be the smallest set of variables, they give an intuitive description of the game. Aggregation techniques [SAHBOO] can always be used to reduce the state space further. Table 5.1 shows the variables, their values, and gives a brief description of each. The P O M D P we will describe can be used for either player. However, to keep things clear, we will develop the model from Bob's perspective. Variable names which are Bob's private, observable, state will start with a B, while those which Bob uses to describe Ann's state and actions will start with an A. Variables which are common knowledge do not start with either letter. The state could be used to model Ann's perspective, of course, by renaming Ann to Bob and vice-versa. 118 The three suits in the game, hearts, diamonds, and clubs, will be denoted by symbols: <£>, X), respectively. Bob's actions (those over which he has control) are Bad, which can be null (no action), or sending a confidential bid of one of the suits (bid's?, bid(},bidJI») or committing a card of some suit (cmtfd', cmtQ, cmtJft). Ann's observed actions are part of the state, Aact, and can be null (no action), or commit-ting a card of some suit (cmtty, cmt(>, cmtft). The values of Bob's heart, diamond and club cards are BVv, B()v, B$*v, respectively. There are ten possible values for each card, which adds much complexity to the decision problem by expanding the state space. To reduce this complexity, we approximate these ten values with three values, vl,v2, v3, where cards valued 1-4 are labeled vl, 5-7 are v2 and 8-10 are labeled v3. This approximation may hide some of the structure of the game. For example, if a player's cards are 1, 3 and 4, then this re-labeling will give them all equal value (vl). However, we assume that when cards are close in value, the player has little preference over which one is actually played. This assumption seems to be borne out by experience, and the approximation seems sufficient to generate a near optimal policy. Another possibility is variables describing which suit is the highest, second and lowest cards. The values of Ann's committed cards are Acv and the values of Bob's com-mitted cards are Bcv. These can be null, or vl,v2, v3. The match variable gives the suit of a match (if the committed cards do indeed match), otherwise it is false (null). These variables (Acv, Bcv and match) only become non-null after commit actions from both players, and immediately become null again on any subsequent action. The bid variable describes the bid currently on the table (observable to both players on a team). It is affected by Bob's bidding actions, but is set to null after each round. Finally, the Acorn variable describes Ann's communication through the video link. It is one of Na states, d\ ... d^d, which have unspecified meaning. They will, however, obtain meaning through their interactions with the other variables in the P O M D P . The P O M D P model is shown in Figure 5.11 as a two time-slice Bayesian network. This is the same model as was shown in Figure 5.3, with the additional assumption that Ann's action, Aact (Ab'a in Figure 5.3), is fully observable, and so the action observations, W, do not play a role. Furthermore, we have factored the state, St, into nine variables, and included the action of the partner (Ann) as part of the state. The conditional dependencies between fully observable variables are easy to learn by counting co-occurrences. In fact, they can be read off from the structure of the game. For example, the heart's value (BVv) stays the same on a bid action, but gets reset randomly on a commit action. The value of Bob's committed card (Bcv) is exactly the value (from Btyv, B§v, BJftv) of the card Bob committed (as 119 F i gure 5.11: Two time slices of Bob's graphical model for card matching game. The variables and actions are described in Table 5.1. 120 name St St-i,At description &A Aact {bid, Acom, Bad} what Ann will do given Bob's action, the current bid, and Ann's previous display @D Acom {bid, Acom, Bad} what Ann will display given Bob's action, the current bid, and Ann's previous display N(0) ot Acom probability of a set of observations given high level description of Ann's display Table 5.2: Conditional probability distributions to be learned from data in P O M D P for card matching game from Bob's perspective. given by Bad if Aact G {cmtV, cmt<y, cmtft}. The value of Ann's committed card (Acv) is an even distribution over the possible values if Ann's action (Aact) is non-null. Finally, the match is suit s if Aact and Bad are both cmt(s). The distributions which must be learned from video data are shown in Table 5.2. A l l three of the distributions in Table 5.2 are part of the model we discussed in Section 3.6.3, and so learning the P O M D P reduces to learning the C 4 M G . The first, O^ , describes what Ann will do given Bob's action, the previous bid, and Ann's display. For example, we may expect that Ann will commit her diamond if the previous bid is diamonds and Ann's display was a nod of the head. This distribution is actually independent of Aact, since Bob's action is only made public after both players have acted, so that Ann will not be affected by Bob's action in the same time step. The second learned distribution, 0£>, describes what Ann will display given Bob's action, the previous bid and Ann's previous display. For example, we may expect Ann to nod her head if the previous bid was hearts, her previous display was shaking the head, and Bob has just bid diamonds. Certainly, agreement will be more likely on the second bid than on the first if the first was refused. It should now be clear how these two distributions give "meaning" to Ann's displays, Acom. Bob only needs to know what displays to expect from Ann, and what actions Ann's displays are predictive of in any given situation. The P O M D P based on these distributions will yield the same policy of action regardless of what particular displays Ann actually chooses. It only matters that his displays are consistent predictors of her future actions in each context. The number of display states to be learned (Nd) must still be specified manu-ally. However, since we are using the results in a probabilistic model (the P O M D P ) , it is only critical to choose Nd large enough to have enough states to capture all the important displays in the game. Our value-directed structure learning techniques (Section 5.3) can then be used to find the smallest set of display states which still 121 yield high expected value policies. Practical concerns limit the value of Nd we start with, since it also depends on the amount of training data available. Wi th a small training set, choosing Nd too large will result in models which are heavily biased by prior distributions. The resulting policies may be very sub-optimal. The reward function is only based upon fully observable variables, and so can be directly written down. Recall that, in the actual game, the players only get the sum of the values on the cards they played if the card suits match, and only if the sum is greater than the other team's sum, if the other team's cards match. However, the other team only plays a motivational role in the game, and we can disregard it here by assuming the players are rewarded by getting a match, and the reward is the sum of the card values in the P O M D P , regardless of the other team's play. We are assuming that the players are striving only to play matched suits which have the greatest sum, since they have no control over the other team's play. Thus, the reward function is Acv + Bcv if match is not null, otherwise it is 0. Section 6.3 shows values for these distributions learned from training data, and shows a policy of action generated from the resulting P O M D P . 5.7.2 Displaying Round On alternate rounds in the card matching game, a player cannot bid, but only watch for bids from their partner. However, the actions available to the player now involve generating displays indicative of agreement (or not) with the bid. This is not a facial display modeling problem (since a player cannot see their own display), but a generation problem: the player must make facial displays. Thus, a P O M D P model of the displaying rounds would involve actions for each facial display necessary in the game, and these actions could be used to drive an animated character, for example. However, prior knowledge of the facial displays and a model of how to generate them would be necessary. Since this thesis is not concerned with the generation problem, we do not discuss this round of the game in detail, but only give a rough overview of what the associated decision problem looks like. We then show how the M D P for the displaying round can be integrated with the P O M D P for the bidding round to form a single P O M D P for each player for the game. In the displaying round, the variables are the same as in the bidding round, except that A and B are reversed, and there is no need to model the displays of the partner, so that the M D P is fully observable. The actions available to the decision maker are now performing facial displays and committing cards. As we have seen, the displays used by players of the card matching game are simple head nods and shakes, which we use as actions in the displaying round P O M D P . The model combines these displays with the actions to commit cards in the game (which 122 Bob's variables action affected description wait none do nothing shake bid indicate disagreement with current bid expect next bid to be different nod.cmfZ? all indicate agreement with current bid and commit ( v l nod.cmt() all indicate agreement with current bid and commit <C> nod.cmtft all indicate agreement with current bid and commit X Table 5.3: Actions during displaying round. are final in the round), to give the five actions shown in Table 5.3. 5 . 7 . 3 Combining M D P s Combining the partially observable M D P for the bidding round with the fully ob-servable one for the displaying round involves simply adding a round variable, which alternates between bid and display. When round is bid, the conditional probability tables are those for the bidding round, and when round is display, the conditional probability tables are those for the displaying round. The full action set is always available, but bidding round actions have no effect in the displaying round, and vice-versa. Figure 5.12 shows the combined P O M D P from Bob's perspective. We have la-beled the subset of state variables as Sa = {BVv, BQv, Bftv, Acv, Bcv, match, bid}, but have kept Ann's action, Aact, and Ann's display, Acom, factored. The policy generated by this combined P O M D P is factored on the round variable into the pol-icy for the bidding round, and that for the displaying round. Each of these smaller P O M D P s can be solved independently, and the resulting policy combined in this way. 123 Figure 5.12: Two time slices of combined P O M D P for the card matching game. 124 Chapter 6 Experiments This chapter presents experimental results from training some of the models pre-sented in Chapters 3 and 4. We present results from the three interactions described in Chapter 5. Section 6.1 describes the imitation game, in which a single subject imitates a cartoon display face (Section 5.5). Section 6.1.1 shows how we can find clusters of similar flow fields using the simple Gaussian mixture model from Sec-tion 3.2 (learning described in Section 4.2). This model considers all frames of the video to be independent, and looks only to find flow field models which efficiently span the set of flow fields in the data. The results are similar to those we have presented in [HL03]. Section 6.1.2 shows the same type of procedure applied to (temporally independent) images using the model in Section 3.3 (learning in Sec-tion 4.3). The results in these two sections are meant to demonstrate how the spatial abstraction of flow fields and images is performed. The analysis is qualitative, and is meant to get the reader acquainted with our descriptions of the low-level learned models. Section 6.1.3 then shows the results of training models with temporal de-pendence. In particular, we train the context dependent mixture of coupled hidden Markov models (C3MG), as described in Section 3.6.1. The analysis is this case does not include any utility measures, as described in Section 5.5. Instead, we pre-dict labels on the test sequences as an indication of the representative power of the model. The robot control experiments involve a single person gesturing to a robot, and are described in Section 6.2. The results are from cross-validation experiments in which the robot actions are chosen according to the learned model for the test data set, and the accumulated reward is used as the measure of success. Section 6.3 presents results on the data taken during the simple card match-ing game described in Section 5.7. In this case, we train the full model including actions, and attempt to find policies of action based upon the model. While the facial displays are simpler during this game, the experiments we present show how 125 actions can be incorporated and how policies can be discovered based upon the learned visual models. 6.1 Imitation Game Subjects were seated in front of a computer terminal on which an animated cartoon shows facial displays according to the game described in Section 5.5. Three subjects were told that their task is to imitate these displays, and were shown each of displays initially and told to practice imitating them. Once they were satisfied with their imitations, they pressed a key, and the system began recording a video sequence through a Sony EVI-D30 color camera mounted above the computer screen. While the subjects were being recorded, the cartoon face performed a series of 40 randomly selected facial displays over a period of 2 minutes. Frames were captured at 160 x 120 with a B T T V frame grabber card on a desktop Pentium III P C running the Linux operating system. The frame rates were almost always above 28fps. After the experiment, most subjects reported either that they did not notice a significant difference between cartoon displays a\ and a2, or that they could not find a way to imitate the second one, a2, due to the extremely down-turned mouth. 6.1.1 Clustering Flow Fields We first examine the results of learning the simple mixture of Gaussians model with feature weighting described in Section 4.2 on data taken during the imitation game. Given the video sequence, we want to discover the major categories of instantaneous motions (flow fields) present. This is an important preliminary step to learning the full temporal models, as it constitutes the lowest level and performs the spatial abstraction. We clustered a set of 904 frames from a 3600 frame sequence of a person performing 4 different facial expressions during the imitation game. As described in Section 5.5, the subject was imitating an on-screen cartoon face which was displaying 4 prototypical expressions: happy, sad, surprised and angry. The person's face was tracked using the flow-based tracker described in Section 3.7. The video frames in this data set are not labeled, and so the analysis is qualitative: our methods discover clusters of optical flow fields, and we interpret these clusters, which can be related to established high-level concepts. We automatically selected 904 frames which had significant motions in them by thresholding the mean magnitude of the optical flow. Applying our methods to all 3600 frames does not substantially change the result, since the flow fields from the other frames all fall close to the origin, and so are represented by one of the 126 0.04 "0 5 10 15 20 25 30 35 horizontal vertical feature k Figure 6.1: Feature weights learned for facial expression data. learned clusters. We used the first 16 Zernike coefficients for each horizontal and vertical flow, resulting in a 32-dimensional basis vector, z. The noise parameters were set to o\ = 0.08, ai = 1.0, ap = 10.0, OQ = 0.5 and aa — 0.1. The feature weighting parameters were set to a = 1, b — 0.01 and a = 34. The parameters fiZtX and AZJX were initialized by choosing K data points randomly as the initial seeds for K-means clustering, and Gaussians were fit to the resulting classes. Feature weights were initialized to 1. Results were relatively insensitive to the initialization. We trained a model with 8 classes. We choose 8 classes because it is large enough to see some structure in the learned classes, but small enough to allow for significant numbers of data points in each class. Figure 6.1 shows the final values of the feature weights, T%. The first 16 dimensions are the Zernike coefficients corresponding to horizontal flow (UA™,UB™ for n < 5), while the last 16 are those corresponding to vertical flow ("A™,VB™ for n < 5), ordered by increasing n and m values. The feature weights are clearly favoring the vertical flows, because a major component of the facial expressions are raising and lowering of eyebrows. The four most relevant features are {VBI,VAI,VAI,VAQ}. There are six other moderately relevant features. The remaining 22 features are irrelevant. Figure 6.2 shows the reconstructed Zernike vectors plotted along two of the relevant features (VBI,VAQ). The clusters are denoted by the shape and color of the data points. Reconstructed optical flow fields (using Equation 3.24) are shown 127 for representative frames within each cluster. The classes are roughly (1) little or no motion, (2) eyebrows raising slowly, (3) jaw expanding, (4) eyebrows raising quickly, (5) jaw relaxing and (6) eyebrows lowering. The remaining two classes corresponded to translational motions (7) up and (8) down. However, these clusters only accounted for a small fraction of the data, and are not considered further. Figure 6.3 shows an example of a raising eyebrows event. The original images are shown with the tracked regions superimposed. The temporal derivatives are shown below the images. We do not show the multi-scale temporal derivatives, only those at the highest resolution. They are meant to be demonstrative of the change in the image only. The flow fields computed using the Simoncelli method are shown below the derivatives. Finally, the the expected reconstructed flow fields from the model states are shown. The two central flow fields (105-106-107) are detected as state 4 (eyebrows raising rapidly), surrounded by more slowly raising eyebrow motions (state 2). Once the eyebrows reach their apex, the state returns to 1 (no motion) by frame 108. Figure 6.4 shows a smiling event detected as state 3 from frame 2174-2178, surrounded by state 1 events (no motion). Figures 6.5 and 6.6 show the sequel to Figure 6.3, in which the subject's face returns to neutral. He begins by lowering his eyebrows (Figure 6.5, frames 115-117), which is classified as state 6, followed by a relaxation of his smile (Figure 6.6, frames 162-164), which is classified as state 5. 128 frame 2714 state 4 frame 238C state 1 frame 3345 A state 2 2^(17) frame 3222 state 6 '•////> if*"-frame 2174 state 3 Figure 6.2: Clustering result for facial expressions along two most relevant features. 129 state 2 1 Figure 6.3: Eyebrow raising classified as states 2 and 4. The corresponding eyebrow lowering is shown in Figures 6.5 and 6.6. 130 131 frame 115 116 117 state 6 6 Figure 6.5: Eyebrow lowering event classified as state 6. frame 162 163 164 state 5 5 Figure 6.6: Smile returning to neutral classified as state 5. 132 6.1.2 Clustering Images In this section, we show results from training the simple mixture of feature-weighted Gaussians on the pose information (projections of the images to the Zernike basis). Again we clustered the data from the imitation game. In this experiment, however, we use all 3600 frames for the training data. We used the first 64 Zernike coefficients. The feature weighting parameters were set to a = 1, b = 0.01 and a = 34. The parameters fj,z>x and Az>x were initialized by choosing K data points randomly as the initial seeds for K-means clustering, and Gaussian distributions were fit to the resulting classes. The feature weights were all initialized to 1. The results were relatively insensitive to the initialization. We trained a model with 8 classes. Figure 6.7 shows the final values of the feature weights, r | . The two most relevant features are A\ and B?, while A\,B\,B\,A\ and B\ are also significant. J J l u iiilliiL 20 30 40 50 GO 70 feature Figure 6.7: Feature weights learned for facial expression data. Figure 6.8 shows the reconstructed Zernike vectors plotted along two of the relevant features (A\,B^). The clusters are denoted by the shape and color of the data points. Also shown are the means and level curves of the covarainces of the Gaussian output distributions. Reconstructed configurations from the mean Zernike feature vectors are also shown as grayscale images. The classes are roughly: (1) (3) and (8) surprised, (2) frown, (4) and (5) neutral, (6) smile eyebrow raised and (7) smile. Figure 6.9 show the classification of the same frames shown in Figure 6.3, taken during a smiling event. The frames with the subject in a smiling pose are classified as state 6, which we have previously recognized as a smiling pose. 133 state 5 state 8 Figure 6.8: Clustering result for facial expressions along two most relevant features. The 8 means are numbered, and their associated covariances are shown as level curves. Reconstructed images for the mean Zernike feature vectors are shown. 134 frame 2173 2174 2175 state 4 2178 2176 2180 frame 2177 state 6 Figure 6.9: Smiling pose classified as state 6. The original frames with tracked facial region are shown, along with the expected reconstructions given the simple mixture model. The most likely states of the mixture are given below the reconstructions. frame 2230 state 6 2231 2232 2233 •3 mm 2234 Figure 6.10: Return from smiling pose classified as state 4. 135 cluster D i D2 D3 D4 Ci C2 C3 CA 0.01 0.01 0.61 .37 0.02 0.48 0.17 0.33 0.01 0.97 0.01 0.01 0.91 0.07 0.01 0.01 Table 6.1: Probability distribution P(Ab:a\Aa) learned for subject A 6.1.3 Clustering Display Sequences The videos from the imitation game described in the last section were temporally segmented using the onset times of the cartoon displays and the resulting sequences were input to the mixture of coupled H M M clustering and training algorithm de-scribed in Section 4.6 using 4 clusters (the number of displays the subjects were trying to imitate). The Viterbi algorithm was used to assign cluster membership, D, to each sequence, which were then compared to the known classes of displays the subjects were trying to imitate. The exact model recovered is partially dependent on the randomness in the initialization procedure. However, we found the recov-ered cluster membership to be fairly consistent, an indication that our initialization procedure is robust to the random starting points. The results we present in the fol-lowing section are usually the results we obtained on the first trial. Some, however, were given multiple trials and the most often re-occurring results are presented. The remainder of this section evaluates the results from one of the subjects who performed the experiment. The results from this subject are demonstrative of the results from the other subjects. We first show the learned high-level probability distribution, P(Ab:a\Aa), which describes the likelihood of observing each high-level motion state given each cartoon display. We then show two of the models learned for this subject: one for "smiling" imitations, and one for "surprised" imitations. For each model, we show the learned feature weights and output distributions for both dynamics and configuration chains. We also show how it analyses two sequences. Table 6.1 shows the learned model parameter, P(Ab:a\Aa), for subject A, in which each row is one Aa state (cartoon display on screen) and each column is one recovered cluster, Ab{a...A^a. To simplify notation, we use C = Aa and D = Ab:a. We see that most of the responses to the cartoon display CA were classified as Di, and most of the responses to C 3 were classified as D2. Responses to C2 were split between those that looked the same as responses to C3 (and so were classified as D2, and those that looked similar to some of the responses to C\ (classified together in state DA- The D3 model classified the majority of the responses to C\. In the 136 following sections, we describe each model, and show some sequences which were classified as belonging to that model. M o d e l Di Feature weights for the model 1 dynamics and configuration chains are shown in Figure 6.11. The dynamics chain has four significant features: two in the horizontal flow components: UA\,VB-2, and three in the vertical flow components, VAQ,VBI, and V A \ . The three most significant features in the configuration chain are B\,B%, and A\. 1 3 I >3 / 1 . 1 • ll 111 •L i 15 20 25 feature dynamics configuration Figure 6.11: Feature weights for model 1 dynamics and configurations chains. The output distributions of the four states (X) in the dynamics chain are shown in Figure 6.12, plotted along two most significant feature dimensions, UA\ and VB\. TWO states (X = 2,4) correspond to no motion (the face is stationary), while the other two correspond to expansion upwards and outwards in the bottom of the face region (X = 1), and contraction downwards and inwards in the bottom of the face region (X = 3). We will see that these states correspond to the expansion and relaxation phase of smiling. The output distributions of the five configuration states are shown in Fig-ure 6.13. There are two states (C = 2 and C = 4) which describe the face in a fairly relaxed pose, while C = 1 and C = 3 describe "smiling" configurations. We now show how the C 3 M G model analyses a particular sequence from the imitation game for which the most likely high level state is D — 1. We show the expected feature vector for each frame in the dynamics and configuration chains, Zx and Zw, respectively, and-the expected output flow fields and poses. Figure 6.14 shows the trajectory of the reconstructed dynamics Zernike vector (Zx) along the 137 Figure 6.12: Dynamics chain model 1 output states plotted along two most signif-icant dimensions according to feature weights, UA\,VB\. Reconstructed flow fields for X state means are also shown. Figure 6.13: Configuration chain model 1 output distributions. Reconstructed grayscale images are shown for C state means. 138 two most important feature vector dimensions for a sequence classified as model D = 1. We see the sequence starts in state 2 (no motion), enters state 1 (smile expansion), then back to state 2 (holding the smile), finally contracting (state 3) back to a relaxed pose (state 1 again). 125; 1 2 5 2 1 124 j \ 24£ 9 91 11 90 \ 1 1 8 11 1188 1 1 8 7 I I I l I I I 1 I I I -0.5 -0.4 -0.3 -0.2 -0.1 .. 0 , 0.1 0.2 0.3 0.4 0.5 Figure 6.14: Trajectory of dynamics feature vector (Zx) for sequence 13 (for which model 1 has highest posterior) is shown given model 1. The level curves of the Gaussian output covariances for model 1 are also shown. Figure 6.15 shows the trajectory of the reconstructed configuration Zernike vector (Zc) along the two most important feature vector dimensions for a sequence classified as model D = 1. We see that the face is in the relaxed pose, W = 2, at the beginning and the end of the sequence, and in the "smiling" pose, W = 1 and W = 3, in the middle from frames 1189 to 1250. Figures 6.16 and 6.17 show model's explanation of the same sequence, for the frames indicated in Figures 6.14 and 6.15. We see the high level distribution over D is peaked at D = 1. Distributions over dynamics and configuration chains show which state is most likely at each frame. The expected pose, H, and flow field, V, are shown conditioning the image, I, and the temporal derivative, ft, respectively. Model D2 Feature weights for the model 2 dynamics and configuration chains are shown in Figure 6.18. The dynamics chain has three significant features, all in the vertical 139 j ! ! / / / ! ! / : : / : :/ : / / / k / -4J91 I % I 1 1 8 8 ^ £ ® . •Man j/T7. gPl2-—e<-m *8 / " ^1189 / I . 4250 Vv • ^ » ~=%L25'< > -•• 1251 11 1 1 1 1 1 1 1 1 1 17 18 19 20 21 o 22 23 24 25 26 B 3 Figure 6.15: Trajectory of configuration feature vector (Zc) for sequence 13 (for which model 1 has highest posterior) is shown given model 1. The level curves of the Gaussian output covariances for model 1 are also shown. flow components: VAQ,VB\, and ^ ^ 3 . The three most significant features in the configuration chain are B\, A°, and A \ . The output distributions of the seven states (X) in the dynamics chain are shown in Figure 6.19, plotted along two most significant feature dimensions, VAQ and VB\. States X = 2, X = 3 and X = 6 all describe little or no motion flows. State X = 1 describes flows upwards in the upper part of the image, roughly corresponding to eyebrows raising. State X = 5 also includes upwards motion in the upper part of the image, but also includes motion downwards and inwards in the lower part of the image, corresponding to eyebrows raising and jaw dropping at the same time. State X = 4 describes motion downwards in the upper part of the image, with slight upwards motion in the lower part of the image, corresponding to the face relaxing from an expanded state. State X = 7 describes small motions downwards in the upper part of the image. The output distributions of the five configuration states are shown in Fig-ure 6.20. There are two states (C = 1 and C = 5) which describe the face in a fairly relaxed pose, while C = 2 and C = 3 describe configurations in which the eyebrows are raised and the jaw is dropped. State C = 4 appears to be a "smiling" pose. We now show how the C 3 M G model analyses a particular sequence from the 140 1187 1188 1189 1190 1191 Figure 6.16: The face starts in its rest configuration (W = 2), and expands into a smiling pose (W = 1) with the X = 1 flow field. 141 1248 1249 1250 1251 1252 Figure 6.17: The face starts in a smiling (W = 1) pose, and contracts with the X = 2 flow field to a rest configuration (W = 2). 142 t t f l l t t l t l l t t I I I I I I t f I I t f Figure 6.19: Dynamics chain model 2 output states plotted along two most signif-icant dimensions according to feature weights, VA$,VB{. Reconstructed flow fields for X state means are also shown. 143 Figure 6.20: Configuration chain model 2 output distributions. Reconstructed grayscale images are shown for C state means. imitation game for which the most likely high level state is D = 2. We show the expected feature vector for each frame in the dynamics and configuration chains, Zx and Zw, respectively, and the expected output flow fields and poses. Figure 6.21 shows the trajectory of the reconstructed dynamics Zernike vector (Zx) along the two most important feature vector dimensions for a sequence classified as model D — 2. We see the sequence starts with no motion, enters state 1 (eyebrow raise), then back to no motion (holding the eyebrows raised), finally contracting (state 4) back to a relaxed pose. Figure 6.22 shows the trajectory of the reconstructed configuration Zernike vector (Zc) along the two most important feature vector dimensions for a sequence classified as model D — 2. We see that the face is in the relaxed pose, W = 5, at the beginning and the end of the sequence, and in the "eyebrow raised" pose, W = 2 and W = 3, in the middle from frames 1002 to 1059. Figures 6.23 through 6.24 show the model's explanation of the same sequence, for the frames indicated in Figures 6.21 and 6.22. We see the high level distribution over D is peaked at D = 2. Distributions over dynamics and configuration chains show which state is most likely at each frame. The expected pose, H, and flow field, V, are shown conditioning the image, I, and the temporal derivative, ft, respectively. 144 C4Q01 ^ 0 0 0 : - - . \ ...1002 \ \iobfi^io< 999 V 0 0 ' " r . " j-g 35 • ; I-)3 07 08 X, -^ 062 ? \ \ \1061 VfJ60 1058 9 -1.5' 1 ' 1 1 ' ' 1 1 -0.6 -0.4 -0.2 0 „ 0 . 2 n 0.4 0.6 0.8 1 Ao Figure 6.21: Trajectory of dynamics feature vector (Zx) for sequence 11 (for which model 2 has highest posterior) is shown given model 2. The level curves of the Gaussian output covariances for model 2 are also shown. 3 0 r ; .006 / 1 0 0 3 4005 1004 Figure 6.22: Trajectory of configuration feature vector (Zc) for sequence 11 (for which model 2 has highest posterior) is shown given model 2. The level curves of the Gaussian output covariances for model 2 are also shown. 145 Subject A model inference Subject B model inference Subject C model inference 1 2 3 4 1 2 3 4 1 2 3 4 1 23 0 0 1 20 8 3 2 0 6 6 6 ^ 2 5 2 9 2 0 12 0 6 9 31 8 0 1 3 0 4 32 0 0 0 40 5 1 1 16 6 1 0 4 37 0 4 3 17 0 0 3 27 success: 78% 74% 62% Table 6.2: Confusion matrices and success rates from cross-validation experiments inferring cartoon displays from sequences of three subjects. 6.1.4 Inferring Agent Actions The preceding section gave a qualitative analysis of one subject's displays in the imitation game, and the models we could learn from a set of training data. We can obtain quantitative results by attempting to infer the cartoon display given the human imitation based on the learned model, as described in Section 5.5. We did this analysis using a leave-one-out cross validation experiment for each of three subjects who participated in the imitation game. There were 40 sequences (imitations) for each subject, one of which was removed. The remaining 39 sequences were used to train the C 3 M G as in the last section. The learned model was used to infer the cartoon display, Aa, from the remaining (left-out) sequence. The most likely value was chosen and compared to the actual display. This process was repeated three times for each subject with different random initializations. The cross-validation is only performed within each subject's data pool, since the models are designed to be person dependent. Table 6.2 shows the confusion matrices for the three subjects (summed over the three experiments), and the total success rate. The majority of the mis-classifications are from display a2, which subjects reported as being difficult to distinguish from a\. The models make quite accurate predictions of 03 (83%) and a 4 (84%). However, it is important to note that these results are obtained by always looking at the most likely cartoon display, which ignores the model's explicit rep-resentation of uncertainty. That is, we estimate P ( A A | 0 ) , which is a distribution over Aa, and the peak of the distribution is used in Table 6.2. However, there are some cases where there is a second display which is nearly as likely as the best one. To demonstrate this, Table 6.3 shows the confusion matrices obtained if we classify 146 999 1000 1001 1002 1003 Figure 6.23: The face starts in its rest configuration (W = 4), expands with the X 1 flow field towards a "eyebrows raised" (W = 2) configuration (see Figure 6.24). 147 1004 1005 1006 1007 1008 Figure 6.24: The face starts in an "eyebrows raised" pose (W = 2), and the mouth opens with the X = 5 flow field, resulting in a "surprised" configuration (W = 3). 148 149 Subject A model inference Subject B model inference Subject C model inference 1 2 3 4 1 2 3 4 1 24 0 0 0 26 2 3 2 13 2 1 14 1 2 0 18 0 0 5 3 w a ^ 0 0 36 0 0 0 45 0 1 0 1 40 0 1 0 23 success: 95% 93% 1 2 3 4 9 0 3 6 0 41 7 0 1 1 21 1 0 0 0 30 84% Table 6.3: Confusion matrices and success rates from cross-validation experiments inferring cartoon displays from sequences of three subjects, using the top two most likely cartoon displays if the most likely is uncertain. 03 o success: Subject A model inference 1 2 3 4 1—» 6 2 0 0 2 2 .1 3 0 3 0 0 12 0 4 0 0 0 14 82% Subject B model inference 1 2 3 4 10 0 1 0 0 6 0 0 0 0 15 0 0 0 0 8 97% Subject C model inference 1 2 3 4 0 4 1 1 0 15 1 0 0 0 8 0 0 0 0 10 82% Table 6.4: Confusion matrices and success rates from supervised experiments infer-ring cartoon displays from sequences of three subjects. the sequence correctly if it falls in the top two most likely displays, but only if the probability of the most likely display is less than 0.5. We see that many of the mis-classified sequences were assigned maximum likelihood with much uncertainty. These results can be compared to the results obtained in a supervised experiment, where each sequence is explicitly labeled, so D is observed. These results are shown in Table 6.4. When compared to the unsupervised maximimum likelihood experi-ments (Table 6.2), we see that the supervised models perform substantially better for two subjects (B and C), but only slightly better for subject A . This is as expected (better performance in the supervised experiments), but the unsupervised models perform well, in particular when the probability distribution is taken into account (Table 6.3), in which case the unsupervised models outperform the supervised ones. 150 6.2 Robot Control Gestures We recorded a set of examples of four hand gestures, designed for simple robotic direction control: forwards, stop, go left and go right. A dozen examples of each gesture were performed by a single subject in front of a stationary camera during a training session. Video was grabbed from a I E E E 1394 (Firewire) camera at 150 x 150 with a narrow field of view. The region of interest was taken to be the entire image, and so no tracking was required. Clearly, this would only be possible with a static camera. Sequences were taken of a fixed length of 90 frames. A robotic agent (not embodied at this stage) chose actions in response to each gesture according to a random policy, and was rewarded by the operators good or bad action, Aact, for choosing the correct action. The gestures were performed in sequence, and the subject practiced perform-ing the gestures, so that they would be as similar as possible. It is important to re-state that these experiments are not meant to demonstrate gesture recognition. It is clear that, with this simple tracking and registration method (taking the whole image), this system would not deal with the high variability in gesture orientation or speed. These experiments are meant as a simple demonstrate the value-directed structure learning techniques: they show how our system can correctly discover the number of meaningful gestures in a simple interaction. They are also meant to demonstrate that our system can be easily applied to more than just facial displays: our representation can deal with the complexity of a gestural motion. We trained the P O M D P with Na = 6 states. The value function and policy are shown in Figure 6.26 as decision diagrams. The policies for states d<i and efe are equivalent and their values are identical, and so the value-directed structure learning algorithm merges them first by simply deleting state d$. The P O M D P is re-trained, resulting in a five-state value function (not shown), in which two more states are found to agree and are merged. Again the P O M D P is re-trained, this time giving a value function and policy in which no displays are found to be redundant, shown in Figure 6.27. Figures 6.28 and 6.29 show the final 4-state model's explanation of two parts of a stop gesture sequence. We see the high level distribution over D is peaked at D = 3. Distributions over dynamics and configuration chains show which state is most likely at each frame. The expected pose, H, and flow field, V, are shown conditioning the image, I, and the temporal derivative, ft, respectively. Stop gestures consist of an expansion phase (Figure 6.28) followed by a retraction phase (Figure 6.29). Figures 6.30 and 6.31 show the model's interpretation of a part of & forwards sequence, classified as model d$, in the new 4-state P O M D P . Forwards gestures consist of one or more iterations of the motions shown: the hand moves forwards 151 Figure 6.26: Original six-state value function (top) and policy (bottom), shown as decision diagrams. States are the labels on each path from the root to a leaf, which contains the value or optimal action for that state. Figure 6.27: Final four-state merged value function (top) and policy (bottom). The value function 152 153 1219 1220 1221 1222 1223 Figure 6.29: Final 4-state gesture model's explanation of a stop gesture as d 3 . Con-tinuation of Figure 6.28. 154 and to the right as it opens (Figure 6.30) and then backwards and to the left as it closes (Figure 6.31). To evaluate how well the model chooses actions, we performed a cross-validation experiment in which the P O M D P was trained on all but one sequence of each gesture. The model was then used to choose actions based upon the four sequences left out. If the action is correct, one reward is given. This process is repeated for 12 different sets of four test sequences, and the total rewards gathered give an indication of how well the model performs on unseen data. The model chose the correct action 47 out of a total of 12 x 4 = 48 times, for a total success rate of 47/48 or 98%. The one failure was due to a mis-classification of a "left" gesture as a "right" gesture due to a large rightwards motion of the hand at the beginning of the stroke. The final P O M D P models learned that there were Na = 4 states in all 12 cases. 6.3 Card Matching Game The card matching game, as described in Section 5.7, was played by two players, "Bob" and "Ann", separated by a partition. The players were both students in our laboratory. The players could not make visual or audio contact with each other, except the visual contact available through the game interface. Each player viewed their partner through a direct s-video link from their workstation to a Sony E V I s-video camera mounted about their partner's screen. The average frame rate at 320 x 240 resolution was over 28fps. This data was also recorded along with a game log describing the actions of each player in the game (card values and suits, bids and committed cards). There were no restrictions placed on the type of communication allowed between the players through the video link: the players were not given any instructions about the types of displays they could or could not use. The rules of the game were explained to the subjects, and they played four games of five rounds each. In the following, we will focus on the model learned from Bob's perspective during his bidding rounds (so the displays will be those of Ann when Bob was bidding). Bob bid first in each game, and so there were three bidding turns for him per game. We will use data from the first three games to train the P O M D P model described in Section 5.7.1. The learned model's performance will then be tested on the data from the last game. We do not analyse Bob's displays here because he was, in fact, the author of the thesis, and his results may be biased. In general, this experiment should be performed with arbitrary subjects. However, we do not intend this experiment as a general human-computer interaction study, but rather as a proof of concept: our 155 39 40 41 42 43 Figure 6.30: Final 4-state gesture model's explanation of a forwards gesture as Continued in Figure 6.31. 156 51 52 53 54 55 Figure 6.31: Final 4-state gesture model's explanation of a forwards gesture as aY Continuation of Figure 6.30. 157 P O M D P model can be applied to this setting, and can learn a valuable policy of action. The generalisability of the system to arbitrary subjects is a subject of future work. The training data set is certainly large enough to learn models of facial displays, but is quite small when it comes to learning an optimal policy. To see why, notice that, to learn an accurate P O M D P , we need examples of every possible valuable state-action pair. Even if many of the 10,000 states and 6 actions in the card matching game will never be visited, the amount of training data required is still quite large. Nevertheless, under certain symmetry assumptions, it may be possible to make near-optimal decisions based only on a small training set. Equivalently, in an on-line reinforcement learning method, it may be possible to start exploiting the learned model with little exploration. For example, if we assume that the players behaviour is symmetrical under permutation of the card suits, then we can average the conditional probability distributions over all such permutations. This allows us to fill in more of the model without having to explicitly explore those situations. For example, suppose that a player notices that a certain facial display in response to a bid of hearts is usually followed by her partner committing hearts. If she assumes that her partner has no special preference between the heart and the diamond suits, then she may extrapolate her experiences with hearts to diamonds, and assume that the same facial display in response to a bid of diamonds will be followed by her partner committing diamonds. However, these types of arguments are dependent on the particular game being played, and generalisations require more complex models than the P O M D P s considered here. We will, however, show in Section 6.3.5 how they can be used for the card game, and how it makes for much better policies when training with little data. As discussed in Section 5.7, we can construct M D P s for the two turns in-dependently, and here we focus only on the Bob's bid turns. Table 6.5 shows the game log from Bob's bidding rounds in the three games in the training set, as well as the game used for testing (see Section 6.3.4). The first and third training games involved an initial bid that was refused, followed by a second bid that was accepted. A l l other bids were accepted on the first bid. Notice that, in some cases (the first, third and last rounds in Table 6.5), Bob's bid pointed to Ann's second highest card, but was nevertheless immediately accepted by Ann. In all three cases, however, this strategy paid off because they got their reward faster, and it was equal or better to the optimal reward. For example, in the first round, Bob bids his high club (7), and Ann accepted and commited her second highest card (a 7 as well), giving a total payoff of 12. Why did Ann not try to get Bob to agree to a play of diamonds? If she had disagreed with his bid of clubs, Bob would surely have bid his second highest 158 card, which may have been a diamond or a heart with equal probability. The addi-tional payoff to the team is (3 — e) if Bob's second bid was diamonds, and (—5 — e) if Bob's second bid was hearts, where e is the expected difference between Bob's highest and second highest card. If Ann and Bob were not communicating over a limited channel, then they could surely come to agreement (after some deliberation) about the most optimal solution. However, the communication difficulties imposed made the risk of failure high enough so that Ann simply accepted the first bid. This approximate solution works well in many cases. 6.3.1 Learning a POMDP We learned the parameters of the POMDP for the bidding rounds from Bob's per-spective, using the methods described in Section 4.7. The model was trained with four display states, which is as large as we think it possible to learn reliable models given the training set size. Two of the learned display states described sequences with little motion ("null states). The other two corresponded roughly to "nodding" and "shaking" of the head. A slightly modified version of the structure learning algorithm described in Section 5.3 was applied. In this case, we do not compare value functions and policies for every state of the game, as they will often not agree in many places. Instead, we look at where the majority of states agree on a merge. Two states were merged, resulting in a three-state model. The two merged models, d\ and d2, both described "null" sequences, with little facial motion. This redundancy shows up in the value function, V(s), a small portion of which is shown in Figure 6.32. The highest value Figure 6.32: Small portion of the two stage-to-go value function of the card matching game with four display states. is when the partner nods ( 0 3 ) , because we expect to get a reward of 3 within one time step. The expected values for the null displays are almost equal. This indicates that making the distinction between d\ and d2 is not useful for determining value. This same redundancy shows up in the policy for the four state model, as shown in 159 Bob's Ann's Ann's bid Bob's Ann's rt rt o frames cards cards display action action <? 0 * Acorn Bad Aact 1—t 1 40-150 3i 4i 73 2i 103 72 d2 - bid* -rt 1 151-295 d2 * cmt* cmt* OS bfl 2 725-827 2i 5 2 2i 7 2 3i 83 di - bidO -2 828-976 d3 0 bid* -rainir 2 977-1048 d2 * cmt* cmt* rainir 3 1544-1626 62 9 3 4i l i 4i 83 di - bidO -3 1627-1709 d2 0 cmtO cmtO <V 1 11-195 l i 4i 4i 2i 72 72 d3 - bid<> -a o3 1 196-351 d2 0 cmtC" cmt<0> bC 2 762-830 72 3i 2i 93 3i 72 di - bid<? -bC a 2 831-942 d2 cmfs? cmt's? '2 '«3 3 1407-1555 72 l i 4i 72 72 l i di - bid<? -f—< +J 3 1556-1651 d2 cmfs? cmt9 CO 1 11-129 l i 3i 9 3 72 93 103 di - bid* -g 1 130-188 d2 * cmt* cmt* 03 too 2 736-834 73 62 4i l i 72 52 d3 - bid<? -bC 2 835-957 d3 bid<> -'2 2 958-1052 d2 cmtC" cmt<0> 'S3 3 1502-1576 103 4i 83 3i 9 3 di - bid<? -3 1577-1653 d2 cmt'? cmt<? 1 60-163 3i 103 83 13 62 83 d2 - bidO -cu 1 164-339 d2 cmtO cmt<0 s o3 2 950-1047 72 4i 83 2i l i 52 d2 - bid* -bO 2 1048-1129 d2 * cmt* cmt* «3 3 1561-1633 93 103 3i l i 2i 62 d2 - bid<? -3 3 1634-1800 1801-2136 d3 d3 0 bidO cmtO cmt<C> Table 6.5: Log for the card matching games. The card values are shown for each suit and each player as the actual value on the card subscripted with the value in the model. Thus, ij means the players saw a card with value i, but it is characterized as v(j) in the P O M D P . A blank means the card values were the same as the previous sequence. Ann's display, ylcom, is the most likely as classified by the final model. The first three games are used for training the model, the last for testing (see Section 6.3.4) 160 h c a i l d i a m o n d Figure 6.33: Small portion of the two stage-to-go policy for the card matching game with four display states. Figure 6.33. The model was re-trained with only three states after merging states di and d,2- The result was, as expected, one "null" state (di), one "nodding" state (^2) and one "shaking" state (03). The remainder of this section discusses this reduced three-state model. 6.3.2 Structure of Learned Model Recall from Section 5.2 that there are two conditional probability distributions to be learned from the training data, both of which are model parameters in the C4MG model we described in Section 3.6.3. Due to the lack of some instances of game states in the training data, the learned conditional probabilities contained some distributions which were only based on the Dirichlet smoothing prior. For example, there is no instance in the data when the player bid hearts after bidding clubs (and receiving a head shake of disagreement from her partner), so the action bidSO was never taken in a state where bid was for any value of Acom. Figure 6.34 shows the learned conditional probability distribution Ann's ac-tion, Aact, given the current bid and Ann's display, Acom, as a decision tree. The leaves show the distribution over the possible values of Aact, from top to bottom: null, cmtty, cmti}, cmtft. We see that, if the bid is null, we expect Ann to do nothing in response. If the bid is some suit, s, and Ann's display (Acom) is the "nodding" display efo, then there is a good chance that Ann will commit her card of suit s. On the other hand, if Ann's display is the "shaking" display, 03, or the "null" display, di, then we expect her to do nothing (and wait for another bid from Bob). The conditional probability distribution of Ann's display, Acom, at time t, given the previous and current bids, bidt-i, and bidt, respectively, are different for each of Bob's actions. This is because Ann observes Bob's bid the moment he makes it. One example, for Bobs action bid(}, is shown in Figure 6.35. These distributions 161 0.79 0 00 0 07 0 30 0.07 0.05 0.07 i | 0.05 0 82 0.44 0 7'» 0.42 (1 Of. (1(14 0 0" 0.04 0 06 0.48 0 fT 0 04 0 06 (KM 0 0 0.50 Figure 6.34: Learned conditional probability distribution over Ann's action, Aact, given the current bid and Ann's display, Acom. The leaves of the decision tree show the distribution over the possible values of Aact from top to bottom: null, cmtV, cmt<y, cmtJft. Figure 6.35: Learned conditional probability distribution over Ann's display, Acom, at time t, given the previous and current bids, bidt-i, and bidt, respectively, for Bob's action bid<$. The leaves of the decision tree show the distribution over the possible values of Acom from top to bottom: dl, d2, d3. 162 carry two important pieces of information for Bob: 1. At the beginning of a round, any bid is likely to elicit a non-null display d2 or 03. As shown in Figure 6.35, the expected distribution over Acom = d i , d2, d3 after action Bad = bid<y if the current bid is null (at the beginning of a round) and the previous display was d\ (a null display) is 0.01,0.49,0.49. Thus, d\ (null) display is not very likely, while d2 and d3 (nod and shake) are equally likely. 2. A "nodding" display is more likely after a "shaking" display if the bid is changed. As shown in Figure 6.35, the expected distribution over Acom = di,d2, d3 after action bid<y if the current bid is (v> and the previous Acom was d3 is 0.004,0.993,0.004: if Bob bids diamonds and sees a d3 display (a shake), then a bid of clubs will most likely elicit a d2 display (a nod). 6.3.3 Policy of Action dl,d3 bid commit bid commit ! commit bid dub diamond i l i . i i i K i i i d hcarl club ih.imoiii] Figure 6.36: Policy of action in the card matching game for the situation in which BVv = v3, B<}v = v3 and B*v = vl. The full policy for the game is quite large, and we instead show the sub-policy for one particular set of values for the cards. In particular, Figure 6.36 shows the policy of action if the player's cards have values BVv = t>3, Bt)v = v3 and BJI»v = vl. To consult the policy, simply follow the path from the root to a leaf corresponding to the state and read the recommended action at the leaf. The lightly colored leaves are actions which are "correct" in that they make sense given our interpretation of the display states (d\ is null, d2 is a nod, and 03 is a shake). 163 Figure 6.37: Learned conditional probability distribution over Ann's display, Acorn, at time t, given the previous and current bids, bidt~i, and bidt, respectively, for Bob's action bidJft. The leaves of the decision tree show the distribution over the possible values of Acorn from top to bottom: dl ,d2, d3. The darkly shaded nodes are actions that are "incorrect", usually due to a lack of information about these states in the training data. The policy recommends the following actions • If there is no bid on the table, then bid the club. This is a sub-optimal decision due to lack of training data, and resulting asymmetries between suits in the conditional probability distributions. Figure 6.37 shows the probability distribution over Ann's display, Acom, for Bob's action, bid^t. As the figure shows, when the bid is null, the expected Acom is d2 with high probability, and so the optimal bid is always clubs regardless of the values of the cards. This probability distribution arises because in the training data, every bid of X is followed by a display d2 (is accepted). • If the bid is diamonds, and the partner nodded (d2), then commit the diamond. Otherwise, if the partner did nothing (di), then bid the diamond (again). Finally, if the partner shook their head [ds), then bid the club. Again, this last action is sub-optimal. • If the bid is hearts and the partner shook nodded their head (d2), then commit the heart, otherwise, bid the diamond (the other high card). • If the bid is clubs and the partner shook nodded their head (d2), then commit 164 the club, otherwise, bid the diamond. 6.3.4 Using the Policy Bob's Ann's Bob's policy policy cards display bid action normal permute • X Acom Bad 7r'( S) 1 1 1 3 d2 - bidX bidX bidO/X 1 d2 X cmtX cmtX cmtX 2 1 2 1 di - bidO bidO 2 d3 0 bid* bidO b id9 2 d2 X cmtX cmtX cmt9 3 2 3 1 di - bid^ • B bidO 3 d2 0 cmt<0> cmt<0> cmt<0> 4 1 1 1 d3 - bid^ bidX cmt9/X 4 d2 0 cmt<0> cmt<0> cmt<0> 5 2 1 1 di - bid<? m bid<? 5 d2 0 cmt's? cmt9 cmfs? 6 2 1 1 di - bid<? M i b id9 6 d2 cmt's? cmt's? cmt^ 7 1 1 3 di - bidX bidX bid* 7 d2 X cmtX cmtX cmtX 8 3 2 1 d3 - bid9 — bid<? 8 d3 bid^ bidO bidO 8 d2 0 cmtO cmtO cmtO 9 3 1 1 di - bid<? bid^? 9 d2 0 cmt^ ? cmt^ ? cmt9 Table 6.6: Log for the training games showing the predicted actions from the policy, 7r(s), and from the symmetrized policy, n'(s). Shaded entries are incorrect policy predictions. Tables 6.6 and 6.7 show Ann's displays, the bids, Bob's cards and Bob's actions during the training and test game, respectively (see Table 6.5 for other variables). The second-to-last column shows the predictions of Bob's actions of the policy, 7r(s), discussed in the last section. This policy correctly predicted 14 out of 20 actions in the training games, and 5 out of 7 actions in the test game The incorrect actions are shaded in Tables 6.6 and 6.7. However, as we have pointed 165 0) Bob's Ann's Bob's policy policy S bO cards display bid action normal permute <? 0 X Acom Bad 7T(s) 7 r » 1 1 3 3 d2 - bidO bid* bidO/* 1 d2 0 cmt<> cmt<0> cmt-0 2 2 1 3 d2 - bid* b id* b id* 2 d2 * cmt* cmt* cmt* 3 3 3 1 d2 - bid<? < § § i i bid<? 3 dz bidO bidO bid^ 3 dz 0 cmt<> Table 6.7: Log for the testing game showing the predicted actions from the policy, 7r(s), and from the symmetrized policy, 7r'(s). Shaded entries are incorrect policy predictions. out, many of the problems with this policy can be attenuated by applying symmetry arguments to construct a symmetrised policy, n'(s). This is discussed in the next section. 6.3.5 Symmetry Considerations As we have pointed out, symmetry arguments can be applied to the learned P O M D P in order to attenuate the effects of the lack of training data. In particular, we may assume that player's do not have any particular preference over card suits, such that the conditional probability tables should be symmetric under permutation of suits. Therefore, we can "symmetrise" the probability distributions by simply averaging over the six card suit permutations. Figures 6.38 and 6.39 show the symmterised conditional probability distributions over Aact and Acom. These can be compared to the unsymmetrised versions in Figures 6.34 and 6.35, respectively. Figure 6.40 shows a portion of the symmetrised policy computed for the symmetrised P O M D P . The predictions of this policy are shown in the last columns of Tables 6.6 and 6.7 for training and test data, respectively. The symmetrised policy correctly predicts all but one of Bob's actions in the training game, for an error rate of only 5%. The mis-classification was due to the subject looking at something to one side of the screen yielding significant horizontal head motion, leading to a classificaiton as d$. The symmetrised policy correctly predicts all but one of Bob's actions in the test game. It does correctly predict the re-bid in the third round, where the first 166 h i d null clb Acom d3 d2 0.82 0.49 0.06 0.04 0.06 o-n 0 06 004 dl Figure 6.38: Learned conditional probability distribution over Ann's action, Aact, given the current bid and Ann's display, Acom. The leaves of the decision tree show the distribution over the possible values of Aact from top to bottom: null, cmtty, cmtt), cmtJIt. Figure 6.39: Learned conditional probability distribution over Ann's display, Acom, at time t, given the previous and current bids, bidt-i, and bidt, respectively, for Bob's action bid(). The leaves of the decision tree show the distribution over the possible values of Acom from top to bottom: dl, d2, d3. 167 Figure 6.40: Policy of action in the card matching game for the situation in which BVv = v3, B<yv = v3 and BJf»v = vl. bid (of hearts) was refused by the partner's display. Figure 6.40 shows the policy of action in this situation. At the beginning, when there is no bid on the table (so bid is null), the policy recommends that Bob bid his heart (as he did). The C 4 M G then recognized Ann's subsequent display as state d3 (shake), and the policy rec-ommends changing the bid to diamonds. The C 4 M G classifies the following display as d3, which is incorrect (Ann actually nodded her head), so the policy recommends bidding the heart again. The last sequence is longer than usual (over 300 frames), and includes some horizontal head motion in the beginning which appears as shak-ing in the model. This mis-classification may expose a weakness of the temporal segmentation method we use, which is based entirely on the observable actions and game states. Although this sequence is long, it is only the (fairly vigorous) head nod at the very end which is the important display. Perhaps a termporal segmen-tation which focussed more on later motions would be more attuned to this kind of sequence. 6.3.6 Output Models Model 2 : "nodding" Feature weights for the dynamics and configuration chains are shown in Figure 6.41. There is one significant feature in the dynamics chain, vAQ, which describes vertical translational movement. The significant features in the configuration process are A\,Bl,B] and £ ? . The output distributions of the seven states (X) in the dynamics chain are 168 6 8 10 12 14 16 18 0 5 10 15 feature feature 25 30 35 dynamics configuration Figure 6.41: Feature weights r | for model 2 dynamics and configuration chains shown in Figure 6.42, plotted along two most significant feature dimensions, UA\ and VB\. Four states (X — 1,2,3,7) correspond to no motion (the face is stationary), while two correspond to translation upwards (X = 4,5), and the last to translation downwards (X = 6). The output distributions of the five configuration states are shown in Fig-ure 6.43. States 1 — 6 and 8 describe orientations of the face (tilted upwards, facing forwards, etc), while state 7 describes an unusual pose that was caused by the sub-ject adjusting her hair, so her hand entered into the top left of the facial region. We now show how the C 4 M G model analyses a particular sequence from the card matching game for which the most likely high level state is D = 3. We show the expected feature vector for each frame in the dynamics and configuration chains, Zx and Zw, respectively, and the expected output flow fields and poses. Figure 6.44 shows the trajectory of the reconstructed dynamics Zernike vector (Zx) along the two most important feature vector dimensions for a sequence classified as model D = 3. The sequence alternates between motion upwards (states 4 and 5) and motion downwards (state 6). Figure 6.45 shows the trajectory of the reconstructed configuration Zernike vector (Zc) along the two most important feature vector dimensions for a sequence classified as model D = 3. Figures 6.46 and 6.47 show the model's explanation of the same sequence, for the frames indicated in Figures 6.44 and 6.45. We see the high level distribution over D is peaked at D — 3. Distributions over dynamics and configuration chains show which state is most likely at each frame. The expected pose, H, and flow field, V, are shown conditioning the image, I, and the temporal derivative, ft, respectively. 169 Figure 6.42: Dynamics chain model 2 output states plotted along two most signif-icant dimensions according to feature weights, M o , M j j . Reconstructed flow fields for X state means are also shown. 170 20 25 30 36 Figure 6.43: Configuration chain model 2 output distributions plotted along two most significant dimensions according to feature weights, VA\,VA\. Reconstructed grayscale images are shown for C state means. Figure 6.44: Trajectory of most likely dynamics feature vector (Zx) for sequence 6 (for which model 2 has highest posterior). The level curves of the Gaussian output covariances for model 2 are also shown. 171 60 r 20 h 10h -10 -5 0 5 10 7 15 20 25 30 35 Figure 6.45: Trajectory of most likely configuration feature vector (Zc) for sequence 6 (for which model 2 has highest posterior). The level curves of the Gaussian output covariances for model 2 are also shown. Model 3: "shaking" Feature weights for the dynamics and configuration chains are shown in Figure 6.48. There is one significant feature in the dynamics chain, UAQ, which describes hori-zontal translational movement. The significant features in the configuration process are A%A\,Bl and £ 3 . The output distributions of the six states (X) in the dynamics chain are shown in Figure 6.42, plotted along two most significant feature dimensions, UAQ and VAQ. TWO states (X = 2,3) correspond to no motion (the face is stationary), while two (X = 1, 5) correspond to motion to the left and right, respectively. The remaining two account for small motion up and down. The output distributions of the five configuration states are shown in Fig-ure 6.50. Different angles of the face are represented: state 1 is angled to the left, states 3 and 4 are angled to the right, and states 2 and 8 are facing forward. We now show how the C 4 M G model analyses a particular sequence from the card matching game for which the most likely state is D = 4. We show the expected feature vector for each frame in the dynamics and configuration chains, Zx and Zw, respectively, and the expected output flow fields and poses. Figure 6.51 shows the trajectory of the reconstructed dynamics Zernike vector (Zx) along the two most 172 1006 1007 1008 1009 1010 Figure 6.46: The C 4 M G model's explanation of part of a sequence during the card matching game. The subject's face tilts downwards during a nodding motion. 173 1011 1012 1013 1014 1015 Figure 6.47: The C 4 M G model's explanation of part of a sequence during the card matching game. The subject's face tilts upwards during a nodding motion. 174 Figure 6.49: Dynamics chain model 3 output states plotted along two most signif-icant dimensions according to feature weights, UA0),VB$. Reconstructed flow fields for X state means are also shown. 175 176 important feature vector dimensions for a sequence classified as model D = 4. The sequence alternates between motion to the right (state 5) and to the left (state 1). 4r Figure 6.51: Trajectory of most likely dynamics feature vector (Zx) for sequence 5 (for which model 3 has highest posterior). The level curves of the Gaussian output covariances for model 3 are also shown. Figure 6.52 shows the trajectory of the reconstructed configuration Zernike vector (Z c ) along the two most important feature vector dimensions for a sequence classified as model D = 3. Figures 6.53 and 6.54 show the model's explanation of the same sequence, for the frames indicated in Figures 6.51 and 6.52. We see the high level distribution over D is peaked at D = 4. Distributions over dynamics and configuration chains show which state is most likely at each frame. The expected pose, H, and flow field, V, are shown conditioning the image, / , and the temporal derivative, ft, respectively. 177 5 5 r 25 1 1 : ' 1 1 0 5 10 . 15 20 25 Figure 6.52: Trajectory of most likely configuration feature vector (Zc) for sequence 5 (for which model 3 has highest posterior). The level curves of the Gaussian output covariances for model 3 are also shown. 178 most likely D=3 "shake" II 885 886 887 888 889 Figure 6.53: The C4MGs explanation of part of a sequence during the card matching game. The subject's face swings to the right during a shaking motion. 179 most likely D=3 "shake" X w V H SI © T © 1 — © T © r « i r 1 889 890 891 892 893 Figure 6.54: The C4MGs explanation of part of a sequence during the card matching game. The subject's face swings to the left during a shaking motion. 180 Chapter 7 Conclusions Computer vision has traditionally focussed on recognition for its own sake. Com-puter vision modules are designed to perform specific, pre-defined tasks in isolation from the system they are to be used in. This approach tends to drive computer vision research into areas where it may not be entirely necessary to go. If computer vision is being used to achieve some goal, then it is the goal itself that should define what the computer vision task should be, not some pre-defined notion of what types of inputs will be needed to accomplish the goal. Goals usually change over time, due to learning, a changing environment, or changing resource bounds. Computer vision systems must therefore be able to self-modify to reflect these changes. This implies that decoupling the task from the goal is not the optimal approach, and we have argued in this thesis for a model that unifies the low-level computer vision tasks with the high-level goal oriented systems in a coherent whole using Bayesian networks. For example, a security system may have the goal of detecting subversive actions. The traditional way of approaching this problem is to first have experts in decision-making for security define subversive action, and a set of inputs that would be sufficient for its detection. Computer vision researchers then build sys-tems to detect these inputs. For example, the decision experts may deem that face recognition, and person detection and tracking are important, and computer vision systems for accomplishing this will be implemented. However, these types of sys-tems are inflexible, in that they cannot easily be changed to reflect new techniques for subversion, new visual inputs, or changing resource bounds. In contrast, the models we describe in this thesis require a different design approach. The decision experts first define the security task as a set of states that the system can be in, and the conditional dependencies between the states. The computer vison inputs do not need to be explicitly defined, only included in the model as generic 'visual information' variables. The model is then trained in 181 the actual environment it is to be used in, and the learned model 'discovers' the categories of visual inputs which are important to the task. This type of analysis unifies the decision process with the learning of computer vision models. The model developed in this thesis combines computer vision, probabilistic modeling and decision theory. The three are closely related, and together are a powerful solution concept for vision-based computer tasks. C o m p u t e r V i s i o n The computer vision task is to detect human motions from video streams, and to compress the motions into spatially and temporally abstract representations suitable for integration with higher level processing. This thesis has investigated data-independent sets of basis functions for simul-taneously modeling instantaneous configuration and dynamics of human facial displays and hand gestures. P robab i l i s t i c M o d e l s The uncertainties inherent in the sensing task call for prob-abilistic modeling techniques. Dynamic Bayesian networks (DBNs) in partic-ular are a structured representation of uncertain beliefs for sequential data. I have shown how to use DBNs to model human behaviors in video sequences, allowing for temporally abstract representations of sequences of motion to be built automatically Dec i s ion T h e o r y Rational decisions require that non-verbal human behaviors must not only be sensed and represented, but must also be used for choosing ac-tions which optimize utility over outcomes. Expected utility maximization is the standard approach for rational decision making, and forms the basis for Markov decision processes (MDPs). M D P s , and their close relatives, P O M D P s (partially observable MDPs) , have become the semantic model of choice for representing large planning problems. Our work on structured representations of M D P s has allowed them to be applied to larger problems than before. In particular, the research presented in this thesis has focussed on representing human non-verbal interactions with P O M D P s . This thesis has shown how partially observable Markov decision processes, or P O M D P s , can be used to combine computer vision, probabilistic modeling and decision theory. The model allows an agent to incorporate actions and utilities into the sensing and representation of visual observations, and provides top-down value-based evidence for the learned probabilistic models: the agent can learn models most conducive for achieving value in a particular task. One of the key features of this technique is that it does not require labeled data sets. That is, the model makes no prior assumptions about the form or number of non-verbal behaviors used in an interaction, but rather discovers this from the data during training. 182 This model is of interest to researchers in both computer vision and decision theory. Computer vision scientists will find a new model for human action in video streams, and a method for learning the model from data. Further, they will see how the model can be attached to a high-level decision process that implicitly defines the computer vision task. This definition is a general one: the systems must be designed so that they can learn from a set of training data. Performance can be explicitly evaluated on a task, giving the computer vision researchers solid feedback on their algorithms. Decision theorists, on the other hand, will find an output model that brings a large and important data source into contact with their models. While they have been traditionally focussing on solution techniques for "toy" problems, they are always interested in real data. The model we have presented in this thesis connects them to vision data, and opens the door to research on much more difficult problem solutions than have usually been attempted. 7.1 Contributions The main contribution of this thesis is a novel and unified model of human non-verbal behavior in video streams, integrated with decision making over high level context states and actions. This includes a Bayesian a-posteriori learning procedure for adapting the parameters of the model to training data. The learning does not require labeled data, so the model automatically discovers the categories of behaviors relevant to a particular task. The learned models can be used to predict human behavior, or by an autonomous agent to choose actions during collaboration with a human. We have demonstrated this model in three interactive settings. 1. During play of an imitation game, a human tries to mimic the facial displays of an animated cartoon face. The resulting displays involve some complex facial motions, and we used this game primarily to demonstrate the computer vision modeling techniques. We showed how our model could represent the facial motions during the four different imitative displays, and we demonstrated the function of the different components of the model. Finally, we performed a quantitative analysis on this data, showing prediction of the cartoon displays on unseen data, with rates comparable to the equivalent supervised experi-ments, in which the data is labeled. 2. Robot control using gestures is the second "game". A n operator signals nav-igation commands to a robot using hand gestures, and rewards the robot for performing the correct action after each gesture. The robot learns the ges-ture categories, their number, and their relationships to its actions and utility 183 functions. It can then use the learned model to take actions after subsequent gestural commands. We performed a leave-one-out cross-validation experi-ment and measure how much reward the robot would gain by taking these actions on unseen data. We find error rates of only 2% over 48 test sequences. 3. The card matching game is played by two humans through a computer inter-face. The players can see, but not hear, one another. They are allowed to communicate through this visual link, but no restrictions are placed on the type of communication. Each player has a hand of three cards, but can only see their own cards, not their partner's. Each player gets to play a single card, and they both win if the suits of their played cards match. The idea is that the players must somehow agree, using only visual communication, about which card to play. This game involves fairly simple facial displays, but has a more complex decision structure, and we used it to demonstrate how computer vi-sion is integrated with decision theory. We trained the model on a set of three games and tested it on a fourth. The model learned categories of motion that made intuitive sense in the card matching game, and the learned P O M D P conditional probability tables showed the kind of structure one would expect in the game. Further, it predicted the actions of the human players correctly in all but one case out of seven on test data. This work has also introduced a novel probabilistic method for spatially abstracting optical flow and images to a set of basis functions, including a Bayesian feature weighting technique to avoid prior selection of features. We have shown how this probabilistic modeling technique is a part of the general Bayesian solution to the problem of recognising and interpreting video sequences. Further, we have shown how a particular basis set, the Zernike polynomials, are useful descriptors of both optical flow and images, and have argued the advantages of using a pre-determined set of functions, as opposed to a data-dependent one. 7.2 Future Work This thesis researches the unification of computer vision and decision theory for modeling human non-verbal displays. We briefly examine the future challenges in the computer vision and decision theoretic fields, and then describe some of the open problems when the two are combined. 184 7.2.1 Modeling human non-verbal behaviors in context Learning decision-theoretic models of human behavior from unlabeled training data presents a number of different challenges. These center in particular on segmentation and tracking, as well as multimodal analysis. Spatial Segmentation and Representation There is a tradeoff in modeling human behavior between spatial segmentation and spatial representation. A holistic approach, such as we have taken in this thesis, spatially segments the entire region of interest (such as the face, or the hands) from the background, and represents the motion and spatial structure in this region using a relatively high dimensional feature vector. However, it is equally possible to break the region in which the behavior is occurring into smaller areas, modeling motion and intensity in each piece using a lower dimensional vector. For example, the fa-cial display models we have presented involve representing the entire facial region using the basis of Zernike polynomials. We have also experimented with segmenting the face into a number of smaller regions over eyes and mouth. The. correspond-ing Zernike projections can each be lower dimensional, but need to be combined into a single vector. It is not yet clear how this tradeoff can be optimally exploited. Future work will involve investigating the advantages of each, and methods for auto-matically segmenting the images spatially. The spatial segmentation/representation problem is closely related to the tracking problem. Tracking Throughout this thesis, we have assumed that a spatial region of interest has been tracked through each video sequence. We have described a method for tracking based on optical flow and exemplar matching (Section 3.7), which effectively de-couples tracking from recognition. In fact, this is a crude approximation: tracking, spatial segmentation, recognition and decision-making are highly interrelated. To see why, recall that it is advantageous to describe decision making processes in terms of a small number of high-level, discrete states. The mapping between the high-dimensional, continuous video signal and this small high-level state space has been one of the central topics of this thesis. This abstraction of the visual space must proceed such that the high-level state space is sufficient for decision-making. There may be many events in the video sequence, and it is the job of the tracking and spatial segmentation processes to separate these events from background. The sep-aration, however, depends on the task being performed. We have briefly described a method for integrating the tracking and recognition process in Section 3.7.2. This 185 technique combines tracking with learning of the model. Value directed learning would further imply that the track would be geared towards achieving value in the high-level task. These issues would require substantial future work, and would be related to the issues in solving P O M D P s described below (Section 7.2.2). Observation Function The observation function we have presented in this thesis is one example of the type of function that may be desirable for modeling human behavior. We have described how it models instantaneous pose and dynamics simultaneously, and how these two are useful for recognition and for tracking. However, there are other types of information which may need to be included for other application areas. Edges, corners, invariant features, color, and texture are all examples. Further, the modeling of both pose and dynamics leads to questions about the role of each in the recognition task. We have performed experiments attempting to delve further into these issues, but the results were inconclusive, and would require further work. Multimodal communication As was pointed out in the introduction, speech, gesture and facial displays are highly correlated. We have not investigated this correlation in this thesis, but combining cues from these multiple communication modalities would be necessary in future work. For example, one of the major components of facial motion during conver-sation is simply due to the motion of the mouth for speech [VBPB03]. Gestures are also commonly used for emphasis and commentary during speech [McN92]. The models we have described in this thesis could be extended to integrate these dif-ferent modalities. However, the question remains of the level at which to integrate them. One possibility is to combine them at the level of the continuous feature vectors, another is to combine them at the level of the decision-making, using each as high-level, discrete conditioning factors. Some cues may be more amenable to combination at a low level, others at a high level. Temporal segmentation As described in Section 3.6.4, the input video sequences are temporally segmented using only the observable high-level context and actions. However, we saw examples in Chapter 6 in which this caused some problems. In particular, the assignment of a facial display to a particular video segment requires that there be some information about that segment containing a 'meaningful' facial display. However, there are cases in which the subjects were moving their heads or faces for reasons other than 186 communication. This could be detected using models which are common for dialogue management, distinguishing different levels of communication [Cla96, PHOO]. These models essentially build a hierarchy of attentional levels, from the channel level for the perception and generation of behaviors, to the conversation level, at which propositional content is modeled. A conversation must include a connection between partners at all levels. Detection and tracking of eyes could also be very useful in this regard [Ver03]. 7.2.2 Optimal or approximate high-dimensional POMDP solutions The models we have described are P O M D P s with extremely high-dimensional out-put spaces. Optimal and approximate solutions to these models lie at the frontier of P O M D P research. One of the main research directions is in methods to com-press belief spaces, allowing for more tractable approximate solutions while not compromising decision quality [PB03]. Since belief spaces are abstractions of the output spaces, the belief state compressions can give indications of how to con-struct representations of the video sequences which are most conducive to achieving value within the P O M D P ' s task. Thus, solutions to the P O M D P s are intimately connected with the learning of the models, including the representations of vision-based observations: a rational learning agent must learn the P O M D P s and solve them concurrently. Future work will involve investigating belief state compression techniques and their relationship to video outputs. The goal will be learning and solution techniques which construct sparse representations of visual data that lead to optimal decisions for obtaining value in the long term. 7.2.3 Unification of decision theory and computer vision for em-bodied human-interactive agents Artificial agents that provide an effortless interface between humans and machines will be useful in two major areas. First, intelligent agents are already being used to help humans in difficult situations, and their roles will increase in the future with the advent of more visually driven systems. For example, agents can increase automobile safety by monitoring and interacting with drivers, or can help people with mental disabilities perform everyday tasks with dignity. Second, the study of interactive agents provides a wealth of information about human psychology. Watching humans interact with simple machines gives insight into the workings of the human brain. Intelligent agents should be able to sense, interpret, act in conjunction with, and learn from a complex and changing environment containing other intelligent agents, particularly humans. The long term goals of this research are to under-187 stand how agents can learn and use relationships between visual observations of other agents, their actions, other contextual information and utility. Learning these relationships will allow agents to adapt to new environments, new tasks, and new collaborators, without manual intervention. As this thesis has argued, computer vision alone is not sufficient, but must be combined with decision theory. The use of probabilistic modeling techniques consolidates the two, and allows the whole to be described on sound theoretical ground. In turn, the decision theoretic modeling is useful for the vision modeling, as it allows value-directed structures to be learned. The experimental paradigm we have presented can be used in future work to investigate more complex interactions involving non-verbal behaviors. Games involving gestures, facial displays and speech will be important for these investiga-tions. 188 Bibliography [ABMM03] Alexei A . Afros, Alexander C. Berg, Greg Mori , and Jitendra Malik. Recognizing action at a distance. In Proceedings of International Con-ference on Computer Vision, pages 726-733, Nice, France, 2003. [Ari] Aristotle. Rhetoric. Written circa 350BC. [ASKP03] Jonathan Alon, Stan Sclaroff, George Kollios, and Vladimir Pavlovic. Discovering clusters in motion time-series data. In Proceedings of Intl. Conference on Computer Vision and Pattern Recognition 2003, pages 682-688, Madison, Wisconsin, June 2003. [Aus62] John L . Austin. How to do things with words. Oxford University Press, 1962. [BA96] Michael Black and P. Anandan. The robust estimation of multiple motions: parametric and piecewise smooth flow fields. Computer Vision and Image Understanding, 63(1):75-104, January 1996. [BarOl] Marian Stewart Bartlett. Face Image Analysis by Unsupervised Learn-ing. Kluwer Academic Publishers, Norwell, M A , 2001. [BB95] Steven S. Beauchemin and John L . Barron. The computation of optical flow. ACM Computing Surveys, 27(3):433-467, 1995. [BCdR03] Peter Brusilovsky, Albert Corbett, and Fiorella de Rosis, editors. Pro-ceedings of 9th International Conference on User Modeling, volume 2702 of LNCS. Springer-Verlag, January 2003. [BCS97] Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: Driving visual speech with audio. Computer Graphics, 31 (Annual Con-ference Series) :353-360, 1997. [BD01] Aaron F. Bobick and James W. Davis. The recognition of human move-ment using temporal templates. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 23(3):257-267, March 2001. 189 [Bel57] Richard E . Bellman. Dynamic Programming. Princeton University-Press, Princeton, 1957. [BF95] Yoshua Bengio and Paolo Frasconi. Diffusion of context and credit information in Markovian models. Journal of Artificial Intelligence Research, 3:249-270, 1995. [BF96] Yoshua Bengio and Paolo Frasconi. Input-output H M M s for sequence processing. IEEE Transactions on Neural Networks, 7(5): 1231-1249, September 1996. [BFB94] John L . Barron, David J . Fleet, and Steven S. Beauchemin. Perfor-mance of optical flow techniques. International Journal of Computer Vision, 12(l):43-77, 1994. [BH96] Adam Baumberg and David Hogg. Generating spatio-temporal models from examples. Image and Vision Computing, 14(8):525-532, 1996. [BHK97] Peter N . Belhumeur, Joao P. Hespanha, and David J . Kriegman. Eigen-faces vs. Fisherfaces: Recognition using class specific linear projec-tions. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 19(7):711-720, July 1997. [BI98] Aaron F . Bobick and Yuri A . Ivanov. Action recognition using proba-bilistic parsing. In IEEE Conference on Computer Vision and Pattern Recognition, pages 196-202, Santa Barbara, C A , 1998. [Bir98] Stan Birchfield. Elliptical head tracking using intensity gradients and color histograms. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pages 203-209, Santa Barbara, C A , 1998. [BLB+03] Marian S. Bartlett, Gwen Littlewort, Bj orn Braathen, Terrence J . Se-jnowski, and Javier R. Movellan. A prototype for automatic recognition of spontaneous facial actions. In S. Becker and K . Obermayer, editors, Advances in Neural Information Processing Systems, volume 15, pages 382-386. M I T Press, 2003. [BobOl] Miroslaw Bober. M P E G - 7 visual shape descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6):716-719, June 2001. 190 [BOP97] Matthew Brand, Nuria Oliver, and Alex Pentland. Coupled hidden Markov models for complex action recognition. In International Con-ference on Computer Vision and Pattern Recognition, pages 994-999, Puerto Rico, 1997. [Bra97] Matthew Brand. Learning concise models of human activity from am-bient video via a structure-inducin m-step estimator. Technical Report TR-97-25, Mitsubishi Electric Research Laboratory, November 1997. [Bra99a] Matthew Brand. Structure learning in conditional probability models via an entropic prior and parameter extinction. Neural Computation, 11:1155-1182, 1999. [Bra99b] Matthew Brand. Voice puppetry. In Proc. ACM SIGGRAPH, pages 21-28, August 1999. [Bre97] Chris Bregler. Learning and recognising human dynamics in video se-quences. In IEEE Conference on Computer Vision and Pattern Recog-nition, pages 568-574, 1997. [BS95] Anthony J . Bell and Terrence J . Sejnowski. A n information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129-1159, 1995. [BS99] Cynthia Breazeal and Brian Scassellati. A context-dependent atten-tion system for a social robot. In Proceedints of the Sixteenth Inter-national Joint Conference on Artificial Intelligence, pages 1146-1151, Stockholm, Sweden, 1999. [BSA91] Saeid O. Belkasim, Malayappan Shridhar, and Majid Ahmadi. Pattern recognition with moment invariants: A comparative study and new results. Pattern Recognition, 24(12):1117-1138, 1991. [BY97] Michael Black and Yaser Yacoob. Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image mo-tions. International Journal of Computer Vision, 25(l):23-48, 1997. [CasOO] Justine Cassell. More than just another pretty face: Embodied conver-sational agents. Communications of the ACM, 43(4):70-78, 2000. [CBC+00] Justine Cassell, Timothy Bickmore, Lee Campbell, Hannes Vilhjamsson, and Hao Yan. Human conversation as a system framework: Designing embodied conversational agents. In Justine 191 Cassell, Joseph Sullivan, Scott Prevost, and Elizabeth Churchill, editors, Embodied Conversational Agents, chapter 2, pages 29-63. M I T Press, 2000. [CD00] Ross Cutler and Larry Davis. Robust real-time periodic motion detec-tion, analysis, and applications. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 22(8):348-362, 2000. [CdFGT03] Peter Carbonetto, Nando de Freitas, Paul Gustafson, and Natalie Thompson. Bayesian feature weighting for unsupervised learning with application to object recognition. In C M Bishop and B J Prey, editors, Ninth International Workshop on Artificial Intelligence and Statistics, pages 122-128, Key West, Florida, January 2003. Timothy F. Cootes, Gareth J . Edwards, and Chris J . Taylor. Active appearance models. Lecture Notes in Computer Science, 1407:484-490, 1998. Proc. SIGCHI, 2003. Nicole Chovil. Social determinants of facial displays. Journal of Non-verbal Behavior, 15(3):141-154, Fall 1991. Herbert H . Clark. Using Language. Cambridge University Press, Cam-bridge, U K , 1996. Anthony Cassandra, Michael L . Littman, and Nevin L . Zhang. Incre-mental pruning: A simple, fast exact method for partially observable Markov decision processes. In Proceedings of Uncertainty in Artificial Intelligence, pages 54-61, 1997. Brian Clarkson and Alex Pentland. Unsupervised clustering of ambu-latory audio and video. In Proc. ICASSP, 1999. James M . Carroll and James A . Russell. Do facial expressions signal specific emotions? Judging the face in context. Journal of Personality and Social Psychology, 70:205-218, 1996. [CSG+03] Ira Cohen, Nice Sebe, Ashutosh Garg, Lawrence S. Chen, and Thomas S. Huang. Facial expression recognition from video sequences: temporal and static modeling. Computer Vision and Image Under-standing, 91:160-187, 2003. [CET98] [CHI03] [Cho91] [Cla96] [CLZ97] [CP99] [CR96] 192 [CSPCOO] Justine Cassell, Joseph Sullivan, Scott Prevost, and Elizabeth Churchill, editors. Embodied Conversational Agents. M I T Press, 2000. [CT98] Ross Cutler and Matthew Turk. View-based interpretation of real-time optical flow for gesture recognition. In Proc. Intl. Conference on Automatic Face and Gesture Recognition, pages 98-104, Nara, Japan, Apr i l 1998. [CVP03] IEEE Workshop on Computer Vision and Pattern Recognition for Hu-man Computer Interaction, Madison, Wisconsin, June 2003. [Dar72] Charles Darwin. Expressions of the Emotions in Man and Animals. Abermale, London, 1872. [DB97] Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. Artificial Intelligence, 89:219-283, 1997. [DBH+99] Gianluca Donato, Marian Stewart Bartlett, Joseph C. Hager, Paul Ek-man, and Terrence J . Sejnowski. Classifying facial actions. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 21(10):974-989, October 1999. [DEP96] Trevor J . Darrell, Irfan A . Essa, and Alex P. Pentland. Task-specific gesture analysis in real-time using interpolated views. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 18(12):1236-1242, 1996. [DH01] Vincent E . Devin and David C. Hogg. Reactive memories: A n interac-tive talking head. In Proc. British Machine Vision Conference, pages 234-243, 2001. [dJ32] Andrea de Jorio. Gesture in Naples and gesture in classical antiq-uity : a translation of "La mimica degli antichi investigata nel gestire napoletano", Gestural expression of the ancients in the light of Neapoli-tan gesturing, translated by Adam Kendon. Indiana University Press, Boomington, Indiana, 2000 (Original 1832). [DLR77] A .P . Dempster, N . M . Laird, and D .B . Rubin. Maximum likelihood from incomplete data using the E M algorithm. Journal of the Royal Statistical Society, 39(B):l-38, 1977. [DP96] Trevor Darrell and Alex P. Pentland. Active gesture recognition using partially observable Markov decision processes. In 13th IEEE Intl. Conference on Pattern Recognition, Vienna, Austria, 1996. 193 [EF78] Paul Ekman and Wallace V . Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, C A , 1978. [EFE72] Paul Ekman, Wallace V . Friesen, and P. Ellsworth. Emotion in the Human Face. Pergamon, New York, 1972. [EHL+02] Pantelis Elinas, Jesse Hoey, Darrell Lahey, Jeff Montgomery, Don Mur-ray, Stephen Se, and James J . Little. Waiting with Jose, a vision based mobile robot. In Proc. International Conference on Robotics and Au-tomation, pages 678-682, Washington, D.C. , May 2002. [EHL03] Pantelis Elinas, Jesse Hoey, and James J . Little. Homer: Human ori-ented messenger robot. In Proc. AAAI Spring Symposium on Human Interaction with Autonomous Systems in Complex Environments, Stan-ford, C A , March 2003. [EP97] Irfan A . Essa and Alex P. Pentland. Coding analysis, interpretation, and recognition of facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7)-.757-763, July 1997. [ER97] Paul Ekman and Erika L . Rosenberg. What the Face Reveals. Oxford Univ. Press, New York, 1997. [FBYJ00] David J . Fleet, Michael J . Black, Yaser Yacoob, and Allan D. Jepson. Design and use of linear models for image motion analysis. International Journal of Computer Vision, 36(3): 171-193, 2000. [FG02] Proceedings of IEEE International Conference on Face and Gesture, Washington, D C , May 2002. [FL03] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: A survey. Pattern Recognition, 36(l):259-275, 2003. [FND03] Terrence Fong, Illah Nourbakhsh, and Kerstin Dautenhahn. A survey of socially interactive robots. Robotics and Autonomous Systems, 42(3-4): 143-166, March 2003. [Fri94] Alan J . Ffidlund. Human facial expression: an evolutionary view. Aca-demic Press, San Diego, C A , 1994. 194 .[Fri97] Alan J . Fridlund. The new ethology of human facial expressions. In James A . Russell and Jose Miguel Fernandez-Dols, editors, The Psy-chology of Facial Expression, chapter 5, pages 103-129. Cambridge Uni-versity Press, Cambridge, U K , 1997. [FST98] Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical Hid-den Markov Model: Analysis and applications. Machine Learning, 32(l):41-62, 1998. [Gav99] Dariu M . Gavrila. The visual analysis of human movement: a survey. Computer Vision and Image Understanding, 73(l):82-98, 1999. [GJH99] Aphrodite Galata, Neil Johnson, and David Hogg. Learning struc-tured behaviour models using variable length Markov models. In IEEE Workshop on Modelling People, Corfu, Greece, September 1999. [GJH01] Aphrodite Galata, Neil Johnson, and David Hogg. Learning variable-length Markov models of behavior. Computer Vision and Image Un-derstanding, 81(3):398-413, 2001. [GL00] Piotr J . Gmytrasiewicz and Christine L . Lisetti. Using decision theory to formalize emotions for multi-agent system applications: preliminary report. In Second ICMAS-2000 Workshop on Game Theoretic and De-cision Theoretic Agents, Boston, 2000. [GLW01] John Gottman, Robert Levenson, and Erica Woodin. Facial expressions during marital conflict. Journal of Family Communication, 1(1):37—57, 2001. [Gmy02] Piotr J . Gmytrasiewicz. Toward optimal planning in multiagent envi-ronments: Basic framework. Technical report, University of Chicago, Chicago, IL, November 2002. [HauOO] Milos Hauskrecht. Value-function approximations for partially observ-able Markov decision processes. Journal of Artificial Intelligence Re-search, 13:33-94, 2000. [HF00] Eric A . Hansen and Zhengzhu Feng. Dynamic programming from pomdps using a factored state representation. In Proc. Fifth Inter-national Conference on Artifical Intelligence Planning and Scheduling, Breckenridge, C O , Apr i l 2000. 195 [HLOO] Jesse Hoey and James J . Little. Representation and recognition of complex human motion. In. Proc. IEEE CVPR, pages 752-759, Hilton Head, SC, June 2000. [HL03] Jesse Hoey and James J . Little. Bayesian clustering of optical flow fields. In Proc. International Conference on Computer Vision, pages 1086-1093, Nice, France, October 2003. [HL04] Jesse Hoey and James J . Little. Decision theoretic modeling of human facial displays. In Proc. European Conference on Computer Vision, Prague, CZ, 2004. [HoeOl] Jesse Hoey. Hierarchical unsupervised learning of facial expression cat-egories. In Proc. ICCV Workshop on Detection and Recognition of Events in Video, pages 99-106, Vancouver, Canada, July 2001. [Hoe02] Jesse Hoey. Clustering contextual facial display sequences. In Proceed-ings of IEEE International Conference on Face and Gesture, Washing-ton, D C , May 2002. [Hoe04] Jesse Hoey. Value-directed learning of gestures and facial displays. In Proc Intl. Conference on Computer Vision and Pattern Recognition, Washington, D C , 2004. [HS81a] H .V . Henderson and S.R. Searle. On deriving the inverse of a sum of matrices. SIAM Review, 23:53-60, 1981. [HS81b] Berthold K . P . Horn and B . G . Schunk. Determining optical flow. Arti-ficial Intelligence, 17:185-203, 1981. [HSAHB99] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. S P U D D : Stochastic planning using decision diagrams. In Proceedings of Uncer-tainty in Artificial Intelligence, pages 279-288, Stockholm, 1999. [HSJ95] Edrward Hunter, Jennifer Schlenzig, and Ramesh Jain. Posture es-timation in reduced-model gesture input systems. In International Workshop on Automatic Face- and Gesture-Recognition, pages 290-295, Zurich, Switzerland, 1995. [ICM03] International Conference on Multimodal Interfaces, Vancouver, B C , November 2003. A C M Press. 196 [Iza97] Carroll E . Izard. Emotions and facial expressions: A perspective from differential emotions theory. In James A . Russell and Jose Miguel Fernandez-Dols, editors, The Psychology of Facial Expression, chap-ter 5, pages 103-129. Cambridge University Press, Cambridge, U K , 1997. [JBY96] Shannon Ju, Michael Black, and Yaser Yacoob. Cardboard people: A parameterized model of articulated image motion. In Proc. IEEE Conference on Automatic Face and Gesture Recognition, pages 38-44, 1996. [Jen02] Cullen Jennings. Probabilistic evidence combination for robust real time finger recognition and tracking. PhD thesis, University of British Columbia, November 2002. [JFEM01] Allan D. Jepson, David J . Fleet, and Thomas F . El-Maraghi. Robust online appearance models for visual tracking. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pages 415-422, Kauai, Hawaii, 2001. [JP98] Tony Jebara and Alex P. Pentland. Action reaction learning: Anal-ysis and synthesis of human behaviour. In IEEE Workshop on The Interpretation of Visual Motion, 1998. [Kal60] Rudolph E . Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME-Journal of Basic Engineering, 82(Series D):35-45, 1960. [KG77] Robert M . Krauss and S. Glucksberg. Social and nonsocial speech. Scientific American, 236:100-105, 1977. [KjeOl] Rick Kjeldsen. Head gestures for computer control. In Proceedings of the Workshop on Recognition And Tracking of Face and Gesture - Real Time Systems, pages 61-67, Vancouver, Canada, July 2001. [KK00] Whoi-Yul K i m and Yong-Sung K i m . A region-based shape descriptor using Zernike moments. Signal Processing: Image Communication, 16:95-102, 2000. [KLC98] Leslie Pack Kaelbling, Michael L . Littman, and Anthony R. Cassan-dra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99-134, 1998. 197 [KLM96] Leslie Pack Kaelbling, Michael Littman, and Andrew W . Moore. Re-inforcement learning: A survey. Journal of Artificial Intelligence Re-search, 4:237-285, 1996. [KMOO] Ioannis Kakadiaris and Dimitris Metaxas. Model-based estimation of 3D human motion. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 22(12):1453-1459, 2000. [Koh89] Tuevo Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin, 1989. [KZ02] Volker Kriiger and Shaohua Zhou. Exemplar-based face recognition from video. In Proceedings of IEEE International Conference on Face and Gesture, Washington, D C , May 2002. [LB98] James J . Little and Jeffrey Boyd. Recognizing people by their gait: the shape of motion. Videre, 1(2), Winter 1998. [LB99] Cen L i and Gautam Biswas. Temporal pattern generation using hid-den Markov model based unsupervised classification. In D . J . Hand, J .N . Kok, and M . R . Berthold, editors, Advances in Intelligent Data Analysis, number 1642 in L N C S , pages 245-256, Berlin, 1999. Springer-Verlag. [LBA99] Michael J . Lyons, Julien Budynek, and Shigeru Akamatsu. Automatic classification of single facial images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12):1357-1362, December 1999. [LF98] Michael E . Leventon and William T. Freeman. Bayesian estimation of 3-d human motion from and image sequence. Technical Report TR-98-06, Mitsubishi Electric Research Laboratory, July 1998. [LK81] Bruce Lucas and Takeo Kanade. A n iterative image registration tech-nique with an application to stereo vision. In Proc. International Joint Conference on Artificial Intelligence, 1981. [LKCL98] James Jenn-Jier Lien, Takeo Kanade, Jeffrey F. Cohn, and Ching-Chung L i . Subtly different facial expression recognition and expression intensity estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 443-449, Santa Barbara, C A , 1998. 198 [LP98] Simon X . Liao and Miroslaw Pawlak. On the accuracy of Zernike mo-ments for image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1358-1364, December 1998. [LSOO] Christine L . Lisetti and Diane J . Schiano. Automatic facial expres-sion interpretation: Where human-computer interaction, artificial in-telligence and cognitive science interact. Pragmatics and Cognition, 8(l):185-235, 2000. [LSR +00] Bastian Leibe, Thad Starner, Will iam Ribarsky, Zachary Wartell, David Krum, Justin Weeks, Brad Singletary, and Larry Hodges. To-wards spontaneous interaction with the perceptive workbench, a semi-immersive virtual environment. In IEEE Virtual Reality, pages 13-20, New Brunswick, N J , March 2000. [LT97] Jeurgen Luettin and Neil A . Thacker. Speechreading using probabilistic models. Computer Vision and Image Understanding, 65(2): 163-178, February 1997. [LTC97] Andreas Lanitis, Chris J . Taylor, and Timothy F . Cootes. Automatic interpretation and coding of face images using flexible models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):743-756, July 1997. [McN92] David McNeil l . Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago, IL, 1992. [Min99] Thomas P. Minka. From hidden Markov models to linear dynamical systems. Technical Report 531, M.I .T. , 1999. [MN95] Hiroshi Murase and Shree K . Nayar. Visual learning and recognition of 3-d objects from appearance. International Journal of Computer Vision, 14:5-24, 1995. [Mor82] Masahiro Mori . The Buddha in the Robot. Charles E . Tuttle Co., 1982. [MP91] Kenji Mase and Alex P. Pentland. Recognition of facial expression from optical flow. IEICE Transactions E, 74(10) :3474-3483, 1991. [MP01] Kevin Murphy and Mark Paskin. Linear time inference in hierarchi-cal H M M s . In Advances in Neural Information Processing Systems, volume 14, Vancouver, B C , 2001. 199 [MPR+02] Michael Montemerlo, Joelle Pineau, Nicholas Roy, Sebastian Thrun, and Vandi Verma. Experiences with a mobile robotic guide for the elderly. In Proceedings of the AAAI National Conference on Artificial Intelligence, Edmonton, Canada, 2002. A A A I . [Mur02] Kevin P. Murphy. Dynamic Bayesian Networks: Representation, In-ference and Learning. PhD thesis, U C Berkeley, Computer Science Division, July 2002. [Muy87] Eadweard Muybridge. Animal Locomotion. University of Pennsylvania, Philadephia, 1887. [MYD96] Carlos Morimoto, Yaser Yacoob, and Larry Davis. Recognition of head gestures using hidden Markov models. In Proceeding of ICPR, pages 461-465, Austria, 1996. [Mye91] Roger B . Myerson. Game Theory: Analysis of Conflict. Harvard Uni-versity Press, Cambridge, Massachussetts, 1991. [NA94] Sourabh A . Niyogi and Edward H . Adelson. Analyzing and recognizing walking figures in xyt. In IEEE Conference on Computer Vision and Pattern Recognition, pages 469-474, 1994. [NH93] Radford M . Neal and Geoffrey E . Hinton. A new view of the E M algo-rithm that justifies incremental and other variants. Technical report, Dept. of Computer Science, University of Toronto, 1993. [NI00] Ara V . Nefian and Monson H . Hayes III. Maximum likelihood training of the embedded H M M for face detection and recognition. In Proc. IEEE International Conference on Image Processing, pages 153-160, 2000. [NY93] Shahriar Negahdaripour and Chih-Ho Y u . A generalized brightness change model for computing optical flow. In Proceedings of Fourth In-ternational Conference on Computer Vision, pages 2-11, Berlin, Ger-many, May 1993. [OHG02] Nuria Oliver, Eric Horvitz, and Ashutosh Garg. Layered representa-tions for human activity recognition. In Proceedings of International Conference on Multimodal Interfaces, Pittsburgh, P A , October 2002. [OPB97] Nuria Oliver, Alex P. Pentland, and Francois Berard. L A F T E R : Lips . and face real time tracker. In Proc. Computer Vision and Pattern Recognition, pages 977-984, 1997. 200 [PB03] Pascal Poupart and Craig Boutilier. Value-directed compression of P O M D P s . In Advances in Neural Information Processing Systems, vol-ume 15, pages 1547-1554, Vancouver, 2003. [Pea88] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, C A , 1988. [PenOO] Alex Pentland. Looking at people: sensing for ubiquitous and wear-able computing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):107-119, 2000. [PFH99] Vladimir Pavlovic, Brendan J . Frey, and Thomas S. Huang. Variational learning in mixed-state dynamic graphical models. In Proc. Uncertainty in Artificial Intelligence, pages 522-530, Stockholm, Sweden, July 1999. Morgan Kaufmann. [PH00] T i m Paek and Eric Horvitz. Conversation as action under uncertainty. In Proceedings of Uncertainty in Artificial Intelligence, Stanford, C A , June 2000. [Plu03] Mark Plutowski. M D P solver for a class of location-based decision-ing tasks. In Uncertainty in Artificial Intelligence Bayesian Modeling Applications Workshop, Mexico, August 2003. [PMS94] Alex P. Pentland, Baback Moghaddam, and Thad Starner. View-based and modular eigenspaces for face recognition. In Proc. IEEE Con-ference on Computer Vision and Pattern Recognition, pages 199-204, Seattle, WA, June 1994. [PPOO] Isabella Poggi and Catherine Pelachaud. Performative facial expres-sions in animated faces. In Justine Cassell, Joseph Sullivan, Scott Prevost, and Elizabeth Churchill, editors, Embodied Conversational Agents, chapter 6, pages 155-187. M I T Press, 2000. [PR89] Aluizio Prata and W . V . T . Rusch. Algorithm for computation of Zernike polynomials expansion coefficients. Applied Optics, 28(4):749-754, February 1989. [PR00] Maja Pantic and Leon J . M . Rothkrantz. Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 22(12), December 2000. 201 [PSH97] Vladimir Pavlovic, Rajeev Sharma, and Thomas S. Huang. Visual interpretation of hand gestures for human-computer interaction: a re-view. IEEE Transactions on Pattern Analysis and Machine Intelligenc, 19(7):677-695, July 1997. [Put94] Martin L . Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, N Y . , 1994. [RAOO] Yong Rui and. P. Anandan. Segmenting visual actions based on spatio-temporal motion patterns. In Proceedings of International Conference on Computer Vision and Pattern Recognition, Hilton Head, SC, June 2000. [Rab89] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2) :257-296, February 1989. [RB99] Jens Rittscher and Andrew Blake. Classification of human body mo-tion. In Proceedings of International Conference on Computer Vision, pages 679-685, Corfu, Greece, 1999. [RE01] Lionel Reveret and Irfan Essa. Visual coding and tracking of speech related facial motion. In Proc of the IEEE CVPR International Work-shop on Cues in Communication, pages 221-229, Dec 2001. [RFD97a] James A . Russell and Jose Miguel Fernandez-Dols, editors. The Psy-chology of Facial Expression. Cambridge University Press, Cambridge, U K , 1997. [RFD97b] James A . Russell and Jose Miguel Fernandez-Dols. What does facial expression mean? In James A . Russell and Jose Miguel Fernandez-Dols, editors, The Psychology of Facial Expression, chapter 1, pages 3-30. Cambridge University Press, Cambridge, U K , 1997. [RH93] Lawrence R. Rabiner and B . H . Huang. Fundamentals of Speech Recog-nition. Prentice Hall, Englewood Cliffs, N J , 1993. [Ris78] Jorma Rissanen. Modelling by shortest data description. Automatica, 14:465-471, 1978. [RN96] Byron Reeves and Clifford Nass. The media equation. Cambridge Uni-versity Press, 1996. 202 [RPTOO] Nicholas Roy, Joelle Pineau, and Sebastian Thrun. Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, 2000. [Rus97] James A . Russell. Reading emotions from and into faces: Resurrect-ing a dimensional-contextual perspective. In James A . Russell and Jose Miguel Fernandez-Dols, editors, The Psychology of Facial Expres-sion, chapter 13, pages 295-320. Cambridge University Press, Cam-bridge, U K , 1997. [SAH91] Eero P. Simoncelli, Edward H . Adelson, and David J . Heeger. Prob-ability distributions of optical flow. In Proceedings of International Conference on Computer Vision and Pattern Recognition, pages 310— 315, Mauii , Hawaii, USA, 1991. [SAHB00] Robert St-Aubin, Jesse Hoey, and Craig Boutilier. A P R I C O D D : Ap-proximate policy construction using decision diagrams. In Neural In-formation Processing Systems, volume 14, pages 1089-1095, 2000. [Sch92] David Schumacher. General filtered image reseating. In D. Kirk, editor, Graphics Gems II. Harcourt Brace Jovanovich, 1992. [SHJ94] Jennifer Schlenzig, Edward Hunter, and Ramesh Jain. Vision based hand gesture interpretation using recursive estimation. In Proc. Asilo-mar Conference on Signals, Systems and Computation, pages 394-399, October 1994. [Smy97] Padhraic Smyth. Clustering sequences with hidden Markov models. In Advances in Neural Information Processing Systems, volume 10, 1997. [S094] Andreas Stolcke and Stephen M . Omohundro. Best-first model merg-ing for hidden Markov model induction. Technical Report TR-94-003, International Computer Science Institute, Berkeley, C A , January 1994. [SP95] Thad Starner and Alex P. Pentland. Visual recognition of american sign language using hidden Markov models. In International Workshop on Automatic Face and Gesture Recognition, pages 189-194, Zurich, Switzerland, 1995. [TC88] Cho-Huak Teh and Roland T. Chin. On image analysis by the methods of moments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(4):496-513, July 1988. 203 [Tea80] Michael Reed Teague. Image analysis via the general theory of mo-ments. Journal of Optical Society of America, 70(8):920-930, 1980. [ThrOO] Sebastian Thrun. Monte Carlo P O M D P s . In S.A. Solla, T . K . Leen, and K . - R . Miiller, editors, Advances in Neural Information Processing Systems, volume 12, pages 1064-1070. M I T Press, 2000. [TKC01] Yingli Tian, Takeo Kanade, and Jeffrey F. Cohn. Recognizing ac-tion units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), February 2001. [TP91] Matthew Turk and Alex P. Pentland. Eigenfaces for recognition. Jour-nal of Cognitive Neuroscience, 3(l):71-86, 1991. [TW93] Demetri Terzopoulus and Keith Waters. Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 15(6):569-579, June 1993. [UR02] George W. Uetz and J . Andrew Roberts. Multisensory cues and multi-modal communication in spiders: Insights from video/audio playback studies. Brain, Behavior and Evolution, 59:222-230, 2002. [VBPB03] Eric Vatikiotis-Bateson, P. Perrier, and G. Bailly, editors. Advances in audio-visual speech processing. M I T Press, Cambridge, M A , 2003. [Ver03] Roel Vertegaal. Attentive user interfaces. Communications of ACM, 46(3), March 2003. [VM98] Christian Vogler and Dimitris Metaxas. A S L recognition based on a coupling between H M M s and 3D motion analysis. In IEEE Intl. Con-ference on Computer Vision, pages 363-369, Mumbai, India, 1998. [VM99] Christian Vogler and Dimitris Metaxas. Parallel hidden Markov models for american sign language recognition. In Proceedings International Conference on Computer Vision, Corfu, Greece, September 1999. [vNM53] John von Neumann and Oskar Morgenstern. Theory of Games and Eco-nomic Behavior. Princeton University Press, Princeton, N J , 3 edition, 1953. [VT02] Alex O. Vasilescu and Demetri Terzopoulus. Multilinear analysis of image ensembles: Tensorfaces. In A Heyden, editor, Proc. of European 204 Conference on Computer Vision, volume 2350 of LNCS, pages 447-460. Springer-Verlag, 2002. [vZ34] Frits von Zernike. Beugungstheorie des schneidenvarfahrens und seiner verbesserten form, der pahsekontrastmethode. Physica, 1:689-704, 1934. [WA03] Hongcheng Wang and Narendra Ahuja. Facial expression decomposi-tion. In Proceedings of International Conference on Computer Vision, pages 958-965, Nice, France, October 2003. [WADP97] Christopher Richard Wren, A l i Azarbayejani, Trevor Darrell, and Alex Pentland. Pfinder: Real-time tracking of the human body. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 19(7):780-785, 1997. [WB99] Andrew D. Wilson and Aaron F. Bobick. Parametric hidden Markov models for gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9):884-900, September 1999. [WBC97] Andrew D. Wilson, Aaron F . Bobick, and Justine Cassell. Temporal classification of natural gesture and application to video coding. In Proc. CVPR, Puerto Rico, June 1997. [WCP00] Christopher R. Wren, Brian P. Clarkson, and Alex P. Pentland. Under-standing purposeful human motion. In Proc. Face and Gesture Recog-nition, Grenoble, France, 2000. [WHT03] Liang Wang, Weiming Hu, and Tieniu Tan. Recent developments in human motion analysis. Pattern Recognition, 36:585-601, 2003. [WPG01] Michael Walter, Alexandra Psarrou, and Shaogang Gong. Data driven gesture model acquisition using minimum description length. In Proc. British Machine Vision Conference, Manchester, U K , September 2001. [YD94] Yaser Yacoob and Larry Davis. Computing spatio-temporal represen-tations of human faces. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pages 65-69, 1994. [ZCMG01] Bo Zhang, Qinsheng Cai, Jianfeng Mao, and Baining Guo. Planning and acting under uncertainty: A new model for spoken dialogue system. In Proceedings of Uncertainty in Artificial Intelligence, pages 572-579, Seattle, W A , August 2001. 205 Dengsheng Zhang and Guojun Lu . A n integrated approach to shape based image retrieval. In Proceedings of 5th Asian Conference on Com-puter Vision, Melbourne, Australia, January 2002. 206 Appendix A Proof of Equation (4.3) H(0"\e') - H(9'\9') = £ p ( X | y , 0 ' ) [ l o g P ( X | y / ' ) - logP(X|y, 9') = £ p ( X | y , 0 ' ) l o g x < ^ P ( X | y , 0 ' - 1 (using log x < x — 1) P(X|y ,fl") P(X|y,e'). PCXly.g") P(X|y,5') i_ = ^ [ p ( X | y ) ( 5 " ) - J P ( X | y ^ ' ) x = 0 This shows that H(9"\6') < H{9'\9'). Notice that if we had used some distribution other than P(X|y, 9') in Equation (4.1), this inequality would not hold. In such cases, one must maximize the expression in Equation (4.2) with respect to both this arbitrary distribution and the model parameters. This leads to a more general exposition of the E M algorithm [NH93]. 207 Appendix B Tracking updates We are trying to estimate J zx,t-l = / P(a t|a t_i« a ; it_i)P(v"/t_i.|at_i«t_i)P(ut_i|z x,t-i)P(«X )t-i|X t_i) J Zx,t-lVt-l (B.l) A l l the terms in this equation are Gaussian distributions, and so we can write the intergrand as a single exponential, ea 7 If we write 6a = at — at-i, and drop the temporal subscript, the exponent is 7 = ~(Sa - z)'A-\6a - z ) - (fT + fsv)'A-\fT + fsv) - (v - Mz)'A-\v - Mz) - (z - //^)'A-i(z - pzx) Completing the squares over v gives 7 = ~(v ~ (J-wYK1^ - Hw) + / 4 A w V u ; - (<5a - z)'k-l(5a - z) - f'TA-lfT - z'M'A^Mz - z'A'^z + ^fJ,z,x^-z,xz ~ lJ,z,x^-z,xllz,x where, w = f'sA~lfT and Aw = (f'sA-lfs + A;1)-1 Pw = Aw{A~lMz - w) The integration over v in Equation (B. l) can be performed, leaving 7 = / 4 A ™ V ™ - ( < 5 a - z)'A-1(5a - z ) - f'TA-lfT - z'M'A-lMz - z'A^z + 2p!ZtXA-}xz - < X A 2 -> 2 , X 208 We can now complete the square in z, which gives 7 = -(z - ffl'Af1 (z - + ^Af'tf - f'rA-lfT + w'A~lw - n'ZtXA-lfiZtX - 5'aA-li where A" = (A"1 + MA~lM + A " 1 - M'AplAwA-lM)'1 = ( A J 1 + A ^ 1 ) - 1 = A ^ ( A - ^ 2 , X - MA~lAww + A^Sa) = A ? ( A - i / i Z l I + A a A ) The integral over z in Equation (B. l) can be performed, leaving 7 = fif A ° _ 1 A" - f'TA~lfT + w'A^w - n'2iXA-*nZiX - ^ A " 1 ^ Finally, we complete the square in 5a, giving 7 = -(&a ~ Va)'A'1 {$c* ~ Va) + p'aA~Xjia ~ f r A ~ l U + w'A^W - fi'ZiXA-*fiz,x where A a = ( A ; 1 - A - 1 A ° A - 1 ) - 1 fy = A A - 1 A a A _ 1 / y rat — l v a l v a lvz lvz,xfJ'z,x We can simplify these expressions using matrix inversion lemma [HS81a]: (B~l + CD-lC')-1 = B- BC(D + C'BC)-lC'B and so A Q = (A" 1 - A - ^ A - i + A - 1 ) - ^ " 1 ) - 1 = -A-a "I" Az^x and thus jla = (A a + A 2 i X ) A a 1 A ° x A 2 ^ / i 2 i a ; = (Aa + Az>x)(Aa + ~AZtX)~lfiZt3 209 Appendix C Update Equations for Mixture Model Figure C . l : Bayesian network for the mixture of Gaussians over optical flow fields with feature weighting. Shaded nodes are observed or fixed (known), while un-shaded nodes are unknown random variables. Boxes are fixed hyper-parameters. The dashed line delineates the priors for feature weighting. X £ 1. . . Nx are discrete motion classes, Z is the Zernike feature vector (projection of optical flow field), V is the optical flow field, fs are the spatial derivatives, and ft is the temporal derivative. / i 2 , A 2 are the parameters of the mixture of Gaussians over the Z vector space, and T are the feature weights. @x are the class probability parameter (a multinomial), and ax is the parameter of the (conjugate) Dirichlet prior over Ox-210 Figure C . l shows the simple mixture model with feature weighted Gaus-sian distributions that was discussed in Section 3.2.2. Here we derive the update equations for the parameter learning using the expectation maximization (EM) al-gorithm. Our goal is to show how Equations 4.6 - 4.9 are derived. To update the output mean, we set the derivative with respect to the mean for state X = Xi, fj,Zi, to zero. [ P(xz |vf ,^)^-iogP(Vfxz |e) + — logP(/x a i i) = 0 where P(fiZti) ~ M(fi*,T) is the prior distribution over the mean for state i, such that - log P ( ^ ; i ) = - T - V + M.,i) The log of the complete posterior is a sum of logarithmic terms, and only those involving nz are non-zero, such that: l o g P ( V f X Z | 0 ) = y — log[P(Z f c |X f c , 6)] The sum over k can be taken outside the expression, giving E E / ^ ( X Z | \ f f , 0 ' ) £ — l o g [ P ( Z f c | X f c , e ) ] - T - 1 ( / x * + Mz,i) The derivative will pick out the ith value from the sum over Xk, Nt C r) E E - E E - E p ( x z | v f , ^ ) ^ i o g [ P ( ^ | x f c , i , e ) ] = T - 1 ( / . * - ^ ) = 1 A i A f c _ ! A f c + 1 A / v ( the sums over Xi and the integrations over Zi for i = 1.. . k — 1, k + 1.. . Nt can be performed, giving unity, and leaving / P(Xk4zk\Vf, &)- ]og[P(zk\Xk<i, 0)] = T-\u* - uzA). (C.l) Since P(zk\Xkti,Q) ~ Af(piZii, A^j ) , the derivative gives Aj*(zfc — /zZ )j). Further, since the measurements are independent in the mixture model, P(Xfc^2fc|VfO') = ^(^|^fc , iV / f c e , )P(X J f e i i |V / f c e / ) , we have 211 Nt ^ p f e i v A e O A j j + T - 1 ,/c=l Nt K'i/Z I ZkP(zk\Xk,&fk&)P(Xkii\Vfk&) +T-1u*. fe=i •'Zk (C.2) and we can solve for txz^ ^,i = (e,<A-J + r- 1)- 1 Nt A - J £ > , x & , i + r - y where £ f c i i = P(X f e ) i |V/ f c 0') , = X)f=i £fc,i a n d Mz,x is given by Equation (3.25). Thus, the most likely mean for each state x is the weighted sum of the most likely values of z as given by Equation (3.25). Dimensions of the means, fj,z^, with small feature weights, T | , will be biased toward the data mean, /j,*, in that dimension. This is reasonable, because such dimensions are not relevant for clustering, and so should be the same for any cluster, X. The updates to the feature weights, TJ2, are found by taking the derivative with respect to T 2 . However, the complete data likelihood, P(VfXZ|0), is indepen-dent of the feature weights, so we are solving A Nz NX fl NZ NX — 2 I o g [ n ) I I P M = ^ E log P(rf) + £ log PM = 0 k i=i i=i k j=i i=i where P (T 2 ) are the prior distributions over the feature weights along each dimension of the output space, given by Equation (3.26), and P(nZti) ~ J\f(/j,*,T), so that and ^ l o g P ( r f c 2 ) = ^ d . . 1 ^ l o g P ( M 2 , , ) = - --(a + l)log(r f c 2)-^ a +1 b "o h 2\2 NZ _0r» l 0 S ( | T | ) + cV2 ^ r 2 (Ti) 2\2 where pz^k, n*k are the dimensions of iiZji and /it*, respectively. Thus, we have 1 (.Mz.i.fc - /*fc) a + 1 (-fe2)2 2 r2 + 2 fc-l k k-l 2\2 = 0 212 which is *\2 fc-1 2\2 Solving for r^'yeilds + a + A^/2 + 1 20 + ^  + 2 q + i . Nz Tk LTk Nx The updates to the feature weights show that those dimensions, k, with pz^k very different from the data mean, across all states, will receive large values of T)j?, while those with pZtitk ~ p% will receive small values of r^. Intuitively, the dimensions along which the data is well separated (large inter-class distance) will be weighted more. The updates to the covariance matrix, A ^ , are found in a similar way to the mean. We present the main points here. We set the derivative with respect to the covariance for state X = Xj, A 2 ) j , to zero. f P ( X Z | V f , 0 ' ) ^ l o g P ( V f X Z | 6 ) + d dl\z,i l o g P ( A Z i i ) = 0 where P(AZii) is the prior distribution over the covariance for state i, given by Equation (3.27), so that OA 9 l o g P ( A ^ ) = 9 dA.i a + Nz + l l o g | A Z j S | - - t r ( a A * A - i ) a-\-Nz + l.-, 1 . , , -2 K x + g K * a A K x (C.3) Following the same derivation as that leading up to Equation (C. l ) , we obtain Nt y [ P(XkAzk\VfkQ')-^-\og[P(zk\XKi,Q)} = k=i Jz« aAz>i — 2 — K x ~ 2A^>A Kx-(CA) The derivative on the left gives d 1 1 \ogP(zk\Xkti, 0) = - - A j j + -A~l(zk - fiZii)(zk - Mz,i)'A ' A - l z.i dA 2 „ multiplying Equation (C.4) on the left and right by AZj gives Nt . y / P(Xkjizk\VfkQ')[(zk - fxZti)(zk - fiZti)' - AZji] = (a + Nz + l)Az>i - oA* fc=iJz* 213 which is N / P(Xkiizk\Vfk&){zk - uz,i)(zk - uz,i)' + aA* = Nt , £ / P(Xk>izk\Vfk&)Az,i + (a + Nz + l)AZi i. fc=l • / Z f c Af  fc=l,/Zfc Since and / p^fclXfc.iVAe') = v/ Zt we have 2 i 5 ( ^ . < l v / * e , ) S p ( z f c | ^ i V / t e o [ ( ^ - ^ I i ) ( ^ - ^ , i ) , ] + a A * = [e,i+(a+AT 2+l)]A, fc=i so that we can solve for A 2 a s 1 + a + Nz + 1 ' Nt /)[(2fc - uZti)(zk - uZii)'] + aA* .k=l We must now evaluate the expected covariance ((zk — nZ:i){zk — Mz,0')> where the expectation is taken with respect to the distribution over z given the state, i, and the data, V / , P(zk\XkiiVfk&). Expanding, ((zk ~ Hz,i)(zk - fj,Zii)') = (zkz'k - 2 / i 2 ) i 4 + Mz,i/4,i) = (zkzk) - 2Mz,i(4) + Hz,iV-'z,i — (zkz'k) ~ Vz$z,i ~ fizjlJ-zj + Mz,i/4,i where we have used Equation (3.22). Since the expected covariance over z is Az,i = ( ( « - A * , t ) ( 2 - A M ) ' ) = {zz - 2flZtiz' + p,z<ijJ,'Zti) = {ZZ') - p.z,ip.Zji we can write ((zk - pz,i)(zk - nZti)') = A 2 ] i + p,z,ijj.'z>i - Pz,iPz,i ~ V-zMj!z,i + PzMJ>'z,i So that the updates to the covariance can be written as = Efc^l(Az,x + Az,iM2,j - Mz, i£ 2 , j ~ Az,iM2,j)^,i + HzjUzjlj + « A * A £.,» + a + Nm + 1 214 which is A*.i = d + a + Nz + 1 The covariance is updated based on a combination of the most likely covariance, Azj and the covariance of the most likely mean, / i Z ) j , about the model mean /z Z )j, for each data point, k — 1... Nt, weighted by the probability of state i given the data. The prior covariance is represented with aA*, which stabilizes the updates, avoiding matrix singularity problems in the inverses in Equation (3.21). The updates to the class probability, 0 X , are given by taking the derivative as usual, except that we must now enforce the contraint that 0 X is a proper probability distribution, ®x,i = 1- We do this using a Lagrange multiplier, so we want to solve X Z I V M ' ) ^ — l o g P ( V f X Z | 9 ) N The prior over Qx is distributed according to a Dirichlet distribution, P ( 0 X ) = ©xT- Taking the derivatives, and performing the integration over zk, this is J2p(Xkti\Vfk,8')— + a7 k = \ C*X )i Multiplying by Qx^ and summing over i gives = A (C.5) E i=l Lfc=l + OLxi Nx = A ^ 0 X i i - A i=l We substitute this value of A back into Equation (C.5) and solve for 0 X ) j to obtain ax,i "b "Ei l - iK. i + e.i)' 215 Appendix D Estimating C 4 M G Model Parameters dynamics configuration Figure D . l : Two time slices of a dynamic Bayesian network (DBN) for simultaneous modeling of pose and dynamics. Repeat of Figure 3.12. This appendix derives the parameter learning equations for the context-dependent Markov chain of mixtures of hidden Markov models (C4MG) described in Section 4.7. However, we first start by deriving the update equations for the coupled hidden Markov model for temporal modeling of pose and dynamics, as discussed in Section 4.4. The Bayesian network for this model is shown in Figure D . l . We use the notation that a set of variables Xt... XT is written {X} To sim-t,T 216 plify things further, we denote X T . . . XT, Wt • • • WT as { X W } . If no subscript is given, it is assumed to be 1, T, the full variable set. Finally, we denote the obser-vations O = IVf, and the hidden state, Y = X W We assume that every term is conditioned on the model parameters, 6'. The update equation for X transition probability, Qxijk is derived in a similar way to the class probability in the simple mixture model. Due to the Markovian dependence in the dynamics and configura-tion chains, we must take derivatives of a more complex expression, subject to the constraint that Qxijk — 1 V / V P ( X W Z H | V f , I , 0 ' ) ^ — logP(VfXWZH|e)] - A ; r - C E ©xyfc - 1) + Ttr— logP(GXljk) = 0 (D.l) The prior over Qxijk is distributed according to a Dirichlet distribution, P{©xijk) = n ^ O ^ a n d so d — \ogP(©xijk) = axijk/Qxijk-UWXijk The complete data posterior can be computed as P{ V f X W Z H | 0 ) = l[lP(Vfk\zk)P(zk\xk)P(h\xk)P(hk\ck) fc=2 P(a;fc|cfcXfe_1)P(cfc|a;fc_iCfc-i)P(a;i|ci)P(ci)] (D.2) Therefore, l o g P ( V f X W Z H | e ) = y-^—\ogP(xt\ctxt_1) The derivative picks out the values of xt = i,ct — j and xt-\ = k, so that the integrations over H and Z , as well as the sums over Xi and Wj for i = 1.. . £ — 2, £ + 1... Nt can all be performed in Equation (D.l) , leaving A = yP(XttiWt,jXt^k\09')—^— \og&Xijk + N, yp(xt,iwt,jx^lik\O0')?^— + ^ t=l 217 Multiplying by ®xtjk and summing over i, gives Nx Nt Nx ^J^p(xt>iWtjXt.ltk\oe') + a X i j k = x ^ e X i j k = x i=l t=l i=l so we can solve for Qxijk as axijk + E ^ i P(XtiiWtJXt-lik\O0') e Xijk = E i <*xi,k + E t=i i'C^.iWij^-i.fclOe') (D.3) The update equation for W transition probability, @wijk, and for the initial state probabilities, Hxij and TLwi, are derived in a similar way, giving <xwijk + Et=i P(WttiX^ltjW^ltk\09') Q\Vijk = Rxij = E i [awijk + E*=i P(wt,iXt-itjwt-iik\oe') axij + E f=i P(Xi,jWhj 1O0') ITv^ i = E i [«x« + ESi^(^i , i^ |oe ' ) am + E t = i ^ i . i |Qg / ) E i a W i + E £ i W , i | 0« ' ) (D.4) (D.5) (D.6) The update equations for the parameters output distributions in the dy-namics chain, \xz^ A Z ) j , are almost the same as those for the simple mixture model derived in Appendix C. V / Vp(XWZH|Vf,I,6»')^-logP(VfXWZH |e) X i W 7 z H + —— log P(fiZii) =0 The derivative picks out only the terms involving fj,z, and all the sums over W and H can be performed, leaving the equivalent of Equation (C. l ) : VJ / P(Xktizk\0)- log[P(z f c |XM, G)] = T~1(u* - uZti). (D.7) k = 1 J z k °Uz,i Now, however, the observations are not all independent, and we have that P{XKizk\D) = P ( z f c | X M 0 ) P ( X M | 0 ) . The probability of a given state of the dynamics process at time t, Xtti, given all 218 the data, O, can be computed as follows P ( X M | { 0 } ) = J2P(XttlWt,3{0})/P({0}) I,T • i,r I,T = ^ P ( { 0 } | X M W T J { 0 } ) P ( X t , i W T J { 0 } ) / P ( { 0 } ) t+l,T M l,t 1,T J ^ P ( { 0 } | X M W i J ) P ( X t , i W I J { 0 } ) / P ( { 0 } ) j t+l,T l,t 1,T E j Pt,ijKt,ij where K ^ - = P ( X t jW^ j { 0 } ) and /5fj- = P ( { 0 } | X t i W t j ) are forwards and back-i,t t+l,T wards variable for the dynamics process. We show recursive derivations for these parameters later in this Appendix 1 . Returning to Equation (D.7), we see that the update equations derived for the simple mixture model (Equations (4.6), (4.7), and (4.8)), can be used for the Markovian model if we use j = P(Xt,i\{0}) as 1,T derived above. Update equations for emission probability over the exemplars, H, given the configuration class, W, ®mj, can be similarly derived leading to @Hij = bHij + Y,kUP{HKjWkj\o) where 5fjij is the prior distribution, which is Dirichlet Dir(@Hij,dHij)- We can expand P(Hk>iWk,j\0) = P(Hk>i\WkJ0)P(WkJ\0) And since P(Hkti\WkjO) = P{Hk,\WkJIk) = /^'^^y 1These quantities are the equivalents of the usual "forwards" and "backwards" variables, a and (3, in hidden Markov model parameter and likelihood estimation [Rab89]. We are using KX here since a is our scale variable. 219 and the probability of a given state of the configuration process at time t, Wt,i, is PWMO}) = ytP{Wt,iXt.1j{0})/P({0}) 1,T 1,T 1,T y PMt {0}\Wt)iXt^1jIt{0})P(WttiXt-ijIt{0})/P({0}) t+l,T l , t - l l , t - l 1,T ^ P ( V / t { 0 } | W M X t _ i J - ) i ' ( W M X t _ i J - 7 t { 0 } ) / P ( { 0 } ) t+i,r i,t—1 I,T we obtain = E j Pt,ijK™ij E i j Pt,ij*H,ij pfiT u / i n x _ P(h\Hk,i)@H,ij YjjPt,ij Kt,ij where «g f c P ( X M W w X t _ i i f c { 0 } / t ) and tm = P ( V / t { 0 } | W t i i A V i j ) are the t+l,T forwards and backwards variables for the configuration process. The expected number of transitions in the dynamics chain is yP(xtAwtJxt^uk\o) t = V P ( X t i i W t i i X t _ i , f c { 0 } ) / P ( 0 ) t 1,T = J2P({0}Vft\XttiWtjXt_hk{0}It)P(Xt,iWtJXt_1;k{0}It)/P(0) t+l,T l,t-l l,t-l = ]TP ( {0} |V/tXMW t ) j)P(V/ t|X t, iW t, i)P(X t, i|W t, jX t_1, f e) t t+l,T P({0}|V/t^t>jWt,i)P(Wt,J-Xt_x,fc{0}/t)/P(0) t+l,T 1 oc yp({o} |xt,iwilj)P(v/t|xtii)ex,«fcP(Wi,jxt_iifc{o}/t)/p(o) t+l,T 1,4-1 = ^/3 t%.P(V/ t |X t i i)Gx,i j f c^/P(0) t where P (O) = V V j . PtijP(^ft\xt,i)Ox,ijk^jk a n d t h e likelihood of the image derivates, P(Vf\Xtti), is calculated from Equation (3.19). 220 The expected number of transitions in the configuration chain is V J P ( W i , i X t _ l i i W t _ i , F C | 0 ) t = £p (Wt, i Xt_i l j Wi_i , F C {0} /P (0) 1,T = VP({0}|^MXt_ l jW T_ 1, F C{0})P(M/MX t_ 1 > jW t_ 1, f c {} {0})/P(0) j t ,T l , t - l l , t - l l , t - l = Tp(vftit{o}\wt,ixt-1j)P(wt,i\xt-hJwt-hk)P(^^^ t t+l,T l , t - l = £ P(V/' {°>\Wt,iXt-i,j)P(It\Wt,J t * + l , r QwijkP(Xt.1,jWt-i,k{{0}})/P(0) l , t - l = ^Wjp(it\wt,i)ewiJkKU,Jk/P(o) t The likelihood of the image, P(It\Wt^) is calculated from the configuration mixture model. The expectation of the number of inital states in the dynamics chain is P(X1:iWltj\0) = P(X1AWltj\0) P(XuWldO) P(O) ICX ftx _ Kl,ijPl,ij 2~lij Kl,ijPf,ij while the expectation of the number of initial states in the configuration chain is £ p ( * i j i y M | 0 ) = Y,P(Xi,jWhi\o) j i (3X E Kl,ijPl,ij V Kx Rx j l^ij Ki,ijPi,ij The forwards variable for the configuration process is recursively computed as fol-221 lows l,t-l = P(/t |W t , i)P(W t l iX t-i l j{0}) l,t—1 = P(It\Wt,i)yP(Wt,iXt-l,jWt-1,k{0}) k 1 , 4 1 k M 1 = P(/ t |W t ,i) ©w,«fcP(^t- ijWt-i , fc) fc = P(/t|W^,i)E0W,iifcKf_l,jfc The backwards variable for the configuration process is recursively computed as follows 0™ij=PWt{O}\Wt,iXt-ltj) t+l,T = yP(Vft\XttkWt,iXt-1j{0})P{Xt,k{0})\Wt,iXt-ltj) ^~ t+i,r t+i,r = Vp(v/ t |x t, f c)P({o})|x t, f e^, ix t_ 1, J)P(x t ) f c | iyMx t_ 1, J) fc t+l,T = £p (v / T |A \ f c ) e^ / ?* f e i The forwards variable for the dynamics process is recursively computed as follows = P(XttiWtij{Oa}) i,t = yP{XttiWt,jXt-lik{Oa}) i,t 2^(V/t|AiW*,^t-i,fc)P(^t,iWi l jX t_i, f e{Oa}/ t) i,t-i = P ( V / t | X M ) ^P(XM |W't, iXt_i, f c)P(Wi, iXt-i, f e{OQ}/ t) l,t-l = p ( v / t | x M ) £ e ^ f e < j f c it The initial value of the forwards dynamics variable is = P ( X M W i , , V / 1 I 1 & 1 ) = p(v/ 1 |x 1 > i)P(/i|Wi ) J)P(6i|^i, J)P(x 1, i |w 1, j)P(iy 1,,) = P(Vf1\Xhi)P(I1\W1j)P(b1\Whj)UX:ijUWJ 222 The backwards variable for the dynamics process is recursively computed as follows Pt-l.ij = P({0} |X t _ l l i W t _ l j - ) t,T = Tp(it\{o} v/ tw t,fcXt_i, iwt_i lJ-)P(v/t{o} w t,fc|^t-i,iW t-i,i) "if t+i,r t+i,r = £P( / t |W t ,OP(V / ao} |Wi l f c X t _ M W t _ i j )P (m,* | -y t - i , iWi_ i J - ) = VP(/ t |m ,fc)P(V/t{0}|Wi,fcX t_ l i i)P(Wi >fc|X t_ 1 , iWi_ 1 J) • V t+1-T = ^p(/t|wt,fc)/?^ew,^ The dynamics backwards process is initialized at time T to be even over the joint space of X, W: Phj = (N* * ^ r 1 The procedure for updating computing the sufficient statistics of the transition pa-rameters for each training sequence is Initialize K\ forwards pass for t = 2:T compute from Kf_l compute K,f from end backwards pass Initialize /3f, for t = T:T-1 compute ft™ from (3£ compute /3f_x from (3% end update sufficient statistics compute EP{y\0ei)(NXijk) compute EP(y\oe>){Nwijk) The likelihood of a sequence of data, {O} is easily computed from K X : 1,T l ' T Jk 223 D . l Estimating State Given that we have a coupled hidden Markov model such as we have just described, we may be interested in estimating the temporal state evolution of the dynamics and configuration processes are for some set of observations. That is, we want to find the single best state sequence, Y i . . . Y j v t , given observations, O i . . .Ojv„ which can be formulated as maximizing P({y}|{O}0), which is equivalent to maximizing l , iV t l,Nt P({y}{O}|0). The Viterbi algorithm performs this maximization. It needs to be l,Nt l,Nt modified slightly to accomodate the two temporal chains. Define St(i,j) to be the highest probability along a single path which accounts for the first t observations and ends in state Xt = i,ct = j: 6t(i,j) = maxP({Y}x t ) i c t J {0}) { Y } 1,4—1 1,4 i , t - i We derive an inductive formula for this quantity as follows 6t(i,j) = maxP({Y}xtticttj{0}) { Y } i,t—1 1,4 i , t - i = maxP({Y} |{0})P(O t x t , l C t j |P({Y}{0}) i Y i l,t-l 1,4 1 , 4 - 1 1,4 1 , 4 - 1 = maxmaxP ({Y} |{0})P (O t x M C t i i |Y t _i) Yt-i {Y} it_i it l,t-2 = max max P ( { Y } x t _ 1 ) f c c t _ i , ; | { 0 } ) P ( O t x M c t j\xt-i,kCt-i,l) k < 1 { Y } 1,4-2 1,4 1 , 4 - 2 = m a x J t _ i ( A ; , 0 P ( O t | x t , j C t i J Y t _ i ) P ( x t ] i c t J | a ; t _ i ) f c c t _ i ) i ) = max [<**_!(*:, l)ex,ijk@w,jkl] P ( V / i | x M ) P ( 7 T | c t j ) (D.8) k,l The optimal state sequence is given by the argument which maximized Equa-tion (D.8) for each t,i and j. We do this using the array tpt(hj)- The procedure is as follows 1. Initialization Si{i,j) = nWdnx,ijP(Vft\xt^P(IT\ctj) 224 Figure D.2: Context dependent Markov chain of mixtures of hidden Markov models (C4MG) as a dynamic Bayesian network, including context variables, C and A. Repeat of Figure 3.17. 2. Recursion 6t(i,j) = max [St-i(k, l)®x,ijk®w,jki} P(^ft\xt,i)P(lT\ct,j) k,l ipi{ij) = argmax[5t_i(fc,/)ex,ijfc0VKjfci] k,l 3. Termination Y*Nt = ^g™nx.[fit-i(k,l)&x,ijk@w,jki] k,l where Y£ = {X*, W?} is the state at time t along the optimal path. D.2 C4MG Paramter Updates Figure D.2 shows the Bayesian network for the context-dependent Markov chain of mixtures of hidden Markov models (C4MG) discussed in Section 4.7. This model can be seen as an input-output hidden Markov model [BF96], and we are trying to maximize £ P(D|0, C, A, 9') log P(D, O, C, A|0) + log P(6) . D 0* = arg max 225 Recall that this Markov chain operates at a higher level than the one we have been previously discussin in this appendix, and so there is no scale variable to marginalise here, and the hidden state is D, which is not factored into two separate Markov chains. The update equations are exactly those for an input-output hidden Markov model, except for the fact that we keep the two output observations, A and O, factored. The update for the D transition parameter, @Dijk — P (A , i |A - i jC t , f c )> is then _ aDijk + Et€{i..M}\Ct=kP(Dt,iDt-ij\0,A,C6') &Dijk = — F T (L>-y) E t [<XDijk + Et6{i...Nt}|ct=fc P ( A , i A - i j | 0 , A , ce1) where the sum over the temporal sequence is only over time steps in which Ct = k. The summand can be factored as follows P ( A , i A - i J - | 0 , A , C 6 r ' ) = P ( A , i A - i j O A C ) / P ( O A C ) = P ( { O A C } | A , i A - i j { O A C } ) P ( A , i A - i j { O A C } ) / P ( O A C ) t+l,T l,t l,t = Pt,iP(At\Dttil Ct,k)P(Ot\Dt}i)eDijkat-hj where atj = P(Dtj{OAC}) and /3 t ) i = P({OAC}|A,i) are the usual forwards and i,t ' t+l,T backwards variables, for which we can derive recursive updates « t j = E p(A,iA-i,fc{OAC}) k = V P{OtAt\DtJ A-i,jfeCt{OAC})P(Aj A-i,*Ct{OAC}) = y;p(o t |Aj)P(^|A, J-Ct)P(AjlA-i,fc,C7 t)P(A-i,fc{OAC}) k = p{Qt\ Dt,j)&A*j*®pjk*ttt-i,k where we write @A*J* = P(At = * | A j A = *) Pt-i,i = VJp({OAC}A,fc|A-i,i) = V P ( { O A C } | O t A t C t A , f e A - i , i ) P ( O t A t C t A , f c | A - i , i ) fc *+I,T = £ A,fcQ>t*fc*P(Otl A,fc)Qpfcj* 2 2 6 The updates to the output distribution over A , &Aijk = P(Atti\DtjCt,k) a r e @Mjk= E P ( A j | O A C ) (D.10) te{i...Art}|At=iACt=fe where the summand is expanded as P(DtJ\OAC) = P ( A j O A C ) / P ( O A C ) = P({OJ t+: = Pt,jOct,j A C } | A J { O A C } ) P ( A J { O A C } ) l,T l , i l,t 227 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051546/manifest

Comment

Related Items