UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Accelerating reinforcement learning through imitation Price, Robert Roy 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2003-854000.pdf [ 8.02MB ]
Metadata
JSON: 831-1.0051625.json
JSON-LD: 831-1.0051625-ld.json
RDF/XML (Pretty): 831-1.0051625-rdf.xml
RDF/JSON: 831-1.0051625-rdf.json
Turtle: 831-1.0051625-turtle.txt
N-Triples: 831-1.0051625-rdf-ntriples.txt
Original Record: 831-1.0051625-source.json
Full Text
831-1.0051625-fulltext.txt
Citation
831-1.0051625.ris

Full Text

Accelerating Reinforcement Learning through Imitation by Robert R o y Price M . S c . U n i v e r s i t y o f S a s k a t c h e w a n , 1995 B . C . S . , C a r l e t o n U n i v e r s i t y , 1993  A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T O F THE REQUIREMENTS F O R T H ED E G R E E OF  Doctor of Philosophy in THE F A C U L T Y OF G R A D U A T E STUDIES (Department o f C o m p u t e r Science)  W e accept this thesis as c o n f o r m i n g  to the required standard  The University of British Columbia December 2002 ©  Robert R o y Price, 2002  In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the requirements f o r an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree t h a t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and study. I f u r t h e r agree t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g of t h i s t h e s i s f o r s c h o l a r l y purposes may be g r a n t e d by the head of my department or by h i s or her r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g or p u b l i c a t i o n of t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n .  The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada  Abstract I m i t a t i o n c a n be v i e w e d as a m e a n s o f e n h a n c i n g l e a r n i n g i n m u l t i a g e n t e n v i r o n ments. It a u g m e n t s an agent's a b i l i t y to learn u s e f u l b e h a v i o u r s b y m a k i n g i n t e l l i g e n t use o f the k n o w l e d g e i m p l i c i t i n b e h a v i o r s d e m o n s t r a t e d b y c o o p e r a t i v e teachers o r other m o r e e x p e r i e n c e d agents. U s i n g r e i n f o r c e m e n t l e a r n i n g theory, w e c o n s t r u c t a n e w , f o r m a l framew o r k f o r i m i t a t i o n that p e r m i t s agents to c o m b i n e p r i o r k n o w l e d g e , l e a r n e d k n o w l e d g e a n d k n o w l e d g e extracted f r o m o b s e r v a t i o n s o f other agents. T h i s f r a m e w o r k , w h i c h w e c a l l implicit  imitation,  uses o b s e r v a t i o n s o f other agents to p r o v i d e an o b s e r v e r agent w i t h i n f o r -  m a t i o n about its a c t i o n c a p a b i l i t i e s i n u n e x p e r i e n c e d s i t u a t i o n s . E f f i c i e n t a l g o r i t h m s are der i v e d f r o m this f r a m e w o r k f o r agents w i t h b o t h s i m i l a r a n d d i s s i m i l a r a c t i o n c a p a b i l i t i e s a n d a series o f e x p e r i m e n t s d e m o n s t r a t e that implicit  imitation  c a n d r a m a t i c a l l y accelerate r e i n -  f o r c e m e n t l e a r n i n g i n c e r t a i n cases. F u r t h e r e x p e r i m e n t s d e m o n s t r a t e that the f r a m e w o r k c a n h a n d l e transfer b e t w e e n agents w i t h different r e w a r d structures, l e a r n i n g f r o m m u l t i ple mentors, and selecting relevant portions o f examples. A B a y e s i a n v e r s i o n o f i m p l i c i t i m i t a t i o n is then d e r i v e d a n d several e x p e r i m e n t s are. u s e d to i l l u s t r a t e h o w it s m o o t h l y integrates m o d e l e x t r a c t i o n w i t h p r i o r k n o w l e d g e a n d o p t i m a l e x p l o r a t i o n . I n i t i a l w o r k is a l s o presented to i l l u s t r a t e h o w e x t e n s i o n s to i m p l i c i t i m i t a t i o n c a n be d e r i v e d for c o n t i n u o u s d o m a i n s , p a r t i a l l y o b s e r v a b l e e n v i r o n m e n t s and m u l t i - a g e n t M a r k o v g a m e s . T h e d e r i v a t i o n o f a l g o r i t h m s a n d e x p e r i m e n t a l results are s u p p l e m e n t e d w i t h an a n a l y s i s o f r e l e v a n t d o m a i n features f o r i m i t a t i n g agents a n d s o m e p e r f o r m a n c e m e t r i c s f o r p r e d i c t i n g i m i t a t i o n g a i n s are suggested.  ii  Contents  Abstract  ii  Contents  iii  List of Tables  vi  List of Figures  vii  Acknowledgements  xi  Dedication  xii  1  Introduction  1  2  Background Material  8  2.1  M a r k o v D e c i s i o n Processes  8  2.2  Model-based Reinforcement Learning  12  2.3  Bayesian Reinforcement Learning  17  2.4  Bayesian Exploration  19  2.5  Probabilistic Reasoning and Bayesian Networks  22  2.6  Model-Free Learning  25  2.7  Partial Observability  26  iii  3  4  2.8  Multi-agent Learning  27  2.9  Technical Definitions  29  N a t u r a l and C o m p u t a t i o n a l Imitation 3.1  C o n c e p t u a l Issues  31  3.2  Practical Implementations  35  3.3  O t h e r Issues  38  3.4  Analysis o f Imitation in Terms o f Reinforcement Learning  38  3.5  A N e w M o d e l : Implicit Imitation  40  3.6  Non-interacting Games  47  Implicit Imitation 4.1  4.2  5  31  51  Implicit Imitation i n H o m o g e n e o u s Settings  52  4.1.1  Homogeneous Actions  52  4.1.2  The Implicit Imitation A l g o r i t h m  53  4.1.3  Empirical Demonstrations  69  I m p l i c i t Imitation i n Heterogeneous Settings  89  4.2.1  Feasibility Testing  89  4.2.2  A>step S i m i l a r i t y a n d R e p a i r  92  4.2.3  Empirical Demonstrations  99  Bayesian Imitation  109  5.1  A u g m e n t e d M o d e l Inference  110  5.2  Empirical Demonstrations  116  5.3  Summary  130  iv  6  7  8  Applicability and Performance of Imitation  132  6.1  Assumptions  132  6.2  Predicting Performance  133  6.3  R e g i o n a l Texture  139  6.4  T h e Fracture M e t r i c  140  6.5  S u b o p t i m a l i t y and B i a s  143  Extensions to New Problem Classes  146  7.1  U n k n o w n Reward Functions  146  7.2  Benefiting f r o m penalties  147  7.3  Continuous and M o d e l - F r e e L e a r n i n g  148  7.4  Interaction o f A g e n t s  150  7.5  Partially Observable Domains  153  7.6  A c t i o n Planning in Partially Observable Environments  161  7.7  Non-stationarity  162  Discussion and Conclusions  163  8.1  Applications  163  8.2  Concluding Remarks  164  Bibliography  169  v  List of Tables 4.1  A n a l g o r i t h m to c o m p u t e the A u g m e n t e d B e l l m a n B a c k u p  63  4.2  A n a l g o r i t h m for a c t i o n f e a s i b i l i t y testing  92  4.3  E l a b o r a t e d a u g m e n t e d b a c k u p test  98  vi  List of Figures  2.1  E x p l i c i t d i s t r i b u t i o n o v e r e x h a u s t i v e e n u m e r a t i o n o f c l o u d - r a i n - u m b r e l l a states  2.2  C o m p a c t d i s t r i b u t i o n s o v e r i n d e p e n d e n t factors f o r c l o u d - r a i n - u m b r e l l a d o main  4.1  24  V and V 0  m  represent v a l u e s estimates d e r i v e d f r o m o b s e r v e r a n d m e n t o r e x -  p e r i e n c e r e s p e c t i v e l y . T h e l o w e r b o u n d s V~  a n d V~  i n c o r p o r a t e an " u n -  certainty p e n a l t y " a s s o c i a t e d w i t h these estimates 4.2  61  A s i m p l e g r i d w o r l d w i t h start state S a n d g o a l state X s h o w i n g representative trajectories f o r a m e n t o r ( s o l i d ) a n d an o b s e r v e r (dashed)  4.3  23  70  O n a s i m p l e g r i d - w o r l d , an o b s e r v e r agent ' O b s ' d i s c o v e r s the g o a l a n d o p t i m i z e s its p o l i c y faster than a c o n t r o l agent ' C t r l ' w i t h o u t o b s e r v a t i o n . T h e ' D e l t a ' c u r v e p l o t s the g a i n o f the o b s e r v e r o v e r the c o n t r o l  4.4  75  D e l t a c u r v e s c o m p a r e g a i n due to o b s e r v a t i o n f o r ' B a s i c ' s c e n a r i o d e s c r i b e d a b o v e , a ' S c a l e ' scenario w i t h a larger state space a n d a ' S t o c h ' s c e n a r i o w i t h increased noise  4.5  ,  77  P r i o r m o d e l s a s s u m e d i a g o n a l c o n n e c t i v i t y a n d h e n c e that the +5 r e w a r d s are a c c e s s i b l e to the observer. In fact, the o b s e r v e r has o n l y ' N E W S '  ac-  t i o n s . U n u p d a t e d p r i o r s about the m e n t o r ' s c a p a b i l i t i e s m a y l e a d to o v e r v a l u i n g d i a g o n a l l y adjacent states  80  vii  4.6  M i s l e a d i n g p r i o r s about m e n t o r a b i l i t i e s m a y e l i m i n a t e i m i t a t i o n g a i n s a n d e v e n r e s u l t i n s o m e d e g r a d a t i o n o f the o b s e r v e r ' s p e r f o r m a n c e r e l a t i v e to the c o n t r o l agent, b u t c o n f i d e n c e t e s t i n g prevents the o b s e r v e r f r o m p e r m a n e n t l y fixating o n i n f e a s i b l e g o a l s  4.7  82  A c o m p l e x m a z e w i t h 3 4 3 states has c o n s i d e r a b l y less l o c a l c o n n e c t i v i t y than an e q u i v a l e n t l y - s i z e d g r i d w o r l d . A s i n g l e 133 step p a t h is the shortest solution  4.8  83  O n the d i f f i c u l t m a z e p r o b l e m , the o b s e r v e r ' O b s ' c o n v e r g e s l o n g b e f o r e the unaided c o n t r o l ' Crtl'  4.9  84  T h e shortcut ( 4 —» 5) i n the m a z e is too p e r i l o u s f o r u n k n o w l e d g e a b l e (or u n a i d e d ) agents to attempt  '  86  4 . 1 0 T h e o b s e v e r ' O b s ' tends to find the shortcut a n d c o n v e r g e s to the o p t i m a l p o l i c y , w h i l e the u n a i d e d c o n t r o l ' C t r l ' t y p i c a l l y settles o n the s u b o p t i m a l soulution 4.11  86  A s c e n a r i o w i t h m u l t i p l e mentors. E a c h m e n t o r c o v e r s o n l y part o f the space o f interest to the o b s e r v e r  87  4 . 1 2 T h e g a i n f r o m s i m u l t a n e o u s i m i t a t i o n o f m u l t i p l e m e n t o r s is s i m i l a r to that seen i n the s i n g l e m e n t o r case  88  4.13 A n a l t e r n a t i v e path c a n b r i d g e v a l u e b a c k u p s a r o u n d i n f e a s i b l e paths . . . .  94  4 . 1 4 T h e agent w i t h o u t f e a s i b i l i t y testing ' N o F e a s ' c a n b e c o m e t r a p p e d b y o v e r v a l u e d states. T h e agent w i t h f e a s i b i l i t y t e s t i n g ' F e a s ' s h o w s t y p i c a l i m i t a tion gains  100  4 . 1 5 O b s t a c l e s b l o c k the m e n t o r ' s i n t e n d e d p o l i c y , t h o u g h s t o c h a s t i c a c t i o n s cause m e n t o r to d e v i a t e s o m e w h a t  102  viii  4.16  I m i t a t i o n g a i n is r e d u c e d b y o b s e r v e r obstacles w i t h i n m e n t o r  trajectories  b u t not e l i m i n a t e d  103  4.17  G e n e r a l i z i n g an i n f e a s i b l e m e n t o r trajectory  103  4.18  A l a r g e l y i n f e a s i b l e trajectory is h a r d to learn f r o m a n d m i n i m i z e s g a i n s from imitation  4.19  104  T h e R i v e r scenario i n w h i c h squares w i t h grey b a c k g r o u n d s result i n a p e n a l t y f o r the agent  106  4 . 2 0 T h e agent w i t h r e p a i r finds the o p t i m a l s o l u t i o n o n the other s i d e o f the r i v e r . 107  5.1  D e p e n d e n c i e s a m o n g t r a n s i t i o n m o d e l T a n d e v i d e n c e sources H ,  5.2  T h e C h a i n - w o r l d scenario i n w h i c h the m y o p i c a l l y attractive a c t i o n b is s u b -  0  H. m  . .  Ill  o p t i m a l i n the l o n g r u n  117  5.3  C h a i n w o r l d results ( A v e r a g e d o v e r 10 r u n s )  118  5.4  L o o p domain  120  5.5  L o o p d o m a i n results (averaged o v e r 10 r u n s )  121  5.6  Flag-world  123  5.7  F l a g - w o r l d results ( A v e r a g e d o v e r 5 0 r u n s )  123  5.8  F l a g - w o r l d m o v e d goal (Average o f 50 runs)  125  5.9  S c h e m a t i c i l l u s t r a t i o n o f a t r a n s i t i o n i n the t u t o r i n g d o m a i n  128  5.10  T u t o r i n g d o m a i n (Average o f 50 runs)  129  5.11  N o South scenario  129  5.12  N o S o u t h results ( A v e r a g e d o v e r 5 0 runs)  130  6.1  A s y s t e m o f state space r e g i o n c l a s s i f i c a t i o n s f o r p r e d i c t i n g the efficacy o f  6.2  imitation  135  F o u r scenarios w i t h e q u a l s o l u t i o n l e n g t h s b u t different fracture coefficients.  141  ix  6.3  P e r c e n t a g e o f runs ( o f ten) c o n v e r g i n g to o p t i m a l p o l i c y g i v e n fracture  $  a n d e x p l o r a t i o n rate 8j  142  7.1  O v e r v i e w o f influences o n o b s e r v e r ' s o b s e r v a t i o n s  155  7.2  D e t a i l e d m o d e l o f influences o n o b s e r v e r ' s o b s e r v a t i o n s  156  x  Acknowledgements F i r s t o f a l l , I w o u l d l i k e to thank m y s u p e r v i s o r , C r a i g B o u t i l i e r , f o r h i s u n w a v e r i n g p a t i e n c e , o p t i m i s m a n d p r o f e s s i o n a l i s m o v e r t o o m a n y years. I t h i n k I o w e h i m about a t w o - d o z e n b e e r s — o r w o u l d that be coffees?  T h r o u g h h i m I ' v e c o m e to a p p r e c i a t e s o p h i s t i c a t i o n i n  g r a m m a r a n d the v a l u e o f style, l e a r n e d s o m e t h i n g o f the fine art o f f i n d i n g the g e m s a m o n g the stones a n d g a i n e d e x p o s u r e to a vast a n d p o w e r f u l set o f t o o l s f o r r e a s o n i n g about u n certainty. S e c o n d l y , I w o u l d l i k e to thank the m a n y r e v i e w e r s o f m y w o r k b o t h a n o n y m o u s a n d p e r s o n a l , w h o e v e n i f they d i d not a l w a y s further m y p u b l i s h i n g a m b i t i o n s , they c e r t a i n l y furthered m y u n d e r s t a n d i n g o f b o t h the b r e a d t h o f w h a t m i g h t be c a l l e d the f i e l d o f l e a r n i n g b y o b s e r v a t i o n , as w e l l as the t e c h n i c a l craft o f s c i e n t i f i c i n v e s t i g a t i o n a n d c o m m u n i c a t i o n . T h i r d l y , I w o u l d l i k e to thank m y office m a t e P a s c a l P o u p a r t f o r m a n y p l e a s a n t d i s c u s s i o n s o n a l l sorts o f p r o b l e m s f o r m a l and i n f o r m a l as w e l l as f o r m a n y a p p l e s . I a m a l s o grateful f o r a l l o f h i s q u e s t i o n s , w h i c h h e l p e d m e k e e p i n p e r s p e c t i v e the a m o u n t I h a v e l e a r n e d versus the a m o u n t I s t i l l h a v e yet to k n o w . A n d f i n a l l y , I w o u l d l i k e to thank m y thesis c o m m i t t e e , u n i v e r s i t y e x a m i n e r s a n d e x t e r n a l e x a m i n e r f o r their t h o r o u g h r e a d i n g o f the thesis a n d t h e i r t h o u g h t p r o v o k i n g quest i o n s : t h a n k y o u to A l a n M a c k w o r t h , D a v i d P o o l e , M a r t i n P u t e r m a n , D a v i d L o w e , B e t r a n d Clarke and Maja Mataric.  ROBERT R O Y PRICE  The University March  of British  Columbia  2003  xi  To Lottie, with love.  xii  Chapter 1  Introduction C o m p u t a t i o n a l r e i n f o r c e m e n t l e a r n i n g is the study o f t e c h n i q u e s f o r a u t o m a t i c a l l y adapti n g an a g e n t ' s b e h a v i o r to m a x i m i z e s o m e o b j e c t i v e f u n c t i o n t y p i c a l l y set b y the agent's designer. S i n c e the d e s i g n e r n e e d not s p e c i f y h o w the agent w i l l a c h i e v e the o b j e c t i v e , rei n f o r c e m e n t l e a r n i n g is a p o p u l a r c h o i c e f o r the d e s i g n o f agents w h e n s u i t a b l e i n s t r u c t i o n s for the agent c a n n o t b e e a s i l y e n c o d e d , o r the agent's e n v i r o n m e n t is d i f f i c u l t to a d e q u a t e l y c h a r a c t e r i z e . S i n c e the agent has o n l y the o b j e c t i v e f u n c t i o n to g u i d e its search f o r o p t i m a l b e h a v i o r , r e i n f o r c e m e n t l e a r n i n g t y p i c a l l y requires the agent to e n g a g e i n e x t e n s i v e e x p e r i m e n t a t i o n w i t h its e n v i r o n m e n t a n d e x t e n d e d search t h r o u g h p o s s i b l e a c t i o n p o l i c i e s . T h e field o f m u l t i - a g e n t r e i n f o r c e m e n t l e a r n i n g extends r e i n f o r c e m e n t l e a r n i n g to the s i t u a t i o n i n w h i c h m u l t i p l e agents i n d e p e n d e n t l y attempt to l e a r n b e h a v i o r s i n a shared environment.  I n g e n e r a l , the m u l t i - a g e n t d i m e n s i o n adds a d d i t i o n a l c o m p l e x i t y . F o r i n -  stance, it is w e l l k n o w n that a n a i v e a p p l i c a t i o n o f r e i n f o r c e m e n t l e a r n i n g t e c h n i q u e s to e n v i r o n m e n t s i n w h i c h agents c a n c h a n g e their b e h a v i o r i n r e s p o n s e to other a g e n t ' s a c t i o n s c a n l e a d to c y c l i n g i n the search f o r o p t i m i z i n g actions a n d a c o n s e q u e n t f a i l u r e to c o n v e r g e to o p t i m a l b e h a v i o r . C y c l i c d i v e r g e n c e has b e e n addressed u s i n g t e c h n i q u e s s u c h as g a m e theoretic e q u i l i b r i a ( L i t t m a n , 1994). T h e presence o f m u l t i p l e agents, h o w e v e r , c a n a l s o  1  s i m p l i f y the l e a r n i n g p r o b l e m : agents m a y act c o o p e r a t i v e l y b y p o o l i n g i n f o r m a t i o n ( T a n , 1993) o r b y d i s t r i b u t i n g search ( M a t a r i c , 1998). A n o t h e r w a y i n w h i c h i n d i v i d u a l agent p e r f o r m a n c e c a n be i m p r o v e d is b y h a v i n g a n o v i c e agent l e a r n r e a s o n a b l e b e h a v i o r f r o m an expert mentor. b r o u g h t about t h r o u g h e x p l i c i t teaching  T h i s type o f learning can be  or demonstration ( L i n , 1992; W h i t e h e a d , 1991), b y  sharing o f p r i v i l e g e d information (Mataric, 1998), o r through what m i g h t m o r e traditionally be t h o u g h t o f as imitation  ( A t k e s o n & S c h a a l , 1 9 9 7 ; B a k k e r & K u n i y o s h i , 1 9 9 6 ) . In i m i t a -  t i o n , an agent observes another agent p e r f o r m i n g an a c t i v i t y i n c o n t e x t a n d then attempts to r e p l i c a t e the o b s e r v e d agent's b e h a v i o r . T y p i c a l l y , the i m i t a t o r m u s t g r o u n d its o b s e r v a t i o n s o f other agents' b e h a v i o r s i n terms o f its o w n c a p a b i l i t i e s a n d r e s o l v e a n y a m b i g u i t i e s i n o b servations a r i s i n g f r o m p a r t i a l o b s e r v a b i l i t y a n d n o i s e . A c o m m o n thread i n a l l o f this w o r k is the use o f a m e n t o r to g u i d e the exploration  o f the observer. G u i d a n c e is t y p i c a l l y a c h i e v e d  t h r o u g h s o m e f o r m o f e x p l i c i t c o r n m u n i c a t i o n b e t w e e n m e n t o r a n d observer. I m i t a t i o n is related to l e a r n i n g apprentice systems ( M i t c h e l l , M a h a d e v a n , & S t e i n b e r g , 1 9 8 5 ) , but i m i tation research g e n e r a l l y a s s u m e s the a c q u i r e d k n o w l e d g e w i l l b e a p p l i e d b y the agent itself, whereas l e a r n i n g apprentices t y p i c a l l y w o r k i n c o n j u n c t i o n w i t h a h u m a n expert. In e x i s t i n g a p p r o a c h e s to i m i t a t i o n , w e see a n u m b e r o f u n d e r l y i n g a s s u m p t i o n s . I n t e a c h i n g m o d e l s , it is a s s u m e d that there is a c o o p e r a t i v e m e n t o r w i l l i n g to p r o v i d e a p p r o p r i a t e l y e n c o d e d e x a m p l e i n p u t s f o r agents. It is a l s o f r e q u e n t l y a s s u m e d that the m e n t o r ' s a c t i o n c h o i c e s c a n b e d i r e c t l y o b s e r v e d . I n p r o g r a m m i n g b y d e m o n s t r a t i o n , it is often as1  s u m e d that the o b s e r v e r has i d e n t i c a l o b j e c t i v e s to the expert mentor. M a n y e x p l i c i t i m i tation m o d e l s a s s u m e that b e h a v i o r s c a n be represented as i n d i v i s i b l e a n d u n a l t e r a b l e u n i t s that c a n be e v a l u a t e d a n d then r e p l a y e d . We understand an action choice to be the agent's intended control signal as opposed to changes i n the agent's state resulting from the direct effects o f an action such as the agent's position or the orientation o f the agent's effectors. 1  2  W e suggest that there are p r a c t i c a l p r o b l e m s f o r w h i c h it is u s e f u l to r e l a x s o m e o f these a s s u m p t i o n s . F o r instance, a m e n t o r m a y be u n w i l l i n g o r u n a b l e to alter its b e h a v i o r to teach the observer, o r e v e n c o m m u n i c a t e i n f o r m a t i o n to it: c o m m o n c o m m u n i c a t i o n p r o t o c o l s m a y be u n a v a i l a b l e to agents d e s i g n e d b y different d e v e l o p e r s (e.g., Internet agents); agents m a y find t h e m s e l v e s i n a c o m p e t i t i v e s i t u a t i o n i n w h i c h there is a d i s i n c e n t i v e to share i n f o r m a t i o n o r s k i l l s ; o r there m a y s i m p l y b e n o i n c e n t i v e f o r o n e agent to p r o v i d e i n f o r m a t i o n to a n o t h e r .  2  I f the e x p l i c i t c o m m u n i c a t i o n a s s u m p t i o n is r e l a x e d , w e i m m e d i a t e l y e n c o u n t e r d i f ficulties  w i t h the a s s u m p t i o n that the m e n t o r ' s a c t i o n c h o i c e s can be o b s e r v e d : d u e to per-  c e p t u a l a l i a s i n g it is frequently i m p o s s i b l e to k n o w an agent's i n t e n d e d a c t i o n c h o i c e . W e a s s u m e that an agent's a c t i o n c h o i c e c a n h a v e n o n - d e t e r m i n i s t i c o u t c o m e s . T h e o u t c o m e s , h o w e v e r m a y take the f o r m o f a c h a n g e i n the state o f the e n v i r o n m e n t , m a y e n h a n c e the persistence o f s o m e state o f the e n v i r o n m e n t o r m a y h a v e n o d i s c e r n i b l e effect o n the res u l t i n g state o f the e n v i r o n m e n t . It is q u i t e p o s s i b l e that t w o o r m o r e o f the a g e n t ' s a c t i o n c h o i c e s c o u l d h a v e the same p r o b a b i l i s t i c o u t c o m e s i n a g i v e n s i t u a t i o n a n d therefore be i n d i s t i n g u i s h a b l e . I n g e n e r a l , i n the absence o f e x p l i c i t c o m m u n i c a t i o n , a n o b s e r v e r c a n n o t k n o w the m e n t o r ' s i n t e n d e d a c t i o n c h o i c e w i t h certainty ( a l t h o u g h the o b s e r v e r m a y b e a b l e to o b s e r v e c h a n g e s i n the m e n t o r ' s state s u c h as a c h a n g e i n the p o s i t i o n o f the m e n t o r ' s effectors o r its l o c a t i o n ) . W e m a y a l s o w a n t to r e l a x the a s s u m p t i o n that the m e n t o r a n d o b s e r v e r h a v e i d e n t i c a l o b j e c t i v e s . E v e n i n cases w h e r e m e n t o r a n d o b s e r v e r o b j e c t i v e s differ, the t w o agents m a y s t i l l n e e d to p e r f o r m s i m i l a r subtasks. I n these c i r c u m s t a n c e s , an o b s e r v e r m i g h t s t i l l w a n t to learn b e h a v i o r s f r o m a mentor, but a p p l y t h e m to a c c o m p l i s h a different g o a l . I n order to b e a b l e to i d e n t i f y a n d reuse these s u b t a s k s , the o b s e r v e r c a n n o t treat m e n t o r b e h a v i o r s 2  F o r reasons o f consistency, we use the term "mentor" to describe any agent from w h i c h an ob-  server can learn, even i f the mentor is an u n w i l l i n g or unwitting participant.  as i n d i v i s i b l e trajectories. T h e o b s e r v e r m u s t be a b l e to represent a n d e v a l u a t e b e h a v i o r s at a  fine-grained  l e v e l w h e r e i n d i v i d u a l c o m p o n e n t s o f the b e h a v i o r c a n b e e v a l u a t e d i n d e p e n -  dently. W e h a v e d e v e l o p e d a f r a m e w o r k f o r the use o f m e n t o r g u i d a n c e w h i c h w e c a l l implicit  imitation.  In contrast to e x i s t i n g m o d e l s , it does not a s s u m e the e x i s t e n c e o f a c o o p e r -  a t i v e mentor, that agents c a n e x p l i c i t l y c o m m u n i c a t e , that the a c t i o n c h o i c e s o f other agents are d i r e c t l y o b s e r v a b l e , o r that the agents share i d e n t i c a l o b j e c t i v e s . I n a d d i t i o n , o u r framew o r k a l l o w s the o b s e r v e r to integrate p a r t i a l b e h a v i o r s f r o m m u l t i p l e m e n t o r s , a n d a l l o w s the o b s e r v i n g agent to refine these b e h a v i o r s at a  fine-grained  l e v e l u s i n g its o w n e x p e r i -  ence. I m p l i c i t i m i t a t i o n ' s a b i l i t y to r e l a x these a s s u m p t i o n s d e r i v e s f r o m its different v i e w o f the i m i t a t i o n p r o c e s s . I n i m p l i c i t i m i t a t i o n , the g o a l o f the agent is the a c q u i s i t i o n o f i n f o r m a t i o n about its p o t e n t i a l a c t i o n c a p a b i l i t i e s i n u n e x p e r i e n c e d s i t u a t i o n s rather than the d u p l i c a t i o n o f b e h a v i o r s o f other agents. W e d e v e l o p the i m p l i c i t i m i t a t i o n f r a m e w o r k i n s e v e r a l stages. W e start w i t h a s i m p l i f i e d p r o b l e m i n w h i c h n o n - i n t e r a c t i n g agents h a v e homogeneous  actions:  that i s , the l e a r n -  i n g agent a n d the m e n t o r h a v e i d e n t i c a l a c t i o n c a p a b i l i t i e s . I n i t i a l l y , h o w e v e r , the o b s e r v e r m a y b e u n a w a r e o f w h a t these c a p a b i l i t i e s are i n a l l p o s s i b l e s i t u a t i o n s . I n the h o m o g e n e o u s setting, the o b s e r v e r c a n use o b s e r v a t i o n s o f changes i n the m e n t o r ' s e x t e r n a l state to e s t i m a t e a m o d e l o f its o w n p o t e n t i a l a c t i o n c a p a b i l i t i e s i n u n e x p l o r e d s i t u a t i o n s . T h r o u g h the adaptation o f standard r e i n f o r c e m e n t l e a r n i n g t e c h n i q u e s , these p o t e n t i a l c a p a b i l i t y m o d e l s c a n be u s e d to b i a s the o b s e r v e r ' s e x p l o r a t i o n t o w a r d s the m o s t p r o m i s i n g a c t i o n s . T h e p o tential c a p a b i l i t y m o d e l s c a n a l s o be u s e d to a l l o c a t e c o m p u t a t i o n a l effort a s s o c i a t e d w i t h rei n f o r c e m e n t l e a r n i n g t e c h n i q u e s m o r e efficiently. T h r o u g h these t w o m e c h a n i s m s , i m p l i c i t i m i t a t i o n m a y r e d u c e the cost a n d speed the c o n v e r g e n c e o f r e i n f o r c e m e n t l e a r n i n g . W h e n agents h a v e heterogeneous a c t i o n c a p a b i l i t i e s , the o b s e r v e r m a y not b e a b l e  4  to p e r f o r m e v e r y a c t i o n the m e n t o r c a n i n e v e r y s i t u a t i o n . T h e o b s e r v e r m a y therefore b e m i s l e d b y o b s e r v a t i o n s o f the m e n t o r about its p o t e n t i a l c a p a b i l i t i e s a n d m a y therefore o v e r estimate the v a l u e o f s i t u a t i o n s , w i t h p o t e n t i a l l y disastrous c o n s e q u e n c e s . W e f o r m a l i z e this o v e r e s t i m a t i o n p r o b l e m a n d d e v e l o p m e c h a n i s m s f o r e x t e n d i n g i m p l i c i t i m i t a t i o n t o this context. W e d e v e l o p a n e w p r o c e d u r e c a l l e d action feasibility testing a n d s h o w h o w i t c a n i d e n t i f y i n f e a s i b l e m e n t o r actions a n d thereby m i n i m i z e the effects o f o v e r e s t i m a t i o n . W e g o o n to s h o w that a s e q u e n c e o f l o c a l l y i n f e a s i b l e actions m a y be part o f a larger f a m i l y o f a c t i o n sequences that i m p l e m e n t a d e s i r e d b e h a v i o r . W e d e r i v e a m e c h a n i s m c a l l e d k-step repair a n d s h o w h o w it c a n g e n e r a l i z e m e n t o r b e h a v i o r s that w o u l d r e q u i r e l o c a l l y i n f e a s i b l e a c t i o n s t o an a l t e r n a t i v e b e h a v i o r that is f e a s i b l e f o r the observer. I m p l i c i t i m i t a t i o n is b a s e d o n the i d e a o f transferring d e t a i l e d i n f o r m a t i o n about act i o n c a p a b i l i t i e s f r o m m e n t o r to observer. A t the e x p e n s e o f further c o m p l e x i t y , w e d e m o n strate that B a y e s i a n m e t h o d s enable the transfer o f r i c h e r a c t i o n c a p a b i l i t y m o d e l s . T h e B a y e s i a n representation a l l o w s designers to i n c o r p o r a t e p r i o r i n f o r m a t i o n a b o u t b o t h the o b server a n d m e n t o r a c t i o n c a p a b i l i t i e s ; p e r m i t s observers t o a v o i d s u b o p t i m a l a c t i o n s i n s i t u ations o n the b a s i s o f m e n t o r o b s e r v a t i o n s ; a n d a l l o w s the agent to e m p l o y o p t i m a l B a y e s i a n e x p l o r a t i o n m e t h o d s i n o r d e r t o r a t i o n a l l y trade o f f the cost o f e x p l o r i n g u n k n o w n a c t i o n s against the benefits e x p l o i t i n g k n o w n a c t i o n s . B a y e s i a n e x p l o r a t i o n a l s o e l i m i n a t e s t e d i o u s e x p l o r a t i o n - r a t e parameters often f o u n d i n r e i n f o r c e m e n t l e a r n i n g a l g o r i t h m s . U s i n g stand a r d a p p r o x i m a t i o n s , w e d e r i v e a c o m p u t a t i o n a l l y tractable a l g o r i t h m a n d d e m o n s t r a t e it o n several examples. T h e B a y e s i a n m o d e l opens u p r i c h inference c a p a b i l i t i e s . W e s h o w h o w the B a y e s i a n f o r m u l a t i o n enables u s to d e r i v e a m e c h a n i s m f o r transfer b e t w e e n m e n t o r a n d o b s e r v e r w h e n the e n v i r o n m e n t i s o n l y p a r t i a l l y o b s e r v a b l e . W e p o i n t out that p a r t i a l o b s e r v a b i l i t y has specific i m p l i c a t i o n s f o r i m i t a t o r s , s u c h as the n e e d t o e v a l u a t e the benefit o f t a k i n g ac-  5  t i o n s that i n c r e a s e the v i s i b i l i t y o f the mentor. O u r e a r l i e r v e r s i o n s o f i m p l i c i t i m i t a t i o n are f o r m u l a t e d i n terms o f d i s c r e t e e n v i r o n m e n t states a n d a c t i o n s u s i n g an e x p l i c i t m o d e l o f the agent's e n v i r o n m e n t . W e e x p l a i n m o d e l - b a s e d a n d m o d e l - f r e e p a r a d i g m s w i t h i n r e i n f o r c e m e n t l e a r n i n g a n d then s h o w that i m p l i c i t i m i t a t i o n c a n be r e f o r m u l a t e d i n a m o d e l - f r e e f o r m a n d p r o p o s e s p e c i f i c m e c h a nisms for i m p l e m e n t i n g i m p l i c i t imitation functions. A s w i t h u n g u i d e d reinforcement learni n g , m o v i n g to a m o d e l - f r e e p a r a d i g m p e r m i t s i m p l i c i t i m i t a t i o n to b e e x t e n d e d to e n v i r o n ments represented i n terms o f c o n t i n u o u s state v a r i a b l e s . O u r final e x t e n s i o n o f the i m p l i c i t i m i t a t i o n f r a m e w o r k c o n s i d e r s h o w i m p l i c i t i m i t a t i o n c a n b e adapted to the case w h e r e there are m u l t i p l e i n t e r a c t i n g agents that c a n affect e a c h o t h e r ' s e n v i r o n m e n t . W e argue that k n o w l e d g e about h o w to be s u c c e s s f u l i n an intera c t i n g m u l t i - a g e n t e n v i r o n m e n t c a n a l s o b e a c q u i r e d b y i m i t a t i o n a n d therefore, i m i t a t i o n is a l s o an i m p o r t a n t strategy f o r f i n d i n g g o o d b e h a v i o r s i n i n t e r a c t i n g m u l t i - a g e n t c o n t e x t s (e.g., g a m e theoretic s c e n a r i o s ) . O u r f o r m u l a t i o n o f i m p l i c i t i m i t a t i o n is b a s e d u p o n the f r a m e w o r k p r o v i d e d b y M a r k o v d e c i s i o n processes, r e i n f o r c e m e n t l e a r n i n g a n d B a y e s i a n m e t h o d s . C h a p t e r 2 r e v i e w s these t o p i c s a n d e s t a b l i s h e s o u r n o t a t i o n . C h a p t e r 3 e x a m i n e s v a r i o u s a p p r o a c h e s to i m i t a t i o n a n d attempts to u n d e r s t a n d e x i s t i n g approaches i n terms o f r e i n f o r c e m e n t l e a r n i n g a n d to e x p o s e c o m m o n a s s u m p t i o n s i n this w o r k . T h e e x a m i n a t i o n o f a s s u m p t i o n s leads to a n e w m o d e l o f i m i t a t i o n b a s e d o n the transfer o f t r a n s i t i o n m o d e l s . C h a p t e r 4 d e v e l o p s t w o s p e c i f i c i n stantiations o f o u r n e w f r a m e w o r k f o r h o m o g e n e o u s a n d h e t e r o g e n e o u s settings. C h a p t e r 5 i n t r o d u c e s the B a y e s i a n a p p r o a c h to i m p l i c i t i m i t a t i o n a n d e x p l a i n s the benefits o f B a y e s i a n e x p l o r a t i o n f o r i m i t a t i o n . C h a p t e r 6 e x a m i n e s the a p p l i c a b i l i t y o f i m p l i c i t i m i t a t i o n i n m o r e d e p t h a n d d e v e l o p s a m e t r i c f o r p r e d i c t i n g i m i t a t i o n p e r f o r m a n c e b a s e d o n d o m a i n chara c t e r i s t i c s . C h a p t e r 7 presents i n i t i a l w o r k o n the e x t e n s i o n o f i m p l i c i t i m i t a t i o n to e n v i -  6  r o n m e n t s w i t h p a r t i a l o b s e r v a b i l i t y , m u l t i p l e i n t e r a c t i n g agents a n d c o n t i n u o u s state v a r i ables.  T h i s chapter a l s o p o i n t s out that the e x t e n s i o n o f i m p l i c i t i m i t a t i o n to n e w p r o b -  l e m classes i n t r o d u c e s a d d i t i o n a l p r o b l e m s specific to i m i t a t o r s a n d b r i e f l y d i s c u s s e s t h e m . C h a p t e r 8 d i s c u s s e s s o m e g e n e r a l c h a l l e n g e s f o r i m i t a t i o n research a n d h o w o n e m i g h t t a c k l e these p r o b l e m s before c o n c l u d i n g w i t h a b r i e f s u m m a r y o f the c o n t r i b u t i o n i m p l i c i t i m i t a t i o n m a k e s to the fields o f i m i t a t i o n a n d r e i n f o r c e m e n t l e a r n i n g .  7  Chapter 2  Background Material O u r m o d e l o f i m i t a t i o n is b a s e d o n the f o r m a l f r a m e w o r k o f M a r k o v D e c i s i o n P r o c e s s e s , r e i n f o r c e m e n t l e a r n i n g , g r a p h i c a l m o d e l s a n d m u l t i - a g e n t systems. T h i s c h a p t e r b r i e f l y rev i e w s s o m e o f the m o s t r e l e v a n t m a t e r i a l w i t h i n these fields. T h r o u g h o u t the chapter, the reader is referred to a n u m b e r o f b a c k g r o u n d references f o r m o r e d e t a i l .  2.1  Markov Decision Processes  M a r k o v d e c i s i o n processes ( M D P s ) h a v e p r o v e n v e r y u s e f u l i n m o d e l i n g s t o c h a s t i c sequent i a l d e c i s i o n p r o b l e m s , a n d h a v e b e e n w i d e l y u s e d i n d e c i s i o n - t h e o r e t i c p l a n n i n g to m o d e l d o m a i n s i n w h i c h an agent's actions h a v e u n c e r t a i n effects, an a g e n t ' s k n o w l e d g e o f the e n v i r o n m e n t m a y be u n c e r t a i n , a n d m u l t i p l e o b j e c t i v e s m u s t b e traded o f f against o n e another ( B e r t s e k a s , 1 9 8 7 ; B o u t i l i e r , D e a n , & H a n k s , 1 9 9 9 ; P u t e r m a n , 1994). T h i s s e c t i o n d e s c r i b e s the b a s i c M D P m o d e l a n d r e v i e w s o n e c l a s s i c a l s o l u t i o n p r o c e d u r e . A n M D P c a n b e v i e w e d as a stochastic a u t o m a t o n i n w h i c h a n a g e n t ' s a c t i o n s i n f l u ence the t r a n s i t i o n s b e t w e e n states, a n d r e w a r d s are o b t a i n e d d e p e n d i n g o n the states v i s i t e d b y an agent. I n i t i a l l y , w e w i l l c o n c e r n o u r s e l v e s w i t h f u l l y o b s e r v a b l e M D P s .  8  In a fully  o b s e r v a b l e M D P , it i s a s s u m e d that the agent c a n d e t e r m i n e its c u r r e n t state w i t h c e r t a i n t y (e.g., there are n o sensor errors o r l i m i t s to the range o f sensors). F o r m a l l y , a f u l l y o b s e r v a b l e M D P c a n b e d e f i n e d as a t u p l e is a finite set o f states o r p o s s i b l e w o r l d s ,  function, a n d ii! i s a reward function.  1  A is a  (S,A,T,R),  finite set o f actions,  T  is a  where  S  state transition  T h e agent c a n i n f l u e n c e future states o f the s y s t e m to  s o m e extent b y c h o o s i n g a c t i o n s a f r o m the set A. A c t i o n s are s t o c h a s t i c i n that the future states c a n n o t g e n e r a l l y b e p r e d i c t e d w i t h certainty. T h e t r a n s i t i o n f u n c t i o n T : S x A x S —>  M  d e s c r i b e s the effects o f e a c h a c t i o n at e a c h state. F o r c o m p a c t n e s s , w e w r i t e  T (s') s,a  to denote the p r o b a b i l i t y o f e n d i n g u p i n future state s' G S w h e n a c t i o n a is p e r f o r m e d at the current state s. W e r e q u i r e that 0 < T  2^ / s  T ' (s') = 1. T h e s a  e5  components  S, A  s , a  (s')  and  T  <  1 f o r a l l s, s', a n d that f o r a l l s,  d e t e r m i n e the d y n a m i c s o f the s y s t e m  b e i n g c o n t r o l l e d . S i n c e the s y s t e m i s f u l l y o b s e r v a b l e — t h a t i s , the agent k n o w s the true state s at e a c h t i m e t ( o n c e that stage i s r e a c h e d ) — i t s c h o i c e o f a c t i o n c a n be b a s e d s o l e l y o n the current state. T h u s u n c e r t a i n t y l i e s o n l y i n the p r e d i c t i o n o f a n a c t i o n ' s effects, n o t i n d e t e r m i n i n g its a c t u a l effect after its e x e c u t i o n . A  (deterministic, stationary, Markovian) policy IT : S  —r  A  describes a course o f  a c t i o n to b e a d o p t e d b y a n agent c o n t r o l l i n g the s y s t e m . A n agent a d o p t i n g s u c h a p o l i c y p e r f o r m s a c t i o n ir(s) w h e n e v e r it finds i t s e l f i n state s. P o l i c i e s o f t h i s f o r m are M a r k o v i a n s i n c e the a c t i o n c h o i c e at a n y state does n o t d e p e n d o n the s y s t e m h i s t o r y , a n d are stationary s i n c e a c t i o n c h o i c e d o e s n o t d e p e n d o n the stage o f the d e c i s i o n p r o b l e m . F o r the p r o b l e m s w e c o n s i d e r , o p t i m a l stationary M a r k o v i a n p o l i c i e s a l w a y s e x i s t . W e a s s u m e a b o u n d e d , r e a l - v a l u e d r e w a r d f u n c t i o n R : S —>• R(s) i s the i n s t a n taneous r e w a r d a n agent r e c e i v e s f o r o c c u p y i n g state s. A n u m b e r o f o p t i m a l i t y c r i t e r i a c a n b e a d o p t e d t o m e a s u r e the v a l u e o f a p o l i c y n, a l l m e a s u r i n g i n s o m e w a y the r e w a r d a c c u 'MDPs can be generalized in a natural fashion to actions with costs and domains with continuous state and action spaces.  9  m u l a t e d b y an agent as it traverses the state space t h r o u g h the e x e c u t i o n o f TT. I n this w o r k , we focus on  discounted infinite-horizon p r o b l e m s : the current v a l u e o f a r e w a r d r e c e i v e d t  stages i n the future is d i s c o u n t e d b y s o m e factor 7 * ( 0 < 7 < 1 ) . T h i s a l l o w s s i m p l e r c o m p u t a t i o n a l m e t h o d s to be used, as d i s c o u n t e d total r e w a r d w i l l b e finite. I n m a n y s i t u a t i o n s , d i s c o u n t i n g c a n b e j u s t i f i e d o n p r a c t i c a l g r o u n d s (e.g., e c o n o m i c a n d r a n d o m t e r m i n a t i o n ) as w e l l . T h e value V  v  : S -» 3ft o f a p o l i c y TT at state s is s i m p l y the e x p e c t e d s u m o f d i s -  c o u n t e d future r e w a r d s o b t a i n e d b y e x e c u t i n g all  s G S a n d a l l p o l i c i e s ir, w e h a v e  K-»  n b e g i n n i n g at s. A p o l i c y it* is optimal if, f o r  (s) > V„(s). W e are g u a r a n t e e d that s u c h o p t i m a l  (stationary) p o l i c i e s e x i s t i n o u r setting ( P u t e r m a n , 1994, see C h . 5). T h e ( o p t i m a l ) v a l u e o f a state V* (s) is its v a l u e V * (s) u n d e r any o p t i m a l p o l i c y n*. v  By icy.  solving an M D P , w e are referring to the p r o b l e m o f c o n s t r u c t i n g an o p t i m a l p o l -  Value iteration ( B e l l m a n , 1957) is a s i m p l e iterative a p p r o x i m a t i o n a l g o r i t h m f o r o p t i -  m a l p o l i c y c o n s t r u c t i o n . G i v e n s o m e arbitrary estimate V° o f the true v a l u e f u n c t i o n  V*,  w e i t e r a t i v e l y i m p r o v e this estimate as f o l l o w s :  V (s) n  = R(s)  + max{ adA  V r - (s')V —' s  7  a  n _ 1  (s')>  (2.1)  s'eS The computation of value functions V  n  V (s) n  given  V~ n  l  is k n o w n as a  Bellman backup. T h e s e q u e n c e o f  p r o d u c e d b y v a l u e i t e r a t i o n c o n v e r g e s l i n e a r l y to V*. E a c h i t e r a t i o n o f  v a l u e i t e r a t i o n requires  0(|i5||^4|) 2  c o m p u t a t i o n t i m e , a n d the n u m b e r o f iterations is p o l y -  n o m i a l i n \S\. F o r s o m e finite n, the a c t i o n s a that m a x i m i z e the r i g h t - h a n d s i d e o f E q u a t i o n 2.1 f o r m an o p t i m a l p o l i c y , a n d V  n  a p p r o x i m a t e s its v a l u e . V a r i o u s t e r m i n a t i o n c r i t e r i a c a n be  a p p l i e d . F o r instance, the v a l u e f u n c t i o n V  is w i t h i n | o f the o p t i m a l f u n c t i o n V* at any  l+1  state, w h e n e v e r E q u a t i o n . 2.2 is satisfied ( P u t e r m a n , 1 9 9 4 , see S e c . 6.3.2). 10  _ v% < - ^ ~ £  T h e d o u b l e bar n o t a t i o n a b o v e , \\X\\ defined m a x { | a : | : x € X},  (2-2)  s h o u l d be taken as the s u p r e m u m n o r m a n d is  that i s , the m a x i m u m c o m p o n e n t o f the vector.  M a n y additional methods exist for s o l v i n g M D P s i n c l u d i n g p o l i c y iteration, m o d i fied p o l i c y i t e r a t i o n m e t h o d s a n d l i n e a r p r o g r a m m i n g f o r m u l a t i o n s ( P u t e r m a n , 1 9 9 4 ) as w e l l as a s y n c h r o n o u s m e t h o d s a n d m e t h o d s u s i n g a p p r o x i m a t e representations o f the v a l u e f u n c t i o n s u c h as n e u r o - d y n a m i c p r o g r a m m i n g ( B e r t s e k a s & T s i t s i k l i s , 1 9 9 6 ) . T h e c o m p u t a t i o n a l c o m p l e x i t y ( i n terms o f a r i t h m e t i c o p e r a t i o n s ) o f s o l v i n g an M D P u n d e r the e x p e c t e d d i s c o u n t e d c u m u l a t i v e r e w a r d c r i t e r i o n is k n o w n to b e p o l y n o m i a l i n the n u m b e r o f states ( L i t t m a n , D e a n , & K a e l b l i n g , 1 9 9 5 ) . T h i s b o u n d is h a r d l y r e a s s u r i n g , h o w ever, as the n u m b e r o f states i n a p r o b l e m is g e n e r a l l y e x p o n e n t i a l i n the n u m b e r o f v a r i a b l e s o r attributes i n the p r o b l e m . T h e p o l y n o m i a l t i m e result is o b t a i n e d t h r o u g h the use o f a standard t r a n s f o r m a t i o n f r o m M D P s to l i n e a r p r o g r a m s . A l t h o u g h w o r s t - c a s e p o l y n o m i a l a l g o r i t h m s e x i s t f o r l i n e a r p r o g r a m m i n g , the p o l y n o m i a l is h i g h - o r d e r . O f t e n , a v a r i a t i o n o f the S i m p l e x m e t h o d i s u s e d , w h i c h is often m u c h faster than the guaranteed p o l y n o m i a l a l g o r i t h m , but e x p o n e n t i a l i n the w o r s t case. T h e c o m p l e x i t y o f alternative t e c h n i q u e s f o r s o l v i n g M D P s , s u c h as p o l i c y i t e r a t i o n a n d v a l u e i t e r a t i o n w i t h a g e n e r a l d i s c o u n t factor, are o p e n q u e s t i o n s . I n the c a s e o f p o l i c y i t e r a t i o n , it is k n o w n that any g i v e n p o l i c y c a n be e v a l u a t e d u s i n g m a t r i x i n v e r s i o n i n O (n ), 3  but p r e s e n t l y the tightest b o u n d a d m i t s a p o s s i b l y e x p o n e n t i a l n u m b e r o f e v a l u a t i o n s . F u r ther, P a p a d i m i t r i o u a n d T s i t s i k l i s ( P a p a d i m i t r i o u & T s i t s i k l i s , 1987) s h o w that M D P s are P - c o m p l e t e , w h i c h m e a n s that a n efficient p a r a l l e l s o l u t i o n o f M D P s w o u l d i m p l y a n effic i e n t p a r a l l e l s o l u t i o n to a l l p r o b l e m s i n the class P - c o m p l e t e w h i c h is c o n s i d e r e d u n l i k e l y  11  b y researchers i n the f i e l d . F r o m a n u m e r i c a l p o i n t o f v i e w , the v a l u e f u n c t i o n u p d a t e d b y v a l u e i t e r a t i o n o r p o l i c y i t e r a t i o n i s k n o w n to c o n v e r g e l i n e a r l y to the o p t i m a l v a l u e ( P u t e r m a n , 1 9 9 4 , see T h e o r e m 6.3.3). G i v e n s p e c i f i c c o n d i t i o n s , the v a l u e f u n c t i o n u p d a t e d u n d e r p o l i c y i t e r a t i o n m a y c o n v e r g e q u a d r a t i c a l l y ( P u t e r m a n , 1 9 9 4 , see T h e o r e m 6.4.8). A c o n c e p t that w i l l b e u s e f u l later i s that o f a Q-function. G i v e n a n arbitrary v a l u e f u n c t i o n V, w e define Q  v a  (s) as  Q {s) = R{s) +  T ' (s')V(s')  v  a  s a  7  (2.3)  s'es Intuitively, Q  v a  (s) denotes the v a l u e o f p e r f o r m i n g a c t i o n a at state s a n d then a c t i n g  i n a m a n n e r that has v a l u e V ( W a t k i n s & D a y a n , 1992). I n p a r t i c u l a r , w e define Q* to b e the Q - f u n c t i o n d e f i n e d w i t h respect to V*, a n d Q™ to b e the Q - f u n c t i o n d e f i n e d w i t h respect to V ~ . n  l  I n this manner, w e c a n r e w r i t e E q u a t i o n 2.1 as:  V (s) n  W e define an  ergodic p o l i c y  = mzx{Q (s)} n  a  (2.4)  i n an M D P as a p o l i c y w h i c h i n d u c e s a M a r k o v c h a i n i n  w h i c h e v e r y state is r e a c h a b l e f r o m a n y other state i n a finite n u m b e r o f steps w i t h n o n - z e r o probability.  2.2  Model-based Reinforcement Learning  O n e d i f f i c u l t y w i t h the u s e o f M D P s i s that t h e c o n s t r u c t i o n o f a n o p t i m a l p o l i c y r e q u i r e s that the agent k n o w the exact t r a n s i t i o n p r o b a b i l i t i e s T a n d r e w a r d m o d e l R. I n t h e s p e c i f i c a t i o n o f a d e c i s i o n p r o b l e m , these r e q u i r e m e n t s , e s p e c i a l l y the d e t a i l e d s p e c i f i c a t i o n o f the  12  d o m a i n ' s d y n a m i c s , c a n i m p o s e a s i g n i f i c a n t b u r d e n o n the agent designer.  learning  Reinforcement  c a n be v i e w e d as s o l v i n g an M D P i n w h i c h the f u l l d e t a i l s o f the m o d e l , i n par-  t i c u l a r T a n d R, are n o t k n o w n to the agent. Instead the agent learns h o w to act o p t i m a l l y t h r o u g h e x p e r i e n c e w i t h its e n v i r o n m e n t . T h i s s e c t i o n p r o v i d e s a b r i e f o v e r v i e w o f r e i n f o r c e m e n t l e a r n i n g i n w h i c h m o d e l - b a s e d approaches are e m p h a s i z e d . A d d i t i o n a l m a t e r i a l c a n be f o u n d i n the texts o f S u t t o n a n d B a r t o ( 1 9 9 8 ) a n d B e r t s e k a s a n d T s i t s i k l i s ( 1 9 9 6 ) , a n d the s u r v e y o f K a e l b l i n g , L i t t m a n a n d M o o r e ( 1 9 9 6 ) . I n the g e n e r a l m o d e l , w e a s s u m e that an agent is c o n t r o l l i n g an M D P (S, A, T , R) a n d i n i t i a l l y k n o w s its state a n d a c t i o n spaces, S a n d A, b u t n o t the t r a n s i t i o n m o d e l T o r r e w a r d f u n c t i o n R. T h e agent acts i n its e n v i r o n m e n t , a n d at e a c h stage o f the p r o c e s s m a k e s a " t r a n s i t i o n " (s, a, r, t); that i s , it takes a c t i o n a at state s, r e c e i v e s r e w a r d r a n d m o v e s to state t. B a s e d o n repeated e x p e r i e n c e s o f this t y p e it c a n d e t e r m i n e an o p t i m a l p o l i c y i n o n e o f t w o w a y s : (a) i n  model-based reinforcement learning,  these e x p e r i e n c e s c a n be u s e d to  l e a r n the true nature o f T a n d R, a n d the M D P c a n be s o l v e d u s i n g standard m e t h o d s (e.g., v a l u e i t e r a t i o n ) ; o r (b) i n  model-free reinforcement learning, these e x p e r i e n c e s  can be used  to d i r e c t l y u p d a t e an estimate o f the o p t i m a l v a l u e f u n c t i o n o r Q - f u n c t i o n . P r o b a b l y the s i m p l e s t m o d e l - b a s e d r e i n f o r c e m e n t l e a r n i n g s c h e m e is the  equivalence approach.  certainty  I n t u i t i v e l y , a l e a r n i n g agent is a s s u m e d to h a v e s o m e current e s t i -  m a t e d t r a n s i t i o n m o d e l T o f its e n v i r o n m e n t c o n s i s t i n g o f e s t i m a t e d p r o b a b i l i t i e s T  s,a  a n d an e s t i m a t e d r e w a r d s m o d e l R(s).  (t)  W i t h e a c h e x p e r i e n c e (s, a, r , t) the agent updates  its e s t i m a t e d m o d e l s , s o l v e s the e s t i m a t e d M D P M to o b t a i n an p o l i c y TT that w o u l d b e o p timal  if its estimated models were correct,  a n d acts a c c o r d i n g to that p o l i c y .  T o m a k e the c e r t a i n t y e q u i v a l e n c e a p p r o a c h p r e c i s e , a s p e c i f i c f o r m o f e s t i m a t e d m o d e l a n d u p d a t e p r o c e d u r e m u s t be adopted. A c o m m o n a p p r o a c h is to use the e m p i r i c a l d i s t r i b u t i o n o f o b s e r v e d state t r a n s i t i o n s a n d r e w a r d s as the e s t i m a t e d m o d e l . F o r s i m p l i c i t y ,  13  w e a s s u m e the m o d e l w i l l be a p p l i e d to a discrete state a n d a c t i o n space. F o r i n s t a n c e , i f a c t i o n a has b e e n attempted C(s, a) t i m e s at state s, a n d o n C(s, a, t) o f t h o s e o c c a s i o n s state  t has been  r e a c h e d , then the e s t i m a t e  T ' (t) — C(s, a, t)/C(s, a). I f C(s, a) — s a  0,  s o m e p r i o r estimate is u s e d (e.g., o n e m i g h t a s s u m e a l l state t r a n s i t i o n s are e q u i p r o b a b l e ) . O n e d i f f i c u l t y w i t h the certainty e q u i v a l e n c e a p p r o a c h i s the c o m p u t a t i o n a l b u r d e n o f r e - s o l v i n g an M D P M w i t h e a c h update o f the m o d e l s T a n d R (i.e., w i t h each e x p e r i ence). A s p o i n t e d out a b o v e , u s i n g c o m m o n t e c h n i q u e s s u c h as v a l u e - i t e r a t i o n o r p o l i c y itera t i o n c o u l d r e q u i r e c o m p u t a t i o n p o l y n o m i a l o r e v e n e x p o n e n t i a l i n the n u m b e r o f states a n d c e r t a i n l y e x p o n e n t i a l i n the n u m b e r o f v a r i a b l e s i n a p r o b l e m . G e n e r a l l y , c e r t a i n t y e q u i v a l e n c e is i m p r a c t i c a l . O n e c o u l d c i r c u m v e n t this to s o m e extent b y b a t c h i n g e x p e r i e n c e s a n d u p d a t i n g (and r e - s o l v i n g ) the m o d e l o n l y p e r i o d i c a l l y . A l t e r n a t i v e l y , o n e c o u l d u s e c o m p u t a t i o n a l effort j u d i c i o u s l y a n d a p p l y B e l l m a n b a c k u p s o n l y at t h o s e states w h o s e v a l u e s ( o r Q - v a l u e s ) are l i k e l y to c h a n g e the m o s t g i v e n a c h a n g e i n the m o d e l . M o o r e a n d A t k e s o n ' s ( 1 9 9 3 )  oritized sweeping T ' (t), s a  a l g o r i t h m does j u s t this. W h e n  a B e l l m a n b a c k u p is a p p l i e d at  s to update  T  pri-  is u p d a t e d b y c h a n g i n g a c o m p o n e n t  its e s t i m a t e d v a l u e  V(s),  as w e l l as the  Q - v a l u e Q(s, a). S u p p o s e the m a g n i t u d e o f the c h a n g e i n V(s) is g i v e n b y A F ( s ) . F o r a n y p r e d e c e s s o r w, the Q - v a l u e s Q(w, a')—hence T h e m a g n i t u d e o f the c h a n g e is b o u n d e d b y are p l a c e d o n a p r i o r i t y queue w i t h  v a l u e s V(w)—can  T ' (s)AV(s). w  T ' '(s)AV(s) w a  a>  change i f T ' ' w a  (s) > 0.  A l l such predecessors  w of s  s e r v i n g as the p r i o r i t y . A f i x e d n u m b e r  o f B e l l m a n b a c k u p s are a p p l i e d to states i n the o r d e r i n w h i c h they a p p e a r o n the queue. W i t h each b a c k u p , a n y c h a n g e i n v a l u e can cause n e w p r e d e c e s s o r s to b e i n s e r t e d i n t o the q u e u e . I n this w a y , c o m p u t a t i o n a l effort is f o c u s s e d o n those states w h e r e a B e l l m a n b a c k u p has the greatest i m p a c t due to the m o d e l c h a n g e . F u r t h e r m o r e , the b a c k u p s are a p p l i e d o n l y to a subset o f states, a n d are g e n e r a l l y o n l y a p p l i e d a f i x e d n u m b e r o f t i m e s (by w a y o f c o n -  14  trast, i n t h e c e r t a i n t y e q u i v a l e n c e a p p r o a c h , b a c k u p s are a p p l i e d u n t i l c o n v e r g e n c e ) .  Thus  p r i o r i t i z e d s w e e p i n g c a n b e v i e w e d as a s p e c i f i c f o r m o f a s y n c h r o n o u s v a l u e i t e r a t i o n w i t h o p t i m i z e d c o m p u t a t i o n a l properties ( M o o r e & A t k e s o n , 1993). B o t h certainty e q u i v a l e n c e a n d p r i o r i t i z e d s w e e p i n g w i l l e v e n t u a l l y c o n v e r g e t o a n o p t i m a l v a l u e f u n c t i o n a n d p o l i c y as l o n g as a c t i n g o p t i m a l l y a c c o r d i n g t o t h e s e q u e n c e o f e s t i m a t e d m o d e l s leads the agent to e x p l o r e the entire state-action space s u f f i c i e n t l y . U n f o r tunately, the use o f the v a l u e - m a x i m i z i n g a c t i o n at e a c h state is u n l i k e l y to result i n t h o r o u g h e x p l o r a t i o n i n p r a c t i c e . F o r this reason, e x p l i c i t exploration policies are i n v a r i a b l y u s e d to ensure that e a c h a c t i o n is t r i e d at each state s u f f i c i e n t l y often. B y a c t i n g r a n d o m l y ( a s s u m i n g an e r g o d i c M D P ) , an agent is assured o f s a m p l i n g e a c h a c t i o n at e a c h state i n f i n i t e l y often i n the l i m i t . U n f o r t u n a t e l y , the a c t i o n s o f s u c h an agent w i l l f a i l to e x p l o i t ( i n fact, w i l l b e c o m p l e t e l y u n i n f l u e n c e d b y ) its k n o w l e d g e o f the o p t i m a l p o l i c y . T h i s  exploration-exploitation  tradeoff'refers t o the t e n s i o n b e t w e e n t r y i n g n e w a c t i o n s i n o r d e r t o find out m o r e about the e n v i r o n m e n t a n d e x e c u t i n g a c t i o n s b e l i e v e d t o b e o p t i m a l o n the b a s i s o f the current e s t i mated model. T h e m o s t c o m m o n m e t h o d for e x p l o r a t i o n is the e - g r e e d y m e t h o d i n w h i c h the agent c h o o s e s a r a n d o m a c t i o n e o f the t i m e , w h e r e 0 < e < 1. T y p i c a l l y , e is d e c a y e d o v e r t i m e t o increase the a g e n t ' s e x p l o i t a t i o n o f its k n o w l e d g e . G i v e n a s u f f i c i e n t l y s l o w d e c a y s c h e d u l e , the agent w i l l c o n v e r g e t o the o p t i m a l p o l i c y . In the B o l t z m a n n a p p r o a c h , e a c h a c t i o n i s selected w i t h a p r o b a b i l i t y p r o p o r t i o n a l to its v a l u e :  T h e p r o p o r t i o n a l i t y c a n b e adjusted n o n l i n e a r l y w i t h t h e temperature parameter r . Asr  - > 0, the p r o b a b i l i t y o f s e l e c t i n g the a c t i o n w i t h the h i g h e s t v a l u e tends t o 1. T y p i c a l l y ,  15  r is started h i g h so that a c t i o n s are r a n d o m l y e x p l o r e d d u r i n g the e a r l y stages o f l e a r n i n g . A s the agent g a i n s k n o w l e d g e about the effects o f its a c t i o n s a n d the v a l u e o f these effects, the parameter r is d e c a y e d so that the agent spends m o r e t i m e e x p l o i t i n g a c t i o n s k n o w n to b e v a l u a b l e a n d less t i m e r a n d o m l y e x p l o r i n g a c t i o n s . M o r e s o p h i s t i c a t e d m e t h o d s attempt to use i n f o r m a t i o n about m o d e l c o n f i d e n c e a n d v a l u e m a g n i t u d e s to p l a n a u t i l i t y - m a x i m i z i n g e x p l o r a t i o n p l a n . A n e a r l y a p p r o x i m a t i o n o f this s c h e m e c a n be f o u n d i n the i n t e r v a l e s t i m a t i o n m e t h o d ( K a e l b l i n g , 1 9 9 3 ) . B a y e s i a n m e t h o d s h a v e a l s o b e e n u s e d to c a l c u l a t e the e x p e c t e d v a l u e o f i n f o r m a t i o n to be g a i n e d from exploration (Meuleau & Bourgine, 1999; Dearden, Friedman, & A n d r e , 1999; Deard e n , F r i e d m a n , & R u s s e l l , 1998). W e w i l l h a v e a specific interest i n B a y e s i a n m e t h o d s a n d w i l l therefore e x p l o r e this a p p r o a c h i n d e t a i l i n the n e x t s e c t i o n o n B a y e s i a n r e i n f o r c e m e n t l e a r n i n g ( D e a r d e n et a l . , 1998). T h e c o m p l e x i t y o f reinforcement learning can be characterized i n t w o ways: sample c o m p l e x i t y , the n u m b e r o f s a m p l e s o f the agent's e n v i r o n m e n t r e q u i r e d f o r the agent to find a g o o d p o l i c y ; a n d c o m p u t a t i o n a l c o m p l e x i t y , the a m o u n t o f c o m p u t a t i o n r e q u i r e d b y the agent to turn s a m p l e s i n t o a p o l i c y . T h e c o m p u t a t i o n a l c o m p l e x i t y o f r e i n f o r c e m e n t l e a r n i n g m e t h o d s v a r i e s c o n s i d e r a b l y d e p e n d i n g o n the t e c h n i q u e c h o s e n to s o l v e the p r o b l e m . M o d e l - f r e e approaches to r e i n f o r c e m e n t l e a r n i n g s u c h as Q - l e a r n i n g a n d T D ( A ) w i t h l i n e a r a p p r o x i m a t i o n take constant t i m e per s a m p l e . M o d e l - b a s e d a p p r o a c h e s v a r y c o n s i d e r a b l y in their time c o m p l e x i t y . A n implementation o f certainty-equivalence approach w o u l d i n v o l v e s o l v i n g a c o m p l e t e M D P at each stage, w h i c h c a n be guaranteed to b e n o w o r s e than h i g h - o r d e r p o l y n o m i a l t i m e . A m o d e l - b a s e d a p p r o a c h u s i n g h e u r i s t i c s to u p d a t e state v a l ues, s u c h as P r i o r i t i z e d S w e e p i n g , c a n be c o n s t r u c t e d to u p d a t e the v a l u e f u n c t i o n i n t i m e w h i c h is l i n e a r i n the n u m b e r o f steps per s a m p l e . T o s o m e extent, the c o m p u t a t i o n a l c o m p l e x i t y r e q u i r e d to u p d a t e the agent's v a l u e  16  f u n c t i o n after a s i n g l e s a m p l e c a n b e traded o f f against the n u m b e r o f s a m p l e s r e q u i r e d u n d e r a p a r t i c u l a r t e c h n i q u e to a c h i e v e a g o o d p o l i c y . T h e n o t i o n o f s a m p l e c o m p l e x i t y i s therefore o f p a r t i c u l a r interest i n r e i n f o r c e m e n t l e a r n i n g . W h e r e t r a n s i t i o n s are " i d e a l l y - m i x e d " it c a n b e s h o w n that agents c a n l e a r n a n e-optimal p o l i c y w i t h p r o b a b i l i t y at least 1 - S i n t i m e that is O ({Nlog(1/e)/e ) 2  (log(N/6)  + log log(1/e))) o r r o u g h l y O(N l o g N) w h e r e N i s the  n u m b e r o f states ( K e a r n s & S i n g h , 1998a). G i v e n k n o w l e d g e o f the m a x i m u m r e w a r d a n d its v a r i a n c e a stronger result is p o s s i b l e w h i c h b o u n d s b o t h c o m p u t a t i o n t i m e a n d s a m p l e c o m p l e x i t y to b e p o l y n o m i a l i n the n u m b e r o f states a n d h o r i z o n l e n g t h ( K e a r n s & S i n g h , 1998b). A s i m i l a r b o u n d can be found for discounted problems.  2.3 Bayesian Reinforcement Learning B a y e s i a n m e t h o d s i n m o d e l - b a s e d R L p r o v i d e a general f r a m e w o r k f o r d e r i v i n g r e i n f o r c e m e n t l e a r n i n g a l g o r i t h m s that flexibly c o m b i n e v a r i o u s sources o f agent k n o w l e d g e i n c l u d i n g p r i o r b e l i e f s about its t r a n s i t i o n a n d r e w a r d m o d e l s , e v i d e n c e c o l l e c t e d t h r o u g h e x p e r i ence, a n d e v i d e n c e c o l l e c t e d i n d i r e c t l y t h r o u g h o b s e r v a t i o n s . B a y e s i a n r e i n f o r c e m e n t l e a r n i n g a l s o m a i n t a i n s a r i c h e r representation i n the f o r m o f p r o b a b i l i t y d i s t r i b u t i o n s o v e r trans i t i o n a n d r e w a r d m o d e l parameters (essentially s e c o n d - o r d e r d i s t r i b u t i o n s ) whereas n o n B a y e s i a n m e t h o d s t y p i c a l l y represent these m o d e l s o n l y i n terms o f the m e a n v a l u e s o f p a rameters. F o r instance, a n o n - B a y e s i a n t r a n s i t i o n m o d e l m i g h t h a v e a s i n g l e p a r a m e t e r repres e n t i n g the p r o b a b i l i t y o f a t r a n s i t i o n f r o m state s tot.  A B a y e s i a n t r a n s i t i o n m o d e l has a  p r o b a b i l i t y d e n s i t y f u n c t i o n o v e r t r a n s i t i o n p r o b a b i l i t i e s . I n states w h e r e the m o d e l i s u n k n o w n , the agent m i g h t h a v e a u n i f o r m d i s t r i b u t i o n o v e r p o s s i b l e t r a n s i t i o n p r o b a b i l i t i e s . I n a state w h e r e the agent has reason to b e l i e v e a t r a n s i t i o n i s u n l i k e l y , o n e m i g h t find a d i s t r i b u t i o n that assigns 9 0 % p r o b a b i l i t y to the t r a n s i t i o n h a v i n g a p r o b a b i l i t y b e l o w 5 0 % a n d a  17  1 0 % p r o b a b i l i t y to the t r a n s i t i o n h a v i n g a p r o b a b i l i t y greater than 5 0 % . F o r m a l l y , a B a y e s i a n r e i n f o r c e m e n t learner assigns a p r i o r d e n s i t y P r ( T ) o v e r the set T o f p o s s i b l e t r a n s i t i o n m o d e l s . L e t H = (SQ, S I , . . . , s ) d e n o t e the current state hist  tory o f the agent u p to t i m e t, a n d A = (ao, a \ , . . . , a -\)  d e n o t e the c o r r e s p o n d i n g  t  history.  T h e posterior, o r u p d a t e d m o d e l , Pr(T\H,  action  A) represents the a g e n t ' s n e w b e l i e f s  about its t r a n s i t i o n m o d e l g i v e n its current e x p e r i e n c e H a n d A. T h e u p d a t e i s d o n e t h r o u g h B a y e s ' rule ( D e G r o o t , 1975, Sec. 2.2):  p  r  f  m  A  Pr(H\T,A)Pr(T\A)  )  T h e d e n o m i n a t o r is s i m p l y a n o r m a l i z i n g f a c t o r o v e r the set o f p o s s i b l e t r a n s i t i o n m o d e l s T to m a k e the resultant d i s t r i b u t i o n s u m to one. T h e n o r m a l i z i n g factor is t r a d i t i o n a l l y d e n o t e d b y a. T h e p o t e n t i a l f o r i n c o r p o r a t i n g e v i d e n c e other than H a n d A i n t o the u p d a t e o f the p o s t e r i o r p r o b a b i l i t y d i s t r i b u t i o n m a k e s this f r a m e w o r k attractive f o r i m i t a t i o n agents. I n g e n e r a l , the representation o f p r i o r a n d p o s t e r i o r d i s t r i b u t i o n s o v e r t r a n s i t i o n m o d els is n o n t r i v i a l . H o w e v e r , the f o r m u l a t i o n o f D e a r d e n et a l . ( 1 9 9 9 ) i n t r o d u c e s a s p e c i f i c tractable m o d e l : the p r o b a b i l i t y d e n s i t y o v e r p o s s i b l e m o d e l s P r ( T ) i s the p r o d u c t o f i n d e p e n d e n t l o c a l p r o b a b i l i t y d i s t r i b u t i o n m o d e l s f o r e a c h i n d i v i d u a l state a n d a c t i o n ; s p e c i f i cally,  Y[?v{T ' );  Pr(r) =  s a  s,a a n d e a c h d e n s i t y Pr(T ' ) s a  i s D i r i c h l e t w i t h parameters n '  s a  ( D e G r o o t , 1975). T h e B a y e s i a n  u p d a t e o f a D i r i c h l e t d i s t r i b u t i o n has a s i m p l e c l o s e d f o r m b a s e d o n e v e n t c o u n t s . T o m o d e l Pr(T n ' s  a  s , a  ) w e r e q u i r e o n e parameter n ' s  a  (s') f o r e a c h p o s s i b l e s u c c e s s o r state s'. T h e v e c t o r  c o l l e c t i v e l y represents the c o u n t s f o r e a c h p o s s i b l e s u c c e s s o r state. U p d a t e o f a D i r i c h -  let is a s t r a i g h t f o r w a r d : g i v e n p r i o r P r ( T ' ; n s  a  18  s , a  ) a n d data v e c t o r c  s , a  (where c ' '  s a s>  i s the  n u m b e r o f o b s e r v e d t r a n s i t i o n s f r o m s to s' u n d e r a), the p o s t e r i o r is g i v e n b y parameters  n  s , a _|_ s , a c  D e a r d e n ' s representation therefore p e r m i t s i n d e p e n d e n t t r a n s i t i o n m o d e l s at  e a c h step to be u p d a t e d as f o l l o w s :  Pr(T ' \H ' ) s a  Here, H  s , a  s a  = a Pv(H ' \T ' ) s a  s a  Pr(T ' ) s a  (2.7)  is the subset o f h i s t o r y r e f e r r i n g to o n l y to t r a n s i t i o n s o r i g i n a t i n g f r o m  s a n d i n f l u e n c e d b y a c t i o n a. E f f i c i e n t m e t h o d s a l s o e x i s t f o r s a m p l i n g p o s s i b l e t r a n s i t i o n m o d e l s f r o m the D i r i c h l e t d i s t r i b u t i o n s .  2.4  Bayesian Exploration  B a y e s i a n r e i n f o r c e m e n t l e a r n i n g m a i n t a i n s d i s t r i b u t i o n s o v e r t r a n s i t i o n m o d e l parameters rather than p o i n t estimates. A B a y e s i a n r e i n f o r c e m e n t learner c a n therefore q u a n t i f y its u n certainty about its m o d e l estimates i n terms o f the spread o f the d i s t r i b u t i o n o v e r p o s s i b l e m o d e l s . G i v e n this e x t r a i n f o r m a t i o n , an agent c o u l d reason e x p l i c i t l y about its state o f k n o w l e d g e , c h o o s i n g a c t i o n s f o r b o t h the e x p e c t e d v a l u e o f the a c t i o n (i.e., the e s t i m a t e d l o n g t e r m return) a n d an estimate o f the  value of informationihat e x p e r i e n c e  w i t h the a c t i o n  m i g h t p r o v i d e to the agent. T h i s f o r m o f e x p l o r a t i o n is c a l l e d B a y e s i a n e x p l o r a t i o n ( D e a r d e n et a l . , 1999). T o k e e p c o m p u t a t i o n s tractable, the agent g e n e r a l l y c o m p u t e s o n l y a m y o p i c a p p r o x i m a t i o n o f the v a l u e o f i n f o r m a t i o n . S i n c e an agent's a c t i o n s u l t i m a t e l y d e t e r m i n e the agent's r e a l p a y o f f s i n the w o r l d , the v a l u e o f accurate k n o w l e d g e about the w o r l d c a n be e x p r e s s e d i n terms o f its effect o n the agent's p a y o f f s . T h e v a l u e o f i n f o r m a t i o n x is defined as the return the agent w o u l d r e c e i v e a c t i n g o p t i m a l l y w i t h the i n f o r m a t i o n x less the return the agent w o u l d r e c e i v e a c t i n g w i t h o u t the i n f o r m a t i o n x ( H o w a r d , 1 9 9 6 ) . In the case o f r e i n f o r c e m e n t l e a r n i n g , agents m a k e their  19  d e c i s i o n s b a s e d u p o n Q - v a l u e s f o r each state a n d a c t i o n . T h e v a l u e o f accurate Q - v a l u e s c a n therefore b e c o m p u t e d b y c o m p a r i n g the agent's return g i v e n the a c c u r a t e Q - v a l u e s w i t h the agent's e x p e c t e d return g i v e n its p r i o r u n c e r t a i n Q - v a l u e s . A n a g e n t ' s m o d e l , together w i t h its r e w a r d f u n c t i o n d e t e r m i n e its b e l i e f s a b o u t its o p t i m a l Q - v a l u e s . A c h a n g e i n the agent's b e l i e f s about the e n v i r o n m e n t m o d e l w i l l therefore result i n a c h a n g e i n its b e l i e f s about its Q - v a l u e s . R e i n f o r c e m e n t learners c a n o b t a i n n e w i n f o r m a t i o n a b o u t Q - v a l u e s b y t a k i n g e x p l o r a t o r y a c t i o n s , o b s e r v i n g the results a n d c a l c u l a t i n g n e w Q - v a l u e s f r o m their updated m o d e l s . S i n c e these e x p l o r a t o r y a c t i o n s p o t e n t i a l l y increase the agent's k n o w l e d g e about the w o r l d , they w i l l h a v e a n i n f o r m a t i o n v a l u e i n a d d i t i o n to w h a t e v e r c o n c r e t e p a y o f f they p r o v i d e . U n f o r t u n a t e l y , w e c a n n o t c o n f i d e n t l y p r e d i c t h o w m u c h i n f o r m a t i o n w i l l b e r e v e a l e d to the agent b y a n y o n e e x p e r i m e n t w i t h a n a c t i o n . W e c a n , h o w e v e r , o b t a i n an e x p e c t a t i o n o v e r the v a l u e o f the i n f o r m a t i o n s u c h e x p l o r a t i o n c o u l d r e v e a l : b a s e d o n p r i o r b e l i e f s , the agent w i l l h a v e a d i s t r i b u t i o n o v e r p o s s i b l e  expected value of per-  Q - v a l u e s f o r the a c t i o n i n q u e s t i o n . W e c o u l d therefore c o m p u t e an  fect information b y  c o m p u t i n g the v a l u e o f perfect i n f o r m a t i o n f o r e a c h p o s s i b l e Q - v a l u e  a s s i g n m e n t a n d m u l t i p l y i n g b y the a s s i g n m e n t ' s e s t i m a t e d p r o b a b i l i t y . T h i s e x p e c t e d v a l u e o f perfect i n f o r m a t i o n c a n be a d d e d to the agent's a c t u a l Q - v a l u e estimates f o r e a c h a c t i o n . M a x i m i z a t i o n o f these i n f o r m e d Q - v a l u e s a l l o w s the agent to take the v a l u e o f e x p l o r a t i o n i n t o a c c o u n t w h e n c h o o s i n g its actions. I n t u i t i v e l y , an agent's return w i l l v a r y w h e n n e w i n f o r m a t i o n about its Q - v a l u e s cause it to c h a n g e its b e l i e f s about the best a c t i o n to p e r f o r m i n a state. F o r m a l l y , L e t Q  represent  SA  the Q - v a l u e o f a c t i o n  a in  state  s. L e t  Pr(Q  S ) Q  =  q)  b e the p r o b a b i l i t y that  Q  q € JR. T h e n , the e x p e c t e d Q - v a l u e o f a c t i o n a at state s is E [ Q , ] = J P r ( Q , s  a  q  =q  s<a  s  a  =  where q)Sq.  A r a t i o n a l agent w o u l d then use these e x p e c t e d Q - v a l u e s to r a n k its a c t i o n s i n o r d e r f r o m best to w o r s t , abest, a e x t B e s t , • • • a n d c h o o s e the best a c t i o n a b t f o r e x p l o i t a t i o n . T h e n  e s  20  agent d o e s not k n o w the true Q - v a l u e s , but it does k n o w the p o s s i b l e r a n g e o f Q - v a l u e s (e.g., the set o f reals) a n d it c a n c o m p u t e the v a l u e o f l e a r n i n g that the true Q - v a l u e has a p a r t i c u l a r real v a l u e . L e t q* q*  a  a  b e a c a n d i d a t e f o r the true Q - v a l u e o f a c t i o n a i n state s. I f the fact that  w a s the true Q - v a l u e f o r a c t i o n a i n state s c a u s e d the agent to c h a n g e the a c t i o n c u r r e n t l y  i d e n t i f i e d as the best i n state s, the agent's p o l i c y w o u l d c h a n g e a n d the agent's a c t u a l return i n the w o r l d c o u l d c h a n g e as w e l l . A s s u m i n g q*  g i v e s the true v a l u e o f a c t i o n a i n state  s, it c o u l d affect the agent's b e l i e f about the current o r d e r i n g o f its a c t i o n s i n o n e o f three  current best a c t i o n  w a y s : the agent m i g h t d e c i d e that the  is a c t u a l l y i n f e r i o r to the  the agent m i g h t d e c i d e that s o m e other a c t i o n is s u p e r i o r to d e c i d e that the  current best  current best;  next best;  o r the agent m i g h t  is s t i l l the best. I f the i d e n t i t y o f the best a c t i o n c h a n g e s , the  c h a n g e i n the agent's return w i l l be the difference b e t w e e n the v a l u e o f the n e w best a c t i o n a n d the v a l u e o f the o l d best a c t i o n . T h e three cases are s u m m a r i z e d b e l o w :  ^[Qs,a  J  - q* ,a i f a = a e s a n d E [ Q , ,  aexlBe  VPI(s,a,g*  )=  I  s  b  t  b e s t  a n d g*  <£,<, - E [ Q , a J  if a ^ a  0  otherwise  s  be  T h e v a l u e o f perfect i n f o r m a t i o n ,  a n e x t B e  , ] > g* t  > E[Q ,„  a  s  VPI(s, a, q* ), g i v e s  b B  J  a  (2.8)  the v a l u e o f k n o w i n g that  the Q - v a l u e f o r a c t i o n a i n state s has the true v a l u e q* . H o w e v e r , a r e i n f o r c e m e n t l e a r n i n g agent d o e s not k n o w these true v a l u e s w h e n e x p l o r i n g . It c o u l d , h o w e v e r , m a i n t a i n a d i s t r i b u t i o n o v e r p o s s i b l e Q - v a l u e s . U s i n g this d i s t r i b u t i o n o v e r the p o s s i b l e a s s i g n m e n t s to Q - v a l u e s , w e c o u l d c o m p u t e an E x p e c t e d V a l u e o f Perfect I n f o r m a t i o n :  CO  /  VPI(s,  a, <£ ) Pr(Q , = < ) 0  s  a  a  dg;,  a  (2.9)  -co  G i v e n the e x p e c t e d v a l u e o f i n f o r m a t i o n ( E V P I ) f o r e a c h p o s s i b l e a c t i o n , a B a y e s i a n r e i n f o r c e m e n t learner constructs i n f o r m e d Q - v a l u e s f o r e a c h a c t i o n , d e f i n e d as the s u m o f  21  the a c t u a l return a n d the v a l u e o f i n f o r m a t i o n f o r the a c t i o n . T h e agent then m a x i m i z e s the r e s u l t i n g i n f o r m e d a c t i o n v a l u e s to y i e l d an a c t i o n that s m o o t h l y trades o f f e x p l o i t a t i o n a n d e x p l o r a t i o n i n a p r i n c i p l e d f a s h i o n . S u c h a p o l i c y is defined as:  7r  EVPI  (s) = argmax  a  { E [ Q , , ] + E V P I ( s , a)} a  (2.10)  S i n c e Q - v a l u e s are a d e t e r m i n i s t i c f u n c t i o n o f the r e w a r d a n d t r a n s i t i o n m o d e l , the u n c e r t a i n t y i n Q - v a l u e s c o m e s f r o m the u n c e r t a i n t y i n m o d e l s . U n f o r t u n a t e l y , it is d i f f i c u l t to p r e d i c t the u n c e r t a i n t y o f any s p e c i f i c Q - v a l u e g i v e n u n c e r t a i n t y i n m o d e l p a r a m e t e r s . I n stead, p r a c t i c a l i m p l e m e n t a t i o n s o f B a y e s i a n r e i n f o r c e m e n t learners g e n e r a l l y s a m p l e p o s sible M D P s f r o m a distribution over possible transition m o d e l parameters and solve each M D P to get a d i s t r i b u t i o n o v e r p o s s i b l e Q - v a l u e s at each state ( D e a r d e n et a l . , 1 9 9 9 ) . W h e n states a n d a c t i o n s are discrete, d i s t r i b u t i o n s o v e r m o d e l parameters c a n b e r e p r e s e n t e d c o m p a c t l y b y D i r i c h l e t d i s t r i b u t i o n s a n d i n d i v i d u a l M D P s are e a s i l y s a m p l e d . S i n c e t r a n s i t i o n m o d e l s c h a n g e v e r y l i t t l e w i t h a s i n g l e agent o b s e r v a t i o n , s o p h i s t i c a t e d i n c r e m e n t a l t e c h n i q u e s f o r s o l v i n g M D P s s u c h as p r i o r i t i z e d s w e e p i n g , c a n b e u s e d to e f f i c i e n t l y m a i n t a i n the Q - v a l u e d i s t r i b u t i o n s ( D e a r d e n et a l . , 1999).  2.5  Probabilistic Reasoning and Bayesian Networks  G r a p h i c a l m o d e l s ( P e a r l , 1 9 8 8 ) p r o v i d e a p o w e r f u l m e c h a n i s m f o r e f f i c i e n t l y r e a s o n i n g about u n c e r t a i n k n o w l e d g e i n c o m p l e x d o m a i n s . I n a g r a p h i c a l m o d e l , a d o m a i n is r e p r e s e n t e d b y a set o f v a r i a b l e s that m a y take o n v a r i o u s v a l u e s . U n c e r t a i n t y w i t h i n the d o m a i n is represented b y the use o f a p r o b a b i l i t y d i s t r i b u t i o n o v e r a l l p o s s i b l e a s s i g n m e n t s to these v a r i ables. T h e g r a p h i c a l aspect o f the m o d e l a l l o w s o n e to represent these d i s t r i b u t i o n s c o m p a c t l y . T h e d o m a i n m o d e l c a n then b e u s e d to i n f e r the p r o b a b i l i t y o f u n s e e n v a r i a b l e s , g i v e n  22  c  R  U  Pr(C,  T  T  T  0.126  T  T  F  0.014  T  F  T  0.054  T  F  F  0.006  F  T  T  0  F  T  F  0.008  F  F  T  0  F  F  F  0.792  R,U)  F i g u r e 2 . 1 : E x p l i c i t d i s t r i b u t i o n o v e r e x h a u s t i v e e n u m e r a t i o n o f c l o u d - r a i n - u m b r e l l a states  o b s e r v a t i o n s o f the r e m a i n i n g v a r i a b l e s . W e w i l l i l l u s t r a t e the representation a n d r e a s o n i n g p r o c e s s w i t h a n e x a m p l e . I m a g ine a s i m p l e w o r l d i n w h i c h it is s o m e t i m e s c l o u d y , s o m e t i m e s r a i n y a n d s o m e t i m e s  one  takes an u m b r e l l a . W e let the b o o l e a n v a r i a b l e s C, R a n d U stand f o r the p r o p o s i t i o n s that it is c l o u d y , that it is r a i n i n g a n d that one t o o k o n e ' s u m b r e l l a r e s p e c t i v e l y . T h e d o m a i n req u i r e s r e a s o n i n g w i t h u n c e r t a i n t y s i n c e it does not a l w a y s r a i n a n d o n e d o e s n o t a l w a y s take o n e ' s u m b r e l l a w h e n it is c l o u d y . T h e u n c e r t a i n t y i n the d o m a i n c o u l d be represented b y an e x p l i c i t p r o b a b i l i t y d i s t r i b u t i o n o v e r e a c h p o s s i b l e a s s i g n m e n t o f the w o r l d as i n F i g u r e 2 . 1 . G i v e n a domain w i t h n boolean variables, a complete probability distribution w o u l d h a v e 2 " p o s s i b l e a s s i g n m e n t s . T h e representation o f s u c h tables a n d the k n o w l e d g e a c q u i s i t i o n p r o c e s s r e q u i r e d to fill t h e m i n m a k e e x p l i c i t table r e p r e s e n t a t i o n s i m p r a c t i c a l f o r p r o b a b i l i s t i c r e a s o n i n g . In a g r a p h i c a l m o d e l , the n e e d f o r c o m p l e t e tables is r e d u c e d b y f a c t o r i n g o n e large table i n t o s m a l l e r tables o f i n d e p e n d e n t v a r i a b l e s . F o r i n s t a n c e , o u r r a i n c l o u d - u m b r e l l a d o m a i n c o u l d be e x p r e s s e d as the g r a p h i c a l m o d e l i n F i g u r e 2.2(a). In this m o d e l , " c l o u d i n e s s " causes r a i n a n d " c l o u d i n e s s " causes one to take o n e ' s u m b r e l l a . U n d e r  23  Pr(C)  0.2  Pi(R\C)  0.7  PT{R\->C)  0.01  Pv{U\C)  0.9  Pr((7hC)  0.0  (a) Graphical model  (b) M o d e l factors  F i g u r e 2.2: C o m p a c t d i s t r i b u t i o n s o v e r i n d e p e n d e n t factors f o r c l o u d - r a i n - u m b r e l l a d o m a i n  the a s s u m p t i o n s o f this m o d e l , k n o w i n g w h e t h e r it i s c l o u d y o r n o t c o m p l e t e l y p r e d i c t s the p r o b a b i l i t y that ones takes o n e ' s u m b r e l l a a n d k n o w i n g w h e t h e r it is r a i n i n g o r n o t does n o t h a v e a n y a d d i t i o n a l p r e d i c t i v e p o w e r . T h a t i s , Pr(U\C)  = Pr(U\C,  R).  T h e i n d e p e n d e n c e a s s u m p t i o n s represented b y the g r a p h i c a l m o d e l a l l o w us to represent the d o m a i n m o r e c o m p a c t l y , y e t s t i l l p r o v i d e s e q u i v a l e n t i n f o r m a t i o n . U s i n g the i n d e p e n d e n c e a s s u m p t i o n a n d the fact that c o m p l e m e n t a r y p r o b a b i l i t i e s a l w a y s s u m to one, the r a i n - c l o u d - u m b r e l l a d o m a i n c a n b e e n c o d e d w i t h the p r o b a b i l i t y factors i n T a b l e  2.2(b).  M u l t i p l y i n g the c o n d i t i o n a l p r o b a b i l i t i e s g i v e s us the s a m e j o i n t p r o b a b i l i t y Pv(R, U, C) =  Pv(R\C) PT(U\C) Pr(C), h o w e v e r , the factors c a n b e represented m o r e c o m p a c t l y a n d spec i a l i z e d a l g o r i t h m s c a n b e u s e d to c o m p u t e queries efficiently. I n g e n e r a l , b e l i e f inference i n B a y e s i a n n e t w o r k s is q u i t e d i f f i c u l t i n t h e w o r s t case. F i n d i n g the m o s t p r o b a b l e settings o f v a r i a b l e s i s k n o w n to b e N P - c o m p l e t e ( L i t t m a n , M a j e r c i k , & P i t a s s i , 2 0 0 1 ) a n d finding the p o s t e r i o r b e l i e f d i s t r i b u t i o n o v e r v a r i a b l e s i s P P c o m p l e t e ( L i t t m a n et a l . , 2 0 0 1 ) .  24  2.6  Model-Free Learning  I n m o d e l - b a s e d a p p r o a c h e s , the agent estimates an e x p l i c i t t r a n s i t i o n m o d e l i n o r d e r to c a l c u l a t e state v a l u e s . M o d e l - b a s e d approaches m a k e the u n d e r l y i n g t h e o r y o f r e i n f o r c e m e n t l e a r n i n g c l e a r a n d m a y use e v i d e n c e m o r e e f f i c i e n t l y i n s o m e c a s e s , b u t suffer f r o m a n u m 2  b e r o f d i s a d v a n t a g e s . F i r s t , t r a n s i t i o n m o d e l s are a f u n c t i o n o f three v a r i a b l e s : p r i o r state, a c t i o n c h o i c e a n d s u c c e s s o r state. G e n e r i c d o m a i n m o d e l s c a n therefore h a v e c o m p l e x i t y o n the o r d e r o f the square o f the s i z e o f the u n d e r l y i n g state space. S e c o n d , i n m o d e l - b a s e d t e c h n i q u e s , the v a l u e f u n c t i o n is t y p i c a l l y c a l c u l a t e d f r o m the t r a n s i t i o n m o d e l t h r o u g h B e l l m a n b a c k u p s . In a d o m a i n w i t h c o n t i n u o u s v a r i a b l e s it is s o m e t i m e s p o s s i b l e to create c o n t i n u o u s t r a n s i t i o n m o d e l s (e.g., d y n a m i c B a y e s i a n n e t w o r k s o r m i x t u r e s o f G a u s s i a n s ) , b u t g e n e r a l l y it is d i f f i c u l t to represent a n d c a l c u l a t e the c o n t i n u o u s v a l u e f u n c t i o n f o r s u c h a model.  3  M o d e l - f r e e m e t h o d s s u c h as T D ( A ) ( S u t t o n , 1988) a n d Q - l e a r n i n g ( W a t k i n s & D a y a n , 1992) estimate the o p t i m a l v a l u e f u n c t i o n o r Q - f u n c t i o n d i r e c t l y , w i t h o u t r e c o u r s e to a d o m a i n m o d e l . In m o d e l - f r e e r e i n f o r c e m e n t l e a r n i n g , the v a l u e o f e a c h state-action p a i r is s a m p l e d e m p i r i c a l l y f r o m i n t e r a c t i o n s w i t h the agent's e n v i r o n m e n t . O n e a c h i n t e r a c t i o n the agent o b t a i n s a t u p l e (s, a , r , t) c o r r e s p o n d i n g to the p r i o r state, c h o s e n a c t i o n , r e w a r d a n d s u c c e s s o r state, r e s p e c t i v e l y . S i n c e the s u c c e s s o r state a n d r e w a r d are generated b y the e n v i r o n m e n t , they are e f f e c t i v e l y s a m p l e d w i t h the p r o b a b i l i t y that they o c c u r i n the w o r l d . A s before, let Q(s,  a) be the Q - v a l u e o f a c t i o n a i n state s. T h e Q - l e a r n i n g r u l e , d e r i v e d b y  W a t k i n s , c a n be u s e d to u p d a t e the agent's Q - v a l u e s i n c r e m e n t a l l y f r o m e a c h e x p e r i e n c e : 2  T h o u g h see Least Squares T D ( B o y a n , 1999).  3  A n approach that illustrates this difficulty and one method o f attacking it can be found i n the  work o f Forbes and A n d r e (2000).  25  Q{s,a)  f - ( l - a ) Q ( s , a ) + cv(r +  T  maxQ(i,a'))  (2.11)  T h e r u l e e f f e c t i v e l y c o m p u t e s the average s u m o f i m m e d i a t e a n d e x p e c t e d future r e w a r d s f o r a c t i o n a at state s, a s s u m i n g that the agent w i l l take the best a c t i o n i n the n e x t state (i.e. m a x / ^ ) . I f a is d e c a y e d a p p r o p r i a t e l y a n d e v e r y state is v i s i t e d i n f i n i t e l y often, a  6  the Q - l e a r n i n g r u l e w i l l cause Q(s, D a y a n , 1 9 9 2 ) . T h e TD(X)  a) to c o n v e r g e to its true v a l u e i n the l i m i t ( W a t k i n s &  f a m i l y o f m e t h o d s i m p r o v e s o n the Q - l e a r n i n g r u l e b y s a m p l i n g  not o n l y the current state, but a p r o p o r t i o n o f future states as w e l l . It t y p i c a l l y c o n v e r g e s m u c h m o r e q u i c k l y than Q - l e a r n i n g g i v e n the same n u m b e r o f s a m p l e s . R e c e n t research o n g r a d i e n t ( B a i r d , 1 9 9 9 ) a n d p o l i c y search ( N g & J o r d a n , 2 0 0 0 ) b a s e d m e t h o d s has e x p a n d e d the s c o p e a n d a p p l i c a b i l i t y o f m o d e l - f r e e t e c h n i q u e s to a w i d e - v a r i e t y o f c h a l l e n g i n g c o n tinuous variable control problems.  2.7  Partial Observability  T h e M D P a s s u m e s that the agent c a n u n i q u e l y i d e n t i f y its state at e a c h d e c i s i o n e p o c h . M D P s can be n a t u r a l l y g e n e r a l i z e d to the p r o b l e m s i n w h i c h agents are u n c e r t a i n about t h e i r current state. T h i s u n c e r t a i n t y m a y be d u e to the l i m i t e d s c o p e o f sensors w h i c h report o n l y o n s o m e subset o f the agent's state, n o i s y sensors that report c o m p l e t e b u t s o m e w h a t i n a c c u rate v e r s i o n s o f the agent's state, o r s o m e c o m b i n a t i o n o f the t w o . T h e P a r t i a l l y O b s e r v a b l e M a r k o v D e c i s i o n P r o c e s s ( P O M D P ) is defined b y the t u p l e (S, A, Z, T, Z, R).  A s i n the  f u l l y o b s e r v a b l e M D P , S, A, T, a n d R represent the set o f states i n the p r o b l e m , the act i o n s a v a i l a b l e to the agent, the e n v i r o n m e n t t r a n s i t i o n m o d e l a n d the r e w a r d f u n c t i o n . I n the p a r t i a l l y o b s e r v a b l e m o d e l , Z represents p o s s i b l e o b s e r v a t i o n s o r sensor r e a d i n g s f o r the agent a n d Z : S —>• A(Z)  represents an o b s e r v a t i o n f u n c t i o n o r s e n s o r m o d e l w h i c h m a p s  26  p o s s i b l e w o r l d - s t a t e s to d i s t r i b u t i o n s o v e r o b s e r v a t i o n s . T e c h n i q u e s f o r s o l v i n g P O M D P s p a r a l l e l those f o r s o l v i n g M D P s . T h e P O M D P c a n be s o l v e d b y b o t h v a l u e - b a s e d ( S m a l l w o o d & S o n d i k , 1 9 7 3 ) a n d p o l i c y - b a s e d a l g o r i t h m s ( H a n s e n , 1 9 9 8 ) . T h e theory o f P O M D P s has b e e n g e n e r a l i z e d to p a r t i a l l y o b s e r v a b l e r e i n f o r c e m e n t l e a r n i n g b y a p p e a l i n g to t r a d i t i o n a l t e c h n i q u e s b a s e d o n d e t e r m i n i n g the underl y i n g state space a n d m o d e l ( M c C a l l u m , 1 9 9 2 ) , as w e l l as m o r e recent w o r k u s i n g g r a d i ent m e t h o d s ( B a x t e r & B a r t l e t t , 2 0 0 0 ; N g & J o r d a n , 2 0 0 0 ) . W e refer the reader to e x i s t i n g c o m p r e h e n s i v e s u r v e y s f o r m o r e details ( C a s s a n d r a , K a e l b l i n g , & L i t t m a n , 1 9 9 4 ; L o v e j o y , 1 9 9 1 ; S m a l l w o o d & S o n d i k , 1973). T h e s o l u t i o n o f g e n e r a l P O M D P s is c o n s i d e r e d i n t r a c t a b l e . C u r r e n t state o f the art a l g o r i t h m s , s u c h as i n c r e m e n t a l p r u n i n g , s t i l l h a v e e x p o n e n t i a l r u n n i n g t i m e i n the s i z e o f the state space i n the w o r s t case a n d e v e n e - o p t i m a l a p p r o x i m a t i o n s are N P o r P S P A C E c o m plete ( L u s e n a , G o l d s m i t h , & M u n d h e n k , 2 0 0 1 ) .  2.8  Multi-agent Learning  S i n g l e agent r e i n f o r c e m e n t l e a r n i n g theory a s s u m e s that the agent is l e a r n i n g w i t h i n a stat i o n a r y e n v i r o n m e n t (i.e., the t r a n s i t i o n m o d e l T does not c h a n g e o v e r t i m e ) . W h e n a d d i t i o n a l agents are a d d e d to the e n v i r o n m e n t , the effects o f an agent's a c t i o n m a y d e p e n d u p o n the a c t i o n c h o i c e s o f other agents. C h a n g e s to the agent's p o l i c y m a y c a u s e other agents to c h a n g e their p o l i c i e s i n turn, w h i c h m a y result i n the s y s t e m o f agents as a w h o l e e n t e r i n g i n t o c y c l i c p o l i c y c h a n g e s that n e v e r c o n v e r g e . I n d o m a i n s w h e r e agents m u s t c o o r d i n a t e i n o r d e r to a c h i e v e g o a l s , agents m a y f a i l to find c o m p a t i b l e p o l i c i e s a n d c o n v e r g e to n o n o p t i m a l s o l u t i o n s . R e i n f o r c e m e n t l e a r n i n g c a n b e e x t e n d e d to m u l t i - a g e n t d o m a i n s t h r o u g h the t h e o r y o f stochastic  games.  T h o u g h S h a p l e y ' s ( 1 9 5 3 ) o r i g i n a l f o r m u l a t i o n o f stochastic g a m e s i n v o l v e d a z e r o -  27  s u m ( f u l l y c o m p e t i t i v e ) a s s u m p t i o n , v a r i o u s g e n e r a l i z a t i o n s o f the m o d e l h a v e b e e n p r o p o s e d , a l l o w i n g f o r arbitrary r e l a t i o n s h i p s b e t w e e n agents' u t i l i t y f u n c t i o n s ( M y e r s o n , 1991). F o r m a l l y , an n - a g e n t stochastic g a m e (S, {Ai  : i < n},T,  {Ri  : i < n})  c o m p r i s e s a set o f  n agents (1 < % < n), a set o f states S, a set o f a c t i o n s Ai f o r e a c h agent i, a state t r a n s i t i o n f u n c t i o n T, a n d a r e w a r d f u n c t i o n Ri f o r each agent i. E a c h state s represents the c o m p l e t e state o f the s y s t e m . In s o m e g a m e s , the s y s t e m state c a n i n t u i t i v e l y be u n d e r s t o o d as the p r o d u c t o f the state o f i n d i v i d u a l agents p l a y i n g the g a m e . U n l i k e an M D P , i n d i v i d u a l agent a c t i o n s d o not d e t e r m i n e state t r a n s i t i o n s , rather it is the  joint action t a k e n  b y the c o l l e c t i o n  o f agents that d e t e r m i n e s h o w the s y s t e m e v o l v e s at any p o i n t i n t i m e . L e t A = A i x • • • x be the set o f j o i n t a c t i o n s ; then T : S x A x S  A  n  1R denotes the p r o b a b i l i t y o f e n d i n g u p  i n the s y s t e m state s' 6 S w h e n j o i n t a c t i o n a is p e r f o r m e d at the s y s t e m state s. F o r c o n v e n i e n c e , w e i n t r o d u c e the n o t a t i o n A-i j o i n t a c t i o n s A\ ai • a-i  x • • • x - 4 ; - i x A+i  x ••• x A  n  (read A , not i ) to d e n o t e the set o f  i n v o l v i n g a l l agents e x c e p t i. W e use  to d e n o t e the (full) j o i n t a c t i o n o b t a i n e d b y c o n j o i n i n g a; £ Ai w i t h a_;  €  A-i.  In a g e n e r a l stochastic g a m e , the p r e d i c t i v e p a r a d i g m f o r a c t i o n c h o i c e c a n b r e a k d o w n . A n agent c a n n o l o n g e r p r e d i c t the effect o f its a c t i o n c h o i c e s , as it w o u l d h a v e to p r e d i c t the a c t i o n c h o i c e s o f other agents, w h i c h i n turn w o u l d be b a s e d u p o n p r e d i c t i o n s o f its o w n a c t i o n c h o i c e s . Instead, g a m e theory m a k e s r e c o u r s e to the i d e a o f a s o l u t i o n c o n c e p t : a p r i n c i p l e o u t s i d e o f the d o m a i n o f interest that c a n be u s e d to c o n s t r a i n the a c t i o n c h o i c e s o f a l l agents. A p o p u l a r t e c h n i q u e for the s o l u t i o n o f g a m e s is the  minimax p r i n c i p l e .  Under min-  i m a x , an agent c h o o s e s the a c t i o n that leaves it least v u l n e r a b l e to m a l i c i o u s b e h a v i o r o f other agents. T h e m i n i m a x p r i n c i p l e f o r m s the c o r e o f the m i n i m a x Q - l e a r n i n g a l g o r i t h m ( L i t t m a n , 1994) w h i c h extends r e i n f o r c e m e n t l e a r n i n g to c o m p e t i t i v e z e r o - s u m g a m e s . T h e 4  F o r example, see the fully cooperative multiagent M D P model proposed by B o u t i l i e r (1999).  28  r n i n i m a x - Q a l g o r i t h m a s s u m e s that the state, r e w a r d a n d a c t i o n c h o i c e s o f other agents i n the e n v i r o n m e n t are o b s e r v a b l e . U n d e r this a s s u m p t i o n , an agent c a n l e a r n Q - t a b l e s f o r b o t h their o w n a c t i o n s a n d the a c t i o n s o f other a g e n t s . A q u a d r a t i c p r o g r a m is then c o n s t r u c t e d 5  to f i n d the p r o b a b i l i s t i c p o l i c y that m i n i m i z e s the w o r s t p o s s i b l e Q - v a l u e s an agent c o u l d r e c e i v e . In a p u r e l y c o m p e t i t i v e g a m e , a d e t e r m i n i s t i c p o l i c y c o u l d be p r e d i c t e d b y other agents a n d taken a d v a n t a g e of. O n e d r a w b a c k o f the m i n i m a x p r i n c i p l e is that it a s s u m e s that agents w i l l a l w a y s act m a l i c i o u s l y , w h i c h is o n l y r e a s o n a b l e i n a p u r e l y c o m p e t i t i v e (or z e r o s u m ) g a m e i n w h i c h n o agent c a n profit unless another agent l o s e s . I n n o n - c o m p e t i t i v e g a m e s , agents c a n find a s o l u t i o n u s i n g q u a d r a t i c p r o g r a m m i n g . T h e s o l u t i o n is a N a s h e q u i l i b r i u m i n w h i c h n o agent can i n d e p e n d e n t l y m o d i f y its a c t i o n c h o i c e to i m p r o v e its o w n o u t c o m e . In the r e i n f o r c e m e n t l e a r n i n g c o n t e x t , H u a n d W e l l m a n ( 1 9 9 8 ) g e n e r a l i z e L i t t m a n ' s w o r k to the case w h e r e agents m a y h a v e o v e r l a p p i n g g o a l s u s i n g quadratic programming. A g i v e n game may have a number o f N a s h equilibria. F o r o p t i m a l p l a y , a l l agents m u s t p l a y the o p t i m a l a c t i o n f o r the same e q u i l i b r i a . H u a n d W e l l m a n ' s w o r k does not e x p l i c i t l y address g a m e s i n w h i c h there are m u l t i p l e e q u i l i b r i a . In n o n - c o m p e t i t i v e games, gradient-based methods have also been suggested ( B o w l i n g & V e l o s o , 2001).  2.9  Technical Definitions  W e m a k e r e c o u r s e to a c o u p l e o f t e c h n i c a l d e f i n i t i o n s . T h e r e l a t i v e e n t r o p y o r  Liebler d i s t a n c e  Kullbach-  b e t w e e n t w o p r o b a b i l i t y d i s t r i b u t i o n s tells us h o w m u c h m o r e i n f o r m a t i o n  it w o u l d t a k e to represent o n e d i s t r i b u t i o n i n terms o f another. It i s often u s e d to c o m p a r e t w o d i s t r i b u t i o n s , a l t h o u g h it does not a c t u a l l y satisfy the c o m p l e t e d e f i n i t i o n o f a m e t r i c b e c a u s e it is not s y m m e t r i c a n d does not satisfy the t r i a n g l e i n e q u a l i t y . L e t P a n d Q be t w o 5  I n a two-player zero-sum game, one agent's payoff is the negative o f the other's, so o n l y one  Q-value need be represented explicitly  29  p r o b a b i l i t y d i s t r i b u t i o n s . T h e K u l l b a c h - L i e b l e r o r K L d i s t a n c e D(P\\Q)  DW\q)=Yp{x)log?¥\  is:  (2.12)  T h e Chebychev inequality g i v e s us a w a y o f b o u n d i n g the p r o b a b i l i t y m a s s l y i n g w i t h i n a m u l t i p l e k o f the standard d e v i a t i o n a o f the m e a n fi f o r a n arbitrary d i s t r i b u t i o n .  P(\X  A homomorphism  - fi\ >=  ka) < 1 -  ^  (2.13)  is a m a p p i n g b e t w e e n t w o g r o u p s s u c h that the o u t c o m e o f the  g r o u p o p e r a t i o n o n the first g r o u p has a c o r r e s p o n d i n g o u t c o m e f o r the g r o u p o p e r a t i o n o n the s e c o n d set. L e t X a n d y be t w o sets a n d 8 b e a m a p 6 : X —>• y. o n X, so that ® : X  L e t © be a n operator  X. L e t ® b e a n operator o n y, so that <2> : y —>• ^ . T h e n the m a p 0  is a h o m o m o r p h i s m w i t h respect to <g>, i f V x g X, 6(®x)  30  = ®^(a;).  Chapter 3  Natural and Computational Imitation R e s e a r c h i n t o i m i t a t i o n spans a b r o a d r a n g e o f d i m e n s i o n s , f r o m e t h o l o g i c a l studies, to abstract a l g e b r a i c f o r m u l a t i o n s , to i n d u s t r i a l c o n t r o l a l g o r i t h m s . A s these fields c r o s s - f e r t i l i z e a n d i n f o r m e a c h other, w e are c o m i n g to stronger c o n c e p t u a l d e f i n i t i o n s o f i m i t a t i o n a n d a better u n d e r s t a n d i n g o f its l i m i t s a n d c a p a b i l i t i e s . M a n y c o m p u t a t i o n a l m o d e l s h a v e b e e n p r o p o s e d to e x p l o i t s p e c i a l i z e d n i c h e s i n a v a r i e t y o f c o n t r o l p a r a d i g m s , a n d i m i t a t i o n t e c h n i q u e s h a v e b e e n a p p l i e d to a v a r i e t y o f r e a l - w o r l d c o n t r o l p r o b l e m s . T h i s c h a p t e r r e v i e w s a representative s a m p l e o f the literature o n n a t u r a l i m i t a t i o n a n d c o m p u t a t i o n a l i m p l e m e n t a t i o n s a n d c o n c l u d e s w i t h an a n a l y s i s o f h o w v a r i o u s f o r m s o f i m i t a t i o n fit i n t o the p a r a d i g m o f k n o w l e d g e transfer w i t h i n a r e i n f o r c e m e n t l e a r n i n g f r a m e w o r k .  3.1  Conceptual Issues  T h e conceptual foundations o f imitation have been clarified b y w o r k o n natural imitation. F r o m w o r k o n apes ( R u s s o n & G a l d i k a s , 1 9 9 3 ) , o c t o p i ( F i o r i t o & S c o t t o , 1 9 9 2 ) , a n d other  31  a n i m a l s , w e k n o w that s o c i a l l y f a c i l i t a t e d l e a r n i n g is w i d e s p r e a d t h r o u g h o u t the a n i m a l k i n g d o m . A n u m b e r o f researchers h a v e p o i n t e d out, h o w e v e r , that s o c i a l f a c i l i t a t i o n c a n take m a n y f o r m s ( C o n t e , 2 0 0 0 ; N o b l e & T o d d , 1999). F o r instance, a m e n t o r ' s attention to s o m e feature o f its e n v i r o n m e n t m a y d r a w an o b s e r v e r ' s attention to this feature. m a y s u b s e q u e n t l y e x p e r i m e n t w i t h the feature o f the e n v i r o n m e n t a n d  T h e observer  independently learn  a m o d e l o f the feature's b e h a v i o u r . T h e d e f i n i t i o n o f " T r u e i m i t a t i o n " attempts to r u l e out a c q u i s i t i o n o f b e h a v i o u r b y i n d e p e n d e n t l e a r n i n g , e v e n i f other agents m o t i v a t e d this i n d e p e n d e n t l e a r n i n g . V i s a l b e r g h i a n d F r a g a z y ( 1 9 9 0 ) c i t e M i t c h e l l ' s d e f i n i t i o n w h i c h says that s o m e b e h a v i o u r C is l e a r n e d b y i m i t a t i o n w h e n the f o l l o w i n g c o n d i t i o n s are met:  1. C (the c o p y o f the b e h a v i o r ) is p r o d u c e d b y an o r g a n i s m  2. w h e r e C is s i m i l a r to M (the M o d e l b e h a v i o r )  3. o b s e r v a t i o n o f M is necessary f o r the p r o d u c t i o n o f C ( a b o v e b a s e l i n e l e v e l s o f C occurring spontaneously)  4. C is d e s i g n e d to be s i m i l a r to M  5. the b e h a v i o r C must be a n o v e l b e h a v i o r not a l r e a d y o r g a n i z e d i n that p r e c i s e w a y i n the o r g a n i s m ' s repertoire.  T h i s d e f i n i t i o n perhaps p r e s u p p o s e s a c o g n i t i v e stance t o w a r d s i m i t a t i o n i n w h i c h an agent e x p l i c i t l y reasons about the b e h a v i o r s o f other agents a n d h o w these b e h a v i o r s relate to its o w n a c t i o n c a p a b i l i t i e s a n d g o a l s . I m i t a t i o n c a n be further a n a l y z e d i n terms o f the t y p e o f c o r r e s p o n d e n c e d e m o n strated b y the m e n t o r ' s b e h a v i o r a n d the o b s e r v e r ' s a c q u i r e d b e h a v i o r ( N e h a n i v & D a u t e n h a h n , 1 9 9 8 ; B y r n e & R u s s o n , 1998). C o r r e s p o n d e n c e types are d i s t i n g u i s h e d b y l e v e l . A t the a c t i o n l e v e l , there is a c o r r e s p o n d e n c e b e t w e e n a c t i o n s . A t the p r o g r a m l e v e l , the a c t i o n s  32  m a y be c o m p l e t e l y different but c o r r e s p o n d e n c e m a y be f o u n d b e t w e e n s u b g o a l s . A t the effect l e v e l , the agent p l a n s a set o f actions that a c h i e v e the s a m e effect as the d e m o n s t r a t e d b e h a v i o r b u t there is n o d i r e c t c o r r e s p o n d e n c e b e t w e e n s u b c o m p o n e n t s o f the o b s e r v e r ' s act i o n s a n d the m e n t o r ' s a c t i o n s . T h e t e r m  abstract imitation has  b e e n p r o p o s e d i n the case  w h e r e agents i m i t a t e b e h a v i o r s w h i c h c o m e f r o m i m i t a t i n g the m e n t a l state o f other agents ( D e m i r i s & H a y e s , 1997). T h e study o f s p e c i f i c c o m p u t a t i o n a l m o d e l s o f i m i t a t i o n has y i e l d e d i n s i g h t s i n t o the nature o f the o b s e r v e r - m e n t o r r e l a t i o n s h i p a n d h o w it affects the a c q u i s i t i o n o f b e h a v i o r s b y o b s e r v e r s . F o r i n s t a n c e , i n the related f i e l d o f b e h a v i o r a l c l o n i n g , it has b e e n o b s e r v e d that m e n t o r s that i m p l e m e n t p o l i c i e s that are r i s k - a v e r s e g e n e r a l l y y i e l d m o r e r e l i a b l e c l o n e s ( U r b a n c i c & B r a t k o , 1 9 9 4 ) . H i g h l y - t r a i n e d m e n t o r s f o l l o w i n g an o p t i m a l p o l i c y w i t h s m a l l c o v e r a g e o f the state space y i e l d less r e l i a b l e c l o n e s than those that m a k e m o r e m i s t a k e s (Sammut, Hurst, K e d z i e r , & M i c h i e , 1992), since mentors w h i c h m a k e mistakes p r o v i d e i n f o r m a t i o n about r e c o v e r i n g f r o m f a i l u r e c o n d i t i o n s . F o r p a r t i a l l y o b s e r v a b l e p r o b l e m s , l e a r n i n g f r o m m e n t o r s w h o h a v e access to m o r e i n f o r m a t i v e p e r c e p t i o n s than the o b s e r v e r c a n be d i s a s t r o u s . G i v e n the a d d i t i o n a l i n f o r m a t i o n i n its p e r c e p t i o n s , a m e n t o r m a y be a b l e to c h o o s e a p o l i c y that w o u l d b e too r i s k y f o r an agent w h o s e p e r c e p t i o n s d i d n o t p r o v i d e e n o u g h i n f o r m a t i o n to m a k e the r e q u i r e d d e c i s i o n s (Scheffer, G r e i n e r , & D a r k e n , 1 9 9 7 ) . C o n s i d e r the case o f t w o r o b o t s n a v i g a t i n g a l o n g a c l i f f . T h e agent w i t h h i g h e r p r e c i s i o n sensors m a y be a b l e to afford a path c l o s e r to the e d g e than the agent w i t h l o w - p r e c i s i o n sensors. F i n a l l y , i n the case o f h u m a n m e n t o r s , it has b e e n o b s e r v e d that s u c c e s s f u l c l o n e s w o u l d often o u t p e r f o r m the o r i g i n a l h u m a n m e n t o r due to the " c l e a n u p effect" ( S a m m u t et a l . , 1 9 9 2 ) — e s s e n t i a l l y , the i m i t a t o r ' s m a c h i n e l e a r n i n g a l g o r i t h m averages out the error i n m a n y runs to p r o d u c e a better p o l i c y than that o f the o r i g i n a l h u m a n c o n t r o l l e r .  33  O n e o f the o r i g i n a l g o a l s o f b e h a v i o r a l c l o n i n g ( M i c h i e , 1 9 9 3 ) w a s to extract k n o w l edge f r o m h u m a n s to speed up the d e s i g n o f c o n t r o l l e r s . F o r the e x t r a c t e d k n o w l e d g e to be u s e f u l , it has b e e n a r g u e d that r u l e - b a s e d systems often the best c h a n c e o f i n t e l l i g i b i l i t y ( v a n L e n t & L a i r d , 1 9 9 9 ) . It has b e c o m e clear, h o w e v e r , that s y m b o l i c representations are n o t a c o m p l e t e answer. R e p r e s e n t a t i o n a l c a p a c i t y is a l s o an issue. H u m a n s often o r g a n i z e c o n t r o l tasks b y t i m e , w h i c h is t y p i c a l l y l a c k i n g i n state a n d p e r c e p t i o n - b a s e d a p p r o a c h e s to c o n t r o l . H u m a n s a l s o n a t u r a l l y break tasks d o w n i n t o i n d e p e n d e n t c o m p o n e n t s a n d s u b g o a l s ( U r b a n c i c & B r a t k o , 1994). S t u d i e s h a v e a l s o d e m o n s t r a t e d that h u m a n s w i l l g i v e v e r b a l d e s c r i p t i o n s o f t h e i r c o n t r o l p o l i c i e s w h i c h d o not m a t c h t h e i r a c t u a l a c t i o n s ( U r b a n c i c & B r a t k o , 1994). T h e p o t e n t i a l f o r s a v i n g t i m e i n a c q u i s i t i o n has b e e n b o r n out b y o n e study w h i c h e x p l i c i t l y c o m p a r e d the t i m e to extract rules w i t h the t i m e r e q u i r e d to p r o g r a m a c o n t r o l l e r ( v a n L e n t & L a i r d , 1999). R e s e a r c h i n t o h u m a n i m i t a t i o n has a l s o h i g h l i g h t e d the use o f h u m a n s o c i a l c u e s i n d i r e c t i n g attention o f a n agent to the features that s h o u l d b e i m i t a t e d (Breazeal & Scassellati, 2002). In a d d i t i o n to w h a t has t r a d i t i o n a l l y b e e n c o n s i d e r e d i m i t a t i o n , an agent m a y a l s o face the p r o b l e m o f " l e a r n i n g h o w to i m i t a t e " o r finding a c o r r e s p o n d e n c e b e t w e e n the act i o n s a n d states o f the o b s e r v e r a n d m e n t o r ( N e h a n i v & D a u t e n h a h n , 1 9 9 8 ) . N e h a n i v et a l . instantiate t h e i r n o t i o n o f c o r r e s p o n d e n c e i n t w o i m p l e m e n t a t i o n s : a r a n d o m - e x p l o r a t i o n memory-based m o d e l (Alissandrakis, Nehaniv, & Dautenhahn, 2002) and a subgoal-search p r o c e d u r e ( A l i s s a n d r a k i s , N e h a n i v , & D a u t e n h a h n , 2 0 0 0 ) . P r e s e n t l y , the targets f o r s u b g o a l search m u s t b e set u s i n g s o m e p r e v i o u s l y defined state c o r r e s p o n d e n c e f u n c t i o n . A c o m p l e t e m o d e l o f l e a r n i n g b y o b s e r v a t i o n i n the absence o f c o m m u n i c a t i o n p r o t o c o l s w i l l h a v e to deal w i t h this first, p r i m a l c o r r e s p o n d e n c e p r o b l e m . M o s t o f the literature o n i m i t a t i o n c o n s i d e r s the p r o b l e m o f agents l e a r n i n g the s o l u t i o n s o f s i n g l e agent p r o b l e m s f r o m other agents. It is p o s s i b l e , h o w e v e r , f o r agents to l e a r n  34  m u l t i - a g e n t tasks f r o m other agents. A f o r m o f i m i t a t i o n c a n b e u s e d w h e n a n agent w h o has l e a r n e d to be a n effective m e m b e r o f a t e a m m u s t b e r e p l a c e d b y a n e w agent w i t h p o s s i b l y different c a p a b i l i t i e s . T h e k n o w l e d g e related to the t e a m task o f the p r e v i o u s t e a m m e m b e r c a n be reused b y the n e w t e a m m e m b e r ( D e n z i n g e r & E n n i s , 2 0 0 2 ) . O u t s i d e o f the i m i t a t i o n l e a r n i n g f i e l d , the n o t i o n o f i m i t a t i o n as i m p l i c i t transfer bet w e e n agents c a n be seen as a s o l u t i o n to t r a n s f e r r i n g data b e t w e e n representations that l a c k a s i m p l e t r a n s f o r m a t i o n p r o c e s s . In s t a t i s t i c a l m o d e l s , the m o s t i n t u i t i v e representations f o r p r i o r d i s t r i b u t i o n s m a y be i n t r a c t a b l e to update, but it is s o m e t i m e s p o s s i b l e to c o n s t r u c t a m o r e tractable p r i o r c a p a b l e o f a p p r o x i m a t e l y e x p r e s s i n g the d e s i r e d p r i o r ( N e a l , 2 0 0 1 ) . U s i n g data s a m p l e d f r o m the i n t r a c t a b l e p r i o r , o n e c a n fit the a p p r o x i m a t i n g p r i o r . C o m p u tations c a n then be p e r f o r m e d u s i n g the tractable p r i o r . W e e n v i s i o n i m i t a t i o n l e a r n i n g b e i n g u s e d i n a s i m i l a r c a p a c i t y to b r i d g e representations.  3.2  Practical Implementations  T h e t h e o r e t i c a l d e v e l o p m e n t s i n i m i t a t i o n research h a v e been a c c o m p a n i e d b y a n u m b e r o f p r a c t i c a l i m p l e m e n t a t i o n s . T h e s e i m p l e m e n t a t i o n s take a d v a n t a g e o f p r o p e r t i e s o f different c o n t r o l p a r a d i g m s to d e m o n s t r a t e v a r i o u s aspects o f i m i t a t i o n . E a r l y b e h a v i o r a l c l o n i n g research t o o k a d v a n t a g e o f s u p e r v i s e d l e a r n i n g t e c h n i q u e s s u c h as d e c i s i o n trees ( S a m m u t et a l , 1992). T h e d e c i s i o n tree w a s u s e d to learn h o w a h u m a n o p e r a t o r m a p p e d p e r c e p t i o n s to a c t i o n s . P e r c e p t i o n s w e r e e n c o d e d as d i s c r e t e v a l u e s . A t i m e d e l a y w a s i n s e r t e d i n o r d e r to s y n c h r o n i z e p e r c e p t i o n s w i t h the a c t i o n s they trigger. L e a r n i n g a p p r e n t i c e s y s t e m s ( M i t c h e l l et a l . , 1 9 8 5 ) a l s o attempted to extract u s e f u l k n o w l e d g e b y w a t c h i n g users, but the g o a l o f apprentices is n o t to i n d e p e n d e n t l y s o l v e p r o b l e m s . L e a r n i n g a p p r e n t i c e s are c l o s e l y related to p r o g r a m m i n g - b y - d e m o n s t r a t i o n s y s t e m s ( L i e b e r m a n , 1 9 9 3 ; L a u , D o m i n g o s , & W e l d , 2 0 0 0 ) . L a t e r efforts u s e d m o r e s o p h i s t i c a t e d t e c h n i q u e s to extract a c t i o n s f r o m v i -  35  s u a l p e r c e p t i o n s a n d abstract these a c t i o n s f o r future use ( K u n i y o s h i , I n a b a , & I n o u e , 1 9 9 4 ) . W o r k o n a s s o c i a t i v e a n d recurrent l e a r n i n g m o d e l s e x t e n d e d e a r l y a p p r o a c h e s to i m i t a t i o n to the l e a r n i n g o f t e m p o r a l sequences ( B i l l a r d & H a y e s , 1 9 9 9 ) . A s s o c i a t i v e l e a r n i n g has b e e n u s e d together w i t h innate f o l l o w i n g b e h a v i o r s to a c q u i r e n a v i g a t i o n e x p e r t i s e f r o m other agents ( B i l l a r d & H a y e s , 1 9 9 7 ) . A r e l a t e d b u t s l i g h t l y different f o r m o f i m i t a t i o n has b e e n s t u d i e d i n the m u l t i - a g e n t r e i n f o r c e m e n t l e a r n i n g c o m m u n i t y . A n e a r l y p r e c u r s o r to i m i t a t i o n c a n b e f o u n d i n w o r k o n s h a r i n g o f p e r c e p t i o n s b e t w e e n agents ( T a n , 1 9 9 3 ) . C l o s e r to i m i t a t i o n is the i d e a o f rep l a y i n g the p e r c e p t i o n s a n d a c t i o n s o f o n e agent to a s e c o n d agent ( L i n , 1 9 9 1 ; W h i t e h e a d , 1991). H e r e the transfer is f r o m o n e agent to another, i n contrast to b e h a v i o r a l c l o n i n g ' s transfer f r o m h u m a n to agent. T h e representation is a l s o different. R e i n f o r c e m e n t l e a r n i n g p r o v i d e s agents w i t h the a b i l i t y to reason about the effects o f current a c t i o n s o n e x p e c t e d future u t i l i t y so agents c a n integrate t h e i r o w n k n o w l e d g e w i t h k n o w l e d g e e x t r a c t e d f r o m other agents b y c o m p a r i n g the r e l a t i v e u t i l i t y o f the a c t i o n s s u g g e s t e d b y e a c h k n o w l e d g e source. T h e " s e e d i n g a p p r o a c h e s " are c l o s e l y related. T r a j e c t o r i e s r e c o r d e d f r o m h u m a n subjects are u s e d to i n i t i a l i z e a p l a n n e r w h i c h s u b s e q u e n t l y o p t i m i z e s the p l a n i n o r d e r to a c c o u n t f o r differences b e t w e e n the h u m a n effector a n d the r o b o t i c effector ( A t k e s o n & S c h a a l , 1 9 9 7 ) . T h i s t e c h n i q u e has b e e n e x t e n d e d to h a n d l e the n o t i o n o f s u b g o a l s w i t h i n a task ( A t k e s o n & S c h a a l , 1 9 9 7 ) . S u b g o a l s are a l s o a d d r e s s e d b y others (Sue & B r a t k o , 1997). A n o t h e r a p p r o a c h i n this f a m i l y is b a s e d o n the a s s u m p t i o n that the m e n t o r is r a t i o n a l , has the same r e w a r d f u n c t i o n as the observer, a n d c h o o s e s f r o m the s a m e set o f a c t i o n s . G i v e n these a s s u m p t i o n s , w e c a n c o n c l u d e that the a c t i o n c h o s e n b y a m e n t o r i n a p a r t i c u l a r state m u s t h a v e h i g h e r v a l u e to the m e n t o r than the alternatives o p e n to the m e n t o r ( U t g o f f &  36  C l o u s e , 1 9 9 1 ) a n d therefore h i g h e r v a l u e to the o b s e r v e r than a n y a l t e r n a t i v e . T h e s y s t e m therefore i t e r a t i v e l y adjusts the v a l u e s o f the a c t i o n s u n t i l this c o n s t r a i n t is satisfied i n its m o d e l . A related a p p r o a c h uses the m e t h o d o l o g y o f l i n e a r - q u a d r a t i c c o n t r o l (Sue & B r a t k o , 1 9 9 7 ) . F i r s t a m o d e l o f the s y s t e m is c o n s t r u c t e d . T h e n the i n v e r s e c o n t r o l p r o b l e m is s o l v e d to find a c o s t m a t r i x that w o u l d result i n the o b s e r v e d c o n t r o l l e r b e h a v i o r g i v e n an e n v i r o n m e n t m o d e l . R e c e n t w o r k o n i n v e r s e r e i n f o r c e m e n t l e a r n i n g takes a related a p p r o a c h but uses a l i n e a r p r o g r a m t e c h n i q u e to reconstruct a s p e c i a l c l a s s o f r e w a r d f u n c t i o n s f r o m o b s e r v e d b e h a v i o r ( N g & R u s s e l l , 2 0 0 0 ) . N g o b s e r v e s that m a n y p o s s i b l e r e w a r d f u n c t i o n s c o u l d e x p l a i n a g i v e n b e h a v i o r a n d p r o p o s e s h e u r i s t i c s to r e g u l a r i z e the search. O n e h e u r i s t i c is to p i c k the p o l i c y that m a x i m i z e s the difference b e t w e e n best a n d s e c o n d best a c t i o n s . T h e i n v e r s e r e i n f o r c e m e n t l e a r n i n g a p p r o a c h c a n be e x t e n d e d to c o n t i n u o u s d o m a i n s a n d to the s i t u a t i o n w h e r e the d o m a i n m o d e l is u n k n o w n . N g ' s a p p r o a c h has b e e n e x t e n d e d to g e n erate p o s t e r i o r d i s t r i b u t i o n s o v e r p o s s i b l e agent p o l i c i e s ( C h a j e w s k a , K o l l e r , & O r m o n e i t , 2001). S e v e r a l researchers h a v e i n d e p e n d e n t l y e x p l o r e d the i d e a o f c o m m o n representations for both perceptual functions and action planning. D e m i r i s and H a y e s (1999) develop a shared-representation m o d e l b a s e d o n a c o n t r o l e n g i n e e r i n g t e c h n i q u e c a l l e d P r o p o r t i o n a l I n t e g r a l - D e r i v a t i v e c o n t r o l ( A s t r o m & H a g g l u n d , 1 9 9 6 ) . P I D is a f o r w a r d - c o n t r o l m o d e l w h i c h attempts to m i n i m i z e the error e(t) b e t w e e n the a c t u a l trajectory at t i m e t a n d the des i r e d trajectory at t i m e t u s i n g a l i n e a r f u n c t i o n o f the error, its i n t e g r a l a n d its d e r i v a t i v e . I n the m o d e l p r o p o s e d b y D e m i r i s a n d H a y e s , e a c h o f the agent's p o s s i b l e b e h a v i o u r s i s represented b y a separate P I D c o n t r o l l e r . O b s e r v a t i o n s o f other agents are c o m p a r e d w i t h the p r e d i c t i o n s m a d e b y e a c h o f the P I D c o n t r o l l e r s . T h e agent a s s i g n s l a b e l s to the b e h a v i o u r s o f o t h e r agents b a s e d o n w h i c h c o n t r o l l e r m o s t c l o s e l y p r e d i c t s the o b s e r v a t i o n s . T h i s " c l o s est" P I D c o n t r o l l e r c a n a l s o b e u s e d to regenerate the o b s e r v e d b e h a v i o u r w h e n a p p r o p r i a t e .  37  B i o l o g i c a l l y i n s p i r e d m o t o r a c t i o n s c h e m a h a v e a l s o b e e n i n v e s t i g a t e d i n the d u a l r o l e o f p e r c e p t u a l a n d m o t o r representations ( M a t a r i c , W i l l i a m s o n , D e m i r i s , & M o h a n , 1 9 9 8 ; M a t a r i c , 2 0 0 2 ) . T h i s w o r k d e m o n s t r a t e d that o n l y the o b s e r v a t i o n o f agent effectors w a s necessary f o r the tasks they c o n s i d e r . W e e x p e c t that this m a y b e d u e to the fact that r e l a t i v e p o s i t i o n a n d h a n d o r i e n t a t i o n c o n s t r a i n h u m a n a r m g e o m e t r y so that a d d i t i o n a l i n f o r m a t i o n is redundant. T h i s suggests the usefulness o f r e d u n d a n c y a n a l y s i s i n the d e s i g n o f p e r c e p t u a l s y s t e m s . M a t a r i c finds that three classes o f p r i m i t i v e s — r e a c h i n g , o s c i l l a t i n g a n d o v e r a l l b o d y p o s t u r e templates—are sufficient f o r i m i t a t i n g h u m a n b e h a v i o r s s u c h as d a n c e . M o r e recent efforts h a v e attempted to a u t o m a t i c a l l y generate templates u s i n g c l u s t e r i n g .  3.3  Other Issues  Imitation techniques have been applied i n a diverse collection o f applications. C l a s s i c a l c o n t r o l a p p l i c a t i o n s i n c l u d e c o n t r o l systems f o r r o b o t arms ( K u n i y o s h i et a l . , 1 9 9 4 ; F r i e d r i c h , M u n c h , D i l l m a n n , B o c i o n e k , & S a s s i n , 1 9 9 6 ) , aeration plants (Scheffer et a l . , 1 9 9 7 ) , a n d c o n t a i n e r l o a d i n g cranes (Sue & B r a t k o , 1 9 9 7 ; U r b a n c i c & B r a t k o , 1 9 9 4 ) . I m i t a t i o n l e a r n i n g has a l s o b e e n a p p l i e d to a c c e l e r a t i o n o f g e n e r i c r e i n f o r c e m e n t l e a r n i n g ( L i n , 1 9 9 1 ; W h i t e h e a d , 1991). L e s s t r a d i t i o n a l a p p l i c a t i o n s i n c l u d e transfer o f m u s i c a l s t y l e ( C a n a m e r o , A r c o s , & de M a n t a r a s , 1 9 9 9 ) a n d the s u p p o r t o f a s o c i a l a t m o s p h e r e ( B i l l a r d , H a y e s , & D a u t e n h a h n , 1 9 9 9 ; B r e a z e a l , 1 9 9 9 ; S c a s s e l l a t i , 1999). I m i t a t i o n has a l s o b e e n i n v e s t i g a t e d as a route to l a n g u a g e a c q u i s i t i o n a n d t r a n s m i s s i o n ( B i l l a r d et a l . , 1 9 9 9 ; O l i p h a n t , 1 9 9 9 ) .  3.4  Analysis of Imitation in Terms of Reinforcement Learning  T h e literature o n i m i t a t i o n spans m a n y d i m e n s i o n s . F o r the p u r p o s e s o f d e v e l o p i n g c o n c r e t e a l g o r i t h m s , h o w e v e r , w e c a n l o o k at the literature t h r o u g h the lens o f r e i n f o r c e m e n t l e a r n i n g .  38  I n r e i n f o r c e m e n t l e a r n i n g , agents m a k e use o f k n o w l e d g e about r e w a r d s , t r a n s i t i o n m o d e l s , Q - v a l u e s a n d a c t i o n p o l i c i e s . I f w e v i e w i m i t a t i o n as the p r o c e s s o f t r a n s f e r r i n g k n o w l e d g e f r o m o n e agent to another, w e s h o u l d be a b l e to express the nature o f transfer i n terms o f the k i n d s o f k n o w l e d g e r e i n f o r c e m e n t l e a r n e r ' s use. In this s e c t i o n , w e e x a m i n e the r o l e o f e a c h t y p e o f k n o w l e d g e i n turn. T h e b e h a v i o r a l c l o n i n g approaches to i m i t a t i o n ( S a m m u t et a l . , 1 9 9 2 ; U r b a n c i c & B r a t k o , 1 9 9 4 ) , a n d to s o m e extent, M i t c h e l l ' s p a r a d i g m o f " t r u e " i m i t a t i o n are b a s e d o n the i d e a o f transferring a p o l i c y f r o m o n e agent to another. I n the b e h a v i o r a l c l o n i n g a p p r o a c h this transfer is e x e c u t e d t h r o u g h l e a r n i n g to associate a c t i o n s w i t h states. U n f o r t u n a t e l y , it a s s u m e s that the o b s e r v e r a n d m e n t o r share the same r e w a r d f u n c t i o n a n d a c t i o n c a p a b i l i t i e s . It a l s o a s s u m e s that c o m p l e t e a n d u n a m b i g u o u s trajectories  (including a c t i o n  choices) can  be o b s e r v e d . U t g o f f ' s rational constraint technique and Sue and B r a t k o ' s L Q controller i n d u c t i o n m e t h o d s represent an attempt to define i m i t a t i o n i n terms o f the transfer o f Q - v a l u e s f r o m o n e agent to another. A g a i n , h o w e v e r , this a p p r o a c h a s s u m e s c o n g r u i t y o f o b j e c t i v e s . N g ' s i n v e r s e r e i n f o r c e m e n t l e a r n i n g ( N g & R u s s e l l , 2 0 0 0 ) c a n be seen as a m o d e l that attempts to i m p l i c i t l y transfer a r e w a r d f u n c t i o n b e t w e e n agents. C o n s p i c u o u s l y absent i n the r e v i e w e d literature is the i d e a o f e x p l i c i t l y t r a n s f e r r i n g t r a n s i t i o n m o d e l i n f o r m a t i o n b e t w e e n agents. T h e a d v a n t a g e o f t r a n s i t i o n m o d e l transfer lies i n the fact that agents n e e d not share i d e n t i c a l r e w a r d f u n c t i o n s to benefit f r o m i n c r e a s e d k n o w l e d g e about t h e i r e n v i r o n m e n t . O u r o w n w o r k is b a s e d o n the i d e a o f an agent extracti n g a m o d e l f r o m a m e n t o r a n d u s i n g this m o d e l i n f o r m a t i o n to p l a c e b o u n d s o n the v a l u e o f a c t i o n s u s i n g its own r e w a r d f u n c t i o n . A g e n t s c a n therefore l e a r n f r o m m e n t o r s w i t h r e w a r d f u n c t i o n s different than t h e i r o w n . C h a p t e r 4 e x p l o r e s this n i c h e i n d e p t h , d e v e l o p s s e v e r a l a l g o r i t h m s , a n d demonstrates their properties o n a w i d e r a n g e o f e x p e r i m e n t a l d o m a i n s . I n  39  p a r t i c u l a r , it is s h o w n that g e n e r i c t r a n s i t i o n m o d e l i n f o r m a t i o n about w h a t the best a c t i o n c o u l d a c c o m p l i s h is v a l u a b l e e v e n i f the i d e n t i t y o f the a c t i o n is u n k n o w n . T h e result is a n e w c l a s s o f a l g o r i t h m s f o r i m i t a t i o n that d o not r e q u i r e i d e n t i c a l r e w a r d s , the a b i l i t y to o b s e r v e a m e n t o r ' s a c t i o n , o r the c o - o p e r a t i o n o f the m e n t o r as a teacher. P a r t o f the c o m p u t a t i o n a l aspect o f r e i n f o r c e m e n t l e a r n i n g is d e c i d i n g w h i c h states to a p p l y b a c k u p s to i n o r d e r to update a v a l u e f u n c t i o n . T h e m e n t o r c o u l d g i v e the i m i t a t o r i n f o r m a t i o n about w h i c h states are m o s t i m p o r t a n t to b a c k u p a n d thereby o p t i m i z e c o m p u t a t i o n a l resources. W e p r o p o s e t w o s p e c i f i c m e c h a n i s m s to i m p l e m e n t a l l o c a t i o n o f c o m p u tation i n C h a p t e r 4. T h e r o l e o f a l l o c a t i o n i n the effectiveness o f i m i t a t i o n is a l s o c o n s i d e r e d i n the a n a l y s i s o f p e r f o r m a n c e g a i n s i n i m i t a t i o n c a r r i e d out i n C h a p t e r 8. M u l t i - a g e n t r e i n f o r c e m e n t l e a r n i n g m o d e l s i n t r o d u c e the c o n c e p t s o f c o - o r d i n a t i o n p r o b l e m s , c o m p e t i t i o n a n d s o l u t i o n c o n c e p t s . It is p o s s i b l e to t h i n k o f a c o - o r d i n a t i o n p r o t o c o l as k n o w l e d g e about h o w to s o l v e the m u l t i - a g e n t part o f the p r o b l e m . A n i m i t a t i o n m e c h a n i s m c o u l d be u s e d to transfer s u c h m u l t i - a g e n t m e t a - k n o w l e d g e f r o m o n e agent to another. D e n z i n g e r ' s ( 2 0 0 2 ) w o r k s h o w s h o w t e a m k n o w l e d g e c a n b e r e u s e d b y n e w t e a m m e m b e r s . A m e t h o d f o r the transfer o f c o o r d i n a t i o n k n o w l e d g e b a s e d o n m u l t i - a g e n t r e i n f o r c e m e n t l e a r n i n g f r a m e w o r k w i l l be i n t r o d u c e d i n C h a p t e r 7. P a r t i a l l y o b s e r v a b l e l e a r n i n g is c o n s i d e r e d o n l y b y S c h e f f e r et a l . ( 1 9 9 7 ) . T h e y d o not e x p l i c i t l y c o n s i d e r the transfer o f quantities s p e c i f i c to p a r t i a l l y o b s e r v a b l e learners. T h e e x p l o r a t i o n o f h o w i m i t a t i o n m i g h t be u s e d to transfer o b s e r v a t i o n m o d e l s o r b e l i e f states b e t w e e n agents r e m a i n s o p e n .  3.5  A New Model: Implicit Imitation  In the p r e v i o u s s u b s e c t i o n s , v a r i o u s approaches to c o m p u t a t i o n a l i m i t a t i o n w e r e e x a m i n e d u n d e r the a s s u m p t i o n that i m i t a t i o n can be u n d e r s t o o d as the transfer o f k n o w l e d g e about  40  v a r i o u s parameters d e s c r i b i n g an agent's e n v i r o n m e n t . T h e s u r v e y c o n c l u d e d that the transfer o f t r a n s i t i o n m o d e l i n f o r m a t i o n is a p r o m i s i n g area f o r research i n t o i m i t a t i o n . T h i s sect i o n e x a m i n e s the i m p l i c a t i o n s o f i m i t a t i o n as t r a n s i t i o n m o d e l transfer a n d p o i n t s out that this f o r m o f i m i t a t i o n is a p p l i c a b l e i n a m u c h w i d e r setting than other f o r m s . T h e s e c t i o n a l s o sets u p a f r a m e w o r k o f a s s u m p t i o n s that w i l l be u s e d to d e v e l o p s p e c i f i c a l g o r i t h m s i n C h a p t e r s 4, 5 a n d 7.  Explicit transfer infeasible B e f o r e d e l v i n g i n t o a s s u m p t i o n s s p e c i f i c to i m i t a t i o n as transfer o f t r a n s i t i o n m o d e l s , o n e a s s u m p t i o n c o m m o n to a l l approaches to i m i t a t i o n s h o u l d be u n d e r s c o r e d : I f it w e r e f e a s i b l e f o r the agents to e x p l i c i t l y e x c h a n g e k n o w l e d g e t h r o u g h a f o r m a l r e p r e s e n t a t i o n , there w o u l d b e n o n e e d to a c q u i r e k n o w l e d g e i n d i r e c t l y t h r o u g h o b s e r v a t i o n s o f o t h e r a g e n t ' s beh a v i o r s . It is therefore s e n s i b l e to c o n s i d e r i m i t a t i o n t e c h n i q u e s o n l y w h e n e x p l i c i t transfer b e t w e e n agents is i n f e a s i b l e . T h i s a s s u m p t i o n w i l l be s h o w n to h a v e a d d i t i o n a l i m p l i c a t i o n s shortly.  Known independent objectives P u t t i n g aside, f o r the m o m e n t , h o w o n e m i g h t transfer t r a n s i t i o n - m o d e l i n f o r m a t i o n b e t w e e n agents w i t h o u t m a k i n g r e c o u r s e to e x p l i c i t c o m m u n i c a t i o n , w e w o u l d l i k e to e x a m i n e the i m p l i c a t i o n s o f i m i t a t i o n as t r a n s i t i o n - m o d e l transfer for the nature o f the o b s e r v e r ' s o b j e c t i v e s a n d the r e l a t i o n s h i p o f these o b j e c t i v e s to those o f other agents. T h e i m p l i c a t i o n s c a n be teased out b y e x a m i n i n g the p r o c e s s o f r e i n f o r c e m e n t l e a r n i n g . In r e i n f o r c e m e n t l e a r n i n g , an agent learns about the effects o f its a c t i o n s b y e x p e r i m e n t i n g w i t h t h e m a n d o b s e r v i n g the o u t c o m e s . T h e o u t c o m e s are s u m m a r i z e d i n a t r a n s i t i o n m o d e l w h i c h c a n then be u s e d to a s s i g n an e x p e c t e d future v a l u e to e a c h a c t i o n b a s e d o n the  41  p r o b a b i l i t y o f the a c t i o n l e a d i n g to future r e w a r d i n g states. A c c u r a t e v a l u e s c a n o n l y be a s s i g n e d i n those r e g i o n s o f the state space w h e r e a c c u rate t r a n s i t i o n m o d e l s are a v a i l a b l e to d e s c r i b e the p r o b a b i l i t y o f r e a c h i n g future r e w a r d i n g states. T h e transfer o f t r a n s i t i o n m o d e l i n f o r m a t i o n f r o m a m e n t o r to o b s e r v e r w o u l d a l l o w the o b s e r v e r agent to increase the a c c u r a c y o f its a c t i o n v a l u e s , l e a d i n g to better d e c i s i o n s . O f p a r t i c u l a r interest, h o w e v e r , is the p o s s i b i l i t y that the o b s e r v e r m i g h t a c q u i r e t r a n s i t i o n m o d e l s f o r r e g i o n s o f the state space i n w h i c h it has n o p r i o r s a m p l e s o f its e n v i r o n m e n t . I n this s i t u a t i o n , the t r a n s i t i o n m o d e l i n f o r m a t i o n f r o m a m e n t o r m a y c o n f e r h i g h l y a d v a n t a g e o u s i n s i g h t s about the l o n g - t e r m c o n s e q u e n c e s o f the o b s e r v e r ' s a c t i o n s . S i n c e the e x p l o r a t i o n o f l o n g sequences o f a c t i o n s is e x t r e m e l y c o s t l y , m e n t o r - d e r i v e d i n f o r m a t i o n about the effect o f e x t e n d e d sequences o f a c t i o n c a n l e a d to d r a m a t i c i m p r o v e m e n t s i n the o b s e r v e r ' s performance. A d i r e c t c o n s e q u e n c e o f a c q u i r i n g k n o w l e d g e i n the f o r m o f t r a n s i t i o n m o d e l s is the fact that the same k n o w l e d g e about the effect o f a c t i o n s c o u l d b e u s e d b y different agents i n different w a y s so as to best m a x i m i z e the agent's o w n o b j e c t i v e s . I m i t a t i o n as m o d e l transfer therefore does not r e q u i r e that the o b s e r v i n g agent share the s a m e r e w a r d f u n c t i o n as the m e n t o r agent. O f c o u r s e , i f the r e w a r d f u n c t i o n s are s i g n i f i c a n t l y different, the m e n t o r a n d o b s e r v e r m a y l a r g e l y o c c u p y different r e g i o n s o f the state space o r h a v e c o m p l e t e l y i n c o m p a t i b l e p o l i c i e s a n d therefore not possess a n y k n o w l e d g e o f c o m m o n interest. I m i t a t i o n as t r a n s i t i o n m o d e l transfer has a d d i t i o n a l i m p l i c a t i o n s c o n c e r n i n g an agent's r e w a r d f u n c t i o n . C o n s i d e r the fact that an agent's c h o i c e o f a c t i o n d e p e n d s n o t o n l y o n the effects o f its a c t i o n s , b u t a l s o o n the r e w a r d s to b e h a d i n states r e s u l t i n g f r o m the effects o f its a c t i o n s . T r a n s i t i o n m o d e l i n f o r m a t i o n w o u l d b e m o s t b e n e f i c i a l i n those r e g i o n s w h e r e the agent a l s o possesses r e w a r d m o d e l i n f o r m a t i o n . T h e r e are t w o g e n e r a l treatments o f r e w a r d m o d e l s i n r e i n f o r c e m e n t l e a r n i n g . In  42  o n e treatment, it is a s s u m e d that the agent (for o u r p u r p o s e s , the o b s e r v e r ) d o e s not k n o w its r e w a r d f u n c t i o n , a n d m u s t s a m p l e the r e w a r d s at e a c h state. In the s e c o n d treatment, it is a s s u m e d that the agent k n o w s the r e w a r d f u n c t i o n f o r e v e r y state i n its d o m a i n . T h e u t i l i t y o f m o d e l t r a n s i t i o n i n f o r m a t i o n w i l l u n d o u b t e d l y be greater f o r an agent w i t h a k n o w n r e w a r d m o d e l f o r a l l states, as the agent w i l l be a b l e to m a k e m u c h better d e c i s i o n s g i v e n the a b i l i t y to p r e d i c t b o t h the l o n g t e r m effects o f actions and the v a l u e o f those effects. In a s s u m i n g that the o b s e r v e r k n o w s its o w n r e w a r d f u n c t i o n , w e g i v e up l i t t l e o f p r a c t i c a l v a l u e . T h e i d e a o f a k n o w n r e w a r d f u n c t i o n is c o n s i s t e n t w i t h the i d e a o f a u t o m a t i c p r o g r a m m i n g : the agent's d e s i g n e r specifies the agent's o b j e c t i v e s i n the f o r m o f a r e w a r d f u n c t i o n a n d the agent finds a p o l i c y that m a x i m i z e s these o b j e c t i v e s . A n y u n c e r t a i n t y i n r e w a r d f u n c t i o n s c a n be recast as uncertainty about state spaces. O n e c a n b e c e r t a i n about the state, but u n c e r t a i n about the r e w a r d attached to it, o r o n e c a n b e u n c e r t a i n about the state, b u t c e r t a i n about the r e w a r d s attached to p o s s i b l e states. T h e m o s t s e n s i b l e treatment o f r e w a r d f o r an agent u s i n g i m i t a t i o n as t r a n s i t i o n m o d e l transfer a s s u m e s the o b s e r v e r agent possesses a r e w a r d m o d e l f o r a l l states i n its e n v i r o n m e n t . T h e a s s u m p t i o n that the o b s e r v e r k n o w s its r e w a r d m o d e l leads to a n o t h e r s e r e n d i p i t o u s result. O b s e r v a t i o n s o f the m e n t o r p r o v i d e the o b s e r v e r w i t h a d d i t i o n a l i n f o r m a t i o n about the e n v i r o n m e n t t r a n s i t i o n f u n c t i o n . T h i s a d d i t i o n a l t r a n s i t i o n f u n c t i o n i n f o r m a t i o n can be c o n v e r t e d i n t o i m p r o v e d v a l u e estimates f o r the o b s e r v e r ' s a c t i o n s . T h e o b s e r v e r is therefore better a b l e to c h o o s e actions that w i l l l e a d to h i g h e r returns.  Importantly, how-  ever, the agent o n l y m a k e s use o f the m e n t o r ' s b e h a v i o u r to i n f o r m its u n d e r s t a n d i n g o f the t r a n s i t i o n d y n a m i c s . U n l i k e other approaches to i m i t a t i o n , the o b s e r v e r is n o t c o n s t r a i n e d to d u p l i c a t e the m e n t o r ' s a c t i o n s i f the o b s e r v e r is aware o f a better a c t i o n p l a n . T h e o b s e r v e r m a y e n d up d u p l i c a t i n g the m e n t o r ' s b e h a v i o r , b u t o n l y as a c o n s e q u e n c e o f i n d i r e c t reasoni n g about the best c o u r s e o f a c t i o n . A s w e feel this is a c e n t r a l a n d d e f i n i n g aspect o f o u r  43  model, we have called our approach  implicit imitation.  W e f e e l this t e r m u n d e r s c o r e s the  fact that any d u p l i c a t i o n o f m e n t o r b e h a v i o r is m e r e l y part o f i n f o r m e d v a l u e m a x i m i z a t i o n .  Observable expert behavior U n d e r the i m i t a t i o n - a s - t r a n s f e r - o f - t r a n s i t i o n - m o d e l p a r a d i g m , the o b s e r v e r agent g a i n s a d d i t i o n a l i n f o r m a t i o n about the t r a n s i t i o n m o d e l f o r its e n v i r o n m e n t f r o m a m e n t o r . It f o l l o w s , that i n o r d e r f o r this m o d e l to benefit the observer, there m u s t e x i s t agents i n the o b s e r v e r ' s e n v i r o n m e n t w h o possess m o r e e x p e r t i s e than the o b s e r v e r f o r s o m e part o f the p r o b l e m f a c i n g the observer. O u r a s s u m p t i o n that e x p l i c i t transfer o f k n o w l e d g e is i n f e a s i b l e i m p o s e s another c o n s t r a i n t o n o b s e r v a t i o n s . T h e a c t i o n c h o i c e o f the m e n t o r agent, i n the sense d e f i n e d i n rei n f o r c e m e n t l e a r n i n g , c a n n o t be d i r e c t l y o b s e r v e d . T h e a r g u m e n t is as f o l l o w s : s i n c e m o r e than o n e a c t i o n m a y h a v e the s a m e effect o n the e n v i r o n m e n t , the i d e n t i t y o f an a c t i o n c a n not b e k n o w n w i t h certainty t h r o u g h o b s e r v a t i o n s o f the m e n t o r ' s effect o n its e n v i r o n m e n t . W e take these effects to i n c l u d e the p o s i t i o n o f its effectors.  1  E s s e n t i a l l y , the m e n t o r ' s ac-  t i o n c h o i c e s c o r r e s p o n d to its i n t e n t i o n s a n d these c o u l d o n l y b e transferred t h r o u g h e x p l i c i t c o m m u n i c a t i o n w h i c h w e h a v e a l r e a d y d e t e r m i n e d is i n f e a s i b l e , o r w e w o u l d n o t n e e d i m i t a t i o n t e c h n i q u e s to b e g i n w i t h . F o r the p u r p o s e s o f the w o r k w e w i l l take up s h o r t l y , o n e further c o n s t r a i n t w i l l be i m p o s e d o n the o b s e r v a t i o n s m a d e b y o b s e r v e r s . F r o m r e i n f o r c e m e n t l e a r n i n g theory, w e k n o w that the p r o b l e m o f l e a r n i n g to c o n t r o l an e n v i r o n m e n t is c o n s i d e r a b l y s i m p l i f i e d w h e n the agent can c o m p l e t e l y a n d u n a m b i g u o u s l y o b s e r v e the state o f the s y s t e m that it is a t t e m p t i n g to c o n t r o l . I n i t i a l l y , w e w i l l therefore a s s u m e that the o b s e r v e r i s l e a r n i n g to c o n t r o l a c o m p l e t e l y o b s e r v a b l e M D P a n d that the m e n t o r ' s state ( t h o u g h not its a c t i o n s ) 1  Consider an agent who is attempting to force open a door. The agent and its effectors may appear  stationary, but surely the agent is taking an action quite distinct from resting near the door.  44  are f u l l y o b s e r v a b l e . I n C h a p t e r 7 , this a s s u m p t i o n w i l l b e r e e x a m i n e d a n d a d e r i v a t i o n w i l l b e p r o v i d e d s h o w i n g h o w o u r p a r a d i g m m i g h t b e e x t e n d e d to p a r t i a l l y o b s e r v a b l e e n v i r o n ments.  Observer-mentor state-space correlation T h u s far, w e h a v e t a l k e d about t w o k i n d s o f k n o w l e d g e : the o b s e r v e r ' s d i r e c t e x p e r i e n c e w i t h a c t i o n c h o i c e s a n d their resultant state t r a n s i t i o n s , a n d k n o w l e d g e a c q u i r e d b y observ a t i o n o f m e n t o r state t r a n s i t i o n s , w i t h o u t s a y i n g a n y t h i n g about the r e l a t i o n s h i p b e t w e e n t h e m . I n m a n y i m i t a t i o n l e a r n i n g p a r a d i g m s , an e x p l i c i t t e a c h i n g p a r a d i g m i s a s s u m e d i n w h i c h a m e n t o r demonstrates a task a n d then the task i s reset a n d the o b s e r v e r attempts the task. It is a s s u m e d that the m e n t o r a n d i m i t a t o r interact w i t h the e n v i r o n m e n t i n i d e n t i c a l w a y s . G i v e n f u l l o b s e r v a b i l i t y o f the mentor, the m e n t o r ' s state c a n b e d i r e c t l y i d e n t i f i e d w i t h c o r r e s p o n d i n g o b s e r v e r states. T h e t r a n s i t i o n m o d e l i n f o r m a t i o n g a i n e d b y o b s e r v i n g m e n t o r t r a n s i t i o n s c a n b e seen to h a v e d i r e c t a n d s t r a i g h t f o r w a r d b e a r i n g o n the o b s e r v e r ' s t r a n s i t i o n m o d e l s f o r the c o r r e s p o n d i n g state. In o u r m o d e l , h o w e v e r , w e w o u l d l i k e to shift to a p e r s p e c t i v e i n w h i c h m u l t i p l e agents o c c u p y the same e n v i r o n m e n t . T h e s e agents m a y o r m a y n o t interact w i t h the e n v i r o n m e n t i n the s a m e w a y (e.g., they m a y h a v e different effectors) a n d they m a y o r m a y n o t face e x a c t l y the same p r o b l e m (i.e., they m a y h a v e different b o d y states). T h e r e m a y a l s o b e m u l t i p l e m e n t o r s w i t h v a r y i n g degrees o f r e l e v a n c e i n different r e g i o n s o f the state space. I n this m o r e g e n e r a l setting, it i s c l e a r that o b s e r v a t i o n s m a d e o f the m e n t o r ' s state t r a n s i t i o n s can offer n o u s e f u l i n f o r m a t i o n to the o b s e r v e r u n l e s s there i s s o m e r e l a t i o n s h i p b e t w e e n the state o f the o b s e r v e r a n d mentor. T h e r e are s e v e r a l w a y s i n w h i c h the r e l a t i o n s h i p b e t w e e n state spaces c a n b e speci f i e d . D a u t e n h a h n a n d N e h a n i v ( 1 9 9 8 ) use a h o m o m o r p h i s m to define the r e l a t i o n s h i p b e -  45  tween mentor and observer for a  specific  f a m i l y o f trajectories. I n its m o s t g e n e r a l f o r m ,  o u r a p p r o a c h c a n l e a r n f r o m any stochastic p r o c e s s f o r w h i c h a h o m o m o r p h i s m c a n be c o n structed b e t w e e n the state a n d t r a n s i t i o n s o f the p r o c e s s a n d the state a n d t r a n s i t i o n s o f the o b s e r v i n g agent. O u r n o t i o n o f a m e n t o r is therefore q u i t e b r o a d . In o r d e r to s i m p l i f y o u r m o d e l a n d a v o i d u n d u e attention to the ( a d m i t t e d l y i m p o r tant) t o p i c o f c o n s t r u c t i n g s u i t a b l e a n a l o g i c a l m a p p i n g s , w e w i l l i n i t i a l l y a s s u m e that the m e n t o r a n d the o b s e r v e r h a v e " i d e n t i c a l " state spaces; that i s , S  m  a n d i S are i n s o m e sense 0  i s o m o r p h i c . T h e p r e c i s e sense i n w h i c h the spaces are i s o m o r p h i c — o r i n s o m e cases,  sumed to  pre-  be i s o m o r p h i c u n t i l p r o v e n o t h e r w i s e — i s e l a b o r a t e d b e l o w w h e n w e d i s c u s s the  r e l a t i o n s h i p b e t w e e n agent a b i l i t i e s . T h u s f r o m this p o i n t w e s i m p l y refer to the state S w i t h out d i s t i n g u i s h i n g the m e n t o r ' s l o c a l space S  m  f r o m the o b s e r v e r ' s  S. 0  Analogous Abilities E v e n w i t h a m a p p i n g b e t w e e n states, o b s e r v a t i o n s o f a m e n t o r ' s state t r a n s i t i o n s o n l y t e l l the o b s e r v e r s o m e t h i n g about the m e n t o r ' s a b i l i t i e s , n o t its o w n . W e m u s t a s s u m e that the o b s e r v e r c a n i n s o m e w a y " d u p l i c a t e " the a c t i o n s t a k e n b y the m e n t o r to i n d u c e a n a l o g o u s t r a n s i t i o n s i n its o w n l o c a l state space. In other w o r d s , there m u s t be s o m e p r e s u m p t i o n that the m e n t o r a n d the o b s e r v e r h a v e s i m i l a r a c t i o n c a p a b i l i t i e s . It is i n t h i s sense that the anal o g i c a l m a p p i n g b e t w e e n state spaces c a n be t a k e n to be a h o m o m o r p h i s m . S p e c i f i c a l l y , w e m i g h t a s s u m e that the m e n t o r a n d the o b s e r v e r h a v e the s a m e a c t i o n s a v a i l a b l e to t h e m (i.e., A  m  — A  0  = A) a n d that there e x i s t s a h o m o m o r p h i c m a p p i n g h : S  m  —> S  0  w i t h respect to  T ( - , a , •) f o r a l l a £ A. In p r a c t i c e , this r e q u i r e m e n t c a n be w e a k e n e d s u b s t a n t i a l l y w i t h o u t d i m i n i s h i n g its u t i l i t y . W e o n l y r e q u i r e that there e x i s t s at least o n e a c t i o n i n the o b s e r v e r ' s a c t i o n set f o r state s w h i c h w i l l d u p l i c a t e the effects o f the a c t i o n a c t u a l l y t a k e n b y the m e n tor i n state s (the m e n t o r ' s other p o s s i b l e a c t i o n s are i r r e l e v a n t ) . F i n a l l y , w e m i g h t h a v e an  46  o b s e r v e r that  assumes that  it c a n d u p l i c a t e the a c t i o n s t a k e n b y the m e n t o r u n t i l it finds e v -  i d e n c e to the contrary. In this case, there is a  presumed homomorphism b e t w e e n  spaces. In w h a t f o l l o w s , w e w i l l d i s t i n g u i s h b e t w e e n i m p l i c i t i m i t a t i o n i n  tion settings—domains  the state  homogeneous ac-  i n w h i c h the a n a l o g i c a l m a p p i n g is i n d e e d h o m o m o r p h i c — a n d i n  heterogeneous action settings—where  the m a p p i n g m a y not b e a h o m o m o r p h i s m .  T h e r e are m o r e g e n e r a l w a y s o f d e f i n i n g s i m i l a r i t y o f a b i l i t y , f o r e x a m p l e , b y ass u m i n g that the o b s e r v e r m a y b e a b l e to m o v e t h r o u g h state space i n a s i m i l a r f a s h i o n to the m e n t o r w i t h o u t f o l l o w i n g the same trajectories ( N e h a n i v & D a u t e n h a h n , 1 9 9 8 ) . F o r i n stance, the m e n t o r m a y h a v e a w a y o f m o v i n g d i r e c t l y b e t w e e n k e y l o c a t i o n s i n state space, w h i l e the o b s e r v e r m a y be a b l e to m o v e b e t w e e n a n a l o g o u s l o c a t i o n s i n a less d i r e c t f a s h ion.  In s u c h a case, the a n a l o g y b e t w e e n states m a y n o t b e d e t e r m i n e d b y s i n g l e a c t i o n s ,  but rather b y sequences o f a c t i o n s o r l o c a l p o l i c i e s . W e w i l l suggest w a y s f o r d e a l i n g w i t h r e s t r i c t e d f o r m s o f this i n S e c t i o n 4 . 2 .  3.6  Non-interacting Games  T h e p r e v i o u s s e c t i o n o u t l i n e d a b a s i c m o d e l a n d a set o f a s s u m p t i o n s f o r an i m i t a t i o n l e a r n i n g f r a m e w o r k . I n C h a p t e r s 4 a n d 5, w e w i l l translate this f r a m e w o r k i n t o s p e c i f i c a l g o r i t h m s . T h e t r a n s l a t i o n w i l l be b a s e d o n the n o t i o n o f the s t o c h a s t i c g a m e d e f i n e d i n C h a p ter 2. T o s i m p l i f y the d e v e l o p m e n t o f a l g o r i t h m s , h o w e v e r , w e w i l l i n i t i a l l y c o n s i d e r a spec i a l class o f s t o c h a s t i c g a m e s i n w h i c h strategic i n t e r a c t i o n s n e e d not b e c o n s i d e r e d ( w e w i l l return to games w i t h i n t e r a c t i o n s i n C h a p t e r 7). W e define  noninteracting stochastic games  projection function w h i c h  is u s e d to extract an agent's  b y a p p e a l i n g to the n o t i o n o f a n agent  local state f r o m  the u n d e r l y i n g g a m e .  I n f o r m a l l y , an a g e n t ' s l o c a l state d e t e r m i n e s a l l aspects o f the g l o b a l state that are r e l e v a n t to its d e c i s i o n m a k i n g p r o c e s s , w h i l e the p r o j e c t i o n f u n c t i o n d e t e r m i n e s w h i c h g l o b a l states  47  are i d e n t i c a l f r o m this l o c a l p e r s p e c t i v e . F o r m a l l y , f o r e a c h agent i, w e a s s u m e a l o c a l state space S{, a n d a p r o j e c t i o n f u n c t i o n Li i f f Li(s)  =  : S  —>• Si.  Foranys,i £  <S, w e w r i t e s ~ ;  £ ; ( £ ) • T h i s e q u i v a l e n c e r e l a t i o n p a r t i t i o n s S i n t o a set o f e q u i v a l e n c e classes  s u c h that the e l e m e n t s w i t h i n a s p e c i f i c class (i.e., L~  l  (s) f o r s o m e s € Si) n e e d not b e  d i s t i n g u i s h e d b y agent i for the purposes o f i n d i v i d u a l d e c i s i o n m a k i n g . W e say a s t o c h a s t i c g a m e is  noninteracting i f there  e x i s t s a l o c a l state space  Si  and projection function  Li  for  each agent i s u c h that:  1. I f s ~ i t, then Va; G A , a - i € A-i,  2. i f c ( s ) =  W{ € Si w e h a v e  ifc(i)ifs~,-f  I n t u i t i v e l y , c o n d i t i o n 1 a b o v e i m p o s e s t w o d i s t i n c t r e q u i r e m e n t s o n the g a m e f r o m the p e r s p e c t i v e o f agent i. F i r s t , i f w e i g n o r e the e x i s t e n c e o f other agents, it p r o v i d e s a n o t i o n o f state space a b s t r a c t i o n s u i t a b l e f o r agent i. S p e c i f i c a l l y ,  c l u s t e r s together states  s G S o n l y i f e a c h state i n an e q u i v a l e n c e class has i d e n t i c a l d y n a m i c s w i t h respect to the a b s t r a c t i o n i n d u c e d b y X,-. T h i s t y p e o f a b s t r a c t i o n i s a f o r m o f b i s i m u l a t i o n o f the t y p e s t u d i e d i n a u t o m a t o n m i n i m i z a t i o n ( H a r t m a n i s & Stearns, 1 9 6 6 ; L e e & Y a n n a k a k i s , 1 9 9 2 ) a n d automatic abstraction methods developed for M D P s (Dearden & B o u t i l i e r , 1997; D e a n & G i v a n , 1 9 9 7 ) . It is not h a r d to s h o w — i g n o r i n g the presence o f other a g e n t s — t h a t the u n d e r l y i n g s y s t e m is M a r k o v i a n w i t h respect to the a b s t r a c t i o n (or e q u i v a l e n t l y , w.r.t. Si) i f c o n d i t i o n 1 is met. T h e q u a n t i f i c a t i o n o v e r a l l a _  8  imposes a strong noninteraction require-  ment, n a m e l y , that the d y n a m i c s o f the g a m e f r o m the p e r s p e c t i v e o f agent i b e i n d e p e n d e n t o f the strategies o f the other agents. C o n d i t i o n 2 s i m p l y requires that a l l states w i t h i n a g i v e n e q u i v a l e n c e c l a s s f o r agent i h a v e the same r e w a r d f o r agent i. T h i s m e a n s that n o states w i t h i n a c l a s s n e e d to be d i s t i n g u i s h e d — e a c h l o c a l state c a n be v i e w e d as a t o m i c .  48  i  T h e s t r o n g n o n i n t e r a c t i o n constraints o f this m o d e l are c o n s i s t e n t w i t h a b r o a d range o f s c e n a r i o s i n w h i c h an o b s e r v e r i n h a b i t s a m u l t i - a g e n t w o r l d , benefits f r o m o b s e r v a t i o n s o f these other agents, but is p e r f o r m i n g a task that c a n be s o l v e d w i t h o u t c o n s i d e r i n g interact i o n w i t h other agents. O n e m i g h t c o n s i d e r the p r o b l e m o f c o n t r o l l i n g a m a c h i n e o r s o l v i n g a specific k i n d o f p l a n n i n g p r o b l e m . O t h e r agent's e x p e r i e n c e m a y b e r e l e v a n t , b u t the actual task a s s i g n e d to the agent does not i n v o l v e any m u l t i - a g e n t i n t e r a c t i o n s . W e d i s c u s s p o s s i b l e e x t e n s i o n s o f the i m i t a t i o n m o d e l to tasks w h i c h i n c l u d e m u l t i - a g e n t i n t e r a c t i o n s i n C h a p t e r 7. A n o n i n t e r a c t i n g g a m e i n d u c e s an M D P M - f o r e a c h agent i w h e r e M ; = (Si, Ai, T -, Ri) 4  t  w h e r e T, is g i v e n b y c o n d i t i o n (1) a b o v e . S p e c i f i c a l l y , f o r e a c h s , ti g t  T^' '(U) =  j: eL-\n)T ' - -'(t)  a  w h e r e s is any state i n LT (si) 1  Sf.  s a  a  t  a n d a_ - is any e l e m e n t o f A-i. t  o p t i m a l p o l i c y f o r M,-. W e can e x t e n d this to a strategy  L e t 7r; : <S —>• Ai be an a  : S —>• Ai f o r the u n d e r l y i n g  stochastic g a m e b y s i m p l y a p p l y i n g 7r,-(s,-) to e v e r y state s g S s u c h that Li(s) e v e r y state s g S s u c h that L((s)  = s -. S i n c e t  = s; has the s a m e r e w a r d a n d t r a n s i t i o n p r o b a b i l i t i e s w i t h  respect to agent i, a n d the r e w a r d a n d t r a n s i t i o n p r o b a b i l i t i e s o f agent i are i n d e p e n d e n t o f the state a n d a c t i o n s o f o t h e r agents, the p o l i c y ixf is d o m i n a n t f o r p l a y e r i. T h u s e a c h agent can s o l v e the n o n i n t e r a c t i n g g a m e b y a b s t r a c t i n g a w a y i r r e l e v a n t aspects o f the state space, i g n o r i n g other agent a c t i o n s , a n d s o l v i n g its " p e r s o n a l " M D P M j . G i v e n an arbitrary stochastic g a m e , it c a n g e n e r a l l y b e q u i t e d i f f i c u l t to d i s c o v e r w h e t h e r it is n o n i n t e r a c t i n g , r e q u i r i n g the c o n s t r u c t i o n o f a p p r o p r i a t e p r o j e c t i o n f u n c t i o n s . F o r the p u r p o s e s o f d e v e l o p i n g a l g o r i t h m s i n the n e x t chapter, w e w i l l s i m p l y assume  that  the u n d e r l y i n g m u l t i a g e n t s y s t e m is a n o n i n t e r a c t i n g g a m e . R a t h e r than s p e c i f y i n g the g a m e a n d p r o j e c t i o n f u n c t i o n s , w e w i l l specify the i n d i v i d u a l M D P s M,- t h e m s e l v e s . T h e n o n i n t e r a c t i n g g a m e i n d u c e d b y the set o f i n d i v i d u a l M D P s is s i m p l y the " c r o s s p r o d u c t " o f the  49  i n d i v i d u a l M D P s . S u c h a v i e w is often q u i t e n a t u r a l . C o n s i d e r the e x a m p l e o f three r o b o t s m o v i n g i n s o m e t w o - d i m e n s i o n a l office d o m a i n . I f w e are a b l e to n e g l e c t the p o s s i b i l i t y o f i n t e r a c t i o n — f o r e x a m p l e , i f the r o b o t s c a n o c c u p y the s a m e 2 - D p o s i t i o n (at a s u i t a b l e l e v e l o f g r a n u l a r i t y ) a n d d o not r e q u i r e the same resources to a c h i e v e t h e i r t a s k s — t h e n w e m i g h t s p e c i f y an i n d i v i d u a l M D P f o r e a c h robot. T h e l o c a l state m i g h t be d e t e r m i n e d b y the r o b o t ' s x, y - p o s i t i o n , o r i e n t a t i o n , a n d the status o f its o w n tasks. T h e g l o b a l state space w o u l d be the c r o s s p r o d u c t Si  x S2 x S3 o f the l o c a l spaces. T h e i n d i v i d u a l c o m p o n e n t s  o f any j o i n t a c t i o n w o u l d affect o n l y the l o c a l state, a n d each agent w o u l d care ( t h r o u g h its r e w a r d f u n c t i o n Ri) o n l y about its l o c a l state. W e note that the p r o j e c t i o n f u n c t i o n Li s h o u l d not be v i e w e d as e q u i v a l e n t to a n o b s e r v a t i o n f u n c t i o n . W e d o not a s s u m e that agent i c a n o n l y d i s t i n g u i s h e l e m e n t s o f  Si—in  fact, o b s e r v a t i o n s o f other agents' states w i l l b e c r u c i a l f o r i m i t a t i o n . R a t h e r the e x i s t e n c e of  Li  s i m p l y means that,/rom  the point of view of decision making with a known model, the  agent n e e d n o t w o r r y about d i s t i n c t i o n s other than those m a d e b y L . A s s u m i n g n o c o m p u t  t a t i o n a l l i m i t a t i o n s , an agent i need o n l y s o l v e Mi, b u t m a y use o b s e r v a t i o n s o f other agents i n o r d e r to i m p r o v e its k n o w l e d g e about M i ' s d y n a m i c s .  50  Chapter 4  Implicit Imitation T h e i m p l i c i t i m i t a t i o n f r a m e w o r k is b u i l t a r o u n d the m e t a p h o r o f an agent i n t e r a c t i n g w i t h its e n v i r o n m e n t . I n this metaphor, the agent m a k e s o b s e r v a t i o n s o f the e n v i r o n m e n t ' s current state a n d c h o o s e s a c t i o n s l i k e l y to i n f l u e n c e the future state o f the e n v i r o n m e n t i n a d e s i r a b l e w a y . L i k e a r e i n f o r c e m e n t learner, the i m p l i c i t i m i t a t o r is u n c e r t a i n o f the effects o f its actions a n d m u s t e x p e r i m e n t w i t h i n its e n v i r o n m e n t to i m p r o v e its k n o w l e d g e . T h e p r o b l e m is d i f f i c u l t , b e c a u s e the c o r r e c t a c t i o n f o r the agent to take at s o m e state s m a y d e p e n d o n w h a t a c t i o n c h o i c e s are o p e n to the agent i n s o m e distant s u c c e s s o r state s'. G e n e r a l l y , the agent w i l l not i n i t i a l l y k n o w its a c t i o n c a p a b i l i t i e s i n s'. U n l i k e a r e i n f o r c e m e n t l e a r n i n g agent, h o w e v e r , a n i m p l i c i t i m i t a t i o n agent m a y h a v e access to state t r a n s i t i o n h i s t o r i e s at the future state s' generated f r o m o b s e r v a t i o n s o f an expert m e n t o r agent. T h i s a d d i t i o n a l k n o w l e d g e can be u s e d to h e l p the i m i t a t o r d e c i d e w h e t h e r to p u r s u e a c t i o n s that l e a d to s'. I m i t a t i o n therefore reduces the search r e q u i r e m e n t s . In the best o f cases, the r e d u c t i o n i n search c a n l e a d to an e x p o n e n t i a l r e d u c t i o n i n t i m e r e q u i r e d to l e a r n a r e a s o n a b l e p o l i c y f o r a given problem. In the p r e v i o u s chapter, w e saw that the c o n c e p t i o n o f i m i t a t i o n as transfer o f k n o w l e d g e about state t r a n s i t i o n p r o b a b i l i t i e s is a n e w a n d p r o v o c a t i v e p e r s p e c t i v e . A s w e w i l l see  51  m o r e f u l l y i n this chapter, it a l s o has a n u m b e r o f n o v e l i m p l i c a t i o n s f o r learners. I m p l i c i t i m i t a t o r s n e e d not share r e w a r d f u n c t i o n s w i t h m e n t o r s , c a n integrate e x a m p l e s f r o m m a n y m e n t o r s , a n d are a b l e to e v a l u a t e o b s e r v e d b e h a v i o u r s at a f i n e - g r a i n e d l e v e l u s i n g e x p l i c i t e x p e c t e d v a l u e c a l c u l a t i o n s . I n the p r e v i o u s chapter, t e c h n i q u e s f o r r e p r e s e n t i n g t r a n s i t i o n m o d e l s a n d i n t e g r a t i n g o b s e r v a t i o n s f r o m v a r i o u s sources i n t o these representations  were  g l o s s e d over. T h e n e x t t w o sections d e r i v e s p e c i f i c a l g o r i t h m s f o r i m p l i c i t i m i t a t i o n i n t w o settings: the h o m o g e n e o u s a c t i o n setting, i n w h i c h the a c t i o n c a p a b i l i t i e s o f the o b s e r v e r a n d m e n t o r are i d e n t i c a l ; a n d the heterogeneous a c t i o n setting, i n w h i c h t h e i r c a p a b i l i t i e s differ. T h e a l g o r i t h m s are d e m o n s t r a t e d i n a set o f s i m p l e d o m a i n s c a r e f u l l y c h o s e n to i l l u s t r a t e specific issues i n i m i t a t i o n l e a r n i n g .  4.1  Implicit Imitation in Homogeneous Settings  W e b e g i n b y d e s c r i b i n g i m p l i c i t i m i t a t i o n i n h o m o g e n e o u s a c t i o n s e t t i n g s — t h e e x t e n s i o n to h e t e r o g e n e o u s settings w i l l b u i l d o n the i n s i g h t s d e v e l o p e d i n this s e c t i o n . F i r s t , w e define the h o m o g e n e o u s setting. T h e n w e d e v e l o p the i m p l i c i t i m i t a t i o n a l g o r i t h m . F i n a l l y , w e d e m o n s t r a t e h o w i m p l i c i t i m i t a t i o n i n a h o m o g e n e o u s setting w o r k s o n a n u m b e r o f s i m p l e p r o b l e m s d e s i g n e d to i l l u s t r a t e the r o l e o f the v a r i o u s m e c h a n i s m s w e d e s c r i b e .  4.1.1  Homogeneous Actions  T h e h o m o g e n e o u s a c t i o n setting is d e f i n e d as f o l l o w s . W e a s s u m e a s i n g l e m e n t o r m a n d o b s e r v e r o, w i t h i n d i v i d u a l M D P s M  m  respectively.  1  —  (S, A , T , m  m  R) m  and M  0  =  (S, A , 0  T, 0  R ), 0  N o t e that the agents share the same state space ( m o r e p r e c i s e l y , w e a s s u m e  a t r i v i a l i s o m o r p h i c m a p p i n g that a l l o w s us to i d e n t i f y their l o c a l states). W e a l s o a s s u m e that the m e n t o r has b e e n p r e t r a i n e d a n d is e x e c u t i n g s o m e stationary p o l i c y n . m  ' T h e extension to multiple mentors is straightforward and is discussed later.  52  We will  often treat this p o l i c y as d e t e r m i n i s t i c , but m o s t o f o u r r e m a r k s a p p l y to s t o c h a s t i c p o l i c i e s as w e l l .  2  L e t the  support set  Supp(7r , m  a c c o r d e d n o n z e r o p r o b a b i l i t y b y 7r  m  s)  for  a £ A 0  0  s u c h that  m  at state  s  be the set o f a c t i o n s  a G A  m  at state s. W e a s s u m e that the o b s e r v e r has the same  a b i l i t i e s as the m e n t o r i n the f o l l o w i n g sense: action  w  Vs, t G S, a  m  T(s, a )(t) — T(s, a )(t). a  m  G Supp(it , s), there m  e x i s t s an  I n other w o r d s , the o b s e r v e r has an  a c t i o n w h o s e o u t c o m e d i s t r i b u t i o n matches that o f the m e n t o r ' s a c t i o n . I n i t i a l l y , h o w e v e r the o b s e r v e r m a y not k n o w w h i c h action(s) w i l l a c h i e v e this d i s t r i b u t i o n . T h e c o n d i t i o n a b o v e , e s s e n t i a l l y defines the agents' l o c a l state spaces to be i s o m o r p h i c w i t h respect to the a c t i o n s a c t u a l l y taken b y the m e n t o r at the subset o f states w h e r e those a c t i o n s m i g h t be taken. T h i s is m u c h w e a k e r than r e q u i r i n g a f u l l h o m o m o r p h i s m from S  m  to S . 0  O f c o u r s e , the e x i s t e n c e o f a f u l l h o m o m o r p h i s m is sufficient f r o m o u r per-  s p e c t i v e ; b u t o u r results d o n o t r e q u i r e this.  4.1.2  The Implicit Imitation Algorithm  T h e i m p l i c i t i m i t a t i o n a l g o r i t h m c a n be u n d e r s t o o d i n terms o f its c o m p o n e n t processes. In a d d i t i o n to l e a r n i n g f r o m its o w n t r a n s i t i o n s , the o b s e r v e r o b s e r v e s the m e n t o r i n o r d e r to a c q u i r e a d d i t i o n a l i n f o r m a t i o n about p o s s i b l e t r a n s i t i o n d y n a m i c s at the states the m e n t o r v i s i t s . T h e m e n t o r f o l l o w s s o m e p o l i c y that n e e d not b e better than the o b s e r v e r ' s o w n , b u t w i l l g e n e r a l l y be m o r e u s e f u l to the o b s e r v e r i f it is better than the o b s e r v e r ' s current p o l i c y f o r at least a subset o f the states i n the p r o b l e m . T h e t r a n s i t i o n d y n a m i c s o b s e r v e d f o r the m e n t o r at state s w i l l constitute a m o d e l o f the m e n t o r ' s a c t i o n i n state s. W e c a l l the b u i l d i n g o f a m o d e l from mentor observations  action model extraction.  W e c a n integrate  this i n f o r m a t i o n i n t o the o b s e r v e r ' s o w n v a l u e estimates b y a u g m e n t i n g the u s u a l B e l l m a n T h e observer must have a deterministic policy with value greater than or equal to mentor's stochastic policy. We know there exists an optimal deterministic policy for M a r k o v D e c i s i o n processes, so we k n o w the observer can find it i n principle. 2  53  b a c k u p w i t h m e n t o r a c t i o n m o d e l s . A c o n f i d e n c e t e s t i n g p r o c e d u r e ensures that w e o n l y use this a u g m e n t e d m o d e l w h e n the m o d e l i m p l i c i t l y e x t r a c t e d f r o m the m e n t o r is m o r e r e l i a b l e than the o b s e r v e r ' s m o d e l o f its o w n b e h a v i o r . W e a l s o extract o c c u p a n c y i n f o r m a t i o n f r o m the o b s e r v e r ' s o b s e r v a t i o n s o f m e n t o r trajectories i n o r d e r to f o c u s the o b s e r v e r ' s c o m p u t a t i o n a l effort (to s o m e extent) i n specific parts o f the state space. F i n a l l y , w e a u g m e n t o u r a c t i o n s e l e c t i o n p r o c e s s to c h o o s e a c t i o n s that w i l l e x p l o r e h i g h - v a l u e r e g i o n s r e v e a l e d b y the mentor. T h e r e m a i n d e r o f this s e c t i o n e x p a n d s u p o n e a c h o f these p r o c e s s e s a n d e x p l a i n s h o w they fit together.  Model Extraction T h e i n f o r m a t i o n a v a i l a b l e to the o b s e r v e r i n its quest to l e a r n h o w to act o p t i m a l l y c a n be d i v i d e d i n t o t w o c a t e g o r i e s . F i r s t , w i t h e a c h a c t i o n it takes, it r e c e i v e s a n e x p e r i e n c e t u p l e  (s , a , r ,t ), 0  0  0  where  0  a  0  is the o b s e r v e r ' s a c t i o n ,  s  0  and  t  0  represent the o b s e r v e r ' s state  t r a n s i t i o n a n d r is the o b s e r v e r ' s r e w a r d . W e w i l l i g n o r e the s a m p l e d r e w a r d r , s i n c e w e 0  0  a s s u m e the r e w a r d f u n c t i o n R is k n o w n i n a d v a n c e . A s i n standard m o d e l - b a s e d l e a r n i n g , e a c h s u c h e x p e r i e n c e c a n be u s e d to u p d a t e its o w n t r a n s i t i o n m o d e l T (s , a , •). 0  0  0  S e c o n d , w i t h e a c h m e n t o r t r a n s i t i o n , the o b s e r v e r o b t a i n s an e x p e r i e n c e t u p l e ( s , m  t ). m  N o t e that the o b s e r v e r does n o t h a v e d i r e c t access to the a c t i o n t a k e n b y the m e n t o r , o n l y to the i n d u c e d state t r a n s i t i o n . A s s u m e the m e n t o r is i m p l e m e n t i n g a d e t e r m i n i s t i c , s t a t i o n 3  ary p o l i c y 7 r , w i t h n (s) m  m  d e n o t i n g the m e n t o r ' s c h o i c e o f a c t i o n at state s. L e t T b e the  true t r a n s i t i o n m o d e l f o r the o b s e r v e r ' s e n v i r o n m e n t . T h e n , the m e n t o r ' s p o l i c y 7 r a M a r k o v c h a i n T ( - , •) o v e r S, w i t h T (s, t) = T(s, n (s),t) m  o f t r a n s i t i o n f r o m s to t f o r the c h a i n . 3  m  4  m  m  induces  d e n o t i n g the p r o b a b i l i t y  S i n c e the learner o b s e r v e s the m e n t o r ' s state tran-  See Chapter 3.  T h i s is somewhat imprecise, since the initial distribution o f the M a r k o v chain is u n k n o w n . F o r our purposes, it is only the dynamics that are relevant to the observer, so only the transition probabilities are used. 4  54  s i t i o n s , it c a n c o n s t r u c t an estimate T  o f this c h a i n : T (s,  m  m  t) is s i m p l y e s t i m a t e d b y the  r e l a t i v e o b s e r v e d f r e q u e n c y o f m e n t o r t r a n s i t i o n s s —> t (w.r.t. a l l t r a n s i t i o n s t a k e n f r o m s). I f the o b s e r v e r has s o m e p r i o r o v e r the p o s s i b l e m e n t o r t r a n s i t i o n s , standard B a y e s i a n u p date t e c h n i q u e s c a n be u s e d i n s t e a d .  5  W e use the t e r m  model extraction f o r  this p r o c e s s o f  e s t i m a t i n g the m e n t o r ' s M a r k o v c h a i n .  Augmented Bellman Backups  S u p p o s e the o b s e r v e r has c o n s t r u c t e d an estimate T  m  T (s, t)  construction,  m  —  T(s, Tr (s),t).  B y the h o m o g e n e i t y a s s u m p t i o n , the s u c c e s s o r  m  state d i s t r i b u t i o n o f the m e n t o r ' s a c t i o n n (s)  c a n be r e p l i c a t e d b y the o b s e r v e r at state s  m  as 7r (s) £ A . m  0  T h u s , the p o l i c y n  m  o f the m e n t o r ' s M a r k o v c h a i n . B y  c a n , i n p r i n c i p l e , b e d u p l i c a t e d b y the o b s e r v e r ( w e r e  it a b l e to i d e n t i f y the a c t u a l a c t i o n s used). A s s u c h , w e c a n define the v a l u e o f the m e n t o r ' s p o l i c y from  the observer's perspective:  (4.1)  N o t i c e that E q u a t i o n 4.1 uses the m e n t o r ' s d y n a m i c s T , m  function, R . 0  b u t the o b s e r v e r ' s r e w a r d  L e t V denote the o b s e r v e r ' s o p t i m a l v a l u e f u n c t i o n . S i n c e the o b s e r v e r is  a s s u m e d to h a v e an a c t i o n c a p a b l e o f d u p l i c a t i n g the m e n t o r ' s d y n a m i c s T (s, m  a l w a y s be the case that the o b s e r v e r ' s v a l u e f u n c t i o n  V(s) > V (s). V m  m  t), it w i l l  therefore p r o v i d e s  a l o w e r b o u n d o n V. M o r e i m p o r t a n t l y , the terms m a k i n g u p V (s) m  B e l l m a n e q u a t i o n f o r the o b s e r v e r ' s M D P , f o r m i n g the 5  c a n b e i n t e g r a t e d d i r e c t l y i n t o the  augmented Bellman equation:  A Bayesian approach based on priors and incremental model updates is developed i n Chapter 5.  55  V{s)  = i? (s) + m a x 0  7  j max i ^  T (s, a, i)F(i) 0  I, ^r (s,i)V(i) | m  (4.2)  T h i s is the u s u a l B e l l m a n e q u a t i o n w i t h an e x t r a t e r m a d d e d , n a m e l y , the s e c o n d summation,  J2tes T (s, t)V(t) d e n o t i n g  the e x p e c t e d v a l u e o f d u p l i c a t i n g the m e n t o r ' s ac-  m  t i o n a . S i n c e this ( u n k n o w n ) a c t i o n is i d e n t i c a l to o n e o f the o b s e r v e r ' s a c t i o n s , the t e r m is m  r e d u n d a n t a n d the a u g m e n t e d v a l u e e q u a t i o n is v a l i d . W e c a l l the i m p r o v e d v a l u e estimates p r o d u c e d b y the a u g m e n t e d B e l l m a n e q u a t i o n , augmented  values.  D u r i n g o n - l i n e l e a r n i n g , an o b s e r v e r w i l l h a v e o n l y e s t i m a t e d v e r s i o n s o f the trans i t i o n m o d e l s . I f the o b s e r v e r ' s e x p l o r a t i o n p o l i c y ensures that each state-action p a i r is v i s ited i n f i n i t e l y often, the estimates o f the T  0  terms w i l l c o n v e r g e to t h e i r true v a l u e s . I f the  m e n t o r ' s p o l i c y is e r g o d i c w i t h respect to S (i.e., its p o l i c y ensures that the m e n t o r w i l l v i s i t e v e r y state i n S), then the estimate o f T  m  w i l l a l s o c o n v e r g e to its true v a l u e . I f the m e n t o r ' s  p o l i c y is restricted to a subset o f states S' C S (i.e., those states w i t h n o n - z e r o p r o b a b i l i t y i n the stationary d i s t r i b u t i o n o f its M a r k o v c h a i n ) , then the estimates o f T  m  f o r the subset w i l l  c o n v e r g e c o r r e c t l y w i t h respect to S' i f the c h a i n is e r g o d i c . T h e states i n S — S' w i l l rem a i n u n v i s i t e d a n d the estimates w i l l r e m a i n u n i n f o r m e d b y data. S i n c e the m e n t o r ' s p o l i c y is not u n d e r the c o n t r o l o f the observer, there is n o w a y f o r the o b s e r v e r to i n f l u e n c e the d i s t r i b u t i o n o f s a m p l e s attained for T .  A n o b s e r v e r must therefore be a b l e to r e a s o n about the  a c c u r a c y o f the e s t i m a t e d m o d e l T  f o r any s a n d restrict the a p p l i c a t i o n o f the a u g m e n t e d  m  m  e q u a t i o n to those states w h e r e T  m  While T  m  is k n o w n w i t h sufficient a c c u r a c y .  c a n n o t be u s e d i n d i s c r i m i n a n t l y , w e argue that it c a n b e h i g h l y i n f o r m a -  t i v e e a r l y i n the l e a r n i n g process. E a r l y i n l e a r n i n g , the o b s e r v e r d e v o t e s its a c t i o n c h o i c e s to e x p l o r a t i o n . I n the absence o f g u i d a n c e , e v e r y a c t i o n i n e v e r y state m u s t b e s a m p l e d i n o r d e r f o r the o b s e r v e r to g a i n sufficient i n f o r m a t i o n to c o m p u t e the best p o l i c y . T h e m e n tor, b y contrast, is p u r s u i n g an o p t i m a l p o l i c y d e s i g n e d to e x p l o i t its k n o w l e d g e i n o r d e r to  56  m a x i m i z e return w i t h respect to R . m  A n observer learning from transitions o f such a mentor  w o u l d find that the m e n t o r t r a n s i t i o n s are often c o n s t r a i n e d to a m u c h s m a l l e r s u b s p a c e o f the p r o b l e m a n d represent the effects o f a s i n g l e a c t i o n . T y p i c a l l y , the s m a l l e r set o f states a n d the c h o i c e o f a s i n g l e a c t i o n i n e a c h state cause the estimates o f T  m  (s, t) i n r e g i o n s r e l e -  vant to the d o m a i n to c o n v e r g e m u c h m o r e r a p i d l y to their true v a l u e s than the c o r r e s p o n d i n g o b s e r v e r estimates T (s, 0  a, t). T h e use o f T (s, m  t) c a n therefore often g i v e m o r e i n f o r m e d  v a l u e estimates at state s, w h i c h i n turn a l l o w the o b s e r v e r to m a k e better i n f o r m e d c h o i c e s earlier i n learning. W e note that the r e a s o n i n g a b o v e h o l d s e v e n i f the m e n t o r is i m p l e m e n t i n g a (stationary) s t o c h a s t i c p o l i c y ( s i n c e the e x p e c t e d v a l u e o f s t o c h a s t i c p o l i c y f o r a f u l l y - o b s e r v a b l e M D P c a n n o t b e greater than that o f an o p t i m a l d e t e r m i n i s t i c p o l i c y ) . W h i l e the " d i r e c t i o n " offered b y a m e n t o r i m p l e m e n t i n g a d e t e r m i n i s t i c p o l i c y tends to b e m o r e p i r i c a l l y w e h a v e f o u n d that m e n t o r s offer  broader  focussed,  em-  guidance i n moderately stochastic en-  v i r o n m e n t s o r w h e n they i m p l e m e n t s t o c h a s t i c p o l i c i e s , s i n c e they t e n d to v i s i t m o r e o f the state space. W e note that the e x t e n s i o n to m u l t i p l e m e n t o r s is s t r a i g h t f o r w a r d — e a c h m e n t o r m o d e l c a n be i n c o r p o r a t e d i n t o the a u g m e n t e d B e l l m a n e q u a t i o n w i t h o u t d i f f i c u l t y .  Model Confidence W h e n the m e n t o r ' s M a r k o v c h a i n is not e r g o d i c , o r e v e n i f the m i x i n g rate is s u f f i c i e n t l y l o w , the m e n t o r m a y v i s i t a c e r t a i n state s r e l a t i v e l y i n f r e q u e n t l y . T h e e s t i m a t e d m e n t o r t r a n s i t i o n m o d e l c o r r e s p o n d i n g to a state that is r a r e l y (or never) v i s i t e d b y the m e n t o r m a y p r o v i d e a v e r y m i s l e a d i n g e s t i m a t e — b a s e d o n the s m a l l s a m p l e o r the p r i o r f o r the m e n t o r ' s c h a i n — o f the v a l u e o f the m e n t o r ' s ( u n k n o w n ) a c t i o n at s; a n d s i n c e the m e n t o r ' s p o l i c y i s n o t u n d e r the c o n t r o l o f the observer, this m i s l e a d i n g v a l u e m a y persist f o r an e x t e n d e d p e r i o d . S i n c e the a u g m e n t e d B e l l m a n e q u a t i o n does n o t c o n s i d e r r e l a t i v e r e l i a b i l i t y o f the m e n t o r a n d o b -  57  server m o d e l s , the v a l u e o f s u c h a state s m a y b e o v e r e s t i m a t e d .  6  T h a t i s to say, the o b s e r v e r  c a n b e t r i c k e d i n t o o v e r v a l u i n g the m e n t o r ' s ( u n k n o w n ) a c t i o n , a n d c o n s e q u e n t l y o v e r e s t i m a t i n g the v a l u e V ( s ) o f state s. T o o v e r c o m e this, w e h a v e i n c o r p o r a t e d a n estimate o f m o d e l c o n f i d e n c e i n t o o u r a u g m e n t e d b a c k u p s . O u r c o n f i d e n c e m o d e l m a k e s use o f a D i r i c h l e t d i s t r i b u t i o n o v e r the parameters o f b o t h the m e n t o r ' s M a r k o v c h a i n T  m  a n d the o b s e r v e r ' s a c t i o n m o d e l s T  0  (De-  G r o o t , 1975). T h e m e a n o f the D i r i c h l e t d i s t r i b u t i o n c o r r e s p o n d s to the f a m i l i a r p o i n t e s t i mate o f the t r a n s i t i o n p r o b a b i l i t y , n a m e l y T (s, 0  a, t) a n d T (s, m  t). H o w e v e r , the D i r i c h l e t  d i s t r i b u t i o n a c t u a l l y defines a c o m p l e t e p r o b a b i l i t y d i s t r i b u t i o n o v e r p o s s i b l e v a l u e s f o r the t r a n s i t i o n p r o b a b i l i t y b e t w e e n states s a n d t (i.e., P r ( T ( s , a,t)) a n d P r ( T ( s , *))). T h e i n i 0  m  t i a l d i s t r i b u t i o n o v e r p o s s i b l e t r a n s i t i o n p r o b a b i l i t i e s i s g i v e n b y a p r i o r d i s t r i b u t i o n a n d it reflects the o b s e r v e r ' s i n i t i a l b e l i e f s about w h i c h t r a n s i t i o n s are p o s s i b l e a n d to w h a t degree. T h e i n i t i a l d i s t r i b u t i o n is then u p d a t e d w i t h o b s e r v a t i o n s o f m e n t o r a n d o b s e r v e r t r a n s i t i o n s to get a m o r e accurate p o s t e r i o r d i s t r i b u t i o n . T h e D i r i c h l e t d i s t r i b u t i o n is p a r a m e t e r i z e d b y c o u n t s o f the f o r m  n ' (t) a n d n (i) s  a  0  s  m  to d e n o t e the n u m b e r o f t i m e s the o b s e r v e r has b e e n o b s e r v e d m a k i n g the t r a n s i t i o n f r o m state s to state t w h e n the a c t i o n a w a s p e r f o r m e d a n d the n u m b e r o f t i m e s the m e n t o r w a s o b s e r v e d to m a k e the t r a n s i t i o n f r o m state s to t w h i l e f o l l o w i n g a stationary p o l i c y . I n a n o n l i n e a p p l i c a t i o n , the agent updates the a p p r o p r i a t e c o u n t after e a c h o b s e r v e d t r a n s i t i o n . P r i o r counts  c ' (t) a n d c (t) s  a  0  s  m  c a n b e a d d e d to the D i r i c h l e t parameters to represent a n agent's  p r i o r b e l i e f s about the p r o b a b i l i t y o f v a r i o u s t r a n s i t i o n s i n terms o f a n u m b e r o f " f i c t i t i o u s " samples. G i v e n the c o m p l e t e d i s t r i b u t i o n o v e r t r a n s i t i o n m o d e l parameters p r o v i d e d b y the D i r i c h l e t d i s t r i b u t i o n , w e c o u l d attempt to c a l c u l a t e a d i s t r i b u t i o n o v e r p o s s i b l e state v a l 6  N o t e that underestimates based on such considerations are not problematic, since the augmented  B e l l m a n equation reduces to the usual B e l l m a n equation.  58  ues, V(s), h o w e v e r , there i s n o s i m p l e c l o s e d f o r m e x p r e s s i o n f o r state v a l u e s as a f u n c t i o n o f the agent's e s t i m a t e d m o d e l s . W e c o u l d attempt t o e m p l o y s a m p l i n g m e t h o d s here, b u t i n the interest o f s i m p l i c i t y w e h a v e e m p l o y e d an a p p r o x i m a t e m e t h o d b a s e d f o r c o m b i n i n g i n f o r m a t i o n sources i n s p i r e d b y K a e l b l i n g ' s ( 1 9 9 3 ) i n t e r v a l e s t i m a t i o n m e t h o d . T h i s m e t h o d uses a n e x p r e s s i o n f o r the v a r i a n c e i n the estimate o f a t r a n s i t i o n m o d e l p a r a m e t e r t o represent uncertainty. T h e v a r i a n c e i n the estimate o f the t r a n s i t i o n m o d e l p a r a m e t e r i s g i v e n b y a s i m p l e f u n c t i o n o f the D i r i c h l e t p a r a m e t e r s . F o r the o b s e r v e r ' s t r a n s i t i o n m o d e l , let 7  c ' (t) a  and  0  P = 2~2t'es-t  + c ' {t'). a  o { ')  n  a  t  0  T h e n the v a r i a n c e  a ' (t) s  a  2  a = n ' (t) + s  a  0  i n the e s t i m a t e o f  0  the t r a n s i t i o n m o d e l parameters i s g i v e n b y  T h e v a r i a n c e i n the e s t i m a t e o f the t r a n s i t i o n m o d e l d e r i v e d f r o m m e n t o r o b s e r v a tions is analogous. L e t  <r4i ( 0  2  m  m  e  a = n (t) s  m  a n d (3 =  2~2t'es-t  mi ')  n  t  + tn( ')c  t  T h e n the v a r i a n c e  e s t i m a t e o f the t r a n s i t i o n m o d e l parameters i s :  T h e v a r i a n c e o f the t r a n s i t i o n m o d e l parameters c a n b e u s e d t o c o m p a r e the o b s e r v e r ' s c o n f i d e n c e i n the v a l u e estimates d e r i v e d f r o m the o b s e r v e r ' s o w n t r a n s i t i o n m o d e l T  0  v e r s u s the v a l u e estimates d e r i v e d f r o m t r a n s i t i o n m o d e l i n f e r r e d f r o m m e n t o r o b s e r v a -  tions, T . m  I n t u i t i v e l y , w e c o m p u t e the v a r i a n c e i n the v a l u e e s t i m a t e d u e t o v a r i a n c e i n the  m o d e l used t o c o m p u t e the estimate. T h i s c a n b e f o u n d b y s i m p l e a p p l i c a t i o n o f the r u l e for linear combinations o f variances, e x p r e s s i o n f o r the B e l l m a n b a c k u p , 7  Var(cX + dY) = c Var{X) + d Var{Y), 2  R(s)  +  7  See Chapter 2, Section 2.3.  59  ^2 T(t, s, a)V{t). t  2  N o t e that the t e r m  t o the  R(s)  is i n d e p e n d e n t o f the v a r i a n c e i n the r a n d o m v a r i a b l e T(s,  a, t) a n d the terms 7 a n d  act l i k e a s c a l a r m u l t i p l i e r s o f the v a r i a n c e i n the r a n d o m v a r i a b l e T(s,  V(t)  a, t). T h e resultant  e x p r e s s i o n g i v e s the v a r i a n c e i n the v a l u e estimate V ( s ) d u e to u n c e r t a i n t y i n the l o c a l transition model:  ^ (^«) = 0  -  7 E< w W 2  a  2u  (4 5)  2  t  T h e v a r i a n c e i n the estimate o f the v a l u e o f state s c a l c u l a t e d f r o m the e s t i m a t e d m e n t o r t r a n s i t i o n m o d e l proceeds i n a s i m i l a r f a s h i o n :  *L(*) = 7 £<WM0 2  w  2  t U s i n g the v a r i a n c e s i n v a l u e estimates w e c a n c o m p u t e a h e u r i s t i c v a l u e i n d i c a t i n g the o b s e r v e r ' s r e l a t i v e c o n f i d e n c e i n the v a l u e estimates c o m p u t e d f r o m o b s e r v e r a n d m e n tor o b s e r v a t i o n s . A n a u g m e n t e d B e l l m a n b a c k u p w i t h respect to V u s i n g c o n f i d e n c e t e s t i n g p r o c e e d s as f o l l o w s . W e first c o m p u t e the o b s e r v e r ' s o p t i m a l a c t i o n a* b a s e d o n the e s t i m a t e d a u g m e n t e d v a l u e s f o r e a c h o f the o b s e r v e r ' s a c t i o n s . L e t  Q(a* , s) = V (s) 0  denote  0  the v a l u e o f the o p t i m a l a c t i o n as c a l c u l a t e d f r o m the o b s e r v e r ' s t r a n s i t i o n m o d e l T . 0  The  v a r i a n c e d u e to m o d e l u n c e r t a i n t y c a n then be u s e d to c o n s t r u c t a l o w e r b o u n d V~ (s) i n terms o f the standard d e v i a t i o n o f the v a l u e estimate a  vo  V-(s) = V (s)-ca (s) 0  V0  (s):  (4.7)  T h e v a l u e c is c h o s e n so that the l o w e r b o u n d V~ (s) has sufficient p r o b a b i l i t y o f b e i n g less that the p o s s i b l e true v a l u e o f V ( s ) . U s i n g C h e b y c h e v ' s i n e q u a l i t y , w h i c h states 0  60  v  Figure 4.1: V  0  and V  m  ov  M  represent v a l u e s estimates d e r i v e d f r o m o b s e r v e r a n d m e n t o r e x p e -  r i e n c e r e s p e c t i v e l y . T h e l o w e r b o u n d s V~  a n d V~  i n c o r p o r a t e an " u n c e r t a i n t y p e n a l t y " as-  s o c i a t e d w i t h these estimates.  that 1 - ^ o f the p r o b a b i l i t y mass f o r an arbitrary d i s t r i b u t i o n w i l l be w i t h i n c standard dev i a t i o n s o f the m e a n , w e c a n o b t a i n any d e s i r e d c o n f i d e n c e l e v e l e v e n t h o u g h the D i r i c h l e t d i s t r i b u t i o n s f o r s m a l l s a m p l e c o u n t s are h i g h l y n o n - n o r m a l . W e u s e d c=5 to o b t a i n a p p r o x i m a t e l y 9 6 % c o n f i d e n c e i n o u r e x p e r i m e n t s , t h o u g h this c o n f i d e n c e l e v e l c a n b e set to a n y s u i t a b l e v a l u e . S i m i l a r l y , the v a r i a n c e i n the v a l u e estimate d u e to the m o d e l b e i n g u p d a t e d w i t h m e n t o r o b s e r v a t i o n s c a n be u s e d to c o n s t r u c t a l o w e r b o u n d f o r the v a l u e estimate der i v e d f r o m m e n t o r o b s e r v a t i o n s , V~ (s). O n e m a y interpret the s u b t r a c t i o n o f a m u l t i p l e o f the standard d e v i a t i o n f r o m the v a l u e estimate as p e n a l t y c o r r e s p o n d i n g to the o b s e r v e r ' s " u n c e r t a i n t y " i n the v a l u e estimate (see F i g u r e 4 . 1 ) .  8  G i v e n the l o w e r b o u n d s o n the v a l u e estimates c o n s t r u c t e d f r o m m o d e l s e s t i m a t e d f r o m m e n t o r a n d o b s e r v e r o b s e r v a t i o n s , w e m a y then p e r f o r m the f o l l o w i n g test: I f V~ (s)  V~(s), then >~ V (s) m  w e say that  V (s) 0  supersedes  V (s) m  and w e write  V (s) y V (s). 0  m  >  When  then either the m e n t o r - i n s p i r e d m o d e l has, i n fact, a l o w e r e x p e c t e d v a l u e  ( w i t h i n a s p e c i f i e d degree o f c o n f i d e n c e ) a n d uses a n o n o p t i m a l a c t i o n ( f r o m the o b s e r v e r ' s Ideally, we w o u l d like to take not only the uncertainty o f the model at the current state into account, but also the uncertainty o f future states as well (Meuleau & Bourgine, 1999). 8  61  p e r s p e c t i v e ) , o r the m e n t o r - i n s p i r e d m o d e l has l o w e r c o n f i d e n c e . I n e i t h e r case, w e reject the i n f o r m a t i o n p r o v i d e d b y the m e n t o r a n d use a standard B e l l m a n b a c k u p u s i n g the act i o n m o d e l d e r i v e d s o l e l y f r o m the o b s e r v e r ' s e x p e r i e n c e (thus s u p p r e s s i n g the a u g m e n t e d b a c k u p ) . S p e c i f i c a l l y , w e define the b a c k e d u p v a l u e to be V (s) 0  i n this case.  A n a l g o r i t h m f o r c o m p u t i n g an a u g m e n t e d b a c k u p u s i n g this c o n f i d e n c e test is s h o w n i n T a b l e 4 . 1 . T h e a l g o r i t h m parameters i n c l u d e the current estimate o f the a u g m e n t e d v a l u e f u n c t i o n V, the current e s t i m a t e d m o d e l T  0  m o d e l o f the m e n t o r ' s M a r k o v c h a i n T  m  and its a s s o c i a t e d v a r i a n c e o\.  b o u n d s a n d returns the m e a n v a l u e , V o r V , 0  a n d its a s s o c i a t e d l o c a l v a r i a n c e al,  m  a n d the  It c a l c u l a t e s l o w e r  w i t h the greatest l o w e r b o u n d . T h e parameter  c d e t e r m i n e s the w i d t h o f the c o n f i d e n c e i n t e r v a l u s e d i n the m e n t o r r e j e c t i o n test.  Focusing In a d d i t i o n to u s i n g m e n t o r o b s e r v a t i o n s to a u g m e n t the q u a l i t y o f its v a l u e estimates, an o b s e r v e r c a n e x p l o i t its o b s e r v a t i o n s o f the m e n t o r to f o c u s its attention o n the states v i s i t e d b y the mentor. I n a m o d e l - b a s e d a p p r o a c h , the s p e c i f i c focusing  mechanism  w e a d o p t re-  q u i r e s the o b s e r v e r to p e r f o r m a ( p o s s i b l y a u g m e n t e d ) B e l l m a n b a c k u p at state s w h e n e v e r the m e n t o r m a k e s a t r a n s i t i o n f r o m s. T h i s has three effects. F i r s t , i f the m e n t o r tends to v i s i t i n t e r e s t i n g r e g i o n s o f space (e.g., i f it shares a c e r t a i n r e w a r d structure w i t h the o b s e r v e r ) , then the s i g n i f i c a n t v a l u e s b a c k e d u p f r o m m e n t o r v i s i t e d states w i l l b i a s the o b s e r v e r ' s e x p l o r a t i o n t o w a r d s these r e g i o n s . S e c o n d , c o m p u t a t i o n a l effort w i l l b e c o n c e n t r a t e d t o w a r d parts o f state space w h e r e the e s t i m a t e d m o d e l T (s, m  t) c h a n g e s , a n d h e n c e w h e r e the es-  t i m a t e d v a l u e o f o n e o f the o b s e r v e r ' s a c t i o n s m a y c h a n g e . T h i r d , c o m p u t a t i o n is f o c u s e d w h e r e the m o d e l is l i k e l y to be m o r e accurate (as d i s c u s s e d a b o v e ) .  62  F U N C T I O N augmentedBackup(  a* = a r g m a x a ^ £ 6  V (s)  =  0  0  s  M*) = 7  2  0  m  T ^ s , a, i ) V ( t )  T (s,a*,t)V(t) 0  +-r Z f \{s,t)V{t) 1  a  0  <  0  m  2  €  R (s)+lzZtes  V (s) = R (s)  <r„  t  V,T ,al,T ,a^,s,c)  t€S n  E  i  6  s ^  2  M V M * )  2  V~ (s) = V (s) - c * cr (s, a*) 0  m  vo  ( ) = Vm(s) - C *  V  s  WV {s)0  V(s)  > V (s)m  <T {s) vm  THEN  = V (s) 0  ELSE V(s)  =  V (s) m  END  Table 4.1: An algorithm to compute the Augmented Bellman Backup  63  Action Selection T h e i n t e g r a t i o n o f e x p l o r a t i o n techniques i n the a c t i o n s e l e c t i o n p o l i c y i s i m p o r t a n t f o r a n y „ r e i n f o r c e m e n t l e a r n i n g a l g o r i t h m to guarantee c o n v e r g e n c e . I n i m p l i c i t i m i t a t i o n , it p l a y s a s e c o n d , c r u c i a l r o l e i n h e l p i n g the agent e x p l o i t the i n f o r m a t i o n e x t r a c t e d f r o m the mentor. O u r i m p r o v e d c o n v e r g e n c e results r e l y o n the greedy q u a l i t y o f the e x p l o r a t i o n strategy t o b i a s a n o b s e r v e r t o w a r d s the h i g h e r - v a l u e d r e g i o n s o f the state space r e v e a l e d b y the mentor. F o r e x p e d i e n c y , w e h a v e adopted the £ - g r e e d y a c t i o n s e l e c t i o n m e t h o d , u s i n g a n e x p l o r a t i o n rate e that d e c a y s o v e r t i m e ( w e c o u l d e a s i l y e m p l o y other s e m i - g r e e d y m e t h o d s s u c h as B o l t z m a n n e x p l o r a t i o n ) . I n the p r e s e n c e o f a mentor, g r e e d y a c t i o n s e l e c t i o n b e c o m e s m o r e c o m p l e x . T h e o b s e r v e r e x a m i n e s its o w n actions at state s i n t h e u s u a l w a y a n d obtains a best a c t i o n  a* w h i c h  for the m e n t o r ' s a c t i o n  V (s). m  has a c o r r e s p o n d i n g v a l u e IfVo(s)  y V (s), m  V (s). A v a l u e 0  is a l s o c a l c u l a t e d  then the o b s e r v e r ' s o w n a c t i o n m o d e l  is u s e d a n d the greedy a c t i o n i s defined e x a c t l y as i f the m e n t o r w e r e n o t present. If, h o w ever, V ( s ) )f V (s) 0  m  then w e w o u l d l i k e to define the greedy a c t i o n to b e the a c t i o n d i c t a t e d  b y t h e m e n t o r ' s p o l i c y at state s. U n f o r t u n a t e l y , the o b s e r v e r does n o t k n o w w h i c h a c t i o n this i s , so w e define the greedy a c t i o n to b e the o b s e r v e r ' s a c t i o n " c l o s e s t " t o the m e n t o r ' s a c t i o n a c c o r d i n g t o the o b s e r v e r ' s current m o d e l estimates at s. M o r e p r e c i s e l y , the a c t i o n most similar t o the m e n t o r ' s at state s, d e n o t e d K (s), m  i s that w h o s e o u t c o m e d i s t r i b u t i o n  has m i n i m u m K u l l b a c h - L e i b l e r d i v e r g e n c e f r o m the m e n t o r ' s a c t i o n o u t c o m e d i s t r i b u t i o n :  (4.8)  T h e observer's o w n experience-based action models w i l l b e poor early i n training, so there i s a c h a n c e that the closest a c t i o n c o m p u t a t i o n w i l l select t h e w r o n g a c t i o n . W e r e l y o n t h e e x p l o r a t i o n p o l i c y t o ensure that e a c h o f the o b s e r v e r ' s a c t i o n s i s s a m p l e d a p -  64  p r o p r i a t e l y i n the l o n g r u n . I f the m e n t o r is e x e c u t i n g a stochastic p o l i c y , the test b a s e d o n K L - d i v e r g e n c e c a n m i s l e a d the learner. A f u l l treatment o f a c t i o n i n f e r e n c e i n a g e n e r a l sett i n g c a n be h a n d l e d m o r e n a t u r a l l y i n a B a y e s i a n f o r m u l a t i o n o f i m i t a t i o n . W e c o n s i d e r this i n C h a p t e r 5. In an o n l i n e r e i n f o r c e m e n t l e a r n i n g s c e n a r i o , an o b s e r v e r agent w i l l h a v e l i m i t e d o p p o r t u n i t i e s to s a m p l e its e n v i r o n m e n t a n d c o n s t r a i n t s o n the a m o u n t o f c o m p u t a t i o n it c a n p e r f o r m b e f o r e it m u s t act i n the w o r l d . B o t h o f these c o n s t r a i n t s put b o u n d s o n the a c c u r a c y o f the Q - v a l u e s an o b s e r v e r c a n h o p e to a c h i e v e . I n the a b s e n c e o f h i g h q u a l i t y estimates f o r the v a l u e o f e v e r y a c t i o n i n e v e r y state, w e w o u l d l i k e to bias the v a l u e o f the states a l o n g the m e n t o r ' s trajectories so that they l o o k w o r t h w h i l e to e x p l o r e . W e d o this b y s e t t i n g the i n i t i a l Q - v a l u e s o v e r the entire space to a v a l u e less than the e s t i m a t e d l o w e r b o u n d o n the v a l u e o f a n y state f o r the p r o b l e m . I n o u r e x a m p l e s , r e w a r d s are s t r i c t l y p o s i t i v e so a l l state v a l u e s are p o s i t i v e . W e c a n therefore set the i n i t i a l Q - v a l u e estimates to z e r o i n o r d e r to m a k e the v a l u e s o f u n e x p l o r e d states l o o k w o r s e than those f o r w h i c h the agent has e v i d e n c e c o u r t e s y o f the mentor. I n this case, the g r e e d y - s t e p i n the e x p l o r a t i o n m e t h o d w i l l p r e f e r a c t i o n s that l e a d to m e n t o r states o v e r a c t i o n s f o r w h i c h the agent has n o i n f o r m a t i o n .  Model Extraction in Specific Reinforcement Learning Algorithms M o d e l e x t r a c t i o n , a u g m e n t e d b a c k u p s , the f o c u s i n g m e c h a n i s m , a n d o u r e x t e n d e d n o t i o n o f the g r e e d y a c t i o n s e l e c t i o n , c a n be integrated i n t o m o d e l - b a s e d r e i n f o r c e m e n t l e a r n i n g a l g o r i t h m s w i t h r e l a t i v e ease. G e n e r i c a l l y , o u r  implicit imitation algorithm  (a) the o b s e r v e r m a i n t a i n a n estimate o f its o w n m o d e l T (s, 0  chain T  m  r e q u i r e s that:  a, t) a n d the m e n t o r ' s M a r k o v  ( s , t) w h i c h is u p d a t e d w i t h e v e r y o b s e r v e d t r a n s i t i o n ; a n d (b) that a l l b a c k u p s per-  f o r m e d to estimate its v a l u e f u n c t i o n use the a u g m e n t e d b a c k u p E q u a t i o n 4 . 2 w i t h c o n f i d e n c e testing. O f c o u r s e , these b a c k u p s are i m p l e m e n t e d u s i n g e s t i m a t e d m o d e l s T (s, 0  65  a, t)  a n d T (s, m  t).  I n a d d i t i o n , the f o c u s i n g m e c h a n i s m requires that an a u g m e n t e d b a c k u p b e  p e r f o r m e d at a n y state v i s i t e d b y the m e n t o r o r observer. W e d e m o n s t r a t e the g e n e r a l i t y o f these m e c h a n i s m s b y c o m b i n i n g t h e m w i t h the w e l l - k n o w n a n d efficient  prioritized sweeping a l g o r i t h m ( M o o r e &  A t k e s o n , 1993). R o u g h l y ,  p r i o r i t i z e d s w e e p i n g w o r k s b y m a i n t a i n i n g an e s t i m a t e d t r a n s i t i o n m o d e l T a n d r e w a r d m o d e l R. W h e n e v e r an e x p e r i e n c e t u p l e (s, a, r, t) is s a m p l e d , the e s t i m a t e d m o d e l at state s c a n c h a n g e ; a B e l l m a n b a c k u p is p e r f o r m e d at s to i n c o r p o r a t e the r e v i s e d m o d e l a n d s o m e ( u s u a l l y fixed) n u m b e r o f a d d i t i o n a l b a c k u p s are p e r f o r m e d at selected states. States are s e l e c t e d using a  priority that  estimates the p o t e n t i a l c h a n g e i n their v a l u e s b a s e d o n the c h a n g e s pre-  c i p i t a t e d b y e a r l i e r b a c k u p s . . E s s e n t i a l l y , c o m p u t a t i o n a l resources ( b a c k u p s ) are f o c u s e d 9  o n those states that c a n m o s t "benefit" f r o m those b a c k u p s . I n c o r p o r a t i n g o u r ideas i n t o p r i o r i t i z e d s w e e p i n g s i m p l y r e q u i r e s the f o l l o w i n g changes  •  W i t h each transition  (s, a, t) the  o b s e r v e r takes, the e s t i m a t e d m o d e l  T (s,a,t) 0  is u p -  dated a n d an a u g m e n t e d b a c k u p is p e r f o r m e d at state s. A u g m e n t e d b a c k u p s are then p e r f o r m e d at a f i x e d n u m b e r o f states u s i n g the u s u a l p r i o r i t y q u e u e i m p l e m e n t a t i o n .  •  W i t h e a c h o b s e r v e d m e n t o r t r a n s i t i o n ( s , t), the e s t i m a t e d m o d e l T  m  (s, t) is u p d a t e d  a n d an a u g m e n t e d b a c k u p is p e r f o r m e d at s. A u g m e n t e d b a c k u p s are then p e r f o r m e d at a f i x e d n u m b e r o f states u s i n g the u s u a l p r i o r i t y q u e u e i m p l e m e n t a t i o n .  K e e p i n g samples o f mentor behavior implements m o d e l extraction. A u g m e n t e d backups integrate this i n f o r m a t i o n i n t o the o b s e r v e r ' s v a l u e f u n c t i o n , a n d p e r f o r m i n g a u g m e n t e d b a c k u p s at o b s e r v e d t r a n s i t i o n s ( i n a d d i t i o n to e x p e r i e n c e d t r a n s i t i o n s ) i n c o r p o r a t e s o u r f o c u s i n g m e c h a n i s m . T h e o b s e r v e r is not f o r c e d to " f o l l o w " o r o t h e r w i s e m i m i c the a c t i o n s o f the m e n t o r d i r e c t l y . B u t it does b a c k u p v a l u e i n f o r m a t i o n a l o n g the m e n t o r ' s trajectory 9  S e e Section 2.2; also see M o o r e and Atkeson (1993) for further details  66  as i f it h a d . U l t i m a t e l y , the o b s e r v e r m u s t m o v e to those states to d i s c o v e r w h i c h a c t i o n s are to be u s e d ; but i n the m e a n t i m e , i m p o r t a n t v a l u e i n f o r m a t i o n is p r o p a g a t e d that c a n g u i d e its e x p l o r a t i o n . I m p l i c i t i m i t a t i o n d o e s not alter the l o n g r u n t h e o r e t i c a l c o n v e r g e n c e p r o p e r t i e s o f the u n d e r l y i n g r e i n f o r c e m e n t l e a r n i n g a l g o r i t h m . T h e i m p l i c i t i m i t a t i o n f r a m e w o r k is ort h o g o n a l to e-greedy e x p l o r a t i o n as it alters o n l y the d e f i n i t i o n o f the " g r e e d y " a c t i o n , n o t w h e n the g r e e d y a c t i o n is taken. G i v e n a t h e o r e t i c a l l y a p p r o p r i a t e d e c a y factor, the e-greedy strategy w i l l thus ensure that the d i s t r i b u t i o n s f o r the a c t i o n m o d e l s at e a c h state are s a m p l e d i n f i n i t e l y often i n the l i m i t and c o n v e r g e to their true v a l u e s . B y the h o m o g e n e i t y a s s u m p t i o n , the e x t r a c t e d m o d e l f r o m the m e n t o r c o r r e s p o n d s to o n e o f the o b s e r v e r ' s o w n a c t i o n s , so its effect o n the v a l u e f u n c t i o n c a l c u l a t i o n s is n o different than the effect o f the o b s e r v e r ' s o w n s a m p l e d a c t i o n m o d e l s . T h e c o n f i d e n c e m e c h a n i s m ensures that the m o d e l w i t h m o r e s a m p l e s w i l l e v e n t u a l l y c o m e to d o m i n a t e i f it i s , i n fact, better. W e c a n therefore rest assured that the c o n v e r g e n c e properties o f r e i n f o r c e m e n t l e a r n i n g w i t h i m p l i c i t i m i t a t i o n are i d e n t i c a l to that o f the u n d e r l y i n g r e i n f o r c e m e n t l e a r n i n g a l g o r i t h m . T h e benefit o f i m p l i c i t i m i t a t i o n l i e s i n the w a y i n w h i c h the m o d e l s e x t r a c t e d f r o m the m e n t o r a l l o w the o b s e r v e r to c a l c u l a t e a l o w e r b o u n d o n the v a l u e f u n c t i o n a n d use this l o w e r b o u n d to c h o o s e its g r e e d y a c t i o n s to m o v e the agent t o w a r d s h i g h e r - v a l u e d r e g i o n s o f state space. A s w e w i l l d e m o n s t r a t e i n o u r e x p e r i m e n t a l s e c t i o n , the t y p i c a l result is q u i c k e r c o n v e r g e n c e to o p t i m a l p o l i c i e s a n d better short-term p r a c t i c a l p e r f o r m a n c e w i t h respect to accumulated discounted reward w h i l e learning.  Extensions T h e i m p l i c i t i m i t a t i o n m o d e l c a n e a s i l y be e x t e n d e d to extract m o d e l i n f o r m a t i o n f r o m m u l t i p l e m e n t o r s , m i x i n g a n d m a t c h i n g p i e c e s e x t r a c t e d f r o m e a c h m e n t o r to a c h i e v e the m o s t  67  i n f o r m a t i v e m o d e l . It does this b y s e a r c h i n g , at e a c h state, the set o f m e n t o r s i t k n o w s about to f i n d the m e n t o r w i t h the h i g h e s t v a l u e estimate. T h e v a l u e e s t i m a t e o f t h i s m e n t o r is then c o m p a r e d u s i n g the c o n f i d e n c e test d e s c r i b e d a b o v e w i t h the o b s e r v e r ' s o w n v a l u e estimate. T h e f o r m a l e x p r e s s i o n o f the a l g o r i t h m is g i v e n b y the  multi-augmented Bellman equation:  (4.9) where M  is the set o f c a n d i d a t e m e n t o r s . I d e a l l y , c o n f i d e n c e estimates s h o u l d b e t a k e n i n t o  a c c o u n t i n the s e l e c t i o n o f the best mentor, as w e m a y get a m e n t o r w i t h a h i g h m e a n v a l u e estimate but large v a r i a n c e w i n n i n g out o v e r a m e n t o r w i t h l o w e r m e a n b u t h i g h c o n f i d e n c e . I f the o b s e r v e r has a n y e x p e r i e n c e w i t h the state at a l l , a l o w c o n f i d e n c e m e n t o r w i l l l i k e l y be rejected as h a v i n g p o o r e r q u a l i t y i n f o r m a t i o n than the observer. I f a l l m e n t o r s c a n be guaranteed to b e e x e c u t i n g the same p o l i c y (perhaps the o p t i m a l p o l i c y ) then the e q u a t i o n c o u l d be s i m p l i f i e d so that the s a m p l e s f r o m a l l m e n t o r s c o u l d be p o o l e d i n t o a s i n g l e m o d e l  T {s,t). m  U p to n o w , w e h a v e a s s u m e d n o a c t i o n costs (i.e., r e w a r d s that d e p e n d o n l y o n the state); h o w e v e r , w e c a n use m o r e g e n e r a l r e w a r d f u n c t i o n s (e.g., w h e r e r e w a r d has the f o r m R(s,  a)). T h e d i f f i c u l t y lies i n b a c k i n g up a c t i o n costs w h e n the m e n t o r ' s c h o s e n a c t i o n is  u n k n o w n . W e c a n m a k e use o f n, the c l o s e s t a c t i o n f u n c t i o n d e v e l o p e d f o r a c t i o n s e l e c t i o n to c h o o s e the a p p r o p r i a t e r e w a r d . T h e a u g m e n t e d B e l l m a n e q u a t i o n w i t h g e n e r a l i z e d r e w a r d s takes the f o l l o w i n g f o r m :  68  V(s)  — max < max aeA < R (s, a) + 7 y^T (s, a, t)V(t)  [ °I  Its  0  0  i ? ( , ^ ( ) ) + 7^Tm(S,i)^w} 0  S  S  (4.10)  (4.11)  W e c a n substitute E q u a t i o n 4.11 f o r the u s u a l a u g m e n t e d v a l u e e q u a t i o n d e f i n e d i n E q u a t i o n 4 . 2 a n d p e r f o r m i m p l i c i t i m i t a t i o n w i t h g e n e r a l r e w a r d s . W e n o t e that B a y e s i a n m e t h o d s c o u l d a l s o be u s e d c o u l d be u s e d to estimate the m e n t o r ' s a c t i o n c h o i c e (see C h a p ter 5).  4.1.3  Empirical Demonstrations  In this s e c t i o n , w e present a n u m b e r o f e x p e r i m e n t s c o m p a r i n g standard r e i n f o r c e m e n t l e a r n i n g agents w i t h an o b s e r v e r agent b a s e d o n p r i o r i t i z e d s w e e p i n g , b u t a u g m e n t e d w i t h m o d e l e x t r a c t i o n a n d o u r f o c u s i n g m e c h a n i s m . T h e s e results i l l u s t r a t e h o w s p e c i f i c features o f d o m a i n s interact w i t h i m p l i c i t i m i t a t i o n a n d p r o v i d e a q u a n t i t a t i v e assessment o f the g a i n s i m p l i c i t i m i t a t i o n c a n p r o v i d e . S i n c e the i m i t a t i o n agent is a l s o a r e i n f o r c e m e n t l e a r n i n g agent, w e first d e s c r i b e the g e n e r a l f o r m o f the r e i n f o r c e m e n t l e a r n i n g p r o b l e m s w e w i l l test o u r i m i t a t i o n agents o n a n d then w e e x p l a i n h o w w e i m p l e m e n t e d m e n t o r s a n d o b s e r v a t i o n s o f mentors. T h e e x p e r i m e n t s w i t h i m i t a t i o n agents are a l l p e r f o r m e d i n s t o c h a s t i c g r i d - w o r l d d o m a i n s , s i n c e these d o m a i n s are easy t o u n d e r s t a n d a n d m a k e i t c l e a r t o w h a t extent the o b s e r v e r ' s a n d m e n t o r ' s o p t i m a l p o l i c i e s o v e r l a p (or f a i l to). F i g u r e 4 . 2 s h o w s a s i m p l e 10 x 10 g r i d - w o r l d e x a m p l e . I n a g r i d - w o r l d p r o b l e m , an agent m u s t d i s c o v e r a s e q u e n c e o f a c t i o n s w h i c h w i l l m o v e it f r o m the start state to the e n d state. T h e a g e n t ' s m o v e m e n t f r o m state s to state t u n d e r a c t i o n a w i l l b e d e f i n e d b y an u n d e r l y i n g e n v i r o n m e n t t r a n s i t i o n m o d e l o f  the f o r m T < (t). s  a  69  c o_  —  1 1  1 1  •-  1 L  _  1 1  1  ]  |  - -1  F i g u r e 4 . 2 : A s i m p l e g r i d w o r l d w i t h start state S a n d g o a l state X s h o w i n g representative trajectories f o r a m e n t o r ( s o l i d ) a n d a n o b s e r v e r (dashed).  In the c o n t e x t o f r e i n f o r c e m e n t l e a r n i n g , the i m i t a t i o n agent d o e s not h a v e c o m p l e t e k n o w l e d g e o f the effects o f its a c t i o n s (i.e., it is u n c e r t a i n a b o u t T  s , a  (£)) a n d m u s t therefore  e x p e r i m e n t w i t h its a c t i o n s i n o r d e r to learn a m o d e l . I n o u r f o r m u l a t i o n o f the g r i d - w o r l d p r o b l e m , the agent is g i v e n the p r i o r k n o w l e d g e that its a c t i o n s c a n o n l y m a k e t r a n s i t i o n s to l e g a l n e i g h b o u r i n g states. I n g e n e r a l , the n e i g h b o u r i n g states are d e f i n e d as the u n i o n o f the eight i m m e d i a t e l y adjacent states i n the g r i d ( h o r i z o n t a l , v e r t i c a l a n d d i a g o n a l n e i g h b o u r s ) a n d the state i t s e l f (to represent s e l f - t r a n s i t i o n s ) . W h e r e the b o u n d a r y o f the g r i d truncates a n e i g h b o u r i n g state o r a state is o c c u p i e d b y an " o b s t a c l e " w e say that the n e i g h b o u r i n g state is i n f e a s i b l e . W e therefore define the l e g a l n e i g h b o u r h o o d o f e a c h state as the set o f f e a s i b l e n e i g h b o u r i n g states. I n s o m e e x p e r i m e n t s , a s m a l l n u m b e r o f c l e a r l y i n d i c a t e d states are defined i n w h i c h the agent has different a c t i o n c a p a b i l i t i e s (e.g., the a b i l i t y to j u m p to a n o t h e r state). W h e n s u c h states are defined, the l e g a l n e i g h b o u r h o o d o f the state is e x t e n d e d to reflect these c a p a b i l i t i e s . T h e agent's p r i o r k n o w l e d g e o f the l e g a l n e i g h b o u r h o o d is represented b y a D i r i c h l e t  70  prior over  only the  l e g a l n e i g h b o u r i n g states w i t h the h y p e r p a r a m e t e r f o r e a c h p o t e n t i a l suc-  c e s s o r state i n i t i a l l y set to 1. T h i s l o c a l l y u n i n f o r m a t i v e p r i o r represents the fact that agents are i n i t i a l l y u n c e r t a i n about w h i c h " d i r e c t i o n " their a c t i o n s w i l l t a k e t h e m i n , b u t they k n o w it w i l l be a l e g a l d i r e c t i o n . T h e m a i n m o t i v a t i o n f o r u s i n g l o c a l n e i g h b o u r h o o d p r i o r s is to a v o i d the c o m p u t a t i o n a l b l o w u p as the n u m b e r o f states i n o u r p r o b l e m s are i n c r e a s e d . In o u r 25 x 25 g r i d - w o r l d e x a m p l e , there are 625 states, w h i c h w o u l d r e q u i r e r o u g h l y 4 0 0 , 0 0 0 parameters to represent a l o c a l l y f a c t o r e d g l o b a l p r i o r . T h e d e c i s i o n to use these l o c a l p r i o r s reflects the fact that the agent is k n o w n to b e c o n f r o n t i n g a n a v i g a t i o n p r o b l e m i n a g e o m e t r i c space. A s w e argue b e l o w , the use o f l o c a l p r i o r s does not m a t e r i a l l y affect the q u e s t i o n s w e w i s h to answer. A l t e r n a t i v e l y , w e m i g h t h a v e e m p l o y e d sparse m u l t i n o m i a l p r i o r s ( F r i e d man & Singer, 1998). G i v e n the agent's p r i o r k n o w l e d g e o f the t r a n s i t i o n m o d e l , s o m e i n i t i a l v a l u e c a l c u l a t i o n s c o u l d b e d o n e i n theory. In g o a l - b a s e d scenarios, p r i o r k n o w l e d g e that a c t i o n s h a v e o n l y l o c a l s c o p e c o u l d be u s e d to p r e d i c t that the v a l u e o f states nearer to the g o a l w i l l be h i g h e r than the v a l u e o f states far a w a y f r o m the g o a l . S i n c e the a g e n t ' s a c t i o n s h a v e u n i f o r m p r o b a b i l i t y o f m a k i n g the t r a n s i t i o n to n e i g h b o u r states, h o w e v e r , e v e r y a c t i o n i n a n y g i v e n state w o u l d h a v e the same v a l u e a c c o r d i n g to its p r i o r s . T h e agent w o u l d s t i l l h a v e to e x p e r i m e n t w i t h the a c t i o n s i n e a c h state i n order to l e a r n w h i c h a c t i o n s m o v e it t o w a r d s the g o a l . T h e p r i o r v a l u e s a s s i g n e d to states c o u l d be u s e d to b i a s a g r e e d y e x p l o r a t i o n m e c h a n i s m s t o w a r d s those states that are nearer the g o a l . H o w e v e r , w h e n states w i t h n e g a t i v e v a l u e s are i n t r o d u c e d the agent m a y be u n s u r e w h e t h e r there e x i s t s a p o l i c y able to a v o i d these p e n a l t y states w i t h sufficient p r o b a b i l i t y to m a k e a c h i e v i n g the g o a l w o r t h w h i l e . T h e agent m a y therefore a v o i d the g o a l state i n order to a v o i d a r e g i o n it b e l i e v e s l i k e l y to result a net l o s s . In a l l o f o u r e x p e r i m e n t s , the agent's i n i t i a l Q - v a l u e s are a l l set to z e r o . W e d o not  71  p e r f o r m any i n i t i a l b a c k u p s before the agent starts its task. A g e n t s e m p l o y p r i o r i t i z e d s w e e p i n g w i t h a f i x e d n u m b e r o f b a c k u p s per o b s e r v e d o r e x p e r i e n c e d s a m p l e .  1 0  Since prioritized  s w e e p i n g o n l y acts o n those states w h e r e there h a v e b e e n e x p e r i e n c e s , o r states m o s t l i k e l y to b e effected b y e x p e r i e n c e s , s i g n i f i c a n t parts o f the state space m a y r e m a i n u n u p d a t e d duri n g any o n e s w e e p . W e h a v e e x p l a i n e d that the agent's p r i o r b e l i e f s about i t a c t i o n c a p a b i l i t i e s a d m i t the p o s s i b i l i t y that a n y p a r t i c u l a r a c t i o n c h o i c e c o u l d result i n the t r a n s i t i o n to a n y o f t y p i c a l l y n i n e s u c c e s s o r states. I n o u r e x p e r i m e n t s , h o w e v e r , the agent o n l y possesses 4 d i s t i n c t a c t i o n s o r c o n t r o l s i g n a l s . T h e a c t u a l effects o f the f o u r a c t i o n s are  e-stochastic  (i.e., e a c h  a c t i o n a is a s s o c i a t e d w i t h an i n t e n d e d d i r e c t i o n s u c h as " N o r t h " a n d u p o n e x e c u t i n g this a c t i o n the e n v i r o n m e n t sends the agent i n this d i r e c t i o n (1 — e) o f the t i m e a n d s o m e other d i r e c t i o n s o f the t i m e . In m o s t e x p e r i m e n t s , the f o u r a c t i o n s s t o c h a s t i c a l l y m o v e the agent i n the c o m p a s s d i r e c t i o n s N o r t h , S o u t h , E a s t a n d W e s t . T o i l l u s t r a t e p e r t i n e n t p o i n t s , w e e m p l o y t r a n s i t i o n m o d e l s r e p r e s e n t i n g different a c t i o n c a p a b i l i t i e s i n different e x p e r i m e n t s . £ - g r e e d y e x p l o r a t i o n is u s e d w i t h d e c a y i n g e. In o u r f o r m u l a t i o n , the agent selects an a c t i o n r a n d o m l y f r o m the set o f a c t i o n s that is c u r r e n t l y e s t i m a t e d to attain the h i g h e s t v a l u e a n d r a n d o m l y selects a r a n d o m a c t i o n f r o m the c o m p l e t e a c t i o n set e o f the t i m e . W e use e x p o n e n t i a l l y d e c a y i n g e x p l o r a t i o n rates. T h e i n i t i a l v a l u e s a n d d e c a y parameters w e r e t u n e d f o r e a c h p r o b l e m a n d are d e s c r i b e d i n s p e c i f i c e x p e r i m e n t s . I n e a c h o f the e x p e r i m e n t s , w e c o m p a r e the p e r f o r m a n c e o f i n d e p e n d e n t r e i n f o r c e ment learners w i t h o b s e r v a t i o n - m a k i n g i m i t a t i o n learners. T h e i m i t a t i o n learners s i m u l t a n e o u s l y o b s e r v e s an expert m e n t o r w h i l e a t t e m p t i n g to s o l v e the p r o b l e m . I n e a c h e x p e r i ment, the m e n t o r is f o l l o w i n g an e-greedy p o l i c y w i t h a s m a l l  e  ( o n the o r d e r o f  0.01).  This  '"Generally, the number o f backups was set to be roughly equal to the length o f the optimal "noisefree" path. This heuristic value is b i g enough to propagate value changes across the state space, but small enough to be tractable in the inner loop of a reinforcement learner.  72  tends to cause the m e n t o r ' s trajectories to l i e w i t h i n a " c l u s t e r " s u r r o u n d i n g o p t i m a l trajectories ( a n d reflect g o o d i f not o p t i m a l p o l i c i e s ) . E v e n w i t h a s m a l l a m o u n t o f e x p l o r a t i o n and s o m e e n v i r o n m e n t s t o c h a s t i c i t y , m e n t o r s g e n e r a l l y d o n o t v i s i t the entire state space, so c o n f i d e n c e testing is i m p o r t a n t . A t y p i c a l o p t i m a l m e n t o r trajectory is i l l u s t r a t e d b y the s o l i d l i n e b e t w e e n the start a n d e n d states. T h e d o t t e d l i n e i n F i g u r e 4 . 2 s h o w s that a t y p i c a l m e n t o r - i n f l u e n c e d trajectory w i l l be q u i t e s i m i l a r to the o b s e r v e d m e n t o r trajectory. O u r g o a l i n these e x p e r i m e n t s is to i l l u s t r a t e h o w i m i t a t i o n affects the c o n v e r g e n c e o f a r e i n f o r c e m e n t learner to an o p t i m a l p o l i c y , h o w v a r i o u s p r o p e r t i e s o f specific d o m a i n s can affect the g a i n d u e to i m i t a t i o n , a n d h o w the v a r i o u s m e c h a n i s m s w e p r o p o s e w o r k together to d e a l w i t h p r o b l e m s w e h a v e a n t i c i p a t e d f o r i m i t a t i o n learners. T o s h e d l i g h t o n these q u e s t i o n s , w e s o u g h t an a p p r o p r i a t e m e a s u r e o f p e r f o r m a n c e . I n o u r g o a l - b a s e d scen a r i o s , the instantaneous r e w a r d o c c u r s o n l y at the e n d o f a s u c c e s s f u l trajectory. W e f o u n d graphs o f instantaneous r e w a r d d i f f i c u l t to interpret a n d c o m p a r e .  In cumulative reward  graphs, o p t i m a l p e r f o r m a n c e s h o w s u p as a fixed u p w a r d s l o p e r e p r e s e n t i n g a c o n s t a n t g o a l a t t a i n m e n t rate. W e f o u n d that s l o p e c o m p a r i s o n s c a u s e d d i f f i c u l t i e s f o r readers. W e sett l e d o n a c o m p r o m i s e s o l u t i o n : a w i n d o w e d average o f r e w a r d . T h e w i n d o w l e n g t h w a s c h o s e n to b e l o n g e n o u g h to integrate instantaneous r e w a r d s i g n a l s a n d short e n o u g h that w e c a n r e a d i l y see w h e n the g o a l rate is c h a n g i n g a n d w h e n i t has c o n v e r g e d to the o p t i m a l v a l u e . W h e n t w o agents h a v e c o n v e r g e d to the same q u a l i t y o f p o l i c y , t h e i r w i n d o w e d average g r a p h l i n e s w i l l a l s o c o n v e r g e . W e e m p h a s i z e that o u r m e a s u r e is n o t the same as average r e w a r d as the w i n d o w i n g f u n c t i o n d i s c a r d s a l l data before the w i n d o w cutoff. In the r e m a i n d e r o f this s e c t i o n w e present the results o f o u r e x p e r i m e n t s . W e l o o k at the g a i n i n r e w a r d o b t a i n e d b y a r e i n f o r c e m e n t learner i n a b a s i c g r i d - w o r l d s c e n a r i o . W e then l o o k at h o w this g a i n is affected b y changes to the scale o f the p r o b l e m a n d the stochast i c i t y o f the a g e n t ' s a c t i o n s . W e l o o k at h o w the o b s e r v e r ' s p r i o r b e l i e f s a b o u t m e n t o r c a -  73  p a b i l i t i e s c a n c a u s e p r o b l e m s , h o w q u a l i t a t i v e l y m o r e d i f f i c u l t p r o b l e m s affect the g a i n i n i m i t a t i o n , a n d c o n c l u d e w i t h a n e x a m p l e i n w h i c h i n f o r m a t i o n f r o m s e v e r a l m e n t o r s is s i m u l t a n e o u s l y c o m b i n e d w i t h the i m i t a t o r ' s o w n e x p e r i e n c e to i m p r o v e its p o l i c y .  E x p e r i m e n t 1: T h e I m i t a t i o n E f f e c t  I n o u r first e x p e r i m e n t w e c o m p a r e d the p e r f o r m a n c e o f an o b s e r v e r u s i n g m o d e l e x t r a c t i o n a n d an expert m e n t o r w i t h the p e r f o r m a n c e o f a c o n t r o l agent u s i n g standard r e i n f o r c e m e n t l e a r n i n g . G i v e n the l a c k o f obstacles and the l a c k o f i n t e r m e d i a t e r e w a r d s , c o n f i d e n c e t e s t i n g w a s not r e q u i r e d .  1 1  B o t h agents attempted to l e a r n a p o l i c y that m a x i m i z e s d i s c o u n t e d return  i n a 10 X 10 g r i d w o r l d . T h e y started i n the upper-left c o r n e r a n d s o u g h t a g o a l w i t h v a l u e 1.0 i n the l o w e r - r i g h t corner. U p o n r e a c h i n g the g o a l , the agents w e r e restarted i n the upperleft corner. A c t i o n d y n a m i c s w e r e n o i s y , r e s u l t i n g i n the i n t e n d e d d i r e c t i o n 9 0 percent o f the t i m e a n d a r a n d o m d i r e c t i o n 10 percent o f the t i m e ( u n i f o r m l y across the N o r t h , S o u t h , E a s t a n d W e s t d i r e c t i o n s ) . T h e d i s c o u n t factor w a s 0.9. A s e x p l a i n e d a b o v e , the m e n t o r executes a m i l d l y stochastic p e r t u r b a t i o n o f the o p t i m a l p o l i c y (one percent n o i s e ) f o r the d o m a i n . I n F i g u r e 4.3 w e p l o t the c u m u l a t i v e n u m b e r o f g o a l s o b t a i n e d o v e r the p r e v i o u s 1 0 0 0 t i m e steps f o r the o b s e r v e r ( O b s ) a n d c o n t r o l ( C t r l ) agents (results are a v e r a g e d o v e r ten runs). T h e o b s e r v e r w a s a b l e to q u i c k l y i n c o r p o r a t e a p o l i c y l e a r n e d f r o m the m e n t o r i n t o its v a l u e estimates. T h i s results i n a steeper l e a r n i n g c u r v e . I n contrast, the c o n t r o l agent s l o w l y e x p l o r e d the space to b u i l d a m o d e l first. B o t h agents c o n v e r g e d to the s a m e o p t i m a l v a l u e f u n c t i o n w h i c h is reflected i n the g r a p h b y the g o a l rates f o r o b s e r v e r a n d c o n t r o l agents c o n v e r g i n g to the same o p t i m a l rate. T h e D e l t a c u r v e , defined as O b s - C t r l , s h o w s the g a i n o f the i m i t a t o r o v e r the u n a i d e d r e i n f o r c e m e n t learner. T h e g a i n p e a k s d u r i n g the e a r l y l e a r n i n g p h a s e a n d g o e s to z e r o o n c e b o t h agents h a v e c o n v e r g e d to o p t i m a l p o l i c i e s . " C o n f i d e n c e testing w i l l be relevant i n subsequent scenarios.  74  500  1000  1500  2000  2500  3000  3500  4000  4500  Simulation Steps  F i g u r e 4.3: O n a s i m p l e g r i d - w o r l d , a n o b s e r v e r agent ' O b s ' d i s c o v e r s the g o a l a n d o p t i m i z e s its p o l i c y faster than a c o n t r o l agent ' C t r l ' w i t h o u t o b s e r v a t i o n . p l o t s the g a i n o f the o b s e r v e r o v e r the c o n t r o l .  75  T h e 'Delta' curve  Experiment 2: Scaling and Noise T h e n e x t e x p e r i m e n t i l l u s t r a t e d the s e n s i t i v i t y o f i m i t a t i o n to the s i z e o f the state space a n d the s t o c h a s t i c i t y o f a c t i o n s . A g a i n , the o b s e r v e r u s e d m o d e l - e x t r a c t i o n b u t n o t c o n f i d e n c e t e s t i n g . In F i g u r e 4 . 4 w e p l o t the D e l t a c u r v e s f o r the b a s i c s c e n a r i o ( B a s i c ) j u s t d e s c r i b e d , the s c a l e d - u p s c e n a r i o ( S c a l e ) i n w h i c h the state space s i z e is i n c r e a s e d 6 9 percent (to a 13 x 13 g r i d ) , a n d the stochastic s c e n a r i o ( S t o c h ) i n w h i c h the n o i s e l e v e l is i n c r e a s e d f r o m 10 percent to 4 0 percent (results are a v e r a g e d o v e r ten runs). T h e D e l t a c u r v e s represent the increase i n g o a l rate due to i m i t a t i o n o v e r the g o a l rate o b t a i n e d b y a n u n a i d e d r e i n f o r c e m e n t learner f o r a  given  s c e n a r i o . T h e g e n e r a l l y p o s i t i v e v a l u e s o f a l l the D e l t a c u r v e s represents  the fact that i n e a c h o f the scenarios c o n s i d e r e d , i m i t a t i o n increases the i m i t a t o r ' s g o a l rate. H o w e v e r , the t i m i n g a n d d u r a t i o n o f the g a i n due to i m i t a t i o n v a r y d e p e n d i n g o n the scenario. W h e n w e c o m p a r e the g a i n o b t a i n e d i n the b a s i c i m i t a t i o n s c e n a r i o ( B a s i c ) w i t h the g a i n o b t a i n e d i n the s c a l e d - u p s c e n a r i o ( S c a l e ) , w e see that the g a i n i n the s c a l e d - u p scen a r i o increases m o r e s l o w l y than the b a s i c s c e n a r i o . T h i s reflects the fact that e v e n w i t h the m e n t o r ' s g u i d a n c e , the i m i t a t o r requires m o r e e x p l o r a t i o n to find the g o a l i n the larger s c e n a r i o . T h e p e a k i n the g a i n f o r the s c a l e d - u p s c e n a r i o is l o w e r as the o p t i m a l g o a l rate i n a larger p r o b l e m is l o w e r , h o w e v e r , the d u r a t i o n o f the g a i n i n the s c a l e d - u p s c e n a r i o is c o n s i d e r a b l y l o n g e r . T h i s reflects W h i t e h e a d ' s ( 1 9 9 1 ) o b s e r v a t i o n that f o r g r i d w o r l d s , e x p l o r a t i o n r e q u i r e m e n t s c a n increase q u i c k l y w i t h state space s i z e (i.e., the w o r k to b e d o n e b y a n u n g u i d e d agent), but the o p t i m a l path l e n g t h increases o n l y l i n e a r l y (e.g., the e x a m p l e p r o v i d e d b y the m e n t o r ) . W e c o n c l u d e that the g u i d a n c e o f the m e n t o r c a n h e l p m o r e i n larger state spaces. W h e n w e c o m p a r e the g a i n d u e to i m i t a t i o n i n the b a s i c s c e n a r i o ( B a s i c ) w i t h the g a i n d u e to i m i t a t i o n i n a s c e n a r i o w h e r e the s t o c h a s t i c i t y o f a c t i o n s is d r a m a t i c a l l y i n c r e a s e d  76  3 0  1  1  i  2 5  /  \ Basic V  Scale  // Stoch  \  0  5 0  i  1 0 0 0  i  2 0 0 0  ,  3 0 0 0  i  4 0 0 0  i  5 0 0 0  6 0 0 0  Simulation Steps  Figure 4.4: Delta curves compare gain due to observation for 'Basic' scenario described above, a 'Scale' scenario with a larger state space and a 'Stoch' scenario with increased noise.  77  ( S t o c h ) w e see that the g a i n i n the stochastic s c e n a r i o is c o n s i d e r a b l y r e d u c e d i n m a x i m u m v a l u e . I n c r e a s i n g the s t o c h a s t i c i t y w i l l increase the t i m e r e q u i r e d to c o m p l e t e the p r o b l e m for b o t h agents a n d r e d u c e the m a x i m u m p o s s i b l e g o a l rate. T h e i n c r e a s e d s t o c h a s t i c i t y a l s o c o n s i d e r a b l y d e l a y s the g a i n d u e to i m i t a t i o n r e l a t i v e to the b a s i c s c e n a r i o . A g a i n , the i m itator takes l o n g e r to d i s c o v e r the g o a l . A s w i t h the s c a l e d - u p p r o b l e m , the g a i n persists for a l o n g e r p e r i o d i n d i c a t i n g that it takes l o n g e r f o r the u n a i d e d r e i n f o r c e m e n t learner to c o n v e r g e i n the stochastic s c e n a r i o . W e speculate that i n c r e a s i n g the n o i s e l e v e l reduces the o b s e r v e r ' s a b i l i t y to act u p o n the i n f o r m a t i o n r e c e i v e d f r o m the m e n t o r a n d therefore erodes its a d v a n t a g e o v e r the c o n t r o l agent. W e c o n c l u d e , h o w e v e r , that the benefit o f i m i t a t i o n degrades g r a c e f u l l y w i t h i n c r e a s e d s t o c h a s t i c i t y a n d is present e v e n at this r e l a t i v e l y e x t r e m e noise level.  Experiment 3 : Confidence Testing T h e agents i n o u r e x p e r i m e n t s h a v e p r i o r b e l i e f s about the o u t c o m e s o f t h e i r a c t i o n s . I n the case o f an i m i t a t i o n agent, the o b s e r v e r a l s o has p r i o r b e l i e f s about the o u t c o m e s o f the m e n t o r ' s a c t i o n s . L i k e the o b s e r v e r ' s p r i o r s about its o w n a c t i o n s , the o b s e r v e r ' s p r i o r b e l i e f s about m e n t o r a c t i o n s take the f o r m o f a D i r i c h l e t d i s t r i b u t i o n w i t h u n i f o r m h y p e r p a r a m e t e r s o v e r the l e g a l n e i g h b o u r i n g states. T h u s , the o b s e r v e r e x p e c t s that the m e n t o r w i l l m o v e i n a w a y that is c o n s t r a i n e d to s i n g l e steps t h r o u g h a g e o m e t r i c space. T h e observer, h o w e v e r , does not k n o w the a c t u a l d i s t r i b u t i o n o f s u c c e s s o r states f o r the m e n t o r ' s a c t i o n . M o r e i n t u i t i v e l y , the o b s e r v e r does not k n o w w h i c h d i r e c t i o n the m e n t o r w i l l m o v e i n e a c h state. T h e m e n t o r p r i o r s are o n l y i n t e n d e d to g i v e the o b s e r v e r a r o u g h g u e s s about the m e n t o r ' s b e h a v i o u r , b u t w i l l g e n e r a l l y not reflect the m e n t o r ' s a c t u a l b e h a v i o u r . S i n c e the p r i o r ' s o v e r the m e n t o r ' s a c t i o n c h o i c e s are u s e d b y the o b s e r v e r to estimate the v a l u e o f the m e n t o r ' s a c t i o n , there m a y be situations i n w h i c h the o b s e r v e r ' s p r i o r b e l i e f s about the tran-  78  s i t i o n p r o b a b i l i t i e s o f the m e n t o r c a n cause the o b s e r v e r to o v e r e s t i m a t e the v a l u e o f m e n t o r ' s a c t i o n . I n a s s u m i n g that it c a n d u p l i c a t e the m e n t o r ' s a c t i o n s , the o b s e r v e r m a y then o v e r v a l u e the state i n w h i c h the m e n t o r ' s a c t i o n w a s o b s e r v e d . T h e c o n f i d e n c e m e c h a n i s m p r o p o s e d i n the p r e v i o u s s e c t i o n can prevent the o b s e r v e r f r o m b e i n g f o o l e d b y m i s l e a d i n g p r i o r s o v e r the m e n t o r ' s t r a n s i t i o n p r o b a b i l i t i e s . T o demonstrate the r o l e o f the c o n f i d e n c e m e c h a n i s m i n i m p l i c i t i m i t a t i o n w e des i g n e d an e x p e r i m e n t b a s e d o n the s c e n a r i o i l l u s t r a t e d i n F i g u r e 4.5. A g a i n the a g e n t ' s task w a s to n a v i g a t e u s i n g the N o r t h , S o u t h , E a s t a n d W e s t a c t i o n s f r o m the top-left c o r n e r to the b o t t o m - r i g h t c o r n e r o f a 10 X 10 g r i d i n o r d e r to attain a r e w a r d o f +1. p a t h o l o g i c a l s c e n a r i o w i t h i s l a n d s o f h i g h r e w a r d (+5).  W e created a  T h e s e i s l a n d s are i s o l a t e d h o r i z o n -  t a l l y a n d v e r t i c a l l y f r o m adjacent states b y obstacles. T h e d i a g o n a l p a t h , h o w e v e r , r e m a i n s unblocked. S i n c e o u r a g e n t ' s p r i o r s a d m i t the p o s s i b i l i t y o f m a k i n g a t r a n s i t i o n to a n y state i n the l e g a l n e i g h b o u r h o o d o f the current state, i n c l u d i n g the c e l l s d i a g o n a l l y adjacent to the current state, the h i g h - v a l u e d c e l l s i n the m i d d l e o f e a c h i s l a n d are i n i t i a l l y b e l i e v e d to be r e a c h a b l e f r o m the states d i a g o n a l l y adjacent to t h e m w i t h s o m e s m a l l p r i o r p r o b a b i l i t y . I n reality, h o w e v e r , the agent's a c t i o n set p r e c l u d e s this p o s s i b i l i t y a n d the agent w i l l therefore n e v e r be a b l e to r e a l i z e this v a l u e . T h e f o u r i s l a n d s i n this s c e n a r i o thus create a f a i r l y large r e g i o n i n the center o f the space w i t h a h i g h e s t i m a t e d v a l u e , w h i c h c o u l d p o t e n t i a l l y trap an o b s e r v e r i f it p e r s i s t e d i n its p r i o r b e l i e f s . N o t i c e that a standard r e i n f o r c e m e n t learner w i l l " q u i c k l y " l e a r n that n o n e o f its act i o n s take it t o w a r d s the r e w a r d i n g i s l a n d s . In contrast, an i m p l i c i t i m i t a t o r u s i n g a u g m e n t e d b a c k u p s c o u l d b e f o o l e d b y its p r i o r m e n t o r m o d e l . I f the m e n t o r does not v i s i t the states n e i g h b o r i n g the i s l a n d , the o b s e r v e r w i l l not h a v e any e v i d e n c e u p o n w h i c h to c h a n g e its p r i o r b e l i e f that the m e n t o r actions are e q u a l l y l i k e l y to r e s u l t i n a t r a n s i t i o n to a n y state i n  79  F i g u r e 4.5:  P r i o r m o d e l s a s s u m e d i a g o n a l c o n n e c t i v i t y a n d h e n c e that the +5 r e w a r d s are  a c c e s s i b l e to the observer. In fact, the o b s e r v e r has o n l y ' N E W S ' a c t i o n s . U n u p d a t e d p r i o r s about the m e n t o r ' s c a p a b i l i t i e s m a y l e a d to o v e r v a l u i n g d i a g o n a l l y adjacent states.  the l o c a l n e i g h b o u r h o o d i n c l u d i n g those d i a g o n a l l y adjacent to the current state. T h e i m i tator m a y f a l s e l y c o n c l u d e o n the b a s i s o f the m e n t o r a c t i o n m o d e l that a n there e x i s t s a n a c t i o n w h i c h w o u l d a l l o w it to access the i s l a n d s o f v a l u e . T h e o b s e r v e r w o u l d then o v e r v a l u e the states d i a g o n a l l y adjacent to the h i g h - v a l u e d states. T h e o b s e r v e r therefore needs a c o n f i d e n c e m e c h a n i s m to detect w h e n the m e n t o r m o d e l is less r e l i a b l e than its o w n m o d e l . T o test the c o n f i d e n c e m e c h a n i s m , w e a s s i g n e d the m e n t o r a p a t h a r o u n d the o u t s i d e o f the obstacles so that its path c o u l d not l e a d the o b s e r v e r out o f the trap (i.e., it p r o v i d e s n o e v i d e n c e to the o b s e r v e r that the d i a g o n a l m o v e s i n t o the i s l a n d s are n o t f e a s i b l e ) . T h e c o m b i n a t i o n o f a h i g h i n i t i a l e x p l o r a t i o n rate a n d the a b i l i t y o f p r i o r i t i z e d s w e e p i n g to spread v a l u e across large distances v i r t u a l l y guarantees that the o b s e r v e r w i l l b e l e d to the trap. G i v e n this s c e n a r i o , w e ran t w o o b s e r v e r agents a n d a c o n t r o l . T h e first o b s e r v e r u s e d a c o n f i d e n c e i n t e r v a l w i t h w i d t h g i v e n b y 5<r, w h i c h , a c c o r d i n g to the C h e b y c h e v r u l e , s h o u l d c o v e r at least 96 percent o f an arbitrary d i s t r i b u t i o n . T h e s e c o n d o b s e r v e r w a s g i v e n a OCT  80  i n t e r v a l , w h i c h e f f e c t i v e l y d i s a b l e s c o n f i d e n c e testing. In F i g u r e 4 . 6 , the p e r f o r m a n c e o f the o b s e r v e r  with c o n f i d e n c e  t e s t i n g is c o m p a r e d  to the p e r f o r m a n c e o f the c o n t r o l agent (results are a v e r a g e d o v e r 10 r u n s ) . T h e o b s e r v e r  without c o n f i d e n c e  t e s t i n g c o n s i s t e n t l y b e c a m e stuck. E x a m i n a t i o n o f the v a l u e f u n c t i o n  f o r the o b s e r v e r w i t h o u t c o n f i d e n c e t e s t i n g r e v e a l e d c o n s i s t e n t p e a k s w i t h i n the trap r e g i o n a n d i n s p e c t i o n o f the agent state trajectories s h o w e d that it w a s stuck i n the trap. T h i s agent f a i l e d to c o n v e r g e to the o p t i m a l g o a l rate i n a n y r u n a n d is n o t p l o t t e d o n the g r a p h . T h e first d e t a i l that stands out i n the g r a p h o f the o b s e r v e r  with  confidence testing  ( O b s ) is that the agent does not e x p e r i e n c e the r a p i d c o n v e r g e n c e to the o p t i m a l g o a l rate seen i n p r e v i o u s s c e n a r i o s . I n i t i a l l y , d u e to the h i g h e x p l o r a t i o n rate e a r l y i n l e a r n i n g , b o t h the o b s e r v e r ( O b s ) a n d c o n t r o l agent ( C t r l ) s p o r a d i c a l l y r e a c h the g o a l , b u t d u e to the presence o f the i s l a n d s , the g o a l rate is c o m p a r a t i v e l y l o w i n the first 2 0 0 0 steps. E v e n the c o n t r o l agent s h o w s a l o n g e r e x p l o r a t i o n p h a s e than i n the b a s i c s c e n a r i o s h o w n i n F i g u r e 4.3. A f ter the first 2 0 0 0 steps, w e see b o t h agents r a p i d l y c o n v e r g i n g to the o p t i m a l g o a l rate w i t h the o b s e r v e r i n i t i a l l y h a v i n g a s l i g h t l e a d . O b s e r v a t i o n o f its v a l u e f u n c t i o n o v e r t i m e s h o w s that the trap f o r m e d , b u t f a d e d a w a y as the o b s e r v e r g a i n e d e n o u g h e x p e r i e n c e to o v e r c o m e e r r o n e o u s p r i o r s o n b o t h its o w n m o d e l s a n d its m o d e l s o f the m e n t o r . A t the e n d o f the e x p e r i m e n t , the o b s e r v e r ' s p e r f o r m a n c e is s l i g h t l y less than that o f the c o n t r o l agent. T h i s m a y b e d u e to i m p r o p e r l y t u n e d e x p l o r a t i o n rate parameters. T h e o r e t i c a l l y these l i n e s w i l l converge g i v e n proper exploration schedules. B a s e d o n the difference i n p e r f o r m a n c e b e t w e e n the o b s e r v e r w i t h c o n f i d e n c e t e s t i n g a n d the o b s e r v e r w i t h o u t c o n f i d e n c e testing, w e c o n c l u d e that c o n f i d e n c e t e s t i n g is necessary f o r i m p l i c i t i m i t a t i o n a n d effective i n m a i n t a i n i n g r e a s o n a b l e p e r f o r m a n c e ( o n the s a m e o r d e r as the c o n t r o l agent) e v e n i n this p a t h o l o g i c a l case.  81  45  —  1 —  i  i  i  C R Series  40  *****  Ctrl  _  f  30  *y °  bs  -  25 -  20  y. \  -5  2000  Delta  I  i  i  i  4000  6000  8000  10000  12000  Simulation Steps  Figure 4.6: Misleading priors about mentor abilities may eliminate imitation gains and even result in some degradation of the observer's performance relative to the control agent, but confidence testing prevents the observer from permanently fixating on infeasible goals.  82  F i g u r e 4 . 7 : A c o m p l e x m a z e w i t h 3 4 3 states has c o n s i d e r a b l y less l o c a l c o n n e c t i v i t y than an e q u i v a l e n t l y - s i z e d g r i d w o r l d . A s i n g l e 133 step path is the shortest s o l u t i o n .  Experiment 4: Qualitative Difficulty T h e n e x t e x p e r i m e n t demonstrates h o w the p o t e n t i a l g a i n s o f i m i t a t i o n c a n i n c r e a s e w i t h the ( q u a l i t a t i v e ) d i f f i c u l t y o f the p r o b l e m . T h e o b s e r v e r e m p l o y e d b o t h m o d e l e x t r a c t i o n a n d c o n f i d e n c e t e s t i n g , t h o u g h c o n f i d e n c e testing d i d n o t p l a y a s i g n i f i c a n t r o l e i n this e x p e r i ment.  1 2  I n the " m a z e " s c e n a r i o , w e i n t r o d u c e d obstacles i n o r d e r to i n c r e a s e the d i f f i c u l t y  o f the l e a r n i n g p r o b l e m . T h e m a z e is set o n a 2 5 x 2 5 g r i d ( F i g u r e 4.7) w i t h 2 8 6 obstacles c o m p l i c a t i n g the agent's j o u r n e y f r o m the top-left to the b o t t o m - r i g h t corner. T h e o p t i m a l s o l u t i o n takes the f o r m o f a s n a k i n g 133-step p a t h , w i t h d i s t r a c t i n g paths (up to l e n g t h 2 2 ) b r a n c h i n g o f f f r o m the s o l u t i o n p a t h , n e c e s s i t a t i n g frequent b a c k t r a c k i n g . T h e d i s c o u n t factor is 0 . 9 8 . W i t h 10 percent n o i s e , the o p t i m a l g o a l - a t t a i n m e n t rate is a b o u t s i x g o a l s p e r 1 0 0 0 steps. T h e m e n t o r r e p e a t e d l y f o l l o w s an e - o p t i m a l path t h r o u g h the m a z e (e = 0.01). F r o m the g r a p h i n F i g u r e 4.8 ( w i t h results a v e r a g e d o v e r ten r u n s ) , w e see that the c o n t r o l agent t o o k o n the o r d e r o f 1 5 0 , 0 0 0 steps to b u i l d a decent v a l u e f u n c t i o n that r e l i a b l y 1 2  I n this problem there are no intermediate rewards or shortcuts whose effect on value might be  propagated by mentor priors.  83  0  0.5  1  1.5  2  2.5  x 10  5  Simulation Steps  F i g u r e 4 . 8 : O n the d i f f i c u l t m a z e p r o b l e m , the o b s e r v e r ' O b s ' c o n v e r g e s l o n g b e f o r e the unaided control ' C r t l ' .  l e d to the g o a l . A t this p o i n t , it o n l y a c h i e v e d f o u r g o a l s p e r 1 0 0 0 steps o n average, as its e x p l o r a t i o n rate is s t i l l r e a s o n a b l y h i g h (unfortunately, d e c r e a s i n g e x p l o r a t i o n m o r e q u i c k l y leads to a h i g h e r p r o b a b i l i t y o f a s u b o p t i m a l p o l i c y a n d therefore a l o w e r v a l u e f u n c t i o n ) . T h e i m i t a t i o n agent is a b l e to take a d v a n t a g e o f the m e n t o r ' s e x p e r t i s e to b u i l d a r e l i a b l e v a l u e f u n c t i o n i n a b o u t 2 0 , 0 0 0 steps. S i n c e the c o n t r o l agent was u n a b l e to r e a c h the g o a l at a l l i n the first 2 0 , 0 0 0 steps, the D e l t a b e t w e e n the c o n t r o l a n d the i m i t a t o r is s i m p l y e q u a l to the i m i t a t o r ' s p e r f o r m a n c e .  T h e i m i t a t o r q u i c k l y a c h i e v e d the o p t i m a l g o a l a t t a i n m e n t rate  o f s i x g o a l s per 1 0 0 0 steps, as its e x p l o r a t i o n rate d e c a y e d m u c h m o r e q u i c k l y .  84  Experiment 5: Improving Suboptimal Policies by Imitation T h e a u g m e n t e d b a c k u p r u l e does not r e q u i r e that the r e w a r d structure o f the m e n t o r a n d o b server be i d e n t i c a l . T h e r e are m a n y u s e f u l scenarios w h e r e r e w a r d s are d i s s i m i l a r b u t i n d u c e v a l u e f u n c t i o n s a n d p o l i c i e s that share s o m e structure. I n this e x p e r i m e n t , w e d e m o n s t r a t e d o n e i n t e r e s t i n g s c e n a r i o i n w h i c h it is r e l a t i v e l y easy to find a s u b o p t i m a l s o l u t i o n , b u t d i f f i c u l t to find the o p t i m a l s o l u t i o n . O n c e the o b s e r v e r f o u n d this s u b o p t i m a l p a t h , h o w e v e r , it w a s a b l e to e x p l o i t its o b s e r v a t i o n s o f the m e n t o r to see that there is a shortcut that s i g n i f i c a n t l y shortens the path to the g o a l . T h e structure o f the s c e n a r i o is s h o w n i n F i g u r e 4 . 9 . T h e s u b o p t i m a l s o l u t i o n l i e s o n the path f r o m l o c a t i o n 1 a r o u n d the " s c e n i c r o u t e " to l o c a t i o n 2 a n d o n to the g o a l at l o c a t i o n 3. T h e m e n t o r t o o k the v e r t i c a l path f r o m l o c a t i o n 4 to l o c a t i o n 5 t h r o u g h the s h o r t c u t .  13  T o d i s c o u r a g e the use o f the shortcut b y n o v i c e agents, i t w a s  l i n e d w i t h c e l l s ( m a r k e d " * " ) that j u m p b a c k to the start state. It w a s therefore d i f f i c u l t f o r a n o v i c e agent e x e c u t i n g m a n y r a n d o m e x p l o r a t o r y m o v e s to m a k e it a l l the w a y to the e n d o f the shortcut a n d o b t a i n the v a l u e w h i c h w o u l d r e i n f o r c e its future use. B o t h the o b s e r v e r a n d c o n t r o l therefore g e n e r a l l y f o u n d the s c e n i c route first. In F i g u r e 4 . 1 0 , the p e r f o r m a n c e ( m e a s u r e d u s i n g g o a l s r e a c h e d o v e r the p r e v i o u s 1 0 0 0 steps) o f the c o n t r o l a n d o b s e r v e r are c o m p a r e d (averaged o v e r ten r u n s ) , i n d i c a t i n g the v a l u e o f these o b s e r v a t i o n s . W e see that the o b s e r v e r a n d c o n t r o l agent b o t h f o u n d the l o n g e r s c e n i c route, t h o u g h the c o n t r o l agent t o o k l o n g e r to find it. T h e o b s e r v e r w e n t o n to find the shortcut a n d j u m p e d to a l m o s t d o u b l e the g o a l rate. T h i s e x p e r i m e n t s h o w s that m e n t o r s c a n i m p r o v e o b s e r v e r p o l i c i e s e v e n w h e n the o b s e r v e r ' s g o a l s are not o n the m e n t o r ' s path. 1 3  A mentor proceeding from 5 to 4 w o u l d not provide guidance without prior knowledge that ac-  tions are reversible.  85  Figure 4.9: The shortcut ( 4 —>• 5) in the maze is too perilous for unknowledgeable (or unaided) agents to attempt.  0  0.5  1  1.5  2  2.5  3 x 10  4  Simulation Steps  Figure 4.10: The obsever 'Obs' tends to find the shortcut and converges to the optimal policy, while the unaided control 'Ctrl' typically settles on the suboptimal soulution. 86  m  r  i  • ->  •  i  1  Hi 1  ran  EreoSirxV  i  2  g|  V  i PES  F i g u r e 4 . 1 1 : A s c e n a r i o w i t h m u l t i p l e mentors. E a c h m e n t o r c o v e r s o n l y part o f the space o f interest to the observer.  Experiment 6: Multiple Mentors T h e final e x p e r i m e n t i l l u s t r a t e d h o w m o d e l e x t r a c t i o n c a n be r e a d i l y e x t e n d e d so that the o b server c a n extract m o d e l s f r o m m u l t i p l e m e n t o r s a n d e x p l o i t the m o s t v a l u a b l e parts o f each. A g a i n , the o b s e r v e r e m p l o y e d m o d e l - e x t r a c t i o n a n d c o n f i d e n c e testing. I n F i g u r e 4 . 1 1 , the learner m u s t m o v e f r o m start l o c a t i o n 1 to g o a l l o c a t i o n 4 . T w o e x p e r t agents w i t h different start a n d g o a l states serve as p o t e n t i a l mentors. O n e m e n t o r r e p e a t e d l y m o v e d f r o m l o c a t i o n 3 to l o c a t i o n 5 a l o n g the d o t t e d l i n e , w h i l e a s e c o n d m e n t o r d e p a r t e d f r o m l o c a t i o n 2 a n d e n d e d at l o c a t i o n 4 a l o n g the d a s h e d l i n e . I n this e x p e r i m e n t , the o b s e r v e r h a d to c o m b i n e the i n f o r m a t i o n f r o m the e x a m p l e s p r o v i d e d b y the t w o m e n t o r s w i t h i n d e p e n d e n t e x p l o r a t i o n o f its o w n i n o r d e r to s o l v e the p r o b l e m . I n F i g u r e 4 . 1 2 , w e see that the o b s e r v e r s u c c e s s f u l l y p u l l e d together the m e n t o r o b s e r v a t i o n s a n d i n d e p e n d e n t e x p l o r a t i o n to l e a r n m u c h m o r e q u i c k l y than the c o n t r o l agent (results are a v e r a g e d o v e r 10 r u n s ) . W e see that the use o f a v a l u e - b a s e d t e c h n i q u e a l l o w e d the o b s e r v e r to c h o o s e w h i c h m e n t o r ' s i n f l u e n c e to use o n a state-by-state b a s i s i n o r d e r to  87  1000  2000  3000  4000  5000  6000  Simulation Steps  F i g u r e 4 . 1 2 : T h e g a i n f r o m s i m u l t a n e o u s i m i t a t i o n o f m u l t i p l e m e n t o r s i s s i m i l a r to that seen i n the s i n g l e m e n t o r case.  get the best s o l u t i o n to the p r o b l e m . O u r e x p e r i m e n t s suggest that m o d e l e x t r a c t i o n a n d c o n f i d e n c e t e s t i n g are a p p l i c a b l e u n d e r a w i d e v a r i e t y o f c o n d i t i o n s , b u t w e h a v e seen that these t e c h n i q u e s p e r f o r m s best w h e n there is o v e r l a p i n the r e w a r d f u n c t i o n s b e t w e e n m e n t o r a n d observer, w h e n there is a m e n t o r f o l l o w i n g a p o l i c y w i t h g o o d c o v e r a g e o f the state space, a n d w h e n there i s a p r o b l e m that is s u f f i c i e n t l y d i f f i c u l t to warrant i m i t a t i o n t e c h n i q u e s . W e c a n g a i n further i n s i g h t i n t o the a p p l i c a b i l i t y a n d p e r f o r m a n c e o f i m i t a t i o n b y c o n s i d e r i n g h o w the m e c h a n i s m o f i m p l i c i t i m i t a t i o n interacts w i t h features o f the d o m a i n . W e w i l l e x p l o r e these i n t e r a c t i o n s i n m o r e d e p t h i n the C h a p t e r 6.  88  4.2  Implicit Imitation in Heterogeneous Settings  T h e a u g m e n t e d b a c k u p e q u a t i o n , as presented i n the p r e v i o u s s e c t i o n , r e l i e s o n the h o m o g e n e o u s a c t i o n s a s s u m p t i o n — t h a t the o b s e r v e r possesses an a c t i o n w h i c h w i l l d u p l i c a t e the l o n g - r u n average d y n a m i c s o f the m e n t o r ' s a c t i o n s . W h e n the h o m o g e n e i t y a s s u m p t i o n is v i o l a t e d , the i m p l i c i t i m i t a t i o n f r a m e w o r k d e s c r i b e d a b o v e c a n c a u s e the l e a r n e r ' s c o n v e r g e n c e rate to s l o w d r a m a t i c a l l y a n d , i n s o m e cases, c a u s e the l e a r n e r to b e c o m e s t u c k i n a s m a l l n e i g h b o r h o o d o f state space. I n p a r t i c u l a r , i f the learner is u n a b l e to m a k e the s a m e state t r a n s i t i o n (or a t r a n s i t i o n w i t h the s a m e p r o b a b i l i t y ) as the m e n t o r at a g i v e n state, it m a y d r a s t i c a l l y o v e r e s t i m a t e the v a l u e o f that state. T h e inflated v a l u e e s t i m a t e causes the learner to return repeatedly to this state e v e n t h o u g h this e x p l o r a t i o n w i l l n e v e r p r o d u c e a f e a s i b l e a c t i o n f o r o b t a i n i n g the inflated e s t i m a t e d v a l u e . T h e r e is n o m e c h a n i s m f o r r e m o v i n g the i n f l u e n c e o f the m e n t o r ' s M a r k o v c h a i n o n v a l u e e s t i m a t e s — t h e o b s e r v e r c a n b e e x t r e m e l y (and c o r r e c t l y ) c o n f i d e n t i n its o b s e r v a t i o n s about the m e n t o r ' s m o d e l a n d yet, the m e n t o r ' s m o d e l is s t i l l i n f e a s i b l e f o r the observer. T h e p r o b l e m l i e s i n the fact that the a u g m e n t e d B e l l m a n b a c k u p is j u s t i f i e d b y the a s s u m p t i o n that the o b s e r v e r c a n d u p l i c a t e every m e n t o r a c t i o n . T h a t i s , at e a c h state s, there is s o m e a £ A s u c h that T(s,  a, t) — T(s,  t) f o r a l l t.  W h e n an e q u i v a l e n t a c t i o n a does n o t e x i s t , there is n o guarantee that the v a l u e c a l c u l a t e d u s i n g the m e n t o r a c t i o n m o d e l c a n i n fact be a c h i e v e d .  4.2.1  Feasibility Testing  In s u c h h e t e r o g e n e o u s settings, w e can p r e v e n t " l o c k - u p " a n d p o o r c o n v e r g e n c e t h r o u g h the use o f an e x p l i c i t  action feasibility test:  before an a u g m e n t e d b a c k u p is p e r f o r m e d at  o b s e r v e r tests to see i f the m e n t o r ' s a c t i o n a  m  s,  the  " d i f f e r s " f r o m e a c h o f its a c t i o n s at s, g i v e n its  current e s t i m a t e d m o d e l s . I f e v e r y o b s e r v e r a c t i o n differs f r o m the m e n t o r a c t i o n , then the a u g m e n t e d b a c k u p is suppressed a n d a standard B e l l m a n b a c k u p is u s e d to u p d a t e the v a l u e  89  function.  B y default, m e n t o r a c t i o n s are a s s u m e d to b e f e a s i b l e f o r the o b s e r v e r ; h o w e v e r ,  o n c e the o b s e r v e r i s r e a s o n a b l y c o n f i d e n t that a  i s i n f e a s i b l e at state s, a u g m e n t e d b a c k u p s  m  are s u p p r e s s e d at s. R e c a l l that t r a n s i t i o n m o d e l s d e s c r i b i n g e a c h a c t i o n are e s t i m a t e d f r o m o b s e r v a t i o n s , a n d that the l e a r n e r ' s u n c e r t a i n t y about the true t r a n s i t i o n p r o b a b i l i t i e s is reflected i n a D i r i c h let d i s t r i b u t i o n o v e r p o s s i b l e m o d e l s . T h e p o i n t estimate o f the t r a n s i t i o n m o d e l t r a n s i t i o n p r o b a b i l i t i e s u s e d t o p e r f o r m b a c k u p s c o r r e s p o n d to the m e a n o f the d i s t r i b u t i o n d e s c r i b e d b y the c o r r e s p o n d i n g D i r i c h l e t d i s t r i b u t i o n . W e c a n therefore v i e w the d i f f e r e n c e b e t w e e n some observer action a  0  a n d the m e n t o r a c t i o n  c l a s s i c a l difference o f m e a n s test  ( M e n d e n h a l l , 1 9 8 7 , S e c t i o n 9.5) w i t h respect t o the c o r r e s p o n d i n g D i r i c h l e t d i s t r i b u t i o n s . T h a t i s , w e w a n t t o k n o w i f the m e a n t r a n s i t i o n p r o b a b i l i t i e s r e p r e s e n t i n g the effects o f the t w o a c t i o n s differ at a s p e c i f i e d l e v e l o f c o n f i d e n c e . A difference o f m e a n s test o v e r t r a n s i t i o n m o d e l d i s t r i b u t i o n s i s c o m p l i c a t e d b y the fact that D i r i c h l e t s are h i g h l y n o n - n o r m a l f o r s m a l l s a m p l e c o u n t s , a n d that t r a n s i t i o n d i s t r i b u t i o n s are m u l t i n o m i a l . W e d e a l w i t h the n o n - n o r m a l i t y t h r o u g h t w o t e c h n i q u e s : W e r e q u i r e a m i n i m u m n u m b e r o f s a m p l e s f o r e a c h test a n d w e e m p l o y r o b u s t C h e b y c h e v b o u n d s o n the p o o l e d v a r i a n c e o f the d i s t r i b u t i o n s t o b e c o m p a r e d . C o n c e p t u a l l y , w e w i l l e v a l u a t e an e x p r e s s i o n l i k e the f o l l o w i n g :  \T (s,a ,t)-T (s,t)\ 0  0  m  >  n (s,a ,i)a'l(s,ao,t)-\-n n(s,t)a^ (s,i) no(s,a ,t)+n {s,t) 0  0  1  0  (4.12)  n  m  E s s e n t i a l l y , w e h a v e a difference o f m e a n s test o n the t r a n s i t i o n p r o b a b i l i t i e s w h e r e the v a r i ance i s a w e i g h t e d p o o l i n g o f v a r i a n c e s f r o m the o b s e r v e r a n d m e n t o r d e r i v e d m o d e l s . T h e t e r m Z /2 i s the c r i t i c a l v a l u e o f the test. T h e p a r a m e t e r a i s the s i g n i f i c a n c e o f the test, o r a  T h e decision is binary, but we could envision a smoother decision criterion that measures the extent to w h i c h the mentor's action can be duplicated. See Bayesian imitation i n Chapter 5. 1 4  90  the p r o b a b i l i t y that w e w i l l f a l s e l y reject the t w o a c t i o n s as b e i n g different w h e n they are a c t u a l l y the same. G i v e n o u r h i g h l y n o n - n o r m a l d i s t r i b u t i o n s e a r l y i n the t r a i n i n g p r o c e s s , the a p p r o p r i a t e Z v a l u e f o r a g i v e n a c a n b e c o m p u t e d f r o m C h e b y c h e v ' s b o u n d b y s o l v i n g  Z/-  2 a = l - Jj- f o r  A 2  W h e n w e h a v e t o o f e w s a m p l e s to d o an accurate test, w e p e r s i s t w i t h a u g m e n t e d b a c k u p s ( e m b o d y i n g o u r default a s s u m p t i o n o f h o m o g e n e i t y ) . I f the v a l u e e s t i m a t e is i n flated b y these b a c k u p s , the agent w i l l b e b i a s e d to o b t a i n a d d i t i o n a l s a m p l e s , w h i c h w i l l then a l l o w the agent to p e r f o r m the r e q u i r e d f e a s i b i l i t y test. O u r a s s u m p t i o n is therefore s e l f - c o r r e c t i n g . W e d e a l w i t h the m u l t i v a r i a t e c o m p l i c a t i o n s b y p e r f o r m i n g the  Bonferroni  test (Seber, 1984) w h i c h has b e e n s h o w n to g i v e g o o d results i n p r a c t i c e ( M i & S a m p s o n , 1 9 9 3 ) , is efficient to c o m p u t e , a n d is k n o w n to b e r o b u s t to d e p e n d e n c e b e t w e e n v a r i a b l e s . A B o n f e r r o n i h y p o t h e s i s test is o b t a i n e d b y c o n j o i n i n g s e v e r a l s i n g l e v a r i a b l e tests. S u p p o s e the a c t i o n s a  0  and a  m  result i n 7- p o s s i b l e s u c c e s s o r states, s i , • • • ,s  r  p r o b a b i l i t i e s to c o m p a r e ) . F o r e a c h s,, the h y p o t h e s i s E{ denotes that a  0  same t r a n s i t i o n p r o b a b i l i t y to s u c c e s s o r state s;; that is T ( s , a , m  (i.e., r t r a n s i t i o n and a  m  h a v e the  si) = T ( s , a , s,-). W e let 0  E{ d e n o t e the c o m p l e m e n t a r y h y p o t h e s i s (i.e., that the t r a n s i t i o n p r o b a b i l i t i e s differ). T h e B o n f e r r o n i i n e q u a l i t y states:  r Pr  > l - ^ P r [ £  l  .i=l  r  "1  f]E  J  t  ]  i=l  T h u s w e c a n test the j o i n t h y p o t h e s i s f l i = i E{—the t w o a c t i o n m o d e l s are the s a m e — b y t e s t i n g e a c h o f the r c o m p l e m e n t a r y h y p o t h e s e s E{ at c o n f i d e n c e l e v e l a/r.  I f w e reject  any o f the h y p o t h e s e s w e reject the n o t i o n that the t w o a c t i o n s are e q u a l w i t h c o n f i d e n c e at least a.  T h e mentor action a  m  is d e e m e d i n f e a s i b l e i f f o r e v e r y o b s e r v e r a c t i o n a , 0  the  m u l t i v a r i a t e B o n f e r r o n i test j u s t d e s c r i b e d rejects the h y p o t h e s i s that the a c t i o n is the same as the m e n t o r ' s . P s e u d o - c o d e f o r the B o n f e r r o n i c o m p o n e n t o f the f e a s i b i l i t y test appears i n T a b l e 4 . 2 .  91  F U N C T I O N feasible(m,s) : Boolean F O R each a i n A do 0  allSuccessorProbsSimilar = true F O R each t i n successors(s) do HA = _  ZA IF  Z  A  \T {s,a,t)-T (s,t)\ 0  I  A«A/y >  m  ln {s,a,t)*al(s,a,t)+n {s,t)al,{s,t) 0  m  n (s,a,t)+n (s,t) 0  m  Zajir  allSuccessorProbsSimilar = false END FOR IF allSuccessorProbsSimilar return true END FOR R E T U R N false  T a b l e 4.2: A n a l g o r i t h m f o r a c t i o n f e a s i b i l i t y t e s t i n g  It assumes a sufficient n u m b e r o f s a m p l e s . P r e s e n t l y , f o r e f f i c i e n c y r e a s o n s , w e c a c h e the results o f the f e a s i b i l i t y t e s t i n g . W h e n the d u p l i c a t i o n o f the m e n t o r ' s a c t i o n at state s i s first d e t e r m i n e d t o b e i n f e a s i b l e , w e set a flag f o r state s to this effect.  4.2.2  k-step Similarity and Repair  A c t i o n f e a s i b i l i t y t e s t i n g e s s e n t i a l l y m a k e s a strict d e c i s i o n as t o w h e t h e r the agent c a n d u p l i c a t e the m e n t o r ' s a c t i o n at a s p e c i f i c state: o n c e i t i s d e c i d e d that the m e n t o r ' s a c t i o n i s i n f e a s i b l e , a u g m e n t e d b a c k u p s are s u p p r e s s e d a n d a l l p o t e n t i a l g u i d a n c e offered i s e l i m i nated at that state. U n f o r t u n a t e l y , the strictness o f the test results i n a s o m e w h a t i m p o v e r i s h e d n o t i o n o f s i m i l a r i t y b e t w e e n m e n t o r a n d observer. T h i s , i n t u r n , u n n e c e s s a r i l y l i m i t s  92  the transfer b e t w e e n m e n t o r a n d observer. W e p r o p o s e a m e c h a n i s m w h e r e b y the m e n t o r ' s i n f l u e n c e m a y persist e v e n i f the specific a c t i o n it c h o o s e s is not f e a s i b l e f o r the o b s e r v e r ; w e instead r e l y o n the p o s s i b i l i t y that the o b s e r v e r m a y approximately  d u p l i c a t e the m e n t o r ' s  trajectory i n s t e a d o f e x a c t l y d u p l i c a t i n g i t . S u p p o s e a n o b s e r v e r has p r e v i o u s l y c o n s t r u c t e d a n e s t i m a t e d v a l u e f u n c t i o n u s i n g a u g m e n t e d b a c k u p s . U s i n g the m e n t o r a c t i o n m o d e l (i.e., the m e n t o r ' s c h a i n  (£)), a high  v a l u e has b e e n c a l c u l a t e d f o r state s. S u b s e q u e n t l y , s u p p o s e the m e n t o r ' s a c t i o n at state s is j u d g e d to b e i n f e a s i b l e . T h i s is i l l u s t r a t e d i n F i g u r e 4 . 1 3 , w h e r e the e s t i m a t e d v a l u e at state s is o r i g i n a l l y d u e to the m e n t o r ' s a c t i o n ir  m  (s), w h i c h f o r the sake o f i l l u s t r a t i o n m o v e s w i t h  h i g h p r o b a b i l i t y to state t, w h i c h i t s e l f c a n l e a d to s o m e h i g h l y r e w a r d i n g r e g i o n o f state space. A f t e r s o m e n u m b e r o f e x p e r i e n c e s at state s, h o w e v e r , the learner c o n c l u d e s that the a c t i o n 7r (s)—and the a s s o c i a t e d h i g h p r o b a b i l i t y t r a n s i t i o n to t—is n o t f e a s i b l e . m  A t this p o i n t , o n e o f t w o t h i n g s m u s t o c c u r : either (a) the v a l u e c a l c u l a t e d f o r state s a n d its predecessors w i l l " c o l l a p s e " a n d a l l e x p l o r a t i o n t o w a r d s h i g h l y v a l u e d r e g i o n s b e y o n d state s ceases; o r (b) the e s t i m a t e d v a l u e d r o p s s l i g h t l y b u t e x p l o r a t i o n c o n t i n u e s t o w a r d s the h i g h l y - v a l u e d r e g i o n s . T h e latter case m a y arise as f o l l o w s . I f the o b s e r v e r has p r e v i o u s l y e x p l o r e d i n the v i c i n i t y o f state s, the o b s e r v e r ' s o w n a c t i o n m o d e l m a y b e sufficiently  d e v e l o p e d that they s t i l l c o n n e c t the h i g h e r v a l u e r e g i o n s b e y o n d state s to state  w t h r o u g h B e l l m a n b a c k u p s . F o r e x a m p l e , i f the learner has sufficient e x p e r i e n c e to h a v e l e a r n e d that the h i g h l y v a l u e d r e g i o n c a n b e r e a c h e d t h r o u g h the a l t e r n a t i v e trajectory s u — v — w, the n e w l y d i s c o v e r e d i n f e a s i b i l i t y o f the m e n t o r ' s t r a n s i t i o n s — t w i l l n o t h a v e a d e l e t e r i o u s effect o n the v a l u e estimate at s. I f s is h i g h l y v a l u e d , it i s l i k e l y that states c l o s e to the m e n t o r ' s trajectory w i l l b e e x p l o r e d to s o m e degree. I n this case, state s m a y n o t b e as h i g h l y v a l u e d as it w a s w h e n u s i n g the m e n t o r ' s a c t i o n m o d e l , b u t it w i l l s t i l l b e v a l u e d h i g h l y e n o u g h that it w i l l l i k e l y to g u i d e further e x p l o r a t i o n t o w a r d the area. W e c a l l this a l -  93  Infeasible Transition High-value State  "Bridge"  o r  F i g u r e 4 . 1 3 : A n alternative path c a n b r i d g e v a l u e b a c k u p s a r o u n d i n f e a s i b l e paths  ternative ( i n this case,  s - u - v - w) to  the m e n t o r ' s a c t i o n a  bridge, b e c a u s e  it a l l o w s v a l u e  f r o m h i g h e r v a l u e r e g i o n s to " f l o w o v e r " an i n f e a s i b l e m e n t o r t r a n s i t i o n . B e c a u s e the b r i d g e w a s f o r m e d w i t h o u t the i n t e n t i o n o f the agent, w e c a l l this p r o c e s s  spontaneous bridging.  W h e r e a spontaneous b r i d g e d o e s not e x i s t , the o b s e r v e r ' s o w n a c t i o n m o d e l s are g e n e r a l l y u n d e v e l o p e d (e.g., they are c l o s e to their u n i f o r m p r i o r d i s t r i b u t i o n s ) . T y p i c a l l y , these u n d e v e l o p e d m o d e l s a s s i g n a s m a l l p r o b a b i l i t y to e v e r y p o s s i b l e o u t c o m e a n d therefore diffuse v a l u e f r o m h i g h e r v a l u e d r e g i o n s a n d l e a d to a v e r y p o o r v a l u e estimate f o r state s. W h e n the m e n t o r ' s t r a n s i t i o n is f i n a l l y j u d g e d i n f e a s i b l e , the result i s a d r a m a t i c d r o p i n the v a l u e o f state s a n d a l l o f its predecessors. S u b s e q u e n t l y , e x p l o r a t i o n t o w a r d s the h i g h l y v a l u e d r e g i o n t h r o u g h the n e i g h b o r h o o d o f state s ceases. I n o u r e x a m p l e , this c o u l d o c c u r i f the o b s e r v e r ' s t r a n s i t i o n m o d e l s at state s a s s i g n l o w p r o b a b i l i t y (e.g., c l o s e to p r i o r p r o b a b i l i t y ) o f m o v i n g to state u due to l a c k o f e x p e r i e n c e (or s i m i l a r l y i f the s u r r o u n d i n g states, s u c h as u o r v, h a v e b e e n i n s u f f i c i e n t l y e x p l o r e d ) . T h e spontaneous b r i d g i n g effect m o t i v a t e s a b r o a d e r n o t i o n o f s i m i l a r i t y . W h e n the o b s e r v e r c a n f i n d a " s h o r t " s e q u e n c e o f a c t i o n s that b r i d g e s an i n f e a s i b l e a c t i o n o n the m e n t o r ' s trajectory, the m e n t o r ' s e x a m p l e c a n s t i l l p r o v i d e e x t r e m e l y u s e f u l g u i d a n c e . F o r the m o m e n t , w e a s s u m e a "short p a t h " is a n y p a t h o f l e n g t h n o greater than s o m e g i v e n integer  k. W e say  an o b s e r v e r is  k-step similar to a mentor at  state  s i f the  observer can duplicate in  o r f e w e r steps the m e n t o r ' s n o m i n a l t r a n s i t i o n at state s w i t h " s u f f i c i e n t l y h i g h " p r o b a b i l i t y .  94  k  G i v e n this n o t i o n o f s i m i l a r i t y , an o b s e r v e r c a n n o w test w h e t h e r a s p o n t a n e o u s b r i d g e e x i s t s a n d d e t e r m i n e w h e t h e r the o b s e r v e r is i n d a n g e r o f v a l u e f u n c t i o n c o l l a p s e a n d the c o n c o m i t a n t loss o f g u i d a n c e i f it d e c i d e s to suppress an a u g m e n t e d b a c k u p at state s. T o d o t h i s , the o b s e r v e r i n i t i a t e s a r e a c h a b i l i t y a n a l y s i s starting f r o m state s u s i n g its o w n act i o n m o d e l To'  a  (t) to d e t e r m i n e i f there is a sequence o f a c t i o n s w i t h leads w i t h s u f f i c i e n t l y  h i g h p r o b a b i l i t y f r o m state s to s o m e state t o n the m e n t o r ' s trajectory d o w n s t r e a m o f the infeasible a c t i o n .  1 5  I f a fc-step b r i d g e a l r e a d y e x i s t s , a u g m e n t e d b a c k u p s c a n b e safely sup-  p r e s s e d at state s. F o r efficiency, w e m a i n t a i n a flag at each state to m a r k it as " b r i d g e d . " O n c e a state is k n o w n to be b r i d g e d , the fc-step r e a c h a b i l i t y a n a l y s i s n e e d n o t b e repeated. I f a s p o n t a n e o u s b r i d g e c a n n o t b e f o u n d , it m i g h t s t i l l be p o s s i b l e to i n t e n t i o n a l l y set out to b u i l d one. T o b u i l d a b r i d g e , the o b s e r v e r m u s t e x p l o r e f r o m state s u p to A;-steps a w a y h o p i n g to m a k e c o n t a c t w i t h the m e n t o r ' s trajectory d o w n s t r e a m o f the i n f e a s i b l e m e n tor a c t i o n . W e i m p l e m e n t a s i n g l e search attempt as a A; -step r a n d o m w a l k , w h i c h w i l l r e s u l t 2  i n a trajectory o n average k steps a w a y f r o m s as l o n g as e r g o d i c i t y a n d l o c a l c o n n e c t i v i t y a s s u m p t i o n s are satisfied. In o r d e r f o r the search to o c c u r , w e m u s t m o t i v a t e the o b s e r v e r to return to the state s a n d e n g a g e i n repeated e x p l o r a t i o n . W e c o u l d p r o v i d e m o t i v a t i o n to the o b s e r v e r b y a s k i n g the o b s e r v e r to a s s u m e that the i n f e a s i b l e a c t i o n w i l l b e r e p a i r a b l e . T h e o b s e r v e r w i l l therefore c o n t i n u e the a u g m e n t e d b a c k u p s w h i c h s u p p o r t h i g h - v a l u e estimates at the state s a n d the o b s e r v e r w i l l repeatedly e n g a g e i n e x p l o r a t i o n f r o m this p o i n t . T h e danger, o f c o u r s e , is that there m a y not i n fact b e a b r i d g e , i n w h i c h case the o b s e r v e r w i l l repeat this search f o r a b r i d g e i n d e f i n i t e l y . W e therefore n e e d a m e c h a n i s m to terminate the r e p a i r p r o c e s s w h e n a  k-step  r e p a i r is i n f e a s i b l e . W e c o u l d attempt to e x p l i c i t l y k e e p  track o f a l l o f the p o s s i b l e paths o p e n to the o b s e r v e r a n d a l l o f the paths e x p l i c i t l y t r i e d b y the o b s e r v e r a n d d e t e r m i n e the r e p a i r p o s s i b i l i t i e s h a d b e e n e x h a u s t e d . Instead, w e elect to 15  I n a more general state space where ergodicity is lacking, the agent must consider predecessors  of state s up to k steps before s to guarantee that all fc-step paths are checked.  95  f o l l o w a p r o b a b i l i s t i c search that e l i m i n a t e s the n e e d f o r b o o k k e e p i n g : i f a b r i d g e c a n n o t be c o n s t r u c t e d w i t h i n n attempts o f ft-step r a n d o m w a l k , the " r e p a i r a b i l i t y a s s u m p t i o n " is j u d g e d f a l s i f i e d , the a u g m e n t e d b a c k u p at state s is s u p p r e s s e d a n d the o b s e r v e r ' s b i a s to e x p l o r e the v i c i n i t y o f state s is e l i m i n a t e d . I f n o b r i d g e is f o u n d f o r state s, a flag is u s e d to m a r k the state as " i r r e p a r a b l e . " T h i s a p p r o a c h i s , o f c o u r s e , a v e r y n a i v e h e u r i s t i c strategy; b u t it i l l u s t r a t e s the b a s i c i m p o r t o f b r i d g i n g . M o r e s y s t e m a t i c strategies c o u l d be u s e d , i n v o l v i n g e x p l i c i t " p l a n n i n g " to find a b r i d g e u s i n g , say, l o c a l search ( A l i s s a n d r a k i s et a l . ,  2000).  A n o t h e r aspect o f this  p r o b l e m that w e d o not address is the persistence o f search f o r b r i d g e s . I n a s p e c i f i c d o m a i n , after s o m e n u m b e r o f u n s u c c e s s f u l attempts to find b r i d g e s , a learner m a y c o n c l u d e that it i s u n a b l e to r e c o n s t r u c t a m e n t o r ' s b e h a v i o r , i n w h i c h case the s e a r c h f o r b r i d g e s m a y b e a b a n d o n e d . T h i s i n v o l v e s s i m p l e , h i g h e r - l e v e l inference, a n d s o m e n o t i o n o f (or p r i o r beliefs about) " s i m i l a r i t y " o f c a p a b i l i t i e s . T h e s e n o t i o n s c o u l d a l s o be u s e d to a u t o m a t i c a l l y d e t e r m i n e p a r a m e t e r settings ( d i s c u s s e d b e l o w ) . T h e parameters k a n d n m u s t be t u n e d e m p i r i c a l l y , b u t c a n b e e s t i m a t e d g i v e n k n o w l edge o f the c o n n e c t i v i t y o f the d o m a i n a n d p r i o r b e l i e f s about h o w s i m i l a r ( i n terms o f l e n g t h o f average r e p a i r ) the trajectories o f the m e n t o r a n d o b s e r v e r w i l l be. F o r i n s t a n c e , n  >  8k - 4 seems s u i t a b l e i n an 8-connected g r i d w o r l d w i t h l o w n o i s e , b a s e d o n the n u m b e r o f trajectories r e q u i r e d to c o v e r the p e r i m e t e r states o f a  k-step r e c t a n g l e a r o u n d a state. W e  note that v e r y large v a l u e s o f n c a n r e d u c e p e r f o r m a n c e b e l o w that o f n o n - i m i t a t i n g agents as it results i n t e m p o r a r y , b u t p r o l o n g e d " l o c k u p . " I n the absence o f p r i o r k n o w l e d g e , o n e m i g h t be a b l e to i m p l e m e n t an i n c r e m e n t a l p r o c e s s f o r finding k that starts w i t h k = 1 a n d i n c r e m e n t s k w h e n the p o l i c y space at g r a n u l a r i t y k is l i k e l y to h a v e b e e n e x h a u s t e d . F e a s i b i l i t y a n d A;-step r e p a i r are e a s i l y integrated i n t o the h o m o g e n e o u s i m p l i c i t i m i t a t i o n f r a m e w o r k . E s s e n t i a l l y , w e s i m p l y elaborate the c o n d i t i o n s u n d e r w h i c h the a u g -  96  merited b a c k u p w i l l be e m p l o y e d . O f c o u r s e , s o m e a d d i t i o n a l r e p r e s e n t a t i o n w i l l be i n t r o d u c e d to k e e p track o f w h e t h e r a state is f e a s i b l e , b r i d g e d , o r r e p a i r a b l e , a n d h o w m a n y r e p a i r attempts h a v e b e e n m a d e . T h e a c t i o n s e l e c t i o n m e c h a n i s m w i l l a l s o b e o v e r r i d d e n b y the b r i d g e - b u i l d i n g a l g o r i t h m w h e n r e q u i r e d i n o r d e r to search f o r a b r i d g e . B r i d g e b u i l d i n g a l w a y s terminates after n attempts, h o w e v e r , so it c a n n o t affect l o n g r u n c o n v e r g e n c e . A l l other aspects o f the a l g o r i t h m , h o w e v e r , s u c h as the e x p l o r a t i o n p o l i c y , are u n c h a n g e d . T h e c o m p l e t e e l a b o r a t e d d e c i s i o n p r o c e d u r e u s e d to d e t e r m i n e w h e n a u g m e n t e d b a c k ups w i l l be e m p l o y e d at state s w i t h respect to m e n t o r m appears i n T a b l e 4.3. It uses s o m e i n t e r n a l state to m a k e its d e c i s i o n s . A s i n the o r i g i n a l m o d e l , w e first c h e c k to see i f the o b s e r v e r ' s e x p e r i e n c e - b a s e d c a l c u l a t i o n f o r the v a l u e o f the state supersedes the m e n t o r - b a s e d c a l c u l a t i o n ; i f so, then the o b s e r v e r uses its o w n e x p e r i e n c e - b a s e d c a l c u l a t i o n . I f the m e n t o r ' s a c t i o n is f e a s i b l e , then w e accept the v a l u e c a l c u l a t e d u s i n g the o b s e r v a t i o n - b a s e d v a l u e f u n c t i o n . I f the a c t i o n is i n f e a s i b l e w e c h e c k to see i f the state is b r i d g e d . T h e first t i m e the test is requested, a r e a c h a b i l i t y a n a l y s i s is p e r f o r m e d , b u t the results w i l l b e d r a w n f r o m a c a c h e f o r subsequent requests. I f the state has b e e n b r i d g e d , w e suppress a u g m e n t e d b a c k ups, c o n f i d e n t that this w i l l not cause v a l u e f u n c t i o n c o l l a p s e . I f the state is n o t b r i d g e d , w e ask i f it is r e p a i r a b l e . F o r the first n requests, the agent w i l l attempt a fc-step repair. I f the r e p a i r s u c c e e d s , the state is m a r k e d as b r i d g e d . I f w e c a n n o t r e p a i r the i n f e a s i b l e t r a n s i t i o n , w e m a r k it n o t - r e p a i r a b l e a n d suppress a u g m e n t e d b a c k u p s . W e m a y w i s h to e m p l o y i m p l i c i t i m i t a t i o n w i t h f e a s i b i l i t y t e s t i n g i n a m u l t i p l e m e n tors s c e n a r i o . T h e k e y c h a n g e f r o m i m p l i c i t i m i t a t i o n w i t h o u t f e a s i b i l i t y t e s t i n g is that the o b s e r v e r w i l l o n l y i m i t a t e f e a s i b l e a c t i o n s . W h e n the o b s e r v e r searches t h r o u g h the set o f m e n t o r s f o r the o n e w i t h the a c t i o n that results i n the h i g h e s t v a l u e estimate, the o b s e r v e r m u s t c o n s i d e r o n l y those m e n t o r s w h o s e actions are s t i l l c o n s i d e r e d f e a s i b l e (or a s s u m e d to be r e p a i r a b l e ) .  97  Table 4.3: Elaborated augmented backup test  F U N C T I O N use_augmented?(s,m) : Boolean IF V (s) 0  y V (s) m  T H E N R E T U R N false  E L S E I F feasible(s,  m) T H E N R E T U R N true  E L S E I F bridged(s,  m) T H E N R E T U R N false  E L S E I F reachable(s,  m) T H E N  bridged(s,m) := true R E T U R N false E L S E IF not repairable(s,  m) T H E N return false  E L S E % we are searching IF 0 < search.steps(s,  m) < k T H E N % search i n progress  return true IF search.steps(s, IF atternpts(s)  m) > k T H E N % search failed > n THEN  repairable(s) = false R E T U R N false ELSE reset_search(s,m) attempts(s) := attempts(s) + 1 R E T U R N true  attempts(s) :=1 % initiate first attempt o f a search initiate-search(s) R E T U R N true  98  4.2.3  Empirical Demonstrations  In this s e c t i o n , w e e m p i r i c a l l y demonstrate the u t i l i t y o f f e a s i b i l i t y t e s t i n g a n d fc-step r e p a i r a n d s h o w h o w the t e c h n i q u e s c a n be u s e d to s u r m o u n t b o t h differences i n a c t i o n s b e t w e e n agents a n d s m a l l l o c a l differences i n state-space t o p o l o g y . T h e p r o b l e m s here h a v e b e e n c h o s e n s p e c i f i c a l l y to demonstrate the necessity a n d u t i l i t y o f b o t h f e a s i b i l i t y t e s t i n g a n d kstep repair. T h e b a c k g r o u n d details o n m e t h o d o l o g y a n d l e a r n i n g parameters are i d e n t i c a l to that d e s c r i b e d at the b e g i n n i n g o f S e c t i o n 4.1.3 e x c e p t w h e r e n o t e d .  Experiment 1: Necessity of Feasibility Testing O u r first e x p e r i m e n t s h o w s the i m p o r t a n c e o f f e a s i b i l i t y t e s t i n g i n i m p l i c i t i m i t a t i o n w h e n agents h a v e heterogeneous a c t i o n s . In this s c e n a r i o , a l l agents m u s t n a v i g a t e across an o b s t a c l e free, 1 0 x 1 0  g r i d w o r l d f r o m the upper-left c o r n e r to a g o a l l o c a t i o n i n the l o w e r - r i g h t . T h e  agent is then reset to the upper-left corner. T h e first agent is a m e n t o r w i t h the " N E W S " act i o n set ( N o r t h , S o u t h , E a s t , a n d W e s t m o v e m e n t a c t i o n s ) . T h e m e n t o r is g i v e n an o p t i m a l stationary p o l i c y f o r this p r o b l e m . W e study the p e r f o r m a n c e o f three learners, e a c h w i t h the " S k e w " a c t i o n set ( N , S, N E , S W ) a n d u n a b l e to d u p l i c a t e the m e n t o r e x a c t l y (e.g., d u p l i c a t i n g a m e n t o r ' s E - m o v e requires the learner to m o v e N E f o l l o w e d b y S, o r m o v e S E then N ) . D u e to the nature o f the g r i d - w o r l d , the c o n t r o l a n d i m i t a t i o n agents w i l l a c t u a l l y h a v e to e x e c u t e m o r e a c t i o n s to get to the g o a l than the m e n t o r and the o p t i m a l g o a l rate f o r b o t h the c o n t r o l a n d i m i t a t o r are therefore l o w e r than that o f the mentor. T h e first learner e m p l o y s i m p l i c i t i m i t a t i o n with f e a s i b i l i t y testing, the s e c o n d uses i m i t a t i o n without f e a s i b i l i t y testi n g , a n d the t h i r d c o n t r o l agent uses n o i m i t a t i o n (i.e., is a standard r e i n f o r c e m e n t l e a r n i n g agent). A l l agents e x p e r i e n c e l i m i t e d s t o c h a s t i c i t y i n the f o r m o f a 5 % c h a n c e that the effect o f t h e i r a c t i o n w i l l b e r a n d o m l y p e r t u r b e d . A s i n the last s e c t i o n , the agents use m o d e l - b a s e d r e i n f o r c e m e n t l e a r n i n g w i t h p r i o r i t i z e d s w e e p i n g . B a s e d o n a p r i o r b e l i e f about h o w " c l o s e "  99  401  0  1  1  1  500  1000  1500  1  1  1  1  r  2000  2500  3000  3500  4000  4500  Simulation S t e p s  F i g u r e 4 . 1 4 : T h e agent w i t h o u t f e a s i b i l i t y t e s t i n g ' N o F e a s ' c a n b e c o m e t r a p p e d b y o v e r v a l u e d states. T h e agent w i t h f e a s i b i l i t y t e s t i n g ' F e a s ' s h o w s t y p i c a l i m i t a t i o n g a i n s .  the m e n t o r a n d o b s e r v e r a c t i o n sets are, w e set k = 3 a n d n — 2 0 . T h e effectiveness o f f e a s i b i l i t y testing i n i m p l i c i t i m i t a t i o n c a n b e seen i n F i g u r e 4 . 1 4 . T h e h o r i z o n t a l a x i s represents t i m e i n s i m u l a t i o n steps a n d the v e r t i c a l a x i s represents the a v erage n u m b e r o f g o a l s a c h i e v e d per 1 0 0 0 t i m e steps ( a v e r a g e d o v e r 10 r u n s ) . W e see that the i m i t a t i o n agent w i t h f e a s i b i l i t y t e s t i n g (Feas) c o n v e r g e d m u c h m o r e q u i c k l y to the o p t i m a l g o a l - a t t a i n m e n t rate than the other agents. T h e agent w i t h o u t f e a s i b i l i t y t e s t i n g ( N o F e a s ) a c h i e v e d s p o r a d i c success e a r l y o n , b u t f r e q u e n t l y " l o c k e d u p " d u e to repeated attempts to d u p l i c a t e i n f e a s i b l e m e n t o r a c t i o n s . T h e agent s t i l l m a n a g e d to r e a c h the g o a l f r o m t i m e to t i m e as the s t o c h a s t i c a c t i o n s d o not p e r m i t the agent to b e c o m e p e r m a n e n t l y s t u c k i n this obstacle-free s c e n a r i o . T h e c o n t r o l agent ( C t r l ) w i t h o u t a n y f o r m o f i m i t a t i o n d e m o n -  100  strated a s i g n i f i c a n t d e l a y i n c o n v e r g e n c e r e l a t i v e to the i m i t a t i o n agents d u e to the l a c k o f any f o r m o f g u i d a n c e , but e a s i l y surpassed the agent w i t h o u t f e a s i b i l i t y t e s t i n g i n the l o n g r u n . T h e m o r e g r a d u a l s l o p e o f the c o n t r o l agent is d u e to the h i g h e r v a r i a n c e i n the c o n t r o l agent's d i s c o v e r y t i m e f o r the o p t i m a l path, b u t b o t h the f e a s i b i l i t y - t e s t i n g i m i t a t o r a n d the c o n t r o l agent c o n v e r g e d to o p t i m a l s o l u t i o n s . A s s h o w n b y the c o m p a r i s o n o f the t w o i m i t a t i o n agents, f e a s i b i l i t y t e s t i n g is necessary to adapt i m p l i c i t i m i t a t i o n to c o n t e x t s i n v o l v i n g heterogeneous a c t i o n s .  Experiment 2: Changes to State Space W e d e v e l o p e d f e a s i b i l i t y t e s t i n g and b r i d g i n g p r i m a r i l y to d e a l w i t h the p r o b l e m o f adapti n g to agents w i t h heterogeneous a c t i o n s . T h e same t e c h n i q u e s , h o w e v e r , c a n b e a p p l i e d to agents w i t h differences i n their state-space c o n n e c t i v i t y (these are e q u i v a l e n t n o t i o n s u l t i m a t e l y ) . T o test t h i s , w e c o n s t r u c t e d a d o m a i n w h e r e a l l agents h a v e the  same N E W S  action  set, b u t w e alter the e n v i r o n m e n t o f the learners b y i n t r o d u c i n g o b s t a c l e s that a r e n ' t present f o r the mentor. I n F i g u r e 4 . 1 5 , the learners find that the m e n t o r ' s p a t h is o b s t r u c t e d b y o b stacles. M o v e m e n t t o w a r d an o b s t a c l e causes a learner to r e m a i n i n its c u r r e n t state. I n this sense, its a c t i o n has a different effect than the m e n t o r ' s . I m i t a t i o n is s t i l l p o s s i b l e i n this scen a r i o as the u n b l o c k e d states c a n be a s s i g n e d a v a l u e t h r o u g h p r i o r a c t i o n v a l u e s that l e a k a r o u n d the o b s t a c l e s a n d because the m e n t o r p o l i c y is s o m e w h a t n o i s y a n d o c c a s i o n a l l y goes around obstacles. I n F i g u r e 4 . 1 6 w e see that the results are q u a l i t a t i v e l y s i m i l a r to the p r e v i o u s e x p e r i m e n t . I n contrast to the p r e v i o u s e x p e r i m e n t , b o t h i m i t a t o r a n d c o n t r o l use the " N E W S " a c t i o n set a n d therefore h a v e a shortest p a t h w i t h the same l e n g t h as that o f the mentor. C o n s e q u e n t l y , the o p t i m a l g o a l rate o f the i m i t a t o r s a n d c o n t r o l is h i g h e r than i n the prev i o u s e x p e r i m e n t . T h e o b s e r v e r w i t h o u t f e a s i b i l i t y testing ( N o F e a s ) has d i f f i c u l t y w i t h the  101  F i g u r e 4 . 1 5 : O b s t a c l e s b l o c k the m e n t o r ' s i n t e n d e d p o l i c y , t h o u g h s t o c h a s t i c a c t i o n s cause m e n t o r to d e v i a t e s o m e w h a t .  m a z e as the v a l u e f u n c t i o n a u g m e n t e d b y m e n t o r o b s e r v a t i o n s c o n s i s t e n t l y l e d the o b s e r v e r to states w h o s e path to the g o a l is d i r e c t l y b l o c k e d . T h e agent w i t h f e a s i b i l i t y t e s t i n g (Feas) q u i c k l y d i s c o v e r e d states w h e r e the m e n t o r ' s i n f l u e n c e is i n a p p r o p r i a t e . W e c o n c l u d e that l o c a l differences i n state are w e l l h a n d l e d b y f e a s i b i l i t y testing.  Experiment 3: Generalizing trajectories O u r n e x t e x p e r i m e n t i n v e s t i g a t e s the q u e s t i o n o f w h e t h e r i m i t a t i o n agents c a n c o m p l e t e l y g e n e r a l i z e the m e n t o r ' s trajectory. H e r e the m e n t o r f o l l o w s a p a t h w h i c h is c o m p l e t e l y i n f e a s i b l e f o r the i m i t a t i n g agent. W e fix the m e n t o r ' s p a t h f o r a l l runs a n d g i v e the i m i t a t i n g agent the m a z e s h o w n i n F i g u r e 4 . 1 7 i n w h i c h a l l but t w o o f the states the m e n t o r v i s i t s are b l o c k e d b y an o b s t a c l e . T h e i m i t a t i n g agent s h o u l d be a b l e to use the m e n t o r ' s trajectory f o r g u i d a n c e a n d b u i l d its o w n p a r a l l e l trajectory w h i c h is c o m p l e t e l y d i s j o i n t f r o m the m e n t o r ' s . T h e results i n F i g u r e 4 . 1 8 s h o w that g a i n o f the i m i t a t o r w i t h f e a s i b i l i t y t e s t i n g (Feas) o v e r the c o n t r o l agent ( C t r l ) is i n s i g n i f i c a n t i n this s c e n a r i o . H o w e v e r , w e see that the agent w i t h o u t f e a s i b i l i t y t e s t i n g ( N o F e a s ) d i d v e r y p o o r l y , e v e n w h e n c o m p a r e d to the c o n t r o l  102  501  0  1  1  i  500  1000  1500  1  r  1  2000 2500 Simulation Steps  3000  3500  4000  4500  Figure 4.16: Imitation gain is reduced by observer obstacles within mentor trajectories but not eliminated.  s  h-. ! J  1  !  i;  ; ; _  j  - - - Observer " Metror  ... T  ,. X  Figure 4.17: Generalizing an infeasible mentor trajectory.  103  60  Feas 50  FP Series  & 40  S 30  $ 20 <  10 NoFeas 500  1000  1500  2000 2500 Simulation Steps  3000  3500  4000  4500  F i g u r e 4 . 1 8 : A l a r g e l y i n f e a s i b l e trajectory is h a r d to l e a r n f r o m a n d m i n i m i z e s g a i n s f r o m imitation.  agent. T h i s is b e c a u s e it gets s t u c k a r o u n d the d o o r w a y . T h e h i g h - v a l u e g r a d i e n t b a c k e d u p a l o n g the m e n t o r ' s p a t h b e c o m e s a c c e s s i b l e to the agents at the d o o r w a y . T h e i m i t a t i o n agent w i t h f e a s i b i l i t y t e s t i n g c o n c l u d e d that it c o u l d not p r o c e e d s o u t h f r o m the d o o r w a y ( i n t o the w a l l ) a n d that it m u s t try a different strategy. T h e i m i t a t o r w i t h o u t f e a s i b i l i t y t e s t i n g n e v e r e x p l o r e d far e n o u g h a w a y f r o m the d o o r w a y to setup an i n d e p e n d e n t v a l u e g r a d i e n t that w o u l d g u i d e it to the g o a l . W i t h a s l o w e r d e c a y s c h e d u l e f o r e x p l o r a t i o n , the i m i t a t o r w i t h o u t f e a s i b i l i t y t e s t i n g w o u l d h a v e f o u n d the g o a l , but this w o u l d s t i l l r e d u c e its perform a n c e b e l o w that o f the i m i t a t o r w i t h f e a s i b i l i t y testing.  104  Experiment 4 : Bridging and repair A s e x p l a i n e d earlier, i n s i m p l e p r o b l e m s there is a g o o d c h a n c e that the i n f o r m a l effects o f p r i o r v a l u e l e a k a g e a n d stochastic e x p l o r a t i o n m a y f o r m b r i d g e s b e f o r e f e a s i b i l i t y t e s t i n g cuts o f f the v a l u e p r o p a g a t i o n that g u i d e s e x p l o r a t i o n . I n m o r e d i f f i c u l t p r o b l e m s w h e r e the agent spends a l o t m o r e t i m e e x p l o r i n g , it w i l l a c c u m u l a t e sufficient s a m p l e s to c o n c l u d e that the m e n t o r ' s a c t i o n s are i n f e a s i b l e l o n g before the agent has c o n s t r u c t e d its o w n b r i d g e . T h e i m i t a t o r ' s p e r f o r m a n c e w o u l d then d r o p d o w n to that o f a n u n a u g m e n t e d r e i n f o r c e m e n t learner. T o d e m o n s t r a t e b r i d g i n g , w e d e v i s e d a d o m a i n i n w h i c h agents m u s t n a v i g a t e f r o m the upper-left c o r n e r to the b o t t o m - r i g h t corner, across a " r i v e r " w h i c h is three steps w i d e a n d exacts a p e n a l t y o f - 0 . 2 p e r step (see F i g u r e 4 . 1 9 ) . T h e r i v e r is represented b y the squares w i t h a g r e y b a c k g r o u n d . T h e g o a l state is w o r t h 1.0. I n the figure, the p a t h o f the m e n t o r is s h o w n starting f r o m the top corner, p r o c e e d i n g a l o n g the e d g e o f the river a n d then c r o s s i n g the r i v e r to the g o a l . T h e m e n t o r e m p l o y s the " N E W S " a c t i o n set. T h e o b s e r v e r uses the " S k e w " a c t i o n set ( N , N E , S , S W ) a n d attempts to r e p r o d u c e the m e n t o r trajectory. It w i l l f a i l to r e p r o d u c e the c r i t i c a l t r a n s i t i o n at the b o r d e r o f the river (because the " E a s t " a c t i o n is i n f e a s i b l e f o r a " S k e w " agent). T h e m e n t o r a c t i o n c a n n o l o n g e r be u s e d to b a c k u p v a l u e f r o m the r e w a r d state a n d there w i l l be n o a l t e r n a t i v e paths b e c a u s e the r i v e r b l o c k s g r e e d y e x p l o r a t i o n i n this r e g i o n . W i t h o u t b r i d g i n g o r an o p t i m i s t i c a n d l e n g t h l y e x p l o r a t i o n phase, o b s e r v e r agents q u i c k l y d i s c o v e r the n e g a t i v e states o f the river a n d c u r t a i l e x p l o r a t i o n i n t h i s d i r e c t i o n b e f o r e a c t u a l l y m a k i n g it across. W e c a n e x a m i n e the v a l u e f u n c t i o n estimate o f an i m i t a t o r (after 1 0 0 0 steps) w i t h f e a s i b i l i t y t e s t i n g b u t n o r e p a i r c a p a b i l i t i e s . W e p l o t the v a l u e f u n c t i o n o n the m a p figure w i t h c i r c l e s w h o s e r a d i u s represents the m a g n i t u d e o f the v a l u e i n the c o r r e s p o n d i n g state. T h e i n c r e a s i n g r a d i u s i n b l a c k c i r c l e s represent i n c r e a s i n g p o s i t i v e v a l u e . T h e i n c r e a s i n g  105  —  1 •  J  •  •  •  •  F i g u r e 4 . 1 9 : T h e R i v e r s c e n a r i o i n w h i c h squares w i t h g r e y b a c k g r o u n d s result i n a p e n a l t y for the agent.  r a d i u s for g r e y c i r c l e s represents i n c r e a s i n g n e g a t i v e v a l u e . W e see that, d u e to s u p p r e s s i o n b y f e a s i b i l i t y testing, the b l a c k - c i r c l e d h i g h - v a l u e states b a c k e d u p f r o m the g o a l terminate a b r u p t l y at an i n f e a s i b l e t r a n s i t i o n w i t h o u t m a k i n g it across the r i v e r . In fact, they are d o m i n a t e d b y the l i g h t e r g r e y c i r c l e s s h o w i n g n e g a t i v e v a l u e s . In this e x p e r i m e n t , w e s h o w that b r i d g i n g c a n p r o l o n g the e x p l o r a t i o n p h a s e i n j u s t the r i g h t w a y . W e e m p l o y the fc-step repair p r o c e d u r e w i t h k = 3. E x a m i n i n g the g r a p h i n F i g u r e 4 . 2 0 , w e see that b o t h i m i t a t i o n agents e x p e r i e n c e d an e a r l y n e g a t i v e d i p as they are g u i d e d deep i n t o the r i v e r b y the m e n t o r ' s i n f l u e n c e . T h e agent w i t h o u t r e p a i r e v e n t u a l l y d e c i d e d the m e n t o r ' s a c t i o n is i n f e a s i b l e , a n d thereafter a v o i d s the river (and the p o s s i b i l i t y o f f i n d i n g the g o a l ) . T h e i m i t a t o r w i t h r e p a i r a l s o d i s c o v e r e d the m e n t o r ' s a c t i o n to be i n f e a s i b l e , but d i d not i m m e d i a t e l y d i s p e n s e w i t h the m e n t o r ' s g u i d ance. It kept e x p l o r i n g i n the area o f the m e n t o r ' s trajectory u s i n g a r a n d o m w a l k , a l l the w h i l e a c c u m u l a t i n g a n e g a t i v e r e w a r d u n t i l it s u d d e n l y f o u n d a b r i d g e a n d r a p i d l y c o n v e r g e d o n the o p t i m a l s o l u t i o n . 1 6  1 6  T h e c o n t r o l agent d i s c o v e r e d the g o a l o n l y o n c e i n ten runs.  W h i l e repair steps take place i n an area o f negative reward i n this scenario, this need not be the  106  1  -  1  1  1  FB Series  55  Ctrl  Ctrl Repair  /  \  NoRepair  -15 -  —  NoRepair /  \  -  -20 0  1  / V / Repair  1000  2000  i 3000 Simulation Steps  i  i  4000  5000  6000  ;ure 4 . 2 0 : T h e agent w i t h r e p a i r finds the o p t i m a l s o l u t i o n o n the o t h e r s i d e o f the r i v e r .  107  T h e results i n this s e c t i o n suggest that f e a s i b i l i t y t e s t i n g a n d k-step r e p a i r are a v i able t e c h n i q u e f o r e x t e n d i n g i m i t a t i o n to agents w i t h heterogeneous a c t i o n c a p a b i l i t i e s . A t present, o u r t e c h n i q u e f o r h a n d l i n g heterogeneous actions requires t u n i n g o f a d h o c b r i d g i n g parameters to o p t i m i z e i m i t a t i o n p e r f o r m a n c e . E x a m i n i n g t e c h n i q u e s to a u t o m a t e the select i o n o f these parameters w o u l d s i g n i f i c a n t l y b r o a d e n the a p p e a l o f o u r m e t h o d . T h e w i d e r a p p l i c a b i l i t y o f these m e t h o d s to g e n e r a l d o m a i n s r e m a i n s to be a n a l y z e d . W e c o n s i d e r the q u e s t i o n o f a p p l i c a b i l i t y o f i m i t a t i o n a n d g e n e r a l p e r f o r m a n c e issues i n C h a p t e r 6.  case. Repair doesn't imply short-term negative return.  108  Chapter 5  Bayesian Imitation T h e u t i l i t y o f i m i t a t i o n as t r a n s i t i o n m o d e l transfer w a s d e m o n s t r a t e d i n the p r e v i o u s chapter. A series o f s i m p l e e x p e r i m e n t s d e m o n s t r a t e d that the s i m p l e s t a t i s t i c a l tests u s e d w e r e b o t h necessary a n d sufficient to a c h i e v e s i g n i f i c a n t p e r f o r m a n c e g a i n s i n m a n y p r o b l e m s . I n this chapter, w e e x a m i n e h o w m o r e s o p h i s t i c a t e d B a y e s i a n m e t h o d s c a n b e u s e d to s m o o t h l y integrate the o b s e r v e r ' s p r i o r k n o w l e d g e , k n o w l e d g e the o b s e r v e r g a i n s f r o m i n t e r a c t i o n w i t h the w o r l d , a n d k n o w l e d g e the o b s e r v e r extracts f r o m o b s e r v a t i o n s o f a n e x p e r t mentor. T h e B a y e s i a n t e c h n i q u e c a n be d e r i v e d q u i t e c l e a n l y , but its i m p l e m e n t a t i o n f o r m o s t usef u l m o d e l classes is m a t h e m a t i c a l l y i n t r a c t a b l e . W e d e r i v e an a p p r o x i m a t e u p d a t e s c h e m e based on s a m p l i n g techniques, w h i c h w e call  augmented model inference.  Experiments in  this chapter d e m o n s t r a t e that B a y e s i a n i m i t a t i o n m e t h o d s , at s o m e a d d i t i o n a l c o m p u t a t i o n a l cost, c a n o u t p e r f o r m t h e i r n o n - B a y e s i a n counterparts a n d e l i m i n a t e t e d i o u s parameter t u n i n g . T h e c l e a n g e n e r a l f o r m o f the B a y e s i a n m o d e l w i l l a l s o set the stage f o r m o r e c o m p l e x m o d e l s i n the next chapter. T h e d e r i v a t i o n o f B a y e s i a n i m i t a t i o n is p r e m i s e d o n the s a m e a s s u m p t i o n s d e v e l o p e d i n C h a p t e r 3, n a m e l y the presence o f an expert, c o m p l e t e o b s e r v a b i l i t y o f o b s e r v e r a n d m e n t o r states, k n o w l e d g e o f r e w a r d f u n c t i o n s , s i m i l a r a c t i o n c a p a b i l i t i e s a n d the e x -  109  istence of a mapping between observer and mentor state spaces. Bayesian imitation also retains the idea of augmenting the observer's knowledge with observations of mentor state transitions. In contrast to the work of Chapter 4, however, the observer does not choose between experience-based and mentor-observation- based model estimates—instead, it pools the evidence collected from self-observation, mentor observations and prior beliefs to update a single more accurate augmented model. Using the transition model augmented with additional model information derived from an expert mentor, the observer can then calculate more accurate expected values for each of its actions and make better action choices early in the learning process.  5.1  Augmented Model Inference  Implicit imitation can be recast in the Bayesian paradigm as the problem of updating a probability distribution over possible transition models T given various sources of information. The observer starts with a prior probability over possible transition models Pr(T) where T G T. After observing its own transitions and those of the mentor, the observer will calculate an updated posterior distribution over possible models. The observations of the observer's own state transitions are collected into a sequence called the state history. Let H —< s° , s\,...,s 0  0  0  > denote the history of the observer's transitions from the initial  state s° to the current states s\ and let A —< al, a\,... , a* > represent the observer's 0  history of action choices from the initial state to the present. The history of mentor observations are similarly represented by H = < s ^ , s ^ , . . . , m  >. Since the mentor is assumed  to be following a stationary policy 7r , its actions are strictly a function of its state history m  and need not be explicitly represented. The sources of observer knowledge, H , H and Pr(T) and the relationships be0  m  tween these sources of knowledge and other variables are shown graphically in a causal 110  F i g u r e 5.1: D e p e n d e n c i e s a m o n g t r a n s i t i o n m o d e l T a n d e v i d e n c e sources H , 0  H. m  m o d e l i n F i g u r e 5.1 (see S e c t i o n 2.5 o f C h a p t e r 2 f o r m o r e o n g r a p h i c a l m o d e l s ) . I n the c a u s a l m o d e l , w e see that the agent's a c t i o n c h o i c e s A  a n d the t r a n s i t i o n m o d e l T  Q  e n c e the o b s e r v e r ' s h i s t o r y H . 0  S i m i l a r l y , the m e n t o r ' s a c t i o n p o l i c y 7 r  m o d e l T i n f l u e n c e the m e n t o r ' s h i s t o r y H .  m  influ-  a n d the t r a n s i t i o n  U s i n g B a y e s ' r u l e , it is p o s s i b l e to r e a s o n about  m  the p o s t e r i o r p r o b a b i l i t y o f a p a r t i c u l a r t r a n s i t i o n m o d e l T g i v e n the p r i o r p r o b a b i l i t y o f T a n d the agent o b s e r v a t i o n h i s t o r i e s H , 0  of m o d e l T given observations H , 0  A  0  A  0  and H . m  and H  W e d e n o t e the p o s t e r i o r p r o b a b i l i t y  as Pv(T\H ,  m  A,  0  H ).  0  a n d the i n d e p e n d e n c i e s i m p l i e d b y the g r a p h i n F i g u r e 5.1 (i.e., H  m  0  g i v e n a t r a n s i t i o n m o d e l T),  w e a r r i v e at the B a y e s i a n update:  Pv{T\H ,A ,H ) 0  0  m  = =  U s i n g B a y e s ' rule,  and H  m  are i n d e p e n d e n t  1  aPT{H ,H \T,A )Pv(T) 0  m  0  aPv(H \T,A )Pv(H \T)Pv(T) 0  0  m  (5.1)  T h e s i m p l e f o r m o f the B a y e s i a n u p d a t e b e l i e s a great d e a l o f p r a c t i c a l c o m p l e x i t y . T h e space o f p o s s i b l e m o d e l s i n the d o m a i n s w e are c o n s i d e r i n g is the set o f a l l r e a l - v a l u e d functions over S x A  -> JR- R e p r e s e n t i n g an e x p l i c i t d i s t r i b u t i o n o v e r p o s s i b l e m o d e l s  T e T is i n f e a s i b l e . Instead, w e a s s u m e that the p r i o r P r ( T ) o v e r p o s s i b l e t r a n s i t i o n m o d e l s has a f a c t o r e d f o r m i n w h i c h the p r o b a b i l i t y d i s t r i b u t i o n o v e r t r a n s i t i o n s m o d e l s P r ( T ) c a n be e x p r e s s e d as the p r o d u c t o f d i s t r i b u t i o n s o v e r l o c a l t r a n s i t i o n s m o d e l s f o r e a c h state a n d ' a is a normalization parameter.  Ill  action, P r ( T ) =  I l a e ^ l ses P (T ' v  s a  )• T h e d i s t r i b u t i o n o v e r e a c h l o c a l t r a n s i t i o n m o d e l  P r ( T ' ) is represented as a D i r i c h l e t (see S e c t i o n 2.3). s  a  I n the absence o f o b s e r v a t i o n s f r o m a m e n t o r agent, the i n d e p e n d e n t D i r i c h l e t factors c a n be u p d a t e d u s i n g the u s u a l i n c r e m e n t s to the c o u n t parameters n ( s , a , t).  The in-  t r o d u c t i o n o f m e n t o r o b s e r v a t i o n s creates s o m e c o m p l i c a t i o n s . I n p r i n c i p l e , u p o n o b s e r v i n g the m e n t o r m a k e the t r a n s i t i o n f r o m state s to state t, w e c o u l d i n c r e m e n t the c o r r e s p o n d i n g c o u n t n ( s , a, t) f o r w h i c h e v e r a c t i o n a w a s e q u i v a l e n t to the m e n t o r a c t i o n a . m  nately, w e d o not k n o w the i d e n t i t y o f a . m  T h e Bayesian perspective w o u l d naturally sug-  gest that w e treat the m e n t o r a c t i o n as a r a n d o m v a r i a b l e A  a n d represent o u r u n c e r t a i n t y  t  as a p r o b a b i l i t y d i s t r i b u t i o n  Pv(A  m  Unfortu-  = a).  T h e k e y q u e s t i o n is h o w to m a i n t a i n the ele-  gant p r o d u c t o f D i r i c h l e t f o r m f o r o u r d i s t r i b u t i o n o v e r p o s s i b l e t r a n s i t i o n m o d e l s w h e n o u r source o f e v i d e n c e takes the f o r m o f u n c e r t a i n events o v e r a n u n k n o w n a c t i o n . T h e a n s w e r c a n be f o u n d b y b r e a k i n g d o w n the update i n t o t w o d i s t i n c t cases. T h o u g h the m e n t o r ' s a c t i o n  a  t i o n a (i.e., n (s)  /  m  = n (s) is u n k n o w n , suppose it is different f r o m the o b s e r v e r ' s ac-  m  m  a). I n this case, the m o d e l factor T '  w o u l d b e i n d e p e n d e n t o f the  s a  m e n t o r ' s h i s t o r y , a n d w e c a n e m p l o y the standard B a y e s i a n u p d a t e w i t h o u t r e g a r d f o r the mentor:  Pv(T ' \H ' , s a  Tr (s) ^ a) = aPv(H ' \T ' )  s a  s a  s a  Pr(T '°) s  m  N o t e that the update n e e d o n l y c o n s i d e r the subset o f the agent h i s t o r y H  s,a  (5.2)  w h i c h concerns  the o u t c o m e o f a c t i o n a i n state s. T h e update has a s i m p l e c l o s e d f o r m i n w h i c h one i n c r e m e n t s the D i r i c h l e t parameter n ( s , a, t) a c c o r d i n g to a c o u n t c ( s , o, t) o f the n u m b e r o f times the o b s e r v e r has v i s i t e d state s, p e r f o r m e d a c t i o n a a n d e n d e d u p i n state t. S u p p o s e , h o w e v e r , that the m e n t o r a c t i o n a  m  o b s e r v e r ' s a c t i o n a (i.e. n  m  (s)  =  a).  = it  m  (s) is i n fact the s a m e as the  T h e n the m e n t o r o b s e r v a t i o n s are r e l e v a n t to the 112  update o f  Pv(T > ): s  a  Pr(T < \H° - ,H ,n (s) s  a  Let n  s , a  = a)  a  0  m  m  =  a Pr(H« ' ,  H \T < ,  =  aPT(H*- \T ' )Pi(H^\T - ,Tr (s)  a  s  0  a  s  u {s)  a  m  s a  s  m  =  m  t r a n s i t i o n s f r o m state s v i a a c t i o n a, a n d  + c ' +  s a  s  s , a  a)Pv(T°' ) a  ) , a n d cl'  d e n o t e the c o u n t s o f o b s e r v e r  a  the c o u n t s o f the m e n t o r t r a n s i t i o n s f r o m state  augmented model f a c t o r  D i r i c h l e t w i t h parameters n '  = a)  s a  a  b e the p r i o r parameter v e c t o r f o r P r ( T  s. T h e p o s t e r i o r  = a) Pi(T ' \7T (s)  m  d e n s i t y Pr(T ' \Ho' , s a  a  H  m  ,  n (s) m  — a) i s then a  ; that i s , w e s i m p l y u p d a t e the D i r i c h l e t p a r a m e t e r  a  0  w i t h the s u m o f the o b s e r v e r a n d m e n t o r c o u n t s . T h e D i r i c h l e t o v e r m o d e l s T a n d parameters s,a  n  +  »,a  c  +  ^  p (r»,o  i s d e n o t e d  r  ;  n  » . a +  PT(T'' \H ' ,H^,n ( ) a  l  m  8  c' ).  +  a  = a)  a  0  * '  c  m  Pr(T ' ; 5 a  =  n > ° + <> s  a  +  c' ) m  S i n c e the o b s e r v e r does n o t k n o w the i d e n t i t y o f the m e n t o r ' s a c t i o n w e c o m p u t e the e x p e c t a t i o n w i t h respect to these t w o p o s s i b l e cases:  Pv(T°- \H ' ,H ) a  s  a  0  =  m  Pr(T ' ; s  P r ( ^ = a\H' <\ H ) 0  + Pi(n'  m  ± a\H' - , a  m  0  H) m  a  n ' s  Pr(T ' ; s  a  a  + <£° +  + <>°)  c' ) m  (5.3)  E q u a t i o n 5.3 d e s c r i b e s an u p d a t e h a v i n g the u s u a l c o n j u g a t e f o r m , b u t d i s t r i b u t i n g the m e n t o r c o u n t s  (i.e., e v i d e n c e p r o v i d e d b y m e n t o r t r a n s i t i o n s at state s) a c r o s s a l l  a c t i o n s , w e i g h t e d b y the p o s t e r i o r p r o b a b i l i t y that the m e n t o r ' s p o l i c y c h o o s e s that a c t i o n at state s.  2  2  T h i s assumes that at least one o f the observer's actions is equivalent to the mentor's, but our  model can be generalized to the heterogeneous case.  113  The final step in the derivation is to show that the posterior distribution over the mentor's policy Pr(7r (s) = a) can be performed in an efficient factored form as well. Using m  Bayes' rule and the independencies of the causal model in Figure 5.1, we can derive an expression for the posterior distribution:  Pi{n \H ,H ) m  m  =  0  aPr(# |7r ,# )Pr(7r |# ) m  m  0  m  0  (5.4)  If we assume that the prior over the mentor's policy is factored in the same way as the prior over models—that is, we have independent distributions Pr(7r (s)) over A m  m  for  each s—this update can be factored as well, with history elements at state s being the only ones relevant to computing the posterior over 7r (s). The last difficulty to overcome is the m  computation of the integral over possible transition models J  TG7  - P r ( i f | 7 r , T) PT(T\H ). m  m  0  This type of integral is often found in the calculation of expectations and can be approximated using Monte Carlo integration (Mackay, 1999). Using the techniques of (Dearden et al., 1999), we sample transition models to estimate this quantity. Specifically, we sample a model T ' for a specific state and action from the Dirichlet factor Pr(T' ' \Ho' ). We S A  s  a  a  then compute the likelihood that the mentor model is equivalent to the observer's action a by computing the joint probability of the observed mentor transitions under the model T  Pr(H \n ,T°< ) s  a  m  m  =  l[T ' (t) ^ s a  s , a  :  (5.5)  We can combine the expression for the factored augmented model update in Equation 5.3 with our expression for mentor policy likelihood in Equation 5.5 to obtain a tractable  114  ALGORITHM 1 T  L = [ 0 , . . . , 0] 6= 0  = augmentedModel(/V , N ,  +  0  m  s, a, 77)  % Li is the likelihood o f action i being the mentor's  % sum for normalization  for each action a E A  0  for c = 1 to 77 T'  s a  ~ D[N > ] s  % sample from Dirichlet on s,a  a  0  L(a)=L(a)+Tl sT ' (t) ™ ™ s a  N  +C  t€  b = b + L{a) a  ~ M[j]  m  if a  m  % sample from multinomial on average normalized likelihoods  = a ; % i f mentor action likely same, add mentor evidence  T+ = m e a n ( Z W + C' ' + a  0  + C* ]) m  else T + = mean(D[7V - + s  a  0  C - ]) s  a  0  a l g o r i t h m f o r u p d a t i n g the o b s e r v e r ' s b e l i e f s about the d y n a m i c s m o d e l T b a s e d o n p r i o r bel i e f s , its o w n e x p e r i e n c e a n d its o b s e r v a t i o n s o f the m e n t o r .  3  P s e u d o - c o d e to c a l c u l a t e the a u g m e n t e d m o d e l appears i n A l g o r i t h m 1. T h e a l g o r i t h m takes the arguments N  0  and N  m  c o r r e s p o n d i n g to the D i r i c h l e t parameters f o r the d i s -  t r i b u t i o n o v e r o b s e r v e r a n d m e n t o r m o d e l s , the a r g u m e n t s s a n d a c o r r e s p o n d i n g to the current state a n d a c t i o n , a n d the argument 77, c o r r e s p o n d i n g to the n u m b e r o f s a m p l e s to use i n M o n t e C a r l o i n t e g r a t i o n . T h e a l g o r i t h m returns the a u g m e n t e d m o d e l T . +  T  The model  takes the f o r m o f a m u l t i n o m i a l d i s t r i b u t i o n o v e r s u c c e s s o r states c o r r e s p o n d i n g to the  +  e x p e c t e d o r m e a n t r a n s i t i o n p r o b a b i l i t i e s o f the D i r i c h l e t d i s t r i b u t i o n o v e r t r a n s i t i o n m o d e l s . T h e terms T (t)^*^ in Equation 5.5 can become very small as the number o f visits to a state becomes large, resulting in underflow. Scaling techniques such as those used for hidden M a r k o v model computation can prove quite useful here. 3  s,a  115  A B a y e s i a n i m i t a t o r thus p r o c e e d s as f o l l o w s : at e a c h stage, it o b s e r v e s its o w n state t r a n s i t i o n a n d that o f the mentor. T h e D i r i c h l e t parameters f o r the o b s e r v e r a n d m e n t o r m o d els are i n c r e m e n t e d . C o n c e p t u a l l y , an a u g m e n t e d m o d e l T  +  is then s a m p l e d p r o b a b i l i s t i -  c a l l y f r o m the p o o l e d d i s t r i b u t i o n b a s e d o n b o t h m e n t o r a n d o b s e r v e r o b s e r v a t i o n s u s i n g A l g o r i t h m 1. T h e a u g m e n t e d m o d e l  mented values  T  +  is u s e d to update the agent's v a l u e s . T h e s e  aug-  h a v e the p o t e n t i a l to be m o r e accurate e a r l y i n l e a r n i n g a n d therefore a l l o w  a B a y e s i a n i m i t a t o r to c h o o s e better actions e a r l y o n i n the l e a r n i n g phase. L i k e a n y r e i n f o r c e m e n t l e a r n i n g agent, an i m i t a t o r r e q u i r e s a s u i t a b l e e x p l o r a t i o n m e c h a n i s m . I n the  Bayesian exploration m o d e l ( D e a r d e n  et a l . , 1 9 9 9 ) , the u n c e r t a i n t y about  the effects o f a c t i o n s is c a p t u r e d b y a D i r i c h l e t , a n d is u s e d to estimate a d i s t r i b u t i o n o v e r p o s s i b l e Q - v a l u e s f o r e a c h state-action pair. N o t i o n s s u c h as v a l u e o f i n f o r m a t i o n c a n then be u s e d to a p p r o x i m a t e the o p t i m a l e x p l o r a t i o n p o l i c y ( S e c t i o n 2.3). B a y e s i a n e x p l o r a t i o n a l s o e l i m i n a t e s the parameter t u n i n g r e q u i r e d b y m e t h o d s l i k e e-greedy, a n d adapts l o c a l l y a n d i n s t a n t l y to e v i d e n c e . T h e s e facts m a k e s it a g o o d c a n d i d a t e to c o m b i n e w i t h B a y e s i a n imitation.  5.2  Empirical Demonstrations  I n this s e c t i o n w e attempt to e m p i r i c a l l y characterize the a p p l i c a b i l i t y a n d e x p e c t e d benefits o f B a y e s i a n i m i t a t i o n t h r o u g h several e x p e r i m e n t s . U s i n g d o m a i n s f r o m the literature a n d t w o u n i q u e d o m a i n s , w e c o m p a r e B a y e s i a n i m i t a t i o n to the n o n - B a y e s i a n i m i t a t i o n m o d e l d e v e l o p e d i n the p r e v i o u s chapter a n d to several standard m o d e l - b a s e d r e i n f o r c e m e n t l e a r n ing (non-imitating) techniques, i n c l u d i n g B a y e s i a n exploration, p r i o r i t i z e d sweeping and complete B e l l m a n backups. W e also investigate h o w B a y e s i a n exploration combines w i t h imitation. F i r s t , w e d e s c r i b e the set o f agents u s e d i n o u r e x p e r i m e n t s . T h e O r a c l e agent e m -  116  F i g u r e 5.2: T h e C h a i n - w o r l d s c e n a r i o i n w h i c h the m y o p i c a l l y attractive a c t i o n b is s u b o p t i m a l i n the l o n g r u n .  ploys a fixed p o l i c y o p t i m i z e d for each d o m a i n . T h e Oracle provides both a baseline and a s o u r c e o f expert m e n t o r b e h a v i o r f o r the o b s e r v e r agents. T h e E G B S agent c o m b i n e s eg r e e d y e x p l o r a t i o n ( E G ) w i t h a f u l l s w e e p o f B e l l m a n b a c k u p s at e a c h t i m e step (i.e. it perf o r m s a b e l l m a n b a c k u p at e v e r y state f o r each e x p e r i e n c e o b t a i n e d ) . It p r o v i d e s a n e x a m p l e o f a g e n e r i c m o d e l - b a s e d a p p r o a c h to l e a r n i n g . T h e E G P S agent is a m o d e l - b a s e d R L agent, u s i n g e-greedy ( E G ) e x p l o r a t i o n w i t h p r i o r i t i z e d s w e e p i n g ( P S ) ( M o o r e & A t k e s o n , 1993). E G P S use f e w e r b a c k u p s , b u t a p p l i e s t h e m w h e r e they are p r e d i c t e d to d o the m o s t g o o d . E G P S d o e s not h a v e a f i x e d b a c k u p p o l i c y , so it c a n p r o p a g a t e v a l u e m u l t i p l e steps across the state space i n situations w h e r e E G B S w o u l d not. T h e B E agent e m p l o y s B a y e s i a n exploration ( B E ) w i t h prioritized sweeping for backups. B E B I combines B a y e s i a n e x p l o r a t i o n ( B E ) w i t h B a y e s i a n i m i t a t i o n ( B I ) . E G B I c o m b i n e s e-greedy e x p l o r a t i o n ( E G ) w i t h Bayesian imitation (BI).  Experiment 1: Suboptimal Policy O u r first e x p e r i m e n t s w i t h the agents w e r e o n the " l o o p " a n d " c h a i n " e x a m p l e s w h i c h w e r e d e s i g n e d to s h o w the benefits o f B a y e s i a n e x p l o r a t i o n ( D e a r d e n et a l . , 1 9 9 9 ) . T h e c h a i n p r o b l e m c o n s i s t s o f 5 states i n a c h a i n as p i c t u r e d i n F i g u r e 5.2. It r e q u i r e s the agent to c h o o s e b e t w e e n a c t i o n b, w h i c h m o v e the agent b a c k to the b e g i n n i n g w i t h a s m a l l l o c a l  117  Figure 5.3: Chain world results (Averaged over 10 runs)  118  r e w a r d ; o r a c t i o n a, w h i c h m o v e s the agent t o w a r d s a distant h i g h l y - r e w a r d i n g g o a l . T h e r e are n o l o c a l r e w a r d s a s s o c i a t e d w i t h the a c t i o n s that l e a d to the h i g h l y - r e w a r d i n g g o a l . A t e a c h state, there is a 0.2 p r o b a b i l i t y o f the agent's a c t i o n h a v i n g the o p p o s i t e o f the i n t e n d e d effect a n d the d i s c o u n t factor is 0 . 9 9 . T h e c h a i n p r o b l e m c h a l l e n g e s agents to e n g a g e i n sufficient e x p l o r a t i o n to find the o p t i m a l p o l i c y , w h i c h is to c h o o s e a c t i o n a at e a c h state. T h e agent's D i r i c h l e t parameters are i n i t i a l l y set to 1 f o r a l l a c t i o n o u t c o m e s to represent a weak prior belief in a uniform distribution over possible action outcomes. T h e results f o r the c h a i n s c e n a r i o appear i n F i g u r e 5.3. W e m e a s u r e agent perform a n c e as the average r e w a r d c o l l e c t e d o v e r the last 2 0 0 steps (see S e c t i o n 4.1.3 f o r m o r e i n f o r m a t i o n o n p e r f o r m a n c e measures). E a c h l i n e i n the g r a p h c o r r e s p o n d s to the average o f 10 runs f o r e a c h agent a n d the error bars represent standard error o n the s a m p l e o f 10 runs. W e see that the B a y e s i a n i m i t a t o r ( B E B I ) , the n o n - B a y e s i a n i m i t a t o r ( E G N B I ) a n d the B a y e s i a n e x p l o r e r w i t h o u t i m i t a t i o n ( B E ) a l l c l u s t e r together near o p t i m a l p e r f o r m a n c e . S i n c e the v a r i a n c e o f the results is h i g h , it is not c l e a r that a n y o f the agents are d o m i n a n t i n this s c e n a r i o . G i v e n the s m a l l state a n d a c t i o n space a n d the B a y e s i a n e x p l o r e r ' s a b i l i t y to d y n a m i c a l l y adjust its e x p l o r a t i o n rate, it is c a p a b l e o f p e r f o r m i n g as w e l l as the best i m i t a tors w h o g a i n i n s i g h t i n t o the v a l u e f u n c t i o n , but s t i l l m u s t r e c o v e r the i d e n t i t y o f the m e n t o r ' s a c t i o n . T h e top p e r f o r m i n g agents d o m u c h better than s t a n d a r d r e i n f o r c e m e n t learners ( E G B S a n d E G P S ) w h o settle o n a s u b o p t i m a l p o l i c y . I n c r e a s i n g the e x p l o r a t i o n o f these agents causes t h e i r p e r f o r m a n c e to f a l l further i n the short t e r m , t h o u g h i t d o e s e v e n t u a l l y a l l o w t h e m to find the o p t i m a l p o l i c y . T h e c o m b i n a t i o n o f B a y e s i a n i m i t a t i o n w i t h e-greedy e x p l o r a t i o n does u n e x p e c t e d l y p o o r l y . T h i s result is a l s o seen i n other e x p e r i m e n t s . W e s p e c u l a t e that there is an i n t e r a c t i o n b e t w e e n the s a m p l i n g o f t r a n s i t i o n m o d e l s u n d e r B a y e s i a n i m i t a t i o n a n d £ - g r e e d y e x p l o r a t i o n — p a r t i c u l a r l y o n these s i m p l e p r o b l e m s w h i c h are s o l v e d after v e r y f e w s a m -  119  F i g u r e 5.4: L o o p d o m a i n  p i e s . W e speculate that e a r l y i n the l e a r n i n g p r o c e s s , w h e n the d i s t r i b u t i o n o v e r p o s s i b l e t r a n s i t i o n m o d e l s is diffuse, o u r m o d e l s a m p l i n g s c h e m e generates h i g h l y w h i c h i n turn result i n  fluctuations  fluctuating  models  i n the r e l a t i v e m a g n i t u d e s o f Q - v a l u e s . U n d e r a greedy  a c t i o n s c h e m e , the n o n - l i n e a r effect o f the m a x operator c a n c a u s e s u b s e q u e n t v i s i t s to a state to result i n t o t a l l y different a c t i o n c h o i c e s . T h i s c a n p r e v e n t the agent f r o m c o n s i s tently b a c k i n g u p v a l u e f r o m future states a n d e s t a b l i s h i n g the v a l u e s that w o u l d g u i d e it t o w a r d s the distant g o a l . T h e B a y e s i a n i m i t a t o r w i t h B a y e s i a n e x p l o r a t i o n , c h o o s e s a c t i o n s p r o b a b i l i s t i c a l l y a c c o r d i n g to their Q - v a l u e s a n d is therefore less s e n s i t i v e to the r e l a t i v e o r d e r i n g o f the Q - v a l u e s . T h e n o n - B a y e s i a n i m i t a t o r w i t h e - g r e e d y e x p l o r a t i o n d e r i v e s its m o d e l f r o m the m e a n o f the D i r i c h l e t m o d e l w h i c h is c o n s i d e r a b l y m o r e stable. Its Q - v a l u e s are therefore a l s o m o r e stable, a n d the agent is therefore a b l e to p u r s u e a l o n g t e r m strategy c o n s i s t e n t w i t h the o v e r a l l v a l u e f u n c t i o n d e r i v e d f r o m the m e n t o r agent. W e see that this p r o b l e m is n o t d i f f i c u l t e n o u g h to d e m o n s t r a t e a s i g n i f i c a n t difference i n p e r f o r m a n c e bet w e e n the B a y e s i a n e x p l o r e r w i t h o u t i m i t a t i o n ( B E ) a n d the B a y e s i a n e x p l o r e r w i t h i m i t a tion ( B E B I ) .  120  100r  90  -  g 0  I 200  I  1  1  400  600  800  1 1000  I 1200  Steps of simulation  I 1400  I 1600  I 1800  F i g u r e 5.5: L o o p d o m a i n results ( a v e r a g e d o v e r 10 r u n s )  121  I 2000  Experiment 2: Loop T h e " l o o p " p r o b l e m ( D e a r d e n et a l . , 1999) c o n s i s t s o f e i g h t states a r r a n g e d i n t w o l o o p s as s h o w n i n F i g u r e 5.4. A c t i o n s i n this d o m a i n are d e t e r m i n i s t i c . T h e r i g h t - h a n d l o o p i n v o l v i n g states 1, 2 , 3 a n d 4 c o n t a i n s n o b r a n c h e s b a s e d o n a c t i o n s a n d is therefore easy f o r the agent to traverse a n d the m o d e s t r e w a r d a v a i l a b l e i n state 4 c a n b e q u i c k l y b a c k e d u p . T h e l o o p d o m a i n is d i f f i c u l t to e x p l o r e , as the left h a n d l o o p i n v o l v i n g states 5, 6, 7 a n d 8 has frequent arcs b a c k to 0 w h i c h m a k e s it difficult f o r the agent to find the l a r g e r r e w a r d a v a i l a b l e i n state 8. It therefore takes l o n g e r to b a c k u p this r e w a r d a l o n g the trajectory.  The  o p t i m a l p o l i c y is to p e r f o r m a c t i o n b e v e r y w h e r e . T h e d i s c o u n t a n d p r i o r s are a s s i g n e d as i n the c h a i n p r o b l e m . T h e results f o r the l o o p d o m a i n appear i n F i g u r e 5.5. A g a i n , graphs represent the average o f 10 r u n s — t h e p e r f o r m a n c e o n e a c h r u n b e i n g m e a s u r e d b y the a v e r a g e r e w a r d c o l l e c t e d i n the last 2 0 0 steps.  Q u a l i t a t i v e l y , the results are v e r y s i m i l a r to t h o s e f o r the  c h a i n w o r l d . T h e B a y e s i a n agents ( B E a n d B E B I ) p e r f o r m as w e l l as the o p t i m a l o r a c l e agent (their l i n e s a l l o v e r l a p i n this graph). T h e n o n - B a y e s i a n i m i t a t o r ( E G N B I ) p e r f o r m s s l i g h t l y less w e l l . I n this d o m a i n w e see the n o n - B a y e s i a n a g e n t ' s i n a b i l i t y to f l e x i b l y adjust the e x p l o r a t i o n rate g i v e n its l o c a l k n o w l e d g e is starting to hurt its p e r f o r m a n c e . T h e c o m b i n a t i o n o f B a y e s i a n i m i t a t i o n w i t h e-greedy e x p l o r a t i o n s h o w s the s a m e i n t e r a c t i o n effect as i n the c h a i n w o r l d a n d w e speculate that a s i m i l a r effect is o c c u r r i n g . T h i s p r o b l e m is not d i f f i c u l t e n o u g h to separate the p e r f o r m a n c e o f the B a y e s i a n e x p l o r e r w i t h o u t i m i t a t i o n ( B E ) f r o m the B a y e s i a n e x p l o r e r w i t h B a y e s i a n i m i t a t i o n ( B E B I ) . W e o b s e r v e that the a d d i t i o n o f i m i t a t i o n c a p a b i l i t i e s to an e x i s t i n g learner d i d not hurt the its p e r f o r m a n c e .  122  s  FI  Gl  F3 F2  G2  IIMHIS  F i g u r e 5.6: F l a g - w o r l d .  18r  F i g u r e 5.7: F l a g - w o r l d results ( A v e r a g e d o v e r 5 0 r u n s )  123  Experiment 3 : More Complex Problems U s i n g the m o r e c h a l l e n g i n g " f l a g - w o r l d " d o m a i n ( D e a r d e n et a l . , 1 9 9 9 ) , w e see m e a n i n g f u l differences i n p e r f o r m a n c e a m o n g s t the agents. I n  flag-world,  agent starts at state S a n d searches f o r the g o a l state Gl.  T h e agent m a y p i c k u p a n y o f three  flags b y v i s i t i n g states F I , F2 a n d F3.  s h o w n i n F i g u r e 5.6, the  U p o n r e a c h i n g the g o a l state, the agent r e c e i v e s 1  p o i n t f o r e a c h flag c o l l e c t e d . E a c h a c t i o n ( N , E , S , W ) succeeds w i t h p r o b a b i l i t y 0.9 i f the c o r r e s p o n d i n g d i r e c t i o n is clear, a n d w i t h p r o b a b i l i t y 0.1 m o v e s the agent p e r p e n d i c u l a r to the d e s i r e d d i r e c t i o n . I f the c o r r e s p o n d i n g d i r e c t i o n is not clear, the agent w i l l r e m a i n w h e r e it is w i t h p r o b a b i l i t y 0.9 a n d attempt a p e r p e n d i c u l a r d i r e c t i o n w i t h the r e m a i n i n g p r o b a b i l i t y . In g e n e r a l , the p r i o r s o v e r the agent's a c t i o n o u t c o m e s are set to a l o w c o n f i d e n c e u n i f o r m d i s t r i b u t i o n o v e r the i m m e d i a t e n e i g h b o r s o f the state. T h e p r i o r s are set to take i n t o a c c o u n t the p r e s e n c e o f obstacles a n d the t r a n s i t i o n to states i n w h i c h the agent p o s s e s s e s flags. T h u s , a c t i o n s h a v e a n o n - z e r o p r o b a b i l i t y o f e n d i n g u p i n at m o s t 4 n e i g h b o r i n g states, b u t p o s s i b l y less d u e to obstacles. T h i s d o m a i n is a g o o d e x e r c i s e f o r e x p l o r a t i o n m e c h a n i s m s as it is easy to find s u b o p t i m a l s o l u t i o n s to this p r o b l e m p r o c e e d i n g to the g o a l state w i t h less than the f u l l c o m p l e m e n t o f flags. F i g u r e 5.7 s h o w s the r e w a r d c o l l e c t e d i n the p r i o r 2 0 0 steps f o r e a c h agent avera g e d o v e r 5 0 runs. T h e O r a c l e agent s h o w s the o p t i m a l p e r f o r m a n c e f o r this d o m a i n . T h e B a y e s i a n i m i t a t o r w i t h B a y e s i a n e x p l o r a t i o n ( B E B I ) a c h i e v e s the q u i c k e s t c o n v e r g e n c e to the o p t i m a l s o l u t i o n , t h o u g h the c o m p l e x i t y o f this d o m a i n n o w r e q u i r e s the o b s e r v e r to e n g a g e i n c o n s i d e r a b l e e x p e r i m e n t a t i o n to d i s c o v e r the correct m e n t o r a c t i o n . T h i s results i n a s i g n i f i c a n t s e p a r a t i o n b e t w e e n the B a y e s i a n I m i t a t o r a n d the o p t i m a l o r a c l e agent perf o r m a n c e . W e n o t i c e that the c o m b i n a t i o n o f e - g r e e d y e x p l o r a t i o n w i t h B a y e s i a n i m i t a t i o n ( E G B I ) does better o n this p r o b l e m . I n this d o m a i n , c o n s i d e r a b l y m o r e s a m p l e s are r e q u i r e d to b u i l d u p accurate v a l u e f u n c t i o n s a n d the o b s e r v e r agent has m o r e p o s s i b l e paths l e a d i n g  124  F i g u r e 5.8: F l a g - w o r l d m o v e d g o a l ( A v e r a g e o f 5 0 r u n s )  to a p p r o x i m a t e l y the s a m e o u t c o m e .  W e s p e c u l a t e that these features m i t i g a t e the issues  of m a x i m i z a t i o n over sampled models. T h e non-Bayesian imitator ( E G N B I ) performs rela t i v e l y p o o r l y . I n this p r o b l e m , the n o n - B a y e s i a n i m i t a t o r ' s i n a b i l i t y to t a i l o r e x p l o r a t i o n rates to its l o c a l state o f k n o w l e d g e m a k e it d i f f i c u l t f o r the n o n - B a y e s i a n i m i t a t o r to e x p l o i t m e n t o r k n o w l e d g e w h i l e r e t a i n i n g sufficient e x p l o r a t i o n to i n f e r the m e n t o r ' s a c t i o n s . T h e B a y e s i a n e x p l o r e r ( B E ) settles o n a s u b o p t i m a l strategy. T h e agents u s i n g g e n e r i c r e i n f o r c e m e n t l e a r n i n g strategies c o n v e r g e m o r e s l o w l y to the o p t i m a l s o l u t i o n , t h o u g h the g r a p h has b e e n truncated before  convergence,  125  Experiment 4: Dissimilar Goals In this e x p e r i m e n t , w e altered the f l a g - w o r l d d o m a i n : k e e p i n g the g o a l o f the m e n t o r agent at l o c a t i o n G l a n d m o v i n g the g o a l f o r a l l other agents to the n e w l o c a t i o n l a b e l e d G 2 i n F i g u r e 5.6. T h e results, F i g u r e 5.8(a), s h o w that the i m i t a t i o n transfer i s q u a l i t a t i v e l y s i m i l a r to the case w i t h i d e n t i c a l r e w a r d s . W e c o n c l u d e that i m i t a t i o n transfer is r o b u s t to c e r t a i n differences i n o b j e c t i v e s b e t w e e n m e n t o r a n d i m i t a t o r . T h i s is r e a d i l y e x p l a i n e d b y the fact that the m e n t o r ' s p o l i c y p r o v i d e s m o d e l i n f o r m a t i o n o v e r m o s t states i n the d o m a i n a n d this i n f o r m a t i o n c a n b e e m p l o y e d b y the o b s e r v e r to a c h i e v e its o w n g o a l s .  Experiment 5: Multi-dimensional Problems W e i n t r o d u c e d a n e w p r o b l e m w e c a l l the  tutoring domain w h i c h  r e q u i r e s agents to s c h e d -  u l e the p r e s e n t a t i o n o f s i m p l e patterns to h u m a n learners i n o r d e r to m i n i m i z e t r a i n i n g t i m e . T o s i m p l i f y o u r e x p e r i m e n t s , w e h a v e the agents teach a s i m u l a t e d student. T h e student's p e r f o r m a n c e is m o d e l e d b y i n d e p e n d e n t , d i s c r e t e l y a p p r o x i m a t e d , e x p o n e n t i a l f o r g e t t i n g c u r v e s f o r e a c h c o n c e p t ( A n d e r s o n , 1 9 9 0 , p a g e 164). T h e a g e n t ' s a c t i o n w i l l b e its c h o i c e o f c o n c e p t to present. T h e agent r e c e i v e s a r e w a r d w h e n the student's f o r g e t t i n g rate has b e e n r e d u c e d b e l o w a p r e d e f i n e d t h r e s h o l d f o r a l l c o n c e p t s . P r e s e n t i n g a c o n c e p t l o w e r s its f o r g e t t i n g rate; l e a v i n g i t u n p r e s e n t e d increases its f o r g e t t i n g rate. T h e a c t i o n space g r o w s l i n e a r l y w i t h the n u m b e r o f c o n c e p t s , a n d the state space e x p o n e n t i a l l y . E x p e r t p e r f o r m a n c e f o r the o r a c l e i n this d o m a i n w a s e n c o d e d as a s i m p l e h e u r i s t i c that p i c k s the first u n l e a r n e d c o n c e p t f o r the n e x t p r e s e n t a t i o n . O u r d i s c r e t e a p p r o x i m a t i o n goes as f o l l o w s : e a c h c o n c e p t i is represented b y a s i n g l e b i n a r y v a r i a b l e c o r r e s p o n d i n g to w h e t h e r c o n c e p t is l e a r n e d (c, = 1) o r u n l e a r n e d (c; = 0 ) . I n o u r e x p e r i m e n t s , there are s i x c o n c e p t s . T h e state o f the s i m u l a t e d student at a n y t i m e is the j o i n t state c = <  ci, C 2 , . . .  ,CQ  > . W h e n the tutor takes an a c t i o n  126  aj to present c o n c e p t  j , the state o f e a c h i n d i v i d u a l c o n c e p t c, i n c is u p d a t e d n o n d e t e r m i n i s t i c a l l y a c c o r d i n g to the c o n c e p t e v o l u t i o n f u n c t i o n  tp  i = j & c,\ — 0 c o n c e p t l e a r n e d  1  i = j & a = 1 concept r e v i e w e d (5.6)  0  i  j h c,; = 0 stays u n k n o w n  L)  i j  j  &  Cj  —  1 forgotten  w h e r e -0 is the l e a r n i n g rate, o r p r o b a b i l i t y that a c o n c e p t w i l l b e l e a r n e d w h e n it is p r e s e n t e d a n d us is the f o r g e t t i n g rate, o r the p r o b a b i l i t y that a c o n c e p t w i l l b e f o r g o t t e n g i v e n that it w a s n o t presented. In o u r e x p e r i m e n t s the l e a r n i n g rate tp — 0.1 a n d the f o r g e t t i n g rate is u> =  P(;l;> rc c  0 . 0 1 . W e note that Pr(-icJ-1 c,-, a j ) = 1 - Pr(c-|c -, a , ) . T h e p r o b a b i l i t y o f the j o i n t s u c c e s s o r t  state is the p r o d u c t o f the i n d i v i d u a l c o n c e p t e v o l u t i o n s P r ( c ' | c , a,j) = Yli  j)-  a  As  s h o w n i n F i g u r e 5.9, the i n d i v i d u a l d y n a m i c s o f e a c h c o n c e p t r e s u l t i n a l a r g e n u m b e r o f p o s s i b l e s u c c e s s o r states. T h e r e c a n b e u p to 6 4 p o s s i b l e s u c c e s s o r states i f e v e r y c o n c e p t state c h a n g e s ( e v e r y t h i n g is forgotten) o r o n l y a s i n g l e s u c c e s s o r state i f n o c o n c e p t state c h a n g e s . T h e d i s c o u n t rate e m p l o y e d i n o u r e x p e r i m e n t s is 0.95 a n d the agent r e c e i v e s a r e w a r d o f 1.0 i f a n d o n l y i f e v e r y c o n c e p t is k n o w n (i.e., Vi, c,). T h e l e a r n e r ' s p r i o r s o v e r p o s s i b l e t r a n s i t i o n m o d e l s a s s u m e that a l l f e a s i b l e s u c c e s s o r states w e r e e q u a l l y l i k e l y w i t h l o w c o n f i d e n c e . W e define f e a s i b l e as those set o f s u c c e s s o r states g r a n t e d n o n - z e r o s u p p o r t f r o m the t r a n s i t i o n m o d e l d e f i n e d i n E q u a t i o n 5.6. T h e results presented f o r the t u t o r i n g d o m a i n appear i n F i g u r e 5.10. W e see that the i m i t a t o r s l e a r n q u i c k l y , w i t h B E B I a n d E G B I o u t p e r f o r m i n g the others. T h e B a y e s i a n p o o l i n g o f m e n t o r a n d o b s e r v e r o b s e r v a t i o n s c o n t i n u e s to offer better p e r f o r m a n c e t h a n the d i s c r e t e s e l e c t i o n m e c h a n i s m o f n o n - B a y e s i a n i m i t a t i o n ( E G N B I ) . O f p a r t i c u l a r interest is the p e r f o r m a n c e o f the B a y e s i a n e x p l o r e r ( B E ) . G i v e n the h i g h b r a n c h i n g f a c t o r o f the d o -  127  c  2  c  3  c  4 a  c  2  c  3  c  4  c  5  c  6  c  l  c  2  c  3  c  4  c  c  5  c  6  4  5  c  6  • ••  c  l  c  2  c  3  c  4  c  5  c  6  F i g u r e 5.9: S c h e m a t i c i l l u s t r a t i o n o f a t r a n s i t i o n i n the t u t o r i n g d o m a i n .  m a i n , the c o n s e r v a t i v e b i a s o f the B a y e s i a n e x p l o r e r causes it to s p e n d a l o t o f t i m e e x p l o r i n g m a n y b r a n c h e s . T h e e x t r a s a m p l e s f r o m m e n t o r e x a m p l e s a l l o w the B a y e s i a n i m i t a t o r w i t h B a y e s i a n e x p l o r a t i o n ( B E B I ) to q u i c k l y i m p r o v e its c o n f i d e n c e i n r e l e v a n t a c t i o n s a n d c o n s e q u e n t l y e n a b l e it to d i s c a r d b r a n c h e s that are not w o r t h e x p l o r i n g early i n the l e a r n i n g process.  Experiment 6: Misleading Priors T h e n e x t d o m a i n p r o v i d e s further i n s i g h t i n t o the c o m b i n a t i o n o f B a y e s i a n i m i t a t i o n a n d B a y e s i a n e x p l o r a t i o n . I n this g r i d w o r l d ( F i g u r e 5.11), agents c a n m o v e s o u t h o n l y i n the first c o l u m n . I n this d o m a i n , the B a y e s i a n e x p l o r e r ( B E ) c h o o s e s a p a t h b a s e d o n its p r i o r b e l i e f s that the space is c o m p l e t e l y c o n n e c t e d . T h e agent c a n e a s i l y b e g u i d e d d o w n o n e o f the l o n g " t u b e s " i n this s c e n a r i o , o n l y to h a v e to retrace it steps. T h e results o f this d o m a i n , s h o w n i n F i g u r e 5.12, c l e a r l y differentiate the early perf o r m a n c e o f the i m i t a t i o n agents f r o m the B a y e s i a n e x p l o r e r a n d o t h e r i n d e p e n d e n t learners. T h e i n i t i a l v a l u e f u n c t i o n c o n s t r u c t e d f r o m the l e a r n e r ' s p r i o r b e l i e f s a b o u t the c o n n e c t i v i t y i n the g r i d w o r l d leads it to o v e r - v a l u e m a n y o f the states that l e a d to a d e a d e n d . T h i s results i n a c o s t l y m i s d i r e c t i o n o f e x p l o r a t i o n a n d p o o r p e r f o r m a n c e . W e see that the B a y e s i a n i m i t a t o r ' s a b i l i t y to p o o l i n f o r m a t i o n sources a n d adapt its e x p l o r a t i o n rate to the l o c a l q u a l i t y  128  ••  0  500  Oracle(Fixed Policy Mentor)  +•  EGBS  (Eps G r e e d y , Bell. S w e e p  Control)  •x-  EGPS  (Eps Greedy, Prio. S w e e p  Control)  —  BE  -a-  EGNBI (Non-Bayesian  1000  (Bayes Exp, Bell. S w e e p  1500  Control)  Imitation)  2000  S t e p s of s i m u l a t i o n  F i g u r e 5.10: T u t o r i n g d o m a i n ( A v e r a g e o f 5 0 r u n s )  s  G F i g u r e 5.11: N o S o u t h s c e n a r i o  129  I 0  I  1000  I  2000  I  I  3000 4000 Steps of simulation  1  5000  I  6000  1—  7000  F i g u r e 5 . 1 2 : N o S o u t h results ( A v e r a g e d o v e r 5 0 r u n s )  o f i n f o r m a t i o n a l l o w s it to e x p l o i t the m e n t o r o b s e r v a t i o n s m o r e q u i c k l y than agents u s i n g g e n e r i c e x p l o r a t i o n strategies l i k e e g r e e d y . W e c a n see f r o m the s i g n i f i c a n t d e l a y b e f o r e the B a y e s i a n e x p l o r e r ( B E ) reaches the r e w a r d f o r the first t i m e that the agent is g e t t i n g s t u c k r e p e a t e d l y i n d e a d - e n d s — i t s p r i o r b e l i e f s w o r k s t r o n g l y against its s u c c e s s . H o w e v e r , it is a b l e to d o better than the g e n e r i c learners w i t h g l o b a l e x p l o r a t i o n strategies.  5.3  Summary  T h i s chapter i n t r o d u c e d a B a y e s i a n i m i t a t i o n m o d e l that is a b l e to s m o o t h l y integrate b o t h b o t h m e n t o r a n d o b s e r v e r o b s e r v a t i o n s . In e m p i r i c a l d e m o n s t r a t i o n s w e s a w that the c o m b i n a t i o n o f B a y e s i a n i m i t a t i o n w i t h B a y e s i a n e x p l o r a t i o n ( B E B I ) c l e a r l y d o m i n a t e d across a l l o f the p r o b l e m s w e s t u d i e d . W i t h o u t further e x p e r i m e n t s , h o w e v e r , w e c a n n o t c l e a r l y  130  attribute this a d v a n t a g e to either B a y e s i a n i m i t a t i o n i n i t s e l f o r to B a y e s i a n e x p l o r a t i o n i n itself. I n o u r e x p e r i m e n t s , B a y e s i a n e x p l o r a t i o n b y i t s e l f ( B E ) is t y p i c a l l y d o m i n a t e d b y i m i tators that d o n o t use B a y e s i a n e x p l o r a t i o n — s u c h as n o n - B a y e s i a n i m i t a t o r s ( E G N B I ) a n d egreedy B a y e s i a n i m i t a t o r s ( E G B I ) . W e s a w i n the t u t o r i n g d o m a i n a n d the n o - s o u t h d o m a i n that B a y e s i a n e x p l o r a t i o n tends to be c o n s e r v a t i v e g i v e n diffuse p r i o r s . O n the o t h e r h a n d , £ - g r e e d y B a y e s i a n i m i t a t o r s are s o m e t i m e s d o m i n a t e d b y n o n - B a y e s i a n i m i t a t o r s ( E G N B I ) . E a r l i e r i n this chapter, w e s p e c u l a t e d that their m a y be a n e g a t i v e i n t e r a c t i o n b e t w e e n m o d e l fluctuations generated b y s a m p l i n g i n B a y e s i a n i m i t a t i o n a n d the a c t i o n m a x i m i z a t i o n i n egreedy. W e suggest this as a p r o m i s i n g area f o r future e x p e r i m e n t s . F r o m a t h e o r e t i c a l pers p e c t i v e , w e c a n see that the c o m b i n a t i o n o f B a y e s i a n E x p l o r a t i o n w i t h B a y e s i a n i m i t a t i o n ( B E B I ) w i l l h a v e a s y n e r g i s t i c effect, as the e x t r a e v i d e n c e f r o m m e n t o r o b s e r v a t i o n s w i l l not o n l y i m p r o v e the a c c u r a c y o f the agent's Q - v a l u e s , b u t a l s o t h e i r c o n f i d e n c e as w e l l . S i n c e this c o n f i d e n c e d r i v e s l o c a l e x p l o r a t i o n w e w o u l d e x p e c t the o b s e r v e r to f o c u s q u i c k l y o n r e w a r d i n g b e h a v i o r s d e m o n s t r a t e d b y the mentor. F r o m a p r a c t i c a l p e r s p e c t i v e , B a y e s i a n e x p l o r e r s are m u c h easier to t r a i n as they d o not r e q u i r e repeated r u n s to o p t i m i z e l e a r n i n g rate parameters.  131  Chapter 6  Applicability and Performance of Imitation T h e a p p l i c a b i l i t y a n d p e r f o r m a n c e o f i m i t a t i o n c a n b e e v a l u a t e d f r o m a n u m b e r o f different v i e w p o i n t s : the a s s u m p t i o n s inherent i n o u r f r a m e w o r k , the p o t e n t i a l f o r a c c e l e r a t i n g l e a r n i n g as a f u n c t i o n o f p r o p e r t i e s o f state spaces, a n d the l i k e l i h o o d that m e n t o r i n p u t w i l l b i a s an o b s e r v e r t o w a r d s s u b o p t i m a l p e r f o r m a n c e . I n this chapter w e e x a m i n e w h a t e a c h o f these p e r s p e c t i v e s c a n t e l l us about the p e r f o r m a n c e o f i m i t a t i o n a n d then g o o n to p r o p o s e a f o r m a l m e t r i c f o r p r e d i c t i n g p e r f o r m a n c e as a f u n c t i o n o f d o m a i n p r o p e r t i e s . W e c o n c l u d e w i t h a d i s c u s s i o n o f the p o t e n t i a l f o r m e n t o r ' s to b i a s a g e n t ' s to s u b o p t i m a l p o l i c i e s .  6.1  Assumptions  W e b e g a n o u r d e v e l o p m e n t o f the i m p l i c i t i m i t a t i o n f r a m e w o r k b y d e v e l o p i n g a set o f ass u m p t i o n s c o n s i s t e n t w i t h i m i t a t i o n . T h e s e a s s u m p t i o n s n e c e s s a r i l y i n f l u e n c e the a p p l i c a bility o f imitation. These include: lack o f explicit c o m m u n i c a t i o n between mentors and observer; i n d e p e n d e n t o b j e c t i v e s f o r m e n t o r s a n d o b s e r v e r ; f u l l o b s e r v a b i l i t y o f m e n t o r s b y  132  o b s e r v e r ; u n o b s e r v a b i l i t y o f m e n t o r s ' a c t i o n s ; a n d ( b o u n d e d ) heterogeneity. A s s u m p t i o n s s u c h as f u l l o b s e r v a b i l i t y are necessary f o r o u r m o d e l — a s f o r m u l a t e d — t o w o r k ( t h o u g h w e w i l l d i s c u s s e x t e n s i o n s to o u r m o d e l f o r the p a r t i a l l y o b s e r v a b l e case i n C h a p t e r 7 ) . T h e t w o a s s u m p t i o n s that e x p l i c i t c o m m u n i c a t i o n is i n f e a s i b l e a n d that a c t i o n s are u n o b s e r v a b l e e x tend the a p p l i c a b i l i t y o f i m p l i c i t i m i t a t i o n b e y o n d other m o d e l s i n the literature; i f these these c o n d i t i o n s d o not h o l d , i m p l i c i t i m i t a t i o n c o u l d b e a u g m e n t e d to t a k e greater a d v a n tage o f the a d d i t i o n a l i n f o r m a t i o n . F i n a l l y , the a s s u m p t i o n s o f b o u n d e d h e t e r o g e n e i t y a n d i n d e p e n d e n t o b j e c t i v e s a l s o ensure that i m p l i c i t i m i t a t i o n c a n b e a p p l i e d w i d e l y . H o w e v e r , as w e h a v e seen i n e a r l i e r e x p e r i m e n t s , the degree to w h i c h r e w a r d s are the s a m e a n d a c t i o n s are h o m o g e n e o u s c a n h a v e an i m p a c t o n the u t i l i t y o f i m p l i c i t i m i t a t i o n (i.e., o n h o w m u c h it accelerates l e a r n i n g ) .  6.2  Predicting Performance  O n c e w e h a v e d e t e r m i n e d that the b a s i c a s s u m p t i o n s o f i m p l i c i t i m i t a t i o n h a v e b e e n satisfied, w e face the q u e s t i o n o f h o w l i k e l y i m p l i c i t i m i t a t i o n is to a c t u a l l y accelerate l e a r n i n g . T h e k e y to a n s w e r i n g this q u e s t i o n is to e x a m i n e b r i e f l y the reasons f o r p o t e n t i a l a c c e l e r a tion i n implicit imitation. In the i m p l i c i t i m i t a t i o n m o d e l , w e use o b s e r v a t i o n s o f other agents to i m p r o v e the o b s e r v e r ' s k n o w l e d g e about its e n v i r o n m e n t a n d then r e l y o n a s e n s i b l e e x p l o r a t i o n p o l i c y to e x p l o i t this a d d i t i o n a l k n o w l e d g e . A c l e a r u n d e r s t a n d i n g o f h o w k n o w l e d g e o f the e n v i r o n m e n t affects e x p l o r a t i o n is therefore c e n t r a l to u n d e r s t a n d i n g h o w i m p l i c i t i m i t a t i o n w i l l p e r f o r m i n a d o m a i n . W i t h i n the i m p l i c i t i m i t a t i o n f r a m e w o r k , agents k n o w t h e i r rew a r d f u n c t i o n s , so k n o w l e d g e o f the e n v i r o n m e n t c o n s i s t s s o l e l y o f k n o w l e d g e a b o u t the agent's a c t i o n m o d e l s . I n g e n e r a l , these m o d e l s c a n take a n y f o r m . F o r s i m p l i c i t y , w e h a v e restricted o u r s e l v e s to m o d e l s that c a n b e d e c o m p o s e d i n t o l o c a l m o d e l s f o r e a c h p o s s i b l e  133  c o m b i n a t i o n o f a s y s t e m state a n d agent a c t i o n . T h e l o c a l m o d e l s f o r state-action pairs a l l o w the p r e d i c t i o n o f a .7-step s u c c e s s o r state d i s t r i b u t i o n g i v e n any i n i t i a l state a n d sequence o f a c t i o n s o r a l o c a l p o l i c y . T h e q u a l i t y o f the j - s t e p state p r e d i c t i o n s w i l l be a f u n c t i o n o f every a c t i o n m o d e l e n c o u n t e r e d b e t w e e n the i n i t i a l state a n d the states at t i m e j — 1. U n f o r t u n a t e l y , the q u a l i t y o f the j - s t e p estimate c a n be d r a s t i c a l l y altered b y the q u a l i t y o f e v e n a s i n g l e i n t e r m e d i a t e state-action m o d e l . T h i s suggests that c o n n e c t e d r e g i o n s o f state s p a c e — i n w h i c h a l l states h a v e r e a s o n a b l y accurate m o d e l s — w i l l a l l o w r e a s o n a b l y accurate future state p r e d i c t i o n s . S i n c e the e s t i m a t e d v a l u e o f a state s is b a s e d o n b o t h the i m m e d i a t e r e w a r d a n d the r e w a r d e x p e c t e d to b e r e c e i v e d i n subsequent states, the q u a l i t y o f this v a l u e estimate w i l l dep e n d o n the c o n n e c t i v i t y o f the state space. S i n c e b o t h the g r e e d y a n d B a y e s i a n e x p l o r a t i o n m e t h o d s b i a s t h e i r e x p l o r a t i o n a c c o r d i n g to the e s t i m a t e d v a l u e o f a c t i o n s , the e x p l o r a t o r y c h o i c e s o f an agent at state s w i l l a l s o be d e p e n d e n t o n the c o n n e c t i v i t y o f the state space. W e h a v e therefore o r g a n i z e d o u r a n a l y s i s o f i m p l i c i t i m i t a t i o n p e r f o r m a n c e a r o u n d the i d e a o f state space c o n n e c t i v i t y a n d the r e g i o n s s u c h c o n n e c t i v i t y defines. S i n c e c o n n e c t e d reg i o n s p l a y an i m p o r t a n t r o l e i n i m p l i c i t i m i t a t i o n , w e i n t r o d u c e an i n f o r m a l c l a s s i f i c a t i o n o f different r e g i o n s w i t h i n the state space s h o w n g r a p h i c a l l y i n F i g u r e 6 . 1 . W e use these i n f o r m a l classes to m o t i v a t e a p r e c i s e d e f i n i t i o n f o r a p o s s i b l e m e t r i c p r e d i c t i n g i m i t a t i o n p e r f o r m a n c e . T h e r e g i o n s are d e s c r i b e d i n depth b e l o w . W e first o b s e r v e that m a n y tasks c a n be c a r r i e d out b y an agent i n a s m a l l subset o f states w i t h i n the state space defined f o r the p r o b l e m . M o r e p r e c i s e l y , i n m a n y M D P s , the o p t i m a l p o l i c y w i l l ensure that an agent r e m a i n s i n a s m a l l s u b s p a c e o f state space. T h i s leads us to the d e f i n i t i o n o f o u r first r e g i o n a l d i s t i n c t i o n : r e l e v a n t v e r s u s i r r e l e v a n t r e g i o n s . The  relevant r e g i o n  is the set o f states w i t h a n o n - z e r o p r o b a b i l i t y o f o c c u p a n c y u n d e r the  134  Observer Explored Region  Mentor Irrelevant  Augmented  Region  Region  Reward  F i g u r e 6 . 1 : A s y s t e m o f state space r e g i o n c l a s s i f i c a t i o n s f o r p r e d i c t i n g the efficacy o f i m i tation.  optimal policy.  1  A n e-relevant r e g i o n is a n a t u r a l g e n e r a l i z a t i o n i n w h i c h the o p t i m a l p o l i c y  keeps the s y s t e m w i t h i n the r e g i o n a f r a c t i o n 1 - e o f the t i m e . W i t h i n the r e l e v a n t r e g i o n , w e d i s t i n g u i s h three a d d i t i o n a l s u b r e g i o n s . T h e  explored  r e g i o n c o n t a i n s those states w h e r e the o b s e r v e r has f o r m u l a t e d r e l i a b l e a c t i o n m o d e l s o n the b a s i s o f its o w n e x p e r i e n c e . T h e  augmented r e g i o n  c o n t a i n s those states w h e r e the o b s e r v e r  l a c k s r e l i a b l e a c t i o n m o d e l s but has i m p r o v e d v a l u e estimates d u e to m e n t o r o b s e r v a t i o n s . F i n a l l y , the  blind r e g i o n designates  those states w h e r e the o b s e r v e r has n e i t h e r ( s i g n i f i c a n t )  p e r s o n a l e x p e r i e n c e n o r the benefit o f m e n t o r o b s e r v a t i o n s . N o t e that b o t h the e x p l o r e d a n d a u g m e n t e d r e g i o n s are created as the r e s u l t o f o b s e r v a t i o n s m a d e b y the o b s e r v e r (either o f its o w n t r a n s i t i o n s o r those o f a m e n t o r ) . T h e s e r e g i o n s w i l l therefore h a v e s i g n i f i c a n t " c o n n e c t e d c o m p o n e n t s , " that i s , c o n t i g u o u s r e g i o n s o f state space w h e r e r e l i a b l e a c t i o n o r m e n t o r m o d e l s are a v a i l a b l e . A n y i n f o r m a t i o n about states w i t h i n the b l i n d r e g i o n w i l l c o m e ' O n e often assumes that the system starts i n one o f a small set o f states. If the M a r k o v chain induced by the optimal policy is ergodic, then the irrelevant region w i l l be empty.  135  ( l a r g e l y ) f r o m the agent's p r i o r b e l i e f s . T h e p o t e n t i a l g a i n f o r an i m i t a t o r i n a d o m a i n c a n be u n d e r s t o o d i n part as a f u n c t i o n o f the i n t e r a c t i o n o f the r e g i o n s d e s c r i b e d a b o v e . C o n s i d e r the d i s t i n c t i o n b e t w e e n r e l e v a n t a n d i r r e l e v a n t states. N o t a l l m o d e l i n f o r m a t i o n is e q u a l l y h e l p f u l : the i m i t a t o r n e e d s o n l y e n o u g h i n f o r m a t i o n about the i r r e l e v a n t r e g i o n to be a b l e to a v o i d it. S i n c e a c t i o n c h o i c e s are i n f l u e n c e d b y the r e l a t i v e v a l u e o f a c t i o n s , the i r r e l e v a n t r e g i o n w i l l b e a v o i d e d w h e n i t l o o k s w o r s e than the r e l e v a n t r e g i o n . G i v e n diffuse p r i o r s o n a c t i o n m o d e l s , n o n e o f the act i o n s o p e n to an agent w i l l i n i t i a l l y appear p a r t i c u l a r l y attractive o r u n a t t r a c t i v e . H o w e v e r , a m e n t o r that p r o v i d e s o b s e r v a t i o n s w i t h i n the r e l e v a n t r e g i o n c a n q u i c k l y m a k e the r e l e v a n t r e g i o n l o o k m u c h m o r e p r o m i s i n g as a m e t h o d o f a c h i e v i n g h i g h e r returns a n d therefore c o n strain e x p l o r a t i o n s i g n i f i c a n t l y . T h e r e f o r e , c o n s i d e r i n g p r o b l e m s j u s t f r o m the p o i n t o f v i e w o f r e l e v a n c e , a p r o b l e m w i t h a s m a l l r e l e v a n t r e g i o n r e l a t i v e to the entire space c o m b i n e d w i t h a m e n t o r that operates w i t h i n the r e l e v a n t r e g i o n w i l l result i n m a x i m u m a d v a n t a g e f o r an i m i t a t i o n agent o v e r a n o n - i m i t a t i n g agent. In the e x p l o r e d r e g i o n , the o b s e r v e r has s u f f i c i e n t l y accurate m o d e l s to c o m p u t e a p o l i c y that is o p t i m a l w i t h respect to the r e w a r d i n g states w i t h i n the r e g i o n (the p o l i c y m a y b e s u b o p t i m a l w i t h respect to r e w a r d s o u t s i d e o f the e x p l o r e d r e g i o n ) . A d d i t i o n a l o b s e r v a t i o n s o n the states w i t h i n the e x p l o r e d r e g i o n p r o v i d e d b y the m e n t o r c a n s t i l l i m p r o v e perf o r m a n c e s o m e w h a t i f s i g n i f i c a n t e v i d e n c e is r e q u i r e d to a c c u r a t e l y d i s c r i m i n a t e b e t w e e n the e x p e c t e d v a l u e o f t w o a c t i o n s . H e n c e , m e n t o r o b s e r v a t i o n s i n the e x p l o r e d r e g i o n c a n h e l p , b u t w i l l n o t result i n d r a m a t i c speedups i n c o n v e r g e n c e . In the a u g m e n t e d r e g i o n , the o b s e r v e r ' s Q - v a l u e s are l i k e l y to b e m o r e accurate than the i n i t i a l Q - v a l u e s as they h a v e b e e n c a l c u l a t e d f r o m m o d e l s that h a v e b e e n a u g m e n t e d w i t h o b s e r v a t i o n s o f a mentor. I n e x p e r i m e n t s i n p r e v i o u s chapters, w e h a v e seen that a n o b s e r v e r e n t e r i n g an a u g m e n t e d r e g i o n c a n e x p e r i e n c e s i g n i f i c a n t s p e e d u p s i n c o n v e r g e n c e . C h a r a c -  136  teristics o f the a u g m e n t e d z o n e , h o w e v e r , c a n affect the degree to w h i c h a u g m e n t a t i o n i m p r o v e s c o n v e r g e n c e s p e e d — s i n c e the o b s e r v e r r e c e i v e s o b s e r v a t i o n s o f o n l y the m e n t o r ' s state a n d not its a c t i o n s , the o b s e r v e r has i m p r o v e d v a l u e estimates f o r states i n the a u g m e n t e d r e g i o n , b u t it does not h a v e a p o l i c y . T h e o b s e r v e r m u s t therefore i n f e r w h i c h a c t i o n s s h o u l d be t a k e n to d u p l i c a t e the m e n t o r ' s b e h a v i o r . W h e r e the o b s e r v e r has p r i o r b e l i e f s about the effects o f its a c t i o n s , it m a y be p o s s i b l e to p e r f o r m i m m e d i a t e i n f e r e n c e about the m e n t o r ' s a c t u a l c h o i c e o f a c t i o n (perhaps u s i n g the K L - d i s t a n c e o r m a x i m u m l i k e l i h o o d ) . W h e r e the o b s e r v e r ' s p r i o r m o d e l is u n i n f o r m a t i v e , the o b s e r v e r w i l l h a v e to e x p l o r e the l o c a l a c t i o n space. In e x p l o r i n g a l o c a l a c t i o n space, h o w e v e r , the agent m u s t take an a c t i o n a n d this a c t i o n w i l l h a v e an effect. S i n c e there is n o guarantee that the agent t o o k the a c t i o n that d u p l i c a t e s the m e n t o r ' s a c t i o n , i t m a y e n d up s o m e w h e r e different t h a n the mentor. I f the a c t i o n causes the o b s e r v e r to f a l l o u t s i d e o f the a u g m e n t e d r e g i o n , the o b s e r v e r w i l l l o s e the g u i d a n c e that the a u g m e n t e d v a l u e f u n c t i o n p r o v i d e s a n d f a l l b a c k to the p e r f o r m a n c e l e v e l o f a n o n - i m i t a t i n g agent. A n i m p o r t a n t c o n s i d e r a t i o n , then, is the p r o b a b i l i t y that the o b s e r v e r w i l l r e m a i n i n a u g m e n t e d r e g i o n s a n d c o n t i n u e to r e c e i v e g u i d a n c e . O n e q u a l i t y o f the a u g m e n t e d r e g i o n that affects the o b s e r v e r ' s p r o b a b i l i t y o f s t a y i n g w i t h i n its b o u n d a r i e s is its r e l a t i v e c o v e r a g e o f the state space. T h e p o l i c y o f the m e n t o r m a y b e sparse o r c o m p l e t e . I n a r e l a t i v e l y determ i n i s t i c d o m a i n w i t h d e f i n e d b e g i n a n d e n d states, a sparse p o l i c y c o v e r i n g f e w states m a y define an adequate p o l i c y f o r the m e n t o r o r observer. I n a h i g h l y s t o c h a s t i c d o m a i n w i t h o u t defined g o a l states, an agent m a y n e e d a c o m p l e t e p o l i c y (i.e., c o v e r i n g e v e r y state). I m p l i c i t i m i t a t i o n p r o v i d e s m o r e g u i d a n c e to the agent i n d o m a i n s that are m o r e s t o c h a s t i c a n d r e q u i r e m o r e c o m p l e t e p o l i c i e s , s i n c e the p o l i c y w i l l c o v e r a l a r g e r part o f the state space. I n a d d i t i o n to the c o m p l e t e n e s s o f a p o l i c y , w e m u s t a l s o c o n s i d e r the p r o b a b i l i t y that the o b s e r v e r w i l l m a k e t r a n s i t i o n s i n t o a n d out o f the a u g m e n t e d r e g i o n . W h e r e the ac-  137  t i o n s i n a d o m a i n are l a r g e l y i n v e r t i b l e ( d i r e c t l y , o r e f f e c t i v e l y so), the agent has a c h a n c e o f re-entering the a u g m e n t e d r e g i o n . W h e r e this p r o p e r t y is l a c k i n g , h o w e v e r , the agent m a y h a v e to w a i t u n t i l the p r o c e s s u n d e r g o e s s o m e f o r m o f "reset" before it has the o p p o r t u n i t y to gather a d d i t i o n a l e v i d e n c e r e g a r d i n g the i d e n t i t y o f the m e n t o r ' s a c t i o n s i n the a u g m e n t e d r e g i o n . T h e "reset" p l a c e s the agent b a c k i n t o the e x p l o r e d r e g i o n , f r o m w h i c h it c a n m a k e its w a y to the f r o n t i e r w h e r e it last e x p l o r e d . T h e l a c k o f i n v e r t i b l e a c t i o n s w o u l d r e d u c e the a g e n t ' s a b i l i t y to m a k e p r o g r e s s t o w a r d s h i g h - v a l u e r e g i o n s before resets, b u t the agent is s t i l l g u i d e d b y the a u g m e n t e d r e g i o n o n e a c h attempt. E f f e c t i v e l y , the agent w i l l c o n c e n trate its e x p l o r a t i o n o n the b o u n d a r y b e t w e e n the e x p l o r e d r e g i o n a n d the m e n t o r a u g m e n t e d region. T h e u t i l i t y o f m e n t o r o b s e r v a t i o n s w i l l d e p e n d o n the p r o b a b i l i t y o f the a u g m e n t e d a n d e x p l o r e d r e g i o n s o v e r l a p p i n g i n the c o u r s e o f the agent's e x p l o r a t i o n . I n the e x p l o r e d r e g i o n s , accurate a c t i o n m o d e l s a l l o w the agent to m o v e as q u i c k l y as p o s s i b l e to h i g h v a l u e r e g i o n s . I n a u g m e n t e d r e g i o n s , a u g m e n t e d Q - v a l u e s i n f o r m agents a b o u t w h i c h states l e a d to h i g h l y - v a l u e d o u t c o m e s . W h e n an a u g m e n t e d r e g i o n abuts an e x p l o r e d r e g i o n , the i m p r o v e d v a l u e estimates f r o m the a u g m e n t e d r e g i o n are r a p i d l y c o m m u n i c a t e d across the e x p l o r e d r e g i o n b y accurate a c t i o n m o d e l s . T h e o b s e r v e r c a n use the r e s u l t a n t i m p r o v e d v a l u e estimates i n the e x p l o r e d r e g i o n together w i t h the accurate a c t i o n m o d e l s i n the e x p l o r e d r e g i o n to r a p i d l y m o v e t o w a r d s the m o s t p r o m i s i n g states o n the f r o n t i e r o f the e x p l o r e d r e g i o n . F r o m these states, the o b s e r v e r c a n e x p l o r e o u t w a r d s a n d e v e n t u a l l y e x p a n d the e x p l o r e d r e g i o n to e n c o m p a s s the a u g m e n t e d r e g i o n . In the case w h e r e the e x p l o r e d r e g i o n a n d a u g m e n t e d r e g i o n d o n o t o v e r l a p , there is a b l i n d r e g i o n . S i n c e the o b s e r v e r has n o i n f o r m a t i o n b e y o n d its p r i o r s f o r the b l i n d reg i o n , the o b s e r v e r is r e d u c e d to r a n d o m e x p l o r a t i o n . In a n o n - i m i t a t i o n c o n t e x t , a n y states that are not e x p l o r e d are b l i n d . H o w e v e r , i n an i m i t a t i o n c o n t e x t , the b l i n d area c a n b e re-  138  d u c e d c o n s i d e r a b l y b y the a u g m e n t e d area. H e n c e , i m p l i c i t i m i t a t i o n e f f e c t i v e l y s h r i n k s the s i z e o f the s e a r c h space o f the p r o b l e m e v e n w h e n there is n o o v e r l a p b e t w e e n e x p l o r e d a n d a u g m e n t e d spaces. T h e m o s t c h a l l e n g i n g case f o r i m p l i c i t i m i t a t i o n transfer o c c u r s w h e n the r e g i o n a u g m e n t e d b y m e n t o r o b s e r v a t i o n s fails to c o n n e c t to either the o b s e r v e r e x p l o r e d r e g i o n o r reg i o n s w i t h s i g n i f i c a n t r e w a r d v a l u e s . I n t h i s case, the a u g m e n t e d r e g i o n w i l l i n i t i a l l y p r o v i d e n o g u i d a n c e . O n c e the o b s e r v e r has i n d e p e n d e n t l y l o c a t e d r e w a r d i n g states, the a u g m e n t e d r e g i o n s c a n be u s e d to h i g h l i g h t " s h o r t c u t s " . T h e s e shortcuts represent i m p r o v e m e n t s o n the agent's p o l i c y . I n d o m a i n s w h e r e a f e a s i b l e s o l u t i o n is easy to f i n d , b u t o p t i m a l s o l u t i o n s are d i f f i c u l t , i m p l i c i t i m i t a t i o n c a n be u s e d to c o n v e r t a f e a s i b l e s o l u t i o n to an i n c r e a s i n g l y o p timal solution.  6.3  Regional Texture  W e h a v e seen h o w d i s t i n c t i v e r e g i o n s c a n be u s e d to u n d e r s t a n d h o w i m i t a t i o n w i l l perf o r m i n v a r i o u s d o m a i n s . W e c a n a l s o a n a l y z e i m i t a t i o n p e r f o r m a n c e i n terms o f p r o p e r ties that cut across the state space. In o u r a n a l y s i s o f h o w m o d e l i n f o r m a t i o n i m p a c t e d i m i t a t i o n p e r f o r m a n c e w e s a w that r e g i o n s o f states c o n n e c t e d b y accurate a c t i o n m o d e l s a l l o w e d an o b s e r v e r to use m e n t o r o b s e r v a t i o n s to l e a r n a b o u t the m o s t p r o m i s i n g d i r e c t i o n for e x p l o r a t i o n . W e see, then, that any set o f m e n t o r o b s e r v a t i o n s w i l l b e m o r e u s e f u l i f it is c o n c e n t r a t e d o n a c o n n e c t e d r e g i o n a n d less u s e f u l i f d i s p e r s e d a b o u t the state space i n u n c o n n e c t e d c o m p o n e n t s . W e are fortunate i n c o m p l e t e l y o b s e r v a b l e e n v i r o n m e n t s that o b s e r v a t i o n s o f m e n t o r s t e n d to capture c o n t i n u o u s trajectories, thereby p r o v i d i n g c o n t i n u ous r e g i o n s o f a u g m e n t e d states. I n p a r t i a l l y o b s e r v a b l e e n v i r o n m e n t s , o c c l u s i o n a n d n o i s e c o u l d l e s s e n the v a l u e o f m e n t o r o b s e r v a t i o n s i n the absence o f a m o d e l to p r e d i c t the m e n t o r ' s state.  139  T h e effects o f heterogeneity, w h e t h e r due to differences i n a c t i o n c a p a b i l i t i e s i n the m e n t o r a n d o b s e r v e r o r d u e to differences i n the e n v i r o n m e n t o f the t w o agents, c a n a l s o b e u n d e r s t o o d i n terms o f the c o n n e c t i v i t y o f a c t i o n m o d e l s . V a l u e c a n p r o p a g a t e a l o n g c h a i n s o f a c t i o n m o d e l s u n t i l w e hit a state i n w h i c h the m e n t o r a n d o b s e r v e r h a v e different a c t i o n c a p a b i l i t i e s . A t this state, it m a y not be p o s s i b l e to a c h i e v e the m e n t o r ' s v a l u e a n d therefore, v a l u e p r o p a g a t i o n is b l o c k e d . A g a i n , w e see that the s e q u e n t i a l d e c i s i o n m a k i n g aspect o f rei n f o r c e m e n t l e a r n i n g leads to the c o n c l u s i o n that m a n y scattered differences b e t w e e n m e n t o r a n d o b s e r v e r w i l l create d i s c o n t i n u i t y t h r o u g h o u t the p r o b l e m space w h e r e a s a c o n t i g u o u s r e g i o n o f differences b e t w e e n m e n t o r a n d o b s e r v e r w i l l cause d i s c o n t i n u i t y i n a r e g i o n b u t l e a v e o t h e r large r e g i o n s f u l l y c o n n e c t e d . H e n c e , the d i s t r i b u t i o n pattern o f d i f f e r e n c e s bet w e e n m e n t o r a n d o b s e r v e r c a p a b i l i t i e s is as i m p o r t a n t as the p r e v a l e n c e o f d i f f e r e n c e . W e w i l l e x p l o r e this pattern i n the n e x t s e c t i o n .  6.4  The Fracture Metric  W e n o w try to c h a r a c t e r i z e c o n n e c t i v i t y i n the f o r m o f a m e t r i c . S i n c e differences i n r e w a r d structure, e n v i r o n m e n t d y n a m i c s a n d a c t i o n m o d e l s that affect c o n n e c t i v i t y a l l w o u l d m a n i fest t h e m s e l v e s as differences i n p o l i c i e s b e t w e e n m e n t o r a n d observer, w e p r o p o s e a m e t r i c a r o u n d o p t i m a l p o l i c y difference. W e c a l l this m e t r i c fracture.  E s s e n t i a l l y , it c o m p u t e s the  average v a l u e o f the m i n i m u m d i s t a n c e f r o m a state i n w h i c h a m e n t o r a n d o b s e r v e r d i s agree o n a p o l i c y to a state i n w h i c h m e n t o r a n d o b s e r v e r agree o n the p o l i c y . T h i s m e a s u r e r o u g h l y captures the d i f f i c u l t y the o b s e r v e r faces i n p r o f i t a b l y e x p l o i t i n g m e n t o r o b s e r v a t i o n s to r e d u c e its e x p l o r a t i o n d e m a n d s . M o r e f o r m a l l y , let w  m  <S b e the state space a n d S ^ 7rm  Vo  b e the m e n t o r ' s o p t i m a l p o l i c y a n d TT b e the o b s e r v e r ' s . L e t 0  be the set o f disputed  states w h e r e the m e n t o r a n d o b s e r v e r  h a v e different o p t i m a l a c t i o n s . A set o f n e i g h b o r i n g d i s p u t e d states c o n s t i t u t e s a d i s p u t e d  140  (a) $ = 0.5  (b)$=1.7  ( c ) $ = 3.5  (d)$  = 6.0  F i g u r e 6.2: F o u r s c e n a r i o s w i t h e q u a l s o l u t i o n l e n g t h s but different fracture c o e f f i c i e n t s .  r e g i o n . T h e set  S — S ^Vm  Ko  w i l l be c a l l e d the  undisputed states.  Let  M  b e a d i s t a n c e met-  r i c o n the space S. T h i s m e t r i c c o r r e s p o n d s to the n u m b e r o f t r a n s i t i o n s a l o n g the " m i n i m a l l e n g t h " p a t h b e t w e e n states (i.e., the shortest path u s i n g n o n z e r o p r o b a b i l i t y o b s e r v e r transitions).  2  I n a standard g r i d w o r l d , it w i l l c o r r e s p o n d to the m i n i m a l M a n h a t t a n d i s t a n c e .  W e define the fracture $(<S) o f state space  S  to b e the average m i n i m a l d i s t a n c e b e t w e e n a  d i s p u t e d state a n d the c l o s e s t u n d i s p u t e d state:  $(£) = —^—-  Y  min  M(s,t)  (6.1)  O t h e r t h i n g s b e i n g e q u a l , a l o w e r fracture v a l u e w i l l t e n d to i n c r e a s e the p r o p a g a t i o n o f v a l u e i n f o r m a t i o n across the state space, p o t e n t i a l l y r e s u l t i n g i n less e x p l o r a t i o n b e i n g r e q u i r e d . T o test o u r m e t r i c , w e a p p l i e d it to a n u m b e r o f s c e n a r i o s w i t h v a r y i n g fracture coefficients. It is d i f f i c u l t to c o n s t r u c t s c e n a r i o s w h i c h v a r y i n t h e i r fracture c o e f f i c i e n t yet h a v e the same e x p e c t e d v a l u e . T h e s c e n a r i o s i n F i g u r e 6.2 h a v e b e e n c o n s t r u c t e d so that the l e n g t h o f a l l p o s s i b l e paths f r o m the start state s to the g o a l state x are the s a m e i n e a c h 2  T h e expected distance w o u l d give a more accurate estimate o f fracture, but is more difficult to  calculate.  141  Observer Exploration Rate Sj 5 x lO 0.5  60%  -  1 x lO"  2  70%  2  5xl0-  3  l x l O -  3  $  5xl0-  4  1 x 10~  1 x 10~  5  5 x 1(T  90% 90%  90%  3.5  30%  100%  6.0  30%  70 %  0%  1.7  4  80%  100 %  100%  F i g u r e 6.3: P e r c e n t a g e o f runs ( o f ten) c o n v e r g i n g to o p t i m a l p o l i c y g i v e n fracture $ a n d e x p l o r a t i o n rate Si  s c e n a r i o . I n e a c h s c e n a r i o , h o w e v e r , there is an u p p e r p a t h a n d a l o w e r p a t h . T h e m e n t o r is t r a i n e d i n a s c e n a r i o that p e n a l i z e s the l o w e r p a t h a n d so the m e n t o r learns to t a k e the u p p e r p a t h . T h e i m i t a t o r is t r a i n e d i n a s c e n a r i o i n w h i c h the u p p e r p a t h is p e n a l i z e d a n d s h o u l d therefore take the l o w e r p a t h . W e e q u a l i z e d the d i f f i c u l t y o f these p r o b l e m s as f o l l o w s : u s i n g a g e n e r i c e-greedy l e a r n i n g agent w i t h a f i x e d e x p l o r a t i o n s c h e d u l e (i.e., i n i t i a l rate a n d d e c a y ) i n o n e s c e n a r i o , w e t u n e d the m a g n i t u d e o f p e n a l t i e s a n d t h e i r e x a c t p l a c e m e n t a l o n g l o o p s i n t h e i r other s c e n a r i o s so that a learner u s i n g the s a m e e x p l o r a t i o n p o l i c y w o u l d c o n v e r g e to the o p t i m a l p o l i c y i n r o u g h l y the same n u m b e r o f steps i n e a c h . I n F i g u r e 6.2(a), the m e n t o r takes the t o p o f e a c h l o o p a n d i n a n o p t i m a l r u n , the i m i t a t o r w o u l d take the b o t t o m o f e a c h l o o p . S i n c e the l o o p s are short a n d the l e n g t h o f the c o m m o n p a t h is l o n g , the average fracture is l o w . W h e n w e c o m p a r e t h i s to F i g u r e 6.2(d), w e see that the l o o p s are v e r y l o n g — t h e m a j o r i t y o f states i n the s c e n a r i o are o n l o o p s . E a c h o f these states o n a l o o p has a d i s t a n c e to the nearest state w h e r e the o b s e r v e r a n d m e n t o r p o l i c i e s agree, n a m e l y , a state n o t o n the l o o p . T h i s s c e n a r i o therefore has a h i g h average fracture coefficient. S i n c e the l o o p s i n the v a r i o u s s c e n a r i o s differ i n l e n g t h , p e n a l t i e s i n s e r t e d i n the l o o p s v a r y w i t h respect to t h e i r d i s t a n c e f r o m the g o a l state a n d therefore affect the t o t a l  142  5  d i s c o u n t e d e x p e c t e d r e w a r d i n different w a y s . T h e p e n a l t i e s m a y a l s o c a u s e the agent to b e c o m e s t u c k i n a l o c a l m i n i m u m i n order to a v o i d the p e n a l t i e s i f the e x p l o r a t i o n rate is t o o l o w . In this set o f e x p e r i m e n t s , w e therefore c o m p a r e o b s e r v e r agents o n the basis o f h o w l i k e l y they are to c o n v e r g e to the o p t i m a l s o l u t i o n g i v e n the m e n t o r e x a m p l e . F i g u r e 6.3 presents the percentage o f runs (out o f ten) i n w h i c h the i m i t a t o r c o n v e r g e d to the o p t i m a l s o l u t i o n (i.e., t a k i n g o n l y the l o w e r l o o p s ) as a f u n c t i o n o f e x p l o r a t i o n rate a n d s c e n a r i o fracture. W e c a n see a d i s t i n c t d i a g o n a l trend i n the table, i l l u s t r a t i n g that i n c r e a s i n g fracture requires the i m i t a t o r to increase its l e v e l s o f e x p l o r a t i o n i n o r d e r to find the o p t i m a l p o l i c y . W e c o n c l u d e thatfracture  captures a p r o p e r t y o f d o m a i n s that i s i m p o r -  tant i n p r e d i c t i n g the efficacy o f i m p l i c i t i m i t a t i o n .  6.5  Suboptimality and Bias  I m p l i c i t i m i t a t i o n is f u n d a m e n t a l l y about b i a s i n g the e x p l o r a t i o n o f the observer. A s s u c h , it is w o r t h w h i l e to ask w h e n this has a p o s i t i v e effect o n o b s e r v e r p e r f o r m a n c e . T h e short a n s w e r is that a m e n t o r f o l l o w i n g an o p t i m a l p o l i c y f o r a n o b s e r v e r w i l l c a u s e a n o b s e r v e r to e x p l o r e i n the n e i g h b o r h o o d o f the o p t i m a l p o l i c y a n d this w i l l g e n e r a l l y b i a s the o b s e r v e r t o w a r d s finding the o p t i m a l p o l i c y . T h a t is to say, a g o o d m e n t o r teaches m o r e than a b a d mentor. A m o r e d e t a i l e d a n s w e r requires l o o k i n g e x p l i c i t l y at e x p l o r a t i o n i n r e i n f o r c e m e n t l e a r n i n g . I n theory, an e-greedy e x p l o r a t i o n p o l i c y w i t h a s u i t a b l e rate o f d e c a y w i l l c a u s e i m p l i c i t i m i t a t o r s to e v e n t u a l l y c o n v e r g e to the same o p t i m a l s o l u t i o n as t h e i r u n a s s i s t e d counterparts.  H o w e v e r , i n p r a c t i c e , the e x p l o r a t i o n rate is t y p i c a l l y d e c a y e d m o r e q u i c k l y  i n order to i m p r o v e early e x p l o i t a t i o n o f m e n t o r i n p u t . G i v e n p r a c t i c a l , b u t t h e o r e t i c a l l y u n s o u n d e x p l o r a t i o n rates, an o b s e r v e r m a y settle f o r a m e n t o r strategy that i s f e a s i b l e , b u t non-optimal. W e can easily imagine examples: C o n s i d e r a situation i n w h i c h an observer  143  is o b s e r v i n g a m e n t o r f o l l o w i n g s o m e p o l i c y . E a r l y i n the l e a r n i n g p r o c e s s , the v a l u e o f the p o l i c y f o l l o w e d b y the m e n t o r m a y l o o k better than the e s t i m a t e d v a l u e o f the a l t e r n a t i v e p o l i c i e s a v a i l a b l e to the observer. It c o u l d b e the case that the m e n t o r ' s p o l i c y a c t u a l l y is the o p t i m a l p o l i c y . O n the other h a n d , it m a y b e the case that s o m e a l t e r n a t i v e p o l i c y is a c t u a l l y superior. G i v e n a l a c k o f i n f o r m a t i o n , a l o w - e x p l o r a t i o n e x p l o i t a t i o n p o l i c y m i g h t l e a d the o b s e r v e r to f a l s e l y c o n c l u d e that the m e n t o r ' s p o l i c y is o p t i m a l . W h i l e i m p l i c i t i m i t a t i o n c a n b i a s the agent to a s u b o p t i m a l p o l i c y , w e h a v e n o r e a s o n to e x p e c t that an agent l e a r n i n g i n a d o m a i n s u f f i c i e n t l y c h a l l e n g i n g to w a r r a n t the use o f i m i t a t i o n w o u l d h a v e d i s c o v e r e d a better alternative. W e e m p h a s i z e that e v e n i f the m e n t o r ' s p o l i c y is s u b o p t i m a l , it s t i l l p r o v i d e s a f e a s i b l e s o l u t i o n w h i c h w i l l b e p r e f e r a b l e to n o s o l u t i o n f o r m a n y p r a c t i c a l p r o b l e m s . I n this regard, w e see that the c l a s s i c e x p l o r a t i o n / e x p l o i t a t i o n tradeoff has an a d d i t i o n a l i n t e r p r e t a t i o n i n the i m p l i c i t i m i t a t i o n setting. A c o m p o n e n t o f the e x p l o r a t i o n rate w i l l c o r r e s p o n d to the o b s e r v e r ' s b e l i e f about the s u f f i c i e n c y o f the m e n t o r ' s p o l i c y . In t h i s p a r a d i g m , then, it seems s o m e w h a t m i s l e a d i n g to t h i n k i n terms o f a d e c i s i o n about w h e t h e r to " f o l l o w " a specific m e n t o r o r not. It is m o r e a q u e s t i o n o f h o w m u c h e x p l o r a t i o n to p e r f o r m i n a d d i t i o n to that r e q u i r e d to r e c o n s t r u c t the m e n t o r ' s p o l i c y . It is u n c l e a r that i n d e p e n d e n t learners c o u l d d o better than i m i t a t o r s i n the  general  case w i t h o u t i n c r e a s i n g the l e v e l o f e x p l o r a t i o n they e m p l o y . A m o r e d e f i n i t i v e statement m u s t a w a i t further a n a l y t i c a l results. F o r specific s u b c l a s s e s , w e c a n offer m o r e i n s i g h t . In c o m p e t i t i v e m u l t i a g e n t p r o b l e m s w i t h i n t e r a c t i o n s b e t w e e n agents, a m a l i c i o u s mentor, or set o f m a l i c i o u s m e n t o r s w i t h k n o w l e d g e o f the o b s e r v e r ' s e x p l o r a t i o n - p o l i c y m i g h t c o o r d i n a t e to b i a s the o b s e r v e r a w a y f r o m states a n d a c t i o n s w i t h strategic v a l u e d u r i n g the o b s e r v e r ' s e x p l o r a t i o n phase. O n c e the o b s e r v e r has l a r g e l y settled i n t o a p o l i c y , the m e n tors c o u l d e x p l o i t the o b s e r v e r ' s s u b o p t i m a l p o l i c y . T h i s k i n d o f strategic r e a s o n i n g about the l e a r n i n g p r o c e s s has n o t b e e n f u l l y a d d r e s s e d i n m u l t i - a g e n t s y s t e m s as o f yet. W e t a l k  144  b r i e f l y about m u l t i - a g e n t i m i t a t i o n w i t h i n t e r a c t i o n s i n the n e x t chapti  145  Chapter 7  Extensions to New Problem Classes T h e m o d e l o f i m p l i c i t i m i t a t i o n presented i n the p r e c e d i n g chapters m a k e s c e r t a i n r e s t r i c t i v e a s s u m p t i o n s r e g a r d i n g the structure o f the d e c i s i o n p r o b l e m b e i n g s o l v e d (e.g., f u l l o b s e r v a b i l i t y , k n o w l e d g e o f r e w a r d f u n c t i o n , a n d d i s c r e t e state a n d a c t i o n spaces). W h i l e these s i m p l i f y i n g a s s u m p t i o n s a i d e d the d e t a i l e d d e v e l o p m e n t o f the m o d e l , w e b e l i e v e the b a s i c i n t u i t i o n s a n d m u c h o f the t e c h n i c a l d e v e l o p m e n t a b o v e c a n b e e x t e n d e d to r i c h e r p r o b l e m classes. W e suggest s e v e r a l p o s s i b l e e x t e n s i o n s i n this chapter, e a c h o f w h i c h c o u l d f o r m the b a s i s o f a s i g n i f i c a n t future project.  7.1 Unknown Reward Functions O u r current p a r a d i g m a s s u m e s that the o b s e r v e r k n o w s its o w n r e w a r d f u n c t i o n . T h i s ass u m p t i o n is c o n s i s t e n t w i t h the v i e w o f r e i n f o r c e m e n t l e a r n i n g as a f o r m o f a u t o m a t i c p r o g r a m m i n g . H o w e v e r , w e c a n e n v i s i o n at least t w o w a y s o f r e l a x i n g this a s s u m p t i o n . T h e first m e t h o d w o u l d r e q u i r e the a s s u m p t i o n that the agent is a b l e to g e n e r a l i z e o b s e r v e d rew a r d s . S u p p o s e that the e x p e c t e d r e w a r d c a n b e e x p r e s s e d i n terms o f a p r o b a b i l i t y d i s t r i b u t i o n o v e r features o f the o b s e r v e r ' s state, P r ( r | / ( s ) ) . I n m o d e l - b a s e d r e i n f o r c e m e n t 0  146  l e a r n i n g , this d i s t r i b u t i o n c a n be l e a r n e d b y the agent t h r o u g h its o w n e x p e r i e n c e . I f the s a m e features c a n b e a p p l i e d to the m e n t o r ' s state s , m  then the o b s e r v e r c a n use w h a t it has l e a r n e d  about the r e w a r d d i s t r i b u t i o n to estimate e x p e c t e d r e w a r d f o r m e n t o r states as w e l l . T h i s e x tends the p a r a d i g m to d o m a i n s i n w h i c h r e w a r d s are u n k n o w n , b u t preserves the a b i l i t y o f the o b s e r v e r to e v a l u a t e m e n t o r e x p e r i e n c e s o n its " o w n t e r m s . " A s e c o n d m e t h o d o f r e l a x i n g the k n o w n r e w a r d f u n c t i o n a s s u m p t i o n w o u l d i n v o l v e e x p l o r i n g the r e l a t i o n s h i p o f i m p l i c i t i m i t a t i o n to other t e c h n i q u e s that d o n o t a s s u m e a k n o w n r e w a r d f u n c t i o n . I m i t a t i o n t e c h n i q u e s d e s i g n e d a r o u n d the a s s u m p t i o n that the o b s e r v e r a n d the m e n t o r share i d e n t i c a l r e w a r d s , s u c h as U t g o f f ' s ( 1 9 9 1 ) , w o u l d o f c o u r s e w o r k i n the absence o f a r e w a r d f u n c t i o n . In U t g o f f ' s a p p r o a c h , the m e n t o r is a s s u m e d to be r a t i o n a l , so Q - v a l u e s are i n c r e m e n t a l l y adjusted u s i n g a gradient a p p r o a c h u n t i l they are c o n s i s t e n t w i t h the a c t i o n s t a k e n b y the mentor. I n v e r s e r e i n f o r c e m e n t l e a r n i n g ( N g & R u s s e l l , 2 0 0 0 ) uses the same i d e a , b u t different m a c h i n e r y . A l i n e a r p r o g r a m i s u s e d to f i n d the r e w a r d f u n c t i o n that w o u l d e x p l a i n the d e c i s i o n s m a d e b y a m e n t o r i n a s a m p l e d trajectory o f the m e n t o r ' s b e h a v i o r . W e suggest that it w o u l d be p r o f i t a b l e to c o n s i d e r s c e n a r i o s i n b e t w e e n the e x t r e m e s o f i m p l i c i t i m i t a t i o n that a s s u m e n o t h i n g about the m e n t o r ' s r e w a r d f u n c t i o n a n d v a l u e - i n v e r s i o n a p p r o a c h e s that a s s u m e the m e n t o r ' s r e w a r d f u n c t i o n is i d e n t i c a l . T h e r e m a y b e a u s e f u l s y n t h e s i s o f these a p p r o a c h e s b a s e d o n an o b s e r v e r ' s p r i o r b e l i e f s about the degree o f c o r r e l a t i o n b e t w e e n its o w n r e w a r d m o d e l a n d that o f the m e n t o r it is o b s e r v i n g .  7.2  Benefiting from penalties  U n d e r B a y e s i a n i m i t a t i o n , k n o w l e d g e f r o m b o t h o b s e r v e r a n d m e n t o r sources is p o o l e d to create a m o r e accurate m o d e l . U n d e r this t e c h n i q u e , it is p o s s i b l e f o r the o b s e r v e r to l e a r n that s o m e a c t i o n leads w i t h h i g h p r o b a b i l i t y to a n e g a t i v e o u t c o m e . It w o u l d therefore be o f interest to e x a m i n e i n greater d e p t h the p o t e n t i a l to use i m i t a t i o n t e c h n i q u e s i n l e a r n i n g to  147  a v o i d u n d e s i r a b l e states.  7.3 Continuous and Model-Free Learning In m a n y r e a l i s t i c d o m a i n s , c o n t i n u o u s attributes a n d large state a n d a c t i o n spaces p r o h i b i t the use o f e x p l i c i t table-based representations.  R e i n f o r c e m e n t l e a r n i n g i n these d o m a i n s  is t y p i c a l l y m o d i f i e d to m a k e use o f f u n c t i o n a p p r o x i m a t o r s to estimate the Q - f u n c t i o n at p o i n t s w h e r e n o d i r e c t e v i d e n c e has b e e n r e c e i v e d . W e p e r c e i v e t w o b r o a d a p p r o a c h e s to app r o x i m a t i o n : parameter-based m o d e l s (e.g., n e u r a l n e t w o r k s ) ( B e r t s e k a s & T s i t s i k l i s , 1 9 9 6 ) a n d m e m o r y - b a s e d approaches ( A t k e s o n , M o o r e , & S c h a a l , 1 9 9 7 ) . I n b o t h these a p p r o a c h e s , m o d e l - f r e e l e a r n i n g is e m p l o y e d . T h a t i s , the agent keeps a v a l u e f u n c t i o n , b u t uses the e n v i r o n m e n t as an i m p l i c i t m o d e l to p e r f o r m b a c k u p s u s i n g the s a m p l i n g d i s t r i b u t i o n p r o v i d e d by environment observations. O n e s t r a i g h t f o r w a r d a p p r o a c h to c a s t i n g i m p l i c i t i m i t a t i o n i n a c o n t i n u o u s setting w o u l d e m p l o y a m o d e l - f r e e l e a r n i n g p a r a d i g m ( W a t k i n s & D a y a n , 1 9 9 2 ) . F i r s t , r e c a l l the augmented B e l l m a n backup function used i n implicit imitation:  V(s)  = i2 (5)+ max|max|^TrW l (0[, E ,  o  T  r  eAo  [tes  /  T  - W  y  ) tes  W f  ( 7  -  1 }  J  W h e n w e e x a m i n e the a u g m e n t e d b a c k u p e q u a t i o n , w e see that i t c a n b e c o n v e r t e d to a m o d e l - f r e e f o r m i n m u c h the s a m e w a y as the o r d i n a r y B e l l m a n b a c k u p . W e use a standard Q - f u n c t i o n w i t h o b s e r v e r a c t i o n s , but w e w i l l a d d o n e a d d i t i o n a l a c t i o n w h i c h c o r r e s p o n d s to the a c t i o n t a k e n b y the mentor, a .  1  m  it r e c e i v e s a t u p l e < s , a , s' 0  0  0  W h e n the o b s e r v e r interacts w i t h the e n v i r o n m e n t ,  >. T h e t u p l e is u s e d to b a c k u p the c o r r e s p o n d i n g state w i t h  the f o l l o w i n g Q - l e a r n i n g r u l e : ' T h i s doesn't i m p l y the observer knows which o f its actions corresponds to a . m  148  Q{s ,a ) 0  = (1 - a)Q(s ,a )  0  o  0  + a(R (s ,a ) 0  0  +7max  0  max {Q(s' , a')} , Q{s' ,a )\ 0  0  (7.2)  m  I  \a£A  0  W h e n the o b s e r v e r m a k e s an o b s e r v a t i o n o f the mentor, i t r e c e i v e s a t u p l e o f the form < s , m  Q(s ,a ) m  m  s'  m  > w h i c h c a n b e b a c k e d u p w i t h the r u l e :  = (1 - a)Q(s a ) mi  m  + a(R {s ,a ) 0  m  m  +)max  max {Q(s' ,a')} , Q{s' ,a ) m  m  m  \ (7.3)  A s i n the d i s c r e t e v e r s i o n o f i m p l i c i t i m i t a t i o n , the r e l a t i v e q u a l i t y o f m e n t o r a n d o b s e r v e r estimates o f the Q - f u n c t i o n at specific states m a y v a r y — i n o r d e r to a v o i d h a v i n g i n a c c u r a t e p r i o r b e l i e f s about the m e n t o r ' s a c t i o n m o d e l s b i a s e x p l o r a t i o n , w e n e e d to e m p l o y a c o n f i d e n c e m e a s u r e to d e c i d e w h e n to a p p l y these a u g m e n t e d e q u a t i o n s . W e feel the m o s t n a t u r a l s e t t i n g f o r these k i n d o f tests i s i n the m e m o r y - b a s e d a p p r o a c h e s to f u n c t i o n a p p r o x i m a t i o n . M e m o r y - b a s e d a p p r o a c h e s , s u c h as l o c a l l y - w e i g h t e d r e g r e s s i o n ( A t k e s o n et a l . , 1 9 9 7 ) n o t o n l y p r o v i d e estimates f o r f u n c t i o n s at p o i n t s p r e v i o u s l y u n v i s i t e d , they a l s o m a i n t a i n the e v i d e n c e set u s e d to generate these estimates. W e n o t e that the i m p l i c i t b i a s o f m e m o r y - b a s e d a p p r o a c h e s assumes s m o o t h n e s s b e t w e e n p o i n t s u n l e s s a d d i t i o n a l data p r o v e s o t h e r w i s e . O n the b a s i s o f this b i a s , i t m i g h t m a k e sense to c o m p a r e the a v erage s q u a r e d d i s t a n c e f r o m the q u e r y to the e x e m p l a r s u s e d i n t h e e s t i m a t e o f the m e n t o r ' s Q - v a l u e to the average s q u a r e d d i s t a n c e f r o m the q u e r y to the e x e m p l a r s u s e d i n the o b s e r v e r - b a s e d estimate to h e u r i s t i c a l l y d e c i d e w h i c h agent has t h e m o r e r e l i a b l e Q - v a l u e . T h e a p p r o a c h suggested here w o u l d not benefit f r o m p r i o r i t i z e d s w e e p i n g . P r i o r i t i z e d s w e e p i n g , has h o w e v e r , b e e n adapted to c o n t i n u o u s settings ( F o r b e s & A n d r e , 2 0 0 0 ) . W e feel a r e a s o n a b l y efficient i n t e g r a t i o n o f these t e c h n i q u e s c o u l d b e m a d e to w o r k .  149  7.4  Interaction of Agents  T h e f r a m e w o r k w e h a v e d e v e l o p e d i n p r e v i o u s chapters assumes a n o n - i n t e r a c t i n g g a m e . W h i l e n o n - i n t e r a c t i n g g a m e s c a n u s e f u l l y m o d e l m a n y p r o b l e m s , s u c h as p r o g r a m m i n g o f a s s e m b l y r o b o t s ( F r i e d r i c h et a l . , 1 9 9 6 ) , w e w o u l d l i k e to be a b l e to e x t e n d i m i t a t i o n to p r o b l e m s that i n v o l v e agents w h i c h interact w i t h e a c h other. T h e g e n e r a l p r o b l e m o f p l a n n i n g f o r agents w h i c h c a n interact is d i f f i c u l t . W e h a v e i d e n t i f i e d w h a t w e b e l i e v e to be a usef u l a s u b c l a s s o f p r o b l e m s a n d w e demonstrate a p o t e n t i a l m e t h o d f o r s o l v i n g p r o b l e m s i n this c l a s s efficiently. W e then g o o n to d e s c r i b e a p p l i c a t i o n s o f i m i t a t i o n i n g e n e r a l M a r k o v games w i t h i n t e r a c t i o n s a n d c o m m e n t o n h o w an i m p l i c i t i m i t a t i o n strategy m i g h t b e app l i e d . W h i l e the ideas o f this s e c t i o n are far f r o m c o m p l e t e , they setup s o m e p r o m i s i n g starting p o i n t s f o r future research.  Role Swapping W e define a  stabilized Markov game with replacement as  both a useful subclass o f games for  i m i t a t o r s a n d a c l a s s w i t h a p o t e n t i a l l y efficient s o l u t i o n . W e see this c l a s s as a n a t u r a l g e n e r a l i z a t i o n o f the " t r a i n i n g the n e w g u y o n the j o b " p a r a d i g m ( D e n z i n g e r & E n n i s , 2 0 0 2 ) . In a s t a b i l i z e d g a m e , a g r o u p o f agents is f o l l o w i n g a p r i o r p o l i c y that a l l o w s t h e m to w o r k efficiently together. T h i s p o l i c y m a y be the result o f a g i f t e d d e s i g n e r , o r the r e s u l t o f e x t e n s i v e l e a r n i n g o f m o d e l s a n d search t h r o u g h g a m e - t h e o r e t i c e q u i l i b r i a . I n e i t h e r case, the o p t i m a l l o n g - r u n p o l i c y f o r the g a m e is k n o w n . I n a g a m e w i t h r e p l a c e m e n t , p l a y e r s m a y be a d d e d to o r r e m o v e d f r o m the game. T h e g o a l o f a n e w l y a d d e d agent is to q u i c k l y find a p o l i c y that w o r k s w i t h the e x i s t i n g p o l i c y o f the g r o u p . W e argue that i m i t a t i o n c a n b e u s e d to h e l p the n e w agent l e a r n this p o l i c y . C o n s i d e r the s i m p l e c o o r d i n a t i o n p r o b l e m f a c e d b y a set o f m o b i l e agents t r y i n g to a v o i d e a c h other w h i l e c a r r y i n g out a f u l l y c o o p e r a t i v e task w i t h i d e n t i c a l interests. A g i f t e d  150  d e s i g n e r o r p r i o r e x p e r i e n c e m a y result i n a p o l i c y f o r this g r o u p o f agents w h i c h r e q u i r e s agents to a l w a y s pass to the r i g h t - h a n d s i d e o f o n c o m i n g agents. A n e w g r o u p o f agents is to j o i n the e x i s t i n g g r o u p o r perhaps r e p l a c e a subset o f the e x i s t i n g g r o u p . W h e n this g r o u p o f n e w agents is s m a l l r e l a t i v e to the e x i s t i n g g r o u p , there w i l l b e n o i n c e n t i v e f o r the g r o u p as a w h o l e to c h a n g e its p o l i c y . T h e m o s t efficient p o l i c y is f o r the n e w agents i n the g r o u p to a c q u i r e the g r o u p p o l i c y as q u i c k l y as p o s s i b l e . T h e l e a r n i n g o f n e w agents c a n b e accelerated b y i m i t a t i o n . W e treat e a c h o f the n e w agents as i m i t a t o r s . W e p r o v i d e e a c h o f t h e m w i t h a s l i g h t l y m o d i f i e d v e r s i o n o f i m p l i c i t i m i t a t i o n . A s i n the n o n - i n t e r a c t i n g v e r s i o n , the i m i t a t i o n a g e n t ' s o b s e r v a t i o n space i n c l u d e s b o t h its o w n state a n d the state o f other agents i n the g r o u p . A s i n o u r n o n - i n t e r a c t i n g dev e l o p m e n t s , w e a s s u m e the e x i s t e n c e o f a m a p p i n g f r o m states i n the m e n t o r state space to states i n the o b s e r v e r state space. U n l i k e the n o n - i n t e r a c t i n g v e r s i o n , h o w e v e r , the o b s e r v e r c a l c u l a t e s its Q - v a l u e s o v e r b o t h its o w n state a n d a c t i o n as w e l l as the state o f o t h e r agents i n its e n v i r o n m e n t . T h e other agents are a s s u m e d to be f o l l o w i n g a f i x e d p o l i c y a n d t h e i r a c t i o n s are i g n o r e d . B e c a u s e the o b s e r v e r ' s Q - v a l u e s are o v e r the j o i n t state, it c a n c h o o s e a different a c t i o n d e p e n d i n g o n w h e t h e r it is b l o c k e d b y another agent o r not. It c a n l e a r n m o r e q u i c k l y w h a t this a c t i o n s h o u l d be t h r o u g h i m i t a t i o n . T o i m i t a t e , the o b s e r v e r c h o o s e s an agent to be its m e n t o r a n d " s w a p s r o l e s " w i t h the m e n t o r : it s w a p s the o b s e r v e d state o f this agent w i t h its o w n state a n d replaces its o w n a c t i o n w i t h a guess about the m e n t o r ' s a c t i o n c a l c u l a t e d f r o m the m e n t o r ' s o b s e r v e d t r a n s i t i o n u s i n g the s a m e m a t h e m a t i c s d e v e l o p e d f o r the B a y e s i a n i m i t a t i o n p a r a d i g m i n C h a p t e r 5. T h e i m i t a t o r then uses the o b s e r v a t i o n w i t h roles s w a p p e d to u p d a t e its Q - v a l u e s . I n this w a y , w e suggest that the Q - v a l u e s w i l l q u i c k l y c o m e to represent the v a l u e o f f o l l o w i n g the m e n t o r ' s a p p r o a c h to c o o r d i n a t i n g its a c t i o n s w i t h other agents. W e c a l l the a p p r o a c h i m p l i c i t i m i t a t i o n w i t h  role swapping.  W e p o i n t out  that i m i t a t i o n i n this c o n t e x t is a c t u a l l y transferring k n o w l e d g e about h o w to s o l v e g a m e -  151  theoretic c o o r d i n a t i o n p r o b l e m s , i n a s m u c h as this k n o w l e d g e is e m b e d d e d i n the b e h a v i o r o f other agents. I n a n o n - c o o p e r a t i v e g a m e , o r a c o - o p e r a t i v e g a m e w i t h n o n - i d e n t i c a l interests, there m a y be an i n c e n t i v e f o r the n e w agents to flout the c o n v e n t i o n s o f the e x i s t i n g g r o u p i n o r d e r to i m p r o v e t h e i r o w n p a y o f f s . W e b e l i e v e the r o l e s w a p p i n g t e c h n i q u e c a n be e x t e n d e d to these s i t u a t i o n s u s i n g a s i m p l e s o l u t i o n e m p l o y e d b y m a n y r e a l - w o r l d agent c o m m u n i t i e s . T h e n o n - i d e n t i c a l interests, as they relate to e s t a b l i s h e d g r o u p c o n v e n t i o n s , are c o n v e r t e d to i d e n t i c a l interests t h r o u g h the a p p l i c a t i o n o f penalties o n n o n - c o n f o r m i n g agents.  Imitation in general symmetric Markov games T h e r o l e s w a p p i n g t e c h n i q u e r e l i e s o n the s t a b i l i t y o f the e s t a b l i s h e d g r o u p p o l i c y to p r o v i d e an efficient i m i t a t i o n - b a s e d s o l u t i o n to the p r o b l e m o f a d d i n g n e w agents to a g r o u p . A s w i t h s i n g l e - a g e n t r e i n f o r c e m e n t l e a r n i n g t e c h n i q u e s , an i m i t a t o r u s i n g r o l e s w a p p i n g i n a n o n stationary g a m e is not guaranteed to c o n v e r g e to a g o o d p o l i c y . W h e r e m u l t i p l e agents are learning simultaneously, gradient techniques ( S i n g h , K e a r n s , & M a n s o u r , 2 0 0 0 ) o r s o l u t i o n c o n c e p t s f r o m g a m e theory c a n be u s e d to adapt e x i s t i n g r e i n f o r c e m e n t l e a r n i n g t e c h n i q u e s (Littman, 1994; H u & W e l l m a n , 1998; B o w l i n g & Veloso, 2001). W e f o c u s here o n t e c h n i q u e s i n s p i r e d b y s o l u t i o n c o n c e p t s f r o m g a m e theory. T h e k e y i n s i g h t o f these a p p r o a c h e s is to treat e a c h stage o f the r e i n f o r c e m e n t l e a r n i n g p r o b l e m as a m a t r i x g a m e w i t h the Q - v a l u e s f o r j o i n t a c t i o n substituted f o r m a t r i x p a y o f f s (See C h a p ter 2 S e c t i o n 2.8). T h e e s t i m a t i o n o f these Q - v a l u e s , h o w e v e r , r e q u i r e s that e v e r y agent be able to o b s e r v e b o t h the state and a c t i o n v a l u e s o f a l l agents i n the g a m e . I n essence, e a c h agent is i n d e p e n d e n t l y c a l c u l a t i n g the same g l o b a l p l a n u s i n g the s a m e g l o b a l i n f o r m a t i o n . U n f o r t u n a t e l y , the r e q u i r e m e n t that the actions o f a l l agents be o b s e r v a b l e c o n f l i c t s w i t h o u r a s s u m p t i o n that often i m i t a t i o n is s e n s i b l e o n l y w h e n a c t i o n s are n o t o b s e r v a b l e (See C h a p -  152  ter 3 S e c t i o n 3.5). S i n c e the other agents m a y c h a n g e their p o l i c y , w e c a n n o t u s e the v a l u e o f t h e i r " M a r k o v c h a i n s " as stationary estimates o f the v a l u e o f the p o l i c y they are f o l l o w i n g .  A  k e y p r o b l e m f o r i m p l e m e n t i n g i m p l i c i t i m i t a t i o n i n this c o n t e x t w o u l d b e the e s t i m a t i o n o f the a c t i o n o f other p l a y e r s . W e suspect that a s s u m p t i o n s s i m i l a r to the h o m o g e n e o u s a c t i o n a s s u m p t i o n w o u l d be r e q u i r e d to o b t a i n a s i g n i f i c a n t a d v a n t a g e t h r o u g h i m i t a t i o n . W e s u g gest that the c l a s s o f  symmetric Markov games  coupled w i t h action inference w o u l d be a  g o o d starting p o i n t f o r research i n this area.  Additional considerations in interacting games O t h e r c h a l l e n g e s a n d o p p o r t u n i t i e s present t h e m s e l v e s w h e n i m i t a t i o n is u s e d i n m u l t i - a g e n t settings. F o r e x a m p l e , i n c o m p e t i t i v e o r e d u c a t i o n a l d o m a i n s , agents not o n l y h a v e to c h o o s e a c t i o n s that m a x i m i z e i n f o r m a t i o n f r o m e x p l o r a t i o n a n d returns f r o m e x p l o i t a t i o n ; they m u s t a l s o reason about h o w their a c t i o n s c o m m u n i c a t e i n f o r m a t i o n to other agents. I n a c o m p e t i t i v e setting, o n e agent m a y w i s h to d i s g u i s e its i n t e n t i o n s , w h i l e i n the c o n t e x t o f t e a c h i n g , a m e n t o r m a y w i s h to c h o o s e a c t i o n s w h o s e p u r p o s e is a b u n d a n t l y clear. T h e s e c o n s i d e r a t i o n s m u s t b e c o m e part o f a n y a c t i o n s e l e c t i o n p r o c e s s .  7.5  Partially Observable Domains  In this s e c t i o n w e c o n s i d e r the p o t e n t i a l f o r the e x t e n s i o n o f i m p l i c i t i m i t a t i o n to p a r t i a l l y o b s e r v a b l e d o m a i n s . W e c o n s i d e r t w o f o r m s o f p a r t i a l o b s e r v a b i l i t y . I n the first type, the o b s e r v e r ' s o w n state is f u l l y o b s e r v a b l e w h i l e the state o f the m e n t o r is o n l y i n t e r m i t t e n t l y o b s e r v a b l e . In this case the agent c a n c o n t i n u e to o b s e r v e the mentor, b u t i t w i l l r e c e i v e f e w e r c o m p l e t e t r a n s i t i o n s f r o m the mentor, a n d f o r reasons that w i l l be f u l l y d e s c r i b e d i n the n e x t chapter, d e r i v e s i g n i f i c a n t l y less benefit f r o m the mentor. T h e m o r e i n t e r e s t i n g case  153  to c o n s i d e r is w h e n the o b s e r v e r a n d m e n t o r are p a r t i a l l y o b s e r v a b l e i n the sense that the o b s e r v e r c a n n o t a l w a y s d i s t i n g u i s h its o w n state o r the state o f the m e n t o r . It is i m p o r t a n t to r e m e m b e r that it is o n l y o b s e r v a b i l i t y o f the m e n t o r w i t h respect to the o b s e r v e r that is i m p o r t a n t . T h e c e n t r a l i d e a o f i m p l i c i t i m i t a t i o n is to extract m o d e l i n f o r m a t i o n f r o m observ a t i o n s o f the mentor, rather than d u p l i c a t i n g m e n t o r b e h a v i o r . T h i s m e a n s that the m e n t o r ' s i n t e r n a l b e l i e f state a n d p o l i c y are n o t ( d i r e c t l y ) r e l e v a n t to the learner. W e t a k e a s o m e w h a t b e h a v i o r i s t stance a n d c o n c e r n o u r s e l v e s o n l y w i t h w h a t the m e n t o r ' s o b s e r v e d b e h a v i o r s t e l l us about the p o s s i b i l i t i e s inherent i n the e n v i r o n m e n t . T h e o b s e r v e r d o e s h a v e to k e e p a b e l i e f state about the m e n t o r ' s current state, but this c a n be d o n e u s i n g the s a m e e s t i m a t e d w o r l d m o d e l the o b s e r v e r uses to update its o w n b e l i e f state. In k e e p i n g w i t h o u r m o d e l - b a s e d a p p r o a c h to r e i n f o r c e m e n t l e a r n i n g , w e p r o p o s e to t a c k l e p a r t i a l l y o b s e r v a b l e r e i n f o r c e m e n t l e a r n i n g b y h a v i n g the o b s e r v e r e x p l i c i t l y e s t i m a t e a t r a n s i t i o n m o d e l T. G i v e n the m o d e l , the agent m a y e m p l o y s t a n d a r d t e c h n i q u e s f o r s o l v i n g the resultant P O M D P (See S e c t i o n 2.7). W e w i l l a s s u m e that the o b s e r v e r k n o w s its r e w a r d f u n c t i o n R(s)  a n d its o b s e r v a t i o n m o d e l Z : S —> A ( 2 ) . O u r g o a l i n this s e c t i o n is  to s h o w that, g i v e n o u r a s s u m p t i o n s , the e s t i m a t i o n p r o c e s s f o r the t r a n s i t i o n m o d e l T c a n be f a c t o r e d i n t o separate c o m p o n e n t s f o r the o b s e r v e r a n d m e n t o r a n d that the updates f o r e a c h o f these factors c a n be recast i n a r e c u r s i v e f o r m . T h i s r e c u r s i v e f o r m , o r state-history r o l l u p , a l l o w s an agent to i n c r e m e n t a l l y i m p r o v e its m o d e l w i t h e a c h o b s e r v a t i o n . G i v e n this m o d e l , the P O M D P c a n b e s o l v e d u s i n g e x i s t i n g t e c h n i q u e s ( S m a l l w o o d & S o n d i k , 1 9 7 3 ) . O u r derivation relies on some additional notation. First, w e m o d i f y our notation for states s l i g h t l y , u s i n g S* to represent the o b s e r v e r ' s state at t i m e t a n d  to represent the  s e q u e n c e o f o b s e r v e r states f r o m t i m e 1 to t i m e t. W e w i l l often a b b r e v i a t e the c o m p l e t e state h i s t o r y So""* to S . 0  T h e m e n t o r ' s state h i s t o r y is represented a n a l o g o u s l y b y S . I n a m  p a r t i a l l y o b s e r v a b l e e n v i r o n m e n t , the agent's true state is not o b s e r v a b l e . Instead, the agent  154  F i g u r e 7 . 1 : O v e r v i e w o f influences o n o b s e r v e r ' s  observations  m u s t reason f r o m i n c o m p l e t e a n d p o s s i b l y n o i s y o b s e r v a t i o n s i g n a l s d e n o t e d Z\. tory o f o b s e r v e r o b s e r v a t i o n s is g i v e n b y Z\- . 1  A s before, w e let A  represent o b s e r v e r  Q  a c t i o n s , T represent the t r a n s i t i o n m o d e l f o r the e n v i r o n m e n t a n d 7 r  The his-  the m e n t o r ' s p o l i c y .  m  F i g u r e 7.1 g i v e s a g r a p h i c a l m o d e l f o r the d e p e n d e n c i e s b e t w e e n these v a r i a b l e s . W e see that the m e n t o r m o d e l 7r  m  a n d the t r a n s i t i o n m o d e l T i n f l u e n c e the m e n t o r state h i s t o r y  w h i c h i n turn i n f l u e n c e s the  observer's  o b s e r v a t i o n s o f the m e n t o r  Z. m  S, m  N o t e that the m e n -  t o r ' s o b s e r v a t i o n s o f its o w n b e h a v i o r are i r r e l e v a n t to the observer. S i m i l a r l y , the o b s e r v e r ' s action choices A  0  S  0  together w i t h the t r a n s i t i o n m o d e l T d e t e r m i n e the o b s e r v e r ' s state h i s t o r y  w h i c h i n t u r n d e t e r m i n e the o b s e r v a t i o n s the o b s e r v e r w i l l m a k e o f its o w n b e h a v i o r  Z. 0  In a p a r t i a l l y o b s e r v a b l e e n v i r o n m e n t , w e s t i l l retain the M a r k o v c o n d i t i o n , w h i c h guarantees that the d y n a m i c s o f the e n v i r o n m e n t are s t r i c t l y a f u n c t i o n o f the a g e n t ' s  current  state a n d a c t i o n . H o w e v e r , an agent m a y n e e d to m a k e series o f o b s e r v a t i o n s i n o r d e r to res o l v e its current state. T h e coarse m o d e l a b o v e c a n be refined i n t o a m o r e d e t a i l e d m o d e l , as s h o w n i n F i g u r e 7.2, to i n c l u d e the i n f l u e n c e o f agent c h o i c e s a n d the t r a n s i t i o n m o d e l o n the d e t a i l e d h i s t o r y o f the agent. I n the m o r e d e t a i l e d v i e w , w e see that the a g e n t ' s b e l i e f s about e a c h state i n its h i s t o r y are i n f l u e n c e d b y the p r i o r a n d future states, the t r a n s i t i o n m o d e l , its a c t i o n c h o i c e s a n d the o b s e r v a t i o n s it m a d e . Just before w e p r o c e e d to the d e r i v a t i o n , w e w o u l d l i k e to e x p l a i n o u r n o t a t i o n . I n  155  F i g u r e 7.2: D e t a i l e d m o d e l o f influences o n observer's observations  e a c h step o f the d e r i v a t i o n , underbraces are u s e d to h i g h l i g h t the s u b e x p r e s s i o n to be m o d i f i e d i n the n e x t step a n d the phrase i m m e d i a t e l y b e l o w the u n d e r b r a c e g i v e s the j u s t i f i c a t i o n f o r this m o d i f i c a t i o n . F o r e x a m p l e , the n o t a t i o n (ab + ac) + V d + e i n d i c a t e s that w e v  v  '  a factors out  w i l l factor out the t e r m a f r o m the u n d e r b r a c e d t e r m (ab + ac). form  AALB\C  m e a n that the v a r i a b l e  A  T h e e x p r e s s i o n s o f the  is c o n d i t i o n a l l y i n d e p e n d e n t o f v a r i a b l e  B  given  v a r i a b l e C. E x p r e s s i o n s o f the f o r m "see n u m b e r " refer to s u b d e r i v a t i o n s o n the f o l l o w i n g pages. O u r d i s c u s s i o n w i l l refer o n l y to the parent d e r i v a t i o n . T h e reader m a y refer to the subderivation for details. D e r i v a t i o n 1 b e g i n s w i t h the t e r m  Pv(T\Z , Z , A ). 0  m  0  W e see that the t r a n s i t i o n m o d e l  p r o b a b i l i t y c a n be e l e g a n t l y d e c o m p o s e d i n t o t w o terms r e p r e s e n t i n g the m o d e l l i k e l i h o o d a c c o r d i n g to the o b s e r v e r ' s o b s e r v a t i o n h i s t o r y a n d the m o d e l l i k e l i h o o d a c c o r d i n g to observ a t i o n s o f the mentor. S k i p p i n g o v e r c o n s i d e r a b l e i n t e r m e d i a t e steps, w e c a n see that the last l i n e o f D e r i v a t i o n 1 s h o w s us t w o t h i n g s . F i r s t , m o d e l l i k e l i h o o d g i v e n the o b s e r v e r ' s o w n o b s e r v a t i o n h i s t o r y factors i n t o a r e c u r s i v e f o r m i n w h i c h p r e v i o u s h i s t o r y c a n be r o l l e d u p i n t o a p r i o r d i s t r i b u t i o n f o r t i m e t - 1. I n t e g r a t i n g a n e w o b s e r v a t i o n i n t o the l i k e l i h o o d c a n be d o n e b y c o n s i d e r i n g o n l y the p r o b a b i l i t y o f t r a n s i t i o n s f r o m t - 1 to t a n d the o b s e r v a t i o n p r o b a b i l i t y o f Z\ g i v e n 5 * . S e c o n d , the l i k e l i h o o d w i t h respect to o b s e r v e r o b s e r v a t i o n s o f  156  the m e n t o r c a n be e x p r e s s e d r e c u r s i v e l y , b u t is m o r e c o m p l e x . T h e p r i o r states S - - - * 1  - 1  a g a i n b e m a r g i n a l i z e d out. U n f o r t u n a t e l y , a n argument r e p r e s e n t i n g the m e n t o r ' s p o l i c y  can n  m  r e m a i n s . W e therefore h a v e to store the m a r g i n a l i z a t i o n f o r e v e r y p o s s i b l e m e n t o r p o l i c y . I n t u i t i v e l y , the presence o f 7r  is due to the fact that f o r e a c h p o s s i b l e m e n t o r p o l i c y , there  m  m a y b e a t r a n s i t i o n m o d e l that m a k e s the m e n t o r p o l i c y c o n s i s t e n t w i t h the o b s e r v a t i o n s . W e m u s t therefore c o n s i d e r a l l p o s s i b l e p o l i c i e s a n d t h e i r p r i o r p r o b a b i l i t i e s w h e n a s s e s s i n g the l i k e l i h o o d o f a n y g i v e n m o d e l T. C o m p u t i n g a n e x p e c t a t i o n o v e r p o s s i b l e m e n t o r p o l i c i e s is s o m e w h a t o n e r o u s , b u t s t i l l c o n c e i v a b l e . W h e r e m e n t o r p o l i c i e s c a n be represented b y a s m a l l n u m b e r o f p a r a m e ters o r f a c t o r e d i n t o l o c a l p o l i c i e s f o r i n d i v i d u a l states, i t w i l l be c o m p u t a t i o n a l l y tractable to c o m p u t e an e x p e c t a t i o n o v e r p o s s i b l e p o l i c i e s . W h i l e m u c h w o r k r e m a i n s to b e d o n e , w e h a v e s h o w n that t r a n s i t i o n m o d e l updates c o u l d , i n theory, b e f a c t o r e d i n a w a y to perm i t i n c r e m e n t a l u p d a t e o f a n a u g m e n t e d m o d e l b y an observer. W e are therefore e n c o u r a g e d that a p a r t i a l l y o b s e r v a b l e v e r s i o n o f i m p l i c i t i m i t a t i o n c o u l d b e w o r k e d out u s i n g the framew o r k already established for calculating value functions and c h o o s i n g actions i n single-agent POMDPs.  D E R I V A T I O N 1 Augmented M o d e l Probability  1.  Pr(T\Z ,Z ,A ) 0  "  m  0  v  '  BayesRule  '  aPr(Z ,Z \T,Av)Pr(T\A )  2.  0  m  0  0  v TA±A  a  3.  v  aPr(Z ,Z \T,A )Pr{T)  *  0  '  m  0  product rule  4.  aPr{Z \T, A ) Pr{Z \Z , 0  0  m  0  *  Z ±LZ , m  5.  *  0  T, A ) Pr{T) 0  A \T  '  a  aPr{Z \T,A )Pr(Z \T)Pr{T) 0  See 7.2  0  m  See 7.5  157  Pr(T\Z , Z , A ) 0  m  0  £ . - M  PriZtlSDPriSHSl- ,^- ^) 1  S  •  '  a  1  ^o  Pr(Zy-\S^\Al-^,T) v  S  '  roled up prior for t — 1 model likelihood given observer observations  6.  Pr(Z^" - ,5^- |7r ,r) Pr(rr, f  1  1  m  V  s  "  roled up prior for t — 1  I I I I ^ ^ ^ ^ —  ^  —  —  model likelihood given mentor observations  D E R I V A T I O N 2 M o d e l L i k e l i h o o d given Observations o f Observer  1.  Pr(Z \T,A ) 0  0  v  v  '  intro S 52 Pr(Z \S ,T,A )Pr(S \T,A ) Q  2-  So  0  0  o  v  s  See 7.3 J2 So  3.  0  "  0  v  '  See 7.4  Pr(^i^)Pr(Z - - 15 -*- ) 1  i  1  1  0  1  0  "  %  '  joint Pr{Si\S\-\A -\T)Pr{Sy-M o-- -\T) l  t  0  v  v  present  '  past  combine past terms and present terms Es.-.  Pr(Zt\Sl)Pr(Sl\St-\Ai-\T) P r ( z  i..,-i  | 5  i... -i i  ) P r ( 5  i... -i| i... -2 <  A  i  ) T )  Perform this summato in and store marginal  158  Pr(T)  £ *_,.,  Pr(Zl\Si)Pr{Sl\Sl-\Ai-\T)  s  Pr(Z -*- |5*- )Pr(5*- |Aj-*~ ,r) 1  1  1  1  2  *  0  combine  '  D E R I V A T I O N 3 Probability o f Observer's O w n Observations  1.  Pr{Z^  \S ,A ,T) 0  0  joint event  2.  Pr(Zy\S ,A ,T) 0  0  v  v  '  product rule 3.  Pr(Z*|Z z  1  0  0  ( _ l...(-l| i J J  z  A ,T 0  z^-'-'-'usJ  S  Z^lLAo,  4.  Z^^So,  -*- ,5 ,A ,r)Pr(  1 0  T\S'  0  Pr(Z*|5^)Pr(Z  1 0  -*- |5 1  1 0  - - ) i  1  D E R I V A T I O N 4 Probability o f Observer's O w n Trajectory  1.  Pr(^|A ,T) 0  joint  Pr^-'K.r)  2.  v  v  '  product rule  3.  Pr(  StlSy'^A^T  )Pr(S - - \A ,T) 1  '  1  0  i 3  0  i  1  0  v  v  S lLS - - \S - ,T,A„ t  t  0  v  s  '  Sj-'-'iJ.^  1  0  SjjJL.4* - |S£~ ,^~ ,T t  4.  1  Pr(5*|5* , A * _ 1  1  - 1  1  ,r)Pr(5  1 0  ' - |Ao' " , T) t  1  i  2  D E R I V A T I O N 5 M o d e l L i k e l i h o o d given Observations o f Mentor  1.  Pr{Z \T)  «, ' m  intra S-m  159  2-  Zs ( m\S ,T)Pr(S \T) Pr  z  m  N  m  m  2 J_LT|5 m  3-  E s  '  v  introduce 7 r  m  |S ) E ,  m  m  />  v  m  m  ^ ( S | 7 r , T) m  see 7.6 E  m  Pr(w \T)  m  m  see 7.7  ir J_LT m  PriZilS^PriZt'-'lSt"- ) 1  5 r o  I'm  present PriS'JS*- ,  5.  1  E s  m  E .  past T) Pr(S^'  K, m  \w ,T) Pr{Tt m  n  present past Pr(^|^)Pr(S^-\7r ,T)  m  m  Pr(^" - |^- - )Pr(S^ - |7r ,r)Pr(7T ) t  1  i  1  i  1  m  m  E s  m  E .  combine r:t i et-i PriZtJStJPriStJS^^^T)  m  Pr(^" - ,5^- - |7r ,T)Pr(7r ) i  *  1  i  1  ro  m  '  1  E  W  m  Es  reverse summato ins Pr(^|5^)Pr(^|S^ ,7r ,r) 1  m  m  cl-<-l| P r ( ^r -l . . -. t - ,l ^ - |7r ,T)Pr(7r ) i  1  t  1  m  V  ;  E d , . /(".  E, Es-.-. m  m  ^  /  <0 = E „ , / ( » , b) E c 9(l>. <0 t  Pr(^|S^)Pr(^|S^7r ,T) m  E  ^(^•• ~ ,^- -Vm,T)Pr(7T ) <  1  t  m  cl,t—2  sum out marginalization 10.  E„  m  Es<r'-'  ^ ( ^ | ^ ) P r ( ^ | <^"\ 7 r , T) m  Pr^-SS^Vm.r) •  1  Pr(7r ) J m  V  swap summato ins and distribute inward 11.  Es--^(^l^)  £„  Pr^lS^" ,^^) 1  r a  P r ^ - * - ^ ^ ! ^ , ^ Pr(7r ) 1  1  m  160  D E R I V A T I O N 6 Probability o f Mentor Observations  1.  Pr{Z^\  Srn,)  joint event joint event 2.  PriZ^lS*-* product rule  3.  Pr{Zl\Zt-*-\St*)Pr{^- \S^) l  4.  Pr(Z^"*|Zi;"*••  5.  1  5i;"*)Pr(Z^- |5i "*- ) 1  )  1  ;  v  '  Pr(Z^ |5^)Pr(Z^-*-i|5^- - ) i  1  l  D E R I V A T I O N 7 Probability o f M e n t o r ' s Trajectory 1.  PrQSmj7r  T O  ,T)  joint 2.  Pr(S£i7r r) m>  v  V  '  product rule 3.  4.  7.6  Pr(S \  £  m  ,7r ,r)Pr(5^- |7r  ^  1  m  r o i  r)  Pr(^|5^ ,7r ,r)Pr(5^ - |7rm,T) 1  i  1  m  Action Planning in Partially Observable Environments  A c t i o n s e l e c t i o n i n p a r t i a l l y o b s e r v a b l e e n v i r o n m e n t s raises a d d i t i o n a l i s s u e s f o r i m i t a t i o n agents. I n a d d i t i o n t o the u s u a l e x p l o r a t i o n - e x p l o i t a t i o n tradeoff, a n i m i t a t i o n agent m u s t d e t e r m i n e w h e t h e r i t i s w o r t h w h i l e t o take a c t i o n s that i m p r o v e the v i s i b i l i t y o f the m e n t o r so that t h i s s o u r c e o f i n f o r m a t i o n r e m a i n s a v a i l a b l e w h i l e l e a r n i n g . W e s p e c u l a t e that the d e c i s i o n a b o u t w h e t h e r to m a k e an effort i n o r d e r t o o b s e r v e a m e n t o r c a n b e f o l d e d d i r e c t l y i n t o the agent's v a l u e m a x i m i z a t i o n p r o c e s s .  161  7.7  Non-stationarity  I n s o m e e n v i r o n m e n t s , the t r a n s i t i o n m o d e l f o r the e n v i r o n m e n t m a y c h a n g e o v e r t i m e . I n a m u l t i - a g e n t l e a r n i n g s y s t e m w i t h i n t e r a c t i o n s b e t w e e n agents, a l l agents i n an e n v i r o n m e n t m a y b e l e a r n i n g s i m u l t a n e o u s l y . In either case, an agent m u s t c o n t i n u a l l y adapt to its c h a n g i n g s u r r o u n d i n g s . W e are u n a w a r e o f i m i t a t i o n s p e c i f i c w o r k that d e a l s w i t h this c o n d i t i o n . A s h i n t e d at i n the s e c t i o n o n s u b o p t i m a l i t y a n d b i a s ( S e c t i o n 6.5), k n o w l e d g e o f h o w an agent's l e a r n i n g c h a n g e s its p o l i c y c o u l d h a v e strategic i m p l i c a t i o n s f o r i m i t a t i o n agents. T h i s area w i l l r e q u i r e further e x p l o r a t i o n .  162  Chapter 8  Discussion and Conclusions O v e r the c o u r s e o f the last s e v e n chapters w e d e v e l o p e d a n e w m o d e l o f i m i t a t i o n w h i c h w e c a l l i m p l i c i t i m i t a t i o n . T h r o u g h n u m e r o u s e x p e r i m e n t s w e d e m o n s t r a t e d i m p o r t a n t aspects o f b o t h the i m p l i c i t i m i t a t i o n m o d e l a n d the m e c h a n i s m s w e p r o p o s e d to i m p l e m e n t it. W e l o o k e d at h o w to c h a r a c t e r i z e the a p p l i c a b i l i t y a n d p e r f o r m a n c e o f i m i t a t i o n l e a r n i n g a n d d i s c u s s e d s e v e r a l s i g n i f i c a n t e x t e n s i o n s to o u r b a s i c m o d e l . I n this chapter, w e s p e c u l a t e o n s e v e r a l p o s s i b l e r e a l - w o r l d a p p l i c a t i o n s f o r o u r m o d e l o f i m i t a t i o n a n d then c o n c l u d e w i t h a f e w r e m a r k s o n o u r c o n t r i b u t i o n s to the field o f c o m p u t a t i o n a l i m i t a t i o n .  8.1  Applications  W e see a p p l i c a t i o n s f o r i m p l i c i t i m i t a t i o n i n a v a r i e t y o f c o n t e x t s . T h e e m e r g i n g e l e c t r o n i c c o m m e r c e a n d i n f o r m a t i o n infrastructure is d r i v i n g the d e v e l o p m e n t o f vast n e t w o r k s o f m u l t i - a g e n t s y s t e m s . In n e t w o r k s u s e d f o r c o m p e t i t i v e p u r p o s e s s u c h as trade, i m p l i c i t i m i t a t i o n c a n b e u s e d b y a n r e i n f o r c e m e n t l e a r n i n g agent to l e a r n a b o u t b u y i n g strategies o r i n f o r m a t i o n filtering p o l i c i e s o f other agents i n o r d e r to i m p r o v e its o w n b e h a v i o r . In c o n t r o l , i m p l i c i t i m i t a t i o n c o u l d be u s e d to transfer k n o w l e d g e f r o m a n e x i s t -  163  i n g l e a r n e d c o n t r o l l e r w h i c h has a l r e a d y adapted to its c l i e n t s to a n e w l e a r n i n g c o n t r o l l e r w i t h a c o m p l e t e l y different architecture. M a n y m o d e r n p r o d u c t s s u c h as e l e v a t o r c o n t r o l l e r s ( C r i t e s & B a r t o , 1 9 9 8 ) , c e l l traffic routers ( S i n g h & B e r t s e k a s , 1 9 9 7 ) a n d a u t o m o t i v e f u e l i n j e c t i o n s y s t e m s u s e a d a p t i v e c o n t r o l l e r s to o p t i m i z e the p e r f o r m a n c e o f a s y s t e m f o r s p e c i f i c user p r o f i l e s . W h e n u p g r a d i n g the t e c h n o l o g y o f the u n d e r l y i n g s y s t e m , it is q u i t e p o s s i b l e that sensors, actuators a n d the i n t e r n a l representation o f the n e w s y s t e m w i l l b e i n c o m p a t i b l e w i t h the o l d s y s t e m . I m p l i c i t i m i t a t i o n p r o v i d e s a m e t h o d o f t r a n s f e r r i n g v a l u a b l e u s e r information between systems without any explicit c o m m u n i c a t i o n . A t r a d i t i o n a l a p p l i c a t i o n f o r i m i t a t i o n - l i k e t e c h n o l o g i e s l i e s i n the area o f b o o t s t r a p p i n g i n t e l l i g e n t artifacts u s i n g traces o f h u m a n b e h a v i o r . T h e b e h a v i o r a l c l o n i n g p a r a d i g m has i n v e s t i g a t e d transfer i n a p p l i c a t i o n s s u c h as p i l o t i n g aircraft ( S a m m u t et a l . , 1 9 9 2 ) a n d c o n t r o l l i n g l o a d i n g cranes (Sue & B r a t k o , 1997). O t h e r researchers h a v e i n v e s t i g a t e d the use o f i m i t a t i o n i n s i m p l i f y i n g the p r o g r a m m i n g o f robots ( K u n i y o s h i et a l . , 1 9 9 4 ; F r i e d r i c h et a l . , 1996). T h e a b i l i t y o f i m i t a t i o n to transfer c o m p l e x , n o n l i n e a r a n d d y n a m i c b e h a v i o r s f r o m e x i s t i n g h u m a n agents m a k e s it p a r t i c u l a r l y attractive f o r c o n t r o l p r o b l e m s .  8.2  Concluding Remarks  A c e n t r a l t h e m e i n o u r a p p r o a c h has b e e n to e x a m i n e i m i t a t i o n t h r o u g h the f r a m e w o r k o f r e i n f o r c e m e n t l e a r n i n g . T h e interpretation o f i m i t a t i o n as r e i n f o r c e m e n t l e a r n i n g has not o n l y h e l p e d us to see h o w v a r i o u s approaches to i m i t a t i o n are r e l a t e d , b u t has a l s o h e l p e d us to see that i m i t a t i o n as m o d e l transfer is a p r o m i s i n g a n d l a r g e l y u n e x p l o r e d area f o r n e w research. W e b e l i e v e o u r m o d e l has a l s o h e l p e d to m a k e the field o f r e i n f o r c e m e n t l e a r n i n g m o r e a c c e s s i b l e a n d r e l e v a n t to s o c i a l l e a r n i n g researchers ( p e r s o n a l c o m m u n i c a t i o n w i t h B r i a n E d m o n d s ) a n d w e e x p e c t that t o o l s f r o m r e i n f o r c e m e n t l e a r n i n g a n d M a r k o v d e c i s i o n processes w i l l see greater use i n future m o d e l s o f s o c i a l agents r e s e a r c h . T h e m o d e l  164  w e p r o p o s e d rests u p o n a n u m b e r o f a s s u m p t i o n s . W e s a w that the c o n c e p t o f i m i t a t i o n i m p l i c i t l y a s s u m e s b o t h the presence o f o b s e r v a b l e m e n t o r e x p e r t i s e a n d the i n f e a s i b i l i t y o f e x p l i c i t c o m m u n i c a t i o n , w h i c h i n turn suggests the u n o b s e r v a b i l i t y o f m e n t o r a c t i o n s . W e e x p l a i n e d that the o b s e r v e r ' s k n o w l e d g e o f its o w n o b j e c t i v e s is b o t h r e a s o n a b l e a n d h i g h l y advantageous i n an i m i t a t i o n c o n t e x t . R e i n f o r c e m e n t l e a r n i n g , together w i t h a c a r e f u l c o n s i d e r a t i o n o f r e a s o n a b l e a s s u m p t i o n s f o r s e n s i b l e i m i t a t i o n , p r o v i d e d a g e n e r a l f r a m e w o r k f o r i m i t a t i o n . U s i n g the framew o r k , w e d e v e l o p e d a s i m p l e yet p r i n c i p l e d a l g o r i t h m c a l l e d m o d e l e x t r a c t i o n to h a n d l e p r o b l e m s i n w h i c h agents h a v e h o m o g e n e o u s a c t i o n c a p a b i l i t i e s . W e s h o w e d that i t is c r i t i c a l f o r i m i t a t o r s to b e a b l e to assess the r e l a t i v e c o n f i d e n c e they h a v e i n the i n f o r m a t i o n d e r i v e d f r o m t h e i r o w n o b s e r v a t i o n s versus that w h i c h they h a v e d e r i v e d f r o m o b s e r v a t i o n s o f m e n t o r s . W e a l s o s a w that r e i n f o r c e m e n t l e a r n i n g a l l o w s agents to a s s i g n v a l u e s to beh a v i o r s they h a v e a c q u i r e d b y o b s e r v a t i o n i n the s a m e w a y they a s s i g n v a l u e to b e h a v i o r s they h a v e l e a r n e d t h e m s e l v e s . S i n c e a l l b e h a v i o r s c a n b e e v a l u a t e d i n the s a m e terms, an agent c a n r e a d i l y d e c i d e w h e n to e m p l o y a b e h a v i o r d e r i v e d f r o m m e n t o r o b s e r v a t i o n s . O u r f r a m e w o r k thus s o l v e s a l o n g - s t a n d i n g issue i n i m i t a t i o n l e a r n i n g . T h e i m p l i c i t i m i t a t i o n f r a m e w o r k a l s o p e r m i t s an o b s e r v e r to s i m u l t a n e o u s l y o b s e r v e m u l t i p l e m e n t o r s , to autom a t i c a l l y extract r e l e v a n t s u b c o m p o n e n t s o f m e n t o r b e h a v i o r , a n d to s y n t h e s i z e n e w b e h a v iors f r o m b o t h o b s e r v a t i o n s o f m e n t o r s a n d the o b s e r v e r ' s o w n e x p e r i e n c e . A g e n t s n e e d n o t d u p l i c a t e m e n t o r b e h a v i o r e x a c t l y . In a d d i t i o n , w e s h o w e d that i m p l i c i t i m i t a t i o n is r o b u s t i n the face o f n o i s e a n d its benefit m a y a c t u a l l y i n c r e a s e w i t h i n c r e a s i n g p r o b l e m d i f f i c u l t y . W e s a w that m o d e l e x t r a c t i o n c a n be e l e g a n t l y e x t e n d e d to d o m a i n s i n w h i c h agents h a v e h e t e r o g e n e o u s a c t i o n c a p a b i l i t i e s t h r o u g h the m e c h a n i s m s o f a c t i o n f e a s i b i l i t y t e s t i n g a n d fc-step repair. A d d i n g b r i d g i n g c a p a b i l i t i e s preserves a n d extends the m e n t o r ' s g u i d a n c e i n the presence o f i n f e a s i b l e a c t i o n s , w h e t h e r d u e to differences i n a c t i o n c a p a b i l i t i e s o r to  165  l o c a l differences i n state spaces. W e a l s o s h o w e d that it is p o s s i b l e to recast i m p l i c i t i m i t a t i o n i n a B a y e s i a n f r a m e w o r k . A t the c o s t o f a d d i t i o n a l c o m p u t a t i o n , a B a y e s i a n agent i s a b l e to s m o o t h l y i n t e grate o b s e r v a t i o n s f r o m its o w n e x p e r i e n c e , o b s e r v a t i o n s o f m e n t o r b e h a v i o r a n d p r i o r beliefs i t has about its e n v i r o n m e n t . In c o m b i n a t i o n w i t h B a y e s i a n e x p l o r a t i o n , the B a y e s i a n i m i t a t i o n a l g o r i t h m e l i m i n a t e d the t e d i o u s parameter t u n i n g a s s o c i a t e d w i t h g r e e d y e x p l o r a t i o n strategies. W e s a w that the B a y e s i a n i m i t a t o r s are c a p a b l e o f o u t p e r f o r m i n g t h e i r n o n B a y e s i a n counterparts. W e a l s o s a w that i m i t a t i o n m a y o v e r c o m e o n e o f the d r a w b a c k s o f B a y e s i a n e x p l o r a t i o n ; the p o s s i b i l i t y o f c o n v e r g i n g to a s u b o p t i m a l p o l i c y d u e to m i s l e a d i n g p r i o r s . O f c o u r s e , a f u n d a m e n t a l i s s u e f o r B a y e s i a n l e a r n i n g w i l l b e the further e x t e n s i o n o f this p a r a d i g m to h e t e r o g e n e o u s a c t i o n c a p a b i l i t i e s . W e then u s e d o u r u n d e r s t a n d i n g o f h o w e x p l o r a t i o n is g u i d e d i n r e i n f o r c e m e n t l e a r n i n g to a n a l y z e h o w c h a r a c t e r i s t i c s o f the d o m a i n affect the g a i n s to b e m a d e b y i m i t a t i o n agents. W e a n a l y z e d the r o l e o f c o n t i n u i t y i n e x p l a i n i n g the p e r f o r m a n c e g a i n s o f i m i t a t i o n a n d d e v e l o p e d the fracture m e t r i c w h i c h w a s s h o w n to b e p r e d i c t i v e o f i m i t a t i o n g a i n s i n a number o f experiments. T o s c a l e i m p l i c i t i m i t a t i o n to larger p r o b l e m s , w e w i l l h a v e to integrate it w i t h recent d e v e l o p m e n t s i n l a r g e - s c a l e r e i n f o r c e m e n t l e a r n i n g . W e s h o w e d that i t is p o s s i b l e to express i m p l i c i t i m i t a t i o n i n a m o d e l - f r e e f o r m a n d s k e t c h e d h o w this m i g h t b e c o m b i n e d w i t h m e m o r y - b a s e d f u n c t i o n a p p r o x i m a t o r s to create v i a b l e i m i t a t i o n agents i n c o n t i n u o u s a n d h i g h - d i m e n s i o n a l spaces. W e a l s o s a w that there is p o t e n t i a l to a p p l y B a y e s i a n i m i t a tion learning i n partially observable domains. W h i l e a practical implementation w i l l require c o n s i d e r a b l y m o r e effort, o u r d e r i v a t i o n s h o w s that the p r o b l e m o f p a r t i a l l y o b s e r v a b l e i m i t a t i o n c a n , i n p r i n c i p l e , b e f a c t o r e d i n t o a f o r m s u i t a b l e for use w i t h e x i s t i n g a p p r o a c h e s f o r s o l v i n g P O M D P s . I n the c o u r s e o f this d e r i v a t i o n w e a l s o p o i n t e d out that p a r t i a l o b s e r v -  166  a b i l i t y i n t r o d u c e s a n e w i s s u e f o r i m i t a t i o n agents: d e c i d i n g w h e n to e m p l o y a c t i o n s that e n h a n c e the v i s i b i l i t y o f a mentor. W e b r i e f l y e x a m i n e d the p o t e n t i a l f o r a p p l y i n g i m p l i c i t imitation i n mutliagent problems with  interacting agents.  W e d i s c o v e r e d that c o o r d i n a t i o n  strategies c o u l d , t h e m s e l v e s , be the subject o f transfer b y i m i t a t i o n . W e c o n c l u d e d that there is p o t e n t i a l to i m p l e m e n t i m p l i c i t i m i t a t i o n u s i n g e x i s t i n g t e c h n i q u e s f o r M a r k o v g a m e s , b u t effort w o u l d n e e d to be g i v e n to d e v e l o p i n g m e t h o d s o f e s t i m a t i n g the a c t i o n s t a k e n b y other agents f i r s t . T h e e x t e n s i o n s w e h a v e p r o p o s e d p r o v i d e c o n s i d e r a b l e i n s p i r a t i o n f o r f u ture w o r k i n i m i t a t i o n . O n e o f o u r f u n d a m e n t a l a s s u m p t i o n s has b e e n the e x i s t e n c e o f a m a p p i n g f u n c t i o n b e t w e e n the state spaces o f the m e n t o r a n d observer. A k e y r e s e a r c h p r o b l e m f o r i m i t a t i o n l e a r n i n g is the a u t o m a t i c d e t e r m i n a t i o n o f s u c h m a p p i n g s ( B a k k e r & K u n i y o s h i , 1 9 9 6 ) . W h i l e e x i s t i n g research has l o o k e d at h o w to f i n d c o r r e s p o n d i n g b e h a v i o r s g i v e n i n t e r m e diate states w i t h k n o w n c o r r e s p o n d e n c e s a l o n g the p a t h , w e h a v e yet to d e v e l o p a satisfactory m e t h o d f o r f i n d i n g the state c o r r e l a t i o n s to b e g i n w i t h . W e s u g g e s t that it m i g h t be profitable to e x a m i n e a b s t r a c t i o n m e c h a n i s m s as a w a y o f f i n d i n g s i m i l a r i t i e s b e t w e e n state spaces. I n a d d i t i o n to this l o n g - s t a n d i n g q u e s t i o n i n i m i t a t i o n l e a r n i n g r e s e a r c h , o u r c u r r e n t w o r k has i n t r o d u c e d t w o n e w q u e s t i o n s i n t o the f i e l d : h o w s h o u l d an agent r e a s o n a b o u t the i n f o r m a t i o n - b r o a d c a s t i n g aspect o f its a c t i o n s i n a d d i t i o n to w h a t e v e r c o n c r e t e v a l u e the act i o n s h a v e f o r the agent, a n d h o w s h o u l d agents r e a s o n about the v a l u e o f a c t i o n s that w i l l l e a d to i n c r e a s e d v i s i b i l i t y o f m e n t o r s ? T h e r e i n f o r c e m e n t l e a r n i n g f r a m e w o r k has l e d us to a better u n d e r s t a n d i n g o f i m i tation l e a r n i n g . It has a l s o h e l p e d us to create three n e w a l g o r i t h m s f o r i m i t a t i o n l e a r n i n g b a s e d o n a s s u m p t i o n s that m a k e these a l g o r i t h m s a p p l i c a b l e i n d o m a i n s p r e v i o u s l y i n a c c e s s i b l e to i m i t a t i o n learners. T h e f r a m e w o r k has a l s o c l e a r l y g i v e n us a s t a r t i n g p o i n t o n several p r o m i s i n g e x t e n s i o n s to the e x i s t i n g w o r k a n d h i g h l i g h t e d t w o i m p o r t a n t n e w q u e s t i o n s .  167  Our work on imitation as model transfer in reinforcement learning has not only resulted in advances in imitation learning, but has opened up a whole new way of exploring this field that poses an abundant source of ready-to-explore problems for future research.  168  Bibliography Alissandrakis, A . , Nehaniv, C . L . , & Dautenhahn, K . (2000). w i t h i m i t a t i o n . In B a u e r , M . , & R i c h , C . ( E d s . ) ,  How to Do Things, 3-5 November, p p .  L e a r n i n g h o w to d o t h i n g s  AAA1 Fall Symposium on Learning  1-6 C a p e C o d , M a s s a c h u s e t t s .  A l i s s a n d r a k i s , A . , N e h a n i v , C . L . , & D a u t e n h a h n , K . ( 2 0 0 2 ) . D o as I d o : C o r r e s p o n d e n c e s across different r o b o t i c e m b o d i m e n t s . I n  on Artificial Life, p p . A n d e r s o n , J. R . (1990).  Proceedings of the 5th German Workshop  143-152 Liibeck.  Cognitive Psychology and Its Implications.  W . H . Freeman, N e w  York.  A s t r o m , K . , & H a g g l u n d , T. ( 1 9 9 6 ) . P I D c o n t r o l . I n L e v i n e , W . ( E d . ) ,  The control handbook,  c h a p . 10.5, p p . 1 9 8 - 2 0 9 . I E E E Press.  A t k e s o n , C . G . , & S c h a a l , S. ( 1 9 9 7 ) . R o b o t l e a r n i n g f r o m d e m o n s t r a t i o n . I n  Proceedings of  the Fourteenth International Conference on Machine Learning, p p .  12-20 Nashville,  TN.  A t k e s o n , C . G . , M o o r e , A . W . , & S c h a a l , S. ( 1 9 9 7 ) . L o c a l l y w e i g h t e d l e a r n i n g f o r c o n t r o l .  Artificial Intelligence Review, B a i r d , L . (1999).  77(1-5), 7 5 - 1 1 3 .  Reinforcement Learning Through Gradient Descent.  Carnegie M e l l o n University.  169  P h . D . thesis,  B a k k e r , P., & K u n i y o s h i , Y . ( 1 9 9 6 ) . R o b o t see, r o b o t d o : A n o v e r v i e w o f r o b o t i m i t a t i o n . In  AISB '96 Workshop on Learning in Robots and Animals, p p .  3-11 B r i g h t o n . U K .  B a x t e r , J . , & B a r t l e t t , P. L . ( 2 0 0 0 ) . R e i n f o r c e m e n t l e a r n i n g i n P O M D P ' s v i a d i r e c t g r a d i ent ascent. I n  Proceedings of the Seventeenth International Conference on Machine  Learning, p p .  41^4-8 S t a n f o r d , C A .  B e l l m a n , R . E . (1957).  Dynamic Programming.  B e r t s e k a s , D . P. ( 1 9 8 7 ) .  Princeton U n i v e r s i t y Press, Princeton.  Dynamic Programming: Deterministic and Stochastic Models.  Prentice-Hall, E n g l e w o o d Cliffs.  B e r t s e k a s , D . P., & T s i t s i k l i s , J . N . ( 1 9 9 6 ) .  Neuro-dynamic Programming. A t h e n a ,  Belmont,  MA.  B i l l a r d , A . , & H a y e s , G . ( 1 9 9 7 ) . L e a r n i n g to c o m m u n i c a t e t h r o u g h i m i t a t i o n i n a u t o n o m o u s robots. In  Proceedings of The Seventh International Conference on Artificial Neural  Networks, p p .  763-768 Lausanne, Switzerland.  B i l l a r d , A . , & H a y e s , G . ( 1 9 9 9 ) . D r a m a , a c o n n e c t i o n i s t architecture f o r c o n t r o l a n d l e a r n i n g i n a u t o n o m o u s robots.  Adaptive Behavior Journal, 7,  B i l l a r d , A . , H a y e s , G . , & Dautenhahn, K . (1999).  35-64.  I m i t a t i o n s k i l l s as a m e a n s to e n h a n c e  l e a r n i n g o f a s y n t h e t i c p r o t o - l a n g u a g e i n an a u t o n o m o u s r o b o t . I n  Proceedings of the  AISB '99 Symposium on Imitation in Animals and Artifacts, p p .  88-95 Edinburgh.  Boutilier, C . (1999). Sequential optimality and coordination i n multiagent systems. In  Pro-  ceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 478^185 Stockholm.  170  B o u t i l i e r , C , D e a n , T . , & H a n k s , S. ( 1 9 9 9 ) . D e c i s i o n theoretic p l a n n i n g : S t r u c t u r a l a s s u m p tions and computational leverage.  Journal of Artificial Intelligence Research, 11, 1 -  94.  B o w l i n g , M . , & V e l o s o , M . (2001). R a t i o n a l and convergent learning i n stochastic games.  In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, p p . 1 0 2 1 - 1 0 2 6 Seattle.  B o y a n , J . A . (1999).  L e a s t - s q u a r e s t e m p o r a l difference l e a r n i n g .  I n Proceedings of the  Sixteenth International Conference on Machine Learning, p p .  49-56. Morgan Kauf-  mann, San Francisco, C A .  B r e a z e a l , C . ( 1 9 9 9 ) . I m i t a t i o n as s o c i a l e x c h a n g e b e t w e e n h u m a n s a n d r o b o t . I n Proceed-  ings of the AISB '99 Symposium on Imitation in Animals and Artifacts, p p .  96-104 Uni-  versity o f E d i n b u r g h .  B r e a z e a l , C , & S c a s s e l l a t i , B . ( 2 0 0 2 ) . C h a l l e n g e s i n b u i l d i n g r o b o t s that i m i t a t e p e o p l e . I n  Imitation in Animals and Artifacts,  pp. 3 6 3 - 3 9 0 . M I T Press, C a m b r i d g e , M A .  B y r n e , R . W . , & R u s s o n , A . E . ( 1 9 9 8 ) . L e a r n i n g b y i m i t a t i o n : a h i e r a r c h i c a l a p p r o a c h . Be-  havioral and Brain Sciences, 21,  667-721.  C a f i a m e r o , D . , A r c o s , J . L . , & d e M a n t a r a s , R . L . ( 1 9 9 9 ) . I m i t a t i n g h u m a n p e r f o r m a n c e s to a u t o m a t i c a l l y generate e x p r e s s i v e j a z z b a l l a d s . I n  Proceedings of the AISB'99 Sym-  posium on Imitation in Animals and Artifacts, p p .  115-20 University of Edinburgh.  C a s s a n d r a , A . R . , K a e l b l i n g , L . P., & L i t t m a n , M . L . ( 1 9 9 4 ) . A c t i n g o p t i m a l l y i n p a r t i a l l y o b s e r v a b l e stochastic d o m a i n s . I n  Artificial Intelligence, p p .  Proceedings of the Twelfth National Conference on  1 0 2 3 - 1 0 2 8 Seattle.  171  C h a j e w s k a , U . , K o l l e r , D . , & O r m o n e i t , D . ( 2 0 0 1 ) . L e a r n i n g an a g e n t ' s u t i l i t y f u n c t i o n b y o b s e r v i n g b e h a v i o r . In  Machine Learning, p p . Conte, R . (2000).  Proceddings of the Eighteenth International Conference on 35^-2.  Intelligent social learning.  In  Proceedings of the AISB'00 Symposium  on Starting from Society: the Applications of Social Analogies to Computational Systems,  p p . 1-14 B i r m i n g h a m .  Crites, R . , & B a r t o , A . G . (1998). Elevator group control using m u l t i p l e reinforcement learni n g agents.  Machine-Learning, 3 3 ( 2 - 3 ) ,  235-62.  D e a n , T . , & G i v a n , R . ( 1 9 9 7 ) . M o d e l m i n i m i z a t i o n i n M a r k o v d e c i s i o n p r o c e s s e s . I n Pro-  ceedings of the Fourteenth National Conference on Artificial Intelligence, pp. 1 0 6 111 P r o v i d e n c e .  D e a r d e n , R . , F r i e d m a n , N . , & R u s s e l l , S. ( 1 9 9 8 ) . B a y e s i a n Q - l e a r n i n g . I n  National Conference on Artificial Intelligence, p p .  The Fifteenth  761-68 Madison, Wisconsin.  Dearden, R . , & Boutilier, C . (1997). A b s t r a c t i o n and approximate d e c i s i o n theoretic planning.  Artificial Intelligence, 89,  219-283.  Dearden, R . , F r i e d m a n , N . , & A n d r e , D . (1999).  Model-based Bayesian exploration.  In  Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 150-159 Stockholm.  D e G r o o t , M . H . (1975).  Probability and statistics. A d d i s o n - W e s l e y  Pub. Co., Reading,  Mass.  D e m i r i s , J . , & H a y e s , G . ( 1 9 9 7 ) . D o r o b o t s ape?. I n  on Socially Intelligent Agents, p p .  Proceedings of the AAAI Symposium  28-31 Cambridge, Masschussets.  172  D e m i r i s , J . , & H a y e s , G . ( 1 9 9 9 ) . A c t i v e a n d p a s s i v e routes to i m i t a t i o n . I n  Proceedings of  the AISB '99 Symposium on Imitation in Animals and Artifacts, p p .  81-87 Edinburgh.  D e n z i n g e r , J . , & E n n i s , S. ( 2 0 0 2 ) . B e i n g the n e w g u y i n a n e x p e r i e n c e d team: e n h a n c i n g t r a i n i n g o n the j o b .  In  Proceedings of the First International Joint Conference on  Autonomous Agents and Multiagent Systems: Part 3,  p p . 2 4 6 - 1 2 5 3 B o l o g n a , Italy.  F i o r i t o , G . , & S c o t t o , P. ( 1 9 9 2 ) . O b s e r v a t i o n a l l e a r n i n g i n o c t o p u s v u l g a r i s .  Science, 256,  545^17.  Forbes, J., & A n d r e , D . (2000).  Practical reinforcement learning i n continuous domains.  T e c h . rep. U C B / C S D - 0 0 - 1 1 0 9 , C o m p u t e r S c i e n c e D i v i s i o n , U n i v e r s i t y o f C a l i f o r n i a Berkeley.  F r i e d m a n , N . , & S i n g e r , Y . ( 1 9 9 8 ) . E f f i c i e n t b a y e s i a n p a r a m e t e r e s t i m a t i o n i n large d i s c r e t e domains. In  tems,  Eleventh conference on Advances in Neural Information Processing Sys-  pp. 417^123 Denver,Colorado.  F r i e d r i c h , H . , M u n c h , S., D i l l m a n n , R . , B o c i o n e k , S., & S a s s i n , M . ( 1 9 9 6 ) . R o b o t p r o g r a m m i n g b y d e m o n s t r a t i o n ( R P D ) : S u p p o r t the i n d u c t i o n b y h u m a n i n t e r a c t i o n .  Machine  Learning,23, 1 6 3 - 1 8 9 . H a n s e n , E . A . ( 1 9 9 8 ) . S o l v i n g P O M D P s b y s e a r c h i n g i n p o l i c y space. I n  Proceedings of  the Fourteenth Conference on Uncertainty in Artificial Intelligence. H a r t m a n i s , J . , & Stearns, R . E . ( 1 9 6 6 ) .  Algebraic Structure Theory of Sequential Machines.  Prentice-Hall, E n g l e w o o d Cliffs.  H o w a r d , R . A . ( 1 9 9 6 ) . I n f o r m a t i o n v a l u e theory.  and Cybernetics, SSC-2(\),  22-26.  173  IEEE Transactions on Systems Science  H u , J . , & W e l l m a n , M . P. ( 1 9 9 8 ) . M u l t i a g e n t r e i n f o r c e m e n t l e a r n i n g : t h e o r e t i c a l f r a m e w o r k a n d an a l g o r i t h m . I n  Proceedings of the Fifthteenth International Conference on Ma-  chine Learning, p p . 2 4 2 - 2 5 0 . K a e l b l i n g , L . P. ( 1 9 9 3 ) .  Morgan Kaufmann, San Francisco, C A .  Learning in Embedded Systems.  M I T Press, C a m b r i d g e , M A .  K a e l b l i n g , L . P., L i t t m a n , M . L . , & M o o r e , A . W . ( 1 9 9 6 ) . R e i n f o r c e m e n t l e a r n i n g : A survey.  Journal of Artificial Intelligence Research, 4,  237-285.  K e a r n s , M . , & S i n g h , S. ( 1 9 9 8 a ) . F i n i t e s a m p l e c o n v e r g e n c e rates f o r Q - l e a r n i n g a n d i n d i rect a l g o r i t h m s . I n  Eleventh Conference on Neural Information Processing Systems,  pp. 9 9 6 - 1 0 0 2 Denver, C o l o r a d o .  K e a r n s , M . , & S i n g h , S. ( 1 9 9 8 b ) . N e a r - o p t i m a l r e i n f o r c e m e n t l e a r n i n g i n p o l y n o m i a l t i m e .  In Proceedings of the 15th International Conference on Machine Learning, pp. 2 6 0 268 M a d i s o n , W i s c o n s i n .  K u n i y o s h i , Y , Inaba, M . , & I n o u e , H . ( 1 9 9 4 ) . L e a r n i n g b y w a t c h i n g : E x t r a c t i n g r e u s a b l e task k n o w l e d g e f r o m v i s u a l o b s e r v a t i o n o f h u m a n p e r f o r m a n c e .  on Robotics and Automation, 10(6),  799-822.  L a u , T., D o m i n g o s , P., & W e l d , D . S. ( 2 0 0 0 ) . to p r o g r a m m i n g b y d e m o n s t r a t i o n . I n  IEEE Transactions  V e r s i o n space a l g e b r a a n d its a p p l i c a t i o n  Proceddings of the Seventeenth International  Conference on Machine Learning, p p . 5 2 7 - 5 3 4  Stanford, C A .  L e e , D . , & Yannakakis, M . (1992). O n l i n e m i m i n i z a t i o n o f transition systems. In  Proceed-  ings of the 24th Annual ACM Symposium on the Theory of Computing (STOC-92), pp. 264-274 Victoria, B C .  L i e b e r m a n , H . ( 1 9 9 3 ) . M o n d r i a n : A teachable g r a p h i c a l editor. I n A . , C . ( E d . ) ,  I Do: Programming by Demonstration, p p .  Watch What  3 4 0 - 3 5 8 . M I T Press, C a m b r i d g e , M A .  L i n , L . - J . ( 1 9 9 1 ) . S e l f - i m p r o v e m e n t b a s e d o n r e i n f o r c e m e n t l e a r n i n g , p l a n n i n g a n d teach-  ing. In Machine Learning: Proceedings of the Eighth International Workshop, pp. 323-27 Chicago, Illinois.  L i n , L . - J . ( 1 9 9 2 ) . S e l f - i m p r o v i n g r e a c t i v e agents b a s e d o n r e i n f o r c e m e n t l e a r n i n g , p l a n n i n g and teaching.  Machine Learning, 8,  293-321.  L i t t m a n , M . L . ( 1 9 9 4 ) . M a r k o v g a m e s as a f r a m e w o r k f o r m u l t i - a g e n t r e i n f o r c e m e n t l e a r n -  ing. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 1 5 7 - 1 6 3 N e w B r u n s w i c k , N J .  L i t t m a n , M . L . , D e a n , T. L . , & K a e l b l i n g , L . P. ( 1 9 9 5 ) . O n the c o m p l e x i t y o f s o l v i n g M a r k o v decision problems. In  Proceedings of the Eleventh Annual Conference on Uncertainty  in Artificial Intelligence (UAI-95), p p . 394-402 M o n t r e a l ,  Quebec, Canada.  L i t t m a n , M . L . , M a j e r c i k , S. M . , & P i t a s s i , T. ( 2 0 0 1 ) . S t o c h a s t i c b o o l e a n s a t i s f i a b i l i t y .  nal of Automated Reasoning, 27(3),  Jour-  251-296.  L o v e j o y , W . S. ( 1 9 9 1 ) . A s u r v e y o f a l g o r i t h m i c m e t h o d s f o r p a r t i a l l y o b s e r v e d M a r k o v dec i s i o n processes.  Annals of Operations Research, 28,  47-66.  L u s e n a , C , G o l d s m i t h , J . , & M u n d h e n k , M . ( 2 0 0 1 ) . N o n a p p r o x i m a b i l i t y results f o r p a r t i a l l y o b s e r v a b l e M a r k o v d e c i s i o n processes.  Journal of Artificial Intelligence Research,  14, 8 3 - 1 0 3 .  M a c k a y , D . ( 1 9 9 9 ) . I n t r o d u c t i o n to M o n t e C a r l o m e t h o d s . I n J o r d a n , M . I. ( E d . ) ,  in Graphical Models, p p .  Learning  1 7 5 - 2 0 4 . M I T Press, C a m b r i d g e , M A .  M a t a r i c , M . J . ( 1 9 9 8 ) . U s i n g c o m m u n i c a t i o n to r e d u c e l o c a l i t y i n d i s t r i b u t e d m u l t i - a g e n t learning.  Journal Experimental and Theoretical Artificial Intelligence, 10(3),  369.  175  357-  Mataric, M . J. (2002).  Visuo-Motor Primitives as a Basisfor Learning by Imitation: Linking  Perception to Action and Biology to Robotics, p p .  3 9 2 - 4 2 2 . M I T Press, C a m b r i d g e ,  MA.  Mataric, M . J., W i l l i a m s o n , M . , D e m i r i s , J., & M o h a n , A . (1998). Behaviour-based p r i m i t i v e s f o r a r t i c u l a t e d c o n t r o l . I n Pfiefer, R . , B l u m b e r g , B . , M e y e r , J . , & W i l s o n , S. W . (Eds.),  Fifth International Conference on Simulation of Adaptive Behavior SAB'98,  pp. 1 6 5 - 1 7 0 Zurich.  M c C a l l u m , R . A . (1992).  Learning.  First Results with Utile Distinction Memory for Reinforcement  U n i v e r s i t y o f Rochester, C o m p u t e r Science Technical Report 446.  M e n d e n h a l l , W . (1987).  Introduction to Probability and Statistics, Seventh Edition.  PWS-  Kent Publishing Company, Boston, M A .  M e u l e a u , N . , & B o u r g i n e , P. ( 1 9 9 9 ) . E x p l o r a t i o n o f m u l t i - s t a t e e n v i r o n m e n t s : L o c a l m e a sures a n d b a c k - p r o p a g a t i o n o f uncertainty.  M i , J., & Sampson, A . R . (1993).  Machine Learning, 35(2),  A c o m p a r i s o n o f the B o n f e r r o n i a n d S c h e f f e b o u n d s .  Journal of Statistical Planning and Inference, 36, M i c h i e , D . (1993).  117-154.  101—105.  K n o w l e d g e , learning and machine intelligence. In Sterling, L . (Ed.),  Intelligent Systems, p p .  1-9. P l e n u m P r e s s , N e w Y o r k .  M i t c h e l l , T. M . , M a h a d e v a n , S., & S t e i n b e r g , L . ( 1 9 8 5 ) . L E A P : A l e a r n i n g a p p r e n t i c e f o r V L S I design. In  Proceedings of the Ninth International Joint Conference on Artificial  Intelligence, p p .  573-580 L o s Angeles, California.  M o o r e , A . W . , & A t k e s o n , C . G . (1993). Prioritized sweeping: Reinforcement learning w i t h less data a n d less real t i m e .  Machine Learning, 13(1),  176  103-30.  M y e r s o n , R . B . (1991).  Game Theory: Analysis of Conflict.  H a r v a r d U n i v e r s i t y Press, C a m -  bridge.  N e a l , R . M . ( 2 0 0 1 ) . T r a n s f e r r i n g p r i o r i n f o r m a t i o n b e t w e e n m o d e l s u s i n g i m a g i n a r y data. T e c h . rep. 0 1 0 8 , D e p a r t m e n t o f S t a t i s t i c s U n i v e r s i t y o f T o r o n t o .  Nehaniv, C , & Dautenhahn, K . (1998). M a p p i n g between dissimilar bodies: Affordances a n d the a l g e b r a i c f o u n d a t i o n s o f i m i t a t i o n . I n  Workshop on Learning Robots, p p .  Proceedings of the Seventh European  64-72 Edinburgh.  N g , A . Y , & J o r d a n , M . ( 2 0 0 0 ) . P E G A S U S : A p o l i c y search m e t h o d f o r l a r g e M D P s a n d POMDPs.  In  Proceedings of the Sixteenth Conference on Uncertainty in Artificial  Intelligence, p p .  4 0 6 ^ - 1 5 Stanford, California.  N g , A . Y , & R u s s e l l , S. ( 2 0 0 0 ) . A l g o r i t h m s f o r i n v e r s e r e i n f o r c e m e n t l e a r n i n g . I n  Proceed-  ings of the Seventeenth International Conference on Machine Learning, p p .  663-670.  M o r g a n Kaufmann, San Francisco, C A .  N o b l e , J . , & T o d d , P. M . ( 1 9 9 9 ) . Is it r e a l l y i m i t a t i o n ? a r e v i e w o f s i m p l e m e c h a n i s m s i n social information gathering. In  in Animals and Artifacts, p p .  Proceedings of the AISB'99 Symposium on Imitation  65-73 Edinburgh.  O l i p h a n t , M . ( 1 9 9 9 ) . C u l t u r a l t r a n s m i s s i o n o f c o m m u n i c a t i o n s s y s t e m s : C o m p a r i n g observational and reinforcement learning models. In  Proceedings of the AISB'99 Sympo-  sium on Imitation in Animals and Artifacts, p p .  47-54 Edinburgh.  P a p a d i m i t r i o u , C , & Tsitsiklis, J . (1987). T h e c o m p l e x i t y o f m a r k o v d e c i s i o n processes..  Mathematics of Operations Research, Pearl, J. (1988).  ference.  72(3), 4 4 1 ^ - 5 0 .  Probabilistic Reasoning in Intelligent Systems: Networks of Plausible InM o r g a n Kaufmann, San Mateo, C A .  Puterman, M . L . (1994).  gramming.  Markov Decision Processes: Discrete Stochastic Dynamic Pro-  J o h n W i l e y a n d S o n s , Inc., N e w Y o r k .  R u s s o n , A . , & G a l d i k a s , B . (1993). Imitation i n free-ranging rehabilitant orangutans (pongopygmaeus).  Journal of Comparative Psychology, 107(2),  147-161.  S a m m u t , C , H u r s t , S., K e d z i e r , D . , & M i c h i e , D . ( 1 9 9 2 ) . L e a r n i n g to fly. I n  of the Ninth International Conference on Machine Learning, p p .  Proceedings  385-393 Aberdeen,  UK. S c a s s e l l a t i , B . ( 1 9 9 9 ) . K n o w i n g w h a t to i m i t a t e a n d k n o w i n g w h e n y o u s u c c e e d . I n Pro-  ceedings of the AISB'99 Symposium on Imitation in Animals and Artifacts, pp. 1 0 5 113 U n i v e r s i t y o f E d i n b u r g h .  Scheffer, T . , G r e i n e r , R . , & D a r k e n , C . ( 1 9 9 7 ) . W h y e x p e r i m e n t a t i o n c a n b e better than perfect g u i d a n c e . I n  Learning, p p .  Proceedings of the Fourteenth International Conference on Machine  331-339 Nashville.  Seber, G . A . F . ( 1 9 8 4 ) .  Multivariate Observations.  S h a p l e y , L . S. ( 1 9 5 3 ) . S t o c h a s t i c g a m e s .  Wiley, N e w York.  Proceedings of the National Academy of Sciences,  39, 3 2 7 - 3 3 2 .  S i n g h , S., K e a r n s , M . , & M a n s o u r , Y . ( 2 0 0 0 ) . N a s h c o n v e r g e n c e o f g r a d i e n t d y n a m i c s i n general-sum games. In  Proceedings of the Sixteenth Conference on Uncertainity in  Artificial Intelligence, p p .  5 4 1 - 5 4 8 Stanford, C A .  S i n g h , S., & B e r t s e k a s , D . ( 1 9 9 7 ) . R e i n f o r c e m e n t l e a r n i n g f o r d y n a m i c c h a n n e l a l l o c a t i o n i n c e l l u l a r t e l e p h o n e systems. I n M o z e r , M . C , J o r d a n , M . I., & P e t s c h e , T . ( E d s . ) ,  Advances in Neural Information Processing Systems, Press.  V o l . 9, p p . 9 7 4 - 9 8 0 . T h e M I T  S m a l l w o o d , R . D . , & Sondik, E . J. (1973). M a r k o v p r o c e s s e s o v e r a finite h o r i z o n .  The optimal control o f partially observable  Operations Research, 21,  1071-1088.  Sue, D . , & B r a t k o , I. ( 1 9 9 7 ) . S k i l l r e c o n s t r u c t i n g as i n d u c t i o n o f L Q c o n t r o l l e r s w i t h s u b goals.  In  Proceedings of the Fifteenth International Joint Conference on Artificial  Intelligence, p p .  914-919 Nagoya.  S u t t o n , R . S. ( 1 9 8 8 ) . L e a r n i n g to p r e d i c t b y the m e t h o d o f t e m p o r a l d i f f e r e n c e s .  Machine  Learning, 3, 9 - 4 4 . S u t t o n , R . S., & B a r t o , A . G . ( 1 9 9 8 ) .  Reinforcement Learning: An Introduction.  M I T Press,  Cambridge, M A .  T a n , M . ( 1 9 9 3 ) . M u l t i - a g e n t r e i n f o r c e m e n t l e a r n i n g : I n d e p e n d e n t v s . c o o p e r a t i v e agents. I n  Proceedings of the Tenth International Conference on Machine Learning, p p .  330-37  Amherst, M A .  U r b a n c i c , T , & B r a t k o , I. ( 1 9 9 4 ) . R e c o n s t r u c t i o n h u m a n s k i l l w i t h m a c h i n e l e a r n i n g . I n  Eleventh European Conference on Artificial Intelligence,  pp. 4 9 8 - 5 0 2 Amsterdam,  The Netherlands.  U t g o f f , P. E . , & C l o u s e , J . A . ( 1 9 9 1 ) . T w o k i n d s o f t r a i n i n g i n f o r m a t i o n f o r e v a l u a t i o n f u n c t i o n l e a r n i n g . In  gence, p p .  Proceedings of the Ninth National Conference on Artificial Intelli-  596-600 Anaheim, C A .  van Lent, M . , & L a i r d , J. (1999). L e a r n i n g hierarchical performance k n o w l e d g e b y observation. In  Proceedings of the Sixteenth International Conference on Machine Learning,  pp. 2 2 9 - 2 3 8 B l e d , S l o v e n i a .  179  V i s a l b e r g h i , E . , & F r a g a z y , D . ( 1 9 9 0 ) . D o m o n k e y s ape?. I n P a r k e r , S., & G i b s o n , K . ( E d s . ) ,  Language and Intelligence in Monkeys and Apes, p p .  247-273. Cambridge University  Press, C a m b r i d g e .  W a t k i n s , C . J . C . H . , & D a y a n , P. ( 1 9 9 2 ) . Q - l e a r n i n g .  Machine Learning, 8,  279-292.  W h i t e h e a d , S. D . ( 1 9 9 1 ) . C o m p l e x i t y a n a l y s i s o f c o o p e r a t i v e m e c h a n i s m s i n r e i n f o r c e m e n t l e a r n i n g . In  Proceedings of the Ninth National Conference on Artificial Intelligence,  pp. 6 0 7 - 6 1 3 A n a h e i m .  180  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051625/manifest

Comment

Related Items