UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Accelerating reinforcement learning through imitation Price, Robert Roy 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2003-854000.pdf [ 8.02MB ]
Metadata
JSON: 831-1.0051625.json
JSON-LD: 831-1.0051625-ld.json
RDF/XML (Pretty): 831-1.0051625-rdf.xml
RDF/JSON: 831-1.0051625-rdf.json
Turtle: 831-1.0051625-turtle.txt
N-Triples: 831-1.0051625-rdf-ntriples.txt
Original Record: 831-1.0051625-source.json
Full Text
831-1.0051625-fulltext.txt
Citation
831-1.0051625.ris

Full Text

Accelerating Reinforcement Learning through Imitation b y Rober t R o y P r i c e M . S c . U n i v e r s i t y o f Saska tchewan, 1995 B . C . S . , Ca r l e ton U n i v e r s i t y , 1993 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Doctor of Philosophy i n T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Depar tment o f C o m p u t e r Sc ience) W e accept this thesis as c o n f o r m i n g to the required standard The University of British Columbia D e c e m b e r 2002 © Rober t R o y P r i ce , 2 0 0 2 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r reference and study. I further agree that permission for extensive copying of t h i s thesis for s c h o l a r l y purposes may be granted by the head of my department or by h i s or her representatives. It i s understood that copying or p u b l i c a t i o n of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Abstract Imi ta t ion can be v i e w e d as a means o f enhanc ing l ea rn ing i n mul t iagent e n v i r o n -ments. It augments an agent 's ab i l i ty to learn useful behav iours b y m a k i n g in te l l igent use o f the k n o w l e d g e i m p l i c i t i n behaviors demonstrated by coopera t ive teachers o r other more exper ienced agents. U s i n g reinforcement l ea rn ing theory, we construct a new, f o r m a l frame-w o r k for imi ta t ion that permits agents to c o m b i n e p r io r k n o w l e d g e , learned k n o w l e d g e and k n o w l e d g e extracted f rom observat ions o f other agents. T h i s f ramework , w h i c h w e c a l l im-plicit imitation, uses observat ions o f other agents to p rov ide an observer agent w i t h infor-mat ion about its ac t ion capabi l i t ies i n unexper ienced si tuat ions. Ef f ic ien t a lgor i thms are de-r i ved f r o m this f r amework for agents w i t h both s im i l a r and d i s s i m i l a r ac t ion capabi l i t i es and a series o f exper iments demonstrate that implicit imitation can d ramat i ca l ly accelerate r e in -forcement l ea rn ing in certain cases. Fur ther exper iments demonstrate that the f r amework can handle transfer between agents w i t h different r eward structures, l ea rn ing f r o m m u l t i -p le mentors , and se lect ing relevant por t ions o f examples . A B a y e s i a n v e r s i o n o f i m p l i c i t imi ta t ion is then de r ived and several exper iments are. used to i l lustrate h o w it smoo th ly inte-grates m o d e l ext rac t ion w i t h p r io r k n o w l e d g e and o p t i m a l exp lo ra t ion . In i t i a l w o r k is also presented to i l lustrate h o w extensions to i m p l i c i t im i t a t ion can be de r ived for con t inuous do-mains , par t ia l ly observable envi ronments and mul t i -agent M a r k o v games. T h e de r iva t ion o f a lgor i thms and exper imenta l results are supplemented w i t h an ana lys is o f relevant d o m a i n features for im i t a t i ng agents and some performance metr ics for p red ic t ing im i t a t i on gains are suggested. i i Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgements xi Dedication xii 1 Introduction 1 2 Background Material 8 2.1 M a r k o v D e c i s i o n Processes 8 2.2 M o d e l - b a s e d Re in forcement L e a r n i n g 12 2.3 B a y e s i a n Re in forcement L e a r n i n g 17 2.4 B a y e s i a n E x p l o r a t i o n 19 2.5 P r o b a b i l i s t i c R e a s o n i n g and B a y e s i a n N e t w o r k s 22 2.6 M o d e l - F r e e L e a r n i n g 25 2.7 Par t i a l O b s e r v a b i l i t y 26 i i i 2.8 M u l t i - a g e n t L e a r n i n g 27 2.9 T e c h n i c a l Def in i t ions 29 3 Natural and Computational Imitation 31 3.1 C o n c e p t u a l Issues 31 3.2 P rac t i ca l Implementa t ions 35 3.3 Other Issues 38 3.4 A n a l y s i s o f Imi ta t ion i n Terms o f Re in fo rcement L e a r n i n g 38 3.5 A N e w M o d e l : I m p l i c i t Imi ta t ion 40 3.6 Non- in t e r ac t i ng Games 47 4 Implicit Imitation 51 4.1 I m p l i c i t Imi ta t ion i n H o m o g e n e o u s Sett ings 52 4.1.1 H o m o g e n e o u s A c t i o n s 52 4.1.2 T h e I m p l i c i t Imi ta t ion A l g o r i t h m 53 4.1.3 E m p i r i c a l Demons t ra t ions 69 4.2 I m p l i c i t Imi ta t ion i n Heterogeneous Set t ings 89 4.2.1 F e a s i b i l i t y Tes t ing 89 4.2.2 A>step S i m i l a r i t y and R e p a i r 92 4.2.3 E m p i r i c a l Demons t ra t ions 99 5 Bayesian Imitation 109 5.1 A u g m e n t e d M o d e l Inference 110 5.2 E m p i r i c a l Demons t ra t ions 116 5.3 S u m m a r y 130 iv 6 Applicability and Performance of Imitation 132 6.1 A s s u m p t i o n s 132 6.2 P r e d i c t i n g Performance 133 6.3 R e g i o n a l Texture 139 6.4 T h e Fracture M e t r i c 140 6.5 S u b o p t i m a l i t y and B i a s 143 7 Extensions to New Problem Classes 146 7.1 U n k n o w n R e w a r d Func t ions 146 7.2 Bene f i t i ng f r o m penalt ies 147 7.3 C o n t i n u o u s and M o d e l - F r e e L e a r n i n g 148 7.4 Interact ion o f A g e n t s 150 7.5 Pa r t i a l l y Obse rvab le D o m a i n s 153 7.6 A c t i o n P l a n n i n g i n Pa r t i a l l y Observab le E n v i r o n m e n t s 161 7.7 Non-s ta t ionar i ty 162 8 Discussion and Conclusions 163 8.1 A p p l i c a t i o n s 163 8.2 C o n c l u d i n g R e m a r k s 164 Bibliography 169 v List of Tables 4.1 A n a lgo r i t hm to compute the A u g m e n t e d B e l l m a n B a c k u p 63 4.2 A n a l g o r i t h m for ac t ion feas ib i l i ty testing 92 4.3 E labora t ed augmented backup test 98 v i List of Figures 2.1 E x p l i c i t d i s t r ibu t ion over exhaust ive enumerat ion o f c l o u d - r a i n - u m b r e l l a states 23 2.2 C o m p a c t d is t r ibut ions over independent factors for c l o u d - r a i n - u m b r e l l a do -m a i n 24 4.1 V0 and Vm represent values estimates de r ived f rom observer and mentor ex-per ience respect ively. T h e l o w e r bounds V~ and V~ incorporate an " u n -certainty pena l ty" associated w i t h these estimates 61 4.2 A s i m p l e g r i d w o r l d w i t h start state S and goa l state X s h o w i n g represen-tat ive trajectories for a mentor (sol id) and an observer (dashed) 70 4.3 O n a s i m p l e g r i d - w o r l d , an observer agent ' O b s ' d i scovers the g o a l and op-t imizes its p o l i c y faster than a con t ro l agent ' C t r l ' w i thou t observa t ion . T h e ' D e l t a ' cu rve plots the ga in o f the observer over the con t ro l 75 4.4 D e l t a curves compare ga in due to observat ion for ' B a s i c ' scenario desc r ibed above, a ' S c a l e ' scenario w i t h a larger state space and a ' S t o c h ' scenario w i t h increased noise , 77 4.5 P r i o r mode ls assume d iagona l connec t iv i ty and hence that the +5 rewards are access ib le to the observer. In fact, the observer has o n l y ' N E W S ' ac-t ions . U n u p d a t e d pr iors about the mentor ' s capabi l i t i es may lead to over-v a l u i n g d i a g o n a l l y adjacent states 80 v i i 4.6 M i s l e a d i n g pr iors about mentor abi l i t ies m a y e l imina te i m i t a t i o n gains and even result i n some degradat ion o f the observer ' s performance re la t ive to the con t ro l agent, but confidence test ing prevents the observer f r o m perma-nent ly fixating on infeas ible goals 82 4.7 A c o m p l e x maze w i t h 343 states has cons iderab ly less l o c a l connec t iv i t y than an equ iva l en t ly - s i zed g r id w o r l d . A s ing le 133 step path is the shortest so lu t ion 83 4.8 O n the dif f icul t maze p rob l em, the observer ' O b s ' converges l o n g before the una ided c o n t r o l ' C r t l ' 84 4.9 T h e shortcut ( 4 —» 5) i n the maze is too per i lous for u n k n o w l e d g e a b l e (or unaided) agents to attempt ' 86 4 .10 T h e obsever ' O b s ' tends to find the shortcut and converges to the o p t i m a l p o l i c y , w h i l e the una ided con t ro l ' C t r l ' t y p i c a l l y settles on the subop t ima l sou lu t ion 86 4.11 A scenario w i t h m u l t i p l e mentors. E a c h mentor covers o n l y part o f the space o f interest to the observer 87 4 .12 T h e ga in f r o m s imul taneous imi t a t ion o f m u l t i p l e mentors is s i m i l a r to that seen i n the s ing le mentor case 88 4.13 A n al ternative path can br idge value backups a round infeas ib le paths . . . . 94 4 .14 T h e agent wi thou t feas ib i l i ty testing ' N o F e a s ' can b e c o m e trapped b y over-va lued states. T h e agent w i t h feas ib i l i ty test ing 'Feas ' shows t y p i c a l i m i t a -t i on gains 100 4.15 Obstac les b l o c k the mentor ' s in tended p o l i c y , though stochast ic act ions cause mentor to deviate somewhat 102 v i i i 4.16 Imi ta t ion ga in is reduced b y observer obstacles w i t h i n mentor trajectories but not e l imina ted 103 4.17 G e n e r a l i z i n g an infeas ible mentor trajectory 103 4.18 A largely infeas ible trajectory is hard to learn f r o m and m i n i m i z e s gains f r o m imi t a t ion 104 4.19 T h e R i v e r scenario i n w h i c h squares w i t h grey backgrounds result i n a penal ty for the agent 106 4 .20 T h e agent w i t h repair finds the o p t i m a l so lu t ion on the other s ide o f the r iver . 107 5.1 Dependenc ies among t ransi t ion m o d e l T and ev idence sources H0, Hm. . . I l l 5.2 T h e C h a i n - w o r l d scenario in w h i c h the m y o p i c a l l y attractive ac t ion b is sub-o p t i m a l i n the l o n g run 117 5.3 C h a i n w o r l d results (Averaged over 10 runs) 118 5.4 L o o p d o m a i n 120 5.5 L o o p d o m a i n results (averaged over 10 runs) 121 5.6 F l a g - w o r l d 123 5.7 F l a g - w o r l d results (Averaged over 50 runs) 123 5.8 F l a g - w o r l d m o v e d goa l (Average o f 50 runs) 125 5.9 Schemat ic i l lus t ra t ion o f a t ransi t ion i n the tu tor ing d o m a i n 128 5.10 T u t o r i n g d o m a i n (Average o f 50 runs) 129 5.11 N o Sou th scenario 129 5.12 N o Sou th results (Averaged over 50 runs) 130 6.1 A sys tem o f state space region class i f icat ions for p red ic t ing the efficacy o f imi t a t ion 135 6.2 F o u r scenarios w i t h equal so lu t ion lengths but different fracture coefficients. 141 i x 6.3 Percentage o f runs (o f ten) conve rg ing to o p t i m a l p o l i c y g i v e n fracture $ and exp lo ra t ion rate 8j 142 7.1 O v e r v i e w o f influences on observer ' s observat ions 155 7.2 D e t a i l e d m o d e l o f influences on observer ' s observat ions 156 x Acknowledgements Fi r s t o f a l l , I w o u l d l i k e to thank m y supervisor , C r a i g B o u t i l i e r , for his u n w a v e r i n g pat ience, o p t i m i s m and p ro fess iona l i sm over too many years. I th ink I o w e h i m about a two-dozen beers—or w o u l d that be coffees? T h r o u g h h i m I ' ve c o m e to appreciate sophis t i ca t ion i n g rammar and the va lue o f style, learned someth ing o f the fine art o f f ind ing the gems a m o n g the stones and ga ined exposure to a vast and power fu l set o f tools for r eason ing about un -certainty. Second ly , I w o u l d l i k e to thank the many reviewers o f m y w o r k bo th a n o n y m o u s and personal , w h o even i f they d i d not a lways further m y p u b l i s h i n g ambi t ions , they cer ta in ly furthered m y unders tanding o f both the breadth o f what m igh t be c a l l e d the f ie ld o f l ea rn ing by observa t ion , as w e l l as the technica l craft o f scientif ic inves t iga t ion and c o m m u n i c a t i o n . T h i r d l y , I w o u l d l i k e to thank m y office mate Pasca l Poupar t for m a n y pleasant d i s -cuss ions on a l l sorts o f p rob lems fo rmal and i n fo rma l as w e l l as for m a n y apples. I a m also grateful for a l l o f his quest ions, w h i c h he lped me keep i n perspec t ive the amount I have learned versus the amount I s t i l l have yet to know. A n d f inal ly , I w o u l d l i k e to thank m y thesis commit tee , un ive rs i ty examiners and external examine r for their thorough reading o f the thesis and their thought p r o v o k i n g ques-t ions: t h a n k y o u to A l a n M a c k w o r t h , D a v i d P o o l e , M a r t i n Pu te rman , D a v i d L o w e , Be t r and C l a r k e and M a j a M a t a r i c . R O B E R T R O Y P R I C E The University of British Columbia March 2003 x i T o L o t t i e , w i t h l ove . x i i Chapter 1 Introduction C o m p u t a t i o n a l re inforcement l ea rn ing is the study o f techniques for au tomat ica l ly adapt-i n g an agent 's behav io r to m a x i m i z e some object ive func t ion t y p i c a l l y set b y the agent 's designer. S i n c e the designer need not specify h o w the agent w i l l ach ieve the object ive , re-inforcement l ea rn ing is a popu la r cho ice for the des ign o f agents w h e n sui table ins t ruct ions for the agent cannot be eas i ly encoded, or the agent 's env i ronment is d i f f icul t to adequately characterize. S i n c e the agent has o n l y the object ive func t ion to gu ide its search for o p t i m a l behavior , re inforcement learn ing t y p i c a l l y requires the agent to engage i n ex tens ive exper i -menta t ion w i t h its env i ronment and extended search through poss ib l e ac t ion p o l i c i e s . T h e field o f mul t i -agent reinforcement l ea rn ing extends re inforcement l ea rn ing to the s i tuat ion i n w h i c h m u l t i p l e agents independent ly attempt to learn behav iors i n a shared envi ronment . In general , the mult i -agent d i m e n s i o n adds add i t iona l c o m p l e x i t y . F o r i n -stance, it is w e l l k n o w n that a na ive app l i ca t ion o f re inforcement l ea rn ing techniques to en-v i ronments i n w h i c h agents can change their behav io r i n response to other agent 's act ions can lead to c y c l i n g i n the search for o p t i m i z i n g actions and a consequent fa i lure to converge to o p t i m a l behavior . C y c l i c d ivergence has been addressed u s ing techniques such as game theoretic e q u i l i b r i a ( L i t t m a n , 1994). T h e presence o f m u l t i p l e agents, however , can a lso 1 s i m p l i f y the l ea rn ing p rob l em: agents may act coopera t ive ly b y p o o l i n g i n fo rma t ion (Tan, 1993) or by d i s t r ibu t ing search (Mata r i c , 1998). A n o t h e r way i n w h i c h i n d i v i d u a l agent performance can be i m p r o v e d is by h a v i n g a n o v i c e agent learn reasonable behav io r f r o m an expert mentor. T h i s type o f l ea rn ing can be brought about through e x p l i c i t teaching or demonst ra t ion ( L i n , 1992; W h i t e h e a d , 1991), b y shar ing o f p r i v i l e g e d in format ion ( M a t a r i c , 1998), o r through what m i g h t m o r e t r ad i t iona l ly be thought o f as imitation ( A t k e s o n & Schaa l , 1997; B a k k e r & K u n i y o s h i , 1996). In i m i t a -t ion , an agent observes another agent pe r fo rming an ac t iv i ty i n context and then attempts to repl icate the observed agent 's behavior . T y p i c a l l y , the imi ta tor must g r o u n d its observat ions o f other agents ' behaviors i n terms o f its o w n capabi l i t ies and reso lve any ambigu i t i e s i n ob-servations a r i s ing f rom par t ia l observab i l i ty and noise . A c o m m o n thread i n a l l o f this w o r k is the use o f a mentor to gu ide the exploration o f the observer. G u i d a n c e is t y p i c a l l y ach ieved through some f o r m o f e x p l i c i t co rnmunica t ion between mentor and observer. Imi ta t ion is related to learn ing apprentice systems ( M i t c h e l l , M a h a d e v a n , & Ste inberg , 1985) , but i m i -tat ion research genera l ly assumes the acqui red k n o w l e d g e w i l l be app l i ed b y the agent i tself , whereas l ea rn ing apprentices t y p i c a l l y w o r k i n con junc t ion w i t h a human expert. In ex i s t i ng approaches to imi t a t ion , we see a number o f u n d e r l y i n g assumpt ions . In teaching mode l s , it is assumed that there is a coopera t ive mentor w i l l i n g to p r o v i d e appro-pr ia te ly encoded example inputs for agents. It is also frequently assumed that the mentor ' s ac t ion c h o i c e s 1 can be d i rec t ly observed. In p r o g r a m m i n g b y demons t ra t ion , it is often as-sumed that the observer has iden t ica l object ives to the expert mentor. M a n y e x p l i c i t i m i -tat ion mode ls assume that behaviors can be represented as i n d i v i s i b l e and unal terable uni ts that can be evaluated and then replayed. 1 We understand an action choice to be the agent's intended control signal as opposed to changes in the agent's state resulting from the direct effects of an action such as the agent's position or the orientation of the agent's effectors. 2 W e suggest that there are prac t ica l p rob lems for w h i c h it is useful to re lax some o f these assumptions . F o r instance, a mentor may be u n w i l l i n g or unable to alter its behav io r to teach the observer, or even communica t e in format ion to it: c o m m o n c o m m u n i c a t i o n pro-tocols may be unava i l ab le to agents des igned by different developers (e.g., Internet agents); agents may find themselves i n a compet i t ive s i tuat ion i n w h i c h there is a d i s incen t ive to share in fo rma t ion o r s k i l l s ; o r there may s i m p l y be no incen t ive for one agent to p r o v i d e in fo rma-t ion to another. 2 I f the e x p l i c i t c o m m u n i c a t i o n assumpt ion is re laxed, w e immed ia t e ly encounter di f -ficulties w i t h the assumpt ion that the mentor ' s ac t ion choices can be observed: due to per-ceptual a l i a s ing it is frequently i m p o s s i b l e to k n o w an agent 's in tended ac t ion cho ice . W e assume that an agent 's ac t ion cho i ce can have non-de terminis t ic outcomes . T h e outcomes , h o w e v e r may take the f o r m o f a change i n the state o f the env i ronment , m a y enhance the persistence o f some state o f the envi ronment or may have no d i sce rn ib l e effect on the re-su l t ing state o f the envi ronment . It is qui te poss ib le that t w o or more o f the agent 's act ion choices c o u l d have the same p robab i l i s t i c outcomes i n a g i v e n s i tuat ion and therefore be i n -d is t inguishable . In general , i n the absence o f e x p l i c i t c o m m u n i c a t i o n , an observer cannot k n o w the mentor ' s in tended act ion cho i ce w i t h certainty (a l though the observer may be able to observe changes i n the mentor ' s state such as a change i n the p o s i t i o n o f the mentor ' s effectors or its loca t ion) . W e may also want to relax the assumpt ion that the mentor and observer have iden-t ica l object ives . E v e n i n cases where mentor and observer object ives differ, the two agents may s t i l l need to pe r fo rm s i m i l a r subtasks. In these c i rcumstances , an observer m igh t s t i l l want to learn behaviors f rom a mentor, but apply them to a c c o m p l i s h a different goa l . In or-der to be able to ident i fy and reuse these subtasks, the observer cannot treat mentor behav iors 2 F o r reasons of consistency, we use the term "mentor" to describe any agent from which an ob-server can learn, even i f the mentor is an unwil l ing or unwitting participant. as i n d i v i s i b l e trajectories. T h e observer must be able to represent and evaluate behav iors at a fine-grained l e v e l where i n d i v i d u a l components o f the behav io r can be evaluated indepen-dently. W e have deve loped a f ramework for the use o f mentor gu idance w h i c h w e c a l l im-plicit imitation. In contrast to ex i s t i ng mode ls , it does not assume the exis tence o f a cooper-at ive mentor, that agents can e x p l i c i t l y communica te , that the ac t ion cho ices o f other agents are d i rec t ly observable , or that the agents share iden t ica l object ives . In add i t i on , our frame-w o r k a l l o w s the observer to integrate par t ia l behaviors f rom m u l t i p l e mentors , and a l l o w s the obse rv ing agent to refine these behaviors at a fine-grained l e v e l u s i n g its o w n exper i -ence. I m p l i c i t imi ta t ion ' s ab i l i ty to re lax these assumptions der ives f r o m its different v i e w o f the imi t a t ion process. In i m p l i c i t im i t a t ion , the goa l o f the agent is the acqu i s i t i on o f i n -fo rmat ion about its potent ia l ac t ion capabi l i t ies i n unexper ienced si tuations rather than the dup l i c a t i on o f behaviors o f other agents. W e deve lop the i m p l i c i t im i t a t ion f ramework i n several stages. W e start w i t h a s i m -p l i f i ed p r o b l e m i n w h i c h non- in terac t ing agents have homogeneous actions: that i s , the learn-i n g agent and the mentor have iden t ica l act ion capabi l i t ies . In i t i a l ly , however , the observer m a y be unaware o f what these capabi l i t ies are i n a l l poss ib le si tuat ions. In the homogeneous setting, the observer can use observat ions o f changes i n the mentor ' s external state to est i-mate a m o d e l o f its o w n potent ia l ac t ion capabi l i t ies i n unexp lo red s i tuat ions. T h r o u g h the adaptation o f standard reinforcement learn ing techniques, these potent ia l capab i l i t y mode l s can be used to bias the observer ' s exp lora t ion towards the mos t p r o m i s i n g act ions. T h e po -tential capab i l i ty mode l s can a lso be used to al locate computa t iona l effort associated w i t h re-inforcement l ea rn ing techniques more efficiently. T h r o u g h these t w o mechan i sms , i m p l i c i t im i t a t i on may reduce the cost and speed the convergence o f re inforcement l ea rn ing . W h e n agents have heterogeneous act ion capabi l i t i es , the observer m a y not be able 4 to per form every ac t ion the mentor can i n every s i tuat ion. T h e observer m a y therefore be m i s l e d b y observat ions o f the mentor about its potent ia l capabi l i t ies and m a y therefore over-estimate the va lue o f si tuations, w i t h potent ia l ly disastrous consequences. W e fo rma l i ze this overes t imat ion p r o b l e m and deve lop mechan i sms for ex tend ing i m p l i c i t i m i t a t i o n to this context. W e deve lop a new procedure c a l l e d action feasibility testing and s h o w h o w it can ident i fy infeas ib le mentor actions and thereby m i n i m i z e the effects o f overes t imat ion . W e go on to s h o w that a sequence o f l o c a l l y infeasible actions may be part o f a larger f a m i l y o f ac t ion sequences that implemen t a desi red behavior . W e der ive a m e c h a n i s m c a l l e d k-step repair and show h o w it can general ize mentor behaviors that w o u l d require l o c a l l y infeas ib le act ions to an al ternative behav io r that is feasible for the observer. I m p l i c i t im i t a t i on is based o n the idea o f transferring de ta i led i n fo rma t ion about ac-t ion capabi l i t ies f r o m mentor to observer. A t the expense o f further c o m p l e x i t y , w e d e m o n -strate that B a y e s i a n methods enable the transfer o f r icher ac t ion capab i l i t y mode l s . T h e B a y e s i a n representation a l l o w s designers to incorporate p r io r i n fo rma t ion about bo th the ob-server and mentor ac t ion capabi l i t i es ; permits observers to a v o i d subop t ima l act ions i n s i tu-ations o n the basis o f mentor observat ions; and a l l ow s the agent to e m p l o y o p t i m a l B a y e s i a n exp lo ra t ion methods i n order to ra t iona l ly trade of f the cost o f e x p l o r i n g u n k n o w n act ions against the benefits e x p l o i t i n g k n o w n actions. B a y e s i a n exp lo ra t ion a lso e l imina tes tedious explorat ion-rate parameters often found i n reinforcement l ea rn ing a lgor i thms . U s i n g stan-dard approx imat ions , w e der ive a computa t iona l ly tractable a l g o r i t h m and demonstrate it on several examples . T h e B a y e s i a n m o d e l opens up r i c h inference capabi l i t ies . W e s h o w h o w the B a y e s i a n fo rmula t ion enables us to der ive a m e c h a n i s m for transfer be tween mentor and observer w h e n the env i ronment is o n l y par t ia l ly observable . W e poin t out that par t ia l obse rvab i l i ty has specific imp l i ca t i ons for imita tors , such as the need to evaluate the benefit o f t ak ing ac-5 t ions that increase the v i s i b i l i t y o f the mentor. O u r earl ier vers ions o f i m p l i c i t im i t a t ion are fo rmula ted i n terms o f discrete e n v i -ronment states and act ions u s ing an e x p l i c i t m o d e l o f the agent 's env i ronment . W e exp l a in mode l -based and model-free paradigms w i t h i n reinforcement l ea rn ing and then s h o w that i m p l i c i t im i t a t ion can be reformulated i n a model-free f o r m and propose specif ic mecha-n i sms for i m p l e m e n t i n g i m p l i c i t im i t a t ion funct ions . A s w i t h u n g u i d e d re inforcement learn-i n g , m o v i n g to a model-free p a r a d i g m permits i m p l i c i t im i t a t ion to be extended to e n v i r o n -ments represented i n terms o f con t inuous state var iables . O u r f inal ex tens ion o f the i m p l i c i t im i t a t ion f ramework cons iders h o w i m p l i c i t i m -i ta t ion can be adapted to the case where there are m u l t i p l e in teract ing agents that can affect each other 's envi ronment . W e argue that k n o w l e d g e about h o w to be successful i n an inter-ac t ing mul t i -agent env i ronment can a lso be acqui red b y im i t a t i on and therefore, i m i t a t i o n is a lso an important strategy for f ind ing g o o d behaviors i n in teract ing mul t i -agent contexts (e.g., game theoretic scenarios) . O u r fo rmula t ion o f i m p l i c i t im i t a t ion is based u p o n the f r amework p r o v i d e d b y M a r k o v d e c i s i o n processes, re inforcement learn ing and B a y e s i a n methods . Chap te r 2 r ev i ews these topics and establishes our nota t ion. Chapter 3 examines var ious approaches to im i t a t i on and attempts to understand ex i s t i ng approaches i n terms o f re inforcement l ea rn ing and to expose c o m m o n assumpt ions i n this work . T h e examina t ion o f assumpt ions leads to a new m o d e l o f imi t a t ion based on the transfer o f t ransi t ion models . Chapte r 4 deve lops t w o specif ic i n -stantiations o f our new f ramework for homogeneous and heterogeneous settings. Chapte r 5 introduces the B a y e s i a n approach to i m p l i c i t im i t a t ion and exp la ins the benefits o f B a y e s i a n exp lo ra t ion for im i t a t i on . Chapte r 6 examines the app l i cab i l i t y o f i m p l i c i t i m i t a t i o n i n more depth and deve lops a metr ic for p red ic t ing imi t a t ion performance based o n d o m a i n char-acterist ics. Chapte r 7 presents i n i t i a l w o r k on the ex tens ion o f i m p l i c i t im i t a t i on to e n v i -6 ronments w i t h par t ia l observabi l i ty , m u l t i p l e interact ing agents and con t inuous state va r i -ables. T h i s chapter also points out that the ex tens ion o f i m p l i c i t im i t a t i on to n e w prob-l e m classes introduces add i t iona l p rob lems specific to imitators and br ief ly discusses them. Chapte r 8 discusses some general chal lenges for imi ta t ion research and h o w one migh t tackle these problems before c o n c l u d i n g w i t h a b r i e f summary o f the con t r ibu t ion i m p l i c i t i m i t a -t ion makes to the fields o f imi t a t ion and reinforcement learn ing . 7 Chapter 2 Background Material O u r m o d e l o f imi t a t ion is based on the fo rmal f ramework o f M a r k o v D e c i s i o n Processes , re inforcement learn ing , g raph ica l mode l s and mult i -agent systems. T h i s chapter br ie f ly re-v i e w s some o f the most relevant mater ia l w i t h i n these fields. T h r o u g h o u t the chapter, the reader is referred to a number o f b a c k g r o u n d references for more de ta i l . 2.1 Markov Decision Processes M a r k o v d e c i s i o n processes ( M D P s ) have p roven very useful i n m o d e l i n g stochast ic sequen-t ia l d e c i s i o n p rob lems , and have been w i d e l y used i n decis ion- theore t ic p l a n n i n g to m o d e l domains i n w h i c h an agent 's actions have uncer ta in effects, an agent 's k n o w l e d g e o f the en-v i ronmen t m a y be uncer ta in , and m u l t i p l e object ives must be traded o f f against one another (Bertsekas, 1987; B o u t i l i e r , D e a n , & H a n k s , 1999; Pu te rman, 1994). T h i s sec t ion describes the bas ic M D P m o d e l and rev iews one c lass ica l so lu t ion procedure. A n M D P can be v i e w e d as a stochastic automaton i n w h i c h an agent 's act ions in f lu -ence the transi t ions between states, and rewards are obta ined depend ing on the states v i s i t ed b y an agent. In i t i a l ly , w e w i l l concern ourselves w i t h fu l l y observable M D P s . In a f u l l y 8 observable M D P , it is assumed that the agent can determine its current state w i t h certainty (e.g., there are no sensor errors or l i m i t s to the range o f sensors). F o r m a l l y , a f u l l y observable M D P can be defined as a tuple (S,A,T,R), where S is a finite set o f states or poss ib le wor lds , A is a finite set o f actions, T is a state transition function, and ii! is a reward function.1 T h e agent can inf luence future states o f the sys tem to some extent b y c h o o s i n g act ions a f r o m the set A. A c t i o n s are stochast ic i n that the future states cannot general ly be predic ted w i t h certainty. T h e t ransi t ion func t ion T : S x A x S —> M describes the effects o f each ac t ion at each state. F o r compactness , w e wr i te Ts,a(s') to denote the p robab i l i ty o f end ing up i n future state s' G S w h e n ac t ion a is per formed at the current state s. W e require that 0 < T s , a ( s ' ) < 1 for a l l s, s', and that for a l l s, 2 ^ s / e 5 Ts'a(s') = 1. T h e components S, A and T determine the d y n a m i c s o f the sys tem be ing con t ro l l ed . S i n c e the sys tem is f u l l y observable—that is , the agent k n o w s the true state s at each t ime t (once that stage is reached)—its cho i ce o f ac t ion can be based so le ly on the current state. T h u s uncertainty l ies on ly i n the p red ic t ion o f an ac t ion ' s effects, not i n de te rmin ing its actual effect after its execut ion . A (deterministic, stationary, Markovian) policy IT : S —r A descr ibes a course o f ac t ion to be adopted b y an agent c o n t r o l l i n g the system. A n agent adop t ing such a p o l i c y performs ac t ion ir(s) whenever it finds i t se l f i n state s. P o l i c i e s o f this f o r m are M a r k o v i a n s ince the ac t ion cho ice at any state does not depend on the sys tem history, and are stationary s ince ac t ion c h o i c e does not depend o n the stage o f the d e c i s i o n p r o b l e m . F o r the p rob lems we consider , o p t i m a l stationary M a r k o v i a n po l i c i e s a lways exist . W e assume a bounded , rea l -va lued reward func t ion R : S —>• R(s) is the instan-taneous reward an agent receives for o c c u p y i n g state s. A number o f op t ima l i t y c r i te r ia can be adopted to measure the va lue o f a p o l i c y n, a l l measur ing i n some w a y the r eward accu-'MDPs can be generalized in a natural fashion to actions with costs and domains with continuous state and action spaces. 9 mula ted b y an agent as it traverses the state space through the execu t ion o f TT. In this work , we focus on discounted infinite-horizon p rob lems: the current va lue o f a r eward rece ived t stages i n the future is d i scounted b y some factor 7 * ( 0 < 7 < 1) . T h i s a l l o w s s imp le r c o m -puta t ional methods to be used, as d i scounted total r eward w i l l be finite. In m a n y si tuat ions, d i s coun t ing can be jus t i f ied on prac t ica l grounds (e.g., e c o n o m i c and r a n d o m terminat ion) as w e l l . T h e value Vv : S -» 3ft o f a p o l i c y TT at state s is s i m p l y the expected s u m o f d i s -counted future rewards obta ined b y execu t ing n b e g i n n i n g at s. A p o l i c y it* is optimal i f , for a l l s G S and a l l po l i c i e s ir, we have K - » (s) > V„(s). W e are guaranteed that such o p t i m a l (stationary) po l i c i e s exis t i n our sett ing (Puterman, 1994, see C h . 5). T h e (opt imal ) va lue o f a state V* (s) is its va lue Vv* (s) under any o p t i m a l p o l i c y n*. B y solving an M D P , w e are referr ing to the p r o b l e m o f cons t ruc t ing an o p t i m a l p o l -icy . Value iteration ( B e l l m a n , 1957) is a s imp le i terat ive a p p r o x i m a t i o n a l g o r i t h m for op t i -m a l p o l i c y cons t ruc t ion . G i v e n some arbitrary estimate V° o f the true va lue func t ion V*, we i tera t ively i m p r o v e this estimate as f o l l o w s : Vn(s) = R(s) + m a x { 7 V r s - a ( s ' ) V n _ 1 (s ')> (2.1) adA —' s'eS T h e computa t ion o f Vn(s) g i v e n Vn~l is k n o w n as a Bellman backup. T h e sequence o f va lue funct ions Vn p roduced by va lue i terat ion converges l inea r ly to V*. E a c h i terat ion o f va lue i terat ion requires 0(|i5|2|^ 4|) computa t ion t ime, and the n u m b e r o f i terat ions is p o l y -n o m i a l i n \S\. F o r some finite n, the act ions a that m a x i m i z e the r ight -hand s ide o f E q u a t i o n 2.1 f o r m an o p t i m a l p o l i c y , and Vn approximates its va lue . Va r ious te rmina t ion cr i te r ia can be app l ied . F o r instance, the va lue func t ion Vl+1 is w i t h i n | o f the o p t i m a l func t ion V* at any state, whenever E q u a t i o n . 2.2 is satisfied (Puterman, 1994, see Sec . 6.3.2). 10 _ v% < £ - ^ ~ (2-2) T h e doub le bar notat ion above, \\X\\ shou ld be taken as the sup remum n o r m and is defined m a x { | a : | : x € X}, that is , the m a x i m u m component o f the vector. M a n y add i t iona l methods exis t for s o l v i n g M D P s i n c l u d i n g p o l i c y i terat ion, m o d i -fied p o l i c y i terat ion methods and l inear p r o g r a m m i n g formula t ions (Puterman, 1994) as w e l l as asynchronous methods and methods u s ing approximate representations o f the va lue func-t ion such as neuro -dynamic p r o g r a m m i n g (Bertsekas & T s i t s i k l i s , 1996) . T h e computa t iona l c o m p l e x i t y ( in terms o f ar i thmetic operat ions) o f s o l v i n g an M D P under the expected d i scounted cumula t i ve reward c r i t e r ion is k n o w n to be p o l y n o m i a l i n the number o f states ( L i t t m a n , D e a n , & K a e l b l i n g , 1995). T h i s b o u n d is hard ly reassur ing, h o w -ever, as the number o f states i n a p r o b l e m is general ly exponen t i a l i n the number o f var iables or attributes i n the p rob lem. T h e p o l y n o m i a l t ime result is obta ined through the use o f a standard t ransformat ion f r o m M D P s to l inear programs. A l t h o u g h worst-case p o l y n o m i a l a lgor i thms exist for l inear p r o g r a m m i n g , the p o l y n o m i a l is high-order. Of ten , a va r ia t ion o f the S i m p l e x m e t h o d is used, w h i c h is often m u c h faster than the guaranteed p o l y n o m i a l a l go r i t hm, but exponent ia l i n the worst case. T h e c o m p l e x i t y o f al ternative techniques for s o l v i n g M D P s , such as p o l i c y i terat ion and va lue i terat ion w i t h a general d i scoun t factor, are open quest ions . In the case o f p o l i c y i terat ion, it is k n o w n that any g i v e n p o l i c y can be evaluated us ing mat r ix i n v e r s i o n i n O (n 3), but presently the tightest b o u n d admits a pos s ib ly exponent ia l n u m b e r o f evaluat ions . F u r -ther, P a p a d i m i t r i o u and T s i t s i k l i s (Papad imi t r iou & T s i t s i k l i s , 1987) s h o w that M D P s are P -comple te , w h i c h means that an efficient pa ra l l e l so lu t ion o f M D P s w o u l d i m p l y an effi-c ient pa ra l l e l so lu t ion to a l l p rob lems i n the class P -comple te w h i c h is cons ide red u n l i k e l y 11 by researchers i n the f ie ld . F r o m a n u m e r i c a l po in t o f v i ew, the va lue func t ion updated b y va lue i terat ion or p o l -i c y i terat ion is k n o w n to converge l inear ly to the o p t i m a l va lue (Puterman, 1994, see T h e o -r e m 6.3.3). G i v e n specif ic cond i t ions , the va lue func t ion updated under p o l i c y i terat ion m a y converge quadra t ica l ly (Puterman, 1994, see T h e o r e m 6.4.8). A concept that w i l l be useful later is that o f a Q-function. G i v e n an arbitrary va lue func t ion V, w e define Qva (s) as Qva{s) = R{s) + 7 Ts'a(s')V(s') (2.3) s'es In tu i t ive ly , Qva (s) denotes the va lue o f pe r fo rming ac t ion a at state s and then ac t ing i n a manner that has va lue V (Watk ins & D a y a n , 1992). In part icular , w e define Q* to be the Q- func t ion defined w i t h respect to V*, and Q™ to be the Q- func t i on def ined w i t h respect to Vn~l. In this manner, w e can rewri te E q u a t i o n 2.1 as: Vn(s) = mzx{Qna(s)} (2.4) W e define an ergodic p o l i c y i n an M D P as a p o l i c y w h i c h induces a M a r k o v cha in i n w h i c h every state is reachable f r o m any other state i n a finite number o f steps w i t h non-zero p robab i l i ty . 2.2 Model-based Reinforcement Learning O n e di f f icu l ty w i t h the use o f M D P s is that the cons t ruc t ion o f an o p t i m a l p o l i c y requires that the agent k n o w the exact t ransi t ion probabi l i t i es T and reward m o d e l R. In the specif i -ca t ion o f a d e c i s i o n p r o b l e m , these requirements , e spec ia l ly the deta i led speci f ica t ion o f the 12 d o m a i n ' s d y n a m i c s , can impose a s ignif icant burden on the agent designer. Reinforcement learning can be v i e w e d as s o l v i n g an M D P i n w h i c h the f u l l detai ls o f the m o d e l , i n par-t icu la r T and R, are not k n o w n to the agent. Instead the agent learns h o w to act o p t i m a l l y through exper ience w i t h its envi ronment . T h i s sect ion p rov ides a b r i e f o v e r v i e w o f r e in -forcement l ea rn ing i n w h i c h model -based approaches are emphas ized . A d d i t i o n a l mater ia l can be found i n the texts o f Sut ton and B a r t o (1998) and Ber tsekas and T s i t s i k l i s (1996) , and the survey o f K a e l b l i n g , L i t t m a n and M o o r e (1996). In the general m o d e l , w e assume that an agent is c o n t r o l l i n g an M D P (S, A, T , R) and i n i t i a l l y k n o w s its state and ac t ion spaces, S and A, but not the t rans i t ion m o d e l T or reward func t ion R. T h e agent acts i n its envi ronment , and at each stage o f the process makes a " t r ans i t ion" (s, a, r, t); that is , it takes ac t ion a at state s, receives r eward r and moves to state t. B a s e d on repeated experiences o f this type it can determine an o p t i m a l p o l i c y i n one o f t w o ways : (a) i n model-based reinforcement learning, these exper iences can be used to learn the true nature o f T and R, and the M D P can be so lved u s ing standard methods (e.g., va lue i terat ion); or (b) i n model-free reinforcement learning, these exper iences can be used to d i rec t ly update an estimate o f the o p t i m a l va lue func t ion or Q- func t ion . P r o b a b l y the s imples t model -based re inforcement l ea rn ing scheme is the certainty equivalence approach. In tu i t ive ly , a l ea rn ing agent is assumed to have some current est i -mated t ransi t ion m o d e l T o f its envi ronment cons i s t ing o f est imated p robab i l i t i e s Ts,a (t) and an est imated rewards m o d e l R(s). W i t h each exper ience (s, a, r , t) the agent updates its est imated mode l s , solves the est imated M D P M to obta in an p o l i c y TT that w o u l d be op-t i m a l if its estimated models were correct, and acts acco rd ing to that p o l i c y . T o make the certainty equiva lence approach precise, a specif ic f o r m o f est imated m o d e l and update procedure must be adopted. A c o m m o n approach is to use the e m p i r i c a l d i s t r ibu t ion o f observed state transi t ions and rewards as the es t imated m o d e l . F o r s i m p l i c i t y , 13 we assume the m o d e l w i l l be app l i ed to a discrete state and ac t ion space. F o r instance, i f ac t ion a has been attempted C(s, a) t imes at state s, and on C(s, a, t) o f those occas ions state t has been reached, then the estimate Ts'a(t) — C(s, a, t)/C(s, a). I f C(s, a) — 0, some p r i o r estimate is used (e.g., one migh t assume a l l state transi t ions are equiprobable) . O n e d i f f icu l ty w i t h the certainty equ iva lence approach is the computa t iona l burden o f r e - s o l v i n g an M D P M w i t h each update o f the mode ls T and R (i.e., w i t h each exper i -ence). A s po in ted out above, u s ing c o m m o n techniques such as value- i tera t ion o r p o l i c y iter-a t ion c o u l d require computa t ion p o l y n o m i a l or even exponen t i a l i n the n u m b e r o f states and cer ta inly exponent ia l i n the number o f var iables i n a p r o b l e m . Gene ra l ly , certainty equ iva -lence is imprac t i ca l . O n e c o u l d c i r cumven t this to some extent b y ba tch ing exper iences and upda t ing (and re - so lv ing) the m o d e l on ly pe r iod ica l ly . A l t e rna t i ve ly , one c o u l d use computa t iona l effort j u d i c i o u s l y and a p p l y B e l l m a n backups o n l y at those states w h o s e va lues (or Q-va lues ) are l i k e l y to change the most g i v e n a change i n the m o d e l . M o o r e and A t k e s o n ' s (1993) pri-oritized sweeping a l go r i t hm does jus t this. W h e n T is updated b y c h a n g i n g a componen t Ts'a(t), a B e l l m a n backup is app l i ed at s to update its est imated va lue V(s), as w e l l as the Q - v a l u e Q(s, a). Suppose the magni tude o f the change i n V(s) is g i v e n b y A F ( s ) . F o r any predecessor w, the Q-va lues Q(w, a')—hence values V(w)—can change i f Tw'a' (s) > 0. T h e magni tude o f the change is b o u n d e d b y Tw'a> (s)AV(s). A l l such predecessors w of s are p l aced on a p r io r i ty queue w i t h Tw'a'(s)AV(s) s e rv ing as the pr ior i ty . A f ixed number o f B e l l m a n backups are app l i ed to states i n the order i n w h i c h they appear on the queue. W i t h each backup , any change i n va lue can cause new predecessors to be inserted into the queue. In this way, computa t iona l effort is focussed on those states where a B e l l m a n backup has the greatest impact due to the m o d e l change. Fur thermore , the backups are app l i ed on ly to a subset o f states, and are general ly on ly app l i ed a f ixed number o f t imes (by w a y o f c o n -14 trast, i n the certainty equ iva lence approach, backups are app l i ed u n t i l convergence) . T h u s p r io r i t i zed sweep ing can be v i e w e d as a specif ic f o r m o f asynchronous va lue i terat ion w i t h o p t i m i z e d computa t iona l properties ( M o o r e & A t k e s o n , 1993). B o t h certainty equiva lence and p r io r i t i z ed sweep ing w i l l even tua l ly converge to an o p t i m a l va lue func t ion and p o l i c y as l o n g as act ing o p t i m a l l y acco rd ing to the sequence o f est imated mode l s leads the agent to explore the entire state-action space sufficiently. U n f o r -tunately, the use o f the v a l u e - m a x i m i z i n g ac t ion at each state is u n l i k e l y to result i n thorough exp lo ra t ion i n practice. F o r this reason, e x p l i c i t exploration policies are i n v a r i a b l y used to ensure that each ac t ion is t r ied at each state suff icient ly often. B y ac t ing r a n d o m l y (assuming an ergodic M D P ) , an agent is assured o f s a m p l i n g each ac t ion at each state in f in i te ly often i n the l i m i t . Unfor tuna te ly , the act ions o f such an agent w i l l f a i l to e x p l o i t ( i n fact, w i l l be c o m -ple te ly uninf luenced by ) its k n o w l e d g e o f the o p t i m a l p o l i c y . T h i s exploration-exploitation tradeoff 'refers to the tension between t ry ing new actions i n order to find out more about the env i ronment and execu t ing act ions be l i eved to be op t ima l o n the basis o f the current est i -mated m o d e l . T h e most c o m m o n method for exp lo ra t ion is the e-greedy me thod i n w h i c h the agent chooses a r a n d o m act ion e o f the t ime, where 0 < e < 1. T y p i c a l l y , e is decayed over t ime to increase the agent 's exp lo i t a t ion o f its k n o w l e d g e . G i v e n a suff ic ient ly s l o w decay schedule , the agent w i l l converge to the o p t i m a l p o l i c y . In the B o l t z m a n n approach, each act ion is selected w i t h a p r o b a b i l i t y p ropor t iona l to its value: T h e p ropor t iona l i ty can be adjusted non l inea r ly w i t h the temperature parameter r. A s r -> 0, the p robab i l i t y o f se lect ing the ac t ion w i t h the highest va lue tends to 1. T y p i c a l l y , 15 r is started h i g h so that act ions are r a n d o m l y exp lo red d u r i n g the ear ly stages o f l ea rn ing . A s the agent gains k n o w l e d g e about the effects o f its act ions and the va lue o f these effects, the parameter r is decayed so that the agent spends more t ime e x p l o i t i n g act ions k n o w n to be va luab le and less t ime r a n d o m l y e x p l o r i n g act ions. M o r e sophis t ica ted methods attempt to use in fo rma t ion about m o d e l conf idence and va lue magni tudes to p l an a u t i l i t y - m a x i m i z i n g exp lora t ion p lan . A n ear ly a p p r o x i m a t i o n o f this scheme can be found i n the in te rva l es t imat ion method ( K a e l b l i n g , 1993). B a y e s i a n methods have also been used to calcula te the expected va lue o f i n fo rma t ion to be ga ined f r o m exp lo ra t ion ( M e u l e a u & B o u r g i n e , 1999; Dearden , F r i e d m a n , & A n d r e , 1999; Dear -den, F r i e d m a n , & R u s s e l l , 1998). W e w i l l have a specific interest i n B a y e s i a n methods and w i l l therefore exp lo re this approach i n deta i l i n the next sect ion on B a y e s i a n re inforcement l ea rn ing (Dearden et a l . , 1998). T h e c o m p l e x i t y o f re inforcement l ea rn ing can be character ized i n t w o ways : sample c o m p l e x i t y , the number o f samples o f the agent 's env i ronment requ i red for the agent to find a g o o d p o l i c y ; and computa t iona l complex i ty , the amount o f compu ta t ion requi red b y the agent to turn samples into a p o l i c y . T h e computa t iona l c o m p l e x i t y o f re inforcement learn-i n g methods varies cons iderab ly depending on the technique chosen to so lve the p r o b l e m . M o d e l - f r e e approaches to reinforcement l ea rn ing such as Q - l e a r n i n g and T D ( A ) w i t h l inear app rox ima t ion take constant t ime per sample. M o d e l - b a s e d approaches vary cons ide rab ly i n their t ime complex i t y . A n implementa t ion o f cer ta in ty-equivalence approach w o u l d i n -v o l v e s o l v i n g a comple te M D P at each stage, w h i c h can be guaranteed to be no worse than h igh-order p o l y n o m i a l t ime. A model -based approach u s ing heuris t ics to update state v a l -ues, such as P r i o r i t i z e d S w e e p i n g , can be constructed to update the va lue func t ion i n t ime w h i c h is l inear i n the number o f steps per sample. T o some extent, the computa t iona l c o m p l e x i t y requi red to update the agent 's va lue 16 func t ion after a s ing le sample can be traded of f against the number o f samples requi red under a par t icular technique to achieve a g o o d p o l i c y . T h e no t ion o f sample c o m p l e x i t y is therefore o f par t icular interest i n re inforcement learning. W h e r e transi t ions are " i d e a l l y - m i x e d " it can be s h o w n that agents can learn an e-optimal p o l i c y w i t h p robab i l i t y at least 1 - S i n t ime that is O ({Nlog(1/e)/e 2) (log(N/6) + log log(1/e))) or r o u g h l y O(N l o g N) where N is the number o f states (Kearns & S i n g h , 1998a). G i v e n k n o w l e d g e o f the m a x i m u m reward and its var iance a stronger result is poss ib le w h i c h bounds bo th compu ta t ion t ime and sample c o m p l e x i t y to be p o l y n o m i a l i n the number o f states and h o r i z o n length (Kearns & S i n g h , 1998b). A s i m i l a r b o u n d can be found for d i scounted p rob lems . 2.3 Bayesian Reinforcement Learning B a y e s i a n methods i n model -based R L p rov ide a general f r amework for d e r i v i n g re inforce-ment l ea rn ing a lgor i thms that flexibly c o m b i n e var ious sources o f agent k n o w l e d g e i n c l u d -i n g p r io r bel iefs about its t ransi t ion and reward mode l s , ev idence co l l ec t ed th rough exper i -ence, and ev idence co l l ec t ed ind i rec t ly th rough observat ions. B a y e s i a n re inforcement learn-i n g also mainta ins a r i cher representation i n the f o r m o f p robab i l i t y d i s t r ibu t ions ove r tran-s i t ion and reward m o d e l parameters (essential ly second-order d is t r ibut ions) whereas n o n -B a y e s i a n methods t y p i c a l l y represent these mode l s o n l y i n terms o f the mean values o f pa-rameters. F o r instance, a non -Bayes i an t ransi t ion m o d e l m igh t have a s ing le parameter repre-sent ing the p robab i l i t y o f a t ransi t ion f r o m state s tot. A B a y e s i a n t rans i t ion m o d e l has a p robab i l i t y densi ty func t ion over t ransi t ion probabi l i t i es . In states where the m o d e l is u n -k n o w n , the agent migh t have a u n i f o r m d i s t r ibu t ion over poss ib le t rans i t ion p robab i l i t i e s . In a state where the agent has reason to be l i eve a t ransi t ion is u n l i k e l y , one m i g h t find a d i s t r i -bu t ion that assigns 9 0 % probab i l i ty to the t ransi t ion h a v i n g a p robab i l i t y b e l o w 5 0 % and a 17 1 0 % probab i l i t y to the t ransi t ion h a v i n g a p robab i l i t y greater than 5 0 % . F o r m a l l y , a B a y e s i a n re inforcement learner assigns a p r i o r dens i ty P r ( T ) over the set T o f poss ib le t rans i t ion models . L e t H = (SQ, S I , . . . , st) denote the current state his-tory o f the agent up to t ime t, and A = (ao, a \ , . . . , at-\) denote the co r r e spond ing action history. T h e posterior, or updated m o d e l , Pr(T\H, A) represents the agent 's new bel iefs about its t ransi t ion m o d e l g i v e n its current exper ience H and A. T h e update is done th rough B a y e s ' rule ( D e G r o o t , 1975, Sec . 2.2): p r f m A ) Pr(H\T,A)Pr(T\A) T h e denomina to r is s i m p l y a n o r m a l i z i n g factor ove r the set o f pos s ib l e t rans i t ion mode ls T to make the resultant d i s t r ibu t ion s u m to one. T h e n o r m a l i z i n g factor is t r ad i t iona l ly denoted b y a. T h e potent ia l for incorpora t ing ev idence other than H and A in to the update o f the poster ior p robab i l i t y d i s t r ibu t ion makes this f r amework attractive for i m i t a t i o n agents. In general , the representation o f p r io r and poster ior d is t r ibut ions ove r t rans i t ion m o d -els is n o n t r i v i a l . H o w e v e r , the fo rmula t ion o f Dearden et a l . (1999) in t roduces a specif ic tractable m o d e l : the p robab i l i t y densi ty over poss ib le mode ls P r ( T ) is the product o f inde-pendent l o c a l p robab i l i t y d i s t r ibu t ion mode ls for each i n d i v i d u a l state and ac t ion ; speci f i -ca l ly , Pr(r) = Y[?v{Ts'a); s,a and each densi ty Pr(Ts'a) is D i r i c h l e t w i t h parameters ns'a ( D e G r o o t , 1975). T h e B a y e s i a n update o f a D i r i c h l e t d i s t r ibu t ion has a s imp le c losed f o r m based o n event counts . T o m o d e l P r ( T s , a ) w e require one parameter ns'a (s ') for each poss ib l e successor state s'. T h e vector n s ' a c o l l e c t i v e l y represents the counts for each poss ib l e successor state. U p d a t e o f a D i r i c h -let is a s t ra ightforward: g i v e n p r io r P r ( T s ' a ; n s , a ) and data vector c s , a (where cs'a's> is the 18 number o f observed transi t ions f rom s to s' under a) , the poster ior is g i v e n b y parameters n s , a _|_ c s , a Dea rden ' s representation therefore permits independent t ransi t ion mode l s at each step to be updated as f o l l o w s : Pr(Ts'a\Hs'a) = a Pv(Hs'a\Ts'a) Pr(Ts'a) (2.7) Here , H s , a is the subset o f his tory referr ing to o n l y to t ransi t ions o r i g i n a t i n g f r o m s and inf luenced b y ac t ion a. Ef f ic ien t methods a lso exis t for s a m p l i n g pos s ib l e t ransi t ion mode l s f r om the D i r i c h l e t d is t r ibut ions . 2.4 Bayesian Exploration B a y e s i a n re inforcement learn ing mainta ins d is t r ibut ions ove r t rans i t ion m o d e l parameters rather than po in t estimates. A B a y e s i a n reinforcement learner can therefore quant i fy its un -certainty about its m o d e l estimates in terms o f the spread o f the d i s t r i bu t ion over poss ib le mode ls . G i v e n this extra in fo rmat ion , an agent c o u l d reason e x p l i c i t l y about its state o f k n o w l e d g e , c h o o s i n g act ions for bo th the expected va lue o f the ac t ion (i.e., the est imated l o n g term return) and an estimate o f the value of informationihat exper ience w i t h the act ion migh t p r o v i d e to the agent. T h i s f o rm o f exp lora t ion is ca l l ed B a y e s i a n exp lo ra t i on (Dearden et a l . , 1999). T o keep computa t ions tractable, the agent genera l ly computes o n l y a m y o p i c app rox ima t ion o f the va lue o f in fo rmat ion . S i n c e an agent 's act ions u l t imate ly determine the agent 's real payoffs i n the w o r l d , the va lue o f accurate k n o w l e d g e about the w o r l d can be expressed i n terms o f its effect on the agent 's payoffs . T h e va lue o f in fo rmat ion x is defined as the return the agent w o u l d receive ac t ing o p t i m a l l y w i t h the in fo rmat ion x less the return the agent w o u l d rece ive ac t ing wi thou t the in fo rmat ion x ( H o w a r d , 1996). In the case o f reinforcement l ea rn ing , agents make their 19 dec is ions based u p o n Q-va lues for each state and ac t ion . T h e va lue o f accurate Q-va lues can therefore be compu ted b y c o m p a r i n g the agent 's return g i v e n the accurate Q-va lues w i t h the agent 's expected return g i v e n its p r io r uncertain Q-va lues . A n agent 's m o d e l , together w i t h its r eward func t ion determine its be l iefs about its o p t i m a l Q-va lues . A change i n the agent 's bel iefs about the env i ronment m o d e l w i l l there-fore result i n a change i n its bel iefs about its Q-va lues . Re in fo rcemen t learners can obta in new in format ion about Q-va lues b y t ak ing explora tory act ions , obse rv ing the results and ca l cu l a t ing new Q-va lues f rom their updated mode l s . S i n c e these explora tory act ions poten-t i a l ly increase the agent 's k n o w l e d g e about the w o r l d , they w i l l have an in fo rma t ion va lue i n add i t ion to whatever concrete payof f they p rov ide . Unfor tuna te ly , w e cannot conf ident ly predic t h o w m u c h in fo rmat ion w i l l be revealed to the agent b y any one exper iment w i t h an ac t ion . W e can , however , obta in an expecta t ion over the va lue o f the i n fo rma t ion such ex-p lo ra t ion c o u l d reveal : based on p r io r bel iefs , the agent w i l l have a d i s t r i bu t ion over poss ib le Q-va lues for the ac t ion i n ques t ion . W e c o u l d therefore compute an expected value of per-fect information b y c o m p u t i n g the va lue o f perfect i n fo rma t ion for each pos s ib l e Q - v a l u e ass ignment and m u l t i p l y i n g b y the assignment 's est imated p robab i l i ty . T h i s expected va lue o f perfect i n fo rma t ion can be added to the agent's actual Q - v a l u e estimates for each ac t ion . M a x i m i z a t i o n o f these in fo rmed Q-va lues a l l o w s the agent to take the va lue o f exp lo ra t ion in to account when c h o o s i n g its actions. In tu i t ive ly , an agent 's return w i l l vary when new in fo rmat ion about its Q-va lues cause it to change its bel iefs about the best ac t ion to pe r fo rm i n a state. F o r m a l l y , L e t QSA represent the Q - v a l u e o f ac t ion a i n state s. L e t P r ( Q S ) Q = q) be the p robab i l i t y that Qs<a = q where q € JR. T h e n , the expected Q - v a l u e o f ac t ion a at state s is E [ Q s , a ] = Jq P r ( Q s , a = q)Sq. A ra t ional agent w o u l d then use these expected Q-va lues to rank its act ions i n order f rom best to worst , abest, a n e x t B e s t , • • • and choose the best ac t ion a b e s t for exp lo i t a t i on . T h e 20 agent does not k n o w the true Q-va lues , but it does k n o w the poss ib l e range o f Q-va lues (e.g., the set o f reals) and it can compute the va lue o f l ea rn ing that the true Q - v a l u e has a par t icular real value. L e t q* a be a candidate for the true Q - v a l u e o f ac t ion a i n state s. I f the fact that q* a was the true Q - v a l u e for ac t ion a i n state s caused the agent to change the ac t ion current ly ident i f ied as the best i n state s, the agent's p o l i c y w o u l d change and the agent 's actual return i n the w o r l d c o u l d change as w e l l . A s s u m i n g q* g ives the true va lue o f ac t ion a i n state s, it c o u l d affect the agent 's b e l i e f about the current o rder ing o f its act ions i n one o f three ways : the agent migh t dec ide that the current best ac t ion is ac tual ly in fe r io r to the next best; the agent migh t dec ide that some other ac t ion is super ior to current best; o r the agent migh t dec ide that the current best is s t i l l the best. I f the ident i ty o f the best ac t ion changes, the change i n the agent 's return w i l l be the difference between the va lue o f the n e w best ac t ion and the va lue o f the o l d best ac t ion . T h e three cases are s u m m a r i z e d b e l o w : V P I ( s , a , g * ) = I ^[Qs,aaexlBeJ - q*s,a i f a = a b e s t a n d E [ Q , , a n e x t B e , t ] > g* a <£,<, - E[Q s ,a b e J i f a ^ a b e s t a n d g* a > E [ Q s , „ b B J (2.8) 0 o therwise T h e va lue o f perfect in fo rmat ion , VPI(s, a, q* ), g ives the va lue o f k n o w i n g that the Q - v a l u e for ac t ion a i n state s has the true va lue q* . H o w e v e r , a re inforcement learn-ing agent does not k n o w these true values when e x p l o r i n g . It c o u l d , however , ma in ta in a d i s t r ibu t ion over poss ib l e Q-va lues . U s i n g this d i s t r ibu t ion over the pos s ib l e assignments to Q-va lues , w e c o u l d compute an E x p e c t e d V a l u e o f Perfect In format ion : /C O V P I ( s , a, <£0) Pr(Q s , a = < a) d g ; , a (2.9) -co G i v e n the expected value o f in format ion ( E V P I ) for each pos s ib l e ac t ion , a B a y e s i a n re inforcement learner constructs in fo rmed Q-va lues for each ac t ion , defined as the s u m o f 21 the actual return and the va lue o f in fo rmat ion for the ac t ion . T h e agent then m a x i m i z e s the resul t ing in fo rmed ac t ion values to y i e l d an ac t ion that smoo th ly trades o f f exp lo i t a t i on and exp lora t ion i n a p r i n c i p l e d fashion . S u c h a p o l i c y is defined as: 7r E V P I ( s ) = a r g m a x a { E [ Q , , a ] + E V P I ( s , a )} (2.10) S i n c e Q-va lues are a de terminis t ic func t ion o f the reward and t rans i t ion m o d e l , the uncertainty i n Q-va lues comes f rom the uncertainty i n mode ls . Unfor tuna te ly , it is d i f f icul t to predic t the uncertainty o f any specif ic Q - v a l u e g i v e n uncertainty i n m o d e l parameters. In -stead, prac t ica l implementa t ions o f B a y e s i a n re inforcement learners genera l ly sample pos-s ib le M D P s f r o m a d i s t r ibu t ion over poss ib le t ransi t ion m o d e l parameters and so lve each M D P to get a d i s t r ibu t ion over poss ib le Q-va lues at each state (Dearden et a l . , 1999). W h e n states and act ions are discrete, d is t r ibut ions over m o d e l parameters can be represented c o m -pac t ly b y D i r i c h l e t d is t r ibut ions and i n d i v i d u a l M D P s are eas i ly sampled . S i n c e t ransi t ion mode ls change very l i t t le w i t h a s ing le agent observat ion , sophis t ica ted inc rementa l tech-niques for s o l v i n g M D P s such as p r io r i t i zed sweep ing , can be used to eff ic ient ly ma in ta in the Q - v a l u e d is t r ibut ions (Dearden et a l . , 1999). 2.5 Probabilistic Reasoning and Bayesian Networks G r a p h i c a l mode l s (Pearl , 1988) p rov ide a power fu l m e c h a n i s m for eff ic ient ly reasoning about uncertain k n o w l e d g e i n c o m p l e x domains . In a g raph ica l m o d e l , a d o m a i n is represented b y a set o f var iables that may take on var ious values. Unce r t a in ty w i t h i n the d o m a i n is repre-sented b y the use o f a p robab i l i t y d i s t r ibu t ion over a l l poss ib l e ass ignments to these v a r i -ables. T h e g raph ica l aspect o f the m o d e l a l l o w s one to represent these d is t r ibu t ions c o m -pactly. T h e d o m a i n m o d e l can then be used to infer the p robab i l i t y o f unseen var iab les , g i v e n 22 c R U Pr(C, R,U) T T T 0.126 T T F 0.014 T F T 0.054 T F F 0.006 F T T 0 F T F 0.008 F F T 0 F F F 0.792 F i g u r e 2 .1 : E x p l i c i t d i s t r ibu t ion over exhaus t ive enumera t ion o f c l o u d - r a i n - u m b r e l l a states observat ions o f the r e m a i n i n g var iables . W e w i l l i l lustrate the representation and reasoning process w i t h an example . Imag-ine a s imp le w o r l d i n w h i c h it is somet imes c loudy , somet imes ra iny and somet imes one takes an umbre l l a . W e let the boo lean var iables C, R and U stand for the p ropos i t ions that it is c loudy , that i t is r a i n i n g and that one took one 's u m b r e l l a respect ively . T h e d o m a i n re-quires reasoning w i t h uncertainty s ince it does not a lways r a in and one does not a lways take one 's u m b r e l l a w h e n it is c loudy . T h e uncertainty i n the d o m a i n c o u l d be represented by an e x p l i c i t p robab i l i t y d i s t r ibu t ion over each poss ib le ass ignment o f the w o r l d as i n F i g u r e 2 .1 . G i v e n a d o m a i n w i t h n boolean var iables , a comple te p r o b a b i l i t y d i s t r ibu t ion w o u l d have 2 " poss ib le assignments . T h e representation o f such tables and the k n o w l e d g e acqu i s i -t ion process requi red to fill t hem i n make e x p l i c i t table representations i m p r a c t i c a l for prob-ab i l i s t i c reasoning . In a g raph ica l m o d e l , the need for comple te tables is reduced by fac-to r ing one large table in to sma l l e r tables o f independent var iables . F o r instance, ou r r a in -c l o u d - u m b r e l l a d o m a i n c o u l d be expressed as the g raph ica l m o d e l i n F i g u r e 2.2(a). In this m o d e l , " c l o u d i n e s s " causes ra in and "c loud ines s " causes one to take one 's u m b r e l l a . U n d e r 23 P r ( C ) Pi(R\C) PT{R\->C) 0.2 0.7 0.01 Pv{U\C) 0.9 (a) Graphical model P r ( ( 7 h C ) 0.0 (b) M o d e l factors F i g u r e 2.2: C o m p a c t d is t r ibut ions over independent factors for c l o u d - r a i n - u m b r e l l a d o m a i n the assumpt ions o f this m o d e l , k n o w i n g whether it is c l o u d y or not comple t e ly predicts the p robab i l i t y that ones takes one 's u m b r e l l a and k n o w i n g whether it is r a i n i n g o r not does not have any add i t iona l p red ic t ive power. Tha t i s , Pr(U\C) = Pr(U\C, R). T h e independence assumptions represented b y the g raph ica l m o d e l a l l o w us to rep-resent the d o m a i n more compact ly , yet s t i l l p rov ides equivalent i n fo rma t ion . U s i n g the inde-pendence assumpt ion and the fact that complementa ry probabi l i t i es a lways s u m to one, the r a in - c loud -umbre l l a d o m a i n can be encoded w i t h the p robab i l i t y factors i n Tab le 2.2(b). M u l t i p l y i n g the c o n d i t i o n a l p robabi l i t i es g ives us the same j o i n t p r o b a b i l i t y Pv(R, U, C) = Pv(R\C) PT(U\C) Pr(C), however , the factors can be represented more c o m p a c t l y and spe-c i a l i z e d a lgor i thms can be used to compute queries efficiently. In general , b e l i e f inference i n B a y e s i a n ne tworks is qui te d i f f icul t i n the wors t case. F i n d i n g the most p robable settings o f var iables is k n o w n to be N P - c o m p l e t e ( L i t t m a n , M a -j e r c i k , & P i t a s s i , 2001) and finding the poster ior b e l i e f d i s t r ibu t ion over var iables is P P -comple te ( L i t t m a n et a l . , 2001) . 24 2.6 Model-Free Learning In mode l -based approaches, the agent estimates an e x p l i c i t t rans i t ion m o d e l i n order to c a l -culate state values. M o d e l - b a s e d approaches make the u n d e r l y i n g theory o f re inforcement learn ing clear and may use ev idence more eff iciently i n some cases , 2 but suffer f r o m a n u m -ber o f disadvantages. F i r s t , t ransi t ion mode ls are a func t ion o f three var iables : p r io r state, ac t ion cho i ce and successor state. G e n e r i c d o m a i n models can therefore have c o m p l e x i t y on the order o f the square o f the s ize o f the u n d e r l y i n g state space. S e c o n d , i n model -based techniques, the va lue func t ion is t y p i c a l l y ca lcula ted f rom the t rans i t ion m o d e l through B e l l -man backups . In a d o m a i n w i t h con t inuous var iables it is somet imes pos s ib l e to create c o n -t inuous t ransi t ion models (e.g., d y n a m i c B a y e s i a n networks or mix tures o f Gauss ians ) , but genera l ly it is d i f f icul t to represent and calcula te the cont inuous va lue func t ion for such a m o d e l . 3 M o d e l - f r e e methods such as T D ( A ) (Sutton, 1988) and Q - l e a r n i n g (Watk ins & D a y a n , 1992) estimate the o p t i m a l va lue funct ion or Q- func t ion di rec t ly , w i thou t recourse to a do-m a i n m o d e l . In model-free reinforcement learn ing , the va lue o f each state-action pa i r is sampled e m p i r i c a l l y f r o m interact ions w i t h the agent 's env i ronment . O n each in teract ion the agent obtains a tuple (s, a , r , t) co r responding to the p r io r state, chosen ac t ion , r eward and successor state, respect ively. S i n c e the successor state and reward are generated b y the envi ronment , they are effect ively sampled w i t h the p robab i l i t y that they o c c u r i n the w o r l d . A s before, let Q(s, a) be the Q - v a l u e o f act ion a i n state s. T h e Q - l e a r n i n g ru le , de r ived b y W a t k i n s , can be used to update the agent 's Q-va lues inc rementa l ly f r o m each exper ience: 2 Though see Least Squares T D (Boyan, 1999). 3 A n approach that illustrates this difficulty and one method of attacking it can be found in the work of Forbes and Andre (2000). 25 Q{s,a) f - ( l - a ) Q ( s , a ) + cv(r + T m a x Q ( i , a ' ) ) (2.11) T h e rule effect ively computes the average s u m o f immedia te and expected future rewards for ac t ion a at state s, a s suming that the agent w i l l take the best ac t ion i n the next state (i.e. m a x a / 6 ^ ) . I f a is decayed appropria te ly and every state is v i s i t e d in f in i te ly often, the Q- l ea rn ing ru le w i l l cause Q(s, a) to converge to its true va lue i n the l i m i t (Watk ins & D a y a n , 1992). T h e TD(X) f a m i l y o f methods improves on the Q - l e a r n i n g ru le b y s a m p l i n g not o n l y the current state, but a p ropor t ion o f future states as w e l l . It t y p i c a l l y converges m u c h more q u i c k l y than Q- l ea rn ing g iven the same number o f samples . Recen t research on gradient ( B a i r d , 1999) and p o l i c y search ( N g & Jordan, 2000) based methods has expanded the scope and app l i cab i l i t y o f model-free techniques to a wide -va r i e ty o f c h a l l e n g i n g c o n -t inuous var iab le con t ro l p roblems . 2.7 Partial Observability T h e M D P assumes that the agent can un ique ly ident i fy its state at each d e c i s i o n epoch . M D P s can be na tura l ly genera l ized to the problems i n w h i c h agents are uncer ta in about their cur-rent state. T h i s uncertainty may be due to the l i m i t e d scope o f sensors w h i c h report o n l y on some subset o f the agent 's state, no i sy sensors that report comple te but somewhat i n a c c u -rate vers ions o f the agent 's state, or some c o m b i n a t i o n o f the two . T h e P a r t i a l l y Obse rvab l e M a r k o v D e c i s i o n Process ( P O M D P ) is defined by the tuple (S, A, Z, T, Z, R). A s i n the f u l l y observable M D P , S, A, T, and R represent the set o f states i n the p r o b l e m , the ac-t ions ava i l ab le to the agent, the env i ronment t ransi t ion m o d e l and the r eward func t ion . In the par t ia l ly observable m o d e l , Z represents poss ib le observat ions or sensor readings for the agent and Z : S —>• A(Z) represents an observat ion func t ion or sensor m o d e l w h i c h maps 26 poss ib le world-states to d is t r ibut ions over observat ions. Techniques for s o l v i n g P O M D P s para l le l those for s o l v i n g M D P s . T h e P O M D P can be s o l v e d b y both value-based ( S m a l l w o o d & S o n d i k , 1973) and p o l i c y - b a s e d a lgor i thms (Hansen , 1998). T h e theory o f P O M D P s has been genera l ized to pa r t i a l ly observable r e in -forcement l ea rn ing b y appea l ing to t radi t ional techniques based o n de t e rmin ing the under-l y i n g state space and m o d e l ( M c C a l l u m , 1992), as w e l l as more recent w o r k u s ing grad i -ent methods (Bax te r & Bart let t , 2000 ; N g & Jordan, 2000) . W e refer the reader to ex i s t i ng comprehens ive surveys for more details (Cassandra , K a e l b l i n g , & L i t t m a n , 1994; L o v e j o y , 1991; S m a l l w o o d & S o n d i k , 1973). T h e so lu t ion o f general P O M D P s is cons idered intractable. Cur ren t state o f the art a lgor i thms, such as incrementa l p run ing , s t i l l have exponent ia l r u n n i n g t ime i n the s ize o f the state space i n the wors t case and even e-opt imal approx imat ions are N P or P S P A C E c o m -plete (Lusena , G o l d s m i t h , & M u n d h e n k , 2001) . 2.8 Multi-agent Learning S i n g l e agent re inforcement learn ing theory assumes that the agent is l ea rn ing w i t h i n a sta-t ionary env i ronment (i.e., the t ransi t ion m o d e l T does not change over t ime) . W h e n add i -t iona l agents are added to the envi ronment , the effects o f an agent 's ac t ion m a y depend upon the ac t ion choices o f other agents. Changes to the agent 's p o l i c y m a y cause other agents to change their po l i c i e s i n turn, w h i c h may result i n the sys tem o f agents as a w h o l e enter ing into c y c l i c p o l i c y changes that never converge. In domains where agents must coordinate i n order to ach ieve goals , agents may fa i l to find compa t ib l e p o l i c i e s and converge to n o n -o p t i m a l so lu t ions . Re in fo rcement learn ing can be extended to mul t i -agent domains through the theory o f stochastic games. T h o u g h Shap ley ' s (1953) o r i g i n a l fo rmula t ion o f stochastic games i n v o l v e d a zero-27 s u m (ful ly compet i t ive ) assumpt ion , var ious general izat ions o f the m o d e l have been p ro -posed, a l l o w i n g f o r arbitrary re la t ionships between agents ' u t i l i t y funct ions ( M y e r s o n , 1991). F o r m a l l y , an n-agent stochastic game (S, {Ai : i < n},T, {Ri : i < n}) compr i ses a set o f n agents (1 < % < n), a set o f states S, a set o f act ions Ai for each agent i, a state t ransi t ion funct ion T, and a reward func t ion Ri for each agent i. E a c h state s represents the comple te state o f the system. In some games, the sys tem state can i n t u i t i v e l y be unders tood as the product o f the state o f i n d i v i d u a l agents p l a y i n g the game. U n l i k e an M D P , i n d i v i d u a l agent act ions do not determine state transi t ions, rather it is the joint action taken b y the c o l l e c t i o n o f agents that determines h o w the sys tem evo lves at any point i n t ime. L e t A = A i x • • • x An be the set o f j o i n t act ions; then T : S x A x S 1R denotes the p robab i l i t y o f e n d i n g up i n the sys tem state s' 6 S w h e n j o i n t ac t ion a is per formed at the sys tem state s. F o r convenience , w e in t roduce the nota t ion A-i (read A , not i ) to denote the set o f j o i n t act ions A\ x • • • x - 4 ; - i x A+i x • • • x A n i n v o l v i n g a l l agents except i. W e use ai • a-i to denote the (ful l) j o i n t ac t ion obta ined b y c o n j o i n i n g a; £ Ai w i t h a_; € A-i. In a general stochastic game, the p red ic t ive pa rad igm for ac t ion cho i ce can break d o w n . A n agent can no longer predict the effect o f its ac t ion cho ices , as it w o u l d have to predict the ac t ion choices o f other agents, w h i c h i n turn w o u l d be based u p o n predic t ions o f its o w n ac t ion choices . Instead, game theory makes recourse to the idea o f a so lu t ion concept: a p r i n c i p l e outs ide o f the d o m a i n o f interest that can be used to const ra in the ac t ion choices o f a l l agents. A popu la r technique for the so lu t ion o f games is the minimax p r i n c i p l e . U n d e r m i n -imax , an agent chooses the ac t ion that leaves it least vu lnerable to m a l i c i o u s behav io r o f other agents. T h e m i n i m a x p r i n c i p l e forms the core o f the m i n i m a x Q - l e a r n i n g a l g o r i t h m ( L i t t m a n , 1994) w h i c h extends reinforcement learning to compe t i t i ve z e r o - s u m games. T h e 4 F o r example, see the fully cooperative multiagent M D P model proposed by Bout i l ier (1999). 28 r n i n i m a x - Q a l g o r i t h m assumes that the state, reward and ac t ion cho ices o f other agents i n the env i ronment are observable . U n d e r this assumpt ion , an agent can learn Q-tables for bo th their o w n act ions and the act ions o f other agents. 5 A quadrat ic p r o g r a m is then const ructed to f ind the p robab i l i s t i c p o l i c y that m i n i m i z e s the wors t poss ib l e Q-va lues an agent c o u l d receive. In a pure ly compe t i t i ve game, a de terminis t ic p o l i c y c o u l d be p red ic ted b y other agents and taken advantage of. O n e d rawback o f the m i n i m a x p r i n c i p l e is that it assumes that agents w i l l a lways act m a l i c i o u s l y , w h i c h is o n l y reasonable i n a pu re ly compe t i t i ve (or zero sum) game i n w h i c h no agent can profit unless another agent loses. In non-compe t i t ive games, agents can f ind a so lu t ion u s i n g quadrat ic p r o g r a m m i n g . T h e so lu t ion is a N a s h e q u i l i b r i u m i n w h i c h no agent can independent ly m o d i f y its ac t ion cho i ce to i m p r o v e its o w n outcome. In the re inforcement l ea rn ing context , H u and W e l l m a n (1998) general ize L i t t m a n ' s w o r k to the case where agents may have o v e r l a p p i n g goa ls u s i n g quadrat ic p r o g r a m m i n g . A g i v e n game may have a n u m b e r o f N a s h e q u i l i b r i a . F o r o p t i m a l play, a l l agents must p l ay the o p t i m a l act ion for the same equ i l i b r i a . H u and W e l l m a n ' s w o r k does not e x p l i c i t l y address games i n w h i c h there are m u l t i p l e e q u i l i b r i a . In non-compe t i t i ve games, gradient-based methods have a lso been suggested ( B o w l i n g & V e l o s o , 2001) . 2.9 Technical Definitions W e m a k e recourse to a coup le o f technica l def ini t ions . T h e re la t ive entropy or Kullbach-Liebler d is tance between two p robab i l i ty d is t r ibut ions tells us h o w m u c h more in fo rma t ion i t w o u l d take to represent one d i s t r ibu t ion i n terms o f another. It i s often used to compare two d is t r ibu t ions , a l though it does not actual ly satisfy the comple te de f in i t ion o f a metr ic because it is not symmet r i c and does not satisfy the t r iangle inequal i ty . L e t P and Q be t w o 5 I n a two-player zero-sum game, one agent's payoff is the negative o f the other's, so only one Q-value need be represented explici t ly 29 probab i l i t y d is t r ibut ions . T h e K u l l b a c h - L i e b l e r or K L distance D(P\\Q) i s : DW\q)=Yp{x)log?¥\ (2.12) T h e Chebychev inequality g ives us a w a y o f b o u n d i n g the p r o b a b i l i t y mass l y i n g w i t h i n a m u l t i p l e k o f the standard dev ia t ion a o f the mean fi for an arbitrary d i s t r ibu t ion . P(\X - fi\ >= ka) < 1 - ^ (2.13) A homomorphism is a m a p p i n g between two groups such that the ou tcome o f the group operat ion on the first g roup has a cor responding ou tcome for the group operat ion on the second set. L e t X and y be two sets and 8 be a map 6 : X —>• y. L e t © be an operator on X, so that ® : X X. L e t ® be an operator on y, so that <2> : y —>• ^ . T h e n the map 0 is a h o m o m o r p h i s m w i t h respect to <g>, i f V x g X, 6(®x) = ® ^ ( a ; ) . 30 Chapter 3 Natural and Computational Imitation Research into imi t a t ion spans a broad range o f d imens ions , f r o m e tho log ica l studies, to ab-stract a lgebra ic fo rmula t ions , to indus t r ia l con t ro l a lgor i thms . A s these fields c ross - fe r t i l i ze and i n f o r m each other, w e are c o m i n g to stronger concep tua l def ini t ions o f im i t a t i on and a better unders tanding o f its l im i t s and capabi l i t ies . M a n y computa t iona l mode l s have been proposed to exp lo i t spec ia l i zed niches i n a var ie ty o f con t ro l parad igms, and i m i t a t i o n tech-niques have been app l i ed to a var ie ty o f r e a l - w o r l d con t ro l p rob lems . T h i s chapter r ev i ews a representative sample o f the literature on natural imi t a t ion and computa t iona l implementa -t ions and conc ludes w i t h an analys is o f h o w var ious forms o f im i t a t i on fit in to the p a r a d i g m o f k n o w l e d g e transfer w i t h i n a reinforcement l ea rn ing f ramework . 3.1 Conceptual Issues T h e concep tua l foundat ions o f imi t a t ion have been c la r i f ied b y w o r k o n natural imi t a t ion . F r o m w o r k on apes (Russon & G a l d i k a s , 1993), oc top i (F io r i t o & Scot to , 1992), and other 31 animals , w e k n o w that soc i a l l y faci l i ta ted l ea rn ing is widespread throughout the a n i m a l k i n g -d o m . A number o f researchers have po in ted out, however , that soc i a l f ac i l i t a t ion can take many forms (Conte , 2000 ; N o b l e & T o d d , 1999). F o r instance, a mentor ' s at tention to some feature o f its env i ronment may draw an observer ' s attention to this feature. T h e observer may subsequent ly exper iment w i t h the feature o f the env i ronment and independently learn a m o d e l o f the feature's behaviour . T h e def in i t ion o f "True i m i t a t i o n " attempts to ru le out acqu i s i t ion o f behav iou r by independent learn ing , even i f other agents mo t iva t ed this inde-pendent l ea rn ing . V i s a l b e r g h i and F ragazy (1990) ci te M i t c h e l l ' s de f in i t ion w h i c h says that some behav iour C is learned b y imi t a t ion w h e n the f o l l o w i n g cond i t i ons are met: 1. C (the copy o f the behavior ) is p roduced by an o r g a n i s m 2. where C is s i m i l a r to M (the M o d e l behavior) 3. observat ion o f M is necessary for the p roduc t ion o f C (above base l ine levels o f C o c c u r r i n g spontaneously) 4. C is des igned to be s im i l a r to M 5. the behav io r C must be a n o v e l behav io r not already o rgan ized i n that precise w a y i n the o rgan i sm 's repertoire. T h i s def in i t ion perhaps presupposes a cogn i t i ve stance towards i m i t a t i o n i n w h i c h an agent e x p l i c i t l y reasons about the behaviors o f other agents and h o w these behav iors relate to its o w n ac t ion capabi l i t ies and goals . Imi ta t ion can be further ana lyzed i n terms o f the type o f cor respondence d e m o n -strated b y the mentor ' s behav io r and the observer ' s acqui red behav io r ( N e h a n i v & D a u t e n -hahn, 1998; B y r n e & R u s s o n , 1998). Cor respondence types are d i s t i ngu i shed b y l e v e l . A t the act ion l e v e l , there is a correspondence between act ions. A t the p rog ram l e v e l , the act ions 32 may be comple t e ly different but correspondence may be found be tween subgoals . A t the ef-fect l e v e l , the agent plans a set o f actions that achieve the same effect as the demonstrated behav io r but there is no direct correspondence between subcomponents o f the observer ' s ac-t ions and the mentor ' s act ions. T h e term abstract imitation has been proposed i n the case where agents imi ta te behaviors w h i c h c o m e f rom imi t a t i ng the menta l state o f other agents ( D e m i r i s & H a y e s , 1997). T h e study o f specif ic computa t iona l mode l s o f imi t a t ion has y i e l d e d ins ights in to the nature o f the observer-mentor re la t ionship and h o w it affects the acqu i s i t i on o f behaviors b y observers. F o r instance, i n the related f ie ld o f behav io ra l c l o n i n g , it has been observed that mentors that implemen t po l i c i e s that are r i sk-averse general ly y i e l d more re l i ab le c lones ( U r b a n c i c & B r a t k o , 1994). H i g h l y - t r a i n e d mentors f o l l o w i n g an o p t i m a l p o l i c y w i t h s m a l l coverage o f the state space y i e l d less re l iab le c lones than those that m a k e more mis takes (Sammut , Hurs t , K e d z i e r , & M i c h i e , 1992), s ince mentors w h i c h make mis takes p r o v i d e in fo rmat ion about r ecover ing f rom fa i lure cond i t ions . F o r par t ia l ly observable p rob lems , l ea rn ing f rom mentors w h o have access to more in fo rmat ive percept ions than the observer can be disastrous. G i v e n the add i t iona l in fo rmat ion i n its percept ions , a mentor m a y be able to choose a p o l i c y that w o u l d be too r i s k y for an agent whose percept ions d i d not p rov ide enough in fo rma t ion to make the requi red dec is ions (Scheffer, Gre iner , & D a r k e n , 1997). C o n s i d e r the case o f t w o robots nav iga t ing a long a c l i f f . T h e agent w i t h h igher p r ec i s ion sensors may be able to afford a path c loser to the edge than the agent w i t h l o w - p r e c i s i o n sensors. F i n a l l y , i n the case o f human mentors , it has been observed that successful c lones w o u l d often outper form the o r ig ina l human mentor due to the "c leanup effect" (Sammut et a l . , 1992)—essent ia l ly , the imi ta tor ' s mach ine l ea rn ing a l g o r i t h m averages out the error i n many runs to p roduce a better p o l i c y than that o f the o r i g i n a l human control ler . 33 O n e o f the o r i g i n a l goals o f behav io ra l c l o n i n g ( M i c h i e , 1993) was to extract k n o w l -edge f r o m humans to speed up the des ign o f contro l lers . F o r the extracted k n o w l e d g e to be useful , it has been argued that rule-based systems often the best chance o f i n t e l l i g i b i l i t y (van L e n t & L a i r d , 1999). It has become clear, however , that s y m b o l i c representations are not a comple te answer. Representa t ional capaci ty is a lso an issue. H u m a n s often organize con t ro l tasks b y t ime, w h i c h is t y p i c a l l y l a c k i n g i n state and percept ion-based approaches to c o n -t ro l . H u m a n s also natura l ly break tasks d o w n into independent components and subgoals ( U r b a n c i c & B r a t k o , 1994). Studies have also demonstrated that humans w i l l g i v e verba l descr ip t ions o f their cont ro l po l i c i e s w h i c h do not match their ac tual act ions ( U r b a n c i c & B r a t k o , 1994). T h e potent ia l for sav ing t ime i n acqu i s i t i on has been born out b y one study w h i c h e x p l i c i t l y compared the t ime to extract rules w i t h the t ime requi red to p r o g r a m a c o n -t rol ler (van L e n t & L a i r d , 1999). Research in to human imi ta t ion has a lso h i g h l i g h t e d the use o f human s o c i a l cues i n d i r ec t ing attention o f an agent to the features that s h o u l d be imi ta ted (Breazea l & Scasse l la t i , 2002) . In add i t ion to what has t rad i t iona l ly been cons idered imi t a t i on , an agent may also face the p r o b l e m o f " l ea rn ing h o w to imi ta te" or finding a cor respondence be tween the ac-t ions and states o f the observer and mentor (Nehan iv & Dautenhahn , 1998). N e h a n i v et a l . instantiate their n o t i o n o f correspondence i n two implementa t ions : a r andom-exp lo ra t ion memory-based m o d e l ( A l i s s a n d r a k i s , N e h a n i v , & Dautenhahn , 2002) and a subgoal-search procedure ( A l i s s a n d r a k i s , N e h a n i v , & Dautenhahn , 2000) . Present ly , the targets for subgoa l search must be set u s ing some p rev ious ly defined state correspondence func t ion . A comple te m o d e l o f l ea rn ing b y observat ion i n the absence o f c o m m u n i c a t i o n pro toco ls w i l l have to deal w i t h this first, p r i m a l correspondence p r o b l e m . M o s t o f the li terature on imi ta t ion considers the p r o b l e m o f agents l ea rn ing the so lu -t ions o f s ingle agent p rob lems f r o m other agents. It is poss ib le , however , for agents to learn 34 mult i -agent tasks f rom other agents. A f o r m o f imi t a t ion can be used w h e n an agent w h o has learned to be an effective member o f a team must be replaced b y a new agent w i t h p o s s i b l y different capabi l i t i es . T h e k n o w l e d g e related to the team task o f the p rev ious team member can be reused b y the new team member (Denz inge r & E n n i s , 2002) . Ou t s ide o f the imi t a t ion learn ing f ie ld , the no t ion o f im i t a t i on as i m p l i c i t transfer be-tween agents can be seen as a so lu t ion to transferring data be tween representations that l ack a s i m p l e t ransformat ion process. In stat is t ical mode ls , the most in tu i t ive representations for p r i o r d is t r ibut ions may be intractable to update, but it is somet imes pos s ib l e to construct a more tractable p r io r capable o f approx imate ly express ing the des i red p r i o r ( N e a l , 2001) . U s i n g data sampled f r o m the intractable pr ior , one can fit the a p p r o x i m a t i n g pr ior . C o m p u -tations can then be per formed us ing the tractable pr ior . W e e n v i s i o n i m i t a t i o n l ea rn ing b e i n g used i n a s i m i l a r capaci ty to b r idge representations. 3.2 Practical Implementations T h e theoret ical developments i n imi t a t ion research have been accompan ied b y a number o f prac t ica l implementa t ions . These implementa t ions take advantage o f propert ies o f differ-ent con t ro l paradigms to demonstrate var ious aspects o f imi t a t ion . E a r l y b e h a v i o r a l c l o n i n g research took advantage o f supervised learn ing techniques such as d e c i s i o n trees (Sammut et a l , 1992). T h e d e c i s i o n tree was used to learn h o w a human operator mapped percep-t ions to act ions. Percept ions were encoded as discrete values. A t ime de lay was inserted i n order to synchron ize percept ions w i t h the act ions they trigger. L e a r n i n g apprent ice systems ( M i t c h e l l et a l . , 1985) also attempted to extract useful k n o w l e d g e b y w a t c h i n g users, but the g o a l o f apprentices is not to independent ly so lve p rob lems . L e a r n i n g apprentices are c lo se ly related to p rogramming-by-demons t ra t ion systems (L i ebe rman , 1993; L a u , D o m i n g o s , & W e l d , 2000) . L a t e r efforts used more sophis t icated techniques to extract act ions f r o m v i -35 sual percept ions and abstract these act ions for future use ( K u n i y o s h i , Inaba, & Inoue, 1994). W o r k on associa t ive and recurrent learn ing mode ls extended early approaches to im i t a t i on to the l ea rn ing o f temporal sequences ( B i l l a r d & H a y e s , 1999). A s s o c i a t i v e l ea rn ing has been used together w i t h innate f o l l o w i n g behaviors to acquire nav iga t ion exper t ise f r o m other agents ( B i l l a r d & H a y e s , 1997). A related but s l igh t ly different f o r m o f imi t a t ion has been s tudied i n the mul t i -agent re inforcement l ea rn ing c o m m u n i t y . A n early precursor to im i t a t i on can be f o u n d i n w o r k on shar ing o f percept ions between agents (Tan, 1993). C l o s e r to im i t a t i on is the idea o f re-p l a y i n g the percept ions and act ions o f one agent to a second agent ( L i n , 1991 ; W h i t e h e a d , 1991). H e r e the transfer is f r om one agent to another, i n contrast to b e h a v i o r a l c l o n i n g ' s transfer f rom human to agent. T h e representation is also different. Re in fo rcemen t l ea rn ing prov ides agents w i t h the ab i l i t y to reason about the effects o f current act ions o n expected future u t i l i t y so agents can integrate their o w n k n o w l e d g e w i t h k n o w l e d g e extracted f rom other agents b y c o m p a r i n g the re la t ive u t i l i t y o f the act ions suggested b y each k n o w l e d g e source. T h e "seeding approaches" are c l o s e l y related. Trajectories recorded f r o m human subjects are used to i n i t i a l i z e a p lanner w h i c h subsequently op t imizes the p l a n i n order to account for differences between the human effector and the robo t ic effector ( A t k e s o n & Schaa l , 1997). T h i s technique has been extended to handle the n o t i o n o f subgoals w i t h i n a task ( A t k e s o n & Schaa l , 1997). Subgoa l s are also addressed b y others (Sue & B r a t k o , 1997). A n o t h e r approach i n this f a m i l y is based on the assumpt ion that the mentor is rat io-na l , has the same reward func t ion as the observer, and chooses f rom the same set o f act ions. G i v e n these assumpt ions , we can conc lude that the ac t ion chosen b y a mentor i n a par t icular state must have h igher va lue to the mentor than the alternatives open to the mentor ( U t g o f f & 36 C l o u s e , 1991) and therefore h igher va lue to the observer than any al ternative. T h e sys tem therefore i te ra t ive ly adjusts the values o f the act ions u n t i l this constra int is satisfied i n its m o d e l . A related approach uses the me thodo logy o f l inear-quadrat ic con t ro l (Sue & B r a t k o , 1997). F i r s t a m o d e l o f the sys tem is constructed. T h e n the inverse con t ro l p r o b l e m is so lved to find a cost matr ix that w o u l d result in the observed con t ro l le r behav io r g i v e n an e n v i r o n -ment m o d e l . Recen t w o r k o n inverse reinforcement learn ing takes a related approach but uses a l inear p rog ram technique to reconstruct a spec ia l class o f r eward funct ions f r o m ob-served behav io r ( N g & R u s s e l l , 2000) . N g observes that m a n y poss ib l e reward funct ions c o u l d exp l a in a g i v e n behav io r and proposes heurist ics to regular ize the search. O n e heuris-t ic is to p i c k the p o l i c y that m a x i m i z e s the difference between best and second best act ions. T h e inverse re inforcement l ea rn ing approach can be extended to con t inuous doma ins and to the s i tuat ion where the d o m a i n m o d e l is u n k n o w n . N g ' s approach has been extended to gen-erate poster ior d is t r ibut ions over poss ib le agent po l i c i e s (Cha jewska , K o l l e r , & O r m o n e i t , 2001) . Severa l researchers have independent ly exp lo red the idea o f c o m m o n representa-t ions for bo th perceptual funct ions and act ion p l ann ing . D e m i r i s and H a y e s (1999) deve lop a shared-representation m o d e l based on a con t ro l engineer ing technique c a l l e d P r o p o r t i o n a l -In tegra l -Der iva t ive con t ro l ( A s t r o m & H a g g l u n d , 1996). P I D is a fo rward -con t ro l m o d e l w h i c h attempts to m i n i m i z e the error e(t) between the actual trajectory at t ime t and the de-sired trajectory at t ime t u s i ng a l inear func t ion o f the error, its in tegral and its der iva t ive . In the m o d e l p roposed b y D e m i r i s and H a y e s , each o f the agent 's pos s ib l e behav iours is rep-resented b y a separate P I D control ler . Observa t ions o f other agents are c o m p a r e d w i t h the predic t ions made b y each o f the P I D control lers . T h e agent assigns labels to the behav iours o f other agents based o n w h i c h con t ro l l e r most c l o s e l y predicts the observa t ions . T h i s " c l o s -est" P I D con t ro l l e r can also be used to regenerate the observed behav iou r w h e n appropriate. 37 B i o l o g i c a l l y insp i red moto r ac t ion schema have a lso been inves t iga ted i n the dua l ro le o f perceptual and moto r representations (Mata r i c , W i l l i a m s o n , D e m i r i s , & M o h a n , 1998; M a t a r i c , 2002) . T h i s w o r k demonstrated that o n l y the observa t ion o f agent effectors was necessary for the tasks they consider . W e expect that this may be due to the fact that re la t ive pos i t i on and hand or ienta t ion constra in human a rm geometry so that add i t i ona l in fo rma t ion is redundant. T h i s suggests the usefulness o f redundancy ana lys is i n the des ign o f perceptual systems. M a t a r i c finds that three classes o f p r i m i t i v e s — reaching , o s c i l l a t i n g and o v e r a l l b o d y posture templates—are sufficient for imi t a t ing human behaviors such as dance. M o r e recent efforts have attempted to au tomat ica l ly generate templates u s i n g c lus te r ing . 3.3 Other Issues Imi ta t ion techniques have been app l i ed i n a d iverse c o l l e c t i o n o f app l ica t ions . C l a s s i c a l c o n -t ro l appl ica t ions i nc lude con t ro l systems for robot arms ( K u n i y o s h i et a l . , 1994; F r i e d r i c h , M u n c h , D i l l m a n n , B o c i o n e k , & Sass in , 1996), aeration plants (Scheffer et a l . , 1997), and container l o a d i n g cranes (Sue & Bra tko , 1997; U r b a n c i c & B r a t k o , 1994) . Imi ta t ion lea rn ing has also been app l i ed to accelerat ion o f generic reinforcement l ea rn ing ( L i n , 1991; W h i t e -head, 1991). L e s s t radi t ional appl ica t ions i nc lude transfer o f m u s i c a l s tyle (Canamero , A r -cos , & de Mantaras , 1999) and the support o f a soc ia l atmosphere ( B i l l a r d , H a y e s , & Dau t -enhahn, 1999; Breazea l , 1999; Scasse l la t i , 1999). Imi ta t ion has a lso been inves t iga ted as a route to language acqu i s i t ion and t ransmiss ion ( B i l l a r d et a l . , 1999; O l i p h a n t , 1999). 3.4 Analysis of Imitation in Terms of Reinforcement Learning T h e li terature on imi ta t ion spans many d imens ions . F o r the purposes o f d e v e l o p i n g concrete a lgor i thms, however , w e can l o o k at the li terature through the lens o f re inforcement learn ing . 38 In re inforcement learn ing , agents make use o f k n o w l e d g e about rewards , t rans i t ion mode l s , Q-va lues and ac t ion po l i c i e s . I f we v i e w imi t a t ion as the process o f t ransferr ing k n o w l e d g e f rom one agent to another, we shou ld be able to express the nature o f transfer i n terms o f the k i n d s o f k n o w l e d g e reinforcement learner 's use. In this sect ion, w e examine the ro le o f each type o f k n o w l e d g e i n turn. T h e behav io ra l c l o n i n g approaches to imi ta t ion (Sammut et a l . , 1992; U r b a n c i c & B r a t k o , 1994), and to some extent, M i t c h e l l ' s p a r a d i g m o f " t rue" im i t a t i on are based on the idea o f transferring a p o l i c y f rom one agent to another. In the behav io ra l c l o n i n g approach this transfer is executed through learn ing to associate act ions w i t h states. Unfor tuna te ly , it assumes that the observer and mentor share the same reward func t ion and ac t ion capabi l i t i es . It a lso assumes that comple te and unambiguous trajectories (including ac t ion cho ices ) can be observed. U t g o f f ' s ra t iona l constraint technique and Sue and B r a t k o ' s L Q con t ro l l e r i n d u c t i o n methods represent an attempt to define imi t a t ion i n terms o f the transfer o f Q-va lues f r o m one agent to another. A g a i n , however , this approach assumes congru i ty o f object ives . N g ' s inverse re inforcement learn ing ( N g & R u s s e l l , 2000) can be seen as a m o d e l that attempts to i m p l i c i t l y transfer a reward func t ion between agents. C o n s p i c u o u s l y absent in the r e v i e w e d literature is the idea o f e x p l i c i t l y t ransferring t ransi t ion m o d e l in fo rma t ion between agents. T h e advantage o f t rans i t ion m o d e l transfer l ies i n the fact that agents need not share iden t i ca l r eward funct ions to benefit f r o m increased k n o w l e d g e about their envi ronment . O u r o w n w o r k is based o n the idea o f an agent extract-i n g a m o d e l f r o m a mentor and us ing this m o d e l in fo rmat ion to p lace bounds on the va lue o f act ions u s ing its own r eward func t ion . A g e n t s can therefore learn f r o m mentors w i t h reward funct ions different than their o w n . Chapter 4 explores this n iche i n depth, deve lops several a lgor i thms, and demonstrates their properties on a w i d e range o f exper imenta l doma ins . In 39 part icular , it is s h o w n that generic t ransi t ion m o d e l in fo rma t ion about what the best ac t ion c o u l d a c c o m p l i s h is va luab le even i f the ident i ty o f the ac t ion is u n k n o w n . T h e result is a n e w class o f a lgor i thms for imi ta t ion that d o not require iden t i ca l rewards , the a b i l i t y to observe a mentor ' s ac t ion, or the co-opera t ion o f the mentor as a teacher. Part o f the computa t iona l aspect o f reinforcement l ea rn ing is d e c i d i n g w h i c h states to apply backups to i n order to update a va lue func t ion . T h e mentor c o u l d g i v e the imi ta tor in fo rmat ion about w h i c h states are most important to backup and thereby o p t i m i z e c o m p u -tat ional resources. W e propose t w o specif ic mechan i sms to imp lemen t a l l oca t i on o f c o m p u -tation in Chapte r 4. T h e ro le o f a l loca t ion i n the effectiveness o f i m i t a t i o n is a lso cons idered i n the analys is o f performance gains in imi ta t ion carr ied out in Chapte r 8. M u l t i - a g e n t reinforcement learn ing models in t roduce the concepts o f co -o rd ina t ion p rob lems , compe t i t i on and so lu t ion concepts . It is poss ib l e to th ink o f a co -o rd ina t ion pro-toco l as k n o w l e d g e about h o w to so lve the mul t i -agent part o f the p r o b l e m . A n imi t a t i on m e c h a n i s m c o u l d be used to transfer such mul t i -agent me ta -knowledge f r o m one agent to another. D e n z i n g e r ' s (2002) w o r k shows h o w team k n o w l e d g e can be reused b y new team members . A me thod for the transfer o f coord ina t ion k n o w l e d g e based o n mul t i -agent r e in -forcement learn ing f ramework w i l l be in t roduced i n Chapte r 7. Pa r t i a l l y observable learning is cons idered on l y b y Scheffer et a l . (1997) . T h e y do not e x p l i c i t l y cons ider the transfer o f quanti t ies specif ic to par t ia l ly observable learners. T h e exp lora t ion o f h o w imi t a t ion migh t be used to transfer observa t ion mode l s o r b e l i e f states between agents remains open. 3.5 A New Model: Implicit Imitation In the p rev ious subsect ions, var ious approaches to computa t iona l i m i t a t i o n were e x a m i n e d under the assumpt ion that imi t a t ion can be unders tood as the transfer o f k n o w l e d g e about 40 var ious parameters desc r ib ing an agent 's envi ronment . T h e survey c o n c l u d e d that the trans-fer o f t ransi t ion m o d e l in fo rmat ion is a p r o m i s i n g area for research in to i m i t a t i o n . T h i s sec-t ion examines the imp l i ca t i ons o f im i t a t i on as t ransi t ion m o d e l transfer and points out that this f o r m o f imi ta t ion is app l icab le in a m u c h w i d e r sett ing than other forms. T h e sect ion also sets up a f r amework o f assumptions that w i l l be used to deve lop specif ic a lgor i thms i n Chapters 4, 5 and 7. Explicit transfer infeasible Befo re d e l v i n g in to assumptions specif ic to im i t a t i on as transfer o f t rans i t ion mode l s , one assumpt ion c o m m o n to a l l approaches to imi t a t ion shou ld be underscored: I f it were feasi-b l e for the agents to e x p l i c i t l y exchange k n o w l e d g e through a f o r m a l representat ion, there w o u l d be no need to acquire k n o w l e d g e ind i rec t ly through observat ions o f other agent 's be-haviors . It is therefore sensible to cons ider imi t a t ion techniques o n l y w h e n e x p l i c i t transfer between agents is infeasible . T h i s assumpt ion w i l l be shown to have add i t i ona l i m p l i c a t i o n s short ly. Known independent objectives Pu t t i ng aside, for the moment , h o w one migh t transfer t r ans i t ion-mode l i n fo rma t ion between agents w i thou t m a k i n g recourse to e x p l i c i t c o m m u n i c a t i o n , w e w o u l d l i k e to e x a m i n e the i m p l i c a t i o n s o f imi ta t ion as t rans i t ion-model transfer for the nature o f the observer ' s objec-t ives and the re la t ionship o f these object ives to those o f other agents. T h e i m p l i c a t i o n s can be teased out b y e x a m i n i n g the process o f re inforcement learn-ing . In re inforcement l ea rn ing , an agent learns about the effects o f its act ions b y exper iment-i n g w i t h them and obse rv ing the outcomes. T h e outcomes are s u m m a r i z e d i n a t ransi t ion m o d e l w h i c h can then be used to assign an expected future va lue to each ac t ion based o n the 41 probab i l i t y o f the ac t ion l ead ing to future r eward ing states. A c c u r a t e values can on ly be ass igned i n those regions o f the state space where accu-rate t ransi t ion mode l s are ava i lab le to descr ibe the p robab i l i t y o f r each ing future r e w a r d i ng states. T h e transfer o f t ransi t ion m o d e l in fo rma t ion f rom a mentor to observer w o u l d a l l o w the observer agent to increase the accuracy o f its ac t ion values, l e ad ing to better dec i s ions . O f par t icular interest, however , is the pos s ib i l i t y that the observer m i g h t acquire t ransi t ion mode ls for regions o f the state space i n w h i c h it has no p r io r samples o f its env i ronment . In this s i tuat ion, the t ransi t ion m o d e l in fo rmat ion f r o m a mentor m a y confer h i g h l y advanta-geous ins ights about the long- te rm consequences o f the observer ' s ac t ions . S i n c e the e x p l o -ra t ion o f l o n g sequences o f act ions is ext remely cost ly , mentor -der ived i n fo rma t ion about the effect o f extended sequences o f act ion can lead to dramat ic improvement s i n the observer ' s performance. A direct consequence o f acqu i r i ng k n o w l e d g e i n the f o r m o f t rans i t ion mode l s is the fact that the same k n o w l e d g e about the effect o f act ions c o u l d be used b y different agents i n different ways so as to best m a x i m i z e the agent's o w n object ives . Imi ta t ion as m o d e l trans-fer therefore does not require that the obse rv ing agent share the same reward func t ion as the mentor agent. O f course, i f the reward funct ions are s ign i f ican t ly different, the mentor and observer may largely occupy different regions o f the state space or have comple t e ly i n c o m -pat ib le po l i c i e s and therefore not possess any k n o w l e d g e o f c o m m o n interest. Imi ta t ion as t ransi t ion m o d e l transfer has add i t iona l i m p l i c a t i o n s c o n c e r n i n g an agent 's reward func t ion . C o n s i d e r the fact that an agent's cho i ce o f ac t ion depends not o n l y on the effects o f its act ions, but also on the rewards to be had i n states resu l t ing f r o m the effects o f its act ions. T rans i t ion m o d e l in format ion w o u l d be most benef ic ia l i n those regions where the agent also possesses reward m o d e l in fo rmat ion . There are two general treatments o f r eward models i n re inforcement l ea rn ing . In 42 one treatment, it is assumed that the agent (for our purposes, the observer) does not k n o w its reward func t ion , and must sample the rewards at each state. In the second treatment, it is assumed that the agent k n o w s the reward func t ion for every state i n its d o m a i n . T h e u t i l i t y o f m o d e l t ransi t ion in fo rmat ion w i l l undoubtedly be greater for an agent w i t h a k n o w n reward m o d e l for a l l states, as the agent w i l l be able to make m u c h better dec i s ions g i v e n the ab i l i t y to predict bo th the l o n g term effects o f actions and the va lue o f those effects. In a s suming that the observer k n o w s its o w n reward func t ion , w e g i v e up l i t t le o f prac t ica l va lue . T h e idea o f a k n o w n reward func t ion is consis tent w i t h the idea o f automatic p r o g r a m m i n g : the agent 's designer specifies the agent 's object ives i n the f o r m o f a reward funct ion and the agent finds a p o l i c y that m a x i m i z e s these object ives . A n y uncerta inty i n reward funct ions can be recast as uncertainty about state spaces. O n e can be certain about the state, but uncer ta in about the reward attached to it, or one can be uncer ta in about the state, but certain about the rewards attached to poss ib le states. T h e most sens ib le treatment o f reward for an agent u s ing imi ta t ion as t ransi t ion m o d e l transfer assumes the observer agent possesses a r eward m o d e l for a l l states i n its envi ronment . T h e assumpt ion that the observer k n o w s its reward m o d e l leads to another serendip-i tous result. Observa t ions o f the mentor p rov ide the observer w i t h add i t i ona l in fo rmat ion about the env i ronment t ransi t ion func t ion . T h i s add i t iona l t rans i t ion func t ion in fo rma t ion can be conver ted into i m p r o v e d value estimates for the observer ' s act ions. T h e observer is therefore better able to choose actions that w i l l lead to h igher returns. Important ly , h o w -ever, the agent on ly makes use o f the mentor ' s behav iour to i n f o r m its unders tanding o f the t ransi t ion d y n a m i c s . U n l i k e other approaches to imi t a t ion , the observer is not cons t ra ined to dupl ica te the mentor ' s act ions i f the observer is aware o f a better ac t ion p l an . T h e observer may end up d u p l i c a t i n g the mentor ' s behavior , but o n l y as a consequence o f ind i rec t reason-i n g about the best course o f act ion. A s w e feel this is a central and def in ing aspect o f our 43 m o d e l , w e have ca l l ed our approach implicit imitation. W e feel this te rm underscores the fact that any dup l i c a t i on o f mentor behav ior is mere ly part o f i n fo rmed va lue m a x i m i z a t i o n . Observable expert behavior U n d e r the imitat ion-as- t ransfer-of- t ransi t ion-model pa rad igm, the observer agent gains ad-d i t i o n a l i n fo rma t ion about the t ransi t ion m o d e l for its env i ronment f r o m a mentor. It f o l l o w s , that in order for this m o d e l to benefit the observer, there must exis t agents i n the observer ' s env i ronment w h o possess more expert ise than the observer for some part o f the p r o b l e m fac ing the observer. O u r assumpt ion that e x p l i c i t transfer o f k n o w l e d g e is infeas ib le imposes another constraint on observat ions . T h e act ion cho i ce o f the mentor agent, in the sense defined in re-inforcement l ea rn ing , cannot be d i rec t ly observed. T h e argument is as f o l l o w s : s ince more than one ac t ion may have the same effect on the envi ronment , the ident i ty o f an ac t ion can-not be k n o w n w i t h certainty through observat ions o f the mentor ' s effect o n its env i ronment . W e take these effects to i nc lude the pos i t i on o f its effectors. 1 E s sen t i a l l y , the mentor ' s ac-t ion choices cor respond to its intentions and these c o u l d o n l y be transferred th rough e x p l i c i t c o m m u n i c a t i o n w h i c h w e have already de termined is infeasible , o r w e w o u l d not need i m -i ta t ion techniques to beg in w i t h . F o r the purposes o f the w o r k w e w i l l take up short ly, one further constraint w i l l be i m p o s e d o n the observat ions made b y observers. F r o m re inforcement l ea rn ing theory, w e k n o w that the p r o b l e m o f learn ing to con t ro l an env i ronment is cons ide rab ly s i m p l i f i e d when the agent can comple t e ly and u n a m b i g u o u s l y observe the state o f the sys tem that it is a t tempt ing to con t ro l . In i t i a l ly , w e w i l l therefore assume that the observer i s l ea rn ing to con t ro l a comple t e ly observable M D P and that the mentor ' s state ( though not its act ions) 1 Consider an agent who is attempting to force open a door. The agent and its effectors may appear stationary, but surely the agent is taking an action quite distinct from resting near the door. 44 are f u l l y observable . In Chapter 7, this assumpt ion w i l l be r eexamined and a de r iva t ion w i l l be p r o v i d e d s h o w i n g h o w our pa rad igm migh t be extended to par t i a l ly observable e n v i r o n -ments. Observer-mentor state-space correlation T h u s far, w e have ta lked about t w o k inds o f k n o w l e d g e : the observer ' s d i rect exper ience w i t h ac t ion choices and their resultant state t ransi t ions, and k n o w l e d g e acqu i red b y obser-va t ion o f mentor state transi t ions, wi thou t s ay ing any th ing about the re la t ionsh ip between them. In many imi t a t ion learn ing paradigms, an e x p l i c i t teaching p a r a d i g m is assumed i n w h i c h a mentor demonstrates a task and then the task is reset and the observer attempts the task. It is assumed that the mentor and imi ta tor interact w i t h the env i ronment i n iden t i ca l ways . G i v e n f u l l obse rvab i l i ty o f the mentor, the mentor ' s state can be d i rec t ly ident i f ied w i t h co r respond ing observer states. T h e t ransi t ion m o d e l i n fo rma t ion ga ined by obse rv ing mentor t ransi t ions can be seen to have direct and straightforward bear ing on the observer ' s t ransi t ion mode ls for the cor responding state. In our m o d e l , however , w e w o u l d l i k e to shift to a perspect ive i n w h i c h m u l t i p l e agents occupy the same envi ronment . These agents may or m a y not interact w i t h the e n v i -ronment i n the same way (e.g., they may have different effectors) and they m a y or may not face exact ly the same p r o b l e m (i.e., they may have different b o d y states). There may also be m u l t i p l e mentors w i t h v a r y i n g degrees o f re levance i n different regions o f the state space. In this more general sett ing, it is clear that observat ions made o f the mentor ' s state transi t ions can offer no useful in fo rma t ion to the observer unless there is some re la t ionsh ip between the state o f the observer and mentor. There are several ways i n w h i c h the re la t ionship between state spaces can be spec-i f ied . Dau tenhahn and N e h a n i v (1998) use a h o m o m o r p h i s m to define the re la t ionsh ip be-45 tween mentor and observer for a specific f a m i l y o f trajectories. In its mos t general f o rm, our approach can learn f rom any stochastic process for w h i c h a h o m o m o r p h i s m can be c o n -structed between the state and transit ions o f the process and the state and t ransi t ions o f the obse rv ing agent. O u r no t ion o f a mentor is therefore qui te broad. In order to s imp l i fy our m o d e l and a v o i d undue attention to the (admit tedly impor -tant) top ic o f cons t ruc t ing sui table ana log ica l mappings , we w i l l i n i t i a l l y assume that the mentor and the observer have " i d e n t i c a l " state spaces; that is , Sm and i S 0 are i n some sense i somorph i c . T h e precise sense i n w h i c h the spaces are i s o m o r p h i c — o r i n some cases, pre-sumed to be i s o m o r p h i c un t i l p roven o the rwise—is elaborated b e l o w w h e n w e discuss the re la t ionsh ip between agent abi l i t ies . T h u s f rom this po in t w e s i m p l y refer to the state S w i t h -out d i s t i ngu i sh ing the mentor ' s l o c a l space Sm f r o m the observer ' s S0. Analogous Abilities E v e n w i t h a m a p p i n g between states, observat ions o f a mentor ' s state t ransi t ions o n l y te l l the observer someth ing about the mentor ' s abi l i t ies , not its o w n . W e must assume that the observer can i n some way "dup l i ca te" the actions taken b y the mentor to induce analogous transi t ions i n its o w n l o c a l state space. In other words , there must be some p resumpt ion that the mentor and the observer have s im i l a r ac t ion capabi l i t ies . It is i n this sense that the ana-l o g i c a l m a p p i n g between state spaces can be taken to be a h o m o m o r p h i s m . Spec i f i ca l ly , w e migh t assume that the mentor and the observer have the same act ions ava i l ab le to them (i.e., A m — A0 = A) and that there exists a h o m o m o r p h i c m a p p i n g h : Sm —> S0 w i t h respect to T ( - , a , •) for a l l a £ A. In pract ice, this requirement can be weakened substant ia l ly wi thou t d i m i n i s h i n g its u t i l i ty . W e o n l y require that there exists at least one ac t ion i n the observer ' s ac t ion set for state s w h i c h w i l l dupl ica te the effects o f the ac t ion ac tua l ly taken b y the m e n -tor i n state s (the mentor ' s other poss ib le act ions are i rrelevant) . F i n a l l y , w e m i g h t have an 46 observer that assumes that it can dupl ica te the act ions taken b y the mentor u n t i l it finds ev-idence to the contrary. In this case, there is a presumed homomorphism be tween the state spaces. In what f o l l o w s , we w i l l d i s t i ngu i sh between i m p l i c i t im i t a t i on i n homogeneous ac-tion settings—domains in w h i c h the ana log ica l m a p p i n g is indeed h o m o m o r p h i c — a n d i n heterogeneous action settings—where the m a p p i n g may not be a h o m o m o r p h i s m . There are more general ways o f def in ing s imi l a r i t y o f ab i l i ty , for example , b y as-s u m i n g that the observer may be able to m o v e through state space i n a s i m i l a r fash ion to the mentor wi thou t f o l l o w i n g the same trajectories ( N e h a n i v & Dau tenhahn , 1998). F o r i n -stance, the mentor may have a wa y o f m o v i n g d i rec t ly be tween key loca t ions i n state space, w h i l e the observer m a y be able to m o v e between analogous loca t ions i n a less di rect fash-i o n . In such a case, the ana logy between states may not be de te rmined b y s ing le act ions, but rather b y sequences o f act ions or l o c a l p o l i c i e s . W e w i l l suggest ways for d e a l i n g w i t h restr icted forms o f this i n Sec t ion 4.2. 3.6 Non-interacting Games T h e p rev ious sect ion ou t l ined a bas ic m o d e l and a set o f assumpt ions for an i m i t a t i o n learn-i n g f ramework . In Chapters 4 and 5, w e w i l l translate this f r amework in to specif ic a lgo-r i thms. T h e t ranslat ion w i l l be based on the no t ion o f the stochast ic game defined i n C h a p -ter 2. T o s i m p l i f y the deve lopment o f a lgor i thms , however , w e w i l l i n i t i a l l y cons ide r a spe-c i a l class o f s tochast ic games i n w h i c h strategic interact ions need not be cons ide red ( w e w i l l return to games w i t h interactions i n Chapter 7). W e define noninteracting stochastic games b y appea l ing to the n o t i o n o f an agent projection function w h i c h is used to extract an agent 's local state f r om the u n d e r l y i n g game. Informal ly , an agent 's l o c a l state determines a l l aspects o f the g l o b a l state that are relevant to its dec i s ion m a k i n g process, w h i l e the pro jec t ion func t ion determines w h i c h g l o b a l states 47 are iden t i ca l f r om this l o c a l perspect ive. F o r m a l l y , for each agent i, w e assume a l o c a l state space S{, and a pro jec t ion funct ion Li : S —>• Si. F o r a n y s , i £ <S, w e wr i t e s ~ ; i i f f Li(s) = £ ; ( £ ) • T h i s equ iva lence re la t ion part i t ions S in to a set o f equ iva lence classes such that the elements w i t h i n a specif ic class (i.e., L~l (s) for some s € Si) need not be d i s t ingu i shed b y agent i for the purposes o f i n d i v i d u a l d e c i s i o n m a k i n g . W e say a stochast ic game is noninteracting i f there exists a l o c a l state space Si and pro jec t ion func t ion Li for each agent i such that: 1. I f s ~ i t, then Va; G A , a - i € A-i, W{ € Si w e have 2. i fc ( s ) = i f c ( i ) i f s ~ , - f In tu i t ive ly , cond i t i on 1 above imposes two dis t inct requirements o n the game f r o m the perspect ive o f agent i. F i rs t , i f w e ignore the existence o f other agents, it p rov ides a no-t ion o f state space abstract ion suitable for agent i. Spec i f i ca l ly , clusters together states s G S o n l y i f each state i n an equiva lence class has iden t i ca l d y n a m i c s w i t h respect to the abstract ion i n d u c e d b y X,- . T h i s type o f abstract ion is a f o r m o f b i s i m u l a t i o n o f the type stud-ied i n automaton m i n i m i z a t i o n (Har tmanis & Stearns, 1966; L e e & Y a n n a k a k i s , 1992) and automatic abstract ion methods deve loped for M D P s (Dearden & B o u t i l i e r , 1997; D e a n & G i v a n , 1997). It is not hard to s h o w — i g n o r i n g the presence o f other agents—that the un -d e r l y i n g sys tem is M a r k o v i a n w i t h respect to the abstract ion (or equ iva len t ly , w.r.t. Si) i f c o n d i t i o n 1 is met. T h e quant i f ica t ion over a l l a_ 8 imposes a s trong nonin te rac t ion require-ment, namely , that the d y n a m i c s o f the game f r o m the perspect ive o f agent i be independent o f the strategies o f the other agents. C o n d i t i o n 2 s i m p l y requires that a l l states w i t h i n a g i v e n equiva lence class for agent i have the same reward for agent i. T h i s means that no states w i t h i n a class need to be d i s t ingu i shed—each l o c a l state can be v i e w e d as a tomic . 48 T h e s trong nonin terac t ion constraints o f this m o d e l are consis tent w i t h a b road range o f scenarios i n w h i c h an observer inhabi ts a mul t i -agent w o r l d , benefits f r o m observat ions o f these other agents, but is pe r forming a task that can be so lved wi thou t c o n s i d e r i n g interac-t ion w i t h other agents. O n e migh t cons ider the p r o b l e m o f c o n t r o l l i n g a mach ine or s o l v i n g a specific k i n d o f p l a n n i n g p r o b l e m . Othe r agent's exper ience m a y be relevant , but the ac-tual task ass igned to the agent does not i n v o l v e any mul t i -agent interact ions. W e discuss poss ib le extensions o f the imi ta t ion m o d e l to tasks w h i c h i n c l u d e mul t i -agent interact ions i n Chapte r 7. A non in te rac t inggame induces an M D P M 4- for each agent i where M ; = (Si, Ai, T t-, Ri) where T, is g i v e n b y cond i t i on (1) above. Spec i f i ca l ly , for each s t , ti g Sf. T^'a'(U) = j:teL-\n)Ts'a-a-'(t) where s is any state i n LT1(si) and a_ t- is any e lement o f A-i. L e t 7r; : <Sa —>• Ai be an op t ima l p o l i c y for M,-. W e can extend this to a strategy : S —>• Ai for the u n d e r l y i n g stochastic game b y s i m p l y a p p l y i n g 7r,-(s,-) to every state s g S such that Li(s) = st-. S i n c e every state s g S such that L((s) = s; has the same reward and t rans i t ion p robab i l i t i e s w i t h respect to agent i, and the reward and transi t ion p robab i l i t i e s o f agent i are independent o f the state and ac t ions o f other agents, the p o l i c y ixf is dominan t fo r p l aye r i. T h u s each agent can so lve the nonin terac t ing game b y abstract ing away i r relevant aspects o f the state space, i g n o r i n g other agent act ions, and s o l v i n g its "pe r sona l " M D P M j . G i v e n an arbitrary stochastic game, it can general ly be qui te d i f f icul t to d i s cove r whether it is noninterac t ing , r equ i r ing the cons t ruc t ion o f appropriate pro jec t ion funct ions . F o r the purposes o f d e v e l o p i n g a lgor i thms i n the next chapter, w e w i l l s i m p l y assume that the u n d e r l y i n g mul t iagent sys tem is a noninterac t ing game. Ra ther than spec i fy ing the game and projec t ion funct ions , we w i l l specify the i n d i v i d u a l M D P s M,- themselves . T h e n o n i n -teracting game induced by the set o f i n d i v i d u a l M D P s is s i m p l y the "cross p roduc t " o f the 49 i n d i v i d u a l M D P s . S u c h a v i e w is often qui te natural . C o n s i d e r the example o f three robots m o v i n g in some t w o - d i m e n s i o n a l office d o m a i n . I f w e are able to neglect the pos s ib i l i t y o f in te rac t ion—for example , i f the robots can occupy the same 2 - D p o s i t i o n (at a sui table l e v e l o f granular i ty) and do not require the same resources to ach ieve their tasks—then w e migh t specify an i n d i v i d u a l M D P for each robot. T h e l o c a l state migh t be de termined b y the robot ' s x, y - p o s i t i o n , or ientat ion, and the status o f its o w n tasks. T h e g l o b a l state space w o u l d be the cross product Si x S2 x S3 o f the l o c a l spaces. T h e i n d i v i d u a l components o f any j o i n t ac t ion w o u l d affect on ly the l o c a l state, and each agent w o u l d care ( through its reward func t ion Ri) on ly about its l o c a l state. W e note that the pro jec t ion func t ion Li shou ld not be v i e w e d as equ iva len t to an ob-servat ion func t ion . W e do not assume that agent i can o n l y d i s t i n g u i s h elements o f Si—in fact, observat ions o f other agents ' states w i l l be c ruc i a l for im i t a t i on . Ra the r the exis tence o f Li s i m p l y means tha t , / rom the point of view of decision making with a known model, the agent need not w o r r y about d i s t inc t ions other than those made b y L t . A s s u m i n g no c o m p u -tat ional l imi ta t ions , an agent i need on ly so lve Mi, but may use observat ions o f other agents i n order to i m p r o v e its k n o w l e d g e about M i ' s d y n a m i c s . 50 Chapter 4 Implicit Imitation T h e i m p l i c i t im i t a t ion f ramework is bu i l t a round the metaphor o f an agent in terac t ing w i t h its env i ronment . In this metaphor, the agent makes observat ions o f the env i ronmen t ' s current state and chooses act ions l i k e l y to inf luence the future state o f the env i ronment i n a desir-able way. L i k e a re inforcement learner, the i m p l i c i t imi ta tor is uncer ta in o f the effects o f its actions and must exper iment w i t h i n its envi ronment to i m p r o v e its k n o w l e d g e . T h e prob-l e m is di f f icul t , because the correct ac t ion for the agent to take at some state s m a y depend on what ac t ion cho ices are open to the agent i n some distant successor state s'. Gene ra l l y , the agent w i l l not i n i t i a l l y k n o w its act ion capabi l i t ies i n s'. U n l i k e a re inforcement l ea rn ing agent, however , an i m p l i c i t im i t a t ion agent may have access to state t ransi t ion his tor ies at the future state s' generated f rom observat ions o f an expert mentor agent. T h i s add i t iona l k n o w l e d g e can be used to he lp the imi ta tor dec ide whether to pursue act ions that lead to s'. Imi ta t ion therefore reduces the search requirements. In the best o f cases, the reduc t ion i n search can lead to an exponen t i a l reduc t ion i n t ime requi red to learn a reasonable p o l i c y for a g i v e n p r o b l e m . In the p rev ious chapter, w e saw that the concep t ion o f im i t a t i on as transfer o f k n o w l -edge about state t ransi t ion probabi l i t i es is a new and p rovoca t ive perspect ive . A s w e w i l l see 51 more f u l l y i n this chapter, it a lso has a number o f n o v e l i m p l i c a t i o n s for learners. I m p l i c i t imitators need not share reward funct ions w i t h mentors, can integrate examples f r o m many mentors , and are able to evaluate observed behaviours at a f ine-grained l e v e l u s ing e x p l i c i t expected va lue ca lcu la t ions . In the p rev ious chapter, techniques for represent ing t rans i t ion mode ls and integrat ing observat ions f r o m var ious sources into these representations were g lossed over. T h e next two sections der ive specif ic a lgor i thms for i m p l i c i t im i t a t i on i n t w o settings: the homogeneous ac t ion setting, i n w h i c h the ac t ion capabi l i t i es o f the observer and mentor are iden t i ca l ; and the heterogeneous ac t ion setting, i n w h i c h their capabi l i t i es differ. T h e a lgor i thms are demonstrated i n a set o f s imp le domains carefu l ly chosen to i l lustrate specif ic issues i n imi t a t ion learning. 4.1 Implicit Imitation in Homogeneous Settings W e beg in b y desc r ib ing i m p l i c i t imi ta t ion i n homogeneous ac t ion set t ings—the ex tens ion to heterogeneous settings w i l l b u i l d on the ins ights deve loped i n this sec t ion . F i r s t , w e define the homogeneous setting. T h e n w e deve lop the i m p l i c i t im i t a t i on a lgo r i t hm. F i n a l l y , w e demonstrate h o w i m p l i c i t imi ta t ion i n a homogeneous sett ing w o r k s o n a n u m b e r o f s imp le p rob lems des igned to i l lustrate the role o f the var ious mechan i sms w e descr ibe . 4.1.1 Homogeneous Actions T h e homogeneous ac t ion sett ing is defined as f o l l o w s . W e assume a s ing le mentor m and observer o, w i t h i n d i v i d u a l M D P s Mm — (S, A m , Tm, Rm) and M0 = (S, A0, T0, R0), respec t ive ly . 1 N o t e that the agents share the same state space (more prec ise ly , we assume a t r i v i a l i s o m o r p h i c m a p p i n g that a l l ow s us to ident i fy their l o c a l states). W e also assume that the mentor has been pretrained and is execu t ing some stat ionary p o l i c y nm. W e w i l l 'The extension to multiple mentors is straightforward and is discussed later. 52 often treat this p o l i c y as de terminis t ic , but most o f our remarks app ly to stochast ic po l i c i e s as w e l l . 2 L e t the support set Supp(7rm, s) for wm at state s be the set o f act ions a G Am accorded nonzero p robab i l i t y b y 7rm at state s. W e assume that the observer has the same abi l i t i es as the mentor i n the f o l l o w i n g sense: Vs, t G S, am G Supp(itm, s), there exists an act ion a0 £ A0 such that T(s, aa)(t) — T(s, am)(t). In other words , the observer has an ac t ion whose ou tcome d i s t r ibu t ion matches that o f the mentor ' s ac t ion . In i t i a l l y , howeve r the observer m a y not k n o w w h i c h action(s) w i l l achieve this d i s t r ibu t ion . T h e c o n d i t i o n above, essent ia l ly defines the agents ' l o c a l state spaces to be i somor -ph ic w i t h respect to the act ions actual ly taken b y the mentor at the subset o f states where those act ions migh t be taken. T h i s is m u c h weaker than r equ i r i ng a f u l l h o m o m o r p h i s m f r o m Sm to S0. O f course, the exis tence o f a f u l l h o m o m o r p h i s m is sufficient f r o m our per-spect ive; but our results do not require this. 4.1.2 The Implicit Imitation Algorithm T h e i m p l i c i t im i t a t i on a l g o r i t h m can be unders tood i n terms o f its componen t processes. In add i t ion to l ea rn ing f rom its o w n transi t ions, the observer observes the mentor i n order to acquire add i t iona l in fo rmat ion about poss ib le t ransi t ion d y n a m i c s at the states the mentor v i s i t s . T h e mentor f o l l o w s some p o l i c y that need not be better than the observer ' s o w n , but w i l l genera l ly be more useful to the observer i f it is better than the observer ' s current p o l -i c y for at least a subset o f the states i n the p r o b l e m . T h e t ransi t ion d y n a m i c s observed for the mentor at state s w i l l consti tute a m o d e l o f the mentor ' s ac t ion i n state s. W e c a l l the b u i l d i n g o f a m o d e l f rom mentor observat ions action model extraction. W e can integrate this in fo rmat ion into the observer ' s o w n va lue estimates b y augment ing the usua l B e l l m a n 2 T h e observer must have a deterministic policy with value greater than or equal to mentor's stochastic policy. We know there exists an optimal deterministic policy for M a r k o v Decis ion pro-cesses, so we know the observer can find it in principle. 53 backup w i t h mentor ac t ion models . A confidence test ing procedure ensures that w e o n l y use this augmented m o d e l w h e n the m o d e l i m p l i c i t l y extracted f r o m the mentor is more re l iab le than the observer ' s m o d e l o f its o w n behavior . W e also extract occupancy in fo rma t ion f r o m the observer ' s observat ions o f mentor trajectories i n order to focus the observer ' s c o m p u -tat ional effort (to some extent) i n specific parts o f the state space. F i n a l l y , w e augment our ac t ion se lec t ion process to choose actions that w i l l exp lo re h igh -va lue regions revea led by the mentor. T h e remainder o f this sect ion expands upon each o f these processes and exp la ins h o w they fit together. Model Extraction T h e in fo rma t ion ava i l ab le to the observer i n its quest to learn h o w to act o p t i m a l l y can be d i v i d e d into t w o categories. F i r s t , w i t h each ac t ion it takes, it receives an exper ience tuple (s0, a0, r0,t0), where a0 is the observer ' s ac t ion , s0 and t0 represent the observer ' s state t ransi t ion and r0 is the observer ' s reward. W e w i l l ignore the sampled r eward r 0 , s ince we assume the reward func t ion R is k n o w n i n advance. A s i n standard mode l -based learn ing , each such exper ience can be used to update its o w n t ransi t ion m o d e l T0(s0, a0, •). S e c o n d , w i t h each mentor t ransi t ion, the observer obtains an exper ience tuple ( s m , tm). N o t e that the observer does not have direct access to the ac t ion taken b y the mentor, o n l y to the i n d u c e d state t r ans i t ion . 3 A s s u m e the mentor is i m p l e m e n t i n g a de te rmin is t ic , stat ion-ary p o l i c y 7r m , w i t h nm(s) denot ing the mentor ' s cho i ce o f ac t ion at state s. L e t T be the true t ransi t ion m o d e l for the observer ' s envi ronment . T h e n , the mentor ' s p o l i c y 7r m induces a M a r k o v cha in T m ( - , •) over S, w i t h Tm(s, t) = T(s, nm(s),t) deno t ing the p robab i l i t y o f t ransi t ion f r o m s to t for the c h a i n . 4 S i n c e the learner observes the mentor ' s state tran-3 See Chapter 3. 4 T h i s is somewhat imprecise, since the initial distribution of the Markov chain is unknown. For our purposes, it is only the dynamics that are relevant to the observer, so only the transition proba-bilities are used. 54 s i t ions , it can construct an estimate Tm o f this cha in : Tm(s, t) is s i m p l y es t imated b y the re la t ive observed frequency o f mentor transit ions s —> t (w.r.t. a l l t ransi t ions taken f rom s). I f the observer has some p r i o r over the poss ib le mentor t ransi t ions , standard B a y e s i a n up-date techniques can be used ins tead . 5 W e use the term model extraction for this process o f es t imat ing the mentor ' s M a r k o v cha in . A u g m e n t e d B e l l m a n B a c k u p s Suppose the observer has constructed an estimate Tm o f the mentor ' s M a r k o v cha in . B y cons t ruc t ion , Tm(s, t) — T(s, Trm(s),t). B y the homogene i ty assumpt ion , the successor state d i s t r ibu t ion o f the mentor ' s ac t ion nm(s) can be repl ica ted b y the observer at state s as 7rm (s) £ A0. T h u s , the p o l i c y nm can, i n p r inc ip l e , be dup l i ca t ed b y the observer (were it able to ident i fy the actual act ions used). A s such, w e can define the va lue o f the mentor ' s p o l i c y from the observer's perspective: N o t i c e that E q u a t i o n 4.1 uses the mentor ' s dynamics Tm, but the observer ' s reward func t ion , R0. L e t V denote the observer ' s o p t i m a l va lue func t ion . S i n c e the observer is assumed to have an ac t ion capable o f dup l i ca t i ng the mentor ' s d y n a m i c s Tm(s, t), it w i l l a lways be the case that the observer ' s va lue func t ion V(s) > Vm(s). Vm therefore p rov ides a l o w e r b o u n d on V. M o r e impor tant ly , the terms m a k i n g up Vm(s) can be integrated d i rec t ly into the B e l l m a n equat ion for the observer ' s M D P , f o r m i n g the augmented Bellman equation: 5 A Bayesian approach based on priors and incremental model updates is developed in Chapter 5. (4.1) 55 V{s) = i?0(s) + 7 max j max i ^ T0(s, a, i)F(i) I, ^rm(s,i)V(i) | (4.2) T h i s is the usua l B e l l m a n equat ion w i t h an extra term added, namely , the second summat ion , J2tes Tm (s, t)V(t) denot ing the expected va lue o f d u p l i c a t i n g the mentor ' s ac-t ion a m . S i n c e this ( unknown) ac t ion is iden t ica l to one o f the observer ' s ac t ions , the term is redundant and the augmented value equat ion is v a l i d . W e c a l l the i m p r o v e d va lue estimates p roduced b y the augmented B e l l m a n equat ion, augmented values. D u r i n g on - l i ne learn ing , an observer w i l l have o n l y est imated vers ions o f the tran-s i t ion mode l s . I f the observer ' s exp lora t ion p o l i c y ensures that each state-action pa i r is v i s -i ted inf in i te ly often, the estimates o f the T0 terms w i l l converge to their true values . I f the mentor ' s p o l i c y is e rgodic w i t h respect to S (i.e., its p o l i c y ensures that the mentor w i l l v i s i t every state i n S), then the estimate o f Tm w i l l a lso converge to its true va lue . I f the mentor ' s p o l i c y is restr icted to a subset o f states S' C S (i.e., those states w i t h non-zero p robab i l i t y i n the stationary d i s t r ibu t ion o f its M a r k o v cha in) , then the estimates o f Tm for the subset w i l l converge cor rec t ly w i t h respect to S' i f the cha in is ergodic . T h e states i n S — S' w i l l re-ma in unv i s i t ed and the estimates w i l l remain un in fo rmed b y data. S i n c e the mentor ' s p o l i c y is not under the con t ro l o f the observer, there is no w a y for the observer to inf luence the d i s -t r ibu t ion o f samples attained for Tm. A n observer must therefore be able to reason about the accuracy o f the est imated m o d e l Tm for any s and restrict the app l i ca t i on o f the augmented equat ion to those states where Tm is k n o w n w i t h sufficient accuracy. W h i l e Tm cannot be used ind i sc r iminan t ly , we argue that it can be h i g h l y i n fo rma-t ive ear ly i n the l ea rn ing process. E a r l y i n learning, the observer devotes its ac t ion choices to exp lo ra t ion . In the absence o f guidance , every ac t ion i n every state must be sampled i n order for the observer to ga in sufficient in format ion to compute the best p o l i c y . T h e m e n -tor, b y contrast, is p u r s u i n g an o p t i m a l p o l i c y des igned to e x p l o i t its k n o w l e d g e i n order to 56 m a x i m i z e return w i t h respect to Rm. A n observer learn ing f r o m transi t ions o f such a mentor w o u l d find that the mentor transit ions are often const ra ined to a m u c h smal le r subspace o f the p r o b l e m and represent the effects o f a s ing le ac t ion . T y p i c a l l y , the smal le r set o f states and the cho i ce o f a s ing le ac t ion i n each state cause the estimates o f Tm (s, t) i n regions rele-vant to the d o m a i n to converge m u c h more r ap id ly to their true values than the co r re spond ing observer estimates T0(s, a, t). T h e use o f Tm(s, t) can therefore often g i v e more i n fo rmed va lue estimates at state s, w h i c h i n turn a l l o w the observer to m a k e better i n fo rmed choices earl ier i n learn ing . W e note that the reasoning above holds even i f the mentor is i m p l e m e n t i n g a (station-ary) stochastic p o l i c y (s ince the expected va lue o f stochastic p o l i c y for a fu l ly -obse rvab le M D P cannot be greater than that o f an o p t i m a l de terminis t ic p o l i c y ) . W h i l e the " d i r e c t i o n " offered b y a mentor i m p l e m e n t i n g a de terminis t ic p o l i c y tends to be more focussed, e m -p i r i c a l l y w e have found that mentors offer broader gu idance i n modera te ly s tochast ic en-v i ronments or w h e n they implemen t stochastic p o l i c i e s , s ince they tend to v i s i t more o f the state space. W e note that the extens ion to m u l t i p l e mentors is s t ra ight forward—each mentor m o d e l can be incorpora ted into the augmented B e l l m a n equat ion wi thou t d i f f icul ty . Model Confidence W h e n the mentor ' s M a r k o v cha in is not ergodic , or even i f the m i x i n g rate is suff ic ient ly l o w , the mentor may v i s i t a certain state s re la t ive ly infrequently. T h e est imated mentor t ransi t ion m o d e l co r respond ing to a state that is rarely (or never) v i s i t ed b y the mentor may p r o v i d e a very m i s l e a d i n g est imate—based on the s m a l l sample o r the p r io r for the mentor ' s c h a i n — o f the va lue o f the mentor ' s ( unknown) ac t ion at s; and s ince the mentor ' s p o l i c y is not under the con t ro l o f the observer, this m i s l e a d i n g value may persist for an extended pe r iod . S i n c e the augmented B e l l m a n equat ion does not cons ider re la t ive r e l i a b i l i t y o f the mentor and ob-57 server mode l s , the va lue o f such a state s may be overes t imated . 6 Tha t is to say, the observer can be t r i cked into o v e r v a l u i n g the mentor ' s ( unknown) ac t ion , and consequent ly overes t i -ma t ing the va lue V ( s ) o f state s. T o ove rcome this, we have incorporated an estimate o f m o d e l conf idence into our augmented backups . O u r confidence m o d e l makes use o f a D i r i c h l e t d i s t r ibu t ion over the parameters o f bo th the mentor ' s M a r k o v cha in Tm and the observer ' s ac t ion mode l s T0 (De-G r o o t , 1975). T h e mean o f the D i r i c h l e t d i s t r ibu t ion corresponds to the f a m i l i a r po in t est i-mate o f the t ransi t ion probabi l i ty , namely T0(s, a, t) and Tm(s, t). H o w e v e r , the D i r i c h l e t d i s t r ibu t ion ac tual ly defines a comple te p robab i l i t y d i s t r ibu t ion over pos s ib l e values for the t ransi t ion p robab i l i t y be tween states s and t (i.e., P r ( T 0 ( s , a,t)) and P r ( T m ( s , *))). T h e i n i -t i a l d i s t r ibu t ion over poss ib le t ransi t ion probabi l i t i es is g i v e n b y a p r i o r d i s t r i bu t ion and it reflects the observer ' s i n i t i a l bel iefs about w h i c h transit ions are poss ib l e and to what degree. T h e i n i t i a l d i s t r ibu t ion is then updated w i t h observat ions o f mentor and observer t ransi t ions to get a more accurate poster ior d i s t r ibu t ion . T h e D i r i c h l e t d i s t r ibu t ion is parameter ized by counts o f the f o r m ns0'a(t) and nsm(i) to denote the number o f t imes the observer has been observed m a k i n g the t rans i t ion f r o m state s to state t when the ac t ion a was per formed and the number o f t imes the mentor was observed to make the t ransi t ion f r o m state s to t w h i l e f o l l o w i n g a stat ionary p o l i c y . In an o n -l ine app l i ca t ion , the agent updates the appropriate count after each observed t ransi t ion. P r i o r counts cs0'a(t) and csm(t) can be added to the D i r i c h l e t parameters to represent an agent 's p r i o r bel iefs about the p robab i l i t y o f var ious transit ions i n terms o f a number o f " f i c t i t ious" samples . G i v e n the comple te d i s t r ibu t ion over t ransi t ion m o d e l parameters p r o v i d e d b y the D i r i c h l e t d i s t r ibu t ion , we c o u l d attempt to calculate a d i s t r ibu t ion over pos s ib l e state v a l -6 N o t e that underestimates based on such considerations are not problematic, since the augmented Bel lman equation reduces to the usual Bel lman equation. 58 ues, V(s), however , there is no s imp le c losed f o r m express ion for state values as a func t ion o f the agent 's est imated mode l s . W e c o u l d attempt to e m p l o y s a m p l i n g methods here, but i n the interest o f s i m p l i c i t y we have e m p l o y e d an approximate method based for c o m b i n i n g i n -format ion sources insp i red by K a e l b l i n g ' s (1993) in te rva l es t imat ion method . T h i s me thod uses an express ion for the var iance i n the estimate o f a t ransi t ion m o d e l parameter to repre-sent uncertainty. T h e var iance i n the estimate o f the t ransi t ion m o d e l parameter is g i v e n by a s i m p l e func t ion o f the D i r i c h l e t parameters . 7 F o r the observer ' s t ransi t ion m o d e l , let a = ns0'a (t) + c0'a(t) and P = 2~2t'es-t noa{t') + c0'a{t'). T h e n the var iance as0'a(t)2 i n the estimate o f the t ransi t ion m o d e l parameters is g i v e n by T h e var iance i n the estimate o f the t ransi t ion m o d e l d e r i v e d f r o m mentor observa-t ions is analogous. L e t a = nsm(t) and (3 = 2~2t'es-t nmit') + ctn(t')- T h e n the var iance <r4i ( 0 2 m m e est imate o f the t ransi t ion m o d e l parameters is : T h e var iance o f the t ransi t ion m o d e l parameters can be used to compare the ob-server 's conf idence i n the va lue estimates de r ived f r o m the observer ' s o w n t rans i t ion m o d e l T0 versus the va lue estimates de r ived f r o m transi t ion m o d e l inferred f r o m mentor observa-t ions, Tm. In tu i t ive ly , w e compute the var iance i n the va lue estimate due to var iance i n the m o d e l used to compute the estimate. T h i s can be found b y s i m p l e app l i ca t i on o f the ru le for l inear combina t ions o f var iances, Var(cX + dY) = c2Var{X) + d2Var{Y), to the express ion for the B e l l m a n backup , R(s) + 7 ^2t T(t, s, a)V{t). N o t e that the te rm R(s) 7 See Chapter 2, Section 2.3. 59 is independent o f the var iance i n the r a n d o m var iab le T(s, a, t) and the terms 7 and V(t) act l i k e a scalar mu l t ip l i e r s o f the var iance i n the r a n d o m var iab le T(s, a, t). T h e resultant express ion g ives the var iance i n the va lue estimate V ( s ) due to uncer ta inty i n the l o c a l tran-s i t ion m o d e l : ^ 0 ( ^ « ) = 72E<aw2uW2 (4-5) t T h e var iance i n the estimate o f the value o f state s ca lcu la ted f r o m the est imated mentor t ransi t ion m o d e l proceeds i n a s im i l a r fashion: *L(*) = 72£<WM02 w t U s i n g the var iances i n va lue estimates w e can compute a heur is t ic va lue i nd i ca t i ng the observer ' s re la t ive confidence i n the value estimates c o m p u t e d f r o m observer and m e n -tor observat ions. A n augmented B e l l m a n backup w i t h respect to V u s i n g conf idence test ing proceeds as f o l l o w s . W e first compute the observer ' s o p t i m a l ac t ion a* based on the est i -mated augmented values for each o f the observer ' s act ions. L e t Q(a*0, s) = V0(s) denote the va lue o f the o p t i m a l ac t ion as ca lcula ted f rom the observer ' s t rans i t ion m o d e l T0. T h e var iance due to m o d e l uncertainty can then be used to construct a l o w e r b o u n d V~ (s) i n terms o f the standard dev ia t ion o f the va lue estimate avo ( s ) : V-(s) = V0(s)-caV0(s) (4.7) T h e va lue c is chosen so that the l o w e r b o u n d V~ (s) has sufficient p robab i l i t y o f b e i n g less that the poss ib le true va lue o f V0 (s ) . U s i n g C h e b y c h e v ' s inequal i ty , w h i c h states 60 v o v M F i g u r e 4 .1 : V0 and Vm represent values estimates de r ived f rom observer and mentor expe-r ience respect ively. T h e l o w e r bounds V~ and V~ incorporate an "uncer ta inty pena l ty" as-sociated w i t h these estimates. that 1 - ^ o f the p robab i l i t y mass for an arbitrary d i s t r ibu t ion w i l l be w i t h i n c standard de-via t ions o f the mean, we can obtain any desi red confidence l e v e l even though the D i r i c h l e t d is t r ibut ions for s m a l l sample counts are h i g h l y non-no rma l . W e used c=5 to ob ta in approx-imate ly 9 6 % conf idence i n our exper iments , though this confidence l e v e l can be set to any suitable value. S i m i l a r l y , the var iance i n the value estimate due to the m o d e l b e i n g updated w i t h mentor observat ions can be used to construct a l o w e r b o u n d for the va lue estimate de-r i v e d f rom mentor observat ions , V~ (s). O n e may interpret the subtract ion o f a m u l t i p l e o f the standard dev ia t ion f r o m the va lue estimate as penal ty co r re spond ing to the observer ' s "uncer ta in ty" i n the va lue estimate (see F i g u r e 4 .1 ) . 8 G i v e n the l o w e r bounds on the va lue estimates const ructed f r o m mode l s est imated f r o m mentor and observer observat ions , w e may then per form the f o l l o w i n g test: I f V~ (s) > V~(s), then w e say that V0(s) supersedes Vm(s) and w e wr i t e V0(s) y Vm(s). W h e n >~ Vm(s) then either the mentor - inspi red m o d e l has, i n fact, a l o w e r expected va lue (w i th in a specif ied degree o f confidence) and uses a n o n o p t i m a l ac t ion ( f rom the observer ' s 8 Ideally, we would l ike to take not only the uncertainty of the model at the current state into ac-count, but also the uncertainty of future states as well (Meuleau & Bourgine, 1999). 61 perspect ive) , o r the mentor - inspi red m o d e l has l o w e r confidence. In ei ther case, w e reject the in format ion p r o v i d e d by the mentor and use a standard B e l l m a n b a c k u p u s ing the ac-t ion m o d e l de r ived so le ly f r o m the observer ' s exper ience (thus suppress ing the augmented backup) . Spec i f i ca l ly , w e define the backed up value to be V0(s) i n this case. A n a l g o r i t h m for c o m p u t i n g an augmented backup u s ing this conf idence test is s h o w n i n Tab le 4 .1 . T h e a lgo r i t hm parameters i nc lude the current estimate o f the augmented va lue func t ion V, the current est imated m o d e l T0 and its associated l o c a l var iance al, and the m o d e l o f the mentor ' s M a r k o v cha in Tm and its associated var iance o\. It calculates l o w e r bounds and returns the mean value , V0 or Vm, w i t h the greatest l o w e r bound . T h e parameter c determines the w i d t h o f the confidence in te rva l used i n the mentor re ject ion test. Focusing In add i t ion to u s i n g mentor observat ions to augment the qua l i ty o f its va lue estimates, an observer can exp lo i t its observat ions o f the mentor to focus its attention o n the states v i s i t e d by the mentor. In a model -based approach, the specif ic focusing mechanism w e adopt re-quires the observer to pe r fo rm a (poss ib ly augmented) B e l l m a n backup at state s whenever the mentor makes a t ransi t ion f rom s. T h i s has three effects. F i r s t , i f the mentor tends to v i s i t interest ing regions o f space (e.g., i f it shares a certain reward structure w i t h the observer) , then the s ignif icant values backed up f r o m mentor v i s i t ed states w i l l b ias the observer ' s ex-p lo ra t ion towards these regions. Second , computa t iona l effort w i l l be concentrated t oward parts o f state space where the est imated m o d e l Tm(s, t) changes, and hence where the es-t imated va lue o f one o f the observer ' s act ions may change. T h i r d , compu ta t ion is focused where the m o d e l is l i k e l y to be more accurate (as d iscussed above) . 62 F U N C T I O N augmentedBackup( V,T0,al,Tm,a^,s,c) a* = a r g m a x a 6 ^ 0 £ t € < s T ^ s , a, i ) V ( t ) V0(s) = R0(s)+lzZtes T0(s,a*,t)V(t) Vm(s) = Ra(s) +-r1Zt€Sfn\{s,t)V{t) < r „ 2 0 M * ) = 7 2 E i 6 s ^ 2 M V M * ) 2 V~ (s) = V0(s) - c * crvo(s, a*) Vm (s) = Vm(s) - C * <Tvm{s) WV0{s)- > Vm(s)- T H E N V(s) = V0(s) E L S E V(s) = Vm(s) E N D Table 4.1: An algorithm to compute the Augmented Bellman Backup 63 Action Selection T h e integrat ion o f exp lora t ion techniques i n the ac t ion select ion p o l i c y is important for any „ reinforcement learn ing a l g o r i t h m to guarantee convergence. In i m p l i c i t im i t a t i on , it p lays a second, c ruc i a l ro le i n h e l p i n g the agent exp lo i t the in fo rmat ion extracted f r o m the mentor. O u r i m p r o v e d convergence results re ly on the greedy qua l i ty o f the exp lo ra t i on strategy to bias an observer towards the h igher -va lued regions o f the state space revea led b y the mentor. F o r expediency, we have adopted the £ - g r e e d y ac t ion se lec t ion me thod , u s i n g an ex-p lo ra t ion rate e that decays over t ime (we c o u l d eas i ly e m p l o y other semi-greedy methods such as B o l t z m a n n explora t ion) . In the presence o f a mentor, greedy ac t ion se lec t ion be-comes more c o m p l e x . T h e observer examines its o w n act ions at state s i n the usual way and obtains a best ac t ion a* w h i c h has a cor responding va lue V0(s). A va lue is a lso ca lcu la ted for the mentor ' s ac t ion Vm(s). I fVo(s ) y Vm(s), then the observer ' s o w n ac t ion m o d e l is used and the greedy act ion is defined exact ly as i f the mentor were not present. If, h o w -ever, V 0 ( s ) )f Vm(s) then we w o u l d l i k e to define the greedy ac t ion to be the ac t ion dictated b y the mentor ' s p o l i c y at state s. Unfor tunate ly , the observer does not k n o w w h i c h ac t ion this i s , so w e define the greedy ac t ion to be the observer ' s act ion "c loses t " to the mentor ' s act ion acco rd ing to the observer ' s current m o d e l estimates at s. M o r e prec ise ly , the act ion most similar to the mentor ' s at state s, denoted Km(s), is that whose ou tcome d i s t r ibu t ion has m i n i m u m K u l l b a c h - L e i b l e r d ivergence f r o m the mentor ' s ac t ion ou tcome d i s t r ibu t ion : T h e observer ' s o w n experience-based ac t ion mode ls w i l l be p o o r ear ly i n t ra in ing , so there is a chance that the closest ac t ion computa t ion w i l l select the w r o n g ac t ion . W e re ly on the exp lo ra t ion p o l i c y to ensure that each o f the observer ' s act ions is sampled ap-(4.8) 64 propr ia te ly i n the l o n g run . I f the mentor is execu t ing a stochastic p o l i c y , the test based on K L - d i v e r g e n c e can mi s l ead the learner. A f u l l treatment o f ac t ion inference i n a general set-t i ng can be hand led more na tura l ly i n a B a y e s i a n fo rmula t ion o f im i t a t i on . W e cons ide r this i n Chapte r 5. In an o n l i n e re inforcement learn ing scenario, an observer agent w i l l have l i m i t e d oppor tuni t ies to sample its env i ronment and constraints on the amount o f compu ta t i on it can per form before it must act i n the w o r l d . B o t h o f these constraints put bounds on the accuracy o f the Q-va lues an observer can hope to achieve. In the absence o f h i g h qua l i t y estimates for the va lue o f every ac t ion i n every state, w e w o u l d l i k e to bias the va lue o f the states a long the mentor ' s trajectories so that they l o o k w o r t h w h i l e to explore . W e do this b y set t ing the i n i t i a l Q-va lues ove r the entire space to a va lue less than the est imated l o w e r b o u n d o n the va lue o f any state for the p r o b l e m . In our examples , rewards are s t r ic t ly pos i t i ve so a l l state values are pos i t ive . W e can therefore set the i n i t i a l Q - v a l u e estimates to zero i n order to make the values o f unexp lo red states l o o k worse than those for w h i c h the agent has ev idence courtesy o f the mentor. In this case, the greedy-step i n the exp lo ra t ion me thod w i l l prefer act ions that lead to mentor states ove r act ions for w h i c h the agent has no in fo rma t ion . Model Extraction in Specific Reinforcement Learning Algorithms M o d e l ex t rac t ion , augmented backups , the f o c u s i n g m e c h a n i s m , and o u r ex tended n o t i o n o f the greedy ac t ion se lec t ion , can be integrated into mode l -based re inforcement l ea rn ing a lgor i thms w i t h re la t ive ease. Gene r i ca l l y , our implicit imitation algorithm requires that: (a) the observer ma in ta in an estimate o f its o w n m o d e l T0(s, a, t) and the mentor ' s M a r k o v cha in Tm (s , t) w h i c h is updated w i t h every observed t ransi t ion; and (b) that a l l backups per-fo rmed to estimate its va lue func t ion use the augmented backup E q u a t i o n 4.2 w i t h conf i -dence testing. O f course, these backups are imp lemen ted u s i n g est imated m o d e l s T0(s, a, t) 65 and Tm(s, t). In add i t ion , the focus ing m e c h a n i s m requires that an augmented b a c k u p be per formed at any state v i s i t ed by the mentor or observer. W e demonstrate the general i ty o f these mechan i sms b y c o m b i n i n g them w i t h the w e l l - k n o w n and efficient prioritized sweeping a l go r i t hm ( M o o r e & A t k e s o n , 1993) . R o u g h l y , p r i o r i t i z e d sweep ing w o r k s b y ma in ta in ing an est imated t ransi t ion m o d e l T and reward m o d e l R. W h e n e v e r an exper ience tuple (s, a, r, t) is sampled , the est imated m o d e l at state s can change; a B e l l m a n backup is per formed at s to incorporate the r ev i sed m o d e l and some (usu-a l l y fixed) number o f add i t iona l backups are performed at selected states. States are selected u s ing a priority that estimates the potent ia l change i n their values based o n the changes pre-c ip i ta ted b y earl ier b a c k u p s . 9 . Essen t i a l ly , computa t iona l resources (backups) are focused on those states that can most "benefit" f r o m those backups . Incorpora t ing our ideas into p r io r i t i zed sweep ing s i m p l y requires the f o l l o w i n g changes • W i t h each t ransi t ion (s, a, t) the observer takes, the est imated m o d e l T0(s,a,t) is up-dated and an augmented backup is per formed at state s. A u g m e n t e d backups are then performed at a f ixed number o f states us ing the usua l p r io r i t y queue implemen ta t ion . • W i t h each observed mentor t ransi t ion (s , t), the est imated m o d e l Tm (s, t) is updated and an augmented backup is performed at s. A u g m e n t e d backups are then per formed at a f ixed number o f states u s ing the usua l p r io r i ty queue imp lemen ta t ion . K e e p i n g samples o f mentor behav io r implements m o d e l ext rac t ion . A u g m e n t e d back-ups integrate this in fo rma t ion into the observer ' s va lue func t ion , and pe r fo rming augmented backups at observed transi t ions ( in add i t ion to exper ienced transi t ions) incorporates our fo-c u s i n g mechan i sm. T h e observer is not forced to " f o l l o w " or o therwise m i m i c the act ions o f the mentor d i rect ly . B u t it does back up va lue in fo rma t ion a long the mentor ' s trajectory 9 See Section 2.2; also see Moore and Atkeson (1993) for further details 66 as i f it had. U l t i m a t e l y , the observer must m o v e to those states to d i s c o v e r w h i c h act ions are to be used; but i n the meant ime, important va lue in fo rmat ion is propagated that can gu ide its exp lo ra t ion . I m p l i c i t im i t a t ion does not alter the l o n g run theoret ical convergence propert ies o f the u n d e r l y i n g re inforcement learn ing a lgor i thm. T h e i m p l i c i t im i t a t i on f r amework is or-thogonal to e-greedy exp lora t ion as it alters on l y the def in i t ion o f the "greedy" ac t ion , not w h e n the greedy ac t ion is taken. G i v e n a theoret ical ly appropriate decay factor, the e-greedy strategy w i l l thus ensure that the d is t r ibut ions for the ac t ion mode l s at each state are sampled inf in i te ly often i n the l i m i t and converge to their true values . B y the homogene i ty assump-t ion , the extracted m o d e l f r om the mentor corresponds to one o f the observer ' s o w n act ions, so its effect on the va lue func t ion ca lcu la t ions is no different than the effect o f the observer ' s o w n sampled ac t ion models . T h e confidence m e c h a n i s m ensures that the m o d e l w i t h more samples w i l l even tua l ly c o m e to domina te i f it i s , i n fact, better. W e can therefore rest as-sured that the convergence properties o f re inforcement l ea rn ing w i t h i m p l i c i t im i t a t i on are iden t i ca l to that o f the u n d e r l y i n g re inforcement learn ing a lgo r i t hm. T h e benefit o f i m p l i c i t imi ta t ion l ies i n the w a y i n w h i c h the mode l s extracted f rom the mentor a l l o w the observer to calculate a l o w e r b o u n d on the va lue func t ion and use this l o w e r b o u n d to choose its greedy act ions to m o v e the agent towards h ighe r -va lued regions o f state space. A s w e w i l l demonstrate i n our exper imenta l sect ion, the t y p i c a l result is q u i c k e r convergence to o p t i m a l po l i c i e s and better short-term prac t ica l performance w i t h respect to accumula ted d i scoun ted reward w h i l e learn ing . Extensions T h e i m p l i c i t im i t a t i on m o d e l can eas i ly be extended to extract m o d e l i n fo rma t ion f r o m m u l -t ip le mentors , m i x i n g and ma tch ing pieces extracted f r o m each mentor to ach ieve the most 67 in format ive m o d e l . It does this by searching, at each state, the set o f mentors i t k n o w s about to f ind the mentor w i t h the highest va lue estimate. T h e va lue estimate o f this mentor is then compared u s i n g the confidence test descr ibed above w i t h the observer ' s o w n va lue estimate. T h e fo rma l express ion o f the a l g o r i t h m is g i v e n b y the multi-augmented Bellman equation: where M is the set o f candidate mentors. Ideal ly , confidence estimates s h o u l d be taken into account i n the se lec t ion o f the best mentor, as w e may get a mentor w i t h a h i g h mean value estimate but large var iance w i n n i n g out over a mentor w i t h l o w e r mean but h i g h confidence. I f the observer has any exper ience w i t h the state at a l l , a l o w conf idence mentor w i l l l i k e l y be rejected as h a v i n g poorer qua l i ty in fo rmat ion than the observer. I f a l l mentors can be guaranteed to be execu t ing the same p o l i c y (perhaps the o p t i m a l p o l i c y ) then the equat ion c o u l d be s imp l i f i ed so that the samples f rom a l l mentors c o u l d be p o o l e d in to a s ing le m o d e l U p to now, w e have assumed no ac t ion costs (i.e., rewards that depend o n l y on the state); however , w e can use more general r eward funct ions (e.g., where r eward has the f o r m R(s, a)). T h e d i f f icu l ty l ies i n b a c k i n g up ac t ion costs w h e n the mentor ' s chosen ac t ion is u n k n o w n . W e can make use o f n, the closest ac t ion func t ion deve loped for ac t ion se lec t ion to choose the appropriate reward. T h e augmented B e l l m a n equat ion w i t h genera l ized rewards takes the f o l l o w i n g fo rm: (4.9) Tm{s,t). 68 V(s) — max < max < R0(s, a) + 7 y^T0(s, a, t)V(t) [aeA° I Its i ? 0 ( S , ^ ( S ) ) + 7^Tm(S,i)^w} (4.10) (4.11) W e can substitute E q u a t i o n 4.11 for the usua l augmented va lue equa t ion defined i n E q u a t i o n 4.2 and pe r fo rm i m p l i c i t im i t a t ion w i t h general rewards. W e note that B a y e s i a n methods c o u l d also be used c o u l d be used to estimate the mentor ' s ac t ion cho i ce (see C h a p -ter 5). 4.1.3 Empirical Demonstrations In this sect ion, w e present a number o f experiments c o m p a r i n g standard re inforcement learn-i n g agents w i t h an observer agent based on p r io r i t i zed sweep ing , but augmented w i t h m o d e l ext rac t ion and our focus ing mechan i sm. These results i l lustrate h o w specif ic features o f do-mains interact w i t h i m p l i c i t im i t a t ion and p rov ide a quant i ta t ive assessment o f the gains i m -p l i c i t im i t a t ion can p rov ide . S i n c e the imi ta t ion agent is a lso a re inforcement l ea rn ing agent, we first descr ibe the general f o r m o f the reinforcement l ea rn ing p rob lems w e w i l l test our im i t a t i on agents on and then w e e x p l a i n h o w we implemen ted mentors and observat ions o f mentors . T h e exper iments w i t h imi t a t ion agents are a l l per formed i n stochast ic g r i d - w o r l d do -mains , s ince these domains are easy to understand and make it c lear to what extent the ob-server 's and mentor ' s o p t i m a l po l i c i e s over lap (or f a i l to). F i g u r e 4 .2 shows a s i m p l e 10 x 10 g r i d - w o r l d example . In a g r i d - w o r l d p r o b l e m , an agent must d i s c o v e r a sequence o f act ions w h i c h w i l l m o v e it f r o m the start state to the end state. T h e agent 's m o v e m e n t f r o m state s to state t under ac t ion a w i l l be defined b y an u n d e r l y i n g env i ronment t rans i t ion m o d e l o f the f o r m Ts<a(t). 69 c o _ — 1 1 1 1 •-1 L _ 1 1 1 ] | - -1 F i g u r e 4.2: A s i m p l e g r i d w o r l d w i t h start state S and goa l state X s h o w i n g representative trajectories for a mentor (sol id) and an observer (dashed). In the context o f re inforcement learn ing , the imi t a t ion agent does not have comple te k n o w l e d g e o f the effects o f its act ions (i.e., it is uncertain about T s , a (£)) and must therefore exper iment w i t h its act ions i n order to learn a m o d e l . In our fo rmula t ion o f the g r i d - w o r l d p r o b l e m , the agent is g i v e n the p r io r k n o w l e d g e that its act ions can o n l y m a k e t ransi t ions to legal ne ighbou r ing states. In general , the ne ighbour ing states are def ined as the u n i o n o f the eight immed ia t e ly adjacent states i n the g r i d (hor izonta l , ver t ica l and d i agona l ne ighbours) and the state i t se l f (to represent self- transi t ions). W h e r e the boundary o f the g r i d truncates a ne ighbour ing state or a state is occup ied by an "obs tac le" w e say that the n e i g h b o u r i n g state is infeasible . W e therefore define the legal ne ighbourhood o f each state as the set o f feasible ne ighbou r ing states. In some experiments , a s m a l l number o f c lea r ly ind ica ted states are defined i n w h i c h the agent has different act ion capabi l i t ies (e.g., the ab i l i t y to j u m p to another state). W h e n such states are defined, the l ega l n e i g h b o u r h o o d o f the state is extended to reflect these capabi l i t ies . T h e agent 's p r io r k n o w l e d g e o f the l ega l n e i g h b o u r h o o d is represented b y a D i r i c h l e t 70 pr io r ove r only the l ega l ne ighbou r ing states w i t h the hyperparameter for each potent ia l suc-cessor state i n i t i a l l y set to 1. T h i s l o c a l l y un in fo rmat ive p r i o r represents the fact that agents are i n i t i a l l y uncer ta in about w h i c h " d i r e c t i o n " their act ions w i l l take them i n , but they k n o w it w i l l be a l ega l d i rec t ion . T h e m a i n m o t i v a t i o n for u s ing l o c a l n e i g h b o u r h o o d pr iors is to a v o i d the computa t iona l b l o w u p as the number o f states i n our p rob lems are increased. In our 25 x 25 g r i d - w o r l d example , there are 625 states, w h i c h w o u l d require r o u g h l y 400 ,000 parameters to represent a l o c a l l y factored g l o b a l pr ior . T h e d e c i s i o n to use these l o c a l pr iors reflects the fact that the agent is k n o w n to be conf ron t ing a nav iga t ion p r o b l e m i n a geometr ic space. A s w e argue be low, the use o f l o c a l pr iors does not ma te r i a l ly affect the quest ions we w i s h to answer. A l t e rna t i ve ly , w e migh t have e m p l o y e d sparse m u l t i n o m i a l pr iors (F r i ed -man & Singer , 1998). G i v e n the agent 's p r io r k n o w l e d g e o f the t ransi t ion m o d e l , some i n i t i a l va lue c a l c u -lat ions c o u l d be done i n theory. In goal -based scenarios, p r io r k n o w l e d g e that act ions have o n ly l o c a l scope c o u l d be used to predict that the value o f states nearer to the g o a l w i l l be h igher than the va lue o f states far away f rom the goa l . S i n c e the agent 's act ions have u n i -f o r m p robab i l i t y o f m a k i n g the t ransi t ion to ne ighbour states, however , every ac t ion i n any g iven state w o u l d have the same va lue accord ing to its pr iors . T h e agent w o u l d s t i l l have to exper iment w i t h the act ions i n each state i n order to learn w h i c h act ions m o v e it towards the goa l . T h e p r io r values ass igned to states c o u l d be used to bias a greedy exp lo ra t i on m e c h -anisms towards those states that are nearer the goa l . H o w e v e r , w h e n states w i t h negat ive values are in t roduced the agent may be unsure whether there exis ts a p o l i c y able to a v o i d these penal ty states w i t h sufficient p robab i l i ty to make a c h i e v i n g the g o a l w o r t h w h i l e . T h e agent may therefore a v o i d the goa l state i n order to a v o i d a r eg ion i t be l i eves l i k e l y to result a net loss. In a l l o f our exper iments , the agent 's i n i t i a l Q-va lues are a l l set to zero. W e do not 71 per form any i n i t i a l backups before the agent starts its task. A g e n t s e m p l o y p r i o r i t i z e d sweep-i n g w i t h a f ixed number o f backups per observed or exper ienced s a m p l e . 1 0 S i n c e p r i o r i t i z e d sweep ing o n l y acts on those states where there have been exper iences , or states mos t l i k e l y to be effected b y experiences, s ignif icant parts o f the state space m a y r ema in unupdated dur-i n g any one sweep. W e have exp l a ined that the agent 's p r io r bel iefs about i t ac t ion capabi l i t i es admit the p o s s i b i l i t y that any par t icular ac t ion cho i ce c o u l d result i n the t rans i t ion to any o f t y p i -c a l l y n ine successor states. In our exper iments , however , the agent o n l y possesses 4 d is t inc t act ions or con t ro l s ignals . T h e actual effects o f the four act ions are e-stochastic (i.e., each ac t ion a is associated w i t h an in tended d i rec t ion such as " N o r t h " and u p o n execu t ing this act ion the env i ronment sends the agent i n this d i rec t ion (1 — e) o f the t ime and some other d i rec t ion s o f the t ime. In most exper iments , the four act ions s tochas t ica l ly m o v e the agent i n the compass d i rec t ions N o r t h , South , Eas t and West . T o i l lustrate pert inent po in ts , w e e m p l o y t ransi t ion mode ls representing different act ion capabi l i t ies i n different exper iments . £ - g r e e d y exp lora t ion is used w i t h decay ing e. In our fo rmu la t i on , the agent selects an ac t ion r a n d o m l y f rom the set o f act ions that is current ly es t imated to attain the highest va lue and r a n d o m l y selects a r andom act ion f rom the comple te ac t ion set e o f the t ime. W e use exponen t i a l ly decay ing exp lora t ion rates. T h e in i t i a l values and decay parameters were tuned for each p r o b l e m and are descr ibed i n specif ic exper iments . In each o f the exper iments , we compare the performance o f independent re inforce-ment learners w i t h obse rva t ion -mak ing imi ta t ion learners. T h e i m i t a t i o n learners s imul ta -neous ly observes an expert mentor w h i l e a t tempting to so lve the p r o b l e m . In each exper i -ment, the mentor is f o l l o w i n g an e-greedy p o l i c y w i t h a s m a l l e (on the order o f 0.01). T h i s ' "General ly, the number of backups was set to be roughly equal to the length of the optimal "noise-free" path. This heuristic value is big enough to propagate value changes across the state space, but small enough to be tractable in the inner loop of a reinforcement learner. 72 tends to cause the mentor ' s trajectories to l i e w i t h i n a "c lus te r" su r round ing o p t i m a l trajec-tories (and reflect g o o d i f not o p t i m a l po l i c i e s ) . E v e n w i t h a s m a l l amount o f exp lo ra t ion and some env i ronment s tochast ici ty, mentors general ly do not v i s i t the entire state space, so confidence testing is important . A t y p i c a l o p t i m a l mentor trajectory is i l lus t ra ted b y the s o l i d l i ne between the start and end states. T h e dotted l i ne i n F i g u r e 4 .2 shows that a t y p i c a l mentor- inf luenced trajectory w i l l be quite s i m i l a r to the observed mentor trajectory. O u r g o a l in these exper iments is to i l lustrate h o w imi t a t ion affects the convergence o f a re inforcement learner to an o p t i m a l p o l i c y , h o w var ious propert ies o f specif ic domains can affect the ga in due to imi t a t ion , and h o w the var ious mechan i sms w e propose w o r k to-gether to deal w i t h p rob lems w e have ant ic ipated for im i t a t i on learners. T o shed l igh t on these quest ions , w e sought an appropriate measure o f performance. I n ou r goa l -based sce-nar ios , the instantaneous reward occurs on ly at the end o f a successful trajectory. W e f o u n d graphs o f instantaneous reward dif f icul t to interpret and compare . In c u m u l a t i v e r eward graphs, op t ima l performance shows up as a fixed u p w a r d s lope represent ing a constant goa l at tainment rate. W e found that s lope compar i sons caused di f f icul t ies for readers. W e set-t led on a c o m p r o m i s e so lu t ion : a w i n d o w e d average o f reward. T h e w i n d o w length was chosen to be l o n g enough to integrate instantaneous reward s ignals and short enough that w e can readi ly see w h e n the goa l rate is chang ing and w h e n i t has converged to the op t i -m a l va lue . W h e n t w o agents have converged to the same qua l i ty o f p o l i c y , their w i n d o w e d average graph l ines w i l l a lso converge. W e emphas ize that our measure is not the same as average reward as the w i n d o w i n g func t ion discards a l l data before the w i n d o w cutoff. In the remainder o f this section w e present the results o f our exper iments . W e l o o k at the ga in i n reward obtained by a reinforcement learner i n a bas ic g r i d - w o r l d scenario. W e then l o o k at h o w this ga in is affected by changes to the scale o f the p r o b l e m and the stochas-t ic i ty o f the agent 's act ions. W e l o o k at h o w the observer ' s p r io r be l iefs about mentor ca -73 pabi l i t i es can cause p rob lems , h o w qua l i t a t ive ly more dif f icul t p rob lems affect the ga in i n imi t a t ion , and c o n c l u d e w i t h an example i n w h i c h in fo rmat ion f r o m several mentors is s i -mul taneous ly c o m b i n e d w i t h the imi ta tor ' s o w n exper ience to i m p r o v e its p o l i c y . E x p e r i m e n t 1: T h e I m i t a t i o n E f f ec t In our first exper iment w e compared the performance o f an observer us ing m o d e l ext rac t ion and an expert mentor w i t h the performance o f a con t ro l agent u s i n g standard re inforcement learn ing . G i v e n the l ack o f obstacles and the lack o f intermediate rewards , conf idence test ing was not r e q u i r e d . 1 1 B o t h agents attempted to learn a p o l i c y that m a x i m i z e s d i scoun ted return i n a 10 X 10 g r i d w o r l d . T h e y started i n the upper-left corner and sought a g o a l w i t h va lue 1.0 i n the lower - r igh t corner. U p o n reaching the goa l , the agents were restarted i n the upper-left corner. A c t i o n d y n a m i c s were noisy, resu l t ing i n the intended d i r ec t ion 90 percent o f the t ime and a r a n d o m d i rec t ion 10 percent o f the t ime (un i fo rmly across the N o r t h , Sou th , Eas t and Wes t d i rec t ions) . T h e d iscount factor was 0.9. A s exp l a ined above, the mentor executes a m i l d l y stochastic perturbation o f the o p t i m a l p o l i c y (one percent noise) for the d o m a i n . In F i g u r e 4.3 w e p lo t the cumula t ive number o f goals obta ined over the p rev ious 1000 t ime steps for the observer (Obs) and con t ro l (Ct r l ) agents (results are averaged over ten runs). T h e observer was able to q u i c k l y incorporate a p o l i c y learned f r o m the mentor into its va lue estimates. T h i s results i n a steeper l ea rn ing curve . In contrast, the con t ro l agent s l o w l y exp lo red the space to b u i l d a m o d e l first. B o t h agents converged to the same o p t i m a l value func t ion w h i c h is reflected i n the graph b y the goa l rates for observer and con t ro l agents conve rg ing to the same op t ima l rate. T h e D e l t a curve , defined as O b s - C t r l , shows the ga in o f the imi ta tor over the una ided re inforcement learner. T h e ga in peaks d u r i n g the ear ly l ea rn ing phase and goes to zero once bo th agents have converged to o p t i m a l p o l i c i e s . "Confidence testing w i l l be relevant in subsequent scenarios. 74 500 1000 1500 2000 2500 3000 3500 4000 4500 Simulation Steps F i g u r e 4.3: O n a s i m p l e g r i d - w o r l d , an observer agent ' O b s ' d i scovers the g o a l and op t i -mizes its p o l i c y faster than a con t ro l agent ' C t r l ' w i thou t observa t ion . T h e ' D e l t a ' cu rve plots the ga in o f the observer over the con t ro l . 75 Experiment 2: Scaling and Noise T h e next exper iment i l lus t ra ted the sens i t iv i ty o f imi ta t ion to the s ize o f the state space and the s tochas t ic i ty o f act ions. A g a i n , the observer used mode l -ex t rac t ion but not conf idence test ing. In F i g u r e 4.4 w e p lo t the D e l t a curves for the bas ic scenario (Bas i c ) jus t descr ibed , the scaled-up scenario (Scale) i n w h i c h the state space s ize is increased 69 percent (to a 13 x 13 gr id ) , and the stochastic scenario (Stoch) i n w h i c h the noise l e v e l is increased f r o m 10 percent to 4 0 percent (results are averaged over ten runs). T h e D e l t a curves represent the increase i n g o a l rate due to imi ta t ion over the goa l rate obta ined b y an una ided re inforcement learner for a given scenario. T h e genera l ly pos i t i ve values o f a l l the D e l t a curves represents the fact that i n each o f the scenarios cons idered , imi ta t ion increases the imi ta to r ' s g o a l rate. H o w e v e r , the t i m i n g and durat ion o f the ga in due to imi ta t ion vary depend ing on the sce-nar io . W h e n w e compare the ga in obta ined i n the basic imi t a t ion scenario (Bas i c ) w i t h the ga in obta ined i n the scaled-up scenario (Scale) , we see that the ga in i n the scaled-up sce-nar io increases more s l o w l y than the bas ic scenario. T h i s reflects the fact that even w i t h the mentor ' s guidance , the imi ta tor requires more exp lo ra t ion to find the g o a l i n the larger scenario. T h e peak i n the ga in for the scaled-up scenario is l o w e r as the o p t i m a l goa l rate i n a larger p r o b l e m is lower , however , the durat ion o f the ga in i n the scaled-up scenario is cons iderab ly longer. T h i s reflects Whi t ehead ' s (1991) observat ion that for g r i d w o r l d s , ex-p lo ra t ion requirements can increase q u i c k l y w i t h state space s ize (i.e., the w o r k to be done b y an ungu ided agent), but the o p t i m a l path length increases o n l y l i nea r ly (e.g., the e x a m -ple p r o v i d e d b y the mentor) . W e conc lude that the gu idance o f the mentor can he lp more i n larger state spaces. W h e n w e compare the g a i n due to im i t a t i on i n the bas ic scenar io ( B a s i c ) w i t h the ga in due to imi t a t ion i n a scenario where the s tochast ic i ty o f act ions is d rama t i ca l ly increased 76 30 25 0 -5 -1 1 i / \ Basic V Scale / / Stoch \ i i , i i 0 1000 2000 3000 4000 5000 6000 Simulation Steps Figure 4.4: Delta curves compare gain due to observation for 'Basic' scenario described above, a 'Scale' scenario with a larger state space and a 'Stoch' scenario with increased noise. 77 (Stoch) we see that the ga in i n the stochastic scenario is cons ide rab ly reduced i n m a x i m u m value . Increas ing the s tochast ic i ty w i l l increase the t ime requi red to comple t e the p r o b l e m for bo th agents and reduce the m a x i m u m poss ib le g o a l rate. T h e increased s tochas t ic i ty a lso cons iderab ly delays the ga in due to imi t a t ion re la t ive to the bas ic scenar io . A g a i n , the i m -itator takes longer to d i s cove r the goa l . A s w i t h the scaled-up p r o b l e m , the g a i n persists for a longer pe r iod ind ica t ing that it takes longer for the una ided re inforcement learner to converge i n the stochastic scenario. W e speculate that increas ing the no i se l e v e l reduces the observer ' s ab i l i t y to act u p o n the in fo rmat ion rece ived f r o m the mentor and therefore erodes its advantage over the con t ro l agent. W e conc lude , however , that the benefit o f im i t a t i on de-grades graceful ly w i t h increased s tochast ic i ty and is present even at this r e la t ive ly extreme noise l eve l . Experiment 3 : Confidence Testing T h e agents i n our exper iments have p r io r bel iefs about the outcomes o f their act ions. In the case o f an im i t a t i on agent, the observer a lso has p r io r bel iefs about the ou tcomes o f the men-tor 's act ions. L i k e the observer ' s pr iors about its o w n act ions, the observer ' s p r i o r bel iefs about mentor act ions take the f o r m o f a D i r i c h l e t d i s t r ibu t ion w i t h u n i f o r m hyperparameters over the l ega l ne ighbou r ing states. Thus , the observer expects that the mentor w i l l m o v e i n a w a y that is cons t ra ined to s ing le steps through a geometr ic space. T h e observer, however , does not k n o w the actual d i s t r ibu t ion o f successor states for the mentor ' s ac t ion . M o r e i n -tu i t ive ly , the observer does not k n o w w h i c h d i rec t ion the mentor w i l l m o v e i n each state. T h e mentor pr iors are o n l y in tended to g i v e the observer a r o u g h guess about the mentor ' s behaviour , but w i l l general ly not reflect the mentor ' s actual behaviour . S i n c e the p r i o r ' s ove r the mentor ' s ac t ion choices are used b y the observer to estimate the va lue o f the mentor ' s ac t ion , there may be si tuations i n w h i c h the observer ' s p r i o r be l iefs about the tran-78 s i t ion p robab i l i t i e s o f the mentor can cause the observer to overest imate the va lue o f m e n -tor 's ac t ion . In a s suming that it can dupl ica te the mentor ' s act ions , the observer m a y then overva lue the state i n w h i c h the mentor ' s ac t ion was observed. T h e conf idence m e c h a n i s m proposed i n the p rev ious sect ion can prevent the observer f r o m b e i n g f o o l e d b y m i s l e a d i n g pr iors over the mentor ' s t ransi t ion probabi l i t i es . T o demonstrate the ro le o f the confidence m e c h a n i s m i n i m p l i c i t im i t a t i on w e de-s igned an exper iment based on the scenario i l lustrated i n F i g u r e 4.5. A g a i n the agent 's task was to navigate u s ing the N o r t h , Sou th , Eas t and West act ions f r o m the top-left corner to the bot tom-r ight corner o f a 10 X 10 g r i d i n order to attain a r eward o f +1. W e created a pa tho log ica l scenario w i t h is lands o f h i g h reward (+5). These is lands are i so la ted h o r i z o n -ta l ly and ver t i ca l ly f r o m adjacent states by obstacles. T h e d i agona l path, however , remains u n b l o c k e d . S i n c e ou r agent 's p r io r s admi t the p o s s i b i l i t y o f m a k i n g a t rans i t ion to any state i n the l ega l ne ighbourhood o f the current state, i n c l u d i n g the ce l l s d i a g o n a l l y adjacent to the current state, the h i g h - v a l u e d ce l l s i n the m i d d l e o f each i s l and are i n i t i a l l y b e l i e v e d to be reachable f rom the states d i agona l ly adjacent to them w i t h some s m a l l p r i o r p robab i l i ty . In reali ty, however , the agent 's act ion set precludes this p o s s i b i l i t y and the agent w i l l therefore never be able to rea l ize this value. T h e four is lands i n this scenario thus create a fa i r ly large reg ion i n the center o f the space w i t h a h i g h est imated va lue , w h i c h c o u l d po ten t ia l ly trap an observer i f it persis ted i n its p r io r bel iefs . N o t i c e that a standard re inforcement learner w i l l " q u i c k l y " learn that none o f its ac-t ions take it towards the r eward ing is lands. In contrast, an i m p l i c i t imi ta tor u s i n g augmented backups c o u l d be foo l ed b y its p r io r mentor m o d e l . I f the mentor does not v i s i t the states n e i g h b o r i n g the i s l and , the observer w i l l not have any ev idence upon w h i c h to change its p r io r b e l i e f that the mentor actions are equa l ly l i k e l y to result i n a t rans i t ion to any state i n 79 F i g u r e 4.5: P r i o r mode ls assume d iagona l connec t iv i ty and hence that the +5 rewards are access ib le to the observer. In fact, the observer has o n l y ' N E W S ' ac t ions . U n u p d a t e d pr iors about the mentor ' s capabi l i t ies may lead to o v e r v a l u i n g d i a g o n a l l y adjacent states. the l o c a l n e i g h b o u r h o o d i n c l u d i n g those d i agona l ly adjacent to the current state. T h e i m i -tator may fa lse ly c o n c l u d e o n the basis o f the mentor ac t ion m o d e l that an there exis ts an ac t ion w h i c h w o u l d a l l o w it to access the is lands o f value. T h e observer w o u l d then over-va lue the states d i a g o n a l l y adjacent to the h igh -va lued states. T h e observer therefore needs a confidence m e c h a n i s m to detect w h e n the mentor m o d e l is less r e l i ab le than its o w n m o d e l . T o test the confidence mechan i sm, w e assigned the mentor a path a round the outs ide o f the obstacles so that its path c o u l d not lead the observer out o f the trap (i.e., it p rov ides no ev idence to the observer that the d iagona l moves into the i s lands are not feasible) . T h e c o m b i n a t i o n o f a h i g h i n i t i a l exp lo ra t ion rate and the ab i l i t y o f p r i o r i t i z e d s w e e p i n g to spread va lue across large distances v i r tua l ly guarantees that the observer w i l l be l e d to the trap. G i v e n this scenario, w e ran t w o observer agents and a con t ro l . T h e first observer used a conf idence in te rva l w i t h w i d t h g i v e n b y 5<r, w h i c h , a cco rd ing to the C h e b y c h e v rule , s h o u l d cove r at least 96 percent o f an arbitrary d i s t r ibu t ion . T h e second observer was g i v e n a OCT 80 in terva l , w h i c h effect ively disables confidence testing. In F i g u r e 4.6, the performance o f the observer with conf idence test ing is compared to the performance o f the con t ro l agent (results are averaged over 10 runs) . T h e observer without conf idence test ing consis tent ly became stuck. E x a m i n a t i o n o f the va lue func t ion for the observer wi thou t confidence test ing revealed consistent peaks w i t h i n the trap r eg ion and inspec t ion o f the agent state trajectories showed that it was stuck i n the trap. T h i s agent fa i led to converge to the o p t i m a l goa l rate i n any run and is not p lo t ted o n the graph. T h e first de ta i l that stands out i n the graph o f the observer with conf idence test ing (Obs) is that the agent does not exper ience the r ap id convergence to the o p t i m a l g o a l rate seen i n p rev ious scenarios. In i t i a l ly , due to the h i g h exp lo ra t ion rate ear ly i n l ea rn ing , bo th the observer (Obs) and con t ro l agent (Ct r l ) sporad ica l ly reach the g o a l , but due to the pres-ence o f the i s lands , the goa l rate is compara t ive ly l o w i n the first 2 0 0 0 steps. E v e n the con t ro l agent shows a longer exp lo ra t ion phase than i n the bas ic scenario s h o w n i n F i g u r e 4.3. A f -ter the first 2000 steps, w e see bo th agents r ap id ly conve rg ing to the o p t i m a l g o a l rate w i t h the observer i n i t i a l l y h a v i n g a s l ight lead. Obse rva t ion o f its va lue func t ion over t ime shows that the trap formed, but faded away as the observer ga ined enough exper ience to ove rcome erroneous pr iors on bo th its o w n mode ls and its mode ls o f the mentor. A t the end o f the exper iment , the observer ' s performance is s l igh t ly less than that o f the c o n t r o l agent. T h i s may be due to imprope r ly tuned exp lo ra t ion rate parameters. T h e o r e t i c a l l y these l ines w i l l converge g i v e n proper exp lo ra t ion schedules. B a s e d on the difference i n performance between the observer w i t h conf idence test ing and the observer wi thou t confidence testing, we conc lude that conf idence test ing is neces-sary for i m p l i c i t im i t a t i on and effective i n ma in t a in ing reasonable performance (on the same order as the con t ro l agent) even i n this pa tho log i ca l case. 81 45 40 30 25 20 -5 C R Series — i 1 — i i _ Ctrl ***** -f *y ° bs -y. \ Delta i i I i 2000 4000 6000 Simulation Steps 8000 10000 12000 Figure 4.6: Misleading priors about mentor abilities may eliminate imitation gains and even result in some degradation of the observer's performance relative to the control agent, but confidence testing prevents the observer from permanently fixating on infeasible goals. 82 F i g u r e 4 .7: A c o m p l e x maze w i t h 343 states has cons iderab ly less l o c a l connec t iv i t y than an equ iva l en t l y - s i zed g r i d w o r l d . A s ingle 133 step path is the shortest so lu t ion . Experiment 4: Qualitative Difficulty T h e next exper iment demonstrates h o w the potent ia l gains o f i m i t a t i o n can increase w i t h the (qual i ta t ive) d i f f icu l ty o f the p r o b l e m . T h e observer e m p l o y e d bo th m o d e l ext rac t ion and conf idence test ing, though confidence testing d i d not p lay a s ignif icant ro le i n this exper i -m e n t . 1 2 In the " m a z e " scenario, we in t roduced obstacles i n order to increase the d i f f icu l ty o f the l ea rn ing p rob lem. T h e maze is set on a 25 x 25 g r i d (F igure 4.7) w i t h 286 obstacles c o m p l i c a t i n g the agent 's j ou rney f r o m the top-left to the bot tom-r ight corner. T h e o p t i m a l so lu t ion takes the f o r m o f a snak ing 133-step path, w i t h d is t rac t ing paths (up to length 22) b r anch ing o f f f r om the so lu t ion path, necessi ta t ing frequent back t r ack ing . T h e d i scount fac-tor is 0.98. W i t h 10 percent noise , the o p t i m a l goal-at tainment rate is about s ix goals per 1000 steps. T h e mentor repeatedly f o l l o w s an e -op t ima l path th rough the maze (e = 0.01). F r o m the graph i n F i g u r e 4.8 (wi th results averaged over ten runs) , w e see that the con t ro l agent took on the order o f 150,000 steps to b u i l d a decent va lue func t ion that r e l i ab ly 1 2 I n this problem there are no intermediate rewards or shortcuts whose effect on value might be propagated by mentor priors. 83 0 0.5 1 1.5 2 2.5 x 105 Simulation Steps F i g u r e 4.8: O n the di f f icul t maze p rob l em, the observer ' O b s ' converges l o n g before the una ided con t ro l ' C r t l ' . l ed to the goa l . A t this point , it o n ly ach ieved four goals per 1000 steps on average, as its exp lo ra t ion rate is s t i l l reasonably h i g h (unfortunately, decreas ing exp lo ra t i on more q u i c k l y leads to a h igher p robab i l i t y o f a subop t ima l p o l i c y and therefore a l o w e r va lue funct ion) . T h e imi t a t ion agent is able to take advantage o f the mentor ' s expert ise to b u i l d a re l i ab le va lue func t ion i n about 20 ,000 steps. S i n c e the con t ro l agent was unable to reach the g o a l at a l l i n the first 20 ,000 steps, the D e l t a between the con t ro l and the imi ta tor is s i m p l y equal to the imi ta to r ' s performance. T h e imi ta tor q u i c k l y ach ieved the o p t i m a l g o a l at tainment rate o f s ix goals per 1000 steps, as its exp lora t ion rate decayed m u c h more q u i c k l y . 84 Experiment 5: Improving Suboptimal Policies by Imitation T h e augmented b a c k u p rule does not require that the reward structure o f the mentor and ob-server be iden t i ca l . The re are many useful scenarios where rewards are d i s s i m i l a r but induce va lue funct ions and po l i c i e s that share some structure. In this exper iment , w e demonstrated one interest ing scenario in w h i c h it is re la t ive ly easy to find a subop t ima l so lu t ion , but d i f f i -cul t to find the o p t i m a l so lu t ion . O n c e the observer found this subop t ima l path, however , it was able to exp lo i t its observat ions o f the mentor to see that there is a shortcut that s ign i f i -cant ly shortens the path to the goa l . T h e structure o f the scenario is s h o w n i n F i g u r e 4.9. T h e subop t ima l so lu t ion l ies on the path f rom loca t ion 1 a round the "scen ic route" to l oca t ion 2 and o n to the g o a l at l oca t ion 3. T h e mentor took the ve r t i ca l path f r o m l o c a t i o n 4 to l o c a -t ion 5 th rough the shor tcu t . 1 3 T o d iscourage the use o f the shortcut b y n o v i c e agents, i t was l i n e d w i t h ce l l s (marked "*") that j u m p back to the start state. It was therefore d i f f icul t for a n o v i c e agent execu t ing many random explora tory moves to make it a l l the w a y to the end o f the shortcut and obta in the value w h i c h w o u l d reinforce its future use. B o t h the observer and con t ro l therefore general ly found the scenic route first. In F i g u r e 4 .10, the performance (measured u s i n g goals reached over the p rev ious 1000 steps) o f the con t ro l and observer are compared (averaged over ten runs) , i nd i ca t i ng the va lue o f these observat ions . W e see that the observer and con t ro l agent bo th f o u n d the longer scenic route, though the con t ro l agent took longer to find it. T h e observer went on to find the shortcut and j u m p e d to a lmost doub le the goa l rate. T h i s exper iment shows that mentors can i m p r o v e observer po l i c i e s even when the observer ' s goals are not on the mentor ' s path. 1 3 A mentor proceeding from 5 to 4 would not provide guidance without prior knowledge that ac-tions are reversible. 85 Figure 4.9: The shortcut ( 4 —>• 5) in the maze is too perilous for unknowledgeable (or un-aided) agents to attempt. 0 0.5 1 1.5 2 2.5 3 x 104 Simulation Steps Figure 4.10: The obsever 'Obs' tends to find the shortcut and converges to the optimal pol-icy, while the unaided control 'Ctrl' typically settles on the suboptimal soulution. 86 m i r • -> • 1 i H i EreoSirxV ran i PES i g| V 1 2 F i g u r e 4 .11: A scenario w i t h m u l t i p l e mentors. E a c h mentor covers o n l y part o f the space o f interest to the observer. Experiment 6: Multiple Mentors T h e f inal exper iment i l lus t ra ted h o w m o d e l extract ion can be read i ly extended so that the ob-server can extract mode ls f rom m u l t i p l e mentors and exp lo i t the most va luab le parts o f each. A g a i n , the observer e m p l o y e d model -ex t rac t ion and confidence testing. In F i g u r e 4 .11 , the learner must m o v e f r o m start l oca t ion 1 to goa l loca t ion 4. T w o expert agents w i t h different start and goa l states serve as potent ia l mentors. O n e mentor repeatedly m o v e d f r o m l o c a -t ion 3 to l oca t ion 5 a long the dotted l ine , w h i l e a second mentor departed f r o m loca t ion 2 and ended at l oca t ion 4 a long the dashed l ine . In this exper iment , the observer had to c o m b i n e the in fo rmat ion f r o m the examples p r o v i d e d by the t w o mentors w i t h independent exp lo ra t ion o f its o w n i n order to so lve the p rob lem. In F i g u r e 4 .12 , w e see that the observer successful ly p u l l e d together the mentor ob-servations and independent exp lo ra t ion to learn m u c h more q u i c k l y than the con t ro l agent (results are averaged over 10 runs). W e see that the use o f a value-based technique a l l o w e d the observer to choose w h i c h mentor ' s influence to use o n a state-by-state basis i n order to 87 1000 2000 3000 4000 Simulation Steps 5000 6000 F i g u r e 4 .12: T h e ga in f rom s imul taneous imi t a t ion o f m u l t i p l e mentors is s i m i l a r to that seen i n the s ing le mentor case. get the best so lu t ion to the p rob lem. O u r exper iments suggest that m o d e l ext rac t ion and conf idence test ing are app l i cab le under a w i d e var ie ty o f cond i t ions , but w e have seen that these techniques performs best w h e n there is over lap i n the reward funct ions between mentor and observer, w h e n there is a mentor f o l l o w i n g a p o l i c y w i t h g o o d coverage o f the state space, and w h e n there is a p r o b l e m that is suff icient ly di f f icul t to warrant imi ta t ion techniques. W e can ga in further ins igh t into the app l i cab i l i t y and performance o f imi ta t ion by cons ide r ing h o w the m e c h a n i s m o f i m p l i c i t im i t a t i on interacts w i t h features o f the d o m a i n . W e w i l l exp lo re these interact ions i n more depth i n the Chapte r 6. 88 4.2 Implicit Imitation in Heterogeneous Settings T h e augmented backup equat ion, as presented i n the p rev ious sec t ion , rel ies on the h o m o -geneous act ions assumption—that the observer possesses an ac t ion w h i c h w i l l dupl ica te the long- run average d y n a m i c s o f the mentor ' s act ions. W h e n the homogene i ty assumpt ion is v io l a t ed , the i m p l i c i t im i t a t ion f ramework descr ibed above can cause the learner 's conver-gence rate to s l o w dramat i ca l ly and, i n some cases, cause the learner to b e c o m e stuck i n a s m a l l n e i g h b o r h o o d o f state space. In part icular , i f the learner is unable to m a k e the same state t ransi t ion (or a t ransi t ion w i t h the same p robab i l i t y ) as the mentor at a g i v e n state, it m a y dras t ica l ly overest imate the va lue o f that state. T h e inflated va lue est imate causes the learner to return repeatedly to this state even though this exp lo ra t ion w i l l never p roduce a feasible ac t ion for o b t a i n i n g the inflated est imated value . There is no m e c h a n i s m fo r r e m o v i n g the inf luence o f the mentor ' s M a r k o v cha in on va lue est imates—the observer can be ex t remely (and correc t ly) confident i n its observat ions about the mentor ' s m o d e l and yet, the mentor ' s m o d e l is s t i l l infeas ib le for the observer. T h e p r o b l e m lies i n the fact that the augmented B e l l m a n b a c k u p is jus t i f i ed b y the assumpt ion that the observer can dup l ica te every mentor ac t ion . Tha t i s , at each state s, there is some a £ A such that T(s, a, t) — T(s, t) for a l l t. W h e n an equivalent ac t ion a does not exis t , there is no guarantee that the va lue ca lcu la ted u s i n g the mentor ac t ion m o d e l can i n fact be ach ieved . 4.2.1 Feasibility Testing In such heterogeneous settings, w e can prevent " l o c k - u p " and poor convergence th rough the use o f an e x p l i c i t action feasibility test: before an augmented b a c k u p is pe r fo rmed at s, the observer tests to see i f the mentor ' s ac t ion am "d i f fers" f r o m each o f its ac t ions at s, g i v e n its current es t imated models . I f every observer ac t ion differs f r o m the mentor ac t ion , then the augmented b a c k u p is suppressed and a standard B e l l m a n backup is used to update the va lue 89 func t ion . B y default, mentor act ions are assumed to be feasible for the observer; however , once the observer is reasonably confident that am is infeas ib le at state s, augmented backups are suppressed at s. R e c a l l that t ransi t ion mode ls desc r ib ing each ac t ion are est imated f r o m observat ions , and that the learner 's uncertainty about the true t ransi t ion p robab i l i t i e s is reflected i n a D i r i c h -let d i s t r ibu t ion over poss ib le mode ls . T h e po in t estimate o f the t rans i t ion m o d e l t rans i t ion p robab i l i t i e s used to pe r fo rm backups cor respond to the mean o f the d i s t r ibu t ion descr ibed b y the cor respond ing D i r i c h l e t d i s t r ibu t ion . W e can therefore v i e w the difference be tween some observer ac t ion a0 and the mentor ac t ion c l a s s i ca l difference o f means test ( M e n d e n h a l l , 1987, Sec t i on 9.5) w i t h respect to the cor respond ing D i r i c h l e t d i s t r ibu t ions . Tha t i s , w e want to k n o w i f the mean t ransi t ion probabi l i t i es represent ing the effects o f the t w o act ions differ at a specif ied l e v e l o f confidence. A difference o f means test over t ransi t ion m o d e l d is t r ibut ions is c o m p l i c a t e d b y the fact that D i r i c h l e t s are h i g h l y n o n - n o r m a l for s m a l l sample counts , and that t rans i t ion d i s -t r ibut ions are m u l t i n o m i a l . W e deal w i t h the non-no rma l i t y th rough t w o techniques: W e re-qui re a m i n i m u m number o f samples for each test and w e e m p l o y robust C h e b y c h e v bounds on the p o o l e d var iance o f the dis t r ibut ions to be compared . Concep tua l l y , w e w i l l evaluate an express ion l i k e the f o l l o w i n g : \T0(s,a0,t)-Tm(s,t)\ > (4.12) n0(s,a0,i)a'l(s,ao,t)-\-n1n(s,t)a^n(s,i) no(s,a0,t)+nm{s,t) Essen t i a l ly , w e have a difference o f means test on the t rans i t ion p robab i l i t i e s where the v a r i -ance is a we igh ted p o o l i n g o f variances f r o m the observer and mentor d e r i v e d mode l s . T h e te rm Za/2 is the c r i t i c a l va lue o f the test. T h e parameter a is the s igni f icance o f the test, or 1 4 T h e decision is binary, but we could envision a smoother decision criterion that measures the extent to which the mentor's action can be duplicated. See Bayesian imitation i n Chapter 5. 90 the p robab i l i t y that w e w i l l fa lsely reject the t w o act ions as b e i n g different w h e n they are ac tual ly the same. G i v e n our h i g h l y non-norma l d is t r ibut ions ear ly i n the t ra in ing process, the appropriate Z va lue for a g i v e n a can be computed f r o m C h e b y c h e v ' s b o u n d by s o l v i n g 2 a = l - Jj- for ZA/2-W h e n w e have too few samples to do an accurate test, w e persist w i t h augmented backups ( e m b o d y i n g our default a ssumpt ion o f homogene i ty ) . I f the va lue estimate is i n -flated b y these backups , the agent w i l l be b iased to obta in add i t iona l samples , w h i c h w i l l then a l l o w the agent to per form the required feas ib i l i ty test. O u r assumpt ion is therefore se l f -correct ing. W e deal w i t h the mul t ivar ia te c o m p l i c a t i o n s b y p e r f o r m i n g the Bonferroni test (Seber, 1984) w h i c h has been s h o w n to g i v e g o o d results in prac t ice ( M i & S a m p s o n , 1993), is efficient to compute , and is k n o w n to be robust to dependence between var iables . A B o n f e r r o n i hypothes is test is obta ined by c o n j o i n i n g several s ing le var iab le tests. S u p -pose the act ions a0 and am result in 7- poss ib le successor states, s i , • • • ,sr (i .e., r t ransi t ion probabi l i t i es to compare) . F o r each s,, the hypothes is E{ denotes that a0 and am have the same t ransi t ion p robab i l i ty to successor state s;; that is T ( s , am, si) = T ( s , a0, s,-). W e let E{ denote the complementa ry hypothes is (i.e., that the t ransi t ion p robab i l i t i e s differ). T h e B o n f e r r o n i inequa l i ty states: r "1 r P r f]El > l - ^ P r [ £ t ] .i=l J i=l T h u s w e can test the j o i n t hypothes is f l i = i E{—the t w o ac t ion mode l s are the same— b y test ing each o f the r complementa ry hypotheses E{ at conf idence l e v e l a/r. I f w e reject any o f the hypotheses w e reject the no t ion that the t w o act ions are equal w i t h conf idence at least a. T h e mentor ac t ion am is deemed infeasible i f for every observer ac t ion a0, the mul t ivar ia te B o n f e r r o n i test jus t descr ibed rejects the hypothes is that the ac t ion is the same as the mentor ' s . Pseudo-code for the B o n f e r r o n i component o f the feas ib i l i ty test appears i n Tab le 4 .2 . 91 F U N C T I O N feasible(m,s) : Boolean F O R each a in A0 do allSuccessorProbsSimilar = true F O R each t in successors(s) do HA = \T0{s,a,t)-Tm(s,t)\ _ I ln0{s,a,t)*al(s,a,t)+nm{s,t)al,{s,t) ZA - A«A/y n0(s,a,t)+nm(s,t) IF ZA > Zajir allSuccessorProbsSimilar = false E N D F O R IF allSuccessorProbsSimilar return true E N D F O R R E T U R N false Tab le 4.2: A n a l g o r i t h m for ac t ion feas ib i l i ty test ing It assumes a sufficient number o f samples. Presently, for eff ic iency reasons, w e cache the results o f the feas ib i l i ty testing. W h e n the dup l i c a t i on o f the mentor ' s ac t ion at state s is first de termined to be infeasible , we set a flag for state s to this effect. 4.2.2 k-step Similarity and Repair A c t i o n feas ib i l i ty test ing essent ia l ly makes a strict d e c i s i o n as to whether the agent can du -pl ica te the mentor ' s ac t ion at a specif ic state: once it is dec ided that the mentor ' s ac t ion is infeasible , augmented backups are suppressed and a l l potent ia l gu idance offered is e l i m i -nated at that state. Unfor tuna te ly , the strictness o f the test results i n a somewhat impove r -i shed n o t i o n o f s imi l a r i t y between mentor and observer. T h i s , i n turn, unnecessar i ly l i m i t s 92 the transfer between mentor and observer. W e propose a m e c h a n i s m whereby the mentor ' s inf luence may persist even i f the specific ac t ion it chooses is not feas ible for the observer ; w e instead re ly on the pos s ib i l i t y that the observer may approximately dupl ica te the mentor ' s trajectory instead o f exact ly dup l i ca t i ng it . Suppose an observer has p rev ious ly constructed an est imated va lue func t ion u s ing augmented backups . U s i n g the mentor ac t ion m o d e l (i.e., the mentor ' s c h a i n (£)) , a h i g h value has been ca lcu la ted for state s. Subsequent ly , suppose the men to r ' s ac t ion at state s is j u d g e d to be infeas ible . T h i s is i l lust ra ted i n F i g u r e 4 .13 , where the est imated va lue at state s is o r i g i n a l l y due to the mentor ' s act ion irm (s), w h i c h for the sake o f i l l u s t r a t ion moves w i t h h i g h p robab i l i t y to state t, w h i c h i t se l f can lead to some h i g h l y r e w a r d i n g r eg ion o f state space. A f t e r some number o f experiences at state s, however , the learner conc ludes that the ac t ion 7r m (s)—and the associated h i g h p robab i l i ty t ransi t ion to t—is not feasible . A t this point , one o f two things must occur: either (a) the va lue ca lcu la ted for state s and its predecessors w i l l " co l l apse" and a l l exp lo ra t ion towards h i g h l y v a l u e d reg ions be-y o n d state s ceases; or (b) the est imated va lue drops s l igh t ly but exp lo ra t i on cont inues to-wards the h i g h l y - v a l u e d regions. T h e latter case may arise as f o l l o w s . I f the observer has p r e v i o u s l y exp lo red i n the v i c i n i t y o f state s, the observer ' s o w n ac t ion m o d e l m a y be suf-ficiently deve loped that they s t i l l connect the h igher va lue regions b e y o n d state s to state w th rough B e l l m a n backups . F o r example , i f the learner has sufficient exper ience to have learned that the h i g h l y va lued reg ion can be reached through the al ternat ive trajectory s -u — v — w, the n e w l y d i scovered in feas ib i l i ty o f the mentor ' s t rans i t ion s — t w i l l not have a deleterious effect on the va lue estimate at s. I f s is h i g h l y va lued , it is l i k e l y that states c lose to the mentor ' s trajectory w i l l be exp lo red to some degree. In this case, state s m a y not be as h i g h l y va lued as it was when u s ing the mentor ' s ac t ion m o d e l , but it w i l l s t i l l be va lued h i g h l y enough that it w i l l l i k e l y to gu ide further exp lo ra t ion toward the area. W e c a l l this a l -93 Infeasible Transition "Bridge" o r High-value State F i g u r e 4 .13 : A n alternative path can b r idge va lue backups a round infeas ib le paths ternative ( in this case, s - u - v - w) to the mentor ' s ac t ion a bridge, because it a l l o w s va lue f r o m h igher va lue regions to " f l o w o v e r " an infeas ible mentor t rans i t ion . Because the b r idge was fo rmed wi thou t the in tent ion o f the agent, w e c a l l this process spontaneous bridging. W h e r e a spontaneous b r idge does not exist , the observer ' s o w n ac t ion mode l s are general ly undeve loped (e.g., they are c lose to their u n i f o r m p r i o r d i s t r ibu t ions) . T y p i c a l l y , these undeve loped mode ls ass ign a s m a l l p robab i l i t y to every poss ib l e ou tcome and there-fore diffuse va lue f r o m higher va lued regions and lead to a very poor va lue estimate for state s. W h e n the mentor ' s t ransi t ion is f ina l ly j u d g e d infeas ible , the result i s a dramat ic drop i n the va lue o f state s and a l l o f its predecessors. Subsequent ly , exp lo ra t i on towards the h i g h l y va lued r eg ion through the ne ighbo rhood o f state s ceases. In our example , this c o u l d occur i f the observer ' s t ransi t ion mode ls at state s ass ign l o w p robab i l i t y (e.g., c lose to p r io r p rob-ab i l i ty ) o f m o v i n g to state u due to l a ck o f exper ience (or s i m i l a r l y i f the su r round ing states, such as u or v, have been insuff ic ient ly explored) . T h e spontaneous b r i d g i n g effect mot ivates a broader n o t i o n o f s imi l a r i ty . W h e n the observer can f ind a "short" sequence o f act ions that br idges an infeas ib le ac t ion o n the men-tor 's trajectory, the mentor ' s example can s t i l l p rov ide ex t remely useful gu idance . F o r the moment , w e assume a "short pa th" is any path o f length no greater than some g i v e n integer k. W e say an observer is k-step similar to a mentor at state s i f the observer can dupl ica te i n k or fewer steps the mentor ' s n o m i n a l t ransi t ion at state s w i t h "suff ic ient ly h i g h " p robab i l i ty . 94 G i v e n this no t ion o f s imi la r i ty , an observer can n o w test whether a spontaneous b r idge exists and determine whether the observer is in danger o f va lue func t ion co l l apse and the concomi tan t loss o f gu idance i f it decides to suppress an augmented b a c k u p at state s. T o do this , the observer init iates a reachabi l i ty analys is starting f r o m state s u s i n g its o w n ac-t ion m o d e l To'a (t) to determine i f there is a sequence o f act ions w i t h leads w i t h suff ic ient ly h i g h p robab i l i t y f rom state s to some state t on the mentor ' s trajectory d o w n s t r e a m o f the infeas ib le a c t i o n . 1 5 I f a fc-step br idge already exis ts , augmented backups can be safely sup-pressed at state s. F o r efficiency, we main ta in a flag at each state to m a r k it as " b r i d g e d . " O n c e a state is k n o w n to be b r idged , the fc-step reachabi l i ty ana lys i s need not be repeated. I f a spontaneous br idge cannot be found , it m igh t s t i l l be pos s ib l e to in ten t iona l ly set out to b u i l d one. T o b u i l d a br idge , the observer must exp lo re f r o m state s up to A;-steps away h o p i n g to make contact w i t h the mentor ' s trajectory downs t r eam o f the infeas ib le m e n -tor ac t ion . W e imp lemen t a s ingle search attempt as a A; 2-step r a n d o m w a l k , w h i c h w i l l result i n a trajectory on average k steps away f rom s as l o n g as e rgod ic i ty and l o c a l connec t iv i t y assumpt ions are satisfied. In order for the search to occur, w e must mot iva te the observer to return to the state s and engage i n repeated exp lo ra t ion . W e c o u l d p r o v i d e m o t i v a t i o n to the observer b y a s k i n g the observer to assume that the infeas ib le ac t ion w i l l be repairable. T h e observer w i l l therefore cont inue the augmented backups w h i c h support h igh -va lue es-timates at the state s and the observer w i l l repeatedly engage i n exp lo ra t i on f r o m this point . T h e danger, o f course, is that there may not i n fact be a b r idge , i n w h i c h case the observer w i l l repeat this search for a b r idge indefini te ly . W e therefore need a m e c h a n i s m to terminate the repai r process w h e n a k-step repair is infeas ible . W e c o u l d attempt to e x p l i c i t l y keep track o f a l l o f the poss ib le paths open to the observer and a l l o f the paths e x p l i c i t l y t r ied b y the observer and determine the repair poss ib i l i t i es had been exhausted. Instead, w e elect to 1 5 I n a more general state space where ergodicity is lacking, the agent must consider predecessors of state s up to k steps before s to guarantee that al l fc-step paths are checked. 95 f o l l o w a p robab i l i s t i c search that e l iminates the need for b o o k k e e p i n g : i f a b r idge cannot be const ructed w i t h i n n attempts o f ft-step r a n d o m w a l k , the " repa i rab i l i ty a s sumpt ion" is j u d g e d fa ls i f ied, the augmented backup at state s is suppressed and the observer ' s bias to explore the v i c i n i t y o f state s is e l imina ted . I f no br idge is found for state s, a flag is used to mark the state as " i r reparable ." T h i s approach is , o f course, a very na ive heur is t ic strategy; but it i l lustrates the bas ic impor t o f b r i d g i n g . M o r e systematic strategies c o u l d be used, i n v o l v i n g e x p l i c i t " p l a n n i n g " to find a b r idge u s ing , say, l o c a l search ( A l i s s a n d r a k i s et a l . , 2000). A n o t h e r aspect o f this p r o b l e m that we do not address is the persistence o f search for br idges . In a specif ic d o m a i n , after some number o f unsuccessful attempts to find br idges , a learner m a y c o n c l u d e that it i s unable to reconstruct a mentor ' s behavior , i n w h i c h case the search fo r b r idges m a y be abandoned. T h i s i n v o l v e s s imple , h igher - leve l inference, and some n o t i o n o f (or p r io r be-l iefs about) " s i m i l a r i t y " o f capabi l i t ies . These not ions c o u l d a lso be used to au tomat ica l ly determine parameter settings (discussed b e l o w ) . T h e parameters k and n must be tuned e m p i r i c a l l y , but can be es t imated g i v e n k n o w l -edge o f the connec t iv i ty o f the d o m a i n and p r io r bel iefs about h o w s i m i l a r ( in terms o f length o f average repair) the trajectories o f the mentor and observer w i l l be. F o r instance, n > 8k - 4 seems sui table in an 8-connected g r id w o r l d w i t h l o w noise , based o n the number o f trajectories requi red to cove r the perimeter states o f a k-step rectangle a round a state. W e note that very large values o f n can reduce performance b e l o w that o f n o n - i m i t a t i n g agents as it results i n temporary, but p ro longed " l o c k up . " In the absence o f p r i o r k n o w l e d g e , one migh t be able to implemen t an incremental process for finding k that starts w i t h k = 1 and increments k w h e n the p o l i c y space at granular i ty k is l i k e l y to have been exhausted. F e a s i b i l i t y and A;-step repair are eas i ly integrated into the homogeneous i m p l i c i t i m -i ta t ion f ramework . Essen t i a l ly , w e s i m p l y elaborate the cond i t i ons under w h i c h the aug-96 merited b a c k u p w i l l be e m p l o y e d . O f course, some add i t iona l representat ion w i l l be in t ro-duced to keep track o f whether a state is feasible, b r idged , or repairable , and h o w m a n y repair attempts have been made. T h e ac t ion se lec t ion m e c h a n i s m w i l l a lso be over r idden b y the b r i d g e - b u i l d i n g a l g o r i t h m when requi red i n order to search for a b r idge . B r i d g e b u i l d i n g a lways terminates after n attempts, however , so it cannot affect l o n g run convergence . A l l other aspects o f the a lgo r i thm, however , such as the exp lo ra t ion p o l i c y , are unchanged . T h e comple te elaborated dec i s ion procedure used to determine w h e n augmented back-ups w i l l be e m p l o y e d at state s w i t h respect to mentor m appears i n Tab le 4.3. It uses some internal state to make its dec is ions . A s i n the o r i g i n a l m o d e l , w e first check to see i f the ob-server 's exper ience-based ca lcu la t ion for the va lue o f the state supersedes the mentor-based ca l cu l a t i on ; i f so, then the observer uses its o w n exper ience-based ca l cu l a t i on . I f the m e n -tor 's ac t ion is feasible , then we accept the va lue ca lcu la ted u s ing the observa t ion-based va lue func t ion . I f the ac t ion is infeasible w e check to see i f the state is b r idged . T h e first t ime the test is requested, a reachabi l i ty analys is is performed, but the results w i l l be d r a w n f r o m a cache for subsequent requests. I f the state has been b r idged , w e suppress augmented back-ups, confident that this w i l l not cause value func t ion co l lapse . I f the state is not b r idged , we ask i f it is repairable. F o r the first n requests, the agent w i l l attempt a fc-step repair. I f the repair succeeds, the state is marked as b r idged . I f w e cannot repair the infeas ib le t ransi t ion, we mark it not-repairable and suppress augmented backups . W e may w i s h to e m p l o y i m p l i c i t im i t a t ion w i t h feas ib i l i ty test ing i n a m u l t i p l e m e n -tors scenario. T h e k e y change f rom i m p l i c i t im i t a t ion wi thou t f eas ib i l i t y test ing is that the observer w i l l o n l y imi ta te feasible act ions. W h e n the observer searches th rough the set o f mentors for the one w i t h the ac t ion that results i n the highest va lue estimate, the observer must cons ider o n l y those mentors whose actions are s t i l l cons idered feasible (or assumed to be repairable) . 97 Table 4.3: Elaborated augmented backup test F U N C T I O N use_augmented?(s,m) : Boolean IF V0(s) y Vm(s) T H E N R E T U R N false E L S E IF feasible(s, m) T H E N R E T U R N true E L S E IF bridged(s, m) T H E N R E T U R N false E L S E IF reachable(s, m) T H E N bridged(s,m) := true R E T U R N false E L S E IF not repairable(s, m) T H E N return false E L S E % we are searching IF 0 < search.steps(s, m) < k T H E N % search in progress return true IF search.steps(s, m) > k T H E N % search failed IF atternpts(s) > n T H E N repairable(s) = false R E T U R N false E L S E reset_search(s,m) attempts(s) := attempts(s) + 1 R E T U R N true attempts(s) :=1 % initiate first attempt of a search initiate-search(s) R E T U R N true 98 4.2.3 Empirical Demonstrations In this sect ion, w e e m p i r i c a l l y demonstrate the u t i l i t y o f f eas ib i l i ty test ing and fc-step repair and show h o w the techniques can be used to surmount both differences i n act ions between agents and s m a l l l o c a l differences i n state-space topology . T h e p rob lems here have been chosen spec i f ica l ly to demonstrate the necessity and u t i l i t y o f bo th f eas ib i l i t y test ing and k-step repair. T h e b a c k g r o u n d details on me thodo logy and lea rn ing parameters are iden t i ca l to that descr ibed at the b e g i n n i n g o f Sec t ion 4.1.3 except where noted. Experiment 1: Necessity of Feasibility Testing O u r first exper iment shows the impor tance o f feas ib i l i ty test ing i n i m p l i c i t im i t a t i on w h e n agents have heterogeneous act ions. In this scenario, a l l agents must navigate across an obstacle-free, 10x10 g r i d w o r l d f r o m the upper-left corner to a goa l loca t ion i n the lower - r igh t . T h e agent is then reset to the upper-left corner. T h e first agent is a mentor w i t h the " N E W S " ac-t ion set (Nor th , Sou th , East , and Wes t movement act ions) . T h e mentor is g i v e n an o p t i m a l stationary p o l i c y for this p r o b l e m . W e study the performance o f three learners, each w i t h the " S k e w " ac t ion set ( N , S, N E , S W ) and unable to dupl ica te the mentor exac t ly (e.g., d u p l i -ca t ing a mentor ' s E - m o v e requires the learner to m o v e N E f o l l o w e d by S, o r m o v e S E then N ) . D u e to the nature o f the g r i d - w o r l d , the con t ro l and imi t a t i on agents w i l l ac tua l ly have to execute more act ions to get to the goa l than the mentor and the o p t i m a l g o a l rate for bo th the con t ro l and imi ta tor are therefore l o w e r than that o f the mentor. T h e first learner e m p l o y s i m p l i c i t im i t a t i on with f eas ib i l i ty test ing, the second uses im i t a t i on without f eas ib i l i t y test-i n g , and the th i rd con t ro l agent uses no imi t a t ion (i.e., is a standard re inforcement l ea rn ing agent). A l l agents exper ience l i m i t e d s tochast ic i ty i n the f o r m o f a 5 % chance that the effect o f their ac t ion w i l l be r a n d o m l y perturbed. A s i n the last sect ion, the agents use mode l -based reinforcement learn ing w i t h p r io r i t i zed sweep ing . B a s e d on a p r i o r b e l i e f about h o w " c l o s e " 99 401 1 1 1 1 1 1 1 r 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Simulation Steps F i g u r e 4.14: T h e agent wi thou t feas ib i l i ty test ing ' N o F e a s ' can become trapped b y o v e r v a l -ued states. T h e agent w i t h feas ib i l i ty test ing 'Feas ' shows t y p i c a l im i t a t i on gains . the mentor and observer act ion sets are, w e set k = 3 and n — 20 . T h e effectiveness o f feas ib i l i ty testing i n i m p l i c i t im i t a t i on can be seen i n F i g u r e 4.14. T h e hor izon ta l axis represents t ime in s imu la t ion steps and the ve r t i ca l axis represents the av-erage number o f goals ach ieved per 1000 t ime steps (averaged over 10 runs) . W e see that the imi t a t ion agent w i t h feas ib i l i ty test ing (Feas) converged m u c h more q u i c k l y to the o p t i m a l goal-at tainment rate than the other agents. T h e agent wi thou t feas ib i l i ty test ing (NoFeas ) ach ieved sporadic success early on , but frequently " l o c k e d u p " due to repeated attempts to dupl ica te infeas ible mentor act ions. T h e agent s t i l l managed to reach the g o a l f r o m t ime to t ime as the stochastic act ions do not permi t the agent to become permanent ly s tuck i n this obstacle-free scenario. T h e con t ro l agent (Ct r l ) w i thou t any f o r m o f im i t a t i on d e m o n -100 strated a s ignif icant delay i n convergence re la t ive to the imi t a t ion agents due to the l ack o f any f o r m o f gu idance , but eas i ly surpassed the agent wi thou t f eas ib i l i t y test ing i n the l o n g run. T h e more gradual s lope o f the con t ro l agent is due to the h igher var iance i n the con t ro l agent 's d i s cove ry t ime for the o p t i m a l path, but both the feas ib i l i ty - tes t ing imi ta to r and the con t ro l agent converged to o p t i m a l solut ions . A s s h o w n b y the c o m p a r i s o n o f the t w o imi t a -t ion agents, f eas ib i l i ty test ing is necessary to adapt i m p l i c i t i m i t a t i o n to contexts i n v o l v i n g heterogeneous act ions. Experiment 2: Changes to State Space W e deve loped feas ib i l i ty test ing and b r i d g i n g p r i m a r i l y to deal w i t h the p r o b l e m o f adapt-i n g to agents w i t h heterogeneous act ions. T h e same techniques, however , can be app l i ed to agents w i t h differences i n their state-space connec t iv i ty (these are equ iva len t no t ions u l t i -mate ly) . T o test this , w e constructed a d o m a i n where a l l agents have the same N E W S ac t ion set, but we alter the env i ronment o f the learners b y in t roduc ing obstacles that aren't present for the mentor. In F i g u r e 4 .15 , the learners find that the mentor ' s path is obstructed b y ob-stacles. M o v e m e n t t oward an obstacle causes a learner to r ema in i n its current state. In this sense, its ac t ion has a different effect than the mentor ' s . Imi ta t ion is s t i l l pos s ib l e i n this sce-nar io as the u n b l o c k e d states can be ass igned a va lue th rough p r i o r ac t ion va lues that leak a round the obstacles and because the mentor p o l i c y is somewhat n o i s y and o c c a s i o n a l l y goes a round obstacles. In F i g u r e 4.16 w e see that the results are qua l i t a t ive ly s i m i l a r to the p rev ious exper-iment . In contrast to the prev ious exper iment , bo th imi ta tor and con t ro l use the " N E W S " ac t ion set and therefore have a shortest path w i t h the same length as that o f the mentor. Consequent ly , the o p t i m a l g o a l rate o f the imitators and con t ro l is h igher than i n the pre-v ious exper iment . T h e observer wi thou t feas ib i l i ty testing (NoFeas ) has d i f f i cu l ty w i t h the 101 F i g u r e 4 .15 : Obstac les b l o c k the mentor ' s intended p o l i c y , though stochast ic act ions cause mentor to deviate somewhat . maze as the va lue func t ion augmented by mentor observat ions cons is tent ly l ed the observer to states w h o s e path to the g o a l is d i rec t ly b l o c k e d . T h e agent w i t h feas ib i l i ty test ing (Feas) q u i c k l y d i scove red states where the mentor ' s inf luence is inappropria te . W e conc lude that loca l differences i n state are w e l l handled by feas ib i l i ty testing. Experiment 3: Generalizing trajectories O u r next exper iment investigates the quest ion o f whether im i t a t i on agents can comple t e ly general ize the mentor ' s trajectory. Here the mentor f o l l o w s a path w h i c h is comple t e ly i n -feasible for the imi t a t ing agent. W e fix the mentor ' s path for a l l runs and g i v e the imi t a t ing agent the maze s h o w n i n F i g u r e 4.17 i n w h i c h a l l but t w o o f the states the mentor v is i t s are b l o c k e d b y an obstacle. T h e imi t a t i ng agent shou ld be able to use the mentor ' s trajectory for guidance and b u i l d its o w n para l le l trajectory w h i c h is comple te ly d i s jo in t f r o m the mentor ' s . T h e results i n F i g u r e 4.18 show that ga in o f the imi ta tor w i t h feas ib i l i ty test ing (Feas) over the con t ro l agent (Ct r l ) is ins igni f icant i n this scenario. H o w e v e r , w e see that the agent wi thou t feas ib i l i ty test ing (NoFeas ) d i d very poor ly , even w h e n compared to the con t ro l 102 501 1 1 i 1 1 r 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Simulation Steps Figure 4.16: Imitation gain is reduced by observer obstacles within mentor trajectories but not eliminated. s h-. ! J ! 1 i ; ; ; _ j - - - Observer " Mer tor ... T ,. X Figure 4.17: Generalizing an infeasible mentor trajectory. 103 60 50 & 40 S 30 $ 20 < 10 FP Series Feas NoFeas 500 1000 1500 2000 2500 Simulation Steps 3000 3500 4000 4500 F i g u r e 4 .18: A la rge ly infeas ible trajectory is hard to learn f rom and m i n i m i z e s gains f r o m imi t a t ion . agent. T h i s is because it gets s tuck around the doorway. T h e h ig h - v a lu e gradient backed up a long the mentor ' s path becomes access ible to the agents at the doorway . T h e i m i t a t i o n agent w i t h feas ib i l i ty test ing conc luded that it c o u l d not proceed south f r o m the d o o r w a y (into the w a l l ) and that it must try a different strategy. T h e imi ta tor w i thou t f eas ib i l i t y test ing never exp lo red far enough away f r o m the d o o r w a y to setup an independent va lue gradient that w o u l d gu ide it to the goa l . W i t h a s lower decay schedule for exp lo ra t i on , the imi ta tor wi thou t feas ib i l i ty test ing w o u l d have found the g o a l , but this w o u l d s t i l l reduce its perfor-mance b e l o w that o f the imi ta tor w i t h feas ib i l i ty testing. 104 Experiment 4 : Bridging and repair A s exp l a ined earlier, i n s i m p l e p rob lems there is a g o o d chance that the i n f o r m a l effects o f p r io r va lue leakage and stochastic exp lora t ion may f o r m br idges before f eas ib i l i t y test ing cuts of f the va lue propagat ion that guides exp lora t ion . In more di f f icul t p rob lems where the agent spends a lo t more t ime e x p l o r i n g , it w i l l accumulate sufficient samples to c o n c l u d e that the mentor ' s act ions are infeas ible l o n g before the agent has const ructed its o w n br idge . T h e imi ta to r ' s performance w o u l d then drop d o w n to that o f an unaugmented re inforcement learner. T o demonstrate b r i d g i n g , we dev i sed a d o m a i n i n w h i c h agents must naviga te f r o m the upper-left corner to the bot tom-r ight corner, across a " r i v e r " w h i c h is three steps w i d e and exacts a penal ty o f - 0 . 2 per step (see F i g u r e 4.19). T h e r ive r is represented b y the squares w i t h a grey b a c k g r o u n d . T h e goa l state is wor th 1.0. In the figure, the path o f the mentor is s h o w n starting f r o m the top corner, p roceed ing a long the edge o f the river and then c ros s ing the r i ve r to the goa l . T h e mentor e m p l o y s the " N E W S " ac t ion set. T h e observer uses the " S k e w " ac t ion set ( N , N E , S , S W ) and attempts to reproduce the mentor trajectory. It w i l l f a i l to reproduce the c r i t i c a l t ransi t ion at the border o f the river (because the " E a s t " ac t ion is infeas ible for a " S k e w " agent). T h e mentor ac t ion can no longer be used to b a c k u p va lue f rom the r eward state and there w i l l be no al ternative paths because the r i ve r b l o c k s greedy exp lo ra t ion i n this r eg ion . W i t h o u t b r i d g i n g or an op t imis t i c and l eng th ly exp lo ra t ion phase, observer agents q u i c k l y d i scover the negat ive states o f the river and cur t a i l exp lo ra t ion i n this d i rec t ion before ac tual ly m a k i n g it across. W e can examine the va lue func t ion estimate o f an imi ta tor (after 1000 steps) w i t h feas ib i l i ty test ing but no repair capabi l i t ies . W e plot the va lue func t ion on the map figure w i t h c i rc les whose radius represents the magni tude o f the va lue i n the co r r e spond ing state. T h e increas ing radius i n b l ack c i rc les represent increas ing pos i t i ve va lue . T h e inc reas ing 105 1 • J — • • • • F i g u r e 4 .19: T h e R i v e r scenario i n w h i c h squares w i t h grey backgrounds result i n a penalty for the agent. radius for grey c i rc les represents increas ing negat ive value. W e see that, due to suppress ion by feas ib i l i ty test ing, the b l a c k - c i r c l e d h igh-va lue states backed up f r o m the goa l terminate abrupt ly at an infeas ible t ransi t ion wi thou t m a k i n g it across the r iver . In fact, they are d o m -inated b y the l ighter grey c i rc les s h o w i n g negat ive values. In this exper iment , w e show that b r i d g i n g can p r o l o n g the exp lo ra t ion phase i n just the r ight way. W e e m p l o y the fc-step repair procedure w i t h k = 3. E x a m i n i n g the graph i n F i g u r e 4 .20, w e see that both im i t a t i on agents exper ienced an early negat ive d ip as they are g u i d e d deep into the r ive r b y the mentor ' s inf luence. T h e agent wi thou t repair eventua l ly dec ided the mentor ' s ac t ion is infeasible , and thereafter avo ids the river (and the p o s s i b i l i t y o f f ind ing the goal) . T h e imi ta tor w i t h repair also d i scove red the mentor ' s ac t ion to be infeas ible , but d i d not immedia te ly dispense w i t h the mentor ' s g u i d -ance. It kept e x p l o r i n g i n the area o f the mentor ' s trajectory u s ing a r a n d o m w a l k , a l l the w h i l e accumula t ing a negat ive reward un t i l it suddenly found a b r idge and r ap id ly converged on the o p t i m a l s o l u t i o n . 1 6 T h e con t ro l agent d i scovered the goa l o n l y once i n ten runs. 1 6 W h i l e repair steps take place in an area of negative reward in this scenario, this need not be the 106 55 -15 -20 -1 1 FB Series 1 1 1 Ctrl Ctrl — Repair / \ NoRepair / -NoRepair \ / - V / Repair i i i 0 1000 2000 3000 4000 5000 6000 Simulation Steps ;ure 4 .20: T h e agent w i t h repair finds the o p t i m a l so lu t ion on the other s ide o f the r iver . 107 T h e results i n this sect ion suggest that feas ib i l i ty test ing and k-step repair are a v i -able technique for ex tending imi t a t ion to agents w i t h heterogeneous ac t ion capab i l i t i e s . A t present, our technique for hand l ing heterogeneous actions requires t u n i n g o f ad hoc b r i d g i n g parameters to o p t i m i z e imi t a t ion performance. E x a m i n i n g techniques to automate the selec-t ion o f these parameters w o u l d s ign i f ican t ly broaden the appeal o f our me thod . T h e w i d e r a p p l i c a b i l i t y o f these methods to general domains remains to be ana lyzed . W e cons ide r the ques t ion o f a p p l i c a b i l i t y o f imi t a t ion and general performance issues i n Chap te r 6. case. Repair doesn't imply short-term negative return. 108 Chapter 5 Bayesian Imitation T h e u t i l i t y o f imi t a t ion as t ransi t ion m o d e l transfer was demonstrated i n the p rev ious chapter. A series o f s i m p l e exper iments demonstrated that the s imp le s tat is t ical tests used were bo th necessary and sufficient to achieve s ignif icant performance gains i n m a n y p rob lems . In this chapter, w e examine h o w more sophis t icated B a y e s i a n methods can be used to smoo th ly integrate the observer ' s p r io r k n o w l e d g e , k n o w l e d g e the observer gains f r o m interact ion w i t h the w o r l d , and k n o w l e d g e the observer extracts f r o m observat ions o f an expert mentor. T h e B a y e s i a n technique can be de r ived qui te c leanly , but its imp lemen ta t i on for most use-fu l m o d e l classes is mathemat ica l ly intractable. W e der ive an approx imate update scheme based on s a m p l i n g techniques, w h i c h w e c a l l augmented model inference. E x p e r i m e n t s i n this chapter demonstrate that B a y e s i a n imi t a t ion methods, at some add i t i ona l computa t iona l cost, can outper form their n o n - B a y e s i a n counterparts and e l imina te tedious parameter tun-i n g . T h e c lean general f o r m o f the B a y e s i a n m o d e l w i l l a lso set the stage for more c o m p l e x mode ls i n the next chapter. T h e de r iva t ion o f B a y e s i a n imi t a t ion is p remised on the same assumpt ions d e v e l -oped i n Chapter 3, namely the presence o f an expert, comple te obse rvab i l i t y o f observer and mentor states, k n o w l e d g e o f reward functions, s im i l a r ac t ion capabi l i t i es and the ex-109 istence of a mapping between observer and mentor state spaces. Bayesian imitation also retains the idea of augmenting the observer's knowledge with observations of mentor state transitions. In contrast to the work of Chapter 4, however, the observer does not choose be-tween experience-based and mentor-observation- based model estimates—instead, it pools the evidence collected from self-observation, mentor observations and prior beliefs to up-date a single more accurate augmented model. Using the transition model augmented with additional model information derived from an expert mentor, the observer can then calculate more accurate expected values for each of its actions and make better action choices early in the learning process. 5.1 Augmented Model Inference Implicit imitation can be recast in the Bayesian paradigm as the problem of updating a prob-ability distribution over possible transition models T given various sources of information. The observer starts with a prior probability over possible transition models Pr(T) where T G T. After observing its own transitions and those of the mentor, the observer will calculate an updated posterior distribution over possible models. The observations of the observer's own state transitions are collected into a sequence called the state history. Let H0 —< s°0, s\,...,s0 > denote the history of the observer's transitions from the initial state s° to the current states s\ and let A0 —< al, a\,... , a* > represent the observer's history of action choices from the initial state to the present. The history of mentor observa-tions are similarly represented by Hm =< s^,s^,... , >. Since the mentor is assumed to be following a stationary policy 7 r m , its actions are strictly a function of its state history and need not be explicitly represented. The sources of observer knowledge, H0, Hm and Pr(T) and the relationships be-tween these sources of knowledge and other variables are shown graphically in a causal 110 F i g u r e 5.1: Dependenc ies among t ransi t ion m o d e l T and ev idence sources H0, Hm. m o d e l i n F i g u r e 5.1 (see Sec t ion 2.5 o f Chapte r 2 for more o n g raph ica l mode l s ) . In the causal m o d e l , we see that the agent 's act ion choices AQ and the t rans i t ion m o d e l T i n f lu -ence the observer ' s h is tory H0. S i m i l a r l y , the mentor ' s ac t ion p o l i c y 7r m and the t ransi t ion m o d e l T inf luence the mentor ' s h is tory Hm. U s i n g B a y e s ' ru le , it is poss ib l e to reason about the poster ior p robab i l i t y o f a par t icular t ransi t ion m o d e l T g i v e n the p r i o r p r o b a b i l i t y o f T and the agent observa t ion histories H0, A0 and Hm. W e denote the pos ter ior p robab i l i t y o f m o d e l T g i v e n observat ions H0, A0 and Hm as Pv(T\H0, A0, Hm). U s i n g B a y e s ' rule , and the independencies i m p l i e d by the graph i n F i g u r e 5.1 (i.e., H0 and Hm are independent g i v e n a t ransi t ion m o d e l T), w e arr ive at the B a y e s i a n update: 1 Pv{T\H0,A0,Hm) = aPT{H0,Hm\T,A0)Pv(T) = aPv(H0\T,A0)Pv(Hm\T)Pv(T) (5.1) T h e s i m p l e f o r m o f the B a y e s i a n update bel ies a great deal o f p rac t ica l comp lex i t y . T h e space o f poss ib le mode l s i n the domains we are cons ide r ing is the set o f a l l rea l -va lued funct ions ove r S x A -> JR- Represen t ing an e x p l i c i t d i s t r ibu t ion over pos s ib l e mode l s T e T is infeasible . Instead, we assume that the p r io r P r ( T ) ove r pos s ib l e t rans i t ion mode ls has a factored f o r m i n w h i c h the p robab i l i t y d i s t r ibu t ion over t ransi t ions mode l s P r ( T ) can be expressed as the product o f d is t r ibut ions over l o c a l t ransi t ions mode l s for each state and ' a is a normalization parameter. I l l act ion , P r ( T ) = I l ae^ l ses Pv(Ts'a )• T h e d i s t r ibu t ion over each l o c a l t rans i t ion m o d e l P r ( T s ' a ) is represented as a D i r i c h l e t (see Sec t ion 2.3). In the absence o f observat ions f rom a mentor agent, the independent D i r i c h l e t fac-tors can be updated u s ing the usual increments to the count parameters n ( s , a , t). T h e i n -t roduc t ion o f mentor observat ions creates some compl i ca t i ons . In p r i n c i p l e , u p o n obse rv ing the mentor make the t ransi t ion f rom state s to state t, w e c o u l d increment the co r re spond ing count n ( s , a, t) for w h i c h e v e r act ion a was equivalent to the mentor ac t ion am. U n f o r t u -nately, we do not k n o w the ident i ty o f am. T h e B a y e s i a n perspect ive w o u l d na tura l ly sug-gest that w e treat the mentor ac t ion as a r a n d o m var iab le At and represent our uncer ta inty as a p robab i l i t y d i s t r ibu t ion Pv(Am = a). T h e key ques t ion is h o w to ma in t a in the ele-gant product o f D i r i c h l e t f o r m for our d i s t r ibu t ion over poss ib l e t rans i t ion mode l s w h e n our source o f ev idence takes the f o r m o f uncer ta in events over an u n k n o w n ac t ion . T h e answer can be found by b reak ing d o w n the update in to t w o d is t inc t cases. T h o u g h the mentor ' s ac t ion am = nm (s) is u n k n o w n , suppose i t is different f r o m the observer ' s ac-t i on a (i.e., nm(s) / a). In this case, the m o d e l factor Ts'a w o u l d be independent o f the mentor ' s his tory, and w e can e m p l o y the standard B a y e s i a n update w i thou t regard for the mentor: Pv(Ts'a\Hs'a, Trm(s) ^ a) = aPv(Hs'a\Ts'a) P r ( T s ' ° ) (5.2) N o t e that the update need o n ly cons ider the subset o f the agent h is tory Hs,a w h i c h concerns the ou tcome o f ac t ion a i n state s. T h e update has a s imp le c losed f o r m i n w h i c h one increments the D i r i c h l e t parameter n ( s , a, t) a cco rd ing to a count c ( s , o, t) o f the number o f t imes the observer has v i s i t e d state s, performed ac t ion a and ended up i n state t. Suppose , however , that the mentor ac t ion am = itm (s) is i n fact the same as the observer ' s ac t ion a (i.e. nm (s) = a). T h e n the mentor observat ions are re levant to the 112 update o f Pv(Ts>a): Pr(Ts<a\H°0-a,Hm,nm(s) = a) = a Pr(H«0'a, Hsm\Ts<a, um{s) = a) Pi(Ts'a\7Tm(s) = a) = aPT(H*-a\Ts'a)Pi(H^\Ts-a,Trm(s) = a)Pv(T°'a) L e t n s , a be the p r io r parameter vector for P r ( T s , a ) , and cl'a denote the counts o f observer t ransi t ions f r o m state s v i a ac t ion a, and the counts o f the mentor t ransi t ions f r o m state s. T h e poster ior augmented model factor densi ty Pr(Ts'a\Ho'a, H m , nm(s) — a) is then a D i r i c h l e t w i t h parameters ns'a + cs0'a + ; that is , we s i m p l y update the D i r i c h l e t parameter w i t h the s u m o f the observer and mentor counts . T h e D i r i c h l e t ove r mode l s T and parameters ns,a + c»,a + ^ i s d e n o t e d p r ( r»,o ; n » . a + c * ' a + c'm). PT(T''a\Hl0'a,H^,nm(8) = a) = Pr (T 5 ' a ; n s >° + <>a + c'm) S i n c e the observer does not k n o w the ident i ty o f the mentor ' s ac t ion w e compu te the ex-pecta t ion w i t h respect to these two poss ib le cases: Pv(T°-a\Hs0'a,Hm) = P r ( ^ = a\H'0<\ Hm) Pr (T s ' a ; n s ' a + < £ ° + c'm) + Pi(n'm ± a\H'0-a, Hm) Pr (T s ' a ; + <>°) (5.3) E q u a t i o n 5.3 describes an update h a v i n g the usua l conjugate f o r m , but d i s t r ibu t ing the mentor counts (i.e., ev idence p r o v i d e d b y mentor t ransi t ions at state s) across a l l act ions , we igh ted b y the poster ior p robab i l i t y that the mentor ' s p o l i c y chooses that act ion at state s.2 2 T h i s assumes that at least one of the observer's actions is equivalent to the mentor's, but our model can be generalized to the heterogeneous case. 113 The final step in the derivation is to show that the posterior distribution over the men-tor's policy Pr(7rm(s) = a) can be performed in an efficient factored form as well. Using Bayes' rule and the independencies of the causal model in Figure 5.1, we can derive an ex-pression for the posterior distribution: If we assume that the prior over the mentor's policy is factored in the same way as the prior over models—that is, we have independent distributions Pr(7rm (s)) over Am for each s—this update can be factored as well, with history elements at state s being the only ones relevant to computing the posterior over 7rm (s). The last difficulty to overcome is the computation of the integral over possible transition models J T G 7 -Pr( i f m |7r m , T) PT(T\H0). This type of integral is often found in the calculation of expectations and can be approx-imated using Monte Carlo integration (Mackay, 1999). Using the techniques of (Dearden et al., 1999), we sample transition models to estimate this quantity. Specifically, we sam-ple a model TS'A for a specific state and action from the Dirichlet factor Pr(T' s'a \Ho'a). We then compute the likelihood that the mentor model is equivalent to the observer's action a by computing the joint probability of the observed mentor transitions under the model T s , a : We can combine the expression for the factored augmented model update in Equa-tion 5.3 with our expression for mentor policy likelihood in Equation 5.5 to obtain a tractable Pi{nm\Hm,H0) = aP r (# m | 7r m , # 0 )P r(7r m|# 0) (5.4) Pr(Hsm\nm,T°<a) = l[Ts'a(t) ^ (5.5) 114 A L G O R I T H M 1 T + = a u g m e n t e d M o d e l ( / V 0 , Nm, s, a, 77) L = [ 0 , . . . , 0] % Li is the l ikel ihood of action i being the mentor's 6 = 0 % sum for normalization for each action a E A0 for c = 1 to 77 Ts'a ~ D[N 0 s> a] % sample from Dirichlet on s,a L(a)=L(a)+Tlt€sTs'a(t)N™+C™ b = b + L{a) am ~ M[j] % sample from multinomial on average normalized l ikelihoods i f am = a ; % i f mentor action l ikely same, add mentor evidence T+ = mean(ZW + C'0'a + + C*m]) else T+ = mean(D[7V 0 s- a + Cs0-a]) a l g o r i t h m for updat ing the observer ' s bel iefs about the d y n a m i c s m o d e l T based o n p r io r be-l iefs , its o w n exper ience and its observat ions o f the mentor . 3 Pseudo-code to calculate the augmented m o d e l appears i n A l g o r i t h m 1. T h e a lgo-r i t h m takes the arguments N0 and Nm co r responding to the D i r i c h l e t parameters for the d i s -t r ibu t ion over observer and mentor models , the arguments s and a co r r e spond ing to the cur-rent state and ac t ion , and the argument 77, co r responding to the number o f samples to use i n M o n t e C a r l o in tegrat ion. T h e a l g o r i t h m returns the augmented m o d e l T+. T h e m o d e l T+ takes the f o r m o f a m u l t i n o m i a l d i s t r ibu t ion over successor states co r r e spond ing to the expected or mean t ransi t ion probabi l i t i es o f the D i r i c h l e t d i s t r ibu t ion over t rans i t ion models . 3 T h e terms Ts,a(t)^*^ in Equation 5.5 can become very small as the number of visits to a state becomes large, resulting in underflow. Scaling techniques such as those used for hidden Markov model computation can prove quite useful here. 115 A B a y e s i a n imi ta tor thus proceeds as f o l l o w s : at each stage, it observes its o w n state t ransi t ion and that o f the mentor. T h e D i r i c h l e t parameters for the observer and mentor m o d -els are incremented. Concep tua l ly , an augmented m o d e l T+ is then sampled p r o b a b i l i s t i -c a l l y f r o m the p o o l e d d i s t r ibu t ion based on both mentor and observer observat ions u s i n g A l g o r i t h m 1. T h e augmented m o d e l T+ is used to update the agent 's values . These aug-mented values have the potent ia l to be more accurate early i n l ea rn ing and therefore a l l o w a B a y e s i a n imi ta tor to choose better actions early on i n the l ea rn ing phase. L i k e any re inforcement learn ing agent, an imi ta tor requires a sui table exp lo ra t ion mechan i sm. In the Bayesian exploration m o d e l (Dearden et a l . , 1999), the uncerta inty about the effects o f act ions is captured b y a D i r i c h l e t , and is used to estimate a d i s t r i bu t ion over poss ib l e Q-va lues for each state-action pair. N o t i o n s such as va lue o f i n fo rma t ion can then be used to approximate the o p t i m a l exp lora t ion p o l i c y (Sec t ion 2.3). B a y e s i a n exp lo ra t ion a lso e l iminates the parameter tun ing required b y methods l i k e e-greedy, and adapts l o c a l l y and instant ly to ev idence . These facts makes it a g o o d candidate to c o m b i n e w i t h B a y e s i a n imi t a t i on . 5.2 Empirical Demonstrations In this sect ion we attempt to e m p i r i c a l l y characterize the a p p l i c a b i l i t y and expected benefits o f B a y e s i a n imi t a t ion through several experiments . U s i n g domains f r o m the li terature and t w o un ique domains , we compare B a y e s i a n imi t a t ion to the n o n - B a y e s i a n im i t a t i on m o d e l deve loped i n the p rev ious chapter and to several standard mode l -based re inforcement learn-i n g (non- imi ta t ing) techniques, i n c l u d i n g B a y e s i a n exp lo ra t ion , p r i o r i t i z e d s w e e p i n g and comple te B e l l m a n backups . W e also invest igate h o w B a y e s i a n exp lo ra t i on combines w i t h imi t a t ion . F i r s t , we descr ibe the set o f agents used i n our exper iments . T h e O r a c l e agent em-116 F i g u r e 5.2: T h e C h a i n - w o r l d scenario i n w h i c h the m y o p i c a l l y attractive ac t ion b is subop-t i m a l i n the l o n g run . p l o y s a f ixed p o l i c y o p t i m i z e d for each d o m a i n . T h e Orac l e p rov ides bo th a base l ine and a source o f expert mentor behav io r for the observer agents. T h e E G B S agent combines e-greedy exp lo ra t ion ( E G ) w i t h a f u l l sweep o f B e l l m a n backups at each t ime step (i.e. it per-forms a b e l l m a n backup at every state for each exper ience obtained) . It p rov ides an e x a m -ple o f a generic model -based approach to learn ing . T h e E G P S agent is a mode l -based R L agent, u s i n g e-greedy ( E G ) exp lo ra t ion w i t h p r io r i t i zed sweep ing (PS) ( M o o r e & A t k e s o n , 1993). E G P S use fewer backups , but appl ies them where they are p red ic ted to do the most good . E G P S does not have a f ixed backup p o l i c y , so it can propagate va lue m u l t i p l e steps across the state space i n si tuations where E G B S w o u l d not. T h e B E agent e m p l o y s B a y e s i a n exp lo ra t ion ( B E ) w i t h p r io r i t i zed sweep ing for backups . B E B I combines B a y e s i a n e x p l o -ra t ion ( B E ) w i t h B a y e s i a n imi t a t ion (BI ) . E G B I combines e-greedy exp lo ra t i on ( E G ) w i t h B a y e s i a n imi t a t ion (BI ) . Experiment 1: Suboptimal Policy O u r first exper iments w i t h the agents were on the " l o o p " and " c h a i n " examples w h i c h were des igned to s h o w the benefits o f B a y e s i a n exp lora t ion (Dearden et a l . , 1999) . T h e cha in p r o b l e m consis ts o f 5 states i n a cha in as p ic tured i n F i g u r e 5.2. It requires the agent to choose between ac t ion b, w h i c h m o v e the agent back to the b e g i n n i n g w i t h a s m a l l l o c a l 117 Figure 5.3: Chain world results (Averaged over 10 runs) 118 reward ; or ac t ion a, w h i c h moves the agent towards a distant h i g h l y - r e w a r d i n g goa l . There are no l o c a l rewards associated w i t h the act ions that lead to the h i g h l y - r e w a r d i n g goa l . A t each state, there is a 0.2 p robab i l i ty o f the agent 's ac t ion h a v i n g the oppos i te o f the in tended effect and the d i scoun t factor is 0.99. T h e cha in p r o b l e m chal lenges agents to engage i n sufficient exp lo ra t ion to f ind the op t ima l p o l i c y , w h i c h is to choose ac t ion a at each state. T h e agent 's D i r i c h l e t parameters are i n i t i a l l y set to 1 for a l l ac t ion outcomes to represent a weak p r io r b e l i e f i n a u n i f o r m d i s t r ibu t ion over poss ib l e ac t ion outcomes . T h e results for the cha in scenario appear i n F i g u r e 5.3. W e measure agent perfor-mance as the average reward co l lec ted over the last 200 steps (see S e c t i o n 4.1.3 for more in fo rma t ion on performance measures). E a c h l ine i n the graph corresponds to the average o f 10 runs for each agent and the error bars represent standard error o n the sample o f 10 runs. W e see that the B a y e s i a n imi ta tor ( B E B I ) , the n o n - B a y e s i a n imi ta to r ( E G N B I ) and the B a y e s i a n exp lo re r w i thou t imi t a t ion ( B E ) a l l c lus ter together near o p t i m a l performance. S i n c e the var iance o f the results is h i g h , it is not c lear that any o f the agents are dominan t i n this scenario. G i v e n the s m a l l state and ac t ion space and the B a y e s i a n exp lo re r ' s ab i l i t y to d y n a m i c a l l y adjust its exp lo ra t ion rate, it is capable o f pe r fo rming as w e l l as the best i m i t a -tors w h o ga in ins ight into the va lue func t ion , but s t i l l must recover the ident i ty o f the men-tor ' s ac t ion . T h e top pe r fo rming agents do m u c h better than standard re inforcement learners ( E G B S and E G P S ) w h o settle on a subop t ima l p o l i c y . Increas ing the exp lo ra t i on o f these agents causes their performance to fa l l further i n the short term, t hough i t does eventua l ly a l l o w them to find the op t ima l p o l i c y . T h e c o m b i n a t i o n o f B a y e s i a n imi t a t ion w i t h e-greedy exp lo ra t i on does unexpect-ed ly poor ly . T h i s result is a lso seen i n other exper iments . W e speculate that there is an i n -teract ion between the s a m p l i n g o f t ransi t ion mode l s under B a y e s i a n i m i t a t i o n and £ - g r e e d y exp lo ra t ion—par t i cu l a r ly on these s i m p l e p rob lems w h i c h are s o l v e d after very few sam-119 F i g u r e 5.4: L o o p d o m a i n pies . W e speculate that early i n the learn ing process, w h e n the d i s t r i bu t ion over poss ib le t ransi t ion mode ls is diffuse, our m o d e l s amp l ing scheme generates h i g h l y fluctuating mode ls w h i c h i n turn result i n fluctuations i n the re la t ive magni tudes o f Q-va lues . U n d e r a greedy ac t ion scheme, the non- l inear effect o f the max operator can cause subsequent v is i t s to a state to result i n to ta l ly different ac t ion choices . T h i s can prevent the agent f r o m cons i s -tently b a c k i n g up va lue f r o m future states and es tab l i sh ing the values that w o u l d gu ide it towards the distant goa l . T h e B a y e s i a n imi ta tor w i t h B a y e s i a n exp lo ra t ion , chooses act ions p robab i l i s t i c a l l y acco rd ing to their Q-va lues and is therefore less sens i t ive to the re la t ive order ing o f the Q-va lues . T h e non -Bayes i an imi ta tor w i t h e-greedy exp lo ra t i on der ives its m o d e l f r o m the mean o f the D i r i c h l e t m o d e l w h i c h is cons ide rab ly more stable. Its Q-va lues are therefore a lso more stable, and the agent is therefore able to pursue a l o n g term strategy consistent w i t h the o v e r a l l va lue func t ion de r ived f r o m the mentor agent. W e see that this p r o b l e m is not d i f f icul t enough to demonstrate a s ignif icant difference i n performance be-tween the B a y e s i a n exp lore r wi thout imi ta t ion ( B E ) and the B a y e s i a n exp lo re r w i t h i m i t a -t ion ( B E B I ) . 120 1 0 0 r 9 0 -g I I 1 1 1 I I I I I 0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0 1 8 0 0 2 0 0 0 Steps of simulation F i g u r e 5.5: L o o p d o m a i n results (averaged over 10 runs) 121 Experiment 2: Loop T h e " l o o p " p r o b l e m (Dearden et a l . , 1999) consists o f eight states arranged i n t w o loops as s h o w n i n F i g u r e 5.4. A c t i o n s i n this d o m a i n are de terminis t ic . T h e r igh t -hand l o o p i n v o l v -i n g states 1, 2, 3 and 4 contains no branches based on act ions and is therefore easy for the agent to traverse and the modest reward ava i lab le i n state 4 can be q u i c k l y b a c k e d up. T h e l o o p d o m a i n is d i f f icul t to explore , as the left hand l o o p i n v o l v i n g states 5, 6, 7 and 8 has frequent arcs back to 0 w h i c h makes it diff icul t for the agent to find the larger r eward a v a i l -able i n state 8. It therefore takes longer to back up this r eward a long the trajectory. T h e o p t i m a l p o l i c y is to pe r fo rm ac t ion b everywhere . T h e d i scount and pr iors are ass igned as i n the cha in p r o b l e m . T h e results for the l o o p d o m a i n appear i n F i g u r e 5.5. A g a i n , graphs represent the average o f 10 runs—the performance on each run b e i n g measured b y the average reward co l l ec t ed i n the last 200 steps. Qua l i t a t ive ly , the results are very s i m i l a r to those for the cha in w o r l d . T h e B a y e s i a n agents ( B E and B E B I ) pe r fo rm as w e l l as the o p t i m a l oracle agent (their l ines a l l over lap i n this graph). T h e n o n - B a y e s i a n imi ta tor ( E G N B I ) performs s l igh t ly less w e l l . In this d o m a i n we see the n o n - B a y e s i a n agent 's i n a b i l i t y to f l e x i b l y adjust the exp lo ra t ion rate g i v e n its l o c a l k n o w l e d g e is starting to hurt its performance. T h e c o m -b ina t ion o f B a y e s i a n imi t a t ion w i t h e-greedy exp lo ra t ion shows the same in terac t ion effect as i n the cha in w o r l d and w e speculate that a s im i l a r effect is occu r r ing . T h i s p r o b l e m is not di f f icul t enough to separate the performance o f the B a y e s i a n exp lo re r w i thou t im i t a t i on ( B E ) f rom the B a y e s i a n exp lore r w i t h B a y e s i a n imi ta t ion ( B E B I ) . W e observe that the add i t ion o f imi t a t ion capabi l i t i es to an ex i s t ing learner d i d not hurt the its performance. 122 s FI G l F3 F2 G2 IIMHIS F i g u r e 5.6: F l a g - w o r l d . 1 8 r F i g u r e 5.7: F l a g - w o r l d results (Averaged over 50 runs) 123 Experiment 3 : More Complex Problems U s i n g the more c h a l l e n g i n g " f l a g - w o r l d " d o m a i n (Dearden et a l . , 1999), w e see mean ing -f u l differences i n performance amongst the agents. In flag-world, s h o w n i n F i g u r e 5.6, the agent starts at state S and searches for the g o a l state Gl. T h e agent m a y p i c k up any o f three flags b y v i s i t i n g states F I , F2 and F3. U p o n reach ing the g o a l state, the agent receives 1 po in t for each flag co l l ec ted . E a c h ac t ion ( N , E , S , W ) succeeds w i t h p r o b a b i l i t y 0.9 i f the cor-respond ing d i rec t ion is clear, and w i t h p robab i l i t y 0.1 moves the agent pe rpend icu la r to the des i red d i r ec t ion . I f the cor responding d i rec t ion is not clear, the agent w i l l r ema in where i t is w i t h p robab i l i t y 0.9 and attempt a perpendicular d i rec t ion w i t h the r e m a i n i n g p robab i l i ty . In general , the pr iors over the agent 's ac t ion outcomes are set to a l o w conf idence u n i f o r m d i s t r ibu t ion over the immedia te ne ighbors o f the state. T h e pr iors are set to take in to account the presence o f obstacles and the t ransi t ion to states i n w h i c h the agent possesses flags. T h u s , act ions have a non-zero p robab i l i t y o f end ing up i n at most 4 n e i g h b o r i n g states, but p o s s i -b l y less due to obstacles. T h i s d o m a i n is a g o o d exercise for exp lo ra t i on mechan i sms as it is easy to find subop t ima l solut ions to this p r o b l e m proceed ing to the g o a l state w i t h less than the f u l l complemen t o f flags. F i g u r e 5.7 shows the reward co l l ec ted i n the p r i o r 2 0 0 steps for each agent aver-aged over 50 runs. T h e Orac l e agent shows the o p t i m a l performance for this d o m a i n . T h e B a y e s i a n imi ta tor w i t h B a y e s i a n exp lo ra t ion ( B E B I ) achieves the qu ickes t convergence to the o p t i m a l so lu t ion , though the c o m p l e x i t y o f this d o m a i n n o w requires the observer to engage i n cons iderab le exper imenta t ion to d i s cove r the correct mentor ac t ion . T h i s results i n a s ignif icant separat ion between the B a y e s i a n Imitator and the o p t i m a l oracle agent per-formance. W e not ice that the c o m b i n a t i o n o f e-greedy exp lo ra t ion w i t h B a y e s i a n im i t a t i on ( E G B I ) does better on this p r o b l e m . In this d o m a i n , cons iderab ly more samples are requi red to b u i l d up accurate va lue funct ions and the observer agent has more poss ib l e paths l ead ing 124 F i g u r e 5.8: F l a g - w o r l d m o v e d g o a l (Average o f 50 runs) to approx imate ly the same outcome. W e speculate that these features mi t iga te the issues o f m a x i m i z a t i o n over sampled models . T h e n o n - B a y e s i a n imi ta tor ( E G N B I ) performs r e l -a t ive ly poor ly . In this p r o b l e m , the n o n - B a y e s i a n imi ta to r ' s i n a b i l i t y to ta i lo r exp lo ra t ion rates to its l o c a l state o f k n o w l e d g e make it d i f f icul t for the n o n - B a y e s i a n imi ta to r to exp lo i t mentor k n o w l e d g e w h i l e re ta in ing sufficient exp lora t ion to infer the men to r ' s act ions. T h e B a y e s i a n exp lo re r ( B E ) settles on a subop t ima l strategy. T h e agents u s i n g generic re inforce-ment learn ing strategies converge more s l o w l y to the o p t i m a l so lu t ion , though the graph has been truncated before convergence, 125 Experiment 4: Dissimilar Goals In this exper iment , w e altered the f l ag -wor ld d o m a i n : k e e p i n g the g o a l o f the mentor agent at l oca t ion G l and m o v i n g the goa l for a l l other agents to the n e w l o c a t i o n l abe led G 2 i n F i g u r e 5.6. T h e results, F i g u r e 5.8(a), show that the imi t a t ion transfer i s qua l i t a t ive ly s i m i l a r to the case w i t h iden t ica l rewards. W e conc lude that im i t a t i on transfer is robust to certain differences i n object ives between mentor and imitator. T h i s is read i ly e x p l a i n e d b y the fact that the mentor ' s p o l i c y p rov ides m o d e l in fo rmat ion over most states i n the d o m a i n and this in fo rma t ion can be e m p l o y e d by the observer to ach ieve its o w n goals . Experiment 5: Multi-dimensional Problems W e in t roduced a new p r o b l e m we c a l l the tutoring domain w h i c h requires agents to sched-u le the presentat ion o f s imp le patterns to human learners i n order to m i n i m i z e t r a in ing t ime. T o s i m p l i f y our exper iments , w e have the agents teach a s imula ted student. T h e student's performance is m o d e l e d by independent, d iscrete ly approx imated , exponen t i a l forget t ing curves for each concept ( A n d e r s o n , 1990, page 164). T h e agent 's ac t ion w i l l be its cho i ce o f concept to present. T h e agent receives a reward w h e n the student 's forget t ing rate has been reduced b e l o w a predefined threshold for a l l concepts . Presen t ing a concept lowers its forget t ing rate; l e a v i n g i t unpresented increases its forget t ing rate. T h e ac t ion space g rows l inea r ly w i t h the number o f concepts , and the state space exponent ia l ly . E x p e r t performance for the oracle i n this d o m a i n was encoded as a s imp le heur is t ic that p i c k s the first unlearned concept for the next presentation. O u r discrete app rox ima t ion goes as f o l l o w s : each concept i is represented b y a s ing le b inary var iab le co r respond ing to whether concept is learned (c, = 1) or unlearned (c; = 0 ) . In our exper iments , there are s ix concepts . T h e state o f the s imula ted student at any t ime is the j o i n t state c = < ci, C 2 , . . . ,CQ > . W h e n the tutor takes an ac t ion aj to present concept 126 j , the state o f each i n d i v i d u a l concept c, i n c is updated nonde te rmin i s t i ca l ly a cco rd ing to the concept e v o l u t i o n func t ion tp i = j & c,\ — 0 concept learned 1 i = j & a = 1 concept r e v i e w e d (5.6) 0 i j h c,; = 0 stays u n k n o w n L) i j j & C j — 1 forgotten where -0 is the learn ing rate, o r p robab i l i ty that a concept w i l l be learned w h e n it is presented and us is the forget t ing rate, or the p robab i l i t y that a concept w i l l be forgotten g i v e n that it was not presented. In our experiments the learn ing rate tp — 0.1 and the forget t ing rate is u> = 0 .01 . W e note that Pr(-icJ-1 c,-, a j ) = 1 - Pr(c-|c t-, a , ) . T h e p robab i l i t y o f the j o i n t successor state is the product o f the i n d i v i d u a l concept evo lu t ions P r (c ' |c , a,j) = Yli Pr(c;lc;> aj)- A s s h o w n i n F i g u r e 5.9, the i n d i v i d u a l dynamics o f each concept result i n a large number o f poss ib l e successor states. There can be up to 64 poss ib le successor states i f every concept state changes (every th ing is forgotten) or o n l y a s ing le successor state i f no concept state changes. T h e d i scount rate e m p l o y e d i n our experiments is 0.95 and the agent receives a reward o f 1.0 i f and o n l y i f every concept is k n o w n (i.e., Vi, c,). T h e learner 's pr iors over poss ib le t ransi t ion mode l s assume that a l l feasible successor states were equa l ly l i k e l y w i t h l o w confidence. W e define feasible as those set o f successor states granted non-zero support f r om the t rans i t ion m o d e l defined i n E q u a t i o n 5.6. T h e results presented for the tu tor ing d o m a i n appear i n F i g u r e 5.10. W e see that the imi ta tors learn q u i c k l y , w i t h B E B I and E G B I ou tper forming the others. T h e B a y e s i a n p o o l i n g o f mentor and observer observat ions cont inues to offer better per formance than the discrete se lec t ion m e c h a n i s m o f n o n - B a y e s i a n imi t a t ion ( E G N B I ) . O f par t icu lar interest is the performance o f the B a y e s i a n explorer ( B E ) . G i v e n the h i g h b r a n c h i n g factor o f the do-127 c 2 c 3 c 4 c 5 c 6 a 4 c 2 c 3 c 4 c 5 c 6 c l c 2 c 3 c 4 c 5 c 6 • • • c l c 2 c 3 c 4 c 5 c 6 F i g u r e 5.9: Schemat ic i l lus t ra t ion o f a t ransi t ion i n the tu tor ing d o m a i n . m a i n , the conserva t ive bias o f the B a y e s i a n explorer causes it to spend a lot o f t ime exp lo r -i n g many branches. T h e extra samples f rom mentor examples a l l o w the B a y e s i a n imi ta tor w i t h B a y e s i a n exp lo ra t ion ( B E B I ) to q u i c k l y i m p r o v e its confidence i n re levant act ions and consequent ly enable it to d i scard branches that are not w o r t h e x p l o r i n g early i n the l ea rn ing process. Experiment 6: Misleading Priors T h e next d o m a i n p rov ides further ins ight into the c o m b i n a t i o n o f B a y e s i a n i m i t a t i o n and B a y e s i a n exp lo ra t ion . In this g r i d w o r l d (F igure 5.11), agents can m o v e south o n l y i n the first c o l u m n . In this d o m a i n , the B a y e s i a n exp lore r ( B E ) chooses a path based o n its p r io r bel iefs that the space is comple te ly connected. T h e agent can eas i ly be g u i d e d d o w n one o f the l o n g " tubes" i n this scenario, on ly to have to retrace it steps. T h e results o f this d o m a i n , shown i n F i g u r e 5.12, c lea r ly differentiate the early per-formance o f the imi t a t ion agents f rom the B a y e s i a n exp lore r and other independent learners. T h e i n i t i a l va lue func t ion constructed f rom the learner 's p r io r bel iefs about the connec t iv i t y i n the g r i d w o r l d leads it to over-va lue many o f the states that lead to a dead end. T h i s results i n a cos t ly mi sd i r ec t i on o f exp lora t ion and poor performance. W e see that the B a y e s i a n i m -i tator 's ab i l i t y to p o o l in fo rmat ion sources and adapt its exp lo ra t ion rate to the l o c a l qua l i ty 128 • • O r a c l e ( F i x e d P o l i c y M e n t o r ) + • E G B S ( E p s G r e e d y , B e l l . S w e e p C o n t r o l ) •x- E G P S ( E p s G r e e d y , P r i o . S w e e p C o n t r o l ) — B E ( B a y e s E x p , B e l l . S w e e p C o n t r o l ) -a- E G N B I ( N o n - B a y e s i a n I m i t a t i o n ) 0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 S t e p s of s i m u l a t i o n F i g u r e 5.10: T u t o r i n g d o m a i n (Average o f 50 runs) s G F i g u r e 5.11: N o Sou th scenario 129 I I I I I 1 I 1— 0 1000 2000 3000 4000 5000 6000 7000 Steps of simulation F i g u r e 5.12: N o Sou th results (Averaged over 50 runs) o f in fo rmat ion a l l o w s it to exp lo i t the mentor observat ions more q u i c k l y than agents u s ing generic exp lo ra t ion strategies l i k e eg reedy . W e can see f r o m the s igni f icant de lay before the B a y e s i a n exp lore r ( B E ) reaches the reward for the first t ime that the agent is get t ing stuck repeatedly i n dead-ends—its p r io r bel iefs w o r k s t rongly against its success. H o w e v e r , it is able to do better than the generic learners w i t h g l o b a l exp lo ra t ion strategies. 5.3 Summary T h i s chapter in t roduced a B a y e s i a n imi t a t ion m o d e l that is able to smoo th ly integrate bo th bo th mentor and observer observat ions. In e m p i r i c a l demonstra t ions w e saw that the c o m -b ina t ion o f B a y e s i a n imi t a t i on w i t h B a y e s i a n exp lora t ion ( B E B I ) c lea r ly d o m i n a t e d across a l l o f the p rob lems w e studied. W i t h o u t further exper iments , however , w e cannot c lea r ly 130 attribute this advantage to either B a y e s i a n imi t a t ion i n i t se l f or to B a y e s i a n exp lo ra t i on i n itself. In our exper iments , B a y e s i a n exp lo ra t ion b y i t se l f ( B E ) is t y p i c a l l y domina t ed b y i m i -tators that do not use B a y e s i a n exp lo r a t i on—s uc h as n o n - B a y e s i a n imita tors ( E G N B I ) and e-greedy B a y e s i a n imitators ( E G B I ) . W e saw i n the tu tor ing d o m a i n and the no-south d o m a i n that B a y e s i a n exp lo ra t ion tends to be conserva t ive g i v e n diffuse pr iors . O n the other hand, £ - g r e e d y B a y e s i a n imitators are somet imes domina ted b y n o n - B a y e s i a n imi ta tors ( E G N B I ) . E a r l i e r i n this chapter, w e speculated that their may be a negat ive in terac t ion be tween m o d e l f luctuations generated by s a m p l i n g i n B a y e s i a n imi t a t ion and the ac t ion m a x i m i z a t i o n i n e-greedy. W e suggest this as a p r o m i s i n g area for future exper iments . F r o m a theoret ical per-spect ive, w e can see that the c o m b i n a t i o n o f B a y e s i a n E x p l o r a t i o n w i t h B a y e s i a n imi t a t ion ( B E B I ) w i l l have a synergis t ic effect, as the extra ev idence f r o m mentor observat ions w i l l not o n l y i m p r o v e the accuracy o f the agent 's Q-va lues , but also their conf idence as w e l l . S i n c e this conf idence dr ives l o c a l exp lora t ion w e w o u l d expect the observer to focus q u i c k l y on r eward ing behaviors demonstrated b y the mentor. F r o m a p rac t i ca l perspect ive , B a y e s i a n explorers are m u c h easier to t rain as they do not require repeated runs to o p t i m i z e l ea rn ing rate parameters. 131 Chapter 6 Applicability and Performance of Imitation T h e a p p l i c a b i l i t y and performance o f imi t a t ion can be evaluated f r o m a number o f differ-ent v i e w p o i n t s : the assumptions inherent i n our f ramework , the potent ia l for accelera t ing learn ing as a func t ion o f properties o f state spaces, and the l i k e l i h o o d that mentor input w i l l b ias an observer towards subop t ima l performance. In this chapter w e e x a m i n e what each o f these perspect ives can te l l us about the performance o f im i t a t i on and then go o n to propose a f o r m a l met r ic for p red ic t ing performance as a func t ion o f d o m a i n propert ies . W e c o n c l u d e w i t h a d i s c u s s i o n o f the potent ia l for mentor ' s to b ias agent 's to subop t ima l p o l i c i e s . 6.1 Assumptions W e began our deve lopment o f the i m p l i c i t im i t a t ion f r amework b y d e v e l o p i n g a set o f as-sumpt ions consistent w i t h imi ta t ion . These assumptions necessar i ly inf luence the app l i ca -b i l i t y o f im i t a t i on . These inc lude : l ack o f e x p l i c i t c o m m u n i c a t i o n be tween mentors and ob-server; independent object ives for mentors and observer; f u l l obse rvab i l i t y o f mentors b y 132 observer; unobse rvab i l i ty o f mentors ' act ions; and (bounded) heterogeneity. A s s u m p t i o n s such as f u l l obse rvab i l i ty are necessary for our mode l—as fo rmula ted—to w o r k ( though w e w i l l d iscuss extensions to our m o d e l for the par t ia l ly observable case i n Chap te r 7) . T h e t w o assumpt ions that e x p l i c i t c o m m u n i c a t i o n is infeas ible and that act ions are unobservab le ex-tend the a p p l i c a b i l i t y o f i m p l i c i t im i t a t ion b e y o n d other mode l s i n the l i terature; i f these these cond i t ions do not h o l d , i m p l i c i t im i t a t ion c o u l d be augmented to take greater advan-tage o f the add i t iona l in format ion . F i n a l l y , the assumpt ions o f b o u n d e d heterogeneity and independent object ives a lso ensure that i m p l i c i t im i t a t ion can be app l i ed w i d e l y . H o w e v e r , as w e have seen i n earl ier exper iments , the degree to w h i c h rewards are the same and act ions are homogeneous can have an impact on the u t i l i t y o f i m p l i c i t i m i t a t i o n (i.e., on h o w m u c h it accelerates learning) . 6.2 Predicting Performance O n c e w e have de termined that the bas ic assumptions o f i m p l i c i t im i t a t i on have been satis-fied, w e face the ques t ion o f h o w l i k e l y i m p l i c i t im i t a t i on is to ac tua l ly accelerate learn ing . T h e key to answer ing this quest ion is to examine br ief ly the reasons for potent ia l accelera-t ion i n i m p l i c i t imi t a t ion . In the i m p l i c i t im i t a t ion m o d e l , w e use observat ions o f other agents to i m p r o v e the observer ' s k n o w l e d g e about its env i ronment and then re ly on a sens ib le exp lo ra t i on p o l i c y to exp lo i t this add i t iona l k n o w l e d g e . A clear unders tanding o f h o w k n o w l e d g e o f the en-v i ronment affects exp lo ra t ion is therefore central to unders tanding h o w i m p l i c i t im i t a t i on w i l l pe r fo rm i n a d o m a i n . W i t h i n the i m p l i c i t im i t a t i on f ramework , agents k n o w their re-wa rd funct ions , so k n o w l e d g e o f the env i ronment consis ts so le ly o f k n o w l e d g e about the agent 's ac t ion models . In general , these mode ls can take any f o r m . F o r s i m p l i c i t y , w e have restricted ourselves to mode ls that can be decomposed in to l o c a l mode l s for each poss ib l e 133 c o m b i n a t i o n o f a sys tem state and agent ac t ion . T h e l o c a l mode l s for state-action pairs a l l o w the p red ic t ion o f a .7-step successor state d i s t r ibu t ion g i v e n any i n i t i a l state and sequence o f act ions o r a l o c a l p o l i c y . T h e qua l i ty o f the j - s t e p state pred ic t ions w i l l be a func t ion o f every ac t ion m o d e l encountered be tween the i n i t i a l state and the states at t ime j — 1. Unfor tunate ly , the qua l i ty o f the j - s t e p estimate can be dras t ica l ly al tered by the qua l i ty o f even a s ingle intermediate state-action m o d e l . T h i s suggests that connected regions o f state space—in w h i c h a l l states have reasonably accurate m o d e l s — w i l l a l l o w reasonably accurate future state predic t ions . S i n c e the es t imated va lue o f a state s is based on bo th the immedia t e r e w a r d and the reward expected to be rece ived i n subsequent states, the qua l i ty o f this va lue estimate w i l l de-pend on the connec t iv i t y o f the state space. S i n c e both the greedy and B a y e s i a n exp lo ra t ion methods bias their exp lo ra t ion accord ing to the est imated va lue o f act ions , the exp lora tory choices o f an agent at state s w i l l a lso be dependent on the connec t iv i t y o f the state space. W e have therefore o rgan ized our analys is o f i m p l i c i t im i t a t ion performance a round the idea o f state space connec t iv i ty and the regions such connec t iv i ty defines. S i n c e connec ted re-g ions p l ay an important ro le i n i m p l i c i t im i t a t i on , w e in t roduce an i n f o r m a l c lass i f ica t ion o f different regions w i t h i n the state space s h o w n g raph ica l ly i n F i g u r e 6.1. W e use these in fo rma l classes to mot iva te a precise def in i t ion for a poss ib le metr ic p red ic t ing im i t a t i on performance. T h e regions are descr ibed i n depth be low. W e first observe that many tasks can be carr ied out by an agent i n a s m a l l subset o f states w i t h i n the state space defined for the p r o b l e m . M o r e prec ise ly , i n m a n y M D P s , the o p t i m a l p o l i c y w i l l ensure that an agent remains i n a s m a l l subspace o f state space. T h i s leads us to the def in i t ion o f our first r eg iona l d i s t inc t ion : relevant versus i r re levant regions . T h e relevant r eg ion is the set o f states w i t h a non-zero p robab i l i t y o f occupancy under the 134 Observer Explored Region Irrelevant Region Mentor Augmented Region Reward F i g u r e 6.1: A sys tem o f state space reg ion classif icat ions for p red ic t ing the efficacy o f i m i -tat ion. o p t i m a l p o l i c y . 1 A n e-relevant reg ion is a natural genera l iza t ion i n w h i c h the o p t i m a l p o l i c y keeps the sys tem w i t h i n the reg ion a fract ion 1 - e o f the t ime. W i t h i n the relevant r eg ion , we d i s t ingu i sh three add i t iona l subregions . T h e explored r eg ion contains those states where the observer has formula ted re l i ab le ac t ion mode ls on the basis o f its o w n experience. T h e augmented r eg ion contains those states where the observer lacks re l i ab le ac t ion mode ls but has i m p r o v e d value estimates due to mentor observat ions. F i n a l l y , the blind r eg ion designates those states where the observer has nei ther (significant) personal exper ience nor the benefit o f mentor observat ions. N o t e that bo th the exp lo red and augmented regions are created as the result o f observat ions made b y the observer (either o f its o w n transi t ions or those o f a mentor) . These regions w i l l therefore have s ignif icant "connected componen ts , " that i s , con t iguous regions o f state space where re l i ab le ac t ion or mentor mode l s are ava i lab le . A n y in format ion about states w i t h i n the b l i n d r eg ion w i l l come 'One often assumes that the system starts in one of a small set of states. If the Markov chain induced by the optimal policy is ergodic, then the irrelevant region w i l l be empty. 135 ( largely) f r o m the agent 's p r io r bel iefs . T h e potent ia l ga in for an imi ta tor i n a d o m a i n can be unders tood i n part as a func t ion o f the in teract ion o f the regions descr ibed above. C o n s i d e r the d i s t i n c t i o n be tween relevant and i r re levant states. N o t a l l m o d e l in fo rmat ion is equa l ly he lp fu l : the imi ta to r needs o n l y enough in fo rma t ion about the i r re levant r eg ion to be able to a v o i d it. S i n c e ac t ion choices are inf luenced b y the re la t ive va lue o f act ions, the i r re levant r e g i o n w i l l be a v o i d e d w h e n i t l o o k s worse than the relevant r eg ion . G i v e n diffuse pr iors on ac t ion mode l s , none o f the ac-t ions open to an agent w i l l i n i t i a l l y appear par t icu lar ly attractive or unattract ive. H o w e v e r , a mentor that p rov ides observat ions w i t h i n the relevant r eg ion can q u i c k l y m a k e the relevant r eg ion l o o k m u c h more p r o m i s i n g as a method o f a ch i ev ing h igher returns and therefore c o n -strain exp lo ra t ion s ignif icant ly . Therefore, cons ide r ing p rob lems jus t f r o m the po in t o f v i e w o f re levance, a p r o b l e m w i t h a s m a l l relevant r eg ion re la t ive to the entire space c o m b i n e d w i t h a mentor that operates w i t h i n the relevant r eg ion w i l l result i n m a x i m u m advantage for an im i t a t i on agent ove r a non- imi t a t ing agent. In the exp lo red reg ion , the observer has suff icient ly accurate mode l s to compute a p o l i c y that is o p t i m a l w i t h respect to the r eward ing states w i t h i n the r e g i o n (the p o l i c y m a y be subop t ima l w i t h respect to rewards outs ide o f the exp lo red reg ion) . A d d i t i o n a l observa-t ions on the states w i t h i n the exp lo red reg ion p r o v i d e d b y the mentor can s t i l l i m p r o v e per-formance somewhat i f s ignif icant ev idence is requi red to accurately d i sc r imina te be tween the expected va lue o f t w o act ions. H e n c e , mentor observat ions i n the e x p l o r e d r e g i o n can help , but w i l l not result i n dramat ic speedups i n convergence. In the augmented reg ion , the observer ' s Q-va lues are l i k e l y to be more accurate than the i n i t i a l Q-va lues as they have been ca lcula ted f r o m mode l s that have been augmented w i t h observat ions o f a mentor. In experiments i n p rev ious chapters, w e have seen that an observer enter ing an augmented r eg ion can exper ience s ignif icant speedups i n convergence . Cha rac -136 terist ics o f the augmented zone, however , can affect the degree to w h i c h augmenta t ion i m -proves convergence speed—since the observer receives observat ions o f o n l y the mentor ' s state and not its act ions, the observer has i m p r o v e d va lue estimates for states i n the aug-mented reg ion , but it does not have a p o l i c y . T h e observer must therefore infer w h i c h act ions shou ld be taken to dupl ica te the mentor ' s behavior . W h e r e the observer has p r i o r bel iefs about the effects o f its act ions, it may be poss ib le to pe r fo rm immedia te inference about the mentor ' s actual cho i ce o f act ion (perhaps u s ing the K L - d i s t a n c e o r m a x i m u m l i k e l i h o o d ) . W h e r e the observer ' s p r io r m o d e l is un informat ive , the observer w i l l have to exp lo re the l o -cal ac t ion space. In e x p l o r i n g a l o c a l ac t ion space, however , the agent must take an ac t ion and this ac t ion w i l l have an effect. S i n c e there is no guarantee that the agent took the ac t ion that dupl icates the mentor ' s ac t ion , i t may end up somewhere different than the mentor. I f the ac t ion causes the observer to f a l l outs ide o f the augmented r eg ion , the observer w i l l lose the gu idance that the augmented va lue func t ion p rov ides and f a l l back to the per formance l e v e l o f a non - imi t a t i ng agent. A n important cons idera t ion , then, is the p robab i l i t y that the observer w i l l r ema in i n augmented regions and cont inue to rece ive guidance. O n e qua l i ty o f the augmented r eg ion that affects the observer ' s p robab i l i ty o f s taying w i t h i n its boundar ies is its re la t ive coverage o f the state space. T h e p o l i c y o f the mentor may be sparse or comple te . In a r e l a t ive ly deter-m i n i s t i c d o m a i n w i t h defined beg in and end states, a sparse p o l i c y c o v e r i n g few states m a y define an adequate p o l i c y for the mentor or observer. In a h i g h l y s tochast ic d o m a i n wi thou t defined g o a l states, an agent may need a comple te p o l i c y (i.e., c o v e r i n g every state). I m -p l i c i t im i t a t i on p rov ides more gu idance to the agent i n domains that are more stochast ic and require more comple te po l i c i e s , s ince the p o l i c y w i l l cove r a larger part o f the state space. In add i t i on to the completeness o f a p o l i c y , w e must a l so c o n s i d e r the p r o b a b i l i t y that the observer w i l l make transit ions into and out o f the augmented r eg ion . W h e r e the ac-137 t ions i n a d o m a i n are largely inver t ib le (direct ly, or effect ively so), the agent has a chance o f re-entering the augmented reg ion . W h e r e this property is l a c k i n g , however , the agent m a y have to wai t u n t i l the process undergoes some f o r m o f "reset" before i t has the oppor tun i ty to gather add i t iona l ev idence regarding the ident i ty o f the mentor ' s act ions i n the augmented reg ion . T h e "reset" places the agent back into the exp lo red reg ion , f r o m w h i c h it can make its way to the frontier where it last exp lo red . T h e lack o f inver t ib le act ions w o u l d reduce the agent 's ab i l i t y to make progress towards h igh -va lue regions before resets, but the agent is s t i l l gu ided by the augmented region on each attempt. E f fec t ive ly , the agent w i l l concen-trate its exp lo ra t ion on the boundary between the exp lo red reg ion and the mentor augmented reg ion . T h e u t i l i t y o f mentor observat ions w i l l depend o n the p r o b a b i l i t y o f the augmented and exp lo red regions ove r l app ing i n the course o f the agent 's exp lo ra t ion . In the exp lo red regions , accurate ac t ion mode ls a l l o w the agent to m o v e as q u i c k l y as pos s ib l e to h i g h va lue regions . In augmented regions , augmented Q-va lues i n f o r m agents about w h i c h states lead to h i g h l y - v a l u e d outcomes . W h e n an augmented reg ion abuts an exp lo red r eg ion , the i m -p r o v e d va lue estimates f rom the augmented reg ion are r ap id ly c o m m u n i c a t e d across the ex-p lo red r e g i o n b y accurate ac t ion models . T h e observer can use the resultant i m p r o v e d va lue estimates i n the exp lo red reg ion together w i t h the accurate ac t ion mode l s i n the e x p l o r e d reg ion to r ap id ly m o v e towards the most p r o m i s i n g states on the frontier o f the exp lo red reg ion . F r o m these states, the observer can explore outwards and even tua l ly expand the ex-p lo red r eg ion to encompass the augmented reg ion . In the case where the exp lo red reg ion and augmented reg ion do not over lap , there is a b l i n d reg ion . S i n c e the observer has no in format ion b e y o n d its pr iors for the b l i n d re-g i o n , the observer is reduced to r andom explora t ion . In a non- imi t a t ion context , any states that are not exp lo red are b l i n d . H o w e v e r , i n an im i t a t i on context , the b l i n d area can be re-138 duced cons iderab ly b y the augmented area. H e n c e , i m p l i c i t im i t a t i on ef fec t ive ly shr inks the s ize o f the search space o f the p r o b l e m even w h e n there is no over lap between e x p l o r e d and augmented spaces. T h e most c h a l l e n g i n g case for i m p l i c i t im i t a t ion transfer occurs w h e n the r eg ion aug-mented b y mentor observat ions fai ls to connect to either the observer exp lo red reg ion or re-g ions w i t h s ignif icant r eward values. In this case, the augmented r eg ion w i l l i n i t i a l l y p r o v i d e no guidance . O n c e the observer has independent ly located r eward ing states, the augmented regions can be used to h igh l i gh t "shortcuts". These shortcuts represent improvemen t s o n the agent 's p o l i c y . In domains where a feasible so lu t ion is easy to f ind, but o p t i m a l so lu t ions are dif f icul t , i m p l i c i t im i t a t i on can be used to conver t a feasible so lu t i on to an inc reas ing ly op-t i m a l so lu t ion . 6.3 Regional Texture W e have seen h o w d i s t inc t ive regions can be used to unders tand h o w imi t a t i on w i l l per-f o r m i n var ious domains . W e can a lso ana lyze imi t a t ion performance i n terms o f proper-ties that cut across the state space. In our analys is o f h o w m o d e l i n fo rma t ion impac ted i m -i ta t ion performance w e saw that regions o f states connected b y accurate ac t ion mode l s a l -l o w e d an observer to use mentor observat ions to learn about the mos t p r o m i s i n g d i rec t ion for exp lo ra t ion . W e see, then, that any set o f mentor observat ions w i l l be more useful i f it is concentrated on a connected reg ion and less useful i f d ispersed about the state space i n unconnec ted components . W e are fortunate i n comple t e ly observable env i ronments that observat ions o f mentors tend to capture cont inuous trajectories, thereby p r o v i d i n g c o n t i n u -ous regions o f augmented states. In par t ia l ly observable envi ronments , o c c l u s i o n and noise c o u l d lessen the va lue o f mentor observat ions i n the absence o f a m o d e l to predic t the m e n -tor ' s state. 139 T h e effects o f heterogeneity, whether due to differences i n ac t ion capabi l i t i es i n the mentor and observer o r due to differences i n the env i ronment o f the t w o agents, can a l so be unders tood i n terms o f the connec t iv i ty o f ac t ion mode l s . Va lue can propagate a long chains o f ac t ion mode ls un t i l w e hit a state i n w h i c h the mentor and observer have different ac t ion capabi l i t ies . A t this state, it may not be poss ib le to achieve the mentor ' s va lue and therefore, va lue propaga t ion is b l o c k e d . A g a i n , w e see that the sequential d e c i s i o n m a k i n g aspect o f re-inforcement learn ing leads to the c o n c l u s i o n that many scattered differences between mentor and observer w i l l create d i scon t inu i ty throughout the p r o b l e m space whereas a con t iguous reg ion o f differences between mentor and observer w i l l cause d i scon t inu i ty i n a r eg ion but leave other large regions f u l l y connected. H e n c e , the d i s t r ibu t ion pattern o f differences be-tween mentor and observer capabi l i t ies is as important as the preva lence o f difference. W e w i l l exp lo re this pattern i n the next sect ion. 6.4 The Fracture Metric W e n o w try to character ize connec t iv i ty i n the f o r m o f a metr ic . S i n c e differences i n reward structure, env i ronment d y n a m i c s and ac t ion mode ls that affect connec t i v i t y a l l w o u l d m a n i -fest themselves as differences in po l i c i e s between mentor and observer, w e propose a metr ic a round o p t i m a l p o l i c y difference. W e c a l l this metr ic fracture. Es sen t i a l l y , it computes the average va lue o f the m i n i m u m distance f r o m a state i n w h i c h a mentor and observer d i s -agree on a p o l i c y to a state i n w h i c h mentor and observer agree o n the p o l i c y . T h i s measure r o u g h l y captures the d i f f icu l ty the observer faces in prof i tably e x p l o i t i n g mentor observa-t ions to reduce its exp lo ra t ion demands. M o r e fo rma l ly , let wm be the mentor ' s o p t i m a l p o l i c y and TT0 be the observer ' s . L e t <S be the state space and S7rm^Vo be the set o f disputed states where the mentor and observer have different o p t i m a l act ions. A set o f n e i g h b o r i n g d isputed states const i tutes a d i sputed 140 (a) $ = 0.5 ( b ) $ = 1 . 7 ( c ) $ = 3.5 (d)$ = 6.0 F i g u r e 6.2: F o u r scenarios w i t h equal so lu t ion lengths but different fracture coefficients. r eg ion . T h e set S — SVm^-Ko w i l l be ca l l ed the undisputed states. L e t M be a d is tance met-r ic on the space S. T h i s metr ic corresponds to the number o f t ransi t ions a l o n g the " m i n i m a l l eng th" path between states (i.e., the shortest path u s i n g nonzero p r o b a b i l i t y observer tran-s i t i ons ) . 2 In a standard g r i d w o r l d , it w i l l cor respond to the m i n i m a l M a n h a t t a n distance. W e define the fracture $(<S) o f state space S to be the average m i n i m a l dis tance be tween a d isputed state and the closest undisputed state: $(£) = —^—- Y m i n M(s,t) (6.1) Other things b e i n g equal , a l o w e r fracture va lue w i l l tend to increase the propagat ion o f va lue in fo rma t ion across the state space, potent ia l ly resu l t ing i n less exp lo ra t i on b e i n g required. T o test our metr ic , we app l i ed it to a number o f scenarios w i t h v a r y i n g fracture coefficients. It is d i f f icul t to construct scenarios w h i c h vary i n their fracture coeff icient yet have the same expected value. T h e scenarios i n F i g u r e 6.2 have been const ructed so that the length o f a l l poss ib l e paths f r o m the start state s to the goa l state x are the same i n each 2 T h e expected distance would give a more accurate estimate of fracture, but is more difficult to calculate. 141 5 x l O - 2 1 x l O " 2 Observer Exploration Rate Sj 5 x l 0 - 3 l x l O - 3 $ 5 x l 0 - 4 1 x 1 0 ~ 4 1 x 1 0 ~ 5 5 x 1 ( T 5 0.5 60% 70% 90% 1.7 0% 80% 90% 9 0 % 3.5 30% 1 0 0 % 6.0 3 0 % 70 % 100 % 1 0 0 % F i g u r e 6.3: Percentage o f runs (o f ten) conve rg ing to o p t i m a l p o l i c y g i v e n fracture $ and exp lo ra t ion rate Si scenario. In each scenario, however , there is an upper path and a l o w e r path. T h e mentor is t rained i n a scenario that penal izes the l o w e r path and so the mentor learns to take the upper path. T h e imi ta tor is t rained i n a scenario i n w h i c h the upper path is p e n a l i z e d and s h o u l d therefore take the l o w e r path. W e equa l i zed the d i f f icu l ty o f these p rob lems as f o l -l o w s : u s ing a generic e-greedy lea rn ing agent w i t h a f ixed exp lo ra t ion schedule (i.e., i n i t i a l rate and decay) i n one scenario, w e tuned the magni tude o f penalt ies and their exact p lace-ment a long loops i n their other scenarios so that a learner u s i n g the same exp lo ra t i on p o l i c y w o u l d converge to the o p t i m a l p o l i c y i n r o u g h l y the same number o f steps i n each. In F i g u r e 6.2(a), the mentor takes the top o f each l o o p and i n an o p t i m a l run , the imi ta tor w o u l d take the bo t t om o f each loop . S i n c e the loops are short and the length o f the c o m m o n path is l o n g , the average fracture is l ow. W h e n w e compare this to F i g u r e 6.2(d), w e see that the loops are very long—the major i ty o f states i n the scenario are o n loops . E a c h o f these states on a l o o p has a dis tance to the nearest state where the observer and mentor po l i c i e s agree, namely , a state not on the l oop . T h i s scenario therefore has a h i g h average fracture coefficient. S i n c e the loops i n the var ious scenarios differ i n length , penal t ies inser ted i n the loops va ry w i t h respect to their distance f r o m the goa l state and therefore affect the total 142 discounted expected reward i n different ways . T h e penalt ies m a y also cause the agent to become stuck i n a l o c a l m i n i m u m i n order to a v o i d the penalt ies i f the exp lo ra t ion rate is too low. In this set o f exper iments , we therefore compare observer agents on the basis o f h o w l i k e l y they are to converge to the o p t i m a l so lu t ion g i v e n the mentor example . F i g u r e 6.3 presents the percentage o f runs (out o f ten) i n w h i c h the imi ta to r c o n -verged to the o p t i m a l so lu t i on (i.e., t ak ing o n l y the l o w e r loops ) as a func t ion o f exp lo ra t i on rate and scenario fracture. W e can see a d is t inc t d i agona l trend i n the table, i l lus t ra t ing that increas ing fracture requires the imi ta tor to increase its leve ls o f exp lo ra t i on i n order to find the o p t i m a l p o l i c y . W e conc lude thatfracture captures a proper ty o f doma ins that i s i m p o r -tant in p red ic t ing the efficacy o f i m p l i c i t im i t a t i on . 6.5 Suboptimality and Bias I m p l i c i t im i t a t i on is fundamenta l ly about b i a s i n g the exp lo ra t ion o f the observer. A s such, it is w o r t h w h i l e to ask w h e n this has a pos i t i ve effect on observer performance. T h e short answer is that a mentor f o l l o w i n g an o p t i m a l p o l i c y fo r an observer w i l l cause an observer to exp lo re i n the n e i g h b o r h o o d o f the o p t i m a l p o l i c y and this w i l l genera l ly bias the observer towards finding the o p t i m a l p o l i c y . Tha t is to say, a g o o d mentor teaches more than a bad mentor. A more deta i led answer requires l o o k i n g e x p l i c i t l y at exp lo ra t i on i n re inforcement learn ing . In theory, an e-greedy exp lo ra t ion p o l i c y w i t h a sui table rate o f decay w i l l cause i m p l i c i t imi ta tors to eventua l ly converge to the same o p t i m a l so lu t ion as their unassis ted counterparts. H o w e v e r , i n pract ice , the exp lo ra t ion rate is t y p i c a l l y decayed more q u i c k l y i n order to i m p r o v e early exp lo i t a t ion o f mentor input . G i v e n p rac t i ca l , but theore t ica l ly unsound exp lo ra t ion rates, an observer may settle for a mentor strategy that i s feasible , but n o n - o p t i m a l . W e can eas i ly imag ine examples : C o n s i d e r a s i tua t ion i n w h i c h an observer 143 is obse rv ing a mentor f o l l o w i n g some p o l i c y . E a r l y i n the l ea rn ing process , the va lue o f the p o l i c y f o l l o w e d b y the mentor may l o o k better than the es t imated va lue o f the al ternative po l i c i e s ava i l ab le to the observer. It c o u l d be the case that the mentor ' s p o l i c y ac tua l ly is the o p t i m a l p o l i c y . O n the other hand, it may be the case that some al ternat ive p o l i c y is actual ly superior. G i v e n a l ack o f in fo rmat ion , a l o w - e x p l o r a t i o n exp lo i t a t i on p o l i c y migh t lead the observer to fa lse ly conc lude that the mentor ' s p o l i c y is o p t i m a l . W h i l e i m p l i c i t im i t a t ion can bias the agent to a subop t ima l p o l i c y , w e have no reason to expect that an agent learn ing i n a d o m a i n suff icient ly c h a l l e n g i n g to warrant the use o f i m -i ta t ion w o u l d have d i scove red a better alternative. W e emphas ize that even i f the mentor ' s p o l i c y is subop t ima l , it s t i l l p rov ides a feasible so lu t ion w h i c h w i l l be preferable to no s o l u -t i on for many prac t ica l p rob lems . In this regard, w e see that the c lass ic exp lo ra t ion /exp lo i t a t ion tradeoff has an add i t iona l interpretation i n the i m p l i c i t im i t a t i on sett ing. A componen t o f the exp lo ra t ion rate w i l l cor respond to the observer ' s b e l i e f about the suff ic iency o f the mentor ' s p o l i c y . In this pa rad igm, then, it seems somewhat m i s l e a d i n g to th ink i n terms o f a d e c i s i o n about whether to " f o l l o w " a specif ic mentor or not. It is more a ques t ion o f h o w m u c h ex-p lo ra t ion to pe r fo rm i n add i t ion to that requi red to reconstruct the mentor ' s p o l i c y . It is unc lear that independent learners c o u l d do better than imita tors i n the general case wi thou t increas ing the l e v e l o f exp lo ra t ion they employ . A more def in i t ive statement must awai t further ana ly t i ca l results. F o r specific subclasses, w e can offer more ins ight . In compe t i t i ve mul t iagent p rob lems w i t h interact ions be tween agents, a m a l i c i o u s mentor, or set o f m a l i c i o u s mentors w i t h k n o w l e d g e o f the observer ' s e x p l o r a t i o n - p o l i c y m i g h t co -ordinate to b ias the observer away f r o m states and act ions w i t h strategic va lue d u r i n g the observer ' s exp lo ra t ion phase. O n c e the observer has largely settled in to a p o l i c y , the m e n -tors c o u l d e x p l o i t the observer ' s subop t ima l p o l i c y . T h i s k i n d o f strategic reason ing about the learn ing process has not been f u l l y addressed i n mul t i -agent systems as o f yet. W e ta lk 144 brief ly about mul t i -agent imi t a t ion w i t h interactions i n the next chapti 145 Chapter 7 Extensions to New Problem Classes T h e m o d e l o f i m p l i c i t im i t a t ion presented i n the p reced ing chapters makes certain res t r ic t ive assumpt ions regard ing the structure o f the d e c i s i o n p r o b l e m b e i n g s o l v e d (e.g., f u l l observ-abi l i ty , k n o w l e d g e o f r eward funct ion , and discrete state and ac t ion spaces). W h i l e these s i m p l i f y i n g assumpt ions a ided the deta i led deve lopment o f the m o d e l , w e be l i eve the bas ic in tu i t ions and m u c h o f the technica l deve lopment above can be extended to r i cher p r o b l e m classes. W e suggest several poss ib le extensions i n this chapter, each o f w h i c h c o u l d f o r m the basis o f a s ignif icant future project. 7.1 Unknown Reward Functions O u r current p a r a d i g m assumes that the observer k n o w s its o w n reward func t ion . T h i s as-sumpt ion is consis tent w i t h the v i e w o f reinforcement l ea rn ing as a f o r m o f automatic pro-g r a m m i n g . H o w e v e r , we can e n v i s i o n at least t w o ways o f r e l a x i n g this assumpt ion . T h e first me thod w o u l d require the assumpt ion that the agent is able to genera l ize observed re-wards . Suppose that the expected reward can be expressed i n terms o f a p r o b a b i l i t y d i s -t r ibu t ion over features o f the observer ' s state, P r ( r | / ( s 0 ) ) . In mode l -based re inforcement 146 l ea rn ing , this d i s t r ibu t ion can be learned b y the agent th rough its o w n exper ience. I f the same features can be app l i ed to the mentor ' s state sm, then the observer can use what it has learned about the reward d i s t r ibu t ion to estimate expected reward for mentor states as w e l l . T h i s ex-tends the p a r a d i g m to domains i n w h i c h rewards are u n k n o w n , but preserves the ab i l i t y o f the observer to evaluate mentor experiences on its " o w n terms." A second me thod o f r e l ax ing the k n o w n reward func t ion assumpt ion w o u l d i n v o l v e e x p l o r i n g the re la t ionship o f i m p l i c i t imi ta t ion to other techniques that do not assume a k n o w n reward func t ion . Imi ta t ion techniques des igned around the assumpt ion that the observer and the mentor share iden t ica l rewards, such as U t g o f f ' s (1991) , w o u l d o f course w o r k i n the absence o f a r eward func t ion . In U t g o f f ' s approach, the mentor is assumed to be ra t iona l , so Q-va lues are incrementa l ly adjusted us ing a gradient approach u n t i l they are consis tent w i t h the act ions taken b y the mentor. Inverse re inforcement l ea rn ing ( N g & R u s s e l l , 2000) uses the same idea , but different machinery . A l inear p rog ram is used to f ind the reward func t ion that w o u l d e x p l a i n the dec i s ions made b y a mentor i n a s ampled trajectory o f the mentor ' s behavior . W e suggest that it w o u l d be profi table to cons ide r scenarios i n be tween the extremes o f i m p l i c i t im i t a t ion that assume no th ing about the mentor ' s r eward func t ion and va lue - inve r s ion approaches that assume the mentor ' s r eward func t ion is i den t i ca l . There may be a useful synthesis o f these approaches based on an observer ' s p r i o r bel iefs about the degree o f cor re la t ion between its o w n reward m o d e l and that o f the mentor it is obse rv ing . 7.2 Benefiting from penalties U n d e r B a y e s i a n imi t a t i on , k n o w l e d g e f r o m bo th observer and mentor sources is p o o l e d to create a more accurate m o d e l . U n d e r this technique, it is poss ib l e for the observer to learn that some ac t ion leads w i t h h i g h p robab i l i ty to a negat ive outcome. It w o u l d therefore be o f interest to examine i n greater depth the potent ia l to use im i t a t i on techniques i n l ea rn ing to 147 a v o i d undes i rable states. 7.3 Continuous and Model-Free Learning In many real is t ic domains , con t inuous attributes and large state and ac t ion spaces p roh ib i t the use o f e x p l i c i t table-based representations. Re inforcement l ea rn ing i n these domains is t y p i c a l l y mod i f i ed to make use o f func t ion approximators to estimate the Q- func t ion at points where no direct ev idence has been rece ived . W e perce ive t w o b road approaches to ap-p r o x i m a t i o n : parameter-based mode ls (e.g., neural ne tworks) (Bertsekas & T s i t s i k l i s , 1996) and memory-based approaches ( A t k e s o n , M o o r e , & Schaa l , 1997). In bo th these approaches, model-free l ea rn ing is e m p l o y e d . Tha t is , the agent keeps a va lue func t ion , but uses the en-v i ronmen t as an i m p l i c i t m o d e l to per form backups u s ing the s a m p l i n g d i s t r i bu t ion p r o v i d e d by env i ronment observat ions . O n e s t ra ightforward approach to cas t ing i m p l i c i t im i t a t i on i n a con t inuous sett ing w o u l d e m p l o y a model-free learn ing pa rad igm (Watk ins & D a y a n , 1992) . F i r s t , r e ca l l the augmented B e l l m a n backup func t ion used i n i m p l i c i t im i t a t i on : V(s) = i 2 o ( 5 ) + T m a x | m a x | ^ T r W , l / ( 0 [ , E T - W y W f ( 7 - 1 } r e A o [tes ) tes J W h e n w e examine the augmented backup equat ion, w e see that i t can be conver ted to a model-free f o r m i n m u c h the same w a y as the ord inary B e l l m a n backup . W e use a standard Q- func t ion w i t h observer act ions, but w e w i l l add one add i t iona l ac t ion w h i c h corresponds to the ac t ion taken b y the mentor, am.1 W h e n the observer interacts w i t h the env i ronment , it receives a tuple < s0, a0, s'0 >. T h e tuple is used to backup the co r r e spond ing state w i t h the f o l l o w i n g Q- l ea rn ing rule: 'Th i s doesn't imply the observer knows which of its actions corresponds to a m . 148 Q{s0,a0) = (1 - a)Q(so,a0) + a(R0(s0,a0) + 7 m a x max {Q(s'0, a')} , Q{s'0,am)\ (7.2) \a£A0 I W h e n the observer makes an observa t ion o f the mentor, i t receives a tuple o f the f o r m < sm, s'm > w h i c h can be backed up w i t h the rule: Q(sm,am) = (1 - a)Q(smiam) + a(R0{sm,am) + ) m a x max {Q(s'm,a')} , Q{s'm,am) \ (7.3) A s i n the discrete ve r s ion o f i m p l i c i t imi t a t ion , the re la t ive qua l i ty o f mentor and observer estimates o f the Q- func t ion at specif ic states may vary — i n order to a v o i d h a v i n g inaccurate p r i o r bel iefs about the mentor ' s ac t ion mode l s bias exp lo ra t i on , w e need to e m -p l o y a conf idence measure to dec ide when to app ly these augmented equat ions . W e feel the most natural set t ing for these k i n d o f tests is i n the memory-based approaches to func t ion app rox ima t ion . M e m o r y - b a s e d approaches, such as l o c a l l y - w e i g h t e d regress ion ( A t k e s o n et a l . , 1997) not o n l y p rov ide estimates for funct ions at poin ts p r e v i o u s l y unv i s i t ed , they also ma in ta in the ev idence set used to generate these estimates. W e note that the i m p l i c i t bias o f memory-based approaches assumes smoothness be tween poin ts unless add i t iona l data proves o therwise . O n the basis o f this bias , it m igh t make sense to compare the av-erage squared dis tance f rom the query to the exemplars used i n the est imate o f the men-tor 's Q - v a l u e to the average squared distance f r o m the query to the exemplars used i n the observer-based estimate to heur i s t i ca l ly dec ide w h i c h agent has the more re l i ab le Q-va lue . T h e approach suggested here w o u l d not benefit f r o m p r i o r i t i z e d sweep ing . P r i o r i t i z e d -sweep ing , has however , been adapted to cont inuous settings (Forbes & A n d r e , 2000) . W e feel a reasonably efficient integrat ion o f these techniques c o u l d be made to w o r k . 149 7.4 Interaction of Agents T h e f ramework w e have deve loped i n p rev ious chapters assumes a non- in te rac t ing game. W h i l e non- in terac t ing games can usefu l ly m o d e l many p rob lems , such as p r o g r a m m i n g o f assembly robots (F r i ed r i ch et a l . , 1996), we w o u l d l i k e to be able to extend imi t a t i on to p rob-lems that i n v o l v e agents w h i c h interact w i t h each other. T h e general p r o b l e m o f p l a n n i n g for agents w h i c h can interact is di f f icul t . W e have ident i f ied what w e be l i eve to be a use-ful a subclass o f p rob lems and w e demonstrate a potent ia l me thod for s o l v i n g p rob lems i n this class efficiently. W e then go on to descr ibe appl ica t ions o f im i t a t i on i n general M a r k o v games w i t h interact ions and comment on h o w an i m p l i c i t im i t a t i on strategy m i g h t be ap-p l i e d . W h i l e the ideas o f this sect ion are far f r o m comple te , they setup some p r o m i s i n g starting points for future research. Role Swapping W e define a stabilized Markov game with replacement as bo th a useful subclass o f games for imitators and a class w i t h a potent ia l ly efficient so lu t ion . W e see this class as a natural gen-era l iza t ion o f the " t ra in ing the new guy on the j o b " p a r a d i g m ( D e n z i n g e r & E n n i s , 2002) . In a s t ab i l i zed game, a group o f agents is f o l l o w i n g a p r io r p o l i c y that a l l o w s them to w o r k eff icient ly together. T h i s p o l i c y may be the result o f a gif ted designer, o r the result o f ex-tensive l ea rn ing o f mode ls and search through game-theoretic e q u i l i b r i a . In ei ther case, the o p t i m a l long- run p o l i c y for the game is k n o w n . In a game w i t h replacement , p layers may be added to or r e m o v e d f r o m the game. T h e goa l o f a n e w l y added agent is to q u i c k l y find a p o l i c y that w o r k s w i t h the ex i s t i ng p o l i c y o f the group. W e argue that i m i t a t i o n can be used to he lp the new agent learn this p o l i c y . C o n s i d e r the s i m p l e coord ina t ion p r o b l e m faced b y a set o f m o b i l e agents t r y ing to a v o i d each other w h i l e ca r ry ing out a f u l l y coopera t ive task w i t h iden t i ca l interests. A gif ted 150 designer or p r io r exper ience may result i n a p o l i c y for this group o f agents w h i c h requires agents to a lways pass to the r ight -hand side o f o n c o m i n g agents. A n e w group o f agents is to j o i n the ex i s t i ng group or perhaps replace a subset o f the ex i s t i ng group. W h e n this g roup o f new agents is s m a l l re la t ive to the ex i s t i ng group, there w i l l be no incen t ive for the group as a w h o l e to change its p o l i c y . T h e most efficient p o l i c y is for the new agents i n the group to acqui re the g roup p o l i c y as q u i c k l y as poss ib le . T h e l ea rn ing o f new agents can be accelerated b y imi t a t i on . W e treat each o f the new agents as imita tors . W e p rov ide each o f t hem w i t h a s l igh t ly m o d i f i e d v e r s i o n o f i m p l i c i t i m -i ta t ion. A s i n the non- in terac t ing vers ion , the imi t a t ion agent 's obse rva t ion space inc ludes bo th its o w n state and the state o f other agents i n the group. A s i n our non- in te rac t ing de-ve lopments , w e assume the exis tence o f a m a p p i n g f r o m states i n the mentor state space to states i n the observer state space. U n l i k e the non- in terac t ing ve r s ion , however , the observer calcula tes its Q-va lues ove r bo th its o w n state and ac t ion as w e l l as the state o f other agents i n its env i ronment . T h e other agents are assumed to be f o l l o w i n g a f ixed p o l i c y and their act ions are ignored . Because the observer ' s Q-va lues are ove r the j o i n t state, it can choose a different ac t ion depending on whether it is b l o c k e d by another agent o r not. It can learn more q u i c k l y what this ac t ion shou ld be th rough imi t a t ion . T o imitate , the observer chooses an agent to be its mentor and "swaps ro les" w i t h the mentor: it swaps the obse rved state o f this agent w i t h its o w n state and replaces its o w n ac t ion w i t h a guess about the mentor ' s ac t ion ca lcu la ted f rom the mentor ' s observed t ransi t ion u s ing the same mathemat ics deve loped fo r the B a y e s i a n im i t a t i on p a r a d i g m i n Chapter 5. T h e imi ta tor then uses the observa t ion w i t h roles swapped to update its Q-va lues . In this way, w e suggest that the Q-va lues w i l l q u i c k l y come to represent the value o f f o l l o w i n g the mentor ' s approach to coo rd ina t i ng its act ions w i t h other agents. W e c a l l the approach i m p l i c i t imi ta t ion w i t h role swapping. W e poin t out that imi t a t ion i n this context is ac tual ly transferring k n o w l e d g e about h o w to so lve game-151 theoretic coo rd ina t ion p rob lems , i na smuch as this k n o w l e d g e is embedded i n the behav io r o f other agents. In a non-coopera t ive game, or a co-operat ive game w i t h non- iden t i ca l interests, there may be an incen t ive for the new agents to flout the convent ions o f the ex i s t i ng g roup i n order to i m p r o v e their o w n payoffs . W e be l ieve the ro le s w a p p i n g technique can be extended to these si tuations u s i n g a s imp le so lu t ion e m p l o y e d b y many r e a l - w o r l d agent c o m m u n i t i e s . T h e non- iden t i ca l interests, as they relate to es tabl ished g roup conven t ions , are conver ted to iden t i ca l interests through the app l ica t ion o f penalt ies o n n o n - c o n f o r m i n g agents. Imitation in general symmetric Markov games T h e ro l e s w a p p i n g technique rel ies on the s tabi l i ty o f the es tabl ished g r o u p p o l i c y to p r o v i d e an efficient imi ta t ion-based so lu t ion to the p r o b l e m o f add ing new agents to a group. A s w i t h single-agent re inforcement learn ing techniques, an imi ta tor u s ing ro le s w a p p i n g i n a n o n -stationary game is not guaranteed to converge to a g o o d p o l i c y . W h e r e m u l t i p l e agents are l ea rn ing s imul taneous ly , gradient techniques ( S i n g h , K e a r n s , & M a n s o u r , 2 0 0 0 ) o r s o l u t i o n concepts f r o m game theory can be used to adapt ex i s t i ng re inforcement l ea rn ing techniques ( L i t t m a n , 1994; H u & W e l l m a n , 1998; B o w l i n g & V e l o s o , 2001) . W e focus here on techniques insp i red b y so lu t ion concepts f r o m game theory. T h e key ins igh t o f these approaches is to treat each stage o f the re inforcement l ea rn ing p r o b l e m as a mat r ix game w i t h the Q-va lues for j o i n t ac t ion substi tuted for mat r ix payoffs (See C h a p -ter 2 Sec t i on 2.8). T h e es t imat ion o f these Q-va lues , however , requires that every agent be able to observe bo th the state and act ion values o f a l l agents i n the game. In essence, each agent is independent ly ca l cu la t ing the same g l o b a l p l an us ing the same g l o b a l in fo rmat ion . Unfor tuna te ly , the requirement that the actions o f a l l agents be observable conf l ic ts w i t h our assumpt ion that often imi t a t ion is sensible on ly w h e n act ions are not observable (See C h a p -152 ter 3 Sec t i on 3.5). S i n c e the other agents may change their p o l i c y , w e cannot use the va lue o f their " M a r k o v c h a i n s " as stationary estimates o f the va lue o f the p o l i c y they are f o l l o w i n g . A key p r o b l e m for i m p l e m e n t i n g i m p l i c i t im i t a t ion i n this context w o u l d be the es t imat ion o f the ac t ion o f other players . W e suspect that assumptions s i m i l a r to the homogeneous ac t ion assumpt ion w o u l d be requi red to obta in a s ignif icant advantage th rough imi t a t i on . W e sug-gest that the class o f symmetric Markov games c o u p l e d w i t h ac t ion inference w o u l d be a g o o d start ing po in t for research i n this area. Additional considerations in interacting games Other chal lenges and opportuni t ies present themselves when imi t a t i on is used i n mul t i -agent settings. F o r example , i n compet i t ive or educat ional domains , agents not o n l y have to choose act ions that m a x i m i z e in fo rmat ion f r o m exp lo ra t ion and returns f r o m exp lo i t a t i on ; they must also reason about h o w their act ions communica t e in fo rmat ion to other agents. In a c ompe t i -t ive setting, one agent may w i s h to d i sgu ise its intent ions, w h i l e i n the context o f teaching, a mentor may w i s h to choose act ions whose purpose is abundant ly clear. These considera t ions must become part o f any ac t ion select ion process. 7.5 Partially Observable Domains In this sect ion w e cons ide r the potent ia l for the ex tens ion o f i m p l i c i t im i t a t i on to par t ia l ly observable domains . W e cons ider two forms o f par t ia l observabi l i ty . In the first type, the observer ' s o w n state is fu l l y observable w h i l e the state o f the mentor is o n l y in termit tent ly observable . In this case the agent can cont inue to observe the mentor, but i t w i l l receive fewer comple te t ransi t ions f r o m the mentor, and for reasons that w i l l be f u l l y desc r ibed i n the next chapter, der ive s igni f icant ly less benefit f r om the mentor. T h e more in teres t ing case 153 to cons ider is w h e n the observer and mentor are par t ia l ly observable i n the sense that the observer cannot a lways d i s t i ngu i sh its o w n state or the state o f the mentor. It is important to remember that it is o n l y observab i l i ty o f the mentor w i t h respect to the observer that is important . T h e central idea o f i m p l i c i t imi ta t ion is to extract m o d e l i n fo rma t ion f r o m obser-vat ions o f the mentor, rather than dup l i ca t i ng mentor behavior . T h i s means that the mentor ' s in ternal b e l i e f state and p o l i c y are not (d i rec t ly) relevant to the learner. W e take a somewhat behavior i s t stance and concern ourselves o n l y w i t h what the mentor ' s observed behaviors t e l l us about the poss ib i l i t i e s inherent i n the envi ronment . T h e observer does have to keep a b e l i e f state about the mentor ' s current state, but this can be done u s i n g the same est imated w o r l d m o d e l the observer uses to update its o w n be l i e f state. In keep ing w i t h our model -based approach to re inforcement l ea rn ing , w e propose to tackle pa r t i a l ly observable reinforcement learn ing b y h a v i n g the observer e x p l i c i t l y est i -mate a t ransi t ion m o d e l T. G i v e n the m o d e l , the agent m a y e m p l o y standard techniques fo r s o l v i n g the resultant P O M D P (See Sec t ion 2.7). W e w i l l assume that the observer k n o w s its r eward func t ion R(s) and its observat ion m o d e l Z : S —> A ( 2 ) . O u r g o a l i n this sect ion is to show that, g i v e n our assumptions , the es t imat ion process for the t ransi t ion m o d e l T can be factored into separate components for the observer and mentor and that the updates for each o f these factors can be recast i n a recurs ive fo rm. T h i s recurs ive f o r m , o r state-history r o l l u p , a l l o w s an agent to incrementa l ly i m p r o v e its m o d e l w i t h each observa t ion . G i v e n this m o d e l , the P O M D P can be so lved u s ing e x i s t i n g techniques ( S m a l l w o o d & S o n d i k , 1973). O u r de r iva t ion rel ies on some add i t iona l nota t ion. F i r s t , w e m o d i f y our nota t ion for states s l igh t ly , u s ing S* to represent the observer ' s state at t ime t and to represent the sequence o f observer states f rom t ime 1 to t ime t. W e w i l l often abbreviate the comple te state h is tory So""* to S0. T h e mentor ' s state his tory is represented ana logous ly b y S m . In a par t ia l ly observable envi ronment , the agent 's true state is not observable . Instead, the agent 154 F i g u r e 7 .1: O v e r v i e w o f influences o n observer ' s observat ions must reason f rom incomple te and pos s ib ly no i sy observat ion s ignals denoted Z\. T h e h is -tory o f observer observat ions is g i v e n b y Z\-1. A s before, w e let AQ represent observer act ions , T represent the t ransi t ion m o d e l for the env i ronment and 7r m the men to r ' s p o l i c y . F i g u r e 7.1 g ives a g raph ica l m o d e l for the dependencies be tween these var iables . W e see that the mentor m o d e l 7r m and the t ransi t ion m o d e l T inf luence the mentor state h is tory Sm, w h i c h i n turn influences the observer's observat ions o f the mentor Zm. N o t e that the m e n -tor ' s observat ions o f its o w n behav io r are i rrelevant to the observer. S i m i l a r l y , the observer ' s ac t ion choices A0 together w i t h the t ransi t ion m o d e l T determine the observer ' s state h is tory S0 w h i c h i n turn determine the observat ions the observer w i l l make o f its o w n behav io r Z0. In a pa r t i a l ly observable envi ronment , we s t i l l retain the M a r k o v c o n d i t i o n , w h i c h guarantees that the dynamics o f the env i ronment are s t r ic t ly a func t ion o f the agent 's current state and ac t ion . H o w e v e r , an agent may need to make series o f observat ions i n order to re-so lve its current state. T h e coarse m o d e l above can be refined in to a more de ta i led m o d e l , as s h o w n in F i g u r e 7.2, to i nc lude the influence o f agent choices and the t rans i t ion m o d e l on the deta i led his tory o f the agent. In the more deta i led v i ew , w e see that the agent 's be l iefs about each state i n its h is tory are inf luenced b y the p r io r and future states, the t rans i t ion m o d e l , its ac t ion cho ices and the observat ions i t made. Just before w e proceed to the de r iva t ion , w e w o u l d l i k e to e x p l a i n our nota t ion . In 155 F i g u r e 7.2: D e t a i l e d m o d e l o f influences on observer ' s observat ions each step o f the de r iva t ion , underbraces are used to h igh l igh t the subexpress ion to be m o d -if ied i n the next step and the phrase immed ia t e ly b e l o w the underbrace g ives the jus t i f i ca -t ion for this mod i f i ca t ion . F o r example , the notat ion (ab + ac) + V d + e indicates that we v v ' a factors out w i l l factor out the term a f r om the underbraced term (ab + ac). T h e express ions o f the f o r m AALB\C mean that the var iab le A is c o n d i t i o n a l l y independent o f var iab le B g i v e n var iab le C. E x p r e s s i o n s o f the f o r m "see n u m b e r " refer to subder iva t ions on the f o l l o w i n g pages. O u r d i s c u s s i o n w i l l refer o n l y to the parent de r iva t ion . T h e reader m a y refer to the subder iva t ion for detai ls . D e r i v a t i o n 1 begins w i t h the term Pv(T\Z0, Zm, A0). W e see that the t rans i t ion m o d e l p robab i l i t y can be e legant ly decomposed into two terms represent ing the m o d e l l i k e l i h o o d acco rd ing to the observer ' s observat ion his tory and the m o d e l l i k e l i h o o d acco rd ing to obser-vat ions o f the mentor. S k i p p i n g over cons iderable intermediate steps, w e can see that the last l i ne o f D e r i v a t i o n 1 shows us t w o things. F i r s t , m o d e l l i k e l i h o o d g i v e n the observer ' s o w n observa t ion his tory factors in to a recurs ive f o r m i n w h i c h p rev ious h is tory can be r o l l e d up into a p r io r d i s t r ibu t ion for t ime t - 1. Integrating a new observa t ion in to the l i k e l i h o o d can be done by cons ide r ing o n l y the p robab i l i ty o f transit ions f r o m t - 1 to t and the observa t ion p robab i l i t y o f Z\ g i v e n 5*. Second , the l i k e l i h o o d w i t h respect to observer observat ions o f 156 the mentor can be expressed recurs ive ly , but is more c o m p l e x . T h e p r i o r states S 1 - - - * - 1 can again be marg ina l i z ed out. Unfor tunate ly , an argument represent ing the men to r ' s p o l i c y nm remains . W e therefore have to store the marg ina l i za t i on for every pos s ib l e mentor p o l i c y . In tu i t ive ly , the presence o f 7r m is due to the fact that for each pos s ib l e mentor p o l i c y , there may be a t ransi t ion m o d e l that makes the mentor p o l i c y consis tent w i t h the observat ions . W e must therefore cons ider a l l poss ib le po l i c i e s and their p r i o r p robab i l i t i e s w h e n assessing the l i k e l i h o o d o f any g i v e n m o d e l T. C o m p u t i n g an expecta t ion over poss ib le mentor po l i c i e s is somewhat onerous, but s t i l l conce ivab le . W h e r e mentor po l i c i e s can be represented b y a s m a l l n u m b e r o f parame-ters o r factored into l o c a l po l i c i e s for i n d i v i d u a l states, i t w i l l be c o m p u t a t i o n a l l y tractable to compute an expecta t ion over poss ib le po l i c i e s . W h i l e m u c h w o r k remains to be done, w e have s h o w n that t rans i t ion m o d e l updates c o u l d , i n theory, be factored i n a w a y to per-m i t inc rementa l update o f an augmented m o d e l b y an observer. W e are therefore encouraged that a par t ia l ly observable ve r s ion o f i m p l i c i t im i t a t ion c o u l d be w o r k e d out u s i n g the frame-w o r k already es tabl ished for ca l cu la t ing va lue funct ions and c h o o s i n g act ions i n s ingle-agent P O M D P s . D E R I V A T I O N 1 A u g m e n t e d M o d e l P r o b a b i l i t y Pr(T\Z0, Zm, A0) 1. Pr(T\Z0,Zm,A0) " v ' BayesRule 2. aPr(Z0,Zm\T,A0)Pr(T\A0) v v ' TA±Aa 3. aPr(Z0,Zm\T,A0)Pr{T) v * ' product rule 4. aPr{Z0\T, A0) Pr{Zm\Z0, T, A0) Pr{T) * * ' Zm±LZ0, Aa\T 5. aPr{Z0\T,A0)Pr(Zm\T)Pr{T) See 7.2 See 7.5 157 £ S . - M PriZtlSDPriSHSl-1,^-1^) • ^ o a ' Pr(Zy-\S^\Al-^,T) S v ' roled up prior for t — 1 6. model likelihood given observer observations Pr(Z^" f - 1 ,5^- 1 |7r m , r ) P r ( r r , s V roled up prior for t — 1 " I I I I ^ ^ ^ ^ — ^ — — model likelihood given mentor observations Pr(T) D E R I V A T I O N 2 M o d e l L i k e l i h o o d g i v e n Observa t ions o f Obse rve r 1. Pr(Z0\T,A0) v v ' intro SQ 2- 52SoPr(Z0\S0,T,Ao)Pr(S0\T,A0) s v " v ' See 7.3 See 7.4 J2 Pr ( ^ i^)Pr(Z 0 1 - i - 1 15 0 1 -* - 1 ) So % " ' 3. joint Pr{Si\S\-\A0-\T)Pr{Sy-Mlo--t-\T) v v ' present past combine past terms and present terms E s . - . Pr(Zt\Sl)Pr(Sl\St-\Ai-\T) P r ( z i . . , - i | 5 i . . . i - i ) P r ( 5 i . . . < - i | A i . . . i - 2 ) T ) Perform this summation and store marginal 158 £s*_,., Pr(Zl\Si)Pr{Sl\Sl-\Ai-\T) P r ( Z 0 1 - * - 1 | 5 * - 1 ) P r ( 5 * - 1 | A j - * ~ 2 , r ) * ' combine D E R I V A T I O N 3 P r o b a b i l i t y o f Obse rve r ' s O w n Observa t ions 1. Pr{Z^ \S0,A0,T) joint event 2. Pr(Zy\S0,A0,T) v v ' product rule 3. P r ( Z * | Z 0 1 - * - 1 , 5 0 , A 0 , r ) P r ( Z^^So, A0,T z ( J J _ z l . . . ( - l | S i z ^ - ' - ' - ' u s J Z^lLAo, T\S'0 4. P r ( Z * | 5 ^ ) P r ( Z 0 1 - * - 1 | 5 0 1 - i - 1 ) D E R I V A T I O N 4 P r o b a b i l i t y o f Obse rve r ' s O w n Trajectory 1. P r ( ^ | A 0 , T ) joint 2. P r ^ - ' K . r ) v v ' product rule 3. Pr( StlSy'^A^T )Pr(S10-t-1\A0,T) s v ' v v ' St0lLS10-i-3\Si0-1,T,A„ S j - ' - ' i J . ^ S j j J L . 4 * t - 1 | S £ ~ 1 , ^ ~ 1 , T 4. P r ( 5 * | 5 * _ 1 , A * - 1 , r ) P r ( 5 0 1 ' t - 1 | A o ' i " 2 , T ) D E R I V A T I O N 5 M o d e l L i k e l i h o o d g i v e n Observa t ions o f M e n t o r 1. Pr{Zm\T) «, ' intra S-m 159 2- ZsmPr(zm\Sm,T)Pr(Sm\T) m N v / > v ' 2mJ_LT|5m introduce 7 r m 3- E s m |S m ) E , m ^ ( S m | 7 r m , T) Pr(wm\T) see 7 .6 see 7 .7 irmJ_LT E 5 r o PriZilS^PriZt'-'lSt"-1) I 'm 5. 10. 11. present past PriS'JS*-1, Km, T) Pr(S^' \wm,T) Pr{Ttn present past E s m E . m P r ( ^ | ^ ) P r ( S ^ - \ 7 r m , T ) P r ( ^ " t - 1 | ^ - i - 1 ) P r ( S^ i- 1|7r m , r ) P r ( 7 T m ) combine r:t i e t - i E s m E . m PriZtJStJPriStJS^^^T) P r ( ^ " i - 1 , 5 ^ - i - 1 | 7 r m , T ) P r ( 7 r r o ) * 1 ' reverse summations E W m E s m P r ( ^ | 5 ^ ) P r ( ^ | S ^ 1 , 7 r m , r ) r l . . . t - l cl-<-l | Pr(^- i- 1,^- t- 1|7r m ,T )Pr(7r m) V ; ^ / E d , . /(". <0 = E „ , t / (» , b) E c 9(l>. <0 E , m E s - . - . P r ( ^ | S ^ ) P r ( ^ | S ^ 7 r m , T ) E ^(^•• <~ 1 ,^- t-Vm,T)Pr(7Tm) cl,t—2 sum out marginalization E „ m E s < r ' - ' ^ ( ^ | ^ ) P r ( ^ | <^"\ 7r m , T) P r ^ - S S ^ V m . r ) P r (7r m ) • J 1 V swap summations and distribute inward E s - - ^ ( ^ l ^ ) £ „ r a P r ^ l S ^ " 1 , ^ ^ ) P r ^ - * - 1 ^ ^ 1 ! ^ , ^ Pr(7rm) 160 D E R I V A T I O N 6 P r o b a b i l i t y o f M e n t o r Observa t ions 1. Pr{Z^\ Srn,) joint event joint event 2. PriZ^lS*-* product rule 3 . Pr{Zl\Zt-*-\St*)Pr{^-l\S^) 4 . P r ( Z ^ " * | Z i ; " * - 1 ) 5 i ; " * ) P r ( Z ^ - 1 | 5 i ; " * - 1 ) •• v ' 5. P r ( Z ^ l | 5 ^ ) P r ( Z ^ - * - i | 5 ^ - i - 1 ) D E R I V A T I O N 7 P r o b a b i l i t y o f M e n t o r ' s Trajectory 1. P r Q S m j 7 r T O , T ) joint 2. Pr(S£i7r m > r) V v ' product rule 3 . Pr(Sm\ £ ^ , 7 r m , r ) P r ( 5 ^ - 1 | 7 r r o i r ) 4 . Pr(^|5^ 1 ,7r m ,r)Pr(5^ i- 1|7rm ,T) 7.6 Action Planning in Partially Observable Environments A c t i o n se lec t ion in par t ia l ly observable envi ronments raises add i t i ona l issues for im i t a t i on agents. In add i t ion to the usua l exp lora t ion-exp lo i t a t ion tradeoff, an i m i t a t i o n agent must determine whether it is w o r t h w h i l e to take act ions that i m p r o v e the v i s i b i l i t y o f the mentor so that this source o f in fo rmat ion remains ava i lab le w h i l e l ea rn ing . W e speculate that the d e c i s i o n about whether to make an effort i n order to observe a mentor can be f o l d e d d i rec t ly into the agent 's va lue m a x i m i z a t i o n process. 161 7.7 Non-stationarity In some env i ronments , the t ransi t ion m o d e l for the env i ronment m a y change over t ime. In a mul t i -agent learn ing sys tem w i t h interactions between agents, a l l agents i n an env i ronment may be l ea rn ing s imul taneously . In either case, an agent must c o n t i n u a l l y adapt to its chang-i n g sur roundings . W e are unaware o f imi t a t ion specif ic w o r k that deals w i t h this c o n d i t i o n . A s h in ted at in the sect ion on subopt imal i ty and bias (Sec t ion 6.5), k n o w l e d g e o f h o w an agent 's learn ing changes its p o l i c y c o u l d have strategic i m p l i c a t i o n s for imi t a t ion agents. T h i s area w i l l require further exp lora t ion . 162 Chapter 8 Discussion and Conclusions O v e r the course o f the last seven chapters w e deve loped a new m o d e l o f i m i t a t i o n w h i c h w e c a l l i m p l i c i t im i t a t i on . T h r o u g h numerous exper iments we demonstra ted impor tant aspects o f bo th the i m p l i c i t imi ta t ion m o d e l and the mechan i sms we p roposed to imp lemen t it. W e l o o k e d at h o w to characterize the app l i cab i l i t y and performance o f i m i t a t i o n lea rn ing and d iscussed several s ignif icant extensions to our bas ic m o d e l . In this chapter, w e speculate on several poss ib l e r e a l - w o r l d appl ica t ions for our m o d e l o f im i t a t i on and then c o n c l u d e w i t h a few remarks on our cont r ibut ions to the field o f computa t iona l im i t a t i on . 8.1 Applications W e see appl ica t ions for i m p l i c i t imi ta t ion in a variety o f contexts . T h e emerg ing e lec t ronic commerce and in fo rmat ion infrastructure is d r i v i n g the deve lopment o f vast ne tworks o f mul t i -agent systems. In ne tworks used for compe t i t ive purposes such as trade, i m p l i c i t i m -i ta t ion can be used b y an re inforcement l ea rn ing agent to learn about b u y i n g strategies o r in fo rma t ion filtering po l i c i e s o f other agents in order to i m p r o v e its o w n behavior . In con t ro l , i m p l i c i t imi ta t ion c o u l d be used to transfer k n o w l e d g e f r o m an exis t -163 i n g learned con t ro l l e r w h i c h has already adapted to its c l ients to a new lea rn ing con t ro l l e r w i t h a comple t e ly different architecture. M a n y modern products such as e levator control lers (Cri tes & Bar to , 1998), c e l l traffic routers ( S i n g h & Ber tsekas , 1997) and au tomot ive fuel i n -j ec t i on systems use adapt ive control lers to o p t i m i z e the performance o f a sys t em for specif ic user profi les . W h e n upgrad ing the t echnology o f the u n d e r l y i n g sys tem, it is qui te poss ib le that sensors, actuators and the internal representation o f the new sys tem w i l l be i n c o m p a t i -b le w i t h the o l d sys tem. I m p l i c i t im i t a t ion p rov ides a method o f t ransferr ing va luab le user in fo rma t ion between systems wi thout any e x p l i c i t c o m m u n i c a t i o n . A t rad i t iona l app l i ca t ion for im i t a t i on - l i ke technologies l ies i n the area o f bootstrap-p i n g in te l l igent artifacts u s ing traces o f human behavior . T h e behav io ra l c l o n i n g p a r a d i g m has inves t iga ted transfer i n appl ica t ions such as p i l o t i n g aircraft (Sammut et a l . , 1992) and c o n t r o l l i n g l o a d i n g cranes (Sue & B r a t k o , 1997). Other researchers have inves t iga ted the use o f im i t a t i on i n s i m p l i f y i n g the p r o g r a m m i n g o f robots ( K u n i y o s h i et a l . , 1994; F r i e d r i c h et a l . , 1996). T h e ab i l i t y o f imi t a t ion to transfer c o m p l e x , non l inea r and d y n a m i c behaviors f rom ex i s t i ng human agents makes it par t icu la r ly attractive for con t ro l p rob lems . 8.2 Concluding Remarks A central theme i n our approach has been to examine im i t a t i on th rough the f r amework o f re inforcement l ea rn ing . T h e interpretation o f imi t a t ion as re inforcement l ea rn ing has not o n l y he lped us to see h o w var ious approaches to imi t a t ion are related, but has a lso he lped us to see that im i t a t i on as m o d e l transfer is a p r o m i s i n g and largely u n e x p l o r e d area for new research. W e be l i eve our m o d e l has also he lped to make the field o f re inforcement l ea rn ing more access ib le and relevant to soc ia l l earn ing researchers (personal c o m m u n i c a t i o n w i t h B r i a n E d m o n d s ) and w e expect that tools f r o m reinforcement l ea rn ing and M a r k o v d e c i -s ion processes w i l l see greater use i n future mode l s o f soc i a l agents research. T h e m o d e l 164 w e proposed rests u p o n a number o f assumptions . W e saw that the concept o f i m i t a t i o n i m -p l i c i t l y assumes bo th the presence o f observable mentor expert ise and the in feas ib i l i ty o f e x p l i c i t c o m m u n i c a t i o n , w h i c h i n turn suggests the unobse rvab i l i t y o f mentor act ions. W e exp l a ined that the observer ' s k n o w l e d g e o f its o w n object ives is bo th reasonable and h i g h l y advantageous i n an im i t a t i on context . Re in fo rcemen t learn ing , together w i t h a careful cons idera t ion o f reasonable assump-tions for sensible imi t a t ion , p r o v i d e d a general f ramework for im i t a t i on . U s i n g the frame-w o r k , w e deve loped a s imp le yet p r i n c i p l e d a l g o r i t h m ca l l ed m o d e l ext rac t ion to handle p rob lems i n w h i c h agents have homogeneous ac t ion capabi l i t i es . W e s h o w e d that i t is cr i t -i c a l for imitators to be able to assess the re la t ive confidence they have i n the in fo rma t ion de r ived f r o m their o w n observat ions versus that w h i c h they have d e r i v e d f r o m observat ions o f mentors . W e also saw that re inforcement learn ing a l l o w s agents to ass ign values to be-hav iors they have acqu i red b y observat ion i n the same w a y they ass ign va lue to behaviors they have learned themselves . S i n c e a l l behaviors can be evaluated i n the same terms, an agent can read i ly dec ide w h e n to e m p l o y a behav io r de r ived f r o m mentor observat ions . O u r f ramework thus solves a long-s tanding issue i n imi t a t ion lea rn ing . T h e i m p l i c i t im i t a t i on f ramework also permits an observer to s imul taneous ly observe m u l t i p l e mentors , to auto-ma t i ca l l y extract relevant subcomponents o f mentor behavior , and to synthes ize n e w behav-iors f r om bo th observat ions o f mentors and the observer ' s o w n exper ience . A g e n t s need not dupl ica te mentor behav io r exact ly . In add i t ion , we showed that i m p l i c i t i m i t a t i o n is robust i n the face o f no ise and its benefit may ac tual ly increase w i t h inc reas ing p r o b l e m diff icul ty . W e saw that m o d e l extract ion can be elegant ly extended to doma ins i n w h i c h agents have heterogeneous ac t ion capabi l i t ies through the mechan i sms o f ac t ion f eas ib i l i t y test ing and fc-step repair. A d d i n g b r i d g i n g capabi l i t ies preserves and extends the mentor ' s gu idance i n the presence o f infeas ible act ions, whether due to differences i n ac t ion capab i l i t i e s or to 165 l o c a l differences i n state spaces. W e also s h o w e d that it is poss ib le to recast i m p l i c i t im i t a t i on i n a B a y e s i a n frame-w o r k . A t the cost o f add i t iona l computa t ion , a B a y e s i a n agent is able to smoo th ly inte-grate observat ions f r o m its o w n exper ience, observat ions o f mentor b e h a v i o r and p r i o r be-l iefs i t has about its envi ronment . In c o m b i n a t i o n w i t h B a y e s i a n exp lo ra t ion , the B a y e s i a n imi t a t ion a l g o r i t h m e l imina ted the tedious parameter t un ing associated w i t h greedy e x p l o -ra t ion strategies. W e saw that the B a y e s i a n imitators are capable o f ou tpe r fo rming their n o n -B a y e s i a n counterparts. W e also saw that imi t a t ion m a y ove rcome one o f the d rawbacks o f B a y e s i a n exp lo ra t ion ; the pos s ib i l i t y o f conve rg ing to a subop t ima l p o l i c y due to m i s l e a d i n g pr iors . O f course, a fundamental issue for B a y e s i a n lea rn ing w i l l be the further ex tens ion o f this p a r a d i g m to heterogeneous ac t ion capabi l i t ies . W e then used our unders tanding o f h o w exp lo ra t ion is g u i d e d i n re inforcement learn-i n g to ana lyze h o w characterist ics o f the d o m a i n affect the gains to be made b y imi t a t i on agents. W e ana lyzed the ro le o f con t inu i ty i n e x p l a i n i n g the performance gains o f im i t a t i on and deve loped the fracture metr ic w h i c h was s h o w n to be p red ic t ive o f i m i t a t i o n ga ins i n a number o f exper iments . T o scale i m p l i c i t im i t a t ion to larger p rob lems , w e w i l l have to integrate it w i t h re-cent deve lopments i n large-scale reinforcement learn ing . W e s h o w e d that i t is pos s ib l e to express i m p l i c i t im i t a t ion i n a model-free f o r m and sketched h o w this m i g h t be c o m b i n e d w i t h memory-based func t ion approximators to create v iab le i m i t a t i o n agents i n con t inuous and h i g h - d i m e n s i o n a l spaces. W e also saw that there is potent ia l to app ly B a y e s i a n i m i t a -t ion learn ing i n par t ia l ly observable domains . W h i l e a p rac t ica l imp lemen ta t ion w i l l require cons ide rab ly more effort, our der iva t ion shows that the p r o b l e m o f pa r t i a l ly observable i m i -tat ion can , i n p r i nc ip l e , be factored into a f o r m sui table for use w i t h ex i s t i ng approaches for s o l v i n g P O M D P s . In the course o f this de r iva t ion w e also po in t ed out that pa r t i a l observ-166 ab i l i t y in t roduces a new issue for imi ta t ion agents: d e c i d i n g w h e n to e m p l o y act ions that enhance the v i s i b i l i t y o f a mentor. W e br ief ly examined the potent ia l for a p p l y i n g i m p l i c i t im i t a t ion i n mut l iagent p rob lems w i t h interacting agents. W e d i scove red that coo rd ina t ion strategies c o u l d , themselves , be the subject o f transfer b y imi t a t i on . W e c o n c l u d e d that there is potent ia l to imp lemen t i m p l i c i t im i t a t ion u s ing ex i s t i ng techniques for M a r k o v games, but effort w o u l d need to be g i v e n to d e v e l o p i n g methods o f es t imat ing the act ions taken b y other agents f irst .The extensions w e have p roposed p rov ide cons iderab le in sp i r a t ion for fu -ture w o r k i n imi t a t i on . O n e o f our fundamental assumptions has been the exis tence o f a m a p p i n g func t ion between the state spaces o f the mentor and observer. A key research p r o b l e m for i m i t a -t ion l ea rn ing is the automatic de terminat ion o f such mapp ings ( B a k k e r & K u n i y o s h i , 1996). W h i l e ex i s t i ng research has l o o k e d at h o w to f ind co r re spond ing behav iors g i v e n interme-diate states w i t h k n o w n correspondences a long the path, we have yet to deve lop a satisfac-tory me thod for f ind ing the state correla t ions to b e g i n w i t h . W e suggest that it m igh t be profi table to examine abstract ion mechan i sms as a w a y o f f i nd ing s imi l a r i t i e s be tween state spaces. In add i t ion to this long-s tanding ques t ion i n im i t a t i on l ea rn ing research, our current w o r k has in t roduced t w o new quest ions into the f ie ld : h o w s h o u l d an agent reason about the in format ion-broadcas t ing aspect o f its act ions i n add i t ion to whatever concrete va lue the ac-t ions have for the agent, and h o w shou ld agents reason about the va lue o f act ions that w i l l lead to increased v i s i b i l i t y o f mentors? T h e re inforcement learn ing f ramework has l ed us to a better unders tanding o f i m i -tat ion learn ing . It has a lso he lped us to create three new a lgor i thms for i m i t a t i o n l ea rn ing based on assumpt ions that make these a lgor i thms app l i cab le i n domains p r e v i o u s l y inacces-s ib le to im i t a t i on learners. T h e f ramework has also c lear ly g i v e n us a start ing po in t on sev-eral p r o m i s i n g extensions to the ex i s t i ng w o r k and h i g h l i g h t e d t w o impor tant new quest ions . 167 Our work on imitation as model transfer in reinforcement learning has not only resulted in advances in imitation learning, but has opened up a whole new way of exploring this field that poses an abundant source of ready-to-explore problems for future research. 168 Bibliography A l i s s a n d r a k i s , A . , N e h a n i v , C . L . , & Dautenhahn , K . (2000) . L e a r n i n g h o w to do things w i t h imi t a t ion . In Bauer , M . , & R i c h , C . (Eds . ) , AAA1 Fall Symposium on Learning How to Do Things, 3-5 November, pp . 1-6 C a p e C o d , Massachuse t t s . A l i s s a n d r a k i s , A . , N e h a n i v , C . L . , & Dautenhahn , K . (2002) . D o as I do : Cor respondences across different robot ic embodiments . In Proceedings of the 5th German Workshop on Artificial Life, pp . 1 4 3 - 1 5 2 L i i b e c k . A n d e r s o n , J . R . (1990). Cognitive Psychology and Its Implications. W . H . F reeman , N e w Y o r k . A s t r o m , K . , & H a g g l u n d , T. (1996). P I D con t ro l . In L e v i n e , W . (Ed . ) , The control handbook, chap. 10.5, pp. 1 9 8 - 2 0 9 . I E E E Press. A t k e s o n , C . G . , & Schaa l , S. (1997). R o b o t l ea rn ing f r o m demonst ra t ion . In Proceedings of the Fourteenth International Conference on Machine Learning, pp . 1 2 - 2 0 N a s h v i l l e , T N . A t k e s o n , C . G . , M o o r e , A . W . , & Schaa l , S. (1997) . L o c a l l y we igh ted l ea rn ing for con t ro l . Artificial Intelligence Review, 77(1-5), 7 5 - 1 1 3 . B a i r d , L . (1999) . Reinforcement Learning Through Gradient Descent. P h . D . thesis, Ca rneg i e M e l l o n U n i v e r s i t y . 169 B a k k e r , P., & K u n i y o s h i , Y . (1996). R o b o t see, robot do : A n o v e r v i e w o f robot im i t a t i on . In AISB '96 Workshop on Learning in Robots and Animals, pp . 3 -11 B r i g h t o n . U K . Baxter , J . , & Bart let t , P. L . (2000). Re inforcement learn ing i n P O M D P ' s v i a direct g rad i -ent ascent. In Proceedings of the Seventeenth International Conference on Machine Learning, pp . 41^4-8 Stanford, C A . B e l l m a n , R . E . (1957) . Dynamic Programming. P r ince ton U n i v e r s i t y Press , P r ince ton . Ber tsekas , D . P. (1987). Dynamic Programming: Deterministic and Stochastic Models. P r e n t i c e - H a l l , E n g l e w o o d C l i f f s . Ber tsekas , D . P., & T s i t s i k l i s , J . N . (1996). Neuro-dynamic Programming. A t h e n a , B e l m o n t , M A . B i l l a r d , A . , & H a y e s , G . (1997) . L e a r n i n g to communica t e through imi t a t i on i n au tonomous robots. In Proceedings of The Seventh International Conference on Artificial Neural Networks, pp . 7 6 3 - 7 6 8 Lausanne , Swi t ze r l and . B i l l a r d , A . , & H a y e s , G . (1999). D r a m a , a connect ionis t architecture for con t ro l and lea rn ing i n au tonomous robots. Adaptive Behavior Journal, 7, 3 5 - 6 4 . B i l l a r d , A . , H a y e s , G . , & Dautenhahn , K . (1999). Imi ta t ion s k i l l s as a means to enhance lea rn ing o f a synthet ic proto- language i n an autonomous robot. In Proceedings of the AISB '99 Symposium on Imitation in Animals and Artifacts, pp . 8 8 - 9 5 E d i n b u r g h . B o u t i l i e r , C . (1999) . Sequent ia l op t ima l i t y and coord ina t ion i n mul t iagent systems. In Pro-ceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp . 4 7 8 ^ 1 8 5 S t o c k h o l m . 170 B o u t i l i e r , C , D e a n , T. , & H a n k s , S. (1999) . D e c i s i o n theoretic p l a n n i n g : S t ruc tura l assump-tions and computa t iona l leverage. Journal of Artificial Intelligence Research, 11, 1 -94. B o w l i n g , M . , & V e l o s o , M . (2001). R a t i o n a l and convergent l ea rn ing i n stochast ic games. In Proceedings of the Seventeenth International Joint Conference on Artificial Intel-ligence, pp. 1 0 2 1 - 1 0 2 6 Seattle. B o y a n , J . A . (1999) . Least-squares tempora l difference lea rn ing . In Proceedings of the Sixteenth International Conference on Machine Learning, pp . 4 9 - 5 6 . M o r g a n K a u f -mann , San F r a n c i s c o , C A . Breazea l , C . (1999) . Imi ta t ion as soc i a l exchange between humans and robot . In Proceed-ings of the AISB '99 Symposium on Imitation in Animals and Artifacts, pp . 9 6 - 1 0 4 U n i -vers i ty o f E d i n b u r g h . Breazea l , C , & Scasse l la t i , B . (2002). Cha l l enges i n b u i l d i n g robots that imi ta te people . In Imitation in Animals and Artifacts, pp. 3 6 3 - 3 9 0 . M I T Press , C a m b r i d g e , M A . B y r n e , R . W . , & R u s s o n , A . E . (1998). L e a r n i n g b y imi t a t ion : a h ie ra rch ica l approach. Be-havioral and Brain Sciences, 21, 6 6 7 - 7 2 1 . Caf iamero, D . , A r c o s , J . L . , & de Mantaras , R . L . (1999). Imi ta t ing h u m a n performances to au tomat ica l ly generate express ive j a z z bal lads . In Proceedings of the AISB'99 Sym-posium on Imitation in Animals and Artifacts, pp . 1 1 5 - 2 0 U n i v e r s i t y o f E d i n b u r g h . Cassandra , A . R . , K a e l b l i n g , L . P., & L i t t m a n , M . L . (1994) . A c t i n g o p t i m a l l y i n par t ia l ly observable stochastic domains . In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp . 1 0 2 3 - 1 0 2 8 Seattle. 171 C h a j e w s k a , U . , K o l l e r , D . , & Ormone i t , D . (2001) . L e a r n i n g an agent 's u t i l i t y func t ion b y obse rv ing behavior . In Proceddings of the Eighteenth International Conference on Machine Learning, pp. 3 5 ^ - 2 . Con te , R . (2000) . In te l l igent soc ia l l earn ing . In Proceedings of the AISB'00 Symposium on Starting from Society: the Applications of Social Analogies to Computational Sys-tems, pp. 1-14 B i r m i n g h a m . Cr i t es , R . , & B a r t o , A . G . (1998). E l e v a t o r group con t ro l u s ing m u l t i p l e re inforcement learn-i n g agents. Machine-Learning, 3 3 ( 2 - 3 ) , 2 3 5 - 6 2 . D e a n , T. , & G i v a n , R . (1997). M o d e l m i n i m i z a t i o n i n M a r k o v d e c i s i o n processes. In Pro-ceedings of the Fourteenth National Conference on Artificial Intelligence, pp . 1 0 6 -111 P rov idence . Dea rden , R . , F r i e d m a n , N . , & R u s s e l l , S. (1998) . B a y e s i a n Q - l e a r n i n g . In The Fifteenth National Conference on Artificial Intelligence, pp . 7 6 1 - 6 8 M a d i s o n , W i s c o n s i n . Dearden , R . , & B o u t i l i e r , C . (1997). A b s t r a c t i o n and approximate d e c i s i o n theoretic p l an -n i n g . Artificial Intelligence, 89, 2 1 9 - 2 8 3 . Dea rden , R . , F r i e d m a n , N . , & A n d r e , D . (1999). M o d e l - b a s e d B a y e s i a n exp lo ra t ion . In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp . 1 5 0 - 1 5 9 S t o c k h o l m . D e G r o o t , M . H . (1975) . Probability and statistics. A d d i s o n - W e s l e y P u b . C o . , R e a d i n g , M a s s . D e m i r i s , J . , & H a y e s , G . (1997). D o robots ape?. In Proceedings of the AAAI Symposium on Socially Intelligent Agents, pp. 2 8 - 3 1 C a m b r i d g e , Masschusse t s . 172 D e m i r i s , J . , & H a y e s , G . (1999) . A c t i v e and pass ive routes to i m i t a t i o n . In Proceedings of the AISB '99 Symposium on Imitation in Animals and Artifacts, pp . 8 1 - 8 7 E d i n b u r g h . Denz inge r , J . , & E n n i s , S. (2002). B e i n g the new guy i n an exper ienced team: enhanc ing t ra in ing o n the j o b . In Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems: Part 3, pp . 246 - 1253 B o l o g n a , Italy. F i o r i t o , G . , & Scot to , P. (1992). Obse rva t iona l learn ing i n octopus vu lga r i s . Science, 256, 5 4 5 ^ 1 7 . Forbes , J . , & A n d r e , D . (2000) . P rac t i ca l re inforcement l ea rn ing i n con t inuous domains . Tech . rep. U C B / C S D - 0 0 - 1 1 0 9 , C o m p u t e r Sc i ence D i v i s i o n , U n i v e r s i t y o f C a l i f o r n i a Be rke l ey . F r i e d m a n , N . , & Singer , Y . (1998). Ef f ic ien t bayes ian parameter es t imat ion i n large discrete domains . In Eleventh conference on Advances in Neural Information Processing Sys-tems, pp . 4 1 7 ^ 1 2 3 D e n v e r , C o l o r a d o . F r i e d r i c h , H . , M u n c h , S., D i l l m a n n , R . , B o c i o n e k , S., & Sass in , M . (1996) . R o b o t p rog ram-m i n g b y demonst ra t ion ( R P D ) : Suppor t the i n d u c t i o n b y h u m a n in terac t ion . Machine Learning,23, 1 6 3 - 1 8 9 . Hansen , E . A . (1998) . S o l v i n g P O M D P s by searching i n p o l i c y space. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Har tman i s , J . , & Stearns, R . E . (1966) . Algebraic Structure Theory of Sequential Machines. P r e n t i c e - H a l l , E n g l e w o o d C l i f f s . H o w a r d , R . A . (1996). Informat ion va lue theory. IEEE Transactions on Systems Science and Cybernetics, SSC-2(\), 2 2 - 2 6 . 173 H u , J . , & W e l l m a n , M . P. (1998) . M u l t i a g e n t re inforcement l ea rn ing : theoret ical f r amework and an a lgor i thm. In Proceedings of the Fifthteenth International Conference on Ma-chine Learning, pp . 2 4 2 - 2 5 0 . M o r g a n K a u f m a n n , San F r a n c i s c o , C A . K a e l b l i n g , L . P. (1993) . Learning in Embedded Systems. M I T Press , C a m b r i d g e , M A . K a e l b l i n g , L . P., L i t t m a n , M . L . , & M o o r e , A . W . (1996) . Re in fo rcemen t l ea rn ing : A survey. Journal of Artificial Intelligence Research, 4, 2 3 7 - 2 8 5 . K e a r n s , M . , & S i n g h , S. (1998a). F i n i t e sample convergence rates for Q - l e a r n i n g and i n d i -rect a lgor i thms. In Eleventh Conference on Neural Information Processing Systems, pp. 9 9 6 - 1 0 0 2 Denve r , C o l o r a d o . K e a r n s , M . , & S i n g h , S. (1998b) . Nea r -op t ima l re inforcement l ea rn ing i n p o l y n o m i a l t ime. In Proceedings of the 15th International Conference on Machine Learning, pp . 2 6 0 -268 M a d i s o n , W i s c o n s i n . K u n i y o s h i , Y , Inaba, M . , & Inoue, H . (1994). L e a r n i n g b y wa tch ing : E x t r a c t i n g reusable task k n o w l e d g e f rom v i s u a l observa t ion o f human performance. IEEE Transactions on Robotics and Automation, 10(6), 7 9 9 - 8 2 2 . L a u , T., D o m i n g o s , P., & W e l d , D . S. (2000) . V e r s i o n space a lgebra and its app l i ca t ion to p r o g r a m m i n g b y demonstra t ion. In Proceddings of the Seventeenth International Conference on Machine Learning, pp. 5 2 7 - 5 3 4 Stanford , C A . L e e , D . , & Y a n n a k a k i s , M . (1992). O n l i n e m i m i n i z a t i o n o f t rans i t ion systems. In Proceed-ings of the 24th Annual ACM Symposium on the Theory of Computing (STOC-92), pp. 2 6 4 - 2 7 4 V i c t o r i a , B C . L i e b e r m a n , H . (1993) . M o n d r i a n : A teachable g raph ica l editor. In A . , C . (Ed . ) , Watch What I Do: Programming by Demonstration, pp . 3 4 0 - 3 5 8 . M I T Press, C a m b r i d g e , M A . L i n , L . - J . (1991) . Se l f - improvemen t based on re inforcement l ea rn ing , p l a n n i n g and teach-i n g . In Machine Learning: Proceedings of the Eighth International Workshop, pp . 3 2 3 - 2 7 C h i c a g o , I l l i n o i s . L i n , L . - J . (1992) . S e l f - i m p r o v i n g reactive agents based on re inforcement l ea rn ing , p l a n n i n g and teaching. Machine Learning, 8, 2 9 3 - 3 2 1 . L i t t m a n , M . L . (1994) . M a r k o v games as a f ramework for mul t i -agent re inforcement learn-i n g . In Proceedings of the Eleventh International Conference on Machine Learning, pp. 1 5 7 - 1 6 3 N e w B r u n s w i c k , N J . L i t t m a n , M . L . , D e a n , T. L . , & K a e l b l i n g , L . P. (1995). O n the c o m p l e x i t y o f s o l v i n g M a r k o v d e c i s i o n p rob lems . In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI-95), pp . 394-402 M o n t r e a l , Quebec , Canada . L i t t m a n , M . L . , M a j e r c i k , S. M . , & Pi tass i , T. (2001). S tochas t ic boo l ean sat isf iabi l i ty . Jour-nal of Automated Reasoning, 27(3), 2 5 1 - 2 9 6 . L o v e j o y , W . S. (1991) . A survey o f a lgo r i thmic methods for pa r t i a l ly obse rved M a r k o v de-c i s i o n processes. Annals of Operations Research, 28, 4 7 - 6 6 . L u s e n a , C , G o l d s m i t h , J . , & M u n d h e n k , M . (2001). N o n a p p r o x i m a b i l i t y results for pa r t i a l ly observable M a r k o v dec i s ion processes. Journal of Artificial Intelligence Research, 14, 8 3 - 1 0 3 . M a c k a y , D . (1999) . In t roduct ion to M o n t e C a r l o methods. In Jordan , M . I. (Ed . ) , Learning in Graphical Models, pp. 1 7 5 - 2 0 4 . M I T Press, C a m b r i d g e , M A . M a t a r i c , M . J . (1998) . U s i n g c o m m u n i c a t i o n to reduce l o c a l i t y i n d is t r ibu ted mul t i -agent learn ing . Journal Experimental and Theoretical Artificial Intelligence, 10(3), 3 5 7 -369. 175 M a t a r i c , M . J . (2002) . Visuo-Motor Primitives as a Basis for Learning by Imitation: Linking Perception to Action and Biology to Robotics, pp . 3 9 2 - 4 2 2 . M I T Press , C a m b r i d g e , M A . M a t a r i c , M . J . , W i l l i a m s o n , M . , D e m i r i s , J . , & M o h a n , A . (1998) . B e h a v i o u r - b a s e d p r i m i -t ives for ar t iculated con t ro l . In Pfiefer, R . , B l u m b e r g , B . , M e y e r , J . , & W i l s o n , S. W . (Eds . ) , Fifth International Conference on Simulation of Adaptive Behavior SAB'98, pp. 1 6 5 - 1 7 0 Z u r i c h . M c C a l l u m , R . A . (1992). First Results with Utile Distinction Memory for Reinforcement Learning. U n i v e r s i t y o f Rochester , C o m p u t e r Sc ience T e c h n i c a l Repor t 4 4 6 . M e n d e n h a l l , W . (1987). Introduction to Probability and Statistics, Seventh Edition. P W S -K e n t P u b l i s h i n g C o m p a n y , B o s t o n , M A . M e u l e a u , N . , & B o u r g i n e , P. (1999). E x p l o r a t i o n o f mult i-state env i ronments : L o c a l mea-sures and back-propaga t ion o f uncertainty. Machine Learning, 35(2), 117 - 154 . M i , J . , & S a m p s o n , A . R . (1993) . A compar i son o f the B o n f e r r o n i and Scheffe bounds . Journal of Statistical Planning and Inference, 36, 101—105. M i c h i e , D . (1993) . K n o w l e d g e , l earn ing and mach ine in te l l igence . In S t e r l i ng , L . (Ed . ) , Intelligent Systems, pp . 1-9. P l e n u m Press, N e w Y o r k . M i t c h e l l , T. M . , M a h a d e v a n , S., & Ste inberg, L . (1985) . L E A P : A l ea rn ing apprentice for V L S I des ign . In Proceedings of the Ninth International Joint Conference on Artificial Intelligence, pp. 5 7 3 - 5 8 0 L o s A n g e l e s , C a l i f o r n i a . M o o r e , A . W . , & A t k e s o n , C . G . (1993). P r i o r i t i z e d sweep ing : Re in fo rcemen t l ea rn ing w i t h less data and less real t ime. Machine Learning, 13(1), 1 0 3 - 3 0 . 176 M y e r s o n , R . B . (1991) . Game Theory: Analysis of Conflict. H a r v a r d U n i v e r s i t y Press , C a m -br idge . N e a l , R . M . (2001) . Transfer r ing p r io r in fo rmat ion between mode l s u s i n g i m a g i n a r y data. Tech . rep. 0108 , Depar tment o f Stat ist ics U n i v e r s i t y o f Toron to . N e h a n i v , C , & Dautenhahn , K . (1998). M a p p i n g between d i s s i m i l a r bod ies : Af fo rdances and the a lgebraic foundat ions o f imi t a t ion . In Proceedings of the Seventh European Workshop on Learning Robots, pp. 6 4 - 7 2 E d i n b u r g h . N g , A . Y , & Jordan , M . (2000). P E G A S U S : A p o l i c y search m e t h o d for large M D P s and P O M D P s . In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp . 4 0 6 ^ - 1 5 Stanford, C a l i f o r n i a . N g , A . Y , & R u s s e l l , S. (2000). A l g o r i t h m s for inverse re inforcement l ea rn ing . In Proceed-ings of the Seventeenth International Conference on Machine Learning, pp . 6 6 3 - 6 7 0 . M o r g a n K a u f m a n n , San F ranc i s co , C A . N o b l e , J . , & T o d d , P. M . (1999). Is it rea l ly imi ta t ion? a r e v i e w o f s i m p l e mechan i sms i n soc i a l i n fo rma t ion gathering. In Proceedings of the AISB'99 Symposium on Imitation in Animals and Artifacts, pp . 6 5 - 7 3 E d i n b u r g h . O l iphan t , M . (1999) . C u l t u r a l t ransmiss ion o f c o m m u n i c a t i o n s systems: C o m p a r i n g obser-va t iona l and re inforcement l ea rn ing models . In Proceedings of the AISB'99 Sympo-sium on Imitation in Animals and Artifacts, pp . 4 7 - 5 4 E d i n b u r g h . P a p a d i m i t r i o u , C , & T s i t s i k l i s , J . (1987). T h e c o m p l e x i t y o f m a r k o v d e c i s i o n processes. . Mathematics of Operations Research, 72(3), 4 4 1 ^ - 5 0 . Pea r l , J . (1988) . Probabilistic Reasoning in Intelligent Systems: Networks of Plausible In-ference. M o r g a n K a u f m a n n , San M a t e o , C A . Puterman, M . L . (1994) . Markov Decision Processes: Discrete Stochastic Dynamic Pro-gramming. J o h n W i l e y and Sons , Inc. , N e w Y o r k . R u s s o n , A . , & G a l d i k a s , B . (1993). Imi ta t ion i n free-ranging rehabi l i tant orangutans (pongo-pygmaeus) . Journal of Comparative Psychology, 107(2), 1 4 7 - 1 6 1 . Sammut , C , Hurs t , S., K e d z i e r , D . , & M i c h i e , D . (1992). L e a r n i n g to fly. In Proceedings of the Ninth International Conference on Machine Learning, pp . 3 8 5 - 3 9 3 A b e r d e e n , U K . Scasse l la t i , B . (1999) . K n o w i n g what to imitate and k n o w i n g w h e n y o u succeed. In Pro-ceedings of the AISB'99 Symposium on Imitation in Animals and Artifacts, pp . 1 0 5 -113 U n i v e r s i t y o f E d i n b u r g h . Scheffer, T. , Gre iner , R . , & D a r k e n , C . (1997). W h y exper imenta t ion can be better than per-fect guidance. In Proceedings of the Fourteenth International Conference on Machine Learning, pp . 3 3 1 - 3 3 9 N a s h v i l l e . Seber, G . A . F. (1984) . Multivariate Observations. W i l e y , N e w Y o r k . Shapley , L . S. (1953) . S tochas t ic games. Proceedings of the National Academy of Sciences, 39, 3 2 7 - 3 3 2 . S i n g h , S., K e a r n s , M . , & M a n s o u r , Y . (2000) . N a s h convergence o f gradient d y n a m i c s i n genera l - sum games. In Proceedings of the Sixteenth Conference on Uncertainity in Artificial Intelligence, pp . 5 4 1 - 5 4 8 Stanford, C A . S i n g h , S., & Ber tsekas , D . (1997). Re inforcement l ea rn ing for d y n a m i c channe l a l loca t ion i n ce l lu l a r te lephone systems. In M o z e r , M . C , Jordan, M . I., & Petsche, T. (Eds . ) , Advances in Neural Information Processing Systems, V o l . 9, pp . 9 7 4 - 9 8 0 . T h e M I T Press. S m a l l w o o d , R . D . , & S o n d i k , E . J . (1973) . T h e o p t i m a l con t ro l o f pa r t i a l ly observable M a r k o v processes ove r a finite h o r i z o n . Operations Research, 21, 1 0 7 1 - 1 0 8 8 . Sue, D . , & B r a t k o , I. (1997) . S k i l l reconst ruct ing as i n d u c t i o n o f L Q cont ro l le rs w i t h sub-goals . In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pp . 9 1 4 - 9 1 9 N a g o y a . Sut ton , R . S. (1988) . L e a r n i n g to predict b y the method o f t empora l differences. Machine Learning, 3, 9 - 4 4 . Sut ton , R . S., & B a r t o , A . G . (1998). Reinforcement Learning: An Introduction. M I T Press, C a m b r i d g e , M A . Tan , M . (1993) . M u l t i - a g e n t re inforcement learn ing: Independent vs . coopera t ive agents. In Proceedings of the Tenth International Conference on Machine Learning, pp . 3 3 0 - 3 7 A m h e r s t , M A . U r b a n c i c , T , & B r a t k o , I. (1994) . Recons t ruc t ion human s k i l l w i t h m a c h i n e l ea rn ing . In Eleventh European Conference on Artificial Intelligence, pp . 4 9 8 - 5 0 2 A m s t e r d a m , T h e Nether lands . Utgof f , P. E . , & C l o u s e , J . A . (1991). T w o k i n d s o f t ra in ing in fo rma t ion for eva lua t ion func-t ion learn ing . In Proceedings of the Ninth National Conference on Artificial Intelli-gence, pp . 5 9 6 - 6 0 0 A n a h e i m , C A . van L e n t , M . , & L a i r d , J . (1999) . L e a r n i n g h ie ra rch ica l performance k n o w l e d g e b y observa-t ion . In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 2 2 9 - 2 3 8 B l e d , S l o v e n i a . 179 V i s a l b e r g h i , E . , & Fragazy , D . (1990). D o m o n k e y s ape?. In Parker, S., & G i b s o n , K . (Eds . ) , Language and Intelligence in Monkeys and Apes, pp . 2 4 7 - 2 7 3 . C a m b r i d g e U n i v e r s i t y Press, C a m b r i d g e . W a t k i n s , C . J . C . H . , & D a y a n , P. (1992). Q- l ea rn ing . Machine Learning, 8, 2 7 9 - 2 9 2 . W h i t e h e a d , S. D . (1991) . C o m p l e x i t y analysis o f coopera t ive mechan i sms i n re inforcement learn ing . In Proceedings of the Ninth National Conference on Artificial Intelligence, pp. 6 0 7 - 6 1 3 A n a h e i m . 180 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051625/manifest

Comment

Related Items