UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Linkup action dependent dynamic programming Ip, John Chong Ching 1995

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1996-09099X.pdf [ 13.54MB ]
Metadata
JSON: 831-1.0065073.json
JSON-LD: 831-1.0065073-ld.json
RDF/XML (Pretty): 831-1.0065073-rdf.xml
RDF/JSON: 831-1.0065073-rdf.json
Turtle: 831-1.0065073-turtle.txt
N-Triples: 831-1.0065073-rdf-ntriples.txt
Original Record: 831-1.0065073-source.json
Full Text
831-1.0065073-fulltext.txt
Citation
831-1.0065073.ris

Full Text

L I N K U P A C T I O N D E P E N D E N T D Y N A M I C P R O G R A M M I N G B y J o h n C . C . Ip B . A . Sc . ( E l e c t r i c a l Eng inee r ing ) U n i v e r s i t y of B r i t i s h C o l u m b i a M . A . Sc . ( E l e c t r i c a l E n g i n e e r i n g ) U n i v e r s i t y of B r i t i s h C o l u m b i a A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF D O C T O R OF PHILOSOPHY i n T H E FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING W e accept th is thesis as c o n f o r m i n g to the r equ i r ed s t anda rd T H E UNIVERSITY OF BRITISH COLUMBIA N o v e m b e r 1995 © J o h n C . C . Ip , 1995 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Ele. cknca. [ En On-ee.H ne The University of British Columbia Vancouver, Canada Date HoMmhes 2 & ; /W5~ DE-6 (2/88) Abstract I n c r e m e n t a l forms of D y n a m i c P r o g r a m m i n g ( D P ) , i n c r e m e n t a l forms of A c t i o n D e p e n -dent D y n a m i c P r o g r a m m i n g ( A D D P ) such as Q - L e a r n i n g ( Q L ) , a n d T e m p o r a l Differ-ence L e a r n i n g ( T D L ) are r e l a t ed types of a l g o r i t h m s , g r o u p e d u n d e r t he h e a d i n g of Re in fo rcemen t L e a r n i n g ( R L ) . T h e y f o r m the bases for a new genera t ion of m e t h o d s for o p t i m a l con t ro l . R L m e t h o d s m a k e the fo l lowing assumpt ions : 1) the s y s t e m is d i sc re te - t ime /d i sc re te - s ta te or d i sc re te - t ime /con t inuous - s ta te , 2) a m o d e l of the d y n a m -ics is unava i l ab l e a n d diff icul t to ident i fy , 3) the sy s t em is s tochas t ic , a n d 4) i n c r e m e n t a l l e a rn ing is necessary or acceptable . These me thods are t a rge t ing c o n t r o l s i tua t ions w h i c h are b e y o n d the scope of mos t e x i s t i n g con t ro l me thods . Convergence resul ts exis t for the bas ic a n d de r ived forms of the above types of algo-r i t h m s . N u m e r o u s e m p i r i c a l resul ts have demons t r a t ed the usefulness of these a lgo r i t hms to o p t i m a l c o n t r o l p rob l ems i n aircraf t con t ro l , m a n u f a c t u r i n g , a n d process con t ro l . H o w -ever, fur ther expans ion of the app l i ca t ions f u n d a m e n t a l l y depends o n 1) r educed a c c u m u -l a t i o n of a p p r o x i m a t i o n errors w h e n un ive r sa l f u n c t i o n a p p r o x i m a t o r s are used to store the var ious m a p p i n g s i n R L a lgo r i t hms a n d 2) the ex tens ion of R L a l g o r i t h m s to m u l t i p l e t e m p o r a l a n d spa t i a l resolut ions such tha t they can be app l i cab l e to h i e r a r c h i c a l con t ro l . T h i s thesis addresses the issue of a p p r o x i m a t i o n error a c c u m u l a t i o n i n D P a n d A D D P a l g o r i t h m s i n R L , a n d the issue of the ex tens ion of R L by i n t r o d u c i n g mod i f i ca t ions to the / a n d Q no ta t ions of D P a n d A D D P , respect ive ly . D e r i v a t i o n f r o m the def in i t ions of these new / a n d Q no ta t ions leads to three L i n k u p R e l a t i o n s . These re la t ions enable a new type of u p d a t e ope ra t i on ca l l ed " l i n k u p " i n a d d i t i o n to " b a c k u p " w h i c h is the f u n d a m e n t a l u p d a t e ope ra t i on i n D P and A D D P . Severa l a l g o r i t h m s , co l l ec t i ve ly ca l l ed L i n k u p A c t i o n i i Dependen t D y n a m i c P r o g r a m m i n g ( L A D D P ) a l g o r i t h m s , are dev i sed based o n l i n k u p s a n d backups . T h e cen t r a l i d e a of L A D D P is tha t the increase of the h o r i z o n of the / - a n d Q-va lues can be done i n inc remen t s of va r iab le number s of base t i m e steps per upda te . D e t e r m i n i s t i c L A D D P a lgo r i t hms are p roposed for a l e a rn ing s i t u a t i o n where the s y s t e m is e i ther d i sc re te - t ime/d i sc re te - s ta te or d i sc re te - t ime /con t inuous - s t a t e , a m o d e l of the d y n a m i c s is ava i lab le or eas i ly iden t i f i ed , the s y s t e m is d e t e r m i n i s t i c , a n d i n c r e m e n t a l l e a r n i n g is ne i the r necessary nor acceptable . T h i s l e a rn ing s i t u a t i o n differs f r o m the Q L l e a r n i n g s i t u a t i o n i n the last 3 po in t s . T h e d e t e r m i n i s t i c L A D D P a lgo r i t hms are shown to be correc t , m e a n i n g tha t the c o m -p u t e d (^-values w i l l be i d e n t i c a l to the o p t i m a l Q-va lues , i n the absence of a p p r o x i m a t i o n errors . It is p r o v e n tha t the wors t case uppe r bounds of two of these a l g o r i t h m s are m u c h lower t h a n tha t of a representa t ive , synchronous ve rs ion of Q L , the Synchronous A D D P a l g o r i t h m . T h e error pe r fo rmance of these a lgo r i t hms is c o m p a r e d b y s i m u l a t i o n us ing t e m p o r a l l y a n d s p a t i a l l y cor re la ted a p p r o x i m a t i o n errors. T h e error pe r fo rmance i n the m e a n a n d va r iance of the p roposed a lgo r i t hms is be t te r t h a n tha t of Synchronous A D D P by as m u c h as a n order of m a g n i t u d e . In t e rms of c o n t r o l p o l i c y deg rada t ion a n d con t ro l pe r fo rmance , the two d e t e r m i n i s t i c L A D D P a lgo r i t hms are shown to be m o r e to lerant of a n d less sensi t ive to the presence of a p p r o x i m a t i o n errors. F o r the same l ea rn ing s i t u a t i o n for w h i c h the d e t e r m i n i s t i c L A D D P a l g o r i t h m s are p roposed , several goa l change a lgo r i t hms are proposed to adapt ( w i t h ce r t a in ty ) e x i s t i n g o p t i m a l c o n t r o l po l ic ies f r o m one goal to a new goal . T h i s c a p a b i l i t y is u n i q u e i n the l i t e r a tu re . S igni f icant savings i n t e rms of the n u m b e r of l i n k u p cycles are demons t r a t ed . A s tochas t ic L A D D P a l g o r i t h m is p roposed for a l e a rn ing s i t u a t i o n where the sy s t em is e i ther d i sc re te - t ime/d i sc re te - s ta te or d i sc re te - t ime /con t inuous - s t a t e , a m o d e l of the d y n a m i c s is unava i l ab le a n d diff icul t to ident i fy , the s y s t e m is s tochas t ic , a n d i n c r e m e n t a l l e a r n i n g is necessary or acceptable . T h i s l ea rn ing s i t u a t i o n is i d e n t i c a l to tha t i n w h i c h i i i i n c r e m e n t a l D P a n d T D L a lgo r i t hms are used. T h e u n d e r l y i n g basis of the one L A D D P a l g o r i t h m proposed for s tochas t ic sys tems is shown to be correct . It is shown by s i m u l a t i o n to have an advantage i n a q u a l i t a t i v e pe r fo rmance measure ca l l ed "order l iness" over T D L and an advantage i n l e a r n i n g speed over a s ingle c o n t r o l l aw vers ion of Q L . B o t h of these advantages are i m p o r t a n t i n p r a c t i c a l app l i ca t i ons . T h e m a i n l i m i t a t i o n of the L i n k u p concept is tha t an efficient a l g o r i t h m based o n the core L i n k u p R e l a t i o n has not been rea l i zed . Howeve r , the p roposed L A D D P a l g o r i t h m s , based o n var ian t s of the core L i n k u p R e l a t i o n , are app l i cab l e to b o t h d e t e r m i n i s t i c a n d s tochas t ic sys tems. i v Table of Contents Abstract ii List of Tables xi List of Figures xiii List of Symbols xvi Glossary xxv Acknowledgement xxxi 1 Introduction 1 1.1 M o t i v a t i o n for th is W o r k 1 1.2 P r o b l e m S ta t emen t 2 1.3 C o n t r i b u t i o n s . . . 3 1.4 L i m i t a t i o n s 5 1.5 P o s i t i o n of m y W o r k i n the L i t e r a t u r e 5 1.6 F o r m a t of Thes i s 9 2 Framework of Field 11 2.1 D e t e r m i n i s t i c C o n t i n u o u s - T i m e a n d -Sta te 13 2.1.1 P r o b l e m de f in i t ion 13 2.1.2 A n a l y t i c a l : C a l c u l u s of V a r i a t i o n s : . 14 2.1.3 A n a l y t i c a l : P o n t r y a g i n ' s M i n i m u m P r i n c i p l e 15 v 2.1.4 N u m e r i c a l : V a r i a t i o n of E x t r e m a l s 16 2.1.5 N u m e r i c a l : Q u a s i l i n e a r i z a t i o n 16 2.1.6 N u m e r i c a l : Steepest Descent 16 2.1.7 N u m e r i c a l : G r a d i e n t P r o j e c t i o n IT 2.1.8 R e m a r k s o n n u m e r i c a l me thods 17 2.2 S tochas t i c C o n t i n u o u s - T i m e a n d -Sta te 18 2.2.1 P r o b l e m de f in i t ion 18 2.2.2 A n a l y t i c a l m e t h o d s 19 2.2.3 A n a l y t i c a l : D y n a m i c P r o g r a m m i n g 19 2.2.4 A n a l y t i c a l : M a r t i n g a l e m e t h o d s 20 2.2.5 A n a l y t i c a l : M i n i m u m P r i n c i p l e s 20 2.2.6 N u m e r i c a l : F i n i t e e lement a p p r o x i m a t i o n of the D P equa t i on . . . 21 2.2.7 N u m e r i c a l : D i s c r e t i z a t i o n of the s tochas t ic c o n t r o l p r o b l e m . . . . 22 2.3 D i s c r e t e - T i m e a n d Con t inuous -S ta t e s 23 2.3.1 A n a l y t i c a l me thods . . 24 2.3.2 N u m e r i c a l me thods 24 2.3.3 N e u r a l ne twork o p t i m i z a t i o n w i t h the M i n i m u m P r i n c i p l e . . . . 25 2.3.4 Re in fo rcemen t L e a r n i n g 25 2.3.5 Re in fo rcemen t L e a r n i n g : P o l i c y I t e r a t ion 26 2.3.6 B a s i c T D ( A ) 29 2.3.7 A d a p t i v e H e u r i s t i c C r i t i c 31 2.3.8 H D P ( w i t h A p _ p or gradient me thods ) 31 2.3.9 Re in fo rcemen t L e a r n i n g : V a l u e I t e r a t ion 33 2.3.10 A D A C or A D H D P 34 2.4 D i s c r e t e - T i m e a n d -Sta te 34 2.4.1 M a r k o v i a n D e c i s i o n Tasks 35 v i 2.4.2 Search me thods 35 2.4.3 Re in fo rcemen t L e a r n i n g 36 2.4.4 Re in fo rcemen t L e a r n i n g : P o l i c y I t e ra t ion 36 2.4.5 B a s i c A d a p t i v e C r i t i c s 37 2.4.6 D P i n genera l 37 2.4.7 Synchronous D P ( S D P ) 38 2.4.8 R e l a x e d forms of S D P 39 2.4.9 D P in te r l eaved w i t h con t ro l 40 2.4.10 A d a p t i v e R e a l - T i m e D P 41 2.4.11 Re in fo rcemen t L e a r n i n g : V a l u e I t e r a t ion 42 2.4.12 A n o v e r a l l v i e w of Q - L e a r n i n g 42 2.4.13 O f f - L i n e Q - L e a r n i n g 43 2.4.14 R e a l - T i m e Q - L e a r n i n g 44 2.4.15 Q - L e a r n i n g i m p l i e s A c t i o n Dependen t D y n a m i c P r o g r a m m i n g . . 44 2.4.16 D y n a - Q 45 2.4.17 P r i o r i t i z e d Sweeping 45 2.4.18 V a r i a b l e R e s o l u t i o n 46 2.4.19 T r a n s i t i o n P o i n t D P 47 2.5 Subgoals 47 2.6 S u m m a r y of L i t e r a t u r e R e v i e w a n d Focus of th i s W o r k 49 2.7 O u t s t a n d i n g Issues 49 2.7.1 A p p r o x i m a t i o n errors i n D P a n d Q - L e a r n i n g 49 2.7.2 Speed of convergence 53 2.7.3 Hie ra rch ies 54 2.8 Increas ing A m o u n t of C o n t e x t 55 v i i 3 The Key Problems and A New Approach 58 3.1 T h e O n e - S t e p Q - L e a r n i n g A l g o r i t h m 58 3.2 A p p r o x i m a t i o n E r r o r s i n Q - L e a r n i n g a n d D P 60 3.3 G o a l C h a n g e 61 3.4 H ie ra r ch i e s 61 3.4.1 T h e A l b u s m o d e l of in te l l igence a n d R L 62 3.4.2 T h e uppe r levels of a h i e ra rchy a n d macros 62 3.4.3 O t h e r r emarks o n hierarchies 64 3.5 A N e w A p p r o a c h : L i n k u p A c t i o n Dependen t D y n a m i c P r o g r a m m i n g . . 64 3.5.1 G o a l states a n d reached states 69 3.5.2 T h e h o r i z o n 69 3.6 P r e - a n d P o s t - C i r c u i t s 69 3.7 S u m m a r y 70 4 The Problem Formulation and Proposed Solutions 72 4.1 T h e I n i t i a l P r o b l e m D e f i n i t i o n 72 4.2 C o n s i d e r a t i o n s i n A p p l y i n g the L i n k u p R e l a t i o n 73 4.2.1 P r o t o t y p e a l g o r i t h m 73 4.2.2 Issue # 1 74 4.2.3 Issue # 2 77 4.2.4 Issue # 3 78 4.2.5 Issue # 4 81 4.2.6 L i n k u p is useful i n two s i tua t ions 82 4.2.7 D e t e r m i n i s t i c sys tems 82 4.2.8 E v a l u a t i n g one p o l i c y at a t i m e 83 4.3 A F a m i l y of L A D D P A l g o r i t h m s for D e t e r m i n i s t i c Sys t ems 83 v i i i 4.3.1 A l g o r i t h m 1 86 4.3.2 A l g o r i t h m 2 86 4.3.3 T h e D e t e r m i n i s t i c D o u b l i n g A l g o r i t h m 88 4.3.4 T h e Synchronous Q - L e a r n i n g A l g o r i t h m 89 4.4 A n A s y n c h r o n o u s L D P ( A L D P ) A l g o r i t h m for S tochas t i c Sys tems . . . . 90 4.5 A N a t u r a l A l g o r i t h m for S tochas t i c Sys tems 92 4.6 L i n k u p A c t i o n D e p e n d e n t D y n a m i c P r o g r a m m i n g a n d H i e r a r c h i e s . . . . 95 4.7 G o a l C h a n g e A l g o r i t h m s 95 4.7.1 O n e goal s tate 96 4.7.2 T w o different app l i ca t ions 101 4.7.3 T h e best goa l change approach 101 4.7.4 M o r e t h a n one goa l s ta te 101 4.7.5 E x t r a c t i n g macros 102 4.7.6 T e m p o r a l versus spa t i a l aggregat ion 105 4.7.7 C o m p a r i s o n w i t h re la ted works 105 4.8 D e v i a t i o n f r o m the C o r e A l g o r i t h m s 107 4.9 S u m m a r y 109 5 Evaluation of Proposed Algorithms 110 5.1 O n the A l g o r i t h m s for D e t e r m i n i s t i c Sys t ems 110 5.1.1 P r o o f of the D e t e r m i n i s t i c L i n k u p R e l a t i o n I l l 5.1.2 N u m e r i c a l d e m o n s t r a t i o n of correctness 115 5.1.3 C o m p l e x i t y results 117 5.1.4 A c c u m u l a t e d a p p r o x i m a t i o n error uppe r bounds 119 5.1.5 N u m e r i c a l d e m o n s t r a t i o n of er ror u p p e r b o u n d s 124 5.1.6 N u m e r i c a l results for cor re la ted errors 125 i x 5.1.7 S i m u l a t i o n s of c o n t r o l p o l i c y degrada t ion a n d pe r fo rmance . . . . 129 5.2 O n the A l g o r i t h m s for S tochas t i c Sys t ems 144 5.2.1 N u m e r i c a l d e m o n s t r a t i o n of correctness 144 5.2.2 C o m p a r i s o n of order l iness 146 5.2.3 A n a t u r a l bu t f lawed a l g o r i t h m 151 5.3 O n the G o a l C h a n g e A l g o r i t h m s 152 5.3.1 T h e correctness of the goal change a lgo r i t hms 152 5.3.2 Increased efficiency of the goal change a l g o r i t h m s 153 5.4 S u m m a r y 155 6 Conclusions and Future Challenges 158 6.1 C o n c l u s i o n s 158 6.2 L i m i t a t i o n of the L i n k u p A p p r o a c h 160 6.3 D i s c u s s i o n 161 6.4 F u t u r e Cha l l enges 162 Bibliography 164 x List of Tables 4.1 2 step expec t ed d i scoun ted s u m for a l l poss ib le c o m b i n a t i o n s of u(so) a n d « ( j 2 ) 76 4.2 V a l u e i den t i f i c a t i on phase 85 4.3 B u i l d - u p phase of A l g o r i t h m 1 87 4.4 B u i l d - u p phase of A l g o r i t h m 2 88 4.5 B u i l d - u p phase of the D e t e r m i n i s t i c D o u b l i n g A l g o r i t h m 89 4.6 T h e Synchronous Q - L e a r n i n g A l g o r i t h m 90 4.7 A s y n c h r o n o u s O n e P o l i c y L i n k u p D y n a m i c P r o g r a m m i n g 92 4.8 Iden t i f i ca t ion Phase of the S D A . 93 4.9 B u i l d - u p Phase of the S D A 94 4.10 T h e R e m o v a l Pass for a S ing le G o a l S ta te 97 4.11 T h e A d d i t i o n Pass for a S ing le G o a l S ta te 100 4.12 T h e R e m o v a l Pass for M u l t i p l e G o a l States 103 4.13 T h e A d d i t i o n Pass for M u l t i p l e G o a l States 104 5.14 T r a n s i t i o n def ini t ions a n d costs of the test s y s t e m 116 5.15 Q-va lues genera ted for the test s y s t e m 117 5.16 C o m p a r i s o n of c o m p l e x i t i e s of the Synchronous Q - L e a r n i n g A l g o r i t h m , A l g o r i t h m 2, a n d the D o u b l i n g A l g o r i t h m 118 5.17 W o r s t case uppe r bounds for absolute c u m u l a t i v e f r ac t iona l a p p r o x i m a t i o n errors for the S Q L a l g o r i t h m , A l g o r i t h m 2, a n d the D o u b l i n g A l g o r i t h m . 125 xi 5.18 M e t h o d for genera t ing t e m p o r a l l y and spa t i a l l y cor re la ted a p p r o x i m a t i o n er ror 127 5.19 F r a c t i o n a l deg rada t ion of the c o n t r o l p o l i c y w i t h ze ro -mean e 132 5.20 F r a c t i o n a l deg rada t ion of the c o n t r o l p o l i c y w i t h p o s i t i v e l y skewed e. . . 133 5.21 F r a c t i o n a l deg rada t ion of the c o n t r o l p o l i c y w i t h nega t ive ly skewed e. . . 133 5.22 / - v a l u e s genera ted for the test sy s t em . 147 5.23 W o r s e case f r ac t ion of the l i n k u p opera t ions i n one l i n k u p cyc le not per-f o r m e d b y u s ing the A d d i t i o n Pass for a S ing le G o a l S ta te 155 x i i List of Figures 2.1 T h e bas ic adap t ive c r i t i c con t ro l s y s t e m . 27 2.2 T h e c r i t i c sy s t em as adap ted by H D P 32 2.3 T h e ove ra l l A D A C sys t em 34 2.4 A n a d a p t i v e non-recurrent subgoal generator . . 48 2.5 A s u m m a r y of the l i t e r a tu re r ev iew. E a c h b o x i m m e d i a t e l y above a t h i c k l i ne is a ca tegory w h i c h appl ies to a l l boxes under tha t box . T h e r e is no a d d i t i o n a l i nhe r i t ance r e l a t ionsh ip be tween boxes o n the same v e r t i c a l c o l u m n 50 2.6 A { s t a t e } — » { e v a l u a t i o n } m a p p i n g 56 2.7 A { s t a t e , c o n t r o l } — » { e v a l u a t i o n } m a p p i n g 56 2.8 A {state ,future-state ,control}—>-{evaluat ion}mapping 56 2.9 A { s t a t e , f u t u r e - s t a t e , c o n t r o l , h o r i z o n } - ^ { e v a l u a t i o n } m a p p i n g 57 3.10 G r a p h i c a l represen ta t ion of the l i n k u p ope ra t i on 68 4.11 W h y the + p o l i c y m a y be different f r o m the * p o l i c y 76 4.12 I l l u s t r a t i o n of the tree of t ra jector ies for a p o l i c y i d e n t i c a l at a l l t i m e steps. 79 4.13 I l l u s t r a t i o n of the tree of t ra jector ies for different po l ic ies at each t i m e step. 80 4.14 C o n t r i b u t i o n s f r o m different t r a jec to ry fami l ies 81 5.15 S ta te t r ans i t ions i n the 7 x 7 d e t e r m i n i s t i c s i m u l a t i o n 126 5.16 C o m p a r i s o n of the m e a n a c c u m u l a t e d f rac t ion errors for the S Q L algo-r i t h m , A l g o r i t h m 2, and the D o u b l i n g A l g o r i t h m 128 xin 5.17 C o m p a r i s o n of the s t a n d a r d d e v i a t i o n of a c c u m u l a t e d f r ac t ion errors for the S Q L A l g o r i t h m , A l g o r i t h m 2, and the D o u b l i n g A l g o r i t h m 129 5.18 O n e - l i n k m a n i p u l a t o r 131 5.19 S ta te t ra jector ies for the D o u b l i n g A l g o r i t h m for ze ro -mean e. T h e large number s be low the graphs i nd i ca t e e'M, the m a x i m u m f rac t iona l error . . . 135 5.20 S ta te t r a j ec to ry for A l g o r i t h m 2 for ze ro-mean e. T h e large n u m b e r s be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l error 136 5.21 S ta te t r a j ec to ry for the Synchronous Q - L e a r n i n g a l g o r i t h m for ze ro -mean e. T h e large number s be low the graphs i n d i c a t e e'M, the m a x i m u m frac-t i o n a l er ror 137 5.22 S ta te t ra jec tor ies for the D o u b l i n g A l g o r i t h m for p o s i t i v e l y skewed e. T h e large number s be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l error . 138 5.23 S ta t e t r a j ec to ry for A l g o r i t h m 2 for p o s i t i v e l y skewed e. T h e la rge n u m b e r s be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l er ror 139 5.24 S ta te t r a j ec to ry for the Synchronous Q - L e a r n i n g a l g o r i t h m for p o s i t i v e l y skewed e. T h e large number s be low the graphs i n d i c a t e e'M, the m a x i m u m f r ac t i ona l er ror 140 5.25 S ta te t ra jector ies for the D o u b l i n g A l g o r i t h m for nega t ive ly skewed e. T h e large number s be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l error . 141 5.26 S ta te t r a j ec to ry for A l g o r i t h m 2 for nega t ive ly skewed e. T h e large n u m -bers be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l er ror 142 5.27 S ta te t r a j ec to ry for the Synchronous Q - L e a r n i n g a l g o r i t h m for nega t ive ly skewed e. T h e large number s be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l error 143 5.28 S ta te t r ans i t ions i n the 7 x 7 s tochas t ic s i m u l a t i o n 145 xiv 5.29 / - v a l u e s as a func t i on of n u m b e r of state t r ans i t ions for each s tate i n F i g u r e 5.18 w h i l e l e a rn ing under the O n e P o l i c y D P a l g o r i t h m 146 5.30 / - v a l u e s as a func t i on of n u m b e r of state t rans i t ions for each state i n F i g u r e 5.18 w h i l e l e a rn ing under the T D ( A ) a l g o r i t h m 148 5.31 / - v a l u e s as a f u n c t i o n of n u m b e r of state t r ans i t ions for each state i n F i g u r e 5.18 w h i l e l ea rn ing under the A L D P a l g o r i t h m 149 x v List of Symbols x(t),x.t S ta te vec tor . u(t) C o n t r o l vec tor . f (x, u, t) 1st order de r iva t ive of s tate vec tor w i t h respect to t i m e , m a y be non-l inear . J O p t i m a l i t y c r i t e r i o n to be m a x i m i z e d . h(x(tj),tf) F i n a l t i m e payoff func t ion . (?(x, u, t) Payof f func t ion . tj F i n a l t i m e . to I n i t i a l t i m e . Ja A u g m e n t e d o p t i m a l c r i t e r i o n . p Cos t a t e vec tor . H H a m i l t o n i a n . x* A state o n the o p t i m a l s tate t ra jec tory . u* O p t i m a l con t ro l a c t i o n . It is a vec tor . p* A costate o n the o p t i m a l costate t ra jec tory . x v i b, b 1st order de r iva t ive of state vec tor w i t h respect to t i m e , m a y be non-l inea r . < T (X 4 ) A n X n m a t r i x r e l a t i n g w h i t e noise B r o w n i a n m o t i o n to dx.f w t A vec tor B r o w n i a n m o t i o n . a D i s c o u n t factor . c ( x t , U t ) Ins tantaneous cost . $ ( x r ) F i n a l cost at t i m e T. T F i n a l t i m e . V V a l u e func t ion . i,j Indices to the e lements of a m a t r i x . dij i , j t h en t ry of era1'. U Set of p e r m i s s i b l e ac t ion vectors . v A p a r t i c u l a r a c t i o n vec tor . Et,x T h e e x p e c t i o n g iven X* = x . \& T h e f u n d a m e n t a l m a t r i x s o lu t i on of E q . (2.17). s A t i m e va r iab le . e(t) No i se vec tor w i t h an u n k n o w n bu t t i m e inva r i an t d i s t r i b u t i o n . c(x(t),u(t)) I m m e d i a t e cost. x v n E E x p e c t a t i o n . N T e r m i n a l t i m e i ndex . z F i n a l p r i m a r y re inforcement at an unspec i f ied a n d va r i ab le final t i m e . W W e i g h t vec tor of the adap t ive c r i t i c . a L e a r n i n g ra te i n T D ( A ) . Pt, Pk P r e d i c t i o n of z at t i m e t or k, respect ive ly . k A t i m e i ndex . A D i s c o u n t factor . m T o t a l n u m b e r of t i m e steps to reach the end of a r u n . ct I m m e d i a t e cost i n c u r r e d at t i m e t. S S ta te set of a d i sc re te - t ime d y n a m i c sys t em. n T h e last state i n the state set S. st S ta te at t i m e t. i A p a r t i c u l a r state, u s u a l l y used to denote the predecessor state i n a s tate t r a n s i t i o n . j A p a r t i c u l a r state, u sua l ly used to denote the successor state i n a s tate t r a n s i t i o n . u A n ac t ion . x v i i i U(i) Set of pe rmi s s ib l e ac t ions for state i. Pij(u) S ta te t r a n s i t i o n p r o b a b i l i t y f r o m state i to s tate j w h e n a c t i o n u is a p p l i e d . c,(u) I m m e d i a t e cost i n c u r r e d w h e n ac t ion u is a p p l i e d i n state i. fx C o n t r o l p o l i c y or con t ro l l aw . A l s o a m a p p i n g f r o m state to ac t i on . 7 D i s c o u n t factor . ct I m m e d i a t e cost at t i m e t. T h e expec t ed t o t a l i n f in i t e -ho r i zon d i scoun ted cost tha t w i l l be i n -cu r r ed i f p o l i c y fi is cons i s ten t ly used a n d i is the i n i t i a l s tate. Efj, T h e e x p e c t a t i o n a s s u m i n g the con t ro l l e r a lways uses p o l i c y fi. f(i) A b b r e v i a t e d f o r m of f^(i). k T h e stage of Synchronous D P . f*(i) T h e o p t i m a l eva lua t ion func t i on for s tate i. fk(i) Stage k e s t ima te of f*(i). f T h e comple t e set of fk{i) for a l l states. A l s o ca l l ed the eva lua t ion func t i on , / - m a p p i n g , or / - f u n c t i o n . kt T h e t o t a l n u m b e r of stages of upda tes of the / - m a p p i n g c o m p l e t e d up to t.. p\j{u) T h e m a x i m u m - l i k e l i h o o d sys t em m o d e l at t i m e t. x i x n;j(t) T h e observed n u m b e r of t imes before t i m e t tha t a c t i o n u was a p p l i e d i n s tate i a n d the sys t em m a d e a t r a n s i t i o n to state j . n"( t ) T h e n u m b e r of t i m e a c t i o n u was a p p l i e d i n s tate i. Qf(i,u) T h e cost of a c t i o n u i n state i as eva lua ted b y / . / * T h e o p t i m a l eva lua t ion func t ion . A l s o the c o m p l e t e set of f*(i) for a l l states. Q*(i,u) T h e cost of a c t i o n u i n state i as eva lua ted b y / * . Qk(i,u) Stage k e s t i m a t i o n of Q*(i,u). S® T h e set of s ta te -ac t ion pai rs scheduled to be u p d a t e d at stage k. a>k(i,u) T h e stage k l e a rn ing rate for state i a n d a c t i o n u. successor ( i , u) T h e successor state i n the next t i m e step w h e n a c t i o n u is a p p l i e d i n state i. Ut A c t i o n at t i m e t. Qt(st,ut) T i m e t e s t i m a t i o n of Q*(st, ut). ft($t+i) T i m e t e s t i m a t i o n of Qn(i, u) T h e expec t ed in f in i t e -ho r i zon d i scoun ted cost of e x e c u t i n g a c t i o n u i n state i as eva lua ted b y p o l i c y / / . fx* A n o p t i m a l c o n t r o l po l i cy . x x ffj,(j) T h e expec t ed in f in i t e -ho r i zon d i scoun ted cost at s tate j as eva lua ted b y p o l i c y fi. fn* ( j ) Iden t i ca l to / * (j). Q^(i,u) Qp(i,u) for o p t i m a l p o l i c y //*. Ci(u) E x p e c t e d value of c , (u) . d A t i m e i ndex . n T h e h o r i z o n . Ht T h e con t ro l p o l i c y at t i m e t. fl A vec tor of c o n t r o l po l ic ies . jl* T h e o p t i m a l fl. fii(so, sn, n) T h e expec t ed in f in i t e -ho r i zon d i scoun ted cost at s t a r t i ng state s0 a n d e n d i n g state sn as eva lua ted b y p o l i c y vec tor ft w i t h a h o r i z o n of n. Eft E x p e c t a t i o n as eva lua ted by p o l i c y vec tor fl. fl(st) A n a c t i o n specif ied b y p o l i c y vec tor fl, s tate st, a n d t i m e t. f*(s0,sn,n) fft(sQ,sn,n) for an o p t i m a l p o l i c y fl*. fjp{so,sn,n) I d e n t i c a l t o / * ( s 0 , s „ , n ) . Q*(so, sny Uo, n) T h e expec ted in f in i t e -ho r i zon d i scoun ted cost of a c t i o n u i n s t a r t i n g state so, as eva lua ted b y f*(so,sn,n), w h e n a l l t ra jec tor ies e n d up at state sn i n n t i m e steps. xxi na A n i n t e rmed ia t e t i m e step. sUa A n i n t e rmed ia t e state. Pitsna T h e na step t r a n s i t i o n p r o b a b i l i t y f r o m state i to s tate sUa g iven UQ is execu ted a n d then p o l i c y vector jl is used . fi* T h e o p t i m a l con t ro l p o l i c y at t i m e t. fil T h e o p t i m a l p o l i c y vec tor for the associa ted pos t - c i r cu i t . fil T h e o p t i m a l p o l i c y vec tor for the associa ted p r e - c i r cu i t . Q*^,sna,u,na) T h e expec t ed f in i t e -hor izon d i scoun ted cost of a c t i o n u i n s t a r t i ng state i as eva lua ted by p o l i c y vec tor fl° w h e n a l l t ra jec tor ies end up at s tate sHa i n na t i m e steps. Q A s i m p l e n o t a t i o n for a m e m b e r of or the ent i re set of Q var iables of the t y p e be ing addressed. Q A n es t ima te of the co r respond ing Q*. m T h e f ixed h o r i z o n of p re -c i r cu i t i n A l g o r i t h m 2. / A n es t ima te of the co r respond ing FA A set of / sa t i s fy ing the p re -c i r cu i t spec i f ica t ion i n the A L D P algo-r i t h m . FB A set o f / sa t i s fy ing the pos t - c i r cu i t spec i f ica t ion i n the A L D P algo-r i t h m . x x i i n'a D u m m y t i m e i n d e x i n the B u i l d - u p Phase of the S D A . s* O p t i m a l state t ra jec tory . s* O p t i m a l state at t i m e t. S N u m b e r of states i n the c o m p l e x i t y c o m p a r i s o n . A N u m b e r of act ions i n the c o m p l e x i t y c o m p a r i s o n . H T h e h o r i z o n i n the c o m p l e x i t y c o m p a r i s o n . e M a x i m u m absolute va lue of the nega t ive f r ac t i ona l e r ror for u n i v e r s a l f unc t i on a p p r o x i m a t i o n M e t h o d 1. 8 M a x i m u m absolute value of the negat ive f r ac t iona l er ror for un ive r sa l f unc t i on a p p r o x i m a t i o n M e t h o d 2. ec(h) M a x i m u m absolute va lue of the c u m u l a t i v e negat ive f r ac t i ona l error for h o r i z o n h. e'c(h) A loose uppe r b o u n d for absolute va lue of the c u m u l a t i v e negat ive f r ac t iona l error for h o r i z o n h. a T h e m a g n i t u d e of the a t e r m i n the error uppe r b o u n d proofs . b T h e m a g n i t u d e of the b t e r m i n the error uppe r b o u n d proofs. e' E r r o r va r iab le used to t rack s i m u l a t e d cor re la ted a p p r o x i m a t i o n error i n t i m e a n d space. e'M M a x i m u m f rac t iona l error . x x i i i e(s,t) T h e f r ac t iona l error i n the ex t r ac t ed 7 x 7 core of the super g r i d , e U n i f o r m l y d i s t r i b u t e d r a n d o m var i ab le i n [—1,1]. $ N u m b e r of states i n the sys t em. (j> M a x i m u m n u m b e r of states tha t any p a r t i c u l a r s tate can reach i n 1 t i m e step. ^ M a x i m u m n u m b e r of states tha t c a n reach any p a r t i c u l a r s ta te i n 1 t i m e step. x x i v G l o s s a r y * policy: T h e c o n t r o l p o l i c y w h i c h produces an o p t i m a l pos t - c i r cu i t (see Sec t i on 3.5). + policy: T h e c o n t r o l p o l i c y w h i c h produces an o p t i m a l p re -c i r cu i t (see Sec t i on 3.5). A C : A d a p t i v e c r i t i c (see Sec t i on 2.4.5). A C E : A s s o c i a t i v e c r i t i c e lement (see Sec t i on 2.3.6). A D A C : A c t i o n dependent adap t ive c r i t i c (see Sec t ion 2.3.10). A D D H P : A c t i o n dependent d u a l heu r i s t i c p r o g r a m m i n g (see Sec t i on 2.3.10). A D D P : A c t i o n dependent d y n a m i c p r o g r a m m i n g (see Sec t i on 2.4.15). A D H D P : A c t i o n dependent heu r i s t i c d y n a m i c p r o g r a m m i n g (see S e c t i o n 2.3.10). A D P : A s y n c h r o n o u s d y n a m i c p r o g r a m m i n g (see Sec t i on 2.4.8). A H C : A d a p t i v e heu r i s t i c c r i t i c (see Sec t ion 2.3.7). A L D P : A s y n c h r o n o u s l i n k u p d y n a m i c p r o g r a m m i n g (see Sec t i on 4.4). A S E : A s s o c i a t i v e search e lement (see Sec t ion 2.3.6). Actual horizon: T h e h o r i z o n a c c u m u l a t e d b y Q - L e a r n i n g i n a c t u a l execu t ion , ra ther by de f in i t ion (see Sec t i on 3.1). Actor-critic: A re inforcement l ea rn ing sy s t em cons i s t ing of a c o n t r o l e lement (actor) a n d a D P eva lua t i on e lement ( c r i t i c ) . x x v Adaptive critic: T h e c r i t i c i n an ac to r - c r i t i c sy s t em. Addition pass: A n a l g o r i t h m used to a d d the r e w a r d effects of a goa l s ta te t o a set of Q func t ions . Approximator: U n i v e r s a l f unc t i on a p p r o x i m a t o r . Backup: T h e u p d a t e ope r a t i on i n O n e Step Q - L e a r n i n g , E q . (3.29). Circuit: T h e c o m p l e t e set of t ra jector ies genera ted b y the s y s t e m unde r a g i v e n c o n t r o l p o l i c y or c o n t r o l p o l i c y vec tor , g iven the i n i t i a l s tate, the end state, a n d the n u m b e r of t i m e steps. Control policy: C o n t r o l l aw . Control policy vector: A vec tor of con t ro l po l ic ies where the e lements are n po l ic ies for a h o r i z o n of n . DP: D y n a m i c p r o g r a m m i n g . Deterministic linkup relation: E q . (4.42). End state: T h e last s tate of a c i r c u i t or the last state of a Q f unc t i on . Evaluation: T h e per fo rmance of an e x p l i c i t or i m p l i c i t con t ro l l aw as eva lua ted b y an o p t i m a l i t y c r i t e r i o n . GSDP: G a u s s - S e i d a l d y n a m i c p r o g r a m m i n g (see Sec t ion 2.4.8). Goal state: I n S e c t i o n 4.7, t he s tate w i t h a r e w a r d c o m p o n e n t i n i t s i m m e d i a t e costs. Goal Set: A set of goa l states (see Sec t ion 4.7.4) . HDP: H e u r i s t i c d y n a m i c p r o g r a m m i n g (see Sec t ion 2.3.8). x x v i Hierarchy: A m u l t i - l e v e l sy s t em where the uppe r levels have coarser reso lu t ions of state a n d t i m e t h a n the lower levels . Horizon: T h e t i m e p e r i o d spanned by an eva lua t ion func t ion . Immediate cost: T h e one t i m e step cost i n c u r r e d by be ing i n a g iven state a n d exe-c u t i n g a g i v e n a c t i o n . Inner minimization: T h e m i n i m i z a t i o n w i t h respect to the a c t i o n for a g i v e n in t e rme-d ia te state i n the var ious l i n k u p re la t ions . Intermediate state: In a l i n k u p ope ra t ion , the end state of the Q+ va lue a n d the start s tate of the Q* va lue . L A D D P : L i n k u p a c t i o n dependent d y n a m i c p r o g r a m m i n g . Linkup: A l i n k u p ope r a t i on or a l i n k u p cyc le , depend ing o n the usage. Linkup operation: T h e a d d i t i o n of a Q+ a n d a Q* at one p a r t i c u l a r i n t e r m e d i a t e state. Linkup relation: E q . (3.37). Linkup cycle: A l ea rn ing cyc le where the l i n k u p opera t ions are done for a l l poss ible i n t e r m e d i a t e states. Macro: A c o n t r o l p o l i c y to reach a g iven goal state. U s e d i n the con tex t of the discus-sions o n hierarchies a n d re-use of sk i l l s . N S G : N e u r a l subgoa l generator . Neurocontrol: A n y con t ro l t echn ique w h i c h uses neura l ne tworks as a c o m p o n e n t . O P D P : O n e p o l i c y d y n a m i c p r o g r a m m i n g (see Sec t ion 5.2.2). x x v i i One-Step Q-learning: T h e a l g o r i t h m descr ibed i n Sec t i on 3.1. One policy linkup relation: A l i n k u p r e l a t i on where the purpose is to evalua te a s in-gle c o n t r o l p o l i c y w i t h respect to an o p t i m a l i t y c r i t e r i o n a n d the ent i re set of i m m e d i a t e costs. Outer minimization: T h e m i n i m i z a t i o n w i t h respect to the i n t e r m e d i a t e state i n the var ious l i n k u p re la t ions . Payoff: I m m e d i a t e cost. Policy iteration: A class of D P a lgo r i t hms where the c o m p u t a t i o n of e v a l u a t i o n of a c o n t r o l l aw a n d the c o m p u t a t i o n of a con t ro l l aw f r o m the e v a l u a t i o n are done a l t e rna t ing ly . Policy: C o n t r o l l a w . Post-circuit: A f a m i l y of t ra jector ies whose eva lua t i on is the va lue of one Q* o n the R H S of E q . (3.37). Post-circuit policy: T h e c o n t r o l p o l i c y vec tor w h i c h generates a g iven pos t - c i r cu i t . Pre-circuit: A f a m i l y of t ra jector ies whose eva lua t i on is the va lue one Q+ o n the R H S of E q . (3.37). Pre-circuit policy: T h e con t ro l p o l i c y vec tor w h i c h generates a g i v e n pos t - c i r cu i t . Q-Learning: A n i n c r e m e n t a l , a c t i o n dependent D P m e t h o d . R T D P : R e a l t i m e d y n a m i c p r o g r a m m i n g (see Sec t ion 2.4.9). Re-use: U s i n g eva lua t i on funct ions or sk i l l s of one goal for another goa l . x x v i i i Reached state: T h e state the s y s t e m happened to reach after execu t i ng a c o n t r o l po l i cy . Reinforcement learning: A class of neurocon t ro l t echn ique w i t h the charac te r i s t i c tha t the c o n t r o l s y s t e m o n l y requires the i m m e d i a t e cost a n d the state to l ea rn the o p t i m a l con t ro l i nc remen ta l l y . Removal pass: A n a l g o r i t h m used to remove the r e w a r d effects of a goa l s tate f r o m a set of Q func t ions . Reward effects: T h e difference i n the Q values tha t the r e w a r d va lue at a goa l state makes . SDA: S tochas t i c d o u b l i n g a l g o r i t h m (see Sec t i on 4.5). SDP: Synch ronous d y n a m i c p r o g r a m m i n g (see Sec t ion 2.4.7). S A D D P : Synch ronous A D D P (see Sec t ion 4.3.4) . Spatial aggregation: T h e aggregat ion of ne ighbour ing states as a means to reduce the r e so lu t i on of eva lua t i on funct ions . Start state: T h e first s tate of a c i r c u i t or the first s tate of a Q f unc t i on . Subgoal: A n i n t e r m e d i a t e goa l state. T D ( A ) : A t e m p o r a l difference l ea rn ing a l g o r i t h m (see Sec t i on 2.3.6). T D L : T e m p o r a l difference l ea rn ing (see Sec t i on 2.3.6). Temporal resolution: T h e t i m e p e r i o d spanned by a state t r a n s i t i o n or the t i m e p e r i o d i n c r e m e n t b y a b a c k u p or l i n k u p ope ra t ion . Temporal aggregation: T h e aggregat ion of n e i g h b o u r i n g t i m e per iods as a means to reduce the r e so lu t ion of eva lua t i on funct ions . xxix Value iteration: A class of D P a lgo r i t hms where the i t e r a t i o n is done over the evalua-t ions , w i t h o u t c o m p u t i n g the con t ro l p o l i c y at any t i m e except after c o m p l e t i o n . x x x Acknowledgement I a m ob l i ga t ed to m y supervisors , Prof . P . D. L a w r e n c e a n d Prof . M . R . I to , for the i r gu idance a n d pa t ience th roughou t th is p r o g r a m . S p e c i a l t hanks go to K e n B u c k l a n d for.his t echn ica l assistance a n d ec le t ic discussions . I w o u l d l i k e to e x t e n d m y app rec i a t i on to R o b Ross for not r e b o o t i n g the ne twork d u r i n g m y mass ive s i m u l a t i o n runs . T h a n k s go to P a u l Tsche t t e r , B r i a n Spears , G r e g G r u d i c , D o n G i l l i e s , a n d H e r v e le P o c h e r for the i r f r iendship and adv ice . T h i s thesis w o u l d not have been poss ible w i t h o u t the suppor t and encouragement of m y f a m i l y . M y deepest g ra t i t ude goes to m y wife , L i s a , for her f a i t h a n d laughters t h roughou t these t r y i n g years. xxxi Acknowledgement I w o u l d l i ke to t h a n k m y superv isors , P ro f . P . D . L a w r e n c e a n d P ro f . M . R . I to , for the i r gu idance a n d pa t ience . T h i s thesis w o u l d not have been poss ib le w i t h o u t the u n w a i v e r i n g suppor t a n d en-couragement of m y f ami ly . T h a n k s m u s t also go to m y fr iends, P a u l Tsche t t e r , B r i a n Spears , G r e g G r u d i c , D o n G i l l i e s , H e r v e le Poche r , e spec ia l ly to K e n B u c k l a n d for his t e c h n i c a l assistance a n d ec le t ic discussions. or our camarade r i e , a n d the c u l t i v a t i o n of the k i l l e r i n s t i n c t . F i n a l l y , m y wife , L i s a , for her u n w a i v e r i n g suppor t a n d f a i t h x x x i Chapter 1 Introduction 1.1 Motivation for this Work D y n a m i c P r o g r a m m i n g ( D P ) and A c t i o n Dependen t D y n a m i c P r o g r a m m i n g ( A D D P ) used w i t h u n i v e r s a l f unc t i on a p p r o x i m a t o r s (for s i m p l i c i t y , jus t approximators) are power fu l m e t h o d s for the so lu t i on of m a n y m a c h i n e l e a rn ing a n d o p t i m a l c o n t r o l p rob-l ems . Re in fo rcemen t L e a r n i n g ( R L ) is the f ie ld i n w h i c h D P a n d A D D P ( w i t h a p p r o x i -ma to r s ) are s t ud i ed mos t i n t ens ive ly i n recent years. T h e D P a n d A D D P m e t h o d s i n R L have t ended to be t a i l o r e d to fit the d o m i n a n t class of p r o b l e m s i n R L , n a m e l y m a c h i n e l e a rn ing p rob l ems . In the a p p l i c a t i o n of D P and A D D P ( w i t h a p p r o x i m a t o r s ) to l i n k e d robo t o p t i m a l c o n t r o l p r o b l e m s , the current R L m e t h o d s take no advantage of some of the charac te r i s t i cs of robo t i cs such as the a v a i l a b i l i t y of robo t mode l s . P r o b l e m s of even mode ra t e size requi re the use of a p p r o x i m a t o r s to store the var ious m a p p i n g s i n these a lgo r i t hms . If a p p r o x i m a t o r s can a p p r o x i m a t e the var ious m a p p i n g s r equ i r ed i n R L to a sufficient l eve l of accuracy , they can be used to ove rcome the p r o b l e m of c o m b i n a t o r i c exp lo s ion [1] i n m a n y rea l w o r l d app l i ca t ions . T h e a p p r o x i m a t o r s , the m a p p i n g s , a n d how the m a p p i n g s are a p p r o x i m a t e d w i l l be desc r ibed after the detai ls of the R L a l g o r i t h m s are i n t r o d u c e d i n C h a p t e r 2. T h e R L a l g o r i t h m s of the value iteration t y p e are h e a v i l y affected b y the a c c u m u l a -t i o n of a p p r o x i m a t i o n errors [2]. T h e precise m e a n i n g of va lue i t e r a t i o n w i l l be e x p l a i n e d i n C h a p t e r 2. R o u g h l y , i t means tha t the "values" or n u m e r i c a l "eva lua t ions" of the 1 Chapter 1. Introduction 2 f ina l o p t i m a l control policy w i t h respect to a pre-defined o p t i m a l i t y c r i t e r i o n are c o m -p u t e d b y i t e r a t i n g o n the values ra ther t h a n o n the different po l i c i e s . O n each i t e r a t i o n , the values of the p rev ious i t e r a t i o n are used, w i t h o ther new i n f o r m a t i o n , to c o m p u t e the new values at the current i t e r a t i o n . T h e a p p r o x i m a t i o n errors of the o l d values are c o m p o u n d e d b y the a p p r o x i m a t i o n errors of the new values. A s the n u m b e r of i t e ra t ions grows, th i s a c c u m u l a t i o n c a n cause the final con t ro l p o l i c y to be n o n - o p t i m a l a n d even uns tab le . E a c h i t e r a t i o n changes the values by accoun t ing for an a d d i t i o n a l t i m e p e r i o d . Fewer i t e ra t ions ach ieved by l eng then ing the t i m e p e r i o d m a y reduce such a c c u m u l a t i o n . T h i s t i m e p e r i o d is not va r iab le i n mos t ex i s t i ng R L a lgo r i t hms . T h i s au thor is not aware of any sa t i s fac tory so lu t i on . T h i s s h o r t c o m i n g is one m o t i v a t i o n b e h i n d the research r epo r t ed i n th is w o r k . A second m o t i v a t i o n is tha t i f such a scheme for v a r y i n g the t i m e r e so lu t ion were to be deve loped , th i s w o u l d p e r m i t the future deve lopment of h i e r a r c h i c a l c o n t r o l sys tems u p d a t e d by D P a n d A D D P me thods . M o r e c o m p l e x o p t i m a l c o n t r o l p r o b l e m s , to be presented i n C h a p t e r 2, are p r o b a b l y mos t effectively so lved by h ierarchies . T h i s is because a s ingle l eve l con t ro l le r (or equiva len t ly , a con t ro l le r us ing a s ingle r e so lu t ion of t i m e a n d space) is not sufficient to effectively a n d eff iciently solve p rob l ems w h i c h requi re o p t i m i z a t i o n i n a w i d e range of t e m p o r a l a n d spa t i a l reso lu t ions . Therefore , the same con t ro l l e r m u s t have different levels or a h i e ra rchy of different reso lu t ions . R L is a p r o m i s i n g means of l e a r n i n g o p t i m a l ac t ions i n at least some aspects of a h ie rarchy . 1.2 Problem Statement T h i s thesis addresses the p r o b l e m of the a c c u m u l a t i o n of a p p r o x i m a t i o n errors b y D P a n d A D D P m e t h o d s a n d prov ides for the ex tens ion of D P a n d A D D P m e t h o d s to h ierarchies . Chapter 1. Introduction 3 M o r e speci f ica l ly , a n u m b e r of a l go r i t hms are p roposed : 1) for d e t e r m i n i s t i c sys tems, to f i n d o p t i m a l c o n t r o l po l ic ies a) f r o m a l l po l ic ies s imu l t aneous ly (i .e. , "value i t e r a t i o n " ) or b) b y a d a p t i n g e x i s t i n g o p t i m a l pol ic ies for one goal to a new goa l , a n d 2) for s tochas t ic sys tems, to f ind o p t i m a l con t ro l po l ic ies b y eva lua t ing the o p t i m a l i t y of one p o l i c y at a t i m e ( i .e . , " p o l i c y i t e r a t i o n " ) . T h e new a l g o r i t h m s be long to a new class of a lgo r i t hms p roposed here ca l l ed L i n k u p A c t i o n D e p e n d e n t D y n a m i c P r o g r a m m i n g ( L A D D P ) . L A D D P is based o n the i d e a tha t the quan t i t i e s b e i n g s u m m e d (or backed up) i n D P a n d A D D P s h o u l d be i n different t e m p o r a l reso lu t ions i n different levels of a h ierarchy. It is also based o n S c h m i d h u b e r ' s m e t h o d of r ecu r s ive ly b r e a k i n g d o w n a goal i n to smal l e r subgoals [3]. In c o n t r o l t e rms , a goa l is equiva len t to the task of o p t i m a l l y m o v i n g f r o m a s tar t s tate to a specif ied end state. T h e resul t is a m e t h o d of s u m m i n g the eva lua t ions of two s t a t e / c o n t r o l t ra jec tor ies w h e n the p r eced ing t r a j ec to ry is m o r e t h a n 1 base t i m e u n i t l o n g . T h i s s u m m i n g o p e r a t i o n is ca l l ed a Linkup. L i n k u p s a l low the l e n g t h of t i m e covered by an e v a l u a t i o n to g row ve ry q u i c k l y , depend ing o n the a l g o r i t h m , even e x p o n e n t i a l l y w i t h respect to the n u m b e r of i t e ra t ions . T h e L A D D P concept a n d a l g o r i t h m s are descended f r o m a s t u d y of R L p r o b l e m s a n d a l g o r i t h m s , thus the emphas i s o n R L me thods th roughou t th is thesis . T h e new a lgo r i t hms are not i n t e n d e d for the same p r o b l e m def ini t ions as the e x i s t i n g R L a l g o r i t h m s w h i c h target the M a r k o v i a n D e c i s i o n Task . T h e p r o b l e m de f in i t ion for the new a lgo r i t hms w i l l be desc r ibed i n m o r e de t a i l i n C h a p t e r 4. 1.3 Contributions A n u m b e r of advances were expec ted . A n increase i n the speed of l e a r n i n g w i t h respect to the n u m b e r of i t e ra t ions was expec ted . S ince the n u m b e r of i t e ra t ions r equ i r ed to Chapter 1. Introduction 4 c o m p u t e eva lua t ions cover ing a ce r t a in l eng th of t i m e is m u c h less t h a n tha t i n D P a n d A D D P , a n advantage i n the a c c u m u l a t i o n o f errors was also expec t ed . L a s t l y , i t was expec t ed tha t th i s i nves t i ga t i on w i l l a l low R L to a p p l y to h i e r a r c h i c a l c o n t r o l . These advances are l a rge ly r ea l i zed by th is work . It s h o u l d be m e n t i o n e d tha t a r e d u c t i o n i n a p p r o x i m a t i o n errors does not necessar i ly l ead to be t te r c o n t r o l pe r fo rmance because the con t ro l l aw is c o m p u t e d b y c o m p a r i n g the r e l a t ive fitness of a l l poss ib le ac t ions at each po in t i n the state space or s ta te -ac t ion space. Ideal ly , the a p p r o x i m a t i o n errors shou ld be reduced i n such a w ay tha t the errors o n each po in t i n the a p p r o x i m a t i o n are equa l . T h e n the re la t ive fitness w o u l d not be affected b y errors . If the errors are r educed to zero , th i s w o u l d i n fact be t rue . T h i s suggests tha t to reduce the a p p r o x i m a t i o n errors abso lu te ly is l i k e l y to be a f ru i t fu l app roach to y i e l d i n g o p t i m a l c o n t r o l l aws . T h i s thesis makes the fo l lowing con t r i bu t ions to the s t u d y of D P a n d A D D P m e t h -ods w i t h a p p r o x i m a t i o n , re inforcement l ea rn ing , a n d the i r connec t ion to l e a r n i n g i n a h i e r a r chy : 1. D e r i v a t i o n of three L i n k u p R e l a t i o n s w h i c h are the founda t ions for us ing l i n k u p opera t ions to c o m p u t e Q-values i n Q L and / - v a l u e s i n D P , i n var ious s i tua t ions . 2. P r o p o s a l of one asynchronous and two synchronous a l g o r i t h m s for a d e t e r m i n i s t i c ve r s ion of the M a r k o v i a n D e c i s i o n Task and p roposa l of one asynchronous a l g o r i t h m for a s tochas t ic ve rs ion of the M a r k o v i a n D e c i s i o n Task , bu t w i t h one con t ro l p o l i c y o p e r a t i n g at a l l t imes . 3. P r o p o s a l of a l go r i t hms to eff iciently change the goal of the top l eve l Q-func t ions p r o d u c e d by the L i n k u p a lgo r i t hms . 4. In the case of the three a lgo r i t hms m e n t i o n e d above, t heo re t i ca l resul ts for the Chapter 1. Introduction 5 correctness , c o m p l e x i t y , a n d wors t case a c c u m u l a t e d error u p p e r bounds a n d s i m u -l a t i o n resul ts for correctness , er ror a c c u m u l a t i o n , a n d c o n t r o l pe r formance . In the case of the one asynchronous a l g o r i t h m for d e t e r m i n i s t i c sys tems, t heo re t i ca l resul t a n d n u m e r i c a l d e m o n s t r a t i o n for correctness. 5. Reasons to suspect tha t L i n k u p L e a r n i n g i n a s tochas t ic , m u l t i - a c t i o n fashion is c o m p u t a t i o n a l l y i m p r a c t i c a l i n a h ie rarchy . 6. In the case of the one asynchronous a l g o r i t h m for s tochas t ic sys tems m e n t i o n e d i n I t e m 2 , s i m u l a t i o n resul ts o n correctness , order l iness , a n d l e a r n i n g speed. 7. Exp re s s ions of the worst case savings i n us ing the p roposed goal change a lgo r i t hms . 8. A c o n v i n c i n g a rgument against off-line l ea rn ing i f the s y s t e m to be con t ro l l ed is s tochas t ic and i n i t i a l l y w i t h o u t mode l s for costs a n d state t r a n s i t i o n p robab i l i t i e s . 1.4 Limitations A t present , no efficient a l g o r i t h m has been devised for the d i rec t a p p l i c a t i o n of the 1st l i n k u p r e l a t i o n , the S tochas t i c L i n k u p R e l a t i o n , for the search for o p t i m a l con t ro l po l ic ies f r o m a l l po l ic ies s imul t aneous ly , w h e n the d y n a m i c s y s t e m is s tochas t ic . Howeve r , i n R L , s tochas t ic o p t i m a l con t ro l p rob l ems have been effectively so lved by m e t h o d s w h i c h search a m o n g a l l po l ic ies s imu l t aneous ly a n d b y m e t h o d s w h i c h search b y e v a l u a t i n g one p o l i c y at a t i m e . T h i s means the p roposed a lgo r i t hms are app l i cab l e to s tochas t ic sys tems, as w e l l as d e t e r m i n i s t i c sys tems. 1.5 Position of my Work in the Literature N e u r o c o n t r o l c a n be d i v i d e d i n to m a n y classes, based b o t h o n the class of p r o b l e m a n d o n the app roach t aken . Re in fo rcemen t L e a r n i n g ( R L ) is a m a j o r class of neu rocon t ro l Chapter 1. Introduction 6 m e t h o d s w h i c h has been deve lop ing for m a n y years [4] [5] [6] [7] [8] [9] [10] [11] [12]. R L inc ludes Q - L e a r n i n g ( Q L ) , T e m p o r a l Difference L e a r n i n g ( T D L ) a n d i n c r e m e n t a l forms of D y n a m i c P r o g r a m m i n g ( D P ) for con t ro l . R L is d i s t i nc t f r o m the o ther classes of neu-r o c o n t r o l because the cont ro l le r learns to o p t i m i z e a func t i on of the payoffs or costs. A s y s t e m o p t i m a l l y con t ro l l ed c a n e x h i b i t r i c h behav iours , e spec ia l ly i f the s y s t e m is non-l inea r a n d / o r discrete . S ince R L is app l i cab l e to non l inear a n d / o r discre te sys tems, i t has been recogn ized tha t R L c o n t r o l can p rov ide d r a m a t i c a l l y new capab i l i t i e s i n cont ro l le rs , e spec ia l ly i f the c o n t r o l s y s t e m has h ierarchies , m u l t i p l e representa t ions of states a n d cont ro l s , p a t t e r n r ecogn i t i on capab i l i t i e s , a n d other s t ruc tures u s u a l l y associa ted w i t h a r t i f i c i a l in te l l igence . Bes ides several o lder R L a lgo r i t hms [8] [13], m a n y a lgo r i t hms connec ted w i t h D P [12] [14] have appea red recent ly . T h i s connec t ion is espec ia l ly s ignif icant because " D P is the o n l y exac t a n d efficient m e t h o d for l ea rn ing o p t i m a l con t ro l of u n k n o w n , non l inea r and s tochas t ic sys tems w i t h o u t m a k i n g h i g h l y spec ia l i zed a s sumpt ions about the sy s t em" [15]. T h e o lder a l go r i t hms are also connec ted to D P , bu t w i t h the a s s u m p t i o n tha t o n l y one c o n t r o l p o l i c y is o p e r a t i n g at a l l t imes , a con t ro l p o l i c y b e i n g another t e r m for a c o n t r o l l a w . Together , these a lgo r i t hms presen t ly o c c u p y a large n iche i n the con t ro l w o r l d w i t h o u t a c o m p e t i t o r a n d are w o r t h y of s tudy. Q L [12] is one of the mos t i m p o r t a n t asynchronous A D D P a l g o r i t h m s because i t is w e l l s u p p o r t e d t heo re t i ca l l y a n d is the basis for a n u m b e r of R L a l g o r i t h m s . T D L is a t y p e of acce le ra ted supe rv i sed l e a r n i n g for the p r e d i c t i o n of fu ture values g i v e n a sequence of past a n d present values. T h e p r i n c i p a l a l g o r i t h m i n T D L is T D ( A ) [10]. Together w i t h ce r t a in o ther a l g o r i t h m s for l i n k i n g i ts results w i t h the con t ro l l e r ac t ions , i t has h i s t o r i c a l l y been a pa r t of R L . S o m e of the mos t impress ive resul ts i n R L are o b t a i n e d u s ing T D ( A ) a l g o r i t h m s . D P has l ong been k n o w n [1] as a power fu l m e t h o d for f ind ing the eva lua t ions of lowest cost pa ths i n Sequen t i a l D e c i s i o n Tasks . L i k e T D ( A ) , together Chapter 1. Introduction 7 w i t h o ther a l g o r i t h m s , i n c r e m e n t a l forms of D P are also an i m p o r t a n t pa r t of R L . T h e deta i ls of a l l 3 types of a lgo r i t hms w i l l be covered i n C h a p t e r 2. In a d y n a m i c sys t em, state a n d t i m e can be e i ther d iscre te or con t inuous , l e ad ing to four poss ib le c o m b i n a t i o n s . A l l R L m e t h o d s assume the sys t em has one of the fo l l owing c o m b i n a t i o n s : 1) d i sc re te - t ime a n d discrete-state , or 2) d i sc re te - t ime a n d cont inuous-state. A l l R L m e t h o d s assume tha t the m o d e l of the d y n a m i c s is u n k n o w n a n d a m o d e l of d y n a m i c s is diff icul t to cons t ruc t by iden t i f i ca t ion . A l l R L m e t h o d s assume tha t the s y s t e m is s tochas t ic . L a s t l y , a l l R L m e t h o d s l ea rn i nc remen ta l l y . Th es e a s sumpt ions and the i n c r e m e n t a l i t y of l e a rn ing are the componen t s of the R L p r o b l e m def in i t ion . H o w e v e r , R L b e i n g the foca l po in t of research i n the c o m b i n e d use of D P / A D D P a n d a p p r o x i m a t o r s means tha t the results shou ld be c ross -u t i l i zed for o ther p r o b l e m types w h i c h are also a m e n d a b l e t o so lu t i o n u s ing D P & a p p r o x i m a t o r s . T h e f o l l o w i n g are two p r o b l e m types where one of the componen t s of the t r a d i t i o n a l R L p r o b l e m de f in i t ion is not app l i cab l e : Known Dynamics or Easy to Construct Dynamic Models If the m o d e l of the dy-n a m i c s is k n o w n or can be eas i ly ident i f ied a n d cons t ruc t ed , a l g o r i t h m s w h i c h use the m o d e l to b u i l d o p t i m a l con t ro l po l ic ies m a y give be t te r resul ts t h a n a lgo r i t hms w h i c h assume the d y n a m i c s are u n k n o w n . T h e r e are m a n y p rob l ems w i t h w e l l u n d e r s t o o d d y n a m i c s , bu t due to non l inea r i ty , d i scon t inu i ty , or ce r t a in types of op-t i m a l c r i t e r i a , no sa t is factory so lu t ion has been found . T h e w a l k i n g m a c h i n e , to be discussed i n m o r e de t a i l la ter , is such a p r o b l e m . T h e fact tha t t he i r p r o b l e m t y p e does not fit i n w i t h t r a d i t i o n a l R L p r o b l e m def in i t ion does not m e a n R L resul ts s h o u l d not be bo r rowed a n d ex tended to be t te r sui t t h e m . Non-Incremental Learning T h e recognized advantages of i n c r e m e n t a l l e a rn ing are 1) tha t the state space need not be t h o ro u g h l y covered i n the D P sweeps for a Chapter 1. Introduction 8 n e a r - o p t i m a l con t ro l p o l i c y to be found and 2) tha t a m o d e l of the s y s t e m be ing c o n t r o l l e d is not necessary. However , there are m a n y ins tances where the t r a d i t i o n a l preference for i n c r e m e n t a l l ea rn ing is not app l i cab le . O n e such ins tance is i n the des ign of a mass m a r k e t sy s t em or a c r i t i c a l sy s t em w h i c h mus t not f a i l i n any par t of the state space. T h e safety of the consumer and the l i a b i l i t y to the manufac tu re r are enough to jus t i fy e x t r a o r d i n a r y efforts to a t t e m p t to ensure tha t the sy s t em is safe i n every reachable pa r t of the state space. T h e c o m p l e t e e x p l o r a t i o n of the state space i n the course of l ea rn ing is some th ing to s t r ive for, r a the r t h a n to avo id . F u r t h e r m o r e , the sy s t em m o d e l m a y be a b less ing ra ther t h a n a b u r d e n , s ince m a n y p h y s i c a l sys tems cannot w i t h s t a n d pro longed exposure to the r a n d o m pol ic ies used i n i n c r e m e n t a l l ea rn ing . M a n y o p t i m a l con t ro l p rob lems i n robot ics require e i ther of the 2 componen t s l i s t ed above. T h e concepts i n t r o d u c e d and a lgo r i t hms proposed i n th is thesis are subs t an t i a l l y m o d i f i e d D P a n d A D D P concepts and a lgo r i t hms in t ended for the even tua l so lu t ion of these p rob l ems a n d s i m i l a r types of p rob lems . T h e proposed a l g o r i t h m s are pos i t i oned i n the ex tan t l i t e r a tu re i n the fo l lowing way: . 1. T h e d e t e r m i n i s t i c L A D D P a lgo r i t hms proposed .are i m p r o v e m e n t s over ADDP al-gorithms in RL i n the case where the sy s t em is d i sc re te - t ime/d i sc re te - s ta te or d i sc re te - t ime /con t inuous - s t a t e , the m o d e l of the d y n a m i c s is k n o w n , the sy s t em is d e t e r m i n i s t i c , and i n c r e m e n t a l l ea rn ing is not r equ i red a n d not acceptable . Fo r the same case, the goal change a lgo r i t hms proposed p r o v i d e u n i q u e new capa-b i l i t i e s i n the a d a p t a t i o n of an ex i s t i ng o p t i m a l con t ro l p o l i c y for one goal to a different goa l . T h e l ea rn ing s i tua t ion for b o t h types of p roposed a lgo r i t hms is different f r o m those i n A D D P a lgo r i t hms i n R L i n tha t these A D D P a lgo r i t hms assume d isc re te - t ime/d i sc re te -s ta te , u n k n o w n d y n a m i c s , s tochas t ic sys tems, a n d Chapter 1. Introduction 9 i n c r e m e n t a l l e a rn ing . 2. T h e s tochas t ic L A D D P a l g o r i t h m proposed is an i m p r o v e m e n t over DP algorithms and TD(\) in RL i n the case where the sys t em is e i ther d i sc re te - t ime/d i sc re te - s ta te or d i sc re te - t ime /con t inuous - s t a t e , the m o d e l of the d y n a m i c s is u n k n o w n , the sys-t e m is s tochas t ic , a n d i n c r e m e n t a l l e a rn ing is r equ i r ed or acceptab le . T h e l ea rn ing s i t u a t i o n is i d e n t i c a l for t he s tochas t ic L A D D P a l g o r i t h m , a n d D P a l g o r i t h m s a n d T D ( A ) i n R L . 3. C o n t i n u i n g f r o m the sec t ion o n L i m i t a t i o n s , no a l g o r i t h m has been p roposed to c o m p a r e w i t h ADDP algorithms in RL i n the case where the s y s t e m is e i ther d i sc re te - t ime/d i sc re te - s ta te or d i sc re te - t ime /con t inuous - s ta te , the m o d e l of the dy-n a m i c s is u n k n o w n , the sys t em is s tochas t ic , a n d i n c r e m e n t a l l e a rn ing is r equ i red or acceptab le . 1.6 Format of Thesis T h e r e m a i n i n g chapters of th is thesis are o rgan ized as fol lows: Chapter 2 A b r o a d ove rv iew of techniques of o p t i m a l con t ro l of non l inea r a n d discre te sys tems w i l l be presented, end ing w i t h a focus o n the types of a lgo r i t hms tha t th is thesis addresses. T h e purpose of th is chapter is not to descr ibe the p rob lems tha t t he w o r k i n th i s thesis is i n t e n d e d to solve, b u t t o b r i n g together i n one p lace the var ious so lu t i on me thods for s i m i l a r p r o b l e m types . T h e purpose is to p r o m o t e a m o r e comple t e ove rv iew a n d to unde r s t and how var ious approaches fit i n the bigger p i c t u r e . S o m e o u t s t a n d i n g issues of these a lgo r i t hms w i l l be br ie f ly d iscussed. Chapter 3 T h e n o t i o n of L i n k u p L e a r n i n g w i l l be deve loped f r o m the idea of b r e a k i n g a goa l i n to subgoals and the need to ex t end R L to hierarchies . A n u m b e r of subt le t ies Chapter 1. Introduction 10 of L i n k u p L e a r n i n g w i l l be br ief ly d iscussed. Chapter 4 T h e p r o b l e m class to be so lved , M a r k o v i a n D e c i s i o n Tasks , w i l l be fo rmu-l a t ed precise ly . T h e s t ra ight fo rward a p p l i c a t i o n of L i n k u p L e a r n i n g to th is p r o b l e m class w i l l be cons idered by i n t r o d u c i n g a p r o t o t y p e a l g o r i t h m . A n u m b e r of issues w h i c h w o u l d m a k e th is diff icul t w i l l be presented, a long w i t h two so lu t ions . A f a m i l y of a l g o r i t h m s for the case of d e t e r m i n i s t i c sys tems a n d a n a l g o r i t h m for s tochas t ic sys tems w i l l be presented. A n a l g o r i t h m w h i c h s h o u l d be a n a t u r a l for s tochas t ic s y s t e m w i l l be m e n t i o n e d . A l g o r i t h m s to change the goa l of the top leve l Q- func t ions w i l l be presented and e x p l a i n e d . T h e relevance of L i n k u p L e a r n i n g to a l e a rn ing i n a h i e ra rchy w i l l be discussed. Chapter 5 T h e o r e t i c a l a n d s i m u l a t i o n results of the a lgo r i t hms p roposed i n C h a p t e r 4 w i l l be shown a n d discussed. Poss ib le var ia t ions of the p roposed a lgo r i t hms w i l l be discussed. A s t rong reason to reject the n a t u r a l a l g o r i t h m desc r ibed i n C h a p t e r 4 w i l l be presented. Chapter 6 T h i s chapter w i l l i n c l u d e genera l discussions, conc lus ions , specu la t ions , a n d suggestions for future work . Chapter 2 Framework of Field In th is chapter , the f r amework of the f ie ld of n o n l i n e a r / d i s c r e t e o p t i m a l con t ro l is re-v i e w e d . T h e purpose is to show how D P me thods re la te to o ther so lu t i on to n o n l i n -ea r /d i sc re t e o p t i m a l con t ro l p rob lems . N o c l a i m is m a d e tha t the m e t h o d s p roposed i n th i s thesis solve the p rob l ems tha t a l l of these me thods target , except tha t D P me thods have advantages over o ther me thods i n ce r t a in s i tua t ions and the p roposed m e t h o d s have advantages over e x i s t i n g D P me thods i n ce r t a in o ther s i tua t ions . T h e reader w h o is not in teres ted i n th is b a c k g r o u n d m a t e r i a l m a y w i s h to sk ip fo rward to S e c t i o n 2.6. T h i s chapter reviews the choices ava i lab le to the c o n t r o l s y s t e m designer w h o m i g h t want to solve ce r t a in types of p rob lems . These p rob lems have the f o l l o w i n g character is-t i cs : 1. C o n t r o l is o p t i m a l ; 2. T h e s y s t e m is p o t e n t i a l l y h i g h - d i m e n s i o n a l ; 3. T h e s y s t e m m a y be cont inuous non l inea r or discrete . If the s y s t e m is con t inuous , i t m a y be s t rong ly non l inea r a n d not confined to spec ia l classes of non l i nea r i t y ; 4. T h e s y s t e m m a y be s tochast ic ; 5. It is not k n o w n a priori the correct s tate t r a j ec to ry or con t ro l t r a j ec to ry ; 11 Chapter 2. Framework of Field 12 6. T h e s y s t e m m o d e l is unava i l ab le and the s t ruc tu re of the sy s t em m o d e l m a y also be unava i l ab le ; a l t e rna t ive ly , the sy s t em m o d e l m a y be ava i lab le , bu t i t s m a t h e m a t i c a l d e s c r i p t i o n is not eas i ly used to der ive an o p t i m a l c o n t r o l l a w . T h i s l i s t of charac te r i s t ics are not necessar i ly those of the p rob l ems tha t the p roposed m e t h o d s target , bu t are c losely re la ted . S o m e of the so lu t i on m e t h o d s , n a m e l y the D P m e t h o d s , are the bases of the p roposed me thods . Some of the p rob l ems encounte red i n the use of the D P me thods are those addressed by the p roposed m e t h o d s . The se charac te r i s t ics are not des igned to e l i m i n a t e a l l b u t a few c o n t r o l me thods . R a t h e r , they are n a t u r a l charac ter i s t ics of m a n y p r a c t i c a l con t ro l a n d m a c h i n e i n t e l l i -gence p r o b l e m s . A clear e x a m p l e is the w a l k i n g m a c h i n e p r o b l e m . A w a l k i n g m a c h i n e w i t h severa l legs has a large n u m b e r of degrees of f reedom. T h e s y s t e m d y n a m i c s are h i g h l y non l inea r and are d i scon t inuous w h e n the legs m a k e or b reak con tac t w i t h the g r o u n d . T h e chains of l i n k s are i n t e r m i t t e n l y open a n d c losed loop . T h e e n v i r o n m e n t tha t the m a c h i n e is i n t ended to func t i on i n is t y p i c a l l y rugged a n d u n p r e d i c t a b l e . T h i s i n -t roduces s tochas t i c i t y i n to the d y n a m i c s . T o have a w a l k i n g m a c h i n e b a l a n c i n g , w a l k i n g , t u r n i n g , a n d c l i m b i n g , to n a m e a few bas ic tasks, involves the p r o b l e m of f i nd ing con t ro l t ra jec tor ies to resul t i n these des i red goals a n d to keep the states out of some regions of the state space, e.g., regions where the legs are c o l l i d i n g or where the m a c h i n e is f a l l i ng over . T o m a k e the search of the con t ro l t ra jector ies m o r e d i f f icul t , the des i red t r a j ec to ry of the states is not k n o w n . F u r t h e r m o r e , we w o u l d l i ke the m a c h i n e to be o p t i m a l i n t e r m s of energy efficiency, speed, m a r g i n of safety, etc. T h e d y n a m i c s are c o m p l e x . E v e n the s i m p l i f i e d d y n a m i c mode l s t y p i c a l l y used i n w a l k i n g m a c h i n e research are diff icul t to m a n i p u l a t e to y i e l d useful , let alone o p t i m a l , con t ro l laws. A w a l k i n g m a c h i n e is also an i n t u i t i v e p l a t f o r m to i l l u s t r a t e m a n y tasks tha t are be ing inves t iga ted i n robot ics and m a c h i n e l ea rn ing . Severa l po in t s i n th i s thesis w i l l be Chapter 2. Framework of Field 13 e x p l a i n e d w i t h examples f r o m the con t ro l of w a l k i n g mach ines . T h i s chapter shows tha t we have o n l y a few choices for the so lu t i on of the s ta ted p r o b l e m t y p e a n d tha t the j u r y is s t i l l out o n w h i c h of these choices w i l l p rove to be the i d e a l s o l u t i o n . T h e focus is t hen shif ted to a p r o m i s i n g class of m e t h o d s a n d the o u t s t a n d i n g research issues i n v o l v e d . T h e fo l lowing sections are o rgan ized a long three m a j o r l ines of d iv i s ions : cont inuous- or d i sc re te - t ime , cont inuous- or discrete-s tate , a n d d e t e r m i n i s t i c or s tochas t ic d y n a m i c s . 2.1 Deterministic Continuous-Time and -State T h e m e t h o d s i n con t inuous - t ime o p t i m a l con t ro l for b o t h d e t e r m i n i s t i c case a n d stochas-t i c case are somewha t r e m o t e l y re la ted to the focus of th is paper . T h e r ev iew be low is m e a n t o n l y to g ive a flavor of how these o p t i m a l con t ro l p r o b l e m s are approached a n a l y t i c a l l y a n d n u m e r i c a l l y . T h i s r ev iew of these approaches m a y serve to b roaden the pe r spec t ive of re inforcement l ea rn ing theoris ts and r emove the sense of vagueness about how th ings are done ou ts ide of tha t f ie ld . 2.1.1 Problem definition T h e non l inea r d y n a m i c s y s t e m i n ques t ion is often desc r ibed i n th is f o r m : where x(t) is the state vec tor a n d u(t) is the con t ro l vec tor . T h e exact na tu re of the f u n c t i o n a l / depends o n the assumpt ions of the m e t h o d used to solve the p r o b l e m . W e are in te res ted i n h a v i n g / at least be m o r e t h a n 1 d i m e n s i o n a n d non l inea r . T h e o p t i m a l i t y c r i t e r i o n is u s u a l l y of the f o r m x = f(x,u,t) (2.1) (2.2) Chapter 2, Framework of Field 14 where t0 is the s tar t t i m e and tf is the end t i m e . T h e po in t of o p t i m a l c o n t r o l is to f i n d a c o n t r o l l aw u(t) w h i c h m a x i m i z e s J. T h e d y n a m i c equa t ion makes the c o n t r o l p r o b l e m a cons t r a ined o p t i m i z a t i o n p r o b l e m . W h i t e a n d J o r d a n [16] r ev i ewed the a n a l y t i c a l a n d n u m e r i c a l s o l u t i o n m e t h o d s for th is class of p rob l ems . 2.1.2 Analytical: Calculus of Variations Since the o p t i m i z a t i o n p r o b l e m is the m i n i m u m of a f unc t i ona l , J , th i s is a p r o b l e m i n the ca lcu lus of va r i a t ions . T o take the d y n a m i c equa t ion in to account i n the o p t i m a l i t y c r i t e r i o n J, costate var iables are i n t r o d u c e d to f o r m an augmen ted o p t i m a l i t y c r i t e r i o n Ja i n the f o l l o w i n g way: where p is a t i m e v a r y i n g vec tor ca l l ed the costates. O p t i m i z i n g Ja w i t h respect to x, p and u is equivalent to o p t i m i z i n g J w i t h respect to u. T h e resul t is a s t a n d a r d set of necessary cond i t ions for o p t i m a l i t y [16, page 187]. T h e i n i t i a l c o n d i t i o n for the state equa t ion a n d the f ina l c o n d i t i o n for one of the equat ions ca l l ed the costate equa t i on have to be found also. T h e o p t i m a l c o n t r o l p r o b l e m is now a two-po in t b o u n d a r y va lue p r o b l e m . A n a l y t i c so lu t ions to these cond i t ions are genera l ly unava i l ab le . C o m m o n l y k n o w n n u m e r i c a l a l g o r i t h m s for the so lu t ion of different ia l equat ions a n d b o u n d a r y value prob-lems are di f f icul t to a p p l y because of the sp l i t b o u n d a r y cond i t i ons a n d the i n s t a b i l i t y of one of the state a n d costate equat ions . T h e n u m e r i c a l m e t h o d s tha t w o u l d w o r k w i l l be desc r ibed i n a la ter sec t ion . (2.3) Chapter 2. Framework of Field 15 2.1.3 Analytical: Pontryagin's Minimum Principle If the c o n t r o l p r o b l e m is fur ther cons t r a ined b y inequa l i t i e s o n the cont ro ls or the states, i t c an be h a n d l e d b y P o n t r y a g i n ' s m i n i m u m p r i n c i p l e . A clear char t of the o p t i m a l c o n t r o l a p p r o a c h based o n th is p r i n c i p l e together w i t h a w o r k e d out e x a m p l e is i n (Sage [17]). T h e bas ic steps are the fo l l owing : 1. Def ine a va r i ab l e c a l l e d t he H a m i l t o n i a n : H = g+pTf; (2.4) 2. F i n d the adjoint or canonic equat ions : 3. Def ine a fur ther c o n d i t i o n : H{x\u\p\t) <H{x\ u,p*,t) (2.7) for a l l a d m i s s i b l e u(t). O r a l t e rna t ive ly , th i s c o n d i t i o n c a n be expressed as u*{t) = arg m i n (g(x, u. t) + pTf(x, u,«)) ; (2.8) 4. O b t a i n the two po in t b o u n d a r y cond i t ions : i(t0) = x0, (2.9) /, v _ dh(x(tf),tf)^ P { t f ) ~ dm(tj) ' ( 2 - 1 0 ) 5. So lve the t w o po in t b o u n d a r y value p r o b l e m ei ther a n a l y t i c a l l y or u s ing a n u m e r i c a l a l g o r i t h m . T h e las t s tep, i f done a n a l y t i c a l l y , q u i c k l y becomes u n w i e l d y for rea l i s t i c sys tems. So for r ea l i s t i c p r o b l e m s , n u m e r i c a l me thods are u s u a l l y e m p l o y e d to solve the two po in t b o u n d a r y va lue p r o b l e m . Chapter 2. Framework of Field 16 2.1.4 Numerical: Variation of Extremals T h e V a r i a t i o n of E x t r e m a l s [16] a t t emp t s to i t e r a t i v e l y e s t ima te an i n i t i a l va lue for the costate equa t ions , the reby conve r t i ng the two-po in t sp l i t b o u n d a r y value p r o b l e m in to an eas i ly so lvable i n i t i a l va lue p r o b l e m . A f t e r each fo rward pass, the guess for the i n i t i a l va lue is u p d a t e d u s ing N e w t o n ' s m e t h o d w i t h a n u m e r i c a l l y - d e t e r m i n e d J a c o b i a n m a t r i x tha t relates i n i t i a l costates to final costates. N e w t o n ' s m e t h o d can perhaps be rep laced b y some o ther o p t i m i z a t i o n m e t h o d . T h e success depends o n the closeness of the first guess to the t rue i n i t i a l va lue . 2.1.5 Numerical: Quasilinearization Q u a s i l i n e a r i z a t i o n [16] [18] breaks up the non l inea r o p t i m i z a t i o n i n t o a sequence of l inear o p t i m i z a t i o n p rob l ems . L i k e the v a r i a t i o n of ex t remal s m e t h o d , a first guess needs to be m a d e b u t i t is o n the state a n d costate t ra jector ies . T h e n a r educed ve r s ion of the state a n d costate equat ions are l i nea r i z ed about these t ra jector ies . These l i n e a r i z e d different ia l equa t ions are in t eg ra ted to fit the k n o w n values of i n i t i a l cond i t i ons . T h e i n i t i a l values of the costates f r o m th is so lu t ion is used to in tegra te the o r i g i n a l non l inea r equat ions fo rward . A set of s tate a n d costate t ra jector ies are o b t a i n e d a n d the who le process repeats . T h e success of th is m e t h o d aga in depends o n a good first guess. 2.1.6 Numerical: Steepest Descent T h e m e t h o d of Steepest Descent [16] begins by guessing a c o n t r o l t ra jec tory . Together w i t h the b o u n d a r y value o n the state, the state equat ions are in t eg ra t ed fo rward i n t i m e . T h e state t ra jec tor ies p r o d u c e d and the guessed con t ro l t ra jec tor ies enable the costate equat ions to be in teg ra ted b a c k w a r d . A f t e r th i s po in t , to o p t i m i z e the cost func t ion J we jus t need to o p t i m i z e the H a m i l t o n i a n w i t h respect to u a long the state a n d costate Chapter 2. Framework of Field 17 t ra jec tor ies [19] [20] [21]. In the l i t e ra tu re , the o p t i m i z a t i o n m e t h o d is steepest descent w i t h a l l i t s d r awbacks , bu t newer o p t i m i z a t i o n me thods m a y be subs t i t u t ed . 2.1.7 Numerical: Gradient Projection G r a d i e n t P r o j e c t i o n [16] [22] [23] avoids the two-po in t b o u n d a r y value p r o b l e m by con-v e r t i n g the o p t i m a l c o n t r o l p r o b l e m in to a non l inea r o p t i m i z a t i o n p r o b l e m where the cons t ra in t s are h a n d l e d d i r ec t ly , ra ther t h a n i n d i r e c t l y t h r o u g h the m e c h a n i s m of the costate equa t ions . T h e gradient p ro j ec t i on m e t h o d is so n a m e d because the gradient is p ro j ec t ed a long the cons t ra in t boundar ies . F o r systems w i t h convex cost func t ions a n d l inea r cons t ra in t s , the o p t i m a l po in t is found on the cons t ra in t boundar i e s . T h e n o n l i n -ear o p t i m a l c o n t r o l p r o b l e m has to undergo three t r a n s f o r m a t i o n before the non l inea r p r o g r a m m i n g t echn ique can be a p p l i e d . F i r s t , the s y s t e m is conve r t ed to d i sc re te - t ime . Second , the non l inea r d y n a m i c s is p iecewise l i nea r i zed . T h i r d , the dependence of the d y n a m i c s o n the states is r e m o v e d b y so lv ing the d i sc re te - t ime equa t i on i n t e rms of the c o n t r o l s ignals . G r a d i e n t p r o j e c t i o n is i t e ra ted w i t h the second a n d t h i r d t rans format ions to f i n d the o p t i m a l con t ro l t ra jec tory . T h e advantage of th is m e t h o d is i t s a b i l i t y to deal w i t h cons t ra in t s o n con t ro l a n d state t ra jector ies . T h e cons t ra in ts l i m i t the size of the search space. B u t s ince the o p t i m i z a t i o n is done d i r e c t l y i n the space of N step con t ro l vec tors , the c o m p l e x i t y of the search grows q u i c k l y w i t h the fineness of d i s c r e t i z a t i o n and the size of the c o n t r o l vectors . 2.1.8 Remarks on numerical methods T h e r e are m a n y rea l i s t i c p rob lems tha t can be so lved by the f o l l o w i n g n u m e r i c a l me thods . P r o b l e m s arise i f accura te non l inea r mode l s of the d y n a m i c s are not ava i lab le . These m e t h o d s m a y have p rob l ems w i t h convergence i n i n t eg ra t i ng the di f ferent ia l equat ions , p a r t i c u l a r l y the costate equa t ion . Chapter 2. Framework of Field 18 T h e r e are m a n y areas of poss ib le improvemen t s i n these techniques , m o s t o b v i o u s l y i n the choice of t he o p t i m i z a t i o n m e t h o d s a n d i n i t i a l guess se lec t ion m e t h o d s . Ident i f i -c a t i o n u s ing a p p r o x i m a t o r s m a y remove the need for p l an t mode l s expressed i n s y m b o l i c m a t h e m a t i c s . It s h o u l d be no t ed tha t these me thods are not m u t u a l l y exc lus ive of d y n a m i c pro-g r a m m i n g based n u m e r i c a l me thods and others to be covered la ter . It m a y be benef ic ia l to use h y b r i d approaches where some of these n u m e r i c a l m e t h o d s presented above can c o m p l e m e n t each other . 2.2 Stochastic Continuous-Time and -State T h e m a t h e m a t i c s of con t inuous - t ime s tochas t ic o p t i m a l con t ro l is qu i t e different f r o m the d e t e r m i n i s t i c t ype . T h e no ta t ions used i n mos t of the sections i n th i s t o p i c are f r o m one m a j o r t r a d i t i o n [24] [25]. 2.2.1 Problem definition T h e d e t e r m i n i s t i c sy s t em w h e n d i s t u r b e d b y w h i t e noise can be r e w r i t t e n i n th is s tochas t ic d i f ferent ia l equa t ion ( S D E ) m o d e l f o r m : where x denotes the state G Rn, wt is a vec tor B r o w n i a n m o t i o n , u : Rn —• Rm is a feedback c o n t r o l , a n d tr is an n x n m a t r i x . x = b(x, u) (2.11) dxt - b(xt, ut)dt + tr(xt)dwt (2.12) Chapter 2. Framework of Field 19 2.2.2 Analytical methods T h e o p t i m a l i t y c r i t e r i o n of a n a l y t i c a l me thods is s l i g h t l y different f r o m tha t i n n u m e r i c a l m e t h o d a n d so i t is desc r ibed separa te ly i n t h i s sec t ion . U n d e r several r e s t r i c t ive L i p s c h i t z cond i t ions , there is a u n i q u e s t rong so l u t i on w h i c h is a f u n c t i o n a l of a g i v e n driving B r o w n i a n m o t i o n [24]. U n d e r m u c h m i l d e r cond i t ions a weak so lu t i on i n a p r o b a b i l i t y space of s tate and d i s tu rbance exis ts . T h i s weak- so lu t ion concept a l lows m a n y t e c h n i c a l res t r ic t ions to be r emoved f r o m the theory. F o r a g iven c o n t r o l u, the cost f unc t i ona l genera l ly take the f o r m of rX J(u) = E i e-atc(xt,ut)dt + e-aT$(xT) Jo (2.13) where a > 0 is the d iscount factor , c(xt,ut) is the ins tan taneous cost , a n d $ ( * x ) the f ina l cost at t i m e T. 2.2.3 Analytical: Dynamic Programming F o r m a l a p p l i c a t i o n of B e l l m a n ' s p r i n c i p l e of o p t i m a l i t y [24] to E q . (2.12) a n d (2.13) leads to the B e l l m a n equa t i on + \ £ aS^C.") - «y('> •) + -Vj (E f£(*> v) + c(x, v)) = o dv dt »,j • •> \ / (2.14) V(T,x) = $(as) where a.j is the i,jth en t ry of the m a t r i x <r<rT. T h e B e l l m a n e q u a t i o n is a quas i - l inear p a r a b o l i c p a r t i a l d i f ferent ia l equa t i on for the value func t i on V(t, x) w i t h the b o u n d a r y c o n d i t i o n g i v e n at t = T. A large b o d y of theory exists for the quest ions of the exis tence of a so lu t i on to E q . (2.14) w i t h specif ied smoothness proper t ies , a n d of the co r respond ing o p t i m a l c o n t r o l [26] [27] [28]. Chapter 2. Framework of Field 20 2.2.4 Analytical: Martingale methods M a r t i n g a l e m e t h o d s [24] [29] [30] [31] are weak so lu t ion m e t h o d s tha t define a t y p e of m i n i m u m of the expec t ed cost E q . (2.13) a n d show tha t i t is a sub -mar t i nga l e [32]. It is poss ib le t o p rove t ha t t h i s m i n i m u m is a m a r t i n g a l e i f a n d o n l y i f t he c o n t r o l is o p t i m a l . V a r i o u s s t anda rd m a r t i n g a l e proper t ies a n d theorems are used to der ive a genera l m i n i m u m p r i n c i p l e a n d a m e t h o d for finding the o p t i m a l c o n t r o l . If the reader is no m o r e en l igh tened after th i s de sc r ip t i on , i t is because the m a t h e m a t i c s i n v o l v e d defy a concise s u m m a r y . T o o b t a i n a m o r e de ta i l ed s u m m a r y , see [24, page 3521]. T h e m a r t i n g a l e a p p r o a c h appl ies equa l ly w e l l t o n o n - M a r k o v i a n sys tems where the coefficients of the s y s t e m d y n a m i c s E q . (2.12) depends o n the past state t r a j ec to ry r a the r t h a n jus t the cur ren t s tate. 2.2.5 Analytical: Minimum Principles Stochas t i c m i n i m u m pr inc ip l e s can be d e r i v e d b y me thods analogous to the d e t e r m i n i s t i c case b y convex analys is [33] or by a p p l i c a t i o n of general e x t r e m a l theory [34] [24]. T h e m a i n conce rn is w i t h cha rac t e r i z ing a p r o b a b i l i s t i c ve r s ion of the P o n t r y a g i n costate d y n a m i c s . If « o is a n o p t i m a l M a r k o v i a n con t ro l , t h e n the d y n a m i c s of the costates [35] is desc r ibed by (d$ rT dc \ p(t,x) = -Et,x f ! _ ( « T ) t f ( r , t ) + j f ^[«„u°( 5 ,«.)]*(M)<fa) (2-15) where Ettx is the e x p e c t a t i o n g iven xt = x a n d is the f u n d a m e n t a l m a t r i x so lu t ion of the l i n e a r i z e d equa t i on (2.16) Chapter 2. Framework of Field 21 2.2.6 Numerical: Finite element approximation of the DP equation C o n t r o l l e d di f ferent ia l equat ions p e r t u r b e d b y a noise can be m o d e l e d as c o n t r o l l e d dif-fus ion processes [25]. T h e cond i t ions for o p t i m a l c o n t r o l are o b t a i n e d b y the d y n a m i c p r o g r a m m i n g m e t h o d . These cond i t ions are expressed as a non l inea r p a r t i a l d i f ferent ia l e q u a t i o n ca l l ed the H a m i l t o n - J a c o b i equa t ion c o n t a i n i n g the pa r t i a l s of y , the expec ted f in i t e -ho r i zon d i scoun ted in t eg ra l of the payoffs, w i t h respect to t i m e a n d the states. T h e so lu t i on of th is equa t ion gives y as a f unc t i on of t i m e a n d the states. In genera l , th i s e q u a t i o n cannot be so lved a n a l y t i c a l l y . If the d i m e n s i o n of the states is s m a l l , n u m e r i c a l f in i te-e lement m e t h o d s can be used to a p p r o x i m a t e i ts so lu t i on . In the f o r m u l a t i o n of so lu t ion m e t h o d , i t is a ssumed tha t the s y s t e m m o d e l is k n o w n , the noise m o d e l is k n o w n and belongs to a ce r t a in class, a n d the effect of the noise on the states is k n o w n . T h e r e is also a hypothes i s of s m o o t h d y n a m i c s . T h e a c t u a l f in i te e lement equat ions depend o n the user 's choice of 1) p a r t i t i o n of the t ime-s ta te space and 2) i n t e r n a l a p p r o x i m a t i o n s . T h i s dependence makes the i m p l e m e n t a t i o n of the m e t h o d c u m b e r s o m e . H o w e v e r , w i t h the r ight choice of i n t e r n a l a p p r o x i m a t i o n , a genera l ized H o w a r d a l g o r i t h m gives the so lu t i on . A convergence resul t is ava i lab le w h i c h shows tha t the a p p r o x i m a t e so lu t i on approaches the t rue so lu t ion as the e lement size goes to zero [25]. T h i s m e t h o d leads to g o o d n u m e r i c a l resul ts . T h e disadvantages of the m e t h o d seem to be 1) the r equ i rement of var ious mode l s and 2) the diff icul t t r a n s l a t i o n process f r o m the mode l s a n d des ign choices to the f ini te e lement equa t ions . It is not c lear tha t the second process can be a u t o m a t e d easily. Therefore , even i f the s y s t e m m o d e l and noise m o d e l c a n be iden t i f i ed , i t w o u l d s t i l l be di f f icul t t o a u t o m a t e t h e ent i re des ign process. N o equa t i on is presented i n th is s u m m a r y because a large n u m b e r of equat ions w o u l d Chapter 2. Framework of Field 22 have to be presented to adequa te ly show the c i r cu i tous d e r i v a t i o n of finite e lement equa-t ions f r o m the d y n a m i c m o d e l , the noise m o d e l , a n d the des ign choices. S u c h a comple t e coverage of th is m e t h o d is b e y o n d the scope of th is thesis . 2.2.7 N u m e r i c a l : D i s c r e t i z a t i o n of the stochastic c o n t r o l p r o b l e m A c o n t i n u o u s - t i m e a n d -state con t ro l l ed diffusion process can be t im e - d i s c r e t i z ed i f the s y s t e m , noise, a n d cost mode l s satisfy some ve ry specific cond i t i ons [25] [36]. It can be shown tha t such o p t i m a l con t ro l p rob lems have so lu t ions . F o r each a d m i s s i b l e diffusion process, the d i sc re te - t ime process a p p r o x i m a t i n g i t has a precise set of a d m i s s i b l e state t r a n s i t i o n p r o b a b i l i t i e s . A d i sc re te - t ime d y n a m i c p r o g r a m m i n g e q u a t i o n can t h e n be def ined. T h e finite h o r i z o n s u m of the costs are c o m p u t e d i t e r a t i v e l y backwards i n t i m e . T h e i d e a is to a p p r o x i m a t e the a c t u a l con t ro l p r o b l e m b y a M a r k o v c h a i n con t ro l p r o b l e m . It can be shown tha t the p r o b a b i l i t y l aw of the o p t i m a l l y c o n t r o l l e d M a r k o v c h a i n converges to the l aw of o p t i m a l diffusion as the d i s c r e t i z a t i o n i n t e r v a l tends to zero [25] [36]. T h i s convergence is sufficient to p rove the convergence of the o p t i m a l discrete cost t o the o p t i m a l con t inuous cost. In a d d i t i o n , there is a fur ther step tha t can be t a k e n to d i s c r e t i zed the s y s t e m i n space [37]. T h i s a p p r o a c h is useful for co r rec t ly d i s c r e t i z i n g a ce r t a in class of con t inuous stochas-t i c c o n t r o l p r o b l e m s . D i s c r e t i z a t i o n is one poss ib le w a y of s o l v i n g s u c h p r o b l e m s . A s w i t h the app roach desc r ibed i n the p rev ious sec t ion , th is is also a d y n a m i c p r o g r a m m i n g m e t h o d w h i c h is useful o n l y w h e n the d i m e n s i o n of the state is s m a l l . Chapter 2. Framework of Field 23 2.3 Discrete-Time and Continuous-States W e want to find the o p t i m a l con t ro l to th is sy s t em , x(t + l) = f(x(t),u(t),t) + e(t) where x e 3£ n is the state vec tor , u 6 3fcm is the a c t i o n vec tor , a n d e G 9£ n is the noise vec tor . / c an be non l inea r . T h e e lements of e have an u n k n o w n bu t t i m e inva r i an t d i s t r i b u t i o n . A f t e r it, c a l c u l a t e d f r o m some con t ro l l aw , is execu ted at t i m e t, e i ther some m e t h o d or the s y s t e m i t se l f re turns an i m m e d i a t e u t i l i t y va lue U(t). T h e purpose is to find a con t ro l l aw w h i c h o p t i m i z e s some c r i t e r i o n . If the s y s t e m is d e t e r m i n i s t i c , i .e . , e(t) = 0, V t , one t y p e of o p t i m a l i t y c r i t e r i o n is J = £ > ( * ( * ) , « ( * ) ) (2-17) where m a y be finite or N = oo. If the s y s t e m is s tochas t ic , one t y p e of o p t i m a l i t y c r i t e r i o n is J = E (2.18) .t-0 T h e r e appears to be no great difference be tween the o p t i m a l c o n t r o l approaches for d e t e r m i n i s t i c a n d s tochas t ic sys tems. If the p lan t is d e t e r m i n i s t i c , there are t w o m a j o r a l t e rna t ives for finding the best sequences: 1. T r y a l l poss ib le c o n t r o l sequences a n d choose the best one a c c o r d i n g to the t o t a l cost w h i c h is ava i lab le at the end of the sequence; a n d 2. U s e d y n a m i c p r o g r a m m i n g ( D P ) to reduce the t i m e i t takes to find the .bes t se-quence. Regard less of the pa r t i cu la r s of the systems (cont inuous or discre te i n t i m e or states) , i f we k n o w tha t the m a t h e m a t i c a l m o d e l represent ing the con t ro l l e r c a n represent an Chapter 2. Framework of Field 24 o p t i m a l con t ro l l e r , t h e n we can t r y to find the parameters of the m a t h e m a t i c a l m o d e l d i r ec t ly . T h e d i f f icu l ty l ies i n the test to de t e rmine the o p t i m a l i t y of any p a r t i c u l a r set of pa r ame te r values. W i t h the con t ro l p r o b l e m defined above, the con t ro l l e r w o u l d have to test a huge n u m b e r of sequences of state t ra jector ies to o b t a i n an accura te r a t i n g of the p a r t i c u l a r con t ro l le r . T h i s is an u n a c c e p t a b l y c u m b e r s o m e test. F u r t h e r m o r e , i f the s t ruc tu re of the o p t i m a l con t ro l le r is not k n o w n beforehand , a u n i v e r s a l s t ruc tu re (or a p p r o x i m a t o r ) w o u l d have to be used. M o s t a p p r o x i m a t o r s have a large n u m b e r of pa ramete r s . Therefore , the search space can be vast . T h e first a l t e rna t ive is p r a c t i c a l o n l y for s m a l l p rob l ems . T o t a c k l e larger a n d / o r s tochas t ic p r o b l e m s , D P is the m e t h o d of choice because i t vas t ly reduces the c o m p l e x i t y of the test m e n t i o n e d above. 2.3.1 Analytical methods T h e p r o b l e m defined above can have an a n a l y t i c a l so lu t ion u s ing D P , i f the s y s t e m a n d the o p t i m a l c r i t e r i o n is s i m p l e enough. A n i m p o r t a n t e x a m p l e is the l i nea r quad ra t i c regula tor [16]. O n c e aga in , i f the p r o b l e m is m o r e c o m p l e x , a n a l y t i c a l m e t h o d s become in t r ac t ab l e . 2.3.2 Numerical methods M o s t of the m e t h o d s unde r th is sec t ion head ing requi re or c o u l d benefit f r o m func t i on a p p r o x i m a t i o n techniques , such as neu ra l ne tworks [38] [39] [40]. F u n c t i o n a p p r o x i m a -t i o n enables the storage of h i g h d i m e n s i o n a l funct ions . T h e s t ruc tu re of the func t ion a p p r o x i m a t o r m a y also be useful for c o m p u t i n g de r iva t ive i n f o r m a t i o n w h i c h is useful for o p t i m i z a t i o n . T h e a p p r o x i m a t i o n c a p a b i l i t y of n e u r a l ne tworks [41] [42] [43] has resu l t ed i n m a n y new c o n t r o l m e t h o d s . T h e n o n - o p t i m a l neu rocon t ro l m e t h o d s have i n c o m m o n w i t h R L Chapter 2. Framework of Field 25 c o n t r o l i n the i r c a p a b i l i t y to c o m p u t e a con t ro l l aw a n d p r o b a b l y store i t i n a f u n c t i o n a p p r o x i m a t o r . T h e m e t h o d of o b t a i n i n g the c o n t r o l l aw is the difference. O n l y o p t i m a l n e u r o c o n t r o l w i l l be d iscussed i n th is thesis . T h e reader in teres ted i n l e a r n i n g m o r e abou t n e u r o c o n t r o l i n genera l can consul t [44] [45] [46] [15] [47] [48] [49] [50]. 2.3.3 Neural network optimization with the Minimum Principle I n (Fong et al. [51]), the tasks i n the m i n i m u m p r i n c i p l e are p e r f o r m e d b y us ing a neu ra l ne twork to represent the con t ro l le r . Ins tead of o p t i m i z i n g i n the space of N step con t ro l vec tors , the o p t i m i z a t i o n is done w i t h respect to the weights of the neu ra l net . Toge ther w i t h a u t o m a t i c d i f ferent ia t ion , the necessary cond i t ions for o p t i m a l con t ro l i n the m i n i m u m p r i n c i p l e are used to c o m p u t e the gradient of the H a m i l t o n i a n w i t h respect to the weights . A n i m p l i c i t a s s u m p t i o n is tha t the same c o n t r o l l aw is a p p l i e d at a l l t i m e steps. T h i s t reats the finite t i m e p r o b l e m as i f i t is in f in i t e t i m e a n d e l imina tes the e x p l o s i o n of poss ib le con t ro l t ra jec tor ies . It is not k n o w n i f th i s t r ea tmen t w o u l d y i e l d good resul ts i n a l l cases. G o o d per formance for the i n v e r t e d p e n d u l u m p r o b l e m was o b t a i n e d . T h e p r i n c i p a l d i f f icu l ty is the r equ i rement for a g o o d p lan t m o d e l for the purpose of c o m p u t i n g the pa r t i a l s of the d y n a m i c s w i t h respect to the weights . 2.3.4 Reinforcement Learning Rein fo rcemen t L e a r n i n g ( R L ) has l ong been recognized to be different f r o m superv i sed a n d u n s u p e r v i s e d l e a rn ing [52]. In a t y p i c a l re inforcement l e a rn ing s i t u a t i o n , the learner is not p r o v i d e d in s t ruc t ions on the r igh t ac t ions or ou tpu t s to take , bu t there is a t each ing s igna l w h i c h is an eva lua t ion of the learner 's act ions or ou tpu t s . W h e n R L was a p p l i e d to sequent ia l dec i s ion tasks , i t became apparent tha t i t is a m e t h o d of se l f - learning o p t i m a l c o n t r o l for ve ry genera l sys tems. B u t R L is m o r e t h a n s i m p l y sel f - learning o p t i m a l c o n t r o l . T h e genera l i ty of the sy s t em, state a n d a c t i o n descr ip t ions a n d R L ' s a b i l i t y Chapter 2. Framework of Field 26 to l e a rn the r igh t a c t i o n w i t h o u t the in s t ruc t ions means i t has t r emendous re levance to m a c h i n e in te l l igence . T h e scope of sel f - learning con t ro l for non l inea r a n d / o r d iscre te sys tems are m u c h larger t h a n o p t i m a l c o n t r o l for l i nea r sys tems. S o m e of the a p p l i c a b l e p r o b l e m s are 1. C o n t i n u o u s d y n a m i c sys tems, e.g., w a l k i n g mach ines , 2. D i s c r e t e or h igher cogn i t ive l eve l sys tems, e.g., l e a rn ing a r t i f i c i a l in te l l igence m a -chines , a n d 3. H y b r i d ( l o w / h i g h leve l and con t inuous /d i s c r e t e state) sys tems, e.g., l e a rn ing sur-v i v a l mach ines . T h e t e r m low level refers to con t ro l u s ing i n f o r m a t i o n closer to t he finest t e m p o r a l a n d s p a t i a l r e so lu t i on of states. T h e t e r m high level refers t o c o n t r o l at a coarser r e so lu t ion . 2.3.5 Reinforcement Learning: Policy Iteration R e i n f o r c e m e n t l e a r n i n g can be d i v i d e d in to two m a i n classes: p o l i c y i t e r a t i o n a n d value i t e r a t i o n . T h e difference be tween t h e m is tha t p o l i c y i t e r a t i o n effect ively evaluates one p o l i c y at a t i m e by i t e r a t i n g over pol ic ies a n d va lue i t e r a t i o n effect ively evaluates a l l po l ic ies at the same t i m e by i t e r a t i n g over the evaluat ions (or value) of po l i c ies . M a n y a l g o r i t h m s dev ia te f r o m these c o m p a c t genera l iza t ions . P o l i c y i t e r a t i o n is o p t i m i z a t i o n at one po in t i n the p o l i c y space at a t i m e . It is a 2 step m e t h o d w h i c h is best e x p l a i n e d b y l o o k i n g at n o n - i n c r e m e n t a l vers ions of p o l i c y i t e r a t i o n . T h e t e r m non-incremental means that each of the 2 steps are c o m p l e t e d before the o ther step is done. In n o n - i n c r e m e n t a l vers ions , the 2 steps are as fol lows: In the 1st s tep, the p o l i c y is f ixed w h i l e the l ea rn ing sy s t em compu te s an eva lua t i on of th is p o l i c y . In the 2 n d step, th i s eva lua t ion is used to c o m p u t e or suggest a new p o l i c y Chapter 2. Framework of Field 27 to t ry . T h i s is m u c h l i ke o p t i m i z a t i o n of funct ions . T h i s 2 step m e t h o d of o p t i m i z i n g a con t ro l l e r is c e r t a i n l y preferable to d i r e c t l y o p t i m i z i n g the con t ro l l e r pa ramete r s a n d use some o ther means to tes t ing each con t ro l l aw thorough ly . T h e i n c r e m e n t a l vers ions w o u l d not wa i t u n t i l an eva lua t i on is comple t e before the nex t p o l i c y is t r i e d . T h e e v a l u a t i o n a n d o p t i m i z a t i o n process w o u l d take p lace concur ren t ly . T h e convergence of the eva lua to r a n d the cont ro l le r a d a p t i n g c o n c u r r e n t l y is jus t b e g i n n i n g to be exp lo red [53]. T h e de sc r ip t i on of va lue i t e r a t i o n w i l l be left to Subsec t ion 2.3.9. T h e rest of th is sec t ion is devo ted to p o l i c y i t e r a t i on . Approximate Secondary Reinforcement/ Primary Reinforcement Cont ro l l e r Plant Actions States F i g u r e 2.1: T h e bas ic adap t ive c r i t i c c o n t r o l sy s t em. In F i g u r e 2 .1 , the cont ro l le r a n d adap t ive c r i t i c are u s u a l l y represented b y a p p r o x i -ma to r s , c o m m o n l y feedforward neu ra l ne tworks . T h e 2 net adap t ive c r i t i c s y s t e m (also ca l l ed an actor-critic sys tem) i n F i g u r e 2.1 learns to a p p r o x i m a t e a t o t a l cost or per-fo rmance c r i t e r i o n (Secondary Reinforcement) at the cur ren t s tate tha t is inherent to the c o n t r o l p r o b l e m . A t each t i m e step, the A d a p t i v e C r i t i c ( A C ) is presented the Sta te a n d the P r i m a r y Re in fo rcemen t . T h e p r i m a r y re inforcement can be of at least two types : b i n a r y pe r fo rmance i n d i c a t o r a n d i m m e d i a t e cost. T h e t rue secondary reinforce-m e n t is a p r o b a b i l i s t i c a l l y a n d t e m p o r a l l y we igh ted s u m m a t i o n of p r i m a r y re inforcement Chapter 2. Framework of Field 28 sequences, e.g., E q . 2.17 a n d 2.18. T h e parameters i n the adap t ive c r i t i c are changed v i a supe rv i sed l e a rn ing to reduce the difference be tween the a p p r o x i m a t e secondary re in -forcement a n d the t rue secondary re inforcement . W i t h t i m e , the o u t p u t of the adap t ive c r i t i c is supposed to c losely a p p r o x i m a t e the t rue secondary re inforcement . W h i l e the A C is b e i n g t r a i n e d , the cont ro l le r is o p t i m i z e d i n some m a n n e r . T h u s , i t is advanta-geous i f the A C is able to l ea rn the t rue m a p p i n g q u i c k l y . T h e r e are a w i d e va r i e ty of o p t i m i z a t i o n techniques ava i lab le , i n c l u d i n g l o c a l o p t i m i z a t i o n techniques such as those presented i n [54] and g l o b a l o p t i m i z a t i o n techniques such as s tochas t ic o p t i m i z a t i o n [55] a n d genet ic a l g o r i t h m s [56]. T h e exact m a n n e r of t r a i n i n g the con t ro l l e r depends o n a n u m b e r of factors; one of w h i c h is the f o r m of the A C . A n o t h e r is h o w the a p p r o x i m a t e secondary re inforcement affects the parameters of the con t ro l le r . In Sect ions 2.3.6 to 2.3.8, m e t h o d s of a d a p t i n g the A C and the cont ro l le r w i l l be desc r ibed . T h e ac to r - c r i t i c s y s t e m is m o r e general t h a n low leve l o p t i m a l con t ro l . T h e states can be h i g h l eve l descr ip t ions as found i n a r t i f i c i a l in te l l igence ( A I ) . T h e a c t i o n c o u l d also s i m p l y be h i g h l eve l de sc r ip t i on of act ions and ac t iv i t i e s i n t e r n a l to the sy s t em, for e x a m p l e , a code for different conven t iona l con t ro l schemes. T h a t code c o u l d i n d i r e c t l y specify a large a n d h i g h l y t i m e - v a r y i n g a c t i o n vec tor . If the con t ro l l e r can be adap ted such tha t the ou tpu t of the a d a p t i v e c r i t i c is at an o p t i m u m for a l l c o m b i n a t i o n s of goals a n d states, the con t ro l l e r w o u l d be o p t i m a l for the g i v e n pe r fo rmance c r i t e r i o n . T h e cont ro l le rs generated b y the adap t ive c r i t i c me thods are i n t e n d e d to achieve th is ob jec t ive : p e r f o r m the j o b of a conven t iona l con t ro l le r ( t r a c k i n g a n d s t ab i l i z i ng ) w h i l e o p t i m i z i n g an e x t e r n a l l y i m p o s e d c r i t e r i a . T h e s t a b i l i t y is e m b e d d e d i n the c r i t e r i a . U n l i k e a c o n v e n t i o n a l o p t i m a l con t ro l le r where the c o n t r o l l a w is d e r i v e d a n a l y t i c a l l y f r o m the d y n a m i c s , the noise mode l s , a n d the per formance c r i t e r i o n , A C m e t h o d s o p t i m i z e a pe r fo rmance measure by l ea rn ing f r o m i ts values d u r i n g r u n t i m e . A C me thods are Chapter 2. Framework of Field 29 su i t ab le for c e r t a in types of p rob l ems i n con t ro l . T h e fo l l owing are some example s : • mos t genera l ly , to f ind con t ro l a n d state t ra jector ies w h i c h w o u l d o p t i m i z e a c r i te -r i o n , • to f ind the best con t ro l t r a j ec to ry (accord ing to some c r i t e r ion ) to take the state f r o m one po in t to another , • to keep the states f r o m ce r t a in doma ins of the state space or conversely, to take the states to a c e r t a in d o m a i n , a n d • to o p t i m i z e a c r i t e r i o n w h i c h is eva luable o n l y at the end of a sequence. 2.3.6 Basic T D ( A ) If the p r i m a r y re inforcement is ava i lab le o n l y w h e n the sy s t em reaches some t e r m i n a l state, we can use the T D ( A ) f a m i l y of a l go r i t hms [10] [12]. T h e T D ( A ) a l g o r i t h m s seem to be a t y p e of p r e d i c t i o n a l g o r i t h m tha t converges q u i c k l y . T h e m a p p i n g used is app l i cab le to non l inea r sys tems. A n e x a m p l e of a p r o b l e m w i t h such a p r i m a r y re inforcement is the i n v e r t e d p e n d u l u m p r o b l e m i n [8] and [9]. In those papers , a var ian t of the T D ( A ) a l g o r i t h m is used to adapt the A C E net of the A C E / A S E 2 net sy s t em. T h e purpose of the a l g o r i t h m b y i t se l f is to upda t e the weights of a p r ed i c to r g iven m u l t i p l e a n d different sequences of states end ing i n a f ina l p r i m a r y re inforcement z , such tha t the p r e d i c t o r can l ea rn to p red ic t z w i t h as few sequences as poss ib le . Pt = P<(x, W) is the p r e d i c t i o n of z at t i m e t where x is the state vec tor of a s y s t e m a n d W is the weight vec to r of the adap t ive c r i t i c . A t each t i m e step t, a weight change vec tor is c o m p u t e d as fol lows: A W , = a{Pt+1 - Pt) A*~*VwA, 0 < A < 1 k=i Chapter 2. Framework of Field 30 W is changed to W + YltLi A W t at the end of a comple t e r u n where m is the t o t a l n u m b e r of t i m e steps to reach the end . T h e r a t iona le for the a l g o r i t h m can be s u m m a r i z e d as fol lows: 1. A t each t i m e step t, i n the fashion of t y p i c a l superv i sed- lea rn ing a l g o r i t h m s , the correct weight change is A W , = Q(Z - Pt)Vv/Pt. 2. T h e p r e d i c t i o n error can be d i v i d e d up as fol lows: m z-pt = ^2(pk+i - ft), k=t where Pm+i = z. 3. W i t h some a lgebra , we can see m m W + Yl<x(z-Pt)VwPt = W + J2aH(pk+i-Pk)VwPt (2.19) i=l t=l k=t m k = W + £ a £ ( A + 1 - P f c ) V w f t (2.20) k-l t=l m t = W + £ a ( P t + 1 - P , ) E W * (2-21) t=l k=l N o t e tha t m is a free pa rame te r w h i c h depends o n how l o n g i t takes to reach a t e r m i n a l s tate i n a p a r t i c u l a r r u n . T h e above is a d e t e r m i n i s t i c e x p l a n a t i o n . See [57] a n d [58] for a s tochas t ic exp lana -t i o n . Proofs of the convergence a n d o p t i m a l i t y of l inear T D ( 0 ) ( i .e . , A = 0) are ava i lab le [10]. A l s o , D a y a n [57] a n d then J a a k k o l a et al. [59] r ecen t ly p r o v e d convergence for the genera l case of T D ( A ) . Convergence w i t h p r o b a b i l i t y 1 is p roved for the d iscre te case a n d for l i nea r representa t ions [58]. Chapter 2. Framework of Field 31 T h e f o r m u l a t i o n above does not e x p l i c i t l y address con t ro l . T D ( A ) is equiva len t to Q - L e a r n i n g (to be presented l a t e r ) , i f the con t ro l l aw is f ixed . W i t h the 2 net des ign, two cont ro l le rs [10] [12] based o n th is a l g o r i t h m were successful i n l e a rn ing to c o n t r o l an i n v e r t e d p e n d u l u m . T h e idea is to get a qu i ck es t imate of the eva lua t i on of a con t ro l l aw a n d t h e n use some k i n d of o p t i m i z a t i o n m e t h o d to be t te r i t . It m a y be able t o hand le servo p r o b l e m s , bu t th is is not addressed i n [10]. T h i s a l g o r i t h m has also been tes ted o n a b a c k g a m m o n game w i t h impres s ive results [60] [61]. 2.3.7 Adaptive Heuristic Critic A d a p t i v e H e u r i s t i c C r i t i c ( A H C ) [13] is a var ian t of the bas ic T D ( A ) a l g o r i t h m . T h e c o n n e c t i o n w i t h T D ( A ) is best i l l u s t r a t e d i n [10, pages 39-40]. It is app rop r i a t e i f the p r i m a r y re inforcement is observable i n each t i m e step, i n s t ead of a s ingle p r i m a r y re in-forcement at the end of a r u n . M o r e precisely, i f some process generates costs c < + i at each state t r a n s i t i o n f r o m t to t + 1, we m a y want Pt to p red ic t the d i scoun ted s u m : oo k=0 where 7, 0 < 7 < 1, is the d iscount - ra te pa ramete r . T h e n the weight u p d a t e ru le g iven is A W , = a(ct+1 + 7 P m - Pt) I1'" Vw Pk- (2.22) k=i 2.3.8 H D P (with Ag_p or gradient methods) T h e i d e a of u s ing adap t ive c r i t i c s to a p p r o x i m a t e d y n a m i c p r o g r a m m i n g was first ex ten-s ive ly covered b y W e r b o s . H e descr ibed a class of a lgo r i t hms ca l l ed H e u r i s t i c D y n a m i c P r o g r a m m i n g (see Sec t i on 2.3.8) [14]. A proof of the cons is tency of the a l g o r i t h m for one p a r t i c u l a r p r o b l e m is ava i lab le . W e r b o s c l a ims tha t H D P is a gene ra l i za t ion of T D ( A ) . S ince ne i the r the c o n t r o l u nor a goal are inpu t s to the adap t ive c r i t i c , the p r e d i c t i o n Chapter 2. Framework of Field 32 x(t+l) A d a p t i v e J(t+1) C r i t i c U{t) x(t) A d a p t i v e /Cr i t i c At) 7 F i g u r e 2.2: T h e c r i t i c s y s t e m as adap ted b y H D P . J is r ea l ly the equiva len t of fk(i) i n Sec t ion 2.4.6. T h i s a l g o r i t h m has been successful ly a p p l i e d to the manufac tu re of compos i t e ma te r i a l s [62], a diff icul t c o n t r o l p r o b l e m . T h e cons i s tency (convergence of the p r e d i c t i o n to the correct value) of H D P has been p roven for i t s a p p l i c a t i o n to a s i m p l e re inforcement l ea rn ing p r o b l e m [14]. H e u r i s t i c D y n a m i c P r o g r a m m i n g ( H D P ) [63] is a m e t h o d of t r a i n i n g an adap t ive c r i t i c . T h e c r i t i c s y s t e m is i l l u s t r a t e d by F i g u r e 2.2. T h e re inforcement c m a y be ava i lab le con t inuous ly , u n l i k e tha t assumed i n the T D ( A ) f a m i l y of a l g o r i t h m s . T h e focus of H D P is not o n the f u n c t i o n of the ove ra l l sy s t em bu t o n the a d a p t a t i o n of the adap t ive c r i t i c . He re , the c r i t i c is supposed to l ea rn , g iven the current c o n t r o l l a w , a p r e d i c t i o n of the l o n g t e r m cost based o n the current state. T r a i n i n g is done i n the adaptation phase of pass n. T h e o u t p u t target of the c r i t i c for t i m e t, w i t h i n pass n is j(x(t + 1 ) , W " - X ) + c ( x ( * ) ) where x is the state a n d W n _ 1 is the vector of weights of the adap t ive c r i t i c at pass n — 1. It s h o u l d be m e n t i o n e d tha t c m a y have an i m p l i c i t CQ tha t m a y cause divergence. T h e r e are m e t h o d s to dea l w i t h th is [14]. Chapter 2. Framework of Field 33 W h i l e the a d a p t i v e c r i t i c is be ing t r a ined , the cont ro l le r can be adap t ed by the A p _ p [64] a l g o r i t h m . If the c r i t i c is i m p l e m e n t e d w i t h a con t inuous feedforward ne twork , the gradient of J(t) w i t h respect to x(t) c an be c o m p u t e d by the d u a l f o r m [65, S e c t i o n 10.6] of the b a c k p r o p a g a t i o n ( d u a l b a c k p r o p a g a t i o n for short ) a l g o r i t h m a n d m a d e ava i lab le to the con t ro l l e r . If the c r i t i c m a p p i n g is i m p l e m e n t e d b y a different t y p e of f u n c t i o n a p p r o x i m a t o r bu t is s t i l l con t inuous , the same m a y s t i l l app ly . T h e s y s t e m of pass ing i n f o r m a t i o n f r o m the c r i t i c to the cont ro l le r is ca l l ed a B a c k p r o p a g a t e d C r i t i c . T h e b a c k p r o p a g a t e d c r i t i c , a r e l a t i ve ly new m e t h o d , eases the bo t t l eneck presented by the o lder A p _ p a l g o r i t h m w h i c h passes o n l y a scalar va lue to the con t ro l l e r a n d use i t to adjust the con t ro l l e r weights i n a m a n n e r s i m i l a r to the m e c h a n i s m i n S tochas t i c L e a r n i n g A u t o m a t a [32]. H D P tends to b reak d o w n , t h r o u g h ve ry s low l ea rn ing , as the size of the p r o b l e m grows bigger [66]. T o get a r o u n d th i s , W e r b o s and his group has p roposed subs t an t i a l l y m o r e c o m p l e x forms of H D P , ca l l ed D u a l H e u r i s t i c P r o g r a m m i n g ( D H P ) [67] a n d G l o b a l i z e d D H P [68]. T h e y m a k e ex tens ive use of d u a l b a c k p r o p a g a t i o n i n m o v i n g i n f o r m a t i o n a r o u n d m o r e c o m p l e x sys tems of m a p p i n g s . T h e general i d e a is to use the c r i t i c to a p p r o x i m a t e the grad ien t of J w i t h respect to x. 2.3.9 Reinforcement Learning: Value Iteration T h e t w o - m a p p i n g sys tems i n p o l i c y i t e r a t i o n m a k e i t c l u m s y to c o m p u t e the o p t i m a l ac t ions because of the l ack of an a c t i o n spec i f ica t ion i n the i n p u t s to the a d a p t i v e c r i t i c . T h e A c t i o n Dependen t A d a p t i v e C r i t i c ( A D A C ) a n d a m o r e c o m p l e x f o r m of A D A C p roposed b y W e r b o s [66] can overcome this p r o b l e m . Chapter 2. Framework of Field 34 J(t) J(t+1)+U(t) F i g u r e 2.3: T h e ove ra l l A D A C s y s t e m . 2.3.10 A D A C or A D H D P A D A C is also ca l l ed A c t i o n Dependen t H e u r i s t i c D y n a m i c P r o g r a m m i n g ( A D H D P ) . It is the cont inuous-s ta te ve rs ion of Q - L e a r n i n g to be presented la ter . F i g u r e 2.3 shows the s t ruc tu re of the A D A C sys t em. J(t + 1) + U(t) is the target o u t p u t to be l ea rned by the c r i t i c . is c o m p u t e d by d u a l b a c k p r o p a g a t i o n and used to change the weights i n the con t ro l l e r b y m o v i n g a long the gradient . A c o m p l e x f o r m of A D A C has the c r i t i c a p p r o x i m a t i n g the p a r t i a l der iva t ives of J w i t h respect to u a n d x. T h i s f o r m is ca l l ed A c t i o n Dependen t D u a l H e u r i s t i c P r o g r a m m i n g ( A D D H P ) [67]. O n c e aga in , i t makes ex tens ive use of d u a l b a c k p r o p a g a t i o n . A t present there is l i t t l e co l l ec t ive exper ience i n D H P , G D H P , a n d A D D H P . 2.4 Discrete-Time and -State Discre te -s ta te sys tems can be thought of as supersets of cont inuous-s ta te sys tems. A l -t h o u g h the n u m b e r of states is u sua l ly assumed to be f in i te , the discreteness a l lows for a r b i t r a r y levels of non l inea r i ty . T h e discreteness also a l lows for a greater va r i e ty of s o l u t i o n m e t h o d s c o m i n g f r o m different l ines of scient i f ic i n q u i r y . Chapter 2. Framework of Field 35 2.4.1 Markovian Decision Tasks A mos t w i d e l y s t ud i ed class of discrete-state d i sc re te - t ime s y s t e m is the M a r k o v i a n D e -c i s ion Tasks . A M a r k o v i a n dec i s ion task is defined i n t e rms of a d i sc re te - t ime s tochas t ic d y n a m i c s y s t e m w i t h f ini te s tate set S = 1 , . . . , n . A t each t i m e step, a con t ro l l e r observes the sys tem's cur ren t s tate st = i a n d generates a con t ro l u f r o m a f in i te set of admiss ib l e ac t ions U(i), w h i c h is a p p l i e d as i n p u t to the sys t em. T h e sys tem's s tate at the next t i m e s tep w i l l be st+i = j w i t h the s t a t e - t r ans i t ion p r o b a b i l i t y Pij(u). T h e a p p l i c a t i o n of a c t i o n u i n s tate i i ncurs a n immediate cost c ; (u) . T h e goal is to f ind e i ther a con t ro l p o l i c y p w h i c h o p t i m i z e s the expected total infinite-horizon discounted cost E U=0 where 0 < 7 < 1 is the d iscount rate , or a con t ro l p o l i c y vec to r {fi0, • • • ^ N-i} w h i c h o p t i m i z e s the expected finite-horizon discounted cost E N where 0 < 7 < 1. In the l a t t e r case, the e lements of the p o l i c y vec tor m a y be cons t r a ined to be i d e n t i c a l . 2.4.2 Search methods T o i l l u s t r a t e the connec t ion be tween con t ro l a n d h igher l eve l p r o b l e m s such as p l a n n i n g , a d e s c r i p t i o n of t h e (de t e rmin i s t i c ) o p t i m a l c o n t r o l p r o b l e m fo l lows: 1. T h e states change acco rd ing to a set of rules ( the d y n a m i c equa t ions) ; 2. T h e change of the states can be inf luenced b y the inpu t s ; 3. T h e states c a n be pushed f r o m one po in t to another by an app rop r i a t e sequence of i npu t s ; a n d Chapter 2. Framework of Field 36 4. O u t of the set of a l l a l lowable i n p u t sequences, one or m o r e w i l l be o p t i m a l i n some sense. M a r k o v i a n dec i s ion tasks are analogous to the con t ro l of t y p i c a l A I p r o b l e m s such as chess games . In chess, the poss ib le future states e x p a n d i n a t ree s t ruc ture . A t t — 0, t he p l aye r m a k i n g a p a r t i c u l a r m o v e w i l l t ake the s tate to a new state . T h e opponen t ' s moves are u n k n o w n a n d can be thought of as be ing s tochas t ic . A f t e r the opponen t ' s m o v e at t = 1, the poss ib le states e x p a n d to a large bu t not comple t e subset of a l l poss ible states. T h e moves of the two players are a l t e rna ted a n d the poss ib le states e x p a n d i n t i m e . If we t h i n k of the consequence of b o t h p layers ' moves f r o m the pe r spec t ive of the first p layer , h is m o v e generates a large n u m b e r of poss ib le states. E a c h of these states has a c e r t a in p r o b a b i l i t y of be ing reached. T h i s sy s t em is t hen e x a c t l y the same as the d y n a m i c sys tems defined. Therefore the search techniques used to solve these k i n d s of p r o b l e m s can be used to solve con t ro l p rob lems . 2.4.3 Reinforcement Learning A l t h o u g h the a l g o r i t h m s to be presented i n the f o l l o w i n g sect ions are f o r m u l a t e d for discrete-s ta te sys tems, mos t researchers w o r k i n g on discrete-s tate a l g o r i t h m s i n t e n d to also a p p l y t h e m to con t inuous sys tems. T h e discrete-state f o r m u l a t i o n offers greater the-o r e t i c a l t r a c t a b i l i t y . S o m e of these m e t h o d s have counterpar t s w h i c h w o r k o n cont inuous sys tems, as covered i n Sect ions 2.3.5 and 2.3.9. 2.4.4 Reinforcement Learning: Policy Iteration S o m e of t he P o l i c y I t e r a t i on m e t h o d s [69] to be presented i n t h i s sec t ion are discrete-s ta te ve r s ion of the ideas presented i n Sec t ion 2.3.5. Chapter 2. Framework of Field 37 2.4.5 Basic Adaptive Critics T h e bas ic a d a p t i v e c r i t i c s y s t e m for cont inuous-s ta te p lan ts shown i n F i g u r e 2.1 is a p p l i -cable to discrete-s tate p lan ts also. T h e difference is i n the a d a p t a t i o n a l g o r i t h m s for the a d a p t i v e c r i t i c ( A C ) a n d the con t ro l le r . T h e a d a p t a t i o n a l g o r i t h m s for the A C inc ludes the discrete-s ta te ve r s ion of T D ( A ) a l g o r i t h m a n d the D P a l g o r i t h m s to be presented i n the fo l l owing sect ions. 2.4.6 D P in general M o s t of the m a t e r i a l on D P a lgo r i t hms to be i n t r o d u c e d are s u m m a r i z e d f r o m [70]. T h e reader is encouraged to refer to th is paper for a t h o r o u g h coverage. A l t h o u g h D P was m e n t i o n e d i n Sec t ion 2.2.3, i t is covered i n d e p t h here, because the d i sc re te - t ime a n d -state f o r m u l a t i o n is m u c h clearer . C a r r y i n g o n f r o m Sec t ion 2.4.1, D P a t t e m p t s to f i n d an accura te n u m e r i c a l eva lua t ion of the o p t i m a l c o n t r o l p o l i c y w i t h o u t k n o w i n g th is p o l i c y i n advance. Deno te the con t ro l p o l i c y as /x. T h e n , for any c o n t r o l p o l i c y fi a n d state i, / M ( i ) is defined to be the expected total infinite-horizon discounted cost t ha t w i l l be i n c u r r e d , g iven tha t the con t ro l l e r uses con t ro l p o l i c y p a n d i is the i n i t i a l state: .t=0 where 0 < 7 < 1 is a d i scount factor , and is the e x p e c t a t i o n a s s u m i n g the cont ro l le r a lways uses c o n t r o l p o l i c y fi. F o r s i m p l i c i t y , f^(i) is a b b r e v i a t e d to /(it). In Sect ions 2.4.7 to 2.4.14, D P a lgo r i t hms are presented i n decreas ing r i g i d i t y i n the i r u p d a t i n g order . It shou ld be r e m e m b e r e d tha t they are a l l a p p l i c a b l e to the t r a i n i n g of an a d a p t i v e c r i t i c . Chapter 2. Framework of Field 38 2.4.7 Synchronous DP (SDP) T h e mos t r i g i d f o r m of D P is Synchronous D P [70] w h i c h is the l o c k e d step c o m p u t a t i o n of / ' "(z) 's u s ing a c o m p l e t e m o d e l of the state t r a n s i t i o n p robab i l i t i e s a n d costs. T h e f o l l o w i n g equa t i on is effect ively execu ted s imu l t aneous ly : where k is the i n d e x of the u p d a t e stage where a l l / - v a l u e s are u p d a t e d . N o values of fk+i appear o n the r i g h t - h a n d side of the equa t ion . T h e a p p l i c a t i o n of th i s u p d a t e equa t i on for s ta te i is referred t o as backing up i's cost. W e have these Convergence Results [70]: 1. If 7 < 1 ( the d i scoun ted case), repeated synchronous upda tes p r o d u c e a sequence of func t ions tha t converges to the o p t i m a l eva lua t i on f u n c t i o n , / * , for any i n i t i a l a p p r o x i m a t i o n . A l t h o u g h the cost of a state need not get closer t o i ts o p t i m a l cost o n each i t e r a t i o n , the maximum error be tween fk(i) a n d f*(i) over a l l states i m u s t decrease. T h e synchronous i t e r a t i o n is a c o n t r a c t i o n m a p p i n g w i t h respect t o the m a x i m u m n o r m a n d has / * as i ts un ique f ixed p o i n t . 2. If 7 = 1 ( the u n d i s c o u n t e d case), to a v o i d in f in i te / - v a l u e s , one or m o r e states mus t have proper t ies w h i c h d i s t i n g u i s h t h e m as terminal states. T h e set of t e r m i n a l states is ca l l ed the t e r m i n a l set. O n c e the sys t em enters a t e r m i n a l state, i t c an m o v e to a state i n the t e r m i n a l set only . A p o l i c y w h i c h gives the s y s t e m a non-zero p r o b a b i l i t y o f en t e r ing a t e r m i n a l s tate s t a r t i n g f r o m any s ta te is c a l l e d a proper policy. G i v e n the above def in i t ions , th is a l g o r i t h m converges [69] [71] to / * i n s tochas t ic o p t i m a l p a t h p rob l ems under the fo l l owing cond i t ions : (2.23) • T h e i n i t i a l cost of every t e r m i n a l state is zero; Chapter 2. Framework of Field 39 • T h e r e is at least one p roper p o l i c y ; a n d • A l l po l ic ies tha t are not p roper i n c u r r e d inf in i te cost for at least one state. 2.4.8 Relaxed forms of SDP D P does not need to be pe r fo rmed i n such a r i g i d m a n n e r as S D P . T h e backups can be done i n different orders a n d even p a r t i a l l y i n each stage. T h e r e are numerous reasons for th i s . T h e i m p o r t a n t reason for t he purpose of t h i s thesis is t ha t t h i s t y p e of r e l axed b a c k u p is a b r idge to a who le new class of D P a l g o r i t h m s , n a m e l y i n c r e m e n t a l D P . In Gaus s -Se ide l D P ( G S D P ) [70], the costs are backed up one s tate at a t i m e i n a sequen t ia l sweep of a l l t he states, w i t h the c o m p u t a t i o n for each s ta te u s i n g the mos t recent costs of the o ther states. T h i s a l g o r i t h m converges under the same cond i t ions as those i n S D P . G S D P tends to converge faster t h a n S D P , because each b a c k u p uses the la test costs of the o ther states. T h e o rde r ing of the backups has influences also. In A s y n c h r o n o u s D P ( A D P ) [70], at each stage k, the costs of a subset of the states are b a c k e d u p synchronous ly , a n d the costs r e m a i n unchanged for the o ther states. T h a t subset varies f r o m stage to stage. T h e choice of these subsets d is t inguishes different A D P a l g o r i t h m s . T h e convergence c o n d i t i o n for d i scoun ted A D P is tha t the cost of each state is b a c k e d up in f in i t e ly often. In the und i s coun t ed case, A D P converges under these cond i t i ons [71]: • T h e i n i t i a l cost of every t e r m i n a l state is zero. • T h e r e is at least one p roper po l i cy . A p o l i c y is p rope r i f i ts use i m p l i e s a nonzero p r o b a b i l i t y of even tua l l y reach ing the t e r m i n a l set s t a r t i ng f r o m any state. U s i n g a p rope r p o l i c y also i m p l i e s tha t the t e r m i n a l set w i l l be reached even tua l ly f r o m any state w i t h p r o b a b i l i t y one. Chapter 2. Framework of Field 40 • A l l i m m e d i a t e costs i n c u r r e d by t rans i t ions f r o m n o n - t e r m i n a l states are pos i t i ve , i .e . , C{(u) > 0 for a l l n o n - t e r m i n a l states i and act ions u £ U(i). N o t e tha t a s ingle b a c k u p of a state 's cost i n A D P does not necessar i ly i m p r o v e i t as an e s t ima te of the state 's o p t i m a l cost; i t m a y i n fact m a k e i t worse. T h e o r d e r i n g is aga in i m p o r t a n t as i n G S D P . 2.4.9 DP interleaved with control T h i s sec t ion covers a lgo r i t hms i n w h i c h the cont ro l le r per forms D P on- l ine d u r i n g con t ro l , w i t h the c o m p u t a t i o n a l steps of an off-line D P a l g o r i t h m in te r l eaved w i t h the genera t ion of ac t ions . It is a s sumed tha t a c o m p l e t e a n d accura te m o d e l of t he M a r k o v i a n dec i s ion task is known. In t e r l eav ing D P w i t h on- l ine con t ro l means tha t d u r i n g the t i m e in terva ls of the dec i s ion p r o b l e m , the con t ro l l e r executes some f ini te n u m b e r o f stages of A D P . C o n t r o l decis ions are based o n the mos t recent eva lua t ion func t i on . A n e x a m p l e is R e a l - T i m e D P ( R T D P ) [70]. In R T D P , the p o l i c y is the so ca l l ed greedy p o l i c y f r o m the eva lua t ion func t i on fkt, where kt is the t o t a l n u m b e r of updates of the / f unc t i on c o m p l e t e d up to t i m e t. T h e greedy p o l i c y is the set of act ions w h i c h m i n i m i z e fkt. S ince the func t i on does not have u as a pa ramete r , the p o l i c y w o u l d have to be chosen a n d s tored at the t i m e tha t fkt is c o m p u t e d . If the knowledge of the s ta te - t rans i t ion p robab i l i t i e s Pij(u) or the i m m e d i a t e costs are unknown, t h e n unde r th i s sec t ion head ing , we have the choices of B a y e s i a n a n d N o n - B a y e s i a n me thods . B a y e s i a n m e t h o d s [70] rest o n an a s s u m p t i o n of a k n o w n a priori p r o b a b i l i t y d i s t r i b u -t i o n over the class of poss ib le d y n a m i c sys tems. T h i s d i s t r i b u t i o n is success ively rev ised u s ing B a y e s ' ru le , a n d at each t i m e step, an ac t ion is selected by u s ing D P to find a Chapter 2. Framework of Field 41 p o l i c y tha t m i n i m i z e s the expec t ed costs over the set of poss ib le sys tems as w e l l as t i m e . T o do th i s , D P is a p p l i e d to a dec i s ion p r o b l e m i n v o l v i n g a sy s t em whose state consists of a n en t i re posterior d i s t r i b u t i o n over the set of poss ib le sys tems. T h e c o m p u t a t i o n s r e q u i r e d can be p r o h i b i t i v e i n a l l bu t the s imples t cases. N o n - B a y e s i a n m e t h o d s a t t e m p t to f ind the o p t i m a l p o l i c y a s y m p t o t i c a l l y for any s y s t e m w i t h i n some pre-specif ied class of sys tems. These are a t y p e of se l f - learning con t ro l schemes. It is useful to d i v i d e these self - learning con t ro l me thods i n to the f o l l o w i n g classes [72]: 1. Ind i rec t m e t h o d s . These r e ly o n the on- l ine f o r m a t i o n of an e x p l i c i t m o d e l of the d y n a m i c s y s t e m be ing con t ro l l ed as w e l l as es t imates of the i m m e d i a t e costs, i f the l a t t e r are also u n k n o w n . 2. D i r e c t m e t h o d s . These de t e rmine a p o l i c y w i t h o u t f o r m i n g an e x p l i c i t sy s t em m o d e l . 2.4.10 Adaptive Real-Time DP T h i s indirect m e t h o d is e x a c t l y the same as R T D P except tha t 1. a s y s t e m m o d e l is u p d a t e d us ing some on- l ine sy s t em iden t i f i c a t i on m e t h o d , such as the f o l l o w i n g m e t h o d for e s t i m a t i n g the m a x i m u m - l i k e l i h o o d s y s t e m m o d e l where nfj(t) is the observed n u m b e r of t imes before t i m e step t t ha t a c t i o n u was genera ted w h e n the s y s t e m was i n state i a n d m a d e a t r a n s i t i o n to state j , a n d ni{t) = H2j£S nij{t) i s the n u m b e r of t imes ac t ion u was genera ted i n state i. I f c ,(u) 's are also u n k n o w n , they can be d e t e r m i n e d by m e m o r i z i n g t h e m as they are observed; Chapter 2. Framework of Field 42 2. the cur ren t s y s t e m m o d e l is used i n p e r f o r m i n g the stages of r ea l - t ime D P ins t ead of the t rue s y s t e m m o d e l ; a n d 3. the a c t i o n at each t i m e step is d e t e r m i n e d by some r a n d o m i z e d p o l i c y . T h e degree of r andomness is h i g h at the b e g i n n i n g of l e a rn ing a n d is g r a d u a l l y lowered la ter . A poss ib le d isadvantage of th is m e t h o d is tha t the f o r m a t i o n of the m o d e l m a y be m o r e d i f f icu l t t h a n finding the c o n t r o l p o l i c y d i rec t ly . O b v i o u s l y th is depends o n the s y s t e m i n ques t ion . 2.4.11 Reinforcement Learning: Value Iteration S o m e of the m e t h o d s to be presented i n this sec t ion are discrete-s tate ve rs ion of the ideas presented i n Sec t i on 2.3.9. 2.4.12 A n overall view of Q-Learning T h e mos t i m p o r t a n t a l g o r i t h m w i t h a s t rong theore t i ca l f o u n d a t i o n for the purpose of se l f - learning o p t i m a l con t ro l is a recen t ly i nven ted a l g o r i t h m t y p e ca l l ed Q - L e a r n i n g . It is f o r m u l a t e d for t he M a r k o v i a n dec i s ion task. C a r r y i n g o n f r o m Sec t ion 2.4.6, W a t k i n s [12] defined a new n o t a t i o n , Q. F o r each s tate i a n d c o n t r o l u € U(i), let Q%u) = C,(U)+7EP,(«)/(J). (2-24) Qf(i, u) is the cost of a c t i o n u i n state i as eva lua ted by / . It is the s u m of the i m m e d i a t e cost a n d the d i scoun ted expec ted value of the costs of the poss ib le successor states j unde r c o n t r o l u. R e f e r r i n g back to the d i s t i nc t ions m a d e i n Sec t ion 2.4.9, th i s f a m i l y of u s u a l l y direct m e t h o d s es t imates the o p t i m a l Q-values for a l l pai rs of states a n d a d m i s s i b l e ac t ions . Chapter 2. Framework of Field 43 Q*(i,u), the o p t i m a l Q-va lue for state i a n d a c t i o n u € U(i), is the cost of genera t ing a c t i o n u i n s tate i a n d thereafter fo l lowing an o p t i m a l p o l i c y . A f t e r the convergence of es t imates Qf(i,uY to Q*(i,u) is comple t e , any greedy a c t i o n w i t h respect to o p t i m a l Q-values for a state is an o p t i m a l ac t i on . In the f o l l o w i n g sect ions, specific Q - L e a r n i n g a lgo r i t hms w i l l be presented . 2.4.13 Off-Line Q-Learning Ins tead of m a i n t a i n i n g an es t ima te of the o p t i m a l eva lua t i on f u n c t i o n , / * , as is done b y a l l the m e t h o d s desc r ibed above, Q - L e a r n i n g m a i n t a i n s an es t ima te of the o p t i m a l Q-values for each state a n d admis s ib l e ac t i on . S ince / * is the m i n i m u m of the o p t i m a l Q-values for each state, Qks i m p l i c i t l y define fk, a stage-fc a p p r o x i m a t i o n of / * , defined as fo l lows: / * ( » ) = m i n Qk(i,u). (2.25) A t each stage k, O f f - L i n e Q - L e a r n i n g synchronous ly upda tes the Q-values of a subset of the s ta te -ac t ion pa i rs a n d leaves unchanged the Q-values for the o ther s ta te -ac t ion pa i r s . F o r each k = 0 , 1 , . . . , let C {(i,u) \ i € S,u € U(i)} denote the set of s ta te -ac t ion pa i rs whose Q-values are u p d a t e d at stage k. T h e n Qk+i is c o m p u t e d as fol lows: l . K ( t , u ) e S ? , Qk+i(i,u) = (1 - ak(i,u))Qk(i,u) + ak(i,u)[c;(u) + 7/*(successor(i,u))] where 0 < ak(i,u) < 1 denote the l ea rn ing rate a n d j = successo r^ , u) r a n d o m l y genera ted nex t s tate acco rd ing to the sys tem's t r a n s i t i o n p r o b a b i l i t i e s . T h e role of the f u n c t i o n is p l a y e d b y the sys t em i t se l f i n R e a l - T i m e Q - L e a r n i n g desc r ibed i n the nex t sec t ion . Chapter 2. Framework of Field 44 2. If (i,u)^S^, Qk+i{i,u) = Qk(i,u). Convergence proofs are ava i lab le [12] [73] [70] [59] for genera l sys tems a n d [74] for l inear q u a d r a t i c r egu la t ion . U n d e r ce r t a in cond i t ions , the sequence Qk(i,u) converges w i t h p r o b a b i l i t y 1 to Q*(i,u) as t —> oo. 2.4.14 Real-Time Q-Learning T h i s is done b y in t e r l eav ing O f f - L i n e Q - L e a r n i n g w i t h con t ro l steps. If a current sys-t e m m o d e l p rov ides an a p p r o x i m a t e func t i on , the resul t is an ind i r ec t a d a p t i v e m e t h o d i d e n t i c a l to a d a p t i v e R T D P . Howeve r , the t e r m Real-Time Q-Learning is used for the case i n w h i c h there is no m o d e l of the s y s t e m u n d e r l y i n g the dec i s ion p r o b l e m a n d the rea l s y s t e m p lays the role of the func t ion . T h i s d i rec t adap t ive a l g o r i t h m updates the Q-va lue for o n l y a s ingle s ta te -ac t ion pa i r at each t i m e step of con t ro l . T h e a l g o r i t h m computes Qt+\ as fol lows: Qt+\{st,ut) = (1 - a t(s t,ut))Q t(s t,u t) + at(st,ut)[cSt(ut) + ^ ft(st+1)}, (2.26) where ft(st+i) — m i n u e [ ; ( S t + 1 ) Qt(st+i,u) and at(st,ut) is the l e a rn ing rate at t i m e t for the cur ren t s ta te -ac t ion pa i r . T h e Q-yalues for a l l the other s ta te -ac t ion pai rs r e m a i n the same, i .e . , Qt+i{i,u) = Qt{i,u), V(i,u) ^  {st,ut). T h i s a l g o r i t h m is the Basic QL algorithm. F o r reasons to be desc r ibed i n C h a p t e r 3, i t s h o u l d also be ca l l ed the One-Step Q-Learning A l g o r i t h m . 2.4.15 Q-Learning implies Action Dependent Dynamic Programming Jus t as the g r a d u a l r e l a x a t i o n of the schedu l ing of upda tes i n d y n a m i c p r o g r a m m i n g leads to i n c r e m e n t a l d y n a m i c p r o g r a m m i n g , i t is i m p l i c i t i n Q - L e a r n i n g tha t there is a r i g i d l y Chapter 2. Framework of Field 45 scheduled a l g o r i t h m as i ts ancestor . T h e same ancestor is also i m p l i e d b y A D H D P . A n a p p r o p r i a t e n a m e for th i s ancestor is Action Dependent Dynamic Programming ( A D D P ) . T h e u p d a t e equa t i on of th is a l g o r i t h m is s i m p l y to c o m p u t e for a l l st £ S a n d for a l l ut G U(st): Qk+i(st, ut) = c(st, ut) + 7 m i n {ut)Qk(st+1,ut+i) ut+iet/(si+1) where k is the i n d e x of the D P stage a n d t is the t i m e i n d e x of the d y n a m i c s y s t e m . T h e difference be tween A D D P a n d i ts var ious descendent a lgo r i t hms is i n the schedu l ing of the upda tes . A s i m p l e r ve r s ion of th is a l g o r i t h m for d e t e r m i n i s t i c sys tems w i l l be desc r ibed i n C h a p t e r 5. 2.4.16 Dyna-Q S u t t o n [75] deve loped a f a m i l y of a lgo r i t hms ca l l ed D y n a w h i c h are va r i a t ions of Q -L e a r n i n g w i t h a m i x t u r e of i nd i r ec t adap t ive con t ro l . I n one p a r t i c u l a r D y n a a l g o r i t h m , the D y n a - Q a l g o r i t h m b y S u t t o n , the upda te ope ra t i on i n Q - L e a r n i n g uses b o t h ac tua l s ta te t r ans i t ions as w e l l as h y p o t h e t i c a l state t rans i t ions s i m u l a t e d b y a s y s t e m m o d e l . T h e l a t t e r pa r t amoun t s to r u n n i n g m u l t i p l e stages of O f f - L i n e Q - L e a r n i n g i n the inter-vals be tween t imes at w h i c h the con t ro l l e r generates ac t ions . T h i s is o n l y one poss ible c o m b i n a t i o n . D y n a w o r k e d successful ly on a s m a l l obs tac le avo idance a n d p l a n n i n g prob-l e m . 2.4.17 Prioritized Sweeping T h e G S D P m e n t i o n e d ear l ier sweeps the states acco rd ing to a schedule . O n e t y p e of h i g h pe r fo rmance s c h e d u l i n g heu r i s t i c is M o o r e ' s f a m i l y of P r i o r i t i z e d S w e e p i n g a l g o r i t h m s [76]. I n these heur i s t i c s , a descending sor ted queue of states a n d pos s ib ly ac t ions is Chapter 2. Framework of Field 46 m a i n t a i n e d . T h e so r t ing is w i t h respect to a p r i o r i t y value w h i c h measures the change to a s ta te -ac t ion pa i r ' s f or Q va lue i n the mos t recent upda te . In each Gauss -Se ide l sweep, the / or Q values of a ce r t a in n u m b e r of the top p r i o r i t y s ta te -ac t ion pa i rs are u p d a t e d . A s t hey m o v e t o w a r d convergence, the i r p r i o r i t y is lowered a n d o ther s ta te -ac t ion pa i rs w i l l m o v e t o w a r d the top of the queue. These heur i s t ics focus the sweeps i n the mos t in t e res t ing par ts of the s ta te -ac t ion space. A closely re la ted heu r i s t i c is the D y n a - Q -queue a l g o r i t h m by P e n g and W i l l i a m s [77]. B o t h p r i o r i t y sweeping a n d D y n a - Q - q u e u e g rea t ly lower the n u m b e r of B e l l m a n ' s E q u a t i o n process ing steps before convergence. 2.4.18 Variable Resolution Q - L e a r n i n g c o n t r o l m a y look l i ke a con t ro l m e t h o d i n the engineer ing sense, bu t we have to keep i n m i n d tha t the states m a y be the de sc r ip t i on of a sy s t em at var ious levels of abs t r ac t i on . T h e states m a y be low leve l descr ip t ions of a p h y s i c a l s y s t e m . A l t e r n a t i v e l y , t hey m a y be s o m e w h a t h igher l eve l as i n the discre te states used i n the s u b s u m p t i o n a rch i t ec tu re [78] [79]. A n e x a m p l e of u s ing D P type a l g o r i t h m at v a r y i n g levels of r e so lu t ion is M o o r e ' s V a r y i n g R e s o l u t i o n D P a l g o r i t h m [80]. H e describes us ing states w h i c h are ind ica to r s of b e i n g i n a range of values of the low leve l s tate space. A f t e r se lec t ing a rough choice of ac t ions or t ra jec tor ies , use a finer r e so lu t ion of the selected t r a j ec to ry a n d do D P o n the finer set of s tate descr ip t ions . T h i s is done successively u n t i l the lowest l eve l de sc r ip t i on is reached . T h i s m e t h o d has a good chance of f ind ing a g o o d t ra jec tory . These ideas touches on u s ing Q - L e a r n i n g at an A I l eve l . Indeed, Q - L e a r n i n g comes under the t i t l e of " I n d u c t i v e R e a s o n i n g " i n the A I l i t e ra tu re . Chapter 2. Framework of Field 47 2.4.19 Transition Point DP T r a n s i t i o n P o i n t D P ( T P D P ) [81] [82] is i n s p i r e d by the i d e a tha t i n a con t inuous sys t em, the c o n t r o l values are often constant t h r o u g h regions of the state space. C o n s i d e r a b l e savings i n the storage of the c o n t r o l p o l i c y can be h a d b y o n l y s to r ing the c o n t r o l values at the b o u n d a r y of these regions. T h e T P D P a l g o r i t h m was deve loped to f ind these boundar i e s a n d the associa ted con t ro l values , s t a r t ing w i t h the same i n f o r m a t i o n as the o ther R L a l g o r i t h m s . In m a n y A I p r o b l e m s , the key dec i s ion po in t s are at the boundar ies . O n c e a dec i s ion is m a d e a n d the associa ted ac t ion i n i t i a t e d , the course is set a n d i t is no longer necessary to pay a t t en t i on to a l te rna t ives w h i l e the s i t u a t i o n r ema ins ins ide the scope of the dec i s ion . M o r e specu la t ive ly , T P D P m a y be a m e t h o d of p a r t i t i o n i n g a s tate space acco rd ing to the c r i t e r i o n of cons tant c o n t r o l (or perhaps some other cons tant q u a n t i t y ) . It m a y be a m e c h a n i s m for f o r m i n g i n t e rmed ia t e l eve l representa t ions ( to be d iscussed i n C h a p t e r 3) . If the n u m b e r of a c t i o n types is large, the o p t i m a l con t ro l of a s y s t e m by state space reg ion t r iggered a c t i o n switches i n an o r thogona l or semi - independen t way is i n t r i g u i n g . I n o ther words , i f i n d i v i d u a l responses, t r iggered b y the boundar ie s of s tate space regions, can be l ea rned i nc r emen ta l l y , the con t ro l sy s t em m a y be able to a c c u m u l a t e a n d l i n k t h e m together to o p t i m i z e a g l o b a l goa l . T h i s w o u l d be s o m e t h i n g a k i n to the a u t o m a t i c b u i l d i n g a n d t u n i n g of a s u b s u m p t i o n a rch i tec ture . T P D P seems to be n a t u r a l as a par t of a l e a r n i n g h ierarchy . 2.5 Subgoals A m e t h o d c lose ly re la ted to o p t i m a l con t ro l is N e u r a l Subgoa l Genera to r s ( N S G ) [83] [3] [84]. F i g u r e 2.4 i l lus t r a t e s the essent ia l componen t s for a non- recurs ive subgoa l generator . A recurs ive ve r s ion can be o b t a i n e d by a s i m p l e r e - rou t ing of the s ignals . A goal or a Chapter 2. Framework of Field 48 Start-Goal-Subgoa l Generator Eva lua tor eval(start,subgoal) Subgoal Evalua to r eval(subgoal,goal) F i g u r e 2.4: A n adap t ive non-recurrent subgoal generator . subgoa l can be a state i n t i m e , a s tate w i t h o u t e x p l i c i t reference to t i m e , or an abs t rac t goa l . In o p t i m a l c on t r o l , we can focus o n e i ther the contro ls or the states. In ex i s t i ng R L theory, the focus is on the cont ro ls . B u t there are a n u m b e r of ways to f ind the o p t i m a l contro ls i f the o p t i m a l t r a j ec to ry is k n o w n (see Sect ions 2.1.5 a n d 2.1.6). B o t h the subgoal generator a n d the eva lua tor are i m p l e m e n t e d i n feedforward neu ra l ne tworks i n the i m p l e m e n t a t i o n s i n [83] [3] [84]. These ne tworks a l l ow the use of d u a l b a c k p r o p a g a t i o n [65] to pass the de r iva t ive of the eva lua t ions w i t h respect to the subgoals . T h e subgoa l generator is u p d a t e d b y m o v i n g d o w n the gradient of the eva lua t ions . T h e purpose of the sy s t em i n F i g u r e 2.4 is to r ecurs ive ly b reak d o w n a large goa l i n to sma l l e r goals, u l t i m a t e l y end up w i t h a sequence of subgoals f r o m the s tar t spec i f ica t ion to the goal spec i f ica t ion . T h e evaluators i n the figure are i d e n t i c a l to each other . T h e role of each copy is to c o m p u t e (or m a p ) an eva lua t ion to i ts i npu t s . T h e eva lua to r is u p d a t e d u s ing the t rue eva lua t ions w h e n the subgoal sequence is execu ted . Th es e a p p r o x i m a t e eva lua t ions are used to c o m p u t e a s u m of evaluat ions for the ent i re sequence. T h e b r e a k d o w n of a large goal is one poss ible use of the subgoal m e t h o d . T h e other is the suggest ion of a different way of d o i n g re inforcement l e a r n i n g , s o m e t h i n g a k i n to b a c k i n g u p the subgoals . T h e first key i dea i n the suggest ion is tha t a l eng thy state t r a j ec to ry (going f r o m the s tar t state to the end state be ing the large goal) can be Chapter 2. Framework of Field 49 b r o k e n d o w n r o u g h l y at the center, i n t i m e , of the state t ra jec tory . T h e state at the po in t of breakage is the subgoal i n the nomenc l a tu r e i n N S G . T h e second key i d e a is tha t the eva lua t ions of b o t h halves of the o r i g i n a l t r a j ec to ry are s u m m e d up to enable j u d g e m e n t of the o p t i m a l i t y of the subgoal . T h e m i s s i n g piece i n N S G is a way to b u i l d up o p t i m a l t ra jector ies of l o n g h o r i z o n f r o m o p t i m a l t ra jec tor ies of short hor izons , w i t h o u t r e l y i n g on the c o n t i n u i t y of the p r o b l e m a n d the f u n c t i o n a p p r o x i m a t o r s (usua l ly neu ra l ne tworks) for h in t s . In th is thesis , some of the essent ia l ideas i n N S G are f ramed i n a discrete-state , d i sc re te - t ime f o r m u l a t i o n , thus r e m o v i n g the re l iance on c o n t i n u i t y as a m a j o r m e c h a n i s m . In d o i n g so, a clearer u n d e r s t a n d i n g of the i d e a of subgoals is ob t a ined (see C h a p t e r 4) . 2.6 Summary of Literature Review and Focus of this Work F i g u r e 2.5 shows a t a b u l a r s u m m a r y of the l i t e ra tu re r ev iew. D P approaches for o p t i m a l con t ro l have rece ived a great dea l of a t t e n t i o n i n recent years for m a n y reasons. T h e m a j o r reasons are 1) the a v a i l a b i l i t y of a large mass of theory, va r i a t i ons , a n d e x p e r i m e n t a l resul t s , 2) the s i m p l i c i t y a n d genera l i ty of the a p p l i c a t i o n s , a n d 3) a p r o m i s i n g prospect of ex tens ion to A I t y p e p r o b l e m s , pos s ib ly v i a c o m p o s i t i o n a l l e a rn ing [85] i n a h ierarchy. D P a n d Q - L e a r n i n g are the focus of th is w o r k . 2.7 Outstanding Issues 2.7.1 Approximation errors in DP and Q-Learning In D P a n d Q - L e a r n i n g theories , i t is u sua l ly assumed tha t the m a p p i n g s , such as the fk(j) f u n c t i o n (or m a p p i n g ) i n E q . 2.23 a n d the Qt(st,ut) a n d ctt(st,ut) func t ions (or m a p p i n g s ) i n E q . 2.26, are s tored w i t h o u t error . T h e storage m e t h o d s used i n s imu la t i ons c o m i n g closest to th is a s s u m p t i o n are discre te storage me thods such as l o o k u p tables . Chapter 2. Framework of Field 50 Kscrete-Time Discrete-State Numerical Search Methods RL Value Iteration Off-Line Q-Learaing Real Time Q-Leaming DynaQ Transition DP Kscrete-Time Discrete-State Numerical Search Methods RL Policy Iteration (Actor-Critic) 3 P Synchronous DP Adaptive Real Time DP Prioritized Sweeping Variable Resolution Kscrete-Time Discrete-State Analytical Discrete-Tune Continuous-State Numerical Neural Network Optimization with the Minimum Principle RL Value Iteration ADAC or ADHDP Discrete-Tune Continuous-State Numerical Neural Network Optimization with the Minimum Principle RL Policy Iteration (Actor-Critic) p t- Adaptive Heuristic Critic Heuristic Dynamic Programming Discrete-Tune Continuous-State Analytical DP Continuous-Time Continuous-State Stochastic Numerical Finite Element Approximation of DP Discretization of the Stochastic Control Problem Continuous-Time Continuous-State Stochastic Analytical i Programming Martingale Methods Minimum Principles Continuous-Time Continuous-State Deterministic Numerical Variation of Extremals Quasilinearization Steepest Descent Gradient Projection Continuous-Time Continuous-State Deterministic Analytical Calculus of Variations Pontryagin's Maximum Principle Figure 2.5: A summary of the literature review. Each box immediately above a thick line is a category which applies to all boxes under that box. There is no additional inheritance relationship between boxes on the same vertical column. Chapter 2. Framework of Field 51 W h e n the i n p u t space, i .e. , the space of j i n fk(j) and the space of {st,ut} i n Qt($t:Ut) a n d a t (5 t ,Ut) , becomes large, d iscre te storage me thods are o v e r w h e l m e d b y the curse of d i m e n s i o n a l i t y . T o overcome the curse of d imens iona l i t y , a p p r o x i m a t o r s are used to s tored these m a p p i n g s . A n e x a m p l e of an a p p r o x i m a t o r is feedforward n e u r a l ne tworks u s ing the b a c k p r o p a g a t i o n a l g o r i t h m [38]. F o r s i m p l i c i t y , we w i l l use fk{j) as the basis of d i scuss ion i n the rest of th i s pa rag raph , d r o p p i n g the i n d e x k. T y p i c a l l y , u n i v e r a l f unc t i on a p p r o x i m a t i o n is used i n the fo l l owing way : • D u r i n g the i t e ra t ions i n a D P a l g o r i t h m , a sequence of {f,j} is generated. • A s each p a i r of {/ , j } is ob t a i n ed , the value of j is s u b m i t t e d to the a p p r o x i m a t o r as the i n p u t a n d the va lue of / is s u b m i t t e d as the ou tpu t . • T h e pa ramete r s of the a p p r o x i m a t o r are adjus ted to reduce the difference be tween the o u t p u t va lue supp l i e d by the D P a l g o r i t h m a n d the o u t p u t va lue generated by the a p p r o x i m a t o r . If the D P a l g o r i t h m is able to r a n d o m l y sample a l l poss ible j ' s i n f in i t e ly often, the a p p r o x i m a t o r w o u l d have s a m p l e d i ts i n p u t space t h o r o u g h l y a n d have a chance to be an accura te m a p p i n g f r o m j to / . T h e other m a p p i n g s desc r ibed at the b e g i n n i n g of th is sec t ion are also a p p r o x i m a t e d s i m i l a r l y . T h e sequence of events descr ibed i n the last p a r a g r a p h can va ry great ly , de-p e n d i n g o n the a l g o r i t h m . In a synchronous a l g o r i t h m , the pai rs of {/ , j} for a l l j ' s are co l l ec ted before they are s u b m i t t e d to the a p p r o x i m a t o r . In an i n c r e m e n t a l a l g o r i t h m , the pa i rs of {f,j} are s u b m i t t e d to the a p p r o x i m a t o r as soon as t hey are ava i lab le . Pe rhaps the mos t c r i t i c a l issue w i t h the use of a p p r o x i m a t o r s is tha t , a lmos t a lways , the o u t p u t values genera ted b y the a p p r o x i m a t o r cannot e x a c t l y m a t c h the s u p p l i e d out-pu t values over the ent i re i n p u t space. T h e general ru le is tha t h i g h accu racy requires Chapter 2. Framework of Field 52 m o r e c o m p u t a t i o n a l and m e m o r y resources. T h e r e d u c t i o n or e l i m i n a t i o n of th is approx-i m a t i o n er ror u s ing preferably reasonable a m o u n t of c o m p u t a t i o n a l a n d m e m o r y resource is the subject of research for m a n y decades [41] [43]. T h e p r o b l e m tha t a p p r o x i m a t i o n error presents is tha t successive i t e ra t ions of D P a n d Q - L e a r n i n g use the o u t p u t of the a p p r o x i m a t o r s to c o m p u t e new / - v a l u e s a n d Q-va lues , respec t ive ly . T h e a p p r o x i m a t o r errors f r o m the last i t e r a t i on is a c c u m u l a t e d i n the new / - a n d Q-va lues . T h e s e erroneous values are t h e n aga in s to red i n t h e a p p r o x i m a t o r s w h i c h v e r y l i k e l y w i l l not s tored t h e m exac t ly , i n t r o d u c i n g new errors . E r roneous / - v a l u e s a n d Q-values are c r i t i c a l to the final resul t o b t a i n e d f r o m D P a n d Q - L e a r n i n g , w h i c h is the con t ro l po l i cy . A s the con t ro l p o l i c y is c o m p u t e d f r o m the / -a n d Q-va lues , i t s o p t i m a l i t y or even s t a b i l i t y are c o m p r o m i s e d by the errors . A s Q - L e a r n i n g a lgo r i t hms are r e l a t i ve ly new, few e x p e r i m e n t a l resul ts i n v o l v i n g ap-p r o x i m a t o r s have been o b t a i n e d [86] [87]. T h e au thor is aware of one negat ive resul t [88] where Q - L e a r n i n g fails to generate a s table con t ro l po l i cy . T h e c u m u l a t i v e na tu re of these a lgo r i t hms and the e x t r a c t i o n of the c o n t r o l p o l i c y m e a n tha t the accuracy of a p p r o x i m a t i o n is p a r a m o u n t . T h e issue of storage error is now a m o n g the mos t press ing concerns. T h r u n and Schwar t z [2] were the first to explore th is area. F o r Q - L e a r n i n g , they ident i f ied the m i n i m i z a t i o n opera tor i n E q u a t i o n s 2.24 a n d 2.25 as a m a j o r reason for the adverse a c c u m u l a t i o n of errors . T h e r e m a y be o ther reasons, such as the fact t ha t i n a t y p i c a l l e a r n i n g s i t u a t i o n , t he a p p r o x i m a t i o n errors are t e m p o r a l l y a n d spa t i a l l y cor re la ted . T h e r e are a n u m b e r of poss ible so lu t ions . T h e mos t obv ious is to use m o r e accura te a p p r o x i m a t o r s . T h e l eve l of accuracy requ i red even for r e l a t i v e l y s i m p l e p rob l ems [2] means the d e m a n d o n the a p p r o x i m a t o r s w i l l be severe. C e r t a i n l y , accura te a p p r o x i m a -tors s h o u l d be used whenever p r a c t i c a l , bu t to re l ieve th is d e m a n d we s h o u l d t r y to solve the p r o b l e m i n o ther ways also. T h r u n and Schwar t z [2] m a d e a n u m b e r of suggestions. Chapter 2. Framework of Field 53 2.7.2 Speed of convergence B o t h Q - L e a r n i n g a n d A d a p t i v e C r i t i c m e t h o d s i n the i r bas ic fo rms are k n o w n t o show slow convergence. S o m e of the causes are f u n d a m e n t a l to the c o n t r o l p r o b l e m a n d diff icul t to ove rcome as l o n g as the assumpt ions of the p r o b l e m are m a i n t a i n e d . T h e r e are also causes w h i c h are ar t i facts of the a lgo r i t hms themselves . B o t h types of causes shou ld be e x a m i n e d . In Q - L e a r n i n g , a f u n d a m e n t a l a s s u m p t i o n is tha t the s y s t e m is s tochas t ic . A n ar t i fact is tha t Q - L e a r n i n g a n d D P m e t h o d s are b o t h one-step i n the i r upda tes . T h e D y n a a rch i t ec tu re , P r i o r i t y Sweep ing , a n d V a r i a b l e R e s o l u t i o n D P are p a r t i a l so lu t ions to the p r o b l e m s i n t r o d u c e d by the one-step na tu re of the base Q - L e a r n i n g a n d D P me thods . T h e a c t o r - c r i t i c sys tems us ing T D ( A ) w o u l d l i k e l y be faster t h a n u s ing a var ian t of D P . T h e fo rmer has a so l id base of e x p e r i m e n t a l suppor t [8] [9] [89] [90] [60] [61] [62] [91] [92] [93] [88] [94] [87] [95]. T h e r e are some expe r imen t s w h i c h show T D ( A ) l e a r n i n g us ing a p p r o x i m a t o r s is slow [9] [89] [91]. In s i tua t ions where m u c h is k n o w n about the s y s t e m a n d th is knowledge s t rong ly const ra ins the forms of the a d a p t i v e c r i t i c a n d the con t ro l l e r , the l e a r n i n g can be ve ry fast [96]. U n d o u b t e d l y , the numerous parameters of a p p r o x i m a t o r s are a m a j o r cause of s low convergence. B u t perhaps there are other causes as w e l l . T h e o u t p u t of the c r i t i c m a p p i n g i n ac to r - c r i t i c sys tems, e.g., T D ( A ) , is an e s t ima te of the e x p e c t e d t o t a l d i s coun ted cost . T h e con t ro l l e r is u p d a t e d based o n the r e l a t ive m a g n i t u d e s of th is cost w i t h respect to the states. T h e p r o b l e m w i t h th is approach is tha t the es t imates are l i k e l y to be inaccura t e before the ove ra l l l e a r n i n g is c o m p l e t e d . Inaccura te es t imates do not affect some con t ro l l e r a d a p t a t i o n a l g o r i t h m s as l ong as the r e l a t ive m a g n i t u d e s of the po in t s o n the c r i t i c m a p p i n g are correct . O t h e r w i s e , the con t ro l l e r w o u l d be g i v e n incor rec t t r a i n i n g s ignals . A t least , the ove ra l l l e a r n i n g w o u l d Chapter 2. Framework of Field 54 be s lowed. A t wors t , the con t ro l l e r w o u l d l ea rn the w r o n g c o n t r o l l a w . E v e n i f th i s is not a n issue, the c r i t i c shou ld i d e a l l y l ea rn as q u i c k l y as poss ib le w h i l e p r o v i d i n g the r igh t k i n d of i n f o r m a t i o n to the con t ro l le r . A n approach for so lu t i on is to cons ider u s ing the s tate t r a n s i t i o n samples m o r e t ho rough ly [91]. It s h o u l d be no t ed tha t th i s issue is d i s t i n c t f r o m the issue of a p p r o x i m a t i o n errors i n the p rev ious sec t ion . 2.7.3 Hierarchies A h i e r a r chy is a m u l t i - l e v e l sy s t em. E a c h l eve l consists of e v a l u a t i o n a n d c o n t r o l c o m -ponents . T h e h igher levels dea l w i t h the e n v i r o n m e n t i n t e rms of coarser t e m p o r a l a n d spa t i a l reso lu t ions , a n d coarser ac t ions . T h e lower levels dea l w i t h the e n v i r o n m e n t i n finer t e m p o r a l a n d spa t i a l reso lu t ions , and finer ac t ions . T h e levels c o m m u n i c a t e w i t h each o ther d u r i n g l e a r n i n g a n d i n t e r a c t i o n w i t h the e n v i r o n m e n t . R L is m a i n l y a s ingle l eve l l ea rn ing con t ro l m e t h o d . Severa l a t t e m p t s have been m a d e to e x t e n d R L to f u n c t i o n i n a h ierarchy. T h e r e are m a n y m o t i v a t i o n s for th i s . F i r s t l y , w h i l e i t m a y seem an a p p r o x i m a t o r can m a p o p t i m a l ac t ions to the states of a c o m p l e x s y s t e m , there are advantages to be ga in f r o m b r e a k i n g up the m a p p i n g in to a h ie ra rchy . A m o n o l i t h i c e v a l u t i o n or con t ro l s t ruc ture , e.g., a mass ive s ingle neu ra l ne twork , is not an i dea l s t ruc tu re to upda te . If the sy s t em is c o m p l e x or i f the o p t i m a l i t y c r i t e r i o n is d i f f icul t t o f u l f i l l , the m a p p i n g can be h i g h l y n o n - s m o o t h . T h i s w o u l d present severe p r o b l e m s to mos t a p p r o x i m a t o r s . S m a l l e r m a p p i n g s can prevent the l ea rn ing i n one pa r t of the s tate space f r o m affecting the l ea rn ing i n a d i s tan t pa r t , for ins tance . T h e sma l l e r m a p p i n g s m a y be s i m p l e r to p a r a m e t i z e a n d t r a i n . Secondly , cons ider an o p t i m a l con t ro l p r o b l e m for a c o m p l e x m e c h a n i s m w i t h con t ro l a n d states of t e m p o r a l r e so lu t ion of a s m a l l f rac t ion of a second a n d s p a t i a l r e so lu t ion i n the range of m i l l i m e t e r s . If the o p t i m a l i t y c r i t e r i on is i n a h i g h l eve l represen ta t ion and encompasses a t i m e of years and dis tances of thousands of k i l o m e t e r s , i t is ve ry l i k e l y Chapter 2. Framework of Field 55 tha t the con t ro l l e r cannot be a s i m p l e m a p p i n g f r o m states to ac t ions . A n u m b e r of levels w o u l d p r o b a b l y be necessary to relate the con t ro l a n d states to the h i g h l eve l o p t i m a l i t y c r i t e r i o n . T h i r d l y , h ierarchies fac i l i ta tes the convers ion be tween n u m e r i c a l representa t ions a n d s y m b o l i c representa t ions . T h e la t t e r is a convenient d o m a i n for c o n t r o l at la rger t i m e / s p a c e scales a n d m o r e c o m p l e x d y n a m i c s . T h e r e is l i k e l y to be a speed advantage to l ea rn ing i n m o r e c o m p a c t representa t ions of state, a c t i o n , a n d t i m e . S y m b o l i c r ep resen ta t ion of the n u m e r i c a l w o r l d a l lows con t ro l to take p lace s y m b o l i c a l l y . A h i e r a r chy c o u l d break a di f f icul t a n d abs t rac t goa l d o w n to concrete ac t ions . A t h o r o u g h d i scuss ion of h ierarchies is i n [97]. O p t i m a l i t y or n e a r - o p t i m a l i t y is i m p o r t a n t at a l l levels of a h ie ra rchy . U l t i m a t e l y , we want the goal at the top of the h ie ra rchy to be ach ieved o p t i m a l l y . T h i s means each of the subgoals m u s t be ach ieved o p t i m a l l y . T h i s is jus t a different s ta tement of the B e l l m a n o p t i m a l i t y equa t i on [1] [69]. T h e cons t ruc t i on of o p t i m a l sk i l l s i n each l eve l of a h i e r a rchy is where R L comes in to play. 2.8 I n c r e a s i n g A m o u n t o f C o n t e x t T h e end p r o d u c t of sel f - learning is to have a r eady-made {state}—>{control} m a p p i n g . T h e var ious a m o u n t s of con tex t ava i l ab le or kept d e t e r m i n e w h a t c a n b e done to o b t a i n those m a p p i n g s . If accura te mode l s of the sy s t em d y n a m i c s and costs are ava i lab le , the {state}—>{control} m a p p i n g can be p r o d u c e d by some means . T h i s is the u l t i m a t e i n a d a p t i v i t y because i f the m o d e l is changed , the m a p p i n g c a n be c o m p l e t e l y u p d a t e d i n one r e a l t i m e s tep. B u t to c o m p u t e the {state}—•{control} m a p p i n g , there is s t i l l no a v o i d i n g us ing one of the fol -l o w i n g m a p p i n g s (see F i g u r e 2.6, 2.7, 2.8, and 2.9) a n d the associa ted c o m p u t a t i o n s . T h e Chapter 2. Framework of Field 56 issue is at w h i c h l eve l of contex t w i l l the des i red n u m e r i c a l a n d c o m p l e x i t y proper t ies be o b t a i n e d . T h i s issue is analogous to the c o m p a r i s o n be tween s t r ic te r S tochas t i c L e a r n i n g A u t o m a t a ( S L A ) A l g o r i t h m s a n d looser S L A a lgo r i t hms such as those desc r ibed i n [98]. T h e s t r i c t e r S L A a l g o r i t h m s use a m i n i m u m a m o u n t of d a t a s t ruc tu r e ( con tex t ) , b u t is s low a n d scales ve ry p o o r l y i n t e rms of convergence speed w i t h respect to the n u m b e r of ac t ions . T h e looser S L A a l g o r i t h m uses m o r e con tex t , bu t is able to converge m u c h m o r e q u i c k l y a n d scale up w e l l . State Mapping Evaluation F i g u r e 2.6: A {state}—>{evaluation} m a p p i n g . State Control Mapping Evaluation F i g u r e 2.7: A {s t a t e , con t ro l}—Reva lua t ion} m a p p i n g . State Future-State Control Mapping Evaluation F i g u r e 2.8: A { state,future-state,control}—*• {eva lua t ion} m a p p i n g . T h e lowest l eve l of contex t is to have a {state}—>{evaluation} m a p p i n g (see F i g u r e 2.6) as is done i n the ac to r - c r i t i c a rchi tec tures . T h e eva lua t i on is the eva lua t i on of the con t ro l p o l i c y . T h i s m a p p i n g is t hen used to cons t ruc t the s ta te- to-ac t ion m a p p i n g , i .e. , the con t ro l l e r . T o do th i s , one of the fo l lowing me thods can be used: 1) keep a r ecord of Chapter 2. Framework of Field 57 State Future-State Control Horizon Mapping Evaluation F i g u r e 2.9: A {state,future-state,control,horizon}—Revaluation} m a p p i n g . the ac t ions at the t i m e of the upda tes of / , 2) use the A j ^ . p a l g o r i t h m [99] to t r a i n the con t ro l l e r , or 3) use de r iva t ive i n f o r m a t i o n as i n the B a c k p r o p a g a t e d A C a l g o r i t h m , the D H P a l g o r i t h m , or the A D D H P a l g o r i t h m . T h e nex t h igher l eve l is to have a {state,control}—Revaluation} m a p p i n g (see F i g -ure 2.7) as done i n Q - L e a r n i n g . T h i s a l lows the easy e x t r a c t i o n of the o p t i m a l con t ro l f r o m the m a p p i n g . B u t there are a n u m b e r of disadvantages w i t h th is a p p r o a c h as w i l l be d iscussed i n C h a p t e r 3. T h e nex t h igher l eve l is to have a {state,future-state,control}—Revaluation} m a p -p i n g (see F i g u r e 2.8). H o m e [100] t a lks about l i n k i n g up { 0 1 d - S t a t e , A c t i o n , N e w - S t a t e } t r ip les to get a workable con t ro l sequence, bu t does not address the o p t i m a l i t y of these sequences. A n even h igher l eve l is to have a {state,future-state,control,horizon}—Revaluation} m a p p i n g (see F i g u r e 2.9). T h e a d d i t i o n a l h o r i z o n pa rame te r is sca lar so the increase i n con tex t is r e l a t i v e l y s m a l l . T h i s increase b u y s m a n y advantages bu t also resul ts i n a h igher l eve l of c o m p l e x i t y w h i c h m a y be p a r t i a l l y m i t i g a t e d b y u s ing f u n c t i o n app rox i -m a t o r s . T h e rest of th i s thesis w i l l t ake this po in t of v i e w to address the issues ra ised i n S e c t i o n 2.7. Chapter 3 The Key Problems and A New Approach I n th i s chapter , the bases for the new a lgo r i t hms to be p roposed i n C h a p t e r 4 are de-sc r ibed . F i r s t , I w i l l discuss the One-S tep Q - L e a r n i n g a l g o r i t h m , the issue of a p p r o x i m a -t i o n errors , a n d the issue of ex tens ion of re inforcement l e a rn ing to h ierarchies . Second , f r o m these discussions I a m m o t i v a t e d to look at the Q L and D P a l g o r i t h m s f r o m an aggregate perspec t ive . F i n a l l y , I w i l l l ook at the subt le t ies of d o i n g Q L a n d D P i n aggregates. 3.1 The One-Step Q-Learning Algorithm O n e - S t e p Q - L e a r n i n g is the focus of th is chapter . F r o m the d e s c r i p t i o n of var ious a l -g o r i t h m s i n C h a p t e r 2, i t is clear tha t th is a l g o r i t h m is the f o u n d a t i o n of m a n y re la ted a l g o r i t h m s . T h e n o t a t i o n Q^(i,u) denotes the expected total infinite-horizon discounted cost of e x e c u t i n g a c t i o n u i n s tate i as eva lua ted by a con t ro l p o l i c y ft. A c o n t r o l p o l i c y fi specifies a specif ic a c t i o n u for each state i. oo Q M ( t , u) = 7*c* | s 0 = i, u0 = u] (3.27) t=0 where 0 < 7 < 1 is a d i scount factor , a n d is the e x p e c t a t i o n a s s u m i n g the cont ro l le r a lways uses p o l i c y fi. T h e h o r i z o n is the largest va lue tha t t assumes i n the de f in i t ion of Q; i n E q . (3.27), the h o r i z o n is in f in i te . F o r each state i a n d a c t i o n u, Q^(i,u) is the expec ted value over a l l poss ib le successor 58 Chapter 3. The Key Problems and A New Approach 59 states j i n the nex t t i m e step. F o r each j, the c o n t r i b u t i o n to Q^,(i,u), we igh t ed by Pi^(u), is c,(u) + jfn{j) where c,(it) is the i m m e d i a t e cost i n c u r r e d for a c t i o n u i n state i, ffi{j) is ca l l ed the eva lua t i on func t ion , a n d f,(j)= m i n Q,(j,u). (3.28) T h e def in i t ions above c o n t a i n an incons is tency. It is not clear tha t p c an be any a r b i t r a r y c o n t r o l p o l i c y a n d s t i l l satisfy E q . (3.28) for every state. Y e t , i t is clear tha t i f p is o p t i m a l , th i s i ncons i s t ency does not exis t . If p is o p t i m a l , i t is deno ted by /z* and we have the fo l l owing proper t ies : ru) = fAj) = m m m a n d <?*(»',«) = QAh n) = CJ (U ) + 7 E r 0'). jes where c,(w) is the expec t ed value of c , (u) . T h e o p t i m a l p o l i c y is found b y some a l g o r i t h m . T h e o p t i m a l i t y i n D P a n d Q - L e a r n i n g is the resul t of the m i n i m i z a t i o n . opera tor i n E q . (3.28). T h e Q-values c an be cons t ruc ted i n a n u m b e r of ways [101]. W a t k i n s ' R e a l - T i m e Q - L e a r n i n g ( R T Q L ) a l g o r i t h m [73] uses an upda te ru le where an es t ima te of Q*, Q, is u p d a t e d after the e x e c u t i o n of an a c t i o n i n each t i m e step a c c o r d i n g to Qt+i(st, ut) = (1 - <xt(st,ut))Qt(st, ut) + at{su ut)[ct(st, ut) + jft(st+i)} (3.29) where ft(st+i)= m i n 0 ( ( % i , t j ) (3.30) a n d cct(st, ut) is the l e a rn ing rate at t i m e t for the current s ta te -ac t ion pa i r . T h e Q-values a n d a-values for a l l o ther s ta te-ac t ion pairs r e m a i n the same, i .e. , Qt+i(i,u) — Qt(i,u) Chapter 3. The Key Problems and A New Approach 60 a n d ctt+i(i,u) = ctt(i,u), V(z ' ,u) ^ (st,ut). T h e l ea rn ing ra te pa r ame te r at(st,ut) is s chedu led to satisfy the fo l l owing cond i t ions : oo oo ^2ad(-s,u) = oo, al(siu) < °°; e s,u e u(s) d=0 d=0 where d is i n c r e m e n t e d each t i m e the specific {s,u} p a i r is encounte red . T h i s a l g o r i t h m essent ia l ly passes a s l i gh t ly mod i f i ed / va lue of a s tate to i ts succes-sor state. T h e e n d resul t after convergence is tha t the incons i s t ency m e n t i o n e d ear l ier d i sappears . Therefore , one i n t e rp re t a t i on is to see R T Q L as a r e l a x a t i o n a l g o r i t h m . N o t i c e t h a t ct(st,ut) i n E q . (3.29) is the cost of a s tate t r a n s i t i o n w h i c h spans o n l y one t i m e step. T o compare w i t h the a lgo r i t hms to be presented la ter , i t is c learer to c a l l th i s a l g o r i t h m One-Step Q-Learning a l g o r i t h m . A l t h o u g h E q . (3.27) shows the h o r i z o n to be in f in i t e , d u r i n g the course of l e a rn ing a c c o r d i n g to E q . (3.29), the h o r i z o n increases by 1 i n each upda te . Inf ini te h o r i z o n is m e r e l y a convenien t m a t h e m a t i c a l concept . It helps to c a l l the hor izons i n the Q's the actual horizons. T h i s a l g o r i t h m has been p roven to converge for general [12] [73] [70] [59] a n d specific [74] l e a r n i n g p r o b l e m s . Y e t i n p rac t i ce , convergence is not ach ieved i n a l l c o n t r o l p rob-l ems . A n u m b e r of issues need a t t en t ion . T h e next sec t ion w i l l descr ibe the inadequacies of th i s One -S tep Q - L e a r n i n g a l g o r i t h m . It w i l l be a rgued tha t t hey are the resul t of c o m p u t i n g the Q-va lues at a s ingle , fixed t i m e reso lu t ion . 3.2 Approximation Errors in Q-Learning and DP A c r i t i c a l d i f f i cu l ty is t ha t the a c c u m u l a t i o n of a p p r o x i m a t i o n errors c a n defeat O n e - S t e p Q - L e a r n i n g [2] a n d D P i n the i r present forms. O n e a rgument is tha t ze ro -mean approx-i m a t i o n errors are skewed b y the m i n i m i z a t i o n opera tor , a s suming tha t a p p r o x i m a t i o n errors are ze ro-mean . E v e n i f the errors are not skewed, t hey w o u l d s t i l l be a severe Chapter 3. The Key Problems and A New Approach 61 p r o b l e m because they are u n l i k e l y to be ze ro-mean a n d they are a c c u m u l a t e d m u l t i p l i c a -t i v e l y a m o n g Q ' s of different a c t u a l hor izons . W h e n these errors are a c c u m u l a t e d b y the b a c k u p o p e r a t i o n , Q - L e a r n i n g can fa i l as the l eng th of the h o r i z o n a n d / o r the n u m b e r of ac t ions b e c o m e large. 3.3 Goal Change In m a n y c o n t r o l p r o b l e m s , the goal shifts f r o m m o m e n t to m o m e n t . F o r e x a m p l e , the goa l of a w a l k i n g m a c h i n e m a y shift f r o m s t and ing to pu t a foot f o r w a r d to w a l k i n g to r u n n i n g . In R L a n d D P , the goal is defined by the o p t i m a l i t y c r i t e r i o n w h i c h is i n t u r n def ined b y the i m m e d i a t e costs. In R L , to change the goal of the con t ro l sy s t em, the ent i re l e a r n i n g process needs to be repea ted w i t h different i m m e d i a t e costs. In m a n y cases, we are in teres ted i n c o n s t r u c t i n g an o p t i m a l con t ro l l e r w h i c h depends o n a range of goals. T h e range of goals m a y be con t inuous . T o do th i s , the l ea rn ing process w o u l d have to be repea ted for each of the goals a n d the c o n t r o l laws p r o d u c e d i n each i t e r a t i o n w o u l d be used to b u i l d up a goa l dependent con t ro l le r . A m o r e efficient way to change goals w o u l d g rea t ly benefit the c o n s t r u c t i o n of th i s t y p e of con t ro l le r . 3.4 Hierarchies A cur ren t goa l of re inforcement l e a rn ing research is to e x t e n d the c a p a b i l i t y of reinforce-m e n t l e a r n i n g to i n c l u d e l ea rn ing o p t i m a l con t ro l at var ious t e m p o r a l a n d spa t i a l levels , i n h ie rarchies . T h e r e are a n u m b e r of m o t i v a t i o n s for th i s , as d iscussed i n C h a p t e r 2. Chapter 3. The Key Problems and A New Approach 62 3.4.1 The Albus model of intelligence and R L A l b u s [97] p roposed a p r o t o t y p e of h i e r a r ch i ca l in te l l igence w h i c h has m a n y subsys tems . T h e par t of h is p roposa l tha t is relevant to re inforcement l e a rn ing is tha t each c o m m a n d center i n each layer has a w o r l d m o d e l m o d u l e , a va lue j u d g m e n t m o d u l e , ar id a b e h a v i o u r genera t ion m o d u l e . T h e va lue j u d g m e n t m o d u l e evaluates the sensory i n p u t s a n d the ex-pec t ed sensory i n p u t s i n the next t i m e step, a n d alters the b e h a v i o u r genera t ion m o d u l e . T h i s i n t e r a c t i o n is ve ry close to the i dea of re inforcement l e a rn i n g . T h e A l b u s m o d e l is a g o o d basis for t h i n k i n g abou t the s t ruc tu re of a h i e r a r c h i c a l R L s y s t e m . A s the A l b u s m o d e l i l lus t ra tes , R L is a l e a rn ing p r o b l e m tha t is necessary a n d per-vasive i n an in te l l igence h ierarchy. It is clear tha t an in te l l igence has to evaluate the pe r fo rmance of i t s behav iours w i t h respect to i ts m a n y goals . It is the o n l y way tha t an in te l l igence h i e r a rchy c o u l d ensure the near sa t i s fac t ion of m a n y requ i rements w h i c h leads to be t te r chance of s u r v i v a l , or ach i ev ing wha tever ove ra l l goa l i t has. 3.4.2 The upper levels of a hierarchy and macros A t the uppe r levels of a h ie rarchy , the represen ta t ion s h o u l d be m o r e s y m b o l i c a n d less l i k e the r aw state a n d c o n t r o l represen ta t ion at the lowest l eve l . If labels c o u l d be assigned to s ignif icant a n d often repea ted low l eve l t ra jector ies , c o m b i n a t i o n s of c o n t r o l a n d state t ra jec tor ies at a s y m b o l i c l eve l c o u l d be done. W h e n a useful , eas i ly p a r a m e t i z a b l e con t ro l sequence [102] is found , a l a b e l a n d some associa ted parameters can be assigned. T h e pa rame te r s he lp to m a k e the l a b e l m o r e general . L e t us c a l l such a sequence a macro. A t the interface be tween the n u m e r i c a l and s y m b o l i c levels , there w o u l d l i k e l y be levels where the represen ta t ion is a n u m e r i c a l a n d s y m b o l i c m i x t u r e . F o r e x a m p l e , a m o v e m e n t l i k e " j u m p " can take o n m a n y start states, end states, and des i red heights . T h e s y s t e m w o u l d also have a rough i d e a of the con t ro l sequence requ i red , the t i m e r equ i r ed , a n d the Chapter 3. The Key Problems and A New Approach 63 costs. T h i s k i n d of h y b r i d representa t ion c o u l d descr ibe s t a t e s / con t ro l t ra jec tor ies a n d s y m -b o l i c t ra jec tor ies . A s we m o v e u p w a r d i n the h ierarchy, these macros w o u l d b e c o m e m o r e d iscre te , u n t i l t h e y are a lmos t pu re ly s y m b o l i c . M a c r o s can be hand-craf ted or learned . T h e c o m b i n a t i o n of l e a r n i n g w i t h some gene t i ca l ly d e t e r m i n e d seed goals seems p r o m i s i n g . T h e seed goals t r igger the l ea rn ing of useful mac ros . E x a m p l e s of seed goals are the goals to look at nove l objects i n a v i s u a l scene, to reach for an objec t , to m o v e t o w a r d an ob jec t , etc. T o a c c o m p l i s h these ob jec t ives , a w h o l e a r ray of in -be tween a n d p e r i p h e r a l a c c i d e n t a l ach ievements w o u l d be encounte red . These acc iden ta l achievements a n d the seed goals can be used to p r o d u c e d e r i v e d goals w h i c h exis t at var ious levels i n the h ierarchy . W h e n an acc iden t a l ach ievement is encounte red , the sy s t em c o u l d de te rmine i f i t is useful i n t e rms of the seed or d e r i v e d goals. If i t is a n d i t does not fit i n the lower levels , the ach ievement c o u l d b e c o m e a new d e r i v e d goal at a h igher l eve l . These goals focus the e x p l o r a t i o n of the state space. T h e a c t i o n sequences caused b y the goals are the macros . N e w macros are fo rmed w h e n a sequence of lower l eve l macros is encoun te red often a n d is found to be useful , i .e. , achieves some b u i l t - i n or d e r i v e d goal . T h e seed goals shou ld be s t r a t eg ica l ly designed such tha t they l ead to the f o r m a t i o n of an adequate set of macros for the sa t i s fac t ion of the ove ra l l goa l . T h e shorter macros can be f o u n d b y acc iden t . T h e longer macros w o u l d have to be f o r m e d b y m o r e de l ibera te m e t h o d s , such as a new f o r m of R L or the m e t h o d proposed i n [85]. W e can b u i l d new macros b y conca t ena t i ng several o l d macros . If the new m a c r o is too c o m p l e x to be encoded at a l ow l eve l , the encod ing is done at a h igher l eve l . I n Sec t i on 4.6, several new a lgo r i t hms w i l l be re la ted to these p reced ing discussions . Chapter 3. The Key Problems and A New Approach 64 3.4.3 Other remarks on hierarchies In a h ie ra rchy , R L can be done e i ther sequent ia l ly or c o n c u r r e n t l y at var ious c o m m a n d centers at severa l levels . A t the lowest levels , th i s is i d e n t i c a l to the l e a rn ing s i t u a t i o n i n regular R L where we dea l i n raw state representat ions a n d e l e m e n t a l t i m e steps. T h i s is not t rue even i n the nex t h igher leve l where coarser s tate representa t ions a n d t i m e in te rva ls are used. T h e bas ic R L a lgo r i t hms do not address in te r - l eve l l e a rn ing . U n t i l recent ly , R L was a m e t h o d for l e a rn ing i n a m o n o l i t h i c , not h i e r a r c h i c a l , cont ro l le r . E a c h b a c k u p o p e r a t i o n invo lves an e l emen ta l t r a n s i t i o n of cons tan t t i m e i n t e r v a l a n d cons tant abs t r ac t i on , whereas a h i e ra rchy has a m i x t u r e of s tate t r ans i t ions of different t i m e in te rva ls a n d abs t rac t ions . In the nex t sec t ion , we look at Q - L e a r n i n g , D P , and c o n t r o l p rob l ems f r o m a new perspec t ive , to address the issues ra ised i n the p reced ing sect ions. A new Q n o t a t i o n w i l l be i n t r o d u c e d w h i c h has two new i n p u t specifiers. O n l y one of these is ind i spensab le i n the a l g o r i t h m s to be presented la ter w h i c h u t i l i z e th is new Q n o t a t i o n . C o m p a r e d to the D P n o t a t i o n w h i c h uses o n l y / , the Q n o t a t i o n i n One -S tep Q - L e a r n i n g makes i t easier to f ind the best first a c t i o n . T h i s change is a m o v e to y i e l d a be t te r a l g o r i t h m by us ing m o r e i n p u t pa ramete r s . A s w i t h b o t h Q - L e a r n i n g and D P , the hope for a p p l i c a t i o n to h i g h d i m e n s i o n s y s t e m is t ha t the C u r s e of D i m e n s i o n a l i t y [1] w o u l d be p a r t i a l l y m i t i g a t e d b y u s ing con t inuous f u n c t i o n a p p r o x i m a t i o n . T h e new Q n o t a t i o n is the nex t step i n the same d i r e c t i o n . 3.5 A New Approach: Linkup Action Dependent Dynamic Programming M o r e can be done w i t h a s ample of a t r a n s i t i o n t h a n jus t d o i n g a b a c k u p for a s ingle Q-va lue . T h e w o r d backup is appropr i a t e because the Q-va lue is u p d a t e d i n the b a c k w a r d d i r e c t i o n . W e a c t u a l l y want to f i l l out as m a n y Q ' s as poss ib le by the l i n k u p of separate Chapter 3. The Key Problems and A New Approach 65 Q ' s tha t h a p p e n to intersect at the t r a n s i t i o n . T h e exact de f in i t ion of a l i n k u p o p e ra t i on w i l l be presented i n a la ter sec t ion . W e can look at the s u m m a t i o n i n E q . (3.27) i n another way, i f we have new no ta t ions for / a n d Q w h i c h have two a d d i t i o n a l i n p u t vectors each. T h i s new n o t a t i o n is defined: n -1 fit{so, sn,n) - Ep[Yj 7tcs,(i"(st)) | s0, sn]. (3.31) 4=0 F o r finite h o r i z o n , ft = {fio,... ,u n_i} is a vector of c o n t r o l po l i c ies at each t i m e step i n t he s u m m a t i o n . W h e n the h o r i z o n is in f in i t e , a l l these c o n t r o l po l i c i e s co l lapse i n to a s ingle fi, i .e . , UQ — • • • = f-n-i- F o r this n o t a t i o n , ft means u s ing u 's a c c o r d i n g to the p o l i c y s t a r t i n g at t = 0, i .e, ft = {uo,... , / i n _ i } . ft c an be any con t ro l p o l i c y vec tor . L e t the o p t i m a l p o l i c y vec tor ft* be defined such tha t f*(s0,sn,n) = ffi*{so,sn,n) = m i n / ^ s o , sn, ra). W e now define another new no ta t i on : Q*(s0,sn,u0,n) = c~S0(u0) + 7 Ps0,s1(u0)f*(s1,sn,n - 1). (3.32) Q* is l i k e a Q* t ha t is also c o n d i t i o n e d o n a reached state sn a n d a f in i te h o r i z o n ra. T h e i n c l u s i o n of sn a n d n a l lows an ope ra t i on to be ca l l ed linkup. T h e m e a n i n g of th is w i l l b e c o m e c lear shor t ly . N o w , a n express ion is w r i t t e n for Q* h a v i n g a s l i gh t ly different m e a n i n g for ft: n-1 Q*(s0,sn,u0,n) = Ep*[%2',tcat(jr'(st)) \s0,sn,u0] (3.33) Therefore , f*(so,sn,n) = minuoeu(s0)Q*(so,sn,Uo,n). It is clear tha t here fl* has to m e a n us ing uo at t — 0 a n d then us ing u 's acco rd ing to the p o l i c y s t a r t i n g at t — 1, i .e. , ft* = {u\,... ,u*n_l}. So the m e a n i n g of ft* is dependent o n the absence or presence of u 0 i n the c o n d i t i o n a l par t of the e x p e c t a t i o n t e r m . T h i s is t rue also for the presen ta t ion Chapter 3. The Key Problems and A New Approach 66 of the Q n o t a t i o n i n the prev ious sec t ion . T h i s d i s t i n c t i o n s impl i f ies the r o u n d a b o u t de f in i t ion of Q f o u n d i n the l i t e ra tu re . F o r s i m p l i c i t y , ct w i l l b e used for cSt(jl*(st)). N o w the l i n k u p r e l a t i o n is d e r i v e d b y l o o k i n g at a p a r t i c u l a r {s0, sn,u0, n } , n a m e l y {i, I, u, n}. W e o b t a i n th i s different b r eakup of E q . (3.33): Q*(i,l,u,n) na—l 7 < c t I s0 = i, $n = I, U0 = I t=0 u n-1 E 7 < c t I so = i,sn = l,u0 u t=na ,(3.34) A B where 0 < na < n. T o end up at a specific s tate sna, the sys t em m a y pass t h r o u g h a n u m b e r of states at any i n t e r m e d i a t e t i m e i n d e x , t, where 0 < t < na — 1. T h e p r o b a b i l i t y tha t , s t a r t i ng f r o m the s tar t s tate i, the s y s t e m w o u l d reach a p a r t i c u l a r i n t e r m e d i a t e s tate sna i n na t i m e steps w h i l e f o l l o w i n g the c o n t r o l p o l i c y p* is defined as p " S n a • T h i s m u l t i - s t e p t r a n s i t i o n p r o b a b i l i t y is dependent o n the d y n a m i c s of the sys t em, on uo, a n d o n the c o n t r o l p o l i c y fl*. M o r e precise ly , E P M l M E P I A ^ W ) - 1 ' E ^nA-2,^n a-l(/ : ?*( 5na-2))p S„A_ 1, S„A(^*K a-l)). siES s2€S sna-ieS T e r m B is not i n the f o r m of Q*. T h e differences are tha t i t does not have a defini te s tate at t — na a n d tha t t begins at na r a ther t h a n 0. T h e state at t = na is reached a c c o r d i n g to Pi°Sna- p* is t r u n c a t e d to { / ^ a + 1 , . . . , / C - i ) w i t h no o p e r a t i o n a l effect o n T e r m B a n d is r e n a m e d T e r m B can be w r i t t e n as PiX« ( ^j? ^£;^[E 7*Ct I S n a , S n = l,Una]) . t=na N o t e tha t u0 = u i n the c o n d i t i o n a l par t w o u l d w o r k as w e l l , bu t i t w o u l d not be as clear . Chapter 3. The Key Problems and A New Approach 67 D e f i n i n g t' = t — na a n d n = na + rib, t e r m B can be w r i t t e n as E PiX=na 1Ha ( m,j n , EFB IE 7*'c«' I *t=na, st,=nb = /, u t = n a] ] a n d t h e n as E P l X - n - W SJP (3-35) S t = „ Q e s V « 1 = n a e t / ( S t = n a ) y N o t e t h a t t e r m A is also no t i n t he f o r m of Q*. ft* is t r u n c a t e d t o {u^,..., p^a-i) w i t h no o p e r a t i o n a l effect on t e r m A and is r e n a m e d T h e s y m b o l + ind ica tes a different k i n d of o p t i m a l i t y : T h i s na e lement p o l i c y vec tor is not o p t i m a l i n t a k i n g the s y s t e m f r o m i to sna i n the sense of E q . (3.33), bu t is o p t i m a l i n t a k i n g the s y s t e m f r o m i t o / w h e n fo l lowed b y T h i s different sense of o p t i m a l i t y w i l l be d iscussed fur ther be low. T e r m A c a n be r e w r i t t e n as E Pi,UA k l l E ^ I so = i,sna,u0 = u}) (3.36) a n d the co r re spond ing Q n o t a t i o n is Q+. T h e difference be tween T e r m A a n d th is express ion is tha t the la t t e r now have a defini te end state, s „ a . R e w r i t i n g E x p r e s s i o n (3.36) to Q+ f o r m a n d a d d i n g i t to E x p r e s s i o n (3.35), we get Q*(i,l,u,n)= ( 3 + ( M « ^ , n « ) + f m m 6*(sna, I, uUa, nb)) . (3.37) Snaes \ unaeu(Sna) j E q . (3.37) w i l l be ca l l ed the Linkup Relation. A n intermediate state is a state at t = na. A linkup operation is the a d d i t i o n of a Q+ a n d a Q* at one p a r t i c u l a r i n t e rmed ia t e state. T h e same n a m e is used for the a d d i t i o n of the es t imates of those quant i t ies as w e l l . T h e w o r d l i n k u p emphas izes the po in t tha t a Q* c an be cons t ruc t ed b y l i n k i n g up Q+,s a n d Q* ' s , b o t h can span m u l t i p l e t i m e steps and cover a f a m i l y of s tate t ra jector ies . T h e f o l l o w i n g figure i l lus t ra tes the two fami l ies of t ra jector ies in t e r sec t ing at a p a r t i c u l a r s tate S n „ . Chapter 3. The Key Problems and A New Approach 68 s, t-0 t = n„ t = n F i g u r e 3.10: G r a p h i c a l representa t ion of the l i n k u p ope ra t i on . A circuit is t he comple t e set of t ra jector ies genera ted b y the s y s t e m u n d e r a g iven con t ro l p o l i c y or c o n t r o l p o l i c y vector , g iven the start state, the e n d state, and the number of t i m e steps. E a c h t r a j ec to ry occurs w i t h the p r o b a b i l i t y d e t e r m i n e d by the star t state, a l l the in t e rven ing states, the end state, and the po l ic ies used. A pre-circuit is a f a m i l y of t ra jector ies whose eva lua t ion is the value of one Q+ on the R H S of E q . (3.37). T h e name also appl ies i n the case of an es t imate of Q+. A post-circuit is a f a m i l y of t ra jector ies whose eva lua t i on is the value of one Q* on the R H S of E q . (3.37). T h e name also appl ies i n the case of an es t imate of Q*. In F i g u r e (3.10), the mass of t ra jector ies to the left of t = na w i l l be ca l l ed the pre-circuit and the mass of t ra jector ies to the r ight of s — na w i l l be ca l l ed the post-circuit. If there is no need to d i s t i ngu i sh between a p re -c i r cu i t a n d a pos t - c i r cu i t , a f a m i l y of t ra jec tor ies w i l l s i m p l y be referred to as a circuit. Chapter 3. The Key Problems and A New Approach 69 3.5.1 Goal states and reached states H a v i n g sna as a goa l s tate a n d e n d i n g up at sUa are different concepts . T h e difference shows u p i n the cost f unc t i on . T h e phrase reached state m a y suggest t ha t th is s tate is the des i red goa l state. B u t i t means the state a c t u a l l y reached at the end of a t ra jec tory . H o w e v e r , i f the s y s t e m is d e t e r m i n i s t i c and the cost f unc t i on is not defined w i t h a specific goa l state, the reached state can be thought of as the des i red goal state. O t h e r w i s e , each goal s tate w o u l d need a separate set of Q-funct ions for each goal state, w h i c h w o u l d be i m p o r t a n t to a v o i d . 3.5.2 The horizon Inf in i te h o r i z o n cost is a m a t h e m a t i c a l abs t r ac t ion . In p rac t i ce , o n l y finite h o r i z o n costs can be lea rned . T h e f in i te h o r i z o n cont ro l le r of a ce r t a in l eng th is v e r y different f r o m another of a different l eng th . If we take the ex t r eme case of n = 1, we can see tha t the o p t i m a l p o l i c y w i l l be ve ry different f r o m the o p t i m a l p o l i c y for n = oo. T h e n — 1 p o l i c y is far too m y o p i c a n d w i l l not be ve ry useful i n c o n t r o l l i n g a n y t h i n g as c o m p l e x as a w a l k i n g m a c h i n e , for e x a m p l e . If n = oo, the con t ro l p o l i c y vec to r w o u l d col lapse i n t o a s ingle c o n t r o l p o l i c y . 3.6 Pre- and Post-Circuits In th i s p a r a g r a p h , E q . (3.37) w i l l be used as the basis for d i scuss ion . T h e r e are i m p o r t a n t differences be tween a p re -c i r cu i t p o l i c y , a n d a pos t - c i r cu i t p o l i c y , A pos t -c i r cu i t p o l i c y is s i m i l a r to the p o l i c y jl* of Q*(i,l,u,n), differing o n l y i n the h o r i z o n , fl* is o p t i m a l i n tha t the s u m of Q*'s we igh ted by the p", 's is o p t i m a l . T h e same is t rue for pi. B u t /?+ is o p t i m a l i n a different sense. /z+ generates p re -c i rcu i t s f r o m % to sUa w i t h pfaSn at expec t ed costs of Q*(i, sna,u, na). T h i s f a m i l y of costs and p r o b a b i l i t i e s , C O M B I N E D Chapter 3. The Key Problems and A New Approach 70 w i t h the e x p e c t e d costs a n d p robab i l i t i e s of the pos t - c i r cu i t s , is o p t i m a l i n t he sense of E x p r e s s i o n (3.38). C o n f i n e d w i t h i n the p e r i o d t = 0 to t — na, is not necessar i ly o p t i m a l i n tha t sense. Therefore , w h e n a Q* is used i n a p re -c i r cu i t capac i ty , i ts p o l i c y ft* m a y not de l ive r the o p t i m a l c o m b i n a t i o n of Q and p™j n o to achieve the ove ra l l o p t i m a l ob jec t ive . T h i s c o m p l i c a t i o n does not arise i n One -S tep Q - L e a r n i n g . T h e i m m e d i a t e costs a l r eady have an a c t i o n spec i f ica t ion a n d have no fur ther p o l i c y speci f ica t ions , so there is no p r e - c i r cu i t p o l i c y vec tor to consider . T h e c o m p l i c a t i o n w i t h the p robab i l i t i e s i n the s tochas t ic l i n k u p r e l a t i o n E q . (3.37) can be thought of as fol lows: F o r c o n t r o l l i n g s tochas t ic sys tems i n the same m a n n e r as O n e - S t e p Q - L e a r n i n g , the Q*'s s hou ld be used by m i n i m i z i n g the e x p e c t e d va lue I > ? , Q * ( » , / , t x , n ) (3.38) w i t h respect to u. T h i s w i l l g ive us the same s u m as E q . (3.27). It is s t ra igh t f o rward to f ind pi. It c an be done by One-S tep Q - L e a r n i n g . However , we r e a l l y wan t to do i t b y u s ing the L i n k u p R e l a t i o n . T o do so, we need to k n o w how to f ind T h i s is not easy because th is p o l i c y does not have an obvious o p t i m a l i t y c r i t e r i o n . 3.7 S u m m a r y T h e stage for the p resen ta t ion of a new approach to Q - L e a r n i n g was p repared b y the cons ide ra t ion of the p rob l ems of a p p r o x i m a t i o n errors a n d the need to e x t e n d R L to h ierarchies . T h i s approach is centered on the L i n k u p R e l a t i o n . It is expec t ed that a l g o r i t h m s based o n th is r e l a t ion w i l l be able to at least p a r t i a l l y solve these p rob lems . T h e L i n k u p R e l a t i o n presents m a n y new oppor tun i t i e s . It m a y be poss ib le to use i t to c o n t r o l the same s tochas t ic sy s t em tha t Q - L e a r n i n g is i n t e n d e d to c o n t r o l [103], bu t the pre- a n d pos t -c i r cu i t s present several challenges to the a p p l i c a t i o n of the r e l a t i on . T h e y w i l l be exp lo red i n d e p t h i n the next chapter . If the s y s t e m is d e t e r m i n i s t i c or i f Chapter 3. The Key Problems and A New Approach 71 a f ixed p o l i c y is ope ra t i ng d u r i n g l ea rn ing , the l i n k u p r e l a t i on c a n be c l ean ly e x p l o i t e d . Severa l a l g o r i t h m s for d e t e r m i n i s t i c sy s t em w i l l be presented i n the nex t chapter . O n e a l g o r i t h m for a l e a r n i n g s i t u a t i o n su i t ed for TD(A) w i l l also be presented . Chapter 4 The Problem Formulation and Proposed Solutions 4.1 The Initial Problem Definition T h e s t a n d a r d task s t u d i e d i n recent works i n re inforcement l e a r n i n g is the M a r k o v i a n dec i s ion task. T h e sy s t em to be con t ro l l ed is a f ini te-s ta te , d i sc re te - t ime , s tochas t ic d y n a m i c s y s t e m w i t h finite s tate set S — L e t U be the finite set of act ions ava i lab le to the con t ro l l e r , w h i c h m a y be different depend ing o n the state. A t each t i m e step t, the con t ro l l e r observes the current s tate st G S a n d executes a c t i o n ut 6 U(st). T h e s y s t e m responds to the a c t i o n a n d moves to the next s tate st+\ € S a c c o r d i n g to a t r a n s i t i o n p r o b a b i l i t y pSt,st+i iut)- T h e cont ro l le r receives f r o m the sy s t em or computes a payoff va lue ct(st,ut). T h e sy s t em is t ime - inva r i an t , m e a n i n g tha t the t r a n s i t i o n proba-b i l i t i e s a n d the payoffs associa ted w i t h the t rans i t ions are cons tant i n t i m e . T h e object of l e a rn ing is to find a c o n t r o l p o l i c y w h i c h o p t i m i z e s some useful f u n c t i o n of the payoffs. W e are in te res ted i n th is expected infinite-horizon discounted cost: Qn(so,uo) = Eu J2^ct\st=o = s0,ut=0 - u0 (4.39) u=o where u is t he c o n t r o l p o l i c y a n d \ct\ < oo. T h e objec t of the o p t i m i z a t i o n is to find for the o p t i m a l c o n t r o l p o l i c y u*. u* is the p o l i c y w h i c h m i n i m i z e s the for the s ta te-ac t ion vec tor { so ,uo} . O n c e a l l Q^s are f o u n d , the o p t i m a l p o l i c y can be c o m p u t e d by f ind ing the m i n i m u m of Qfl*(s,u) at every s w i t h respect to u(s). T h e in f in i t e h o r i z o n is an abs t r ac t ion . D u r i n g the course of c o m p u t i n g Q, Q ' s of 72 Chapter 4. The Problem Formulation and Proposed Solutions 73 v a r y i n g f in i te a c t u a l hor izons w i l l have to be cons t ruc ted . Inf ini te h o r i z o n Q w i l l o n l y be cons t ruc t ed after an in f in i t e p e r i o d of l ea rn ing . 4.2 Considerations in Applying the Linkup Relation T o uncover the issues tha t arise i n the a p p l i c a t i o n of the L i n k u p R e l a t i o n , we need to o u t l i n e a p r o t o t y p e a l g o r i t h m first . These issues c o u l d not be p r o p e r l y d iscussed at the e n d of the las t chapter w h e n a p r o t o t y p e a l g o r i t h m h a d no t yet been i n t r o d u c e d . F o r the sake of convenience , the w o r d policy m a y be used i n p lace of policy vector. 4.2.1 Prototype algorithm A Q is the e x p e c t e d f in i t e -hor izon d i scoun ted s u m of the payoffs associa ted w i t h a c i r c u i t . In the L i n k u p R e l a t i o n , the correct va lue of th is s u m for a pos t - c i r cu i t w i t h an o p t i m a l p o l i c y vec to r is deno ted by Q*. T h e n o t a t i o n Q w i l l be used for the e s t ima te of Q*. T h e genera l i d e a is t o use the L i n k u p R e l a t i o n to l i n k u p the Q+ for m u l t i p l e t ime-s tep p re -c i r cu i t s w i t h t he Q* for m u l t i p l e t ime-s tep pos t - c i r cu i t s . T h i s generates the Q* for new pos t - c i r cu i t s h a v i n g a h o r i z o n w h i c h is now increased by the l e n g t h of the p re -c i rcu i t h o r i z o n . R e c u r s i v e l y d o i n g this w i t h successively longer h o r i z o n post- a n d p re -c i r cu i t Q ' s s h o u l d resul t i n ve ry fast b u i l d u p of the h o r i z o n . T h e l i n k u p o p e ra t i o n (as opposed to the L i n k u p R e l a t i o n ) is i l l u s t r a t e d i n F i g u r e (3.10). T h e p r o t o t y p e a l g o r i t h m is synchronous , m e a n i n g tha t a l l poss ib le l i n k u p opera t ions at a fixed i n t e r m e d i a t e t i m e t = na a n d g iven s0 a n d sn t ake p lace effect ively s imul t ane -ously. R e f e r r i n g to F i g u r e (3.10), we want to c o m p u t e the c o n t r i b u t i o n of the l i n k u p of these t w o c i r c u i t s t o Q(s0, sn,u0, n ) . K e e p i n g i n m i n d tha t a n d cons ide r ing o n l y a l i n k u p at t i m e t = na, th i s in te r sec t ion of c i r cu i t s at the p a r t i c u l a r i n t e r m e d i a t e state sna is jus t one p o s s i b i l i t y a m o n g a l l poss ib le states. L e t the n a m e Linkup Cycle refer to a Chapter 4. The Problem Formulation and Proposed Solutions 74 l ea rn ing c y c l e where the l i n k u p opera t ions are done for a l l poss ib le i n t e r m e d i a t e states. A l l c o n t r i b u t i o n s are accoun ted for w h e n a l i n k u p cyc le is c o m p l e t e d . L e t the n a m e + policy refer to the con t ro l p o l i c y associa ted w i t h the o p t i m a l pre-c i r c u i t , or to see i t i n another way, associa ted w i t h Q + . A l i n k u p proceeds i n th is manne r : T h e Q+(s0, s n a i u 0 i n a ) is found a n d added to A™a m i n U r i a g [ / ( S n a ) U n a , n ~ na). T h i s va lue m u l t i p l i e d b y t he p r o b a b i l i t y p " 0 ° i S n a u s i n g the + p o l i c y is the c o n t r i b u t i o n of th is l i n k u p to Q*(so, sn, uo, n), i .e . , th i s c o n t r i b u t i o n , denoted b y AQ*(so, sn, UQ, n), is equa l t o Pnso,sna \Q+(so,Sna,u0,na) +\na min Q*(sna,sn,una,n-na) . (4.40) \ «n a eu (s„ a ) J T h i s process proceeds recurs ive ly , b e g i n n i n g o n l y w i t h Q+ s p a n n i n g 1 t ime-s tep , Q* s p a n n i n g also 1 t i m e step, a n d the 1 t i m e step t r a n s i t i o n p r o b a b i l i t i e s . L o n g e r and longer Q + , s a n d Q*'s are b u i l t u s ing the mos t recent ly c o m p u t e d Q + , s a n d Q* ' s . E v e n t u a l l y , we want to be able to c o m p u t e , for some con t ro l s i tua t ions , Q(so,u0,oo) w i t h o u t reference to a n e n d state a n d , for other s i tua t ions , Q(SQ, SOO, UQ, O O ) . O t h e r t h a n some m i s s i n g pieces, the subject of discussions shor t ly , a ske le ta l frame-w o r k is now i n p lace for the d iscuss ion of the issues of e x p l o i t i n g the L i n k u p R e l a t i o n . W i t h th i s f r amework , efforts to discover a convergence resul t show tha t to do l i n k u p i n a m a n n e r analogous to tha t i n b a c k u p is not s t ra ight fo rward . T h e r e are i m p o r t a n t issues tha t have to be so lved or c i r c u m v e n t e d . These issues are not nea t ly o r t hogona l . 4.2.2 Issue #1 L e t the n a m e * policy refer to the con t ro l p o l i c y associa ted w i t h the o p t i m a l pos t - c i r cu i t , or to see i t i n another way, associa ted w i t h Q*. T h e * p o l i c y for t = 0 to t = n c an be d i v i d e d i n t o a * p o l i c y for the t i m e t = na to t = n w h i c h w i l l be ca l l ed the component * policy a n d a n o p t i m a l p o l i c y f r o m t = 0 to t = na w h i c h w i l l be ca l l ed the component Chapter 4. The Problem Formulation and Proposed Solutions 75 + policy. A + p o l i c y is not s i m i l a r to a * p o l i c y because i t does not o p t i m i z e a s u m l i k e E q . (3.38). Ins tead, i t has to be fo l lowed b y a * p o l i c y to o p t i m i z e such a s u m . F o r i d e n t i c a l sys tems, hor izons and o p t i m a l i t y c r i t e r i a , the + p o l i c y is not necessar i ly i d e n t i c a l t o the * p o l i c y . T h i s s ta tement is a c t u a l l y i l l - posed because a + p o l i c y depends o n the succeed ing * p o l i c y , bu t i t is an i m p o r t a n t po in t and has m a n y consequences. T h e f o l l o w i n g p roo f shows th is is t rue . Proposition 4.1 Given an optimal policy for a finite-horizon problem, the elements of the component + policy of half the horizon are not always identical to the elements of the component * policy of half the horizon. Proof. We begin with the hypothesis that the + policy is a lways identical to the * policy. To show that this is not the case, we need to find one particular system within the system definition at the beginning of this chapter that violates this hypothesis. We choose the case where the pre- and post-circuits are of length 1. This means na = 1 and n = 2 . We choose the case where the intermediate state j2 is identical to so and to s2. In Figure (4-11), the solid lines are the complete family of trajectories from t — 0 to t = 2, starting at a particular start state s0. The transition probability p and the transition cost c are shown on the figure alongside the associated transition. Let us say that the discount value, X, is equal to 0.9. There are only 2 action choices at the state so, i.e., also at state j2 and s2. The 2 family of trajectories produced from these two action choices are shown by the trajectories emanating from So and from j2. These choices are completely allowable by the system definition in Section J^.l. Figure (4-11) shows the description. Chapter 4. The Problem Formulation and Proposed Solutions 76 p = 0.1, c = 1.1 p = 0.9, c = 1.0 u(s0) = 0 t = o t= l f = 2 Figure 4.11: Why the 4- policy may be different from the * policy. u(s0) St=o = s 0, ut=0 = u(so)} 0 0 1.1+ (0.9*0+ 0.1* 1.1) = 1.21 0 1 1.1+ (0.9* 0 + 0.1* 1.0) = 1.20 1 0 1.0+ (0.1*0+ 0.9* 1.1) = 1.99 1 1 1.0+ (0.1*0+ 0.9* 1.0) = 1.90 Table 4.1: 2 step expected discounted sum for all possible combinations of U(SQ) and The family depicted by the solid lines can be separated into a set of pre-circuits and a set of post-circuits. The intermediate states are at t — 1. All the possible paths from so to J2 are the pre-circuit associated with ji. All the possible paths from ji to S2 are the post-circuit associated with ji. Table 4-1 shows the 2 time-step expected discounted-sum from sQ for all possible combinations of u(s0) and u(ji). As shown in Table 4-1, the best combination is U(SQ) = 0 and u(ji) — 1. Now note that the pre-circuit associated with ji are different from the post-circuits associated with ji. Remember that given the beginning state, the end state, and the number of steps, the circuits can only differ because of the different policies. This means the pre-circuit policy Chapter 4. The Problem Formulation and Proposed Solutions 77 is different from the post-circuit policy. Since the overall policy is optimal, the pre-circuit policy must be + optimal and the post-circuit policy must be * optimal. Therefore, one case is found where the 4- policy is different from the * policy for the same start states, the same end states, and the same time length. T h i s shows tha t the + p o l i c y is not a lways i d e n t i c a l to the * p o l i c y . • Issue # 1 is tha t there is no elegant w ay of f i nd ing the + p o l i c y or to cons t ruc t longer h o r i z o n + po l ic ies f r o m shorter h o r i z o n + pol ic ies . T h i s is reflected i n the fact tha t there is no u p d a t e or cons t ruc t ive equa t ion for Q+. T h i s means we canno t lessen the b u r d e n of exhaus t i ve search t h r o u g h the con t ro l p o l i c y space for the t i m e l e n g t h f r o m t = 0 to t = na. A l s o , the resul t of th i s exhaus t ive search cannot be re-used for * po l ic ies of different lengths because they m a y have different + po l i c ies . T h i s is c o m p l i c a t e d fur ther b y Issue # 3 to be presented be low. 4.2.3 Issue #2 Issue # 2 is the need for the e s t i m a t i o n of p re -c i r cu i t t r a n s i t i o n p robab i l i t i e s p™*iSn a n d the pos t - c i r cu i t t r a n s i t i o n p robab i l i t i e s p " * o i S n - T h e pos t - c i r cu i t p robab i l i t i e s are needed to choose an a c t i o n at t — na to m i n i m i z e the expec ted va lue i n the L i n k u p R e l a t i o n . O b t a i n i n g these pos t - c i r cu i t p robab i l i t i e s is r e l a t i ve ly s i m p l e c o m p a r e d to o b t a i n i n g p r e - c i r c u i t t r a n s i t i o n p r o b a b i l i t i e s because i t is r e l a t i ve ly easy t o f i n d the * p o l i c y . Sup-pose the * p o l i c y is o b t a i n e d for the t i m e t — na to t = n, we c a n use the s y s t e m to s u p p l y the t r a n s i t i o n p robab i l i t i e s . O n e step t r a n s i t i o n p robab i l i t i e s c o u l d be e s t i m a t e d by p r o b i n g a n d obse rv ing the sy s t em over a p e r i o d of t i m e . S o m e m e t h o d m u s t be used to store the es t imates . O n c e the comple t e one step t r a n s i t i o n p robab i l i t i e s are ob t a i ne d , P"na.sn c an be found , for e x a m p l e , b y Step 4 i n T a b l e 4.9. Chapter 4. The Problem Formulation and Proposed Solutions 78 T h e p re -c i r cu i t p robab i l i t i e s are needed to c o m p u t e the c o n t r i b u t i o n of a l i n k u p to the n step Q. B u t s ince there is no r eady-made + p o l i c y vec tor , i f a m o d e l of the s y s t e m is unava i l ab l e , to o b t a i n the p re -c i r cu i t t r a n s i t i o n p r o b a b i l i t i e s , we w o u l d have to e x h a u s t i v e l y test over a l l poss ib le p re -c i r cu i t p o l i c y vectors to get the p robab i l i t i e s a n d t h r o u g h these p r o b a b i l i t i e s , f ind the + p o l i c y . T h i s process is c l ea r ly ve ry expens ive . 4.2.4 Issue #3 Issue # 3 is the issue of the con t ro l p o l i c y vector for f in i te hor izons . T h e p r o b l e m is tha t , for an TV step h o r i z o n , there are N i m p l i c i t con t ro l po l i c i e s , one for each t i m e step, a n d they m a y be different i n order for the who le sequence of pol ic ies to o p t i m i z e the expec ted value £ * E V C . . ( # * ) ) (4-4 1) i=0 T h e reason they m a y be different is tha t the i r hor izons are different. U n l i k e the case of in f in i t e hor izons where each Q* represents the eva lua t ion of a s ingle p o l i c y over t i m e , each Q represents the eva lua t i on of a con t ro l p o l i c y vec tor . F i g u r e (4.12) shows the c i r cu i t s generated by the same con t ro l p o l i c y as i n the case of an in f in i t e h o r i z o n Q*. N o t i c e the tree becomes u n i f o r m after a c e r t a in t i m e . F i g u r e (4.13) shows the c i r cu i t s generated by a different con t ro l p o l i c y at each t i m e step. T h e set of poss ib le t r ans i t ions are different at each t i m e step. T h i s is o b v i o u s l y a m o r e c o m p l e x s i t u a t i o n . W h e n a Q+ is added to a Q*, we are c o m p u t i n g an eva lua t ion of another set of con t ro l po l ic ies for a f in i te bu t longer h o r i z o n . T h i s new set is s i m p l y a conca t ena t i on of the + set o f po l i c i e s a n d the * set of po l i c ies . So the * p o l i c y for a n step h o r i z o n is a c t u a l l y a set of po l ic ies for hor izons r ang ing f rom n steps to 1 step. A s i m i l a r s i t u a t i o n is t rue for the - f p o l i c y . Chapter 4. The Problem Formulation and Proposed Solutions 79 t = 0 t= 10 F i g u r e 4.12: I l l u s t r a t i o n of the tree of t ra jector ies for a p o l i c y i d e n t i c a l at a l l t i m e steps. T h i s is not a great p r o b l e m for the pos t - c i r cu i t po l i cy . If after the l e a r n i n g , we have bu i l t up a Q* for o n l y a finite length ho r i zon , then a l l Q* must, be kept for a l l shorter l eng th hor izons . S ince d u r i n g the execu t ion of con t ro l , we need to p e r f o r m a m i n i m i z a t i o n of Q* w i t h respect to u at every t i m e step. T h e w o r d " o n l y " is used because the goal is to b u i l d Q for hor izons tha t are effectively in f in i te . If the h o r i z o n is not effectively in f in i t e , the p e n a l t y is the storage of a l l Q* for every l eng th . T h i s p e n a l t y w o u l d o n l y be i n c u r r e d i f we want the a c t u a l expec ted d iscounted s u m of the payoffs to be i den t i ca l to tha t p r e d i c t e d b y the longest h o r i z o n Q*. In p rac t i ce , the h o r i z o n is u s u a l l y receding such tha t the h o r i z o n is fixed l eng th . If th i s is the case, Q*'s for shor ter hor izons can be d i sca rded after b u i l d - u p . T h e m e a n i n g of b u i l d - u p w i l l be clear la ter . B u t th is means the a c t u a l Q for a ce r t a in l eng th of t i m e w i l l be different t h a n the c o m p u t e d Q*. T h e p r o b l e m is m o r e serious for the p re -c i rcu i t po l i cy . In the worst case, we need a comple t e set of o p t i m a l p re -c i rcu i t p o l i c y vectors for pos t -c i rcu i t s of every l eng th . D u r i n g the b u i l d - u p t o w a r d Q* of effectively in f in i te hor izons , a n u m b e r of l i n k u p cycles have Chapter 4. The Problem Formulation and Proposed Solutions 80 f = 0 -— ^ = 1 Q F i g u r e 4.13: I l l u s t r a t i o n of the tree of t ra jector ies for different po l i c ies at each t i m e step. M a x i m u m h o r i z o n is 10. to be c o m p l e t e d . L e t ' s say N is the n u m b e r of l i n k u p cycles a n d i n each cyc le the post-c i r c u i t is different, w h i c h c o u l d ve ry w e l l be the case d u r i n g b u i l d - u p . Issue # 1 means tha t , i n the wors t case, we w o u l d have to p e r f o r m a cos t ly search for the + p o l i c y N t imes . T h e Q for the p re -c i r cu i t p e r i o d is a c t u a l l y specif ied by the i n p u t parameters {so, sna, Uo, u(si), ..., u(sna-i), na}. It is obvious tha t the size of i n p u t space grows q u i c k l y w i t h the n u m b e r of a c t i o n choices, the n u m b e r of states, and the l eng th of the p re -c i rcu i t h o r i z o n . In O n e - S t e p Q - L e a r n i n g , the con t ro l po l ic ies and the i r eva lua t ions are r e a l l y for f in i te hor izons . A s l e a rn ing progresses, the ac tua l hor izons increase f r o m 1 t o w a r d inf in i ty . T h e L i n k u p R e l a t i o n forces us to dea l w i t h the f in i te l eng th h o r i z o n p r o b l e m . In the conven t i ona l forms of Q - L e a r n i n g , th i s p r o b l e m is ignored by c o m p u t i n g expec t ed values over a ve ry l o n g t i m e , bu t i t is i m p l i c i t l y there nonetheless. A n c o - h o r i z o n o p t i m a l p o l i c y is not necessar i ly o p t i m a l i n the short t e r m . T h e stan-d a r d Q L a n d D P a l g o r i t h m s jus t seek a consis tent p o l i c y . If the c o n t r o l p r o b l e m is Chapter 4. The Problem Formulation and Proposed Solutions 81 • * * t = 0 t = n. t = n F i g u r e 4.14: C o n t r i b u t i o n s f r o m different t r a j ec to ry f ami l i e s . f i n i t e - t ime a n d th is t i m e is k n o w n , a co l l ec t i on of shorter t e r m pol ic ies w o u l d y i e l d be t te r pe r fo rmance . B y not d i s t i n g u i s h i n g be tween different a c t u a l ho r i zons , these a lgo r i t hms force the s y s t e m to choose o n l y one po l i cy . 4.2.5 Issue #4 A s s h o w n b y F i g u r e (4.14), l i n k u p c a n h a p p e n at different i n t e r m e d i a t e states sna a n d for different end states sn. Issue # 4 is tha t to find the + p o l i c y vec tor , we need to find the effect of a l l poss ible po l i c ies across a l l poss ib le s n a ' s a n d a l l poss ible 5 „ ' s at the correct pre- a n d pos t - c i r cu i t t r a n s i t i o n p r o b a b i l i t i e s . T h e effort i t takes to do th is is s i m p l y u n t h i n k a b l e . F o r e x a m p l e , re fer r ing to F i g u r e (4.14), suppose the in t e rmed ia t e state of a l i n k u p is ji, we need to k n o w the + p o l i c y vec tor . B u t i t is u n k n o w n i f p re -c i r cu i t p o l i c y vec tor is + o p t i m a l u n t i l we have c o m p u t e d Q(s0,sn,u0, n) for th is p o l i c y vec to r t a k i n g in to account the p r e - c i r c u i t t r a n s i t i o n p robab i l i t i e s p r o d u c e d by th i s p o l i c y vec to r a n d the pos t - c i r cu i t p robab i l i t i e s p r o d u c e d by the * p o l i c y vec tor . T h e r e is too m u c h d a t a to b r i n g together Chapter 4. The Problem Formulation and Proposed Solutions 82 for the eva lua t i on of jus t 1 poss ib le p re -c i r cu i t p o l i c y vec tor . 4.2.6 Linkup is useful in two situations T h e above four issues can be avo ided i n at least two s i tua t ions . F i r s t , i f the s y s t e m is d e t e r m i n i s t i c [104]. H a v i n g a d e t e r m i n i s t i c sy s t em e l imina te s the c o m p l i c a t i o n s associ-a t ed w i t h the e s t i m a t i o n , c o m p u t a t i o n , a n d storage of t r a n s i t i o n p r o b a b i l i t i e s . It also e l im ina t e s the p r o b l e m s w i t h the + p o l i c y . Second , i f there is o n l y one p o l i c y ope ra t i ng at a l l t imes . T h i s means the l i n k u p r e l a t ion can be used for the fast e v a l u a t i o n of near in f in i t e t i m e pe r fo rmance of a s ingle con t ro l po l i cy . 4.2.7 Deterministic systems If the s y s t e m is d e t e r m i n i s t i c , the sy s t em can be d r i v e n to any s tate reachable g iven a specif ic l e n g t h of t i m e . T h e purpose of the * p o l i c y is t o t ake the s ta te f r o m a specif ied state to a specif ied end state i n an exact l eng th of t i m e at m i n i m a l cost . T h e purpose of the + p o l i c y is i n t a k i n g the s y s t e m f r o m a specif ied s tar t s tate to the best i n t e rmed ia t e s tate at m i n i m a l cost . So these pol ic ies are s i m i l a r except for the s tate spec i f ica t ion . T h e issues w i t h t r a n s i t i o n p robab i l i t i e s are i r re levant . T h i s e l im ina t e s Issues # 2 a n d # 4 d i r ec t ly . W i t h o u t t r a n s i t i o n p robab i l i t i e s , the mean ings of the + po l ic ies a n d the * po l i c ies are m u c h s imp le r . If i t is k n o w n w h i c h i n t e rmed ia t e state is the best and i f the pre- a n d pos t - c i r cu i t s are equa l i n l eng th , t hen the * p o l i c y w o u l d be + o p t i m a l for the p re -c i r cu i t as w e l l . T h i s e l im ina t e s Issue # 1 . T h e m a j o r ope ra t i on here is choos ing the best i n t e r m e d i a t e state. R e g a r d i n g Issue # 3 , the Q* of every l eng th s t i l l needs to be kept . T h i s is so the con t ro l sequence can be execu ted to achieve the Q* of the longest h o r i z o n . W e also have the o p t i o n of u s ing a reced ing h o r i z o n d u r i n g execu t ion to e l i m i n a t e th i s need. Chapter 4. The Problem Formulation and Proposed Solutions 83 4.2.8 Evaluating one policy at a time If o n l y one p o l i c y is b e i n g eva lua ted , there is no d i s t i n c t i o n be tween pre- a n d p o s t - c i r c u i t po l ic ies a n d we can d i s regard Issue # 1 . S ince the same p o l i c y is used at a l l t imes , Issue # 3 can be d i s regarded as w e l l . T h e t r a n s i t i o n p robab i l i t i e s s t i l l need to be accoun ted for, b u t n o w t h e y are func t ions of { s o , . s n , n } ra ther t h a n { s o , s n , u o , pi, fJ-2, • • • , rln-\in}- T h e storage of these p robab i l i t i e s is s i m p l i f i e d . T h e t r a n s i t i o n p robab i l i t i e s c a n be used for b o t h pre- a n d pos t - c i r cu i t s . T h e t r a n s i t i o n p robab i l i t i e s can also be l i n k e d u p i n some way. These s imp l i f i ca t ions e l i m i n a t e m u c h of Issues # 2 a n d # 4 . T h e s y s t e m m a y also be used to p rov ide the t r a n s i t i o n p r o b a b i l i t i e s , i .e . , u s ing an on- l ine a l g o r i t h m . In th is m e t h o d , no es t imates of t r a n s i t i o n p robab i l i t i e s need to be kep t . Ins tead, i n the m a n n e r of One-S tep Q - L e a r n i n g , the t r ans i t ions are observed a n d / ' s are changed b y a s m a l l amoun t . These / ' s w i l l be we igh ted b y different t r ans i t ions at s amp le f requency after a suff ic ient ly l ong p e r i o d of t i m e . A L i n k u p D y n a m i c P r o g r a m m i n g ( L D P ) a l g o r i t h m for eva lua t i ng one p o l i c y is useful i n an ac to r - c r i t i c sy s t em. T h e c r i t i c ' s func t ion is to q u i c k l y eva lua te the m a p p i n g of the ac tor w i t h respect to the o p t i m a l i t y c r i t e r i on . Ideal ly , the ac tor m a p p i n g is cons tant w h i l e i t is b e i n g eva lua ted . If b o t h the c r i t i c and ac tor m a p p i n g s are adap ted concur ren t ly , the ac tor s h o u l d change s lowly enough for the c r i t i c to evaluate i t cor rec t ly . In such case, the qu icke r the c r i t i c c a n adapt to the changes i n the o u t p u t of the ac tor , the m o r e d i rec t the r e l a t i onsh ip w i l l be be tween the a c t i o n o u t p u t by the ac tor a n d the secondary re in forcement . 4.3 A Family of L A D D P Algorithms for Deterministic Systems T h e e x i s t i n g Q - L e a r n i n g a n d i n c r e m e n t a l D P a lgo r i t hms assume a s tochas t ic sy s t em. T h i s a s s u m p t i o n is unnecessar i ly general for m a n y i m p o r t a n t c o n t r o l p rob l ems a n d has Chapter 4. The Problem Formulation and Proposed Solutions 84 resu l t ed i n unnecessar i ly weak a lgo r i t hms for these p rob l ems . T h i s sec t ion presents a v a r i a t i o n of the L i n k u p R e l a t i o n and a class of a lgo r i t hms w h i c h assumes a de t e rmin i s -t i c sy s t em. T h i s class of a lgo r i t hms has the charac ter i s t ics of g rea t ly r educed n u m b e r of l i n k u p cycles a n d r a p i d g r o w t h of h o r i z o n d u r i n g l e a r n i n g . T h e a d d i t i o n a l m e m o r y a n d c o m p u t a t i o n a l r equ i rements can be compensa t ed b y the use of con t inuous func t i on a p p r o x i m a t i o n w h i c h is a l ready r equ i r ed for mos t rea l w o r l d a p p l i c a t i o n of Q - L e a r n i n g a l g o r i t h m s . T h e d e t e r m i n i s t i c sy s t em w o u l d be a lmos t i d e n t i c a l to the s tochas t ic one desc r ibed i n Sec t i on 4 .1 . T h e m a i n difference is tha t pi,j(u) w i l l be equa l to 1 for e x a c t l y one state j € S a n d 0 for a l l o ther states. M o r e so t h a n the s tochas t ic s y s t e m , the s y s t e m is unreachab le f r o m some states to some other states i n a f ixed n u m b e r of t i m e steps, and somet imes some states are unreachab le i n any n u m b e r of t i m e steps f r o m other states. W i t h o u t the c o m p l i c a t i o n of s tochas t ic i ty , m u c h of the c o m p l e x i t i e s i n the p r o t o t y p e a l g o r i t h m disappears . W e o n l y need to f ind the i n t e rmed ia t e state sUa w h i c h o p t i m i z e s the payoff of m o v i n g f r o m s0 to sn. T h e d e t e r m i n i s t i c f o r m of the l i n k u p r e l a t i o n (or the Deterministic Linkup Relation) is Q*(i,l,u,n) = m i n ( Q*(i,sna,u,na) + 7"° m i n &*(sna, I, u „ a , nb)) . (4.42) Sna£S y •u.„aeU(sna) j T h e m i n i m i z a t i o n done ins ide the large parentheses w i l l be ca l l ed inner minimiza-tion a n d the m i n i m i z a t i o n done ou ts ide t h e m the outer minimization. N o t e tha t the p re -c i r cu i t Q+ is now re labe l l ed as Q*. A pos t - c i r cu i t t y p e p o l i c y w h i c h o p t i m a l l y moves the states f r o m i to / w h e n used i n the capac i ty of a p re -c i r cu i t t y p e p o l i c y , w i l l also m o v e the states f r o m i to sna o p t i m a l l y . T h e effects of a p re -c i r cu i t p o l i c y are en t i re ly c a p t u r e d i n sna. W e have no need to con tend w i t h m u l t i - s t e p t r a n s i t i o n p r o b a b i l i t i e s . So w i t h i n the p e r i o d f r o m t — 0 to t — na, the m e a n i n g of an o p t i m a l p o l i c y is e x a c t l y the same as an o p t i m a l p o l i c y f r o m t — 0 to t — n or f r o m t = na to t = n . Therefore , the Chapter 4. The Problem Formulation and Proposed Solutions 85 S ta r t w i t h t i m e i n d e x t = 0. I n i t i a l i z e the sys t em to some state s0 € S. E x c i t e the s y s t e m w i t h some a c t i o n uo £ U(so). R e c o r d the cost Co(s0,u0) i n a l o o k u p tab le as a func t ion of the o l d state s0, the new state S i , a n d the a c t i o n 1 u0. T h i s is Q* for h o r i z o n of 1, Q*(s0,si,uQ, 1). to := h- Rese t t i m e i n d e x t to 0. R a n d o m l y or us ing a schedule , choose a new a c t i o n UQ. If a l l poss ib le c o m b i n a t i o n s of o l d states so £ S and act ions uo € U(so) have been t r i e d , goto Step 4. If not , goto Step 2. G i v e a t ag value , e.g., an i m p o s s i b l e cost va lue , to entr ies i n the l o o k u p t ab le w h i c h have not been f i l l ed . E x i t . T a b l e 4.2: V a l u e iden t i f i ca t ion phase. m i n i m i z e d state-to-state Q* func t ions i n the p rev ious l i n k u p c y c l e c a n be used d i r e c t l y as the p re - c i r cu i t Q* i n the present cyc le . T h r e e a l g o r i t h m s are dev ised to use this d e t e r m i n i s t i c l i n k u p r e l a t i o n , E q . (4.42). E a c h of the 3 a l g o r i t h m s consists of 2 phases: a V a l u e Iden t i f i ca t ion P h a s e a n d a B u i l d -up Phase . A l l three a lgo r i t hms share the same value iden t i f i ca t ion phase as desc r ibed i n T a b l e 4.2. T h e y differ i n the i r b u i l d - u p phase. A value iden t i f i ca t ion phase observes the s y s t e m responses a n d cons t ruc ts the Q*-values for h o r i z o n 1. These values w i l l be re-used to b u i l d longer h o r i z o n Q*'s . C o m p a r e d to One-S tep Q - L e a r n i n g w h i c h does not r e t a in or re-use the observat ions , these a lgo r i t hms use fewer i n p u t - o u t p u t samples to b u i l d an o p t i m a l c o n t r o l p o l i c y . T h e b u i l d - u p phase of the a l g o r i t h m fol lows the value i den t i f i c a t i on phase. W h e n the new c o n t r o l p o l i c y is c o m p l e t e d , i t w i l l go in to effect. F o r a d a p t a b i l i t y to sys t em change, the va lue iden t i f i ca t ion a n d b u i l d - u p phases c a n be repea ted con t inuous ly i n p a r a l l e l d u r i n g the con t ro l of the sys t em. T h e in t e r l eav ing of va lue iden t i f i ca t ion a n d b u i l d - u p can be done to var ious degrees. It is not necessary to wa i t for a c o m p l e t e value 1For Algorithm 1 , this function has an additional input'specifier of n = 1 . Chapter 4. The Problem Formulation and Proposed Solutions 86 iden t i f i c a t i on cyc l e to f in ish before b u i l d i n g up the con t ro l po l i cy . T h i s w o u l d g ive the c o n t r o l s y s t e m a d a p t a b i l i t y be tween comple t e value iden t i f i ca t ion cycles . T h e m a i n p r i ce for th i s a d a p t a b i l i t y is tha t the sy s t em per fo rmance w i l l be affected b y the e x p l o r a t i o n . 4.3.1 Algorithm 1 A l g o r i t h m 1 stores p re -c i r cu i t Q*'s and pos t - c i r cu i t Q*'s of m a n y different lengths i n a c o m m o n p o o l a n d a synchronous ly l i n k s t h e m up to create longer Q*'s. T h i s process fol lows E q . (4.42) closely. If an end state / is unreachable f r o m sHa s t a r t i n g w i t h u n a i n rib steps, a t ag va lue is g i v e n to the cor respond ing Q*. T a b l e 4.3 describes th is a l g o r i t h m . In S tep 2 i n T a b l e 4.3, the "replace" ope ra t i on per forms the outer m i n i m i z a t i o n . T h e rep lace opera tor is used ins tead of a " s u m " opera tor as i n O n e - S t e p Q - L e a r n i n g (see E q . (3.29)) because the t ra jector ies are d e t e r m i n i s t i c . It is not necessary to s tack t ra jec tor ies together by s u m m a t i o n . T h e r e are subs t an t i a l redundancies i n th is a l g o r i t h m , bu t i t does p e r m i t the Q * ' s a loose wa y of r each ing the longer h o r i z o n w i t h o u t o p e r a t i o n a l l y r e q u i r i n g comple t e e x p l o r a t i o n of t he i r i n p u t space. T h i s m i g h t be a n ice a l g o r i t h m for genera t ing useful c o n t r o l laws q u i c k l y w i t h o u t comple t e i n f o r m a t i o n . O b v i o u s l y m a n y versions of th is a l g o r i t h m are poss ible . A useful ve r s ion is tha t o n l y Q*'s w i t h n ' s greater t h a n a ce r t a in lower b o u n d are kept . T h i s lower b o u n d starts f r o m 1 and is c o n t i n u a l l y rev ised as the b u i l d - u p phase progresses. T h i s removes the short h o r i z o n Q * ' s w h i c h are no longer be ing used to b u i l d up longer h o r i z o n Q* 's . 4.3.2 Algorithm 2 Algorithm 2 b u i l d s up Q* funct ions of a fixed h o r i z o n , say ra, for the p re -c i r cu i t a n d uses i t to act as s i m p l e cost funct ions tha t replace the one step costs i n A D D P . T h i s m e t h o d is synchronous a n d bu i ld s Q* of a p a r t i c u l a r h o r i z o n c o m p l e t e l y before w o r k i n g Chapter 4. The Problem Formulation and Proposed Solutions 87 F r o m the set of ex i s t i ng Q(sna, /, una, n j ) ' s , d i s rega rd ing the i r o p t i m a l i t y , col lec t Q ' s of the same { s „ a , I, rib} spec i f ica t ion for a l l ava i lab le u n o ' s . F i n d the m i n i m u m w i t h respect to una, i .e . , m^unaeu(sna) Q(sna,l,una,nb). C o m p u t e new Q*(i, /, u, n) a cco rd ing to: Q * ( i , I,u,n) = Q*(i,sna,u,na) + 7 " " m i n u „ o e [ 7 ( S n o ) Q*{sna,I,una,nb). T h i s new value replaces any larger value or non-exis tent va lue of the i n p u t spec i f ica t ion {i, I, u, n } , Q(i, I, u , n ) . D o th is for a l l ava i lab le Q ' s of the same p re - c i r cu i t i n p u t spec i f ica t ion , Q(i,sHa,u,na). G o t o Step 1. T a b l e 4.3: B u i l d - u p phase of A l g o r i t h m 1. o n Q*'s of m-s tep longer h o r i z o n . S ince we k n o w the h o r i z o n of the p re -c i r cu i t s , i t is not necessary to have n as an i n p u t . T h i s process ' upda t e equa t i on is Q*(i,l,u) := m i n f Q*(i,sm,u) + jm m i n Q*(sm, I, um) J . (4.43) smes y umeu(sm) j T a b l e 4.4 descr ibes A l g o r i t h m 2. T h e r e is no need to use an e s t ima te o f Q* because a l l Q-values are o p t i m a l w h i l e the sy s t em is l ea rn ing . W i t h some f u n c t i o n a p p r o x i m a t o r s , e.g., m u l t i l a y e r feedforward n e u r a l ne tworks , the accu racy of a p p r o x i m a t i o n depends at least i n i t i a l l y o n the a m o u n t of resources spent o n the l ea rn ing . T h i s means i t m a y be a g o o d idea to spend m o r e resources i n ref in ing the a p p r o x i m a t i o n for a f ixed l eng th p re -c i r cu i t and perhaps a c o m p a r a b l e a m o u n t of resources o n the l i n k u p s as used i n A D D P . T h e n u m b e r of l i n k u p s is r o u g h l y rn — 1 t imes less t h a n the n u m b e r of backups i n a comparab l e A D D P a l g o r i t h m . A c o m p a r a b l e A D D P a l g o r i t h m is the Synchronous Q - L e a r n i n g A l g o r i t h m to be desc r ibed i n C h a p t e r 5. T h e c u m u l a t i v e error of the l i n k u p s due to a p p r o x i m a t i o n errors m a y be lower t h a n i f a c o m p a r a b l e a m o u n t of c o m p u t a t i o n a l power is spent o n A D D P . Chapter 4. The Problem Formulation and Proposed Solutions 88 Choose 1 < M < H, where H is the des i red h o r i z o n a n d H/M is an integer . C o m p u t e Q*(so, SM, UO, M) u s ing one of the fo l l owing op t ions : Option 1: C o m p u t e Q*(SO,SM,UO, M) u s ing the S Q L a l g o r i t h m descr ibed i n T a b l e 4.6. Option 2: U s e the Q* of h o r i z o n 1 f r o m the value i den t i f i c a t i on phase to b u i l d Q* funct ions of the des i red p r e - c i r cu i t h o r i z o n M, f o l l owing Q*(i, I, u) := m i m m € 5 sm, u) + 7 m m i n „ m € l / ( s m ) Q*(sm, /, um)) w i t h m = 1 for th is step. Option 3: U s e any o ther workab le m e t h o d , such as A l g o r i t h m 1 or the D o u b l i n g A l g o r i t h m descr ibed i n Sec t i on 4.3.3. U s i n g the M - s t e p Q* func t ions , Q*{s0, SM, UQ, M), c o m p u t e d i n Step 1, b u i l d Q*'s of the f ina l des i red h o r i z o n H, Q*(SO,SH,UO,H), u s ing Q*(i, /, u) : = m i n S m 6 5 sm, u) + 7" 1 m i n U m e r j ( S m ) Q*(sm, /, um)) w i t h m - - M. E x i t . T a b l e 4.4: B u i l d - u p phase of A l g o r i t h m 2. 4.3.3 The Deterministic Doubling Algorithm T h e t h i r d a n d mos t i n t r i g u i n g a l g o r i t h m is ca l l ed the Deterministic Doubling Algo-rithm (or s i m p l y Doubling Algorithm). T a b l e 4.5 describes th i s a l g o r i t h m . It uses the n e w l y b u i l t Q*'s as b o t h the p re -c i rcu i t Q*'s and pos t - c i r cu i t Q*'s i n the next l i n k u p cyc le . T h e h o r i z o n of t he new Q*'s are t w i c e tha t of the o l d Q*. T h i s means the hor izons are b u i l t up exponen t i a l l y . D o u b l i n g bu i lds a TV-step h o r i z o n Q* f u n c t i o n u s ing o n l y l o g 2 TV l i n k u p cycles . T h i s means an effectively in f in i te h o r i z o n Q f u n c t i o n can be b u i l t r a p i d l y . T h i s m e t h o d is synchronous and so does not requi re n i n the i n p u t spec i f ica t ion . T h e r e is no need to use an e s t ima te of Q* because a l l Q-values are o p t i m a l w h i l e the s y s t e m is l e a rn ing . N o t to over -emphas ize the value and accuracy of analogies or our o w n analys is of our thought processes, h u m a n s usua l ly k n o w rough ly the cost of r each ing a ce r t a in goal Chapter 4. The Problem Formulation and Proposed Solutions 89 S tep 1. U s i n g Q*'s of ex i s t i ng h o r i z o n as b o t h p re -c i r cu i t a n d pos t - c i r cu i t Q* ' s t o b u i l d new Q* a cco rd ing to Q*(i, I, u) := m i n , r a € s (Q*{i, sm, u) + 7 m mmUmeU{Sm) Q*{sm, I, um)) . Step 2. U p d a t e the h o r i z o n var iab le m acco rd ing to m := 2 * m . S tep 3. If m < M , where M is the desired h o r i z o n , goto Step 1. S tep 4. E x i t . T a b l e 4.5: B u i l d - u p phase of the D e t e r m i n i s t i c D o u b l i n g A l g o r i t h m . a n d the t i m e i t takes. T h e gene ra t i on / r e t r i eva l of tha t knowledge is often an ins tan t response, r a the r t h a n a m u c h slower m e n t a l s i m u l a t i o n of our i n t e r n a l w o r l d m o d e l . T h i s is an a rgumen t for the cach ing of m e n t a l s imu la t ions or perhaps the k i n d of c o m p u t a t i o n done is s o m e t h i n g l i ke the c o m p u t a t i o n s done i n th is a l g o r i t h m . 4.3.4 The Synchronous Q-Learning Algorithm T o f ac i l i t a t e c o m p a r i n g var ious aspects of the a lgo r i t hms p roposed above w i t h A D D P , a synchronous ve r s ion of One -S tep Q - L e a r n i n g desc r ibed by E q s . (3.29) a n d (3.30) is dev ised . It w i l l be ca l l ed the Synchronous Q - L e a r n i n g ( S Q L ) a l g o r i t h m . B y synchronous i t is mean t tha t the Q* values for each a c t u a l h o r i z o n are b u i l t u p s i m u l t a n e o u s l y a n d are c o m p l e t e d before the Q* values for the next longer ac tua l h o r i z o n is c o m p u t e d . These Q-values are deno ted Q* because they are also p a r a m e t i z e d b y sn. T a b l e 4.6 describes the a l g o r i t h m . T h e h o r i z o n parameters are o m i t t e d , s ince they c a n be eas i ly t r a cked d u r i n g the e x e c u t i o n of th is a l g o r i t h m . R e l a t i n g to the d i scuss ion i n Sec t i on 2.4.15, Synch ronous Q - L e a r n i n g a l g o r i t h m is one of the i m p l i e d A c t i o n Dependen t D y n a m i c P r o g r a m m i n g a l g o r i t h m s . T h e m a i n difference be tween i t a n d One-S tep Q - L e a r n i n g a l g o r i t h m is i n the schedu l ing of the b a c k u p ope ra t i on . In t e rms of a c c u m u l a t i o n of a p p r o x i m a t i o n errors , th i s synchronous parent ve r s ion shares i n the one step na tu re w i t h One-S tep Q - L e a r n i n g , bu t has the Chapter 4. The Problem Formulation and Proposed Solutions 90 Set the current h o r i z o n n to 1. I n i t i a l i z e s y s t e m state to some So € S, a c c o r d i n g to a schedule or r a n d o m l y . A p p l y a c t i o n Uo G U(s0), a cco rd ing to a schedule or r a n d o m l y . R e c o r d ct(s0,uo) i n a l o o k u p table . so := si If a l l poss ib le state t rans i t ions have not been observed , r e i n i t i a l i z e t he s y s t e m s ta te a n d goto Step 2. Inc rement cur rent h o r i z o n n by 1. U s i n g the set of ct(so,u0) recorded i n S tep 2 a n d 3, b u i l d up Q* acco rd ing to Q*(s0,sn,u0) = ct(s0,Uo) + 7 m i n u i Q*(s1,sn,u1) u n t i l a l l Q* for the cur ren t h o r i z o n n are u p d a t e d . If n < N, where N is the desired h o r i z o n , goto Step 4; else ex i t . T a b l e 4.6: T h e Synchronous Q - L e a r n i n g A l g o r i t h m advantage tha t i t a l lows us to focus on the proper t ies of the b a c k u p o p e r a t i o n i t se l f r a the r i n the s chedu l ing of backups . 4.4 An Asynchronous L D P (ALDP) Algorithm for Stochastic Systems T h e genera l i dea here is to evaluate a single p o l i c y acco rd ing to th is express ion : n -1 £ * E y c . , ( / 2 ( * ) ) I (4.44) t=o T h e purpose of th is a l g o r i t h m is s i m i l a r to the ove ra l l purpose of the T D ( A ) a l g o r i t h m bu t is qu i t e different f r o m the D P and Q L a lgo r i t hms . In T D ( A ) , the i d e a is to use p rev ious l e a rn ing to e s t ima te the long t e r m payoff of the cur ren t p o l i c y f r o m the current s tate a n d to l ea rn f r o m the e s t ima te itself. T h e es t imate w i l l u s u a l l y c o n t a i n some error . T h i s er ror is co r rec ted w h e n the state t r a jec to ry is a c t u a l l y p l a y e d out . T h e ga in w i t h T D ( A ) is the acce le ra t ion of the speed of eva lua t ing a s ingle po l i cy . In D P a n d Q L , the l e a rn ing s y s t e m looks for one consis tent and o p t i m a l p o l i c y f r o m a l l poss ib le po l i c ies . Chapter 4. The Problem Formulation and Proposed Solutions 91 In the a l g o r i t h m to.be presented i n th is sec t ion , the i dea is to accelerate the e v a l u a t i o n of a s ingle p o l i c y u s ing l i n k u p s in s t ead of backups . W h e r e a s T D ( A ) is a way of speeding u p the e v a l u a t i o n of a c o n t r o l p o l i c y by m a k i n g use of p a r t i a l e s t ima t ions , hopefu l ly th i s a l g o r i t h m speeds up a n d improves the q u a l i t y of the eva lua t i on by u s ing l i n k u p s . S ince there is o n l y one p o l i c y , the L i n k u p R e l a t i o n can be s i m p l i f i e d to th i s f o r m : / ( t , / , n ) = £ Plasna(f(i,sna,na) + ~in«f{sna,l,nb)) . (4.45) « n „ £ S T h i s equa t i on is c a l l ed the O n e P o l i c y L i n k u p R e l a t i o n . T h e Q n o t a t i o n i n E q . (3.37) is now rep laced by / because the p o l i c y p is a l ready k n o w n . E a c h / is the expec t ed d i scoun ted cost over numerous t ra jec tor ies f r o m the start-i n g state SQ to the reached state sn. T h e / n o t a t i o n requires a s i m p l e r i n p u t p a r a m e t i z a -t i o n t h a n the Q n o t a t i o n . T h i s w i l l s ign i f ican t ly ease the storage issue. T h e m i n i m i z a t i o n opera tor i n E q . (3.37) has been e l i m i n a t e d because there is no longer a need to choose be tween different ac t ions . F r o m E q . (4.45) a n d E q . (3.37), an asynchronous a l g o r i t h m , shown i n T a b l e 4.7, is dev i sed where l i n k u p s t ake p lace as each state t r ans i t i on is observed , s i m i l a r to One-S tep Q - L e a r n i n g . ft is used be low as the es t imate of the / * at t i m e t. at = ctt(s0,sn,n) is the l e a rn ing rate a n d i ts va lue at the dth upda t e of the co r respond ing Q(so, sn, n) is assigned a c c o r d i n g to the s t a n d a r d s tochas t ic a p p r o x i m a t i o n manne r : oo oo 0 < aj, < 1, aj, —> 0 as d —> oo, ^  ad = c o , a n d a\ < oo. O n l y ce r t a in l i n k u p s of p re -c i rcu i t s a n d pos t -c i rcu i t s are p e r m i s s i b l e i n order to f o r m the e x p e c t e d values cor rec t ly . T h e record D i n Step 1 is i m p o r t a n t i n m a k i n g i t poss ible to o b t a i n an unb iased es t imate . T h e reasoning is tha t the n a - s t e p t r a n s i t i o n p r o b a b i l i t y p " j n o is r e a l i z e d i f the t r a n s i t i o n a c t u a l l y took p lace . N o t i c e tha t n o es t imates of t r a n s i t i o n p robab i l i t i e s are e x p l i c i t l y kept . Chapter 4. The Problem Formulation and Proposed Solutions 92 Step 1. E x e c u t e con t ro l ut at t i m e t a cco rd ing to the g i v e n p o l i c y a n d observe the t r a n s i t i o n f r o m state j to state k. U p d a t e a record D of {st} for t € {—T,.. • , 0} where t = —T means T t i m e steps i n the past . S tep 2. U p d a t e the 1-step costs as fol lows: ft(j, k, 1) = ct. Step 3. F o r a l l / a n d na where ft(k, I,, nb — 1) is ava i lab le , p e r f o r m b a c k u p as fol lows: / t + i ( i , I, nb) = (1 - at)ft(j, I, nb) + at (c 4 + -yft(k, I, nb - 1)) . S tep 4. F i n d a l l f(s0,sn,n) w i t h s0 — 5_„ £ D and sn = j, c o l l e c t i v e l y referred to as FA, a n d a l l f(so,sn,n) w i t h So — j, c o l l e c t i v e l y referred t o as FB-C o n s t r u c t new / ' s by l i n k i n g up the elements of FA a n d FB as fol lows: ft+i(i,l,n) = (1 - at)ft(i,l,n) + at (fa{i,j,na) + 'ynafb(j,l,nb)) where fa £ FA, fb G -FBI is a l ea rn ing rate a n d n = na + nb. D o th is for a l l c o m b i n a t i o n s of e lements of FA a n d FB- If -F/i is an e m p t y set, no l i n k u p is pe r fo rmed . It is imposs ib l e for FB to be an e m p t y set after S tep 3. T o ensure tha t the latest i n f o r m a t i o n is i n c o r p o r a t e d i n t o the new values of / ' s , these l i n k u p s are done i n ascend ing order of n. T a b l e 4.7: A s y n c h r o n o u s O n e P o l i c y L i n k u p D y n a m i c P r o g r a m m i n g . T h i s semi-off- l ine a l g o r i t h m is s i m i l a r to T D ( A ) i n purpose , to q u i c k l y evaluate the pe r fo rmance of one con t ro l p o l i c y at a t i m e . In a A C E and A S E s y s t e m [8], the upda te of the A C E a n d A S E m a p p i n g s are done i n lock step w i t h the ra te of d a t a rece ived f r o m the s y s t e m . T h i s m a y not be des i rable and a semi-off-l ine a l g o r i t h m for the A C E m a y be advantageous. 4.5 A Natural Algorithm for Stochastic Systems E q . (4.45) qu i t e n a t u r a l l y suggests a synchronous a l g o r i t h m where the l e a r n i n g is d i v i d e d i n t o an iden t i f i c a t i on phase a n d a b u i l d - u p phase, s i m i l a r to the D e t e r m i n i s t i c D o u b l i n g A l g o r i t h m presented ear l ier . T h i s a l g o r i t h m is presented here o n l y for d i scuss ion a n d for a n u m e r i c a l test to be presented i n C h a p t e r 5. T h i s is an off-line a l g o r i t h m i n the sense that the 1 t i m e step t r ans i t ions a n d costs of Chapter 4. The Problem Formulation and Proposed Solutions 93 the s y s t e m are iden t i f i ed i n an Identification Phase before the e v a l u a t i o n of the g iven p o l i c y begins i n a Build-up Phase. T h i s a l g o r i t h m w i l l be ca l l ed the Stochastic Dou-bling Algorithm ( S D A ) because i t is the s tochas t ic counte rpar t to the D e t e r m i n i s t i c D o u b l i n g A l g o r i t h m presented ear l ier . T a b l e 4.8 shows the Iden t i f i ca t ion Phase . I n i t i a l i z e the s y s t e m to some state s0 £ S a cco rd ing to a schedule . E x e c u t e the con t ro l p o l i c y PO(SQ). U p d a t e the s amp le f requency Ps 0 ,si ° f the t r a n s i t i o n so —> Si i n the t r a n s i t i o n p r o b a b i l i t y l o o k u p tab le , a cco rd ing to P „ " 0 where n " ° . is the observed n u m b e r of t imes tha t a c t i o n u0 was a p p l i e d w h e n the sys t em was i n state SQ and m a d e a t r a n s i t i o n to state S i , a n d n"° = Z ^ e s ™ " " , ^ ^ s ^ n e n u m h e r of t imes a c t i o n UQ a p p l i e d i n state SQ. R e c o r d the cost ct i n a l o o k u p tab le as a f u n c t i o n of { s 0 ) s i } , Ct{so,S\). T h i s is / for h o r i z o n of 1, / ( s o , s \ , 1). If p]Q are not accura te es t imates of p]0iSl, goto S tep 1. E x i t . ' T a b l e 4.8: Iden t i f i ca t ion Phase of the S D A . In T a b l e 4.8, we m a y want to i n i t i a l i z e the states acco rd ing to an i n i t i a l i z a t i o n schedule des igned to s a mp le the state space thoroughly . T h e design of the schedule is the user's choice a n d w i l l not be de ta i l ed here. In Step 2, r a n d o m noise s h o u l d not be in jec ted in to the c o n t r o l s igna l because the purpose is to accura te ly evaluate the g i v e n p o l i c y . T h e t r a n s i t i o n p r o b a b i l i t y l o o k u p tab le keeps the es t imates of pjSj for a l l d u r i n g the Iden t i f i ca t ion Pha s e and the es t imates of p™Sria for a l l {i, sna} and for m r a n g i n g f r o m 1 to the va lue n i n E x p r e s s i o n (4.44) d u r i n g the B u i l d - u p Phase . In S tep 3, the j u d g m e n t of the a c c u r a c y of the es t imates is done w i t h some s t anda rd s t a t i s t i c a l t echnique . A f t e r the iden t i f i ca t ion phase is done, the t r a n s i t i o n p r o b a b i l i t y t ab le a n d the / t ab le are b o t h for a h o r i z o n of 1. T h e y are used i n the B u i l d - u p Phase to c o m p u t e longer Chapter 4. The Problem Formulation and Proposed Solutions h o r i z o n / ' s . T a b l e 4.9 shows the B u i l d - u p Phase of the synchronous a l g o r i t h m . 94 Choose des i red h o r i z o n and set up the fo l lowing counters : n := 2 n 0 := 1 D o u b l e the h o r i z o n of f(i, I, n) as fol lows: f(i,l,n):=ZsnaesPlasna (f(i, Sna,na) +r° f(snaJ,na)) ; Vi,l€S If n = des i red h o r i z o n , goto Step 5. D o u b l e the h o r i z o n of the t r a n s i t i o n p robab i l i t i e s as fol lows: n'a := na na := 2na i>?;ana :=£.„, € spS n i p;£,, . n a ; v?,s„a G s If des i red h o r i z o n is not reached, goto Step 2. E x i t . T a b l e 4.9: B u i l d - u p Phase of the S D A . S ince we k n o w the con t ro l p o l i c y and we are o n l y in teres ted i n the t rue va lue of E x p r e s s i o n (4.44), we can d i s ca rd a l l o lder f(i,l,n) and a s the B u i l d - u p Phase progresses. T h e nota t ions f(i,l,n), f(i,sna,na), and f(sna,l,na) c an be s i m p l i f i e d to / ( « , / ) , / ( « , - s n a ) , a n d f(sna,l), respec t ive ly , because we k n o w the values of n and na associa ted w i t h these var iables at a l l t imes . T h e nota t ions p?" , j3™° , a n d f>^'a . Jrlt3na 1 rl>3nf snf i s " a a T*a can be s i m p l i f i e d to p,-, S n , pilS , , a n d ps , i S n for the same reason. T h e storage of these na na a var iab les can be s i m p l i f i e d s i m i l a r l y . T h i s a l g o r i t h m seems a t t r ac t ive at th is po in t , bu t there is a n u m e r i c a l f law, to be discussed i n C h a p t e r 5, w h i c h is sufficient for the re jec t ion of not o n l y th is a l g o r i t h m bu t also mos t off-l ine, synchronous a lgo r i t hms for s tochas t ic sys tems. Chapter 4. The Problem Formulation and Proposed Solutions 95 4.6 Linkup Action Dependent Dynamic Programming and Hierarchies D e t e r m i n i s t i c L i n k u p A c t i o n Dependen t D y n a m i c P r o g r a m m i n g a l g o r i t h m s , spec i f ica l ly the D o u b l i n g A l g o r i t h m , fit i n t o hierarchies for d e t e r m i n i s t i c sys tems i n th i s manne r : • T h e lowest l eve l of a h i e ra rchy conta ins the 1 t i m e step Q values a n d 1 t i m e step c o n t r o l po l i c ies . • T h e nex t l eve l up conta ins the 2 t i m e step Q values a n d 2 t i m e step c o n t r o l po l i c ies . • T h i s i s r epea ted u p the h ie ra rchy , w i t h t he t op l eve l c o n t a i n i n g n t i m e s tep Q values a n d n t i m e step con t ro l po l ic ies . A l g o r i t h m 2 fits i n to h ierarchies s i m i l a r l y . T h i s correspondence shows how L i n k u p A c t i o n D e p e n d e n t D y n a m i c P r o g r a m m i n g fits i n w i t h a h i e r a rchy i f we cons t ruc t t he h i e r a r chy f r o m the b o t t o m leve l to the top . T h i s m a p p i n g covers o n l y the t e m p o r a l aggregat ion aspect of h ie rarchies . T h e spa t i a l aggregat ion a n d o ther aspects are not addressed. (These two types of aggregat ion is d iscussed i n m o r e d e t a i l i n Sec t i on 4.7.6.) Howeve r , L i n k u p A c t i o n Dependen t D y n a m i c P r o g r a m m i n g shows how to generate o p t i m a l con t ro l laws for th i s aspect of h ierarchies . 4.7 Goal Change Algorithms G o a l change is c lose ly r e l a t ed to h ierarchies because the uppe r levels of a h i e r a rchy shou ld be able to use the lower leve l sk i l l s for different goals. In th is sec t ion , 4 a l g o r i t h m s for chang ing the goa l of t he L i n k u p A c t i o n D e p e n d e n t D y n a m i c P r o g r a m m i n g con t ro l l e r are presented a n d discussed. In C h a p t e r 5, the i r correctness a n d efficiency gains are discussed fur ther . W e w i l l r e t u r n to the top ic of h ierarchies i n Sect ions 4.7.5 to 4.7.7. G o a l change can be done w i t h i n L i n k u p A c t i o n Dependen t D y n a m i c P r o g r a m m i n g . T h i s requires a re -def in i t ion of the i m m e d i a t e costs ct. I n the p rev ious sect ions, r e w a r d Chapter 4. The Problem Formulation and Proposed Solutions 96 a n d p e n a l t y are c o m b i n e d in to one value as the i m m e d i a t e cost. In th is sec t ion , we redefine the goa l of the con t ro l sy s t em to that of be ing able to reach a goal s tate o p t i m a l l y , ra ther t h a n the m o r e general i m p l i e d goal of the i m m e d i a t e costs. F u r t h e r m o r e , we define the rewards a n d penal t ies tha t m a k e up the i m m e d i a t e costs i d e n t i c a l l y to those i n [85] for the C Q - L a l g o r i t h m . E a c h state incurs some penal ty , except for the goal s tate where i t also receives a r eward . T h e penal t ies m a y be different w i t h respect to the states bu t are i d e n t i c a l to any goal state we w i s h to define. T h e penal t ies are i d e n t i c a l for a l l goa l states. 4.7.1 One goal state Suppose after u s ing the D o u b l i n g A l g o r i t h m , w i t h o u t loss of general i ty , we have b u i l t u p a set of Q - m a p p i n g s for hor izons of 1 , 2 , 4 , . . . , TV for a ce r t a in goa l state. L e t us c a l l the effects of the r e w a r d componen t of a g iven goal state on the Q - m a p p i n g s the reward effects. W e c o u l d r emove the r e w a r d effects of th is goa l s tate f r o m the Q - m a p p i n g s and t h e n a d d the r e w a r d effects of another goa l state to these m a p p i n g s . If we c o u l d do th i s , we w o u l d be able to re-use the Q - m a p p i n g s to p roduce con t ro l laws for any goal s tate we l i k e . O f course we w o u l d l ike th is goal change p rocedure to be m o r e efficient t h a n the p rocedure to b u i l d up the Q - m a p p i n g s i n the first p lace . A s we w i l l see shor t ly , we can i n fact o b t a i n th is h igher efficiency. T o r emove the r e w a r d effects of the o r i g i n a l goal state, we p a r t i a l l y execute the D o u b l i n g A l g o r i t h m f r o m the 2 t i m e step h o r i z o n m a p p i n g to the n t i m e step h o r i z o n m a p p i n g . T h i s is ca l l ed the r e m o v a l pass. A remova l pass is m a d e up of l o g 2 TV cycles , where TV is the longest h o r i z o n . L e t Ss be a set of s tar t states. L e t Se be a set of end states. L e t P be a set of pa i rs of states sj, S[ where the first e lement of each pa i r is mean t to be a s tar t s tate a n d the last e lement of each pa i r is mean t to be an end state. These 3 sets are defined for a ce r t a in h o r i z o n . L e t a co r respond ing 3 sets be defined for the next Chapter 4. The Problem Formulation and Proposed Solutions 97 1. Ss := {sg}, Se := 0, P := 0, n := 2. 2. Let SV be the set of states reachable from sg in one time step. For all s i € 5 r , Q(sg, si, u0,1) := Q(sg, s i , u0,1) - r(sg) Se := Se U {Si} P:=PU{{sg,Sl}} 3. S'a := 0, ^ := 0, P ' := 0. 4. For all s n € S and for all s s 6 -So, if g(s s,s n,u 0 ,n) = min (<3(s„ sa, «0, f) + 7 f min Q(sa,5n,Ua,f)) then Q(ss,s„,u0,n) := min (Q(S0,SR,U0, f) + 72 min Q(s», 5; := S's U ^ := S'e U {aB} P ' : = P ' U { K , 5 n } } otherwise Q(ss,sn,u0,n) is not changed. 5. For all s0 £ S and for all se £ Se, if Q(s0,se,u0,n) = min (0(s o ,sa,u 0 , f) + 7* m j n . Q(*f,*e,«?,f)) then Q(50,se,u 0,n) := min (Q(s0, sa, u0, f) + 7 2 min Q(s<i, 2 2 2 5 e , « f , f ) ) := 5 i U {so} S'e := S'e U {se} P' :=P'U { { * „ , * . } } otherwise Q(so,se,uo,n) is not changed. 6. if n ^ N, n := 2 * n Ss := S's Se := ^ P := P' goto 3. 7. Exit. Table 4.10: The Removal Pass for a Single Goal State. Chapter 4. The Problem Formulation and Proposed Solutions 98 h igher h o r i z o n a n d be denoted by S's, S'e, and P' respect ive ly . T h e p rocedure to r emove the r e w a r d effects is ca l l ed the R e m o v a l P a s s as shown i n T a b l e 4.10. O n e cyc le of the r e m o v a l pass consists of Steps 3 to 6 i n T a b l e 4.10. T h e r a t iona le for the r e m o v a l pass as shown i n T a b l e 4.10 is the f o l l o w i n g , b e g i n n i n g w i t h the shortest h o r i z o n Q-values : W h e n the r eward componen t is r e m o v e d f r o m a goal state, o n l y one t y p e of Q-values is affected. T h i s t y p e is those Q ' s t h a t s tar t w i t h t he goa l state. A t the nex t h igher h o r i z o n , 2, two types of Q-values are affected: those tha t starts w i t h the goa l s tate a n d those tha t have the goal state as the i m m e d i a t e state. T o judge whe the r any p a r t i c u l a r 2 t i m e step Q f a l l i n t o one of these t w o types , w e check whe the r the o r i g i n a l Q is equa l to the resul t of a l i n k u p us ing a 1 t i m e step Q of one of the above types . If so, we k n o w tha t there is a p o s s i b i l i t y tha t r e m o v i n g the r e w a r d c o m p o n e n t w i l l i n v a l i d a t e t he o r i g i n a l Q. T h i s means we have to redo the l i n k u p o p e r a t i o n for t he same s tar t s tate a n d end state of th i s Q. If not , we k n o w tha t there is no p o s s i b i l i t y tha t r e m o v i n g the r e w a r d componen t w i l l i nva l ida t e the o r i g i n a l Q a n d we do not need to redo the l i n k u p o p e r a t i o n for t h i s Q. A f t e r we checked a l l c a n d i d a t e Q's o f h o r i z o n 2 a n d co l l ec t ed the s tar t and end state of a l l affected Q ' s , we use these s tar t a n d end states i n the same process as for the 2 t i m e step Q ' s to de t e rmine w h i c h 4 t i m e step Q's are affected. T h i s process is r epea ted u n t i l we have done the n t i m e step Q ' s . S ince a c y c l e conta ins a m i x t u r e of upda t e opera t ions for the t rue a n d false cases of the tests i n Steps 4 and 5 i n T a b l e 4.10, the upda te ope ra t i on i n the false case is a non -ope ra t i on , the u p d a t e ope ra t i on i n the t rue case is less in tens ive t h a n a L i n k u p ope ra t i on , the s tar t s tate is less t h a n or equa l to S, and the end state is less t h a n or equa l to S, the c o m p u t a t i o n a l effort is less t h a n a fu l l L i n k u p cyc le . N o t e t ha t s ince the D o u b l i n g A l g o r i t h m requires o n l y a few cycles a n d the n u m b e r of e lements i n P increases f r o m a ve ry s m a l l f r ac t ion , i n mos t d y n a m i c sys tems, of a l l poss ib le {s tar t state, end state} pa i r s , there is a large sav ing i n the first few cycles . A n Chapter 4. The Problem Formulation and Proposed Solutions 99 a d d i t i o n pass is also m a d e up of l o g 2 N cycles , where N is the longest h o r i z o n . T o a d d the r e w a r d effects of a new goal state, we do a s i m i l a r process as the R e m o v a l Pass . T h e i m m e d i a t e cost at the new goal s tate s'g is changed to i n c l u d e a r e w a r d r(s'g). T h i s process is c a l l ed the A d d i t i o n P a s s a n d i t is shown i n T a b l e 4 .11. T h e ra t iona le for the a d d i t i o n pass is s i m i l a r to the r a t iona le for the r e m o v a l pass. T h e difference is i n the test i n Steps 4 a n d 5. If, for a g iven h o r i z o n , u s ing the affected Q ' s of the i m m e d i a t e l y shor ter h o r i z o n , we can f ind a be t te r new Q t h a n the o r i g i n a l Q, t h e n th i s Q m u s t be affected b y the a d d i t i o n of a r eward c o m p o n e n t at the new goal s tate a n d i n fact i t is e x a c t l y equa l to the new Q. If the new Q ' s is no t be t t e r t h a n the o r i g i n a l Q ' s , we k n o w the o r i g i n a l QJs is the correct va lue . T h e t rue cases i n Steps 4 a n d 5 of T a b l e 4.11 is m u c h less in tens ive t h a n E q . (4.42) because the set of states over w h i c h the outer m i n i m i z a t i o n is done is m u c h less t h a n the f u l l s tate set for the first few cycles . E q . (4.42) is the m a i n u p d a t e equa t i on i n the D o u b l i n g A l g o r i t h m . T h e savings i n a cyc le of the a d d i t i o n pass c o m p a r e d to a cyc l e of genera l L i n k u p A c t i o n D e p e n d e n t D y n a m i c P r o g r a m m i n g c o m e f r o m th i s fact a n d the fact tha t the false case is a non-opera t ion . T h e fact t ha t we do not need to store the Q - m a p p i n g s for a l l i n t e r m e d i a t e hor izons f r o m 1 to i V is an i m p o r t a n t benefit of the D o u b l i n g A l g o r i t h m . T h e same savings c o u l d also be o b t a i n e d for A l g o r i t h m 2, a l t hough to a lesser extent . It is not c lear tha t the r e m o v a l / a d d i t i o n passes are poss ib le i f the a c t u a l h o r i z o n is t r ea ted as i f i t is in f in i t e , as is i n the case of One-S tep Q - L e a r n i n g . S ince the a d d i t i o n a n d r e m o v a l passes cannot be done for Q of different hor izons , they w o u l d l i k e l y to be done for the in f in i t e h o r i z o n Q a n d re i t e ra ted on the same Q ' s a n u m b e r of t imes . E a c h cyc le w o u l d t h e n be essent ia l ly a r e l a x a t i o n cyc le . If r e m o v a l / a d d i t i o n passes are poss ib le , the correct Q - m a p p i n g is p r o b a b l y ach ieved after a n u m b e r of r e l a x a t i o n cycles . T h e l ack of a n e n d state pa r ame te r i n the Q-values i n One-S tep Q - L e a r n i n g also poses a p r o b l e m . Chapter 4. The Problem Formulation and Proposed Solutions 100 's P' 1. | Ss := {s'g}, Se := 0, P := 0, n : = 2. 2. L e t 5,. be the set of states reachable f rom s'g i n one t i m e step. F o r a l l Si G Sr, Q(s'g, s i , u 0 , 1 ) := Q(s'g, si, u0,1) + r(s'g) 5 e := 5 e U P:=PU{{s'g,Sl}} 3. := 0, S'e := 0, P' := 0. 4. F o r a l l sn €. S a n d for a l l ss € 5 S , i f Q ( s „ s „ , u 0 , n ) > m i n (Q(SS,SR,U0,^) + ^ -min Q ( i a , i n ) u a , f ) ) 2 2 2 2 t h e n g ( 5 s , , „ , u 0 , n ) : = ^ g { m i n j € p (Q(S0, S,, „ 0 , f ) + 7 * ^ m i n ^ Q ( s f , s n , u f , f ) ) Si := 5 i U { 5 s } : = ^ U { s n } := P'U{{s3,sn}} otherwise s n , u 0 , n ) is not changed. F o r a l l s 0 € S a n d for a l l s e € Se, i f g(so,s e ,Uo ,rc) > m i n (Q(S0, , u 0 , f ) + 7* ™ j n , , a e , u a , f )) t h e n := ^ U {so} S'e := ^ U { . e } P ' : = P ' U { { s 0 , 5 E } } o therwise Q(-so, $e, uo, n) is not changed, i f n ^ N, n :=2 * n Ss := S's Se := S> P := P ' goto 3. E x i t . T a b l e 4.11: T h e A d d i t i o n Pass for a S ing le G o a l S ta te . Chapter 4. The Problem Formulation and Proposed Solutions 101 4.7.2 Two different applications I n one a p p l i c a t i o n , the r e m o v a l / a d d i t i o n passes c a n be used to solve m u l t i p l e goals w i t h a large s av ing i n c o m p u t a t i o n t i m e . In the second a p p l i c a t i o n , we are in te res ted i n be ing able to p roduce a c o n t r o l m a p p i n g w h i c h takes the goal s tate as an i n p u t . T h i s a l lows us to steer the s y s t e m t o w a r d any goal s tate we l i ke i n m i d course a n d k n o w tha t the c o n t r o l is o p t i m a l u n t i l the next goa l change. T h e a p p l i c a t i o n t o w a r d w a l k i n g mach ines , for e x a m p l e , is c lear . T h e opera tor of the m a c h i n e , h u m a n or no t , needs to adjust h is goals f requent ly w h i l e the m a c h i n e is i n m o t i o n because the e n v i r o n m e n t is c o m p l e x a n d m a y even be u n p r e d i c t a b l e . 4.7.3 The best goal change approach If our goal is to cons t ruc t a con t ro l m a p p i n g w i t h the goal s tate as an i n p u t , th i s is the best goa l change approach : T h e r e m o v a l passes are p r o b a b l y s ign i f i can t ly m o r e cos t ly t h a n t he a d d i t i o n passes because i n t he r emova l passes, the ou te r m i n i m i z a t i o n is over S whereas i n the a d d i t i o n passes, the outer m i n i m i z a t i o n is over a subset of Ss or Se. Idea l ly we do not want to pe r fo rm the r emova l passes at a l l d u r i n g the c o n s t r u c t i o n of th i s c o n t r o l m a p p i n g . If d u r i n g the l ea rn ing of the o r i g i n a l Q - m a p p i n g s , a goal state is not def ined, t h e n the m a p p i n g w o u l d not con ta in the r e w a r d effects of a goal state. T h i s means no r e m o v a l pass is r equ i r ed to remove the r e w a r d effects a n d the o r i g i n a l Q - m a p p i n g s c a n be used to p roduce Q - m a p p i n g s for any goal s tate b y us ing o n l y the a d d i t i o n passes. 4.7.4 More than one goal state U n t i l now i n th is sec t ion , the goal s tate is assumed to be a s ingle goa l s tate. In some s i tua t ions , we w o u l d be satisfied i f the sys t em spends as m u c h t i m e as poss ib le i n a set Chapter 4. The Problem Formulation and Proposed Solutions 102 of goa l states, to be ca l l ed the g o a l s e t and denoted by Sg. T h e r e m o v a l a n d a d d i t i o n passes c a n be m o d i f i e d to a c c o m m o d a t e th is . S u p p o s i n g tha t we want to r emove one goal set a n d replace i t w i t h another , the r emova l and a d d i t i o n passes are m o d i f i e d as i n T a b l e 4.12 a n d T a b l e 4.13, respect ive ly . A r e m o v a l pass is m a d e up of l o g 2 cycles , where N is the longest h o r i z o n . O n e c y c l e of the r e m o v a l pass consists of Steps 3 to 6 of T a b l e 4.13. A n a d d i t i o n pass is also m a d e u p of l o g 2 N cyc les , where N is t he longest h o r i z o n . T h e i m m e d i a t e costs of new goal set S'g are adjusted to i n c l u d e new r e w a r d values. O n e c y c l e of the a d d i t i o n pass consists of Steps 3 to 6 of T a b l e 4.13. In the ex t r eme , every state c o u l d have" a r eward , p r e s u m a b l y different rewards at different states. A s the set of goa l states approaches the comple t e s tate space, the c o m p u t a t i o n a l t i m e r equ i r ed w i l l approach but not exceed the t i m e r equ i r ed to b u i l d the first Q - m a p p i n g set. T h e r e m o v a l a n d a d d i t i o n passes of th is subsec t ion a n d Subsec t i on 4.7.1 c a n be per-f o r m e d i n d e p e n d e n t l y of each other . F o r e x a m p l e , we can replace a goa l s tate w i t h a goa l set or a goa l set w i t h a goa l state. F o r the r e m o v a l / a d d i t i o n passes, the D o u b l i n g A l g o r i t h m grea t ly reduces the n u m b e r of cycles i n b o t h passes. In contras t , i f Synchronous Q - L e a r n i n g is used , there w o u l d be as m a n y cycles as the l eng th of the h o r i z o n . Synchronous Q - L e a r n i n g differs f r o m O n e - S t e p Q - L e a r n i n g i n tha t i t has the end state as a pa r ame te r i n the Q n o t a t i o n . I n not h a v i n g th i s e n d state pa rame te r beside other i n c o m p a t i b i l i t i e s , O n e - S t e p Q - L e a r n i n g m a y not be able to take advantage of the r e m o v a l / a d d i t i o n passes. 4.7.5 E x t r a c t i n g m a c r o s In Sec t i on 3.4.2, the b u i l d i n g of macros was discussed. In th is sec t ion , we specula te on mac ros m a y be b u i l t u s ing the goal change procedures . N o c l a i m is b e i n g m a d e tha t the Chapter 4. The Problem Formulation and Proposed Solutions 103 1. Ss := Sg, Se := 0, P := 0, n : = 2. 2. D o for a l l sg (E Sg, let SV be the set of states reachable f r o m sg i n one t i m e step. F o r a l l s\ & Sr, Q(sg,suu0,l) := Q(sg,suuQ,l) - r(sg) Se := Se U {si} P : = P U { { 5 s , 5 l } } 3. S's := 0, S£ := 0, P' := 0. 4. F o r a l l sn £ S a n d for a l l s3 £ Ss, i f Q ( s s , s n , u 0 , n ) = m i n (Q(sat SR, u 0 , f ) + 7* m j n x t h e n Q(ss,sn,u0,n) := minL ( Q ( S 0 , s » , u 0 , f ) + 7* m j n , <2(s?, S n , ^ , f ) ) ^ := S'e U K ) P ' : = P ' U { { 5 S , 6 N } } o therwise Q(ss,sn,u0,n) is not changed. 5. F o r a l l s0 € S a n d for a l l se € Se, i f Q (>o , s e , uo , r c ) = m i n (Q{S0, sn., « 0 , f ) + 7* m j n , Q ( a » , s e , u n , 2 ) ) t h e n Q(s0,se,u0,n) := m i n ( Q ( S 0 , s n , « 0 , f ) + 7^ m i r n x Q ( s a , 2 2 2 3 e , t i f , f ) ) 5; := 5 i U {so} S'e := S'e U {se} P' :=P'U{{s0,se}} otherwise Q(SQ, se, Uo, n) is not changed. 6. i f n ^ TV, n := 2 * n 5 . := P := P' goto 3. 7. E x i t . T a b l e 4.12: T h e R e m o v a l Pass for M u l t i p l e G o a l States . Chapter 4. The Problem Formulation and Proposed Solutions 104 1. Ss := S'g, Se := 0, P := 0, n := 2. 2. D o for a l l ^ € S'g, let SV be the set of states reachable f r o m s'g F o r a l l si £ SV, i n one t i m e step. Q ( a ^ , 3 i , u 0 , l ) := Q(a ^ s i,Mo,l) - r ( s ^ ) S'e := S'e U {Si} P:=PU{{s'g,Sl}} 3. S's := 0, 5 e := 0, P ' := 0. 4. F o r a l l sn (E S a n d for a l l ss £ 5 S , i f Q ( s s , s „ , u 0 , n ) > m i n (<3(s s, s a , u 0 , f ) + 7 ^ m i n tine[/(sn) Q(s±,sn,u±,fj) t h e n Q ( s s , s n , u 0 , n ) := m i n ( Q ( s 0 , s a , u 0 . f ) + 7 * m i n Q{s% , 3 „ , « a , | ) ) 5 : := U {,.} ^ := S'e U { s n } P ' : = P ' U { { s s , s n } } o therwise Q(ss, sn, uo, n) is not changed. 5. F o r a l l s 0 € S a n d for a l l s e 6 S e , i f Q(s0,se,u0,n) > m i n ( Q ( S 0 , S £ , U 0 , j ) + 7 * m i n Q ( s ? , S e , U f , f ) ) t h e n Q ( s 0 , s e , w 0 , r a ) := m i n (Q(S0,S*,U0 sn.e{sxi,se}eP \ 2 f ) + 7 * m i n Q ( s a , S e , U " , f ) ) 5J := S's U { s 0 } 5 e := S'e U { s e } P ' : = P ' U { { s 0 , S e } } otherwise Q(s0, se, uo, n) is not changed. 6. i f n / N, n := 2 * n : = S"s S'e := S"e P := P' goto 3. 7. E x i t . T a b l e 4.13: T h e A d d i t i o n Pass for M u l t i p l e G o a l States . Chapter 4. The Problem Formulation and Proposed Solutions 105 genera l p roposa l to be m a d e i n th is sec t ion is the comple t e answer to the b u i l d i n g a n d u t i l i z a t i o n of mac ros . Suppose we have b u i l t the Q funct ions w i t h o u t spec i fy ing a goal state. T h e n we w o u l d p e r f o r m the a d d i t i o n pass for a large n u m b e r of goa l states. These goa l states co r re spond to the seed goals d iscussed i n Sec t ion 3.4.2. A s we p e r f o r m the a d d i t i o n passes for these seed goals, we col lec t a set of c o m m o n l y used con t ro l a n d state t ra jec tor ies . S t ra igh t -fo rward names c o u l d be assigned to these con t ro l and state sequences, such as a n a m e c o m p o s e d of the star t state, the end state, and the h o r i z o n . E a c h c o n t r o l sequence is specific to the star t of the co r respond ing state sequence. These t h e n are the macros . A s the set of macros becomes larger and more comple t e i n t e rms of b e i n g able to f o r m effective c o n t r o l a n d state t ra jector ies for most new goals we specify, the sys t em becomes inc reas ing re l ian t on macros ins tead of the comple t e set of Q values for a l l state c o m b i n a t i o n s a n d for var ious hor izons . In a sense, we c o u l d say tha t the sys t em is b e g i n n i n g to l ea rn i n a h igher l eve l representa t ion . F r o m another pe r spec t ive , the sy s t em learns to res t r ic t i ts search space to o n l y those con t ro l a n d state t ra jector ies tha t proves to be genera l ly useful . 4.7.6 Temporal versus spatial aggregation If a n u m b e r of n e i g h b o u r i n g states are g rouped together as a super state, i t c o u l d be ca l l ed spatial aggregation. If a n u m b e r of n e i g h b o u r i n g t i m e per iods or po in t s i n t i m e are g rouped together , i t c o u l d be ca l l ed temporal aggregation. B o t h t e m p o r a l a n d spa t i a l aggregat ions are necessary i n a h i e r a r ch i ca l con t ro l l e r [97]. 4.7.7 Comparison with related works L i n [105] approaches h i e r a r c h i c a l l ea rn ing for M a r k o v i a n , d e t e r m i n i s t i c , a n d discrete sys tems f r o m a state aggregat ion perspec t ive . T h e base state space is d i v i d e d i n t o regions Chapter 4. The Problem Formulation and Proposed Solutions 106 of a p p r o x i m a t e l y equa l size a n d shape. These regions are ca l l ed abs t rac t states. A b s t r a c t states serve as subgoals . A b s t r a c t act ions are close-loop c o n t r o l po l ic ies w h i c h takes the s y s t e m f r o m the i r subset of base states to the i r co r respond ing abs t rac t states. E a c h such subset is c a l l ed the a p p l i c a t i o n space of the abst ract a c t i o n . T h e a p p l i c a t i o n space is the set of base state f r o m w h i c h the abs t rac t state can be reached w i t h i n I steps, where / is the m a x i m u m dis tance be tween any two ne ighbor ing abs t rac t states. T h e abs t r ac t ion can be m u l t i - l e v e l , m e a n i n g tha t the abst ract states and ac t ions of one l eve l can be used as base states a n d ac t ions for the nex t h igher l eve l . H a v i n g define these concepts , L i n ' s h i e r a r ch i ca l l ea rn ing t echn ique funct ions is ve ry br ief ly as fol lows: 1) M a n u a l l y select the abst ract states and define the abs t rac t act ions for the base l eve l . 2) T r a i n each abs t rac t ac t ion i n i ts e x p l o r a t i o n space, w h i c h covers i ts a p p l i c a t i o n space. K e e p o n l y those abs t rac t act ions w h i c h succeeds i n t a k i n g the sy s t em f r o m the i r defined a p p l i c a t i o n space to the i r defined abs t rac t space. 3) T r a i n a h igh- l eve l p o l i c y to de t e r mine the o p t i m a l abs t rac t ac t ion for each abs t rac t space. L i n ' s t echn ique differs f r o m the L i n k u p approach i n tha t 1) i t is done i n a s tate aggre-g a t i o n perspec t ive ; 2) the different goals ex i s t i ng s i m u l t a n e o u s l y i n one con t ro l sy s t em means the t op l eve l c o n t r o l sy s t em can be s u b o p t i m a l (see p . 86, [105]). It is possi-b le tha t L i n ' s t echn ique is c o m p l e m e n t a r y w i t h the L i n k u p app roach a n d i f successful ly c o m b i n e d , w o u l d i n c l u d e b o t h spa t i a l and t e m p o r a l aggregations. M a e s a n d B r o o k s [106] desc r ibed a l ea rn ing techn ique for the s u b s u m p t i o n archi tec-tu re [78]. T h e s u b s u m p t i o n a rch i tec tu re is a d i s t r i b u t e d con t ro l l e r m a d e up of m a n u a l l y des igned a t o m i c behav iour s a n d m a n u a l l y designed sys t em of t r iggers to ac t iva te the behav iour s . A b e h a v i o u r is an a c t i o n or ac t ion sequence w h i c h can be ac t iva t ed w h e n i ts p re -cond i t ions are sat isf ied. T h e l ea rn ing techn ique can re l ieve the designer f r o m h a v i n g t o d e t e r m i n e h o w a n d w h e n the behav iours are t r iggered . In the i r d e m o n s t r a t i o n of a 6 legged w a l k i n g robo t l ea rn ing to wa lk , the p r o b l e m is s t ruc tu red such tha t o p t i m i z a t i o n Chapter 4. The Problem Formulation and Proposed Solutions 107 w i t h respect to each i m m e d i a t e cost ra ther t h a n to a sequence of i m m e d i a t e costs is sufficient to cause s i m p l e w a l k i n g behav iours to emerge. A d u a l for M a e s and B r o o k s ' approach is proposed by M a h a d e v a n a n d C o n n e l l [107]. In the fo rmer case, the a t o m i c behav iours are h u m a n designed a n d the c o o r d i n a t i o n of the behav iour s is l ea rned . In the la t te r case, the c o o r d i n a t i o n is first h a n d des igned and t h e n the a t o m i c behav iour s are learned . S i n g h [85] p roposed a t e m p o r a l aggregat ion f o r m a l i s m for the conca t ena t ion of tasks. In h is f o r m a l i s m , a compos i t e task is decomposed in to e l emen ta l tasks a n d lower l eve l c o m p o s i t e tasks. A n asynchronous , c o m p e t i t i v e l ea rn ing a l g o r i t h m is p roposed for the c o n s t r u c t i o n of h igher l eve l Q-funct ions f rom lower leve l Q-func t ions . T h i s app roach does not o p t i m i z e a s ingle goa l o p t i m a l con t ro l task i n the sense tha t each e l e m e n t a l task or lower l eve l c o m p o s i t e task m a y be s o l v i n g a different task, e.g., a different goa l s tate. T h i s means some c o m p r o m i s e is i m p l i c i t i n the so lu t ion of the top l eve l c o m p o s i t e task. In c o m p a r i s o n , the goal change approach discards the o r i g i n a l goa l states before accoun t ing for the new goal states. G i v e n the same p r o b l e m , S ingh ' s app roach w i l l p r o b a b l y y i e l d different c o n t r o l l aws f r o m conven t i ona l D P a l g o r i t h m s , whereas a l l a l g o r i t h m s p roposed i n th i s thesis w i l l y i e l d con t ro l laws i d e n t i c a l to those f r o m c o n v e n t i o n a l D P a lgo r i t hms , i f a p p r o x i m a t i o n errors are not present. 4.8 Deviation from the Core Algorithms Sect ions 2.4.7 to 2.4.10 showed the process i n w h i c h the core and r e l a t i ve ly r i g i d D P a l g o r i t h m is g r a d u a l l y r e l axed in to p a r t i a l upda te versions and la ter i n to i n c r e m e n t a l vers ions . L i k e w i s e , the L i n k u p A c t i o n Dependen t D y n a m i c P r o g r a m m i n g a lgo r i t hms are deve loped here i n the i r core forms. It is expec ted tha t some of the fur ther deve lopments w o u l d be i n r e l a x i n g the var ious aspects of the core L i n k u p A c t i o n Dependen t D y n a m i c Chapter 4. The Problem Formulation and Proposed Solutions 108 P r o g r a m m i n g a l g o r i t h m s presented i n th is thesis . T h e s tate reso lu t ions c a n be r e l axed . F o r e x a m p l e , the states m a y be aggregated as i n M o o r e ' s V a r i a b l e R e s o l u t i o n D y n a m i c P r o g r a m m i n g (see Sec t i on 2.4.18). T h e n l i n k u p w o u l d be done over these lower r e so lu t ion aggregates. O n c e reasonably g o o d t ra jector ies are f o u n d t h r o u g h these aggregates are ach ieved , l i n k u p c a n t h e n be done over the nex t h igher r e so lu t i on aggregat ion w i t h i n the t ra jectory . T h e process is g r a d u a l l y ref ined u n t i l the t r a j ec to ry is i n the highest s tate reso lu t ion . T h e r equ i remen t tha t the core a lgo r i t hms of L A D D P updates the Q values over a l l states c a n be r e l axed . F o r e x a m p l e , i t m a y be a good i d e a to keep a queue of the i n t e r m e d i a t e states w h i c h caused the largest r e d u c t i o n i n the Q values i n recent l i n k u p s , s i m i l a r to M o o r e ' s P r i o r i t i z e d Sweep ing me thods . K e e p i n g t r ack of the best i n t e rmed ia t e states is c lose ly re la ted to the b u i l d i n g of useful macros w h i c h is concerned w i t h keep ing t r ack of the best s tate t ra jector ies . T h e i m p l e m e n t a t i o n of the inner and outer m i n i m i z a t i o n m e t h o d s can also be r e l axed . F o r e x a m p l e , we m a y choose to accept a near m i n i m u m i f th i s resul ts i n a large r e d u c t i o n i n c o m p u t a t i o n a l effort. T h e r eco rd keep ing of the t imes i n the L A D D P a l g o r i t h m s can also be r e l axed . Per -haps we c o u l d t reat the f in i te h o r i z o n Q values as in f in i te ho r i zons . T h i s thesis presents the core a l g o r i t h m s , but i n p rac t i ce i t is e x p e c t e d tha t var ious r e l axed versions w i l l a c t u a l l y be used ins tead of t h e m . T h e core a l g o r i t h m s guarantee o p t i m a l i t y o n a f i xed schedule . T h e r e l axed versions w i l l genera l ly t r ade off t h i s guarantee for r educed c o m p u t a t i o n a l effort. Chapter 4. The Problem Formulation and Proposed Solutions 109 4.9 Summary In th i s chapter , the d y n a m i c s y s t e m to be o p t i m a l l y con t ro l l ed is def ined. A p r o t o t y p e a l g o r i t h m is pu t f o r th as a basis for t h i n k i n g about the issues i n v o l v e d i n a p p l y i n g the L i n k u p R e l a t i o n . T h e n 4 issues tha t c o m p l i c a t e the a p p l i c a t i o n are desc r ibed . T w o s i tua t ions are iden t i f i ed w h i c h p e r m i t these issues to be c i r c u m v e n t e d . T h e first s i t u a t i o n is w h e n the s y s t e m is d e t e r m i n i s t i c . T h e second is w h e n we are in te res ted i n eva lua t ing one c o n t r o l p o l i c y at a t i m e . A t o t a l of 4 a lgo r i t hms is p roposed for these s i tua t ions . A n a l g o r i t h m w h i c h seems n a t u r a l for E q . (4.45) is p roposed o n l y for d i scuss ion a n d a n u m e r i c a l test la ter . T h e correspondence between L i n k u p A c t i o n D e p e n d e n t D y n a m i c P r o g r a m m i n g a n d l e a rn ing of o p t i m a l con t ro l i n h ierarchies is d iscussed. A l g o r i t h m s for chang ing the goa l of the cont ro l le r are are presented. In the nex t chapter , t heore t i ca l a n d s i m u l a t i o n resul ts are presented. Chapter 5 Evaluation of Proposed Algorithms In th i s chapter , t heo re t i ca l a n d s i m u l a t i o n results are presented for the 3 a lgo r i t hms pro-posed i n the last chapter for d e t e r m i n i s t i c sys tems and for the 1 a l g o r i t h m for stochas-t i c sys tems. T h e resul ts for the d e t e r m i n i s t i c a l go r i t hms are o n the correctness of the a l g o r i t h m s , o n the i r a c c u m u l a t i o n of a p p r o x i m a t i o n errors , a n d o n the effects of the ap-p r o x i m a t i o n errors o n the con t ro l po l ic ies and the i r pe r formance . T h e resul ts for the s tochas t ic a l g o r i t h m s are o n the correctness of the a lgo r i t hms a n d on the smoothness of the l e a rn ing . 5.1 On the Algorithms for Deterministic Systems Severa l resul ts are o b t a i n e d for the 3 a lgo r i t hms proposed for d e t e r m i n i s t i c sys tems. F i r s t , proofs w i l l be presented on the correctness of the D e t e r m i n i s t i c L i n k u p R e l a t i o n a n d the correctness of the 3 a lgo r i t hms . N e x t , the proofs for several u p p e r bounds for the S Q L a l g o r i t h m , A l g o r i t h m 2, and the D o u b l i n g A l g o r i t h m w i l l be presented. T h e u p p e r bounds are t hen i l l u s t r a t e d n u m e r i c a l l y . T h e n , a series of s i m u l a t i o n s of these 3 a l g o r i t h m s are done i n c o r p o r a t i n g spa t i a l l y and t e m p o r a l l y cor re la ted errors. In the a c c u m u l a t i o n of a p p r o x i m a t i o n errors, A l g o r i t h m 2 a n d the D o u b l i n g A l g o r i t h m are shown to be m u c h be t te r behaved i n the m e a n and var iance , w i t h the D o u b l i n g A l g o r i t h m b e i n g the best of t h e m . I n t e rms of the degrada t ion of t h e c o n t r o l l aw a n d c o n t r o l pe r fo rmance , A l g o r i t h m 2 a n d the D o u b l i n g A l g o r i t h m are shown to be m o r e to lerant a n d less sens i t ive to the a c c u m u l a t e d a p p r o x i m a t i o n errors. 110 Chapter 5. Evaluation of Proposed Algorithms 111 5.1.1 Proof of the Deterministic Linkup Relation T h e d e t e r m i n i s t i c s y s t e m was desc r ibed i n Sec t ion 4.3 where the f o l l o w i n g is c l a i m e d , i n m o r e precise t e rms here: Theorem 5.1 Given 1. a deterministic system with a set of states S and action sets U(s £ S) which has deterministic transitions {st,Ut} —> S t + i , where st,st+i £ S and Ui £ U(st), and the system has costs c(st,ut) attached to state transitions; 2. a control policy at each time step t £ { 0 , . . . , n — 1} defined as the set ut of actions u(s) € U(s), Vs £ S, one unique action for each state; 3. a vector of control policies for i 6 { 0 , . . . , n - 1} denoted by p, with the optimal vector denoted by jl*; and 4- set(s) of optimal (or minimal) finite-horizon discounted sum of costs, Q*(i, /, u , m ) = mirifi* for one or more horizons m and where 0 < 7 < 1 is the discount factor and each set for a particular m is complete, i.e., a set for a particular m contains Q*(i, /, u , ra) '5 for all possible {« , /} € S and u(i) £ U(i), we can compute a Q*(i, I, u, n) for a longer horizon, i.e., n is greater than all m's in Item 4 above, in the following manner: Q*(i,l,u,n) = min . \Q*(i ,s n a ,u ,n a ) + 7"° m i n 0*(sna,l,una,nb) snaes \ unaeu(sna) where na and nb are both one of the m's in Item 4 above and na + nb = n. m — l Lt=0 SQ — Z, SR I, u0 — u , ra Proof. Chapter 5. . Evaluation of Proposed Algorithms 112 We begin with the definition for the desired result: Q*(s0,sn,u0,n) — min n - 1 t=0 (5.46) The result is the minimum from the exhaustive search through all possible jl. The ft selected by the minimization is the optimal policy vector, denoted by Because the system is deterministic, given the start state, So, the sequence of actions {uo, ul(si),..., /u*_ 1 (5 n _i)} results in a deterministic sequence of optimal states, s — \ s 0 i s l i • • • ) s n a i • • • JSn-li SnJ5 where sUa is the optimal intermediate state of interest and where 0 < na < n. This sequence can be written in this form: s — \so, S j , . . . , sna,..., sn_1, Sn) because SQ and sn are given and are not determined by the minimization. This way of writing the sequence will help to clarify the notations later. For the same trajectory of the system state, this sequence of costs is seen by the learning system: { c o ( s 0 , u 0 ) , c i ( a j , u i ) , . . . ,cna(s*na,una),..., c n _ i ( s ; _ 1 5 u n _ a ) } . If we know the sequence s*, we can do the minimization in Eq. (5-46) by performing minimizations individually on each state in the sequence, i.e., n-1 Q*(s0, sn,u0, n) = c 0 ( s 0 , u 0 ) + E V m m c < ( s *> " 0 (5A7) t=i Chapter 5. Evaluation of Proposed Algorithms 113 with the restriction that the choice of ut for each s* must drive the system to the corre-sponding s*+1. Eq. (5.47) can be grouped into 2 terms as follows: na—1 n—1 Q*{s0,sn,u0,n) = co(5o,u0)+ £ V m i n c ^ , u t ) + ^ V m m c ^ ^ t ) . (5.48) i=l ' t=na The sequence of optimal states in T e r m a is { s 0 , s * , . . . , 5 ; a } and the sequence of optimal states in T e r m b is {Sna> - - - ' S n - 1 ' S n } -Now suppose we know only s0, s* a , and sn, we can rewrite Eq. (5.48) as na—1 / n-1 \ Q*(s0,sn,u0,n) =mm[Y] ^Ct\s0,s*n uQ] + m i n m i n [ V 7*c«|s* , s n , t t n „ ] j (5.49) and We can then rewrite Eq. (5.49) as Q*(s0,sn,u0,n) = Q*(s0,5*A,u0,na) + m i n hnaQ*{s*n , s n , u „ a , r e - rea)) . (5.50) An implication of Eqs. (5-49) and (5.50) is that we still need to find s* o . s*la must be the element of S which minimizes this expression: Q*(so,sna,u0,na) + m i n (inaQ*{sna,sn,una,n - n 0 ) j . This means Q*(s0,sn,u0,n) = m i n (Q*{so,sna,u0,na) + mm^"Q*(sna,sn,una,n - na)J . (5.51) Chapter 5. Evaluation of Proposed Algorithms 114 • In E q . (5.51), the m i n i m i z a t i o n done ins ide the large parentheses w i l l be ca l l ed in-ner minimization a n d the m i n i m i z a t i o n done outs ide t h e m the outer minimization. N o t e tha t the p re -c i r cu i t Q+ is now re labe l l ed as Q*. If there is no s tochas t ic i ty , the c o m p l i c a t i o n desc r ibed at the end of the last sect ion is e l i m i n a t e d . A pos t - c i r cu i t t y p e p o l i c y w h i c h o p t i m a l l y moves the states f r o m sna to sn w h e n used i n the c a p a c i t y of a p r e - c i r cu i t t y p e p o l i c y , w i l l also moves the states s0 to sna o p t i m a l l y . T h e effect of a p re -c i r cu i t p o l i c y is en t i r e ly cap tu r ed i n sna. W e have no need to con tend w i t h mu l t i - s t ep t r a n s i t i o n p r o b a b i l i t i e s . So w i t h i n the p e r i o d f r o m t — 0 to t = n Q , the m e a n i n g of a n o p t i m a l p o l i c y is e x a c t l y the same as an o p t i m a l p o l i c y f r o m t = 0 to t = n or f r o m t = na to t — n. Therefore , the m i n i m i z e d state-to-state Q* funct ions i n the prev ious l i n k u p c y c l e c a n be used d i r e c t l y as the p re -c i r cu i t Q* i n the present cyc l e . E q . (5.51) is s i m i l a r to E q . (3.37). T h e y are de r ived a long different l ines of reasoning , bu t the resul ts are equivalent versions of each other . T h e jl+ p o l i c y i n E q . (3.37) i n a d e t e r m i n i s t i c s y s t e m w i l l t ake the sys t em, i n na steps, to the s* a t ha t w o u l d be chosen b y the outer m i n i m i z a t i o n i n E q . (5.51). Corollary 5.1 Algorithm 1 and 2, and the Doubling Algorithm will produce correct val-ues of Q*(s0,sn,u0,n), assuming that in the Value Identification Phase all possible state transitions have been observed. For Algorithm 1, it is also assumed that it is not in its loose form, meaning that the observations in Value Identification Phase are complete and its collection schedule in Step 1 of Table Jh3 is exhaustive. Proof. Eq. (5.51) implies that an algorithm which builds up Q* of any horizon in strict accordance with the conditions of Theorem 5.1.1 will produce the correct values. This Chapter 5. Evaluation of Proposed Algorithms 115 clearly implies that Algorithm 2 and 3 are correct. The situation with Algorithm 1 is only slightly more complicated. Its correctness hinges on the asynchronous linkups to have covered all possible intermediate states, sna. If the probability of linkup at all $na 6 S is non-zero, Q's would converge to the corresponding Q* 's as time approaches infinity. The search for Q's in Step 1 of Table 4-3 is assumed to be exhaustive, meaning that each of these probabilities is 1. The "replace" operation in Step 2 of the Build-Up Phase of Algorithm 1 (Table 4-3) ensures that incorrect values of Q* will be erased and not affect the final Q* as time approaches infinity. If the linkups are deterministically scheduled to completely cover the intermediate states, then the situation is even simpler. a 5.1.2 Numerical demonstration of correctness E x t e n s i v e s i m u l a t i o n s have been done w i t h the S Q L a l g o r i t h m a n d a l l 3 p roposed algo-r i t h m s . T h e resul ts demons t ra t e the correctness of the de r iva t i on for the D e t e r m i n i s t i c L i n k u p R e l a t i o n . T h e s imu la t i ons are done on r a n d o m l y genera ted d iscre te sys tems w i t h m a n y different n u m b e r of states a n d different n u m b e r of ac t ions per state. T h e systems have t r a n s i t i o n costs be tween 0.0 and 1.0. A n e x a m p l e of the r a n d o m l y generated systems is shown i n T a b l e 5.14. F o r ease of p resen ta t ion , th i s s y s t e m has o n l y 4 states and 4 ac t ions per state. T h e number s i n the lower r igh t sec t ion of the tab le p rov ide two pieces of i n f o r m a t i o n : 1) the presence of a n u m b e r at a n in t e r sec t ion ind ica tes the nex t s tate s h o w n o n the le f tmos t c o l u m n is reached w i t h the g iven ac t i on , and 2) the value ind ica tes the cost of the t r a n s i t i o n . T h e p o s i t i o n a n d the va lue are r a n d o m l y generated. T h e other sys tems tes ted are not s i m i l a r to th is s y s t e m except h a v i n g the bas ic parameters of the n u m b e r of states, the n u m b e r of ac t ions per state, a n d the range of the costs. T h e other pa ramete r s of the con t ro l Chapter 5. Evaluation of Proposed Algorithms 116 ut = 0 ut = 1 u t = 2 u t = 3 St = 0 , 5 t + i = 0 0.697221 st = 0,st+i = 1 0.175798 st = 0 ,s t + i = 2 0.296271 s< - 0, s t + i = 3 0.592444 s t = = 0 s t = l , s i + 1 = 1 0.983885 st = l , s ( + 1 = 2 s t = l , s t + i = 3 0.536482 0.701650 0.283753 s< - 2, s-t+i - 0 0.699833 0.028425 0.640908 st - 2, s < + 1 = 1 s t = 2, s ( + i = 2 0.694930 s< = 2, s ( + i = 3 s t = 3 , s t + i = 0 0.767806 st = 3 , s t + 1 =• 1 S t = 3 , s i + i = 2 0.996451 0.817222 0.369249 s t = 3, s t + i = 3 T a b l e 5.14: T r a n s i t i o n def ini t ions a n d costs of the test sy s t em. B l a n k entr ies i n d i c a t e t r a n s i t i o n not poss ib le i n one t i m e step. p r o b l e m are A = 0.99 a n d n = 1024. T a b l e 5.15 shows the s i m u l a t i o n results for the sy s t em defined above. T h e Q-values genera ted by Synchronous Q - L e a r n i n g are i den t i c a l to the Q-values genera ted by the D o u b l i n g A l g o r i t h m , w h i c h is wha t the s i m u l a t i o n is i n t ended to demons t ra te . D o u b l e p r e c i s i o n var iab les are used t h roughou t t he s i m u l a t i o n . T h e purpose of T a b l e 5.15 is t o f ac i l i t a t e independen t ve r i f i ca t ion . T h i s s i m u l a t i o n resul t is representa t ive of the results o b t a i n e d over a w i d e range of r a n d o m l y genera ted sys tems of v a r y i n g sizes. A n in te res t ing po in t to note is tha t as n o n - s m o o t h as the s y s t e m is , the Q-values i n each h o r i z o n t a l b l o c k of number s differ o n l y b y a few percent w i t h respect t o different ac t ions . T h i s is cons i s ten t ly the case w i t h the o ther sys tems tes ted. A h o r i z o n of 1024 t i m e steps is effect ively in f in i t e for th i s s m a l l Chapter 5. Evaluation of Proposed Algorithms 117 u0 - 0 u0 — 1 u 0 = 2 UQ — 3 •so = 0, sn = 0 16.80323 16.83576 16.30154 16.50537 •so = 0 , s n = 1 17.31345 16.74727 16.91243 16.49454 s0 = 0 , s n = 2 16.70020 16.83837 16.16696 16.77944 s 0 = 0, sn = 3 16.90634 17.00172 16.82249 16.37452 s 0 = l , s „ = 0 16.80321 16.83574 16.30156 16.50539 s 0 = 1, s n = 1 17.31347 16.74725 16.91242 16.49452 s 0 = 1, sn = 2 16.70022 16.83835 16.16694 16.77943 s0 = 1, sn = 3 16.90633 17.00174 16.82251 16.37454 s 0 = 2 , s „ = 0 16.80322 16.83574 16.30156 16.50539 s 0 = 2 , s n = 1 17.31348 16.74725 16.91242 16.49452 s 0 = 2, s n = 2 16.70022 16.83836 16.16695 16.77943 s 0 = 2, s n = 3 16.90633 17.00174 16.82251 16.37454 s 0 = 3 , s n = 0 16.80323 16.83575 16.30154 16.50537 •so = 3, sn = 1 17.31346 16.74726 16.91243 16.49454 So = 3, s n = 2 16.70020 16.83837 16.16696 16.77944 So = 3, s n = 3 16.90634 17.00172 16.82250 16.37452 T a b l e 5.15: Q-values generated for the test s y s t e m . s y s t e m . T h i s i l lus t ra tes the h i g h leve l of a p p r o x i m a t i o n accu racy r equ i r ed i n D P and Q - L e a r n i n g . T h e first two a l g o r i t h m s are s i m i l a r i n na tu re to the D o u b l i n g A l g o r i t h m , except for some o p e r a t i o n a l differences. S i m u l a t i o n s demons t r a t ed tha t they also generate the same Q-values as the S Q L a l g o r i t h m and the D o u b l i n g A l g o r i t h m . 5.1.3 Complexity results T h e c o m p l e x i t i e s of the Synchronous Q - L e a r n i n g A l g o r i t h m ( S Q L ) , A l g o r i t h m 2, a n d the D o u b l i n g A l g o r i t h m ( D A ) are c o m p a r e d i n T a b l e 5.16. W e choose not to consider the c o m p l e x i t y of A l g o r i t h m 1 i n i ts in t ended loose f o r m because i ts c o m p l e x i t y is de-pendent o n two a d d i t i o n a l factors: the e x p l o r a t i o n schedule or p robab i l i t i e s , a n d the c o l l e c t i o n schedule i n Step 1 of T a b l e 4.3. T h e effect of these factors m a y differ f r o m Chapter 5. Evaluation of Proposed Algorithms 118 s y s t e m to s y s t e m a n d f r o m r u n to r u n of the a l g o r i t h m , a n d can s ign i f i can t ly change the c o m p l e x i t i e s . S Q L A l g . 2 D A m e m o r y 2S2A 3S2A + S2 + 2SA S2(2A + 1) e x p l o r a t i o n O(SAH) 0{SA) 0 ( S 7 l ) m i n i m i z a t i o n s S2A(H - 1) S2A{M-l) -r(S2 + S2A)(H/M - 1 ) S 2 ( l + A ) l o g 2 # reads S2A2{H - 1) (2S'2A + 2SA)(M - 1) +(S2A + 2S3A)(H/M - 1) ( S 2 A + 2S ' 3 A)log 2 # wri tes S2A(H - 1) (S2 + S2A)(M + H/M - 2 ) S 2 ( l + A ) l o g 2 # b a c k u p / l i n k u p cycles H - 1 M + H/M - 2 l o g 2 # T a b l e 5.16: C o m p a r i s o n of complex i t i e s of the Synchronous Q - L e a r n i n g A l g o r i t h m , A l -g o r i t h m 2, a n d the D o u b l i n g A l g o r i t h m . It is assumed tha t b o t h A l g . 2 a n d D A use an a p p r o x i m a t o r for the / * func t ion . F o r A l g . 2, i t is a ssumed tha t H/M is an integer . Severa l c o m m e n t s about the c o m p a r i s o n shou ld be made . A l l 3 a l g o r i t h m s operate on a d e t e r m i n i s t i c s y s t e m a n d the Q ' s i n b o t h a lgo r i t hms have sn i n the i n p u t specifiers. A s i m p l e t ab le l o o k u p i m p l e m e n t a t i o n of the Q funct ions is used . T h e s t a r tup detai ls are ignored . S denotes the n u m b e r of states i n the sys t em. A denotes the n u m b e r of ac t ions per state. It is a ssumed that each state has the same n u m b e r of poss ib le ac t ions . H denotes the h o r i z o n . T h e s ta te-ac t ion space is u n i f o r m l y " exp lo red" by a l l a lgo r i t hms . T h e "reads" a n d "wr i tes" are to a func t ion a p p r o x i m a t o r , i n th is case s i m p l y a l o o k u p tab le , ra ther t h a n to o r d i n a r y c o m p u t e r m e m o r y . T h e s ignif icance of th is c o m p l e x i t y c o m p a r i s o n is tha t , for d e t e r m i n i s t i c sys tems, the c o m p l e x i t i e s of Synchronous Q - L e a r n i n g , A l g o r i t h m 2, and the D o u b l i n g A l g o r i t h m , i n t e rms of m e m o r y , e x p l o r a t i o n , n u m b e r of m i n i m i z a t i o n s , n u m b e r of reads, a n d n u m b e r of wr i t e s , are c o m p a r a b l e , bu t A l g o r i t h m 2 and espec ia l ly the D o u b l i n g A l g o r i t h m requi re m u c h fewer cycles of b a c k u p s / l i n k u p s t h a n S Q L . Chapter 5. Evaluation of Proposed Algorithms 119 5.1.4 Accumulated approximation error upper bounds W e want to c o m p u t e some worst case upper bounds of the absolute c u m u l a t i v e f r ac t iona l a p p r o x i m a t i o n errors tha t w o u l d resul t f r o m the use of the var ious a l g o r i t h m s presented . T o c o m p u t e these worst case uppe r bounds , i t is assumed tha t the m a x i m u m m a g n i t u d e of the nega t ive a p p r o x i m a t i o n errors is at least as large as the m a x i m u m m a g n i t u d e of the pos i t i ve a p p r o x i m a t i o n errors. F i n a l l y , i t is also assumed tha t the m a x i m u m negat ive a p p r o x i m a t i o n error is a lways i n c u r r e d by a l l wr i tes to the a p p r o x i m a t o r . T hese me thods are ca l l ed " M e t h o d 1." W e define th is no t a t i on : £ = m a x |negat ive f rac t iona l approx . error for M e t h o d 1| . W e wan t to look at the m a x i m u m negat ive a p p r o x i m a t i o n error because, i f a l l else are equa l , the m i n i m i z a t i o n operators i n the var ious a lgo r i t hms w i l l choose the values w h i c h have the largest negat ive error . F o r A l g o r i t h m 2, i t is assumed tha t the m-s tep Q f unc t ion i n S tep 1 of the B u i l d -u p P h a s e uses c o m p u t a t i o n a l m e t h o d s s i m i l a r to M e t h o d 1 b u t w h i c h resul ts i n lower m a x i m u m a p p r o x i m a t i o n errors t h a n those f rom M e t h o d 1. T hese m e t h o d s are ca l l ed " M e t h o d 2 ." W e define th is no t a t i on : 8 = m a x |negat ive f rac t iona l approx . error for M e t h o d 21 . A s each t y p e of a l g o r i t h m progresses, e and 8 w i l l a c c u m u l a t e i n the b a c k u p or l i n k u p processes, wha teve r the case m a y be. T h e fo l lowing n o t a t i o n w i l l be used to denote the wors t case uppe r b o u n d for a c c u m u l a t e d error: £c(h) = m a x | c u m u l a t i v e nega t ive f r ac t iona l approx . er ror for h o r i z o n h\ . A n d the fo l l owing n o t a t i o n w i l l be used to denote a loose, m e a n i n g pos s ib ly worse t h a n wors t case, u p p e r b o u n d : e'JJi) < m a x | c u m u l a t i v e negat ive f rac t iona l approx . error for h o r i z o n h\ . Chapter 5. Evaluation of Proposed Algorithms 120 N o a t t e m p t has been m a d e to find the least upper bounds because tha t w o u l d requi re the deta i l s of s y s t e m def in i t ion to be in tegra l par ts of the uppe r b o u n d expressions w h i c h w o u l d lessen the c l a r i t y of these resul ts . SQL Algorithm Proposition 5.1 Using the SQL algorithm, the worst case upper bound for the absolute cumulative fractional approximation error for a n horizon problem is e c ( n ) = ( l + e ) n - l . Proof. The worst case upper bound for absolute cumulative fractional approximation error for a horizon of 1 is just the maximum magnitude of the negative fractional approximation error in storing the cost of a transition, e, which is e c ( l ) . Now assume the proposition for a horizon of n, i.e., ec{n) = (1 + e)n - 1 . Recall that the update equation of the SQL algorithm is the following: Q*(s0,sn,uQ) := ct(sQ,uQ) + •jmmQ*(si,sn,ui) . (5.52) a v b Only the b term carries over ec(n) to Q*(so,sn,uo). Therefore, the fractional errors accumulate as follows: (l + e ) ( l + ^ ( ( l + £ ) n - l ) ) - l (5-53) where a is the magnitude of the a term in Eq. (5.52) and b is the magnitude of the b term in the same equation. The term then depends on the details of the system Chapter 5. Evaluation o f Proposed Algorithms 121 definition. For 0 < a < oo and 0 < 67 < oo, it is monotonically decreasing in a and takes the largest value at a — 0. We want to choose the worse case, which is when a = 0. This choice also has the effect of making the upper bound independent of system details, Expression (5.53) now takes the form ( l + e ) ( l + ( l + e ) " - l ) ) - l and then ( l + e ) n + 1 - l which is ec(n + 1). • Algorithm 2 Proposition 5.2 Using Algorithm 2, a loose upper bound for the absolute cumulative fractional approximation error for a mn' — n horizon problem is e'e(mri) = ( l + e ) r o + n ' - 1 -1 . where m is the horizon of circuit constructed at the completion of Step 1 of Algorithm 2 and n' is 1 plus the number of times that this m-step circuit is used to linkup to a horizon of mn' in Step 2 in Table 4-4- It IS assumed that Step 1 uses the SQL algorithm, all storage is done with Method 1, and all costs c(s,u) are greater than 0. Proof. The worst case upper bound for absolute cumulative fractional approximation error for a horizon of m is just that accumulated in m cycles of the SQL algorithm, i.e., £c(m) = {l+e)m - 1 , Chapter 5. Evaluation of Proposed Algorithms 122 from the error upper bound proof for the SQL algorithm presented earlier. Now assume the proposition for a horizon of mn, i.e., e'c(mnl) = ( l + e ) m + n ' - 1 -1 . Recall that the update equation of Algorithm 2 is the following: ( \ (5.54) Q*(i, /, u) := min SmES Q * ( i , 5 m , u ) + 7 m n min Q*(sm,l,um) \ a 1 / Both the a and b terms carry over maximum absolute cumulative fractional approximation error to Q*(SQ, sn, tto, m(n' + l)). The a term is fixed at m steps and its maximum absolute cumulative fractional approximation error is fixed at (1 + e)m — 1. The b term changes as Algorithm 2 progresses. The b term has a mn step horizon with a loose maximum absolute cumulative fractional approximation error assumed to be (1 + £ )"H-n ' - i _ ^ j^ne fractional errors accumulate as follows: ( ((1 + e)m - 1) a + ((I + e ) m W - 1 - l) &7 m ) \ f 1 + £ » ( ' + .V ——) -1 (5-5' where a is the magnitude of the a term in Eq. (5.54) an& b ? s the magnitude of the b term in the same equation. Since e > 0, a > 0 by assumption, and a + 67"1 > 0 by assumption, Expression (5.55) is less t h a n or e q u a l to (1 + e ) f l + ( 1 + £ ) " + " , " ' < ° + fr") _ 1] _ 1 \ a + b-fm ) which is equal to (1 + e)m+n' - 1 which is e'c(m(n' + 1)). • Chapter 5. Evaluation of Proposed Algorithms 123 It was m e n t i o n e d p rev ious ly tha t the m step Q*'s m a y be b u i l t w i t h M e t h o d 2 to decrease the i r a p p r o x i m a t i o n errors, i .e. , 8 < e. If this is the case, the wors t case uppe r b o u n d for A l g o r i t h m 2 w o u l d be lower t h a n (1 + £ ) m + n ' - 1 _ 1. Doubling Algori thm Proposition 5.3 Using the Doubling Algorithm, the worst case upper bound for the absolute cumulative fractional approximation error for a n horizon problem is e c ( n ) = ( l + £ ) l o ^ 2 " - l . Proof. The worst case upper bound for absolute cumulative fractional approximation error for a horizon of 1 is just the maximum magnitude of the fractional approximation error in storing the cost of a transition, e, which is £ c (l) . Now assume the proposition for a horizon ofn, i.e., ec{n) = (l+e)los>2n-l. Recall that the update equation of the Doubling Algorithm is the following: ( \ (5.56) Q*(i, I, u, 2n) := m i n snes Q*(i,sn,u,n) + jn m i n Q*(sn,l,un,n) Both the a and b terms carry over ec(n) to Q*(s0,sn,u0,2n). Since both terms have identical ec(n), the worse case fractional errors accumulate as follows: ( 1 + £ ) ( 1 + a + 67* ) " 1 ( 5 - 5 7 ) Chapter 5. Evaluation of Proposed Algorithms 124 where a is the magnitude of the a term and b is the magnitude of the b term in Eq. (5.56). Expression (5.57) now takes the form ( l + e ) ( l + e c ) - l and then ( 1+ e) ( l + (1+ e ) l o S 2 2 n - l ) - 1 and finally (1 +£)log22(2n) _ 1 which is £ c ( 2 n ) . • T h e wors t case uppe r b o u n d for the D o u b l i n g A l g o r i t h m can be w r i t t e n as e c ( n ) = ( 2 n ) l o g 2 ( 1 + £ ) - 1 . T h i s is an encourag ing resul t because £ c ( n ) is a sub- l inear f u n c t i o n of n i f £ < 1. It is c lear tha t the wors t case upper bounds for A l g o r i t h m 2 a n d the D o u b l i n g A l -g o r i t h m are b o t h m u c h lower t h a n tha t for the S Q L a l g o r i t h m . T h e nex t sec t ion w i l l i l l u s t r a t e n u m e r i c a l l y the difference i n the i r worst case uppe r bounds . 5.1.5 Numerical demonstration of error upper bounds T a b l e 5.17 shows the n u m e r i c a l values of the upper bounds p roven above. It is obvious tha t the wors t case uppe r b o u n d for the S Q L deteriorates q u i c k l y w i t h the h o r i z o n . T h e S Q L a l g o r i t h m is m e r e l y a r i g i d l y scheduled vers ion of One -S tep Q - L e a r n i n g . T h i s i m p l i e s tha t th is uppe r b o u n d also appl ies to O n e Step Q - L e a r n i n g . I n s imu la t i ons , the a c t u a l errors a c c u m u l a t e d by One-S tep Q - L e a r n i n g w i l l mos t l i k e l y not be as severe Chapter 5. Evaluation of Proposed Algorithms 125 £ S Q L A l g . 2 D A 0.000 0 .00000E+00 0 .00000E+00 0 .00000E+00 0.005 1.64214E+02 4.82924E-01 5.63958E-02 0.010 2 .66116E+04 1.19477E+00 1.15668E-01 0.015 4 .18049E+06 2 .24203E+00 1.77949E-01 0.020 6 .40584E+08 3 .77984E+00 2.43374E-01 0.025 9 .57719E+10 6 .03372E+00 3.12087E-01 0.030 1.39739E+13 9 .33096E+00 3.84234E-01 0.035 1.99030E+15 1.41456E+01 4.59970E-01 0.040 2 .76783E+17 2 .11633E+01 5.39454E-01 0.045 3 .75908E+19 3 .13733E+01 6.22853E-01 0.050 4 .98703E+21 4 .62014E+01 7.10339E-01 T a b l e 5.17: W o r s t case uppe r bounds for absolute c u m u l a t i v e f r ac t iona l a p p r o x i m a t i o n errors for t he S Q L a l g o r i t h m , A l g o r i t h m 2, and the D o u b l i n g A l g o r i t h m . T h e h o r i z o n is 1024. F o r A l g o r i t h m 2, m = 16 and n = 6 4 , as the wors t case u p p e r b o u n d , b y m a n y orders of m a g n i t u d e . T h i s is because the a p p r o x i m a t i o n errors w i l l r a re ly be at the i r worst and there are cance l l a t ions a m o n g p o s i t i v e a n d nega t ive errors . Never the less , the n u m e r i c a l c o m p a r i s o n s h o w n o n T a b l e 5.17 c lea r ly favors A l g o r i t h m 2 and the D o u b l i n g A l g o r i t h m . 5.1.6 Numerical results for correlated errors Different sys tems are s i m u l a t e d i n this sect ion to e x a m i n e h o w the a c c u m u l a t i o n of ap-p r o x i m a t i o n errors behave w h e n they are spa t i a l l y and t e m p o r a l l y cor re la ted . T h i s t y p e of co r r e l a t i on is encounte red i f an a p p r o x i m a t o r is used to store the var ious m a p p i n g s i n these a l g o r i t h m s . W h i l e the sy s t em i n Sec t ion 5.1.2 is chosen for the sake of ease of p resen ta t ion of n u m e r i c a l resul ts , sys tems w i t h 2 d i m e n s i o n a l n e i g h b o u r i n g proper t ies are chosen i n th is sec t ion . B y chang ing the purpose and m e t h o d of p resen ta t ion , a larger s y s t e m c a n be shown i n th is sec t ion . L o o k u p tables are used to i m p l e m e n t the var ious m a p p i n g s r equ i r ed by the S Q L Chapter 5. Evaluation of Proposed Algorithms 126 a l g o r i t h m , A l g o r i t h m 2, a n d the D o u b l i n g A l g o r i t h m . G i v e n the m e m o r y cons t ra in ts , the sys tems have a rec tangu la r g r i d of 7 x 7 states, a n d 2 ac t ions per state. A balance is s t ruck be tween the l eng th of the h o r i z o n and the n u m b e r of states. A f ina l h o r i z o n of 128 seems reasonable for a sy s t em w i t h 49 states. A Prototype System A c t i o n = 0 A c t i o n = 1 F i g u r e 5.15: S ta te t rans i t ions i n the 7 x 7 d e t e r m i n i s t i c s i m u l a t i o n . F o r the sake of c l a r i ty , one poss ible sys t em is shown o n F i g u r e 5.15. F i g u r e 5.15 shows the state t r a n s i t i o n at each state for b o t h possible ac t ions . N o t i c e tha t the t rans i t ions are a l l to n e i g h b o u r i n g states. T h e l o c a l na ture of the state t r ans i t ions is in tended to a l l ow m e a n i n g f u l tests of the a c c u m u l a t i o n of spa t i a l l y and t e m p o r a l l y corre la ted a p p r o x i m a t i o n errors . The actual error accumulation simulations In the s i m u l a t i o n results to be presented i n this sec t ion , each i n d i v i d u a l r u n uses a different, r a n d o m l y generated sys t em. Fo r each c o m b i n a t i o n of a l g o r i t h m and value of e, Chapter 5. Evaluation of Proposed Algorithms 127 Step 1. Temporal correlation: F o r each e' i n the super g r i d , c o m p u t e i ts new va lue as fol lows: e'(t + l)= 0.8e'{t) + 0.2e'Me where e'M is the m a x i m u m f rac t iona l error a n d e is a u n i f o r m l y d i s t r i b u t e d r a n d o m va r i ab le i n [—1,1]. S tep 2. Spatial correlation via relaxation: R e p e a t 5 t imes B e g i n F o r each s i n the 7 x 7 core g r i d , e' = m e a n of the 3 x 3 b lock of e' centered o n s. E n d ; S tep 3. Extraction and normalization: T h e core 7 x 7 e"s f r o m the 9 x 9 g r i d are e x t r a c t e d a n d n o r m a l i z e d such tha t m a x | ex t rac ted e'\ = e'M. T a b l e 5.18: M e t h o d for genera t ing t e m p o r a l l y and spa t i a l l y co r re la t ed a p p r o x i m a t i o n er ror . 100 sys tems are s i m u l a t e d . F o r i l l u s t r a t i v e purposes , a s y s t e m tha t we m i g h t cons t ruc t m a n u a l l y is shown i n F i g u r e 5.15. F i g u r e 5.15 shows the state t r a n s i t i o n at each state for b o t h poss ib le ac t ions . N o t i c e tha t the t rans i t ions are a l l t o n e i g h b o u r i n g states. T h i s l o c a l na tu re of the state t rans i t ions is i n t ended to a l low m e a n i n g f u l test of the a c c u m u l a t i o n of s p a t i a l l y and t e m p o r a l l y cor re la ted a p p r o x i m a t i o n errors . T h e t r a n s i t i o n costs are as fo l lows: G o i n g i n the r igh t d i r e c t i o n , w i t h a c t i o n 0, a l o n g a sequence of 5 cont iguous t r ans i t ions w o u l d p roduce a h igh payoff or equ iva len t ly , a l ow cost. T h i s sequence is i n d i c a t e d by heavy arrows i n the left par t of the figure a n d each t rans i t ions i n the sequence carr ies a cost of 0. G o i n g backwards on th is sequence, w i t h a c t i o n 1, carr ies a r e l a t i v e l y heavy cost of 2. A l l o ther o r d i n a r y t r ans i t ions have a cost of 1. T h e r a n d o m l y genera ted sys tems are not cons t ruc ted purposefu l ly l i ke th is sy s t em. W h e n the var ious m a p p i n g s i n the a lgo r i t hms to be c o m p a r e d are s tored us ing ap-p r o x i m a t o r s , the a p p r o x i m a t i o n errors t y p i c a l l y are t e m p o r a l l y and s p a t i a l l y cor re la ted . Chapter 5. Evaluation of Proposed Algorithms 128 Maximum Fractional Error per Backup/Linkup F i g u r e 5.16: C o m p a r i s o n of the m e a n a c c u m u l a t e d f rac t ion errors for the S Q L a l g o r i t h m , A l g o r i t h m 2, a n d the D o u b l i n g A l g o r i t h m . T h e t e m p o r a l co r r e l a t i on comes f r o m the fact tha t the a p p r o x i m a t o r s ' pa rameters are usua l ly adjus ted s l owly i n t i m e , so that the errors i n the near fu ture at any po in t i n i ts i n p u t space w i l l be inf luenced by the errors at the present. T h e spa t i a l co r r e l a t i on comes f r o m the fact tha t the a p p r o x i m a t i o n is cont inuous . In these s i m u l a t i o n s , a p p r o x i m a t i o n errors are s i m u l a t e d in s t ead of genera ted by a c t u a l f u n c t i o n a p p r o x i m a t o r s . T h i s gives us greater con t ro l over the error m a g n i t u d e . A 9 x 9 super g r i d of error var iables e' is used to t rack the error i n t i m e a n d i n space. E a c h state uses the cor respond ing f rac t iona l error e(s,t) i n the 7 x 7 core of the super g r i d . These e(s ,^) 's are generated by the m e t h o d descr ibed i n T a b l e 5.18. T h i s m e t h o d is pe r fo rmed once before the storage of the results of each l i n k u p / b a c k u p cyc le . T h e values at the po in t s i n the ex t r ac t ed 7 x 7 core g r i d are the f r ac t iona l errors used d u r i n g storage. F i g u r e 5.16 shows the m e a n a c c u m u l a t e d f rac t ion error, E[ec(l28)}, as a func t i on of the m a x i m u m f rac t iona l error , e. T h e D o u b l i n g A l g o r i t h m and A l g o r i t h m 2 are c lear ly Chapter 5. Evaluation of Proposed Algorithms 129 Maximum Fractional Error per Backup/Linkup F i g u r e 5.17: C o m p a r i s o n of the s t anda rd d e v i a t i o n of a c c u m u l a t e d f rac t ion errors for the S Q L A l g o r i t h m , A l g o r i t h m 2, and the D o u b l i n g A l g o r i t h m . super io r to the S Q L a l g o r i t h m i n th is c o m p a r i s o n . F i g u r e 5.17 shows tha t the D o u -b l i n g A l g o r i t h m a n d A l g o r i t h m 2 are m u c h bet ter behaved i n t e rms of the v a r i a b i l i t y of -E[e c (128)]. T h e m e a n and s t anda rd d e v i a t i o n are t aken over a l l Q ' s a n d a l l s i m u l a t i o n runs . 5.1.7 Simulations of control policy degradation and performance In Q - L e a r n i n g l i ke a l g o r i t h m s , i t is i m p o r t a n t to d i s t i ngu i sh be tween absolu te app rox i -m a t i o n errors a n d re la t ive a p p r o x i m a t i o n errors. A b s o l u t e errors is the t y p e descr ibed i n th i s thesis thus far, w h i c h are the differences be tween the l ea rned Q-func t ions and the er ror free Q- func t ions . F o r a set of Q or (^-functions of the same s tar t , end states and h o r i z o n bu t different ac t ions , the re la t ive errors are the differences i n absolu te errors be tween the e lements of the set. In Q - L e a r n i n g l ike a l g o r i t h m s , i t is the r e l a t ive errors Chapter 5. Evaluation of Proposed Algorithms 130 w h i c h u l t i m a t e l y de t e rmine the o p t i m a l i t y of the con t ro l laws. B e t t e r absolu te approx-i m a t i o n errors do not i m p l y be t te r con t ro l po l ic ies . If the r e l a t ive errors are zero , t h e n absolute errors are not relevant i n d e t e r m i n i n g the o p t i m a l i t y of the c o n t r o l laws. O n the o ther h a n d , i f the absolu te errors are zero, then re la t ive errors m u s t also be zero. T h i s is a s t rong reason to pursue the r e d u c t i o n of absolute errors as a means to reduce the r e l a t ive errors . In the p rev ious sect ions, i t is shown that 2 p roposed a l g o r i t h m s , the D o u b l i n g A l -g o r i t h m a n d A l g o r i t h m 2, have an advantage i n absolute errors over the Synchronous Q - L e a r n i n g a l g o r i t h m . I n t h i s sec t ion , we are e x a m i n i n g the effect of t he r e d u c t i o n of absolu te errors o n the con t ro l l aw by the proposed a lgo r i t hms . In the set of s i m u l a t i o n s for th is sec t ion , a one- l ink m a n i p u l a t o r is used i n c o m p a r i n g the c o n t r o l l aws a n d per formances a m o n g the D o u b l i n g A l g o r i t h m , A l g o r i t h m 2, a n d the Synch ronous Q - L e a r n i n g a l g o r i t h m . T h e purpose of these s i m u l a t i o n s is not to show tha t these a l g o r i t h m s can con t ro l th i s sys t em. T h e sy s t em is i n t e n d e d to be a p l a t f o r m for c o m p a r i n g the r e l a t ive pe r fo rmance of these a l g o r i t h m s a n d d e m o n s t r a t i n g h o w one uses the a l g o r i t h m s i n a non-concep tua l sys t em. T h e d y n a m i c s of the m a n i p u l a t o r is as fol lows: L where G = 9.81 p- is the g r a v i t a t i o n a l constant , 9 i n rad ians is the angle as shown i n F i g u r e 5.18, T is the to rque , M = 1.0 k g is the mass of the end mass , a n d L = 1.0 m is the l eng th of the massless a r m . T h e range of the angle, 9, is equa l ly quan t i zed to 32 d iv i s ions , w i t h 0 r a d i a n at the center of one d i v i s i o n . T h e range of the angular ve loc i ty , 9, is e q u a l l y q u a n t i z e d in to 33 d iv i s i ons , w i t h 0 r a d i a n / s e c o n d at the center of one d i v i s i o n . T h e c o n t r o l c a n be 1 of 2 poss ible values, 7.5 N - m and —7.5 N - m . T h e f requency of Chapter 5. Evaluation of Proposed Algorithms 131 3K . 2 1 \ 2 V e 0 F i g u r e 5.18: O n e - l i n k m a n i p u l a t o r . con t ro l changes is 12.5 H z . B e t w e e n changes, the con t ro l va lue is cons tan t . These values are chosen so tha t there is enough change i n the states i n one t i m e p e r i o d to cause the co r re spond ing d i sc re t i zed states to have v a r i a t i o n f r o m one step to the nex t t i m e step. T h i s d y n a m i c sy s t em is chosen over m o r e c o m p l e x sys tems because l o o k u p tables are used to store the Q, Q values, and other mapp ings . It w o u l d not have been possible to o b t a i n the chosen l eve l of fineness of q u a n t i z a t i o n w i t h more state var iab les . T h e fineness of q u a n t i z a t i o n has p r i o r i t y over the n u m b e r of s tate var iab les because the q u a n t i z a t i o n errors s h o u l d not be present i n a l ea rn ing sys tem us ing a p p r o x i m a t o r s . T h e o p t i m a l i t y c r i t e r i on is defined by Eq. (4.39) and by the f o r m of the i m m e d i a t e costs. T h e i m m e d i a t e costs for a quan t i zed state are ca l cu la t ed f r o m the values of the state var iables : c ( s 0 , u 0 ) = O.8|0 - TT| + O.2|0| where so is the q u a n t i z e d state f rom 9 and 6. T h e i m m e d i a t e cost specifies {9 = TT, 9 = 0} as the goal state. T h e a p p r o x i m a t i o n errors are s i m u l a t e d us ing the m e t h o d desc r ibed i n T a b l e 5.18 Chapter 5. Evaluation of Proposed Algorithms 132 £ D A A l g . 2 S Q L 0.05 0.0019 0.0028 0.0038 0.10 0.0114 0.0028 0.0028 0.15 0.0265 0.0095 0.0047 0.20 0.0398 0.0294 0.0057 0.25 0.0417 0.0294 0.0028 0.30 0.0398 0.0369 0.0076 0.35 0.0938 0.0492 0.0085 0.40 0.0852 0.0369 0.0142 0.45 0.0739 0.0616 0.0208 0.50 0.1203 0.0568 0.0303 T a b l e 5.19: F r a c t i o n a l degrada t ion of the con t ro l p o l i c y w i t h ze ro -mean e. w i t h the difference the core g r i d here has the d i m e n s i o n 32 x 33 x 32 x 33 a n d the super g r i d here has the d i m e n s i o n 32 x 35 x 32 x 35. T h e two 6 var iables do not requi re 2 m o r e d iv i s ions i n the super g r i d because angles wrap a round the un i t c i r c l e . A t var ious levels of error m a g n i t u d e , the con t ro l laws of each of the 3 a lgo r i t hms tes ted degrade. O n e measure of the degrada t ion is the f rac t ion of ^ ( s ) ' s out of a l l / 4 5 ) ' s m a c o n t r o l l aw tha t are different f r o m the error free con t ro l l aw. E a c h f r ac t iona l degrada t ion va lue is the resul t of 1 s i m u l a t i o n . T a b l e 5.19 shows th is f r ac t iona l degrada t ion for var ious values of the m a x i m u m f rac t iona l error m a g n i t u d e , £, u s ing ze ro -mean e u n i f o r m l y d i s t r i b u t e d be tween —1 a n d 1. T a b l e 5.20 shows this f r ac t iona l deg rada t ion for var ious values of the m a x i m u m f rac t iona l error m a g n i t u d e , £, u s ing p o s i t i v e l y skewed e u n i f o r m l y d i s t r i b u t e d be tween —0.5 and 1. T a b l e 5.21 shows this f r ac t iona l deg rada t ion for var ious values of the m a x i m u m f rac t iona l error m a g n i t u d e , £, u s ing nega t ive ly skewed e u n i f o r m l y d i s t r i b u t e d be tween —1 a n d 0.5. Tab les 5.19, 5.20, a n d 5.21 show tha t the S Q L a l g o r i t h m is m o r e va r i ab l e w i t h respect to the charac te r i s t ics of the a p p r o x i m a t i o n errors t h a n A l g o r i t h m 2 a n d the D o u b l i n g Chapter 5. Evaluation of Proposed Algorithms e D A A l g . 2 S Q L 0.05 0.0028 0.0057 0.0530 0.10 0.0104 0.0085 0.1042 0.15 0.0170 0.0114 0.1345 0.20 0.0237 0.0133 0.1572 0.25 0.0417 0.0360- 0.1837 0.30 0.0682 0.0303 . 0.2074 0.35 0.0777 0.0407 0.2282 0.40 0.0625 0.0464 0.2491 0.45 0.0919 0.0871 0.2623 0.50 0.1004 0.0625 0.5104 T a b l e 5.20: F r a c t i o n a l deg rada t ion of the con t ro l p o l i c y w i t h p o s i t i v e l y skewed e D A A l g . 2 S Q L 0.05 0.0019 0.0019 0.0057 0.10 0.0076 0.0057 0.0104 0.15 0.0237 0.0123 0.0275 0.20 0.0199 0.0199 0.0616 0.25 0.0540 0.0275 0.5104 0.30 0.0275 0.0331 0.0919 0.35 0.0568 0.0341 0.1193 0.40 0.0407 0.0511 0.1780 0.45 0.0606 0.0634 0.2367 0.50 0.0947 0.0748 0.3580 T a b l e 5.21: F r a c t i o n a l degrada t ion of the con t ro l p o l i c y w i t h nega t ive ly skewed Chapter 5. Evaluation of Proposed Algorithms 134 A l g o r i t h m . T h e y also show tha t , on average, larger f rac t ion of the c o n t r o l l aw gener-a ted by the S Q L a l g o r i t h m are affected by the a p p r o x i m a t i o n error t h a n tha t by e i ther A l g o r i t h m 2 or the D o u b l i n g A l g o r i t h m . T h e m e a n i n g of the f r ac t iona l degrada t ion figures cannot be assessed s i m p l y . F o r e x a m p l e , the c o n t r o l task s i m u l a t e d for th is sec t ion is s y m m e t r i c a l i n the p o s i t i v e and nega t ive d i rec t ions of 0 a n d 6. C o n t r o l laws i n b o t h d i rec t ions can satisfy the o p t i m a l i t y c r i t e r i o n t o the same degree. I n t h i s sy s t em a n d even m o r e so i n a m o r e c o m p l e x s y s t e m , there m a y be m o r e t h a n one o p t i m a l con t ro l l aw . In a s i m p l e f r ac t iona l degrada t ion measure , i t is poss ib le tha t a con t ro l l aw w i t h a large degrada t ion va lue is a c t u a l l y m u c h m o r e o p t i m a l t h a n i t appears . F o r th is a n d o ther reasons, we need to test the con t ro l laws themselves o n the i r c o n t r o l pe r fo rmance . In t e s t ing the con t ro l laws, we are l o o k i n g for the difference i n the state t ra jec tor ies be tween the error-free s imu la t i ons and the error-present s i m u l a t i o n s at var ious levels of error magn i tudes . F igu res 5.19, 5.20, and 5.21 show these state t r a jec to ry differences for the D o u b l i n g A l g o r i t h m , A l g o r i t h m 2, a n d the Synch ronous Q - L e a r n i n g a l g o r i t h m , for the case of ze ro-mean e. F igu res 5.22, 5.23, and 5.24 are for the case of p o s i t i v e l y skewed e. F igu re s 5.25, 5.26, and 5.27 are for the case of nega t ive ly skewed e. E a c h g r a p h i n each figure shows 4 t ra jector ies , one for a different i n i t i a l state. T h e top left g r a p h i n each figure shows the error-free state t ra jec tor ies . T h e 4 s tar t states tes ted are {0 r a d , 0 r a d / s } , { | r a d , 0 r a d / s } , {7r r a d , 0 r a d / s } , a n d r a d , 0 r a d / s } . If the c o n t r o l l aw is a near m i r r o r image i n te rms of s ign of the er ror free con t ro l l a w , the signs of the c o n t r o l ac t ions are i nve r t ed so the results w o u l d not m i s i n t e r p r e t e d . T h e f r ac t iona l degrada t ion figures and the con t ro l pe r fo rmance tests show tha t the p roposed a l g o r i t h m s , the D o u b l i n g A l g o r i t h m and A l g o r i t h m 2, are o n average more error to le ran t a n d less error sensi t ive t h a n a parent of One-S tep Q - L e a r n i n g a l g o r i t h m , the S y n c h r o n o u s Q - L e a r n i n g a l g o r i t h m . Chapter 5. Evaluation of Proposed Algorithms 135 F i g u r e 5.19: S ta te t ra jector ies for the D o u b l i n g A l g o r i t h m for ze ro -mean e. T h e large number s be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l error . Chapter 5. Evaluation of Proposed Algorithms 136 Figure 5.20: State trajectory for Algorithm 2 for zero-mean e. The large numbers below the graphs indicate e'M, the maximum fractional error. Chapter 5. Evaluation of Proposed Algorithms 137 F i g u r e 5.21: S ta te t r a jec to ry for the Synchronous Q - L e a r n i n g a l g o r i t h m for zero-mean e. T h e large number s be low the graphs ind ica t e e'M, the m a x i m u m f rac t iona l error. Chapter 5. Evaluation of Proposed Algorithms 138 F i g u r e 5.22: Sta te t ra jector ies for the D o u b l i n g A l g o r i t h m for p o s i t i v e l y skewed e. T h e large n u m b e r s be low the graphs ind ica t e e'M, the m a x i m u m f r ac t i ona l error . Chapter 5. Evaluation of Proposed Algorithms 139 F i g u r e 5.23: S ta te t r a jec to ry for A l g o r i t h m 2 for p o s i t i v e l y skewed e. T h e large numbers be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l error . Chapter 5. Evaluation of Proposed Algorithms 140 0.15 S , e p s 0.35 F i g u r e 5.24: S ta te t r a jec to ry for the Synchronous Q - L e a r n i n g a l g o r i t h m for pos i t i ve ly skewed e. T h e large number s be low the graphs i n d i c a t e s'M, the m a x i m u m f rac t iona l error . Chapter 5. Evaluation of Proposed Algorithms 141 F i g u r e 5.25: Sta te t ra jector ies for the D o u b l i n g A l g o r i t h m for nega t ive ly skewed e. T h e large n u m b e r s be low the graphs ind ica t e e'M, the m a x i m u m f rac t iona l error . Chapter 5. Evaluation of Proposed Algorithms 142 F i g u r e 5.26: Sta te t r a jec to ry for A l g o r i t h m 2 for nega t ive ly skewed e. T h e large numbers be low the graphs i n d i c a t e e'M, the m a x i m u m f rac t iona l error. Chapter 5. Evaluation of Proposed Algorithms 143 F i g u r e 5.27: S ta te t r a jec to ry for the Synchronous Q - L e a r n i n g a l g o r i t h m for nega t ive ly skewed e. T h e large number s be low the graphs i nd i ca t e e'M, t he m a x i m u m f rac t iona l error . Chapter 5. Evaluation of Proposed Algorithms 144 It is also shown tha t the re la t ive pe r fo rmance of these a lgo r i t hms depends o n the er ror charac te r i s t i cs of the a p p r o x i m a t o r . M y specu la t ion , appa ren t l y s u p p o r t e d b y the s i m u l a t i o n resul ts i n th is sec t ion , is tha t a p p r o x i m a t o r s w h i c h e x h i b i t r e l a t i ve ly ze ro-mean errors m i g h t favor the S Q L a l g o r i t h m because there is m o r e o p p o r t u n i t y i n th is a l g o r i t h m for error c a n c e l l a t i o n , even after skewing by the m i n i m i z a t i o n opera tor . B y r e q u i r i n g m u c h fewer l i n k u p cycles , the D o u b l i n g A l g o r i t h m and A l g o r i t h m 2 t e n d to preserve the skewedness of the errors b y re-us ing l i n k u p s repeatedly . M y specu l a t i on , appa ren t l y s u p p o r t e d by the s i m u l a t i o n results i n th is sec t ion , is tha t i f the er ror charac te r i s t i cs of the a p p r o x i m a t o r s are skewed, the proposed a lgo r i t hms w o u l d be favored re l a t ive to the S Q L a l g o r i t h m . 5.2 On the Algorithms for Stochastic Systems I n t h i s sec t ion , s i m u l a t i o n resul ts are presented for the A L D P a l g o r i t h m , c o m p a r i n g i t to the T D ( A ) a l g o r i t h m a n d the Q L a l g o r i t h m . T h e apparent advantage of the A L D P a l g o r i t h m is tha t i t is faster t h a n the Q L a l g o r i t h m a n d produces Q values w i t h greater order l iness t h a n the T D ( A ) a l g o r i t h m . T h e m e a n i n g of order l iness w i l l be c la r i f i ed la ter . G e n e r a l resul ts w i l l also be presented for the S tochas t i c D o u b l i n g A l g o r i t h m presented i n C h a p t e r 4. It appa ren t ly does not have an obvious advantage over a synchronous D P a l g o r i t h m . 5.2.1 Numerical demonstration of correctness T h e purpose of the s i m u l a t i o n i n th is sec t ion is to demons t ra t e n u m e r i c a l l y the correctness of E q . (4.45), r ep roduced here for convenience: f{i,l,n)= E Plsna {fihSna,na) +fnaf(sna,l,nb)) . Chapter 5. Evaluation of Proposed Algorithms 145 Probab i l i ty = 0.8 P robab i l i ty = 0.2 F i g u r e 5.28: Sta te t rans i t ions i n the 7 x 7 s tochas t ic s i m u l a t i o n . T h e s i m u l a t i o n is o n a s tochas t ic sy s t em h a v i n g o n l y one c o n t r o l p o l i c y for a l l t i m e steps. E a c h state has at least two poss ible subsequent states, each w i t h a ce r t a in prob-ab i l i t y . T h e s i m u l a t i o n uses the S D A descr ibed i n Sec t ion 4.5 to b u i l d the h o r i z o n up to a specif ied va lue . T h i s a l g o r i t h m i m p l e m e n t s the above equa t i on i n i ts m i n i m a l fo rm. If th i s a l g o r i t h m func t ions co r r ec t ly c o m p a r e d to a s t anda rd synchronous D P a l g o r i t h m a p p l i e d to the same sys t em, i t w o u l d i m p l y tha t the equa t ion is correct for the tested cases. M a n y systems fitting the above desc r ip t ion have been tested. T o fac i l i t a t e presenta-t i o n a n d t h i r d p a r t y r e p r o d u c t i o n of the results , a r e l a t i ve ly s m a l l sy s t em is chosen for i l l u s t r a t i o n , desc r ibed by F i g u r e 5.28. In F i g u r e 5.28, the left h a n d side d i a g r a m shows the state t r ans i t ions w i t h t r ans i t i on p r o b a b i l i t y of 0.8 a n d the r ight h a n d side d i a g r a m 0.2. W i t h these k i n d s of l o c a l state t r ans i t ions a n d the h i g h bias i n the t r a n s i t i o n p robab i l i t i e s , the s y s t e m behaves l ike robo t i c m a n i p u l a t o r s i n tha t g iven the current state a n d current con t ro l , the robot has a h igh p r o b a b i l i t y of m o v i n g to a defini te next state. In o ther words , the ac t ions have a s t rong bu t not t o t a l con t ro l over the state changes. T h e t h i c k arrows i n the left d i a g r a m Chapter 5. Evaluation of Proposed Algorithms 146 7 -6 # of State Transitions (x 1000) F i g u r e 5.29: / - v a l u e s as a func t ion of n u m b e r of state t rans i t ions for each state i n F i g -ure 5.18 w h i l e l e a r n i n g under the O n e P o l i c y D P a l g o r i t h m . i n d i c a t e a t r a n s i t i o n cost of 2. A l l other t rans i t ions have a cost of 1. T h e f ina l desired h o r i z o n is 128. T h e d iscount factor 7 is 0.99. A l l c o m p u t a t i o n a n d storage o f the values use doub le p rec i s ion . T a b l e 5.22 shows that the / - v a l u e s generated by the S D P a l g o r i t h m and the S D A are i d e n t i c a l . T h e same is t rue of a large n u m b e r of s imu la t ions for different sys tems and s i m u l a t i o n cond i t ions . 5.2.2 Comparison of orderliness In Sec t ion 5.1, the focus is on the a c c u m u l a t i o n of a p p r o x i m a t i o n errors . F o r the algo-r i t h m s o n s tochas t ic sys tems, the focus is shif ted to a q u a l i t a t i v e measure t ha t can be a p p r o p r i a t e l y ca l l ed order l iness . T h e concern w i t h errors is she lved a n d the reason is p a r t i a l l y e x p l a i n e d by Sec t ion 5.2.3 w h i c h shows w h y off-line D P or Q L for s tochast ic sys tems is not p r a c t i c a l . Chapter 5. Evaluation of Proposed Algorithms 147 S ta te S D P S D A x = l , u = 1 50.17423 50.17423 x = l , u = 2 49.89191 49.89191 x = 1,2/ = 3 49.53544 49.53544 x = 1,2/ = 4 49.08536 49.08536 x = 1,2/ = 5 49.08536 49.08536 x = 1,2/ = 6 49.08524 49.08524 x = 1,2/ = 7 49.49901 49.49901 x = 2,2/ = 1 50.05454 50.05454 x = 2,2/ = 2 49.74674 49.74674 a; = 2,2/ = 3 49.40497 49.40497 x = 2,2/ = 4 49.08596 49.08596 x = 2, y = 5 48.68226 48.68226 x = 2, j / = 6 49.18237 49.18237 x = 2,2/ = 7 49.56323 49.56323 x = 3,2/ = 1 50.14771 50.14771 x = 3,2/ = 2 49.72291 49.72291 x = 3,y = 3 49.40751 49.40751 x = 3,t/ = 4 49.08902 49.08902 x = 3,2/ = 5 49.10445 49.10445 x = 3 , j / = 6 49.18359 49.18359 x = 3,2/ = 7 49.57594 49.57594 x = 4 ,y = 1 50.48382 50.48382 x = 4, j / = 2 50.28877 50.28877 x = 4 ,y = 3 49.91634 49.91634 x = 4,2/ = 4 49.56194 49.56194 T a b l e 5.22: / - v a l u e s S ta te S D P S D A x = 4,2/ = 5 49.42033 49.42033 x = 4, j / = 6 49.56369 49.56369 x = 4,2/ = 7 50.09675 50.09675 x = 5,2/ = 1 50.95069 50.95069 x = 5, y — 2 50.59444 50.59444 x = 5, y — 3 50.22748 50.22748 x = 5, y = 4 49.76415 49.76415 x = 5,2/ = 5 49.80419 49.80419 x = 5, j / = 6 49.91299 49.91299 x = 5,2/ = 7 50.61930 50.61930 x = 6,2/ = 1 51.17789 51.17789 x = 6, y = 2 50.85817 50.85817 x = 6, y — 3 50.52671 50.52671 x = 6,2/ = 4 50.10820 50.10820 x = 6,y — 5 50.10820 50.10820 x = 6, y = 6 50.21505 50.21505 x = 6,2/ = 7 51.17519 51.17519 x = 7,2/ = 1 51.33838 51.33838 x = 7,2/ = 2 51.00401 51.00401 x = 7,2/ = 3 50.66176 50.66176 x = 7,2/ = 4 50.41711 50.41711 x = 7,2/ = 5 50.41711 50.41711 x = 7,2/ = 6 51.89252 51.89252 x = 7,2/ = 7 53.29184 53.29184 for the test s y s t e m . Chapter 5. Evaluation of Proposed Algorithms 148 20 18 16 14 12 / 10 8 6 4 2 0 0 5 10 15 20 25 # of State Transitions (x 1000) F i g u r e 5.30: / - v a l u e s as a func t ion of n u m b e r of s tate t r ans i t ions for each state i n F i g -ure 5.18 w h i l e l e a r n i n g under the T D ( A ) a l g o r i t h m . In i n c r e m e n t a l D P , the a c t u a l po in t values of the / - f u n c t i o n s are less i m p o r t a n t t h a n the r e l a t ive values o f each of t h e m to the others . In several m e t h o d s presented i n C h a p t e r 2, the / - v a l u e s are used to i n d i r e c t l y select the o p t i m a l act ions . D u r i n g l ea rn ing , the / - v a l u e t y p i c a l l y increase s lowly t o w a r d a sympto te s . W i t h an on- l ine a l g o r i t h m , the actor m a p p i n g such as the A S E m a p p i n g i n the A C E / A S E sys t em [8], w o u l d l ea rn c o n c u r r e n t l y as the / - f u n c t i o n learns. L e t us define r e l a t i v e o r d e r as s i m p l y the order of the f-value of a state re la t ive to the f-values o f o ther states. In order for the a c t i o n f u n c t i o n to l ea rn f rom consistent i n f o r m a t i o n , i t is i m p o r t a n t tha t the r e l a t ive order of the / - v a l u e s be m a i n t a i n e d as the n u m b e r of s tate t r ans i t ions increases. F u r t h e r m o r e , the r e l a t ive order shou ld be m a i n t a i n e d as t i m e goes t o w a r d i n f i n i t y because the ac tor func t ion is l ea rn ing at a l l t imes , unless some a d d i t i o n a l m e c h a n i s m is used to s top the l e a r n i n g i n ce r t a in s i tua t ions . T h e effective m a i n t e n a n c e of r e l a t ive order is defined here by the t e r m order l iness . Chapter 5. Evaluation of Proposed Algorithms 149 18 16 14 12 10 8 6 4 2 0 0 5 10 15 20 25 # of State Transitions (x 1000) F i g u r e 5.31: / - v a l u e s as a func t ion of n u m b e r of state t rans i t ions for each state i n F i g -ure 5.18 w h i l e l e a rn ing under the A L D P a l g o r i t h m . T h e s i m u l a t i o n s compare 3 a lgo r i thms : A mod i f i ed f o r m of One -S tep Q - L e a r n i n g as descr ibed by E q . (3.29) and E q . (3.30), the A L D P a l g o r i t h m as desc r ibed by T a b l e 4.7, and the T D ( A ) a l g o r i t h m as descr ibed i n Sec t ion 2.3.6. T h e Q - L e a r n i n g a l g o r i t h m is mod i f i ed to e l i m i n a t e the m i n i m i z a t i o n opera tor and the ac t ion specifier, as there is on ly one p o l i c y at a l l t imes . T h e mod i f i ed a l g o r i t h m w i l l be ca l l ed One P o l i c y D P ( O P D P ) a l g o r i t h m . T h e T D ( A ) a l g o r i t h m does not need m o d i f i c a t i o n because i t funct ions i n an i d e n t i c a l l ea rn ing s i t u a t i o n as the A L D P a l g o r i t h m . T h e results presented here are for the sys tem descr ibed i n Sec t i on 5.2.1, w i t h the difference tha t 7 = 0.95. D u r i n g l ea rn ing , the states are reset r a n d o m l y at regular in te rva ls to ensure comple t e e x p l o r a t i o n of the state space. T h e / - v a l u e o b t a i n e d are for in f in i te h o r i z o n , indef in i te end states, m e a n i n g that / is specified by {SQ,U0} on ly . Chapter 5. Evaluation of Proposed Algorithms 150 A large n u m b e r of s imu la t ions are done, i t e r a t i n g over different values of the pre-c i r c u i t l eng th for the A L D P a l g o r i t h m or over different va lue of m for the T D ( A ) algo-r i t h m , a was f ixed for the O P D P a l g o r i t h m . F i g u r e 5.29 shows the resul t for the O P D P a l g o r i t h m . T h i s a l g o r i t h m produces w e l l o rdered state-to-cost m a p p i n g s qu i te early, bu t the costs take ve ry long to approach the correct values. F i g u r e 5.30 shows the best q u a l i t a t i v e result for the T D ( A ) a l g o r i t h m , w h i c h is for m = 1. T h e T D ( A ) a l g o r i t h m takes the cost to the ne ighbourhood of the correct values q u i c k l y , bu t the m a p p i n g s are not wel l -ordered . F i g u r e 5.31 shows the best q u a l i t a t i v e result for the A L D P a l g o r i t h m , w h i c h is for a p re -c i r cu i t l eng th of 20. T h e A L D P a l g o r i t h m produces w e l l o rdered state-to-cost m a p p i n g s q u i c k l y . O n e P o l i c y D P is s ign i f ican t ly slower t h a n the o ther two types of a l go r i t hms . In th is thesis , the order l iness of regular Q L ( w h i c h o p t i m i z e s the con t ro l p o l i c y as wel l ) is not covered . If regular Q - L e a r n i n g has good order l iness , i t m a y be able to achieve the ove ra l l ob j ec t ive of f i nd ing the o p t i m a l con t ro l p o l i c y faster t h a n the o ther two w i t h gradients . It s h o u l d be sa id tha t 2 stage l ea rn ing a lgo r i t hms l i ke T D ( A ) has achieved greater success i n m a n y app l i ca t i ons t h a n Q - L e a r n i n g , as m e n t i o n e d i n C h a p t e r 2. T h e reason .may have to do w i t h error to lerance and w i t h orderl iness . T h e con t r a ry specu la t ion is tha t regular Q L m a y be m u c h slower t h a n O n e P o l i c y D P such tha t regular Q L is not bet ter . R e g u l a r Q L assumes g loba l m i n i m i z a t i o n w h i c h is ve ry cost ly . If o n l y l o c a l m i n i m i z a t i o n is done, then regular Q L is d o i n g m i n i m i z a t i o n m u c h m o r e i n c r e m e n t a l l y t h a n E q . (3.29) and E q . (3.30) suggest. H o w T D ( A ) funct ions i n a A C E / A S E scheme is w e l l k n o w n (see C h a p t e r 2) . T h e A L D P a l g o r i t h m c a n also func t i on i n the same way. T h i s a l g o r i t h m w o u l d s i m p l y replace T D ( A ) i n the a lgo r i t hms descr ibed i n the sections under Sect ions 2.3.5 a n d 2.4.4. T h e Chapter 5.. Evaluation of Proposed Algorithms 151 defaul t m i n i m i z a t i o n is l oca l bu t there is n o t h i n g to prevent i t f r o m b e i n g g l o b a l . In c o m p a r i n g the 3 a lgo r i t hms , i t seems the A L D P a l g o r i t h m combines the best of b o t h wor lds of the O P D P a l g o r i t h m and the TD(A) a l g o r i t h m by offering fast a n d o rde r ly b u i l d u p of the / - v a l u e s . These are sufficient reasons to t ake o n the increased c o m p l e x i t y of the a l g o r i t h m . 5.2.3 A natural but flawed algorithm A s m e n t i o n e d i n Sec t ion 4.5, the one p o l i c y l i n k u p re la t ion po in t s t o w a r d the S tochas t i c D o u b l i n g A l g o r i t h m . T h i s a l g o r i t h m is used to demons t ra te the correctness of the r e l a t i on i n Sec t i on 5.2.1. In the A C E / A S E scheme, the synchronous and asynchronous a lgo r i t hms w o u l d be used for t he same purpose , bu t w i t h different d a t a s t ruc tures . In s imu la t i ons t a k i n g a p p r o x i m a t i o n errors i n to account , a p p r o x i m a t i o n errors co r rup t t w o i m p o r t a n t m a p p i n g s : the / - m a p p i n g and the Pitlna m a p p i n g . T h e errors w i t h i n each l i n k u p stage i n these 2 m a p p i n g s are c o m b i n e d i n Step 2 of T a b l e 4 . 9 , b y s u m m a t i o n a n d m u l t i p l i c a t i o n . T h e a c c u m u l a t i o n of errors by s u m m a t i o n and m u l t i p l i c a t i o n has a s ign i f ican t ly worse effect t h a n a c c u m u l a t i o n by s u m m a t i o n alone, as i n the case of Q -L e a r n i n g . T h i s flaw is enough to negate other advantages th is n a t u r a l a l g o r i t h m m a y offer w h e n i t is a p p l i e d i n rea l con t ro l p rob lems . O t h e r a lgo r i thms w h i c h use a p p r o x i m a t i o n of b o t h the cost and p r o b a b i l i t y m a p p i n g s w o u l d have to deal w i t h th is issue also. T h i s is w h y the m o r e c o m p l e x vers ion descr ibed i n Sec t i on 4.4 a n d tes ted i n the last sec t ion is used. T h e purpose of the finite l eng th Z)-record is to t ake account of m u l t i -s tep t r a n s i t i o n p robab i l i t i e s w i t h o u t a p p r o x i m a t i n g t h e m . O f course, a great dea l of i n f o r m a t i o n o n t r a n s i t i o n p robab i l i t i e s is d i sca rded , bu t th is is appa ren t l y necessary. Chapter 5. Evaluation of Proposed Algorithms 152 5.3 On the Goal Change Algorithms In Sec t ion 4.7.1 a n d 4.7.4, goal state r emova l and a d d i t i o n a lgo r i t hms are p roposed for s ingle goa l states a n d m u l t i p l e goal states. He re the correctness and increased efficiency are demons t r a t ed . W e w i l l beg in w i t h the correctness of the r emova l a n d a d d i t i o n a l -g o r i t h m s . T h e n we w i l l cover the increased efficiency of the a d d i t i o n a lgo r i t hms . T h e efficiency of the r e m o v a l a lgo r i t hms w i l l not be covered because, as discussed i n Sec-t i o n 4.7.1 a n d 4.7.4, i t is not necessary to pe r fo rm t h e m to effect goa l changes. 5.3.1 The correctness of the goal change algorithms T h e correctness of the a d d i t i o n pass for s ingle goal state, as desc r ibed i n T a b l e 4 .11, is a rgued here. T h e correctness of the r emova l pass for s ingle goal state is nea r ly i d e n t i c a l to th is a rgumen t and w i l l not be repeated. A t cyc l e 1, i t is c lear tha t the e l ig ib le start state is sg, the state i n Ss after the c o m p l e t i o n of S tep 2, a n d the e l ig ib le end states are the states i n Se after the c o m p l e t i o n of S tep 2. T h e Q ' s at. the end of S tep 2 are 1 t i m e step l o n g and the Q ' s i n at the end of the first cyc l e t h r o u g h Steps 3 to 6 are 2 t i m e steps long . T h e Q values a l te red i n S tep 2 can e i ther act i n Steps 4 or 5 as the p re -c i rcu i t Q ' s or the pos t - c i r cu i t Q ' s . If t hey act as the p re -c i r cu i t Q ' s , t h e n the o n l y 2 t i m e step Q ' s affected by the a d d i t i o n of the r eward c o m p o n e n t to sg are those tha t beg in w i t h the states i n Sa and end i n states tha t are reachable f r o m the states i n Ss i n 2 t i m e steps. If they act as the pos t - c i r cu i t Q ' s , t hen the o n l y 2 t i m e step Q ' s affected by the a d d i t i o n of the r eward c o m p o n e n t to sg are those tha t end w i t h the states i n Se and beg in w i t h the states w h i c h can reach the states i n Se i n 2 t i m e steps. W e can test whe the r a 2 t i m e step Q is affected by seeing i f i t s o r i g i n a l va lue is Chapter 5. Evaluation of Proposed Algorithms 153 greater t h a n a new Q c o m p u t e d by us ing the affected 1 t i m e step Q ' s f r o m Step 2. T h e grea ter - than c o m p a r i s o n is v a l i d because any change due to the r e w a r d c o m p o n e n t i n s9 m u s t decrease the value of the new Q. If the tests i n Steps 4 and 5 are t rue , t hen we k n o w tha t the s tar t states a n d end states of the affected 2 t i m e step Q ' s mus t be added to Sa and Se for the nex t cyc l e , respect ive ly . P a i r s of s tar t and end states are added to P w h i c h m e r e l y keeps t rack of the assoc ia t ion of s tar t and end states, for the purpose of n a r r o w i n g the d o m a i n of the outer m i n i m i z a t i o n i n the next cyc le . U s i n g the new d)'s, the new Ss, the new Se, and the new P , Step 4 and 5 is repeated i n the nex t cyc le . T h e logic of these cycles is i den t i c a l to the first cyc l e t h r o u g h Steps 3 to 6 desc r ibed above. 5.3.2 Increased efficiency of the goal change algorithms L e t (j) be the m a x i m u m n u m b e r of states reachable f rom any state s € S i n 1 t i m e step. L e t rj) be the m a x i m u m n u m b e r of states tha t can reach any p a r t i c u l a r s tate s £ S i n 1 t i m e step. L e t $ be the t o t a l n u m b e r of states i n S. L e t $ 5 be the n u m b e r of states i n Ss. L e t $ e he the n u m b e r of states in Se. L e t us s tar t w i t h the a l g o r i t h m descr ibed i n T a b l e 4.11. T h e number s of states i n the sets Sa a n d Se increase i n th is sequence: Initializaton: $ 5 = 1, $ e = <f>. Cycle 1: $ s = 1 + rj), $ e = ^(1 + Cycle 2: $ s = (1 + 0)4"1, $e = <f>{\ + c i ) 4 " 1 . Cycle n : * 4 = (1 + 0)2""1, $e = <j>(l +.^)2"-1. Chapter 5. Evaluation of Proposed Algorithms 154 These are wors t case or m a x i m u m numbers i n tha t we are i gno r ing the over laps between the states reached f r o m a set of states. A n o t h e r factor by w h i c h the a d d i t i o n pass reduces the a m o u n t of c o m p u t a t i o n is tha t , w i t h i n each l i n k u p ope ra t i on (Step 4 or S tep 5, wha teve r the case m a y be) i n the a d d i t i o n pass, the c o m p a r i s o n test is less in tens ive t h a n a fu l l l i n k u p ope ra t ion as shown i n Step 1 of T a b l e 4.5 because the outer m i n i m i z a t i o n is over Sa or 5 e , b o t h a subset of S. T h e increased efficiency comes p a r t l y f rom the fewer n u m b e r of l i n k u p opera t ions . In a f u l l execu t ion of the D e t e r m i n i s i t i c D o u b l i n g A l g o r i t h m , the n u m b e r of l i n k u p oper-at ions pe r fo rmed is $ 2 . If we pe r fo rm this a l g o r i t h m w i t h o u t goal states, as discussed i n Sec t ion 4.7.1 a n d 4.7.4, the m i n i m u m n u m b e r of l i n k u p opera t ions not pe r fo rmed (or s a v i n g s ) is , after some a lgebra , the fo l lowing : £ ( ( * - m i n ( $ , (1 + t/>)2V1))$ + m i n ( $ , (1 + V O 2 ' " 1 ) ^ " m i n ( $ , # 1 + jf-1))) • . i=l T a b l e 5.23 shows n u m e r i c a l l y the m i n i m u m n u m b e r of l i n k u p opera t ions not per-f o r m e d for some p l aus ib l e systems. In th is table , $ = 1056 w h i c h is the n u m b e r of states i n the one - l ink m a n i p u l a t o r s i m u l a t i o n , and ip = 10 w h i c h is a reasonable n u m b e r for the same s i m u l a t i o n . T h e first c o l u m n of the t ab le shows the n u m b e r of l i n k u p cycles . T h e first row shows <f>. In the one- l ink m a n i p u l a t o r s i m u l a t i o n , the n u m b e r of l i n k u p cycles is 7 and <f> is equa l to 2. A c c o r d i n g to T a b l e 5.23, at <f> oi 2 and 7 l i n k u p cycles , the m i n i m u m f rac t ion of the l i n k u p opera t ions i n one L i n k u p C y c l e is r ough ly 2. T h i s means tha t i n the n u m b e r of l i n k u p opera t ions alone, the a d d i t i o n pass for s ingle goal state effectively e l imina tes at least 2 out of 7 l i n k u p cycles . T h e ana lys i s of efficiency for the m u l t i p l e goal s tate case is a s i m p l e ex tens ion of the above. T h e m i n i m u m n u m b e r of l i n k u p opera t ions not pe r fo rmed is the fo l lowing : £ ( ( $ - m i n ( $ , m ( l + ^ ) 2 ' " 1 ) ) $ + m i n ( $ , m ( l . + V 0 2 ' - 1 ( $ - m i n ( $ , r o # l + <A)2'-1))) »=i Chapter 5. Evaluation of Proposed Algorithms 155 <f> = 2 <j> = 3 cj) = A (£ = 5 <t> = 6 (j> = 1056 H o r i z o n = 1 0.999940 0.999878 0.999797 0.999698 0.999582 0.989583 H o r i z o n = 2 1.948804 1.818056 1.526316 0.999698 0.999582 0.989583 H o r i z o n = 3 1.948804 1.8180.56 1.526316 0.999698 0.999582 0.989583 H o r i z o n = 4 1.948804 1.818056 1.526316 0.999698 . 0.999582 0.989583 H o r i z o n = 5 1.948804 1.818056 1.526316 0.999698 0.999582 0.989583 H o r i z o n = 6 .1.948804 1.818056 1.526316 0.999698 0.999582 0.989583 H o r i z o n = 7 1.948804 1.818056 1.526316 0.999698 0.999582 0.989583 T a b l e 5.23: W o r s e case f rac t ion of the l i n k u p operat ions i n one l i n k u p cyc le not pe r fo rmed by u s ing the A d d i t i o n Pass for a S ing le G o a l Sta te . where m is the n u m b e r of goal states to be added. 5.4 S u m m a r y In th is chap te r , several resul ts are presented on the a l g o r i t h m s proposed i n C h a p t e r 4. O n the d e t e r m i n i s t i c a lgo r i t hms , proofs of correctness of the u n d e r l y i n g r e l a t i on and of the correctness of the a lgo r i t hms are presented. T h e proposed a lgo r i t hms are the D o u -b l i n g A l g o r i t h m and A l g o r i t h m 2. T h e y are compared against a parent of the One-S tep Q - L e a r n i n g a l g o r i t h m , the Synchronous Q - L e a r n i n g a l g o r i t h m . A n u m e r i c a l demons t ra -t i o n of the correctness is presented. Severa l worst case uppe r bounds are also p roved o n the a c c u m u l a t i o n of a p p r o x i m a t i o n errors. These upper bounds for the proposed a l -g o r i t h m s are m u c h lower t h a n tha t of t he Synchronous Q - L e a r n i n g a l g o r i t h m . T h e n , u s ing t e m p o r a l l y a n d spa t i a l l y corre la ted errors, the performances of the Q L a l g o r i t h m , A l g o r i t h m 2, a n d the D o u b l i n g A l g o r i t h m are c o m p a r e d . S i m u l a t e d errors are used to g ive us c o n t r o l over the m a g n i t u d e and character is t ics of the a p p r o x i m a t i o n error . T h e sy s t em s i m u l a t i o n is a s m a l l 2 d i m e n s i o n a l sys t em of abst ract states w i t h ne ighborhood proper t ies . A large n u m b e r of r a n d o m l y generated sys t em d y n a m i c s were tested to y i e l d a measure of the a c c u m u l a t e d a p p r o x i m a t i o n errors over a l l states. It is found tha t the Chapter 5. Evaluation of Proposed Algorithms 156 l a t t e r 2 a lgo r i t hms are subs t an t i a l ly be t ter behaved i n the m e a n and the var iance . A l -t h o u g h the abs t rac t sy s t em gives us the a b i l i t y to test a l a rge n u m b e r of poss ib le sy s t em d y n a m i c s , an eas i ly in te rpre tab le sys tem is needed to v i sua l i ze the i m p a c t of the a c c u m u -l a t ed a p p r o x i m a t i o n errors . T h e one- l ink m a n i p u l a t o r is chosen for th i s task . In a series of c o n t r o l l aw degrada t ion and con t ro l per formance s imu la t ions us ing aga in t e m p o r a l l y a n d s p a t i a l l y cor re la ted a p p r o x i m a t i o n errors, i t is found tha t the p roposed a lgo r i t hms are o n average m o r e to lerant of and less sensi t ive to the m a g n i t u d e of the source error e w h i c h genera ted the cor re la ted a p p r o x i m a t i o n errors. O n the s tochas t ic a lgo r i t hms , the focus is shifted f r o m a c c u m u l a t i o n of errors to a q u a l i t a t i v e measure ca l l ed order l iness . It is argued that order l iness is necessary for the effective l e a r n i n g of o p t i m a l s ta te- to-act ion m a p p i n g i n the ac to r - c r i t i c a l go r i t hms presented i n C h a p t e r 2. N u m e r i c a l demons t r a t i on of the correctness of the u n d e r l y i n g re-l a t i o n is presented first . T h e n the orderl iness of the O n e P o l i c y D P a l g o r i t h m , the T D ( A ) a l g o r i t h m , and the A L D P a l g o r i t h m are compared . It is found tha t the A L D P a l g o r i t h m e x h i b i t s the order l iness of the O n e P o l i c y D P a l g o r i t h m and the speed of l ea rn ing of the T D ( A ) a l g o r i t h m . T h e S D A , an a l g o r i t h m i n c l u d e d i n th is w o r k o n l y for d i scuss ion a n d for a n u m e r i c a l test, is shown to pe r fo rm p o o r l y w h e n func t ion a p p r o x i m a t i o n is used. T h i s resul t suggests tha t other a lgo r i thms w h i c h use b o t h cost and p r o b a b i l i t y m a p p i n g s w i l l be at a d isadvantage when used for large scale systems. O n the goal change a l g o r i t h m s , the correctness of the a d d i t i o n a lgo r i t hms is demon-s t ra ted a n d expressions for the m i n i m u m savings i n the n u m b e r of l i n k u p cycles are presented for a d d i t i o n a l g o r i t h m for single goal states and for m u l t i p l e goal states. Fo r one - l ink m a n i p u l a t o r p r o b l e m as s t ruc tu red i n th is chapter , the m i n i m u m theore t i ca l savings is at least 1.94 out of 7 l i n k u p cycles . T h e ac tua l savings w i l l be greater s ince a n u m b e r of factors w h i c h con t r ibu te to greater savings has not been t a k e n in to account . S ince the r e m o v a l a lgo r i t hms are near ly i den t i c a l to the a d d i t i o n a lgo r i t hms and they Chapter 5. Evaluation o f Proposed Algorithms 157 are not necessary i n the mos t c o m m o n goal change s i t u a t i o n , the deve lopment of the correctness and expressions for m i n i m u m savings has not been repeated for t h e m . Chapter 6 Conclusions and Future Challenges D y n a m i c P r o g r a m m i n g and Q - L e a r n i n g are general a l g o r i t h m types for a w i d e var ie ty of o p t i m a l con t ro l p rob l ems . Success i n a n u m b e r of app l i ca t ions have shown the i r u t i l i t i e s . T h e r e is also a subs tan t i a l n u m b e r of theore t ica l results w i t h respect to the convergence of a l go r i t hms of these types . Howeve r , there are some results showing the deleterious effects of a p p r o x i m a t i o n errors on l ea rn ing , bu t the au thor is not aware of any sat isfactory so lu t ions . D P a n d Q L are m a i n l y f o r m u l a t e d for l ea rn ing i n one t e m p o r a l reso lu t ion and one spa t i a l r e so lu t ion . A need exis ts for ex tens t ion of R L to hierarchies . R L a lgo r i t hms have been f o r m u l a t e d for a s ingle goa l as defined by the values of the i m m e d i a t e costs. A need exis ts for efficient m e t h o d s to change the goal o f a R L con t ro l l e r w i t h o u t r epea t ing the l e a r n i n g process i n i ts ent i re ty . 6.1 Conclusions In the process of e x t e n d i n g Q L to hierarchies , three L i n k u p R e l a t i o n s are de r ived . T h e y show new ways of c o m p u t i n g mod i f i ed forms of Q- and / - v a l u e s . T h e Re la t i ons use a new u p d a t e ope ra t i on ca l l ed l i n k u p , the n a m e i n d i c a t i n g the l i n k i n g up of the eva lua t ion of t w o fami l i es of t ra jector ies . T h e effect of these re la t ions is to enable new a lgor i thms w h i c h have grea t ly reduced n u m b e r of a p p r o x i m a t i o n cycles . T h e mos t general L i n k u p R e l a t i o n , E q . (3.37), is for a l e a rn ing s i t ua t i on s i m i l a r to tha t i n Q L , w h i c h consists o f a s tochas t ic sy s t em a n d m o r e t h a n one ac t ion poss ible 158 Chapter 6. Conclusions and Future Challenges 159 for each state. In a t t e m p t i n g to app ly th is L i n k u p R e l a t i o n , a n u m b e r of issues were d iscovered w h i c h show tha t th is is not p r a c t i c a l . These issues are va luab le h in t s on h o w h i e r a r c h i c a l l e a rn ing m u s t be done. T h e other two L i n k u p R e l a t i o n s , D e t e r m i n i s t i c L i n k u p R e l a t i o n ( E q . (4.42)) and O n e P o l i c y L i n k u p R e l a t i o n ( E q . (4.45)), are d e r i v e d i n response to the above discovery. It is found that l i n k u p s can be advantageous ly a p p l i e d i f the sy s t em is d e t e r m i n i s t i c or i f we are on ly eva lua t ing one p o l i c y at a t i m e , s i m i l a r tp the cond i t ions of l ea rn ing assumed by T D ( A ) a lgo r i thms . T h e mos t i m p o r t a n t a c c o m p l i s h m e n t is tha t for the d e t e r m i n i s t i c sys tems, two syn-chronous a lgo r i t hms based on the D e t e r m i n i s t i c L i n k u p R e l a t i o n are inven ted w h i c h achieve a large r e d u c t i o n o f a c c u m u l a t e d a p p r o x i m a t i o n errors over a representa t ive ver-s ion of Q L , the S Q L a l g o r i t h m . T h e ga in is ob ta ined w i t h o u t sacr i f ic ing o p t i m a l i t y . T h e r e d u c t i o n t r ans la t ed to be t ter con t ro l per formance and lesser con t ro l p o l i c y degrada t ion i n s i m u l a t i o n s of the one- l ink m a n i p u l a t o r i n the presence of var ious types of a p p r o x i m a -t i o n errors . It is also found tha t i t is possible to change the goal of the con t ro l l e r us ing va r i a t ions of the d e t e r m i n i s t i c L i n k u p a lgo r i thms , w i t h o u t r epea t ing the ent i re l ea rn ing processes of these a lgo r i t hms . These var ia t ions are co l l ec t ive ly ca l l ed the goal change a l g o r i t h m s . Exp re s s ions of the savings i n u s ing the goal change a l g o r i t h m s are presented a n d they show signif icant savings for the one- l ink m a n i p u l a t o r s i m u l a t i o n desc r ibed i n C h a p t e r 5. A n o t h e r a c c o m p l i s h m e n t is tha t , for the one-pol icy , s tochas t ic , asynchronous algo-r i t h m , a large i m p r o v e m e n t i n orderl iness is achieved over T D ( A ) w i t h o u t the slowness of bas ic Q L . Orde r l iness is the ma in tenance of the order of the / - v a l u e of the states d u r i n g l ea rn ing . In a n A C E / A S E sys tem, the A C E stores the / - v a l u e s and the A S E learns the o p t i m a l ac t ions based o n the r e l a t ive values of the states i n the A C E . I f t he order of these values f luctuates d u r i n g l ea rn ing , t h e A S E w o u l d be g iven conf l i c t ing s ignals . It is therefore des i rable tha t an a l g o r i t h m produces h igh order l iness , not o n l y d u r i n g the Chapter 6. Conclusions and Future Challenges 160 ea r ly a n d i n t e rmed ia t e per iods of l ea rn ing , but also d u r i n g the f ina l p e r i o d . In an a t t e m p t to devise an off-line, synchronous a l g o r i t h m for one p o l i c y , s tochas t ic sys tems, i t is found tha t , i n such a lgo r i t hms , the errors are m u l t i p l i e d not o n l y be tween l i n k u p cycles , bu t also w i t h i n a l i n k u p cyc le . T h e errors i n the a p p r o x i m a t i o n of the va lue func t i on a n d of the t r ans i t i on p r o b a b i l i t y m a p p i n g in te rac t m u l t i p l i c a t i v e l y . T h e resul t is m u c h worse t h a n a one p o l i c y vers ion of Q L . T h i s is a c o n v i n c i n g reason for not d o i n g off-line Q L or D P l ea rn ing for a s tochast ic sy s t em, even for one p o l i c y on ly , i f an accura te m o d e l of e i ther the costs or the t r ans i t i on p robab i l i t i e s is not ava i lab le . Re fe r r i ng back to S c h m i d h u b e r ' s Subgoa l idea , the a lgo r i t hms proposed i n th is thesis are new m e t h o d s of f ind ing o p t i m a l subgoals for de t e rmin i s t i c sys tems. T h e o p t i m a l subgoals are recorded d u r i n g the b u i l d u p phases of these a lgo r i t hms . T h e subgoals are the o p t i m a l i n t e rmed ia t e states found by the inner and outer m i n i m i z a t i o n s . 6.2 Limitation of the L i n k u p A p p r o a c h T h e S tochas t i c L i n k u p R e l a t i o n is diff icul t to app ly to a general Q L s i t u a t i o n where the s y s t e m is s tochas t ic and there are more than one possible a c t i o n per state. T h e diff icul t ies are desc r ibed i n Sect ions 4.2.2 to 4.2.5. Y e t , this is perhaps a va luab le d iscovery because i t m a y h in t at a f u n d a m e n t a l na ture of h i e ra rch ica l l ea rn ing . T h i s na tu re is tha t i t is perhaps c o m p u t a t i o n a l l y i m p r a c t i c a l to t r y to f ind o p t i m a l con t ro l laws b y t a k i n g in to account every poss ib le ou tcome at i ts correct p robab i l i t i e s and l o o k i n g at a l l possible c o n t r o l laws s imul t aneous ly , w h e n b o t h t ra jec tory sequences be ing l i n k e d together are at a h igher t i m e reso lu t ion t h a n the t i m e reso lu t ion of the con t ro l . Chapter 6. Conclusions and Future Challenges 161 6.3 D i s c u s s i o n In the R L l i t e ra tu re , i t is a lmos t t aken as fact tha t on- l ine , asynchronous a lgo r i t hms are super ior to synchronous a lgo r i t hms . T h e resul t is tha t p rob l ems be t te r su i t ed to off-l ine, synchronous a lgo r i t hms are not a t tended to sufficiently. O n e a rgumen t for th is supposed supe r io r i t y is tha t on- l ine , asynchronous a lgo r i t hms do not requi re the comple t e set of s tate t r a n s i t i o n samples to o b t a i n good results . It is also a rgued tha t the m a i n d isadvantage of synchronous a lgo r i thms is the di f f icul ty of m o d e l b u i l d i n g . In the case of s tochas t ic sys tems, th i s thesis has found suppor t for th i s a rgument . B u t the oppos i te conc lu s ion is reached i n the case of d e t e r m i n i s t i c sys tems. T h e d i s t i nc t advantage found for synchronous L A D D P on d e t e r m i n i s t i c systems sug-gests tha t a t t en t ion shou ld be p a i d to synchronous a lgo r i t hms aga in , p a r t i c u l a r l y for the con t ro l of robots cons i s t ing of l i nks and jo in t s . W h i l e i n ce r t a in r o b o t i c p rob lems , such as the p r o b l e m of f ind ing the o p t i m a l p a t h for a robot t h r o u g h an area p o p u l a t e d by obstacles a n d s t ruc tu red by wa l l s , an a n a l y t i c a l m o d e l of the p r o b l e m is p r o b a b l y po in t -less to der ive because we want the robot to be able to cope w i t h new env i ronmen t s . T h e same is not t rue for m a n y r o b o t i c p rob lems of the l i nks and jo in t s var ie ty . T h e d y n a m i c and k i n e m a t i c mode l s of a p a r t i c u l a r robot of the la t te r t y p e are r e l a t i ve ly constant and often der ivab le , bu t the so lu t ion is di f f icul t . A n e x a m p l e is the w a l k i n g m a c h i n e . In this p r o b l e m , the d y n a m i c s have la rgely been worked out . B u t there is no con t ro l l aw w h i c h can cause the m a c h i n e to wa lk i n a faul t to lerant fashion for a w i d e va r i e ty of start and end pos i t ions . T h i s is where off-line, synchronous a lgo r i t hms for d e t e r m i n i s t i c systems have an advantage over on- l ine , asynchronous a lgo r i thms for s tochas t ic sys tems, because the fo rmer is m o r e to lerant to a p p r o x i m a t i o n errors and bet ter su i t ed to the p r o b l e m . O n the ques t ion of h ierarchies , L A D D P is d i r ec t ly relevant to the c h a i n i n g of macros . A m a c r o is l i ke a con t ro l sequence, except tha t i t is p a r a m e t i z e d to increase i ts general i ty. Chapter 6. Conclusions and Future Challenges 162 T h e aforement ioned di f f icul ty of a p p l y i n g L A D D P to s tochas t ic systems m a y m e a n tha t , i n a h ie ra rchy , the sy s t em be ing con t ro l l ed mus t be assumed to be d e t e r m i n i s t i c . If the s y s t e m is i n fact s tochas t ic , do ing so m a y m e a n tha t the sy s t em w o u l d not be con t ro l l ed o p t i m a l l y . Pe rhaps th is o p t i m a l i t y cannot be achieved i n p rac t i ce i n m a n y large scale sys tems. T h e m e t h o d of b u i l d i n g a m a c r o cover ing a longer t i m e p e r i o d is r o u g h l y as fo l lows. E a c h m a c r o has a d o m a i n of s tar t states and a range of end states. T h e new a n d larger m a c r o w o u l d be one w i t h the d o m a i n of s tar t states of the first m a c r o a n d the range of end states of the second m a c r o . T h e l i n k u p is at the o p t i m a l i n t e rmed ia t e states w h i c h are d e t e r m i n e d by the two m i n i m i z a t i o n opera t ions . T h e inner m i n i m i z a t i o n is w i t h respect to a c t i o n of the second m a c r o and the outer is w i t h respect to the in te r sec t ing states of the two sma l l e r macros . T h i s is a specu la t ion of how l i n k u p can be used i n a h ierarchy. T h e precise mechan i sms are a large p r o b l e m w h i c h is b e y o n d the scope of th is thesis . W h i l e the proposed a lgo r i t hms have clear n u m e r i c a l and q u a l i t a t i v e advantages over e x i s t i n g R L a l g o r i t h m s , the L A D D P idea is not mere ly an i n t e r i m i m p r o v e m e n t to the p rob l ems of a c c u m u l a t i o n of errors, inefficient goal change, the slow speed of l ea rn ing of the bas ic Q L a l g o r i t h m , or the disorderl iness of T D ( A ) . It is a f undamen ta l m e c h a n i s m i n the var ious levels of h ierarchies . A s such, i t deserves fur ther i nves t iga t ion . 6.4 Future Challenges A n u m b e r of suggestions to fur ther th is work can be made : 1. C o n t i n u e to push for w ide r a p p l i c a t i o n of the S tochas t i c L i n k u p R e l a t i o n . 2. C o m e to a conc lus ion about the p r a c t i c a l i t y of t r y i n g to f ind o p t i m a l con t ro l laws i n the lowest t i m e reso lu t ion w h e n the sys tem is s tochas t ic a n d the t ra jector ies to be l i n k e d are i n coarser t i m e resolut ions . If i t is found to be i m p r a c t i c a l , a different Chapter 6. Conclusions and Future Challenges 163 w ay of h a n d l i n g p r o b a b i l i s t i c outcomes shou ld be dev ised , perhaps by us ing con t ro l laws i n the same t i m e reso lu t ion as the t ra jector ies to be l i n k e d . It w o u l d p r o b a b l y m e a n g i v i n g up o p t i m a l i t y as defined for the lowest t i m e re so lu t ion . 3. F i n d m o r e theore t i ca l results about the error a c c u m u l a t i o n behav iours of S Q L , A l -g o r i t h m 2, the D o u b l i n g A l g o r i t h m , and other c o m p e t i n g a l g o r i t h m s , i n p a r t i c u l a r the m e a n a c c u m u l a t e d f rac t iona l error . 4. T h e obs tac le to the a p p l i c a t i o n of L A D D P to a h ie ra rchy is the i n t e rmed ia t e layer(s ) , w h i c h interfaces the lower leve l state and con t ro l representat ions to h igher l eve l s y m b o l i c representat ions. If a general p a r a m e t i z e d i n t e rmed ia t e representa-t i o n (or the f o r m of the macros) is found , L A D D P can p rov ide the basis for the o p t i m a l l i n k u p s of these and higher level representat ions. 5. A n in te res t ing specu la t ion is whether the aggregated var iab le i n L A D D P can be s o m e t h i n g other t h a n t i m e , i n a d d i t i o n to t i m e . T h e r e m a y be o ther var iables w h i c h share some essential features w i t h t i m e w h i c h w o u l d render i t poss ible to aggregate. A mos t r eward ing f ind ing w o u l d be tha t spa t i a l d imens ions are such . var iab les . T h i s w o u l d p a r t l y solve the puzz le of l ea rn ing w i t h respect to different s p a t i a l reso lu t ions i n A l b u s ' in te l l igence h ie ra rchy [97]. O t h e r va r i ab le m a y be s o m e t h i n g l i ke "exhaus t ion" w h i c h for a cont inuous p e r i o d of work b y a b io log ica l b e i n g w o u l d increase m o n o t o n i c a l l y . 6. W i t h the cur ren t abundance of research on a p p r o x i m a t o r s , t hey are able to hand le inc reas ing ly greater n u m b e r of i npu t var iables . T h e t i m e m a y be at h a n d or is near w h e n subs t an t i a l r o b o t i c con t ro l p rob lems shou ld be a t t e m p t e d . A n economica l ly va luab le p ro jec t w o u l d be the a p p l i c a t i o n of the proposed synchronous a lgor i thms c o m b i n e d w i t h a powerful and accura te a p p r o x i m a t o r on a b i p e d w a l k i n g mach ine . Bibliography [l] R . E . B e l l m a n . Dynamic Programming. P r i n c e t o n U n i v e r s i t y Press , P r i n c e t o n , N J , 1957. [2] S. T h r u n a n d A . Schwar t z . Issues i n us ing func t ion a p p r o x i m a t i o n for reinforce-m e n t l ea rn ing . In Proceedings of the Fourth Connectionist Models Summer School, H i l l s d a l e , N J , D e c e m b e r 1993. .Lawrence E r l b a u m P u b l i s h e r . [3] J . H . S c h m i d h u b e r and R . Wahns ied le r . P l a n n i n g s i m p l e t ra jector ies us ing neura l subgoal generators . In The Second International Conference on Simulations of Adaptive Behavior (SAB92), 1992. [4] M . L . M i n s k y . Theory of Neural-Analog Reinforcement Learning Systems and its Application to the Brain-Model Problem. P h D thesis , P r i n c e t o n U n i v e r s i t y , P r i n c e -t o n , N J , 1954. [5] A . L . S a m u e l . S o m e studies i n m a c h i n e l ea rn ing us ing the game of checkers. IBM Journal of Research and Development, 3:210-229, 1959. [6] J . M . M e n d e l and R . W . M c L a r e n . Re in fo rcemen t l e a rn ing con t ro l and pa t t e rn r e c o g n i t i o n sys tems. In J . M . M e n d e l and K . S. F u , ed i tors , Adaptive, Learning and Pattern Recognition Systems: Theory and Application, pages 287-318. A c a d e m i c Press , N e w Y o r k , 1970. [7] D . M i c h i e and R . A . C h a m b e r s . B o x e s : . A n expe r im en t i n adap t ive con t ro l . In E . D a l e a n d D . M i c h i e , edi tors , Machine Intelligence 2, pages 137-152. O l i v e r and B o y d , E d i n b u r g h , 1968. [8] A . G . B a r t o , R . S. S u t t o n , and C . W . A n d e r s o n . N e u r o n l i k e adap t ive elements tha t can solve diff icul t l e a rn ing cont ro l p rob lems . IEEE Transactions on Systems, Man and Cybernetics, S M C - 1 3 ( 5 ) : 8 3 4 - 8 4 6 , 1983. [9] C . W . . A n d e r s o n . Learning and Problem Solving with Multilayer Connectionist Systems. P h D thesis , D e p a r t m e n t of C o m p u t e r and I n f o r m a t i o n Science, U n i v e r s i t y of Massachuse t t s , A m h e r s t , M A , 1986. [10] R . S. S u t t o n . L e a r n i n g to p red ic t by the me thods of t e m p o r a l differences. Machine Learning, 3:9-44, 1988. 164 Bibliography 165 [11] R . S. S u t t o n . In tegra t ing archi tec tures for l ea rn ing , p l a n n i n g , a n d r eac t ing based on a p p r o x i m a t i n g d y n a m i c p r o g r a m m i n g . In Proceedings of the Seventh International Conference on Machine Learning, pages 216-224, S a n M a t e o , C A , 1990. M o r g a n K a u f m a n n . [12] C . J . C . H . W a t k i n s . Learning From Delayed Rewards. P h D thesis , C a m b r i d g e U n i v . , C a m b r i d g e , E n g l a n d , 1989. [13] R . S. S u t t o n . T e m p o r a l credi t ass ignment i n re inforcement l ea rn ing . C O I N S 84-02, D e p a r t m e n t of C o m p u t e r and In fo rmat ion Science, U n i v e r s i t y of Massachuse t t s , A m h e r s t , M A , 1984. [14] P . J . W e r b o s . Cons i s t ency of H D P app l i ed to a s i m p l e re inforcement l ea rn ing p r o b l e m . Neural Networks, 3:179-189, 1990. [15] P . J . W e r b o s . B a c k p r o p a g a t i o n and neurocon t ro l : a r ev iew a n d prospectus . In Proceedings of the 1989 International Joint Conference on Neural Networks, pages I - 2 0 9 - I - 2 1 5 , N e w Y o r k , 1989. I E E E Press . [16] D . A . W h i t e a n d M . I. J o r d a n . O p t i m a l con t ro l : A founda t ion for in te l l igen t con-t r o l . In Handbook of Intelligent control: Neural, Fuzzy, and Adaptive Approaches, pages 185-214. V a n N o s t r a n d R e i n h o l d , N e w Y o r k , 1992. [17] A . P . Sage. Linear Systems Control. M a t r i x P u b l i s h e r s , Inc . , 1978. [18] R . M c G i l l and P . K e n n e t h . S o l u t i o n of va r i a t i ona l p rob l ems b y means of a gener-a l i zed N e w t o n - R a p h s o n operator . AIAA Journal, 1964. [19] H . J . K e l l e y . M e t h o d of gradients . In Optimization techniques with application1 to aerospace systems. N e w Y o r k : A c a d e m i c Press , 1962. [ 2 0 ] . H . J . K e l l e y . G r a d i e n t theory of o p t i m a l f l ight paths . ARS Journal, 1960. [21] A . E . B r y s o n and W . F . D e n h a m . O p t i m a l p r o g r a m m i n g m e t h o d s w i t h i n e q u a l i t y cons t ra in t s II : so lu t ion by steepest ascent. AIAA Journal, 1964. [22] J . B . R o s e n . T h e gradient p ro jec t ion m e t h o d for non l inear p r o g r a m m i n g . J. SIAM Control, 1960. [23] J . B . R o s e n . I t e ra t ive so lu t ion of nonl inear o p t i m a l con t ro l p r o b l e m s . J. SIAM Control Series A, 1966. [24] M . H . A . D a v i s . O p t i m a l s tochas t ic con t ro l : general aspects. In Systems and Control Encyclopedia, pages 3517-3523. P e r g a m m o n Press , N e w Y o r k , 1987. Bibliography 166 [25] M . G o u r s a t a n d J . P . Q u a d r a t . O p t i m a l s tochast ic con t ro l : n u m e r i c a l me thods . In Systems and Control Encyclopedia, pages 3517-3523. P e r g a m m o n Press , N e w Y o r k , 1987. [26] W . H . F l e m i n g and R . W . R i s h e l . Deterministic and stochastic optimal control. Spr inger , N e w Y o r k , 1975. [27] N . V . K r y l o v . Control diffusion processes. Spr inger , N e w Y o r k , 1980. [28] P . L . L i o n s . G e n e r a l i z e d solu t ions of H a m i l t o n - J a c o b i equat ions . In Research Notes in Mathematics. P i t m a n , L o n d o n , 1982. [29] M . H . A . D a v i s . M a r t i n g a l e me thods i n s tochas t ic con t ro l , v o l u m e 16. 1979. [30] R . J . E l l i o t t . Stochastic calculus and applications. Spr inger , N e w Y o r k , 1982. [31] N . E l K a r o u i . Les aspectes probabi l i s tes du cont ro le s tochas t ique . In Lecture notes in mathematics, v o l u m e 876. Spr inger , B e r l i n , 1981. [32] K . S. N a r e n d r a and M . A . L . T h a t h a c h a r . Learning Automata: An Introduction. P r e n t i c e - H a l l , N e w Jersey, 1989. [33] J . M . B i s m u t . A n i n t r o d u c t o r y approach to d u a l i t y i n o p t i m a l s tochas t ic con t ro l . SIAM Rev., 20:62-78, 1978. [34] U . H a u s s m a n n . O n the s tochas t ic m a x i m u m p r i n c i p l e . SIAM J. Control Optimiza-tion, 16:236-51, 1978. [35] U . H a u s s m a n n . O n the adjoint process for o p t i m a l con t ro l of diffusion processes. SIAM J. Control Optim., 19:221-42, 1981. [36] F . S t roock a n d S. R . S. V a r a d h a n . Multidimensional Diffusion Process. Spr inge r , B e r l i n , 1979. [37] J . P . Q u a d r a t . E x i s t e n c e de so lu t ion et a l go r i t hme de r e so lu t ion numer iques de p rob lemes s tochast iques degenerees ou non . SIAM J. Control, 18:199-226, 1980. [38] D . E . R u m e l h a r t , G . E . H i n t o n , and R . J . W i l l i a m s . L e a r n i n g i n t e r n a l representa t ion by error p ropaga t ion . In D . E . R u m e l h a r t and J . L . M c C l e l l a n d , ed i tors , Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations. M I T P r e s s / B r a d f o r d B o o k s , C a m b r i d g e , M A , 1986. [39] G . G i r o s i a n d T . Pogg io . N e t w o r k s and best a p p r o x i m a t i o n p roper ty . T e c h n i c a l R e p o r t 1164, M I T A r t i f i c i a l In te l l igence L a b . , O c t o b e r 1989. Bibliography 167 [40] S. E . F a h l m a n and C . Leb ie re . T h e cascade-corre la t ion l e a rn ing a rch i tec tu re . Tech-n i c a l R e p o r t C M U - C S - 9 0 - 1 0 0 , School of C o m p u t e r Science, Ca rneg ie M e l l o n U n i -vers i ty , P i t t s b u r g h , P A , 1990. [41] K . H o r n i k , M . S t i n c h c o m b e , and H . W h i t e . M u l t i l a y e r feedforward ne tworks are u n i v e r s a l a p p r o x i m a t o r s . Neural Networks, 2 :359-366, 1989. [42] K . H o r n i k , M . S t i n c h c o m b e , and H . W h i t e . U n i v e r s a l a p p r o x i m a t i o n of an un-k n o w n m a p p i n g and i ts der iva t ives us ing m u l t i l a y e r feedforward ne tworks . Neural Networks, 3:551-560, 1990. [43] K . H o r n i k . A p p r o x i m a t i o n capabi l i t i es of m u l t i l a y e r feedforward ne tworks . Neural Networks, 4 :251-257, 1991. [44] D . P s a l t i s , A . S ider i s , and A . Y a m a m u r a . N e u r a l cont ro l le rs . In Proceedings of the 1987 IEEE International Conference on Neural Networks, pages I V - 5 5 1 - I V - 5 5 8 , 1987. [45] D . P s a l t i s , A . S ider i s , and A . Y a m a m u r a . A m u l t i l a y e r e d neura l ne twork cont ro l le r . IEEE Control Systems Magazine, pages 17-21 , 1988. [46] A . G u e z , J . L . E i l b e r t , a n d M . K a m . N e u r a l ne twork a rch i t ec tu re for c o n t r o l . IEEE Control System Magazine, pages 22-25 , A p r i l 1988. [47] A . G . B a r t o . C o n n e c t i o n i s t l ea rn ing for con t ro l : A n overv iew. T e c h n i c a l R e p o r t 89-89, D e p a r t m e n t of C o m p u t e r and In fo rmat ion Science , U n i v e r s i t y o f Massachuse t t s , A m h e r s t , M A , 1989. [48] W . L i and J . - J . E . S lo t ine . N e u r a l ne twork con t ro l of u n k n o w n non l inea r systems. I n Proceedings of 1989 American Control Conference, pages 1136-1141, 1990. [49] P . J . W e r b o s . A current overv iew of neurocon t ro l . T u t o r i a l , the 1991 In te rna t iona l J o i n t Conference on N e u r a l N e t w o r k s , Seat t le , W A , J u l y 1991. [50] E d u a r d o D . Sontag . Some topics i n neura l networks and con t ro l . T e c h n i c a l R e p o r t L S 9 3 - 0 2 , D e p a r t m e n t of M a t h e m a t i c s , Rutgers U n i v e r s i t y , N e w B r u n s w i c k , N J , 1993. [51] K . F . F o n g a n d A . P . L o h . D i sc r e t e - t ime o p t i m a l con t ro l u s ing neu ra l nets. In IJCNN 91, pages 1355-1360, S ingapore , S ingapore , 1991. [52] P . J . W e r b o s . P l e n a r y Speech , the 1990 In te rna t iona l J o i n t Conference on N e u r a l N e t w o r k s , S a n D i e g o , C A , J u n e 1990. Bibliography 168 [53] R . J . W i l l i a m s a n d L . C . B a i r d , I I I . A n a l y s i s of some i n c r e m e n t a l var iants of p o l i c y i t e r a t i o n : first steps t oward unde r s t and ing ac to r - c r i t i c l e a r n i n g sys tems. N U - C C S -93-11, Co l l ege of C o m p u t e r Science, Nor theas t e rn U n i v e r s i t y , B o s t o n , M A , 1993. [54] D . F . Shanno . Recen t advances i n n u m e r i c a l techniques for large-scale o p t i m i z a t i o n . In W . T . M i l l e r , R . S. S u t t o n , and P . J . W e r b o s , ed i tors , Neural Networks for . Control. M I T P r e s s / B r a d f o r d B o o k s , C a m b r i d g e , M A , 1990. [55] N . B a b a . A new approach for f i nd ing the g loba l m i n i m u m of er ror func t ion of neu ra l ne tworks . Neural Networks, 2 :367-373, 1989. [56] J . H . H o l l a n d . Adaptation in natural and ai^tificial systems. U n i v e r s i t y of M i c h i g a n Press , A n n A r b o r , M I , 1975. [57] P . D a y a n . Convergence of T D ( A ) for general A . Machine Learning, 8 (3-4) :341 , 1992. [58] P . D a y a n and T . J . Se jnowski . T D ( A ) converges w i t h p r o b a b i l i t y 1. Machine Learning, 14:295-301, 1993. [59] T . J a a k k o l a , M . I. J o r d a n , and S. P . S i n g h . S tochas t i c convergence of i t e ra t ive D P a l g o r i t h m s . In Advances in Neural Information Processing Systems 6, pages 703-710. M o r g a n K a u f m a n n P u b l i s h e r s , San M a t e o , 1994. [60] G . Tesauro . Re in fo rcemen t l ea rn ing i n game p l a y i n g . P l e n a r y Speech , In te rna t iona l J o i n t Conference on N e u r a l N e t w o r k s , Seat t le , W A , J u l y 7-12 1991. [61] G . Tesauro . P r a c t i c a l issues i n t e m p o r a l difference l ea rn ing . T e c h n i c a l repor t , I B M T . J . W a t s o n Research Cen te r , Y o r k t o w n He igh t s , N Y , U S A , 1991. [62] D . A . Sofge and D . A . W h i t e . A p p l i e d l e a r n i n g - o p t i m a l con t ro l for m a n u f a c t u r i n g . . I n Handbook of Intelligent control: Neural, Fuzzy, and Adaptive Approaches, pages 259-282. V a n N o s t r a n d R e i n h o l d , N e w Y o r k , 1992. [63] P . J . W e r b o s . A m e n u of designs for re inforcement l ea rn ing over t i m e . In W . T . M i l l e r , R . S. S u t t o n , and P . J . Werbos , edi tors , Neural Networks for Control. M I T ' Press , C a m b r i d g e , M A , 1990. [64] A . G . B a r t o and P . A n a n d a n . P a t t e r n recogniz ing s tochas t ic l ea rn ing au toma ta . IEEE Transactions on Systems, Man, and Cybernetics, S M C - 1 5 : 3 6 0 - 3 7 4 , 1985. [65] P . J . W e r b o s , T . M c A v o y , and T . S u . N e u r a l ne tworks , s y s t e m iden t i f i ca t ion , a n d c o n t r o l i n the c h e m i c a l process indus t r ies . In Handbook of Intelligent control: Neural, Fuzzy, and Adaptive Approaches, pages 283-356. V a n N o s t r a n d R e i n h o l d , N e w Y o r k , 1992. Bibliography 169 66] P . J . W e r b o s . B a c k p r o p a g a t i o n t h rough t i m e : wha t i t is a n d how to do i t . Pro-ceedings of the IEEE, 78(10):1550-1560, 1990. 67] P . J . W e r b o s . A p p r o x i m a t e d y n a m i c p r o g r a m m i n g for r ea l - t ime con t ro l . In Hand-book of Intelligent control: Neural, Fuzzy, and Adaptive Approaches, pages 493-525. V a n N o s t r a n d R e i n h o l d , N e w Y o r k , 1992. 68] P . J . W e r b o s . N e u r o c o n t r o l and. superv ised l ea rn ing : A n ove rv i ew a n d eva lua t ion . In Handbook of Intelligent control: Neural, Fuzzy, and Adaptive Approaches, pages 65-89 . V a n N o s t r a n d R e i n h o l d , N e w Y o r k , 1992. [69] D . P . Ber tsekas . Dynamic Programming: Deterministic and Stochastic Models. P r e n t i c e - H a l l , E n g l e w o o d Cl i f f s , N J , 1987. [70] A . G . B a r t o , S. J . B r a d t k e , and S. P . S i n g h . L e a r n i n g to act u s ing r ea l - t ime d y n a m i c p r o g r a m m i n g . T e c h n i c a l repor t , D e p a r t m e n t of C o m p u t e r Science , U n i v e r s i t y of Massachuse t t s , A m h e r s t , M A , J a n u a r y 1993. [71] D . P . Ber tsekas and J . N . T s i t s i k l i s . Parallel and Distributed Computation: Nu-merical Methods. P r e n t i c e - H a l l , E n g l e w o o d Cl i f f s , N J , 1989. [72] R . S. S u t t o n , A . G . B a r t o , and R . J . W i l l i a m s . Re in fo rcement l e a r n i n g is d i rec t a d a p t i v e o p t i m a l con t ro l . IEEE Control Systems Magazine, pages 19-22 , A p r i l 1992. [73] C . J . C . H . W a t k i n s and P . D a y a n . T e c h n i c a l note: Q- l ea rn ing . Machine Learning, 8:279-292, 1992. [74] S. J . B r a d t k e . Re in fo rcemen t l ea rn ing app l i ed to l inear q u a d r a t i c r egu la t ion . In. Advances in Neural Information Processing Systems 5. M o r g a n K a u f m a n n P u b l i s h -ers, S a n M a t e o , 1993. [75] R . S. S u t t o n . F i r s t results w i t h D y n a , an in te rgra ted a rch i t ec tu re for l ea rn ing , p l a n n i n g and reac t ing . In W . T . M i l l e r , R . S. S u t t o n , and P . J . W e r b o s , edi tors , Neural Networks for Control. M I T P r e s s / B r a d f o r d B o o k s , C a m b r i d g e , M A , 1990. [76] A . W . M o o r e and C . G . A t k e s o n . M e m o r y - b a s e d re inforcement l ea rn ing : Ef f i -c ien t c o m p u t a t i o n w i t h p r i o r i t i z e d sweeping. In Advances in Neural Information Processing Systems 4- M o r g a n K a u f m a n n Pub l i she r s , S a n M a t e o , C A , 1992. [77] J . P e n g a n d R . J . W i l l i a m s . Eff ic ient search con t ro l i n D y n a . T e c h n i c a l repor t , Co l l ege of C o m p u t e r Science, Nor theas te rn U n i v e r s i t y , 1992. [78] R . A . B r o o k s . A robust layered con t ro l sys tem for a m o b i l e robo t . IEEE Journal of Robotics and Automation, 2 ( l ) : 1 4 - 2 3 , 1986. Bibliography 170 [79] S. M a h a d e v a n and J . C o n n e l l . A u t o m a t i c p r o g r a m m i n g of behav ior -based robots u s ing re inforcement l ea rn ing . Artificial Intelligence, 55:311-365, 1992. [80] A . W . M o o r e . V a r i a b l e reso lu t ion d y n a m i c p r o g r a m m i n g : Ef f i c i en t ly l e a rn ing ac-t i o n m a p s i n m u l t i v a r i a t e rea l -valued state-spaces. In Machine Learning: Proceed-ings of the 8th International Workshop. M o r g a n K a u f m a n n , 1991. [81] K . M . B u c k l a n d . Optimal Control of Dynamic Systems Through the Reinforcement Learning of Transition Points. PhD thesis, D e p a r t m e n t of E l e c t r i c a l E n g i n e e r i n g , T h e U n i v e r s i t y of B r i t i s h C o l u m b i a , V a n c o u v e r , B . C . , C a n a d a , 1994. [82] K . M . B u c k l a n d and P . D . Lawrence . T r a n s i t i o n po in t d y n a m i c p r o g r a m m i n g . In Advances in Neural Information Processing Systems 6, pages 639-646. M o r g a n K a u f m a n n P u b l i s h e r s , San M a t e o , 1994. [83] J . H . S c h m i d h u b e r . Towards c o m p o s i t i o n a l l ea rn ing w i t h d y n a m i c neu ra l net-works . T e c h n i c a l R e p o r t F K I - 1 2 9 - 9 0 , Ins t i tu t fur In fo rma t ik , Techn i sche U n i v e r -s i t y M i i n c h e n , M i i n c h e n , W . G e r m a n y , 1990. [84] M . E l d r a c h e r and B . B a g i n s k i . N e u r a l subgoal genera t ion us ing backp ropaga t i on . In Proceedings of the World Congress on Neural Networks, pages III—145—III—148, P o r t l a n d , O R , 1993. [85] S. S i n g h . Transfer of l ea rn ing by c o m p o s i n g solut ions of e l emen ta l sequent ia l tasks. Machine Learning, 8 (3 /4) :323-339 , 1992. [86] C . W . A n d e r s o n . Q- l ea rn ing w i t h h idden-un i t r es ta r t ing . In Advances in Neural Information Processing Systems 5. M o r g a n K a u f m a n n P u b l i s h e r s , S a n M a t e o , C A , 1993. [87] C K . T h a m and R . W . Prager . Re inforcement l ea rn ing me thods for m u l t i - l i n k e d m a n i p u l a t o r obstacle avoidance and con t ro l . In Proceedings of the IEEE Asia-Pacific Workshop on Advances in Motion Control, Singapore, pages 140-145, J u l y 1993. [88] M . S h e p p a r d , A . O s w a l d , C . V a l e n z u e l a , G . S u l l i v a n , and R . S o t u d e h . Reinforce-m e n t l e a rn ing i n con t ro l . In Proceedings of the 9th International Conference on Mathematical and Computer Modelling, Berke ley , C A , J u l y 1993. [89] J . J a m e s o n . A neurocont ro l le r based on m o d e l feedback a n d the adap t ive heur i s t i c c o n t r o l . In IJCNN 90, pages 11-37-11-^44, San D iego , C A , 1990. [90] D . A . Sofge a n d D . A . W h i t e . N e u r a l ne twork based process o p t i m i z a t i o n and c o n t r o l . In Proceedings of the 29th Conference on Decision and Control, H o n u l u l u , H a w a i i , 1990. Bibliography 171 [91] T . T . J e rv i s and F . Fa l l s ide . P o l e b a l a n c i n g on a rea l r i g us ing a re inforcement l e a r n i n g con t ro l le r . C U E D / F - I N F E N G / T R 115, C a m b r i d g e U n i v e r s i t y , Eng inee r -i n g D e p a r t m e n t , C a m b r i d g e , E n g l a n d , 1992. [92] C . K . T h a m and R . W . Prager . Re in fo rcemen t l ea rn ing for m u l t i - l i n k e d m a n i p -u l a to r con t ro l . C U E D / F - I N F E N G / T R 104, C a m b r i d g e U n i v e r s i t y E n g i n e e r i n g D e p a r t m e n t , C a m b r i d g e , U K , 1992. [93] J . C . H o s k i n s and D . M . H i m m e l b l a u . Process con t ro l v i a a r t i f i c i a l neu ra l ne tworks a n d re inforcement l ea rn ing . Computers and Chemical Engineering, 16(4):241—251, 1992. [94] D . Y a n a n d M . Saif . N e u r a l ne twork based cont ro l le rs for non- l inear sys tems . I n Proceedings of the Second IEEE Conference on Control Applications, pages 331— 336, V a n c o u v e r , B . C . , C a n a d a , 1993. [95] K . K r i s h n a K u m a r , S. Sawhney, and R . W a i . Neuro-con t ro l l e r s for adap t ive he-l i c o p t e r hover t r a i n i n g . IEEE Transactions on Systems, Man, and Cybernetics, 24(8) :1142-1152, 1994. [96] J . M o o d y a n d V . Tresp . A t r i v i a l bu t fast re inforcement con t ro l l e r . To appear in Neural Computation, 6, 1994. [97] J . S. A l b u s . O u t l i n e for a theory of in te l l igence . IEEE Transactions on Systems, Man, and Cybernetics, 21(3) :473-509, 1991. [98] L . P . D e v r o y e . A class of o p t i m a l per formance d i rec ted p r o b a b i l i s t i c a u t o m a t a . IEEE Transactions on Systems, Man and Cyber., S M C - 6 : 7 7 7 - 7 8 4 , 1976. [99] R . J . W i l l i a m s . Re in fo rcemen t l ea rn ing i n connec t ion is t ne tworks : a m a t h e m a t i c a l ana lys i s . T e c h n i c a l R e p o r t 8605, Ins t i tu te for C o g n i t i v e Sc ience , U n i v e r s i t y of C a l i f o r n i a , S a n D i e g o , C A , 1986. [100] B . H o m e and M . J a m s h i d i . A connec t ion is t ne twork for r o b o t i c g r ippe r con t ro l . In Proceedings 27th Conference on Decision and Control, pages 1070-1074, A u s t i n , Texas , D e c e m b e r 1988. [101] A . G . B a r t o , S. J . B r a d t k e , and S. P . S i n g h . R e a l - t i m e l e a r n i n g a n d con t ro l us-i n g asynchronous d y n a m i c p r o g r a m m i n g . T e c h n i c a l R e p o r t 91-57, D e p a r t m e n t of C o m p u t e r Science , U n i v . of Massachuse t t s , A m h e r s t , M A , 1991. [102] J . C . H o u k , S. P . S i n g h , C . F i she r , and A . G . B a r t o . A n a d a p t i v e sensor imotor ne twork i n s p i r e d by the a n a t o m y and phys io logy o f the c e r e b e l l u m . In W . T . M i l l e r , R . S. S u t t o n , and P . J . Werbos , edi tors , Neural Networks for Control, pages 301-348. M I T Press , C a m b r i d g e , M A , 1990. Bibliography 172 [103] J . C . C . Ip , P . D . L a w r e n c e , and M . R . Ito. L i n k u p accelera ted Q - l e a r n i n g . In Pro-ceedings of the World Congress on Neural Networks, pages II—580—II—584, P o r t l a n d , O R , 1993. [104] J . C . C . Ip , M . R . I to, and P . D . Lawrence . A class of l i n k u p q- lea rn ing a lgo r i t hms for d e t e r m i n i s t i c sys tems. S u b m i t t e d to J o u r n a l of A r t i f i c i a l In te l l igence Resea rch , 1994. [105] L . - J . L i n . Reinforcement Learning for Robots Using Neural Networks. P h D thesis , S c h o o l of C o m p u t e r Science , Carneg ie M e l l o n U n i v e r s i t y , P i t t s b u r g h , P A , 1993. [106] P . M a e s a n d R . B r o o k s . L e a r n i n g to coord ina te behaviors . I n Proceedings of the Eighth National Conference on Artificial Intelligence, pages 796-802. A A A I P r e s s / T h e M I T Press , 1990. [107] S. M a h a d e v a n and J . C o n n e l l . S c a l i n g re inforcement l e a rn ing to robo t i cs by ex-p l o i t i n g the s u b s u m p t i o n a rch i tec ture . In Proceedings of the Eigth International Workshop on Machine Learning, pages 328-332. M o r g a n K a u f m a n n , 1991. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065073/manifest

Comment

Related Items