Algorithms for Partially Observable Markov Decision Processes by Hsien-Te Cheng B.Sc, National Taiwan University, 1975 M.Sc, National Taiwan University, 1977 M.Sc, University of Idaho, 1979 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies Faculty of Commerce and Business Administration We accept this thesis as conforming to the required standard The University of British Columbia August 1988 ©Hsien-Te Cheng, 1988 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Commerce The University of British Columbia Vancouver, Canada Date December 21, 1988 DE-6 (2/88) A B S T R A C T The thesis develops methods to solve discrete-time finite-state partially observable Markov decision processes. For the infinite horizon problem, only discounted reward case is considered. Several new algorithms for the finite horizon and the infinite horizon problems are developed. For the finite horizon problem, two new algorithms are developed. The first algo-rithm is called the relaxed region algorithm. For each support in the value function, this algorithm determines a region not smaller than its support region and modifies it implicitly in later steps until the exact support region is found. The second algorithm, called linear support algorithm, systematically approximates the value function until all supports in the value function are found. The most important feature of this algorithm is that it can be modified to find an approximate value function. The number of regions determined explicitly by both algorithms is the same as the number of supports in the value function, which is much less than the number of regions generated by the one-pass algorithm. Since the vertices of each region have to be found, these two algorithms are more efficient than the one-pass algorithm. The limited numerical examples also show that both methods are more efficient than the existing algorithms. For the infinite horizon problem, it is first shown that the approximation version of linear support algorithm can be used to substitute the policy improvement step in a standard successive approximation method to obtain an e-optimal value function. Next, an iterative discretization procedure is developed which uses a small number of states ii to find new supports and improve the value function between two policy improvement steps. Since only a finite number of states are chosen in this process, some techniques developed for finite MDP can be applied here. Finally, we prove that the policy improve-ment step in iterative discretization procedure can be replaced by the approximation version of linear support algorithm. The last part of the thesis deals with problems with continuous signals. We first show that if the signal processes are uniformly distributed, then the problem can be reformulated as a problem with finite number of signals. Then the result is extended to where the signal processes are step functions. Since step functions can be easily used to approximate most of the probability distributions, this method can be used to approximate most of the problems with continuous signals. Finally, we present some conditions which guarantee that the linear support can be computed for any given state, then the methods developed for finite signal cases can be easily modified and applied to problems for which the conditions hold. iii T A B L E O F C O N T E N T S A B S T R A C T ii T A B L E O F C O N T E N T iv L I S T O F T A B L E S vi L I S T O F F I G U R E S viii A C K N O W L E D G M E N T S ix C H A P T E R 1: I N T R O D U C T I O N 1 I. Development of P O M D P 3 II. Summary of Results and Plan of the Thesis 6 C H A P T E R 2: P R O B L E M F O R M U L A T I O N A N D P R E L I M I N A R Y R E S U L T S 12 I. T h e Partially Observable Markov Decision Processes 12 II. Problem Formulation 14 III. Notations and Operators 19 IV. Major Properties 20 C H A P T E R 3: A L G O R I T H M S F O R F I N I T E H O R I Z O N P O M D P 27 I. Partition Method 29 II. One-Pass Algorithm 32 III. T h e Monahan Algorithm 37 IV. Relaxed Region Algorithm 38 V. Linear Support Algorithm 52 VI. Numerical Examples 62 VII. Conclusion 79 iv C H A P T E R 4: A L G O R I T H M S F O R I N F I N I T E H O R I Z O N P O M D P . . . . 81 I. Existing Algorithms for Infinite Horizon P O M D P 82 II. Preliminaries 84 III. Approximate Value Iteration 89 IV. Methods for Calculating Xjt, Uk, and //* 9 5 V. A n Iterative Discretization Procedure For P O M D P 98 VI. Accelerating the Convergence 109 VII. T h e Iterative Discretization Procedure with Approximation Policy Improvement 118 VIII. Numerical Examples 120 C H A P T E R 5: P A R T I A L L Y O B S E R V A B L E M A R K O V D E C I S I O N P R O C E S S E S W I T H C O N T I N U O U S S I G N A L D I S T R I B U T I O N S 133 I. Assumptions, Notations, and Formulations 134 II. Uniformly Distributed Signal Processes 136 III. Methods for Solving P O M D P with Continuous Signal Space . . . 143 B I B L I O G R A P H Y 149 A P P E N D I X 1 154 A P P E N D I X 2 170 V LIST O F T A B L E S Table 3.1: C P U times of the machine maintenance problem 64 Table 3.2: C P U times of the machine maintenance problem for the approxima-tion method 66 Table 3.3: Results of the data set D3.1 69 Table 3.4: Results of the data set D3.2 69 Table 3.5: Results of the data set D3.3 70 Table 3.6: Results of the data set D3.4 70 Table 3.7: Results of the data set D3.5 71 Table 3.8: Results of the data set D4.1 72 Table 3.9: Results of the data set D4.2 73 Table 3.10: Results of the data set D4.3 74 Table 3.11: Results of the data set D4.4 75 Table 3.12: Results of the data set D4.5 76 Table 3.13: Results of data set D4.5 by using the approximation method which limits the maximal number of supports in every iteration 77 Table 3.14: Results of the data set D5.1 78 Table 4.1: Results of Sondik's example (with exact policy improvement) . . 122 Table 4.2: Results of Sondik's example (with approximate policy improvement) 123 Table 4.3: Results of Sondik's example (use \\Hv — v\\ to compute error bound) 123 Table 4.4: Results of data set 1 (with approximate policy improvement) . . 125 Table 4.5: Results of data set 2 (with exact policy improvement) 126 vi Table 4.6: Results of data set 2 (with approximate policy improvement) . . 127 Table 4.7: Results of data set 3 (with exact policy improvement) 128 Table 4.8: Results of data set 3 (with approximate policy improvement) . . 129 Table 4.9: Results of data set 4 (with approximate policy improvement) . . 130 Table 4.10: Results of data set 5 (with approximate policy improvement) . . 131 vii LIST O F F I G U R E S Figure 2.1: Decision diagram for completely observable Markov decision processes 13 Figure 2.2: Decision diagram for partially observable Markov decision processes 15 viii A C K N O W L E D G M E N T S I am indebted to Professor Shelby Brumelle for introducing me to the subject of partially observable Markov decision processes and for guiding this research. His suggestions, advice, patience, and financial support have been invaluable to this work. I am deeply grateful. I am grateful to Professor Martin L. Puterman for introducing me to the computa-tional aspect of dynamic programming and Markov decision processes. His comments on these topics were always useful. I wish to thank Professor John Hughes and Carl Walters, who reviewed earlier drafts and offered many comments that resulted in improved clarity of presentation in the final draft of the thesis. I am grateful to Dr. Sondik for providing the computer code for the one-pass algorithm, and Professor Rubin of the University of North Carolina for providing the computer code for the Mattheiss algorithm. I also want to thank my friends for their encouragement during this work. Special thanks go to Mei-Jean Goh and Margaret Judd for proof-reading my thesis. I would also like to thank my family for their encouragement and financial support during my study years. The greatest debt of gratitude goes to my wife Lily for willingly sharing the frustrations, and for the encouragement and limitless patience exhibited during this work. ix C H A P T E R 1 I N T R O D U C T I O N After Bellman (1957) and Howard (1960) introduced the dynamic programming and Markov decision process models, the Markov Decision Process (MDP) has received much attention in operations research. It has been applied to a wide range of problems. A common situation is that a problem can be formulated as a the Markov decision process in all respects except that the states of the system are not fully observable and significant costs may be incurred to get this information. The following maintenance and repair problem, similar to the one discussed in Smallwood and Sondik (1973), is illustration of this fact. A machine is used to produce a particular product. Only one product is produced at each time period. This machine has two identical components, each of which must operate once upon the product before it is finished. The life of each component has an exponential distribution. The lifetime of one component is independent of the other. There is a positive probability that an operational component will break down in the process of manufacturing a product. If both components function well, then there is only a small chance that this machine produces a defective product. However, if either component fails, there is a higher probability that the machine will produce a defective product. Since both components are identical, we can model the dynamics of the machine with a three-state discrete time Markov process. The three states correspond to zero, 1 one, or two components having failed. Generally, we do not know whether a component is in a good or compromised condition. In order to know the state of the machine, we may stop its operation for inspection. However, most of the time, we can expect the machine to be functioning well, and hence such an action will increase the costs unnecessarily. Alternatively, the decision maker may choose to continue production or to examine the final products. Continuing production will not give us any information about the state of the system. Examining the final product, though, will give us the probability distribution of the state of the system by Bayes' rule. These two options will not tell us the exact state of the system under study. Therefore, even if we indeed have a Markov decision process, we may not know the system state when we choose an action. A problem that can be formulated using the MDP but which suffers from an imper-fect state observation is usually referred to as a Partially Observable Markov Decision Process (POMDP). More precisely, a partially observable Markov decision process is a generalized Markov decision process which allows an imperfect observation of the system states. As a descriptive model, the POMDP offers many advantages over MDP. For exam-ple, the POMDP allows an imperfect state observation, which happens often in the real world. The observation process may be non-Markovian. The POMDP model also forces the model builder to make a clear distinction between real systems and observations. Moreover, similar to what seems to occur in practice, an action or decision taken in 2 POMDP may affect the quality of future observations. Hence, the POMDP has been used to model a wide range of problems which will be presented later. As a prescriptive tool, however, the POMDP is awkward. The decision maker may be forced to make decisions based upon the entire history of the system, a string of past decisions and observations. The POMDP is usually converted into an equiva-lent completely observable MDP, where the state space is the conditional probability distribution of the system state given the history of the system (Astorm 1965). This conversion facilitates analysis of the problem, but does not overcome computational difficulties introduced by the state space being continuous and not finite. The aim of this thesis is to develop some efficient solution methods for POMDP's. The development of the POMDP will be reviewed in Section I of this chapter, while the plan and major results of this thesis will be summarized in Section II. * I. Development of P O M D P The Partially Observable Markov Decision Process (POMDP) is a natural extension of the MDP. Research on the POMDP began in the early sixties, just a few years after Howard's work (1960). Drake (1962) was the first person who developed the explicit POMDP model. About the same time, Astorm (1965, 1969) formulated the finite horizon POMDP. Several researchers have studied the theory of POMDP. Sawaragi and Yoshikawa 3 (1970) studied the POMDP with an uncountable action space and a countable system states. Rhenius (1974) considered POMDP problems with both action and system state spaces being Borel spaces. White and Harrington (1980) studied the relationship between the value functions, observation quality, and suboptimal decisions. Platzman (1980) developed the conditions under which the undiscounted infinite horizon POMDP is well defined. Some researchers studied conditions that ensure the optimal policies have certain structural characteristics. Albright (1979) presented conditions under which the optimal policy of a two system state POMDP is monotone on the probability distribution of the system states. White (1980) gave conditions which yield monotone optimal policies for finite horizon POMDP that is either completely observable or nonobservable. Recently, Lovejoy (1987) provided sufficient conditions for the optimal value in a discrete time, finite POMDP to be monotone on the space of state probability vectors ordered by likelihood ratios. The computational difficulties associated with POMDP's have been under study since the mid-sixties. Kakalik (1965) divided the space of state probability vectors into equal area grids and considered each grid as a state. The problem was then trans-formed into a finite MDP problem. Satia and Lave (1973) developed an implicit enu-meration algorithm for computing an e-optimal value function for the finite horizon POMDP. Smallwood and Sondik (1973) and Sondik (1971) developed a so-called one-pass algorithm for the finite horizon POMDP, and discovered that POMDP's are not as computationally intractable as general nondenumberable state MDP's. Sondik (1971, 4 1978) developed a policy iteration type algorithm for the infinite horizon POMDP. Brumelle and Sawaki (1978) presented a partition method for finite horizon POMDP and a modified policy iteration algorithm for the infinite horizon POMDP. Platzman (1981) discussed a finite memory algorithm to find a e-optimal policy for infinite horizon POMDP. Recently, White and Scherer (1986) developed a reward revision algorithm to solve infinite horizon problems. There are some algorithms developed only for special cases of the POMDP. Wang (1976, 1977) considered a two action replacement problems. Buckman and Miller (1979) reformulated an investigation problem as a regenerative stopping problem. Hughes (1980) reformulated a two-action, two-state sequential quality control model as a re-newal problem. We will discuss some of the algorithms in more detail later in this thesis. POMDP have been used to model a wide range of problems (Monahan 1982). One of the major applications is the machine replacement and quality control problems. Eck-les (1968), Ohnishi et al. (1984, 1986), Ross (1971), Wang (1976, 1977), White (1977, 1978, 1979a, 1979b) applied POMDP to different settings of machine replacement and quality control problems. Kaplan (1969) applied the results of the machine inspection and replacement problem to a cost control problem in accounting. Hughes (1977) mod-eled the internal control of a corporate control system as a POMDP. Smallwood (1971) developed a two state POMDP model of optimal teaching strategies. Smallwood et al. (1971) used POMDP concepts in the development of methodology for the analysis of health care system. White (1976) applied the theory of POMDP to design question-5 naires in situations where responses may not be truthful. Hsu and Marcus (1980) used the POMDP as a tool to solve the decentralized control of finite Markov processes. Eagle (1984) used the POMDP to study searching for the moving target. Recently, Lovejoy (1983) and Lane (1986a, 1986b) have applied the POMDP to fishery problems. II. Summary of Results and Plan of the Thesis The major results of this thesis can be divided into three parts: algorithms for the finite horizon POMDP, algorithms for the infinite horizon POMDP, and methods for solving the POMDP with continuous signals. Each of these sets of results is discussed below: 1. Algorithms for Finite Horizon P O M D P : Two new algorithms are developed for solving finite horizon POMDP problems. The first algorithm, called the relaxed region algorithm, is a modification of Sondik's one-pass algorithm. Instead of finding an exact support region in the state space which corresponds to a linear support for the value function, a larger relaxed region which contains this exact region is found. In later steps, this relaxed region is modified. At the end of the procedures, the regions corresponding to each of the linear supports in the value function are found exactly. Unlike Sondik's one-pass algorithm, the number of regions produced by this method is exactly the same as the number of linear supports in the value function. In other words, the number of regions produced by this method is much smaller than that of Sondik's one-pass algorithm. Since fewer regions are 6 produced, this method is much more efficient than the Sondik method. The second new algorithm is a linear support algorithm. This algorithm can be viewed as a special type of relaxed region algorithm, although it is different from the re-laxed region algorithm discussed above. This algorithm uses only the convexity and piecewise-linearity of POMDP. This method systematically approximates the value function until all linear supports in the value function are found. One of the most important features of this algorithm is that it can be used to find an e-optimal value function. Although this algorithm may not be the only algorithm which can find an ap-proximate value function, it might be the only algorithm which can be used for finding both exact and approximate value functions. The numerical examples show that this is a very efficient algorithm for finding an exact value function. More importantly, when there are a large number of linear supports to form a value function, this algorithm requires only a fraction of CPU time to find an approximate value function which is very close to the optimal one. Because of its ability to find an approximate value func-tion efficiently, this algorithm also is used in the development of several methods in the infinite horizon POMDP. 2. Algorithms for Infinite Horizon P O M D P : Several new algorithms are developed in this part. In regular successive approximation method, the exact value function is found in each iteration. As we know, when a large number of linear supports are required to form a value function, it is difficult to find an exact value function, no matter which 7 algorithm is used. A natural resolution to this difficulty is to apply an approximate value function for each policy improvement and hope an e-optimal value function for infinite horizon POMDP can be obtained. We will prove that this approach is workable in Chapter 4; i.e., an e-optimal value function can be obtained by repeatedly applying the approximate policy improvement step. This is the first result in this part. Another possible method for overcoming the difficulty of the standard successive approximation approach is to reduce the number of iterations of policy improvement. The method developed under this category introduces a discrete phase between two policy improvement steps. In each of the discrete phases, only a small set of states are considered. In each iteration in a discrete phase, linear supports corresponding to these selected states are computed. The maximal value of newly computed linear supports and the linear supports in the previous value function form a new value function. By performing iterations in a discrete phase, the value function can be improved without complicating computations. We called this method the iterative discretization procedure. Since only a finite number of states are considered in a discrete phase, some tech-niques developed for the finite MDP can be applied, at lease in concept, to iterative discretization procedures to accelerate the convergence. Three methods are considered. They are the Gauss-Seidel method, the action elimination method, and the modified policy iteration method. In the iterative discretization procedure introduced here, the exact value function is required for each iteration of policy improvement. A natural extension is to replace 8 this exact value function by an approximate value function since an approximate value function is easier to compute and numerically more stable. It is shown that the iterative discretization procedure can work with an approximate policy improvement. 3. Algorithms for P O M D P with Continuous Signal Space: In this part, the assumption of only a finite number of signals is relaxed. It is assumed that there is a probability distribution of signals for each system state and action. There is little research in this area, especially with respect to a general purpose algorithm. The major difficulty for developing such an algorithm is that the property of piecewise linearity of the value function which is available in the finite signal setting is not preserved in the setting. This feature raises computational difficulties. It is first proven that, if all signal processes are uniformly distributed, the problem can be reformulated as a finite signal problem. As a result, all the algorithms for the finite signal problems can then be applied. This result is then extended to step functions; that is, if the signal processes are step functions, the problem can be reformulated as a finite signals problem. Although there may not be many problems in which signal processes are step functions, step functions can be easily applied to approximate any distribution. Therefore, we can use this approach to solve most problems. The assumption of finite signals is only used to guarantee that it is possible to find 9 the linear support for any given state in the algorithms discussed in Part 1 and Part 2. If there is a method which guarantees that linear supports can be obtained for the given states, then the algorithms can be applied to problems with continuous signals without major changes. Some conditions are developed that guarantee the existence of a linear support for any given state. If these conditions are met, then the methods developed i n previous two parts can be used in here too. T h e plan for the remainder of the thesis follows. Chapter 2: Problem Formulation and Preliminary Results The formal problem setting for a P O M D P is introduced. Next the problem is refor-mulated as a completely observable M D P with continuous state space. The properties of this newly formulated MDP, which will be used in later chapters, are then discussed. Chapter 3: Algorithms for Finite Horizon P O M D P T h i s chapter discusses the algorithms for finite horizon P OMDP. Major existing al-gorithms, the partition method, Sondik's one-pass algorithm, and the Monahan method, are presented and reviewed. Then two new algorithms, the relaxed region algorithm and the linear support algorithm, are developed. Some computational results are used to compare the efficiency of these algorithms. Chapter 4: Algorithms for Infinite Horizon P O M D P T h i s chapter discusses the computational methods for infinite horizon P O M D P problems and is organized as follows. First, the major existing algorithms are reviewed. 10 Next a successive approximation method with approximate policy improvement to ob-tain an c-optimal value function and an iterative discretization procedure are devel-oped. Then various methods to accelerate convergence are considered. Then some of the previous ideas are combined to apply approximate policy improvement with iterative successive approximation. Finally, a numerical comparison of algorithms is presented. Chapter 5: P O M D P with Continuous Signal Distribution; This chapter first considers the refomulation of a problem with uniformly dis-tributed signal processes to a problem with a finite number of signals. Then this proce-dure is extended to signal processes with distributions which are step functions. Some conditions which guarantee that a linear support to the value function at any given state can be constructed are stated. Finally, the algorithms developed in the previous two chapters are adapted to this setting. 11 C H A P T E R 2 P R O B L E M F O R M U L A T I O N A N D P R E L I M I N A R Y R E S U L T S In this chapter, the formal problem setting for a POMDP will be introduced. Then the problem is transformed to a completely observable MDP. Lastly, the properties to be used in later chapters are discussed. I. The Partially Observable Markov Decision Processes Underlying a POMDP is assumed to be a discrete-time finite-state Markov chain. This process has N stage-invariant states labelled 1,2, ...,N. In order to distinguish states for the underlying process from the states in the decision problem discussed later, the states in the underlying process are called system states. Time is divided into discrete periods labelled by nonnegative integers t. At any time epoch t the system state is denoted by Xt. At each decision time epoch on a stage, the decision maker has to choose an action from an action space D. The action chosen at time t is denoted by Yt. In this thesis, the action space is assumed finite, t It is also assumed that the action space D is stage-invariant for an infinite horizon stage problem. f The assumption of a finite action space is not necessary for the model de-velopment. Sawaragi and Yoshikawa (1970) developed the theory of POMDP's with an uncountable action space and a countable system state space. However, in order to develop efficient computational algorithms, we limit our attention to a finite action space. 12 At decision epoch t, if the process is in system state i and the decision maker chooses action d, then a set of transition probabilities {pdj(t)\ j = 1,2,..., N) ( 53 • pdj(t) = 1 for any system state i) is established to describe the probability of moving to system state j at the next time epoch; that is, pdj(t) = Prp^+i = j | Xt = i, Yt = d]. If the transition probability is independent of time t or it is clear from content, the index t is omitted. For an infinite horizon problem, stationary transition probabilities are assumed. It is also assumed that the decision maker knows the transition probability function. A finite reward rd(i) is obtained whenever the system is in system state i and action d is chosen. The reward is stage-invariant. The notation rd is a iV-dimensional column vector with rd(i) as its t-th element. System moves to a new state according to P\Yt t I t + l Choose Action Yt based on system state Incur Reward Fig. 2.1: Decision diagram for completely observable Markov decision processes If the decision maker knows the current system state before choosing an action, 13 the decision problem is a regular completely observable M D P . The decision maker can, then, choose an action based on the current system state to maximize the total expected (discounted) reward. To make the decision procedure clear, consider the block diagram of F ig . 2.1. P O M D P ' s have the characteristic that the state of the system cannot be observed directly. Instead, the decision maker receives a random signal from the system. In Chapter 2 to 4, the signal space 0 is assumed finite. This assumption will be relaxed in Chapter 5. ; Let the signal received at time t be denoted by Zt. Although the signal is random, the decision maker knows the conditional probability of getting signal 8 when the sys-tem state and action are known. Assuming that this conditional probability is stage invariant, it can be denoted by a matrix Qd = where gfe = Pv[zt = e\ xt ='i,yi_1 = d\ v*. Assume that the objective of the decision model is to maximize the total expected (discounted) reward. Since the decision maker does not know the exact system state of the process when choosing an actian,rhe/shemay-choose an action based on the entire history of the process. (This will be discussed in next section.) To contrast Fig. 2.1, consider the block diagram for POMDP i n F ig . 2.2. , 14 System moves to a new state according to P\Yt t + l Choose Action Yt based on system information T Incur Reward Transform system history to useful information *(0 It T Update system history It — {It-i,Yt-i,Zt} Receive a signal Zt according to Q\Yt F i g . 2.2: Decision diagram for partially observable Markov decision processes. II. Problem Formulation For a P O M D P problem, the system state is usually unknown to the decision maker at the time of choosing an action. Moreover, since the signal process itself need not be a Markov process, it may be necessary to base decisions on more than just the signal received i n epoch t. In this section, new formulations will be considered to resolve this difficulty. 15 Let 7r(0) = [7Ti(0),7T2(0), ...,7r#(0)] be the probability distribution of the initial system states where TT,(0) = TI{XQ = i) for 1 = 1,2,... , N . Assume 7r(0) is known by the decision maker. T h e history of the process up to time t is denoted as It where Io = [7r(0),Z 0] It = [n(0),Z0,Yo,Zi,Yi,... ,Zt-i,Yt-i,Zt] and It+i = [It,Yt,Zt+i). We remark that {It, t = 0,1,2,...} is a Markov decision process. Let vt(-) denote the optimal total expected reward from time t to the end of the decision horizon. The dynamic programming recursive equation can now be written as vt(lt) = max E{rY<(XT) + fi • »«+,(/,+,) | IuYt) it fc/j = m a x { E ( r y ' ( X t ) | IuYt) + fi • E(vt+1(It+1) \ IuYt)} N = m a x { ^ P r ( ^ = i | 7 t ) T y ' ( 0 i=l + fi • P r ( ^ + i = 6 I lu Yt) • vt+1({It,Yt,Zt+i = *])} (2 - 1) *ee where fi is a discount factor. We assume that 0 < fi < 1 for infinite horizon problems and 0 < fi for finite horizon problems. Note that the state variables are known exactly in equation (2-1). B y Bayes' rule and induction, it is also possible to find the value of P r ( Z t + i = 6 | ItyYt) for any given 6, It and action. Therefore, we have a completely observable Markov decision process. In order to solve the recursive equation (2-1), the value functions in period t must be evaluated for each possible history It. When there are a large number of possible actions 16 and/or signals or the decision horizon is long, then the number of possible histories It is large (or infinite) and grows linearly i n t. The computational requirements of this dynamic programming algorithm can be truly prohibitive. (The details of this problem are discussed i n Section 4.1 of Bertsekas (1976).) Therefore, we should consider other possible state variables. Let *i(t) = Pi(Xt = i | It) and ir(t) = M<), 7 r 2 ( < ) , . . w h e r e £i=i *.•(<) = 1 and 0 < 7Ti(<) < 1 for i = 1,2,..., N. It is easy to show Tr(Xt+1 = i | It,Yt,Zi+1) = Pr(Xt+1 = i | * (<) ,Y u Z t + 1 ) and P r ( Z t + a = 6 \ It,Yt) = P r ( Z t + 1 = 6 \ ir(t),Yt). Therefore, given It, Yt = d, and Z<+i = 6, by Bayes' rule, Pr(Xt+1=j \It,Yt = d,Zt+1=e) PT(Xt+1=j\*(i),Yt = d,Zt+1=6) ?T(Xt+l=j,Zt+1=6\n(t),Yt = d) ( 2 - 2 ) Or, in matrix form, T ( * (0 ,4«)sPr ( X l + 1 |Z„ y 1 = s < i , Z w = 6) = P r ( X t + 1 | TT(0, KI = <f, Z,+i = $) *(0 •PiQe (2 - 3) ir(t) • P ' • Q{ • 1 where is a diagonal matrix with q°- as its diagonal elements and 1 is an N-dimensional column vector with all elements being 1. 17 It is well known that 7r(t) is a sufficient statistic for It when choosing an action at time epoch f. More precisely, 7r(r) summarizes all of the necessary information of the history of the process for choosing an action at time t (Bertsekas 1976, Monahan 1982, Sondik 1971, Striebel 1965). Let II = {ir € RN : Y*=\ *i = 1 ^ *i ^ 0 for i = 1,2, ...,N). The following result is also readily established (Astrom 1965, Monahan 1982, Rhenius 1974, Sawaragi and Yoshikawa 1970, and Sondik 1971): for any fixed sequence of actions Yi,Y2,... ,Yt € D, the sequence of probabilities {n(k)} where k = 0 , 1 , . . . , t is a Markov process; that is, if T C II, then Pr(7r(t + 1) € T | TT(0), TT(I), . . . , nit), Yt) = Pr(n(t + 1) € T \ *(t\Yt). Therefore, consider using the probability distribution of the system states as the state variables; that is, using II as the new state space. Then, for 7r 6 II, «,(*) = max E{rd(Xt ) + vt+1 (T(v, d, 9)) \ TT, d) N = maxjy ; it, • rd(i) + 0 • Y Pr (Z f + 1 = 6 \ TT, d) • vt+1(T(*, d, 9))}. i=i eee ( 2 - 4 ) Note that, in this representation, the state variable is known exactly; and hence (2-4) is the recursive equation of a completely observable Markov decision process. Thus, the POMDP can be converted into an equivalent (completely observable) Markov decision process. When using It as the state variables, the state space is different in each stage. In contrast, using {pi(t),t = 0,1,.. .} as the state variables, the state space is always 18 the same, II. The stage invariant state space aids the problem analysis; however, to find an optimal policy is still not an easy task because the state space II is continuous and not finite. In general, it is not easy to solve a Markov decision process with a continuous state space. Fortunately, POMDP's possess some special properties which will be discussed later in this chapter. These properties will help us to develop some useful computational algorithms to find an optimal policy. III. Notations and Operators In this section, some notation and operators useful for later chapters are introduced. As mentioned in previous sections, II is the state space and D the action space. A policy is a function which maps the state space into the action space; i.e., if 6 is a policy, then 6 : H —* D where S(TT) is the action taken in state TT £ II. Let the policy space A be the set of all stationary policies. Let B be the set of all bounded real-valued functions on II. In this thesis, unless otherwise stated, the norm || • || is the supreme norm; for example, if v 6 B, then = sup{|u(7r)| : 7r 6 II}. For convenience, define the local income function h which assigns a real number to each triple (n,d, v) with n 6 II, d 6 D, and v 6 B. In Chapter 2 to 4, h is defined as h(n, d, v) = TT • rd + p • ^ P r ( * I *>d) • "(ZX*. d, 6)). eee Note that with this notation Equation (2-4) can be rewritten as vt(n) — rnax /i(7r, d, vt+i). 19 The local income function can, then, generate a return operator Hs on B for each 6 G A; i.e., [Hs(v)](7r) = ^(ff, 6(n), v). If 6(7?) = <f for all 7r 6 II, then if,f instead of i7{ is used. Finally an optimal operator H : B —» Bu is defined as J?t> = maxjGA[i?«v]. IV. Major Properties In the previous sections, the POMDP was converted into an equivalent completely observable Markov decision process. In this section, the more important properties of this new Markov decision process are considered. The first three properties, boundedness, monotonicity, and contraction, are com-monly assumed in dynamic prograrnming. It is shown that these properties are also present in a POMDP. Then, two further properties of the POMDP, i.e., piecewise lin-earity and convexity of the value functions, are discussed. Finally, the existence of an optimal value function and policy is also discussed. 1. Boundedness: The usual assumption of boundedness is as follows: There exist numbers K\ and K2 such that \\H6v\\ <K!+K2- \\v\\ for all v € B and 6 € A (Whitt 1978). Let r = maxdgrj){|r''(i)| : t = 1,2,... ,N), then for arbitrary 7r G II, 8 G A, and v e B \[Hsv\{*)\ = \vrd + p.Y, Pr(* I *,*00) • v(T(n,6(^), $))\ eee 20 < k • rd\ + fi • ]T I *, *00)' *))l *€© < r + / 9 . £ P r ( * | * ,*(*)) 'HI *€© = r + /?'NI Therefore, \\H6v\\ <r + fi\\v\\ for all v € B and A; that is, both for all 6 G A and H satisfy the boundedness assumption. 2. Monotonicity: The monotonicity assumption is as follows: If v > u in B, i.e., if v(ir) > U(TT) for all 7r G II, then Hsv > Hsu in B for all 6 G A (Whitt 1978). Now assume v > u in B, then for arbitrary 6 G A and n G II, (#»(TT) - (H6u)(ir) = fi-Y, P r ( * I • w(T(7r, *(TT), 5)) - 0 • £ P r ( * | M O O ) • u(T(ir, «€© = /? • £ Pr(0 | TT, 6(*)) • {V(T(TT, 6)) - u(T(*, «(*), 0))} > 0 Therefore, (HSV){TT) > (Hsu)(n) or Hsv > Hsu for all -n G II and 6 G A ; that is, both Hs for all 6 G A and H satisfy the monotonicity assumption. 3. Contraction: The contraction property is one of the most important properties in a problem with an infinite horizon. This property can be stated thus: For some fixed fi, 0 < fi < 1, then \\Hsv - H6u\\ < fi • \\v - u\\ for all u,v G B and 6 G A (Whitt 1978). 21 For arbitrary u,v € B and 7r € II, \(H6v)(*) - (H6u)(n)\ <P-Y, P r(* I *. *(*)) * l«(r(jr, *(*), $)) - U(T(TT, 6(n), 9))\ < 0 . £ P r ( 0 | -u||. Hence \\Hsv — Heu\\ < /? • ||t> — u|| for all u,v 6 B and 6 € A ; that is, both for all S G A and if satisfy the contraction mapping assumption. 4. Piecewise Linearity of the Value Function: The piecewise linearity and convexity of the value function are the two of the most important properties of a POMDP. Piecewise linearity was first discovered by Astrom (1965), but was formally introduced by Sondik (1971). Let us briefly describe the piecewise linearity of a POMDP. The following is Lemma 2.1 of Sondik (1971): Lemma 2.1: T(ir,d,0) preserves straight lines. That is, if 0 < p < 1, and Tr1,^2 G II, then p-ir1 +(1 — p)-ir2 is a straight line in II with end points n*,ir2. If the transformation of this line for a fixed signal 8 and action d is considered, then T(p • 7T1 -f (1 — p) • n2, a, 0) = fi-Tin1 ,d,0) + (l-p)-T(n2,d,8), where, as p ranges between 0 and 1, p. ranges between 0 and 1. Hence the image of a straight line under T(-,d,8) is a straight line, specifically, _ P-PT(8 \ -K\d) ^ " p • Pr(0 | n\d) + (1 - p) • Pr(0 ( n2,d)' 22 A function v is called piecewise linear and convex if there is a finite set of N-dimensional column vectors, A, such that v(n) = maxt=ii2>...,fc{7r • a' : where a' € A). For n € fl , d G D, and 6 € 0, define Anid>e = {a e A : T(TT, d, 9) • a > T(ir, d, 6) • a V-a € A}. Then v(T(ir,d,0)) = T{ir,d,B) • a for all a € A„<dt0. If a*tdj is a vector in An<d,6i then, by (2-3) and (2-4), Hv(it) can be written as: Hdv(ir) = ir-rd + l3-J2 Pr(0 | TT, d) • w(r(?r, rf, tf)) *ee = TT • [rd + fi • ^ P r f • Qde • a.,,,,]; (2-5) flee Since rd + fi • ^26^Q Pd • Qe • oc„tdt$ is an iV-dimensional vector, (2-5) can be simplified as Hdv(ir) = TT • a T ) d where a„td = rd + fi • J2eee p d ' Qe ' Q*,d,e- Moreover, Hv(n) = maxrfgo Hdv(ir), then Hv(ir) = m&x{Hdv(ir)} = TT • a , (2-6) where an = rd + fi • E$ee p d ' Qi ' a* «j $ if d is an optimal action for state 7r. As described in Lemma 2.1, T(-,d, 6) preserves straight lines. As a result of this property and the assumption of the piecewise linearity of v with a finite number of linear segments, for any given action d in D and signal 8 in 0, the state space II can be partitioned into a finite number of regions such that AK<dt$ is constant for all states n in the interior of each region. Moreover, if there are only finite numbers of actions 23 in D and signals in 0, a finer partition with a finite number of cells can be found such that >lir,<{,# = -<4ir',<i,0 for all d € D, and 8 G 0, given 7r and 7r' are in the same region. Therefore, if Hv(ir) = it • a*, then Hv(ir) = n • a* for all IT in the same region as ft in the finer partition. That is, Hv is a piecewise linear function. Note that, if the number of signals is not finite, the state space II might be parti-tioned into an infinite number of regions. In this case, Hv might not be a piecewise linear function with a finite number of linear segments. However, the convexity is preserved even in this case. We will discuss this type of problem in Chapter 5. 5. Convexity of the Value Function: The convexity of the value function for a system with a finite number of system states, actions, and signals was first presented and proven formally by Astrom (1969). His theorem can be simply described as: Given a convex function v, Hv is also convex. Note that this theorem can only be applied directly to the optimal operator. For an arbitrarily given policy 6 £ A , H&v is not necessarily convex even though t; is a convex function. However, in the proof of this theorem, the fact that H4V is convex for all d € D is proven. This is conceptually important for developing an algorithm for a finite horizon problem. White and Harrington (1980) extended Astrom's results to a more general setting where the signal space is not necessarily finite. In Lemma 3.1 of their paper, they showed that H^v is convex whenever v is, even if the signal space is not finite. Since the maximal value of a set of convex functions is convex function, Hv is also a convex 24 function. Therefore, if v is piecewise linear and convex, so is Hv. Moreover, from piecewise linearity, HV(TT) = ir • a* for TT in region i. Suppose there are k' regions and let AH = {a1 ,a2,... ,ak }. Since Hv is piecewise linear and convex, the function value of Hv can be written as Hv(n) = max {7r • a' : a* G AH} W G II. (2 - 7) T h a t is, if the vectors in AH are known, then, by piecewise linearity and convexity of Hv, the function value of Hv can be easily obtained by (2-7). It is not necessary to know the area of each region explicitly. 6. Optimal Value Function and Policy: Since the operators Hg and H are contraction mappings, for each operator, there exists an unique fixed point in B, denoted by vs and v*, respectively, such that Hgvs = vs and Hv* = v* (El'gol'c 1964). This implies that, if the initial state is TT G II and policy 6 G A is used over the infinite horizon, the total expected discount reward will be v&(n). Then, by definition of H, v*(n) = max^ A ^ ^ 7 1 " ) } for all 7r in II. Hence, v* is the optimal value function. A policy 6 G A which attains Htv* = Hv* is called an optimal policy. Sawaragi and Yoshikawa (1970) have shown that if the action space D is finite, there exists an optimal stationary policy. A policy 6 is said to be e-optimal if vs < v* and — vs \\ < t. For any given e > 0, there exists an e-optimal stationary policy even when the action space D is not finite (Sawaragi and Yoshikawa 1970). 25 Since a POMDP has a continuous state space, it is usually very difficult to find the vg for 6 € A or v*. Therefore, bounds for these values have been sought by several researchers. Astrom (1965) studied the finite horizon problem in an optimal control setting. He found that the value function for a partially observable control system is always better than that of an unobservable control system (or open-looped system). In contrast, the value function of a partially observable system is no better than that of a completely observable system. White and Harrington (1980) extended Astrom's theorem to compare the results of different qualities of measurement. They concluded that if the quality of the observation is improved, then the value function will also be improved. This result is true even when the optimal policy for the lower quality system is applied to the better measurement system. 26 C H A P T E R 3 A L G O R I T H M S F O R F I N I T E H O R I Z O N P O M D P Finite horizon algorithms are important not only to solve finite horizon problems but also for use as the policy improvement step for an infinite horizon problem. Hence an efficient algorithm is desired. If a one-stage POMDP problem can be solved, then by induction, a finite horizon problem can be solved. Therefore, in this chapter, the main focus will be on how to compute Hv for a given piecewise linear and convex function v- We will start with a characterization of v in term of its supports, and then compute a set of supports for Hv. Let v be a piecewise linear and convex function and A a given support set containing a finite number of iV-dimensional column vectors such that V(TT) > 7r • a Va 6 A and 7r 6 n and v(ir) = max{7r • a : a E A} W € II. Therefore, every vector in A is a support of v. The support sets discussed in this dissertation usually satisfy the following condition: for every a in a support set A, there exist a state 7r in n such that v(7r) = ir - a. The supports in a support set A which do not satisfy this condition can be deleted from A without changing the value function. In this dissertation, this condition is not a necessary one in the development of algorithms. However, delete the unnecessary supports will improve the efficiency of the algorithms. 27 A region, R(a, A), is called a support region for a support a € A if R(a,A) = {TT € II : TT • a > TT • a Vd G 4}. The support regions have the following two properties: (1) U o g ^ i ^ a , A) = II; and (2) intil(o;, A) ninti2(d, A) = 0 when a ^ a, where intR(a, A) is the the interior of R(a, A). Hence, V (TT) = TT • a if TT G A). As discussed in Chapter 2, /fu is also piecewise linear and convex. To represent Hv easily, a support set, AH, is required such that HV(TT) > 7r • a Va G AH and TT G n and HV(TT) = max{7r • a : a G A//} V7r G n. There are an infinite number of support sets which satisfy these two conditions. Among them, the smallest set, denoted as AH, is the one for which the interior of the support region for each support in A*H is not empty. By definition of piecewise linearity, A*H is a finite set. The smallest set is desired since it will simplify the calculation of Hv(n) for any given TT G II. Although the smallest set is desired, it might be difficult to find since identifying an empty interior of a region is a time-consuming task. Each of the algorithms discussed below will generate an alternative set of supports. Recall that, for given TT G II, d G D a n d f l e e , AN<D>E= {a : T(iT,d,6) • a > T(TT,d,0) • a Va G A}. Define AH\ as AH — {a : 3TT G II, d G D and a selection of QWld,e € Antd,e for each 8 G 0 f AH depends on v and A. For simplification, the dependence will not be shown on the notation of AH-28 such that a = rd + 0 Pd • Qd0 • o^.i.j and ifu(7r) = TT • a}. Note that A H is a finite set and A * H is a subset of A H . A H is equal to A * H most of the time; however, occasionally some unnecessary supports will be included in A H - These unnecessary supports affect the operations of the algorithms by imposing additional computations that are time-consuming to detect. However, it may be inefficient to delete these vectors from A H -The main purpose of this chapter is to discuss two methods, the relaxed region algorithm and the linear support algorithm, for computing Hv and A H from given v and A . These two algorithms are closely related to existing algorithms which will also be discussed. A partition method will be presented in Section I. Then the two well known algorithms, the one-pass algorithm and the Monahan algorithm, will be reviewed in Sections II and III. The relaxed region algorithm will be discussed in Section IV, and, in Section V, the linear support algorithm is developed. Some numerical examples will be presented in Section VI for comparing the efficiency of these algorithms. I. Partition Method The partition method was introduced by Brumelle and Sawaki (1978) for general piecewise linear dynamic programming and by Sawaki (1983) for the POMDP. When being used for the POMDP, this method is similar to Sondik's one-pass algorithm (Sondik 1971, Smallwood and Sondik 1973). A collection B = {2?i, B 2 , • • •, Bn) of subsets of n is a partition of state space n 29 if Bi is a convex polyhedron, \J"=1Bi = II and int(Bj) D 'mt(Bj) = 0 where i ^ j and int(Bt) is the interior of B,-. The product of two partitions Ci and C2 is C\ ® C2 = {BiDDj : Bi 6 Ci and Dj € C2}. The product of C i , C2, • • •, Cm is defined inductively by ®ZiCi = Cm®®?rllCi. If d is an action in D, $ is a signal in 0, and d is a support in A, define Sd,eta as Sd,*,* = {Tr e II: T(7r,d,0)-d > r (7r ,d , 0 ) -a VaGA} . Note that Sd,e,a is a convex polyhedron and might be empty. Then Sd,e = {Sd,e,a '• Va € A} is a partition of the state space II. A product partition can be formed from these partitions Sd,e for all 6 G 0; i.e., Sd — 0 e g 0 Sd,e- Denote the cells in Sd by cd / cd cd cd\ J — \J} , 02 5 • • • » Jl J • Recall that HdV is the value of using action d for one period for any 7r G II and terminating with reward v. Now we will construct a set of supports, A j , such that there is a support £ d l for each cell Sf in Sd. By (2-5), Hdv(n) = 7r-rd + /3j2 Pr(# I M ) • v(T(n, d, 0)) eee = *.[rd + fiY,Pd Qe •«.,*,#] (3-1) where antd,e € An>d,e- If w is in the interior of Sd,e,a, then A * , ^ = d. Since for any given signal 6 the mapping A.td,e '• * —* A„td,e is constant on the interior of Sd, rd+(3 See© Pd'Qi'a*,d,o will be the same for the states in the interior of Sd. Therefore, (3-1) can be rewritten as Hdv(*) = n • £di for TT G int(S?), 30 where £dl = rd + /3 X ^ e e Pd'Qi' Q*,d,e and is an N-dimensional column vector. Define Ad = {£dt : t = l ,2,.. . ,/}. It is possible that a 7r € II is on the boundaries of several regions. For example, let ir E Sf f) Sd. Since n is a state in Sf, a^td,e E A„td,e where 7f E int(Sf) and atajj is tbe constant vector for the states in the interior of Sf. Then, by the continuity of HdV, Hdv(ir) = IT - £ D * . Similarly, Hdv(ir) = TT • . Hence, no more new vectors are generated from these states on the boundaries. If Ad is known, then the region Sf can be simply represented as Sf = {TT € n : TT • £dt > n • £ Vf € Ad) and Hdv(Tr) = max{7r • a : Va 6 Ad}-It is possible to find the partition Sd for each d in D. A product partition can be formed from these partitions Sd; i.e., S = ^deD^d- Since there are only a finite number of cells in Sd for every d E D and a finite number of actions in £), 5 contains a finite number of cells. Denote the cells in 5 as 5 = {Bi, ..., Bn). For each action d, S is finer than Sd. Let adi = £*' if Bi is a subset of Sd, then Hdv(ir) = TT • adi for all TT E Bi where i = 1,2, . . . , n . Note that if both B , and Bj are subsets of SjJ,, then adl equals a r f j . To find Hv = maxdeDHdV, define the convex polyhedrons Gdi for t = 1,2, . . . , n and d E D by Gdi = {TT G Bi: TT • adi > TT • a 0 1 V a 6 D}. Note that Gd,- might be empty. If Gdi is not empty, then for n E Gdi the action d is optimal and Hv(ir) = ix • adi. 31 Therefore, ad* is a support of Hv and should be included in A H if Gdi is not empty. Let Ui = {Gdi '• d 6 D}. Then Z7; is a partition of Bi and U = \J?=1(Ui \ 0) is a partition of II. The sets in U can be rearranged as U = {R\, R2, • • •, Rm}- Define A H = {adl : d € A>'• — 1>2,...,n, and ^ 0}. Then, Hv{ix) = max{7r • a : a 6 AH } and we have finished the construction of the supports of Hv. It is worth mentioning that the number of supports in A H is usually less than the number of cells in U. This can be easily seen from the following example. Assume Ra = Gdi Q Bi and Ri, = Gdi Q Bi; moreover, assume both Bi and Bi are subsets of Sd. Then adt = adl = that is, the same vector corresponds to states in Ra and Rb-Let R(ad,,AH) be the support region for the vector ad\ i.e., R(adt,An) = {n e II : TX • adl > 7r • a Va € AH], then Ra U Rb C R(adi,AH). II. One-Pass Algorithm In this section, we will study Sondik's one-pass algorithm. The one-pass algorithm will produce the same partition U and support set A H as the partition method discussed in Section I; however, the approaches of these two methods are different. In the previous section, U is a partition of the state space II. Now let us focus on a cell Gjt- in U. Recall that Gdi = { 7 r € B t : TT • adl > TT • adi VdtD}. That is, a state 7r is in f- if and only if the following two conditions are satisfied: 32 (i) TT • adi > TT • adi for all d E D; (ii) E The first condition implies that d is the optimal action for the states in G^ • and a d ' is a vector in An. Since Bi = f | {TT e n : V£ € <*er> the second condition is similar to finding TT E II satisfying the following set of constraints TT-adi>Tx-id V ^ e Ai and dE D. Sondik (1971) and Smallwood and Sondik (1973) discussed a simple method to represent these two conditions. Let An<d = {a : a - rd + /3 • ^ Pd • Qde • an,d,e where am,d,8 E An<d,e}• Then Hdv(ir) = TT • a for all a € Anj. Let TT be an arbitrary state in G j , , then ad* E Airj- Since TT • a is a scalar and the values TT • a are the same for all a E A*d, the optimal action(s) for the state TT are = {d : d = argmax{7r • a**'}}. dED Note that .D* might contain more than one action; however, d is in D*. Moreover, it is possible that A^ j contains more than one vector, but adt is in A^ j. First, assume that there is only one action, d, in and one vector, a , in A^ j . This assumption will be relaxed later. 33 Now consider the second condition. T h e constraint set to represent the second condition might not seem easy to set up since all the vectors in A& for all d G D have to be known even though only the region i is considered. Fortunately, a simple representation can be developed. For simplification, assume that there is only one vector in Ajc,d for each d G D. Th is assumption will be relaxed later. Let adi = rd + fi- £ , € e Pd Qj • a M i , . Since {aM,*} = A M i , for all TT G int(B;), then by definition of A„td,e, for all TT G Bi, T ( 7 r , d, 6) • a * ) d > * > r(7r, d, 0) • a V a G A , or *-Pd-Qi-an,d,e>K-Pd-Qde-<x V a G A. ( 3 - 2 ) Therefore, if IT is in then (3-2) holds for all 6 G 0 and de D. Conversely, if (3-2) holds for all 6 G 0 and d G D, then, for ft G *-Pd-Qi- a M j , >n-Pd-Qd- a M ) * V0 G 0 where ft is an arbitrary state in II but not in Summing over all 6 G 0 , we obtain £ TT • P d • Qd • a * , , , , > ] [ > • P r f • • a * , , , , *€© «€© or TT . [rd + fi • 53 Pd . • o M f # ] > TT . [r<* + fi • ] T P r f • Qd • a M,,]. 06© «6© Since rd + fi • See© Pd ' Q$ ' a*,d,e is a vector in A d , hence TT • adi > IT • £D V ^ G A j . 34 Therefore, TT 6 II is in £ , if and only if (3-2) holds for for all tf € 0 and d 6 D;\ that is, Bi = 0D£D{K € II : IT • Pd - Qd6 • ( a M i * - a) > 0 V tf € 0 and a € A). The region Gj i can now be represented as Gdi = {TT € II: 7r • a d ' > 7r • a*** and TT - Pd • Qg • (a^^e - a) > 0 V deD, 9 € 0, and a € A}. (3-3) If there is more than one vector in A^^^ for some d € D and tf G 0, then consider all possible selections of one vector, say atir,d,6, from each Antd,e. If there is only one optimal action for TT, then, by using (3-3), each selection : d G D, 9 e 0} can be used to determine a region. Note that Gj f is one of the generated regions, and all of these regions contain it. Moreover, the intersection of the interior of any two of these generated regions is empty. If Djt has i actions, i.e. there are i optimal actions corresponding to the state TT, then, for each selection of {<Xjt,d,6 '• d € D,9 £ 0}, by changing the optimal action, i regions can be determined. Therefore, if there are j ways of selecting {a^td,e}i then there are t • j regions and Gdi is one of the regions. Although this method may be inefficient for determining a particular region f , all these regions generated are also cells in U. Having discovered this property, Sondik f In fact, as discussed in Smallwood and Sondik (1973), since T(-,d, tf) preserves straight lines, only those a's whose support regions have common boundaries with the support region for a^d,8 have to be considered. 35 (1971) and Smallwood and Sondik (1973) developed a systematic method, the one-pass algorithm, to find all vectors in A H -The one-pass algorithm uses an arbitrary state as an initial point. With this initial point, an optimal action (or actions) and an optimal vector (or vectors) for this point are found. Using (3-3), a region (or regions) can be obtained. All vertices of these generated region(s) are then determined. Similar to the procedure discussed above, each generated vertex is used to determine new regions and their corresponding vertices. Each vertex is used to determine regions once and the regions which generate this vertex need not be determined again. There are only a finite number of regions in U and each region only has a finite number of vertices. Hence, the algorithm will terminate in a finite number of iterations. Since the states in the interior of any region belong to exactly one region, the algorithm is called one-pass algorithm. As with the partition method, the one-pass algorithm has the major disadvantage of dividing the support regions into several subregions. As mentioned in Smallwood and Sondik (1973), most of the computational time in the one-pass algorithm is spent in solving linear programs to determine the vertices of the regions. The computational efficiency can be improved by not partitioning the support regions. This is the moti-vation for the development of the relaxed region algorithm which will be discussed in Section IV. 36 III. The Monahan Algorithm Monahan (1982) proposed an algorithm which he called Sondik's one-pass algo-rithm. Since the basic idea of this algorithm is different from the one-pass algorithm discussed above, we will consider his algorithm separately. As described in Section I, if TT € Sf, then Hdv(n) = TT • £di where £di = rd + f3 • YJeee Pd -Qg- an,d,e- Therefore, in order to find £ d t , we need to know a„td,e for each 6 in 0 and a state TT in Sf. As discussed in Section II, to find the region Sf is not an easy task. Let us consider an alternative approach. Instead of finding aWjd,e for some TT, an arbitrary ag from A is chosen for each 6 G 0 and the vector a = rd + /3-J2eeQ Pd-Qg-ckg. is calculated. If the vectors {ag,6 G 0} are chosen so that ag = av<d,e, then a equals £ d t . Let Ad be the set of all possible <5's; i.e., Ad = {a: a = rd + /3-^2Pd -Qi-ae where ae G A}. ee© Therefore, £d% € Ad and Ad Q Ad. This implies that Ad might contain some unnecessary supports of HdV, and for all TT G II Hdv(n) = max{7r • a : a G Ad} = max{7r • a : a G Ad}. Note that, if there are K actions in D, L signals in 0, and M supports in A, then at most there are ML vectors in Ad and K • ML vectors in ( J i e D ^ -Since AH Q UdeD^ <*> then AH Q {JdeD-^d- A support a in UdeD^d *s * n ^ H ^ and only if the support region of a is not empty. Therefore, to find whether if a vector 37 Oi in \Jdei)Ad is also in AH, the support region R(a,[jdGDAd) = {n € II : 7r • a > it • a foreach d € \JdeD-A-d) is determined. Note that R(a,\JdeDAd) is the same as R(a, AH) if a is a support in and empty otherwise. Hence, if R(a,[JdeDAd) is not empty, then a is a support m AH- A slight modification by Eagle(1984) which eliminate some unnecessary vectors by dominance arguments can determine the vectors in AH more efficiently. The Monahan algorithm is simple to use. However, when the number of actions, signals, or supports in A.is Jaige, the jmmber of^vectors in UdeD^rf * s a ^ s o l a r g e - m this case, it is time-consuming *to determine the supports in AH-IV. Relaxed Region Algorithm As discussed in Sections I and II, the major disadvantage of the partition method and the one-pass algorithm is "that a support region for a support in AH is usually divided into several subregions. The aim of this section is to develop a relaxed region algorithm which can reduce the number of generated regions. In fact, the number of regions found by this relaxed region algorithm is exactly the same as the number of supports in AH-Let i be an arbitrary state in 3L As discussed in Section II, A^j for each d £ D and Djr can be easily found."Now choose an arbitrary d € D*, and d E A . j . Then, by definition, the support region for d, J2(d, AH), is . .R(d, AH) = { T € TJ : 7r • d > 7r • a for each a G AH) 38 = {it G II : it • d > Tt • a for each a G Ad and d G D}. Therefore, a state it G II is in R(a, AH) if and only if it satisfies the following three sets of constraints: (i) Tt • d > it • ad for all ad G A ^ . d and d G D; (ii) Tt • d > Tt • Z* for all ^ G Ad; (iii) Tt • a > it • £d for all £d G Ad and d G D, but d ^ d. Since j4jr)(j for each d E D has already been found, the first set of constraints is easy to set up. B y the construction of A^j, a selection of a f r rf- g for all 0 in 0 is known such that d = r d + fi • J2eee Pd'Qe' an,d,e ^ o r s o m e special j e G A^ j g C A. As discussed in Section II, the second set of constraints can be rewritten as Tt-Pd •QJ-aitd-0>7t-Pd-QJ-a Va G A and 9 G 0. Therefore, the second set of constraints can also be set up.f In order to find the region i?(d, AH ) , the third set of constraints has to be set up too. However, this set of constraints is difficult to set up since it requires the methods discussed in the previous sections to find all vectors in Ad for each d G D. Since this set of constraints is difficult to set up, Sondik's method uses an ad G A*td for each d G D to substitute for d ; i.e., the third set of constraints is replaced by it - ad > it • £d for each £d G Ad, <xd G A * ^ , and d G D. t Similar to the footnote in Section II, only those a's G A whose correspond-ing support regions and the support region corresponding to a^. j e which have common boundaries should be considered. 39 Consequently, his method has to divide R(a, AH) into several smaller regions. Since the third set of constraints is so difficult to set up, an alternative method is proposed here. Consider a relaxed region which is defined by the first two sets of constraints but not the third set of constraints. By definition of R(a, AH), TT • a > TT • Q for all a G AH and TT € R{6L,AH)- Suppose that we have identified a particular set of AH and denoted it as AH- In order to fully utilize the available information, these supports in AH are used to determine a relaxed region for d; that is, the constraints TT • d > TT • a for each a E AH determine this relaxed region. A relaxed region for the support d, Ra, is defined as: Ra = {TT G II : TT • a > TT • ad V ad € AK<D and d G D; TT • PD • QJ • a^Q,e >TT-PDQdea Va G A; 7 r - d > 7 T - a ' V a' € AH] (3-4). Since RQ does not include the third set of constraints which is used to define R(a, AH), it is clear that R(a, AH) Q RQ- Moreover, all states TT G Ra must satisfy the constraints TT • a > TT • a for each a G AH- Since TT • a > TT • a for all TT G int(R(a, A H ) ) where a ,d G AH, it can be shown that Ra f]'mt(R(a, AH)) = 0 where a G AH and int(i?(a, AH)) is the interior of the support region for support a. Using the relaxed regions discussed above, a relaxed region algorithm can be de-veloped to determine all supports in AH- This algorithm starts with an arbitrary state it G II and an empty set AH- The sets A^j for each d G D and D* are determined. The 40 supports in (Jdeo* a i e * ^ e n P u t * n t o H ' According to (3-4), a relaxed region for each support in [ J r f e D * *^.<* * s obtained, and its vertices are found. As will be discussed later, these vertices are used to find the supports in A H and generate new relaxed re-gions. This procedure is repeated until no more new supports and relaxed regions are generated. The relaxed region algorithm relies on the vertices of the generated relaxed regions to find supports in AH - This process works as follows: Consider all vertices of a relaxed region RQ. Let TT' be an arbitrary vertex of RA. Similar to the procedure discussed before, an An>td for each d 6 D can be found. Since R(a,An) Q RA, can either be on the boundary of R(a,An) or not. If TT' is not on the boundary of R(a, AH), then d is not in A„>td for any d € D„>. Suppose d' € DT> and a' € An'td', then a' is a support in A H and a' ^ d. If TT' is on the boundary of R(a,An) and TT' is not a vertex of the state space II, then there is a support region other than R(6C,AH) such that TT' is on the common boundary of this region and of R(a,An)- This implies that d € U<feD > ^',d a n d there is at least one vector a' € U</GD / ^',d where a' ^ d and TT' G R(6L, AH) fl^(O'I4H)-Whether TT' is on the boundary of R(a, AH) or not, a support a' 6 A H and a' ^ d can be found. If a relaxed region for the support, RA>, for a', is not found previously, then a relaxed region RA> can be found using (3-4). Similarly, the vertices of RQ> can be used to find supports in AH- If RA' has been found previously, then TT' has reached the boundary of R(a', AH) since RQ P)int(i?(a', AH)) = 0. A relaxed region for support 41 a' need not be found again. There is only a finite number of supports in A H , and each support in AH is used to generate one relaxed region only. Moreover, there are only a finite number of vertices for each relaxed region. The relaxed region algorithm will terminate in a finite number of iterations. As will be shown in Theorem 3.2, the vertices of the support regions will be included in the vertices of the generated support regions. Although these vertices might not be useful in a finite horizon algorithm, they can play an important role in computing the error bound for an infinite horizon problem. This problem will be discussed in the next chapter. The procedure for this algorithm is outlined below. Step 0: Initialize A H , A H , W, E, and a point search table to empty sets. Put an arbitrary state TX € II into the point search table with an unmarked attribute. Step 1: Proceed to Step 7 if there is no unmarked state in the point search table. Step 2: Choose an unmarked state from the point search table. Denote this state as it. Mark it. Step 3: Find A*,* for each d € D and D*. Let W = U<*eD* --^M' ^ ^ n e s u m °f ^ n e number of vectors in W and the number of zero elements in it is greater than or equal to JV, then put it into E. Put all vectors in W into AH • 42 Step 4: Return to Step 1 if W is empty. Otherwise, go to Step 5. Step 5: Choose a vector from W and denote it as d. Delete d from W. If d is not in A H , then put it in AH and go to Step 6. Else, go to Step 4. Step 6: Use (3-4) to define a relaxed region RQ. Determine all the vertices of RA. Put all vertices which do not exist in the point search table into the point search table with an unmarked attribute. Go to Step 4. Step 7: Stop. Hv(ix) = max{7r • a : a € AH) for all TX £ II, and E contains all the vertices of the support regions. The following two system states, three actions and two signals example illustrates the use of this algorithm. Example: P 1 = .8 .2' .5 .5 QL = '.8 .6 .2' .4 r1 = ' -4 ' 5 P 2 = .5 .5' .4 .6 Q2 = .9 .4 .1' .6 r2 = -2 ' 3 P 3 = .6 A' .3 .7 Q3 = .9 .2 .1" .8 r3 = - 1 ' 1 A = {a\a2: where a 1 = 4' 5 ' and a2 = 3" 9 The discount factor /? is 1; that is, there is no discounting in reward. Choose 7r = [0,1] as the initial state and put it into the point search table with an unmarked attribute. 43 Iteration 1: Pick [0,1] from the point search table. Mark it. Denote it as ft. Then, 0.2 11.0 } ; ^ , 2 = { £2 1 } = { 4.0 9.6 }; A * , , = { { « } = { 4.4 8.2 Since it • £ n = 11.0 > it • (21 > it • f 3 1 , d = 1 is the optimal action for ft, and f 1 1 = [0.2,11.0]T is a support in AH- Put £ n into A H and AH- Since ft has a zero element, put ft in E. The relaxed region, R\, for £ n can then be determined by the following set of constraints: (i) T T - e 1 1 > 7 T - e 2 1 (ii) Since 0^,1,1 = a 2 , and 0:^1,2 = °2'•> then, T T - P 1 Q\ -[a2-a1} >0 (iii) 7r G II. Now substitute the data into the constraints and rewrite these constraints as: (i) TT 0.2 11.0 TT 0.2 11.0 > TT 4.0 9.6 4.4 8.2 (ii) TT ' .8 .2' .8 0' •( 3" '4' )>o .5 .5 0 .6 9 5 .8 .2 .2 0' •( 3' •4' )>o TT • .5 .5 0 .4 9 5 (iii) TT\ + 7T2 = 1 and H i , 7 T 2 > 0 44 The vertices of Ri are [0,1] and [0.27,0.73]. Since [0,1] is already in the point search table, according to the procedure, only [0.27,0.73] has to be put into the point search table with an unmarked attribute. Iteration 2: There is only one unmarked state in the point search table, hence ft = [0.27,0.73]. Mark ft. Then, 0.2 11.0 }; ^ , 2 = U22} = { 4.0 9.6 }; ^ , 3 = {e32} = { 4.4 8.2 Since ft • f 1 2 = ft • i22 = 8.09 > ft • £ 3 2 , both £ 1 2 and £ 2 2 are in W and ft should be in E. Since £ 1 2 is already in A H , only £ 2 2 is put into AH and AH- The relaxed region, R2, for £ 2 2 can be defined by the following constraints: (i) TT TT 4.0 9.6 4.0 9.6 > TT > TT 0.2 11.0 4.4 8.2 (ii) TT • .5 .5' .9 0' •( '3' 4" .4 .6 0 .4 9 5 7T • .5 .5' .1 0' •( 3' "4" .4 .6 0 .6 9 5 ) >o ) > o (iii) 7Ti + 1T2 = 1 and TT\ , TT2 > 0 The vertices of R2 are [0.27,0.73] and [0.78,0.22]. Since ]0.27,0.73] is already in the point search table, only [0.78,0.22] is put into the point search table with an unmarked attribute. Iteration 3: Now [0.78,0.22] is the only unmarked state in the point search table, denote it as 45 it. Mark it. Then, = it") = { 0.2 11.0 4.0 9.6 }; A M = { £ 3 3 } = { 4.62 7.91 Since TT • £ 3 3 = 5.35 > it • (23 > it • £ 1 3 , only £ 3 3 is included in A H - Put £ 3 3 into A H and A H - There is only one support corresponding to it and none of the elements of 7r is zero, so it should not be put into E. Similar to .Ri and R2, the relaxed region, R3, for vector £ 3 3 can be determined. The vertices of .R3 are [0.73,0.27] and [1,0]. Since both vertices are not in the point search table, both are put into the point search table with unmarked attribute. Iteration 4: Pick [0.73,0.27] from the point search table. Denote it as it. Mark it. Then, 0.2 11.0 }; A , , 2 = { £ 2 4 } = { 4.0 9.6 }; A M = { £3 4 } = { 4.62 7.91 Since 7 r - £ 2 4 = 7 r - £ 3 4 = 5.50 > 7 r - £ 1 4 , then £ 2 4 and £ 3 4 are both supports in AH- However, since both supports are already in A H , no relaxed region has to be determined and no new vertex is generated. There are two optimal vectors for it, so it should be put into E. Iteration 5: Now, there is only one unmarked state, [1,0], in the point search table, denote it as it. Mark it. Then, A* , i = { £ 1 5 } = { 0.36 10.2 }; A* , 2 = { £ 2 5 } = { 4.0 9.6 }; A * , 3 = { £3 5 } = { 4.62 7.91 46 Since ft • f 3 5 = 4.62 > ft • £ 2 5 > it • £ 1 5 , then only £ 3 5 is a support in A H - However, £ 3 5 is equal to f 3 3 and is in A H already, no relaxed region or new vertex is generated. There is a zero element in ft, so ft should be in E. Iteration 6: Finally, since there is no unmarked state in the point search table, the process is now completed. There are three supports, [0.2,11.0]r, [4.0,9.6]T, and [4.62,7.91]T, in A H , and four vertices, [0,1], [0.27,0.73], [0.73,0.27], and [1,0], in E. • The number of relaxed regions generated from this example is three which is equal to the number of supports in A H - For comparison, Sondik's one-pass algorithm generated five regions as shown below: State Used Corresponding Vector Vertices Generated [0,1] [0.2,11.0]T [0,1], [0.27,0.73] [0.27,0.73] [4.0,9.6]T [0.27,0.73], [0.57,0.43] [0.57,0.43] [4.0,9.6]T [0.57,0.43], [0.73,0.27] [0.73,0.27] [4.62,7.91]T [0.73,0.27], [0.83,0.17] [0.83,0.17] [4.62,7.91]T [0.83,0.17], [1.0,0] As in the relaxed region algorithm, the vertices of each region in Sondik's algorithm have to be identified. Since finding all vertices of a region is the most time-consuming step in both algorithms, the relaxed region algorithm requires less computational time than the one-pass algorithm because fewer regions are generated. 47 Some of the more salient properties of the relaxed region algorithm are discussed below. T H E O R E M 3 . 1 : All supports in A*H are also in the support set An which is generated by the relaxed region algorithm. Proof: Let a be the support which is in A*H but not in the generated AH-First assume that the supports whose support regions have common boundaries with the support region R(a, A*H) are found and put in AH- Without loss of generality, assume these supports be a 1 , a 2,..., ak where the superscripts stand for the order of generated sequence. Since RQk C R(ak,An) and .Ra* D int(R(a', A H ) ) = 0 for i = 1,2,..., k — 1, then at least one of the extreme points of RQk is in R(a, A*H). Let this point be it, then a € UdeD* {Ajrj}- Contradiction. If not all supports whose support regions have common boundaries with R(a, A*H) are found, then assume that J? be a set which contains R(a, A*H) and the support regions which can form a connected set with R(a,A*H) and its support is not found. Similar to the previous proof, let a1 ,a 2,... ,ak be a sequence of supports in AH and whose support regions have common boundaries with R. T h e n at least one of the extreme points of Rak is i n R, then this state can be used to find at least one of the supports which have not be found. Therefore, the procedure cannot terminate. Contradiction. II 48 T H E O R E M 3.2: The states in E are the vertices of the support regions for supports in AH- Con-versely, all vertices of the support regions for the supports in AH o.re in E. Proof: Let 7r be an arbitrary state in E. There is at least one support, d, in AH, such that Hv(it) = it - a. Without loss of generality, assume there are m supports in AH- Then consider the following linear programming problem: max 7r • d Subject to: (1) 7r • d > 7r • a for each a €: AH and d a (2) 7rfc > 0 = 1,2,:..,JV N (3) X> = 1 Note that there are m+N constraints which define the support region for d. Since it £ E, if 7rjt > 0 for all k = 1,2,..., N, then there are at least N supports in AH corresponding to it including d. Set the slack variables of the N — 1 constraints corresponding to these supports other than d to zero, then, it can be seen that it is a basic feasible solution of the LP. Similarly, if some ft* = 0, then set the slack variables of these constraints irk > 0 to zero. Then it is also a basic feasible solution of the LP. Therefore, it is a vertex of the support region for d. This proves the first part of the theorem. Now assume it is a vertex of a support region R(a, AH) where d is an arbitrary support in AH- Then it is one of the basic feasible solutions of the LP shown above. If n elements of it are zero where n can be 0,1,... or N — 1, then at least N — n — 1 constraints 49 in (1) should be tight when 7r is substituted by ft. Denote the a's in the tight constraints, including d, as a 1 , a 2 , . . . ,aN~n where the superscripts indicate the order of vectors in the set A H - The vertex ft will be placed in this point search table no later than the vertices of the relaxed region for aN~n. Since ft • a 1 = ft • a 2 = • • • = ft • aN~n, when ft is chosen for finding new supports in AH, then {a 1 ,a 2 , . . . ,aN~n) C \ J D E D T A * , , * ; that is, ft is in E. I Before closing this section, let us compare the relaxed region algorithm and Sondik's one-pass algorithm. The biggest difference between the relaxed region algorithm and Sondik's algorithm is in their methods of defining the regions for a support in A H - AS mentioned before, Sondik's algorithm needs to unite several subregions in order to define a support region for a support in A H - The relaxed region algorithm will always find a region not smaller than its support region. This can be seen in the example presented in this section. There are three supports in AH- For the first support, [0.2,11.0]T, the relaxed region algorithm and one-pass algorithm define the same region as its support region. However, for the second support, [4.0,9.6]T, the relaxed region algorithm finds a larger region initially, but the one-pass algorithm requires unification of two regions to form the support region. For the third support, [4.62,7.91]R, the relaxed region is exactly the same as its support region; but, the one-pass algorithm again requires unification of two regions to form its support region. The total number of regions generated from one-pass algorithm 50 is usually more than the number of supports in AH', however, the number of regions produced in the relaxed region algorithm is always the same as the number of supports in A H . From a computational point of view, there is a difference in the number of con-straints used to define a region between these two algorithms. In Sondik's algorithm, the constraint set includes all constraints shown in (3-3); however, in the relaxed region al-gorithm only those constraints in (3-3) which concern the optimal action are considered. As a result, the one-pass algorithm usually has a much larger number of constraints than the relaxed region algorithm for defining a region. It was mentioned in Smallwood and Sondik(1973) that most of the computational time in Sondik's algorithm was spent in linear programming in order to determine the boundaries and vertices of the regions. Since there are fewer constraints in each constraint set and fewer regions to be solved, the relaxed region algorithm requires less computer memory and computational time than Sondik's algorithm does. It is not clear whether the relaxed region algorithm requires less computational time and computer memory than the Monahan algorithm. However, use of the vertices of the support regions for computing the error bound might be an important consideration for using the relaxed region algorithm as a policy improvement step in an infinite horizon problem. 51 V . Linear Support Algorithm In the previous section, a relaxed region algorithm was discussed. Although the constraint sets for determining a region in this algorithm are much simpler than those of the one-pass algorithm, they are still very complicated. The original motivation for the linear support algorithm discussed in this section is to develop an algorithm which does not require complicated constraint sets. Besides having simpler constraint sets, the linear support algorithm also has a special property which makes it more attractive. If a large number of supports in A H are required to characterize Hv, then computing Hv and A H is usually very time consuming regardless of which algorithm is used. In this case, an approximate solution for Hv might be tolerable if the maximal difference between the exact solution and the approximate solution is less than a given error. However, none of the algorithms discussed so far can be modified to find this kind of solution. The linear support algorithm described in this section can provide this kind of approximation; this is the most important feature of this algorithm. Let An = UdEDw{Antd} for IT £ U. Then An is a set of supports for Hv at TX. In order to have a finite set of supports which characterize Hv, only those supports in A N for all 7r € II are considered. Recall that, if it is an arbitrary state in n and a £ A * , then HV(TX) > TX • a for all TX £ II and Hv(it) = it • a. The basic idea of the linear support algorithm can be described as follow. Let A H 52 be a finite set of supports for Hv and define HV(TT) = max{7r • a : a £ AH) for all TT £ II. If 5t> is used to approximate the value function Hv, then the maximal error for this approximation is max,r€n{#i>(7r)_#t,(7r)}- If this error is not zero, then a "proper" support can be chosen and included in AH to arrive at a better approximation. This procedure is repeated until the approximation is within a tolerable range. If the exact solution is needed, then the procedure can be repeated until the exact function is found. The major problem is how to choose a "proper" support. The algorithm starts with finding the linear supports corresponding to each of the extreme points of the state space II. The generated supports are put into AH- In order to know how good this approximation is, a relaxed region for a support a in AH is defined as RQ = {TT £ II : TT • a > TT • a V a £ AH}-Note that the support region R(a, AH) is a subset of Ra- If there are k supports currently in AH, then there will be k relaxed regions. All vertices of these relaxed regions are found.f If max a e ^^{7r • a} is used to approximate HV(TT) for all TT £ II, then the error of this approximation can be defined as a function g where g(Tr) = HV(TT) — maxQg^H {TT ' a} for all TT £ II. As shown in the following lemma and theorem, the maximal error of this approximation, i.e., the maximal value of g in II, will be at f In fact, only the vertices of k — 1 regions have to be found. The vertices of the fc-th region can be found in the vertices of other regions with the exception of the vertex which generates the support corresponding to this relaxed region. 53 one of the vertices of these relaxed regions. L E M M A 3.1: Let RA be the relaxed region for a support a G AH- The maximal value of g in RA will occur at one of the vertices of RA. Proof: For TT G RA, TT • a > 7r • a for all a G AH- Therefore, the function g in RA can be rewritten as </(7r) = HV{TC) — 7r • d for all TT G RA- Since Hv is a convex function and 7r • d is a linear function, g is a convex function in RA. Moreover, RQ is a convex polytope. The maximal value of a convex function in a convex polytope will be at one of the extreme points of the convex polytope. Therefore, the maximal value of g in RQ will be at one of the vertices of RA. I T H E O R E M 3.3: The maximal value of g in U will be at one of the vertices of these relaxed regions whose corresponding supports are in AH-Proof: As shown in Lemma 3.1, the maximal value of g in RA is in one of the vertices of RQ. By definition of the relaxed region, the union of all relaxed regions is II. There-fore, the maximal value of g will be in one of these relaxed regions; then, by Lemma 3.1, the maximal value of g in n will be at one of the vertices of these relaxed regions. H Assume that all vertices for the generated relaxed regions are in a set E. By 54 Theorem 3.3, the maximal error of this approximation will be at one of the vertices in E. Denote this vertex as ft. If g(it) is equal to zero, then there is no error for this approximation; i.e., this is an exact solution. If g(it) is greater than zero, the linear support(s) oi Hv at ft, A*, can be found. Note that the supports in A* are not currently in AH since Hv(it) = ft • a > max{ft • a : a G AH} where a G A*. Therefore, if one or more supports in A* axe included in AH, a better approximation of Hv can be found. If the supports in A* are included in AH, a new approximation for Hv and new relaxed regions for every support in AH can be determined. B y Theorem 3.3, the max-imal error of the new approximation is at one of the vertices of the newly generated relaxed regions for the supports currently in AH- However, some of these newly gen-erated relaxed regions are not the same as the relaxed regions before the supports in A^ are included in AH- In order to find the maximal error of the new approximation function, the vertices of all newly generated relaxed regions have to be determined again and this is time-consuming. Fortunately, as will be shown in the next lemma, all of the vertices of these relaxed regions are in the set E [j C where C is the set of all vertices for the relaxed region(s) for the support(s) in A*. This implies only those vertices in the relaxed region(s) for the support(s) in A* have to be identified. L E M M A 3.2: Let AH be a set of supports as described above and let the relaxed regions defined by the supports in AH be Ra = {?r G II : IT • a > IT • a where a G AH}- Let E be the set of all vertices for the relaxed regions corresponding to supports in AH- Assume it is a state in E and A * is the set of supports for Hv at it. Let the relaxed region for a 55 support a G A H U A N be R'a = {it G I I : it • a > it • a V a € i n U A*}. Let C be the set of all vertices for the relaxed regions R'a for a G A * and E' be the set of vertices for the relaxed regions R'a for a G A H U A * . Then E' C E U C. P r o o f : Let 7f be an arbitrary vertex of E'. li it is in R'Q for an a in A * , then it G C. If ft is not in R'a for any a E A N , then, since it G II C 9£N, TV equations are required to define it. If fr is not on the boundary of II, then there exists a set of supports { a 1 , a2,..., aN} such that these JV equations can be represented as 7r • aN = it • a' where i = 1,2,..., N — 1; N and ^ T f * = 1. Jt=i Note that a' G AH and a' ^ A „ for i = 1,2,..., N. Therefore, it is a vertex of Ran; that is, it is in E. Similarly, if it is on the boundary of II, some of the constraints needed to define if are the boundary conditions. Following the same argument, it should be a vertex in E. Therefore E' CEUC. • In fact E U C might contain some states which are no longer a vertex of any relaxed region. These vertices can be determined by computing the error. Define g'(n) = Hv(ir) — m a x a e ^ H U i 4 # { 7 r • a} for all n G E.] If g{Tx) > g'(7r) and IT is not a t A vertex it G E which has been used to find new supports will have g^it) = 0. Therefore, it is not necessary to compute g(-) for this kind of vertex. 56 vertex of C, then TT is not a vertex of any relaxed region and can be deleted from the set E'. If C is included in E and the supports in A * are included in AH , then the maximal error for the new approximation function, by Lemma 3.2, will be at one of the states in E. Determine g(Tr) for all TT G E and find the maximal value. This maximal error will not be greater than that of the previous approximation. The support of Hv is found at the state which has the maximal error in the approximation. The procedure discussed above is repeated until no more new vertices and supports are .generated. Then the set of supports can be used to form Hv without any error. If a small error is tolerable, this procedure can be modified slightly to find an approximate solution. When the errors corresponding to the vertices of the region are less than the tolerable error, a more accurate approximation is not necessary for the states in this region. This can be restated more constructively: when the error of a vertex is less than the tolerable error, this vertex does not have to be considered. This modification can guarantee that the maximal error from the resulting approximation will not be more than the tolerable error. This algorithm can be summarized as follows: Step 0. Initialize AH, E^Esnd C to empty sets. Put all vertices of II, TT1,TT2, ...,TTN , into E and E. Find the supports and the value of Hv at TT for each TT € E, then put the newly generated supports into the set An.. Determine the relaxed regions for each support in AH and find all vertices of these relaxed regions. 57 Put these generated vertices into C If a vertex is in C but not in E, put it in E. Step 1. Find HV(TT) for each TT G C. Empty C. Step 2. Compute g(ir) = HV(TT) — maxa6/jH{rr-a} for 7r G E\E. Ug(ir) is less than the given tolerable error, then put IT into E. If all g^tr) are less than the tolerable error, go to Step 6; otherwise, pick a vertex with the largest value of g from E and denote this vertex as TT. Step 3. Empty the set A * . Find the linear support(s) for Hv at it and put these generated supports in A * . Put the supports in A* which are not already in A H into AH- Find the relaxed region for each support in AFind all vertices of these newly generated relaxed regions and put into C. If it is not a vertex in C, delete it from E; otherwise, put it into E. Step 4. Compute g'{ir) = HV(TT) — max o € / i , IT • a for all TT G E\E. If g'iir) < g(n) and TT £ C, then delete TT from E. Step 5. Put the vertices in C which are not in E into E. Go to Step 1. Step 6. Stop. The value o{maxae^H {ir-a} is an approximation of HV(TT) with maximal error less than the given tolerable error. The set E contains all the vertices of the support regions for the supports in AH-Note that if the tolerable error in the above algorithm is zero, then an exact solution of Hv can be found, and A H is equal to AH-58 Unlike the one-pass algorithm and the relaxed region algorithm, only the support itself is important in the linear support algorithm. The information concerning which particular selection of an>d>e G A„,d,e forms a support in An is not used in this algorithm. Therefore, all supports in UdeD-A-d as defined in Monahan's algorithm can be generated. Hence, HV(TT) = max _ {TT • a} An = {d G UdeDAd • * • d > TT • a Va G Ud£DAd) and AH Q AH Q UdeDAd-Example: (continued) The linear support algorithm is used to solve the same problem as shown in Section I V . The extreme points of n, [0,1] and [1,0], are put into E and E. The solution procedure starts with finding the linear supports for the extreme points of n. The linear support for state [0,1] is [0.2,11.0]r, and the linear support for state [1,0] is [4.62,7.91]T. Therefore, AH = {[0.2,11.0]r, [4.62,7.91]r}. Since there are only two supports in A H , there is only one relaxed region whose vertices have to be found. Let Ri be the relaxed region for the support [0.2,11.0]T; that is, R\ = {it G n : m +^2 = 1 and O^-TTJ + II-TTJJ > 4.62-7^+7.91 -7r2}. The vertices of Rx are [0,1] and [0.41,0.59]. Since the vertex [0,1] is already in E, put [0.41,0.59] in E. The vertex [0.41,0.59] is the only one in E\E. Since <?([0.41,0.59]) = 0.74 > 0, [0.41,0.59] is used to find a new support. 59 T h e linear support for state [0.41,0.59] is [4.0,9.6]T. Since [4.0,9.6]T is a new support, it is put into A#. The vertices of the relaxed region for the support [4.0,9.6]T are [0.27,0.73] and [0.73,0.27]. Put these two vertices into C. Since [0.41,0.59] is not a vertex i n C, delete it from E. There are no vertices i n E\E. Put the vertices in C into E. Since these two vertices are not i n E, </([0.27,0.73]) and g([0.73,0.27]) must be computed. Both values are zero; therefore, they should be in E. Now, no vertex in E has function value g greater than 0, and the process is completed. A l l supports in AH have been found. There are three supports, [0.2,11.0]T, [4.0,9.6]T, and [4.62, 7.91] T, in AH, and four states, [0,1], [0.27,0.73], [0.73,0.27], and [1,0], in E. Now assume that the tolerable error is 0.75. Since <z([0.41,0.59]) is 0.74, which is smaller than the tolerable error, [0.41.0.59] is a vertex in E and not used to find new support. There is no vertex in E\E. The process is completed. The result is two sup-ports, [0.2,11.0]r, and [4.62,7.91]T, in A H , and three vertices, [0,1], [0.41,0.59], and [1,0], in E with a maximal approximation error of less than 0.75. H Before we conclude this section, the following questions are raised: (1) Can this algorithm be terminated within a finite number of iterations? (2) Gan this algorithm find all supports i n A//? (3) Does E contain all vertices of the support regions, or, for an approximate solution, does E contain all vertices of the relaxed regions corresponding to the supports in A//? The second half of question 3 has been answered by Lemma 3.2. The remaining questions are answered by the following theorem. 60 T H E O R E M 3.4: (1) The linear support algorithm will terminate in a finite number of iterations. If the tolerable error is set to be zero, then (2) All supports in AH can be found by the linear support algorithm. (S) E contains all vertices of the support regions. Proof: (1) When a vertex is chosen for finding a linear support, the linear support cannot be the same as any of the supports currently in AH- Since there is only a finite number of supports in AH , there is no more approximation error after all supports in AH are found. Therefore, no more relaxed regions and vertices are generated; that is, the algorithm will terminate after a finite number of iterations. (2) Assume that AH is not the same as AH and the process is terminated. Since AH is not the same as AH, this implies that there is at least one TT G II such that maxa£AH {TT •«} — m a x a e A H in ' a) > 0- Then, by Theorem 3.3, the maximal error should occur at one of the vertices of the relaxed region. Therefore, the process should be continued and the algorithm cannot be terminated. Contradiction. (3) When all supports in AH are found, the relaxed region corresponding to the sup-ports in AH are the support regions. Then, by Lemma 3.2, the result follows. H Let us now compare the linear support algorithm with the relaxed region algorithm. The linear support algorithm can be viewed as a relaxed region algorithm, but it is not the same as the one discussed in Section IV. The linear support algorithm uses a simpler 61 constraint set to define a relaxed region than the relaxed region algorithm. The number of relaxed regions whose vertices have to be found is the same as the number of supports in AH for the relaxed region algorithm; in contrast, there is one relaxed region for which the vertices do not have to be found for the linear support algorithm. Both algorithms can generate all vertices for the support regions. However, the computational time for both algorithms should be very close; although we might expect the computational time for the linear support algorithm to be slightly less. As mentioned before, the most important difference between these two algorithms is that, unlike the relaxed region algorithm, the linear support algorithm can-serve as an approximation algorithm. Although the basic idea and motivation of the linear support algorithm is to make the relaxed region algorithm more efficient, it is similar to an algorithm discussed in Sondik (1971) provided that there are only two system states. Sondik claimed that his algorithm might not be as efficient as the one-pass algorithm if the number of system states or supports in A J J is large. However, the computational requirement for the linear support algorithm is about the same as or less than that for the relaxed region algorithm which has been shown to be more efficient than the one-pass algorithm. VI. Numerical Examples In this section, several «ets of/test data are used to compare the efficiency of the algorithms discussed in this chapter. The basis of comparison is CPU time. All algo-rithms were implemented as Fortran T7 programs which were run on the Amdahl 5860 62 with FPU at the University of British Columbia. The code for the one-pass algorithm is based on the original code provided by Dr. Sondik. The Monahan algorithm is coded with Eagle's modification; that is, if all elements of a generated support are less than or equal to another support, then this support is deleted before linear programming is used to determine the unnecessary supports. Since there are usually more generated supports than system states, the dual formulations are used for linear programming. The IMSL routines are used to solve these linear programming problems to determine the unnecessary supports. In the relaxed region algorithm and the linear support algorithm, all vertices of a relaxed region have to be found. The Mattheiss algorithm is used here for finding all vertices of the convex polytopes. This algorithm is discussed in Mattheiss (1973) and Mattheiss and Rubin (1980). The test data can be divided into two groups. The first group contains the data for the machine maintenance problem discussed in Smallwood and Sondik (1973). The second group contains several sets of randomly generated data for problems with three, four, and five system states. In order to minimize the effect of the terminal reward, all problems are solved for twenty stages and the terminal reward is set to be zero. For all problems, the discount factor fi is 1; i.e., there is no discounting in reward. 1. Machine Maintenance Problem: The data for this test problem is in Smallwood and Sondik (1973). The CPU times 63 Relaxed Linear Number of Number of Sondik's Monahan's Region Support Periods Left supports Algorithm Algorithm Algorithm Algorithm 1 1 28 0 2 0 2 1 28 0 2 0 3 1 28 0 2 0 4 1 28 1 3 1 5 1 28 0 2 1 6 2 28 2 8 3 7 3 45 6 14 6 8 4 101 11 22 11 9 4 103 15 22 12 10 5 110 17 28 15 11 6 104 27 39 20 12 8 168 37 54 31 13 10 232 57 73 49 14 15 340 74 115 103 15 13 270 146 113 111 16 14 336 156 105 106 17 9 333 123 59 74 18 12 201 74 73 65 19 10 203 99 69 67 20 13 233 92 89 76 Total 2947 937 894 751 (unit : .001 CPU second) Table 3.1: CPU times of the machine maintenance problem and the number of supports at each period are recorded in Table 3.1. From Table 3.1, the CPU time required for every period using Sondik's one-pass algorithm is always longer than that for other algorithms. It is clear that Sondik's algorithm is the least efficient algorithm among these four algorithms. This result is expected as discussed before. As discussed in Section III, The Monahan algorithm generates K • ML supports 64 in each iteration where K is the number of actions, L is the number of signals, and M is the number of supports in the previous iteration. Although a large number of generated supports can be eliminated by simply comparing their elements, there might still be a large number of supports which require linear programming to determine whether or not it is a required support. Therefore, the CPU time required for an iteration of the Monahan algorithm is directly related to the number of supports in the previous iteration. In contrast, the relaxed region algorithm and the linear support algorithm need to identify the same number of relaxed regions as the number of required supports for an iteration. Therefore, the CPU time required for an iteration of the latter algorithms is directly related to the number of necessary supports in the current iteration. The results of this example confirm this phenomenon. For instance, there are only nine supports for the seventeenth iteration, but fourteen supports for the previous iteration. As a result, the Monahan algorithm spent about twice the CPU time to perform this iteration as compared with the relaxed region algorithm and the linear support algorithm. Similar results are also observed for the fifteenth and nineteenth iteration. On the other hand, ten supports are needed at the thirteenth iteration and fifteen supports at the fourteenth iteration. The Monahan algorithm only requires approximately | of CPU time used in the relaxed region algorithm and the linear support algorithm. The total CPU time requirement for the Monahan algorithm is 5% more than that for the relaxed region algorithm and 20% more than that for the linear support algorithm. It is clear that the relaxed region algorithm and the linear support algorithm 65 axe more efficient than the Monahan algorithm for this set of data. As discussed in Section V, the linear support algorithm can be used to find an approximate solution if a small error is tolerable in each iteration. For comparison purposes, the tolerable error is set to 0.1, 0.01, 0.005, and 0.001, respectively, in 4 case runs. The number of supports and CPU times required for different tolerable errors are shown in Table 3.2. It is easy to see that the CPU times required for these approximate solutions must be less than or equal to that for the linear support algorithm for finding an exact solution. However, the CPU time required to obtain these approximate solutions is surprisingly short. If the tolerable error is set to 0.1 for each iteration, it takes 19% of the CPU time required to find the exact solution by the Monahan method. It requires 57%, 64%, and 77% of the CPU times for the Monahan algorithm if the tolerable errors for each iteration are 0.01, 0.005, and 0.001, respectively. The large reduction of CPU time required is due to fewer supports being generated from each iteration. At the end of the 20th Iteration CPU Times # of supports Maximal Error Tolerable Error = 0.1 0.179 4 0.12508 Tolerable Error = 0.01 0.536 9 0.00863 Tolerable Error = 0.005 0.604 10 0.00283 Tolerable Error = 0.001 0.695 13 0 Table 3.2: CPU times of the machine maintenance problem for the approximation method The error bound for these approximate solutions can also be calculated using the 66 following formula: 1 -pn 1-/3 •T.E. if 0 < $ < 1; and n • T.E. if (3= 1 where n is the number of iterations, T.E. is the tolerable error for each iteration, and /? is the discount factor. Therefore, the error bounds are 2.0, 0.2, 0.1, and 0.02 for the tolerable errors of 0.1, 0.01, 0.005, and 0.001, respectively. These error bounds might be too large when compared with the maximal value of 10.59079 at the end of the twentieth iteration or when compared with the difference of 3.4025 between one-period maximal and minimal reward. Since the exact solution is known, the maximal error at the end of the twentieth iteration can be computed. The maximal errors are 0.12508, 0.00863, 0.00283, and 0 for the tolerable errors of 0.1, 0.01, 0.005, and 0.001, respectively. Except for the tolerable error of 0.1, the maximal errors are less than the tolerable error in one iteration. Even for a tolerable error of 0.1, the maximal error is just slightly more than the tolerable error in one iteration. The actual error is significantly smaller when compared with the maximal value at the end of the twentieth iteration or with the difference of one-period maximal and minimal reward. This example shows that if two or more supports have very similar slopes, then the approximation method does not recognize these as distinct supports and treats them as one support. In this way, the number of supports can be reduced at each iteration and, as a result, the CPU times required can also be reduced. Since supports with similar slopes are considered as one and the number of supports are reduced, numerical stability 67 improves. A problem which cannot be solved by other algorithms might be able to be solved by using this method to get an approximate solution. 2. Randomly Generated Data: Several sets of data with 3, 4, and 5 system states are generated to compare the efficiency of the algorithms discussed in this chapter. A l l these data are listed in the Appendix 1. From the discussions in Sections II, IV, and V, and the previous numerical example, it is clear that Sondik's one-pass algorithm is not as efficient as the other algorithms. Therefore, the one-pass algorithm will not be considered further in this thesis. T h e first group of randomly generated data consists of five data sets with three states, three actions, and three signals. These data are listed in D3.1 to D3.5 in Ap-pendix 1. T h e number of supports and the maximal error at the end of the twentieth iteration, and the C P U times are shown in Tables 3.3 to 3.7. 68 CPU Times At the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 0.646 5 0 Relaxed Region Algorithm 0.812 5 0 Linear Support Algorithm T.E.= 0 0.473 5 0 Linear Support Algorithm T.E.= 0.1 0.230 4 0.00279 Linear Support Algorithm T.E.= 0.01 0.231 4 0.00279 Linear Support Algorithm T.E.= 0.005 0.226 4 0.00279 Linear Support Algorithm T.E.= 0.001 0.323 4 0.00015 Table 3.3: Results of the data set D3.1 CPU Times At the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 0.882 7 0 Relaxed Region Algorithm 0.663 4 0 Linear Support Algorithm T.E.= 0 0.496 7 0 Linear Support Algorithm T.E.= 0.1 0.225 4 0.07402 Linear Support Algorithm T.E.= 0.01 0.321 5 0.00475 Linear Support Algorithm T.E.= 0.005 0.324 5 0.00475 Linear Support Algorithm T.E.= 0.001 0.417 6 0.00018 Table 3.4: Results of the data set D3.2 69 CPU Times A.t the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 1.528 7 0 Relaxed Region Algorithm 0.926 5 0 Linear Support Algorithm T.E.= 0 0.818 7 0 Linear Support Algorithm T.E.= 0.1 0.260 4 0.13268 Linear Support Algorithm T.E.= 0.01 0.497 6 0.01228 Linear Support Algorithm T.E.= 0.005 0.481 6 0.01228 Linear Support Algorithm T.E.= 0.001 0.662 7 0.00041 Table 3.5: Results of the data set D3.3 CPU Times At the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 4.626 10 0 Relaxed Region Algorithm 1.012 11 0 Linear Support Algorithm T.E.= 0 1.631 10 0 Linear Support Algorithm T.E.= 0.1 0.158 3 0.13672 Linear Support Algorithm T.E.= 0.01 0.478 5 0.02663 Linear Support Algorithm T.E.= 0.005 0.535 5 0.00928 Linear Support Algorithm T.E.= 0.001 1.068 7 0.00140 Table 3.6: Results of the data set D3.4 70 CPU Times At the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 5.171 16 0 Relaxed Region Algorithm 2.230 8 0 Linear Support Algorithm T.E.= 0 2.348 15 0 Linear Support Algorithm T.E.= 0.1 0.389 5 0.06367 Linear Support Algorithm T.E.= 0.01 0.692 7 0.01219 Linear Support Algorithm T.E.= 0.005 0.727 7 0.01051 Linear Support Algorithm T.E.= 0.001 1.267 9 0.00093 Table 3.7: Results of the data set D3.5 From these five tables, it can be seen that in the case where only a small number of supports are needed to construct a value function, there is no significant difference in CPU times among these methods. However, when a larger number of supports are needed to form a value function, then the relaxed region algorithm and the linear support algorithm are much more efficient than the Monahan algorithm. The CPU times required for the relaxed region algorithm and the linear support algorithm are most often considerably less than a half of the CPU time required by the Monahan algorithm. The performance of the approximation method is still very impressive. The CPU times required for the approximation method are much less than those for the other methods to find the exact solution. The actual maximal error is about the same or less than the tolerable error in one iteration. Considering that the difference between one period maximal and minimal reward ranges from 6.1 to 9.7, or the maximal value at 71 the end of the twentieth iteration ranges from 119 to 175, the actual error is remarkably small. Now consider the second group of randomly generated data. This group of data includes five data sets with four states, four actions, and four signals. The data are listed in D4.1 to D4.5 in Appendix 1. The number of supports and maximal error at the end of the twentieth iteration, and the CPU times are shown in Table 3.8 to 3.12. CPU Times At the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 58.820 26 0 Relaxed Region Algorithm 11.549 22 0 Linear Support Algorithm T.E.= 0 19.861 25 0 Linear Support Algorithm T.E.= 0.1 0.991 5 0.18222 Linear Support Algorithm T.E.= 0.01 3.609 11 0.00922 Linear Support Algorithm T.E.= 0.005 8.348 13 0.00394 Linear Support Algorithm T.E.= 0.001 11.787 16 0.00156 Table 3.8: Results of the data set D4.1 These five sets of data require more than 20 supports to form their value functions at the end of the twentieth iteration. Since there are more supports in each iteration, the CPU times required to solve these problems are much longer than the data in the previous group. This group of data can be divided into.three subgroups. The first subgroup contains 72 CPU Times At the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 61.479 26 0 Relaxed Region Algorithm 13.365 19 0 Linear Support Algorithm T.E.= 0 23.493 25 0 Linear Support Algorithm T.E.= 0.1 2.293 8 0.16191 Linear Support Algorithm T.E.= 0.01 7.830 13 0.00905 Linear Support Algorithm T.E.= 0.005 10.301 14 0.00308 Linear Support Algorithm T.E.= 0.001 16.890 20 0.00038 Table 3.9: Results of the data set D4.2 data sets D4.1 to D4.3. The results for this subgroup are obtained by all methods. The Monahan algorithm requires more than twice the CPU time to solve these problems as the other two algorithms. The performance of the approximation method is excellent. For example, the approximate solutions with a tolerable error of 0.1 for every iteration only require 1.2% to 3.8% of the CPU times required for the Monahan algorithm to solve these problems. The maximal errors for these approximations are very small when compared with the maximal value at the end of the twentieth iteration or the difference between one period maximal and minimal reward. The second subgroup contains data set D4.4. Due to numerical problems, no result is generated by the relaxed region algorithm. For this set of data, the Monahan algo-rithm needs 762 CPU seconds to solve; however, the linear support algorithm requires only 172 CPU seconds to reach solution, which is only about 22.6% of the CPU time 73 CPU Times At the end of the 20th Iteration # of supports Maximal Error Monahan's Algorithm 79.154 30 0 Relaxed Region Algorithm 28.633 32 0 Linear Support Algorithm T.E.= 0 36.327 30 0 Linear Support Algorithm T.E.= 0.1 0.891 5 0.21916 Linear Support Algorithm T.E.= 0.01 4.301 9 0.02196 Linear Support Algorithm T.E.= 0.005 7.522 13 0.00813 Linear Support Algorithm T.E.= 0.001 21.877 20 0.00049 Table 3.10: Results of the data set D4.3 required by the Monahan algorithm. The performance of the approximation method is even more impressive. With a tolerable error of 0.1 for each iteration, it only takes 2.152 CPU seconds to reach solution and this is less than 0.3% of the CPU time required by the Monahan algorithm or 1.25% of the CPU time required by the linear support algo-rithm for finding the exact solution. The maximal error is only 0.01826. Consider the maximal value at the end of the twentieth iteration, 154.62, or the difference between one-period maximal and minimal reward, 6.8. This maximal error is so small that it can be ignored in this case. The third subgroup contains data set D4.5. In this set of data, all three algo-rithms for finding the exact solution generate more than the pre-set maximal number of supports, 50, at the sixth iteration. Therefore, none of the results are generated. Even though the exact solution is impossible or difficult to obtain, the approximation 74 At the end of the 20th Iteration CPU Times # of supports Maximal Error Monahan's Algorithm 762.173 32 0 Relaxed Region Algorithm Result is not obtained Linear Support Algorithm T.E.= 0 172.282 33 0 Linear Support Algorithm T.E.= 0.1 2.152 6 0.01826 Linear Support Algorithm T.E.= 0.01 2.604 7 0.01297 Linear Support Algorithm T.E.= 0.005 9.140 10 0.00546 Linear Support Algorithm T.E.= 0.001 57.226 19 0.00214 Table 3.11: Results of the data set D4.4 method still works very well. It takes only 3.113 CPU seconds to solve this problem if the tolerable error for each iteration is set to 0.1 or 45.688 CPU seconds if the tolerable error is set to a relatively small number, 0.001. Since the exact solution is unknown, the exact maximal error cannot be computed. As discussed before, the error bound can be calculated as n • T.E.. However, from the previous examples, the error bound computed by this method is too big. A narrower error bound is desired. By the triangle inequality, IK -w||<IK-«|| + ||«-v|| where v* is the exact solution, t; is the approximate solution whose error bound is desired, and v is another approximate solution whose error bound is known. Since the approximate solution with a tolerable error of 0.001 at every iteration is the best solution obtained, this solution can be chosen as the reference solution, v, where \\v* —v\\ < 0.02. 75 Therefore, only ||t; — u|| has to be computed in order to know the error bound for the approximate solution v. The error bounds shown in Table 3.12 are computed by this method. These error bounds are still very small. It appears from this example that very good approximate solutions are obtained by the approximation method, particularly since these error bounds are probably considerably overestimated. C P U Times it the end of the 20th Iteration # of supports Error bound Monahan's Algorithm over 50 supports at the 6th iteration Relaxed Region Algorithm over 50 supports at the 6th iteration Linear Support Algorithm T.E.= 0 over 50 supports at the 6th iteration Linear Support Algorithm T.E.= 0.1 3.113 8 0.11084 Linear Support Algorithm T.E.= 0.01 9.892 14 0.03778 Linear Support Algorithm T.E.= 0.005 32.259 20 0.02998 Linear Support Algorithm T.E.= 0.001 45.688 24 0.02000 Table 3.12: Results of the data set D4.5 Since the generated supports are added one at a time into the support set at every iteration of the linear support algorithm, it is easy to develop an approximation method by limiting the maximal number of supports at each iteration. Table 3.13 shows the results of limiting the maximal number supports at each iteration to be 10, 15, and 20, respectively, for the data set D4.5. When approximate error for each iteration is known, the error bound can be computed as *=i 76 where e* is the maximal approximate error in the k-th iteration. By this method, the error bounds are 0.85161, 0.14986, and 0.07117, for maximal number of supports of 10, 15, and 20, respectively. However, since a more accurate result is known, a narrower error bound can be obtained by the triangle inequality discussed above. These narrower error bounds are shown in Table 3.13. CPU Times At the end of the 20th Iteration # of supports Error bound Maximal # of supports = 10 4.219 10 0.08210 Maximal # of supports =15 14.479 15 0.03587 Maximal # of supports = 20 35.283 20 0.02655 Table 3.13: Results of data set D4.5 by using the approximation method which limits the maximal number of supports in every iteration When it is difficult to choose a priori tolerable error for every iteration, it can be very useful to limit the maximal number of supports at every iteration in the ap-proximation method. The selection of the maximal number of supports only depends on the number of actions and number of signals in the problem under consideration. When a reasonable number is chosen as the maximal number of supports, an approxi-mate solution can usually be obtained although the error bound cannot be determined beforehand. The last group of randomly generated data contains only one set of data, D5.1 in Appendix 1. This is a set of five states, "three actions and three signals. The results of 77 CPU Times it the end of the 20th Iteration # of supports Error bound Monahan's Algorithm 73.139 41 0 Relaxed Region Algorithm Result is not obtained Linear Support Algorithm T.E.= 0 Result is not obtained Linear Support Algorithm T.E.= 0.1 0.725 4 0.03799 Linear Support Algorithm T.E.= 0.01 2.364 6 0.02653 Linear Support Algorithm T.E.= 0.005 2.668 7 0.02343 Linear Support Algorithm T.E.= 0.001 10.622 12 0.02000 Table 3.14: Results of the data set D5.1 this set of data are shown in Table 3.14. In this set of data, the relaxed region algorithm and the linear support algorithm cannot complete the calculations. This difficulty is caused by too many supports being generated and the procedure of finding all vertices of relaxed regions being not suffi-ciently stable under this situation. As shown in the previous examples as well as this example, the approximation method can reduce the number of supports generated. As a result, the instability of the procedure for finding all vertices of a relaxed region is resolved. All four levels of approximation can obtain the results within a relatively short time. By the triangle inequality, the error bounds are 0.03799, 0.02653, 0.02343, and 0.02000 for tolerable errors of 0.1, 0.01, 0.005, and 0.001, respectively. This fact also shows that a large number of generated supports can be represented by a very small 78 number of supports and only a very small error occurs. For example, the 41 supports generated by the Monahan algorithm can be represented by only 4 supports and the maximal error occurred is less than 0.03799. This example shows that the approximation method can quickly find a stable and accurate approximation. VII. Conclusion In this chapter, four algorithms for solving finite horizon POMDP problems are discussed. Sondik's one-pass algorithm was the first systematic solution procedure for solving finite horizon POMDP problems. Since the unification of several regions is usually required to form a support region and all vertices of these region have to be found, it is clear that Sondik's method needs more CPU time to solve a problem than does the relaxed region algorithm or the linear support algorithm where the number of relaxed regions generated is the same as the number of supports. The Monahan algorithm is simple to code. When there is only a small number of generated supports at each iteration, the Monahan algorithm can be more efficient than the relaxed region algorithm or the linear support algorithm. However, when the number of generated supports increases, it is clear, as evidenced by the numerical examples shown in Section VI, that the relaxed region algorithm and the linear support algorithm need less CPU time to solve a problem than does the Monahan algorithm. 79 It is difficult to compare the efficiency of the relaxed region algorithm and the linear support algorithm. However, the linear support algorithm has two advantages over the relaxed region algorithm. The first advantage is that the constraint set which defines a relaxed region is easy to set up for the linear support algorithm. This constraint set also gives more stable results for finding all vertices of a relaxed region. More importantly, the linear support algorithm can be used as an approximation method. Both the relaxed region algorithm and the linear support algorithm can provide the vertices of all support regions. These vertices can be used to compute the maximal and minimal difference of two piecewise linear functions. The approximation method is the only method capable of solving a problem with a large number of supports. The approximation method reduces the number of generated supports with a small error. Reducing the number of supports not only significantly decreases the CPU time required for an iteration, but also decreases the possibility of numerical error caused by two or more very similar supports. As a result, a stable and relatively accurate solution can be obtained for more complex problems within reasonable CPU time. The approximation method can be performed by either setting the tolerable error for each iteration or limiting the number of supports generated for each iteration. 80 C H A P T E R 4 A L G O R I T H M S F O R INFINITE HOR IZON P O M D P In the previous chapter, the algorithms for finite horizon POMDP problems were discussed. In this chapter, the algorithms for infinite horizon discounted POMDP prob-lems will be presented. The discount factor fi, in this chapter, is assumed to be 0 < fi < 1. The assumption about the discount factor is important since it guaranties that H and Hf, are contrac-tions. Moreover, under this assumption Sawaragi and Yoshikawa (1970) have shown that there is a stationary optimal policy for a POMDP. Hence, only stationary policies have to be considered. Although only stationary policies have to be considered in an infinite horizon dis-counted POMDP, there are uncountably many stationary policies available because the state space n is continuous. Therefore, the convergence of the algorithm within finite time is not guaranteed. Moreover, the limit of a piecewise linear function is not nec-essarily piecewise linear. Papadimitriou and Tsitsiklis (1987) pointed out that infinite horizon POMDP problems are not combinatorial problems and do not appear to be exactly solvable by finite algorithms. However, if only an e-optimal solution is required, these difficulties may be resolved. The theme of the chapter is to find an e-optimal policy. The main aim of this chapter is to develop a special class of algorithms for infinite horizon discounted POMDP, called iterative discretization procedures (IDP), which can 81 find an e-optimal solution efficiently. This chapter is organized in the following way. The existing algorithms will be discussed in Section I. Section II introduces some of the basic results used in this chapter. Applying the approximation method discussed in Chapter 3 to the successive approximation method to get an e-optimal solution is the topic of Section III. Section IV discusses the methods to find some useful values for termination criteria. The iterative discretization procedure is developed in Section V. Methods for accelerating convergence for the iterative discretization procedure are discussed in Section VI. The iterative discretization procedure with the approximation policy improvement is presented in Section VII. Section VIII provides the numerical comparisons of algorithms discussed in this chapter. I. Existing Algorithms for Infinite Horizon P O M D P The most straightforward approach for solving an infinite horizon POMDP problem is the standard successive approximation method. As discussed in Chapter 2, a POMDP problem has the contraction property. Following Theorem 1 of Denardo (1967), it is easy to show that an e-optimal solution can be obtained in a finite number of iterations. However, an iteration in the successive approximation method is similar to solving one stage of a finite horizon POMDP. As discussed in Chapter 3, to solve one stage of a finite horizon POMDP is not an easy task. Moreover, Papadimitriou and Tsitsiklis (1987) have shown that a finite horizon POMDP is a PSPACE-complete problem, so this approach is not efficient. 82 Sondik (1971,1978) introduced a policy iteration algorithm. This algorithm is based on the assumption of a finitely transient policy property. If a policy is finitely transient, then the state space II can be partitioned into a finite number of convex regions such that, for any given signal, all states in one region will map onto the same region under this policy. However, it is difficult to verify that a given policy is finitely transient. Hence, a transient policy is used to approximate the given policy. Since partitioning the state space is not easy, it is difficult to perform this algorithm. Recently, White and Scherer (1986) proposed a reward revision algorithm for an infinite horizon POMDP. Their algorithm is an accelerated successive approximation algorithm. In their algorithm, the problem is approximated by a completely observable, finite state MDP and the reward is revised between two standard policy improvement steps. They presented several examples with two states for which the speed of conver-gence is reduced by ten times compared with the standard successive approximation algorithm. Although not necessarily the most efficient method, discretizing the state space is widely used for solving continuous state space MDP. Bertsekas (1975) showed that as the discretization grids become finer and finer, the performance of the resulting suboptimal policies comes arbitrarily close to that of the optimal. Kakalik (1965) used this method to solve a POMDP. He divided the state space n into equal area grids and each grid was considered as a state. In this way, a finite number of states is obtained. The usual finite state MDP techniques can, then, be used to solve this problem. The resulting solutions for each state are used to represent the whole grid; that is, a piecewise 83 constant function is used to represent the value function. This method is easy to use and techniques developed for finite MDP can be applied. The disadvantage of this method is that a large number of grids might be required to get a reasonable approximation to an optimal solution. Indeed, this finite state MDP might become more difficult to solve than the original problem. There are some other methods that have received less attention. Satia and Lave (1973) developed an implicit enumeration algorithm for computing e-optimal solution to an finite horizon POMDP. Brumelle and Sawaki (1978) and Sawaki (1980) developed a modified policy iteration algorithm to solve an infinite horizon POMDP. There are some special algorithms suitable only for some special cases. Wang (1976, 1977) considered a two action algorithm. Buckman and Miller (1979) reformulated the problem as a regenerative stopping problem. Algorithms of this nature are very efficient although they are not suitable for general infinite horizon POMDP problems. II. Preliminaries In this section, some results which will be used in later sections are developed. Let fi be a nonempty set in RN. Let B be the collection of real-valued bounded functions with domain fi. Define a metric || • || on B by ||u — u|| = s u p I g n \u(x) — v(x)\ where u,v G B, and let V be a subset of B which is complete in this metric. Let u,v G B, we say u = v if u(i) = v(x) for all x G fi and u < v if u(x) < v(x) for all 84 x £ SI. Recall that a mapping G : V —• V is called a contraction if for some fi strictly between 0 and 1, ||Gu — Gv\\ < fi\\u — v\\ Vu,t> £ V. Then by the principle of contraction mapping, there exists a unique fixed point, v*, in V such that Gv* = u* (El'sgol'c 1964). The following theorem is an extension of Theorem 12.2.1 in Ortega and Rheinboldt (1970) and can be proved by a simple modification of the their proof. T H E O R E M 4.1: Let G : V —• V be a contraction mapping, and assume Vo C V is a closed set such ihatGVo C V0. Let {uk} be any sequence of functions in VQ and set /z* = \\Guk — 1|, k = 0,1, Also let v° £ V0 and define vk+1 = Gvk for k = 0,1,.... Let v* be the unique fixed point of G in V (of course, v* £ Vo). Then, for k = 0,1,.. \\nk+1 -v*\\< [1/(1 - 0)] • [fi • \\uk+1 - uk\\ •+ /^]; (4 - 1) it - v*\\ < \\vk+i - + • fij + fik+1 • \\v° - u°\\, (4 - 2) and limfc—oo w* = v* if and only if l imt_ 0 0 fik = 0. The following definition is due to Van Nunen (1976). Definition: A mapping G from V to V is said to be fi-contracting (/i > 0) with contraction radius fi (0 < fi < 1) if for each u, v £ V, we have \\Gu-Gv\\<fi-\\u-v\\+fi. 85 L E M M A 4 . 1 : Let G : V —* V be a fi-contraction mapping. If u E V is such that VQ = {v £ V : \\v - Gu\\ < 7 } C V where 7 = [1/(1 - fi)] • [fi • \\Gu - u|| + fi], then GV0 C VQ. Proof: Let v € Vo, then \\Gv - Gu|| < fi • \\v - u\\ + fi <0.[||v-Gu|| + ||Gu-u||] + /z <fi"Y + fi-\\Gu-u\\ + fi = 7. II Now let us consider using a mapping G to approximate a contraction mapping G. The following theorem gives us a basic result for this approximation. T H E O R E M 4.2: Let G : V —* V be a contraction mapping on V with constant fi and G : V —* V be another mapping for which \\Gv - Gv\\ <fi Vu € V. Suppose for some v° € V, V0 = {v € V : ||t; - Gu°|| < 7} where 7 = [1/(1 - fi)] • [fi • ||Gt5° - t5°|| + 2fi]. Then the sequence {vk} defined by vk+1 = Gt;* for k = 0,1,... remains in Vo and \\vk+1 - v*\\ < 1)9/(1 - fi)] • ||6* + 1 - 0*11 + iikliX - fi) ^ [ ^ / ( l - ^ l - H S ^ - ^ l l + M l - / ? ) ( 4 - 3 ) 86 where /x/t = \\Gvk — Gvk\\. Moreover, if {vk} is the sequence defined by = Gvk for k = 0 , 1 , . . . with v° = v°, then k \\vk+1 -«* || < \\vk-»-v*\\ + X>*"V> i=o k <\\vk+i-v*\\ + Y,Fii (4-4) i=o where v* is the unique fixed point of G in Vo-Proof: The proof is divided into four parts. (i) G is a 2/x-contraction mapping. Let u , v € V , then ||Gu - Gv\\ < ||Gu - + ||Gv - Gu|| + ||Gu - Gu|| <H + H + 0-\\u-v\\ = P-\\u-v\\+2p. Therefore, by Lemma 4.1, GV0 C Vo. (ii) Now we show GVb C Vo-Let v 6 Vo, then Gv € V and ||Gi> - Gu°|| < \\Gv - Gv°\\ + \\Gv° - Gv°\\ <0.\\v-iP\\+r <0-(\\v-dv°\\ + \\Gvo-vo\\) + p < p . 7 + 0.\\Gvo-v°\\ + n 87 <fi.7 + fi.\\Gv0-v°\\ + 2fi </?. 7 + ( l - / ? ) - 7 = 7 Therefore, Gv € Vo for all v E V0. (iii) Now we will show that if v° — u°, then Gv° € V0. \\Gv° - Gv°\\ = \\Gv° - Gv°\\ < n < 7. Therefore, Gv° € V0. (iv) Then by Theorem 4.1 and fij < fi for j = 0,1,..., fc, the inequalities (4-3) and (4-4) are established. I As k approaches infinity, ||u*+1 — v*\\ approaches 0, and 53jLo 0 J * P approaches fi/(l — fi). Then, by Theorem 4.2, the maximal distance between the approximation function vk and the fixed point v* will be less than or equal to/x/(l — fl + e for all k sufficiently large. When k equals 0, the inequalities (4 — 3) and (4 — 4) can be rewritten as II*1 - t>*|| < [0/(1 - fl] • «°||+ [*>/(!-0)] < 10/(1-fl]- ||* «°|| +1^/(1-fl] (4-5) and HC 1 — w*H < Hw1 — v* || + / i o Since v° = v° and by the contraction mapping assumption, II*1 - v*\\ <[0/(l- fl] • h 1 -v°\\+fio ^ ^ / ( l - f l l - l ^ - i J l + A * (4-6) 88 where/xo = \\Gv° - Gv°\\. In practice, the current approximation to the solution can be viewed as v1 and the previous approximation to the solution can be viewed as v°. In this case, (4 — 5) and (4 — 6) are more useful than (4 — 3) and (4 — 4) since the error bound for the current approximation to the solution can be computed. This error bound can be computed and used to determine whether or not the e-optimality has been achieved. Note that /x0 < /x. Therefore, if /x0 is readily available, which is the case in later sections, then the first inequalities in (4-5) and (4-6) should be used. Moreover, if Wv1 — v°\\ = Wv1 — i;01| + /x, then (4-5) and (4-6) give the same error bound. However, if ||v1 — v°|| < Wv1 — v°\\ + /x, then (4-6) gives a tighter bound than (4-5). Therefore, if Uu1 — £>°|| can be computed easily and Wv1 — v°\\ < Wv1 — v°\\ + /x, then (4-6) is recommended. III. Approximate Value Iteration In the previous section, some very general results were discussed. In the next few sections, we will apply these results to the setting discussed in Chapter 2; that is, we will focus on the domain II, the set of bounded real-valued functions V, and the contraction operator H. In Chapter 2, the operator H was introduced. The computation of Hv for a given v was the major topic of Chapter 3. In the successive approximation method for an infinite 89 horizon problem, the operator H is repeatedly applied to find the optimal solution, i.e., l im Hnv = v*. As discussed and shown in the numerical examples in Chapter 3, when a large number of supports are needed to form Hv, it is usually time-consuming to compute Hv. The linear support algorithm discussed in the last chapter can be used to find an approximation solution. In contrast to the operator H, we will refer to this approxima-tion operator as H in this chapter.f The approximate value Hv might require much less time to obtain. We might expect that it may be easier to repeatedly apply operator H to find an e-optimal solution for an infinite horizon P O M D P problem. We will discuss this issue in this section. Theorem 4.2 gives a theoretical background for the use of an approximate evaluation for each step of policy improvement. Formula (4-4) shows that \\vk+1 —v*\\ < \\vk+1 — v*\\ + £*=o/?*"•' • H where vi = Hv'-1 and v{ = Hv'-1 for t = 1 , 2 , . . . , * + 1 and v° = v°. When k approaches infinity, vk+1 approaches v*. Therefore, for any given e > 0, — < jfqj + e k for k large enough. This implies that the maximal distance between v* and the result from repeatedly applying approximate policy improvement steps wil l not be more than e if fi is chosen to be less than (1— /?)-(e — e. Of course, in practice, fi is chosen to be much less than (1 — /?) • e in order to ensure faster convergence. In practice, inequalities similar to (4-5) and (4-6) are usually used to determine the f Although H is dependent on a selected error e, the dependency was suppressed from the notation for simplification. 90 error bound for the current solution. Let vk be the current solution. If \\Hvk — vk\\ < (i - g V c-,u o r | | ^ 5 f c _ t , * j | < ( l - ^ ) j e - ^ O w h e r e /x* = \\Hvk -Hvk\\, then, following from the inequalities (4-5) and (4-6), \\Hvk — v*\\ < e. That is, Hvk is an e-optimal value function. Successive approximations with extrapolation usually give a better bound for the optimal value function and also reduce the number of iterations required to get an e-optimal value function. T he following proposition is a generalization of Proposition 4 on page 237, in Bertsekas (1976). The proof of the following proposition is also a direct generalization of the proof in Bertsekas. PROPOS IT ION 4.1: Let v € V. Then for all n € U and k = 1,2,..., (Hkv)(x) + y^J < ( . f f l + 1 t > ) ( 7 r ) + ^Y^J-<(Hk^v)(ir) + < ( H k v ) ( « ) + ^ where Lk = in f {(Hkv)(ir) - (H*-lv)(ir)} iren and Uk = sup { ( t f*u ) ( 7 r ) - (H^v)^)}. jrgn Therefore, if (((3 - (Uk - Lk))/(1 e-optimal value function. - 0)) < e, then Hkv + ((/? • Lk)/(1 - 0)) is an 91 The above proposition and proof are restricted to the operations with an exact evaluation of H in the policy improvement step. The following proposition extends this result to the approximate policy improvement operator, H. P R O P O S I T I O N 4.2: Assume Hv > Hv and \\Hv — Hv\\ < \i for all v € V. Then, for all n £ TI and 1 - / 9 < Hkv(n) + 1 - / 9 where Uk = sup{(Hkv)(n) - (H^v)^)} Tren Uk = sup{(Hkv)(n) - (Hkv)(7r)}. wen Proof: Since Lk = infw e n{(B r kv)(v) - (JJ*"1 « ) (* ) } , then Hk~l v{*) + Lk < Hkv(Tr) VTT e n ( 4 - 7 ) Apply H to both sides, using the monotonicity of H, HHk-lv{x) + P-Lk< HHkv{ir), 92 and, by assumption, Hkv(ir) < HHk 1V(TT), and (4-7), Hk~lv{Tx) + Lk + j3 • Lk < HHkv{ix). T h i s process can be repeated. First apply H and then apply (4-7) to obtain Hk~lv{*) + Lk + r3-Lk + 0 2 - L k < Hkv(rr) + 0-Lk + P2 • Lk < H2Hkv{ix). After m steps this results in the inequality m Hkv(ir) + ^0i-Lk< HmHkv(n). Taking the limit as m - t oo we obtain Hkv(n)+^<v*(ir), which is the first inequality of this proposition. Now consider the second inequality. The proof is similar to the first one. Since Uk = s u P i r € n { ( £ * z , ) ( 7 r ) - (Hk-*v)(*)}, Hk-1v(ir) + Uk>Hkv. ( 4 - 8 ) A p p l y H to both sides, using the monotonicity of H, HHk~lv{ij) + P-Uk> HHkv(ir). B y assumption, HHk~1v(ir) < Hkv(ir) + fik, and ( 4 - 8 ) , Hkv + fik + p-Uk >HHk. 93 T h i s process can be repeated. First apply H and then apply (4-8) to obtain Hkv + nk + fi • fik + fi • Uk + fi2 • UK > H2Hk. After m steps this results in the inequality m m —1 Hkv(n) + ] T fi{ • Uk + Y, fi' • t*k > HmHkv(*). 1=1 t=0 Taking the limit as m —• co we obtain which is the second inequality. I C o r o l l a r y : If P(Uk~L*)+fi* < t } then, Hkv + is an e-optimal value function. P r o o f : The difference between Hkv(ir) + and Hkv(ir) + ^'U1k^k is less than or equal to M'-Wi" for all TT € II. Therefore, if ^ < u - - ^ ) + ^ < c, then Hkv + fjf* is an c-optimal value function. I Unlike Proposition 4.1, the bound need not decrease monotonically in Proposition 4.2; that is, Hkv(s) + (fi • Lk)/(1 - fi) < Hk+1v(s) + (fi-Lk+i)/(l - fi) and Hk+1v(s) + (fi • Uk+i + - fi)< Hkv(s) + (fi-Uk + / * * ) / ( ! ~ fi) m i g h t not be true because the operator H is not monotone. Since \\Hkv — Hk~1\\ = max{|Ljt|, \Uk\}, if both Lk and Uk are same sign, the slightly modified bound provided by Proposition 4.2 will always be smaller or equal to 94 the bound provided by Theorem 4.2. The quantities Lk and UK for the examples will be shown in Section VIII to be both positive, thus the bound provided by Proposition 4.2 is better than that of Theorem 4.2. IV. Methods for Calculating Lk, UKI and //* In order to obtain an e-optimal solution, Lk, UK, and fik have to be calculated. Since these values are also required in later sections for computing the error bound, method for calculating these values will be reviewed and discussed. For ease of discussion, let us consider a more general setting. Assume u and v are two paces linear continuous convex functions with a polytope domain II. Assume the supports of u are in the set A = {a^,a2,..., ak) and the supports of v are in the set H = £2, • • • 1 £1} where k and / are finite integers. Then, for all TT G II, U(TT) = max{7r • a : a G A} and V(TT) = max{7r • £ : £ € E}. Let L = inf {U(TT) *-en V(IT)} U = sup{u(7r) iren and fj. = \\u — v It is easy to show that \i — max{|L|, |17|}. 95 Now consider the function u — v. Since both u and v are paces linear functions, « — v is a paces linear and continuous function; however, u — v need not be a convex function. For aj E A and f j 6 E, define Rij as Ri,j = {* € II : (it - U ) ( T T ) = TT • O i — 7r • £ , } • Let the set R = {Rij : where a,- 6 A and £, G E}. As in Chapter 3, let the support regions for u and v be Ri = {TT £ II : TT • a, > 7r • a where .a,, a G A} and i?j; = { T T G TI : IT • fj > TT • £ where ^,-,£ G E}. Therefore, the region of i2t)j is the intersection of the support regions Ri and Rj. Since both Ri and .Ry are convex, Rij is a convex set. Note that Rij can be an empty set. Since u — v is a linear function on the polytope Rij, the maximal and the minimal values of u — v in Rij axe on the extreme points of Rij. Moreover, since both u and v have only a finite number of supports, there is a finite number of regions in R. Then L, U, and fi are on the extreme points of some regions in R. White and Scherer (1986) developed a linear programming method to implement the idea discussed above calculating L, U, and fi. Let Lij and Uij be the minimum and maximum values of u — v in Rij, respectively. To compute Lij, the following linear programming problem can be solved min TT • oti — TT • £j subject to 96 TT • a, > TT • a for all a 6 A * • ij > * • f for all £ € E IT € n The objective function value is the value of Lij. Analogously, Uij is the objective value of the maximization of the above linear programming. Having calculated Lij and Uij for each region in i2, let L = rmn{Lij} and U = max{Uij}. Using such a procedure to determine L and U requires solving 2 • k • I linear pro-gramming problems. This procedure can represent a significant computational effort, particularly if k and / are large numbers. White and Scherer (1986) suggested an ap-proximation method to find L and U. Let E = {e\, e2,..., e m } be a preselected set such that if et- € E then ej € II. Consequently, L = min{u(e) — v(e) : e £ E} U = max{tx(e) — v(e) : e € E} and fi = max{|L|,\U\}. Clearly, L < L, U > U, and fi > fi; White and Scherer did not discuss how to choose the set E to obtain a good approximation. A s discussed earlier, L and U will occur at some of the extreme points of some regions in R. If all of the extreme points of the regions in R are contained in the set E, then L = L and U = U. However, it is as difficult to find the extreme points of the regions in R as it is to solve all linear programming problems to find L and U. 97 Let R = {Ri : a, € A) and R = {£> : € E}. Also let ER, E R , and ER be the extreme points of the regions in R, R and R, respectively. Clearly, ER\JER C ER. More importantly, in our application, if u and v are found by the relaxed region algorithm or the linear support algorithm, then ER and ER are readily available for use. Since no extra effort is required to find ER and E R , it is recommended that E = ER U ER be used to find L, U and p. to approximate L, U, and fi. In particular for the two system states problem, it can be shown that ER U ER = ER and the approximation is exact. Once the approximate termination criterion based on L, U, and fi is satisfied, the exact values of L, U, and n can be computed to verify c-optimality. V . A n I t e r a t i v e D i s c r e t i z a t i o n P r o c e d u r e f o r P O M D P In Chapter 3 an approximation H was defined which could be used to apply approx-imate value iteration to compute an e-optimal value function as discussed in Section III. Although the time required for each iteration of approximate value iteration is much less than under regular successive approximation, for each iteration all vertices of the relaxed regions still have to be found. T h e procedure of finding all vertices is not an easy task. It is desirable to have a method which can approximate Hv for a given v which does not involve finding all vertices and which reduces the number of iterations of applying operator H. In this section, we present a method which accomplishes these purposes. Let vn be a piecewise linear and convex value function with An = {a1,.. • ,<»*} as 98 its support set. For any given n and An, Hvn(n) and the corresponding support can be calculated by the formula (2 - 5) and (2-6); that is = max n-{rd + P-Y,pd'Qdea*>w} (4 " 9) where ar,d<e G {a € An : TT • Pd • Qde • a > TT • Pd • Q\ • d for all d € A„}. For the example shown in Chapter 3, let us choose IT as [0.5,0.5] and fi to be 1, then Hvn(ir) can be obtained as ffv„([0.5,0.5]) = max [0.5,0.5] • { ' - 4 ' 5 + 0.8 0.2' 0.5 0.5 -2 ' 3 + 0.5 0.5' 0.4 0.6 - 1 ' 1 + 0.6 0.4' 0.3 0.7 0.8 0 0.9 0 0.9 0 0 0.6 0 0.4 0 0.2 • '3' 9 + 0.2 0 0 ' 0.4 • '3" 9 • '3' 9 + 0.1 0 0 ' 0.6 3' 9 3' 9 + 0.1 0 0 ' 0.8 • '3' 9 = max [0.5,0.5] = [0.5,0.5] = 6.8 { 0.2 4.0 4.4 11 9.6 8.2 } 4.0 9.6 The support of Hvn at this state is [4.0,9.6]T, and the second action is the best for this state. Similarly, if the chosen state is [0,1], the support of Hvn at this state is [0.2, l l ] r . If the chosen state is [1,0], then the support of Hvn at this state is [4.62,7.91]T. Notice that these three supports form the supports of Hvn on II. This implies that if some particular states are chosen and formula (4 — 9) is applied, then it is not necessary to perform the complicated procedures discussed in Chapter 3 in order to calculate the supports and values of Hvn. However, the major difficulty with this method is how to select the chosen states such that all necessary supports for Hvn can be found. 99 Although it is difficult to select a set of states such that all necessary supports can be found by only using these states in the set, a good selection of states can generate a good approximation of Hvn. A means of selecting states for this purpose will be discussed later in this section. For now, assume that there is a method for choosing a finite number of states such that a good approximation of Hvn can be generated. Since the computational time spent in finding supports by using a finite number of states is much less than that for an iteration of if as discussed in Chapter 3 provided not too many states are chosen, it might be worthwhile to approximate Hvn using a finite number of states to generate supports. A method to solve POMDP can be developed using such an approximation for Hvn. For a given value function vn < v* and the corresponding set of supports A n , first select a set of k states. Then compute the value of Hv and the corresponding linear supports for these states. The convex piecewise linear function generated by these sup-ports is used to approximate Hvn. Note that this approximated value might be less than vn for some TT £ II. In order to obtain Hvn > u n , we should also include the sup-ports in An when generating the approximation. Although the approximated solution might not be exactly the same as the result from an iteration of policy improvement (i.e., the H operator), the approximated value function might be reasonably good and the CPU time required for computing this approximation will be much less than that for an iteration of the H operation. Improvements for these chosen states due to this approximation can be obtained 100 through computing the difference between the current values and the original values of these states. If the maximal improvement among these chosen states is "large," then the same set of states and their approximate values can be used to find a new approximate value function, since a relatively "large" improvement can be expected for the new approximation. This procedure can be repeated. However, as expected, the improvements for these chosen states become smaller and smaller. It is not a good idea to continue the process using the same set of states if the maximal improvement is very small. Instead, an iteration of policy improvement should be performed. Since only a finite number of states are considered between two iterations of policy improvement, we refer to this period as a discrete phase. This is the basic concept for an iterative discretization procedure. Now let us discuss the iterative discretization procedure more precisely. Set Anto — An, and Vn,o{*) = maxi{n • a' : a' 6 An,o}- Denote the set of discrete states generated at period n as 3 I n = {TT1 ,TT2,• - • ,nk^}. Also let 6 be an arbitrary policy used in the discrete phase. Define tfn.m+i^') ^ see and Wn.m+iCTr*) as Vn)m+l(^) = mBx { ^ ^ . 1 ( 7 r , ) } •(**) = maxK - [rd + fi • £ Pd • Q% • a / ( i r , t , f M . , „ , ] } = V - d , ' ( 7 l , m + l) ( 4 - 1 1 ) 101 where G I I n , and a / ( l r t ) < , i M n m ) G {a G An,m : TT«1 • Pd • Qj • a > TT' • Pd • Qde • a for all or € A n > m } . Setting i n , m + a = An,m U { U ^ d ' ^ m + 1)}, vn,m+i can be defined as v„ > m + i(7r) = max{7r • a' : a'' € i „ , m + i } VTT € n . (4 - 12) Note that, since An,m Q -A n ,m+i> it follows that we have vn,m+i > w „ j m . Moreover, if a*,aJ G A n ) m + i where i ^ j i , and all the elements of a' are greater than or equal to those of a J , then a J can be deleted from A „ i m + 1 without changing the results of the whole process. T h e iterative discretization procedure can be summarized as follows: Step 0. Choose v'0 such that Hv'0 > v'0. Set n = 0. Step 1. Compute vn+i = Hv'n. Step 2. Calculate 17„ +i = s u p w € n { u n + 1 ( 7 r ) -vn(ir)} and L„+i = inf„ €n{u n+i ( 7 r ) -vn(n)} using the techniques described in Section IV. If Un+i — Ln+i < j ^ e , go to Step 8; otherwise, go to Step 3. Step 3. Set n = n + 1, m = 0, vnt0 = v n , An<0 = A „ , and select fc(n) disjoint states from II, put them into the set n n . Also select a small number ei(n) as a reference for stopping this discrete phase, and an integer number I(n) as the largest number of iterations to be performed in this discrete phase. 102 Step 4. Compute v„ ) T O+i(7r) for IT € II„. Then find An,m+i-Step 5. Proceed to Step 7 if max f f 6 n n (u n i m + i(7r) — u „ , m ( 7 r ) ) < ex(n) or m > I(n); otherwise, go to Step 6. Step 6. Set m = m + 1, then go to 4. Step 7. Set v'n = u„ , m +i , A'n = A n ) f n + i . Go to 1. Step 8. Set t>n+i(7r) = t>n+i(7r) + j^j ' ^n+i- Then \\v* — vn+i\\ < e and an e-optimal value function has been found by Proposition 1. Note that the steps 3 to 7 are the procedures in a discrete phase. If steps 4 to 7 are omitted and step 3 is changed to "Set n = n + 1, v'n = vn, and A'n = An, go to 1", then this becomes an ordinary successive approximation procedure. Example: One iteration of the discrete phase of iterative discretization procedure is illustrated using the problem posed in Sondik (1978) with the following data: P 1 = P 2 = 0.8 0.2 0.5 0.5 0.5 0.5 0.4 0.6 Q2 = 0.8 0.2 0.6 0.4 0.9 0.1 0.4 0.6 r = r 2 = -4 4 0 3 Assume that is 0.9 and A i ) 0 = A\ = {a1, a2} where a1 = [-4,4]^ and a2 = [0,3]r. For ease of calculation, two states, TT1 = [0,1] and TT2 = [1,0] are selected. Then *i,o("'1) = 4 and UI,O(TT2) = 0. Let us also choose ei(l) = 1.25 as the stopping criterion 103 for the discrete phase. By (4-10) and (4-11), « 1 , 1 ( [0 , l ] ) = max{[0,l] = [0,1] = 5.35 -3.46 5.35 ; [o, i] -3.46 5.35 1.44 4.80 and t;lil([l,0]) = max{[l,0] = [1,0] = 1.44. -3.46 5.35 ;[i,o] 1.44 4.80 1.44 4.80 Therefore, a 1 ( l , 1) = -3.46 5.35 and d 2 ( l , l ) = 1.44 4.80 . The set Altl is Alt0\J{a1(l, 1)}U {a2(l,l)}. However, since Q1(1,1) > a1 and a 2 ( l , l ) > a 2 , a 1 and a2 can be deleted from A\t\ to reduce unnecessary calculations and the solution will still be the same. Hence, A\y\ can be set as { lated by t ; l j l (7r) = max{7r • -3.46 5.35 -3.46 5.35 ,7T • 1.44 4.80 1.44 4.80 }. The value of t>i,i(7r) can then be calcu-Since both v\t\(i:1) — f i . o ^ 1 ) and U i , i ( 7 r 2 ) — VJ,O(TI"2) are greater than ei(l), 7T1 and 7r2 are used to perform the second iteration. Similarly, by (4-10) and (4-11), » 1 | 2 ( [0 , l ] ) = max{[0,l] = [0,1] = 6.81 -2.10 6.81 ; [0,1] 2.74 6.11 -2.10 6.81 and t;1>2([l,0]) = max{[l,0] -2.10 6.80 ;[i,o] 2.81 6.11 104 = [i,o] = 2.81. 2.81 6.11 Therefore, d*(l,2) = -2.10 6.81 2.81 6.11 Since d ^ l ^ ) > d ^ l , ! ) and and d 2(l,2) = a 2(l,2) > d 2 ( l , l ) , Aif2 can be set as d 1(l,2) Ud 2 (l ,2) . Moreover, since Vi^i*1) — vijin1) and vi,2(n2) — vi,\{n2) are greater than ci(l), the discrete phase should not be terminated. The states 7T 1 and TT2 are used to do the third iteration. Similarly, by (4-10) and (4-11), uIi3([0, l]) = max{[0,l] = [0,1] = 8.01 -0.88 8.01 ; [o, i] -0.88 8.01 3.98 7.36 } and vli3([l,0]) = max{[l,0] = [1,0] = 4.01. -0.88 8.01 ;[i,o] 4.01 7.31 Therefore, a 1 (1,3) = -0.88 8.01 4.01 7.31 4.01 7.31 . Since a 1 (1,3) > a1 {1,2) and and d 2(l,3) = d 2(l,3) > d 2(l,2), A i ( 3 can be set as d : ( l , 3) U Q2(1, 3). Since both VI^TT1) - VI^TT1) and th,3(7r2) — ui,2(7 1'2) A R E equal to 1.20 which is smaller than the preselected stopping Then do one iteration of the H operator to check whether or not an e-optimal solution has been obtained. D 105 as { -0.88" '4.01' 8.01 7.31 Observe that, i n this example, £1,1 = Hv\ and v\Y2 = Hv~iti — H2v\. Therefore, only is an approximation of H3V\. Compared with the methods discussed in Chapter 3, the computation shown in this example to obtain v\t\ and viT2 is much simpler than the computation of any method discussed in Chapter 3 for obtaining the values of Hv\ and H2V2. This is the major benefit of using the iterative discretization procedure. Let us now develop some properties of the iterative discretization procedure. L E M M A 4.2: Ifvn - Hv'n_x > v'n_x, then Vn < Vn,m < Vn,m+1 < v'„ < Hv'n < V* whereO < m < J(n) — 1, and I(n) is the number of iterations in the n-th discrete phase. Proof: vn < Vn,m < Vn,m+i < v'n follows immediately from the definitions of vn,m and v'n. Let 7r be an arbitrary state in II and m' = min{m > 1 : t>n>m(7r) = ^ nC71")}- Then Hv'n(x) = max { 7 r . rd + fi £ Pr(0|7r , d) • v'H{T(*\d, 6))} > max{7 r • rd + fi V Pr(0|,7r, d) (T(*\d,e))} > u„,m'(7r) where the first inequality follows from the monotonicity property, and the second in-equality follows from the definition of vn,m'(?r) and the convexity of un,m'-106 We still have to prove Hv'n < v*. Since Hv'n_x > t > ( , _ i , we have v'n_1 < v*. By the monotonicity property, Hv^^ < v*. Then by the monotonicity property and induction, Hvn,m < u*. The result follows. | T H E O R E M 4.3: Ifvb ^ HV'Q and the sequence of {vn} is defined as in the above algorithm, then the sequence {vn} converges to v* monotonically. Proof: The monotonicity follows directly from Lemma 4.2. We only have to prove the convergence. Let Hn = H o i f " - 1 . Then by induction, the monotone property and Lemma 4.2 Hnv0 <vn<v'n< v*. Since if is a contraction mapping, Hnvo converges to v* when n —• oo. Therefore, the sequence {vn} converges to v*. | The initial choice of the v'0 < Hv'Q is the key to getting the monotonic conver-gence for this algorithm. There are several methods which can be used to satisfy this requirement. One possible method involves starting the algorithm by choosing v'0(n) = j3^{max,feD[mini<i<jv rrf(i)]} W € II. This implies that the set AQ contains only one vector and each element of this vector is y^{max<igD[mini<,</v rd(i)]}. P R O P O S I T I O N 4.3: 107 If vo(n) = j 3 ^ { m a x d e D [ m i n i < t < N r < i(0]} ^ € n> ^ e n -^ uo > v0. Proof: Let d = argmaxd € D{niini<i<7v[r''(i)]} and r 0 = Yz^ { m a X(i e D[mini<i<Ar r d ( i ) ) } . T h e n r r f ( t ) > (1 — 8) • r 0 for 1 < i < N. Let TT be an arbitrary state in II, then HV0(TT) = max{7r T ^ + ^ V . P r ( % , d) • v0(T(Tr\d, 0))} eee = max{7r • r r f + 8 • rn) d e c 1 J > TT • + 0 • r 0 > TT • (1 - /?)r 0 • 1 + 0 r o = r0 = VO(TT) m Note that during the process, the conditions Hv'n > v'n and u n > m + i > are always satisfied. If accurate results are desired, the previous results v'n and u n > m can be used as the new initial values. Before closing this section, let us discuss the selection of states for a discrete phase. The ideal situation is to choose states that can generate supports which cannot be generated by other chosen states. In this case, supports of Hvn at these chosen states completely determine Hvn. However, it may be impossible to find such ideal states. The question then becomes how best to use the currently available information. The best estimation of Hvn will be the current value u„, especially when vn is very close to the optimal solution v*. If the operation H is performed by the relaxed region algorithm or the linear support algorithm discussed in Chapter 3, then all the extreme points of 108 the support regions are readily available. These extreme points may be a good start for quickly finding the supports. However, when a large number of such states are generated, a large amount of time may be required to perform an iteration in a discrete phase. The simple average of the vertices of each support region, which will be inside this region, may be an ideal alternative since it can reflect the information we now have, and thereby significantly reduce the number of chosen states. Unlike the methods discussed in the previous chapter, the information generated in the current iteration in a discrete phase is not used in the current iteration. More precisely, the computation to generate tin,™^*) is independent of the computation to generate vnim{iT3) for all the selected discrete states TT' and it3. These computations de-pend only on the information generated from the previous iteration, i.e., Anim-i. The advantages of this independence of information can be exploited by developing programs for parallel processing computers, thereby reducing the computational time required. The standard successive approximation algorithm and policy iteration algorithm devel-oped by Sondik (1978) are not suitable for parallel processing. The feasibility of using parallel processing computers is an advantage peculiar to the iterative discretization procedure. VI. Accelerating the Convergence As mentioned in the previous section, only a few discrete states are used in an itera-tion in a discrete phase. Since discrete MDP is well developed, some of the techniques for 109 discrete MDP might, at least in some sense, be applied to POMDP to accelerate the con-vergence. In this section, three such methods are discussed. They are the Gauss-Seidel method, the action elimination procedure, and the modified policy iteration algorithm. 1. Gauss-Seidel Method: Denardo (1982) discussed three methods for accelerating successive approximation for a discrete MDP. The second method, the Gauss-Seidel method, used the latest information in successive approximations. This method may be described as: vn(i) = max{rd(t) + 0 £ P * • w-O) + fi £ K ' «n-i(j)}-That is, all the values available before computing the vn(i) can be used for computing vn(i). A similar approach, easily applied to the POMDP setting, is discussed below. Define Anm = An<m and Axnm = Antm U(Uj<,-tV(n, m + 1)) for i > 1; that is, all of the supports including those just generated in the current iteration are in Axnm. When the value of the state 7r* is computed, all supports in A%nm are used as candidates for «/(*•• ,rf,Mj,,m)- T h e n w e h a v e = 7r*'-d ,'(n,m + l) (4-13) where o/(«« A M i m ) = {a € Anm : *•Pd • Qj • a > ** • Pd • Qde • a Vd € A^}. 110 The algorithm is similar to the one discussed in Section V, with the exception that in step 4, the new un,m+i(' r) discussed in this section is used. Example: Let us illustrate the Gauss-Seidel method for the problem in the previous section. First compute t>i,i([0,1]) using equation (4-13): wlfi([0, l]) = max{[0,l] = [0,1] = 5.35. -3.46 5.35 ; [o, i] 1.44 4.80 } -3.46 5.35 The result of i>i,i([0,1]) is the same as the example shown in the previous section. Now the generated support -3.46 5.35 is put into the support set -AX(1,0). Therefore, the support set A1(1,0) contains three supports: "-4" 0" , and -3.46" 4 3 5.35 Since, -3.46" -4 ' 5.35 > 4 _ , the support -4 4 can be deleted from A 1 (1,0). Now by using .A1 (1,0) and equation (4-13), Uj^QljO]) can be computed as v1,1([l,0]) = max{[l,0] = [1,0] = 1.83 -3.46 5.35 ;[i,o] 1.83 5.26 1.83 5.26 Note that the value «i ) i([l ,0]) generated here is larger than the one generated in the example of the previous section. Since both t>i,i([0,1]) — u"i,o([0,1]) and i>i,i([l,0]) — Uiio([l,0]) are greater than el5 the second iteration is performed. The support set A\,\ is { 111 -3.46' 1.83' 5.35 5.26 }. Then, by equation (4-13), tJ1|2([0,l]) = max{[0,l] = [0,1] = 7.19 -1.74 7.19 ; [0,1] 3.19 6.50 -1.74 7.19 Include the support -1.74 7.19 with the support set Ai , i and then compute i>ii2([l, 0]): »1,2([l,0]) = max{[l,0] = [1,0] = 3.55. -1.74 7.19 ;[i,o] 3.55 7.00 3.55 7.00 Since both u1)2([0,1]) — ui,i([0,1]) and Si)2([l,0]) — ^ ^([1,0]) are greater than ej, the procedure should not be terminated. Follow the same procedure to perform the third and fourth iterations. The support set after the fourth iteration is A 1 > 4 = { ' 1.16 ' '6.35' 10.09 9.78 }. Now perform the fifth iteration: vll5([0,l]) = max{[0,l] = [0,1]' = 11.26 2.33 11.26 ; [0,1] 2.33 11.26 Add 2.33 11.26 into Ai,4 and then compute viis([l,0]): t>1>5([l,0]) = max{[l,0] = [1,0] = 7.48 2.33 11.26 ;[i,o] 7.48 10.90 7.26 10.57 7.48 10.90 112 Now both fi)5([0,l]) — ri(4([0,l]) and u1(5([l,0]) — ui)4([l,0]) are less than ei, the pro-cedure should be terminated. H Comparing the results of this example and the one in the previous section, we observe that the values of the selected states grow much faster in the Gauss-Seidel method, implying that this method may reduce the computational time. T H E O R E M 4.4 If Hv'0 > V'Q and vn,m is computed from the Gauss-Seidel method, then Vn < Vn,l < Vn,m < v'„ < Hv'n. The sequence {vn} converges to v* monotonically. Proof: The proof is similar to the proof of Lemma 4.2 and Theorem 4.3 and it follows by induction and the monotone property. H Although we cannot prove Gauss-Seidel method will accelerate the convergence rate, numerical examples have shown that this method can reduce computation time. 2. Action Elimination: To obtain results from (4-11), all available actions have to be used in the com-putations. If a large number of actions are available, this procedure may take a long 113 time to compute. However, if one or more actions can be identified as suboptimal for a chosen state, these actions need not be considered for this particular state. This idea is difficult to apply to a continuous state space since there are uncountably many states; however, the action elimination procedure can be applied in the iterative discretization procedure since only finitely many states are considered. Let Un = sup i r €n (^ u n-i ( 7 r ) ~ v ' n - i ( * ) ) and L n = i n f ^ n - < - i W ) -As shown in Theorem 1, Hvn-\ > vn-\. Therefore, Un and Ln are both nonnegative real numbers. The following proposition is similar to the one used by Puterman and Shin (1982). P R O P O S I T I O N 4.4: Let vUii be the value calculated from (4-11) and (4~12). Then <V*(TT) T H E O R E M 4.5 Suppose in the n-th iteration that TT • r* + 8 £ Pr(0|7r, o)t>»(T0r|o, 6)) + ^ • Un • 1 < + ^ZJ ' L n l where vn<i is calculated from (4~11) and (4~12), then the action a is non optimal in state TT. 114 Proof: The proof directly follows from the Proposition 4.4 and MacQueen's lemma (1967). n The action elimination procedure discussed above is only useful in the standard procedure presented in Section III. To apply the action elimination procedure directly to the Gauss-Seidel method, a result similar to Theorem 4.5 is needed. However, this task is difficult to achieve since computing the improvement on vnt\ is not easy. To overcome this difficulty, the procedure may be appropriately modified. Formulas (4-11) and (4-12) can be used to compute vn,\. Following Theorem 4.5, the suboptimal actions for n € n n can be identified. The Gauss-Seidel method is then applied to compute i>n,m+i, taking into account only those actions which are not suboptimal. As discussed in Section IV, to compute Un and Ln is not an easy task. Therefore, the action elimination method is useful only when there are a large number of actions. When the problem involves only very few actions, the action elimination method may not be able to reduce the computational time because the exact computation has to be performed. 115 3. Modified Policy Iteration: The modified policy iteration algorithm for discrete MDP has been discussed in Puterman and Shin(197B). The same idea can be directly applied to this setting. The procedure for computing Hvn can be viewed as the policy improvement proce-dure in the modified policy iteration algorithm. The (discrete phase can then be viewed as the value iteration in the modified policy iteration.. That is, when we derive the set of discrete states in IIn, we also choose an action corresponding to each state in n„. Then tJ^^+iC71-) instead of un)Tn_(.i(x) is computed each iteration in this discrete phase. Since only one action is chosen for each selected state, it can save computational effort when a large number of actions are available. Ifor «ach selected state, the action chosen can be the one used in Hv. Therefore, no extra effort is needed far selecting the actions. Example: Consider the example discussed in Section V and under the the section on the Gauss-Seidel method. If action 1 is taken for the state {0,1], and action 2 for the state [1,0], all the results are the same in both examples; however, when vn<m([0,1]) is computed only the first action has to be calculated, and when i;.n ) m([l, 0]) is computed only the second action has to be calculated. ]No maximization has to be performed. Since only one action is considered for each state, the computational time can be drastically reduced- II T H E O R E M 4.6 116 Let vn,m be computed from the modified policy iteration algorithm, then V n < U n , l < Vn,m < v'n < Hv'n and the sequence vn converges to v* monotonically. Proof: The proof follows from induction and the monotone property. I We remark that in modified policy iteration algorithm, only one action in each state 7r 6 n„ is considered. There is a trade-off between finding a potentially better solution and reducing computational time. When the policy for those states in f l n is close to the policy chosen by the standard iterative discretization procedure discussed in the previous section, the value function derived from the modified policy iteration might be close to or equal to the one obtained from the standard method which performs optimization at every iteration. In this case, the modified policy iteration algorithm can reduce the time in finding values for different actions. On the other hand, if the actions used for states in n n are suboptimal, then the result obtained from a modified iteration might be far away from the solution obtained from the standard method, and might need more iterations for convergence. The Gauss-Seidel method is unsuitable for parallel processing, whereas the action elimination method and modified policy iteration are suitable, making it advantageous to use the latter two methods. 117 V I I . T h e I t e r a t i v e D i s c r e t i z a t i o n P r o c e d u r e w i t h A p p r o x i m a t e P o l i c y I m p r o v e m e n t T h e iterative discretization procedure described in Sections V and V I used an exact policy improvement step (i.e., the operator H). As discussed in Chapter 3, the evaluation of H might be difficult when there are a large number of supports forming the value function. A natural question arises: can the approximation of the operator i f , H i n Section III, be used in the iterative discretization procedure? In this section, this problem and a solution will be presented. In Sections V and VI, the convergence of the iterative discretization procedures was demonstrated using the monotonicity property of the operator H. Unfortunately, the monotonicity property does not hold for the operator H. In fact, when H is substituted for H i n the algorithm, vn will not necessarily converge to v*. However, we will show that vn is e-optimal when n is sufficiently large. We will also determine error bounds of vn which provide a termination criterion. Now let us discuss the convergence of vn to an e-optimal neighbourhood. For simplicity, introduce the following notation which will be used in next theorem and proof. Let v'Q € V and v'0 < v*. Define an operator H with the property that H m + 1 V k > HmVk > Vk for all m > 1. We remark that the operations in the discrete phase discussed i n the two previous section satisfy the assumption of H. Define Hv'k = Vk+i and Hm+1Vk = H oHmVk. Moreover, let n(k) be the number of operations of H in the fc-th discrete phase and Hn^Vk = v'k. Note that, unlike the previous two sections, the only 118 limitation for choosing the initial value function is that it should not be larger than v*, i.e. v'0 < v*. Since v'0 < t>*, it is also true that vk < v* and v'k < v*. T H E O R E M 4.7: If v'Q is the chosen starting value function such that v'0 < v*, then an e-optimal value function can be obtained by applying the iterative discretization procedure with approximation policy improvement. Proof: Let v* be the optimal value function. \\Hv'k-v^<\\Hv'k-Hv'k\\ + \\Hv'k-v*\\ = li + fi-\\Hn^vk-v*\\ <VL + f3-\\vk-v'\\ = H + fi-\\Hv'k_1-v*\\. where the third inequality arises since v* > Hn^vk > vk. Therefore, by recursion, i=0 When k approaches infinity, then the right hand side approaches fi/(l — fi). If n is chosen to be less than (1 — fi) • t, then an e-optimal solution is obtained for sufficiently large k. I Theorem 4.7 shows that the iterative discretization procedure with approximate policy improvement obtains an e-optimal value function. However, this theorem does 119 not provide any information as to when an e-optimal solution is obtained. An easy way to determine a stopping criterion is to use the values of v'k and Hv'k. We can view v'k as the current result and Hv'k as the updated value. These two values can be used to decide whether or not an e-optimal value function has been obtained. In this case, the result shown in Section III can be used. Therefore, if \\Hv'k — v'k\\ < ^ " ^ H - ^ * where fik = \\Hv'k — v'k\\, then \\Hv'k — v*\\ < e. Similarly, Proposition 4.2 can also be used; that is, if ^'^UkfJ^+ftk < c, then Hv'k + is an e-optimal value function where Lk = inf„ en{(Hv' k)(Tr) - v'k(Tr)}, Uk = sup„en{(Hv'k)(Tr) - v'k(Tr)}, and fik = sup„en{(Hvk)(*)-(Hvk)(*)}. Similarly, we can apply approximate policy improvement to the accelerated con-vergence methods discussed in Section VI. As might be expected, when a large number of supports are needed to form a value function, these methods might converge faster than the methods discussed in Section VI. The method discussed in this section together with the accelerated convergence techniques discussed in Section VI are especially well suited to complicated problems which require large numbers of supports to form a value function. Numerical comparisons for these methods will be provided in the next section. VIII. Numerical Examples In this section, several sets of test data are used to compare the efficiency of the algorithms discussed in this chapter. The basis of comparison is CPU time. All algo-rithms were implemented as Fortran 77 programs which were run on the Amdahl 5860 120 with FPU at the University of British Columbia. The algorithm for computing Hv or its approximate value, Hv, is the linear support algorithm which was developed in Chapter 3. Unless otherwise specified, ft,(t/*~_^)+/i* is used as the termination criterion. The "vertices of relaxed regions are used to compute the approximate termination criterion as discussed in Section IV. If the approximate termination criterion is satisfied, the linear programming method is used to compute the exact termination criterion. The IMSL routines are used as the linear programming solver. The test data can be divided into two groups. The first group contains the well known testing data discussed in Sondik (1978). The second group contains several sets of randomly generated data for the problems with three, four, and five system states. Unless otherwise specified, the discount factor 3 is 0.9. 1. Sondik's Testing Problem: This test problem is the only testing data shown in the literature (Platzman 1981, White 1980, White and Scherer 1986). We will study this set of data first. Since this is a two system states problem, the values of Uk, and Lk can be found exactly by the vertices of the relaxed regions. Linear programming is not required to find these values for this problem. The CPU times, the number of iterations, and error bounds for algorithms using the exact policy improvement discussed in this chapter are recorded and shown in Table 4.1. 121 Method C P U Time (Seconds) Number of Iterations Number of Supports Error Bound Action Elimination 0.035 4 3 0.000275 Gauss-Seidel Method 0.040 4 3 0.000137 Modified Policy Iteration 0.035 4 3 0.000275 Standard IDP 0.041 4 3 0.000283 Successive Approximation 0.027 7 3 0.000730 e for Stopping Criterion: 0.01 ci(n) for Stopping Discrete Phase: 0.001 Tolerable E r r or for Policy Improvement: 0.0 T a b l e 4.1: R e s u l t s o f Sondik's E x a m p l e ( w i t h E x a c t P o l i c y I m p r o v e m e n t ) From Table 4.1, all of these algorithms converge really quickly. Among these al-gorithms, the standard successive approximation method requires the least C P U time although the error bound is slightly higher than other methods. This is because it only takes seven iterations to get a convergence result for the standard successive approxi-mation. Although other algorithms only take four iterations to converge, they perform several iterations i n a discrete phase. As a result, for this simple problem, the standard successive approximation method performs better than other methods. T h e result for this problem by using the algorithms with tolerable error for the policy improvement is shown in Table 4.2. This result is similar to the result in Table 4.1 because this testing problem has a finitely transient optimal policy. If \\Hv — v\\ is used to calculate the error bound, the result is shown in Table 4.3. T h i s result is similar to the results shown in two previous tables except for the 122 Method CPU Time (Seconds) Number of Iterations Number of Supports Error Bound Action Elimination 0.035 4 3 0.000275 Gauss-Seidel Method 0.040 4 3 0.000137 Modified Policy Iteration 0.035 4 3 0.000275 Standard IDP 0.041 4 3 0.000283 Successive Approximation 0.024 7 3 0.000944 c for Stopping Criterion: 0.01 ci(n) for Stopping Discrete Phase: 0.001 Tolerable Error for Policy Improvement: 0.0005 Table 4.2: Results of Sondik's Example (with Approximate Policy Improvement) Method CPU Time (Seconds) Number of Iterations Number of Supports Error Bound Action Elimination 0.036 4 3 0.005768 Gauss-Seidel Method 0.041 4 3 0.004944 Modified Policy Iteration 0.036 4 3 0.005356 Standard IDP 0.052 5 3 0.006729 Successive Approximation 0.307 71 3 0.009476 e for Stopping Criterion: 0.01 Ci(n) for Stopping Discrete Phase: 0.001 Tolerable Error for Policy Improvement: 0.0 Table 4.3: Results of Sondik's Example (Use \\Hv — v\\ to Compute Error Bound) standard successive approximation method. The standard successive approximation method requires seventy-one iterations to get a convergence result. It also requires 6 to 8 times more CPU time to complete the computation. This is due to the fact that 123 there are many iterations in a discrete phase of any IDP method and each iteration in a discrete phase does a similar job as an iteration in the standard successive approximation method. Therefore, unlike the standard successive approximation, the IDP methods are not sensitive to the methods of computing error bound. Since this problem has a finitely transient optimal policy and only three linear supports are needed to form the optimal value function, the algorithms with tolerable error for the policy improvement obtain the exactly same results as presented in Table 4.3 if \\Hv — v\\ is used to compute the error bound. 2. Randomly Generated Data: This group consists five sets of data range from three to six system states. These data are listed in the Appendix 2. We did not use the data in Appendix 1 because the calculations either converged too fast or did not converge within a reasonable time limit. The first set of randomly generated data is a three system states, three actions, and three signal problem. If the algorithms with the exact policy improvement are used, there are too many linear supports generated (more than fifty supports), which is more than the design of the code can handle. However, if the algorithms with approximate policy improvement are used, the problem of space can be resolved. This is one of the advantages of using the algorithms with approximate policy improvement. The CPU time, number of iterations, number of supports and error bound for each method are 124 shown in Table 4.4. In order to compare the methods in more detail, we also show the CPU time less the time spent in using linear programming to calculate Un and Ln. CPU Time Number of Number of Error Method Total No LP Iterations Supports Bound Action Elimination 4.045 3.676 8 13 0.059359 Gauss-Seidel Method 4.093 3.718 10 13 0.058961 Modified Policy Iteration 3.922 3.561 10 13 0.058961 Standard IDP 3.303 2.940 8 13 0.062985 Successive Approximation 11.555 11.271 18 12 0.083387 e for Stopping Criterion: 0.1 ci(rc) for Stopping Discrete Phase: 0.01 Tolerable Error for Policy Improvement: 0.005 Table 4.4: Results of Data Set 1 (with Approximate Policy Improvement) From this example, we can see that all IDP methods require much less CPU time and have better error bounds than the standard successive approximation method. The IDP methods just need about one third of the CPU time of the standard successive approximation method. Among the IDP methods, standard IDP requires slightly less CPU time but has slightly larger error bound. Although both the action elimination method and standard IDP require eight iterations to get the solution and the action elimination method has fewer iterations in the discrete phases (not shown in Table 4.4), the action elimination method requires more time than standard IDP. This might be due to the fact that not many actions are eliminated and special effort is required to check which actions can be eliminated. 125 In order to find the effect of the action elimination method, let us consider the second data set. This is a three system states, six actions, and three signals problem. The results of algorithms with exact policy improvement are shown in Table 4.5. CPU Time Number of Number of Error Method Total No LP Iterations Supports Bound Action Elimination 0.669 0.484 5 11 0.030075 Gauss-Seidel Method 1.090 0.905 5 11 0.020606 Modified Policy Iteration 0.816 0.631 5 11 0.042023 Standard IDP 0.647 0.537 4 9 0.028022 Successive Approximation 3.356 3.132 8 11 0.034083 e for Stopping Criterion: 0.1 ei(n) for Stopping Discrete Phase: 0.01 Table 4.5: Results of Data Set 2 (with Exact Policy Improvement) From Table 4.5, we can see that the action elimination method requires less CPU time than the Gauss-Seidel method and the modified policy iteration although all three methods require five iterations to obtain the solutions. This result is as we expected since there are relatively large number of actions available and the action elimination method can omit the suboptimal actions. The modified policy iteration method only uses one action for each selected state in a discrete phase; however, in this case, the suboptimal actions might be chosen for some selected states. As a result, the modified policy iteration method requires more iterations in one discrete phase (not shown in Table 4.5). The standard IDP method requires slightly less CPU time to obtain a 126 solution than the action elimination method because it only takes four iterations to obtain its solution although it has more iterations in the discrete phases. As expected, the standard successive approximation takes the longest time to obtain its solution. The effect on the number of actions can be detected more easily by the algorithms with approximate policy improvement since there are fewer supports in the solutions. The time spent in computing policy improvement will be less if there are fewer supports in the support set. The results of using data set 2 are shown in Table 4.6. CPU Time Number of Number of Error Method Total No LP Iterations Supports Bound Action Elimination 0.466 0.391 5 7 0.045400 Gauss-Seidel Method 0.692 0.618 4 7 0.080924 Modified Policy Iteration 0.448 0.374 5 7 0.050268 Standard IDP 0.591 0.519 4 7 0.017349 Successive Approximation 0.971 0.903 8 7 0.057928 e for Stopping Criterion: 0.1 Cj(n) for Stopping Discrete Phase: 0.01 Tolerable Error for Policy Improvement: 0.005 Table 4.6: Results of Data Set 2 (with Approximate Policy Improvement) From Table 4.6, we can see that although requiring more iterations than the Gauss-Seidel Method and the standard IDP method, the modified policy iteration method and the action elimination method require less CPU time to obtain their solutions. This result is to be expected since the time required for the action elimination method or 127 the modified policy method does not relate directly to the number of available actions. The modified policy iteration method needs slightly less CPU time than the action elimination method because only one action is chosen for each state and no extra effort is required for the selection of actions. The next set of randomly generated data, data set 3, is a four system states, four actions, and four signals problem. Table 4.7 contains the results of algorithms with exact policy improvement steps. Method CPU Time Number of Iterations Number of Supports Error Bound Total No LP Action Elimination 16.537 16.044 4 14 0.022665 Gauss-Seidel Method 16.647 16.206 4 13 0.022324 Modified Policy Iteration 16.501 16.060 4 13 0.022324 Standard IDP 16.683 16.201 4 14 0.006259 Successive Approximation > 100 Seconds t for Stopping Criterion: 0.1 C i ( n ) for Stopping Discrete Phase: 0.01 Table 4.7: Results of Data Set 3 (with Exact Policy Improvement) From Table 4.7, we find that the CPU time required for all IDP methods are about the same. However, the standard successive approximation method is unable to yield a solution within 100 seconds. In contrast to the results shown in Table 4.7, the standard successive approxima-128 tion method with approximate policy improvement steps can obtain its solution within twelve seconds. (See Table 4.8.) Among the IDP methods shown in Table 4.8, the CPU time required varies significantly. The action elimination method can obtain its solution in two seconds. However, the Gauss-Seidel method needs 6.492 seconds to obtain its solution. This is because the action elimination method eliminates most of the suboptimal actions in every iteration and only a few actions are considered. CPU Time Number of Number of Error Method Total No LP Iterations Supports Bound Action Elimination 1.926 1.756 4 9 0.064620 Gauss-Seidel Method 6.492 6.153 6 9 0.057871 Modified Policy Iteration 6.323 5.988 6 9 0.057871 Standard IDP 3.425 3.256 4 9 0.051241 Successive Approximation 11.547 11.222 8 9 0.091206 c for Stopping Criterion: 0.1 ei(n) for Stopping Discrete Phase: 0.01 Tolerable Error for Policy Improvement: 0.005 Table 4.8: Results of Data Set 3 (with Approximate Policy Improvement) For the last two data sets, data set 4 and data set 5, the algorithms with the exact policy improvement steps generate more supports than the fifty supports that the design of the codes can accommodate. Therefore, no results are generated. Moreover, if an error bound of 0.1 is used in the previous three data sets, then the algorithms with approximate policy improvement will also generate more than fifty supports. Therefore, in these two data sets, only the algorithms with approximate policy improvement are 129 considered, and the error bound is set to be 0.5. Now let us consider the fourth data set which is a five system states, five actions, and five signals problem. The results can be obtained in relatively short time. From Table 4.9, we find that all IDP methods require much less CPU times than the stan-dard successive approximation method. Among the IDP methods, the modified policy iteration method requires less CPU time than other methods in this data set. CPU Time Number of Number of Error Method Total No LP Iterations Supports Bound Action Elimination 2.296 2.142 5 8 0.396919 Gauss-Seidel Method 1.991 1.807 4 9 0.452781 Modified Policy Iteration 1.882 1.732 4 8 0.362355 Standard IDP 2.322 2.171 4 8 0.373904 Successive Approximation 9.187 8.850 10 8 0.469870 € for Stopping Criterion: 0.5 ci(n) for Stopping Discrete Phase: 0.05 Tolerable Error for Policy Improvement: 0.025 Table 4.9: Results of Data Set 4 (with Approximate Policy Improvement) The last set of randomly generated data is a six system states, four actions, and six signals problem. The results for the algorithms with approximate policy improvement steps are shown in Table 4.10. From this table, we find that the CPU time requirements for all of the algorithms are about the same. The modified policy iteration requires the least CPU time among all methods. As we discussed before, there are several iterations 130 of computations in each discrete phase of IDP methods. In this example, the standard successive approximation method, which does not involve any discrete phase, only takes one or two more iterations of policy improvement than the IDP methods. However, the standard successive approximation method requires more CPU time than any of the IDP methods, which implies that each iteration of policy improvement might require more CPU time to perform than the computation of several iterations in a discrete phase. CPU Time Number of Number of Error Method Total No LP Iterations Supports Bound Action Elimination 26.919 26.205 4 13 0.347050 Gauss-Seidel Method 27.250 26.532 4 13 0.346822 Modified Policy Iteration 24.497 23.792 5 13 0.319670 Standard IDP 26.892 26.189 5 13 0.279474 Successive Approximation 27.910 27.387 6 12 0.330867 e for Stopping Criterion: 0.5 ei(n) for Stopping Discrete Phase: 0.05 Tolerable Error for Policy Improvement: 0.025 Table 4.10: Results of Data Set 5 (with Approximate Policy Improvement) The above examples clearly demonstrate that there are benefits to be reaped in using the algorithms with the approximate policy improvement steps. For the same accuracy, the algorithms with the approximate policy improvement steps generate fewer linear supports than their counterparts with the exact policy improvement. As a result, less computer memory and CPU time are needed for the algorithms with approximate 131 policy improvement. The algorithms with approximate policy improvement steps might also produce a more stable solution since there are fewer chances that two linear supports are similar. From the above examples, we have also found that the IDP methods require less CPU time to obtain the solution than the standard successive approximation method. These examples also suggest that the fewer iterations of policy improvement required is one of the major reasons that IDP methods are more efficient. Of course, we cannot conclude that IDP methods are more efficient than the standard successive approximation just from these limited examples. However, these limited examples have shown that a large portion of CPU time might be saved by using IDP methods. What is the best method among IDP algorithms? This is difficult to answer from these limited examples. We have found from these examples that no single method always performs better than other methods. This is because all performance results are data dependent. Only a rule of thumb might be formulated: when there are a large number of available actions, the action elimination method and the modified policy iteration method are recommended since these two methods only consider a subset of available actions. 132 C H A P T E R 5 P A R T I A L L Y O B S E R V A B L E M A R K O V D E C I S I O N P R O C E S S E S W I T H C O N T I N U O U S S I G N A L D I S T R I B U T I O N S In the last few chapters, partially observable Markov processes with finite discrete signal space were studied. The assumption of finite discrete signal space is restrictive since in many contexts it is more natural to model the signal space as continuous instead of finite. For example, in a machine replacement problem, the signal might be the temperature of this machine. This temperature could be any value within a certain range. As another example, consider blood pressure, which is a good indicator of certain diseases and which is frequently modeled with a continuous distribution. Moreover, if there are a large number of discrete signals, it is sometimes easier to model the problem as a continuous signal space. Therefore, it is desirable to extend the algorithms discussed in the last two chapters to a more general signal space. There has been some research on P O M D P with a general signal space. Whittle (1982) discussed some of the applications of P O M D P to statistical inference and sequen-tial design. Nir (1986) discussed control problems with two system states. However, these researchers relied on the particular structure of the problems studied. In this chapter, a more general approach to the problem will be discussed. The plan of this chapter is as follows: assumptions, notation, and formulation of P O M D P with a continuous signal space are presented in Section I; a method to convert a P O M D P problem with uniformly distributed signal processes to a P O M D P with 133 finite signals is presented in Section II; and applications of the algorithms developed in Chapter 3 and 4 to solve the POMDP problems with continuous signals (under certain assumptions hold) are discussed in Section III. I. Assumptions, Notation, and Formulation The basic assumptions of this chapter are the same as in Chapter 2. The only difference is the nature of the signal processes. In this chapter, we assume that the signal distributions have density functions, whereas in Chapter 2 we assume that they are discrete. The parameters of these probability density functions depend on the system state as well as the action taken in the previous decision epoch. More precisely, for each system state i € 5, each action d 6 D, and each time t = 0,1,2,..., there is a probability density function fft(-) on the set of signals. For the infinite horizon problem, we assume that the probability density function is time invariant and the dependency on t is suppressed from the notation. Let TT € II be defined as in Chapter 2. Given that the current distribution on the state space is TT and action d is used, the conditional density function on the set of signal is f(-\n,d). Then this conditional density function can then be computed as k=l j=l Or, in matrix form, f(0\TT,d) = TT-Pd.Qde.l, (5-1) 134 where Qf, is a diagonal matrix with ff(8) as its diagonal elements and 1 is an N -dimensional column vector with all elements being 1. Analogous to the definition of T(ir,d,8), define T(w,d,8) as the probability distribution of the system state at the next time epoch, given that the probability distribution of the current system state is 7T, the action applied is d, and the signal obtained in the next time epoch is 8; i.e., fi(TT,d,8) = Pr(Xt+1 = i\n,Yt = d,Zt+1 = 8), and f(*,d,8) = [fI(T:,d,8)). Then, by Bayes' rule, fi(n,d,8) = Pv(Xt+1 = i\n,Yt = d,Zt+1 = 9) Or, in matrix form, (5-2) rude) - *-pd-Qi *-pd-Qi , x If f{8J7r, d) = 0, then T(ir, d, 8) can be arbitrary and will not affect the following analysis. As in Chapter 2, let vt and vt+i be bounded convex continuous value functions at time t and t + 1, respectively. Then, for 7r € II, rd N vt(*) = maxE{rd{Xt) + fi • « t + 1 ( f ( M , 0 ) ) b r , « * } = max dec {I>.r'(i) + /9. / />Kd)-^ +i(T(7r,d ,tf))^}. (5-4) ,=i ^ee 135 If vt+i is piecewise linear function on the domain II, then the formula (5-3) can be rewritten as where a**'4'9) is support of vt+1 and f(?r,dt6) • a«l*>d>°) > f{ix,d,6) • ak for all ak which are supports of vt+i. Notice that the formula (5-5) is similar to the corresponding formula for the discrete signal case. The only difference is that the summation is replaced by an integral. White and Harrington (1980) showed that if vt+i is a convex function, so is vt. However, unlike the finite signal case, vt need not be piecewise linear even though vt+\ is piecewise linear. This property makes continuous signal problems more difficult than discrete signal problems. A uniform distribution is commonly used to model a process without much available information. Nir (1986) studied a two-state POMDP with uniformly distributed costs as its signals. We discuss tHs distribution separately from others because this problem can be reformulated as a discrete signal space problem. The algorithms developed for the discrete signal space can then be applied to solve this type of problem and are more efficient than the method "which will discussed in later sections. (5-5) II. Uniformly Distributed Signal Processes 136 Let 0 be the signal space. Also, at decision epoch t, let Qdt be the signal space for the process given that the system state is i and that the decision taken at previous decision epoch is d. The probability density function for the signal in this signal space is uniformly distributed. A trivial case occurs if the state can be deduced for sure from the observed signal; i.e., 0fj D 0 f j = 0 for all pairs of states i and j. This can clearly be formulated as a completely observable MDP. For example, suppose in a two state control problem that a machine in the good state has an operating cost in the range of 100 to 250 dollars per day; however, if the machine is in the bad state, the operating cost is in the range of 300 to 450 dollars per day. So if the current operating cost is 350 dollars, it is obvious that the system is in the bad state. Of course, the above technique fails if the supports of the signal distributions over-lap. However, if the signal distributions are uniform, then the problem can be reformu-lated as a POMDP with finite signal space. Let ©£, = 0\0 t d ) t and ©?• = { © ? , , © ? , } . Then ©£,- is a partition of the signal space 0. Let ©< = { © < , i , ©t,2, • • • > ©«,*} be the product partition of Of,- for all system states i and actions d. Since there are only a finite number, N, of system states, there are only a finite number of elements in 0 t . The key to converting a uniformly distributed signal problem to a discrete signal problem is that the only information provided by the signal is the cell of the partition in which it occurs. Each element in 0< can be viewed as a signal in a finite signal space problem. 137 For example, consider a machine with an operating cost that depends on its condition. When the machine is in excellent condition, the operating cost is uniformly distributed between $100 and $250 per day. If the machine is in fair condition, the operating cost is uniformly distributed between $200 and $350 per day. And if the machine is in bad condition, the operating cost is uniformly distributed between $300 and $450 per day. Thus the cost can be divided into five regions - that is, $100 to $200, $200 to $250, $250 to $300, $300 to $350, and $350 to $450. Each region can be viewed as a distinct signal. Since the observations are uniformly distributed in their signal space, the probabilities of each of the newly defined signals can be calculated by the integral of the density function on each of the regions in ©<. For example, there is 66.67% chance that the operating cost is between $100 and $200, and 33.33% of chance the operating cost is between $200 and $250 if the machine is in excellent condition. Similarly, the probability for each cost range can be computed for the fair and the bad machine conditions. Therefore, this problem is converted to a discrete signal space problem. The key reason that a POMDP with uniformly distributed signal processes can be modeled as a POMDP with a finite number of signals is because T(7T,d, •) is constant over each element of © ( . Let us show this result in the following lemmas. L E M M A 5.1: For every system state i G S, ff(-) is constant on every element of Qt. Proof: Let 8i and 62 be any two arbitrary signals in any 0 t ) J 6 ©t- By the method discussed above for generating the elements in 0 ( , either Qtj D Qd{ = 0 or Qt,j Q ©t, t 138 for all system states i. If ©,,_, n 0 j , = 0, then ff(0i) = ff(02) = 0. If 6 t ) i C Gdti, then, by the uniform assumption, ff{9\) = ff(02)• H Although this lemma is trivial, it is the primary reason why POMDP with uniformly distributed signal processes can be reformulated as POMDP with a finite number of signals. Moreover, in the proof of this lemma, we only require that Qtj is a subset of 0f ,• or ©tj D Qdj = 0 for all system states i and that the signal processes are uniformly distributed on Ot,j- Hence, we can extend this result to more general cases where the signal processes are step functions. We will discuss this issue later in this section. Since ff(-) is constant on any element of 0 t , then Qf is the same for all 6 in any element of O t . Moreover, f(8\n,d) = n • Pd • Qf, • 1, therefore, /(-|7r,<i) is constant on every element in 0 t . Now we can show that T(7r, d, •) is constant on each element of 0,. L E M M A 5.2: T(TT,d, •) is constant on each element of 0<. Proof: Let Qt,j be an arbitrary element of 0« and B\, 82 be any two arbitrary signals in €>,,,-. Since Qd0i = Qd3 and M M = / > 2 | M ) , by (5-3), T ( M , * i ) = ?{*,d,62). The result follows. H Now let us show that if the signal processes are uniformly distributed, then Equation (5-4) can be rewritten as Equation (2-4). Let us consider the integration part of 139 Equation (5-4) first. Let 0, be an arbitrary signal in O t i l . By Lemma 5.2, / f(8\*,d)-vt+l(t(v,d,8))d0 Jeee,,i =vt+l{f{rr,d,8i))' I hO\*,d) d8 Jeeetii =vt+1(f(ir,d,8i)) • Pr(0 6 6tl,-|M) = Pr(0 e © * > , < * ) • vt+1(f{7T,d,8i)). By Lemma 5.2, each element in 0< can be viewed as a signal. Define T ( 7 r , d , © < t ) = T(TT, d, 8i) for 8{ 6 0t,i- Since there are only a finite number of elements in 0(, Equation (5-4) can be rewritten as N r vt(ir) = m a x {V>, • rd(i) + 8- f(0\*,d) • vt+1(f(7r,d,8))d8} d e D ~[ Joee, N r = max{^i-rd(i) + 8- / f(e\rr,d)-vt+1(f(iT,d,9))d8} N = max{Y,*i-rd(i) + 8- £ Pr(0M|7r,d) • v t + a (T(7r ,<f , 0M))}, « = 1 e.,,ee« where P r ^ ^ j l T r , d) = / f l e e ( . f(8\ir, d)d8. The last equality is the same as Equation (2-4). Therefore, Equation (5-4) can be rewritten as (2-4); that is, P O M D P with uniformly distributed signal processes can be reformulated as P O M D P with a finite number of signals. As discussed above, the key reason that P O M D P with uniformly distributed signal processes can be formulated as P O M D P with finite number of signals is that the density function ff(-) is constant on every element of 0 t for every system state i € S and action d € D. Therefore, if other P O M D P problems have this property, by following the same 140 procedures we can show that they can be formulated as finite signal P O M D P problems. One example is P O M D P problems where signal processes are step functions. Let us consider a P O M D P problem which signal processes are step function. As-sume that 0* • is a partition of signal space 0t, and the density function is constant on O f i for every system state t 6 5 and action d € D. A product partition of © £ , for all system state i and action d can be performed and denoted as Qt = {©t,i, Qt,2, • • •, Therefore, the density function //(•) is constant in each element in Qt for each system state i and action d. B y following the same procedures as discussed above, it can be shown that T(ir,d, •) is also constant on each element in 0*. Therefore, P O M D P whose signal processes are step functions can also be formulated as P O M D P with finite signals. In practice, there might not be many applications whose signal processes are step functions. However, step functions can naturally be used to approximate any probability distribution. After the signal processes have been approximated by step functions, the problem can then be solved by the methods discussed in Chapters 3 and 4. Using this approach, the general problem with continuous signal space can be solved. Now let us discuss how to use step functions to approximate the distributions of the signal processes. For simplification, we assume that the signal space is time-invariant, and the time subscript, i, is suppressed. For each action d € D, and system state i G S, the signal space 0 is divided into a finite number of connected subsets, Qf = {Qdk}, such that LI*©,,* = 0 and Qfk D Qft = 0 for all pair of k and /. In order to have a good approximation, we usually require the density function to be continuous in each of the 141 subsets. The product partition of O f for all system states i G S and actions d £ D can be performed and denoted as 0 = { 0 i , 02,..., 0/}. If the conditional density function f(6\ir,d) is defined as in the first section, then Pr(0 € 0*|7r,d) = / f(8\ir,d)dd VTT G II, d 6 D, and 0* G 0. Since is continuous in each subset in 0 for all i and d, f(6\n,d) is also continuous in each set i n 0. Since each set i n 0 can be viewed as a signal and the conditional density functions for these signals are defined, we now have a finite signal P O M D P problem. The accuracy of the approximation depends on how the signal space is partitioned. Let us now develop an error bound for this approximation. Let 7r be an arbitrary state i n II, v a given value function, Hv the exact value function, and Hv the value function obtained from the approximation. If d G D is the action used for Hv at 7r and HV(TT) > HV(TT), then HV(TT) - HV(TT) <7T-rd + /3' I f(6|TT, d) • v(f (TT, d, 9))d6 Jeee - ( 7 r . r d + /3- £ P r C O t K ^ . u C T C T r , ^ ^ ) ) ) e»€© = /?• / f(8\n,d).v(f(n,d,8))d6-P- T Pr(0 t|7r,cf) • v(T(n,d,ek)) J°*« ekee = / ? • £ / f(8\*,d).v(f(ir,d,8))d8 -P- Pr ( e f c | M ) - t;(T ( M , e A ) ) ©t€© 142 f^t- E ( f f(0\7T,d)-Mkde-Pr(Qk\7T,d)-Lk) = B-Y, Pr(Qk\rr,d)-(Mk-Lk) e*€e <8-{Mk-lk) where Mk = m a x , ^ v(f (TT,d,0)), Lk = min^Q^ v(f (7r,d,0)), M f c = max f r e n V(TX), and Lfc = min^n v(ir). A similar approach can be used if HV(TT) is less than HV(TT). Note that Mk — Lk can be made arbitrarily small by dividing the signal space into very small regions, and in this case, the value function obtained will be very close to the exact solution. III. Methods for Solving P O M D P with Continuous Signal Space In the previous section, we discussed how to solve POMDP problems with continu-ous signal spaces by using step functions to approximate the signal processes. However, in order to obtain a value function which is close to the optimal value function, we might need to construct a step function with a large number of steps; that is, we might need to solve a finite signal POMDP problem with a large number of signals. To solve a problem with a large number of signals is not easy. In this section, we will introduce a method which can solve this problem without using step functions to approximate its signal processes if certain assumptions hold. As discussed in the first section, even if v is a piecewise linear function, Hv need not be piecewise linear. Moreover, integral signs have replaced summation signs in the 143 formulas (5-4) and (5-5). Therefore, unlike P O M D P with a finite number of signals, the construction of linear supports for any given state IT and action d is not trivial or a by-product of the procedure for finding the value of Hv(ir). The algorithms discussed in Chapters 3 and 4 are based on the linear supports for the given states. As a result, they cannot be directly applied to problems with continuous signal space. However, if certain assumptions which wil l be discussed later are satisfied, we can find linear supports for given states and actions. Then methods similar to those discussed in the previous two chapters can be applied to problems with a continuous signal space. We now focus on how to calculate the value and the support of Hv(ir) for arbitrary 7r € n if t; is a piecewise linear function (i.e., the evaluation of (5-5)). Let A be the set of all linear supports of v. If d is a support in A and cf is an action in the action space, define Q„td,a as Q*,d,& = {0 € 0 : IT • Pd Qja > * • Pd • Qdea V a € A}. (5-6) Apply (5-6), then (5-5) can be rewritten as Hv(ir) = maxjV Tr; • rd(i) + fi- TT • Pd • Qde • a ' M 9 ) d 6 } d e D ~{ Jeze N r = max{][> • r'(i) + fi • £ / TT • Pd • Qd • ad9) t=l ogA j6^Q'.d,c, N r = max{5> • rd(i) + fi • £ TT • Pd • / Qd • add). (5 - 7) In order to simplify the calculation of J ^ € @ _ w • Pd • Q<j • <*d9, let us assume that for any given 7r € n , d € D, and a € A, there are only a finite number of connected sets in Qn,d,a- Moreover, we also assume that the boundary of ©n,d,& has measure zero. 144 Since Qg is a diagonal matrix with ff(8) as its i-th diagonal element and a is an JV-dimensional vector with a, as the i-th element, we define f IeeewdJd(e)-&id8\ Therefore, (5-7) can be rewritten as N ( 5 - 8 ) HV(TT) = m a x { £ ^ • rd(i) + 0 • £ TT • P< • / Qj • d<f0} = max{» • [rd + 0 • £ Pd • C<fi]}. (5-9) ©w,4,a We remark that rd + 8- Y2ew d & Pd' C*,a * s a N-dimensional column vector and a linear support oi Hv at TT. Although the assumptions about Q*,d,a are strong, these assumptions are true for many commonly used distributions. More importantly, 7r, Pd, and a are known before ®*,d,a is computed. Therefore, in many situations, it is possible to verify whether the given signal processes will have a finite number of connected sets in ©«-,<*,a before the actual calculation is started. For example, if the signal processes are exponential distributions and there are only two system states, we can show that the assumptions hold. Now let us use the exponential distribution as a example. Example Let us illustrate the method for computing linear supports for POMDP with the 145 0.8 0.2' i - 4 ' 0.5 0.5 r = 4 0.5 0.5' 2 0' 0.4 0.6 r = 3 following example: P 1 = P 2 = The signal processes have the following density functions: for 8 > 0, /21(tf) = 10-e- l o e , f?(8) = 3 • e-3'*, /22(0) = 2-e-2-<, and f{(8) = .^(tf) = /2(tf) = /2(tf) = 0 for 8 < 0. Let us also assume that 0 is 0.9, Tt = [0,1], and A = {a1,a2} where a 1 = [-4,4]r and a 2 = [0,3]r. We first compute 0^,1,a»; i.e., we have to find 8 € 0 such that [0 1]. 0.8 0.2 0.5 0.5 „-6 0 0 10 -e-106 -4 4 > [0 !]• '0.8 0.2' 'e-e 0 "o" 0.5 0.5 0 10 -e-loe 3 After solving this inequality, we have 0 w , i , Q i = {0 < tf < 0.1}. Hence 0^,1,o* is 0.1 < tf < 0 0 . Similarly, we can find 0 f f >2 , o i - However, 0^ ,2,0* = {tf < -1.77} which is outside the domain of density functions, so 0 T ) 2 ) O r i = 0. and Qn<2,a* = {tf > 0}. Now apply 0 T,i, Q>, 0 w , i , a» , 0ir,2,a», and 0^,2,a» in (5-9), to obtain Hv([0 l]) = max{ [0 !]•{[• + 0.9 0.8 0.2 0.5 0.5 ( / 0 0 1 0 -4 • e-° dB . / 0 0 1 0 4-10-e- 1 0 ( ? d6. 146 [0 1] {[' = max{[0 1] 5.46 + 0.9 -3.62 5.46 )) 0.5 0.5 0.4 0.6 ; [o i ] J ~ 3 - 2 . e - " d » 1.35 4.62 Therefore, Hv at [0,1] is 5.46 and the linear support corresponding to this state is [-3.62,5.46]T. • Now let us look at the algorithms discussed in Chapter 3. Signals are intrinsic to the partition method and Sondik's one-pass algorithm. In Monahan's method, signals are only used to generate candidate linear supports. However, the signals do not for-mally appear in the relaxed region algorithm and linear support algorithm. These two algorithms only assume that, for any given state, it is possible to find a linear support corresponding to this state. The assumption of a finite number of signals guarantees an easy method to compute linear supports. If there is a method to generate the linear support for any given state, then the assumption of a finite number of signals is not required for the relaxed region algorithm or linear support algorithm. Provided that a support of a value function can be constructed at any given state, either the relaxed region algorithm or the linear support algorithm can be used to solve a POMDP which has continuous signals. However, in practice, only the linear support algorithm can be used because, for POMDP with a continuous signal space, Hv need not be a piecewise linear function even v is. If the relaxed region algorithm is used, the termination criterion might never be satisfied. As was discussed in Chapter 3, the 147 linear support algorithm can be used to approximate Hv, and can be used to find an approximate solution Hv which is a piecewise linear function. Similarly, the algorithms discussed in Chapter 4 only use signals in the computa-tion of the linear supports. However, since it might not be possible to calculate the exact value of Hv for POMDP with continuous signals as discussed above, only those algorithms which permit approximations of Hv can be applied; that is, the methods presented in Sections III and VII in Chapter 4. The assumptions made in this section are only to guarantee that a linear support can be found for any given state. If there is any other method which can produce the linear support for any given state in II, then the linear support algorithm and the methods shown in Sections III and VII in Chapter 4 can still be applied. We can expect that many POMDP problems having a continuous signal space either satisfy the assumptions in this section or a method exists for computing the linear support for any given state and therefore can be solved using linear support methods. Problems for which linear supports cannot be calculated for all states in II can still be solved by using step functions to approximate the signal processes as discussed in the previous section, although this method might not be efficient. 148 BIBLIOGRAPHY Albright, S. 1979. Structural Results for Partially Observable Markov Decision Pro-cesses. Operations Research 27, 1041-1053. Astrom, K. 1965. Optimal Control of Markov Processed with Incomplete State Infor-mation. Journal of Mathematical Analysis and Applications 10, 174-205. Astrom, K. 1969. Optimal Control of Markov Processes with Incomplete State infor-mation, II. The Convexity of the Loss Function. Journal of Mathematical Analysis and Applications 26, 403-406. Bellman, R. 1958. Dynamic Programming. Princeton University Press, Princeton, New Jersey. Bertsekas, D. P. 1975. Convergence of Discretization Procedures in Dynamic Program-ming. IEEE Transactions Automatic Control AC-21,415-419. Bertsekas, D. P. 1976. Dynamic Programming and Stochastic Control. Academic Press, New York. Brumelle, S., and K. Sawaki. 1978. Generalized Policy Improvement for Simple Dy-namic Programs with an Application to Partially Observable Markov Decision Pro-cesses. Working Paper 546, Faculty of Commerce, University of British Columbia, Vancouver. Buckman, A. G. and B. L. Miller. 1979. Optimal Investigation as a Regenerative Stopping Problem. Working Paper 289, Western Management Science Institute, University of California at Los Angles, California. Denardo, E. V. 1967. Contraction Mapping in the Theory Underlying Dynamic Pro-gramming. SI AM Reviews 9, 165-177. Denardo, E. V. 1982. Dynamic Programming Models and Applications. Prentice-Hall,Inc. Englewood Cliffs, New Jersey. Drake, A. 1962. Observation of a Markov Process Through u Noisy Channel. Unpub-lished Sc.D. Thesis, Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts. Eagle, J. N. 1984. The Optimal Search far a Moving Target "When the Search Path Is Constrained. Operations Research 32, 1107-1115. Eckles, J. 1968. Optimum Maintenance with Incomplete Information. Operations Re-149 search 16, 1058-1067. El'sgol'c, L. E. 1964 Qualitative Methods in Mathematical Analysis. American Mathe-matical Society. Providence, Rhode Island Howard, R. A. 1960. Dynamic Programming and Markov Processes. The M.I.T. Press, Cambridge, Massachusetts. Hsu, K. and S. Marcus. 1980. Decentralized Control of Finite State Markov Processes. Proceedings 19th IEEE Conference on Decision and Control, 143-148. Hughes, J. S. 1977. Optimal Internal Audit Timing. Accounting Review L H , 56-58. Hughes, J. S. 1977. A Note on Quality Control under Markovian Deterioration. Oper-ations Research 28, 421-423. Kakalik, J.S. 1965- Optimum Policies for Partially Observable Markov System. Techni-cal Report TR-18, Operations Research Center, Massachusetts Institute of Tech-nology, Cambridge, Massachusetts. Kaplan, R. 1969. Optimal Investigation Strategies with imperfect Information. Journal of Accounting Research 7, 32-43. Lane, D. E. 1986a. Dynamics Models of Decision Making by Fishermen. Unpublished Ph.D. dissertation, University of British Columbia, Vancouver, British Columbia, Canada. Lane, D. E. 1986b. A Partially Observable Model of Decision Making by Fishermen. Working Paper 86-46. University of Ottawa. Ontario. Lovejoy, W. S. 1983. Policy Bounds for Markov Decision Processes with Applications to Fisheries Management. Unpublished Ph.D. dissertation, University of Delaware. Lovejoy, W. S. 1987. Some Monotonicity Results of Partially Observed Markov Decision Processes. Operations Research 35, 736-743. Mattheiss, T. H. 1973. An Algorithm for Determining Irrelevant Constraints and All Vertices in Systems of Linear Inequalities. Operations Research 21 , 247-260. Mattheiss, T. H. , and D. S. Rubin. 1980. A Survey and Comparison of Methods for Finding All Vertices of Convex Polyhedral Sets. Mathematics of Operations Research 5, 167-185. MacQueen, J. 1967. A Test for Suboptimal Actions in Markovian Decision Problems. 150 Operations Research 15, 559-561. Monahan, G. E. 1982. A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms. Management Science 28, 1-16. Nir, Abraham. 1986. Optimal Control of a Partially Observable Markov Chain. Un-published Ph.D. dissertation, Northwestern University, Evanston, Illinois. van Nunen, J.A.E.E. 1976. Contracting Markov Decision Processes Amsterdam, Math-ematisch Centrum (Mathematical Centre Tracts, no. 71). Ohnishi, M. , Kawai, H., and H. Mine. 1986. An Optimal Inspection and Replace-ment Policy under Incomplete State Information. European Journal of Operational Research. 27, 117-128. Ohnishi, M. , Mine, H., and H. Kawai. 1986. An Optimal Inspection and Replacement Policy under Incomplete State Information: Average Cost Criterion, in Stochastic Models in Reliability Theory, 187-197. Springer-Verlag, Berlin. Ortega, J. M. , and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables Academic Press. New York. Papadimitriou C H . and J. N. Tsitsiklis. 1987. The Complexity of Markov Decision Processes. Mathematics of Operations Research 12, 441-450 Platzman, L. 1980. Optimal Infinite-Horizon Undiscounted Control of finite Probabilis-tic Systems. SIAM Journal of Control and Optimization 18, 362-380 Platzman, L. 1981. A feasible Computational Approach to Infinite-Horizon Partially-Observed Markov Decision Problems, mimeograph, School of Industrial and Sys-tems Engineering, Georgia Institute of Technology, Atlanta, Georgia. Puterman, M. L., and M. C. Shin. 1978. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science 24, 1127-1137. Puterman, M . L., and M. C. Shin. 1982. Action Elimination Procedures for Modified Policy Iteration Algorithms. Operations Research 30, 301-318. Rhenius, D. 1974. Incomplete Information in Markovian Decision Models. Annals of Statistics 2, 1327-1334. Ross, S. 1971. Quality Control Under Markovian Deterioration. Management Science 17 587-596. 151 Satia, J. and R. Lave. 1973. Markovian Decision Processes with Probabilistic Observa-tion of States. Management Science 2 0 1-13. Sawaki, K. 1980. Piecewise Linear Markov Decision Processes with an Application into Partially Observable Models, in Recent Developments in Markov Decision Processes (R. Hartley et al, Eds.) Academic Press, New York. Sawaki, K. 1983. Transformation of Partially Observable Markov Decision Processes into Piecewise Linear Ones. Journal of Mathematical Analysis and Applications 9 1 , 112-118. Sawaragi, Y. and T. Yoshikawa. 1970. Discrete-Time Markovian Decision Processes with Imcomplete State Observation. Annals of Mathematical Statistics 4 1 , 78-86. Smallwood, R. D., and E. J. Sondik. 1973. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon. Operations Research 2 1 , 1071-1088. Sondik, E. J. 1971. The Optimal Control of Partially Observable Markov Processes. Unpublished Ph.D. dissertation, Stanford University, Stanford, California. Sondik, E. J. 1978. The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Cost. Operations Research 2 6 , 282-304. Striebel, C. 1965. Sufficient Statistics in the Control of Stochastic Systems. Journal of Mathematical Analysis and Applications 12 , 576-592. Wang, R. 1976. Computing Optimal Quality Control Policies-Two Actions. Journal of Applied Probability 1 3 , 826-832. Wang, R. 1977. Optimal Replacement Policy Under Unobservable States. Journal of Applied Probability 14 , 340-348. White, C. 1976. Optimal Diagnostic Questionnaires Which Allow Less Than Truthful Responses. Information and Control 3 2 , 61-74. White, C. 1977. A Markov Quality Control Process Subject to Partial Observation. Management science 2 3 , 843-852. White, C. 1978. Optimal Inspection and Repair of a Production Process Subject to Deterioration. Journal of Operational Research Society 2 9 , 235-243. White, C. 1979a. Bounds on Optimal Cost for a Replacement Problem with Partial Observation. Naval Research Logistics Quarterly 2 6 , 415-422. 152 White, C. 1979b. Optimal Control-Limit Strategies for a Partially Observed Replace-ment Problem. International Journal of Systems Science 10, 321-331. White, C. 1980. Monotone Control Laws for Noisy, Countable-State Markov Chains. European Journal of Operational Research 5, 124-132. White, C. C , and D. P. Harrington. 1980. Application of Jensen's Inequality to Adap-tive Suboptimal Design. Journal of Optimization Theory and Application 32, 89-99 White, C. C , and W. T. Scherer. 1986. Accelerated Successive Approximation Algo-rithms for Partially Observed Markov Decision Processes. Working Paper, Univer-sity of Virginia, Charlottesville, Virginia. Whitt, W. 1978. Approximations of Dynamic Programs, I. Mathematics of Operations Research 3, 231-243. Whittle, P. 1982. Optimization Over Time. Volume II. John Wiley & Sons Ltd. Chich-ester. 153 A P P E N D I X 1 R A N D O M G E N E R A T E D T E S T I N G D A T A F O R F I N I T E H O R I Z O N A L G O R I T H M S Number of States = 3 Number of Actions = 3 Number of Signals = 3 Action 1: D A T A S E T D3.1 P 1 = 0.573 0.416 0.103 0.346 0.441 0.390 0.081 0.143 0.507 0.646 0.106 0.120 0.199 0.643 0.220 0.155 0.251 0.660 Action 2: 0.357 0.370 0.169 0.345 0.460 0.192 0.298 0.170 0.639 Q2 = 0.704 0.0 0.188 0.116 0.840 0.135 0.179 0.160 0.677 Action 3: P 3 = 0.305 0.388 0.139 0.239 0.356 0.271 0.456 0.256 0.590 Q3 = 0.702 0.211 0.087 0.304 0.581 0.115 0.145 0.238 0.617 154 D A T A S E T D3.2 Number of States = 3 Number of Actions = 3 Number of Signals = 3 Action 1: P 1 0.256 0.002 0.578 0.063 0.693 0.051 0.273 0.725 0.359 0.712 0.089 0.014 0.066 0.790 0.126 0.222 0.121 0.860 r = 7.100 9.300 5.300 Action 2: P 2 = 0.350 0.244 0.227 0.239 0.237 0.323 0.411 0.519 0.450 Q2 = 0.763 0.081 0.092 0.041 0.566 0.031 0.196" 0.353 0.877 r = 8.300 8.600 7.800 Action 3: P 3 = 0.482 0.108 0.445 0.045 0.485 0.158 0.473 0.407 0.397 Q3 = 0.504 0.080 0.194 0.061 0.705 0.015 0.435 0.215 0.791 r = 0.900 3.900 8.800 155 D A T A S E T D3.3 Number of States = 3 Number of Actions = 3 Number of Signals = 3 Action 1: P 1 = 0.349 0.860 0.070 0.472 0.180 0.471 0.093 0.047 0.458 0.677 0.166 0.143 0.193 0.572 0.247 0.130 0.262 0.611 r = 9.900 0.200 2.400 Action 2: P 2 = 0.295 0.272 0.443 0.144 0.104 0.314 0.561 0.624 0.243 Q2 = 0.660 0.130 0.205 0.165 0.792 0.083 0.175 0.078 0.712 r = 5.900 7.400 5.500 Action 3: P 3 = 0.284 0.672 0.044 0.635 0.300 0.065 0.042 0.545 0.413 Q3 = 0.745 0.133 0.115 0.143 0.712 0.260 0.112' 0.155 0.625 r 3 = 3.700 4.900 9.800 156 D A T A S E T D3 .4 Number of States = 3 Number of Actions = 3 Number of Signals = 3 Action 1: P1 = 0.336 0.436 0.228 0.304 0.272 0.424 0.167 0.578 0.255 0.574 0.135 0.062 0.227 0.745 0.294 0.199 0.120 0.644 r = 3.700 3.000 6.200 Action 2: P2 = 0.339 0.525 0.136 0.435 0.525 0.040 0.374 0.243 0.383 Q2 = 0.640 0.252 0.236 0.180 0.584 0.024 0.180 0.164 0.740 r = 5.700 0.100 1.700 Action 3: P3 = 0.143 0.173 0.684 0.020 0.826 0.154 0.601 0.079 0.320 Q3 = 0.916 0.101 0.095 0.054 0.843 0.215 0.030 0.056 0.690 r = 7.400 0.400 7.000 157 D A T A S E T D3.5 Number of States = 3 Number of Actions = 3 Number of Signals = 3 Action 1: P 1 = 0.445 0.222 0.333 0.500 0.173 0.327 0.204 0.553 0.243 0.686 0.138 0.279 0.182 0.786 0.083 0.132 0.076 0.638 r = 5.200 4.600 4.100 Action 2: P 2 = 0.234 0.549 0.061 0.064 0.218 0.466 0.702 0.233 0.473 Q2 = 0.698 0.283 0.005 0.131 0.624 0.202 0.171 0.093 0.793 r 2 = Action 3: P 3 = 0.535 0.114 0.325 0.313 0.870 0.360 0.152 0.016 0.315 Q3 = 0.567 0.243 0.186 0.234 0.641 0.044 0.199 0.116 0.770 r 3 = 9.000 9.300 0.800 158 D A T A S E T D4.1 Number of States = 4 Number of Actions = Number of Signals = Action 1: P1 = •0.355 0.321 0.100 0.433 0.181 0.342 0.254 0.155 0.065 .0.066 0.420 0.172 -0.519 0.192 0.154 0.161 0.551 0.093 0.112 0.158 0.662 .0.157 0.126 0.055 0.224-0.044 0.526 0.342. 0.1351 rO.2" 0.195 i _ 8.2 0.068 r ~ 6.8 0.662J L8.1. Action 2: P 2 = Q2 = 0.094 0.241 0.351 0.064 0.571 0.158 0.089 0.109 0.311 0.173 0.259 0.174 0.013 0.547 0.180 0.198 0.262 0.233 0.273 0.454 0.162 0.105 0.629 0.149 0.333-0.353 0.117 0.308. 0.254-0.190 0.102 0.544. ro.8-1.0 5.4 L3.9. Action 3: P 3 = Q3 = -0.455 0.182 0.135 0-228-0.370 0.280 0.005 0.345 0.220 0.276 0.270 0.234 .0.403 0.499 0.071 0.027. -0.650 0.121 .,0.126 0.103" 0.175 0.538 0.134 0.153 0.064 0.086 0.676 0.174 .0.204 0.232 0.010 0.554. r 3 = 2.2" 6.7 2.6 Ll.7. 159 Action 4: P4 = Q 4 = -0.255 0.294 0.025 .0.267 "0.672 0.004 0.159 .0.070 0.255 0.108 0.143 0.298 0.103 0.694 0.166 0.172 0.230 0.417 0.352 0.194 0.134 0.109 0.621 0.123 0.260-0.181 0.480 0.241. 0.091-0.193 0.054 0.635. 160 D A T A S E T D4.2 Number of States = 4 Number of Actions = 4 Number of Signals = 4 Action 1: Action 2: Action 3: P J = 0.252 0.291 0.171 0.294 0.700 0.075 0.222 0.088 0.210 0.123 0.075 0.010 0.088 0.699 0.016 0.252 0.240 0.149 0.475 0.396 0.131 0.149 0.710 0.149 0.298 0.437 0.279 0.300 0.081 0.077 0.052 0.511 r 1 = 1.1 9.2 4.2 .9.9 P 2 = Q2 = 0.375 0.462 0.120 0.222 0.644 0.040 0.050 0.020 0.192 0.121 0.640 0.283 0.156 0.750 0.204 0.038 0.150 0.077 0.092 0.129 0.163 0.088 0.655 0.154 0.283 0.340 0.148 0.366 0.037 0.122 0.091 0.788 r 2 = T0.8 8.1 9.5 L7.1 P 3 = Q3 0.255 0.180 0.286 0.157 0.713 0.116 0.076 0.176 0.140 0.395 0.229 0.012 0.002 0.667 0.159 0.083 0.309 0.032 0.273 0.517 0.139 0.020 0.678 0.096 0.296 0.393 0.212 0.314 0.146 0.197 0.087 0.645 r 3 = T5.5 7.3 9.3 L1.3 161 Action 4: P 4 = <?4 = •0.316 0.333 0.280 .0.149 •0.728 0.102 0.198 .0.037 0.188 0.320 0.346 0.185 0.080 0.612 0.123 0.170 0.378 0.228 0.069 0.392 0.072 0.208 0.540 0.260 0.118 0.119 0.305 0.274 0.120 0.078 0.139 0.533 162 D A T A S E T D4.3 Number of States = 4 Number of Actions = 4 Number of Signals = 4 Action 1: Action 2: Action 3: PJ = Q1 = 0.016 0.184 0.126 0.344 0.677 0.030 0.172 0.152 0.086 0.237 0.396 0.342 0.239 0.634 0.002 0.079 0.634 0.280 0.298 0.110 0.066 0.185 0.671 0.173 0.264 0.299 0.180 0.204 0.018 0.151 0.155 0.596 r 1 = '5.7 8.8 8.3 .3.7 P2 = <?2 = 0.092 0.133 0.162 0.342 0.559 0.071 0.008 0.210 0.313 0.161 0.355 0.178 0.138 0.638 0.138 0.081 0.151 0.457 0.165 0.304 0.108 0.136 0.687 0.062 0.444 0.249 0.318 0.176 0.195 0.155 0.167 0.647 r 2 = 4.3 3.2 0.7 .5.3 P3 = e3 = 0.244 0.319 0.216 0.131 0.503 0.123 0.243 0.135 0.335 0.137 0.278 0.490 0.085 0.685 0.042 0.129 0.255 0.204 0.140 0.340 0.177 0.136 0.626 0.116 0.166 0.340 0.366 0.039 0.235 0.056 0.089 0.620 r 3 = T5.8 8.5 7.4 8.3 163 Action 4: P 4 = Q 4 = •0.174 0.297 0.140 .0.222 0.637 0.121 0.185 .0.137 0.230 0.283 0.260 0.341 0.052 0.675 0.032 0.142 0.372 0.164 0.316 0.055 0.227 0.059 0.675 0.120 0.224" 0.256 0.284 0.382. 0.084" 0.145 0.108 0.601. 164 D A T A S E T D 4 . 4 Number of States = 4 Number of Actions = 4 Number of Signals = 4 Action 1: Action 2: Action 3: P 1 = 0.093 0.269 0.255 0.096 0.634 0.156 0.073 0.124 0.337 0.306 0.352 0.171 0.154 0.612 0.167 0.106 0.226 0.178 0.264 0.354 0.124 0.115 0.670 0.156 0.344 0.247 0.129 0.379 0.088 0.117 0.090 0.614 r 1 = 4.2 4.5 3.8 L2.7 P 2 = Q2 = 0.280 0.205 0.284 0.418 0.675 0.302 0.202 0.231 0.162 0.251 0.295 0.379 0.058 0.549 0.157 0.065 0.029 0.322 0.153 0.181 0.153 0.112 0.567 0.110 0.529 0.222 0.268 0.022 0.114 0.037 0.074 0.594 •7.1 8.9 5.4 .6.9 P 3 = Q 3 = 0.408 0.307 0.072 0.244 0.563 0.167 0.155 0.091 0.0 0.245 0.346 0.303 0.119 0.584 0.112 0.146 0.355 0.213 0.419 0.208 0.236 0.211 0.601 0.091 0.237 0.235 0.163 0.245 0.082 0.038 0.132 0.672 r 3 = 6.2 2.1 4.6 L8.9 165 Action 4 : P 4 = QA = "0.485 0.280 0.047 .0.234 •0.531 0.097 0.089 .0.080 0.374 0.030 0.147 0.420 0.320 0.603 0.136 0.190 0.086 0.649 0.313 0.046 0.002 0.253 0.590 0.166 0.055 0.041 0.493 0.300 0.147 0.047 0.185 0.564 166 D A T A S E T D4.5 Number of States = 4 Number of Actions = 4 Number of Signals = 4 Action 1: Action 2: Action 3: P x = 0.191 0.230 0.105 0.502 0.536 0.193 0.148 0.146 0.159 0.210 0.370 0.231 0.140 0.541 0.140 0.023 0.325 0.317 0.101 0.088 0.173 0.155 0.608 0.147 0.325 0.243 0.424 0.179 0.151 0.111 0.104 0.684 r 1 = 9.4 2.9 7.1 .7.2 P2 = Q2 = 0.161 0.280 0.024 0.069 0.681 0.180 0.017 0.200 0.148 0.334 0.422 0.151 0.011 0.680 0.262 0.030 0.318 0.336 0.422 0.303 0.134 0.090 0.504 0.190 0.373 0.050 0.132 0.477 0.174 0.050 0.217 0.580 r 2 = 0.1 5.1 0.8 L8.5 P 3 = Q3 = 0.330 0.423 0.412 0.243 0.510 0.122 0.207 0.034 0.200 0.001 0.113 0.034 0.169 0.624 0.046 0.176 0.174 0.168 0.019 0.309 0.255 0.125 0.732 0.236 0.296 0.408 0.456 0.414 0.066 0.129 0.015 0.554 r 3 = T6.1 7.1 5.1 L8.9 167 Action 4: P 4 = "0.194 0.208 0.193 .0.436 "0.576 0.120 0.070 .0.040 0.305 0.212 0.415 0.492 0.287 0.590 0.180 0.254 0.491 0.029 0.181 0.023 0.041 0.143 0.510 0.200 0.010 0.551 0.211 0.049 0.096 0.147 0.240 0.506 168 D A T A S E T D5.1 Number of States = 5 Number of Actions = 3 Number of Signals = 3 Action 1: P 1 = Action 2: Action 3: "0.064 0.343 0.303 0.106 0.184" 0.340 0.087 0.214 0.069 0.290 0.189 0.200 0.402 0.097 0.112 0.322 0.003 0.053 0.322 0.300 .0.137 0.292 0.108 0.100 0.363. •0.092 0.403 0.505- -10.0-0.310 0.099 0.591 6.0 0.431 0.289 0.280 r 1 — 4.9 0.287 0.304 0.409 3.2 .0.382 0.213 0.405. . 7.4 . Q 3 = P 2 = Q2 = 0.425 0.108 0.470 0.173 0.154 0.081 0.736 0.521 0.360 0.040 0.117 0.298 0.033 0.184 0.136 0.495 0.024 0.467 0.199 0.447 0.261 0.092 0.155 0.284 0.236 0.4241 0.240 0.012 0.441 0.513 0.077 0.320 0.194 0.215 0.128 r = 0.120 0.182 0.148 0.144 0.346 J T4.5 0.2 2.5 6.5 L5.5 P 3 = 0.005 0.226 0.292 0.264 0.213" 0.239 0.126 0.266 0.220 0.149 = 0.228 0.016 0.259 0.040 0.457 0.092 0.019 0.518 0.034 0.337 .0.324 0.272 0.024 0.047 0.333. •0.226 0.403 0.371" -2.3" 0.416 0.144 0.440 7.9 0.462 0.319 0.219 r 3 = 6.2 0.260 0.103 0.637 6.5 .0.508 0.078 0.414. .3.3. 169 A P P E N D I X 2 R A N D O M G E N E R A T E D T E S T I N G D A T A F O R I N F I N I T E H O R I Z O N A L G O R I T H M S D A T A S E T 1 Number of States = 3 Number of Actions = 3 Number of Signals = 3 A c t i o n 1: '.483 .268 .249 " '.557 .220 .223' "5.10" P 1 = .000 .000 1.000 Qy = .031 .665 .304 r 1 = 5.20 .000 .698 .302 .223 .111 .667 9.50 A c t i o n 2: A c t i o n '.665 .335 .000" .624 .337 .039' "5.80" P 2 = .407 .223 .369 Q2 = .215 .706 .080 r2 = 4.60 _.695 .000 .305 _ .149 .088 .762 1.40_ 3: ' .363 .361 .275' '.643 .171 .186" '7.90 P 3 = .430 .000 .570 Q3 = .121 .727 .152 6.80 1.000 .000 .000 .256 .102 .643 7.30 170 D A T A S E T 2 Number of States = 3 Number of Actions = 6 Number of Signals = 3 Action 1: .000 .388 .612 .213 .451 .336' '1.90' P 1 = .580 .420 .000 Ql = .383 .617 .000 r 1 = 6.20 .379 .180 .441 .359 .641 .000 9.60 Action 2: '.000 1.000 .000" " 1.000 .000 .000' '2.70" P2 = .479 .000 .521 Q2 = .691 .000 .309 r 2 = 8.70 .529 .000 .471 .000 .527 .473 5.60 Action 3: .324 .291 .385" '.000 .643 .357" "1.80" P 3 = .506 .000 .494 Q 3 = .321 .000 .679 r 3 = 6.20 .000 1.000 .000 .486 .000 .514 8.10 Action 4: '.716 .002 .282" '.844 .043 .112' "7.20' P4 = .171 .612 .216 Q4 = .339 .534 .127 r 4 = 4.10 .124 .214 .661 .273 .026 .701 .40 Action 5: .822 .058 .120 .742 .196 .062" "2.00" P 5 = .175 .726 .099 Q 5 = .103 .641 .256 r 5 = 2.40 .215 .065 .721 .147 .208 .645 9.80 Action 6: ".666 .033 .301" '.681 .205 .114" "6.20" P 6 = .114 .652 .233 Q* = .101 .734 .166 r 6 = 9.60 .217 .116 .667 .102 .132 .766 4.30 171 D A T A S E T 3 Number of States = 4 Number of Actions = 4 Number of Signals = 4 Action 1: P 1 = -.379 .271 .000 .350" .000 .341 .407 .252 .000 .434 .000 .566 ..000 .414 .000 .586. ".589 .282 .018 .110" .057 .721 .144 .077 .123 .160 .561 .156 ..219 .011 .115 .654. r 1 = • .20 5.40 2.50 .8.00 Action 2: P 2 = Q2 = -.000 .631 .000 .369" .632 .000 .000 .368 .000 .000 .523 .477 ..000 .521 .000 .479. -.542 .200 .126 .133" .180 .658 .157 .005 .181 .103 .550 .166 ..094 .110 .146 .649. r 2 = •5.80 8.40 7.50 .1.70 Action 3: Q3 = .319 .339 .342 .000" .190 .332 .188 .290 .517 .000 .000 .483 ..000 .378 .000 .622. -.522 .021 .169 .287" .131 .593 .113 .163 .017 .233 .612 .138 ..145 .036 .149 .670. r 3 = 9.30 5.60 9.70 .8.80 172 Action 4: P 4 = Q4 = ".000 1.000 .000 .000 • .416 .000 .584 .000 .000 .199 .292 .509 ..402 .000 .272 .326 -.624 .162 .155 .059-.141 .654 .140 .065 .129 .232 .600 .038 ..027 .270 .124 .579. r 4 = •4.20" 9.40 4.40 .2.60. 173 D A T A S E T 4 Number of States = 5 Number of Actions = 5 Number of Signals = 5 Action 1: P 1 Action 2: p 2 Q2 .000 .234 .000 .292 .474" .266 .397 .000 .000 .337 .000 .000 .344 .339 .318 .000 .362 .638 .000 .000 ..274 .248 .000 .303 .175. -.705 .011 .041 .030 .214" "2.40-.058 .655 .164 .052 .071 5.80 .056 .032 .603 .148 .161 r 1 = 4.20 .025 .177 .157 .512 .129 2.70 ..150 .112 .082 .100 .555. .2.60. .000 .233 .315 .000 .452" .347 .360 .000 .000 .293 = .192 .192 .246 .105 .265 .000 .000 .279 .721 .000 ..314 .000 .304 .226 .157. .531 .210 .014 .099 .146" -3.70-5.10 7.60 5.60 9.00 . cr . .175 .623 .072 .040 .090 = .197 .064 .542 .190 .006 r 2 = .006 .082 .226 .634 .052 ..111 .139 .060 .064 174 .626. Action 3: <?3 = -.275 .257 .000 .381 .294 .000 .343 .264 ..595 .000 .525 .020 .036 .636 .038 .079 .009 .022 ..035 .024 Action 4: P 4 = Q 4 = .167 .301 .343 .000 .288 .000 .416 .251 ..264 .433 ".591 .107 .036 .735 .002 .131 .122 .018 ..143 .079 Action 5: P 5 = <?5 = -.177 .000 .274 .286 .284 .145 .386 .403 ..239 .000 .571 .224 .090 .625 .096 .185 .024 .042 ..085 .119 .284 .184 .ooo-.000 .358 .261 .000 .335 .371 .000 .265 .128 .405 .000 .000. .202 .074 .179" '5.50 .149 .105 .073 1.60 .600 .072 .210 r 3 = 1.00 .219 .522 .229 3.70 .059 .200 .683. .1.70 .250 .282 .000" .207 .450 .000 .170 .228 .314 .000 .333 .000 .000 .304 .000. .044 .132 .126" "3.90 .035 .149 .045 5.90 .627 .129 .111 r 4 = 3.70 .181 .617 .062 3.90 .177 .035 .566. .4.30 .390 .289 .144" .141 .183 .116 .261 .310 .000 .000 .211 .000 .253 .000 .508. .120 .078 .0081 "3.20 .122 .099 .064 1.80 .556 .154 .010 r 5 = 4.30 .011 .726 .198 1.50 .000 .168 .627. .9.10 175 D A T A S E T 5 Number of States = 6 Number of Actions = 4 Number of Signals = 6 A c t i o n 1: P> = A c t i o n 2: P 2 = Q2 = -.106 .167 .207 .196 .223 .101" .000 .357 .000 .303 .341 .000 .000 .000 .196 .207 .396 .201 .000 .254 .164 .153 .187 .242 .125 .234 .000 .239 .138 .265 ..200 .229 .159 .000 .197 .216. ".549 .051 .091 .148 .062 .098-.029 .620 .112 .097 .062 .080 .003 .007 .559 .203 .044 .183 .086 .007 .136 .660 .001 .110 .086 .090 .029 .126 .561 .108 ..084 .099 .019 .042 .144 .612. .275 .278 .000 .166 .000 .281" .000 .462 .000 .000 .538 .000 .000 .000 .231 .000 .228 .541 .173 .298 .177 .182 .169 .000 .114 .180 .116 .119 .304 .167 ..000 .236 .126 .213 .270 .155. -.501 .143 .148 .011 .049 .147" .028 .508 .134 .123 .097 .110 .106 .133 .621 .079 .023 .038 .092 .017 .150 .527 .066 .148 .110 .061 .052 .111 .607 .059 ..097 .089 .104 .093 .105 .513. 176 Action 3: P 3 = Q3 = Action 4: P 4 = Q4 = .209 .304 .204 .000 .282 .000" .299 .000 .197 .000 .000 .505 .239 .242 .000 .275 .243 .000 .000 .000 .325 .000 .343 .332 .213 .000 .390 .000 .397 .000 .151 .127 .212 .228 .000 .282. .633 .111 .069 .017 .136 .035" .030 .618 .148 .154 .018 .032 .080 .074 .628 .114 .035 .069 .043 .008 .060 .531 .207 .151 .112 .080 .097 .101 .572 .038 .102 .115 .031 .027 .123 .603. .179 .345 .000 .000 .269 .207" .404 .196 .401 .000 .000 .000 .186 .221 .210 .000 .383 .000 .186 .201 .218 .169 .000 .228 .179 .188 .182 .000 .207 .244 .268 .000 .000 .402 .000 .330. .538 .035 .096 .023 .175 .133" .097 .599 .109 .092 .090 .013 .038 .141 .599 .016 .103 .103 .074 .075 .120 .547 .134 .050 .123 .048 .166 .097 .516 .049 .103 .101 .128 .038 .064 .565. r 3 = ' .30 9.40 5.80 2.60 2.70 L1.70 r 4 = 6.40 7.10 7.60 6.70 6.20 .1.10 177
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Algorithms for partially observable Markov decision...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Algorithms for partially observable Markov decision processes Cheng, Hsien-Te 1988
pdf
Page Metadata
Item Metadata
Title | Algorithms for partially observable Markov decision processes |
Creator |
Cheng, Hsien-Te |
Publisher | University of British Columbia |
Date Issued | 1988 |
Description | The thesis develops methods to solve discrete-time finite-state partially observable Markov decision processes. For the infinite horizon problem, only discounted reward case is considered. Several new algorithms for the finite horizon and the infinite horizon problems are developed. For the finite horizon problem, two new algorithms are developed. The first algorithm is called the relaxed region algorithm. For each support in the value function, this algorithm determines a region not smaller than its support region and modifies it implicitly in later steps until the exact support region is found. The second algorithm, called linear support algorithm, systematically approximates the value function until all supports in the value function are found. The most important feature of this algorithm is that it can be modified to find an approximate value function. The number of regions determined explicitly by both algorithms is the same as the number of supports in the value function, which is much less than the number of regions generated by the one-pass algorithm. Since the vertices of each region have to be found, these two algorithms are more efficient than the one-pass algorithm. The limited numerical examples also show that both methods are more efficient than the existing algorithms. For the infinite horizon problem, it is first shown that the approximation version of linear support algorithm can be used to substitute the policy improvement step in a standard successive approximation method to obtain an e-optimal value function. Next, an iterative discretization procedure is developed which uses a small number of states to find new supports and improve the value function between two policy improvement steps. Since only a finite number of states are chosen in this process, some techniques developed for finite MDP can be applied here. Finally, we prove that the policy improvement step in iterative discretization procedure can be replaced by the approximation version of linear support algorithm. The last part of the thesis deals with problems with continuous signals. We first show that if the signal processes are uniformly distributed, then the problem can be reformulated as a problem with finite number of signals. Then the result is extended to where the signal processes are step functions. Since step functions can be easily used to approximate most of the probability distributions, this method can be used to approximate most of the problems with continuous signals. Finally, we present some conditions which guarantee that the linear support can be computed for any given state, then the methods developed for finite signal cases can be easily modified and applied to problems for which the conditions hold. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-10-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0098252 |
URI | http://hdl.handle.net/2429/29073 |
Degree |
Doctor of Philosophy - PhD |
Program |
Business Administration |
Affiliation |
Business, Sauder School of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1989_A1 C45.pdf [ 8.54MB ]
- Metadata
- JSON: 831-1.0098252.json
- JSON-LD: 831-1.0098252-ld.json
- RDF/XML (Pretty): 831-1.0098252-rdf.xml
- RDF/JSON: 831-1.0098252-rdf.json
- Turtle: 831-1.0098252-turtle.txt
- N-Triples: 831-1.0098252-rdf-ntriples.txt
- Original Record: 831-1.0098252-source.json
- Full Text
- 831-1.0098252-fulltext.txt
- Citation
- 831-1.0098252.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0098252/manifest