UBC Theses and Dissertations
Linkup action dependent dynamic programming Ip, John Chong Ching
Incremental forms of Dynamic Programming (DP), incremental forms of Action Dependent Dynamic Programming (ADDP) such as Q-Learning (QL), and Temporal Difference Learning (TDL) are related types of algorithms, grouped under the heading of Reinforcement Learning (RL. They form the bases for a new generation of methods for optimal control. RL methods make the following assumptions: 1) the system is discrete-time/discrete-state or discrete-time/continuous-state, 2) a model of the dynamics is unavailable and difficult to identify, 3) the system is stochastic, and 4) incremental learning is necessary or acceptable. These methods are targeting control situations which are beyond the scope of most existing control methods. Convergence results exist for the basic and derived forms of the above types of algorithms. Numerous empirical results have demonstrated the usefulness of these algorithms to optimal control problems in aircraft control, manufacturing, and process control. However, further expansion of the applications fundamentally depends on 1) reduced accumulation of approximation errors when universal function approximators are used to store the various mappings in RL algorithms and 2) the extension of RL algorithms to multiple temporal and spatial resolutions such that they can be applicable to hierarchical control. This thesis addresses the issue of approximation error accumulation in DP and ADDP algorithms in RL, and the issue of the extension of RL by introducing modifications to the ƒ and Q notations of DP and ADDP, respectively. Derivation from the definitions of these new ƒ and Q notations leads to three Linkup Relations. These relations enable a new type of update operation called "link up" in addition to "backup" which is the fundamental update operation in DP and ADDP. Several algorithms, collectively called Linkup Action Dependent Dynamic Programming (LADDP) algorithms, are devised based on linkups and backups. The central idea of LADDP is that the increase of the horizon of the ƒ – and Q-values can be done i n increments of variable numbers of base time steps per update. Deterministic LADDP algorithms are proposed for a learning situation where the system is either discrete-time/discrete-state or discrete-time/continuous-state, a model of the dynamics is available or easily identified, the system is deterministic, and incremental learning is neither necessary nor acceptable. This learning situation differs from the QL learning situation in the last 3 points. The deterministic LADDP algorithms are shown to be correct, meaning that the computed Q-values will be identical to the optimal Q-values, in the absence of approximation errors. It is proven that the worst case upper bounds of two of these algorithms are much lower than that of a representative, synchronous version of QL, the Synchronous ADDP algorithm. The error performance of these algorithms is compared by simulation using temporally and spatially correlated approximation errors. The error performance in the mean and variance of the proposed algorithms is better than that of Synchronous ADDP by as much as an order of magnitude. In terms of control policy degradation and control performance, the two deterministic LADDP algorithms are shown to be more tolerant of and less sensitive to the presence of approximation errors. For the same learning situation for which the deterministic LADDP algorithms are proposed, several goal change algorithms are proposed to adapt (with certainty) existing optimal control policies from one goal to a new goal. This capability is unique in the literature. Significant savings in terms of the number of linkup cycles are demonstrated. A stochastic LADDP algorithm is proposed for a learning situation where the system is either discrete-time/discrete-state or discrete-time/continuous-state, a model of the dynamics is unavailable and difficult to identify, the system is stochastic, and incremental learning is necessary or acceptable. This learning situation is identical to that in which incremental DP and TDL algorithms are used. The underlying basis of the one LADDP algorithm proposed for stochastic systems is shown to be correct. It is shown by simulation to have an advantage in a qualitative performance measure called "orderliness" over TDL and an advantage in learning speed over a single control l aw version of QL. Both of these advantages are important in practical applications. The main limitation of the Linkup concept is that an efficient algorithm based on the core Linkup Relation has not been realized. However, the proposed LADDP algorithms, based on variants of the core Linkup Relation, are applicable to both deterministic and stochastic systems.
Item Citations and Data