Linkup action dependent dynamic programming

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Linkup action dependent dynamic programming Ip, John Chong Ching

Abstract

Incremental forms of Dynamic Programming (DP), incremental forms of Action Dependent Dynamic Programming (ADDP) such as Q-Learning (QL), and Temporal Difference Learning (TDL) are related types of algorithms, grouped under the heading of Reinforcement Learning (RL. They form the bases for a new generation of methods for optimal control. RL methods make the following assumptions: 1) the system is discrete-time/discrete-state or discrete-time/continuous-state, 2) a model of the dynamics is unavailable and difficult to identify, 3) the system is stochastic, and 4) incremental learning is necessary or acceptable. These methods are targeting control situations which are beyond the scope of most existing control methods. Convergence results exist for the basic and derived forms of the above types of algorithms. Numerous empirical results have demonstrated the usefulness of these algorithms to optimal control problems in aircraft control, manufacturing, and process control. However, further expansion of the applications fundamentally depends on 1) reduced accumulation of approximation errors when universal function approximators are used to store the various mappings in RL algorithms and 2) the extension of RL algorithms to multiple temporal and spatial resolutions such that they can be applicable to hierarchical control. This thesis addresses the issue of approximation error accumulation in DP and ADDP algorithms in RL, and the issue of the extension of RL by introducing modifications to the ƒ and Q notations of DP and ADDP, respectively. Derivation from the definitions of these new ƒ and Q notations leads to three Linkup Relations. These relations enable a new type of update operation called "link up" in addition to "backup" which is the fundamental update operation in DP and ADDP. Several algorithms, collectively called Linkup Action Dependent Dynamic Programming (LADDP) algorithms, are devised based on linkups and backups. The central idea of LADDP is that the increase of the horizon of the ƒ – and Q-values can be done i n increments of variable numbers of base time steps per update. Deterministic LADDP algorithms are proposed for a learning situation where the system is either discrete-time/discrete-state or discrete-time/continuous-state, a model of the dynamics is available or easily identified, the system is deterministic, and incremental learning is neither necessary nor acceptable. This learning situation differs from the QL learning situation in the last 3 points. The deterministic LADDP algorithms are shown to be correct, meaning that the computed Q-values will be identical to the optimal Q-values, in the absence of approximation errors. It is proven that the worst case upper bounds of two of these algorithms are much lower than that of a representative, synchronous version of QL, the Synchronous ADDP algorithm. The error performance of these algorithms is compared by simulation using temporally and spatially correlated approximation errors. The error performance in the mean and variance of the proposed algorithms is better than that of Synchronous ADDP by as much as an order of magnitude. In terms of control policy degradation and control performance, the two deterministic LADDP algorithms are shown to be more tolerant of and less sensitive to the presence of approximation errors. For the same learning situation for which the deterministic LADDP algorithms are proposed, several goal change algorithms are proposed to adapt (with certainty) existing optimal control policies from one goal to a new goal. This capability is unique in the literature. Significant savings in terms of the number of linkup cycles are demonstrated. A stochastic LADDP algorithm is proposed for a learning situation where the system is either discrete-time/discrete-state or discrete-time/continuous-state, a model of the dynamics is unavailable and difficult to identify, the system is stochastic, and incremental learning is necessary or acceptable. This learning situation is identical to that in which incremental DP and TDL algorithms are used. The underlying basis of the one LADDP algorithm proposed for stochastic systems is shown to be correct. It is shown by simulation to have an advantage in a qualitative performance measure called "orderliness" over TDL and an advantage in learning speed over a single control l aw version of QL. Both of these advantages are important in practical applications. The main limitation of the Linkup concept is that an efficient algorithm based on the core Linkup Relation has not been realized. However, the proposed LADDP algorithms, based on variants of the core Linkup Relation, are applicable to both deterministic and stochastic systems.

Item Metadata

Title	Linkup action dependent dynamic programming
Creator	Ip, John Chong Ching
Publisher	University of British Columbia
Date Issued	1995
Description	Incremental forms of Dynamic Programming (DP), incremental forms of Action Dependent Dynamic Programming (ADDP) such as Q-Learning (QL), and Temporal Difference Learning (TDL) are related types of algorithms, grouped under the heading of Reinforcement Learning (RL. They form the bases for a new generation of methods for optimal control. RL methods make the following assumptions: 1) the system is discrete-time/discrete-state or discrete-time/continuous-state, 2) a model of the dynamics is unavailable and difficult to identify, 3) the system is stochastic, and 4) incremental learning is necessary or acceptable. These methods are targeting control situations which are beyond the scope of most existing control methods. Convergence results exist for the basic and derived forms of the above types of algorithms. Numerous empirical results have demonstrated the usefulness of these algorithms to optimal control problems in aircraft control, manufacturing, and process control. However, further expansion of the applications fundamentally depends on 1) reduced accumulation of approximation errors when universal function approximators are used to store the various mappings in RL algorithms and 2) the extension of RL algorithms to multiple temporal and spatial resolutions such that they can be applicable to hierarchical control. This thesis addresses the issue of approximation error accumulation in DP and ADDP algorithms in RL, and the issue of the extension of RL by introducing modifications to the ƒ and Q notations of DP and ADDP, respectively. Derivation from the definitions of these new ƒ and Q notations leads to three Linkup Relations. These relations enable a new type of update operation called "link up" in addition to "backup" which is the fundamental update operation in DP and ADDP. Several algorithms, collectively called Linkup Action Dependent Dynamic Programming (LADDP) algorithms, are devised based on linkups and backups. The central idea of LADDP is that the increase of the horizon of the ƒ – and Q-values can be done i n increments of variable numbers of base time steps per update. Deterministic LADDP algorithms are proposed for a learning situation where the system is either discrete-time/discrete-state or discrete-time/continuous-state, a model of the dynamics is available or easily identified, the system is deterministic, and incremental learning is neither necessary nor acceptable. This learning situation differs from the QL learning situation in the last 3 points. The deterministic LADDP algorithms are shown to be correct, meaning that the computed Q-values will be identical to the optimal Q-values, in the absence of approximation errors. It is proven that the worst case upper bounds of two of these algorithms are much lower than that of a representative, synchronous version of QL, the Synchronous ADDP algorithm. The error performance of these algorithms is compared by simulation using temporally and spatially correlated approximation errors. The error performance in the mean and variance of the proposed algorithms is better than that of Synchronous ADDP by as much as an order of magnitude. In terms of control policy degradation and control performance, the two deterministic LADDP algorithms are shown to be more tolerant of and less sensitive to the presence of approximation errors. For the same learning situation for which the deterministic LADDP algorithms are proposed, several goal change algorithms are proposed to adapt (with certainty) existing optimal control policies from one goal to a new goal. This capability is unique in the literature. Significant savings in terms of the number of linkup cycles are demonstrated. A stochastic LADDP algorithm is proposed for a learning situation where the system is either discrete-time/discrete-state or discrete-time/continuous-state, a model of the dynamics is unavailable and difficult to identify, the system is stochastic, and incremental learning is necessary or acceptable. This learning situation is identical to that in which incremental DP and TDL algorithms are used. The underlying basis of the one LADDP algorithm proposed for stochastic systems is shown to be correct. It is shown by simulation to have an advantage in a qualitative performance measure called "orderliness" over TDL and an advantage in learning speed over a single control l aw version of QL. Both of these advantages are important in practical applications. The main limitation of the Linkup concept is that an efficient algorithm based on the core Linkup Relation has not been realized. However, the proposed LADDP algorithms, based on variants of the core Linkup Relation, are applicable to both deterministic and stochastic systems.
Extent	14202280 bytes
Genre	Thesis/Dissertation
Type	Text
File Format	application/pdf
Language	eng
Date Available	2009-02-18
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0065073
URI	http://hdl.handle.net/2429/4780
Degree	Doctor of Philosophy - PhD
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	1996-05
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

ubc_1996-09099X.pdf -- 13.54MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Linkup action dependent dynamic programming Ip, John Chong Ching

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights