Reinforcement Learning in Neural Networks with Multiple Outputs John Chong Ching Ip B.A.Sc, The University of British Columbia, 1987 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August 1990 © John Chong Ching Ip, 1990 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of &-eciric<n\ G^fin-eer'tn-The University of British Columbia Vancouver, Canada DE-6 (2/88) Abstract Reinforcement learning algorithms comprise a class of learning algorithms for neural networks. Reinforcement learning is distinguished from other classes by the type of problems that it is intended to solve. It is used for learning input-output mappings where the desired outputs are not known and only a scalar reinforcement value is available. Primary Reinforcement learning (PRL) is a core component of the most actively researched form of reinforcement learning. The issues surrounding the convergence characteristics of PRL are considered in this thesis. There have been no convergence proofs for any kind of networks learning under PRL. A convergence theorem is proved in this thesis, showing that under some conditions, a particular reinforcement learning algorithm, the A R . P algorithm, will train a single-layer network correctly. The theorem is demonstrated with a series of simulations. A new PRL algorithm is proposed to deal with the training of multiple layer, binary output networks with continuous inputs. This is a more difficult learning problem than with binary inputs. The new algorithm is shown to be able to surcessfully train a network with multiple outputs when the environment conforms to the conditions of the convergence theorem for a single-layer network. i i Table of Contents Acknowledgments v i i Abstract i i List of Figures v 1 Introduction 1 1.1 Neural Networks 1 1.1.1 Applications of neural networks to engineering 2 1.1.2 Neuron models 2 1.1.3 Network models 5 1.1.4 Capability of feedforward networks 6 1.1.5 Two classes of learning algorithms for feedforward networks 6 1.1.6 The formulation of the reinforcement learning problem 9 1.2 The Significance of Reinforcement Learning 11 1.3 Scope of the Thesis 11 1.4 Thesis Overview 12 2 Review of Reinforcement Learning Literature 14 2.1 The Development of Reinforcement Learning 14 2.2 Primary and Adaptive Critic Reinforcement Learning Algorithms 15 2.2.1 Primary reinforcement learning 16 2.2.2 Adaptive critic reinforcement learning 19 2.3 Relationship between reinforcement learning and Stochastic Learning Automata . . . . 22 2.3.1 The formulation of stochastic learning automata 23 2.3.2 Interacting stochastic learning automata 24 2.4 Summary 25 3 Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm 27 3.1 Existing Theoretical Results 27 3.2 Single-Layer Convergence Theorem 29 3.3 Proof of the Single-Layer Convergence Theorem 32 3.4 Discussion of the Single-Layer Convergence Theorem 35 3.5 A New Algorithm for a Single-Layer of Multilayer Sub-Networks 38 4 Simulations of Single-Layer Networks and the MPRL Algorithm 44 4.1 Simulations of a Single-Layer, 3-Neuron Network 44 4.1.1 Environment conforming to Condition A with lk = 3 45 4.1.2 Environment (informing to Condition A with lk = 2 48 4.1.3 Environment conforming to Condition A with lk = 1 50 4.1.4 Environment conforming to Condition B 53 4.1.5 Performance for various values of A 56 4.1.6 Performance for different number of neurons 57 4.2 Simulations of Multiple Output Networks for Continuous-to-Discrete Mappings . . . . 59 4.2.1 Single-layer A R . P network for classifying continuous inputs 60 4.2.2 Two layer AR.P network for classifying binary and continuous inputs 61 4.2.3 MPRL algorithm for classifying nonlinearly separable continuous inputs 63 i i i 4.3 Conclusions ^ 4.4 Future Simulations 6 5 5 Conclusions ^ 5.1 Summary ° ° 5.2 Contributions ° l 5.3 Future Areas of Research • • 6 7 References 6 9 Appendix A Proof of Barto's A R . P Convergence Theorem 75 i v List of Figures 1.1.1 The basic neuron (or deterministic neuron) model. 4 1.1.2 Typical multi-layer feedforward network 5 1.1.3 The basic reinforcement learning situation 8 1.1.4 The structure of the A R . P neuron 10 2.2.5 A simplified representation of an adaptive critic reinforcement learning system. 16 2.2.6 An adaptive critic reinforcement learning system 19 2.3.7 Comparison between SLA (a) and RL (b) 22 2.3.8 Game matrices for a (2 x 2) game 25 2.4.9 The relationship between the different algorithms or classes of algorithms 26 3.2.10 The reinforcement learning situation for a single-layer network with L neurons 30 3.4.11 Examples of environments satisfying Condition A 36 3.4.12 Example of an environment satisfying Condition B 37 3.5.13 The overall structure of the network for the MPRL algorithm 40 3.5.14 The structure of the subnetworks used in the network for the MPRL algorithm 41 4.1.15 Average results for Condition A with lk = 3 45 4.1.16 Single run results for Condition A with lk = 3 and p\ = 0.05 46 4.1.17 Single run results for Condition A with lk = 3 and p\ = 3.2 47 4.1.18 Single run results for Condition A with lk = 3 and pi = 800.0 47 4.1.19 Average results for Condition A with lk = 2 48 4.1.20 Single run results for Condition A with lk = 2 and p\ = 0.001 49 4.1.21 Single run results for Condition A with lk = 2 and pi = 0.4 49 v 4.1.22 Single run results for Condition A with lk = 2 and pi = 1.6 49 4.1.23 Average results for Condition A with lk = 1 50 4.1.24 Single run results for Condition A with lk = 1 and pi = 0.001 51 4.1.25 Single run results for Condition A with lk = 1 and pi = 0.1 51 4.1.26 Single run results for Condition A with lk = 1 and pi = 0.8 51 4.1.27 Average results for Condition B 54 4.1.28 Single run results for Condition B with pi = 0.01 54 4.1.29 Single run results for Condition B with pi = 1.0 55 4.1.30 Single run results for Condition B with pi = 1000.0 55 4.1.31 Average results for various values of A 56 4.1.32 The dependence of final action probability on A 57 4.1.33 Action probabilities versus number of time steps 58 4.1.34 Time steps taken for the average action probability errors to reach 0.0796 versus the number of neurons 59 4.2.35 Action probability maps for a 3-neuron, single-layer A R . P network. . . . 60 4.2.36 Action probability maps for a single-output, two layer A R . P network. . . 62 4.2.37 Action probability maps for a partially stochastic two layer network trained by the MPRL algorithm 63 vi Acknowledgments I would like to thank my supervisors, Prof. P.D. Lawrence and Prof. M.R. Ito, for their guidance and encouragement throughout the course of my thesis. I would also like to thank my colleagues, Ken Buckland, Greg Grudic, Kevin O'Donnell, Richard Smith, Neil Braithwaite, Steve Mason, Edward Lam, and Shawn Day, for their interesting discussions. Finally, I would like to thank my family and my friends, Christopher Dodds, Paul Tschetter, and Martin Dodds, for their concern and helpful suggestions. v i i Chapter 1: Introduction Chapter 1 Introduction Reiriforcement learning ( R L ) has been recognized as one of the most significant forms of neural network learning and has promise to have much greater technological capabilities than more wel l known forms [1], Its main application is for solving sequential decision problems. To see how it fits into the family of neural network models, a review and discussion of neural networks i n general follows. 1.1 Neural Networks Artif icial neural networks are mathematical models o f biological neural networks. The motivation for investigations of artificial neural networks stems from the fact that biological neural networks are the basis for the best intelligence i n existence; man-made intelligence does not even approach the capability of the highest expression of biological intelligence: the human brain. To move toward the goal of building artificial intelligence to the level that we want, it makes sense to study and imitate biological neural networks. The underlying promise of artificial neural networks is based on the fact that biological neural networks work. If it is true that intelligence is purely a result o f physical causes, a solution for creating intelligence exists; it is a matter o f time before we know enough about the brain to be able to duplicate and even to improve upon its functions. F o r that reason, the study of real neural networks (or something close to them) w i l l pay off i n the long run. Although building artificial intelligence machines o f the type and capability that we are want is a far off goal, there is stil l a number o f good reasons for the study of biological neural networks using artificial neuron networks as a medium. For the time being, we can only study some subsets of the brain's capabilities. Fortunately, they are o f great interest to engineering. These capabilities include pattern recognition, associative memory, pattern clustering, and motor control. There are several levels to the study of neural networks. The first level is the neuron modeL It can be divided into the signal model and the learning model. The signal model describes the short term relationship between the input signals and the output signals. The learning model describes how 1 Chapter 1: Introduction that relationship changes over time. The second level is the architecture. This is concerned with how the neurons are connected to each other. There are several well known basic architectures. The third level is the learning algorithms. They are concerned with how to modify the parameters of a network (usually the weights of connection). There is a great variety of learning algorithms. 1 . 1 . 1 Applications of neural networks to engineering The potential engineering applications of neural networks are found anywhere where there is a perceptual task of the kind easily performed by animals or humans but which is difficult for existing technologies. It is difficult to give a meaningful account of applications of neural networks because there are so many. Much of these applications are only proposed and are done on a proof-of-concept scale thus far. To say that application to a particular area has been made is not complete without giving the details of the results. Suffice it to say that these areas embody the requirements for those capability of biological neural networks mentioned earlier. The potential payoff from continued research in biological modeling is great The human brain is able to solve some problems that are very difficult for existing technology, problems such as pattern recognition, speech recognition and synthesis, natural language understanding, contextual retrieval of information from memory, efficient solution of difficult optimization problems, and motor control. These capabilities are needed for a quantum leap in a machine's ability to perform tasks autonomously in the real world. On the other hand, we should not expect that neural networks would give us capabilities that are not present in animals. 1.1.2 Neuron models The model of neurons used in artificial neural networks captures what is thought to be the essential behavior of biological neurons. The form of the signals coming into a neuron and going out from it is idealized, and less important input-output effects and of course the details of the biochemical and genetic processes are left out. If the neuron model differs sufficiently from real neurons, then efforts to find properties from grouping together of these model neurons can result in a dead end. Also, there is only one neuron model in common use when in reality there are many different kinds 2 Chapter I: Introduction o f neurons. It may be that a network would not work without the right neuron i n the right place. In this thesis, the word "neuron" refers to a simple model o f biological neurons. F i g . 1.1.1 shows the structure o f a model o f a neuron. The model consists o f a group of input lines representing incoming axons from other neurons, a weight for each input line representing the connection efficacy o f the synapse which could be excitatory or inhibitory, a summation block representing the function of the cell body, a sigmoidal function to bound the output signal, and an output line representing the a x o n In a real neuron, the signals arriving have the form of action potentials which are voltage pulses. The frequency of these pulses encodes the magnitude of the signal. The frequency has upper and lower bounds. In the model, the magnitude of the signal is represented by an analog value. The ce l l body is excited or inhibited by a signal, depending on the type of its synapse. If the excitation of the cel l body is above a certain threshold, the neuron w i l l fire and the resulting action potentials w i l l travel down the axon. In the model, the cumulative excitation is calculated by a sum of the magnitude of the input signals weighted by the connection weights. If the sum is above the threshold, the output lines w i l l be high; otherwise, they w i l l be low. This neuron model can also be called a Determinist ic N e u r o n model, to distinguish it from the stochastic neuron (or the A R . P neuron) model to be introduced later. 3 Chapter 1: Introduction y i Sigmoid i Sum N V / \ v ' X / / • • * \ \ Xl I \ 1 Figure 1.1.1: The basic neuron (or deterministic neuron) model. The previous paragraph does not describe how a neuron changes its behavior during learning. There are many different theories regarding this. A prominent one is the Hebbian learning theory [2] which states that a synaptic connection should be made more excitatory if the presynaptic signal is positively correlated with the postsynaptic signal. Another is Klopf's drive reinforcement model [3] which states that a synaptic connection should be made more excitatory if the change in the presynaptic signal is positively correlated with the change in the postsynaptic signal. A large number of learning responses of biological neurons has been predicted by this theory. Another theory by Alkon et al. [4] says that a conditioned input (or an input whose effect changes with learning) should be made to evoke similar responses from the cell body if it is active near the time of an unconditioned input. Very fast pattern recognition learning has been demonstrated with this neuronal model but it appears that the networks used in these demonstrations are easily overwhelmed by the curse of dimensionality. Almost all present neural networks are explicidy or implicitly based on Hebb's theory. 4 Chapter 1: Introduction 1.1.3 Network models There are many different types of network models. Some of them are inseparable from their learning algorithms; examples are the Adaptive Resonance Theory networks [5] and the Neocognitron [6]. Some of them are generic enough that many different learning algorithms have been devised for them. Hopfield's associative memory and feedforward networks are two examples. Feedforward networks have received greatest amount of attention and are the focus of this thesis. Fig. 1.1.2 shows the structure of a stricdy layered feedforward network. It has connections from one layer to the next immediate layer but none in the same layer or between neurons two or more layers away. The feedforward networks in the literature are usually of this type. The calculation of the network outputs is done as follows. Inputs come in from the left and feed into all the neurons in the first hidden layer. A layer is called hidden if its outputs are not the outputs of the network. Each neuron computes its output using the calculations shown in the neuron model in Fig. 1.1.1. The outputs from the first hidden layer are then the inputs of the second hidden layer. This repeats until the output layer. In applications, connections are often allowed to skip forward to higher layers; it is found that learning is faster this way. t 2 2 I Hidden Layer 1 Hidden Layer 2 Output Layer Figure 1.1.2: Typical multi-layer feedforward network. Chapter 1: Introduction 1.1.4 Capability of feedforward networks The primary use of feedforward networks is for pattern classification. Therefore it is important to know what such a network is capable of classifying. Each layer of a multilayer feedforward network divides its own input space into two regions. By stacking the layers, we obtain more complex partitioning of the input space of the network. Lippmann [7] gave an intuitive discussion of what kind of decision regions can be formed by feedforward networks of different number of layers. Each output of a single-layer network can divide the input space into two linearly separable decision regions; a two layer network, convex open or closed decisions regions; and a three layer network, arbitrary decision regions, the complexity of them limited by the number of nodes in each layer. A network is capable of forming a mapping correctly if a set of weights exists for that network to implement that mapping. There have been many attempts to put the above intuitive conclusions into theorems. There is some confusion in the literature about the arbitrary classification of an input space (the mapping of a continuous input space to a discrete output space) and function approximation (the mapping of a continuous input space to a continuous output space). The latter is a superset of the former. A theorem has recently been proven [8] proving that 2-layer standard feedforward networks with sufficiendy many hidden units are able to approximate to any desired degree of accuracy any Borel measurable function from one finite dimensional space to another. This theorem explains the successes that have been reported in literature in the use of feedforward network in many applications. One example of a difficult mapping is the classification of two intertwined spirals [9]. 1.1.5 Two classes of learning algorithms for feedforward networks For the purpose of this thesis, we need to distinguish between supervised and reinforcement learning. In Supervised Learning, the input/output mapping to be learned by the network is known a priori. Usually, the network outputs are deterministic functions of the inputs, calculated in the manner described in Subsection 1.1.3. The network outputs are compared with desired output values known 6 Chapter 1: Introduction to the learning algorithm. This information is usually used by a teacher which coordinates all the weight changes in the network. The weights are changed so that the errors between the desired output values and the actual output values of the network are gradually reduced. There are many algorithms for supervised learning in a feedforward network. Backpropagation [10] was one of the first and most successful algorithms at the time of its publication. Its main drawback is the large number of iterations of input/output presentations required for the network to converge to the correct weights (If it is possible to converge to the correct weights.). Several schemes to improve backpropagation directly were later devised [11][12][13][14][15]. Most of these schemes are variations of backpropagation and they usually differ in how the parameters are adjusted. Later, very different supervised learning algorithms became available [16][17][18][19][20]. What supervised learning tries to accomplish can be easily done using a look-up table and interpolating between the points, providing the number of entries to the table is not large [21]. Of course, there is a number of advantages to storing mappings in the form of weights in a network, such as reduction in memory requirements and similarity to biological neurons. But finally, supervised learning methods are just complex storage methods. In Reinforcement Learning, the input/output mapping to be learned is not known a priori. The only feedback available to the learning algorithm is a scalar evaluative signal called the reinforcement The reinforcement is generated by the environment which is external to the network and the learning algorithm. The reinforcement indicates to the learning algorithm how good the output vector is for the given input vector. The network must try different output combinations for the same input vector to find the output vector which will illicit the best reinforcement. For this reason, the outputs are stochastic functions of the inputs. A variety of reinforcement learning algorithms will be described in Chapter 2. Traditionally, the learning algorithm is local to each neuron in the network. Thus it is synonymous to say that the learning algothm is training a neuron or to say that a neuron is changing itself. The leaning algorithm does not coordinate the weight changes of the entire network as a whole. The learning algorithm changes every neuron as if the other neurons do not exist. Each neuron changes 7 Chapter 1: Introduction itself to illicit the best reinforcement for itself. It is hoped that the network will exhibit useful learning properties from the statistical interaction of these neurons. Supervised learning is a subset of reinforcement learning. The learning system, which may simply be the network itself, must search for the best mappings and store them in the network. The searching is done by the stochastic nature of the outputs and the storing is done by the update rule for the weights. The two components are tied to each other in a complex way. Fig. 1.1.3 shows the basic learning situation of reinforcement learning. Each neuron outputs a binary value generated stochastically. The probability of producing a particular output value (conventionally the high value) is determined by putting the weighted sum of the neuron's inputs through a squashing function which becomes the output of the neuron. When the outputs have propagated to the top, a single reinforcement value is returned to the network to indicate how good the network outputs are for the given network inputs. That single reinforcement value is used by every neuron in the network to change itself. Environment Network biputi Figure 1.1.3: The basic reinforcement learning situation. The reinforcement is also stochastic. If the network outputs are good for the present network inputs, the reinforcement will more likely be positive; if the outputs are poor, the reinforcement will more likely be negative. Fig. 2.3.7 shows why the reinforcement should be stochastic: it allows for a 8 Chapter 1: Introduction more general environment. Given this type of feedback from the environment, reinforcement learning algorithms try to change the output probability of each neuron in certain ways. If the reinforcement is positive, each neuron is changed to increase the probability of producing the same output for the same input pattern; otherwise, changed to decrease. Reinforcement learning is designed to be able to learn in a much less informative environment than the one in supervised learning. In reinforcement learning, a network has access to only two sets of information from the environment: the input values and the reinforcement value. By being so designed, it is a more appropriate learning method in some circumstances. An example would be learning to control an inverted pendulum when the model of the system is not known and only the fact that the pendulum is up or down is known. 1.1.6 The formulation of the reinforcement learning problem The following formulation is one version of pure reinforcement learning [22] for single neurons; there is a number of newer and different versions. The algorithm is similar for each neuron in the network except for possibly different learning parameters. A learning cycle starts with presentation of the input vectors) xk to the input neurons of the network at time k and the activation of the neurons flows forward to the output neurons. Each neuron, also called a stochastic semi-linear unit [23], or an AR.P Neuron (so named because of the learning algorithm, to be introduced later, used to modify the neuron) computes an output ak stochastically in response to its input vector according to where 0k is its weight vector and r)k is a random variable having a distribution function $ which does not vary with k. $ is usually a sigmoid shaped functioa The structure of the AR.P neuron is shown in Fig. 1.1.4. (1) — 1, otherwise 9 Chapter 1: Introduction y Figure 1.1.4: The structure of the AR.P neuron. Why is stochastic output necessary? The main reason is that the reinforcement is stochastic. This is most easily understood if we consider a two-action (or two-output) Stochastic L e a r n i n g A u t o m a t a (or S L A ) ; the basics of S L A will be reviewed in Chapter 2. With a deterministic output, if both actions result in high success probability, then the automaton will tend to be locked to the action it is closest to initially, even when the success probability of the other action is higher. The stochastic output lets the automaton sample both actions and move to the better action asymptotically. The environment evaluates the output vector from the network and returns a reinforcement value bk 6 {-1,1} where -1 (failure or penalty) indicates a poor output vector and 1 (success or reward) indicates a good one. In Barto's A R . P (for Associative Reward-Penalty) algorithm [24][25], the weights in each neuron are updated according to where pk > 0 is a learning factor and 0 < A < 1 is a penalty factor. The input vector is in the set of R" where n is the dimension of the vector. If A = 0, the algorithm is called A R . T (for Associative Reward-Inaction). Pk[h<ik ~ E{ak\9k,x-k})x-k if h = 1 - E{ak\9k,xk}]xk if bk = -1 (2) 10 Chapter I: Introduction If 0 < A < 1, the algorithm is called A R . C P (for Associative Reward-Epsilon Penalty). Some of the update rules by other researchers are closely related to Barto's [23][26], There is an intuitive interpretation of the A R . P algorithm. When bk = 1 which indicates a positive reinforcement, bkak = Ofc- This means the weights are changed to bring the average output of the neuron closer to the current output. When bk = -1 which indicates a negative reinforcement, bkak = -ak. This means the weights are changed to bring the average output of the neuron closer to the opposite of the current output. 1.2 The Significance of Reinforcement Learning From a biological modeling perspective, reinforcement learning represents one of the dominant ideas about how parts of the brain leam. Many simulation experiments have demonstrated success in producing useful behavior from small networks. Lending support for the biological plausibility of reinforcement learning, Mazzoni et al. [27] described simulations which show that a network model of area 7a of primate parietal cortex when trained by a variant of the A R . P algorithm developed hidden neuron responses which are similar to area 7a neurons. From a technological modeling perspective, more complex versions of reinforcement learning have been recognized as some of the principal ways to apply neural networks to learning control [28] and also one which will exhibit much more interesting and useful behavior than supervised learning or associative memory. When the capability of reinforcement learning is better developed, applications will be able make use of neural networks in greater capacity than pattern classification and function approximatioa The sophisticated level of motor control exhibited by animals, even by lower animals such as insects, is the basis for the hope that neural networks would eventually achieve similar levels of performance. Much study remains to be done to see which of the current neural network paradigms best fit into the area of motor control. In the mean time, a promising lead like reinforcement learning should be explored. 13 Scope of the Thesis This thesis is concerned with the core component, to be introduced in the next chapter, 11 Chapter 1: Introduction of the various kinds of reinforcement learning. This core component will be called Primary Reinforcement Learning (or PRL). PRL is purely a learning pattern classification method. This thesis is not concerned with how PRL is supplemented with other mechanisms to producing a form of reinforcement learning claimed to be suitable for learning control and the theory of that form of reinforcement learning. This form will be called Adaptive Critic Reinforcement Learning (or ACRL). Learning control is discussed only to illustrate how PRL fits into the whole picture of reinforcement learning. The dominant version of ACRL is built on the viability of PRL. There are a few theoretical results showing that PRL may work. The theory of PRL will be examined further in this thesis. It will be shown that under certain conditions, a network of neurons will converge to the correct weights as a result of training. Some small scale simulations have been done successfully in the literature. Somewhat larger and different kinds of simulations will be carried out to test PRL more completely. As a result of the theoretical considerations and the simulations, it will be shown that PRL in its present formulation needs to be modified for certain type of problems. This modification results in a new learning algorithm which will be tested by simulation. 1.4 Thesis Overview Chapter 2 presents the development of reinforcement learning, makes a distinction between the two major classes of algorithms, presents the specifics of a number of algorithms, and shows the important connection between stochastic learning automata and reinforcement learning. At the end of the chapter, the reader should have a general overview of the background and have the specifics of the algorithm to be analyzed in Chapter 3. Chapter 3 focuses on a specific algorithm, the A R . P algorithm. The main doubt about the viability of PRL algorithms is how the reinforcement probabilities are related to each neuron in a network. A discussion of this leads up to the proof which shows that under some conditions, a group of neurons will converge under reinforcement learning. The proof then shows how PRL should be modified and a new algorithm is presented. 12 Chapter 1: Introduction Chapter 4 shows the simulation of A R . P algorithm for one- and two-layer networks. Large single-layer networks will also be tested with the same algorithm under the ideal conditions required in the theorem proved in Chapter 3. Finally, the new PRL algorithm presented in Chapter 3 will be tested by simulation. Chapter 5 summarizes the theoretical and simulation results in Chapter 3 and 4. The contributions of this thesis will be listed. Further research following this thesis and in the field of reinforcement learning in general will be discussed. 13 Chapter 2: Review of Reinforcement Learning Literature Chapter 2 Review of Reinforcement Learning Literature In this chapter, the development of reiriforcement learning is summarized. A distinction is made between the form of reinforcement learning called PRL designed to find and learn an optimal input-output mapping and the form called ACRL [26][29][30][31] designed to find and learn an input-output mapping which will produce an optimal output sequence for a given external environment under conditions of noise or uncertainty. ACRL has greater engineering relevance than PRL, but it will be shown that it is necessary to analyze PRL at the same time as developing algorithms for ACRL. The relevance of SLA to the analysis of PRL will be shown. The results of SLA hint at what can be expected from a network with multiple outputs trained by PRL. 2.1 The Development of Reinforcement Learning Reinforcement learning has its beginning in psychological studies. Mathematical models of instrumental conditioning behaviors was studied under the name mathematical psychology (refer to [32][33][34]). Together with sequential estimation theory and studies on abstract models of learning systems such as 2-armed bandit problem and SLA (refer to [35][36][37][38][39][40]), it contributed to the development of reiriforcement learning. Other early ideas include Minsky's "Stochastic Neural-Analog Reinforcement Calculator" (refer to [41]) and the Selective Bootstrap Adaptation Algorithm by Widrow et al. [42]. Later, Barto performed a series of RL experiments which led up the current theory. For a bibliography of these experiments, see [43]. From the neurobiological community, KlopFs concept of networks of neurons working to optimize their own inputs [44] has become a central idea of RL. Neural networks experienced a revival in the 1980's. The impetus for it is the appearance of the backpropagation algorithm for multilayer networks, the Hopfield associative memory, the Boltzmann machine, and some others. A problem that many researchers had wanted to solve was how to train a multilayer network. The primary algorithms developed by Barto and Williams [23] [24] were at one time some of the few algorithms available to train a multilayer network. Now there are a multitude 14 Chapter 2: Review of Reinforcement Learning Literature of algorithms for training these networks. As will be explained later, primary reinforcement learning algorithms' role should now be changed to solving multiple output problems instead of solving multiple layer problems. The development of RL also proceeded along a different direction. Besides primary algorithms, algorithms of the adaptive critic family [26] [29] were designed to optimize time sequences of outputs from a network. These are meant to be applicable to sequential decision tasks such as games and control. Since the field of neural networks needs applications to maintain its present level of activity, these algorithms are now some of the most immediately important ones in neural networks. While these optimizing algorithms can make use of general mapping storage mechanisms and algorithms, how they work with neural networks is still important because they are cast in the form of neural network algorithms and the capability of neural networks is still needed. 2.2 Primary and Adaptive Critic Reinforcement Learning Algorithms There are two distinct types of reinforcement learning algorithms: P R L and A C R L algorithms. Fig. 1.1.3 is the simplest model of the P R L situation. It is the core component of the A C R L situation illustrated by Fig. 2.2.5. From a neuron's perspective, it is performing the same learning in both cases. In fact, the A R . P algorithm and other reinforcement learning algorithms can be used in both P R L and A C R L situations. The only procedural difference between the two cases is how the reinforcement is computed. 15 Chapter 2: Review of Reinforcement Learning Literature Environment Actions k Secondary Primary Reinforcement Adaptive Acdon Network Reinforce-ment Adaptive Critic Network J States Figure 2.2.5: A simplified representation of an adaptive critic reinforcement learning system. The goal of PRL is to leam input-output mappings where the reinforcement is only a function of the present input and the present output. The reinforcement coming directiy from the environment does not necessarily have to be modified before it is used by the network to learn. This reiriforcement is called the primary reinforcement, thus the name Primary Reinforcement Learning. In contrast, the goal of ACRL is to leam a mappings that maximize a cumulative measure of primary reinforcement over time. The outputs of the action network (see Fig. 2.2.6) at any particular time can produce immediate payoff but can also influence payoff available later. The name ACRL comes from a learning function called the Adaptive Critic which calculates a secondary reinforcement based on the sequence of primary reinforcements, inputs and outputs of the network. The secondary reinforcement is then used by the network to learn. There is a long line of research in reinforcement learning algorithms. The early works were of the PRL variety and they were applied to various games and control situations. Later, it was realized that the reinforcement must be processed to give a more ideal reinforcement so that the optimization task is possible. Then the separation between PRL and ACRL algorithms appeared. 2.2.1 Primary reinforcement learning The evolution of PRL resulted in several well known algorithms. Note the terminology in this subsection is the same as in the literature. Since this section is quite isolated from the rest of the 16 Chapter 2: Review of Reinforcement Learning Literature thesis, the terminology is not modified to fit in with the rest of the thesis. Such a modification would be too awkward to be worthwhile. The most prominent PRL algorithm is the AR.P algorithm as shown in Chapter 1. The reason for the prominence of this algorithm is that a convergence proof exists for the case of a single neuron. A form which is more easily compared with other algorithms to be presented in this section is the following (a(yi-Pi)xj , r = l Awy = { (3) [ aA(l - yi - pi)xj ,r = 0 where a > 0 is the primary learning rate, 0 < A < 1 is the penalty rate. «/,-_,- is the weight from input Xj 6 R to ith neuron which produces the output yi according to the action probability Pi = Pr {yi = 11 Wij, XJ}. Here the output of the neuron is in the set of {0,1} and the reinforcement r is in the set of {0,1} where 0 is a penalty and 1 is a reward. If the reinforcement is continuous and is in the range [0,1], the S-model version of the AR.p algorithm [45] can be used to take advantage of the more informative reinforcement: Awy = o[r(jft - pi) + A(l - r)(l - & - Pi)]xj (4) A continuous reinforcement may 1) be unbalanced or 2) change meaning as the system learns. The former happens when a neutral reinforcement is not at 0.5. The neuron would be incorrectly penalized or rewarded if the reinforcement is not interpreted with respect to a good estimate of a neutral value. The latter occurs when a relatively good reinforcement at an early stage of learning becomes relatively poor at a later stage. The Reinforcement Comparison Algorithm [26] was devised to deal with this problem: Awn =a(r- rfred)(y< - pt)XJ (5) where r ? r e d = prediction of E{r | W, X} where W is the vector of all the weights in the network and X is the network input vector. An extra network is required to learn and store rPre<* and its training is done by supervised learning. 17 Chapter 2: Review of Reinforcement Learning Literature Williams [23] [46] [47] identified a general class of reinforcement learning algorithm called REINFORCE algorithms. This class of algorithms is applicable to a general network architecture. The weight updates are done as follows Awn = a,j(r - bij)eij (6) where the learning rates a;j > 0 are independent of the inputs x' to the ith neuron, the reinforcement baseline bij is conditionally independent of the output yi given the total network weights W and x\ and the eligibility e,j is equal to gjrln<7» where </,(£, w'jX*) = Pr {y, = f | w*,x'}. A theorem concerning the average behavior of the network was proven. It will be presented in the next chapter. An interesting feature of this class of algorithm is that it includes the A R J algorithm and the Reinforcement Comparison algorithm. Therefore the theorem applies to both of them. An interesting algorithm was proposed by Munro [48]; it is a dual backpropagation algorithm where a predictive network learns a prediction of reinforcement as a function of the input vector and the action vector. The predictive network learns by flailing, prior to the learning of the action network. The action network is modified by gradient information from the predictive network At»y = where f is the output of the predictive network and Wij are the weights in the action network. His simulations were for a small number of binary inputs and no special attention was paid to the possibility of the reinforcement being probabilistic. For reasons unknown, the simulations converged quite slowly. To leam continuous-to-continuous mappings, Gullapalli [49] [50] devised an algorithm which uses two networks. These networks should be capable of accurate representation of such mappings. The output is generated according to a Gaussian distribution which uses the mean fi stored by one network and the standard deviation a stored by the other network. At the conclusion of learning, the standard deviation network should have small outputs indicating confidence in the means found and the mean network would have the final mapping desired. Out of the above algorithms, this thesis will focus on Barto's A R . P algorithm. This is because there are two theorems proven for it So it is a more promising algorithm for further results. 18 Chapter 2: Review of Reinforcement Learning Literature If the reinforcement is immediate and only relevant to the last input-output vector pair, the credit assignment problem is just structural. A structural credit assignment problem is the problem of assigning responsibility to each of the neurons in the network. If the reinforcement is delayed by an unknown number of time steps, then we have a temporal as well as a structural credit assignment problem. This means when a reinforcement is received, we also do not know which vector in the sequence of output vectors has contributed to the reinforcement. The extended REINFORCE algorithm [51] is one possible solution to this. Sutton [26] and Watkins [52] talked extensively about this problem. Delayed reinforcement sits at the boundary of PRL and ACRL. Where ACRL comes into play is when the reinforcement and the states are generated by a state-transitional environment. It is difficult to leam in such an environment, particularly if the reinforcement is also delayed. 2.2.2 Adaptive critic reinforcement learning Fig. 2.2.6 shows the structure of adaptive critic systems. It shows the flow of the signals in the system and the relationship between the various blocks. Environment Adaptive Action Network Adaptive Critic Network Figure 2.2.6: An adaptive critic reinforcement learning system. At each time step k, both the Adaptive Action Network (or AAN) and the Adaptive Critic Network (or ACN) receive the input vector supplied by the environment. The ACN also receives 19 Chapter 2: Review of Reinforcement Learning Literature the primary reinforcement rk which is used to update the weights in the ACN, ^ ( w ^ j j X j t , ^ ) . The updating of the ACN will be described later but not in great detail because it is beyond the scope of this thesis to do so. After the ACN is updated, it produces a secondary reinforcement bk(xk, w£) where x* is the input vector and wk is the vector of the critic weights in the ACN. The secondary reinforcement bk is then used to modify the AAN. The modification is done using some PRL algorithm. This last statement is important; it shows how PRL fits into ACRL. Then the AAN generates an output vector a^ (xjt, w£) where is the set of weights in the AAN. w£ is changed by bk prior to the calculation of a* because bk is in response to ak-i and x^-i and the actions produced by the AAN should reflect the learning done as a result of the activities in the last time step. The idea behind the ACRL system is for the AAN to find an input/output mapping which maximizes the secondary reinforcement bk. bk is designed such that if the AAN maximize it, the optimization goal of the whole system will be achieved. The optimization goal is to maximize a discounted sum b*k which is calculated by bk = rk + *frk+i + -r2rk+2 + • • • (7) where 0 < 7 < 1 is the discount factor. The update rule for the ACN trains the ACN to learn to approximate b*k, i.e., bk —* b*k as k —• oo. So when bk is maximized, the future primary reinforcements are maximized, with the near term reinforcements having the highest priority. To compute b*k directly requires an infinite number of time steps before all the data are available. To get around this, an approximation to bk is computed by % = rk+jbk(xk,wck). (8) This value, called the corrected 1-step truncated return [53], is then learned by the ACN via supervised learning. The desired behavior of the systems is that with time, the ACN learns to predict the discounted sum of future primary reinforcements from the present inputs (or states) and the AAN learns actions that would illicit a long string of positive reinforcements. It is clear that the environment is supposed to have a state transition mechanism as in dynamic systems. An example of this is x*+i = xk+i(xk,&k,noise). In a number of papers, the adaptive 20 Chapter 2: Review of Reinforcement Learning Literature critic design has been used for learning control of an inverted pendulum and for solving sequential decision problems such as the Tower of Hanoi problem [26] [29] [30] [54]. In [26] and [29], the action network is effectively a look up table. Each entry of the table stores a probability of a push on the cart toward the right side. Therefore, that system is really a group of SLA's where only one automaton is activated at one time. By using a group of automata instead of the more difficult to train neural network, the problem is simplified. In [30], a two layer network is used in place of a look up table. The network is taught the probability of generating an action by supervised learning. It should be noted that the action network has only 1 output. This is a much simpler learning situation than having multiple outputs which is the main thrust of this thesis. The action network does not need to be a feedforward network. There have been a number of recent proposals to use adaptive critics for adjusting different kinds of adaptive action networks [22] [55]. Adaptive critic reinforcement learning has been compared to dynamic programming. In fact, it has been described as approximate dynamic programing [56]. Both methods are used for solving sequential decision problems. The problem of finding an optimum sequence of outputs is far from trivial. If there are N possible alternatives at each time step and if there are K time steps, to do an exhaustive search of all possible sequences would take NK tries. Take the example of trying to take the environment from initial state to a final state and keep it there, as in control problems, we encounter a number of serious problems: 1) the problem is an open-ended sequence optimization task, i.e., K is unknown; 2) if a good action trajectory is found, it has to be remembered the first time, for efficiency sake; and 3) if the action vector has M elements, the total search problem is (MN)K sized. ACRL with an adaptive action network that produces binary outputs performs an exhaustive search of the entire sequence space. The curse of dimensionality problem can be reduced significandy if the system model and the payoff (reinforcement) function are known. One way to make use of such knowledge is to initialize the action network with promising trajectories and the critic network with the payoff function. It is beyond the scope of this thesis to cover adaptive critics methods in any greater detail. This section is included to give the reader some idea of where PRL fits into a larger form of reinforcement 21 Chapter 2: Review of Reinforcement Learning Literature learning which appears to have more application potential. 2.3 Relationship between reinforcement learning and Stochastic Learning Automata Of the different origins of RL, SLA deserves special mention. The concept of SLA is tied strongly to the concept of RL. The analyses done for SLA can be used for RL and the results of SLA are relevant to RL. RL is also known as associative reinforcement learning and SLA has been called "nonassociative reinforcement learning" [57]. Fig. 2.3.7 shows the similarity between them. As shown in Fig. 2.3.7, the unknown process and the evaluator comprise the environment. The Disturbances and other input Environment Unknown Process Description of process state Evaluator Action Learning System Reinforcement (a) Disturbances and other input Environment Unknown Process Action Description of process state Evaluator 3 Z Learning System Reinforcement (b) Figure 2.3.7: Comparison between S L A (a) and R L (b). only difference between the two learning situations is that in RL, the learning system also receives a description of the process state or context. The learning system in SLA only has to choose an action from a set of possible actions whereas the learning system (for example, a single neuron) in RL has 22 Chapter 2: Review of Reinforcement Learning Literature to choose an action from a set specified by the context. In the next section, a review of the basics of SLA is presented. It should help to clarify the idea of RL. 2.3.1 The formulation of stochastic learning automata The following is a brief overview of stochastic learning automata. The interested reader should consult Narendra's book on the topic [40]. A stochastic learning automata (also known as a Variable Structure Stochastic Automata, not to be confused with other types of automata) is miriimally represented by a quadruple {<x,fi,p,A} where a = {01,02,• •-,ar} with r < oo is the action set, /? is the reinforcement coming from the environment, p = {pi,p2, • • *»Pr} is the action probability vector governing the choice of the action at each stage (i.e., at each stage n, p(n) = (pi(n),p2(n)t • • • ,ps(n)) ), and A is an algorithm (also known as the updating scheme or reinforcement scheme) which generates p(n + 1) from p(n). Only an environment with random response characteristics is of interest. At each stage n, the environment has inputs a(n) 6 a and output (reinforcement) (3(n) belonging to a set z. Frequently x = {0,1} with 0 indicating reward and 1 indicating penalty. The probability of the environment emitting a particular reinforcement symbol (say, 1) depends on the input and is denoted Ci, (i = 1, • • •, r). The c,-'s are called the failure probabilities. Its opposites, d,; = 1 — c,-, are called success probabilities. Generally, j3(n) can be in the range [0,1], where 0 and 1 corresponds to the maximum degree of success and failure, respectively. In this case, we need a more general expression for c,-: d = E[(3(n) | o(n) = a,]. (9) If the Ci do not depend on n, the environment is said to be stationary. Otherwise it is nonstationary and is a general case of stationary environments. It is assumed that the c,'s are unknown initially; the problem would be trivial if they are known a priori. It is also assumed that one of the q's is smaller than all other c,'s. The reinforcement scheme A can be represented by p(n+l) = T[p(n),a(n),/?(n)] (10) 23 Chapter 2: Review of Reinforcement Learning Literature where T is a mapping. One can classify the reinforcement schemes on the basis of the nature of the functions appearing in the scheme. If p(n + 1) is a linear function of the components of p(n), the reinforcement scheme is said to be linear, otherwise it is nonlinear. If p(n) is updated according to different schemes depending on the intervals in which the value of p(n) lies, the combined reinforcement scheme is called a hybrid scheme. A general learning scheme is the following: Pi(n + 1) = Pi(n) - [1 - /?(n)]5,(p(n)) + /?(n)/i<(p(n)), if a(n) ? a, (11) Pi(n + 1) = Pi(n) + [1 - /?(n)] £ gj(p(n)) - P(n) £ hj(p(n)), if a(n) = a,-where gj and hj are non-negative continuous functions satisfying the conditions 0 < gj{p(n)) < pj and 0 < E [p,- + hi(p(n))] < 1 for all j = 1,2,.. . ,r. The above scheme becomes linear if gj and hj are chosen to be gj(p(n)) = apj(n) and /jj(p(n)) = b -Pj(n)] with 0 < o < l a n d 0 < 6 < l . If b ^ 0 and 6 $ 0, the scheme is called a L R . P (for Linear Reward-Penalty) scheme. If b = 0, it is called a L R . I (for Linear Reward-Inaction) scheme. If b ^ 0 and b w 0, it is called a L R . C P (for Linear Reward-Epsilon Penalty) scheme. There are a number of theoretical results for different SLA algorithms. L R _ P are "expedient" algorithms meaning that the action probabilities are ergodic and the accuracy is quite sub-optimaL L R . I are "absolutely expedient" or "e-optimal" algorithms which means the final action probabilities will converge to either O's or l's but there is a small chance that they will converge to the wrong values. The advantage of L R . P algorithms is for tracking a changing environment. L R . C P algorithms are compromises between L R . P and L R . I with the action probabilities being able to converge close to the extremes but retaining an ability to move away from a false value. These behaviors are reflected in their associative counterparts. The A R . P algorithm presented in Chapter 1 and its variants, the A R J and the A R . C P algorithms, have similar convergence characteristics. 2.3.2 Interacting stochastic learning automata The A R . P convergence theorem by Barto for a single neuron was inspired by the proofs of SLA convergence. Beyond the relevance of single SLA convergence results to single neurons learning 24 Chapter 2: Review of Reinforcement Learning Literature from reinforcement, we are more interested in the interaction of a group of SLA which is relevant to a group of neurons. A substantial amount of work has been done in analysis and simulation of games played by SLA [40]. The results show that a N-player game with identical reinforcement would be assured to converge to best strategy only if there is a unique maximum in the game matrix. The term strategy means a combination of outputs from the players in the game. The following figure illustrates a (2 x 2) game matrix with (a) one maximum and (b) two maxima. The numbers in the squares are success probabilities. The game in Fig. 2.3.8a would always converge to the strategy $ rt S 2 S 0.9 0.5 0.3 0.1 cd B 2 3 -0.9 0.1 0.3 0.5 action 1 action 2 action 1 action 2 automaton 2 automaton 2 (a) (b) Figure 2.3.8: Game matrices for a (2 x 2) game. {action 2, action 1} for automaton 1 and 2 respectively. On the other hand, the game in Fig. 2.3.8b would converge to {action 2, action 1) in majority of the time but would occasionally converge to {action 1, action 2}. Primary reinforcement learning is analogous to a N-player game of SLA with identical payoff. A network of neurons with binary outputs is like a (2 x 2 x 2 . . . ) game. From the results found for the SLA games, we can expect that the environment must satisfy certain conditions for a group of neurons to converge to the desired mappings. 2.4 Summary In this chapter, a brief review of the historical roots of reinforcement learning is presented. Reinforcement learning can be divided into two major types: PRL and ACRL. The differences and 25 Chapter 2: Review of Reinforcement Learning Literature relationship between the two are explained. A number of PRL algorithms are mentioned. SLA, the precursor of primary reinforcement learning, is reviewed. Fig. 2.4.9 shows the relationship between the various algorithms and classes of algorithms. SLA PRL ACRL A R-P A R - I ^R-eP Munro's Gullapalli's Algorithm Algorithm ^R-p S-model Reinforcement Comparison Algorithms REINFORCE Algorithms Figure 2.4.9: The relationship between the different algorithms or classes of algorithms. The results obtained in S L A in a game situation show what could be expected in a network of neurons interacting in much the same way as a game with identical reinforcement to all the neurons. In the next chapter, a convergence theorem for single-layer networks trained by the AR.P algorithm is derived. 26 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm Chapter 3 Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm In this chapter, we look at the theoretical and experimental issues surrounding PRL. A review of what we know about those issues is presented. The idea of a scalar reinforcement being able to direct the change of the weights of an entire network is examined. The insights thus gained led to the discovery of conditions when a group of neurons (cast in the form of a single-layer network) will converge to the desired mappings. The implications of that discovery is discussed. The chapter is then concluded with the introduction of a new, more effective algorithm for PRL. 3.1 Existing Theoretical Results The AR.P algorithm is an associative version of LR.P SLA algorithms. In Barto's 1985 paper [24], a strong convergence theorem for single neurons trained by the AR.P algorithm (or the AR.P convergence theorem) was presented which states that under certain conditions, the weight vector will converge toward the values that will produce a correct output for each of the input vectors in the training input set The input vectors, the output, and the reinforcement are all bipolar which is a much simpler case than if any of these elements is continuous. Furthermore, one of the conditions in the theorem requires that the input vectors be linearly independent This means the theorem does not cover the learning of the classification of all vectors from an input space. Nevertheless, this theorem is important because it connects reinforcement learning with the well developed field of SLA and because the theoretical convergence of a single neuron is required before a network of neurons can be shown to converge under this kind of learning. Williams [46] later proved a theorem for a general network of a variety of stochastic unit types which states E{6®k | ©*} • VekE{h | ©*} > 0 (12) where 0 j t is the combined vector of all the weights in the network at time k and bk is the reinforcement. This theorem applies to REINFORCE algorithms [23][46][47] which include the 27 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm A R J and Reinforcement Comparison [26] algorithms. The theorem says that a network of neurons will on average be ascending along the gradient of the expected value of the reinforcement function. The theorem lends some hope about the learning ability of a network trained by the A R ^ P algorithm. No strong convergence theorem had been proven for a network [58]. The ultimate goal of the development of primary reinforcement learning algorithms is to reach the stage where a function approximator can learn optimal continuous-input to continuous-output mappings as quickly as possible through the feedback of just a scalar reinforcement value. Since a feedforward network is more suited to learning an continuous-input to binary-output mapping, this type of mapping should be the current goal of primary reinforcement learning using neural networks. Many simulations [25][30][45][59][60][61] have shown that various reinforcement learning algorithms do work for small networks with BINARY inputs. But these simulations did not address what we are most interested in at the moment: learning continuous-input to binary-output mappings. The lack of theory and simulation results have caused the discussion on PRL to be conducted at an intuitive level. The intuition often used is that if each neuron is changing to increase the expected reiriforcement for itself, then the whole network may have interesting group behavior. The first problem with this is of course that numerical methods are highly dependent on actual numbers, not just ideas. Furthermore, the intuition itself is also counter-intuitive in some ways. Optimization of individual interests only leads to optimization of a society of individuals under some special circumstances. A loose analogy is capitalism: it works fairly well but not optimally. The optimization of individuals' interests often leads to pathological competition instead of cooperation and cooperation as a side effect of competition. If a group of self-interested individuals is asked to achieve a specific performance goal, say an altuistic goal, it is highly unlikely that the unconstrained interaction of these agents would be able to achieve it In reinforcement learning, the performance measure is how close in some sense the learned mappings are to the optimal mappings. So as in unsupervised learning, the interesting group behavior may be interesting but may not be what we want Considering the learning of a network mechanistically, we can see that in order to understand how a network can converge correctly under reinforcement learning, we must have some idea what the reinforcement is telling each neuron. Since the reinforcement is the only guide that any neuron 28 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm in the network has for converging to the correct weights, its statistics cannot be arbitrary. Otherwise, the network would be taught nonsense. The statistics must satisfy certain properties. Intuitively, the weights can only be expected to converge correctly if, on average, they are pulled toward the correct values more often than they are pulled away. Furthermore, from the perspective of any one neuron in a network, the environment is nonsta-tionary because of the other neurons in the network. This is true even when the reinforcement is a deterministic function of the input and output vectors. An indication of the problem of nonstationarity can be seen in the convergence analysis performed for SLA. Convergence can be shown for them only if some restrictions are placed on the nonstationarity (refer to p.257 in [40]). With a multi-layer network, the reinforcement is related to a neuron in the interior of the network in a complex way. If the network is fully stochastic, meaning that all neurons have stochastic outputs, the action of an interior neuron is stochastically transformed by the neurons in higher layers. With more layers, this problem becomes more severe. So we can expect that the convergence of a network of neurons to be certain only with some restrictions on the architecture of the network and on the nature of the reinforcement. 3.2 Single-Layer Convergence Theorem It is found that if the environment satisfies one of two conditions given below, the AR.P convergence theorem can be extended from a single neuron to a single layer of neurons. The structure of a single-layer network is shown by Fig. 3.2.10. The extension shows that learning under the AR.P algorithm, a single-layer network will converge to the correct weights. The extension shows that if the conditions of the single neuron theorem are satisfied and if the environment satisfies one of two sufficient conditions, the weights in an arbitrarily large single-layer network will converge with probability 1 to values with which the network correctly classifies the training input set One condition (Condition A) requires that for all output vectors, the probability of reinforcement being 1 (success) has one of two values; the output vectors having at least lk correct elements have the higher probability whereas the output vectors having less than lk correct elements have the lower probability. 29 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm The other condition (Condition B) requires that the reinforcement has a higher probability of being 1 for output vectors having a higher number of correct elements. Fig. 3.2.10 shows the basic parts of the reinforcement learning set up for a single-layer network. There are L neurons in the network. The dimension of the input vector of each neuron can be different and is denoted rij for the jth neuroa Each neuron can be trained by input vectors from its own unique training input set; the input set for the jth neuron is denoted Xj. The size of the different training input sets can be different as well and is denoted TOj for the jth neuron. The ith vector from the Xj training input set is denoted xf?'\ The notations used in this and later sections are similar to those in [24], with some changes added to account for having more than one neuron and for allowing each neuron to have its own input set. The correspondence between the two set of notations is listed in the Appendix. Environment a a Figure 3.2.10. The reinforcement learning situation for a single-layer network with L neurons. The following is based on the generic form of the AR_P algorithm described by eq. (1) and (2). Denote the output vector at time k as a* = { o u t , • • • > « L f c } and the output vector having the highest Pr {bk = 1} at time k as a£ = {a*[k, • • •, a*Lk}. Define lc to be the number of non-zero elements in a* + a£. Consider the following conditions on the environment: 30 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm Condition A. At any k G {1, • • - ,oo} there exists a lk such that Pi{h = l\lc>lk}=Ph (13) ?T{bk = l \ l c < l k } = P l where 0 < pi < Ph < 1 and 1 < lk < L is an integer and all three values may vary with k. for all k G {1, • • •, oo}, where //, are integers and 0 <//<//,< L. Note that this does not require that Pr {bk = 1 | lc = h} = Pr {bk = 1 \ lc = /2} if h,h are integers and are the results of different output vectors and l\ = /2. The following four conditions (l)-(4) are modified versions of the Conditions (C1)-(C4) of Barto's proof of the AR.P convergence theorem as shown in the Appendix. Condition 1. Each input set X,- = {xj1}, • • •, x$mi)} is a linearly independent set, for all Condition 2. For each x ^ 6 Xj and k > 1, Pr lxjk = ^ )) = £}J > 0, for all j G {1, • • •, L} where Xjk is the input vector to the jtii neuron at time k. In other words, every neuron can have its own input set, as mentioned before, and each input vector from every input set has a non-zero probability of appearing at the input of its corresponding neuron. Condition 3. Each of the L independent identically distributed random variables, denoted rfjk for the jth neuron, has a continuous and strictly monotonic distribution function This is true for all j e {1 , • • • , £ } . Condition 4. Denoting the learning factor for the jth neuron as pjk, the sequence pjk has the following properties: Condition B. Pr{6* = l K = /A}>Pr{6fc = l | / c = /,} (14) Pjk > o, (15) k k 31 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm for all j <E {!,•••,£}. Theorem. If Conditions (l)-(4) are true and if the environment satisfies ONE of Conditions A and B, at the limit A —• 0 the AR.P convergence theorem can be applied to each neuron in a single-layer network. This means there exists a 6°jX € 7Zn' for the jth neuron such that the random process {^ fc}fc>1 generated by the AR.P algorithm in a reinforcement learning task converges to 0?A with probability 1 (that is, Pr j^ ni Ojk = ^JA j = *)• where for all Xj e Xj and for all j e {1, • • •, L} f 1, if d)k > djkl where d\k = Pr {&* = 11 Xj f c = x}'', ajk = l | ^ = Pr{6fc = l |xj f c = xV,aifc = -l} are called the success probabilities. (17) 3.3 Proof of the Single-Layer Convergence Theorem The following proof uses terminology similar to those in the proof of the AR.P convergence theorem in the Appendix. It is helpful to read the proof in the Appendix before reading this proof. Proof: Define two intermediate variables: V] = Pr{6* = 1 | aik= aiy • •, a(j_1)A. = ay_i, Ojjt = 1, oy+i)fc = aj+i,- • •, aLk = a^ , X t t = Xi.--- >xy-i)* = Xj-i,Xjjt = x5 , ,'),x(i+1)Jt = Xj+i,---,XLk = XL}, and VJ1 = Pr{6fc = 1 | aik= <x\,- • ;a(j_i)jk = <*j-i, ajk = -1, oy+i)* = aj+i,* • = o-u XlJfe = Xl, • • ' , X ( j _ 1 ) f c = Xj-l,Xjfc = x^0, x (j + 1 ) f c = Xj+i, • • •, xLfe = XL} (18) where a(.),X(.) a1® dummy variables. For any 2 < j < X - 1, the success probabilities of the jth neuron in a network with L > 5 neurons can be written as 32 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm dh= S E W II Prfa* = a*l*** = X*}Pr{*** = X*}] « i € { l , - l > XiGXi ^e{i,---,x> > w ' a^G-fl,-!} X(j-i) e X i- i a y f l ) € {i, _i} X ( j + 1 ) e x i +i Indexed action and input probabilities ar,€{l,-l} X L ^ X L Indices (19) and D7* = E E ^ II Pr = I *** = X*}Pr{a** = x*}]. ai€{l,-l} Xi€Xi «*e{l, •••,£} Qy_l)€{l,-1} X(j-l)€Xj_i X(i+i)eXi+l OL 6 {1>-1} X L ^ X L (20) The expressions are similar for j G {1,2, i — 1, L} and L < 4. Lemma 1. Comparing each term in the expression for djk against the corresponding term in the expression for djk using either Conditions A and B (example in the discussion) shows that d}k>djk\ Yxfew), Vj€{l,-",J}, V*e{l,-,oo}, (21) 33 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm and d)k<djk\ Vx^eu;- 1, Vje{l,-•-,!}, V*e{l,-",oo}. (22) These inequalities are the same as in the A R . P convergence theorem. Proof of Lemma 1: To compare the expressions for djk and djk in eq, (19) and (20), we only need to compare the indexed term Vj against the corresponding indexed term VJ1. This is because, for any index, the indexed action and input probabilities for both dl-k and djk are positive and identical to each other. Since for any particular instance of the indices, the input sets are identical for both Vj and VJ1 in eq. (18), we only need to consider how these terms vary as {ai, • • •, ar,} vary. Supposing the jth element of ak is 1 (the following analysis is similar for the case of the jth element of a*- being -1) , meaning that € u>|, then for the same {ai, • • •,aj-i, a^ +i, • • • , O I L } , the number of correct outputs for the term Vj is always 1 greater than that for the term VJ1. If Condition B is true, every instance of Vj is greater than the corresponding instance of Vj1. Therefore, d)k > djkl. For Condition A , define l\ as the number of correct elements in {ai, • • •, a3-_i, 1, a J +i , • • •, a*,} and I'1 as the number of correct elements in {ct\, • • •,ctj-i, —1,aj+u • • •,ai,}. We can see that 1 < ^ < £ and 0 < l~l < L - 1. For each instance of {ai, • • •,aj_i,aj+i, • • •,OLL}, l\ = l^1 + 1. If Condition A is true, VJ = VJ\ if ll < h and l~l < lk 3 3 (23) VJ = VJ\ iill >/fcand/c"1 >/fc. But when l\ = lk, we have l'1 + 1 = lk and so Z"1 < lk. That is when Vj > VJ1. Since 1 < lk < L, its range is the same as that of Regardless of the actual value of lk, there will be at least one instance of {ai, • • •, aj-i, 1, aj+i, • • •, ax,} where l] = lk. Therefore, djk > djk. Q.E.D. Note that since d]fc and djk are dependent on Pr {a^ = a^ | x^ - X^}. (4> i1 which are in turn dependent on the time-varying weight vectors fy*, (cf> ^ j), djk and djk are not stationary for this reason alone. Nonstationarity interferes with the proof of the A R . P convergence theorem in 34 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm the form of a value of pj(0j) = Pr jo,-* = 1 | 0j* = Oj,Xjk = x^'*} denoted as (3j\ which makes the expression u>V(9j, A) equal to zero. See Lemma A2 in the Appendix. The fact that rfj*,d~* are time-varying makes /3j\ time-varying (see eq. (49)). And this in turn makes 9jX time varying (see eq. (69)) and prevents the proof from reaching the conclusion in [39] or [62]. But acxx>nling to Lemma A2 in Barto's proof, f l , if d] >dj* Mjx=\ , (24) lo, if rf} <djl. So as A -»• 0, 0jX -*• constant for all k if one of Conditions A and B is satisfied. Then the rest of the proof of the AR.P convergence theorem can proceed to conclusion. Q.E.D. 3.4 Discussion of the Single-Layer Convergence Theorem To see how the expressions for djfc and dj*1 compare, look at a simple example where L = 3. Furthermore, all three neurons receive the same input vector (i.e., ii = i 2 = h and xi = X2 = x.i), and x^ € , x2^ 6 u>2, x^ 6 which is equivalent to saying that = 1, a2* = 1> a,u = 1 are the correct (best) actions for the 1st vector from the xi training input set For neuron 1, the success probabilities for the 1st input vector from the xi training input set can be written as: d\k = Pr{&* = 1 | cu = 1,02^ = - M . u = -l.xifc = x(11)}Pr{a2* = -l}Pr{a t U = -1}+ Pr{&* = 1 | aik = l,a 2* = - l , a 3 * = l,xi* = xf)}Pr{a2A: = -l}Pr{o.u = 1}+ Pr {&* = 1 | aik = M 2 j t = 1 ,03* = -l.xufc = x^ } Pr{a2* = l}Pr{a,u = -1}+ Pr jo* = 1 | oi* = l,a 2* = l,o S f c = l,xifc = x^ } Pr {o2k = 1} Pr { 0 3 * = 1} dik =Pr{i f c = 1 | oi* = -1,02* = -1,03* = -l,xu = x(11)}Pr{a2* = -l}Pr{o3* = -1}+ Pr {&* = 1 I a\k = -1,02* = -1,03* = l,xi* = xj1*} Pr{a2* = -l}Pr{o3Jt = 1}+ Pr {&* = 1 | oi* = - l ,o 2 f c = l,a 3 f c = -l,xlfe = x^j Pr{a2* = l}Pr{03* = -1}+ Pr{&* = 1 | alk = - l , a 2 * = 1,03* = 1,*!* = xS1)}Pr{a2* = l}Pr{o3* = 1}. Denoting the four terms in the expression for d\k as f (1,1), t (1,2), * (1,3), and $(1,4), in the same order as in the expressions. The corresponding notations for the four terms of djk are then* (-1,1), f (-1,2), r(-l,3), and i(-l,4). 35 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm With Condition A, t (1,1) = t (-1,1), t (1,2) = t (-1,2), t (1,3) = t (-1,3), t (1,4) > t (-1,4), if lk = 3 *(1,1) = f (-1,1), <(1.2) ><(-!» 2), <(1,3) > t(-l,3), t(l,4) = *(-l,4), if lk = 2 (33) j(l,l) > i ( - l , l ) , i(l,2) = *(-l,2), t(l,3) = t(-l,3), t(l,4) = t(-l,4), iflk = 1 showing that d\k > djkl. With Condition B, t(1,1) > t(-1,1), t(l,2) > *(-l,2), <(1,3) > t(-1,3), and <(1,4) > i (-1,4), again showing that d\k > d~[k. Condition A says that the probability of success is higher if the number of correct outputs is above a certain level rather than below. Pr{6* = l}isa step function of lc. Learning tasks satisfying this condition can be very difficult tasks. At the extreme, Condition A permits environments such as these, for example: P r l M l ) 1-Ol I f I f 1 j 1 1 > 0 lk L 0 L=lk lc, # of correct outputs lc ,# of correct outputs (a) (b) Figure 3.4.11: Examples of environments satisfying Condition A. In Fig. 3.4.11a, the success probabilities are uriiformly high. Here at each time step, the weights are almost always updated to increase the probability of generating the same output again. They are pulled toward the wrong weights almost as often as they are pull toward the correct weights. In Fig. 3.4.11b, the success probabilities are uniformly low. The weights are pulled toward one extreme and then the other in an oscillatory manner. Fortunately in this case, the theorem requires A —• 0; therefore, penalty exerts a very small influence in this case. But convergence would be slow because 1-36 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm the reinforcement would seldom be a reward. Pr{r*=l} • 1-0 L le, # of correct outputs Figure 3.4.12: Example of an environment satisfying Condition B. Condition B says that the higher the number of correct outputs, the higher is the probability of success. Here Pr{6* = 1} is a monotonically increasing function of lc. Fig. 3.4.12 shows an example of this condition. This condition is equivalent to saying that the success probability has a single maximum in the output space. There is an analogous condition in the interaction of a team of SLA in a game with identical reinforcement to each automaton: the game matrix must have a single equilibrium point to guarantee convergence to the optimal game strategy (p.327 in [40]). The two conditions clearly do not cover all possible environments. Even in the relatively simple case of SLA games, there are many game matrices (or environments) which will result in a game converging to a suboptimal strategy. There will probably be more environments in which a single-layer network would converge wrongly or fail to converge . Problematic environments would be the ones that have d}-k < d~k when X j * 6 and/or d^k > d~k when X j * € <*>~l for some subset of k 6 {1, • • •, oo} and/or for some subset of j G {1, • • •, L} and/or for some subset of the vectors in the Xj input sets. In my simulations of multi-layer networks where all the neurons are stochastic, this problem is present for the hidden neurons (data not shown). The above proof is significant because it is a step toward understanding the factors affecting the convergence of a multi-layer network. Furthermore, it is significant because a network with multiple 37 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm outputs is a case that justifies reinforcement learning. For a network with only one output and with a deterministic environment, when bk = 1, we know the desired output is the current ak; when bk = — 1, we know the desired output is the complement of the current ak. In effect, the reinforcement directly shows the network what the desired output should be, making the learning situation equivalent to supervised learning. If the environment is not deterministic, the situation is equivalent to supervised learning using a training set with some wrong mappings. But when a network has multiple outputs, the scalar reinforcement can no longer show the network what the desired outputs should be. This is when the reinforcement learning situation becomes more distinct from other learning situations. The convergence of single-layer networks has relevance for multi-layer networks. The proof decomposes the problem of the conditions for the convergence of a network into the conditions for the convergence of individual neurons. It worked for the single-layer case because the success probabilities for each neuron are simply related to the nature of the environment For multi-layer networks, if for all the neurons in a network the inequalities (21) and (22) can be shown and the Conditions (1) to (4) are satisfied, the network as a whole will converge. A desirable extension to the single-layer convergence proof is to find other conditions beside Conditions A and B where the inequalities (21) and (22) are true. Another extension would be to show that the inequalities (21) and (22) can be relaxed without preventing theoretical convergence. Perhaps it is sufficient if these inequalities are true for most of k 6 {1,•••, oo} and/or most of j € {1, • • •, L} and/or most of the vectors in the X j input sets. This extension would be valuable to the analysis of multi-layer networks. The usefulness of a single-layer network is fairly limited. Such a network can only divide the input spaces into linearly separable regions. Forming multi-layer networks with AR.P elements is not promising for the reasons described earlier. A much better way is described in the next section which is a part of a whole new family of more engineering oriented algorithms. 3.5 A New Algorithm for a Single-Layer of Multilayer Sub-Networks Reinforcement learning can be decomposed into search and storage stages. The search stage can be done by the search element of the AR.P algorithm or by other search methods. The AR.P search 38 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm stage is a random search of all possible output combinations. This is a brute force method which does not make use of knowledge about the environment A search based on some knowledge of the environment may be designed to be much more focused. This is what human beings do; we don't attempt all possible action combinations and sequences; from what we know about a problem, we only attempt a small subset which is likely to achieve our goal. The storage stage can be done by a number of different learning algorithms on top of a number of neural network architecture or other mathematical structures. The storage stage used in the AR.P algorithm is not efficient compared to other methods. It is really a gradient descent procedure which is unnecessary for a linear sum. The speed of the store operations may be increased by using more sophisticated techniques such as recursive least squares for the update of the weights of each neuron. A side benefit of the decomposition is a way to combine reinforcement learning and supervised learning. If the desired output values are available, they are stored by supervised learning and the search component would not be used. Otherwise the best output values are found by the search component and then stored by supervised learning. This decomposition idea has been commented upon in [63]. Anderson [30] used an algorithm having a modified back propagation stage to leam to balance an inverted pendulum; the network has one output. The idea was also used in Yeung's paper on multi-action associative SLA [64]; in this paper the network only generates one output at a time, thus the task is an easier case than in the algorithm below. Here we present a new algorithm in this spirit. It is different from the two algorithms above in that a more noise tolerant supervised learning algorithm is used for the supervised learning stage. In a multi-output network, there are more sources of noise than in the single-output case. Not only are the reinforcement and the output of a neuron stochastic, the environment for each neuron is also non-stationary because of the presence of the other neurons. A number of candidate supervised learning algorithms were tried. A version of PRL using the standard backpropagation algorithm was tried but convergence took an inordinate number of time steps and was also unreliable. Two versions using different recursive least squares algorithms [16] [20] for multi-layer network often fail to converge to the global optimum. A version using a fast algorithm based on the idea that the 39 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm weights changes according to a set of stiff ordinary differential equations [18] was also tried but failed due to the complexity of the implementation of the algorithm and the lack of some details in the paper. A number of other algorithms were not used on the grounds of 1) the unsuitability of the required learning situation, 2) an algorithm only able to converge to a local optimum, and 3) the complexity of an algorithm. Finally, a suitable supervised learning algorithm was found in [65]. This algorithm, which will be called the smoothed backpropagation algorithm, is specifically geared for learning from noisy output exemplars and deterministic input patterns. This is exactly what is needed in PRL. A new algorithm is devised using a modified form of the smoothed backpropagation algorithm; it will be called the Multi-output Primary Reinforcement Learning (or MPRL) algorithm. This algorithm is so named because it is especially suitable for multiple output networks; a multiple output network has more noise sources. The architecture of the resultant network is a single-layer network of sub-networks. Each sub-network is a single-output, 2-layer feedforward network. The overall structure of the new network is shown in Fig. 3.5.13. The fine structure of the subnetworks in the new network is shown in Fig. 3.5.14. E n v i r o n m e n t L 1 i 1 a2k aLk Subnet Subnet Subnet 1 2 * * * L Xlk X2k XLk Figure 3.5.13: The overall structure of the network for the MPRL algorithm. 40 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm a A *1 x2 -^"o * Figure 3.5.14: The structure of the subnetworks used in the network for the M P R L algorithm. For each sub-network, the forward propagation calculations are done i n the standard feedforward manner, starting from the input layer toward the output: yj = f(*$ + E"fi**)- i e { i , • • •,m> where no is the number of elements i n the input vector, n\ is the number o f neurons i n the first (hidden) layer; Xi is the ith element of the input vector, tojP is the weight from xt- to the j t h neuron i n the 1st layer of neurons; w$ is the threshold weight for the j t h neuron i n the 1st layer of neurons; yj is the output of the same neuron; vfp is the weight from the j t h neuron i n the 1st layer to the output neuron; and and wf^ is the threshold weight to the output neuron. 41 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm The squashing function in each neuron is represented by z is the action probability and is used to generate the output a according to ( 1, with probability z, a = { (36) 1 0, with probability 1 - z. The reinforcement b G {1,-1} is returned by the environment. 1 indicates reward and -1 indicates penalty. At each time step, the weights are updated by w(t + l )=W(t+l) + Aw(<+l) (37) where w is a vector of all the weights in the network, W is a vector of the estimates of all the weights, and Aw is a vector of weight changes due to the latest reinforcement. Aw is calculated according to the following equations, in the order presented: (a - z)z(l - z), if b = 1 A(l - a - z)z{l - *), if b = -1 Awj 2 ) = pSyj Av,?=p,6 <38) Aiuj-P = p6wf]yj(l - yj)n Awf) = Pt6wfyj(l - Vi) where p > 0 is the primary learning rate, pt > 0 is the learning rate for the thresholds, and 0 < A < 1 is the penalty factor. W is calculated according to 1 A f - i W(T+1) = ^ E W ( I - A ) (39> • { . M n where M is the size of the averaging window. 42 Chapter 3: Development of the Single-Layer Convergence Theorem and a New Primary Reinforcement Learning Algorithm This new algorithm is significantly more complex than the AR_P algorithm. Given there is no convergence (to global optimum) theorem for any form of backpropagation for multi-layer feedforward networks, a fact (Xinfirmed in the simulations in the literature, it is not likely that strong convergence theorems will be found for the MPRL algorithm which uses a modified form of backpropagation It is possible to obtain convergence results for a future algorithm similar to MPRL. This algorithm would not use multi-layer sub-networks. Instead each sub-network is replaced by a linear weight sum of some orthogonal basis functions such as Chebychev polynomials [66]. Thus we see that simulation is the only tool we have to judge the effectiveness of this algorithm. In the next chapter, the MPRL algorithm and the single-layer convergence theorem will be tested by simulatioa 43 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm Chapter 4 Simulations of Single-Layer Networks and the MPRL Algorithm The workings of PRL using a network with all stochastic neurons have usually been conjectured with words. It has been tested on small networks; these simulations are usually on two layer networks and always with binary inputs and a single output But when PRL is used as a part of ACRL, the purpose of the PRL network is really to learn mappings of multiple continuous inputs to multiple outputs which are preferably continuous, reinforced by a continuous reinforcement The simulations done in this chapter are closer to meeting this purpose than any other simulations in the literature. The thrust of the simulations is to find a network structure and algorithm to learn multiple mappings of continuous inputs to discrete outputs. Such networks would make ACRL more practical. The simulations are divided into two groups. In the first group, we want to confirm the theoretical results on single-layer AR.P networks presented in the last chapter. The results of a series of simulations done on single-layer networks under various environmental conditions will be shown. The second group of simulations is presented which summarizes the search for a multiple output network which will learn more useful mappings under reinforcement learning. These simulations includes two-layered AR.P networks and those trained by the new PRL algorithm. 4.1 Simulations of a Single-Layer, 3-Neuron Network The first five sets of simulations are done for a 3-neuron single-layer network. Each neuron is to learn the correct responses for each of two input vectors. The input vectors are {1, -1} and {1,1}. They have equal probability of being presented to the network. The correct responses of the 3 neurons to these input vectors are {1, —1,1} and {-1,1, -1}, respectively. The inputs to each neuron are independent of other neurons. In the last simulation in this section, networks with varying number of neurons are tested. The input vectors are the same as in the other simulations. The correct responses {a*, a%,a£} for the neurons in a network with L neurons are {1, — 1,1, • • •} and {—1,1, — 1, • • •}, respectively. 44 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm The weights of the neurons are initialized to 0.0. It is not necessary to initialized the weights with random non-zero values to perform symmetry breaking because the network has only one layer. This means the action probabihty for both input vectors has an initial value of 0.5. The learning rate is positive and, to satisfy Condition 4, proportional to 1 /k°-M where k is the iteration count and is initially equal to 1. The density function for the noise term in eq. (1) is 1/(1 + exp (-2.0 * x)) which gives a monotonically increasing distribution function. Thus the Conditions 1 to 4 of Barto's theorem are satisfied by this simulation set up. 4.1.1 Environment conforming to Condition A with 4 = 3 This set of simulations is done for Condition A, with a constant lk = 3, pn = 1.0 and pi = 0.0. The penalty rate A is set to 0.1. Fig. 4.1.15 shows the average results over 30 runs. 1.2 0.8 1 0.6 Xl o 2 0.4 3 0.2 -0.2 Pi = 3.2 f\ - i 1 1 I w 1 pi = 800.0 ... ' _ . ^ ^ ^ ^ pi = 0.05 \J neuron 1 " Pi = 800.0 Pi = 0-05 m A neuron 2 Pi = 3.2 1 i l l 1 \) • t i i 0 100 200 300 400 500 600 700 800 900 1000 Time Step (X 10) Figure 4.1.15: Average results for Condition A with lk = 3. In Fig. 4.1.15, the top 3 curves show the action probability of neuron 1 for the input vector {1,-1} and the lower 3 curves show the same for neuron 2. Only the results for these neurons and 45 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm this input vector are shown for the sake of clarity. The desired final action probability for neuron 1 for that input vector is 1.0 and for neuron 2, 0.0. The behavior for different combinations of neuron and input vector are very similar to these. The two rough curves are for p\ = 800.0, the two smooth but rapidly converging curves are for pi = 3.2, and the two slowly converging curves are for p\ = 0.05. These parameters are chosen to allow the curves to be distinguishable from one another on the graph. The surprising result here is how the learning algorithm is tolerant of very high learning rates. The next 3 figures show individual runs for each of the 3 different values of p\. In Fig. 4.1.16 and 4.1.17, there are three curves in each figure. The curves for neuron 1 and 3 are close together at the top whereas the curves for neuron 2 are at the bottom of these graphs. Their behaviors are very close to what is indicated by the average results in Fig. 4.1.15. In Fig. 4.1.18, the top graph shows a run for neuron 1 and the bottom graph shows a run for neuron 2. Note how the action probabilities fluctuate wildly at the beginning of the run and then settle to the correct final values. This is caused by the fact that the high learning rate causes the action probabilities to oscillate hard from one extreme to the other as long as one of the outputs is incorrect. But as soon as all three outputs are correct, the oscillations would stop. i 1.5 1 0.5 0 -0.5 -1 1 1 1 1 neuron 1 1 1 i 1 i s s = = = = = = = = = : = : = = = = ^ neuron 3 t • --• • I I I neuron 2 0 100 200 300 400 500 600 700 800 900 1000 Time Step (X 10) Figure 4.1.16: Single run results for Condition A with lk = 3 and pi = 0.05. 46 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm 1.5 X> i B 0.5 r o o < -0.5 i 1 1 r -neurons 1 and 3 - i r-neuron 2 0 100 200 300 400 500 600 700 800 900 1000 Time Step (X 10) Figure 4.1.17: Single run results for Condition A with /* = 3 and p\ = 3.2. p c o < 100 150 200 Time Step (X 10) 250 300 XI p c _o o < 50 100 150 200 Time Step (X 10) 250 300 Figure 4.1.18: Single run results for Condition A with /* = 3 and pi = 800.0. 47 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm 4.1.2 Environment conforming to Condition A with /* = 2 This set of simulations is for Condition A with a constant lk = 2, = 1.0, and pi = 0.0. Fig. 4.1.19 shows the average results over 30 runs for 3 different values of p\. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.19: Average results for Condition A with /* = 2. In Fig. 4.1.19, the top 3 curves are for the action probability of neuron 1 for the input vector {1,-1} and the lower 3 curves are for the action probability of neuron 2 for the same input vector. The desired final action probabilities are 1.0 for neuron 1 and 0.0 for neuron 2. The two curves in the center of the graph are for p\ = 0.001, the two curves at the top and bottom of the graph are for pi = 0.4, the two rougher curves are for p\ = 1.6. It is clear from the graph that the learning rate can not be very high. The reason for this will be discussed at the end of Subsection 4.1.3 where the simulations from Subsections 4.1.1 to 4.1.3 are compared. More details of the individual runs are presented in the next 3 figures. 48 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm -0.5 1 1 1 1 , 1 "1 1 1 neurons 1 and 3 -1 - , 1 1 neuron 2 , 1 1 i i i , I 1 —- 1 1 1 1 I 1 I 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.20: Single run results for Condition A with /* = 2 and p\ = 0.001. 1.5 1 I 1 0.5 < 0 -0.5 1 1 —T 1 1 1 1 1 1 neurons 1 and 3 neuron 2 1 1 1 1 1 1 1 I 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.21: Single run results for Condition A with lk = 2 and p\ = 0.4. 1.5 1 x I c 0.5 .2 o < 0 -0.5 • 1 1 • 1 I 1 1 1 neuron 1 neuron 3 ^ * * ^ / * " ^ ^ neuron 2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.22: Single run results for Condition A with lk = 2 and p\ = 1.6. The single run results presented in Fig. 4.1.20 and Fig. 4.1.21 are close to their average results. The results of other runs using the same parameters as in these figures are similar. We can see that the 49 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm learning is fast and almost monotonia But in Fig. 4.1.22, we see how the learning performance of the network is degraded by a higher p\. The two curves which quickly reach the top and the bottom have attained the correct action probability values. As those curves reach their desired values, the remaining neuron has very little to no information to guide it to its desired value. This phenomenon will be discussed in detail at the end of Subsection 4.1.3. 4.1.3 Environment conforming to Condition A with lk = 1 This simulation is the most difficult of those testing Condition A. The parameters used are lk = 1 which stays constant, pn = 1.0, and pi = 0.0. Fig. 4.1.23 shows the average results over 30 runs. i § •a o < 1.2 0.8 0.6 0.4 0.2 -0.2 Pi = 0.1 P i = 0.8 Pi = 0.001 Pi = 0.001 Pi = 0.8 Pi = 0.1 _J L_ A neuron 1 A neuron 2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.23: Average results for Condition A with lk = 1. The two curves at the center of the graph in Fig. 4.1.23 are for pi = 0.001, the two curves at the outer extremes of the graph are for pi = 0.1, and the two rougher curves are for pi = 0.8 The behaviors showed in these simulations are the same as in Subsection 4.1.2. The performance increases with pi up to a point and then is degraded. 5 0 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm The next 3 figures shows the behavior of individual runs. X> i 1.5 1 0.5 0 -0.5 1 1 1— ! 1 1 1 1 1 neurons 1 and 3 1 . 1 neuron 2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.24: Single run results for Condition A with /* = 1 and p\ = 0.001. 1.5 1 -i c 0.5 .2 O < 0 r -0.5 ~ i 1 r— neurons 1 and 3 neuron 2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.25: Single run results for Condition A with 4 = 1 and pi =0.1. 1.5 1 0.5 r 0 -0.5 ! 1 1 1 1 1 1 1 1 neuron 1 • neuron 3 i i i neuron 2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1.26: Single run results for Condition A with 4 = 1 and p\ = 0.8. 51 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm The single run curves shown in Fig.'s 4.1.24 and 4.1.25 deviate little from their averages. There are two curves at the top in each of these two graphs and one at the bottom. The two top curves are for neuron 1 and 3, and the lower curve is for neuron 2. In Fig. 4.1.26 how the degradation in performance occurs is clearly shown. The top curve is for neuron 1, the middle for neuron 3, and the bottom for neuron 2. The action probability of neuron 1, quickly reaches its desired value. Notice that it did not come very close to 1 initially. This gave the other neurons a chance to learn. But a little later when the action probability of neuron 2 is close to 0 and that of neuron 1 is closer to 1 yet, the third neuron is very little chance to learn. In comparing the simulations in Subsections 4.1.1 to 4.1.3, we can see that the learning rate has to be reduced as /* is decreased. Also, learning for = L, as in Subsection 4.1.1, is much more robust then for 1 < /* < L — 1, as in Subsections 4.1.2 and 4.1.3. In the single-layer convergence theorem, convergence is shown to be sure for both cases. The discrepancy is caused by the assumption of mfinite precision requirement of the theorem. This requirement is common in many proofs of convergence of algorithms. How finite precision comes into play, most noticably in the case of 1 < / * < £ — 1 is that when d^k is very close to d~k, the quantization errors becomes significant in changing the sign of w-;(0j,A) and the accuracy and precision of the random number generator become important. Fig. 4.1.26 shows an example of this effect, when the action probability of neuron 1 for the input vector {1,-1} (indicated by the top curve) is very close to 1 and that of neuron 2 is close to 0, the success probabilities of neuron 3 (whose action probability for the input vector {1,-1} is indicated by the middle curve), d\k and d^k, are both very close to 1. Then (see eq. (49) in the Appendix) «>3(*3. A) = p1/ (l - p1/) (<# - dT 1 ) + A(l - 1- 1 - (p 1' 1) 2^ 1 « 0 (40) (Note that the ij superscript is 1 for the input vector {1,-1} and the subscript k is omitted for simplicity.) which makes it easy for round off errors to change its sign. Furthermore, if the random number generator is not able to correctly realize the slightly non-zero joint probability of both neuron 1 and 2 producing the wrong actions, then d^k and d^k may have the wrong values and may even be equal to each other, making learning impossible. So the problems with high learning rates for Subsections 4.1.2 and 4.1.3 do not contradict the single-layer convergence theorem. 52 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm At lower learning rates, there is time for every neuron to get an adequate sampling of the consequences of its actions and thus learning is sure. There is no false convergence observed for lower learning rates. Another observation from these simulations is that a low A favors the learning in Subsection 4.1.1, and a higher A favors the learning in Subsections 4.1.2 and 4.1.3, Subsection 4.1.3 in particular. This is intuitive because in Subsection 4.1.1, the reinforcement is often penalty; a low A lessens the cumulative effects of penalty. But in Subsection 4.1.3, the reinforcement is often reward; the neurons should capitalize on a penalty feedback when it arrives. 4.1.4 Environment conforming to Condition B This simulation is to test the convergence properties of networks under Condition B. The reward probabilities given a certain number of correct output are as follows: Pr{o* = l | / c = 3} = l , Pr{6jt = l | / C = 2} = 2, (41) Pr{6fc = 11 / c = 1} = i, Pr{6fc = l | / C = 0} = 0. The penalty rate A is set at 0.01. Fig. 4.1.27 shows the average results over 30 runs. 53 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm 1.2 •8 •a u < -0.2 pi = 1.0 Pi = 0.01 Pi = 1000.0 pi = 1.0 neuron 2 T7 0 20 40 60 80 100 120 140 160 180 200 Time Step (X 200) Figure 4.1.27: Average results for Condition B. In Fig. 4.1.27, the top 3 curves are for neuron 1 and the bottom 3 curves are for neuron 2. All of the curves are converging to their desired values. The two curves at the center of the graph are for pi = 0.01, the two smooth curves at the outer extremes are for pi = 1.0, and the two rougher curves are for pi = 1000.0. Single runs for these parameters are shown in the next three figures. 1.5 1 & tion 0.5 o < 0 -0.5 1 1 1 —1 1 1 1 1 T neuron 1 1 1 I neuron 2 0 20 40 60 80 100 120 140 160 180 200 Time Step (X 200) Figure 4.1.28: Single run results for Condition B with p\ = 0.01. 54 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm < 1.5 0.5 -0.5 -1 1 1 r— - i r-neuron 1 neuron 2 "0 20 40 60 80 100 120 140 160 180 200 Time Step (X 200) Figure 4.1.29: Single run results for Condition B with pi = 1.0. p I <: p •3 U < 1.5 1 0.5 0 -0.5 1.5 1 0.5 0 -0.5 - i r - - i r -J U neuron 1 — i 1 1 i _ - i r -neuron 2 0 20 40 60 80 100 120 140 160 180 200 Time Step (X 200) 0 20 40 60 80 100 120 140 160 180 200 Time Step (X200) Figure 4.1.30: Single run results for Condition B with p\ = 1000.0. The single run results for pi = 0.01 and pi = 1.0 are shown in Fig. 4.1.28 and 4.1.29, respectively. They are well behaved and almost identical to the average curves, as usual. The graphs in Fig. 4.1.30 are more interesting. The top graph is for neuron 1 and the bottom for neuron 2. The curves in these graphs oscillates wildly as they did in Subsection 4.1.1. But both of them 55 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm converge to the right values. In my simulations, no false convergence in this case has ever been observed. The robustness can be attributed to the fact that as long as there is one or more outputs that are incorrect, the system always has a large enough failure probabihty. This means corrections to the action probability will continue until they are all converged to their respective desired values. 4.1.5 Performance for various values of A. In this simulation we are interested in seeing how a major factor in the AR.P algorithm, namely A which should be close to zero, affects learning. Here the environment operates according to Condition B, exactly as in Subsection 4.1.4. pi is set at 0.5. A is varied to produced the following two figures. Fig. 4.1.31 shows the time evolution of the action probabilities of neuron 1 and 2, averaged over 30 runs. Fig. 4.1.32 shows the dependence of the final values of action probability for the two neurons as a function of A. < 1 0.9 0.8 0.7 small A 0.5 0.4 0.6 |r^V*r^fW^ large A large A 0.3-0.2 0.1 0 small A 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time Step (X 1000) Figure 4.1J1: Average results for various values of A. neuron 1 neuron 2 56 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm neuron 1 neuron 2 0 5 10 15 20 25 30 35 40 Lambda Figure 4.1.32: The dependence of final action probability on A. We can see in Fig. 4.1.31 that learning is best with small A on two counts: accuracy and smoothness. As lambda is increased, the curves become rougher and converges to suboptimal values. This agrees with the analysis done in Chapter 3 for the nonstationarity problem and the analysis for the zero-crossing of wl(9, A) in the Appendix. Fig. 4.1.32 shows how the final action probabilities vary with A. If a plot of the zero-crossing of wl(9, A) versus A for various values of rf1' and rf-1' is made from eq. (49) in the Appendix, we can see this plot is similar to Fig. 4.1.32. Thus the simulation results confirm the analysis done by Barto. 4.1.6 Performance for different number of neurons In this subsection we are testing how the learning varies with the number of neurons. The number of neurons ranges from 1 to 100. The length of time of the simulations made it impractical to try larger networks. The environment conforms to Condition B. p \ is set at 0.5 and A is set at 0.01. 57 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm Fig. 4.1.33 shows the action probabihty error versus the number of time steps for different numbers of neurons, averaged over 30 runs and over all neurons for all input vectors. The probability error is calculated by subtracting the actual action probability from the desired action probability for a given input vector. The top curve is for 100 neurons; the next lower curve is for 80 neurons; this continues down to 10 neurons for the bottom curve. It is clear that the convergence is smooth and monotonia 10000 9000 8000 7000 CM & 6000 C u •s 5000 H C A ©. u 4000 tv5 u 3000 2000 1000 01 ' '— ' i" ' 1 1 1 1 1 0 10 20 30 40 50 6J 70 80 90 100 Number of Neurons Figure 4.1.33: Action probabilities versus number of time steps. Fig. 4.1.34 shows the time steps taken for the average action probability errors to reach the value of 0.0796 which is the final value for a network with 100 neurons at the end of 20,000,000 iterations. From this graph, the learning time seems to be an exponential function of the number of neurons. The learning curve is close to 1.75 • e 0 0 8 6 o n where n is the number of neurons. More data points (i.e. larger networks) would be necessary to determine whether the learning problem is exponential hard or polynomial hard in time. There is a good reason for believing that the problem is exponential hard. Suppose the inputs are constant, and suppose the network does an exhaustive 58 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm search of the output space, the number of tries needed to do this search would be 2". The reason that the number of iterations taken in the simulations is much less than 2 n is because the environment conforms to Condition B . 0.51 , , , , , , • • • 1 0.45 -0.4 -$ 0.35 -Time Step (X 2000) Figure 4.1.34: Time steps taken for the average action probability errors to reach 0.0796 versus the number of neurons. f 4.2 Simulations of Multiple Output Networks for Continuous-to-Discrete Mappings In this section, the results of three simulations are presented. They are the results of a search for a reinforcement learning algorithm which can train a multi-layer network. The first simulation is for a single-layer AR.P network being trained to make a linearly separable classification of the input space. The second is for a fully stochastic two-layer A R . P network being trained to classify nonlinearly separable binary and continuous vectors. The third is for a partially stochastic two-layer network trained by the M P R L algorithm to classify nonlinearly separable continuous vectors. In all three simulations, the environment conforms to Condition B with the same success probabilities as in Subsection 4.1.4. 59 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm 4.2.1 Single-layer AR.P network for classifying continuous inputs In this simulation, the network is consisted of 3 neurons. Each neuron is to leam a different linearly separable classification of the input space. The learning algorithm is exactly the same as in the previous section. The difference is that now continuous inputs are used to train the network. The parameters of the AR.P algorithm used here are pi = 0.5 which is kept constant for all k and A = 0.1. The weights are set to 0 initially. The input vector only has two elements, making it easy to visualize the action probability of each neuron in its input space. Fig. 4.2.35 shows the time evolution of the action probability in the neurons' input space; each row is for one neuron. Neuron 1 Neuron 2 Neuron 3 ] Wt§M 0 50 100 500 5000 100000 Figure 4.2.35: Action probability maps for a 3-neuron, single-layer A R . P network. Target The range of the input space is [-1,1] in the x-direction and [-1,1] in the y-direction. Black represents an action probability of 1, and white represents 0. The numbers indicate the iteration count The last column on the right is the target classifications. Fig. 4.2.35 shows that a single-layer AR.P networks can leam linearly separable mappings easily. This fact may not seem significant since linearly separable regions represent a tiny fraction of all possible divisions of a space. But if the weighted sum of the inputs is replaced by a weighted sum 60 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm of some functions of the inputs, the mapping can be general. An example of this is a weighted sum of Chebychev polynomials [66]. The speed of learning can also be greatly improved because the underlying gradient descent supervised learning algorithm is much slower than others that are more suitable, for example, recursive least squares. Even with the AR.P algorithm, the speed of the learning is fairly fast. 4.2.2 Two layer A R . p network for classifying binary and continuous inputs In this simulation, a long standing question is tackled. The question is whether a fully stochastic, multi-layer AR.P network is able to learn a nonlinearly separable mapping when given continuous input vectors. In all the simulations done to answer this question, the answer has been consistently no. This has not been adequately addressed in literature. The failure of this type of learning is difficult to avoid and is fairly surprising. If the weights are initialized with the desired final weights, the weights will maintain their relative values. But if the weights are initialized with values not far from the desired final weights, the weights will drift to a local minimum. The surprising aspect is that a network would learn certain decision regions given binary inputs but would fail to leam the same regions when given continuous inputs which directly specify the same decision boundaries. Fig. 4.2.36 shows the evolution of the classification of a two dimensional input space when a single-output network is trained to leam the XOR task. The input vectors and their desired classifications are the following: {1,1} classified as 1, {-1, 1} classified as -1, {-1, -1} classified as 1, and {1,-1} classified as -1. The inputs and the output are all bipolar. The network is a two layer network with 2 neurons in the hidden layer and 1 neuron at the output layer. p \ is set at 0.5 which is kept constant and A is set at 0.01. The neurons have threshold inputs which are set at 1 as shown in Fig. 1.1.1. This is necessary because the classification is nonlinearly separable. All the weights are initialized to 0. It is not necessary to initialize them with small random weights to break symmetry because the neurons are stochastic. 61 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm 0 3000 6000 9000 12000 15000 Figure 4.2.36: Action probability maps for a single-output, two layer AR.P network. As in the last subsection, the range of the input space is [-1,1] in the x-direction and [— 1,1] in the j/-direction. Black represents an action probability of 1, and white represents 0. The numbers indicate the iteration count. No target map is shown because the desired mapping is only binary-to-binary. In Fig. 4.2.36, the final classification of the input space is close to the optimal that can be achieved with linear decision boundaries. But when the final classification, shown by the map on the right side, is used as the training set to train an identical network, the network invariably converges to a local optimum. The local optimum in this case is for the output probability to be near 0 over the entire input space because this classification is correct in majority of the input space. What happens is that the threshold weight of the output neuron increases in a negative direction faster than all other weights, quickly making the output almost always 0. The reason for this is that the threshold input is fixed at 1 whereas the other inputs are fluctuating. When the classification of the input space is not balanced, the threshold weights are driven to one direction faster than the other direction and they change faster than the non-threshold weights. The convergence of the binary input case is still an open question. The 2-layer network trained to leam the XOR mapping is not a large enough network nor is the mapping difficult enough to allow a conclusion to be drawn. What this simulation does reveal is that if the inputs are continuous, a fully stochastic network is not able to leam a simple mapping even when the solution exists. Since a single-output multi-layer AR.P network is unable to leam a continuous-to-discrete 62 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm mapping, there is no point in testing the convergence of multiple output networks. 4.2.3 MPRL algorithm for classifying nonlinearly separable continuous inputs. In this simulation, a 3-output, single-layer network with 3 subnetworks is trained with continuous 2-dimensional input vectors. The 3 subnetworks are identical: 2 layers with 4 neurons in the hidden layer and 1 neuron at the output layer. All neurons have a threshold input of 1. The learning algorithm is the MPRL algorithm described at the end of Chapter 3. A = 0.5 and p = 0.1 which is held constant The thresholds use pt = 0.05, also held constant An averaging window size of 3 is used. The environment conforms to Condition B as described in Subsection 4.1.4. Fig. 4.2.37 summarizes the performance of the MPRL algorithm. The 3 rows of maps are for the 3 subnetworks. 0 5000 100000 250000 400000 500000 Target Figure 4.137: Action probability maps for a partially stochastic two layer network trained by the MPRL algorithm. The range of the input space is [-1,1] in the x-direction and [-1,1] in the y-direction. Black represents an action probability of 1, and white represents 0. The numbers indicate the iteration count The last column on the right is the target classifications. 63 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm The figure shows that all three networks have learned the correct mappings. After 500,000 iterations, the decision regions already formed will become more and more distinct. The performance of the algorithm as a function of its various parameters has not been extensively examined. The parameters used in this simulation are chosen after a certain amount of trial and error. It is quite possible that the performance can be much improved by only choosing different parameters. 4.3 Conclusions The simulations are designed to move from learning multiple classifications of linearly separable binary inputs to learning multiple classifications of nonlinearly separable continuous inputs. In the first group of the simulations, the single-layer convergence theorem is confirmed. The simulations are done for a number of different initial values of the primary learning rate, pi, for different values of A, and for different numbers of neurons. Except for high values of pi or A, the convergencies of the networks are monotonic and smooth. These simulations give us a basis for the hope that a single-layer network may be able to learn correctly in more demanding situations. In the second group of the simulations, that hope is put to the test in three different simulations. In the first simulation, a single-layer AR.P network identical to those in the previous group of simulations is trained to classify continuous values that are linearly separable. The simulations showed that the convergence is fast and certain. In the second simulation, a search is made for an answer to the speculation in the literature regarding the learning ability of a fully stochastic, multi-layer AR.P network. A two-layer AR.P network is trained to classify nonlinearly separable binary and continuous inputs. The results showed that the multi-layer AR.P networks are not able to leam the classification of continuous inputs. An irony is that the network is able to leam a classification identical to that desired in the continuous case if the network is trained to classify binary inputs. My conclusion is that the binary case is a much easier learning situation. Also, the binary learning task is a minimum parity task. If the binary case had been more difficult, it is not certain that learning would be successful. In the third simulation, the MPRL algorithm is tested. The results are encouraging. They represent the most significant simulation results for a multi-output network up to this time. Using partially 64 Chapter 4: Simulations of Single-Layer Networks and the MPRL Algorithm stochastic networks, the M P R L algorithm is the first algorithm to demonstrate success i n a long string of papers to find an algorithm using fully stochastic networks for this problem. 4.4 Future Simulations There is a number of variation on the simulations that were not tried. Possibly the most important variation is an environment which deviates slighdy from both Condition A and B. M y suspicion is that the network would sti l l be able to converge i n such a case. The idea that the probability o f reward should be higher when any particular neuron produces an output which is identical to the optimal value is fundamental. If this idea is violated significantly, the average change of the weights may not be monotonia If the network converges at a l l , the average weight changes of a subset o f the neurons would probably have to change directions multiple times as the network learns. W h e n this k ind of behavior is exhibited i n learning, it would probably make convergence a chance occurrence and make convergence analysis impossible. Further simulations would help to see the effects o f the deviations and would help to gain some understanding useful for A C R L where the secondary reinforcement may deviates from both Condition A and B which are ideal conditions. 65 Chapter 5: Conclusions Chapter 5 Conclusions 5.1 Summary Reinforcement learning is an interesting type of neural network learning. ACRL algorithms have potentials to solve many sequential decision problems. The main attribute of reinforcement learning is that it searches for the best solution and learns it. ACRL algorithms have been used to solve game problems and learning control problems. Compared to supervised learning, ACRL algorithms require no prior knowledge of a good solution. In this thesis, we focused on PRL which is a component of ACRL. We were motivated by the lack of convergence results for any kind of network and working algorithms for more general type of learning situations, namely learning multiple continuous-to-binary nonlinearly separable classifications. We considered the mechanism of the convergence of a network. The specific question considered is what the reinforcement is telling each neuron in a network. As a result, a convergence theorem is proved for a single-layer network when the environment satisfies certain conditions. Simulation results confirmed the theorem. In the course of running simulations, it became apparent that a multi-layer network with all stochastic neurons and with continuous network inputs would converge only occasionally under the AR_P algorithm. An alternative to this is to use a network which is stochastic at the output only. The desired action probabilities in each iteration can be calculated by a component of some existing PRL algorithm. They are then stored in the network by supervised learning. A new algorithm, MPRL, was devised for PRL from this idea. Simulations showed that the MPRL algorithm converges correctly for the test problem. A long sought after goal in reinforcement learning is to be able to train a general multi-layer network. By general we mean general inputs (continuous) and general outputs (multiple continuous outputs). The closest results in the literature are simulation results for single output multilayer networks with continuous network inputs. The algorithms used in those papers use a concept similar to the MPRL algorithm. The difference between these algorithms and the MPRL algorithm is that the 66 Chapter 5: Conclusions MPRL algorithm uses a supervised learning algorithm which is specifically designed to be tolerant of noisy output exemplars. The result is that the MPRL algorithm is able to train a multiple output network successfully for the test problem. Since there is no solution to the local minima problem in the training of a multi-layer network, the MPRL algorithm can't be expected to always leam correcdy. But the MPRL concept is not tied to any particular supervised learrung algorithm. When a faster, more noise tolerant algorithm and/or a more compact function approximator are available, they should be used instead of the smoothed backpropagation algorithm. The speed of convergence of the smoothed backpropagation algorithm is probably more responsible for the speed of convergence of the MPRL than another other component. 5.2 Contributions This thesis made the following contributions to the development of reinforcement learning: • A convergence theorem for a single-layer neural network trained with the AR.P algorithm. The reinforcement (or environment) must satisfy certain conditions. • Ideas about the underlying leamability of the reinforcement learning situation. • Simulations of multi-layer feedforward network trained by the AR.P algorithm. • A new algorithm, the MPRL algorithm, which is especially suited to the training of multi-output networks. 5.3 Future Areas of Research Reinforcement learning as a whole has been moving towards a better understanding of adaptive critic reinforcement learning algorithms. Most of the work currently being done is in the theory of adaptive critic networks [67]. Yet an ACRL algorithm needs an effective PRL algorithm. The MPRL algorithm presented in this thesis is the only one that has been demonstrated to train a multi-output network successfully. Clearly, better PRL algorithms should be developed when the tools become available. These algorithms should be able to leam multiple continuous-to-continuous mappings. This is important for many types of sequential decision problems, for example, control. 67 Chapter 5: Conclusions To increase the speed of future versions of the MPRL algorithm, there is a need to find faster, more compact, and more noise tolerant function approximation representations and (supervised) learning methods. The ability of the secondary reinforcement to direct a multiple output network to the correct mappings is still unknown. The utility of ACRL depends on the existence of that ability. This means we need to find more conditions where a group of neurons will converge and to determine whether the secondary reinforcement is close to one of these conditions or the two conditions found in this thesis. SLA has been formulated to not to use extra memory to store history of the past reinforcements or estimates of the average reinforcements for different actions. In recent years, some work has been done where this restriction is relaxed [68][69][70]. The result is that the convergence speed of SLA's are dramatically faster. Since the current crop of reinforcement learning algorithms are descended from SLA algorithms, perhaps a reinforcement learning algorithm which is an analogy of these new SLA algorithms can be devised to significandy increase convergence speed. The trade-off for the increased speed is increased memory requirements. Munro's idea [48] is similar to this. In games of SLA, it has been shown [71] that if the actions chosen by other SLA are made known to other automata, estimator algorithms can be used and the game will converge to the global maximum in all game environments. Perhaps an analogy of this idea can be used for PRL for networks with multiple outputs. This approach would result in increased memory requirements and a much more complex algorithm but it would gready increase the ability of PRL. 68 References [I] P. Werbos. Plenary Speech, the 1990 Int'l Joint Conf. on Neural Networks, San Diego, CA, June 1990. [2] D. O. Hebb, The Organization of Behavior. New York: Wiley, 1949. [3] A. H. Klopf, "A drive-reinforcement model of single neuron function: an alternative to the hebbian neuronal model," in Proc. oftheAIP Conference #151, (Snowbird), pp. 265-269,1986. [4] D. L. Alkon, K. T. Blackwell, G. S. Barbour, A. K. Rigler, and T. P. Vogl, "Pattern-recognition by an artificial network derived from biologic neuronal systems," Biological Cybernetics, 1990. [5] S. Grossberg and G. A. Carpenter, "Art 2: self-organization of stable category recognition codes for analog input patterns," Applied Optics, vol. 26, no. 23, pp. 4919-4930, 1987. [6] K. Fukushima, S. Miyake, and T. Ito, "Neocognitron: a neural network model for a mechanism of visual pattern recognition," IEEE Trans. Sys. Man Cybern., vol. SMC-13, pp. 826-834, 1983. [7] R. P. Lippmann, "An introduction to computing with neural nets," IEEE ASSP Magazine, pp. 4-22, April 1987. [8] K. Hornik, M . Stinchcombe, and H. White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, pp. 359-366, 1989. [9] K. J. Lang and M . J. Witbrock, "Learning to tell two spirals apart," in Proc. of the 1988 Connectionist Models Summer School, pp. 52-60, 1988. [10] D. Rumelhart, G. E. Hinton, and R. Williams, "Learning internal representation by error propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations (D. E. Rumelhart and J. L . McClelland, eds.), Cambridge: MIT Press/Bradford Books, 1986. [II] S. Becker and Y. le Cun, "Improving the convergence of back-propagation learning with second order methods," in Proc. of the 1988 Connectionist Models Summer School, pp. 29—37, 1988. [12] S. E. Fahlman, "Faster-learning variations on back-propagation: an empirical study," in Proc. of the 1988 Connectionist Models Summer School, pp. 38-51, 1988. [13] J. P. Cater, "Successfully using peak learning rates of 10 (and greater) in back-propagation networks with the heuristic learning algorithm," in Proc. of the 1987 IEEE Int I Conf. on Neural Networks, pp. 11-645, 1987. [14] T. P. Vogl, J. K. Mangis, A. K. Rigler, W. T. Zink, and D. L. Alkon, "Accelerating the convergence of the back-propagation method," Biological Cybernetics, vol. 59, pp. 257-263, 1988. [15] J. C. Hoskins, "Speeding up artificial neural networks in the 'real* world," in Proc. of the 1989 Int'I. Joint Conf. on Neural Networks, pp. 11-626, 1989. [16] M. R. Azimi-Sadjadi and S. Citrin, "Fast learning process of multi-layer neural nets using recursive least squares technique," in Proc. of the 1989 Intl. Joint Conf. on Neural Networks, pp. 11-623, 1989. [17] D. B. Parker, "Optimal algorithms for adaptive networks: second order back propagation, second order direct propagation, and second order Hebbian learning," in Proc.fo the 1987 IEEE Int'I. Conf. on Neural Networks, pp. 11-593, 1987. [18] A. J. Owens and D. L. Filkin, "Efficient training of the back-propagation network by solving a system of stiff ordinary differential equations," in Proc. of the 1989 Int'I. Joint Conf. on Neural Networks, pp. 11-381, 1989. [19] S. Makram-Ebeid, J.-A. Sirat, and J.-R. Viala, "A rationalized error back-propagation learning algorithm," in Proc. of the 1989 Intl. Joint Conf. on Neural Networks, pp. 11-373, 1989. [20] S. Kollias and D. Anastassiou, "An adaptive least squares algorithm for the efficient training of artificial neural networks," IEEE Trans. Circuits and Syst., vol. 36, no. 8, pp. 1092-1101,1989. [21] M. H. Raibert, "Analytical equations vs. table look-up for manipulation: a unifying concept," in Proceedings of the IEEE Conference on Decision and Control, pp. 576-579, 1977. [22] J. Schmidhuber, "Recurrent networks adjusted by adaptive critics," in Intl. Joint Conf. on Neural Networks 90, (Washington, DC), Jan. 1990. [23] R. J. Williams, "Reinforcement learning in connectionist networks: a mathematical analysis," Tech. Rep. 8605, Institute for Cognitive Science, University of California, San Diego, CA, 1986. [24] A. G. Barto and P. Anandan, "Pattern recognizing stochastic learning automata," IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-15, pp. 360-374, 1985. [25] A. G. Barto, P. Anandan, and C. W. Anderson, "Cooperativity in networks of pattern recognizing 70 stochastic learning automata," in Adaptive and Learning Systems: Theory and Applications (K. S. Narendra, ed.), New York: Plenum, 1985. [26] R. S. Sutton, "Temporal credit assignment in reinforcement learning," COINS 84-02, Dept. of Computer and Information Science, University of Massachusetts, Amherst, MA, 1984. [27] P. Mazzoni, R. A. Andersen, and M. I. Jordan, "A r-p learning applied to a network model of cortical area 7a," in Proc. of the 1990 Intl. Joint Conf. on Neural Networks, (San Diego, CA), pp. II-373-II-379, 1990. [28] T. Miller, R. Sutton, and P. Werbos, eds., Neural Networks for Robotics and Control. Cambridge, MA: MIT Press, 1990. [29] A. G. Barto, R. S. Sutton, and C. W. Anderson, "Neuronlike adaptive elements that can solve difficult learning control problems," IEEE Trans. Syst. Man and Cybern., vol. SMC-13, no. 5, pp. 834-846, 1983. [30] C. W. Anderson, Learning and Problem Solving with Multilayer Connectionist Systems. PhD thesis, Dept of Computer and Information Science, University of Massachusetts, Amherst, MA, 1986. [31] R. S. Sutton, "Learning to predict by the methods of temporal differences," Machine Learning, vol. 3, pp. 9^ 14, 1988. [32] R. R. Bush and F. Mosteller, Stochastic Models for Learning. New York: Wiley, 1955. [33] R. R. Bush and W. K. Estes, eds., Studies in Mathematical Learning Theory. Stanford, CA: Stanford Univ. Press, 1959. [34] R. C. Atkinson, G. H. Bower, and E. J. Crothers, An Introduction to Mathematical Learning Theory. New York: Wiley, 1965. [35] T. M. Cover and M. E. Hellman, "The two-armed bandit problem with time-invariant finite memory," IEEE Trans. Inform. Theory, vol. IT-16, pp. 185-195, 1970. [36] J. M. Mendel and R. W. McLaren, "Reinforcement learning control and pattern recognition systems," in Adaptive, Learning and Pattern Recognition Systems: Theory and Application (J. M. Mendel and K. S. Fu, eds.), New York: Academic Press, 1970. [37] M. L. Tsetlin, Automata Theory and Modeling of Biological Systems. New York: Academic 71 Press, 1973. [38] K. S. Narendra and M. A. L. Thathachar, "Learning automata—a survey," IEEE Trans, on Systems, Man, and Cybernetics, vol. 4, pp. 323-334, 1974. [39] S. Lakshmivarahan, Learning Algorithms Theory and Applications. New York: Springer, 1981. [40] K. S. Narendra and M. A. L. Thathachar, Learning Automata: An Introduction. New Jersey: Prentice-Hall, 1989. [41] M. L. Minsky, Theory of Neural-Analog Reinforcement Learning Systems and its Application to the Brain-Model Problem. PhD thesis, Princeton Univ., Princeton, NJ, 1954. [42] B. Widrow, N. K. Gupta, and S. Maitra, "Punish/reward: learning with a critic in adaptive threshold systems," IEEE Trans, on Syst. Man and Cybern., vol. SMC-3, no. 5, pp. 455-465, 1973. [43] A. G. Barto, "Reinforcement learning." Tutorial, IntT. Joint Conf. on Neural Networks, San Diego, CA, June 17-21 1990. [44] A. H. Klopf, The Hedonistic Neuron: a Theory of Memory, Learning, and Intelligence. Washington, DC: Hemisphere, 1982. [45] A. G. Barto and M. I. Jordan, "Gradient following without back-propagation in layered networks," in Proceedings of the IEEE Conference on Neural Networks, vol. n , pp. 629-636, 1987. [46] R. J. Williams, "A class of gradient-estimating algorithms for reinforcement learning in neural networks," in Proceedings of the First Annual International Conference on Neural Networks, (San Diego, CA), pp. I-263-I-270, 1987. [47] R. J. Williams, 'Toward a theory of reinforcement-learning connectionist systems," Tech. Rep. NU-CCS-88-3, College of Computer Science, Northeastern University, Boston, M.A., 1988. [48] P. Munro, "A dual back-propagation scheme for scalar reward learning," in Proc. of the 9th Annual Conf. of the Cognitive Science Society, pp. 165-176, 1987. [49] V. Gullapalli, "A stochastic reinforcement learning algorithm for learning real-valued functions." Neural Networks, to appear. [50] V. Gullapalli, "A stochastic reinforcement learning algorithm for learning real-valued functions," 72 Tech. Rep. 88-91, University of Massachusetts, Amherst, MA, 1988. [51] R. J. Williams, "Reinforcement-learning connectionist systems," Tech. Rep. NU-CCS-87-3, College of Computer Science, Northeastern Univ., February 1987. [52] C. J. C. H. Watkins, Learning From Delayed Rewards. PhD thesis, Cambridge Univ., Cambridge, England, 1989. [53] C. J. C. H. Watkins, Incremental Dynamic Programming. 1989. [54] C. W. Anderson, "Strategy learning with multilayer connectionist representations," in Proc. of the 4th Int'I. Workshop on Machine Learning, pp. 103-114, 1987. [55] T. Robinson and F. Fallside, "Dynamic reinforcement driven error propagation networks with application to game playing," in Proc. of the 11th Conf. of the Cognitive Science Society, (Ann Arbor), 1989. [56] A. G. Barto, "Connectionist learning for control: an overview," in Neural networks for robotics and control (T. Miller, R. Sutton, and P. Werbos, eds.), Cambridge, MA: MIT Press, 1990. [57] A. G. Barto, R. S. Sutton, and P. S. Brouwer, "Associative search network: a reinforcement learning associative memory," Biological Cybernetics, vol. 40, pp. 201-211, 1981. [58] R. Williams, "Reinforcement learning." Technical Tutorial Seminar (available as IEEE/EAB HVT product no. HV0131-3), International Joint Conference on Neural Networks, Washington, D.C, July 19-22, 1989. [59] A. G. Barto and R. S. Sutton, "Landmark learning: an illustration of associative search," Biological Cybernetics, vol. 42, pp. 1-8, 1981. [60] A. G. Barto, "Adaptive neural networks for learning control: some computational experiments," in Proc. of the IEEE Workshop on Intelligent Control, pp. 170-175, 1985. [61] A. G. Barto, "Learning by statistical cooperation of self-interested neuron-like computing elements," Human Neurobiology, vol. 4, pp. 229-256, 1985. [62] R. L. Kashyap, C. C. Blaydon, and K. S. Fu, "Stochastic approximation," in Adaptation, Learning, Pattern Recognition Systems: Theory and Applications (J. M. Mendel and K. S. Fu, eds.), New York: Academic, 1970. [63] P. J. Werbos, "Backpropagation: past and future," in Proc. of the second int'I. conf. on neural 73 networks, (San Diego, CA), pp. I-343-I353, 1988. [64] D. Y. Yeung, "Supervised learning of action probabilities in associative reinforcement learning," in Proc. of the 1988 Connectionist Models Summer School, pp. 162-171, 1988. [65] T. Guillerm and N. E. Cotter, "Neural networks in noisy environments: a simple temporal higher order learning for feed-forward network," in Proc. of the Intl. Joint Conf. on Neural Networks, (San Diego, CA), pp. III-105-III-112, June 17-21 1990. [66] S. Qian, Y. Lee, R. Jones, C. Barnes, and K. Lee, "Function approximation with an orthogonal basis net," in Proceedings of the International Joint Conference on Neural Networks, (San Diego, CA), pp. III-605-III-619, June 17-21 1990. [67] A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins, "learning and sequential decision making," COINS 89-95, Dept. of Computer and Information Science, Univ. of Massachusetts, Amherst, MA, 1989. [68] L. P. Devroye, "A class of optimal performance directed probabilistic automata," IEEE Trans. Syst. Man Cyber., vol. SMC-6, pp. 777-784, 1976. [69] M. A. L. Thathachar and P. S. Sastry, "A new approach to the design of reinforcement schemes for learning automata," IEEE Trans. Syst. Man Cybern., vol. SMC-15, pp. 168-175, January 1985. [70] R. Simha and J. F. Kurose, "Relative reward strength algorithms for learning automata," IEEE Trans. Syst. Man Cybern., vol. SMC-19, no. 2, 1989. [71] M. A. L. Thathachar and P. S. Sastry, "Learning optimal discriminant functions through a cooperative game of automata," IEEE Trans. Syst. Man and Cybern., vol. SMC-17, no. 1, pp. 73-85, 1987. 74 Appendix A Proof of Barto's AR.P Convergence Theorem The notations in the body of the thesis differ from those in Barto's 1985 paper [24]. The differences are mainly the addition of the letter j 6 {1, • • •, L} to denote the jth neuron in a network. Other changes address the need to denote particular input vectors in particular training input sets. The correspondence between Barto's notations (shown on left) and our notations is as follows: ak a,*, xjt <-»• Xjk, <-• x^\ X *-* Xj, m *-> mj, £' «-+ £j, n rij, T]K <-> 7}jk, * «-• pk pjk, 9 ~ Oj, 9k «. 9jk, 9°x «-• 0°jX, ui ~ uj. u2 uj1, w(9,X) <-> ^(d^X), d\* ~ d)k, <£" <-> djk\ pu(9) <-*• p}(0j). and f5\ *^ (5j\. In the last four new notations, the superscript (ij) is not used for the sake of simplicity. The following material is taken from Barto's 1985 paper with some modifications added to fit in with the rest of thesis. It is included because of the close connection between the proof of the single-layer convergence theorem and this proof. The AR.P algorithm was presented in Chapter 1. Four conditions must be fulfilled for the single neuron theorem to hold. They are the following: Condition C l . The set of input vectors X = {x^, x^ 2^,..., x^ m }^ is a linearly independent set. Condition C2. For^ each x® € X and fe > 1, Pr {x* = x*«>} = ? > 0. Condition C3. Each of the independent identically distributed random variables j]k has a continuous and stricdy monotonic distribution function Condition C4. The sequence pk satisfies pk > 0, Y^Pk = oo, Y^,pk < oo. k k Theorem. Under Conditions (Cl)—(C4), for each A e (0,1], there exists a 9\ e Tln such that the random process {9k}k>1 generated by the AR.P algorithm in an associative reinforcement learning task converges to 9X with probability 1 (that is, Pr \ lim 9k = 9l) = 1), where 75 for all x e X Pr {a = 11 0$,x} > 0.5, if d(x, 1) > rf(x,-l) (42) < 0.5, if d(x,l) < d(x,-l). In addition, for all x 6 X, f l , if d(x,l) > d(x,-l) lim Pr{a = l|0?,x} = \ (43) A - ° U , ifd(x,l)<d(x,-l). Proof: First, some properties of the algorithm are stated: Property PI. Letting pli(0) = Pr \ak = 1 | 0k = 9,xk = x « } * v J (44) results in {^a* | 0* = 0,xfc = x }^ = 2ph'(0) - 1. (45) Note that since $ is continuous and strictly monotonic, so is pu regarded as a function of 0TxM. Property P2. Let 69k = 9k+i - 9k. Then E{60k\0k = 0} = pkw(0,X) where m w(9,X) = ^<fV(0,A)x(,) wi{9,X) = wRi{9)-rXwPi(0) and 2PkwRi(0)xM = E[60k | 0k = 0,xk = x « A 2XPkwPi(9)xW = E{60k \9k = 0,xk = x«\bk 76 (46) (47) (48) It can be easily shown that wRi(0) = pu(9)(l - pu(9)) (du - d~u) (49) wPi(9) = (l-pli(0))2c-u - {p[i(e))2cli where / , * \ ( 5 0 ) <T" = d(xW,-l) = l - c - u . Property P3. If the initial state of the AR.P algorithm is 0i, then for fc > 1 fc-i 0k=9i+Y,6°i- ( 5 1 ) Since each £0,- is a scalar times X j , 0* - B\ is in the linear space spanned by the set X of input vectors; that is, letting M denote the matrix whose columns are the input patterns and letting R(M) denote the range of Af, 9k G R(M) + 9\, for fc > 1. Lemma Al. For A = 1, and for each i, 1 < i < m, there exist unique /?' and /?' G (0,1) such that, for any 9, if pli(9) = /?•', then wPi(9) = 0; and if pu(9) = (f, then w{(9,1) = 0; where ifd 1 , >cr 1 , ' ) then ? > > 0.5 (52) i f d 1 ' < t T \ then/?'</?•< 0.5. Proof: This proof is for the case dl% > d~lt\ the proof for the other case is similar. From (P2) if pu(9) = 0, then wPi(9) = w{(9,1) = c"H > 0 if pli(9) = 1, then wPi(9) = w\9,1) = - c H < 0 (53) ifpli(0) = 0.5, then wPi(9) = 1/4 (c"1'' - c«) > 0. Since u^* is a quadratic function of pu(9) and pK(0) G [0,1] for all 9, the above facts imply that there exists a unique /?' G (0.5,1] such that when ph(9) = /?', wPt(9) = 0. Further, note that for pu{9) G (0,1), wRi(9) > 0. Hence, if pu{0) = fi, then w'(0,1) = wRi(9) + wPi(9) > wPi(9) > 0 (54) 77 and there exists a unique /?' e (/^ > l) such that when pu(9) = /?*', u>'(0, 1) = 0. For future reference, also observe that if pli(9) > /?', thenu>p,'<0 (55) if pli(9) < /?«', then t o p » ' > 0. Lemma A2. For each A 6 (0,1) and for each i, 1 < i < m, there exists a unique (3\ such that if pH(0) > /3[, then wi(9, A) < 0 if pu(9) = /?A, then w{{9, A) = 0 (56) if pu(0) < /?A, then A) > 0. Further, if dli > d _ 1 \ then /?!>/?'' andlim/?A = l (57) A—»0 whereas if du < d~li, then ? x < ? and lim ft = 0. (58) Proof: Again this proof is for the case du > d~u; the proof for the other case is similar. By (P2) it can be seen that for each i, 1 < i < ro, that wi(9,X) = wRi(9) + XwPi(9) = wp,'(0) + wPi{9) - (1 - X)wPi(9) (59) = w{(9,1) - (1 - X)wPi(9). Hence, from the definition of (P and (55) in Lemma Al , it can be seen that if ph(9) = /?*, then wi(9,X) = -{l-X)wPi(9)>0 (60) and then if pli(9) = 1, then wi(9,X) = - A c H < 0. (61) 78 Thus the effect of multiplying wPt(6) by A is to shift the zero-crossing (3* further to the right Consequently, there exists a unique ft > such that if pu(9) = ft, then wt(9, A) = 0. It is clear that the other parts of (56) hold as well. Further, for A' < A, it can be seen that if ph(9) = ft, then to1' (0, A') = - (A - A') wPi(6) > 0. (62) This implies that ft increases as A decreases. Since to'(0,0) = 0 when pil(9) = 1, the least upper bound of 3\ is 1. Hence, lim B\ — 1. A A—i-O Q.E.D. Lemma A3. Under Conditions (C1)-(C3), for each A 6 (0,1] there exists a unique 9X 6 R(M) + $i such that w(0Qx,\) = 0; and for all i, 1 < i < m, r / n x f l , if d h '>r f -« , ]lmph(el ) = l (63) Proof: Since X is a linearly independent set and > 0, 1 < i < m m w($, A) = ^ ?V(0, A)x(i) = 0 & w\9, A) = 0, 1 < i < m. (64) i=i From Lemma A2, we know that for all i, 1 < i < m, there exists a unique (3X such that Pli(9) = (3ix^wi(9,\) = 0. (65) Since pu is stricdy monotonic and continuous as a function of 9Tx^\ it is invertible so that (65) implies that for all i, 1 < i < m, there exists a unique (3\ such that pu(9) = (3^ when 9Tx^ = ft. Letting fi\ denote the vector whose components are ft, this can be written MT9 = ft (66) where M is the matrix whose columns are the input patterns. Since the rows of MT are independent, there is at least one 9X that satisfies (66). Additionally, the condition that 9X £ R(M) + 9\ implies that (66) can be written M r ( M 7 + 0i) = ft (67) 79 for some vector 7 6 lZm. Since the columns of M are independent, MTM is invertible and 7 is the unique vector 1=(MTM)-l(px-MTel). (68) Thus, 9\ = M7 + #i is the unique vector that satisfies (66) and thus is the unique vector in R(M) + Bx for which w(9°x,X) = 0. Finally, combining (65) with Lemma A2, we have that for all i, 1 < i < TO, r / n x f l , if d[i>d~u W ' 0 ° )=P\ = \ ,. • (69) Q.E.D. Lemma A4. Under Conditions (C1HC3), (0 - 0\)Tw(9, A) < 0 for all A e (0,1] and for all 9 6 1ln. Further, 9\ is the unique 9 in R{M) + 9\. such that (9 - 9xl)Tw(9, A) = 0. Proof: From (56) in Lemma A2 and the monotonicity of pu(0) we have for all A 6 (0,1] that tt;'(0,A) <0od rx(" > 0fxW A) = 0 0 T x w = Ofx® (70) and Additionally, «'(0, A) > 0 4* 9Tx.U < 0OTX(.) w (9 - 9lfw(9, \)=(0- 9l)T £ ?V(*, A)x« (72) «=i m = J ] f U ; ' ( 0 , A ) [ 0 r x « - 0 f x « Combining (71) and (72) we have that (9 - 9lfw(9, A) < 0, 9 e 7en. (73) From Lemma A3 we have that, for any A 6 (0,1], 0" is the unique vector in R(M) + 0i such that w%(9x,X) = 0 for all i, 1 < i < TO. Hence 9^ is the unique vector in R(M) + 9\ such that equality in (73) holds. Q.E.D. 80 We now derive an additional property of the algorithm. For each i, 1 < i < ro, let aRi(9) = pli(6){i - Pli(e)) • [(i - pu(6))du + Pli(e)d-U] aPi(9) = (pu($))*cu + (l-pu(e))\-li <J(9) = aRi(9) + \2aPi(9) (74) and Then and a(0) = Y/?ai(9) i-l E{69l69k \6k = 9} = pla{$) E{69Tk69k \9k = 9,xk = x«\bk = l} = pl |x«| 2a f l i(0) (75) (76) api(9). (77) E{69l69k | 9k = d,xk = = -l} = X2p\ Since p[x(9) e [0,1] for all and 9, for any given set of x^ there exists constants L\ and Li such that 2 IK' «=1 a(0)<i1E||x ( , ) < X 2 . (78) Hence, E{69l69k \9k = 9}< L2P\. (79) We are now able to apply a version of the convergence theorem stated in Section UJ (in [24]) for stochastic approximation to prove that under Conditions (C1)-(C4), 9k converges to 9\ with probability 1. A proof, omitted here, can be found in [39]. Here the conditions for the theorem are shown to hold. Lemma A5. Under Conditions (C1)-(C4) the random process {9k}k>\ generated by the A R . P algorithm converges to 9\ with probability 1. Proof: We show that the random process {9k}k>l satisfies Conditions (A1)-(A4) presented in the discussion of Section III (in [24]) of supervised learning pattern classification. (The roles of y(9) and r(9) of Section III (in [24]) are played here respectively by E{69k \ 9k = 9} and w(9, A).) 81 First note that by (P3), 9k 6 R(M) + 0i, k > 1. By Lemma A3 we have for 0 e i?(M) + 0X and A e (0,1] that u>(0, A) = (l/pk)E{69k \ 9k = * > 1. has a unique zero at This is as required by Condition (Al). Condition (A2) is just Condition (C4). By Lemma A4 we have that (9 - 9l)Tw(9, A) < 0, for all 9 e R(M) + 9i such that 9 ± 9\. Given the definition of 69, this implies (A3). By (79), .E{||<$0||2 | #j < Lip\. This is a stronger condition than (A4). It is the analog of Condition 4.8 of Lakshmivarahan [39, p. 112]. The remainder of the proof follows the argument given by Kashyap et al. [62] based on that of Gladyshev, or Theorem 4.1 of Lakshmivarahan [39, p. 112]. Q . E . D . The AR.P convergence theorem follows immediately from Lemmas A l , A3, and A5. 82
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Reinforcement learning in neural networks with multiple...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Reinforcement learning in neural networks with multiple outputs Ip, John Chong Ching 1990
pdf
Page Metadata
Item Metadata
Title | Reinforcement learning in neural networks with multiple outputs |
Creator |
Ip, John Chong Ching |
Publisher | University of British Columbia |
Date Issued | 1990 |
Description | Reinforcement learning algorithms comprise a class of learning algorithms for neural networks. Reinforcement learning is distinguished from other classes by the type of problems that it is intended to solve. It is used for learning input-output mappings where the desired outputs are not known and only a scalar reinforcement value is available. Primary Reinforcement Learning (PRL) is a core component of the most actively researched form of reinforcement learning. The issues surrounding the convergence characteristics of PRL are considered in this thesis. There have been no convergence proofs for any kind of networks learning under PRL. A convergence theorem is proved in this thesis, showing that under some conditions, a particular reinforcement learning algorithm, the A[formula omitted] algorithm, will train a single-layer network correctly. The theorem is demonstrated with a series of simulations. A new PRL algorithm is proposed to deal with the training of multiple layer, binary output networks with continuous inputs. This is a more difficult learning problem than with binary inputs. The new algorithm is shown to be able to successfully train a network with multiple outputs when the environment conforms to the conditions of the convergence theorem for a single-layer network. |
Subject |
Neural networks (Computer science) Neural circuitry -- Computer simulation Artificial intelligence |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-10-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065628 |
URI | http://hdl.handle.net/2429/29624 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1990_A7 I62.pdf [ 6.43MB ]
- Metadata
- JSON: 831-1.0065628.json
- JSON-LD: 831-1.0065628-ld.json
- RDF/XML (Pretty): 831-1.0065628-rdf.xml
- RDF/JSON: 831-1.0065628-rdf.json
- Turtle: 831-1.0065628-turtle.txt
- N-Triples: 831-1.0065628-rdf-ntriples.txt
- Original Record: 831-1.0065628-source.json
- Full Text
- 831-1.0065628-fulltext.txt
- Citation
- 831-1.0065628.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065628/manifest