SELF-ORTHOGONALIZING STRATEGIES FOR ENHANCING HEBBIAN LEARNING IN RECURRENT NEURAL NETWORKS By Michael Richard Davenport B. Sc. (Physics) University of Calgary M. Sc. (Physics) University of British Columbia A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF PHYSICS We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA June 1993 © Michael Richard Davenport, 1993 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Physics The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: 3^t 9,613 Abstract A neural network model is presented which extends Hopfield's model by adding hidden neurons. The resulting model remains fully recurrent, and still learns by prescriptive Hebbian learning, but the hidden neurons give it power and flexibility which were not available in Hopfield's original network. The key to the success of the model is that it uses the emerging structure of its own memory space to establish a pattern in the hidden neurons such that each new memory is optimally orthogonal to all previous memories. As a result, the network actually learns a memory set which is "near-orthogonal", even though the visible components of the memories are randomly selected. The performance of the new network is evaluated both experimentally, using computer simulations, and analytically, using mathematical tools derived from the statistical mechanics of magnetic lattices. The simulation results show that, in comparison with Hopfield's original model, the new network can (a) store more memories of a given size, (b) store memories of different lengths at the same time, (c) store a greater amount of information per neuron, (d) retrieve memories from a smaller prompt, and (e) store the XOR set of associations. It is also shown that the memory recovery process developed for the new network can be used to greatly expand the radius of attraction of standard Hopfield networks for "incomplete" (as opposed to "noisy") prompts. The mathematical analysis begins by deriving an expression for the free energy of a Hopfield network when a near-orthogonal memory set has been stored. The associated mean-field eqiiations are solved for the zero-temperature, single-recovered-memory case, yielding an equation for the memory capacity as a function of the level of orthogonality. A separate calculation derives a statistical estimate of the level of orthogonality that can be achieved by the roll-up process. When this is combined with the memory capacity-vs-orthogonality result, it yields a reasonable ll estimate of the memory capacity as a function of the number of hidden neurons. Finally, the theoretical maximum information content of sets of near-orthogonal memories is calculated as a function of the level of orthogonality, and is compared to the amount of information that can be stored in the new network. ii i Table of Contents Abstract List of Tables^ ix List of Figures xiii Acknowledgements^ 1 1.1 2 1 Introduction Important Features of the EH Network ^ 4 1.1.1^Hebbian Learning 4 ^ 1.1.2^Recurrence ^ 6 1.1.3^Hidden Neurons 7 ^ 1.2 Research Goals ^ 9 1.3 Summary ^ 9 The Hopfield Network and Non Random Memory Sets 12 2.1 Review of Hopfield Neural Networks ^ 12 2.1.1^Storing Memories ^ 14 2.1.2^Network Dynamics and Stable States ^ 15 2.1.3^Overlap mµ" and Orthogonality ^ 16 2.1.4^Energy Landscape ^ 17 2.1.5^The Meaning of Energy Peaks 17 - 2.1.6^False Attractors 2.2 ^ ^ Limitations and Extensions of the Hopfield Network ^ iv 18 19 2.3 Non-Random Memory Sets ^ 22 2.3.1 Generating Near-Orthogonal Memory Sets ^ 23 2.3.2 Generating Correlated Memory Sets ^ 24 2.3.3 Benefits of Near-Orthogonal Memory Sets ^ 26 2.4 Working Premise: All Orthogonalizing Processes are Equivalent ^ 28 3 The Extended Hopfield Network ^ 30 3.1 EH Network Terminology ^ 31 3.1.1 Hidden Neurons in an EH Network ^ 31 3.1.2 Partial Overlaps in the EH Network ^ 32 3.1.3 Measuring "Stability" in an EH Network ^ 32 3.1.4 "Level of Correlation" of Memory Sets ^ 33 3.1.5 TriState Neurons Option ^ 33 3.2 Storing Memories in an EH Network ^ 34 3.3 Recovering Memories in an EH Network ^ 36 3.3.1 "Random" Option ^ 38 3.3.2 "TriState" Option ^ 39 3.3.3 "Bi-State" Option ^ 40 3.3.4 Retrieval Efficiency with 10 Memories Stored ^ 42 3.3.5 Radii of Attraction for Random Memories ^ 42 3.3.6 Radii of Attraction for Near-Orthogonal Memories ^ 45 3.4 Performance of the EH Network ^ 45 3.4.1 Memory Storage Capacity ^ 46 3.4.2 Maximum Information Content ^ 48 3.4.3 Correlated Memory Sets ^ 49 3.4.4 Radii of Attraction ^ 51 3.4.5 Storing the XOR Memory Set ^ 54 3.4.6 Flexible-Length Memories ^ 58 4 The Free Energy of a Network with Near Orthogonal Memories 60 4.1^Magnetic Lattice Analog for Neural Networks ^ 62 - 4.2 4.3 5 - 4.1.1 Zero Temperature Ising Model ^ 64 4.1.2 Non-Zero Temperature in Neural Models 65 4.1.3 Justification for Using Thermodynamic Methods ^ 67 4.1.4 Quenched Random Memory States ^ 69 Ensemble Average of the Free Energy ^ ^ 4.2.1 The Energy per Neuron ^ 70 4.2.2 Partition Function for n Replicas ^ 72 4.2.3 Calculation of Average Over the Uncondensed Memories ^ 74 4.2.4 Inverse Gaussian Transforms ^ 77 4.2.5 Average Over Condensed Memories 78 4.2.6 Saddle Point Calculation 4.2.7 Free Energy of the Replica Symmetric System ^ 80 ^ Summary of the Free-Energy Calculation ^ ^ 5.2 5.3 81 84 86 Analytical Estimate of Network Storage Capacity 5.1 69 Hopfield Network Capacity for Near-Orthogonal Memories ^ 86 5.1.1 Mean Field Equations at Zero Temperature ^ 87 5.1.2 Memory Capacity with No Clamped Neurons ^ 87 Statistical Estimate of Overlap After Roll-Up ^ 91 5.2.1 Notation ^ 92 5.2.2 Overlap as a Function of Neural Alignments ^ 93 5.2.3 Probability Distribution of the Neural Alignments ^ 94 5.2.4 Dynamics of the Probability Distribution ^ 94 5.2.5 Numerical Calculations of Average Overlaps ^ 98 5.2.6 Average Overlap of All Memories ^ EH Network Storage Capacity ^ vi 100 101 6 The Information Content of Near-Orthogonal Memories^ 105 6.1 Information Content of Exactly Orthogonal Memories ^ 105 6.1.1 Analytical Calculation of Information Content ^ 106 6.1.2 Brute Force Calculation of Information Content ^ 110 6.1.3 Calculated Information Content of Exactly Orthogonal Memories ^ 113 6.2 Information Content in Near-Orthogonal Memories ^ 116 6.3 Ideal Information Capacity vs Information Content in EH Networks ^ 122 6.3.1 Limits on the Orthogonalization Process ^ 123 6.3.2 Information Content Near Capacity ^ 125 7 Conclusions and Discussion^ 129 7.1 Conclusions ^ 129 7.2 Further Research ^ 131 Bibliography^ 133 A Calculations from Chapter 2 ^ 137 A.1 The Canonical Set {F} of Exactly Orthogonal Memories ^ 137 A.2 Mean and Variance of (es' • el ^ 139 A.3 Expectation of Partial Overlaps in Orthogonal Memories ^ 140 B Calculations from Chapter 3 ^ B.1 Theoretical Minimum Prompt Length ^ C Calculations from Chapter 4^ 145 145 146 C.1 Sum-Product Inversions Equation ^ 146 C.2 Ensemble Average of Noise Term over Orthogonal Sets ^ 147 C.3 Average of e -T3 for Condensed Vectors ^ 151 C.4 Average of (rP•so )(rP.SP) over Condensed Vectors ^ 153 C.5 Variance of T3 in the Space of Condensed Vectors ^ 156 . vii C.6 Details of the Inverse Gaussian Transform ^ 158 C.7 Simplification of Exponent in G1 ^ 161 C.8 Self-Averaging in Condensed Near-Orthogonal Memories ^ 162 C.9 Replica Symmetric Value of G1 ^ 164 C.10 Replica Symmetric Value of G2 ^ 166 C.11 Replica Symmetric Value of F2 ^ 167 C.12 Detailed Calculation of the Mean Field Equations ^ 168 D Calculations from Chapter 5^ D.1 Limiting Behaviour of 13(1 — q) ^ 171 171 D.2 Solution of Mean-field Equations for No Clamped Neurons ^ 174 D.3 Distribution of Xi Before Roll-Up ^ 175 D.4 Effect of Inverting One Bit ^ 176 D.5 Mean Value of X1 for Xi > a ^ 178 D.6 Estimate of the Variance Vx(t) ^ 179 D.7 Time Evolution of the Number of Invertible Neurons ^ 180 viii List of Tables 1.1 Brain features cross-referenced to neural network models. ^ 3 A.1 The first ten "canonical" orthogonal memories ^ ix 138 List of Figures 1.1 Relative scales of components of the nervous system ^ 2 2.1 Diagram of a simple Hopfield network ^ 13 2.2 Mean overlaps between near-orthogonal memories ^ 25 2.3 Memory capacity of a Hopfield network vs orthogonality of the memory set. . ^ 27 2.4 Radius of attraction of a Hopfield network vs orthogonality of the memory set. ^ 28 3.1 Schematic representation of the EH network memory storage process. ^ 35 3.2 Degree of orthogonality achieved by the tri-state memory storage process, as a function of memory loading ^ 36 3.3 Degree of orthogonality achieved by the bi-state memory storage process, as a function of memory loading ^ 37 3.4 Degree of orthogonality achieved by the memory storage process using tri-state neurons, as a function of the number of hidden neurons ^ 37 3.5 Schematic representation of the tri-state EH network memory retrieval process. ^ 39 3.6 Schematic representation of the bi-state EH network memory retrieval process. ^ 41 3.7 Comparison of three ways to handle "don't know" bits during memory recovery. 43 3.8 Comparison of radii of attraction for tri-state and random memory recovery procedures. ^ 44 3.9 Radius of attraction as a function of memory loading, for near-orthogonal memory sets ^ 45 3.10 Comparison of EH network capacity with Hopfield network capacity ^ 47 3.11 Network capacity as a function of the number of visible neurons. ^ 48 3.12 Information capacity of an EH network, as a function of the number of visible neurons used ^ 50 3.13 Storage capacity of an EH network for correlated memory sets. ^ 51 3.14 Type-1 radii of attraction of EH networks ^ 53 3.15 Type-1 radii of attraction of EH networks as a function of how near the network is to capacity. ^ 3.16 Type-2 radii of attraction of EH networks. ^ 54 55 3.17 Retrieval rates for the XOR set of associations with varying numbers of hidden neurons. ^ 58 4.1 Conceptual sketch comparing energy landscapes to free energy landscapes ^ 61 4.2 Sketches of three phases of a magnetic lattice ^ 63 4.3 Sketches of the energy surface around ferromagnetic and spin-glass states . . . ^ 64 4.4 Scaling factor CP as a function of mpv and V. ^ 77 5.1 Plots of Equation 5.7 showing a vs y and m vs y ^ 89 5.2 Plots of Equation 5.7 showing m vs a ^ 90 5.3 Comparison of predicted capacity to measured capacity. ^ 91 5.4 Sketch of the initial distribution of Xi for a = 0.2. ^ 95 5.5 Sketch of the evolution of the distribution of Xi. ^ 96 5.6 Numerical solutions for the time evolution of the mean neural alignment X. . ^ 97 5.7 Illustration of how to calculate the rms overlap from X(r) and b(r) ^ 99 5.8 Predicted and observed overlap of the most recent memory to all previous memories.100 5.9 Predicted and observed overlaps of all memories, as a function of the number of memories stored. ^ 101 5.10 Graphical estimate of EH network capacity for "whole-vector" stability. ^ 103 5.11 Estimate of the true EH network capacity^ 104 6.1 Information content of the ( iu + 1)th orthogonal memory. ^ 114 xi 6.2 Total information content of orthogonal memory sets. ^ 116 6.3 Information in the most recent near-orthogonal memory for k = 1.0. ^ 121 6.4 Information in the most recent near-orthogonal memory for k = 0.88. ^ 121 6.5 Average information per neuron in near-orthogonal memory set, using k = 1.0'. ^ 122 6.6 Average information per neuron in near-orthogonal memory set, using k = 0.88 ^ 123 6.7 Ideal information capacity as a function of the memory loading, compared to the information content of an EH memory ^ 124 6.8 Comparison of the theoretical minimum overlap, to the overlap achievable by the roll-up process. ^ 6.9 Orthogonality of an EH network at capacity. ^ 125 127 6.10 Ideal information capacity at maximum loading, as a function of the ratio of visibles ^ 127 6.11 Comparison of ideal information capacity and actual information content, at maximum loading. ^ 128 A.1 Probability distributions for partially sampled sets ^ 144 A.2 Overlaps between subsets of orthogonal memories ^ 144 C.1 Sketch of the vectors which make up T3 ^ 156 xii Acknowledgements I am grateful to a number of people who contributed to my research and to the production of this thesis. By far the greatest contributer has been my research supervisor, Professor Geoffrey Hoffmann. Professor Hoffmann was the first to recognize the practical uses for peaks in the energy landscape, and many of the other concepts which are developed here were either suggested by him, or grew out of one of our frequent discussions. Thanks also to the other members of my advisory committee, Professors Gordon Semenoff, Luis deSobrino, and Dan Camporese, for reading the thesis and making many helpful suggestions. Professor Semenoff was especially helpful in providing direction for the mathematical derivations in Chapters 4 and 5, and in being available for discussions about the statistical mechanics of neural networks. I am also grateful to Professor Geoffrey Hinton for reading the thesis and making a number of helpful suggestions which have been integrated into the text. Thanks also to my colleague Shawn Day for comments on the text and for the loan of his computer, and to my family for their tolerance and corporeal support. The research was funded by the Natural Sciences and Engineering Research Council of Canada, and the Science Council of British Columbia. Computer equipment and other financial support were provided by MPR Teltech Ltd., of Burnaby B.C. Chapter 1 Introduction The problem of gaining a comprehensive mathematical understanding of brain function is undeniably immense. The human brain is an extremely complex system, made up of at least 10 11 neurons [42, pg 1], of which there are at least scores, and probably hundreds of different types [48, pg 318]. Every neuron connects to approximately 10 4 other neurons so there are on the order of 10 15 synapses. The complexity, moreover, exists at all scales — inter-neural connections are no less complex than the connections between cortical columns or between larger functional areas. As a final obstacle to mathematical analysis, the dynamical equations for components of the brain tend to be both non-linear and time-delayed [24, 7]. Faced with such daunting obstacles, the research community has, in the time-honoured way, divided its resources between examining exact models of individual elements (neuro-physiology), more coarse-grained models of larger subsystems (neural networks), and behavioural models of the complete system (psychology), as illustrated in Figure 1.1. The central role of neural network research is to discover how high-level behaviour such as memory storage and recovery, pattern recognition and, ultimately, intelligence can "emerge" from relatively simple interactions between neurons. The aesthetics of neural network research are therefore closely tied to this concept of "emergent" behaviour. A "good" network is one which is made of simple neural components, assembled and trained according to a simple principle, but which exhibits a complex computational ability. By contrast, a network is not as "good", even if it has the same computational ability, if its ability is only a result of intelligent 1 ^ Chapter 1. Introduction^ ^1 2 m^I Nervous System I I I 1 cm^I^Maps^I I mm^I^Networks^I ^10 ^1 cm^ ^Subsystems^I I^ 100 tun^1^Neurons^I ^1 I pm^I^Synapses^I I I 1 A^ ^Molecules^I Figure 1.1: Relative scales of components of the nervous system. Neural network research examines structures which are typically between lmm and 10cm in size. (Based on Sejnowski [48, pg 305]) interference from a human designer or brute force optimization of its parameters for the specified task. The behaviour of the latter network is better described as "engineered" rather than "emergent". For many researchers, a second aesthetic is that the model incorporate as many aspects of the physiology of the brain as possible. There is some disagreement on this issue, because many neural network researchers are solely interested in the development of alternative computing devices. If a network does interesting things, what does it matter if it resembles the brain or not? The counter-argument to this is that the brain is a phenomenally successful computing device, capable of many functions which are far beyond the capability of current artificial machines. Surely there is a greater chance of reproducing these abilities in a machine which more closely resembles a real brain? With the second aesthetic in mind, it is useful to enumerate some of the salient "design features" of the brain which might be important to its higher functions. A selected list is: Chapter 1. Introduction^ 3 1. Hebbian Learning: learning occurs at each synapse according to a scheme first suggested by Hebb, as discussed below. 2. Recurrence: the output of almost every neuron indirectly feeds back into that neuron. 3. Hidden Neurons: virtually all neurons in the brain are neither sensory nor motor neurons. 4. Low Fractional Fan-Out: the number of neurons connected to a single neuron is much smaller than the total number of neurons in the brain. 5. Time Delays: Significant time delays are present in the interconnections between neurons. Table 1.1 lists some important neural network models and indicates which of these features are associated with each. Hopfield's network model [25], which will be described in detail in Chapter 2, uses a recurrent architecture, and a version of Hebbian learning. Each neuron in a Hopfield network is either Hopfield [25] Extended Hopfield (EH) [13] Boltzmann Machine [22] Time Delay Hopfield [51, 32, 2] Willshaw [52] Kanerva [27, 9] Back-Propagation [45] Pineda [44] Dispersive [15] Kohonen [33] Hebbian F F S F F F recurrent • • • • • • hiddens low fan-out time delays • • • • • • • • • • • • • Table 1.1: Some features of the neurophysiology of the brain cross-referenced to neural network models. Hebbian learning is categorized as fast (F) or slow (S), as discussed in Section 1.1.1. The EH network is the only one which combines recurrence, fast Hebbian learning, and hidden neurons. Chapter 1. Introduction^ 4 "on" or "off". The state of the network is determined by the "on" and "off" states of all its neurons. A "memory" is a network state which has been "learned" by the network using a learning rule which is described below. 1.1 Important Features of the Ell Network The "Extended Hopfield" or "EH" network is the subject of this thesis, and is of particular interest because it is the only one to combine the features of recurrence, fast Hebbian learning, and hidden neurons. These three features of the brain's structure are discussed separately in the following subsections. 1.1.1 Hebbian Learning The Hebbian learning principle was first suggested by D.O. Hebb in 1949 [19], without any experimental support. It can be summarized as follows [46, pg 36]: When neuron A and neuron B are simultaneously excited, increase the strength of the connection between them. Until very recently experimental tools were not available to test Hebb's conjecture, but it remained of interest because it is a simple and fast way to train a network, and all learning is done locally; there is no need for an external "teacher" to influence the changes in synaptic strength. In 1969, Willshaw et al [52] described a very simple associative memory network which uses a form of Hebbian learning. Their network consists of a set of input neurons and output neurons connected by synapses The network is trained by presenting a series of memory Chapter 1. Introduction^ pairs it' and rf, , 5 and setting the synapses as follows: cij = 1^if a7 = 1,7 = 1 for any a 0^otherwise After the network is trained, a memory a can be recalled from a prompt b as follows: ai^1 = —2 [1 — sign (k — J:=1. where k is a positive constant determined by the specific memory set being stored. The capacity of Willshaw's network does not increase proportionally to N [20, pg 53], so its capacity is very limited. The network helped establish the concept of associative memories, however, and the viability of Hebbian learning for artificial neural networks. In 1982, Hopfield (25] presented a neural network model which stored memories using a learning rule which was very similar to Hebb's, and which will be called "pseudo-Hebbian": During learning, if neuron A and neuron B are in the same state (either both excited or both quiescent), then increase the strength of the connection between them. If neuron A and neuron B are in opposite states, reduce the strength of the connection between them. The ability of Hopfield's model to store many memories in a single network was a truly emergent property, and it generated a great deal of new interest in neural networks, and Hebbian learning in particular. Some neural networks [22, 21, eg.] use an iterative form of Hebbian learning in which the training signal must be provided repeatedly over an extended period of time while the connections adapt. Early tests of the Boltzmann Machine, for example, required that each training item be presented 5000 times [22, pg 308] to eliminate the errors. Hopfield's version of Hebbian learning, by contrast, is prescriptive rather than iterative, and requires only a single Chapter 1. Introduction^ 6 presentation of the training items. These two approaches are distinguished as "slow" and "fast" Hebbian learning, respectively, in Table 1.1. The EH network is a "fast" learner, although not as fast as Hopfield's because of time spent determining the optimum state of the hidden neurons. Since the initial surge of interest in Hopfield's paper, much research activity has swung away from Hebbian learning toward iterative learning techniques such as back-propagation [46). Because the back-propagation process maintains tight control of the behaviour of the network, its ability to recall memories is not really an "emergent" property. Such networks have been very successful, however, in industrial applications such as machinery control [39], signal processing [34], trend analysis [17, 11], speech recognition [37], and speech synthesis [47]. Recent improvements to back-propagation have included methods for improving learning speed [41], and for incorporating time delays in the synapses [14]. Hebbian learning has recently caught the interest of researchers again, as a result of important new experimental evidence. Thomas Brown's group at Yale [7, 8] have used microelectrodes to control the voltage above and below isolated synapses in the hippocampus, and have found that the synaptic conductance stays approximately constant until the pre- and post-synaptic membranes are stimulated simulataneously, just as Hebb had postulated. This is the first, and only, direct observation of learning at the cellular level, and it implies that new models of the brain will be far more likely to replicate brain-like behaviour if they incorporate some form of Hebbian learning. 1.1.2 Recurrence Neural networks can be classified as either feed-forward or recurrent. Feed-forward networks are thought to be representative of sensory inputs and motor outputs. Signals in a feedforward network pass from the input neurons, often through layers of hidden neurons, and then arrive at the output neurons without ever looping back. Once trained, the behaviour of artificial Chapter 1. Introduction^ 7 feedforward networks is therefore very simple. Most of their complexity lies in the methods used to train them. In recurrent networks, signals going out from a neuron can eventually influence the input to that same neuron. Networks of this type are thought to be representative of the cognitive centers of the brain, where virtually all the neurons are of this class [42, pg 1]. Both the Hopfield model and the EH model are fully recurrent networks, in that every neuron can be connected to every other neuron. The behaviour of recurrent networks can be quite complex, because of non-linearities and time delays in the signal loops. A network which begins in one state (a "state" is defined by the on/off status of all the neurons in the network), will typically evolve through a series of intermediate states as the neural signals loop around in the network. In both the Hopfield and the EH model, it will eventually reach a stable "attractor" state, and stop evolving. In the brain, the network never reaches a stable state — new thoughts are continually being created in the mind. 1.1.3 Hidden Neurons The overwhelming majority of neurons in the brain are "hidden", meaning that they are not in contact with the outside world. The only neurons which are "visible", rather than hidden, are the sensory neurons and the motor neurons. In an artificial neural network, any neuron which is not initialized, and whose output is not seen, is considered to be a hidden neuron. Experience with feed-forward networks has shown that hidden neurons are of central importance for most applications. Minsky and Papert [40], for example, proved that one of the basic elements of logic, the XOR relationship, cannot be learned by a perceptron network with no hidden neurons. The XOR relationship can be learned by a network when hidden neurons are present or the Chapter 1. Introduction^ 8 output units have non-monotonic transfer functions 1 . During learning, the on/off status of hidden neurons is determined by processes internal to the brain, rather than directly by the training signal. For neural networks to use hidden neurons, the design must include some internal procedure for determining their on/off status. Back-propagation, for example, manipulates the status of the hiddens by iteratively modifying the weights of the synapses leading into the hidden neurons. The biggest limitation of Hopfield's network is that it has no mechanism for including hidden neurons. The model treats all neurons equally, as both inputs and outputs, and requires that all neural states be defined by the trainer. This makes for a very simple and elegant network, but it imposes the following limitations on the network's abilities: • It cannot store the XOR relationship, or higher order parity. • The length of each memory must be exactly equal to the number of neurons. • The maximum number of memories which can be stored is strictly limited by the number of bits in the memories. The EH network makes it possible to add hidden neurons to a Hopfield network without compromising Hopfield's use of Hebbian learning and recurrent architectures. During training, the EH network uses the emerging structure of its own memory space to determine the optimum on/off status of each hidden neuron, as described in Chapter 3. Because that optimum status is one where the memories are mutually orthogonal, the EH network can be thought of as a "self-orthogonalizing" network. 1 The author thanks G. Hinton for pointing out that non-monotonic transfer functions in the output neurons make it possible to store XOR without hidden neurons. Chapter 1. Introduction^ 9 1.2 Research Goals The overall goals of this research are to better understand the structure and the behaviour of the brain, and to develop computing procedures which take advantage of those principles to perform novel computing tasks. The specific goals which led to the results presented here were: • to extend Hopfield's model to include hidden neurons • to demonstrate the improved performance resulting from the extension • to gain "analytical control" [5, pg. 2293] of EH networks. • to derive an analytical expression for the capacity of an EH network as a function of the number of hidden neurons 1.3 Summary Chapters 2 through 6 describe the research that has been done in pursuit of the above goals. It can be summarized as follows: • Chapter 2 reviews Hopfield's neural network and examines its performance with nonrandom memory sets. The chapter begins with a brief review of the structure of Hopfield networks, and the notation used to describe them. It then introduces the idea of "near-orthogonal" memory sets, describes how they can be stochastically generated, and examines how the memory capacity of the Hopfield network changes as the memory sets become more orthogonal. The final section of the chapter introduces an important working assumption which will be used in subsequent derivations. • Chapter 3 defines and characterizes the new EH network. It begins by describing the EH network's memory storage procedure, and then presents measurements of how well Chapter 1. Introduction^ 10 that procedure is able to orthogonalize the memory space. It describes two alternative methods for recalling memories when specific neurons (usually "hidden" neurons) are flagged as unknowns, and examines the effectiveness of the methods for near-orthogonal memory sets. Finally, in Section 3.4, many different aspects of the network's performance are examined, including its memory storage capacity, its information capacity, its ability to store correlated memory sets, and the size of its attractor basins. • Chapter 4 develops a mathematical model for a Hopfield network in which near-orthogonal memories have been stored. The chapter begins by reviewing the analogy between lattices of Ising ferromagnets, and Hopfield-style neural networks. It then derives an expression for the free-energy, using the replica method and assuming replica symmetry. Finally, it derives a set of mean-field equations which are valid at the saddle point in the free energy. • Chapter 5 derives an analytical estimate of the storage capacity of the EH network as a function of the fraction of visible neurons in the network. The chapter begins by using the mean-field equations from Chapter 4 to calculate the capacity of a Hopfield network as a function of the degree of orthogonality of the memories stored. Next, it uses a statistical argument to estimate the degree of orthogonality that can be achieved by the EH network's memory storage process. Finally, it combines these two results to estimate the memory capacity. • Chapter 6 compares the theoretical maximum information capacity of near-orthogonal memories to the actual amount of information that can be stored in an EH network. It begins by calculating the information content of exactly orthogonal memory sets, using both a dimensional argument and a brute-force calculation. It then extends the result analytically to include near-orthogonal memory sets. Finally, a comparison is made between the amount of information which can be stored in a memory set of a certain orthogonality, and the amount of information which is stored in an EH network with the same degree of orthogonality. Chapter 1. Introduction^ 11 • Chapter 7 gives a summary of the conclusions that have been reached in this research, and of further research that needs to be done. Appendices A through D provide details of some of the calculations used in the main body of the thesis. Chapter 2 The Hopfield Network and Non-Random Memory Sets Before the EH network can be introduced, it is important to review the structure and dynamics of Hopfield's original network, and its behaviour for sets of memories which are not randomly generated. Section 2.1 summarizes the design of the network and introduces the notation which will be used in subsequent chapters. Section 2.3 defines the concept of a "near-orthogonal" memory set, and presents simulation results which show how the memory capacity of a Hopfield net varies as the memory sets become less random. Section 2.4 presents an important working assumption about the behaviour of Hopfield networks which will form the basis for mathematical derivations in subsequent chapters. 2.1 Review of Hopfield Neural Networks A Hopfield net consists of N neurons, each of which is connected via synapses to every other neuron, as shown in Figure 2.1. The "synaptic strengths" are in general real numbers ranging from positive to negative. The output of the ith neuron is called the "state", Si of that neuron. The Hopfield net uses bi-state neurons, each of which has only two possible states, (Si = +1) and (Si = -1). The set of N neural states together form an N-dimensional "network state" vector S. A dot product between any two network states can be written as follows: N s 1 . 52 E sis? 12 (2.1) Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 13 Initial State V Current State 2 Weights Figure 2.1: Diagram of a Hopfield network with 5 neurons. The output, Si, from each neuron is multiplied by a weight Jij and summed with other signals before being fed as an input to each of the other neurons. The initial states of the neurons are imposed by an external signal, which is removed once memory recovery begins. State vectors can be written concisely using just the signs of their components, as in the following example: S3^++-+-+++--+--+++ The input Ui to each neuron i is equal to the sum of the outputs of all the other neurons, weighted by a set of connection strengths Ji 1 through JiN, where N is the number of neurons: N =^ (2.2) j=1 Neurons do not synapse back onto themselves, so Jii = 0. Non-zero Jii terms would not affect the location of memories, but can influence the dynamics of the network. Large positive 4, for example, tend to stabilize states which would otherwise be unstable. Negative have the opposite effect, destabilizing some of the shallower attractor states. Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 14 The notation used here is that used by Amit [1], Lautrup [35], Mezard, and Parisi [38]. Hopfield used Tip and Vi rather than Ai and Si in his original paper [25]. 2.1.1 Storing Memories The connection strengths Jij are determined by the memories that have been stored in the network. A "memory", just like a network state, is an N-dimensional vector ei whose components are (+1) or (-1). A set of P memories can be "stored" in a network by setting the strength of each connection as follows: Jib [ P = N^ 0=1 ere; -^, where the bij ensures that the diagonal elements of J are always zero. The k term is often not included in the definition of Jij but, as will become clear, it does not change the dynamics of the network, and it ensures that the connection strengths remain finite in the limit of very large P and N. The definition of Jib implies that each new memory can be added using the following "outer product rule": 4 = r .113 + Tr P.4; — (5131 (2.3) The application of the outer product rule is variously referred to as "storing", "inserting", or "memorizing" a new memory eµ. Note that the outer product rule is an implementation of the pseudo-Hebbian learning rule introduced in Section 1.1.1, in that if two neurons i and j are in the same state = el, the connection Jib between them increases, otherwise it decreases. The "memory loading", a, of a network is equal to the ratio of the number of memories P and the number of neurons N: a =^ (2.4) ^ Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 15 2.1.2 Network Dynamics and Stable States The evolution of a Hopfield network is determined by its initial state, and by the connection strength Jij. In Hopfield's original network, the network evolves in the following way: 1. One of the neurons, i, is selected at random . 2. The input Ui for that neuron is calculated. 3. The state Si is modified according to the following "forward update rule": +1 if Ui > 0 Si =^ (2.5) —1 if Ui < 0 4. A new neuron is selected as in Step (1). This is referred to as the "asynchronous" version of the forward update rule because each neuron is updated independently. Evolution continues until a stable state is reached in which all the inputs Ui have the same sign as the neural states Si. Hopfield proved that the network will always reach such a stable state, and that the stable states tend to be attractors. If the final state is equal to one of the memories e, then the network is said to have "recovered" or "recalled" that memory. Hopfield networks are "autoassociative", meaning that they tend to recover the memory which is most similar to their initial state. In a typical application, the network is initialized to ° a "prompt" state 5 , which is an incomplete or noisy version of one of the memories, and then is allowed to evolve to one of the attractor states. Barring any problems, the final state is the memory state which most resembles the prompt state. The "radius of attraction", Re, of a memory e is a measure of how many bits can be recovered during the memory recovery process. More precisely, it is the maximum number of Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 16 bits that can be wrong in the prompt before the probability of recovering the correct memory becomes less than 50%. In some situations it is preferable for the network to evolve in the following way: 1. The inputs U1 through UN are all calculated. 2. The states S1 through SN are all modified using the forward update rule (Equation 2.5), using the pre-calculated U1. 3. A new set of inputs Ui is selected as in Step (1). This is referred to as the "synchronous" version of the forward update rule because all neurons are updated at the same time. A network evolving with the synchronous update rule does not always reach a stable state, but may instead enter a limit cycle in which it flips between two states [1, pg 78]. 2.1.3 Overlap miw and Orthogonality The overlap mou between two memories e and is a measure of how similar they are to each other. It is defined as follows: rev 1 N N i=i er (2.6) Overlaps of 1, and -1 indicate that two memories are identical, or bit-inverse to each other, respectively. Two memories are said to be "orthogonal" to each other if their overlap is zero. When m appears with a single superscript, it represents the overlap between the current state S and one of the memories: E 1 N N i=i Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 17 It will be important in the following chapters to characterize memory sets by the magnitude of typical overlaps between the memories. The expected overlap mou between two randomly selected memories is equivalent to the expected length of a random walk of N steps. The mean of mo' is therefore zero and the variance is ki [1, pg 175]. Overlaps for non-random memory sets are discussed in Section 2.3. 2.1.4 Energy Landscape Hopfield noted that his network is "isomorphic with an Ising model" [25, pg 2556], a fact which will be discussed in detail in Section 4.1. This isomorphism led him to define the following "energy" E for every state S of the network: 1 E =^Ji• S2^ 3^3 :,3=1 (2.7) It can be easily shown [20, pg. 21], when Jii = Jji, that this energy acts as a Liapunov function for the update rule in Equation 2.5. It is common, therefore, to speak of an "energy landscape" in which the network moves. When the forward update rule is used, the network never rolls uphill in this landscape, and always eventually reaches a stable state at the bottom of an "energy valley". When no memories have been stored in the network, and all the J, are zero, the landscape is flat. After one memory has been installed, there are two valleys: one at e and one at el. 2.1.5 The Meaning of Energy Peaks If the valleys in the energy landscape represent stored memories and their bit-inverses, what do the peaks represent? If Equation 2.3 is replaced into Equation 2.7, the energy of a state Si Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 18 can be written: N [P EE E = — 1 2N i,j=1 µ=11 3 — psi ; SiSi N P^P _2_ . = _^ne)2^ 2 pz (2.8) Peak states are states where E is maximum, so they are also states where (mP)2 is minimum. In other words the peaks mark states which are, in the least-squares sense, optimally orthogonal to all installed memories [1, pg 167]. "Optimally orthogonal" here means that the sum of the squares of the overlaps is minimum compared to all nearby states. If the states were exactly orthogonal, the sum of the squares would of course be zero. 2.1.6 False Attractors There are always stable attractor states which were not explicitly stored as memories. Every time a memory where C — e e tA is stored, for example, the vector eP is automatically stored as well, for all i. This follows from the fact that Jii in Equation 2.3 is unchanged when all the Cs are negated. The vectors are referred to as "bit inverse attractors". When more than two memories have been installed, other "false" attractor states begin to appear which are not so simply related to the real memory states. The locations of many of these states are quite well understood analytically [1, pp 176-177]. The following "symmetric 3-states", for example, will always be false attractors: Si = sign( t + Si + si gnw + Si — + a) Si = sign(Si Si = sign(e — 6? — 6?) si = sign(-e + Si + Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 19 etc. There are similar symmetric 5-states, 7-states, and so on. False attractors are a serious problem for practical applications of attractor neural networks. As new memories are added, the number of false attractors grows far more rapidly than the number of true attractors, which reduces the radii of attraction of the true memories. 2.2 Limitations and Extensions of the Hopfield Network There are a number of fundamental limitations of Hopfield's neural network which reduce its usefulness in many practical applications. These include: 1. The network is unable to store the XOR boolean relationship, for reasons discussed in Section 3.4.5. This fundamentally limits its usefulness as a computational device. 2. The number of synapses in the network is always equal to N(N-1), where N is the number of neurons. Because the number of synapses determines the capacity of the network, this means that the capacity of the network is limited by the width of the stored memories. 3. The capacity of the network is proportional to the number of neurons, which is the square root of the number of synapses. As networks get bigger, therefore, the complexity increases as the square of the capacity. 4. The capacity of the network falls quickly when the memories become more correlated, as discussed in Section 3.4.3. Typical memory sets in the real world are often correlated, so this is an important limitation. The EH network overcomes the first two of these limitations, as discussed in Chapter 3, but is not significantly better than the Hopfield network for the latter two. As discussed in Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 20 Chapter 5, although the capacity of an EH network is larger than that of a Hopfield network, it still increases linearly with N. Similarly, Section 3.4.3 shows that the capacity of the EH network falls almost as rapidly as the Hopfield network when the stored memories are correlated. Baum and Moody [6] have described a feedforward network whose structure bears some similarity to Hopfield's, and which overcomes these limitations. They call the network a "unary Associative and Content Addressable Memory" (Unary ACAM). It has two layers of connections, with one layer of hidden neurons in between. The top set of connections, connecting the input neurons to the hidden neurons, form a Hamming net [36]. When an input is presented, the Hamming net activates exactly one hidden neuron — the one which corresponds to the pattern that most resembles the input pattern. The hidden neurons are therefore "grandmother" cells [20, pg 143] because each one corresponds to one specific memory. It is straightforward to set up connections from the grandmother cells to the output neurons which ensure that the desired output signal is generated in response to each grandmother cell. Baum and Moody point out that if the grandmother cells are replaced by unity gains, and the output is fed back into the input, their network is equivalent to Hopfield's and the training method is equivalent to the Outer Product Rule. As a computing device the unary ACAM is, arguably, a better performer than Hopfield's. Its memory capacity is exactly equal to the number of hidden neurons, and so can be extended almost indefinately just by adding more hidden neurons. Furthermore, memory recall does not degrade as the network approaches capacity. The network also works well for correlated memories, and has no trouble storing the XOR set of associations. As a model for human memory, however, the Unary ACAM is not very satisfactory because it is not trained with Hebbian learning and it is not recurrent. It is, essentially, a machine which has been designed to respond to specific inputs with a specific output. Its behaviour is therefore more engineered than emergent. Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 21 The most significant limitation of the Unary ACAM as a computing device is its lack of robustness. The failure of any grandmother cell, or any synapse in the output layer, causes an immediate failure in recalling the corresponding memory. This weakness can be addressed in a number of ways. The simplest way, as pointed out by Baum and Moody, is to add redundant grandmother cells. If two grandmother cells are used for each memory the failure of any single cell no longer degrades the performance of the network. Such a network is still far less robust than a Hopfield network, where more than half the neurons can typically be disabled without seriously degrading the performance. Another network which is similar in structure to the Unary ACAM but somewhat more robust, has been described by Kanerva [27, 9]. Like the Unary ACAM, Kanerva's network is a feedforward device with two layers of connections and a single layer of hidden neurons. As in the Unary ACAM, each neuron in the hidden layer is trained to recognize a single vector input, but now the training vectors are randomly chosen rather than equal to the memory set. There is no "winner take all" circuitry on the hidden neurons, so more than one hidden neuron can be activated for each input pattern. The first layer of synapses now effectively encodes each N-bit input into an s-bit pattern, where s is the number of hidden neurons. The second layer of Kanerva's network is similar to a Hopfield network except that it has separate input and output neurons, and is not recurrent. The connections of the second layer are trained with the Outer Product Rule. Kanerva's network performs comparably to the Unary ACAM, and its performance can be fine-tuned by adjusting its parameters. The robustness of the network, for example, can be increased by: • Adding more hidden neurons, which adds redundancy to the network, at the price of added complexity; or Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 22 • Decreasing the thresholds on the hidden neurons, which brings more hidden neurons into action for each input pattern, at the price of more noise in the input to the second layer. Like the Unary ACAM, the capacity of Kanerva's network grows in proportion to the number of hidden neurons, and hence can be expanded indefinately by adding more hardware. Kanerva's memory, which has been described as a Hopfield network with a preprocessor [9, pg 285], is somewhat more satisfactory than the Unary ACAM as a model for biological memory. It is conceivable that a biological memory might have a hard-wired preprocessor network which encodes input signals into larger dimensional signals; and the Hopfield-like second layer is certainly plausible. Unlike the Hopfield model, however, Kanerva's network is not recurrent. 2.3 Non Random Memory Sets - This section focusses on how the memory capacity of a Hopfield network is affected by the level of correlation between the memories that are being stored. The first step in quantifying that relationship is to define a measure for the level of correlation in a memory set. A natural measure is the root-mean-square of the overlap between all the memories e. For random memories, the rms overlap is N/Tr, so the following "normalized rms overlap" is used: 1 O m, -a - (tleficq (2.9) The double brackets used here are Amit's notation [1, pg 187] for the average over an ensemble of memory sets, each randomly generated according to a given probability distribution. For example, if there are P memories in each memory set, and if all memories are equally likely, then the ensemble average of a quantity G ((Geo))) E al • • .e ) is: 2-PN E^.. • e ) . {e}...{e-} (2.10) Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 23 The ensemble average is included in the definition of O^so that it can be directly related to the probability distribution used to generate the memory set, rather than being unique for each set of memories. In many of the calculations it is convenient to use the variance V a-, (0,,,,) 2^(2.11) rather than Orms . Both O, and V range from 0 for exactly orthogonal memory sets to 1 for random memory sets. For memory sets which are correlated, such as those discussed in [5], they are both larger than unity. 2.3.1 Generating Near Orthogonal Memory Sets - A "near-orthogonal" set of memories is a group of memories in which typical overlaps are smaller than they would be if the memories were randomly selected. One way to generate a set of near-orthogonal memories {} is to start with a set then progressively add random noise. Memories {r} of exactly orthogonal memories and rP and I' are exactly orthogonal if and only if they satisfy the following equation: N Er;r;^Nev. If b is the probability that a particular bit er is not equal to rt% then the probability distribution i of memory bits er is: P(ei ) = (1 — b) 8(er — rn l + b 45(e; + rn,^(2.12) where 6. (a — b) ba b . The value of b ranges from 0 (no noise) to 0.5 (maximum noise). In order for Equation 2.12 to be useful, it is necessary to first have a set ro of exactly orthogonal memory vectors. Section A.1 describes how to generate one such set for any value Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 24 of N which is an integer power of 2. It is a "complete" set, in that the number of vectors is equal to the number of available dimensions. Complete orthogonal sets for other values of N are not possible because of the constraint that vector elements must be either +1 or - 1. The expected overlap between a state 411 and its corresponding pi is: N EP(enOT = (1 — 2b) 41‘1;))^ i=1^fel (2.13) Consider now the average mean and variance of the overlaps between two different states el' and e', as derived in Section A.2. The mean overlap is found to be zero, because there is no reason that a positive overlap should be more likely than a negative overlap. The variance in the overlap is, from Equation A.4, equal to: .r)) = N 11 - (1 - 2b) 4 1 from which it follows immediately that: V = 1 - (1 - 2b) 4 (2.14) 1 b = - [1 - (1 -^. 2 (2.15) Or It is a simple matter, using Equation 2.15, to generate sets of memories for which V has a particular value. This was tested in simulations and proven to be accurate, as shown in Figure 2.2. 2.3.2 Generating Correlated Memory Sets Amit [1][5] generates correlated memories by decreasing the probability of each neuron being +1. He uses a parameter a which varies from 0 (random memories) to 1 (completely correlated Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 25 1.2 4 1 - ( 1 - 2b ) rn 0 0.2-1 N = 128 0-F 0 0.1^0.2^0.3^0.4^0.5^0.6 b = Prob of Flipping Each Memory Bit^1992 APRIL 21 Figure 2.2: Measured variance in the overlaps between near-orthogonal memories, compared with the predicted variance. The line is a graph of Equation 2.14. The data points are from simulations of 100 memory sets, each with 128 neurons. The memories were generated by adding noise to exactly orthogonal memories, using Equation 2.12. The mean overlap between memories was measured and was close to zero. memories), and writes the probability distribution as follows: P(e) = i1^1 (1 a)6( — 1) -I- — (1 — a)(5(e + 1) 2 (2.16) The mean activation of the neurons is a, and the average overlap between two different memories is NO. If a correlated memory set with zero mean activation is needed, it can be formed by making a cosmetic change to Amit's method. After each new memory is generated using Equation 2.16, a fair coin is flipped. If the coin comes up heads, every bit in the memory is inverted, otherwise the memory is left unchanged. This process creates memory sets with zero average activation, but with connection strengths Jii which are exactly as they were using Amit's process, because of the symmetry in Equation 2.3. If the mean activation is zero for Amit's correlated memories, then the variance V in the Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 26 overlaps can be used as a measure of the degree of orthogonality, as in Equation 2.11. The value of V is related to a as follows: = 1 N^\2 ((V))^(^ ) i=1 N N =^(E erere.'"q) N N -N1- (E + EE eer4 '"ef) 1=1^j . yi [iv + (Eeer),.,.] 1+ a4 (N — 1) ioi (2.17) Or a = (V — 1\I N -1) 2.3.3 Benefits of Near Orthogonal Memory Sets - A memory 4- is stable if the inputs^have the same sign as the neural states er. This is equivalent to the following condition: 11.5 0 <^ Vi 0 < E .17=1 Jii^ vi < EpP--1^E7-1 e3 q - P^Vi 0 < (N — P) N (2.18) r ntiw^Vi 00 ,, ee The first and second terms in this last equation can be thought of as the "signal" and the "noise", respectively. The sign of the noise term is random, so memories will tend to go unstable when the magnitude of the noise approaches (N — P). This implies that smaller overlaps mill' will allow more memories to be stored before retrieval begins to deteriorate. The EH network is designed to capitalize on this fact. ^\ ^. Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 27 0.50— 0.45- z 0.40- Random 0.35- 8. 0.30- 0 0.25E E 0.20- a- Correlated 0.150.10- .... Near- +^......... ^ 0.05..... ........ . ... .. . ... . . Orthogonal 0.00— 0.0 1 0^2.0^3.0 Normalized rms Overlap^tem FEB 25 Figure 2.3: Memory capacity of a Hopfield network as a function of the level of orthogonality in the memory set. The orthogonality measure O is defined in Equation 2.9. The memory capacity ce, is times the average number of memories which can be stored before one twentieth of the stored memories are no longer stable. Simulations were done on 100 different sets of random memories for each data point. The network consisted of 128 neurons. The dashed outlines show the standard deviation in memory capacities. -,;„ The derivation of an analytical relationship between storage capacity and Orras is quite complex and will be postponed until Chapter 4, but it is appropriate in this chapter to present simulation results confirming that memory capacity improves with orthogonality. Figure 2.3 shows that the memory capacity increases dramatically as the rms overlap decreases, as expected. Also included in the figure are measurements of the network capacity for correlated memories, generated using the zero-mean-overlap version of Amit's process described in Section 2.3.2. In addition to the need for large memory capacities, it is also important that neural network have basins of attraction which are as large as possible. Cottrell [10] showed that, for any network which uses the outer product rule, the radius of attraction is optimal if all the memories Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 28 0.501 Random 0.45......... ...... c o 0.40'5 P. = < 0 0.35.- . M = 73 x ° .......... ......... - 0.30- 0.250.201 0.0 NearOrthogonal 11511^ 0.5^1 0 Normalized rms Overlap 15 1992 FEB Es Figure 2.4: Radius of attraction of a Hopfield network after 10 memories have been stored, as a function of the level of orthogonality in the memory set. The normalized rms overlap O is defined in Equation 2.9. The radius of attraction was measured as for Figure 2.3. Simulations were done on 100 different sets of random memories for each data point, using a network with 128 neurons. The dashed outlines show the standard deviation in radii of attraction. The memory capacity of the network falls below 10 just above (O nus = 1), so it did not make sense to calculate the radii of attraction above that point. are orthogonal. This suggests that the radii of attraction of the memories will improve as memory sets become nearly orthogonal. Figure 2.4 confirms that this is true, but the improvement in radius of attraction is not dramatic over the range that is plotted. 2.4 Working Premise: All Orthogonalizing Processes are Equivalent Two different methods for generating near-orthogonal memory sets are investigated in this thesis. The first, which was introduced in Section 2.3.1, is quite compatible with mathematical analysis, but could not be used for storing real memories. The second, which will be introduced in Chapter 3 and is the core process in the EH network, lends itself very naturally to real-world implementations, but would be very difficult to analyse mathematically. The mathematical Chapter 2. The Hopfield Network and Non-Random Memory Sets ^ 29 analysis of EH networks therefore proceeds under the following working assumption: • ansatz: the dependence of network behaviour on the orthogonality of the memory set is independent of the manner in which that orthogonality was achieved. Chapter 3 The Extended Hopfield Network This chapter introduces the detailed design of the Extended Hopfield ("EH") network, and presents the results of a number of tests of its capacity and reliability. Some of these results have been published in earlier papers by Hoffmann and Davenport [12, 13]. The central design challenge is to find mechanisms for storing and retrieving memories, which make efficient use of hidden neurons. The following three points are of central importance in determining what direction to proceed: • To add hidden neurons to a Hopfield network, some process must be used to determine the states of the hiddens before the outer product rule installs the memory (Section 1.1.3). • More memories can be installed in a Hopfield network, and with better radii of attraction, if the installed memories are as nearly orthogonal to each other as possible (Section 2.3.3). • Peaks in the energy landscape of a Hopfield network are located at states which are optimally orthogonal to all previously-installed memories (Section 2.1.5). When stated in this way, the following conclusion seems obvious': During memory storage, the states of the hidden neurons should be set in such a way that the network moves as close as possible to an energy peak. This will ensure 'This conclusion seems obvious in retrospect, but it wasn't obvious until Hoffmann pointed it out. 30 Chapter 3. The Extended Hopfield Network ^ 31 that the memory sets are as orthogonal as possible, which will improve the capacity and reliability of the network. The technical details of how this is done are described in the following sections. 3.1 EH Network Terminology The EH network requires some added notational conventions to accomodate the presence of hidden neurons, and the use of neurons with three possible states. These are introduced in subsections 3.1.1 through 3.1.5. 3.1.1 Hidden Neurons in an EH Network Hidden neurons in an EH network are treated identically to visible neurons wherever possible. Each is fully connected to every other neuron, for example, and each obeys the same dynamical rules that the visibles obey. Just as in a Hopfield net, the hidden and visible neurons in an EH network behave in a manner which is completely independent of their position in the network, so for notational simplicity, the first R neurons are assumed to be visible, and the last N — R neurons are assumed to be hidden. A state vector can be written as follows: Si^+-+---+-++++-++-+--+--+ -++--+++--+--+++. R visibles^(N—R) hidden The letter M is used to represent the number of hidden neurons: M^(N — R) Chapter 3. The Extended Hopfield Network^ 32 3.1.2 Partial Overlaps in the EH Network The overlap mou between memories ep and can be split into two components in an EH net- work. The overlap between the visible neurons is written mr and is defined as: ^v 1 R ^R i=1 The overlap between the hidden neurons is written mr and is defined as: ^m fr^E cc i=R+1 1 N The total overlap between vectors is therefore: ma" = 1 (Rmr Mmr) 3.1.3 Measuring "Stability" in an EH Network The most common method for measuring the capacity of an attractor network is to measure how many stable memories can be stored. In a Hopfield network, the test for stability is to set all neurons equal to the memory being tested, and then allow the network to evolve using the update rule. If no evolution takes place, the network is stable. In an EH network, stability is tested by setting only the visible neurons equal to the stored memories. The hidden neurons are treated as unknowns using one of the procedures discussed in Section 3.3. Memory recall then takes place in the normal way, and continues until an attractor is reached. If the state of the visible neurons after recall is equal to the memory state, then the memory is "stable". Chapter 3. The Extended Hopfield Network^ 33 3.1.4 "Level of Correlation" of Memory Sets In Section 3.4, a number of tests are described in which memory sets with various statistical properties are tested. For those tests, the "level of correlation" of the memory set is determined by the rms overlap of the visible neurons only. So if a set of "random" memories are stored in a network with 50 hiddens and 50 visibles, it means that the visible bits have been selected at random, so: (mr )rms = 1 (3.1) R 3.1.5 TriState Neurons Option One of the options which will be discussed for the EH network is the use of tri-state instead of Hopfield's standard bi-state neurons. Tri-state neurons use the standard +1 and 1 states, and - an additional "neutral" state 0. A typical tri-state vector might look like: S 1 = +-+---+-++++-++-+--+--+ -++00000-0+-000+. R visible^(N—R) hiddens The advantage of using tri-state neurons is that when they are in the zero state, they have no influence on the other neurons, which is appropriate if, for example, the correct state of that neuron is completely unknown. The disadvantage, of course, is that tri-state neurons are more complicated, and are more difficult to implement as electronic hardware. It will be shown that everything possible with tri-state neurons can be achieved with bi-state neurons, albeit with somewhat more complicated procedures. Chapter 3. The Extended Hopfield Network^ 34 3.2 Storing Memories in an EH Network An EH memory is stored using the outer product rule, just as in a Hopfield network, but before the memory is stored, the state of the hidden neurons is determined by letting the network "roll up" in the energy landscape. The rolling up process is very similar to the evolution procedures described in Section 2.1.2, except that the forward update rule (Equation 2.5) is replaced by the following "reverse update rule": —1 if^> 0 =^ +1 if^< 0 (3.2) It is not difficult to prove that rolling up leads to an energy peak. From Equation 3.2, a stable state S satisfies Si = —sign(Ui). Using Equations 2.7 and 2.2, the energy E of a stable state S is therefore: 1 1 E = -- V UiSi = — 2 2 i=i Inverting any bit j changes the energy to E': 2 E luil — i= 1 which is always less than E, so the stable state is a point of (locally) maximum energy. The energy of a network with finite N is bounded from above, so a peak will always be encountered when rolling up. The procedure for storing memories is shown schematically in Figure 3.1, and can be summarized as follows: 1. All known bits are set to their known values 2. All unknown or hidden neurons are set either to zero (if tri-state), or to +1 or -1 at random (if bi-state). Chapter 3. The Extended Hopfield Network^ Maximum Energy State (Visible Neurons Unchanged) Reverse Update Rule 2' 1I Initial State (Hidden are Zero) visibles hidden^Energy + - -H-H- - reverse update rule outer product rule New Energy Contour After Memory is Stored 35 I 000000000000 -17 + - + - -H- - - - - -H- +56 Revised J„ Energy "Landscape" + - -H- - - - - -H- -126 State Space (a)^ (b) Figure 3.1: Schematic representation of the Ell network memory storage process, (a) as a "roll-up" process in the energy landscape and (b) as an update process for bits in a state vector. The energy surface as a function of state space has peaks and valley-bottoms. The peaks are locally optimally orthogonal to the valleys. The location of a peak is found using the reverse update rule. The memory is installed, using the outer product rule, at the location of the peak, resulting in a new energy contour with an energy minimum at that point. 3. The reverse update rule (Equation 3.2) is applied, with the visible neurons clamped, until a stable state is reached. The term "clamped" means that the state of the neuron is not allowed to change. 4. The memory is installed using the outer product rule (Equation 2.3). Note that when some of the neurons are clamped, the network evolves in a subspace of the complete network state space. The peaks that are reached in that subspace are not in general true peaks in the complete space, but experience has shown that they are close approximations to them. A good indication of how well this memory storage process works is the degree of orthogonality that it achieves. Figures 3.2 and 3.3 show the rms overlap O ^as a function of the Chapter 3. The Extended Hopfield Network^ 1.1 1.0 xx 36 0 Hiddens x x x x x x x 0.9 a. as 8 10 Hiddens 0.8 0.7 30 Hiddens 0.6 E 0.5 ,i) 0.4 g 0 0.3 50 Hiddens 0.2 90 Hiddens 0.1 0.0 100 Neurons Tri-State 10^20^30^40 Memory Loading 50^60 1902 AUG 22 Figure 3.2: Degree of orthogonality achieved by the tri-state EH memory storage process, as a function of the number of memories added and for various numbers of hidden neurons. The visible neurons in each memory were set at random to either +1 or -1. The rms overlap O rms was calculated using the complete memory vector, both visible and hidden, and was averaged over all pairs of memories that had been stored. The simulations were performed with a 100-neuron network, and each point represents the average result from 50 different sets of memories. The lines were drawn free-hand as a visual aid only. number of memories installed and the number of hidden neurons available, for both tri-state and bi-state neurons. A comparison of the two figures reveals that tri-state and bi-state options are approximately equivalent. When no hiddens are present, the rms overlap is approximately unity, as expected. As a greater percentage of the neurons are hidden, the roll-up process is able to achieve better and better levels of orthogonality. This is examined again in Figure 3.4, where O is plotted as a function of the number of hiddens. 3.3 Recovering Memories in an EH Network During memory recovery in a standard Hopfield network, all bits are treated equally. There is no way to tell the network, for example, that you are very sure about bits 1 through 50, but Chapter 3. The Extended Hopfield Network^ 1.1- 37 0 Hiddens 1.0 x x x x 0.910 Hiddens 0.8- o_ as = 0.7- Hicklensx O 0.6- 0.50.4O 50 Hiddens 0.390 Hiddens 0.20.10.0 100 Neurons Bi-State 10^20^30^40 Memory Loading 50^60 1992 AUG 22 Figure 3.3: Degree of orthogonality achieved by the bi-state EH network memory storage process, as a function of the number of memories added, for various numbers of hidden neurons. The conditions for this simulation were identical to those in Figure 3.2. The lines are identical to those in Figure 3.2 to aid in comparison. 1 + 10 Memories 0.9- AE 20 Memories 0.8a. os 0 a) ts) as x 40 Memories 0.7- ^ 60 Memories 0.60.50.40.30.20.10— o 100 Neurons Tri-State 6.1^0.2^0.3^0:4^0.5^0:6^0.7^0.8 019p_Tri^(Num Hidden Neurons)/N 0.9^1 1992 AUG 22 Figure 3.4: Degree of orthogonality achieved by the EH network memory storage process, as a function of the number of hidden neurons, for various levels of memory loading and for tri-state neurons. The conditions for this simulation were identical to those in Figure 3.2. The lines were drawn free-hand as a visual aid only. Chapter 3. The Extended Hopfield Network^ 38 have no information about bits 51 through 100. There are many applications, however, where the locations of the unknown bits is known. This is particularly true in an EH network, because the states of all of the hidden neurons are, by definition, completely unknown. It would also be true, for example, if a standard Hopfield network were being used to complete a bit-mapped image where the bottom half of the image was known to be obscured. It would be useful to find a memory recovery process in which certain neurons could be flagged as "don't know", so that their initial values would have minimal influence on the final answer. Three different methods for doing this are discussed in Sections 3.3.1 through 3.3.3 and are compared in Sections 3.3.4 and 3.3.5. All tests of the memory recovery process in Sections 3.3.1 through 3.3.5 use sets of memories which are randomly generated, with the bits equally likely to be +1 or -1. The tests described in Section 3.3.6 use near-orthogonal memory sets. None of the memory sets tested in these subsections were stored using the roll-up process. 3.3.1 "Random" Option The first strategy, which is most commonly used with Hopfield networks, is to set all the "don't know" bits to +1 or -1 at random, and then apply the forward update rule (Equation 2.5) until an attractor is reached. This method relies on the randomly selected bits being uncorrelated with all the memories, so that their net influence is negligible. Unfortunately, when a large number of "don't know"s are present, or if the network is nearly at capacity, there is a high probability that the random bits will divert the network into a false attractor. Chapter 3. The Extended Hopfield Network^ Synchronous Update Intermediate State visibles^hidden^Energy Initial State +-+ I F000 - - Li Asynch Update 39 000000000000 +96 +-+-0+----0+ -12 +-+-++----++ -126 synchronous update rule I Final State Energy "Landscape" +-+-1-4-9-asynch update rule State Space (a)^ (b) Figure 3.5: Schematic representation of the tri-state EH network memory retrieval process, (a) as a "roll-down" process in the energy landscape and (b) as an update process for bits in a state vector. Unknown bits, including all hidden neurons, are initially set to zero. The network then rolls down synchronously, with known neurons clamped, until the energy no longer decreases. All neurons are then undamped, and the network rolls down asynchronously to an attractor state. 3.3.2 "TriState" Option If tri-state neurons are used, a very successful memory recovery process is sketched in Figure 3.5, and can be summarized as follows [13, pg 1361: 1. All known neurons are set to their known values 2. All unknown or hidden neurons are set to zero 3. The synchronous forward update rule (Section 2.1.2) is applied, with all known neurons clamped. If some neurons are stuck in zero states, set one to +1 or -1 at random, and continue rolling down, until all neurons are non-zero and the energy no longer decreases. Chapter 3. The Extended Hopfield Network^ 40 4. The asynchronous forward update rule (Section 2.1.2) is applied, with no neurons clamped, until a stable state is reached. It is an unfortunate added complexity that two different phases of roll-down are required, but tests show that both phases are needed. If asynchronous updates were used immediately, the first few neurons to be updated would exert disproportionate influences on the subsequent evolution of the network. If only synchronous updates were used, the network might settle into a limit cycle, and never reach a stable state. 3.3.3 "Bi-State" Option If tri-state neurons are not used, comparable performance can be achieved with bi-state neurons, at the cost of an extra phase in the recall process. Bi-state recovery is shown schematically in Figure 3.6, and can be summarized as follows [12]: 1. All known bits are set to their known values 2. All unknown or hidden neurons are set to +1 or -1 at random. 3. The asynchronous reverse update rule (Equation 3.2 ) is applied, with known bits clamped, until a stable state is reached. This is the "roll-up" phase of bi-state memory recovery. 4. The synchronous forward update rule (Section 2.1.2) is applied with known bits clamped, until the energy no longer decreases. 5. The asynchronous forward update rule (Section 2.1.2) is applied with no bits clamped, until a stable state is reached. The usefulness of step 3 is not immediately obvious, Isn't rolling up to a peak counterproductive when the goal is to arrive in a valley? And why will the peak be any closer to the ^ 41 Chapter 3. The Extended Hopfield Network^ A hiddens visibles Synchronous Update Peak State Intermediate State^ /—\ Reverse Update Rule AsYnch Update Initial State I +-i-H-??? set bits at random State Space i +-+++--+ +--H---+----+ +-++-H-- - Energy "Landscape" ???????????? +23 reverse update synchronous update rule Final State Energy +-++++-- asynch date rule ur_F __I__H__F__I (a)^ +++-+---++-+ +96 i ^+ ^-12 +-+-++ +.1___H_____ -126 (b) Figure 3.6: Schematic representation of the bi-state EH network memory retrieval process, (a) as a "roll-up-then-down" process in the energy landscape and (b) as an update process for bits in a state vector. Unknown bits, including all hidden neurons, are initially set to +1 or -1 at random. The network rolls up asynchronously using the reverse update rule, with known neurons clamped, until a peak is reached. It then rolls down synchronously, with known neurons clamped, until the energy no longer decreases, and finally, with all neurons undamped, it rolls down asynchronously to an attractor state. desired valley than to any other valley? The explanation can be found using the following thought experiment. Let A be the known states of the visible neurons, and let B be the states of the hidden neurons when the network is at the peak, so the complete network state can be written 5 1 = (A + B). For example: $ 1^+-+---+-++++-++-+--+--+ -++--+++--+--+++. A ^By the symmetry of^the bit-inverse state 5 2 = (A + B): 2 Chapter 3. The Extended Hopfield Network ^ 42 must also be a peak, and as such its overlap with all memories is as small as possible. Now invert all the visible neurons A: S 3^+-+---+-++++-++-+--+--+ +--++---++-++---. A 73- The resulting state (A + TY) still has approximately zero overlap with all the memories, except for the one whose visible neurons are A. Its overlap with that memory is very large, which implies that the state (A + B) is an ideal starting point from which to recover the desired memory. Now consider what happens in the real network. When step 4 of the recovery begins, the network is in state (A + B). All the hidden inputs are the negative of the states Si, because it is a peak. That means that the first synchronous update will immediately invert all the hidden neurons, and the network will be in state (A -I- B), which was just shown to be a good starting point for memory recovery. 3.3.4 Retrieval Efficiency with 10 Memories Stored Figure 3.7 shows the measured retrieval efficiency for a network in which 10 random memories have been stored. No roll-up procedure was used when the memories were stored. The figure indicates that the "tri-state" and the "bi-state" retrieval processes are comparable to each other, and are both superior to the "random" process. With the tri-state and bi-state options, there is a good chance of recalling a memory even when only 10 bits are known. 3.3.5 Radii of Attraction for Random Memories Figure 3.8 shows how the radius of attraction of a typical memory varies as more and more random memories are added to the network. No roll-up procedure was used when the memories Chapter 3. The Extended Hopfield Network ^ 43 1.11.00.9- c i^t6 5<- *^* *^* t 0 E 0.80 c 0.7- 0.4- 0.30 S_1. 0.2- . 0. 11 0.04 0 + TriState BiState 10 Memories Random 100 Neurons ^J 10^20 ^4-16 50^do^70^8 0^9'0^100 ReeovTat Number of Prompt Bits Used 1992 MAR 5 1 Figure 3.7: Comparison of three ways to handle "don't know" bits during memory recovery. In all three cases, ten random memories were stored in a standard Hopfield network. The network was then initialized to one of the memories e, except that x of the bits were treated as "don't know" according to one of the three methods listed in the text, and memory recovery was performed. The graph plots the fraction of times that the recovered memory was exactly equal to e' 4 . Fifty different memory sets were tested, and all 10 memories of each set were tested. were stored. The figure shows that the radius of attraction gets smaller as memories are added. This is partly unavoidable, because when there are more memories to choose from, more information is required to specify which one is desired. It is also partly because the memory space becomes cluttered with false attractors, and this makes the effective number of memories much larger. The first of these reasons can be quantified as described in Section B.1. Equation B.1 indicates that the average number of bits required to unambiguously specify one memory out of a set of P is: Mmm = log 2 4(P — 1),^ (3.3) ^ Chapter 3. The Extended Hopfield Network^ 1.0— 44 .......^......„^ Theoretical x *^•:: ........ _._ .......... 1 ^+ ........ - ...^ Maximum 0.9-^. ..........^+ . *---„ .,... ,^..^ 0.8-^„,.. .,, -• ..^ x^"..^„. ..... .. -..^„,.. ... - 0.60.50.40.3- x 0.2-^+ TriState 0.1-^x Random 0.0x^ 0^0.05 100 Neurons 0.1^0.15 Memory Loading (alpha) 0.2 1992 MAR 10 Figure 3.8: Comparison of radii of attraction for tri-state and random memory recovery procedures. Memory loading a is the number of stored memories divided by the number of neurons. The stored memories were completely random — no roll-up procedure was used to orthogonalize them. The fractional radius of attraction is T times the number of bits which were necessary and sufficient to recover a memory, when the other bits were either set to zero (for "tri-state") or to random values (for "random"). The "theoretical maximum" was calculated using Equation 3.4. The dashed lines mark the standard deviation in the observed radii of attraction. -A- so the theoretical maximum fractional radius of attraction of a memory is: Tmax 1 1 log 4(P — 1) N log 2 (3.4) This value is included in Figure 3.8 for reference. The figure confirms that the tri-state retrieval process is much more effective than the random process, especially as the memory loading approaches capacity around a = 0.14. For a < 0.1, the tri-state network's performance is comparable to the theoretical upper limit. Bistate data is not plotted because it was indistinguishable from the tri-state data. Chapter 3. The Extended Hopfield Network^ 45 1.00-1-0.900.800.700.600.500.400.300.200.100.00 ao 0.1^0.2^0.3 Memory Loading (alpha) Or111815^ 0.4^0 5 1982 SEP 18 Figure 3.9: Radius of attraction as a function of memory loading for near-orthogonal memory sets, using the tri-state memory recovery procedure. Memory loading a is the number of stored memories divided by the number of neurons. Memory sets with normalized rms overlaps of 1.0, 0.8, 0.6, 0.4, 0.2, and 0.1 were generated as described in Section 2.3.1 and then stored. The fractional radii of attraction were calculated as the number of bits necessary and sufficient to recover 90% of the memories with fewer than 5% of the bits incorrect. Each data point is an average from 10 different memory sets. The lines were drawn freehand as visual aids only. 3.3.6 Radii of Attraction for Near-Orthogonal Memories Figure 3.9 summarizes how the tri-state recovery process performs with near-orthogonal memory sets. The level of orthogonality makes no difference at low memory loads, but as the memory load increases, the radii of attraction of the more orthogonal networks are better. 3.4 Performance of the EH Network The following subsections examine how well the EH network performs when the roll-up memory storage process described in Section 3.2 and the roll-up-then-down memory recovery process described in Section 3.3 work together. When they are both used some of the neurons can Chapter 3. The Extended Hopfield Network ^ 46 be truly hidden, in that their states are neither imposed on the network when the memory is stored, nor used as prompts when the memory is recovered. The presence of hidden neurons gives the EH network a number of advantages over a standard Hopfield network. Some of these are: • More memories can be stored in a network of a given length, although the memories will be shorter. • A network's capacity for memories of a given length can be increased • The amount of stored information per neuron can be increased. • The radii of attraction of stored memories can be increased. • The XOR set of memories can be stored. • Sets with memories of varying length can be stored. These advantages are documented in Sections 3.4.1 through 3.4.6. In these tests, a memory is "stable" if it meets the criteria described in Section 3.1.3, and the "level of correlation" of a memory set is as defined in Section 3.1.4. Unless otherwise noted, all simulations were done using tri-state neurons, using both the memory storage method from Section 3.2 and the memory recovery process from Section 3.3. 3.4.1 Memory Storage Capacity An important advantage of EH networks is their ability to store more memories than a Hopfield network of similar size. The capacity of a network with a fixed number of neurons can be enhanced by designating some neurons as hiddens. Alternatively, the capacity of a network Chapter 3. The Extended Hopfield Network^ 47 1.2 R=0+ R = 50 x N = 100 "a 0.41 0 L.L 0.20. 0 0 ^0.2^0.3^0.4^0.5^06 cepaci Memory Loading (alpha) 1989 July 12 Figure 3.10: Comparison of an EH network with 50 hidden neurons and 50 visible neurons to a Hopfield network with 100 neurons. Random memory vectors were used for both networks, but the Hopfield memories were twice as long. Tri-state neurons were used for the EH network. Each point represents the average over five different sets of memories. with fixed memory length can be increased by adding hidden neurons, and thus increasing the total number of neurons. Figure 3.10 compares the recovery rates of a Hopfield network with 100 neurons to an EH network with 50 visible and 50 hidden neurons. The EH memories are only 50 bits long, but the EH network is able to store more than twice as many as the Hopfield network. In addition, memory stability remains 100% for up to 25 memories, as opposed to only 10 memories for the Hopfield network. If the "memory capacity" of each network is defined to be the load level for which the fraction of stable states falls below a certain threshold (say 0.9), then the Hopfield network has a capacity per visible neuron of 0.22 compared to 0.3 for the EH network. Figure 3.11 shows the results of more extensive simulations where the memory capacity was measured as a function of R/N, the fraction of visible neurons in the network. The memory capacity increases steadily as R/N goes from 1.0 down to 0.5, because the network is using Chapter 3. The Extended Hopfield Network^ t a_ 13 - 48 0.30- o. 0.20m 0 0.15E 0.10- 0.00 0.0^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9^1.0 Ratio of Visible Neurons (R/N) ^91 oar 12 HUMP? Figure 3.11: Network capacity as a function of the number of visible neurons. To measure the capacity, random memories were added one by one, using the EH storage method. After each memory was added, the number of stable memories was measured as described in Section 3.1.3. In this graph, the "memory capacity" is times the number of memories which can be stored before the fraction of stable memories fell below either 0.9 or 0.5. Each data point represents the average from ten different memory sets. The lines were drawn freehand as a visual aid only. k the hidden neurons to orthogonalize the memory space. As R/N becomes smaller than 0.5, however, the capacity decreases again, because it becomes more and more difficult to recall the memories with so few visible neurons. 3.4.2 Maximum Information Content Even though more memories can be stored in a network with hidden neurons than in a standard Hopfield network, it is not immediately clear that the information capacity of the network increases or decreases. The reason for the confusion is that there are two competing effects as the ratio R/N decreases from unity: • More memories can be stored, because the memory space can be orthogonalized. Chapter 3. The Extended Hopfield Network^ 49 • The amount of information in each memory decreases, because the hidden neurons contain no information. The information capacity of an EH memory can be calculated quite easily. If the network has a total of N neurons, R of which are visible, and if the visible neurons are set at random, then the information content of the pth memory is TEH (it R) = log (2 R ) — log(p + 1) = R log 2 — log(p + 1). (3.5) The first term represents the information in the visible bits and the second term represents the information lost because the order in which the memories are stored is irretrievable. The mathematical justification for these terms is discussed in Section 6.1.1. Figure 3.12 graphs the information capacity as a function of the ratio of visible neurons, and reveals that it can be increased by up to 45% by adding hidden neurons. In the limit of zero hiddens, the information capacity per neuron, per memory, is approximately 0.14 log 2, which is consistent with results reported by Hopfield [25]. 3.4.3 Correlated Memory Sets A serious limitation in the performance of Hopfield networks is their very poor capacity for sets of correlated memories. The decrease in performance occurs because the correlations between the memories add together to form attractors which overwhelm the memories themselves. Amit has proposed a solution for one class of correlated memory sets, where the average activation of the bits is biased away from zero [5]. His solution involves restricting the networks during evolution to states which have a specified average activation. The EH network's performance with correlated memories is of particular interest because its memory storage process very specifically tries to decorrelate the memory set. Simulations Chapter 3. The Extended Hopfield Network ^ 50 0.16 0.141 0.121 o — 0.12 0.10 0.08 -- 0 0.08Ts E 6 0.06o 0. 041-0.021 0.00. 0.0 100 Neurons I 0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9^1.0 Hump,^Fraction of Visible Neurons (R/N)^GI OCT 12 Figure 3.12: Information capacity of an EH network, as a function of the number of visible neurons used. Memory capacity was calculated using the "0.9 criterion" data described in Figure 3.11. The ordinate was then calculated as the number of memories stored, times the f number of visible bits per memory, times log 2 / N 2 , so that the graphed value is equal to L)/.i,il times the number of data bits that can be stored for every neuron in the network. Each graph point is the average of tests from approximately 10 sets of memories. The line was drawn freehand as a visual aid only. confirm that it is able to decorrelate the first few members of each memory set, and that this improves the storage capacity. Unfortunately, the decorrelating capacity of the hidden neurons is very quickly used up, so the performance of the network deteriorates quickly after the first few memories have been stored. Figure 3.13 illustrates EH network capacity for memory sets with four different levels of correlation, where "level of correlation" refers only to the visible neurons, as discussed in Section 3.1.4. Notice that in the bottom left corner of the figure, where few memories are stored and many hidden neurons are available, memory capacities are the same for all levels of correlation. As the memory loading increases, the decorrelation process fails first for memory sets with the highest levels of correlation. Chapter 3. The Extended Hopfield Network^ 0.40 0.35-1 0.30*(..) 0.25as 3 :„ 51 Avg Overlaps 0.0 ■ 0.04 + 0.1 )sE 0.2 ^ 0.20- 0 0.10H 0.05H 100 Neurons 0.00I^ 0.0^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9^1.0 OlapHump Fraction of Visible Neurons 186e JAN 16 Figure 3.13: Storage capacity of an EH network for correlated memory sets with three different levels of correlation. The memory sets were generated as described in Section 2.3.2, and then stored and retrieved using a 100-neuron tri-state EH network. Memory capacity was calculated using the "0.9 criterion" data described in Figure 3.11. In the limit where the ratio of visible neurons is 1.0, this is the performance of a Hopfield network. Each point represents the average of 10 different memory sets. The lines were drawn freehand as a visual aid only. Even when the EH network's performance is degraded because the memory set is correlated, it is still able to use the hidden neurons to advantage. An EH network with 50% hidden neurons can store more that twice as many memories as a Hopfield network, for example, when the memories have an average overlap of 0.1. Furthermore, the EH network has a greater capacity for correlated memories than the Hopfield network has for random memories. 3.4.4 Radii of Attraction Most of the measurements of network capacity up to this point have focussed on the number of memories which can be stored before memories become unstable. To be useful, however, it is not good enough for a memory to be stable; it must also have a good radius of attraction. Chapter 3. The Extended Hopfield Network^ 52 In this section, two types of radii of attraction are measured: 1. Type 1: Attraction from an incomplete prompt: This is a measure of how many - bits can be flagged as "don't know" (as discussed in Section 3.3) before the original memory can no longer be recovered. 2. Type 2: Attraction from a noisy prompt: This is a measure of how many bits can - be set incorrectly, but not flagged as "don't know", before the original memory can no longer be recovered. In both cases, the radius of attraction is equal to the number of visible bits which can be wrong (either "don't know" or inverted) before memory recovery begins to fail. If a network with 50 visible and 50 hidden neurons, for example, has a type-1 radius of "40", then recovery is possible when 40 of the visibles, as well as all 50 of the hiddens, are initially flagged as "don't know". Figure 3.14 compares the type-1 radii of attraction of two different EH networks. The "100/0" network has 100 visible neurons and no hiddens, and the "50/50" network has 50 visibles and 50 hiddens. The 100/0 network is identical during memory storage to a 100-neuron Hopfield network. The radius of attraction shown in the figure could not, however, be achieved by a Hopfield network because it relies on the tri-state process described in Section 3.1.5. The figure shows that the 50/50 network maintains a good radius of attraction for much higher memory loadings than the 100/0 network. This is not surprising, given that the 50/50 network has, according to Figure 3.11, a memory capacity of 30, while the 100/0 network's capacity is only 14. Figure 3.15 gives a head-to-head comparison of the two networks as they approach their nominal memory capacities. It shows that the 50/50 network has lost most of its basin of attraction by the time the 30th (and final) memory is added. By contrast, the 100/0 network still has a reasonable basin of attraction when the 14th memory is added. Chapter 3. The Extended Hopfield Network ^ 53 ( recovery from an incomplete prompt ) 6^10^15^20^25^35^40 '^I^I OldRadi^ Number of Memories^1902 SEP IS Figure 3.14: Type-1 radii of attraction of EH networks. The "100/0" network has 100 visible and 0 hidden neurons, and the "50/50" network has 50 visible and 50 hidden neurons. The ordinate is the maximum number of visible neurons which can be set to zero initially without hindering subsequent memory recovery, divided by the total number of visible neurons. Each marked point is the average from five different sets of memories. The solid lines were drawn freehand as a visual aid only. The type-1 radii of attraction of the 50/50 network were not only larger than those of the 100/0 network, they were also more consistent. For each of the points shown in Figure 3.14, both the average and the standard deviation of the fractional type-1 radii of attraction were measured, although only the average radius was plotted. The standard deviations averaged 0.13 for the 50/50 network, and 0.20 for the 100/0 network. This may have an important impact on system reliability for some applications. The type-2 radii of attraction of the 100/0 and the 50/50 networks are shown in Figure 3.16. The figure shows the average maximum number of visible neurons that can be initially set to the bit-inverse of the memory state, without interfering with memory recovery. In this case, because none of the visible neurons were flagged as unknowns, the 100/0 network was entirely identical to a 100-neuron Hopfield network. ^ Chapter 3. The Extended Hopfield Network^ 54 1.2 ( recovery from an incomplete prompt ) 0 t v L1-3 0.8H1 .0 2 0.6 as g 0.4H Lt. 0.2H 01 ,^,^.^i^.^1-1^.^.^i 0^0.2^0.4^0.6^0.8^1^1.2^1.4^1.6^1.8 1992 SEP 19 °Wadi^Memory Load / Memory Capacity 2 Figure 3.15: Type-1 radii of attraction of EH networks as a function of how near the network is to capacity. This figure is identical to Figure 3.14, except that the memory loads were divided by 14 for the 100/0 network, and by 30 for the 50/50 network. A comparison of Figures 3.16 and 3.14 indicates that the type-2 radii of attraction are approximately one third the size of the type-1 radii. The type-2 basin of attraction of the 50/50 network is again larger for much higher memory loadings than the basin for the 100/0 network. The standard deviations of the fractional type-2 radii of attraction were measured and found to be approximately 0.08 for both the 50/50 and the 100/0 networks. 3.4.5 Storing the XOR Memory Set One serious limitation of the Hopfield network is its inability to store the "not-equal" or "exclusive-or" (XOR) set of associations. This set of associations consists of the following Chapter 3. The Extended Hopfield Network ^ 55 0.5-i 0.450.40.350.30.250.20.150.10.050 1 20^25^30^35 OldRadii^Number of Memories^1992 SEP 19 0^5^10^ 40 Figure 3.16: Type-2 radii of attraction of EH networks, using the same networks as in Figure 3.14. The plotted value is the maximum number of visible neurons which can be set to the inverse of the memory initially, without hindering memory recovery, divided by the total number of visible neurons. The 100/0 network can be thought of as either an EH network with no hidden neurons, or as a standard Hopfield network. Each marked point is the average from five different sets of memories. The solid lines were drawn freehand as a visual aid only. four memories: ++ ( stored memories )^( bit-inverse memories ) where the third bit of each stored memory is (-) if the first two bits are equal, and (+) if they are not. The goal is to store these memories in a network in such a way that when the first two "input" bits are provided, and the third "output" bit is flagged as "don't know", the network will roll down. to an attractor in which the third bit is set correctly. There are two severe problems with storing the XOR set in a Hopfield network: ^ Chapter 3. The Extended Hopfield Network^ 56 1. For every stored memory of the XOR set (eg:^-) with the correct answer in the third bit, a bit inverse attractor of a different memory (eg:^+) is also stored which has the opposite answer in the third bit. 2. The sum of bits is zero in each of the three columns, so all the Jiis are zero after the last memory has been stored. The only conceivable solution with a standard Hopfield network is to add extra bits to each memory so that, for example, the memory set becomes: ++ - +++++^ +^ +- + +++++^ -+ ^ - -+ + +++++^ +-- - +++++^ ++ + ^ ( stored memories )^( bit-inverse memories ) The extra bits must be the same for all four memories, or else the user may end up "giving the answer" as part of the question. These extra bits clearly solve problem 2, because the bit inverse attractors can now be recognized by the incorrect values of the added bits. Unfortunately problem 1 remains, and now applies to the extra bits as well. Because the extra bits bits are the same for all memories, the outer product rule 4 E (vie - (54) =N 3 0=1^ zeroes every synapse that connects them to the "output" neuron, as can be readily confirmed by trying a few examples. The problem here is very similar to the problem of storing XOR in a single layer perceptron, as pointed out by Minsky and Papert [40]. The solution for the Hopfield network, just as for perceptron networks, is to add hidden neurons. The hidden neurons can do what the old extra neurons could not do, because the Chapter 3. The Extended Hopfield Network^ 57 hiddens change from one memory to another. This allows the Jiis connecting the hiddens to the output neuron to be non-zero. To test the EH network with the XOR set, the following four memories were stored, with varying numbers of hidden neurons: The first bit of each memory is a single symmetry breaking neuron, which is still required so that bit-inverse attractors can be avoided. The lower trace of Figure 3.17 shows the results of the tests, and indicates that approximately 11 hidden neurons are required before the correct answer is consistently achieved. A disturbing feature of the XOR experiments was that even with quite large numbers of hidden neurons, the network occasionally made mistakes. Among 1200 tests of the network with 13 hiddens, for example, the network got the wrong answer 3 times. After careful tracing through of the network's behaviour, it was found that the errors always happened when the network reached a "plateau" in the energy landscape, where some of the hidden neurons had zero inputs, but were still in a zero state. The standard approach, at such a plateau, is to set one of the zero neurons to +1 or -1 at random and then continue. The problem was solved by implementing a "tie-breaker heuristic" or "TBH". The input Ui to each neuron is the sum of (N — 1) signals ViJii, not all of which have the same magnitude. If the sum of these signals is zero, the TBH counts whether there are more positive or more negative signals (ignoring their magnitude), and sets Ili to +1 or -1 respectively. With the TBH in place, only three hidden neurons were needed to store the XOR set, as shown in the upper trace of Figure 3.17. In 15,000 tests of the TBH network, with three hidden neurons or more, Chapter 3. The Extended Hopfield Network^ 58 to C 0 ^o0 4^6^8^10^12 XORset number of hidden neurons 16 1090 APR 10 Figure 3.17: Retrieval rates for the XOR set of associations with varying numbers of hidden neurons. Each data point marks the average of a total of 1200 tests, during which the XOR set was stored 100 separate times. A test consisted of providing the two input bits, and then seeing if the network set the output bit correctly. For the lower trace, neural dynamics were as described in Section 3.3. For the upper trace, the tie-breaker heuristic was added, as described in the text. The lines were drawn freehand as a visual aid only. zero errors occurred. 3.4.6 Flexible-Length Memories The way that the EH network manages hidden neurons is ideally suited for applications where the memories are not all the same length. The network can be sized to accomodate the longest possible memory, and when a short memory is stored, the unused neurons can be put to good use orthogonalizing the network. The only minor hitch is that during recall, it may not be clear where the memory ends and the hidden neurons begin, but that can be solved quite simply as described below. To demonstrate this and other capabilities of the EH network, a demonstration program was Chapter 3. The Extended Hopfield Network ^ 59 written for use on a desktop computer. The program simulated a 162-neuron EH network, and stored memories which were entered and displayed as strings of characters. Each character was encoded as six bits, and the first six bits of each memory encoded the number of characters in the string, so strings of up to 26 characters could be entered. Any bits which were not required for the given character string were treated as hidden neurons. A typical demonstration task was to store the titles of the works of Shakespeare. The titles ranged in length from 6 characters (Hamlet) to 24 characters (Alls Well That Ends Well), with an average size of about 16 characters. No systematic measurements were made of the performance of the network, but the following observations were made: • All thirty five titles could be stored in approximately 95 seconds, when running on a 25 MHz '386 desktop computer. • The network worked well with approximately 25 memories stored, in the sense that the radii of attraction were still very large. A prompt of "Rome ?", for example, would consistently elicit the response "Romeo and Juliet". • Thirty titles could be stored without any becoming unstable, but the radii of attraction were noticeably reduced at that level of storage. • When 35 memories were stored, approximately half the memories were no longer stable. • The memories were surprisingly robust to "brain damage". Memories could be consistently recovered even when large percentages (70% to 90%) of the synapses were set to zero. Chapter 4 The Free-Energy of a Network with Near-Orthogonal Memories The most successful analytical tools for use with Hopfield-style neural networks are based on the statistical mechanics of Ising spin glasses [3, 4, 1, 38]. The formal ties between spin glasses and Hopfield networks were originally forged for networks in which all neurons were identical, and for memory sets which were completely random [3, 4]. Recent papers have extended the methods to include some forms of non-random memories [5, 43], and modifications to the learning rules [5]. Statistical mechanics methods are useful because they translate the "energy" landscape discussed in Section 2.1.4 into a "free energy" landscape, as described in Section 4.1. The energy landscape exists in the space of all possible states, and is uniquely determined by each set of memories that has been stored, as sketched in Figure 4.1(a). Its topology for one set of memories tends to bear no resemblance to its topology for another set of memories, so there is no "typical" energy landscape. The free energy landscape, on the other hand, exists in a space of order parameters, such as the overlap mo between a state S and memory es, as sketched in Figure 4.1(b). Its topology tends to be very similar from one set of memories to another, so it can be averaged over all possible memory sets and then used to calculate properties of the network which are independent of the specific set of memories that has been installed. This chapter derives the free energy for a network in which some of the neurons are clamped and in which the memory sets are near-orthogonal. The discussion and derivation proceed as 60 Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^61 FA EA ^ >S FA FA 10- EA n? A <<E>>^ A 4 S ^ ne <<F>>^• average ^ Memory Set (a) Energy Landscape ne (b) Free Energy Landscape Figure 4.1: Conceptual sketch comparing energy landscapes to free energy landscapes. Energy landscapes are in the space of all possible states S, and are completely different for every set of stored memories fel. There is no interesting information in the average energy landscape. Free energy landscapes are in a space of order parameters such as m 0 , and tend to be similar from one set of memories to another. The free energy landscapes can be averaged to provide information about typical network behaviour, as discussed in Section 4.1.4. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^62 follows: • Section 4.1 reviews the analogy between spin glasses and neural networks and justifies the use of a Gibbs distribution for the EH network with non-zero temperature. • Section 4.2 derives a free energy equation for the most general case, and derives mean field equations for the expected overlap mo. • Section 4.3 summarizes the results of the chapter. 4.1 Magnetic Lattice Analog for Neural Networks Hopfield's neural network model [25] was intentionally designed to be analogous to a lattice of interacting magnetic dipoles. At low temperatures, each dipole in the lattice aligns itself with the total magnetic field at its location, which in turn is determined by the orientation of all the other dipoles. Such lattices have been shown [16, 50] to exhibit a "spin-glass" phase, in which the state of the lattice is stationary, but "frustrated", meaning that dipoles are aligned with some others, but misaligned with others. One low-temperature stable state is the ferromagnetic state, where all dipoles are aligned in the same direction, as sketched in Figure 4.2(a). Each dipole is aligned not only with the local magnetic field, but also with dipoles which are very far from it. Spin glass states occur at low temperatures when small groups of dipoles align themselves in one direction while other groups align themselves in the other direction, as in Figure 4.2(b). Although each dipole is aligned with the local field, it is anti-aligned with many of the dipoles in the lattice, so there is a high degree of frustration. These states are meta-stable (and hence glass-like) at finite temperatures, meaning that the lattice will stay in one state for a long period of time, but will eventually change to a new state. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^63 xxxxxxxxx x xxxxxx x xx -.XXXXXXXX ^ xx xxxxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxx ...xxxx....x?x. .xxx xxxxxx?xxxxxxxxxxxx --xxxxxx..xxxxxxxx? xxxxxxxxxxxxxxxxxxx .XXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXX.. xxxxxxxxxxxxxxxxxxx •XXX•XXXXXXXX•.X.•• x xx xx x xxx xx xxxx ? xx x •xx ^ XXXX••X••• xxxxxxxxxxxxxxxxxxx^...... ?...xxx ^ xxxxxxxxxxxxxxxxxxx ^ xxx ^ xx x xx ?xx xx x xxxxxxxx ^ XXX•••• xxxxxxxxxxxxxxxxxxx ^ xxxx?. XXXXXXXXXXXXXXXXXXX^XX ^ XXXX---. xxxxxxxxxxxxxxxxxxx ?..xxxx ^ x x xxx--XXXXXXXXXXXXXXXXXXX • xxx•• ••xxxx x?x•• xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx ...XXXX ...XXXXXXXXX ..•XXXX•XX•XXXXXX ...XXX xxxxxx xxxxxxxxxxx xxxxxx?x^x ^ xxxxx (a)^ x?•x••x•?x••x•xx•.• -x?x.x.xx.xx?x...x. • x•xx •xx?x••x•xxx?x x?xx••x..••x?.••xx• x.x?x.. ..x.? xx.-xx. xx•xxxx?•xx•xx?••?x ••xxx•x•x•xxxxx•?•x • x•x•xx••?xxx••xx?x xx•xxx•?x•xxxxxxx•• •••?xx•••x•x•x?x••x xxx•?xxx •x •x?x••x•x xxx••xxx••?x •x xx••• ••x•x?x•x•?x•x•x•xx ?•x•x•xx•x•x?xx•x•x x••x•xxx?xx•xx•xx•x • x•••xxx••x•x x• ?x•x x •x x••x•x?•x••x•x•x • x •x•x•?xx•x•xx•x•x (b)^ (c) Figure 4.2: Sketches of steady-state behaviour of (a) ferromagnetic, (b) spin-glass, and (c) paramagnetic phases of a two-dimensional lattice of Ising ferro-magnets. Sites marked x are spin-down, sites marked are spin-up, and sites marked ? have recently changed orientation. Patterns in (a) and (b) are approximately stable in time, whereas the pattern in (c) is constantly changing. At high temperatures, when the interactions between the particles are overwhelmed by thermal effects, the lattice is in a paramagnetic phase. The high temperature allows the dipoles to freely flip up and down, as sketched in Figure 4.2(c). In a neural network at low loading, memory states resemble ferromagnetic states, in that few of the neurons are frustrated, and the energy of the network is low. In the analogy, the x symbols in Figure 4.2(a) represent neurons which are aligned with the recalled memory. Figure 4.3(a) is a sketch of the energy surface around a network in a ferromagnetic phase. As memories are added, more and more false attractors appear, until eventually the space between memories is filled with large numbers of shallow false attractors, as sketched in Figure 4.3(b). A network in one of those states bears little resemblance to any of the memories, and has large numbers of frustrated neurons. If the "temperature" of the network is non-zero (as discussed in Section 4.1.2) then it will occasionally move between the false attractors, just like a spin-glass. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^64 (b) Energy Landscape (a) Figure 4.3: Sketch of the energy surface around (a) a ferromagnetic and (b) a spin-glass state. The ferromagnetic state is stable and strongly correlated to a memory. The spin-glass state evolves only very slowly, but has no recognizeable correlations to any memories. 4.1.1 Zero Temperature Ising Model The simplest model which exhibits spin-glass phases is the Ising model of ferromagnetism [26]. It consists of N magnetic dipoles, each of which can be thought of as a charge spinning around an axis. The dipole moments are restricted to the z direction, so each lattice location i is allowed only two states, "spin up" (Si = 1) or "spin down" (Si = —1), The neural analog of the spin is the activation Si of the ith neuron in a network, which is also allowed only two states, "on" (Si = 1) or "off" (Si = —1). If an external field hi is imposed at each lattice point i, then the total field at that point, Ui, is the sum of hi and the sum of the magnetic fields from the other dipoles in the lattice: Ui = N hi -I- (4.1) j=1 where Jii is the magnetic coupling coefficient between lattice locations i and j. The total energy of a network is equal to minus the sum of the dot products of the network states Si with the local fields Ui: =— E 1 EN hiSi + —Si i=1^2 J=1 (4.2) The factor of 12 is necessary because the potential energy between each interacting pair SiSi is added twice. Note that this is the same as Equation 2.7, except that the external field hi has Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^65 been added. Equation 4.2 is only correct if the self-interaction terms Jii are taken to be zero, and the magnetic coupling coefficients are symmetric (Jib = .71ii). When the lattice is in thermal contact with a heat sink at absolute zero, the second law of thermodynamics ensures that E will decrease monotonically until a stable state is reached where virtually all the molecular dipoles are aligned with the local magnetic field. Hopfield's model is analogous to the zero-temperature Ising lattice. The neural analog to the coupling coefficient Jib is the synaptic strength Jib, and the analog to the magnetic field Ui is the postsynaptic potential Ui. Hopfield was able to define the Lyapunov function or "energy" of his network using Equation 4.2 by imposing the restrictions = 0) and (Jib = Jji), the latter of which certainly has no biological justification. The neural analog to the external fields hi would be bias inputs provided at each neuron. The major difference between magnetic lattice models and neural network models is the way that the Jib are determined. In a magnetic lattice, they are determined by the proximity of the two lattice locations, and by the magnetic properties of the material forming the lattice. In the Sherrington and Kirkpatrick model [50, 30], for example, the values of Jib are assumed to form a Gaussian distribution around zero. In a neural network, the Jij are determined by a learning process, such as the outer product rule in Equation 2.3. 4.1.2 Non-Zero Temperature in Neural Models Hinton and Sejnowksi [23] and Amit [3] extended Hopfield's model to networks in which neurons occasionally became mis-aligned with their postsynaptic potential. The appropriate analog for such a network is a magnetic lattice which is in contact with a heat bath at non-zero temperature. The temperature 1/# is a measure of the likelihood that individual molecular Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^66 dipoles will become misaligned with the local fields by absorbing energy from the heat bath. Following Glauber [18], if the Ising lattice is in contact with a heat bath at temperature 10, and if all the dipoles except the ith are clamped in a state S., then the probability that the ith dipole will be in state Si is: P(Sa) = 1 [1 tanh (OSA)] . (4.3) 2 — In a neural network, this is the probability that neuron i will be in state Si after it is updated, when all other neurons are kept fixed. In the zero temperature limit, oo, this is identical to the update rule (Equation 2.5). The "temperature" parameter 0 in neural network models determines the probability that a neuron will be in a state Si which is contrary to its post synaptic potential II; when the network is at equilibrium. It is not, of course, related to brain temperature. Following the magnetic analog, Glauber dynamics (Equation 4.3) are chosen to describe the temperature-dependence of the transition probability P(Si). A neural network at non-zero temperature never reaches a perfectly stable state, or "fixed point". This lack of stability can be an advantage, because false attractors will no longer be fixed points of the system. The simulated annealing process [29], for example, uses non-zero temperatures in its search for the global minimum. When there are no fixed points, however, it is somewhat more difficult to recognize a recovered memory. The standard solution is to calculate the recovered memory as a time average (SI) of the network state. At non-zero temperatures there is a finite probability that the network will evolve out of even the deepest attractor — a fact which seems to destroy the usefulness of the the network. If all final states are equally accessible from each initial state, then the system is ergodic, and there is no possibility of using the network to retrieve information. In practise, however, the time spent in an attractor is extremely long compared to the time spent to reach the attractor, so there is no problem recognizing the attractor. Analytically, ergodicity can be avoided by Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^67 carrying out all calculations in the thermodynamic limit, where the number of neurons goes to infinity and attractor states become infinitely deep wells in the energy landscape, so the probability of escaping goes to zero. 4.1.3 Justification for Using Thermodynamic Met hods When Q is non-zero, the evolution of the neural network is stochastic and can no longer be described in terms of a Lyapunov function. Instead, if detailed balance [1, pg 1091 is satisfied, the network can be modelled using the tools of statistical mechanics. A system obeys detailed balance if transitions from state :513 to state S occur exactly as frequently as transitions from state S to state in the thermodynamic limit. If W(gilg`) is the probability of going from state state g' ic to state §1 , and p(S) is the probability of being in P., then this implies that: W(Pigk )P(P) = W(PIP)P(P).^ (4.4) If the system obeys Glauber dynamics (Equation 4.3), and if the only difference between S3 and S is that si -Si, then the requirement for detailed balance is: W(g3 I )^P(SI'S3)^exp(QU2 Si)^P(P ^-w(g,ki§j) --P O.; )^p(Sk). P(-Sil^exp( gj) Any expression which satisfies 4.5 for all states g' 3 and (4.5) S will suffice as an expression for p(g). The natural choice is: N exp Q uisi = e - f3E(§), (4.6) where E(§) is Hopfield's energy for state g', as in Equation 4.2. Note that this expression was derived without making any assumptions about the choice of connections Jii, so it is equally applicable to the EH network as it is to the Hopfield network. ^ Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^68 Equation 4.6 shows that finite-temperature neural networks conform to the Gibbs distribution of standard thermodynamics, with E playing the role of energy and # playing the role of inverse temperature. All the methods of statistical mechanics can therefore be applied to calculate the statistical behaviour of the networks at equilibrium. If changes in an observable 0 affect the energy of the network, then the mean value of 0 can be calculated using the partition function: Z(h, /3) = E exp —0[E — hO(S)]^ (4.7) {s} where the sum is over all accessible states of the network. Here h is the physical observable which must multiply 0 to calculate the energy. In a magnetic lattice, for example, if 0 is the magnetization, then h is the magnitude of an external field. The time-average of On is: 1 On Z(h, 0)1^On F = z(h,p) Ohn l h=0 Ohn On (On) = ^ On 1^F (On) =^ 13n-1 ohn (4.8) where 1 F^-- log Z 15' (4.9) is the free energy. In the following sections, F will be calculated as a function of a set of order parameters, such as m 1 ... m 8 . This means that the sum in Equation 4.7 has been calculated over only those states which are compatible with the given order parameters. Such restricted sums are proportional to the probability that the network states are compatible with the given order parameters. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^69 4.1.4 Quenched Random Memory States There are, in a sense, two levels of randomness involved in the analysis of attractor neural networks: the random choice of a memory set {eo}, and the random time-evolution of the neurons. To keep these levels distinct, the eis are treated as quenched variables. This means that, when time averages are being calculated, it is assumed that a single set of memories has been learned - no attempt is made to average over all possible memory sets. Equation 4.8, for example, can be used to calculate the time average of an observable for one specific set of memories {eu}. Once the time average is complete, a second average is calculated over an ensemble of all possible memory sets. From Equation 4.8, the ensemble average of a time-averaged observable 0 can be calculated as: («O n ») on^Ohn == ^on-1 1 On Vh An •^ (4.10) where (0) is the same notation that was introduced in Section 2.3. The central goal of the analysis which follows is therefore to calculate ((F)), the ensemble average of the free energy F. 4.2 Ensemble Average of the Free Energy In this section, the ensemble-averaged free energy ((F)) is calculated for a network of N neurons, the first R of which are clamped, and the last M N - R of which are not clamped. The ensemble average is calculated over all possible sets of memories {e}. The free energy F is a function of the overlaps mo, as discussed in the last paragraph of Section 4.1.3, so partition functions are summed only over states which are compatible with the given mo. The number Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^70 of undamped neurons M is assumed to be proportional to N, with the ratio equal to 7: 7 EE 14.1m .. In each memory set there are P memories defined in Section 2.3.1. The probability p M Tv- 41- • • • e , which are near-orthogonal in the sense (er) of a neural state er is therefore: P^= (1 — b) S(1:;' —^b S(; where (4.11) < 1. r;),^(4.12) r is a set of exactly orthogonal memories, and b determines the degree of orthogonality of e. Equivalently, the expected overlap between two memories eP and ev is: ( N rcr) )) = MV = M {1 - (1 - 201 ,^(4.13) i= Rji as derived in Equation 2.14. The expression for the connection strength between neurons i and j is Jij - P ef`.; — abij, E N p=i 1 (4.14) where the 6;4 term ensures that there are no self-interaction terms. At each neuron Si, let there be an external field h i made up of P external fields which are proportional to the stored memories: P hi = E hoer By increasing one of the hP, the network can thus be encouraged to recall one memory particular. 4.2.1 The Energy per Neuron The energy associated with the ith neuron is: 1 Ei =^Ji • Se Se — hiSr 2 j=1 ‘--4 3 3 I et , in Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^71 a = - 2 1 P N 2N0=1Ej=R-1-1^ E^p=1^ E + -217^ Ee .is.; CST j=1 f (4.15) where the superscript p is used with the state SI P in anticipation of the case where more than one state will be considered in the same equation. As in Equation 2.4, the memory loading is represented by a = P/N. It is convenient to work with e(SP), the mean energy per neuron for a network in state So, rather than the energy of a specific neuron Ei. The mean is calculated over the last M neurons, which are the only ones participating in the dynamics: N (8 P)^ E Ei(sP) i=R+1 1 2 2MN P N EEE FfEt.'^e se s p=1 i=R+1 j=R-1-1 P N -— m E E le p=1 i=R+1 P =^– E (mP)2— E 771;,11 14 P P=1^p=1 (4.16) where R + Y' es 2N 3=1 L-d 3 (4.17) and mr, is the overlap between hidden neurons in state SP and memory e 11 : 1 n't^— E of sr, i=R+1 (4.18) Not surprisingly, the only effect that the clamped neurons have on the dynamics of the undamped neurons is to exert a constant external field, but that field will play a different role than the hP field when the limit hP 0 is required (as in Section 5.1.2, for example). If SP and^are chosen at random, and hP = 0, then the magnitude of izP will be on the order of Nilli/N = Nry/VTV. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^72 4.2.2 Partition Function for n Replicas The partition function for a network in which the average energy per neuron is given by Equation 4.16 is: Z = E e-ome(g) { s ,^ (4.19) } where the sum is over all accessible states S of the network. The ensemble average free energy per neuron is equal to: ((.f)) =^ (4.20) = – —((log Z)), NR and is finite in the thermodynamic limit. The ensemble average is calculated using the replica method [1, 35, 38], ((1)) = li^1 # n—r0 N-1411007177 (((Zn)) — (4.21) where Zn is the partition function of n non-interacting replicas of the neural network. Each replica has the same memories en, and the same external fields hi, but two replicas p and o may be in different states SP and 5'. The replica method strategy is to find an analytical expression for the partition function Zn as a function of n, and then calculate the formal algebraic limit (n 0). It is hard to imagine a physical interpretation of this limiting process — Amit refers to it as "an implausible exercise in a space of matrices with zero dimensions" [1, pg 337] — but it certainly is useful. The exact expression for Zn is: Zn = [ E come(s) { s } Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^73 exp Pory Nn 2 {S"}^ 2 + 137 2 N^x!--, I 2 p=1 1---1^Vn P) p=1 n P + [37N E E 77"0 // 1)1 ti ] p=1 p=1 (m p 2 n P ^a72 N^ = E E e Pa 72 H H exp {.51}^p=1 p=1 ^Nn 2 p )^ [37 Nmil,P1, where the sum is over all possible groups of n states. The next step is to calculate the ensemble average Pl. In the thermodynamic limit where N oo, ((Z')) will be dominated by the terms where the energy is minimum. In general there will be clusters of such terms around different memory states, and no single cluster of terms will dominate the sum. The external fields hiA are used to break the symmetry between the attractors and force the solution to focus on the network's behaviour around a specific attractor. Assume that s of the external fields are non-zero. They might as well be the first s fields, h 1 ...V, because the numbering of the memories is arbitrary. If hp is large enough, the attractor located near memories el ...es will certainly dominate Pl. The corresponding memories e l ... es are then referred to as condensed memories. Once the ensemble average has been calculated using the saddle point method, the external field can be removed by setting hi' = 0. Following Amit [1, pg. 333] and Lautrup [35, pg 33], the ensemble average is split into two parts. An estimate of the amount of noise coming from the (P — s) uncondensed memories is given by G1: Gi^((exp s n P {„y N E E p=1 p=a+1 / 2 (mpi") 2 07Nmii; It" where the e 8 + 1 ...e 13 subscript indicates that the average is only over memory vectors s 1 through P. The entire ensemble average can now be written: ((z")) /3a -yNn = e 2 (( n [07 2 N ( mil) 2 + 07NMI,4, E . E G i exp^ 8 {s1} 2 k (4.23) 01 ta Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^74 4.2.3 Calculation of Average Over the Uncondensed Memories This section calculates the average noise G 1 coming from the uncondensed memories, as defined in Equation 4.22. The quadratic mP terms in the exponent of G 1 can be removed using a Gaussian transform: 1 b2 \\ 4a) exP iii f °° = V Tr f_ 00 a 1 2 exp(-ax2 bx)dx,^(4.24) where: b = yOTV mtc: This gives the following expression for G1: B (12=19.2^ P 2^ G1^ C° 2r^-00 ff P H^exp E E {- -21( x )2 n^n GC,^(4.25) 14=3+1 p=1^p=1 p=8-1-1 where P EE n GC F....^exp ( ( + 011P p=1 14=8+1 7Nmid e+1 (4.26) The ensemble average in Gi is calculated as follows; NP IL p=118+ 1 exp [Breri)) (4.27) (( where E R + 0h 11 ) Sr] According to Equation 2.12, each er must be either rii4 or -r;', with probabilities (1 - b) and b respectively, so the average over all the s can be written as an average over all possible Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^75 choices of {r}: Gi = K(A = (( N P [ ( 1 - b)exp(BM) bexp(—BM)})) 11 fl i=R+1 p=s+i N P H H ( ] + 1 - 2ort: sinh Br )) (4.28) 1 log[cosh x + a sinh x] = ax + — (1 — a 2 )x 2 0(x 3 ) 2 (4.29) A [cosh Br 1=R+1 µ=s+1 The following expansion: makes it possible to write Gi as: P N G11 = A ((exp E > {(1 - 2b)11`.13;1 2b(1 — b)(Br) 2 0[(/4)3 ]}) . (4.30) p=8-1-1 In the limit N oo, B r terms of order 3 and higher can be ignored, because /4 This leaves the following expression for Gil : P N Gi = Aexp [ E E 241_ bon21 + log ((e T2 )) r ,^(4.31) 14=8+1 i=R+1 where P N T2 =a p=s+1 EE (4.32) Digress here for a moment and consider the hypothetical case where {r} is a set of random vectors. In that case, the parameter b should be irrelevent, because adding random noise to random vectors should have no effect on averaged behaviour. The value of ((e T2 )) is calculated for that case in Equation C.5 which, when replaced into Equation 4.31, gives the following expression for G'i. : 1 GI = Aexp — 2 P N n EEE p=8+1 i=R-1-1 p,cr=1 BPPBP-sfsr ^{ Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^76 where Fie N P B"' (\ + . (4.33) This expression is independent of b, which tends to confirm the accuracy of the derivation so far. In Returning to the main calculation, consider the case where is a set of exactly orthogonal vectors. The calculation of ((e T2 )), which is much more difficult for this case, is described in Appendix C.2. In that calculation, ((eT2 )) is first expressed as a function of the overlaps through rl.sP rs•so (Equation C.9), and then it is averaged over all possible vectors r1 through r. (Section C.4). The result (Equation C.12) is: ((eT2)) = exp P^ n EE 0 .8+1 r 8 (1 - 2b) 2 — (1 — V.1i)E(m7,) 2 1 B"B"S 1 •S ° 1 p,o=1^ v=1 where the dot product refers to a sum over the undamped indices only: N S P • St`^ E sfsr. i=R+1 Replacing the expression for ((e 7.2 )) into Equation 4.31 yields the following value of G'1: 1^1 GC = Aexp 1^1.^t [2b(1 — b) + — (1 — 2b) 2 — —(ne) 2 (1 — V i) BPPBI'S° • s° 0=8+1 p,cr=1 P^n E^E 2^P 2 1^ — = Aexp co./300.730-sp • 20=-8+1 p,P=1 s-}, (4.34) where CP^1 — (1 — Vt)(7nut,) 2 . (4.35) and where V is the orthogonality measure defined in Equation 2.11. The scaling factor CP will be used in all the derivations which follow to encapsulate information about the degree of orthogonality of the memory set and how it influences the free energy. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^77 1.0 V = 0.5 V = 0. 0.8-1^ asrnas a c° 0.4V = 0.0 0.2-1 gamma = 1.0 0.0^0.1 0.2^0.3^0.4^0.5^0.6^0.7 m = Overlap to Nearest Memory 0.8^0.9^1.0 1992 JULY 1 Figure 4.4: Plot of the scaling factor CP as a function of the overlap m pu, for three different degrees of orthogonality in the memory set. The ratio of visible bits, y, is 1.0 for all three plots. In the limit of random memories (V = 1), or zero correlation between the recalled memory and stored memories (mpu = 0), CP becomes unity, and the equations reduce to the standard equations for a Hopfield network, as they should. In the limit of exactly orthogonal memories (V = 0), and exact recall (m pv = 0), CP becomes zero and the noise from the uncondensed memories disappears, as it should. Figure 4.4 summarizes the variation in CP over this range. 4.2.4 Inverse Gaussian Transforms Now that the ensemble average over {E} is complete, the Gaussian transform done in Section 4.2.2 must be reversed. In other words, G 1 must be integrated over the zPv and variables. Expanding Equation 4.34 in terms of = 1 2r 2^ exp 1 • P^n E E p=8-1-1 p,cr=1 [ CP SP•Slex° N^P Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^78 - [2cP0 P_hP sP•s- xp - [cP0 2 (11P) 2 S P •Scr ] (4.36) reveals the problem: there are mixed products of the integration variables el, in the exponent. To remove the mixed products, a unitary transformation must be performed in the integration variables. This procedure is briefly described in [1, 35] for random memory sets. A modified form of the procedure that is suitable for near-orthogonal memories is described in Section C.6. The result of the inverse transform is: E [K-11Pcr — N —car(log K) k,8 Gl = exp — 2^ p,o=1 (4.37) where [K] is the following n x n matrix: [K] p,, E_ (8x, —-CPSP • 5')^ (4.38) and H is the sum of the "crowd noises" from the clamped neurons: E P (4.39) p=s+1 The first term in the exponent of Equation 4.37 represents the noise from undamped neurons in the uncondensed memories, and the second and third terms represent the noise from clamped neurons in the uncondensed memories. 4.2.5 Average Over Condensed Memories The ensemble average over the condensed memories e l ... es can now be calculated. Replacing Equation 4.37 into Equation 4.23 results in quite a complicated function of the overlaps nisi:, because of the intl, term in CP. It is therefore not possible to linearize the exponent using a Gaussian transform, as was done in Equation 4.25. Instead, all occurrences of mil, and SP•5" Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^79 are replaced by new variables iiiii, and qx" using the following 6-functions: 02 p 100^ 1 6. (qp, - --lg. SP .5) = — 271- _co dr I, exp {-0 2 zrp,„T 1 (qp, - -37. SAS°. )] , 132 p 00 5 ( is; - 74) = --^de exp [-O 2 PteP^P (filk` -M 1- eii -sr)1 .th j_^ ,,,c, P^ 27 (4.40) Terms that were previously functions of SP and 41` can now be written as functions of ?VI, and qpu : OP ..-- E: 1- (fit,-)2(1{1- 0) Vp=o P7aP - 1376 0 %, V p Gl = exp -1-V.- { -car(log IC) + // /3 2 E [f(-1] — /31In 1 (4.41) ^p,cr=1^ so that the averaged partition function, Equation 4.23, is equal to: vn)) = e pa-r2Nn ^. E i f c''') (8 (( f n II II dfittii, dyr ) j p=1 p=1 9 Isil Isnl^-°° n(28-1-n-1) 2 X ( 122-) 2 71-^ pa f (n-1 n 00 I-1 11dgimy drp,,) p=1 o=p n-1 n [ _ O1 e.xp EE p=1 cr=p+1 -/3 2 zrixr aN (q -— Pa m S P •S tr )] n 8^N 1 [P672 x exp^ - 2 -I- ,(37Nitei:it° - 0 2 z4aN ( 71.1;`, - -11--i e.SP) i)) ( fni;,) 2^ p=1 1.1=1^ EE This leaves only two terms in the exponent which involve mil, or 100^ 00 /4=1 so ((Zn)) can be written: o n^ o n-1 n (( ZVI ))^j j^C1744114 9) - Sp, p=1 o=p p=1^-00^ Clqfxrdr pa) Old2F21 where F2 aipi n-1 n^ n exp [— rpo sP•syit;e1" • Sp {Sn}^7^p=1 cr=p+1^p=1 p=1 E•E [S 11 EE +EE e - e (4.42) ^ Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^80 and n(28-1- n-1) (32aN ^2^ exp — N G2^ 27r ) n 8^ n-1 n fi a " P 2 a^E 2 P= 1 (7=p+1 2 7 2 yen,PlIP -- Owe^. X exp N IfIEE —(fit") P^P P p=1_^ (1=1 2^P^ (4.43) The ensemble average can be simplified in the standard way [1, pg. 336]: N^ n-1 2 F2 = (( E. • .E H {S 1 } E E gisr) )) E E a 02 2 n-1 n^8 n N^ H E . • • > exp — (E E 7^ = (( i=R+1 S 1 =±1 = a n n y exp afi 2 r,,Sr Sr + 7^ p=1 p=1^ p=1 cr=p+1^ {Sn} i=R+1^ el... es > N ((exp sn=±1^ log { rp,,S P Scr + E E yAPP s ) )) p=1 cr =p+1^p=1 p=1^et... ea E•••E i=R+1^S1 =11 Sn=±1 ao2i n-1 n^a n exp E> r„,„SPS° + E E yigrsP })) ^) 7^p=1cr=p+1^ p=lp=1^e1... e a (4.44) — Section C.8 shows that the expression inside the sum is self-averaging, even though the memory vectors are not entirely random. Let 1' be an 8-dimensional column vector, with each element equal to either +1 or -1. The sum over i is proportional to an average over each of the 2 8 possible vectors fi. through i42 .: F2 = E••.E s1=±1 sn=±1 ( n-1 n^a n [10 2 2 E E r po.SPS° + E E yi:TPSP^ (4.45) )1)) exp N ((log ^exp P= 1 °=P+ 11^p=1 p=1 71-118 4.2.6 Saddle Point Calculation When the expressions for G1, G2, and F2 are replaced into Equation 4.42, the result is a series of integrals over a single exponential in which the exponent is proportional to N. In the limit Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^81 of very large N, the integral is dominated by the value of the integrand at the point where the exponent of the integrand is maximum, so the saddle point method [1, pg. 138] can be used to solve the integral. Applying the saddle point method to Equation 4.42 gives: ((Zn)) = d1d2F2, where O 1 , O 2 , and F2 are evaluated at the saddle point. To find the saddle point, the following 2n(2s n — 1) equations must be solved simulataneously for 2n(2s n — 1) variables qc,,, rte , etzlf, and yl; : 0^ 0 log di + 0 log O2 0qp, {V^: (1 < p < n) A (p < a 5_ n)} 0 0 log O2 0 log F2 Orm,^Orp, {V p, : (1 < p < n) A (p < a 5_ nil 0^ 0 = (4.46) 0 log d i + 8 log O2 p,p : (1 5_ p < s) A (1 5_ p 5_ nil 074^00,i, O log O2 0 log F2 oip + {V a,p: (1 <µ s) A (1 5_ p n)} Equations 4.46 only enforce the requirement that the free energy surface be fiat. If the point of condensation, determined by the selection of hi', is chosen poorly, these equations may give a solution for a maximum or saddle point, rather than a minimum of the energy function. To confirm that the solution is a minimum, the second derivatives must be checked, a task which Lautrup found "quite complicated" [35, pg. 37], even for random memories with no clamped neurons, and which has not been done here. 4.2.7 Free Energy of the Replica Symmetric System The established method for making the calculations manageable is to assume that the solutions will obey replica symmetry [1, 35, 38]. That is, assume that the values of qp,, rte , 711,, and ^ Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^82 yr, will be identical for all replicas. Under that assumption, the values of the variables at the saddle point can be written: qP0 I saddle q rpaI'saddle = r saddle Yt,; inh saddle This reduces the number of variables to (2s + 2), so Equations 4.46 reduce to (2s + 2) equations in q, r, mo, and yii. Rather than n scaling factors CP there is now only one: ^C = 1 — (1 — Vi)(mt) 2 ,^ (4.47) where v is the index of the memory which is most similar to the condensed state. Sections C.9, C.10, and C.11 derive expressions for O1, O2, and F2 for the replica symmetric case and for the limit of small n. When these are used in Equation 4.42, the result is: ((Zn)) = exp [—nNO On]^(4.48) where: oryCq + 411 ((f)) =^-1 a—log[1 + 07C(q — 1)] 20^2 [1 + 07C(q — 1)] 1^ 8^1 + cry — cOr(q — 1)] — v" [-- 2( e )2 + -ymf:h 14 — /3cem hil +— 2 `H L-1 2 ' ‘ 11) I 0=1 — 1 (0.0g Q 2 cosh /3 ( lictrz + oz0tI yt`TP))))^.^(4.49) 14=1^71...T2. Replacing this into Equation 4.21 and using (enx= 1 nx) in the limit n^0 confirms that the expression in Equation 4.49 is the average free energy per neuron. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^83 Equations for q, r, mf:, and yo are found by requiring that the partial derivatives of ((f)) with respect to all variables are zero, as in Equations 4.46. This ensures a solution which is at a saddle point, but it does not ensure a minimum in the free energy. The calculations are described in Section C.12, and result in the following mean field equations: mh = (((T° tanh^ E (72 /4 + 7 11v G3(q, me)) Tulk /7: z E Y=1^ q = (((tanh 2 p[vaT-z+ E(7 2mf:^ (4.50) 71...T28 8 — N-1 G3(q,mt))TP1))) (4.51) 71...T28 7C(cryCq + 4H) r —^ a [1 + #7C(q — J.)? 2 where the (((• • a (4.52) symbol represents a combined ensemble average and an average over z using a Gaussian window, as defined in Equation C.57, and where 7 2 4(1 — VT) G3(q, mfi`) =^[a — ceP7C(1 — q) 2 — 40(q — 1) H} . [1 — 07C(1 — q)] q)] 2 (4.53) In the limits V = 1 and M = N, the network should behave as a standard Hopfield network, because there is effectively no correlation between the memories and the set of orthogonal memories. These values of V and m imply that y = 1, C = 1, Ito /to, G3 (q, mo ) = 0, and H 0, so the mean-field equations become: q = (((tanh 2 13 arz + E(74 h°)T°1))) 0 .1 71...T28 r = ^ [1 + /3(q — mh = (((T° tanh p[fcir z^(mt h!')Tul))) V=1^ 71...T28 Thus the new equations are equivalent, in the limit, to the standard equations for a Hopfield network [1, pg. 293]. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^84 4.3 Summary of the Free Energy Calculation - The calculations in Sections 4.1 and 4.2 can be summarized as follows: • The update rule rule given in Equation 2.5 was modified so that neural evolution became probabilistic, with probabilities determined by Glauber dynamics (Equation 4.3). Neural states were allowed to evolve contrary to the postsynaptic potential with a probability which depends on a "temperature" parameter. • The probability of a network at equilibrium and with temperature A being in a state with energy E was shown to be proportional to e OEA. - • The free energy F was defined to be a function of a set of "order parameters" by setting it equal to the logarithm of the partition function Z, where Z is summed over only those states which are compatible with the given order parameters. • The ensemble average partition function, Z", of a set of n identical networks (replicas) was calculated as a function of the order parameters. It was calculated in two parts: first for "uncondensed" memories, which are not associated with a large overlap order parameter mo, and second for "condensed" memories. • It was shown that the "crowd noise" from the uncondensed memories diminishes in proportion to a scaling factor CP (Equation 4.35) as the stored memory sets become more orthogonal. • The exact expression for Z" is given in Equation 4.42 in the form of a series of nested integrals of a complicated exponential function, and was solved using the saddle point method. The location of the saddlepoint is determined by a set of mean field equations in the integration variables, and can be translated into a set of mean field equations in the order parameters. Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories ^85 • The mean field equations were simplified by assuming that all the order parameters are identical for each of the n replicas ("replica symmetry"). The resulting equations, 4.50 through 4.52, are an important result in this chapter. • The ensemble average free energy ((f)) was calculated from Zn using the standard replica "trick", as described in Equation 4.21. Chapter 5 Analytical Estimate of Network Storage Capacity This chapter derives an analytical estimate of the storage capacity of the EH network. The calculation is done in three stages: • Section 5.1 calculates how many near-orthogonal memories can be successfully stored in a Hopfield network. • Section 5.2 estimates the degree of orthogonality which can be achieved using the EH network's roll-up procedure, as a function of the number of memories installed and the number of hidden neurons. • Section 5.3 combines these results to achieve an analytical estimate of the capacity of an EH network as a function of the number of hidden neurons. 5.1 Hopfield Network Capacity for Near-Orthogonal Memories This section uses the mean-field equations derived in Section 4.2.7 to predict the memory capacity of a Hopfield network for near-orthogonal memory sets. The calculation parallels the standard solution for random memories [1, 35], but with some added complexities because of new terms in the mean-field equations. 86 ^ ^ Chapter 5. Analytical Estimate of Network Storage Capacity^ 87 5.1.1 Mean Field Equations at Zero Temperature A very useful simplification of the mean-field equations (Equations 4.50, 4.51, and 4.52) is to take the zero-temperature limit. The first step is to calculate B /3(1 — q) in the limit oo. This is done in Section D.1, and results in the following implicit equation for B: B^lim /3(1 — /3—).00 2 (amr, 411.13)11) 7 v. tarLa [rnh + l' 7(1 V' (1 — 7CB) 2 tar \\e xP p=i^ - )^(5.1) 71—T2a It follows immediately that the zero-temperature mean-field equations can be written in terms of B, using Equation D.8, as follows: ^ amf: 4i5iB) 5.2 ) 7 2(1 VI)^ {^ tar E (72mri+7hu [1 — 7CB] 2 Ti...T2s q = 1^ (5.3) nit = ((Tiled - 1 }) 7C(a7Cq 4H) [1 - 7C.13]2^ (5 .4) 5.1.2 Memory Capacity with No Clamped Neurons Equations 5.2, 5.3, and 5.4, which in general must be solved numerically, can be solved analytically in certain relatively simple cases. One such case, the subject of this section, is the calculation of the memory capacity for a network in which none of the neurons are clamped (y = 1) and the temperature is zero (0 = oo). The memory capacity is equal to the largest value of a for which the expected overlap is dose to unity when a single state is recovered (s = 1). In terms of the parameters used in the mean-field equations, these conditions can be written: = 1 ^ ^ Chapter 5. Analytical Estimate of Network Storage Capacity^ 88 s = 1 = 00 It° = 0^V it ft = E oP) /1=1. 2 = O. With only a single recovered state, the ensemble averages in Equations 5.1 and 5.2 can be calculated explicitly. Equation 5.1 becomes: 2 1, (1 -17i)(12) ^p)2} B = ^ exp 1j e 12a1 r ( 1^(1 - CB) _y2 (5.5) where (1- VT)a) ^ y^(1^ (1- CB) 2 ) (5.6) Note that the variable y has nothing to do with the integration parameter yP that was used in Chapter 4. The mean-field equations, Equations 5.2, 5.3, and 5.4, reduce to: nt` = erf(y) q = 1 C2 r = ^ (1- CB) 2 Section D.2 describes the derivation of a simultaneous solution for these three equations. The result is the following equation for the capacity a, as a function of y and V: CB a =^ ^(Vy-2c2 2( erfo(i_ vt) _ Cy) 2 (1 - Vt. ) erf(y) [ - 2 (5.7) where: B= 2(1 - VI) erf(y) e - Y2 {V y 2 C 2 2(erfy) 2 (1 - Vt) - Cy} (5.8) Chapter 5. Analytical Estimate of Network Storage Capacity ^ 1.2 0.40 V = 0.0 m. 0.35 3 89 -1.0 > 0.30 LL 0.25 urn c „,V=0.25 -0.8 't / 1 /^V = 0.5 u. as /":"NA^V = 0.75 /_ -0.6 2 rOA 11 0.05 0.00 0.0 V = 1.0 5.0 1.0^2.0^3.0^4.0 cap vsv2 :AkshisOrY y - parameter 1892 JULY 5 r0.2 FO.0 6.0 Figure 5.1: Plots of Equation 5.7, showing that for each level of orthogonality, solutions to the mean field equations exist only when memory loading is below a critical level. The y-parameter is defined in Equation 5.6 and is only useful as an aid in the solution of the mean-field equations. Figure 5.1 is a plot of Equation 5.7 for various values of the orthogonality measure V. The plot for V = 1 is identical to Figure 5.5.1 in Lautrup's notes [35, pg. 48], as expected. The plots illustrate that for each level of orthogonality, there is a loading factor a c (V) above which no solution exists. The memory capacity of the network is equal to a c (V). A graph of mf:(y) is included in the figure for reference, and to illustrate that mt: is always close to unity when a is maximum. This means that the state of the network which is compatible with maximum loading factor is also a good-quality retrieval state. The relationship between mf: and a can be seen more clearly in Figure 5.2, where the parameter y has been eliminated. As before, no solutions are available for [a > a c (V)]. The figure shows that as the memory loading a gets smaller, solutions are available in which mf: is loser and closer to unity. This means that recovered memories tend to be exactly equal to stored memories when memory loading is much lower than memory capacity, which is consistent with experiments. Chapter 5. Analytical Estimate of Network Storage Capacity ^ 90 1.2 /V = 0.25 E 1.0 V= ' V = 0.0 a 0 8 V = 0.75 V " 0.6 0. a TD 0.4 II E 0.0 0.0^0.1 cap vat : MOtAlpha 0.2 alpha - loading factor 0.3^0 4 1902 JULY 7 Figure 5.2: Plots of Equation 5.7, showing the retrieval quality m as a function of the load factor a. This figure is comparable to Amit's Figure 6.11 [1]. If the values of a c (V) are plotted versus V, the result is the graph shown in Figure 5.3. It shows the predicted memory capacity of the network as a function of the orthogonality measure V. It is important to keep in mind that this calculation of "memory capacity" assumes that there are no hidden neurons, so all neurons are initialized to the memory state when stability is measured. This form of memory capacity was measured in Section 2.3.3, and the calculated memory capacity matches those measurements very nicely. It may be possible, in future research, to extend the above calculations to predict the stability of near-orthogonal memories in the sense defined in Section 3.1.3, where hidden neurons do not play an "unfair" role in contributing to the stability. The goal of such research would be to analytically predict the locations of the curves in Figure 3.9, rather than just the values of their x-axis intersections. Chapter 5. Analytical Estimate of Network Storage Capacity ^ 91 0.6 a_ 13 0.41 Analytical Result E 0.1-1 + Simulations 0.0^111111111 00^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9 Cap_VsV2 :AMON^V = Orthogonality Measure^1992 JULY 5 10 Figure 5.3: Comparison of predicted capacity a c (V) to measured capacity. The line is a graph of the maximum value of a which is a solution of Equation 5.7. The data points are the same data points as in Figure 2.3. 5.2 Statistical Estimate of Overlap After Roll Up - The following sections derive an approximate expression for the degree of orthogonality which can be achieved by the EH network's roll-up process. The calculations assume that the network is rolling up approximately as described in Section 3.2. The only difference is that all updates are assumed to be asynchronous — the initial synchronous updates described in Section 3.2 are ignored. The presentation proceeds as follows. Sections 5.2.1 and 5.2.2 review the notation and introduce a new variable, the "neural alignment" Xi. Section 5.2.3 introduces P(X,t), the statistical distribution of Xi, and Section 5.2.4 examines how P(X, t) changes as the network rolls up. The resulting equations are solved numerically in Section 5.2.5 to predict the rms overlap between the most recent memory and all previous memories. Section 5.2.6 shows how this can be used to calculate the average overlap between all memories. Chapter 5. Analytical Estimate of Network Storage Capacity ^ 92 5.2.1 Notation The notation used in the following sections is an extension of that introduced in Chapter 3. The network has N neurons, R of which are visible and therefore not free to change, and M = (N - R) of which are hidden and free to change. Because the purpose of the calculation is to determine the mean overlap of the pth memory, it is assumed that there are (p -1) memories already installed when the roll-up process begins. The following two ratios are useful: - ^=sa^P^ (p - 1) _ M 7 =- Although the time-evolution of the network is not the focus of these calculations, some concept of "time" needs to be introduced so that time-varying characteristics can be expressed. The natural measure of time, represented by the symbol t, starts at zero before roll-up begins, and increases by one whenever a neuron in the network changes state. In addition, a scaleinvariant time r, defined as: T t (5.9) N' is used for some calculations. The overlap calculated in this section is given a new symbol, OP , defined as: 1/TV - p-i^ (m µ" ) 2 p - 1 ,,=.1 ^ E )) 2 . (5.10) This is the rms overlap between one newly-added memory el', and each previous memory, and so is not the same as O nus , the normalized rms overlap between all previous memories. The relationship between O rms and O1 is discussed in Section 5.2.6. Chapter 5. Analytical Estimate of Network Storage Capacity^ 93 5.2.2 Overlap as a Function of Neural Alignments This section introduces the "neural alignment" Xi and shows how the expected normalized rms overlap can be calculated using it. Xi is a measure of how well the alignments between bit i of each memory and bit i of the current state g' are matched to the overlaps my between each memory eu and the current state, and is therefore closely related to the negative of the neural energy Ei. It is defined as follows: N 14-1 p 1 - E musi c = [-2E1 + aSi ] I E^(5.11) u=i^ N j.i The value of Xi for each neuron Si determines whether that neuron will be inverted during roll-up. This can be seen by expanding the input, Ui, to each neuron as follows: N Ui = E TijSj 1 N si gf L., L.,^—^— 1) ii - j =1 v=i =^(xi — a) . (5.12) The roll-up process described in Section 3.2 inverts only neurons where the state Si is the same sign as the input Ui, and hence from Equation 5.12, only neurons where (Xi > a). The roll-up process can thus be viewed as a process for shifting all the Xi which are above a until they are all below a. There is a simple relationship between the average neural alignment and the rms overlap. The sum of all neural alignments Xi is equal to: N^p-1 N N^p-1 E xi = — E E E sisi erei; = N E(m,-)2. N i=1^v=1 i=1 j=1^v=1 A comparison of this to Equation 5.10 leads to the following exact relationship: (5.13) -1 )i Xi )) Chapter 5. Analytical Estimate of Network Storage Capacity ^ 94 5.2.3 Probability Distribution of the Neural Alignments To proceed further, it is necessary to define a probability distribution, P(X, t), for the neural alignments Xi. The value of P(X, t) is proportional to the probability that Xi, for some i chosen at random, will be equal to X. P(X, t) acts as a probability density, so that the probability that the value of Xi will be between a and b is equal to: Pab(t)^P(X, X =a t) dX. Strictly speaking, P(X, t) can only be considered a probability density in the limit (N —> oo), where the allowed values of Xi approach a continuum. Calculations of the dynamics of P(X, t) are based on the following ansatz: Working Assumption: P(X,t) is assumed to be Gaussian at all times during roll-up, although the mean, ±(t), and variance, Vx(t), may change. The general form for P(X,t) is therefore 1 ^[—(X — X(t)) 2 P(X,t) =^ OrVx(t) exP^2Vx(0 . (5.14) Section D.3 shows that the mean and the variance are both equal to a initially: X(0) = Vx(0) = a. A sketch of the initial distribution for a = 0.2 is shown in Figure 5.4. 5.2.4 Dynamics of the Probability Distribution This section derives the equations which will be used to calculate the rms overlap between a new memory and all previous memories. Two equations will be derived: one which describes how X varies with time, and another which calculates how long the roll-up process takes. Chapter 5. Analytical Estimate of Network Storage Capacity^ EvolPxt Value of Xi 95 1993 FEB 20 Figure 5.4: Sketch of the distribution of Xi for a --= 0.2, before the network rolls up. The vertical scale is P(X, 0), the probability density of Xi. Neurons for which (Xi > a) are candidates for being inverted during roll-up. The average value of all such candidates is marked (X). The time evolution of P(X, t) is sketched in Figure 5.5, and can be summarized as follows. Initially, P(X, 0) is a Gaussian distribution centered at X = a, and with variance a. As each neuron is inverted, the Gaussian gets narrower, and its center, fC(t), moves to the left. (X), the average of all the Xi above a, also moves to the left. The motion of the Gaussian is self-limiting because its speed is proportional to ((X) — a), and so diminishes with distance travelled. The Gaussian continues to shift to the left and get narrower until all undamped neurons have been used up. All changes in P(X, t) are driven by changes in AM. Section D.4 derives the following expression for the rate of change of _Pt), averaged over many updates: dd =^— (X)) The following equivalent expression using time T (5.15) is independent of N: dX 4(a — (X)) dr = (5.16) Chapter 5. Analytical Estimate of Network Storage Capacity ^ 2.5-1 96 alpha = 0.2 x> (t3) cx>(12 ) <x>(ti) x> (to) 0 0.5-1 -ir.2 T -O.() 0.2^0.4^0.6^0.8 1999 FEB 20 Value of Xi Figure 5.5: Sketch of the evolution of the distribution of Xi as the network rolls up, for a network with a = 0.2. The initial probability distribution is labelled as t o , and examples of subsequent distributions are labelled t 1 , t2 , and t 3 . The average of all Xi > a are marked as (X)(to) through (X)(t3). As the network rolls up, the distribution becomes narrower, its mean moves toward zero, and (X) approaches a. Section D.5 shows that (X) is equal to: (X) = X 11 2Vx e-Y2 r 1 - erf(y) (5.17) where (a - X) Y V21/7 The following estimate for the variance Vx is derived in Section D.6: Vx ks: .7C [1 + a (O rms ) 2 - 1C 1 — - where O is the rms overlap between all previously-stored memories CY. If the O term were explicitly kept in, the equations describing X for each memory loading would become coupled to all the equations for smaller memory loadings. This would increase the complexity of the mathematical analysis enormously, and for very little gain. Chapter 5. Analytical Estimate of Network Storage Capacity ^ 0.0 0.0^0.1^0.2^0.3 MultAtta^ time (tau) 97 0.4 1993 FEB 20 Figure 5.6: Numerical solutions for the time evolution of the mean neural alignment X, for a = 0.05, 0.1, 0.2, and 0.4. The value plotted is (X/a)i. The time variable r is 1/N times the number of neurons which have been inverted since roll-up began. The graphs are solutions of Equation 5.15, with the initial condition (iC(0) = a). For simplicity, a constant has been used in place of O nns for the calculations that follow. The experimental results in Chapter 3, such as those shown in Figure 3.3, suggest that a typical value for O rms is 0.5, so Vx is set equal to: VX =^[1.^(0.5) 2 « - ki (5.18) Equations 5.16, 5.17, and 5.18 can be combined to form a single differential equation in X, which can be solved numerically to yield .t as a function of time. The functional form of X(t) depends on a, both because a appears explicitly in the equations, and because X is initialized to a. Figure 5.6 shows solutions of this equation for four values of a. Before the calculation of O r.,8 can be completed, the length of time required for roll-up must be determined. It is clear from Figure 5.6 that the longer the roll-up takes, the smaller X will be when roll-up stops. The time required to roll up will in turn depend on what fraction of the Chapter 5. Analytical Estimate of Network Storage Capacity ^ 98 neurons are unclamped. If, for example, 90% of the neurons are clamped, the roll-up process will truncate after a short period of time, and the final average overlap will be large. Section D.7 derives a differential equation which describes how the number of "invertible" neurons changes with time. To be invertible, a neuron Si must be unclamped and must have (Xi > a). The total number of invertible neurons is represented by the symbol B(t) and obeys the following differential equation: dB(t)^B(t) [dPT = —1+ 1 , dt^PT^dt + (5.19) where PT is the total probability that Xi is greater than a, and is equal to: PT E 1^a — X [1 - erf fix )1 . The initial number of invertible neurons in the network, B(0), is set to (7N)/2. A scale-invariant version of B(t) is: b(r) = B (t ) N where T was defined in Equation 5.9. Equation 5.19 can then be written using b(r) in a manner which is independent of N, as follows: db(r) dr = _1 + P T ^1 J (5.20) Figure 5.7 shows four solutions of this equation on the same time scale as a solution for X(r). 5.2.5 Numerical Calculations of Average Overlaps The solutions of Equations 5.15 and 5.20 can be used together to predict O0, as illustrated in Figure 5.7. The time Ty at which the graph of b(r) crosses zero in Figure 5.7 is the amount - of time available for the network to roll up. The value of .7C(rf) is then the average neural Chapter 5. Analytical Estimate of Network Storage Capacity^ 99 0.207 :I 0.15 , - ..cas m . z _ R rnC X(t) 0.107 'as 7 Ee cr w cy, 0.05i C gamma = 0.00 ft 11111.."..^ 10.1 0.3 23 .5 b(t) 0 0.05^0.1^0.15^0.2^0.25 Evoner time (tau) e 0.0 z .0.1 11 .0 .0.2 .0.3 •.4 4 •0.5 0.3 0.9 1900 FEB 20 Figure 5.7: Illustration of how to calculate O0 from X(r) and b(r), for a = 0.2, and for 7 = 0.1, 0.3, 0.5, and 0.9. The top graph shows the average neural alignment, X(r), and is a solution of Equation 5.15. The bottom graphs show 1/N times the number of invertible neurons, b(r), and are solutions of Equation 5.20. The time variable r is 1/N times the number of neurons which have been inverted since roll-up began. To calculate X for a given value of 7, draw a line up from the x-intersection of the graph of b(r), and then across to the Xi axis. The rms overlap is then equal to PC/a. alignment when roll-up is completed. It follows from Equation 5.13 that the rms overlap can be calculated from .k(rf) as follows: _ Of.lms = i X(arf ) 1 2 (5.21) Note that this solution is independent of N, and so is applicable to networks of any size. Figure 5.8 compares the results of the numerical integration of Equations 5.15 and 5.20 to the average values of O0 seen in EH network simulations. The analytical solutions match the observed data very well for all values of a and 7, which confirms the integrity of the analysis. Chapter 5. Analytical Estimate of Network Storage Capacity ^ 100 1.1 1.0H 0.9- E °II: 0.6as t13 05 .^ 0 in 0.4E - 0.30.20.10.0* 0^0.1^0.2^03^OA^0.5^0.6^0.7 Statlioll^ Memory Loading (alpha) 1999 FEB 19 Figure 5.8: Predicted and observed overlap of the most recent memory to all previous memories, as a function of the number of memories stored. a is equal to the number of the current memory, divided by N. y is equal to the number of neurons which are unclamped, divided by N. Solid lines are solutions of Equations 5.15 and 5.20 and marked points are averages from 100 different simulated memory sets. This figure shows the average overlaps between the most recent memory and all previous memories, which is in contrast to Figure 3.2, where the overlaps between all memories were included in the average. 5.2.6 Average Overlap of All Memories An analytical estimate of O0, the expected overlap between the most recent memory and each previous memory, has been derived, but an expression for O, the average overlap between all previous memories, would be more useful. The latter can be calculated from the former if O rms is expanded using Equation 2.9: (O rms ) 2 = 1 1 Ep=1 E V-1 ) ^ ( 14 Eu.2 Ep.i 1 v=2 N N This can be written in terms of 0'as follows: (Orms)2 = EL2( 1/ — 1) (O ) 2 211 (F1 — 1) (my/ (5.22) Chapter 5. Analytical Estimate of Network Storage Capacity ^ 101 1.1-1 1.0-1 0.9-1 _ 0.8- E 0.70 a 0.64f -5 0.5- A g 0.40.30.20.10.00 0 0.1^0.2^0.3^0.4^0.5 Statliol St9tAva^Memory 016^017 Loading (alpha) 1993FEB19 Figure 5.9: Predicted and observed overlaps of all memories, as a function of the number of memories stored. Solid lines are solutions of Equations 5.15, 5.20, and 5.23. Marked points are averages from 50 different memory sets in a bi-state network, and were previously presented in Figure 3.3. 2 ti ( — 1 ) ,2 (v — 1) (oz,;,, ) 2 ^ (5.23) Equation 5.23 makes it possible to translate the solution of Equations 5.15 and 5.20 into an expression for O rms rather than O0. The result, as shown in Figure 5.9, matches the simulation data quite well. 5.3 EH Network Storage Capacity The analytical results from Sections 5.1 and 5.2 can now be combined to (almost) yield a prediction of the memory capacity of the EH network as a function of the number of hidden neurons. This is an "almost" prediction because it only estimates the number of memories that will be stable in the Hopfield sense — that is, will be stable when all the visible and hidden Chapter 5. Analytical Estimate of Network Storage Capacity ^ 102 neurons are initialized to the memory. This is a weaker form of stability than "true" EH network stability, defined in Section 3.1.3, where memories are stable when only the visible neurons are initialized to the memory. In the following paragraphs, capacities calculated using this weaker form of stability are referred to as "whole-vector" capacities. Figure 5.10 shows graphically how to calculate the whole-vector capacity of the network using the analytical results from Sections 5.1 and 5.2.5. The line descending to the right is the solution of the mean-field equations in Section 5.1 and was previously presented in Figure 5.3. It delineates the maximum number of near-orthogonal memories that can be safely stored, as a function of their level of orthogonality. The lines ascending to the right were generated using the statistical estimate in Section 5.2.5 and were previously presented in Figure 5.9. They describe the level of orthogonality that can be attained by the roll-up process, for a given number of stored memories. If the ansatz made in Section 2.4 is correct, the intersection points on these two graphs mark the maximum number of memories that can be stored using the EH process without losing "whole vector" stability. The locations of these intersections are plotted in Figure 5.11, and compared to the observed memory capacity of the EH network. When R 0.4N, the memory capacity approximately matches the analytical prediction, but the prediction diverges from the observed for smaller values of R. An explanation for this is that networks with large numbers of visible neurons are limited by their storage capacity, whereas networks with small numbers of visible neurons are limited by their recall capability. "Whole vector" stability ignores the recall process because it assumes that the memory has already been "recalled" and only tests for stability. This explains why the prediction in Figure 5.11 fails for (R/N < 0.4). Better results will only be achieved when an analytical description of the recall process is developed. Three important conclusions can be reached from the results of this section: Chapter 5. Analytical Estimate of Network Storage Capacity ^ 103 1.1 1.0 0.9-1 0.8-1 a. as g 0.71 0 0.6-1 0.5H 0 0.4-1 0.3j Capacity of Near - Orthogonals 0.1H 0.0 0.1^0.2^0.3^0.4^0.5^0.6^0.7 1993 FEB 21 Memory Loading (alpha) R4Theo2 NetCapy1^ Figure 5.10: Graphical estimate of EH network capacity for "whole-vector" stability. The line curving down from left to right is identical to Figure 5.3, except that the axes are reversed and the data is plotted versus the rms overlap O, rather than the orthogonality measure V. The lines rising to the right are identical to those in Figure 5.9. If the ansatz in Section 2.4 is correct, the intersections of the lines will be the memory capacities of the network for each number of visible neurons. • The good match between the analytical predictions of network capacity and the measured capacity adds to the confidence that a good mathematical model of the EH network has been achieved. • The success of the technique illustrated by Figure 5.10 suggests that the ansatz made in Section 2.4 was a good one, and may be used for further calculations. Further research into the recall process needs to be done to clarify what the slight misalignment between the predicted and observed capacity for large R/N means. It may indicate that the equivalance principle introduced in the ansatz is inexact; or it may be an effect due to the unmodelled recall process. • The analytical predictions indicate that roll-up effectiveness and network capacity are independent of the size of the network. This suggests that the performance benefits Chapter 5. Analytical Estimate of Network Storage Capacity ^ 104 0.40 0.35 L o_ 0 0.300.25- S. 0.20as Analytical Estimate of "Whole Vector / Storage Capacity 0 0.15E 2 0.10- 0.05-1 - Retrieval Limit + Measured Capacity 0.00111111.111 0.0^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9 HUMPI NMCPPY2^Ratio of Visible Neurons (R/N)^93 FEB 21 10 Figure 5.11: Estimate of the true EH network capacity. The solid line going down to the right is the analytical prediction of memory capacity, as determined by the intersections in Figure 5.10. The derivation leading to this result considers only the storage capacity of the network, as measured by "whole vector" stability. The "retrieval limit" line marks the theoretical minimum number of visible neurons required to distinguish one memory, as derived in Equation 3.4. Marked points are the observed capacity of the network, and were first presented in Figure 3.11. documented in Chapter 3 for networks of 100 neurons will be observed in networks of all sizes. Chapter 6 The Information Content of Near-Orthogonal Memories This chapter calculates the information content of memory sets which are exactly orthogonal or near-orthogonal. The derivation proceeds as follows: • 6.1 : Information in Exactly Orthogonal Memories. An expression for the information content of a set of exactly orthogonal memories is derived, and confirmed by an exhaustive "brute force" calculation. • 6.2 : Information in Near-Orthogonal Memories. The information content of memory sets which are not exactly orthogonal is derived. • 6.3 : Comparing Information Content and Capacity. The information capacity of the EH network is compared to the information content of near-orthogonal memory sets with the same number of memories and the same average orthogonality. 6.1 Information Content of Exactly Orthogonal Memories This section focuses on the problem of calculating the information content of a set of exactly orthogonal memories. The motivation for this undertaking is that it will serve as a starting point for calculating the information content of near-orthogonal memory sets, as described in Section 6.2. Section 6.1.1 briefly reviews the problems inherent in achieving an analytical 105 Chapter 6. The Information Content of Near-Orthogonal Memories ^106 expression for the information content of orthogonal memories, and discusses some possible analytical approaches and their limitations. Section 6.1.2 describes a numerical method for exactly calculating the information content, based on an apparently reasonable simplifying assumption. In Section 6.1.3 the numerical results are compared to analytical predictions and conclusions are drawn. 6.1.1 Analytical Calculation of Information Content When someone adds a new memory to a set of memories, the amount of information in that new memory is determined by how much freedom that person had to select the memory. This makes intuitive sense — one can communicate more effectively with a large vocabulary than with only five or ten words. Specifically, if Q is the set of all possible combinations of memory vectors which could be selected, then the information contained in the selected set is the logarithm of the size of Q [49][5]: S = log(g(Q))^ (6.1) There is a direct analogy with a statistical mechanical system where Q is the space of all accessible states of the system and S is its entropy. It is also common in the literature [28, 49, 1] to define the information in terms of the probability P(e) of each vector el' being selected: S = E — P (e) log P (e)^(6.2) In a system with G possible states, all equally probable, the probability of any state is G S = E— S = log G G log (— ) G , so: Chapter 6. The Information Content of Near-Orthogonal Memories^ 107 which is consistent with Equation 6.1. For systems where not all vectors are equally probable, Equation 6.2 must be used. It would be natural to calculate the size of Q as the product of the sizes of the state space for each memory in the set. In that case, the information in a set of memories would be the sum of the information in each memory. This would be over-counting, however, because the order in which the memories are added to the set is irrelevant. In other words, if two memories A and B are to be stored, it would count (A,B) as a different memory set from (B,A). To compensate, the size of Q must be divided my M!, where M is the number of memories, with the result that the degeneracy g(Q) is: g(Q) = g (v,p) Mill p=0 Af-i M-1 ( N = n ^ II fig+' (6.3) p=0 The information content in the most recently added memory is: S(N, p) = log(g(N, p)) — log(p + 1)^ (6.4) where p is the index of the memory and, for reasons which become dear in Section 6.1.2, the initial memory is referred to as the zeroeth. It is simple to calculate the information content of randomly chosen memories because each bit is equally likely to be +1 or -1. There are 2 N ways to select each memory, so the information in each new memory is [N log 2 — log(p 1)]. Consider now the case where all elements of a set of memories are required to be mutually orthogonal. If this is the only condition imposed on the memory set, then each bit is still equally likely to be +1 or -1. When the first memory is selected, orthogonality is no constraint at all, so the accessible space for the first memory is: g(N,O) = 2 N Chapter 6. The Information Content of Near-Orthogonal Memories ^ 108 The only constraint on the second memory is that exactly half of its bits be identical to the first memory. If N is odd, this is immediately impossible — so memory sets with odd numbers of neurons are immediately eliminated from the discussion. If N is even, the accessible space is: g(N , 1) = ( NN12 ) Similar arguments lead to the following expression for the third memory: N/2 g(N,2)^(N/4) 2 It is useful to view each memory as one vertex of an N-dimensional hypercube, as is commonly done in the literature [1, pg 32]. According to that view, the first three memories make up a triad of orthogonal vectors in the N-dimensional space. The expressions for g for the first three memories are simple, because all possible sets of three vectors are equivalent, up to a rotation or reflection in the N-dimensional space. Reflections and rotations are achieved by either a gauge transform [1, pg 184], or by reordering the axes. In a gauge transform, a subset of bits in each memory EP is inverted using the transform-vector e: It is possible, by rotations and reflections, to transform any set of three orthogonal vectors into the following set: ++++++++++++++++++++++++++++++++ ++++++++++++++++ ^ ++++++++ ^++++++++ ^ The expression for the size of the accessible space for e3 is more complicated: N/4 g(N,3) = ^(Ntl i.0 )4 (6.5) Chapter 6. The Information Content of Near-Orthogonal Memories ^ 109 This increased complexity reflects the fact that there are many ways to generate e which are not identical up to a rotation or reflection. The measure of the information in e ° must take into account all of these possible "isomers" of the basis vectors. For the fifth and higher memories, each isomer spawns a new family of isomers, and the complexity of the problem quickly becomes intolerable. There are a number of ways to proceed from here. The brute force approach is to do numerical calculations of g(N,p). The complexity of this task is considerable, and one approach is described in 6.1.2. The other approaches involve some form of approximation. The simplest and most successful analytical estimate of the information capacity is based on the N-dimensional hypercube interpretation of the vector space. In an N-dimensional network, which corresponds to an N-dimensional space, there are 2 N ways to insert e ° . In order to be orthogonal to e°, el is restricted to an N - 1 dimensional subspace. If that subspace is topologically similar to the full space, it can be expected that there will be 2N -1 ways to form e 1 . Following this line of reasoning, the number of accessible states for memory g(N,/.1) e' is: 2N-P (6.6) In fact the (N - 1)-dimensional subspace is not very similar to the full space. In geometrical terms, although the reduced-dimension space left over after e has a "volume" of 2N-1 , a large part of that volume is not accessible to el because of the constraint that esl lie on the vertices of the hypercube. Equation 6.6 is therefore an overestimate of the number umber of accessible states. There are two reasons why it is still an acceptable approximation: • Section 6.1.3 shows, for one example set of orthogonal vectors, that the error due to over-estimation is not large. • The argument gains strength when the requirement that all memories be strictly orthogonal is relaxed, because all the "hidden corners" of the space become accessible. Chapter 6. The Information Content of Near-Orthogonal Memories^ 110 Combining Equations 6.4 and 6.6, the predicted information content of the pth memory is: S(N, A)^(N — p) log(2) — log(A + 1)^ (6.7) 6.1.2 Brute Force Calculation of Information Content A method has been developed for the exact calculation of the information contained in each new memory of a memory set which is constrained to be exactly orthogonal. This method assumes that N, the size of the network, is an integer power of two, so that a set of memories can be created which is both exactly orthogonal and complete. That is, it will be possible to generate one memory for every dimension in the hyperspace. The information content S(N, p) of the pth memory is determined by the size g(N, p) of the vector space from which potential vectors can be chosen, as in Equation 6.1. It has already been shown that the complexity of the expressions for g(N, p) begins to blow up when reaches three (Equation 6.5). The problem can be made tractable by assuming that the information contained in each new memory is independent of the specific set of memories stored previously. Under that assumption, it is acceptable to assume a specific set of exactly orthogonal previous memories e ° ...e m-1 when calculating the information in the Mth memory eif . Because those previous memories are known precisely, the size of only one vector subspace needs to be calculated. The canonical memory set P defined in Section A.1 was used as the set of exactly orthogonal memories. There is a direct correspondence between the set of memories constituting the nth level of r(m) and the set I'(n), for all m. Consider, for example, the level 4 memories in table A.1. When the block ++++---- is mapped to a +1 bit, and the block ----++++ is mapped to a -1 Chapter 6. The Information Content of Near-Orthogonal Memories ^ 111 bit, then level 4 of r(32) corresponds exactly to T( 4 ), the canonical set of 4-neuron memories: ++++----++++----++++----++++---- ++++ ++++----++++^ ++++----++++ ++-- ++++ ^++++++++ ^ ++++ + -+- ++++^++++----++++++++---- +--+ (6.8) These features of r(N) simplify the calculation of the information content in some important ways. The most important is that they permit the imposition of progressively stronger restrictions on the sum of the bits in each block of the new memories. Specifically: • Consider a memory as being divided into 2" blocks of width N/2". For all memories numbered p = 2" or higher, the sum of the bits within each of these blocks must be zero. In other words, there must be an equal number of +1 and -1 bits within each block. In mathematical terms, this restriction is: V n : 1^< n < log 2 (N) N/ 2n^ E 4+..2n = 0^y p p> 2"^(6.9) : V m : 0 < m < 2n In table A.1, for example, the sum of bits 9 through 16 must be zero in any memory numbered 4 or higher. The degeneracy for any memory c", where p is an integer power of two, is therefore: 9(N,µ) = (BP ) 2 LP for all p = LP^ (6.10) where BP and LP are the block width and level number defined in Section A.1. Note that this is the number of vector states which could be chosen to be e 2n . For the purposes of calculating the degeneracy of later memories, this memory is allowed to be only one state - the one defined by Equation A.1. ^ Chapter 6. The Information Content of Near-Orthogonal Memories^ 112 States ei where > LP must satisfy the further restriction that they have zero overlap with memories r where v > Li`. The overlap with each of these previous memories can be calculated as a weighted sum of k o through kLm_ i , where the ki are defined as: 1 ki = BA 4 L., il'4.BA-13 =1 where 0 < i < LP^(6.11) h9 Each ki is the quarter-overlap of the ith block in ei with the ith block in rLA. The block furthest to the left is the zeroeth. The number of ways to arrange the bits within a block so that the quarter-overlap with rLP equals ki is: g(BP , ki) = I ( BB A Pl2k i ) 2 4 for B>2 (6.12) 2^for B = 2 Each ki can range from ABP to -F1BP (they must all be zero when B = 2), but as a set, the ki's must satisfy the restrictions which ensure zero overlaps. There are LP blocks, each with the degeneracy given by Equation 6.12, so the total degeneracy is: LA-1 BA/4^2 BA^ =^E^BA^.) g(N,^ D(p, {k})^for all^LP ^(6.13) "1 j=o ki =—Bµ/4 4 where D(p, {k}) enforces the restrictions on the set of ki's. The restriction factors D(p, {k}) always take the form of a product of one or more delta functions. The following formula for D(p, {k}) is derived quite directly using the equivalence property from Equation 6.8: II 6 p-LA D(p, {k}) = j=1 Lµ-1 E ( m ^1 r1_ 1 )3 - (6.14) ki ) i=0 Consider for example g(32, 5), the degeneracy of the fifth memory in table A.1. LP is 4, so there are four blocks, with quarter-overlaps of ko through k 3 . k o , for example, is the quarter Chapter 6. The Information Content of Near-Orthogonal Memories^ 113 overlap between the first eight bits of e 5 and the first eight bits of P4. Using Equation 6.14, the restriction factor is found to be: D(5, Ikl) = b(ko ki k2 k3) and, using Equation 6.13, the total degeneracy is: 8 g(32, 5) = E^(2 —4 ko ) 2 ( 2 L 2 k1 ) 2 = 8 ko,ki ,k2 - (2 —4 k2 ) 2 ( 2 k3 ) 2 45(k° kl k2 k3) Once the degeneracy g(N, p) has been calculated, the information content of the most recent memory ez can be calculated using Equation 6.4. 6.1.3 Calculated Information Content of Exactly Orthogonal Memories The information content of networks with 8, 16, 32, and 64 neurons were calculated using the brute force method derived in Section 6.1.2. Figure 6.1 shows how the information in the most recently added memory decreases as more and more memories are added. The solid lines in the figure show the information content predicted by Equation 6.7. As expected, Equation 6.7 has overestimated the information content, for reasons discussed in Section 6.1.1. A closer approximation would be to replace g(N, p), as defined in Equation 6.6, with an empirical function g(N, µ, k) that is defined as follows: g(N, k)^2k(N-p).^ (6.15) This is equivalent to g(N, it) when k = 1.0. From Equation 6.4, it follows that the information in the (2 + 1)th memory is: S(N,^= k(N - It) log(2) - log(k(it + 1)).^(6.16) Chapter 6. The Information Content of Near-Orthogonal Memories ^ 114 45 xN=64 • + N = 32 35 k= E 30^ in 25^ io^ xx 1.0 x^ x 0 . 88 -C1 20 N = 16 N=8 x 0 15 .1] 51 E 10 . '6 - 5 0 -5 I 0^5^10^15^20^25 mu = Number of Previous Memories 30 1992 APRIL 4 Figure 6.1: Information content of the (p +1)th memory in sets of exactly orthogonal memories with 8, 16, 32, and 64 neurons. The marked data points are from calculations performed using the "brute force" method of Equations 6.10, 6.13, and 6.4. The information capacity predicted by Equation 6.7 is shown as a continuous line above each set of points. The dashed line through the points is a plot of Equation 6.16, with k = 0.88. The dashed line in Figure 6.1 shows S(N, IL, 0.88). This new estimate of the information is a compromise — it is worse than the original for small numbers of memories, but averages out better for large numbers of memories. An interesting feature of Figure 6.1 is that the information content is negative for the last ten percent of the memories. This means that information is actually lost when each of those memories is added. This loss of information has nothing to do with the way the memories are stored, because storage of the memories has not even been considered yet. The explanation can be found in Equation 6.7: the second term, which represents the information lost due to the indistinguishable order of the memories, becomes larger than the first term, which represents the information inherent in the choice of the next vector. Consider, for example, a situation where 63 memories have been added to an orthogonal set Chapter 6. The Information Content of Near-Orthogonal Memories ^ 115 of 64 neuron memories. One more memory can be added to make a complete set of memories. The person choosing that memory may only choose between two vectors, which are the bit inverse of each other, so the information in the 64th vector would be log 2. Once that memory was added to the list, it would be indistinguishable from the other 63 memories. On the other hand if the 64th memory is not added, a careful observer could determine, up to a sign, exactly which memory was missing. The amount of information encoded in the choice of the missing memory is just the log of the number of memories which could have been selected — log 64. Thus more information can be conveyed by not adding the last memory, so the last memory contains a negative amount of information. A second interesting feature of Figure 6.1 is that the error in S(N, p) is zero when (1—) = etc. This is a result of the structure of r(N, p) which, as noted in Equation 6.8, ensures an exact isomorphism between the second half of the r(N) and the memory set with half as many neurons, T(N/ 2 ). The fourth quarter of r(N) can therefore be thought of as the second half of r(N/2), and so on. The brute force calculation of degeneracies at these points will therefore correspond exactly to those estimated in Equation 6.6. In a less highly structured set of exactly orthogonal memories, it is likely that the observed degeneracies would lie along a smoother curve, perhaps similar to the S(N,p, 0.88) curve. The total amount of information I in the network is the sum of the information added by each new memory. Using the approximation from Equation 6.16, this is: 1 I(N, M,k) = kM(N — ——) log 2 — M log k — log M! 2 2 (6.17) Figure 6.2 shows how the total information calculated using the brute force approach compares to the amounts predicted by these two expressions. The fact that graphs using k = 0.88 correspond so well to the data for all four memory sizes suggests that k = 0.88 should be used for all estimates of information content. Chapter 6. The Information Content of Near-Orthogonal Memories ^ 116 0^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ^1^1.1 Memory Load Ratio (M/N) 1992 APRIL 7 Figure 6.2: Total information content of a set of exactly orthogonal memories with 64 neurons. The marked data points are summations of the data shown in Figure 6.1. Lines marked a, b, c, and d are graphs of Equation 6.17 with k = 1.0, for N = 64, 32, 16, and 8, respectively. Lines marked with primes show the case where k = 0.88. 6.2 Information Content in Near-Orthogonal Memories The information content in a near-orthogonal vector set can now be estimated. A "nearorthogonal" vector set is made up of vectors whose bits differ from one of the exactly orthogonal vectors iv according to the probability distribution defined in Equation 2.12: p (v) (1 — b ) b (er + b ov (6.18) In the discussion which follows, it will be useful to have an estimate of how many vectors can be classified as "near" a particular vector ro. There is a finite chance that any vector in the whole vector space might be generated from pi using Equation 6.18, so it would not be possible to just count the number of "near" vectors. Instead, the number of "near" vectors is estimated using a statistical method, based on the amount of information added when the new vector is generated. Chapter 6. The Information Content of Near-Orthogonal Memories^ 117 When a new vector ep is generated using Equation 6.18 the information in eti exceeds the information in PP by an amount equal to the sum of the information in each of its bits: 5+ = — N[b log b (1 — b) log(1 — b)] (6.19) The information is understood to be the logarithm of the number of accessible vectors, so the inverse is also true: the number of vectors which are "near" rP is the exponential of the information added when generating them: g+ (N,b) e'5+ = exp {—N[b log b (1 — b) log(1 — b)1} (6.20) The next step is to estimate the total size, g(N,p,b), of the vector space from which nearorthogonal memories can be selected. A new near-orthogonal vector et' is generated in two steps: first an exactly orthogonal vector rP is selected at random, and then noise is added according to Equation 6.18. The space of all possible vectors e , is therefore the space of all vectors which are "near" any one of the rP vectors. The number of possible rts vectors is g(N, p, k), as derived in Section 6.1. It would be tempting to calculate g(N, p, k ,b) as simply the product of g(N, p, k) and g+ (N , b): g(N, p, k, b) g(N, A, k) • g+ (N,b) (6.21) The problem with this is that it would count the vectors around each possible 1' 0 as being distinct from those around every other rP. Any vectors which were "near" more than one of the possible iv would be counted more than once. This is a reasonable approximation when both g and b are small, but it is possible to derive an expression which is correct in the more general case. The number of vectors which are nearly orthogonal to a set of previously-installed vectors is calculated as the total number of vectors that are close to each possible new ro vector. There are g(N, p, k) possible vectors rP, numbered from r µ[l] through rPm. The number of vectors Chapter 6. The Information Content of Near-Orthogonal Memories ^ 118 which are close to rP[1] was calculated in Equation 6.20: ^gi = exp {—N[b log b (1 — b)log(1 — b)ll^(6.22) The size of the complete vector space is 2 N , so the fraction of all states which are near ri•) is 0+ B^-1 = [2 bb (1 — b) (1—b) ] —N^(6.23) 2N — The definition of 4 , the number of uncounted states close to rP[2], must take into account - the fact that there are now only 2 N (1 — 0) states which are still eligible to be counted. The simplest assumption is that, on average, the memories around rµ[21 are evenly distributed between the reduced space and the excluded space, in which case: gj. = gt (1 — 0) This leaves 2 N (1 — 0) 2 states eligible for counting in g t, and so on: = 02 N (1 — 0) 2 = 02 N (1 — 0) i-1 The total number of possible memory states is the sum of gt through ycgE: g(N,is,k) g(N,p,k,b)^E^= E 0 2N(1 — e)i-1 i=i 00^ = 00 0 2N^— — e 2N Di — i.0^i=g oo = 2 N — 02 N (1 — 0) 9(N ' P ' k) E(1 — 0) 1 i=o = 2 N [i — (1 — 0)9(N,11,101^ (6.24) This is a difficult expression to simplify analytically, but it is tractable in the limits b^0 and b^1. As b goes to zero, 0 also goes to zero, so Equation 6.24 can be simplified as follows: g(N, 11,k,b)^2N [1 — (1 — g(N, µ, k)0)] Chapter 6. The Information Content of Near-Orthogonal Memories ^ 119 = g(N, k)b - Nb (1 - b )-N(i-b) = g ( N,ti,k ), log g(N, k, b) = log g(N , ft, k) - Nb log b - N(1 - b) log(1 - b) (6.25) This result is accurate when (0 g(N„u,k)) << 1, and matches Equation 6.21 as expected. In the domain where it is accurate, the information in the near-orthogonal vectors is just the information in the exactly orthogonal vectors plus the information from inverting some of the bits. As b approaches 0.5, the memory set approaches complete randomness, so the amount of information in each new memory should approach (N log 2 - log(/ +1)). If b is written in terms of a small amount 8: b= 12 then 0 can be written: 0 = = .7,..2, 2 -N ( 1 0 (1+8) — b y 4(1-6) ( 1 + 6^ 2^j + 8/2 \ 25N 2 (1 _ 8) _/2i (1 + 0 _ 1,2, (1 _ 1 ^6 /2 ) s 2)26N 8 f (1 + _6 \ 26N ( 62) -^ L. + v + 1 2) 2 ) 1-N _ t5 a) ( 1 + 262N) ( 2 - (1 + N6 2 ) 2 (6.26) This approximation is accurate if N8 2 << 1. Replacing this into Equation 6.24, g(N,p,k,b) = 2 N 1 - g(N,p,k)] 2^) (6.27) So when b gets close to 0.5, the information content approaches N log 2, as required. The information content of the (it + 1)th near-orthogonal memory is, from Equations 6.4 and 6.24 equal to: S(N, tz,k,b) = log [11(N, tt,k,b)] - log(tt + 1) Chapter 6. The Information Content of Near-Orthogonal Memories ^ 120 = N log 2 + log [1 — (1 — e)9(N,p,k)] log(p + 1)^(6.28) where 9 is defined exactly by Equation 6.23. This value was calculated numerically and is shown in Figures 6.3 and 6.4 for k = 1.0 and k = 0.88 respectively. In the k = 0.88 graph, the information content for b = 0 rather awkwardly does not match the other plots at p = 0. This is because S(N, p, 0.88) is a poor match to the true information content at p = 0, as noted in Figure 6.1. This awkwardness is compensated by the fact that it is a better match for large values of A. As expected, the near-orthogonal graph runs parallel to the exactly-orthogonal graph in the domain where 9 g(N, p, k) < 1. There is a sharp transition to the domain where the information is just N log 2 — log(p + 1) so, to a good approximation: S(N, k) — Nblog b — N(1 — b) log(1 — b) S(N, p,k,b) = minimum of:^ N log 2 — log(p + 1) (6.29) The location, p o , of the elbow in the graph of S(N, p, k, b) is defined implicitly by the following equation: N log 2 — (g o + 1) = S(N, p o , k) — Nb log b — N(1 — b) log(1 — b)^(6.30) where S(N,p,k) is defined in Equation 6.16. Once a set of memories has been installed in a network, the order in which they were installed is irrecoverable, so it is not meaningful to regard early-installed memories as containing more information than later-installed ones. The average amount of information per neuron, after M memories have been installed, is: :§(N, M,k,b) = S(N,p,k, b) Li MN /1=0 E 1 IP° [N log 2 — log(p 1)]+ MN p.0 ^Nr Chapter 6. The Information Content of Near-Orthogonal Memories ^ 121 0.8 g',' 0.7 E 0.6 'C' 0 0.5 0: it 15 0 0.4 M o 0.3 -E2 z 0.2 0 a)^. 1 z 8 0.1 0 0 -0.1 0.0 k = 1.0 0^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 alpha = Num Memories / N Orth2^ 11 0.9^1 1992 APRIL 12 Figure 6.3: Numerical calculation of the information per neuron in the most recently added near-orthogonal memory, from Equation 6.28. The calculation assumed 100 neurons, and used k = 1.0. The relationship between O m, and b is given in Equations 2.11 and 2.14. 0.8 o 0.7 E a) m 0.6 C a) 0 0.5 cc • 0.4 M 4e) 0.3 'Es 0 = a) 0.2 z 80.1 0 0 -0.1 k = 0.88 0.0 0^0.1^0.2 0.3 0.4 0.5^0.6 0.7 0.8 0.9^1^1 1 alpha = Num Memories / N ^ISM APRIL 12 Figure 6.4: Numerical calculation of the information per neuron in the most recently added near-orthogonal memory. The graph is identical to Figure 6.3 except that k = 0.88. Chapter 6. The Information Content of Near-Orthogonal Memories^ 122 0.8 0.7 fi rms E 0.6 0.8 0.5 0.6 a) 0.4 0.2 0.0 02 2^' t.13)^0.1 0 -0.1 k = 1.0 0^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9^1 Nr_0(1h2^alpha = Num Memories / N ^1992 SEP 27 11 Figure 6.5: Numerical calculation of the average information per neuron, per memory, in a near-orthogonal memory set, as a function of the memory loading a = M/N, for k = 1.0. The calculations were done by integrating and normalizing the data shown in Figure 6.3. A4 -1 E - [k(N - /A) log 2 — log(k(p + 1)) + log gifi^(6.31) p=p0+1 where gif is related to b via Equation 6.22. Figures 6.5 and 6.6 show numerical calculations of .§(N , M, b) for various values of b, and for k equal to unity or 0.88. 6.3 Ideal Information Capacity vs Information Content in EH Networks. This section relates the information capacity calculations of Section 6.2 to the information content of EH networks. The term "ideal capacity" is used to refer to the amount of information that can theoretically be stored in a set of memories which are generated as described in Section 2.3.1, and which have a given rms overlap O. The term "information content" is used to refer to the amount of information represented by the visible neurons in the memory set of an EH network. It is not immediately obvious how the ideal capacity can be related to Chapter 6. The Information Content of Near-Orthogonal Memories^ 123 0.8 0.7 firms ° 0.6 E a) 0.5 0.8 e" 0 0.4 0.4 z 0.3 0.2 0.0 0.6 0 0.2 "E 0 0)^1 >^• 0 k = 0.88 -0.1 0^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9^1 NrOrth2^alpha = Num Memories / N^tsa2 SEP 27 11 Figure 6.6: Numerical calculation of the average information per neuron, per memory, in a near-orthogonal memory set, as a function of the memory loading a = MIN, for k = 0.88. The calculations were done by integrating and normalizing the data shown in Figure 6.4. the information content, because the former was derived for memory sets generated in a very different way from the latter. According to the ansatz in Section 2.4, however, it is assumed that the network's behaviour will be independent of the way in which the orthogonality was achieved. This implies that the ideal information capacity is a theoretical upper limit on the allowed information content of EH memories. Sections 6.3.1 and 6.3.2 discuss how efficiently the EH process matches the information content of each memory set to the ideal capacity of the network. For simplicity, all calculations assume that k = 0.88. 6.3.1 Limits on the Orthogonalization Process Information theory imposes limits on the degree of orthogonalization that can be achieved by the EH network's roll-up process. The limit results from the fact that the more orthogonal Chapter 6. The Information Content of Near-Orthogonal Memories ^ 0.8 E 0 0.7 R/N = 0.9 0.5 R/N = 0.7 g 0.4 0 k = 0.88 0.6 2 "E information capacity.. 0.3 R/N = 0.5 ....... 0.2 R/N = 0.3 ... c`fi ^0.1 a.^• R/N = 0.1 0 -0.1 124 information content 0.0 0^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ^1 1992 OCT 2 Nr_Orth2^alpha = Num Memories / N 11 Figure 6.7: Ideal information capacity as a function of the memory loading, compared to the information content of an EH memory. The solid lines are identical to those in Figure 6.4 and show the amount of information that can be stored (the ideal capacity) in a new memory, for a set of near-orthogonal memories with overlap O. The dashed lines are plots of Equation 3.5 for five different values of R/N, and show the amount of information that is stored (the information content) in the latest EH memory, in a network with R visible neurons. a memory set becomes, the less information can be stored in it, as illustrated in Figures 6.3 and 6.4. An "optimum" orthogonalization process will therefore reduce O mis only until the ideal capacity S(N, k ,b) of each newly-added memory is equal to its information content IER-(a , R). The rms overlaps which would be achieved by such an optimum orthogonalizer can be found using Figure 6.7, where the information content of one EH memory, as calculated in Equation 3.5, is superimposed on the ideal capacity from Figure 6.4. The intersections of these graphs indicate the lower limit on 0.„, for various numbers of visible neurons and various degrees of memory loading. If the memory loading is so low that the graphs do not intersect, the lower limit on O rras is zero. Figure 6.8 summarizes the locations of the intersections in Figure 6.7, and thus shows the Chapter 6. The Information Content of Near-Orthogonal Memories ^ 0^0.2 BestOlep 125 I- ^0.4^0.6^0.8^1 alpha = number of memories / N 1992 Oct Figure 6.8: Comparison of the theoretical minimum overlap, to the overlap which is achieved by the roll-up process. The decimal numbers indicate the ratio of visible neurons, R/N. The lines plot the locations of the intersections in Figure 6.7, and indicate the theoretical minimum rms overlap required in order to store the given number of randomly-selected visible neurons. The marked data points are copied from Figure 3.2, and show the levels of overlap achieved by the EH network's roll-up process. minimum value of O, that is required in order to store a given number of EH memories. The overlap achieved by the roll-up process is shown for comparison. Within the tested range, the achieved overlap is always greater than the theoretical minimum, as expected. 6.3.2 Information Content Near Capacity This section compares the maximum information content of an EH network to the ideal information capacity of a set of memories with the same level of orthogonality. The comparison provides a measure of how efficiently the network is utilizing the hidden neurons. The first step in the calculation is to determine how well the memory sets are orthogonalized when the network is filled to capacity, as shown in Figure 6.9. Given a value of R/N, the Chapter 6. The Information Content of Near-Orthogonal Memories^ 126 memory capacity is given by Figure 3.11, and O. at capacity is given by Figure 6.9, so the ideal information capacity can be found using Figure 6.6. The result, a graph of the ideal information capacity per neuron at maximum memory loading as a function of R/N, is shown in Figure 6.10. When these values are multiplied by the memory capacity in Figure 3.11, the result is a graph of the ideal information capacity at maximum memory loading, as a function of R/N, as shown in Figure 6.11. The plots in Figure 6.11 indicate that the information capacity of the EH network is less than the ideal information capacity, particularly for low numbers of visible neurons. In the zone where the memory capacity is highest, the EH network's capacity is between one half and two thirds as large as the ideal network. ^ Chapter 6. The Information Content of Near-Orthogonal Memories ^ 127 1.2 1.0-1 E 0.4-1 0.20.0-f0.0 0.2^0.4^0.6^0.8^1.0^1 2 ratio of visibles (R/N)^1982 SEP 28 ^OvlAtCap^ Figure 6.9: Orthogonality of an EH network at capacity, as a function of the ratio of visibles, R/N. The data points were calculated by first finding the capacity as a function of R/N, using Figure 3.11, and then finding the corresponding rms overlap, using Figure 3.2. 0.80 f; 0.75as U(.) 0.70- 0.65- CD 0.60- 2 0.550.50c 0.450.40 0.0 0.2^0.4^0.6^0.8 of visibles (R/N) ^OvlAtCap^ratio 1.0^1.2 1982 SEP 28 Figure 6.10: Ideal information capacity at maximum loading, as a function of the ratio of visibles. The data points were derived from Figures 3.11 and 6.9, as described in the text. The line was drawn freehand as a visual aid only. Chapter 6. The Information Content of Near-Orthogonal Memories^ 128 0.25 information in random memories ca 0.20 ideal / information capacity c 0.15- o . 8 8 0.10E information content 0.05- 0.00 0.0 02 zap 0:4^0:6^08 ratio of visibles (R/N) 1.0^1.2 1992 SEP 28 Figure 6.11: Comparison of ideal information capacity and actual information content, at maximum loading. The lowest plot is a copy of Figure 3.12, and shows the total information actually stored in an EH network at maximum loading a,. The center plot shows the ideal information capacity of a set of a, memories with rms overlap equal to the rms overlap of the EH network memories at maximum loading. The top plot shows, for reference, the information capacity of a set of a c N completely random memories. All information capacities are divided by N 2 , so that the plotted value is equal to 11 the number of data bits that can be stored for every neuron in the network. The lower two lines were drawn freehand as visual aids only. Chapter 7 Conclusions and Discussion The following sections review the conclusions that have been reached in this research and propose a number of topics which may be pursued in further research. 7.1 Conclusions All of the research goals listed in section 1.2 have been met. An "Extended Hopfield" or "EH" network has been described which adds hidden neurons to the Hopfield network. The performance of the EH network was compared to the Hopfield network, and it was shown that: • 45% more information can be stored in an EH network with hidden neurons, compared to a Hopfield network with the same total number of neurons (Figure 3.12). • for a fixed total number of neurons, up to twice as many memories can be stored and recalled in an EH network as in a Hopfield network (Figure 3.11). • the radius of attraction for "incomplete" prompts in an EH network, even with no hidden neurons, far exceeds that of a Hopfield network (Figure 3.7). • when recovering a memory from a "noisy" prompt, a greater percentage of visible bits can be initially wrong in an EH network than in a Hopfield network (Figure 3.16).^. 129 Chapter 7. Conclusions and Discussion^ 130 • the EH network is able to store the XOR set of associations, something that a Hopfield network cannot do (Figure 3.17). • the EH network is able to efficiently store and recall sets of memories with variable lengths. For sets of correlated memories, the capacity of the EH network was found to deteriorate just as rapidly as the Hopfield network as the correlation increased, as shown in Figure 3.13. The EH network was able to store more memories than the Hopfield network, but the performance of both networks was severely degraded compared to their performance with random memories. "Analytical control" of the EH network was achieved on a number of fronts. Most importantly, an expression for the free-energy of a Hopfield network with a near-orthogonal memory set was derived (Equation 4.49), and an associated set of mean-field equations was found (Equations 4.50, 4.51, and 4.52). The mean-field equations were then used to predict the memory capacity of a Hopfield network as a function of the degree of orthogonality of the memory set (Figure 5.3). The success of these predictions is an encouraging indication that the mean-field equations are correct. A mathematical description of the average performance of the roll-up process was also derived, resulting in a pair of coupled differential equations (Equations 5.15 and 5.20). These equations were used to predict how well new memories can be orthogonalized as a function of the memory loading and the number of hidden neurons. The analytical predictions matched the observed overlaps very accurately, as shown in Figure 5.8. Analytical tools were developed to describe the information content of near-orthogonal memory sets. The result is Equation 6.31, which describes the average amount of information per neuron as a function of the number of neurons, the number of memories, and the degree of orthogonality of the memory set. The information capacity of an EH network was compared to the ideal information capacity for a set of memories with the same degree of orthogonality. Chapter 7. Conclusions and Discussion ^ 131 The results, shown in Figure 6.11, indicate that the EH network is able to store approximately two thirds of the theoretical maximum amount of information when one third of the neurons are hidden. The final goal, to predict the memory capacity of the EH network as a function of the fraction of hidden neurons, was partially achieved in Section 5.3, as summarized in Figure 5.11. The analytical predictions were poor for networks made up of mostly hidden neurons, because it ignored the problems associated with recalling a memory from a small number of visible prompt bits. The analytical predictions were good, however, for networks where more than half the neurons were visible. Furthermore, the analysis showed that the improved performance observed in networks of 100 neurons, will be observed in networks of all sizes and will be independent of network size. 7.2 Further Research The results of the present research suggest many possible topics for further work. There are, for example, many aspects of the behaviour of EH networks which haven't been explored. The most important, as highlighted in Section 5.3, is to analyse how the radius of attraction of EH networks varies with the memory loading and with the orthogonality of the memory set. One possible approach would be to clamp the "visible" neurons, represented by they (j = 1, R) in Equation 4.17, at varying degrees of correlation with the memory being recalled, in order to simulate a recall process which begins far from the target memory. It might then be possible, by solving the mean field equations as in Section 5.1, to determine the memory capacity as a function of the radius of attraction. Chapter 7. Conclusions and Discussion^ 132 Other research topics which would help define the behaviour of EH networks are: • An evaluation of how robust the network is to synaptic damage. • An analysis of how the EH network behaves at non-zero temperatures • An investigation into whether time delay synapses can be introduced to an EH network • A further analysis of why the tie-breaking heuristic works so well for the XOR set of associations, as discussed in Section 3.4.5, and whether it may be of use for more general sets of memories. There are also a number of topics suggested by this research which do not directly involve the EH network architecture. One of the most interesting would be to investigate how the roll-up process, and hidden neurons, can be used to advantage with correlated sets of memories. It was shown in Section 3.4.3 that the EH network has difficulties with correlated memories, in much the same way that Hopfield's network does. Amit et al [5], have described a version of Hopfield's network which is tolerant of correlated memory sets. By rigidly constraining all network states to match the correlation in the memory set, they achieved a network with a memory capacity for correlated memories that exceeds Hopfield's capacity for random memories. A significant problem with their solution is that it imposes such tight constraints on the memory sets. A network with hidden neurons might be able to achieve the rigid constraints on the memory set while still allowing a broad range of memories to be stored. Perhaps the most valuable result of this research is that it has demonstrated how a neural network can learn new memories, taking advantage of the accumulated experience of all the old memories, but without explicitly reviewing each old memory. This may be an important step forward — the energy landscape is not used just for memory recall, but is also used to optimize its own shape as that shape develops. Further research may find more general applications of this principle to a broader range of network architectures. Bibliography [1] Daniel J. Amit. Modeling Brain Function, The world of attractor neural networks. Cambridge University Press, 1989. [2] Daniel J. Amit. Neural network counting chimes. Proceedings of the National Academy of Science, USA, 85:2141, 1988. [3] Daniel J. Amit and Hanoch Gutfreund. Spin-glass models of neural networks. Physical Review A, 32(2):1007-1018, August 1985. [4] Daniel J. Amit and Hanoch Gutfreund. Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55(14):1530-1533, September 1985. [5] Daniel J. Amit, Hanoch Gutfreund, and H. Sompolinsky. Information storage in neural networks with low levels of activity. Physical Review A, 35(5):2293-2303, March 1987. [6] E.B. Baum, J. Moody, and F. Wilczek. Internal representations for associative memory. Biological Cybernetics, 59:217-228, 1988. [7] T.H. Brown, A.H. Ganong, E.W. Kariss, and C.L.Keenan. Hebbian synapses - computations and biophysical mechanisms. Annual Review of Neuroscience, 12:, 1989. [8] Thomas Brown. Neuronal plasticity and learning. In IJCNN-91 Tutorial Notes, IEEE, Seattle, 1991. (bound separately from other tutorial notes). [9] Philip A. Chou. The capacity of the Kanerva associative memory. IEEE Transactions on Information Theory, 35(2):281-298, March 1989. [10] M. Cottrell. Stability and attractivity in associative memory networks. Biol. Cyber., 58:129-139, 1988. [11] Michael R. Davenport and Shawn P. Day. Chaotic signal emulation using a recurrent time delay neural network. In Neural networks for signal processing 2 - proceedings of the 1992 IEEE workshop, 1992. [12] Michael R. Davenport and Geoffrey W. Hoffmann. Using hidden neurons with Hebbian learning in a bi-state Hopfield neural network. In Proceeding of the third annual conference on advances in communication and control systems, IEEE, Victoria, B.C., October 1991. [13] M.R. Davenport and G.W. Hoffmann. A recurrent neural network using tri-state hidden neurons to orthogonalize the memory space. Intl. J. Neural Systems, 1(2):133-141, 1989. 133 Bibliography^ 134 [14] Shawn P. Day and Michael It. Davenport. Continuous-time temporal back-propagation with adaptable time delays. I.E.E.E. Transactions on Neural Networks, 4(2):348-354, March 1993. [15] Shawn P. Day and Michael R. Davenport. Dispersive networks for nonlinear adaptive filtering. In Neural networks for signal processing 2 - proceedings of the 1992 IEEE workshop, 1992. [16] S.F. Edwards and P.W. Anderson. Theory of spin glasses. J. Phys, F5:965-974, 1975. [17] J.D. Farmer and J.J. Sidorowich. Predicting chaotic time series. Physical Review Letters, 59:845-848, 1987. [18] R.J. Glauber. Time dependent statistics of the Ising model. Journal of Mathematical Physics, 4:294, 1963. [19] D.O. Hebb. The organization of behaviour. Wiley, 1949. [20] John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory of Neural Computation. Santa Fe Institute Studies in the Sciences of Complexity, Addison Wesley, 1991. [21] A. Herz, B. Sulzer, R. Kuhn, and J.L. van Hemmen. Hebbian learning reconsidered: representation of static and dynamic objects in associative neural nets. Biological Cybernetics, 60:457-467, June 1989. [22] G.E. Hinton and T.J. Sejnowski. Learning and relearning in Boltzmann machines. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Processing Explorations in the Microstructure of Cognition, pages 282-317, MIT Press, 1986. [23] G.E. Hinton and T.J. Sejnowski. Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 448-453, IEEE, Washington, D.C., 1983. [24] A.L. Hodgkin, A.F. Huxley, and B. Katz. Currents carried by sodium and potassium ions through the membrane of the giant axon of loligo. J. Physiology, 116:463, 1952. [25] J.J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. (USA), 79:2554 - 2558, 1982. [26] E. Ising. Bertrag zur Theorie des Ferromagnetismus. Zeitschrift Fur Physik, 31:253, 1925. [27] P. Kanerva. Self-propagating search: A unified theory of memory. Technical Report CSLI84-7, Stanford Center of Language and Information, Stanford, California, March 1984. [28] A.I. Khinchin. Mathematical Foundations of Information Theory. Dover, 1957. [29] S. Kirkpatrick, C.D. Gelatt Jr., and M.P. Vecchi. Optimization by simulated annealing. Science, 220:671, 1983. Bibliography^ 135 [30] Scott Kirkpatrick and David Sherrington. Infinite-ranged models of spin-glasses. Physical Review B, 17(11):4384-4403, 1978. [31] Charles Kittel. Thermal Physics. J. Wiley and Sons, 1969. [32] David Kleinfeld. Sequential state generation by model neural networks. Proc. Natl. Acad. Sci (USA), 83:9469, 1986. [33] T. Kohonen. Self-Organization and Associative Memory. Springer Verlag, Berlin, 1984. [34] S.Y. Kung, F. Fallside, J.Aa. Sorenson, and C.A. Kamm, editors. Neural networks for signal processing II - Proceedings of the 1992 IEEE workshop, Princeton, NJ, 1992. [35] Benny Lautrup. Lecture notes on the theory of the Hopfield model. Technical Report NBIHE-88-06, Niels Bohr Institute, January 1988. [36] Richard P. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, April 1987. [37] Richard P. Lippmann. Review of neural networks for speech recognition. In Alex Waibel and Kai-Fu Lee, editors, Readings in Speech Recognition, pages 374-392, Morgan Kaufmann, San Mateo, 1990. [38] M. Mezard, G. Parisi, and M.A. Virasoro. Spin Glass Theory and Beyond. World Scientific, 1987. [39] W.T. Miller III and R.S. Sutton andP.J. Werbos. Neural Networks for Control. MIT Press, Cambridge MA, 1991. [40] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1969. [41] K.S. Narendra and K. Parthasarathy. Gradient methods for the optimization of dynamical systems containing neural networks. I.E.E.E. Transactions on Neural Networks, 2:252-262, March 1991. [42] John Nolte. The Human Brain. C.V. Mosby, St. Louis, 1988. [43] Conrado J. Perez Vicente and Daniel J. Amit. Optimized network for sparsely coded patterns. J. Physics A: Math., 22:559-569, 1989. [44] Fernando J. Pineda. Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 59(19):2229, 1987. [45] David E. Rumelhart, Geoffrey E. Hinton, and R.J.Williams. Learning representations by back-propogating errors. Nature, 323:533 - 536, October 1986. [46] David E. Rumelhart and James L. McClelland. Parallel Distributing Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, MIT Press, 1986. Bibliography^ 136 [47] T. Sejnowski and C.R. Rosenberg. NETtalk: a parallel network that learns to read aloud. Technical Report JHU/EECS-86/01, Johns Hopkins University, 1986. [48] T.J. Sejnowski and P.S. Churchland. Brain and cognition. In M.I. Posner, editor, Foundations of Cognitive Science, MIT Press, Cambridge, MA, 1990. [49] C.E. Shannon. The mathematical theory of communication. Bell Systems Technical Journal, 27:379-423, 1948. [50] David Sherrington and Scott Kirkpatrick. Solvable model of a spin-glass. Physical Review Letters, 35:1792-1796, 1975. [51] H. Sompolinsky and I. Kanter. Temporal association in asymmetric neural networks. Phys. Rev. Lett., 57:2861, December 1986. [52] D.J. Willshaw, O.P. Buneman, and H.C. Longuet-Higgins. Non-holographic associative memory. Nature, 222:960-962, June 1969. Appendix A Calculations from Chapter 2 The following sections provide detailed derivations of results which are used to generate nearorthogonal memories in Section 2.3.1. A.1 The Canonical Set {r} of Exactly Orthogonal Memories This section defines a method for generating a "canonical" set of N-bit long memories, each of which is exactly orthogonal to all the others. This set is given the symbol to the number of neurons in the network. The pth memory in ith neuron is r(N), where N refers r(N) is r(N)P, and the state of its rrti. When there is no ambiguity about the number of neurons in the network, this is generally written rtii. This canonical set is used to generate near-orthogonal memory sets, as in Equation 2.12, and is used in the derivation which begins at Equation 6.8. The symbols BP and L!:` are used to represent the block width and level number in the definition of r. They do not correspond to symbols used elsewhere, except in the derivation immediately following Equation 6.8. The structure chosen for r(N)p is illustrated in table A.1 and defined recursively as follows: r?^+1^for all i 137 138 Appendix A. Calculations from Chapter 2^ 11(32)P p 0 ++++++++++++++++++++++++++++++++ 1 ++++++++++++++++ ^ 2 ++++++++ ^ ++++++++ ^ 3 ++++++++ ^++++++++ 4 ++++----++++----++++ ^++++---5 ++++----++++ ^++++----++++ 6 ++++ ^ ++++++++ ^ ++++ 7 ++++ ^++++----++++++++---..---...—.—,..,—...--„--.......„--/ 8 9 block 0^block 1^block 2^block 3 ++--++--++ ^++--++--++--++--++-++--++--++ ^++----++--++--++--++ Lo Bo 1 2 2 4 4 4 4 32 16 16 8 8 8 8 8 8 4 4 }level 1 } level 2 1 level 4 Table A.1: Listing of the first ten "canonical" orthogonal memories r(32)P, for a 32-neuron network. Li` is the level number, and BP is the block width for the pth memory. The memories are numbered starting at zero. +1^if i < N/2 if i > N/2 —1 = 2trunc[log2(P)) N^ (A.1) for all p for all /2 Lµ/2 Lµ/2 ^mod /V] if Lµ= µ if LI` p where "trunc[x]" means the largest integer less than or equal to x. Table A.1 shows how each memory rP can be assigned a "level" LP, and a "block width" B. Blocks are numbered from 0 through (VA — 1), as illustrated for 1( 32 ) 7 . Appendix A. Calculations from Chapter 2 ^ 139 A.2 Mean and Variance of (ei • ep) This section calculates the mean and variance of (e el. The results are used to derive Equation 2.14. The overlaps between two different states, e and ev, may be analysed in terms of the following sets of bits: symbol definition size A fi : 41:` = rill N(1 — b) B fi : C = In N(1 — b) F (A n B) U (A n 8) N(1 — 2b + 2b 2 ) g (A n 13)U (A n B) 2Nb(1 — b) where "size" is the expected size of the set in the limit N oo. The mean overlap is expected to be zero because the states are generated with no preference for either +1 or -1. It is calculated by splitting the sum using .7 and . g: N ((EErn =^cer-FEeren i=1^ iEG =^— E ieF^iEG = N(1 — 2b^— 2Nb(1 — b)((r 1:111 = N(1 — 4b 4b 2 )((rrin^ (A.2) The orthogonality of iv' and ry ensures that Min = 0, so the mean overlap of pairs of memories is also zero. The variance of e - ev can be calculated as follows: 2 Var(es^= ((^erC) i=1 Appendix A. Calculations from Chapter 2^ = (E - (K Ertm) 140 2 2 (-2 r r , ,^ (A.3) iEg A general solution for expressions of this type is derived in Section A.3. In this case q in Equation A.9 is set to 2Nb(1 — b), the size of g, so: )2)) Var(e° • ev) = 4 (K iEg = 8Nb(1— b)[1 — 2b(1 — b)] = N [1 — (1 — 2b) 4 ] (A.4) A.3 Expectation of Partial Overlaps in Orthogonal Memories This section derives a probability distribution which is useful for calculating partial overlaps in orthogonal and near-orthogonal memory sets. The situation is equivalent to picking q balls out of a jar holding W white and N — W black balls. The goal is to calculate the probability distribution of s, the number of white balls that are picked, as a function of W and q. The following symbols are local to this section, and do not correspond to symbols used elsewhere: N = the total number of balls in the jar W = the number of white balls in the jar q = the total number of balls taken from the jar s = the number of white balls taken from the jar A,C = factors which do not depend on s B = the factor which does depend on s L = shifted version of s ^ Appendix A. Calculations from Chapter 2^ 141 The probability of choosing a set with s white balls is: s ) (n (Ng "W P(s,q,W) = t (N g ) -1 vs :^q° otherwise 0^ The terms which are of interest are those involving s, so this can be written: P(s,q,W) = A(N,W,q)B(N,W,q,^) where 1 B(N,W,q,^) =^ s! (W – s)! (q – s)! (N – W – q – s)! Notice that B is symmetric in the exchange of W and q, so B(N, x, y, s) = B(N, y, x, s). In the limit of large N, this can be approximated as a Gaussian distribution. In the special case where W = a N B the calculation proceeds as follows: , ^1 B(N, – N, q, s) = 2 (I + L)! (If – – L)! (I – L)!(2–^L)! –2) ^(A.5) = G () x G (— N 2^2 2 where L^S - - 2 1 + L)! – L)! G(x) The logarithm of the G(x) functions can be simplified following Kittel [31, pp 19-20]. Note that G(x) is an even function of L, so the derivation can assume L > 0 without any loss of generality: log G(x) = – log(x L)! – log(x – L)! L^ = – log(x!) – E L ^i) – log(x!) E ^– i 1) ^ ^ Appendix A. Calculations from Chapter 2^ 142 L -21og(x!) - E -21og(x!) - E log ( 11 +-i/x) iix log ( x +1 ) i=1^x - z i=i In the limit where x L, the logarithm can be approximated as follows: log G(x)^-21og(x!) - L 2i Ei=1 x Replacing this into Equation A.5, and introducing C and C' to represent terms which do not depend on s: ^log N q^.. [1 1 1 1^ [B(N -N q,^)] = -21ogy! - 2log(-2- - -_-)! -^4t N qi 2^ 2 i=i^q^= C - 2L(L +1)[ l + 1 N-q 1 = C - 2L(L 1) [q(NN - q).1 C - 2 (s — ) ( 2 ) ( q(NN 2 l+s is+ 1 = c.„ ( 2N ^q(N q))^ 2‘ q) ) ( 2N )^ 12 ^C' ^q(N q))^21 , _ 0] 2 Taking the exponential of both sides gives: 1^ -( s- < s >)] var(s) B(N,^q, s) = K(N , q) exp [ where K(N, q) is a normalizing constant chosen so that P(s)ds = 1, and where: (s) = 2^for W = Var(s) = q(A1 47,7 q)^for W = Appendix A. Calculations from Chapter 2 ^ The dependence on W, which was ignored by setting W to 143 -1, can now be inferred from B's symmetry in the interchange of q and W. For this symmetry to be present in the mean and variance, they must have the following values: (s) = Wq N W) Var(s) = q( N - q)W(N N3 (A.6) (A.7) The resulting expression for the probability of picking s white balls is: p(s,g,w) tii K(N,q,W)exp [ - Va74 )2 ] Vs.{ Cis< W ' q>s>q+W-N (A.8) otherwise 0 where K(N , q, W) is a normalizing constant. Figure A.1 illustrates P(s) for various values of W and q, with the normalization calculated numerically. Except in the most extreme cases, the shapes of all the curves appear to be similar to a Gaussian. The variance V(q) in the overlap of a q-bit long subset of two exactly orthogonal vectors and rt. r- can be calculated using this result. If "white balls" represent bits which are identical in the two vectors, then W = 2 and the overlap between the vectors is 2s. The variance in the overlap is therefore: Var(q) = 4(8 2 ) = q( N q) N ' (A.9) Figure A.2 shows the mean measured overlaps between orthogonal vectors for values of q ranging from 0 to N, and confirms that Equation A.9 is correct. • Appendix A. Calculations from Chapter 2^ 144 0.7 W = Total Number White Balls q = Number of Balls Picked 0.6-1 W=5 q = 60 .0 °) 0.5- W = 55 q = 95 t co 0) 0.4- 117W = 55 q = 20 0 za.). 0.3- W = 90 q = 80 W = 55 q = 60 2 0- 0.2N 0.1 r o.o 0^10 Cok& N = 100 100 20 30 40 50 60 70 80 90 s - number of white balls^1992 Oct 4 Figure A.1: Probability distributions calculated using Equation A.8, for example values of W and q. If there are N balls, and W of them are white, this is the probability that a random sample of q balls will include exactly s white balls. 0.3 a3 Q 0.25 . ot II^0.2 a) 0.15 • rn 0 0.1 _c ° 0.05 11 128 Neurons 0 f^I^1^1^1^1^11 0^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ^1 q/N = Fraction of Vectors Examined^1992 MAR 23 Figure A.2: Observed overlaps between subsets of two orthogonal memories of 128 neurons each. The horizontal scale shows what fraction of the bits were used to calculate the overlap. Each data point was generated using 1000 trials. The line is a graph of Equation A.9. Appendix B Calculations from Chapter 3 B.1 Theoretical Minimum Prompt Length This section calculates the number of bits required to uniquely determine one particular vector in a set of P randomly-generated vectors. In other words, it answers the following question: "given a set of P vectors, on average how many bits are required to uniquely specify one of those vectors?" A reasonable interpretation of this question is: "how many bits must be given before the probability drops below 0.5 that more than one vector in the set will match?" Assume that the prompt is M bits long. One prompt bit is immediately used up to distinguish between the memories and their bit-inverse images (ref Section 2.1.6). The probability that one (M - 1)-bit substring will match another is equal to 2 - W -1 ). One memory can be assumed to match the prompt, and the probability that none of the others match is: 1) 2M-1 < 0.5 (P - This simplifies to the following condition on M: M > log 2 4(P - 1).^ (B.1) This is the number of bits typically required to select one memory from among P memories. 145 Appendix C Calculations from Chapter 4 The following sections provide detailed derivations of results which are used to calculate the free energy in Section 4.2. C.1 Sum-Product Inversions Equation A common identity used in the literature [1, 35] provides a method for inverting the order of sums and products of exponentials under certain circumstances. The identity is re-stated here as a reference because it is used frequently in the main calculation. The identity can be stated as follows: E^H N E e fSli=1 aik (C.1) i=1 k EC where the vector :5 is made up of N components Si, and the possible values of each component ; make up the set C. The sum on the left side must include all possible vectors S. The sum on the right side is over all elements of the set C. Most commonly, C = {-1, +1}, and the identity can be written: EH N^N ll E e ats ll = 2 cosh(cei)^(C.2) {S} i=1^i=1 8=±1^i=1 eaisi^ 146 Appendix C. Calculations from Chapter 4^ 147 C.2 Ensemble Average of Noise Term over Orthogonal Sets This section calculates the ensemble average ((eT2 r from Equation 4.31, where: P N T2 .7,- (1 — 2b) E> µ=8+1 1=R+ 1 The result is used in Equation 4.34 to calculate the amount of noise coming from the uncondensed memories. The following symbols are local to this section, and do not correspond to symbols used elsewhere: VT = the variance of T2 VT = the variance of T2 if the r vectors are random, not orthogonal In this Appendix, and those following, the dot product refers to a sum over the undamped indices only, as in for example: E N X -Y -a- XiYi i=R-Fi The tactic for calculating ((eT2 )) r is to calculate the average over r8+1 rP with an integral over all possible values of T2. The result of that integral must then be averaged over all possible 1'1 r.. The central limit theorem implies that T2 will be distributed according to a Gaussian distribution in the {F}-space. The mean value of T2 will be zero because there is no reason why positive or negative components should be favoured in its sum. The variance VT is equal to the ensemble average of (T2 ) 2 : VT ==- (((T2)2)) = (1 - 2b) 2 E^E ^ b B7rii4r ^ pfu=8+1 ))^ rs+1...rp (C.3) 148 Appendix C. Calculations from Chapter 4^ The average of eT2 is calculated using the following Gaussian integral: 00 (( eT2 ))r = foodT2 exp [ —(T2)2 Tz] 2VT (C.4) = (( exP G VT )))rl...r. where the average over rl ...r. is weighted according to the values of mil, and 5 ) • Su. r set was just randomly selected vectors. Nowhere yet in the calculation has the orthogonal nature of r been taken Consider first, as an aside, what the value of eT2 would be if the into account, so this is a reasonable exercise. Call the random vectors P to avoid confusion. The ensemble average overt in Equation C.3 could be moved inside the summation because all terms in the sum would be independently random: VT = ( 1 — 2b) 2 E^E BP ti,v=8+1 id=R-F1 3';. r8+1 FP ... The ensemble average ((fTi; * would be zero unless the indices were identical, so: VT = ( 1 — 2b) 2 P^N E^E 0 ,v=8-F1 P = (1 — 2b) 2 Br B76 41vOii i,j=R+1 N EE (Bµ) 2 . p=s+1 i=R+1 Replacing this into Equation C.4 would give: P N (( e T2 ))r E E (Bn21. exp [ 1 (1 — 2b) 2 2 p=8+1 i=R+1 (C.5) This result is used in Equation 4.33 to check the consistency of the derivation up to this point. When each set r consists of exactly orthogonal memories, the ensemble average in Equa- tion C.3 cannot be moved inside the summations so easily. It is possible , however, to simplify Equation C.3 using the fact that for every set of orthogonal vectors {r}, there is another set {F}' whose vectors are all identical except for one, which is the bit-inverse of its counterpart. This Appendix C. Calculations from Chapter 4^ 149 is illustrated by the following two sets of exactly orthogonal 8-vectors: ++44-++--++-- <=> +—+—+--+ ■1, Because of this anti-symmetry, the only terms in Equation C.3 which will contribute to the average are those where y = 1/, so Equation C.3 reduces to: VT = ( 1 - 2b) 2 P^N E E Bptirit:T7)) p=8+1^ rs+1...rP P = (1 — 2b) 2 E (rP.BP) 2 )) ,A=8+1^ro+1...rP Note that the vectors B 8 + 1 through BP are distinct from each other — they are not, for example, the projection of a single vector onto the r0 vectors. The T vectors form a complete orthogonal set in the M-dimensional space of undamped vector segments, so any vector can be expanded as a sum of its projections along each of the r vectors. Expanding BP in this way gives: E E 1^1 ^ ru.o, ) 2 +^ (r .. B ,i ) 2 iB u )2 = M ( v=1 ,,=.+1 M^ 8 ( 11 ' 43 ') 2 = m 1 13 '1 2 (ru-B") 2 v=8 + 1^ v=i E E If nothing is known about vectors r8+1 through T N , then each term in the sum has the same expected value, so: (((r.BP) 2 )) 1 ^[ M IBP1 2 — s ro+1...rp^ — [11 I B P 1 2 - ^(ru.BP) 2 v=1 E (r.BP)2] 1 (C.6) Appendix C. Calculations from Chapter 4^ 150 The first term in this expression can be calculated directly: 1B1412 = N E (B; )2 i=R+1 N n =E > B"Bi."7 Sr Yi r (C.7) i=R+1 p,cr=1 where BPP is defined in Equation 4.33. The second term in Equation C.6 can be written as a function of the overlaps in the condensed memories, as follows: 8 E •Bµ)2^ E BPPBP- E •sP) (r- v=1^p0=1^v=1 SO (((re •Bµ) 2 )) ra+ 1....rp = n E • BPP.Bm. [msP sa p,cr=1^ — E(r- • sP)(ru • sal v=1 and P VT = (1 - 2b)2 n EE BPPB/- [m-sP -14=8+1 p,cr=1^ • — E(r- • sP)(r- • se)] v=1 Replacing this into Equation C.4 yields the following expression for ((e T2 )) r: P n (( e T2 )) r = ((exp (1 - 2b) 2 X B I4P B 14° E E ts=s+1 p,cr=1 [M SP • S ri' — E(r- • sP)(r- • sel})) v=i^ ri To complete the calculation, it must be averaged over all possible orientations of (C.8) pa r1 through r., and expressed as a function of either el ... e 8 or in t1, ...msp7 as follows: P n EE • = exp { 1 (1 - 2b) 2 BPPBP-sP Sal ((e -T3 )) (( eT2 )) r 2 0=8+1 poy=1 (C.9) 151 Appendix C. Calculations from Chapter 4^ where T 2M (1 — 3 — 2b)2 E E BooBti- E(r- • so)(r- • Scr ) (C.10) v=1 14=s+1 0,17=1 The average ((e -T3 ))^is r i ...r . calculated in Section C.3, with the following result from Equation C.16: 1 = exp -- (1 Vi) ((e-T3))ri...r.^ 2 — P^n^8 E E B".13P ° LE(mpv) 2 1 SP .5a /1=3+1 p,cr=1^=1 (C.11) When this is replaced into Equation C.9, it reduces to: P^n [ ((e T2 )) r = exp^E E (1 - 2b) 2 - (1 - Vi) 14=8+1 pfr=1^ E(m7,)21 BPP/3 1'SP•S° u=i (C.12) C.3 Average of e - T3 for Condensed Vectors This section calculates the average of e -T3 over r1 through T., where T3 is defined in Equa- tion C.10 as 1 T3 = — (1 — 2b) 2 2M E E B ".13 1". E(r- • so)(r- • Se). p=s+1 09=1^v=1 The result is used in Equation C.11 as part of the calculation of the total noise generated by the uncondensed memories. The following symbols are local to this section, and do not correspond to symbols used elsewhere: T3 = the average value of 7 13 VT = the variance of T3 in the space of A = a central factor in ((e - T3 )) ri through P. ^ 152 Appendix C. Calculations from Chapter 4^ The tactic for calculating ((e -T3 ))^is similar to that used for calculating ((e T2 )) in Section C.2. As before, the sum over all possible vectors r1^r8 is replaced with an integral over all possible values of T3 . Although the central limit theorem does not apply in this case, 7'3 is assumed to be distributed according to a Gaussian distribution in the {r}-space. This time, however, the mean value is non-zero. If the mean and variance of T3 are: T3 = OT3D r 1 . .. r e VT -= (((T3) 2 //P...r.^(M)2 then the ensemble average will take the following form: rr \ - EdT3 exp [-( 213^)2 217T^ T3] 1 T3) . = exp (- VT — (C.13) 2 The mean value of T3, P 2M n E E (1 — 2b) 2 BPPBP- E arm • sP)(r- • 5 0 )D r i...p, , p=s1-1 p,cr=1^v=1 is calculated using the average of (TP.so)(rP•sP) derived in Section C.4. In regions where m>> and V << (1- k), which covers virtually all the range of interest, Equation C.20 gives the following expression for T3: 7"; = 2 (1 - V) EE BPPBP- E(np)2 . . p=6+1 p,cr=1^v=1 sP•s-. (C.14) Equation C.23 from Section C.5 presents the following expression for the variance of T3: P^n^8 E E BPPBP- E(mpu)2 SP.5'.^(C.15) 11=8+1 p,cr=1^v=1 Var(T3) = (VI - V) Equation C.13 can now be used to calculate ((e - T3 )) (( e-T3 )) r i ... I'S = p=s+1 1 exp -- 2 1-1 ...1-S P^n^8 E E BPPBP- E(7/111 p,cr=1^v=1 )2(1 — Vt)S P • Su^(C.16) Appendix C. Calculations from Chapter 4^ 153 C.4 Average of (ro•scr)(rP•sP) over Condensed Vectors This section calculates the mean of (ro•sP)(rP.s-) in the space of all orthogonal vectors 1'1 with probabilities determined by the values of mit: and SP • 5°. The result is used to calculate the mean of T3 in Equation C.14. The following symbols are local to this section, and do not correspond to symbols used elsewhere: = the weighted sum of S P vectors Q = the value of Al, (rP • sP)(rti • sff) mo = the overlap between SP and el , M, A, . . . g = the sets defined below M = the number of elements in M A, ... G = the number of elements in each set, divided by M sym definition -1,- x avg size -I— 111,—I x variance M set of all indices 1 = -111-1-(N — R) 0 A = 1(1 + nt 11,) 4 0 en A fiEM:Sr= B { i E M : riii . eii } (B) = (1— b) b(1 — b) C fi E M : Sr = rn (0 . mil,B + 1(1 — mil) [1 — (771;‘,) 2 1 b(1 — b) 1, { i E M : ,59: = Sn D = 1(1 + -1,7 SP • 5' ) 0 F {ieCnD} (F) = CD D(1— D)C(1 — C) G fi E c n cm — 1,} G=C—F 0 Appendix C. Calculations from Chapter 4 154 The averages and variances in this table were calculated using Equations A.6 and A.7. The probability distributions of the sizes of the sets are assumed to be Gaussian, as for example: P(X) = exp (—(X (X))2 2Var(X) ^(C.17) The calculation of (r1' • so)(p • S°) begins with the observation that: , 1 Q^—(rP • sP)(rP • S°. ) M2 E r; S; + =^ E ITsr) (E 11'4' E iED^iEM-D^iED^ ( = m2 (EITSf + E iEM-D P s7) 4 rrsr)^11'4 — E r, iED^iEM-D^iED^ 2^1^2 Sr) iEM-D =^-^E rNsp =ED^— (C.18) iEM-D The size of these two terms can be calculated using the sets defined above: 1 E rfasr = F — (D — F) 2F — D :EP M iEM-D rf'sf = G — (1 — D — G) = 2C — 2F + D — 1. SO: Q = f f f dF dC dB P(F) P(C) P(B) {(2F — D) 2 — (2C — 2F + D — 1)1 dF dC dB P(F) P(C) P(B) {(8C — 4)F + (2 — 4C)D — 4C 2 + 4C — When this is integrated over F it reduces to: Q = f f dC dB P(C) P(B) {(8C — 4) (CD) + (2 — 4C)D — 4C 2 + 4C — = g dC dB P(C) P(B) (2D — 1)14C 2 — 4C + 1] . 155 Appendix C. Calculations from Chapter 4^ The second integration is over C, and yields: ^Q^I dB P(B) (2D – 1) [4(Var(C) (C) 2) – 4(C) + 1] 4 1^2 = I^ dB P(B) (2D –1) {—(1– m 2 )b(1 – b) + 4 {mB + – (1 – m) 2 ] – 4 [mB 1 – m)} + 1}. The terms proportional to -1. are negligible except when all other terms are very small, but 4 this is an important exception, so they are kept in: 4 Q = f dB P(B) (2D – 1) {— (1– m 2 )b(1 – b) + m 2 (4B 2 – 4B + 1)] 1 = I dB P(B)—(SP-S°) [ M (1 m 2 )b(1– b) + m 2 (4B 2 – 4B +1)] ^1^4 = .71-4.–(SP-P) [ 17 (1 – m 2 )b(1 – b) + m 2 (4Var(B) 4(B) 2 – 4(B) + 1)] 1^4 [(1 – m 2 )b(1 – b) + m 2 b(1 – b)} + m 2 [4(1 – b) 2 – 4(1 – b) + 111 1^4 =^(SP .5') — b(1 – b) + m 2 (1 – 2b) 2 }^ (C.19) m The average of (rP•sP)(rP•s-) is therefore: (((rP•sP)(rP•S")D r i ...r . = M(mt;,) 2 (1 – 2b) 2 SP •S`"^4b(1 – b) SP-S°^(C.20) This can be qualitatively checked for some simple cases, and with the aid of the sketch in Figure C.1. When b = 2, the F vectors are not correlated with the e vectors. In that situation the mill overlaps should be irrelevent, which is indeed true in Equation C.20. If b = 0, so that the I' vectors are exactly aligned with the e vectors, and if p = o, so the two S vectors are aligned, then Equation C.20 is equal to the square of el' • S P , which is correct. 156 Appendix C. Calculations from Chapter 4^ Figure C.1: Sketch of the vectors which make up T3, for a case where n = 2 and s = 1. The exact value of T3 is the product of the projections of So and SG' onto PP. The mean value and variance of T3 can be estimated as a function of mtl„, SP.S' , and V. C.5 Variance of T3 in the Space of Condensed Vectors This section estimates the variance of T3 in the space of all orthogonal vectors rl F8. The result is used to calculate the mean of e T3 in Equation C.15. T3, as defined in Equation C.10, is: P^n 1^ T3 E 2 (1 — 2b) 2 E M E 13 1* .13P' Dr- • so)(r- • Yr), 0=8-1-1 p,cr=1^v=1 which is a sum of products of projections of two vectors, S° and So, onto a reference vector To. Figure C.1 helps to visualize this. The variance in T3 is an estimate of how much the sum of these projection products varies away from the mean value, calculated in Section C.4, as the r sible directions. There are only (n 2 s) terms which vary with rl r8 vectors are rotated in all posand are added together inside T3. Because this number of terms is small compared to M, the variance in T3 will be of 157 Appendix C. Calculations from Chapter 4^ the same order as 2'3 for M —÷ oo, and so cannot be ignored. The exact calculation of Var(T3 ) proceeds as follows: Var(T3 ) = (((T3) 2 )> r ,... r. — (T3 ) 2 1 - 204 E BO p BOa-^B013 4M2^L—d 0,0=s+1 p,cr,cr,3=1 xE 8 (((rv.sP)(rv•s-)(rii•sa)(ru•so))) por=1 ri...r. - (T3 ) 2 (C.21) This can be simplified by expanding only terms up to n = 2, because the resulting expression will be used in the limit where n —> 0. With that simplification, there are only four different types of terms to be averaged: = ar.sP)(r.•sP)(rm •sP)(rP•sP)D o... I-% 8 Q^= Ww•sP)(ru.sP)(v•sP)(r"•5')Dri...r. Q 3vm = (((r.SP)(r - •sP)(rP•s(7 )(rP.,s- )Dri ...r. Q7 = arv•5P)( 1'•5 )(P•5P)(r"•5 )Dri...r. • 0 0 When each of these is simplified separately, the result is a series of terms of a form similar to: Var(T3)^fi(V) E^E BOp BOc r BOot B013 0,0=3+1 pfi,a,0=1 8 x E^(5P.50)^ (C.22) 1 11=1 where ni and f1(V) are constants and functions as required. If this calculation was carried out, it would result in a complicated expression in which every term includes a product of four BPP factors. Such a calculation was not done in this research. Instead, the value of Var(T3) was approximated using an expression which is very similar to Equation C.14, as described below. The result is an equation for ((e T3 )) that is much simpler than the full calculation would have been, and is accurate in the domain of interest. 158 Appendix C. Calculations from Chapter 4^ The expression for the variance is derived by first establishing its values at the limits V = 0 and V = 1, and then proposing a simple form for the full range. At V = 0, the Figure C.1 is always exactly aligned with ro vector in eµ, and so has a variance of zero in the space of all allowed I' vectors. When V = 1, the (1 — 2b) factor in T3 ensures that it is always zero, so the variance is again zero. A simple expression for the functional dependence on Var(T3), which is zero in these two limits, is: Var(T3) a V''(1 — Vn2) where n 1 and n 2 are positive exponents to be determined. It was found that n 1 = 0.25 and n2 = 0.75 work very well. Following the form of Equation C.14, the variance is therefore written: P^n^8 Var(T3) = (V 4 — V) EE • BPPBt- E(mp2sP Su.^(C.23) 0 =8+1 p,cr=1^m=1 C.6 Details of the Inverse Gaussian Transform The inverse Gaussian transform of F1 , as defined in Equation 4.36, is most easily calculated if there are no mixed products of the form xi-1,4. Such mixed products can be removed by doing a unitary transformation before the integration, as follows. Define an n x n matrix [K]: [K],^(Sp — 12-CPSP • Scr)^ y N (C.24) The diagonal elements of K are (1 — 7 0cP). There must exist a unitary transformation U„„ which diagonalizes K: •• E per=1 11, p 1C,U;71 =^ (C.25) ^ ^- 159 Appendix C. Calculations from Chapter 4^ where A i, are the eigenvalues of K. From the definition of K it follows that: E uppsp„u; y -E Uvp {C P S P • so] E uvp [CPSP Uay = • so] u;,yi = —(1 -^(C.26) pa=1 so U also diagonalizes [CPSP • Su]. The coordinate transformation proceeds as follows: 2a n Fi ^1^ exp - - = • E 20=6+1 -213V i liPE[CPSP • S'] I 02(110)2 27r — 20/1:--htl exp- -2 ^E E [.. > u-iu[cPsP -xtr`,...}11 -1 UKU -1 U Xa 14 • sciu-iu^- /32w? E C P S P • 5' p,cr=1 p=1^ (P— a)n = ^2 (1 2r) cpsp.scr Pta= 1 p=1 —y- > 1 n A Ppo IP i°6Per E E 2^ p=8+1 p,cr=1 eXp — — 20 ghP - E^- 02 (hp)2 E c p sp. sa p,cr=1^ (C.27) p,a=i where the transformed integration variables i are defined as: is I^U I x;;; I .^ (C.28) 160 Appendix C. Calculations from Chapter 4^ Because this transformation is unitary its Jacobian is 1, and the integral over i will be identical to the integral over x. F1 can be written more simply as: l P an = ( F1 1 2r P n exp ) EE 2 Ap (i';,) 2 µ=s+1 p=1 —2A5Kriel t7p 0. — A p Yit,; — 13 2 (h 1- ) 2 E cPs0 • cr=1 s- ,^(C.29) where Up a^U;p1. (C.30) .7=1 All mixed products of the form xf;xii: have been removed from this expression, so the inverse Gaussian transformation can now be calculated. Using Equation 4.24, the integral of F 1 is equal to: ri ( P n ,./,..9) Gi Fa I I c°^II _00 0=8+1 P=1 F1 ri 11 ^ n 2 22-^{PN(4, 0 ) 2 0p2 (1 — Ap)2 192 _ + — (11°) exp cPsP • s°} 2r p=1 A P^2A P^ 2 ^0=3+1 cr=1 n (i:L eln p = ( I. ) 2^ : . E (1^ )2^3 ^ 1 ^fiNiis,\2 1- t CP SP•ocr P^7,7 U2 det(K) exP 2^)^^ Ap^IV^ PIA-1 P,cr=1 P-1 P^ (C.31) Equation C.31 can be written more simply in terms of the matrix K. The exponent is written in terms of K using the expression derived in Section C.7, the determinant of K is moved into the exponent using the identity log(det(K)) = Tr(log(K)), ^ (C.32) and the sum of all the external field effects h are grouped under a single symbol H E E p=s-1-1 (i10)2.^ (C.33) Appendix C. Calculations from Chapter 4^ 161 In addition, following [35], P - s can be replaced by P for very large N, because P scales with N while s remains finite. The result is the following expression for G1 = exp 2 -aTr(log K)^E [K-1] - Pfln p,a=1 C.7 Simplification of Exponent in G1 This section describes how to simplify the exponent in Equation C.31, which is referred to as "X" for this derivation only: X = PiIN [ n (1 - - E u2P^ ^A A 2 ) P p=1 0 n +— E cPsP.se . N p,cr=1 The matrix U is a contraction of a unitary matrix U, and Up , = U;p1 for any unitary matrix, SO: Op =7.^= E U„. (C.34) 7 a=1 o=1 Using this, and the definition of K from Equation 4.38, X can be written: /3HN X - ^ 2 w-i)„„ - 2 + (Ad Up, + E [ n p,cr ,v=1 E^Kip, .^(C.35) P,a = 1 These are the inverse of the transformations done in Equation C.25, so: 1 E(U-1)pv — Au v=i u„ [K-l]pp- and E(U-1),9v A, U„„. = [IC] v=1 This means that X is equal to: f X = ^ [p,o=1 ir N E 2 p,cr=1 (1 — 1<) 2 4 K E - K] pa por=1 filiNn [K-l po. (C.36) 162 Appendix C. Calculations from Chapter 4^ C.8 Self-Averaging in Condensed Near-Orthogonal Memories This section demonstrates that a sum of the form N E f (d • • • en^ (C.37) is self-averaging when s < N. This conclusion is not immediately obvious in this case, because the elements of each e are generated by adding noise to a set of exactly orthogonal memories 1', rather than randomly. It will be shown, however, that this structured process for generating memories actually works in favour of self-averaging. Self-averaging is best understood if the set of s memory vectors are placed together to form e, an s x N array of bits. A sketch of such an array is shown here, with the 9th column vector, s highlighted: —++—++—+ + ++---+—++—++—++—+---+ —++- -+— . . —+—++—++ + +—++—+—++----+—+—+ ^ +--++ . . e —++—+++— + --+—++---++----++++++—+—++—+ . . •. to = •. +—++—+ —+ — —++++—++--+—++----++ ^ +++—+ . . When N >> 2 8 , every possible column vector occurs many times within the sum in expression C.37, and each time it occurs, f ... C) contributes the same amount to the sum. If the frequency of each column vector can be estimated accurately, then the sum over all bits i can be replaced by a "self-averaged" weighted sum over all possible column vectors fed [1, pg. 188]. The first step is to establish the expected frequency of each possible column vector. Let {T} be the set of all possible column vectors, and number its members from T1 to Tea. Let Oi be the number of times that vector Ti occurs in a given memory set {N. All column vectors are Appendix C. Calculations from Chapter 4^ 163 equally likely for near-orthogonal memory sets because they are generated by adding noise to a set of randomly-seeded orthogonal memory sets, so the expected frequencies of all column vectors are the same: YiDRA} = 2. The next step is to show that N/2' is an increasingly accurate estimate of cki in the thermodynamic limit. More specifically, it must be shown that the standard deviation in srki becomes insignificant compared to 4 as (N —+ oo). The expected variance in 40 .i can be written as: War(95i)He} = (( (.i)2 )) where the frequency deviation µ} 4 is: Cl?^g6i —^ (C.38) The overlap (e •e) between two memories is equal to a weighted sum of the frequency deviations: N e *ei = Eerer i=i 2 Eoi nT7 s j=1 2 8^28 = E q5i T3f 27+,i)y- Tf T! , 4-.1 3=1^j=1 =0 The r7 3 31 !" factor is +1 half the time and —1 half the time, so this sum is equivalent to a random walk of 2 8 steps, with step size equal to the standard deviation in cj. Thus the variance in is related to the variance in 4 ; by: Var(e -ev) = 28Var(4j) (eiel Appendix C. Calculations from Chapter 4^ 164 Using Equation 2.14 as the variance in e.eu, this leads to: Var(0j) = 2 [1 — (1 — 2b) 4 1 The standard deviation in Oj will therefore be proportional to ITV, and will become insignificant as N becomes very large. As b ---+ 0 and the memories become more orthogonal, the variance in cb diminishes, so near-orthogonal memories are even better self-averagers than random memories. In summary, it has been shown that self-averaging does occur when the memories are nearorthogonal, so the sum in expression C.37 can be replaced as follows: E f(61. • •c) = N ((f(771 3^3 ))T1...T2s^ i=i (C.39) C.9 Replica Symmetric Value of C1 This section derives an expression for O 11 in the case of replica symmetry, and in the limit of small n. Equation 4.41 indicates that the determinant and trace of K will be required, so these are calculated first. The matrix K for replica symmetric solutions is of the form: K = A B B -•• B A B ••^- B B A •• - . . . . (C.40) where A -a 1 — 0-yC B^—0-yCq^ (C.41) 165 Appendix C. Calculations from Chapter 4 ^ Its eigenvectors are: /1\ 1^1^\ 1^0\ 1^0\ —1 1 1 0 1 —1 1 0 —1 0 1 0 (C.42) \ 1 / ‘^0^/ \^0^/ \^0^/ and the corresponding eigenvalues are: Ai = A + (n — 1)B A2 ... An = (A — B) (C.43) It follows immediately that the determinant and trace of K are: det(K) = (A — 13) n-1 (A — B nB) Tr(K) = A — B + An (C.44) (C.45) The inverse of K is calculated as follows: 1 0 det(K) = det(K) 0 [K] up (C.46) but: a OA a OB det K = (A — B) n-2 (2A — 2B + nB) det K = —(A — B) n-2 (2B — 2A — 2Bn + An) (C.47) SO: ^2A — 2B + Bn (A — B)(A — B nB) PP 2B — 2A — 2Bn + An [ K-11 (A Ps^ — B)(A — B + nB) [K 1 1 — . (C.48) Equation 4.41 requires the following contraction: p,cr=1 n(2A — 2B + Bn) n(n — 1)(2B — 2A — 2Bn + An) . (C.49) (A - B)(A - B nB) Pa^ [K-i]^= Appendix C. Calculations from Chapter 4 ^ 166 In the limit n 0 this becomes: E n 4n + 0(n 2 ) A-B [K-1 p,cr=1 4n + 0(n 2 ) 1 - 07C(1 - q) (C.50) Using Equation C.32, the trace of the log of K can be simplified as follows: Tr log(K) = log [(A - B) 72-1 (A - B nB)] = nlog(A - B) + log (1 + n B B ^(C.51) In the limit n 0 this becomes: Tr log(K) = nlog(A - B) n B -A + 0(n 2 ) P7Cq = n flog [1 + P7C(q - 1)]^ 1 + P7C(q - 1) O(n2)^(C.52) Replacing C.50 and C.52 into Equation 4.41 yields the following expression for G1: 10}^(C.53) )3(aryCq 411) = exp Nn {-71 log [1 + 07C(q - 1)] + C(q - 1)^2 2^ 2 + 2,37 C.10 Replica Symmetric Value of O2 The value of G2 for a replica-symmetric system is calculated from Equation 4.43. In the limit n^0 the initial factor becomes unity: n(2s+n-1) lim n—r0 0 2 (ff) 2 = ( 27r 1. The sums over p and o in the exponent are straightforward. In the limit of small n, the first sum can be approximated as follows: EE n-1 n p=1 cr=p+1 rq = - n(n - 1)rq 2 1 -- nrq 2 ^ Appendix C. Calculations from Chapter 4^ 167 The resulting expression for G2 is: G2 = exp NO 1 ,--, 8 1^ , m02 + vittillo - fiaymmti ary + Oarq + 2., pr( _-_-__ 2 2^0=1 2 (C.54) C.11 Replica Symmetric Value of F2 Following Amit [1], the expression for F2 from Equation 4.45 is simplified for the replica symmetric case as follows: F2 = exp N ((log E • • • E si.±i sn=±1 2 (E _^ .21 eV' Sp^7 exp {a0 2 [r(V` SP)] )) 2 Lp=1^ --"^2 p=i^p=1 T1 ... T2 8 Equation 4.24 is now used to linearize the expression with regard to So: F2 = exp N ((log E • • . E s1=±1 S n =±1 'n f °° dz [ e (222 2^— exp — -22 -00 0-7-r exp N ((log le - 1 rn F3 ES P p=1 0=1^/ )) I I I p. ... T2 . 1.°3 dz e _z2,2 E E^e-sPF3 )) J_coN/Ti^s1=±1 sn=±1 p=1 71...T2a where: F3^(VaTZ cx0 E eTP)^ p=1 Using Equation C.2, this reduces to: dz^z2 n J^ ll Na B 2 rn^ F2 =^exp N ((log e- -2E e -SF3)) -00 ^p=1S=±1^11...T2 a _42rn^ No dz^z2 = e- 2expqlogf—e - 7 (2 cosh F3 ) n -co firr^Ti...T2 a Ajap_2_ 2^ , // °r) dz^z2 = e -- 2 exp N ^e- 7 exp En log(2 cosh F3 )])) V2r^ .. T2 a (C.55) Appendix C. Calculations from Chapter 4^ 168 In the limit n -- 0, OA = 1 + nX, and log(1 + nX) = nX, so: F2 = e - Noe_v_m^ 2 exp N ((log °° L dz^. 2 e i [ 1 + n log(2 cosh F3 )0 .‘/Ti. 0,22rn - - ^ dz -2 = e 2expllog{1+74--e T. log(2 cosh F3 _0007r^ d)) ..^dz = e 2 exp Nn ((f° ^ e— 4. log(2 cosh F3))) --00V27r 71.•. T2 a itp = exp Nn (((1o$(2 cosh F3 )))) T1 ...Tv^ where the triple brackets 0( ))) T1 ..T2 s 71..T2s ai3271 2 (C.56) represent both an ensemble average over the column vectors T1 and an average over a Gaussian window in z: 2 C() dz^ e 22 x(z ,Ti ))) (V(z,Z)))) E^ (C.57) ^f2Tr^... T2.^ . C.12 Detailed Calculation of the Mean Field Equations The mean field equations are calculated by requiring that the derivative of the free energy ((F)), Equation 4.49, is zero for a set of variables inf i4 , yµ, q, and r. The following symbols are local to this section, and do not correspond to symbols used elsewhere: A = denominator in the expression for = derivative of yo C with respect to m il: Setting the derivative with respect to Int to zero provides an equation for yo: a (-07(1— q) C' ^a7qC' 0 ((F)) =^= 0^ 814^20^A^2A (a7Cq 41)07(1 — q)C' 2A 2 7 274 — 7 11P + Pay' , where: C'^-274(1 A E 1 - Vt) - 137C(1 - q)^ (C.58) 169 Appendix C. Calculations from Chapter 4^ The equation reduces to: yo = __ 1 1 7 2 74 + afl L 'Yip`- (C.59) G3(q, m4)] where, as in Equation 4.53: G3(q, m4)^72741(Al Vt) [a - a i37C(1 - q) 2 + 40(1 - q)ii]^(C.60) Zeroing the derivative with respect to yo provides an equation for mf:: 0 0 y0 = = p ane - (((tanh (Var z^E eTu) aP 2 TP))) v=1 ((KT° tanh (Var z^te Tu )))) v i ts in h (C.61) Inserting the value of yo, this becomes: rnf: = (((T1`tanh^z +^(72m4 + -y itv - G3 (q, mt)) T" v=i ^.62) Zeroing the derivative of ((F)) with respect to r generates the mean field equation for q: a ((F))^0 Or AN/Tr ir(1 — = /3a( q - 1) + (((tanh (01/7:t7z+^ A^0=1 ro dz z e_z212 ((tanh(Az + B))) ev) z \rf.))) B —^L ooViTr This can be integrated by parts, using d/dx tanh x = sech 2 x: )3 NicW(1 - q) =^((tanh(Az B) e' 212 )) 12Tr^ ---, Avai- (((sech 2 (Az + B)))) - q = q '^ si m dz e - z 2/2 ((A sech 2 (Az B))) _00 -00 Orr 1 - (((sech 2 (Az 4- B)))) = (((tanb 2 )3 q = [far z+ ai3tyPTP]))) 14=1 8 (((tanh 2 /1 Hz + 7 2,4 yitt` - G3(q, nzt4 E( 14=1 + )) Tµ Appendix C. Calculations from Chapter 4^ 170 Finally, when the derivative of the free energy with respect to q is set to zero, the result is an equation for r: ((F)) aq r 0=— c 2A 272 ( a7Cq + 4H) 13;7* 7C (ayCq 4if) - aA 2 (C.64) Appendix D Calculations from Chapter 5 The following sections provide detailed derivations of results which are used to calculate the network capacity in Chapter 5. D.1 Limiting Behaviour of 13(1 — q) The following calculation of B p(1 - q) in the limit p 00 is based on Lautrup's derivation [35, pg. 45]. The following symbols are local to this section, and do not correspond to symbols used elsewhere: Y = the tanh expression in Equations 4.50 and 4.51 a = the factor multiplying z in the argument of Y b = the factor not multiplying z in the argument of Y Define Y as follows: Y^tanh P(az b) where 8 b { 0 ) [are 4if Bi} Tv 72 mv^ +7h^(1 v^2 (1- 171 7C Br ^ 172 Appendix D. Calculations from Chapter 5^ Note that: OY Ob^cosh2 13(az b) (D.1) = ,(3(1 — Y 2 ) The limit of /3(1 - q) is calculated as follows: lim ,3(1 - q) =^- /3-4 00^ 0-K,0 Y 2 ))) (D.2) = 0.00 limOk • db where (((Y))) can be written as: lim ((( 1')))= 0-, 00 0 — (((log(1 - 14 )k lira^ p-, 00 2$ O b 0 p Ob 1 lim - — (((log cosh d(az b)))) . (D.3) It is possible to integrate out the z-dependence of this expression in the limit of large by first eliminating the transcendental functions using the following identity: lim log cosh [3(az b) = lim [log ( 6 0(az+b) e-p(az-F0) _ log 2] /3—oo 0 -4 00^ = slim Plaz -1-bl- log 2 ] . 00 The average over z is calculated as follows: Jaz + b I^= f oom d2Z7r e— 1z 2 I dZ bl oo —00b/a dz^ i z2 faz b) e^ ^bia a re_izzi-b/a 1 - 00 ^b/a -b dz e L z2 2 ( az + a r e lz2100 Nig^—b/a dz _ 1-z2^ - -Z b r° d z re^2 2 f_co V27r^.1—b/a (D.4) ^ Appendix D. Calculations from Chapter 5^ 173 The solution of these integrals involves the error function erf(x). The following two properties of erf(x) will be useful: e -°2 dx 2 du CV 1.1 erf ( ) I k c/ 2 _ x2 o_r e '` erf(x) (D.5) Equation D.4 reduces to: Iazz bi) z 1 2- a 7 exP 2a — 2 ) —2 [( 1 erf (N/^ a)) —^erf^))1 =^7 affexp(-21 + berf ( b 2a2 V r^ .12- a) (D.6) and its derivatives are: 0 erf ( b 51,( !az +bl)z^ a) - •\/ 02 )172( laz + bI ) z = _Vex, (—b2) a 7r 1' 2a ) (D.7) Replacing the first derivative into Equation D.3 gives: olLma (( 1)) = ((01 Lino 0 en (-‘/^ a))) ((^(^ire^2^L 2 am erf^4i/B)1)) ^(D.8) 7 m +^7 (1 V4) v 1^[1 — 7C le tar _ and when the second derivative is replaced into Equation D.2 the result is: ^B =^0(1^= Ern 1 ((3 , r_2i e (—b 2 ))) 0.00 T3 ^a V 7r jcP 2a 2 lim b2 ((_. 1 Aexp 0.00^V a 7r 2a2 (D.9) ^ ^ Appendix D. Calculations from Chapter 5^ 174 D.2 Solution of Mean-field Equations for No Clamped Neurons In this section an analytical solution is found for the mean-field equations in Section 5.1.2. The equations are m = erf(y) C2 ^ (1 - CB) 2 r = (D.10) where: y^ (1 (1 - CB) 2 J tar and B = ^ (D.11) ,A7 vrr - e-12. B can be eliminated from the definition of y using Equation D.10 (1^)m 2/Try = m C2 ar. This is a quadratic equation in NicTer whose solution is far = C2 2 477t 2 (1 - Vi) 2y + 2(1 VI rid \ C2 Choosing a positive sign will give the correct value in the limit V 0, so: C^ oa. = - Vt)m Vy2C2 2m 2 (1 -^- yCi (D.12) Replacing this into Equation D.11 gives B= 2(1 - Vi)erf(y) e-Y2 I NfiC / y 2 C 2 + 2(erfy) 2 (1 -^- Cy} -1 . (D.13) The mean-field equation in r can be written: 1^1 - CB C (D.14) Appendix D. Calculations from Chapter 5^ 175 so the square of Equation D.12 is equal to: a = ^1 - CB ^( 1/y 2c2 2 (1 - Vi)erf( y) 2( erfoo. _ vt) - Cy)] . (D.15) When Equation 5.5 is replaced into Equation D.15, the result is an equation for a as an explicit function of y. D.3 Distribution of Xi Before Roll-Up This section calculates the mean and the variance of the neural alignments Xi of a randomly selected memory vector before roll-up has occurred. A definition of the neural alignments is given in Equation 5.11. The ensemble average variance, Vx is, by definition: Vx = ((X 2 )) - ((X)) 2 . The first term ((X 2 )) can be calculated as follows: 14-1 14-1 E E 771 19 7-ftv erC (( p=1 t.•=1 p—lp-1 1^N N = E E N2((E E siskexcer)) p=1 v=1^j=1 k=1 p-1 N-1 e p tl,^N EE 4N^ 1. + E E Si Sk61161:)) p=1 v=1^j=1^j=1 k$j 740 p-1 p-1 = E E 2N 2 + Efue.Pg7)) Jo= 1)2^1 (µ k^ N2 + N 2 jOi p=1 EE E6ferex)) 1, 0 P NO —1) 2 (N - 1)(p - 1) N 2 +^N2 (D.16) Appendix D. Calculations from Chapter 5^ 176 The second term can be calculated as follows: 0-1 E N = i = - v=i N ESiSier0) E 1 + E E si si em)) P-1^((P-1 N N v=1^%v=1 u=i^v=i jOi —1 ^= N (D.17) So the variance in Xi is: (N — 1)(y — 1) Vx =^a N2 (D.18) If, as assumed in Section 5.2.3, the probability distribution is Gaussian, then the probability distribution of the Xi at time t = 0 can be written as follows: 1 ,O) = ^ exp^a)21 P(X^ 2a (D.19) D.4 Effect of Inverting One Bit This section calculates an exact expression for the change in E xi when one neuron in the network is inverted, and uses it to derive an expression for the average time rate of change of X as the network rolls up. The exact expression is derived as follows. The initial states of all the neurons are written Si, and the final states are written Similarly Xi and Xi are used to represent the initial and final neural alignments, as defined in Equation 5.11. Let k be the index of the inverted neuron Sk. The new value of Xi, for (i k), is: -ki(Ok) = P-1 ^ E E sgrr - N P-1 E^S;SkEi Sk v=i jok^v=1 Appendix D. Calculations from Chapter 5^ 177 2 P-1 = X. - — SiSkCa N v=1 = Xi - 2 Zik where: sisi Zji N p-1 E (D.20) The new value of Xi, for (i = k) is: Xk —1 Px-, -1 = N 2., 2_,SkSieM + ^ u=1 j$k = -Xk + 2a The sum of Xi includes (N - 1) terms where i k, and one term where (i = k). The sum is exactly equal to: N i=i N -ki • E Xi - 2 E zi • • k - 2xk + 2a i=i^i,-£k N^N EXi - 2E Zik 2Zkk - 2Xk F 2a i=i^i=i E xi i=i 4 [Xk - a] . (D.21) This result was derived without making any assumptions about the initial distribution of Xi. It is now convenient to bring the notation for Xi back in line with what is used in Section 5.2, so that Xi becomes Xi(t+1), and Xi becomes Xi(t). It follows immediately from Equation D.21 1rE that the change in the average value of Xi in one time step is equal to: dX(t) dt =N - ^N xi(t + 1) - E Xi(t) i=1^1=1 4 7v- [a - Xk] Because this is linear in Xk, the average rate of change in X over many updates will be ^ 178 Appendix D. Calculations from Chapter 5^ proportional to the average of Xk: (D.22) -1v [a - (Xk)] . D.5 Mean Value of X; for Xi > a This section calculates the average value of all the Xis which are greater than a. The desired value is ^(X) 00 C1^dX X P(X,t) = a where P(X, t) is the Gaussian distribution given in Equation 5.14, and where C 1 is a constant which satisfies the following normalization condition: 1 = C1^ 00 a dX P(X,t)^ (D.23) Both of the above integrals are standard, so their solutions need only be briefly summarized here. The normalization constant is: V2 ^_ erf(y)] where a-X The average value of Xi is: (X) = = CI. 1 00 a CC1 dX X exp [ -(X - X) 2 ] 2Vx -X2 1 _ dX(X + X) exp [ 2Vx i cr-x Vz= CiX ^[-Vx [1 - erf(Y)] + Ci 2 ^x exp Replacing the value for C1 gives: _ (X) 2Vx e - Y 2 V r 1 - erf(y) -x2 )] ( ^m 2V^ ^c,_1- . 179 Appendix D. Calculations from Chapter 5 ^ D.6 Estimate of the Variance Vx (t) This section derives a relationship between the average neural alignment X(t), and the variance in the neural alignments, Vx(t). The derivation begins by examining the ensemble average of (X1) 2 : (((X0 2 )) = ^2 (K 1 -3 7F i P -1 E E sisisisgregfef)) ,j,k=1 pw=1 N =1^ S j Skell^ qm"))) N 2 4-4 ^p=1 j,k=1 u= 1 (D.24) The overlap m" between two memories can be used to calculate the probability that any two bits a and Er are identical, as follows: 1+ no' P(ere = Efc ) =^2 1 - m" P(a 0 = 2 The average value of the parenthesized sum in Equation D.24 can therefore be written as follows: m-Perc)) =^[m-P ( 1 + 2') (Er) + ( 1 P^t^p=1 = (( 1. — 2m" ) ( erd)) ( m lip)2 erc )) P=1 p-1 = Dm"? )) pOti^YAP = [( 4 — 2 ) (((m vP ) 2 )) vop + 1 1 a (D.25) k1 where a(0)2 + 1 Appendix D. Calculations from Chapter 5^ 180 and where O is the rms overlap between previously-installed memories. When Equation D.25 is used to replace the parenthesized factor in Equation D.24, the result is: N p-1 (((X1) 2 )) =^ SjSkq j,k=1 v=1 EE = kiX where X is the average value of Xi, as used in Equation 5.14. The variance in Xi is therefore: vx = (((xi ) 2)) ((xi ? - = kiX — X 2 = X [a(O rms ) 2 + 1 — fd - (D.26) D.7 Time Evolution of the Number of Invertible Neurons This section derives a differential equation which describes how the number of invertible neurons varies with time. The result is used in Section 5.2.4 to calculate the rms overlap of the network after roll-up. An "invertible" neuron is one which might be inverted as the network rolls up. For that to be possible, it must be unclamped, and it must have (Xi > a). The symbol B(t) represents the total number of invertible neurons at time t. Let F be the set of all neurons, clamped or unclamped, for which Xi > a. The population of .7" is: = NPT where PT is the total probability that any neuron has Xi > a. The value of PT is equal to the area under the graph of P(X, t) and to the right of a in Figure 5.4, and can be calculated as Appendix D. Calculations from Chapter 5^ 181 follows: PT = j: dX P(X ,t) 1 [1 erf( a X )] 217x (D.27) There are two methods by which neurons leave .7.: 1. they may be inverted (this happens to exactly one neuron in each time step), or 2. their neural alignments may become smaller than a due to the changes in X and V. The total number of neurons that leave P by the second method is equal to one less than the total number that leave: total loss by second method = -- (NPT) — 1. dt It is assumed that invertible neurons are just as likely to leave by the second method as any other neurons, so the reduction in B(t) by the second method is proportional to the total loss by the second method: B^[ d (NPT) 1] NPT dt loss of invertibles by second method =^ The total change in the number of invertibles is negative, and is the sum of the amounts lost by the two methods: dB^B d — dB = —1 +^ { (NPT) 11 ^ N PT dt (D.28) This expression can be made independent of the size of the network N by defining: bE t T = --N- The differential equation can then be written: ^[dPT db_ d T +1 dt : ^+ PT — (D.29)
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Self-orthogonalizing strategies for enhancing Hebbian...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Self-orthogonalizing strategies for enhancing Hebbian learning in recurrent neural networks Davenport, Michael R. 1993
pdf
Page Metadata
Item Metadata
Title | Self-orthogonalizing strategies for enhancing Hebbian learning in recurrent neural networks |
Creator |
Davenport, Michael R. |
Date Issued | 1993 |
Description | A neural network model is presented which extends Hopfield's model by adding hidden neurons. The resulting model remains fully recurrent, and still learns by prescriptive Hebbian learning, but the hidden neurons give it power and flexibility which were not available in Hopfield’s original network. The key to the success of the model is that it uses the emerging structure of its own memory space to establish a pattern in the hidden neurons such that each new memory is optimally orthogonal to all previous memories. As a result, the network actually learns a memory set which is "near-orthogonal", even though the visible components of the memories are randomly selected. The performance of the new network is evaluated both experimentally, using computer simulations, and analytically, using mathematical tools derived from the statistical mechanics of magnetic lattices. The simulation results show that, in comparison with Hopfield's original model, the new network can (a) store more memories of a given size, (b) store memories of different lengths at the same time, (c) store a greater amount of information per neuron, (d) retrieve memories from a smaller prompt, and (e) store the XOR set of associations. It is also shown that the memory recovery process developed for the new network can be used to greatly expand the radius of attraction of standard Hopfield networks for "incomplete" (as opposed to "noisy") prompts. The mathematical analysis begins by deriving an expression for the free energy of a Hopfield network when a near-orthogonal memory set has been stored. The associated mean-field equations are solved for the zero-temperature, single-recovered-memory case, yielding an equation for the memory capacity as a function of the level of orthogonality. A separate calculation derives a statistical estimate of the level of orthogonality that can be achieved by the roll-up process. When this is combined with the memory capacity-vs-orthogonality result, it yields a reasonable estimate of the memory capacity as a function of the number of hidden neurons. Finally, the theoretical maximum information content of sets of near-orthogonal memories is calculated as a function of the level of orthogonality, and is compared to the amount of information that can be stored in the new network. |
Extent | 7045459 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-09-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0085553 |
URI | http://hdl.handle.net/2429/1769 |
Degree |
Doctor of Philosophy - PhD |
Program |
Physics |
Affiliation |
Science, Faculty of Physics and Astronomy, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1993-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1993_fall_phd_davenport_michael.pdf [ 6.72MB ]
- Metadata
- JSON: 831-1.0085553.json
- JSON-LD: 831-1.0085553-ld.json
- RDF/XML (Pretty): 831-1.0085553-rdf.xml
- RDF/JSON: 831-1.0085553-rdf.json
- Turtle: 831-1.0085553-turtle.txt
- N-Triples: 831-1.0085553-rdf-ntriples.txt
- Original Record: 831-1.0085553-source.json
- Full Text
- 831-1.0085553-fulltext.txt
- Citation
- 831-1.0085553.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0085553/manifest