SELF-ORTHOGONALIZING STRATEGIES FOR ENHANCINGHEBBIAN LEARNING IN RECURRENT NEURAL NETWORKSByMichael Richard DavenportB. Sc. (Physics) University of CalgaryM. Sc. (Physics) University of British ColumbiaA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF PHYSICSWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAJune 1993© Michael Richard Davenport, 1993In presenting this thesis in partial fulfilment of the requirements for an advanced degree at theUniversity of British Columbia, I agree that the Library shall make it freely available for refer-ence and study. I further agree that permission for extensive copying of this thesis for scholarlypurposes may be granted by the head of my department or by his or her representatives. Itis understood that copying or publication of this thesis for financial gain shall not be allowedwithout my written permission.Department of PhysicsThe University of British Columbia2075 Wesbrook PlaceVancouver, CanadaV6T 1W5Date:3^t 9,613AbstractA neural network model is presented which extends Hopfield's model by adding hidden neurons.The resulting model remains fully recurrent, and still learns by prescriptive Hebbian learning,but the hidden neurons give it power and flexibility which were not available in Hopfield'soriginal network. The key to the success of the model is that it uses the emerging structure ofits own memory space to establish a pattern in the hidden neurons such that each new memoryis optimally orthogonal to all previous memories. As a result, the network actually learns amemory set which is "near-orthogonal", even though the visible components of the memoriesare randomly selected.The performance of the new network is evaluated both experimentally, using computersimulations, and analytically, using mathematical tools derived from the statistical mechanicsof magnetic lattices. The simulation results show that, in comparison with Hopfield's originalmodel, the new network can (a) store more memories of a given size, (b) store memoriesof different lengths at the same time, (c) store a greater amount of information per neuron,(d) retrieve memories from a smaller prompt, and (e) store the XOR set of associations. Itis also shown that the memory recovery process developed for the new network can be usedto greatly expand the radius of attraction of standard Hopfield networks for "incomplete" (asopposed to "noisy") prompts.The mathematical analysis begins by deriving an expression for the free energy of a Hopfieldnetwork when a near-orthogonal memory set has been stored. The associated mean-field eqiia-tions are solved for the zero-temperature, single-recovered-memory case, yielding an equation forthe memory capacity as a function of the level of orthogonality. A separate calculation derivesa statistical estimate of the level of orthogonality that can be achieved by the roll-up process.When this is combined with the memory capacity-vs-orthogonality result, it yields a reasonablellestimate of the memory capacity as a function of the number of hidden neurons. Finally, thetheoretical maximum information content of sets of near-orthogonal memories is calculated asa function of the level of orthogonality, and is compared to the amount of information that canbe stored in the new network.iiiTable of ContentsAbstractList of Tables^ ixList of FiguresAcknowledgements^ xiii1 Introduction 11.1 Important Features of the EH Network ^ 41.1.1^Hebbian Learning ^ 41.1.2^Recurrence ^ 61.1.3^Hidden Neurons 71.2 Research Goals ^ 91.3 Summary 92 The Hopfield Network and Non-Random Memory Sets 122.1 Review of Hopfield Neural Networks ^ 122.1.1^Storing Memories ^ 142.1.2^Network Dynamics and Stable States ^ 152.1.3^Overlap mµ" and Orthogonality 162.1.4^Energy Landscape ^ 172.1.5^The Meaning of Energy Peaks ^ 172.1.6^False Attractors ^ 182.2 Limitations and Extensions of the Hopfield Network ^ 19iv2.3 Non-Random Memory Sets ^ 222.3.1 Generating Near-Orthogonal Memory Sets ^ 232.3.2 Generating Correlated Memory Sets ^ 242.3.3 Benefits of Near-Orthogonal Memory Sets ^ 262.4 Working Premise: All Orthogonalizing Processes are Equivalent ^ 283 The Extended Hopfield Network^ 303.1 EH Network Terminology 313.1.1 Hidden Neurons in an EH Network ^ 313.1.2 Partial Overlaps in the EH Network 323.1.3 Measuring "Stability" in an EH Network ^ 323.1.4 "Level of Correlation" of Memory Sets 333.1.5 TriState Neurons Option ^ 333.2 Storing Memories in an EH Network 343.3 Recovering Memories in an EH Network ^ 363.3.1 "Random" Option ^ 383.3.2 "TriState" Option 393.3.3 "Bi-State" Option ^ 403.3.4 Retrieval Efficiency with 10 Memories Stored ^ 423.3.5 Radii of Attraction for Random Memories 423.3.6 Radii of Attraction for Near-Orthogonal Memories ^ 453.4 Performance of the EH Network ^ 453.4.1 Memory Storage Capacity 463.4.2 Maximum Information Content ^ 483.4.3 Correlated Memory Sets 493.4.4 Radii of Attraction ^ 513.4.5 Storing the XOR Memory Set ^ 543.4.6 Flexible-Length Memories 584 The Free-Energy of a Network with Near-Orthogonal Memories4.1^Magnetic Lattice Analog for Neural Networks ^60624.1.1 Zero Temperature Ising Model ^ 644.1.2 Non-Zero Temperature in Neural Models ^ 654.1.3 Justification for Using Thermodynamic Methods ^ 674.1.4 Quenched Random Memory States ^ 694.2 Ensemble Average of the Free Energy 694.2.1 The Energy per Neuron ^ 704.2.2 Partition Function for n Replicas ^ 724.2.3 Calculation of Average Over the Uncondensed Memories ^ 744.2.4 Inverse Gaussian Transforms ^ 774.2.5 Average Over Condensed Memories ^ 784.2.6 Saddle Point Calculation ^ 804.2.7 Free Energy of the Replica Symmetric System ^ 814.3 Summary of the Free-Energy Calculation ^ 845 Analytical Estimate of Network Storage Capacity 865.1 Hopfield Network Capacity for Near-Orthogonal Memories ^ 865.1.1 Mean Field Equations at Zero Temperature 875.1.2 Memory Capacity with No Clamped Neurons ^ 875.2 Statistical Estimate of Overlap After Roll-Up ^ 915.2.1 Notation^ 925.2.2 Overlap as a Function of Neural Alignments^ 935.2.3 Probability Distribution of the Neural Alignments ^ 945.2.4 Dynamics of the Probability Distribution ^ 945.2.5 Numerical Calculations of Average Overlaps 985.2.6 Average Overlap of All Memories ^ 1005.3 EH Network Storage Capacity^ 101vi6 The Information Content of Near-Orthogonal Memories^ 1056.1 Information Content of Exactly Orthogonal Memories 1056.1.1 Analytical Calculation of Information Content ^ 1066.1.2 Brute Force Calculation of Information Content 1106.1.3 Calculated Information Content of Exactly Orthogonal Memories ^ 1136.2 Information Content in Near-Orthogonal Memories ^ 1166.3 Ideal Information Capacity vs Information Content in EH Networks^ 1226.3.1 Limits on the Orthogonalization Process ^ 1236.3.2 Information Content Near Capacity 1257 Conclusions and Discussion^ 1297.1 Conclusions ^ 1297.2 Further Research 131Bibliography^ 133A Calculations from Chapter 2^ 137A.1 The Canonical Set {F} of Exactly Orthogonal Memories ^ 137A.2 Mean and Variance of (es' • el ^ 139A.3 Expectation of Partial Overlaps in Orthogonal Memories ^ 140B Calculations from Chapter 3^ 145B.1 Theoretical Minimum Prompt Length ^ 145C Calculations from Chapter 4^ 146C.1 Sum-Product Inversions Equation ^ 146C.2 Ensemble Average of Noise Term over Orthogonal Sets ^ 147C.3 Average of e -T3 for Condensed Vectors ^ 151C.4 Average of (rP•so. )(rP.SP) over Condensed Vectors ^ 153C.5 Variance of T3 in the Space of Condensed Vectors 156viiC.6 Details of the Inverse Gaussian Transform ^ 158C.7 Simplification of Exponent in G1 ^ 161C.8 Self-Averaging in Condensed Near-Orthogonal Memories ^ 162C.9 Replica Symmetric Value of G1 ^ 164C.10 Replica Symmetric Value of G2 166C.11 Replica Symmetric Value of F2 ^ 167C.12 Detailed Calculation of the Mean Field Equations ^ 168D Calculations from Chapter 5^ 171D.1 Limiting Behaviour of 13(1 — q) 171D.2 Solution of Mean-field Equations for No Clamped Neurons ^ 174D.3 Distribution of Xi Before Roll-Up ^ 175D.4 Effect of Inverting One Bit 176D.5 Mean Value of X1 for Xi > a ^ 178D.6 Estimate of the Variance Vx(t) 179D.7 Time Evolution of the Number of Invertible Neurons ^ 180viiiList of Tables1.1 Brain features cross-referenced to neural network models. ^ 3A.1 The first ten "canonical" orthogonal memories ^ 138ixList of Figures1.1 Relative scales of components of the nervous system^ 22.1 Diagram of a simple Hopfield network^ 132.2 Mean overlaps between near-orthogonal memories^ 252.3 Memory capacity of a Hopfield network vs orthogonality of the memory set. . ^ 272.4 Radius of attraction of a Hopfield network vs orthogonality of the memory set. ^ 283.1 Schematic representation of the EH network memory storage process. ^ 353.2 Degree of orthogonality achieved by the tri-state memory storage process, as afunction of memory loading^ 363.3 Degree of orthogonality achieved by the bi-state memory storage process, as afunction of memory loading^ 373.4 Degree of orthogonality achieved by the memory storage process using tri-stateneurons, as a function of the number of hidden neurons^ 373.5 Schematic representation of the tri-state EH network memory retrieval process. ^ 393.6 Schematic representation of the bi-state EH network memory retrieval process. ^ 413.7 Comparison of three ways to handle "don't know" bits during memory recovery. 433.8 Comparison of radii of attraction for tri-state and random memory recoveryprocedures. ^ 443.9 Radius of attraction as a function of memory loading, for near-orthogonal mem-ory sets^ 453.10 Comparison of EH network capacity with Hopfield network capacity^ 473.11 Network capacity as a function of the number of visible neurons. 483.12 Information capacity of an EH network, as a function of the number of visibleneurons used^ 503.13 Storage capacity of an EH network for correlated memory sets. ^ 513.14 Type-1 radii of attraction of EH networks^ 533.15 Type-1 radii of attraction of EH networks as a function of how near the networkis to capacity. ^ 543.16 Type-2 radii of attraction of EH networks. ^ 553.17 Retrieval rates for the XOR set of associations with varying numbers of hiddenneurons. ^ 584.1 Conceptual sketch comparing energy landscapes to free energy landscapes^ 614.2 Sketches of three phases of a magnetic lattice^ 634.3 Sketches of the energy surface around ferromagnetic and spin-glass states . . . ^ 644.4 Scaling factor CP as a function of mpv and V. ^ 775.1 Plots of Equation 5.7 showing a vs y and m vs y 895.2 Plots of Equation 5.7 showing m vs a ^ 905.3 Comparison of predicted capacity to measured capacity. ^ 915.4 Sketch of the initial distribution of Xi for a = 0.2. 955.5 Sketch of the evolution of the distribution of Xi. ^ 965.6 Numerical solutions for the time evolution of the mean neural alignment X. . ^ 975.7 Illustration of how to calculate the rms overlap from X(r) and b(r) ^ 995.8 Predicted and observed overlap of the most recent memory to all previous memories.1005.9 Predicted and observed overlaps of all memories, as a function of the number ofmemories stored. ^ 1015.10 Graphical estimate of EH network capacity for "whole-vector" stability. ^ 1035.11 Estimate of the true EH network capacity^ 1046.1 Information content of the ( iu + 1)th orthogonal memory. ^ 114xi6.2 Total information content of orthogonal memory sets. ^ 1166.3 Information in the most recent near-orthogonal memory for k = 1.0. ^ 1216.4 Information in the most recent near-orthogonal memory for k = 0.88. ^ 1216.5 Average information per neuron in near-orthogonal memory set, using k = 1.0'. ^ 1226.6 Average information per neuron in near-orthogonal memory set, using k = 0.88^ 1236.7 Ideal information capacity as a function of the memory loading, compared to theinformation content of an EH memory^ 1246.8 Comparison of the theoretical minimum overlap, to the overlap achievable by theroll-up process. ^ 1256.9 Orthogonality of an EH network at capacity. ^ 1276.10 Ideal information capacity at maximum loading, as a function of the ratio ofvisibles ^ 1276.11 Comparison of ideal information capacity and actual information content, atmaximum loading. ^ 128A.1 Probability distributions for partially sampled sets^ 144A.2 Overlaps between subsets of orthogonal memories 144C.1 Sketch of the vectors which make up T3^ 156xiiAcknowledgementsI am grateful to a number of people who contributed to my research and to the productionof this thesis. By far the greatest contributer has been my research supervisor, ProfessorGeoffrey Hoffmann. Professor Hoffmann was the first to recognize the practical uses for peaksin the energy landscape, and many of the other concepts which are developed here were eithersuggested by him, or grew out of one of our frequent discussions. Thanks also to the othermembers of my advisory committee, Professors Gordon Semenoff, Luis deSobrino, and DanCamporese, for reading the thesis and making many helpful suggestions. Professor Semenoffwas especially helpful in providing direction for the mathematical derivations in Chapters 4and 5, and in being available for discussions about the statistical mechanics of neural networks.I am also grateful to Professor Geoffrey Hinton for reading the thesis and making a numberof helpful suggestions which have been integrated into the text. Thanks also to my colleagueShawn Day for comments on the text and for the loan of his computer, and to my family fortheir tolerance and corporeal support.The research was funded by the Natural Sciences and Engineering Research Council ofCanada, and the Science Council of British Columbia. Computer equipment and other financialsupport were provided by MPR Teltech Ltd., of Burnaby B.C.Chapter 1IntroductionThe problem of gaining a comprehensive mathematical understanding of brain function is un-deniably immense. The human brain is an extremely complex system, made up of at least1011 neurons [42, pg 1], of which there are at least scores, and probably hundreds of differenttypes [48, pg 318]. Every neuron connects to approximately 10 4 other neurons so there areon the order of 10 15 synapses. The complexity, moreover, exists at all scales — inter-neuralconnections are no less complex than the connections between cortical columns or betweenlarger functional areas. As a final obstacle to mathematical analysis, the dynamical equa-tions for components of the brain tend to be both non-linear and time-delayed [24, 7]. Facedwith such daunting obstacles, the research community has, in the time-honoured way, dividedits resources between examining exact models of individual elements (neuro-physiology), morecoarse-grained models of larger subsystems (neural networks), and behavioural models of thecomplete system (psychology), as illustrated in Figure 1.1.The central role of neural network research is to discover how high-level behaviour such asmemory storage and recovery, pattern recognition and, ultimately, intelligence can "emerge"from relatively simple interactions between neurons. The aesthetics of neural network researchare therefore closely tied to this concept of "emergent" behaviour. A "good" network is onewhich is made of simple neural components, assembled and trained according to a simpleprinciple, but which exhibits a complex computational ability. By contrast, a network is not as"good", even if it has the same computational ability, if its ability is only a result of intelligent1Chapter 1. Introduction^ 2^1 m^I Nervous System I ^10 cm^I^Subsystems^II ^1 cm^I^Maps^II ^1 mm^I^Networks^II^100 tun^1^Neurons^II ^1 pm^I^Synapses^II 1 A^I^Molecules^IFigure 1.1: Relative scales of components of the nervous system. Neural network research ex-amines structures which are typically between lmm and 10cm in size. (Based on Sejnowski [48,pg 305])interference from a human designer or brute force optimization of its parameters for the specifiedtask. The behaviour of the latter network is better described as "engineered" rather than"emergent".For many researchers, a second aesthetic is that the model incorporate as many aspects ofthe physiology of the brain as possible. There is some disagreement on this issue, because manyneural network researchers are solely interested in the development of alternative computingdevices. If a network does interesting things, what does it matter if it resembles the brain or not?The counter-argument to this is that the brain is a phenomenally successful computing device,capable of many functions which are far beyond the capability of current artificial machines.Surely there is a greater chance of reproducing these abilities in a machine which more closelyresembles a real brain?With the second aesthetic in mind, it is useful to enumerate some of the salient "designfeatures" of the brain which might be important to its higher functions. A selected list is:Chapter 1. Introduction^ 31. Hebbian Learning: learning occurs at each synapse according to a scheme first sug-gested by Hebb, as discussed below.2. Recurrence: the output of almost every neuron indirectly feeds back into that neuron.3. Hidden Neurons: virtually all neurons in the brain are neither sensory nor motorneurons.4. Low Fractional Fan-Out: the number of neurons connected to a single neuron is muchsmaller than the total number of neurons in the brain.5. Time Delays: Significant time delays are present in the interconnections between neu-rons.Table 1.1 lists some important neural network models and indicates which of these features areassociated with each.Hopfield's network model [25], which will be described in detail in Chapter 2, uses a recurrentarchitecture, and a version of Hebbian learning. Each neuron in a Hopfield network is eitherHebbian recurrent hiddens low fan-out time delaysHopfield [25] F •Extended Hopfield (EH) [13] F • •Boltzmann Machine [22] S • •Time Delay Hopfield [51, 32, 2] F • •Willshaw [52] FKanerva [27, 9] F •Back-Propagation [45] • •Pineda [44] • • •Dispersive [15] • • •Kohonen [33] • • •Table 1.1: Some features of the neurophysiology of the brain cross-referenced to neural networkmodels. Hebbian learning is categorized as fast (F) or slow (S), as discussed in Section 1.1.1.The EH network is the only one which combines recurrence, fast Hebbian learning, and hiddenneurons.Chapter 1. Introduction^ 4"on" or "off". The state of the network is determined by the "on" and "off" states of all itsneurons. A "memory" is a network state which has been "learned" by the network using alearning rule which is described below.1.1 Important Features of the Ell NetworkThe "Extended Hopfield" or "EH" network is the subject of this thesis, and is of particularinterest because it is the only one to combine the features of recurrence, fast Hebbian learning,and hidden neurons. These three features of the brain's structure are discussed separately inthe following subsections.1.1.1 Hebbian LearningThe Hebbian learning principle was first suggested by D.O. Hebb in 1949 [19], without anyexperimental support. It can be summarized as follows [46, pg 36]:When neuron A and neuron B are simultaneously excited, increase the strengthof the connection between them.Until very recently experimental tools were not available to test Hebb's conjecture, but itremained of interest because it is a simple and fast way to train a network, and all learningis done locally; there is no need for an external "teacher" to influence the changes in synapticstrength.In 1969, Willshaw et al [52] described a very simple associative memory network which usesa form of Hebbian learning. Their network consists of a set of input neurons and outputneurons connected by synapses The network is trained by presenting a series of memoryChapter 1. Introduction^ 5pairs it' and rf, , and setting the synapses as follows:1^if a7 = 1,7 = 1 for any acij =0^otherwiseAfter the network is trained, a memory a can be recalled from a prompt b as follows:ai^1= —2 [1 — sign (k — J:=1.where k is a positive constant determined by the specific memory set being stored. The capacityof Willshaw's network does not increase proportionally to N [20, pg 53], so its capacity is verylimited. The network helped establish the concept of associative memories, however, and theviability of Hebbian learning for artificial neural networks.In 1982, Hopfield (25] presented a neural network model which stored memories using alearning rule which was very similar to Hebb's, and which will be called "pseudo-Hebbian":During learning, if neuron A and neuron B are in the same state (either bothexcited or both quiescent), then increase the strength of the connection betweenthem. If neuron A and neuron B are in opposite states, reduce the strength of theconnection between them.The ability of Hopfield's model to store many memories in a single network was a truly emergentproperty, and it generated a great deal of new interest in neural networks, and Hebbian learningin particular.Some neural networks [22, 21, eg.] use an iterative form of Hebbian learning in whichthe training signal must be provided repeatedly over an extended period of time while theconnections adapt. Early tests of the Boltzmann Machine, for example, required that eachtraining item be presented 5000 times [22, pg 308] to eliminate the errors. Hopfield's version ofHebbian learning, by contrast, is prescriptive rather than iterative, and requires only a singleChapter 1. Introduction^ 6presentation of the training items. These two approaches are distinguished as "slow" and "fast"Hebbian learning, respectively, in Table 1.1. The EH network is a "fast" learner, although notas fast as Hopfield's because of time spent determining the optimum state of the hidden neurons.Since the initial surge of interest in Hopfield's paper, much research activity has swung awayfrom Hebbian learning toward iterative learning techniques such as back-propagation [46). Be-cause the back-propagation process maintains tight control of the behaviour of the network,its ability to recall memories is not really an "emergent" property. Such networks have beenvery successful, however, in industrial applications such as machinery control [39], signal pro-cessing [34], trend analysis [17, 11], speech recognition [37], and speech synthesis [47]. Recentimprovements to back-propagation have included methods for improving learning speed [41],and for incorporating time delays in the synapses [14].Hebbian learning has recently caught the interest of researchers again, as a result of impor-tant new experimental evidence. Thomas Brown's group at Yale [7, 8] have used microelectrodesto control the voltage above and below isolated synapses in the hippocampus, and have foundthat the synaptic conductance stays approximately constant until the pre- and post-synapticmembranes are stimulated simulataneously, just as Hebb had postulated. This is the first, andonly, direct observation of learning at the cellular level, and it implies that new models of thebrain will be far more likely to replicate brain-like behaviour if they incorporate some form ofHebbian learning.1.1.2 RecurrenceNeural networks can be classified as either feed-forward or recurrent. Feed-forward networksare thought to be representative of sensory inputs and motor outputs. Signals in a feedforwardnetwork pass from the input neurons, often through layers of hidden neurons, and then arriveat the output neurons without ever looping back. Once trained, the behaviour of artificialChapter 1. Introduction^ 7feedforward networks is therefore very simple. Most of their complexity lies in the methodsused to train them.In recurrent networks, signals going out from a neuron can eventually influence the inputto that same neuron. Networks of this type are thought to be representative of the cognitivecenters of the brain, where virtually all the neurons are of this class [42, pg 1]. Both the Hopfieldmodel and the EH model are fully recurrent networks, in that every neuron can be connectedto every other neuron.The behaviour of recurrent networks can be quite complex, because of non-linearities andtime delays in the signal loops. A network which begins in one state (a "state" is defined bythe on/off status of all the neurons in the network), will typically evolve through a series ofintermediate states as the neural signals loop around in the network. In both the Hopfield andthe EH model, it will eventually reach a stable "attractor" state, and stop evolving. In thebrain, the network never reaches a stable state — new thoughts are continually being created inthe mind.1.1.3 Hidden NeuronsThe overwhelming majority of neurons in the brain are "hidden", meaning that they are not incontact with the outside world. The only neurons which are "visible", rather than hidden, arethe sensory neurons and the motor neurons. In an artificial neural network, any neuron whichis not initialized, and whose output is not seen, is considered to be a hidden neuron. Experiencewith feed-forward networks has shown that hidden neurons are of central importance for mostapplications. Minsky and Papert [40], for example, proved that one of the basic elements oflogic, the XOR relationship, cannot be learned by a perceptron network with no hidden neurons.The XOR relationship can be learned by a network when hidden neurons are present or theChapter 1. Introduction^ 8output units have non-monotonic transfer functions 1 .During learning, the on/off status of hidden neurons is determined by processes internalto the brain, rather than directly by the training signal. For neural networks to use hiddenneurons, the design must include some internal procedure for determining their on/off status.Back-propagation, for example, manipulates the status of the hiddens by iteratively modifyingthe weights of the synapses leading into the hidden neurons.The biggest limitation of Hopfield's network is that it has no mechanism for including hiddenneurons. The model treats all neurons equally, as both inputs and outputs, and requires thatall neural states be defined by the trainer. This makes for a very simple and elegant network,but it imposes the following limitations on the network's abilities:• It cannot store the XOR relationship, or higher order parity.• The length of each memory must be exactly equal to the number of neurons.• The maximum number of memories which can be stored is strictly limited by the numberof bits in the memories.The EH network makes it possible to add hidden neurons to a Hopfield network withoutcompromising Hopfield's use of Hebbian learning and recurrent architectures. During training,the EH network uses the emerging structure of its own memory space to determine the optimumon/off status of each hidden neuron, as described in Chapter 3. Because that optimum statusis one where the memories are mutually orthogonal, the EH network can be thought of as a"self-orthogonalizing" network.1 The author thanks G. Hinton for pointing out that non-monotonic transfer functions in the output neuronsmake it possible to store XOR without hidden neurons.Chapter 1. Introduction^ 91.2 Research GoalsThe overall goals of this research are to better understand the structure and the behaviour ofthe brain, and to develop computing procedures which take advantage of those principles toperform novel computing tasks. The specific goals which led to the results presented here were:• to extend Hopfield's model to include hidden neurons• to demonstrate the improved performance resulting from the extension• to gain "analytical control" [5, pg. 2293] of EH networks.• to derive an analytical expression for the capacity of an EH network as a function of thenumber of hidden neurons1.3 SummaryChapters 2 through 6 describe the research that has been done in pursuit of the above goals.It can be summarized as follows:• Chapter 2 reviews Hopfield's neural network and examines its performance with non-random memory sets. The chapter begins with a brief review of the structure of Hop-field networks, and the notation used to describe them. It then introduces the idea of"near-orthogonal" memory sets, describes how they can be stochastically generated, andexamines how the memory capacity of the Hopfield network changes as the memory setsbecome more orthogonal. The final section of the chapter introduces an important work-ing assumption which will be used in subsequent derivations.• Chapter 3 defines and characterizes the new EH network. It begins by describing theEH network's memory storage procedure, and then presents measurements of how wellChapter 1. Introduction^ 10that procedure is able to orthogonalize the memory space. It describes two alternativemethods for recalling memories when specific neurons (usually "hidden" neurons) areflagged as unknowns, and examines the effectiveness of the methods for near-orthogonalmemory sets. Finally, in Section 3.4, many different aspects of the network's performanceare examined, including its memory storage capacity, its information capacity, its abilityto store correlated memory sets, and the size of its attractor basins.• Chapter 4 develops a mathematical model for a Hopfield network in which near-orthogonalmemories have been stored. The chapter begins by reviewing the analogy between latticesof Ising ferromagnets, and Hopfield-style neural networks. It then derives an expressionfor the free-energy, using the replica method and assuming replica symmetry. Finally, itderives a set of mean-field equations which are valid at the saddle point in the free energy.• Chapter 5 derives an analytical estimate of the storage capacity of the EH network as afunction of the fraction of visible neurons in the network. The chapter begins by usingthe mean-field equations from Chapter 4 to calculate the capacity of a Hopfield networkas a function of the degree of orthogonality of the memories stored. Next, it uses astatistical argument to estimate the degree of orthogonality that can be achieved by theEH network's memory storage process. Finally, it combines these two results to estimatethe memory capacity.• Chapter 6 compares the theoretical maximum information capacity of near-orthogonalmemories to the actual amount of information that can be stored in an EH network. Itbegins by calculating the information content of exactly orthogonal memory sets, usingboth a dimensional argument and a brute-force calculation. It then extends the result an-alytically to include near-orthogonal memory sets. Finally, a comparison is made betweenthe amount of information which can be stored in a memory set of a certain orthogonality,and the amount of information which is stored in an EH network with the same degreeof orthogonality.Chapter 1. Introduction^ 11• Chapter 7 gives a summary of the conclusions that have been reached in this research,and of further research that needs to be done.Appendices A through D provide details of some of the calculations used in the main bodyof the thesis.(2.1)NE sis?s 1 . 52Chapter 2The Hopfield Network and Non-Random Memory SetsBefore the EH network can be introduced, it is important to review the structure and dynamicsof Hopfield's original network, and its behaviour for sets of memories which are not randomlygenerated. Section 2.1 summarizes the design of the network and introduces the notation whichwill be used in subsequent chapters. Section 2.3 defines the concept of a "near-orthogonal"memory set, and presents simulation results which show how the memory capacity of a Hopfieldnet varies as the memory sets become less random. Section 2.4 presents an important workingassumption about the behaviour of Hopfield networks which will form the basis for mathematicalderivations in subsequent chapters.2.1 Review of Hopfield Neural NetworksA Hopfield net consists of N neurons, each of which is connected via synapses to every otherneuron, as shown in Figure 2.1. The "synaptic strengths" are in general real numbersranging from positive to negative. The output of the ith neuron is called the "state", Si of thatneuron. The Hopfield net uses bi-state neurons, each of which has only two possible states,(Si = +1) and (Si = -1). The set of N neural states together form an N-dimensional "networkstate" vector S. A dot product between any two network states can be written as follows:12CurrentStateWeights2Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 13Initial StateVFigure 2.1: Diagram of a Hopfield network with 5 neurons. The output, Si, from each neuronis multiplied by a weight Jij and summed with other signals before being fed as an input toeach of the other neurons. The initial states of the neurons are imposed by an external signal,which is removed once memory recovery begins.State vectors can be written concisely using just the signs of their components, as in thefollowing example:S3^++-+-+++--+--+++The input Ui to each neuron i is equal to the sum of the outputs of all the other neurons,weighted by a set of connection strengths Ji 1 through JiN, where N is the number of neurons:N=^ (2.2)j=1Neurons do not synapse back onto themselves, so Jii = 0. Non-zero Jii terms would not affectthe location of memories, but can influence the dynamics of the network. Large positive 4,for example, tend to stabilize states which would otherwise be unstable. Negative have theopposite effect, destabilizing some of the shallower attractor states.Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 14The notation used here is that used by Amit [1], Lautrup [35], Mezard, and Parisi [38].Hopfield used Tip and Vi rather than Ai and Si in his original paper [25].2.1.1 Storing MemoriesThe connection strengths Jij are determined by the memories that have been stored in the net-work. A "memory", just like a network state, is an N-dimensional vector ei whose componentsare (+1) or (-1). A set of P memories can be "stored" in a network by setting the strength ofeach connection as follows:P= N^ere; -^,[ where the bij ensures that the diagonal elements of J are always zero. The k term is oftennot included in the definition of Jij but, as will become clear, it does not change the dynamicsof the network, and it ensures that the connection strengths remain finite in the limit of verylarge P and N. The definition of Jib implies that each new memory can be added using thefollowing "outer product rule":r4 = .113 + Tr P.4; — (5131The application of the outer product rule is variously referred to as "storing", "inserting", or"memorizing" a new memory eµ. Note that the outer product rule is an implementation of thepseudo-Hebbian learning rule introduced in Section 1.1.1, in that if two neurons i and j are inthe same state = el, the connection Jib between them increases, otherwise it decreases.The "memory loading", a, of a network is equal to the ratio of the number of memories Pand the number of neurons N:Jib0=1(2.3)a =^ (2.4)Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 152.1.2 Network Dynamics and Stable StatesThe evolution of a Hopfield network is determined by its initial state, and by the connectionstrength Jij. In Hopfield's original network, the network evolves in the following way:1. One of the neurons, i, is selected at random .2. The input Ui for that neuron is calculated.3. The state Si is modified according to the following "forward update rule":+1 if Ui > 0Si =^^ (2.5)—1 if Ui < 04. A new neuron is selected as in Step (1).This is referred to as the "asynchronous" version of the forward update rule because eachneuron is updated independently. Evolution continues until a stable state is reached in whichall the inputs Ui have the same sign as the neural states Si. Hopfield proved that the networkwill always reach such a stable state, and that the stable states tend to be attractors. If thefinal state is equal to one of the memories e, then the network is said to have "recovered" or"recalled" that memory.Hopfield networks are "autoassociative", meaning that they tend to recover the memorywhich is most similar to their initial state. In a typical application, the network is initialized toa "prompt" state 5° , which is an incomplete or noisy version of one of the memories, and thenis allowed to evolve to one of the attractor states. Barring any problems, the final state is thememory state which most resembles the prompt state.The "radius of attraction", Re, of a memory e is a measure of how many bits can berecovered during the memory recovery process. More precisely, it is the maximum number ofChapter 2. The Hopfield Network and Non-Random Memory Sets^ 16bits that can be wrong in the prompt before the probability of recovering the correct memorybecomes less than 50%.In some situations it is preferable for the network to evolve in the following way:1. The inputs U1 through UN are all calculated.2. The states S1 through SN are all modified using the forward update rule (Equation 2.5),using the pre-calculated U1.3. A new set of inputs Ui is selected as in Step (1).This is referred to as the "synchronous" version of the forward update rule because all neuronsare updated at the same time. A network evolving with the synchronous update rule does notalways reach a stable state, but may instead enter a limit cycle in which it flips between twostates [1, pg 78].2.1.3 Overlap miw and OrthogonalityThe overlap mou between two memories e and is a measure of how similar they are to eachother. It is defined as follows:1 Nrev erN i=iOverlaps of 1, and -1 indicate that two memories are identical, or bit-inverse to each other,respectively. Two memories are said to be "orthogonal" to each other if their overlap is zero.When m appears with a single superscript, it represents the overlap between the currentstate S and one of the memories:1 NEN i=i(2.6)Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 17It will be important in the following chapters to characterize memory sets by the magnitudeof typical overlaps between the memories. The expected overlap mou between two randomlyselected memories is equivalent to the expected length of a random walk of N steps. The meanof mo' is therefore zero and the variance is ki [1, pg 175]. Overlaps for non-random memorysets are discussed in Section 2.3.2.1.4 Energy LandscapeHopfield noted that his network is "isomorphic with an Ising model" [25, pg 2556], a fact whichwill be discussed in detail in Section 4.1. This isomorphism led him to define the following"energy" E for every state S of the network:1E =^Ji• S-2^3^3:,3=1It can be easily shown [20, pg. 21], when Jii = Jji, that this energy acts as a Liapunov functionfor the update rule in Equation 2.5. It is common, therefore, to speak of an "energy landscape"in which the network moves. When the forward update rule is used, the network never rollsuphill in this landscape, and always eventually reaches a stable state at the bottom of an"energy valley". When no memories have been stored in the network, and all the J, are zero,the landscape is flat. After one memory has been installed, there are two valleys: one at e andone at el.2.1.5 The Meaning of Energy PeaksIf the valleys in the energy landscape represent stored memories and their bit-inverses, whatdo the peaks represent? If Equation 2.3 is replaced into Equation 2.7, the energy of a state Si(2.7)Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 18can be written:N [PE 1 = — E E 3 — psi; SiSi2N i,j=1 µ=11N P^P= _^ne)2^_2_ .2 pzPeak states are states where E is maximum, so they are also states where (mP)2 isminimum. In other words the peaks mark states which are, in the least-squares sense, optimallyorthogonal to all installed memories [1, pg 167]. "Optimally orthogonal" here means that thesum of the squares of the overlaps is minimum compared to all nearby states. If the states wereexactly orthogonal, the sum of the squares would of course be zero.2.1.6 False AttractorsThere are always stable attractor states which were not explicitly stored as memories. Ev-ery time a memory e tA is stored, for example, the vector eP is automatically stored as well,where C —e for all i. This follows from the fact that Jii in Equation 2.3 is unchanged whenall the Cs are negated. The vectors are referred to as "bit inverse attractors".When more than two memories have been installed, other "false" attractor states begin toappear which are not so simply related to the real memory states. The locations of many ofthese states are quite well understood analytically [1, pp 176-177]. The following "symmetric3-states", for example, will always be false attractors:Si = sign( t + Si +signw + Si —Si = sign(Si + a)Si = sign(e — 6? — 6?)si = sign(-e + Si +(2.8)Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 19etc.There are similar symmetric 5-states, 7-states, and so on.False attractors are a serious problem for practical applications of attractor neural networks.As new memories are added, the number of false attractors grows far more rapidly than thenumber of true attractors, which reduces the radii of attraction of the true memories.2.2 Limitations and Extensions of the Hopfield NetworkThere are a number of fundamental limitations of Hopfield's neural network which reduce itsusefulness in many practical applications. These include:1. The network is unable to store the XOR boolean relationship, for reasons discussed inSection 3.4.5. This fundamentally limits its usefulness as a computational device.2. The number of synapses in the network is always equal to N(N-1), where N is the numberof neurons. Because the number of synapses determines the capacity of the network, thismeans that the capacity of the network is limited by the width of the stored memories.3. The capacity of the network is proportional to the number of neurons, which is the squareroot of the number of synapses. As networks get bigger, therefore, the complexity increasesas the square of the capacity.4. The capacity of the network falls quickly when the memories become more correlated, asdiscussed in Section 3.4.3. Typical memory sets in the real world are often correlated, sothis is an important limitation.The EH network overcomes the first two of these limitations, as discussed in Chapter 3,but is not significantly better than the Hopfield network for the latter two. As discussed inChapter 2. The Hopfield Network and Non-Random Memory Sets^ 20Chapter 5, although the capacity of an EH network is larger than that of a Hopfield network,it still increases linearly with N. Similarly, Section 3.4.3 shows that the capacity of the EHnetwork falls almost as rapidly as the Hopfield network when the stored memories are correlated.Baum and Moody [6] have described a feedforward network whose structure bears some sim-ilarity to Hopfield's, and which overcomes these limitations. They call the network a "unaryAssociative and Content Addressable Memory" (Unary ACAM). It has two layers of connec-tions, with one layer of hidden neurons in between. The top set of connections, connecting theinput neurons to the hidden neurons, form a Hamming net [36]. When an input is presented,the Hamming net activates exactly one hidden neuron — the one which corresponds to the pat-tern that most resembles the input pattern. The hidden neurons are therefore "grandmother"cells [20, pg 143] because each one corresponds to one specific memory. It is straightforwardto set up connections from the grandmother cells to the output neurons which ensure that thedesired output signal is generated in response to each grandmother cell. Baum and Moodypoint out that if the grandmother cells are replaced by unity gains, and the output is fed backinto the input, their network is equivalent to Hopfield's and the training method is equivalentto the Outer Product Rule.As a computing device the unary ACAM is, arguably, a better performer than Hopfield's.Its memory capacity is exactly equal to the number of hidden neurons, and so can be extendedalmost indefinately just by adding more hidden neurons. Furthermore, memory recall doesnot degrade as the network approaches capacity. The network also works well for correlatedmemories, and has no trouble storing the XOR set of associations.As a model for human memory, however, the Unary ACAM is not very satisfactory becauseit is not trained with Hebbian learning and it is not recurrent. It is, essentially, a machinewhich has been designed to respond to specific inputs with a specific output. Its behaviour istherefore more engineered than emergent.Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 21The most significant limitation of the Unary ACAM as a computing device is its lack ofrobustness. The failure of any grandmother cell, or any synapse in the output layer, causes animmediate failure in recalling the corresponding memory. This weakness can be addressed in anumber of ways. The simplest way, as pointed out by Baum and Moody, is to add redundantgrandmother cells. If two grandmother cells are used for each memory the failure of any singlecell no longer degrades the performance of the network. Such a network is still far less robustthan a Hopfield network, where more than half the neurons can typically be disabled withoutseriously degrading the performance.Another network which is similar in structure to the Unary ACAM but somewhat morerobust, has been described by Kanerva [27, 9]. Like the Unary ACAM, Kanerva's network is afeedforward device with two layers of connections and a single layer of hidden neurons. As inthe Unary ACAM, each neuron in the hidden layer is trained to recognize a single vector input,but now the training vectors are randomly chosen rather than equal to the memory set. Thereis no "winner take all" circuitry on the hidden neurons, so more than one hidden neuron canbe activated for each input pattern. The first layer of synapses now effectively encodes eachN-bit input into an s-bit pattern, where s is the number of hidden neurons. The second layerof Kanerva's network is similar to a Hopfield network except that it has separate input andoutput neurons, and is not recurrent. The connections of the second layer are trained with theOuter Product Rule.Kanerva's network performs comparably to the Unary ACAM, and its performance can befine-tuned by adjusting its parameters. The robustness of the network, for example, can beincreased by:• Adding more hidden neurons, which adds redundancy to the network, at the price ofadded complexity; or1(tleficqOm, -a (2.9)Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 22• Decreasing the thresholds on the hidden neurons, which brings more hidden neurons intoaction for each input pattern, at the price of more noise in the input to the second layer.Like the Unary ACAM, the capacity of Kanerva's network grows in proportion to the numberof hidden neurons, and hence can be expanded indefinately by adding more hardware.Kanerva's memory, which has been described as a Hopfield network with a preprocessor [9,pg 285], is somewhat more satisfactory than the Unary ACAM as a model for biological memory.It is conceivable that a biological memory might have a hard-wired preprocessor network whichencodes input signals into larger dimensional signals; and the Hopfield-like second layer iscertainly plausible. Unlike the Hopfield model, however, Kanerva's network is not recurrent.2.3 Non-Random Memory SetsThis section focusses on how the memory capacity of a Hopfield network is affected by thelevel of correlation between the memories that are being stored. The first step in quantifyingthat relationship is to define a measure for the level of correlation in a memory set. A naturalmeasure is the root-mean-square of the overlap between all the memories e. For randommemories, the rms overlap is N/Tr, so the following "normalized rms overlap" is used:The double brackets used here are Amit's notation [1, pg 187] for the average over anensemble of memory sets, each randomly generated according to a given probability distribution.For example, if there are P memories in each memory set, and if all memories are equally likely,then the ensemble average of a quantity Gal • • .e ) is:((Geo))) E 2-PN E^.. • e ) .{e}...{e-}(2.10)Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 23The ensemble average is included in the definition of O^so that it can be directly related tothe probability distribution used to generate the memory set, rather than being unique for eachset of memories.In many of the calculations it is convenient to use the varianceV a-, (0,,,,)2^(2.11)rather than Orms . Both O, and V range from 0 for exactly orthogonal memory sets to 1 forrandom memory sets. For memory sets which are correlated, such as those discussed in [5],they are both larger than unity.2.3.1 Generating Near -Orthogonal Memory SetsA "near-orthogonal" set of memories is a group of memories in which typical overlaps aresmaller than they would be if the memories were randomly selected. One way to generate a setof near-orthogonal memories {} is to start with a set {r} of exactly orthogonal memories andthen progressively add random noise. Memories rP and I' are exactly orthogonal if and onlyif they satisfy the following equation:NEr;r;^Nev.If b is the probability that a particular bit er is not equal to rti% then the probability distributionof memory bits er is:P(eil ) = (1 — b) 8(er — rn + b 45(e; + rn,^(2.12)where 6.(a — b) bab . The value of b ranges from 0 (no noise) to 0.5 (maximum noise).In order for Equation 2.12 to be useful, it is necessary to first have a set ro of exactlyorthogonal memory vectors. Section A.1 describes how to generate one such set for any valueChapter 2. The Hopfield Network and Non-Random Memory Sets^ 24of N which is an integer power of 2. It is a "complete" set, in that the number of vectors isequal to the number of available dimensions. Complete orthogonal sets for other values of Nare not possible because of the constraint that vector elements must be either +1 or - 1.The expected overlap between a state 411 and its corresponding pi is:N41‘1;))^EP(enOT = (1 — 2b)i=1^fel(2.13)Consider now the average mean and variance of the overlaps between two different statesel' and e', as derived in Section A.2. The mean overlap is found to be zero, because there isno reason that a positive overlap should be more likely than a negative overlap. The variancein the overlap is, from Equation A.4, equal to:.r)) = N 11 - (1 - 2b)4 1from which it follows immediately that:V = 1 - (1 - 2b) 4Or1b = -2 [1 - (1 -^.(2.14)(2.15)It is a simple matter, using Equation 2.15, to generate sets of memories for which V hasa particular value. This was tested in simulations and proven to be accurate, as shown inFigure 2.2.2.3.2 Generating Correlated Memory SetsAmit [1][5] generates correlated memories by decreasing the probability of each neuron being +1.He uses a parameter a which varies from 0 (random memories) to 1 (completely correlatedChapter 2. The Hopfield Network and Non-Random Memory Sets^ 251.21 - ( 1 - 2b )4rn00.2-1N = 1280.1^0.2^0.3^0.4^0.5^0.6b = Prob of Flipping Each Memory Bit^1992 APRIL 21Figure 2.2: Measured variance in the overlaps between near-orthogonal memories, comparedwith the predicted variance. The line is a graph of Equation 2.14. The data points are fromsimulations of 100 memory sets, each with 128 neurons. The memories were generated byadding noise to exactly orthogonal memories, using Equation 2.12. The mean overlap betweenmemories was measured and was close to zero.memories), and writes the probability distribution as follows:0-F01^1P(e) = i(1 a)6( — 1) -I- —2(1 — a)(5(e + 1) (2.16)The mean activation of the neurons is a, and the average overlap between two different memoriesis NO.If a correlated memory set with zero mean activation is needed, it can be formed by makinga cosmetic change to Amit's method. After each new memory is generated using Equation 2.16,a fair coin is flipped. If the coin comes up heads, every bit in the memory is inverted, otherwisethe memory is left unchanged. This process creates memory sets with zero average activation,but with connection strengths Jii which are exactly as they were using Amit's process, becauseof the symmetry in Equation 2.3.If the mean activation is zero for Amit's correlated memories, then the variance V in theChapter 2. The Hopfield Network and Non-Random Memory Sets^ 26overlaps can be used as a measure of the degree of orthogonality, as in Equation 2.11. Thevalue of V is related to a as follows:1 N^\2=((V))^(^)i=1N N=^(E erere.'"q)N N-N-1 (E + EE eer4.'"ef)1=1^jyi [iv + (Eeer),.,.]ioi1+ a4 (N — 1) (2.17)Ora =(V — 1\IN -1)2.3.3 Benefits of Near -Orthogonal Memory SetsA memory 4- is stable if the inputs^have the same sign as the neural states er. This isequivalent to the following condition:0 <^ Vi0 < E .17=1 Jii^ vi< EpP--1^E7-1 e3 q - P^Vi0 < (N — P) N 00 ,, eerntiw^ViThe first and second terms in this last equation can be thought of as the "signal" and the "noise",respectively. The sign of the noise term is random, so memories will tend to go unstable whenthe magnitude of the noise approaches (N — P). This implies that smaller overlaps mill' willallow more memories to be stored before retrieval begins to deteriorate. The EH network isdesigned to capitalize on this fact.11.5 (2.18)Chapter 2. The Hopfield Network and Non-Random Memory Sets^ 27RandomNear-Orthogonal0.50—0.45-z 0.40-0.35-.8 0.30-0 0.25-EE 0.20-0.15-a-- 0.10-0.05-0.00—0.0Correlated^\ .... +^.......... ^................... . .....1 0^2.0^3.0Normalized rms Overlap tem FEB 25Figure 2.3: Memory capacity of a Hopfield network as a function of the level of orthogonalityin the memory set. The orthogonality measure O is defined in Equation 2.9. The memorycapacity ce, is -,;„ times the average number of memories which can be stored before one twentiethof the stored memories are no longer stable. Simulations were done on 100 different sets ofrandom memories for each data point. The network consisted of 128 neurons. The dashedoutlines show the standard deviation in memory capacities.The derivation of an analytical relationship between storage capacity and Orras is quitecomplex and will be postponed until Chapter 4, but it is appropriate in this chapter to presentsimulation results confirming that memory capacity improves with orthogonality. Figure 2.3shows that the memory capacity increases dramatically as the rms overlap decreases, as ex-pected. Also included in the figure are measurements of the network capacity for correlatedmemories, generated using the zero-mean-overlap version of Amit's process described in Sec-tion 2.3.2.In addition to the need for large memory capacities, it is also important that neural networkhave basins of attraction which are as large as possible. Cottrell [10] showed that, for any net-work which uses the outer product rule, the radius of attraction is optimal if all the memories0.501co.'50.45-0.40-P.=<.-0M=73-0.35-x° 0.30-0.25-0.2010.0Random......... .........................Near-Orthogonal11511^0.5^1 0Normalized rms Overlap151992 FEB EsChapter 2. The Hopfield Network and Non-Random Memory Sets^ 28Figure 2.4: Radius of attraction of a Hopfield network after 10 memories have been stored, asa function of the level of orthogonality in the memory set. The normalized rms overlap O isdefined in Equation 2.9. The radius of attraction was measured as for Figure 2.3. Simulationswere done on 100 different sets of random memories for each data point, using a network with128 neurons. The dashed outlines show the standard deviation in radii of attraction. Thememory capacity of the network falls below 10 just above (Onus = 1), so it did not make senseto calculate the radii of attraction above that point.are orthogonal. This suggests that the radii of attraction of the memories will improve as mem-ory sets become nearly orthogonal. Figure 2.4 confirms that this is true, but the improvementin radius of attraction is not dramatic over the range that is plotted.2.4 Working Premise: All Orthogonalizing Processes are EquivalentTwo different methods for generating near-orthogonal memory sets are investigated in thisthesis. The first, which was introduced in Section 2.3.1, is quite compatible with mathematicalanalysis, but could not be used for storing real memories. The second, which will be introducedin Chapter 3 and is the core process in the EH network, lends itself very naturally to real-worldimplementations, but would be very difficult to analyse mathematically. The mathematicalChapter 2. The Hopfield Network and Non-Random Memory Sets^ 29analysis of EH networks therefore proceeds under the following working assumption:• ansatz: the dependence of network behaviour on the orthogonality of the memory set isindependent of the manner in which that orthogonality was achieved.Chapter 3The Extended Hopfield NetworkThis chapter introduces the detailed design of the Extended Hopfield ("EH") network, andpresents the results of a number of tests of its capacity and reliability. Some of these resultshave been published in earlier papers by Hoffmann and Davenport [12, 13].The central design challenge is to find mechanisms for storing and retrieving memories, whichmake efficient use of hidden neurons. The following three points are of central importance indetermining what direction to proceed:• To add hidden neurons to a Hopfield network, some process must be used to determinethe states of the hiddens before the outer product rule installs the memory (Section 1.1.3).• More memories can be installed in a Hopfield network, and with better radii of attraction,if the installed memories are as nearly orthogonal to each other as possible (Section 2.3.3).• Peaks in the energy landscape of a Hopfield network are located at states which areoptimally orthogonal to all previously-installed memories (Section 2.1.5).When stated in this way, the following conclusion seems obvious':During memory storage, the states of the hidden neurons should be set in such away that the network moves as close as possible to an energy peak. This will ensure'This conclusion seems obvious in retrospect, but it wasn't obvious until Hoffmann pointed it out.30Chapter 3. The Extended Hopfield Network^ 31that the memory sets are as orthogonal as possible, which will improve the capacityand reliability of the network.The technical details of how this is done are described in the following sections.3.1 EH Network TerminologyThe EH network requires some added notational conventions to accomodate the presence ofhidden neurons, and the use of neurons with three possible states. These are introduced insubsections 3.1.1 through 3.1.5.3.1.1 Hidden Neurons in an EH NetworkHidden neurons in an EH network are treated identically to visible neurons wherever possible.Each is fully connected to every other neuron, for example, and each obeys the same dynamicalrules that the visibles obey. Just as in a Hopfield net, the hidden and visible neurons inan EH network behave in a manner which is completely independent of their position in thenetwork, so for notational simplicity, the first R neurons are assumed to be visible, and the lastN — R neurons are assumed to be hidden. A state vector can be written as follows:Si^+-+---+-++++-++-+--+--+ -++--+++--+--+++.R visibles^(N—R) hiddenThe letter M is used to represent the number of hidden neurons:M^(N — R)Chapter 3. The Extended Hopfield Network^ 323.1.2 Partial Overlaps in the EH NetworkThe overlap mou between memories ep and can be split into two components in an EH net-work. The overlap between the visible neurons is written mr and is defined as:1 R^v ^RThe overlap between the hidden neurons is written mr and is defined as:1 N^m fr^E cci=R+1The total overlap between vectors is therefore:ma" = 1 (Rmr Mmr)3.1.3 Measuring "Stability" in an EH NetworkThe most common method for measuring the capacity of an attractor network is to measurehow many stable memories can be stored. In a Hopfield network, the test for stability is to setall neurons equal to the memory being tested, and then allow the network to evolve using theupdate rule. If no evolution takes place, the network is stable.In an EH network, stability is tested by setting only the visible neurons equal to the storedmemories. The hidden neurons are treated as unknowns using one of the procedures discussedin Section 3.3. Memory recall then takes place in the normal way, and continues until anattractor is reached. If the state of the visible neurons after recall is equal to the memory state,then the memory is "stable".i=11 (3.1)RChapter 3. The Extended Hopfield Network^ 333.1.4 "Level of Correlation" of Memory SetsIn Section 3.4, a number of tests are described in which memory sets with various statisticalproperties are tested. For those tests, the "level of correlation" of the memory set is determinedby the rms overlap of the visible neurons only. So if a set of "random" memories are stored ina network with 50 hiddens and 50 visibles, it means that the visible bits have been selected atrandom, so:(mr )rms =3.1.5 TriState Neurons OptionOne of the options which will be discussed for the EH network is the use of tri-state instead ofHopfield's standard bi-state neurons. Tri-state neurons use the standard +1 and - 1 states, andan additional "neutral" state 0. A typical tri-state vector might look like:S1 = +-+---+-++++-++-+--+--+ -++00000-0+-000+.R visible^(N—R) hiddensThe advantage of using tri-state neurons is that when they are in the zero state, they haveno influence on the other neurons, which is appropriate if, for example, the correct state of thatneuron is completely unknown. The disadvantage, of course, is that tri-state neurons are morecomplicated, and are more difficult to implement as electronic hardware. It will be shown thateverything possible with tri-state neurons can be achieved with bi-state neurons, albeit withsomewhat more complicated procedures.Chapter 3. The Extended Hopfield Network^ 343.2 Storing Memories in an EH NetworkAn EH memory is stored using the outer product rule, just as in a Hopfield network, but beforethe memory is stored, the state of the hidden neurons is determined by letting the network "rollup" in the energy landscape. The rolling up process is very similar to the evolution proceduresdescribed in Section 2.1.2, except that the forward update rule (Equation 2.5) is replaced bythe following "reverse update rule":—1 if^> 0=^ (3.2)+1 if^< 0It is not difficult to prove that rolling up leads to an energy peak. From Equation 3.2, astable state S satisfies Si = —sign(Ui). Using Equations 2.7 and 2.2, the energy E of a stablestate S is therefore:E = --1 V UiSi = —12 2 i=iInverting any bit j changes the energy to E':2 E luil —which is always less than E, so the stable state is a point of (locally) maximum energy. Theenergy of a network with finite N is bounded from above, so a peak will always be encounteredwhen rolling up.The procedure for storing memories is shown schematically in Figure 3.1, and can be sum-marized as follows:i= 11. All known bits are set to their known values2. All unknown or hidden neurons are set either to zero (if tri-state), or to +1 or -1 atrandom (if bi-state).hidden^Energy000000000000+ - + - -H- - - - - -H-2'1IMaximum Energy State(Visible Neurons Unchanged)Reverse Update RuleInitial State(Hidden are Zero)visibles+ - -H-H- -reverse Iupdaterule-17+56outerproductrule Revised J„New Energy ContourAfter Memory is Stored Energy "Landscape"+ - -H- - - - - -H- -126Chapter 3. The Extended Hopfield Network^ 35State Space(a)^ (b)Figure 3.1: Schematic representation of the Ell network memory storage process, (a) as a"roll-up" process in the energy landscape and (b) as an update process for bits in a statevector. The energy surface as a function of state space has peaks and valley-bottoms. Thepeaks are locally optimally orthogonal to the valleys. The location of a peak is found using thereverse update rule. The memory is installed, using the outer product rule, at the location ofthe peak, resulting in a new energy contour with an energy minimum at that point.3. The reverse update rule (Equation 3.2) is applied, with the visible neurons clamped, untila stable state is reached. The term "clamped" means that the state of the neuron is notallowed to change.4. The memory is installed using the outer product rule (Equation 2.3).Note that when some of the neurons are clamped, the network evolves in a subspace of thecomplete network state space. The peaks that are reached in that subspace are not in generaltrue peaks in the complete space, but experience has shown that they are close approximationsto them.A good indication of how well this memory storage process works is the degree of orthog-onality that it achieves. Figures 3.2 and 3.3 show the rms overlap O ^as a function of the1.11.0x xx x x x x x x0 Hiddens0.9,ig) 0.400.30.20.10.00.8a.as 0.78 0.6E 0.5100 Neurons Tri-State10 Hiddens30 Hiddens50 Hiddens90 Hiddens50^601902 AUG 2210^20^30^40Memory LoadingChapter 3. The Extended Hopfield Network^ 36Figure 3.2: Degree of orthogonality achieved by the tri-state EH memory storage process, asa function of the number of memories added and for various numbers of hidden neurons. Thevisible neurons in each memory were set at random to either +1 or -1. The rms overlap O rms wascalculated using the complete memory vector, both visible and hidden, and was averaged overall pairs of memories that had been stored. The simulations were performed with a 100-neuronnetwork, and each point represents the average result from 50 different sets of memories. Thelines were drawn free-hand as a visual aid only.number of memories installed and the number of hidden neurons available, for both tri-stateand bi-state neurons. A comparison of the two figures reveals that tri-state and bi-state optionsare approximately equivalent. When no hiddens are present, the rms overlap is approximatelyunity, as expected. As a greater percentage of the neurons are hidden, the roll-up process isable to achieve better and better levels of orthogonality. This is examined again in Figure 3.4,where O is plotted as a function of the number of hiddens.3.3 Recovering Memories in an EH NetworkDuring memory recovery in a standard Hopfield network, all bits are treated equally. There isno way to tell the network, for example, that you are very sure about bits 1 through 50, but+ 10 MemoriesAE 20 Memoriesx 40 Memories^ 60 Memories100 NeuronsTri-State6.1^0.2^0.3^0:4^0.5^0:6^0.7^0.8019p_Tri^(Num Hidden Neurons)/Na.os0a)ts)as10.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1-0—o 0.9^11992 AUG 22Chapter 3. The Extended Hopfield Network^ 371.1-1.00.9-0.8-o_as= 0.7-O 0.6-0.5-0.4-O0.3-0.2-0.1-0.00 Hiddensx x x x10 HiddensHicklensx50 Hiddens90 Hiddens100 Neurons Bi-State10^20^30^40Memory Loading50^601992 AUG 22Figure 3.3: Degree of orthogonality achieved by the bi-state EH network memory storageprocess, as a function of the number of memories added, for various numbers of hidden neurons.The conditions for this simulation were identical to those in Figure 3.2. The lines are identicalto those in Figure 3.2 to aid in comparison.Figure 3.4: Degree of orthogonality achieved by the EH network memory storage process, as afunction of the number of hidden neurons, for various levels of memory loading and for tri-stateneurons. The conditions for this simulation were identical to those in Figure 3.2. The lineswere drawn free-hand as a visual aid only.Chapter 3. The Extended Hopfield Network^ 38have no information about bits 51 through 100. There are many applications, however, wherethe locations of the unknown bits is known. This is particularly true in an EH network, becausethe states of all of the hidden neurons are, by definition, completely unknown. It would alsobe true, for example, if a standard Hopfield network were being used to complete a bit-mappedimage where the bottom half of the image was known to be obscured.It would be useful to find a memory recovery process in which certain neurons could beflagged as "don't know", so that their initial values would have minimal influence on the finalanswer. Three different methods for doing this are discussed in Sections 3.3.1 through 3.3.3and are compared in Sections 3.3.4 and 3.3.5.All tests of the memory recovery process in Sections 3.3.1 through 3.3.5 use sets of memorieswhich are randomly generated, with the bits equally likely to be +1 or -1. The tests describedin Section 3.3.6 use near-orthogonal memory sets. None of the memory sets tested in thesesubsections were stored using the roll-up process.3.3.1 "Random" OptionThe first strategy, which is most commonly used with Hopfield networks, is to set all the "don'tknow" bits to +1 or -1 at random, and then apply the forward update rule (Equation 2.5) untilan attractor is reached. This method relies on the randomly selected bits being uncorrelatedwith all the memories, so that their net influence is negligible. Unfortunately, when a largenumber of "don't know"s are present, or if the network is nearly at capacity, there is a highprobability that the random bits will divert the network into a false attractor.visibles^hidden^EnergySynchronous UpdateLiIntermediate StateAsynchUpdateFinalStateInitial StateEnergy "Landscape"synchronousupdateruleI +-+-1-4-9--asynchupdaterule+-+-I-F000+-+-++----+++-+-0+----0+000000000000-126-12+96Chapter 3. The Extended Hopfield Network^ 39State Space(a)^ (b)Figure 3.5: Schematic representation of the tri-state EH network memory retrieval process,(a) as a "roll-down" process in the energy landscape and (b) as an update process for bitsin a state vector. Unknown bits, including all hidden neurons, are initially set to zero. Thenetwork then rolls down synchronously, with known neurons clamped, until the energy no longerdecreases. All neurons are then undamped, and the network rolls down asynchronously to anattractor state.3.3.2 "TriState" OptionIf tri-state neurons are used, a very successful memory recovery process is sketched in Figure 3.5,and can be summarized as follows [13, pg 1361:1. All known neurons are set to their known values2. All unknown or hidden neurons are set to zero3. The synchronous forward update rule (Section 2.1.2) is applied, with all known neuronsclamped. If some neurons are stuck in zero states, set one to +1 or -1 at random, andcontinue rolling down, until all neurons are non-zero and the energy no longer decreases.Chapter 3. The Extended Hopfield Network^ 404. The asynchronous forward update rule (Section 2.1.2) is applied, with no neurons clamped,until a stable state is reached.It is an unfortunate added complexity that two different phases of roll-down are required, buttests show that both phases are needed. If asynchronous updates were used immediately, thefirst few neurons to be updated would exert disproportionate influences on the subsequentevolution of the network. If only synchronous updates were used, the network might settle intoa limit cycle, and never reach a stable state.3.3.3 "Bi-State" OptionIf tri-state neurons are not used, comparable performance can be achieved with bi-state neurons,at the cost of an extra phase in the recall process. Bi-state recovery is shown schematically inFigure 3.6, and can be summarized as follows [12]:1. All known bits are set to their known values2. All unknown or hidden neurons are set to +1 or -1 at random.3. The asynchronous reverse update rule (Equation 3.2 ) is applied, with known bits clamped,until a stable state is reached. This is the "roll-up" phase of bi-state memory recovery.4. The synchronous forward update rule (Section 2.1.2) is applied with known bits clamped,until the energy no longer decreases.5. The asynchronous forward update rule (Section 2.1.2) is applied with no bits clamped,until a stable state is reached.The usefulness of step 3 is not immediately obvious, Isn't rolling up to a peak counter-productive when the goal is to arrive in a valley? And why will the peak be any closer to theReverseUpdate RuleInitial State/—\ AsYnchUpdateEnergy "Landscape"FinalStateSynchronous UpdateIntermediate State^Peak StateAState SpaceChapter 3. The Extended Hopfield Network^ 41visibles hiddens EnergyI +-i-H-??? ????????????set bits atrandomi+-+++--+ +--H---+----+ +23reverseupdate+-++-H-- - +96+++-+---++-+synchronousupdate rulei+-++++-- ^-12 +-+-++ ^+asynchurdate rule_F __I__H__F__I +.1___H_____ -126(a)^ (b)Figure 3.6: Schematic representation of the bi-state EH network memory retrieval process,(a) as a "roll-up-then-down" process in the energy landscape and (b) as an update process forbits in a state vector. Unknown bits, including all hidden neurons, are initially set to +1 or -1at random. The network rolls up asynchronously using the reverse update rule, with knownneurons clamped, until a peak is reached. It then rolls down synchronously, with known neuronsclamped, until the energy no longer decreases, and finally, with all neurons undamped, it rollsdown asynchronously to an attractor state.desired valley than to any other valley?The explanation can be found using the following thought experiment. Let A be the knownstates of the visible neurons, and let B be the states of the hidden neurons when the networkis at the peak, so the complete network state can be written 5 1 = (A + B). For example:^$1^+-+---+-++++-++-+--+--+ -++--+++--+--+++.A^By the symmetry of^the bit-inverse state 5 2 = (A + B):2Chapter 3. The Extended Hopfield Network^ 42must also be a peak, and as such its overlap with all memories is as small as possible. Nowinvert all the visible neurons A:S3^+-+---+-++++-++-+--+--+ +--++---++-++---.A 73-The resulting state (A + TY) still has approximately zero overlap with all the memories, exceptfor the one whose visible neurons are A. Its overlap with that memory is very large, whichimplies that the state (A + B) is an ideal starting point from which to recover the desiredmemory.Now consider what happens in the real network. When step 4 of the recovery begins, thenetwork is in state (A + B). All the hidden inputs are the negative of the states Si, because itis a peak. That means that the first synchronous update will immediately invert all the hiddenneurons, and the network will be in state (A -I- B), which was just shown to be a good startingpoint for memory recovery.3.3.4 Retrieval Efficiency with 10 Memories StoredFigure 3.7 shows the measured retrieval efficiency for a network in which 10 random memorieshave been stored. No roll-up procedure was used when the memories were stored. The figureindicates that the "tri-state" and the "bi-state" retrieval processes are comparable to eachother, and are both superior to the "random" process. With the tri-state and bi-state options,there is a good chance of recalling a memory even when only 10 bits are known.3.3.5 Radii of Attraction for Random MemoriesFigure 3.8 shows how the radius of attraction of a typical memory varies as more and morerandom memories are added to the network. No roll-up procedure was used when the memories5<- *^* *^* cti^t610^20^4-16^J50^do^70^810^9'0^1001.1-1.0-0.9-0E 0.8-0 0.7-0.4-c0.3-.0S_1. 0.2-0. 110.04010 Memories100 Neurons+ TriStateBiStateRandomChapter 3. The Extended Hopfield Network^ 43ReeovTat Number of Prompt Bits Used 1992 MAR 5Figure 3.7: Comparison of three ways to handle "don't know" bits during memory recovery. Inall three cases, ten random memories were stored in a standard Hopfield network. The networkwas then initialized to one of the memories e, except that x of the bits were treated as "don'tknow" according to one of the three methods listed in the text, and memory recovery wasperformed. The graph plots the fraction of times that the recovered memory was exactly equalto e'4 . Fifty different memory sets were tested, and all 10 memories of each set were tested.were stored. The figure shows that the radius of attraction gets smaller as memories areadded. This is partly unavoidable, because when there are more memories to choose from,more information is required to specify which one is desired. It is also partly because thememory space becomes cluttered with false attractors, and this makes the effective number ofmemories much larger.The first of these reasons can be quantified as described in Section B.1. Equation B.1indicates that the average number of bits required to unambiguously specify one memory outof a set of P is:Mmm = log 2 4(P — 1),^ (3.3)0.21992 MAR 10x100 NeuronsChapter 3. The Extended Hopfield Network^ 441.0— ..........^......„^Theoretical^x *^•:: ........ _._ .......... 1-^+ ........ - ...^Maximum0.9-^. ..........^+ . *---„,^.. .,...0.8-^„,.. .,,-• ..^x^"..^„...... .. -..^„,..0.6-0.5-0.4-0.3-0.2-^+ TriState0.1-^x Random0.0x0^0.05 0.1^0.15Memory Loading (alpha)Figure 3.8: Comparison of radii of attraction for tri-state and random memory recovery proce-dures. Memory loading a is the number of stored memories divided by the number of neurons.The stored memories were completely random — no roll-up procedure was used to orthogo-nalize them. The fractional radius of attraction is -AT- times the number of bits which werenecessary and sufficient to recover a memory, when the other bits were either set to zero (for"tri-state") or to random values (for "random"). The "theoretical maximum" was calculatedusing Equation 3.4. The dashed lines mark the standard deviation in the observed radii ofattraction.so the theoretical maximum fractional radius of attraction of a memory is:1 1 log 4(P — 1)N log 2This value is included in Figure 3.8 for reference.The figure confirms that the tri-state retrieval process is much more effective than therandom process, especially as the memory loading approaches capacity around a = 0.14. Fora < 0.1, the tri-state network's performance is comparable to the theoretical upper limit. Bi-state data is not plotted because it was indistinguishable from the tri-state data.Tmax (3.4)Chapter 3. The Extended Hopfield Network^ 451.00-1--0.90-0.80-0.70-0.60-0.50-0.40-0.30-0.20-0.10-0.00ao 0.1^0.2^0.3Or111815^Memory Loading (alpha)0.4^0 51982 SEP 18Figure 3.9: Radius of attraction as a function of memory loading for near-orthogonal memorysets, using the tri-state memory recovery procedure. Memory loading a is the number of storedmemories divided by the number of neurons. Memory sets with normalized rms overlaps of 1.0,0.8, 0.6, 0.4, 0.2, and 0.1 were generated as described in Section 2.3.1 and then stored. Thefractional radii of attraction were calculated as the number of bits necessary and sufficient torecover 90% of the memories with fewer than 5% of the bits incorrect. Each data point is anaverage from 10 different memory sets. The lines were drawn freehand as visual aids only.3.3.6 Radii of Attraction for Near-Orthogonal MemoriesFigure 3.9 summarizes how the tri-state recovery process performs with near-orthogonal memorysets. The level of orthogonality makes no difference at low memory loads, but as the memoryload increases, the radii of attraction of the more orthogonal networks are better.3.4 Performance of the EH NetworkThe following subsections examine how well the EH network performs when the roll-up memorystorage process described in Section 3.2 and the roll-up-then-down memory recovery processdescribed in Section 3.3 work together. When they are both used some of the neurons canChapter 3. The Extended Hopfield Network^ 46be truly hidden, in that their states are neither imposed on the network when the memory isstored, nor used as prompts when the memory is recovered.The presence of hidden neurons gives the EH network a number of advantages over a stan-dard Hopfield network. Some of these are:• More memories can be stored in a network of a given length, although the memories willbe shorter.• A network's capacity for memories of a given length can be increased• The amount of stored information per neuron can be increased.• The radii of attraction of stored memories can be increased.• The XOR set of memories can be stored.• Sets with memories of varying length can be stored.These advantages are documented in Sections 3.4.1 through 3.4.6. In these tests, a memory is"stable" if it meets the criteria described in Section 3.1.3, and the "level of correlation" of amemory set is as defined in Section 3.1.4.Unless otherwise noted, all simulations were done using tri-state neurons, using both thememory storage method from Section 3.2 and the memory recovery process from Section 3.3.3.4.1 Memory Storage CapacityAn important advantage of EH networks is their ability to store more memories than a Hopfieldnetwork of similar size. The capacity of a network with a fixed number of neurons can beenhanced by designating some neurons as hiddens. Alternatively, the capacity of a networkR = 0 +R = 50 xN = 100"a 0.4100.2-L.LChapter 3. The Extended Hopfield Network^ 471.20.00^0.2^0.3^0.4^0.5^06cepaci Memory Loading (alpha) 1989 July 12Figure 3.10: Comparison of an EH network with 50 hidden neurons and 50 visible neurons toa Hopfield network with 100 neurons. Random memory vectors were used for both networks,but the Hopfield memories were twice as long. Tri-state neurons were used for the EH network.Each point represents the average over five different sets of memories.with fixed memory length can be increased by adding hidden neurons, and thus increasing thetotal number of neurons.Figure 3.10 compares the recovery rates of a Hopfield network with 100 neurons to anEH network with 50 visible and 50 hidden neurons. The EH memories are only 50 bits long, butthe EH network is able to store more than twice as many as the Hopfield network. In addition,memory stability remains 100% for up to 25 memories, as opposed to only 10 memories for theHopfield network. If the "memory capacity" of each network is defined to be the load level forwhich the fraction of stable states falls below a certain threshold (say 0.9), then the Hopfieldnetwork has a capacity per visible neuron of 0.22 compared to 0.3 for the EH network.Figure 3.11 shows the results of more extensive simulations where the memory capacity wasmeasured as a function of R/N, the fraction of visible neurons in the network. The memorycapacity increases steadily as R/N goes from 1.0 down to 0.5, because the network is using0.30-ta_13-o. 0.20-m0 0.15-E0.10-0.000.0^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9^1.0Ratio of Visible Neurons (R/N)^91 oar 12HUMP?Chapter 3. The Extended Hopfield Network^ 48Figure 3.11: Network capacity as a function of the number of visible neurons. To measure thecapacity, random memories were added one by one, using the EH storage method. After eachmemory was added, the number of stable memories was measured as described in Section 3.1.3.In this graph, the "memory capacity" is k times the number of memories which can be storedbefore the fraction of stable memories fell below either 0.9 or 0.5. Each data point representsthe average from ten different memory sets. The lines were drawn freehand as a visual aid only.the hidden neurons to orthogonalize the memory space. As R/N becomes smaller than 0.5,however, the capacity decreases again, because it becomes more and more difficult to recall thememories with so few visible neurons.3.4.2 Maximum Information ContentEven though more memories can be stored in a network with hidden neurons than in a standardHopfield network, it is not immediately clear that the information capacity of the networkincreases or decreases. The reason for the confusion is that there are two competing effects asthe ratio R/N decreases from unity:• More memories can be stored, because the memory space can be orthogonalized.Chapter 3. The Extended Hopfield Network^ 49• The amount of information in each memory decreases, because the hidden neurons containno information.The information capacity of an EH memory can be calculated quite easily. If the networkhas a total of N neurons, R of which are visible, and if the visible neurons are set at random,then the information content of the pth memory isTEH (it R) = log (2R) — log(p + 1) = R log 2 — log(p + 1). (3.5)The first term represents the information in the visible bits and the second term representsthe information lost because the order in which the memories are stored is irretrievable. Themathematical justification for these terms is discussed in Section 6.1.1.Figure 3.12 graphs the information capacity as a function of the ratio of visible neurons,and reveals that it can be increased by up to 45% by adding hidden neurons. In the limit ofzero hiddens, the information capacity per neuron, per memory, is approximately 0.14 log 2,which is consistent with results reported by Hopfield [25].3.4.3 Correlated Memory SetsA serious limitation in the performance of Hopfield networks is their very poor capacity for setsof correlated memories. The decrease in performance occurs because the correlations betweenthe memories add together to form attractors which overwhelm the memories themselves. Amithas proposed a solution for one class of correlated memory sets, where the average activationof the bits is biased away from zero [5]. His solution involves restricting the networks duringevolution to states which have a specified average activation.The EH network's performance with correlated memories is of particular interest becauseits memory storage process very specifically tries to decorrelate the memory set. Simulationso 0.100.1210Ts0.08-E6 0.06-o1--0 04-. 0.0210.00.0.0Chapter 3. The Extended Hopfield Network^ 500.160.141— 0.120.08 --100 NeuronsI0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9^1.0Hump, Fraction of Visible Neurons (R/N)^GI OCT 12Figure 3.12: Information capacity of an EH network, as a function of the number of visibleneurons used. Memory capacity was calculated using the "0.9 criterion" data described inFigure 3.11. The ordinate was then calculated as the number of memories stored, times thenumber of visible bits per memory, times log 2 / N2 , so that the graphed value is equal to L)./i,fltimes the number of data bits that can be stored for every neuron in the network. Each graphpoint is the average of tests from approximately 10 sets of memories. The line was drawnfreehand as a visual aid only.confirm that it is able to decorrelate the first few members of each memory set, and that thisimproves the storage capacity. Unfortunately, the decorrelating capacity of the hidden neuronsis very quickly used up, so the performance of the network deteriorates quickly after the firstfew memories have been stored.Figure 3.13 illustrates EH network capacity for memory sets with four different levels ofcorrelation, where "level of correlation" refers only to the visible neurons, as discussed in Sec-tion 3.1.4. Notice that in the bottom left corner of the figure, where few memories are storedand many hidden neurons are available, memory capacities are the same for all levels of corre-lation. As the memory loading increases, the decorrelation process fails first for memory setswith the highest levels of correlation.0.400.35-10.30-*(..) 0.25-as3:„ 0.20-0Avg Overlaps0.0 ■0.04 +0.1 )sE0.2 ^0.10H0.05H100 NeuronsChapter 3. The Extended Hopfield Network^ 510.00I^0.0^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9^1.0OlapHump Fraction of Visible Neurons 186e JAN 16Figure 3.13: Storage capacity of an EH network for correlated memory sets with three differentlevels of correlation. The memory sets were generated as described in Section 2.3.2, and thenstored and retrieved using a 100-neuron tri-state EH network. Memory capacity was calculatedusing the "0.9 criterion" data described in Figure 3.11. In the limit where the ratio of visibleneurons is 1.0, this is the performance of a Hopfield network. Each point represents the averageof 10 different memory sets. The lines were drawn freehand as a visual aid only.Even when the EH network's performance is degraded because the memory set is correlated,it is still able to use the hidden neurons to advantage. An EH network with 50% hidden neuronscan store more that twice as many memories as a Hopfield network, for example, when thememories have an average overlap of 0.1. Furthermore, the EH network has a greater capacityfor correlated memories than the Hopfield network has for random memories.3.4.4 Radii of AttractionMost of the measurements of network capacity up to this point have focussed on the numberof memories which can be stored before memories become unstable. To be useful, however, itis not good enough for a memory to be stable; it must also have a good radius of attraction.Chapter 3. The Extended Hopfield Network^ 52In this section, two types of radii of attraction are measured:1. Type-1: Attraction from an incomplete prompt: This is a measure of how manybits can be flagged as "don't know" (as discussed in Section 3.3) before the originalmemory can no longer be recovered.2. Type-2: Attraction from a noisy prompt: This is a measure of how many bits canbe set incorrectly, but not flagged as "don't know", before the original memory can nolonger be recovered.In both cases, the radius of attraction is equal to the number of visible bits which can bewrong (either "don't know" or inverted) before memory recovery begins to fail. If a networkwith 50 visible and 50 hidden neurons, for example, has a type-1 radius of "40", then recoveryis possible when 40 of the visibles, as well as all 50 of the hiddens, are initially flagged as "don'tknow".Figure 3.14 compares the type-1 radii of attraction of two different EH networks. The"100/0" network has 100 visible neurons and no hiddens, and the "50/50" network has 50 vis-ibles and 50 hiddens. The 100/0 network is identical during memory storage to a 100-neuronHopfield network. The radius of attraction shown in the figure could not, however, be achievedby a Hopfield network because it relies on the tri-state process described in Section 3.1.5.The figure shows that the 50/50 network maintains a good radius of attraction for muchhigher memory loadings than the 100/0 network. This is not surprising, given that the 50/50network has, according to Figure 3.11, a memory capacity of 30, while the 100/0 network'scapacity is only 14. Figure 3.15 gives a head-to-head comparison of the two networks as theyapproach their nominal memory capacities. It shows that the 50/50 network has lost most of itsbasin of attraction by the time the 30th (and final) memory is added. By contrast, the 100/0network still has a reasonable basin of attraction when the 14th memory is added.( recovery from an incomplete prompt )Chapter 3. The Extended Hopfield Network^ 536^10^'^I^I15 20^25 35^40OldRadi Number of Memories 1902 SEP ISFigure 3.14: Type-1 radii of attraction of EH networks. The "100/0" network has 100 visibleand 0 hidden neurons, and the "50/50" network has 50 visible and 50 hidden neurons. Theordinate is the maximum number of visible neurons which can be set to zero initially withouthindering subsequent memory recovery, divided by the total number of visible neurons. Eachmarked point is the average from five different sets of memories. The solid lines were drawnfreehand as a visual aid only.The type-1 radii of attraction of the 50/50 network were not only larger than those of the100/0 network, they were also more consistent. For each of the points shown in Figure 3.14,both the average and the standard deviation of the fractional type-1 radii of attraction weremeasured, although only the average radius was plotted. The standard deviations averaged 0.13for the 50/50 network, and 0.20 for the 100/0 network. This may have an important impact onsystem reliability for some applications.The type-2 radii of attraction of the 100/0 and the 50/50 networks are shown in Figure 3.16.The figure shows the average maximum number of visible neurons that can be initially set tothe bit-inverse of the memory state, without interfering with memory recovery. In this case,because none of the visible neurons were flagged as unknowns, the 100/0 network was entirelyidentical to a 100-neuron Hopfield network.1.2( recovery from an incomplete prompt )0tvL1-3 0.8H1.02 0.6asg 0.4HLt.0.2H^01 ,^,^.^i^.^1-1^.^.^i0^0.2^0.4^0.6^0.8^1^1.2^1.4^1.6^1.8 2°Wadi^Memory Load / Memory Capacity 1992 SEP 19Chapter 3. The Extended Hopfield Network^ 54Figure 3.15: Type-1 radii of attraction of EH networks as a function of how near the network isto capacity. This figure is identical to Figure 3.14, except that the memory loads were dividedby 14 for the 100/0 network, and by 30 for the 50/50 network.A comparison of Figures 3.16 and 3.14 indicates that the type-2 radii of attraction areapproximately one third the size of the type-1 radii. The type-2 basin of attraction of the50/50 network is again larger for much higher memory loadings than the basin for the 100/0 net-work. The standard deviations of the fractional type-2 radii of attraction were measured andfound to be approximately 0.08 for both the 50/50 and the 100/0 networks.3.4.5 Storing the XOR Memory SetOne serious limitation of the Hopfield network is its inability to store the "not-equal" or"exclusive-or" (XOR) set of associations. This set of associations consists of the followingChapter 3. The Extended Hopfield Network^ 550.5-i0.45-0.4-0.35-0.3-0.25-0.2-0.15-0.1-0.05-040Figure 3.16: Type-2 radii of attraction of EH networks, using the same networks as in Fig-ure 3.14. The plotted value is the maximum number of visible neurons which can be set tothe inverse of the memory initially, without hindering memory recovery, divided by the totalnumber of visible neurons. The 100/0 network can be thought of as either an EH network withno hidden neurons, or as a standard Hopfield network. Each marked point is the average fromfive different sets of memories. The solid lines were drawn freehand as a visual aid only.four memories:0^5^10^ 20^25^301 ^35OldRadii Number of Memories 1992 SEP 19++( stored memories )^( bit-inverse memories )where the third bit of each stored memory is (-) if the first two bits are equal, and (+) if theyare not. The goal is to store these memories in a network in such a way that when the first two"input" bits are provided, and the third "output" bit is flagged as "don't know", the networkwill roll down. to an attractor in which the third bit is set correctly.There are two severe problems with storing the XOR set in a Hopfield network:Chapter 3. The Extended Hopfield Network^ 561. For every stored memory of the XOR set (eg:^-) with the correct answer in the third^bit, a bit inverse attractor of a different memory (eg:^+) is also stored which has theopposite answer in the third bit.2. The sum of bits is zero in each of the three columns, so all the Jiis are zero after the lastmemory has been stored.The only conceivable solution with a standard Hopfield network is to add extra bits to eachmemory so that, for example, the memory set becomes:++ - +++++^ + ^+- + +++++ -+ ^ --+ + +++++^ +--- - +++++ ++ + ^( stored memories )^( bit-inverse memories )The extra bits must be the same for all four memories, or else the user may end up "givingthe answer" as part of the question. These extra bits clearly solve problem 2, because the bitinverse attractors can now be recognized by the incorrect values of the added bits.Unfortunately problem 1 remains, and now applies to the extra bits as well. Because theextra bits bits are the same for all memories, the outer product rule4= N E (vie - (54)0=1^3zeroes every synapse that connects them to the "output" neuron, as can be readily confirmedby trying a few examples. The problem here is very similar to the problem of storing XOR ina single layer perceptron, as pointed out by Minsky and Papert [40].The solution for the Hopfield network, just as for perceptron networks, is to add hiddenneurons. The hidden neurons can do what the old extra neurons could not do, because theChapter 3. The Extended Hopfield Network^ 57hiddens change from one memory to another. This allows the Jiis connecting the hiddens tothe output neuron to be non-zero.To test the EH network with the XOR set, the following four memories were stored, withvarying numbers of hidden neurons:The first bit of each memory is a single symmetry breaking neuron, which is still required sothat bit-inverse attractors can be avoided. The lower trace of Figure 3.17 shows the results ofthe tests, and indicates that approximately 11 hidden neurons are required before the correctanswer is consistently achieved.A disturbing feature of the XOR experiments was that even with quite large numbers ofhidden neurons, the network occasionally made mistakes. Among 1200 tests of the networkwith 13 hiddens, for example, the network got the wrong answer 3 times. After careful tracingthrough of the network's behaviour, it was found that the errors always happened when thenetwork reached a "plateau" in the energy landscape, where some of the hidden neurons hadzero inputs, but were still in a zero state. The standard approach, at such a plateau, is to setone of the zero neurons to +1 or -1 at random and then continue.The problem was solved by implementing a "tie-breaker heuristic" or "TBH". The input Uito each neuron is the sum of (N — 1) signals ViJii, not all of which have the same magnitude.If the sum of these signals is zero, the TBH counts whether there are more positive or morenegative signals (ignoring their magnitude), and sets Ili to +1 or -1 respectively. With the TBHin place, only three hidden neurons were needed to store the XOR set, as shown in the uppertrace of Figure 3.17. In 15,000 tests of the TBH network, with three hidden neurons or more,Chapter 3. The Extended Hopfield Network^ 58C0to^o0 4^6^8^10^12XORset number of hidden neurons 1090 APR 1016Figure 3.17: Retrieval rates for the XOR set of associations with varying numbers of hiddenneurons. Each data point marks the average of a total of 1200 tests, during which the XOR setwas stored 100 separate times. A test consisted of providing the two input bits, and thenseeing if the network set the output bit correctly. For the lower trace, neural dynamics were asdescribed in Section 3.3. For the upper trace, the tie-breaker heuristic was added, as describedin the text. The lines were drawn freehand as a visual aid only.zero errors occurred.3.4.6 Flexible-Length MemoriesThe way that the EH network manages hidden neurons is ideally suited for applications wherethe memories are not all the same length. The network can be sized to accomodate the longestpossible memory, and when a short memory is stored, the unused neurons can be put to gooduse orthogonalizing the network. The only minor hitch is that during recall, it may not be clearwhere the memory ends and the hidden neurons begin, but that can be solved quite simply asdescribed below.To demonstrate this and other capabilities of the EH network, a demonstration program wasChapter 3. The Extended Hopfield Network^ 59written for use on a desktop computer. The program simulated a 162-neuron EH network, andstored memories which were entered and displayed as strings of characters. Each character wasencoded as six bits, and the first six bits of each memory encoded the number of characters inthe string, so strings of up to 26 characters could be entered. Any bits which were not requiredfor the given character string were treated as hidden neurons.A typical demonstration task was to store the titles of the works of Shakespeare. The titlesranged in length from 6 characters (Hamlet) to 24 characters (Alls Well That Ends Well),with an average size of about 16 characters. No systematic measurements were made of theperformance of the network, but the following observations were made:• All thirty five titles could be stored in approximately 95 seconds, when running on a25 MHz '386 desktop computer.• The network worked well with approximately 25 memories stored, in the sense that theradii of attraction were still very large. A prompt of "Rome ?", for example, wouldconsistently elicit the response "Romeo and Juliet".• Thirty titles could be stored without any becoming unstable, but the radii of attractionwere noticeably reduced at that level of storage.• When 35 memories were stored, approximately half the memories were no longer stable.• The memories were surprisingly robust to "brain damage". Memories could be consis-tently recovered even when large percentages (70% to 90%) of the synapses were set tozero.Chapter 4The Free-Energy of a Network with Near-Orthogonal MemoriesThe most successful analytical tools for use with Hopfield-style neural networks are based onthe statistical mechanics of Ising spin glasses [3, 4, 1, 38]. The formal ties between spin glassesand Hopfield networks were originally forged for networks in which all neurons were identical,and for memory sets which were completely random [3, 4]. Recent papers have extended themethods to include some forms of non-random memories [5, 43], and modifications to thelearning rules [5].Statistical mechanics methods are useful because they translate the "energy" landscapediscussed in Section 2.1.4 into a "free energy" landscape, as described in Section 4.1. Theenergy landscape exists in the space of all possible states, and is uniquely determined by eachset of memories that has been stored, as sketched in Figure 4.1(a). Its topology for one set ofmemories tends to bear no resemblance to its topology for another set of memories, so there isno "typical" energy landscape. The free energy landscape, on the other hand, exists in a spaceof order parameters, such as the overlap mo between a state S and memory es, as sketched inFigure 4.1(b). Its topology tends to be very similar from one set of memories to another, soit can be averaged over all possible memory sets and then used to calculate properties of thenetwork which are independent of the specific set of memories that has been installed.This chapter derives the free energy for a network in which some of the neurons are clampedand in which the memory sets are near-orthogonal. The discussion and derivation proceed as60EA^ >SFAFAFA10- n?A^ neEAS<<E>>^4A<<F>>^•^ neaverageMemorySet (a) Energy Landscape (b) Free Energy LandscapeChapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^61Figure 4.1: Conceptual sketch comparing energy landscapes to free energy landscapes. Energylandscapes are in the space of all possible states S, and are completely different for every setof stored memories fel. There is no interesting information in the average energy landscape.Free energy landscapes are in a space of order parameters such as m 0 , and tend to be similarfrom one set of memories to another. The free energy landscapes can be averaged to provideinformation about typical network behaviour, as discussed in Section 4.1.4.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^62follows:• Section 4.1 reviews the analogy between spin glasses and neural networks and justifies theuse of a Gibbs distribution for the EH network with non-zero temperature.• Section 4.2 derives a free energy equation for the most general case, and derives meanfield equations for the expected overlap mo.• Section 4.3 summarizes the results of the chapter.4.1 Magnetic Lattice Analog for Neural NetworksHopfield's neural network model [25] was intentionally designed to be analogous to a lattice ofinteracting magnetic dipoles. At low temperatures, each dipole in the lattice aligns itself withthe total magnetic field at its location, which in turn is determined by the orientation of all theother dipoles. Such lattices have been shown [16, 50] to exhibit a "spin-glass" phase, in whichthe state of the lattice is stationary, but "frustrated", meaning that dipoles are aligned withsome others, but misaligned with others.One low-temperature stable state is the ferromagnetic state, where all dipoles are alignedin the same direction, as sketched in Figure 4.2(a). Each dipole is aligned not only with thelocal magnetic field, but also with dipoles which are very far from it.Spin glass states occur at low temperatures when small groups of dipoles align themselvesin one direction while other groups align themselves in the other direction, as in Figure 4.2(b).Although each dipole is aligned with the local field, it is anti-aligned with many of the dipolesin the lattice, so there is a high degree of frustration. These states are meta-stable (and henceglass-like) at finite temperatures, meaning that the lattice will stay in one state for a long periodof time, but will eventually change to a new state.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^63xxxxxxxxx x xxxxxx x xx -.XXXXXXXX ^ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx ...xxxx....x?x. .xxxxxxxxx?xxxxxxxxxxxx --xxxxxx..xxxxxxxx?xxxxxxxxxxxxxxxxxxx .XXXXXXX.XXXXXXXXX-XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXX..xxxxxxxxxxxxxxxxxxx •XXX•XXXXXXXX•.X.••x xx xx x xxx xx xxxx ? xx x •xx ^ XXXX••X•••xxxxxxxxxxxxxxxxxxx^...... ?...xxx ^xxxxxxxxxxxxxxxxxxx xxx xx x xx ?xx xx x xxxxxxxx XXX••••xxxxxxxxxxxxxxxxxxx xxxx?.XXXXXXXXXXXXXXXXXXX^XX ^ XXXX---.xxxxxxxxxxxxxxxxxxx ?..xxxx x x xxx---XXXXXXXXXXXXXXXXXXX • xxx•• ••xxxx x?x••xxxxxxxxxxxxxxxxxxx ...XXXX ...XXXXXXXXXxxxxxxxxxxxxxxxxxxx ..•XXXX•XX•XXXXXXxxxxxxxxxxxxxxxxxxx ...XXX xxxxxxxxxxxxxxxxx xxxxxx?x^x ^ xxxxxx?•x••x•?x••x•xx•.•-x?x.x.xx.xx?x...x.•x•xx •xx?x••x•xxx?xx?xx••x..••x?.••xx•x.x?x.. ..x.? xx.-xx.xx•xxxx?•xx•xx?••?x••xxx•x•x•xxxxx•?•x• x•x•xx••?xxx••xx?xxx•xxx•?x•xxxxxxx•••••?xx•••x•x•x?x••xxxx•?xxx •x •x?x••x•xxxx••xxx••?x •x xx•••••x•x?x•x•?x•x•x•xx?•x•x•xx•x•x?xx•x•xx••x•xxx?xx•xx•xx•x• x•••xxx••x•x x• ?x•xx •x x••x•x?•x••x•x•x•x •x•x•?xx•x•xx•x•x(a)^ (b)^ (c)Figure 4.2: Sketches of steady-state behaviour of (a) ferromagnetic, (b) spin-glass, and (c) para-magnetic phases of a two-dimensional lattice of Ising ferro-magnets. Sites marked x arespin-down, sites marked are spin-up, and sites marked ? have recently changed orientation.Patterns in (a) and (b) are approximately stable in time, whereas the pattern in (c) is constantlychanging.At high temperatures, when the interactions between the particles are overwhelmed bythermal effects, the lattice is in a paramagnetic phase. The high temperature allows the dipolesto freely flip up and down, as sketched in Figure 4.2(c).In a neural network at low loading, memory states resemble ferromagnetic states, in thatfew of the neurons are frustrated, and the energy of the network is low. In the analogy, thex symbols in Figure 4.2(a) represent neurons which are aligned with the recalled memory.Figure 4.3(a) is a sketch of the energy surface around a network in a ferromagnetic phase.As memories are added, more and more false attractors appear, until eventually the spacebetween memories is filled with large numbers of shallow false attractors, as sketched in Fig-ure 4.3(b). A network in one of those states bears little resemblance to any of the memories,and has large numbers of frustrated neurons. If the "temperature" of the network is non-zero(as discussed in Section 4.1.2) then it will occasionally move between the false attractors, justlike a spin-glass.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^64(b)Energy Landscape(a)Figure 4.3: Sketch of the energy surface around (a) a ferromagnetic and (b) a spin-glass state.The ferromagnetic state is stable and strongly correlated to a memory. The spin-glass stateevolves only very slowly, but has no recognizeable correlations to any memories.4.1.1 Zero Temperature Ising ModelThe simplest model which exhibits spin-glass phases is the Ising model of ferromagnetism [26].It consists of N magnetic dipoles, each of which can be thought of as a charge spinning aroundan axis. The dipole moments are restricted to the z direction, so each lattice location i isallowed only two states, "spin up" (Si = 1) or "spin down" (Si = —1), The neural analog of thespin is the activation Si of the ith neuron in a network, which is also allowed only two states,"on" (Si = 1) or "off" (Si = —1).If an external field hi is imposed at each lattice point i, then the total field at that point,Ui, is the sum of hi and the sum of the magnetic fields from the other dipoles in the lattice:NUi = hi -I-j=1where Jii is the magnetic coupling coefficient between lattice locations i and j. The total energyof a network is equal to minus the sum of the dot products of the network states Si with thelocal fields Ui:1 N= — E hiSi + —Si Ei=1^2 J=1The factor of 1 is necessary because the potential energy between each interacting pair SiSi is2added twice. Note that this is the same as Equation 2.7, except that the external field hi has(4.1)(4.2)Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^65been added.Equation 4.2 is only correct if the self-interaction terms Jii are taken to be zero, and themagnetic coupling coefficients are symmetric (Jib = .71ii). When the lattice is in thermal contactwith a heat sink at absolute zero, the second law of thermodynamics ensures that E will decreasemonotonically until a stable state is reached where virtually all the molecular dipoles are alignedwith the local magnetic field.Hopfield's model is analogous to the zero-temperature Ising lattice. The neural analog tothe coupling coefficient Jib is the synaptic strength Jib, and the analog to the magnetic field Ui isthe postsynaptic potential Ui. Hopfield was able to define the Lyapunov function or "energy" ofhis network using Equation 4.2 by imposing the restrictions = 0) and (Jib = Jji), the latterof which certainly has no biological justification. The neural analog to the external fields hiwould be bias inputs provided at each neuron.The major difference between magnetic lattice models and neural network models is theway that the Jib are determined. In a magnetic lattice, they are determined by the proximityof the two lattice locations, and by the magnetic properties of the material forming the lattice.In the Sherrington and Kirkpatrick model [50, 30], for example, the values of Jib are assumedto form a Gaussian distribution around zero. In a neural network, the Jij are determined by alearning process, such as the outer product rule in Equation 2.3.4.1.2 Non-Zero Temperature in Neural ModelsHinton and Sejnowksi [23] and Amit [3] extended Hopfield's model to networks in which neuronsoccasionally became mis-aligned with their postsynaptic potential. The appropriate analogfor such a network is a magnetic lattice which is in contact with a heat bath at non-zerotemperature. The temperature 1/# is a measure of the likelihood that individual molecularChapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^66dipoles will become misaligned with the local fields by absorbing energy from the heat bath.Following Glauber [18], if the Ising lattice is in contact with a heat bath at temperature 10,and if all the dipoles except the ith are clamped in a state S., then the probability that theith dipole will be in state Si is:P(Sa) = —1 [1 tanh (OSA)] . (4.3)2In a neural network, this is the probability that neuron i will be in state Si after it is updated,when all other neurons are kept fixed. In the zero temperature limit, oo, this is identicalto the update rule (Equation 2.5).The "temperature" parameter 0 in neural network models determines the probability that aneuron will be in a state Si which is contrary to its post synaptic potential II; when the networkis at equilibrium. It is not, of course, related to brain temperature. Following the magneticanalog, Glauber dynamics (Equation 4.3) are chosen to describe the temperature-dependenceof the transition probability P(Si).A neural network at non-zero temperature never reaches a perfectly stable state, or "fixedpoint". This lack of stability can be an advantage, because false attractors will no longer befixed points of the system. The simulated annealing process [29], for example, uses non-zerotemperatures in its search for the global minimum. When there are no fixed points, however,it is somewhat more difficult to recognize a recovered memory. The standard solution is tocalculate the recovered memory as a time average (SI) of the network state.At non-zero temperatures there is a finite probability that the network will evolve out ofeven the deepest attractor — a fact which seems to destroy the usefulness of the the network.If all final states are equally accessible from each initial state, then the system is ergodic, andthere is no possibility of using the network to retrieve information. In practise, however, thetime spent in an attractor is extremely long compared to the time spent to reach the attractor,so there is no problem recognizing the attractor. Analytically, ergodicity can be avoided byChapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^67carrying out all calculations in the thermodynamic limit, where the number of neurons goesto infinity and attractor states become infinitely deep wells in the energy landscape, so theprobability of escaping goes to zero.4.1.3 Justification for Using Thermodynamic Met hodsWhen Q is non-zero, the evolution of the neural network is stochastic and can no longer bedescribed in terms of a Lyapunov function. Instead, if detailed balance [1, pg 1091 is satisfied,the network can be modelled using the tools of statistical mechanics.A system obeys detailed balance if transitions from state :513 to state S occur exactly asfrequently as transitions from state S to state in the thermodynamic limit. If W(gilg`)is the probability of going from state g' ic to state §1 , and p(S) is the probability of being instate P., then this implies that:W(Pigk )P(P) = W(PIP)P(P).^ (4.4)If the system obeys Glauber dynamics (Equation 4.3), and if the only difference between S3and S is that si -Si, then the requirement for detailed balance is:W(g3 I )^P(SI'S3)^exp(QU2 Si)^P(P (4.5)w(g,ki§j) ^--P(-S gj)il exp( --P O.; )^p(Sk).Any expression which satisfies 4.5 for all states g'3 and S will suffice as an expression for p(g).The natural choice is:Nexp Q uisi = e-f3E(§), (4.6)where E(§) is Hopfield's energy for state g', as in Equation 4.2. Note that this expression wasderived without making any assumptions about the choice of connections Jii, so it is equallyapplicable to the EH network as it is to the Hopfield network.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^68Equation 4.6 shows that finite-temperature neural networks conform to the Gibbs distri-bution of standard thermodynamics, with E playing the role of energy and # playing the roleof inverse temperature. All the methods of statistical mechanics can therefore be applied tocalculate the statistical behaviour of the networks at equilibrium. If changes in an observable 0affect the energy of the network, then the mean value of 0 can be calculated using the partitionfunction:Z(h, /3) = E exp —0[E — hO(S)]^ (4.7){s}where the sum is over all accessible states of the network. Here h is the physical observablewhich must multiply 0 to calculate the energy. In a magnetic lattice, for example, if 0 is themagnetization, then h is the magnitude of an external field.The time-average of On is:1 On Z(h, 0)1^On F^On (On) = ^z(h,p) Ohn lh=0 =OhnOn(On) =^1^F13n-1 ohnwhereF^--1 log Z15'is the free energy.In the following sections, F will be calculated as a function of a set of order parameters,such as m 1 ... m8 . This means that the sum in Equation 4.7 has been calculated over onlythose states which are compatible with the given order parameters. Such restricted sums areproportional to the probability that the network states are compatible with the given orderparameters.(4.8)(4.9)Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^694.1.4 Quenched Random Memory StatesThere are, in a sense, two levels of randomness involved in the analysis of attractor neuralnetworks: the random choice of a memory set {eo}, and the random time-evolution of theneurons. To keep these levels distinct, the eis are treated as quenched variables. This meansthat, when time averages are being calculated, it is assumed that a single set of memories hasbeen learned - no attempt is made to average over all possible memory sets. Equation 4.8,for example, can be used to calculate the time average of an observable for one specific set ofmemories {eu}.Once the time average is complete, a second average is calculated over an ensemble of allpossible memory sets. From Equation 4.8, the ensemble average of a time-averaged observable 0can be calculated as:(«On») ==on^Ohn1 On Vh ^on-1 An •^ (4.10)where (0) is the same notation that was introduced in Section 2.3.The central goal of the analysis which follows is therefore to calculate ((F)), the ensembleaverage of the free energy F.4.2 Ensemble Average of the Free EnergyIn this section, the ensemble-averaged free energy ((F)) is calculated for a network of N neurons,the first R of which are clamped, and the last M N - R of which are not clamped. Theensemble average is calculated over all possible sets of memories {e}. The free energy F is afunction of the overlaps mo, as discussed in the last paragraph of Section 4.1.3, so partitionfunctions are summed only over states which are compatible with the given mo. The numberM7 EE 14.1.m. Tv- < 1. (4.11)Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^70of undamped neurons M is assumed to be proportional to N, with the ratio equal to 7:In each memory set there are P memories 41- • • • e , which are near-orthogonal in the sensedefined in Section 2.3.1. The probability p(er) of a neural state er is therefore:P^= (1 — b) S(1:;' —^b S(; r;),^(4.12)where r is a set of exactly orthogonal memories, and b determines the degree of orthogonalityof e. Equivalently, the expected overlap between two memories eP and ev is:( N rcr) ))= MV = M {1 - (1 - 201 ,^(4.13)i=Rjias derived in Equation 2.14. The expression for the connection strength between neurons iand j isPJij - N E ef`.1; — abij,p=iwhere the 6;4 term ensures that there are no self-interaction terms.(4.14)At each neuron Si, let there be an external field hi made up of P external fields which areproportional to the stored memories:Phi = E hoerBy increasing one of the hP, the network can thus be encouraged to recall one memory et, inparticular.4.2.1 The Energy per NeuronThe energy associated with the ith neuron is:1Ei =^Ji • Se Se — hiSr2 ‘--4 3 3 Ij=1(4.17)(4.18)Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^71a= - -21 P N2N E E^E + - Ee217^.fis.; CST0=1 j=R-1-1 p=1^j=1 (4.15)where the superscript p is used with the state SIP in anticipation of the case where more thanone state will be considered in the same equation. As in Equation 2.4, the memory loading isrepresented by a = P/N. It is convenient to work with e(SP), the mean energy per neuron fora network in state So, rather than the energy of a specific neuron Ei. The mean is calculatedover the last M neurons, which are the only ones participating in the dynamics:N(8 P)^E Ei(sP)i=R+11 P NF Et.'^e2 2MN E E E f se sp=1 i=R+1 j=R-1-1P N- —m E E lep=1 i=R+1P=^– E (mP) — E 7711;4,11P2P=1^p=1whereR+ Y' es2N L-d 33=1and mr, is the overlap between hidden neurons in state SP and memory e 11 :(4.16)n't^—1 E of sr,i=R+1Not surprisingly, the only effect that the clamped neurons have on the dynamics of the un-damped neurons is to exert a constant external field, but that field will play a different rolethan the hP field when the limit hP 0 is required (as in Section 5.1.2, for example). IfSP and^are chosen at random, and hP = 0, then the magnitude of izP will be on the orderof Nilli/N = Nry/VTV.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^724.2.2 Partition Function for n ReplicasThe partition function for a network in which the average energy per neuron is given by Equa-tion 4.16 is:Z = E e-ome(g),^(4.19){s}where the sum is over all accessible states S of the network. The ensemble average free energyper neuron is equal to:((.f)) =^ (4.20)= – —((log Z)),NRand is finite in the thermodynamic limit.The ensemble average is calculated using the replica method [1, 35, 38],((1)) = li^1# n—r0 N-1411007177 (((Zn)) — (4.21)where Zn is the partition function of n non-interacting replicas of the neural network. Eachreplica has the same memories en, and the same external fields hi, but two replicas p and omay be in different states SP and 5'. The replica method strategy is to find an analyticalexpression for the partition function Zn as a function of n, and then calculate the formalalgebraic limit (n 0). It is hard to imagine a physical interpretation of this limiting process— Amit refers to it as "an implausible exercise in a space of matrices with zero dimensions" [1,pg 337] — but it certainly is useful.The exact expression for Zn is:Zn = [E come(s){s}(4.23)01 taChapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^73exp Pory2 Nn + 13 272N^x!--, I1---1^VnP){S"}^ p=1 p=1n P+ [37N E E 77"011) //ti]p=1 p=1^Nn n P ^a72 N^2= E E e Pa 72 H H exp 2 (m pp )^[37 Nmil,P1,{.51}^p=1 p=1where the sum is over all possible groups of n states.The next step is to calculate the ensemble average Pl. In the thermodynamic limitwhere N oo, ((Z')) will be dominated by the terms where the energy is minimum. In generalthere will be clusters of such terms around different memory states, and no single cluster ofterms will dominate the sum. The external fields hiA are used to break the symmetry betweenthe attractors and force the solution to focus on the network's behaviour around a specificattractor. Assume that s of the external fields are non-zero. They might as well be the first sfields, h 1 ...V, because the numbering of the memories is arbitrary. If hp is large enough,the attractor located near memories el ...es will certainly dominate Pl. The correspondingmemories e l ... es are then referred to as condensed memories. Once the ensemble averagehas been calculated using the saddle point method, the external field can be removed by set-ting hi' = 0.Following Amit [1, pg. 333] and Lautrup [35, pg 33], the ensemble average is split into twoparts. An estimate of the amount of noise coming from the (P — s) uncondensed memories isgiven by G1:n P {„y NGi^((exp E E s /2 (mpi") 2 07Nmii; It"p=1 p=a+1where the e8+ 1 ...e13 subscript indicates that the average is only over memory vectors s 1through P. The entire ensemble average can now be written:2((z")) /3a -yNn ((= e 2 n 8 [072 N (mil) 2 + 07NMI,4 E . E Gi exp^2 k{s1}+ 011P (4.26)7Nmid e + 1( n P (GC F....^exp E Ep=1 14=8+1((N PIL p=118+ 1 exp [Breri)) (4.27)E R + 0h11) Sr]Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^744.2.3 Calculation of Average Over the Uncondensed MemoriesThis section calculates the average noise G 1 coming from the uncondensed memories, as definedin Equation 4.22. The quadratic mP terms in the exponent of G 1 can be removed using aGaussian transform:1 b2 \\ iii °°fexP 4a) = V Tr f_00exp(-ax2 bx)dx,^(4.24)where:12b = yOTV mtc:This gives the following expression for G1:(12=19.2^P n^n P2 1 x )2BC°G1^ H^exp E E {- -2 ( ff2r^-00 14=3+1 p=1 p=1 p=8-1-1whereaGC,^(4.25)The ensemble average in Gi is calculated as follows;whereAccording to Equation 2.12, each er must be either rii4 or -r;', with probabilities (1 - b)and b respectively, so the average over all the s can be written as an average over all possibleP NT2 =a E Ep=s+1(4.32)Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^75choices of {r}:N PGi = K(A 11 fl [ ( 1 - b)exp(BM) bexp(—BM)}))i=R+1 p=s+i= (( N PA H H [cosh Br + (1 - 2ort: sinh Br ]))1=R+1 µ=s+1The following expansion:1log[cosh x + a sinh x] = ax + —2(1 — a2 )x2 0(x 3 )(4.28)(4.29)makes it possible to write Gi as:P NG1 = A ((exp E > {(1 - 2b)11`.13 1; 2b(1 — b)(Br)2 0[(/4)3]}) . (4.30)p=8-1-1In the limit N oo, Br terms of order 3 and higher can be ignored, because /4This leaves the following expression for Gil :GiP N= Aexp [ E E 241_ bon21 + log ((eT2 ))r ,^(4.31)14=8+1 i=R+1whereDigress here for a moment and consider the hypothetical case where {r} is a set of randomvectors. In that case, the parameter b should be irrelevent, because adding random noise torandom vectors should have no effect on averaged behaviour. The value of ((eT2 )) is calculatedfor that case in Equation C.5 which, when replaced into Equation 4.31, gives the followingexpression for G'i. :P N nGI = Aexp 2—1 E E E BPPBP-sfsrp=8+1 i=R-1-1 p,cr=1Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^76whereB"' (\Fie + . (4.33)N PThis expression is independent of b, which tends to confirm the accuracy of the derivation sofar.Returning to the main calculation, consider the case where In is a set of exactly orthogonalvectors. The calculation of ((eT2 )), which is much more difficult for this case, is described inAppendix C.2. In that calculation, ((eT2 )) is first expressed as a function of the overlaps rl.sPthrough rs•so (Equation C.9), and then it is averaged over all possible vectors r1 through r.(Section C.4). The result (Equation C.12) is:=((eT2))P^n rexp E E8(1 - 2b)2 — (1 — V.1i)E(m7,)21 B"B"S1 •S° 10.8+1 p,o=1^ v=1where the dot product refers to a sum over the undamped indices only:Ni=R+1SP • St`^E sfsr.Replacing the expression for ((e7.2 )) into Equation 4.31 yields the following value of G'1:GC = Aexp 1^1.^t0=8+1 p,cr=11^1[2b(1 — b) + —2(1 — 2b)2 — —(ne)2 (1 — V i) BPPBI'S° • s°2^P= Aexp1^P^n^{ —2 E^E co./300.730-sp • s-}, (4.34)0=-8+1 p,P=1whereCP^1 — (1 — Vt)(7nut,)2 . (4.35)and where V is the orthogonality measure defined in Equation 2.11.The scaling factor CP will be used in all the derivations which follow to encapsulate informa-tion about the degree of orthogonality of the memory set and how it influences the free energy.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^771.00.8-1^ V = 0.rnasasac° 0.4-V = 0.00.2-1gamma = 1.00.2^0.3^0.4^0.5^0.6^0.7m = Overlap to Nearest MemoryFigure 4.4: Plot of the scaling factor CP as a function of the overlap mpu, for three differentdegrees of orthogonality in the memory set. The ratio of visible bits, y, is 1.0 for all three plots.In the limit of random memories (V = 1), or zero correlation between the recalled memoryand stored memories (mpu = 0), CP becomes unity, and the equations reduce to the standardequations for a Hopfield network, as they should. In the limit of exactly orthogonal memories(V = 0), and exact recall (mpv = 0), CP becomes zero and the noise from the uncondensedmemories disappears, as it should. Figure 4.4 summarizes the variation in CP over this range.4.2.4 Inverse Gaussian TransformsNow that the ensemble average over {E} is complete, the Gaussian transform done in Sec-tion 4.2.2 must be reversed. In other words, G 1 must be integrated over the zPv andvariables. Expanding Equation 4.34 in terms of•1=2r2^1 P^nexp E E [ N^PCP SP•Slex°p=8-1-1 p,cr=10.0^0.1 0.8^0.9^1.01992 JULY 1V = 0.5PEp=s+1(4.39)Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^78- [2cP0 P_hP sP•s- xp - [cP02 (11P)2 SP •Scr ] (4.36)reveals the problem: there are mixed products of the integration variables el, in the exponent.To remove the mixed products, a unitary transformation must be performed in the inte-gration variables. This procedure is briefly described in [1, 35] for random memory sets. Amodified form of the procedure that is suitable for near-orthogonal memories is described inSection C.6. The result of the inverse transform is:Gl = exp —N —car(log K) k ,8 E [K-11 —2^ Pcrp,o=1(4.37)where [K] is the following n x n matrix:[K]p,, E_ (8x, —-CPSP • 5')^ (4.38)and H is the sum of the "crowd noises" from the clamped neurons:The first term in the exponent of Equation 4.37 represents the noise from undamped neurons inthe uncondensed memories, and the second and third terms represent the noise from clampedneurons in the uncondensed memories.4.2.5 Average Over Condensed MemoriesThe ensemble average over the condensed memories el ... es can now be calculated. ReplacingEquation 4.37 into Equation 4.23 results in quite a complicated function of the overlaps nisi:,because of the intl, term in CP. It is therefore not possible to linearize the exponent using aGaussian transform, as was done in Equation 4.25. Instead, all occurrences of mil, and SP•5"Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^79are replaced by new variables iiiii, and qx" using the following 6-functions:02 p 100^ 16. (qp, - --lg. SP .5) = —271- _codr I, exp {-02 zrp,„T (qp, - -37. SAS°.)] ,1132 p 005 ( this; - 74) = 27 j ,,,c, P^P^P M--^de exp [-O2 Pte (filk` - 1- eii-sr)1 ._(4.40)Terms that were previously functions of SP and 41` can now be written as functions of ?VI,and qpu :OP ..- E: 1- (fit,-)2(1- 0){1- P7aP- 13760%, V p V p = oexp -1-V.- { -car(log IC) + // /3 E [f(-1] — /31In 12^p,cr=1^(4.41)so that the averaged partition function, Equation 4.23, is equal to:dgvn)) = e pa-r2Nn (( ^. E i fc''') (8 nIsil Isnl^-°° p=1 p=1II II dfittii, dyr9 ) j f 00p=1 o=pI-1 11 imydrp,,)f pa (n-1 nn(28-1-n-1)( 2122-)71-^2O1 e.xp E E -/32 zrixraN (q - —SP •Str)]X n-1 n [ _ Pa mp=1 cr=p+1E E [P672 2^fni-;,) 2 -I- ,(37Nitei:it° - 02z4aN (x exp 71.1;,` - -1-1-ie.SP)i))1n 8^N(p=1 1.1=1 e - eThis leaves only two terms in the exponent which involve mil, or Sp, so ((Zn)) can be written:Gl =100^n^o n-1 n((ZVI ))^j j C17441149)-00 /4=1 p=1^-00^Clqfxrdr pa) Old2F21p=1 o=pwhere(4.42)F2aipi n-1 n^ nE • E exp [— E E rposP•s- + E E yit;e1" • Sp[S 1 {Sn}^7^p=1 cr=p+1^p=1 p=1X exp N IfIEE —(fit")2 yen,PlIP -- Owe^.n 8^2_^7^2^P^P^P Pp=1 (1=1(4.43)Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^80n(28-1- n-1) (32aN27r )^2^ n-1 nG2^ exp —N fi 2a" P2a^EP=1 (7=p+1andThe ensemble average can be simplified in the standard way [1, pg. 336]:E E r,,Sr Sr + E E ygisr) ))N^ a nF2 = (( E. • .E H exp afi22 n-1 n{S1 } {Sn} i=R+1^7^p=1 cr=p+1^p=1 p=1^el... escrN^ 02 n-1 n 8 n2= (( H E . • • > exp — (E E rp,,SP S + E E yAPPs ) ))i=R+1 S1 =±1 sn=±1^7^p=1 =p+1^p=1 p=1^et... eaac = N((exp > log { E • • • Ei=R+1^S1 =11 Sn=±1exp — E > r„,„SPS° + E E yigrsPcr^) }))7^p=1 =p+1^p=lp=1^e1... eaao2i n-1 n a nSection C.8 shows that the expression inside the sum is self-averaging, even though the memoryvectors are not entirely random. Let 1' be an 8-dimensional column vector, with each elementequal to either +1 or -1. The sum over i is proportional to an average over each of the 2 8possible vectors fi. through i42 .:F2 = exp N ((log E • • . Es1=±1 sn=±1n-1 n^a n^exp [10 22 E E rpo.SPS° + E E yi:TPSP^(4.45))1))P=1 °=P+ 11^p=1 p=1(71-1184.2.6 Saddle Point Calculation(4.44)When the expressions for G1, G2, and F2 are replaced into Equation 4.42, the result is a seriesof integrals over a single exponential in which the exponent is proportional to N. In the limitChapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^81of very large N, the integral is dominated by the value of the integrand at the point where theexponent of the integrand is maximum, so the saddle point method [1, pg. 138] can be used tosolve the integral. Applying the saddle point method to Equation 4.42 gives:((Zn)) = d1d2F2,where O 1 , O2 , and F2 are evaluated at the saddle point.To find the saddle point, the following 2n(2s n — 1) equations must be solved simulata-neously for 2n(2s n — 1) variables qc,,, rte , etzlf, and yl; :0^0 log di + 0 log O2 0qp,0 log O2 0 log F2 0Orm,^Orp,0^0 log di + 8 log O2 074^00,i,O log O2 0 log F2{V^: (1 < p < n) A (p < a 5_ n)}{V p, : (1 < p < n) A (p < a 5_ nilp,p : (1 5_ p < s) A (1 5_ p 5_ nil{V a,p: (1 <µ s) A (1 5_ p n)}(4.46)0 =oip +Equations 4.46 only enforce the requirement that the free energy surface be fiat. If the pointof condensation, determined by the selection of hi', is chosen poorly, these equations may givea solution for a maximum or saddle point, rather than a minimum of the energy function. Toconfirm that the solution is a minimum, the second derivatives must be checked, a task whichLautrup found "quite complicated" [35, pg. 37], even for random memories with no clampedneurons, and which has not been done here.4.2.7 Free Energy of the Replica Symmetric SystemThe established method for making the calculations manageable is to assume that the solutionswill obey replica symmetry [1, 35, 38]. That is, assume that the values of qp,, rte , 711,, andinhChapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^82yr, will be identical for all replicas. Under that assumption, the values of the variables at thesaddle point can be written:qP0 I saddle qr Ipa 'saddlesaddleYt,; saddle= rThis reduces the number of variables to (2s + 2), so Equations 4.46 reduce to (2s + 2) equationsin q, r, mo, and yii.Rather than n scaling factors CP there is now only one:^C = 1 — (1 — Vi)(mt)2 ,^ (4.47)where v is the index of the memory which is most similar to the condensed state.Sections C.9, C.10, and C.11 derive expressions for O1, O2, and F2 for the replica symmetriccase and for the limit of small n. When these are used in Equation 4.42, the result is:^((Zn)) = exp [—nNO On]^(4.48)where:((f)) =^--a—log[1 + 07C(q — 1)]1 20oryCq + 411 2 [1 + 07C(q — 1)]1^ 8^1+ —2 `H+ cry — cOr(q — 1)] — v" [--2(e )2 + -ymf:h 14 — /3cemhilL-1 2 ' ‘ 11) I0=1— 1Q (0.0g 2 cosh /3 (lictrz + oz0 It yt`TP))))^.^(4.49)14=1^71...T2.Replacing this into Equation 4.21 and using ( nxe = 1 nx) in the limit n^0 confirms thatthe expression in Equation 4.49 is the average free energy per neuron.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^83Equations for q, r, mf:, and yo are found by requiring that the partial derivatives of ((f))with respect to all variables are zero, as in Equations 4.46. This ensures a solution which isat a saddle point, but it does not ensure a minimum in the free energy. The calculations aredescribed in Section C.12, and result in the following mean field equations:mh = (((T° tanh^ E/7: z E (72 /4 + 711v G3(q, me)) TuY=1^lk71...T288q = (((tanh 2 p[vaT-z+ E(72mf:^— G3(q,mt))TP1)))N-1 71...T28(4.50)(4.51)7C(cryCq + 4H)r —^ (4.52)a [1 + #7C(q — J.)? 2where the (((• • a symbol represents a combined ensemble average and an average over z usinga Gaussian window, as defined in Equation C.57, and where72 4(1 — VT)q)]2[a — ceP7C(1 — q) 2 — 40(q — 1) .H}G3(q, mfi` ) =^ (4.53)[1 — 07C(1 — q)]In the limits V = 1 and M = N, the network should behave as a standard Hopfield network,because there is effectively no correlation between the memories and the set of orthogonal mem-ories. These values of V and m imply that y = 1, C = 1, Ito /to, G3 (q, mo ) = 0, and H 0,so the mean-field equations become:q = (((tanh2 13 arz + E(74 h°)T°1)))0.1 71...T28r = ^[1 + /3(q —mh = (((T° tanh p[fcir z^(mt h!')Tul)))V=1^ 71...T28Thus the new equations are equivalent, in the limit, to the standard equations for a Hopfieldnetwork [1, pg. 293].Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^844.3 Summary of the Free-Energy CalculationThe calculations in Sections 4.1 and 4.2 can be summarized as follows:• The update rule rule given in Equation 2.5 was modified so that neural evolution becameprobabilistic, with probabilities determined by Glauber dynamics (Equation 4.3). Neuralstates were allowed to evolve contrary to the postsynaptic potential with a probabilitywhich depends on a "temperature" parameter.• The probability of a network at equilibrium and with temperature A being in a state withenergy E was shown to be proportional to e-OEA.• The free energy F was defined to be a function of a set of "order parameters" by settingit equal to the logarithm of the partition function Z, where Z is summed over only thosestates which are compatible with the given order parameters.• The ensemble average partition function, Z", of a set of n identical networks (replicas)was calculated as a function of the order parameters. It was calculated in two parts:first for "uncondensed" memories, which are not associated with a large overlap orderparameter mo, and second for "condensed" memories.• It was shown that the "crowd noise" from the uncondensed memories diminishes in pro-portion to a scaling factor CP (Equation 4.35) as the stored memory sets become moreorthogonal.• The exact expression for Z" is given in Equation 4.42 in the form of a series of nestedintegrals of a complicated exponential function, and was solved using the saddle pointmethod. The location of the saddlepoint is determined by a set of mean field equationsin the integration variables, and can be translated into a set of mean field equations inthe order parameters.Chapter 4. The Free-Energy of a Network with Near-Orthogonal Memories^85• The mean field equations were simplified by assuming that all the order parameters areidentical for each of the n replicas ("replica symmetry"). The resulting equations, 4.50through 4.52, are an important result in this chapter.• The ensemble average free energy ((f)) was calculated from Zn using the standard replica"trick", as described in Equation 4.21.Chapter 5Analytical Estimate of Network Storage CapacityThis chapter derives an analytical estimate of the storage capacity of the EH network. Thecalculation is done in three stages:• Section 5.1 calculates how many near-orthogonal memories can be successfully stored ina Hopfield network.• Section 5.2 estimates the degree of orthogonality which can be achieved using the EH net-work's roll-up procedure, as a function of the number of memories installed and the num-ber of hidden neurons.• Section 5.3 combines these results to achieve an analytical estimate of the capacity of anEH network as a function of the number of hidden neurons.5.1 Hopfield Network Capacity for Near-Orthogonal MemoriesThis section uses the mean-field equations derived in Section 4.2.7 to predict the memorycapacity of a Hopfield network for near-orthogonal memory sets. The calculation parallels thestandard solution for random memories [1, 35], but with some added complexities because ofnew terms in the mean-field equations.86Chapter 5. Analytical Estimate of Network Storage Capacity^ 875.1.1 Mean Field Equations at Zero TemperatureA very useful simplification of the mean-field equations (Equations 4.50, 4.51, and 4.52) is totake the zero-temperature limit. The first step is to calculate B /3(1 — q) in the limit oo.This is done in Section D.1, and results in the following implicit equation for B:B^lim /3(1 —/3—).00\\e-xP27 tarrLa [rnh + l' 7(1 V (amr, 411.13)11)p=i^ ' (1 — 7CB) 2v. )^(5.1)71—T2aIt follows immediately that the zero-temperature mean-field equations can be written interms of B, using Equation D.8, as follows:^1 ^ amf: 4i5iB) })nit = ((Tiled{^E (72mri+7hu 72(1 VI)^ 5.2)tar [1 — 7CB] 2 Ti...T2sq = 1^ (5.3)- 7C(a7Cq 4H) [1 - 7C.13]2^(5 .4)5.1.2 Memory Capacity with No Clamped NeuronsEquations 5.2, 5.3, and 5.4, which in general must be solved numerically, can be solved ana-lytically in certain relatively simple cases. One such case, the subject of this section, is thecalculation of the memory capacity for a network in which none of the neurons are clamped(y = 1) and the temperature is zero (0 = oo).The memory capacity is equal to the largest value of a for which the expected overlap isdose to unity when a single state is recovered (s = 1). In terms of the parameters used in themean-field equations, these conditions can be written:= 1Chapter 5. Analytical Estimate of Network Storage Capacity^ 88s = 1= 00It° = 0^V itft = E oP)2 = O./1=1.With only a single recovered state, the ensemble averages in Equations 5.1 and 5.2 can becalculated explicitly. Equation 5.1 becomes:1,(1 -17i)(12) 2^p)2}B = ^ exp12a1 r (1^(1 - CB)1j e _y2(5.5)where^(1- VT)a) ^y^(1^ (5.6)(1- CB) 2 )Note that the variable y has nothing to do with the integration parameter yP that was used inChapter 4. The mean-field equations, Equations 5.2, 5.3, and 5.4, reduce to:nt` = erf(y)q = 1C2r = ^(1- CB) 2Section D.2 describes the derivation of a simultaneous solution for these three equations.The result is the following equation for the capacity a, as a function of y and V:2[ - CB ^(Vy-2c2 2(erfo(i_ vt) _ Cy)a =^ (5.7)2 (1 - Vt. ) erf(y)where:B = 2(1 - VI) erf(y) e-Y2 {Vy 2C2 2(erfy) 2(1 - Vt) - Cy}(5.8)V = 1.00.050.000.0 5.01.0^2.0^3.0^4.01.2-1.0 >-0.8 'tu.as-0.6 2rOA11r0.2FO.06.00.35m.3„,V=0.25LLu- 0.25rnc/ 1 /^V = 0.5/_/":"NA^V = 0.750.40V = 0.00.30Chapter 5. Analytical Estimate of Network Storage Capacity^ 89cap vsv2 :AkshisOrY y - parameter 1892 JULY 5Figure 5.1: Plots of Equation 5.7, showing that for each level of orthogonality, solutions to themean field equations exist only when memory loading is below a critical level. The y-parameteris defined in Equation 5.6 and is only useful as an aid in the solution of the mean-field equations.Figure 5.1 is a plot of Equation 5.7 for various values of the orthogonality measure V. Theplot for V = 1 is identical to Figure 5.5.1 in Lautrup's notes [35, pg. 48], as expected. Theplots illustrate that for each level of orthogonality, there is a loading factor a c (V) above whichno solution exists. The memory capacity of the network is equal to a c (V). A graph of mf:(y)is included in the figure for reference, and to illustrate that mt: is always close to unity whena is maximum. This means that the state of the network which is compatible with maximumloading factor is also a good-quality retrieval state.The relationship between mf: and a can be seen more clearly in Figure 5.2, where theparameter y has been eliminated. As before, no solutions are available for [a > ac (V)]. Thefigure shows that as the memory loading a gets smaller, solutions are available in which mf:is loser and closer to unity. This means that recovered memories tend to be exactly equal tostored memories when memory loading is much lower than memory capacity, which is consistentwith experiments./V = 0.25' V = 0.0V = 0.75 V "1.2E 1.0V =a08 0.60.aTD 0.4IIE0.00.0^0.1cap vat : MOtAlphaChapter 5. Analytical Estimate of Network Storage Capacity^ 900.2alpha - loading factor0.3^0 41902 JULY 7Figure 5.2: Plots of Equation 5.7, showing the retrieval quality m as a function of the loadfactor a. This figure is comparable to Amit's Figure 6.11 [1].If the values of ac (V) are plotted versus V, the result is the graph shown in Figure 5.3.It shows the predicted memory capacity of the network as a function of the orthogonalitymeasure V. It is important to keep in mind that this calculation of "memory capacity" assumesthat there are no hidden neurons, so all neurons are initialized to the memory state whenstability is measured. This form of memory capacity was measured in Section 2.3.3, and thecalculated memory capacity matches those measurements very nicely.It may be possible, in future research, to extend the above calculations to predict the stabilityof near-orthogonal memories in the sense defined in Section 3.1.3, where hidden neurons do notplay an "unfair" role in contributing to the stability. The goal of such research would be toanalytically predict the locations of the curves in Figure 3.9, rather than just the values of theirx-axis intersections.0.6a_13 0.41Analytical ResultE0.1-1 + Simulations0.0^11111111100^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9Cap_VsV2 :AMON^V = Orthogonality Measure^1992 JULY 510Chapter 5. Analytical Estimate of Network Storage Capacity^ 91Figure 5.3: Comparison of predicted capacity a c (V) to measured capacity. The line is a graphof the maximum value of a which is a solution of Equation 5.7. The data points are the samedata points as in Figure 2.3.5.2 Statistical Estimate of Overlap After Roll-UpThe following sections derive an approximate expression for the degree of orthogonality whichcan be achieved by the EH network's roll-up process. The calculations assume that the networkis rolling up approximately as described in Section 3.2. The only difference is that all updatesare assumed to be asynchronous — the initial synchronous updates described in Section 3.2 areignored.The presentation proceeds as follows. Sections 5.2.1 and 5.2.2 review the notation andintroduce a new variable, the "neural alignment" Xi. Section 5.2.3 introduces P(X,t), the sta-tistical distribution of Xi, and Section 5.2.4 examines how P(X, t) changes as the network rollsup. The resulting equations are solved numerically in Section 5.2.5 to predict the rms overlapbetween the most recent memory and all previous memories. Section 5.2.6 shows how this canbe used to calculate the average overlap between all memories.Chapter 5. Analytical Estimate of Network Storage Capacity^ 925.2.1 NotationThe notation used in the following sections is an extension of that introduced in Chapter 3.The network has N neurons, R of which are visible and therefore not free to change, andM = (N - R) of which are hidden and free to change. Because the purpose of the calculation isto determine the mean overlap of the pth memory, it is assumed that there are (p -1) memoriesalready installed when the roll-up process begins. The following two ratios are useful:a P^^(p - 1)- =s-7 =-Although the time-evolution of the network is not the focus of these calculations, someconcept of "time" needs to be introduced so that time-varying characteristics can be expressed.The natural measure of time, represented by the symbol t, starts at zero before roll-up begins,and increases by one whenever a neuron in the network changes state. In addition, a scale-invariant time r, defined as:tT N'is used for some calculations.M_(5.9)The overlap calculated in this section is given a new symbol, OP , defined as:p-i^))1/TV- ^ E (mµ" ) 2 2 .p - 1 ,,=.1(5.10)This is the rms overlap between one newly-added memory el', and each previous memory, andso is not the same as Onus , the normalized rms overlap between all previous memories. Therelationship between Orms and O1 is discussed in Section 5.2.6.Chapter 5. Analytical Estimate of Network Storage Capacity^ 935.2.2 Overlap as a Function of Neural AlignmentsThis section introduces the "neural alignment" Xi and shows how the expected normalized rmsoverlap can be calculated using it. Xi is a measure of how well the alignments between bit iof each memory and bit i of the current state g' are matched to the overlaps my between eachmemory eu and the current state, and is therefore closely related to the negative of the neuralenergy Ei. It is defined as follows:p- 1 N 14-1E music = [-2E1 + aSi] I E^(5.11)u=i^ N j.iThe value of Xi for each neuron Si determines whether that neuron will be inverted duringroll-up. This can be seen by expanding the input, Ui, to each neuron as follows:NUi = E TijSjN1-gf L., L.,^—^— 1) siiij=1 v=i=^(xi — a) . (5.12)The roll-up process described in Section 3.2 inverts only neurons where the state Si is the samesign as the input Ui, and hence from Equation 5.12, only neurons where (Xi > a). The roll-upprocess can thus be viewed as a process for shifting all the Xi which are above a until they areall below a.There is a simple relationship between the average neural alignment and the rms overlap.The sum of all neural alignments Xi is equal to:N^p-1 N N^p-1E xi = —N E E E sisierei; = N E(m,-)2.i=1^v=1 i=1 j=1 v=1A comparison of this to Equation 5.10 leads to the following exact relationship:-1)i Xi))(5.13)Chapter 5. Analytical Estimate of Network Storage Capacity^ 945.2.3 Probability Distribution of the Neural AlignmentsTo proceed further, it is necessary to define a probability distribution, P(X, t), for the neuralalignments Xi. The value of P(X, t) is proportional to the probability that Xi, for some i chosenat random, will be equal to X. P(X, t) acts as a probability density, so that the probabilitythat the value of Xi will be between a and b is equal to:Pab(t)^P(X, t) dX.X =aStrictly speaking, P(X, t) can only be considered a probability density in the limit (N —> oo),where the allowed values of Xi approach a continuum.Calculations of the dynamics of P(X, t) are based on the following ansatz:Working Assumption: P(X,t) is assumed to be Gaussian at all times during roll-up, al-though the mean, ±(t), and variance, Vx(t), may change.The general form for P(X,t) is thereforeP(X,t) =^1 ^[—(X — X(t)) 2 . OrVx(t) exP^2Vx(0Section D.3 shows that the mean and the variance are both equal to a initially:X(0) = Vx(0) = a.A sketch of the initial distribution for a = 0.2 is shown in Figure 5.4.(5.14)5.2.4 Dynamics of the Probability DistributionThis section derives the equations which will be used to calculate the rms overlap between anew memory and all previous memories. Two equations will be derived: one which describeshow X varies with time, and another which calculates how long the roll-up process takes.Chapter 5. Analytical Estimate of Network Storage Capacity^ 95EvolPxt Value of Xi 1993 FEB 20Figure 5.4: Sketch of the distribution of Xi for a -= 0.2, before the network rolls up. The verticalscale is P(X, 0), the probability density of Xi. Neurons for which (Xi > a) are candidates forbeing inverted during roll-up. The average value of all such candidates is marked (X).The time evolution of P(X, t) is sketched in Figure 5.5, and can be summarized as follows.Initially, P(X, 0) is a Gaussian distribution centered at X = a, and with variance a. As eachneuron is inverted, the Gaussian gets narrower, and its center, fC(t), moves to the left. (X), theaverage of all the Xi above a, also moves to the left. The motion of the Gaussian is self-limitingbecause its speed is proportional to ((X) — a), and so diminishes with distance travelled. TheGaussian continues to shift to the left and get narrower until all undamped neurons have beenused up.All changes in P(X, t) are driven by changes in AM. Section D.4 derives the followingexpression for the rate of change of _Pt), averaged over many updates:d =^— (X))The following equivalent expression using time T is independent of N:(5.15)dXdr = 4(a — (X)) (5.16)2.5-1 alpha = 0.2x> (t3)cx>(12 )<x>(ti)x> (to)00.5-1-ir.2 T -O.() 0.2^0.4^0.6^0.8Value of Xi 1999 FEB 20Chapter 5. Analytical Estimate of Network Storage Capacity^ 96Figure 5.5: Sketch of the evolution of the distribution of Xi as the network rolls up, for anetwork with a = 0.2. The initial probability distribution is labelled as t o , and examples ofsubsequent distributions are labelled t1 , t2 , and t3 . The average of all Xi > a are markedas (X)(to) through (X)(t3). As the network rolls up, the distribution becomes narrower, itsmean moves toward zero, and (X) approaches a.Section D.5 shows that (X) is equal to:(X) = Xwhere112Vx e-Y2r 1 - erf(y) (5.17)(a - X)Y V21/7The following estimate for the variance Vx is derived in Section D.6:Vx ks: .7—C [1 + a (Orms )2 - 1C- 1where O is the rms overlap between all previously-stored memories CY. If the O term wereexplicitly kept in, the equations describing X for each memory loading would become coupledto all the equations for smaller memory loadings. This would increase the complexity of themathematical analysis enormously, and for very little gain.0.00.0^0.1^0.2^0.3MultAtta time (tau)0.41993 FEB 20VX =^[1.^(0.5)2« - ki (5.18)Chapter 5. Analytical Estimate of Network Storage Capacity^ 97Figure 5.6: Numerical solutions for the time evolution of the mean neural alignment X, fora = 0.05, 0.1, 0.2, and 0.4. The value plotted is (X/a)i. The time variable r is 1/N times thenumber of neurons which have been inverted since roll-up began. The graphs are solutions ofEquation 5.15, with the initial condition (iC(0) = a).For simplicity, a constant has been used in place of Onns for the calculations that follow.The experimental results in Chapter 3, such as those shown in Figure 3.3, suggest that a typicalvalue for Orms is 0.5, so Vx is set equal to:Equations 5.16, 5.17, and 5.18 can be combined to form a single differential equation in X,which can be solved numerically to yield .t as a function of time. The functional form of X(t)depends on a, both because a appears explicitly in the equations, and because X is initializedto a. Figure 5.6 shows solutions of this equation for four values of a.Before the calculation of Or.,8 can be completed, the length of time required for roll-up mustbe determined. It is clear from Figure 5.6 that the longer the roll-up takes, the smaller X willbe when roll-up stops. The time required to roll up will in turn depend on what fraction of theChapter 5. Analytical Estimate of Network Storage Capacity^ 98neurons are unclamped. If, for example, 90% of the neurons are clamped, the roll-up processwill truncate after a short period of time, and the final average overlap will be large.Section D.7 derives a differential equation which describes how the number of "invert-ible" neurons changes with time. To be invertible, a neuron Si must be unclamped and musthave (Xi > a). The total number of invertible neurons is represented by the symbol B(t) andobeys the following differential equation:dB(t)^B(t) [dPT= —1+ dt PT^dt +1 ,where PT is the total probability that Xi is greater than a, and is equal to:— XPT E 1^a[1 - erf fix )1 .The initial number of invertible neurons in the network, B(0), is set to (7N)/2.(5.19)A scale-invariant version of B(t) is:(tb(r) = BN)where T was defined in Equation 5.9. Equation 5.19 can then be written using b(r) in a mannerwhich is independent of N, as follows:db(r) = _ 1 + PT ^1Jdr (5.20)Figure 5.7 shows four solutions of this equation on the same time scale as a solution for X(r).5.2.5 Numerical Calculations of Average OverlapsThe solutions of Equations 5.15 and 5.20 can be used together to predict O0, as illustratedin Figure 5.7. The time Ty at which the graph of b(r) crosses zero in Figure 5.7 is the amountof time available for the network to roll up. The value of .7-C(rf) is then the average neural.50.30.05^0.1^0.15^0.2^0.25Evoner01900 FEB 20time (tau)_Of.lms = iX(arf ) 1 2 (5.21)Chapter 5. Analytical Estimate of Network Storage Capacity^ 990.207:I..c-asm .0.15-,_0.107R 70.00X(t)b(t)0.90.0.0.1.0.2.0.3•.44 •0.50.3zrnC'asEecrC23ez11.0wcy, 0.05igamma =ft 11111.."..^10.1Figure 5.7: Illustration of how to calculate O0 from X(r) and b(r), for a = 0.2, and for 7 = 0.1,0.3, 0.5, and 0.9. The top graph shows the average neural alignment, X(r), and is a solution ofEquation 5.15. The bottom graphs show 1/N times the number of invertible neurons, b(r), andare solutions of Equation 5.20. The time variable r is 1/N times the number of neurons whichhave been inverted since roll-up began. To calculate X for a given value of 7, draw a line upfrom the x-intersection of the graph of b(r), and then across to the Xi axis. The rms overlap isthen equal to PC/a.alignment when roll-up is completed. It follows from Equation 5.13 that the rms overlap canbe calculated from .k(rf) as follows:Note that this solution is independent of N, and so is applicable to networks of any size.Figure 5.8 compares the results of the numerical integration of Equations 5.15 and 5.20 tothe average values of O0 seen in EH network simulations. The analytical solutions match theobserved data very well for all values of a and 7, which confirms the integrity of the analysis.( 14V^-1 ) E E (my/N N Eu.2 Ep.i 1 v=2 p=1(Orms )2 =1 1(5.22)Chapter 5. Analytical Estimate of Network Storage Capacity^ 100t13 05.^-0inE 0.4-- 0.3-0.2-0.1-1.11.0H0.9-E °I-I: 0.6-as0.0*0^0.1^0.2^03^OA^0.5^0.6^0.7Statlioll Memory Loading (alpha) 1999 FEB 19Figure 5.8: Predicted and observed overlap of the most recent memory to all previous memories,as a function of the number of memories stored. a is equal to the number of the current memory,divided by N. y is equal to the number of neurons which are unclamped, divided by N. Solidlines are solutions of Equations 5.15 and 5.20 and marked points are averages from 100 differentsimulated memory sets. This figure shows the average overlaps between the most recent memoryand all previous memories, which is in contrast to Figure 3.2, where the overlaps between allmemories were included in the average.5.2.6 Average Overlap of All MemoriesAn analytical estimate of O0, the expected overlap between the most recent memory and eachprevious memory, has been derived, but an expression for O, the average overlap between allprevious memories, would be more useful. The latter can be calculated from the former if O rmsis expanded using Equation 2.9:This can be written in terms of 0'as follows:(Orms)2 = EL2( 1/ — 1) (O ) 2 211 (F1 — 1)1.1-11.0-10.9-10.8-E 0.7-0_a 0.6-4f-A5 0.5-g 0.4-0.3-0.2-0.1-0.000 0.1^0.2^0.3^0.4^0.5Statliol St9tAva^Memory Loading (alpha)016^0171993FEB192(5.23)(v — 1) (oz,;,,^ )2ti( — 1) ,2Chapter 5. Analytical Estimate of Network Storage Capacity^ 101Figure 5.9: Predicted and observed overlaps of all memories, as a function of the number ofmemories stored. Solid lines are solutions of Equations 5.15, 5.20, and 5.23. Marked points areaverages from 50 different memory sets in a bi-state network, and were previously presented inFigure 3.3.Equation 5.23 makes it possible to translate the solution of Equations 5.15 and 5.20 intoan expression for Orms rather than O0. The result, as shown in Figure 5.9, matches thesimulation data quite well.5.3 EH Network Storage CapacityThe analytical results from Sections 5.1 and 5.2 can now be combined to (almost) yield aprediction of the memory capacity of the EH network as a function of the number of hiddenneurons. This is an "almost" prediction because it only estimates the number of memoriesthat will be stable in the Hopfield sense — that is, will be stable when all the visible and hiddenChapter 5. Analytical Estimate of Network Storage Capacity^ 102neurons are initialized to the memory. This is a weaker form of stability than "true" EH networkstability, defined in Section 3.1.3, where memories are stable when only the visible neurons areinitialized to the memory. In the following paragraphs, capacities calculated using this weakerform of stability are referred to as "whole-vector" capacities.Figure 5.10 shows graphically how to calculate the whole-vector capacity of the networkusing the analytical results from Sections 5.1 and 5.2.5. The line descending to the right is thesolution of the mean-field equations in Section 5.1 and was previously presented in Figure 5.3.It delineates the maximum number of near-orthogonal memories that can be safely stored, as afunction of their level of orthogonality. The lines ascending to the right were generated using thestatistical estimate in Section 5.2.5 and were previously presented in Figure 5.9. They describethe level of orthogonality that can be attained by the roll-up process, for a given number ofstored memories. If the ansatz made in Section 2.4 is correct, the intersection points on thesetwo graphs mark the maximum number of memories that can be stored using the EH processwithout losing "whole vector" stability.The locations of these intersections are plotted in Figure 5.11, and compared to the observedmemory capacity of the EH network. When R 0.4N, the memory capacity approximatelymatches the analytical prediction, but the prediction diverges from the observed for smallervalues of R. An explanation for this is that networks with large numbers of visible neurons arelimited by their storage capacity, whereas networks with small numbers of visible neurons arelimited by their recall capability. "Whole vector" stability ignores the recall process because itassumes that the memory has already been "recalled" and only tests for stability. This explainswhy the prediction in Figure 5.11 fails for (R/N < 0.4). Better results will only be achievedwhen an analytical description of the recall process is developed.Three important conclusions can be reached from the results of this section:1.11.00.9-10.8-1a.asg 0.710 0.6-10.5H0.4-100.3j0.1H0.0Capacity ofNear - Orthogonals0.1^0.2^0.3^0.4^0.5^0.6^0.7R4Theo2 NetCapy1^Memory Loading (alpha) 1993 FEB 21Chapter 5. Analytical Estimate of Network Storage Capacity^ 103Figure 5.10: Graphical estimate of EH network capacity for "whole-vector" stability. The linecurving down from left to right is identical to Figure 5.3, except that the axes are reversedand the data is plotted versus the rms overlap O, rather than the orthogonality measure V.The lines rising to the right are identical to those in Figure 5.9. If the ansatz in Section 2.4is correct, the intersections of the lines will be the memory capacities of the network for eachnumber of visible neurons.• The good match between the analytical predictions of network capacity and the measuredcapacity adds to the confidence that a good mathematical model of the EH network hasbeen achieved.• The success of the technique illustrated by Figure 5.10 suggests that the ansatz made inSection 2.4 was a good one, and may be used for further calculations. Further researchinto the recall process needs to be done to clarify what the slight misalignment betweenthe predicted and observed capacity for large R/N means. It may indicate that theequivalance principle introduced in the ansatz is inexact; or it may be an effect due to theunmodelled recall process.• The analytical predictions indicate that roll-up effectiveness and network capacity areindependent of the size of the network. This suggests that the performance benefits0.400.350.30-Lo_0.25-0S. 0.20-as0 0.15-E2 0.10-0.05-1Chapter 5. Analytical Estimate of Network Storage Capacity^ 104Analytical Estimate of"Whole Vector /Storage Capacity-Retrieval Limit+ Measured Capacity0.00111111.1110.0^0.1^0.2^0.3^0.4^0.5^0.6^0.7^0.8^0.9HUMPI NMCPPY2^Ratio of Visible Neurons (R/N)^93 FEB 21Figure 5.11: Estimate of the true EH network capacity. The solid line going down to the right isthe analytical prediction of memory capacity, as determined by the intersections in Figure 5.10.The derivation leading to this result considers only the storage capacity of the network, asmeasured by "whole vector" stability. The "retrieval limit" line marks the theoretical minimumnumber of visible neurons required to distinguish one memory, as derived in Equation 3.4.Marked points are the observed capacity of the network, and were first presented in Figure 3.11.documented in Chapter 3 for networks of 100 neurons will be observed in networks of allsizes.1 0Chapter 6The Information Content of Near-Orthogonal MemoriesThis chapter calculates the information content of memory sets which are exactly orthogonalor near-orthogonal. The derivation proceeds as follows:• 6.1 : Information in Exactly Orthogonal Memories. An expression for the infor-mation content of a set of exactly orthogonal memories is derived, and confirmed by anexhaustive "brute force" calculation.• 6.2 : Information in Near-Orthogonal Memories. The information content ofmemory sets which are not exactly orthogonal is derived.• 6.3 : Comparing Information Content and Capacity. The information capacity ofthe EH network is compared to the information content of near-orthogonal memory setswith the same number of memories and the same average orthogonality.6.1 Information Content of Exactly Orthogonal MemoriesThis section focuses on the problem of calculating the information content of a set of exactlyorthogonal memories. The motivation for this undertaking is that it will serve as a startingpoint for calculating the information content of near-orthogonal memory sets, as describedin Section 6.2. Section 6.1.1 briefly reviews the problems inherent in achieving an analytical105Chapter 6. The Information Content of Near-Orthogonal Memories^106expression for the information content of orthogonal memories, and discusses some possibleanalytical approaches and their limitations. Section 6.1.2 describes a numerical method forexactly calculating the information content, based on an apparently reasonable simplifyingassumption. In Section 6.1.3 the numerical results are compared to analytical predictions andconclusions are drawn.6.1.1 Analytical Calculation of Information ContentWhen someone adds a new memory to a set of memories, the amount of information in that newmemory is determined by how much freedom that person had to select the memory. This makesintuitive sense — one can communicate more effectively with a large vocabulary than with onlyfive or ten words. Specifically, if Q is the set of all possible combinations of memory vectorswhich could be selected, then the information contained in the selected set is the logarithm ofthe size of Q [49][5]:S = log(g(Q))^ (6.1)There is a direct analogy with a statistical mechanical system where Q is the space of allaccessible states of the system and S is its entropy.It is also common in the literature [28, 49, 1] to define the information in terms of theprobability P(e) of each vector el' being selected:S = E — P (e) log P (e)^(6.2)In a system with G possible states, all equally probable, the probability of any state is G , so:S = E — G log (—G )S = log GChapter 6. The Information Content of Near-Orthogonal Memories^ 107which is consistent with Equation 6.1. For systems where not all vectors are equally probable,Equation 6.2 must be used.It would be natural to calculate the size of Q as the product of the sizes of the state spacefor each memory in the set. In that case, the information in a set of memories would be the sumof the information in each memory. This would be over-counting, however, because the order inwhich the memories are added to the set is irrelevant. In other words, if two memories A and Bare to be stored, it would count (A,B) as a different memory set from (B,A). To compensate,the size of Q must be divided my M!, where M is the number of memories, with the result thatthe degeneracy g(Q) is:Af-iMill g(v,p)p=0M-1 (NII fi +'= n ^p=0 gThe information content in the most recently added memory is:S(N, p) = log(g(N, p)) — log(p + 1)^ (6.4)where p is the index of the memory and, for reasons which become dear in Section 6.1.2, theinitial memory is referred to as the zeroeth.It is simple to calculate the information content of randomly chosen memories because eachbit is equally likely to be +1 or -1. There are 2 N ways to select each memory, so the informationin each new memory is [N log 2 — log(p 1)].Consider now the case where all elements of a set of memories are required to be mutuallyorthogonal. If this is the only condition imposed on the memory set, then each bit is stillequally likely to be +1 or -1. When the first memory is selected, orthogonality is no constraintat all, so the accessible space for the first memory is:g(Q) =(6.3)g(N,O) = 2NChapter 6. The Information Content of Near-Orthogonal Memories^ 108The only constraint on the second memory is that exactly half of its bits be identical to thefirst memory. If N is odd, this is immediately impossible — so memory sets with odd numbersof neurons are immediately eliminated from the discussion. If N is even, the accessible spaceis:g(N , 1) = (NN12 )Similar arguments lead to the following expression for the third memory:N/2 2g(N,2)^(N/4)It is useful to view each memory as one vertex of an N-dimensional hypercube, as is com-monly done in the literature [1, pg 32]. According to that view, the first three memories makeup a triad of orthogonal vectors in the N-dimensional space. The expressions for g for thefirst three memories are simple, because all possible sets of three vectors are equivalent, up toa rotation or reflection in the N-dimensional space. Reflections and rotations are achieved byeither a gauge transform [1, pg 184], or by reordering the axes. In a gauge transform, a subsetof bits in each memory EP is inverted using the transform-vector e:It is possible, by rotations and reflections, to transform any set of three orthogonal vectors intothe following set:++++++++++++++++++++++++++++++++++++++++++++++++^++++++++^++++++++^The expression for the size of the accessible space for e3 is more complicated:N/4g(N,3) = ^(Ntl ) 4i.0(6.5)Chapter 6. The Information Content of Near-Orthogonal Memories^ 109This increased complexity reflects the fact that there are many ways to generate e which arenot identical up to a rotation or reflection. The measure of the information in e ° must take intoaccount all of these possible "isomers" of the basis vectors. For the fifth and higher memories,each isomer spawns a new family of isomers, and the complexity of the problem quickly becomesintolerable.There are a number of ways to proceed from here. The brute force approach is to donumerical calculations of g(N,p). The complexity of this task is considerable, and one approachis described in 6.1.2. The other approaches involve some form of approximation.The simplest and most successful analytical estimate of the information capacity is based onthe N-dimensional hypercube interpretation of the vector space. In an N-dimensional network,which corresponds to an N-dimensional space, there are 2N ways to insert e° . In order tobe orthogonal to e°, el is restricted to an N - 1 dimensional subspace. If that subspace istopologically similar to the full space, it can be expected that there will be 2N-1 ways to forme1 . Following this line of reasoning, the number of accessible states for memory e' is:g(N,/.1) 2N-P (6.6)In fact the (N - 1)-dimensional subspace is not very similar to the full space. In geometricalterms, although the reduced-dimension space left over after e has a "volume" of 2N-1 , a largepart of that volume is not accessible to el because of the constraint that el lie on the verticesof the hypercube. Equation 6.6 is therefore an overestimate of the numbers of accessible states.There are two reasons why it is still an acceptable approximation:• Section 6.1.3 shows, for one example set of orthogonal vectors, that the error due toover-estimation is not large.• The argument gains strength when the requirement that all memories be strictly orthog-onal is relaxed, because all the "hidden corners" of the space become accessible.Chapter 6. The Information Content of Near-Orthogonal Memories^ 110Combining Equations 6.4 and 6.6, the predicted information content of the pth memory is:S(N, A)^(N — p) log(2) — log(A + 1)^ (6.7)6.1.2 Brute Force Calculation of Information ContentA method has been developed for the exact calculation of the information contained in eachnew memory of a memory set which is constrained to be exactly orthogonal. This methodassumes that N, the size of the network, is an integer power of two, so that a set of memoriescan be created which is both exactly orthogonal and complete. That is, it will be possible togenerate one memory for every dimension in the hyperspace.The information content S(N, p) of the pth memory is determined by the size g(N, p) ofthe vector space from which potential vectors can be chosen, as in Equation 6.1. It has alreadybeen shown that the complexity of the expressions for g(N, p) begins to blow up when reachesthree (Equation 6.5). The problem can be made tractable by assuming that the informationcontained in each new memory is independent of the specific set of memories stored previously.Under that assumption, it is acceptable to assume a specific set of exactly orthogonal previ-ous memories e° ...em-1 when calculating the information in the Mth memory eif . Becausethose previous memories are known precisely, the size of only one vector subspace needs to becalculated.The canonical memory set P defined in Section A.1 was used as the set of exactly orthogonalmemories. There is a direct correspondence between the set of memories constituting the nthlevel of r(m) and the set I'(n), for all m. Consider, for example, the level 4 memories in table A.1.When the block ++++---- is mapped to a +1 bit, and the block ----++++ is mapped to a -1(6.8)++++++--+-+-+--+Chapter 6. The Information Content of Near-Orthogonal Memories^ 111bit, then level 4 of r(32) corresponds exactly to T( 4), the canonical set of 4-neuron memories:++++----++++----++++----++++----++++----++++^++++----++++++++^++++++++^++++++++^++++----++++++++----These features of r(N) simplify the calculation of the information content in some impor-tant ways. The most important is that they permit the imposition of progressively strongerrestrictions on the sum of the bits in each block of the new memories. Specifically:• Consider a memory as being divided into 2" blocks of width N/2". For all memoriesnumbered p = 2" or higher, the sum of the bits within each of these blocks must be zero.In other words, there must be an equal number of +1 and -1 bits within each block.In mathematical terms, this restriction is:N/ 2n^V n : 1^< n < log2 (N)E 4+..2n = 0^y p : p> 2"^(6.9)V m : 0 < m < 2nIn table A.1, for example, the sum of bits 9 through 16 must be zero in any memory numbered4 or higher.The degeneracy for any memory c", where p is an integer power of two, is therefore:(BP ) LP2where BP and LP are the block width and level number defined in Section A.1. Note that thisis the number of vector states which could be chosen to be e2n . For the purposes of calculatingthe degeneracy of later memories, this memory is allowed to be only one state - the one definedby Equation A.1.9(N,µ) = for all p = LP^ (6.10)14 L., il'4.BA-1- h93 =1where 0 < i < LP^(6.11)BAki =Chapter 6. The Information Content of Near-Orthogonal Memories^ 112States ei where > LP must satisfy the further restriction that they have zero overlapwith memories r where v > Li`. The overlap with each of these previous memories can becalculated as a weighted sum of ko through kLm_ i , where the ki are defined as:Each ki is the quarter-overlap of the ith block in ei with the ith block in rLA. The blockfurthest to the left is the zeroeth. The number of ways to arrange the bits within a block sothat the quarter-overlap with rLP equals ki is:I ( BBA Pl2ki ) 242^for B = 2Each ki can range from ABP to -F1BP (they must all be zero when B = 2), but as a set, theki's must satisfy the restrictions which ensure zero overlaps.There are LP blocks, each with the degeneracy given by Equation 6.12, so the total degen-eracy is:^LA-1 BA/4^BA^2g(N,^"1=^E^BA^.) D(p, {k})^for all^LP4(6.13)j=o ki =—Bµ/4where D(p, {k}) enforces the restrictions on the set of ki's.The restriction factors D(p, {k}) always take the form of a product of one or more deltafunctions. The following formula for D(p, {k}) is derived quite directly using the equivalenceproperty from Equation 6.8:D(p, {k}) =for B > 2g(BP , ki) = (6.12)p-LAII 6j=1Lµ-1( Ei=0m ^1r1_ 1 )3- ki) (6.14)Consider for example g(32, 5), the degeneracy of the fifth memory in table A.1. LP is 4, sothere are four blocks, with quarter-overlaps of ko through k3 . ko , for example, is the quarterChapter 6. The Information Content of Near-Orthogonal Memories^ 113overlap between the first eight bits of e5 and the first eight bits of P4. Using Equation 6.14, therestriction factor is found to be:D(5, Ikl) = b(ko ki k2 k3)and, using Equation 6.13, the total degeneracy is:g(32, 5) =8E^(2 —4 ko ) 2 ( 2 L2 k1 ) 2ko,ki ,k2 =-8(2 —4 k2 ) 2 ( 2 k3 ) 2 45(k° kl k2 k3)Once the degeneracy g(N, p) has been calculated, the information content of the most recentmemory ez can be calculated using Equation 6.4.6.1.3 Calculated Information Content of Exactly Orthogonal MemoriesThe information content of networks with 8, 16, 32, and 64 neurons were calculated using thebrute force method derived in Section 6.1.2. Figure 6.1 shows how the information in the mostrecently added memory decreases as more and more memories are added. The solid lines inthe figure show the information content predicted by Equation 6.7. As expected, Equation 6.7has overestimated the information content, for reasons discussed in Section 6.1.1. A closerapproximation would be to replace g(N, p), as defined in Equation 6.6, with an empiricalfunction g(N, µ, k) that is defined as follows:g(N, k)^2k(N-p).^ (6.15)This is equivalent to g(N, it) when k = 1.0. From Equation 6.4, it follows that the informationin the (2 + 1)th memory is:S(N,^= k(N - It) log(2) - log(k(it + 1)).^(6.16)Chapter 6. The Information Content of Near-Orthogonal Memories^11445• 35E 30^k =1.0x xin 25 x^0 . 88io x xC-1 200 15.151]E 10 .'6-50xN=64+ N = 32N = 16N = 8-5 I0^5^10^15^20^25 30mu = Number of Previous Memories 1992 APRIL 4Figure 6.1: Information content of the (p +1)th memory in sets of exactly orthogonal memorieswith 8, 16, 32, and 64 neurons. The marked data points are from calculations performed usingthe "brute force" method of Equations 6.10, 6.13, and 6.4. The information capacity predictedby Equation 6.7 is shown as a continuous line above each set of points. The dashed line throughthe points is a plot of Equation 6.16, with k = 0.88.The dashed line in Figure 6.1 shows S(N, IL, 0.88). This new estimate of the information is acompromise — it is worse than the original for small numbers of memories, but averages outbetter for large numbers of memories.An interesting feature of Figure 6.1 is that the information content is negative for the lastten percent of the memories. This means that information is actually lost when each of thosememories is added. This loss of information has nothing to do with the way the memories arestored, because storage of the memories has not even been considered yet. The explanation canbe found in Equation 6.7: the second term, which represents the information lost due to theindistinguishable order of the memories, becomes larger than the first term, which representsthe information inherent in the choice of the next vector.Consider, for example, a situation where 63 memories have been added to an orthogonal setChapter 6. The Information Content of Near-Orthogonal Memories^ 115of 64 neuron memories. One more memory can be added to make a complete set of memories.The person choosing that memory may only choose between two vectors, which are the bitinverse of each other, so the information in the 64th vector would be log 2. Once that memorywas added to the list, it would be indistinguishable from the other 63 memories. On the otherhand if the 64th memory is not added, a careful observer could determine, up to a sign, exactlywhich memory was missing. The amount of information encoded in the choice of the missingmemory is just the log of the number of memories which could have been selected — log 64.Thus more information can be conveyed by not adding the last memory, so the last memorycontains a negative amount of information.A second interesting feature of Figure 6.1 is that the error in S(N, p) is zero when (1—) =etc. This is a result of the structure of r(N, p) which, as noted in Equation 6.8, ensuresan exact isomorphism between the second half of the r(N) and the memory set with half asmany neurons, T(N/2). The fourth quarter of r(N) can therefore be thought of as the second halfof r(N/2), and so on. The brute force calculation of degeneracies at these points will thereforecorrespond exactly to those estimated in Equation 6.6. In a less highly structured set of exactlyorthogonal memories, it is likely that the observed degeneracies would lie along a smoothercurve, perhaps similar to the S(N,p, 0.88) curve.The total amount of information I in the network is the sum of the information added byeach new memory. Using the approximation from Equation 6.16, this is:I(N, M,k) = kM(N —12 — 2—) log 2 — M log k — log M! (6.17)Figure 6.2 shows how the total information calculated using the brute force approach comparesto the amounts predicted by these two expressions. The fact that graphs using k = 0.88correspond so well to the data for all four memory sizes suggests that k = 0.88 should be usedfor all estimates of information content.Chapter 6. The Information Content of Near-Orthogonal Memories^ 1160^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9^1^1.1Memory Load Ratio (M/N) 1992 APRIL 7Figure 6.2: Total information content of a set of exactly orthogonal memories with 64 neurons.The marked data points are summations of the data shown in Figure 6.1. Lines marked a, b, c,and d are graphs of Equation 6.17 with k = 1.0, for N = 64, 32, 16, and 8, respectively. Linesmarked with primes show the case where k = 0.88.6.2 Information Content in Near-Orthogonal MemoriesThe information content in a near-orthogonal vector set can now be estimated. A "near-orthogonal" vector set is made up of vectors whose bits differ from one of the exactly orthogonalvectors iv according to the probability distribution defined in Equation 2.12:p(v) (1 — b) b(er + bov (6.18)In the discussion which follows, it will be useful to have an estimate of how many vectorscan be classified as "near" a particular vector ro. There is a finite chance that any vector inthe whole vector space might be generated from pi using Equation 6.18, so it would not bepossible to just count the number of "near" vectors. Instead, the number of "near" vectors isestimated using a statistical method, based on the amount of information added when the newvector is generated.Chapter 6. The Information Content of Near-Orthogonal Memories^ 117When a new vector ep is generated using Equation 6.18 the information in eti exceeds theinformation in PP by an amount equal to the sum of the information in each of its bits:5+ = — N[b log b (1 — b) log(1 — b)] (6.19)The information is understood to be the logarithm of the number of accessible vectors, sothe inverse is also true: the number of vectors which are "near" rP is the exponential of theinformation added when generating them:g+ (N,b) e'5+ = exp {—N[b log b (1 — b) log(1 — b)1} (6.20)The next step is to estimate the total size, g(N,p,b), of the vector space from which near-orthogonal memories can be selected. A new near-orthogonal vector et' is generated in two steps:first an exactly orthogonal vector rP is selected at random, and then noise is added according toEquation 6.18. The space of all possible vectors e, is therefore the space of all vectors which are"near" any one of the rP vectors. The number of possible rts vectors is g(N, p, k), as derived inSection 6.1. It would be tempting to calculate g(N, p, k ,b) as simply the product of g(N, p, k)and g+ (N , b):g(N, p, k, b) g(N, A, k) • g+ (N,b) (6.21)The problem with this is that it would count the vectors around each possible 1'0 as beingdistinct from those around every other rP. Any vectors which were "near" more than one ofthe possible iv would be counted more than once. This is a reasonable approximation whenboth g and b are small, but it is possible to derive an expression which is correct in the moregeneral case.The number of vectors which are nearly orthogonal to a set of previously-installed vectors iscalculated as the total number of vectors that are close to each possible new ro vector. Thereare g(N, p, k) possible vectors rP, numbered from rµ[l] through rPm. The number of vectorsChapter 6. The Information Content of Near-Orthogonal Memories^118which are close to rP[1] was calculated in Equation 6.20:^gi = exp {—N[b log b (1 — b)log(1 — b)ll^(6.22)The size of the complete vector space is 2 N , so the fraction of all states which are near ri•) is0+B^-1 = [2 bb (1 — b) (1—b) ] —N^(6.23)2N —The definition of 4- , the number of uncounted states close to rP[2], must take into accountthe fact that there are now only 2N (1 — 0) states which are still eligible to be counted. Thesimplest assumption is that, on average, the memories around rµ[21 are evenly distributedbetween the reduced space and the excluded space, in which case:gj. = gt (1 — 0)This leaves 2N (1 — 0) 2 states eligible for counting in gt, and so on:= 02N (1 — 0) 2= 02N (1 — 0) i-1The total number of possible memory states is the sum of gt through ycgE:g(N,is,k)g(N,p,k,b)^E^= E 0 2N(1 — e)i-1i=i00^ 00= 0 2N^— — e 2N Di —i.0 i=goo= 2N — 02N(1 — 0)9(N 'P 'k) E(1 — 0) 1i=o= 2N [i — (1 — 0)9(N,11,101^ (6.24)This is a difficult expression to simplify analytically, but it is tractable in the limits b^0and b^1. As b goes to zero, 0 also goes to zero, so Equation 6.24 can be simplified as follows:g(N, 11,k,b)^2N [1 — (1 — g(N, µ, k)0)]Chapter 6. The Information Content of Near-Orthogonal Memories^119= g(N, k)b - Nb (1 - b)-N(i-b) = g(N,ti,k),log g(N, k, b) = log g(N , ft, k) - Nb log b - N(1 - b) log(1 - b) (6.25)This result is accurate when (0 g(N„u,k)) << 1, and matches Equation 6.21 as expected. Inthe domain where it is accurate, the information in the near-orthogonal vectors is just theinformation in the exactly orthogonal vectors plus the information from inverting some of thebits.As b approaches 0.5, the memory set approaches complete randomness, so the amount ofinformation in each new memory should approach (N log 2 - log(/ +1)). If b is written in termsof a small amount 8:1 -b =then 0 can be written:20 = 2-N — by 4(1-6) ( 1 + 6^(1+8)0( 12 2^j= (1 _ 8) _/2i (1 + 0_ 1,2, + 8/2 \ 25N1 ^6 /2 ))f 8.7,..2, (1 _ 62) -^(1 + _6 \ 26N2 ) 1 + v2+(L.s2)26N(1 - N_ t5a) (1 + 262N)2- (1 + N62)2This approximation is accurate if N82 << 1. Replacing this into Equation 6.24,g(N,p,k,b) = 2N 1 - 2^) g(N,p,k)]So when b gets close to 0.5, the information content approaches N log 2, as required.(6.26)(6.27)The information content of the (it + 1)th near-orthogonal memory is, from Equations 6.4and 6.24 equal to:S(N, tz,k,b) = log [11(N, tt,k,b)] - log(tt + 1)Chapter 6. The Information Content of Near-Orthogonal Memories^120= N log 2 + log [1 — (1 — e)9(N,p,k)] log(p + 1)^(6.28)where 9 is defined exactly by Equation 6.23. This value was calculated numerically and isshown in Figures 6.3 and 6.4 for k = 1.0 and k = 0.88 respectively. In the k = 0.88 graph, theinformation content for b = 0 rather awkwardly does not match the other plots at p = 0. Thisis because S(N, p, 0.88) is a poor match to the true information content at p = 0, as noted inFigure 6.1. This awkwardness is compensated by the fact that it is a better match for largevalues of A.As expected, the near-orthogonal graph runs parallel to the exactly-orthogonal graph in thedomain where 9 g(N, p, k) < 1. There is a sharp transition to the domain where the informationis just N log 2 — log(p + 1) so, to a good approximation:S(N, k) — Nblog b — N(1 — b) log(1 — b)S(N, p,k,b) = minimum of:^ (6.29)N log 2 — log(p + 1)The location, po , of the elbow in the graph of S(N, p, k, b) is defined implicitly by thefollowing equation:N log 2 — (go + 1) = S(N, po , k) — Nb log b — N(1 — b) log(1 — b)^(6.30)where S(N,p,k) is defined in Equation 6.16.Once a set of memories has been installed in a network, the order in which they were installedis irrecoverable, so it is not meaningful to regard early-installed memories as containing moreinformation than later-installed ones. The average amount of information per neuron, after Mmemories have been installed, is:S(N,p,k, b)Li MN/1=0MN1 IP°E [N log 2 — log(p 1)]+p.0 :§(N, M,k,b) =k = 0.88 0.0Chapter 6. The Information Content of Near-Orthogonal Memories^121it150 0.4Mo 0.3-E-2z 0.20a)^.g',' 0.7E1 0.6'C'0: 0.5z8 0.100-0.10^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8^Nr Orth2 alpha = Num Memories / N0.8k = 1.01992 APRIL 120.9^10.011Figure 6.3: Numerical calculation of the information per neuron in the most recently addednear-orthogonal memory, from Equation 6.28. The calculation assumed 100 neurons, andused k = 1.0. The relationship between Om, and b is given in Equations 2.11 and 2.14.0.8o 0.7Ea)m 0.6Ca)0 0.5cc• 0.4M4e) 0.3'Es0= 0.2a)z80.100-0.10^0.1^0.2 0.3 0.4 0.5^0.6 0.7 0.8 0.9^1^1 1alpha = Num Memories / N^ISM APRIL 12Figure 6.4: Numerical calculation of the information per neuron in the most recently addednear-orthogonal memory. The graph is identical to Figure 6.3 except that k = 0.88.0.80.7 fi rmsE 0.6 0.80.5a) 0.60.40.20.02^'0 2t.13)^0.10k = 1.0-0.10^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9^1Nr_0(1h2^alpha = Num Memories / N^1992 SEP 2711Chapter 6. The Information Content of Near-Orthogonal Memories^ 122Figure 6.5: Numerical calculation of the average information per neuron, per memory, in anear-orthogonal memory set, as a function of the memory loading a = M/N, for k = 1.0. Thecalculations were done by integrating and normalizing the data shown in Figure 6.3.A4--1E [k(N - /A) log 2 — log(k(p + 1)) + log gifi^(6.31)p=p0+1where gif is related to b via Equation 6.22. Figures 6.5 and 6.6 show numerical calculations of.§(N , M, b) for various values of b, and for k equal to unity or 0.88.6.3 Ideal Information Capacity vs Information Content in EH Networks.This section relates the information capacity calculations of Section 6.2 to the informationcontent of EH networks. The term "ideal capacity" is used to refer to the amount of informationthat can theoretically be stored in a set of memories which are generated as described inSection 2.3.1, and which have a given rms overlap O. The term "information content" isused to refer to the amount of information represented by the visible neurons in the memoryset of an EH network. It is not immediately obvious how the ideal capacity can be related to0.80.7 firms0.80.60.40.20.0E° 0.6a)0.50z 0.30 0.2"E0)^10> •0e" 0.4Chapter 6. The Information Content of Near-Orthogonal Memories^ 123k = 0.88-0.10^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9^1NrOrth2^alpha = Num Memories / N^tsa2 SEP 27Figure 6.6: Numerical calculation of the average information per neuron, per memory, in anear-orthogonal memory set, as a function of the memory loading a = MIN, for k = 0.88. Thecalculations were done by integrating and normalizing the data shown in Figure 6.4.the information content, because the former was derived for memory sets generated in a verydifferent way from the latter. According to the ansatz in Section 2.4, however, it is assumedthat the network's behaviour will be independent of the way in which the orthogonality wasachieved. This implies that the ideal information capacity is a theoretical upper limit on theallowed information content of EH memories.Sections 6.3.1 and 6.3.2 discuss how efficiently the EH process matches the informationcontent of each memory set to the ideal capacity of the network. For simplicity, all calculationsassume that k = 0.88.6.3.1 Limits on the Orthogonalization Process11Information theory imposes limits on the degree of orthogonalization that can be achieved bythe EH network's roll-up process. The limit results from the fact that the more orthogonal0.8k = 0.880.7E0.60 0.5information capacity..R/N = 0.9R/N = 0.7g 0.420.3"E00.2c`fi ^0.1a. • R/N = 0.1R/N = 0.3...information contentR/N = 0.5.......00.0-0.10^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9^1Nr_Orth2 alpha = Num Memories / N111992 OCT 2Chapter 6. The Information Content of Near-Orthogonal Memories^ 124Figure 6.7: Ideal information capacity as a function of the memory loading, compared to theinformation content of an EH memory. The solid lines are identical to those in Figure 6.4 andshow the amount of information that can be stored (the ideal capacity) in a new memory, for a setof near-orthogonal memories with overlap O. The dashed lines are plots of Equation 3.5 forfive different values of R/N, and show the amount of information that is stored (the informationcontent) in the latest EH memory, in a network with R visible neurons.a memory set becomes, the less information can be stored in it, as illustrated in Figures 6.3and 6.4. An "optimum" orthogonalization process will therefore reduce O mis only until the idealcapacity S(N, k ,b) of each newly-added memory is equal to its information content IER-(a , R).The rms overlaps which would be achieved by such an optimum orthogonalizer can befound using Figure 6.7, where the information content of one EH memory, as calculated inEquation 3.5, is superimposed on the ideal capacity from Figure 6.4. The intersections of thesegraphs indicate the lower limit on 0.„, for various numbers of visible neurons and variousdegrees of memory loading. If the memory loading is so low that the graphs do not intersect,the lower limit on Orras is zero.Figure 6.8 summarizes the locations of the intersections in Figure 6.7, and thus shows theChapter 6. The Information Content of Near-Orthogonal Memories^ 125I-0^0.2 ^0.4^0.6^0.8^1BestOlep alpha = number of memories / N 1992 OctFigure 6.8: Comparison of the theoretical minimum overlap, to the overlap which is achievedby the roll-up process. The decimal numbers indicate the ratio of visible neurons, R/N. Thelines plot the locations of the intersections in Figure 6.7, and indicate the theoretical minimumrms overlap required in order to store the given number of randomly-selected visible neurons.The marked data points are copied from Figure 3.2, and show the levels of overlap achieved bythe EH network's roll-up process.minimum value of O, that is required in order to store a given number of EH memories. Theoverlap achieved by the roll-up process is shown for comparison. Within the tested range, theachieved overlap is always greater than the theoretical minimum, as expected.6.3.2 Information Content Near CapacityThis section compares the maximum information content of an EH network to the ideal infor-mation capacity of a set of memories with the same level of orthogonality. The comparisonprovides a measure of how efficiently the network is utilizing the hidden neurons.The first step in the calculation is to determine how well the memory sets are orthogonalizedwhen the network is filled to capacity, as shown in Figure 6.9. Given a value of R/N, theChapter 6. The Information Content of Near-Orthogonal Memories^ 126memory capacity is given by Figure 3.11, and O. at capacity is given by Figure 6.9, so theideal information capacity can be found using Figure 6.6. The result, a graph of the idealinformation capacity per neuron at maximum memory loading as a function of R/N, is shownin Figure 6.10. When these values are multiplied by the memory capacity in Figure 3.11, theresult is a graph of the ideal information capacity at maximum memory loading, as a functionof R/N, as shown in Figure 6.11.The plots in Figure 6.11 indicate that the information capacity of the EH network is lessthan the ideal information capacity, particularly for low numbers of visible neurons. In the zonewhere the memory capacity is highest, the EH network's capacity is between one half and twothirds as large as the ideal network.Chapter 6. The Information Content of Near-Orthogonal Memories^ 1271.21.0-10.4-1E0.2-^0.2^0.4^0.6^0.8^1.0^1 2^OvlAtCap ratio of visibles (R/N) 1982 SEP 28Figure 6.9: Orthogonality of an EH network at capacity, as a function of the ratio of visibles,R/N. The data points were calculated by first finding the capacity as a function of R/N, usingFigure 3.11, and then finding the corresponding rms overlap, using Figure 3.2.0.0-f-0.00.80f; 0.75-asU(.) 0.70-0.65-0.60-CD2 0.55-0.50-c 0.45-0.400.0 0.2^0.4^0.6^0.8^OvlAtCap ratio of visibles (R/N)1.0^1.21982 SEP 28Figure 6.10: Ideal information capacity at maximum loading, as a function of the ratio ofvisibles. The data points were derived from Figures 3.11 and 6.9, as described in the text. Theline was drawn freehand as a visual aid only.information inrandom memories ideal/ informationcapacityChapter 6. The Information Content of Near-Orthogonal Memories^ 1280.250.20cac 0.15-o8.8 0.10-E 0.05-0.000.0 02zapinformationcontent0:4^0:6^08ratio of visibles (R/N)1.0^1.21992 SEP 28Figure 6.11: Comparison of ideal information capacity and actual information content, at maxi-mum loading. The lowest plot is a copy of Figure 3.12, and shows the total information actuallystored in an EH network at maximum loading a,. The center plot shows the ideal informationcapacity of a set of a, memories with rms overlap equal to the rms overlap of the EH networkmemories at maximum loading. The top plot shows, for reference, the information capacity ofa set of acN completely random memories. All information capacities are divided by N 2 , sothat the plotted value is equal to 11 the number of data bits that can be stored forevery neuron in the network. The lower two lines were drawn freehand as visual aids only.Chapter 7Conclusions and DiscussionThe following sections review the conclusions that have been reached in this research andpropose a number of topics which may be pursued in further research.7.1 ConclusionsAll of the research goals listed in section 1.2 have been met. An "Extended Hopfield" or"EH" network has been described which adds hidden neurons to the Hopfield network. Theperformance of the EH network was compared to the Hopfield network, and it was shown that:• 45% more information can be stored in an EH network with hidden neurons, comparedto a Hopfield network with the same total number of neurons (Figure 3.12).• for a fixed total number of neurons, up to twice as many memories can be stored andrecalled in an EH network as in a Hopfield network (Figure 3.11).• the radius of attraction for "incomplete" prompts in an EH network, even with no hiddenneurons, far exceeds that of a Hopfield network (Figure 3.7).• when recovering a memory from a "noisy" prompt, a greater percentage of visible bitscan be initially wrong in an EH network than in a Hopfield network (Figure 3.16).^.129Chapter 7. Conclusions and Discussion^ 130• the EH network is able to store the XOR set of associations, something that a Hopfieldnetwork cannot do (Figure 3.17).• the EH network is able to efficiently store and recall sets of memories with variable lengths.For sets of correlated memories, the capacity of the EH network was found to deteriorate justas rapidly as the Hopfield network as the correlation increased, as shown in Figure 3.13. TheEH network was able to store more memories than the Hopfield network, but the performanceof both networks was severely degraded compared to their performance with random memories."Analytical control" of the EH network was achieved on a number of fronts. Most impor-tantly, an expression for the free-energy of a Hopfield network with a near-orthogonal memoryset was derived (Equation 4.49), and an associated set of mean-field equations was found (Equa-tions 4.50, 4.51, and 4.52). The mean-field equations were then used to predict the memorycapacity of a Hopfield network as a function of the degree of orthogonality of the memory set(Figure 5.3). The success of these predictions is an encouraging indication that the mean-fieldequations are correct.A mathematical description of the average performance of the roll-up process was alsoderived, resulting in a pair of coupled differential equations (Equations 5.15 and 5.20). Theseequations were used to predict how well new memories can be orthogonalized as a function ofthe memory loading and the number of hidden neurons. The analytical predictions matchedthe observed overlaps very accurately, as shown in Figure 5.8.Analytical tools were developed to describe the information content of near-orthogonal mem-ory sets. The result is Equation 6.31, which describes the average amount of information perneuron as a function of the number of neurons, the number of memories, and the degree oforthogonality of the memory set. The information capacity of an EH network was comparedto the ideal information capacity for a set of memories with the same degree of orthogonality.Chapter 7. Conclusions and Discussion^ 131The results, shown in Figure 6.11, indicate that the EH network is able to store approximatelytwo thirds of the theoretical maximum amount of information when one third of the neuronsare hidden.The final goal, to predict the memory capacity of the EH network as a function of thefraction of hidden neurons, was partially achieved in Section 5.3, as summarized in Figure 5.11.The analytical predictions were poor for networks made up of mostly hidden neurons, becauseit ignored the problems associated with recalling a memory from a small number of visibleprompt bits. The analytical predictions were good, however, for networks where more thanhalf the neurons were visible. Furthermore, the analysis showed that the improved performanceobserved in networks of 100 neurons, will be observed in networks of all sizes and will beindependent of network size.7.2 Further ResearchThe results of the present research suggest many possible topics for further work. There are,for example, many aspects of the behaviour of EH networks which haven't been explored. Themost important, as highlighted in Section 5.3, is to analyse how the radius of attraction ofEH networks varies with the memory loading and with the orthogonality of the memory set.One possible approach would be to clamp the "visible" neurons, represented by they (j = 1, R)in Equation 4.17, at varying degrees of correlation with the memory being recalled, in order tosimulate a recall process which begins far from the target memory. It might then be possible,by solving the mean field equations as in Section 5.1, to determine the memory capacity as afunction of the radius of attraction.Chapter 7. Conclusions and Discussion^ 132Other research topics which would help define the behaviour of EH networks are:• An evaluation of how robust the network is to synaptic damage.• An analysis of how the EH network behaves at non-zero temperatures• An investigation into whether time delay synapses can be introduced to an EH network• A further analysis of why the tie-breaking heuristic works so well for the XOR set ofassociations, as discussed in Section 3.4.5, and whether it may be of use for more generalsets of memories.There are also a number of topics suggested by this research which do not directly involve theEH network architecture. One of the most interesting would be to investigate how the roll-upprocess, and hidden neurons, can be used to advantage with correlated sets of memories. It wasshown in Section 3.4.3 that the EH network has difficulties with correlated memories, in muchthe same way that Hopfield's network does. Amit et al [5], have described a version of Hopfield'snetwork which is tolerant of correlated memory sets. By rigidly constraining all network statesto match the correlation in the memory set, they achieved a network with a memory capacityfor correlated memories that exceeds Hopfield's capacity for random memories. A significantproblem with their solution is that it imposes such tight constraints on the memory sets. Anetwork with hidden neurons might be able to achieve the rigid constraints on the memory setwhile still allowing a broad range of memories to be stored.Perhaps the most valuable result of this research is that it has demonstrated how a neuralnetwork can learn new memories, taking advantage of the accumulated experience of all the oldmemories, but without explicitly reviewing each old memory. This may be an important stepforward — the energy landscape is not used just for memory recall, but is also used to optimizeits own shape as that shape develops. Further research may find more general applications ofthis principle to a broader range of network architectures.Bibliography[1] Daniel J. Amit. Modeling Brain Function, The world of attractor neural networks. Cam-bridge University Press, 1989.[2] Daniel J. Amit. Neural network counting chimes. Proceedings of the National Academy ofScience, USA, 85:2141, 1988.[3] Daniel J. Amit and Hanoch Gutfreund. Spin-glass models of neural networks. PhysicalReview A, 32(2):1007-1018, August 1985.[4] Daniel J. Amit and Hanoch Gutfreund. Storing infinite numbers of patterns in a spin-glassmodel of neural networks. Physical Review Letters, 55(14):1530-1533, September 1985.[5] Daniel J. Amit, Hanoch Gutfreund, and H. Sompolinsky. Information storage in neuralnetworks with low levels of activity. Physical Review A, 35(5):2293-2303, March 1987.[6] E.B. Baum, J. Moody, and F. Wilczek. Internal representations for associative memory.Biological Cybernetics, 59:217-228, 1988.[7] T.H. Brown, A.H. Ganong, E.W. Kariss, and C.L.Keenan. Hebbian synapses - computa-tions and biophysical mechanisms. Annual Review of Neuroscience, 12:, 1989.[8] Thomas Brown. Neuronal plasticity and learning. In IJCNN-91 Tutorial Notes, IEEE,Seattle, 1991. (bound separately from other tutorial notes).[9] Philip A. Chou. The capacity of the Kanerva associative memory. IEEE Transactions onInformation Theory, 35(2):281-298, March 1989.[10] M. Cottrell. Stability and attractivity in associative memory networks. Biol. Cyber.,58:129-139, 1988.[11] Michael R. Davenport and Shawn P. Day. Chaotic signal emulation using a recurrent timedelay neural network. In Neural networks for signal processing 2 - proceedings of the 1992IEEE workshop, 1992.[12] Michael R. Davenport and Geoffrey W. Hoffmann. Using hidden neurons with Hebbianlearning in a bi-state Hopfield neural network. In Proceeding of the third annual conferenceon advances in communication and control systems, IEEE, Victoria, B.C., October 1991.[13] M.R. Davenport and G.W. Hoffmann. A recurrent neural network using tri-state hiddenneurons to orthogonalize the memory space. Intl. J. Neural Systems, 1(2):133-141, 1989.133Bibliography^ 134[14] Shawn P. Day and Michael It. Davenport. Continuous-time temporal back-propagationwith adaptable time delays. I.E.E.E. Transactions on Neural Networks, 4(2):348-354,March 1993.[15] Shawn P. Day and Michael R. Davenport. Dispersive networks for nonlinear adaptive filter-ing. In Neural networks for signal processing 2 - proceedings of the 1992 IEEE workshop,1992.[16] S.F. Edwards and P.W. Anderson. Theory of spin glasses. J. Phys, F5:965-974, 1975.[17] J.D. Farmer and J.J. Sidorowich. Predicting chaotic time series. Physical Review Letters,59:845-848, 1987.[18] R.J. Glauber. Time dependent statistics of the Ising model. Journal of MathematicalPhysics, 4:294, 1963.[19] D.O. Hebb. The organization of behaviour. Wiley, 1949.[20] John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory of NeuralComputation. Santa Fe Institute Studies in the Sciences of Complexity, Addison Wesley,1991.[21] A. Herz, B. Sulzer, R. Kuhn, and J.L. van Hemmen. Hebbian learning reconsidered: repre-sentation of static and dynamic objects in associative neural nets. Biological Cybernetics,60:457-467, June 1989.[22] G.E. Hinton and T.J. Sejnowski. Learning and relearning in Boltzmann machines. InDavid E. Rumelhart and James L. McClelland, editors, Parallel Distributed ProcessingExplorations in the Microstructure of Cognition, pages 282-317, MIT Press, 1986.[23] G.E. Hinton and T.J. Sejnowski. Optimal perceptual inference. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 448-453, IEEE,Washington, D.C., 1983.[24] A.L. Hodgkin, A.F. Huxley, and B. Katz. Currents carried by sodium and potassium ionsthrough the membrane of the giant axon of loligo. J. Physiology, 116:463, 1952.[25] J.J. Hopfield. Neural networks and physical systems with emergent collective computa-tional abilities. Proc. Natl. Acad. Sci. (USA), 79:2554 - 2558, 1982.[26] E. Ising. Bertrag zur Theorie des Ferromagnetismus. Zeitschrift Fur Physik, 31:253, 1925.[27] P. Kanerva. Self-propagating search: A unified theory of memory. Technical Report CSLI-84-7, Stanford Center of Language and Information, Stanford, California, March 1984.[28] A.I. Khinchin. Mathematical Foundations of Information Theory. Dover, 1957.[29] S. Kirkpatrick, C.D. Gelatt Jr., and M.P. Vecchi. Optimization by simulated annealing.Science, 220:671, 1983.Bibliography^ 135[30] Scott Kirkpatrick and David Sherrington. Infinite-ranged models of spin-glasses. PhysicalReview B, 17(11):4384-4403, 1978.[31] Charles Kittel. Thermal Physics. J. Wiley and Sons, 1969.[32] David Kleinfeld. Sequential state generation by model neural networks. Proc. Natl. Acad.Sci (USA), 83:9469, 1986.[33] T. Kohonen. Self-Organization and Associative Memory. Springer Verlag, Berlin, 1984.[34] S.Y. Kung, F. Fallside, J.Aa. Sorenson, and C.A. Kamm, editors. Neural networks forsignal processing II - Proceedings of the 1992 IEEE workshop, Princeton, NJ, 1992.[35] Benny Lautrup. Lecture notes on the theory of the Hopfield model. Technical Report NBI-HE-88-06, Niels Bohr Institute, January 1988.[36] Richard P. Lippmann. An introduction to computing with neural nets. IEEE ASSPMagazine, April 1987.[37] Richard P. Lippmann. Review of neural networks for speech recognition. In Alex Waibeland Kai-Fu Lee, editors, Readings in Speech Recognition, pages 374-392, Morgan Kauf-mann, San Mateo, 1990.[38] M. Mezard, G. Parisi, and M.A. Virasoro. Spin Glass Theory and Beyond. World Scientific,1987.[39] W.T. Miller III and R.S. Sutton andP.J. Werbos. Neural Networks for Control. MIT Press,Cambridge MA, 1991.[40] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MITPress, 1969.[41] K.S. Narendra and K. Parthasarathy. Gradient methods for the optimization of dynamicalsystems containing neural networks. I.E.E.E. Transactions on Neural Networks, 2:252-262,March 1991.[42] John Nolte. The Human Brain. C.V. Mosby, St. Louis, 1988.[43] Conrado J. Perez Vicente and Daniel J. Amit. Optimized network for sparsely codedpatterns. J. Physics A: Math., 22:559-569, 1989.[44] Fernando J. Pineda. Generalization of back-propagation to recurrent neural networks.Physical Review Letters, 59(19):2229, 1987.[45] David E. Rumelhart, Geoffrey E. Hinton, and R.J.Williams. Learning representations byback-propogating errors. Nature, 323:533 - 536, October 1986.[46] David E. Rumelhart and James L. McClelland. Parallel Distributing Processing: Explo-rations in the Microstructure of Cognition. Volume 1: Foundations, MIT Press, 1986.Bibliography^ 136[47] T. Sejnowski and C.R. Rosenberg. NETtalk: a parallel network that learns to read aloud.Technical Report JHU/EECS-86/01, Johns Hopkins University, 1986.[48] T.J. Sejnowski and P.S. Churchland. Brain and cognition. In M.I. Posner, editor, Foun-dations of Cognitive Science, MIT Press, Cambridge, MA, 1990.[49] C.E. Shannon. The mathematical theory of communication. Bell Systems Technical Jour-nal, 27:379-423, 1948.[50] David Sherrington and Scott Kirkpatrick. Solvable model of a spin-glass. Physical ReviewLetters, 35:1792-1796, 1975.[51] H. Sompolinsky and I. Kanter. Temporal association in asymmetric neural networks. Phys.Rev. Lett., 57:2861, December 1986.[52] D.J. Willshaw, O.P. Buneman, and H.C. Longuet-Higgins. Non-holographic associativememory. Nature, 222:960-962, June 1969.Appendix ACalculations from Chapter 2The following sections provide detailed derivations of results which are used to generate near-orthogonal memories in Section 2.3.1.A.1 The Canonical Set {r} of Exactly Orthogonal MemoriesThis section defines a method for generating a "canonical" set of N-bit long memories, each ofwhich is exactly orthogonal to all the others. This set is given the symbol r(N), where N refersto the number of neurons in the network. The pth memory in r(N) is r(N)P, and the state of itsith neuron is rrti. When there is no ambiguity about the number of neurons in the network,this is generally written rtii. This canonical set is used to generate near-orthogonal memorysets, as in Equation 2.12, and is used in the derivation which begins at Equation 6.8.The symbols BP and L!:` are used to represent the block width and level number in thedefinition of r. They do not correspond to symbols used elsewhere, except in the derivationimmediately following Equation 6.8.The structure chosen for r(N)p is illustrated in table A.1 and defined recursively as follows:r?^+1^for all i137Appendix A. Calculations from Chapter 2^ 138p 11(32)P Lo Bo0 ++++++++++++++++++++++++++++++++1 ++++++++++++++++^ 1 322 ++++++++^++++++++^ 2 163 ++++++++ ++++++++ 2 164 ++++----++++----++++^++++---- 4 85 ++++----++++^++++----++++ 4 86 ++++^++++++++^++++ 4 87 ++++ ++++----++++++++----..---...—.—,..,—...--„--.......„--/ 4 8block 0^block 1^block 2^block 38 ++--++--++^++--++--++--++--++-- 8 49 ++--++--++^++----++--++--++--++ 8 4}level 1} level 21 level 4Table A.1: Listing of the first ten "canonical" orthogonal memories r(32)P, for a 32-neuronnetwork. Li` is the level number, and BP is the block width for the pth memory. The memoriesare numbered starting at zero.+1^if i < N/2if i > N/2for all p—1= 2trunc[log2(P))N^for all /2Lµ/2 Lµ/2^mod /V](A.1)if Lµ= µif LI` pwhere "trunc[x]" means the largest integer less than or equal to x.Table A.1 shows how each memory rP can be assigned a "level" LP, and a "block width" B.Blocks are numbered from 0 through (VA — 1), as illustrated for 1( 32 ) 7 .Appendix A. Calculations from Chapter 2^ 139A.2 Mean and Variance of (ei • ep)This section calculates the mean and variance of (e el. The results are used to deriveEquation 2.14.The overlaps between two different states, e and ev, may be analysed in terms of thefollowing sets of bits:symbol definition sizeA fi : 41:` = rill N(1 — b)B fi : C = In N(1 — b)F (A n B) U (A n 8) N(1 — 2b + 2b2 )g (A n 13)U (A n B) 2Nb(1 — b)where "size" is the expected size of the set in the limit N oo. The mean overlap is expected tobe zero because the states are generated with no preference for either +1 or -1. It is calculatedby splitting the sum using .7 . and g:N((EErn =^cer-FEereni=1^ iEG=^— EieF^iEG= N(1 — 2b — 2Nb(1 — b)((r 1:111= N(1 — 4b 4b 2 )((rrin^ (A.2)The orthogonality of iv' and ry ensures that Min = 0, so the mean overlap of pairs ofmemories is also zero.The variance of e - ev can be calculated as follows:2Var(es^= ((^erC)i=1Appendix A. Calculations from Chapter 2^ 1402= (E - Ertm)2(K (-2 r , r,^ (A.3)iEgA general solution for expressions of this type is derived in Section A.3. In this case q inEquation A.9 is set to 2Nb(1 — b), the size of g, so:)2))(A.4)A.3 Expectation of Partial Overlaps in Orthogonal MemoriesThis section derives a probability distribution which is useful for calculating partial overlapsin orthogonal and near-orthogonal memory sets. The situation is equivalent to picking q ballsout of a jar holding W white and N — W black balls. The goal is to calculate the probabilitydistribution of s, the number of white balls that are picked, as a function of W and q. Thefollowing symbols are local to this section, and do not correspond to symbols used elsewhere:Var(e° • ev) = 4 (KiEg= 8Nb(1— b)[1 — 2b(1 — b)]= N [1 — (1 — 2b) 4 ]N =W =q =s =A,C =B =L =the total number of balls in the jarthe number of white balls in the jarthe total number of balls taken from the jarthe number of white balls taken from the jarfactors which do not depend on sthe factor which does depend on sshifted version of sAppendix A. Calculations from Chapter 2^ 141The probability of choosing a set with s white balls is:P(s,q,W) =(n (Ng "Ws ) (Ng ) -1 vs :^q°t 0^ otherwiseThe terms which are of interest are those involving s, so this can be written:P(s,q,W) = A(N,W,q)B(N,W,q,^)whereB(N,W,q,^) =1 s! (W – s)! (q – s)! (N – W – q – s)!Notice that B is symmetric in the exchange of W and q, so B(N, x, y, s) = B(N, y, x, s).In the limit of large N, this can be approximated as a Gaussian distribution. In the specialcase where W = a N , B the calculation proceeds as follows:B(N, –2 N, q, s) = ^1 (I + L)! (If – – L)! (I – L)!(2–^L)!^= G () x G (—N –2) ^(A.5)2^2 whereL^S - -21G(x)+ L)! – L)!The logarithm of the G(x) functions can be simplified following Kittel [31, pp 19-20]. Notethat G(x) is an even function of L, so the derivation can assume L > 0 without any loss ofgenerality:log G(x) = – log(x L)! – log(x – L)!L^ L= – log(x!) – E ^i) – log(x!) E ^– i 1)^Appendix A. Calculations from Chapter 2^ 142L-21og(x!) - E log ( x +1 )i=1^x - z-21og(x!) - E log ( 1 + iixi=i 1 -i/x)In the limit where x L, the logarithm can be approximated as follows:L 2ilog G(x)^-21og(x!) - E -i=1 xReplacing this into Equation A.5, and introducing C and C' to represent terms which do notdepend on s:1^ N q^.. [1 N 1 1^log [B(N 2 2-N q,^)] = -21ogy! - 2log(-2- - -_-)! -^4ti=i^q^- qi= C - 2L(L +1)[ l + 1 N - q1= C - 2L(L 1) [ N q(N - q).1C - 2 (s — 2 ) (l + s 2 ) (q(NN= c.„ ( 2N q) ) is+ , 1 _ 0] 2^q(N - q))^2‘C' ( 2N )^ 12^q(N - q))^21Taking the exponential of both sides gives:1^ s- < sB(N,^q, s) = K(N , q) exp [-( var(s) >)]where K(N, q) is a normalizing constant chosen so thatP(s)ds = 1,and where:(s) = 2^for W =Var(s) = q(A147,7q)^for W =Appendix A. Calculations from Chapter 2^ 143The dependence on W, which was ignored by setting W to -1, can now be inferred fromB's symmetry in the interchange of q and W. For this symmetry to be present in the meanand variance, they must have the following values:(s) = WqNVar(s) = q(N - q)W(N - W)N3The resulting expression for the probability of picking s white balls is:(A.6)(A.7)p(s,g,w) tii K(N,q,W)exp [ -Va74 )2 ]0Vs.{ Cis< W' q>s>q+W-Notherwise(A.8)where K(N , q, W) is a normalizing constant.Figure A.1 illustrates P(s) for various values of W and q, with the normalization calculatednumerically. Except in the most extreme cases, the shapes of all the curves appear to be similarto a Gaussian.The variance V(q) in the overlap of a q-bit long subset of two exactly orthogonal vectors rt.and r- can be calculated using this result. If "white balls" represent bits which are identicalin the two vectors, then W = 2 and the overlap between the vectors is 2s. The variance in theoverlap is therefore:Var(q) = 4(82) = q( N q) ' (A.9)NFigure A.2 shows the mean measured overlaps between orthogonal vectors for values of qranging from 0 to N, and confirms that Equation A.9 is correct.W = 55q = 95W = 90q = 80a3Q. 0.25otII^0.2a) 0.15• 0.1rn0_c°• 0.05110.3128 NeuronsAppendix A. Calculations from Chapter 2^ 1440.7W = Total Number White Ballsq = Number of Balls Picked0.6-1.0°) 0.5-tco0) 0.4-0a.z.) 0.3-20- 0.2-N0.1W = 5q = 60117W = 55q = 20W = 55q = 60o.o r0^10Cok&100N = 10020 30 40 50 60 70 80 90s - number of white balls^1992 Oct 4Figure A.1: Probability distributions calculated using Equation A.8, for example values of Wand q. If there are N balls, and W of them are white, this is the probability that a randomsample of q balls will include exactly s white balls.0 f^I^1^1^1^1^110^0.1^0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9^1q/N = Fraction of Vectors Examined^1992 MAR 23Figure A.2: Observed overlaps between subsets of two orthogonal memories of 128 neuronseach. The horizontal scale shows what fraction of the bits were used to calculate the overlap.Each data point was generated using 1000 trials. The line is a graph of Equation A.9.Appendix BCalculations from Chapter 3B.1 Theoretical Minimum Prompt LengthThis section calculates the number of bits required to uniquely determine one particular vectorin a set of P randomly-generated vectors. In other words, it answers the following question:"given a set of P vectors, on average how many bits are required to uniquely specify one ofthose vectors?" A reasonable interpretation of this question is: "how many bits must be givenbefore the probability drops below 0.5 that more than one vector in the set will match?"Assume that the prompt is M bits long. One prompt bit is immediately used up to distin-guish between the memories and their bit-inverse images (ref Section 2.1.6). The probabilitythat one (M - 1)-bit substring will match another is equal to 2 -W-1 ). One memory can beassumed to match the prompt, and the probability that none of the others match is:(P - 1)2M-1 < 0.5This simplifies to the following condition on M:M > log 2 4(P - 1).^ (B.1)This is the number of bits typically required to select one memory from among P memories.145fSli=1 i=1 k ECAppendix CCalculations from Chapter 4The following sections provide detailed derivations of results which are used to calculate thefree energy in Section 4.2.C.1 Sum-Product Inversions EquationA common identity used in the literature [1, 35] provides a method for inverting the order ofsums and products of exponentials under certain circumstances. The identity is re-stated hereas a reference because it is used frequently in the main calculation. The identity can be statedas follows:NE^H E eaik (C.1)where the vector :5; is made up of N components Si, and the possible values of each componentmake up the set C. The sum on the left side must include all possible vectors S. The sum onthe right side is over all elements of the set C. Most commonly, C = {-1, +1}, and the identitycan be written:N^NE H eaisi^ll E eats = ll 2 cosh(cei)^(C.2){S} i=1^i=1 8=±1^i=1146Appendix C. Calculations from Chapter 4^ 147C.2 Ensemble Average of Noise Term over Orthogonal SetsThis section calculates the ensemble average ((eT2 r from Equation 4.31, where:P NT2 .7,- (1 — 2b) E >µ=8+1 1=R+ 1The result is used in Equation 4.34 to calculate the amount of noise coming from the uncon-densed memories. The following symbols are local to this section, and do not correspond tosymbols used elsewhere:VT = the variance of T2VT = the variance of T2 if the r vectors are random, not orthogonalIn this Appendix, and those following, the dot product refers to a sum over the undampedindices only, as in for example:NX -Y -a- E XiYii=R-FiThe tactic for calculating ((eT2 ))r is to calculate the average over r8+1 rP with anintegral over all possible values of T2. The result of that integral must then be averaged over allpossible 1'1 r.. The central limit theorem implies that T2 will be distributed according to aGaussian distribution in the {F}-space. The mean value of T2 will be zero because there is noreason why positive or negative components should be favoured in its sum. The variance VT isequal to the ensemble average of (T2 )2 :VT ==- (((T2)2))= (1 - 2b)2 ^E^E b B7rii4r ))^ (C.3)pfu=8+1 rs+1...rpP Nexp [1 (1 — 2b) 2 E E (Bn21.p=8+1 i=R+12((e T2 ))r(C.5)Appendix C. Calculations from Chapter 4^ 148The average of eT2 is calculated using the following Gaussian integral:00(( eT2 ))r = foodT2 exp [ —(T2)2 Tz]2VT= ((exP GVT)))rl...r.where the average over rl ...r. is weighted according to the values of mil, and 5 ) • Su.Consider first, as an aside, what the value of eT2 would be if the r set was just randomlyselected vectors. Nowhere yet in the calculation has the orthogonal nature of r been takeninto account, so this is a reasonable exercise. Call the random vectors P to avoid confusion.The ensemble average overt in Equation C.3 could be moved inside the summation becauseall terms in the sum would be independently random:VT = (1 — 2b)2 E^E BP 3';.r8+1 ...FPti,v=8+1 id=R-F1The ensemble average ((fTi;* would be zero unless the indices were identical, so:P^NVT = (1 — 2b) 2 E^E Br B7641vOii0 ,v=8-F1 i,j=R+1P N= (1 — 2b)2 E E (Bµ)2 .p=s+1 i=R+1Replacing this into Equation C.4 would give:(C.4)This result is used in Equation 4.33 to check the consistency of the derivation up to this point.When each set r consists of exactly orthogonal memories, the ensemble average in Equa-tion C.3 cannot be moved inside the summations so easily. It is possible , however, to simplifyEquation C.3 using the fact that for every set of orthogonal vectors {r}, there is another set {F}'whose vectors are all identical except for one, which is the bit-inverse of its counterpart. This++--++--<=>++44--+—+—+--+■1,M IBP12 — ^(ru.BP)2ro+1...rp^— s11 ^[(((r.BP) 2 ))v=1Appendix C. Calculations from Chapter 4^ 149is illustrated by the following two sets of exactly orthogonal 8-vectors:Because of this anti-symmetry, the only terms in Equation C.3 which will contribute to theaverage are those where y = 1/, so Equation C.3 reduces to:P^NVT = (1 - 2b)2 E E Bptirit:T7))p=8+1 rs+1...rPP= (1 — 2b)2 E (rP.BP) 2 )),A=8+1^ro+1...rPNote that the vectors B 8+ 1 through BP are distinct from each other — they are not, for example,the projection of a single vector onto the r0 vectors.The T vectors form a complete orthogonal set in the M-dimensional space of undampedvector segments, so any vector can be expanded as a sum of its projections along each of ther vectors. Expanding BP in this way gives:1^1iBu )2 = ^ ru.o, )2 +^E (r..B,i )2M E ( v=1 ,,=.+1M^ 8E ( 11' 43')2 = m 1 13 '1 2 - E (ru-B")2v=8+1 v=iIf nothing is known about vectors r8+1 through T N , then each term in the sum has thesame expected value, so:[11 I B P 1 2 - E (r.BP)2] (C.6)E BPP.Bm. [msP • sa — E(r- • sP)(ru • salp,cr=1^ v=1n(((re •Bµ) 2 ))ra+1....rp =Appendix C. Calculations from Chapter 4^ 150The first term in this expression can be calculated directly:N1B1412 = E (B; )2i=R+1N n= E > B"Bi."7 Sr Yiri=R+1 p,cr=1where BPP is defined in Equation 4.33.The second term in Equation C.6 can be written as a function of the overlaps in the con-densed memories, as follows:8E •Bµ)2^E BPPBP- E •sP) (r-v=1^p0=1^v=1SO(C.7)andP nVT = (1 - 2b)2- E E BPPB/- [m-sP • — E(r- • sP)(r- • se)] 14=8+1 p,cr=1^ v=1Replacing this into Equation C.4 yields the following expression for ((eT2 ))r :((e T2 )) rP n= ((exp (1 - 2b) 2 E Ets=s+1 p,cr=1X B I4P B14° [M SP • Sri' — E(r- • sP)(r- • sel}))v=i^ ri pa(C.8)To complete the calculation, it must be averaged over all possible orientations of r1 through r.,and expressed as a function of either el ... e8 or int,1 ...msp7 as follows:P n= exp { 1 (1 - 2b) 2 E E BPPBP-sP • Sal ((e -T3 ))2 0=8+1 poy=1 (C.9)((eT2 )) rT3 — 2M (1 — 2b)2 E E BooBti- E(r- • so)(r- • Scr ) (C.10)v=114=s+1 0,17=1Appendix C. Calculations from Chapter 4^ 151whereThe average ((e -T3 ))^is calculated in Section C.3, with the following result fromr i ...r .Equation C.16:((e-T3))ri...r.^P^n^8= exp --12(1 — Vi) E E B".13P ° LE(mpv) 21 SP .5a/1=3+1 p,cr=1^=1(C.11)When this is replaced into Equation C.9, it reduces to:P^n [((eT2 ))r = exp^E E (1 - 2b) 2 - (1 - Vi) E(m7,)21 BPP/3 1'SP•S°14=8+1 pfr=1^ u=i(C.12)C.3 Average of e -T3 for Condensed VectorsThis section calculates the average of e -T3 over r1 through T., where T3 is defined in Equa-tion C.10 asT3 = 2M—1 (1 — 2b) 2 E E B ".13 1". E(r- • so)(r- • Se).p=s+1 09=1^v=1The result is used in Equation C.11 as part of the calculation of the total noise generated by theuncondensed memories. The following symbols are local to this section, and do not correspondto symbols used elsewhere:T3 = the average value of 7 13VT = the variance of T3 in the space of ri through P.A = a central factor in ((e-T3 ))7"; = 2 (1 - V) EE BPPBP- E(np)2 sP•s-.. . p=6+1 p,cr=1^v=1(C.14)Appendix C. Calculations from Chapter 4^ 152The tactic for calculating ((e -T3 ))^is similar to that used for calculating ((eT2 )) inSection C.2. As before, the sum over all possible vectors r1^r8 is replaced with an integralover all possible values of T3 . Although the central limit theorem does not apply in this case,7'3 is assumed to be distributed according to a Gaussian distribution in the {r}-space. Thistime, however, the mean value is non-zero. If the mean and variance of T3 are:T3 = OT3Dr1 . ..reVT -= (((T3) 2 //P...r.^(M)2then the ensemble average will take the following form:- EdT3 exp [-( 213^)2217T^T3]rr \= exp (-12VT — T3) . (C.13)The mean value of T3,P n2M (1 — 2b)2 E E BPPBP- E arm • sP)(r- • 50 )Dri...p, ,p=s1-1 p,cr=1^v=1is calculated using the average of (TP.so)(rP•sP) derived in Section C.4. In regions wherem>> and V << (1- k), which covers virtually all the range of interest, Equation C.20 givesthe following expression for T3:Equation C.23 from Section C.5 presents the following expression for the variance of T3:P^n^8Var(T3) = (VI - V) E E BPPBP- E(mpu)2 SP.5'.^(C.15)11=8+1 p,cr=1^v=1Equation C.13 can now be used to calculate ((e -T3 ))1-1 ...1-S1 P^n^8((e-T3 )) ri ...I'S = exp --2 E E BPPBP- E(7/111 )2(1 — Vt)SP • Su^(C.16)^ p=s+1 p,cr=1^v=1Appendix C. Calculations from Chapter 4^ 153C.4 Average of (ro•scr)(rP•sP) over Condensed VectorsThis section calculates the mean of (ro•sP)(rP.s-) in the space of all orthogonal vectors1'1 with probabilities determined by the values of mit: and SP • 5°. The result is used tocalculate the mean of T3 in Equation C.14. The following symbols are local to this section, anddo not correspond to symbols used elsewhere:= the weighted sum of SP vectorsQ = the value of Al, (rP • sP)(rti • sff)mo = the overlap between SP and el,M, A, . . . g = the sets defined belowM = the number of elements in MA, ... G = the number of elements in each set, divided by Msym definition -1,- x avg size 11 —,I-I— x varianceM set of all indices 1 = -11-1-(N — R) 0A fiEM:Sr= en A = 1(1 + nt 14,) 0B { i E M : riii . eii } (B) = (1— b) b(1 — b)C fi E M : Sr = rn (0 . mil,B + 1(1 — mil) [1 — (771;‘,) 21 b(1 — b)1, { i E M : ,59: = Sn D = 1(1 + -1,7 SP • 5' ) 0F {ieCnD} (F) = CD D(1— D)C(1 — C)G fi E c n cm — 1,} G = C — F 0Appendix C. Calculations from Chapter 4 154The averages and variances in this table were calculated using Equations A.6 and A.7. Theprobability distributions of the sizes of the sets are assumed to be Gaussian, as for example:P(X) = exp (—(X (X))2^(C.17)2Var(X)The calculation of (r1' • so)(p, • S°) begins with the observation that:1Q^—(rP • sP)(rP • S°. )M2=^(E r; S; + E ITsr) (E 11'4' E P4s7)iED^iEM-D^iED^iEM-DThe size of these two terms can be calculated using the sets defined above:1 E rfasr = F — (D — F) 2F — D:EPrf'sf = G — (1 — D — G) = 2C — 2F + D — 1.M iEM-DSO:Q = f f f dF dC dB P(F) P(C) P(B) {(2F — D) 2 — (2C — 2F + D — 1)1dF dC dB P(F) P(C) P(B){(8C — 4)F + (2 — 4C)D — 4C2 + 4C —When this is integrated over F it reduces to:Q = f f dC dB P(C) P(B){(8C — 4) (CD) + (2 — 4C)D — 4C 2 + 4C —= g dC dB P(C) P(B) (2D — 1)14C 2 — 4C + 1] .= m2 (EITSf + E rrsr)^11'4 — E r, Sr)iED^iEM-D^iED^iEM-D2^1 2=^-^E rNsp=ED^— iEM-D (C.18)Appendix C. Calculations from Chapter 4^ 155The second integration is over C, and yields:^Q^I dB P(B) (2D – 1) [4(Var(C) (C) 2 ) – 4(C) + 1]1^2= I^4dB P(B) (2D –1) {—(1– m2 )b(1 – b) + 4 {mB + –2 (1 – m)]– 4 [mB 1 – m)} + 1}.The terms proportional to -1.4 are negligible except when all other terms are very small, butthis is an important exception, so they are kept in:Q = f dB P(B) (2D – 1) {—4 (1– m2 )b(1 – b) + m 2 (4B2 – 4B + 1)]= I dB P(B) 1—(SP-S°) [M(1 m2 )b(1– b) + m 2 (4B2 – 4B +1)]^1^4= .71-4.–(SP-P) [17 (1 – m2 )b(1 – b) + m2 (4Var(B) 4(B) 2 – 4(B) + 1)]1^4[(1 – m2 )b(1 – b) + m2b(1 – b)} + m2 [4(1 – b)2 – 4(1 – b) + 1111^4=^(SP .5') —m b(1 – b) + m2 (1 – 2b)2 }^ (C.19)The average of (rP•sP)(rP•s-) is therefore:(((rP•sP)(rP•S")Dr i ...r. = M(mt;,) 2 (1 – 2b) 2 SP •S`"^4b(1 – b) SP-S°^(C.20)This can be qualitatively checked for some simple cases, and with the aid of the sketch inFigure C.1. When b = 2, the F vectors are not correlated with the e vectors. In that situationthe mill overlaps should be irrelevent, which is indeed true in Equation C.20. If b = 0, so thatthe I' vectors are exactly aligned with the e vectors, and if p = o, so the two S vectors arealigned, then Equation C.20 is equal to the square of el' • SP , which is correct.Appendix C. Calculations from Chapter 4^ 156Figure C.1: Sketch of the vectors which make up T3, for a case where n = 2 and s = 1. Theexact value of T3 is the product of the projections of So and SG' onto PP. The mean value andvariance of T3 can be estimated as a function of mtl„, SP.S' , and V.C.5 Variance of T3 in the Space of Condensed VectorsThis section estimates the variance of T3 in the space of all orthogonal vectors rl F8. Theresult is used to calculate the mean of eT3 in Equation C.15.T3, as defined in Equation C.10, is:1^P^nT3 E 2M (1 — 2b) 2 E E 13 1* .13P' Dr- • so)(r- • Yr),which is a sum of products of projections of two vectors, S° and So, onto a reference vector To.Figure C.1 helps to visualize this.The variance in T3 is an estimate of how much the sum of these projection products variesaway from the mean value, calculated in Section C.4, as the r vectors are rotated in all pos-sible directions. There are only (n2s) terms which vary with rl r8 and are added togetherinside T3. Because this number of terms is small compared to M, the variance in T3 will be of0=8-1-1 p,cr=1^v=10,0=s+1 p,cr,cr,3=11 - 204 E4M2^L—dAppendix C. Calculations from Chapter 4^ 157the same order as 2'3 for M —÷ oo, and so cannot be ignored.The exact calculation of Var(T3 ) proceeds as follows:Var(T3) = (((T3) 2 )>r,...r. — (T3 )2BO p BOa-^B0138x E (((rv.sP)(rv•s-)(rii•sa)(ru•so))) ri...r. - (T3 )2 (C.21)por=1This can be simplified by expanding only terms up to n = 2, because the resulting expressionwill be used in the limit where n —> 0. With that simplification, there are only four differenttypes of terms to be averaged:= ar.sP)(r.•sP)(rm •sP)(rP•sP)Do... I-% 8Q^= Ww•sP)(ru.sP)(v•sP)(r"•5')Dri...r.Q3vm = (((r.SP)(r- •sP)(rP•s(7 )(rP.,s- )Dri ...r.Q7 = arv•5P)(1'•50 )(P•5P)(r"•50 )Dri...r. •When each of these is simplified separately, the result is a series of terms of a form similar to:Var(T3)^fi(V) E^E BOp BOc r BOot B0130,0=3+1 pfi,a,0=18x E^(5P.50)^ (C.22)111=1where ni and f1(V) are constants and functions as required. If this calculation was carried out,it would result in a complicated expression in which every term includes a product of four BPPfactors. Such a calculation was not done in this research. Instead, the value of Var(T3) wasapproximated using an expression which is very similar to Equation C.14, as described below.The result is an equation for ((eT3 )) that is much simpler than the full calculation would havebeen, and is accurate in the domain of interest.Appendix C. Calculations from Chapter 4^ 158The expression for the variance is derived by first establishing its values at the limits V = 0and V = 1, and then proposing a simple form for the full range. At V = 0, the ro vector inFigure C.1 is always exactly aligned with eµ, and so has a variance of zero in the space of allallowed I' vectors. When V = 1, the (1 — 2b) factor in T3 ensures that it is always zero, so thevariance is again zero. A simple expression for the functional dependence on Var(T3), which iszero in these two limits, is:Var(T3) a V''(1 — Vn2)where n 1 and n2 are positive exponents to be determined. It was found that n 1 = 0.25 and n2 =0.75 work very well.Following the form of Equation C.14, the variance is therefore written:P^n^8Var(T3) = (V 4 — V) E E BPPBt- E(mp2sP • Su.^(C.23)0=8+1 p,cr=1^m=1C.6 Details of the Inverse Gaussian TransformThe inverse Gaussian transform of F1 , as defined in Equation 4.36, is most easily calculated ifthere are no mixed products of the form xi-1,4. Such mixed products can be removed by doinga unitary transformation before the integration, as follows.Define an n x n matrix [K]:[K],^(Spy N— 12-CPSP • Scr)^ (C.24)The diagonal elements of K are (1 — 70cP). There must exist a unitary transformation U„„which diagonalizes K:••E 11,p1C,U;71 =^ (C.25)per=1Fi =Appendix C. Calculations from Chapter 4^ 159where A i, are the eigenvalues of K. From the definition of K it follows that:E uppsp„u; y - E Uvp {CPSP • so] Uay =E uvp [CPSP • so] u;,yi = —(1 -^(C.26)pa=1so U also diagonalizes [CPSP • Su].The coordinate transformation proceeds as follows:2a n^1^ •exp - -2 E0=6+1-213V i liPE[CPSP • S'] I 02(110)2 > cpsp.scrPta= 1p=127r—y-^Eexp- 2- [.. -xtr` ,...}11 -1 UKU -1 U— 20/1:--htl > u-iu[cPsP • sciu-iu^- /32w? E CPSP • 5'p=1^ p,cr=1X 14a= 1^2(-2r) (P— a)n 1eXp — — E En2^APpoIP i°6Per^- 20-ghP E^- 02 (hp )2 E cpsp. sa^p,cr=1 p,a=iwhere the transformed integration variables i are defined as:p=8+1 p,cr=1(C.27)is I^U I x;;; I .^ (C.28)Up a^U;p1..7=1(C.30)Appendix C. Calculations from Chapter 4^ 160Because this transformation is unitary its Jacobian is 1, and the integral over i will be identicalto the integral over x. F1 can be written more simply as:P nF1 = (2r1 )Plan exp 2 E E Ap (i';,) 2µ=s+1 p=1—2A5Kriel t7p0. — ApYit,; — 132 (h 1- )2 E cPs0 • s- ,^(C.29)cr=1whereAll mixed products of the form xf;xii: have been removed from this expression, so the inverseGaussian transformation can now be calculated. Using Equation 4.24, the integral of F 1 isequal to:( P nGi Fa I I c°^II ri ,./,..9) F1_00 0=8+1 P=1(i:L:eln p n ^ n= ( 2rI. ) 2^expri 11 _22-^{PN(4,0 )2 0p2 (1 — Ap)2 + —2192(1.1°) 2 E cPsP • s°}^0=3+1 p=1 AP^2AP^cr=11 ^fiNiis,\2^)2^3PIA-1 det(K) exP 2^)^U2 (1^P^7,71- t CP SP•ocrP^Ap IV(C.31)P-1 P,cr=1Equation C.31 can be written more simply in terms of the matrix K. The exponent iswritten in terms of K using the expression derived in Section C.7, the determinant of K ismoved into the exponent using the identitylog(det(K)) = Tr(log(K)),^ (C.32)and the sum of all the external field effects h are grouped under a single symbolH E E (i10)2.^ (C.33)p=s-1-1E U„.o=1(C.34)Appendix C. Calculations from Chapter 4^ 161In addition, following [35], P - s can be replaced by P for very large N, because P scales with Nwhile s remains finite. The result is the following expression forG1 = exp2 -aTr(log K)^E [K-1] - Pflnp,a=1C.7 Simplification of Exponent in G1This section describes how to simplify the exponent in Equation C.31, which is referred to as"X" for this derivation only:X = PiIN [ n - (1 - 2P^APp=1E u2 ^A ) + — E cPsP.se .N p,cr=10 nThe matrix U is a contraction of a unitary matrix U, and Up, = U;p1 for any unitary matrix,SO:Op 7=7.^=a=1Using this, and the definition of K from Equation 4.38, X can be written:/3HNX - ^2 [ E w-i)„„ - 2 + (Ad Up, +np,cr ,v=1E^Kip, .^(C.35)P,a=1These are the inverse of the transformations done in Equation C.25, so:E(U-1)pv —1 u„Auv=iandE(U-1),9v A, U„„. = [IC]v=1This means that X is equal to:[K-l]pp-X = ^ir N2[p,o=1f (1 — 1<)2 4KfiliNnE [K-lpo. p,cr=1E - K] papor=1(C.36)eto =Appendix C. Calculations from Chapter 4^ 162C.8 Self-Averaging in Condensed Near-Orthogonal MemoriesThis section demonstrates that a sum of the formNE f (d • • • en^ (C.37)is self-averaging when s < N. This conclusion is not immediately obvious in this case, becausethe elements of each e are generated by adding noise to a set of exactly orthogonal memories 1',rather than randomly. It will be shown, however, that this structured process for generatingmemories actually works in favour of self-averaging.Self-averaging is best understood if the set of s memory vectors are placed together to forman s x N array of bits. A sketch of such an array is shown here, with the 9th column vector, es ,highlighted:—++—++—+ + ++---+—++—++—++—+---+ —++- -+— . .—+—++—++ + +—++—+—++----+—+—+^+--++ . .—++—+++— + --+—++---++----++++++—+—++—+ . .•.+—++—+ —+•.— —++++—++--+—++----++^+++—+ . .When N >> 28 , every possible column vector occurs many times within the sum in expres-sion C.37, and each time it occurs, f ...C) contributes the same amount to the sum. If thefrequency of each column vector can be estimated accurately, then the sum over all bits i canbe replaced by a "self-averaged" weighted sum over all possible column vectors fed [1, pg. 188].The first step is to establish the expected frequency of each possible column vector. Let {T}be the set of all possible column vectors, and number its members from T1 to Tea. Let Oi bethe number of times that vector Ti occurs in a given memory set {N. All column vectors areAppendix C. Calculations from Chapter 4^ 163equally likely for near-orthogonal memory sets because they are generated by adding noise toa set of randomly-seeded orthogonal memory sets, so the expected frequencies of all columnvectors are the same:YiDRA} = 2.The next step is to show that N/2' is an increasingly accurate estimate of cki in the thermo-dynamic limit. More specifically, it must be shown that the standard deviation in srki becomesinsignificant compared to 4 as (N —+ oo). The expected variance in 40 .i can be written as:War(95i)He} = (( (.i)2 )) µ}where the frequency deviation 4 is:Cl?^g6i —^ (C.38)The overlap (e •e) between two memories is equal to a weighted sum of the frequencydeviations:Ne *ei = Eereri=i2 sEoinT7j=12 8^28= E q5iT3f,27+,i)y-4-.1 TfT!3=1^j=1=0The r71!" factor is +1 half the time and —1 half the time, so this sum is equivalent to a random3 3walk of 28 steps, with step size equal to the standard deviation in cj. Thus the variance in (eielis related to the variance in 4 ; by:Var(e -ev) = 28Var(4j)Appendix C. Calculations from Chapter 4^ 164Using Equation 2.14 as the variance in e.eu, this leads to:Var(0j) = 2 [1 — (1 — 2b)4 1The standard deviation in Oj will therefore be proportional to ITV, and will become insignif-icant as N becomes very large. As b ---+ 0 and the memories become more orthogonal, thevariance in cb diminishes, so near-orthogonal memories are even better self-averagers thanrandom memories.In summary, it has been shown that self-averaging does occur when the memories are near-orthogonal, so the sum in expression C.37 can be replaced as follows:E f(61. • •c) = N ((f(7713^3 ))T1...T2s^ (C.39)i=iC.9 Replica Symmetric Value of C1This section derives an expression for O 11 in the case of replica symmetry, and in the limit ofsmall n. Equation 4.41 indicates that the determinant and trace of K will be required, so theseare calculated first.The matrix K for replica symmetric solutions is of the form:K =whereA B B - • •B A B ••^-(C.40)B.B.A.• • -.A -a 1 — 0-yCB^—0-yCq^ (C.41)OAadet K = —(A — B)n-2 (2B — 2A — 2Bn + An)adet K = (A — B)n-2 (2A — 2B + nB)OB(C.47)Appendix C. Calculations from Chapter 4^ 165Its eigenvectors are:/1\ 1^1^\ 1^0\ 1^0\1 —1 1 01 0 —1 11 0 0 —1\ 1 / ‘^0^/ \^0^/ \^0^/and the corresponding eigenvalues are:Ai = A + (n — 1)BA2 ... An = (A — B)It follows immediately that the determinant and trace of K are:det(K) = (A — 13)n-1 (A — B nB)Tr(K) = A — B + AnThe inverse of K is calculated as follows:1 0 det(K) = det(K) 0 [K]upbut:(C.42)(C.43)(C.44)(C.45)(C.46)—[K 1 1PP^2A — 2B + Bn(A — B)(A — B nB)[ -1K Ps^2B — 2A — 2Bn + An(A — B)(A — B + nB) .SO:(C.48)Equation 4.41 requires the following contraction:n(2A — 2B + Bn) n(n — 1)(2B — 2A — 2Bn + An). (C.49)[K-i]^=Pa (A - B)(A - B nB)p,cr=1Appendix C. Calculations from Chapter 4^ 166In the limit n 0 this becomes:nE [K-1p,cr=14nA - B + 0(n2 )4n + 0(n2 )1 - 07C(1 - q)(C.50)Using Equation C.32, the trace of the log of K can be simplified as follows:Tr log(K) = log [(A - B) 72-1 (A - B nB)]= nlog(A - B) + log (1 + nBB^(C.51)In the limit n 0 this becomes:Tr log(K) = nlog(A - B) nB- A + 0(n2 )= n flog [1 + P7C(q - 1)]^P7Cq O(n2)^(C.52)1 + P7C(q - 1)Replacing C.50 and C.52 into Equation 4.41 yields the following expression for G1:-= exp Nn {-71 log [1 + 07C(q - 1)] + )3(aryCqC(q - 1)411)^210}^(C.53)2^ 2 + 2,37 C.10 Replica Symmetric Value of O2The value of G2 for a replica-symmetric system is calculated from Equation 4.43. In thelimit n^0 the initial factor becomes unity:n(2s+n-1)lim 02(ff 2) = 1.n—r0 ( 27rThe sums over p and o in the exponent are straightforward. In the limit of small n, the firstsum can be approximated as follows:n-1 nE E rq = -2 n(n - 1)rqp=1 cr=p+11--2 nrq[-22 F3 ESP ))p=10 ^/ I I I p. ...T2 .Appendix C. Calculations from Chapter 4^ 167The resulting expression for G2 is:G2 = exp NO 21 _-_-__ary + Oarq + 2., pr(, m02 + vittillo - fiaymmti2^,--8 1^ (C.54)0=1 2C.11 Replica Symmetric Value of F2Following Amit [1], the expression for F2 from Equation 4.45 is simplified for the replica sym-metric case as follows:F2 = exp N ((log E • • • Esi.±i sn=±12exp {a02 r[ (VS`p^7.21 _^eV' (E SP)] ))^2 L--" 2p=1 p=i^p=1T1 ... T2 8Equation 4.24 is now used to linearize the expression with regard to So:F2 = exp N ((log E • • . Es1=±1 Sn =±1e 2^— exp —(222 °° dz'n f-00 0-7-rdz e_z2,2 E E^e-sPF3 ))exp N ((log le -1 rn 1.°3J_coN/Ti^s1=±1 sn=±1 p=1 71...T2awhere:F3^(VaTZ cx0 E eTP)^ (C.55)p=1Using Equation C.2, this reduces to:Na B2 rn^ dz^z2 nF2 =^exp N ((log J^e- -2- ll E e-SF3))-00^p=1S=±1^11...T2 aNo_42rn^ dz^z2= e- 2expqlogf—e- 7 (2 cosh F3 )n-co firr Ti...T2 aAjap_2_,2^ °r) dz^z2= e-- 2 exp N //^e- 7 exp En log(2 cosh F3 )]))V2r .. T2 aAppendix C. Calculations from Chapter 4^ 168In the limit n -- 0, OA = 1 + nX, and log(1 + nX) = nX, so:Noe_v_m^ °° dz^.2F2 = e- 2 exp N ((log L .‘/Ti. e- i- [ 1 + n log(2 cosh F3)00,2 rn ^dz 2= e 2expllog{1+74--e- T. log(2 cosh F3 d))_0007r 71..T2sitp ^dz= e 2 .. exp Nn ((f° ^--00V27r e— 4. log(2 cosh F3)))71.•. T2 a= exp Nn (((1o$(2 cosh F3 ))))T1 ...Tv^ai32712T1 ..T2 s(C.56)where the triple brackets 0( ))) represent both an ensemble average over the column vectors T1and an average over a Gaussian window in z:(V(z,Z)))) E^C() dz^2e 2.2 x(z ,Ti)))^f2Tr ... T2.^(C.57)C.12 Detailed Calculation of the Mean Field EquationsThe mean field equations are calculated by requiring that the derivative of the free energy ((F)),Equation 4.49, is zero for a set of variables inf i4 , yµ, q, and r. The following symbols are localto this section, and do not correspond to symbols used elsewhere:A = denominator in the expression for yo= derivative of C with respect to m il:Setting the derivative with respect to Int to zero provides an equation for yo:0 ((F)) =^=0^a (-07(1— q) C' ^a7qC'814^20^A^2A(a7Cq 41)07(1 — q)C'7274 — 711P + Pay' ,2A2where:C'^-274(1 - Vt)A E 1 - 137C(1 - q)^ (C.58)Appendix C. Calculations from Chapter 4^ 169The equation reduces to:yo = __1 17 2 74 + 'Yip`- G3(q, m4)]afl Lwhere, as in Equation 4.53:(C.59)G3(q, m4)^72741(Al Vt) [a - a i37C(1 - q) 2 + 40(1 - q)ii]^(C.60)Zeroing the derivative with respect to yo provides an equation for mf::0y0 = = pane - (((tanh (Var z^E eTu) aP 2TP)))v=10((KT° tanh (Var z^teTuv i ))))Inserting the value of yo, this becomes:tsinh (C.61)rnf: = (((T1`tanh^z +^(72m4 + -yitv - G3 (q, mt)) T" ^.62)v=iZeroing the derivative of ((F)) with respect to r generates the mean field equation for q:a ((F))^0 = /3a(q - 1) + (((tanh (01/7:t7z+^ev) z\rf.))) —OrA^0=1Bro dz z e_z212 ((tanh(Az + B)))AN/Trir(1 —^L ooViTrThis can be integrated by parts, using d/dx tanh x = sech2x:)3NicW(1 - q) =^((tanh(Az B) e'212 ))'^simdz e-z2/2 ((A sech 2(Az B)))12-Tr _00 -00 Orrqqq---,===Avai- (((sech2 (Az1 - (((sech2 (Az(((tanb 2 )3+ B))))4- B))))[far z+ ai3tyPTP])))14=18Hz + E (72,4 + yitt` - G3(q, nzt4 )) Tµ14=1(((tanh2 /10 = —2A272c (a7Cq + 4H) 13;7*r - 7C (ayCq 4if) aA2((F))aq(C.64)Appendix C. Calculations from Chapter 4^ 170Finally, when the derivative of the free energy with respect to q is set to zero, the result isan equation for r:Appendix DCalculations from Chapter 5The following sections provide detailed derivations of results which are used to calculate thenetwork capacity in Chapter 5.D.1 Limiting Behaviour of 13(1 — q)The following calculation of B p(1 - q) in the limit p 00 is based on Lautrup's deriva-tion [35, pg. 45]. The following symbols are local to this section, and do not correspond tosymbols used elsewhere:Y = the tanh expression in Equations 4.50 and 4.51a = the factor multiplying z in the argument of Yb = the factor not multiplying z in the argument of YDefine Y as follows:Y^tanh P(az b)where8{ 7 m +7h^(1 0 ) [are 4if Bi} Tv2 v^v 2( 1 - 7C Brb171OYOb^cosh2 13(az b)= ,(3(1 — Y2 ) (D.1)Appendix D. Calculations from Chapter 5^ 172Note that:The limit of /3(1 - q) is calculated as follows:lim ,3(1 - q) =^- Y2)))/3-400^0-K,0= limOk •0.00 dbwhere (((Y))) can be written as:lim (((1')))=0-,00lim -1 —0 (((log cosh d(az b)))) .p ObIt is possible to integrate out the z-dependence of this expression in the limit of large byfirst eliminating the transcendental functions using the following identity:lim log cosh [3(az b) = lim [log (60(az+b) e-p(az-F0) _ log 2]0-400^ /3—oo= slim Plaz -1-bl- log 2 ] .00The average over z is calculated as follows:lira^000 2$ O— (((log(1 - 14)kp-, b(D.2)(D.3)Jaz + b I^= foom d2Z7r e— 1z2 I dZ bl—00b/a dz^i 2z faz b)^biaeoo dz - Lz2 (az +e 2a re lz2100^Nig^—b/a1 -00a re_izzi-b/a- b^b/a dz _ 1-z2^b r° dz re^2- -2 Zf_co V27r^.1—b/a(D.4)e-°2 ' 2` dudx erf(x)CV 1.1 erf ( ) Ik c/2 _x2o_r e (D.5)(D.6)(D.7)Appendix D. Calculations from Chapter 5^ 173The solution of these integrals involves the error function erf(x). The following two proper-ties of erf(x) will be useful:Equation D.4 reduces to:Iazz bi) za 172-=^ exP —2a2 ) —2 [(1 erf (N/ a^)) —^erf^))1affexp 2a2(-21 + berf ( bV r^.12- a)and its derivatives are:05-1,( !az +bl)z^erf ( b•\/ a)02)172( laz + bI )z = _Vex, (—b2)a 7r 1' 2a )Replacing the first derivative into Equation D.3 gives:olLma (( 1)) = ((01 Lino 0 en (-‘/ a^)))^((^(^ire^2^L amerf^4i/B)1))72 m +^7 (1 V4) ^(D.8)tar v_ 1^[1 — 7C leand when the second derivative is replaced into Equation D.2 the result is:^B =^0(1^= Ern 1 ((3 , r_2i e (—b2 )))0.00 T3^a V 7r jcP 2a22lim a((_.1 Aexp2a2b(D.9)0.00^V 7r C^Vy2C2 2m2 (1 -^- yCioa. =- Vt)m(D.12)^Appendix D. Calculations from Chapter 5^ 174D.2 Solution of Mean-field Equations for No Clamped NeuronsIn this section an analytical solution is found for the mean-field equations in Section 5.1.2. Theequations areerf(y)C2r = ^ (D.10)(1 - CB) 2where:m =y^(1 (1 - CB) 2 J tarB = ^,A7v -rr e-12.and(D.11)B can be eliminated from the definition of y using Equation D.10(1^)m ar.2/Try = mC2This is a quadratic equation in NicTer whose solution isfar = C22(1 VI2y +2 477t2 (1 - Vi) \ridC2Choosing a positive sign will give the correct value in the limit V 0, so:Replacing this into Equation D.11 givesB = 2(1 - Vi)erfC(y) e-Y2 I /y2C2 + 2(erfy) 2 (1 -^- Cy} -1 .Nfi (D.13)The mean-field equation in r can be written:1^1 - CBC(D.14)Appendix D. Calculations from Chapter 5^ 175so the square of Equation D.12 is equal to:a =^1 - CB ^(1/y 2c2 2(erfoo. _ vt) - Cy)] .2 (1 - Vi)erf(y)(D.15)When Equation 5.5 is replaced into Equation D.15, the result is an equation for a as an explicitfunction of y.D.3 Distribution of Xi Before Roll-UpThis section calculates the mean and the variance of the neural alignments Xi of a randomlyselected memory vector before roll-up has occurred. A definition of the neural alignments isgiven in Equation 5.11.The ensemble average variance, Vx is, by definition:Vx = ((X 2 )) - ((X)) 2 .The first term ((X2 )) can be calculated as follows:((14-1 14-1E 77119 7-ftverC p=1 t.•=1p—lp-1 1^N N= E E N2((E E siskexcer))p=1 v=1^j=1 k=1p-1 N-1 ep tl,^NE E 1.4N + E E Si Sk 61 61:))p=1 v=1^j=1^j=1 k$j740p-1 p-1= E E 22 + Efue.Pg7))N Jo=k^1)2^1(µ N2 + N 2 E EjOi p=1 E6ferex))1,0PNO—1)2 (N - 1)(p - 1)N 2 +^N2(D.16)Appendix D. Calculations from Chapter 5^ 176The second term can be calculated as follows:0-1 N= iN E ESiSier0)v=iP-1^((P-1 N= -N E 1 + E E sisiem))u=i^v=iv 1 %v 1 jOi— 1^ =NSo the variance in Xi is:(N — 1)(y — 1)Vx =^aN2(D.17)(D.18)If, as assumed in Section 5.2.3, the probability distribution is Gaussian, then the probabilitydistribution of the Xi at time t = 0 can be written as follows:P(X^1,O) = ^ exp^a)212a(D.19)D.4 Effect of Inverting One BitThis section calculates an exact expression for the change in E xi when one neuron in thenetwork is inverted, and uses it to derive an expression for the average time rate of change of Xas the network rolls up.The exact expression is derived as follows. The initial states of all the neurons are written Si,and the final states are written Similarly Xi and Xi are used to represent the initial and finalneural alignments, as defined in Equation 5.11. Let k be the index of the inverted neuron Sk.The new value of Xi, for (i k), is:P-1^P-1E E sgrr - EN ^S;SkEi Skv=i jok v=1-ki(Ok) =2 P-1= X. - N— SiSkCav=1= Xi - 2ZikAppendix D. Calculations from Chapter 5^ 177where:sisi p-1Zji N EThe new value of Xi, for (i = k) is:Px-,-1 — 1= N 2., 2_,SkSieM + ^u=1 j$k= -Xk + 2aXk(D.20)The sum of Xi includes (N - 1) terms where i k, and one term where (i = k). The sumis exactly equal to:N-kii=iN• E Xi - 2 E zik - 2xk + 2ai=i^i,-£kN N• EXi - 2E Zik 2Zkk - 2Xk F 2ai=i^i=i• E xi - 4 [Xk - a] .i=i(D.21)This result was derived without making any assumptions about the initial distribution of Xi.It is now convenient to bring the notation for Xi back in line with what is used in Section 5.2,so that Xi becomes Xi(t+1), and Xi becomes Xi(t). It follows immediately from Equation D.21that the change in the average value of Xi in one time step is equal to:dX(t) dt1r^N= - E xi(t + 1) - E Xi(t)N i=1^1=147v- [a - Xk]Because this is linear in Xk, the average rate of change in X over many updates will be-1-v [a - (Xk)] . (D.22)Appendix D. Calculations from Chapter 5^ 178proportional to the average of Xk:D.5 Mean Value of X; for Xi > aThis section calculates the average value of all the Xis which are greater than a. The desiredvalue is00^(X) = C1^dX X P(X,t)awhere P(X, t) is the Gaussian distribution given in Equation 5.14, and where C 1 is a constantwhich satisfies the following normalization condition:00^1 = C1^ dX P(X,t)^ (D.23)aBoth of the above integrals are standard, so their solutions need only be briefly summarizedhere. The normalization constant is:V2 ^_ erf(y)]wherea - XThe average value of Xi is:(X) = CI. 100 dX X expa-(X - X) 2 ]1[2Vx-X2= CC1cr-x_ dX(X + X) exp[ 2Vx i= CiXVz-+ Ci^[-Vx exp (-x2^m2V.x ^c,_1--)][1 - erf(Y)]2Replacing the value for C1 gives:(X) V_ 2Vx e-Y2r 1 - erf(y)Appendix D. Calculations from Chapter 5^ 179D.6 Estimate of the Variance Vx (t)This section derives a relationship between the average neural alignment X(t), and the variancein the neural alignments, Vx(t). The derivation begins by examining the ensemble averageof (X1) 2 :(((X02)) =^2(K 1 N P-17F-3 E E sisisisgregfef))i ,j,k=1 pw=1=1^S j Skell^qm")))N2 4-4j,k=1 u=1^p=1(D.24)The overlap m" between two memories can be used to calculate the probability that anytwo bits a and Er are identical, as follows:1+ no'P(ere = Efc ) =^21 - m"P(a 0 =The average value of the parenthesized sum in Equation D.24 can therefore be written as follows:m-Perc)) =^[m-P ( 1 +2') (Er) + ( 1 —2m" ) ( erd))P^t^p=12= (( 1. (mlip)2 erc ))P=1p-1= Dm"? ))pOti^YAP= [( 4 — 2 ) (((mvP )2 ))vop + 11 ak1wherea(0)2 + 1(D.25)Appendix D. Calculations from Chapter 5^ 180and where O is the rms overlap between previously-installed memories.When Equation D.25 is used to replace the parenthesized factor in Equation D.24, the resultis:N p-1(((X1) 2)) =^E E SjSkqj,k=1 v=1= kiXwhere X is the average value of Xi, as used in Equation 5.14. The variance in Xi is therefore:vx ==(((xi )2)) - ((xi?kiX — X 2= X- [a(Orms )2 + 1 — fd (D.26)D.7 Time Evolution of the Number of Invertible NeuronsThis section derives a differential equation which describes how the number of invertible neuronsvaries with time. The result is used in Section 5.2.4 to calculate the rms overlap of the networkafter roll-up.An "invertible" neuron is one which might be inverted as the network rolls up. For that tobe possible, it must be unclamped, and it must have (Xi > a). The symbol B(t) represents thetotal number of invertible neurons at time t.Let F be the set of all neurons, clamped or unclamped, for which Xi > a. The populationof .7" is:= NPTwhere PT is the total probability that any neuron has Xi > a. The value of PT is equal to thearea under the graph of P(X, t) and to the right of a in Figure 5.4, and can be calculated asAppendix D. Calculations from Chapter 5^ 181follows:PT = j: dX P(X ,t)1 [1 erf(a X)]217xThere are two methods by which neurons leave .7.:(D.27)1. they may be inverted (this happens to exactly one neuron in each time step), or2. their neural alignments may become smaller than a due to the changes in X and V.The total number of neurons that leave P by the second method is equal to one less than thetotal number that leave:total loss by second method = --dt (NPT) — 1.It is assumed that invertible neurons are just as likely to leave by the second method as anyother neurons, so the reduction in B(t) by the second method is proportional to the total lossby the second method:loss of invertibles by second method =^B^ dNPT [ dt (NPT) 1]The total change in the number of invertibles is negative, and is the sum of the amounts lostby the two methods:dB^B d— = —1 +^ { (NPT) 11dB NPT dtThis expression can be made independent of the size of the network N by defining:b EThe differential equation can then be written:tT = --N-(D.28)db_ ^[dPT—dt :^+ PT dT +1 (D.29)
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Self-orthogonalizing strategies for enhancing Hebbian...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Self-orthogonalizing strategies for enhancing Hebbian learning in recurrent neural networks Davenport, Michael R. 1993
pdf
Page Metadata
Item Metadata
Title | Self-orthogonalizing strategies for enhancing Hebbian learning in recurrent neural networks |
Creator |
Davenport, Michael R. |
Date Issued | 1993 |
Description | A neural network model is presented which extends Hopfield's model by adding hidden neurons. The resulting model remains fully recurrent, and still learns by prescriptive Hebbian learning, but the hidden neurons give it power and flexibility which were not available in Hopfield’s original network. The key to the success of the model is that it uses the emerging structure of its own memory space to establish a pattern in the hidden neurons such that each new memory is optimally orthogonal to all previous memories. As a result, the network actually learns a memory set which is "near-orthogonal", even though the visible components of the memories are randomly selected. The performance of the new network is evaluated both experimentally, using computer simulations, and analytically, using mathematical tools derived from the statistical mechanics of magnetic lattices. The simulation results show that, in comparison with Hopfield's original model, the new network can (a) store more memories of a given size, (b) store memories of different lengths at the same time, (c) store a greater amount of information per neuron, (d) retrieve memories from a smaller prompt, and (e) store the XOR set of associations. It is also shown that the memory recovery process developed for the new network can be used to greatly expand the radius of attraction of standard Hopfield networks for "incomplete" (as opposed to "noisy") prompts. The mathematical analysis begins by deriving an expression for the free energy of a Hopfield network when a near-orthogonal memory set has been stored. The associated mean-field equations are solved for the zero-temperature, single-recovered-memory case, yielding an equation for the memory capacity as a function of the level of orthogonality. A separate calculation derives a statistical estimate of the level of orthogonality that can be achieved by the roll-up process. When this is combined with the memory capacity-vs-orthogonality result, it yields a reasonable estimate of the memory capacity as a function of the number of hidden neurons. Finally, the theoretical maximum information content of sets of near-orthogonal memories is calculated as a function of the level of orthogonality, and is compared to the amount of information that can be stored in the new network. |
Extent | 7045459 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2008-09-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0085553 |
URI | http://hdl.handle.net/2429/1769 |
Degree |
Doctor of Philosophy - PhD |
Program |
Physics |
Affiliation |
Science, Faculty of Physics and Astronomy, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1993-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1993_fall_phd_davenport_michael.pdf [ 6.72MB ]
- Metadata
- JSON: 831-1.0085553.json
- JSON-LD: 831-1.0085553-ld.json
- RDF/XML (Pretty): 831-1.0085553-rdf.xml
- RDF/JSON: 831-1.0085553-rdf.json
- Turtle: 831-1.0085553-turtle.txt
- N-Triples: 831-1.0085553-rdf-ntriples.txt
- Original Record: 831-1.0085553-source.json
- Full Text
- 831-1.0085553-fulltext.txt
- Citation
- 831-1.0085553.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0085553/manifest