THE DEVELOPMENT OF VLSI IMPLEMENTATION-RELATED NEURAL NETWORK TRAINING ALGORITHMS By Gang Li B. A. Sc. Harbin University of Science and Technology, China M. A. Sc. Shenyang Institute of Computing Technology, Chinese Academy of Sciences, China A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF THESIS DRAFT FOR MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA July 1994 © Gang Li, 1994 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date Sep. 3 o n DE-6 (2/88) Abstract Artificial neural networks are systems composed of interconnected simple computing units known as artificial neurons which simulate some properties of their biological counter-parts. They have been developed and studied for understanding how brains function, and for computational purposes. Two kinds of architecture of Neural Network Mod-els(NNMs) are the most popular, Recurrent and Feed-forward. The recurrent model (Hopfield network) is one of the simplest NNMs. It is specially designed as a Content Addressable Memory (CAM). The feed-forward models include Perceptron and multi-layer perceptrons. They have been proved to be useful in many applications. Two parts are included in this thesis. In part 1, recurrent models are investigated and a novel digital CMOS VLSI implementation scheme is proposed. Synaptic matrix construction rules (training rules) for the proposed model were simulated on SUN work stations. Three widely accepted training rules are simulated and compared, the Hebb's rule, the Projection rule and the Simplex method. A coding scheme, named "dilution", is applied to the three training rules. Both Dilution Coding Hebb's rule and Dilution Coding Projection rule were verified to exhibit good performance. In part 2 of the thesis, feed-forward models are introduced. Variations of the BP algorithm are developed together with the considerations of hardware implementations. The proposed DPT (Delta Pre-Training) method can speed up BP training and can also reduce the probability of getting trapped in a local minima. The implementation of the DPT method will not increase the complexity of VLSI design. 11 Table of Contents Abstract ii List of Tables vii List of Figures viii Acknowledgement x 1 I N T R O D U C T I O N 1 1.1 History 1 1.2 Motivation 2 1.3 Thesis Outline 4 1.4 Neuron Models 5 2 R E C U R R E N T N E T W O R K S 11 2.1 Introduction 11 2.2 Analog Networks Implemented 13 2.3 Hopfield Network 14 2.4 Summary of Symbols for Hopfield Network Models 15 3 A L G O R I T H M A N A L Y S I S A N D S I M U L A T I O N R E S U L T S 18 3.1 Description of learning rules 18 3.1.1 Hebb's Rule 18 3.1.2 Projection Rule 18 in 3.1.3 Simplex Method rule 19 3.1.4 Model Summary 21 3.2 Dilution Coding Rules with Two-Valued Models 22 3.2.1 Dilution Coding Hebb's Rule 23 3.2.2 Dilution Coding Projection Rule 24 3.2.3 Simplex Method with Unipolar Neurons 24 3.2.4 Simulation Results 25 3.3 Comparison 27 4 I M P L E M E N T A T I O N OF A FULLY DIGITAL C A M 28 4.1 Architecture 28 5 F E E D F O R W A R D N E T W O R K S 34 5.1 Introduction 34 5.1.1 Perceptron and Delta Rule 34 5.1.2 Multi-Layer Perceptron and BP algorithm 39 6 A N e w M e t h o d for Ref inement of the B P training Algor i thm 44 6.1 Acceleration of BP trainings 44 6.1.1 Previous Work 44 6.1.2 Beginning of BP Training 46 6.1.3 DPT: a Delta Rule Pre-Training Method 48 6.2 Avoiding local minima 50 6.3 Simulation Results 51 6.3.1 Encoder/Decoder Data Compression Problem 52 6.3.2 XOR Problem 54 6.4 Discussion 55 IV 6.5 Future work on the DPT method 56 7 A Local Min ima Free Training Algori thm 58 7.1 Globally Guided Back Propagation 58 7.2 Locally Guided BP: suitable for VLSI Implementation 60 7.2.1 LGBP algorithm 61 7.2.2 Comparison of LGBP and GGBP 62 8 Implementat ions of B P Algori thms 64 8.1 Introduction 64 8.2 Direct Implementation of BP algorithms 65 8.2.1 Replacement of derivative with a constant fails on XOR problem . 67 8.2.2 Replacing derivatives of output neurons 69 8.2.3 Summary 72 8.3 Weight Perturbation: An Optimal Approach for Analog VLSI Circuits . 73 8.3.1 Weight Perturbation of LGBP 73 8.4 Comparison of Perturbation and Non-Perturbation 75 9 Conclusions 77 9.1 Summary 77 9.2 Future Work 79 A H O P F I E L D N E T W O R K I M P L E M E N T A T I O N 81 A.l Interface 81 A.2 RAM Module 83 A.3 Neural Network Module 85 A.4 Control Module 89 A.5 Controller FSM Incorporating Pipeline Structure 93 v A.6 Implementation Considerations 95 A.7 Testability 96 A.8 Performance of the proposed network 96 A.9 Expandability 99 B Summary of Notations 102 Bibliography 103 VI List of Tables 3.1 Learning Rules for Hopfield Networks with Bipolar Neurons 21 3.2 Simulation Conditions and Characteristics of Two-Valued Models . . . . 26 6.1 Simulations on Data compression problems 53 6.2 Simulations on the XOR problem 54 6.3 Comparison on XOR problem 56 8.1 Simulation Results with different back-error functions 70 A.2 Pin Assignment for Neural Network Module chip 88 A.3 Pin Assignment for Control Module chip 92 vn List of Figures 1.1 A Biological Neuron 6 1.2 An Artificial Neuron 7 1.3 Typical Activation Functions: (a) Linear; (b) Quasi-linear; (c) Threshold-logic; and (d) Sigmoid 8 2.1 Schematic of a Hopfield Neural Network 17 3.1 Comparison of Learning Rules on Hopfield Network (P=10, N=60) . . . 21 3.2 Comparison of Learning Rules for Two-Valued Networks 25 3.3 Comparison of Dilution Projection Rule with Others 27 4.1 Block Diagram of the Associative Memory Co-processor System 30 5.1 A Perceptron Network 35 5.2 A Sigmoid Perceptron Network 37 5.3 A Multi-Layer Perceptron Network 40 8.1 An unconverged XOR network after 10,000 epoches 68 8.2 Back-Error Functions vs Error 72 A.l Working flow viewed from host computer 82 A.2 Functional diagram of the RAM Module 84 A.3 Functional diagram of the Neural Network Module ( 1 of two ) 86 A.4 Functional diagram of the Neural Network Module ( 2 of two ) 87 A.5 Functional diagram of Control Module 90 viii A.6 State transition flow of Controller FSM 91 A.7 Performance of the proposed network with Dilution Hebb's rule 98 IX Acknowledgement I would like to thank my supervisor, Dr. Hussein Alnuweiri, for his support and encour-agement of finishing this thesis. His patience and kindness through out the writing of the thesis made my thesis an excellent complete. Many thanks to Dr. David L. Pulfrey for his arrangement and help during my graduate studies. Thanks also to Dave Gagne for his help in using the machines and software packages in the department. Thanks to my wife, Hongbing Li, for her disturbing when I was working too hard and her pushing when I was relaxing too much. Thanks to my lovely son, Yaoyao Li, for his patience when waiting for me to play with him. Finally, this thesis would not be possible without the support of my parents. x Chapter 1 I N T R O D U C T I O N 1.1 History People have long been curious about how the brain works. The capabilities of the ner-vous system of brains in performing certain tasks such as pattern recognition are far more powerful than today's most advanced digital computers. In addition to satisfying intellectual curiosity, it is hoped that by understanding how the brain works we may be able to create machines as powerful as, if not more powerful than, the brain. The nervous system contains billions of neurons which are organized in networks. A neuron receives stimuli from its neighboring neurons and sends out signals if the cu-mulative effect of the stimuli is strong enough. Although a great deal about individual neurons is known, little is known about the networks they form. In order to understand the nervous system, various network models, known as artificial neural networks, have been proposed and studied. These networks are composed of simple computing units, referred to as artificial neurons since they approximate their biological counterparts [114]. Work on neural networks has a long history. Development of detailed mathematical models began more than 50 years ago with the works of McCulloch and Pitts [63], Hebb [43], Rosenblatt [71], Widrow [79] and others [70]. More recent works by Hopfield [46, 47, 48], Rumelhart and McClelland [73], Feldman [33], Grossberg [41], and others have led to a new resurgence of the field since 1980s. This renewed interest is due to the development of new network topologies and algorithms [73, 33, 46, 47, 48], new VLSI 1 Chapter 1. INTRODUCTION 2 implementation techniques [2, 6, 11, 13, 64], and some interesting demonstrations [48], as well as by a growing fascination with the functioning of the human brain. Recent interest is also driven by the realization that human-like performances in areas such as pattern recognition will require enormous amounts of computation. Neural networks provide an alternative means for obtaining the required computing capacity along with other non-conventional parallel computing machines. In addition to understanding the nervous system, neural networks have been suggested for computational purposes ever since the beginning of their study. McCulloch and Pitts in 1943 [63] showed that neurons could be used to realize arbitrary logic functions. The Perceptron, proposed by Rosenblatt in 1957 [72], was designed intentionally for computational purposes. In 1960[79], Widrow proposed a network model called Madaline, which has been successfully used in adaptive signal processing and pattern recognition [80]. In 1984, Hopfield demonstrated that his network could be used to solve optimization problems [48, 49]. In 1986, Rumelhart and his colleagues rediscovered the generalized delta rule [73], which enables a type of network known as the back-propagation network to implement any arbitrary function up to any specified accuracy. These contributions have laid down the foundation for applying neural networks to solve a wide range of computational problems, especially those which digital computers can not solve efficiently or take too long time to solve. 1.2 Mot ivat ion In recent years artificial neural networks have been successfully applied to solve problems in many areas, such as pattern recognition and classification [62, 68], signal processing [38, 39], control [24, 25], and optimization [48, 50]. The essence of neural network prob-lem solving is based on performing functions which are the solutions to the problem. Chapter 1. INTRODUCTION 3 Designing a neural network to perform a specific function can be achieved through a learning process, in which the parameters of the network are adjusted until the network approximates the input-output behavior of the function. The generalization capability of an artificial neural network is a direct result of the learning process. The algorithms used in the learning procedure to update network parameters are normally referred to as training algorithms or learning algorithms. An important issue in the synthesis of Recurrent Networks is implementing large recurrent networks in VLSI technology so that the properties of networks with large numbers of neurons can be investigated. Since there will be JV2 connections in a N neuron network, the implementation of a large network may turn out to be difficult, especially if programmable connections are to be realized. A solution to this problem is discussed in Part 1 of the thesis, where training algorithms highly suited to digital VLSI implementations are developed. With feed forward networks, the greatest single obstacle to the wide spread use of neu-ral networks in real-world applications is the slow speed at which the current algorithms can train a network to converge. This means that we are limited to investigating rela-tively small networks with only a few thousand trainable weights. While some problems of real-world importance can be tackled using networks of this size, most of the real-world tasks are much too large and complex to be handled by current training algorithms. Two solutions may exist. One is to develop fast training algorithms and the other is to implement on-chip learning. Advances in training algorithms and in implementation technology are complementary . If we can combine parallel hardware that runs several orders of magnitude faster than conventional sequential hardware and training algorithms that scale up well to very large networks, we will be in a position to tackle a much larger universe of possible applications. It is the objective of Part 2 of this thesis to develop new training algorithms and consider hardware implementations of on-chip learning. Chapter 1. INTRODUCTION 4 1.3 Thesis Outl ine This thesis consists of two parts within 9 chapters. Chapter 2 introduces the Hopfield network Model. The concept and working principle of Hopfield networks are described. Chapter 3 presents the work on training algorithms for Hopfield networks. Three major algorithms are investigated, simulation results are diagramed, and dilution coding rules suitable for VLSI implementation are proposed. Chapter 4 describes a fully digital CMOS implementation scheme for the dilution coding Hopfield network model. The co-processor system is shown with functional blocks and the detailed description of the functional blocks are given in Appendix A. Chapters 5 through 8 constitute part 2 of the thesis. In Chapter 5, variant aspects of Feed Forward Neural Networks (FFNN) are introduced. The architecture of FFNNs is described from single layer networks (Perceptron), to multi-layer networks (Multi-Layer Perceptrons). Training algorithms of FFNNs are presented from the Perceptron training algorithm (Delta Rule) to Multi-Layer Perceptron training algorithm (Back-Propagation algorithm (BP)). For convenience, the definition of symbols describing FFNNs are pre-sented in Appendix B. Chapter 6 develops a new method, named DPT (Delta rule Pre-Training), which allows the BP algorithm to start training a Multi-Layer Perceptron (MLP) with all zero initial weights. The DPT method maintains the feature of learning by presentation of examples. It can be viewed as part of a BP algorithm and the combination of DPT and BP is an extended version of the BP algorithm. Simulation results are presented to show the advantage of using the derived D P T method. With the new method the problem of slow convergence of BP training can be alleviated and the probability of getting trapped at local minima is significantly reduced. Chapter 7 addresses another type of BP-like algorithms, named LGBP (Locally Chapter 1. INTRODUCTION 5 Guided BP). LGBP updates weights under the guidance of output changes at output neurons. In other words, LGBP performs weight changes in the output space, while the standard BP does that in weight space. LGBP updates one weight at a time so that only local information of the updating weight will suffice. The LGBP has two promising aspects: 1) It guarantees the convergence of a MLP to a global minimum, and 2) It is well-suited to high speed hardware implementation. Chapter 8 is dedicated to the discussion of hardware implementations of FFNNs. Two implementation methods of training algorithms are studied: 1) Direct implementation, that is, each mathematical expression of a training algorithm is directly implemented. With the discussion of direct implementation, the role of the derivative of the activation function is addressed. The conclusion is that the derivative in a hidden layer can not be replaced with a constant, otherwise the learning capacity of the MLP will be degraded. 2) Indirect implementation, that is, using weight perturbation to approximate derivatives of activation functions. With weight perturbation, the realization of LGBP is presented. Finally, Chapter 9 summarizes the work of the thesis, and the conclusions and con-tributions of the thesis are outlined. The rest of the chapter presents some fundamental concepts related to neural net-works, which will be helpful throughout the reading of the thesis. 1.4 N e u r o n Mode l s The human nervous system consists of 1010 - 1011 neurons, each of which has 103 -104 connections with its neighboring neurons. A neuron cell shares many characteristics with the other cells in the human body, but has unique capabilities to receive, process, and transmit electro-chemical signals over the neural pathways that comprise the brain's communication system. Chapter 1. INTRODUCTION 6 Figure 1.1 shows the structure of a typical biological neuron. Dendrites extend from the cell body to other neurons where they receive signals at a connection point called a synapse. On the receiving side of the synapse, these inputs are conducted to the body. There they are summed, some inputs tending to excite the cell, others tending to inhibit it. When the cumulative excitation in the cell body exceeds a threshold, the cell "fires"—sending a signal down the axon to other neurons. This basic functional outline has many complexities and exceptions; nevertheless, most artificial neurons model the above mentioned simple characteristics. From other neurons Dendrites To other neurons Figure 1.1: A Biological Neuron An artificial neuron is designed to mimic some characteristics of its biological counter-part. Figure 1.2 shows a typical artificial neuron model. Despite the diversity of network paradigms, nearly all neuron models, except for a few models such as the sigma-pi neuron (see [30]), are based upon this configuration. In Figure 1.2, a set of inputs labeled (x%,X2, ...,XJV) is applied to the neuron. These inputs correspond to the signal inputs to the synapses of a biological neuron. Each signal Chapter 1. INTRODUCTION 7 X i X i 1 COi • 1 COi 1 • : 1 | c o N ^ -X e t •A ^\ir M \ ' a Activation Function | . 1 - y Figure 1.2: An Artificial Neuron Xi is multiplied by an associated weight Wi G {w1,w2, ...,wN}, before it is applied to the summation block, labeled X). Each weight corresponds to the "strength" of a single biological synapse. The summation block, corresponding roughly to the biological cell body, adds up all the weighted inputs, producing an output e = J2% wixi- e 1S then compared with a threshold t. The difference a = e — t is usually further processed by an activation function to produce the neuron's output. The activation function may be a simple linear function, a threshold-logic function, or a function which more accurately simulates the nonlinear transfer characteristic of the biological neuron and permits more general network functions. Most neuron models vary only in the forms of their activation functions. Some exam-ples include the linear, the quasi-linear, the threshold-logic, and the sigmoid activation functions. The linear activation function is fL(a) = a : i - i) The quasi-linear activation function is / Q ( « ) = < a if a > 0 0 otherwise (1.2) The threshold-logic activation function is fT{a) = < 1 if a > 0 0 otherwise (1.3) Chapter 1. INTRODUCTION 8 And the sigmoid activation function is fs(a) 1 1 + e~a Figure 1.3 shows these four activation functions. f(oc) A (1.4) l a f(a) A i - • a r - • a (c) (d) Figure 1.3: Typical Activation Functions: (a) Linear; (b) Quasi-linear; (c) Thresh-old-logic; and (d) Sigmoid. The threshold-logic neuron model was used by McCulloch and Pitts [63] to describe neurons which were later known as McCulloch-Pitts neurons. Rosenblatt later used this model to construct the well-known Perceptron [72], and Widrow et al used it to build the ADAptive Linear Element, known as AD ALINE. The Hopfield network proposed by Hopfield [46] also uses the threshold-logic neurons as the primitive processing units. The sigmoid neuron model was used out of necessity by Rumelhart and his colleagues to construct what is later known as the back-propagation network [73]. This neuron model is used for the convenience of back-propagating errors. Variants of this model are used in networks which are trained by error back-propagation. The linear neuron model was used by Kohonen and Anderson respectively to construct networks known as linear associators [57, 22]. Since networks of linear neurons are easy to analyze and their capacities do not increase as the number of layers increases, they are not challenging. Nevertheless, the practical use of linear neurons can not be overlooked. They are important in performing certain tasks such as matrix computations. Moreover, they Chapter 1. INTRODUCTION 9 are the basis of other more complex neuron models. The quasi-linear activation function was used by Fukushima to construct S-cell which is one of the primitive processing units of his network model [35]. Although a single neuron can perform certain pattern detection functions, the power of neural computing comes from connecting neurons into networks. Larger, more complex networks generally offer greater computational capabilities. According to their structures, neural networks are classified to two categories: feedback (recurrent) and feed-forward networks. More detailed discussion on recurrent networks is presented in par t i , and feed forward networks in part2. P A R T I: R E C U R R E N T N E T W O R K S In this part, recurrent models are investigated and a novel digital CMOS VLSI imple-mentation scheme is proposed. Synaptic matrix construction rules (training rules) for the proposed model were simulated through C programs on SUN work stations. Three widely accepted training rules are compared. Both Dilution Coding Hebb's rule and Dilution Coding Projection rule were verified to exhibit good performance. 10 Chapter 2 R E C U R R E N T N E T W O R K S 2.1 Introduct ion Models of Hopfield neural networks are receiving widespread attention as potential new architectures for computing systems. They have demonstrated functions such as associa-tive memory[46], adaptive learning from examples[l] and combinatorial optimization[48]. The most striking aspects of such a network are the highly distributed manner in which information is stored and the high degree of parallelism in which information is pro-cessed by the neurons. Thus the Hopfield network presents intrinsically two promising features: high speed and fault tolerance. Much research concerning Hopfield network models has been done through theoretical studies and numerical computer simulations. However, simulations of large networks on serial computers are painfully slow and their speed advantage is completely lost. Only with customized parallel hardware, is it possi-ble to realize neural networks for useful applications and for the investigation of network properties such as robustness. This part of the thesis presents fully digital CMOS designs for a Hopfield network model as a CAM (Content Addressable Memory). Detailed information about data path and control logic will be presented. The number of chips and the number of transistors in each chip are estimated. The number of pins and the functions of pins in each chip are listed. 11 Chapter 2. RECURRENT NETWORKS 12 There are several reasons for choosing digital techniques over analogue ones to imple-ment a neural network model. One of the advantages stems from available CMOS VLSI processing technologies. A digital VLSI chip can be designed with standard cells so as to shorten the design cycle , and such digital chips are tolerant to parameter dispersion and noise. The second advantage lies in making use of high speed external memory to store connection matrix so as to implement large expandable neural networks. Theo-retical analysis and simulation results show that a large network size is important since the performance advantage of a Hopfield network increases with the number of neurons. Allowing the construction of simple and fast simulation programs is also important for investigating properties and finding optimal thresholds of large networks. The implementation of a fully digital Hopfield model is made practical by following considerations: • The availability of low-cost high-speed memory chips offers a good choice to store a connection matrix in external memory. • The proposed Dilution Coding rules shown in the next chapter for constructing connection matrices are well suited to digital operation. • The simulations performed on the Dilution Coding rules in the next chapter give acceptable results, which proved the feasibility of implementing a digital Hopfield network based on the Dilution Coding rules. Several implementations of the Hopfield model have been proposed based on one of three methods: analog VLSI circuits (or mixed analog/digital), fully digital elec-tronic circuits, and optoelectronic VLSI circuits. Some analog implementations have been reported[2, 11, 3, 12]; but fully digital implementations are less common[6] and optoelectronic VLSI chips are still in the planning stage [16]. Chapter 2. RECURRENT NETWORKS 13 The major obstacle in analog implementation is to find a suitable non-volatile analog memory storage device. One way to solve this problem is by using floating-gate devices, however slow charge variation in the structure appears to be a major drawback. In both analog and digital implementation, there is another obstacle: I /O pin-limitation. Usually, Hopfield networks require a large number of pins to initialize neurons and to program synapses. The optoelectronic VLSI technology is believed to be superior to the first two. By using optical connections to provide inputs to the network chip, data can be sent directly to individual locations on the chip from above rather than only from perimeter of the circuitry, hence optoelectronic ICs do not face the pin-limitation bottleneck. Unfortunately, no optoelectronic Hopfield network chip has been reported so far, and the technology is relatively new to users. For these reasons, only analog and fully digital approaches are discussed and compared in this part of the thesis with the description of our proposed fully digital Hopfield network. It is shown that the proposed model can overcome the two obstacles mentioned above by using external RAM as storage and time-multiplexing to deal with pin-limitation, coupled with the dilution-rule to improve system performance. 2.2 Analog Networks Implemented The largest working network chip with programmable connections, so far reported, contains 54 neurons and is implemented on a CMOS VLSI chip with about 75000 transistors[2, 13]. Computations inside the chip are analogue, while its input and output data are digital so as to make the chip work within a digital system. A larger network with several hundred neurons was also reported in [11], but its connection strengths are fixed, not programmable. To increase recall capacity of a net-work, Moopenn has presented an electronic implementation of a CAM with a 32x32 Chapter 2. RECURRENT NETWORKS 14 programmable unipolar matrix[3]. The connection matrix was computed from diluted unipolar patterns using Hebb's rule and implemented with CMOS analogue switches each in series with a positive-feedback resistor. The system converges to a memory in a single cycle without any oscillation. The cycle time is approximately 15 fis . The ASSOCMEM chip is another neural network chip containing 22 bipolar neurons [12]. It was designed in 4 micron NMOS technology and has programmable connections. It contains over 20000 transistors. From the examples above, we have realized that the high interconnectivity of Hopfield networks makes an electronic implementation difficult, especially for large networks with programmable connection strengths. How to construct a large or even expandable neural networks is one of main issues in neural network research. Our purpose in this part of the thesis is to propose an answer to the problem. 2.3 Hopfield Network Of the many biologically inspired models, the Hopfield network is one of the simplest. It has been demonstrated empirically that such a network can behave as an associative memory [46, 47]. The schematic of Figure 2.1 illustrates an example of a Hopfield network. The network consists of N interconnected neurons. Each neuron can have two states: ON (14 = 1), and OFF (Vk = —1) (bipolar). The connection (synaptic) strength from neuron i to neuron k is denoted by Tki- In Figure 2.1, the strengths are controlled by resistors located at the crosspoints of the synaptic matrix. In operation, a randomly selected neuron examines its inputs and decides its next Chapter 2. RECURRENT NETWORKS 15 state. The update rule proposed by Hopfield is given by ' ON ; E i l i ^ - K - > h Vk(t+1) = { Vk(t) ; E£iT«V5 = 0 Tki(truncated) = 0 J —0 < Tfa < 0 (3-2) - 1 ; Tkx < - 0 where a suitable threshold 0 for truncating T^ can be chosen through simulation. 3.1.3 S implex M e t h o d rule Both Hebb's rule and Projection rule require a wide range of values to express the con-nection strengths. Having programmable connection strengths with the wide range of values is prohibitively expensive with current VLSI technology. By truncating the values to a few levels, usually three values: + 1 , 0, and — 1, the performance of the Hopfield network is somewhat degraded ( chapter 2 of [10] ). The Simplex method rule presented in [6] is based on the principle of stabilizing stored patterns as much as possible. The Simplex method is known as a linear algebra opti-mization method. The restriction regarding the values of connection strengths as + 1 , 0, Chapter 3. ALGORITHM ANALYSIS AND SIMULATION RESULTS 20 and —1 is included in the algorithm, rather than truncating the value after computation. The Simplex method can be described as follows. For any neuron A;, its input corresponding to pattern i should be Si = J2?=i TkrV?. Assume the neuron k has threshold t^. If we choose the set of Tkr such that the difference between S\ and tk is maximized, we will obtain the highest possible stability for neuron k of pattern i. To use Simplex method, we employ the following formal relation: Zl = (Si - h) x V^ Sl = ^ TkrV; (3.3) r=\ with the condition: - 1 < TkT < 1 For neurons having output +1 and —1 (bipolar neurons), t^ is set to zero. The solution would be a set of TkT that satisfies: Max{min(Zl)} for all k , k = l,2,---,N The linear algebra theory states that at least N — P (N = number of neurons, P = number of patterns) coefficients will take the maximum values —1 or + 1 , and statistical data has shown that all the other coefficient values are near —1, 0, or + 1 . Simulation showed that only —1,0, and +1 are generated if N and P are small, say N=12, and P = 3 , but there exist some values near —1, 0, or +1 when N and P increase, say to N=60, and P=10. Thus, truncation needs to be performed on these values. The truncation threshold was chosen through simulation. Chapter 3. ALGORITHM ANALYSIS AND SIMULATION RESULTS 21 3.1.4 Model Summary The parameters for the three original forms of Hopfield models are summarized in Table 3.1. The truncated projection rule is included in the table. Type of Rules Hebb's Simplex Method Projection Truncated Projection Tki Range of Possible Values -P--- + P - 1 , 0 , + 1 - c o • • • + CO - 1 , 0 , + 1 Possible Values of Neurons - 1 , + 1 - 1 , + 1 Threshold in Update rule 0 0 0 0 Reference [46] [6] [5] Table 3.1: Learning Rules for Hopfield Networks with Bipolar Neurons CD o c CD CD i _ CD ^ 0.8 O O "co E Q. >«_ 0.6 O >^ o c CD cr CD LL 0.4 Simplex Method Projection Rule Truncated Projection Hebb's Rule 1 5 10 Hamming Distance Figure 3.1: Comparison of Learning Rules on Hopfield Network (P=10, N=60) Chapter 3. ALGORITHM ANALYSIS AND SIMULATION RESULTS 22 To verify that the three Hopfield models do indeed function as associative memories, computer simulations were performed. The simulation resulting from loading 10 patterns are plotted in Figure 3.1. The programs generate random patterns and store them in the trial networks using one of the previous three rules. The network is then probed with vectors having a fixed Hamming distance h to the nearest pattern. The probe vectors are formed by randomly choosing a pattern as the base, then randomly negating h of its bits. Convergence is said to be optimum if the network settles to a nearest pattern. The projection rule with truncation has been simulated too (see Figure 3.1, Truncated Projection). The simulation results show that the Simplex method rule is superior to the Hebb's rule and the projection rule in performance. It uses only three values ( + 1 , 0, —1) as connection strength. But one should note that this method takes the longest time in simulation because it uses floating point calculations and involves iterations in the algorithm. 3.2 Di lut ion Coding Rules wi th Two-Valued Models It should be observed that the original forms of the above three rules are not adaptable to digital implementation, mainly because they produce more than two values of connec-tion strength, and thus will require more than one memory cell to store each strength. For digital implementation, unipolar neurons (neuron's output would be either 0 or +1) and two-valued connection strengths (only 0 and 1) are required. To avoid performance degradation when neurons are changed from bipolar to unipolar, a one-hot pattern di-lution scheme is used. Unipolar neurons are first divided into groups of b each, each group is then diluted to 2 neurons with their states translated using a b to 2b decoder. To compute the connection matrix with the diluted patterns, Dilution Coding rules are employed. In [10], Mok investigated Dilution Coding Hebb's rule coupled with two Chapter 3. ALGORITHM ANALYSIS AND SIMULATION RESULTS 23 thresholds for neurons instead of one and presented his simulation results. His conclusion is that Dilution Hebb's rule is better than original Hebb's rule. His work is continued in this thesis by investigating the other two rules. The simulations show that both Dilution Coding Hebb's and Dilution Coding Projection rule provide good performance, but the Simplex method is not suitable for the dilution scheme. 3.2.1 Di lut ion Coding Hebb's Rule Dilution Coding Hebb's rule is described below. It was first presented in [10] by Winston Mok. The connection matrix is computed from the diluted patterns by using Hebb's rule with truncation. Tki = T(JTVkrvA with Tu = 0 (3.4) where T{x) = 0 ; x = 0 1 ; x > 1 VI denotes k-th neuron in r-th input pattern Instead of a fixed threshold t^ used in the original Hopfield network (see Figurel . l ) , two values , TKOFF and ThoN, are adopted here. ThoFF is used if the neighbours in the same group as neuron k are all OFF, and ThoN is used if any of the neighbours is ON. The threshold depends only on the state of the neighbours and is independent of the state of neuron k itself. ThoN is chosen to be an unattainably large value so that groups are restricted to containing no more than one TRUE bit. The choice of ThoFF is a compromise between the attractivity and the capacity of a network. The optimal ThoFF c a n only be discovered via simulation. Chapter 3. ALGORITHM ANALYSIS AND SIMULATION RESULTS 24 3.2.2 Di lut ion Coding Project ion Rule Projection rule on diluted patterns was simulated here. Again, floating point calculations are required for the simulation. The threshold 0 for truncating TJ& was chosen through experiments. T = T[M[(M)TM}-\M)T] (3.5) where T(x) = 0 ; x < 0 1 ; x> 0 M = [V\V2,---,Vp] 3.2.3 S implex M e t h o d wi th Unipolar Neurons The Simplex method rule failed with the simulation on diluted patterns in large networks. One reason is that it takes too much time for simulations, another reason is that it requires as many thresholds as neurons. As a result the simulations were carried out on undiluted patterns with unipolar neurons. To work with unipolar neurons, the Simplex method is changed a little bit as follows : Zl = (Si - tk) x (2Vi - 1) (3.6) i = 1, 2, • • •, m 0 C o O j? 0.6 Q. o >> 0.4 O c CD ZJ CT CD LL 0.2 Simplex Method Projection Rule Dilution Projection 5 10 Hamming Distance Figure 3.3: Comparison of Dilution Projection Rule with Others From the simulation results and taking into consideration the stability, the attrac-tivity, and the capacity, it is clear that dilution rules are superior to the original forms, particularly in dealing with digital implementations. These rules provide an effective method for programming digitally implemented Hopfield networks. Chapter 4 I M P L E M E N T A T I O N OF A F U L L Y D I G I T A L C A M In Hopfield network implementations, most of the active area is occupied by the con-nection matrix, especially with programmable connection strengths. With N neurons, there will be ./V2 cross connections in the network. In the CMOS VLSI implemented chip reported in [2, 13], almost 90 percent of the silicon area is occupied by the connection matrix space[2]. Within the ASSOCMEM chip, 37 out of the 41 transistors in each Tki element are used to implement programmable interconnections [15]. To implement a large network with programmable connection strengths, the conven-tional analogue techniques appear to be very difficult. However, with the model proposed in Chapter 3, we can design a large expandable digital Hopfield network. A conventional RAM can be used store the connection matrix since the proposed model has strengths of value 0 or 1 only. The neurons can be integrated on the same or another chip. Based on this idea, a fully digital CMOS Hopfield CAM design is described in the following sections. 4.1 Architecture Figure 4.1 shows a block diagram of an associative memory co-processor system incorpo-rating the proposed neural network model . The main features of the proposed system are: • fully-digital implementation 28 Chapter 4. IMPLEMENTATION OF A FULLY DIGITAL CAM 29 • modular design so as to be cascadable to enlarge network size • guaranteed stable memories • amenable to discrete time simulation algorithms • adjustable threshold to accommodate probe characteristics • programmable connection strengths • use of external RAM permits more neurons per integrated circuit. Chapter 4. IMPLEMENTATION OF A FULLY DIGITAL CAM 30 ISA-bus ' Address Bus Address Decoder INTERFACE / Control Bus A Data Bus Tri-State Bus Transceiver E Interface Controller Stop/Run / / 'Address /Data ' Control /Local Bus / / D 32 bits 3ET3C RAM Modules Block Address Encoder D 32 bits Active0--7 (8bits) a a t Copy Counter / Clocks Clock Generator (ClkGen) ± T_ * - / • — CpShift V Neural Network Module s (NNM) / Option (3bits) Configuration Switches n n A D 7 bits Control Module (Ctr lM) ISum v y v(8*4bits) * 'Thoffi 7bits) I Adder/Comparator (AComp) v-—* CurrState 'NextState Convergence Detector (CvgDt) Cnshift * 7 ^ — 7^~ Ready ~r No rmal/SingleStep CO-PROCESSOR Figure 4 .1 : Block Diagram of t h e Associat ive Memory Co-processor Sys tem Chapter 4. IMPLEMENTATION OF A FULLY DIGITAL CAM 31 This specific architecture can work with both the Dilution Coding Projection rule and the Dilution Coding Hebb's rule. As previously mentioned, the Dilution Coding Projection rule offers the best performance, but requires longer time to compute the connection matrix. In contrast, Dilution Coding Hebb's rule gives an acceptable perfor-mance, guaranteed stability, and has the advantage of short simulation time. In practical applications, if the connection matrix is rarely changed, the construction of the connec-tion matrix is not a significant cost, so the former rule (Dilution Projection Rule) is preferred, otherwise, the latter may be used. During design and functional test of the proposed network, the Dilution Coding Hebb's rule should be first used, so as to take the advantage of short simulation time. The architecture shown in 4.1 is designed with expandability considerations (see Ap-pendix A). A simple system with only 128 neurons is described here. By programming, the 128 neurons can be divided into 4, 8, and 16 groups, each with 32, 16, and 8 neurons. Enlarging the network size is discussed in detail in Appendix A. The co-processor system is designed to be connected to a host computer. A PC with an ISA (or EISA) bus is considered in the following description. It appears to the host computer as a memory mapped slave device. In order to offer expandability, it is parti-tioned into several functional blocks which will be implemented on a set of chips. The INTERFACE performs handshaking functions with the ISA bus and distributes down-loaded data to the appropriate memory in the co-processor system. The RAM module consists of external memories to store the connection matrix. The Neural Network Mod-ule (NNM) contains the probe vector and supplies the final stable state (the returned memory). It also contains a calculator which computes ISum = Yl TkiVi as an interme-diate sum and sends it to Adder/Comparator (AComp). The AComp sums ISums from NNMS and performs the update rule by comparing the total sum with Thopp supplied Chapter 4. IMPLEMENTATION OF A FULLY DIGITAL CAM 32 by the Control Module (CtrlM) and generates NextState accordingly. The Control Mod-ule incorporates a small state machine that controls the operation of the co-processor system. The INTERFACE, RAM Module, NNM and CtrlM are further discussed in Appendix A. As its name suggests, the Convergence Detector(CvgDt) monitors the network con-vergence. It compares the current state signal CurrState with the next state signal NextState. It declares that the network has converged and sets Ready if all the neurons have had a chance to update their states and no change occurred. The signal Ready can turn on Stop/Run to high which forces the system to stop. Minimal hardware is required to implement CvgDt. A counter with modulus N will suffice. At the end of each update cycle the counter is either set to A — 1 if the current neuron changes its state, or decremented by one if the neuron remains in its previous state. When the counter reaches zero, the network has converged. The Clock Generator (ClkGen) generates clocks for the system. If SingleStep is available, the Clock Generator will inhibit in each cycle so that the system can stay unchanged until being restarted. This feature is designed for functional testing. The Copy Counter (CpCt) is designed for expansion. It determines the active NNM which contains the neuron being updated. The Configuration Switches indicates how many NNMs are presented in the system. A more detailed description of the functional blocks is included in Appendix A. PART II: FEED FORWARD NETWORKS In this part, feed-forward models are introduced. The theories and conclusions about feed-forward models of Perceptron and Back-Propagation (BP) multi-layer networks are presented. The obstacles and drawbacks of the current training algorithms are anal-ized. A novel training algorithm for multi-layer networks, named Delta Pre-Training rule (DPT), is developed. The new training algorithm improves the performance of BP-like algorithms. It has given the best results in the selected problems in comparison with the results from BP-like algorithms reported by other researchers. The new training algo-rithm retains the property of learning from examples through a training procedure. It needs little extra circuitry over the original BP algorithm if on-chip learning is designed. 33 Chapter 5 F E E D F O R W A R D N E T W O R K S This chapter reviews the MLP structure and training algorithms. It starts by introducing the network of single layer neurons, the Perceptron network, and its training algorithm, the Delta Rule. Then the MLP and its training algorithm, the BP algorithm, is presented. Since as of yet there are no standard notations for neural networks, symbol definitions used in this and following Chapters are given in Appendix B. 5.1 Introduct ion Although a single neuron can perform certain pattern detection functions, the power of neural computing stems from the collective computations performed by connecting neurons into networks. Larger, more complex networks generally offer greater, more powerful computational capabilities. In feedforward networks, neurons are arranged in layers. There are connections between neurons in different layers, but no connection between neurons in the same layer. The connection is unidirectional. The output of a feedforward network at time k depends only on the network input at time (k — d), where d is the processing time of the network. In other words, a feedforward network is memoryless. 5.1.1 Perceptron and Del ta Rule A Perceptron network is diagramed in Figure 5.1. It consists of J threshold-logic neurons. Rosenblatt named it "Perceptron" [72] while Widrow et al defined it as MADALINE 34 Chapter 5. FEED FORWARD NETWORKS 35 (Multi-output ADAptive LINear Element) [96]. In this thesis the term Perceptron is used. Xo Xi x2 JC]_ Xj_ *- o O-* O o. Figure 5.1: A Perceptron Network. A Perceptron is a single layer network. The term layer refers to the weights between the inputs and the output neurons. The basic function of a Perceptron is to map a set of input vectors to a set of output vectors. Given an input vector of / inputs, the network produces an output vector: 0 = 0 i , 0 2 , . . . , 0 j . Each element of the output vector is calculated from: °i = T(Y, w3ixi) = T(V])i J = 1,2,..., J. (5.1) i=0 The weight w^ is the connection strength from input i to output j . The weights in the network are adjustable. Changing the weights will result in different mapping functions of the network. Chapter 5. FEED FORWARD NETWORKS 36 The special input XQ is set to 1 and the weight Wj0 works as an adjustable threshold of the output neuron j . That is, by setting Xo = 1, the network has an adaptable threshold for each of the output neuron. Usually the input x0 is called the bias input neuron. The activation function, T() is a threshold logic function defined as: + 1 if y > 0 T(y) = (5.2) 0 otherwise The given / inputs, Xi,X2, ...xi, form an input vector or an input pattern. If there is more than one input pattern, the network generates multiple output patterns accordingly. Thus another index is needed to indicate the input/output pattern. For P input patterns, the equation(5.1) becomes: 0P3=T{J:w^) = T{f3\ (5.3) i=0 j = 1, 2,. . . , J and p = 1,2, ...,P. To enable the network to perform the input/output mapping function, a training (or learning) procedure is needed. During the training phase, input patterns and the corresponding desired output patterns are presented to the network. A training algorithm is responsible for adjusting the weights so that the output responses closely approach their respective desired responses. For the Perceptron network, the training algorithm, sometimes called the Perceptron convergence procedure, was developed by Rosenblatt [72]. The rule for changing weights following the presentation of an input/output pair p is given by: Axvji = i /(0? - 0 ? K = rjS^xl (5.4) where 0 < n < 1 is the training constant, normally called the learning rate, which controls the step size of the change in the weights during adaptation and D? is the desired output of neuron j . This training algorithm is normally referred to as the De l ta Rule. Chapter 5. FEED FORWARD NETWORKS 37 After Rosenlatt's Perceptron, which used the threshold-logic neuron, the sigmoid neuron has come into wide use in recent years. The nonlinearity of sigmoid neurons has the characteristic of differentiability that facilitate adaptation. Using sigmoid neurons in place of threshold-logic neurons makes Perceptron network a subclass of Multi Layer Perceptron (MLP), which will be presented in the next section. Widrow has named Perceptron with sigmoid neurons "sigmoid Madaline" [96] and here it will be termed "sigmoid Perceptron." A sigmoid Perceptron network corresponding to Figure 5.1 is illustrated in Figure 5.2. XQ Xi X2 Xi xx +- O ** O * o o, Figure 5.2: A Sigmoid Perceptron Network. It should be mentioned that there are different forms of the Delta Rule based on the neuron activation function. The ultimate goal of the Delta Rule is to minimize the difference between network output and the desired output through weight updating. In this thesis, the Delta Rule for sigmoid Perceptrons is derived through applying steepest Chapter 5. FEED FORWARD NETWORKS 38 gradient search on a function surface which is defined as the sum of the squared error function: E = £ ( £ - Of. A more detailed derivation can be found in [73]. See Appendix B for notaitons. Now define the error function for an input pattern p as follows: EP = \ E(^ ? - 0*? (5.5) Then the overall error is: ^ = E^ = lED^-^)2 (5-6) V P 3 To minimize E, the steepest gradient method is used to modify the weights. This method changes each weight by an amount proportional to the derivative of the error measure with respect to that weight. When an input pattern p is presented, the change in Wji is given by: dEp dEvdO] owji oOj own Combining equaitons (5.1), (5.5) and (5.7), the De l ta Rule for a sigmoid Perceptron following a presentation of the input pattern p is: Awji = T]6?xl (5.8) *J = (D* - 0?)f\f]) (5.9) yvi = Yl w3xA The Perceptron training procedure based on the Delta rule involves the following steps: 1. Reset all weights to zero. Chapter 5. FEED FORWARD NETWORKS 39 2. Present an input pattern and calculate the output vector using equ(5.3). This is referred to as a forward calculation. 3. The output vector is compared with the desired output pattern. If the difference is small enough, no weight updating takes place. Otherwise goes to step 5. 4. If the difference is small enough for all P input patterns, the training procedure terminates. Otherwise goes to step 2. 5. Change the weights according to the Delta Rule equation (5.8). The Perceptron training procedure has been proven to be capable of finding the optimal weight for any linearly separable set of training patterns [72]. If the training patterns are not linearly separable, the Perceptron training procedure never converges. This is a well-known limitation of Perceptron networks. A Multi-Layer Perceptron network breaks the limitation of the single layer Perceptron. It can perform any classification task if a given network has a sufficient number of neurons in each Perceptron layer. 5.1.2 Multi-Layer Perceptron and B P algorithm A typical Multi-Layer Perceptron network (MLP) is depicted in Figure 5.3. It contains two or more layers of Perceptrons. The first Perceptron is connected to the network's inputs, but its outputs are used as the inputs to a second Perceptron. The second Perceptron's outputs are the inputs of third Perceptron, and so on. The final Perceptron outputs are the final outputs of the MLP. Any neuron whose output is not used as the network's output is called a hidden neuron. The outputs of hidden neurons are labeled as hi, /«2, •••, hj in Figure 5.3. It has been proven that a network with two layers of Perceptrons can perform any classification task [113] although a network with more Chapter 5. FEED FORWARD NETWORKS 40 Perceptron layers may be able to tackle certain tasks with fewer neurons. Even though this thesis focuses on two layer MLPs, the theory and conclusions can be extended to MLPs of more than two layer Perceptrons. XQ XI X2 X± Xx Figure 5.3: A Multi-Layer Perceptron Network. A two layer MLP performs the function of input/output pattern mappings as follows: i=o i'=0 j = l,2,...,J,k = l,2,...,K a n d p = 1,2,. . . ,P. (5.10) (5.11) To generate the desired output patterns for a set of input patterns, again a weight updating rule is required. However, deriving a training rule for MLPs is not as easy as for Chapter 5. FEED FORWARD NETWORKS 41 Perceptrons because of lack in knowledge about the hidden neuron outputs. Nevertheless, after Rumelhart et al developed the Back-Propagation (BP) algorithm, the training of a MLP became almost as simple as training a Perceptron. Actually, the BP algorithm is called "The Generalized Delta Rule" (by Rumelhart et at) [73]. If a weight Wkj is in the output layer (i.e., connected to output neurons), the weight update rule is similar to the one given in equation (5.8), except that the input X{ in (5.8) is replaced by hj, the hidden neuron output: Awkj = r,8lhl (5.12) Sl = f\yl){Dl-0{) (5.13) j 3 = 0 I hPi = fQ2W**i) If the weight Wji is in the input layer, the weight update rule can be derived in the same manner as deriving the Delta rule. Thus, Awjt = rjS^xl (5.14) ^ = W ) E ^ « (5-15) I yvi =52wnxP t = 0 If there are more than two connection layers, the x\ will be replaced by a hidden neu-ron output which connects to Wji. Equation (5.14) and (5.15) give a recursive procedure for updating weights connected to hidden neurons. The weight update rule of equation (5.12) and (5.14) is usually referred to as the s tandard B P algorithm. The BP algorithm again involves the step control constant, rj. The larger the constant is, the larger the weights change in each update. But a large step constant will more easily Chapter 5. FEED FORWARD NETWORKS 42 lead to oscillation. One way to increase the training rate without leading to oscillation is to add a momentum term to (5.12) and (5.14). The modified BP algorithm, referred to as the standard B P algorithm wi th m o m e n t u m , becomes: Awkj(t + 1) = rjSm + aAwkj(t), (5.16) Awji(t + 1) = r/^xf + aAwji(t), (5.17) where 0 < a < 1 is a constant. With the BP algorithm, the MLP training procedure involves the following steps: 1. Initialize the weights to small random numbers. (If the weights are all initialized to the same number, the BP algorithm will fail to achieve convergence[73]. 2. Present an input pattern, and generate the output vector. This completes a forward calculation. 3. The output vector is compared with the desired output pattern. If the difference is small enough, no weight updating takes place. Otherwise, go to step 5. 4. If the differences are small enough for all P input patterns, the training procedure terminates. Otherwise, go to step 2. 5. Change the weights according to equations (5.12) and (5.14) or (5.16) and (5.17). An alternate mode of the BP training procedure is batch mode or epoch mode. De-pending on the requirements and/or convenience, epoch mode or the above pattern-presentation mode is selected. The epoch mode of BP training procedure is defined by the following steps: 1. Initialize weights to small random numbers. Chapter 5. FEED FORWARD NETWORKS 43 2. Present an input pattern to the network and calculate the output vector. 3. Calculate the incremental weight changes Aw according to the BP algorithm using equations (5.12) and (5.14) or (5.16) and (5.17). Accumulate the Aw for each weight. 4. If all pairs of the training patterns have been presented, go to the next step. Oth-erwise go to step 2. 5. If the differences between the network outputs and its desired outputs are small enough for all P input patterns, the training procedure terminates. Otherwise, apply the accumulated Aw saved in step 3 to each weight. Reset each Aw to 0, go to step 2. Chapter 6 A N e w M e t h o d for Ref inement of the B P training Algor i thm Training a MLP with the BP algorithm is a nonlinear optimization procedure. The procedure is usually started from a random starting point in the nonlinear system by setting weights to random numbers. This chapter develops a new method, named Delta Pre-Training (DPT), to set the starting point. As a result, the training speed is increased and the probability of getting stuck at local minima is decreased. Simulation results strongly support the benefits of the DPT method. 6.1 Acceleration of B P trainings In this section, the DPT method is derived with emphasis on increasing the BP training speed. The issue of avoiding local minima will be discussed in section 6.2. 6.1.1 Prev ious Work Acceleration of BP trainings has been a major challenge to neural network researchers. Several variations of the standard BP algorithm have been proposed, including adapting learning-rate [108], sharpening the slope of activation functions [109], increasing the effectiveness of error functions [107] and utilizing second order derivatives of activation functions [107]. These techniques result in significant improvements in convergence speed. Nevertheless, all the above methods concentrate on the BP algorithm itself, but the great influence of the random starting point is ignored. An examination of the simulation results of BP algorithms reported in the literature 44 Chapter 6. A New Method for Refinement of the BP training Algorithm 45 reveals that the fastest convergence rate can differ from the slowest rate by several orders of magnitude [73]-[109]. This difference results mainly from the initial random weight settings. Two random settings of initial weights can result in very different convergence rates, even when small numbers are chosen as initial weights to avoid "premature satu-ration" [110]. Chen et al. [112] developed linear-algebraic methods to set the initial weights. The idea is to set the initial weights as close as possible to a global minimum. Training with random initial weights is started in the first connection layer of a two layer MLP, then a linear-algebraic method is used to calculate the weights in the second layer. With this method, the training procedure is not necessary if the number of hidden units is equal to or greater than the number of the training patterns minus one. However, this condition is rarely applicable in most applications, since with too many hidden neurons the network may simply learn to replicate the training patterns but will fail to perform any generalization. Therefore, our main interest lies in studying networks with small number of hidden neurons. In dealing with small number of hidden neurons, Chen at el. [112] developed two basic methods called Linear Algebra Forward Estimation (LA-FE) and Linear Algebra Recurrent Estimation (LA-RE). These methods significantly improve the convergence speed of BP algorithms. However, the linear-algebraic methods have the following disadvantages: 1) they are limited to three-layer networks (i.e. with two connection layers), 2) they are hard to implement in VLSI technology, and 3) they start with random initial weight settings in the first layer, which still requires a careful selection of the range of initial values. In [91] and [90] L.Wessels and E. Barnard investigated the initial weights from the point of view of avoiding local minima. They identified three physical correlations of local minima, then they set up the initial weights in such a way that the three conditions are dissatisfied. The disadvantages of their methods are: 1) VLSI implementation is Chapter 6. A New Method for Refinement of the BP training Algorithm 46 difficult, and 2) the convergence speed issue is not addressed by their method. That is, to avoid local minima their methods may result in a slower convergence speed [90]. A new method will be described below for setting up the initial weights for the BP training procedure. The method is named Delta-rule Pre-Training (DPT), and has the following major advantages: • The DPT method is easier to implement in hardware. In particular, the complexity of the required hardware is the same as that for the standard BP algorithm specified by equations (5.12) and (5.14). • The D P T method improves BP training in two ways: by increasing convergence speed and by reducing the probability of getting trapped at a local minimum. • The DPT method can be used for training MLPs with any number layers, while the linear-algebraic methods of [112] are restricted to two-layers of MLPs. In the remainder of this section, the effect of the random initial weight setting is analyzed, then the Delta Pre-Training method is presented. 6.1.2 Beg inning of B P Training The main purpose of random initial weight settings is to break the network symmetry so that the output error can be back-propagated to the input layer. However, random initial weights may cause the network to converge very slowly for a number of reasons such as: (a) premature saturation [110] where only a small fraction of an output error can be back-propagated and as a result requiring lengthy training period before the network converges. Chapter 6. A New Method for Refinement of the BP training Algorithm 47 (b) The output error duction is made on an irregular error function surface during BP training, which increases the probability of the network becoming trapped in a local minimum. Therefore it would be ideal if an initial weight setting method could speed up BP training and also decrease the probability of finding a local minimum. One basic consid-eration is that , the method should cause the BP training to move the weights along a smooth region of the error surface and allow large BP training step sizes. If this criterion is achieved, it may even be possible to set the initial weights far away from the global minimum but still achieve a significant improvement in the convergence rate. Suppose we start training a MLP by setting all weights to zero. At this starting point, the network has the following properties: • There is no premature saturation [110] because all the output neurons produce their middle output value (0.5 if activation equation (1.4) is adopted) irrespective of the input patterns. • Three identified causes of being trapped at local minima are dissatisfied (details discussed in Section 6.2). • Both the total error (equation (5.6)) and the derivative / (y) reach their maximum values and as a result the training step size (weight updating) AW(t+l) is maximal. Unfortunately, BP algorithms are not able to start with the all-zeros weight setting because they are rendered incapable of learning the input-output mappings due to the symmetry problem [73]. Thus, a new method must be developed with two basic restric-tions: • It can adjust the weights to bring the network out of the symmetry state so that the BP training procedure can be initiated. Chapter 6. A New Method for Refinement of the BP training Algorithm 48 • The resulted starting point for BP training should maintain the properties which the zero start-point has as mentioned above. In addition, if the method can adjust the weights towards the global minimum, rather than a random adjustment, the BP training speed will be further improved. Recall that the Delta Rule can be used to train Perceptron networks where it starts with zero weight settings. If we partition a MLP into several Perceptron networks, we can use the Delta Rule to train each Perceptron with all-zeros initial weights. To build connections between Perceptrons, we can define the output patterns of the previous Perceptron as the input patterns of the current Perceptron. We can not expect the Delta Rule to train each Perceptron to converge, but we have reasons to believe that the Delta Rule can adjust the weights towards a proper setting for convergence. After combining all the pre-trained Perceptrons back into one MLP, we may expect the MLP to have a good starting point for the BP training procedure. Based on the above discussions, The Delta rule Pre-Training (DPT) method is proposed in the following. 6.1.3 D P T : a De l ta Rule Pre-Training M e t h o d The core of the DPT method is that the initial weights for MLPs are pre-trained with the Delta Rule, instead of using random numbers as is commonly done. The DPT is realized by two major steps. First, a MLP is partitioned into single-layer Perceptrons. The Delta Rule can then be applied with zero initial weight settings for each Perceptron. In the second step, input patterns to the hidden layers are defined such that the outputs of one Perceptron layer are the inputs of the next Perceptron layer. This builds connections between adjacent Perceptrons. To implement the second step, four methods are proposed below. Depending on the problem to be solved and the network size, the appropriate method is chosen: Chapter 6. A New Method for Refinement of the BP training Algorithm 49 M e t h o d 1: With sufficient number of hidden neurons, specifically 2J > P, ( J is the number of hidden neurons, and P is the number of training patterns), sequentially assign a hidden pattern from 2J possible patterns to one of the input patterns. One hidden pattern matches one input pattern. This method is suitable for hardware implementation because a J bit counter can be used to generate the hidden patterns. M e t h o d 2: With sufficient hidden neurons, specifically 2J > P, randomly assign a hidden pattern from 2 possible patterns to one of the input patterns. A unique hidden pattern is required for each input pattern. M e t h o d 3: With few hidden neurons, specifically 2J < P, randomly assign a hidden pattern to one of the input patterns. One hidden pattern may be used for several input patterns. M e t h o d 4: Randomly initialize the weights of the first layer. Then calculate hidden patterns by presenting each input pattern to the input layer. The hidden patterns are the input patterns for the next Perceptron layer. This method could be used when 2J < P. Given the hidden patterns, the Delta Rule is used to pre-train the hidden-to-output layer or input-to-hidden layer. The number of iterations taken by the Delta Rule is a tunable variable since no explicit solution is known yet. In the experiments of Section 6.3, the number of iterations was in 5P to 10P. Simulation results show that the additional time required by the Delta Rule is small in comparison to the time taken by BP algorithm. The pre-trained weights from the Delta Rule are used as initial weights in the followed BP training procedure. The complete BP training procedure using the DPT can be described by the following. Part i t ion network into Perceptrons; Pre-define hidden patterns (using Method 1 for example); Rese t weights to zero; Chapter 6. A New Method for Refinement of the BP training Algorithm /* pre-training the input-to-hidden layer starts from here */ for( number of iterations ) { Randomly or sequential ly choose an input pattern; U p d a t e weights according to Equation (5 .8); } /* pre-training the hidden-to-output layer starts from here */ for( number of i terations ) { Randomly or sequential ly choose a h idden pattern; U p d a t e weights according to Equation (5 .8); } Apply a B P algorithm to train the entire network. 6.2 Avoiding local min ima The simulation results presented in Section 6.3 show the effectiveness of DPT in increasing convergence speed and decreasing the probability of getting trapped in a local minimum. The reasons are analyzed in this section. There may be many causes which result in local minimum in BP trainings. Three of them were identified in [91]: 1. Straying hidden nodes, i.e., the hidden nodes are either strongly active or very inactive for all training samples; 2. Hidden nodes duplicating function, i.e. pairs of hidden nodes may duplicate a function. 3. Hidden nodes are arranged such that all are inactive in a certain region of sample space, thus creating a "dead region." Chapter 6. A New Method for Refinement of the BP training Algorithm 51 L.Wessels and E.Barnard in [90] showed that avoiding local minima can be achieved, at least partly, by proper initialization of the interconnection weights. Fortunately, the DPT can avoid these local minima related conditions due to the following reasons: First of all, DPT phase starts with all-zeros weights. The hidden node outputs are all at their middle value, 0.5, according to the activation function of equation (1.4), which is considered as the worst situation because the overall error reaches its maximum (see equation (5.6)). At this point, there are neither strongly active nodes nor very inactive nodes. Thus the first cause is avoided. Secondly, if the network under training has a sufficient number of hidden nodes, that is, 2J > P, then there is no function duplication by hidden nodes when using Method 1. Thus the second cause for reaching local minima is avoided. If the network does not have enough hidden nodes (2 J < P), there will be function duplications when using the hidden pattern assignment methods listed in Section 6.1.3. However if the hidden patterns are allowed to contain real-numbers, such as 0.2, 0.4, ..., 0.8, etc, there will be enough numbers to assign a unique hidden pattern to each input pattern. Thus there is no function duplication at hidden nodes. In real world applications, the use of networks with enough hidden nodes is normal. Finally, the DPT can also avoid creating a "dead region." The obvious reason is that all the weight changes are guided by training samples during DPT phase. The simulation results in the following section supports the above analyses. 6.3 Simulation Resul ts Simulations were carried out on two-layer MLPs. For the DPT phase, weights are updated after a presentation of a pattern. The number of presentations are reported with the simulation results. The learning rate for the DPT is adjustable although a value of 2.0 Chapter 6. A New Method for Refinement of the BP training Algorithm 52 was used. For the standard BP algorithm (equations (5.16) and (5.17)), the weights are updated after each epoch. For claiming network convergence I am in favor of [107]. The output values are in the range 0.0 to 1.0. Any value below 0.4 is regarded as 0.0 and any value above 0.6 is regarded as 1.0. This type of thresholding is suitable for digital logic designs. The margin between the two classes of outputs allows tolerating a small amount of noise in output values. When all the output vectors corresponding to input patterns are equal to the desired outputs, the network is said to have converged. If the network does not converge in a certain number of epoches, the learning is said to have failed. The failure threshold is set to 3000 epoches in the experiments. The learning rate for the BP algorithm is listed in the following tables. When random initial weights are used for the standard BP algorithm, the values are restricted to the range of [-2.0,+2.0], and the momentum is set to 0.9. Another idea borrowed from [107] is that the derivative is increased by 0.1 before being used to scale error, that is, the derivative is added by 0.1 after its calculation. To start simulation, the hidden patterns are pre-defined first. Method 1 is adopted in the experiments. That is, sequentially choose a hidden pattern from the 2 J patterns to match an input pattern. One hidden pattern is assigned to only one input pattern. During the DPT phase, the first input pattern is randomly picked up, and the other patterns are picked up sequentially following the first pattern. This arrangement is suitable for hardware implementations of on-chip learning. 6.3.1 E n c o d e r / D e c o d e r D a t a Compress ion P r o b l e m In the Encoder/Decoder task, each of the input patterns has only one input bit turned on (set to 1.0); the other input bits are turned off (set to 0.0). The task is to duplicate the input patterns in the output neurons. Since all information must flow through the hidden neurons, the network must develop a unique encoding for each of the P patterns Chapter 6. A New Method for Refinement of the BP training Algorithm 53 in the J hidden neurons and a set of connection weights that , working together, perform the encoding and decoding operations. An I-J-K (K = I) Encoder/Decoder means / input neurons, J hidden neurons and K output neurons. The data compression problem was used by Fahlman to study his QuickProp algo-rithm, and his simulation results were reported in detail in [107]. These results are the best reported to date on the data compression problem. No comparisons between the two algorithms are being made. The results from QuickProp are used to evaluate the effectiveness of DPT method, however, The D P T method can also be combined with QuickProp. However, only the standard BP is adopted in this thesis because it has the same complexity as the DPT and can be implemented by similar hardware function blocks. Problem I-J-K (E/D)4-2-4 (E/D)8-3-8 (E/D)16-4-16 (E/D)32-5-32 (E/D)64-6-64 (E/D)10-5-10 Trials 100 100 100 25 10 100 DPT # of Ite. 40 80 160 320 640 100 Learning rate 2.0 2.0 2.0 1.5 1.0 3.0 Average epoches DPT-stdBP 3.87 7.36 10.16 17.68 33.70 6.66 Average epoches QuickProp [107] 15.93 21.99 28.93 30.48 33.90 14.01 Table 6.1: Simulations on Data compression problems The simulation results are reported in Table 6.1. The column labeled Trials lists the number of simulations for each different-sized network. For the QuickProp algo-rithm, each trial starts a simulation with a new set of initial weights. For the DPT-stdBP algorithm, the initial weights are always zero and each trial starts simulation with new pre-defined hidden patterns. Our results are reported in the column labeled Average epoches DPT-stdBP (the time spent for DPT is not included), while the results Chapter 6. A New Method for Refinement of the BP training Algorithm 54 from [107] are listed in the column labeled Average epoches QuickProp. As shown in the table, combining DPT with the standard BP algorithm achieves better performance as compared to the QuickProp method [107]. Since the BP algorithm is not as efficient as QuickProp, combining QuickProp, instead of the standard BP, with the DPT should achieve an even better result. 6.3.2 X O R P r o b l e m The XOR problem is one of the most popular benchmark in evaluating training algo-rithms, although the network size is relatively small. In this section, the DPT method is applied to solve the well-known XOR problem. The simulation results together with relevant data from [107] are listed in Table 6.2. Algorithm DPT-stdBP DPT-stdBP DPT-stdBP DPT-stdBP DPT-stdBP stdBP QuickProp Trials 100 100 100 100 100 100 100 DPT # of Ite. 20 20 20 20 20 --fails 0 0 0 0 0 37 14 Learning rate 3.0 4.0 5.0 6.0 7.0 3.0 4.0 momentum 0.9 0.9 0.9 0.9 0.9 0.9 2.25 Average epoches 31.21 24.95 20.10 18.23 17.59 93.17 24.22 [107] Table 6.2: Simulations on the XOR problem Interestingly, Table 6.2 indicates that the best result with the DPT-stdBP can con-verge in as few as 17.59 epoches without any failure. This result is believed to be better than any other reported result for this problem using BP-type algorithms. It demon-strates that the DPT method is not only capable of bringing the network out of the symmetry state of all-zero weights, but is also capable of adjusting the weights to a proper setting for subsequent BP training so that the false local minima are effectively Chapter 6. A New Method for Refinement of the BP training Algorithm 55 avoided. This result strongly supports the claims made in Section 6.2. 6.4 Discussion The simulation results reported in the previous section show the advantages of using the proposed DPT method. In fact, the DPT method can be viewed as part of the BP algorithm and the combination as an extended BP algorithm. The extended BP algorithm can start training a MLP with the all-zeros weight setting so that the performance of the original BP algorithm is improved in terms of a faster convergence speed and a significantly reduced chances of getting locked in local minima. The trade off is the additional DPT computation. Comparison to other methods for selecting initial weights, such as those reported in [112] and [90], the proposed DPT method has the following advantages: • The DPT method can speed up BP training convergence and decrease the proba-bility of getting locked in local minima. • The DPT method maintains the feature of learning by presenting training patterns, thus avoiding more complicated non-local calculations. • The computation complexity of training procedure is not different from that of the standard BP. Thus the hardware implementation complexity of on-chip learning will not be increased from that of the standard BP. • In digital implementations, such as digital weight storage, allowing training to start with the zero weights may result in simple circuit requirements since a single global reset signal can set all the weights to zero. Chapter 6. A New Method for Refinement of the BP training Algorithm 56 6.5 Future work on the D P T method The advantages of employing the DPT method have been proven and listed in the pre-vious sections. However, there are a few deficiencies associated with the DPT which are worth of mentioning. One deficiency is how to determine the required number of itera-tions to terminate the DPT process. In our experience, hP to 10P iterations normally give good results. The exact number of iterations may be related to the problem to be solved and the size of the network to be trained. At present, it is not known what is the optimum number of iterations, but a useful criterion is that the number of iterations must be large enough to break network symmetry. On the other hand, the number of iterations should not be set to be too large since this does not guarantee better perfor-mance. Table 6.3 gives an example for the XOR problem. As shown in Table 6.3, 40 iterations perform slightly better than 20 iterations, but if the number of iterations is increased further (e.g. 80), performance may degrade. Algorithm DPT-stdBP DPT-stdBP DPT-stdBP Trials 100 100 100 DPT # of Ite. 20 40 80 fails 0 0 0 Learning rate 4.0 4.0 4.0 momentum 0.9 0.9 0.9 Average epoches 24.95 23.50 29.73 Table 6.3: Comparison on XOR problem The example suggests that pre-training with the Delta rule should not go too far. This can be concluded by observing that the final result depends on the BP training, not the Delta rule. To further confirm this statement, the Pocket rule was employed instead of Delta rule in the pre-training phase. The Pocket rule is an improved Delta rule. It keeps a set of best weights in a pocket during Perceptron network training. The pocket will contain a set of best weights while the training is stopped. As a result, the Pocket Chapter 6. A New Method for Refinement of the BP training Algorithm 57 rule guarantees classifying as large a fraction of the training examples as possible [111]. However, our simulation results with the Pocket rule in pre-training phase are not as good as those with the Delta rule. Other unresolved questions include: what should be the optimum learning rate for the Delta rule? Is there any other Perceptron-based learning rule which gives better results than the Delta rule? How well does the DPT perform with BP-like algorithms other than the standard BP algorithm? These questions reserve as future investigation. Chapter 7 A Local Min ima Free Training Algori thm This chapter presents another BP-like algorithm, called Globally Guided Back Propa-gation (GGBP) [92]. The most attractive property of GGBP is that it is free of local minima, that is, it guarantees convergence to a global minimum. The first section will briefly review GGBP. Then another version of the GGBP, named Locally Guided BP (LGBP) is derived. The LGBP has the local minima free property and is superior to GGBP from the point of view of VLSI implementation. The implementation of LGBP using weight perturbation techniques will be discussed in the next chapter. 7.1 Globally Guided Back Propagat ion The standard BP algorithm is based on the steepest descent method with respect to weight space. Local minima may exist because of the complicated error surface in the weight space. On the other hand, the GGBP performs gradient search on the error surface in the output space. The error function (see equation (5.5)) in the output space defines a convex quadratic surface which has only one global minimum. The error measure function after presenting the input pattern p can be rewritten as follows: EP = \Y,{Dl-0{f (7.1) When performing gradient search in the output space, the change in the output to decrease the error becomes: 8F A°l = ~Xddl = x{Dl - °l) (7-2) 58 Chapter 7. A Local Minima Free Training Algorithm 59 where A is the learning rate. The next step is to decide how each weight should change with respect to the output change. Direct calculation of weight changes with respect to the output change is ex-tremely difficult. However, the required change in weights can be approximated through a Taylor series expansion of equation (7.2), as follows: BOp AOp = f ^ A W f c + 0{A(Wkf) (7.3) where Wk denotes all weights contributing to output Ok-If AWk is small, the higher order component can be omitted, the output change can then be approximated by: BOp QQP Now the output change is expressed as an inner product of two vectors, -^- and AWk- If each element of Wk changes in the same direction as the corresponding element of Qyy-i the output change is maximized and can be rewritten as: BOp AO? = \-^-\\AWk\ (7.5) For each element ws of Wk, we have the following expression holds: dol Aws 3ws (7.6) \AWk\ | M | 1 ' I dWk I Combining equations (7.5) and (7.6), we have: AOf^ AWs = l ^ f (7-7) Substitute AOl w ^ h equation (7.2), if ws is a weight connecting to an output node, Ok, the update rule becomes: \2£± \(np np\ao* ]dWs = Ayuk uk)dWs I °^k 12 V (d°l\7, \awk I i^w,ewky sWi} ^WS ~ ^OTTr, ~ ~ ,dOl. „ ' (1^) Chapter 7. A Local Minima Free Training Algorithm 60 If ws is a weight connecting to a hidden node, the change in ws will contribute to all outputs; so all the output changes will affect the Aws. The update rule becomes: K \(TV — nvY-The denominator of equation (7.8) and equation (7.9), J2wiewk(^0^./dw{)2, is the sum of gradients over all weights which contribute to the output to which ws is connected. In short, ws is guided globally by slopes of all other weights. It is this characteristic which gives this algorithm Globally Guided BP. Comparing equation (5.12) and (7.8), yields the relation: \2 Comparing equation (5.14) and (7.9), yields the relation: l^k=\ l^w,£Wk{ dWi) Equation (7.10) and (7.11) illustrate the difference between GGLP and standard BP. 7.2 Locally Guided B P : suitable for VLSI Implementat ion The GGBP has the unique property of being local minima free [92]. This is particularly useful for those tasks requiring accurate trainings. The GGBP claims to have a faster convergence speed than the standard BP [92]. This is the case in the experiments reported in [92]. However, the GGBP algorithm is unlikely to further improve training speed as the algorithm is derived from a Taylor series approximation (see equation (7.4)) which requires that AWk be small. The requirement for small change in weights ensures slow convergence of GGBP. Further more, convergence enhancement techniques, such as the use of second order derivatives in QuickProp, are unlikely to be applicable to GGBP. Chapter 7. A Local Minima Free Training Algorithm 61 In addition, the implementation of on-chip learning using the GGBP algorithm to VLSI is not suitable because of its globally guided feature, as shown by the denominator of (7.8) and (7.9). Thus an alternative to GGBP-like algorithm is considered. The new algorithm should contain the local minima free feature and be suitable for VLSI implementation. It is from this perspective that the Locally Guided BP (LGBP) is derived. 7.2.1 L G B P algorithm Starting from equation (7.4), consider one weight change at a time. All other weights are kept constant. This can be viewed as performing the local gradient search. From (7.4), consider only the change in ws, the Taylor series expansion is approximated as: dOp AOps = T J - ^ A U ; . (7.12) ows Combining equations (7.12) with (7.1) and (7.2), we have: A - ^ A E 1 ^ 3 1 (7-13) k dws If ws belongs to a hidden layer, (7.13) is the update rule. If ws belongs to an output layer, it contributes to only one output, Ok, thus the update rule becomes: Aws = A ^ (7.14) dws The denominator -^ of equation (7.13) and (7.14) shows that if the slope of ws is small (flat surface), there should be a big change in ws to make desired AO change. If the slope of ws is large, a small ws change is defined. However, in the extreme case where QQP oyf- = 0, an infinite value of Aws is expected. If this occurs for all weights, then the global minima is reached, and no more training is needed. However, if this happens to Chapter 7. A Local Minima Free Training Algorithm 62 some of the weights, not all of them, equations (7.13) and (7.14) has to be redefined as (7.15) and (7.16), respectively. If the Ws belongs to hidden layer, the update rule is: Aws = A ] P AUJ* and Aw, = < k 3 £ * if|2£>e dws o if |2£ < e OWs (7.15) If the ws belongs to output layer, the update rule is: A aol n dWs Z U . , dOl ~ if «-* < 0 OWs (7.16) Equations (7.15) and (7.16) define the learning algorithm of the LGBP. Its hardware implementation will be discussed in the next chapter. 7.2.2 Comparison of L G B P and G G B P In LGBP, the important concept is to consider one weight at a time. This results in a local optimization. The assumption of one weight change at a time does not alter its being local minima free. The local minima free of the GGBP is guaranteed by the shape of the error function over the output space, where there is only one global minima. The same principle also guarantees the LGBP being local minima free. The difference between the GGBP and the LGBP is that the local optimization of the LGBP will result in a curved error deduction path to the global minima. As a result, the convergence speed of the LGBP will be slower than that of the GGBP in terms of the number of pattern presentations. However, the LGBP has a much simpler form which makes VLSI implementation feasible. Through the use of weight perturbation, the LGBP appears well suited to analog VLSI circuits. The implementation of BP algorithms in VLSI will be further discussed in the following chapter. Chapter 7. A Local Minima Free Training Algorithm 63 The idea of considering one weight change at a time is not new. Many researchers have used the idea in implementing their algorithms on chips [100, 93]. Computer simulations of the LGBP could be performed and compared with the GGBP to compare convergence speed. However, from the point of view of software simulation, the LGBP algorithm is not expected to outperform the GGBP. The advantage of the LGBP over the GGBP lies in its amenability for simple hardware implementation. Chapter 8 Implementat ions of B P Algor i thms A significant amount of research has been done on BP algorithms with software sim-ulations. Progress has been achieved in theoretical aspects of MLPs, and commercial software simulators are currently available for conventional computers[97]. In contrast, however, hardware VLSI implementations are still in their early stage and attract more and more researchers to this field. This chapter considers the VLSI implementations of BP algorithms, usually referred to as on-chip learning. The work presented here does not involve real silicon layout, in-stead the variations of training algorithms for simplifying VLSI layout are investigated. The first section discusses the realization of the derivative of the activation function. Qualitative analysis and an experiment on the XOR problem will show that the shape of the derivative is important during learning. The replacement of the derivative with a constant [107, 102] will degrade the learning capability of a MLP although this replace-ment will obviously simplify the design of the on-chip learning function. The remaining half of this chapter deals with the development of the weight perturbation method for the LGBP algorithm. The weight perturbation version of the LGBP can be easily applied to the systems described in [94] and [93]. 8.1 Introduct ion VLSI hardware implementation of neural networks with on-chip learning capability is an important step towards the use of neural networks in real-time applications. Compared to 64 Chapter 8. Implementations of BP Algorithms 65 purely digital VLSI and purely analog VLSI, hybrid digital/analog VLSI seems the most promising approach. Through the use of purely digital circuits, the implemented neural network system can be easily controlled and placed in digital systems. However, analog circuits are also suitable for implementing neural networks because the BP algorithms are analog in the sense that weights are real values and the neural activation function is continuous. With on chip learning, the inherent inaccurates of analog circuits can be compensated for during training. Currently, whatever VLSI technique, digital, analog, or digital/analog hybrid, is adopted, two approaches are most popular: 1) Direct implementation of calculating function blocks according to the original formulation of BP algorithms, such as equa-tions (5.12) to (5.15). This method has a relatively long history [98, 99, 64], 2) Weight perturbation to approximate the gradient search, -g^-. The second approach is of special interest to analog circuits designers [94, 93]. A comparison of the two approaches will be made during the discussion of their implementations in the following sections and the summary will follow. 8.2 Direct Implementat ion of B P algorithms Direct implementation of the standard BP algorithm has been a topic of interest to VLSI designers. The first attempt was reported in [98], where a fully analog approach was employed. To realize the standard BP algorithm from equation (5.12) - (5.15) at least three major function blocks are required: weight storage, activation function generation and the computation of the derivative of the activation function. Obviously, any effort toward simplifying their implementation in terms of the number of components and complexity of circuitry is important to reduce usage of silicon area. To simplify VLSI design and solve real circuit problems, researchers have investigated Chapter 8. Implementations of BP Algorithms 66 certain aspects of the implementation of each of the three function blocks[104]. The most critical part of a VLSI implementation is the weight storage. Holt and Baker[105] re-ported their simulation results employing limited precision calculations. They concluded that integer computation works quite well for the standard BP algorithm. This result provides a good reference for digital VLSI designers. The activation function block is often implemented by a transconductance amplifier [99, 64]. The derivative / ' ( . ) of the activation function / can be implemented either by using / in the form / ' = / ( l — / ) [98], or by dedicated circuits[103]. For the computation of the derivative of the activation function, hardware designers are trying to use extra circuits to design or to approximate the derivative according to the activation function because they want to keep the gradient descent feature. How-ever, computer simulations show that the derivative is not always essential to problems. Fahlman replaced the derivative with a constant in a simulation of the Encoder/Decoder problem and obtained a fairly good result[107]. K. Nakayama, et. al. [102] also successed in their pattern recognition examples with a constant derivative. This is an important aspect in VLSI implementation because no extra circuits are required for the constant if it is combined with the learning rate. Recently L.Song and M.Elmasry et al [95] have designed a MLP substituting the derivative with a constant. It is obvious that using a constant to replace the real derivative will simplify circuit design. It will make a significant difference in layout in terms of the number of compo-nents and the complexity of circuits. It will also make the testing of an implemented circuit easier. However the question remains: at what cost are these benefits obtained? And will this replacement retain the learning power of BP algorithms. That is, will the simplified algorithm still be able to achieve convergence of a MLP to a local or a global minima? Chapter 8. Implementations of BP Algorithms 67 8.2.1 Replacement of derivative wi th a constant fails on X O R problem The investigation begins with the simulation of training a network to perform the XOR function. The standard BP algorithm is used and the derivative is replaced with a constant. The goal is to see if the modified BP algorithm can train the 2-2-1 XOR network to converge. The parameters were set as follows. The fail-threshold was set to 10,000 epoches. This is a relatively large number for the XOR problem as a local minima is usually reached in a couple hundred epoches (see Chapter 6). The learning rate was set to 0.1, momentum term 0.0, and the constant derivative for the hidden layer was set to 0.1. The constant for the output layer was set to 0.1 and 0.5, respectively. 25 trials were simulated with each of the two settings. The results showed that all 25 trials failed in each of the two settings. One of the results, terminated at 10,000 epoches, is shown in Figure 8.1. The output does not classify input patterns after 10,000 epoches, and the network is unlikely to converge if the training continues. The above simulation implies that the results in [107] and [102] may only apply to some special cases. Re-examining the experiments in [107] and [102], we are not surprised that the experiments were successful as it is noted that the input patterns in [107] and [102] are actually linearly separable with the given number of hidden neurons. For those special examples, Perceptron training algorithms, like Delta Rule (5.8), are sufficient. The failure to train a network to solve the XOR problem in the above simulation confirms the prediction that the replacement of a derivative with a constant will degrade the training capability of BP algorithms. However, it is premature to settle on this conclusion since the simulation is not exhaustive nor the XOR problem is large enough. More simulations on large problems could be done but it is time consuming, and math-ematical derivation to prove this fact seems difficult, if not impossible. A wise choice to further exploit the potential is to do qualitative analysis. Consider the sources of the Chapter 8. Implementations of BP Algorithms 68 Wkj: Wll=-0.332, W12=-0.404„ W13= 0.285 Wji: W11=-0.417,W21=-0.051 W12=-0.395,W22= 0.048 W13= 0.276, W23=-0.755 Output Patterns Input Output 0 0 01 10 1 1 0.492 0.501 0.499 0.508 k, output layer j,hidden layer Input Patterns i, input layer bias neurons, output = 1.0 Figure 8.1: An unconverged XOR network after 10,000 epoches real derivative and a constant as a derivative: the real derivative comes from a nonlinear sigmoid function while a constant implies that the activation function is a linear function, fiinear = constant. It is well known that the linear activation function in a hidden layer of a MLP provides no advantage over a single layer Perceptron [73]. With a constant approximating the derivative, the power of the BP algorithm is degraded to that of a Perceptron-like algorithm which is not able to train a multi-layer Perceptron. As a result, the training algorithm is unable to train a network with the XOR function, and is the reason why the simulations failed on the XOR problem. Chapter 8. Implementations of BP Algorithms 69 8.2.2 Replacing derivatives of output neurons The most important feature of a MLP is its capability of solving higher order problems, such as implementing XOR function. The derivative of output neurons was studied from the point of view of VLSI implementation. There are at least two parameters present at the output layer when a forward calculation is complete, a constant learning rate and the partial-error function expressed in (5.5). If the derivative can be replaced with either of them, the additional derivative circuits in output layer are no longer needed. The saved silicon area is around the output layer which normally requires more function circuits, e.g. error-accumulator, convergence-detector, etc. It would be especially effective for data compression networks which contain more output neurons than hidden neurons. Note that both the constant learning rate and the partial-error are positive values. Replacing the derivative with one of these values will not alter the direction of weight change, only the magnitude of the weight-update steps are affected. This can be regarded as a "crude adjustment" in the output layer and "fine adjustment" in the hidden layer. Simulations were carried out such that only the derivatives of the output neurons were replaced and the derivatives of hidden neurons were kept unchanged. The simulations with real-derivatives were also reported so that the results can be compared. The role of the derivative is to back propagate errors, so it will be called the back-error function in the following for convenience. For both the replaced and the non-replaced back-error functions, a value of 0.1 is added as this always gives better results[107]. The fail-threshold was set to 10,000. Two problems were chosen in this experiment. The XOR problem and The "tight" Encoder/Decoder problem[107]. The simulation results are reported in Table 8.1. In all the simulations, the momentum term is not included because it incurs additional hardware. Chapter 8. Implementations of BP Algorithms 70 Parity problem Back-Error Function Real Constant Constant P-Error Real Constant Constant P-Error Constant Value -0.25 0.5 --0.25 0.5 -Network Size 2-2-1 2-2-1 2-2-1 2-2-1 2-2-1 2-2-1 2-2-1 2-2-1 Trials 25 25 25 25 25 25 25 25 Learning Rate 0.6 0.6 0.6 0.6 1.0 1.0 1.0 1.0 Initial Range [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] Average Epoches 1002 1132 563 890 648 587 301 468 Fails 4 3 3 3 3 3 3 3 Encoder/Decoder problems Real Constant Constant P-Error Real Constant Constant P-Error -0.25 0.5 --0.25 0.5 -4-2-4 4-2-4 4-2-4 4-2-4 8-3-8 8-3-8 8-3-8 8-3-8 25 25 25 25 25 25 25 25 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] [-1.0,1.0] 100 96 59 78 205 177 121 138 0 0 0 0 0 0 0 0 Table 8.1: Simulation Results with different back-error functions Chapter 8. Implementations of BP Algorithms 71 To understand qualitatively why the replacements work and even improve convergence rate, consider Figure 8.2 which gives the plots showing the differences of back-error functions. The X-axis represents the error at an output neuron, \D — 0\, and the Y-axis represents the magnitude of errors which the back-error functions propagate, \D — 0\ * f'(y)- The traces show how much error each function (the real derivative, the constant, and the partial-error function) back-propagates. Before the error is greater than 0.7, all the functions back-propagate more error as the error increases. After the error is greater than 0.7, the traces of the functions diverge. The constant and the partial-error function back-propagate a greater amount of error than the real derivative does. As the error increases, the real derivative back-propagates less error while the other two functions back-propagate more error. Fortunately, because they can back-propagate more error as the error increases, the other two functions prove to be better in dealing with the problem of "Flat Spot"[107] or "Premature Saturation"[110]. It is one of the factors that improves the convergence speed. It should be noticed in Table 8.1 that the larger constant (0.5) gives better convergence rate although it is less similar to the real derivative than the smaller constant (0.25). This is because that the constant 0.5 can back-propagate more error than 0.25 in each updating cycle. However, this does not indicates that the larger the constant is, the better the performance will be. Simulating with a constant=2.0 on the 2-2-1 XOR network yields further improved convergence rate, but with the expense of more failed trials. As the network size increases, a smaller constant should be used. This is the same principle as that of reducing the learning rate as network size increases. Apparently, the convergence is not guaranteed theoretically after the replacement. But from a practical point of view, the replacements are successful. Chapter 8. Implementations of BP Algorithms 72 ID-OI*back_error function 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ++++ ID-OI*Partial-Error ID-OI*Real-Derivative ID-OI*Constant *++44 0..25--0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure 8.2: Back-Error Functions vs Error 8.2.3 Summary With the simulations and qualitative analyses in this section, two observations can be made: 1) the replacement of derivative with a constant, like in [107], [102] and [95] does not work with XOR/Pari ty problems because the replacement degrades the training power of BP algorithm. There is no mathematical derivation of how much this replacement reduces the BP algorithm training capability, but a qualitative analysis indicates that this replacement may reduce a BP algorithm to a Perceptron-like algorithm. 2) if the derivatives of hidden neurons are kept unchanged and only the derivatives of Chapter 8. Implementations of BP Algorithms 73 output neurons are replaced by a constant, then an effective training algorithm results. This replacement will save silicon area around the output layer where normally more functional circuits are required, such as error accumulator, convergence detector, etc. It is especially useful to those applications where there are many more output neurons than hidden neurons. 8.3 Weight Perturbat ion: A n Optimal Approach for Analog VLSI Circuits As an alternative to direct implementation, weight perturbation (WP) is regarded as an optimal approach [94]. With weight perturbation, the gradient with respect to the weight of equ.(5.7) is evaluated by the approximation: dE _ E(WS + Wpert) - E(WS) dWs Wvert ' ^ ' where Wpert is a small perturbation to the currently updating weight, Ws. The weight update rule becomes: AW, = y w + "w.)-w), (8.2) W pert M.Jabri et al [94] and S.Satyanarayana et al [93] have used the above weight per-turbation method in a VLSI implementation and in a discrete circuitry implementation, respectively. In this section, the weight perturbation method applied to LGBP is derived. Both designs of [94] and [93] can replace their original WP method with the new WP LGBP to take the advantage of local minima free of LGBP. 8.3.1 Weight Perturbat ion of L G B P The weight update rule of LGBP consists of equation (7.15) and (7.16). In these equa-tions, the output gradient with respect to the weight, j^r, is a common term. This term Chapter 8. Implementations of BP Algorithms 74 can be approximated through weight perturbation: dOk Ok(Ws + Wpert)-Ok(Ws) dWx W, pert (8.3) if the perturbation Wpert is small. The approximation can be improved by defining Ws as the central point: dok ok(ws + ^ f i) - ok(ws - ^ ) dW„ Wpert (8.4) if the perturbation Wpert is again small enough. The difference between equ. (8.3) and (8.4) is that equ.(8.3) requires less forward calculations than (8.4), while (8.4) is more accurate than (8.3). To calculate -j^- for (8.3), one additional forward calculation, 0(WS + Wpert), is required. While for equ.(8.4), wm ws. two additional forward calculations are needed, 0(WS + %f*-) and 0(WS - E^-). The use of equ.(8.3) or equ.(8.4) depends on the speed/accuracy trade-off. Bring (8.3) into (7.15) and (7.16), the weight update rule of W P LGBP becomes, if the Ws belongs to hidden layer: AW, = A ^ A W S A (8.5) AW* Di-Oj op(ws+wpert)-op(ws) f Opk(Ws+Wpert)-Op(Wa) ,Q ''pert • f 0P(Ws + Wpert)-0P(Ws) _ Q WT pert if the Ws belongs to output layer: A ^ = A Dl-Ol i f Op(Ws+Wpert)-Op(Ws) ,Q r^pcrt op(ws + wpcrt)-oP(ws) Wpert 0 i f Oi(Ws + Wpert)-Q{(Ws) = Q w, pert (8.6) Equation (8.5) and (8.6) defines the weight perturbation version of LGBP. It is suit-able for analog VLSI implementations. Chapter 8. Implementations of BP Algorithms 75 8.4 Comparison of Perturbat ion and Non-Perturbat ion In this section, the W P implementation and direct implementation are compared. The advantage and disadvantage of each method are discussed. Compared to direct implementation, the WP method is ideal for VLSI designs for the following reasons: 1. Since the gradient ^ or ( | ^ ) is approximated by { \ r ^ ] or ( ( ° ^ ~ 0 ) ) , no back-propagation pass is needed and only forward calculation is required. This means that no bidirectional circuits are needed. This simplifies considerably the implementation. 2. The hardware for back-propagation is totally avoided, especially the circuits for the derivative of the activation function are eliminated. In direct implementations, the generation of the derivative is a rather difficult task. On the other hand, W P implementation introduces difficulties which are not involved in direct method. The disadvantages of W P method are the advantages of direct method. 1. The circuits for weight perturbation and weight update need to be designed. The weight update circuits are involved in back-propagation pass in direct implementa-tions. In W P method, the weight update circuits have to be designed in addition to forward pass. 2. In direct implementation, the calculation of AWS (equ.(5.12)/(5.14)) is realized through back-propagation circuits, in which the calculations are local to each weight. While in W P method, the calculation of AWS (equ.(8.5)/(8.6)) is done outside the network. The update to each weight has to be routed through a kind of decoder mechanism. Both [94] and [93] solved this problem by using a host Chapter 8. Implementations of BP Algorithms 76 computer. That is ont really on-chip learning. The host computer is needed only during the learning phase. In summary, the choice of W P or direct method depends on the following factors: • Purpose. In [93], the W P method is adopted to explore the potential of the recon-figurable system for various problems. In this case the use of a host computer is reasonable. • Technology advances. As VLSI technology advances and new function circuitry appears, either of the two method may become easier to implement. Based on the state of the art of VLSI technology, WP method seems superior. Chapter 9 Conclusions 9.1 Summary Two popular neural network models are investigated in this thesis, Hopfleld neural net-work and Feed Forward Neural Network (FFNN). One of the challenging topics in the study of Hopfield network models is to build a large Hopfleld network with hardware circuitry, so that the properties, such as the robustness, of a network with a large number of neurons can be investigated. The work in Chapter 3 has explored feasible models suitable for digital VLSI implementations of Hopfield networks. The dilution coding Hebb's model and the dilution coding projection model are both proven to have good performance through software simulations. The design presented in Chapter 4 to implement the dilution coding models showed the feasibility of designing a Hopfield network to be programmable and expandable. These properties will be desirable for Hopfield network researchers. Multi-Layer Perceptrons or BP networks are well-known neural networks. They can map any input /output pattern pairs through their internal nonlinear transformations if the hidden layer(s) are large enough. This thesis has involved the studies of building Multi-Layer Perceptrons. Training a MLP has been a challenging topic for researchers. The most famous BP algorithm has two great obstacles, slow convergence and getting stuck at local minima, although the algorithm looks elegant mathematically. These two obstacles have inhibited 77 Chapter 9. Conclusions 78 the widespread use of MLPs in real-world applications. Solving or alleviating the two drawbacks motivated the work of this thesis. The research on FFNNs in this thesis has been along two roads, developing training algorithms and implementing on-chip learning. Since hardware realization of on-chip learning is the most efficient approach toward real-time applications of neural networks, the development of training algorithms in this thesis has been considered to suit for hardware circuits. In Chapter 6, the developed initial weight pre-training method, named DPT, has extended the BP algorithm to be able to start working with zero initial weights. As a result, the extended version of the BP algorithm can speed up the standard BP algorithm up to 80 percent, and the chances of getting stuck at local minima are dra-matically reduced. With respect to hardware realization, the extended BP algorithm has the same complexity as the standard BP algorithm. The work of developing the D P T method was presented in [101]. Chapter 7 derived hardware suitable BP-like algorithm, named LGBP (Locally Guided BP). There are two promising features with LGBP in comparison to other BP-like algorithms: converging to a global minima is guaranteed, and hardware implementation is not difficult. The issue of designing on-chip learning was discussed in Chapter 8. For direct im-plementation of equations of the BP algorithm, the role of the derivative was addressed. Replacing the derivative in the hidden layer with a constant will degrade the BP algo-rithms' capability of training higher order problems. The example is the XOR problem which can not be taught after the constant replacement. For the implementation through weight perturbation, the weight perturbation version of LBGP (WP LGBP) was derived. The WP LGBP can be easily used in the implemented systems reported in [93] and [94] to take the advantage of local minima free LGBP. In summary, the work presented here has the following contributions to the neural network literature. First, for the Hopfield network, the dilution coding models and their Chapter 9. Conclusions 79 VLSI implementation scheme with digital CMOS technology will lead to a programmable and expandable Hopfield network with a large number of neurons. This result will help researchers to look into Hopfield networks. For feed forward networks, the DPT method has enabled the BP algorithms to start training in a symmetry state, especially the all-zeros start point. As a result, not only is the BP training speed increased but also the chances of getting stuck at local minima are reduced. Furthermore, the derivation of the LGBP algorithm has made hardware implementation much simpler for a local minima free algorithm. The weight perturbation version of the LGBP has its immediate application in the systems of [93] and [94]. Finally, the investigation of the replacement of the derivative with a constant has clarified the importance of the shape of the real derivative during training. Using a constant as the derivative will likely degrade a BP algorithm to a Perceptron-like rule which can not train a MLP to solve higher order problems, such as the XOR problem. 9.2 Future Work The work in Chapter 6 of developing the DPT method has resulted in improving BP performance significantly. However, questions remain to be solved. As the questions mentioned in Section 6.5 are solved, the performance of BP trainings will be further improved. The weight perturbation version of LGBP is optimal for analog VLSI implementations of on-chip learning. Bringing the algorithm to the systems, [93] or [94], to have an actual experiment will help the future use of LGBP. With those existed systems, the comparison between weight perturbation of the standard BP algorithm and weight perturbation of the LGBP algorithm for large problems should be easily made. Chapter 9. Conclusions 80 Actually the most exciting work to do will be an implementation of the weight per-turbation LGBP with analog VLSI technology. This design should be different from the designs of [93] and [94] mainly in the technique of calculating and passing Au)s 's. [93] and [94] have used host computers to take care of the calculation and the routing of Aiu s 's. The new design should integrate those functions into a chip. A p p e n d i x A H O P F I E L D N E T W O R K I M P L E M E N T A T I O N Continuing from Chapter 4, the detailed discussion of the Hopfield Network implementa-tion is presented in this Appendix. Refer back to Figure 4.1 when each functional block is described. A . l Interface The co-processor system is accessed by the host via the specially designed INTER-FACE. It consists of Interface Controller(IC), Address Decoder(AD), and Tri-State Bus Transceiver(TSBT). IC deals with the protocol translation for communicating between the two processor systems. By writing the IC, the Stop/Run signal can be set to high(Stop) or low(Run). When Stop is active, the co-processor is stopped and can be accessed by host via TSBT. When Run is active, the Local Bus is separated from the ISA bus by TSBT and the co-processor can work by itself, the AD recognizes valid co-processor system addresses. Four contiguous blocks of memory in the host computer address space are assigned to the co-processor system. The first block is occupied by the RAM Module, which stores the connection matrix. The second one is used by the Neuron Block within the NNM. The third addresses the Control/Threshold Register in CtrlM. The last block is assigned to Control Registers in the Interface Controller. Contiguous addresses are used to ease the INTERFACE implementation. The polling method of I /O will be adopted in designing the Interface Controller. Accordingly the co-processor system is initialized, started and finished as shown in Figure A. l . 81 Appendix A. HOPFIELD NETWORK IMPLEMENTATION 82 ( Begin J Write to Interface control register to set "Stop/Run" to high. Load RAM Module with con-nection matrix. Load control register in CtrlM. Load NNM with prboe vector. Initialization Write to Interface control register to reset "Stop/Run" to low. Co-processor system is started Read Interface control register. No Yes Read stable state from NNM, post process on the stable state. postprocessing Yes No C End J Starting © Figure A.l: Working flow viewed from host computer Appendix A. HOPFIELD NETWORK IMPLEMENTATION 83 A.2 R A M Module The RAM Module is constructed with high speed static memory chips. It contains the vertical slices of the connection matrix T^ associated with the neurons in the NNM. Ideally, the memory would have as many words and as many bits in width as there are neurons(128) in the NNM. However , such a wide word width would require too many pins on the NNM chip and too many lines for the Local Data Bus to be practical. Therefore memories are arranged on a 32 bit wide bus and divided into four blocks. The width of 32 bits matches the 80386 and 80486 CPU Data Bus. Figure A.2 shows a functional diagram of the RAM Module. It can be accessed by both the host and the CtrlM. When accessed by the host, address lines A3 through A9 address the word and Al and A2 are decoded to choose one of the four RAM blocks: RAMO through RAM3. When accessed by the CtrlM, the four RAM blocks are selected simultaneously and 128 bits are read out, followed by time multiplexing operation on the Transceivers by the signals PselO through PseW. Accordingly the 128 bits data are re-assembled in the Pipeline Register in the NNM with PseW through PselS. This design allows access at speeds approaching that of the 128 bit wide configuration with lower pin requirements on the NNM which solves the problem of pin-limitation. The local memory is relatively small in size but has very stringent speed requirements. Within each neuron update cycle, the memory must be able to perform one read access and deliver the data in four parts to the local data bus. A conservative extimation of 115 ns per neuron update cycle is achievable[10]. Appendix A. HOPFIELD NETWORK IMPLEMENTATION 84 A3 A9 T Control Transceiver cs 32 bit RAMO (128*32) cs A * y\ Local Bus T A1--A9 32 bit data bus Transceiver cs 32 bit RAMI (128*32) cs Transceiver cs \32 bit RAM2 (128*32) cs Decoder I Al A2 Ww (from local bus) Stop/Run (from interface) Transceiver cs 32 bit RAM3 (128*32) cs w PselO-PseB (from controller) Figure A.2: Funct ional d i ag ram of t h e R A M Modu le Appendix A. HOPFIELD NETWORK IMPLEMENTATION 85 A.3 Neural Network Module Figure A.3 and A.4 is a functional diagram of the Neural Network Module(NNM). The neuron block contains storage space for 128 neurons. During initialization, a probe vector is loaded by the host computer in four parts, 32 bits each. NselO through NselS trigger each group of neurons( D-flipflop ) to accept 32 bits of data, respectively. In operation, at the end of each update cycle, the NextState is latched into the current neuron chosen by Current Neuron Shift Register(CNSR). The CNSR is initialized to all zero except for a bit set to one which corresponds to the neuron that is going to be updated first. It is shifted in sequence by signal CnShift generated by CtrlM. This realization is based on the round-robin neuron selection method. The Pipeline Register contains the row of the connection matrix associated with the current neuron. It was loaded within the last cycle. As the update rule is carried out, the next row of connection matrix is read into the Pipeline Register. The multiplications TkiVi are performed by the logic AND gates since the value of Tki and V{ are either 0 or 1. The multiplication products are then grouped by OR gates so that there are only as many inputs as groups to the Count Ones block. As previously discussed, there can only be a maximum of one TRUE bit per group and a maximum of 16 groups, with zeroed diagonal in the connection matrix, thus the Count Ones block needs only 16 inputs, rather than 128, and has 4 bit outputs to represent a maximum ISum of 15. The state of the current neuron is passed by CNSR through Selector block (A.4) and supplied to CvgDt by signal CurrState. The signal ThoffState tells CtrlM if the current neuron has threshold of ThoFF and needs to be updated. More about ThoffState is described in section A.5 with "Controller FSM". Appendix A. HOPFIELD NETWORK IMPLEMENTATION 86 5 5 2 I a SJ n re 1 3 ! en V Host Data v (32 bits) Multiplexer ^ ^ ^ ( ^ - - - k ^ v ! "~~L(ff(^1 Uncal Memory Data PselO~Psel3 16 Groups < 8 Groups 4 Groups ^ ISum (4 bits) Figure A.3: Functional diagram of the Neural Network Module ( 1 of two ) Appendix A. HOPFIELD NETWORK IMPLEMENTATION 87 1 e U a O K_ D Q Q K D Q Q K_ D ~m Neurons K_ D Q Q Selector <8> kg) <8 ® k» kfc Acrive V • ^ CurrState 16 Groups < 8 Groups < 4 Groups < Active 3Z>H& ThoffState Figure A.4: Functional diagram of the Neural Network Module ( 2 of two ) Appendix A. HOPFIELD NETWORK IMPLEMENTATION 88 The signals Active and CpShift are used for expansion. More details about them are discussed in section A.9. Pin name Data Stop/Run Clock(s) Power Ground PselO-3 NselO-3 16 Groups 08 Groups 04 Groups CnShift Uncal Active NextState ThoffState CurrState CpShift Sum direction I /O 0 0 0 0 Neural Network Module Pin pins 32 1 l(or more) 2 2 4 4 1 1 1 1 1 1 1 1 1 1 4 To LcBus -------------CtrlM CvgDt CpCt AComp Totally 60 From LcBus Interf ClkGen --CtrlM LcBus CtrlM CtrlM CtrlM CtrlM CtrlM CpCt AComp ----pins List Function description Data transition Stopping/Running control Timing Supplying +5v power Ground of power Addressing Pipeline Registers Addressing Neurons Choosing 16 groups Choosing 8 groups Choosing 4 groups Shifting CNSR one bit Stopping Calculating Selecting current NNM Next State of current neuron Current neuron with Thoff State of current neuron Shift CpCt one bit Input of current neuron Table A.2: Pin Assignment for Neural Network Module chip The NNM is going to be integrated on a single chip. Its pins are listed in Table A.2 with 60 pins assigned so far. The rest of its pins could be used for testing. In estimating the chip's complexity, CMOS3 standard cells are referenced and totally 12192 transistors will be integrated into the chip. In practical design, pseudo NMOS technology may be used somewhere so that the number of transistors may be reduced. Appendix A. HOPFIELD NETWORK IMPLEMENTATION 89 A.4 Control Module Figure A.5 shows the functional diagram of the Control Module(CtrlM). At the end of initialization, the host computer resets signal Stop/Run to low by writing to the Interface Controller, and the right of control to the co-processor is given to the Control Module. Its Control and Threshold Register is loaded by the host during initialization. The Threshold Register contains the value of threshold TJIOFF- As discussed in the last section, with one NNM, the maximum value ISum of 15 can be represented in 4 bits, so a TKOFF of 4 bits will suffice. The extra 3 bits of the 7 bit Threshold Register is reserved for expansion ( see section A.9 ). The Control Register contains three fields for GROUPS, ERROR, and SINGLE-STEP/NORMAL, as shown in A.5. The 3 bit GROUPS field indicates how neurons are grouped. It is selectable among 4, 8, and 16 groups each corresponding to 32, 16, and 8 neurons respectively. The bit SINGLE-STEP/NORMAL indicates two different running states. In normal mode, the CvgDt reports Ready when a final stable state is reached. In single step mode, the CvgDt reports Ready at the end of every update cycle and inhibits ClkGen so as to stop the co-processor running. The co-processor will be restarted to continue its previous work when the host resets Stop/Run again. This design offers an effective approach for functional test of the system. The ERROR bit is reserved for reporting error conditions in the co-processor system. Notice that the Data lines are reduced to 7, rather than 32. This means that the pin count in the chip incorporating the CtrlM is reduced greatly. Appendix A. HOPFIELD NETWORK IMPLEMENTATION 90 Local Bus / 'Address (7 bits) Address Encoder cs 128 bits Address Shift Register (128 bits) Clock Counter Ashift / Clock! — ^ — • Clock! jcl Aout y / / 'Data ( 7 bits) 0> en a! J3 'Si c o U £ R/w Controller FSM Th0ff ( 7 bits ) 16 Groups 8 Groups 4 Groups Error Normal/SingleStep PselO Psell Psel2 Psel3 Cnshift a 1 s §• 5§ Figure A.5: Functional diagram of Control Module Appendix A. HOPFIELD NETWORK IMPLEMENTATION 91 Power on or Stop/Run high 1 Initial State: I n h i b i t Clock Genetator y /Run I n i t i a l i z i n g C l o c k l Running S t a t e 1 : R e a d i n g RAM Calculator i d l e Clock2 (Check ThojfState) ThoffState Low ThojfState High I S t a t e 6: RAM Reading Stopped Shift ASR Shift CNSR CvgDt counter-1 State 2: RAM Reading to end Shift ASR Clockl. State 3: Reading RAM Calculator Working Clock2 ThojfState Low S t a t e 5 : Calculating s t o p p e d RAM R e a d i n g S t o p p e d S h i f t CNSR CvgDt c o u n t e r - 1 (Check ThoffState) ThoffState High S t a t e 4 : RAM R e a d i n g t o end S h i f t ASR Calculator t o end W r i t e NextState S h i f t CNSR CvgDt c o u n t e r - 1 Figure A.6: State transition flow of Controller FSM Appendix A. HOPFIELD NETWORK IMPLEMENTATION 92 The Address Shift Register(ASR) is shifted synchronously with CNSR so as to supply the appropriate address to read a row of the connection matrix. The heart of CtrlM is the Controller FSM block. It synchronizes the operation of each functional block in the system. Its state transition flow is shown in Figure A.6. Its design is dependent on the choice of pipeline strategy, so section 4.6 is dedicated to a further discussion on it. Control Module Pin List Pin name Data R/w Power Ground Stop/i?un GClock Cs ThoffState Error Address PselO-3 Thoff 16 Groups 08 Groups 04 Groups Nor/Sgl CnShift Uncal direction I /O I /O 0 0 0 0 0 0 0 0 0 pins 7 1 1 1 1 1 2 1 1 7 4 7 1 1 1 1 1 1 To LcBus LcBus -------LcBus NNM AComp NNM NNM NNM ClkGen k CvgDt NNM k CvgDt NNM Totally From LcBus LcBus --Interf ClkGen LcBus NNM -----------40 pins Function description Loading Control Register Reading /Writing Supplying +5v power Ground of power Stopping/Running control Supply general clock Selecting Control/Threshold R Current neuron with Thoff Reporting errors Addressing RAMs Addressing Pipeline Registers Supplying Thoff Value Choosing 16 groups Choosing 8 groups Choosing 4 groups Selecting running mode Shifting CNSR one bit CvgDt decremented by one Stopping Calculating Table A.3: Pin Assignment for Control Module chip The CtrlM can be integrated on a single chip with 40 pins using the CMOS3 processing technology. Its pins are listed in Table A.3 with all 40 pins assigned. Appendix A. HOPFIELD NETWORK IMPLEMENTATION 93 A.5 Controller F S M Incorporating Pipel ine Structure To speed up convergence , the co-processor system incorporates a two stage pipeline. The first is a prefetch stage. The connection matrix row associated with the neuron to be updated in next cycle is read into the Pipeline Registers in the NNM. The second stage can be viewed as an execution stage. Included in it are the calculation of sums, J2TkiVi, done by the NNM and the operation of update rule performed by the Adder/Comparator. In the first consideration, the neurons are evaluated sequentially one by one regardless of the neuron having THOFF or ThoN • The advantage of this method is that the system can have a regular performance cycle and is easier to design. However, if only neurons with threshold TKOFF a r e evaluated, we can increase system convergence speed greatly. The Dilution Coding Hebb's rule guarantees pattern stability by choosing two thresh-olds, ThoFF and ThoN, for neurons. ThoN is s e t high enough so that no neuron's input sum can exceed it. Thus neurons with ThoN w m never change their states and only neurons with TKOFF need to be considered for updating. Usually neurons with ThoFF are much fewer than those with ThoN in diluted patterns. Using Mok's estimation for average convergence time: tconverge < 3 X N X 115ns (A . l ) where N denotes the number of neurons, I estimate it as tconverge < 3 X (G + F) X 115^5 (A.2) G+F<^N. where G denotes the number of groups and F the number of neurons which need to be evaluated to correct errors in the probe vector. Appendix A. HOPFIELD NETWORK IMPLEMENTATION 94 The above change suggests that CtrlM has to look ahead to decide which neuron needs to be evaluated and reads the related row of the connection matrix into the Pipeline Register. Unfortunately this "look ahead" is impossible because the threshold appropriate to the next neuron is not known until the current neuron is updated according to NextState By combining the two methods above, the third method is optimal. The prefetch stage is performed normally as is done in the first method. At the same time, ThoffState is generated within a very short time (between Clockl and Clock2 in Figure A.6). If ThoffState is high, that is, the current neuron has ThoFFi the ongoing cycle is allowed to complete, otherwise, a new cycle is started and checking ThoffState continues. The design of the Controller FSM incorporates the third pipeline method. Shown in Figure A.6, there are six states and two phases. In phase 1, including states 1, 2, and 6, Clockl starts an update cycle and begins reading RAM (statel) . The time between Clockl and Clock2 is the delay for ThoffState to become available. If ThoffState is low, the FSM returns' back to statel via state6, otherwise, it goes into state3 in phase 2 through state2. In phase 2, including states 3, 4, and 5, reading the RAM and updating neurons are simultaneously carried out through state3 and state4. Whenever a ThoffState appears low, the FSM goes back to statel via state5. This strategy of pipelining takes the advantages of first two methods and is the one which should be adopted. Now the average convergence time can be estimated as tconverge < 3 x (G + F) x 115ns + 3 x i V x {Clock! - Clockl) (A.3) Where Clock2 — Clockl C o O co E QL O 0.6 -• 0.4 - j O c CD D- °-2 CD g=16,m=10,T=8 g=8,m=10,T=3 g=4,m=10,T=2 -1 1 1 r-5 10 Hamming Distance 15 Figure A.7: Performance of the proposed network with Dilution Hebb's rule bits until the required Hamming distance is achieved. Neurons are updated in round-robin fashion. Groups of 32, 16 and 8 neurons were simulated with a load of 10 patterns. From Figure 4.8, for all the cases, the frequency of convergence is unity at zero Hamming distance. This is a direct result of patterns being unconditionally stable in the proposed network with Dilution Hebb's rule. The convergence property of the proposed network can be adjusted by varying the threshold TJIOFF- The results shown in Figure A.7 are obtained with thresholds being set to a value that maximizes attractivity while maintaining a convergence probability above 95% at Hamming distance of 1. If the probes are expected to be relatively error-free, a high threshold is desirable in order to increase capacity within a small region about the Appendix A. HOPFIELD NETWORK IMPLEMENTATION 99 patterns. Low attractivity is of little concern if the probes are very unlikely to fall outside the small region. On the other hand, if the network must operate on highly corrupted vectors, a lower threshold is desirable, in order to increase attractivity to obtain useful results for high Hamming distance probes, even though the convergence probability at low Hamming distance is somewhat reduced. In general, the threshold ThoFF should be set according to the applications at hand. To describe the effect of groups chosen, the dilution factor of b =$• 2b decoder is defined as D = y . The dilution factor affects the available bits for storing information and further affects memory load. In general, a large dilution factor (fewer groups) can express fewer available information bits, but can have higher memory load; and vice versa [10]. A.9 Expandabi l i ty One of the advantages in the proposed network is its expandability in size by increasing the number of neurons. In my first version of the implementation as described in the previous sections, the simplest form, one NNM with 128 neurons, is proposed, neverthe-less expandability has been considered in designing the system's architecture, as shown in 4.1 . One NNM with 128 neurons has been designed as a fundamental element. Like bit slice chips, multiple NNMs can be cascaded to increase the number of neurons and thus form a larger network. In theory, hundreds or more NNMs can be presented in a system, in fact however, the expandability is limited by implementation complexity. In a second version of the implementation, up to 8 NNMs can be cascaded to form a network with 1024 neurons. As shown in A.3 and A.4, the signals ActiveO through Activtl from CpCt are designed to enable/disable transmission gates to pass signals Appendix A. HOPFIELD NETWORK IMPLEMENTATION 100 CurrState,NextState, ThoffState, CnShift, and CpShift in each NNM. At any time, only one NNM is selected as active, which can supply those signals to appropriate chips, thus those signals from multiple NNMs can be Line-ORed when more than one NNM is present in the system. Each NNM supplies AComp with 4 bits of intermediate sum, ISum, with maximum value of 15. The AComp first sums at most 8 such inputs with a maximum value of 120, followed by comparing the total sum with ThoFF offered by the CtrlM, and finally generates the NextState according to the update rule. The adders within AComp can be designed with conventional adders in a tree structure which will grow as the log of the number of NNMs presented in the network. With 8 NNMs, there will be three levels of adders. According to the number of NNMs, the adders can be enabled or inhibited. How many NNMs are configured in the system is determined by the signal Option with 3 bits, which are preset through the Configuration Switches. The CvgDt incorporates a binary counter with a modulus of N, the number of neurons. The value of N varies with the number of NNMs, so the Option is also input to the CvgDt to determine the modulus N. The maximum value of N will be 1024 to match 8 NNMs. There is not much change required with the design of CtrlM. The only expandable feature in it is the TH0FF s^ze °f which with 7 bits, ranges from 0 to 127, above the maximum value of total sum of 120 for the 8 NNMs system. The size of the RAM increases quadratically with the number of NNMs. Each RAM Module is associated with a NNM through a individual local data bus 32 bits wide. Within a single system cycle, 128 bits of data from RAM Modules can be delivered to their associated NNMs in four parts. This arrangement leaves the system cycle unchanged ( 115 ns as Mok estimated ). The depth increment of the RAM Modules is achieved by choosing suitable RAM chips. The CpCt consists of a binary counter that determines the active NNM which contains Appendix A. HOPFIELD NETWORK IMPLEMENTATION 101 the neuron being updated. The counter's modulus is the number of NNMs and the Option states how many NNMs are present in the system. The signal CpShift from active NNM, generated when CNSR cycles back to 0 from 127, increments the CpCt counter by one to cause the next NNM to become active. The outputs of the CpCt: ActiveO to Active7 are encoded to address eight RAM blocks of 128 words each. The individual word within each block is addressed by the 7 address lines from the CtrlM. Finally, the Tri-State Bus Transceiver in the INTERFACE is enlarged to allow the host to access the expanded system. No change occurs in the design of the Interface Controller for expandability. Surely the expanded network will have longer system convergence time. As shown in equ.(A.3), a 256 neuron network ( two NNMs ) with 32 groups and assuming F=16 is expected converge in 28.08 us. When the maximum of 8 NNMs are cascaded in the system, the maximum of 128 groups is chosen and G=64 is assumed, the system converge time is 1.15 ms from equ.(A.3), and 13.31 ms from equ.(A.l). From the discussions above, the proposed network can be expanded to a network size that is impossible to construct using analogue techniques which cannot take the advantage of the external memory structure. How exciting it is that we can build the largest neural network in the world. Appendix B S u m m a r y of Notat ions In deriving the De l ta Rule and the s tandard B P algorithm, a set of notations has been introduced. These notations are summarized here so that it will be convenient for readers to reference. I = the number of input neurons. J = the number of hidden neurons. K = the number of output neurons. P = the number of training patterns. y = net inputs to a neuron (hidden or output). 0 = a set of outputs generated by network output neurons. D = a set of desired outputs x\ = the ith element of input pattern p. hp- = the output of hidden neuron j corresponding to input pattern p Wji = weight from ith input neuron to j t h hidden neuron. Wkj = weight from jth hidden neuron to kth output neuron. ws — weight being updated. rj = learning rate used in standard BP algorithm. a — momentum. A = learning rate used in GGBP and LGBP algorithms. 102 Bibliography [1] J. Denker, "Neural network models of learning and adaptation," Physica 22D, pp. 216-232, 1986. [2] H. Graf et al., "VLSI implementation of a neural network model," IEEE Computer, pp. 41-49, 1988. [3] A. Moopenn et al., "Electronic implementation of associative memory based on neural network models," IEEE Trans. Syst. Man Cybern., vol. SMC-17, No. 2, pp. 325-331, 1987. [4] M. Cottrell, " Stability and Attractivity in Associative Memory Networks," Biol. Cybern. vol. 58, pp. 129-139, 1988. [5] L. Personnaz, "Collective computational properties of neural networks: New learn-ing mechanisms," Physical Review A, vol. 34, No.5, Nov. 1986. [6] M. Verleyser, "Neural Networks for High-Storage Content-Addtressable Memory: VLSI Circuit and Learning Algorithm," IEEE Journal of Solid-State Circuits, vol. 24, No. 3, June 1989. [7] Leon S. Lasdon, "Optimization Theory for Large Systems," London: Collier-MacMillan, 1970. [8] Memories Data Book, 1991-92 Edition, Fujitsu Microelectronics Inc., Santa Clara, 1986. [9] Signetics-Microprocessor Data Manual, 1992, Signetics Corporation , Printed in USA. [10] W. Mok,"VLSI Implementable Associative Memory Based on Neural Network Ar-chitectures," M.A.Sc. Thesis, University of British Columbia,April 1989 [11] H.P.Graf, L.D.Jackel,"VLSI Implementation of a Neural Network Memory with several Hundreds of Neurons,"Pro. Conf. Neural Networks for computing, Snow-bird,Utah,American Inst, of Physics, 1986 [12] M. Sivilotti et al.,uA novel associative memory implemented using collective com-putation," Chapel Hill Conference on Very Large Scale Integration, pp. 329-342, 1985 103 Bibliography 104 [13] H.P.Graf,"A CMOS Implementation of a Neural Network Module," Advanced Re-search in VLSI, Cambridge, MA:M.I.T. press, 1987 [14] LOGIC DATA BOOK , Volume 2,1992, National Semiconductor Corporation. [15] M. Sivilotti, "VLSI Architectures for Implementation of Neural Networks," [16] A.Chandna, "Opto-Programmable Neural Networks: An Initial Study," Proc. of Canadian Conference on VLSI, October, 1989 [17] Guide to the Integrated Circuit Implementation Services of the CMC, GICIS Ver-sion 4:0, April 1991, CANADIAN MICROELECTRONECS CORPORATION. [18] Akl, S.G., Parallel Sorting Algorithms, Vol. 12 in Werner Rheinboldt (ed.), Notes and Reports in Computer Science and Applied Mathematics, Orlando: Academic Press, 1985. [19] Albert, A. Regression and the Moore-Penrose Pseudoinverse, Academic Press: New York, 1972. [20] Aleksander, I. (ed.), Neural Computing Architectures: The Design of Brain-like Machines, The MIT Press: Cambridge, Massachusetts, 1989. [21] Aleksander, I., Thomas, W.V., and Bowden, P.A., "WISARD, A Radical Step Forward in Image Recognition," Sensor Review, Vol. 4, No. 3, pp. 120-124, 1984. [22] Anderson, J.A., "A Simple Neural Network Generating An Interactive Memory," Mathematical Biosciences, Vol. 14, pp. 197-220, 1972. [23] Arbib, M.A., Brains, machines, and mathematics, 1987. [24] Parten, C.R., R.M. Pap, and C. Thomas, "Neurocontrol Applied to Telerobotics f or the Space Shuttle," in Proc. of the Intl. Neural Network Conf., July 9-13, 1990, Paris. [25] Barto, A.G., R.S. Sutton, and C.W. Anderson, "Neuronlike Adaptive Elements That Solve Difficult Learning Control Problems," IEEE Trans, on Systems, Man, and Cybernetics, Vol. SMC-13, No. 5, Sept./Oct., 1983. [26] Baum, E.B., J. Moody, and F. Wilczek, "Internal Representations for Associative Memory," NSF-ITP-86-138 Institute for Theoretical Physics, University of Califor-nia, Santa Barbara, California, 1986. [27] Carpenter, G.A. and S. Grossberg, "The ART of Adaptive Pattern Recognition by A Self-Organizing Neural Network," IEEE Computer, March 1988, pp. 77-88. Bibliography 105 [28] Chandra, A.K., L. Stockmeyer, and U. Vishkin, "Constant Depth Reducibility," Siam J. Comput., Vol. 13, pp. 423-439, 1984. [29] Cooper, L. and D. Steinberg, Introduction to Methods of Optimization, Saunders: Philadelphia, 1970. [30] Durbin, R. and D.E. Rumelhart, "Product Units: a Computationally Powerful and Biologically Plausible Extension to Back-propagation Networks," Neural Compu-tation, Vol. 1, 1989, pp. 133-142. [31] Eberhart, Russell C , "Standardization of Neural Network Terminology," Neural Networks, Vol. 1, No. 2, June, 1990, pp. 244-245. [32] Elman, J.L., "Learning the Hidden Structure of Speech," Institute for Cognitive Science, University of California at San Diego, ICS Report 8701, Feb., 1987. [33] Feldman J.A. and D.H. Ballard, "Connectionist Models and Their Properties," Cognitive Science, Vol. 6, pp. 205-254, 1982. [34] Fox, G.C. and P.C. Messina, "Advanced Computer Architectures," IEEE Comput-ers Magazine, pp. 67-74, 1989. [35] Fukushima, K. and S. Miyake, "Neocognitron: A New Algorithm For Pattern Recognition Tolerant of Deformations And Shifts In Position," Pattern Recogni-tion, Vol. 15, No. 6, pp. 455-469, 1982. [36] Furst M., J.B. Saxe and M. Sipser, "Parity, Circuits and the Polynomial-time Hierarchy," Proc. 22nd IEEE Symposium on Foundations of Computer Science, 1981, pp. 260-270. [37] Gallager, R.G., Information Theory and Reliable Communication, John Wiley & Sons, New York, 1968. [38] Gevins, A.S. and N.H. Morgan, "Applications of Neural-Network (NN) Signal Pro-cessing in Brain Research," IEEE Trans, on Acoustic Signal Speach Processing, Vol. 36, No. 7, July, 1988. [39] Matthews, M.B., E.S. Moschytz, "Neural Network Nonlinear Adaptive Filtering Using the Exteneded Kalman Filter Algorithm," in Proc. of the Intl. Neural Net-work Conf., July 9-13, 19 90, Paris. [40] Goles Eric and Servet Martinez, Neural and Automata Networks: Dynamical Be-haviour and Applications, Kluwear Academic Publishers: Boston, 1990. Bibliography 106 [41] Grossberg, S., The Adaptive Brain I: Cognition, Learning, Reinforcement, and Rhythm, and The Adaptive Brain II: Vision, Speech, Language, and Motor Control, Elsevier/North-Holland, Amsterdam, 1986. [42] Hagiwara, M., "Accelerated Back Propagation Using Unlearning Based on Hebb Rule," in Proc. of IJCNN'90, 1-617-620, Jan., 1990. [43] Hebb, D.O., The Organization of Behavior, John Wiley h Sons: New York, 1949. [44] Hecht-Nielsen, R., "Theory of the Back-propagation Neural Networks," In Proc. of the Intl. Joint Conf. on Neural Networks, Washington, D.C., New York: IEEE Press, pp. 1593-606, 1989. [45] Hecht-Nielsen, R., "Counter-Propagation Networks," Proc. of IEEE First Interna-tional Conference on Neural Networks, New York, 1987. [46] Hopfield, J.J., "Neural Networks and Physical Systems with Emergent Collective Computational Abilities," Proc. Natl. Acad. Sci. USA, Vol. 79, 2554-2558, April 1982. [47] Hopfield, J.J., "Neurons with Graded Response Have Collective Computational Properties Like Those of Two-State Neurons", Proc. Natl. Acad. Sci. USA, Vol. 81, 3088-3092, May 1984 [48] Hopfield, J.J., and D.W. Tank, "Computing with Neural Circuits: A Model," Sci-ence, Vol. 232, 625-633, August 1986. [49] Hopfield, J.J., and D.W. Tank, "Simple 'Neural' Optimization Networks: An A/D Convert, Signal Decision Circuit, and a Linear Programming Circuit," IEEE Trans, on Circuits and Systems, Vol. CAS-33, No. 5, pp. 533-541, May 1986. [50] Hopfield, J.J. and D.W. Tank, "Neural Computation of Decisions in Optimization Problems," Biol. Cybern. Vol. 52, pp. 141-152, 1985. [51] Irie, B. and S. Miyake, "Capabilities of Three-layered Perceptron," in Proc. of IJCNN'89, pp. 1-641-648, June, 1989. [52] Izui, Y. and A. Pentland, "Speed Up Back Propagation," in Proc. of IJCNN'90, pp. 1-639-642, Jan. 1990. [53] Judd, J.S., "Learning in Networks Is Hard," Proc. of the First Intl. Conf. on Neural Networks, pp. 685-692, IEEE, San Diego, California, June, 1987. [54] Judd, J.S., Neural Network Design and the Complexity of Learning, The MIT Press, Cambridge, Massachussetts, 1990. Bibliography 107 [55] Kandel, E.R., and J.H. Schwartz, Principles of Neural Science, Elsevier, New York, 1985. [56] Klassen, M.S. and Y.H. Pao, "Characteristics of the Functional-link Net: A Higher Order Delta Rule Net," Proc. of IEEE Second Annual International Conference on Neural Networks, June San Diago, CA, 1988. [57] Kohonen, T., "Correlation Matrix Memories," IEEE Trans, on Computers, C-21, pp. 353-359, 1972. [58] Kosko, B., "Adaptive Bidirectional Associative Momories," Applied Optics, Vol. 26, pp. 4947-4960, 1987. [59] Lewis, P.M. II, and C.L. Coates, Threshold Logic, John Wiley & Sons: New York, 1967. [60] Lippmann, R.P., "An Introduction to Computing with Neural nets," IEEE ASSP Magazine pp. 4-22, April 1987. [61] Lippmann, R.P., B. Gold, and M.L. Malpass, "A Comparison of Hamming and Hopfield Neural Nets for Pattern Classification," MIT Lincoln Laboratory Technical Report, TR-769, 1987. [62] Lippmann, R.P., "Pattern Classification Using Neural Networks," IEEE Commu-nications Magazine, Vol. 27, No. 11, pp. 47-64, Nov., 1989. [63] McCulloch, W.S. and W. Pit ts , "A Logical Calculus of the Ideas Immanent in Nervous Activity," Bulletin of Mathematical Biophysics, 5, pp. 115-133, 1943. [64] Mead, C.A., Analog VLSI and Neural Systems, Addison-Wesley Publishing Com-pany, 1989. [65] Minsky, M. and S. Papert, Perceptron: An Introduction to Computational Geome-try, Expanded Edition, The MIT Press, 1988. [66] Mota-oka, T. and M. Kitsuregawa, The Fifth Generation Computer: The Japanese Challenge, John Wiley & Sons, 1985. [67] Owens, A.J. and D.L. Filkin, "Efficient Training of the Back-propagation Network by Solving a system of Stiff Ordinary Differential Equations," in Proc. of IJCNN'89, pp. 77-381-386, June, 1989. [68] Pao, Y.H., Adaptive Pattern Recognition and Neural Networks, Addison-Wesley Publishing Company, 1989. Bibliography 108 [69] Peeling, S.M., R.K. Moor, and M.J. Tomlinson, "The Multi-Layer Perceptron as a Tool for Speech Pattern Processing Research," in Proc. IoA Autumn Conf. on Speech and Hearing, 1986. [70] Posch, T.E., "Models of the Generation and Processing of Signals by Nerve Cells: A Categorically Indexed Abridged Bibliography," USCEE Report 290, August 1968. [71] Rosenblatt, R., Principles of Neurodynamics, New York, Spartan Books, 1959. [72] Rosenblatt, R., "The Perceptron: A Probabilistic Model for Information Storage and Organization In The Brain," Psychological Review 65, pp. 386-408, 1957. [73] Rumelhart, D.E., G.E. Hinton, and R.J. Williams, "Learning Internal Representa-tions by Error Propagation," in D.E. Rumelhart & J.L. McClelland (Eds.), Paral-lel Distributed Processing: Explorations in the Micro structure of Cognition. Vol. 1: Fundations, MIT Press, 1986. [74] Sawai, H., A. Waibel, P. Haffner, M. Miyatake, and K. Shikano, "Paral-lelism, Hierarchy, Scaling in Time-delay Neural Networks for Spotting Japanese Phonemes/CV-syllables," in Proc. of IJCNN'89, pp. 77-81-88, June, 1989. [75] Siu, Kai-Yeung, and Jehoshua Bruck, "Neural Computation of Arithmetic Func-tions," IEEE Proceedings, Vol. 78, No. 10, Oct. 1990, pp. 1669-1675. [76] Takeda M. and J.W. Goodman, "Neural Networks for Computation: Number Representations and Programming Complexity," Applied Optics, Vol. 25, No. 18, pp. 3033-3046, 15 Sept. 1986. [77] Walsh, G.R., Methods of Optimization, John Wiley k Sons: New York, 1975. [78] Wasserman, P.D., Neural Computing: Theory and Practice, Van Nostrand Rein-hold: New York, 1989. [79] Widrow, B. and M.E. Hoff, "Adaptive Switching Circuits", 1960 IRE WESCON Conv. Record, Part 4, 96-104, August 1960. [80] Widrow, B. and R. Winter, "Neural Nets for Adaptive Filtering and Adaptive Pattern Recognition," IEEE Computer Magazine, pp. 25-39, March, 1988. [81] Widrow, B. S.D. Stearns, Adaptive Singal Processing, Prentice-Hall: New Jersey, 1985. [82] Scott E. Fahlman, "An Empirical Study of Learning Speed in Back-Propagation Networks," CMU Technical Report, CMU-CS-162, Sep. 1988 Bibliography 109 [83] R.A.Jacobs, "Increased rates of convergence through learning rate adaptation." Neural Networks, vol. 1, 1988 [84] A.Rezgui et al., "The effect of the slope of the activation function on the back propagation algorithm," Proc. Int. Joint Conf. Neural Networks, vol. 1, Jan. 1990 [85] Youngjik Lee, Sang-Hoon Oh, and Myung W. Kim, "The Effect of Initial Weights on Premature Saturation in Back-Propagation Learning," IJCNN, Seattle, WA, July 8-12, 1991 [86] S.I.Gallant, "Perceptron-Based Learning Algorithms", IEEE Tran. on Neural Net-works, Vol.1, No. 2, June 1990 [87] C.L.Chen and R.S.Nutter, "Improving the Training Speed of Three-layer Feedfor-ward Neural Nets by Optimal Estimation of the Initial Weights", IJCNN, Singa-pore, Nov. 1991 [88] J. Yang, "The classification of acoustic emission signals bia artificial neural net-work," M.A.Sc. Thesis, University of British Columbia, August, 1991 [89] Fletcher Crowe, "The Impact of Neural Networks of the American Economy: 1990-2010," in Proc. Int. Neural Network Conf. (Paris), July 1990. [90] L.Wessels and E.Barnard, "Avoiding False Local Minima by Proper Initialization of Connections," IEEE Tran. on Neural Networks, Vol.3, No. 6, Nov. 1992. [91] L.Wessels and E.Barnard, "The Physical Correlates of Local Minima," in Proc. Int. Neural Network Conf. (Paris), July 1990. [92] Zaiyong Tang and Gary J.Koehler, "A Convergent Neural Network Learning Algo-rithm," Proc. Int. Joint Conf. Neural Networks, vol. 2, June 1992. [93] S.Satyanarayana, Y.Tsividis, and H.Graf, "A Reconfigurable VLSI Neural Net-work," IEEE Journal of S'olid-State Circuits, Vol.27, No.l , Jan. 1992. [94] M.Jabri and B.Flower, "Weight Perturbation: An Optimal Architecture and Learn-ing Technique for Analog VLSI Feedforward and Recurrent Multilayer Networks," IEEE Tran. on Neural Networks, Vol.3, No. 1, Jan. 1992. [95] L.Song, M.Elmasry and A.Vannelli, "A VLSI Implementation of Feedforward Neu-ral Networks with Learning," CCVLSI, Oct. 1992. [96] Bernard Widrow and Michael Lehr, "30 Years of Adaptive Neural Networks: Per-ceptron, Madaline, and Backpropagation," Proceedings of the IEEE, Vol.78, No.9, September 1990. Bibliography 110 [97] NeuralWare, Inc. "NeuralWorks Professional II/PLUS," 1991 [98] B.Furman, J.White, and A.Abidi, "CMOS analog implementation of back propa-gation algorithm," in Abstract of the First Annual INNS Meeting, Boston, MA, 1988 [99] B.Furman, A.Abidi, "An Analog CMOS Backward Error- Propagation LSI," in Proc. 22nd Asilomar Conf. Signals, Syst. Comput., Oct. 1988 [100] Ajay Patrikar and John Provence, "Learning by Local Variations," in Proc. of IJCNN'90, pp. 1-700-703, Jan. 1990. [101] Gang Li, Hussein Alnuweiri, "The Acceleration of Back Propergations through Initial Weight Pre-Training with Delta Rule," in Proc. of ICNN'93, San. F., CA, March 28 - April 3, 1993 [102] K. Nakayama, S. Inomata, Y. Takeuchi, "A digital Multilayer Neural Network with Limited Binary Expressions", in Proc. of IJCNN'90, San Diego, CA, June, 1990 [103] Tobi Delbruck, "Bump, Circuits for Computing Similarity and Dissimilarity of Analog Voltages," in Proc. of IJGNN'91, Seattle, WA, July 8-12, 1991 [104] Leonardo M. Reyneri and Enrica Filippi, "An Analysis on the Performance of Silicon Implementations of Backpropagation Algorithms for Artificial Neural Net-works," IEEE Transactions on Computers, Vol. 40, No. 12, Dec. 1991 [105] Jordan L. Holt and Thomas E. Baker, " Back Propagation Simulations using Lim-ited Precision Calculations", in Proc. of IJCNN'91, Seattle, WA, July 8-12, 1991 [106] F.Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization In The Brain," Psychological Review 65, pp. 386-408, 1957. [107] Scott E. Fahlman, "An Empirical Study of Learning Speed in Back-Propagation Networks," CMU Technical Report, CMU-CS-162, Sep. 1988 [108] R.A.Jacobs, "Increased rates of convergence through learning rate adaptation." Neural Networks, vol. 1, 1988 [109] A.Rezgui et al., "The effect of the slope of the activation function on the back propagation algorithm," Proc. Int. Joint Conf. Neural Networks, vol. 1, Jan. 1990 [110] Youngjik Lee, Sang-Hoon Oh, and Myung W. Kim, "The Effect of Initial Weights on Premature Saturation in Back-Propagation Learning," Proc. Int. Joint Conf. Neural Networks, Seattle, WA, July 8-12, 1991 Bibliography 111 [111] S.I.Gallant, "Perceptron-Based Learning Algorithms," IEEE Tran. on Neural Net-works, Vol.1, No. 2, June 1990 [112] C.L.Chen and R.S.Nutter, "Improving the Training Speed of Three-layer Feedfor-ward Neural Nets by Optimal Estimation of the Initial Weights", Proc. Int. Joint Conf. Neural Networks,Smg3ipore, Nov. 1991 [113] Lippmann, R.P., "An Introduction to Computing with Neural nets," IEEE ASSP Magazine, pp. 4-22, April 1987. [114] Michael Arbib, "Brains, Machines, and Mathematics," Spring er-Verlag, Second Edition, 1987.