DISPERSIVE NEURAL NETWORKS FOR ADAPTIVE SIGNAL PROCESSING By Shawn P. Day B. A. Sc. (Electrical Engineering) University of British Columbia, 1988 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA February 1993 ©ShawnP.Dy,193 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Electrical Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: *el 9, mg Abstract Back-propagation is a popular method for training feed-forward neural networks. This thesis extends the back-propagation technique to dispersive networks, which contain internal delay elements. Both the delays and the weights adapt to minimize the error at the output. Dispersive networks can perform many tasks, induding signal prediction, signal production, channel equalization, and spatio-temporal pattern recognition. For comparison with conventional techniques, a dispersive network was trained to predict future values of a chaotic signal using only its present value as an input. With adaptable delays, the network had less than half the prediction error of an identical network with fixed delays, and about one-quarter the error of a conventional back-propagation network. Moreover, a dispersive network can simultaneously adapt and predict, while a conventional network cannot. After training as a predictor, the network was placed in a signal production configuration, where it autonomously generated a close approximation to the training signal. The power spectrum of the network output was a good reproduction of the training signal spectrum. Networks with fixed time delays produced much less accurate power spectra, and conventional back-propagation networks were unstable, generating high-frequency oscillations. Dispersive networks also showed an improvement over conventional techniques in an adaptive channel equalization task, where the channel transfer function was nonlinear. The adaptable delays in the dispersive network allowed it to reach a lower error than other equalizers, including a conventional back-propagation network and an adaptive linear filter. However, the improved performance came at the expense of a longer training time. Dispersive networks can be implemented in serial or parallel form, using digital electronic circuitry. Unlike conventional back-propagation networks, they can operate in a fully pipelined fashion, leading to a higher signal throughput. Their implementation in analog hardware is a promising area for future research. ii Table of Contents Abstract^ ii List of Tables^ vii List of Figures^ viii Glossary^ xi Acknowledgement^ 1 xxiii 1 Introduction 1.1 Adaptive Systems ^ 1 1.1.1^Adaptive Prediction ^ 2 1.1.2^Adaptive System Identification ^ 3 1.1.3^Inverse Modeling or Equalization ^ 4 1.1.4^Adaptive Interference Canceling ^ 5 1.2 Artificial Neural Networks ^ 1.3 2 Thesis Outline ^ The Adaptive Linear Combiner 6 7 9 2.1 Structure ^ 9 2.2 Operation ^ 11 2.2.1^Error Measure ^ 11 2.2.2^Gradient of the Error Measure ^ 13 2.2.3^Gradient Search ^ 13 2.2.4^The LMS Algorithm ^ 16 iii ^ Table of Contents 3 4 2.2.5^Random Local Search ^ 17 2.3^Adaptive Linear Transversal Filter ^ 18 2.4^ALC Classifier ^ 18 2.5^Limitations of the ALC ^ 20 ALC Networks 22 3.1 Multidimensional Outputs ^ 22 3.2 Networks with Multiple Layers ^ 23 3.3 Back-Propagation ^ 25 Nonlinear Adaptive Network Filters 30 4.1 Continuous-Time LMS Adaptation ^ 30 4.2 Continuous-Time Back-Propagation ^ 31 4.3 The Mackey-Glass Signal Prediction Task ^ 33 4.4 Simulation Details ^ 33 4.4.1^Training Configuration ^ 34 4.4.2^Network Topology and Parameters ^ 35 4.4.3^Numerical Integration Technique ^ 37 4.4.4^Performance Measure ^ 37 Prediction Results ^ 38 4.5 5 iv Dispersive Networks 41 5.1 Unfolding in Time ^ 42 5.2 Temporal Back-Propagation ^ 44 5.2.1^Notation ^ 47 5.2.2^The Error, T, and its Functional Derivative ^ 49 5.2.3^Back-Propagated Error Signals ^ 52 5.3 Gradient Descent Adaptation ^ 54 5.4 Relationship To Other Work ^ 56 Table of Contents 5.5^Practical Considerations ^ 6 6.2 57 5.5.2^Approximate Delayed Error Gradient ^ 60 5.5.3^Choosing the Error Signal Ages ^ 63 7 65 Adaptive Signal Prediction ^ 65 6.1.1^Network Topology and Implementation Details ^ 65 6.1.2^Embedding Dimension ^ 67 6.1.3^Prediction Results ^ 68 6.1.4^True Adaptive Prediction ^ 69 6.1.5^Moment um ^ 73 Signal Production ^ 74 6.2.1^Learning Progression ^ 75 6.2.2^Performance ^ 80 6.3 Adaptive Channel Equalization ^ 6.4 57 5.5.1^Exact Delayed Error Gradient ^ Applications of Dispersive Networks 6.1 v 86 6.3.1^Network Topology and Parameters ^ 88 6.3.2^Equalization Results ^ 88 Signal Detection/Classification ^ 89 Implementations 91 7.1 Software Implementations ^ 91 7.2 Parallel Digital Implementations ^ 92 7.2.1^Operation ^ 95 7.2.2^Adaptable Weights and Time Delays ^ 96 7.2.3^Pipelining ^ 97 7.3 Analog Hardware Implementations ^ 103 Table of Contents^ 8 Conclusions and Suggestions for Further Research^ vi 105 8.1 Summary ^ 105 8.2 Contributions to the Literature ^ 106 8.3 Topics for Further Research ^ 107 8.3.1 Other Applications ^ 108 8.3.2 Increased Rates of Convergence ^ 109 8.3.3 Implementation Issues ^ 110 8.3.4 Other Network Models ^ 111 Bibliography^ 112 Appendices^ 123 A Momentum^ 123 B Estimating the Mean Value of a Signal ^ 126 C Linear Interpolation of Delayed Signal Derivatives^ 127 D Third-Order Low-Pass Butterworth Digital Filter^ 129 List of Tables ^ 3.1 On Line Back Propagation Equations 5.1 Exact Dispersive Network Learning Equations ^ - 5.2 Approximate Dispersive Network Learning Equations ^ 5.3 Storage Requirements ^ vii 29 60 61 63 List of Figures 1.1 Adaptive Processor ^ 3 1.2 Applications of Adaptive Processors 1 ^ 4 1.3 Applications of Adaptive Processors 2 ^ 5 2.1 Adaptive Linear Combiner (ALC) ^ 9 2.2 ALC Transfer Function ^ 10 2.3 Error Surface for a Single Weight ^ 14 2.4 Error Surface for Two Weights ^ 15 2.5 Adaptive Linear Transversal Filter ^ 18 2.6 ALC Classifier ^ 19 2.7 Classification Boundary for a Two-Input ALC Classifier ^ 20 3.1 Single Layer ALC Network ^ 23 3.2 Two Layer ALC Network ^ 24 3.3 Sigmoidal Nonlinearity ^ 25 3.4 Two-Layer Back-Propagation Network ^ 26 4.1 Mackey-Glass Signal ^ 34 4.2 Signal Prediction ^ 35 4.3 Transversal Network Topology ^ 36 4.4 Mackey-Glass Signal Prediction Learning Curves ^ 38 4.5 Mackey-Glass Prediction Results ^ 40 5.1 Unfolding in Time ^ 43 5.2 Conventional Back-Propagation Network Connection ^ 44 viii List of Figures^ ix 5.3 Dispersive Network Connection ^ 45 5.4 Conventional Back-Propagation Network Node ^ 46 5.5 Dispersive Network Node ^ 47 5.6 Section of a Dispersive Network ^ 48 5.7 Perturbation Function ^ 50 5.8 Direct Implementation of Dispersive Networks ^ 58 5.9 Efficient Implementation of Dispersive Networks ^ 59 5.10 Dispersive Network Optimized for Storage Efficiency ^ 62 6.1 Dispersive Network Architecture for Prediction ^ 66 6.2 Learning Curves for Chaotic Signal Prediction ^ 69 6.3 True Adaptive Prediction with Two Networks ^ 70 6.4 True Adaptive Prediction with a Dispersive Network ^ 71 6.5 Learning Curves for True Adaptive Prediction ^ 72 6.6 Learning Curves for Training With Momentum ^ 74 6.7 Network in a Signal Production Configuration ^ 75 6.8 Learning Progression of a Signal Production Network ^ 76 6.9 Mackey-Glass Chaotic Attractor ^ 77 6.10 Chaotic Attractors Near Start of Learning ^ 78 6.11 Chaotic Attractors Near End of Learning ^ 79 6.12 Chaotic Attractors for Different Networks ^ 81 6.13 Short Term Emulation of the Mackey-Glass Signal ^ 82 6.14 Long Term Emulation of the Mackey-Glass Signal ^ 83 6.15 Network-Generated Power Spectra ^ 84 6.16 Network-Generated Low-Frequency Power Spectra ^ 85 6.17 Channel Equalization ^ 86 6.18 Nonlinear Communication Channel ^ 87 6.19 Equalization Learning Curves ^ 89 List of Figures x 7.1 Block Diagram of a Digital Dispersive Network ^ 93 7.2 Unique and Shared Signals in a Dispersive Network ^ 94 7.3 Block Diagram of a Single Processing Element ^ 98 7.4 Processing Element Input Link ^ 99 7.5 Processing Element Output Link ^ 100 7.6 Single Processing Element with Latches ^ 101 7.7 Pseudocode for a Single Processing Element ^ 102 C.1 Stored Samples of a Continuous Signal ^ 127 D.1 Signal Flow Graph for a Low-Pass Butterworth Filter ^ 131 Glossary This list defines terms that are used throughout the thesis. They may have broader definitions in other contexts. Within each definition, words or phrases in italics refer to additional glossary entries. accumulator. A data storage element that maintains a running sum of its past inputs. The sum can be reset to zero when required. action potential. An electrical signal that propagates along an axon, transmitting informa- tion from one neuron to other neurons. ADALINE. See adaptive linear neuron. adaptive filter. A filter that can alter its transfer function, usually through performance feedback adaptation. adaptive interference canceler. An adaptive system designed to remove noise from a signal. adaptive linear combiner. An adaptive system whose output signal is a weighted sum of its inputs. The weights can adapt to alter its behavior. adaptive linear element. See adaptive linear combiner. adaptive linear filter. An adaptive linear combiner whose weights multiply delayed versions of its input signal, forming a transversal filter. An adaptive linear filter can respond to temporal information in its input signal. adaptive linear neuron. An adaptive linear combiner cascaded with a binary quantizer. An adaptive linear neuron can operate as a classifier for problems that are linearly separable. adaptive signal processing. The use of adaptive systems to perform filtering and other signal processing tasks. adaptive system. A system that can modify its internal parameters based on environmental feedback. age. The age of a node in a dispersive network is the round trip time required for a signal to propagate from the node to a network output, and for the resulting error signal to propagate back. ALC. See adaptive linear combiner. xi Glossary^ xii artificial neural network. A model or simulation of a neural network, often used for investi- gating its properties and gaining insight into biological computation. In broader contexts, including this thesis, an artificial neural network is any interconnected group of simple processing elements, without any necessary relationship to biological neural networks. See also network. associative memory. A storage device where stored patterns are retrieved based on their similarity to an input or prompting pattern. attractor. An equilibrium state or set of states to which a dynamical system converges. axon. A nerve fiber through which neurons transmit action potentials. back-propagation. A training technique for feed-forward networks that minimizes the output error using gradient descent. back-propagation network. A network that is trained using the back-propagation technique. band-limited signal. A signal whose energy is restricted to a finite frequency range. bandwidth. The difference in frequency between the highest and lowest frequency components of a signal. batch-mode adaptation. A method for training a network, where the internal parameters are updated only after all the training data has been presented. Each such presentation is called a batch or an epoch. Compare with on-line adaptation. bilinear transform. A frequency-warping function used in the design of digital filters. Its purpose is to eliminate inaccuracies caused by a finite sampling frequency. binary quantizer. An element with a continuous-valued input, whose output can take on one of two states. See sign= function. bipolar sigmoid. A sigmoidal function whose output can take on both positive and negative values. bus. One or more conductors or communication channels used for transmitting signals or power. Butterworth filter. A low-pass filter designed for maximally flat frequency response below its cutoff frequency. causal system. A system whose output at time t depends only on events that occur at or before time t. Compare with noncausal system. channel. A conductive wire, medium, or other pathway for transmitting a signal. chaotic attractor. An attractor with a nonintegral or fractal dimension. A chaotic attractor generates behavior that appears random, although the underlying dynamical system may be deterministic. Glossary^ xiii chaotic dynamical system. A dynamical system whose behavior is characterized by a chaotic attractor. chaotic signal. A signal generated by a chaotic dynamical system. Although it is determinis- tic, a chaotic signal may appear random. characteristic time. The inverse of the mean frequency of the power spectrum of a signal. The characteristic time approximates the period of a quasi-periodic signal. classification. The process of separating input patterns or signals into different categories. classifier. A device used to perform classification. CMOS. Complementary Metal-Oxide Semiconductor. A popular technology for building integrated circuits, employing two different types of MOS transistors. complex frequency. A description of the frequency of sinusoidal oscillations, using complex numbers rather than a sum of sin() and cos() functions. computational locality. In a multiprocessor system, a restriction on the distance that data must travel to reach the processor where a computation takes place. Good computational locality implies a short maximum distance. concurrent computation. Two or more computational processes occurring simultaneously. conjugate gradient descent. A method for increasing the speed of gradient descent. Unfortunately, conjugate gradient descent exhibits poor computational locality. connection. A unidirectional communication pathway between two nodes. If the nodes are models of neurons, then the connections usually model synapses. connectionism. The study of networks of simple processing elements, and their computational properties. credit assignment problem. The problem of assigning blame to hidden nodes in a network, to account for the output error. current-mode CMOS. A circuit employing CMOS technology, where computational quan- tities are represented by currents flowing through the electronic devices. cutoff frequency. The lowest or highest frequency component of a signal that is not blocked by a filter. cycle time. The time required for one iteration of an algorithm, or one time step in a digital computational circuit. cyclical connections. Sets of multiple connections within a network that form circular pathways. With cyclical connections, the output of a node can influence its own input, after propagating around the circular pathway. Glossary^ xiv delay buffer. A device for storing past values of a signal. depth. The number of layers in a multilayer network. dispersive network. A network with time delays in its internal connections. The time delays spread out signals in both time and space as they propagate through the network. dynamic range. The difference between the maximum and minimum output values of a node or processing element. dynamical system. A system whose behavior can be described by a set of differential equations. The past behavior of the system influences its future behavior. embedding dimension. The set of past signal samples used by a system to make predictions about future values of the signal. ensemble average. The average or expected sample function of a random process. The ensemble average can be estimated by observing multiple sample functions of the random process. If the random process is ergodic, the ensemble average can be estimated by a time average of a single sample function. epoch. One complete presentation of all the training data during batch-mode adaptation. equalization. The process of using a filter to implement the inverse transfer function of a distorting channel or medium. Placing the filter in series with the channel removes the distortion, retrieving the original, undistorted signal. ergodic process. A random process whose ensemble average can be estimated by a time average of a single sample function. An ergodic process is necessarily a stationary process. error signal. The difference between the output of a network and a target signal, as a function of time. error surface. The multidimensional surface defined by the expected value of an error signal as a function of the internal, adaptable parameters of the network. Euler integration. A simple iterative algorithm for solving systems of differential equations using a digital computer. Compare with Runge-Kutta integration. Fast Fourier Transform. An algorithm for decomposing a signal into a sum of sinusoidal oscillations with different frequencies and phase shifts. feed-forward network. A network without any cyclical connections. filter. A device that performs some type of processing on an input signal, and generates a corresponding output signal. A filter is usually described in terms of its transfer function. finite difference approximation. A general numerical method for solving systems of differential equations in discrete time steps. Glossary ^ xv fractal dimension. The nonintegral dimension of a chaotic attractor. Gabor polynomial predictive method. A method for predicting a signal by using polynomial functions of its past values. The method is typically unstable under iterative prediction. gain. The maximum steepness or slope of a sigmoidal transfer function. As the gain increases, the sigmoid approaches a step function. Gaussian noise. A random signal whose values are taken from a Gaussian (normal) distribution with a fixed mean (typically zero), and a fixed variance. gradient descent. A technique for minimizing the value of a function by adapting its parameters in small steps. Each step adjusts the parameters in a direction opposite to the gradient of the function with respect to the parameters, causing the value of the function to decrease. The procedure can become stuck in local minima of the function. gradient search. See gradient descent. Hamming data window. A weighting function used to preprocess signal data before performing a Fast Fourier Transform. The Hamming window minimizes errors caused by performing a Fourier Transform on a signal sample of finite length. hidden node. A node within a network that is neither an input node nor an output node. Hidden nodes receive inputs from other nodes in the network, and transmit their outputs to other nodes in the network. Compare with input node and output node. IIR filter. See infinite impulse response filter. infinite impulse response filter. A filter whose input is formed from both an input signal and delayed versions of its output. input node. A node that receives input or stimulus from its external environment. Compare with hidden node and output node. integrated circuit. An interconnected group of circuit elements, such as resistors and transistors, on a single small chip of semiconductor material. inverse modeling. The process of determining or implementing the inverse transfer function of a plant or filter. iterative prediction. The process of feeding the output of a signal predictor back to its input, to make additional predictions further into the future. Kalman filter. A type of adaptive filter that assumes a particular model of a plant or dynamical system. latch. A storage device for a single sample of a signal. Glossary^ xvi layer. A set of nodes in a network that are all the same distance, in terms of the number of connections, from the output nodes or the input nodes. leaky integrator. A summation device that forgets older information exponentially as new information arrives. learning rate. A small, usually constant parameter that determines the rate of gradient de- scent. Liapunov exponent. A number describing the rate of exponential convergence or divergence of nearby trajectories of a dynamical system. Negative Liapunov exponents indicate convergence, and positive exponents indicate divergence. A dynamical system with a chaotic attractor possesses at least one positive Liapunov exponent. linear separability. A characteristic of a set of labeled points in a multidimensional space. The points are linearly separable if a hyperplane exists that divides the space into two regions such that all the points in each region have the same label. link. A connection between two processing elements that may contain multiple weights and time delays. LMS algorithm. Least Mean Squares algorithm. A training technique for adaptive linear combiners. local minima. A value of a function that is less than any other in a neighborhood of its parameter space. logistic function. A curve with the equation k y=^ ea+bx where b is less than zero. The logistic function is an example of a sigmoidal transfer function. low-pass filter. A filter with an upper cutoff frequency. A low-pass filter blocks any signal component with a frequency higher than the cutoff frequency. Mackey-Glass signal. A chaotic signal, generated by a model of the regulation of blood formation, that is often used to test signal prediction techniques. manifold. A finite-dimensional subset of the phase space of a dynamical system. mean squared error. The average or expected value of the squared output error of a network. misadjustment. The additional mean squared error of an adaptive system, over an optimal nonadaptive system. model-free technique. A method for performing system identification, inverse modeling, or signal prediction without assuming any knowledge of the internal dynamics of the plant. Glossary^ xvii modular architecture. A computing structure composed of multiple, similar processing ele- ments. The structure can be easily expanded by adding more processing elements. momentum. A technique used to increase the convergence rate of the back-propagation train- ing technique and its variants. MOS. Metal-Oxide Semiconductor. A popular technology for implementing transistors in integrated circuits. multilayer network. A network with individual nodes arranged in a feed-forward, layered structure. A multilayer network is usually described in terms of the number of layers, and the number of nodes within each layer. multiplexing. A method for transmitting two or more different signals over a single channel. network. An interconnected group of simple processing elements or nodes. The nodes can often be divided into input nodes, hidden nodes, and output nodes. neural network. An interconnected group of neurons. The neural outputs propagate along axons, and the connections are formed by synapses. In broader contexts, including this thesis, the term is often used in place of artificial neural network. neuron. A living nerve cell in a biological organism that acts as a simple processing device. Neurons typically receive action potentials from other neurons, and generate action potentials of their own. node. A single processing element, often modeled after a neuron. noncausal system. A system whose output at time t depends on at least one event that occurs after time t. Noncausal systems cannot be physically implemented, but their mathematical properties can still be studied. Compare with causal system. nonlinear optimization. The determination of minimum or maximum values of a nonlinear function, sometimes subject to constraints. Nonlinear functions often have local minima that can make the optimization process difficult. nonlocal computation. A computation by a processing element in a network, involving data from other processing elements that are physically distant. nonstationary process. A random process whose statistics, such as its mean and variance, change over time. normalized rms error. The rms error, divided by the variance of the target signal. NRMSE. See normalized rms error. on-line adaptation. A method for training a network where the internal parameters are updated continuously, or after each integration step, rather than after all the training data has been presented. Compare with batch-mode adaptation. Glossary^ xviii output error. The instantaneous difference between the output of a network and a corresponding target signal. output node. A node that generates a signal that acts upon its external environment, or that can be observed by an external observer. Compare with input node and hidden node. parallel implementation. An implementation of a network or other computational device that uses multiple processing elements. The processing elements perform concurrent computations. Compare with serial implementation. pattern recognition. The process of classifying a spatial or temporal pattern into one category, from a set of possible categories. Perceptron. An early model of a neural network, pioneered by Frank Rosenblatt. Also, single adaptive linear neurons are often referred to as Perceptrons. Perceptron learning rule. A method for adapting the weights of an adaptive linear neuron or Perceptron. performance feedback adaptation. A method for adapting the internal parameters of an adaptive system by observing how it interacts with its environment. phase diagram. A plot of the trajectory of a dynamical system through phase space as it evolves in time. phase shift. The relative position in time between two similar signals. phase space. A space with coordinates representing the state of a dynamical system. The dynamical system follows a trajectory through phase space. pipelined operation. A method for achieving parallel operation between the layers of a computational network. Each layer in the network operates as a single stage, performing concurrent computation with the other network layers. Pipelining increases signal throughput. plant. A system that is to be controlled or modeled. It is affected by applied input signals, and produces corresponding output signals. positive definite matrix. A matrix that can be easily inverted because it possesses a nonzero determinant. positive semidefinite matrix. A matrix that cannot be easily inverted because its determinant is zero. power spectrum. When a signal is decomposed into a sum of sinusoidal oscillations (e.g., by a Fast Fourier Transform), the power in each of these oscillations, as a function of frequency, defines the power spectrum of the signal. procedural programming language. A computer programming language, such as Pascal or C, where emphasis is placed on the flow of control through blocks of code. Glossary^ xix processing element. A single computing device, usually part of a network of similar com- puting devices. pulse stream network. A network in which the information flowing through the connections is encoded as streams of pulses, rather than continuous levels. quantization. The round-off error caused by representing analog values using digital storage devices. quasi-periodic signal. A nonperiodic signal that has a strong frequency component in its power spectrum. A quasi-periodic signal can be partially described by its characteristic time. random access memory. A computer storage device where the retrieval time for each stored value is independent of its storage location. random local search. A method for finding the minimum or maximum value of a function. A random local search is usually slower than a gradient search. random process. A process that can be described by a random variable, whose probability distribution may be a function of time. random variable. A variable that may take on a range of values that cannot be predicted with certainty, but only described by a probability distribution. real estate. The amount of semiconductor surface area required to implement a device in integrated circuitry. recurrent network. A network with cyclical connections. rms error. Root-Mean-Squared error. The square root of the expected value of the squared output error. round-robin. A scheduling technique where each processing element is given access to a shared resource (such as a shared bus) in turn. All the processing elements have equal priority. Runge-Kutta integration. An iterative algorithm for solving systems of differential equa- tions using a digital computer. Runge-Kutta integration produces more accurate results than Euler integration, but requires more complex computations. sample function. The output, as a function of time, of a random process. Multiple instanti- ations of the same random process can generate different sample functions. sampling frequency. The reciprocal of the time interval between sample points when a con- tinuous-time signal is converted to a discrete-time representation. scalable architecture. A modular architecture with no inherent bound on the size of the structure that can be built. xx Glossary^ serial implementation. An implementation of a network or other computational device that uses a single processor, such as a conventional digital computer. The single processor must perform the calculations for each network node and connection in a step-by-step fashion, using a technique like round-robin scheduling. Compare with parallel implementation. shared signal. A signal that is broadcast to multiple processing elements. Compare with unique signal. sigmoidal transfer function. A continuous and monotonically increasing s-shaped function, often used as the transfer function of a node. Common sigmoidal functions are the logistic function and the hyperbolic tangent. signal detection. The reconstruction or classification of signals that may be corrupted by noise. signal emulation. See signal production. signal flow graph. A diagram emphasizing the flow of signals between summation nodes, often used to illustrate the internal structure of a filter. signal prediction. The process of predicting future values of a signal, usually from observations of its past values. signal production. The process of generating a signal with a desired set of properties, often with the goal of imitating another signal with an unknown generating mechanism. signum function. The function sgn(Y) +1 if y 0 ^—1 otherwise, which acts as a binary quantizer. sliding tap. A method for implementing a variable-length delay buffer. spatio-temporal pattern. A pattern or signal that contains both spatial and temporal information. A simple example is a television image. stationary process. A random process whose statistics, such as its mean and variance, are independent of time. steepest descent. See gradient descent. stochastic approximation. A method for finding the minimum or maximum value of a function by using gradient estimates at each time step. The gradient estimates permit an approximate form of gradient descent. synapse. An excitatory or inhibitory connection between two neurons. When the first neuron generates an action potential, an excitatory connection encourages the second neuron to generate its own action potential, while an inhibitory connection discourages it. xxi Glossary^ system identification. The process of determining or implementing the transfer function of an unknown plant. target signal. A signal representing the desired output of a network. Taylor series. The representation of a function as a power series of the form 00 1 Y —i ( x — a)nf(n)(a), n=0 n • where f(n)(a) is the n t h derivative of f at a. TDNN. See time delay neural network. temporal back-propagation. A generalization of the back-propagation technique for training dispersive networks. threshold. The input value of a linear or sigmoidal transfer function that causes its output value to be midway between its extreme values. When the input is less than the threshold, the output lies below the midpoint, and vice versa. threshold logic device. A summation device cascaded with a binary quantizer. Also known as an adaptive linear neuron. throughput. The rate of information flow through a data processing device or filter. time delay neural network. A feed-forward network with fixed-length time delays in the connections. topology. The spatial arrangement of nodes and connections within a network, and their pattern of connectivity. training. The process of adapting the internal parameters of a network to improve its performance on a particular set of data, known as the training set. The internal parameters most often adapted are the weights and the time delays, if present. training technique. A particular method or algorithm for training a network. trajectory. The path followed by a dynamical system through phase space. transfer function. The input-output relationship of a plant, a network, or a single node. transversal filter. A filter formed by feeding an input signal through a chain of delay elements, and then feeding the delayed signal values into a processing device, such as an adaptive linear combiner or a back propagation network. unfolding in time. The process of moving the internal time delays in a dispersive network to the input nodes. The resulting transversal filter may be much larger than the original dispersive network. Glossary^ xxii unique signal. A signal that flows to only a single processing element. Compare with shared signal. unity transfer function. A transfer function whose output is an exact copy of its input. universal approximator. A device that can implement any continuous transfer function to a specified degree of precision. VLSI. Very Large Scale Integration. A class of integrated circuits containing thousands of transistors on a single small chip of semiconductor material. weight. A multiplicative factor that scales the signal flowing through a connection within a network. weight perturbation. The process of adding random values to the weights in a network, and observing the effect on the output error. The observations can yield information that allows the weights to be modified and the output error to be reduced. Acknowledgement I would like to express my appreciation for the guidance and support provided by my supervisor, Dr. D. S. Camporese. I am also indebted to my colleague, Michael Davenport, for many interesting discussions, and for his mathematical insight. We collaborated on several papers in the field of neural networks. Several enlightening conversations with Dr. G. W. Hoffmann at U. B. C. and Eric Wan at Stanford University provided early inspiration for this work. I would also like to thank Dave Gagne, without whose technical support this research would have taken twice as long. The work presented here could not have been completed without the financial assistance provided by the Natural Sciences and Engineering Research Council of Canada, and the computer hardware provided by the Canadian Microelectronics Corporation. Finally, I would like to mention my sincere appreciation for the continuous support of my family and friends throughout the course of this investigation. Chapter 1 Introduction 1.1 Adaptive Systems Adaptive systems can alter their structure or behavior based on environmental feedback. They are prevalent in biology, where species adapt through the process of natural selection [21] and organisms adapt by learning new behaviors. In the latter case, the adaptive ability is often associated with intelligence. It is also possible for artificial, or man-made systems to adapt to their environments, and although they are not usually called intelligent, they often exhibit properties commonly attributed to intelligence. There are many areas of science and engineering where adaptive systems can provide useful solutions to difficult, possibly time-varying problems. Adaptive processors find widespread use in fields such as control, communications, speech recognition, imaging, sonar, radar, and seismology. Most of these systems automatically adjust internal structural or behavioral parameters to maximize some type of performance measure in a changing environment. Alternatively, they may search for optimal parameters in a stationary environment, where adaptation ceases after reaching an acceptable level of performance. In both cases, adaptation occurs through environmental contact. Most artificial adaptive systems can be characterized by one or more of the following properties (paraphrased from [134]): 1. They can automatically adapt to changing environments. 2. They can be trained to perform specific tasks in stationary environments. In a sense, the design of the system takes place automatically through the training process. 1 Chapter 1. Introduction^ 2 3. They can extrapolate their behavior to deal with new situations, after being trained on a finite and often small amount of training data. 4. They are often robust to internal defects, having an ability to adapt around them. 5. They can be viewed as nonlinear systems with time-varying parameters. 6. They are usually more complex and difficult to analyze than nonadaptive systems, but can offer better performance when the input signal is unknown or nonstationary. Much of the work in this area has involved a device known either as an adaptive linear combiner (ALC), or as an adaptive linear element. This device is the central topic of discussion in Chapter 2. The ALC is the primary component of an adaptive linear filter, which is probably the most common adaptive processor in use today. The operation of most adaptive processors can be modeled as shown in Figure 1.1. The processor receives input from its environment, causing it to generate an output signal that acts back upon the environment. This environmental interaction generates a performance feedback signal that the processor can use to increase its performance by modifying its internal parameters. Quite often, the environmental interaction is just a comparison with a desired output signal, and the goal of adaptation is to minimize the difference between the actual output and the desired output. The nature of the input signal and the environmental interaction determines the task that the processor performs. 1.1.1 Adaptive Prediction Figure 1.2(a) illustrates an adaptive prediction task. The input to the adaptive processor is a delayed version of the signal to be predicted, while the desired output is an undelayed version of the same signal. The performance feedback signal, which is the difference between the desired and actual outputs, can be thought of as an error signal that the adaptive processor tries to minimize. Through adaptation, the processor learns to predict the current value of the input signal, based only on observations of its past values. In some situations, this arrangement has 3 Chapter 1. Introduction^ Input Adaptive Processor ( Output Performance Feedback Environment Figure 1.1: Operation of a typical adaptive processor. The processor receives input from its environment, causing it to generate an output that acts back upon the environment. Based on the results of this environmental interaction, a performance feedback signal returns to the adaptive processor, allowing it to modify its internal parameters and thereby increase its performance. The environmental interaction usually involves a comparison of the actual output signal with a desired output signal, and the performance feedback is often just the difference between these two signals. limited utility because the processor "predicts" only the present value of the input signal, which is already available as the desired output. Chapters 5 and 6 describe adaptive processors based on dispersive networks, which can make true predictions about future values of the signal. 1.1.2 Adaptive System Identification Figure 1.2(b) shows the system identification arrangement. Here, an input signal feeds into both the adaptive processor and an unknown plant. As the adaptive processor minimizes the difference between its output and that of the plant, its transfer function will approach the plant transfer function. The plant is "identified" when this difference is sufficiently small. If the transfer function of the plant changes over time, the adaptive processor can track those changes and continue to provide an accurate plant model. 4 Chapter 1. Introduction^ Adaptive Processor Delay Output Performance Feedback Desired Output (a) / Output Adaptive Processor / Performance Feedback Plant^ Desired Output + (b) Figure 1.2: Some applications of adaptive processors: (a) adaptive signal prediction, and (b) system identification. Signal prediction can be viewed as a special case of system identification, where the "plant" is a perfect predictor. In both cases, the performance feedback signal can be thought of as an error signal that the adaptive processor tries to minimize. 1.1.3 Inverse Modeling or Equalization The inverse modeling task is similar to the identification task, except the adaptive processor is connected in series with the plant instead of in parallel. As shown in Figure 1.3(a), the processor learns to produce a delayed version of the plant input signal, and in the process acquires the inverse transfer function of the plant. Inverse models are useful in control applications, where proper input values must be determined from knowledge of the desired output. They can also be used to remove distortions caused by imperfect communication channels. Chapter 1. Introduction^ 5 Figure 1.3: Additional applications of adaptive processors: (a) inverse modeling or equalization, and (b) interference canceling. In inverse modeling, the adaptive processor learns to implement the inverse transfer function of the plant, with a fixed time delay. Its input is the plant output, which may be corrupted by additive noise. In interference canceling, unlike other applications, the output of the adaptive processor is not the desired output of the system. Instead, the processor tries to predict the signal plus noise, using a noise signal at its input that is correlated with the signal noise. However, since the correlated noise input contains no information about the signal itself, the "best" prediction that the processor can make predicts only the noise component of the signal. Subtracting this prediction from the noisy signal gives the clean signal as the system output. 1.1.4 Adaptive Interference Canceling Figure 1.3(b) shows an adaptive processor configured as an interference canceling device. The "desired output" of the adaptive processor is a signal with additive noise, while its input is a second, perhaps distorted, version of the noise. The input noise is correlated with the additive Chapter 1. Introduction^ 6 signal noise, but not with the signal itself. Reproducing the additive noise is the best that the processor can do, based on the noise at its input, since the noise and the signal are uncorrelated. Subtracting the output of the processor from the noisy signal gives an approximation of the noise-free signal. Given the correlated noise, an optimal estimate of the additive noise occurs when the adaptive processor minimizes the power in the output signal. 1.2 Artificial Neural Networks Mammalian brains contain living nerve cells called neurons, that can send electrical signals (action potentials) down communication lines known as axons. The action potentials eventually reach synapses, where the axons connect to the inputs of many other neurons through excitatory or inhibitory connections. A neuron emits an action potential when the difference between its excitatory and inhibitory inputs is sufficiently large. In this way, the behavior of one neuron can influence the behavior of many other neurons. Adaptation in the brain (i.e., learning) occurs, at least in part, by changes in the strengths of the synaptic connections [45]. There has been much hope that the operation of the brain can be explained by modeling the neurons as simple processing elements and modeling the synapses as weighted connections between them, leading to the study of "connectionism". Early work in this area was performed by McCulloch and Pitts [74], whose neuronal model behaved as a threshold logic device with a binary output. This device has since become known as the "McCulloch-Pitts" neuron, and its variants find use in many modern network models. Recently, researchers have found that a wide variety of neural models have interesting or useful computational properties, even if they bear little resemblance to biological brains [44]. Investigation of these "artificial" neural networks has gone far beyond modeling the brain, in search of new types of computational systems. Some of these systems can perform tasks like pattern recognition and pattern classification, or behave as associative memories. Associative memories store patterns that can be retrieved based on their similarity to an input, or prompting pattern. Chapter 1. Introduction^ 7 One class of adaptive networks that has received a great deal of attention is the "multilayer Perceptron", or back-propagation network. This network is closely related to the adaptive linear combiner, containing many interconnected ALCs with intervening nonlinear elements. It has shown great promise as a pattern classifier [104] and, more recently, as a universal function approximator [48]. The research presented here is an extension of the back-propagation network, with adaptable time delays added to all the internal connections. As with conventional back-propagation networks, these new networks can have either discrete or continuous-valued outputs. Discrete outputs are useful for spatio-temporal pattern classification, while continuous outputs are more appropriate for approximating continuous functions. The focus of this research is on networks with continuous-valued outputs. 1.3 Thesis Outline Chapter 2 is a review of adaptive linear filters, and their primary component, the adaptive linear combiner. It describes the adaptation algorithm in detail, and discusses some limitations of the device that arise from its linearity. Chapter 3 introduces networks of adaptive linear combiners, both with and without internal nonlinearities. With the nonlinearities, the networks are equivalent to back-propagation networks, and the back-propagation adaptation algorithm is discussed in detail. Chapter 4 develops continuous-time formulations of the adaptive algorithms for both linear combiners and back-propagation networks. It then presents a simple method for forming a nonlinear adaptive filter from a back-propagation network, and applies it to a signal prediction task. Chapter 5 is the main development of the thesis. It discusses nonlinear networks incorporating internal time delays, and develops a new continuous-time adaptation technique for adjusting the internal parameters, including the lengths of the time delays. Chapter 6 presents simulation results for three applications of the networks in Chapter 5. Chapter 1. Introduction^ 8 In order of presentation, they are adaptive signal prediction, signal production (system identification), and adaptive channel equalization (inverse modeling). Finally, Chapter 7 discusses the software implementation of the network filters used in the simulations, and possible hardware implementations in both digital and analog forms. Chapter 8 summarizes the thesis and provides suggestions for further research. Chapter 2 The Adaptive Linear Combiner Many adaptive signal processing systems make use of a simple device known as an adaptive linear combiner (ALC) [135]. This chapter reviews its structure and operation, for comparison with the nonlinear adaptive systems to be described later. The linearity of the device simplifies the analysis, permitting a better behavioral understanding than is currently possible for nonlinear systems. The ALC can use either the least mean squares (LMS) algorithm or the method of random local search (RLS) to update its internal parameters. 2.1 Structure Figure 2.1 shows the structure of an adaptive linear combiner with N inputs. The output, y(t), ► y(t) Figure 2.1: Adaptive linear combiner with N inputs. The output, y(t), is a sum of products of the form wi(t)xj(t), and a threshold, 0(t). The weights and threshold can adapt to alter the transfer function. 9 Chapter 2. The Adaptive Linear Combiner ^ 10 Figure 2.2: ALC Transfer Function for N = 2. The weights, wi(t), and threshold, 9(t), define an N-dimensional hyperplane representing the transfer function, which intersects the y-axis at —O. A desired transfer function also defines an N-dimensional surface that the ALC can approximate by shifting and tilting the hyperplane to obtain a best fit. Since this target surface is not necessarily a hyperplane, the optimal weights and threshold do not always provide an error-free implementation. is the sum of a threshold, 9(t), and products formed by multiplying the inputs, xj(t), by weights, wi(t): y(t) = —9(t) N E wi(oxi(t).^ j=1 (2.1) The output is positive when the sum of weighted inputs exceeds the threshold, and negative when it falls below. The weights and threshold can adapt to alter the transfer function, but they change on a much longer time scale than the inputs or y(t). For a particular set of weights and a fixed threshold, the transfer function defines an N-dimensional hyperplane, as shown in Figure 2.2. The hyperplane intersects the y-axis at —0, and the weights control its orientation. Note that 9(t) can be implemented as an additional weight, w o (t) = —9(t), and a fixed input, xo(t) = 1, so it can be incorporated into the sum in equation (2.1). The threshold will be treated in this manner without further mention or loss of generality. Chapter 2. The Adaptive Linear Combiner ^ 11 For a particular task, the optimal or desired output of the ALC can be specified for each possible set of input values. This desired behavior of the ALC also defines an N-dimensional surface, but not necessarily a hyperplane. This surface can be approximated by tilting and shifting the ALC hyperplane to obtain a good fit, according to some error measure. Explicit knowledge of the required transfer function permits direct computation of the optimal weight values, but such knowledge is often difficult to obtain. However, some forms of limited information about the desired output allow the correct weights to be found using techniques like performance feedback adaptation. With performance feedback adaptation, the weights and threshold change during operation, so the overall transfer function is nonlinear and the ALC exhibits more complex behavior than a linear combiner with fixed parameters. 2.2 Operation 2.2.1 Error Measure Performance feedback adaptation requires a target signal, y(t), that the actual output, y(t), must follow. Accordingly, an error signal, e(t), measures the instantaneous difference between them: e(t) = p(t)— y(t) = y(t) — W T (t)X(t).^(2.2) Here, vector notation replaces the summation notation for the ALC output in equation (2.1). W(t) is the column vector of weights with w o (t) = —OM, W (t) = [wo (t) w i (t)... w,(0 F,^(2.3) X(t) is the column vector of inputs with x o (t) = 1, X(t) = [x o (ox i ( t)... x N (07,^ (2.4) and T represents the matrix transpose operation. The error can be positive or negative, with no inherent bound on its magnitude. Squaring equation (2.2) gives the instantaneous squared 12 Chapter 2. The Adaptive Linear Combiner^ error, e 2 (t) = [0) — W T (t)X(t)] 2 = 9 2 (t) W T (t)X(t)X T (t)W(t) - 2Kt)X T (t)W(t),^(2.5) which is never negative, and is zero only when the ALC follows the target signal precisely. Under the assumption that the input and target signals are sample functions of random processes [90, 113], the instantaneous values, X(t) and "(t), can be viewed as random variables. Then e(t) is also a random variable, and the mean squared error, e(t), is the expected value of e 2 (t) at time t: MSE = C(t) = E[e 2 (t)] = E[y(t)] W T (t)E[X(t)X T (t)]W(t) — 2E[Xt)X T (t)]W(t). (2.6) Minimizing the MSE by changing the weights makes the ALC follow the target signal more accurately. If X(t) and '(t) are stationary, the expected values on the right hand side of equation (2.6) are constants representing the average power of the target signal, c = E[e(t)],^ (2.7) the input correlation matrix, x8(t)^xo(t)xi(t)^xo(t)xN(t) R = E[X(t)X T (t)] = E (t)x o (t)^4(0^ ...^ (t) xN(t) (2.8) x N (t)x o (t) xN(t)x i (t)^x2N(t) and a cross-correlation vector, P = E[Xt)X(t)] = ED(t)xo(t) (t)xi(t)^.(t)xN(t)1T.^(2.9) The matrix, R, is real and symmetric. It is usually positive definite, but occasionally it can be positive semidefinite [132]. With this notation, the MSE can be rewritten as C(t) = c W T (t)RW(t) — 2P T W(t).^(2.10) 13 Chapter 2. The Adaptive Linear Combiner^ Equation (2.10) is a quadratic function of W that defines a hyperparabolic error surface in N +1 dimensions, not to be confused with the hyperplane representing the ALC output. The error surface exists in the space defined by the weights of the ALC, while the output hyperplane exists in the space defined by the ALC inputs. The error surface is concave upward because the squared error is never negative, and it possesses a unique minimum, e ---= emir" at W = W. 2.2.2 Gradient of the Error Measure The gradient of the hyperparabolic MSE surface with respect to the weight vector is ^v i= , { aE as^as ^w,-Do --.^ a7 v,awl r. —2P + 2RW. (2.11) Knowledge of R and P allows W* to be found by solving for VE = 0 [134], yielding the optimal Wiener filter weights [72, 136]: W* = R -1 P. However, when R and P are not explicitly available, W* cannot be found with this analytical method. As an alternative, suppose the current weight vector represents the location of a ball resting on the MSE surface. If the weights adapt so that the ball follows the negative gradient, it will come to rest at the location of the minimum and the weights will have acquired their optimal values, W. Fortunately, as will be shown in Section 2.2.4, the gradient at a position represented by a particular weight vector can be estimated without knowledge of R or P. This method for finding the optimal weight values is usually called a gradient search. 2.2.3 Gradient Search Figure 2.3 shows a parabolic error surface for a single weight: E(t) = Ervin + A(w* — w(t)) 2 .^ (2.12) Chapter 2. The Adaptive Linear Combiner ^ 14 A gradient search occurs when the weight adapts in a direction opposite to the derivative of E with respect to w: w(t + 1) = w(t) — p c71v- = w(t)-F 20(w* — w(t)). ^(2.13) dE Here, A is a step-size parameter that controls the size of the weight adjustments, and t is an integer representing the time step. Di e( 0) emin w(0) ^ w* ^ Weight Figure 2.3: MSE surface for a single weight. Here, the "surface" is a parabola with its minimum at w*. The initial value of the weight is w(0), and the MSE decreases from g(0) to E rvin as the weight approaches the minimum. To gain some insight into the asymptotic behavior of the recursive expression in equation (2.13), it can be rewritten nonrecursively as [134] w(t) = w* — (1 — 2/LA) t (w* — w(0)).^ (2.14) The second term represents the error due to the initial condition where, generally, w(0) 0 w*. This term will decay to zero and the weight will converge to w* if 1 11 — 2/./Al< 1, i.e., 0 < p < A . (2.15) Substituting equation (2.14) into (2.12) gives 6(0 = Erni. + A(1 — 2pA) 2t (w* — w(0)) 2 , (2.16) Chapter 2. The Adaptive Linear Combiner ^ 15 Figure 2.4: Error surface for two weights. The rings represent contours of constant mean squared error, with the minimum value at the center of the figure. At the point (w i (t),w 2 (t)), (A) shows the direction toward the minimum, W*, and (B) shows the direction of steepest descent, which is always perpendicular to the MSE surface contours. Proceeding along the direction of steepest descent will eventually lead to the minimum, but not necessarily via the shortest (i.e., straight) path. showing that the MSE will converge to Emii, under the same conditions as (2.15). This example illustrates the method of gradient search in a single dimension. Generalization to multiple dimensions yields the method of steepest descent: W(t + 1) = W(t) — IIVE(t).^ (2.17) In the one-dimensional case, an update in the direction opposite to the gradient always moves the weight toward w*. However, Figure 2.4 shows that in multiple dimensions the direction of steepest descent may not be the same as the direction toward W*, and the weights may follow a curved path to the minimum. For a quadratic error surface, iteration of equation (2.17) converges to W* if [134] 0 < it < x 4 1, max ^ (2.18) where A max is the largest eigenvalue of R. This condition is analogous to (2.15), but it is not as useful as it might appear. Since R is often unknown, the best step size parameter, it, is ^ Chapter 2. The Adaptive Linear Combiner ^ 16 usually found by trial and error. In practice, can sometimes be bounded by the reciprocal of the average input signal power [132], if that quantity can be measured. 2.2.4 The LMS Algorithm Gradient search methods require knowledge of Ve(t) for the current weight vector, which can be expanded as v£(t) = VE[e 2 (t)] = E[Ve 2 (t)] = E[-2(p(t) — W T (t)X(t))X(t)] = —2E[e(t)X(t)] (2.19) using equations (2.5) and (2.6). Fortunately, the readily available signals e(t) and X(t) can substitute for knowledge of R and P in equation (2.11). Equation (2.19) represents an infinite ensemble average, but if the signals are sample functions of ergodic processes, the expected value can be estimated by taking a time average [90], giving a gradient estimate, Ve: ^Ve(t) = 2 , E e(t')X(t').^ t =t-N (2.20) The weights should be held constant while each batch of N samples generates a gradient estimate, after which they can be updated according to E 21i^t W(t 1) = W(t) — e(t) = W(t) — e(e)X(e), Nti=t-N resulting in a technique often called "batch-mode" adaptation. A fixed-length segment of the input signal can be iterated once for each batch, or new segments can be used as they arrive. As the size of the batch, N, decreases, a fixed-length segment will not be an adequate representation of the desired transfer function, so the second method must be applied. A smaller batch size yields a noisier gradient estimate, with more frequent updates to W. In the limiting case where N = 1, the instantaneous signal values estimate the gradient, o£(t) = —2e(t)X(t),^ (2.21) resulting in "on-line" adaptation. The gradient search formula becomes W(t + 1) = W(t) 2/te(t)X(t),^ (2.22) Chapter 2. The Adaptive Linear Combiner^ 17 and W approaches W* at a rate controlled by p. The noisy gradient estimate causes W(t) to wander around its optimal value, but it will converge in expectation with an expected variance proportional to p [70]. This wandering increases the MSE over that of an optimal Wiener filter by a quantity known as the misadjustment [134]. The misadjustment measures the cost of adaptability, increasing with p and the square of the number of weights. The noisy gradient estimate is not as bad as it first appears because many small weight updates become averaged over time, effectively providing a better estimate. This technique, the Widrow-Hoff least mean squares (LMS) stochastic gradient algorithm [131, 139], offers significant advantages over Wiener filters because it does not require explicit determination of the input and target signal statistics defined in equations (2.8) and (2.9). 2.2.5 Random Local Search The LMS algorithm is one possible method for selecting the weights of an adaptive linear combiner. The method of Random Local Search (RLS) also performs this task, but convergence to the optimal weight vector, W*, is usually slower. On each iteration, a unit vector, r, with a random orientation perturbs the weight vector, W(t + 1) = W(t) + pr, (2.23) and if the resulting output error decreases, the new weight vector is retained. If the error increases, however, the weight vector is restored to its previous value from time t. After many iterations, the weight vector will approach W*, but it will not converge to a fixed value. As in the LMS algorithm, random weight fluctuations continue due to the estimation of E, with their magnitude depending partly upon p. Large values of p lead to faster adaptation, but they also lead to correspondingly larger steady state error fluctuations. The random vector, r, often lies in the wrong direction, so the LMS algorithm discussed in the previous section usually provides faster convergence. Chapter 2. The Adaptive Linear Combiner ^ 18 y(t) Figure 2.5: Adaptive linear transversal filter. The device incorporates an ALC and several unit-delay elements, represented by z -1 , that provide delayed versions of the input signal. Multidimensional input signals require a set of weights and delays for each dimension. 2.3 Adaptive Linear Transversal Filter The adaptive linear transversal filter in Figure 2.5 incorporates an adaptive linear combiner and several unit-delay elements. The output is a weighted sum, y(t) = N-1 E wi (t)x(i - j),^ i.o (2.24) that responds to temporally distributed information in the input signal, making the device particularly well-suited to signal processing tasks. The diagram shows a scalar input signal — multidimensional signals require additional sets of delayed input lines. 2.4 ALC Classifier Figure 2.6 shows an adaptive linear combiner cascaded with a binary quantizer. This device was studied as an early neural model by Rosenblatt [102}, who used it as a component of his "Perceptron", and by Widrow and Hoff [131], who called it an adaptive linear neuron, or Chapter 2. The Adaptive Linear Combiner ^ 19 x1 (t) y(t) x2 (t) o(t) o. • XN(t) Figure 2.6: ALC classifier. Cascading the ALC with a binary quantizer forces the output, o(t), to be ±1, effectively partitioning the input space into two categories. The weights define a hyperplane that forms the boundary between the categories. ADALINE. Its output is ( N o(t) = sgn E wi xi (t) , (2.25) j =1 where sgn() is the signum function: sgn(y) = 1 +1 if y > 0 —1 otherwise. The quantizer output is —1 if the sum is negative, and +1 otherwise. The critical region where the sum equals zero defines an (N — 1)-dimensional separating hyperplane, shown as a dashed line in Figure 2.2. This separating hyperplane should not be confused with the transfer function hyperplane, which is N-dimensional. In Figure 2.2, N = 2, so the separating hyperplane is a line. The instantaneous input signal represents a pattern that the ADALINE classifies into one of two categories, depending on which side of the separating hyperplane it lies. Weight adaptation changes the position and orientation of this hyperplane, using a target output of +1 if the current pattern belongs to the first category, and —1 otherwise. The output of the ALC, not Chapter 2. The Adaptive Linear Combiner ^ 20 the quantizer, forms the error when using the LMS algorithm. A slightly different technique, known as the Perceptron Learning Rule [7, 79, 102], adapts the weights using the output of the quantizer. Although the ALC can classify any set of linearly separable input patterns, it is possible that the LMS algorithm will not converge to the correct solution. Mean squared error minimization may sacrifice the proper classification of one pattern for a better fit to the rest [10]. However, choosing a threshold LMS error criterion [112] guarantees convergence to the proper set of weights. The threshold LMS error is zero when Y(t) = 1 and y(t) > 1, or when y(t) = —1 and y(t) < —1. 2.5 Limitations of the ALC The ALC can implement a particular transfer function only if it can be described by an Ndimensional hyperplane in the input space. In other circumstances, the minimum value of the hyperparabolic MSE surface will be greater than zero, representing the smallest error that can be achieved by a linear device. x2^ x2 A 1 1^-1 -1 -1 -1 -1 (a) 1^ 1 -1 (b) Figure 2.7: Classification boundary for a two-input ALC classifier. With two inputs, the separating hyperplane is a line bisecting the 2-dimensional input space. In (a), the two categories represented by +1 and —1 can be perfectly separated, but in (b) there is no such line that can accomplish the task. Chapter 2. The Adaptive Linear Combiner ^ 21 When used as a classifier, the ADALINE cannot separate all possible categories [79]. A classification can be implemented only if a hyperplane exists that separates all patterns in the first category from those in the second — a condition known as linear separability. Figure 2.7(a) shows a case that is linearly separable, and Figure 2.7(b) shows a case that is not. In nonseparable cases, the LMS algorithm minimizes the mean squared error between the output and the target signal, and the Perceptron Learning Rule doesn't converge [79]. Unfortunately, the fraction of linearly separable classifications is small, decreasing very rapidly toward zero as the number of input patterns increases [18, 51]. Networks with multiple layers of ALC nodes can overcome the limitations of linear nonseparability. Their structure and capabilities are the subject of the next chapter. Chapter 3 ALC Networks The adaptive linear combiner in Chapter 2 can implement only those transfer functions with a scalar output that can be expressed as a linear combination of the input signal components. With a binary output quantizer, it can classify input patterns into only two linearly separable categories. This chapter describes networks of ALC nodes that overcome these limitations at the expense of more complex adaptation algorithms. 3.1 Multidimensional Outputs Multidimensional outputs present no difficulty, since multiple ALC units can produce the required signal vector, Y(t). Figure 3.1 illustrates the resulting architecture, known as a single layer network. The term "layer" refers to the set of weights between the inputs and the output nodes. Adaptation requires a target signal vector, Y(t), that the actual outputs must follow, and the Euclidean distance between them provides a measure of the instantaneous error: e(t) =^"(t) —^(t)ii•^ (3.1) If the input and target signals are sample functions of random processes, e(t) can be viewed as a random variable, and the mean squared error, E(t), is MSE =^(t) = E[e 2 (t)] = EMY(t)— Y(011 2 ] = E M M m=1 em(t)] = E E[eL(t)]. m=1 (3.2) Here, M is the number of ALC nodes, and e ni (t) is the output error for node m: em(t) =^— y m (t). 22 (3.3) 23 Chapter 3. ALC Networks^ x2 (t) • xN(t) Figure 3.1: Single layer ALC network. Each weight has subscripts to identify its input and output connections. Signals flow from left to right, and multiple ALC nodes produce the output signal vector, Y(t). The LMS algorithm adapts the weights for each node, independently of the other weights in the network. The right side of equation (3.2) shows that the MSE is a sum of M independent terms, each representing the MSE for one ALC and its associated weights. Therefore, the total MSE can be minimized by independently applying the LMS algorithm from Chapter 2 to each node. 3.2 Net works with Multiple Layers The network in Figure 3.1 performs the vector-matrix multiplication, Y(t) = WX(t),^ (3.4) where W is the weight matrix with a separate row for each ALC, and X(t) is the input signal. Cascading another layer, as in Figure 3.2, multiplies the output by a second weight matrix, W', Y(t) = W' (WX(t)), (3.5) forming a multilayer feed-forward network. The expression "feed-forward" means there are no direct or indirect cyclical connections, so the output at time t is completely determined by the input at time t. 24 Chapter 3. ALC Networks^ w' Yi(t) ,(t) x2(t) .^. • xN(t) ^ ym(t) Figure 3.2: Two layer ALC network. The first layer of weights multiplies the input vector by a weight matrix, W, and the next layer multiplies the result by a second matrix, W', producing the output vector, Y(t). As explained in the text, this network is equivalent to a single layer network of the type shown in Figure 3.1. However, cascading the internal nodes with nonlinearities, as in Figure 3.4, enables the network to implement a much wider range of transfer functions. For linear nodes, the product W'W in equation (3.5) is equivalent to a single matrix, A, due to the associative property of matrix multiplication: Y(t) = W'(WX(t)) = AX(t). Therefore, a multilayer network of adaptive linear combiners can implement the same transfer functions as a single layer network, and can classify only linearly separable input patterns. Cascading the internal, or "hidden" ALC nodes with nonlinearities such as binary quantizers invalidates this argument, and the linear separability restriction no longer applies. The resulting network can implement a much wider range of transfer functions, and it can perform any classification task with three layers of weights and enough hidden nodes [69]. However, the lack of a target signal for the hidden nodes makes it impossible to apply the LMS algorithm directly to weights that are not in the output layer. This difficulty, the credit assignment problem, arises because each hidden ALC must be assigned responsibility for an unknown portion of the MSE at the output. 25 Chapter 3. ALC Networks^ Figure 3.3: Sigmoidal nonlinearity. The sigmoid shown here is a scaled and offset logistic function, o(t) = tanh(ay(t)), but any continuous, differentiable, and monotonically increasing function is appropriate. The parameter a, which controls the gain (steepness) of the sigmoid, was unity for this plot. The flat areas at high and low values of y(t) allow the device to behave as a classifier. 3.3 Back Propagation - Although it is possible to train networks with internal binary quantizers using random search techniques [23], a more efficient algorithm known as "back-propagation" solves the credit assignment problem for networks with sigmoidal nonlinearities of the type shown in Figure 3.3 [105]. Figure 3.4 shows a two-layer sigmoidal network, but the discussion applies to networks with any number of layers, and to feed-forward networks without a layered structure. If the node and input indices increase from the inputs to the outputs, then the feed-forward requirement means that a weight, wig, from node j to node i, can exist only if i > j. The output of each node is oi(t) = fi(yi(t)), approximating a binary quantizer as the gain, or steepness of the sigmoid, fi(), increases. Often, all of the nodes have identical sigmoids, but back-propagation can still be applied when each node implements a unique sigmoid, 11 0. For a given task, networks employing continuous sigmoids usually require fewer hidden nodes than those employing binary quantizers [111, 109]. Networks of this type are universal approximators, and with proper output scaling they can implement any continuous transfer function using only a single hidden layer [36, 47, 48, 49, 54, 52, 53]. However, two hidden layers may be required 26 Chapter 3. ALC Networks^ N+1 ^ N+H+1 o(t) X2 (0 o(t) XN(t) o(t) Figure 3.4: Two layer back-propagation network. The circles represent ALC nodes cascaded with sigmoidal nonlinearities of the type shown in Figure 3.3. The node and input indices increase from top to bottom, and from left to right, so the feed-forward restriction means that a weight, can exist only if i > j. The back-propagation algorithm also applies to networks with additional layers, and to networks with weights that skip across layers. to approximate the discontinuous inverses of some continuous functions that arise in control applications [110], and networks with two or more hidden layers may be able to approximate certain functions with fewer nodes [15, 122]. As in equation (3.1), a target signal, 6(0, defines the instantaneous error, ^e(t)= 116(t) — 0(t)ii• ^ The corresponding mean squared error, C(t), is MSE = ((t) E[e 2 (t)] = E[116(t) — 0(011 2 ] = E (3.6) E^(3.7) m€0 where 0 is the set of all output nodes, indexed by m, and e m (t) is the error at output m: e m (t) = 15,7,(t) — o n,(t). (3.8) The weights cannot be found by solving for Ve = 0 because the nonlinearities preclude an analytical solution. However, the method of steepest descent requires the value of VE at only a single point in the space defined by the network weights. This value can be calculated easily 27 Chapter 3. ALC Networks^ because the differentiability of the sigmoids ensures the existence of partial derivatives of the MSE with respect to the weights. The gradient can then be used for weight adaptation: W(t 1) = W(t) — itiC7 (t).^ (3.9) Here, W(t) represents all the weights in the network, including those of the hidden nodes. For each weight, equation (3.9) gives: Oe wij(t + 1) = wij(t) uw•i (3.10) The partial derivative in (3.10) can be expanded as °e „(t)--L ,, (01= —E[A i moi ( t)],^(3.11) (t)] = Er n =^= °^E[e 2 (t)]^E aw ii^uw ij ayi^o wij aw^ ij where Ai(t) is Ai(t) = — 0e2 ayi (t). (3.12) For notational convenience in equation (3.11), oi(t) represents either a network input, or the output of a hidden node. If the signals are sample functions of ergodic processes, equation (3.11) can be estimated with a time average [90], de^1^E^ t (e)ai (e),^ Owii^N t,=t—N (3.13) and the weight updates occur after each batch interval of length N. For the on-line case (N = 1), the instantaneous signal values provide the estimate, ow i; = Ai(t)oi(t).^ (3.14) A slightly different approach, somewhere between batch mode and on-line adaptation, uses an exponentially weighted sum of previous signal values, aw ii = (1 — a) E t,=0 a ti Ai(t — t')oi(t — t'),^ (3.15) ^ 28 Chapter 3. ALC Networks ^ which can be expressed in terms of the estimate from the previous time step: if t = 0 0^ (t) =^ ^owij^(1 - a)Ai(t)oi(t)-1- alL(t — 1) if t > 0. awii (3.16) The momentum parameter, a, adds a kind of inertia to the weight vector, filtering out high frequency changes in the gradient [104]. It often permits a higher learning rate, jt , and may allow the network to bypass local minima in the MSE surface. The method reduces to the instantaneous on-line case, described by equation (3.14), when a is zero. Momentum can also be applied to batch-mode adaptation, averaging the gradient estimates in equation (3.13) from consecutive batches. Whatever the method, estimation of 0e/Owij requires only the output of node j and the Ai(t) term associated with node i. If node i is an output, it is easy to see that ows 5i 0e 2^0e2 Oo• Ai(t)^= -- -(t)- 1.(t) = 2(Oi(t)— oi(t))fl(yi(t)), ^(3.17) where fgyi) is the slope of the sigmoid. For other nodes, ^ae2 0„,,^0e2^ 0e2 0e2 (t) = —„(t).„^ wkioi(t) 00i kEc, 8Yk Ooi kEC, 'IA ^00i^ kECi "k^ kECi E E --(tr"E E — E Ak(t)Wki, (3.18) so that Ai(t) = ii(yi(t)) > Ak(t)wki.^ (3.19) kEC, The recursive expression for Ai(t) in equation (3.19) represents information flowing from the outputs toward the inputs, and is responsible for the term "back-propagation". The backward flow of information permits estimation of the partial derivatives using only locally available signals. That is, the estimate for node i requires data only from other nodes having direct connections to node i. In operation, an input signal propagates forward through all the layers until it reaches the outputs. From there, an error signal derived from the output and target signals propagates back toward the input nodes according to equation (3.19). This process repeats at each time step, and (3.13) or (3.14) estimates the expected value in (3.11) for batch 29 Chapter 3. ALC Networks^ Parameter Formula NM E wii moi ( t) i oi (t) fi(N(0) D i (t) 20j(t) — oi (o)m y,(0)^for j ti (yi(t)) wii(t + 1) E iECJ E0 w ii (t)Ai(t)^otherwise wii(t) + itAi(t)oi(t) Table 3.1: Back-propagation equations for on-line learning. The symbols follow the notation in the text. For notational convenience, oi(t) represents either an input signal, or the output of a node. The first two lines describe forward-propagating signals, the third describes backward-propagating signals, and the final line describes on-line weight adaptation. or on-line adaptation, respectively. Table 3.1 summarizes the network operation for on-line adaptation without momentum. Several researchers independently discovered back-propagation as a technique for nonlinear optimization [68, 92, 104, 105, 128]. Unlike the LMS algorithm, the shape of the error surface is very complex, often containing many local minima. However, most of these minima seem to produce near-optimal performance, so their existence is not a serious problem [29]. For cases where they do cause trouble, it can usually be overcome by adding more hidden nodes [104]. The precise behavior depends on the initial weight vector, which, if assigned small random values, usually results in good performance [29, 104]. Chapter 4 Nonlinear Adaptive Network Filters The back-propagation networks in Chapter 3 provide advantages over linear techniques for many signal processing applications. Passing the input signal through a series of unit delays creates a nonlinear filter analogous to the linear transversal filter in Figure 2.5. For expanded dynamic range, the output nodes can be simple linear combiners, while the hidden nodes retain their sigmoidal nonlinearities to preserve the benefits of multilayer networks. This chapter describes the application of back-propagation networks to a signal prediction task, where they outperform conventional predictive techniques. The results provide a baseline for comparison with the dispersive networks to be discussed in Chapter 5. Since the analysis of dispersive networks benefits from a continuous-time treatment, the next sections set up an appropriate framework by presenting continuous-time formulations of LMS adaptation and back-propagation. The resulting differential equations describe possible analog physical implementations of the devices, while the discrete-time formulations in previous chapters are more appropriate for digital implementations. Much of the previous work in this area has concentrated on digital implementations due to the availability of digital computers. 4.1 Continuous-Time LMS Adaptation A continuous-time version of LMS adaptation arises when t assumes real values, and the weights change according to the differential equation, d wj^= 2pE[e(t)xj(t)i. dt^Owi ^ (4.1) Equation (4.1) is analogous to the discrete-time version developed in Chapter 2. The expected value represents an ensemble average, which, for ergodic signals, can be estimated with the 30 Chapter 4. Nonlinear Adaptive Network Filters^ 31 integral, E[e(t)x At)]^itt T e(t i )x i(e)de ,^ (4.2) leading to batch-mode adaptation. Since the term "batch" implies a collection of discrete events, the intervals of length T are more commonly known as "epochs" during continuoustime adaptation. The weights change in a piece-wise linear fashion, with the rate of change assuming new values at times T, 2T , 3T, etc.: f 2" nT dw j (nT) =^e)x e( i(e)de dt^T in-1)T (4.3) Although the weights continue to change during the integration, the resulting error can be controlled by choosing a sufficiently small value for p, with longer epochs requiring smaller values. On-line adaptation occurs in the limit as T approaches zero, giving dw j^2pe(t)x j(t), dt (4.4) which also approximates equation (4.1) for sufficiently small values of^The slowly accumulating weight changes implicitly perform the averaging required to estimate the expected value. 4.2 Continuous Time Back Propagation - - The network structure for continuous-time back-propagation is identical to Figure 3.4, but all the signals are continuous functions of time. The mean squared error is analogous to equation (3.7): e(t) E[1145(t) — 0 (011 2 1= E[E 4, i (t)].^(4.5) mE0 A differential equation replaces the discrete weight-update formula in (3.9), giving dW = —INC(t),^ dt (4.6) Chapter 4. Nonlinear Adaptive Network Filters ^ 32 which can be written as dWij pE[Pi(t)oi(t)] lipij(t)^(4.7) dt^az% for each weight, following the derivation in Chapter 3. Here, pii(t) represents the negative partial derivative of E(t) with respect to wii, which is defined as an expected value over an ensemble. For ergodic signals, an estimate, Pii, of the expected value results from the integral, fiij = /LTT Ai(e)oi(e)de E[Ai(t)oi(t)].^(4.8) Batch-mode adaptation using equation (4.8) is analogous to equation (4.3) from the LMS version: ^dw^nT i-i (nT) = —^ ^dt^T ^ (n-1)T Ai(e)oi(e)de. (4.9) Again, the error due to the changing weights is small if a is sufficiently small. For on-line learning, a "leaky integrator" can provide an estimate of the partial derivative: d j 77— dt — pij Ai(t)oa(t). ^(4.10) The time constant, 77, is related to the discrete-time momentum parameter, a, as shown in Appendix A. Differentiating equation (4.7) with respect to time and substituting (4.10) allows the weight adaptation formula to be written as d2 wi^dwi itAi(t)oi(t).^ = dt 2" dt^ (4.11) In the limit where 7./ goes to zero, the leaky integrator follows Ai(t)ai(t) precisely, and the weight adaptation formula becomes dtaij = ttAi(t)oj(t). dt (4.12) Therefore, continuous-time on-line adaptation can occur with or without momentum, in a manner similar to the discrete-time case discussed in Chapter 3. Without momentum, the calculation of dwij/dt requires no explicit storage or integration, since the slowly changing weights implicitly estimate the required averages. With the complete formulation of the adaptation techniques, it is now possible to present some benchmark simulation results. Chapter 4. Nonlinear Adaptive Network Filters ^ 33 4.3 The Mackey Glass Signal Prediction Task - Accurate predictions about the future, based on observations of the past, are very useful in science and engineering (not to mention business and economics)! The usual technique of modeling the underlying process cannot always be applied. Often, the process is either completely unknown, or far too complex to permit a practical model, leaving only the observed behavior of the system for use in making predictions. The signal produced by integrating the Mackey-Glass delay-differential equation [71], dx(t) = — x(t r) )io, bx(t) al + (t_ dt 7. (4.13) provides a commonly used benchmark [33, 66, 80, 81, 82, 106, 107, 114] for comparing modelfree predictive techniques. It results from modeling the regulation of blood formation, and its properties have been studied extensively [32]. Although (4.13) gives an explicit form to the underlying process, only the signal, x(t), can be used in a model-free test. For comparison with other published results [32, 33, 66, 80, 82], values of r = 17, a = 0.2, and b = 0.1 were chosen, and the equation was integrated using a four-point Runge-Kutta method with a step size of 0.1 and a constant initial value of x = 0.8 for t < O. Discarding the results from the first 1000 integration steps allowed any start-up transients to die out. The resulting signal, shown in Figure 4.1, is quasi-periodic with a characteristic time of t ch :::-, 50. The characteristic time is defined as the inverse of the mean frequency of the power spectrum. As described in [14, 66, 114], the task is to predict the Mackey-Glass signal an interval, T = 6, into the future. Once trained, additional networks can be cascaded to provide further predictions at intervals T = 12,18, 24, ... This prediction task involves continuous, time-varying inputs and outputs, providing a good test of nonlinear filter performance. 4.4 Simulation Details Simulations of the type described here have many free parameters that must be assigned specific values. To some extent, the network topology, numerical integration method, error measure, Chapter 4. Nonlinear Adaptive Network Filters^ 34 1.4 1.2 1.0 g (4 t c l ,14 c.) \ 0.8 0.6 0.4 0.2 0.0 ^ 0 100^200^300^400^500 Time (0 Figure 4.1: Mackey-Glass signal with r = 17, a = 0.2, and b = 0.1. The Mackey-Glass delay-differential equation was integrated using a four-point Runge-Kutta method with a step size of 0.1, and a constant initial value of x = 0.8. Discarding data from the first 1000 integration steps allowed any initial transients to die out. The resulting signal is quasi-periodic with a characteristic time of t ch„,. ,:-.1 50. learning rate, and derivative estimation method all have an influence on the ultimate performance. 4.4.1 Training Configuration Figure 4.2 shows one possible method for training an adaptive network to predict a signal T time units into the future. The delay, T, at the input allows the undelayed signal, x(t), to be used as the target during training. Minimizing the error, e(t), by adapting the weights causes the network to "predict" the current value of the input signal based on its values up to time t —T. After training, the delay at the input can be removed, allowing the network to make true future predictions. However, without this delay, adaptation must stop due to the lack of a target signal. Chapter 4. Nonlinear Adaptive Network Filters ^ 35 Figure 4.2: Block diagram of a system for training a network to predict its input signal. A delayed version of the input signal feeds into the network so that the undelayed input can be used as the target signal. Minimizing the error, e(t), causes the network to learn to predict x(t) based on its values up to time t — T. The network output is a "prediction" of the current value of the input signal. For true prediction, the delay, T, must be removed. 4.4.2 Network Topology and Parameters Figure 4.3 shows the network topology, which is similar to that used in [66], where it performed well on a similar Mackey-Glass signal prediction task. There are two hidden layers, consisting of ten sigmoidal nodes each, and a single linear output node. The input signal passes through four delay elements to form a 4-dimensional "embedding space". The delay elements are each of length 6 time units, providing four network inputs of x(t — 6), x(t — 12), x(t — 18), and x(t — 24). A single connection joins each pair of nodes in adjacent layers, for a total of 150 adaptable weights. For the hidden nodes, each k 0 was a hyperbolic tangent: tanh(a[xi — O]), (4.14) a(1 — fi(xj))= gi(fi(xj)). (4.15) The hyperbolic tangent is a scaled and offset version of the commonly used logistic function [104], so that the output lies in the range (-1, 1) instead of (0,1). The bipolar nature of this function allows adaptation to occur near the negative "tail" of the sigmoid, where it Chapter 4. Nonlinear Adaptive Network Filters^ 36 Figure 4.3: Transversal network topology. A delayed input signal feeds into the network so that an undelayed version can be used as the target signal during adaptation. Further delays on the input provide the network with three additional signal samples: x(t — 12), x(t — 18), and x(t — 24). Minimizing the error causes the network to predict its input signal 6 time units into the future. During adaptation, however, the extra delay at the input allows the network to "predict" only the present value of the signal. could not occur with the logistic function because the output signal, oi(t), in equation (4.7) would be zero. All nodes had unity gain (a = 1), and adaptable thresholds, 0, were implemented as connections from an additional node with a constant output of 1, as discussed in Section 2.1. The initial weights were pseudorandomly chosen from a uniform distribution on the interval [-0.1,0.1], and a weight learning rate of p = 0.005 was selected, based on a few experimental simulation runs. Figure 4.1 shows that the Mackey-Glass signal is always positive, so the full range of the Chapter 4. Nonlinear Adaptive Network Filters^ 37 input nodes cannot be exploited. The bias weights will eventually correct the situation, but it can take a long time [12]. To encourage faster convergence, the mean value of the input signal should be removed [12, 66]. The mean can be estimated with a running average, using an integrator of the type described by equation (4.10) with a large time constant. Appendix B shows how a similar estimate can be obtained in a discrete-time software simulation. However, to speed up the simulations here, the mean was measured in advance and found to be approximately 0.92534. A preprocessing stage subtracted this constant value from the input signal before presenting it to the network. 4.4.3 Numerical Integration Technique The differential equations describing the network's behavior, including adaptation, represent a nonalgorithmic continuous-time process. Simulation on a digital computer requires conversion to an algorithmic form such as Euler integration, which uses a small, fixed-length time step. The Euler method is a good choice because of its light computational demands during each integration step, and its simplicity of coding. Smaller time steps gave virtually identical results, suggesting that more complex integration schemes would not provide any benefits. However, to train the network on data as close as possible to the "true" Mackey-Glass signal, equation (4.13) was integrated using a four-point Runge-Kutta technique. To obtain a good global fit to the Mackey-Glass data, adaptation took place with a continually evolving input signal, rather than a repeated, finite training segment. 4.4.4 Performance Measure As in [33], the normalized rms error, NRMSE = E[ 116(t) — 0(0112 ]1/2 4.16 (^) E[^— E[O(t)] 1 2 ]1/2^ was used to monitor the network's performance during training. The NRMSE is the rms error at the output, divided by the variance of the input signal. For the Mackey-Glass application, the output "vector", 0(t), has only a single dimension. An NRMSE of zero indicates perfect Chapter 4. Nonlinear Adaptive Network Filters ^ 38 prediction, and an NRMSE of unity means the prediction is no better than a constant predictor at E[Ii. (0]. 4.5 Prediction Results Figure 4.4 shows a typical learning curve for the back-propagation network, along with learning curves for three different linear transversal filters trained on the same task. The curves are 0.5 4 — Input Linear Combiner 0.4 - 25-Input Linear Combiner ..... .................................................................... I^• v s" **- 0.1 — 0.0 — 0 150-Input Linear Combiner Back-Propagation Network 6 Time (t x 10 5 ) ^ 7 ^ 8 ^ 10 Figure 4.4: Mackey-Glass signal prediction learning curves. The back-propagation network had an embedding dimension of 4, and reached an NRMSE of 0.058 at time t = 10 6 . For comparison, linear predictors with embedding dimensions of 4, 25, and 150 reached NRMSEs of only 0.428, 0.305, and 0.222, respectively, after the same length of time. A further increase of the linear filter embedding dimension to 300 resulted in a marginally lower NRMSE, but greatly increased the variance of the prediction error. not identical for different random starting weights, but the graph gives an indication of their relative performances. The first linear filter had four inputs, giving the same embedding space as the back-propagation network. For improved performance, a second linear filter had 25 delayed inputs, and a third had 150, giving it the same number of adaptable parameters as the Chapter 4. Nonlinear Adaptive Network Filters ^ 39 back-propagation network. The results clearly show the advantage of internal nonlinearities for this signal prediction task, even over the linear filter with 150 inputs. A further increase in the size of the linear filter to 300 inputs resulted in only a marginal improvement in error, but a large increase in noise. The noise likely arises because samples of the signal from 300 time units in the past are virtually uncorrelated with its present and future behavior, and because the misadjustment increases with the square of the number of weights [134]. The error in Figure 4.4 is a normalized version of the adaptive prediction error shown in Figure 4.3. True prediction requires removal of the T = 6 delay, thus halting adaptation and increasing the prediction error. Stopping the adaptation at time t = 10 6 allowed the network to be tested in this nonadaptive prediction configuration. Figure 4.5 plots the true prediction output from the two techniques, along with the actual Mackey-Glass signal. The NRMSE observed in this situation was 0.063 for the static network, and 0.36 for the linear filter with 150 inputs. The large increase in error when using the linear filter for true prediction suggests that it was adapting to very recent portions of the input signal, without learning its global structure very well. It may also be related to the misadjustment noise visible on the learning curve, which could cause the weights to be far from their optimal values when adaptation stopped [134]. On the other hand, the back-propagation network performed almost as well on the true prediction task as it did during adaptation, suggesting that it found an accurate global representation of the Mackey-Glass signal. The benefits that nonlinear networks provide over linear prediction [72] and the Gabor polynomial predictive method [37] were clearly demonstrated in [66]. There, a batch-mode adaptation technique, using a nonlocal conjugate gradient descent optimization procedure running on a Cray X-MP supercomputer, obtained near-optimal values for the weights in about an hour of CPU time. The results presented here required less than an hour on a Sun SPARCstation 2 using a local gradient descent technique, so the weights have not converged to their optimal values. Chapter 4. Nonlinear Adaptive Network Filters ^ 0.5 0.4 4 40 Mackey-Glass Linear Prediction Back-Propagation Network Prediction ., 0.3 I 0.2 0 -0.3-0.4 -0.5 150^175^200^225^250 Time (t) 275^300^325^350 Figure 4.5: Mackey-Glass prediction results. A back-propagation network and a linear prediction filter with 150 weights were trained up to time t = 10 6 , and then placed in nonadaptive, true prediction configurations. The graph shows the prediction output of each device, along with the actual Mackey-Glass signal for comparison. The plots begin at t = 150 to allow the delay buffers to fill with valid data, and the prediction time is T = 6. However, the intent is not to duplicate the results in [66], but to show that continuous-time on-line adaptation is practical, and to provide a baseline for comparison with the dispersive networks to be discussed in the next chapter. To update each weight, the nonlocal optimization procedure used in [66] requires data from the entire network, so it does not lend itself well to parallel physical implementations. However, one of the prime motivations for investigating network filters is that they can be implemented in parallel hardware, which will provide an enormous speed advantage over software simulations. Chapter 5 Dispersive Networks The nonlinear back-propagation filters in Chapter 4 are an extension of the linear transversal filters described in Chapter 2. In both cases, the output from a series of unit delays forms the input vector. A more powerful structure can be obtained by replacing each ALC in the back-propagation network with a linear transversal filter. The resulting network contains internal time delays, so its output can depend on temporal activation patterns in the hidden nodes. Features in the input signal disperse in time and space as they propagate along different pathways, influencing the output at many different points in the future. Unfortunately, dispersive networks are more difficult to train than their nondispersive counterparts. This chapter presents a new continuous-time technique for adapting the internal time delays, as well as the weights, in a dispersive network. Gradient descent with dispersive networks presents a difficult problem because the two signals forming the product in equation (3.11) are not simultaneously available. When oi(t) is present at time t, the corresponding Ai(t) requires a signal to propagate to the network outputs, and the resulting error signal to propagate back to node i. However, by this time a different value of oi will be present due to new inputs continually arriving. Correct operation requires the storage of oi(t), and its proper correlation with Ai(t) when Ai(t) becomes available. To make matters worse, the backward propagation of the delta signals requires proper correlation with stored values of f'(y(t)), as seen in equation (3.19). An algorithm presented in [4] solves this problem by including zero-delay feed-forward pathways between all pairs of connected nodes. Weight modifications have an immediate effect on 41 Chapter 5. Dispersive Networks^ 42 the output, so the instantaneous error can be minimized using a slightly altered form of backpropagation. However, the technique ignores the delayed effects of the weight changes on the output signal, and the total error may not be minimized. 5.1 Unfolding in Time One technique for minimizing the total error in a dispersive network is known as "unfolding in time". Figure 5.1 shows the process for a very simple two-layer network with two weights between each connected pair of nodes. Every weight requires three identifying subscripts because a pair of nodes can have multiple connections, each with a different delay. The unfolding process effectively moves all the internal delays to the input layer, yielding an equivalent filter in transversal form with more nodes and weights than the original. Each additional delay in layers other than the input layer generally leads to an additional copy of the network to the left of that point. This new subnetwork will have copies of the weights in the original network, but they must be constrained to maintain a single value. For example, in Figure 5.1(c), WBA1 and wam represent the same weight. When taking the partial derivative of the MSE with respect to this weight, the two components must be added and the resulting weight update must be applied to both copies. Since unfolding in time requires the construction of a network much larger than the original, the computational costs quickly become unmanageable. However, if the internal delays are integral multiples of unit delays, then there is usually a large amount of redundancy that can be exploited in the expansion, leading to a smaller unfolded network than would otherwise be the case. This is the approach usually taken when training a Time Delay Neural Network (TDNN) [123]. In addition to the problem of a potentially large unfolded network, the constraints on the weights require nonlocal communication to ensure that multiple weight copies retain the same value. While not a problem in software simulations, this type of nonlocal communication can create difficult practical problems in a parallel hardware implementation, where the overhead 43 Chapter 5. Dispersive Networks^ W BA1 ^ WCB1 A(t) Figure 5.1: Unfolding in Time. This technique separates the temporal structure from the spatial structure, resulting in a much larger network. The progression from (a) to (c) illustrates the process. Each delay in a hidden or output layer generally requires an additional copy of the entire network to the left of that delay. The result is a transversal filter incorporating a nondispersive network of the type described in Chapter 4, with multiple copies of the weights where necessary. These multiple weights must be constrained to have the same values if they will be transferred back to the dispersive network after training. 44 Chapter 5. Dispersive Networks^ related to connectivity is often a limiting factor. To overcome these limitations, a new technique known as temporal back-propagation operates on the original dispersive network using only local communication. 5.2 Temporal Back Propagation - The term "temporal back-propagation" comes from [125], which describes a discrete-time adaptation algorithm for networks with fixed, integer-valued internal delays. The technique developed here is more general, allowing continuous-time signals and adaptable, real-valued delays. However, it seems logical to extend the scope of the name to include this new technique, rather than trying to invent new terminology in an already overburdened nomenclature. Temporal unfolding is impractical with this new type of network because the adaptable, real-valued delays eliminate any redundancy in the unfolded structure. Before presenting a detailed derivation of temporal back-propagation, it will be helpful to describe its operation in a simplified diagrammatic form. Figure 5.2 shows a connection between two nodes in a conventional back-propagation network. The feed-forward signal flows from left w(t) w(t) dw dt Figure 5.2: A connection between two nodes in a conventional back-propagation network. The feed-forward signal flows from left to right along the top of the diagram, where multiplication with the weight occurs. The back-propagated error flows from right to left along a similar path at the bottom. The product of these two signals, along with a learning rate, p, controls the weight adaptation. Chapter 5. Dispersive Networks^ 45 Figure 5.3: A connection between two nodes in a dispersive network. The gray boxes representing time delays are the only difference between this diagram and Figure 5.2. As in the conventional back-propagation connection, the feed-forward signal flows from left to right along the top of the diagram, while the back-propagated error signal flows from right to left along the bottom. The delay, r, is in the feed-forward path before the multiplicative weight. Note the additional delay in the back-propagation path. Its length is chosen so that the sum of the feed-forward delay and the back-propagation delay is a constant, K, for all connections in a particular layer of the network. The two additional delays ensure that the signals remain properly correlated during the multiplications involving the back-propagated error. to right along the top of the diagram, while the back-propagated error signal flows from right to left along the bottom. Both signal paths contain the same multiplicative weight, w(t). The rate of change for this weight during adaptation is the product of the feed-forward signal, the back-propagated signal, and a learning rate, it. Figure 5.3 shows a similar connection for a dispersive network being trained with temporal back-propagation. Again, the feed-forward signal flows from left to right along the top, while the back-propagated error signal flows from right to left along the bottom. The feed-forward 46 Chapter 5. Dispersive Networks^ d/dx 4^ Figure 5.4: A conventional back-propagation node. The sum of the feed-forward signals from nodes in the preceding layer passes through a sigmoidal nonlinearity. In the back-propagation path, the sum of the error signals from nodes in the succeeding layer forms a product with the derivative of the feed-forward signal. delay, r, is in the feed-forward path before the multiplicative weight, and there is an additional delay in the back-propagation path. The length of this additional delay is chosen so that the sum of the feed-forward delay and the back-propagation delay is a constant for all connections in a given layer of the network. As a result, the time it takes for a signal to propagate from a particular node to the output layer, and for the resulting error signal to propagate back, is constant no matter what path the signal happens to follow. From this constant propagation time, it is obvious that the back-propagated signal will be "older" than the feed-forward signal by a fixed time delay, at which is the "age" of the error sig, nal at this connection. To maintain proper correlation during the weight update multiplication, the feed-forward signal must be delayed by at units of time before the multiplication can be performed. Finally, a delayed version of the changing weight must be used in the back-propagation path, so that its age matches the age of the back-propagating error signal. Figure 5.4 shows a conventional back-propagation network node. The sum of the input signals from the preceding layer on the left passes through a sigmoidal nonlinearity to form the output. In the back-propagation path, the derivative of the feed-forward signal with respect to the feed-forward sum forms a product with the sum of the error signals from nodes in the succeeding layer. 47 Chapter 5. Dispersive Networks^ d/dx Figure 5.5: A dispersive network node. This node is nearly identical to the conventional back-propagation node. However, since the back-propagating signal is older than the feed-forward signal, the latter must be delayed by an appropriate amount before the multiplication can be performed. Figure 5.5 is the corresponding diagram for a dispersive network node. Since the backpropagated error signal is older than the feed-forward signal, the derivative of the feed-forward signal must be delayed by an appropriate amount before the multiplication can be performed. Apart from this delay, there is no difference between a dispersive network node and a conventional back-propagation node. With this introductory explanation, the details of the mathematical derivation can now be presented. 5.2.1 Notation Figure 5.6 shows a section of a typical feed-forward dispersive network. Connections can skip across layers, and there can be more than one connection between two particular nodes. Every connection has both a weight and a time delay, but any of the time delays can be fixed at zero 48 Chapter 5. Dispersive Networks^ Figure 5.6: Section of a typical feed-forward dispersive network. Each connection has both a weight, Wijk, and a time delay, riik, that can adapt to minimize the error at the output. There can be multiple connections with different delays between any two nodes in different layers, and they can skip across layers like the connection between node j and node h. Features in the input signal spread out in time and space as they travel from left to right. Error signals propagate from right to left, but they are absent from the diagram for clarity. if desired. The following notation describes the network structure and behavior: wijk(t) = weight of kth connection from node j to node i at time t. rijk(t) = delay of kth connection from node j to node i at time t. oi(t) = Xijk(t) = either the output of node i, or a network input at time input to node i from the k th connection from node j at time t. yi(t) = total input to node i at time t. fi() = sigmoidal nonlinearity of node i. bni (t) = e m (t) = s m (t) = t. target signal for output node m. the error at output node m at time t. the squared error at output node m at time t. 49 Chapter 5. Dispersive Networks^ 0 = set of all output nodes. Ci = set of all nodes that node j feeds into. Aii = set of all connections between node j and node i. Nm i = the total number of signal paths, by any route, from node i to node m. zm ip = the total delay from node i to node m, via the pth route (p = 1... Nm i). The following relationships summarize the feed-forward behavior of the network: xiik(t) = wijk(t)oAt — riik(t)), yi(t) = Oi(t) = (5.1) EE x iik (t) = EE wiik (t)oi (t — 7- 3k (t)), (5.2) fi(yi(t)). (5.3) j^k^i^k For some values of j, oi(t) represents a network input. The function, fi(), is usually sigmoidal, but must be continuous and monotonically increasing. Adaptation requires an input signal vector, and a corresponding target signal vector, OW. The observed states of the output units constitute an output vector, 0(t), and the Euclidean distance, s(t) = IIO(t) — o(t)112 = E (O m (t) — Om(t))2 = E em(t) = E s m (t),^(5.4) m€0^m€0^mE0 measures the instantaneous squared error between the output signal and the target signal, where e m (t) = bm (t) — om (t). 5.2.2 The Error, W, and its Functional Derivative In a standard back-propagation network, a short-lived change to a weight results in a single, immediate, short-lived change to the output node activations. With time delayed connections, a short-lived change to a weight results in many future, short-lived changes at the outputs. Adaptation requires knowledge of how the error at the output, integrated over all time, depends 50 Chapter 5. Dispersive Networks^ , Csiik(^E ) E T 0 ^► time td t Figure 5.7: The perturbation function criik(t,td,h,c), centered at time td, with magnitude h and duration c. In the derivation of the error gradient formulae, this function is added to wijk(t) and rijk(t) while considering the limit where c and h go to zero. on the weights and time delays. Accordingly, the total squared error, IP, is 00 =^s(t)dt =f c.°^E s m (t)dt.^ (5.5) =-00 mE0 Although 'F may be infinite, it has finite functional derivatives with respect to time-localized changes in the weights and time delays. Changes in can be described by these functional derivatives with respect to wijk(td) and riik(td) at a particular moment in time, td. Figure 5.7 shows a perturbation function, cr ijk, centered at time td, with magnitude h and duration c: ij k(t td, h if (td — c/2) < t < (td c/2) h, c) =^ 0 otherwise (5.6) Perturbing either wijk(t) or riik(t) by criik(t,td,h,e) causes measurable changes in^that are approximately proportional to ch when c and h are small. The functional derivatives, 6AF bwijk(td) ST 6rijk(td) lim {T[tviik(t)-F aijk(t,td,h, E)] — [wijk(t)] Eh lim^T[rijk(t)^Cri jk(t td, h, E)] [rip, (t)] h,e-.0^ Eh (5.7) (5.8) describe how changes to the weights and time delays influence the total squared error. They will be used as error gradients in Section 5.3. 51 Chapter 5. Dispersive Networks^ The dependence of '1 on Wijk involves many nonlinear terms, so T[wijk(t) o-ijk(t, td, h, €)] can be calculated only in the limit of vanishingly small h and E. Expanding equation (5.7) using equation (5.5) gives ^bT „ = lim j ll mEO t 1 = -oo 1 ^E ^,c--■^E 0 Wijk(td)^h A particular weight, Woo {.9 m (e)[wijk (t)+ criik (t,t d ,h,c)] — s m (e)[wijk(t)il del . (5.9) influences each output error, s m (t'), through p E 1... Nm i different signal paths, each with a total delay of z m ip . Thus, s m (t') can be viewed as a function not of a single parameter, Wijk, but of several parameters, wijk(t' — zm ip ), where p acts as a parameter label. According to this interpretation, the notation c9s m (t') Owijk(t' — zmip) represents the dependence of the output error at time t' on the weight at time (t' — zm ip ), through path p. The term s m (e)[wijk(t) Crijk(tItd,h,c)] can now be expanded as a Taylor series, to first order in o: s m (e)[wijk(t)-1- criik(t,td,h,c))= Nm i Sm(t 1 ){Wijk(t)]+ 0me) + 0 (Ot k). E^Zmip,td,h,E)^ OWijk8 (tt (- Zmip) (5.10) P= 1 Substituting equation (5.10) into equation (5.9) and neglecting higher order terms yields bT^1 ^„ = (5WijWd) oo E Crijk(t f Zmip,td,h,E)^8sm(t') Ls^ h,c-■0^mEO ^- co p=i Owijk(t' — zmip) ^cle} . (5.11) In each of these integrals, the independent variable t' can be replaced by (t" zm ip ), giving W^1 lim bwijk (t d )^h,f —o){ Eh =^i; v, 1, _ Nm i 00 = mE0 p=1 t {— E h,e-o0 Eh mEO Nmi^ !VI 4 d+E12) 0 Srn (t jo „ ) 1, 14^ 98m (t^ZmiP) " 00^owi zmip) dt" " Owijk(t") P=1 (td —E12) 52 Chapter 5. Dispersive Networks^ This replacement doesn't alter the value of the expression because the limits on the integrals are infinite. In the limit where c goes to zero, the dependence on t" disappears and the derivative becomes Nmi v. v. 1^ eh aS m (td + zmip) ^ST lim^ hviik(td)^h,c--40 € h. niL-‘06 pL=-11 ^Owijk(td) Nmi OS (td + Zmip) =^ ^(5.12) m ^OWijk(td)^ mE0 p=1 E . A similar derivation yields the following expression for the functional derivative of 11 with , respect to each time delay: N 03 m (td Zm ip ) ^ =^ SLY (td) 8Ti j k(td)^mE0 p=1^°Tiak (5.13) Related derivations using a finite difference approximation have been presented in [94, 93] for recurrently connected networks. 5.2.3 Back Propagated Error Signals - If an error signal, j(t) =— a S m (t Z j ) E^ Oyj(t) m p (5.14) mE0 p=1 is associated with each node, j, in the network, then the functional derivatives from equations (5.12) and (5.13) can be expressed in terms of Ai(t). The weight derivatives can be expanded using the chain rule, 5T^ Nn" 0.5 m (t = ^ E Swijk(t) mE0 p =1 = Zm i p ) awiik(t) v. "" aSm(t Zmip) agi(t) ^Oyi(t)^awijk(t) lriik)Ai(t) m_Eoo i (tp:^, ^==^ and the delay derivatives can be expanded similarly: =^NE. as + zm ip) orijk(t)^mE0 P^ ariik(t) =1 6T m (t (5.15) 53 Chapter 5. Dispersive Networks^ z m ip ) Oyi(t) Ox iik (t) Oyi(t)^Oxijk(t) Oriik(t) v,^Osm(t Z-d mEO P=L^ = Wijk(i) j(t — riik) (5.16) For output nodes, equation (5.14) is Osm(t) Am(t) —^ Oym (t) Os m (t) Oo m (t) Oo m (t)Oym (t) = 2 00)— 0.(t)M(Y.(t)), (5.17) and for the hidden nodes, it can be expanded using the chain rule: ^Nv.■ " n 0 Sm(t zmjp) j( t) = — mEO p=1^ Oyj(t) Nmi osmo Tijk + zmip) oyio + riik) = -EE E E ayi(t rijk)^ova(t) mEO lECi kEAij p=1 -E E ay* iEc; kEAi i .6(0)) E iEci + Tij k)^Osm(t^zmip) ayi(t + riik) ao) mEO^ p=1 > Ai(t + rii k)wii k(t + Two. (5.18) This recursive expression for Ai(t) is valid for any feed-forward network, but its evaluation requires future values of making it noncausal. To overcome this problem, an "aged" error signal, Ai(t, ai), can be defined so that its value at time t represents the true error signal, Ai(t), at time t — aj: Ai(t,ai)= Ai(t — aj).^ (5.19) There is some flexibility in choosing the age, aj, for each node, but it must be greater than the maximum value of zm ip for all m and p. For output nodes causality is not a problem, so aj = 0 will suffice, and equation (5.17) describes the error signal. For hidden nodes, aj must be chosen to satisfy a • —^ >max zm.77, • = max max (ai m,P^lECj (5.20) Chapter 5. Dispersive Networks^ so that Aj(t,aj) 54 can be defined recursively and causally as Aj(t, aj) = f.;(yj(t — aj)) EE Ai(t — [aj — at — rijk], ai)wijk(t — [ai — riik]).^(5.21) iEci kE.Ati ; 5.3 Gradient Descent Adaptation Assuming the input and target signals are sample functions of random processes, the mean total squared error can be defined as E = E[11], ^ (5.22) where the expected value represents an ensemble average. To minimize E, the weights and time delays must adapt with changes proportional to the negative gradients, dwijk (5.23) dt drijk dt (5.24) where p and v are learning rates. Expanding these equations gives SE dwij k dt P Swijk(t) driik dt SE v^= Srijk(t) ILE [ 81 owijk(t) , I = pE[oj(t — riik) Ai(t)], SW vE[, „I = —vE[wijk(t) di (t — riik) Ai(t)]. oriikW (5.25) (5.26) Due to causality constraints, it is difficult to estimate the required ensemble averages until time t aj, so a delayed version of gradient descent must be used: SE dwijk dt • • P Swijk(t — ai) —pE [^I Swijk(t — ai) = pE[ol (t — ai —^) Ai(t, ai)], drijk dt = • (5.27) SE Sriik(t — ai) —vE SWY [ I 57 ijk(t — ai) - • —vE[wijk(t — ai) di (t — ai — rijk)Ai(t,ai)]. (5.28) Chapter 5. Dispersive Networks^ 55 If the learning rates, ti and v, are small enough that the weights do not change significantly during a period equal to the maximum delay through the network, then these equations are a good approximation to true gradient descent. At high learning rates, the time delays can cause the adaptable parameters to overshoot their optimal values, leading to instability. This problem can be easily overcome by choosing smaller learning rates, but the issue of stability deserves further analysis. For ergodic signals and slow adaptation rates, the expected values in (5.27) and (5.28) can be estimated by time averages. One such estimate for the weights, pijk(t), can be obtained with a leaky integrator, dPijk Pijk(t) F 7)— - dt — oi(t — ai — riik) Ai(t, ai), and a similar estimate, qijk(t), can be obtained for the delays: dqiik qiik(t)^dt = wijk(t — ai)dj (t — ai — rijk) The time constant, j, is similar to the discrete-time momentum parameter discussed in [104]. With these estimates, the adaptation equations become dwijk^drijk PPijk(t) and^= —vqiik(t)• dt^ dt In the limit where 77 goes to zero, the leaky integrators follow their inputs precisely, giving the on-line adaptation equations: dwijk dt drijk dt poi(t — ai — rijk)Ai(t,ai),^ (5.29) —vinijk(t — ai)dj (t — ai — rijk)Ai(t,ai).^(5.30) The on-line use of the instantaneous gradient is analogous to the stochastic approximation algorithm introduced by Robbins and Monro [99]. Stochastic approximation was first applied to function minimization (or maximization) in [60], and since then it has become a well-known technique [64]. The method is an iterative procedure that updates a set of parameters based on estimates of the true gradient. A set of gradient samples (possibly a single sample) forms an Chapter 5. Dispersive Networks^ 56 estimate of the true gradient during each iteration. The gradient information effectively becomes averaged over many parameter updates, which are each small enough to ensure convergence. In the technique presented here, a continuously evolving network state replaces the discrete steps, and the instantaneous gradient is a single-sample estimate of the true gradient. On-line learning, as described, applies to a continuously arriving training signal. Finitelength training signals can be readily extended for on-line learning by presenting them repeatedly, but adaptation should be suspended during an interval equal to the longest propagation delay through the network immediately after each repetition boundary to avoid learning any transients incurred at that point. During adaptation, the internal network buffers explicitly store recent information from the input signal, and the values of the weights and time delays implicitly store older information. The network can track some types of nonstationary signals because recent information slowly replaces older information in the weights and time delays. 5.4 Relationship To Other Work If the delay learning rate is zero and the network equations are integrated using a simple Euler method, the technique is equivalent to the temporal back-propagation algorithm presented in [124]. Furthermore, with all the time delays fixed at zero, temporal back-propagation is identical to standard back-propagation. A related technique for learning time delay "windows" in feed-forward networks has been demonstrated in [8] for a speech recognition task, but without a detailed derivation. Adaptable time delays were also discussed in [98], where recurrent networks without sigmoidal nonlinearities separated sets of superimposed and delayed signals. A different method for performing gradient descent in networks with fixed time delays has been presented in [87] and [88]. However, that method, while being much more general, does not take advantage of the computational savings incurred by back-propagating through the time delays, and is not appropriate for networks with time delays in all of the connections. The method of unfolding in time [123] can be 57 Chapter 5. Dispersive Networks ^ used to train networks with delays in all the connections, but it requires extensive storage and nonlocal computations. All these techniques, except [8] and [98], use fixed time delays. Other types of networks that can process temporal information employ context units, which remember past values and include them with the current inputs [43]. Methods for adapting the decay rates of input integrators to match time constants observed in the input signals were described in [83], and an interesting time delay structure known as a Gamma memory was reported in [28]. 5.5 Practical Considerations At this point, practical issues associated with building and training a dispersive network with adaptable delays must be considered. The implementations assume on-line adaptation using only computationally local information. 5.5.1 Exact Delayed Error Gradient Table 5.1 summarizes the equations describing the complete operation of a dispersive network with adaptable time delays. These equations describe a network with information flowing in two directions: input signals flow forward and error signals, Ai(t), flow backwards. The nodes and connections can be modeled as shown in Figure 5.8. Each node contains a delay line of length a2, providing the delayed value ri (yj(t — a2)), and each connection contains tapped delay lines to provide the delayed values of oj, Wijk, and Ai. The time delays, Tijk, can be adjusted by moving the taps. In a typical network, the number of connections increases with the square of the number of nodes, N, so the implementation requires 0(N 2 ) storage for the delayed signals. Figure 5.9 shows a functionally equivalent but more efficient implementation. The of buffers have been moved from the connections to the nodes, with multiple taps for multiple connections, reducing the number of buffers required for of from 0(N 2 ) to 0(N). Also, it is not necessary to store values of yj to calculate rj (yi). Since f() is a monotonically increasing function of yj, the derivative can be calculated directly using a function, Moi), to calculate f;(yi) from oi. . Chapter 5. Dispersive Networks ^ 58 Y./ (a) Node (t) JO ^ (t [af ar'Cijk] - - , ad wuk(t-[af-tift]) ijk(t) -41 'Lift 4- *F—^a• a i + tijk —01 ijk 14 - Wijk (b) Connection A i (t, Figure 5.8: Block diagrams of (a) a node, and (b) a connection, for a direct but inefficient implementation of the network equations. Elongated boxes labeled with an age, such as aj, represent tapped delay lines with a maximum delay proportional to the length of the box. Input signals flow from left to right and error signals flow from right to left. Chapter 5. Dispersive Networks 59 ► xfik (t) oi(t tijk ) - oi(t-[a i + ok]) 0j f() a• tijk 14— '4^ ai+Cijk (t, (a) Node ----1 g () P yit-a) ) Ai(t- f aj artijki, ad Wyk [ aj-tijk - Wiik(t) Oj (t-tiik) xok(t) ► oi(t-[ai+T ijk ]) wiik(o _q ai 4^ (t, (b) Connection Figure 5.9: Block diagrams of (a) a node, and (b) a connection, for a network that is functionally equivalent to that shown in Figure 5.8, but with smaller total time delay buffer requirements. The weight delay buffer is now a fixed length, independent of riik. The function g(f(yj)) calculates nyi) from values of f(y.i). 60 Chapter 5. Dispersive Networks^ Network Equations Using Exact Delayed Gradient Parameter Formula yi(t) EEtvijk(t),(t-Tijk(o) j^k oi (t) fi(N(t)) A m (t) 2(6,7,(t) — o n,,(t))f„., (y,„(t))^for m E 0 Ai(t, ai) dWijk dt drijk dt f"i (ys(t — as)) > > A (t - [a, - ai — i rijkl, ai)wijk(t — IEC3 kEA; [ai — niki) poi(t — [ai + riik])Ai(t,ai) —1/Wijk(i — ai)o'i (t — [ai + riskpAi(t,ai) Table 5.1: Equations describing the complete operation of a dispersive network with adaptable time delays, using an exact form of gradient descent. The subscript m denotes output nodes. Finally, in Figure 5.9, the weight multiplication occurs before the delay in the feedback path, resulting in a fixed-length delay line that is shorter than the weight delay in Figure 5.8. The reduction in storage compared to Figure 5.8 is approximately proportional to N 2 (2aj — 5.5.2 Approximate Delayed Error Gradient The total storage requirements can be reduced from 0(N 2 ) to 0(N) by replacing each delayed occurrence of wijk with its current value, and moving all the delay buffers from the connections to the nodes. To accomplish this reduction, the delay buffers must have a separate tap for each connection. Table 5.2 shows equations for the resulting approximated error signal, j(t, as), and for dwijk/dt and drisk/dt. The error signal for the output nodes remains the same as in Table 5.1. For small learning rates, the weights and time delays change slowly and this 61 Chapter 5. Dispersive Networks^ Network Equations Using Approximate Delayed Gradient Parameter Formula yi(t) rijk (o) EEwiik(t)oi(t_ j^k Oi(t) Om(t) Aj(t,aj) fi(Yi(t)) 20,(0_ ri(yj(t — aj)) E orn (t))f,Vyni (t))^for E iii(t - [aj mE0 — ai — rijk], ai )wijk(t) iEC, kEAi; dwijk dt drijk dt ,uoi(t — [ai + rijk])Ä i(t, ai) —1/wijk(t)oli(t . — [ai + rii k])Ai(t,ai) Table 5.2: Equations describing the complete operation of a dispersive network with adaptable time delays, using an approximate form of gradient descent. All the delayed values of wijk in Table 5.1 have been replaced with wijk(t). The simplified delayed gradient has smaller storage requirements, and is a good approximation to the exact delayed gradient for small learning rates. The subscript m denotes output nodes. approximation introduces little error into the gradient calculations. Figure 5.10 shows the node and connection diagrams for this implementation, and Table 5.3 summarizes the storage requirements for all three implementations. With this approximation, the connections no longer require any storage buffers. The feedback delay lines are now associated with the nodes, and provided with multiple taps in the same way as the feed-forward delays. While no claims can be made for the biological plausibility of this neural network model, it is interesting that delay lines with multiple taps parallel the axons in biological neural networks. Braitenberg, for example, in his discussion of neural interconnections in the cerebellum, suggests that "Engineers will have no difficulty appraising the general usefulness of a system of delay 62 Chapter 5. Dispersive Networks xjlk (t) JO - a.(t_[al_ar tfik i,a.,) 1 a• 4 T ijk 14- ^ go) ai +t uk I ^>o(t 'tick) oi(t-[ai+Tuk]) - f(yi (t-ad) --oi tia 1 ai-ai •^ 4^ ai(t- [ai -art ukl,^wijk(t) A. (t, a.) <1^ (a) Node of (t-t uk ) oi(t-[a i +T ijk P -dldt Wijk(t) Ch ijk dt Wijk (t) adt_[ ai -art iik ],a i (b) Connection wijk(t) ii i (t-[ai-al-tuk], Figure 5.10: Block diagrams of (a) a node, and (b) a connection, for a network optimized for storage efficiency. These diagrams are similar to those in Figure 5.9, except that current values of wijk(t) replace the delayed values. For small learning rates, adaptation still approximates gradient descent. With all the delay lines moved out of the connections, the total storage requirement scales linearly with the number of nodes. 63 Chapter 5. Dispersive Networks^ Storage Requirements Implementation Figure 5.8 Storage For yj of Ai Wijk yj Figure 5.9 0i Ai Wijk Figure 5.10 yj of Ai Wijk Buffer Length aj aj ai a i aj — aj aj a i ai — aj maxi (aj — ai) — — - Buffers Needed oc N a N2 oc N 2 cc N 2 0 cc N a N2 a N2 0 oc N oc N 0 Table 5.3: Buffers required for the implementations shown in Figs. 5.8, 5.9 and 5.10. The estimates are based on networks containing N nodes, and on the assumption that each node has a fan-in and fan-out approximately proportional to N. As shown in the third column, the number of buffers needed scales approximately as a power of N. lines with provision for multiple tapping along the way" [11]. 5.5.3 Choosing the Error Signal Ages The best choice for the age parameters, aj, will depend on the requirements of a particular implementation. It is desirable to make each aj as small as possible while keeping Aj causal, because small age values help to minimize storage requirements, maintain stability, and ensure that adaptation most closely approximates true gradient descent. For practical purposes, each node, j, can be labeled with a "layer number", nj, which is the distance from node j to the farthest output node in terms of the number of connection layers. For example, in the network of Figure 5.6, if node h is an output node, then ng = 1, nh = 0, ni = 1, and nj = 2. Two nodes, i and j, are "adjacent" if ni and nj differ by 1. If the maximum permissible delay between two adjacent nodes is z 0 , then the maximum permissible delay between any other pair of nodes, i and j, is (nj — ni)zo . With these definitions, the Chapter 5. Dispersive Networks^ 64 formula aj = njzo^(5.31) gives a reasonable and consistent set of ages for all nodes in the network. Note that any particular choice of the aj parameters places limits on the maximum values for the time delays. Thus, each rijk must be kept smaller than aj — ai and greater than zero by imposing some kind of limiting procedure during adaptation. One possibility is a hard threshold, where the value of Tijk stops changing once it has reached its limit. Another possibility is to optimize a different set of parameters, -yijk, where aj — at riik =^exp(--yijk) (5.32) • Now, with unrestricted values of Nik, rijk will lie in the range 0 < T;jk < aj — at. By removing restrictions on the parameters, this technique can yield a better-conditioned optimization problem [89]. Chapter 6 Applications of Dispersive Networks The dispersive networks described in the previous chapter can be applied to many different tasks. This chapter presents simulation results for adaptive signal prediction, signal production, and adaptive channel equalization, demonstrating several benefits of dispersive networks over more conventional techniques. 6.1 Adaptive Signal Prediction The Mackey-Glass signal prediction task described in Chapter 4 provides a convenient benchmark for measuring the performance of adaptive signal processing systems. The training configuration shown in Figure 4.2 permits a direct comparison between dispersive networks and conventional back-propagation networks. 6.1.1 Network Topology and Implementation Details Figure 6.1 shows the network topology, which is similar to the topology used in Chapter 4 and in [66]. There are two hidden layers, consisting of ten nodes each, and a single output node. Instead of four input nodes as in Figure 4.3, there is a single input node with four adaptable connections to each node in the first hidden layer. A single adaptable connection joins each other pair of nodes in adjacent layers. 65 Chapter 6. Applications of Dispersive Networks ^ 66 Input Figure 6.1: Dispersive network architecture for the Mackey-Glass prediction task. Each heavy line between the linear input node and the first hidden layer represents four adaptable weights and time delays. All other connections contain a single adaptable weight and delay, and all feed-forward signals flow from left to right. There are a total of 150 adaptable connections and 21 adaptable biases. Each node has either a linear or a sigmoidal transfer function, represented by the node symbols. Before training, the weights and delays are set to pseudorandom values. During training, the network operates in an on-line mode, continuously adapting to the input signal. During the simulations, all 150 network connections contained adaptable weights and time delays, and the aj parameters were chosen as explained in Section 5.5.3, with z o = 25. As in Chapter 4, each fi() for the hidden nodes was a hyperbolic tangent, fi(yi) = tanh(a[y — O]), Myi) = gi(fi(xj)) = a(1— ffiyi)), and all the input and output nodes had linear transfer functions for expanded dynamic range. Adaptable thresholds, p, were implemented by connections from an additional node with a constant output, and all the nodes had unity gain (a = 1). The initial delays were chosen pseudorandomly from a uniform distribution on the interval [0, 20], and the initial weights were chosen from a uniform distribution on the interval [-0.1,0.1]. A weight learning rate of p = 0.005 and a delay learning rate of v = 1.0 were selected, based on a few experimental Chapter 6. Applications of Dispersive Networks ^ 67 training runs. As in Chapter 4, the network equations were integrated using a simple Euler method with a small fixed-length time step, and the input was the continually evolving Mackey-Glass signal, integrated using a four point Runge-Kutta method. Delayed values of time-varying signals that fell between sample points were obtained using simple linear interpolation. The adaptable time delays also require delayed values of signal derivatives, and these were interpolated between sample points as described in Appendix C. 6.1.2 Embedding Dimension The behavior of the Mackey-Glass signal generated by equation (4.13) depends on its past values from time t — r to the present. The signal history within that interval can be divided into an infinite number of points, which all affect the future behavior of the signal. Therefore, the Mackey-Glass equation can be viewed as an infinite-dimensional mapping from past values to a single future value. The signal values from time t — r to the present define a location in an infinite-dimensional phase space, and the evolution of equation (4.13) causes the system to follow a path through this space. The present location of the system completely determines its future behavior. Imagine a system evolving from its starting location in phase space. Its trajectory defines a flow line that is an "attractor" if it converges with other flow lines that are "near" it in phase space. A particular trajectory may or may not eventually lead back to its starting location. If it does, then the attractor is periodic, and the system will exhibit cyclical behavior if its starting location is within the region of attraction. A second type of attractor is a stable, fixed point where the state of the system stops evolving. A third type, known as a chaotic attractor, is neither a fixed point, nor periodic. The system continues to evolve, with the flow lines tracing out a particular shape in the phase space. There may be some regions of phase space that the system never enters, and others that it passes near frequently. However, due to a strong sensitivity to initial conditions, the Chapter 6. Applications of Dispersive Networks ^ 68 actual trajectory cannot be predicted with much accuracy very far into the future. The overall structure of the attractor is often a finite-dimensional manifold within the phase space. For the Mackey-Glass signal, the attractor has a fractal dimension of approximately 2.1 [32]. It was demonstrated in [91] that a chaotic attractor can be reconstructed from a finite set of samples of the signal history. For example, x(t + T) = f(x(t), x(t — r 1 ), x(t — r2 ), ... , x(t — rn,)), where T is the prediction time into the future, and (6.1) 10 is a mapping function. Since the mapping function uses m+ 1 samples to make the prediction, the "embedding dimension" is m+ 1. The embedding dimension must be at least as large as the dimension of the chaotic attractor itself. The choice of embedding dimension, in terms of its size and temporal spacing, has a significant effect on the network's learning ability [76]. In Chapter 4, the embedding dimension was 4 with an even spacing of 6, since these values gave good results in [66]. However, the optimal embedding dimension of a system, and even the attractor dimension, might not be known in advance. Dispersive networks can automatically learn an optimal embedding dimension, due to the adaptability of the internal time delays. The number of delay elements within the network imposes an upper limit on the embedding dimension, but its detailed structure develops as the network adapts. 6.1.3 Prediction Results Figure 6.2 shows learning curves for the adaptable time delay network, an identical network with fixed time delays, and a conventional nondispersive network of the type discussed in Chapter 4. All the networks showed approximately exponential convergence, and after 10 8 time units appeared to be approaching asymptotes. At that point, the NRMSE values were 0.0200, 0.0082, and 0.0035 for the conventional network, the dispersive network with fixed delays, and the dispersive network with adaptable delays, respectively. A 150-tap linear filter trained on the same task had a prediction error of 0.2, which is well off the scale in Figure 6.2. Chapter 6. Applications of Dispersive Networks ^ 69 0.06 Conventional Back-Propagation Network Dispersive Network with Fixed Delays Dispersive Network with Adaptable Delays 0.05 0.04 0.03 - ,, „ (a) .‘ ss 0.02 0.01 0.00 (c) 0^ 1^2^3 ---------------------------- 4^5^6^7^8^9^10 Time (t x 10 7 ) Figure 6.2: Learning curves showing the normalized rms error (NRMSE) of networks trained to predict the Mackey-Glass chaotic signal an interval, T = 6, into the future. The curves are for: (a) A nondispersive network with identical topology to that used in [66]. Four input nodes represented fixed delays of 0, 6, 12, and 18. (b) A dispersive network with randomly chosen fixed delays. (c) A dispersive network with adaptable delays, initially identical to (b). In each case there were two hidden layers with ten nodes each, and a total of 150 connections. Adaptation occurred on-line, with a weight learning rate of = 0.005 and, for part (c), a delay learning rate of v = 1. The averages used to estimate the NRMSE were over time intervals of length 10 5 . It should be noted that each integration time step for the network with adaptable delays took approximately 1.7 times as long as each time step for the other networks, due to the overhead involved in adapting the time delays. This penalty could be avoided in a parallel hardware implementation, where the weight and time delay updates could occur simultaneously. 6.1.4 True Adaptive Prediction During adaptation, the output of the network "predicts" the current value of the input signal. To perform true prediction of future values, the delay at the input in Figure 4.3 must be removed, but the resulting lack of a target signal prevents further adaptation. Figure 6.3 shows one method for performing true adaptive future prediction. A second copy of the network connects Chapter 6. Applications of Dispersive Networks ^ 70 directly to the input signal, without the time delay, T. The lower network adapts as in the original configuration, but its internal parameters (weights and time delays) are continuously copied to the second network above it. As the parameters of the lower network adapt, the parameters of the upper network follow, allowing it to perform true adaptive prediction. Figure 6.3: True adaptive prediction with two networks. The lower network adapts as in Figure 4.2, but its internal parameters (weights and times delays) are continuously copied to the top prediction network. As the parameters adapt, those in the top network follow, allowing it to perform true adaptive prediction. If the networks are physically separate, then the parameter copying procedure requires a high-bandwidth communication channel. Additionally, all of the elements within the two networks must be precisely matched to ensure that copying the parameters will lead to accurate prediction. Unfortunately, the arrangement in Figure 6.3 doubles the hardware required to perform the prediction task. In addition, if the networks are physically separate, then the parameter copying procedure requires a high-bandwidth communication channel between them. Finally, the processing elements within the two networks must be very closely matched to ensure that a simple copying of the parameters will lead to the desired behavior. Slight variations between the two networks, coupled with their internal nonlinearities, can lead to large differences in the 71 Chapter 6. Applications of Dispersive Networks^ Dispersive Network^ x(t) Prediction Network o(t+T) o(t) 1111-111. e(t) b(t) Figure 6.4: True adaptive prediction with a dispersive network. The delay at the output can be viewed as an additional layer in the dispersive network, permitting simultaneous adaptation and prediction. The true prediction output comes from a tap before the final delay, while the output from the final delay, along with the target (input) signal, forms the back-propagated error. outputs. One desirable feature of an adaptive system is its ability to "adapt around" internal defects. However, the two-network configuration effectively disables this feature. With a dispersive network, the delay can be moved to the output, as shown in Figure 6.4. This delay can be viewed as an additional network layer with a single node, connected through a fixed delay, T, with a fixed weight of unity. If the network adapts to produce a unity transfer function, then a tap taken off before the fixed delay will provide true future prediction. This configuration permits on-line adaptive prediction without requiring two networks, but the network must adapt using temporal back-propagation, due to the delay present in the final layer. Figure 6.5 shows learning curves for the adaptable time delay network and an identical network with fixed time delays, both in true adaptive prediction configurations. The figure also includes the learning curves from Figure 6.2 for reference. Again, the networks showed approximately exponential convergence, and after 10 8 time units appeared to be approaching asymptotes. At that point, the NRMSE values were 0.0084 and 0.0035 for the dispersive ^ Chapter 6. Applications of Dispersive Networks ^ 72 0.06 0.05 - Conventional Back-Propagation Network Dispersive Networks with Nonpredictive Adaptation Dispersive Networks with Predictive Adaptation 0.04 0.03 0.02 - (a) ••^...^ (b) •^•^•^ ••^.^•^ •••^ •.•• ...... 0.01 0.00 ^^ 0 1^2^3^4^5^6^7^8^9^10 Time x 10 7 ) Figure 6.5: Learning curves showing the normalized rms error (NRMSE) of networks trained to predict the Mackey-Glass chaotic signal an interval, T = 6, into the future. The solid lines are for networks trained in a true prediction configuration, so that they could predict and adapt simultaneously. For reference, the graph includes the learning curves from Figure 6.2 as broken lines. The curves are for: (a) A nondispersive network with identical topology to that used in [66]. The nondispersive network cannot be trained in a true predictive configuration. (b) A dispersive network with randomly chosen fixed delays. (c) A dispersive network with adaptable delays, initially identical to (b). In each case there were two hidden layers with ten nodes each, and a total of 150 connections. Adaptation occurred on-line with a weight learning rate of tt = 0.005 and, for part (c), a delay learning rate of v = 1. The averages used to calculate the NRMSE were over time intervals of length 10 5 . network with fixed delays, and the dispersive network with adaptable delays, respectively. It is apparent from this diagram that the cost of true prediction, incurred by placing the fixed delay at the output, is insignificant compared with the benefit of a dispersive network over a conventional back-propagation network. It is important to remember that the conventional network cannot perform true prediction while adapting, so there is no true prediction learning curve for a conventional back-propagation network in Figure 6.5. To predict further into the future, identical copies of the T = 6 network can be cascaded to provide outputs for T = 12,18,24,... Alternatively, the network can be trained to predict Chapter 6. Applications of Dispersive Networks ^ 73 further into the future by increasing the fixed delay, T, in Figure 6.4. Increasing this delay may degrade the stability of the prediction filter, requiring smaller learning rates. Theoretically, the cascade method should provide more accurate predictions for chaotic signals [14]. Some results in [66] confirm the higher accuracy of the cascade method with nondispersive networks, but more simulations need to be performed to verify its superiority with dispersive networks. 6.1.5 Moment um It is apparent from the learning curves in Figures 6.2 and 6.5 that the back-propagation adaptation technique can take a long time to converge. Momentum, as discussed in Section 3.3 and Appendix A, may increase the rate of convergence and allow the network to escape from local minima in the error surface [104]. To determine the effect of momentum on networks with internal time delays, the same networks were trained with a momentum parameter of 77 = 10, as defined in Appendix A. Figure 6.6 shows the results, with learning curves for networks trained without momentum included as dashed lines. Momentum apparently speeds up the initial convergence, but dispersive networks trained without momentum ultimately reach lower error values. The nondispersive network seems to show the opposite effect, but by time 10 8 , the difference between the curves with and without momentum is virtually nonexistent. Figure 6.6 suggests that it may be useful to include momentum during the early stages of training, and then gradually reduce i as training progresses. More simulations need to be performed to investigate this possibility, but the rate at which to decrease 77 is another parameter that must be determined by trial and error. Finally, it has been reported that momentum doesn't often provide much improvement for on-line adaptation [132]. Perhaps with a batch update technique, as in equation (3.13), the momentum parameter would prove to be more useful. 74 Chapter 6. Applications of Dispersive Networks ^ 0.06 Without Momentum With Momentum 0.05 LL1 0.04 0.03 0.02 0.01 1 4^5^6 10 Time (t x 10 7) Figure 6.6: Learning curves similar to those in Figure 6.5, for networks trained with and without momentum. The solid lines are for networks trained with a momentum parameter of 77 = 10. For reference, the dashed lines are the learning curves with n = 0 from Figure 6.5. The curves are for: (a) A nondispersive network with identical topology to that used in [66]. (b) A dispersive network with randomly chosen fixed delays. (c) A dispersive network with adaptable delays, initially identical to (b). In each case there were two hidden layers with ten nodes each, and a total of 150 connections. Adaptation occurred on-line with a weight learning rate of it = 0.005 and, for part (c), a delay learning rate of li = 1. The averages used to calculate the NRMSE were over time intervals of length 10 5 . 6.2 Signal Production An interesting application of the trained network is to freeze the weights and time delays, remove the input signal, and connect the output labeled o(t) in Figure 6.4 directly to the input node, creating a recurrent network. Now the network can use its past predictions as inputs, and make additional predictions further into the future. The resulting dynamical system, shown in Figure 6.7, spontaneously generates its own version of the Mackey-Glass signal. Due to the strong dependence on initial conditions, the resulting signal will soon deviate from the actual Mackey-Glass signal, but it can still be used to make long term statistical predictions, and to Chapter 6. Applications of Dispersive Networks^ 75 Figure 6.7: A dispersive network in a recurrent signal production configuration. After training as a predictor, the weights and time delays can be frozen, and the output fed back to the network's input. The resulting dynamical system spontaneously generates its own version of the Mackey-Glass signal, or it can be initialized with a segment of the signal, and then left to run freely. investigate the nature of the attractor. 6.2.1 Learning Progression During the learning progression plotted in Figure 6.5, adaptation was periodically halted, and the network placed in the recurrent configuration shown in Figure 6.7. Figure 6.8 summarizes the results for the dispersive network with adaptable time delays, trained in the true predictive configuration. The behavior did not progress smoothly, but went through periods where the recurrent network was unstable. After each period of instability, the network appeared to produce a better imitation of the true Mackey-Glass signal. A better way to view the learning progression is through the aid of phase diagrams. Figure 6.9 plots x(t) vs. x(t — 17) for the Mackey-Glass signal, resulting in a projection of the Mackey-Glass chaotic attractor onto a two-dimensional surface. It is evident from the diagram that the signal is approximately periodic, but it is also evident that the system follows a slightly different trajectory during each "period". By creating similar phase diagrams for the network at different points during training, the learning progression is more readily apparent. Chapter 6. Applications of Dispersive Networks ^ 10 5 10 6 10 7^ 76 108 Time Figure 6.8: Summary of a dispersive network's progress while learning to produce the Mackey-Glass signal. The network was trained as shown in Figure 6.4, and periodically reconfigured as in Figure 6.7 to observe its signal production performance. The heavy line, partly dashed, shows how the prediction NRMSE decreased as training progressed, on a logarithmic time scale. The network went through periods of instability, indicated by the dashed segments of the learning curve. The insets show representative output from the network at four selected stable points during training. Note that the network first learns to produce a periodic output, followed by a period of instability, and then an output similar to the Mackey-Glass signal. Figure 6.10 shows four phase diagrams from the early stages of training. In each case, the network was initialized with Mackey-Glass data from time t = 0 to time t = 100, and then placed in the recurrent configuration shown in Figure 6.7. After training up to time 1.14 • 10 4 , the network converges to a stable fixed point very quickly after initialization, as shown in Figure 6.10(a). Figure 6.10(b) shows that after training up to time 1.53 • 10 4 , the network has learned to produce an oscillatory pattern, but it still settles to a fixed point in about 500 time units. In Figure 6.10(c), at time 3.71 • 10 4 , the network finds a stable periodic attractor, never reaching a fixed point. Finally, at time 7.45 • 10 4 (Figure 6.10(d)), the shape of the attractor 77 Chapter 6. Applications of Dispersive Networks ^ 0 2^0.4^0.6^0.8^1.0 ^ 1.2 ^ 1.4 X(t-17) Figure 6.9: Projection of the Mackey-Glass chaotic attractor onto a two-dimensional surface. The vertical axis plots the signal at time t, and the horizontal axis plots the same signal at time t — 17. Note the roughly periodic behavior. begins to resemble the actual Mackey-Glass attractor in Figure 6.9. Figure 6.11 presents four phase diagrams from the later stages of training. Figure 6.11(a) shows that at time 2.42.10 5 , the recurrently connected network was unstable. However, stability returns by time 5.4 - 10 5 (Figure 6.11(b)), where the attractor contains a double loop structure similar to that in Figure 6.9. The network is again unstable at time 3.43 • 10 6 (Figure 6.11(c)), but by time 2.84 • 10 7 (Figure 6.11(d)) it has finally learned to emulate the Mackey-Glass attractor remarkably well. Chapter 6. Applications of Dispersive Networks ^ (a) (c) (d) Figure 6.10: Behavior of the dispersive network during the early stages of learning. The attractors represent the network output after training up to times: (a) 1.14 • 10 4 , (b) 1.53 • 10 4 , (c) 3.71 • 10 4 , and (d) 7.45 • 10 4 . The diagrams are not all plotted to the same scale. In (a) and (b), the network reaches a stable fixed point, but in (c) and (d) it eventually reaches a stable, periodic orbit. The training times for these plots coincide with points where the network exhibited interesting new behavior. 78 Chapter 6. Applications of Dispersive Networks ^ (a) (b) (c) (d) Figure 6.11: Behavior of the dispersive network during the later stages of learning. In this figure, the diagrams are all plotted to the same scale. The attractors represent the network output after training up to times: (a) 2.42-10 5 , (b) 5.4-10 5 , (c) 3.43 -10 6 , and (d) 2.84 -10 7 . These training times coincide with points where the network exhibited interesting new behavior. In plots (a) and (c), the network is unstable, but plot (d) is a close approximation to the actual Mackey-Glass attractor shown in Figure 6.9. 79 Chapter 6. Applications of Dispersive Networks^ 80 6.2.2 Performance Figure 6.12 shows the final attractors reached after training up to time 10 8 for a conventional back-propagation network, a dispersive network with fixed delays, and a dispersive network with adaptable delays. The diagram also shows the actual Mackey-Glass attractor for comparison. The conventional back-propagation network was unstable, and it quickly deviated from the Mackey-Glass signal. However, both dispersive networks produced reasonable imitations of the true attractor, with the adaptable delay network giving a slightly more accurate representation. The phase diagrams do not provide complete information about the behaviors of the different networks. Although they show the shape of the attractors, they provide no indication of the rates at which the network states evolve through phase space. To show this temporal information, Figure 6.13 plots the signals at the network outputs, along with the actual MackeyGlass signal for comparison. For the first 100 time units, the networks were in a strictly feedforward predictive arrangement, with the actual Mackey-Glass signal applied at the input. At time t = 100, the networks were reconfigured into the arrangement of Figure 6.7 and left to run freely. In Figure 6.13, initial start-up transients are apparent before the internal network delays have filled with valid data, and then very accurate prediction occurs up to time T = 100. At that point, the freely running networks begin to diverge from the Mackey-Glass signal, with the conventional back-propagation network diverging the fastest. This network starts to become unstable before time t ---, 500. The two dispersive networks, with fixed and adaptable delays in Figures 6.13(b) and 6.13(c), respectively, both provide accurate signal emulation up to time t = 500. To show the difference between networks with fixed and adaptable delays, Figure 6.14 extends the plots in Figure 6.13 from time t = 750 to time t = 1250. Here, the accuracy of the fixed-delay network begins to deteriorate, exhibited mostly by a phase-shift of the signal pattern. The network with adaptable delays still provides quite accurate emulation up to time t = 1250, while the conventional back-propagation network is clearly unstable. Chapter 6. Applications of Dispersive Networks ^ (a) (b) (c) (d) Figure 6.12: Final attractors for (a) a conventional back-propagation network, (b) a dispersive network with fixed time delays, and (c) a dispersive network with adaptable time delays. All the networks were trained up to time 10 8 and then placed in the configuration shown in Figure 6.7. The actual Mackey-Glass attractor is reproduced in (d) for comparison. 81 Chapter 6. Applications of Dispersive Networks ^ Inn^ 82 i 1r^1, IA24 ..ii A ,,ali A^A AL Ai AI A .e. A .ii A A A 1' 1 Y^ T^ T^T T "'V (4 A, A 4. A .r A A A ■ IT "Y TV v y (13) 0 Time 250 500 Figure 6.13: Network emulation of the Mackey-Glass signal from time t 0 to time t 500. The networks are (a) a conventional back-propagation network, (b) a dispersive network with fixed time delays, and (c) a dispersive network with adaptable time delays. Before time t = 100, the networks were in a strictly predictive configuration, with the actual Mackey-Glass signal applied at the input. Initial start-up transients occur before the internal network delays have filled with valid data, but once filled, all the networks provide accurate predictive performance up to t = 100. At this point, shown by the vertical dashed line, the networks were reconfigured into the arrangement of Figure 6.7 and left to run freely. Their outputs gradually diverge from the true Mackey-Glass signal. Another way to compare the performance of the three networks is to plot the power spectra of the generated signals, as shown in Figure 6.15. Each plot also shows the "true" MackeyGlass power spectrum, from the signal obtained by integrating the Mackey-Glass equation using a four-point Runge-Kutta method with a step size of 0.1. It is obvious that the dispersive network with adaptable time delays is able to reproduce the Mackey-Glass spectrum much more accurately than either of the other networks. The discrepancy between the network-generated Chapter 6. Applications of Dispersive Networks ^ li ■ r^ !^ , , ',^(a), LA ,^'^ ii^- ^ .; f'iii^AIL^I , 11!^ AI^ ,,,,^ '^ 1 1^.^ 83 All^Iii Ai, ► ! A .A^;A Ai^. A ill ,si I\ ati: sr V^V^I^v '^,,, (b) A (c.AL.A^A.AAA_AAA yv^v^TTITT 750 Time 1000 1250 Figure 6.14: Network emulation of the Mackey-Glass signal from time t = 750 to time t = 1250. The networks are (a) a conventional back-propagation network, (b) a dispersive network with fixed time delays, and (c) a dispersive network with adaptable time delays. These plots are an extension of Figure 6.13. spectra and the Mackey-Glass spectrum appears to grow larger at higher frequencies, but the log power scale in Figure 6.15 exaggerates this effect. In fact, most of the Mackey-Glass signal power is concentrated below a frequency of r/8, so Figure 6.16 shows an expanded view of that region. It is apparent from Figure 6.16(c) that the dispersive network with adaptable time delays produces a remarkably accurate spectrum at low frequencies. These results demonstrate that dispersive networks with adaptable time delays can learn to predict and reproduce signals generated by chaotic dynamical systems. Chaotic systems probably underlie many physical phenomena and economic patterns, and similar predictive techniques have been successfully applied to the prediction of sunspot numbers [126], weather Chapter 6. Applications of Dispersive Networks 10 10 2 1 10 o i s^I i, •, 1 *^• h^A ■iot ^ -1^ .h. 0^ r$04 10 10 2 10 1 VAS16 ?\i/j1jf1V )ykiSwsu -2^(a)^ 10 84 liwis„,...„ _________________ IA • ''' ^10 0 10 10 10 -2^ ••1 • _____^ (b) kl^ I ? , 2 1 10 0 1 0 -i 10 (c) -2 0 it/4 it/2 Frequency 3 7t/4 it Figure 6.15: A comparison of power spectra generated by (a) a conventional back-propagation network, (b) a dispersive network with fixed time delays, and (c) a dispersive network with adaptable time delays. Each network was trained as a predictor, and then connected in the recurrent configuration shown in Figure 6.7. Each resulting dynamical system spontaneously generated its own version of the Mackey-Glass signal, with the power spectra shown above. For comparison, the dashed line is the actual Mackey-Glass power spectrum. The spectra are the result of averaging 100 Fast Fourier Transforms of segments with 4096 samples taken at unit intervals, using a Hamming data window. Chapter 6. Applications of Dispersive Networks ^ 10 10 85 3 (a) 2 IS • I% I • I I 10 1 ' 10 0 10 ( ***I 1/0 3 (b) 64 bO 10 10 2 1 •, I 10 10 10 o 3 (c) 2 10 1 10 10 o 1 ^ ^ 37c/32 7c/8 0^7c/32^7c/16 Frequency Figure 6.16: A comparison of power spectra generated by (a) a conventional back-propagation network, (b) a dispersive network with fixed time delays, and (c) a dispersive network with adaptable time delays. This graph is a magnification of the low-frequency region from Figure 6.15, where most of the signal power is concentrated. The dashed line is the actual Mackey-Glass power spectrum. The spectra are the result of averaging 100 Fast Fourier Transforms of segments with 4096 samples taken at unit intervals, using a Hamming data window. Chapter 6. Applications of Dispersive Networks^ 86 Figure 6.17: Adaptive channel equalization. The internal network parameters adapt to minimize the error, e(t). The resulting output, o(t), is a time-delayed estimate of the channel input. Channel noise, represented by a white Gaussian signal, n(t), prevents the error from reaching zero. patterns [101], stock markets [61], and commodity markets [16]. The prediction error for a chaotic system typically increases exponentially with the prediction interval, T. The exponent gives an upper estimate of the largest positive Liapunov exponent, which describes the divergence of nearby trajectories in phase space. Despite the exponential decrease in predictive ability, qualitative predictions about the future behavior of a chaotic system can still be made from a short segment of training data. The network can be trained by repeatedly presenting it with the finite data segment, until it has reached a sufficiently low error. Connecting the trained network in a recurrent signal production configuration will cause it to generate an output that is statistically similar to the training signal. This method could be used to investigate the nature of the chaotic attractor underlying the training signal without requiring observation of that signal for a long period of time. 6.3 Adaptive Channel Equalization Figure 6.17 shows how an adaptive network can be trained to compensate for distortion in a communication channel. As adaptation minimizes the error between o(t) and x(t — T), the network learns to implement the inverse of the channel transfer function with a fixed Chapter 6. Applications of Dispersive Networks^ 87 lag, T. Linear filters often work well in this configuration [134], but their accuracy suffers when the channel distortion is nonlinear. Figure 6.18 is an example of a nonlinear channel for which linear filters do not provide a satisfactory solution. The channel consists of three Figure 6.18: Nonlinear communication channel. The channel consists of three elements in series. The first and last elements are linear, with identical transfer functions. The center element is a nonlinearity chosen for its mathematical convenience. elements in series. The first and last elements are linear, with identical transfer functions of 0.2602 + 0.9298z -1 + 0.2602z -2 , and the middle element introduces the nonlinear distortion, 5x1x1(1 — x 2 ) Y =^(1 + x 2 ) • (6 .2) To find the inverse transfer function of this channel, a dispersive network was placed in the arrangement shown in Figure 6.17. The input signal, x(t), was zero mean white Gaussian noise with unit variance, fed through a third-order low-pass Butterworth filter with a cutoff frequency of r/16to, where to was the sampling period. Appendix D describes the exact form of Chapter 6. Applications of Dispersive Networks ^ 88 the Butterworth filter. The lag in Figure 6.17 was T = 15, and n(t) was white Gaussian noise having a power of -30dB with respect to the channel output. 6.3.1 Network Topology and Parameters The network had two hidden layers with five nodes each, and a single output node. For the hidden nodes, each .60 was a hyperbolic tangent with unity gain, and the output node had a linear transfer function for expanded dynamic range. Each node in the first hidden layer was connected to the input signal through five delayed connections, and the internal layers were fully connected with three delayed connections between each pair of nodes in adjacent layers. Adaptable thresholds were implemented by connections from an additional node with a constant output. The initial delays were pseudorandom, uniformly distributed values on the interval [0, 24], and the initial weights were similarly distributed on [-0.1, 0.1]. The ages, cti, were 50 for the first hidden layer, and 25 for the second. A weight learning rate of p = 0.01 and a delay learning rate of v = 1.0 were selected, and the network equations were integrated using a simple Euler method with a small fixed-length time step. A conventional back-propagation network was trained for comparison, with the same number of nodes and the same number of input connections, formed from evenly-spaced delay taps ranging from 0 to 24. 6.3.2 Equalization Results Figure 6.19 shows learning curves for the dispersive network, the conventional back-propagation network, and an adaptive linear filter with 25 taps. The graph plots the normalized rms error as a function of training time. Both networks easily outperformed the linear filter, but the adaptable delays in the dispersive network allowed it to achieve the lowest error. At time 3.10 6 , the NRMSE values were 0.254, 0.158, and 0.065 for the linear filter, the back-propagation network, and the dispersive network, respectively. Chapter 6. Applications of Dispersive Networks ^ 89 0.40 0.35 I- 0.30 o 4.1^ 07 0.25 Linear Filter 0.20 Back-Propagation Network 0.15 Dispersive Network - 0.10 -^ 0.05 - 0.00 ^ 0^0.5^1:^' 0^1 5^2.0^2.5^3 0 . Time X 106 Figure 6.19: Learning curves showing the NRMSE as a function of training time for adaptive filters trained to compensate for a nonlinear channel distortion. The linear filter and the back-propagation network both had 25 delayed taps for the input signal. The top curve shows that the linear filter converges very quickly, but cannot reach an error less than about 0.25. Initially, the two network filters show approximately exponential convergence, but the adaptable delays in the dispersive network allow it to surpass the conventional back-propagation network as training progresses. The averages used to estimate the NRMSE were over time intervals of length 10 4 . 6.4 Signal Detection/Classification If the output nodes of a dispersive network are cascaded with binary quantizers, the network becomes a classifier that can separate its input pattern into different categories, based on its spatial or temporal properties. The output nodes each correspond to one category, and a node can signify the presence of a pattern in its corresponding category by producing 1, and its absence by producing -1. This type of signal detector or classifier could be very useful in tasks Chapter 6. Applications of Dispersive Networks ^ 90 like speech recognition. Although the dispersive networks described here have not yet been applied to classification problems, the project is definitely worthy of future consideration. Chapter 7 Implementations Dispersive networks can be implemented in algorithmic form using either serial or parallel digital hardware, or they can be implemented as continuous-time systems using parallel analog hardware. Since the present state of the art for integrated digital circuitry is more advanced than it is for analog circuitry, the emphasis of this chapter is on algorithmic implementations. However, analog hardware may ultimately permit higher operating speeds, lower power requirements, and smaller devices, so a brief discussion of analog implementations suggests some areas for future research. 7.1 Software Implementations All the experimental results presented in this thesis were obtained using a serial computer and conventional programming techniques. The program can be viewed as a "simulation" of the differential equations that describe continuous-time dispersive networks, or as a separate algorithmic form of dispersive network that can be studied in its own right. In dispersive networks, there is a tradeoff between computational locality and storage requirements. For a network node to calculate its output value, it requires data from other nodes in the network. Computational locality refers to the required communication bandwidth and the physical distance to these other nodes. Serial software implementations have different constraints than either digital or analog parallel implementations. A random access memory (RAM) stores all the parameters describing the network, so computational locality is not a consideration. To minimize the required RAM, a serial implementation can use the configuration shown in Figure 5.10. Coding the adaptation 91 Chapter 7. Implementations^ 92 algorithm in a procedural programming language is a straightforward task, and the technique described in Appendix C can be used to obtain the delayed signal derivatives. 7.2 Parallel Digital Implementations Parallel implementations typically employ separate hardware elements for each node, and perhaps each connection, within the network [118]. These elements must transmit and receive data in both the feed-forward and back-propagation phases of operation. Consider a fully connected network, where each node in one layer connects to all the nodes in the next layer. For fully parallel operation, every node requires a separate communication link for each one of these connections, and the number of links increases approximately with the square of the number of nodes. In a discrete-component electrical implementation, the links could be implemented using conductive wires. As the number of nodes increases, the volume of space between the layers will become increasingly filled with links, limiting the size of the network that can be constructed. In integrated circuitry, this constraint is more severe because the space between the layers is only two-dimensional. A truly scalable architecture requires the elimination of any spatial constraints by some form of signal multiplexing on fewer physical wires. The resulting system may not be completely parallel in operation, but a significant increase in speed over serial implementations can still be accomplished. Figure 7.1 shows a scalable dispersive network architecture that trades operational speed for scalability. Each pair of communicating nodes can have access to the bus in sequence, storing the received data for later internal computations. As the number of nodes increases, the overall operating speed will decrease because there will be more nodes sharing the bus. In a straightforward implementation, the operating speed will be inversely proportional to the square of the number of nodes. However, there are no spatial connectivity constraints because the design is completely modular. 93 Chapter 7. Implementations^ Node Node Output ^/ Node ^/ Node > Output Input ----\\ Node Input ----\ Node .._..._\ Input ____/, Node A---- \ --\ \ /--- ---1., \ ^ Bidirectional Bus 7/ Node /---- --N. \ \^ Bidirectional Bus 7/ Node > Output Figure 7.1: Block diagram of a parallel digital dispersive network implementation. Each processing element shares a communication bus with other processing elements in the same layer. If two nodes need to communicate, they can periodically obtain access to the bus, in a round-robin fashion with other network nodes. The length of time between bus accesses for a given pair of nodes determines the operating speed of the network. This design avoids spatial connectivity constraints, but the operating speed decreases as the number of nodes in each layer increases. To minimize the required bus bandwidth in Figure 7.1, the dispersive network must be intelligently partitioned into separate processing elements. One obvious choice is to use a processing element for each network node, and to associate the weights and time delays with the nodes in a way that minimizes the required bandwidth. Figure 7.2 reproduces the node and connection diagrams from Figure 5.8, with the optional weight delay buffer for the back-propagated signal removed. The signal lines in Figure 7.2 are classified into two categories: unique and shared. Unique signals, represented by solid lines in Figure 7.2, flow only between two particular nodes, and can be viewed as forming the connection between them. Shared signals, represented by dashed lines, are common to multiple nodes in the preceding or succeeding layers, and are good candidates for broadcast over a shared bus. The dotted line in Figure 7.2 partitions each link into two sections, one associated with the Chapter 7. Implementations^ 94 Xflk (t) of Yi a• 4- ^ f(yi(t-a)) Aj (t, Ar" (a) Node AO- [ai-a;tijk], ^wuk(t [artuk]) - Figure 7.2: Unique and shared signals in a dispersive network. The solid lines represent unique signals associated with two specific nodes. They can be viewed as forming the connection between the two nodes. The dashed lines represent shared signals that must be transmitted to multiple nodes in the succeeding (for feed-forward) or preceding (for back-propagation) layers. Shared signals are good candidates for broadcast over a shared bus. The dotted line partitions the link into two sections, one associated with the node in the preceding layer, and the other associated with the node in the succeeding layer. The processing elements representing these two nodes each incorporate the appropriate link section. The partition cuts only shared signals, minimizing the bus bandwidth required between the layers. Chapter 7. Implementations^ 95 node in the preceding layer, and the other associated with the node in the succeeding layer. The calculations required in each section are performed by the processing element associated with the corresponding node. If the link partition were to cut a unique signal, the multiplexing requirements would depend on the number of nodes in both layers connected by the link, leading to a time complexity of 0(N 2 ). To minimize the required bus bandwidth, the link partition in Figure 7.2 cuts only shared signals, leading to a time complexity of 0(N). 7.2.1 Operation The network operation occurs in cycles, with each cycle incorporating a feed-forward pass, an error calculation at the output, and a back-propagation pass. The following discussion ignores updates to the adaptable weights and time delays, which will be detailed in the next section. To start a cycle, the first processing element, j = 1, in the input layer broadcasts its input signal to all the processing elements (PEs) in the first hidden layer. Each PE, i, in the first hidden layer contains storage for the feed-forward delay, rijk, shown in Figure 7.2(b). This delay line stores the incoming signal, and the delayed output from the line forms a product with the weight, wijk(t). Finally, an accumulator or summation device latches the resulting signal, xij k(t). Each PE in the first hidden layer performs these operations in parallel, while the first input PE broadcasts its input signal. Multiple connections, indexed by k, between two particular nodes can be implemented as multiple taps off the delay buffer, aj. Each tap must be multiplied by the appropriate weight, and accumulated as part of the input sum for the node. When the nodes in the first hidden layer have finished, the second PE in the input layer broadcasts its input signal. Again, each PE in the first hidden layer stores the incoming signal in a delay buffer and accumulates the signals, xijk(t). This process continues until all the processing elements in the input layer have had a chance to broadcast their inputs. At this point, the input sums, yi, for the nodes in the first hidden layer are available in the accumulators, so they can be fed through the sigmoidal nonlinearities, fi(), to produce the 96 Chapter 7. Implementations^ node outputs. They also feed into a delay buffer, at, associated with each node, to provide the sigmoidal derivative, fl(), for multiplication by the back-propagating signal. Now each node in the first hidden layer can broadcast its output to all the nodes in the second hidden layer, using the same procedure as the input nodes. This process repeats for each layer until it reaches the output nodes, when the network output becomes available. At the network output, the error signal, Ai, is formed by subtracting the target signal from the node outputs. Then the first node, i, in the output layer broadcasts its error signal to every node in the preceding layer. Each node, j, in the preceding layer stores its incoming signal in the time-delay buffer, ai — shown in Figure 7.2(b). The delayed output from this buffer is multiplied by the weight, wijk(t), and fed into the node's back-propagation accumulator. As for the feed-forward case, multiple connections (indexed by k) between two particular nodes can be implemented as multiple taps off the delay buffer, a ; — ai. Each tap must be multiplied . by the appropriate weight, and accumulated as part of the back-propagation sum for the node. When all the nodes in the preceding layer have finished, the second node in the output layer broadcasts its error signal. Again, each PE in the preceding layer stores the incoming signal in a delay buffer and accumulates the signals AjtViik. This process continues until all the processing elements in the output layer have had a chance to broadcast their errors. Now the back-propagation sums for all the nodes in the last hidden layer are available, and can be multiplied by the delayed sigmoidal derivatives, f;(), to form the error signals, Ai. Back. propagation proceeds in this fashion, layer by layer, until the error signals reach the first hidden layer. At this point, except for adapting the weights and time delays, one cycle is complete. 7.2.2 Adaptable Weights and Time Delays Figure 7.2(b) shows that the components for calculating the weight and time delay updates for a link between two nodes, i and j, are associated with the processing element for node i. However, the figure also shows that node j requires the weight and time delay values during the back-propagation phase. As the weights and time delays adapt, node j must maintain their Chapter 7. Implementations^ 97 most recent values. Since these values are calculated by the PE associated with node i, it might seem logical to send them to node j over the shared bus. However, these signals would be unique to each pair of nodes, wasting valuable bus bandwidth and increasing the cycle time. To maintain an optimal tradeoff between system speed and hardware requirements, it is better for each node to calculate the weight and time delay updates separately, and maintain its own set of values. Since all the calculations are digital, the two sets of parameters will remain synchronized if they start out with the same initial values before adaptation commences. However, it is important to note that this technique would not be feasible in an analog system, because minor variations in the analog multipliers and accumulators would eventually cause the two sets of parameters to deviate from one another. Having each PE keep track of its own versions of the weights and time delays leads to some hardware duplication. Figure 7.3 is the block diagram for a single processing element, incorporating multiple input and output links that connect to the shared busses. The summation nodes represent the feed-forward and back-propagation accumulators, where summation occurs serially during each cycle. Figure 7.4 details the components required for a single input link, and Figure 7.5 is the corresponding diagram for an output link. Multiple input or output links for a particular PE can share their weight and time delay update circuitry, since only one input or output link will be active at any given time. However, they each require separate storage for the weight and time delay values, the feed-forward signal, and (for output links) the backpropagation signal. The duplicated storage for the feed-forward signals allows each node to calculate its own weight and time delay updates. 7.2.3 Pipelining The cycle time for the implementation described above increases with the number of PEs or nodes, and with the number of network layers. However, a small restriction on the time-delay values permits pipelined operation, making the cycle time independent of the number of layers. Consider the processing element for node j. Signals from the previous layer, represented Chapter 7. Implementations^ Bidirectional Bus Bidirectional Bus Figure 7.3: Block diagram of a single processing element for a parallel digital hardware implementation. Figures 7.4 and 7.5 show the input and output links in more detail. 98 99 Chapter 7. Implementations^ Xj11(0 01(0 -01 "CA al wiscW Figure 7.4: Processing element input link. Each processing element requires one input link for every node in the previous layer. However, since only one input link can be active at any particular time, the links can share update circuitry, and only the storage for ai and the weight and time-delay values must be duplicated for each link. Multiple connections to the same node in the previous layer can be implemented as multiple taps off the delay buffer, and extra storage for the additional weights and time delays. For clarity, however, the diagram shows only a single connection. by o r in Figure 7.4, enter the time-delay buffers, al, and the delayed values form the sum in the feed-forward accumulator. However, these delayed values of or are already present in the input link storage buffer, having been placed there during a previous cycle. With these delayed values, it is possible for a particular node to calculate its output for the present cycle before receiving its input signals from the previous layer. Therefore, at the beginning of the cycle, all the nodes in the network can calculate their outputs in parallel. With all the node outputs available, the busses for the entire network can be used in parallel to perform the feed-forward communication, 100 Chapter 7. Implementations^ r of (t H^ wift(t) a• ai+Tok wift(t) Ai(t, ai Figure 7.5: Processing element output link. Each processing element requires one output link for every node in the succeeding layer. Since only one output link can be active at any particular time, the links can share update circuitry, and only the storage for the feed-forward signal, the back-propagated signal, the weight values, and the time delay values must be duplicated. Multiple connections to the same node in the succeeding layer can be implemented as multiple taps off the delay buffers, ai and ai — ai, and extra storage for the additional weights and time delays. For clarity, however, the diagram shows only a single connection. instead of sending the information layer by layer. Fortunately, the symmetry of the network permits the same parallelism to be employed during the back-propagation phase. Furthermore, by replacing each bidirectional bus with two unidirectional busses, both the feed-forward and back-propagation phases can proceed in parallel. The preceding discussion assumes that the bus communication occurs after the sum accumulations. Adding a simple latch to the feed-forward signal, and another to the back-propagation signal, permits even these two operations to proceed in parallel. Figure 7.6 shows the locations 101 Chapter 7. Implementations^ of the latches, which effectively add a one-cycle delay to the signal paths. Now the latches within each node provide the outputs during the bus communication, while the accumulators simultaneously calculate the sums required for the next cycle. The resulting network provides the fastest possible throughput with a completely modular architecture. Figure 7.6: Block diagram of a single processing element with latches, for a parallel digital hardware implementation. For clarity, this diagram omits the input and output links shown in Figure 7.3. The latches permit each layer in the network to operate as one stage in a pipelined architecture. The only restriction for pipelined operation is that each time delay, Tijk, must be greater than or equal to the network cycle time. The network cannot contain any zero-delay connections, of the type present in a conventional back-propagation network. The latches in Figure 7.6 effectively add the one-cycle delay to each Tijk, so the lower limit for the delays in the input and output links is still zero. The pipelining ability is another advantage of the dispersive network over conventional back-propagation, where signals must proceed layer by layer in a serial fashion. The lack of zero-delay connections shouldn't adversely affect the performance of the network, since at worst they will lead to a small time lag between the input and output. The pseudocode in Figure 7.7 summarizes the operation of each node. The constant Lmax represents the largest number of nodes in any network layer, and ffsurn and bpsum represent the values in the feed-forward and back-propagation accumulators, respectively. The operation of the node can viewed as two concurrent processes, one for the feed-forward operation and the other for back-propagation. There must also be some type of global synchronization, so Chapter 7. Implementations ^ - transfer ffsum to feed-forward latch - ffsum = 0 - for WI to L - shift al buffer - for all k do - ffsum = ffsum + wilk oi (t-tja) - end for - update weights and time delays - end for (a) 102 - transfer bpsum to back-propagation latch - bpsum = 0 - for i=1 to L - shift a.) buffer - shift arai buffer - for all k do - bpsum = bpsum + wiik Ai(t-[ai -artiik],a) - end for - update weights and time delays - end for (b) Figure 7.7: Pseudocode for a single processing element. Each PE implements two parallel processes, one for the feed-forward calculations, and the other for the back-propagation calculations. L max is the maximum number of nodes in a layer, and ffsum and bpsum represent the values in the accumulators. The code is for (a) feed-forward operation and (b) back-propagation operation. that each node drives the separate feed-forward and back-propagation busses at the appropriate times. The cycle time of this pipelined architecture scales as 0(Lm„.). Thus, wider networks, which may be required for more complex problems, will be constrained to sample the input signal at a lower frequency than simpler networks. However, the cycle time is independent of the network depth. The storage space required for delayed signals scales approximately with the square of the number of nodes in the network, unlike the serial implementation, which scales linearly with the number of nodes. The cycle time for a similar conventional back-propagation network scales as O(Lm NL), where NL is the number of network layers. However, the computational complexity for each processing element is reduced because only the weights need to be updated. With parallel hardware, a dispersive network processing element can perform the weight and time delay updates concurrently, and achieve approximately the same time complexity as a conventional back-propagation processing element. Chapter 7. Implementations^ 103 7.3 Analog Hardware Implementations The implementation of neural computational systems using analog electronic hardware is still in its infancy, with many challenges remaining [39]. Pioneering work in this area is presently being performed using current-mode CMOS VLSI hardware [2, 3, 40, 75, 108]. Currents within the MOS transistors represent the signals or weights, and dedicated circuitry performs the necessary multiplications and summations. Other work on analog VLSI networks has investigated on-chip learning [13] and the fabrication of analog VLSI synapse matrices [50, 56, 103, 115]. Analog networks using discrete components have been used to control real-time adaptive optics, which cancel distortions in telescopic images caused by turbulence in the earth's atmosphere [34]. A significant problem with integrated analog circuitry is the variation among components that are supposed to be identical. For proper operation, the adaptation techniques presented in Chapter 5 require these components to be closely matched. Some effects of nonidealities in analog network hardware have been investigated in [35], and other adaptation techniques like weight perturbation [55] may help to solve the problem. Perhaps the most serious challenge for dispersive network implementations involves the adaptable time delays. Even fixed delay lines are difficult to manufacture in integrated circuitry, and usually require a large amount of real estate. Adding adaptability by means of a sliding tap implies some type of mechanical coupling, which is beyond current technology. Furthermore, the sum of the feed-forward and back-propagation delays must be a constant value for all links in a given layer. Slight variations in the lengths of the delays will lead to inaccuracies during adaptation, possibly extending the time required for convergence, or even preventing convergence altogether. Perhaps a modified adaptation mechanism could be developed that would accommodate variations in these time-delay sums. For some applications, however, it may not be necessary to have fully adaptable time delays. A software simulation of the proposed network could learn an optimal set of time delays for one representative task from a problem domain. Then the analog network could be fabricated with this fixed set of delays, having only adaptable weights. The adaptable weights would Chapter 7. Implementations^ 104 provide the necessary adjustment for any task or time-varying environment in the problem domain. Another possibility is a hybrid analog/digital system, with digital parameter storage for accuracy, but analog multiplication and summation for speed. Such hybrid systems have been investigated in [24] and [121] for networks without internal time delays. Other possibilities include pulse stream networks [85], which use digital pulses to encode the propagating signals. Chapter 8 Conclusions and Suggestions for Further Research 8.1 Summary The conventional back-propagation network is an extension of the adaptive linear combiner, incorporating nonlinear elements that allow it to be used in applications where linear processors are inadequate. It is well known that back-propagation networks are universal approximators. With a single hidden layer of nonlinear nodes, they can approximate any continuous function if the hidden layer is large enough. More than one hidden layer may be required to approximate the discontinuous inverses of some functions that arise in control applications. Chapter 4 showed how conventional back-propagation networks could be used as nonlinear signal processors. Simulations of a back-propagation network in a chaotic signal prediction task demonstrated its superiority over the adaptive linear filter. However, the increased accuracy came at the expense of a significantly longer training time. The work presented here has extended the back-propagation adaptation technique to dispersive networks with internal adaptable time delays, adding a temporal dimension to the problem. Both the weights and the time delays adapt, using a delayed form of gradient descent to minimize the error at the output. Previously, most dispersive networks used fixed-length time delays that were chosen before training commenced [123, 124]. This thesis describes a natural extension to those techniques. Chapter 5 presented a detailed derivation of the adaptation equations, along with some practical considerations for implementation. These derivations have recently been published in [25] and [26]. Dispersive networks can perform many tasks, including adaptive signal prediction, signal production, channel equalization, and spatio-temporal pattern recognition [42]. Chapter 6 105 Chapter 8. Conclusions and Suggestions for Further Research^ 106 showed that dispersive networks can outperform conventional techniques, including standard back-propagation, when applied to a chaotic signal prediction task. However, the training time can be long, and the use of momentum did not provide any speed improvement during on-line adaptation. Placing a fully trained prediction network in a recurrent configuration allows it to generate a close approximation to the original chaotic training signal. This procedure can be used to make long term statistical predictions about the training signal, and to investigate the structure of its underlying attractor. Chapter 6 also described the application of dispersive networks to a nonlinear channel equalization task, where they again demonstrated their improved accuracy over conventional techniques, although they required longer training times. Related discussions of dispersive networks in some of these application areas appeared in [22] and [27]. Finally, Chapter 7 presented some possible implementations and their associated tradeoffs, including digital versus analog, and serial versus parallel. A detailed description of a parallel digital implementation showed that the resulting architecture can be efficiently pipelined, leading to a higher throughput than can be achieved with a conventional back-propagation network. 8.2 Contributions to the Literature This work presents several contributions to the state of the art in nonlinear network filters. Most of these contributions result from the internal adaptable time delays present in dispersive networks. First, the adaptation technique permits dispersive networks to be trained without the temporal unfolding previously required for time delay neural networks (TDNNs). Temporal unfolding can lead to very large expanded networks, especially when the time delays are adaptable and are not integer multiples of unit delays. Second, in addition to adapting the weights, the continuous-time formulation leads to a simple mechanism for adapting the time delays. In most conventional discrete-time formulations, the time delays are fixed and evenly distributed. In Chapter 6, the adaptable delays led to Chapter 8. Conclusions and Suggestions for Further Research^ 107 increased accuracy in adaptive signal prediction and adaptive channel equalization tasks. Third, in a signal prediction configuration, dispersive networks can perform simultaneous adaptation and prediction, while two identical copies of a conventional back-propagation network must be used to accomplish the same task. Apart from the additional cost of doubling the hardware, the two copies of the network must be very closely matched. Any variation between the components, coupled with the internal nonlinearities, can lead to large discrepancies in the network outputs. Fourth, when learning to predict a chaotic signal, the adaptable time delays permit a dispersive network to find an appropriate embedding space automatically. With other techniques, the embedding space must be chosen based on a priori information or trial-and-error. This ability can be very important when investigating the nature of the attractor underlying a chaotic signal with unknown dimensionality. Fifth, for the Mackey-Glass signal emulation task, the quality of the power spectrum generated by a network with adaptable delays was far superior to the power spectra produced by similar networks with fixed delays. Conventional back-propagation networks were unstable in this configuration, generating high-frequency oscillations. Finally, the new technique is suitable for parallel hardware implementations. In digital implementations, the internal time delays permit efficient pipelining, resulting in a much higher throughput than is possible with conventional back-propagation networks. In parallel implementations, the storage requirements scale with the square of the number of nodes, while in serial implementations, the storage requirements are directly proportional to the number of nodes. 8.3 Topics for Further Research There are many interesting areas where work on dispersive networks can be extended. The following discussion is divided into four sections on applications, increased rates of convergence, implementations, and different network models. Chapter 8. Conclusions and Suggestions for Further Research ^ 108 8.3.1 Other Applications The chaotic signal prediction problem in Chapters 4 and 6 assumed that a noise-free input signal was available. In any physical implementation, it is inevitable that some degree of noise will be present in the input signal, perhaps affecting the adaptation process. The noise might permit the prediction network to find a more robust solution [119], because it will provide training examples in regions of the phase space that do not lie on the attractor. Without noisy input data during training, the network response to an input that does not lie exactly on the attractor could be quite unpredictable. In addition, the noise might help the network to escape from local minima in the error function, leading to a better solution. Simulations of chaotic signals with additive noise could help to confirm or refute these conjectures. It would also be useful to investigate the dispersive network prediction performance on chaotic signals other than that produced by the Mackey-Glass equation. In particular, some chaotic attractors have multiple "lobes", leading to nonstationary signal statistics. The signal can spend much of its time in one lobe, and then switch over to another lobe where it behaves differently. To maintain acceptable prediction accuracy, it is important for the network to be able to adapt quickly to the new behavior. Some nonstationary learning characteristics of linear filters were investigated in [133]. The nonlinear communication channel used as a benchmark in Chapter 6 is defined by very idealized equations, chosen for their mathematical convenience. It would be useful to develop more realistic nonlinear models of real-world communication channels, and compare the performance of dispersive network equalizers with more conventional techniques. In particular, it would be interesting to see how recursive equalizers based on Kalman filters compare with dispersive networks. Other applications to which dispersive networks could be applied include adaptive noise cancelling [130], and spatio-temporal pattern classification [42]. Classifying patterns based on their temporal properties is a very important problem in the area of speech recognition, where dispersive networks could perhaps find some utility [9, 65, 120, 123]. In pattern classification Chapter 8. Conclusions and Suggestions for Further Research^ 109 tasks, it may be important to use an error criterion other than the Euclidean distance between the output and target vectors [6, 41, 112]. It has also been reported that additive noise can help to improve the generalization ability of conventional back-propagation networks [67, 73], and it may lead to similar benefits with dispersive networks. 8.3.2 Increased Rates of Convergence There are some known results about the conditions under which the back-propagation training algorithm converges to a stable solution [63, 64]. It would be very useful, although perhaps difficult, to obtain similar results for dispersive networks with adaptable time delays. Knowledge of these conditions could lead to new techniques for speeding up the convergence, which is presently too slow for some real-time applications. One technique that did not improve convergence was the use of momentum, as discussed in Chapter 6. Simulations using batch-mode adaptation could be performed to find out if the poor performance of momentum was due to the on-line adaptation technique used throughout this investigation. Of course, batch-mode adaptation has additional complexities that must be addressed, such as the accumulation of parameter update values during each batch. Another technique that might increase the rate of convergence is to allow the sigmoid gains (a in equation (4.15)) to adapt along with the other network parameters [62]. This method can speed up conventional back-propagation, but its utility with dispersive networks remains to be investigated. A significant problem in all gradient descent methods involves the choice of an optimal learning rate. The choice is usually made by trial and error, but adaptive techniques for automatically finding the optimum learning rate can bypass this step in the design process and lead to faster adaptation [20, 127]. In [57], Jacobs suggested that each weight in a back-propagation network be given its own learning rate, and that these learning rates adapt separately to provide faster convergence. Variations on this technique have been reported in [31, 78, 116]. All these methods can potentially speed up convergence in dispersive networks as well. Chapter 8. Conclusions and Suggestions for Further Research^ 110 8.3.3 Implementation Issues It would be interesting to build the parallel hardware discussed in Chapter 7, and apply it to real-world problems. With any digital implementation, the effects of signal quantization can lead to variations in performance [38, 46]. For example, if the error is small but nonzero, then weight updates might stop because the calculated weight increment is less than one quantization step. A full understanding of the behavior of digital implementations requires an investigation of how this phenomenon affects dispersive networks with adaptable time delays. One way to avoid quantization problems is to build dispersive networks using analog hardware, but this approach leads to a host of new problems that may be even more difficult to deal with. As discussed in Chapter 7, analog circuits suffer from variations in components that should be identical. The effects of these variations need to be investigated, and new methods of adaptation that are less sensitive to them may need to be developed. Perhaps the most difficult challenge for the analog implementation of dispersive networks lies in building adaptable time delays. Even fixed delays are difficult to build consistently, and they require a large physical area in integrated circuitry. With adaptable delays, the sum of the feed-forward and back-propagation delays must be an accurately determined constant, which might be difficult to achieve in an analog implementation. An investigation of temporal back-propagation's sensitivity to variations in this sum would give an idea of how accurate the analog implementation must be. For some applications, completely adaptable time delays may not be necessary. Training with adaptable delays could be performed in software for a particular problem, and the resulting delays would determine the length of fixed delays in the analog hardware. Then the hardware could learn to solve similar problems with only adaptable weights. Finally, there is the possibility of hybrid implementations, combining both analog and digital techniques. Chapter 8. Conclusions and Suggestions for Further Research^ 111 8.3.4 Other Network Models One exciting possibility for future research is to extend the ideas of dispersive networks and adaptable time delays to other types of network models. For example, the time delays discussed here represent perfect delay elements, but physical delay lines distort and spread out the propagating signal, acting more like filters. It may be possible to derive adaptation techniques for these types of delays as well. When choosing a network topology for a particular application, it is difficult to find the optimal number of layers, hidden nodes, and interconnections. Adequate structures can often be found by trial and error, but the procedure is time consuming and not automatic. In addition to modifying the weights, some adaptation algorithms add [19, 30, 114] or remove [58, 59] nodes and connections to automatically optimize the topology. Similar techniques could be applied to dispersive networks to remove the trial and error phase from the design process. Another interesting project would be to investigate the use of adaptable time delays in recurrently connected networks. There have already been several studies applying back-propagation to recurrent networks [1, 5, 95, 96, 97, 117, 129, 137, 138], and Pearlmutter suggested the use of adaptable time delays in recurrent networks [94]. An advantage of recurrent networks is that their outputs can depend on the input signal from times very distant in the past, but their training procedures usually suffer from either large storage requirements, or the need to perform complex or nonlocal computations. Finally, there has been much recent interest in pulse-stream networks, where the output of a node is represented by a series of pulses, instead of a continuous signal level [17, 77, 84, 85, 86]. The inspiration for these models comes from a study of biological neurons, where the pulse-stream representation is the primary mode of operation [11]. Extending the time delay adaptation methods to pulse-coded networks is a promising area for future exploration. Bibliography [1] L. B Almeida. A learning rule for asynchronous perceptrons with feedback in a combi- natorial environment. In M. Caudil and C. Butler, editors, Proceedings of the IEEE First International Conference on Neural Networks, volume 2, pages 609-618, San Diego, CA, June 1987. [2] Andreas G. Andreou and Kwabena A. Boahen. Synthetic neural circuits using currentdomain signal representations. Neural Computation, 1(4):489-501, 1989. [3] Andreas G. Andreou, Kwabena A. Boahen, Philippe O. Pou li quen, Aleksandra PavasoviC, Robert E. Jenkins, and Kim Strohbehn. Current-mode subthreshold MOS circuits for analog VLSI neural systems. IEEE Transactions on Neural Networks, 2(2):205213, March 1991. [4] A. D. Back and A. C. Tsoi. FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation, 3(3):375-385, 1991. [5] Jacob Barhen, Nikzad Toomarian, and Sandeep Gulati. Adjoint operator algorithms for faster learning in dynamical neural networks. In David S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 498-508. San Mateo, CA: Morgan Kaufmann, 1990. [6] Etienne Barnard. Performance and generalization of the classification figure of merit criterion function. IEEE Transactions on Neural Networks, 2(2):322-325, March 1991. [7] H. D. Block. The Perceptron: A model for brain functioning. I. Reviews of Modern Physics, 34:123-135, 1962. [8] Ulrich Bodenhausen and Alex Waibel. The Tempo 2 algorithm: Adjusting time-delays by supervised learning. In Richard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 155-161. San Mateo, CA: Morgan Kaufmann, 1991. [9] L. Bottou and F. Fogelman Soulie. Speaker-independent isolated digit recognition: Multilayer perceptrons vs. dynamic time warping. Neural Networks, 3(4):453-465, 1990. [10] Martin L. Brady, Raghu Raghavan, and Joseph Slawny. Back propagation fails to separate where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36(5):665-674, May 1989. [11] Valentino Braitenberg. On the Texture of Brains. Springer-Verlag, New York, 1977. 112 Bibliography^ 113 [12] David J. Burr. Experiments on neural net recognition of spoken and written text. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(7):1162-1168, July 1988. [13] H. C. Card and W. R. Moore. VLSI devices and circuits for neural networks. International Journal of Neural Systems, 1(2):149-165, 1989. [14] Martin Casdagli. Nonlinear prediction of chaotic time series. Physica D 35, pages 335356, 1989. [15] Daniel L. Chester. Why two hidden layers are better than one. In International Joint Conference on Neural Networks 1990, volume 1, pages 265-268, Washington, D. C., January 1990. [16] Joseph E. Collard. A B-P ANN commodity trader. In Richard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 551-556. San Mateo, CA: Morgan Kaufmann, 1991. [17] Neil E. Cotter, Kent Smith, and Martin Gaspar. A pulse-width modulation design approach and path-programmable logic for artificial neural networks. In Jonathan Allen and F. Thomson Leighton, editors, Advanced Research in VLSI: Proceedings of the Fifth MIT Conference, March 1988, pages 1-17, Cambridge, MA, 1988. MIT Press. [18] Thomas M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14:326-334, June 1965. [19] R. Scott Crowder, III. Predicting the Mackey-Glass timeseries with cascade-correlation learning. In David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton, editors, Proceedings of the 1990 Connectionist Models Summer School, pages 117123, San Mateo, CA, 1990. Morgan Kaufmann. [20] Christian Darken, Joseph Chang, and John Moody. Learning rate schedules for faster stochastic gradient search. In S. Y. Kung, F. Fallside, J. Aa. Sorenson, and C. A. Kamm, editors, Neural Networks for Signal Processing II — Proceedings of the 1992 IEEE Workshop, pages 3-12, 445 Hoes Lane, Piscataway, NJ 08854-4150, 1992. IEEE Press. [21] Charles Darwin. On the Origin of Species. John Murray, London, 1859. [22] Michael R. Davenport and Shawn P. Day. Chaotic signal emulation using a recurrent time delay neural network. In S. Y. Kung, F. Fallside, J. Aa. Sorenson, and C. A. Kamm, editors, Neural Networks for Signal Processing II — Proceedings of the 1992 IEEE Workshop, pages 454-463, 445 Hoes Lane, Piscataway, NJ 08854-4150, 1992. IEEE Press. [23] Shawn P. Day and Daniel S. Camporese. A stochastic training technique for feed-forward neural networks. In International Joint Conference on Neural Networks 1990, volume 1, pages 607-612, San Diego, CA, June 1990. Bibliography^ 114 [24] Shawn P. Day and Daniel S. Camporese. A VLSI neural network with on-chip learning. In David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton, editors, Proceedings of the 1990 Connectionist Models Summer School, pages 387-395. San Mateo, CA: Morgan Kaufmann, 1990. [25] Shawn P. Day and Daniel S. Camporese. Continuous-time temporal back-propagation. In International Joint Conference on Neural Networks 1991, volume 2, pages 95-100, Seattle, WA, July 1991. [26] Shawn P. Day and Michael R. Davenport. Continuous-time temporal back-propagation with adaptable time delays. To appear in: IEEE Transactions on Neural Networks, 4(2), March 1993. [27] Shawn P. Day, Michael R. Davenport, and Daniel S. Camporese. Dispersive networks for nonlinear adaptive filtering. In S. Y. Kung, F. Fallside, J. Aa. Sorenson, and C. A. Kamm, editors, Neural Networks for Signal Processing II — Proceedings of the 1992 IEEE Workshop, pages 540-549,445 Hoes Lane, Piscataway, NJ 08854-4150,1992. IEEE Press. [28] Bert De Vries and Jose C. Principe. The Gamma model — A new neural model for temporal processing. Neural Networks, 5(4):565-576, 1992. [29] Scott E. Fahlman. An empirical study of learning speed in back-propagation networks. Technical Report CMU-CS-88-162, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, September 1988. [30] Scott E. Fahlman and Christian Lebiere. The cascade-correlation learning architecture. Technical Report CMU-CS-90-100, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, February 1990. [31] Yan Fang and Terrence J. Sejnowski. Faster learning for dynamic recurrent backpropagation. Neural Computation, 2(3):270-273, 1990. [32] J. Doyne Farmer. Chaotic attractors of an infinite-dimensional dynamical system. Physica D 4, pages 366-393,1982. [33] J. Doyne Farmer and John J. Sidorowich. Predicting chaotic time series. Physical Review Letters, 59(8):845-848, August 1987. [34] William A. Fisher, Robert J. Fujimoto, and Robert C. Smithson. A programmable analog neural network processor. IEEE Transactions on Neural Networks, 2(2):222-229, March 1991. [35] Robert C. Frye, Edward A. Rietman, and Chee C. Wong. Back-propagation learning and nonidealities in analog neural network hardware. IEEE Transactions on Neural Networks, 2(1):110-117, January 1991. [36] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183-192, 1989. Bibliography^ 115 [37] D. Gabor, W. P. L. Wilby, and R. Woodcock. A universal non-linear filter, predictor and simulator which optimizes itself by a learning process. IEE Proceedings, 108B:422-438, 1961. [38] Richard D. Gitlin, J. E. Mazo, and Michael G. Taylor. On the design of gradient algorithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory, CT-20(2):125-136, March 1973. [39] Karl Goser, Ulrich Hilleringmann, Ulrich Rueckert, and Klaus Schumacher. VLSI technologies for artificial neural networks. IEEE Micro, 9(6):28-44, December 1989. [40] Hans P. Graf and Paul deVegvar. A CMOS implementation of a neural network model. In Paul Losleben, editor, Advanced Research in VLSI: Proceedings of the 1987 Stanford Conference, pages 351-367, Cambridge, MA, 1987. MIT Press. [41] John B. Hampshire, II and Alexander H. Waibel. A novel objective function for improved phoneme recognition using time-delay neural networks. IEEE Transactions on Neural Networks, 1(2):216-228, June 1990. [42] Robert Hecht-Nielson. Neurocomputing. Addison-Wesley, Reading, MA, 1990. [43] John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, 1991. [44] Geoffrey E. Hinton. Connectionist learning procedures. Artificial Intelligence, 40:185234, 1989. [45] Geoffrey E. Hinton. How neural networks learn from experience. Scientific American, 267(3):144-151, September 1992. [46] Paul W. Hollis, John S. Harper, and John J. Paulos. The effects of precision constraints in a backpropagation learning network. Neural Computation, 2(3):363-373, 1990. [47] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251-257, 1991. [48] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359-366, 1989. [49] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3(5):551-560, 1990. [50] W. Hubbard, D. Schwartz, J. Denker, H. P. Graf, R. Howard, L. Jackel, B. Straughn, and D. Tennant. Electronic neural networks. In John S. Denker, editor, Neural Networks for Computing, Snowbird, UT, A IP Conference Proceedings 151, pages 227-234, New York, 1986. American Institute of Physics. Bibliography^ 116 [51] Don R. Hush and Bill G. Horne. Progress in supervised neural networks. IEEE Signal Processing Magazine, 10(1):8-39, January 1983. [52] Yoshifusa Ito. Approximation of functions on a compact set by finite sums of a sigmoid function without scaling. Neural Networks, 4(6):817-826, 1991. [53] Yoshifusa Ito. Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory. Neural Networks, 4(3):385-394, 1991. [54] Yoshifusa Ito. Approximation of continuous functions on Rd by linear combinations of shifted rotations of a sigmoid function with and without scaling. Neural Networks, 5(1):105-115, 1992. [55] Marwan Jabri and Barry Flower. Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. Neural Computation, 3(4):546-565, 1991. [56] L. D. Jackel, H. P. Graf, and R. E. Howard. Electronic neural network chips. Applied Optics, 26(23):5077-5080, December 1987. [57] Robert A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4):295-307, 1988. [58] Chuanyi Ji, Robert R. Snapp, and Demetri Psaltis. Generalizing smoothness constraints from discrete samples. Neural Computation, 2(2):188-197, 1990. [59] Ehud D. Karnin. A simple procedure for pruning back-propagation trained neural networks. IEEE Transactions on Neural Networks, 1(2):239-242, June 1990. [60] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23:462-466, 1952. [61] Takashi Kimoto, Kazuo Asakawa, Morio Yoda, and Masakazu Takeoka. Stock market prediction system with modular neural networks. In International Joint Conference on Neural Networks 1990, volume 1, pages 1-6, Washington, D. C., January 1990. [62] John K. Kruschke and Javier R. Movellan. Benefits of gain: Speeded learning and minimal hidden layers in back-propagation networks. IEEE Transactions on Systems, Man, and Cybernetics, 21(1):273-280, January/February 1991. [63] Chung-Ming Kuan and Kurt Hornik. Convergence of learning algorithms with constant learning rates. IEEE Transactions on Neural Networks, 2(5):484-489, September 1991. [64] Harold J. Kushner and Dean S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, New York, 1978. [65] Kevin J. Lang, Alex H. Waibel, and Geoffrey E. Hinton. A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1):23-43, 1990. Bibliography^ 117 [66] Alan Lapedes and Robert Farber. Nonlinear signal processing using neural networks: Prediction and system modeling. Technical Report LA-UR-87-2662, Los Alamos National Laboratory, July 1987. [67] Lasse HolmstrOm and Petri Koistinen. Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, 3(1):24-38, January 1992. [68] Y. Le Cun. A learning scheme for asymmetric threshold networks. In Proceedings Cognitiva 85, pages 599-604, Paris, 1985. [69] Richard P. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, pages 4-22, April 1987. [70] Odile Macchi and Eweda Eweda. Second-order convergence analysis of stochastic adaptive linear filtering. IEEE Transactions on Automatic Control, AC-28(1):76-85, January 1983. [71] Michael C. Mackey and Leon Glass. Oscillation and chaos in physiological control systems. Science, 197:287-289, July 1977. [72] John Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561580, April 1975. [73] Kiyotoshi Matsuoka. Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics, 22(3):436-440, May/June 1992. [74] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133, 1943. [75] Carver Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989. [76] W. C. Mead, R. D. Jones, Y. C. Lee, C. W. Barnes, G. W. Flake, L. A. Lee, and M. K. O'Rourke. Using CNLS-Net to predict the Mackey-Glass chaotic time series. In International Joint Conference on Neural Networks 1991, volume 2, pages 485-490, Seattle, WA, July 1991. [77] Jack L. Meador, Angus Wu, Clint Cole, Novat Nintunze, and Pichet Chintrakulchai. Programmable impulse neural circuits. IEEE Transactions on Neural Networks, 2(1):101109, January 1991. [78] Ali A. Minai and Ronald D. Williams. Back-propagation heuristics: A study of the extended delta-bar-delta algorithm. In International Joint Conference on Neural Networks 1990, volume 1, pages 595-600, San Diego, CA, June 1990. [79] Marvin L. Minsky and Seymour A. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA, 1969. [80] John Moody. Fast learning in multi-resolution hierarchies. In David S. Touretzky, editor, Advances in Neural Information Processing Systems 1, pages 29-39. San Mateo, CA: Morgan Kaufmann, 1989. Bibliography^ 118 [81] John Moody and Christian Darken. Learning with localized receptive fields. In David S. Touretzky, Geoffrey E. Hinton, and Terrence J. Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 133-143. Morgan Kaufmann, 1988. [82] John Moody and Christian J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2):281-294, 1989. [83] M. C. Mozer. A focussed backpropagation algorithm for temporal pattern recognition. Complex Systems, 3:349-381, 1989. [84] Alan F. Murray. Pulse arithmetic in VLSI neural networks. IEEE Micro, 9(6):64-74, December 1989. [85] Alan F. Murray, Dante Del Corso, and Lionel Tarassenko. Pulse-stream VLSI neural networks mixing analog and digital techniques. IEEE Transactions on Neural Networks, 2(2):193-204, March 1991. [86] Alan F. Murray and Anthony V. W. Smith. Asynchronous VLSI neural networks using pulse-stream arithmetic. IEEE Journal of Solid-State Circuits, 23(3):688-697, June 1988. [87] Kumpati S. Narendra and Kannan Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1):4-27, March 1990. [88] Kumpati S. Narendra and Kannan Parthasarathy. Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Transactions on Neural Networks, 2(2):252-262, March 1991. [89] Steven J. Nowlan and Geoffrey E. Hinton. Simplifying neural networks by soft weightsharing. Neural Computation, 4:473-493, 1992. [90] Alan V. Oppenheim and Ronald W. Schafer. Digital Signal Processing. Prentice-Hall, Englewood Cliffs, New Jersey, 1975. [91] N. H. Packard, J. P. Crutchfield, J. D. Farmer, and R. S. Shaw. Geometry from a time series. Physical Review Letters, 45(9):712-716, September 1980. [92] D. B. Parker. Learning-logic. Technical Report TR-47, Sloan School of Management, M.I.T., Cambridge, MA, 1985. [93] Barak A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2):263-269, 1989. [94] Barak A. Pearlmutter. Dynamic recurrent neural networks. Technical Report CMU-CS90-196, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, December 1990. [95] Fernando J. Pineda. Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 59(19):2229-2232, November 1987. Bibliography^ 119 [96] Fernando J. Pineda. Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Computation, 1(2):161-172, 1989. [97] Fernando J. Pineda. Time dependent adaptive neural networks. In David S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 710-718. San Mateo, CA: Morgan Kaufmann, 1990. [98] John C. Platt and Federico Faggin. Networks for the separation of sources that are superimposed and delayed. In R. Lippmann J. Moody, S. Hanson, editor, Advances in Neural Information Processing Systems 4, pages 730-737. San Mateo, CA: Morgan Kaufmann, 1992. [99] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400-407, 1951. [100] Richard A. Roberts and Clifford T. Mullis. Digital Signal Processing. Addison-Wesley, Don Mills, Ontario, 1987. [101] David Rogers. Predicting weather using a genetic memory: a combination of Kanerva's sparse distributed memory with Holland's genetic algorithms. In David S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 455-464. San Mateo, CA: Morgan Kaufmann, 1990. [102] Frank Rosenblatt. Principles of Neurodynamics. Spartan Books, New York, 1962. [103] Olivier Rossetto, Christian Jutten, Jeanny Herault, and Ingo Kreuzer. Analog VLSI synaptic matrices as building blocks for neural networks. IEEE Micro, 9(6):56-63, 1989. [104] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, chapter 8, pages 318-362. MIT Press, Cambridge, MA, 1986. [105] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(9):533-536, October 1986. [106] Terence D. Sanger. A tree-structured adaptive network for function approximation in high-dimensional spaces. IEEE Transactions on Neural Networks, 2(2):285-293, March 1991. [107] Terence D. Sanger. A tree-structured algorithm for reducing computation in networks with separable basis functions. Neural Computation, 3(1):67-78, 1991. [108] Massimo A. Sivilotti, Michael R. Emerling, and Carver A. Mead. VLSI architectures for implementations of neural networks. In John S. Denker, editor, Neural Networks for Computing, Snowbird, UT, A IP Conference Proceedings 151, pages 408-413, New York, 1986. American Institute of Physics. Bibliography^ 120 [109] Eduardo D. Sontag. Sigmoids distinguish better than heavisides. Neural Computation, 1:470-472, 1989. [110] Eduardo D. Sontag. Feedback stabilization using two-hidden-layer nets. Technical Report SYCON-90-11, Rutgers Center for Systems and Control, October 1990. [111] Eduardo D. Sontag. Remarks on interpolation and recognition using neural nets. In Richard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 939-945. San Mateo, CA: Morgan Kaufmann, 1991. [112] Eduardo D. Sontag and Hector J. Sussmann Back propagation separates where perceptrons do. Neural Networks, 4(2):243-249, 1991. [113] Ferrel G. Stremler. Introduction to Communication Systems. Addison-Wesley, Reading, Massachusetts, second edition edition, 1982. [114] Manoel F. Tenorio and Wei-Tsih Lee. Self-organizing network for optimum supervised learning. IEEE Transactions on Neural Networks, 1(1):100-110, March 1990. [115] A. P. Thakoor, A. Moopenn, John Lambe, and S. K. Khanna. Electronic hardware implementations of neural networks. Applied Optics, 26(23):5085-5092, December 1987. [116] Tom Tollenaere. SuperSAB: Fast adaptive back propagation with good scaling properties. Neural Networks, 3(5):561-573, 1990. [117] N. Toomarian and J. Barhen. Adjoint-functions and temporal learning algorithms in neural networks. In Richard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 113-120. San Mateo, CA: Morgan Kaufmann, 1991. [118] Philip Treleaven, Marco Pacheco, and Marley Vellasco. VLSI architectures for neural networks. IEEE Micro, 9(6):8-27, December 1989. [119] T. Troudet and W. Merrill. Neuromorphic learning of continuous-valued mappings from noise-corrupted data. IEEE Transactions on Neural Networks, 2(2):294-301, March 1991. [120] K. P. Unnikrishnan, John J. Hopfield, and David W. Tank Connected-digit speakerdependent speech recognition using a neural network with time-delayed connections. IEEE Transactions on Signal Processing, 39(3):698-713, March 1991. [121] Michel Verleysen, Bruno Sirletti, Andre M. Vandemeulebroecke, and Paul G. A. Jespers. Neural networks for high-storage content-addressable memory: VLSI circuit and learning algorithm. IEEE Journal of Solid-State Circuits, 24(3):562-568, June 1989. [122] Vera Kiirkova. Kolmogorov's theorem is relevant. Neural Computation, 3(4):617-622, 1991. Bibliography^ 121 [123] Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3):328-339, March 1989. [124] Eric A. Wan. Temporal backpropagation: An efficient algorithm for finite impulse response neural networks. In David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton, editors, Proceedings of the 1990 Connectionist Models Summer School, pages 131-137, San Mateo, CA, 1990. Morgan Kaufmann. [125] Eric A. Wan. Temporal backpropagation for FIR neural networks. In International Joint Conference on Neural Networks 1990, volume 1, pages 575-580, San Diego, CA, June 1990. [126] Andreas S. Weigend, Bernardo A. Huberman, and David E. Rumelhart. Predicting the future: A connectionist approach. International Journal of Neural Systems, 1(3):193-209, 1990. [127] Michael K. Weir. A method for self-determination of adaptive learning rates in back propagation. Neural Networks, 4(3):371-379, 1991. [128] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974. [129] Paul J. Werbos. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10):1550-1560, 1990. [130] Bernard Widrow, John R. Glover, Jr., John M. McCool, John Kaunitz, Charles S. Williams, Robert H. Hearn, James R. Zeidler, Eugene Dong, Jr., and Robert C. Goodlin Adaptive noise cancelling: Principles and applications. Proceedings of the IEEE, 63(12):1692-1716, December 1975. [131] Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. In 1960 IRE WESCON Convention Record, pages 96-104, New York, 1960. IRE. [132] Bernard Widrow and Michael A. Lehr. 30 years of adaptive neural networks: Perceptron, Madaline, and backpropagation. Proceedings of the IEEE, 78(9):1415-1442, 1990. [133] Bernard Widrow, John M. McCool, Michael G. Larimore, and C. Richard Johnson, Jr. Stationary and nonstationary learning characteristics of the LMS adaptive filter. Proceedings of the IEEE, 64(8):1151-1162, August 1976. [134] Bernard Widrow and Samuel D. Stearns. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, New Jersey, 1985. [135] Bernard Widrow and Rodney Winter. Neural nets for adaptive filtering and adaptive pattern recognition. Computer, 21(3):25-39, March 1988. [136] N. Wiener. Extrapolation, Interpolation and Smoothing of Stationary Time Series with Engineering Applications. MIT Press, Cambridge, MA, 1949. Bibliography^ 122 [137] Ronald J. Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2(4):490-501, 1990. [138] Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270-280, 1989. [139] James R. Zeidler. Performance analysis of LMS adaptive prediction filters. Proceedings of the IEEE, 78(12):1780-1806, 1990. Appendix A Momentum Gradient descent often converges very slowly for many problems of practical interest. To speed convergence, a simple technique known as "momentum" was first employed in [104]. Adaptation with momentum depends on previous weight updates, as well as the direction of the gradient, leading to the following discrete-time formula: if n < 0 0^ wii(t„ ) =^OE^ —(1 — a)p-071 —(t n ) aAwij(t n _i) if n > O. (A.1) Although the gradient usually must be estimated, it is not necessary to complicate the notation with that detail here. The weights change at discrete points in time with a period of T = t o — t n _i. The integer, n, is the step number, and the momentum parameter, a, must lie in the range 0 < a < 1. Equation (A.1), which describes a discrete-time first order low-pass filter, can also be written as LIWij(tn)::: —(1 — a) E C n t/1=1 (tn tm), m=0^ulv (A .2) with n > O. Thus, Awij is an exponentially weighted sum of past gradient values. Looking at the special case where the gradient is constant gives oe^am n (1 — a)/27-7, lim Awii(tn) = —lim n—■co n--■oo utik? m=0 \ OE 1 —(1— °IP Owii 1 — a OE n^• uwii (A.3) Therefore, for smooth error surfaces, where 0 2 E/Otqj^0, the weight changes follow simple 123 Appendix A. Momentum^ 124 gradient descent. Substituting the continuous variables t and t' for t n, and t, gives a continuoustime version of equation (A.2): dwij^t - = –K(1 – a) J a ti f T iz(t – t')dt'.^(A.4) dt at% Here, K is a gain constant chosen according to the constant gradient criterion. If 0 2 e/04 = 0, then dwij lim t—,00 dt – lim K(1 – a)(9e Ii510i j 0f t at'IT t^ —.00 1–a OE = KT ^ IL^• Ina Owij (A.5) Approximating gradient descent for smooth error surfaces requires that KT 1– a OE^OE ^=^,^ Ina awij^awij (A.6) SO Inaa K^ T(1 – a) (A.7) and finally, dwij^in a f t a tia OS #^di, dt T o^owij – In a ft ,n l a .„ O E^^ –^ exp (--z )p—kt – t „)ar T 0^T^Owij (A.8) Equation (A.8) represents a simple first order low pass filter with a time constant of –T/ In a and an impulse response of In a^In a^ T t)u(t), h(t) = T --exp(— (A.9) where u(t) is the unit step function, given by u(t) = 0 if t < 0 1 if t > O. (A.10) Differentiating equation (A.8) with respect to time gives –T d 2 wij^dw•-^OE Un a dt 2 =^d; 3^(t) atv ii^7 (A.11) 125 Appendix A. Momentum^ which is analogous to equation (4.11). The relationship between j in (4.11) and the momentum parameter, a, is 17 and 77 is positive since 0 < a < 1. = T In a (A.12) , Appendix B Estimating the Mean Value of a Signal Adaptation in back-propagation networks with bipolar sigmoidal nonlinearities occurs fastest for input signals with zero mean [12]. Therefore, it is desirable to preprocess the signal by removing its mean value before feeding it into the network. If the mean is not known in advance, it can be estimated using some type of on-line adaptive technique. One simple possibility is to maintain a running average of the input signal vector, and subtract it from the input before feeding it into the network. In discrete time, the input signal can be represented as a sequence of vectors, In , with n = 1,2, 3, ... being the sample number. The running average, denoted by (I) n , can be updated with each sample as (I) n+i = (I) n + p n (I n — (I) n ), (B.1) where 0 < p„ < 1 is a rate parameter that can vary with n. If pn, = 1/n, then M t, will be the simple average of all previous input vectors. On the other hand, if p n is constant, then (I) n represents an exponentially weighted average of past input vectors, with the heaviest weighting on the most recent. Constant values of p can be useful for some types of nonstationary input signals that have a slowly varying mean. The time scale on which the mean varies determines the best value for p. Other schedules for changing pn, may be useful for different types of nonstationary input signals. 126 Appendix C Linear Interpolation of Delayed Signal Derivatives In digital implementations of dispersive networks, it is often necessary to estimate delayed signal values and time derivatives that fall between stored sample points. For the signal values, it is a simple matter to interpolate linearly between the sample points on either side of the desired value, but estimating the time derivative of the signal at a particular point requires a more complex technique. Figure C.1 shows several stored sample points from a signal, o(t), and the task is to estimate the signal derivative at a particular delay, ro. The derivative can easily be estimated for delays o(t) Delay Ti+2 T+1 Values Time to Figure C.1: Stored sample points from a continuous signal. Time is shown along the horizontal axis, with the length of the delay, r, increasing to the left. The task is to estimate the time derivative of the signal at a particular delay, r0 , which may lie between sample points. 127 Appendix C. Linear Interpolation of Delayed Signal Derivatives ^ 128 that are midway between sample points, such as a delay of r + 1/2: o'(t — — 1/2) = h(ri) — h(ri + 1).^(C.1) Here, h(r) o(t — r) for notational convenience. For values of ro that are not midway between sample points, interpolation can be used between derivative estimates on either side of the required delay. To begin, the delay, ro, can be decomposed into the sum of an integer portion, Ti, and a fractional portion, rf , where T0 =^rf • ^ (C.2) There are two cases to consider. First, if Tf < 0.5, then the two nearest estimated derivatives are h'(Ti + 1/ 2 ) = h(Ti) — h(ri + 1), h'(Ti — 1/2) = h(ri — 1) — h(ri), and the required derivative can be interpolated as h'(ro ) = h'(Ti Tf) (Tf 1/2)W(ri + 1/2) + (1/2 — rf)11/(ri — 1/2) (Tf +1/2)[h(ri) — h(ri + 1)] + (1/2 — rf)[h(ri — 1) — h(ri)].^(C.3) In the second case, Tf > 0.5, and the two nearest derivative estimates are + 1/2) = h(ri)— h(ri +1), h'(Ti + 3/2) = h(ri + 1) — h(ri + 2). Here, the required interpolation gives h' (ro ) = h'(Ti + rf) = (Tf — 1/2)hVi + 3/2) + (3/2 — rf)hVi + 1/2) = (Tf — 1/2)[h(Ti + 1) — h(ri + 2)] + (3/2 — rf)[h(ri) — h(ri + 1)].^(C.4) In both cases, the required derivative estimate is obtained from a weighted average of derivative estimates on either side of To. Appendix D Third-Order Low-Pass Butterworth Digital Filter This Appendix describes the low-pass Butterworth digital filter used to provide a band-limited input for the channel equalization problem in Chapter 6. The filter was designed using the bilinear transform method [100], which guarantees stability. To begin, an analog, third-order, low-pass Butterworth filter with unity gain has the transfer function (s) = ^ H^(s^t c )(s 2 sfi c f2D' (D.1) where s represents complex frequency, and it, is the cutoff frequency. The bilinear transform maps the complex s-plane into the complex z-plane according to z= W(s)= 1+s , 1—s (D.2) which causes a "frequency warping" given by 2 w(ft) = to — arctan (D.3) where fl represents the frequency of the analog filter and co represents the frequency of the digital filter. If coo = 2r/t o is the sampling frequency of the digital filter, then c,./ increases from —(J.)0 to w0 /2 as S-2 increases continuously from —co to +oo. The analog filter must be designed with a cutoff frequency that, when warped by equation (D.3), will yield the desired cutoff frequency for the digital filter. Therefore, the "inverse warp" of equation (D.3) is fl(w) = tan(wt o /2).^ For the digital filter in Chapter 6, the desired cutoff frequency was = 16to ' 129 (D.4) Appendix D. Third-Order Low-Pass Butterworth Digital Filter ^ 130 so the cutoff frequency of the analog filter must be fic = 12(coc) = tan(ir/32) r.,-, 0.0985 in equation (D.1). The inverse of equation (D.2) is z —1 8 = 1-1 (z) = z + 1 (D.5) , so that the transfer function of the desired digital filter is A + Bz -1 + Cz -2 + Dz -3 H() z = fi i z - 1) = z-F 1^1 A-Ez -1 A-Fz -2 + Gz--3 • (D.6) Solving for the constants in equation (D.6) gives A = 0.000727 B = 0.00218 C = 0.00218 D = 0.000727 E = -2.327 F = 2.072 G = -0.686. Figure D.1 shows the resulting signal flow graph, which represents a stable infinite-impulseresponse (IIR) filter. Appendix D. Third-Order Low-Pass Butterworth Digital Filter ^ Figure D.1: Signal flow graph for a third-order digital low-pass Butterworth filter. The input signal enters from the left, and the connections labeled z -1 represent unit-delay elements. The structure is a stable infinite-impulse-response (IIR) filter. 131
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Dispersive neural networks for adaptive signal processing
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Dispersive neural networks for adaptive signal processing Day, Shawn P. 1993
pdf
Page Metadata
Item Metadata
Title | Dispersive neural networks for adaptive signal processing |
Creator |
Day, Shawn P. |
Date Issued | 1993 |
Description | Back-propagation is a popular method for training feed-forward neural networks. This thesis extends the back-propagation technique to dispersive networks, which contain internal delay elements. Both the delays and the weights adapt to minimize the error at the output. Dispersive networks can perform many tasks, including signal prediction, signal production, channel equalization, and spatio-temporal pattern recognition. For comparison with conventional techniques, a dispersive network was trained to predict future values of a chaotic signal using only its present value as an input. With adaptable delays, the network had less than half the prediction error of an identical network with fixed delays, and about one-quarter the error of a conventional back-propagation network. Moreover, a dispersive network can simultaneously adapt and predict, while a conventional network cannot. After training as a predictor, the network was placed in a signal production configuration, where it autonomously generated a close approximation to the training signal. The power spectrum of the network output was a good reproduction of the training signal spectrum. Networks with fixed time delays produced much less accurate power spectra, and conventional back-propagation networks were unstable, generating high-frequency oscillations. Dispersive networks also showed an improvement over conventional techniques in an adaptive channel equalization task, where the channel transfer function was nonlinear. The adaptable delays in the dispersive network allowed it to reach a lower error than other equalizers, including a conventional back-propagation network and an adaptive linear filter. However, the improved performance came at the expense of a longer training time. Dispersive networks can be implemented in serial or parallel form, using digital electronic circuitry. Unlike conventional back-propagation networks, they can operate in a fully pipelined fashion, leading to a higher signal throughput. Their implementation in analog hardware is a promising area for future research. |
Extent | 6237316 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-09-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0064917 |
URI | http://hdl.handle.net/2429/2255 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1993-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1993_spring_phd_day_shawn.pdf [ 5.95MB ]
- Metadata
- JSON: 831-1.0064917.json
- JSON-LD: 831-1.0064917-ld.json
- RDF/XML (Pretty): 831-1.0064917-rdf.xml
- RDF/JSON: 831-1.0064917-rdf.json
- Turtle: 831-1.0064917-turtle.txt
- N-Triples: 831-1.0064917-rdf-ntriples.txt
- Original Record: 831-1.0064917-source.json
- Full Text
- 831-1.0064917-fulltext.txt
- Citation
- 831-1.0064917.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064917/manifest