Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Dispersive neural networks for adaptive signal processing Day, Shawn P. 1993

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1993_spring_phd_day_shawn.pdf [ 5.95MB ]
Metadata
JSON: 831-1.0064917.json
JSON-LD: 831-1.0064917-ld.json
RDF/XML (Pretty): 831-1.0064917-rdf.xml
RDF/JSON: 831-1.0064917-rdf.json
Turtle: 831-1.0064917-turtle.txt
N-Triples: 831-1.0064917-rdf-ntriples.txt
Original Record: 831-1.0064917-source.json
Full Text
831-1.0064917-fulltext.txt
Citation
831-1.0064917.ris

Full Text

DISPERSIVE NEURAL NETWORKSFOR ADAPTIVE SIGNAL PROCESSINGByShawn P. DayB. A. Sc. (Electrical Engineering) University of British Columbia, 1988A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAFebruary 1993© Shawn P. Day, 1993In presenting this thesis in partial fulfilment of the requirements for an advanced degree at theUniversity of British Columbia, I agree that the Library shall make it freely available for refer-ence and study. I further agree that permission for extensive copying of this thesis for scholarlypurposes may be granted by the head of my department or by his or her representatives. Itis understood that copying or publication of this thesis for financial gain shall not be allowedwithout my written permission.Electrical EngineeringThe University of British Columbia2075 Wesbrook PlaceVancouver, CanadaV6T 1W5Date:*el 9, mgAbstractBack-propagation is a popular method for training feed-forward neural networks. This thesisextends the back-propagation technique to dispersive networks, which contain internal delayelements. Both the delays and the weights adapt to minimize the error at the output.Dispersive networks can perform many tasks, induding signal prediction, signal production,channel equalization, and spatio-temporal pattern recognition. For comparison with conven-tional techniques, a dispersive network was trained to predict future values of a chaotic signalusing only its present value as an input. With adaptable delays, the network had less than halfthe prediction error of an identical network with fixed delays, and about one-quarter the error ofa conventional back-propagation network. Moreover, a dispersive network can simultaneouslyadapt and predict, while a conventional network cannot.After training as a predictor, the network was placed in a signal production configuration,where it autonomously generated a close approximation to the training signal. The powerspectrum of the network output was a good reproduction of the training signal spectrum.Networks with fixed time delays produced much less accurate power spectra, and conventionalback-propagation networks were unstable, generating high-frequency oscillations.Dispersive networks also showed an improvement over conventional techniques in an adap-tive channel equalization task, where the channel transfer function was nonlinear. The adapt-able delays in the dispersive network allowed it to reach a lower error than other equalizers,including a conventional back-propagation network and an adaptive linear filter. However, theimproved performance came at the expense of a longer training time.Dispersive networks can be implemented in serial or parallel form, using digital electroniccircuitry. Unlike conventional back-propagation networks, they can operate in a fully pipelinedfashion, leading to a higher signal throughput. Their implementation in analog hardware is apromising area for future research.iiTable of ContentsAbstract^ iiList of Tables viiList of Figures^ viiiGlossary xiAcknowledgement^ xxiii1 Introduction 11.1 Adaptive Systems ^ 11.1.1^Adaptive Prediction ^ 21.1.2^Adaptive System Identification ^ 31.1.3^Inverse Modeling or Equalization 41.1.4^Adaptive Interference Canceling ^ 51.2 Artificial Neural Networks ^ 61.3 Thesis Outline ^ 72 The Adaptive Linear Combiner 92.1 Structure ^ 92.2 Operation 112.2.1^Error Measure ^ 112.2.2^Gradient of the Error Measure ^ 132.2.3^Gradient Search ^ 132.2.4^The LMS Algorithm 16iiiTable of Contents2.2.5^Random Local Search ^^2.3^Adaptive Linear Transversal Filter 2.4^ALC Classifier ^2.5^Limitations of the ALC ^iv171818203 ALC Networks 223.1 Multidimensional Outputs ^ 223.2 Networks with Multiple Layers 233.3 Back-Propagation ^ 254 Nonlinear Adaptive Network Filters 304.1 Continuous-Time LMS Adaptation ^ 304.2 Continuous-Time Back-Propagation 314.3 The Mackey-Glass Signal Prediction Task ^ 334.4 Simulation Details ^ 334.4.1^Training Configuration ^ 344.4.2^Network Topology and Parameters ^ 354.4.3^Numerical Integration Technique 374.4.4^Performance Measure ^ 374.5 Prediction Results ^ 385 Dispersive Networks 415.1 Unfolding in Time ^ 425.2 Temporal Back-Propagation ^ 445.2.1^Notation ^ 475.2.2^The Error, T, and its Functional Derivative ^ 495.2.3^Back-Propagated Error Signals ^ 525.3 Gradient Descent Adaptation ^ 545.4 Relationship To Other Work 56Table of Contents5.5^Practical Considerations ^5.5.1^Exact Delayed Error Gradient ^5.5.2^Approximate Delayed Error Gradient 5.5.3^Choosing the Error Signal Ages ^v575760636 Applications of Dispersive Networks 656.1 Adaptive Signal Prediction ^ 656.1.1^Network Topology and Implementation Details ^ 656.1.2^Embedding Dimension ^ 676.1.3^Prediction Results 686.1.4^True Adaptive Prediction ^ 696.1.5^Moment um ^ 736.2 Signal Production 746.2.1^Learning Progression ^ 756.2.2^Performance 806.3 Adaptive Channel Equalization ^ 866.3.1^Network Topology and Parameters ^ 886.3.2^Equalization Results ^ 886.4 Signal Detection/Classification 897 Implementations 917.1 Software Implementations ^ 917.2 Parallel Digital Implementations 927.2.1^Operation ^ 957.2.2^Adaptable Weights and Time Delays ^ 967.2.3^Pipelining ^ 977.3 Analog Hardware Implementations ^ 103Table of Contents^ vi8 Conclusions and Suggestions for Further Research^ 1058.1 Summary ^ 1058.2 Contributions to the Literature ^  1068.3 Topics for Further Research  1078.3.1 Other Applications ^  1088.3.2 Increased Rates of Convergence ^  1098.3.3 Implementation Issues ^  1108.3.4 Other Network Models  111Bibliography^ 112Appendices 123A Momentum^ 123B Estimating the Mean Value of a Signal^ 126C Linear Interpolation of Delayed Signal Derivatives^ 127D Third-Order Low-Pass Butterworth Digital Filter^ 129List of Tables3.1 On-Line Back Propagation Equations ^ 295.1 Exact Dispersive Network Learning Equations 605.2 Approximate Dispersive Network Learning Equations ^ 615.3 Storage Requirements ^ 63viiList of Figures1.1 Adaptive Processor ^ 31.2 Applications of Adaptive Processors 1 ^ 41.3 Applications of Adaptive Processors 2 52.1 Adaptive Linear Combiner (ALC) ^ 92.2 ALC Transfer Function ^ 102.3 Error Surface for a Single Weight ^ 142.4 Error Surface for Two Weights 152.5 Adaptive Linear Transversal Filter ^ 182.6 ALC Classifier ^ 192.7 Classification Boundary for a Two-Input ALC Classifier ^ 203.1 Single Layer ALC Network ^ 233.2 Two Layer ALC Network 243.3 Sigmoidal Nonlinearity ^ 253.4 Two-Layer Back-Propagation Network ^ 264.1 Mackey-Glass Signal ^ 344.2 Signal Prediction 354.3 Transversal Network Topology ^ 364.4 Mackey-Glass Signal Prediction Learning Curves ^ 384.5 Mackey-Glass Prediction Results ^ 405.1 Unfolding in Time ^ 435.2 Conventional Back-Propagation Network Connection ^ 44viiiList of Figures^ ix5.3 Dispersive Network Connection ^  455.4 Conventional Back-Propagation Network Node ^  465.5 Dispersive Network Node ^  475.6 Section of a Dispersive Network  485.7 Perturbation Function ^  505.8 Direct Implementation of Dispersive Networks^  585.9 Efficient Implementation of Dispersive Networks  595.10 Dispersive Network Optimized for Storage Efficiency ^  626.1 Dispersive Network Architecture for Prediction  666.2 Learning Curves for Chaotic Signal Prediction ^  696.3 True Adaptive Prediction with Two Networks  706.4 True Adaptive Prediction with a Dispersive Network ^  716.5 Learning Curves for True Adaptive Prediction  726.6 Learning Curves for Training With Momentum ^  746.7 Network in a Signal Production Configuration  756.8 Learning Progression of a Signal Production Network^  766.9 Mackey-Glass Chaotic Attractor ^  776.10 Chaotic Attractors Near Start of Learning ^  786.11 Chaotic Attractors Near End of Learning  796.12 Chaotic Attractors for Different Networks ^  816.13 Short Term Emulation of the Mackey-Glass Signal ^  826.14 Long Term Emulation of the Mackey-Glass Signal  836.15 Network-Generated Power Spectra ^  846.16 Network-Generated Low-Frequency Power Spectra ^  856.17 Channel Equalization ^  866.18 Nonlinear Communication Channel ^  876.19 Equalization Learning Curves  89List of Figures x7.1 Block Diagram of a Digital Dispersive Network ^ 937.2 Unique and Shared Signals in a Dispersive Network 947.3 Block Diagram of a Single Processing Element ^ 987.4 Processing Element Input Link ^ 997.5 Processing Element Output Link 1007.6 Single Processing Element with Latches ^ 1017.7 Pseudocode for a Single Processing Element 102C.1 Stored Samples of a Continuous Signal ^ 127D.1 Signal Flow Graph for a Low-Pass Butterworth Filter ^ 131GlossaryThis list defines terms that are used throughout the thesis. They may have broader definitionsin other contexts. Within each definition, words or phrases in italics refer to additional glossaryentries.accumulator. A data storage element that maintains a running sum of its past inputs. Thesum can be reset to zero when required.action potential. An electrical signal that propagates along an axon, transmitting informa-tion from one neuron to other neurons.ADALINE. See adaptive linear neuron.adaptive filter. A filter that can alter its transfer function, usually through performancefeedback adaptation.adaptive interference canceler. An adaptive system designed to remove noise from a signal.adaptive linear combiner. An adaptive system whose output signal is a weighted sum of itsinputs. The weights can adapt to alter its behavior.adaptive linear element. See adaptive linear combiner.adaptive linear filter. An adaptive linear combiner whose weights multiply delayed versionsof its input signal, forming a transversal filter. An adaptive linear filter can respond totemporal information in its input signal.adaptive linear neuron. An adaptive linear combiner cascaded with a binary quantizer. Anadaptive linear neuron can operate as a classifier for problems that are linearly separable.adaptive signal processing. The use of adaptive systems to perform filtering and other signalprocessing tasks.adaptive system. A system that can modify its internal parameters based on environmentalfeedback.age. The age of a node in a dispersive network is the round trip time required for a signalto propagate from the node to a network output, and for the resulting error signal topropagate back.ALC. See adaptive linear combiner.xiGlossary^ xiiartificial neural network. A model or simulation of a neural network, often used for investi-gating its properties and gaining insight into biological computation. In broader contexts,including this thesis, an artificial neural network is any interconnected group of simpleprocessing elements, without any necessary relationship to biological neural networks. Seealso network.associative memory. A storage device where stored patterns are retrieved based on theirsimilarity to an input or prompting pattern.attractor. An equilibrium state or set of states to which a dynamical system converges.axon. A nerve fiber through which neurons transmit action potentials.back-propagation. A training technique for feed-forward networks that minimizes the outputerror using gradient descent.back-propagation network. A network that is trained using the back-propagation technique.band-limited signal. A signal whose energy is restricted to a finite frequency range.bandwidth. The difference in frequency between the highest and lowest frequency componentsof a signal.batch-mode adaptation. A method for training a network, where the internal parametersare updated only after all the training data has been presented. Each such presentationis called a batch or an epoch. Compare with on-line adaptation.bilinear transform. A frequency-warping function used in the design of digital filters. Itspurpose is to eliminate inaccuracies caused by a finite sampling frequency.binary quantizer. An element with a continuous-valued input, whose output can take on oneof two states. See sign= function.bipolar sigmoid. A sigmoidal function whose output can take on both positive and negativevalues.bus. One or more conductors or communication channels used for transmitting signals orpower.Butterworth filter. A low-pass filter designed for maximally flat frequency response belowits cutoff frequency.causal system. A system whose output at time t depends only on events that occur at orbefore time t. Compare with noncausal system.channel. A conductive wire, medium, or other pathway for transmitting a signal.chaotic attractor. An attractor with a nonintegral or fractal dimension. A chaotic attractorgenerates behavior that appears random, although the underlying dynamical system maybe deterministic.Glossary^ xiiichaotic dynamical system. A dynamical system whose behavior is characterized by a cha-otic attractor.chaotic signal. A signal generated by a chaotic dynamical system. Although it is determinis-tic, a chaotic signal may appear random.characteristic time. The inverse of the mean frequency of the power spectrum of a signal.The characteristic time approximates the period of a quasi-periodic signal.classification. The process of separating input patterns or signals into different categories.classifier. A device used to perform classification.CMOS. Complementary Metal-Oxide Semiconductor. A popular technology for building in-tegrated circuits, employing two different types of MOS transistors.complex frequency. A description of the frequency of sinusoidal oscillations, using complexnumbers rather than a sum of sin() and cos() functions.computational locality. In a multiprocessor system, a restriction on the distance that datamust travel to reach the processor where a computation takes place. Good computationallocality implies a short maximum distance.concurrent computation. Two or more computational processes occurring simultaneously.conjugate gradient descent. A method for increasing the speed of gradient descent. Unfor-tunately, conjugate gradient descent exhibits poor computational locality.connection. A unidirectional communication pathway between two nodes. If the nodes aremodels of neurons, then the connections usually model synapses.connectionism. The study of networks of simple processing elements, and their computationalproperties.credit assignment problem. The problem of assigning blame to hidden nodes in a network,to account for the output error.current-mode CMOS. A circuit employing CMOS technology, where computational quan-tities are represented by currents flowing through the electronic devices.cutoff frequency. The lowest or highest frequency component of a signal that is not blockedby a filter.cycle time. The time required for one iteration of an algorithm, or one time step in a digitalcomputational circuit.cyclical connections. Sets of multiple connections within a network that form circular path-ways. With cyclical connections, the output of a node can influence its own input, afterpropagating around the circular pathway.Glossary^ xivdelay buffer. A device for storing past values of a signal.depth. The number of layers in a multilayer network.dispersive network. A network with time delays in its internal connections. The time delaysspread out signals in both time and space as they propagate through the network.dynamic range. The difference between the maximum and minimum output values of a nodeor processing element.dynamical system. A system whose behavior can be described by a set of differential equa-tions. The past behavior of the system influences its future behavior.embedding dimension. The set of past signal samples used by a system to make predictionsabout future values of the signal.ensemble average. The average or expected sample function of a random process. The en-semble average can be estimated by observing multiple sample functions of the randomprocess. If the random process is ergodic, the ensemble average can be estimated by atime average of a single sample function.epoch. One complete presentation of all the training data during batch-mode adaptation.equalization. The process of using a filter to implement the inverse transfer function of adistorting channel or medium. Placing the filter in series with the channel removes thedistortion, retrieving the original, undistorted signal.ergodic process. A random process whose ensemble average can be estimated by a timeaverage of a single sample function. An ergodic process is necessarily a stationary process.error signal. The difference between the output of a network and a target signal, as a functionof time.error surface. The multidimensional surface defined by the expected value of an error signalas a function of the internal, adaptable parameters of the network.Euler integration. A simple iterative algorithm for solving systems of differential equationsusing a digital computer. Compare with Runge-Kutta integration.Fast Fourier Transform. An algorithm for decomposing a signal into a sum of sinusoidaloscillations with different frequencies and phase shifts.feed-forward network. A network without any cyclical connections.filter. A device that performs some type of processing on an input signal, and generates acorresponding output signal. A filter is usually described in terms of its transfer function.finite difference approximation. A general numerical method for solving systems of differ-ential equations in discrete time steps.Glossary^ xvfractal dimension. The nonintegral dimension of a chaotic attractor.Gabor polynomial predictive method. A method for predicting a signal by using poly-nomial functions of its past values. The method is typically unstable under iterativeprediction.gain. The maximum steepness or slope of a sigmoidal transfer function. As the gain increases,the sigmoid approaches a step function.Gaussian noise. A random signal whose values are taken from a Gaussian (normal) distribu-tion with a fixed mean (typically zero), and a fixed variance.gradient descent. A technique for minimizing the value of a function by adapting its pa-rameters in small steps. Each step adjusts the parameters in a direction opposite to thegradient of the function with respect to the parameters, causing the value of the functionto decrease. The procedure can become stuck in local minima of the function.gradient search. See gradient descent.Hamming data window. A weighting function used to preprocess signal data before per-forming a Fast Fourier Transform. The Hamming window minimizes errors caused byperforming a Fourier Transform on a signal sample of finite length.hidden node. A node within a network that is neither an input node nor an output node.Hidden nodes receive inputs from other nodes in the network, and transmit their outputsto other nodes in the network. Compare with input node and output node.IIR filter. See infinite impulse response filter.infinite impulse response filter. A filter whose input is formed from both an input signaland delayed versions of its output.input node. A node that receives input or stimulus from its external environment. Comparewith hidden node and output node.integrated circuit. An interconnected group of circuit elements, such as resistors and tran-sistors, on a single small chip of semiconductor material.inverse modeling. The process of determining or implementing the inverse transfer functionof a plant or filter.iterative prediction. The process of feeding the output of a signal predictor back to its input,to make additional predictions further into the future.Kalman filter. A type of adaptive filter that assumes a particular model of a plant or dynam-ical system.latch. A storage device for a single sample of a signal.Glossary^ xvilayer. A set of nodes in a network that are all the same distance, in terms of the number ofconnections, from the output nodes or the input nodes.leaky integrator. A summation device that forgets older information exponentially as newinformation arrives.learning rate. A small, usually constant parameter that determines the rate of gradient de-scent.Liapunov exponent. A number describing the rate of exponential convergence or divergenceof nearby trajectories of a dynamical system. Negative Liapunov exponents indicateconvergence, and positive exponents indicate divergence. A dynamical system with achaotic attractor possesses at least one positive Liapunov exponent.linear separability. A characteristic of a set of labeled points in a multidimensional space.The points are linearly separable if a hyperplane exists that divides the space into tworegions such that all the points in each region have the same label.link. A connection between two processing elements that may contain multiple weights andtime delays.LMS algorithm. Least Mean Squares algorithm. A training technique for adaptive linearcombiners.local minima. A value of a function that is less than any other in a neighborhood of itsparameter space.logistic function. A curve with the equationky = e^a+bxwhere b is less than zero. The logistic function is an example of a sigmoidal transferfunction.low-pass filter. A filter with an upper cutoff frequency. A low-pass filter blocks any signalcomponent with a frequency higher than the cutoff frequency.Mackey-Glass signal. A chaotic signal, generated by a model of the regulation of bloodformation, that is often used to test signal prediction techniques.manifold. A finite-dimensional subset of the phase space of a dynamical system.mean squared error. The average or expected value of the squared output error of a network.misadjustment. The additional mean squared error of an adaptive system, over an optimalnonadaptive system.model-free technique. A method for performing system identification, inverse modeling, orsignal prediction without assuming any knowledge of the internal dynamics of the plant.Glossary^ xviimodular architecture. A computing structure composed of multiple, similar processing ele-ments. The structure can be easily expanded by adding more processing elements.momentum. A technique used to increase the convergence rate of the back-propagation train-ing technique and its variants.MOS. Metal-Oxide Semiconductor. A popular technology for implementing transistors inintegrated circuits.multilayer network. A network with individual nodes arranged in a feed-forward, layeredstructure. A multilayer network is usually described in terms of the number of layers,and the number of nodes within each layer.multiplexing. A method for transmitting two or more different signals over a single channel.network. An interconnected group of simple processing elements or nodes. The nodes canoften be divided into input nodes, hidden nodes, and output nodes.neural network. An interconnected group of neurons. The neural outputs propagate alongaxons, and the connections are formed by synapses. In broader contexts, including thisthesis, the term is often used in place of artificial neural network.neuron. A living nerve cell in a biological organism that acts as a simple processing device.Neurons typically receive action potentials from other neurons, and generate action po-tentials of their own.node. A single processing element, often modeled after a neuron.noncausal system. A system whose output at time t depends on at least one event thatoccurs after time t. Noncausal systems cannot be physically implemented, but theirmathematical properties can still be studied. Compare with causal system.nonlinear optimization. The determination of minimum or maximum values of a nonlinearfunction, sometimes subject to constraints. Nonlinear functions often have local minimathat can make the optimization process difficult.nonlocal computation. A computation by a processing element in a network, involving datafrom other processing elements that are physically distant.nonstationary process. A random process whose statistics, such as its mean and variance,change over time.normalized rms error. The rms error, divided by the variance of the target signal.NRMSE. See normalized rms error.on-line adaptation. A method for training a network where the internal parameters areupdated continuously, or after each integration step, rather than after all the trainingdata has been presented. Compare with batch-mode adaptation.Glossary^ xviiioutput error. The instantaneous difference between the output of a network and a corre-sponding target signal.output node. A node that generates a signal that acts upon its external environment, or thatcan be observed by an external observer. Compare with input node and hidden node.parallel implementation. An implementation of a network or other computational devicethat uses multiple processing elements. The processing elements perform concurrent com-putations. Compare with serial implementation.pattern recognition. The process of classifying a spatial or temporal pattern into one cate-gory, from a set of possible categories.Perceptron. An early model of a neural network, pioneered by Frank Rosenblatt. Also, singleadaptive linear neurons are often referred to as Perceptrons.Perceptron learning rule. A method for adapting the weights of an adaptive linear neuronor Perceptron.performance feedback adaptation. A method for adapting the internal parameters of anadaptive system by observing how it interacts with its environment.phase diagram. A plot of the trajectory of a dynamical system through phase space as itevolves in time.phase shift. The relative position in time between two similar signals.phase space. A space with coordinates representing the state of a dynamical system. Thedynamical system follows a trajectory through phase space.pipelined operation. A method for achieving parallel operation between the layers of a com-putational network. Each layer in the network operates as a single stage, performing con-current computation with the other network layers. Pipelining increases signal throughput.plant. A system that is to be controlled or modeled. It is affected by applied input signals,and produces corresponding output signals.positive definite matrix. A matrix that can be easily inverted because it possesses a nonzerodeterminant.positive semidefinite matrix. A matrix that cannot be easily inverted because its determi-nant is zero.power spectrum. When a signal is decomposed into a sum of sinusoidal oscillations (e.g.,by a Fast Fourier Transform), the power in each of these oscillations, as a function offrequency, defines the power spectrum of the signal.procedural programming language. A computer programming language, such as Pascal orC, where emphasis is placed on the flow of control through blocks of code.Glossary^ xixprocessing element. A single computing device, usually part of a network of similar com-puting devices.pulse stream network. A network in which the information flowing through the connectionsis encoded as streams of pulses, rather than continuous levels.quantization. The round-off error caused by representing analog values using digital storagedevices.quasi-periodic signal. A nonperiodic signal that has a strong frequency component in itspower spectrum. A quasi-periodic signal can be partially described by its characteristictime.random access memory. A computer storage device where the retrieval time for each storedvalue is independent of its storage location.random local search. A method for finding the minimum or maximum value of a function.A random local search is usually slower than a gradient search.random process. A process that can be described by a random variable, whose probabilitydistribution may be a function of time.random variable. A variable that may take on a range of values that cannot be predictedwith certainty, but only described by a probability distribution.real estate. The amount of semiconductor surface area required to implement a device inintegrated circuitry.recurrent network. A network with cyclical connections.rms error. Root-Mean-Squared error. The square root of the expected value of the squaredoutput error.round-robin. A scheduling technique where each processing element is given access to a sharedresource (such as a shared bus) in turn. All the processing elements have equal priority.Runge-Kutta integration. An iterative algorithm for solving systems of differential equa-tions using a digital computer. Runge-Kutta integration produces more accurate resultsthan Euler integration, but requires more complex computations.sample function. The output, as a function of time, of a random process. Multiple instanti-ations of the same random process can generate different sample functions.sampling frequency. The reciprocal of the time interval between sample points when a con-tinuous-time signal is converted to a discrete-time representation.scalable architecture. A modular architecture with no inherent bound on the size of thestructure that can be built.Glossary^ xxserial implementation. An implementation of a network or other computational device thatuses a single processor, such as a conventional digital computer. The single processor mustperform the calculations for each network node and connection in a step-by-step fashion,using a technique like round-robin scheduling. Compare with parallel implementation.shared signal. A signal that is broadcast to multiple processing elements. Compare withunique signal.sigmoidal transfer function. A continuous and monotonically increasing s-shaped function,often used as the transfer function of a node. Common sigmoidal functions are the logisticfunction and the hyperbolic tangent.signal detection. The reconstruction or classification of signals that may be corrupted bynoise.signal emulation. See signal production.signal flow graph. A diagram emphasizing the flow of signals between summation nodes,often used to illustrate the internal structure of a filter.signal prediction. The process of predicting future values of a signal, usually from observa-tions of its past values.signal production. The process of generating a signal with a desired set of properties, oftenwith the goal of imitating another signal with an unknown generating mechanism.signum function. The function+1 if y 0sgn(Y) ^—1 otherwise,which acts as a binary quantizer.sliding tap. A method for implementing a variable-length delay buffer.spatio-temporal pattern. A pattern or signal that contains both spatial and temporal in-formation. A simple example is a television image.stationary process. A random process whose statistics, such as its mean and variance, areindependent of time.steepest descent. See gradient descent.stochastic approximation. A method for finding the minimum or maximum value of a func-tion by using gradient estimates at each time step. The gradient estimates permit anapproximate form of gradient descent.synapse. An excitatory or inhibitory connection between two neurons. When the first neurongenerates an action potential, an excitatory connection encourages the second neuron togenerate its own action potential, while an inhibitory connection discourages it.Glossary^ xxisystem identification. The process of determining or implementing the transfer function ofan unknown plant.target signal. A signal representing the desired output of a network.Taylor series. The representation of a function as a power series of the form00 1Y —i (x — a)nf(n)(a),n=0 n •where f(n)(a) is the nth derivative of f at a.TDNN. See time delay neural network.temporal back-propagation. A generalization of the back-propagation technique for trainingdispersive networks.threshold. The input value of a linear or sigmoidal transfer function that causes its outputvalue to be midway between its extreme values. When the input is less than the threshold,the output lies below the midpoint, and vice versa.threshold logic device. A summation device cascaded with a binary quantizer. Also knownas an adaptive linear neuron.throughput. The rate of information flow through a data processing device or filter.time delay neural network. A feed-forward network with fixed-length time delays in theconnections.topology. The spatial arrangement of nodes and connections within a network, and theirpattern of connectivity.training. The process of adapting the internal parameters of a network to improve its perfor-mance on a particular set of data, known as the training set. The internal parametersmost often adapted are the weights and the time delays, if present.training technique. A particular method or algorithm for training a network.trajectory. The path followed by a dynamical system through phase space.transfer function. The input-output relationship of a plant, a network, or a single node.transversal filter. A filter formed by feeding an input signal through a chain of delay ele-ments, and then feeding the delayed signal values into a processing device, such as anadaptive linear combiner or a back propagation network.unfolding in time. The process of moving the internal time delays in a dispersive network tothe input nodes. The resulting transversal filter may be much larger than the originaldispersive network.Glossary^ xxiiunique signal. A signal that flows to only a single processing element. Compare with sharedsignal.unity transfer function. A transfer function whose output is an exact copy of its input.universal approximator. A device that can implement any continuous transfer function toa specified degree of precision.VLSI. Very Large Scale Integration. A class of integrated circuits containing thousands oftransistors on a single small chip of semiconductor material.weight. A multiplicative factor that scales the signal flowing through a connection within anetwork.weight perturbation. The process of adding random values to the weights in a network, andobserving the effect on the output error. The observations can yield information thatallows the weights to be modified and the output error to be reduced.AcknowledgementI would like to express my appreciation for the guidance and support provided by my supervisor,Dr. D. S. Camporese. I am also indebted to my colleague, Michael Davenport, for manyinteresting discussions, and for his mathematical insight. We collaborated on several papers inthe field of neural networks. Several enlightening conversations with Dr. G. W. Hoffmann atU. B. C. and Eric Wan at Stanford University provided early inspiration for this work. I wouldalso like to thank Dave Gagne, without whose technical support this research would have takentwice as long. The work presented here could not have been completed without the financialassistance provided by the Natural Sciences and Engineering Research Council of Canada, andthe computer hardware provided by the Canadian Microelectronics Corporation. Finally, Iwould like to mention my sincere appreciation for the continuous support of my family andfriends throughout the course of this investigation.Chapter 1Introduction1.1 Adaptive SystemsAdaptive systems can alter their structure or behavior based on environmental feedback. Theyare prevalent in biology, where species adapt through the process of natural selection [21] andorganisms adapt by learning new behaviors. In the latter case, the adaptive ability is oftenassociated with intelligence. It is also possible for artificial, or man-made systems to adaptto their environments, and although they are not usually called intelligent, they often exhibitproperties commonly attributed to intelligence.There are many areas of science and engineering where adaptive systems can provide use-ful solutions to difficult, possibly time-varying problems. Adaptive processors find widespreaduse in fields such as control, communications, speech recognition, imaging, sonar, radar, andseismology. Most of these systems automatically adjust internal structural or behavioral param-eters to maximize some type of performance measure in a changing environment. Alternatively,they may search for optimal parameters in a stationary environment, where adaptation ceasesafter reaching an acceptable level of performance. In both cases, adaptation occurs throughenvironmental contact.Most artificial adaptive systems can be characterized by one or more of the following prop-erties (paraphrased from [134]):1. They can automatically adapt to changing environments.2. They can be trained to perform specific tasks in stationary environments. In a sense, thedesign of the system takes place automatically through the training process.1Chapter 1. Introduction^ 23. They can extrapolate their behavior to deal with new situations, after being trained on afinite and often small amount of training data.4. They are often robust to internal defects, having an ability to adapt around them.5. They can be viewed as nonlinear systems with time-varying parameters.6. They are usually more complex and difficult to analyze than nonadaptive systems, butcan offer better performance when the input signal is unknown or nonstationary.Much of the work in this area has involved a device known either as an adaptive linearcombiner (ALC), or as an adaptive linear element. This device is the central topic of discussionin Chapter 2. The ALC is the primary component of an adaptive linear filter, which is probablythe most common adaptive processor in use today.The operation of most adaptive processors can be modeled as shown in Figure 1.1. Theprocessor receives input from its environment, causing it to generate an output signal thatacts back upon the environment. This environmental interaction generates a performancefeedback signal that the processor can use to increase its performance by modifying its internalparameters. Quite often, the environmental interaction is just a comparison with a desiredoutput signal, and the goal of adaptation is to minimize the difference between the actualoutput and the desired output. The nature of the input signal and the environmental interactiondetermines the task that the processor performs.1.1.1 Adaptive PredictionFigure 1.2(a) illustrates an adaptive prediction task. The input to the adaptive processor is adelayed version of the signal to be predicted, while the desired output is an undelayed version ofthe same signal. The performance feedback signal, which is the difference between the desiredand actual outputs, can be thought of as an error signal that the adaptive processor tries tominimize. Through adaptation, the processor learns to predict the current value of the inputsignal, based only on observations of its past values. In some situations, this arrangement hasAdaptiveProcessor(PerformanceFeedbackInput OutputEnvironmentChapter 1. Introduction^ 3Figure 1.1: Operation of a typical adaptive processor. The processor receivesinput from its environment, causing it to generate an output that acts backupon the environment. Based on the results of this environmental interaction,a performance feedback signal returns to the adaptive processor, allowing itto modify its internal parameters and thereby increase its performance. Theenvironmental interaction usually involves a comparison of the actual outputsignal with a desired output signal, and the performance feedback is often justthe difference between these two signals.limited utility because the processor "predicts" only the present value of the input signal, whichis already available as the desired output. Chapters 5 and 6 describe adaptive processors basedon dispersive networks, which can make true predictions about future values of the signal.1.1.2 Adaptive System IdentificationFigure 1.2(b) shows the system identification arrangement. Here, an input signal feeds intoboth the adaptive processor and an unknown plant. As the adaptive processor minimizes thedifference between its output and that of the plant, its transfer function will approach the planttransfer function. The plant is "identified" when this difference is sufficiently small. If thetransfer function of the plant changes over time, the adaptive processor can track those changesand continue to provide an accurate plant model.Chapter 1. Introduction^ 4AdaptiveProcessorOutputDelayPerformance FeedbackDesired Output(a)// Performance FeedbackDesired Output(b)Figure 1.2: Some applications of adaptive processors: (a) adaptive signal predic-tion, and (b) system identification. Signal prediction can be viewed as a specialcase of system identification, where the "plant" is a perfect predictor. In bothcases, the performance feedback signal can be thought of as an error signal thatthe adaptive processor tries to minimize.1.1.3 Inverse Modeling or EqualizationThe inverse modeling task is similar to the identification task, except the adaptive processor isconnected in series with the plant instead of in parallel. As shown in Figure 1.3(a), the processorlearns to produce a delayed version of the plant input signal, and in the process acquires theinverse transfer function of the plant. Inverse models are useful in control applications, whereproper input values must be determined from knowledge of the desired output. They can alsobe used to remove distortions caused by imperfect communication channels.AdaptiveProcessorPlant^Output+Chapter 1. Introduction^ 5Figure 1.3: Additional applications of adaptive processors: (a) inverse modelingor equalization, and (b) interference canceling. In inverse modeling, the adaptiveprocessor learns to implement the inverse transfer function of the plant, witha fixed time delay. Its input is the plant output, which may be corrupted byadditive noise. In interference canceling, unlike other applications, the outputof the adaptive processor is not the desired output of the system. Instead, theprocessor tries to predict the signal plus noise, using a noise signal at its inputthat is correlated with the signal noise. However, since the correlated noise inputcontains no information about the signal itself, the "best" prediction that theprocessor can make predicts only the noise component of the signal. Subtractingthis prediction from the noisy signal gives the clean signal as the system output.1.1.4 Adaptive Interference CancelingFigure 1.3(b) shows an adaptive processor configured as an interference canceling device. The"desired output" of the adaptive processor is a signal with additive noise, while its input is asecond, perhaps distorted, version of the noise. The input noise is correlated with the additiveChapter 1. Introduction^ 6signal noise, but not with the signal itself. Reproducing the additive noise is the best that theprocessor can do, based on the noise at its input, since the noise and the signal are uncorrelated.Subtracting the output of the processor from the noisy signal gives an approximation of thenoise-free signal. Given the correlated noise, an optimal estimate of the additive noise occurswhen the adaptive processor minimizes the power in the output signal.1.2 Artificial Neural NetworksMammalian brains contain living nerve cells called neurons, that can send electrical signals(action potentials) down communication lines known as axons. The action potentials eventuallyreach synapses, where the axons connect to the inputs of many other neurons through excitatoryor inhibitory connections. A neuron emits an action potential when the difference between itsexcitatory and inhibitory inputs is sufficiently large. In this way, the behavior of one neuron caninfluence the behavior of many other neurons. Adaptation in the brain (i.e., learning) occurs,at least in part, by changes in the strengths of the synaptic connections [45].There has been much hope that the operation of the brain can be explained by modelingthe neurons as simple processing elements and modeling the synapses as weighted connectionsbetween them, leading to the study of "connectionism". Early work in this area was performedby McCulloch and Pitts [74], whose neuronal model behaved as a threshold logic device with abinary output. This device has since become known as the "McCulloch-Pitts" neuron, and itsvariants find use in many modern network models.Recently, researchers have found that a wide variety of neural models have interesting oruseful computational properties, even if they bear little resemblance to biological brains [44].Investigation of these "artificial" neural networks has gone far beyond modeling the brain, insearch of new types of computational systems. Some of these systems can perform tasks likepattern recognition and pattern classification, or behave as associative memories. Associativememories store patterns that can be retrieved based on their similarity to an input, or promptingpattern.Chapter 1. Introduction^ 7One class of adaptive networks that has received a great deal of attention is the "multilayerPerceptron", or back-propagation network. This network is closely related to the adaptivelinear combiner, containing many interconnected ALCs with intervening nonlinear elements. Ithas shown great promise as a pattern classifier [104] and, more recently, as a universal functionapproximator [48].The research presented here is an extension of the back-propagation network, with adaptabletime delays added to all the internal connections. As with conventional back-propagationnetworks, these new networks can have either discrete or continuous-valued outputs. Discreteoutputs are useful for spatio-temporal pattern classification, while continuous outputs are moreappropriate for approximating continuous functions. The focus of this research is on networkswith continuous-valued outputs.1.3 Thesis OutlineChapter 2 is a review of adaptive linear filters, and their primary component, the adaptivelinear combiner. It describes the adaptation algorithm in detail, and discusses some limitationsof the device that arise from its linearity. Chapter 3 introduces networks of adaptive linearcombiners, both with and without internal nonlinearities. With the nonlinearities, the networksare equivalent to back-propagation networks, and the back-propagation adaptation algorithmis discussed in detail.Chapter 4 develops continuous-time formulations of the adaptive algorithms for both linearcombiners and back-propagation networks. It then presents a simple method for forming anonlinear adaptive filter from a back-propagation network, and applies it to a signal predic-tion task. Chapter 5 is the main development of the thesis. It discusses nonlinear networksincorporating internal time delays, and develops a new continuous-time adaptation techniquefor adjusting the internal parameters, including the lengths of the time delays.Chapter 6 presents simulation results for three applications of the networks in Chapter 5.Chapter 1. Introduction^ 8In order of presentation, they are adaptive signal prediction, signal production (system identi-fication), and adaptive channel equalization (inverse modeling).Finally, Chapter 7 discusses the software implementation of the network filters used in thesimulations, and possible hardware implementations in both digital and analog forms. Chapter 8summarizes the thesis and provides suggestions for further research.Chapter 2The Adaptive Linear CombinerMany adaptive signal processing systems make use of a simple device known as an adaptivelinear combiner (ALC) [135]. This chapter reviews its structure and operation, for comparisonwith the nonlinear adaptive systems to be described later. The linearity of the device simplifiesthe analysis, permitting a better behavioral understanding than is currently possible for non-linear systems. The ALC can use either the least mean squares (LMS) algorithm or the methodof random local search (RLS) to update its internal parameters.2.1 StructureFigure 2.1 shows the structure of an adaptive linear combiner with N inputs. The output, y(t),► y(t)Figure 2.1: Adaptive linear combiner with N inputs. The output, y(t), is asum of products of the form wi(t)xj(t), and a threshold, 0(t). The weights andthreshold can adapt to alter the transfer function.9Chapter 2. The Adaptive Linear Combiner^ 10Figure 2.2: ALC Transfer Function for N = 2. The weights, wi(t), and thresh-old, 9(t), define an N-dimensional hyperplane representing the transfer function,which intersects the y-axis at —O. A desired transfer function also defines anN-dimensional surface that the ALC can approximate by shifting and tilting thehyperplane to obtain a best fit. Since this target surface is not necessarily a hy-perplane, the optimal weights and threshold do not always provide an error-freeimplementation.is the sum of a threshold, 9(t), and products formed by multiplying the inputs, xj(t), byweights, wi(t):Ny(t) = —9(t) E wi(oxi(t).^ (2.1)j=1The output is positive when the sum of weighted inputs exceeds the threshold, and negativewhen it falls below. The weights and threshold can adapt to alter the transfer function, butthey change on a much longer time scale than the inputs or y(t). For a particular set ofweights and a fixed threshold, the transfer function defines an N-dimensional hyperplane, asshown in Figure 2.2. The hyperplane intersects the y-axis at —0, and the weights control itsorientation. Note that 9(t) can be implemented as an additional weight, wo (t) = —9(t), and afixed input, xo(t) = 1, so it can be incorporated into the sum in equation (2.1). The thresholdwill be treated in this manner without further mention or loss of generality.Chapter 2. The Adaptive Linear Combiner^ 11For a particular task, the optimal or desired output of the ALC can be specified for eachpossible set of input values. This desired behavior of the ALC also defines an N-dimensionalsurface, but not necessarily a hyperplane. This surface can be approximated by tilting andshifting the ALC hyperplane to obtain a good fit, according to some error measure. Explicitknowledge of the required transfer function permits direct computation of the optimal weightvalues, but such knowledge is often difficult to obtain. However, some forms of limited infor-mation about the desired output allow the correct weights to be found using techniques likeperformance feedback adaptation. With performance feedback adaptation, the weights andthreshold change during operation, so the overall transfer function is nonlinear and the ALCexhibits more complex behavior than a linear combiner with fixed parameters.2.2 Operation2.2.1 Error MeasurePerformance feedback adaptation requires a target signal, y(t), that the actual output, y(t),must follow. Accordingly, an error signal, e(t), measures the instantaneous difference betweenthem:e(t) = p(t)— y(t) = y(t) — WT (t)X(t).^(2.2)Here, vector notation replaces the summation notation for the ALC output in equation (2.1).W(t) is the column vector of weights with w o (t) = —OM,W (t) = [wo (t) wi (t)... w,(0F,^(2.3)X(t) is the column vector of inputs with xo (t) = 1,X(t) = [xo(ox i (t)... xN(07,^ (2.4)and T represents the matrix transpose operation. The error can be positive or negative, withno inherent bound on its magnitude. Squaring equation (2.2) gives the instantaneous squaredChapter 2. The Adaptive Linear Combiner^ 12error,e2(t) = [0) — WT (t)X(t)] 2 = 92(t) WT(t)X(t)XT(t)W(t) - 2Kt)XT (t)W(t),^(2.5)which is never negative, and is zero only when the ALC follows the target signal precisely.Under the assumption that the input and target signals are sample functions of randomprocesses [90, 113], the instantaneous values, X(t) and "(t), can be viewed as random variables.Then e(t) is also a random variable, and the mean squared error, e(t), is the expected valueof e2 (t) at time t:MSE = C(t) = E[e 2 (t)] = E[y(t)] WT (t)E[X(t)XT (t)]W(t) — 2E[Xt)XT(t)]W(t). (2.6)Minimizing the MSE by changing the weights makes the ALC follow the target signal moreaccurately.If X(t) and '(t) are stationary, the expected values on the right hand side of equation (2.6)are constants representing the average power of the target signal,c = E[e(t)],^ (2.7)the input correlation matrix,R = E[X(t)XT (t)] = Eand a cross-correlation vector, x8(t)^xo(t)xi( t)^xo(t)xN(t)(t)xo(t)^4(0^...^(t)xN(t)(2.8)xN (t)xo(t) xN(t)x i (t)^x2N(t)P = E[Xt)X(t)] = ED(t)xo(t) (t)xi(t)^.(t)xN(t)1T.^(2.9)The matrix, R, is real and symmetric. It is usually positive definite, but occasionally it can bepositive semidefinite [132]. With this notation, the MSE can be rewritten asC(t) = c WT(t)RW(t) — 2PTW(t).^(2.10)Chapter 2. The Adaptive Linear Combiner^ 13Equation (2.10) is a quadratic function of W that defines a hyperparabolic error surface in N +1dimensions, not to be confused with the hyperplane representing the ALC output. The errorsurface exists in the space defined by the weights of the ALC, while the output hyperplaneexists in the space defined by the ALC inputs. The error surface is concave upward because thesquared error is never negative, and it possesses a unique minimum, e ---= emir" at W = W.2.2.2 Gradient of the Error MeasureThe gradient of the hyperparabolic MSE surface with respect to the weight vector is^v = { aE as^as ri,w,--Do awl--. a7-v, . —2P + 2RW. (2.11)Knowledge of R and P allows W* to be found by solving for VE = 0 [134], yielding the optimalWiener filter weights [72, 136]:W* = R-1P.However, when R and P are not explicitly available, W* cannot be found with this analyticalmethod.As an alternative, suppose the current weight vector represents the location of a ball restingon the MSE surface. If the weights adapt so that the ball follows the negative gradient, itwill come to rest at the location of the minimum and the weights will have acquired theiroptimal values, W. Fortunately, as will be shown in Section 2.2.4, the gradient at a positionrepresented by a particular weight vector can be estimated without knowledge of R or P. Thismethod for finding the optimal weight values is usually called a gradient search.2.2.3 Gradient SearchFigure 2.3 shows a parabolic error surface for a single weight:E(t) = Ervin + A(w* — w(t)) 2 .^ (2.12)Chapter 2. The Adaptive Linear Combiner^ 14A gradient search occurs when the weight adapts in a direction opposite to the derivative of Ewith respect to w:w(t + 1) = w(t) — p dEc71-v = w(t)-F 20(w* — w(t)).^(2.13)Here, A is a step-size parameter that controls the size of the weight adjustments, and t is aninteger representing the time step.Die(0)eminw(0)^w*^WeightFigure 2.3: MSE surface for a single weight. Here, the "surface" is a parabolawith its minimum at w*. The initial value of the weight is w(0), and the MSEdecreases from g(0) to Ervin as the weight approaches the minimum.To gain some insight into the asymptotic behavior of the recursive expression in equa-tion (2.13), it can be rewritten nonrecursively as [134]w(t) = w* — (1 — 2/LA) t (w* — w(0)).^ (2.14)The second term represents the error due to the initial condition where, generally, w(0) 0 w*.This term will decay to zero and the weight will converge to w* if111 — 2/./Al< 1, i.e., 0 < p < A.Substituting equation (2.14) into (2.12) gives6(0 = Erni. + A(1 — 2pA) 2t (w* — w(0)) 2 ,(2.15)(2.16)Chapter 2. The Adaptive Linear Combiner^ 15Figure 2.4: Error surface for two weights. The rings represent contours of con-stant mean squared error, with the minimum value at the center of the figure.At the point (wi (t),w2 (t)), (A) shows the direction toward the minimum, W*,and (B) shows the direction of steepest descent, which is always perpendicularto the MSE surface contours. Proceeding along the direction of steepest descentwill eventually lead to the minimum, but not necessarily via the shortest (i.e.,straight) path.showing that the MSE will converge to Emii, under the same conditions as (2.15).This example illustrates the method of gradient search in a single dimension. Generalizationto multiple dimensions yields the method of steepest descent:W(t + 1) = W(t) — IIVE(t).^ (2.17)In the one-dimensional case, an update in the direction opposite to the gradient always movesthe weight toward w*. However, Figure 2.4 shows that in multiple dimensions the direction ofsteepest descent may not be the same as the direction toward W*, and the weights may followa curved path to the minimum.For a quadratic error surface, iteration of equation (2.17) converges to W* if [134]0 < it < x 1 ,^ (2.18)4maxwhere Amax is the largest eigenvalue of R. This condition is analogous to (2.15), but it is notas useful as it might appear. Since R is often unknown, the best step size parameter, it, isChapter 2. The Adaptive Linear Combiner^ 16usually found by trial and error. In practice, can sometimes be bounded by the reciprocal ofthe average input signal power [132], if that quantity can be measured.2.2.4 The LMS AlgorithmGradient search methods require knowledge of Ve(t) for the current weight vector, which canbe expanded asv£(t) = VE[e2 (t)] = E[Ve2 (t)] = E[-2(p(t) — WT (t)X(t))X(t)] = —2E[e(t)X(t)] (2.19)using equations (2.5) and (2.6). Fortunately, the readily available signals e(t) and X(t) cansubstitute for knowledge of R and P in equation (2.11). Equation (2.19) represents an infiniteensemble average, but if the signals are sample functions of ergodic processes, the expectedvalue can be estimated by taking a time average [90], giving a gradient estimate, Ve:^Ve(t) = 2 E e(t')X(t').^ (2.20)t, =t-NThe weights should be held constant while each batch of N samples generates a gradient esti-mate, after which they can be updated according to21i^tW(t 1) = W(t) — e(t) = W(t) —N E e(e)X(e),ti=t-Nresulting in a technique often called "batch-mode" adaptation. A fixed-length segment of theinput signal can be iterated once for each batch, or new segments can be used as they arrive. Asthe size of the batch, N, decreases, a fixed-length segment will not be an adequate representationof the desired transfer function, so the second method must be applied.A smaller batch size yields a noisier gradient estimate, with more frequent updates to W.In the limiting case where N = 1, the instantaneous signal values estimate the gradient,o£(t) = —2e(t)X(t),^ (2.21)resulting in "on-line" adaptation. The gradient search formula becomes^W(t + 1) = W(t) 2/te(t)X(t),^ (2.22)Chapter 2. The Adaptive Linear Combiner^ 17and W approaches W* at a rate controlled by p. The noisy gradient estimate causes W(t) towander around its optimal value, but it will converge in expectation with an expected varianceproportional to p [70]. This wandering increases the MSE over that of an optimal Wiener filterby a quantity known as the misadjustment [134]. The misadjustment measures the cost ofadaptability, increasing with p and the square of the number of weights.The noisy gradient estimate is not as bad as it first appears because many small weightupdates become averaged over time, effectively providing a better estimate. This technique,the Widrow-Hoff least mean squares (LMS) stochastic gradient algorithm [131, 139], offerssignificant advantages over Wiener filters because it does not require explicit determination ofthe input and target signal statistics defined in equations (2.8) and (2.9).2.2.5 Random Local SearchThe LMS algorithm is one possible method for selecting the weights of an adaptive linearcombiner. The method of Random Local Search (RLS) also performs this task, but convergenceto the optimal weight vector, W*, is usually slower.On each iteration, a unit vector, r, with a random orientation perturbs the weight vector,W(t + 1) = W(t) + pr, (2.23)and if the resulting output error decreases, the new weight vector is retained. If the errorincreases, however, the weight vector is restored to its previous value from time t. After manyiterations, the weight vector will approach W*, but it will not converge to a fixed value. As inthe LMS algorithm, random weight fluctuations continue due to the estimation of E, with theirmagnitude depending partly upon p. Large values of p lead to faster adaptation, but they alsolead to correspondingly larger steady state error fluctuations. The random vector, r, often liesin the wrong direction, so the LMS algorithm discussed in the previous section usually providesfaster convergence.Chapter 2. The Adaptive Linear Combiner^ 18 y(t)Figure 2.5: Adaptive linear transversal filter. The device incorporates an ALCand several unit-delay elements, represented by z -1 , that provide delayed ver-sions of the input signal. Multidimensional input signals require a set of weightsand delays for each dimension.2.3 Adaptive Linear Transversal FilterThe adaptive linear transversal filter in Figure 2.5 incorporates an adaptive linear combinerand several unit-delay elements. The output is a weighted sum,N-1y(t) = E wi (t)x(i - j),^ (2.24)i.othat responds to temporally distributed information in the input signal, making the deviceparticularly well-suited to signal processing tasks. The diagram shows a scalar input signal —multidimensional signals require additional sets of delayed input lines.2.4 ALC ClassifierFigure 2.6 shows an adaptive linear combiner cascaded with a binary quantizer. This devicewas studied as an early neural model by Rosenblatt [102}, who used it as a component ofhis "Perceptron", and by Widrow and Hoff [131], who called it an adaptive linear neuron, orChapter 2. The Adaptive Linear Combiner^ 19x1 (t)x2(t)•XN(t) y(t) o(t) o.N(o(t) = sgn E wixi(t) ,j = 1(2.25)Figure 2.6: ALC classifier. Cascading the ALC with a binary quantizer forcesthe output, o(t), to be ±1, effectively partitioning the input space into twocategories. The weights define a hyperplane that forms the boundary betweenthe categories.ADALINE. Its output iswhere sgn() is the signum function:1sgn(y) =The quantizer output is —1 if the sum is negative, and +1 otherwise. The critical region wherethe sum equals zero defines an (N — 1)-dimensional separating hyperplane, shown as a dashedline in Figure 2.2. This separating hyperplane should not be confused with the transfer functionhyperplane, which is N-dimensional. In Figure 2.2, N = 2, so the separating hyperplane is aline.The instantaneous input signal represents a pattern that the ADALINE classifies into one oftwo categories, depending on which side of the separating hyperplane it lies. Weight adaptationchanges the position and orientation of this hyperplane, using a target output of +1 if thecurrent pattern belongs to the first category, and —1 otherwise. The output of the ALC, not+1 if y > 0—1 otherwise.A-1-1Chapter 2. The Adaptive Linear Combiner^ 20the quantizer, forms the error when using the LMS algorithm. A slightly different technique,known as the Perceptron Learning Rule [7, 79, 102], adapts the weights using the output of thequantizer.Although the ALC can classify any set of linearly separable input patterns, it is possible thatthe LMS algorithm will not converge to the correct solution. Mean squared error minimizationmay sacrifice the proper classification of one pattern for a better fit to the rest [10]. However,choosing a threshold LMS error criterion [112] guarantees convergence to the proper set ofweights. The threshold LMS error is zero when Y(t) = 1 and y(t) > 1, or when y(t) = —1and y(t) < —1.2.5 Limitations of the ALCThe ALC can implement a particular transfer function only if it can be described by an N-dimensional hyperplane in the input space. In other circumstances, the minimum value of thehyperparabolic MSE surface will be greater than zero, representing the smallest error that canbe achieved by a linear device.x2^ x211^1(a) (b)Figure 2.7: Classification boundary for a two-input ALC classifier. With twoinputs, the separating hyperplane is a line bisecting the 2-dimensional inputspace. In (a), the two categories represented by +1 and —1 can be perfectlyseparated, but in (b) there is no such line that can accomplish the task.1^-1-1-1-1Chapter 2. The Adaptive Linear Combiner^ 21When used as a classifier, the ADALINE cannot separate all possible categories [79]. Aclassification can be implemented only if a hyperplane exists that separates all patterns in thefirst category from those in the second — a condition known as linear separability. Figure 2.7(a)shows a case that is linearly separable, and Figure 2.7(b) shows a case that is not.In nonseparable cases, the LMS algorithm minimizes the mean squared error between theoutput and the target signal, and the Perceptron Learning Rule doesn't converge [79]. Unfortu-nately, the fraction of linearly separable classifications is small, decreasing very rapidly towardzero as the number of input patterns increases [18, 51]. Networks with multiple layers of ALCnodes can overcome the limitations of linear nonseparability. Their structure and capabilitiesare the subject of the next chapter.Chapter 3ALC NetworksThe adaptive linear combiner in Chapter 2 can implement only those transfer functions witha scalar output that can be expressed as a linear combination of the input signal components.With a binary output quantizer, it can classify input patterns into only two linearly separablecategories. This chapter describes networks of ALC nodes that overcome these limitations atthe expense of more complex adaptation algorithms.3.1 Multidimensional OutputsMultidimensional outputs present no difficulty, since multiple ALC units can produce the re-quired signal vector, Y(t). Figure 3.1 illustrates the resulting architecture, known as a singlelayer network. The term "layer" refers to the set of weights between the inputs and the outputnodes.Adaptation requires a target signal vector, Y(t), that the actual outputs must follow, andthe Euclidean distance between them provides a measure of the instantaneous error:e(t) =^"(t) —^(t)ii•^ (3.1)If the input and target signals are sample functions of random processes, e(t) can be viewed asa random variable, and the mean squared error, E(t), isM MMSE =^(t) = E[e 2 (t)] = EMY(t)— Y(011 2 ] = E em(t)]m=1= E E[eL(t)].m=1(3.2)Here, M is the number of ALC nodes, and eni (t) is the output error for node m:em(t) =^— ym (t). (3.3)22Chapter 3. ALC Networks^ 23x2(t)•xN(t)Figure 3.1: Single layer ALC network. Each weight has subscripts to identifyits input and output connections. Signals flow from left to right, and multipleALC nodes produce the output signal vector, Y(t). The LMS algorithm adaptsthe weights for each node, independently of the other weights in the network.The right side of equation (3.2) shows that the MSE is a sum of M independent terms, eachrepresenting the MSE for one ALC and its associated weights. Therefore, the total MSE canbe minimized by independently applying the LMS algorithm from Chapter 2 to each node.3.2 Net works with Multiple LayersThe network in Figure 3.1 performs the vector-matrix multiplication,Y(t) = WX(t),^ (3.4)where W is the weight matrix with a separate row for each ALC, and X(t) is the input signal.Cascading another layer, as in Figure 3.2, multiplies the output by a second weight matrix, W',Y(t) = W' (WX(t)), (3.5)forming a multilayer feed-forward network. The expression "feed-forward" means there are nodirect or indirect cyclical connections, so the output at time t is completely determined by theinput at time t.Chapter 3. ALC Networks^ 24x2(t)w'Yi(t),(t).^.•xN(t)^ ym(t)Figure 3.2: Two layer ALC network. The first layer of weights multiplies theinput vector by a weight matrix, W, and the next layer multiplies the resultby a second matrix, W', producing the output vector, Y(t). As explained inthe text, this network is equivalent to a single layer network of the type shownin Figure 3.1. However, cascading the internal nodes with nonlinearities, as inFigure 3.4, enables the network to implement a much wider range of transferfunctions.For linear nodes, the product W'W in equation (3.5) is equivalent to a single matrix, A,due to the associative property of matrix multiplication:Y(t) = W'(WX(t)) = AX(t).Therefore, a multilayer network of adaptive linear combiners can implement the same transferfunctions as a single layer network, and can classify only linearly separable input patterns.Cascading the internal, or "hidden" ALC nodes with nonlinearities such as binary quantizersinvalidates this argument, and the linear separability restriction no longer applies. The resultingnetwork can implement a much wider range of transfer functions, and it can perform anyclassification task with three layers of weights and enough hidden nodes [69]. However, the lackof a target signal for the hidden nodes makes it impossible to apply the LMS algorithm directlyto weights that are not in the output layer. This difficulty, the credit assignment problem, arisesbecause each hidden ALC must be assigned responsibility for an unknown portion of the MSEat the output.Chapter 3. ALC Networks^ 25Figure 3.3: Sigmoidal nonlinearity. The sigmoid shown here is a scaled andoffset logistic function, o(t) = tanh(ay(t)), but any continuous, differentiable,and monotonically increasing function is appropriate. The parameter a, whichcontrols the gain (steepness) of the sigmoid, was unity for this plot. The flatareas at high and low values of y(t) allow the device to behave as a classifier.3.3 Back-PropagationAlthough it is possible to train networks with internal binary quantizers using random searchtechniques [23], a more efficient algorithm known as "back-propagation" solves the credit assign-ment problem for networks with sigmoidal nonlinearities of the type shown in Figure 3.3 [105].Figure 3.4 shows a two-layer sigmoidal network, but the discussion applies to networks withany number of layers, and to feed-forward networks without a layered structure. If the nodeand input indices increase from the inputs to the outputs, then the feed-forward requirementmeans that a weight, wig, from node j to node i, can exist only if i > j.The output of each node is oi(t) = fi(yi(t)), approximating a binary quantizer as the gain,or steepness of the sigmoid, fi(), increases. Often, all of the nodes have identical sigmoids, butback-propagation can still be applied when each node implements a unique sigmoid, 110. Fora given task, networks employing continuous sigmoids usually require fewer hidden nodes thanthose employing binary quantizers [111, 109]. Networks of this type are universal approximators,and with proper output scaling they can implement any continuous transfer function using onlya single hidden layer [36, 47, 48, 49, 54, 52, 53]. However, two hidden layers may be requiredChapter 3. ALC Networks^ 26N+1^ N+H+1X2(0XN(t)o(t)o(t)o(t)Figure 3.4: Two layer back-propagation network. The circles represent ALCnodes cascaded with sigmoidal nonlinearities of the type shown in Figure 3.3.The node and input indices increase from top to bottom, and from left to right,so the feed-forward restriction means that a weight, can exist only if i > j.The back-propagation algorithm also applies to networks with additional layers,and to networks with weights that skip across layers.to approximate the discontinuous inverses of some continuous functions that arise in controlapplications [110], and networks with two or more hidden layers may be able to approximatecertain functions with fewer nodes [15, 122].As in equation (3.1), a target signal, 6(0, defines the instantaneous error,^e(t)= 116(t) — 0(t)ii•^ (3.6)The corresponding mean squared error, C(t), isMSE = ((t) E[e2 (t)] = E[116(t) — 0(011 2] = E E^(3.7)m€0where 0 is the set of all output nodes, indexed by m, and em (t) is the error at output m:em (t) = 15,7,(t) — on,(t). (3.8)The weights cannot be found by solving for Ve = 0 because the nonlinearities preclude ananalytical solution. However, the method of steepest descent requires the value of VE at onlya single point in the space defined by the network weights. This value can be calculated easilyChapter 3. ALC Networks^ 27because the differentiability of the sigmoids ensures the existence of partial derivatives of theMSE with respect to the weights. The gradient can then be used for weight adaptation:W(t 1) = W(t) — itiC7 (t).^ (3.9)Here, W(t) represents all the weights in the network, including those of the hidden nodes. Foreach weight, equation (3.9) gives:wij(t + 1) = wij(t)Oe (3.10)uw•iThe partial derivative in (3.10) can be expanded as°^E[e2 (t)]^En°e = =aw^(t)] = Er„(t)--L,,ayi^owij (01= —E[Aimoi (t)],^(3.11)awii^uwijwhere Ai(t) isij=2—0e (t).Ai(t)ayi(3.12)For notational convenience in equation (3.11), oi(t) represents either a network input, or theoutput of a hidden node.If the signals are sample functions of ergodic processes, equation (3.11) can be estimatedwith a time average [90],de^1^tE^(e)ai (e),^ (3.13)Owii^N t,=t—Nand the weight updates occur after each batch interval of length N. For the on-line case (N = 1),the instantaneous signal values provide the estimate,owi; = Ai(t)oi(t).^ (3.14)A slightly different approach, somewhere between batch mode and on-line adaptation, uses anexponentially weighted sum of previous signal values,= (1 — a) E atiAi(t — t')oi(t — t'),^ (3.15)awii t,=0Chapter 3. ALC Networks^ 28which can be expressed in terms of the estimate from the previous time step:0^ if t = 0(t) = (3.16)^owij^(1 - a)Ai(t)oi(t)-1- alL(t — 1) if t > 0.awiiThe momentum parameter, a, adds a kind of inertia to the weight vector, filtering out highfrequency changes in the gradient [104]. It often permits a higher learning rate, jt , and mayallow the network to bypass local minima in the MSE surface. The method reduces to theinstantaneous on-line case, described by equation (3.14), when a is zero. Momentum can alsobe applied to batch-mode adaptation, averaging the gradient estimates in equation (3.13) fromconsecutive batches.Whatever the method, estimation of 0e/Owij requires only the output of node j andthe Ai(t) term associated with node i. If node i is an output, it is easy to see that0e2^0e2 Oo•Ai(t)^= --ows-(t)-51i.(t) = 2(Oi(t)— oi(t))fl(yi(t)),^(3.17)where fgyi) is the slope of the sigmoid. For other nodes,0e2^ae2 0„,,^0e2^ 0e2^(t) = E —„(t).„ E --(tr"E wkioi(t) E "k^— E Ak(t)Wki,00i kEC, 'IA^00i kECi(3.18)kECikEc, 8Yk Ooiso thatAi(t) = ii(yi(t)) > Ak(t)wki.^ (3.19)kEC,The recursive expression for Ai(t) in equation (3.19) represents information flowing from theoutputs toward the inputs, and is responsible for the term "back-propagation". The backwardflow of information permits estimation of the partial derivatives using only locally availablesignals. That is, the estimate for node i requires data only from other nodes having directconnections to node i. In operation, an input signal propagates forward through all the layersuntil it reaches the outputs. From there, an error signal derived from the output and targetsignals propagates back toward the input nodes according to equation (3.19). This processrepeats at each time step, and (3.13) or (3.14) estimates the expected value in (3.11) for batchChapter 3. ALC Networks^ 29Parameter FormulaNM E wiimoi (t)ioi (t) fi(N(0)Di (t) 20j(t) — oi (o)my,(0)^for j E 0ti (yi(t)) E wii (t)Ai(t)^otherwiseiECJwii(t + 1) wii(t) + itAi(t)oi(t)Table 3.1: Back-propagation equations for on-line learning. The symbols fol-low the notation in the text. For notational convenience, oi(t) represents ei-ther an input signal, or the output of a node. The first two lines describeforward-propagating signals, the third describes backward-propagating signals,and the final line describes on-line weight adaptation.or on-line adaptation, respectively. Table 3.1 summarizes the network operation for on-lineadaptation without momentum.Several researchers independently discovered back-propagation as a technique for nonlinearoptimization [68, 92, 104, 105, 128]. Unlike the LMS algorithm, the shape of the error surfaceis very complex, often containing many local minima. However, most of these minima seem toproduce near-optimal performance, so their existence is not a serious problem [29]. For caseswhere they do cause trouble, it can usually be overcome by adding more hidden nodes [104].The precise behavior depends on the initial weight vector, which, if assigned small randomvalues, usually results in good performance [29, 104].Chapter 4Nonlinear Adaptive Network FiltersThe back-propagation networks in Chapter 3 provide advantages over linear techniques for manysignal processing applications. Passing the input signal through a series of unit delays createsa nonlinear filter analogous to the linear transversal filter in Figure 2.5. For expanded dynamicrange, the output nodes can be simple linear combiners, while the hidden nodes retain theirsigmoidal nonlinearities to preserve the benefits of multilayer networks. This chapter describesthe application of back-propagation networks to a signal prediction task, where they outperformconventional predictive techniques. The results provide a baseline for comparison with the dis-persive networks to be discussed in Chapter 5. Since the analysis of dispersive networks benefitsfrom a continuous-time treatment, the next sections set up an appropriate framework by pre-senting continuous-time formulations of LMS adaptation and back-propagation. The resultingdifferential equations describe possible analog physical implementations of the devices, whilethe discrete-time formulations in previous chapters are more appropriate for digital implemen-tations. Much of the previous work in this area has concentrated on digital implementationsdue to the availability of digital computers.4.1 Continuous-Time LMS AdaptationA continuous-time version of LMS adaptation arises when t assumes real values, and the weightschange according to the differential equation,ddt^Owiwj = 2pE[e(t)xj(t)i.^ (4.1)Equation (4.1) is analogous to the discrete-time version developed in Chapter 2. The expectedvalue represents an ensemble average, which, for ergodic signals, can be estimated with the30Chapter 4. Nonlinear Adaptive Network Filters^ 31integral,E[e(t)x At)]^itt T e(ti )x i(e)de ,^ (4.2)leading to batch-mode adaptation. Since the term "batch" implies a collection of discreteevents, the intervals of length T are more commonly known as "epochs" during continuous-time adaptation. The weights change in a piece-wise linear fashion, with the rate of changeassuming new values at times T, 2T , 3T, etc.:nT2" fdwj (nT) =^e)x i(e)dedt^T in-1)T e(Although the weights continue to change during the integration, the resulting error can becontrolled by choosing a sufficiently small value for p, with longer epochs requiring smallervalues.On-line adaptation occurs in the limit as T approaches zero, givingdwj^2pe(t)x j(t),dt(4.4)which also approximates equation (4.1) for sufficiently small values of^The slowly accu-mulating weight changes implicitly perform the averaging required to estimate the expectedvalue.4.2 Continuous -Time Back-PropagationThe network structure for continuous-time back-propagation is identical to Figure 3.4, but allthe signals are continuous functions of time. The mean squared error is analogous to equa-tion (3.7):e(t) E[1145(t) — 0 (011 21= E[E 4,i (t)].^(4.5)mE0A differential equation replaces the discrete weight-update formula in (3.9), giving(4.3)dWdt = —INC(t),^(4.6)Chapter 4. Nonlinear Adaptive Network Filters^ 32which can be written asdWij pE[Pi(t)oi(t)] lipij(t)^(4.7)dt^az%for each weight, following the derivation in Chapter 3. Here, pii(t) represents the negativepartial derivative of E(t) with respect to wii, which is defined as an expected value over anensemble.For ergodic signals, an estimate, Pii, of the expected value results from the integral,fiij = / T Ai(e)oi(e)de E[Ai(t)oi(t)].^(4.8)LTBatch-mode adaptation using equation (4.8) is analogous to equation (4.3) from the LMS ver-sion:^dw^nTi-i (nT) = —^ Ai(e)oi(e)de.t T ^ (n-1)TAgain, the error due to the changing weights is small if a is sufficiently small.(4.9)For on-line learning, a "leaky integrator" can provide an estimate of the partial derivative:ddtj77— — pij Ai(t)oa(t).^(4.10)The time constant, 77, is related to the discrete-time momentum parameter, a, as shown inAppendix A. Differentiating equation (4.7) with respect to time and substituting (4.10) allowsthe weight adaptation formula to be written asd2 wi^dwidt2" = dt^itAi(t)oi(t).^ (4.11)In the limit where 7./ goes to zero, the leaky integrator follows Ai(t)ai(t) precisely, and theweight adaptation formula becomesdtaij = ttAi(t)oj(t).dt(4.12)Therefore, continuous-time on-line adaptation can occur with or without momentum, in amanner similar to the discrete-time case discussed in Chapter 3. Without momentum, the cal-culation of dwij/dt requires no explicit storage or integration, since the slowly changing weightsimplicitly estimate the required averages. With the complete formulation of the adaptationtechniques, it is now possible to present some benchmark simulation results.Chapter 4. Nonlinear Adaptive Network Filters^ 334.3 The Mackey-Glass Signal Prediction TaskAccurate predictions about the future, based on observations of the past, are very useful inscience and engineering (not to mention business and economics)! The usual technique of mod-eling the underlying process cannot always be applied. Often, the process is either completelyunknown, or far too complex to permit a practical model, leaving only the observed behaviorof the system for use in making predictions.The signal produced by integrating the Mackey-Glass delay-differential equation [71],dx(t) = —bx(t) al +x(t(t_r) )io, (4.13)dt 7. provides a commonly used benchmark [33, 66, 80, 81, 82, 106, 107, 114] for comparing model-free predictive techniques. It results from modeling the regulation of blood formation, and itsproperties have been studied extensively [32]. Although (4.13) gives an explicit form to theunderlying process, only the signal, x(t), can be used in a model-free test.For comparison with other published results [32, 33, 66, 80, 82], values of r = 17, a = 0.2,and b = 0.1 were chosen, and the equation was integrated using a four-point Runge-Kuttamethod with a step size of 0.1 and a constant initial value of x = 0.8 for t < O. Discardingthe results from the first 1000 integration steps allowed any start-up transients to die out. Theresulting signal, shown in Figure 4.1, is quasi-periodic with a characteristic time of t ch :::-, 50.The characteristic time is defined as the inverse of the mean frequency of the power spectrum.As described in [14, 66, 114], the task is to predict the Mackey-Glass signal an interval, T = 6,into the future. Once trained, additional networks can be cascaded to provide further predic-tions at intervals T = 12,18, 24, ... This prediction task involves continuous, time-varyinginputs and outputs, providing a good test of nonlinear filter performance.4.4 Simulation DetailsSimulations of the type described here have many free parameters that must be assigned specificvalues. To some extent, the network topology, numerical integration method, error measure,Chapter 4. Nonlinear Adaptive Network Filters^ 341.4100^200^300^400^500Time (0Figure 4.1: Mackey-Glass signal with r = 17, a = 0.2, and b = 0.1.The Mackey-Glass delay-differential equation was integrated using a four-pointRunge-Kutta method with a step size of 0.1, and a constant initial value ofx = 0.8. Discarding data from the first 1000 integration steps allowed any initialtransients to die out. The resulting signal is quasi-periodic with a characteristictime of tch„,. ,:-.1 50.learning rate, and derivative estimation method all have an influence on the ultimate perfor-mance.4.4.1 Training ConfigurationFigure 4.2 shows one possible method for training an adaptive network to predict a signal Ttime units into the future. The delay, T, at the input allows the undelayed signal, x(t), tobe used as the target during training. Minimizing the error, e(t), by adapting the weightscauses the network to "predict" the current value of the input signal based on its values up totime t —T. After training, the delay at the input can be removed, allowing the network to maketrue future predictions. However, without this delay, adaptation must stop due to the lack ofa target signal.1.2 -1.0g 0.8 -(4ctl 0.6 -,14c.)0.4 -0.2 -0.0 ^0 \Chapter 4. Nonlinear Adaptive Network Filters^ 35Figure 4.2: Block diagram of a system for training a network to predict its inputsignal. A delayed version of the input signal feeds into the network so that theundelayed input can be used as the target signal. Minimizing the error, e(t),causes the network to learn to predict x(t) based on its values up to time t — T.The network output is a "prediction" of the current value of the input signal.For true prediction, the delay, T, must be removed.4.4.2 Network Topology and ParametersFigure 4.3 shows the network topology, which is similar to that used in [66], where it performedwell on a similar Mackey-Glass signal prediction task. There are two hidden layers, consistingof ten sigmoidal nodes each, and a single linear output node. The input signal passes throughfour delay elements to form a 4-dimensional "embedding space". The delay elements are each oflength 6 time units, providing four network inputs of x(t — 6), x(t — 12), x(t — 18), and x(t — 24).A single connection joins each pair of nodes in adjacent layers, for a total of 150 adaptableweights.For the hidden nodes, each k 0 was a hyperbolic tangent:tanh(a[xi — O]),a(1 — fi(xj))= gi(fi(xj)).(4.14)(4.15)The hyperbolic tangent is a scaled and offset version of the commonly used logistic func-tion [104], so that the output lies in the range (-1, 1) instead of (0,1). The bipolar natureof this function allows adaptation to occur near the negative "tail" of the sigmoid, where itChapter 4. Nonlinear Adaptive Network Filters^ 36Figure 4.3: Transversal network topology. A delayed input signal feeds into thenetwork so that an undelayed version can be used as the target signal duringadaptation. Further delays on the input provide the network with three ad-ditional signal samples: x(t — 12), x(t — 18), and x(t — 24). Minimizing theerror causes the network to predict its input signal 6 time units into the future.During adaptation, however, the extra delay at the input allows the network to"predict" only the present value of the signal.could not occur with the logistic function because the output signal, oi(t), in equation (4.7)would be zero. All nodes had unity gain (a = 1), and adaptable thresholds, 0, were imple-mented as connections from an additional node with a constant output of 1, as discussed inSection 2.1. The initial weights were pseudorandomly chosen from a uniform distribution onthe interval [-0.1,0.1], and a weight learning rate of p = 0.005 was selected, based on a fewexperimental simulation runs.Figure 4.1 shows that the Mackey-Glass signal is always positive, so the full range of theChapter 4. Nonlinear Adaptive Network Filters^ 37input nodes cannot be exploited. The bias weights will eventually correct the situation, butit can take a long time [12]. To encourage faster convergence, the mean value of the inputsignal should be removed [12, 66]. The mean can be estimated with a running average, usingan integrator of the type described by equation (4.10) with a large time constant. Appendix Bshows how a similar estimate can be obtained in a discrete-time software simulation. However,to speed up the simulations here, the mean was measured in advance and found to be approx-imately 0.92534. A preprocessing stage subtracted this constant value from the input signalbefore presenting it to the network.4.4.3 Numerical Integration TechniqueThe differential equations describing the network's behavior, including adaptation, represent anonalgorithmic continuous-time process. Simulation on a digital computer requires conversionto an algorithmic form such as Euler integration, which uses a small, fixed-length time step.The Euler method is a good choice because of its light computational demands during eachintegration step, and its simplicity of coding. Smaller time steps gave virtually identical results,suggesting that more complex integration schemes would not provide any benefits. However, totrain the network on data as close as possible to the "true" Mackey-Glass signal, equation (4.13)was integrated using a four-point Runge-Kutta technique. To obtain a good global fit to theMackey-Glass data, adaptation took place with a continually evolving input signal, rather thana repeated, finite training segment.4.4.4 Performance MeasureAs in [33], the normalized rms error,NRMSE = E[ 116(t) — 0(0112 ]1/24.16E[^— E[O(t)] 11 2 ]1/2^(^)was used to monitor the network's performance during training. The NRMSE is the rms errorat the output, divided by the variance of the input signal. For the Mackey-Glass application,the output "vector", 0(t), has only a single dimension. An NRMSE of zero indicates perfect150-Input Linear CombinerBack-Propagation Network**- I^•v s"6^7^8^10Time (t x 105)Chapter 4. Nonlinear Adaptive Network Filters^ 38prediction, and an NRMSE of unity means the prediction is no better than a constant predictorat E[Ii. (0].4.5 Prediction ResultsFigure 4.4 shows a typical learning curve for the back-propagation network, along with learningcurves for three different linear transversal filters trained on the same task. The curves are0.54—Input Linear Combiner25-Input Linear Combiner..... ....................................................................0.1 —0.0 —00.4 -Figure 4.4: Mackey-Glass signal prediction learning curves. The back-propaga-tion network had an embedding dimension of 4, and reached an NRMSE of 0.058at time t = 106 . For comparison, linear predictors with embedding dimensionsof 4, 25, and 150 reached NRMSEs of only 0.428, 0.305, and 0.222, respectively,after the same length of time. A further increase of the linear filter embeddingdimension to 300 resulted in a marginally lower NRMSE, but greatly increasedthe variance of the prediction error.not identical for different random starting weights, but the graph gives an indication of theirrelative performances. The first linear filter had four inputs, giving the same embedding spaceas the back-propagation network. For improved performance, a second linear filter had 25delayed inputs, and a third had 150, giving it the same number of adaptable parameters as theChapter 4. Nonlinear Adaptive Network Filters^ 39back-propagation network. The results clearly show the advantage of internal nonlinearities forthis signal prediction task, even over the linear filter with 150 inputs. A further increase inthe size of the linear filter to 300 inputs resulted in only a marginal improvement in error, buta large increase in noise. The noise likely arises because samples of the signal from 300 timeunits in the past are virtually uncorrelated with its present and future behavior, and becausethe misadjustment increases with the square of the number of weights [134].The error in Figure 4.4 is a normalized version of the adaptive prediction error shown inFigure 4.3. True prediction requires removal of the T = 6 delay, thus halting adaptation andincreasing the prediction error. Stopping the adaptation at time t = 106 allowed the networkto be tested in this nonadaptive prediction configuration. Figure 4.5 plots the true predictionoutput from the two techniques, along with the actual Mackey-Glass signal. The NRMSEobserved in this situation was 0.063 for the static network, and 0.36 for the linear filter with150 inputs.The large increase in error when using the linear filter for true prediction suggests that itwas adapting to very recent portions of the input signal, without learning its global structurevery well. It may also be related to the misadjustment noise visible on the learning curve, whichcould cause the weights to be far from their optimal values when adaptation stopped [134]. Onthe other hand, the back-propagation network performed almost as well on the true predictiontask as it did during adaptation, suggesting that it found an accurate global representation ofthe Mackey-Glass signal.The benefits that nonlinear networks provide over linear prediction [72] and the Gaborpolynomial predictive method [37] were clearly demonstrated in [66]. There, a batch-modeadaptation technique, using a nonlocal conjugate gradient descent optimization procedure run-ning on a Cray X-MP supercomputer, obtained near-optimal values for the weights in aboutan hour of CPU time.The results presented here required less than an hour on a Sun SPARCstation 2 using alocal gradient descent technique, so the weights have not converged to their optimal values.Chapter 4. Nonlinear Adaptive Network Filters^ 40Mackey-GlassLinear PredictionBack-Propagation Network Prediction0.50.44., 0.3I 0.20-0.3--0.4 --0.5 -150^175^200^225^250Time (t)275^300^325^350Figure 4.5: Mackey-Glass prediction results. A back-propagation network anda linear prediction filter with 150 weights were trained up to time t = 106 , andthen placed in nonadaptive, true prediction configurations. The graph showsthe prediction output of each device, along with the actual Mackey-Glass signalfor comparison. The plots begin at t = 150 to allow the delay buffers to fill withvalid data, and the prediction time is T = 6.However, the intent is not to duplicate the results in [66], but to show that continuous-timeon-line adaptation is practical, and to provide a baseline for comparison with the dispersivenetworks to be discussed in the next chapter. To update each weight, the nonlocal optimizationprocedure used in [66] requires data from the entire network, so it does not lend itself wellto parallel physical implementations. However, one of the prime motivations for investigatingnetwork filters is that they can be implemented in parallel hardware, which will provide anenormous speed advantage over software simulations.Chapter 5Dispersive NetworksThe nonlinear back-propagation filters in Chapter 4 are an extension of the linear transversalfilters described in Chapter 2. In both cases, the output from a series of unit delays formsthe input vector. A more powerful structure can be obtained by replacing each ALC in theback-propagation network with a linear transversal filter. The resulting network contains in-ternal time delays, so its output can depend on temporal activation patterns in the hiddennodes. Features in the input signal disperse in time and space as they propagate along differentpathways, influencing the output at many different points in the future. Unfortunately, disper-sive networks are more difficult to train than their nondispersive counterparts. This chapterpresents a new continuous-time technique for adapting the internal time delays, as well as theweights, in a dispersive network.Gradient descent with dispersive networks presents a difficult problem because the twosignals forming the product in equation (3.11) are not simultaneously available. When oi(t) ispresent at time t, the corresponding Ai(t) requires a signal to propagate to the network outputs,and the resulting error signal to propagate back to node i. However, by this time a differentvalue of oi will be present due to new inputs continually arriving. Correct operation requiresthe storage of oi(t), and its proper correlation with Ai(t) when Ai(t) becomes available. Tomake matters worse, the backward propagation of the delta signals requires proper correlationwith stored values of f'(y(t)), as seen in equation (3.19).An algorithm presented in [4] solves this problem by including zero-delay feed-forward path-ways between all pairs of connected nodes. Weight modifications have an immediate effect on41Chapter 5. Dispersive Networks^ 42the output, so the instantaneous error can be minimized using a slightly altered form of back-propagation. However, the technique ignores the delayed effects of the weight changes on theoutput signal, and the total error may not be minimized.5.1 Unfolding in TimeOne technique for minimizing the total error in a dispersive network is known as "unfoldingin time". Figure 5.1 shows the process for a very simple two-layer network with two weightsbetween each connected pair of nodes. Every weight requires three identifying subscripts be-cause a pair of nodes can have multiple connections, each with a different delay. The unfoldingprocess effectively moves all the internal delays to the input layer, yielding an equivalent filterin transversal form with more nodes and weights than the original. Each additional delay inlayers other than the input layer generally leads to an additional copy of the network to theleft of that point. This new subnetwork will have copies of the weights in the original network,but they must be constrained to maintain a single value. For example, in Figure 5.1(c), WBA1and wam represent the same weight. When taking the partial derivative of the MSE withrespect to this weight, the two components must be added and the resulting weight updatemust be applied to both copies.Since unfolding in time requires the construction of a network much larger than the orig-inal, the computational costs quickly become unmanageable. However, if the internal delaysare integral multiples of unit delays, then there is usually a large amount of redundancy thatcan be exploited in the expansion, leading to a smaller unfolded network than would other-wise be the case. This is the approach usually taken when training a Time Delay NeuralNetwork (TDNN) [123].In addition to the problem of a potentially large unfolded network, the constraints on theweights require nonlocal communication to ensure that multiple weight copies retain the samevalue. While not a problem in software simulations, this type of nonlocal communication cancreate difficult practical problems in a parallel hardware implementation, where the overheadChapter 5. Dispersive Networks^ 43W BA1^ WCB1A(t)Figure 5.1: Unfolding in Time. This technique separates the temporal structurefrom the spatial structure, resulting in a much larger network. The progressionfrom (a) to (c) illustrates the process. Each delay in a hidden or output layergenerally requires an additional copy of the entire network to the left of thatdelay. The result is a transversal filter incorporating a nondispersive networkof the type described in Chapter 4, with multiple copies of the weights wherenecessary. These multiple weights must be constrained to have the same valuesif they will be transferred back to the dispersive network after training.Chapter 5. Dispersive Networks^ 44related to connectivity is often a limiting factor. To overcome these limitations, a new techniqueknown as temporal back-propagation operates on the original dispersive network using only localcommunication.5.2 Temporal Back-PropagationThe term "temporal back-propagation" comes from [125], which describes a discrete-time adap-tation algorithm for networks with fixed, integer-valued internal delays. The technique devel-oped here is more general, allowing continuous-time signals and adaptable, real-valued delays.However, it seems logical to extend the scope of the name to include this new technique, ratherthan trying to invent new terminology in an already overburdened nomenclature. Temporalunfolding is impractical with this new type of network because the adaptable, real-valued delayseliminate any redundancy in the unfolded structure.Before presenting a detailed derivation of temporal back-propagation, it will be helpful todescribe its operation in a simplified diagrammatic form. Figure 5.2 shows a connection betweentwo nodes in a conventional back-propagation network. The feed-forward signal flows from leftw(t)dwdtw(t) Figure 5.2: A connection between two nodes in a conventional back-propagationnetwork. The feed-forward signal flows from left to right along the top of thediagram, where multiplication with the weight occurs. The back-propagatederror flows from right to left along a similar path at the bottom. The product ofthese two signals, along with a learning rate, p, controls the weight adaptation.Chapter 5. Dispersive Networks^ 45Figure 5.3: A connection between two nodes in a dispersive network. Thegray boxes representing time delays are the only difference between this dia-gram and Figure 5.2. As in the conventional back-propagation connection, thefeed-forward signal flows from left to right along the top of the diagram, whilethe back-propagated error signal flows from right to left along the bottom. Thedelay, r, is in the feed-forward path before the multiplicative weight. Note theadditional delay in the back-propagation path. Its length is chosen so that thesum of the feed-forward delay and the back-propagation delay is a constant, K,for all connections in a particular layer of the network. The two additional de-lays ensure that the signals remain properly correlated during the multiplicationsinvolving the back-propagated error.to right along the top of the diagram, while the back-propagated error signal flows from rightto left along the bottom. Both signal paths contain the same multiplicative weight, w(t). Therate of change for this weight during adaptation is the product of the feed-forward signal, theback-propagated signal, and a learning rate, it.Figure 5.3 shows a similar connection for a dispersive network being trained with temporalback-propagation. Again, the feed-forward signal flows from left to right along the top, whilethe back-propagated error signal flows from right to left along the bottom. The feed-forwardChapter 5. Dispersive Networks^ 46 d/dx4^Figure 5.4: A conventional back-propagation node. The sum of the feed-forwardsignals from nodes in the preceding layer passes through a sigmoidal nonlinearity.In the back-propagation path, the sum of the error signals from nodes in thesucceeding layer forms a product with the derivative of the feed-forward signal.delay, r, is in the feed-forward path before the multiplicative weight, and there is an additionaldelay in the back-propagation path. The length of this additional delay is chosen so that thesum of the feed-forward delay and the back-propagation delay is a constant for all connectionsin a given layer of the network. As a result, the time it takes for a signal to propagate froma particular node to the output layer, and for the resulting error signal to propagate back, isconstant no matter what path the signal happens to follow.From this constant propagation time, it is obvious that the back-propagated signal will be"older" than the feed-forward signal by a fixed time delay, at , which is the "age" of the error sig-nal at this connection. To maintain proper correlation during the weight update multiplication,the feed-forward signal must be delayed by at units of time before the multiplication can be per-formed. Finally, a delayed version of the changing weight must be used in the back-propagationpath, so that its age matches the age of the back-propagating error signal.Figure 5.4 shows a conventional back-propagation network node. The sum of the inputsignals from the preceding layer on the left passes through a sigmoidal nonlinearity to form theoutput. In the back-propagation path, the derivative of the feed-forward signal with respectto the feed-forward sum forms a product with the sum of the error signals from nodes in thesucceeding layer.Chapter 5. Dispersive Networks^ 47 d/dxFigure 5.5: A dispersive network node. This node is nearly identical to the con-ventional back-propagation node. However, since the back-propagating signal isolder than the feed-forward signal, the latter must be delayed by an appropriateamount before the multiplication can be performed.Figure 5.5 is the corresponding diagram for a dispersive network node. Since the back-propagated error signal is older than the feed-forward signal, the derivative of the feed-forwardsignal must be delayed by an appropriate amount before the multiplication can be performed.Apart from this delay, there is no difference between a dispersive network node and a conven-tional back-propagation node. With this introductory explanation, the details of the mathe-matical derivation can now be presented.5.2.1 NotationFigure 5.6 shows a section of a typical feed-forward dispersive network. Connections can skipacross layers, and there can be more than one connection between two particular nodes. Everyconnection has both a weight and a time delay, but any of the time delays can be fixed at zeroChapter 5. Dispersive Networks^ 48Figure 5.6: Section of a typical feed-forward dispersive network. Each connectionhas both a weight, Wijk, and a time delay, riik, that can adapt to minimize theerror at the output. There can be multiple connections with different delaysbetween any two nodes in different layers, and they can skip across layers likethe connection between node j and node h. Features in the input signal spreadout in time and space as they travel from left to right. Error signals propagatefrom right to left, but they are absent from the diagram for clarity.if desired. The following notation describes the network structure and behavior:weight of kth connection from node j to node i at time t.delay of kth connection from node j to node i at time t.either the output of node i, or a network input at time t.Xijk(t) = input to node i from the k th connection from node j at time t.yi(t) = total input to node i at time t.fi() = sigmoidal nonlinearity of node i.target signal for output node m.the error at output node m at time t.the squared error at output node m at time t.wijk(t) =rijk(t) =oi(t) =bni (t) =em (t) =sm (t) =Chapter 5. Dispersive Networks^ 490 =Ci =Aii =Nmi =zm ip =set of all output nodes.set of all nodes that node j feeds into.set of all connections between node j and node i.the total number of signal paths, by any route, from node i to node m.the total delay from node i to node m, via the pth route (p = 1... Nmi).The following relationships summarize the feed-forward behavior of the network:xiik(t)yi(t)==wijk(t)oAt — riik(t)),EE x iik (t) = EE wiik (t)oi (t — 7- 3k (t)),j^k i^k(5.1)(5.2)Oi(t) = fi(yi(t)). (5.3)For some values of j, oi(t) represents a network input. The function, fi(), is usually sigmoidal,but must be continuous and monotonically increasing.Adaptation requires an input signal vector, and a corresponding target signal vector, OW.The observed states of the output units constitute an output vector, 0(t), and the Euclideandistance,s(t) = IIO(t) — o(t)112 = E (Om (t) — Om(t))2 = E em(t) = E sm (t),^(5.4)m€0^m€0^mE0measures the instantaneous squared error between the output signal and the target signal,where em (t) = bm (t) — om (t).5.2.2 The Error, W, and its Functional DerivativeIn a standard back-propagation network, a short-lived change to a weight results in a single,immediate, short-lived change to the output node activations. With time delayed connections,a short-lived change to a weight results in many future, short-lived changes at the outputs.Adaptation requires knowledge of how the error at the output, integrated over all time, dependsETtd0 ^► ttime6AFbwijk(td)ST6rijk(td){T[tviik(t)-F aijk(t,td,h,  E)]lim Eh— [wijk(t)] lim^T[rijk(t)^Cri jk(t td, h, E)] [rip, (t)]h,e-.0 EhChapter 5. Dispersive Networks^ 50,Csiik(^E )Figure 5.7: The perturbation function criik(t,td,h,c), centered at time td, withmagnitude h and duration c. In the derivation of the error gradient formulae,this function is added to wijk(t) and rijk(t) while considering the limit where cand h go to zero.on the weights and time delays. Accordingly, the total squared error, IP, is00=^s(t)dt =f c.°^E sm (t)dt.^ (5.5)=-00 mE0Although 'F may be infinite, it has finite functional derivatives with respect to time-localizedchanges in the weights and time delays. Changes in can be described by these functionalderivatives with respect to wijk(td) and riik(td) at a particular moment in time, td.Figure 5.7 shows a perturbation function, crijk, centered at time td, with magnitude h andduration c:h if (td — c/2) < t < (td c/2)ij k(t td, h, c) =^ (5.6)0 otherwisePerturbing either wijk(t) or riik(t) by criik(t,td,h,e) causes measurable changes in^that areapproximately proportional to ch when c and h are small. The functional derivatives,describe how changes to the weights and time delays influence the total squaredwill be used as error gradients in Section 5.3.(5.7)(5.8)error. TheyW^1 v,bwijk (td )^h,flim—o){ EhmE0!VI 4 1, 14^"98m (t^ZmiP)1,= _00^owi jo„ )p=1 tNmi 00Chapter 5. Dispersive Networks^ 51The dependence of '1 on Wijk involves many nonlinear terms, so T[wijk(t) o-ijk(t, td, h, €)]can be calculated only in the limit of vanishingly small h and E. Expanding equation (5.7) usingequation (5.5) givesWijk(td)^h ^bT„ = lim0^E j -oo {.9m (e)[wijk (t)+ criik (t,td ,h,c)] — sm (e)[wijk(t)il del .^,c--■^E 1ll mEO t 1=(5.9)A particular weight, Woo influences each output error, sm (t'), through p E 1... Nmi differentsignal paths, each with a total delay of zm ip . Thus, sm (t') can be viewed as a function not of asingle parameter, Wijk, but of several parameters, wijk(t' — zmip ), where p acts as a parameterlabel. According to this interpretation, the notationc9sm (t')Owijk(t' — zmip)represents the dependence of the output error at time t' on the weight at time (t' — zm ip ),through path p.The term sm (e)[wijk(t) Crijk(tItd,h,c)] can now be expanded as a Taylor series, to firstorder in o:sm (e)[wijk(t)-1- criik(t,td,h,c))=Nm i0me)Sm(t 1 ){Wijk(t)]+ E^Zmip,td,h,E)^ + 0 (Ot k).P= 1OWijk8(tt (- Zmip)(5.10)Substituting equation (5.10) into equation (5.9) and neglecting higher order terms yieldsbT^1 oo^„ = Ls^E Crijk(tf Zmip,td,h,E)^8sm(t') ^cle} . (5.11)Owijk(t' — zmip)(5WijWd) h,c-■0^mEO - co p=iIn each of these integrals, the independent variable t' can be replaced by (t" zm ip ), giving=^i; {— Eh,e-o0 Eh mEONmi^d+E12) 0 Srn (t" zmip) dt"P=1 (td—E12)    Owijk(t"Chapter 5. Dispersive Networks^ 52This replacement doesn't alter the value of the expression because the limits on the integrals areinfinite. In the limit where c goes to zero, the dependence on t" disappears and the derivativebecomes^ST lim^1^Nmiv. v. eh aSm (td + zmip) hviik(td)^h,c--40 € h. niL-‘06 pL-1= ^Owijk(td)=^Nmi OS (td + Zmip)^(5.12)E m^OWijk(td)^.mE0 p=1A similar derivation yields the following expression for the functional derivative of 11 , withrespect to each time delay:SLYN 03m (td Zm ip )^ =^ (5.13)8Tijk(td)^mE0 p=1^°Tiak (td)Related derivations using a finite difference approximation have been presented in [94, 93] forrecurrently connected networks.5.2.3 Back-Propagated Error SignalsIf an error signal,a Sm (t Zmjp ) j(t) = — E^ (5.14)Oyj(t)mE0 p=1is associated with each node, j, in the network, then the functional derivatives from equa-tions (5.12) and (5.13) can be expressed in terms of Ai(t). The weight derivatives can beexpanded using the chain rule,5T^Nn" 0.5m(t Zmip )^ =Swijk(t)v."" aSm(t Zmip) agi(t) mE0 pE=1 awiik(t)=^Oyi(t)^awijk(t)m_Eooi (tp:^,lriik)Ai(t)==^ (5.15)and the delay derivatives can be expanded similarly:6T =^NE. asm (t + zmip)orijk(t)^mE0 P^ariik(t)=1Chapter 5. Dispersive Networks^ 53v,^Osm(t zm ip) Oyi(t)  Ox iik (t)Z-dmEO P=L^Oyi(t)^Oxijk(t) Oriik(t)= Wijk(i) j(t — riik)For output nodes, equation (5.14) isOAm(t) —^sm(t)Oym (t)Osm (t) Oom (t)Oom (t)Oym (t)= 200)— 0.(t)M(Y.(t)),(5.16)(5.17)and for the hidden nodes, it can be expanded using the chain rule:j(^Nv.■"n 0 Sm(t zmjp) t) = —mEO p=1^Oyj(t)Nmi osmo Tijk + zmip) oyio + riik)= -EE E E ayi(t rijk)^ova(t)- E E ay*ao) mEO p=1+ Tijk)^Osm(t^zmip)iEc; kEAii ^ayi(t + riik).6(0)) E > Ai(t + riik)wiik(t + Two.iEciThis recursive expression for Ai(t) is valid for any feed-forward network, but its evaluationrequires future values of making it noncausal. To overcome this problem, an "aged" errorsignal, Ai(t, ai), can be defined so that its value at time t represents the true error signal, Ai(t),at time t — aj:Ai(t,ai)= Ai(t — aj).^ (5.19)There is some flexibility in choosing the age, aj, for each node, but it must be greater than themaximum value of zmip for all m and p.For output nodes causality is not a problem, so aj = 0 will suffice, and equation (5.17)describes the error signal. For hidden nodes, aj must be chosen to satisfymEO lECi kEAij p=1(5.18)a • >max z • = max max (ai—^m.77,m,P^lECj(5.20)Chapter 5. Dispersive Networks^ 54so that Aj(t,aj) can be defined recursively and causally asAj(t, aj) = f.;(yj(t — aj)) E E Ai(t — [aj — at — rijk], ai)wijk(t — [ai — riik]).^(5.21)iEci kE.Ati;5.3 Gradient Descent AdaptationAssuming the input and target signals are sample functions of random processes, the meantotal squared error can be defined asE = E[11],^ (5.22)where the expected value represents an ensemble average. To minimize E, the weights and timedelays must adapt with changes proportional to the negative gradients,dwijkdtdrijkdt(5.23)(5.24)where p and v are learning rates. Expanding these equations givesdwijk SE P Swijk(t)SE v^=Srijk(t)[  81, ILEowijk(t)SWvE[, „IoriikWI = pE[oj(t — riik) Ai(t)],= —vE[wijk(t) di (t — riik) Ai(t)].(5.25)(5.26)dtdriikdtDue to causality constraints, it is difficult to estimate the required ensemble averages untiltime t aj, so a delayed version of gradient descent must be used:dwijk dtSE • P Swijk(t — ai)• —pE[^IdrijkdtSwijk(t — ai)= pE[ol (t — ai —^) Ai(t, ai)],SE=• —vE• —vE[wijk(t — ai) di (t — ai — rijk)Ai(t,ai)].Sriik(t — ai)[ SWY I57-ijk(t — ai)(5.27)(5.28)Chapter 5. Dispersive Networks^ 55If the learning rates, ti and v, are small enough that the weights do not change significantlyduring a period equal to the maximum delay through the network, then these equations area good approximation to true gradient descent. At high learning rates, the time delays cancause the adaptable parameters to overshoot their optimal values, leading to instability. Thisproblem can be easily overcome by choosing smaller learning rates, but the issue of stabilitydeserves further analysis.For ergodic signals and slow adaptation rates, the expected values in (5.27) and (5.28) canbe estimated by time averages. One such estimate for the weights, pijk(t), can be obtained witha leaky integrator,dPijkPijk(t) -F 7)—dt — oi(t — ai — riik) Ai(t, ai),and a similar estimate, qijk(t), can be obtained for the delays:dqiikqiik(t)^dt= wijk(t — ai)dj (t — ai — rijk)The time constant, j, is similar to the discrete-time momentum parameter discussed in [104].With these estimates, the adaptation equations becomedwijk^drijkPPijk(t) and^= —vqiik(t)•dt dtIn the limit where 77 goes to zero, the leaky integrators follow their inputs precisely, giving theon-line adaptation equations:dwijkdtdrijkdtpoi(t — ai — rijk)Ai(t,ai),^ (5.29)—vinijk(t — ai)dj (t — ai — rijk)Ai(t,ai).^(5.30)The on-line use of the instantaneous gradient is analogous to the stochastic approximationalgorithm introduced by Robbins and Monro [99]. Stochastic approximation was first appliedto function minimization (or maximization) in [60], and since then it has become a well-knowntechnique [64]. The method is an iterative procedure that updates a set of parameters basedon estimates of the true gradient. A set of gradient samples (possibly a single sample) forms anChapter 5. Dispersive Networks^ 56estimate of the true gradient during each iteration. The gradient information effectively becomesaveraged over many parameter updates, which are each small enough to ensure convergence. Inthe technique presented here, a continuously evolving network state replaces the discrete steps,and the instantaneous gradient is a single-sample estimate of the true gradient.On-line learning, as described, applies to a continuously arriving training signal. Finite-length training signals can be readily extended for on-line learning by presenting them repeat-edly, but adaptation should be suspended during an interval equal to the longest propagationdelay through the network immediately after each repetition boundary to avoid learning anytransients incurred at that point.During adaptation, the internal network buffers explicitly store recent information from theinput signal, and the values of the weights and time delays implicitly store older information.The network can track some types of nonstationary signals because recent information slowlyreplaces older information in the weights and time delays.5.4 Relationship To Other WorkIf the delay learning rate is zero and the network equations are integrated using a simple Eulermethod, the technique is equivalent to the temporal back-propagation algorithm presentedin [124]. Furthermore, with all the time delays fixed at zero, temporal back-propagation isidentical to standard back-propagation.A related technique for learning time delay "windows" in feed-forward networks has beendemonstrated in [8] for a speech recognition task, but without a detailed derivation. Adaptabletime delays were also discussed in [98], where recurrent networks without sigmoidal nonlinear-ities separated sets of superimposed and delayed signals. A different method for performinggradient descent in networks with fixed time delays has been presented in [87] and [88]. However,that method, while being much more general, does not take advantage of the computationalsavings incurred by back-propagating through the time delays, and is not appropriate for net-works with time delays in all of the connections. The method of unfolding in time [123] can beChapter 5. Dispersive Networks^ 57used to train networks with delays in all the connections, but it requires extensive storage andnonlocal computations. All these techniques, except [8] and [98], use fixed time delays.Other types of networks that can process temporal information employ context units, whichremember past values and include them with the current inputs [43]. Methods for adaptingthe decay rates of input integrators to match time constants observed in the input signals weredescribed in [83], and an interesting time delay structure known as a Gamma memory wasreported in [28].5.5 Practical ConsiderationsAt this point, practical issues associated with building and training a dispersive network withadaptable delays must be considered. The implementations assume on-line adaptation usingonly computationally local information.5.5.1 Exact Delayed Error GradientTable 5.1 summarizes the equations describing the complete operation of a dispersive networkwith adaptable time delays. These equations describe a network with information flowing intwo directions: input signals flow forward and error signals, Ai(t), flow backwards. The nodesand connections can be modeled as shown in Figure 5.8. Each node contains a delay line oflength a2, providing the delayed value ri (yj(t — a2)), and each connection contains tapped delaylines to provide the delayed values of oj, Wijk, and Ai. The time delays, Tijk, can be adjusted bymoving the taps. In a typical network, the number of connections increases with the square ofthe number of nodes, N, so the implementation requires 0(N2 ) storage for the delayed signals.Figure 5.9 shows a functionally equivalent but more efficient implementation. The of buffershave been moved from the connections to the nodes, with multiple taps for multiple connections,reducing the number of buffers required for of from 0(N 2 ) to 0(N). Also, it is not necessaryto store values of yj to calculate rj (yi). Since f() is a monotonically increasing function of yj,the derivative can be calculated directly using a function, Moi), to calculate f.;(yi) from oi.Y./ (t)JO-41 'Lift 4-Wijk(b) Connection—01 ijk 14-*F—^a•ai + tijkA i (t,ijk(t)Chapter 5. Dispersive Networks^ 58(a) Node^ (t- [af -ar'Cijk] , ad wuk(t-[af-tift])Figure 5.8: Block diagrams of (a) a node, and (b) a connection, for a direct butinefficient implementation of the network equations. Elongated boxes labeledwith an age, such as aj, represent tapped delay lines with a maximum delayproportional to the length of the box. Input signals flow from left to right anderror signals flow from right to left.0jf()►a•tijk 14—'4^ ai+Cijk ----1 g() oi(t- tijk)oi(t-[ai+ ok])Pyit-a))wiik(o _q4^(t,(b) Connectionaixfik (t)(t,(a) NodeOj (t-tiik) oi(t-[ai+T ijk ])Ai(t- f aj -artijki, ad Wyk [ aj-tijkWiik(t)xok(t) ►Chapter 5. Dispersive NetworksFigure 5.9: Block diagrams of (a) a node, and (b) a connection, for a networkthat is functionally equivalent to that shown in Figure 5.8, but with smallertotal time delay buffer requirements. The weight delay buffer is now a fixedlength, independent of riik. The function g(f(yj)) calculates nyi) from valuesof f(y.i).59Chapter 5. Dispersive Networks^ 60Network Equations Using Exact Delayed GradientParameter Formulayi(t) EEtvijk(t),(t-Tijk(o)j^koi (t) fi(N(t))Am (t) 2(6,7,(t) — on,,(t))f„., (y,„(t))^for m E 0Ai(t, ai) f"i (ys(t — as)) > > A i (t - [a, - ai — rijkl, ai)wijk(t — [ai — niki)IEC3 kEA;dWijkpoi(t — [ai + riik])Ai(t,ai)dtdrijk—1/Wijk(i — ai)o'i (t — [ai + riskpAi(t,ai)dtTable 5.1: Equations describing the complete operation of a dispersive networkwith adaptable time delays, using an exact form of gradient descent. The sub-script m denotes output nodes.Finally, in Figure 5.9, the weight multiplication occurs before the delay in the feedback path,resulting in a fixed-length delay line that is shorter than the weight delay in Figure 5.8. Thereduction in storage compared to Figure 5.8 is approximately proportional to N 2 (2aj —5.5.2 Approximate Delayed Error GradientThe total storage requirements can be reduced from 0(N2 ) to 0(N) by replacing each delayedoccurrence of wijk with its current value, and moving all the delay buffers from the connectionsto the nodes. To accomplish this reduction, the delay buffers must have a separate tap for eachconnection. Table 5.2 shows equations for the resulting approximated error signal, j(t, as),and for dwijk/dt and drisk/dt. The error signal for the output nodes remains the same asin Table 5.1. For small learning rates, the weights and time delays change slowly and thisChapter 5. Dispersive Networks^ 61Network Equations Using Approximate Delayed GradientParameter Formulayi(t) EEwiik(t)oi(t_ rijk (o)j^kOi(t) fi(Yi(t))Om(t) 20,(0_ orn (t))f,Vyni (t))^for m E 0Aj(t,aj) ri(yj(t — aj)) E E iii(t - [aj — ai — rijk], ai )wijk(t)iEC, kEAi;dwijk ,uoi(t — [ai + rijk])Ä. i(t, ai)dtdrijk—1/wijk(t)oli(t — [ai + riik])Ai(t,ai)dtTable 5.2: Equations describing the complete operation of a dispersive networkwith adaptable time delays, using an approximate form of gradient descent. Allthe delayed values of wijk in Table 5.1 have been replaced with wijk(t). The sim-plified delayed gradient has smaller storage requirements, and is a good approx-imation to the exact delayed gradient for small learning rates. The subscript mdenotes output nodes.approximation introduces little error into the gradient calculations. Figure 5.10 shows thenode and connection diagrams for this implementation, and Table 5.3 summarizes the storagerequirements for all three implementations. With this approximation, the connections no longerrequire any storage buffers. The feedback delay lines are now associated with the nodes, andprovided with multiple taps in the same way as the feed-forward delays.While no claims can be made for the biological plausibility of this neural network model, itis interesting that delay lines with multiple taps parallel the axons in biological neural networks.Braitenberg, for example, in his discussion of neural interconnections in the cerebellum, suggeststhat "Engineers will have no difficulty appraising the general usefulness of a system of delayJOCh ijkdtWijk(t)Wijk(t)Chapter 5. Dispersive Networks 62a •-14T ijk 14-^ ai +tuk--oi tia 1^>o(t 'tick)oi(t-[ai+Tuk])f(yi(t-ad)a.(t_[al_artfik i,a.,)4^•A. (t, a.)<1^(a) Nodeof (t-tuk)oi(t-[a i +T ijk Pai-aiai(t- [ai-artukl,^wijk(t)xjlk (t)go) I-adt_[ ai-art iik],a i(b) Connectionwijk(t)iii (t-[ai-al-tuk],-dldtFigure 5.10: Block diagrams of (a) a node, and (b) a connection, for a net-work optimized for storage efficiency. These diagrams are similar to those inFigure 5.9, except that current values of wijk(t) replace the delayed values. Forsmall learning rates, adaptation still approximates gradient descent. With allthe delay lines moved out of the connections, the total storage requirement scaleslinearly with the number of nodes.Chapter 5. Dispersive Networks^ 63Storage RequirementsImplementation Storage For Buffer Length Buffers NeededFigure 5.8yjofAiWijkajajai — aiajoc Na N2oc N2cc N2Figure 5.9yj0iAiWijk—ajaj — aiai0cc Na N2a N2Figure 5.10yjofAiWijk —ajmaxi (aj — ai)-0oc Noc N0Table 5.3: Buffers required for the implementations shown in Figs. 5.8, 5.9and 5.10. The estimates are based on networks containing N nodes, and onthe assumption that each node has a fan-in and fan-out approximately propor-tional to N. As shown in the third column, the number of buffers needed scalesapproximately as a power of N.lines with provision for multiple tapping along the way" [11].5.5.3 Choosing the Error Signal AgesThe best choice for the age parameters, aj, will depend on the requirements of a particularimplementation. It is desirable to make each aj as small as possible while keeping Aj causal,because small age values help to minimize storage requirements, maintain stability, and ensurethat adaptation most closely approximates true gradient descent.For practical purposes, each node, j, can be labeled with a "layer number", nj, which isthe distance from node j to the farthest output node in terms of the number of connectionlayers. For example, in the network of Figure 5.6, if node h is an output node, then ng = 1,nh = 0, ni = 1, and nj = 2. Two nodes, i and j, are "adjacent" if ni and nj differ by 1. If themaximum permissible delay between two adjacent nodes is z 0 , then the maximum permissibledelay between any other pair of nodes, i and j, is (nj — ni)zo . With these definitions, theChapter 5. Dispersive Networks^ 64formulaaj = njzo^(5.31)gives a reasonable and consistent set of ages for all nodes in the network.Note that any particular choice of the aj parameters places limits on the maximum valuesfor the time delays. Thus, each rijk must be kept smaller than aj — ai and greater than zero byimposing some kind of limiting procedure during adaptation. One possibility is a hard threshold,where the value of Tijk stops changing once it has reached its limit. Another possibility is tooptimize a different set of parameters, -yijk, whereaj — atriik =^exp(--yijk) •(5.32)Now, with unrestricted values of Nik, rijk will lie in the range 0 < T;jk < aj — at. By remov-ing restrictions on the parameters, this technique can yield a better-conditioned optimizationproblem [89].Chapter 6Applications of Dispersive NetworksThe dispersive networks described in the previous chapter can be applied to many differenttasks. This chapter presents simulation results for adaptive signal prediction, signal production,and adaptive channel equalization, demonstrating several benefits of dispersive networks overmore conventional techniques.6.1 Adaptive Signal PredictionThe Mackey-Glass signal prediction task described in Chapter 4 provides a convenient bench-mark for measuring the performance of adaptive signal processing systems. The training con-figuration shown in Figure 4.2 permits a direct comparison between dispersive networks andconventional back-propagation networks.6.1.1 Network Topology and Implementation DetailsFigure 6.1 shows the network topology, which is similar to the topology used in Chapter 4 andin [66]. There are two hidden layers, consisting of ten nodes each, and a single output node.Instead of four input nodes as in Figure 4.3, there is a single input node with four adaptableconnections to each node in the first hidden layer. A single adaptable connection joins eachother pair of nodes in adjacent layers.65Chapter 6. Applications of Dispersive Networks^ 66InputFigure 6.1: Dispersive network architecture for the Mackey-Glass predictiontask. Each heavy line between the linear input node and the first hidden layerrepresents four adaptable weights and time delays. All other connections containa single adaptable weight and delay, and all feed-forward signals flow from left toright. There are a total of 150 adaptable connections and 21 adaptable biases.Each node has either a linear or a sigmoidal transfer function, represented by thenode symbols. Before training, the weights and delays are set to pseudorandomvalues. During training, the network operates in an on-line mode, continuouslyadapting to the input signal.During the simulations, all 150 network connections contained adaptable weights and timedelays, and the aj parameters were chosen as explained in Section 5.5.3, with z o = 25. As inChapter 4, each fi() for the hidden nodes was a hyperbolic tangent,fi(yi) = tanh(a[y — O]),Myi) = gi(fi(xj)) = a(1— ffiyi)),and all the input and output nodes had linear transfer functions for expanded dynamic range.Adaptable thresholds, p, were implemented by connections from an additional node with aconstant output, and all the nodes had unity gain (a = 1). The initial delays were chosenpseudorandomly from a uniform distribution on the interval [0, 20], and the initial weightswere chosen from a uniform distribution on the interval [-0.1,0.1]. A weight learning rateof p = 0.005 and a delay learning rate of v = 1.0 were selected, based on a few experimentalChapter 6. Applications of Dispersive Networks^ 67training runs.As in Chapter 4, the network equations were integrated using a simple Euler method witha small fixed-length time step, and the input was the continually evolving Mackey-Glass signal,integrated using a four point Runge-Kutta method. Delayed values of time-varying signals thatfell between sample points were obtained using simple linear interpolation. The adaptable timedelays also require delayed values of signal derivatives, and these were interpolated betweensample points as described in Appendix C.6.1.2 Embedding DimensionThe behavior of the Mackey-Glass signal generated by equation (4.13) depends on its pastvalues from time t — r to the present. The signal history within that interval can be dividedinto an infinite number of points, which all affect the future behavior of the signal. Therefore,the Mackey-Glass equation can be viewed as an infinite-dimensional mapping from past valuesto a single future value. The signal values from time t — r to the present define a location inan infinite-dimensional phase space, and the evolution of equation (4.13) causes the system tofollow a path through this space. The present location of the system completely determines itsfuture behavior.Imagine a system evolving from its starting location in phase space. Its trajectory defines aflow line that is an "attractor" if it converges with other flow lines that are "near" it in phasespace. A particular trajectory may or may not eventually lead back to its starting location. If itdoes, then the attractor is periodic, and the system will exhibit cyclical behavior if its startinglocation is within the region of attraction.A second type of attractor is a stable, fixed point where the state of the system stopsevolving. A third type, known as a chaotic attractor, is neither a fixed point, nor periodic.The system continues to evolve, with the flow lines tracing out a particular shape in the phasespace. There may be some regions of phase space that the system never enters, and othersthat it passes near frequently. However, due to a strong sensitivity to initial conditions, theChapter 6. Applications of Dispersive Networks^ 68actual trajectory cannot be predicted with much accuracy very far into the future. The overallstructure of the attractor is often a finite-dimensional manifold within the phase space. For theMackey-Glass signal, the attractor has a fractal dimension of approximately 2.1 [32].It was demonstrated in [91] that a chaotic attractor can be reconstructed from a finite setof samples of the signal history. For example,x(t + T) = f(x(t), x(t — r1 ), x(t — r2 ), ... , x(t — rn,)), (6.1)where T is the prediction time into the future, and 10 is a mapping function. Since the mappingfunction uses m+ 1 samples to make the prediction, the "embedding dimension" is m+ 1. Theembedding dimension must be at least as large as the dimension of the chaotic attractor itself.The choice of embedding dimension, in terms of its size and temporal spacing, has a signifi-cant effect on the network's learning ability [76]. In Chapter 4, the embedding dimension was 4with an even spacing of 6, since these values gave good results in [66]. However, the optimalembedding dimension of a system, and even the attractor dimension, might not be known inadvance.Dispersive networks can automatically learn an optimal embedding dimension, due to theadaptability of the internal time delays. The number of delay elements within the networkimposes an upper limit on the embedding dimension, but its detailed structure develops as thenetwork adapts.6.1.3 Prediction ResultsFigure 6.2 shows learning curves for the adaptable time delay network, an identical networkwith fixed time delays, and a conventional nondispersive network of the type discussed inChapter 4. All the networks showed approximately exponential convergence, and after 10 8time units appeared to be approaching asymptotes. At that point, the NRMSE values were0.0200, 0.0082, and 0.0035 for the conventional network, the dispersive network with fixeddelays, and the dispersive network with adaptable delays, respectively. A 150-tap linear filtertrained on the same task had a prediction error of 0.2, which is well off the scale in Figure 6.2.Chapter 6. Applications of Dispersive Networks^ 690.06(a),,„ .‘ ss0.02 -Conventional Back-Propagation NetworkDispersive Network with Fixed DelaysDispersive Network with Adaptable Delays0.05 -0.04 -0.03 -----------------------------0.000^ 4^5^6^7^8^9^10Time (t x 107)Figure 6.2: Learning curves showing the normalized rms error (NRMSE) ofnetworks trained to predict the Mackey-Glass chaotic signal an interval, T = 6,into the future. The curves are for: (a) A nondispersive network with identicaltopology to that used in [66]. Four input nodes represented fixed delays of 0, 6,12, and 18. (b) A dispersive network with randomly chosen fixed delays. (c) Adispersive network with adaptable delays, initially identical to (b). In each casethere were two hidden layers with ten nodes each, and a total of 150 connections.Adaptation occurred on-line, with a weight learning rate of = 0.005 and, forpart (c), a delay learning rate of v = 1. The averages used to estimate theNRMSE were over time intervals of length 10 5 .It should be noted that each integration time step for the network with adaptable delays tookapproximately 1.7 times as long as each time step for the other networks, due to the overheadinvolved in adapting the time delays. This penalty could be avoided in a parallel hardwareimplementation, where the weight and time delay updates could occur simultaneously.6.1.4 True Adaptive PredictionDuring adaptation, the output of the network "predicts" the current value of the input signal.To perform true prediction of future values, the delay at the input in Figure 4.3 must be removed,but the resulting lack of a target signal prevents further adaptation. Figure 6.3 shows onemethod for performing true adaptive future prediction. A second copy of the network connects0.01 (c)1^2^3Chapter 6. Applications of Dispersive Networks^ 70directly to the input signal, without the time delay, T. The lower network adapts as in theoriginal configuration, but its internal parameters (weights and time delays) are continuouslycopied to the second network above it. As the parameters of the lower network adapt, theparameters of the upper network follow, allowing it to perform true adaptive prediction.Figure 6.3: True adaptive prediction with two networks. The lower networkadapts as in Figure 4.2, but its internal parameters (weights and times delays)are continuously copied to the top prediction network. As the parameters adapt,those in the top network follow, allowing it to perform true adaptive prediction.If the networks are physically separate, then the parameter copying procedurerequires a high-bandwidth communication channel. Additionally, all of the ele-ments within the two networks must be precisely matched to ensure that copyingthe parameters will lead to accurate prediction.Unfortunately, the arrangement in Figure 6.3 doubles the hardware required to performthe prediction task. In addition, if the networks are physically separate, then the parametercopying procedure requires a high-bandwidth communication channel between them. Finally,the processing elements within the two networks must be very closely matched to ensure thata simple copying of the parameters will lead to the desired behavior. Slight variations betweenthe two networks, coupled with their internal nonlinearities, can lead to large differences in thePredictionNetworko(t)x(t)1111-111.e(t)b(t)Chapter 6. Applications of Dispersive Networks^ 71Dispersive Network^ o(t+T)Figure 6.4: True adaptive prediction with a dispersive network. The delay at theoutput can be viewed as an additional layer in the dispersive network, permittingsimultaneous adaptation and prediction. The true prediction output comes froma tap before the final delay, while the output from the final delay, along withthe target (input) signal, forms the back-propagated error.outputs. One desirable feature of an adaptive system is its ability to "adapt around" internaldefects. However, the two-network configuration effectively disables this feature.With a dispersive network, the delay can be moved to the output, as shown in Figure 6.4.This delay can be viewed as an additional network layer with a single node, connected througha fixed delay, T, with a fixed weight of unity. If the network adapts to produce a unitytransfer function, then a tap taken off before the fixed delay will provide true future prediction.This configuration permits on-line adaptive prediction without requiring two networks, but thenetwork must adapt using temporal back-propagation, due to the delay present in the finallayer.Figure 6.5 shows learning curves for the adaptable time delay network and an identicalnetwork with fixed time delays, both in true adaptive prediction configurations. The figurealso includes the learning curves from Figure 6.2 for reference. Again, the networks showedapproximately exponential convergence, and after 10 8 time units appeared to be approachingasymptotes. At that point, the NRMSE values were 0.0084 and 0.0035 for the dispersiveConventional Back-Propagation NetworkDispersive Networks with Nonpredictive AdaptationDispersive Networks with Predictive Adaptation(a)••^...^•^•^•^••^.^•^•••^•.•• ......(b)Chapter 6. Applications of Dispersive Networks^ 720.060.05 -0.040.03 -0.02 -0.01 -0.00 ^^0 1^2^3^4^5^6^7^8^9^10Time x 107)Figure 6.5: Learning curves showing the normalized rms error (NRMSE) of net-works trained to predict the Mackey-Glass chaotic signal an interval, T = 6, intothe future. The solid lines are for networks trained in a true prediction configu-ration, so that they could predict and adapt simultaneously. For reference, thegraph includes the learning curves from Figure 6.2 as broken lines. The curvesare for: (a) A nondispersive network with identical topology to that used in [66].The nondispersive network cannot be trained in a true predictive configuration.(b) A dispersive network with randomly chosen fixed delays. (c) A dispersivenetwork with adaptable delays, initially identical to (b). In each case there weretwo hidden layers with ten nodes each, and a total of 150 connections. Adapta-tion occurred on-line with a weight learning rate of tt = 0.005 and, for part (c),a delay learning rate of v = 1. The averages used to calculate the NRMSE wereover time intervals of length 10 5 .network with fixed delays, and the dispersive network with adaptable delays, respectively. Itis apparent from this diagram that the cost of true prediction, incurred by placing the fixeddelay at the output, is insignificant compared with the benefit of a dispersive network overa conventional back-propagation network. It is important to remember that the conventionalnetwork cannot perform true prediction while adapting, so there is no true prediction learningcurve for a conventional back-propagation network in Figure 6.5.To predict further into the future, identical copies of the T = 6 network can be cascadedto provide outputs for T = 12,18,24,... Alternatively, the network can be trained to predictChapter 6. Applications of Dispersive Networks^ 73further into the future by increasing the fixed delay, T, in Figure 6.4. Increasing this delay maydegrade the stability of the prediction filter, requiring smaller learning rates. Theoretically, thecascade method should provide more accurate predictions for chaotic signals [14]. Some resultsin [66] confirm the higher accuracy of the cascade method with nondispersive networks, butmore simulations need to be performed to verify its superiority with dispersive networks.6.1.5 Moment umIt is apparent from the learning curves in Figures 6.2 and 6.5 that the back-propagation adap-tation technique can take a long time to converge. Momentum, as discussed in Section 3.3 andAppendix A, may increase the rate of convergence and allow the network to escape from localminima in the error surface [104].To determine the effect of momentum on networks with internal time delays, the samenetworks were trained with a momentum parameter of 77 = 10, as defined in Appendix A.Figure 6.6 shows the results, with learning curves for networks trained without momentum in-cluded as dashed lines. Momentum apparently speeds up the initial convergence, but dispersivenetworks trained without momentum ultimately reach lower error values. The nondispersivenetwork seems to show the opposite effect, but by time 10 8 , the difference between the curveswith and without momentum is virtually nonexistent.Figure 6.6 suggests that it may be useful to include momentum during the early stagesof training, and then gradually reduce i as training progresses. More simulations need tobe performed to investigate this possibility, but the rate at which to decrease 77 is anotherparameter that must be determined by trial and error. Finally, it has been reported thatmomentum doesn't often provide much improvement for on-line adaptation [132]. Perhapswith a batch update technique, as in equation (3.13), the momentum parameter would proveto be more useful.0.06Without MomentumWith Momentum0.050.04LL10.030.020.014^5^6Time (t x 107)1 10Chapter 6. Applications of Dispersive Networks^ 74Figure 6.6: Learning curves similar to those in Figure 6.5, for networks trainedwith and without momentum. The solid lines are for networks trained with amomentum parameter of 77 = 10. For reference, the dashed lines are the learningcurves with n = 0 from Figure 6.5. The curves are for: (a) A nondispersivenetwork with identical topology to that used in [66]. (b) A dispersive networkwith randomly chosen fixed delays. (c) A dispersive network with adaptabledelays, initially identical to (b). In each case there were two hidden layers withten nodes each, and a total of 150 connections. Adaptation occurred on-linewith a weight learning rate of it = 0.005 and, for part (c), a delay learning rateof li = 1. The averages used to calculate the NRMSE were over time intervalsof length 105 .6.2 Signal ProductionAn interesting application of the trained network is to freeze the weights and time delays,remove the input signal, and connect the output labeled o(t) in Figure 6.4 directly to the inputnode, creating a recurrent network. Now the network can use its past predictions as inputs, andmake additional predictions further into the future. The resulting dynamical system, shownin Figure 6.7, spontaneously generates its own version of the Mackey-Glass signal. Due to thestrong dependence on initial conditions, the resulting signal will soon deviate from the actualMackey-Glass signal, but it can still be used to make long term statistical predictions, and toChapter 6. Applications of Dispersive Networks^ 75Figure 6.7: A dispersive network in a recurrent signal production configuration.After training as a predictor, the weights and time delays can be frozen, andthe output fed back to the network's input. The resulting dynamical systemspontaneously generates its own version of the Mackey-Glass signal, or it can beinitialized with a segment of the signal, and then left to run freely.investigate the nature of the attractor.6.2.1 Learning ProgressionDuring the learning progression plotted in Figure 6.5, adaptation was periodically halted, andthe network placed in the recurrent configuration shown in Figure 6.7. Figure 6.8 summarizesthe results for the dispersive network with adaptable time delays, trained in the true predictiveconfiguration. The behavior did not progress smoothly, but went through periods where therecurrent network was unstable. After each period of instability, the network appeared toproduce a better imitation of the true Mackey-Glass signal.A better way to view the learning progression is through the aid of phase diagrams. Fig-ure 6.9 plots x(t) vs. x(t — 17) for the Mackey-Glass signal, resulting in a projection of theMackey-Glass chaotic attractor onto a two-dimensional surface. It is evident from the diagramthat the signal is approximately periodic, but it is also evident that the system follows a slightlydifferent trajectory during each "period". By creating similar phase diagrams for the networkat different points during training, the learning progression is more readily apparent.Chapter 6. Applications of Dispersive Networks^ 76105106107^108TimeFigure 6.8: Summary of a dispersive network's progress while learning to producethe Mackey-Glass signal. The network was trained as shown in Figure 6.4,and periodically reconfigured as in Figure 6.7 to observe its signal productionperformance. The heavy line, partly dashed, shows how the prediction NRMSEdecreased as training progressed, on a logarithmic time scale. The network wentthrough periods of instability, indicated by the dashed segments of the learningcurve. The insets show representative output from the network at four selectedstable points during training. Note that the network first learns to produce aperiodic output, followed by a period of instability, and then an output similarto the Mackey-Glass signal.Figure 6.10 shows four phase diagrams from the early stages of training. In each case, thenetwork was initialized with Mackey-Glass data from time t = 0 to time t = 100, and thenplaced in the recurrent configuration shown in Figure 6.7. After training up to time 1.14 • 10 4 ,the network converges to a stable fixed point very quickly after initialization, as shown inFigure 6.10(a). Figure 6.10(b) shows that after training up to time 1.53 • 10 4 , the network haslearned to produce an oscillatory pattern, but it still settles to a fixed point in about 500 timeunits. In Figure 6.10(c), at time 3.71 • 10 4 , the network finds a stable periodic attractor, neverreaching a fixed point. Finally, at time 7.45 • 10 4 (Figure 6.10(d)), the shape of the attractorChapter 6. Applications of Dispersive Networks^ 770 2^0.4^0.6^0.8^1.0^1.2^1.4X(t-17)Figure 6.9: Projection of the Mackey-Glass chaotic attractor onto a two-dimen-sional surface. The vertical axis plots the signal at time t, and the horizontalaxis plots the same signal at time t — 17. Note the roughly periodic behavior.begins to resemble the actual Mackey-Glass attractor in Figure 6.9.Figure 6.11 presents four phase diagrams from the later stages of training. Figure 6.11(a)shows that at time 2.42.10 5 , the recurrently connected network was unstable. However, stabilityreturns by time 5.4 - 10 5 (Figure 6.11(b)), where the attractor contains a double loop structuresimilar to that in Figure 6.9. The network is again unstable at time 3.43 • 10 6 (Figure 6.11(c)),but by time 2.84 • 107 (Figure 6.11(d)) it has finally learned to emulate the Mackey-Glassattractor remarkably well.(c)(a)(d)Chapter 6. Applications of Dispersive Networks^ 78Figure 6.10: Behavior of the dispersive network during the early stages of learn-ing. The attractors represent the network output after training up to times:(a) 1.14 • 104 , (b) 1.53 • 104 , (c) 3.71 • 104 , and (d) 7.45 • 104 . The diagrams arenot all plotted to the same scale. In (a) and (b), the network reaches a stablefixed point, but in (c) and (d) it eventually reaches a stable, periodic orbit. Thetraining times for these plots coincide with points where the network exhibitedinteresting new behavior.(a) (b)(c) (d)Chapter 6. Applications of Dispersive Networks^ 79Figure 6.11: Behavior of the dispersive network during the later stages of learn-ing. In this figure, the diagrams are all plotted to the same scale. The attractorsrepresent the network output after training up to times: (a) 2.42-10 5 , (b) 5.4-10 5 ,(c) 3.43 -106 , and (d) 2.84 -10 7 . These training times coincide with points wherethe network exhibited interesting new behavior. In plots (a) and (c), the networkis unstable, but plot (d) is a close approximation to the actual Mackey-Glassattractor shown in Figure 6.9.Chapter 6. Applications of Dispersive Networks^ 806.2.2 PerformanceFigure 6.12 shows the final attractors reached after training up to time 10 8 for a conventionalback-propagation network, a dispersive network with fixed delays, and a dispersive network withadaptable delays. The diagram also shows the actual Mackey-Glass attractor for comparison.The conventional back-propagation network was unstable, and it quickly deviated from theMackey-Glass signal. However, both dispersive networks produced reasonable imitations of thetrue attractor, with the adaptable delay network giving a slightly more accurate representation.The phase diagrams do not provide complete information about the behaviors of the differentnetworks. Although they show the shape of the attractors, they provide no indication ofthe rates at which the network states evolve through phase space. To show this temporalinformation, Figure 6.13 plots the signals at the network outputs, along with the actual Mackey-Glass signal for comparison. For the first 100 time units, the networks were in a strictly feed-forward predictive arrangement, with the actual Mackey-Glass signal applied at the input. Attime t = 100, the networks were reconfigured into the arrangement of Figure 6.7 and left to runfreely.In Figure 6.13, initial start-up transients are apparent before the internal network delayshave filled with valid data, and then very accurate prediction occurs up to time T = 100.At that point, the freely running networks begin to diverge from the Mackey-Glass signal,with the conventional back-propagation network diverging the fastest. This network starts tobecome unstable before time t ---, 500. The two dispersive networks, with fixed and adaptabledelays in Figures 6.13(b) and 6.13(c), respectively, both provide accurate signal emulation upto time t = 500.To show the difference between networks with fixed and adaptable delays, Figure 6.14extends the plots in Figure 6.13 from time t = 750 to time t = 1250. Here, the accuracy of thefixed-delay network begins to deteriorate, exhibited mostly by a phase-shift of the signal pattern.The network with adaptable delays still provides quite accurate emulation up to time t = 1250,while the conventional back-propagation network is clearly unstable.(a)(c)(b)(d)Chapter 6. Applications of Dispersive Networks^ 81Figure 6.12: Final attractors for (a) a conventional back-propagation network,(b) a dispersive network with fixed time delays, and (c) a dispersive networkwith adaptable time delays. All the networks were trained up to time 10 8 andthen placed in the configuration shown in Figure 6.7. The actual Mackey-Glassattractor is reproduced in (d) for comparison.Chapter 6. Applications of Dispersive Networks^ 82IA24 ..ii A ,,ali A^A AL iInn 1r^1,(13) Ai AI A .e. A .ii A A AY^1 T^1' T^T T "'V(4 A, A 4. A .r A A A■ IT "Y TV v y0 Time 250 500Figure 6.13: Network emulation of the Mackey-Glass signal from time t 0 totime t 500. The networks are (a) a conventional back-propagation network,(b) a dispersive network with fixed time delays, and (c) a dispersive networkwith adaptable time delays. Before time t = 100, the networks were in a strictlypredictive configuration, with the actual Mackey-Glass signal applied at theinput. Initial start-up transients occur before the internal network delays havefilled with valid data, but once filled, all the networks provide accurate predictiveperformance up to t = 100. At this point, shown by the vertical dashed line, thenetworks were reconfigured into the arrangement of Figure 6.7 and left to runfreely. Their outputs gradually diverge from the true Mackey-Glass signal.Another way to compare the performance of the three networks is to plot the power spectraof the generated signals, as shown in Figure 6.15. Each plot also shows the "true" Mackey-Glass power spectrum, from the signal obtained by integrating the Mackey-Glass equationusing a four-point Runge-Kutta method with a step size of 0.1. It is obvious that the dispersivenetwork with adaptable time delays is able to reproduce the Mackey-Glass spectrum much moreaccurately than either of the other networks. The discrepancy between the network-generatedChapter 6. Applications of Dispersive Networks^ 83',^(a), LA ,^' .;ii^- ^f'iii^AIL^II, ►All^Iii - Ai,'^,,,,^111^.^11!^!^AI^,(b)A .A^;A Ai^. A ill,,si■ r^liI\ ati:!srV^V I^v '^,,, A (c.AL.A^A.AAA_AAAyv v^TTITT750 Time 1000 1250Figure 6.14: Network emulation of the Mackey-Glass signal from time t = 750 totime t = 1250. The networks are (a) a conventional back-propagation network,(b) a dispersive network with fixed time delays, and (c) a dispersive networkwith adaptable time delays. These plots are an extension of Figure 6.13.spectra and the Mackey-Glass spectrum appears to grow larger at higher frequencies, but thelog power scale in Figure 6.15 exaggerates this effect. In fact, most of the Mackey-Glass signalpower is concentrated below a frequency of r/8, so Figure 6.16 shows an expanded view ofthat region. It is apparent from Figure 6.16(c) that the dispersive network with adaptable timedelays produces a remarkably accurate spectrum at low frequencies.These results demonstrate that dispersive networks with adaptable time delays can learnto predict and reproduce signals generated by chaotic dynamical systems. Chaotic systemsprobably underlie many physical phenomena and economic patterns, and similar predictivetechniques have been successfully applied to the prediction of sunspot numbers [126], weather102•••,1 • _____^?kl^I10 -2^(b)•37t/ 4 itit/2Frequency(c)0 it/4IA10 1''' ^10 010210110 01 0 -i10 -2Chapter 6. Applications of Dispersive Networks8410210 o ot^i-■.h. 0^ VAS16 ?\i/j1jf1Vii, s^I10 1 ,• 1*^•h^A10 -1^ r$04Figure 6.15: A comparison of power spectra generated by (a) a conventionalback-propagation network, (b) a dispersive network with fixed time delays, and(c) a dispersive network with adaptable time delays. Each network was trainedas a predictor, and then connected in the recurrent configuration shown in Fig-ure 6.7. Each resulting dynamical system spontaneously generated its own ver-sion of the Mackey-Glass signal, with the power spectra shown above. Forcomparison, the dashed line is the actual Mackey-Glass power spectrum. Thespectra are the result of averaging 100 Fast Fourier Transforms of segments with4096 samples taken at unit intervals, using a Hamming data window.)ykiSwsu10-2^(a) liwis„,...„ _________________***I(b) •, I(c)(a)• I%'1/0ISI• II(Chapter 6. Applications of Dispersive Networks^ 8510310210 110010364 102101bO10o1031021 0 110o10 10^7c/32^7c/16^37c/32^7c/8FrequencyFigure 6.16: A comparison of power spectra generated by (a) a conventionalback-propagation network, (b) a dispersive network with fixed time delays, and(c) a dispersive network with adaptable time delays. This graph is a magnifica-tion of the low-frequency region from Figure 6.15, where most of the signal poweris concentrated. The dashed line is the actual Mackey-Glass power spectrum.The spectra are the result of averaging 100 Fast Fourier Transforms of segmentswith 4096 samples taken at unit intervals, using a Hamming data window.Chapter 6. Applications of Dispersive Networks^ 86Figure 6.17: Adaptive channel equalization. The internal network parametersadapt to minimize the error, e(t). The resulting output, o(t), is a time-delayedestimate of the channel input. Channel noise, represented by a white Gaussiansignal, n(t), prevents the error from reaching zero.patterns [101], stock markets [61], and commodity markets [16].The prediction error for a chaotic system typically increases exponentially with the pre-diction interval, T. The exponent gives an upper estimate of the largest positive Liapunovexponent, which describes the divergence of nearby trajectories in phase space. Despite theexponential decrease in predictive ability, qualitative predictions about the future behavior ofa chaotic system can still be made from a short segment of training data. The network canbe trained by repeatedly presenting it with the finite data segment, until it has reached asufficiently low error. Connecting the trained network in a recurrent signal production config-uration will cause it to generate an output that is statistically similar to the training signal.This method could be used to investigate the nature of the chaotic attractor underlying thetraining signal without requiring observation of that signal for a long period of time.6.3 Adaptive Channel EqualizationFigure 6.17 shows how an adaptive network can be trained to compensate for distortion ina communication channel. As adaptation minimizes the error between o(t) and x(t — T),the network learns to implement the inverse of the channel transfer function with a fixedChapter 6. Applications of Dispersive Networks^ 87lag, T. Linear filters often work well in this configuration [134], but their accuracy sufferswhen the channel distortion is nonlinear. Figure 6.18 is an example of a nonlinear channelfor which linear filters do not provide a satisfactory solution. The channel consists of threeFigure 6.18: Nonlinear communication channel. The channel consists of threeelements in series. The first and last elements are linear, with identical transferfunctions. The center element is a nonlinearity chosen for its mathematicalconvenience.elements in series. The first and last elements are linear, with identical transfer functions of0.2602 + 0.9298z -1 + 0.2602z -2 , and the middle element introduces the nonlinear distortion,5x1x1(1 — x 2 )Y =^(1 + x 2) • (6 .2)To find the inverse transfer function of this channel, a dispersive network was placed inthe arrangement shown in Figure 6.17. The input signal, x(t), was zero mean white Gaussiannoise with unit variance, fed through a third-order low-pass Butterworth filter with a cutofffrequency of r/16to, where to was the sampling period. Appendix D describes the exact form ofChapter 6. Applications of Dispersive Networks^ 88the Butterworth filter. The lag in Figure 6.17 was T = 15, and n(t) was white Gaussian noisehaving a power of -30dB with respect to the channel output.6.3.1 Network Topology and ParametersThe network had two hidden layers with five nodes each, and a single output node. For thehidden nodes, each .60 was a hyperbolic tangent with unity gain, and the output node hada linear transfer function for expanded dynamic range. Each node in the first hidden layerwas connected to the input signal through five delayed connections, and the internal layerswere fully connected with three delayed connections between each pair of nodes in adjacentlayers. Adaptable thresholds were implemented by connections from an additional node with aconstant output.The initial delays were pseudorandom, uniformly distributed values on the interval [0, 24],and the initial weights were similarly distributed on [-0.1, 0.1]. The ages, cti, were 50 for thefirst hidden layer, and 25 for the second. A weight learning rate of p = 0.01 and a delay learningrate of v = 1.0 were selected, and the network equations were integrated using a simple Eulermethod with a small fixed-length time step.A conventional back-propagation network was trained for comparison, with the same numberof nodes and the same number of input connections, formed from evenly-spaced delay tapsranging from 0 to 24.6.3.2 Equalization ResultsFigure 6.19 shows learning curves for the dispersive network, the conventional back-propagationnetwork, and an adaptive linear filter with 25 taps. The graph plots the normalized rms erroras a function of training time. Both networks easily outperformed the linear filter, but theadaptable delays in the dispersive network allowed it to achieve the lowest error. At time 3.10 6 ,the NRMSE values were 0.254, 0.158, and 0.065 for the linear filter, the back-propagationnetwork, and the dispersive network, respectively.Chapter 6. Applications of Dispersive Networks^ 890.400.35 -oI- 0.30 -4.1^ Linear Filter07 0.250.20 Back-Propagation Network0.150.10 -^ Dispersive Network -0.05 -0.00^0 0.5^1:^'0 1 .5^2.0^2.5^3 0Time X 106Figure 6.19: Learning curves showing the NRMSE as a function of training timefor adaptive filters trained to compensate for a nonlinear channel distortion.The linear filter and the back-propagation network both had 25 delayed tapsfor the input signal. The top curve shows that the linear filter converges veryquickly, but cannot reach an error less than about 0.25. Initially, the two networkfilters show approximately exponential convergence, but the adaptable delays inthe dispersive network allow it to surpass the conventional back-propagationnetwork as training progresses. The averages used to estimate the NRMSE wereover time intervals of length 10 4 .6.4 Signal Detection/ClassificationIf the output nodes of a dispersive network are cascaded with binary quantizers, the networkbecomes a classifier that can separate its input pattern into different categories, based on itsspatial or temporal properties. The output nodes each correspond to one category, and a nodecan signify the presence of a pattern in its corresponding category by producing 1, and itsabsence by producing -1. This type of signal detector or classifier could be very useful in tasksChapter 6. Applications of Dispersive Networks^ 90like speech recognition. Although the dispersive networks described here have not yet beenapplied to classification problems, the project is definitely worthy of future consideration.Chapter 7ImplementationsDispersive networks can be implemented in algorithmic form using either serial or parallel dig-ital hardware, or they can be implemented as continuous-time systems using parallel analoghardware. Since the present state of the art for integrated digital circuitry is more advancedthan it is for analog circuitry, the emphasis of this chapter is on algorithmic implementations.However, analog hardware may ultimately permit higher operating speeds, lower power require-ments, and smaller devices, so a brief discussion of analog implementations suggests some areasfor future research.7.1 Software ImplementationsAll the experimental results presented in this thesis were obtained using a serial computerand conventional programming techniques. The program can be viewed as a "simulation" ofthe differential equations that describe continuous-time dispersive networks, or as a separatealgorithmic form of dispersive network that can be studied in its own right.In dispersive networks, there is a tradeoff between computational locality and storage re-quirements. For a network node to calculate its output value, it requires data from other nodesin the network. Computational locality refers to the required communication bandwidth andthe physical distance to these other nodes.Serial software implementations have different constraints than either digital or analog par-allel implementations. A random access memory (RAM) stores all the parameters describingthe network, so computational locality is not a consideration. To minimize the required RAM,a serial implementation can use the configuration shown in Figure 5.10. Coding the adaptation91Chapter 7. Implementations^ 92algorithm in a procedural programming language is a straightforward task, and the techniquedescribed in Appendix C can be used to obtain the delayed signal derivatives.7.2 Parallel Digital ImplementationsParallel implementations typically employ separate hardware elements for each node, and per-haps each connection, within the network [118]. These elements must transmit and receive datain both the feed-forward and back-propagation phases of operation. Consider a fully connectednetwork, where each node in one layer connects to all the nodes in the next layer. For fullyparallel operation, every node requires a separate communication link for each one of theseconnections, and the number of links increases approximately with the square of the number ofnodes.In a discrete-component electrical implementation, the links could be implemented usingconductive wires. As the number of nodes increases, the volume of space between the layers willbecome increasingly filled with links, limiting the size of the network that can be constructed.In integrated circuitry, this constraint is more severe because the space between the layersis only two-dimensional. A truly scalable architecture requires the elimination of any spatialconstraints by some form of signal multiplexing on fewer physical wires. The resulting systemmay not be completely parallel in operation, but a significant increase in speed over serialimplementations can still be accomplished.Figure 7.1 shows a scalable dispersive network architecture that trades operational speedfor scalability. Each pair of communicating nodes can have access to the bus in sequence,storing the received data for later internal computations. As the number of nodes increases,the overall operating speed will decrease because there will be more nodes sharing the bus.In a straightforward implementation, the operating speed will be inversely proportional to thesquare of the number of nodes. However, there are no spatial connectivity constraints becausethe design is completely modular.NodeInput ----\\Node^/Node/---- --N. \\^ 7/BidirectionalBusA---- --\ \\  /^/--- ---1.,\^ 7/BidirectionalBusInput ----\.._..._\Input ____/, NodeNodeNodeNode > Output> OutputOutputNode NodeChapter 7. Implementations^ 93Figure 7.1: Block diagram of a parallel digital dispersive network implementa-tion. Each processing element shares a communication bus with other processingelements in the same layer. If two nodes need to communicate, they can periodi-cally obtain access to the bus, in a round-robin fashion with other network nodes.The length of time between bus accesses for a given pair of nodes determinesthe operating speed of the network. This design avoids spatial connectivity con-straints, but the operating speed decreases as the number of nodes in each layerincreases.To minimize the required bus bandwidth in Figure 7.1, the dispersive network must be intel-ligently partitioned into separate processing elements. One obvious choice is to use a processingelement for each network node, and to associate the weights and time delays with the nodes ina way that minimizes the required bandwidth. Figure 7.2 reproduces the node and connectiondiagrams from Figure 5.8, with the optional weight delay buffer for the back-propagated signalremoved. The signal lines in Figure 7.2 are classified into two categories: unique and shared.Unique signals, represented by solid lines in Figure 7.2, flow only between two particular nodes,and can be viewed as forming the connection between them. Shared signals, represented bydashed lines, are common to multiple nodes in the preceding or succeeding layers, and are goodcandidates for broadcast over a shared bus.The dotted line in Figure 7.2 partitions each link into two sections, one associated with theAj (t,4-^Ar"(a) Node AO- [ai-a;tijk],^wuk(t- [artuk])Yia•of f(yi(t-a))Chapter 7. Implementations^ 94Xflk (t)Figure 7.2: Unique and shared signals in a dispersive network. The solid linesrepresent unique signals associated with two specific nodes. They can be viewedas forming the connection between the two nodes. The dashed lines representshared signals that must be transmitted to multiple nodes in the succeeding(for feed-forward) or preceding (for back-propagation) layers. Shared signals aregood candidates for broadcast over a shared bus. The dotted line partitionsthe link into two sections, one associated with the node in the preceding layer,and the other associated with the node in the succeeding layer. The processingelements representing these two nodes each incorporate the appropriate linksection. The partition cuts only shared signals, minimizing the bus bandwidthrequired between the layers.Chapter 7. Implementations^ 95node in the preceding layer, and the other associated with the node in the succeeding layer.The calculations required in each section are performed by the processing element associatedwith the corresponding node. If the link partition were to cut a unique signal, the multiplexingrequirements would depend on the number of nodes in both layers connected by the link, leadingto a time complexity of 0(N2 ). To minimize the required bus bandwidth, the link partition inFigure 7.2 cuts only shared signals, leading to a time complexity of 0(N).7.2.1 OperationThe network operation occurs in cycles, with each cycle incorporating a feed-forward pass, anerror calculation at the output, and a back-propagation pass. The following discussion ignoresupdates to the adaptable weights and time delays, which will be detailed in the next section.To start a cycle, the first processing element, j = 1, in the input layer broadcasts its inputsignal to all the processing elements (PEs) in the first hidden layer. Each PE, i, in the firsthidden layer contains storage for the feed-forward delay, rijk, shown in Figure 7.2(b). Thisdelay line stores the incoming signal, and the delayed output from the line forms a productwith the weight, wijk(t). Finally, an accumulator or summation device latches the resultingsignal, xijk(t).Each PE in the first hidden layer performs these operations in parallel, while the first inputPE broadcasts its input signal. Multiple connections, indexed by k, between two particularnodes can be implemented as multiple taps off the delay buffer, aj. Each tap must be multipliedby the appropriate weight, and accumulated as part of the input sum for the node. When thenodes in the first hidden layer have finished, the second PE in the input layer broadcasts itsinput signal. Again, each PE in the first hidden layer stores the incoming signal in a delay bufferand accumulates the signals, xijk(t). This process continues until all the processing elementsin the input layer have had a chance to broadcast their inputs.At this point, the input sums, yi, for the nodes in the first hidden layer are available inthe accumulators, so they can be fed through the sigmoidal nonlinearities, fi(), to produce theChapter 7. Implementations^ 96node outputs. They also feed into a delay buffer, at, associated with each node, to provide thesigmoidal derivative, fl(), for multiplication by the back-propagating signal. Now each node inthe first hidden layer can broadcast its output to all the nodes in the second hidden layer, usingthe same procedure as the input nodes. This process repeats for each layer until it reaches theoutput nodes, when the network output becomes available.At the network output, the error signal, Ai, is formed by subtracting the target signal fromthe node outputs. Then the first node, i, in the output layer broadcasts its error signal to everynode in the preceding layer. Each node, j, in the preceding layer stores its incoming signal inthe time-delay buffer, ai — shown in Figure 7.2(b). The delayed output from this buffer ismultiplied by the weight, wijk(t), and fed into the node's back-propagation accumulator. Asfor the feed-forward case, multiple connections (indexed by k) between two particular nodescan be implemented as multiple taps off the delay buffer, a .; — ai. Each tap must be multipliedby the appropriate weight, and accumulated as part of the back-propagation sum for the node.When all the nodes in the preceding layer have finished, the second node in the output layerbroadcasts its error signal. Again, each PE in the preceding layer stores the incoming signal in adelay buffer and accumulates the signals AjtViik. This process continues until all the processingelements in the output layer have had a chance to broadcast their errors.Now the back-propagation sums for all the nodes in the last hidden layer are available, andcan be multiplied by the delayed sigmoidal derivatives, f.;(), to form the error signals, Ai. Back-propagation proceeds in this fashion, layer by layer, until the error signals reach the first hiddenlayer. At this point, except for adapting the weights and time delays, one cycle is complete.7.2.2 Adaptable Weights and Time DelaysFigure 7.2(b) shows that the components for calculating the weight and time delay updatesfor a link between two nodes, i and j, are associated with the processing element for node i.However, the figure also shows that node j requires the weight and time delay values duringthe back-propagation phase. As the weights and time delays adapt, node j must maintain theirChapter 7. Implementations^ 97most recent values. Since these values are calculated by the PE associated with node i, it mightseem logical to send them to node j over the shared bus. However, these signals would beunique to each pair of nodes, wasting valuable bus bandwidth and increasing the cycle time.To maintain an optimal tradeoff between system speed and hardware requirements, it isbetter for each node to calculate the weight and time delay updates separately, and maintainits own set of values. Since all the calculations are digital, the two sets of parameters willremain synchronized if they start out with the same initial values before adaptation commences.However, it is important to note that this technique would not be feasible in an analog system,because minor variations in the analog multipliers and accumulators would eventually causethe two sets of parameters to deviate from one another.Having each PE keep track of its own versions of the weights and time delays leads to somehardware duplication. Figure 7.3 is the block diagram for a single processing element, incor-porating multiple input and output links that connect to the shared busses. The summationnodes represent the feed-forward and back-propagation accumulators, where summation occursserially during each cycle. Figure 7.4 details the components required for a single input link,and Figure 7.5 is the corresponding diagram for an output link. Multiple input or output linksfor a particular PE can share their weight and time delay update circuitry, since only one inputor output link will be active at any given time. However, they each require separate storagefor the weight and time delay values, the feed-forward signal, and (for output links) the back-propagation signal. The duplicated storage for the feed-forward signals allows each node tocalculate its own weight and time delay updates.7.2.3 PipeliningThe cycle time for the implementation described above increases with the number of PEs ornodes, and with the number of network layers. However, a small restriction on the time-delayvalues permits pipelined operation, making the cycle time independent of the number of layers.Consider the processing element for node j. Signals from the previous layer, representedBidirectional BusBidirectional BusChapter 7. Implementations^ 98Figure 7.3: Block diagram of a single processing element for a parallel digitalhardware implementation. Figures 7.4 and 7.5 show the input and output linksin more detail.Xj11(0-01 "CA01(0 al wiscWChapter 7. Implementations^ 99Figure 7.4: Processing element input link. Each processing element requires oneinput link for every node in the previous layer. However, since only one input linkcan be active at any particular time, the links can share update circuitry, andonly the storage for ai and the weight and time-delay values must be duplicatedfor each link. Multiple connections to the same node in the previous layer canbe implemented as multiple taps off the delay buffer, and extra storage for theadditional weights and time delays. For clarity, however, the diagram showsonly a single connection.by or in Figure 7.4, enter the time-delay buffers, al, and the delayed values form the sum in thefeed-forward accumulator. However, these delayed values of or are already present in the inputlink storage buffer, having been placed there during a previous cycle. With these delayed values,it is possible for a particular node to calculate its output for the present cycle before receivingits input signals from the previous layer. Therefore, at the beginning of the cycle, all the nodesin the network can calculate their outputs in parallel. With all the node outputs available, thebusses for the entire network can be used in parallel to perform the feed-forward communication,rof (tH^ a•wift(t)Ai(t, aiai+Tokwift(t)Chapter 7. Implementations^ 100Figure 7.5: Processing element output link. Each processing element requiresone output link for every node in the succeeding layer. Since only one outputlink can be active at any particular time, the links can share update circuitry,and only the storage for the feed-forward signal, the back-propagated signal,the weight values, and the time delay values must be duplicated. Multipleconnections to the same node in the succeeding layer can be implemented asmultiple taps off the delay buffers, ai and ai — ai, and extra storage for theadditional weights and time delays. For clarity, however, the diagram showsonly a single connection.instead of sending the information layer by layer. Fortunately, the symmetry of the networkpermits the same parallelism to be employed during the back-propagation phase. Furthermore,by replacing each bidirectional bus with two unidirectional busses, both the feed-forward andback-propagation phases can proceed in parallel.The preceding discussion assumes that the bus communication occurs after the sum accumu-lations. Adding a simple latch to the feed-forward signal, and another to the back-propagationsignal, permits even these two operations to proceed in parallel. Figure 7.6 shows the locationsChapter 7. Implementations^ 101of the latches, which effectively add a one-cycle delay to the signal paths. Now the latcheswithin each node provide the outputs during the bus communication, while the accumulatorssimultaneously calculate the sums required for the next cycle. The resulting network providesthe fastest possible throughput with a completely modular architecture.Figure 7.6: Block diagram of a single processing element with latches, for aparallel digital hardware implementation. For clarity, this diagram omits theinput and output links shown in Figure 7.3. The latches permit each layer inthe network to operate as one stage in a pipelined architecture.The only restriction for pipelined operation is that each time delay, Tijk, must be greater thanor equal to the network cycle time. The network cannot contain any zero-delay connections,of the type present in a conventional back-propagation network. The latches in Figure 7.6effectively add the one-cycle delay to each Tijk, so the lower limit for the delays in the inputand output links is still zero. The pipelining ability is another advantage of the dispersivenetwork over conventional back-propagation, where signals must proceed layer by layer in aserial fashion. The lack of zero-delay connections shouldn't adversely affect the performance ofthe network, since at worst they will lead to a small time lag between the input and output.The pseudocode in Figure 7.7 summarizes the operation of each node. The constant Lmaxrepresents the largest number of nodes in any network layer, and ffsurn and bpsum representthe values in the feed-forward and back-propagation accumulators, respectively. The operationof the node can viewed as two concurrent processes, one for the feed-forward operation andthe other for back-propagation. There must also be some type of global synchronization, soChapter 7. Implementations^ 102- transfer ffsum to feed-forward latch- ffsum = 0- for WI to L- shift al buffer- for all k do- ffsum = ffsum + wilkoi(t-tja)- end for- update weights and time delays- end for(a)- transfer bpsum to back-propagation latch- bpsum = 0- for i=1 to L- shift a.) buffer- shift arai buffer- for all k do- bpsum = bpsum + wiikAi(t-[ai-artiik],a)- end for- update weights and time delays- end for(b)Figure 7.7: Pseudocode for a single processing element. Each PE implementstwo parallel processes, one for the feed-forward calculations, and the other forthe back-propagation calculations. Lmax is the maximum number of nodes in alayer, and ffsum and bpsum represent the values in the accumulators. The codeis for (a) feed-forward operation and (b) back-propagation operation.that each node drives the separate feed-forward and back-propagation busses at the appropriatetimes.The cycle time of this pipelined architecture scales as 0(Lm„.). Thus, wider networks,which may be required for more complex problems, will be constrained to sample the inputsignal at a lower frequency than simpler networks. However, the cycle time is independent ofthe network depth. The storage space required for delayed signals scales approximately withthe square of the number of nodes in the network, unlike the serial implementation, which scaleslinearly with the number of nodes.The cycle time for a similar conventional back-propagation network scales as O(Lm NL),where NL is the number of network layers. However, the computational complexity for eachprocessing element is reduced because only the weights need to be updated. With parallelhardware, a dispersive network processing element can perform the weight and time delayupdates concurrently, and achieve approximately the same time complexity as a conventionalback-propagation processing element.Chapter 7. Implementations^ 1037.3 Analog Hardware ImplementationsThe implementation of neural computational systems using analog electronic hardware is still inits infancy, with many challenges remaining [39]. Pioneering work in this area is presently beingperformed using current-mode CMOS VLSI hardware [2, 3, 40, 75, 108]. Currents within theMOS transistors represent the signals or weights, and dedicated circuitry performs the necessarymultiplications and summations. Other work on analog VLSI networks has investigated on-chiplearning [13] and the fabrication of analog VLSI synapse matrices [50, 56, 103, 115]. Analognetworks using discrete components have been used to control real-time adaptive optics, whichcancel distortions in telescopic images caused by turbulence in the earth's atmosphere [34].A significant problem with integrated analog circuitry is the variation among componentsthat are supposed to be identical. For proper operation, the adaptation techniques presentedin Chapter 5 require these components to be closely matched. Some effects of nonidealities inanalog network hardware have been investigated in [35], and other adaptation techniques likeweight perturbation [55] may help to solve the problem.Perhaps the most serious challenge for dispersive network implementations involves theadaptable time delays. Even fixed delay lines are difficult to manufacture in integrated circuitry,and usually require a large amount of real estate. Adding adaptability by means of a sliding tapimplies some type of mechanical coupling, which is beyond current technology. Furthermore,the sum of the feed-forward and back-propagation delays must be a constant value for alllinks in a given layer. Slight variations in the lengths of the delays will lead to inaccuraciesduring adaptation, possibly extending the time required for convergence, or even preventingconvergence altogether. Perhaps a modified adaptation mechanism could be developed thatwould accommodate variations in these time-delay sums.For some applications, however, it may not be necessary to have fully adaptable time delays.A software simulation of the proposed network could learn an optimal set of time delays forone representative task from a problem domain. Then the analog network could be fabricatedwith this fixed set of delays, having only adaptable weights. The adaptable weights wouldChapter 7. Implementations^ 104provide the necessary adjustment for any task or time-varying environment in the problemdomain. Another possibility is a hybrid analog/digital system, with digital parameter storagefor accuracy, but analog multiplication and summation for speed. Such hybrid systems havebeen investigated in [24] and [121] for networks without internal time delays. Other possibilitiesinclude pulse stream networks [85], which use digital pulses to encode the propagating signals.Chapter 8Conclusions and Suggestions for Further Research8.1 SummaryThe conventional back-propagation network is an extension of the adaptive linear combiner,incorporating nonlinear elements that allow it to be used in applications where linear processorsare inadequate. It is well known that back-propagation networks are universal approximators.With a single hidden layer of nonlinear nodes, they can approximate any continuous function ifthe hidden layer is large enough. More than one hidden layer may be required to approximatethe discontinuous inverses of some functions that arise in control applications.Chapter 4 showed how conventional back-propagation networks could be used as nonlinearsignal processors. Simulations of a back-propagation network in a chaotic signal prediction taskdemonstrated its superiority over the adaptive linear filter. However, the increased accuracycame at the expense of a significantly longer training time.The work presented here has extended the back-propagation adaptation technique to dis-persive networks with internal adaptable time delays, adding a temporal dimension to theproblem. Both the weights and the time delays adapt, using a delayed form of gradient descentto minimize the error at the output. Previously, most dispersive networks used fixed-lengthtime delays that were chosen before training commenced [123, 124]. This thesis describes anatural extension to those techniques. Chapter 5 presented a detailed derivation of the adapta-tion equations, along with some practical considerations for implementation. These derivationshave recently been published in [25] and [26].Dispersive networks can perform many tasks, including adaptive signal prediction, signalproduction, channel equalization, and spatio-temporal pattern recognition [42]. Chapter 6105Chapter 8. Conclusions and Suggestions for Further Research^ 106showed that dispersive networks can outperform conventional techniques, including standardback-propagation, when applied to a chaotic signal prediction task. However, the training timecan be long, and the use of momentum did not provide any speed improvement during on-lineadaptation. Placing a fully trained prediction network in a recurrent configuration allows it togenerate a close approximation to the original chaotic training signal. This procedure can beused to make long term statistical predictions about the training signal, and to investigate thestructure of its underlying attractor.Chapter 6 also described the application of dispersive networks to a nonlinear channelequalization task, where they again demonstrated their improved accuracy over conventionaltechniques, although they required longer training times. Related discussions of dispersivenetworks in some of these application areas appeared in [22] and [27].Finally, Chapter 7 presented some possible implementations and their associated tradeoffs,including digital versus analog, and serial versus parallel. A detailed description of a paral-lel digital implementation showed that the resulting architecture can be efficiently pipelined,leading to a higher throughput than can be achieved with a conventional back-propagationnetwork.8.2 Contributions to the LiteratureThis work presents several contributions to the state of the art in nonlinear network filters.Most of these contributions result from the internal adaptable time delays present in dispersivenetworks. First, the adaptation technique permits dispersive networks to be trained withoutthe temporal unfolding previously required for time delay neural networks (TDNNs). Tempo-ral unfolding can lead to very large expanded networks, especially when the time delays areadaptable and are not integer multiples of unit delays.Second, in addition to adapting the weights, the continuous-time formulation leads to a sim-ple mechanism for adapting the time delays. In most conventional discrete-time formulations,the time delays are fixed and evenly distributed. In Chapter 6, the adaptable delays led toChapter 8. Conclusions and Suggestions for Further Research^ 107increased accuracy in adaptive signal prediction and adaptive channel equalization tasks.Third, in a signal prediction configuration, dispersive networks can perform simultaneousadaptation and prediction, while two identical copies of a conventional back-propagation net-work must be used to accomplish the same task. Apart from the additional cost of doubling thehardware, the two copies of the network must be very closely matched. Any variation betweenthe components, coupled with the internal nonlinearities, can lead to large discrepancies in thenetwork outputs.Fourth, when learning to predict a chaotic signal, the adaptable time delays permit a disper-sive network to find an appropriate embedding space automatically. With other techniques, theembedding space must be chosen based on a priori information or trial-and-error. This abilitycan be very important when investigating the nature of the attractor underlying a chaotic signalwith unknown dimensionality.Fifth, for the Mackey-Glass signal emulation task, the quality of the power spectrum gen-erated by a network with adaptable delays was far superior to the power spectra produced bysimilar networks with fixed delays. Conventional back-propagation networks were unstable inthis configuration, generating high-frequency oscillations.Finally, the new technique is suitable for parallel hardware implementations. In digital im-plementations, the internal time delays permit efficient pipelining, resulting in a much higherthroughput than is possible with conventional back-propagation networks. In parallel imple-mentations, the storage requirements scale with the square of the number of nodes, while inserial implementations, the storage requirements are directly proportional to the number ofnodes.8.3 Topics for Further ResearchThere are many interesting areas where work on dispersive networks can be extended. Thefollowing discussion is divided into four sections on applications, increased rates of convergence,implementations, and different network models.Chapter 8. Conclusions and Suggestions for Further Research^ 1088.3.1 Other ApplicationsThe chaotic signal prediction problem in Chapters 4 and 6 assumed that a noise-free inputsignal was available. In any physical implementation, it is inevitable that some degree of noisewill be present in the input signal, perhaps affecting the adaptation process. The noise mightpermit the prediction network to find a more robust solution [119], because it will providetraining examples in regions of the phase space that do not lie on the attractor. Without noisyinput data during training, the network response to an input that does not lie exactly on theattractor could be quite unpredictable. In addition, the noise might help the network to escapefrom local minima in the error function, leading to a better solution. Simulations of chaoticsignals with additive noise could help to confirm or refute these conjectures.It would also be useful to investigate the dispersive network prediction performance onchaotic signals other than that produced by the Mackey-Glass equation. In particular, somechaotic attractors have multiple "lobes", leading to nonstationary signal statistics. The signalcan spend much of its time in one lobe, and then switch over to another lobe where it behavesdifferently. To maintain acceptable prediction accuracy, it is important for the network to beable to adapt quickly to the new behavior. Some nonstationary learning characteristics of linearfilters were investigated in [133].The nonlinear communication channel used as a benchmark in Chapter 6 is defined byvery idealized equations, chosen for their mathematical convenience. It would be useful todevelop more realistic nonlinear models of real-world communication channels, and compare theperformance of dispersive network equalizers with more conventional techniques. In particular,it would be interesting to see how recursive equalizers based on Kalman filters compare withdispersive networks.Other applications to which dispersive networks could be applied include adaptive noisecancelling [130], and spatio-temporal pattern classification [42]. Classifying patterns based ontheir temporal properties is a very important problem in the area of speech recognition, wheredispersive networks could perhaps find some utility [9, 65, 120, 123]. In pattern classificationChapter 8. Conclusions and Suggestions for Further Research^ 109tasks, it may be important to use an error criterion other than the Euclidean distance betweenthe output and target vectors [6, 41, 112]. It has also been reported that additive noise canhelp to improve the generalization ability of conventional back-propagation networks [67, 73],and it may lead to similar benefits with dispersive networks.8.3.2 Increased Rates of ConvergenceThere are some known results about the conditions under which the back-propagation trainingalgorithm converges to a stable solution [63, 64]. It would be very useful, although perhapsdifficult, to obtain similar results for dispersive networks with adaptable time delays. Knowledgeof these conditions could lead to new techniques for speeding up the convergence, which ispresently too slow for some real-time applications.One technique that did not improve convergence was the use of momentum, as discussedin Chapter 6. Simulations using batch-mode adaptation could be performed to find out if thepoor performance of momentum was due to the on-line adaptation technique used throughoutthis investigation. Of course, batch-mode adaptation has additional complexities that must beaddressed, such as the accumulation of parameter update values during each batch.Another technique that might increase the rate of convergence is to allow the sigmoid gains(a in equation (4.15)) to adapt along with the other network parameters [62]. This method canspeed up conventional back-propagation, but its utility with dispersive networks remains to beinvestigated.A significant problem in all gradient descent methods involves the choice of an optimallearning rate. The choice is usually made by trial and error, but adaptive techniques for auto-matically finding the optimum learning rate can bypass this step in the design process and leadto faster adaptation [20, 127]. In [57], Jacobs suggested that each weight in a back-propagationnetwork be given its own learning rate, and that these learning rates adapt separately to providefaster convergence. Variations on this technique have been reported in [31, 78, 116]. All thesemethods can potentially speed up convergence in dispersive networks as well.Chapter 8. Conclusions and Suggestions for Further Research^ 1108.3.3 Implementation IssuesIt would be interesting to build the parallel hardware discussed in Chapter 7, and apply it toreal-world problems. With any digital implementation, the effects of signal quantization canlead to variations in performance [38, 46]. For example, if the error is small but nonzero, thenweight updates might stop because the calculated weight increment is less than one quantizationstep. A full understanding of the behavior of digital implementations requires an investigationof how this phenomenon affects dispersive networks with adaptable time delays.One way to avoid quantization problems is to build dispersive networks using analog hard-ware, but this approach leads to a host of new problems that may be even more difficult todeal with. As discussed in Chapter 7, analog circuits suffer from variations in components thatshould be identical. The effects of these variations need to be investigated, and new methodsof adaptation that are less sensitive to them may need to be developed.Perhaps the most difficult challenge for the analog implementation of dispersive networkslies in building adaptable time delays. Even fixed delays are difficult to build consistently,and they require a large physical area in integrated circuitry. With adaptable delays, the sumof the feed-forward and back-propagation delays must be an accurately determined constant,which might be difficult to achieve in an analog implementation. An investigation of temporalback-propagation's sensitivity to variations in this sum would give an idea of how accurate theanalog implementation must be.For some applications, completely adaptable time delays may not be necessary. Trainingwith adaptable delays could be performed in software for a particular problem, and the re-sulting delays would determine the length of fixed delays in the analog hardware. Then thehardware could learn to solve similar problems with only adaptable weights. Finally, there isthe possibility of hybrid implementations, combining both analog and digital techniques.Chapter 8. Conclusions and Suggestions for Further Research^ 1118.3.4 Other Network ModelsOne exciting possibility for future research is to extend the ideas of dispersive networks andadaptable time delays to other types of network models. For example, the time delays discussedhere represent perfect delay elements, but physical delay lines distort and spread out the prop-agating signal, acting more like filters. It may be possible to derive adaptation techniques forthese types of delays as well.When choosing a network topology for a particular application, it is difficult to find theoptimal number of layers, hidden nodes, and interconnections. Adequate structures can oftenbe found by trial and error, but the procedure is time consuming and not automatic. In additionto modifying the weights, some adaptation algorithms add [19, 30, 114] or remove [58, 59] nodesand connections to automatically optimize the topology. Similar techniques could be appliedto dispersive networks to remove the trial and error phase from the design process.Another interesting project would be to investigate the use of adaptable time delays in recur-rently connected networks. There have already been several studies applying back-propagationto recurrent networks [1, 5, 95, 96, 97, 117, 129, 137, 138], and Pearlmutter suggested the useof adaptable time delays in recurrent networks [94]. An advantage of recurrent networks isthat their outputs can depend on the input signal from times very distant in the past, buttheir training procedures usually suffer from either large storage requirements, or the need toperform complex or nonlocal computations.Finally, there has been much recent interest in pulse-stream networks, where the outputof a node is represented by a series of pulses, instead of a continuous signal level [17, 77, 84,85, 86]. The inspiration for these models comes from a study of biological neurons, where thepulse-stream representation is the primary mode of operation [11]. Extending the time delayadaptation methods to pulse-coded networks is a promising area for future exploration.Bibliography[1] L. B Almeida. A learning rule for asynchronous perceptrons with feedback in a combi-natorial environment. In M. Caudil and C. Butler, editors, Proceedings of the IEEE FirstInternational Conference on Neural Networks, volume 2, pages 609-618, San Diego, CA,June 1987.[2] Andreas G. Andreou and Kwabena A. Boahen. Synthetic neural circuits using current-domain signal representations. Neural Computation, 1(4):489-501, 1989.[3] Andreas G. Andreou, Kwabena A. Boahen, Philippe O. Pou li quen, AleksandraPavasoviC, Robert E. Jenkins, and Kim Strohbehn. Current-mode subthreshold MOS cir-cuits for analog VLSI neural systems. IEEE Transactions on Neural Networks, 2(2):205-213, March 1991.[4] A. D. Back and A. C. Tsoi. FIR and IIR synapses, a new neural network architecture fortime series modeling. Neural Computation, 3(3):375-385, 1991.[5] Jacob Barhen, Nikzad Toomarian, and Sandeep Gulati. Adjoint operator algorithms forfaster learning in dynamical neural networks. In David S. Touretzky, editor, Advancesin Neural Information Processing Systems 2, pages 498-508. San Mateo, CA: MorganKaufmann, 1990.[6] Etienne Barnard. Performance and generalization of the classification figure of meritcriterion function. IEEE Transactions on Neural Networks, 2(2):322-325, March 1991.[7] H. D. Block. The Perceptron: A model for brain functioning. I. Reviews of ModernPhysics, 34:123-135, 1962.[8] Ulrich Bodenhausen and Alex Waibel. The Tempo 2 algorithm: Adjusting time-delays bysupervised learning. In Richard P. Lippmann, John E. Moody, and David S. Touretzky,editors, Advances in Neural Information Processing Systems 3, pages 155-161. San Mateo,CA: Morgan Kaufmann, 1991.[9] L. Bottou and F. Fogelman Soulie. Speaker-independent isolated digit recognition: Mul-tilayer perceptrons vs. dynamic time warping. Neural Networks, 3(4):453-465, 1990.[10] Martin L. Brady, Raghu Raghavan, and Joseph Slawny. Back propagation fails to separatewhere perceptrons succeed. IEEE Transactions on Circuits and Systems, 36(5):665-674,May 1989.[11] Valentino Braitenberg. On the Texture of Brains. Springer-Verlag, New York, 1977.112Bibliography^ 113[12] David J. Burr. Experiments on neural net recognition of spoken and written text. IEEETransactions on Acoustics, Speech, and Signal Processing, 36(7):1162-1168, July 1988.[13] H. C. Card and W. R. Moore. VLSI devices and circuits for neural networks. InternationalJournal of Neural Systems, 1(2):149-165, 1989.[14] Martin Casdagli. Nonlinear prediction of chaotic time series. Physica D 35, pages 335-356, 1989.[15] Daniel L. Chester. Why two hidden layers are better than one. In International JointConference on Neural Networks 1990, volume 1, pages 265-268, Washington, D. C., Jan-uary 1990.[16] Joseph E. Collard. A B-P ANN commodity trader. In Richard P. Lippmann, John E.Moody, and David S. Touretzky, editors, Advances in Neural Information ProcessingSystems 3, pages 551-556. San Mateo, CA: Morgan Kaufmann, 1991.[17] Neil E. Cotter, Kent Smith, and Martin Gaspar. A pulse-width modulation design ap-proach and path-programmable logic for artificial neural networks. In Jonathan Allenand F. Thomson Leighton, editors, Advanced Research in VLSI: Proceedings of the FifthMIT Conference, March 1988, pages 1-17, Cambridge, MA, 1988. MIT Press.[18] Thomas M. Cover. Geometrical and statistical properties of systems of linear inequalitieswith applications in pattern recognition. IEEE Transactions on Electronic Computers,14:326-334, June 1965.[19] R. Scott Crowder, III. Predicting the Mackey-Glass timeseries with cascade-correlationlearning. In David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E.Hinton, editors, Proceedings of the 1990 Connectionist Models Summer School, pages 117-123, San Mateo, CA, 1990. Morgan Kaufmann.[20] Christian Darken, Joseph Chang, and John Moody. Learning rate schedules for fasterstochastic gradient search. In S. Y. Kung, F. Fallside, J. Aa. Sorenson, and C. A. Kamm,editors, Neural Networks for Signal Processing II — Proceedings of the 1992 IEEE Work-shop, pages 3-12, 445 Hoes Lane, Piscataway, NJ 08854-4150, 1992. IEEE Press.[21] Charles Darwin. On the Origin of Species. John Murray, London, 1859.[22] Michael R. Davenport and Shawn P. Day. Chaotic signal emulation using a recurrenttime delay neural network. In S. Y. Kung, F. Fallside, J. Aa. Sorenson, and C. A.Kamm, editors, Neural Networks for Signal Processing II — Proceedings of the 1992IEEE Workshop, pages 454-463, 445 Hoes Lane, Piscataway, NJ 08854-4150, 1992. IEEEPress.[23] Shawn P. Day and Daniel S. Camporese. A stochastic training technique for feed-forwardneural networks. In International Joint Conference on Neural Networks 1990, volume 1,pages 607-612, San Diego, CA, June 1990.Bibliography^ 114[24] Shawn P. Day and Daniel S. Camporese. A VLSI neural network with on-chip learning.In David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton,editors, Proceedings of the 1990 Connectionist Models Summer School, pages 387-395.San Mateo, CA: Morgan Kaufmann, 1990.[25] Shawn P. Day and Daniel S. Camporese. Continuous-time temporal back-propagation.In International Joint Conference on Neural Networks 1991, volume 2, pages 95-100,Seattle, WA, July 1991.[26] Shawn P. Day and Michael R. Davenport. Continuous-time temporal back-propagationwith adaptable time delays. To appear in: IEEE Transactions on Neural Networks, 4(2),March 1993.[27] Shawn P. Day, Michael R. Davenport, and Daniel S. Camporese. Dispersive networksfor nonlinear adaptive filtering. In S. Y. Kung, F. Fallside, J. Aa. Sorenson, and C. A.Kamm, editors, Neural Networks for Signal Processing II — Proceedings of the 1992 IEEEWorkshop, pages 540-549,445 Hoes Lane, Piscataway, NJ 08854-4150,1992. IEEE Press.[28] Bert De Vries and Jose C. Principe. The Gamma model — A new neural model fortemporal processing. Neural Networks, 5(4):565-576, 1992.[29] Scott E. Fahlman. An empirical study of learning speed in back-propagation networks.Technical Report CMU-CS-88-162, Computer Science Department, Carnegie Mellon Uni-versity, Pittsburgh, PA, September 1988.[30] Scott E. Fahlman and Christian Lebiere. The cascade-correlation learning architecture.Technical Report CMU-CS-90-100, School of Computer Science, Carnegie Mellon Univer-sity, Pittsburgh, PA, February 1990.[31] Yan Fang and Terrence J. Sejnowski. Faster learning for dynamic recurrent backpropa-gation. Neural Computation, 2(3):270-273, 1990.[32] J. Doyne Farmer. Chaotic attractors of an infinite-dimensional dynamical system. PhysicaD 4, pages 366-393,1982.[33] J. Doyne Farmer and John J. Sidorowich. Predicting chaotic time series. Physical ReviewLetters, 59(8):845-848, August 1987.[34] William A. Fisher, Robert J. Fujimoto, and Robert C. Smithson. A programmable analogneural network processor. IEEE Transactions on Neural Networks, 2(2):222-229, March1991.[35] Robert C. Frye, Edward A. Rietman, and Chee C. Wong. Back-propagation learning andnonidealities in analog neural network hardware. IEEE Transactions on Neural Networks,2(1):110-117, January 1991.[36] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neuralnetworks. Neural Networks, 2(3):183-192, 1989.Bibliography^ 115[37] D. Gabor, W. P. L. Wilby, and R. Woodcock. A universal non-linear filter, predictor andsimulator which optimizes itself by a learning process. IEE Proceedings, 108B:422-438,1961.[38] Richard D. Gitlin, J. E. Mazo, and Michael G. Taylor. On the design of gradient algo-rithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory,CT-20(2):125-136, March 1973.[39] Karl Goser, Ulrich Hilleringmann, Ulrich Rueckert, and Klaus Schumacher. VLSI tech-nologies for artificial neural networks. IEEE Micro, 9(6):28-44, December 1989.[40] Hans P. Graf and Paul deVegvar. A CMOS implementation of a neural network model.In Paul Losleben, editor, Advanced Research in VLSI: Proceedings of the 1987 StanfordConference, pages 351-367, Cambridge, MA, 1987. MIT Press.[41] John B. Hampshire, II and Alexander H. Waibel. A novel objective function for improvedphoneme recognition using time-delay neural networks. IEEE Transactions on NeuralNetworks, 1(2):216-228, June 1990.[42] Robert Hecht-Nielson. Neurocomputing. Addison-Wesley, Reading, MA, 1990.[43] John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory of NeuralComputation. Addison-Wesley, 1991.[44] Geoffrey E. Hinton. Connectionist learning procedures. Artificial Intelligence, 40:185-234, 1989.[45] Geoffrey E. Hinton. How neural networks learn from experience. Scientific American,267(3):144-151, September 1992.[46] Paul W. Hollis, John S. Harper, and John J. Paulos. The effects of precision constraintsin a backpropagation learning network. Neural Computation, 2(3):363-373, 1990.[47] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. NeuralNetworks, 4(2):251-257, 1991.[48] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networksare universal approximators. Neural Networks, 2(5):359-366, 1989.[49] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation ofan unknown mapping and its derivatives using multilayer feedforward networks. NeuralNetworks, 3(5):551-560, 1990.[50] W. Hubbard, D. Schwartz, J. Denker, H. P. Graf, R. Howard, L. Jackel, B. Straughn, andD. Tennant. Electronic neural networks. In John S. Denker, editor, Neural Networks forComputing, Snowbird, UT, A IP Conference Proceedings 151, pages 227-234, New York,1986. American Institute of Physics.Bibliography^ 116[51] Don R. Hush and Bill G. Horne. Progress in supervised neural networks. IEEE SignalProcessing Magazine, 10(1):8-39, January 1983.[52] Yoshifusa Ito. Approximation of functions on a compact set by finite sums of a sigmoidfunction without scaling. Neural Networks, 4(6):817-826, 1991.[53] Yoshifusa Ito. Representation of functions by superpositions of a step or sigmoid functionand their applications to neural network theory. Neural Networks, 4(3):385-394, 1991.[54] Yoshifusa Ito. Approximation of continuous functions on Rd by linear combinationsof shifted rotations of a sigmoid function with and without scaling. Neural Networks,5(1):105-115, 1992.[55] Marwan Jabri and Barry Flower. Weight perturbation: An optimal architecture andlearning technique for analog VLSI feedforward and recurrent multilayer networks. NeuralComputation, 3(4):546-565, 1991.[56] L. D. Jackel, H. P. Graf, and R. E. Howard. Electronic neural network chips. AppliedOptics, 26(23):5077-5080, December 1987.[57] Robert A. Jacobs. Increased rates of convergence through learning rate adaptation. Neu-ral Networks, 1(4):295-307, 1988.[58] Chuanyi Ji, Robert R. Snapp, and Demetri Psaltis. Generalizing smoothness constraintsfrom discrete samples. Neural Computation, 2(2):188-197, 1990.[59] Ehud D. Karnin. A simple procedure for pruning back-propagation trained neural net-works. IEEE Transactions on Neural Networks, 1(2):239-242, June 1990.[60] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function.Annals of Mathematical Statistics, 23:462-466, 1952.[61] Takashi Kimoto, Kazuo Asakawa, Morio Yoda, and Masakazu Takeoka. Stock marketprediction system with modular neural networks. In International Joint Conference onNeural Networks 1990, volume 1, pages 1-6, Washington, D. C., January 1990.[62] John K. Kruschke and Javier R. Movellan. Benefits of gain: Speeded learning and minimalhidden layers in back-propagation networks. IEEE Transactions on Systems, Man, andCybernetics, 21(1):273-280, January/February 1991.[63] Chung-Ming Kuan and Kurt Hornik. Convergence of learning algorithms with constantlearning rates. IEEE Transactions on Neural Networks, 2(5):484-489, September 1991.[64] Harold J. Kushner and Dean S. Clark. Stochastic Approximation Methods for Constrainedand Unconstrained Systems. Springer-Verlag, New York, 1978.[65] Kevin J. Lang, Alex H. Waibel, and Geoffrey E. Hinton. A time-delay neural networkarchitecture for isolated word recognition. Neural Networks, 3(1):23-43, 1990.Bibliography^ 117[66] Alan Lapedes and Robert Farber. Nonlinear signal processing using neural networks:Prediction and system modeling. Technical Report LA-UR-87-2662, Los Alamos NationalLaboratory, July 1987.[67] Lasse HolmstrOm and Petri Koistinen. Using additive noise in back-propagation training.IEEE Transactions on Neural Networks, 3(1):24-38, January 1992.[68] Y. Le Cun. A learning scheme for asymmetric threshold networks. In Proceedings Cog-nitiva 85, pages 599-604, Paris, 1985.[69] Richard P. Lippmann. An introduction to computing with neural nets. IEEE ASSPMagazine, pages 4-22, April 1987.[70] Odile Macchi and Eweda Eweda. Second-order convergence analysis of stochastic adaptivelinear filtering. IEEE Transactions on Automatic Control, AC-28(1):76-85, January 1983.[71] Michael C. Mackey and Leon Glass. Oscillation and chaos in physiological control systems.Science, 197:287-289, July 1977.[72] John Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561-580, April 1975.[73] Kiyotoshi Matsuoka. Noise injection into inputs in back-propagation learning. IEEETransactions on Systems, Man, and Cybernetics, 22(3):436-440, May/June 1992.[74] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent innervous activity. Bulletin of Mathematical Biophysics, 5:115-133, 1943.[75] Carver Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989.[76] W. C. Mead, R. D. Jones, Y. C. Lee, C. W. Barnes, G. W. Flake, L. A. Lee, and M. K.O'Rourke. Using CNLS-Net to predict the Mackey-Glass chaotic time series. In Inter-national Joint Conference on Neural Networks 1991, volume 2, pages 485-490, Seattle,WA, July 1991.[77] Jack L. Meador, Angus Wu, Clint Cole, Novat Nintunze, and Pichet Chintrakulchai.Programmable impulse neural circuits. IEEE Transactions on Neural Networks, 2(1):101-109, January 1991.[78] Ali A. Minai and Ronald D. Williams. Back-propagation heuristics: A study of the ex-tended delta-bar-delta algorithm. In International Joint Conference on Neural Networks1990, volume 1, pages 595-600, San Diego, CA, June 1990.[79] Marvin L. Minsky and Seymour A. Papert. Perceptrons: An Introduction to Computa-tional Geometry. MIT Press, Cambridge, MA, 1969.[80] John Moody. Fast learning in multi-resolution hierarchies. In David S. Touretzky, editor,Advances in Neural Information Processing Systems 1, pages 29-39. San Mateo, CA:Morgan Kaufmann, 1989.Bibliography^ 118[81] John Moody and Christian Darken. Learning with localized receptive fields. In David S.Touretzky, Geoffrey E. Hinton, and Terrence J. Sejnowski, editors, Proceedings of the1988 Connectionist Models Summer School, pages 133-143. Morgan Kaufmann, 1988.[82] John Moody and Christian J. Darken. Fast learning in networks of locally-tuned process-ing units. Neural Computation, 1(2):281-294, 1989.[83] M. C. Mozer. A focussed backpropagation algorithm for temporal pattern recognition.Complex Systems, 3:349-381, 1989.[84] Alan F. Murray. Pulse arithmetic in VLSI neural networks. IEEE Micro, 9(6):64-74,December 1989.[85] Alan F. Murray, Dante Del Corso, and Lionel Tarassenko. Pulse-stream VLSI neuralnetworks mixing analog and digital techniques. IEEE Transactions on Neural Networks,2(2):193-204, March 1991.[86] Alan F. Murray and Anthony V. W. Smith. Asynchronous VLSI neural networks usingpulse-stream arithmetic. IEEE Journal of Solid-State Circuits, 23(3):688-697, June 1988.[87] Kumpati S. Narendra and Kannan Parthasarathy. Identification and control of dynamicalsystems using neural networks. IEEE Transactions on Neural Networks, 1(1):4-27, March1990.[88] Kumpati S. Narendra and Kannan Parthasarathy. Gradient methods for the optimiza-tion of dynamical systems containing neural networks. IEEE Transactions on NeuralNetworks, 2(2):252-262, March 1991.[89] Steven J. Nowlan and Geoffrey E. Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4:473-493, 1992.[90] Alan V. Oppenheim and Ronald W. Schafer. Digital Signal Processing. Prentice-Hall,Englewood Cliffs, New Jersey, 1975.[91] N. H. Packard, J. P. Crutchfield, J. D. Farmer, and R. S. Shaw. Geometry from a timeseries. Physical Review Letters, 45(9):712-716, September 1980.[92] D. B. Parker. Learning-logic. Technical Report TR-47, Sloan School of Management,M.I.T., Cambridge, MA, 1985.[93] Barak A. Pearlmutter. Learning state space trajectories in recurrent neural networks.Neural Computation, 1(2):263-269, 1989.[94] Barak A. Pearlmutter. Dynamic recurrent neural networks. Technical Report CMU-CS-90-196, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213,December 1990.[95] Fernando J. Pineda. Generalization of back-propagation to recurrent neural networks.Physical Review Letters, 59(19):2229-2232, November 1987.Bibliography^ 119[96] Fernando J. Pineda. Recurrent backpropagation and the dynamical approach to adaptiveneural computation. Neural Computation, 1(2):161-172, 1989.[97] Fernando J. Pineda. Time dependent adaptive neural networks. In David S. Touretzky,editor, Advances in Neural Information Processing Systems 2, pages 710-718. San Mateo,CA: Morgan Kaufmann, 1990.[98] John C. Platt and Federico Faggin. Networks for the separation of sources that aresuperimposed and delayed. In R. Lippmann J. Moody, S. Hanson, editor, Advancesin Neural Information Processing Systems 4, pages 730-737. San Mateo, CA: MorganKaufmann, 1992.[99] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals ofMathematical Statistics, 22:400-407, 1951.[100] Richard A. Roberts and Clifford T. Mullis. Digital Signal Processing. Addison-Wesley,Don Mills, Ontario, 1987.[101] David Rogers. Predicting weather using a genetic memory: a combination of Kanerva'ssparse distributed memory with Holland's genetic algorithms. In David S. Touretzky,editor, Advances in Neural Information Processing Systems 2, pages 455-464. San Mateo,CA: Morgan Kaufmann, 1990.[102] Frank Rosenblatt. Principles of Neurodynamics. Spartan Books, New York, 1962.[103] Olivier Rossetto, Christian Jutten, Jeanny Herault, and Ingo Kreuzer. Analog VLSIsynaptic matrices as building blocks for neural networks. IEEE Micro, 9(6):56-63, 1989.[104] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations byerror propagation. In Parallel Distributed Processing: Explorations in the Microstructureof Cognition, Volume 1: Foundations, chapter 8, pages 318-362. MIT Press, Cambridge,MA, 1986.[105] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representa-tions by back-propagating errors. Nature, 323(9):533-536, October 1986.[106] Terence D. Sanger. A tree-structured adaptive network for function approximation inhigh-dimensional spaces. IEEE Transactions on Neural Networks, 2(2):285-293, March1991.[107] Terence D. Sanger. A tree-structured algorithm for reducing computation in networkswith separable basis functions. Neural Computation, 3(1):67-78, 1991.[108] Massimo A. Sivilotti, Michael R. Emerling, and Carver A. Mead. VLSI architecturesfor implementations of neural networks. In John S. Denker, editor, Neural Networks forComputing, Snowbird, UT, A IP Conference Proceedings 151, pages 408-413, New York,1986. American Institute of Physics.Bibliography^ 120[109] Eduardo D. Sontag. Sigmoids distinguish better than heavisides. Neural Computation,1:470-472, 1989.[110] Eduardo D. Sontag. Feedback stabilization using two-hidden-layer nets. Technical ReportSYCON-90-11, Rutgers Center for Systems and Control, October 1990.[111] Eduardo D. Sontag. Remarks on interpolation and recognition using neural nets. InRichard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances inNeural Information Processing Systems 3, pages 939-945. San Mateo, CA: Morgan Kauf-mann, 1991.[112] Eduardo D. Sontag and Hector J. Sussmann Back propagation separates where percep-trons do. Neural Networks, 4(2):243-249, 1991.[113] Ferrel G. Stremler. Introduction to Communication Systems. Addison-Wesley, Reading,Massachusetts, second edition edition, 1982.[114] Manoel F. Tenorio and Wei-Tsih Lee. Self-organizing network for optimum supervisedlearning. IEEE Transactions on Neural Networks, 1(1):100-110, March 1990.[115] A. P. Thakoor, A. Moopenn, John Lambe, and S. K. Khanna. Electronic hardwareimplementations of neural networks. Applied Optics, 26(23):5085-5092, December 1987.[116] Tom Tollenaere. SuperSAB: Fast adaptive back propagation with good scaling properties.Neural Networks, 3(5):561-573, 1990.[117] N. Toomarian and J. Barhen. Adjoint-functions and temporal learning algorithms inneural networks. In Richard P. Lippmann, John E. Moody, and David S. Touretzky,editors, Advances in Neural Information Processing Systems 3, pages 113-120. San Mateo,CA: Morgan Kaufmann, 1991.[118] Philip Treleaven, Marco Pacheco, and Marley Vellasco. VLSI architectures for neuralnetworks. IEEE Micro, 9(6):8-27, December 1989.[119] T. Troudet and W. Merrill. Neuromorphic learning of continuous-valued mappings fromnoise-corrupted data. IEEE Transactions on Neural Networks, 2(2):294-301, March 1991.[120] K. P. Unnikrishnan, John J. Hopfield, and David W. Tank Connected-digit speaker-dependent speech recognition using a neural network with time-delayed connections.IEEE Transactions on Signal Processing, 39(3):698-713, March 1991.[121] Michel Verleysen, Bruno Sirletti, Andre M. Vandemeulebroecke, and Paul G. A. Jespers.Neural networks for high-storage content-addressable memory: VLSI circuit and learningalgorithm. IEEE Journal of Solid-State Circuits, 24(3):562-568, June 1989.[122] Vera Kiirkova. Kolmogorov's theorem is relevant. Neural Computation, 3(4):617-622,1991.Bibliography^ 121[123] Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J.Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions onAcoustics, Speech, and Signal Processing, 37(3):328-339, March 1989.[124] Eric A. Wan. Temporal backpropagation: An efficient algorithm for finite impulse re-sponse neural networks. In David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski,and Geoffrey E. Hinton, editors, Proceedings of the 1990 Connectionist Models SummerSchool, pages 131-137, San Mateo, CA, 1990. Morgan Kaufmann.[125] Eric A. Wan. Temporal backpropagation for FIR neural networks. In International JointConference on Neural Networks 1990, volume 1, pages 575-580, San Diego, CA, June1990.[126] Andreas S. Weigend, Bernardo A. Huberman, and David E. Rumelhart. Predicting thefuture: A connectionist approach. International Journal of Neural Systems, 1(3):193-209,1990.[127] Michael K. Weir. A method for self-determination of adaptive learning rates in backpropagation. Neural Networks, 4(3):371-379, 1991.[128] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behav-ioral Sciences. PhD thesis, Harvard University, 1974.[129] Paul J. Werbos. Backpropagation through time: What it does and how to do it. Pro-ceedings of the IEEE, 78(10):1550-1560, 1990.[130] Bernard Widrow, John R. Glover, Jr., John M. McCool, John Kaunitz, Charles S.Williams, Robert H. Hearn, James R. Zeidler, Eugene Dong, Jr., and Robert C.Goodlin Adaptive noise cancelling: Principles and applications. Proceedings of the IEEE,63(12):1692-1716, December 1975.[131] Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. In 1960 IREWESCON Convention Record, pages 96-104, New York, 1960. IRE.[132] Bernard Widrow and Michael A. Lehr. 30 years of adaptive neural networks: Perceptron,Madaline, and backpropagation. Proceedings of the IEEE, 78(9):1415-1442, 1990.[133] Bernard Widrow, John M. McCool, Michael G. Larimore, and C. Richard Johnson, Jr.Stationary and nonstationary learning characteristics of the LMS adaptive filter. Pro-ceedings of the IEEE, 64(8):1151-1162, August 1976.[134] Bernard Widrow and Samuel D. Stearns. Adaptive Signal Processing. Prentice-Hall,Englewood Cliffs, New Jersey, 1985.[135] Bernard Widrow and Rodney Winter. Neural nets for adaptive filtering and adaptivepattern recognition. Computer, 21(3):25-39, March 1988.[136] N. Wiener. Extrapolation, Interpolation and Smoothing of Stationary Time Series withEngineering Applications. MIT Press, Cambridge, MA, 1949.Bibliography^ 122[137] Ronald J. Williams and Jing Peng. An efficient gradient-based algorithm for on-linetraining of recurrent network trajectories. Neural Computation, 2(4):490-501, 1990.[138] Ronald J. Williams and David Zipser. A learning algorithm for continually running fullyrecurrent neural networks. Neural Computation, 1(2):270-280, 1989.[139] James R. Zeidler. Performance analysis of LMS adaptive prediction filters. Proceedingsof the IEEE, 78(12):1780-1806, 1990.Appendix AMomentumGradient descent often converges very slowly for many problems of practical interest. To speedconvergence, a simple technique known as "momentum" was first employed in [104]. Adaptationwith momentum depends on previous weight updates, as well as the direction of the gradient,leading to the following discrete-time formula:0^ if n < 0wii(t„ ) = OE^ (A.1)—(1 — a)p-071—(tn) aAwij(tn_i) if n > O.Although the gradient usually must be estimated, it is not necessary to complicate the no-tation with that detail here. The weights change at discrete points in time with a periodof T = to — tn_i. The integer, n, is the step number, and the momentum parameter, a, mustlie in the range 0 < a < 1.Equation (A.1), which describes a discrete-time first order low-pass filter, can also be writtenasLIWij(tn)::: —(1 — a) E C nt/1=-1 (tn tm),m=0^ulvwith n > O. Thus, Awij is an exponentially weighted sum of past gradient values. Looking atthe special case where the gradient is constant giveslim Awii(tn) =n--■oon—lim (1 — a)/2 oe^am7-7,n—■coutik? m=0\ OE  1—(1— °IP Owii 1 — aOEn^•uwii (A.3)Therefore, for smooth error surfaces, where 02 E/Otqj^0, the weight changes follow simple(A .2)123Appendix A. Momentum^ 124gradient descent. Substituting the continuous variables t and t' for tn, and t, gives a continuous-time version of equation (A.2):wij^tddt- = –K(1 – a) J ati f Tiz(t – t')dt'.^(A.4)at%Here, K is a gain constant chosen according to the constant gradient criterion. If 0 2e/04 = 0,thenlim dwijt—,00 dt– lim K(1 – a)(9e f t t'ITt^Ii510i j 0 a—.001–a OE= KT^ IL^•Ina Owij(A.5)Approximating gradient descent for smooth error surfaces requires that1– a OE^OEKT ^=^,^ (A.6)Ina awij^awijSOIna  K^ (A.7)T(1 – a)and finally,dwijdt^in a ft atia OS #^di,T o^owij –In a ft–^,n a .„ O E^„)arexpl(--z )p—kt – t (A.8)T 0^T^OwijEquation (A.8) represents a simple first order low pass filter with a time constant of –T/ In aand an impulse response ofh(t) = --T exp(—T t)u(t),In a^In a^ (A.9)where u(t) is the unit step function, given by0 if t < 0u(t) =1 if t > O.Differentiating equation (A.8) with respect to time gives–T d2 wij^dw•-^OEUn a dt 2 =^d;3^(t)atvii^7(A.10)(A.11)Appendix A. Momentum^ 125which is analogous to equation (4.11). The relationship between j in (4.11) and the momentumparameter, a, isand 77 is positive since 0 < a < 1.T17 = In a , (A.12)Appendix BEstimating the Mean Value of a SignalAdaptation in back-propagation networks with bipolar sigmoidal nonlinearities occurs fastestfor input signals with zero mean [12]. Therefore, it is desirable to preprocess the signal byremoving its mean value before feeding it into the network. If the mean is not known inadvance, it can be estimated using some type of on-line adaptive technique.One simple possibility is to maintain a running average of the input signal vector, andsubtract it from the input before feeding it into the network. In discrete time, the input signalcan be represented as a sequence of vectors, In , with n = 1,2, 3, ... being the sample number.The running average, denoted by (I) n , can be updated with each sample as(I)n+i = (I)n + pn (In — (I)n), (B.1)where 0 < p„ < 1 is a rate parameter that can vary with n.If pn, = 1/n, then M t, will be the simple average of all previous input vectors. On the otherhand, if pn is constant, then (I)n represents an exponentially weighted average of past inputvectors, with the heaviest weighting on the most recent. Constant values of p can be useful forsome types of nonstationary input signals that have a slowly varying mean. The time scale onwhich the mean varies determines the best value for p. Other schedules for changing pn, maybe useful for different types of nonstationary input signals.126Appendix CLinear Interpolation of Delayed Signal DerivativesIn digital implementations of dispersive networks, it is often necessary to estimate delayed signalvalues and time derivatives that fall between stored sample points. For the signal values, it isa simple matter to interpolate linearly between the sample points on either side of the desiredvalue, but estimating the time derivative of the signal at a particular point requires a morecomplex technique.Figure C.1 shows several stored sample points from a signal, o(t), and the task is to estimatethe signal derivative at a particular delay, ro. The derivative can easily be estimated for delays o(t)TimeDelay Ti+2 T+1Values toFigure C.1: Stored sample points from a continuous signal. Time is shown alongthe horizontal axis, with the length of the delay, r, increasing to the left. Thetask is to estimate the time derivative of the signal at a particular delay, r0 ,which may lie between sample points.127Appendix C. Linear Interpolation of Delayed Signal Derivatives^ 128that are midway between sample points, such as a delay of r + 1/2:o'(t — — 1/2) = h(ri) — h(ri + 1).^(C.1)Here, h(r) o(t — r) for notational convenience.For values of ro that are not midway between sample points, interpolation can be usedbetween derivative estimates on either side of the required delay. To begin, the delay, ro, canbe decomposed into the sum of an integer portion, Ti, and a fractional portion, rf , whereT0 =^rf •^ (C.2)There are two cases to consider. First, if Tf < 0.5, then the two nearest estimated derivativesareh'(Ti + 1/2) = h(Ti) — h(ri + 1),h'(Ti — 1/2) = h(ri — 1) — h(ri),and the required derivative can be interpolated ash'(ro) = h'(Ti Tf)(Tf 1/2)W(ri + 1/2) + (1/2 — rf)11/(ri — 1/2)(Tf +1/2)[h(ri) — h(ri + 1)] + (1/2 — rf)[h(ri — 1) — h(ri)].^(C.3)In the second case, Tf > 0.5, and the two nearest derivative estimates are+ 1/2) = h(ri)— h(ri +1),h'(Ti + 3/2) = h(ri + 1) — h(ri + 2).Here, the required interpolation givesh' (ro ) = h'(Ti + rf)= (Tf — 1/2)hVi + 3/2) + (3/2 — rf)hVi + 1/2)= (Tf — 1/2)[h(Ti + 1) — h(ri + 2)] + (3/2 — rf)[h(ri) — h(ri + 1)].^(C.4)In both cases, the required derivative estimate is obtained from a weighted average of derivativeestimates on either side of To.Appendix DThird-Order Low-Pass Butterworth Digital FilterThis Appendix describes the low-pass Butterworth digital filter used to provide a band-limitedinput for the channel equalization problem in Chapter 6. The filter was designed using thebilinear transform method [100], which guarantees stability. To begin, an analog, third-order,low-pass Butterworth filter with unity gain has the transfer function= ^H(s) ^(s^t c )(s2 sfi c f2D' (D.1)where s represents complex frequency, and it, is the cutoff frequency.The bilinear transform maps the complex s-plane into the complex z-plane according toz= W(s)= 1+s ,1 — swhich causes a "frequency warping" given byw(ft) = to arctan—2 where fl represents the frequency of the analog filter and co represents the frequency of thedigital filter. If coo = 2r/to is the sampling frequency of the digital filter, then c,./ increasesfrom —(J.)0 to w0/2 as S-2 increases continuously from —co to +oo.The analog filter must be designed with a cutoff frequency that, when warped by equa-tion (D.3), will yield the desired cutoff frequency for the digital filter. Therefore, the "inversewarp" of equation (D.3) isfl(w) = tan(wto /2).^ (D.4)For the digital filter in Chapter 6, the desired cutoff frequency was=16to '(D.2)(D.3)129Appendix D. Third-Order Low-Pass Butterworth Digital Filter^ 130so the cutoff frequency of the analog filter must befic = 12(coc) = tan(ir/32) r.,-, 0.0985in equation (D.1).The inverse of equation (D.2) isz — 18 = 1-1 (z) = z + 1 ,so that the transfer function of the desired digital filter is() = fi i z - 1) = A + Bz -1 + Cz -2 + Dz -3H z z-F 1^1 A-Ez -1 A-Fz-2 + Gz--3 •Solving for the constants in equation (D.6) givesA = 0.000727B = 0.00218C = 0.00218D = 0.000727E = -2.327F = 2.072G = -0.686.Figure D.1 shows the resulting signal flow graph, which represents a stable infinite-impulse-response (IIR) filter.(D.5)(D.6)Appendix D. Third-Order Low-Pass Butterworth Digital Filter^ 131Figure D.1: Signal flow graph for a third-order digital low-pass Butterworthfilter. The input signal enters from the left, and the connections labeled z -1represent unit-delay elements. The structure is a stable infinite-impulse-response(IIR) filter.

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064917/manifest

Comment

Related Items