UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The classification of acoustic emission signals via artificial neural network Yang, Jian 1992

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1992_spring_yang_jian.pdf [ 4.02MB ]
JSON: 831-1.0065107.json
JSON-LD: 831-1.0065107-ld.json
RDF/XML (Pretty): 831-1.0065107-rdf.xml
RDF/JSON: 831-1.0065107-rdf.json
Turtle: 831-1.0065107-turtle.txt
N-Triples: 831-1.0065107-rdf-ntriples.txt
Original Record: 831-1.0065107-source.json
Full Text

Full Text

The classification of acoustic emission signals via artificial neural network by  Jian Yang  B.A.Sc. Aeronautic and Astronautic University, Sheng Yang, China, 1982.  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE  in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF ELECTRICAL ENGINEERING  We accept this thesis as conforming to the required standard  THE UNIVERSITY OF BRITISH COLUMBIA  August, 1991 © Jian Yang, 1991  In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission.  (Signature)  Department of ^Cfze k r'; -  -  LR  )  The University of British Columbia Vancouver, Canada  Date^--Sq. Iv\^,  DE-6 (2/88)  SS-L.  e  -  Abstract  Automatic identification of wood species is a long desired goal in the pulp and paper industry. Motivated by this, researchers have investigated Pattern Recognition (PR) for the classification of wood chip species based on acoustic emission signals. In this thesis, a new Artificial Neural Network (ANN) approach is proposed to perform this task. One of the purposes of the thesis is to explore the connection between traditional methods of statistics and modern approaches of neural networks regarding pattern recognition. We attempt to understand how the neural networks perform signal processing functions such as the Karhunen-Loeve Transform (KLT) and pattern recognition functions such as the Principal Component Analysis (PCA). The configuration of the classifier is a multilayer feedforward network in which the supervised and unsupervised learning algorithms are implemented. Based on the established orthogonal data space, feature extraction and data compression are accomplished through the training process. The reconstructed data, which is in orthogonal and compressed form with maximal variance, are used to improve the classification efficiency. Since the primary goal of the thesis is to identify wood species from their acoustic emission, simulations are used to characterize classification accuracy, classification efficiency, noise immunity, and generalization capability.  ii  Table of Contents  Abstract ^  ii  List of Figures ^  vi  List of Tables ^  viii  List of Abbreviations ^  ix  Acknowledgments ^ 1 Introduction^ 1.1 Neural network ^  1 1  1.1.1 Neuron models ^  1  1.1.2 Network models ^  3  1.2 Pattern recognition ^  5  1.2.1 Pattern model ^  5  1.2.2 Pattern recognition model ^  6  1.3 Neural network as a classifier ^  7  1.4 Capability of feedforward network ^  7  1.5 Application of neural networks in engineering ^  9  1.6 The scope of the thesis ^  9  2 Acoustic emission signals^  11  2.1 Identification requirements in chip refining ^  11  2.2 Acquisition of acoustic emission signals ^  12  2.2.1 The process of data acquisition ^  12  2.2.2 The parameters of data acquisition ^  13  2.3 Signal analysis ^  14  2.3.1 General waveform analysis ^  14  2.3.2 Effect of operating variables ^  15  2.4 Similarity between categories and difference within category ^ 16  3 The architectures and learning rules of neural networks ^  19  3.1 The architectures ^  19  3.2 Supervised learning ^  21  3.2.1 Backpropagation ^  21  3.2.2 The capability and limitation in backpropagation ^  25  3.3 Unsupervised learning ^  26  3.3.1 Competitive learning ^  26  3.3.2 Feature extraction as a good representation ^  27  3.3.3 Data compression as an optimality principle ^  28  4 Subspace and orthogonality^  30  4.1 Subspace as representation for categories ^  30  4.2 Orthogonal bases ^  31  4.3 Karhunen-Loeve transform ^  33  4.4 Principal component analysis ^  35  5 The Generalized Hebbian Algorithm^  37  5.1 Terminology ^  37  5.2 Basic Hebbian algorithm ^  37  5.3 Oja algorithm ^  ao  5.4 Gram-Schmidt process ^  43  5.5 Combination of the basic algorithms ^  45  5.6 Significance ^  46  6 The performance of data reconstruction^  49  6.1 The eigenvector decomposition ^  49  6.2 The Gabor filters ^  51  6.3 The masks ^  54  6.3.1 General signal preprocessing ^  54  6.3.2 The function of the neural network ^  55  iv  6.4 The matched filters ^  57  6.5 The whitening filters ^  58  7 The classification performance^ 7.1 The noise resistance ^  61 61  7.1.1 Capability of the system ^  61  7.1.2 The resistance to process noise ^  63  7.1.3 The resistance to Gaussian noise ^  65  7.1.4 The recovery with incomplete signals ^  67  7.2 Generalization ^  68  7.3 The classification efficiency ^  70  7.3.1 The compression criteria ^  71  7.3.2 The structure reduction ^  73  7.3.3 The training time reduction ^  73  7.3.4 The iteration reduction ^  74  7.3.5 The learning curve ^  75  7.4 The classification accuracy ^  76  7.5 The single and mixed wood species ^  77  81  8 Conclusion^ 8.1 Summary of the thesis ^  81  8.2 Contributions ^  82  8.3 Future work ^  82 83  References^  v  List of Figures  1.1 Artificial network with activation function ^  2  1.2 Network models ^  3  1.3 Neural network model ^  4  1.4 The classifier ^  5  1.5 Pattern recognition model ^  6  1.6 Separability problem ^  8  2.7 Chemical acoustic emission data ^  13  2.8 The power spectra of wood species for 5 categories ^  15  2.9 The comparison of power spectra between Background and Hemlock ^ 16 2.10 Nonlinear problem in pattern recognition ^  17  3.11 The architecture of heterogenous network ^  19  3.12 Backpropagation ^  22  3.13 Architecture of competitive learning ^  27  3.14 The pattern recognition chain ^  28  4.15 Geometric interpretation of the orthogonal decomposition theorem ^ 32 6.16 Eigenvector matrix CCT = I and covariance matrix of outputs, with white input CQC T = A ^  50  6.17 Eigenvector matrix CCT = I and covariance matrix of outputs, with original input . 51 6.18 The whitened outputs for 5 categories of wood species ^  59  7.19 The classification performance in Gaussian noisy input ^  66  vi  7.20 The classification performance with incomplete signals ^  67  721 The training scheme ^  71  7.22 The training time with different compression inputs ^  72  7.23 The learning curves ^  75  7.24 The mixture of wood species ^  78  7.25 The testing procedure for mixtures ^  79  vu  List of Tables  2.1 The parameters of data acquisition of the acoustic emission signals ^ 14 7.2 The classification performance with zero mean and non-zero mean inputs ^ 65 7.3 The classification performance with different training sets ^  69  7.4 The classification performance with different inputs ^  70  7.5 The comparison of classification performance ^  76  7.6 The four responses of the system for various values of a and Q ^  79  viii  List of Abbreviations  AE: Acoustic Emission AI: Artificial Intelligence ANN: Artificial Neural Network BP: BackPropagation GHA: Generalized Hebbian Algorithm KLT: Karhunen-Loeve Transform NN: Neural Network PCA: Principal Component Analysis PR: Pattern Recognition  ix  Acknowledgments  I wish to thank my supervisor Prof. G. A. Dumont for his invaluable insight and assistance during my thesis work. I would also like to thank Prof. A. P. Wade and his group. I would like to thank my previous supervisor Prof. C. C. H. Ma.  x  Chapter 1: Introduction  Chapter 1 Introduction  Recently, there has been a renewed interest in the study of neural networks, with many reported applications from both science and engineering, in tasks such as statistical Pattern Recognition (PR) or classification. As shown in theoretical and applications studies, it seems natural to use a neural network in pattern recognition. The classification task is an important practical problem for industry. The industrial environment requires that the PR system must be capable of obtaining the desired classification performance in a robust manner. The multilayer feedforward network is chosen as the basic configuration, and both supervised and unsupervised learning is employed to increase the network efficiency. With an additional network to extract features of the signals, backpropagation is used as the primary training algorithm to solve the problem of nonlinear pattern classification. Another goal of this thesis, independent of the classification performance of the proposed algorithm, is concerned with finding the relationship between traditional methods of statistics and the modem approaches of neural networks. 1.1 Neural network  Artificial Neural Networks (ANN) are an approximate mathematical model of the structure of the human brain. ANN attempt to imitate the thinking process of human brain. Regardless of the categories of neural network, the system is basically composed of two fundamental items, a neuron, which is constructed after the biological neuron, and a network, in which many neurons are connected together to form layers. In their various configurations, neural networks provide powerful and numerous functions in problem solving. 1.1.1 Neuron models The model of neurons used in artificial neural networks is essentially thought to be the mathematical model of biological neurons of human brain [22]. The artificial neuron is designed to imitate the first-order characteristics of the biological neuron. Despite the diversity of network paradigms, nearly 1  Chapter 1: Introduction  Multiplication^Summation  Threshold  1 xl x2 W2 xn I Wn  OUT  NET  Figure 1.1: Artificial network with activation function all of artificial neurons are based upon the same configuration. Figure 1.1 illustrates the fundamental structure of the major form of artificial neuron. A neuron model can be roughly composed of three basic components: summation, multiplication and threshold, as indicated in Figure 1.1 [59]. A set of inputs x 1 , x 2 , ...x n is applied, either from the outside environment or from the previous layer. Each input xi is multiplied by a corresponding weight wi, i = 1, 2, ...n. All of the weighted inputs are then summed to determine the activation level of the neuron. This summation of products is termed Net and is calculated for each neuron in  the network. After Net is calculated, it is passed through a nonlinear function F to produce the signal Out[59]. The magnitude of each element of the input vector is represented by either an analog or binary value. Each weight corresponds to the "strength" of a single biological synaptic connection. The summation block roughly corresponds to the biological cell body. To summarize, the function of the single neuron is expressed as n  NET = xiwi + x2w2 + ... + x n w. = E xiwi, OUT = F(N En.  i=i  Currently, all neural networks are explicitly or implicitly based on Hebb's theory [17]. The weights on the inputs can be either positive or negative representing excitatory or inhibitory synaptic connections. Hebb states that a synaptic connection should be made more excitatory if the presynaptic signal is positively correlated with the postsynaptic signal. Many learning algorithms are based on Hebb's theory. The model of the simple artificial single neuron ignores many of the characteristics of its biological counterpart. For example, it does not take into account delays that affect the dynamics of 2  Chapter 1: Introduction  Neural output  Neural input Figure 1.2: Network models  the system. Despite these limitations, the single neuron model exhibits attributes that are strongly reminiscent of the biological system, and captures enough essential features of the biological system. The single neuron model can perform simple pattern detection, and has been broadly used in artificial neural networks. 1.1.2 Network models Although a single neuron performs certain simple pattern detection functions, the power of neural computation comes from connecting neurons into networks. A simple network is a group of neurons arranged in a layer as shown in Figure 1.2. A neural network has more complex capabilities than a single neuron. Since the time that McCulloch, Pitts and Hebb laid the foundations of neural network research, there has been much growth in this field. A large number of neural network models have been investigated, so it is necessary to classify them. The network models can be categorized according to their characteristics, the network topology and the learning rules [34]. Figure 1.3 shows how the most popular neural network models are categorized according to the classification scheme. Individual neurons in a network can be characterized by these attributes: the natural sensory inputs can be either binary or analogue, the weights can be positive or negative, representing  3  Chapter 1: Introduction  Neural network model FeedfLward I  Rectirent I  Supeivised I Hopfield network Bolzman machine Hamming network Bidirectional associative memory  Unsupervised  I Supervised  Adaptive resonance Perceptron theory I Backpropagation Immune system Model  I Unsupervised Kohonen feature map Learning vector quantization Competitive learning  Figure 1.3: Neural network model excitatory and inhibitory synaptic connections, and the outputs of the nonlinear functions, can be  taken as analogue or as binary to represent the categories to which the input patterns belong. Topologically, networks can be divided into two broad categories called feedforward or recurrent structure. A recurrent network allows any type of interconnections including feedforward and feedback paths. The feedback paths are the connections from the output of any neuron to the input of any neuron in previous layers. A feedforward network contains no feedback paths. The output of one particular neuron cannot contribute to the input of the neurons in the same layer. Signals between layers are strictly fed forward from one layer to the next. The training rules can be divided into two categories called supervised and unsupervised learning. In supervised learning, the sensory input pattern and the desired output pattern form a training pair. When the network provides an example as an input pattern, the desired output is available to the learning algorithm to compute the weight adjustments. Both the input pattern and the desired output pattern compose a training pair and are presented to the network by a "teacher". In unsupervised learning, no such "teacher" exists. The network develops its own response to input patterns according to its competitive results during the training process. Generally, supervised learning is applied to classify patterns in explicit categories. Unsupervised learning is implemented to classify input patterns into a number of categories based on the similarity of the input patterns. It is difficult to discuss the network functions independently of their applications. The network performance should be analyzed by combining its algorithm and task. Feedforward networks receive the greatest attention in this thesis since they are more suitable for static pattern recognition. 4  Chapter 1: Introduction  x1 x2 x3  i0=0 or 1 or 2 ... K Response  Pattern (Data to be classified)  (Classification) Figure 1.4: The classifier  1.2 Pattern recognition Pattern recognition is an active field in engineering and is of great practical importance. The problem is described in a set of basic models. Pattern usually consists of a set of numbers representing sensory measurements, and classification is carried out to create a set of codes representing categories. Different approaches in this field have been explored for many years. The neural network method is proposed because it may have some advantages over traditional algorithms.  1.2.1 Pattern model Many real world tasks involve the ability to classify or sort data, such as speech recognition, characters recognition and species classification. The mechanization of these tasks requires a device which accepts input data and responds with an output indicating the classification of this data [5]. We assume that each set of data to be classified is a set of n real numbers, x1, x2, —, xn, or is described as an input vector x = [x1,...,x4 Such a set is called a pattern, and each number is called a component of the pattern. Any device for sorting patterns into categories is called pattern classifier. Any pattern can be represented by a point in an n—dimensional Euclidean space Rn called pattern space. A pattern classifier is thus a device which maps the points of Rn into the category numbers,  1,...,K, see Figure 1.4. Let the symbol y(i) denote the set of points in lin which are mapped into the number i. Then for each category number, we have a set of points in Rn denoted by one of the symbols y( 1), y( 2 ), The classifier is a system which maps the input vector X of Rn into the categories y( 1 ), y( 2 ), ...y(k). 5  Chapter 1: Introduction  1.2.2 Pattern recognition model A large number of models have been suggested for pattern recognition. The commonly used models for pattern recognition are listed in Figure 1.5. Pattern recognition  l  Parametric Imodels Nonparaj etric models discrainate surfaces P(x/i)  Ci) P(x/  Bayesian classifier  Nearest neighbor  P( i/x)  1  Linear classifier  Figure 1.5: Pattern recognition model  In statistics, the Bayesian classifier is used as a parametric identification. Practical pattern recognition almost always makes a crucial assumption about the statistical structure of the world. We describe an input pattern as a vector x = [x l , ...x n ]. The Bayesian classifier classifies by probability Pr(C1/x) =  Pr(Ci)Pr(xlCi) Pr(x)  (1.2)  where Ci is the category number. Thus the Bayesian classifier is a parametric method based on a priori distribution of classes Pr(Ci), the probability distributions Pr(xlCi) and the a posteriori distributions of classes Pr(Cilx)[5]. In a Bayesian classifier, the distribution of the possible inputs is assumed to be known, that is, the probability that a particular point belongs to any given classification. The nearest neighbor method is used as a nonparametric identification. Suppose we have a set of points in state space that have been given classifications. Suppose a new pattern is to be classified. The distance from the new pattern to all the old ones is computed. The new pattern is given the classification of the previously classified point that is closest to it, that is, its nearest neighbour. Nearby points are more likely to be given the same classification than distant ones. In classification, we wish to obtain a tessellation of the pattern space using discriminant functions di(x). If pattern x has di(x) > di(x) V x, then x, is in class Ci. Discriminant functions can be  obtained from a parametric method such as Bayesian classifiers or a nonparametric method such as nearest neighbor methods. 6  Chapter 1: Introduction  1.3 Neural network as a classifier  Based on the introduction to neural networks and pattern recognition, the connection between the two classes of systems is not difficult to find. Many theories and applications show that pattern classification is a natural way to use a neural network [44] [26]. We wish to know how classification works in a neural network. We have an input x = [x 1 , x n ] which are the data from some sensory measurement. We also have an output category vector y = [y(1) y(k)] which consists of bipolar numbers representing class memberships. The weight  matrix, which is composed of synaptic connections between the inputs and outputs, is defined as W. Each class of patterns will be represented by the activity of a unique output neuron y. If the input pattern x causes the output in an output neuron y to be higher than in other output neurons, it is classified as belonging to the category associated with that neuron. The virtue of neural networks in pattern recognition is that by presenting a limited set of examples of pairs of input patterns and their associated categories to the network, the weights in the network can be changed by an algorithm so that the network will correctly classify new input patterns. Problem solving is acheived by a training process. Self-training networks provide the possibility of solving a problem that we do not completely understand. If we cannot specify how an algorithm should perform on particular examples and if examples of the functions of the algorithm are available, then it is very valuable to have a technique for generating an approximation of the algorithm with a training procedure. Additionally, with the easy availability of powerful computers, we can partially analyze a complex system with simulations before a full understanding can be had with mathematics.  1.4 Capability of feedforward network  From the literature, we know that the feedforward network performs well in pattern classification. This section discusses the classification capability of such a network. Feedforward networks can be categorized as layered or non-layered. In a layered feedforward network, connections are allowed only between the immediately adjacent layers. Non-layered 7  Chapter 1: Introduction  -1^1 1^-1  Linearly inseparable problem  Linearly separable problem  Figure 1.6: Separability problem networks allow arbitrary forward connections. Most of research done to date has been only on layered networks. The perceptron is a single layered feedforward network in pattern recognition. The input can be analogue or digital, and the output is a binary number representing the category of patterns. The weights are adjusted according as follows: A p wi = nOxi wi(t 1) = wi(t) Atui  (1.3)  where i is the learning rate, and 5=(y t) is the difference between actual output y and target output -  t. The learning rule utilized in a perceptron is called the delta rule because of the variable 5. The  limitation of the perceptron is that it only can solve linearly separable problems. See Figure 1.6. If categories are linearly inseparable, the system will not converge. We have seen that it is impossible to draw a straight line subdividing the x-y plane. So the exclusive-or function can not be represented by a perceptron. Unfortunately, this is not an isolated example, there exists a large class of functions that cannot be represented by a single-layer network. Because a perceptron is limited to being able to classifiy linearly separable problems, to solve nonlinearly separable problems, the perceptron is extended to the multilayer feedforward network. For instance, a hidden layer is added between input and output layers. The participation of the hidden layer allows the network to overcome the limitation of the perceptron. There are many algorithms for supervised learning in feedforward networks, and backpropagation is one of the more popular and successful algorithms since its publication. Another advantage of multilayer networks comes from the Kolmogorov's theorem which states that any continuous function f : [0, l] m —› Rn ,y = f (x) can 8  Chapter 1: Introduction  be approximated to any desired degree of accuracy by a three—layer standard feedforward network with sufficiently many hidden units [39]. 1.5 Application of neural networks in engineering  The majority of applications of neural networks are in statistical pattern recognition. Unlike psychologists and physicists who attempt to understand the ideas of neural network theory, there are engineers and computer scientists who are mostly interested in small systems that often function as pattern classifiers. Some successful applications of neural networks have been reported in pattern recognition, in both computer simulation and hardware implementation. The mathematical model of an artificial neural network is implemented with powerful computers to imitate the style of thinking process of the human brain. Many theories are further investigated and demonstrated by sufficient computer simulations. Additionally, advanced parallel computers, such as the Thinking Machine or a network of transputers, have an advantage over sequential computer due to their parallel processes. Neural networks offer potentially powerful collective computation techniques for designing special purpose hardware for fast implementation of pattern recognition. In hardware implementation, neural networks are relatively robust and fault tolerant with respect to malfunction or loss of components, It is very important to recognize that current methods do not perform well on real world data and large systems. Neural network implementations are limited by finite data set, complexity of problem and system size. Due to the limited capability of current computers and specialized VLSI hardwares, neural networks have so far been able to solve only relatively simple tasks and the application of neural networks is still limited to simple real world problems [49]. 1.6 The scope of the thesis  This thesis is divided in five parts, where we 1. Introduce the classification task: the acoustic emission signals; 2. Establish the architecture of the neural network for the identification task; 9  Chapter 1: Introduction  3. Analyze the theoretical aspects of the learning algorithms and functions; 4. Analyze the network performance; 5. Analyze the classification performance. In Chapter 2, the characteristics of the acoustic emission signals of wood species will be analyzed. The classification performance depends on both network structure and signal property, so the analysis of the signals themselves must be included. In Chapters 3, 4 and 5, a system will be developed for the given task. The network architecture, learning algorithm and function will be introduced and analyzed for the special classification problems. Both a supervised and an unsupervised learning algorithms will be introduced separately. A backpropagation network is used as the primary classifier in this task. Since the backpropagation algorithm is popular, we assume that the readers are familiar with it and will not introduce it in detail. We will focus on unsupervised learning for feature extraction. The generalized Hebbian algorithm will be analyzed to see how the network functions are used as general approaches to signal processing in artificial intelligence. The data subspace is established to do feature extraction and form orthogonal data space according to optimality principles of feature extraction and data compression. We then focus on finding the connection between signal processing and neural network in pattern recognition. In Chapters 6 and 7, the network analysis will be accomplished in two steps. The first step is the analysis of how the linear network functions are used as the traditional approaches of pattern recognition, such as eigenvector decomposition, filters, signal preprocessing and principal component analysis. The second step focuses on the classification performance of the acoustic emission signals, including noise resistance, generalization property, classification accuracy and classification efficiency.  10  Chapter 2: Acoustic emission signals  Chapter 2 Acoustic emission signals The classification strategy of the network cannot be readily understood by simply inspecting weight displays. Different signal patterns always interact with the weight patterns in different ways so the signals themselves must also be included in the analysis. The task of this thesis is the classification of the acoustic emission signals of wood species. Although the signals in frequency domain display broadband and active information, there are still some problems in the classification. Therefore, the analysis of the signals themselves takes the first attention of this research. It is important to analyze the input vectors and characterize the signals that produce the highest activation data in pattern recognition.  2.1 Identification requirements in chip refining An increasingly important process to produce wood pulp is the thermomechanical pulping process. At the heart of the thermomechanical wood pulping process is the chip refiner that breaks down wood chips into fibres and develops the papermaking properties of the fibres by beating them between two grooved stainless steel discs that are relatively rotating to each other. The variability of the raw materials characteristics, such as wood species, chip size and density, greatly influences the process. At the present time, samples of pulp from a refiner are monitored off-line in a test laboratory in order to determine pulp quality and consistence. Results from these tests are generally available on an hourly basis. However, on-line freeness tests have been unsatisfactory and have led to adoption of on-line drainage sensors. Therefore, an improved on-line method for predicting the end-pulp quality during the refining process is needed to address the shortcomings of the present off-line empirical laboratory tests and current on-line control systems [11]. Because wood is a natural material, both its chemical composition and physical properties are highly variable. The wood mix significantly influences the behavior of the refining process and the pulp quality. Because in a mill the feed rarely consists of a pure species, the more interesting case 11  Chapter 2: Acoustic emission signals  is when the feed contains an unknown mixture of a number of known species. Therefore, there is not only a requirement to identify the single wood species, but also a requirement to know whether the wood being processed is a single wood species or a mixture[10]. The simulations will be based on the above two requirements. Thus, pulp quality monitoring systems have been long desired, and identification of wood species is fundamentally required to offer useful operating parameters to the monitoring system [9]. The classification is based on the power spectra of acoustic emission signals, which display active and broadband information. In the first stage, we measure the energy and frequencies of the acoustic emission from the wood chips and fibres; then, utilizing acoustic emission signals from the wood chips and fibres we identify the wood species being processed; finally, utilizing the species content identity profile we monitor the refining action on line and predict pulp quality [11].  2.2 Acquisition of acoustic emission signals Acoustic emissions are present for many natural and composite materials under stress thus it is naturally utilized for testing wood species. It has been found that the acoustic emission from wood chips and fibres is distinct from the vibration of the refiner plates or discs. By measuring the emissions from the wood chips and fibres, we hope to directly indicate the wood material being processed.  2.2.1 The process of data acquisition Acoustic Emission (AE) is the release of ultrasonic energy by a system as it attempts to reattain a state of equilibrium. Energy is measured both in terms of energy units (which may be arbitrary units) and in terms of frequencies. The refining process has been found to generate broadband acoustic emission signals whose characteristics are affected by the wood species being pulped [10]. The acquisition process is very simple and requires little apparatus. The only necessary preparation is to attach a transducer to the samples. The components of the AE monitoring setup are shown in Figure 2.7 [11]. To pick-up acoustic emission from the inside of the refiner, a broad band piezoelectric acoustic transducer is needed. With a built-in 40 dB pre-amplifier, and cut-off frequency of about 1 MHz, the transducer is attached to an outside bolt in contact with the stationary 12  Chapter 2: Acoustic emission signals  Sensor  —O.  Filtering & Amplification  Data Acquisition  Source  0  oo --  Transducer  ^  Amplifier  ^  Oscilloscope  Signal processing  —PI.  i^  ^  \ Computer  Figure 2.7: Chemical acoustic emission data  refiner plate using a steel bracket and an extension piece. To ensure a good acoustic coupling, a  smear of high vacuum grease is applied to all interfaces. The transducer is connected to a combining amplifier which provides selectable gain and bandpass filters. All signals are filtered first through a 50 kHz high pass filter and then through a high quality 50 kHz-2 MHz bandpass filter. The signals are then set to a digital oscilloscope for digitization and capture. Every two seconds, a 1024—point 8—bits resolution record is captured at a sampling rate of 2.5 MHz over 0.4096 ms. These data are then transferred to a PC-computer via an IEEE-488 instrumentation interface [11].  2.2.2 The parameters of data acquisition The sampling rate for this process is 2.5 MHz. By experiments, we know that the original digitized signals are not well suited to pattern recognition and the signals have more active information  in the frequency domain. Thus, each averaged signal is calculated directly from the record time of the signals. Through Fast Fourier Transform (141-1), the signals in the frequency domain are obtained as 512 frequency components, covering the frequency range from 0 to 1.25 MHz. The magnitude of the frequency components is utilized as the component of the input vector. The transformed data can provide better quantitative information in power spectra. Additionally, data is obtained under different operating conditions. Major operating variables are background noise, feed rate, and plate gap. Table 2.1 shows the data acquisition and operating variables [10]. 13  Chapter 2: Acoustic emission signals  The sampling rate The sampling points The sampling frequency  2.5 MHz 1024 points 0.4096 ms 1.25 MHz 512 4, 5, 7 0.10, 0.20, 0.25, 0.30, 0.35 0.4 to 1  The band width The frequency components The machine feed rate The plate gaps (mm) Dilute water rate (r)  Table 2.1 The parameters of data acquisition of the acoustic emission signals  2.3 Signal analysis The ability to identify the sources of signals is important. In order to distinguish among different generating signals, it is necessary to look at the characteristics of the signals themselves. There are several common approaches to analyze ultrasonic signals. Generally, it is convenient to utilize the frequency analysis. The general and special frequency analysis is given in this section. 2.3.1 General waveform analysis  Further signal processing is then performed on the computer. The power spectra are computed via Fast Fourier Transform and 1024 sampling points of the raw signals generates 512 frequency components, covering the frequency range from 0 to 1.25 MHz. The average power spectra are obtained for each data set by averaging the power spectra over 100 individual signals [11]. The averaged power spectra of wood species for 5 categories are shown in Figure 2.8. The acoustic emission signal can be separated from the disc vibration of equipment. The disc vibration is much less than 25 KHz. The signals from wood fibres appear mostly above 50 kHz , thus the acoustic emission from wood chips and fibres can be distinguished from the vibration of the refiner discs. By measuring the emissions from the wood chips and fibres, we may be able to indicate the wood material being processed in chip refiner [10]. Different wood species produce different characteristic frequency intensity spectra in different frequency range. One of the very commonly used approaches in statistical pattern recognition is the Principal Component Analysis (PCA). In this technique, we do not use the whole information. We 14  ^ ^  Chapter 2: Acoustic emission signals  Figure 2.8: The power spectra of wood species for 5 categories Hemlock^  30  Jock "ne 5^  25  3  15  a  I 10  57 o^ 0.0 0.2  15  t•  Black S ruce  ^40  ^4  2  50  8  20  io  0.4^0.6^0.8^1.0^1.2^1 4 Frequency, 101z  .2^ 11111 /1L^ 0 0.0 0.2^ 0.4^0.6^0.8^1.0 1.2 1.4^0.0 Frequency, MHz  0.2^0.4^0.6^0.8^1.0 Frequency, MHz  ^  1.2^1.4  Lodoepole  1 0-  .2  0 0.0 0.2 0.4 0.6 0.8^1.0^1.2^1 4^0 0 0.2 0.4 0.8 0.8^1.0^1.2^1 4 Frequency, MHz Frequency, MHz  only need certain principal features of the signals for classification task. Specifically analyzing the signals in frequency domain, we found that the frequency distribution of the signals is characterized by certain broadband and multichannel features. The active frequency is only up to 800kHz. The difference to infer wood species is indicated by a group of peak descriptions, center, bandwidth, magnitude, and number between 0 to 1.25 MHz. This group of parameters can provide sufficient information to distinguish the sources of signals [57].  2.3.2 Effect of operating variables The interpretation of the power spectra is complicated by the fact that the acoustic signals generated in the system are distorted by several factors. The appearance of power spectra is distorted because of the influence of operating condition. The power spectra of acoustic emission provide sufficient information for our classification, but some dependent frequency components also bring difficulties for the pattern recognition. The signal amplitude varies with the different operation parameters, such as feeding rate, feeding plate gap, dilution water flow rate, processing temperature and pressure. Feeding rate and plate gap are varied in the data acquisition process, and these two parameters have a major influence on the amplitude of acoustic emission. One of the factors is the effect of plate gap, which is used to adjust 15  Chapter 2: Acoustic emission signals  the operation of chip refiner. The changes in plate gap does not alter the band structure of the power spectrum but only the peak intensities. Analysis in [8] shows that the feed rate has an effect on acoustic energy as the plate gap is lower than a value, and some high acoustic energy is emitted due to the operating requirement. Because of these two factors, the signals appear amplified with some minor distortion specifically in the high frequency range. A special phenomenon which attracted our attention is that a major peak is due to background noise. The background refers to the operation of the refiner with both dilution water flowing through it, and the higher motor load required for refining of wood chips. The averaged power spectrum for the background is shown in Figure 2.9. There is a frequency band of width 50 kHz with a peak centered at 120 kHz [8].  II-  412^M^ON ■ 411000440  11110:010. 4/  Figure 2.9: The comparison of power spectra between Background and Hemlock  To different wood species correspond different shapes in high frequencies. Comparing the power spectrum of Hemlock to that of the background, see Figure 2.9, one may speculate that hemlock is just acting as an amplifier for the background and that it emits very little acoustic energy of its own. The four other species have more energy emitted in different frequency bands as shown in Figure 2.8. The high frequency shapes provide crucial information to distinguish the sources of acoustic emission. 2.4 Similarity between categories and difference within category  Wood is a natural material, both its chemical composition and physical properties are highly variable. Variations not only exist between different trees but also within the same tree, such as rot content, bark content and wood age. Additionally, the data acquisition system brings certain influence 16  Chapter 2: Acoustic emission signals  c2  Optimal class boundary  Ads.  c3 cl c5  Misclassification  Linearly inseparable problem  ^Overlapping between the neighbour classes  Figure 2.10: Nonlinear problem in pattern recognition to the power spectra, particular the transducer, frequency response, transducer location. This indicates that the nature of the acoustic signals is associated with fundamental processes occurring in some cases, and the signals in frequency domain is effected by the operation variables. Because of natural influence, the categories are not well separated, there are differences which existed within the same category, and similarities are exhibited among different categories. This brings certain difficulties to the pattern recognition problem. By experimental analysis, we know the misclassification usually occurs in two cases, either the categories are linearly inseparable, or closed to the boundary, as shown in Figure 2.10. There are subterrains for the one single category and overlapping between the different categories. In data space, the n-dimensional sample space is defined as Rn. Let V and W be arbitrary sets and let 0 be defined as the empty set. V fl W is defined as the intersection between two sets. If x is an element of V, we write x E V. If y=f(x) we write y C V. If V is a subset of W, we write  V C W. If x E Rn then y C R. Thus the overlapping problem is defined as the V fl W 0. Linear inseparability comes from the fact that x E V and x E W, then y C V and y C W. If the given patterns appear variable or distorted, the expected performance of the network is that the system is able to separate different classes as well as put together all distorted members of the same class. The system design in pattern recognition is motivated by eliminating the two problems of similarity between categories and difference within category.  17  Chapter 2: Acoustic emission signals  Acoustic emission signals provide active information in the frequency domain to distinguish the sources of the signals. The general and special signal analyses are discussed to give us better insight of the input information.  18  Chapter 3: The architectures and learning rules of neural networks  Chapter 3 The architectures and learning rules of neural networks The feedforward network has been successfully utilized in applications of pattern recognition. Although power spectra of acoustic emission provide sufficient information, the distortion of signals bring certain difficulty to identification. The network performance is affected by both its inherent discriminant capability and certain properties of signals. The problem cannot be solved simply by choosing a proper structure. Therefore, a heterogeneous network is designed with two individual subnetworks to overcome specific problems in the signals. Supervised and unsupervised learning algorithms are implemented separately in two subnetworks to perform different functions in categorization. 3.1 The architectures  A multilayer feedforward neural network is utilized as a classifier in this thesis. The heterogeneous network is composed of two subnetworks, a compression subnetwork at the bottom and a classification subnetwork at the top, as shown in Figure 3.11. Each of them has its own individual structure, learning rule and function. Classification by Backpropagation (BP) The classification network  Pattern dichotomizer The output of BP  Hidden layer of BP  t The input of BP & RiardinaIrtfrefr'.4 ^The output of GHA 1Wcor npression network w  —  T^The input of GHA WINITIIIIIIAINOWt031114 6071E (Onginal data) -  ture extraction and data reduction  Compression by Generalized Hebbian algorthm (GHA)  Figure 3.11: The architecture of heterogenous network 19  Chapter 3: The architectures and learning rules of neural networks  Topologically, both subnetworks have a feedforward configuration but different structures. The bottom one, which is used as a compression network, only simply includes input and output layers. The top one, which is used as a classification network, is a multilayer feedforward network with a hidden layer added between. Supervised and unsupervised learning are individually employed in the two networks. Unsupervised learning is utilized in the bottom for feature extraction and data compression. The top one is a typical classifier currently used, called backpropagation, in which supervised learning is used to make final discriminant decision. The different functions are accomplished by the two subnetworks individually. The compression network does feature extraction and data reduction. The linearly reconstructed data include the principal components of the original information in optimal form and compressed size. The transferred data is used as the input for backpropagation. Backpropagation is implemented as a nonlinear classifier to categorize the patterns into classes. Parameters of the architecture are set as follows. The input neurons are 150, representing magnitude of power spectra. These 150 components are used in form of an input vector for the whole system. The output number of BP is 5, representing 5 categories to which input patterns belong. The hidden neurons of BP are fixed at 5 for all the trials to compare the classification performance between different tests. The output of compression network is used as the input of backpropagation. The number of neurons in this layer is flexible. It plays a crucial rule for the network and will depend on the compression performance. The training is processed bottom-up [31]. First of all, the compression network is trained using all 150 coefficients of a power spectrum as the components of the input vector. The compression network is used as a compressor and both the training set and testing set are passed through the compressor to obtain the transferred version of data. Finally, the classifier is trained and tested with the compressed data sets. The final classification performance is tested in the top backpropagation layer. Since the training is processed bottom-up sequentially, the training performance at top layer depends upon the results of bottom layer, as shown in Figure 3.11. The optimal training for the 20  Chapter 3: The architectures and learning rules of neural networks  preceding layer depends on the statistical properties of outputs of the previous layer, the weights between two layers. Using this scheme, we successively train all the layers of the multilayered network. 3.2 Supervised learning  Supervised learning is used as a training algorithm in backpropagation. Supervised training requires a training pair, i.e., an input vector with a desired output vector. In the training process, the network is presented with a training pair and the weights between units are modified in such a way that it will produce the same output in response to the same input. An input vector is applied, the output of the network is calculated and compared to the corresponding target vector, and the error is fed back through the network and weights are changed according to an algorithm that tends to minimize the error. Usually a network is trained over a number of such training pairs and the vectors of the training set are applied sequentially. Errors are calculated and weights adjusted for each vector, until the error for the entire training set is at an acceptably low level. Success in training is measured by the mean squared difference between desired and actual outputs. This constitutes the optimality principle for many supervised algorithms [59]. 3.2.1 Backpropagation  One of the earliest supervised algorithms for a single-layer network is the Widrow-Hoff algorithm, introduced by Widrow and Hoff in 1960 [63]. A following algorithm to the Widrow-Hoff' is called perceptron, in which the outputs are taken the value 1 or —1 [48]. This algorithm is used to train a single-layer network in which the output units are processed by a threshold nonlinearity which forces them to take one of two values. The perceptron is limited to linearly separable problems. The application of single layer networks is limited by their capability of solving nonlinearity. Thus the multilayered network is proposed to overcome the drawback. For training multilayer feedforward networks, the most commonly used algorithm is backpropagation, discovered independently in different forms by Werbos, Le Cun, Parker and Rumelhart [62] [30] [45] [51]. The architecture of backpropagation is shown in Figure 3.12. 21  Chapter 3: The architectures and learning rules of neural networks  Input pattern vector Xk xlk x2k x3k  ^Pertubation Output vector Yk error  ed Ylk  ed Euclidean error Y2k  x4k x5k  dlk • k Desired responses Figure 3.12: Backpropagation  The learning rule implemented in BP is called generalized delta rule, based on the standard delta rule. Delta rule is so called because the amount of learning is proportional to the difference 6 between .  the actual and desired outputs. The backpropagation algorithm performs gradient descent in the space of the weights between neurons along gradients of an error function defined by the mean-squared error in the outputs. The Lyapunov, or energy function is defined as the summed squared error between the desired output tpi and actual output op; E =— P 2  E (tp n  ;  -  492,i) 2  (3.4)  =1  and the overall measure of the energy is defined as ^E  =^Ep^  (3.5)  p=1  To get the correct generalization of the delta rule, we set increment of the weight matrix as follows: OE  ^ApWii  Owii^  (3.6)  where wij refers to the connection from the ilk neuron to the jth neuron. The derivation of the energy function is OE),^OE Onet atVii^Orteipj attlii  (3.7)  This derivation results from the product of two items: the first term in equation 3.7 reflects the change in error as a function of the change in the net input to the unit, and the second term in equation 3.7 22  Chapter 3: The architectures and learning rules of neural networks  represents the effect of changing a particular weight on the net input. The net total output is  netPi• =  E w -0 • 31  (3.8)  where of = if unit i is an input unit. Thus the second term in equation 3.7 is given by  8  artetpj  E wikopk =_- Op` ^= owi,^  (3.9)  k  It is more complicated to get the first term  6pi =  OE  80E2, Boob Onetpi^Oopi Onet pi  (3.10)  From the definition of energy function 3.4, we have  OE = 0 (1 E ktpj - opi ) = -(tpi - opi) Oopi^eopi^ 2  (3.11)  A nonlinear activation function is the one in which op.; = (ietpi )^  (3.12)  The derivation of this equation is  °°P-1 = Pnet -) anetpj^3^P3  (3.13)  Then we have  bpi = (tpj  —  Opi)fj(netpj)  (3.14)  The increment of weight matrix  °  17) wij = 7/45Pi Pi  (3.15)  In output layer, the nonlinear function is a squash function 1  0P3 =  1+e  —  23  (3.16)  Chapter 3: The architectures and learning rules of neural networks  Then the derivation in output layer is 00p  ; =  °P:7( 1 °Pi)  ^  (3.17)  In hidden layer, there is no such target output to follow. Thus we compute the error for each neuron based on their proportion of the output layer error. It is computed as  1  b^  E Ivo;  bpi = .6(netpi) * [  (3.18)  j=1  A typical backpropagation network has three layers, that is, a input layer, a hidden layer and an output layer. The training pair is composed of input pattern and desired output. When each training pattern presents, each neuron goes through two paths in such a network, forward activation flow and backward error flow. The BP update weights incrementally after each training pair presents. The training will be completed when each training pattern presents and output error reaches the desired level [50]. Let the input vector be xj and the target output vector be tj. The number of units in the input layer is A, in the hidden layer is B, and in the output layer is C. First, the calculation is carried out in the forward activation path. The activation in input layer i is initialized by the input pattern x  =^ where i1 is referred as the  t  (3.19)  neuron of input layer i. Propagating the activations from the input  layer to the hidden layer ^=  ^ 1+  1  exp(-  A^  k = 0...A, j = 0...B^(3.20)  k=0  Here, a sigmoid is used as a nonlinear function. Furthermore, propagating the activation from the hidden layer to the output layer ^0  1  B  k = 0...B,j = 1...0^(3.21)  3 = ^  1 + ex+  k=0  24  Chapter 3: The architectures and learning rules of neural networks  where of is referred as the jth neuron in output layer o. The weights are updated in the backward path according to the error information  S.  In the output  layer, the output error , denoted by bli°, is calculated according to the difference between the actual and desired outputs (ri = oi(1 — oi)(ti —^j = 1...0^ (3.22) The weight adjustment between the output layer and the hidden layer is given by Op wkj  = n elh k^= 1...c,k = 0...B^ (3.23)  where i is referred as the learning rate to control the convergence speed of the system. In the hidden layer, such an error, denoted by 61, is given by the equation  (57 =^-^= 1...B ,k = 1...0^(3.24) k=1  The weights between the hidden layer and the input layer are updated by the following equation: AWki  = 074  j = 1...B, k = 0...A^ (3.25)  The training is completed when the output error is below the expected tolerance. 3.2.2 The capability and limitation in backpropagation  The backpropagation is a learning algorithm which is designed to solve the problem of nonlinearity. The participation of hidden layer provides more powerful capability in problem of nonlinearity than with perceptron. The BP algorithm has been used to discover nonlinear topological mappings. In learning distributed representations, weights are automatically evolved to show relationships which are not explicit in inputs. The introduction of hidden units gives a feedforward network the potential for an arbitrary mapping in both auto-associative and hetero-associative memories [47]. The reported success of backpropagation has made it the most popular scheme currently used. But the application of BP is limited by its own inherent drawbacks. The gradient descent nature of the BP algorithm causes very slow convergence speed during the training process. Training time 25  Chapter 3: The architectures and learning rules of neural networks  is an exponential function of the number of hidden units in the network. The network needs many patterns to train and, when each pattern present and each neuron need two paths, one is feedforward activation flow and the another is backward error flow. When the number of neurons is large, the training speed is very slow then the application of BP becomes impractical or unacceptable. Backpropagation is successful in pattern recognition for nonlinear systems, but there is no well-developed theory to describe the properties of BP with respect to function approximation or generalization. If we want to solve a particular problem of approximation, the convergence through the training process can be observed but is very difficult to be analyzed. 3.3 Unsupervised learning  The inherent drawbacks of backpropagation limit its application in practical problems. This motivates us to find some systems which are easily understood and implemented. One of the unsupervised learning rules, competitive learning, is suggested for pattern recognition. The network is simply composed of input and output layers. It is readily analyzed. Additionally, the algorithm of weight adjustment can be flexibly modified according to the optimality principles based on the special requirements of classification task. The optimality principles are suggested in the competitive learning to make the system more powerful in pattern classification. 3.3.1 Competitive learning  The typical competitive topology is a one-layer feedforward model. If there are fewer output states than input patterns the network will produce the same output for different input patterns. Competitive learning is a mechanism whereby the network learns to produce the same outputs for input patterns which are "similar" [29]. The structure of a competitive learning system is shown in Figure 3.13. Neurons compete for the activation induced by randomly sampled pattern vectors x E R". During competitive learning, a set of input patterns is presented and the network is allowed to produce its own output. Each neuron in layer 2 receives an excitatory input from each input neuron in layer 1. Within layer 2, the neurons are broken into sets of inhibitory clusters, which are dark circles in Figure 3.13. The neurons in each 26  Chapter 3: The architectures and learning rules of neural networks  inhibitory cluster inhibit all of the other neurons in the same cluster. Thus, the neurons in an inhibitory cluster compete with each other and the one with the highest activation eventually wins [20].  Inhibitory cluster  Layer 2  Excitatory connections  Layer 1 KO/010/0:0111131,1  FAIWPICO1 Input neurons  Figure 3.13: Architecture of competitive learning In competitive learning, no target vector is required for the outputs, and hence, no comparison to predetermined ideal responses. The training set solely consists of input vectors. The training algorithm modifies network weights to produce output vectors according to the cluster of input vectors. If input vectors are sufficiently similar to each other then they produce the same output vector. The training process, therefore, extracts the statistical properties of the training set and groups similar vectors into classes [32]. The "winner-take-all" makes the network closer to the natural human behavior but less accurate in approximation. The data is not reconstructed in optimal form. In order to improve the network performance, some optimality principles are suggested and learning algorithms will be modified accordingly.  3.3.2 Feature extraction as a good representation Effective internal representation of input data is the motivation for feature analysis in pattern recognition. Usually, in the model of pattern recognition as an engineering discipline, most systems consist of three consecutive processes: at the first stage, sensor measurements are taken and preprocessed by typical data preprocessing to remove the unimportant portion; at the second stage, features are extracted and a more abstract representation is obtained; finally, the representation is utilized as the input vector for classification. Figure 3.14 illustrates the process of pattern recognition implemented in ordinary methods of artificial intelligence. 27  Chapter 3: The architectures and learning rules of neural networks  Pattern --e. Data acquisistion --■ Data preprocessing  Decision■-- Classification recognition  -ND"  Data reduction Feature extraction  Feature representation of pattern  Figure 3.14: The pattern recognition chain In this, the features of data play crucial role in pattern classification. Reliability of these features is essential for pattern classification. Thus the first optimality rule is based on the requirement of feature extraction. The primary goal of feature selection is to obtain features which maximize the similarity of objects in the same class while maximizing the dissimilarity of objects in different classes. It is desirable to use signal representations which are both sufficiently descriptive yet robust to signals and environment variations. To maximize the variance of input is the first optimality principle in data reconstruction. In artificial intelligence, feature extraction involves statistical pattern estimation, called parametric method, assuming the forms of the known class densities. But in nonparametric methods, or neural network methods, not based on explicitly given probability distribution, parameter estimation is replaced by training algorithm through the learning process.  3.3.3 Data compression as an optimality principle One major motivation for feature analysis is dimensionality reduction. A good feature must be sufficiently discriminating. In order to keep the classification problem tractable, the total number of features must be limited. Statistical pattern recognition techniques typically involve the reduction of the signals into primitive shape features. Dimensionality reduction is a typical task for data preprocessing, therefore the data compression is used as an optimality principle in data reconstruction. Data compression is based on the fact that some points can be represented by some values at other points. The goal of compression is to find a representation in which certain features of signals are made explicitly. If there are fewer outputs than inputs, the system attempts to find the underlying structure of the input distribution. 28  Chapter 3: The architectures and learning rules of neural networks  Data reconstruction attempts to remove unimportant signal features and only keeps the principal components to represent the original signals. Compression makes good sense for generalization. If we have M<N, then similar inputs will be encoded to the simple symbol to represent them. The statistical distribution of inputs will be discovered. The input vectors are roughly categorized by their distribution. Neural network technology offers techniques for selecting, developing, clustering, and compressing features into a useful sets. Actually, feature extraction and data compression are already included in backpropagation, implicitly done in the hidden layer. If the number of outputs M is less than that of inputs N, feature extraction is certainly included. The weakness of feature extraction in backpropagation is the lack of optimality principles. Therefore, features formed in hidden layer are not necessarily in optimal form.  The neural network configuration has been established. In order to improve classification efficiency, feature extraction is introduced to the basic classifier of backpropagation.  29  Chapter 4: Subspace and orthogonality  Chapter 4 Subspace and orthogonality Effective internal representation of the input data is the motivation for feature analysis. In feature extraction, the primary goal is to build an explicit or implicit model for the pattern classes. Given the representation for the classes, the classification rules and the class regions are completely defined in the high-dimensional pattern space. Feature extraction was intuitively analyzed in chapter 3. In this chapter, combining mathematical definitions and geometrical perspectives, we attempt to analyze how the network performs feature extraction in the subspace. Theoretically and experimentally, the most efficient data representation for pattern recognition is in orthogonal form. Thus orthogonality is naturally suggested as an optimality rule in feature extraction. The connection between traditional signal processing algorithms and the modern NN approach will be discussed together with the orthogonalization process.  4.1 Subspace as representation for categories The subspace is initially established for feature extraction. The original input vectors are mapped from the sample space to the subspace, this is a typical task of linear data reconstruction. The input vector consists of the acoustic power spectrum emitted by wood chips and fibers. The elements of input vectors are the energy in the frequency components. If we look at a power spectrum as being composed by randomly excited intensities, then we can assume that the spectrum is a linearly combination of a finite set of basis vectors. We attempt to find the best fitting model for the specified observation measurement of x. This is equivalent to subspace projection. The subspace defines a class as the general linear combination of some spanning basis vectors. In competitive learning, one of the assumptions is zero-mean Gaussian distribution for the signals. But the original form of our input vector does not have this property. To make the vector satisfy the zero-mean condition, we average the training set of signals and withdraw means from whole data sets. Through this data preprocessing, the input vectors are represented as a zero-mean Gaussian 30  ^  Chapter 4: Subspace and orthogonality  random vector in set form x = {x E  Rn l, or in vector form x = x  T  .  The sample space is  defined as Rn, which is a n-dimensional Euclidean vector space. Any set of P linearly independent vectors u l , ...up in Rn (with p<n) span a subspace, say LP, which is the set  E  xlx = aiui L = L(ui,u2, ...up ) = {P i.r  (4.26)  where fad are asymptotically stable equilibrium points of system (S), and (S) is a solution space [41]. We define the set LP as LP = {x E lin : x < xi < 1, ...n}. Suppose we give p vectors {ai} in Rn , which are to be stored as asymptotically stable equilibrium points for an n dimensional system (S). A method is proposed to find a group of parameters {ai} to map an input vector from the original sample space to the subspace. Assume that there are K classes, y( 1 ), ..., y(k), with basis vectors in subspace, the projection vectors are represented as K classes ^L(i)  =^1,...,K  ^  (4.27)  Thus the classification rule is given by the distance function [41]: if d(x, L( 1 )) < d(x, LW) for all j^i, then classify x in class y(i).  4.2 Orthogonal bases Theoretically and experimentally, the most efficient representation of data is to store orthogonal data space in the network. Based on this fact, three rules are suggested in the neural network, i.e., the outer product rule, projection rule and eigenstructure rule. Each of them can be used in finding an associative maping between input and output. Here, we attempt to find an optimal form in data reconstruction. 1.The outer product rule: The Hebbian learning algorithm is based on the outer product rule to form the auto-associative map between the input and output. But this rule is only limited to a certain range. The tolerance for distortion and error from the raw data is not good enough for the required classification task. 31  Chapter 4: Subspace and orthogonality  2. The projection rule: Projection is a basic subspace operation. The input vectors can be mapped into a certain subspace through certain rotation. Then the data are linearly reconstructed by  new basis vectors. But data reconstruction is requested to be in optimal form. 3. The eigenstructure: It is motivated by the optimization requirement. The most economical and efficient linear representation is to store the basis orthonormal vectors. The orthogonalized data space has the optimal form for data reconstruction. The orthogonalization process is done by decomposing eigenvectors of the data space, then we can say that the data space is in eigenstructure. Based on this discussion, we conclude that it is optimal to choose orthonormal vectors as the basis vectors in linear data reconstruction and thus orthogonality is a fundamental requirement in neural networks. Orthogonal decomposition theorem and geometric interpretation are introduced by Apostol in [2]. In this thesis, the geometric interpretation of the orthogonal decomposition theorem is illustrated in Figure 4.15, and the theorem is cited as following:  Orthogonal decomposition theorem: Let le be an Euclidean space and let L be a finite dimensional subspace of R. Then every element x in le can be represented uniquely as a sum of two elements, one in L and one in L 1 . That is, we have  x 1+ 11 , where 1 E L and 1 1 E LI. .  Figure 4.15: Geometric interpretation of the orthogonal decomposition theorem The vector x is decomposed into two vectors, 1 and 1 1 , illustrated in Figure 4.15. According to the projection principle, the closest point in L to any vector x is its projection  x=  ^ 32  (4.28)  Chapter 4: Subspace and orthogonality  "i is computed from =  E (x" u2 \) = E k uiui i x = Px  (4.29)  T\  i=i^i=i  P is the orthogonal matrix, or eigenvector matrix in subspace L. With data representation, the  classification rule, which is a determination of distance function, is then a measurement of dissimilarity of x and subspace L  d(x, =^lig  ^  (4.30)  It is 1 if x 1 L, and 0 if x E L. We have the same distance function as in the previous section, but in orthogonalized subspace, the classification rule is thus: If d(x, L( i )) < d(x, L(i)) for all j^i, then classify x in class y(i).  4.3 Karhunen-Loêve transform  The optimal linear reconstruction was originally suggested by Hotelling for discrete random processes [19]. The algorithm was consecutively introduced as a series expansion for continuous random process by Karhunen [25] and Loêve [36]. Thus, the Karhunen Loeve Transform (KLT) is also called the Hotelling transform or the method of principal components because it is the extension of the Hotelling algorithm. Mathematically, the KLT is equivalent to the process of eigenvector decomposition. Let input vector x = [x 1 ,...x„] T be a random process with autocorrelation matrix Q. Let V be an N x N unitary matrix which reconstruct matrix Q to its diagonal form A [23]. For any Hennitian matrix Q there exists an unitary matrix V such that V T QV =  A  (4.31)  or in the form VkQ = AkVk^k  33  = 1, ...N^  (4.32)  Chapter 4: Subspace and orthogonality  where {Ak} and {vk} are the eigenvalues and eigenvectors of Q respectively, where the eigenvectors are  viT vj = 1, if i = j  (4.33)  = 0, if i j and the eigenvectors in descending order  A1, > A2 > > A n^(4.34) We order the set of N eigenvectors {vi} by decreasing eigenvalue {Ai }, so that the largest eigenvector v1 , called the "principal" eigenvector, corresponds to the largest eigenvector Al [39]. For Hermitian matrices, the eigenvectors corresponding to distinct eigenvalues are orthogonal. For repeated eigenvalues, their eigenvectors form a subspace that can be orthogonalized to yield a complete set of orthogonal eigenvectors. Normalization of these eigenvectors yields an orthonormal set. The matrix  V transforms the input vector x = [x i , ...x„] T to another n-dimension vector y [y i , y = V x^  (4.35)  [yyr] = VTE [xXT] V = VTQ V = A^  (4.36)  with E  If Q represents the covariance matrix of x, then the sequence y is uncorrelated, and:  E^E [A]^E [el]^  (4.37)  The KLT has many desirable properties, which make it optimal in many applications. It offers an optimal linear transformation for digital signals in the such a way that the mean squared error between the original and the reconstructed signals is minimized [23]. This optimality property provides a useful tool for developing data reconstruction technique. The reconstructed data provide a functional property by decorrelating data. The input vectors are transferred to the uncorrelated form by a group of eigenvectors corresponding to the eigenvalues. The largest eigenvalue represents the maximum variance of the input information, consequently, the reconstructed data are more easily classified in PR. 34  Chapter 4: Subspace and orthogonality  The KLT is widely used in general signal processing. But one of the problems to utilize the KLT is the dimensionality difficulty. The KLT not only depends on the statistics but also depends on the size of signals. The computation for both transformation matrix and the operation of data reconstruction are quite considerable with the large input size. Because of this, the computation for data processing becomes unacceptable and impractical when the input size is large.  4.4 Principal component analysis  In statistics, Principal Component Analysis (PCA) is a standard identification method in pattern recognition. We attempt to find a particular set of data to describe the data set as a small number of components. We are going to throw away some information, but we do not know what particular information is going to be needed and what information we can afford to discard. In this data reconstruction, it is required to preserve as much information as possible. The principal component is the eigenvector which maximize the variance of the output. The eigenvalues are a measure of how much variance of the data set that eigenvector account for. The larger the eigenvalue, the better description of the data set. Therefore, the orthogonalized data will contribute the best set of features [33].  The principle of preserving maximal information, thus appears to be an extremely natural and attractive one to be used for network construction [1]. A properly designed network, when learning and responding, performs good statistical inference, the principal component analysis. This is equivalent to performing the Karhunen-Loêve transform in neural networks. In engineering, the KLT is the best way to represent a set of data with the least number of descriptors. If we have a set of data vectors,  x = [x 1 ,...x 7 ] T , and we want to describe them, the first thing we do is to form  matrix Q, where Q  = E[xxT] is the covariance matrix of input vectors. There exists a transformation  matrix  V which is composed of eigenvectors {vi} corresponding to eigenvalues (Ai) in descending  order. The eigenvectors {vi} then is used to transform input vector  x = [x 1 ,...x„] T into output vector  y = [yi , ...yr ] T by calculating the dot product features vi • x. The vector x can be approximately 35  Chapter 4: Subspace and orthogonality  represented by the sum y=  ^  (Ili • X)Vi  ^  (4.38)  1=1  After the linear data transformation, the covariance matrix A = E [y yT] includes maximal variance of the input distribution. In this data processing, the representational error (y x) tends to be minimum. -  Many applications prove that it is possible to approximate the KLT, or perform the PCA via a neural network [33] [40] [53]. In a classic paper [40], Oja shows the relation between the Hebbian learning and the PCA. Oja treats a single weight vector as an largest eigenvector. He shows that the weight vector converges to the largest eigenvector of the input correlation, thus maximizing the output variance. The extension of Oja' algorithm, which is used to find consequent principal components, is referred to as Sanger's algorithm. Sanger developed an extension of Oja's algorithm, in which a projection-like process is used to extract as many principal components as desired [53]. Oja and Sanger develop the learning rules to perform KLT and explore how to implement principal component analysis in neural network. The learning algorithms will be mathematically analyzed in the next chapter.  The subspace has been established to perform feature extraction and data compression. Combining orthogonal data transformation, we have discussed two classical algorithms: the KLT and the PCA. Both algorithms are very useful in signal processing and PR. Based on the analyses, we explored how to utilize neural networks to accomplish the same tasks as the traditional methods do.  36  Chapter 5: The Generalized Hebbian Algorithm  Chapter 5 The Generalized Hebbian Algorithm The generalized Hebbian algorithm, which is a learning rule in linear networks, is based on the standard Hebbian learning rule, further developed in the Oja's algorithm and most recent resulted in Sanger's algorithm. Through the learning process, the data orthogonalization and compression are accomplished simultaneously. The algorithm shows the significance of feature extraction and data compression.  5.1 Terminology The notation is defined for a single layer network. The sensory input is represented by a vector x = [x i , ...x r ] T in n—dimensional sample space R. The output is the Gaussian vector y = [y i , ...yni ] T which ii s computed from x by equation y=Cx, where C = [cii] is an M x N matrix  of synaptic weights. The covariance matrix of input vector is formed by matrix Q = Ilx , = E[xx 21 while output covariance matrix is defined by  Ryy =  E[Cx(C x) T ] = E[Cxx T C T ] = CE[xx T ]C T = CQC T^(5.39)  5.2 Basic Hebbian algorithm Most of today's training algorithms have evolved from the concepts of D. 0. Hebb [17]. He proposed a model for unsupervised learning in which the synaptic strength is increased if both the source and destination neurons are activated. In this way, often-used paths in the network are strengthened through iteration learning. The algorithm modifies the synaptic connection between two units in proportion to the cross-correlation of the units activities. It adjusts the weight by adding 37  Chapter 5: The Generalized Hebbian Algorithm  a small increment proportional to the product between input and output [17]. This can be written in scalar form cii(t + 1) = cii(t)-1-7yi(t)xj(t) ^  (5.40)  or in matrix form C(t + 1) = C(t) 7y(t)x T (t)  ^  (5.41)  where 7 is a learning rate which determines the convergence speed of the system, and cif is the value of a weight connecting neuron j to neuron i. In the supervised learning case, the desired values are supplied for y, and , we assume that y is a deterministic function of x, i.e., y=f(x). In the linear case, we have y=Lx, and E[yx T ] = E[Lxx T ]= LE[xx T ] = LQ^ (5.42)  If x is a white noise, then Q=1 and C becomes proportional to L. The learning results are simply decided by the linear transformation. In the unsupervised learning case, the problem becomes more complicated. Since y is a timevarying function of x, equation 5.41 is difficult to analyze. So unsupervised algorithms require some constraints to force different outputs to learn different things. If we assume that 7 is sufficiently small so that C changes slowly relative to x, then we can average equation 5.41 C(t 1) = C(t) 7C(t)E[xx r ]  = C(t) 7C(t)Q  (43)  Under certain conditions on 7, the convergence of C can be approximately described by the differential equation = C Q^  (5.44)  This differential equation is solved in initial condition C(t) = C(0)eQ (t)^(5.45)  38  Chapter 5: The Generalized Hebbian Algorithm  If x is in the infinite Euclidean space and Q is a symmetric positive define matrix, we have Q = VAVT, where V is the matrix which columns are the orthonormal eigenvectors { ei }, and  A is the diagonal matrix of eigenvalues Ai. Expanding the matrix exponential as a power series with VV T = I. We can write eQt =  v e AtvT  east eA 2 t  C(t) = C(0)T  TT eANt^  = C(0)  E oite3 3  (5.46)  j=1  = E e j (c(0)ei)er A t  j=i  Dividing and multiplying by eAit, where Ai is the largest eigenvalue of the correlation matrix, then we have the differential eigenvalue's solution  E N  = dit e (Aii)t(c(0) e •)eT C(t)^ 3  3  j=1  = e A l t ( c(0)ei^E  (5.47)  N  eof_Ai ) (c(0) ei j=2  )  The equation 5.47 does not converge due to the exponential factor e (A3-A1) . However, dividing by a normalization factor in order to maintain the C to norm 1, the rows of matrix C have finite norms. Since A l is the largest eigenvalue, the exponents Ai — A l are all negative, those terms within the sum will converge to zero as t —> oo. Therefore, with normalization, we have C —+ C(0)e i eT^  (5.48)  and each row ci of C Ci^(ci(0)e i )ei^  (5.49)  Therefore, all the rows of C will converge to the principal eigenvector e l . The above equations show how the Hebbian algorithm maximizes the variance of each output. 39  Chapter 5: The Generalized Hebbian Algorithm  Now we check the Lyapunov energy function to understand the convergenc of the system. The energy term is defined as E = — E E[x2i = —trace[R yy ] = —trace[CQC T ]^(5.50) n=1  then  NN  n O^ E  EE,Cklenl  OCij^0 Cii i=1^k=11=1  = —( EQJicii+Eogicik+2Qiicii 1 $i^14 j  _  ..._  . N^N  i  (5.51)  —  Q 31Csl + EQk Ci k) 1=1^k=1 N  = —2 Ec2k  ;  Cik  k=1  To perform gradient descent, we set the differential form of energy as OE di;^ —^= 2[CQ] ij Ocii  (5.52)  or CQ^  (5.53)  We see that the Hebbian learning rule performs gradient descent on the energy function specified by equation 5.50. The significance of the Hebbian algorithm is that attempts to maximize the variance of each output. The energy function is defined as the negative of the sum of the output variances, so minimization of E will lead to maximization of the variance. Since there is only one best solution, all the outputs will come to have the same synaptic weights from the inputs. Therefore each output will compute the same function of the input, and the learning algorithm will choose the function which maximizes the total output variance [52]. 5.3 Oja algorithm  40  Chapter 5: The Generalized Hebbian Algorithm  The normalization is very important from a mathematical point of view. Simply utilizing the basic Hebbian scheme without normalization will lead to unrealistic growth of the efficiency, and this motivates us to find the normalized form of the weight matrix. Although the Hebbian algorithm is able to maximize the variance of the outputs, it needs an extra operation to keep the weights normalized. Oja's algorithm is a modified version of the basic Hebbian algorithm. It modifies the basic Hebbian algorithm by keeping the norm at 1 automatically, without external manipulation of the normalization [40]. In his algorithm, if we maintain the diagonal elements of CC T equal to 1 then a Hebbian learning rule will cause the rows of C to converge to the first principal eigenvector of Q. Oja synaptic modification rule is given in a scalar form cij(t  cii(t)-F 7yi (t)xj(t) + 1) =^ (554) 1/2^ N E [cii (o+ 7yi(t)xj(t)? )  where cij refers to the connection from the  fh  input neuron to the ith output neuron. The matrix  form of equation 5.54 is + 1)  = r (0) + -yy(t)x T (t))  r = diag[(C(t) -yy(t)x T (t)) (CO) -yy(t)x T (0)1  (5.55)  Where the matrix operation "diag" refers to take the diagonal elements of the matrix. Equation 5.54 is the result of using Euclidean vector norm. The sum of the rows of matrix C is equal to 1 by external normalization operation. The next attempt on equation 5.54 is to the normalize the weight matrix by changing the learning scheme. Assume that the learning rate 7 is very small, and the synaptic connection cif varies according to cii(t + 1) =  zij(t  + 1)  w[c i ji + 1),^+ 1)]  (5.56)  In equation 5.56, the approximation is given as eij(t + 1) = cii(t)^-rp[cii(t),xj(t), yi(t)]^(5.57)  Where (p[•] is the basic increment at step t, and co[.] is a constraint function to normalize the  eii (t + 1).  The weight adjustment OH is in a special case with ca[cij(t), x; (t), yi (0]^ 41  (5.58)  Chapter 5: The Generalized Hebbian Algorithm  I  The constraints function co[.] is in the form n  W[eii(t 1), ...ei n (t 1)] =  1/2  [E eiAt + 1)2^(5.59) .i=1  The w function satisfies co[beii(t^+ 1)] = Ow[eii(t^+ 1)]  (5.60)  According to equation 5.60, we have eil  ein  W(Cil,•••Cin) = W [wR^ ^ii, ...ein)^ ' —1 W(eil 1 • -  1 = ^_^- \ W(eil 1 —ein) = W(Cil  1 • ••  Cin I  ein)  (5.61)  1  This is the explicit constraint on the synaptic connection cif. Then we obtain wreii (t + 1), ...ei n (t + 1)]  = w[c ii (t)^, x 1 (0, y(t), ...ci n (t))^-y (p(c in (t), x in (t), y(t))]  (5.62)  = co(cii(t),...ci n (t))+ 7 a7 + 0 (7 2 ) Thus, with small 7, equation 5.56 becomes cij(t  eii( t + 1 )  + 1) =  + 1)) cij(t + 1) 1 +^+0(7'2) Ow = eii(t + 1 ) — -yeij(t 1)-F + 0(7 2 ) cii(t)d y[cii(t), x i (t), y(t)] — cif(4; + 0 -  (5.63)  (7 2 )  Furthermore, for the differential term in equation 5.59, we have  a  Ow Ty =^E  [cii (t) + 7 xi (t)0)}2  (17  =E  (5.64)  cii(t)xi(t)yj (t) = (t)  Ignoring the second order term 0 (7 2 ), and substituting 5.64 to equation 5.63, we finally obtain the weight adjustment in normalized form  ciAt + 1) = cif(t) -y (t)[x j(t) — Me i; OA^(5.65)  42  Chapter 5: The Generalized Hebbian Algorithm  or in matrix form  C(t + 1) = C(t) + 7y(t)x T (t) — -ydiag[yy T ]C(t)^(5.66) In equation 5.66, the first two terms are the same as in the basic Hebbian algorithm, and the third term reflects the effect of normalization [40]. Replacing y(t) by C(t)x(t), with Q = E[xxT], and with sufficiently small y, we rewrite equation 5.66 as:  C(t + 1) = C(t) 7C(t)x(t)x T (t) = C(t) 7C(t)E[xx T ]  —  —  7diag [C(t)x(t)x T (OC T (t)] C(t)  (5.67)  7diagr(t)E[xx T ]C T (t)] C(t)  The approximation of the differential equation is given by  C = CQ  —  (5.68)  diag[CQC T ]C^  In equation 5.66, Oja proves that for positive semidefinite matrices Q, in which there is a largest eigenvalue with multiplicity one, the differential equation will always converge to a matrix , in which each of the rows is a normalized principal eigenvector. Then we can say that Oja's algorithm shows how to manipulate the operation of extracting principal feature, or eigenvector decomposition.  5.4 Gram-Schmidt process Implementing Oja algorithm, we can successfully extract the principal eigenvector through the training process. In pattern recognition, one principal component is not enough, we need to extract more information to make the classification complete. So we need to extract a group of eigenvectors to perform learning vector quantization. Based on the Oja algorithm, Sanger derives a learning rule to maintain each row of C at one in order to obtain a set of eigenvectors which are orthogonal to each other and in descending order of eigenvalues [52]. The way to extract the subsequent eigenvector in orthogonal form is to use the Gram-Schmidt orthogonalization process. Every finite-dimensional linear space has a finite basis. If the space is Euclidean, we can always construct an orthogonal basis. The construction is called the Gram Schmidt -  orthogonalization process. [2] 43  Chapter 5: The Generalized Hebbian Algorithm  Assume that the patterns are in Euclidean vector space Rn. Each object to be classified is represented as a n-dimension vector of real value elements x [x 1 ,...x n ]. Let x l , x 2 , ..., be a finite or infinite sequence of elements in a Euclidean space Rn and let L(xi, ...xk) denote the subspace spanned by the first n of these elements. Then there is a corresponding sequence of elements yi , y2, ..•, in Rn space, which has the following properties for each integer n (a) The element y n is orthogonal to every element in the subspace L(u ‘01, •••Yn(b) The subspace spanned by y i , ...y, is the same as that spanned by x l , L(yi,•••,y.) = L(xi,••.xn)  ^  (5.69)  In other works, if the first n elements x 1 ...x n are independent, then the corresponding elements Yi  , ...yn are nonzero. In this case, we define y r +i by the equation  E 1=1 r  Yr+1 Xr+i  (5.70)  aiyi^  The coefficients ai are given as  a=  ( xr + 1, Yi)  (5.71)  (Y.i,YA  and formulas defining y i , ...y, become r  yl =^= Xr+i — 2  kxr+i Y) ,^  ,  N)  i=1^  yi for r= 1,2,...n — 1  (5.72)  These formulas describe the Gram-Schmidt process for constructing an orthogonal set of nonzero elements yi , ...yn which spans the same subspace as a given independent set xi...xn. In particular, if x i ...x, is a basis for the finite-dimensional Euclidean space, then yi , ...y n is an orthogonal basis for the same space. We can also convert this to an orthonormal basis by normalizing each element yi. In the Gram-Schmidt process, we construct the element wig by substituting from x,.  +1  the projection  of x r+ i along each of the earlier elements y i , ..•yr • In weight adjustment, the orthogonalization process is interpreted as  C(t + 1) = C(t) 1 lower[CC T ]C^ --  that is, orthogonal eigenvectors are constructed by taking lower elements of the matrix. 44  (533)  Chapter 5: The Generalized Hebbian Algorithm  5.5 Combination of the basic algorithms Summarizing the requirements of optimal linear data reconstruction, three principles are suggested for the new algorithm: 1. Find the principal eigenvectors in descending order of eigenvalues. 2. Normalize eigenvectors to magnitude 1. 3. The row vectors in weight matrix C orthogonal to each other. Based on the standard Hebbian algorithm, Oja algorithm, and the Gram-Schmidt process, Sanger develops a new scheme for linearly unsupervised learning, called the Generalized Hebbian algorithm [53]. The final form of GHA is described as  cii(t + 1) = cij(t) + 7 (Yi(t)xi(t) — Yi(t) Ecki wyk(o)^(5.74) k<i  or in matrix form  C(t + 1) = C(t) + -y (y(t)xT (t) - LT [y(t)yT (0] C(t))^(5.75) where matrix operation "LT" represents to take the lower diagonal elements of the matrix. The differential equation is given as  C = CQ  —  LT [CQC T ]C^  (5.76)  Through the training process, at each iteration step, the system converges to the first principal eigenvector, then converges to the eigenvector with next largest eigenvalue, and so on. The eigenvectors will be arranged orthogonally to each other and in descending order of eigenvalues. The reconstructed data can satisfy all the principles required by data reconstruction with maximal variance and minimum squared error. In order to let the eigenvector matrix be bounded and the reconstructed data get into more compact subset, we utilize a certain threshold to limit the magnitude of weight matrix [53]. Choose A to be the compact region in R MN given by the set of matrices C. The magnitudes of entrance of C are limited by a, that is, I I < a, whereby parameter a is a fixed positive value. Choose a to 45  Chapter 5: The Generalized Hebbian Algorithm  be sufficiently large so that D E A, where D is the domain of attraction including all matrices with bounded weights. Then C(t) entries some compact subset A C D(S) infinitely often w.p.l. Then with probability one lim C(t) E S, where S is the domain of attraction of the system. So the system t—.00 will converge from any initial weight value C(0). The learning rate y is chosen as -y(t) = I, to satisfy requirements of system convergence. Actually, in practice, 7 is set to a constant value within a certain range in order to make the system converge faster.  5.6 Significance The Karhunen-Loeve transform is accomplished in the training process by implementing generalized Hebbian algorithm. The original input signals are reconstructed with maximal variance and minimum error. The reconstructed data is in optimal form. This is one of the contributions of the generalized Hebbian algorithm. Some other competitive learning methods attempt to maximize the variance of the outputs, such as the Kohonen feature map [26] and Grossberg competitive learning [16]. The problem with these algorithms is that they maximize variance independently. The principal components are captured by "winner-take-all" scheme, and the eigenvectors are not orthogonal to each other. In the GHA, the system converges to the first eigenvector first, and the proceeding eigenvector is constructed according to Gram-Schmidt process, therefore they are orthogonal to each other automatically. The weight matrix, which is composed by those eigenvectors, is in optimal form. The reconstructed data is with minimum error. In the learning rule, Linear Least Squared Error (LLSE) is set as an optimality principle for linear data reconstruction [52]. For the case M<N, we estimate an N—vector x given the value of the M vector. If we have y=Cx, where C is M x N matrix. We choose the estimate i which minimizes the expected Mean Squared Error (MSE) nix – Ill]. The best estimate is given by  i = Rxy R7.1 yy .., 46  (5.77)  Chapter 5: The Generalized Hebbian Algorithm  With error vector E = x  —  i, the error covariance matrix is  Ree = E[(x  —  i)(x  —  i) T ] = Q  —  Rxy R;y1 RxTy^(5.78)  where Q = E[xxT] is a covariance matrix of input space. The cross-correlation matrix is defined as Rxy =  E[xY T ] = E[xx T C T ] = QC T  Ryy =  E[Cxx T C T ] = CQC T  (5.79)  Substitute them to equation 5.77, the estimate value i becomes i = Q C T (CQC T )  -  1  y  (5.80)  then the cross correlation matrix of error is described as Ree = Q  —  QC T (CQC T ) 1 CQ  (5.81)  where EH lx —ill] = trace[R ee ]. Since C has rank M, and the row of C have norm 1, the error is minimized if CQCT = AM, where AM is the M x M diagonal matrix of the largest M eigenvalues of Q. Using the generalized Hebbian algorithm to decompose eigenvectors, we can save some computational work. Mathematically, the generalized Hebbian algorithm is equivalent to the Karhunen-Loeve transform. With classical methods to accomplish the KLT, we have first to calculate the covariance matrix Q = E[xx T ]. When the input dimension is large, the large size of the matrix makes computations more difficult. In pattern recognition, if the problem is not so complicated, we only need the first few eigenvalues. With neural network methods, it is not necessary to calculate a huge matrix to obtain the eigenvalues. If the number of outputs is relatively smaller than that of inputs, that is, M<N, the network only extracts the M eigenvectors through the training process. Considerable computation work will be saved by using training procedure. It is noted that the eigenvector decomposition in the network is not a substitution for the traditional approach. The system converges to the first eigenvector, the approximation of proceeding eigenvector depends on the previous one. The estimator error is accumulated and influences the consequent ones. 47  Chapter 5: The Generalized Hebbian Algorithm  Thus approximation is poor when M is not so small. Only in the case of M < N, does the network offer benefits over the traditional method.  The GHA, which is the learning algorithm for feature extraction, is precisely analyzed from a mathematical point of view. This helps us to understand how neural networks accomplish the KLT, which is a classical method to orthogonalize data space.  48  Chapter 6: The performance of data reconstruction  Chapter 6 The performance of data reconstruction The single layer network has advantages not only in being easily modified, but also in being mathematically analyzable. The weight matrix might be referred to as a group of functional filters to be implemented in data preprocessing. Those linear filters transform the environmental coloured inputs into their principal uncorrelated components. Feature extraction and data compression are successfully accomplished through the training procedure. Based on those results, the connections between conventional methods of PR and modem approach of NNs are investigated in the analysis of performance of linear data reconstruction.  6.1 The eigenvector decomposition In this section, we show how the neural network performs feature extraction. Usually, feature extraction is required before data are sent for category decision. A linear data reconstruction is accomplished by the KLT in general signal processing. As seen in the previous chapter, in statistics, the term Karhunen-Loeve Transform (KLT) is referred to eigenvector decomposition, or singular value decomposition. The original idea of using neural networks to learn eigenvectors was suggested by Linsker [33]. Assume that x is a N—dimensional input vector, which has N x N covariance matrix Q  E[xx 71, then an eigenvector of Q is defined by ciQ = Aicr, where Ai is the corresponding eigenvalues. The eigenvector set is {ci}, ordered by descending value {A i }, thus c 1 corresponds to the largest eigenvalue which is called the principal eigenvector. The output of a network is a vector of dimension M, and y=Cx, where C is M x N the weight matrix. Assume that M  << N for the problems of pattern recognition. Each row ci of C is an  eigenvector of Qi, then the output vector y will have a diagonal covariance matrix A = E[yyT]. If the rows are ordered by eigenvalues, we say that the components of y are "principal components" or "Karhunen-Loeve transform of x", and that the matrix C is an optimal eigenvector estimator. The weight matrix C is described by the three following properties. 49  Chapter 6: The performance of data reconstruction  1. CCT = I, i.e., the diagonal elements of CC T are all 1. 1^0 1  (6.82)  0^1 2. CQCT = A is diagonal matrix which is in descending order of eigenvalues {Ai} Al A2^0  (6.83) Am  0  3. trace[CQCT] is maximized over all M x N matrices A = CQC T  Figure 6.16: Eigenvector matrix CC T = I and covariance matrix of outputs, with white input CQCT = A  Figure 6.16, which is with random inputs, and Figure 6.17, which is with acoustic emission signals of wood species, show the performance of eigenvector decomposition by the network. Comparing the two results, we find that the performance of eigenvector decomposition is better with random input than with acoustic emission signals. First of all, as we analyzed in chapter 5, the traditional 50  Chapter 6: The performance of data reconstruction  Figure 6.17: Eigenvector matrix CCT = I and covariance matrix of outputs, with original input method of eigenvector decomposition cannot be entirely replaced by the network approach. In the NN approach, the system converges to the first eigenvector and the second eigenvector is the result of withdrawing the previous one. So the error of approximation is accumulated when the number of output increases. In the case that the number of the output is close to the number of input, the network cannot give good performance for the approximation. Additionally, the zero mean Gaussian distribution is the requirement for input vectors in the GHA, and if the input vectors cannot satisfy the assumption of Gaussian distribution, the precision of approximation will not be at the desired level. Even though we transform the original signals to the zero-mean form, the input distribution still cannot entirely satisfy the assumption of the GHA. The bounded diagonal elements of covariance matrix of outputs are obtained with the coloured inputs, as shown in Figure 6.17.  6.2 The Gabor filters How neural networks perform feature extraction was shown in the previous section. In this section, we show how neural networks function like a Gabor filter, which is a typical filter for data compression. The data compression has the simplest examples with applications of neural network. An approach to data compression was originally suggested by Dennis Gabor in 1946 [13]. In his paper, Gabor proposed a new method for analyzing arbitrary signals. The optimal set of basis functions for analyzing signals consists of sinusoidal portions of this signal. Gabor shows that for a 51  Chapter 6: The performance of data reconstruction  signal of finite duration, the use of such basis functions minimizes the joint uncertainty regarding the product of the effective bandwidth. No other set of basis functions has this property. In principle, data compression is based on two facts: one is that the values at some points can be represented by some values at other points, and the another is that the representation of the original signals can be with minimum error [6]. According to these two facts, two optimality principles are chosen: maximum variance and minimum error. The correlated signals can have many forms and can be ordered by different statistic property. The existence of this statistic property implies that entropy of the source is less than the entropy of channels. Thus, data compression becomes possible by removing some unimportant information. The entropy of a source is defined as the average information generated by it, i.e. N  = -^poog,pk^  (6.84)  k=1  where Pk is the possibility associated with source information xk. For a signal, its entropy can be estimated from its histogram and the entropy of a source is maximum for uniform distributions. The entropy of a source gives the lower bound on the number of bits required encode its output. According to Shannon's noiseless coding theorem, it is possible to encode a source of entropy  H without distortion. It is also possible to code the source with H bits such that the distortion in the decoded signals can be made with arbitrary small. As described in subspace implementation, the linear data reconstruction is to map vector from input space Rn to the orthogonal subspace, which is formed by its basis vectors. The task of data transformation is to find a group of coefficient parameters ai}: {  xlx = E {m L = L(ui , u2, • -^ , um) =^  a i ui  (6.85)  i.i.  to project the input vectors to an orthogonal space. The original inputs {xi} might be non-orthogonal, incomplete, or a set of linearly dependent vectors plus noise pollution. We expect to find an optimal projection by satisfying global optimization criteria. We present this projection operation as  H=  E (aiGi(x))^ 52  (6.86)  Chapter 6: The performance of data reconstruction  If the elementary functions {G i (x)} form a complete orthogonal set, then the solution for fail is given as  E (x i Gi (x j )) ai =  (6.87)  n  E  j=1 The desired set of coefficients fail must be determined by optimization criteria, and usually, this criteria is to minimize the squared norm of the difference between x and H(x), that is E = Ilx  —  H(x)H 2 =  x  -  2  EaiGi(x)  n^m  = Exj  —  2  E aiGi(xj)  (6.88)  If we take the partial derivative of equation 6.88 with respect to the coefficients { ai } OE = Octi  -  2 (xiGi(xj))+^[27E akGk(xj)Gi(xd = 0^(6.89) j.i^j=1^k=1  Then the solution will be found according to equation 6.89. In practice, two problems exist in solving this equation: if we have M<N, lad is not unique, and also, if N is large, it takes a lot of computational time. For instance, matrix manipulation will grow up factorial with the number of simultaneous equations. How data compression is accomplished in neural networks? A neural network adjusts its weights during the training process. The direction of adjustment is defined as the energy gradient descent: E= --i cTQci  2  2  (6.90)  where ci is the weight vector and Q is the covariance matrix of input. Ignoring the scalar term of E, we know that the definition of E is just the expression of variance of outputs  E [ yyT] = CQCT = A^(6.91) So E is minimized subject to cicT = 1 acij  =^t = [Qci at^ aci; 53  (6.92)  Chapter 6: The performance of data reconstruction  and  a ci at  Q ci  (6.93)  This indicates that performing gradient descent on the energy function is equivalent to maximize the variance of the outputs. In this algorithm, the weights are adjusted according to the variance of the outputs. If a network is trained to compute M<N., then all the outputs are correlated, and their variance is a measure of the usefulness of those outputs. The entropy information is maximized as much as possible for all values of M. So we can say that the training procedure to fix the weight matrix is also a process of finding a group of coefficients of Gabor filters.  6.3 The masks The functions of neural networks can be interpreted as a series of masks. In signal processing, the mask is a gradient operator to detect the significant feature of the input vectors in different ranges. The reason to implement the mask is from the consideration of local correlation only with finite range. Now we get insight into frequency domain masks to see how the masks work with signals in neural networks.  6.3.1 General signal preprocessing To find the connection between traditional signal processing and neural network, we list the functions of general signal processing below. In order to make the notation consistent, we define raw input vector as wj ...wn], and the vector after signal processing as xi = [x 1 , ...x n ]. The general signal preprocessing with series Gaussian functions is given by  x; = mi (g  *  w)  (6.94)  in respect of the jth neuron in the input vector. In equation 6.94, input vector w is convolved by a Gaussian mask g, and then the result is multiplied by a Gaussian window m. If we define weight cii as the synaptic connection from input xi to the output yi, we have the corresponding output yt  =E J=1  54  (6.95)  Chapter 6: The performance of data reconstruction  In equation 6.94, the convolution is defined as ^(g  *  w) i =^  (6.96) 1=1 Because we always need to know signals in both spatial and frequency domain, here we give the  definition in frequency domain and some operations between spatial and frequency domain. The Discrete Fourier Transform (DFT) is given by  E  x^ (6.97) j=1 The lowercase letter stands for spatial domain while uppercase letter for frequency domain. The w, =  ^X  convolution theorem is described as DFT(g * w) = DFT(g)- DFT(w) ^  (6.98)  The convolution theorem indicates that the convolution in spatial domain becomes the multiplication in frequency domain. Parseval' relation shows that  E  N-1^N-1  E l9(t)1  IG(w)12 — 2 = N2^ t=o^w=o It means that the energy in spatial domain equals to the energy in frequency domain.  (6.99)  6.3.2 The function of the neural network  We check the Gabor filter function from the view point of the neural network. Regarding the outputs, we know the correlation in signals [N N  E[yiyk] E EEciixickixi j=1 1=1 N N^  = E E cii cki E[xi xi ] j =1 1=0  (6.100)  Substituting x by operation of signal processing x1 = mi(g * w) i , then we obtain the correlation as follows N N  E[yi yk] = E  E cii ck,E[mi (g * w)i mi(g * w)1]  j=1 1=1  N N  = E E (ciimi)(c ki m i )Ekg * w)j (g * w) 1 ] j=1 k=1  55  (6.101)  Chapter 6: The performance of data reconstruction  Furthermore, combining the definition of DFT, the convolution theorem and the Parseval's relation, the correlation in term of outputs is defined in frequency domain as N N  E[yiyd = (1/N 4 )  E E (ci.m)„,,(ck*  wi=1 10=1  M)„,,.E[(GW)w.,(GW).,J  (6.102)  N N  = 1/N 4  E E (ci * 11.1j=1  to1=1  m)Gw,(ck *M)Gw,.E[Ww i Wt„,]  The term E[W,„,Ww,] is a Gaussian noise with zero mean and variance  N 2 , thus the equation 6.102  becomes the following form E[yiyk] = (1/N 2 )  E  wi .1  Gwi G,.(Ci * M)„,; .(Ck M),  (6.103)  The variance in any output i is given by E[yi21 = ( 1/N2)  E (Gt„,)2(ci* M)„,,^(6.104)  w, =1  Analyzing components in equation 6.104, we observe that the variance between outputs is the product of two terms, (awl ) 2 , which is the power spectrum of Gaussian mask, and (Ci * which power spectrum of weight matrix is smoothed by Gaussian window M. If there is no mask M, the variance of output i is given by N  E [yn = (1/N2)  > (Gw,)2(cii )2^(6.105) j=1  From Shannon's theorem, we know that the eigenvectors {ci} are the Fourier basis, and eigenvalues { Ai}  are Fourier components. But these eigenvectors {ci} have limited spatial extend because the  vector ci is a sinusoidal family of orthogonal transforms. Usually sinusoidal vectors have infinite bandwidths. By implementing the mask g, the property of bandwidth will be changed to noninfinitesimal bandwidth. The Gaussian mask g determines the spatial extent of the input and specifies the bandwidth of the filters. Therefore we can say that the function of Gaussian mask g is to force eigenvectors to have a finite spatial width. Intuitively, we attempt to understand that in frequency domain. The first mask will learn the first peak, which indicates the maximum variance of the inputs, and the second mask learns the next 56  Chapter 6: The performance of data reconstruction  one, the successive mask learns the peak which is nearest to the previous one, and so on. If we define them as the property of filters, low-pass filter, and then bandpass filters cover the frequency domain from low to high. Most of input energy is distributed near DC. Observing the signal shapes in frequency domain, the input energy decreases while the frequency increases. We can see that the frequency response of successive masks is ordered by frequency.  6.4 The matched filters  If the input is generated by a fixed vector in a white random noise, we see that the first principal components can be interpreted as a matched filter for the signals. Let n be a noise vector with covariance cr 2 1 . Then the input vector is given as x(t) = w n(t)^  (6.106)  The correlation matrix Q of x(t) becomes Q = ww T E [nnT = wwT a' I^(6.107)  If w is the first eigenvector, then we have the form Qw = ww T w cr 2 w = w^a2)^(6.108) In equation 6.108, w is an eigenvector of Q with eigenvalue 114 + 0- 2 . If c is any other eigenvector which is orthogonal to w, then w T c = 0 and Q c = WW T C cr 2 C = cr 2 C  ^  (6.109)  Equation 6.109 shows that other eigenvector of 0 is c with eigenvalue a 2 . Clearly, all other eigenvectors have eigenvalues a 2 < 110 a2. Equation 6.109 shows that w is the principal component of the data magnitude, and the Hebbian learning algorithm will find a matched filter for that vector. In the case of multiple signals, each eigenvector will converge to a matched filter for one of the input patterns. 57  Chapter 6: The performance of data reconstruction  We proved that the neural network performs a matched filter in stochastic process view point. The same result can be obtained from the linear vector space. Let the input vector be x = Wv^  (6.110)  where W is a matrix composed of a set of fixed N—dimension orthonormal vectors and v = [v i , is a vector of independent variables of Gaussian with zero mean , correlation R n and the variance  a = [ai ...an ], and the correlation matrix Q is given as Q = E[xx T ] = E[Wvv T W T ] = Wcliag(a l ...cr m )W T  =  ww T o.2^  ( (6.111)  = E w i wTai i=i Assume wk is the leh eigenvector, then we find the eigenvalue corresponding this eigenvector Qwk =  T  2  > wiwi wkai  (6.112)  i=1  and since the wi are assumed to be orthonormal, we have 112 „.2 Qwk = WkilWkIl 0.k = tvkvk  (6.113)  This shows that w k is an eigenvector of Q with eigenvalue a 2 . Therefore, the extended Hebbian algorithm will converge to a matrix of M different "matched filters” for the different fixed patterns, and the vectors will be ordered by their average power Ihvklkq.  6.5 The whitening filters The network is called a whitening filter because the outputs of the neural network become uncorrelated with each other. After the training procedure, a linear filter transforms the environmental colored noise into its principal uncorrelated components. The whitened outputs are utilized as the input vectors to improve the classification performance. 58  Chapter 6: The performance of data reconstruction  Hemlock 1.0  0.5  0.0  - 0.5  -1.0  0.6  Lod  0  5  10  15  ole 0.6 0.4  0.4  02  02  0.0 0.0  - 02  -0.2  - 0.4  -0.4 0  - 0.6 10^15^0^5^10^15  Figure 6.18: The whitened outputs for 5 categories of wood species We are able to estimate x in an optimal form. Normally, the optimality principle chosen for an  estimator is to minimize squared error between the input x and estimated vector x E [(x  —  .1)1. An  optimal mean-squared error estimator was suggested by Ljung in 1977 [35]: = Rnx Q  1  x  (6.114)  where Rnx = E[nx T ] and Q is the covariance matrix of inputs. A network trained by backpropagation converges to this optimal estimator as it was proved by Stone in 1986 [56]. Further implementation of this filter in the GHA was given by Sanger in 1989 [52]. Many theories show that neural networks can be used as optimal filters to whiten coloured input vectors in minimizing squared error between the input x and estimated input I. The use of whitening filter is very common in several fields. Learning vector quantization is a quite often used technique for feature extraction. One of the most used algorithms is the Kohonen feature map. For signal processing, the KLT is as a more advanced data compression technique because of its optimal quantization. KLT is used to order the eigenvalues by descending order, to obtain the maximal variance of input information. Figure 6.18 shows that the reconstructed outputs are more easily distinguished in than the original one in Figure 2.8.  59  Chapter 6: The performance of data reconstruction  The connection between traditional methods of signal processing and model approaches of neural networks has been explored. The network performs the filter functions to process signals and this motivates the application of NN to PR.  60  Chapter 7: The classification performance  Chapter 7 The classification performance Finally, the network performance is measured in terms of its ability to classify acoustic emission signals as a function of wood species. The first requirement is that the system be able to cope with environmental noise. Noise resistance is strongly desired so that the system can still perform well even with corrupted data. Then generalization will be analyzed combining the simulation results with noise signals. The purpose of introducing the heterogenous network is to overcome the drawbacks of single multilayered networks, therefore classification accuracy and efficiency of the heterogeneous network will be compared with that of backpropagation. Finally, simulations are run to classify single and mixed wood species.  7.1 The noise resistance One of the capabilities of networks is to correct erroneous information while input vectors are distorted. When dealing with real-world data, a system always has to handle the problems of estimating some parameters from corrupted or incomplete data. The immunity to noise is one of the most important properties in pattern recognition. The term of noise immunity is used to describe the maximum noise level that can be added to a system while the system still maintains error free operation. If the network learns a set of data in an environment, then tests on a changed environment should be relatively robust. The immunity to noise will be theoretically analyzed on the learning rules based on the Hebbian algorithm. The simulations to test immunity to noise are run with different types of noise, that is, process noise, Gaussian noise, and incomplete signals. 7.1.1 Capability of the system The standard Hebbian training rule, which is designed to adjust synaptic connections between  neurons, is a typical learning algorithm to find auto-associative map between input and output. In the Hebbian algorithm, the weight increment, which is Aci = -mix, is modified by following the errors 61  Chapter 7: The classification performance  between inputs and outputs. The network designed according to the outer product rule is inherently noise resistant. The system rejects noise to maximize the signal to noise ratio. Designed according to the Hebbian learning algorithm, backpropagation is naturally capable of resisting noise. There are usually three layers in BP with the hidden layer being used to accomplish feature extraction. Final category decision is made in the output layer according to the encoded data in the hidden layer. The variance of the inputs is maximized in the hidden layer and the noise of the signals is compressed in this middle layer. Due to the lack of constraints, feature extraction is not optimal and difficult to be analyzed in BP. The noise resistance with the generalized Hebbian algorithm is analyzed below. In GHA, if uncorrelated noise with variance a 2 is added to the input vector g, then the noise vector will be n with covariance aq. Then we have the noisy input vector  x(t)= g n(t)  ^  (7.115)  The corresponding covariance matrix of the inputs becomes  Q  =  ggT E [ nn T] = ggT cr2I^  (7.116)  where cr 2 1 is the diagonal noise correlation matrix. We analyze the covariance matrix of the outputs A = CQC T^(7.117) Substituting the noisy input vector x(t)=g(t)+n(t),  CQC T = C(ggT  0.2 .0cT  =  A 4_ 0.21^ (7.118)  The output matrix is diagonal, and the eigenvectors of A + cr 2 1 are the same as the eigenvector of  Q. We thus have  (Q a 2 /)C T = (A + a 2 .0 CT^(7.119) The eigenvalues are given by A + a -2 in equation 7.119. If noise variance is less than the variance of the input vector, the noise will not significantly affect the values of eigenvectors [52j. Therefore, 62  Chapter 7: The classification performance  the weights, which are trained by the generalized Hebbian algorithm, are not very much influenced by the noise term in the input. The eigenvectors of the correlation matrix with noise are nearly the same as that of matrix without noise. Another interesting phenomenon is observed from the signal to noise ratio. The ratio is rearranged from xi in to Ai in. The higher the correlation of input, and the larger the initial eigenvalue 'Nan will be, the lesser influence of noise will be. In our simulation problem M < N, only the first few eigenvalues are needed, and the S/N is rearranged by descending order of eigenvalues. The S/N is Ai/n, which naturally has higher value than that of the original form of the data, and the noise has  thus less influence on the reconstructed signals. 7.1.2 The resistance to process noise  Wood is a natural material, and data acquisition is a process influenced by operation of the process. Certainly, operating variables have some effects on the power spectra, and process noise is defined as including this effect. However, the system is required to be robust to environment changes. Simulations are run with a group of real data from acoustic emission of wood chips. As analyzed in chapter 2, similarities exist between categories and differences exist within the same category due to the operating variables. We know that error sources are mostly due to the sensory inputs to be classified which are linearly inseparable, or which are close to boundaries between classes. It is shown in Figure 2.10, that a single category occupies more than one terrain in the topological map, and there is overlapping exists between the neighbor classes. We employ techniques of data preprocessing and neural network strategies to make the information more consistent within the same category and more distinct between the different categories. The overlapping problem is automatically overcome by the generalized Hebbian algorithm. The optimality principle for the GHA is chosen to orthogonalize input vectors. The system is so designed that it provides the property of orthogonal data space. Mathematically, the orthogonalized vectors are in optimal data form for pattern recognition. Decomposed input vectors are uncorrelated, mutually orthogonal, and with maximal variance so that they can be more easily discriminated. 63  Chapter 7: The classification performance  In this linearly inseparable problem, one of inseparabilities is due to the influence of operation parameters. Further analysis shows that changes in plate gap do not alter the band structure of the power spectrum but only the peak intensities. Normalization with respect to magnitude of power spectra is simply employed to reduce the diversity. Implementing this strategy, we observe that the signals are more consistent within the categories although there are still some differences in frequency magnitude. On the other hand, because of normalization, we lose some entropy information by ignoring the signal energy. In traditional approaches of statistics, for instance, the principal component analysis, energy usually offers a very important information as a descriptor to specify the signal characteristics. Comparing the two approaches, the neural network and the principal component analysis, we notice that they employ different models. In neural networks, the technique of nonparametric model of pattern recognition is implemented, while parametric model is used for the principal component analysis. The neural network works more with information about the shape of signals than with precise data calculation. Regardless of intensity, if signals have the same shape, the system recognize them as from the same category. The strategy below is used to satisfy one of the algorithm assumptions, thus making the network perform more efficiently. One of the assumptions for the GHA is input vectors with zero-mean Gaussian distribution. But actually, most problems cannot inherently satisfy this requirement. To be more consistent with this assumption, we can either substract the mean from the signals or modify the basic algorithm of GHA in the form of non-zero mean. The basic GHA for zero-mean inputs is as follows: C(t + 1) = C(t) + (y(t)x T (t) — Lower (y(t)y T (t)C(t))) (7.120)  It assumes the following form for non-zero-mean inputs: C(t 1) = C(t) (y(t)xr (t)  —  y(t)yT (t)C(t)) (7.121)  where "Lower" represents the lower triangle elements of the matrix. In equation 7.120, the network produces a set of eigenvectors of the input covariance while in equation 7.121 the system converges to the eigenvectors of the auto-correlation. In the former case, the network maximizes the variance 64  Chapter 7: The classification performance  Classification  Training time  Training  Total training  accuracy  (GHA)  time(BP)  time  Normalized & zero mean  97.2%  84  20  104  Normalized & non-zero  92.15%  107  43  150  Non-normalized  89.4%  150  56  206  Techiques  Table 7.2 The classification performance with zero mean and non-zero mean inputs of the input vectors. In the latter case, because of non-zero-mean input distribution, the network only rearranges the signals in which the expectation of magnitudes of the power spectra is in descending order. Table 7.2 shows the classification performance. In this group of simulations, we test the network in the cases of the normalized input vectors with and without zero-mean. The classification performance is listed in Table 7.2. It shows that with the normalized and zero mean inputs, the system achieves the best classification accuracy and efficiency. Non-zero mean inputs cause slower system convergence. The signals do not include the maximal variance if the input distribution is not in zero mean Gaussian distribution. It this case, the reconstructed signals are not in the optimal form for pattern classification. Therefore, we cannot obtain the best classification performance. One of the simulations in Table 72 is tested with non-normalized input vectors. As we know, in this identification, the normalization is used to reduce the difference within the category. Without normalization, the classification accuracy is degraded since the signals within the category are not so consistent. 7.1.3 The resistance to Gaussian noise To test the capability of the noise resistance, simulation runs are made with artificial Gaussian noise. The noisy input patterns are generated by adding zero-mean Gaussian noise with a given specific standard deviation to the key pattern. In system analysis, we know that noise resistance is improved by data encoding technique. The reconstructed data captures the principal features with maximal variance while maximizing the signal to noise ratio as well. If the noise variance is less than the variance of input vectors, the noise will be compressed in the process. Figure 7.19 shows the classification performance in the cases of the 65  Chapter 7: The classification performance  1.5  0  P.. 1.0 0 0 0  0  0.00^0.02^0.04^0.06^0.08 The standard deviation of noise  0.10  ^  Figure 7.19: The classification performance in Gaussian noisy input noisy inputs. The classification performance is given for BP with both compressed data and original data, which are indicated in Figure 7.19. In both cases, when noise is 10% of peak magnitude of power spectra, the classification accuracy stays the same as without noise. Continuously increasing the noise magnitude, the classification performance is degraded. Further analyzing the performance in Figure 7.19, we found that the classification accuracy is higher with compressed data when the noise variance is below 0.07. Above this noise level, the system achieves better classification accuracy with original data than with compression one. The phenomenon can be explained by the analysis in section 7.1, in which the noise resistance was analyzed for the whole system for both BP and GHA. In the process of data reconstruction, the signals are reconstructed according to the optimality rule, after that process, the signal to noise ratio is rearranged from xi/n to Ai/n. If the noise level is low, the signal to noise ratio is higher in term Ai in than in term x i ln in the first few components. The noise level is compressed and noise does not have much influence on the signals. It shows that the encoding technique is successfully implemented to enhance noise resistance. When the noise level is high, the signal to noise ratio Ai/n 66  0.12  Chapter 7: The classification performance  0  .^  .^  .^1^.  20^40^60^80 The number of missing neurons  Figure 7.20: The classification performance with incomplete signals correspondingly goes down since the denominator n is larger. The encoded signals are only with a few components so that they are very sensitive to noise. The classification accuracy is lower with the data representation than with original one.  7.1.4 The recovery with incomplete signals Many applications of neural networks also require that the network be able to fill in missing information. When the input is partially missing, the system still can converge to the correct outputs. To test the capability of recovery of the system, the simulation is run using incomplete input patterns. The input patterns with missing bits am randomly generated according to Gaussian distribution in given values. The performance is plotted in "Missing neuron number and the classification accuracy". The classification performance is given with compressed data and original ones respectively, which are indicated in Figure 7.20. The capability of filling the missing information is called auto-associative. As is well-known, auto-associative refers to the capability of finding the topological map between the input and output. 67  Chapter 7: The classification performance  Whenever a portion of the input pattern is presented, the remainder of the pattern is to be filled in or completed. With both generalized Hebbian algorithm and backpopagation, the system is good at auto-associativeness. When the number of missing bits is up to 15% of the whole number of the input pattern, the system still achieves classification accuracy at 96.2%. The same phenomenon happens for performance in Figure 7.20 as that of Figure 7.19. When the input patterns are corrupted above a certain degree, the network performance is degraded with compressed data. The classification accuracy deteriorates faster with compressed data than with the original vectors when missing information is at 20%. It indicates that if compression data is not in optimal size, the reconstructed data will bring certain influence for system by noisy inputs or incomplete patterns. Usually it is difficult to obtain the optimal size for system because of many factors. In practice, the optimal compression size is decided by testing the classification performance. Concluding from the simulations with noisy inputs and incomplete signals, we find that the system performs well with distorted signals. This network, which is designed according to Hebbian algorithm, is a good example of auto-associative system. Associative networks have a satisfactory noise resistance. The idea of an associative network is also a good example linking engineering methods with the computer science and artificial intelligence ideas. The mechanism that the neural networks use to recognize patterns is similar to the pattern recognition process performed by human brain. 7.2 Generalization The use of networks to explore a new algorithm crucially depends on their capability to generalize. Generalization refers to the capability of finding a correct output, which is not seen during the training process. It means we train the network with the finite patterns but test a set of data with infinite patterns. The capability of generalization is quite strongly desired in practice. For instance, in our simulations, we only have certain patterns which represent wood species under a few operating variables. It is impossible to include every change brought by many variable factors. In identification of wood species, we want to know the wood species, which indicate their fibre property. This is 68  Chapter 7: The classification performance  Training patterns 5 Classification accuracy 68%(26/38)  10 92.1%(35/38)  15 94.7(36/38)  38 97.2%(37/38)  Table 7.3 The classification performance with different training sets a  quite often encountered, and very typical problem in pattern recognition. So generalization is the  fundamental requirement in pattern recognition. The test of "noise resistance' already gives us a good example of generalization. The training is run with pure wood species or limited samples, and the test is made with corrupt signals. Further simulations are run to test the system performance under the limited training patterns. The 76 original patterns are used to examine the performance of generalization by checking the classification accuracy. The data set is equally divided into training set and test set to cover 15 difference operation conditions. We gradually vary the training set size to see how the performance is improved. Table 7.3 shows the classification accuracy tested under the different training sets. Simulation results show good performance of generalization. In the first trial, the training set only includes 5 patterns, representing 5 categories. The classification accuracy is quite poor. The phenomenon can be explained from Figure 2.10. Because of the diversity within the same category, some of categories occupy more than one single terrain in topological map. It is described in subspace as  Uk = {{Uhl }, {Uk2}, •••7{Ukn}}.  When the system is trained by only 5 patterns, the subterrains cannot be covered during the training process with only single pattern. In the second trial with 10 patterns, two patterns are included for each category, most of subterrains are covered by the better data representations. The classification accuracy is considerably improved. Generalization can be explained by the inherent property of the network. In both backpropagation and generalized Hebbian algorithm, the network has a special layer, that is, the layer of feature extraction. In feature extraction, the network attempts to find the statistic distribution of the input vectors. Although the signals are different, if they have similar statistical distribution, the encoding process will provide the same representation for these signals. The distortion portion is compressed in the feature extraction. One advantage is that neural networks can learn to perform tasks for which computational algorithms may not exist. This property is important for practical purposes. Self-training networks 69  Chapter 7: The classification performance  The input  Training time  Training time  Total training  Training  Classification  number  (GHA)  (BP)  time  Patterns  accuracy  8 10 12 14 16 18 20 22 150  27 41 53 67 84 105 133 157 X  24 22 19 19 20 20 22 21 172  51 63 72 86 104 125 155 178 172  160 150 120 120 130 130 120 120 480  84% 92.1% 88.2% 94.7% 97.2% 97.2% 96.05% 97.2% 97.2%  Table 7.4 The classification performance with different  inputs  provide the possibility of solving a problem which we do not completely understand. If a network can be trained to explore an algorithm based upon examples, then it might be able to generalize to solve the problem for a case it has not yet seen. If the algorithm is found, it means the network finds the fundamental structure of the problem. Any problem with the same structure will give the same result. In some cases, the generalization is better in nonparametric identification than with numerical calculation. 7.3 The classification efficiency  The primary goal of using the heterogeneous network is to improve the classification efficiency. Classification performance is compared between the heterogeneous network and BP. Table 7.4 shows the classification performance with different inputs. The performance is tested with the original data set, in which 76 patterns are equally divided into training set and testing set. The test with plain backpropagation is listed in Table 7.4 in order to compare the classification performance between the compressed data and the original data. The criterion of checking classification efficiency is set under the same classification accuracy. With 150 original frequency components, the classification accuracy is 97.2%. Thus classification efficiency will be analyzed in the classification accuracy of 97.2%. Observing Table 7.4, we can see 70  Chapter 7: The classification performance  Weight vector 1-M Compression training  IM IM The moving window  Classification training  ^  The threshold checking  Figure 7.21: The training scheme that classification efficiency is significantly improved by using the compression network. The system reduction is contributed in term of structure, training time, and iteration. The reduction performance will be analyzed on those factors respectively.  7.3.1 The compression criteria The first step in the tests is to find the optimal encoding size. Theoretically, the optimal encoding size is decided according to the descending order of eigenvalues. But, the performance of the network is limited by its inherent drawbacks, and the eigenvalues of the decomposition results is not in desirable descending order. In practice, the encoding size is not only theoretically estimated but also experimentally tested according to classification performance. Experimentally, the optimal encoding size is decided under two criteria. The first criterion is that the system achieves the same classification accuracy with compressed data as with the original data, and the second one is to check the reduction in training time.Two techniques are utilized to decide the optimal compression size in feature extraction, the moving window and the automatic threshold checking, shown in Figure 7.21. The moving window is utilized to improve the system convergence and the threshold checking is utilized to make the system process in parallel. Each time the window moves, the corresponding covariance matrix of outputs is calculated by A = E [yyT ] = CQC T , and the result is compared to the certain desired tolerance. If the eigenvalue, which is calculated by the outputs of the network, is below the threshold, the training for compression network is stopped and training for classifier is started. But, if the diagonal elements of covariance matrix of output are not in precisely decreasing order, the classification accuracy is checked to help making decision for encoding size. 71  Chapter 7: The classification performance  In our simulations, it is difficult to get desired compression results. Not only networks but also signals themselves would be considered in the system. First of all, as analyzed in chapter 5, the approach of neural networks for data reconstruction is not exactly equivalent to the traditional eigenvector decomposition. Secondly, the degree of correlation between input vectors is a very decisive factor in the classification performance. Regarding the different signals, the system achieves different performances. Because of these two reasons, the approximation precision of compression cannot solely depend on the learning curve of eigenvector decomposition. In order to check the classification efficiency, we check the training time for the whole system. The training time for compression network and BP are tested and shown in Figure 7.22. The training time for backpropagation, on compressed data, is significantly reduced by a large reduction of input dimensionality. The training curve of backpropagation is almost constant with small input size. On the other side, the training time for compression is exponentially increased due to the increment of the number of outputs. The training curve of the compressor is scaled up due to each eigenvector decomposition participation. We have to consider the total training time for the system. For instance, with compression size 22, although the training time of BP is almost the same as with the size 16, training time of compression jumps up from 84 to 157, thus the total training time is not reduced for the whole system. In this case, the optimal compression size is very crucial in the whole system. 250  cna  200 I  --  ISSINNIONAWMAUVE^BP  I;^INASSISSIUMIUM^G HA E 150 — - asmwm§amiKai GHA&BP -.7.^ cn c 100 —  :E^ 0 50^4 -  IL'^  I^al  5^10  15 The number of inputs  20  Figure 7.22: The training time with different compression inputs 72  25  Chapter 7: The classification performance  7.3.2 The structure reduction  The data reduction primarily contributes to the structure reduction. It is denoted by the reduction of input space dimensionality [64]. In data reconstruction, we utilized two optimality principles, that is, feature extraction and data compression. The reconstructed data attempts to include as much information as possible while keeping the network size as small as possible. The original data includes some correlated portion. The correlated portion does not help the classification performance but degrades it. After data encoding, the correlated and unimportant portion are removed from the original data. The reconstructed signals include the maximal variance of the input information. The compression scheme makes signals more compact and makes the classification complete. The input neurons are dropped form 150 to 16, thus dimensionality reduction of about 8 or 9 is achieved in the simulation. The participation of compression network makes the whole system have more layers but it makes the configuration more compact in BP. The weight adjustment is more complicated in BP than in linear network due to its learning algorithm. 7.3.3 The training time reduction  The structure reduction leads to training time reduction. In the heterogenous network, the total training time for the whole system consists of two portions, training times for the compression network and the classification network. First of all, the training time is significantly reduced in BP. This is denoted by reducing input space dimensionality. In Backpropagation, the system needs so many patterns to present, when each pattern present, each neuron needs two paths, one is feedforward activation flow and another is backward error flow. Thus, the training time is increased significantly in case of the large input size. In Table 7.4, The training time is dramatically reduced as the number of input neurons is decreased. In backpropagation, only 15% of training time is needed to train the network by encoded data with 16 neurons. Secondly, the training time for compression network is rapidly increased by participation of each single output. The learning time for compression increases rapidly with participation of each single output. As shown in Figure 7.22, the linear data reconstruction has slow convergence speed when the number of outputs is large. We cannot ignore 73  Chapter 7: The classification performance  it. Including the training time for compressor in compression size 16, the whole system needs half of the training time required with full 150 components neurons. Continuously increasing the number of outputs, the classification accuracy is not improved but the classification efficiency is degraded in terms of training time. The simulation results further prove that the optimal compression size is very crucial considering the performance of the whole system. In practice, both classification accuracy and efficiency are important in pattern recognition.  7.3.4 The iteration reduction  The further consequence of structure reduction is the iteration reduction. In backpropagation, the compressed data do not only reduce the system in size but also cut down iterations to make the system converge faster. With the original inputs, the network needs 480 patterns, but only 130 patterns needed with compressed signals. Fewer patterns makes the BP converge faster under the same tolerance. Certainly, reduced input dimensionality means less time to calculate the synaptic connections. But it is not simply explained by the network size. We cannot exactly explain how networks do recognition. There are many theories to explain memory function intuitively, but it is difficult to analyze them mathematically. Memory reminiscence is a complicated process of imitating the human brain. A network can identify the signals by capability of memory recall and pattern generalization. In traditional approaches in statistics like principal component analysis, the approximation is based on the calculation of the signal descriptors. The more descriptors, the more precise calculation. The neural network is a model of nonparametric approximation. During the identification process, the network captures the shape of the signals rather than doing precise calculation. The large input size provides more information, but the quality of the signals also influences the approximation process. The correlated information does not help classification but cost more space and is time consuming both in training and test process. Thus to obtain crucial infonnation in minimum size turns classification process more efficiently. Feature extraction plays a very important role in pattern recognition, and the feature layer attempts to find the distribution of input signals. 74  Chapter 7: The classification performance  10 8 cr)  The original inputs  6 4  .The compressed inputs  2 0 0  ^  ^ ^ 10^20^30 40 50 The number of iterations  Figure 7.23: The learning curves It is explained that the correlated information does not help the classification. On the other hand, the system also needs longer time to converge due to insufficient information through encoding. In Table 7.4, with compression size 8, the network cannot offer sufficient information, the system needs more patterns to converge. If compression size is optimal, the information is sufficient, then the network processes signals more easily and efficiently. 7.3.5 The learning curve  The learning curves are plotted in Figure 7.23 for backpropagation with compression input size 16 and full input components 150. The dotted line is plotted for network with compressed input and the solid line is for network with original input. Comparing the learning curves with pure backpropagation and heterogeneous network, we observe that the learning curve of BP is smoother. It means that backpropagation takes longer to converge. The learning curve of heterogeneous network converges faster than the BP. BP is a gradient descent algorithm. The desired output and the actual outputs are compared in  the output layer, the modification of the weights is A p wii =^= 716p.i ipi 75  ^  (7.122)  Chapter 7: The classification performance  Number of  Training time  Training time  Total training  Number of  Classificaion  neurons  (GHA)  (BP)  time  iterations  accuracy  150  X  172  172  48  97.2%  16  84  20  104  13  97.2%  Table 7.5 The comparison of classification performance where tpi is the desired output, opi is the actual outputs and ö pj is the error. The energy will be decreased in the direction  EP =l tP9 — 0Pi  ^ ?  (7.123)  The error is calculated and the weights are adjusted according to the difference between the desired and actual output. Each synaptic connection must be modified according to learning rule. Certainly, the more neurons, the more paths between neurons and the more computational work. Furthermore, feature extraction also helps the system convergence Summarizing the above, we conclude that the classification efficiency is improved by compression network in terms of structure reduction, training time reduction and iteration reduction. Compressed signals need fewer inputs, less iteration, shorter training time for the whole system. The series of reductions leads improvement in both classification accuracy and efficiency. In practice, structure reduction offers the additional advantages for hardware implementation of ANN. The current technique of VLSI is difficult to built huge circuits. Small dimensional representation will reduce memory and storage requirement and hardware implementation cost.  7.4 The classification accuracy Actually, the BP is a very popular and quite successful training scheme used in pattern classification currently. Using BP, the classification accuracy usually can reach to the desired level, which is about 90%. The significance of using the compression network is to improve classification efficiency. The system performances are compared between backpropagation and heterogeneous network in Table 7.5. The GHA and BP, which are used in linear network and nonlinear network, are based on the standard Hebbian algorithm. In such learning rules, the topological map between input and output  76  Chapter 7: The classification performance  are gradually established through the training process, thus the networks inherently have associative property. Kolmogorov's theorem indicates: BP can reach any arbitrary level with hidden layer of increasing size. Table 7.5 shows the results. Using BP, the classification accuracy reaches 97.2%. The data compression is carried out without sacrificing of classification accuracy. The significance of the compression work is its structure reduction and training time reduction for the whole system.  7.5 The single and mixed wood species As analyzed in chapter 2, the wood mix significantly influences the refining process and pulp quality, therefore, there is a requirement to know the wood being processed whether it be single or it be mixed wood species. In the refining process, the feed consists of an unknown mixture of a number of known species. As a test, we do not intend to consider complicated cases in this thesis and we only simply simulate mixed signals with two wood species. We average the power spectra of two species, and obtain the combination  q  as 10 patterns which represent mutually mixed wood species of 5 categories, as  shown in 7.24. Thus the test set of mixed species, which originally contain 38 patterns, consists of 90 patterns to represent mixed signals including two different species. We choose BP as the learning algorithm of the network and train this network by 5 pure wood species. Since the participation of the mixed species, we calibrate the mixed species and establish a look-up table to indicate the components of the mixture. In test stage, the network is tested by mixture of single and mixed wood species. First of all, each signal is checked to distinguish whether it is from a single species or a mixture. If it is single, the network indicates its category. Otherwise, the network checks the look-up table to find the corresponding gradients of the mixed species. If a signal neither belongs to single nor mixed species, we say that it is spurious. The test scheme is shown in Figure 7.25 In the test procedure, the network generates four responses. Any pattern will converge to one of four possible responses during the test, either single or mixed species. The definition is given to those responses: y = [y('), ...y(m)1 is the output vector firing neuron k. Let yk be a scalar of the 77  Chapter 7: The classification performance  0.25 7  F  0.20 0.15 0.10 0.05 0.00  O  025  50  100  150  50  100  150  E  0.20 E0.15 7 0.10 0.05 0.00 0  50  100  150  50  100  150  50  100  150  0.20 0.15 0.10 0.05 0.00 0  50  100  0  150  cuo  0.25 0.20 —  0.15 7  0.15 E0.10 :-  0.10  0.05 7  0.05 0.00  0.00  0  0.25  50  100  150  50  100^  150  0  F  0.20 0.15 0.10 !)  Figure 7.24: The mixture of wood species 78  Chapter 7: The classification performance  Start  P= 9 °  I Spurious special P=P-1  The scheme of testing mixture  ^  The classification distribution  Figure 7.25: The testing procedure for mixtures Classification  a  /3  Q  Single correct  1  0  0  Single incorrect  0  0  1  Mixed correct  1  1  0  Mixed confusion  0  0  >2  Table 7.6 The four responses of the system for various values of a and Q output vector y corresponding to a category in the case of a single species, while yk and yi be scalars in the case of mixed species. Let y be defined as a vector with components y ay k p yj  EQ y  r  (7.124)  rk,j  The four possible network responses are listed in Table 7.6. In single species cases, with p=o, if a=1 and Q=0, then the system correctly classifies the input patterns. The case of a=0 and Q=1 is  an incorrect response. In this case, the signals cannot converge to the correct patterns. The patterns probably converge to the neighbor category due to the overlapping with it. As indicated in section 7.3, 97.2% of the classification accuracy is achieved in single wood species.  79  Chapter 7: The classification performance  In the single species case, each class of patterns is represented by the unique active neuron in y. If input pattern x causes activity in a cell in y, it is classified as the category associated with that cell. In mixed species cases, the components of the mixture are represented by two active neurons in y, usually in the correct identification, those two active neurons mostly correspond to the gradient categories. For instance, for the mixture of category 1 and 3, the most active neurons in y at the position 1 and 3. in which we indicated them as desired outputs. As shown in the scheme of testing mixtures on Figure 7.25, if the network does not converge to the single species category, then we check the mixture. In the case of the responses characterized by a=1, 0=1 and Q=0, the system converges to the correct mixture. The corresponding gradients of the pattern are found in the look-up table. In the case of a=0, /3=0, and Q>2, the classification with mixed species fails. We say that it is a spurious species. The classification distribution of single and mixed species is shown in the classification distribution, Figure 7.25. We test the network with a testing set of single and mixed species. Actually, 30% of mixed species are indicated as a single species, which is one component of the mixture. Except for this, in mixture classification, another 15% of mixed species cannot be correctly classified. We say that they belong to spurious species. Concluding from the simulation results, we know that the power spectra of the mixed species still include their principal features, which indicate their physical properties. If a network can be trained by single species and is capable of testing mixed species, it indicates that the neural network is capable of generalizing.  The simulations, which are run to identify acoustic emission signals of wood species, have shown us good results in terms of noise resistance, capability of generalization, classification accuracy and efficiency. The tests, which are used to identify whether it is single species or mixture , further prove that the network has good performance of generalization.  80  Chapter 8: Conclusion  Chapter 8 Conclusion  8.1 Summary of the thesis Simulation results show that neural networks can potentially solve practical pattern recognition problems. In this thesis, a neural network is successfully implemented as a classifier to identify acoustic emission signals of wood species, and the desired classification accuracy is achieved by a heterogeneous configuration. The heterogeneous network is designed to perform classification in two steps: linear feature extraction and nonlinear classification. Supervised learning is implemented in a compression network for feature extraction and data compression. The outputs of the compression network are used as inputs to a classification network. Then unsupervised learning is utilized in the classifier to make category decisions. The connection between the traditional methods and modern approaches of neural networks are mathematically analyzed and experimentally demonstrated with numerous simulations. This application strongly shows that it is natural to use neural networks as a classifier. The KarhunenLoeve Transform, a classical method of data reconstruction, is approximated by the neural network designed according to the generalized Hebbian algorithm. During this process, the network performs signal processing functions in the fashions of the eigenvector decomposer in feature extraction, the Gabor filter in data compression, and the whitening filter in the decorrelation of input vectors. The reconstructed data with maximum variance are used as the representation of the original data, improving the classification efficiency. Simulations are performed to identify the acoustic emission signals of wood species. The expected classification accuracy is achieved in both BP algorithm and GHA. Simulation results  show that the system has good noise immunity and generalization. With the inherent noise immunity of the Hebbian learning algorithm, the system can achieve good classification accuracy even with distorted data. 81  Chapter 8: Conclusion  8.2 Contributions  This thesis has made the following contributions to the application of neural networks: 1. Systematically analyze the functions of the neural network in signal processing and pattern recognitions. 2. Achieve some desired classification results by using the heterogeneous neural network. 3. Experimented with the identification of single and mixture of wood species. 8.3 Future work  With current theories, there is still a number of unknowns with the use of neural networks. Although the introduction of linear network makes the system more easily understood, most applications still depend mainly on experimental work. Based on the previous discussion on the current GHA, further research is need to improve the performance of feature extraction. Using neural networks to approximate the KLT, the network is very sensitive with respect to the system parameters, such as network size, learning rate and initial weight values. Specifically, the performance of eigenvector decomposition needs to improve when the number of approximated eigenvectors is large. The next consideration is on the structure of the neural networks. The current network is composed of two networks and they are consequently connected together. The starting of training in the top network has to wait until the bottom one has finished. The network will be modified to function in parallel mode instead of sequential one. Furthermore, a more advanced network should be formed by combining the two networks into one structure in which linear feature extraction and nonlinear classification are implemented in a single network with different layers. A desirable goal of future research is to improve the performance of automatic identification so that the requirements of the pulp and paper industry are satisfied.  82  Chapter 8: Conclusion  References [1]  J. A. Anderson, A. Pellionisz, and E. Rosenfeld. Neurocomputing 2. 1990.  [2]  T. M. Apostol. Calculus. 1969.  [3]  K. J. AstrOm. Introduction to stochastic control theory. Mathematics in Science and Engineering, Vol. 70, pp. 13-43, 1970.  [4]  P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2, pp. 53-58, 1989.  [5]  T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1967.  [6]  J. G. Daugman. Complete discrete 2-d Gabor transforms by neural networks for image analysis and compression. IEEE Transactions on Acoustic, Speech and Signal Processing Vol. 36, No. 7, July, 1988.  [7]  S. P. Day. A cmos vlsi implementation of a self-organizing feed-forward neural network. Ph.D thesis proposal, the University of British Columbia, 1989.  [8]  B. S. Dayal. Application of acoustic emission technology to thermomechanical pulp refining. Technique report in Paprican, 1989.  [9]  G. A. Dumont. Control techniques in the pulp and paper industry. Control and Dynamic Systems. Vol. 37 pp. 106-111., 1990.  [10] G. A. Dumont, A. P. Wade, and 0. Lee. Acoustic emission monitoring of wood chip refiners. Patent, 1990.  [11] G. A. Dumont, A. P. Wade, and 0. Lee. Acoustic emission monitoring of chip refiners. UBC Pulp and Paper Center, 1991.  [12] 0. K. Ersoy and D. Hong. Parallel self-organization, hierarchical neural networks. IEEE Transaction on Neural Networks, Vol. 1, No. 2, June, 1990.  [13] D. Gabor. Theory of communication. J. Inst. Elec. Eng., Vol. 93, pp. 429-459, 1946. 83  Chapter 8: Conclusion  [14]  W. A. Gardner. Introduction to random processes with application to signals & systems. McGraw-Hill Publisher Company, 1990.  [15]  H. P. Graf, L. D. Jackle, and W. E. Hubbard. Vlsi implementation of a neural network model. IEEE Computer, March 1988, pp. 27-32, 1988.  [16]  S. Grossberg. Neural networks and natural intelligence. MIT Press, 1988.  [17]  D. 0. Hebb. Organization of behavior. New York, Science Editions, 1949.  [18]  J. J. Hopfield. The effectiveness of analogue neural network hardware. Network, Vold, 1990, pp27-40, 1990.  [19]  H. Hotelling. Analysis of a complex of statistical variables into principal components. J. Educ. sychology 24:pp 417-441, 1933.  [20] T. Hrycej. A non-competitive model for unsupervised learning. Proc. Eleventh International Joint Conference on Artificial Intelligence, Detroit, pp. 170-175, 1989.  [21] T. Hrycej. Self-organization by delta rule. Proceedings IEEE ,1990, pp. 307-312., 1990. [22]  J. C. C. Ip. Reinforcement learning in neural networks with multiple outputs. Master thesis, 1990.  [23]  A. K. Jain. Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs, NJ 07632, 1989.  [24]  W. H. James. Acoustic emission monitoring of wood chip refiners. Patent, 1990.  [25]  H. Karhunen. Uber lineare methoden in der wahrscheinlich-keitsrechnung. Ann. Acad. Science, Fenn, Ser. A.I. 37 Helsmki, 1947.  [26] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics 43 59-69, 1982.  [27]  T. Kohonen. The neural phonetic typewriter. IEEE Computer, 1988.  [28] T. Kohonen. The self-organizing map. Proceedings of the IEEE, Vol. 78, No. 9, September, 1990, pp. 1464-1480, 1990.  [29]  B. Kosto. Statistic competitive learning. International Joint Cconference on Neural Networks, Vol. 2, pp. 215-226, 1990.  84  Chapter 8: Conclusion  [30]  C. Y. Le. A learning scheme for asymmetric threshold networks. Proceedings of Cognitive Science, Vol. 85, pp. 599-604, Paris France, 1985.  [31] T. Leen, M. Rudnick, and D. Hammerstrom. Hebbian feature discovery improves classifier efficiency. Proceedings IEEE, 1990, Vol. 1, pp. 51-56., 1990. [32]  E. Levin, N. Tishby, and S. A. Solla. Statistical approach to learning and generalization in layered neural networks. Proceedings IEEE, Vol. 78, No. 10, October 1990, pp. 1568-1574, 1990.  [33]  R. Linsker. Self-organization in a perceptron network. IEEE Computer, 1988.  [34]  W. A. Little. The existence of persist states in the brain. Mathematical Biosciences, 19 pp. 101-120, 1974.  [35]  L. Ljung. Analysis of recursive stochastic algorithm. IEEE Trans. Automatic Control, Vol. 22, pp. 551-575, 1977.  [36]  M. Loeve. Fontions Aleatoires de seconde ordre. In P. Levy, Processus Stochastiques et Mouvement Brownien. Paris, France: Hermann, 1948.  [37]  H. A. Malki and A. M. Damjoo. Using the karhunen-loeve transformation in the backpropagation training algorithm. IEEE Transactions on Neural Networks, Vol. 2, No. 1, pp. 162-165., 1991.  [38]  A. N. Michel, J. Si, and G. Yen. Analysis and synthesis of a class of discrete time neural networks described on hypercubes. IEEE Transaction on Neural Networks, Vol. 2, No. 1, January, 1991.  [39]  R. H. Nielsen. Neurocomputing. 1990.  [40]  E. Oja. A simplified neuron model as a principal component analyzer. J. Math., Biology, 15:267-273, 1982.  [41]  E. Oja. Neural networks, principal components, and subspaces. International Journal of Neural Systems 1-1 (1989), pp. 61-68, 1989.  [42] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Math. Anal. and Appl. Vol. 106, pp. 69-84, 1985. 85  Chapter 8: Conclusion  [43]  E. Oja and T. Kohonen. The subspace learning algorithm as a formalism for pattern recognition and neural networks. Proceedings IEEE International Conference on Neunal Networks, San Diego, 1988, pp. 277-284, 1988.  [44]  Y. H. Pao. Adaptive pattern recognition and neural networks. Reading, MA: Addison Wesley, 1989.  [45]  D. B. Parker. Learning-logic. Tech. Report TR-47, Sloan School of Management of MIT, 1985.  [46]  D. L. Reilly and L. N. Cooper. A neural model for category learning. Biological Cybernetics, 45: pp. 35-41, 1982.  [47]  D. Rosenblatt, A. Lelu, and A. Georgel. Learning in single pass: a neural model for principal component analysis and linear regression. Proceedings of first IEE International Conference on Artificial Neural Networks, London 1989 pp. 252-256., 1989.  [48] F. Rosenblatt. Principle of neurodynamics. Spartan Books, New York, 1962. [49]  M. W. Roth. Survey of neural network technology for automatic target recognition. IEEE Transaction on Neural Networks, Vol. 1, No. 1, pp. 28-43, 1990.  [50]  D. E. Rumelhart. Parallel distributed processing. MIT Press, pp. 318-364, 1989.  [51]  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, 323(9):533-536, 1986.  [52]  T. D. Sanger. Optimal unsupervised learning in feedforward neural networks. MIT Al Tech Report 1086, 1989.  [53]  T. D. Sanger. Unsupervised learning in a single layer linear feedforward neural network. Neural Networks, Vol. 2, No. 6, pp. 459-473, 1989.  [54]  T. D. Sanger. Analysis of the two-dimensional receptive fields learned by the generalized hebbian algorithm in response to random input. Biological Cybernetics, Vol. 63, pp. 221238, 1990.  [55]  D. F. Specht. Probabilistic neural networks. Neural Networks, Vol. 3, pp. 109-118, 1990.  [56]  G. 0. Stone. An analysis of the delta rule and the learning of statistical associations. Parallel Distribution Processing, Chapter 11, pp. 444-459, 1986.  86  Chapter 8: Conclusion  [57] A. P. Wade and D. B. Sibbald. Analytical perspective on acoustic emission. Analytical Chemistry, Vol. 63, No. 9, May 1, 1991, 1991.  [58] A. P. Wade, D. B. Sibbald, and K. A. Soulsury. A rules-based approach to signal classification for chemical acoustic emission. Presentation in Pulp and Paper Center, 1991. [59] P. D. Wasserman. Neural computing theory and practice. Van Nostrand Reinhold, New York, 1989. [60] A. R. Webb and D. Lowe. The optimized internal representation of multilayer classifier networks performs nonlinear discriminant analysis. Neural Networks Vol. 3, pp. 367-375, 1990. [61] P. D. Wentzell and A. P. Wade. Chemical acoustic emission analysis in the frequency domain. American Chemical Society, 61, 2638, 1991.  [62] P. J. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences,. Ph. D. Thesis, Harvard University, 1974. [63]  B. Widrow and M. E. Hoff. Adaptive switching circuit. In IRE WESCON. Cony. Record, Part.  4, pp. 96-104, 1960. [64]  J. Yang and G. A. Dumont. Classification of acoustic emission signals via hebbian feature  extraction. International Joint Conference on Neural networks, Seattle, Vol. 1 pp. 113-118, 1991.  87  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items