UBC Faculty Research and Publications

Nonlinear multivariate and time series analysis by neural network methods 2011

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
Hsieh_AGU_2004_2002RG000112.pdf
Hsieh_AGU_2004_2002RG000112.pdf [ 2.07MB ]
Metadata
JSON: 1.0041791.json
JSON-LD: 1.0041791+ld.json
RDF/XML (Pretty): 1.0041791.xml
RDF/JSON: 1.0041791+rdf.json
Turtle: 1.0041791+rdf-turtle.txt
N-Triples: 1.0041791+rdf-ntriples.txt
Citation
1.0041791.ris

Full Text

NONLINEAR MULTIVARIATE AND TIME SERIES ANALYSIS BY NEURAL NETWORK METHODS William W. Hsieh Department of Earth and Ocean Sciences University of British Columbia Vancouver, British Columbia, Canada Received 29 March 2002; revised 6 November 2003; accepted 12 November 2003; published 18 March 2004. [1] Methods in multivariate statistical analysis are essential for working with large amounts of geophysical data, data from observational arrays, from satellites, or from numerical model output. In classical multivariate statistical analysis, there is a hierarchy of methods, starting with linear regression at the base, followed by principal component analysis (PCA) and finally canonical correlation analysis (CCA). A multivariate time series method, the singular spectrum analysis (SSA), has been a fruitful extension of the PCA technique. The common drawback of these classical methods is that only linear structures can be correctly extracted from the data. Since the late 1980s, neural network methods have become popular for performing nonlinear regression and classification. More recently, neural network methods have been extended to perform nonlinear PCA (NLPCA), nonlinear CCA (NLCCA), and nonlinear SSA (NLSSA). This paper presents a unified view of the NLPCA, NLCCA, and NLSSA techniques and their applications to various data sets of the atmosphere and the ocean (especially for the El Niño-Southern Oscillation and the stratospheric quasi-biennial oscillation). These data sets reveal that the linear methods are often too simplistic to describe real-world systems, with a tendency to scatter a single oscillatory phenomenon into numerous unphysical modes or higher harmonics, which can be largely alleviated in the new nonlinear paradigm. INDEX TERMS: 3299 Mathematical Geophysics: General or miscellaneous; 3394 Meteorology and Atmospheric Dynamics: Instruments and techniques; 4294 Oceanography: General: Instruments and techniques; 4522 Oceanography: Physical: El Niño; KEYWORDS: neural networks, principal component analysis, canonical correlation analysis, singular spectrum analysis, El Niño. Citation: Hsieh, W. W. (2004), Nonlinear multivariate and time series analysis by neural network methods, Rev. Geophys., 42, RG1003, doi:10.1029/2002RG000112. 1. INTRODUCTION [2] In a standard text on classical multivariate statistical analysis [e.g., Mardia et al., 1979] the chapters typically proceed from linear regression to principal component analysis and then to canonical correlation analysis. In regression one tries to find how the response variable y is linearly affected by the predictor variables x  [x1, . . ., xl], i.e., y ¼ r  xþ r0 þ ; ð1Þ where  is the error (or residual) and the regression coefficients r and r0 are found by minimizing the mean of 2. 1.1. Principal Component Analysis [3] However, in many data sets one cannot separate variables into predictor and response variables. For instance, one may have a data set of the monthly sea surface temperatures (SST) collected at 1000 grid locations over several decades; that is, the data set is of the form x(t) = [x1, . . ., xl], where each variable xi (i = 1, . . ., l) has N samples labeled by the index t. Very often, t is simply the time, and each xi is a time series containing N observations. Principal component analysis (PCA), also known as empirical or- thogonal function (EOF) analysis, looks for u, a linear combination of the xi, and an associated vector a, with u tð Þ ¼ a  x tð Þ; ð2Þ so that hkx tð Þ 	 au tð Þk2i is minimized, where h i denotes a sample or time mean. Here u, called the first principal component (PC) (or score), is often a time series, while a, called the first eigenvector (also called an EOF or loading), is the first eigenvector of the data covariance matrix, and a often describes a spatially standing oscillation pattern. Together u and a make up the first PCA mode. In essence, a given data set is approximated by a straight line (oriented in the direction of a), which accounts for the maximum amount of variance in the data; pictorially, in a scatterplot of the data the straight line found by PCA passes through the ‘‘middle’’ of the data set. From the residual, x 	 au, the second PCA mode can similarly be extracted and so on for the higher modes. In practice, the Copyright 2004 by the American Geophysical Union. 8755-1209/04/2002RG000112$15.00 Review of Geophysics, 42, RG1003 / 2004 1 of 25 Paper number 2002RG000112 RG1003 common algorithms for PCA extract all modes simulta- neously [Jolliffe, 2002; Preisendorfer, 1988]. By retaining only the leading modes, PCA has been commonly used to reduce the dimensionality of the data set and to extract the main patterns from the data set. PCA has also been extended to the singular spectrum analysis (SSA) technique for time series analysis [Elsner and Tsonis, 1996; von Storch and Zwiers, 1999; Ghil et al., 2002]. 1.2. Canonical Correlation Analysis [4] Next consider two data sets {xi(t)} and {yj(t)}, each with N samples. We group the {xi(t)} variables to form the vector x(t) and the {yj(t)} variables to form the vector y(t). Canonical correlation analysis (CCA) [Mardia et al., 1979; Bretherton et al., 1992; von Storch and Zwiers, 1999] looks for linear combinations u tð Þ ¼ a  x tð Þ v tð Þ ¼ b  y tð Þ; ð3Þ where the canonical variates u and v have maximum correlation; that is, the weight vectors a and b are chosen such that cor(u, v), the Pearson correlation coefficient between u and v, is maximized. For instance, if x(t) is the sea level pressure (SLP) field and y(t) is the SST field, then CCA can be used to extract the correlated spatial patterns a and b in the SLP and SST fields. Unlike regression, which tries to study how each yj is related to the x variables, CCA examines how the entire y field is related to the x field. This holistic view has made CCA popular [Barnett and Preisendorfer, 1987; Barnston and Ropelewski, 1992; Shabbar and Barnston, 1996]. [5] In the environmental sciences, researchers have to work with large data sets from satellite images of the Earth’s surface and global climate data to voluminous output from large numerical models. Multivariate techniques such as PCA and CCA have become indispensable in extracting essential information from these massive data sets [von Storch and Zwiers, 1999]. However, the restriction of finding only linear relations means that nonlinear relations are either missed or misinterpreted by these methods. The introduction of nonlinear multivariate and time series tech- niques is crucial to further advancement in the environmen- tal sciences. 1.3. Feed Forward Neural Network Models [6] The nonlinear neural network (NN) models originated from research trying to understand how the brain functions with its networks of interconnected neurons [McCulloch and Pitts, 1943]. There are many types of NN models; some are only of interest to neurological researchers, while others are general nonlinear data techniques. There are now many good textbooks on NN models [Bishop, 1995; Rojas, 1996; Ripley, 1996; Cherkassky and Mulier, 1998; Haykin, 1999]. [7] The most widely used NNmodels are the feed forward NNs, also called multilayer perceptrons [Rumelhart et al., 1986], which perform nonlinear regression and classifica- tion. The basic architecture (Figure 1) consists of a layer of input neurons xi (a ‘‘neuron’’ is simply a variable in NN jargon) linked to a layer or more of ‘‘hidden’’ neurons, which are, in turn, linked to a layer of output neurons yj. In Figure 1, there is only one layer of hidden neurons hk. A transfer function (an ‘‘activation’’ function in NN jargon) maps from the inputs to the hidden neurons. There is a variety of choices for the transfer function, the hyperbolic tangent function being a common one, i.e., hk ¼ tanh X i wkixi þ bk  ! ; ð4Þ where wki and bk are the weight and bias parameters, respectively. The tanh(z) function is a sigmoidal-shaped function, where its two asymptotic values of ±1 as z! ±1 can be viewed as representing the two states of a neuron (at rest or activated), depending on the strength of the excitation z. (If there is more than one hidden layer, then equations of the same form as equation (4) are used to calculate the values of the next layer of the hidden neurons from the current layer of neurons). When the feed forward NN is used for nonlinear regression, the output neurons yj are usually calculated by a linear combination of the neurons in the preceding layer, i.e., yj ¼ X k ~wjkhk þ ~bj: ð5Þ [8] Given observed data yoj, the optimal values for the weight and bias parameters (wki, ~wjk, bk, and ~bj) are found by ‘‘training’’ the NN, i.e., performing a nonlinear optimi- zation, where the cost function or objective function J ¼ X j yj 	 yoj  2* + ð6Þ is minimized, with J simply being the mean squared error (MSE) of the output. The NN has found a set of nonlinear regression relations yj = fj(x). To approximate a set of Figure 1. Schematic diagram of the feed forward neural network (NN) model, with one ‘‘hidden’’ layer of neurons (i.e., variables) (denoted by circles) sandwiched between the input layer and the output layer. In the feed forward NN model the information only flows forward starting from the input neurons. Increasing the number of hidden neurons increases the number of model parameters. Adapted from Hsieh and Tang [1998]. RG1003 Hsieh: NEURAL NETWORK METHODS 2 of 25 RG1003 continuous functions fj, only one layer of hidden neurons is enough, provided enough hidden neurons are used in that layer [Hornik et al., 1989; Cybenko, 1989]. The NN with one hidden layer is commonly called a two-layer NN, as there are two layers of mapping (equations (4) and (5)) going from input to output; however, there are other conventions for counting the number of layers, and some authors refer to our two-layer NN as a three-layer NN since there are three layers of neurons. 1.4. Local Minima and Overfitting [9] The main difficulty of the NN method is that the nonlinear optimization often encounters multiple local minima in the cost function. This means that starting from different initial guesses for the parameters, the optimization algorithm may converge to different local minima. Many approaches have been proposed to allevi- ate this problem [Bishop, 1995; Hsieh and Tang, 1998]; a common approach involves multiple optimization runs starting from different random initial parameters so that, hopefully, not all runs will be stranded at shallow local minima. [10] Another pitfall with the NN method is overfitting, i.e., fitting to the noise in the data, because of the tremendous flexibility of the NN to fit the data. With enough hidden neurons the NN can fit the data, including the noise, to arbitrary accuracy. Thus, for a network with many parameters, reaching the global minimum may mean nothing more than finding a badly overfitted solution. Usually, only a portion of the data record is used to train (i.e., fit) the NN model; the other is reserved to validate the model. If too many hidden neurons are used, then the NN model fit to the training data will be excellent, but the model fit to the validation data will be poor, thereby allowing the researchers to gauge the appropriate number of hidden neurons. During the optimization process it is also common to monitor the MSE over the training data and over the validation data separately. As the number of iterations of the optimization algorithm increased, the MSE calculated over the training data would decrease; however, beyond a certain number of iterations the MSE over the validation data would begin to increase, indicating the start of overfitting and hence the appropriate time to stop the optimization process. Another approach to avoid overfit- ting is to add weight penalty terms to the cost function, as discussed in Appendix A. Yet another approach is to compute an ensemble of NN models starting from different random initial parameters. The mean of the ensemble of NN solutions tends to give a smoother solution than the individual NN solutions. [11] If forecast skills are to be estimated, then another unused part of the data record will have to be reserved as independent test data for estimating the forecast skills, as the validation data have already been used to determine the model architecture. Some authors interchange the terminol- ogy for ‘‘validation’’ data and ‘‘test’’ data; the terminology here follows Bishop [1995]. For poor quality data sets (e.g., short, noisy data records) the problems of local minima and overfitting could render nonlinear NN methods incapable of offering any advantage over linear methods. [12] The feed forward NN has been applied to a variety of nonlinear regression and classification problems in environ- mental sciences such as meteorology and oceanography and has been reviewed by Gardner and Dorling [1998] and Hsieh and Tang [1998]. Some examples of recent applica- tions using NN include the following: tornado diagnosis [Marzban, 2000], efficient radiative transfer computation in atmospheric general circulation models [Chevallier et al., 2000], multiparameter satellite retrievals from the Special Sensor Microwave/Imager [Gemmill and Krasnopolsky, 1999], wind retrieval from scatterometer [Richaume et al., 2000], adaptive nonlinear model output statistics [Yuval and Hsieh, 2003], efficient computation of sea water density or salinity from a nonlinear equation of state [Krasnopolsky et al., 2000], tropical Pacific sea surface temperature predic- tion [Tang et al., 2000; Yuval, 2001], and an empirical atmosphere in a hybrid coupled atmosphere-ocean model of the tropical Pacific [Tang and Hsieh, 2002]. For NN applications in geophysics (seismic exploration, well log lithology determination, electromagnetic exploration, and earthquake seismology), see Sandham and Leggett [2003]. [13] To keep within the scope of a review paper, I have to omit reviewing the numerous fine papers on using NN for nonlinear regression and classification and focus on the topic of how the feed forward NN can be extended from its original role as nonlinear regression to nonlinear PCA (section 2), nonlinear CCA (section 3), and nonlinear SSA (section 4), illustrated by examples from the tropical Pacific atmosphere-ocean interactions and the equatorial strato- spheric wind variations. These examples reveal various disadvantages of the linear methods; the most common one is the tendency to scatter a single oscillatory phenom- enon into numerous modes or higher harmonics. 2. NONLINEAR PRINCIPAL COMPONENT ANALYSIS (NLPCA) 2.1. Open Curves [14] As PCA finds a straight line which passes through the ‘‘middle’’ of the data cluster, the obvious next step is to generalize the straight line to a curve. Kramer [1991] proposed a neural network-based NLPCA model where the straight line is replaced by a continuous open curve for approximating the data. [15] The fundamental difference between NLPCA and PCA is that PCA only allows a linear mapping (equation (2)) between x and the PC u, while NLPCA allows a nonlinear mapping. To perform NLPCA, the feed forward NN in Figure 2a contains three hidden layers of neurons between the input and output layers of variables. The NLPCA is basically a standard feed forward NN with four layers of transfer functions mapping from the inputs to the outputs. One can view the NLPCA network as composed of two standard two-layer feed forward NNs placed one after the other. The first two-layer network maps from the inputs x through a hidden layer to the bottleneck layer with only RG1003 Hsieh: NEURAL NETWORK METHODS 3 of 25 RG1003 one neuron u, i.e., a nonlinear mapping u = f (x). The next two-layer feed forward NN inversely maps from the non- linear PC (NLPC) u back to the original higher-dimensional x space, with the objective that the outputs x0 = g(u) be as close as possible to the inputs x (thus the NN is said to be autoassociative). Note g(u) nonlinearly generates a curve in the x space and hence a one-dimensional (1-D) approxima- tion of the original data. To minimize the MSE of this approximation, the cost function J = hkx 	 x0k2i is minimized to solve for the weight and bias parameters of the NN. Squeezing the input information through a bottle- neck layer with only one neuron accomplishes the dimen- sional reduction. Details of the NLPCA are given in Appendix A. [16] In effect, the linear relation (2) in PCA is now generalized to u = f(x), where f can be any nonlinear continuous function representable by a feed forward NN mapping from the input layer to the bottleneck layer; and instead of hkx(t) 	 au(t)k2i, hkx 	 g(u)k2i is minimized. The residual, x 	 g(u), can be input into the same network to extract the second NLPCA mode and so on for the higher modes. [17] That the classical PCA is indeed a linear version of this NLPCA can be readily seen by replacing all the transfer functions with the identity function, thereby removing the nonlinear modeling capability of the NLPCA. Then the forward map to u involves only a linear combination of the original variables as in the PCA. [18] The NLPCA has been applied to the radiometric inversion of atmospheric profiles [Del Frate and Schiavon, 1999] and to the Lorenz [1963] three-component chaotic system [Monahan, 2000; Hsieh, 2001a]. For the tropical Pacific climate variability the NLPCA has been used to study the SST field [Monahan, 2001; Hsieh, 2001a] and the SLP field [Monahan, 2001]. The Northern Hemisphere atmospheric variability [Monahan et al., 2000, 2001], the Canadian surface air temperature [Wu et al., 2002], and the subsurface thermal structure of the Pacific Ocean [Tang and Hsieh, 2003] have also been investigated by the NLPCA. [19] In the classical linear approach, there is a well- known dichotomy between PCA and rotated PCA (RPCA) [Richman, 1986]. In PCA the linear mode that accounts for the most variance of the data set is sought. However, as illustrated by Preisendorfer [1988, Figure 7.3], the resulting eigenvectors may not align close to local data clusters, so the eigenvectors may not represent actual physical states well. One application of RPCA methods is to rotate the PCA eigenvectors so they point closer to the local clusters of data points [Preisendorfer, 1988]. Thus the rotated eigenvectors may bear greater resemblance to actual phys- ical states (though they account for less variance) than the unrotated eigenvectors; hence RPCA is also widely used [Richman, 1986; von Storch and Zwiers, 1999]. As there are many possible criteria for rotation, there are many RPCA schemes, among which the varimax [Kaiser, 1958] scheme is perhaps the most popular. [20] The tropical Pacific climate system contains the famous interannual variability known as the El Niño-South- ern Oscillation (ENSO), a coupled atmosphere-ocean inter- action involving the oceanic phenomenon El Niño and the associated atmospheric phenomenon the Southern Oscilla- tion. The coupled interaction results in anomalously warm SST in the eastern equatorial Pacific during El Niño episodes and cool SST in the central equatorial Pacific during La Niña episodes [Philander, 1990; Diaz and Markgraf, 2000]. ENSO is an irregular oscillation, but spectral analysis does reveal a broad spectral peak at the 4- to 5-year period. Hsieh [2001a] used the tropical Pacific SST data (1950–1999) to make a three-way comparison between NLPCA, RPCA, and PCA. The tropical Pacific SST anomaly (SSTA) data (i.e., the SST data with the climatological seasonal cycle removed) were prefiltered by Figure 2. (a) Schematic diagram of the NN model for calculating the nonlinear principal component analysis (NLPCA). There are three layers of hidden neurons sandwiched between the input layer x on the left and the output layer x0on the right. Next to the input layer is the encoding layer, followed by the ‘‘bottleneck’’ layer (with a single neuron u), which is then followed by the decoding layer. A nonlinear function maps from the higher-dimension input space to the one-dimensional bottleneck space, followed by an inverse transform mapping from the bottleneck space back to the original space represented by the outputs, which are to be as close to the inputs as possible by minimizing the cost function J = hkx 	 x0k2i. Data compression is achieved by the bottleneck, with the bottleneck neuron giving u, the nonlinear principal component (NLPC). (b) Schematic diagram of the NN model for calculating the NLPCAwith a circular node at the bottleneck (NLPCA(cir)). Instead of having one bottleneck neuron u, there are now two neurons p and q constrained to lie on a unit circle in the p-q plane, so there is only one free angular variable q, the NLPC. This network is suited for extracting a closed curve solution. From Hsieh [2001a], reprinted with permission from Blackwell Science, Oxford. RG1003 Hsieh: NEURAL NETWORK METHODS 4 of 25 RG1003 PCA, with only the three leading modes retained. PCA modes 1, 2, and 3 accounted for 51.4%, 10.1%, and 7.2%, respectively, of the variance in the SSTA data. The first three PCs (PC1, PC2, and PC3) were used as the input x for the NLPCA network. [21] The data are shown as dots in a scatterplot in the PC1-PC2 plane (Figure 3), where the cool La Niña states lie in the top left corner and the warm El Niño states lie in the top right corner. The NLPCA solution is a U-shaped curve linking the La Niña states at one end (low u) to the El Niño states at the other end (high u), similar to the solution found originally by Monahan [2001]. In contrast, the first PCA eigenvector lies along the horizontal line, and the second PCA lies along the vertical line (Figure 3), neither of which would come close to the El Niño states in the top right corner nor the La Niña states in the top left corner, thus demon- strating the inadequacy of PCA. For comparison, a varimax rotation [Kaiser, 1958; Preisendorfer, 1988] was applied to the first three PCA eigenvectors. (The varimax criterion can be applied to either the loadings or the PCs depending on one’s objectives [Richman, 1986; Preisendorfer, 1988]; here it is applied to the PCs.) The resulting first RPCA eigenvector, shown as a dashed line in Figure 3, spears through the cluster of El Niño states in the top right corner, thereby yielding a more accurate description of the El Niño anomalies (Figure 4c) than the first PCA mode (Figure 4a), which did not fully represent the intense warming of Peruvian waters. The second RPCA eigenvector, also shown as a dashed line in Figure 3, did not improve much on the second PCA mode, with the PCA spatial pattern shown in Figure 4b and the RPCA pattern shown in Figure 4d. In terms of variance explained, the first NLPCA mode explained 56.6% of the variance versus 51.4% explained by the first PCA mode and 47.2% explained by the first RPCA mode. [22] With the NLPCA, for a given value of the NLPC u, one can map from u to the three PCs. This is done by assigning the value u to the bottleneck neuron and mapping forward using the second half of the network in Figure 2a. Each of the three PCs can be multiplied by its associated PCA (spatial) eigenvector, and the three can be added together to yield the spatial pattern for that particular value of u. Unlike PCA, which gives the same spatial anomaly pattern except for changes in the amplitude as the PC varies, the NLPCA spatial pattern generally varies continuously as the NLPC changes. Figures 4e and 4f show the spatial anomaly patterns when u has its maximum value (cor- responding to the strongest El Niño) and when u has its minimum value (strongest La Niña), respectively. Clearly, the asymmetry between El Niño and La Niña (i.e., the cool anomalies during La Niña episodes (Figure 4f) are observed to center much farther west of the warm anomalies during El Niño (Figure 4e) [Hoerling et al., 1997]) is well captured by the first NLPCA mode; in contrast, the PCA mode 1 Figure 3. Scatterplot of the sea surface temperatures (SST) anomaly (SSTA) data (shown as dots) in the principal component (PC)1-PC2 plane, with the El Niño states lying in the top right corner and the La Niña states lying in the top left corner. The PC2 axis is stretched relative to the PC1 axis for better visualization. The first-mode NLPCA approximation to the data is shown by the (overlapping) small circles, which traced out a U-shaped curve. The first principal component analysis (PCA) eigenvector is oriented along the horizontal line, and the second PCA is oriented along the vertical line. The varimax method rotates the two PCA eigenvectors in a counterclockwise direction, as the rotated PCA (RPCA) eigenvectors are oriented along the dashed lines. (As the varimax method generates an orthogonal rotation, the angle between the two RPCA eigenvectors is 90 in the three-dimensional PC1-PC2-PC3 space). From Hsieh [2001a], reprinted with permission from Blackwell Science, Oxford. RG1003 Hsieh: NEURAL NETWORK METHODS 5 of 25 RG1003 gives a La Niña that is simply the mirror image of the El Niño (Figure 4a). While El Niño has been known by Peruvian fishermen for many centuries because of its strong SSTA off the coast of Peru and its devastation of the Peruvian fishery, La Niña, with its weak manifestation in the Peruvian waters, was not appreciated until the last 2 decades of the twentieth century. [23] In summary, PCA is used for two main purposes: (1) to reduce the dimensionality of the data set and (2) to extract features or recognize patterns from the data set. It is the second purpose where PCA can be improved upon. Both RPCA and NLPCA take the PCs from PCA as input. However, instead of multiplying the PCs by a fixed ortho- normal rotational matrix, as performed in the varimax RPCA approach, NLPCA performs a nonlinear mapping of the PCs. RPCA sacrifices on the amount of variance explained, but by rotating the PCA eigenvectors, RPCA eigenvectors tend to point more toward local data clusters and are therefore more representative of physical states than the PCA eigenvectors. [24] With a linear approach it is generally impossible to have a solution simultaneously (1) explaining maximum global variance of the data set and (2) approaching local data clusters hence the dichotomy between PCA and RPCA, with PCA aiming for objective 1 and RPCA aiming for objective 2. Hsieh [2001a] pointed out that with the more flexible NLPCA method both objectives 1 and 2 may be attained together; thus the nonlinearity in NLPCA unifies the PCA and RPCA approaches. It is easy to see why the dichotomy between PCA and RPCA in the linear approach automatically vanishes in the nonlinear approach. By increasing m, the number of hidden neurons in the encoding layer (and the decoding layer), the solution is capable of going through all local data clusters while maximizing the global variance explained. (In fact, for large enough m, NLPCA can pass through all data points, Figure 4. SSTA patterns (in C) of the PCA, RPCA, and the NLPCA. (a) First and (b) second PCA spatial modes (both with their corresponding PCs at maximum value). (c) First and (d) second varimax RPCA spatial modes (both with their corresponding RPCs at maximum value). Anomaly pattern as the NLPC u of the first NLPCA mode varying from (e) maximum (strong El Niño) to (f ) its minimum (strong La Niña). With a contour interval of 0.5C the positive contours are shown as solid curves, negative contours are shown as dashed curves, and the zero contour is shown as a thick curve. Adapted from Hsieh [2001a], reprinted with permission from Blackwell Science, Oxford. RG1003 Hsieh: NEURAL NETWORK METHODS 6 of 25 RG1003 although this will in general give an undesirable, overfitted solution). [25] The tropical Pacific SST example illustrates that with a complicated oscillation like the El Niño-La Niña phenom- enon, using a linear method such as PCA results in the nonlinear mode being scattered into several linear modes. (In fact, all three leading PCA modes are related to this phenomenon.) This brings to mind the famous parable of the three blind men and their disparate descriptions of an elephant. Thus we see the importance of NLPCA as a unifier of the separate linear modes. In the study of climate variability the wide use of PCA methods has created the somewhat misleading view that our climate is dominated by a number of spatially fixed oscillatory patterns, which is, in fact, due to the limitation of the linear method. Applying NLPCA to the tropical Pacific SSTA, we found no spatially fixed oscillatory patterns but an oscillation evolving in space as well as in time. 2.2. Closed Curves [26] The NLPCA is capable of finding a continuous open curve solution, but there are many geophysical phenomena involving waves or quasiperiodic fluctuations, which call for a continuous closed curve solution. Kirby and Miranda [1996] introduced a NLPCA with a circular node at the network bottleneck (henceforth referred to as the NLPCA (cir)), so that the NLPC as represented by the circular node is an angular variable q, and the NLPCA(cir) is capable of approximating the data by a closed continuous curve. Figure 2b shows the NLPCA(cir) network, which is almost identical to the NLPCA of Figure 2a, except at the bottleneck, where there are now two neurons p and q constrained to lie on a unit circle in the p-q plane, so there is only one free angular variable q, the NLPC. Details of the NLPCA(cir) are given in Appendix B. [27] Applications of the NLPCA(cir) have been made to the tropical Pacific SST [Hsieh, 2001a] and to the equatorial stratospheric zonal wind (i.e., the east-west component of the wind) for the quasi-biennial oscillation (QBO) [Hamilton and Hsieh, 2002]. The QBO dominates over the annual cycle or other variations in the equatorial stratosphere, with the period of oscillation varying roughly between 22 and 32 months, with a mean of about 28 months. After the 45-year means were removed, the zonal wind u at seven vertical levels in the stratosphere became the seven inputs to the NLPCA(cir) network. The NLPCA(cir) mode 1 solution gives a closed curve in a seven-dimensional space. The system goes around the closed curve once, as the NLPC q varies through one cycle of the QBO. Figure 5 shows the solution in three of the seven dimensions, namely, the wind anomalies at the 70-, 30-, and 10-hPa pressure levels (corresponding to elevations ranging roughly between 20 and 30 km above sea level). The NLPCA(cir) mode 1 explains 94.8% of the variance. For comparison, the linear Figure 5. NLPCA(cir) mode 1 solution for the equatorial stratospheric zonal wind shown by the (overlapping) circles, with the data shown as dots. For comparison, the PCA mode 1 solution is shown as a thin straight line. Only three out of seven dimensions are shown, namely, u at the top, middle, and bottom levels (10, 30, and 70 hPa). (a)–(c) Two-dimensional views. (d) Three-dimensional view. From Hamilton and Hsieh [2002]. RG1003 Hsieh: NEURAL NETWORK METHODS 7 of 25 RG1003 PCAyields seven modes explaining 57.8, 35.4, 3.1, 2.1, 0.8, 0.5, and 0.3% of the variance, respectively. To compare with the NLPCA(cir) mode 1, Hamilton and Hsieh [2002] constructed a linear model of q. In the plane spanned by PC1 and PC2 (each normalized by its standard deviation), an angle q can be defined as the arctangent of the ratio of the two normalized PCA coefficients. This linear model accounts for 83.0% of the variance in the zonal wind, considerably less than the 94.8% accounted for by the NLPCA(cir) mode 1. The QBO as q varies over one cycle is shown in Figure 6 for the NLPCA(cir) mode 1 and for the linear model. The observed strong asymmetries between the easterly and westerly phases of the QBO [Hamilton, 1998; Baldwin et al., 2001] are captured by the nonlinear mode but not by the linear mode. [28] The actual time series of the wind measured at a particular height level is somewhat noisy, and it is often desirable to have a smoother representation of the QBO time series which captures the essential features at all vertical levels. Also, the reversal of the wind from westerly to easterly and vice versa occurs at different times for different height levels, rendering it difficult to define the phase of the QBO.Hamilton and Hsieh [2002] found that the phase of the QBO as defined by the NLPC q is more accurate than previous attempts to characterize the phase, leading to a stronger link between the QBO and Northern Hemisphere polar stratospheric temperatures in winter (the Holton-Tan effect) [Holton and Tan, 1980] than previously found. 2.3. Other Approaches (Principal Curves and Self-Organizing Maps) [29] Besides the autoassociative NN, there have been several other approaches developed to generalize PCA [Cherkassy and Mulier, 1998]. The principal curve method [Hastie and Stuetzle, 1989; Hastie et al., 2001] finds a nonlinear curve which passes through the middle of the data points. Developed originally in the statistics community, this method does not appear to have been applied to the environmental sciences or geophysics. There is a subtle but important difference between NLPCA (by autoassociative NN) and principal curves. In the principal curve approach each point in the data space is projected to a point on the principal curve, where the distance between the two is the shortest. In the NLPCA approach, while the mean square error (hence distance) between the data point and the projected point is minimized, it is only the mean which is minimized. There is no guarantee for an individual data point that it will be mapped to the closest point on the curve found by NLPCA. Hence, unlike the projection in principal curves, the projection used in NLPCA is suboptimal [Malthouse, 1998]. However, NLPCA has an advantage over the principal curve method in that its NN architecture provides a continuous (and differentiable) mapping function. [30] Newbigging et al. [2003] used the principal curve projection concept to improve the NLPCA solution. Malthouse [1998] made a comparison between principal curves and the NLPCA model by autoassociative NN. Unfortunately, when testing a closed curve solution, he used NLPCA instead of NLPCA(cir) (which would have extracted the closed curve easily), thereby ending up with the conclusion that the NLPCA was not satisfactory for extracting the closed curve solution. [31] Another popular NN method is the self-organizing map (SOM) [Kohonen, 1982, 2001], used widely for clus- tering. Since this approach fits a grid (usually a 1-D or 2-D grid) to a data set, it can be thought of as a discrete version of nonlinear PCA [Cherkassky and Mulier, 1998]. SOM has been applied to the clustering of winter daily precipitation data [Cavazos, 1999], to satellite ocean color classification [Yacoub et al., 2001], and to high-dimensional hyperspectral Airborne Visible/Infrared Imaging Spectrometer data to classify the geology of the land surface [Villmann et al., 2003]. For seismic data, SOM has been used to identify and classify multiple events [Essenreiter et al., 2001] and used in well log calibration [Taner et al., 2001]. Figure 6. (a) Contour plot of the NLPCA(cir) mode 1 zonal wind anomalies as a function of pressure and qweighted, where qweighted is q weighted by the histogram distribution of q. Thus qweighted is more representative of actual time during a cycle than q. Contour interval is 5 m s	1, with westerly winds indicated by solid lines, easterlies indicated by dashed lines, and zero contours indicated by thick lines. (b) Similar plot for a linear circular model of qweighted. From Hamilton and Hsieh [2002]. RG1003 Hsieh: NEURAL NETWORK METHODS 8 of 25 RG1003 [32] Another way to generalize PCA is via independent component analysis (ICA) [Comon, 1994; Hyvärinen et al., 2001], which was developed from information theory and has been applied to study the tropical Pacific SST variability by Aires et al. [2000]. Since ICA uses higher-order statistics (e.g., kurtosis, which is very sensitive to outliers), it may not be robust enough for the noisy data sets commonly encoun- tered in climate or seismic studies (T. Ulrych, personal communication, 2003). 3. NONLINEAR CANONICAL CORRELATION ANALYSIS (NLCCA) [33] While many techniques have been developed for nonlinearly generalizing PCA, there has been much less activity in developing nonlinear CCA. A number of differ- ent approaches have recently been proposed to nonlinearly generalize CCA [Lai and Fyfe, 1999, 2000; Hsieh, 2000; Melzer et al., 2003]. Hsieh [2000] proposed using three feed forward NNs to accomplish NLCCA, where the linear mappings in equation (3) for the CCA are replaced by nonlinear mapping functions using two-layer feed forward NNs. The mappings from x to u and y to v are represented by the double-barreled NN on the left-hand side of Figure 7. By minimizing the cost function J = 	cor(u, v), one finds the parameters that maximize the correlation cor(u, v). After the forward mapping with the double-barreled NN has been solved, inverse mappings from the canonical variates u and v to the original variables, as represented by the two standard feed forward NNs on the right side of Figure 7, are to be solved, where the MSE of their outputs x0 and y0 are minimized with respect to x and y, respectively. For details, see Appendix C. [34] Consider the following test problem from Hsieh [2000]. Let X1 ¼ t 	 0:3t2; X2 ¼ t þ 0:3t3; X3 ¼ t2; ð7Þ Y1 ¼ ~t3; Y2 ¼ 	~t þ 0:3~t3; Y3 ¼ ~t þ 0:3~t2; ð8Þ where t and ~t are uniformly distributed random numbers in [	1, 1]. Also let X 01 ¼ 	s	 0:3s2; X 02 ¼ s	 0:3s3; X 03 ¼ 	s4; ð9Þ Y 01 ¼ sech 4sð Þ; Y 02 ¼ sþ 0:3s3; Y 03 ¼ s	 0:3s2; ð10Þ where s is a uniformly distributed random number in [	1, 1]. The shapes described by the X and X0 vector functions are displayed in Figure 8, and those described by Y and Y0 are displayed in Figure 9. To lowest order, equation (7) for X describes a quadratic curve, and equation (9) for X0 describes a quartic. Similarly, to lowest order, Y is a cubic, and Y0 is a hyperbolic secant. The signal in the test data was produced by adding the second mode (X0, Y0) to the first mode (X, Y), with the variance of the second mode being 1/3 that of the first mode. A small amount of Gaussian random noise, with standard deviation equal to 10% of the signal standard deviation, was also added to the data set. The data set of N = 500 points was then standardized (i.e., each variable with mean removed was normalized by the standard deviation). Note that different sequences of random numbers tn and ~tn (n = 1, . . ., N) were used to generate the first modes X and Y, respectively. Hence these two dominant modes in the x space and the y space are unrelated. In contrast, as X0 and Y0 were generated from the same sequence of random numbers sn, they are strongly related. The NLCCA was applied to the data, and the first NLCCA mode retrieved (Figures 10 and 11) resembles the expected theoretical mode (X0, Y0). This is quite remarkable considering thatX0 andY0 have only 1/3 the variance of X and Y; that is, the NLCCA ignores the large variance of X and Y and succeeded in detecting the nonlinear correlated mode (X0, Y0). In contrast, if the NLPCA is applied to x and y separately, then the first NLPCA mode retrieved from x will be X, and the first mode from y will be Y. This illustrates the essential difference between NLPCA and NLCCA. [35] The NLCCA has been applied to analyze the tropical Pacific sea level pressure anomaly (SLPA) and SSTA fields [Hsieh, 2001b], where the six leading PCs of the SLPA and the six PCs of the SSTA during 1950–2000 were inputs to an NLCCA model. The first NLCCA mode is plotted in the PC spaces of the SLPA and the SSTA (Figure 12), where only the three leading PCs are shown. For the SLPA (Figure 12a) in the PC1-PC2 plane the La Niña states are in the left corner (corresponding to low u values), while the El Niño states are in the top right corner (high u values). The CCA solutions are shown as thin straight lines. For the Figure 7. Three feed forward NNs used to perform nonlinear canonical correlation analysis (NLCCA). The double-barreled NN on the left maps from the inputs x and y to the canonical variates u and v, respectively. The cost function J forces the correlation between u and v to be maximized. On the right side the top NN maps from u to the output layer x0. The cost function J1 basically minimizes the mean-square error (MSE) of x0 relative to x. The third NN maps from v to the output layer y0. The cost function J2 basically minimizes the MSE of y0 relative to y. From Hsieh [2001b], reprinted with permission from the American Meteorological Society. RG1003 Hsieh: NEURAL NETWORK METHODS 9 of 25 RG1003 SSTA (Figure 12b) in the PC1-PC2 plane the first NLCCA mode is a U-shaped curve linking the La Niña states in the top left corner (low v values) to the El Niño states in the top right corner (high v values). In general, the nonlinearity is greater in the SSTA than in the SLPA, as the difference between the CCA mode and the NLCCA mode is greater in Figure 12b than in Figure 12a. [36] The MSE of the NLCCA divided by the MSE of the CCA is a useful measure on how different the nonlinear solution is relative to the linear solution; a smaller ratio means greater nonlinearity, while a ratio of 1 means the NLCCA can only find a linear solution. This ratio is 0.951 for the SLPA and 0.935 for the SSTA, confirming that the mapping for the SSTA was more nonlinear than that for the SLPA. When the data record was divided into two halves (1950–1975 and 1976–1999) to be separately analyzed by the NLCCA, Hsieh [2001b] found that this ratio decreased for the second half, implying an increase in the nonlinearity of ENSO during the more recent period. [37] For the NLCCA mode 1, as u varies from its minimum value to its maximum value, the SLPA field varies from the strong La Niña phase to the strong El Niño phase (Figure 13). The zero contour is farther west during La Niña (Figure 13a) than during strong El Niño (Figure 13b). Similarly, as v varies from its minimum to its maximum, the SSTA field varies from strong La Niña to strong El Niño (Figure 14), revealing that the SST anomalies during La Niña are centered farther west of the anomalies during El Niño. [38] Wu and Hsieh [2002, 2003] studied the relation between the tropical Pacific wind stress anomaly (WSA) and SSTA fields using the NLCCA. Wu and Hsieh [2003] found notable interdecadal changes of ENSO behavior before and after the mid-1970s climate regime shift, with greater nonlinearity found during 1981–1999 than during 1961–1975. Spatial asymmetry (for both SSTA and WSA) between El Niño and La Niña episodes was significantly enhanced in the later period. During 1981–1999 the loca- tion of the equatorial easterly WSA in the NLCCA solution during La Niña was unchanged from the earlier period, but during El Niño the westerly WSA was shifted eastward by up to 30. From dynamical considerations based on the delay oscillator theory for ENSO (where the farther east the location of the WSA, the longer is the duration of the resulting SSTA in the eastern equatorial Pacific), Wu and Hsieh [2003] concluded that this interdecadal change would lengthen the duration of the ENSO warm phase but leave the duration of the cool phase unchanged, which was confirmed with numerical model experiments. This is an example of a nonlinear data analysis detecting a feature missed by previous studies using linear techniques, which, in turn, leads to new dynamical insight. Figure 8. First theoretical mode X generated from equation (7) shown by curve made up of small circles. The solid curve shows the second mode X0 generated from equation (9). A projection in the (a) x1-x2, (b) x1-x3, and (c) x2-x3 planes and a (d) three-dimensional plot. The actual data set of 500 points (shown by dots) was generated by adding mode 2 to mode 1 (with mode 2 having 1/3 the variance of mode 1) and adding a small amount of Gaussian noise. Follows Hsieh [2000], with permission from Elsevier Science. RG1003 Hsieh: NEURAL NETWORK METHODS 10 of 25 RG1003 [39] The NLCCA has also been applied to study the relation between the tropical Pacific SSTA and the Northern Hemisphere midlatitude winter atmospheric variability (500 mbar geopotential height and North American surface air temperature) simulated in an atmospheric general circu- lation model (GCM), demonstrating the value of NLCCA as a nonlinear diagnostic tool for GCMs [Wu et al., 2003]. 4. NONLINEAR SINGULAR SPECTRUM ANALYSIS (NLSSA) [40] By the 1980s, interest in chaos theory and dynamical systems led to further extension of the PCA method to singular spectrum analysis [Elsner and Tsonis, 1996; Golyandina et al., 2001; Ghil et al., 2002]. Given a time series yj = y(tj) (j = 1, . . ., N), lagged copies of the time series are stacked to form the augmented matrix Y, Y ¼ y1 y2    yN	Lþ1 y2 y3    yN	Lþ2 .. . .. . .. . .. . yL yLþ1    yN 2 66666666664 3 77777777775 : ð11Þ This matrix has the same form as the data matrix produced by L variables, each being a time series of length n = N L + 1. Y can also be viewed as composed of its column vectors yj, forming a vector time series y(tj) (j = 1, . . ., n). The standard PCA can be performed on the augmented data matrix Y, resulting in y tj   ¼X i xi tj   ei; ð12Þ where xi is the ith PC, a time series of length n, and ei is the ith eigenvector (or loading vector) of length L. Together xi and ei represent the ith SSAmode. This resulting method is the SSA with window L. [41] In the multivariate case, with M variables yk(tj)  ykj (k = 1, . . ., M; j = 1, . . ., N), the augmented matrix can be formed by letting Y ¼ y11 y12    y1;N	Lþ1 .. . .. . .. . .. . yM1 yM2    yM ;N	Lþ1 .. . .. . .. . .. . y1L y1;Lþ1    y1N .. . .. . .. . .. . yML yM ;Lþ1    yMN 2 6666666666666666666666664 3 7777777777777777777777775 : ð13Þ Figure 9. First theoretical modeY generated from equation (8) shown by curve made up of small circles. The solid curve shows the second mode Y0 generated from equation (10). A projection in the (a) y1-y2 plane, (b) y1-y3 plane, and (c) y2-y3 plane and a (d) three-dimensional plot. The data set of 500 points was generated by adding mode 2 to mode 1 (with mode 2 having 1/3 the variance of mode 1) and adding a small amount of Gaussian noise. Follows Hsieh [2000], with permission from Elsevier Science. RG1003 Hsieh: NEURAL NETWORK METHODS 11 of 25 RG1003 Figure 10. Nonlinear canonical correlation analysis (NLCCA) mode 1 in x space shown as a string of (densely overlapping) small circles. The theoretical mode X0 is shown as a thin solid curve, and the linear canonical correlation analysis (CCA) mode is shown as a thin dashed line. The dots display the 500 data points. The number of hidden neurons (see Appendix C) used is l2 = m2 = 3. Follows Hsieh [2000], with permission from Elsevier Science. Figure 11. NLCCA mode 1 in y space shown as a string of overlapping small circles. The thin solid curve is the theoretical mode Y0, and the thin dashed line is the CCA mode. Follows Hsieh [2000], with permission from Elsevier Science. RG1003 Hsieh: NEURAL NETWORK METHODS 12 of 25 RG1003 PCA can again be applied to Y to get the SSA modes, resulting in the multichannel SSA (MSSA) method, also called the space-time PCA method or the extended EOF (EEOF) method (though in typical EEOF applications, only a small number of lags are used). For brevity, we will use the term SSA to denote both SSA and MSSA. Commonly used in the meteorological and oceanographic communities [Ghil et al., 2002], SSA has also been used to analyze solar activity [Watari, 1996; Rangarajan and Barreto, 2000] and storms on Mars [Hollingsworth et al., 1997]. [42] Hsieh and Wu [2002] proposed the NLSSA method: Assume SSA has been applied to the data set, and after discarding the higher modes, we have retained the leading PCs, x(t) = [x1, . . ., xl], where each variable xi (i = 1, . . ., l ) is a time series of length n. The variables x are the inputs to the NLPCA(cir) network (Figure 2b). The NLPCA(cir), with its ability to extract closed curve solutions, is partic- ularly ideal for extracting periodic or wave modes in the data. In SSA it is common to encounter periodic modes, each of which has to be split into a pair of SSA modes [Elsner and Tsonis, 1996], as the underlying PCA technique is not capable of modeling a periodic mode (a closed curve) by a single mode (a straight line). Thus two (or more) SSA modes can easily be combined by NLPCA(cir) into one NLSSA mode, taking the shape of a closed curve. When implementing NLPCA(cir), Hsieh [2001a] found that there were two possible configurations, a restricted configuration Figure 12. NLCCA mode 1 between the tropical Pacific (a) sea level pressure anomaly (SLPA) and (b) SSTA, plotted as (overlapping) squares in the PC1-PC2-PC3 three- dimensional (3-D) space. The linear (CCA) mode is shown as a dashed line. The NLCCA mode and the CCA mode are also projected onto the PC1-PC2 plane, the PC1-PC3 plane, and the PC2-PC3 plane, where the projected NLCCA is indicated by (overlapping) circles, and the CCA is indicated by thin solid lines, and the projected data points (during 1950–2000) are shown by the scattered dots. There is no time lag between the SLPA and the corresponding SSTA data. The NLCCA solution was obtained with the number of hidden neurons l2 = m2 = 2; with l2 = m2 = 1, only a linear solution can be found. Adapted from Hsieh [2001b]. Figure 13. SLPA field when the canonical variate u of the NLCCA mode 1 is at (a) its minimum (strong La Niña) and (b) its maximum (strong El Niño). Contour interval is 0.5 mbar. Reprinted from Hsieh [2001b], with permission from the American Meteorological Society. RG1003 Hsieh: NEURAL NETWORK METHODS 13 of 25 RG1003 and a general configuration (see Appendix B). We will use the general configuration here. After the first NLSSA mode has been extracted, it can be subtracted from x to get the residual, which can be input again into the same network to extract the second NLSSA mode, and so forth for the higher modes. [43] To illustrate the difference between the NLSSA and the SSA, consider a test problem with a nonsinusoidal wave of the form f tð Þ ¼ 3 t ¼ 1; . . . ; 7 	1 t ¼ 8; . . . ; 28 8< : ð14Þ and periodic thereafter. This is a square wave with the peak stretched to be 3 times as tall but only 1/3 as broad as the trough; it has a period of 28. Gaussian noise with twice the standard deviation as this signal was added, and the time series was normalized to unit standard deviation (Figure 15). The time series has 600 data points. [44] SSA with window L = 50 was applied to this time series, with the first eight eigenvectors shown in Figure 16. The first eight modes individually accounted for 6.3, 5.6, 4.6, 4.3, 3.3, 3.3, 3.2, and 3.1% of the variance of the augmented time series y. The leading pair of modes displays oscillations of period 28, while the next pair manifests oscillations at a period of 14, i.e., the first harmonic. The nonsinusoidal nature of the SSA eigenvectors can be seen in mode 2 (Figure 16), where the trough is broader and shallower than the peak but nowhere as intense as in the original stretched square wave signal. The PCs for modes 1–4 are also shown in Figure 17. Both the eigenvectors (Figure 16) and the PCs (Figure 17) tend to appear in pairs, each member of the pair having similar appearance except for the quadrature phase difference. [45] The first eight PC time series were served as inputs to the NLPCA(cir) network, with m (the number of hidden neurons in the encoding layer) ranging from 2 to 8 (and the weight penalty parameter P = 1, see Appendix B). The MSE dropped with increasing m, until m = 5, beyond which the MSE showed no further improvement. The resulting NLSSA mode 1 (with m = 5) is shown in Figure 18. Not surprisingly, the PC1 and PC2 are united by the approximately circular curve. What is more surprising are the Lissajous-like curves found in the PC1-PC3 plane (Figure 18b) and in the PC1- PC4 plane (Figure 18c), indicating relations between the first SSA mode and the higher modes 3 and 4. (It is well known that for two sinusoidal time series z1(t) and z2(t) oscillating at frequencies w1 and w2, respectively, a plot of the trajectory in the z1-z2 plane reveals a closed Lissajous curve if and only if w2/w1 is a rational number.) There was no relation found between PC1 and PC5, as PC5 appeared independent of PC1 (Figure 18d). However, with less noise in the input, relations can be found between PC1 and PC5 and even higher PCs. [46] The NLSSA reconstructed component 1 (NLRC1) is the approximation of the original time series by the NLSSA mode 1. The neural network output x0 are the NLSSA mode 1 approximation for the eight leading PCs. Multiply- ing these approximated PCs by their corresponding SSA eigenvectors and summing over the eight modes allows the reconstruction of the time series from the NLSSA mode 1. As each eigenvector contains the loading over a range of lags, each value in the reconstructed time series at time tj also involves averaging over the contributions at tj from the various lags. [47] In Figure 15, NLRC1 (curve f) from NLSSA is to be compared with the reconstructed component (RC) from SSA mode 1 (RC1) (curve c). The nonsinusoidal nature of the oscillations is not revealed by the RC1 but is clearly manifested in the NLRC1, where each strong narrow peak is followed by a weak broad trough, similar to the original stretched square wave. Also, the wave amplitude is more steady in the NLRC1 than in the RC1. Using contributions from the first two SSA modes, RC1-2 (not shown) is rather similar to RC1 in appearance except for a larger amplitude. [48] In Figure 15, curves d and e show the RC from SSA using the first three modes and the first eight modes, respec- tively. These curves, referred to as RC1-3 and RC1-8, respectively, show increasing noise as more modes are used. Among the RCs, with respect to the stretched square Figure 14. SSTA field when the canonical variate v is at (a) its minimum (strong La Niña) and (b) its maximum (strong El Niño). Contour interval is 0.5C. Reprinted from Hsieh [2001b], with permission from the American Meteorological Society. RG1003 Hsieh: NEURAL NETWORK METHODS 14 of 25 RG1003 wave time series (curve b), RC1-3 attains the most favorable correlation (0.849) and root-mean-square error (RMSE) (0.245) but remains behind the NLRC1, with correlation (0.875) and RMSE (0.225). [49] The stretched square wave signal accounted for only 22.6% of the variance in the noisy data. For comparison, NLRC1 accounted for 17.9%, RC1 accounted for 9.4%, and RC1–2 accounted for 14.1% of the variance. With more modes the RCs account for increasingly more variance, but beyond RC1-3 the increased variance is only from fitting to the noise in the data. [50] When classical Fourier spectral analysis was per- formed, the most energetic bands were the sine and cosine at a period of 14, the two together accounting for 7.0% of the variance. In this case the strong scattering of energy to higher harmonics by the Fourier technique has actually assigned 38% more energy to the first harmonic (at period 14) than to the fundamental period of 28. Next the data record was slightly shortened from 600 to 588 points, so the data record is exactly 21 times the fundamental period of our known signal; this is to avoid violating the periodicity assumption of Fourier analysis and the resulting spurious energy scatter into higher spectral bands. The most ener- getic Fourier bands were the sine and cosine at the funda- mental period of 28, the two together accounting for 9.8% of the variance, compared with 14.1% of the variance accounted for by the first two SSA modes. Thus, even with great care, the Fourier method scatters the spectral energy considerably more than the SSA method. [51] The SSA has also been applied to the multivariate case by Hsieh and Wu [2002]. The tropical Pacific monthly SLPA data [Woodruff et al., 1987] during 1950–2000 were used. The first eight SSA modes of the SLPA accounted for 7.9, 7.1, 5.0, 4.9, 4.0, 3.1, 2.5, and 1.9%, respectively, of the total variance of the augmented data. In Figure 19 the first two modes displayed the Southern Oscillation (SO), the east-west seesaw of SLPA at around the 50-month period, while the higher modes displayed fluctuations at around the QBO [Hamilton, 1998] average period of 28 months. [52] The eight leading PCs of the SSA were then used as inputs, x1, . . ., x8, to the NLPCA(cir) network, yielding the NLSSA mode 1 for the SLPA. This mode accounts for Figure 15. Noisy time series y containing a stretched square wave signal shown by curve a. Curve b shows the stretched square wave signal, which we will try to extract from the noisy time series. Curves c, d, and e are the reconstructed components (RC) from SSA leading modes, using one, three, and eight modes, respectively. Curve f is the NLSSA mode 1 RC (NLRC1). The dashed lines show the means of the various curves, which have been vertically shifted for better visualization. RG1003 Hsieh: NEURAL NETWORK METHODS 15 of 25 RG1003 17.1% of the variance of the augmented data, more than the variance explained by the first two SSA modes (15.0%). This is not surprising as the NLSSA mode did more than just combine the SSA modes 1 and 2: It also connects the SSA mode 3 to the SSA modes 1 and 2 (Figure 20). In the x1-x3 plane the bowl-shaped projected solution implies that PC3 tends to be positive when PC1 takes on either large positive or large negative values. Similarly, in the x2-x3 plane the hill-shaped projected solution indicates that PC3 tends to be negative when PC2 takes on large positive or negative values. These curves reveal interactions between the longer-timescale (50 months) SSA modes 1 and 2 and the shorter-timescale (28 months) SSA mode 3. [53] In the linear case of PCA or SSA, as the PC varies, the loading pattern is unchanged except for scaling by the PC. In the nonlinear case, as the NLPC varies, the loading pattern changes as it does not generally lie along a fixed eigenvector. The space-time loading patterns for the NLSSA mode 1 at various values of the NLPC q (Figure 21) manifest prominently the growth and decay of the negative phase of the SO (i.e., negative SLPA in the eastern equa- torial Pacific and positive SLPA in the west) as time progresses. The negative phase of the SO here is much shorter and more intense than the positive phase, in agree- ment with observations and in contrast to the SSA modes 1 and 2 (Figures 19a and 19b), where the negative and positive phases of the SO are about equal in duration and magnitude. [54] The tropical Pacific SSTA field was also analyzed by the NLSSA method by Hsieh and Wu [2002]. Comparing the NLSSA mode 1 loading patterns with the patterns from the first two SSA modes of the SSTA, Hsieh and Wu [2002] found three notable differences: (1) The presence of warm anomalies for 24 months followed by cool anomalies for 24 months in the first two SSA modes is replaced in the NLSSA mode 1 by warm anomalies for 18 months followed by cool anomalies for about 33 months; although the cool anomalies can be quite mild for long periods, they can develop into full La Niña cool episodes. (2) The El Niño warm episodes are strongest near the eastern boundary, while the La Niña episodes are strongest near the central equatorial Pacific in the NLSSA mode 1, an asymmetry not found in the individual SSA modes. (3) The magnitude of the peak positive anomalies is significantly larger than that of the peak negative anomalies in the NLSSA mode 1, again an asymmetry not found in the individual SSA modes. All three differences indicate that the NLSSA mode 1 is much closer to the observed ENSO properties than the first two SSA modes are. [55] Furthermore, from the residual the NLSSA mode 2 has been extracted by Hsieh and Wu [2002] for the SLPA field and for the SSTA field. For both variables the NLSSA mode 2 has a 39-month period, considerably longer than the QBO periods typically reported by previous studies using linear techniques [Ghil et al., 2002]. Intriguingly, the coupling between the SLPA and the SSTA fields for the second nonlinear mode of a 39-month period was found to be considerably stronger than their coupling for the first nonlinear ‘‘ENSO’’ mode of a 51-month period [Hsieh and Wu, 2002]. The NLSSA technique has also been used to study the stratospheric equatorial winds for the QBO phenomenon [Hsieh and Hamilton, 2003]. 5. SUMMARY AND CONCLUSIONS [56] This paper has reviewed the recent extension of the feed forward NN from its original role for nonlinear regression and classification to nonlinear PCA (for open and closed curves), nonlinear CCA, and nonlinear SSA. With examples from the atmosphere and the ocean, notably the ENSO and the stratospheric QBO phenomena, these NN methods can be seen to advance our understanding of geophysical phenomena. To highlight only a few of the many new findings by the nonlinear techniques, we note that the nonlinearity in the tropical Pacific interannual variability has been found to have increased in recent decades [Hsieh, 2001b; Wu and Hsieh, 2003]; that besides the main coupling at the ENSO timescale of about 51 months the strongest coupling between the tropical Figure 16. First eight SSA eigenvectors as a function of time lag: (a) mode 1 (solid curve) and mode 2 (dashed curve), (b) mode 3 (solid curve) and mode 4 (dashed curve), (c) mode 5 (solid curve) and mode 6 (dashed curve), and (d) mode 7 (solid curve) and mode 8 (dashed curve). RG1003 Hsieh: NEURAL NETWORK METHODS 16 of 25 RG1003 Pacific SLP and SST has been identified at a second nonlinear mode of a 39-month period, this unusual period itself arising from the interaction between the linear modes with ENSO and QBO timescales [Hsieh and Wu, 2002]; and that the phase of the stratospheric QBO can be better defined, resulting in an enhancement of the Holton-Tan effect [Hamilton and Hsieh, 2002]. The nonlinear PCA, CCA, and SSA codes (written in MATLAB 1 are freely downloadable from the author’s web site (http://www. ocgy.ubc.ca/projects/clim.pred). [57] PCA is widely used for two main purposes: (1) to reduce the dimensionality of the data set and (2) to extract features or recognize patterns from the data set. It is purpose 2 where PCA can be improved upon. Rotated PCA (RPCA) sacrifices on the amount of variance explained, but by rotating the PCA eigenvectors, RPCA eigenvectors can point more toward local data clusters and can therefore be more representative of physical states than the PCA eigen- vectors. With the tropical Pacific SST as an example it was shown that RPCA represented El Niño states better than PCA, but neither method represented La Niña states well. In contrast, nonlinear PCA (NLPCA) passed through both the clusters of El Niño and La Niña states, thus representing both well within a single mode; the NLPCA first mode also accounted for more variance of the data set than the first mode of PCA or RPCA. [58] With PCA the straight line explaining the maximum variance of the data is found. With NLPCA the straight line is replaced by a continuous, open curve. NLPCA(cir) (NLPCA with a circular node at the bottleneck) replaces the open curve with a closed curve, so periodic or wave solutions can be modeled. When dealing with data contain- ing a nonlinear or periodic structure, the linear methods scatter the energy into multiple modes, which is usually prevented when the nonlinear methods are used. [59] With two data fields x and y the classical CCA method finds the canonical variate u (from a linear combi- nation of the x variables) and its partner v (from the y variables), so that the correlation between u and v is maximized. CCA finds a line in the x space, where fluctuations of the x data projected onto this line are most highly correlated with fluctuations of y data projected onto another line in the y space. NN can perform nonlinear CCA (NLCCA), where u and v can be nonlinear functions of the x and y variables, respectively. NLCCA finds a curve in the x space where fluctuations of the x data projected onto this curve are most highly correlated with fluctuations of y data projected onto another curve in the y space. Figure 17. PC time series of SSA (a) mode 1 (solid curve) and mode 2 (dashed curve), (b) mode 3 (solid curve) and mode 4 (dashed curve), and (c) q, the nonlinear PC from NLSSA mode 1. Note q is periodic, here bounded between 	p and p radians. RG1003 Hsieh: NEURAL NETWORK METHODS 17 of 25 RG1003 [60] For univariate and multivariate time series analysis the PCA method has been extended to the SSA technique. NN can also be used to perform nonlinear SSA (NLSSA): The data set is first condensed by the SSA; then several leading PCs from the SSA are chosen as inputs to the NLPCA(cir) network, which extracts the NLSSA mode by nonlinearly combining the various SSA modes. [61] In general, NLSSA has several advantages over SSA: (1) The PCs from different SSA modes are linearly uncorrelated; however, they may have relationships that can be detected by the NLSSA. (2) Although the SSA modes are not restricted to sinusoidal oscillations in time like the Fourier spectral components, in practice, they are inefficient in modeling nonsinusoidal periodic signals (e.g., the stretched square wave in section 4), scattering the signal energy into many SSA modes, similar to the way Fourier spectral analysis scatters the energy of a nonsinusoidal wave to its higher harmonics. The NLSSA recombines the SSA modes to extract the nonsinusoidal signal, alleviating the spurious transfer of energy to higher frequencies. In the tropical Pacific the NLSSA mode 2 of both the SSTA field and the SLPA field yielded a 39-month signal, considerably lower in frequency than the QBO frequency signals found by linear methods. Figure 18. First NLSSA mode indicated by the (overlapping) small circles, with the input data shown as dots. The input data were the first eight PCs from the SSA of the time series y containing the stretched square wave. The NLSSA solution is a closed curve in an eight-dimensional PC space. The NLSSA solution projected onto (a) the PC1-PC2 plane, (b) the PC1-PC3 plane, (c) the PC1-PC4 plane, and (d) the PC1-PC5 plane. Figure 19. (a–f ) SSA modes 1–6 for the tropical Pacific sea level pressure anomalies (SLPA), respectively. The contour plots display the SSA space-time eigenvectors (loading patterns), showing the SLPA along the equator as a function of the lag. Solid contours indicate positive anomalies, and dashed contours indicate negative anomalies, with the zero contour indicated by the thick solid curve. In a separate graph beneath each contour plot the PC of each SSA mode is also plotted as a time series (where each tick mark on the abscissa indicates the start of a year). The time of the PC is synchronized to the lag time of 0 month in the space-time eigenvector. Figure 19 courtesy of A. Wu. RG1003 Hsieh: NEURAL NETWORK METHODS 18 of 25 RG1003 Figure 19 RG1003 Hsieh: NEURAL NETWORK METHODS 19 of 25 RG1003 [62] In summary, the linear methods currently used are often too simplistic to describe complicated real-world systems, resulting in a tendency to scatter a single oscilla- tory phenomenon into numerous modes or higher harmon- ics. This would introduce unphysical spatially standing patterns or spurious high-frequency energy. These problems are shown to be largely alleviated by the use of nonlinear methods. [63] The main disadvantage of NN methods compared with the linear methods lies in their instability or nonunique- ness; with local minima in the cost function, optimizations started from different initial parameters often end up at different minima for the NN approach. A number of opti- mization runs starting from different random initial param- eters is needed, where the best run is chosen as the solution; even then, there is no guarantee that the global minimum has been found. Proper scaling of the input data is essential to avoid having the nonlinear optimization algorithm searching for parameters with a wide range of magnitudes. Regulari- zation by adding weight penalty terms to the cost functions generally improved the stability of the NN methods. Never- theless, for short records with noisy data one may not be able to find a reliable nonlinear solution, and the linear solution may be the best one can extract from of the data. The time averaging data (e.g., averaging daily data to yield monthly data) may also, through the central limit theorem, severely reduce the nonlinearity which can be detected [Yuval and Hsieh, 2002]. [64] Hence, whether the nonlinear approach has a significant advantage over the linear approach is highly dependent on the data set: The nonlinear approach is generally ineffective if the data record is short and noisy or the underlying physics is essentially linear. For the Earth’s climate, tropical variability such as ENSO and the stratospheric QBO have strong signal-to-noise ratio and are handled well by the nonlinear methods; in contrast, in the middle and high latitudes the signal-to- noise ratio is much weaker, rendering the nonlinear methods less effective. Presently, the number of hidden neurons in the NN and the weight penalty parameters are often determined by a trial and error approach; adopting techniques such as generalized cross validation [Yuval, 2000] and information criterion [Burnham and Anderson, 1998] may help in the future to provide more guidance on the choice of the most appropriate NN architecture. While NN has been widely used as the main workhorse in nonlinear multivariate and time series analysis, new emerging techniques such as kernel-based methods [Vapnik, 1998] may play an increasingly important role in the future. APPENDIX A: NLPCA MODEL [65] In Figure 2a the transfer function f1 maps from x, the input column vector of length l, to the first hidden layer (the Figure 20. NLSSA mode 1 for the tropical Pacific SLPA. The PCs of SSA modes 1–8 were used as inputs x1, . . ., x8 to the NLPCA(cir) network, with the resulting NLSSA mode 1 shown as (densely overlapping) crosses in the x1-x2-x3 3-D PC space. The projections of this mode onto the x1-x2, x1-x3, and x2-x3 planes are denoted by the (densely overlapping) small circles, and the projected data are shown by dots. For comparison, the linear SSA mode 1 is shown by the dashed line in the 3-D space and by the projected solid lines on the 2-D planes. From Hsieh and Wu [2002]. RG1003 Hsieh: NEURAL NETWORK METHODS 20 of 25 RG1003 encoding layer), represented by h(x), a column vector of length m, with elements h xð Þ k ¼ f1 W xð Þxþ b xð Þ   k h i ; ðA1Þ where (with the capital bold font reserved for matrices and the small bold font for vectors), W(x) is an m  l weight matrix, b(x) is a column vector of length m containing the bias parameters, and k = 1, . . ., m. Similarly, a second transfer function f2 maps from the encoding layer to the bottleneck layer containing a single neuron, which repre- sents the nonlinear principal component u, u ¼ f2 w xð Þ  h xð Þ þ b xð Þ   : ðA2Þ The transfer function f1 is generally nonlinear (usually the hyperbolic tangent or the sigmoidal function, though the exact form is not critical), while f2 is usually taken to be the identity function. Figure 21. SLPA NLSSA mode 1 space-time loading patterns for various values of the NLPC q: (a) q = 0, (b) q = 60, (c) q = 120, (d) q = 180, (e) q = 240, and (f) q = 300. The contour plots display the SLPA along the equator as a function of the lag time. Contour interval is 0.2 mbar. Figure 21 courtesy of A. Wu. RG1003 Hsieh: NEURAL NETWORK METHODS 21 of 25 RG1003 [66] Next a transfer function f3 maps from u to the final hidden layer (the decoding layer) h(u), h uð Þ k ¼ f3 w uð Þuþ b uð Þ   k h i ; ðA3Þ (k = 1, . . ., m), followed by f4 mapping from h (u) to x0, the output column vector of length l, with x0i ¼ f4 W uð Þh uð Þ þ b uð Þ   i h i : ðA4Þ [67] The cost function J = hkx 	 x0k2i is minimized by finding the optimal values of W(x), b(x), w(x), b(x), w(u), b(u), W(u), and b(u). The mean-square error (MSE) between the NN output x0 and the original data x is thus minimized. The NLPCA was implemented using the hyperbolic tangent function for f1 and f3 and the identity function for f2 and f4, so that u ¼ w xð Þ  h xð Þ þ b xð Þ ðA5Þ x0i ¼ W uð Þh uð Þ þ b uð Þ   i : ðA6Þ [68] Furthermore, we adopt the normalization conditions that hui = 0 and hu2i = 1. These conditions are approxi- mately satisfied by modifying the cost function to J ¼ hkx	 x0k2i þ hui2 þ hu2i 	 1 2: ðA7Þ The total number of (weight and bias) parameters used by the NLPCA is 2lm + 4m + l + 1, though the number of effectively free parameters is two less because of the constraints on hui and hu2i. [69] The choice of m, the number of hidden neurons in both the encoding and decoding layers, follows a general principle of parsimony. A larger m increases the nonlinear modeling capability of the network but could also lead to overfitted solutions (i.e., wiggly solutions which fit to the noise in the data). If f4 is the identity function and m = 1, then equation (A6) implies that all x0i are linearly related to a single hidden neuron; hence there can only be a linear relation between the x0i variables. Thus, for nonlinear solutions we need to look at m  2. It is also possible to have more than one neuron at the bottleneck layer. For instance, with two bottleneck neurons the mode extracted will span a 2-D surface instead of a 1-D curve. [70] The nonlinear optimization was carried out by the MATLAB function ‘‘fminu,’’ a quasi-Newton algorithm. Because of local minima in the cost function, there is no guarantee that the optimization algorithm reaches the global minimum. Hence a number of runs with random initial weights and bias parameters was made. Also, 20% of the data were randomly selected as validation data and withheld from the training of the NNs. Runs where the MSE was larger for the validation data set than for the training data set were rejected to avoid overfitted solutions. Then the run with the smallest MSE was selected as the solution. [71] In general, the most serious problem with NLPCA is the presence of local minima in the cost function. As a result, optimizations started from different initial parameters often converge to different minima, rendering the solution unstable or nonunique. Regularization of the cost function by adding weight penalty terms is an answer. [72] The purpose of the weight penalty terms is to limit the nonlinear power of the NLPCA, which came from the nonlinear transfer functions in the network. The transfer function tanh has the property that given x in the interval [	L, L], one can find a small enough weight w, so that tanh(wx)  wx; that is, the transfer function is almost linear. Similarly, one can choose a large enough w so that tanh approaches a step function, thus yielding Z-shaped solu- tions. If we can penalize the use of excessive weights, we can limit the degree of nonlinearity in the NLPCA solution. This is achieved with a modified cost function J ¼ hkx	 x0k2i þ hui2 þ hu2i 	 1 2þPX ki W xð Þ ki  2 ; ðA8Þ where P is the weight penalty parameter. A large P increases the concavity of the cost function and forces the weights W(x) to be small in magnitude, thereby yielding smoother and less nonlinear solutions than when P is small or zero. Hence increasing P also reduces the number of effectively free parameters of the model. The percentage of the variance explained by the NLPCA mode is simply 100 % 1	 hkx	 x 0k2i hkx	 xk2i  ! ; ðA9Þ with x being the mean of x. APPENDIX B: NLPCA(cir) MODEL [73] At the bottleneck in Figure 2b, analogous to u in equation (A5), we calculate the prestates po and qo by po ¼ w xð Þ  h xð Þ þ b xð Þ qo ¼ ~w xð Þ  h xð Þ þ ~b xð Þ; ðB1Þ where w(x) and ~w(x) are weight parameter vectors and b(x) and ~b(x) are bias parameters. Let r ¼ p2o þ q2o  1=2 ; ðB2Þ then the circular node is defined with p ¼ po=r q ¼ qo=r; ðB3Þ satisfying the unit circle equation p2 + q2 = 1. Thus, even though there are two variables p and q at the bottleneck, there is only one angular degree of freedom from q (Figure 2b) because of the circle constraint. The mapping from the bottleneck to the output proceeds as in Appendix A, with equation (A3) replaced by h uð Þ k ¼ tanh w uð Þpþ ~w uð Þqþ b uð Þ   k h i : ðB4Þ [74] When implementing NLPCA(cir), Hsieh [2001a] found that there are actually two possible configurations: (1) A restricted configuration where the constraints hpi = 0 = hqi are applied and (2) a general configuration without RG1003 Hsieh: NEURAL NETWORK METHODS 22 of 25 RG1003 the constraints. With configuration 1 the constraints can be satisfied approximately by adding the extra terms hpi2 and hqi2 to the cost function. If a closed curve solution is sought, then configuration 1 is better than configuration 2 as it has effectively two fewer parameters. However, con- figuration 2, being more general than configuration 1, can more readily model open curve solutions like a regular NLPCA. The reason is that if the input data mapped onto the p-q plane cover only a segment of the unit circle instead of the whole circle, then the inverse mapping from the p-q space to the output space will yield a solution resembling an open curve. Hence, given a data set, configuration 2 may yield either a closed curve or an open curve solution. Its generality comes with a price, namely, that there may be more local minima to contend with. The number of param- eters is 2lm + 6m + l + 2; though under configuration 1 the number of effectively free parameters is two less because of the imposed constraints. Unlike NLPCA, which reduces to PCA when only linear transfer functions are used, NLPCA (cir) does not appear to have a linear counterpart. APPENDIX C: NLCCA MODEL [75] In Figure 7 the inputs x and y are mapped to the neurons in the hidden layer: h xð Þ k ¼ tanh W xð Þxþ b xð Þ   k h i h yð Þn ¼ tanh W yð Þyþ b yð Þ   n h i ; ðC1Þ where W(x) and W(y) are weight matrices and b(x) and b(y) are bias parameter vectors. The dimensions of x, y, h(x), and h(y) are l1, m1, l2, and m2, respectively. [76] The canonical variate neurons u and v are calculated from a linear combination of the hidden neurons h(x) and h(y), respectively, with u ¼ w xð Þ  h xð Þ þ b xð Þ v ¼ w yð Þ  h yð Þ þ b yð Þ: ðC2Þ These mappings are standard feed forward NNs and are capable of representing any continuous functions mapping from x to u and from y to v to any given accuracy, provided large enough l2 and m2 are used. [77] To maximize cor(u, v), the cost function J = 	 cor(u, v) is minimized by finding the optimal values of W(x), W(y), b(x), b(y), w(x), w(y), b(x), and b(y). We also adopt the constraints hui = 0 = hvi and hu2i = 1 = hv2i, which are approximately satisfied by modifying the cost function to J ¼ 	cor u; vð Þ þ hui2 þ hvi2 þ hu2i1=2 	 1  2 þ hv2i1=2 	 1  2 : ðC3Þ [78] On the right side of Figure 7 the top NN (a standard feed forward NN) maps from u to x0 in two steps: h uð Þ k ¼ tanh w uð Þuþ b uð Þ   k h i x0 ¼W uð Þh uð Þ þ b uð Þ: ðC4Þ The cost function J1 = hkx0 	 xk2i is minimized by finding the optimal values of w(u), b(u), W(u), and b(u). The MSE between the NN output x0 and the original data x is thus minimized. [79] Similarly, the bottom NN on the right side of Figure 7 maps from v to y0: h vð Þn ¼ tanh w vð Þvþ b vð Þ   n h i y0 ¼W vð Þh vð Þ þ b vð Þ; ðC5Þ with the cost function J2 = hky0 	 yk2i minimized. The total number of parameters used by the NLCCA is 2(l1l2 + m1m2) + 4(l2 + m2) + l1 + m1 + 2, though the number of effectively free parameters is four less because of the constraints on hui, hvi, hu2i, and hv2i. [80] A number of runs mapping from (x, y) to (u, v), using random initial parameters, were performed. The run attaining the highest cor(u, v) was selected as the solution. Next a number of runs (mapping from u to x0) was used to find the solution with the smallest MSE in x0. Finally, a number of runs were used to find the solution yielding the smallest MSE in y0. After the first NLCCA mode has been retrieved from the data, the method can be applied again to the residual to extract the second mode and so forth. [81] That the CCA is indeed a linear version of this NLCCA can be readily seen by replacing the hyperbolic tangent transfer functions in equations (C1), (C4), and (C5) with the identity function, thereby removing the nonlinear modeling capability of the NLCCA. Then the forward maps to u and v involve only a linear combination of the original variables x and y, as in the CCA. [82] With three NNs in NLCCA, overfitting can occur in any of the three networks. With noisy data the three cost functions are modified to J ¼ 	cor u; vð Þ þ hui2 þ hvi2 þ hu2i1=2 	 1  2þ hv2i1=2 	 1 2 þ P X ki W xð Þ ki  2 þ X nj W yð Þ nj  2" # ; ðC6Þ J1 ¼ hkx0 	 xk2i þ P1 X k w uð Þ k  2 ; ðC7Þ J2 ¼ hky0 	 yk2i þ P2 X n w vð Þn  2 ; ðC8Þ where P, P1, and P2 are nonnegative weight penalty parameters. Since the nonlinearity of a network is controlled by the weights in the hyperbolic tangent transfer function, only those weights are penalized. [83] ACKNOWLEDGMENTS. I would like to thank Aiming Wu, who supplied Figures 19 and 21. The American Meteorolog- ical Society, Elsevier Science, and Blackwell Science kindly granted permission to reproduce or adapt figures from their publications. Support through strategic and research grants from the Natural Sciences and Engineering Research Council of Canada and the Canadian Foundation for Climate and Atmospheric Scien- ces is gratefully acknowledged. RG1003 Hsieh: NEURAL NETWORK METHODS 23 of 25 RG1003 [84] Kendal McGuffie was the Editor responsible for this paper. He thanks one technical reviewer and one cross-disci- plinary reviewer. REFERENCES Aires, F., A. Chédin, and J. P. Nadal (2000), Independent compo- nent analysis of multivariate time series: Application to the tro- pical SST variability, J. Geophys. Res., 105(D13), 17,437– 17,455. Baldwin, M., et al. (2001), The quasi-biennial oscillation, Rev. Geophys., 39, 179–229. Barnett, T. P., and R. Preisendorfer (1987), Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation analysis, Mon. Weather Rev., 115, 1825–1850. Barnston, A. G., and C. F. Ropelewski (1992), Prediction of ENSO episodes using canonical correlation analysis, J. Clim., 5, 1316– 1345. Bishop, C. M. (1995), Neural Networks for Pattern Recognition, 482 pp., Clarendon, Oxford, U.K. Bretherton, C. S., C. Smith, and J. M. Wallace (1992), An inter- comparison of methods for finding coupled patterns in climate data, J. Clim., 5, 541–560. Burnham, K. P., and D. R. Anderson (1998), Model Selection and Inference, 353 pp., Springer-Verlag, New York. Cavazos, T. (1999), Large-scale circulation anomalies conducive to extreme precipitation events and derivation of daily rainfall in northeastern Mexico and southeastern Texas, J. Clim., 12, 1506– 1523. Cherkassky, V., and F. Mulier (1998), Learning From Data, 441 pp., John Wiley, Hoboken, N. J. Chevallier, F., J. J. Morcrette, F. Cheruy, and N. A. Scott (2000), Use of a neural-network-based long-wave radiative-transfer scheme in the ECMWF atmospheric model, Q. J. R. Meteorol. Soc., 126, 761–776. Comon, P. (1994), Independent component analysis—A new con- cept?, Signal Process., 36, 287–314. Cybenko, G. (1989), Approximation by superpositions of a sigmoi- dal function, Math. Control Signals Syst., 2, 303–314. Del Frate, F., and G. Schiavon (1999), Nonlinear principal compo- nent analysis for the radiometric inversion of atmospheric pro- files by using neural networks, IEEE Trans. Geosci. Remote Sens., 37, 2335–2342. Diaz, H. F., and V. Markgraf (2000), El Niño and the Southern Oscillation: Multiscale Variability and Global and Regional Im- pacts, 496 pp., Cambridge Univ. Press, New York. Elsner, J. B., and A. A. Tsonis (1996), Singular Spectrum Analysis, 164 pp., Plenum, New York. Essenreiter, R., M. Karrenbach, and S. Treitel (2001), Identification and classification of multiple reflections with self-organizing maps, Geophys. Prospect., 49, 341–352. Gardner, M. W., and S. R. Dorling (1998), Artificial neural net- works (the multilayer perceptron)—A review of applications in the atmospheric sciences, Atmos. Environ., 32, 2627–2636. Gemmill, W. H., and V. M. Krasnopolsky (1999), The use of SSM/ I data in operational marine analysis, Weather Forecasting, 14, 789–800. Ghil, M., et al. (2002), Advanced spectral methods for climatic time series, Rev. Geophys., 40(1), 1003, doi:10.1029/ 2000RG000092. Golyandina, N. E., V. V. Nekrutin, and A. A. Zhigljavsky (2001), Analysis of Time Series Structure, SSA and Related Techniques, 320 pp., Chapman and Hall, New York. Hamilton, K. (1998), Dynamics of the tropical middle atmosphere: A tutorial review, Atmos. Ocean, 36, 319–354. Hamilton, K., and W. W. Hsieh (2002), Representation of the QBO in the tropical stratospheric wind by nonlinear principal compo- nent analysis, J. Geophys. Res., 107(D15), 4232, doi:10.1029/ 2001JD001250. Hastie, T., and W. Stuetzle (1989), Principal curves, JASA J. Am. Stat. Assoc., 84, 502–516. Hastie, T., R. Tibshirani, and J. Friedman (2001), The Elements of Statistical Learning, 552 pp., Springer-Verlag, New York. Haykin, S. (1999), Neural Networks: A Comprehensive Founda- tion, 842 pp., Prentice-Hall, Old Tappan, N. J. Hoerling, M. P., A. Kumar, and M. Zhong (1997), El Niño, La Niña and the nonlinearity of their teleconnections, J. Clim., 10, 1769–1786. Hollingsworth, J. L., R. M. Haberle, and J. Schaeffer (1997), Sea- sonal variations of storm zones on Mars, Adv. Space Res., 19, 1237–1240. Holton, J. R., and H.-C. Tan (1980), The influence of the equatorial quasi-biennial oscillation on the global circulation at 50 mb, J. Atmos. Sci., 37, 2200–2208. Hornik, K., M. Stinchcombe, and H. White (1989), Multilayer feedforward networks are universal approximators, Neural Net- works, 2, 359–366. Hsieh, W. W. (2000), Nonlinear canonical correlation analysis by neural networks, Neural Networks, 13, 1095–1105. Hsieh, W. W. (2001a), Nonlinear principal component analysis by neural networks, Tellus, Ser. A, 53, 599–615. Hsieh, W. W. (2001b), Nonlinear canonical correlation analysis of the tropical Pacific climate variability using a neural network approach, J. Clim., 14, 2528–2539. Hsieh, W. W., and K. Hamilton (2003), Nonlinear singular spec- trum analysis of the tropical stratospheric wind, Q. J. R. Meteorol. Soc., 129, 2367–2382. Hsieh, W. W., and B. Tang (1998), Applying neural network mod- els to prediction and data analysis in meteorology and oceano- graphy, Bull. Am. Meteorol. Soc., 79, 1855–1870. Hsieh, W. W., and A. Wu (2002), Nonlinear multichannel singular spectrum analysis of the tropical Pacific climate variability using a neural network approach, J. Geophys. Res., 107(C7), 3076, doi:10.1029/2001JC000957. Hyvärinen, A., J. Karhunen, and E. Oja (2001), Independent Com- ponent Analysis, 504 pp., John Wiley, Hoboken, N. J. Jolliffe, I. T. (2002), Principal Component Analysis, 502 pp., Springer-Verlag, New York. Kaiser, H. F. (1958), The varimax criterion for analytic rotation in factor analysis, Psychometrika, 23, 187–200. Kirby, M. J., and R. Miranda (1996), Circular nodes in neural networks, Neural Comp., 8, 390–402. Kohonen, T. (1982), Self-organized formation of topologically cor- rect feature maps, Biol. Cybernetics, 43, 59–69. Kohonen, T. (2001), Self-Organizing Maps, 501 pp., Springer- Verlag, New York. Kramer, M. A. (1991), Nonlinear principal component analysis using autoassociative neural networks, AIChE J., 37, 233–243. Krasnopolsky, V. M., D. Chalikov, L. C. Breaker, and D. Rao (2000), Application of neural networks for efficient calculation of sea water density or salinity from the UNESCO equation of state, in Proceedings of the Second Conference on Artificial Intelligence, pp. 27–30, Am. Math. Soc., Providence, R. I. Lai, P. L., and C. Fyfe (1999), A neural implementation of canonical correlation analysis, Neural Networks, 12, 1391– 1397. Lai, P. L., and C. Fyfe (2000), Kernel and non-linear canonical correlation analysis, Int. J. Neural Syst., 10, 365–377. Lorenz, E. N. (1963), Deterministic nonperiodic flow, J. Atmos. Sci., 20, 130–141. Malthouse, E. C. (1998), Limitations of nonlinear PCA as per- formed with generic neural networks, IEEE Trans. Neural Net- works, 9, 165–173. Mardia, K. V., J. T. Kent, and J. M. Bibby (1979), Multivariate Analysis, 518 pp., Academic, San Diego, Calif. Marzban, C. (2000), A neural network for tornado diagnosis: Managing local minima, Neural Comput. Appl., 9, 133–141. McCulloch, W. S., and W. Pitts (1943), A logical calculus of the ideas immanent in neural nets, Bull. Math. Biophys., 5, 115–137. RG1003 Hsieh: NEURAL NETWORK METHODS 24 of 25 RG1003 Melzer, T., M. Reiter, and H. Bischof (2003), Appearance models based on kernel canonical correlation analysis, Pattern Recognit., 36, 1961–1971. Monahan, A. H. (2000), Nonlinear principal component analysis by neural networks: Theory and application to the Lorenz sys- tem, J. Clim., 13, 821–835. Monahan, A. H. (2001), Nonlinear principal component analysis: Tropical Indo-Pacific sea surface temperature and sea level pres- sure, J. Clim., 14, 219–233. Monahan, A. H., J. C. Fyfe, and G. M. Flato (2000), A regime view of Northern Hemisphere atmospheric variability and change under global warming, Geophys. Res. Lett., 27, 1139–1142. Monahan, A. H., L. Pandolfo, and J. C. Fyfe (2001), The preferred structure of variability of the Northern Hemisphere atmospheric circulation, Geophys. Res. Lett., 28, 1019–1022. Newbigging, S. C., L. A. Mysak, andW.W. Hsieh (2003), Improve- ments to the nonlinear principal component analysis method, with applications to ENSO and QBO, Atmos. Ocean, 41, 291– 299. Philander, S. G. (1990), El Niño, La Niña, and the Southern Oscillation, 293 pp., Academic, San Diego, Calif. Preisendorfer, R. W. (1988), Principal Component Analysis in Meteorology andOceanography, 425 pp., Elsevier Sci., NewYork. Rangarajan, G. K., and L. M. Barreto (2000), Long term variability in solar wind velocity and IMF intensity and the relationship between solar wind parameters and geomagnetic activity, Earth Planets Space, 52, 121–132. Richaume, P., F. Badran, M. Crepon, C. Mejı́a, H. Roquet, and S. Thiria (2000), Neural network wind retrieval from ERS-1 scatterometer data, J. Geophys. Res., 105(C4), 8737–8751. Richman, M. B. (1986), Rotation of principal components, J. Clim., 6, 293–335. Ripley, B. D. (1996), Pattern Recognition and Neural Networks, 403 pp., Cambridge Univ. Press, New York. Rojas, R. (1996), Neural Networks—A Systematic Introduction, 502 pp., Springer-Verlag, New York. Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986), Learn- ing internal representations by error propagation, in Parallel Dis- tributed Processing, vol. 1, edited by D. E. Rumelhart, J. L. McClelland, and P. R. Group, pp. 318–362, MIT Press, Cam- bridge, Mass. Sandham, W., and M. Leggett (2003), Geophysical Applications of Artificial Neural Networks and Fuzzy Logic, 348 pp., Kluwer Acad., Norwell, Mass. Shabbar, A., and A. G. Barnston (1996), Skill of seasonal climate forecasts in Canada using canonical correlation analysis, Mon. Weather Rev., 124, 2370–2385. Taner, M. T., T. Berge, J. A. Walls, M. Smith, G. Taylor, D. Dumas, and M. B. Carr (2001), Well log calibration of Kohonen-classi- fied seismic attributes using Bayesian logic, J. Pet. Geol., 24, 405–416. Tang, B. Y., W. W. Hsieh, A. H. Monahan, and F. T. Tangang (2000), Skill comparisons between neural networks and canoni- cal correlation analysis in predicting the equatorial Pacific sea surface temperatures, J. Clim., 13, 287–293. Tang, Y., and W. W. Hsieh (2002), Hybrid coupled models of the tropical Pacific-ENSO prediction, Clim. Dyn., 19, 343–353. Tang, Y., and W. W. Hsieh (2003), Nonlinear modes of decadal and interannual variability of the subsurface thermal structure in the Pacific Ocean, J. Geophys. Res., 108(C3), 3084, doi:10.1029/ 2001JC001236. Vapnik, V. N. (1998), Statistical Learning Theory, 736 pp., John Wiley, Hoboken, N. J. Villmann, T., E. Merenyi, and B. Hammer (2003), Neural maps in remote sensing image analysis, Neural Networks, 16, 389–403. von Storch, H., and F. W. Zwiers (1999), Statistical Analysis in Climate Research, 484 pp., Cambridge Univ. Press, New York. Watari, S. (1996), Separation of periodic, chaotic, and random components in solar activity, Sol. Phys., 168, 413–422. Woodruff, S. D., R. J. Slutz, R. L. Jenne, and P. M. Steurer (1987), A comprehensive ocean-atmosphere data set, Bull. Am. Meteorol. Soc., 68, 1239–1250. Wu, A., and W. W. Hsieh (2002), Nonlinear canonical correlation analysis of the tropical Pacific wind stress and sea surface tem- perature, Clim. Dyn., 19, 713–722, doi:10.1007/s00382-002- 0262-8. Wu, A., and W. W. Hsieh (2003), Nonlinear interdecadal changes of the El Niño-Southern Oscillation, Clim. Dyn., 21, 719–730, doi:10.1007/s00382-003-0361-1. Wu, A., W. W. Hsieh, and A. Shabbar (2002), Nonlinear charac- teristics of the surface air temperature over Canada, J. Geophys. Res., 107(D21), 4571, doi:10.1029/2001JD001090. Wu, A., W. W. Hsieh, and F. W. Zwiers (2003), Nonlinear modes of North American winter climate variability detected from a general circulation model, J. Clim., 16, 2325–2339. Yacoub, M., F. Badran, and S. Thiria (2001), A topological hierarchical clustering: Application to ocean color classifica- tion, in Artificial Neural Networks—ICANN 2001, Interna- tional Conference, Vienna, Austria, August 21–25, 2001: Proceedings, Lect. Notes Comput. Sci., vol. 2130, edited by G. Dorffner, H. Bischof, and K. Hornik, pp. 492–499, Springer- Verlag, New York. Yuval, (2000), Neural network training for prediction of climato- logical time series, regularized by minimization of the general- ized cross validation function, Mon. Weather Rev., 128, 1456– 1473. Yuval (2001), Enhancement and error estimation of neural net- work prediction of Niño 3.4 SST anomalies, J. Clim., 14, 2150–2163. Yuval, and W. W. Hsieh (2002), The impact of time-averaging on the detectability of nonlinear empirical relations,Q. J. R.Meteorol. Soc., 128, 1609–1622. Yuval, and W. W. Hsieh (2003), An adaptive nonlinear MOS scheme for precipitation forecasts using neural networks,Weather Forecasting, 18, 303–310.  W. W. Hsieh, Department of Earth and Ocean Sciences, University of British Columbia, 6339 Stores Road, Vancouver, BC V6T 1Z4, Canada. (whsieh@eos.ubc.ca) RG1003 Hsieh: NEURAL NETWORK METHODS 25 of 25 RG1003

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
China 21 12
United States 2 2
City Views Downloads
Beijing 21 0
Sunnyvale 1 0
Ashburn 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}

Share

Share to:

Comment

Related Items