UBC Faculty Research and Publications

Nonlinear multivariate and time series analysis by neural network methods Hsieh, William W. 2004-03-18

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


Hsieh_AGU_2004_2002RG000112.pdf [ 2.07MB ]
JSON: 1.0041791.json
JSON-LD: 1.0041791+ld.json
RDF/XML (Pretty): 1.0041791.xml
RDF/JSON: 1.0041791+rdf.json
Turtle: 1.0041791+rdf-turtle.txt
N-Triples: 1.0041791+rdf-ntriples.txt
Original Record: 1.0041791 +original-record.json
Full Text

Full Text

NONLINEAR MULTIVARIATE AND TIME SERIESANALYSIS BY NEURAL NETWORK METHODSWilliam W. HsiehDepartment of Earth and Ocean SciencesUniversity of British ColumbiaVancouver, British Columbia, CanadaReceived 29 March 2002; revised 6 November 2003; accepted 12 November 2003; published 18 March 2004.[1] Methods in multivariate statistical analysis areessential for working with large amounts of geophysicaldata, data from observational arrays, from satellites, or fromnumerical model output. In classical multivariate statisticalanalysis, there is a hierarchy of methods, starting with linearregression at the base, followed by principal componentanalysis (PCA) and finally canonical correlation analysis(CCA). A multivariate time series method, the singularspectrum analysis (SSA), has been a fruitful extension of thePCA technique. The common drawback of these classicalmethods is that only linear structures can be correctlyextracted from the data. Since the late 1980s, neuralnetwork methods have become popular for performingnonlinear regression and classification. More recently,neural network methods have been extended to performnonlinear PCA (NLPCA), nonlinear CCA (NLCCA), andnonlinear SSA (NLSSA). This paper presents a unified viewof the NLPCA, NLCCA, and NLSSA techniques and theirapplications to various data sets of the atmosphere and theocean (especially for the El Nin˜o-Southern Oscillation andthe stratospheric quasi-biennial oscillation). These data setsreveal that the linear methods are often too simplistic todescribe real-world systems, with a tendency to scatter asingle oscillatory phenomenon into numerous unphysicalmodes or higher harmonics, which can be largely alleviatedin the new nonlinear paradigm. INDEX TERMS: 3299Mathematical Geophysics: General or miscellaneous; 3394Meteorology and Atmospheric Dynamics: Instruments andtechniques; 4294 Oceanography: General: Instruments andtechniques; 4522 Oceanography: Physical: El Nin˜o; KEYWORDS:neural networks, principal component analysis, canonicalcorrelation analysis, singular spectrum analysis, El Nin˜o.Citation: Hsieh, W. W. (2004), Nonlinear multivariate and time series analysis by neural network methods, Rev. Geophys., 42,RG1003, doi:10.1029/2002RG000112.1. INTRODUCTION[2] In a standard text on classical multivariate statisticalanalysis [e.g., Mardia et al., 1979] the chapters typicallyproceed from linear regression to principal componentanalysis and then to canonical correlation analysis. Inregression one tries to find how the response variable y islinearly affected by the predictor variables x C17 [x1, ..., xl],i.e.,y ¼ rC1xþr0þ C15; ð1Þwhere C15 is the error (or residual) and the regressioncoefficients r and r0are found by minimizing the mean ofC152.1.1. Principal Component Analysis[3] However, in many data sets one cannot separatevariables into predictor and response variables. For instance,onemayhaveadatasetofthemonthlyseasurfacetemperatures (SST) collected at 1000 grid locations overseveral decades; that is, the data set is of the form x(t)=[x1,..., xl], where each variable xi(i =1,..., l) has N sampleslabeled by the index t. Very often, t is simply the time, andeach xiis a time series containing N observations. Principalcomponent analysis (PCA), also known as empirical or-thogonal function (EOF) analysis, looks for u, a linearcombination of the xi, and an associated vector a, withutðÞ¼aC1x tðÞ; ð2Þso thathkx tðÞC0autðÞk2iis minimized, where hidenotes a sample or time mean.Here u, called the first principal component (PC) (or score),is often a time series, while a, called the first eigenvector(also called an EOF or loading), is the first eigenvector ofthe data covariance matrix, and a often describes a spatiallystanding oscillation pattern. Together u and a make up thefirst PCA mode. In essence, a given data set is approximatedby a straight line (oriented in the direction of a), whichaccounts for the maximum amount of variance in the data;pictorially, in a scatterplot of the data the straight line foundby PCA passes through the ‘‘middle’’ of the data set. Fromthe residual, x C0 au, the second PCA mode can similarly beextracted and so on for the higher modes. In practice, theCopyright 2004 by the American Geophysical Union.8755-1209/04/2002RG000112$15.00Review of Geophysics, 42, RG1003/ 20041of25Paper number 2002RG000112RG1003common algorithms for PCA extract all modes simulta-neously [Jolliffe, 2002; Preisendorfer, 1988]. By retainingonly the leading modes, PCA has been commonly used toreduce the dimensionality of the data set and to extract themain patterns from the data set. PCA has also been extendedto the singular spectrum analysis (SSA) technique for timeseries analysis [Elsner and Tsonis, 1996; von Storch andZwiers, 1999; Ghil et al., 2002].1.2. Canonical Correlation Analysis[4] Next consider two data sets {xi(t)} and {yj(t)}, eachwith N samples. We group the {xi(t)} variables to form thevector x(t) and the {yj(t)} variables to form the vector y(t).Canonical correlation analysis (CCA) [Mardia et al., 1979;Bretherton et al., 1992; von Storch and Zwiers, 1999] looksfor linear combinationsutðÞ¼aC1x tðÞ vtðÞ¼bC1y tðÞ; ð3Þwhere the canonical variates u and v have maximumcorrelation; that is, the weight vectors a and b are chosensuch that cor(u, v), the Pearson correlation coefficientbetween u and v, is maximized. For instance, if x(t)isthesea level pressure (SLP) field and y(t) is the SST field, thenCCA can be used to extract the correlated spatial patternsa and b in the SLP and SST fields. Unlike regression,which tries to study how each yjis related to the x variables,CCA examines how the entire y field is related to the xfield. This holistic view has made CCA popular [Barnettand Preisendorfer, 1987; Barnston and Ropelewski, 1992;Shabbar and Barnston, 1996].[5] In the environmental sciences, researchers have towork with large data sets from satellite images of the Earth’ssurface and global climate data to voluminous output fromlarge numerical models. Multivariate techniques such asPCA and CCA have become indispensable in extractingessential information from these massive data sets [vonStorch and Zwiers, 1999]. However, the restriction offinding only linear relations means that nonlinear relationsare either missed or misinterpreted by these methods. Theintroduction of nonlinear multivariate and time series tech-niques is crucial to further advancement in the environmen-tal sciences.1.3. Feed Forward Neural Network Models[6] The nonlinear neural network (NN) models originatedfrom research trying to understand how the brain functionswith its networks of interconnected neurons [McCullochand Pitts, 1943]. There are many types of NN models;some are only of interest to neurological researchers, whileothers are general nonlinear data techniques. There are nowmany good textbooks on NN models [Bishop, 1995; Rojas,1996; Ripley, 1996; Cherkassky and Mulier, 1998; Haykin,1999].[7] The most widely used NN models are the feed forwardNNs, also called multilayer perceptrons [Rumelhart et al.,1986], which perform nonlinear regression and classifica-tion. The basic architecture (Figure 1) consists of a layer ofinput neurons xi(a ‘‘neuron’’ is simply a variable inNN jargon) linked to a layer or more of ‘‘hidden’’ neurons,which are, in turn, linked to a layer of output neurons yj.InFigure 1, there is only one layer of hidden neurons hk.Atransfer function (an ‘‘activation’’ function in NN jargon)maps from the inputs to the hidden neurons. There is avariety of choices for the transfer function, the hyperbolictangent function being a common one, i.e.,hk¼ tanhXiwkixiþbk !; ð4Þwhere wkiand bkare the weight and bias parameters,respectively. The tanh(z) function is a sigmoidal-shapedfunction, where its two asymptotic values of ±1 as z ! ±1can be viewed as representing the two states of a neuron (atrest or activated), depending on the strength of theexcitation z. (If there is more than one hidden layer, thenequations of the same form as equation (4) are used tocalculate the values of the next layer of the hidden neuronsfrom the current layer of neurons). When the feed forwardNN is used for nonlinear regression, the output neurons yjare usually calculated by a linear combination of theneurons in the preceding layer, i.e.,yj¼Xk~wjkhkþ~bj: ð5Þ[8] Given observed data yoj, the optimal values for theweight and bias parameters (wki, ~wjk, bk, and~bj) are foundby ‘‘training’’ the NN, i.e., performing a nonlinear optimi-zation, where the cost function or objective functionJ ¼XjyjC0yojC0C12*+ð6Þis minimized, with J simply being the mean squared error(MSE) of the output. The NN has found a set of nonlinearregression relations yj= fj(x). To approximate a set ofFigure 1. Schematic diagram of the feed forward neuralnetwork (NN) model, with one ‘‘hidden’’ layer of neurons(i.e., variables) (denoted by circles) sandwiched between theinput layer and the output layer. In the feed forward NNmodel the information only flows forward starting from theinput neurons. Increasing the number of hidden neuronsincreases the number of model parameters. Adapted fromHsieh and Tang [1998].RG1003 Hsieh: NEURAL NETWORK METHODS2of25RG1003continuous functions fj, only one layer of hidden neurons isenough, provided enough hidden neurons are used in thatlayer [Hornik et al., 1989; Cybenko, 1989]. The NN withone hidden layer is commonly called a two-layer NN, asthere are two layers of mapping (equations (4) and (5))going from input to output; however, there are otherconventions for counting the number of layers, and someauthors refer to our two-layer NN as a three-layer NN sincethere are three layers of neurons.1.4. Local Minima and Overfitting[9] The main difficulty of the NN method is that thenonlinear optimization often encounters multiple localminima in the cost function. This means that startingfrom different initial guesses for the parameters, theoptimization algorithm may converge to different localminima. Many approaches have been proposed to allevi-ate this problem [Bishop, 1995; Hsieh and Tang, 1998]; acommon approach involves multiple optimization runsstarting from different random initial parameters so that,hopefully, not all runs will be stranded at shallow localminima.[10] Another pitfall with the NN method is overfitting,i.e., fitting to the noise in the data, because of thetremendous flexibility of the NN to fit the data. Withenough hidden neurons the NN can fit the data, includingthe noise, to arbitrary accuracy. Thus, for a network withmany parameters, reaching the global minimum may meannothing more than finding a badly overfitted solution.Usually, only a portion of the data record is used to train(i.e., fit) the NN model; the other is reserved to validatethe model. If too many hidden neurons are used, then theNN model fit to the training data will be excellent, but themodel fit to the validation data will be poor, therebyallowing the researchers to gauge the appropriate numberof hidden neurons. During the optimization process it isalso common to monitor the MSE over the training dataand over the validation data separately. As the number ofiterations of the optimization algorithm increased, the MSEcalculated over the training data would decrease; however,beyond a certain number of iterations the MSE over thevalidation data would begin to increase, indicating the startof overfitting and hence the appropriate time to stop theoptimization process. Another approach to avoid overfit-ting is to add weight penalty terms to the cost function, asdiscussed in Appendix A. Yet another approach is tocompute an ensemble of NN models starting from differentrandom initial parameters. The mean of the ensemble ofNN solutions tends to give a smoother solution than theindividual NN solutions.[11] If forecast skills are to be estimated, then anotherunused part of the data record will have to be reserved asindependent test data for estimating the forecast skills, asthe validation data have already been used to determine themodel architecture. Some authors interchange the terminol-ogy for ‘‘validation’’ data and ‘‘test’’ data; the terminologyhere follows Bishop [1995]. For poor quality data sets (e.g.,short, noisy data records) the problems of local minima andoverfitting could render nonlinear NN methods incapable ofoffering any advantage over linear methods.[12] The feed forward NN has been applied to a variety ofnonlinear regression and classification problems in environ-mental sciences such as meteorology and oceanography andhas been reviewed by Gardner and Dorling [1998] andHsieh and Tang [1998]. Some examples of recent applica-tions using NN include the following: tornado diagnosis[Marzban, 2000], efficient radiative transfer computation inatmospheric general circulation models [Chevallier et al.,2000], multiparameter satellite retrievals from the SpecialSensor Microwave/Imager [Gemmill and Krasnopolsky,1999], wind retrieval from scatterometer [Richaume et al.,2000], adaptive nonlinear model output statistics [Yuval andHsieh, 2003], efficient computation of sea water density orsalinity from a nonlinear equation of state [Krasnopolsky etal., 2000], tropical Pacific sea surface temperature predic-tion [Tang et al., 2000; Yuval, 2001], and an empiricalatmosphere in a hybrid coupled atmosphere-ocean model ofthe tropical Pacific [Tang and Hsieh,2002].ForNNapplications in geophysics (seismic exploration, well loglithology determination, electromagnetic exploration, andearthquake seismology), see Sandham and Leggett [2003].[13] To keep within the scope of a review paper, I have toomit reviewing the numerous fine papers on using NN fornonlinear regression and classification and focus on thetopic of how the feed forward NN can be extended from itsoriginal role as nonlinear regression to nonlinear PCA(section 2), nonlinear CCA (section 3), and nonlinear SSA(section 4), illustrated by examples from the tropical Pacificatmosphere-ocean interactions and the equatorial strato-spheric wind variations. These examples reveal variousdisadvantages of the linear methods; the most commonone is the tendency to scatter a single oscillatory phenom-enon into numerous modes or higher harmonics.2. NONLINEAR PRINCIPAL COMPONENTANALYSIS ((NLPCA))2.1. Open Curves[14] As PCA finds a straight line which passes throughthe ‘‘middle’’ of the data cluster, the obvious next step is togeneralize the straight line to a curve. Kramer [1991]proposed a neural network-based NLPCA model wherethe straight line is replaced by a continuous open curvefor approximating the data.[15] The fundamental difference between NLPCA andPCA is that PCA only allows a linear mapping (equation (2))between x and the PC u, while NLPCA allows a nonlinearmapping. To perform NLPCA, the feed forward NN inFigure 2a contains three hidden layers of neurons betweenthe input and output layers of variables. The NLPCA isbasically a standard feed forward NN with four layers oftransfer functions mapping from the inputs to the outputs.One can view the NLPCA network as composed of twostandard two-layer feed forward NNs placed one after theother. The first two-layer network maps from the inputs xthrough a hidden layer to the bottleneck layer with onlyRG1003 Hsieh: NEURAL NETWORK METHODS3of25RG1003one neuron u, i.e., a nonlinear mapping u = f(x). The nexttwo-layer feed forward NN inversely maps from the non-linear PC (NLPC) u back to the original higher-dimensionalx space, with the objective that the outputs x0= g(u)beasclose as possible to the inputs x (thus the NN is said to beautoassociative). Note g(u) nonlinearly generates a curve inthe x space and hence a one-dimensional (1-D) approxima-tion of the original data. To minimize the MSE of thisapproximation, the cost function J = hkx C0 x0k2i isminimized to solve for the weight and bias parameters ofthe NN. Squeezing the input information through a bottle-neck layer with only one neuron accomplishes the dimen-sional reduction. Details of the NLPCA are given inAppendix A.[16] In effect, the linear relation (2) in PCA is nowgeneralized to u = f(x), where f can be any nonlinearcontinuous function representable by a feed forward NNmapping from the input layer to the bottleneck layer; andinstead of hkx(t) C0 au(t)k2i, hkx C0 g(u)k2i is minimized.The residual, x C0 g(u), can be input into the same networkto extract the second NLPCA mode and so on for the highermodes.[17] That the classical PCA is indeed a linear version ofthis NLPCA can be readily seen by replacing all the transferfunctions with the identity function, thereby removing thenonlinear modeling capability of the NLPCA. Then theforward map to u involves only a linear combination ofthe original variables as in the PCA.[18] The NLPCA has been applied to the radiometricinversion of atmospheric profiles [Del Frate and Schiavon,1999] and to the Lorenz [1963] three-component chaoticsystem [Monahan, 2000; Hsieh, 2001a]. For the tropicalPacific climate variability the NLPCA has been used tostudy the SST field [Monahan, 2001; Hsieh, 2001a] and theSLP field [Monahan, 2001]. The Northern Hemisphereatmospheric variability [Monahan et al., 2000, 2001], theCanadian surface air temperature [Wu et al., 2002], and thesubsurface thermal structure of the Pacific Ocean [Tang andHsieh, 2003] have also been investigated by the NLPCA.[19] In the classical linear approach, there is a well-known dichotomy between PCA and rotated PCA (RPCA)[Richman, 1986]. In PCA the linear mode that accounts forthe most variance of the data set is sought. However, asillustrated by Preisendorfer [1988, Figure 7.3], the resultingeigenvectors may not align close to local data clusters, sothe eigenvectors may not represent actual physical stateswell. One application of RPCA methods is to rotate thePCA eigenvectors so they point closer to the local clustersof data points [Preisendorfer, 1988]. Thus the rotatedeigenvectors may bear greater resemblance to actual phys-ical states (though they account for less variance) than theunrotated eigenvectors; hence RPCA is also widely used[Richman, 1986; von Storch and Zwiers, 1999]. As there aremany possible criteria for rotation, there are many RPCAschemes, among which the varimax [Kaiser, 1958] schemeis perhaps the most popular.[20] The tropical Pacific climate system contains thefamous interannual variability known as the El Nin˜o-South-ern Oscillation (ENSO), a coupled atmosphere-ocean inter-action involving the oceanic phenomenon El Nin˜o and theassociated atmospheric phenomenon the Southern Oscilla-tion. The coupled interaction results in anomalously warmSST in the eastern equatorial Pacific during El Nin˜oepisodes and cool SST in the central equatorial Pacificduring La Nin˜a episodes [Philander, 1990; Diaz andMarkgraf, 2000]. ENSO is an irregular oscillation, butspectral analysis does reveal a broad spectral peak at the4- to 5-year period. Hsieh [2001a] used the tropical PacificSST data (1950–1999) to make a three-way comparisonbetween NLPCA, RPCA, and PCA. The tropical PacificSST anomaly (SSTA) data (i.e., the SST data with theclimatological seasonal cycle removed) were prefiltered byFigure 2. (a) Schematic diagram of the NN model forcalculating the nonlinear principal component analysis(NLPCA). There are three layers of hidden neuronssandwiched between the input layer x on the left and theoutput layer x0on the right. Next to the input layer is theencoding layer, followed by the ‘‘bottleneck’’ layer (with asingle neuron u), which is then followed by the decodinglayer. A nonlinear function maps from the higher-dimensioninput space to the one-dimensional bottleneck space,followed by an inverse transform mapping from thebottleneck space back to the original space represented bythe outputs, which are to be as close to the inputs as possibleby minimizing the cost function J = hkx C0 x0k2i. Datacompression is achieved by the bottleneck, with thebottleneck neuron giving u, the nonlinear principalcomponent (NLPC). (b) Schematic diagram of the NNmodel for calculating the NLPCAwith a circular node at thebottleneck (NLPCA(cir)). Instead of having one bottleneckneuron u, there are now two neurons p and q constrained tolie on a unit circle in the p-q plane, so there is only one freeangular variable q, the NLPC. This network is suited forextracting a closed curve solution. From Hsieh [2001a],reprinted with permission from Blackwell Science, Oxford.RG1003 Hsieh: NEURAL NETWORK METHODS4of25RG1003PCA, with only the three leading modes retained. PCAmodes 1, 2, and 3 accounted for 51.4%, 10.1%, and 7.2%,respectively, of the variance in the SSTA data. The firstthree PCs (PC1, PC2, and PC3) were used as the input x forthe NLPCA network.[21] The data are shown as dots in a scatterplot in thePC1-PC2 plane (Figure 3), where the cool La Nin˜a states liein the top left corner and the warm El Nin˜o states lie in thetop right corner. The NLPCA solution is a U-shaped curvelinking the La Nin˜a states at one end (low u) to the El Nin˜ostates at the other end (high u), similar to the solution foundoriginally by Monahan [2001]. In contrast, the first PCAeigenvector lies along the horizontal line, and the secondPCA lies along the vertical line (Figure 3), neither of whichwould come close to the El Nin˜o states in the top right cornernor the La Nin˜a states in the top left corner, thus demon-strating the inadequacy of PCA. For comparison, a varimaxrotation [Kaiser, 1958; Preisendorfer, 1988] was appliedto the first three PCA eigenvectors. (The varimax criterioncan be applied to either the loadings or the PCs dependingon one’s objectives [Richman, 1986; Preisendorfer, 1988];here it is applied to the PCs.) The resulting first RPCAeigenvector, shown as a dashed line in Figure 3, spearsthrough the cluster of El Nin˜o states in the top right corner,thereby yielding a more accurate description of the El Nin˜oanomalies (Figure 4c) than the first PCA mode (Figure 4a),which did not fully represent the intense warming ofPeruvian waters. The second RPCA eigenvector, also shownas a dashed line in Figure 3, did not improve much on thesecond PCA mode, with the PCA spatial pattern shown inFigure 4b and the RPCA pattern shown in Figure 4d. Interms of variance explained, the first NLPCA modeexplained 56.6% of the variance versus 51.4% explainedby the first PCA mode and 47.2% explained by the firstRPCA mode.[22] With the NLPCA, for a given value of the NLPC u,one can map from u to the three PCs. This is done byassigning the value u to the bottleneck neuron and mappingforward using the second half of the network in Figure 2a.Each of the three PCs can be multiplied by its associatedPCA (spatial) eigenvector, and the three can be addedtogether to yield the spatial pattern for that particular valueof u. Unlike PCA, which gives the same spatial anomalypattern except for changes in the amplitude as the PC varies,the NLPCA spatial pattern generally varies continuously asthe NLPC changes. Figures 4e and 4f show the spatialanomaly patterns when u has its maximum value (cor-responding to the strongest El Nin˜o) and when u has itsminimum value (strongest La Nin˜a), respectively. Clearly,the asymmetry between El Nin˜o and La Nin˜a (i.e., the coolanomalies during La Nin˜a episodes (Figure 4f) are observedto center much farther west of the warm anomalies duringEl Nin˜o (Figure 4e) [Hoerling et al., 1997]) is well capturedby the first NLPCA mode; in contrast, the PCA mode 1Figure 3. Scatterplot of the sea surface temperatures (SST) anomaly (SSTA) data (shown as dots) in theprincipal component (PC)1-PC2 plane, with the El Nin˜o states lying in the top right corner and the LaNin˜a states lying in the top left corner. The PC2 axis is stretched relative to the PC1 axis for bettervisualization. The first-mode NLPCA approximation to the data is shown by the (overlapping) smallcircles, which traced out a U-shaped curve. The first principal component analysis (PCA) eigenvector isoriented along the horizontal line, and the second PCA is oriented along the vertical line. The varimaxmethod rotates the two PCA eigenvectors in a counterclockwise direction, as the rotated PCA (RPCA)eigenvectors are oriented along the dashed lines. (As the varimax method generates an orthogonalrotation, the angle between the two RPCA eigenvectors is 90C176 in the three-dimensional PC1-PC2-PC3space). From Hsieh [2001a], reprinted with permission from Blackwell Science, Oxford.RG1003 Hsieh: NEURAL NETWORK METHODS5of25RG1003gives a La Nin˜a that is simply the mirror image of theEl Nin˜o (Figure 4a). While El Nin˜o has been known byPeruvian fishermen for many centuries because of its strongSSTA off the coast of Peru and its devastation of thePeruvian fishery, La Nin˜a, with its weak manifestation inthe Peruvian waters, was not appreciated until the last2 decades of the twentieth century.[23] In summary, PCA is used for two main purposes:(1) to reduce the dimensionality of the data set and (2) toextract features or recognize patterns from the data set. It isthe second purpose where PCA can be improved upon. BothRPCA and NLPCA take the PCs from PCA as input.However, instead of multiplying the PCs by a fixed ortho-normal rotational matrix, as performed in the varimaxRPCA approach, NLPCA performs a nonlinear mappingof the PCs. RPCA sacrifices on the amount of varianceexplained, but by rotating the PCA eigenvectors, RPCAeigenvectors tend to point more toward local data clustersand are therefore more representative of physical states thanthe PCA eigenvectors.[24] With a linear approach it is generally impossible tohave a solution simultaneously (1) explaining maximumglobal variance of the data set and (2) approaching localdata clusters hence the dichotomy between PCA andRPCA, with PCA aiming for objective 1 and RPCAaiming for objective 2. Hsieh [2001a] pointed out thatwith the more flexible NLPCA method both objectives 1and 2 may be attained together; thus the nonlinearity inNLPCA unifies the PCA and RPCA approaches. It is easyto see why the dichotomy between PCA and RPCA in thelinear approach automatically vanishes in the nonlinearapproach. By increasing m, the number of hidden neuronsin the encoding layer (and the decoding layer), the solutionis capable of going through all local data clusters whilemaximizing the global variance explained. (In fact, forlarge enough m, NLPCA can pass through all data points,Figure 4. SSTA patterns (in C176C) of the PCA, RPCA, and the NLPCA. (a) First and (b) second PCAspatial modes (both with their corresponding PCs at maximum value). (c) First and (d) second varimaxRPCA spatial modes (both with their corresponding RPCs at maximum value). Anomaly pattern as theNLPC u of the first NLPCA mode varying from (e) maximum (strong El Nin˜o) to (f) its minimum(strong La Nin˜a). With a contour interval of 0.5C176C the positive contours are shown as solid curves,negative contours are shown as dashed curves, and the zero contour is shown as a thick curve. Adaptedfrom Hsieh [2001a], reprinted with permission from Blackwell Science, Oxford.RG1003 Hsieh: NEURAL NETWORK METHODS6of25RG1003although this will in general give an undesirable, overfittedsolution).[25] The tropical Pacific SSTexample illustrates that witha complicated oscillation like the El Nin˜o-La Nin˜a phenom-enon, using a linear method such as PCA results in thenonlinear mode being scattered into several linear modes.(In fact, all three leading PCA modes are related to thisphenomenon.) This brings to mind the famous parable ofthe three blind men and their disparate descriptions of anelephant. Thus we see the importance of NLPCA as aunifier of the separate linear modes. In the study of climatevariability the wide use of PCA methods has created thesomewhat misleading view that our climate is dominated bya number of spatially fixed oscillatory patterns, which is, infact, due to the limitation of the linear method. ApplyingNLPCA to the tropical Pacific SSTA, we found no spatiallyfixed oscillatory patterns but an oscillation evolving inspace as well as in time.2.2. Closed Curves[26] The NLPCA is capable of finding a continuous opencurve solution, but there are many geophysical phenomenainvolving waves or quasiperiodic fluctuations, which callfor a continuous closed curve solution. Kirby and Miranda[1996] introduced a NLPCA with a circular node at thenetwork bottleneck (henceforth referred to as the NLPCA(cir)), so that the NLPC as represented by the circular nodeis an angular variable q, and the NLPCA(cir) is capable ofapproximating the data by a closed continuous curve.Figure 2b shows the NLPCA(cir) network, which is almostidentical to the NLPCA of Figure 2a, except at thebottleneck, where there are now two neurons p and qconstrained to lie on a unit circle in the p-q plane, so thereis only one free angular variable q, the NLPC. Details of theNLPCA(cir) are given in Appendix B.[27] Applications of the NLPCA(cir) have been made tothe tropical Pacific SST [Hsieh, 2001a] and to the equatorialstratospheric zonal wind(i.e., the east-west component of thewind) for the quasi-biennial oscillation (QBO) [Hamiltonand Hsieh, 2002]. The QBO dominates over the annualcycle or other variations in the equatorial stratosphere, withthe period of oscillation varying roughly between 22 and32 months, with a mean of about 28 months. After the45-year means were removed, the zonal wind u at sevenvertical levels in the stratosphere became the seven inputs tothe NLPCA(cir) network. The NLPCA(cir) mode 1 solutiongives a closed curve in a seven-dimensional space. Thesystem goes around the closed curve once, as the NLPC qvaries through one cycle of the QBO. Figure 5 shows thesolution in three of the seven dimensions, namely, the windanomalies at the 70-, 30-, and 10-hPa pressure levels(corresponding to elevations ranging roughly between 20and 30 km above sea level). The NLPCA(cir) mode 1explains 94.8% of the variance. For comparison, the linearFigure 5. NLPCA(cir) mode 1 solution for the equatorial stratospheric zonal wind shown by the(overlapping) circles, with the data shown as dots. For comparison, the PCA mode 1 solution is shown asa thin straight line. Only three out of seven dimensions are shown, namely, u at the top, middle, andbottom levels (10, 30, and 70 hPa). (a)–(c) Two-dimensional views. (d) Three-dimensional view. FromHamilton and Hsieh [2002].RG1003 Hsieh: NEURAL NETWORK METHODS7of25RG1003PCAyields seven modes explaining 57.8, 35.4, 3.1, 2.1, 0.8,0.5, and 0.3% of the variance, respectively. To comparewith the NLPCA(cir) mode 1, Hamilton and Hsieh [2002]constructed a linear model of q. In the plane spanned byPC1 and PC2 (each normalized by its standard deviation),an angle q can be defined as the arctangent of the ratio of thetwo normalized PCA coefficients. This linear modelaccounts for 83.0% of the variance in the zonal wind,considerably less than the 94.8% accounted for by theNLPCA(cir) mode 1. The QBO as q varies over one cycleis shown in Figure 6 for the NLPCA(cir) mode 1 and for thelinear model. The observed strong asymmetries between theeasterly and westerly phases of the QBO [Hamilton, 1998;Baldwin et al., 2001] are captured by the nonlinear modebut not by the linear mode.[28] The actual time series of the wind measured at aparticular height level is somewhat noisy, and it is oftendesirable to have a smoother representation of the QBO timeseries which captures the essential features at all verticallevels. Also, the reversal of the wind from westerly toeasterly and vice versa occurs at different times for differentheight levels, rendering it difficult to define the phase of theQBO. Hamilton and Hsieh [2002] found that the phase of theQBO as defined by the NLPC q is more accurate thanprevious attempts to characterize the phase, leading to astronger link between the QBO and Northern Hemispherepolar stratospheric temperatures in winter (the Holton-Taneffect) [Holton and Tan, 1980] than previously found.2.3. Other Approaches (Principal Curves andSelf-Organizing Maps)[29] Besides the autoassociative NN, there have beenseveral other approaches developed to generalize PCA[Cherkassy and Mulier, 1998]. The principal curve method[Hastie and Stuetzle, 1989; Hastie et al., 2001] finds anonlinear curve which passes through the middle of the datapoints. Developed originally in the statistics community,this method does not appear to have been applied to theenvironmental sciences or geophysics. There is a subtle butimportant difference between NLPCA (by autoassociativeNN) and principal curves. In the principal curve approacheach point in the data space is projected to a point on theprincipal curve, where the distance between the two is theshortest. In the NLPCA approach, while the mean squareerror (hence distance) between the data point and theprojected point is minimized, it is only the mean which isminimized. There is no guarantee for an individual datapoint that it will be mapped to the closest point on the curvefound by NLPCA. Hence, unlike the projection in principalcurves, the projection used in NLPCA is suboptimal[Malthouse, 1998]. However, NLPCA has an advantageover the principal curve method in that its NN architectureprovides a continuous (and differentiable) mapping function.[30] Newbigging et al. [2003] used the principal curveprojection concept to improve the NLPCA solution.Malthouse [1998] made a comparison between principalcurves and the NLPCA model by autoassociative NN.Unfortunately, when testing a closed curve solution, heused NLPCA instead of NLPCA(cir) (which would haveextracted the closed curve easily), thereby ending up withthe conclusion that the NLPCA was not satisfactory forextracting the closed curve solution.[31] Another popular NN method is the self-organizingmap (SOM) [Kohonen, 1982, 2001], used widely for clus-tering. Since this approach fits a grid (usually a 1-D or 2-Dgrid) to a data set, it can be thought of as a discrete version ofnonlinear PCA [Cherkassky and Mulier, 1998]. SOM hasbeen applied to the clustering of winter daily precipitationdata [Cavazos, 1999], to satellite ocean color classification[Yacoub et al., 2001], and to high-dimensional hyperspectralAirborne Visible/Infrared Imaging Spectrometer data toclassify the geology of the land surface [Villmann et al.,2003]. For seismic data, SOM has been used to identify andclassify multiple events [Essenreiteret al., 2001] and used inwell log calibration [Taner et al., 2001].Figure 6. (a) Contour plot of the NLPCA(cir) mode 1zonal wind anomalies as a function of pressure and qweighted,where qweightedis q weighted by the histogram distributionof q. Thus qweightedis more representative of actual timeduring a cycle than q. Contour interval is 5 m sC01, withwesterly winds indicated by solid lines, easterlies indicatedby dashed lines, and zero contours indicated by thick lines.(b) Similar plot for a linear circular model of qweighted. FromHamilton and Hsieh [2002].RG1003 Hsieh: NEURAL NETWORK METHODS8of25RG1003[32] Another way to generalize PCA is via independentcomponent analysis (ICA) [Comon, 1994; Hyva¨rinen et al.,2001], which was developed from information theory andhas been applied to study the tropical Pacific SST variabilityby Aires et al. [2000]. Since ICA uses higher-order statistics(e.g., kurtosis, which is very sensitive to outliers), it may notbe robust enough for the noisy data sets commonly encoun-tered in climate or seismic studies (T. Ulrych, personalcommunication, 2003).3. NONLINEAR CANONICAL CORRELATIONANALYSIS (NLCCA)[33] While many techniques have been developed fornonlinearly generalizing PCA, there has been much lessactivity in developing nonlinear CCA. A number of differ-ent approaches have recently been proposed to nonlinearlygeneralize CCA [Lai and Fyfe, 1999, 2000; Hsieh, 2000;Melzer et al., 2003]. Hsieh [2000] proposed using three feedforward NNs to accomplish NLCCA, where the linearmappings in equation (3) for the CCA are replaced bynonlinear mapping functions using two-layer feed forwardNNs. The mappings from x to u and y to v are representedby the double-barreled NN on the left-hand side of Figure 7.By minimizing the cost function J = C0cor(u, v), one findsthe parameters that maximize the correlation cor(u, v). Afterthe forward mapping with the double-barreled NN has beensolved, inverse mappings from the canonical variates u andv to the original variables, as represented by the twostandard feed forward NNs on the right side of Figure 7,are to be solved, where the MSE of their outputs x0and y0are minimized with respect to x and y, respectively. Fordetails, see Appendix C.[34] Consider the following test problem from Hsieh[2000]. LetX1¼ t C00:3t2; X2¼ t þ0:3t3; X3¼ t2; ð7ÞY1¼~t3; Y2¼C0~t þ0:3~t3; Y3¼~t þ0:3~t2; ð8Þwhere t and~t are uniformly distributed random numbers in[C01, 1]. Also letX01¼C0sC00:3s2; X02¼ s C00:3s3; X03¼C0s4; ð9ÞY01¼ sech 4sðÞ; Y02¼ sþ0:3s3; Y03¼ s C00:3s2; ð10Þwhere s is a uniformly distributed random number in [C01, 1].The shapes described by the X and X0vector functions aredisplayed in Figure 8, and those described by Y and Y0aredisplayed in Figure 9. To lowest order, equation (7) for Xdescribes a quadratic curve, and equation (9) for X0describesa quartic. Similarly, to lowest order, Y is a cubic, and Y0is ahyperbolic secant. The signal in the test data was producedby adding the second mode (X0, Y0) to the first mode (X, Y),with the variance of the second mode being 1/3 that of thefirst mode. A small amount of Gaussian random noise, withstandard deviation equal to 10% of the signal standarddeviation, was also added to the data set. The data set of N =500 points was then standardized (i.e., each variable withmean removed was normalized by the standard deviation).Note that different sequences of random numbers tnand~tn(n =1,..., N) were used to generate the first modes X and Y,respectively. Hence these two dominant modes in thex spaceand the y space are unrelated. In contrast, as X0and Y0weregenerated from the same sequence of random numbers sn,they are strongly related. The NLCCA was applied to thedata, and the first NLCCA mode retrieved (Figures 10and 11) resembles the expected theoretical mode (X0, Y0).This is quite remarkable considering that X0andY0have only1/3 the variance of X and Y; that is, the NLCCA ignores thelarge variance of X and Y and succeeded in detectingthe nonlinear correlated mode (X0, Y0). In contrast, if theNLPCA is applied to x and y separately, then the firstNLPCA mode retrieved from x will be X, and the first modefrom y will be Y. This illustrates the essential differencebetween NLPCA and NLCCA.[35] The NLCCA has been applied to analyze the tropicalPacific sea level pressure anomaly (SLPA) and SSTA fields[Hsieh, 2001b], where the six leading PCs of the SLPA andthe six PCs of the SSTA during 1950–2000 were inputs toan NLCCA model. The first NLCCA mode is plotted in thePC spaces of the SLPA and the SSTA (Figure 12), whereonly the three leading PCs are shown. For the SLPA(Figure 12a) in the PC1-PC2 plane the La Nin˜a states arein the left corner (corresponding to low u values), while theEl Nin˜o states are in the top right corner (high u values).The CCA solutions are shown as thin straight lines. For theFigure 7. Three feed forward NNs used to performnonlinear canonical correlation analysis (NLCCA). Thedouble-barreled NN on the left maps from the inputs x and yto the canonical variates u and v, respectively. The costfunction J forces the correlation between u and v to bemaximized. On the right side the top NN maps from u to theoutput layer x0. The cost function J1basically minimizes themean-square error (MSE) of x0relative to x. The third NNmaps from v to the output layer y0. The cost function J2basically minimizes the MSE of y0relative to y. From Hsieh[2001b], reprinted with permission from the AmericanMeteorological Society.RG1003 Hsieh: NEURAL NETWORK METHODS9of25RG1003SSTA (Figure 12b) in the PC1-PC2 plane the first NLCCAmode is a U-shaped curve linking the La Nin˜a states in thetop left corner (low v values) to the El Nin˜o states in the topright corner (high v values). In general, the nonlinearity isgreater in the SSTA than in the SLPA, as the differencebetween the CCA mode and the NLCCA mode is greater inFigure 12b than in Figure 12a.[36] The MSE of the NLCCA divided by the MSE of theCCA is a useful measure on how different the nonlinearsolution is relative to the linear solution; a smaller ratiomeans greater nonlinearity, while a ratio of 1 means theNLCCA can only find a linear solution. This ratio is 0.951for the SLPA and 0.935 for the SSTA, confirming that themapping for the SSTA was more nonlinear than that for theSLPA. When the data record was divided into two halves(1950–1975 and 1976–1999) to be separately analyzed bythe NLCCA, Hsieh [2001b] found that this ratio decreasedfor the second half, implying an increase in the nonlinearityof ENSO during the more recent period.[37] For the NLCCA mode 1, as u varies from itsminimum value to its maximum value, the SLPA field variesfrom the strong La Nin˜a phase to the strong El Nin˜o phase(Figure 13). The zero contour is farther west during La Nin˜a(Figure 13a) than during strong El Nin˜o (Figure 13b).Similarly, as v varies from its minimum to its maximum,the SSTA field varies from strong La Nin˜a to strongEl Nin˜o (Figure 14), revealing that the SSTanomalies duringLa Nin˜a are centered farther west of the anomalies duringEl Nin˜o.[38] Wu and Hsieh [2002, 2003] studied the relationbetween the tropical Pacific wind stress anomaly (WSA)and SSTA fields using the NLCCA. Wu and Hsieh [2003]found notable interdecadal changes of ENSO behaviorbefore and after the mid-1970s climate regime shift, withgreater nonlinearity found during 1981–1999 than during1961–1975. Spatial asymmetry (for both SSTA and WSA)between El Nin˜o and La Nin˜a episodes was significantlyenhanced in the later period. During 1981–1999 the loca-tion of the equatorial easterly WSA in the NLCCA solutionduring La Nin˜a was unchanged from the earlier period, butduring El Nin˜o the westerly WSA was shifted eastward byup to 30C176. From dynamical considerations based on thedelay oscillator theory for ENSO (where the farther eastthe location of the WSA, the longer is the duration of theresulting SSTA in the eastern equatorial Pacific), Wu andHsieh [2003] concluded that this interdecadal change wouldlengthen the duration of the ENSO warm phase but leavethe duration of the cool phase unchanged, which wasconfirmed with numerical model experiments. This is anexample of a nonlinear data analysis detecting a featuremissed by previous studies using linear techniques, which,in turn, leads to new dynamical insight.Figure 8. First theoretical mode X generated from equation (7) shown by curve made up of smallcircles. The solid curve shows the second mode X0generated from equation (9). A projection in the(a) x1-x2,(b)x1-x3, and (c) x2-x3planes and a (d) three-dimensional plot. The actual data set of 500 points(shown by dots) was generated by adding mode 2 to mode 1 (with mode 2 having 1/3 the variance ofmode 1) and adding a small amount of Gaussian noise. Follows Hsieh [2000], with permission fromElsevier Science.RG1003 Hsieh: NEURAL NETWORK METHODS10 of 25RG1003[39] The NLCCA has also been applied to study therelation between the tropical Pacific SSTA and the NorthernHemisphere midlatitude winter atmospheric variability(500 mbar geopotential height and North American surfaceair temperature) simulated in an atmospheric general circu-lation model (GCM), demonstrating the value of NLCCA asa nonlinear diagnostic tool for GCMs [Wu et al., 2003].4. NONLINEAR SINGULAR SPECTRUM ANALYSIS(NLSSA)[40] By the 1980s, interest in chaos theory and dynamicalsystems led to further extension of the PCA method tosingular spectrum analysis [Elsner and Tsonis, 1996;Golyandina et al., 2001; Ghil et al., 2002]. Given a timeseries yj= y(tj)(j =1,..., N), lagged copies of the timeseries are stacked to form the augmented matrix Y,Y ¼y1y2C1C1C1 yNC0Lþ1y2y3C1C1C1 yNC0Lþ2............yLyLþ1C1C1C1 yN266666666664377777777775: ð11ÞThis matrix has the same form as the data matrix producedby L variables, each being a time series of length n = N C0L +1.Y can also be viewed as composed of its columnvectors yj, forming a vector time series y(tj)(j =1,..., n).The standard PCA can be performed on the augmented datamatrix Y, resulting iny tjC0C1¼XixitjC0C1ei; ð12ÞwherexiistheithPC,atimeseriesoflengthn,andeiistheitheigenvector (or loading vector) of length L. Together xiand eirepresenttheithSSAmode.ThisresultingmethodistheSSAwith window L.[41] In the multivariate case, with M variables yk(tj) C17 ykj(k =1,..., M; j =1,..., N), the augmented matrix can beformed by lettingY ¼y11y12C1C1C1 y1;NC0Lþ1............yM1yM2C1C1C1 yM;NC0Lþ1............y1Ly1;Lþ1C1C1C1 y1N............yMLyM;Lþ1C1C1C1 yMN2666666666666666666666666437777777777777777777777775: ð13ÞFigure 9. First theoretical mode Y generated from equation (8) shown by curve made up of small circles.The solid curve shows the second mode Y0generated from equation (10). A projection in the (a) y1-y2plane, (b) y1-y3plane, and (c) y2-y3plane and a (d) three-dimensional plot. The data set of 500 points wasgenerated by adding mode 2 to mode 1 (with mode 2 having 1/3 the variance of mode 1) and adding asmall amount of Gaussian noise. Follows Hsieh [2000], with permission from Elsevier Science.RG1003 Hsieh: NEURAL NETWORK METHODS11 of 25RG1003Figure 10. Nonlinear canonical correlation analysis (NLCCA) mode 1 in x space shown as a string of(densely overlapping) small circles. The theoretical mode X0is shown as a thin solid curve, and the linearcanonical correlation analysis (CCA) mode is shown as a thin dashed line. The dots display the 500 datapoints. The number of hidden neurons (see Appendix C) used is l2= m2= 3. Follows Hsieh [2000], withpermission from Elsevier Science.Figure 11. NLCCA mode 1 in y space shown as a string of overlapping small circles. The thin solidcurve is the theoretical mode Y0, and the thin dashed line is the CCA mode. Follows Hsieh [2000], withpermission from Elsevier Science.RG1003 Hsieh: NEURAL NETWORK METHODS12 of 25RG1003PCA can again be applied to Y to get the SSA modes,resulting in the multichannel SSA (MSSA) method, alsocalled the space-time PCA method or the extended EOF(EEOF) method (though in typical EEOF applications, onlya small number of lags are used). For brevity, we will usethe term SSA to denote both SSA and MSSA. Commonlyused in the meteorological and oceanographic communities[Ghil et al., 2002], SSA has also been used to analyze solaractivity [Watari, 1996; Rangarajan and Barreto, 2000] andstorms on Mars [Hollingsworth et al., 1997].[42] Hsieh and Wu [2002] proposed the NLSSA method:Assume SSA has been applied to the data set, and afterdiscarding the higher modes, we have retained the leadingPCs, x(t)=[x1, ..., xl], where each variable xi(i =1,..., l)is a time series of length n. The variables x are the inputs tothe NLPCA(cir) network (Figure 2b). The NLPCA(cir),with its ability to extract closed curve solutions, is partic-ularly ideal for extracting periodic or wave modes in thedata. In SSA it is common to encounter periodic modes,each of which has to be split into a pair of SSA modes[Elsner and Tsonis, 1996], as the underlying PCA techniqueis not capable of modeling a periodic mode (a closed curve)by a single mode (a straight line). Thus two (or more) SSAmodes can easily be combined by NLPCA(cir) into oneNLSSA mode, taking the shape of a closed curve. Whenimplementing NLPCA(cir), Hsieh [2001a] found that therewere two possible configurations, a restricted configurationFigure 12. NLCCA mode 1 between the tropical Pacific(a) sea level pressure anomaly (SLPA) and (b) SSTA,plotted as (overlapping) squares in the PC1-PC2-PC3three-dimensional (3-D) space. The linear (CCA) mode is shownas a dashed line. The NLCCA mode and the CCA mode arealso projected onto the PC1-PC2plane, the PC1-PC3plane,and the PC2-PC3plane, where the projected NLCCA isindicated by (overlapping) circles, and the CCA is indicatedby thin solid lines, and the projected data points (during1950–2000) are shown by the scattered dots. There is notime lag between the SLPA and the corresponding SSTAdata. The NLCCA solution was obtained with the numberof hidden neurons l2= m2= 2; with l2= m2= 1, only a linearsolution can be found. Adapted from Hsieh [2001b].Figure 13. SLPA field when the canonical variate u of theNLCCA mode 1 is at (a) its minimum (strong La Nin˜a) and(b) its maximum (strong El Nin˜o). Contour interval is0.5 mbar. Reprinted from Hsieh [2001b], with permissionfrom the American Meteorological Society.RG1003 Hsieh: NEURAL NETWORK METHODS13 of 25RG1003and a general configuration (see Appendix B). We will usethe general configuration here. After the first NLSSA modehas been extracted, it can be subtracted from x to get theresidual, which can be input again into the same network toextract the second NLSSA mode, and so forth for the highermodes.[43] To illustrate the difference between the NLSSA andthe SSA, consider a test problem with a nonsinusoidal waveof the formftðÞ¼3 t ¼ 1; ...;7C01 t ¼ 8; ...;288<:ð14Þand periodic thereafter. This is a square wave with the peakstretched to be 3 times as tall but only 1/3 as broad as thetrough; it has a period of 28. Gaussian noise with twice thestandard deviation as this signal was added, and the timeseries was normalized to unit standard deviation (Figure 15).The time series has 600 data points.[44] SSA with window L = 50 was applied to this timeseries, with the first eight eigenvectors shown in Figure 16.The first eight modes individually accounted for 6.3, 5.6,4.6, 4.3, 3.3, 3.3, 3.2, and 3.1% of the variance of theaugmented time series y. The leading pair of modes displaysoscillations of period 28, while the next pair manifestsoscillations at a period of 14, i.e., the first harmonic. Thenonsinusoidal nature of the SSA eigenvectors can be seen inmode 2 (Figure 16), where the trough is broader andshallower than the peak but nowhere as intense as in theoriginal stretched square wave signal. The PCs for modes1–4 are also shown in Figure 17. Both the eigenvectors(Figure 16) and the PCs (Figure 17) tend to appear in pairs,each member of the pair having similar appearance exceptfor the quadrature phase difference.[45] The first eight PC time series were served as inputsto the NLPCA(cir) network, with m (the number of hiddenneurons in the encoding layer) ranging from 2 to 8 (and theweight penalty parameter P = 1, see Appendix B). The MSEdropped with increasing m, until m = 5, beyond which theMSE showed no further improvement. The resulting NLSSAmode 1 (with m = 5) is shown in Figure 18. Not surprisingly,the PC1 and PC2 are united by the approximately circularcurve. What is more surprising are the Lissajous-like curvesfound in the PC1-PC3 plane (Figure 18b) and in the PC1-PC4 plane (Figure 18c), indicating relations between thefirst SSA mode and the higher modes 3 and 4. (It is wellknown that for two sinusoidal time series z1(t) and z2(t)oscillating at frequencies w1and w2, respectively, a plot ofthe trajectory in the z1-z2plane reveals a closed Lissajouscurve if and only if w2/w1is a rational number.) There was norelation found between PC1 and PC5, as PC5 appearedindependent of PC1 (Figure 18d). However, with less noisein the input, relations can be found between PC1 and PC5and even higher PCs.[46] The NLSSA reconstructed component 1 (NLRC1) isthe approximation of the original time series by the NLSSAmode 1. The neural network output x0are the NLSSAmode 1 approximation for the eight leading PCs. Multiply-ing these approximated PCs by their corresponding SSAeigenvectors and summing over the eight modes allows thereconstruction of the time series from the NLSSA mode 1.As each eigenvector contains the loading over a range oflags, each value in the reconstructed time series at time tjalso involves averaging over the contributions at tjfrom thevarious lags.[47] In Figure 15, NLRC1 (curve f) from NLSSA is to becompared with the reconstructed component (RC) fromSSA mode 1 (RC1) (curve c). The nonsinusoidal natureof the oscillations is not revealed by the RC1 but is clearlymanifested in the NLRC1, where each strong narrow peak isfollowed by a weak broad trough, similar to the originalstretched square wave. Also, the wave amplitude is moresteady in the NLRC1 than in the RC1. Using contributionsfrom the first two SSA modes, RC1-2 (not shown) is rathersimilar to RC1 in appearance except for a larger amplitude.[48] In Figure 15, curves d and e show the RC from SSAusing the first three modes and the first eight modes, respec-tively. These curves, referred to as RC1-3 and RC1-8,respectively, show increasing noise as more modes areused. Among the RCs, with respect to the stretched squareFigure 14. SSTA field when the canonical variate v is at(a) its minimum (strong La Nin˜a) and (b) its maximum(strong El Nin˜o). Contour interval is 0.5C176C. Reprinted fromHsieh [2001b], with permission from the AmericanMeteorological Society.RG1003 Hsieh: NEURAL NETWORK METHODS14 of 25RG1003wave time series (curve b), RC1-3 attains the most favorablecorrelation (0.849) and root-mean-square error (RMSE)(0.245) but remains behind the NLRC1, with correlation(0.875) and RMSE (0.225).[49] The stretched square wave signal accounted for only22.6% of the variance in the noisy data. For comparison,NLRC1 accounted for 17.9%, RC1 accounted for 9.4%, andRC1–2 accounted for 14.1% of the variance. With moremodes the RCs account for increasingly more variance, butbeyond RC1-3 the increased variance is only from fitting tothe noise in the data.[50] When classical Fourier spectral analysis was per-formed, the most energetic bands were the sine and cosineat a period of 14, the two together accounting for 7.0% ofthe variance. In this case the strong scattering of energy tohigher harmonics by the Fourier technique has actuallyassigned 38% more energy to the first harmonic (at period14) than to the fundamental period of 28. Next the datarecord was slightly shortened from 600 to 588 points, so thedata record is exactly 21 times the fundamental period ofour known signal; this is to avoid violating the periodicityassumption of Fourier analysis and the resulting spuriousenergy scatter into higher spectral bands. The most ener-getic Fourier bands were the sine and cosine at the funda-mental period of 28, the two together accounting for 9.8%of the variance, compared with 14.1% of the varianceaccounted for by the first two SSA modes. Thus, even withgreat care, the Fourier method scatters the spectral energyconsiderably more than the SSA method.[51] The SSA has also been applied to the multivariatecase by Hsieh and Wu [2002]. The tropical Pacific monthlySLPA data [Woodruff et al., 1987] during 1950–2000 wereused. The first eight SSA modes of the SLPA accounted for7.9, 7.1, 5.0, 4.9, 4.0, 3.1, 2.5, and 1.9%, respectively, of thetotal variance of the augmented data. In Figure 19 the firsttwo modes displayed the Southern Oscillation (SO), theeast-west seesaw of SLPA at around the 50-month period,while the higher modes displayed fluctuations at around theQBO [Hamilton, 1998] average period of 28 months.[52] The eight leading PCs of the SSA were then used asinputs, x1, ..., x8, to the NLPCA(cir) network, yielding theNLSSA mode 1 for the SLPA. This mode accounts forFigure 15. Noisy time series y containing a stretched square wave signal shown by curve a. Curve bshows the stretched square wave signal, which we will try to extract from the noisy time series. Curves c,d, and e are the reconstructed components (RC) from SSA leading modes, using one, three, and eightmodes, respectively. Curve f is the NLSSA mode 1 RC (NLRC1). The dashed lines show the means ofthe various curves, which have been vertically shifted for better visualization.RG1003 Hsieh: NEURAL NETWORK METHODS15 of 25RG100317.1% of the variance of the augmented data, more than thevariance explained by the first two SSA modes (15.0%).This is not surprising as the NLSSA mode did more thanjust combine the SSA modes 1 and 2: It also connects theSSA mode 3 to the SSA modes 1 and 2 (Figure 20). In thex1-x3plane the bowl-shaped projected solution implies thatPC3 tends to be positive when PC1 takes on either largepositive or large negative values. Similarly, in the x2-x3plane the hill-shaped projected solution indicates that PC3tends to be negative when PC2 takes on large positive ornegative values. These curves reveal interactions betweenthe longer-timescale (50 months) SSA modes 1 and 2 andthe shorter-timescale (28 months) SSA mode 3.[53] In the linear case of PCA or SSA, as the PC varies,the loading pattern is unchanged except for scaling by thePC. In the nonlinear case, as the NLPC varies, the loadingpattern changes as it does not generally lie along a fixedeigenvector. The space-time loading patterns for the NLSSAmode 1 at various values of the NLPC q (Figure 21)manifest prominently the growth and decay of the negativephase of the SO (i.e., negative SLPA in the eastern equa-torial Pacific and positive SLPA in the west) as timeprogresses. The negative phase of the SO here is muchshorter and more intense than the positive phase, in agree-ment with observations and in contrast to the SSA modes 1and 2 (Figures 19a and 19b), where the negative andpositive phases of the SO are about equal in duration andmagnitude.[54] The tropical Pacific SSTA field was also analyzed bythe NLSSA method by Hsieh and Wu [2002]. Comparingthe NLSSA mode 1 loading patterns with the patternsfrom the first two SSA modes of the SSTA, Hsieh and Wu[2002] found three notable differences: (1) The presence ofwarm anomalies for 24 months followed by cool anomaliesfor 24 months in the first two SSA modes is replaced in theNLSSA mode 1 by warm anomalies for 18 months followedby cool anomalies for about 33 months; although the coolanomalies can be quite mild for long periods, they candevelop into full La Nin˜a cool episodes. (2) The El Nin˜owarm episodesare strongest near the eastern boundary, whilethe La Nin˜a episodes are strongest near the central equatorialPacific in the NLSSA mode 1, an asymmetry not found inthe individual SSA modes. (3) The magnitude of the peakpositive anomalies is significantly larger than that of thepeak negative anomalies in the NLSSA mode 1, again anasymmetry not found in the individual SSA modes. All threedifferences indicate that the NLSSA mode 1 is much closerto the observed ENSO properties than the first two SSAmodes are.[55] Furthermore, from the residual the NLSSA mode 2has been extracted by Hsieh and Wu [2002] for the SLPAfield and for the SSTA field. For both variables the NLSSAmode 2 has a 39-month period, considerably longer than theQBO periods typically reported by previous studies usinglinear techniques [Ghil et al., 2002]. Intriguingly, thecoupling between the SLPA and the SSTA fields for thesecond nonlinear mode of a 39-month period was found tobe considerably stronger than their coupling for the firstnonlinear ‘‘ENSO’’ mode of a 51-month period [Hsieh andWu, 2002]. The NLSSA technique has also been used tostudy the stratospheric equatorial winds for the QBOphenomenon [Hsieh and Hamilton, 2003].5. SUMMARY AND CONCLUSIONS[56] This paper has reviewed the recent extension of thefeed forward NN from its original role for nonlinearregression and classification to nonlinear PCA (for openand closed curves), nonlinear CCA, and nonlinear SSA.With examples from the atmosphere and the ocean, notablythe ENSO and the stratospheric QBO phenomena, these NNmethods can be seen to advance our understanding ofgeophysical phenomena. To highlight only a few of themany new findings by the nonlinear techniques, we notethat the nonlinearity in the tropical Pacific interannualvariability has been found to have increased in recentdecades [Hsieh, 2001b; Wu and Hsieh, 2003]; that besidesthe main coupling at the ENSO timescale of about51 months the strongest coupling between the tropicalFigure 16. First eight SSA eigenvectors as a function oftime lag: (a) mode 1 (solid curve) and mode 2 (dashedcurve), (b) mode 3 (solid curve) and mode 4 (dashed curve),(c) mode 5 (solid curve) and mode 6 (dashed curve), and (d)mode 7 (solid curve) and mode 8 (dashed curve).RG1003 Hsieh: NEURAL NETWORK METHODS16 of 25RG1003Pacific SLP and SST has been identified at a secondnonlinear mode of a 39-month period, this unusual perioditself arising from the interaction between the linear modeswith ENSO and QBO timescales [Hsieh and Wu, 2002]; andthat the phase of the stratospheric QBO can be betterdefined, resulting in an enhancement of the Holton-Taneffect [Hamilton and Hsieh, 2002]. The nonlinear PCA,CCA, and SSA codes (written in MATLAB1are freelydownloadable from the author’s web site (http://www.ocgy.ubc.ca/projects/clim.pred).[57] PCA is widely used for two main purposes: (1) toreduce the dimensionality of the data set and (2) to extractfeatures or recognize patterns from the data set. It is purpose2 where PCA can be improved upon. Rotated PCA (RPCA)sacrifices on the amount of variance explained, but byrotating the PCA eigenvectors, RPCA eigenvectors canpoint more toward local data clusters and can therefore bemore representative of physical states than the PCA eigen-vectors. With the tropical Pacific SST as an example it wasshown that RPCA represented El Nin˜o states better thanPCA, but neither method represented La Nin˜a states well. Incontrast, nonlinear PCA (NLPCA) passed through both theclusters of El Nin˜o and La Nin˜a states, thus representingboth well within a single mode; the NLPCA first mode alsoaccounted for more variance of the data set than the firstmode of PCA or RPCA.[58] With PCA the straight line explaining the maximumvariance of the data is found. With NLPCA the straight lineis replaced by a continuous, open curve. NLPCA(cir)(NLPCA with a circular node at the bottleneck) replacesthe open curve with a closed curve, so periodic or wavesolutions can be modeled. When dealing with data contain-ing a nonlinear or periodic structure, the linear methodsscatter the energy into multiple modes, which is usuallyprevented when the nonlinear methods are used.[59] With two data fields x and y the classical CCAmethod finds the canonical variate u (from a linear combi-nation of the x variables) and its partner v (from the yvariables), so that the correlation between u and v ismaximized. CCA finds a line in the x space, wherefluctuations of the x data projected onto this line are mosthighly correlated with fluctuations of y data projected ontoanother line in the y space. NN can perform nonlinear CCA(NLCCA), where u and v can be nonlinear functions of thex and y variables, respectively. NLCCA finds a curve in thex space where fluctuations of the x data projected onto thiscurve are most highly correlated with fluctuations of y dataprojected onto another curve in the y space.Figure 17. PC time series of SSA (a) mode 1 (solid curve) and mode 2 (dashed curve), (b) mode 3(solid curve) and mode 4 (dashed curve), and (c) q, the nonlinear PC from NLSSA mode 1. Note q isperiodic, here bounded between C0p and p radians.RG1003 Hsieh: NEURAL NETWORK METHODS17 of 25RG1003[60] For univariate and multivariate time series analysisthe PCA method has been extended to the SSA technique.NN can also be used to perform nonlinear SSA (NLSSA):The data set is first condensed by the SSA; then severalleading PCs from the SSA are chosen as inputs to theNLPCA(cir) network, which extracts the NLSSA mode bynonlinearly combining the various SSA modes.[61] In general, NLSSA has several advantages overSSA: (1) The PCs from different SSA modes are linearlyuncorrelated; however, they may have relationships that canbe detected by the NLSSA. (2) Although the SSA modesare not restricted to sinusoidal oscillations in time like theFourier spectral components, in practice, they are inefficientin modeling nonsinusoidal periodic signals (e.g., thestretched square wave in section 4), scattering the signalenergy into many SSA modes, similar to the way Fourierspectral analysis scatters the energy of a nonsinusoidal waveto its higher harmonics. The NLSSA recombines the SSAmodes to extract the nonsinusoidal signal, alleviating thespurious transfer of energy to higher frequencies. In thetropical Pacific the NLSSA mode 2 of both the SSTA fieldand the SLPA field yielded a 39-month signal, considerablylower in frequency than the QBO frequency signals foundby linear methods.Figure 18. First NLSSA mode indicated by the (overlapping) small circles, with the input data shownas dots. The input data were the first eight PCs from the SSA of the time series y containing the stretchedsquare wave. The NLSSA solution is a closed curve in an eight-dimensional PC space. The NLSSAsolution projected onto (a) the PC1-PC2 plane, (b) the PC1-PC3 plane, (c) the PC1-PC4 plane, and (d) thePC1-PC5 plane.Figure 19. (a–f) SSA modes 1–6 for the tropical Pacific sea level pressure anomalies (SLPA), respectively. The contourplots display the SSA space-time eigenvectors (loading patterns), showing the SLPA along the equator as a function of thelag. Solid contours indicate positive anomalies, and dashed contours indicate negative anomalies, with the zero contourindicated by the thick solid curve. In a separate graph beneath each contour plot the PC of each SSA mode is also plotted asa time series (where each tick mark on the abscissa indicates the start of a year). The time of the PC is synchronized to thelag time of 0 month in the space-time eigenvector. Figure 19 courtesy of A. Wu.RG1003 Hsieh: NEURAL NETWORK METHODS18 of 25RG1003Figure 19RG1003 Hsieh: NEURAL NETWORK METHODS19 of 25RG1003[62] In summary, the linear methods currently used areoften too simplistic to describe complicated real-worldsystems, resulting in a tendency to scatter a single oscilla-tory phenomenon into numerous modes or higher harmon-ics. This would introduce unphysical spatially standingpatterns or spurious high-frequency energy. These problemsare shown to be largely alleviated by the use of nonlinearmethods.[63] The main disadvantage of NN methods comparedwith the linear methods lies in their instability or nonunique-ness; with local minima in the cost function, optimizationsstarted from different initial parameters often end up atdifferent minima for the NN approach. A number of opti-mization runs starting from different random initial param-eters is needed, where the best run is chosen as the solution;even then, there is no guarantee that the global minimum hasbeen found. Proper scaling of the input data is essential toavoid having the nonlinear optimization algorithm searchingfor parameters with a wide range of magnitudes. Regulari-zation by adding weight penalty terms to the cost functionsgenerally improved the stability of the NN methods. Never-theless, for short records with noisy data one may not be ableto find a reliable nonlinear solution, and the linear solutionmay be the best one can extract from of the data. The timeaveraging data (e.g., averaging daily data to yield monthlydata) may also, through the central limit theorem, severelyreduce the nonlinearity which can be detected [Yuval andHsieh, 2002].[64] Hence, whether the nonlinear approach has asignificant advantage over the linear approach is highlydependent on the data set: The nonlinear approach isgenerally ineffective if the data record is short and noisyor the underlying physics is essentially linear. For theEarth’s climate, tropical variability such as ENSO andthe stratospheric QBO have strong signal-to-noise ratioand are handled well by the nonlinear methods; incontrast, in the middle and high latitudes the signal-to-noise ratio is much weaker, rendering the nonlinearmethods less effective. Presently, the number of hiddenneurons in the NN and the weight penalty parameters areoften determined by a trial and error approach; adoptingtechniques such as generalized cross validation [Yuval,2000] and information criterion [Burnham and Anderson,1998] may help in the future to provide more guidanceon the choice of the most appropriate NN architecture.While NN has been widely used as the main workhorsein nonlinear multivariate and time series analysis, newemerging techniques such as kernel-based methods[Vapnik, 1998] may play an increasingly important rolein the future.APPENDIX A: NLPCA MODEL[65] In Figure 2a the transfer function f1maps from x,theinput column vector of length l, to the first hidden layer (theFigure 20. NLSSA mode 1 for the tropical Pacific SLPA. The PCs of SSA modes 1–8 were used asinputs x1, ..., x8to the NLPCA(cir) network, with the resulting NLSSA mode 1 shown as (denselyoverlapping) crosses in the x1-x2-x33-D PC space. The projections of this mode onto the x1-x2, x1-x3, andx2-x3planes are denoted by the (densely overlapping) small circles, and the projected data are shown bydots. For comparison, the linear SSA mode 1 is shown by the dashed line in the 3-D space and by theprojected solid lines on the 2-D planes. From Hsieh and Wu [2002].RG1003 Hsieh: NEURAL NETWORK METHODS20 of 25RG1003encoding layer), represented by h(x), a column vector oflength m, with elementshxðÞk¼ f1WxðÞxþbxðÞC16C17khi; ðA1Þwhere (with the capital bold font reserved for matrices andthe small bold font for vectors), W(x)is an m C2 l weightmatrix, b(x)is a column vector of length m containing thebias parameters, and k =1,..., m. Similarly, a secondtransfer function f2maps from the encoding layer to thebottleneck layer containing a single neuron, which repre-sents the nonlinear principal component u,u ¼ f2wxðÞC1hxðÞþbxðÞC16C17: ðA2ÞThe transfer function f1is generally nonlinear (usually thehyperbolic tangent or the sigmoidal function, though theexact form is not critical), while f2is usually taken to bethe identity function.Figure 21. SLPA NLSSA mode 1 space-time loading patterns for various values of the NLPC q:(a)q =0C176,(b)q =60C176,(c)q = 120C176,(d)q = 180C176,(e)q = 240C176, and (f) q = 300C176. The contour plots display theSLPA along the equator as a function of the lag time. Contour interval is 0.2 mbar. Figure 21 courtesy ofA. Wu.RG1003 Hsieh: NEURAL NETWORK METHODS21 of 25RG1003[66] Next a transfer function f3maps from u to the finalhidden layer (the decoding layer) h(u),huðÞk¼ f3wuðÞuþbuðÞC16C17khi; ðA3Þ(k =1,..., m), followed by f4mapping from h(u)to x0,theoutput column vector of length l, withx0i¼ f4WuðÞhuðÞþbuðÞC16C17ihi: ðA4Þ[67] The cost function J = hkx C0 x0k2i is minimized byfinding the optimal values of W(x), b(x), w(x), b(x), w(u), b(u),W(u), and b(u). The mean-square error (MSE) between theNN output x0and the original data x is thus minimized. TheNLPCA was implemented using the hyperbolic tangentfunction for f1and f3and the identity function for f2andf4, so thatu ¼ wxðÞC1hxðÞþbxðÞðA5Þx0i¼ WuðÞhuðÞþbuðÞC16C17i: ðA6Þ[68] Furthermore, we adopt the normalization conditionsthat hui = 0 and hu2i = 1. These conditions are approxi-mately satisfied by modifying the cost function toJ ¼hkxC0x0k2iþhui2þhu2iC01C0C12: ðA7ÞThe total number of (weight and bias) parameters used bythe NLPCA is 2lm +4m + l + 1, though the number ofeffectively free parameters is two less because of theconstraints on hui and hu2i.[69] The choice of m, the number of hidden neurons inboth the encoding and decoding layers, follows a generalprinciple of parsimony. A larger m increases the nonlinearmodeling capability of the network but could also lead tooverfitted solutions (i.e., wiggly solutions which fit to thenoise in the data). If f4is the identity function and m =1,then equation (A6) implies that all x0iare linearly related to asingle hidden neuron; hence there can only be a linearrelation between the x0ivariables. Thus, for nonlinearsolutions we need to look at m C21 2. It is also possible tohave more than one neuron at the bottleneck layer. Forinstance, with two bottleneck neurons the mode extractedwill span a 2-D surface instead of a 1-D curve.[70] The nonlinear optimization was carried out by theMATLAB function ‘‘fminu,’’ a quasi-Newton algorithm.Because of local minima in the cost function, there is noguarantee that the optimization algorithm reaches the globalminimum. Hence a number of runs with random initialweights and bias parameters was made. Also, 20% of thedata were randomly selected as validation data and withheldfrom the training of the NNs. Runs where the MSE waslarger for the validation data set than for the training data setwere rejected to avoid overfitted solutions. Then the runwith the smallest MSE was selected as the solution.[71] In general, the most serious problem with NLPCA isthe presence of local minima in the cost function. As aresult, optimizations started from different initial parametersoften converge to different minima, rendering the solutionunstable or nonunique. Regularization of the cost functionby adding weight penalty terms is an answer.[72] The purpose of the weight penalty terms is to limitthe nonlinear power of the NLPCA, which came from thenonlinear transfer functions in the network. The transferfunction tanh has the property that given x in the interval[C0L, L], one can find a small enough weight w, so thattanh(wx) C25 wx; that is, the transfer function is almost linear.Similarly, one can choose a large enough w so that tanhapproaches a step function, thus yielding Z-shaped solu-tions. If we can penalize the use of excessive weights, wecan limit the degree of nonlinearity in the NLPCA solution.This is achieved with a modified cost functionJ ¼hkxC0x0k2iþhui2þhu2iC01C0C12þPXkiWxðÞkiC16C172; ðA8Þwhere P is the weight penalty parameter. A large P increasesthe concavity of the cost function and forces the weightsW(x)to be small in magnitude, thereby yielding smootherand less nonlinear solutions than when P is small or zero.Hence increasing P also reduces the number of effectivelyfree parameters of the model. The percentage of thevariance explained by the NLPCA mode is simply100 %C2 1C0hkxC0x0k2ihkxC0xk2i !; ðA9Þwith x being the mean of x.APPENDIX B: NLPCA(cir) MODEL[73] At the bottleneck in Figure 2b, analogous to u inequation (A5), we calculate the prestates poand qobypo¼ wxðÞC1hxðÞþbxðÞqo¼ ~wxðÞC1hxðÞþ~bxðÞ; ðB1Þwhere w(x)and ~w(x)are weight parameter vectors and b(x)and~b(x)are bias parameters. Letr ¼ p2oþq2oC0C11=2; ðB2Þthen the circular node is defined withp ¼ po=rq¼ qo=r; ðB3Þsatisfying the unit circle equation p2+ q2= 1. Thus, eventhough there are two variables p and q at the bottleneck,there is only one angular degree of freedom from q(Figure 2b) because of the circle constraint. The mappingfrom the bottleneck to the output proceeds as in Appendix A,with equation (A3) replaced byhuðÞk¼ tanh wuðÞpþ ~wuðÞqþbuðÞC16C17khi: ðB4Þ[74] When implementing NLPCA(cir), Hsieh [2001a]found that there are actually two possible configurations:(1) A restricted configuration where the constraints hpi =0=hqi are applied and (2) a general configuration withoutRG1003 Hsieh: NEURAL NETWORK METHODS22 of 25RG1003the constraints. With configuration 1 the constraints can besatisfied approximately by adding the extra terms hpi2andhqi2to the cost function. If a closed curve solution issought, then configuration 1 is better than configuration 2as it has effectively two fewer parameters. However, con-figuration 2, being more general than configuration 1, canmore readily model open curve solutions like a regularNLPCA. The reason is that if the input data mapped ontothe p-q plane cover only a segment of the unit circle insteadof the whole circle, then the inverse mapping from the p-qspace to the output space will yield a solution resembling anopen curve. Hence, given a data set, configuration 2 mayyield either a closed curve or an open curve solution. Itsgenerality comes with a price, namely, that there may bemore local minima to contend with. The number of param-eters is 2lm +6m + l + 2; though under configuration 1 thenumber of effectively free parameters is two less because ofthe imposed constraints. Unlike NLPCA, which reduces toPCA when only linear transfer functions are used, NLPCA(cir) does not appear to have a linear counterpart.APPENDIX C: NLCCA MODEL[75] In Figure 7 the inputs x and y are mapped to theneurons in the hidden layer:hxðÞk¼ tanh WxðÞxþbxðÞC0C1khihyðÞn¼ tanh WyðÞyþbyðÞC0C1nhi;ðC1Þwhere W(x)and W(y)are weight matrices and b(x)and b(y)are bias parameter vectors. The dimensions of x, y, h(x),and h(y)are l1, m1, l2, and m2, respectively.[76] The canonical variate neurons u and v are calculatedfrom a linear combination of the hidden neurons h(x)andh(y), respectively, withu ¼ wxðÞC1hxðÞþbxðÞv ¼ wyðÞC1hyðÞþbyðÞ: ðC2ÞThese mappings are standard feed forward NNs and arecapable of representing any continuous functions mappingfrom x to u and from y to v to any given accuracy, providedlarge enough l2and m2are used.[77] To maximize cor(u, v), the cost function J =C0 cor(u, v) is minimized by finding the optimal values ofW(x), W(y), b(x), b(y), w(x), w(y), b(x), and b(y). We also adoptthe constraints hui =0=hvi and hu2i =1=hv2i, which areapproximately satisfied by modifying the cost function toJ ¼C0cor u;vðÞþhui2þhvi2þhu2i1=2C01C16C172þhv2i1=2C01C16C172:ðC3Þ[78] On the right side of Figure 7 the top NN (a standardfeed forward NN) maps from u to x0in two steps:huðÞk¼ tanh wuðÞuþbuðÞC16C17khix0¼ WuðÞhuðÞþbuðÞ: ðC4ÞThe cost function J1= hkx0C0 xk2i is minimized by findingthe optimal values of w(u), b(u), W(u), and b(u). The MSEbetween the NN output x0and the original data x is thusminimized.[79] Similarly, the bottom NN on the right side of Figure 7maps from v to y0:hvðÞn¼ tanh wvðÞvþbvðÞC16C17nhiy0¼ WvðÞhvðÞþbvðÞ; ðC5Þwith the cost function J2= hky0C0 yk2i minimized. The totalnumber of parameters used by the NLCCA is 2(l1l2+m1m2)+4(l2+ m2)+l1+ m1+ 2, though the number ofeffectively free parameters is four less because of theconstraints on hui, hvi, hu2i, and hv2i.[80] A number of runs mapping from (x, y)to(u, v),using random initial parameters, were performed. The runattaining the highest cor(u, v) was selected as the solution.Next a number of runs (mapping from u to x0) was used tofind the solution with the smallest MSE in x0. Finally, anumber of runs were used to find the solution yielding thesmallest MSE in y0. After the first NLCCA mode has beenretrieved from the data, the method can be applied again tothe residual to extract the second mode and so forth.[81] That the CCA is indeed a linear version of thisNLCCA can be readily seen by replacing the hyperbolictangent transfer functions in equations (C1), (C4), and (C5)with the identity function, thereby removing the nonlinearmodeling capability of the NLCCA. Then the forward mapsto u and v involve only a linear combination of the originalvariables x and y, as in the CCA.[82] With three NNs in NLCCA, overfitting can occur inany of the three networks. With noisy data the three costfunctions are modified toJ ¼C0cor u;vðÞþhui2þhvi2þhu2i1=2C01C16C172þhv2i1=2C01C16C172þPXkiWxðÞkiC16C172þXnjWyðÞnjC16C172"#; ðC6ÞJ1¼hkx0C0xk2iþP1XkwuðÞkC16C172; ðC7ÞJ2¼hky0C0yk2iþP2XnwvðÞnC16C172; ðC8Þwhere P, P1, and P2are nonnegative weight penaltyparameters. Since the nonlinearity of a network is controlledby the weights in the hyperbolic tangent transfer function,only those weights are penalized.[83] ACKNOWLEDGMENTS. I would like to thank AimingWu, who supplied Figures 19 and 21. The American Meteorolog-ical Society, Elsevier Science, and Blackwell Science kindlygranted permission to reproduce or adapt figures from theirpublications. Support through strategic and research grants fromthe Natural Sciences and Engineering Research Council of Canadaand the Canadian Foundation for Climate and Atmospheric Scien-ces is gratefully acknowledged.RG1003 Hsieh: NEURAL NETWORK METHODS23 of 25RG1003[84] Kendal McGuffie was the Editor responsible for thispaper. He thanks one technical reviewer and one cross-disci-plinary reviewer.REFERENCESAires, F., A. Che´din, and J. P. Nadal (2000), Independent compo-nent analysis of multivariate time series: Application to the tro-pical SST variability, J. Geophys. Res., 105(D13), 17,437–17,455.Baldwin, M., et al. (2001), The quasi-biennial oscillation, Rev.Geophys., 39, 179–229.Barnett, T. P., and R. Preisendorfer (1987), Origins and levels ofmonthly and seasonal forecast skill for United States surface airtemperatures determined by canonical correlation analysis, Mon.Weather Rev., 115, 1825–1850.Barnston, A. G., and C. F. Ropelewski (1992), Prediction of ENSOepisodes using canonical correlation analysis, J. Clim., 5, 1316–1345.Bishop, C. M. (1995), Neural Networks for Pattern Recognition,482 pp., Clarendon, Oxford, U.K.Bretherton, C. S., C. Smith, and J. M. Wallace (1992), An inter-comparison of methods for finding coupled patterns in climatedata, J. Clim., 5, 541–560.Burnham, K. P., and D. R. Anderson (1998), Model Selection andInference, 353 pp., Springer-Verlag, New York.Cavazos, T. (1999), Large-scale circulation anomalies conducive toextreme precipitation events and derivation of daily rainfall innortheastern Mexico and southeastern Texas, J. Clim., 12, 1506–1523.Cherkassky, V., and F. Mulier (1998), Learning From Data, 441pp., John Wiley, Hoboken, N. J.Chevallier, F., J. J. Morcrette, F. Cheruy, and N. A. Scott (2000),Use of a neural-network-based long-wave radiative-transferscheme in the ECMWF atmospheric model, Q. J. R. Meteorol.Soc., 126, 761–776.Comon, P. (1994), Independent component analysis—A new con-cept?, Signal Process., 36, 287–314.Cybenko, G. (1989), Approximation by superpositions of a sigmoi-dal function, Math. Control Signals Syst., 2, 303–314.Del Frate, F., and G. Schiavon (1999), Nonlinear principal compo-nent analysis for the radiometric inversion of atmospheric pro-files by using neural networks, IEEE Trans. Geosci. RemoteSens., 37, 2335–2342.Diaz, H. F., and V. Markgraf (2000), El Nin˜o and the SouthernOscillation: Multiscale Variability and Global and Regional Im-pacts, 496 pp., Cambridge Univ. Press, New York.Elsner, J. B., and A. A. Tsonis (1996), Singular Spectrum Analysis,164 pp., Plenum, New York.Essenreiter, R., M. Karrenbach, and S. Treitel (2001), Identificationand classification of multiple reflections with self-organizingmaps, Geophys. Prospect., 49, 341–352.Gardner, M. W., and S. R. Dorling (1998), Artificial neural net-works (the multilayer perceptron)—A review of applications inthe atmospheric sciences, Atmos. Environ., 32, 2627–2636.Gemmill, W. H., and V. M. Krasnopolsky (1999), The use of SSM/I data in operational marine analysis, Weather Forecasting, 14,789–800.Ghil, M., et al. (2002), Advanced spectral methods for climatictime series, Rev. Geophys., 40(1), 1003, doi:10.1029/2000RG000092.Golyandina, N. E., V. V. Nekrutin, and A. A. Zhigljavsky (2001),Analysis of Time Series Structure, SSA and Related Techniques,320 pp., Chapman and Hall, New York.Hamilton, K. (1998), Dynamics of the tropical middle atmosphere:A tutorial review, Atmos. Ocean, 36, 319–354.Hamilton, K., and W. W. Hsieh (2002), Representation of the QBOin the tropical stratospheric wind by nonlinear principal compo-nent analysis, J. Geophys. Res., 107(D15), 4232, doi:10.1029/2001JD001250.Hastie, T., and W. Stuetzle (1989), Principal curves, JASA J. Am.Stat. Assoc., 84, 502–516.Hastie, T., R. Tibshirani, and J. Friedman (2001), The Elements ofStatistical Learning, 552 pp., Springer-Verlag, New York.Haykin, S. (1999), Neural Networks: A Comprehensive Founda-tion, 842 pp., Prentice-Hall, Old Tappan, N. J.Hoerling, M. P., A. Kumar, and M. Zhong (1997), El Nin˜o, LaNin˜a and the nonlinearity of their teleconnections, J. Clim., 10,1769–1786.Hollingsworth, J. L., R. M. Haberle, and J. Schaeffer (1997), Sea-sonal variations of storm zones on Mars, Adv. Space Res., 19,1237–1240.Holton, J. R., and H.-C. Tan (1980), The influence of the equatorialquasi-biennial oscillation on the global circulation at 50 mb,J. Atmos. Sci., 37, 2200–2208.Hornik, K., M. Stinchcombe, and H. White (1989), Multilayerfeedforward networks are universal approximators, Neural Net-works, 2, 359–366.Hsieh, W. W. (2000), Nonlinear canonical correlation analysis byneural networks, Neural Networks, 13, 1095–1105.Hsieh, W. W. (2001a), Nonlinear principal component analysis byneural networks, Tellus, Ser. A, 53, 599–615.Hsieh, W. W. (2001b), Nonlinear canonical correlation analysis ofthe tropical Pacific climate variability using a neural networkapproach, J. Clim., 14, 2528–2539.Hsieh, W. W., and K. Hamilton (2003), Nonlinear singular spec-trum analysis of the tropical stratospheric wind, Q. J. R. Meteorol.Soc., 129, 2367–2382.Hsieh, W. W., and B. Tang (1998), Applying neural network mod-els to prediction and data analysis in meteorology and oceano-graphy, Bull. Am. Meteorol. Soc., 79, 1855–1870.Hsieh, W. W., and A. Wu (2002), Nonlinear multichannel singularspectrum analysis of the tropical Pacific climate variability usinga neural network approach, J. Geophys. Res., 107(C7), 3076,doi:10.1029/2001JC000957.Hyva¨rinen, A., J. Karhunen, and E. Oja (2001), Independent Com-ponent Analysis, 504 pp., John Wiley, Hoboken, N. J.Jolliffe, I. T. (2002), Principal Component Analysis,502pp.,Springer-Verlag, New York.Kaiser, H. F. (1958), The varimax criterion for analytic rotation infactor analysis, Psychometrika, 23, 187–200.Kirby, M. J., and R. Miranda (1996), Circular nodes in neuralnetworks, Neural Comp., 8, 390–402.Kohonen, T. (1982), Self-organized formation of topologically cor-rect feature maps, Biol. Cybernetics, 43, 59–69.Kohonen, T. (2001), Self-Organizing Maps, 501 pp., Springer-Verlag, New York.Kramer, M. A. (1991), Nonlinear principal component analysisusing autoassociative neural networks, AIChE J., 37, 233–243.Krasnopolsky, V. M., D. Chalikov, L. C. Breaker, and D. Rao(2000), Application of neural networks for efficient calculationof sea water density or salinity from the UNESCO equationof state, in Proceedings of the Second Conference on ArtificialIntelligence, pp. 27–30, Am. Math. Soc., Providence, R. I.Lai, P. L., and C. Fyfe (1999), A neural implementation ofcanonical correlation analysis, Neural Networks, 12,1391–1397.Lai, P. L., and C. Fyfe (2000), Kernel and non-linear canonicalcorrelation analysis, Int. J. Neural Syst., 10, 365–377.Lorenz, E. N. (1963), Deterministic nonperiodic flow, J. Atmos.Sci., 20, 130–141.Malthouse, E. C. (1998), Limitations of nonlinear PCA as per-formed with generic neural networks, IEEE Trans. Neural Net-works, 9, 165–173.Mardia, K. V., J. T. Kent, and J. M. Bibby (1979), MultivariateAnalysis, 518 pp., Academic, San Diego, Calif.Marzban, C. (2000), A neural network for tornado diagnosis:Managing local minima, Neural Comput. Appl., 9, 133–141.McCulloch, W. S., and W. Pitts (1943), A logical calculus of theideas immanent in neural nets, Bull. Math. Biophys., 5, 115–137.RG1003 Hsieh: NEURAL NETWORK METHODS24 of 25RG1003Melzer, T., M. Reiter, and H. Bischof (2003), Appearance modelsbased on kernel canonical correlation analysis, Pattern Recognit.,36, 1961–1971.Monahan, A. H. (2000), Nonlinear principal component analysisby neural networks: Theory and application to the Lorenz sys-tem, J. Clim., 13, 821–835.Monahan, A. H. (2001), Nonlinear principal component analysis:Tropical Indo-Pacific sea surface temperature and sea level pres-sure, J. Clim., 14, 219–233.Monahan, A. H., J. C. Fyfe, and G. M. Flato (2000), A regimeview of Northern Hemisphere atmospheric variability and changeunder global warming, Geophys. Res. Lett., 27, 1139–1142.Monahan, A. H., L. Pandolfo, and J. C. Fyfe (2001), The preferredstructure of variability of the Northern Hemisphere atmosphericcirculation, Geophys. Res. Lett., 28, 1019–1022.Newbigging, S. C.,L. A. Mysak,and W.W. Hsieh (2003), Improve-ments to the nonlinear principal component analysis method,with applications to ENSO and QBO, Atmos. Ocean, 41, 291–299.Philander, S. G. (1990), El Nin˜o, La Nin˜a, and the SouthernOscillation, 293 pp., Academic, San Diego, Calif.Preisendorfer, R. W. (1988), Principal Component Analysis inMeteorologyandOceanography,425pp.,ElsevierSci.,NewYork.Rangarajan, G. K., and L. M. Barreto (2000), Long term variabilityin solar wind velocity and IMF intensity and the relationshipbetween solar wind parameters and geomagnetic activity, EarthPlanets Space, 52, 121–132.Richaume, P., F. Badran, M. Crepon, C. Mejı´a, H. Roquet, andS. Thiria (2000), Neural network wind retrieval from ERS-1scatterometer data, J. Geophys. Res., 105(C4), 8737–8751.Richman, M. B. (1986), Rotation of principal components, J. Clim.,6, 293–335.Ripley, B. D. (1996), Pattern Recognition and Neural Networks,403 pp., Cambridge Univ. Press, New York.Rojas, R. (1996), Neural Networks—A Systematic Introduction,502 pp., Springer-Verlag, New York.Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986), Learn-ing internal representations by error propagation, in Parallel Dis-tributed Processing,vol.1,editedbyD.E.Rumelhart,J.L.McClelland, and P. R. Group, pp. 318–362, MIT Press, Cam-bridge, Mass.Sandham, W., and M. Leggett (2003), Geophysical Applications ofArtificial Neural Networks and Fuzzy Logic, 348 pp., KluwerAcad., Norwell, Mass.Shabbar, A., and A. G. Barnston (1996), Skill of seasonal climateforecasts in Canada using canonical correlation analysis, Mon.Weather Rev., 124, 2370–2385.Taner, M. T., T. Berge, J. A. Walls, M. Smith, G. Taylor, D. Dumas,and M. B. Carr (2001), Well log calibration of Kohonen-classi-fied seismic attributes using Bayesian logic, J. Pet. Geol., 24,405–416.Tang, B. Y., W. W. Hsieh, A. H. Monahan, and F. T. Tangang(2000), Skill comparisons between neural networks and canoni-cal correlation analysis in predicting the equatorial Pacific seasurface temperatures, J. Clim., 13, 287–293.Tang, Y., and W. W. Hsieh (2002), Hybrid coupled models of thetropical Pacific-ENSO prediction, Clim. Dyn., 19, 343–353.Tang, Y., and W. W. Hsieh (2003), Nonlinear modes of decadal andinterannual variability of the subsurface thermal structure in thePacific Ocean, J. Geophys. Res., 108(C3), 3084, doi:10.1029/2001JC001236.Vapnik, V. N. (1998), Statistical Learning Theory, 736 pp., JohnWiley, Hoboken, N. J.Villmann, T., E. Merenyi, and B. Hammer (2003), Neural maps inremote sensing image analysis, Neural Networks, 16, 389–403.von Storch, H., and F. W. Zwiers (1999), Statistical Analysis inClimate Research, 484 pp., Cambridge Univ. Press, New York.Watari, S. (1996), Separation of periodic, chaotic, and randomcomponents in solar activity, Sol. Phys., 168, 413–422.Woodruff, S. D., R. J. Slutz, R. L. Jenne, and P. M. Steurer (1987),A comprehensive ocean-atmosphere data set, Bull. Am. Meteorol.Soc., 68, 1239–1250.Wu, A., and W. W. Hsieh (2002), Nonlinear canonical correlationanalysis of the tropical Pacific wind stress and sea surface tem-perature, Clim. Dyn., 19, 713–722, doi:10.1007/s00382-002-0262-8.Wu, A., and W. W. Hsieh (2003), Nonlinear interdecadal changesof the El Nin˜o-Southern Oscillation, Clim. Dyn., 21, 719–730,doi:10.1007/s00382-003-0361-1.Wu, A., W. W. Hsieh, and A. Shabbar (2002), Nonlinear charac-teristics of the surface air temperature over Canada, J. Geophys.Res., 107(D21), 4571, doi:10.1029/2001JD001090.Wu, A., W. W. Hsieh, and F. W. Zwiers (2003), Nonlinear modesof North American winter climate variability detected from ageneral circulation model, J. Clim., 16, 2325–2339.Yacoub, M., F. Badran, and S. Thiria (2001), A topologicalhierarchical clustering: Application to ocean color classifica-tion, in Artificial Neural Networks—ICANN 2001, Interna-tional Conference, Vienna, Austria, August 21–25, 2001:Proceedings, Lect. Notes Comput. Sci., vol. 2130, edited byG. Dorffner, H. Bischof, and K. Hornik, pp. 492–499, Springer-Verlag, New York.Yuval, (2000), Neural network training for prediction of climato-logical time series, regularized by minimization of the general-ized cross validation function, Mon. Weather Rev., 128, 1456–1473.Yuval (2001), Enhancement and error estimation of neural net-work prediction of Nin˜o 3.4 SST anomalies, J. Clim., 14,2150–2163.Yuval, and W. W. Hsieh (2002), The impact of time-averaging onthedetectabilityofnonlinearempiricalrelations,Q.J.R.Meteorol.Soc., 128, 1609–1622.Yuval, and W. W. Hsieh (2003), An adaptive nonlinear MOSscheme for precipitation forecasts using neural networks, WeatherForecasting, 18, 303–310.C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0C0W. W. Hsieh, Department of Earth and Ocean Sciences, Universityof British Columbia, 6339 Stores Road, Vancouver, BC V6T 1Z4,Canada. (whsieh@eos.ubc.ca)RG1003 Hsieh: NEURAL NETWORK METHODS25 of 25RG1003


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics

Country Views Downloads
China 4 6
India 4 0
Egypt 3 0
Finland 2 0
Russia 2 0
Indonesia 1 0
Macedonia 1 0
United States 1 28
City Views Downloads
Unknown 4 7
Shenzhen 4 6
Mumbai 3 0
Turku 2 0
Saint Petersburg 2 0
Semarang 1 0
Chennai 1 0
Ashburn 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items