UBC Faculty Research and Publications

Applying Neural Network Models to Prediction and Data Analysis in Meteorology and Oceanography. Hsieh, William W.; Tang, Benyang 1998-09-30

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


52383-Hsieh_AMS_1998_BAMS1855.pdf [ 235.77kB ]
JSON: 52383-1.0041821.json
JSON-LD: 52383-1.0041821-ld.json
RDF/XML (Pretty): 52383-1.0041821-rdf.xml
RDF/JSON: 52383-1.0041821-rdf.json
Turtle: 52383-1.0041821-turtle.txt
N-Triples: 52383-1.0041821-rdf-ntriples.txt
Original Record: 52383-1.0041821-source.json
Full Text

Full Text

1855Bulletin of the American Meteorological Society1. IntroductionThe introduction of empirical or statistical meth-ods into meteorology and oceanography can bebroadly classified as having occurred in four distinctstages: 1) Regression (and correlation analysis) wasinvented by Galton (1885) to find a linear relationbetween a pair of variables x and z. 2) With more vari-ables, the principal component analysis (PCA), alsoknown as the empirical orthogonal function (EOF)analysis, was introduced by Pearson (1901) to find thecorrelated patterns in a set of variables x1, . . ., xn. Itbecame popular in meteorology only after Lorenz(1956) and was later introduced into physical ocean-ography in the mid-1970s. 3) To linearly relate a setof variables x1, . . ., xn to another set of variablesz1, . . ., zm, the canonical correlation analysis (CCA)was invented by Hotelling (1936), and became popu-lar in meteorology and oceanography after Barnett andPreisendorfer (1987). It is remarkable that in each ofthese three stages, the technique was first invented ina biological–psychological field, long before its adap-tation by meteorologists and oceanographers manyyears or decades later. Not surprisingly, stage 4, in-volving the neural network (NN) method, which is juststarting to appear in meteorology–oceanography, alsohad a biological–psychological origin, as it developedfrom investigations into the human brain function. Itcan be viewed as a generalization of stage 3, as itnonlinearly relates a set of variables x1 . . ., xn to an-other set of variables z1, . . ., zm.The human brain is one of the great wonders ofnature—even a very young child can recognize peopleand objects much better than the most advanced arti-ficial intelligence program running on a supercom-Applying Neural Network Models toPrediction and Data Analysis inMeteorology and OceanographyWilliam W. Hsieh and Benyang TangDepartment of Earth and Ocean Sciences, University of British Columbia,Vancouver, British Columbia, CanadaABSTRACTEmpirical or statistical methods have been introduced into meteorology and oceanography in four distinct stages:1) linear regression (and correlation), 2) principal component analysis (PCA), 3) canonical correlation analysis, and re-cently 4) neural network (NN) models. Despite the great popularity of the NN models in many fields, there are threeobstacles to adapting the NN method to meteorology–oceanography, especially in large-scale, low-frequency studies:(a) nonlinear instability with short data records, (b) large spatial data fields, and (c) difficulties in interpreting the nonlin-ear NN results. Recent research shows that these three obstacles can be overcome. For obstacle (a), ensemble averagingwas found to be effective in controlling nonlinear instability. For (b), the PCA method was used as a prefilter for com-pressing the large spatial data fields. For (c), the mysterious hidden layer could be given a phase space interpretation,and spectral analysis aided in understanding the nonlinear NN relations. With these and future improvements, the non-linear NN method is evolving to a versatile and powerful technique capable of augmenting traditional linear statisticalmethods in data analysis and forecasting; for example, the NN method has been used for El Niño prediction and fornonlinear PCA. The NN model is also found to be a type of variational (adjoint) data assimilation, which allows it to bereadily linked to dynamical models under adjoint data assimilation, resulting in a new class of hybrid neural–dynamicalmodels.Corresponding author address: Dr. William Hsieh, Oceanogra-phy, Dept. of Earth and Ocean Sciences, University of BritishColumbia, Vancouver, BC V6T 1Z4, Canada.E-mail: william@eos.ubc.caIn final form 4 May 1998.© 1998 American Meteorological Society1856 Vol. 79, No. 9, September 1998puter. The brain is vastly more robust and fault toler-ant than a computer. Though the brain cells are con-tinually dying off with age, the person continues torecognize friends last seen years ago. This robustnessis the result of the brain’s massively parallel comput-ing structure, which arises from the neurons beinginterconnected by a network. It is this parallel com-puting structure that allows the neural network, withtypical neuron “clock speeds” of a few milliseconds,about a millionth that of a computer, to outperformthe computer in vision, motor control, and many othertasks.Fascinated by the brain, scientists began studyinglogical processing in neural networks in the 1940s(McCulloch and Pitts 1943). In the late 1950s and early1960s, F. Rosenblatt and B. Widrow independentlyinvestigated the learning and computational capabil-ity of neural networks (Rosenblatt 1962; Widrow andSterns 1985). The perceptron model of Rosenblattconsists of a layer of input neurons, interconnected toan output layer of neurons. After the limits of theperceptron model were found (Minsky and Papert1969), interests in neural network computing waned,as it was felt that neural networks (also calledconnectionist models) did not offer any advantage overconventional computing methods.While it was recognized that the perceptron modelwas limited by having the output neurons linked di-rectly to the neurons, the more interesting problem ofhaving one or more “hidden” layers of neurons (Fig. 1)between the input and output layer was mothballed dueto the lack of an algorithm for finding the weights in-terconnecting the neurons. The hiatus ended in thesecond half of the 1980s, with the rediscovery of theback-propagation algorithm (Rumelhart et al. 1986),which successfully solved the problem of finding theweights in a model with hidden layers. Since then, NNresearch rapidly became popular in artificial intelli-gence, robotics, and many other fields (Crick 1989).Neural networks were found to outperform the linearBox–Jenkins models (Box and Jenkins 1976) in fore-casting time series with short memory and at longerlead times (Tang et al. 1991). During 1991–92, theSanta Fe Institute sponsored the Santa Fe Time SeriesPrediction and Analysis Competition, where all pre-diction methods were invited to forecast several timeseries. For every time series in the competition, thewinner turned out to be an NN model (Weigend andGershenfeld 1994).Among the countless applications of NN models,pattern recognition provides many intriguing ex-amples; for example, for security purposes, the nextgeneration of credit cards may carry a recorded elec-tronic image of the owner. Neural networks can nowcorrectly verify the bearer based on the recorded im-age, a nontrivial task as the illumination, hair, glasses,and mood of the bearer may be very different fromthose in the recorded image (Javidi 1997). In artifi-cial speech, NNs have been successful in pronounc-ing English text (Sejnowski and Rosenberg 1987).Also widely applied to financial forecasting (Trippiand Turban 1993; Gately 1995; Zirilli 1997), NNmodels are now increasingly being used by bank andfund managers for trading stocks, bonds, and foreigncurrencies. There are well over 300 books on artifi-cial NNs in the subject guide to Books in Print, 1996–97 (published by R.R. Bowker), including some goodtexts (e.g., Bishop 1995; Ripley 1996; Rojas 1996).By now, the reader must be wondering why theseubiquitous NNs are not more widely applied tometeorology and oceanography. Indeed, the purposeof this paper is to examine the serious problems thatcan arise when applying NN models to meteorologyand oceanography, especially in large-scale, low-frequency studies, and the attempts to resolve theseproblems. Section 2 explains the difficulties encoun-tered with NN models, while section 3 gives a basicintroduction to NN models. Sections 4–6 examine vari-ous solutions to overcome the difficulties mentioned insection 2. Section 7 applies the NN method to El Niñoforecasting, while section 8 shows how the NN can beused as nonlinear PCA. The NN model is linked tovariational (adjoint) data assimilation in section 9.FIG. 1. A schematic diagram illustrating a neural network modelwith one hidden layer of neurons between the input layer and theoutput layer. For a feed-forward network, the information onlyflows forward starting from the input neurons.1857Bulletin of the American Meteorological Society2. Difficulties in applying neuralnetworks to meteorology—oceanographyElsner and Tsonis (1992) applied the NN method toforecasting various time series, and comparing with fore-casts by autoregressive (AR) models. Unfortunately,because of the computer bug noted in their corrigen-dum, the advantage of the NN method was not as con-clusive as in their original paper. Early attempts to useneural networks for seasonal climate forecasting werealso at best of mixed success (Derr and Slutz 1994;Tang et al. 1994), showing little evidence that exotic,nonlinear NN models could beat standard linear sta-tistical methods such as the CCA method (Barnett andPreisendorfer 1987; Barnston and Ropelewski 1992;Shabbar and Barnston 1996) or its variant, the singu-lar value decomposition (SVD) method (Brethertonet al. 1992), or even the simpler principal oscillationpattern (POP) method (Penland 1989; von Storchet al. 1995; Tang 1995).There are three main reasons why NNs have diffi-culty being adapted to meteorology and oceanography:(a) nonlinear instability occurs with short data records,(b) large spatial data fields have to be dealt with, and(c) the nonlinear relations found by NN models are farless comprehensible than the linear relations found byregression methods. Let us examine each of these dif-ficulties.(a) Relative to the timescale of the phenomena one istrying to analyze or forecast, most meteorologicaland oceanographic data records are short, espe-cially for climate studies. While capable of mod-eling the nonlinear relations in the data, NN mod-els have many free parameters. With a short datarecord, the problem of solving for many parametersis ill conditioned; that is, when searching for theglobal minimum of the cost function associatedwith the problem, the algorithm is often stuck inone of the numerous local minima surrounding thetrue global minimum (Fig. 2). Such an NN couldgive very noisy or completely erroneous forecasts,thus performing far worse than a simple linear sta-tistical model. This situation is analogous to thatfound in the early days of numerical weather pre-diction, where the addition of the nonlinear advec-tive terms to the governing equations, instead ofimproving the forecasts of the linear models, ledto disastrous nonlinear numerical instabilities(Phillips 1959), which were overcome only afterextensive research on how to discretize the advec-tive terms correctly. Thus the challenge for NNmodels is how to control nonlinear instability, withonly the relatively short data records available.(b) Meteorological and oceanographic data tend tocover numerous spatial grids or stations, and ifeach station serves as an input to an NN, then theNN will have a very large number of inputs andassociated weights. The optimal search for manyweights over the relatively short temporal recordswould be an ill-conditioned problem. Hence, eventhough there had been some success with NNs inseasonal forecasting using a small number of pre-dictors (Navone and Ceccatto 1994; Hastenrathet al. 1995), it was unclear how the method couldbe generalized to large data fields in the mannerof CCA or SVD methods.(c) The interpretation of the nonlinear relations foundby an NN is not easy. Unlike the parametersfrom a linear regression model, the weights foundby an NN model are nearly incomprehensible.Furthermore, the “hidden” neurons have alwaysbeen a mystery.FIG. 2. A schematic diagram illustrating the cost function sur-face, where depending on the starting condition, the search algo-rithm often gets trapped in one of the numerous deep local minima.The local minima labeled 2, 4, and 5 are likely to be reasonablelocal minima, while the minimum labeled 1 is likely to be a badone (in that the data was not well fitted at all). The minimumlabeled 3 is the global minimum, which could correspond to anoverfitted solution (i.e., fitted closely to the noise in the data) andmay, in fact, be a poorer solution than the minima labeled 2, 4,and 5.1858 Vol. 79, No. 9, September 1998Over the last few years, the University of BritishColumbia Climate Prediction Group has been tryingto overcome these three obstacles. The purpose of thispaper is to show how these obstacles can be overcome:For obstacle a, ensemble averaging was found to beeffective in controlling nonlinear instability. Variouspenalty, pruning, and nonconvergent methods alsohelped. For b, the PCA method was used to greatlyreduce the dimension of the large spatial data fields.For c, new measures and visualization techniqueshelped in understanding the nonlinear NN relationsand the mystery of the hidden neurons. With theseimprovements, the NN approach has evolved to a tech-nique capable of augmenting the traditional linear sta-tistical methods currently used.Since the focus of this review is on the applicationof the NN method to low-frequency studies, we willonly briefly mention some of the higher-frequencyapplications. As the problem of short data records isno longer a major obstacle in higher-frequency appli-cations, the NN method has been successfully used inareas such as satellite imagery analysis and oceanacoustics (Lee et al. 1990; Badran et al. 1991; IEEE1991; French et al. 1992; Peak and Tag 1992; Bankert1994; Peak and Tag 1994; Stogryn et al. 1994;Krasnopolsky et al. 1995; Butler et al. 1996; Marzbanand Stumpf 1996; Liu et al. 1997).3. Neural network modelsTo keep within the scope of this paper, we willlimit our survey of NN models to the feed-forwardneural network. Figure 1 shows a network with onehidden layer, where the jth neuron in this hidden layeris assigned the value yj, given in terms of the inputvalues xi byywxbjijiji=+∑tanh ,(1)where wij and bj are the weight and bias parameters,respectively, and the hyperbolic tangent function isused as the activation function (Fig. 3). Other functionsbesides the hyperbolic tangent could be used for theactivation function, which was designed originally tosimulate the firing or nonfiring of a neuron upon re-ceiving input signals from its neighbors. If there areadditional hidden layers, then equations of the sameform as (1) will be used to calculate the values of thenext layer of hidden neurons from the current layer ofneurons. The output neurons zk are usually calculatedby a linear combination of the neurons in the layer justbefore the output layer, that is,zwybkjkjkj=+∑~~.(2)To construct a NN model for forecasting, the predic-tor variables are used as the input, and the predictands(either the same variables or other variables at somelead time) as the output. With zdk denoting the observeddata, the NN is trained by finding the optimal valuesof the weight and bias parameters (wij, w~jk, bj, and b~k),which will minimize the cost function:Jzzkdk=−( )∑2, (3)where the rhs of the equation is simply the sum-squared error of the output. The optimal parameterscan be found by a back-propagation algorithm(Rumelhart et al. 1986; Hertz et al. 1991). For thereader familiar with variational data assimilation meth-ods, we would point out that this back propagation isequivalent to the backward integration of the adjointequations in variational assimilation (see section 9).The back-propagation algorithm has now beensuperceded by more efficient optimization algorithms,FIG. 3. The activation function y = tanh(wx + b), shown with b= 0. The neuron is activated (i.e., outputs a value of nearly +1)when the input signal x is above a threshold; otherwise, it remainsinactive (with a value of around −1). The location of the thresh-old along the x axis is changed by the bias parameter b and thesteepness of the threshold is changed by the weight w.1859Bulletin of the American Meteorological Societysuch as the conjugate gradient method, simulatedannealing, and genetic algorithms (Hertz et al. 1991).Once the optimal parameters are found, the trainingis finished, and the network is ready to perform fore-casting with new input data. Normally the data recordis divided into two, with the first piece used for net-work training and the second for testing forecasts. Dueto the large number of parameters and the great flex-ibility of the NN, the model output may fit the datavery well during the training period yet produce poorforecasts during the test period. This results fromoverfitting; that is, the NN fitted the training data sowell that it fitted to the noise, which of course resultedin the poor forecasts over the test period (Fig. 4). AnNN model is usually capable of learning the signalsin the data, but as training progresses, it often startslearning the noise in the data; that is, the forecasterror of the model over the test period first decreasesand then increases as the model starts to learn the noisein the training data. Overfitting is often a serious prob-lem with NN models, and we will discuss some solu-tions in the next section.Consider the special case of a NN with no hiddenlayers—inputs being several values of a time series,output being the prediction for the next value, and theinput and output layers connected by linear activationfunctions. Training this simple network is then equiva-lent to determining an AR model through least squaresregression, with the weights of the NN correspondingto the weights of the AR model. Hence the NN modelreduces to the well-known AR model in this limit.In general, most NN applications have only one ortwo hidden layers, since it is known that to approxi-mate a set of reasonable functions fk({xi}) to a givenaccuracy, at most two hidden layers are needed (Hertzet al. 1991, 142). Furthermore, to approximate continu-ous functions, only one hidden layer is enough(Cybenko 1989; Hornik et al. 1989). In our models forforecasting the tropical Pacific (Tangang et al. 1997;Tangang et al. 1998a; Tangang et al. 1998b; Tang et al.1998, manuscript submitted to J. Climate, hereafterTHMT), we have not used more than one hidden layer.There is also an interesting distinction between the non-linear modeling capability of NN models and that ofpolynomial expansions. With only a few parameters,the polynomial expansion is only capable of learninglow-order interactions. In contrast, even a small NNis fully nonlinear and is not limited to learning low-order interactions. Of course, a small NN can learnonly a few interactions, while a bigger one can learnmore.When forecasting at longer lead times, there aretwo possible approaches. The iterated forecasting ap-proach trains the network to forecast one time stepforward, then use the forecasted result as input to thesame network for the next time step, and this processis iterated until a forecast for the nth time step isobtained. The other is the direct or “jump” forecastapproach, where a network is trained for forecastingat a lead time of n time steps, with a different networkfor each value of n. For deterministic chaotic systems,iterated forecasts seem to be better than direct forecasts(Gershenfeld and Weigend 1994). For noisy timeseries, it is not clear which approach is better. Ourexperience with climate data suggested that the directforecast approach is better. Even with “clearning” andcontinuity constraints (Tang et al. 1996), the iteratedforecast method was unable to prevent the forecasterrors from being amplified during the iterations.4. Nonlinear instabilityLet us examine obstacle a; that is, the applicationof NNs to relatively short data records leads to an ill-conditioned problem. In this case, as the cost functionis full of deep local minima, the optimal search wouldlikely end up trapped in one of the local minima(Fig. 2). The situation gets worse when we either (i)make the NN more nonlinear (by increasing the num-ber of neurons, hence the number of parameters), orFIG. 4. A schematic diagram illustrating the problem ofoverfitting: The dashed curve illustrates a good fit to noisy data(indicated by the squares), while the solid curve illustratesoverfitting, where the fit is perfect on the training data (squares)but is poor on the test data (circles). Often the NN model beginsby fitting the training data as the dashed curve, but with furtheriterations, ends up overfitting as the solid curve.1860 Vol. 79, No. 9, September 1998(ii) shorten the data record. With a different initializa-tion of the parameters, the search would usually endup in a different local minimum.A simple way to deal with the local minima prob-lem is to use an ensemble of NNs, where their param-eters are randomly initialized before training. Theindividual NN solutions would be scattered around theglobal minimum, but by averaging the solutions fromthe ensemble members, we would likely obtain a bet-ter estimate of the true solution. Tangang et al. (1998a)found the ensemble approach useful in their forecastsof tropical Pacific sea surface temperature anomalies(SSTA).An alternative ensemble technique is “bagging”(abbreviated from bootstrap aggregating) (Breiman1996; THMT). First, training pairs, consisting of thepredictor data available at the time of a forecast andthe forecast target data at some lead time, are formed.The available training pairs are separated into a train-ing set and a forecast test set, where the latter is re-served for testing the model forecasts only and is notused for training the model. The training set is usedto generate an ensemble of NN models, where eachmember of the ensemble is trained by a subset of thetraining set. The subset is drawn at random withreplacement from the training set. The subset has thesame number of training pairs as the training set,where some pairs in the training set appear more thanonce in the subset, and about 37% of the training pairsin the training set are unused in the subset. Theseunused pairs are not wasted, as they are used to deter-mine when to stop training. To avoid overfitting,THMT stopped training when the error variance fromapplying the model to the set of “unused” pairs startedto increase. By averaging the output from all the in-dividual members of the ensemble, a final output isobtained.We found ensemble averaging to be most effectivein preventing nonlinear instability and overfitting.However, if the individual ensemble members are se-verely overfitting, the effectiveness of ensemble av-eraging is reduced. There are several ways to preventoverfitting in the individual NN models of the en-semble, as presented below.“Stopped training” as used in THMT is a type ofnonconvergent method, which was initially viewedwith much skepticism as the cost function does not ingeneral converge to the global minimum. Under theenormous weight of empirical evidence, theorists havefinally begun to rigorously study the properties ofstopped training (Finnoff et al. 1993). Intuitively, it isnot difficult to see why the global minimum is oftennot the best solution. When using linear methods, thereis little danger of overfitting (provided one is notusing too many parameters); hence, the global mini-mum is the best solution. But with powerful nonlin-ear methods, the global minimum would be a veryclose fit to the data (including the noise), like the solidcurve in Fig. 4. Stopped training, in contrast, wouldonly have time to converge to the dashed curve inFig. 4, which is actually a better solution.Another approach to prevent overfitting is topenalize the excessive parameters in the NN model.The ridge regression method (Chauvin, 1990; Tanget al. 1994; Tangang et al. 1998a, their appendix)modifies the cost function in (3) by adding weightpenalty terms, that is,Jzzcwcwkdk ij jk=−( )++∑∑∑21222~,(4)with c1, c2 positive constants, thereby forcing unimpor-tant weights to zero.Another alternative is network pruning, whereinsignificant weights are removed. Such methods haveoxymoronic names like “optimal brain damage” (LeCun et al. 1990). With appropriate use of penalty andpruning methods, the global minimum solution maybe able to avoid overfitting. A comparison of theeffectiveness of nonconvergent methods, penaltymethods, and pruning methods is given by Finnoffet al. (1993).In summary, the nonlinearity in the NN modelintroduces two problems: (i) the presence of localminima in the cost function, and (ii) overfitting. LetD be the number of data points and P the number ofparameters in the model. For many applications of NNin robotics and pattern recognition, there is almost anunlimited amount of data, so that D >> P, whereas incontrast, D ~ P in low-frequency meteorological–oceanographic studies. The local minima problem ispresent when D >> P, as well as when D ~ P. However,overfitting tends to occur when D ~ P but not when D>> P. Ensemble averaging has been found to beeffective in helping the NNs to cope with local minimaand overfitting problems.5. Prefiltering the data fieldsLet us now examine obstacle b, that is, the spatiallylarge data fields. Clearly if data from each spatial grid1861Bulletin of the American Meteorological Societyis to be an input neuron, the NN will have so manyparameters that the problem becomes ill conditioned.We need to effectively compress the input (and out-put) data fields by a prefiltering process. PCA analy-sis is widely used to compress large data fields(Preisendorfer 1988).In the PCA representation, we havext ateininn()=()∑,(5)andat xteniini()=()∑,(6)where en and an are the nth mode PCA and its timecoefficient, respectively, with the PCAs forming anorthonormal basis. PCA maximizes the variance ofa1(t) and then, from the residual, maximizes the vari-ance of a2(t), and so forth for the higher modes, underthe constraint of orthogonality for the {en}. If the origi-nal data xi(i = 1, . . . , I) imply a very large number ofinput neurons, we can usually capture the main vari-ance of the input data and filter out noise by using thefirst few PCA time series an(n = 1, . . ., N), with N <<I, thereby greatly reducing the number of input neu-rons and the size of the NN model.In practice, we would need to provide informationon how the {an} has been evolving in time prior to mak-ing our forecast. This is usually achieved by provid-ing {an(t)}, {an(t−∆t)}, . . ., {an(t−m∆t)}, that is, someof the earlier values of {an}, as input. Compression ofinput data in the time dimension is possible with theextended PCA or EOF (EPCA or EEOF) method(Weare and Nasstrom 1982; Graham et al. 1987).In the EPCA analysis, copies of the original datamatrix Xij = xi(tj) are stacked with time lag τ into a largermatrix X′,′ =()++XXX XTTT T...ij ij ij n,,, ,ττ(7)where the superscript T denotes the transpose and Xij+nτ= xi(tj + nτ). Applying the standard PCA analysis toX′ yields the EPCAs, with the corresponding time se-ries {an′}, which because time-lag information has al-ready been incorporated into the EPCAs, could be usedinstead of {an(t)}, {an(t − ∆t)}, . . . {an(t − m∆t)} fromthe PCA analysis, thereby drastically reducing thenumber of input neurons.This reduction in the input neurons by EPCAscomes with a price, namely the time-domain filteringautomatically associated with the EPCA analysis,which results in some loss of input information.Monahan et al. (1998, manuscript submitted toAtmos.–Ocean) found that the EPCAs could becomedegenerate if the lag τ was close to the integraltimescale of standing waves in the data. While thereare clearly trade-offs between using PCAs and usingEPCAs, our experiments with forecasting the tropicalPacific SST found that the much smaller networks re-sulting from the use of EPCAs tended to be less proneto overfitting than the networks using PCAs.We are also studying other possible prefilteringprocesses. Since CCA often uses EPCA to prefilter thepredictor and predictand fields (Barnston andRopelewski 1992), we are investigating the possibil-ity of the NN model using the CCA as a prefilter, thatis, using the CCA modes instead of the PCA or EPCAmodes. Alternatively, we may argue that all theseprefilters are linear processes, whereas we should use anonlinear prefilter for a nonlinear method such as NN.We are investigating the possibility of using NN mod-els as nonlinear PCAs to do the prefiltering (section 8).6. Interpreting the NN modelWe now turn to obstacle c, namely, the great diffi-culty in understanding the nonlinear NN model results.In particular, is there a meaningful interpretation ofthose mysterious neurons in the hidden layer?Consider a simple NN for forecasting the tropicalPacific wind stress field. The input consists of the first11 EPCA time series of the wind stress field (from TheFlorida State University), plus a sine and cosine func-tion to indicate the phase with respect to the annualcycle. The single hidden layer has three neurons, andthe output layer the same 11 EPCA time series onemonth later. As the values of the three hidden neuronscan be plotted in 3D space, Fig. 5 shows their trajec-tory for selected years. From Fig. 5 and the trajecto-ries of other years (not shown), we can identify regionsin the 3D phase space as the El Niño warm event phaseand its precursor phase and the cold event and its pre-cursor. One can issue warm event or cold event fore-casts whenever the system enters the warm eventprecursor region or the cold event precursor region, re-spectively. Thus the hidden layer spans a 3D phasespace for the El Niño–Southern Oscillation (ENSO)system and is, thus, a higher-dimension generalization1862 Vol. 79, No. 9, September 1998of the 2D phase space based on the POP method (Tang1995).Hence we interpret the NN as a projection from theinput space onto a phase space, as spanned by the neu-rons in the hidden layer. The state of the system in thephase space then allows a projection onto the outputspace, which can be the same input variables sometime in the future, or other variables in the future. Thisinterpretation provides a guide for choosing the appro-priate number of neurons in the hidden layer; namely,the number of hidden neurons should be the same asthe embedding manifold for the system. Since theENSO system is thought to have only a few degreesof freedom (Grieger and Latif 1994), we have limitedthe number of hidden neurons to about 3–7 in most ofour NN forecasts for the tropical Pacific, in contrastto the earlier study by Derr and Slutz (1994), where30 hidden neurons were used. Our approach for gen-erating a phase space for the ENSO attractor with anNN model is very different from that of Grieger andLatif (1994), since their phase space was generatedfrom the model output instead of from the hiddenlayer. Their approach generated a 4D phase space fromusing four PCA time series as input and as output,whereas our approach did not have to limit the num-ber of input or output neurons to generate a low-dimensional phase space for the ENSO system.In many situations, the NNmodel is compared with a linearmodel, and one wants to knowhow nonlinear the NN modelreally is. A useful diagnostictool to measure nonlinearity isspectral analysis (Tangang et al.1998b). Once a network hasbeen trained, we replace the Ninput signals by artificial sinu-soidal signals with frequenciesω1, ω2, . . ., ωN, which were care-fully chosen so that the nonlin-ear interactions of two signals offrequencies ωi and ωj wouldgenerate frequencies ωi + ωj andωi − ωj not equal to any ofthe original input frequencies.The amplitude of a sinusoidalsignal is chosen so as to yieldthe same variance as that of theoriginal real input data. Theoutput from the NN with thesinusoidal inputs is spectrallyanalyzed (Fig. 6). If the NN is basically linear, spec-tral peaks will be found only at the original input fre-quencies. The presence of an unexpected peak atfrequency ω′, equaling ωi + ωj or ωi − ωj for somei, j, indicates a nonlinear interaction between the ithand the jth predictor time series. For an overall mea-sure of the degree of nonlinearity of an NN, we cancalculate the total area under the output spectrum, ex-cluding the contribution from the original input fre-quencies, and divide it by the total area under thespectrum, thereby yielding an estimate of the portionof output variance that is due to nonlinear interactions.Using this measure of nonlinearity while forecastingthe regional SSTA in the equatorial Pacific, Tanganget al. (1998b) found that the nonlinearity of the NNtended to vary with forecast lead time and with geo-graphical location.7. Forecasting the tropical Pacific SSTMany dynamical and statistical methods have beenapplied to forecasting the ENSO system (Barnstonet al. 1994). Tangang et al. (1997) forecasted the SSTAin the Niño 3.4 region (Fig. 7) with NN models usingseveral PCA time series from the tropical Pacific windstress field as predictors. Tangang et al. (1998a) com-FIG. 5. The values of the three hidden neurons plotted in 3D space for the years 1972,1973, 1976, 1982, 1983, and 1988. Projections onto 2D planes are also shown. The smallcircles are for the months from January to December, and the two “+” signs for January andFebruary of the following year. El Niño warm events occurred during 1972, 1976, and 1982,while a cold event occurred in 1988. In 1973 and 1983, the Tropics returned to cooler con-ditions from an El Niño. Notice the similarity between the trajectories during 1972, 1976,and 1982, and during 1973, 1983, and 1988. In years with neither warm nor cold events, thetrajectories oscillate randomly near the center of the cube. From these trajectories, we canidentify the precursor phase regions for warm events and cold events, which could allow usto forecast these events.1863Bulletin of the American Meteorological Societypared the relative merit of using the sea level pressure(SLP) as predictor versus the wind stress as predictor,with the SLP emerging as the better predictor, espe-cially at longer lead times. The reason may be that theSLP is less noisy than the wind stress; for example,the first seven modes of the tropical pacific SLP ac-counted for 81% of the total variance, versus only 54%for the seven wind stress modes. In addition, Tanganget al. (1998a) introduced the approach of ensembleaveraging NN forecasts with randomly initializedweights to give better forecasts. The best forecast skillswere found over the western-central and central equa-torial regions (Niño 4 and 3.4) with lesserskills over the eastern regions (Niño 3,P4, and P5) (Fig. 7).To further simplify the NN models,Tangang et al. (1998b) used EPCAs in-stead of PCAs and a simple pruningmethod. The spectral analysis methodwas also introduced to interpret the non-linear interactions in the NN, showingthat the nonlinearity of the networkstended to increase with lead time and tobecome stronger for the eastern regionsof the equatorial Pacific Ocean.THMT compare the correlation skills of NN, CCA,and multiple linear regression (LR) in forecasting theNiño 3 SSTA index. First the PCAs of the tropicalPacific SSTA field and the SLP anomaly field werecalculated. The same predictors were chosen for allthree methods, namely the first seven SLP anomalyPCA time series at the initial month, and 3 months, 6months, and 9 months before the initial month (atotal of 7 × 4 = 28 predictors); the first 10 SSTA PCAtime series at the initial month; and the Niño 3 SSTAat the initial month. These 39 predictors were thenfurther compressed to 12 predictors by an EPCA, andcross-validated model forecasts were made at variouslead times after the initial month. Ten CCA modeswere used in the CCA model, as fewer CCA modesdegraded the forecasts. Figure 8 shows that the NN hasbetter forecast correlation skills than CCA and LR atall lead times, especially at the 12-month lead time(where NN has a correlation skill of 0.54 vs 0.49 forCCA). Figure 9 shows the cross-validated forecasts ofthe Niño 3 SSTA at 6-month lead time (THMT). Inother tropical regions, the advantage of NN over CCAwas smaller, and in Niño 4, the CCA slightly outper-formed the NN model. In this comparison, CCA hadan advantage over the NN and LR: While thepredictand for the NN and LR was Niño 3 SSTA, thepredictands for the CCA were actually the first 10 PCAmodes of the tropical Pacific SSTA field, from whichthe regional Niño 3 SSTA was then calculated. Henceduring training, the CCA had more information avail-able by using a much broader predictand field than theNN and LR models, which did not have predictandinformation outside Niño 3.As the extratropical ENSO variability is muchmore nonlinear than in the Tropics (Hoerling et al.1997), it is possible that the performance gap betweenthe nonlinear NN and the linear CCA may widen forforecasts outside the Tropics.FIG. 6. A schematic spectral analysis of the output from an NNmodel, where the input were artificial sinusoidal time series offrequencies 2.0, 3.0, and 4.5 (arbitrary units). The three main peakslabeled A, B, and C correspond to the frequencies of the input timeseries. The nonlinear NN generates extra peaks in the output spec-trum (labeled a–i). The peaks c and g at frequencies of 2.5 and6.5, respectively, arose from the nonlinear interactions betweenthe main peaks at 2.0 (peak A) and 4.5 (peak C) (as the differ-ence of the frequencies between C and A is 2.5, and the sum oftheir frequencies is 6.5).FIG. 7. Regions of interest in the Pacific. SSTA for these regions are used asthe predictands in forecast models.1864 Vol. 79, No. 9, September 19988. Nonlinear principal componentanalysisPCA is popular because it offers the most efficientlinear method for reducing the dimensions of a data-set and extracting the main features. If we are notrestricted to using only linear transformations, evenmore powerful data compression and extraction isgenerally possible. The NN offers a way to do nonlin-ear PCA (NLPCA) (Kramer 1991; Bishop 1995).In NLPCA, the NN outputs are the same as theinputs, and data compression is achieved by havingrelatively few hidden neurons forming a “bottleneck”layer (Fig. 10). Since there are few bottleneck neurons,it would in general not be possible to reproduce theinputs exactly by the output neurons. How many hid-den layers would such an NN require in order to per-form NLPCA? At first, one might think only onehidden layer would be enough. Indeed with onehidden layer and linear activation functions, the NNsolution should be identical to the PCA solution.However, even with nonlinear activation functions, theNN solution is still basically that of the linear PCAsolution (Bourlard and Kamp 1988). It turns out thatfor NLPCA, three hidden layers are needed (Fig. 10)(Kramer 1991).The reason is that to properly model nonlinear con-tinuous functions, we need at least one hidden layerbetween the input layer and the bottleneck layer, andanother hidden layer between the bottleneck layer andthe output layer (Cybenko 1989). Hence, a nonlinearfunction maps from the higher-dimension input spaceto the lower-dimension space represented by thebottleneck layer, and then an inverse transform mapsfrom the bottleneck space back to the original higher-dimensional space represented by the output layer,with the requirement that the output be as close to theinput as possible. One approach chooses to have onlyone neuron in the bottleneck layer, which will extract asingle NLPCA mode. To extract higher modes, this firstmode is subtracted from the original data, and the pro-cedure is repeated to extract the next NLPCA mode.Using the NN in Fig. 10, A. H. Monahan (1998,personal communication) extracted the first NLPCAmode for data from the Lorenz (1963) three-componentchaotic system. Figure 11 shows the famous Lorenzattractor for a scatterplot of data in the x–z plane. Thefirst PCA mode is simply a horizontal line, explaining60% of the total variance, while the first NLPCA isthe U-shaped curve, explaining 73% of the variance.In general, PCA models data with lines, planes, and hy-perplanes for higher dimensions, while the NLPCAuses curves and curved surfaces.Malthouse (1998) pointed out alimitation of the Kramer (1991)NLPCA method, with its three hid-den layers. When the curve fromthe NLPCA crosses itself, for ex-ample forming a circle, the methodfails. The reason is that with onlyone hidden layer between the inputand the bottleneck layer, and againone hidden layer between thebottleneck and the output, the non-FIG. 8. Forecast correlation skills for the Niño 3 SSTA by theNN, the CCA, and the LR at various lead times (THMT 1998).The cross-validated forecasts were made from the data record(1950–97) by first reserving a small segment of test data, trainingthe models using data not in the segment, then computing fore-cast skills over the test data—with the procedure repeated by shift-ing the segment of test data around the entire record.FIG. 9. Forecasts of the Niño 3 SSTA (in °C) at 6-month lead time by an NN model(THMT 1998), with the forecasts indicated by circles and the observations by the solidline. With cross-validation, only the forecasts over the test data are shown here.1865Bulletin of the American Meteorological Societylinear mapping functions are limited to continuousfunctions (Cybenko 1989), which of course cannotidentify 0° with 360°. However, we believe this fail-ure can be corrected by having two hidden layers be-tween the input and bottleneck layers, and two hiddenlayers between the bottleneck and output layers, sinceany reasonable function can be modeled with two hid-den layers (Hertz et al. 1991, 142). As the PCA is acornerstone in modern meteorology–oceanography, anonlinear generalization of the PCA method by NNis indeed exciting.9. Neural networks and variationaldata assimilationVariational data assimilation (Daley 1991) arosefrom the need to use data to guide numerical models[including coupled atmosphere–ocean models (Lu andHsieh 1997, 1998a, 1998b)], whereas neural networkmodels arose from the desire to model the vast em-pirical learning capability of the brain. With suchdiverse origins, it is no surprise that these two meth-ods have evolved to prominence completely indepen-dently. Yet from section 3, the minimization of the costfunction (3) by adjusting the parameters of the NNmodel is exactly what is done in variational (adjoint)data assimilation, except that here the governing equa-tions are the neural network equations, (1) and (2),instead of the dynamical equations (see appendix fordetails).Functionally, as an empirical modeling technique,NN models appear closely related to the familiar lin-ear empirical methods, such as CCA, SVD, PCA, andPOP, which belong to the class of singular value oreigenvalue methods. This apparent similarity is some-what misleading, as structurally, the NN model is avariational data assimilation method. An analogywould be the dolphin, which lives in the sea like a fish,but is in fact a highly evolved mammal, hence thenatural bond with humans. Similarly, the fact that theNN model is a variational data assimilation methodallows it to be bonded naturally to a dynamical modelunder a variational assimilation formulation. The dy-namical model equations can be placed on equal foot-ing with the NN model equations, with both thedynamical model parameters and initial conditionsand the NN parameters found by minimizing a singlecost function. This integrated treatment of the empiri-cal and dynamical parts is very different from presentforecast systems such as Model Output Statistics(Wilks 1995; Vislocky and Fritsch 1997), where thedynamical model is run first before the statisticalmethod is applied.What good would such a union bring? Our presentdynamical models have good forecast skills for somevariables and poor skills for others (e.g., precipitation,snowfall, ozone concentration, etc.). Yet there may besufficient data available for these difficult variablesthat empirical methods such as NN may be useful inimproving their forecast skills. Also, hybrid coupledmodels are already being used for El Niño forecast-ing (Barnett et al. 1993), where a dynamical oceanmodel is coupled to an empirical atmospheric model.A combined neural–dynamical approach may allowthe NN to complement the dynamical model, leadingto an improvement of modeling and prediction skills.10. ConclusionsThe introduction of empirical or statistical meth-ods into meteorology and oceanography has beenbroadly classified as having occurred in four distinctstages: 1) linear regression (and correlation analysis),2) PCA analysis, 3) CCA (and SVD), and 4) NN.These four stages correspond respectively to the evolv-ing needs of finding 1) a linear relation (or correlation)between two variables x and z; 2) the correlated pat-terns within a set of variables x1, . . ., xn; 3) the linearrelations between a set of variables x1, . . ., xn and an-other set of variables z1, . . ., zm; and 4) the nonlinearFIG. 10. The NN model for calculating NLPCA. There are threehidden layers between the input layer and the output layer. Themiddle hidden layer is the “bottleneck” layer. A nonlinear func-tion maps from the higher-dimension input space to the lower-dimension bottleneck space, followed by an inverse transformmapping from the bottleneck space back to the original space rep-resented by the outputs, which are to be as close to the inputs aspossible. Data compression is achieved by the bottleneck, withthe NLPCA modes described by the bottleneck neurons.1866 Vol. 79, No. 9, September 1998relations between x1, . . ., xn and z1, . . ., zm. Without ex-ception, in all four stages, the method was first in-vented in a biological–psychological field, long beforeits adaptation by meteorologists and oceanographersmany years or decades later.Despite the great popularity of the NN models inmany fields, there are three obstacles in adaptingthe NN method to meteorology–oceanography: (a)nonlinear instability, especially with a short datarecord; (b) overwhelmingly large spatial data fields;and (c) difficulties in interpretting the nonlinear NNresults. In large-scale, low-frequency studies, wheredata records are in general short, the potential successof the unstable nonlinear NN model against stable lin-ear models, such as the CCA, de-pends critically on our ability totame nonlinear instability.Recent research shows that allthree obstacles can be overcome.For obstacle a, ensemble averagingis found to be effective in control-ling nonlinear instability. Penaltyand pruning methods and noncon-vergent methods also help. For b,the PCA method is found to be aneffective prefilter for greatly reduc-ing the dimension of the large spa-tial data fields. Other possibleprefilters include the EPCA,rotated PCA, CCA, and nonlinearPCA by NN models. For c, themysterious hidden layer can begiven a phase space interpretation,and a spectral analysis method aidsin understanding the nonlinear NNrelations. With these and future im-provements, the nonlinear NNmethod is evolving to a versatileand powerful technique capable ofaugmenting traditional linear statis-tical methods in data analysis andin forecasting. The NN model is atype of variational (adjoint) data as-similation, which further allows itto be linked to dynamical modelsunder adjoint data assimilation,potentially leading to a new class ofhybrid neural–dynamical models.Acknowledgments. Our research onNN models has been carried out with thehelp of many present and former students, especially Fredolin T.Tangang and Adam H. Monahan, to whom we are most grateful.Encouragements from Dr. Anthony Barnston and Dr. FrancisZwiers are much appreciated. W. Hsieh and B. Tang have beensupported by grants from the Natural Sciences and EngineeringResearch Council of Canada, and Environment Canada.Appendix:  Connecting neural networkswith variational data assimilationWith a more compact notation, the NN model insection 3 can be simply written asz = N(x,q), (A1)FIG. 11. Data from the Lorenz (1963) three-component (x, y, z) chaotic system wereused to perform PCA and NLPCA (using the NN model shown in Fig. 10), with theresults displayed as a scatterplot in the x–z plane. The horizontal line is the firstPCA mode, while the curve is the first NLPCA mode (A. H. Monahan 1998, personalcommunication).1867Bulletin of the American Meteorological Societywhere x and z are the input and output vectors, respec-tively, and q is the parameter vector containing wij, w~jk,bj, and b~k. A Lagrange function, L, can be introduced,LJ=+ −()•∑µzN, (A2)where the model constraint (A1) is incorporated in theoptimization with the help of the Lagrange multipli-ers (or adjoint variables) µ. The NN model has nowbeen cast in a standard adjoint data assimilation for-mulation, where the Lagrange function L is to be op-timized by finding the optimal control parameters q,via the solution of the adjoint equations for theadjoint variables µ (Daley 1991). Thus, the NN jar-gon “back-propagation” is simply “backward integra-tion of the adjoint model” in adjoint data assimilationjargon.Often, the predictands are the same variables as thepredictors, only at a different time; that is, (A1) and(A2) becomez(t +∆t) = N(zd(t),q), (A3)LJ t t t td=+ () +()−()()[]•∑µzNzq∆ ,, (A4)where ∆t is the forecast lead time. For notational sim-plicity, we have ignored some details in (A3) and (A4):for example, the forecast made at time t could use notonly zd (t), but also earlier data; and N may also de-pend on other predictor or forcing data xd.There is one subtle difference between thefeedforward NN model (A4) and standard adjoint dataassimilation—the NN model starts with the data zd atevery time step, whereas the dynamical model takesthe data as initial condition only at the first step. Forsubsequent steps, the dynamical model takes themodel output of the previous step and integrates for-ward. So in adjoint data assimilation with dynamicmodels, zd (t) in Eq. (A4) is replaced by z(t), that is,LJ t t t t=+ () +()−()[]•∑µzNzq∆ ,, (A5)where only at the initial time t = 0 is z(0) = zd (0). Wecan think of this as a strong constraint of continuity(Daley 1991), since it imposes that during the dataassimilation period [0,T], the solution has to be con-tinuous. In contrast, the training scheme of the NN hasno constraint of continuity.However, the NN training scheme does not haveto be without a continuity constraint. Tang et al. (1996)proposed a new NN training scheme, where theLagrange function isLJ t t t tttt td=+ ()− ()[]+()− ()[]+() +()−()∑∑∑•αβ~~~,,zz zzzNzq2 2µµ∆where, z~, the input to the NN model, is also adjustedin the optimization process, along with the model pa-rameter vector q. The second term on the right-handside of (A6), the relative importance of which is con-trolled by the coefficient α, is a constraint to force theadjustable inputs to be close to the data. The third term,whose relative importance is controlled by the coeffi-cient β, is a constraint to force the inputs to be closeto the outputs of the previous step. However, this termdoes not dictate that the inputs have to be the outputsof the previous step. It is thus a weak constraint ofcontinuity (Daley 1991). Note that α and β are scalarconstants, whereas µ(t) is a vector of adjoint variables.The weak continuity constraint version (A6) can thusbe thought of as the middle ground between the strongcontinuity constraint version (A5) and no continuityconstraint version (A4).Let us now couple an NN model to a dynamicalmodel under an adjoint assimilation formulation. Thiskind of a hybrid model may benefit a system wheresome variables are better simulated by a dynamicalmodel, while other variables are better simulated byan NN model.Suppose we have a dynamical model with govern-ing equations in discrete form,u(t +δt) = M(u, v, p, t), (A7)where u denotes the vector of state variables in thedynamical model, v denotes the vector of variables notmodeled by the dynamical model, and p denotes a vec-tor of model parameters and/or initial conditions. Sup-pose the v variables, which could not be forecastedwell by a dynamical model, could be forecasted withbetter skills by an NN model, that is,v(t +∆t) = N(u, v, q, t), (A8)where the NN model N has inputs u and v, and param-eters q.If observed data ud and vd are available, then wecan define a cost function,(A6)1868 Vol. 79, No. 9, September 1998Jdddd=−()−()+−()−()∑∑uu Uuuvv VvvTT,(A9)where the superscript T denotes the transpose and Uand V the weighting matrices, often computed fromthe inverses of the covariance matrices of the obser-vational errors. For simplicity, we have omitted the ob-servational operator matrices and terms representinga priori estimates of the parameters, both commonlyused in actual adjoint data assimilation. The Lagrangefunction L is given byLJ t t tttt=+ () +()−[]+() +()−[]∑λµTTuMvNδ∆,(A10)where λ and µ are the vectors of adjoint variables. Thisformulation places the dynamical model M and theneural network model N on equal footing, as both areoptimized (i.e., optimal values of p and q are found)by minimizing the Lagrange function L—that is, theadjoint equations for λ and µ are obtained from thevariation of L with respect to u and v, while the gradi-ents of the cost function with respect to p and q arefound from the variation of L with respect to p and q.Note that without v and N, Eqs. (A7), (A9), and (A10)simply reduce to the standard adjoint assimilationproblem for a dynamical model; whereas without uand M, Eqs. (A8), (A9), and (A10) simply reduce tofinding the optimal parameters q for the NN model.Here, the hybrid neural-dynamical data assimila-tion model (A10) is in a strong continuity constraintform. Similar hybrid models can be derived for a weakcontinuity constraint or no continuity constraint.ReferencesBadran, F., S. Thiria, and M. Crepon, 1991: Wind ambiguity re-moval by the use of neural network techniques. J. Geophys.Res., 96, 20 521–20 529.Bankert, R. L., 1994: Cloud classification of AVHRR imagery inmaritime regions using a probabilistic neural network. J. Appl.Meteor., 33, 909–918.Barnett, T. P., and R. Preisendorfer, 1987: Origins and levels ofmonthly and seasonal forecast skill for United States surfaceair temperatures determined by canonical correlation analysis.Mon. Wea. Rev., 115, 1825–1850.——, M. Latif, N. Graham, M. Flugel, S. Pazan, and W. White,1993: ENSO and ENSO-related predictability. Part I: Predic-tion of equatorial Pacific sea surface temperature with a hy-brid coupled ocean–atmosphere model. J. Climate, 6, 1545–1566.Barnston, A. G., and C. F. Ropelewski, 1992: Prediction of ENSOepisodes using canonical correlation analysis. J. Climate, 5,1316–1345.——, and Coauthors, 1994: Long-lead seasonal forecasts—wheredo we stand? Bull. Amer. Meteor. Soc., 75, 2097–2114.Bishop, C. M., 1995: Neural Networks for Pattern Recognition.Clarendon Press, 482 pp.Bourlard, H., and Y. Kamp, 1988: Auto-association by multilayerperceptrons and singular value decomposition. Biol. Cybern.,59, 291–294.Box, G. P. E., and G. M. Jenkins, 1976: Time Series Analysis:Forecasting and Control. Holden-Day, 575 pp.Breiman, L., 1996: Bagging predictions. Mach. Learning, 24,123–140.Bretherton, C. S., C. Smith, and J. M. Wallace, 1992: An inter-comparison of methods for finding coupled patterns in climatedata. J. Climate, 5, 541–560.Butler, C. T., R. V. Meredith, and A. P. Stogryn, 1996: Retriev-ing atmospheric temperature parameters from dmsp ssm/t-1data with a neural network. J. Geophys. Res., 101, 7075–7083.Chauvin, Y., 1990: Generalization performance of overtrainedback-propagation networks. Neural Networks. EURASIPWorkshop Proceedings, L. B. Almeida and C. J. Wellekens,Eds., Springer-Verlag, 46–55.Crick, F., 1989: The recent excitement about neural networks.Nature, 337, 129–132.Cybenko, G., 1989: Approximation by superpositions of a sigmoi-dal function. Math. Control, Signals, Syst., 2, 303–314.Daley, R., 1991: Atmospheric Data Analysis. Cambridge Univer-sity Press, 457 pp.Derr, V. E., and R. J. Slutz, 1994: Prediction of El Niño events inthe Pacific by means of neural networks. AI Applic., 8, 51–63.Elsner, J. B., and A. A. Tsonis, 1992: Nonlinear prediction, chaos,and noise. Bull. Amer. Meteor. Soc., 73, 49–60; Corrigendum,74, 243.Finnoff, W., F. Hergert, and H. G. Zimmermann, 1993: Improv-ing model selection by nonconvergent methods. Neural Net-works, 6, 771–783.French, M. N., W. F. Krajewski, and R. R. Cuykendall, 1992:Rainfall forecasting in space and time using a neural network.J. Hydrol. 137, 1–31.Galton, F. J., 1885: Regression towards mediocrity in hereditarystature. J. Anthropological Inst., 15, 246–263.Gately, E., 1995: Neural Networks for Financial Forecasting.Wiley, 169 pp.Gershenfeld, N. A., and A. S. Weigend, 1994: The future of timeseries: Learning and understanding. Time Series Prediction:Forecasting the Future and Understanding the Past, A. S.Weigend and N. A. Gershenfeld, Eds., Addison-Wesley, 1–70.Graham, N. E., J. Michaelsen, and T. P. Barnett, 1987: An inves-tigation of the El Niño-Southern Oscillation cycle with statis-tical models, 1, predictor field characteristics. J. Geophys. Res.,92, 14 251–14 270.Grieger, B., and M. Latif, 1994: Reconstruction of the El NiñoAttractor with neural Networks. Climate Dyn., 10, 267–276.Hastenrath, S., L. Greischar, and J. van Heerden, 1995: Predic-tion of the summer rainfall over south Africa. J. Climate, 8,1511–1518.1869Bulletin of the American Meteorological SocietyHertz, J., A. Krogh, and R. G. Palmer, 1991: Introduction to theTheory of Neural Computation, Addison-Wesley, 327 pp.Hoerling, M. P., A. Kumar, and M. Zhong, 1997: El Nino, La Nina,and the nonlinearity of their teleconnections. J. Climate, 10,1769–1786.Hornik, K., M. Stinchcombe, and H. White, 1989: Multilayerfeedforward networks are universal approximators. NeuralNetworks, 2, 359–366.Hotelling, H., 1936: Relations between two sets of variates.Biometrika, 28, 321–377.IEEE, 1991: IEEE Conference on Neural Networks for OceanEngineering. IEEE, 397 pp.Javidi, B., 1997: Securing information with optical technologies.Physics Today, 50, 27–32.Kramer, M. A., 1991: Nonlinear principal component analysisusing autoassociative neural networks. AIChE J., 37, 233–243.Krasnopolsky, V. M., L. C. Breaker, and W. H. Gemmill, 1995:A neural network as a nonlinear transfer function model forretrieving surface wind speeds from the special sensor micro-wave imager. J. Geophys. Res., 100, 11 033–11 045.Le Cun, Y., J. S. Denker, and S. A. Solla, 1990: Optimal braindamage. Advances in Neural Information Processing SystemsII, D. S. Touretzky, Ed., Morgan Kaufmann, 598–605.Lee, J., R. C. Weger, S. K. Sengupta, and R. M. Welch, 1990: Aneural network approach to cloud classification. IEEE Trans.Geosci. Remote Sens., 28, 846–855.Liu, Q. H., C. Simmer, and E. Ruprecht, 1997: Estimatinglongwave net radiation at sea surface from the Special Sen-sor Microwave/Imager (SSM/I). J. Appl. Meteor., 36, 919–930.Lorenz, E. N., 1956: Empirical orthogonal functions and statisti-cal weather prediction. Statistical Forecasting Project, Dept.of Meteorology, Massachusetts Institute of Technology, Cam-bridge, MA, 49 pp.——, 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20,130–141.Lu, J., and W. W. Hsieh, 1997: Adjoint data assimilation incoupled atmosphere–ocean models: Determining model pa-rameters in a simple equatorial model. Quart. J. Roy. Meteor.Soc., 123, 2115–2139.——, and ——, 1998a: Adjoint data assimilation in coupled at-mosphere–ocean models: Determining initial conditions in asimple equatorial model. J. Meteor. Soc. Japan, in press.——, and ——, 1998b: On determining initial conditions andparameters in a simple coupled atmosphere–ocean model byadjoint data assimilation. Tellus, in press.Malthouse, E. C., 1998: Limitations of nonlinear PCA as per-formed with generic neural networks. IEEE Trans. NeuralNetworks, 9, 165–173.Marzban, C., and G. J. Stumpf, 1996: A neural network for tor-nado prediction based on Doppler radar–derived attributes. J.Appl. Meteor., 35, 617–626.McCulloch, W. S., and W. Pitts, 1943: A logical calculus of theideas immanent in neural nets. Bull. Math. Biophys., 5, 115–137.Minsky, M., and S. Papert, 1969: Perceptrons. MIT Press, 292 pp.Navone, H. D., and H. A. Ceccatto, 1994: Predicting Indian mon-soon rainfall—A neural network approach. Climate Dyn., 10,305–312.Peak, J. E., and P. M. Tag, 1992: Toward automated interpreta-tion of satellite imagery for navy shipboard applications. Bull.Amer. Meteor. Soc., 73, 995–1008.——, and ——, 1994: Segmentation of satellite imagery using hi-erarchical thresholding and neural networks. J. Appl. Meteor.,33, 605–616.Pearson, K., 1901: On lines and planes of closest fit to system ofpoints in space. Philos. Mag., Ser. 6, 2, 559–572.Penland, C., 1989: Random forcing and forecasting using principaloscillation pattern analysis. Mon. Wea. Rev., 117, 2165–2185.Phillips, N. A., 1959: An example of non-linear computationalinstability. The Atmosphere and the Sea in Motion, RossbyMemorial Volume, Rockefeller Institute Press, 501–504.Preisendorfer, R. W., 1988: Principal Component Analysis inMeteorology and Oceanography. Elsevier, 425 pp.Ripley, B. D., 1996: Pattern Recognition and Neural Networks.Cambridge University Press, 403 pp.Rojas, R., 1996: Neural Networks—A Systematic Introduction.Springer, 502 pp.Rosenblatt, F., 1962: Principles of Neurodynamics. Spartan, 616 pp.Rumelhart, D. E., G. E. Hinton, and R. J. Williams, 1986: Learn-ing internal representations by error propagation. Parallel Dis-tributed Processing, D. E. Rumelhart, J. L. McClelland, andP. R. Group, Eds., Vol. 1, MIT Press, 318–362.Sejnowski, T. J., and C. R. Rosenberg, 1987: Parallel networks thatlearn to pronounce English text. Complex Syst., 1, 145–168.Shabbar, A., and A. G. Barnston, 1996: Skill of seasonal climateforecasts in Canada using canonical correlation analysis. Mon.Wea. Rev., 124, 2370–2385.Stogryn, A. P., C. T. Butler, and T. J. Bartolac, 1994: Ocean sur-face wind retrievals from Special Sensor Microwave Imagerdata with neural networks. J. Geophys Res., 99, 981–984.Tang, B., 1995: Periods of linear development of the ENSO cycleand POP forecast experiments. J. Climate, 8, 682–691.——, G. M. Flato, and G. Holloway, 1994: A study of Arctic seaice and sea-level pressure using POP and neural network meth-ods. Atmos.–Ocean, 32, 507–529.——, W. Hsieh, and F. Tangang, 1996: “Cleaning” neural net-works with continuity constraint for prediction of noisy timeseries. Proc. Int. Conf. on Neural Information Processing,Hong Kong, China, 722–725.Tang, Z., C. de Almeida, and P. A. Fishwick, 1991: Time seriesforecasting using neural networks vs. Box–Jenkins method-ology. Simulation, 57, 303–310.Tangang, F. T., W. W. Hsieh, and B. Tang, 1997: Forecasting theequatorial Pacific sea surface temperatures by neural networkmodels. Climate Dyn., 13, 135–147.——, ——, and ——, 1998a. Forecasting the regional sea sur-face temperatures of the tropical Pacific by neural networkmodels, with wind stress and sea level pressure as predictors.J. Geophys. Res., 103, 7511–7522.——, B. Tang, A. H. Monahan, and W. W. Hsieh, 1998b. Fore-casting ENSO events: A neural network—extended EOF ap-proach. J. Climate, 11, 29–41.Trippi, R. R., and E. Turban, Eds., 1993: Neural Networks in Fi-nance and Investing. Probus, 513 pp.Vislocky, R. L., and J. M. Fritsch, 1997: Performance of an ad-vanced MOS system in the 1996–97 National CollegiateWeather Forecasting Contest. Bull. Amer. Meteor. Soc., 78,2851–2857.1870 Vol. 79, No. 9, September 1998von Storch, H., G. Burger, R. Schnur, and J.-S. von Storch, 1995:Principal oscillation patterns: A review. J. Climate, 8, 377–400.Weare, B. C., and J. S. Nasstrom, 1982: Example of extendedempirical orthogonal functions. Mon. Wea. Rev., 110, 481–485.Weigend, A. S., and N. A. Gershenfeld, Eds., 1994: Time SeriesPrediction: Forecasting the Future and Understanding thePast. Santa Fe Institute Studies in the Sciences of Complex-ity, Proceedings Vol. XV, Addison-Wesley, 643 pp.Widrow, B., and S. D. Sterns, 1985: Adaptive Signal Processing.Prentice-Hall, 474 pp.Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sci-ences. Academic Press, 467 pp.Zirilli, J. S., 1997: Financial Prediction Using Neural Networks.International Thomson, 135 pp.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items