818 VOLUME 129MONTHLY WEATHER REVIEWq 2001 American Meteorological SocietyCoupling Neural Networks to Incomplete Dynamical Systems viaVariational Data AssimilationYOUMIN TANG AND WILLIAM W. H SIEHOceanography, Department of Earth and Ocean Sciences, University of British Columbia,Vancouver, British Columbia, Canada(Manuscript received 16 March 2000, in final form 30 August 2000)ABSTRACTThe advent of the feed-forward neural network (N) model opens the possibility of hybrid neural–dynamicalmodels via variational data assimilation. Such a hybrid model may be used in situations where some variables,difficult to model dynamically, have sufficient data for modeling them empirically with an N. This idea of usingan N to replace missing dynamical equations is tested with the Lorenz three-component nonlinear system, whereone of the three Lorenz equations is replaced by an N equation. In several experiments, the 4DVAR assimilationapproach is used to estimate 1) the N model parameters (26 parameters), 2) two dynamical parameters and threeinitial conditions for the hybrid model, and 3) the dynamical parameters, initial conditions, and the N parameters(28 parameters plus three initial conditions).Two cases of the Lorenz model—(i) the weakly nonlinear case of quasiperiodic oscillations, and (ii) the highlynonlinear, chaotic case—were chosen to test the forecast skills of the hybrid model. Numerical experimentsshowed that for the weakly nonlinear case, the hybrid model can be very successful, with forecast skills similarto the original Lorenz model. For the highly nonlinear case, the hybrid model could produce reasonable predictionsfor at least one cycle of oscillation for most experiments, although poor results were obtained for some exper-iments. In these failed experiments, the data used for assimilation were often located on one wing of the Lorenzbutterfly-shaped attractor, while the system moved to the second wing during the forecast period. The forecastsfailed as the model had never been trained with data from the second wing.1. IntroductionNumerical models have been widely used to simulatedynamical systems such as the atmosphere and theocean. However, the physics in these models is usuallyincomplete, and some empirical approach is needed topatch up the missing physics. The first example is theinability of a numerical model, with its finite resolution,to represent subgrid-scale physical processes, therebyforcing numerical modelers to adopt parameterizationschemes for these processes. A second example arisesfrom the fact that some variables of interest are notvariables in the numerical model, for example, precip-itation and surface air temperature (which is generallynot equivalent to the temperature at the lowest level ofan atmospheric model). Often multiple linear regression(MLR) is used to empirically relate the model variablesto the variables of interest, via schemes such as perfectprog and model output statistics (Wilks 1995). A thirdexample arises in situations where replacing the physicalCorresponding author address: William W. Hsieh, Oceanography/EOS, University of British Columbia, Vancouver, BC V6T 1Z4, Can-ada.E-mail: william@ocgy.ubc.caequations by empirical ones results in large computa-tional savings (and greater stability): in the equatorialPacific, a simple empirical atmospheric model can becoupled to a dynamical ocean model, to form a hybridcoupled model. Here the wind stress is empirically es-timated from the ocean variables either by a linear sta-tistical method, such as MLR (Barnett et al. 1993) orsingular value decomposition (Syu et al. 1995), or bya neural network (Tang et al. 2001).While it is an attractive idea to replace missing dy-namical equations with empirical equations, there areserious technical problems; for example, the dynamicalequations are generally nonlinear, while the standardstatistical methods are usually linear, hence not capableof simulating the evolution of nonlinear dynamicalequations for extended periods. Neural networks (N)have been known to be capable of simulating any non-linear function, given a large enough network (Cybenko1989). Recent advances in N modeling have led to newtechniques capable of nonlinearly generalizing the clas-sical multivariate methods such as MLR, principal com-ponent analysis, and canonical correlation analysis(Hsieh and Tang 1998). As Ns originated from the fieldof artificial intelligence and robotics, it is tempting toponder whether Ns can benefit dynamical systems withAPRIL 2001 819TANG AND HSIEHFIG. 1. The Lorenz system integrated over 15 000 time steps for (left) the weakly nonlinear case and (right) thehighly nonlinear case.incomplete physics the way robotic limbs have helpedprosthetics. More specifically, could N equations beused to simulate missing nonlinear dynamical equa-tions? How to effectively couple an N model to a dy-namical model is a major challenge.Variational data assimilation, especially via the ad-joint approach, has become popular in assimilating datainto numerical models (Daley 1991; Ghil and Malan-otte-Rizzoli 1991; Bennett 1992; Navon 1998). Themethod is commonly used to optimally estimate modelparameters or initial conditions, and is being imple-mented in operational weather and climate predictionmodels (e.g., Zhu and Navon 1999; Gauthier et al. 1999;Courtier et al. 1998; Ji et al. 1998). Hsieh and Tang(1998) noted that Ns could be formulated as a varia-tional assimilation problem, and suggested that Ns anddynamical models might be combined most naturallyvia a variational assimilation approach.The objective of this paper is to show how an N modelcan be coupled to a dynamical model (with incompletephysics) via variational assimilation. The chosen dy-namical model is the simple Lorenz (1963) model, withthree variables, arising from the Fourier truncation ofthe Rayleigh–Be´nard equations describing atmosphericconvection. In the field of data assimilation, the cele-brated Lorenz model has served as a test bed for ex-amining the properties of various data assimilationmethods (Gauthier 1992; Miller et al. 1994; Evensen1997) as the Lorenz model shares many common fea-tures with the atmospheric circulation and climate sys-tem in terms of variability and predictability (Palmer1993). By adjusting the model parameters that controlthe nonlinearity of the system, the model can be usedto simulate near-regular oscillations or highly nonlinearfluctuations.This paper is structured as follows: section 2 describesthe hybrid Lorenz model, while section 3 shows somesimple experiments involving the hybrid model. Section4 gives a general formulation for coupling an N to adynamical model via variational data assimilation. Sec-tion 5 uses variation assimilation to estimate the N pa-rameters in the hybrid Lorenz model. Section 6 studiesnot only estimating the N parameters, but also retrievingthe dynamical model parameters and initial conditions.2. Hybrid neural–dynamical Lorenz modelThe nondimensionalized Lorenz (1963) nonlinearsystem of three differential equations are820 VOLUME 129MONTHLY WEATHER REVIEWFIG. 2. The ensemble mean of 25 Ns (dot–dash curve) for approximating the third Lorenz equation (solid curve)for (left) the weakly nonlinear case and (right) the highly nonlinear case. (a) and (b) The training period of 3000 timesteps, and (c) and (d) the test period of 1000 time steps.dX52aX 1 aY, (1)dtdY52XZ 1 bX 2 Y, (2)dtdZ5 XY 2 cZ, (3)dtwhere variables X, Y, and Z are related to the intensityof convective motion, and the temperature gradients inthe horizontal and vertical directions, respectively. Theparameters a, b, and c will be referred to as dynamicalparameters, in contrast to the empirical parameters ofthe N model.The so-called observed data or true data are from theintegration of the Lorenz equations (1)–(3) over 15 000time steps at a step size h of 0.001, using a fourth-orderRunge–Kutta integration scheme. As this system is verysensitive to changes in the initial conditions and modelparameters, two cases are studied (Fig. 1). The first case,called the weakly nonlinear case, with the parametersa, b, and c set to 10, 28, and 8/3, respectively, and initialconditions for (X, Y, and Z)to(29.42, 29.43, 28.3)(as in Gauthier 1992), displays near-regular oscillationswith a gradually increasing amplitude in the devisedintegration period. The other case is the highly nonlinearcase, with a, b, and c set to 16.0, 120.1, and 4.0, re-spectively (as in Elsner and Tsonis 1992), and initialconditions to (22.8, 35.7, 114.9).Next we assume the third Lorenz equation is un-available, and we must approximate it empirically withan N equation. Our hybrid model thus consists ofdX52aX 1 aY, (4)dtdY52XZ 1 bX 2 Y, (5)dtdZ5 N (X , Y , Z ), (6)tttdtwhere N is a feed-forward N model (Hsieh and Tang1998). The three input neurons Xt, Yt, and Ztare (alsodenoted by yi, i 5 1, 2, 3), inputting the values of X,Y, and Z at time t, and the single output neuron is dZ/dt.More details on the N model are given in appendix A.APRIL 2001 821TANG AND HSIEHFIG. 3. The hybrid model (dot–dash curve) and the true Lorenz model (solid curve) integrated from identical initialconditions for (left) the weakly nonlinear case and (right) the highly nonlinear case.3. Simple hybrid model experimentsA simple-minded approach to the missing third Lo-renz equation is to use data to get the N Eq. (6), andthen integrate Eqs. (4)–(6), with (6) as a replacementfor the missing Lorenz equation. The N has X, Y, andZ as the inputs and (Zt112 Zt)/h as the target duringtraining [with the optimization minimizing the meansquare error between the target and the model output],so the model output is approximately dZ/dt. The opti-mization of the N model was determined over a trainingperiod of 3000 time steps, and the following period of1000 time steps was used to independently test the Nmodel skills in estimating dZ/dt. The appropriate num-ber of neurons in the hidden layer is chosen based onthe trade-off between underfitting (too few neurons) andoverfitting (too many neurons) (Hsieh and Tang 1998).After trying models with different numbers of hiddenneurons, we found that five hidden neurons yielded thebest skills over the test period.To alleviate the instability problems associated withN modelling (Hsieh and Tang 1998), an ensemble of 25identical Ns were trained with random initial weightsand biases, with the outputs from the 25 Ns used toprovide an ensemble mean output. The skills of theensemble mean generally exceeded the skills of an in-dividual member (Hsieh and Tang 1998). The ensemblemean N simulations of dZ/dt for both the weakly non-linear and the highly nonlinear cases during the trainingperiod and the test period are shown in Fig. 2, wherethe N gave good simulations of dZ/dt, although the os-cillation peaks were underestimated.Now that the third Lorenz equation seems to be rea-sonably approximated by the N Eq. (6), we proceed tointegrate the hybrid system (4)–(6) from perfect initialconditions and compare the output with respect to thatfrom the true Lorenz system (1)–(3). This simple-mind-ed hybrid approach worked reasonably well in simu-lating the true Lorenz system for the weakly nonlinearcase (Fig. 3), but failed completely for the highly non-linear case, where the hybrid model yielded X and Yvalues that were off by orders of magnitude. The highlynonlinear Lorenz system is known for its extremely highsensitivity to perturbations. Clearly, the N approxima-tion to the third Lorenz equation introduced significanterrors in Z, which quickly caused the X and Y valuesto deviate far off course.The defect of this simple approach lies in the factthat the optimization of the N was achieved withoutimposing the dynamical constraints of (4) and (5). Abetter approach is variational assimilation, where thedynamical constraints are imposed.822 VOLUME 129MONTHLY WEATHER REVIEWFIG. 4. Schematic diagram for (a) the SC, (b) NC, and (c) PSCassimilation schemes. The solid curves are the model trajectory; dots,the observed values; and dash lines, the model errors to be minimized.Here a model trajectory starts from an observed value.FIG. 5. The NC assimilation results, with the model training results shown in the left panels, and predictions over thetest period in the right panels, where the solid curves show the true data and dot–dash curves the outputs of the hybridmodel. The dot–dash curves overlap the solid curves in the left panels.4. The hybrid model using variational dataassimilationThe scalar equations (4)–(6) can be expressed as avector equation:dv5 f(v, p, t), (7)dtwhere v is the vector denoting (X, Y, Z) and p is theparameter vector (which could contain the parametersof the N or of the dynamical equations).The variational assimilation method involves mini-mizing a quadratic cost function J subject to model con-straint (7), where J is defined asTJ(v, p, v ) 5 D(v, t) dt and (8)0 E0T 21D 5 (v 2 v ) W (v 2 v ), (9)obs obswhere the subscript obs denotes the observed values,the superscriptTthe transpose, T the length of the as-similation window (also called training period some-times), v0the initial conditions, and W the covariancematrix of the measurement errors, assumed here to bediagonal.APRIL 2001 823TANG AND HSIEHFIG. 6. The PSC assimilation results with AS 5 200 time steps. (left) The model training results, and (right) modelpredictions, with solid curves for the true data and dot–dash curves for the outputs of the hybrid model.Introduce the Lagrange function L,TdvTL 5 J 1 v* 2 f dt, (10)E12dt0where v*(t) is a vector of Lagrange multipliers (oradjoint variables). After integration by parts, L be-comesTTdv*T T TL 5 (v* v) 1 D 2 v 1 v* f dt. (11)0 E12[]dt0The first-order variation of L isTT TTdv* ]f ]fT T TdL 5 [v* dv] 1 = D dv 2 dv 1 v*dv 1 v*dp dt. (12)0 E v12 12 12[[ ]dt ]v ]p0The hybrid adjoint model can be obtained by lettingdL/dv 5 0:dv* ]f25v*(t) 2 = D, (13)vdt ]vv*(T) 5 0. (14)According to (13) and (14), the formulas for com-puting the gradients of J with respect to p and v0canbe obtained by differentiating (12) with respect to theseunknowns (Lu and Hsieh 1998)TTdJ ]f52 v* dt and (15)E12dp ]p0dJ52v*(0). (16)dv0824 VOLUME 129MONTHLY WEATHER REVIEWFIG. 7. Same as Fig. 6 but for AS 5 1000 time steps.Details of the variational assimilation are given inappendix B.As the above assimilation process, also referred to asthe training process, is completely subject to the model(4)–(6) after the initial guesses are given, it is calledthe strong continuity constraint assimilation scheme(SC).5. Determining N parameters via variationalassimilationIn this section, we determine the N parameters, thatis, the weights and biases, in the hybrid model by var-iational data assimilation. An N with five hidden neu-rons is used. Preliminary experiments are done in sec-tions 5a and 5b to determine the best first guesses andassimilation window sizes, respectively, before the mainexperiments are run in section 5c. In this section, allexperiments were performed with perfect initial con-ditions and dynamical parameters a and b.a. The initial guessesWhether an assimilation process is successful de-pends greatly on the choice of the first guesses. To helpin getting good initial guesses, two assimilation schemesare introduced in addition to the SC scheme. Under SC,at each step of the integration, the initial conditions arethe model outputs from the previous step (Fig. 4a). Inthe no-continuity constraint assimilation scheme (NC),at each integration step, the initial conditions are theobserved data, rather than the model output from theprevious step (Fig. 4b).Between these two extremes, we can have the par-tially strong continuity constraint assimilation scheme(PSC) (Fig. 4c), where the assimilation window T isdivided into smaller assimilation segments (ASs). With-in each AS, integration at each step uses the modeloutput from the previous step as the initial conditions.Only at the start of the AS are the data used directly asthe initial conditions. Both SC and NC can be regardedas special cases of PSC where the AS is T in SC andone time step in NC.The reason for introducing NC and PSC is that usingSC without good initial guesses generally does not leadto successful assimilation over a long assimilation win-dow. Chopping the window into shorter ASs means thatthe dynamic model constraints are no longer imposedover a long period, thereby making the nonlinear op-timization a much easier task. Our strategy is to dividethe assimilation process into two stages: (a) use lessdemanding schemes such as NC and PSC to obtain rea-sonable parameters estimates, which will then be usedAPRIL 2001 825TANG AND HSIEHFIG. 8. Correlation skills (averaged over 10 experiments) for thevariables X, Y, and Z at various assimilation windows (T 5 100, 300,500, and 800 time steps) during (a) the training period, and (b) thetest period.as initial guesses in (b) the full variational assimilationunder SC.For the weakly nonlinear case, the NC assimilationscheme was first applied to the hybrid model (4)–(6),with initial guesses for the N parameters taken randomlyfrom one of the 25 Ns in section 3. The assimilationwindow is 3000 time steps. The NC scheme can producegood results during the training period but poor skillduring the test period (Fig. 5). Since the AS is veryshort (one time step) in the NC scheme, the initial valuesat each integration step vastly outweigh the importanceof the dynamical constraints, hence the poor skill duringthe test period. However the result from the NC canprovide reasonable first guesses for the following PSC.A series of PSC assimilations with AS 5 100, 200,500, and 1000 time steps was further performed withthe hybrid model (in the weakly nonlinear case) to grad-ually improve the first guesses to be used in the stageb SC assimilation. A longer AS increases the importanceof dynamical constraints relative to initial conditions ininfluencing the model outputs. The first guesses of theN parameters in a PSC experiment are the parametersobtained from the previous PSC experiment with theshorter AS. The first PSC experiment (not shown) withAS 5 100 time steps uses the results from the NC case(i.e., AS 5 one time step) for its first guesses.The PSC assimilation over the training period, thatis, a window of 3000 time steps, with AS 5 200 timesteps (Fig. 6), and AS 5 1000 time steps (Fig. 7), revealsthat the test skills improved with the increase of the AS.The parameters estimated from the PSC experiment withAS51000 time steps will serve as the first guess forthe stage b SC assimilation (in the next section) for theweakly nonlinear case, where the assimilation windowis also taken to be 1000 time steps.This approach of using a series of PSC assimilationsto improve on the parameter estimates to be used asinitial guesses in the stage b SC assimilation did notwork for the highly nonlinear case. Unlike the weaklynonlinear case, the first guesses from PSC, if found, isonly effective for a specific experimental configuration(e.g., AS, window length, etc.). Hence, for the highlynonlinear case, the parameters estimated from the NCassimilation over 3000 time steps are directly used asthe first guesses for the stage b SC assimilation.b. Impact of assimilation window in the highlynonlinear caseThe greatest difficulty in variational assimilation,namely the presence of multiple minima in the costfunction arising from the nonlinearity of the problem,often renders the problem intractable. The number oflocal minima dramatically increases with the window Tfor a strongly nonlinear dynamical system. Gauthier(1992) and Miller et al. (1994) have shown with theLorenz model that variational assimilation is effectiveonly for sufficiently short windows. In the weakly non-linear case (section 5a), T did not have a major impact;hence, the following discussion will only focus on thehighly nonlinear case.To study the impact of the window, SC assimilationexperiments were performed, where T was varied be-tween 100, 300, 500, and 800 time steps. For each T,10 experiments were performed with the same firstguesses, but the assimilation periods were shifted by100 time steps between one experiment and the next.The effect of T on the assimilation is different for thetraining period (Fig. 8a) and the test period (Fig. 8b).With a short window (T 5 100 time steps), the corre-lation skills were excellent for all three variables duringtraining, but yielded no prediction skills over the fol-lowing test period of 300 time steps. As T increased to300 time steps and then 500 time steps, there were in-creases in the prediction skills (Fig. 8b), but as T in-creased to 800, the prediction skill dropped sharply, asthe problem with local minima in the cost function wors-826 VOLUME 129MONTHLY WEATHER REVIEWFIG. 10. Bar charts from 100 assimilation experiments for the highlynonlinear case showing the number of experiments with (a) corre-lation above a particular correlation level and (b) REE below a par-ticular REE level, during the test period.←FIG. 9. (a) Correlation skills and (b) REE for the variables X, Y,and Z during the training and test periods (averaged over 100 ex-periments), for the weakly nonlinear case, where the SC assimilationretrieves the N parameters. The error bars show the standard devi-ation. Results from the simple hybrid model without data assimilation(section 3) are also shown for comparison.ened with increasing T. Hence in the following SC as-similation experiments, the window will be 500 timesteps for the highly nonlinear case, and 1000 time stepsfor the weakly nonlinear case. Following the training(assimilation) period, there is a test period of 300 timesteps for the highly nonlinear case, and 1000 time stepsfor the weakly nonlinear case.c. SC assimilationWe now perform the main SC assimilation experi-ments, repeated 100 times, to examine the skills of theAPRIL 2001 827TANG AND HSIEHFIG. 11. A typical experiment with class 1 results for the highly nonlinear case, showing (a) the fitting during thetraining period (left panels) and (b) the forecasting during the testing period (right panels). Solid curves denote the truevalue, while dot–dash curves denote model results.hybrid model for the weakly nonlinear and highly non-linear cases. The choices of windows and first guessesare those found in sections 5a and 5b. The 100 exper-iments are implemented with their assimilation periodsshifted by 100 time steps from one experiment to theother.For the weakly nonlinear case, the average correlationbetween model and data (over 100 experiments) exceeds0.96 for each of X, Y, and Z during both the trainingperiod and the test period (Fig. 9a). The relative estimateerror [REE, i.e., S (estimated value-true value)2/S (truevalue)2] is very small, less than 0.004 (Fig. 9b). Thus,via variational assimilation, a hybrid model has suc-cessfully reconstructed the Lorenz model in the weaklynonlinear case. For comparison, the results of the simplehybrid model without data assimilation from section 3are also show in Fig. 9. The improvement due to var-iational data assimilation is dramatic for X and Y.In contrast, the highly nonlinear case has much lowerskill. During training, the skill was good—80% of theexperiments had correlation above 0.75, and 60% hadREEs under 0.05 (not shown). However, during the testperiod, the correlation skill and REE attained were muchpoorer (Fig. 10).Generally, for the highly nonlinear case, there werethree classes of assimilation results found in the 100experiments. A typical example of the class 1 results isshown in Fig. 11, showing successful assimilation.About 15% of the total experiments is of this class.Almost 60% of the total experiments belongs to a class2 of moderately successful assimilation, as illustratedby the example in Fig. 12, where during the test period,the forecast skills were fairly good for the first 200 timesteps. The remaining 25%–30% of the experiments be-longed to a class 3 of failed forecasting, as illustratedin Fig. 13, where despite excellent fitting during thetraining period, the forecast results were totally wrong.To see why there is a drastic difference between class1 and class 3 behavior, we plotted the system trajectoriesof the two cases in the (X, Y, Z) space (Fig. 14). TheLorenz attractor for the highly nonlinear case is knownto have a ‘‘butterfly’’ shape with two wings. In the class1 experiment, both wings of the Lorenz attractor werecovered by the training data trajectory. In contrast, withthe class 3 example, the training data trajectory residedin one wing. The system then evolved to the other wingduring the testing period. As the model had never beentrained with data from the second wing, it obviously828 VOLUME 129MONTHLY WEATHER REVIEWFIG. 12. Same as Fig. 11 but for a typical experiment with class 2 (i.e., moderately successful forecast) results.could not forecast properly in that regime, hence thedisastrously poor forecast results found in Fig. 13. Thusthe bimodality of the highly nonlinear Lorenz systemcreated cases where the training data covered only onewing of the attractor, and the model thus trained had nocapability of forecasting the transition to the other wing.One would be tempted to think that the problem couldbe corrected by simply using longer training periods,so that the full Lorenz attractor would have been learnedby the model. Unfortunately, a longer training periodgreatly increases the difficulty in the nonlinear opti-mization due to multiple local minima in the cost func-tion, leading eventually to the failure of the variationalassimilation method.6. Determining dynamical parameters and initialconditions by the hybrid modelIn variational data assimilation, parameters and initialconditions can be jointly estimated (e.g., Le Dimet andTalagrand 1986; Tziperman and Thacker 1989; Lu andHsieh 1998; Zhu and Navon 1999). We next conductedan experiment, for the weakly nonlinear case, to estimatesimultaneously the dynamical parameters a and b, andthe initial conditions of X, Y, and Z in the hybrid Lorenzsystem (4)–(6). The N from section 5a (i.e., PSC assim-ilation with AS 5 1000 time steps) was used.For the initial guesses, a and b and the initial con-ditions were all scaled down by 10% from their truevalues. The result of retrieving the two parameters andthree initial conditions by the hybrid model with anassimilation period of 3000 steps under SC showed theretrieval to be very good, as the REEs were all of theorder of 1023–1024(not shown). Next the true valueswere scaled down by 20% to provide the initial guesses,and the retrieved results were still generally good (Fig.15). But when the initial values were the true valuesscaled down by 30%, the retrieval nearly failed (notshown).Next, we had the N parameters retrieved as well dur-ing the assimilation; that is, we conducted 100 additionalSC experiments to estimate simultaneously the N pa-rameters, the dynamical parameters a and b, and theinitial conditions of X, Y, and Z, for the weakly nonlinearcase. The first guesses for the initial conditions and theparameters a and b were simply the true values scaleddown by 10%. The choice of the first guesses of the Nmodel parameters, cost function, and assimilation win-dow were the same as those discussed in the previoussection.APRIL 2001 829TANG AND HSIEHFIG. 13. Same as Fig. 11 but for a typical experiment with class 3 (i.e., unsuccessful forecast) results.The average correlation skills over the 100 experi-ments are shown in Fig. 16, while the average REEsare all around 0.05 (not shown). The correlation andREE were generally poorer than those in Fig. 9, whereonly the N parameters were retrieved. This agrees withmany studies that found that the assimilation qualitydeteriorated while retrieving more model unknowns (Luand Hsieh 1998), as the problem of multiple minima inthe cost function could become worse when retrievingmore model unknowns.Likewise, 100 experiments for the highly nonlinearcase were also carried out to simultaneously estimatethe N parameters, the dynamical parameters a and b,and the initial conditions. The first guesses were the truevalues scaled down by 10%, and the first guesses of theN model, the cost function, and assimilation windowwere still taken from section 5. While the skills in thetraining period were as good as those retrieving the Nparameters only (not shown), the skills during the testingperiod (Fig. 17) were considerably lower than thosefound when only N parameters were retrieved (Fig. 10).Class 1 examples like Fig. 11 can hardly be found now.7. Summary and discussionIn this study, a hybrid neural–dynamical variationaldata assimilation procedure aimed at simulating missingdynamical equations was formulated. This procedurewas then applied to the Lorenz model, where the thirddynamical equation of the model was missing and hadto be simulated empirically by an N equation.First guesses in variational data assimilation can bevital to retrieving the model parameters and initial con-ditions successfully. In order to get reasonable firstguesses, the no continuity constraint and partial strongcontinuity constraint schemes were proposed in thisstudy. Experiments show that such a treatment of choos-ing first guesses is effective in the weakly nonlinearcase, which allows the cost function to be successfullyoptimized in almost all experiments under strong con-tinuity constraint. For the highly nonlinear case, the NCscheme provides reasonable first guesses for the SC ex-periments.The length of the assimilation window is very sig-nificant for the highly nonlinear case. When the windowis either too short or too long, the prediction skills arepoor. This strict demand for an appropriate assimilatingwindow results from a delicate balance between the needfor more data (longer window) and the avoidance ofsevere multiple minima during optimization with thehighly nonlinear model, which calls for a short window.Our numerical experiments showed that the hybridmodel based on N and variational assimilation is suc-830 VOLUME 129MONTHLY WEATHER REVIEWFIG. 14. The system trajectories in the (X, Y, Z) space during the training period (thin curves) and the testingperiod (thick curves) for (a) the class 1 experiment and (b) the class 3 experiment.FIG. 16. The correlation skill averaged over 100 experiments wherethe N parameters, the dynamical parameters a and b, and the initialconditions were retrieved for the weakly nonlinear case. Error barsindicate the standard deviation.FIG. 15. The results from retrieving the dynamical parameters aand b, and the initial conditions of the hybrid model, where the initialguesses were the true values scaled down by 20%.cessful in simulating the weakly nonlinear Lorenz mod-el, with forecast skills similar to the original Lorenzmodel. For a highly nonlinear Lorenz system, the hybridmodel can also produce reasonable skills of simulationsand predictions for most experiments, although the as-similation basically failed for a fair portion of the ex-APRIL 2001 831TANG AND HSIEHFIG. 17. Bar charts from 100 assimilation experiments for the highlynonlinear case showing the number of experiments with (a) corre-lation above a particular correlation level and (b) REE below a par-ticular REE level, during the test period. The N parameters, the dy-namical parameters a and b, and the initial conditions were retrieved.FIG. A1. An example of a neural network model, where there arethree neurons in the input layer, five in the hidden layer, and one inthe output layer. The parameters wijand w˜jare the weights, and bjand b are the biases. The parameters bjand b can also be regardedas the weights for constant inputs of value 1.periments. Deterministic optimization algorithms oftenhave difficulties reaching the global minimum in thecost function, due to the presence of local minima. Ithas been shown that the variational assimilation withdeterministic optimization algorithms often loses itspower for strongly nonlinear models (Miller et al. 1994;Evensen 1997). Stochastic optimization methods suchas simulated annealing may help (Kruger 1993; Han-nachi and Legras 1995). However, even if the globalminimum is found, the data in the assimilation windowmight still only consist of data from one wing of thebutterfly-shaped attractor, whereas the data in the fore-casting test period could lie in the second wing, therebyresulting in failed forecasts. Alternatively, other assim-ilation approaches, for example, the extended Kalmanfilter, could be more powerful in dealing with stronglynonlinear systems (Miller et al. 1994; Evensen 1997).The possibility of an N coupled with a dynamical modelwith an extended Kalman filter is being investigated.Acknowledgments. The authors are grateful to Dr.Benyang Tang, Dr. Jingxi Lu, Dr. Adam Monahan, andYuval for helpful discussion, and to Dr. Ralf Gieringfor his TAMC compiler for adjoint codes. This workwas supported by research and strategies grants to W.Hsieh from the Natural Sciences and Engineering Re-search Council of Canada.APPENDIX ANeural Network ModelA feed-forward neural network is a nonparametricstatistical model for extracting nonlinear relations in thedata. A common N model configuration is to place be-tween the input and output variables (also called ‘‘neu-rons’’), a layer of ‘‘hidden neurons’’ (Fig. A1). Thevalue of the jth hidden neuron isy 5 tanh wx1 b , (A1)Ojijij12iwhere xiis the ith input, wijthe weight parameters, andbjthe bias parameters. The hyperbolic tangent functionis used as the transfer function (Bishop 1995, p. 127).The output neuron is given byz 5 w˜ y 1 b . (A2)OjjjA cost function832 VOLUME 129MONTHLY WEATHER REVIEWJ 5^(z 2 zobs)2& (A3)measures the mean square error between the model out-put z and the observed values zobs. The parameters wij,w˜j, bj, and b are adjusted as the cost function is mini-mized by the optimization algorithm of Levenberg–Mar-quardt (Masters 1995), without any constraints imposed.The procedure, known as network training, yields theoptimal parameters for the network. The random numbergenerator in MATLAB, which generates uniformly dis-tributed random numbers on the interval (0.0, 1.0), wasused to initialize these parameters.For an N with m1inputs and m2hidden neurons, thenumber of model parameters is m13 m21 2 3 m211. In this paper, we set the number of hidden neuronsto 5, so the N has 26 parameters.The above is a brief description of a traditional feed-forward N algorithm, used in section 3. In later sections,the optimization will be constrained by the dynamicalequations.APPENDIX BDiscrete Form of the Hybrid Model for CodingThe computer code is directly generated from the dis-crete form of the hybrid model. The fourth-order Runge–Kutta is used to discretize the vector equation 7:k1 k2 k3 k4nnnnv 5 v 1111, (B1)n11 n6336k1 5 hf(v , p, t ), (B2)nnnk2 5 hf(v 1 0.5k1 , p, t 1 0.5h), (B3)nn nnk3 5 hf(v 1 0.5k2 , p, t 1 0.5h), (B4)nn nnk4 5 hf(v 1 k3 , p, t 1 h), (B5)nnnnwhere the subscript n indicates the time level, f 5(f1, f2, f3)T, and f1, f2, and f3 are scalar functions:f1 52aX 1 aY, (B6)f2 52XZ 1 bX 2 Y, (B7)f3 5 w˜ [tanh(wX1 wY1 wZ1 b )]Oj 1j 2j 3jjj1 b , (B8)j 5 1,2,...,M (M 5 5, the number of hidden neurons).Equation (B1) can be written asvn115 vn1 Kn. (B9)The Lagrange function of Eq. (10) is discretized asN21TL 5 J 1 v*(v 2 v 2 K ), (B10)Onn11 nnn50Nobs T 21 obsJ 5 (v 2 v ) W (v 2 v ), (B11)Onn nnn51where v* is the vector of Lagrange multipliers and itsdimension is the same that of v.The equations for the best fit are obtained by requiringthat all derivatives of L vanish:]L5 0, (B12)]v*n]L5 0, (B13)]vn]L5 0, (B14)]pwhere the partial derivative with respect to a vectorindicates a partial derivative with respect to each of thevector components. Equation (B12) gives nothing morethan (B9), whereas (B10) givesN21]L ]Kn52v*. (B15)On12]p ]pn50The gradients from (B15) are then provided to a qua-si-Newton method to seek the optimal parameters.The equations describing the trajectory of the La-grange multipliers , that is, the adjoint equations, canv*nbe obtained from ]L/]vnand (B10) (also see Andersonand Willebrand 1989):N]Kn21 obs2 W (v 2 v ) 1 v* 2 v* 2 v*Onn n21 nn12]vn51n5 0, n 5 1,...,N, (B16)]L52v*, (B17)0]v0v* 5 0. (B18)NThe ]Kn/]vnand ]Kn/]p can be obtained by chain rule;for example,]k 1 ]k1 1 ]k2 ]k1 1 ]k3 ]k2 ]k1nnnnnnn51]v 6 ]v 3 ]k1 ]v 3 ]k2 ]k1 ]vnn nn nnn1 ]k4 ]k3 ]k2 ]k1nnnn1 .6 ]k3 ]k2 ]k1 ]vnnnnThus, the cycling procedure for computing the bestfit of the hybrid model (4)–(6) to Lorenz model (1)–(3)is as follows:1) Given initial guess values for the unknown param-eters, use (B9) to step the hybrid model forward intime from 1 to N to compute a first approximationto the Lorenz model.2) Step equations (B16)–(B18) backward in time fromN to 1 to compute approximate values of the La-grange multipliers.3) Evaluate the gradient of the cost function with re-spect to these parameters using (B15) [and/or Eq.(B17) if initial conditions need to be retrieved].APRIL 2001 833TANG AND HSIEHFIG. B1. Variations of the normalized cost function (J/J0) (solid line) and normalized gradient (\g\\g0\21) (dash line)the number of iterations for (a) the weakly nonlinear case of Fig. 7 and (b) the strongly nonlinear case of Fig. 11. Thebetter initial guesses in (a) meant that the cost function and its gradients had relatively less distance to drop to convergencethan in (b).4) Use a descent algorithm to find the minimum of thecost function in the direction opposite to the gradientby a line search.5) Take results of the line search as the next guessesfor the parameters and repeat the process startingfrom step 1 until convergence criteria are satisfied.The quasi-Newton method of Broyden–Fletcher–Goldfarb–Shanno (BFGS) (Press et al. 1992) is used inthe above procedure for searching the optimal solutions.Figure B1 is the variation of the normalized cost func-tion and normalized gradient with the number of iter-ations for a weakly nonlinear case (Fig. 7) and a stronglynonlinear case (Fig. 11). The BFGS algorithm has amemory requirement of O(N2) where N is the numberof parameters involved in the optimization. In our case,the memory is 183 KB. For problems with larger N, theconjugate gradient method with a memory requirementof O(N) could be more suitable. A typical case hererequires about 50 iteration for convergence, using about3 min of CPU time on an SGI Origin 200 server.In practice, the adjoint codes were not developed from(B16) to (B18), but by the Tangent and Adjoint ModelCompiler (TAMC; Giering and Kaminski 1996) via thetangent linear equations. These equations are derivedfrom (B9):dvn115 Andvn, (B19)where the matrix Anis the tangent linear model at timestep n, and]KnA 5 I 1 , (B20)n]vnwith I the the identity matrix. The adjointness has beenchecked using the relation between the tangent linearcode A and its adjoint code A* (Talagrand 1991; Navonet al. 1992):{a, Ab} 5 {A*a, b}, (B21)where {·,·} denotes the inner product, and a and barbitrary vectors.834 VOLUME 129MONTHLY WEATHER REVIEWREFERENCESAnderson, D. L. T., and J. Willebrand, Eds., 1989: Oceanic Circu-lation Models: Combining Data and Dynamics. Kluwer Aca-demic Publishers, 605 pp.Barnett, T. P., M. Latif, N. E. Graham, M. Flugel, S. Pazan, and W.White, 1993: ENSO and ENSO-related predictability. Part I:Prediction of equatorial sea surface temperature with a hybridcoupled ocean–atmosphere model. J. Climate, 6, 1545–1566.Bennett, A. F., 1992: Inverse Methods in Physical Oceanography.Cambridge University Press, 346 pp.Bishop, C. M., 1995: Neural Networks for Pattern Recognition. Clar-endon Press, 482 pp.Courtier, P., and Coauthors, 1998: The ECMWF implementation ofthree-dimensional variational assimilation (3D-VAR). I: For-mulation. Quart. J. Roy. Meteor. Soc., 124, 1783–1807.Cybenko, G., 1989: Approximation by superposition of a sigmoidalfunction. Math. Control, Signal, Syst., 2, 303–314.Daley, R., 1991: Atmospheric Data Analysis. Cambridge Atmosphericand Space Science Series, Cambridge University Press, 457 pp.Elsner, J. B., and A. A. Tsonis, 1992: Nonlinear prediction, chaosand noise. Bull. Amer. Meteor. Soc., 73, 49–60.Evensen, G., 1997: Advanced data assimilation for strongly nonlineardynamics. Mon. Wea. Rev., 125, 1342–1354.Gauthier, P., 1992: Chaos and quadric-dimensional data assimilation:A study based on the Lorenz model. Tellus, 44A, 2–17., C. Charette, L. Fillion, P. Koclas, and S. Laroche, 1999: Im-plementation of a 3D variational data assimilation system at theCanadian Meteorological Centre. Part I: The global analysis.Atmos.–Ocean, 37, 103–156.Ghil, M., and P. Malanotte-Rizzoli, 1991: Data assimilation in me-teorology and oceanography. Advances in Geophysics, Vol. 33,Academic Press, 141–265.Giering, R., and T. Kaminski, 1998: Recipes for adjoint code con-struction. ACM TOMS, 24 (4), 437–474.Hannachi, A., and B. Legras, 1995: Simulated annealing and weatherregimes classification. Tellus, 47A, 955–973.Hsieh, W. W., and B. Tang, 1998: Applying neural network modelsto prediction and analysis in meteorology and oceanography.Bull. Amer. Meteor. Soc., 79, 1855–1870.Ji, M., D. W. Behringer, and A. Leetmaa, 1998: An improved coupledmodel for ENSO prediction and implications for ocean initial-ization. Part II: The coupled model. Mon. Wea. Rev., 126, 1022–1034.Kruger, J., 1993: Simulated annealing—A tool for data assimilationinto an almost steady model state. J. Phys. Oceanogr. 23, 679–688.Le Dimet, F., and O. Talagrand, 1986: Variational algorithms foranalysis and assimilation of meteorology observations: Theo-retical aspects. Tellus, 38A, 97–110.Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci.,20, 130–141.Lu, J., and W. W. Hsieh, 1998: On determining initial conditions andparameters in a simple coupled model by adjoint data assimi-lation. Tellus, 50A, 534–544.Masters, T., 1995: Advanced Algorithms for Neural Networks—A C1Source Book. John Wiley and Sons, 431 pp.Miller, R. N., M. Ghil, and F. Gauthiez, 1994: Advanced data assim-ilation in strongly nonlinear dynamical systems. J. Atmos. Sci.,51, 1037–1056.Navon, I. M., 1998: Practical and theoretical aspects of adjoint pa-rameter estimation and identifiability in meteorology and ocean-ography. Dyn. Atmos. Oceans, 27, 55–79., X. Zhou, J. Derber, and J. Sela, 1992: Variational data assim-ilation with an adiabatic version of the NMC spectral model.Mon. Wea. Rev., 120, 1433–1446.Palmer, T. N., 1993: Extended-range atmospheric prediction and theLorenz model. Bull. Amer. Meteor. Soc., 74, 49–65.Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,1992: Numerical Recipes in Fortran. 2d ed. Cambridge Uni-versity Press, 963 pp.Syu, H. H., J. D. Neelin, and D. Gutzler, 1995: Seasonal and inter-annual variability in a hybrid coupled GCM. J. Climate, 8, 2121–2143.Talagrand, O., 1991: The use of adjoint equations in numerical mod-elling of the atmospheric circulation. Automatic Differentiationof Algorithms: Theory, Implementation and Application, A. Grie-wank and G. F. Corliess, Eds., SIAM, 169–180,Tang, Y., W. W. Hsieh, B. Tang, and K. Haines, 2001: A neuralnetwork atmospheric model for hybrid coupled modeling. Cli-mate Dyn., in press.Tziperman, E., and W. C. Thacker, 1989: An optimal control/adjointequations approach to studying the oceanic general circulation.J. Phys. Oceanogr., 19, 1471–1485.Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences.Academic Press, 467 pp.Zhu, Y., and I. M. Navon, 1999: Impact of key parameters estimationon the performance of the FSU spectral model using the fullphysics adjoint. Mon. Wea. Rev., 127, 1497–1517.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- Coupling Neural Networks to Incomplete Dynamical Systems...
Open Collections
UBC Faculty Research and Publications
Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation. Tang, Youmin; Hsieh, William W. 2001
pdf
Page Metadata
Item Metadata
Title | Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation. |
Creator |
Tang, Youmin Hsieh, William W. |
Publisher | American Meteorological Society |
Date Issued | 2001-04 |
Description | The advent of the feed-forward neural network (N) model opens the possibility of hybrid neural–dynamical models via variational data assimilation. Such a hybrid model may be used in situations where some variables, difficult to model dynamically, have sufficient data for modeling them empirically with an N. This idea of using an N to replace missing dynamical equations is tested with the Lorenz three-component nonlinear system, where one of the three Lorenz equations is replaced by an N equation. In several experiments, the 4DVAR assimilation approach is used to estimate 1) the N model parameters (26 parameters), 2) two dynamical parameters and three initial conditions for the hybrid model, and 3) the dynamical parameters, initial conditions, and the N parameters (28 parameters plus three initial conditions). Two cases of the Lorenz model—(i) the weakly nonlinear case of quasiperiodic oscillations, and (ii) the highly nonlinear, chaotic case—were chosen to test the forecast skills of the hybrid model. Numerical experiments showed that for the weakly nonlinear case, the hybrid model can be very successful, with forecast skills similar to the original Lorenz model. For the highly nonlinear case, the hybrid model could produce reasonable predictions for at least one cycle of oscillation for most experiments, although poor results were obtained for some experiments. In these failed experiments, the data used for assimilation were often located on one wing of the Lorenz butterfly-shaped attractor, while the system moved to the second wing during the forecast period. The forecasts failed as the model had never been trained with data from the second wing. Copyright 2001 American Meteorological Society (AMS). Permission to use figures, tables, and brief excerpts from this work in scientific and educational works is hereby granted provided that the source is acknowledged. Any use of material in this work that is determined to be “fair use” under Section 107 of the U.S. Copyright Act or that satisfies the conditions specified in Section 108 of the U.S. Copyright Act (17 USC §108, as revised by P.L. 94-553) does not require the AMS’s permission. Republication, systematic reproduction, posting in electronic form, such as on a web site or in a searchable database, or other uses of this material, except as exempted by the above statement, requires written permission or a license from the AMS. Additional details are provided in the AMS Copyright Policy, available on the AMS Web site located at (http://www.ametsoc.org/) or from the AMS at 617-227-2425 or copyright@ametsoc.org. |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2011-04-05 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0041820 |
URI | http://hdl.handle.net/2429/33323 |
Affiliation |
Science, Faculty of Earth and Ocean Sciences, Department of |
Citation | Tang, Youmin, Hsieh, William W. 2001. Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation. Monthly Weather Review 129(4) 818-834. dx.doi.org/10.1175/1520-0493(2001)129<0818:CNNTID>2.0.CO;2 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Hsieh, William W. |
Aggregated Source Repository | DSpace |
Download
- Media
- Hsieh_AMS_2001_MWR818.pdf [ 420.44kB ]
- Metadata
- JSON: 1.0041820.json
- JSON-LD: 1.0041820+ld.json
- RDF/XML (Pretty): 1.0041820.xml
- RDF/JSON: 1.0041820+rdf.json
- Turtle: 1.0041820+rdf-turtle.txt
- N-Triples: 1.0041820+rdf-ntriples.txt
- Original Record: 1.0041820 +original-record.json
- Full Text
- 1.0041820.txt
- Citation
- 1.0041820.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 5 | 1 |
City | Views | Downloads |
---|---|---|
Wilmington | 2 | 0 |
Lewes | 2 | 0 |
Ashburn | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0041820/manifest