@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Science, Faculty of"@en, "Earth and Ocean Sciences, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:identifierCitation "Tang, Youmin, Hsieh, William W. 2001. Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation. Monthly Weather Review 129(4) 818-834."@en ; ns0:rightsCopyright "Hsieh, William W."@en ; dcterms:creator "Tang, Youmin"@en, "Hsieh, William W."@en ; dcterms:issued "2011-04-05T23:33:34Z"@en, "2001-04"@en ; dcterms:description """The advent of the feed-forward neural network (N) model opens the possibility of hybrid neural–dynamical models via variational data assimilation. Such a hybrid model may be used in situations where some variables, difficult to model dynamically, have sufficient data for modeling them empirically with an N. This idea of using an N to replace missing dynamical equations is tested with the Lorenz three-component nonlinear system, where one of the three Lorenz equations is replaced by an N equation. In several experiments, the 4DVAR assimilation approach is used to estimate 1) the N model parameters (26 parameters), 2) two dynamical parameters and three initial conditions for the hybrid model, and 3) the dynamical parameters, initial conditions, and the N parameters (28 parameters plus three initial conditions). Two cases of the Lorenz model—(i) the weakly nonlinear case of quasiperiodic oscillations, and (ii) the highly nonlinear, chaotic case—were chosen to test the forecast skills of the hybrid model. Numerical experiments showed that for the weakly nonlinear case, the hybrid model can be very successful, with forecast skills similar to the original Lorenz model. For the highly nonlinear case, the hybrid model could produce reasonable predictions for at least one cycle of oscillation for most experiments, although poor results were obtained for some experiments. In these failed experiments, the data used for assimilation were often located on one wing of the Lorenz butterfly-shaped attractor, while the system moved to the second wing during the forecast period. The forecasts failed as the model had never been trained with data from the second wing. Copyright 2001 American Meteorological Society (AMS). Permission to use figures, tables, and brief excerpts from this work in scientific and educational works is hereby granted provided that the source is acknowledged. Any use of material in this work that is determined to be “fair use” under Section 107 of the U.S. Copyright Act or that satisfies the conditions specified in Section 108 of the U.S. Copyright Act (17 USC §108, as revised by P.L. 94-553) does not require the AMS’s permission. Republication, systematic reproduction, posting in electronic form, such as on a web site or in a searchable database, or other uses of this material, except as exempted by the above statement, requires written permission or a license from the AMS. Additional details are provided in the AMS Copyright Policy, available on the AMS Web site located at (http://www.ametsoc.org/) or from the AMS at 617-227-2425 or copyright@ametsoc.org."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/33323?expand=metadata"@en ; skos:note "818 VOLUME 129M O N T H L Y W E A T H E R R E V I E W q 2001 American Meteorological Society Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation YOUMIN TANG AND WILLIAM W. HSIEH Oceanography, Department of Earth and Ocean Sciences, University of British Columbia, Vancouver, British Columbia, Canada (Manuscript received 16 March 2000, in final form 30 August 2000) ABSTRACT The advent of the feed-forward neural network (N) model opens the possibility of hybrid neural–dynamical models via variational data assimilation. Such a hybrid model may be used in situations where some variables, difficult to model dynamically, have sufficient data for modeling them empirically with an N. This idea of using an N to replace missing dynamical equations is tested with the Lorenz three-component nonlinear system, where one of the three Lorenz equations is replaced by an N equation. In several experiments, the 4DVAR assimilation approach is used to estimate 1) the N model parameters (26 parameters), 2) two dynamical parameters and three initial conditions for the hybrid model, and 3) the dynamical parameters, initial conditions, and the N parameters (28 parameters plus three initial conditions). Two cases of the Lorenz model—(i) the weakly nonlinear case of quasiperiodic oscillations, and (ii) the highly nonlinear, chaotic case—were chosen to test the forecast skills of the hybrid model. Numerical experiments showed that for the weakly nonlinear case, the hybrid model can be very successful, with forecast skills similar to the original Lorenz model. For the highly nonlinear case, the hybrid model could produce reasonable predictions for at least one cycle of oscillation for most experiments, although poor results were obtained for some exper- iments. In these failed experiments, the data used for assimilation were often located on one wing of the Lorenz butterfly-shaped attractor, while the system moved to the second wing during the forecast period. The forecasts failed as the model had never been trained with data from the second wing. 1. Introduction Numerical models have been widely used to simulate dynamical systems such as the atmosphere and the ocean. However, the physics in these models is usually incomplete, and some empirical approach is needed to patch up the missing physics. The first example is the inability of a numerical model, with its finite resolution, to represent subgrid-scale physical processes, thereby forcing numerical modelers to adopt parameterization schemes for these processes. A second example arises from the fact that some variables of interest are not variables in the numerical model, for example, precip- itation and surface air temperature (which is generally not equivalent to the temperature at the lowest level of an atmospheric model). Often multiple linear regression (MLR) is used to empirically relate the model variables to the variables of interest, via schemes such as perfect prog and model output statistics (Wilks 1995). A third example arises in situations where replacing the physical Corresponding author address: William W. Hsieh, Oceanography/ EOS, University of British Columbia, Vancouver, BC V6T 1Z4, Can- ada. E-mail: william@ocgy.ubc.ca equations by empirical ones results in large computa- tional savings (and greater stability): in the equatorial Pacific, a simple empirical atmospheric model can be coupled to a dynamical ocean model, to form a hybrid coupled model. Here the wind stress is empirically es- timated from the ocean variables either by a linear sta- tistical method, such as MLR (Barnett et al. 1993) or singular value decomposition (Syu et al. 1995), or by a neural network (Tang et al. 2001). While it is an attractive idea to replace missing dy- namical equations with empirical equations, there are serious technical problems; for example, the dynamical equations are generally nonlinear, while the standard statistical methods are usually linear, hence not capable of simulating the evolution of nonlinear dynamical equations for extended periods. Neural networks (N) have been known to be capable of simulating any non- linear function, given a large enough network (Cybenko 1989). Recent advances in N modeling have led to new techniques capable of nonlinearly generalizing the clas- sical multivariate methods such as MLR, principal com- ponent analysis, and canonical correlation analysis (Hsieh and Tang 1998). As Ns originated from the field of artificial intelligence and robotics, it is tempting to ponder whether Ns can benefit dynamical systems with APRIL 2001 819T A N G A N D H S I E H FIG. 1. The Lorenz system integrated over 15 000 time steps for (left) the weakly nonlinear case and (right) the highly nonlinear case. incomplete physics the way robotic limbs have helped prosthetics. More specifically, could N equations be used to simulate missing nonlinear dynamical equa- tions? How to effectively couple an N model to a dy- namical model is a major challenge. Variational data assimilation, especially via the ad- joint approach, has become popular in assimilating data into numerical models (Daley 1991; Ghil and Malan- otte-Rizzoli 1991; Bennett 1992; Navon 1998). The method is commonly used to optimally estimate model parameters or initial conditions, and is being imple- mented in operational weather and climate prediction models (e.g., Zhu and Navon 1999; Gauthier et al. 1999; Courtier et al. 1998; Ji et al. 1998). Hsieh and Tang (1998) noted that Ns could be formulated as a varia- tional assimilation problem, and suggested that Ns and dynamical models might be combined most naturally via a variational assimilation approach. The objective of this paper is to show how an N model can be coupled to a dynamical model (with incomplete physics) via variational assimilation. The chosen dy- namical model is the simple Lorenz (1963) model, with three variables, arising from the Fourier truncation of the Rayleigh–Bénard equations describing atmospheric convection. In the field of data assimilation, the cele- brated Lorenz model has served as a test bed for ex- amining the properties of various data assimilation methods (Gauthier 1992; Miller et al. 1994; Evensen 1997) as the Lorenz model shares many common fea- tures with the atmospheric circulation and climate sys- tem in terms of variability and predictability (Palmer 1993). By adjusting the model parameters that control the nonlinearity of the system, the model can be used to simulate near-regular oscillations or highly nonlinear fluctuations. This paper is structured as follows: section 2 describes the hybrid Lorenz model, while section 3 shows some simple experiments involving the hybrid model. Section 4 gives a general formulation for coupling an N to a dynamical model via variational data assimilation. Sec- tion 5 uses variation assimilation to estimate the N pa- rameters in the hybrid Lorenz model. Section 6 studies not only estimating the N parameters, but also retrieving the dynamical model parameters and initial conditions. 2. Hybrid neural–dynamical Lorenz model The nondimensionalized Lorenz (1963) nonlinear system of three differential equations are 820 VOLUME 129M O N T H L Y W E A T H E R R E V I E W FIG. 2. The ensemble mean of 25 Ns (dot–dash curve) for approximating the third Lorenz equation (solid curve) for (left) the weakly nonlinear case and (right) the highly nonlinear case. (a) and (b) The training period of 3000 time steps, and (c) and (d) the test period of 1000 time steps. dX 5 2aX 1 aY, (1) dt dY 5 2XZ 1 bX 2 Y, (2) dt dZ 5 XY 2 cZ, (3) dt where variables X, Y, and Z are related to the intensity of convective motion, and the temperature gradients in the horizontal and vertical directions, respectively. The parameters a, b, and c will be referred to as dynamical parameters, in contrast to the empirical parameters of the N model. The so-called observed data or true data are from the integration of the Lorenz equations (1)–(3) over 15 000 time steps at a step size h of 0.001, using a fourth-order Runge–Kutta integration scheme. As this system is very sensitive to changes in the initial conditions and model parameters, two cases are studied (Fig. 1). The first case, called the weakly nonlinear case, with the parameters a, b, and c set to 10, 28, and 8/3, respectively, and initial conditions for (X, Y, and Z) to (29.42, 29.43, 28.3) (as in Gauthier 1992), displays near-regular oscillations with a gradually increasing amplitude in the devised integration period. The other case is the highly nonlinear case, with a, b, and c set to 16.0, 120.1, and 4.0, re- spectively (as in Elsner and Tsonis 1992), and initial conditions to (22.8, 35.7, 114.9). Next we assume the third Lorenz equation is un- available, and we must approximate it empirically with an N equation. Our hybrid model thus consists of dX 5 2aX 1 aY, (4) dt dY 5 2XZ 1 bX 2 Y, (5) dt dZ 5 N (X , Y , Z ), (6)t t tdt where N is a feed-forward N model (Hsieh and Tang 1998). The three input neurons Xt, Yt, and Zt are (also denoted by y i, i 5 1, 2, 3), inputting the values of X, Y, and Z at time t, and the single output neuron is dZ/dt. More details on the N model are given in appendix A. APRIL 2001 821T A N G A N D H S I E H FIG. 3. The hybrid model (dot–dash curve) and the true Lorenz model (solid curve) integrated from identical initial conditions for (left) the weakly nonlinear case and (right) the highly nonlinear case. 3. Simple hybrid model experiments A simple-minded approach to the missing third Lo- renz equation is to use data to get the N Eq. (6), and then integrate Eqs. (4)–(6), with (6) as a replacement for the missing Lorenz equation. The N has X, Y, and Z as the inputs and (Zt11 2 Zt)/h as the target during training [with the optimization minimizing the mean square error between the target and the model output], so the model output is approximately dZ/dt. The opti- mization of the N model was determined over a training period of 3000 time steps, and the following period of 1000 time steps was used to independently test the N model skills in estimating dZ/dt. The appropriate num- ber of neurons in the hidden layer is chosen based on the trade-off between underfitting (too few neurons) and overfitting (too many neurons) (Hsieh and Tang 1998). After trying models with different numbers of hidden neurons, we found that five hidden neurons yielded the best skills over the test period. To alleviate the instability problems associated with N modelling (Hsieh and Tang 1998), an ensemble of 25 identical Ns were trained with random initial weights and biases, with the outputs from the 25 Ns used to provide an ensemble mean output. The skills of the ensemble mean generally exceeded the skills of an in- dividual member (Hsieh and Tang 1998). The ensemble mean N simulations of dZ/dt for both the weakly non- linear and the highly nonlinear cases during the training period and the test period are shown in Fig. 2, where the N gave good simulations of dZ/dt, although the os- cillation peaks were underestimated. Now that the third Lorenz equation seems to be rea- sonably approximated by the N Eq. (6), we proceed to integrate the hybrid system (4)–(6) from perfect initial conditions and compare the output with respect to that from the true Lorenz system (1)–(3). This simple-mind- ed hybrid approach worked reasonably well in simu- lating the true Lorenz system for the weakly nonlinear case (Fig. 3), but failed completely for the highly non- linear case, where the hybrid model yielded X and Y values that were off by orders of magnitude. The highly nonlinear Lorenz system is known for its extremely high sensitivity to perturbations. Clearly, the N approxima- tion to the third Lorenz equation introduced significant errors in Z, which quickly caused the X and Y values to deviate far off course. The defect of this simple approach lies in the fact that the optimization of the N was achieved without imposing the dynamical constraints of (4) and (5). A better approach is variational assimilation, where the dynamical constraints are imposed. 822 VOLUME 129M O N T H L Y W E A T H E R R E V I E W FIG. 4. Schematic diagram for (a) the SC, (b) NC, and (c) PSC assimilation schemes. The solid curves are the model trajectory; dots, the observed values; and dash lines, the model errors to be minimized. Here a model trajectory starts from an observed value. FIG. 5. The NC assimilation results, with the model training results shown in the left panels, and predictions over the test period in the right panels, where the solid curves show the true data and dot–dash curves the outputs of the hybrid model. The dot–dash curves overlap the solid curves in the left panels. 4. The hybrid model using variational data assimilation The scalar equations (4)–(6) can be expressed as a vector equation: dv 5 f (v, p, t), (7) dt where v is the vector denoting (X, Y, Z) and p is the parameter vector (which could contain the parameters of the N or of the dynamical equations). The variational assimilation method involves mini- mizing a quadratic cost function J subject to model con- straint (7), where J is defined as T J(v, p, v ) 5 D(v, t) dt and (8)0 E 0 T 21D 5 (v 2 v ) W (v 2 v ), (9)obs obs where the subscript obs denotes the observed values, the superscript T the transpose, T the length of the as- similation window (also called training period some- times), v0 the initial conditions, and W the covariance matrix of the measurement errors, assumed here to be diagonal. APRIL 2001 823T A N G A N D H S I E H FIG. 6. The PSC assimilation results with AS 5 200 time steps. (left) The model training results, and (right) model predictions, with solid curves for the true data and dot–dash curves for the outputs of the hybrid model. Introduce the Lagrange function L, T dv TL 5 J 1 v* 2 f dt, (10)E 1 2dt0 where v*(t) is a vector of Lagrange multipliers (or adjoint variables). After integration by parts, L be- comes T Tdv* T T TL 5 (v* v) 1 D 2 v 1 v* f dt. (11)0 E 1 2[ ]dt0 The first-order variation of L is T T TT dv* ]f ]f T T TdL 5 [v* dv] 1 = D dv 2 dv 1 v*dv 1 v*dp dt. (12)0 E v 1 2 1 2 1 2[ [ ]dt ]v ]p0 The hybrid adjoint model can be obtained by letting dL/dv 5 0: dv* ]f 2 5 v*(t) 2 = D, (13)vdt ]v v*(T ) 5 0. (14) According to (13) and (14), the formulas for com- puting the gradients of J with respect to p and v0 can be obtained by differentiating (12) with respect to these unknowns (Lu and Hsieh 1998) TTdJ ]f 5 2 v* dt and (15)E 1 2dp ]p0 dJ 5 2v*(0). (16) dv0 824 VOLUME 129M O N T H L Y W E A T H E R R E V I E W FIG. 7. Same as Fig. 6 but for AS 5 1000 time steps. Details of the variational assimilation are given in appendix B. As the above assimilation process, also referred to as the training process, is completely subject to the model (4)–(6) after the initial guesses are given, it is called the strong continuity constraint assimilation scheme (SC). 5. Determining N parameters via variational assimilation In this section, we determine the N parameters, that is, the weights and biases, in the hybrid model by var- iational data assimilation. An N with five hidden neu- rons is used. Preliminary experiments are done in sec- tions 5a and 5b to determine the best first guesses and assimilation window sizes, respectively, before the main experiments are run in section 5c. In this section, all experiments were performed with perfect initial con- ditions and dynamical parameters a and b. a. The initial guesses Whether an assimilation process is successful de- pends greatly on the choice of the first guesses. To help in getting good initial guesses, two assimilation schemes are introduced in addition to the SC scheme. Under SC, at each step of the integration, the initial conditions are the model outputs from the previous step (Fig. 4a). In the no-continuity constraint assimilation scheme (NC), at each integration step, the initial conditions are the observed data, rather than the model output from the previous step (Fig. 4b). Between these two extremes, we can have the par- tially strong continuity constraint assimilation scheme (PSC) (Fig. 4c), where the assimilation window T is divided into smaller assimilation segments (ASs). With- in each AS, integration at each step uses the model output from the previous step as the initial conditions. Only at the start of the AS are the data used directly as the initial conditions. Both SC and NC can be regarded as special cases of PSC where the AS is T in SC and one time step in NC. The reason for introducing NC and PSC is that using SC without good initial guesses generally does not lead to successful assimilation over a long assimilation win- dow. Chopping the window into shorter ASs means that the dynamic model constraints are no longer imposed over a long period, thereby making the nonlinear op- timization a much easier task. Our strategy is to divide the assimilation process into two stages: (a) use less demanding schemes such as NC and PSC to obtain rea- sonable parameters estimates, which will then be used APRIL 2001 825T A N G A N D H S I E H FIG. 8. Correlation skills (averaged over 10 experiments) for the variables X, Y, and Z at various assimilation windows (T 5 100, 300, 500, and 800 time steps) during (a) the training period, and (b) the test period. as initial guesses in (b) the full variational assimilation under SC. For the weakly nonlinear case, the NC assimilation scheme was first applied to the hybrid model (4)–(6), with initial guesses for the N parameters taken randomly from one of the 25 Ns in section 3. The assimilation window is 3000 time steps. The NC scheme can produce good results during the training period but poor skill during the test period (Fig. 5). Since the AS is very short (one time step) in the NC scheme, the initial values at each integration step vastly outweigh the importance of the dynamical constraints, hence the poor skill during the test period. However the result from the NC can provide reasonable first guesses for the following PSC. A series of PSC assimilations with AS 5 100, 200, 500, and 1000 time steps was further performed with the hybrid model (in the weakly nonlinear case) to grad- ually improve the first guesses to be used in the stage b SC assimilation. A longer AS increases the importance of dynamical constraints relative to initial conditions in influencing the model outputs. The first guesses of the N parameters in a PSC experiment are the parameters obtained from the previous PSC experiment with the shorter AS. The first PSC experiment (not shown) with AS 5 100 time steps uses the results from the NC case (i.e., AS 5 one time step) for its first guesses. The PSC assimilation over the training period, that is, a window of 3000 time steps, with AS 5 200 time steps (Fig. 6), and AS 5 1000 time steps (Fig. 7), reveals that the test skills improved with the increase of the AS. The parameters estimated from the PSC experiment with AS51000 time steps will serve as the first guess for the stage b SC assimilation (in the next section) for the weakly nonlinear case, where the assimilation window is also taken to be 1000 time steps. This approach of using a series of PSC assimilations to improve on the parameter estimates to be used as initial guesses in the stage b SC assimilation did not work for the highly nonlinear case. Unlike the weakly nonlinear case, the first guesses from PSC, if found, is only effective for a specific experimental configuration (e.g., AS, window length, etc.). Hence, for the highly nonlinear case, the parameters estimated from the NC assimilation over 3000 time steps are directly used as the first guesses for the stage b SC assimilation. b. Impact of assimilation window in the highly nonlinear case The greatest difficulty in variational assimilation, namely the presence of multiple minima in the cost function arising from the nonlinearity of the problem, often renders the problem intractable. The number of local minima dramatically increases with the window T for a strongly nonlinear dynamical system. Gauthier (1992) and Miller et al. (1994) have shown with the Lorenz model that variational assimilation is effective only for sufficiently short windows. In the weakly non- linear case (section 5a), T did not have a major impact; hence, the following discussion will only focus on the highly nonlinear case. To study the impact of the window, SC assimilation experiments were performed, where T was varied be- tween 100, 300, 500, and 800 time steps. For each T, 10 experiments were performed with the same first guesses, but the assimilation periods were shifted by 100 time steps between one experiment and the next. The effect of T on the assimilation is different for the training period (Fig. 8a) and the test period (Fig. 8b). With a short window (T 5 100 time steps), the corre- lation skills were excellent for all three variables during training, but yielded no prediction skills over the fol- lowing test period of 300 time steps. As T increased to 300 time steps and then 500 time steps, there were in- creases in the prediction skills (Fig. 8b), but as T in- creased to 800, the prediction skill dropped sharply, as the problem with local minima in the cost function wors- 826 VOLUME 129M O N T H L Y W E A T H E R R E V I E W FIG. 10. Bar charts from 100 assimilation experiments for the highly nonlinear case showing the number of experiments with (a) corre- lation above a particular correlation level and (b) REE below a par- ticular REE level, during the test period. ← FIG. 9. (a) Correlation skills and (b) REE for the variables X, Y, and Z during the training and test periods (averaged over 100 ex- periments), for the weakly nonlinear case, where the SC assimilation retrieves the N parameters. The error bars show the standard devi- ation. Results from the simple hybrid model without data assimilation (section 3) are also shown for comparison. ened with increasing T. Hence in the following SC as- similation experiments, the window will be 500 time steps for the highly nonlinear case, and 1000 time steps for the weakly nonlinear case. Following the training (assimilation) period, there is a test period of 300 time steps for the highly nonlinear case, and 1000 time steps for the weakly nonlinear case. c. SC assimilation We now perform the main SC assimilation experi- ments, repeated 100 times, to examine the skills of the APRIL 2001 827T A N G A N D H S I E H FIG. 11. A typical experiment with class 1 results for the highly nonlinear case, showing (a) the fitting during the training period (left panels) and (b) the forecasting during the testing period (right panels). Solid curves denote the true value, while dot–dash curves denote model results. hybrid model for the weakly nonlinear and highly non- linear cases. The choices of windows and first guesses are those found in sections 5a and 5b. The 100 exper- iments are implemented with their assimilation periods shifted by 100 time steps from one experiment to the other. For the weakly nonlinear case, the average correlation between model and data (over 100 experiments) exceeds 0.96 for each of X, Y, and Z during both the training period and the test period (Fig. 9a). The relative estimate error [REE, i.e., S (estimated value-true value)2/S (true value)2] is very small, less than 0.004 (Fig. 9b). Thus, via variational assimilation, a hybrid model has suc- cessfully reconstructed the Lorenz model in the weakly nonlinear case. For comparison, the results of the simple hybrid model without data assimilation from section 3 are also show in Fig. 9. The improvement due to var- iational data assimilation is dramatic for X and Y. In contrast, the highly nonlinear case has much lower skill. During training, the skill was good—80% of the experiments had correlation above 0.75, and 60% had REEs under 0.05 (not shown). However, during the test period, the correlation skill and REE attained were much poorer (Fig. 10). Generally, for the highly nonlinear case, there were three classes of assimilation results found in the 100 experiments. A typical example of the class 1 results is shown in Fig. 11, showing successful assimilation. About 15% of the total experiments is of this class. Almost 60% of the total experiments belongs to a class 2 of moderately successful assimilation, as illustrated by the example in Fig. 12, where during the test period, the forecast skills were fairly good for the first 200 time steps. The remaining 25%–30% of the experiments be- longed to a class 3 of failed forecasting, as illustrated in Fig. 13, where despite excellent fitting during the training period, the forecast results were totally wrong. To see why there is a drastic difference between class 1 and class 3 behavior, we plotted the system trajectories of the two cases in the (X, Y, Z) space (Fig. 14). The Lorenz attractor for the highly nonlinear case is known to have a ‘‘butterfly’’ shape with two wings. In the class 1 experiment, both wings of the Lorenz attractor were covered by the training data trajectory. In contrast, with the class 3 example, the training data trajectory resided in one wing. The system then evolved to the other wing during the testing period. As the model had never been trained with data from the second wing, it obviously 828 VOLUME 129M O N T H L Y W E A T H E R R E V I E W FIG. 12. Same as Fig. 11 but for a typical experiment with class 2 (i.e., moderately successful forecast) results. could not forecast properly in that regime, hence the disastrously poor forecast results found in Fig. 13. Thus the bimodality of the highly nonlinear Lorenz system created cases where the training data covered only one wing of the attractor, and the model thus trained had no capability of forecasting the transition to the other wing. One would be tempted to think that the problem could be corrected by simply using longer training periods, so that the full Lorenz attractor would have been learned by the model. Unfortunately, a longer training period greatly increases the difficulty in the nonlinear opti- mization due to multiple local minima in the cost func- tion, leading eventually to the failure of the variational assimilation method. 6. Determining dynamical parameters and initial conditions by the hybrid model In variational data assimilation, parameters and initial conditions can be jointly estimated (e.g., Le Dimet and Talagrand 1986; Tziperman and Thacker 1989; Lu and Hsieh 1998; Zhu and Navon 1999). We next conducted an experiment, for the weakly nonlinear case, to estimate simultaneously the dynamical parameters a and b, and the initial conditions of X, Y, and Z in the hybrid Lorenz system (4)–(6). The N from section 5a (i.e., PSC assim- ilation with AS 5 1000 time steps) was used. For the initial guesses, a and b and the initial con- ditions were all scaled down by 10% from their true values. The result of retrieving the two parameters and three initial conditions by the hybrid model with an assimilation period of 3000 steps under SC showed the retrieval to be very good, as the REEs were all of the order of 1023–1024 (not shown). Next the true values were scaled down by 20% to provide the initial guesses, and the retrieved results were still generally good (Fig. 15). But when the initial values were the true values scaled down by 30%, the retrieval nearly failed (not shown). Next, we had the N parameters retrieved as well dur- ing the assimilation; that is, we conducted 100 additional SC experiments to estimate simultaneously the N pa- rameters, the dynamical parameters a and b, and the initial conditions of X, Y, and Z, for the weakly nonlinear case. The first guesses for the initial conditions and the parameters a and b were simply the true values scaled down by 10%. The choice of the first guesses of the N model parameters, cost function, and assimilation win- dow were the same as those discussed in the previous section. APRIL 2001 829T A N G A N D H S I E H FIG. 13. Same as Fig. 11 but for a typical experiment with class 3 (i.e., unsuccessful forecast) results. The average correlation skills over the 100 experi- ments are shown in Fig. 16, while the average REEs are all around 0.05 (not shown). The correlation and REE were generally poorer than those in Fig. 9, where only the N parameters were retrieved. This agrees with many studies that found that the assimilation quality deteriorated while retrieving more model unknowns (Lu and Hsieh 1998), as the problem of multiple minima in the cost function could become worse when retrieving more model unknowns. Likewise, 100 experiments for the highly nonlinear case were also carried out to simultaneously estimate the N parameters, the dynamical parameters a and b, and the initial conditions. The first guesses were the true values scaled down by 10%, and the first guesses of the N model, the cost function, and assimilation window were still taken from section 5. While the skills in the training period were as good as those retrieving the N parameters only (not shown), the skills during the testing period (Fig. 17) were considerably lower than those found when only N parameters were retrieved (Fig. 10). Class 1 examples like Fig. 11 can hardly be found now. 7. Summary and discussion In this study, a hybrid neural–dynamical variational data assimilation procedure aimed at simulating missing dynamical equations was formulated. This procedure was then applied to the Lorenz model, where the third dynamical equation of the model was missing and had to be simulated empirically by an N equation. First guesses in variational data assimilation can be vital to retrieving the model parameters and initial con- ditions successfully. In order to get reasonable first guesses, the no continuity constraint and partial strong continuity constraint schemes were proposed in this study. Experiments show that such a treatment of choos- ing first guesses is effective in the weakly nonlinear case, which allows the cost function to be successfully optimized in almost all experiments under strong con- tinuity constraint. For the highly nonlinear case, the NC scheme provides reasonable first guesses for the SC ex- periments. The length of the assimilation window is very sig- nificant for the highly nonlinear case. When the window is either too short or too long, the prediction skills are poor. This strict demand for an appropriate assimilating window results from a delicate balance between the need for more data (longer window) and the avoidance of severe multiple minima during optimization with the highly nonlinear model, which calls for a short window. Our numerical experiments showed that the hybrid model based on N and variational assimilation is suc- 830 VOLUME 129M O N T H L Y W E A T H E R R E V I E W FIG. 14. The system trajectories in the (X, Y, Z ) space during the training period (thin curves) and the testing period (thick curves) for (a) the class 1 experiment and (b) the class 3 experiment. FIG. 16. The correlation skill averaged over 100 experiments where the N parameters, the dynamical parameters a and b, and the initial conditions were retrieved for the weakly nonlinear case. Error bars indicate the standard deviation. FIG. 15. The results from retrieving the dynamical parameters a and b, and the initial conditions of the hybrid model, where the initial guesses were the true values scaled down by 20%. cessful in simulating the weakly nonlinear Lorenz mod- el, with forecast skills similar to the original Lorenz model. For a highly nonlinear Lorenz system, the hybrid model can also produce reasonable skills of simulations and predictions for most experiments, although the as- similation basically failed for a fair portion of the ex- APRIL 2001 831T A N G A N D H S I E H FIG. 17. Bar charts from 100 assimilation experiments for the highly nonlinear case showing the number of experiments with (a) corre- lation above a particular correlation level and (b) REE below a par- ticular REE level, during the test period. The N parameters, the dy- namical parameters a and b, and the initial conditions were retrieved. FIG. A1. An example of a neural network model, where there are three neurons in the input layer, five in the hidden layer, and one in the output layer. The parameters wij and w̃j are the weights, and bj and b÷ are the biases. The parameters bj and b÷ can also be regarded as the weights for constant inputs of value 1. periments. Deterministic optimization algorithms often have difficulties reaching the global minimum in the cost function, due to the presence of local minima. It has been shown that the variational assimilation with deterministic optimization algorithms often loses its power for strongly nonlinear models (Miller et al. 1994; Evensen 1997). Stochastic optimization methods such as simulated annealing may help (Kruger 1993; Han- nachi and Legras 1995). However, even if the global minimum is found, the data in the assimilation window might still only consist of data from one wing of the butterfly-shaped attractor, whereas the data in the fore- casting test period could lie in the second wing, thereby resulting in failed forecasts. Alternatively, other assim- ilation approaches, for example, the extended Kalman filter, could be more powerful in dealing with strongly nonlinear systems (Miller et al. 1994; Evensen 1997). The possibility of an N coupled with a dynamical model with an extended Kalman filter is being investigated. Acknowledgments. The authors are grateful to Dr. Benyang Tang, Dr. Jingxi Lu, Dr. Adam Monahan, and Yuval for helpful discussion, and to Dr. Ralf Giering for his TAMC compiler for adjoint codes. This work was supported by research and strategies grants to W. Hsieh from the Natural Sciences and Engineering Re- search Council of Canada. APPENDIX A Neural Network Model A feed-forward neural network is a nonparametric statistical model for extracting nonlinear relations in the data. A common N model configuration is to place be- tween the input and output variables (also called ‘‘neu- rons’’), a layer of ‘‘hidden neurons’’ (Fig. A1). The value of the jth hidden neuron is y 5 tanh w x 1 b , (A1)Oj ij i j1 2i where xi is the ith input, wij the weight parameters, and bj the bias parameters. The hyperbolic tangent function is used as the transfer function (Bishop 1995, p. 127). The output neuron is given by z 5 w̃ y 1 b÷ . (A2)O j j j A cost function 832 VOLUME 129M O N T H L Y W E A T H E R R E V I E W J 5 ^(z 2 zobs)2& (A3) measures the mean square error between the model out- put z and the observed values zobs. The parameters wij, w̃j, bj, and b÷ are adjusted as the cost function is mini- mized by the optimization algorithm of Levenberg–Mar- quardt (Masters 1995), without any constraints imposed. The procedure, known as network training, yields the optimal parameters for the network. The random number generator in MATLAB, which generates uniformly dis- tributed random numbers on the interval (0.0, 1.0), was used to initialize these parameters. For an N with m1 inputs and m2 hidden neurons, the number of model parameters is m1 3 m2 1 2 3 m2 1 1. In this paper, we set the number of hidden neurons to 5, so the N has 26 parameters. The above is a brief description of a traditional feed- forward N algorithm, used in section 3. In later sections, the optimization will be constrained by the dynamical equations. APPENDIX B Discrete Form of the Hybrid Model for Coding The computer code is directly generated from the dis- crete form of the hybrid model. The fourth-order Runge– Kutta is used to discretize the vector equation 7: k1 k2 k3 k4n n n n v 5 v 1 1 1 1 , (B1)n11 n 6 3 3 6 k1 5 hf (v , p, t ), (B2)n n n k2 5 hf (v 1 0.5k1 , p, t 1 0.5h), (B3)n n n n k3 5 hf (v 1 0.5k2 , p, t 1 0.5h), (B4)n n n n k4 5 hf (v 1 k3 , p, t 1 h), (B5)n n n n where the subscript n indicates the time level, f 5 (f1, f2, f3)T, and f1, f2, and f3 are scalar functions: f1 5 2aX 1 aY, (B6) f2 5 2XZ 1 bX 2 Y, (B7) f3 5 w̃ [tanh(w X 1 w Y 1 w Z 1 b )]O j 1j 2 j 3 j j j 1 b÷ , (B8) j 5 1, 2, . . . , M (M 5 5, the number of hidden neurons). Equation (B1) can be written as vn11 5 vn 1 Kn. (B9) The Lagrange function of Eq. (10) is discretized as N21 TL 5 J 1 v* (v 2 v 2 K ), (B10)O n n11 n n n50 N obs T 21 obsJ 5 (v 2 v ) W (v 2 v ), (B11)O n n n n n51 where v* is the vector of Lagrange multipliers and its dimension is the same that of v. The equations for the best fit are obtained by requiring that all derivatives of L vanish: ]L 5 0, (B12) ]v*n ]L 5 0, (B13) ]vn ]L 5 0, (B14) ]p where the partial derivative with respect to a vector indicates a partial derivative with respect to each of the vector components. Equation (B12) gives nothing more than (B9), whereas (B10) gives N21]L ]Kn5 2 v*. (B15)O n1 2]p ]pn50 The gradients from (B15) are then provided to a qua- si-Newton method to seek the optimal parameters. The equations describing the trajectory of the La- grange multipliers , that is, the adjoint equations, canv*n be obtained from ]L/]vn and (B10) (also see Anderson and Willebrand 1989): N ]Kn21 obs2 W (v 2 v ) 1 v* 2 v* 2 v*O n n n21 n n1 2]vn51 n 5 0, n 5 1, . . . , N, (B16) ]L 5 2v*, (B17)0]v0 v* 5 0. (B18)N The ]Kn/]vn and ]Kn/]p can be obtained by chain rule; for example, ]k 1 ]k1 1 ]k2 ]k1 1 ]k3 ]k2 ]k1n n n n n n n5 1 1 ]v 6 ]v 3 ]k1 ]v 3 ]k2 ]k1 ]vn n n n n n n 1 ]k4 ]k3 ]k2 ]k1n n n n1 . 6 ]k3 ]k2 ]k1 ]vn n n n Thus, the cycling procedure for computing the best fit of the hybrid model (4)–(6) to Lorenz model (1)–(3) is as follows: 1) Given initial guess values for the unknown param- eters, use (B9) to step the hybrid model forward in time from 1 to N to compute a first approximation to the Lorenz model. 2) Step equations (B16)–(B18) backward in time from N to 1 to compute approximate values of the La- grange multipliers. 3) Evaluate the gradient of the cost function with re- spect to these parameters using (B15) [and/or Eq. (B17) if initial conditions need to be retrieved]. APRIL 2001 833T A N G A N D H S I E H FIG. B1. Variations of the normalized cost function (J/J0) (solid line) and normalized gradient (\\g\\ \\g0\\21) (dash line) the number of iterations for (a) the weakly nonlinear case of Fig. 7 and (b) the strongly nonlinear case of Fig. 11. The better initial guesses in (a) meant that the cost function and its gradients had relatively less distance to drop to convergence than in (b). 4) Use a descent algorithm to find the minimum of the cost function in the direction opposite to the gradient by a line search. 5) Take results of the line search as the next guesses for the parameters and repeat the process starting from step 1 until convergence criteria are satisfied. The quasi-Newton method of Broyden–Fletcher– Goldfarb–Shanno (BFGS) (Press et al. 1992) is used in the above procedure for searching the optimal solutions. Figure B1 is the variation of the normalized cost func- tion and normalized gradient with the number of iter- ations for a weakly nonlinear case (Fig. 7) and a strongly nonlinear case (Fig. 11). The BFGS algorithm has a memory requirement of O(N 2) where N is the number of parameters involved in the optimization. In our case, the memory is 183 KB. For problems with larger N, the conjugate gradient method with a memory requirement of O(N) could be more suitable. A typical case here requires about 50 iteration for convergence, using about 3 min of CPU time on an SGI Origin 200 server. In practice, the adjoint codes were not developed from (B16) to (B18), but by the Tangent and Adjoint Model Compiler (TAMC; Giering and Kaminski 1996) via the tangent linear equations. These equations are derived from (B9): dvn11 5 Andvn, (B19) where the matrix An is the tangent linear model at time step n, and ]KnA 5 I 1 , (B20)n ]vn with I the the identity matrix. The adjointness has been checked using the relation between the tangent linear code A and its adjoint code A* (Talagrand 1991; Navon et al. 1992): {a, Ab} 5 {A*a, b}, (B21) where { · , · } denotes the inner product, and a and b arbitrary vectors. 834 VOLUME 129M O N T H L Y W E A T H E R R E V I E W REFERENCES Anderson, D. L. T., and J. Willebrand, Eds., 1989: Oceanic Circu- lation Models: Combining Data and Dynamics. Kluwer Aca- demic Publishers, 605 pp. Barnett, T. P., M. Latif, N. E. Graham, M. Flugel, S. Pazan, and W. White, 1993: ENSO and ENSO-related predictability. Part I: Prediction of equatorial sea surface temperature with a hybrid coupled ocean–atmosphere model. J. Climate, 6, 1545–1566. Bennett, A. F., 1992: Inverse Methods in Physical Oceanography. Cambridge University Press, 346 pp. Bishop, C. M., 1995: Neural Networks for Pattern Recognition. Clar- endon Press, 482 pp. Courtier, P., and Coauthors, 1998: The ECMWF implementation of three-dimensional variational assimilation (3D-VAR). I: For- mulation. Quart. J. Roy. Meteor. Soc., 124, 1783–1807. Cybenko, G., 1989: Approximation by superposition of a sigmoidal function. Math. Control, Signal, Syst., 2, 303–314. Daley, R., 1991: Atmospheric Data Analysis. Cambridge Atmospheric and Space Science Series, Cambridge University Press, 457 pp. Elsner, J. B., and A. A. Tsonis, 1992: Nonlinear prediction, chaos and noise. Bull. Amer. Meteor. Soc., 73, 49–60. Evensen, G., 1997: Advanced data assimilation for strongly nonlinear dynamics. Mon. Wea. Rev., 125, 1342–1354. Gauthier, P., 1992: Chaos and quadric-dimensional data assimilation: A study based on the Lorenz model. Tellus, 44A, 2–17. , C. Charette, L. Fillion, P. Koclas, and S. Laroche, 1999: Im- plementation of a 3D variational data assimilation system at the Canadian Meteorological Centre. Part I: The global analysis. Atmos.–Ocean, 37, 103–156. Ghil, M., and P. Malanotte-Rizzoli, 1991: Data assimilation in me- teorology and oceanography. Advances in Geophysics, Vol. 33, Academic Press, 141–265. Giering, R., and T. Kaminski, 1998: Recipes for adjoint code con- struction. ACM TOMS, 24 (4), 437–474. Hannachi, A., and B. Legras, 1995: Simulated annealing and weather regimes classification. Tellus, 47A, 955–973. Hsieh, W. W., and B. Tang, 1998: Applying neural network models to prediction and analysis in meteorology and oceanography. Bull. Amer. Meteor. Soc., 79, 1855–1870. Ji, M., D. W. Behringer, and A. Leetmaa, 1998: An improved coupled model for ENSO prediction and implications for ocean initial- ization. Part II: The coupled model. Mon. Wea. Rev., 126, 1022– 1034. Kruger, J., 1993: Simulated annealing—A tool for data assimilation into an almost steady model state. J. Phys. Oceanogr. 23, 679– 688. Le Dimet, F., and O. Talagrand, 1986: Variational algorithms for analysis and assimilation of meteorology observations: Theo- retical aspects. Tellus, 38A, 97–110. Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20, 130–141. Lu, J., and W. W. Hsieh, 1998: On determining initial conditions and parameters in a simple coupled model by adjoint data assimi- lation. Tellus, 50A, 534–544. Masters, T., 1995: Advanced Algorithms for Neural Networks—A C1 Source Book. John Wiley and Sons, 431 pp. Miller, R. N., M. Ghil, and F. Gauthiez, 1994: Advanced data assim- ilation in strongly nonlinear dynamical systems. J. Atmos. Sci., 51, 1037–1056. Navon, I. M., 1998: Practical and theoretical aspects of adjoint pa- rameter estimation and identifiability in meteorology and ocean- ography. Dyn. Atmos. Oceans, 27, 55–79. , X. Zhou, J. Derber, and J. Sela, 1992: Variational data assim- ilation with an adiabatic version of the NMC spectral model. Mon. Wea. Rev., 120, 1433–1446. Palmer, T. N., 1993: Extended-range atmospheric prediction and the Lorenz model. Bull. Amer. Meteor. Soc., 74, 49–65. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 1992: Numerical Recipes in Fortran. 2d ed. Cambridge Uni- versity Press, 963 pp. Syu, H. H., J. D. Neelin, and D. Gutzler, 1995: Seasonal and inter- annual variability in a hybrid coupled GCM. J. Climate, 8, 2121– 2143. Talagrand, O., 1991: The use of adjoint equations in numerical mod- elling of the atmospheric circulation. Automatic Differentiation of Algorithms: Theory, Implementation and Application, A. Grie- wank and G. F. Corliess, Eds., SIAM, 169–180, Tang, Y., W. W. Hsieh, B. Tang, and K. Haines, 2001: A neural network atmospheric model for hybrid coupled modeling. Cli- mate Dyn., in press. Tziperman, E., and W. C. Thacker, 1989: An optimal control/adjoint equations approach to studying the oceanic general circulation. J. Phys. Oceanogr., 19, 1471–1485. Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp. Zhu, Y., and I. M. Navon, 1999: Impact of key parameters estimation on the performance of the FSU spectral model using the full physics adjoint. Mon. Wea. Rev., 127, 1497–1517."@en ; edm:hasType "Article"@en ; edm:isShownAt "10.14288/1.0041820"@en ; dcterms:language "eng"@en ; ns0:peerReviewStatus "Reviewed"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "American Meteorological Society"@en ; ns0:publisherDOI "10.1175/1520-0493(2001)129<0818:CNNTID>2.0.CO;2"@en ; dcterms:rights "Attribution-NonCommercial-NoDerivatives 4.0 International"@en ; ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/4.0/"@en ; ns0:scholarLevel "Faculty"@en ; dcterms:title "Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation."@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/33323"@en .