818 MONTHLY WEATHER REVIEW VOLUME 129 Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation YOUMIN TANG AND WILLIAM W. HSIEH Oceanography, Department of Earth and Ocean Sciences, University of British Columbia, Vancouver, British Columbia, Canada (Manuscript received 16 March 2000, in final form 30 August 2000) ABSTRACT The advent of the feed-forward neural network (N) model opens the possibility of hybrid neural–dynamical models via variational data assimilation. Such a hybrid model may be used in situations where some variables, difficult to model dynamically, have sufficient data for modeling them empirically with an N. This idea of using an N to replace missing dynamical equations is tested with the Lorenz three-component nonlinear system, where one of the three Lorenz equations is replaced by an N equation. In several experiments, the 4DVAR assimilation approach is used to estimate 1) the N model parameters (26 parameters), 2) two dynamical parameters and three initial conditions for the hybrid model, and 3) the dynamical parameters, initial conditions, and the N parameters (28 parameters plus three initial conditions). Two cases of the Lorenz model—(i) the weakly nonlinear case of quasiperiodic oscillations, and (ii) the highly nonlinear, chaotic case—were chosen to test the forecast skills of the hybrid model. Numerical experiments showed that for the weakly nonlinear case, the hybrid model can be very successful, with forecast skills similar to the original Lorenz model. For the highly nonlinear case, the hybrid model could produce reasonable predictions for at least one cycle of oscillation for most experiments, although poor results were obtained for some experiments. In these failed experiments, the data used for assimilation were often located on one wing of the Lorenz butterfly-shaped attractor, while the system moved to the second wing during the forecast period. The forecasts failed as the model had never been trained with data from the second wing. 1. Introduction Numerical models have been widely used to simulate dynamical systems such as the atmosphere and the ocean. However, the physics in these models is usually incomplete, and some empirical approach is needed to patch up the missing physics. The first example is the inability of a numerical model, with its finite resolution, to represent subgrid-scale physical processes, thereby forcing numerical modelers to adopt parameterization schemes for these processes. A second example arises from the fact that some variables of interest are not variables in the numerical model, for example, precipitation and surface air temperature (which is generally not equivalent to the temperature at the lowest level of an atmospheric model). Often multiple linear regression (MLR) is used to empirically relate the model variables to the variables of interest, via schemes such as perfect prog and model output statistics (Wilks 1995). A third example arises in situations where replacing the physical Corresponding author address: William W. Hsieh, Oceanography/ EOS, University of British Columbia, Vancouver, BC V6T 1Z4, Canada. E-mail: william@ocgy.ubc.ca ᭧ 2001 American Meteorological Society equations by empirical ones results in large computational savings (and greater stability): in the equatorial Pacific, a simple empirical atmospheric model can be coupled to a dynamical ocean model, to form a hybrid coupled model. Here the wind stress is empirically estimated from the ocean variables either by a linear statistical method, such as MLR (Barnett et al. 1993) or singular value decomposition (Syu et al. 1995), or by a neural network (Tang et al. 2001). While it is an attractive idea to replace missing dynamical equations with empirical equations, there are serious technical problems; for example, the dynamical equations are generally nonlinear, while the standard statistical methods are usually linear, hence not capable of simulating the evolution of nonlinear dynamical equations for extended periods. Neural networks (N) have been known to be capable of simulating any nonlinear function, given a large enough network (Cybenko 1989). Recent advances in N modeling have led to new techniques capable of nonlinearly generalizing the classical multivariate methods such as MLR, principal component analysis, and canonical correlation analysis (Hsieh and Tang 1998). As Ns originated from the field of artificial intelligence and robotics, it is tempting to ponder whether Ns can benefit dynamical systems with APRIL 2001 TANG AND HSIEH 819 FIG. 1. The Lorenz system integrated over 15 000 time steps for (left) the weakly nonlinear case and (right) the highly nonlinear case. incomplete physics the way robotic limbs have helped prosthetics. More specifically, could N equations be used to simulate missing nonlinear dynamical equations? How to effectively couple an N model to a dynamical model is a major challenge. Variational data assimilation, especially via the adjoint approach, has become popular in assimilating data into numerical models (Daley 1991; Ghil and Malanotte-Rizzoli 1991; Bennett 1992; Navon 1998). The method is commonly used to optimally estimate model parameters or initial conditions, and is being implemented in operational weather and climate prediction models (e.g., Zhu and Navon 1999; Gauthier et al. 1999; Courtier et al. 1998; Ji et al. 1998). Hsieh and Tang (1998) noted that Ns could be formulated as a variational assimilation problem, and suggested that Ns and dynamical models might be combined most naturally via a variational assimilation approach. The objective of this paper is to show how an N model can be coupled to a dynamical model (with incomplete physics) via variational assimilation. The chosen dynamical model is the simple Lorenz (1963) model, with three variables, arising from the Fourier truncation of the Rayleigh–Be´nard equations describing atmospheric convection. In the field of data assimilation, the celebrated Lorenz model has served as a test bed for ex- amining the properties of various data assimilation methods (Gauthier 1992; Miller et al. 1994; Evensen 1997) as the Lorenz model shares many common features with the atmospheric circulation and climate system in terms of variability and predictability (Palmer 1993). By adjusting the model parameters that control the nonlinearity of the system, the model can be used to simulate near-regular oscillations or highly nonlinear fluctuations. This paper is structured as follows: section 2 describes the hybrid Lorenz model, while section 3 shows some simple experiments involving the hybrid model. Section 4 gives a general formulation for coupling an N to a dynamical model via variational data assimilation. Section 5 uses variation assimilation to estimate the N parameters in the hybrid Lorenz model. Section 6 studies not only estimating the N parameters, but also retrieving the dynamical model parameters and initial conditions. 2. Hybrid neural–dynamical Lorenz model The nondimensionalized Lorenz (1963) nonlinear system of three differential equations are 820 MONTHLY WEATHER REVIEW VOLUME 129 FIG. 2. The ensemble mean of 25 Ns (dot–dash curve) for approximating the third Lorenz equation (solid curve) for (left) the weakly nonlinear case and (right) the highly nonlinear case. (a) and (b) The training period of 3000 time steps, and (c) and (d) the test period of 1000 time steps. dX ϭ ϪaX ϩ aY, dt (1) dY ϭ ϪXZ ϩ bX Ϫ Y, dt (2) dZ ϭ XY Ϫ cZ, dt (3) where variables X, Y, and Z are related to the intensity of convective motion, and the temperature gradients in the horizontal and vertical directions, respectively. The parameters a, b, and c will be referred to as dynamical parameters, in contrast to the empirical parameters of the N model. The so-called observed data or true data are from the integration of the Lorenz equations (1)–(3) over 15 000 time steps at a step size h of 0.001, using a fourth-order Runge–Kutta integration scheme. As this system is very sensitive to changes in the initial conditions and model parameters, two cases are studied (Fig. 1). The first case, called the weakly nonlinear case, with the parameters a, b, and c set to 10, 28, and 8/3, respectively, and initial conditions for (X, Y, and Z) to (Ϫ9.42, Ϫ9.43, 28.3) (as in Gauthier 1992), displays near-regular oscillations with a gradually increasing amplitude in the devised integration period. The other case is the highly nonlinear case, with a, b, and c set to 16.0, 120.1, and 4.0, respectively (as in Elsner and Tsonis 1992), and initial conditions to (22.8, 35.7, 114.9). Next we assume the third Lorenz equation is unavailable, and we must approximate it empirically with an N equation. Our hybrid model thus consists of dX ϭ ϪaX ϩ aY, dt (4) dY ϭ ϪXZ ϩ bX Ϫ Y, dt (5) dZ ϭ N (X t , Yt , Z t ), dt (6) where N is a feed-forward N model (Hsieh and Tang 1998). The three input neurons X t , Y t , and Z t are (also denoted by i , i ϭ 1, 2, 3), inputting the values of X, Y, and Z at time t, and the single output neuron is dZ/dt. More details on the N model are given in appendix A. APRIL 2001 TANG AND HSIEH 821 FIG. 3. The hybrid model (dot–dash curve) and the true Lorenz model (solid curve) integrated from identical initial conditions for (left) the weakly nonlinear case and (right) the highly nonlinear case. 3. Simple hybrid model experiments A simple-minded approach to the missing third Lorenz equation is to use data to get the N Eq. (6), and then integrate Eqs. (4)–(6), with (6) as a replacement for the missing Lorenz equation. The N has X, Y, and Z as the inputs and (Z tϩ1 Ϫ Z t )/h as the target during training [with the optimization minimizing the mean square error between the target and the model output], so the model output is approximately dZ/dt. The optimization of the N model was determined over a training period of 3000 time steps, and the following period of 1000 time steps was used to independently test the N model skills in estimating dZ/dt. The appropriate number of neurons in the hidden layer is chosen based on the trade-off between underfitting (too few neurons) and overfitting (too many neurons) (Hsieh and Tang 1998). After trying models with different numbers of hidden neurons, we found that five hidden neurons yielded the best skills over the test period. To alleviate the instability problems associated with N modelling (Hsieh and Tang 1998), an ensemble of 25 identical Ns were trained with random initial weights and biases, with the outputs from the 25 Ns used to provide an ensemble mean output. The skills of the ensemble mean generally exceeded the skills of an in- dividual member (Hsieh and Tang 1998). The ensemble mean N simulations of dZ/dt for both the weakly nonlinear and the highly nonlinear cases during the training period and the test period are shown in Fig. 2, where the N gave good simulations of dZ/dt, although the oscillation peaks were underestimated. Now that the third Lorenz equation seems to be reasonably approximated by the N Eq. (6), we proceed to integrate the hybrid system (4)–(6) from perfect initial conditions and compare the output with respect to that from the true Lorenz system (1)–(3). This simple-minded hybrid approach worked reasonably well in simulating the true Lorenz system for the weakly nonlinear case (Fig. 3), but failed completely for the highly nonlinear case, where the hybrid model yielded X and Y values that were off by orders of magnitude. The highly nonlinear Lorenz system is known for its extremely high sensitivity to perturbations. Clearly, the N approximation to the third Lorenz equation introduced significant errors in Z, which quickly caused the X and Y values to deviate far off course. The defect of this simple approach lies in the fact that the optimization of the N was achieved without imposing the dynamical constraints of (4) and (5). A better approach is variational assimilation, where the dynamical constraints are imposed. 822 MONTHLY WEATHER REVIEW VOLUME 129 4. The hybrid model using variational data assimilation The scalar equations (4)–(6) can be expressed as a vector equation: dv ϭ f (v, p, t), dt (7) where v is the vector denoting (X, Y, Z) and p is the parameter vector (which could contain the parameters of the N or of the dynamical equations). The variational assimilation method involves minimizing a quadratic cost function J subject to model constraint (7), where J is defined as J(v, p, v0 ) ϭ ͵ T D(v, t) dt and (8) 0 D ϭ (v Ϫ vobs ) T WϪ1 (v Ϫ vobs ), FIG. 4. Schematic diagram for (a) the SC, (b) NC, and (c) PSC assimilation schemes. The solid curves are the model trajectory; dots, the observed values; and dash lines, the model errors to be minimized. Here a model trajectory starts from an observed value. (9) where the subscript obs denotes the observed values, the superscript T the transpose, T the length of the assimilation window (also called training period sometimes), v 0 the initial conditions, and W the covariance matrix of the measurement errors, assumed here to be diagonal. FIG. 5. The NC assimilation results, with the model training results shown in the left panels, and predictions over the test period in the right panels, where the solid curves show the true data and dot–dash curves the outputs of the hybrid model. The dot–dash curves overlap the solid curves in the left panels. APRIL 2001 823 TANG AND HSIEH FIG. 6. The PSC assimilation results with AS ϭ 200 time steps. (left) The model training results, and (right) model predictions, with solid curves for the true data and dot–dash curves for the outputs of the hybrid model. Introduce the Lagrange function L, LϭJϩ ͵ T 0 v*T dt Ϫ f dt, dv adjoint variables). After integration by parts, L becomes ͵[ T ١v D T␦ v Ϫ 0 [ dv* ץf ϭ v*(t) Ϫ ١v D, dt ץv v*(T ) ϭ 0. DϪ (13) (14) According to (13) and (14), the formulas for computing the gradients of J with respect to p and v 0 can T T ] dt. ] dv* ץf ץf ␦v ϩ v*␦ v ϩ v*␦ p dt. dt ץv ץp The hybrid adjoint model can be obtained by letting ␦L/␦v ϭ 0: Ϫ dv*T v ϩ v*T f dt 0 The first-order variation of L is L ϭ (v*T v) 0T ϩ where v*(t) is a vector of Lagrange multipliers (or ␦ L ϭ [v*T␦ v] 0T ϩ ͵[ T (10) T (11) (12) be obtained by differentiating (12) with respect to these unknowns (Lu and Hsieh 1998) ␦J ϭϪ ␦p ͵ T 0 ץf v* dt and ץp ␦J ϭ Ϫv*(0). ␦ v0 T (15) (16) 824 MONTHLY WEATHER REVIEW VOLUME 129 FIG. 7. Same as Fig. 6 but for AS ϭ 1000 time steps. Details of the variational assimilation are given in appendix B. As the above assimilation process, also referred to as the training process, is completely subject to the model (4)–(6) after the initial guesses are given, it is called the strong continuity constraint assimilation scheme (SC). 5. Determining N parameters via variational assimilation In this section, we determine the N parameters, that is, the weights and biases, in the hybrid model by variational data assimilation. An N with five hidden neurons is used. Preliminary experiments are done in sections 5a and 5b to determine the best first guesses and assimilation window sizes, respectively, before the main experiments are run in section 5c. In this section, all experiments were performed with perfect initial conditions and dynamical parameters a and b. a. The initial guesses Whether an assimilation process is successful depends greatly on the choice of the first guesses. To help in getting good initial guesses, two assimilation schemes are introduced in addition to the SC scheme. Under SC, at each step of the integration, the initial conditions are the model outputs from the previous step (Fig. 4a). In the no-continuity constraint assimilation scheme (NC), at each integration step, the initial conditions are the observed data, rather than the model output from the previous step (Fig. 4b). Between these two extremes, we can have the partially strong continuity constraint assimilation scheme (PSC) (Fig. 4c), where the assimilation window T is divided into smaller assimilation segments (ASs). Within each AS, integration at each step uses the model output from the previous step as the initial conditions. Only at the start of the AS are the data used directly as the initial conditions. Both SC and NC can be regarded as special cases of PSC where the AS is T in SC and one time step in NC. The reason for introducing NC and PSC is that using SC without good initial guesses generally does not lead to successful assimilation over a long assimilation window. Chopping the window into shorter ASs means that the dynamic model constraints are no longer imposed over a long period, thereby making the nonlinear optimization a much easier task. Our strategy is to divide the assimilation process into two stages: (a) use less demanding schemes such as NC and PSC to obtain reasonable parameters estimates, which will then be used APRIL 2001 TANG AND HSIEH as initial guesses in (b) the full variational assimilation under SC. For the weakly nonlinear case, the NC assimilation scheme was first applied to the hybrid model (4)–(6), with initial guesses for the N parameters taken randomly from one of the 25 Ns in section 3. The assimilation window is 3000 time steps. The NC scheme can produce good results during the training period but poor skill during the test period (Fig. 5). Since the AS is very short (one time step) in the NC scheme, the initial values at each integration step vastly outweigh the importance of the dynamical constraints, hence the poor skill during the test period. However the result from the NC can provide reasonable first guesses for the following PSC. A series of PSC assimilations with AS ϭ 100, 200, 500, and 1000 time steps was further performed with the hybrid model (in the weakly nonlinear case) to gradually improve the first guesses to be used in the stage b SC assimilation. A longer AS increases the importance of dynamical constraints relative to initial conditions in influencing the model outputs. The first guesses of the N parameters in a PSC experiment are the parameters obtained from the previous PSC experiment with the shorter AS. The first PSC experiment (not shown) with AS ϭ 100 time steps uses the results from the NC case (i.e., AS ϭ one time step) for its first guesses. The PSC assimilation over the training period, that is, a window of 3000 time steps, with AS ϭ 200 time steps (Fig. 6), and AS ϭ 1000 time steps (Fig. 7), reveals that the test skills improved with the increase of the AS. The parameters estimated from the PSC experiment with ASϭ1000 time steps will serve as the first guess for the stage b SC assimilation (in the next section) for the weakly nonlinear case, where the assimilation window is also taken to be 1000 time steps. This approach of using a series of PSC assimilations to improve on the parameter estimates to be used as initial guesses in the stage b SC assimilation did not work for the highly nonlinear case. Unlike the weakly nonlinear case, the first guesses from PSC, if found, is only effective for a specific experimental configuration (e.g., AS, window length, etc.). Hence, for the highly nonlinear case, the parameters estimated from the NC assimilation over 3000 time steps are directly used as the first guesses for the stage b SC assimilation. b. Impact of assimilation window in the highly nonlinear case The greatest difficulty in variational assimilation, namely the presence of multiple minima in the cost function arising from the nonlinearity of the problem, often renders the problem intractable. The number of local minima dramatically increases with the window T for a strongly nonlinear dynamical system. Gauthier (1992) and Miller et al. (1994) have shown with the Lorenz model that variational assimilation is effective only for sufficiently short windows. In the weakly non- 825 FIG. 8. Correlation skills (averaged over 10 experiments) for the variables X, Y, and Z at various assimilation windows (T ϭ 100, 300, 500, and 800 time steps) during (a) the training period, and (b) the test period. linear case (section 5a), T did not have a major impact; hence, the following discussion will only focus on the highly nonlinear case. To study the impact of the window, SC assimilation experiments were performed, where T was varied between 100, 300, 500, and 800 time steps. For each T, 10 experiments were performed with the same first guesses, but the assimilation periods were shifted by 100 time steps between one experiment and the next. The effect of T on the assimilation is different for the training period (Fig. 8a) and the test period (Fig. 8b). With a short window (T ϭ 100 time steps), the correlation skills were excellent for all three variables during training, but yielded no prediction skills over the following test period of 300 time steps. As T increased to 300 time steps and then 500 time steps, there were increases in the prediction skills (Fig. 8b), but as T increased to 800, the prediction skill dropped sharply, as the problem with local minima in the cost function wors- 826 MONTHLY WEATHER REVIEW VOLUME 129 FIG. 10. Bar charts from 100 assimilation experiments for the highly nonlinear case showing the number of experiments with (a) correlation above a particular correlation level and (b) REE below a particular REE level, during the test period. ened with increasing T. Hence in the following SC assimilation experiments, the window will be 500 time steps for the highly nonlinear case, and 1000 time steps for the weakly nonlinear case. Following the training (assimilation) period, there is a test period of 300 time steps for the highly nonlinear case, and 1000 time steps for the weakly nonlinear case. c. SC assimilation We now perform the main SC assimilation experiments, repeated 100 times, to examine the skills of the ← FIG. 9. (a) Correlation skills and (b) REE for the variables X, Y, and Z during the training and test periods (averaged over 100 experiments), for the weakly nonlinear case, where the SC assimilation retrieves the N parameters. The error bars show the standard deviation. Results from the simple hybrid model without data assimilation (section 3) are also shown for comparison. APRIL 2001 TANG AND HSIEH 827 FIG. 11. A typical experiment with class 1 results for the highly nonlinear case, showing (a) the fitting during the training period (left panels) and (b) the forecasting during the testing period (right panels). Solid curves denote the true value, while dot–dash curves denote model results. hybrid model for the weakly nonlinear and highly nonlinear cases. The choices of windows and first guesses are those found in sections 5a and 5b. The 100 experiments are implemented with their assimilation periods shifted by 100 time steps from one experiment to the other. For the weakly nonlinear case, the average correlation between model and data (over 100 experiments) exceeds 0.96 for each of X, Y, and Z during both the training period and the test period (Fig. 9a). The relative estimate error [REE, i.e., ⌺ (estimated value-true value) 2 /⌺ (true value) 2 ] is very small, less than 0.004 (Fig. 9b). Thus, via variational assimilation, a hybrid model has successfully reconstructed the Lorenz model in the weakly nonlinear case. For comparison, the results of the simple hybrid model without data assimilation from section 3 are also show in Fig. 9. The improvement due to variational data assimilation is dramatic for X and Y. In contrast, the highly nonlinear case has much lower skill. During training, the skill was good—80% of the experiments had correlation above 0.75, and 60% had REEs under 0.05 (not shown). However, during the test period, the correlation skill and REE attained were much poorer (Fig. 10). Generally, for the highly nonlinear case, there were three classes of assimilation results found in the 100 experiments. A typical example of the class 1 results is shown in Fig. 11, showing successful assimilation. About 15% of the total experiments is of this class. Almost 60% of the total experiments belongs to a class 2 of moderately successful assimilation, as illustrated by the example in Fig. 12, where during the test period, the forecast skills were fairly good for the first 200 time steps. The remaining 25%–30% of the experiments belonged to a class 3 of failed forecasting, as illustrated in Fig. 13, where despite excellent fitting during the training period, the forecast results were totally wrong. To see why there is a drastic difference between class 1 and class 3 behavior, we plotted the system trajectories of the two cases in the (X, Y, Z) space (Fig. 14). The Lorenz attractor for the highly nonlinear case is known to have a ‘‘butterfly’’ shape with two wings. In the class 1 experiment, both wings of the Lorenz attractor were covered by the training data trajectory. In contrast, with the class 3 example, the training data trajectory resided in one wing. The system then evolved to the other wing during the testing period. As the model had never been trained with data from the second wing, it obviously 828 MONTHLY WEATHER REVIEW VOLUME 129 FIG. 12. Same as Fig. 11 but for a typical experiment with class 2 (i.e., moderately successful forecast) results. could not forecast properly in that regime, hence the disastrously poor forecast results found in Fig. 13. Thus the bimodality of the highly nonlinear Lorenz system created cases where the training data covered only one wing of the attractor, and the model thus trained had no capability of forecasting the transition to the other wing. One would be tempted to think that the problem could be corrected by simply using longer training periods, so that the full Lorenz attractor would have been learned by the model. Unfortunately, a longer training period greatly increases the difficulty in the nonlinear optimization due to multiple local minima in the cost function, leading eventually to the failure of the variational assimilation method. 6. Determining dynamical parameters and initial conditions by the hybrid model In variational data assimilation, parameters and initial conditions can be jointly estimated (e.g., Le Dimet and Talagrand 1986; Tziperman and Thacker 1989; Lu and Hsieh 1998; Zhu and Navon 1999). We next conducted an experiment, for the weakly nonlinear case, to estimate simultaneously the dynamical parameters a and b, and the initial conditions of X, Y, and Z in the hybrid Lorenz system (4)–(6). The N from section 5a (i.e., PSC assimilation with AS ϭ 1000 time steps) was used. For the initial guesses, a and b and the initial conditions were all scaled down by 10% from their true values. The result of retrieving the two parameters and three initial conditions by the hybrid model with an assimilation period of 3000 steps under SC showed the retrieval to be very good, as the REEs were all of the order of 10Ϫ3–10Ϫ4 (not shown). Next the true values were scaled down by 20% to provide the initial guesses, and the retrieved results were still generally good (Fig. 15). But when the initial values were the true values scaled down by 30%, the retrieval nearly failed (not shown). Next, we had the N parameters retrieved as well during the assimilation; that is, we conducted 100 additional SC experiments to estimate simultaneously the N parameters, the dynamical parameters a and b, and the initial conditions of X, Y, and Z, for the weakly nonlinear case. The first guesses for the initial conditions and the parameters a and b were simply the true values scaled down by 10%. The choice of the first guesses of the N model parameters, cost function, and assimilation window were the same as those discussed in the previous section. APRIL 2001 TANG AND HSIEH 829 FIG. 13. Same as Fig. 11 but for a typical experiment with class 3 (i.e., unsuccessful forecast) results. The average correlation skills over the 100 experiments are shown in Fig. 16, while the average REEs are all around 0.05 (not shown). The correlation and REE were generally poorer than those in Fig. 9, where only the N parameters were retrieved. This agrees with many studies that found that the assimilation quality deteriorated while retrieving more model unknowns (Lu and Hsieh 1998), as the problem of multiple minima in the cost function could become worse when retrieving more model unknowns. Likewise, 100 experiments for the highly nonlinear case were also carried out to simultaneously estimate the N parameters, the dynamical parameters a and b, and the initial conditions. The first guesses were the true values scaled down by 10%, and the first guesses of the N model, the cost function, and assimilation window were still taken from section 5. While the skills in the training period were as good as those retrieving the N parameters only (not shown), the skills during the testing period (Fig. 17) were considerably lower than those found when only N parameters were retrieved (Fig. 10). Class 1 examples like Fig. 11 can hardly be found now. 7. Summary and discussion In this study, a hybrid neural–dynamical variational data assimilation procedure aimed at simulating missing dynamical equations was formulated. This procedure was then applied to the Lorenz model, where the third dynamical equation of the model was missing and had to be simulated empirically by an N equation. First guesses in variational data assimilation can be vital to retrieving the model parameters and initial conditions successfully. In order to get reasonable first guesses, the no continuity constraint and partial strong continuity constraint schemes were proposed in this study. Experiments show that such a treatment of choosing first guesses is effective in the weakly nonlinear case, which allows the cost function to be successfully optimized in almost all experiments under strong continuity constraint. For the highly nonlinear case, the NC scheme provides reasonable first guesses for the SC experiments. The length of the assimilation window is very significant for the highly nonlinear case. When the window is either too short or too long, the prediction skills are poor. This strict demand for an appropriate assimilating window results from a delicate balance between the need for more data (longer window) and the avoidance of severe multiple minima during optimization with the highly nonlinear model, which calls for a short window. Our numerical experiments showed that the hybrid model based on N and variational assimilation is suc- 830 MONTHLY WEATHER REVIEW VOLUME 129 FIG. 14. The system trajectories in the (X, Y, Z ) space during the training period (thin curves) and the testing period (thick curves) for (a) the class 1 experiment and (b) the class 3 experiment. cessful in simulating the weakly nonlinear Lorenz model, with forecast skills similar to the original Lorenz model. For a highly nonlinear Lorenz system, the hybrid model can also produce reasonable skills of simulations and predictions for most experiments, although the assimilation basically failed for a fair portion of the ex- FIG. 15. The results from retrieving the dynamical parameters a and b, and the initial conditions of the hybrid model, where the initial guesses were the true values scaled down by 20%. FIG. 16. The correlation skill averaged over 100 experiments where the N parameters, the dynamical parameters a and b, and the initial conditions were retrieved for the weakly nonlinear case. Error bars indicate the standard deviation. APRIL 2001 831 TANG AND HSIEH Acknowledgments. The authors are grateful to Dr. Benyang Tang, Dr. Jingxi Lu, Dr. Adam Monahan, and Yuval for helpful discussion, and to Dr. Ralf Giering for his TAMC compiler for adjoint codes. This work was supported by research and strategies grants to W. Hsieh from the Natural Sciences and Engineering Research Council of Canada. APPENDIX A Neural Network Model A feed-forward neural network is a nonparametric statistical model for extracting nonlinear relations in the data. A common N model configuration is to place between the input and output variables (also called ‘‘neurons’’), a layer of ‘‘hidden neurons’’ (Fig. A1). The value of the jth hidden neuron is w x ϩ b , yj ϭ tanh ij i j (A1) i where x i is the ith input, w ij the weight parameters, and b j the bias parameters. The hyperbolic tangent function is used as the transfer function (Bishop 1995, p. 127). The output neuron is given by zϭ w˜ y ϩ b˜ . j j (A2) j A cost function FIG. 17. Bar charts from 100 assimilation experiments for the highly nonlinear case showing the number of experiments with (a) correlation above a particular correlation level and (b) REE below a particular REE level, during the test period. The N parameters, the dynamical parameters a and b, and the initial conditions were retrieved. periments. Deterministic optimization algorithms often have difficulties reaching the global minimum in the cost function, due to the presence of local minima. It has been shown that the variational assimilation with deterministic optimization algorithms often loses its power for strongly nonlinear models (Miller et al. 1994; Evensen 1997). Stochastic optimization methods such as simulated annealing may help (Kruger 1993; Hannachi and Legras 1995). However, even if the global minimum is found, the data in the assimilation window might still only consist of data from one wing of the butterfly-shaped attractor, whereas the data in the forecasting test period could lie in the second wing, thereby resulting in failed forecasts. Alternatively, other assimilation approaches, for example, the extended Kalman filter, could be more powerful in dealing with strongly nonlinear systems (Miller et al. 1994; Evensen 1997). The possibility of an N coupled with a dynamical model with an extended Kalman filter is being investigated. FIG. A1. An example of a neural network model, where there are three neurons in the input layer, five in the hidden layer, and one in the output layer. The parameters w ij and w˜ j are the weights, and b j and b˜ are the biases. The parameters b j and b˜ can also be regarded as the weights for constant inputs of value 1. 832 MONTHLY WEATHER REVIEW J ϭ ͗(z Ϫ zobs ) 2 ͘ (A3) measures the mean square error between the model output z and the observed values zobs . The parameters w ij , w˜ j , b j , and b˜ are adjusted as the cost function is minimized by the optimization algorithm of Levenberg–Marquardt (Masters 1995), without any constraints imposed. The procedure, known as network training, yields the optimal parameters for the network. The random number generator in MATLAB, which generates uniformly distributed random numbers on the interval (0.0, 1.0), was used to initialize these parameters. For an N with m1 inputs and m 2 hidden neurons, the number of model parameters is m1 ϫ m 2 ϩ 2 ϫ m 2 ϩ 1. In this paper, we set the number of hidden neurons to 5, so the N has 26 parameters. The above is a brief description of a traditional feedforward N algorithm, used in section 3. In later sections, the optimization will be constrained by the dynamical equations. APPENDIX B Discrete Form of the Hybrid Model for Coding The computer code is directly generated from the discrete form of the hybrid model. The fourth-order Runge– Kutta is used to discretize the vector equation 7: VOLUME 129 where v* is the vector of Lagrange multipliers and its dimension is the same that of v. The equations for the best fit are obtained by requiring that all derivatives of L vanish: ץL ϭ 0, ץv*n (B12) ץL ϭ 0, ץvn (B13) ץL ϭ 0, ץp (B14) where the partial derivative with respect to a vector indicates a partial derivative with respect to each of the vector components. Equation (B12) gives nothing more than (B9), whereas (B10) gives ץL ϭ ץp Ϫ ץKץp v*. NϪ1 n The gradients from (B15) are then provided to a quasi-Newton method to seek the optimal parameters. The equations describing the trajectory of the Lagrange multipliers v*n , that is, the adjoint equations, can be obtained from ץL/ץv n and (B10) (also see Anderson and Willebrand 1989): N vnϩ1 k1 k2 k3 k4 ϭ vn ϩ n ϩ n ϩ n ϩ n , 6 3 3 6 (B1) 2 WϪ1 (vn Ϫ vnobs ) ϩ v* nϪ1 Ϫ v* n Ϫ nϭ1 ץv v* ץK n n n k1n ϭ hf (vn , p, t n ), (B2) ϭ 0, k2 n ϭ hf (vn ϩ 0.5k1n , p, t n ϩ 0.5h), (B3) (B17) k3n ϭ hf (vn ϩ 0.5k2 n , p, t n ϩ 0.5h), (B4) ץL ϭ Ϫv*, 0 ץv0 k4n ϭ hf (vn ϩ k3n , p, t n ϩ h), v*N ϭ 0. (B18) (B5) where the subscript n indicates the time level, f ϭ (f1, f2, f3)T , and f1, f2, and f3 are scalar functions: f1 ϭ ϪaX ϩ aY, (B6) f2 ϭ ϪXZ ϩ bX Ϫ Y, w˜ [tanh(w X ϩ w Y ϩ w Z ϩ b )] (B7) ϩ b˜ , (B8) f3 ϭ (B15) n nϭ0 j 1j 2j 3j j j j ϭ 1, 2, . . . , M (M ϭ 5, the number of hidden neurons). Equation (B1) can be written as v nϩ1 ϭ v n ϩ K n . (B9) The Lagrange function of Eq. (10) is discretized as v* (v NϪ1 LϭJϩ T n nϩ1 Ϫ vn Ϫ K n ), (B10) nϭ0 (v Ϫ v N Jϭ n nϭ1 ) WϪ1 (vn Ϫ vnobs ), obs T n (B11) n ϭ 1, . . . , N, (B16) The ץK n /ץv n and ץK n /ץp can be obtained by chain rule; for example, ץk n 1 ץk1n 1 ץk2 n ץk1n 1 ץk3n ץk2 n ץk1n ϭ ϩ ϩ ץvn 6 ץvn 3 ץk1n ץvn 3 ץk2 n ץk1n ץvn ϩ 1 ץk4n ץk3n ץk2 n ץk1n . 6 ץk3n ץk2 n ץk1n ץvn Thus, the cycling procedure for computing the best fit of the hybrid model (4)–(6) to Lorenz model (1)–(3) is as follows: 1) Given initial guess values for the unknown parameters, use (B9) to step the hybrid model forward in time from 1 to N to compute a first approximation to the Lorenz model. 2) Step equations (B16)–(B18) backward in time from N to 1 to compute approximate values of the Lagrange multipliers. 3) Evaluate the gradient of the cost function with respect to these parameters using (B15) [and/or Eq. (B17) if initial conditions need to be retrieved]. APRIL 2001 833 TANG AND HSIEH FIG. B1. Variations of the normalized cost function (J/J 0 ) (solid line) and normalized gradient (g g 0Ϫ1 ) (dash line) the number of iterations for (a) the weakly nonlinear case of Fig. 7 and (b) the strongly nonlinear case of Fig. 11. The better initial guesses in (a) meant that the cost function and its gradients had relatively less distance to drop to convergence than in (b). 4) Use a descent algorithm to find the minimum of the cost function in the direction opposite to the gradient by a line search. 5) Take results of the line search as the next guesses for the parameters and repeat the process starting from step 1 until convergence criteria are satisfied. (B16) to (B18), but by the Tangent and Adjoint Model Compiler (TAMC; Giering and Kaminski 1996) via the tangent linear equations. These equations are derived from (B9): The quasi-Newton method of Broyden–Fletcher– Goldfarb–Shanno (BFGS) (Press et al. 1992) is used in the above procedure for searching the optimal solutions. Figure B1 is the variation of the normalized cost function and normalized gradient with the number of iterations for a weakly nonlinear case (Fig. 7) and a strongly nonlinear case (Fig. 11). The BFGS algorithm has a memory requirement of O(N 2 ) where N is the number of parameters involved in the optimization. In our case, the memory is 183 KB. For problems with larger N, the conjugate gradient method with a memory requirement of O(N) could be more suitable. A typical case here requires about 50 iteration for convergence, using about 3 min of CPU time on an SGI Origin 200 server. In practice, the adjoint codes were not developed from where the matrix A n is the tangent linear model at time step n, and ␦v nϩ1 ϭ A n ␦v n , An ϭ I ϩ ץK n , ץvn (B19) (B20) with I the the identity matrix. The adjointness has been checked using the relation between the tangent linear code A and its adjoint code A* (Talagrand 1991; Navon et al. 1992): {a, Ab} ϭ {A*a, b}, (B21) where { · , · } denotes the inner product, and a and b arbitrary vectors. 834 MONTHLY WEATHER REVIEW REFERENCES Anderson, D. L. T., and J. Willebrand, Eds., 1989: Oceanic Circulation Models: Combining Data and Dynamics. Kluwer Academic Publishers, 605 pp. Barnett, T. P., M. Latif, N. E. Graham, M. Flugel, S. Pazan, and W. White, 1993: ENSO and ENSO-related predictability. Part I: Prediction of equatorial sea surface temperature with a hybrid coupled ocean–atmosphere model. J. Climate, 6, 1545–1566. Bennett, A. F., 1992: Inverse Methods in Physical Oceanography. Cambridge University Press, 346 pp. Bishop, C. M., 1995: Neural Networks for Pattern Recognition. Clarendon Press, 482 pp. Courtier, P., and Coauthors, 1998: The ECMWF implementation of three-dimensional variational assimilation (3D-VAR). I: Formulation. Quart. J. Roy. Meteor. Soc., 124, 1783–1807. Cybenko, G., 1989: Approximation by superposition of a sigmoidal function. Math. Control, Signal, Syst., 2, 303–314. Daley, R., 1991: Atmospheric Data Analysis. Cambridge Atmospheric and Space Science Series, Cambridge University Press, 457 pp. Elsner, J. B., and A. A. Tsonis, 1992: Nonlinear prediction, chaos and noise. Bull. Amer. Meteor. Soc., 73, 49–60. Evensen, G., 1997: Advanced data assimilation for strongly nonlinear dynamics. Mon. Wea. Rev., 125, 1342–1354. Gauthier, P., 1992: Chaos and quadric-dimensional data assimilation: A study based on the Lorenz model. Tellus, 44A, 2–17. , C. Charette, L. Fillion, P. Koclas, and S. Laroche, 1999: Implementation of a 3D variational data assimilation system at the Canadian Meteorological Centre. Part I: The global analysis. Atmos.–Ocean, 37, 103–156. Ghil, M., and P. Malanotte-Rizzoli, 1991: Data assimilation in meteorology and oceanography. Advances in Geophysics, Vol. 33, Academic Press, 141–265. Giering, R., and T. Kaminski, 1998: Recipes for adjoint code construction. ACM TOMS, 24 (4), 437–474. Hannachi, A., and B. Legras, 1995: Simulated annealing and weather regimes classification. Tellus, 47A, 955–973. Hsieh, W. W., and B. Tang, 1998: Applying neural network models to prediction and analysis in meteorology and oceanography. Bull. Amer. Meteor. Soc., 79, 1855–1870. Ji, M., D. W. Behringer, and A. Leetmaa, 1998: An improved coupled model for ENSO prediction and implications for ocean initialization. Part II: The coupled model. Mon. Wea. Rev., 126, 1022– 1034. Kruger, J., 1993: Simulated annealing—A tool for data assimilation VOLUME 129 into an almost steady model state. J. Phys. Oceanogr. 23, 679– 688. Le Dimet, F., and O. Talagrand, 1986: Variational algorithms for analysis and assimilation of meteorology observations: Theoretical aspects. Tellus, 38A, 97–110. Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20, 130–141. Lu, J., and W. W. Hsieh, 1998: On determining initial conditions and parameters in a simple coupled model by adjoint data assimilation. Tellus, 50A, 534–544. Masters, T., 1995: Advanced Algorithms for Neural Networks—A C ϩ Source Book. John Wiley and Sons, 431 pp. Miller, R. N., M. Ghil, and F. Gauthiez, 1994: Advanced data assimilation in strongly nonlinear dynamical systems. J. Atmos. Sci., 51, 1037–1056. Navon, I. M., 1998: Practical and theoretical aspects of adjoint parameter estimation and identifiability in meteorology and oceanography. Dyn. Atmos. Oceans, 27, 55–79. , X. Zhou, J. Derber, and J. Sela, 1992: Variational data assimilation with an adiabatic version of the NMC spectral model. Mon. Wea. Rev., 120, 1433–1446. Palmer, T. N., 1993: Extended-range atmospheric prediction and the Lorenz model. Bull. Amer. Meteor. Soc., 74, 49–65. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 1992: Numerical Recipes in Fortran. 2d ed. Cambridge University Press, 963 pp. Syu, H. H., J. D. Neelin, and D. Gutzler, 1995: Seasonal and interannual variability in a hybrid coupled GCM. J. Climate, 8, 2121– 2143. Talagrand, O., 1991: The use of adjoint equations in numerical modelling of the atmospheric circulation. Automatic Differentiation of Algorithms: Theory, Implementation and Application, A. Griewank and G. F. Corliess, Eds., SIAM, 169–180, Tang, Y., W. W. Hsieh, B. Tang, and K. Haines, 2001: A neural network atmospheric model for hybrid coupled modeling. Climate Dyn., in press. Tziperman, E., and W. C. Thacker, 1989: An optimal control/adjoint equations approach to studying the oceanic general circulation. J. Phys. Oceanogr., 19, 1471–1485. Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp. Zhu, Y., and I. M. Navon, 1999: Impact of key parameters estimation on the performance of the FSU spectral model using the full physics adjoint. Mon. Wea. Rev., 127, 1497–1517.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- Coupling Neural Networks to Incomplete Dynamical Systems...
Open Collections
UBC Faculty Research and Publications
Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation. Tang, Youmin; Hsieh, William W. Apr 30, 2001
pdf
Page Metadata
Item Metadata
Title | Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation. |
Creator |
Tang, Youmin Hsieh, William W. |
Publisher | American Meteorological Society |
Date Issued | 2001-04 |
Description | The advent of the feed-forward neural network (N) model opens the possibility of hybrid neural–dynamical models via variational data assimilation. Such a hybrid model may be used in situations where some variables, difficult to model dynamically, have sufficient data for modeling them empirically with an N. This idea of using an N to replace missing dynamical equations is tested with the Lorenz three-component nonlinear system, where one of the three Lorenz equations is replaced by an N equation. In several experiments, the 4DVAR assimilation approach is used to estimate 1) the N model parameters (26 parameters), 2) two dynamical parameters and three initial conditions for the hybrid model, and 3) the dynamical parameters, initial conditions, and the N parameters (28 parameters plus three initial conditions). Two cases of the Lorenz model—(i) the weakly nonlinear case of quasiperiodic oscillations, and (ii) the highly nonlinear, chaotic case—were chosen to test the forecast skills of the hybrid model. Numerical experiments showed that for the weakly nonlinear case, the hybrid model can be very successful, with forecast skills similar to the original Lorenz model. For the highly nonlinear case, the hybrid model could produce reasonable predictions for at least one cycle of oscillation for most experiments, although poor results were obtained for some experiments. In these failed experiments, the data used for assimilation were often located on one wing of the Lorenz butterfly-shaped attractor, while the system moved to the second wing during the forecast period. The forecasts failed as the model had never been trained with data from the second wing. Copyright 2001 American Meteorological Society (AMS). Permission to use figures, tables, and brief excerpts from this work in scientific and educational works is hereby granted provided that the source is acknowledged. Any use of material in this work that is determined to be “fair use” under Section 107 of the U.S. Copyright Act or that satisfies the conditions specified in Section 108 of the U.S. Copyright Act (17 USC §108, as revised by P.L. 94-553) does not require the AMS’s permission. Republication, systematic reproduction, posting in electronic form, such as on a web site or in a searchable database, or other uses of this material, except as exempted by the above statement, requires written permission or a license from the AMS. Additional details are provided in the AMS Copyright Policy, available on the AMS Web site located at (http://www.ametsoc.org/) or from the AMS at 617-227-2425 or copyright@ametsoc.org. |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2011-04-05 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0041820 |
URI | http://hdl.handle.net/2429/33323 |
Affiliation |
Science, Faculty of Earth and Ocean Sciences, Department of |
Citation | Tang, Youmin, Hsieh, William W. 2001. Coupling Neural Networks to Incomplete Dynamical Systems via Variational Data Assimilation. Monthly Weather Review 129(4) 818-834. |
Publisher DOI | 10.1175/1520-0493(2001)129<0818:CNNTID>2.0.CO;2 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Hsieh, William W. |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 52383-Hsieh_AMS_2001_MWR818.pdf [ 420.44kB ]
- Metadata
- JSON: 52383-1.0041820.json
- JSON-LD: 52383-1.0041820-ld.json
- RDF/XML (Pretty): 52383-1.0041820-rdf.xml
- RDF/JSON: 52383-1.0041820-rdf.json
- Turtle: 52383-1.0041820-turtle.txt
- N-Triples: 52383-1.0041820-rdf-ntriples.txt
- Original Record: 52383-1.0041820-source.json
- Full Text
- 52383-1.0041820-fulltext.txt
- Citation
- 52383-1.0041820.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0041820/manifest