STATISTICAL MODELLING OF SEDIMENT CONCENTRATION by MAI PHUONG THOMPSON B.Sc, The University of British Columbia, 1984 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (The Department of Statistics) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August, 1987 ©Mai Phuong Thompson, 1987 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Sfelh'shcS The University of British Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 DE-6(3/81) ABSTRACT One technique that is commonly used to replace the costly daily sampling of sed-iment concentration in assessing sediment discharge is the "rating curve" technique. This technique relies on the form of the relationship between sediment concentration and water discharge to estimate long-term sediment loads. In this study, a regression/time-series approach to modelling the relationship between sediment concentration and water discharge is developed. The model com-prises of a linear regression of the natural logarithm of sediment concentration on the natural logarithm of water discharge and an autoregressive time-series of order one or two for the errors of the regression equation. The main inferences from the model are the estimation of annual sediment loads and the calculation of their standard errors. Bias correction factors for the bias resulted from the inverse transformation of the natural logarithm of sediment concentration are studied. The accuracy of the load estimates is checked by comparing them to figures published by Water Survey of Canada. 11 T A B L E O F C O N T E N T S A B S T R A C T i i T A B L E O F C O N T E N T S i i i L I S T O F T A B L E S v L I S T O F F I G U R E S v i A C K N O W L E D G E M E N T v i i C H A P T E R 1 I N T R O D U C T I O N 1 1.1 Outl ine of Study 3 1.2 Description of Da ta 3 a. Sediment sampling stations 3 b. Suspended sediment concentration and load 4 c. Discharge 5 1.3 Variable Names and Descriptions 5 C H A P T E R 2 R E V I E W O F S O M E P R E V I O U S W O R K O N R A T I N G C U R V E S 7 2.1 Kellerhals et al.'s (1974) Consult ing Report 8 2.2 Wal l ing and Webb (1981) 8 2.3 Smill ie and K o c h (1984) and Ferguson (1986) 11 2.4 Church, Kellerhals and Ward's (1985) Consulting Report 12 2.5 Discussion 15 C H A P T E R 3 S T A T I S T I C A L A N A L Y S I S A N D M O D E L L I N G 18 3.0 Introduction 18 3.1 Exploratory Data Analysis for Miss ion Data 19 a. A basic model 20 b. The seasonal effect 20 in c. Other factors 21 d. The "dog leg" effect 23 e. Outliers 25 f. M o d e l for the errors 25 3.2 M o d e l for Miss ion Data 28 a. The proposed model 28 b. Mode l adequacy and assumptions 28 3.3 Est imat ion of The Annua l Sediment Load 29 3.4 Other Da ta Sets 31 C H A P T E R 4 C O M P A R I S O N O F E S T I M A T E D L O A D S W I T H P U B L I S H E D D A T A 34 R E F E R E N C E S 36 A P P E N D I X 39 A . I Bias Correction Factor for C wi th Normal Errors 39 A . 2 A Comparison of T w o Estimated Bias Correction Terms 40 A.3 Bias Correction Factor for C wi th Generalized Gaussian Errors . . . 41 A . 4 Integration Process for Sediment Load Est imat ion 42 A . 5 Expectat ion and Variance of the Sediment Load 43 A.6 M a x i m u m Likel ihood Estimators 45 I V LIST OF TABLES TABLE 2.1 The prediction factors investigated by Kellerhals et al 48 3.1 Cp and adjusted R2 for rising stage submodels of Mission 49 3.2 Cp and adjusted R2 for falling stage submodels of 50 3.3 Sample autocorrelations and sample partial autocorrelations of the errors in the Mission model 51 3.4 AIC(k) - nine-2 for the errors in the Mission model 51 3.5 Least squares fitted models 52 3.6 Rising stage model's coefficients estimated using least squares, minimum absolute residual, Huber and Andrews regressions . . . . 53 3.7 Falling stage model's coefficients estimated using least squares, minimum absolute residual, Huber and Andrews regressions . . . . 53 3.8 The proposed models for Mission, Hansard, Brocket and Walsingham 54 3.9 Numerical comparison between two estimates of exp( E(C)E(Q); hence on the average, the product between the average sediment concentration and the average flow is less than the average of the products of the two. The great increase in variability of methods 2 and 4 as the sampling interval increases is due to the fact that in systematic sampling, longer interval means fewer points and poorer precision. It should also be noted that systematic sampling may give poor precision when periodicity is present (Cochran (1977)). In evaluating the load estimates obtained using rating curves, four data sets from a seven year record, each containing fifty replicates representing four sampling strategies were assembled. The resultant rating curve relationships were applied to several frequently used load calculation procedures. The sampling strategies employed to generate the replicate data sets included regular sampling at weekly intervals and coverage of storm events with additional random sampling when flows exceeded certain thresholds. Fifty replicate relationships of the form C = aQb where Q is the instantaneous discharge at the time of sampling, were calculated using least squares regression. In addition, two of the data sets were subdivided into four stages: winter and rising stage, winter and falling stage, summer and rising stage, summer and falling stage; four separate relationships were developed for each stage. The results showed that accuracy and precision increased with the coverage of storm events. Most noteworthy is the discovery that the use of rating relationships subdi-vided according to season and stage tendency produced an increase in the accuracy of the estimate provided by a particular sampling strategy. 10 In general, all four strategies underestimated the total load for the seven year period. This could be the result of poor sampling and/or poor modelling in addition to the lack of a bias correction factor. 2.3 S M I L L I E A N D K O C H (1984) A N D F E R G U S O N (1986) For years, hydrologists have overlooked the bias in their estimation of the sedi-ment loads or concentrations that resulted from simply transforming the fitted model lnC = a + 6 InQ to obtain C = exp(a)Q* Smillie and Koch (1984) and Ferguson (1986) showed that if certain assumptions were satisfied, this bias could be easily corrected. Smillie and Koch illustrated this biased behaviour for data on the Yampa River, near Maybell and the Little Snake River, at Lilly Park, in Northwestern Colorado, U.S.A., but found that for both rivers, the bias correction factor derived overcorrected the estimates of the annual sediment loads. To further illustrate the tendency towards bias and to test the ability of the correction factor to improve a prediction, Smillie and Koch created a data set for which the following properties were satisfied: 1. A log-transformed linear relationship exists between the two variables. 2. Errors are normally distributed. 11 3. There is temporal stationarity in the system. As in the case with the real data, the uncorrected model underpredicted the average of the dependent variable. On the other hand, the use of the bias correction factor improved the estimates considerably and no sign of overcorrection was detected. The consistent overestimation noted earlier for the Yampa River and the Little Snake River could be the result of the following possibilities: • The data collected were not representative. • The model used was too simplistic. • The normality assumption was not satisfied. Smillie and Koch suggested that the serial correlation between the errors might have caused the overcorrection by the bias correction factor, but as shown in Appendix A.4, this should affect only the variance of the estimate and not the estimate itself. Ferguson applied the bias correction factor to sixteen simulated and real data sets. For the nine simulation experiments, he obtained corrected estimates that lie within three percent of the true load. For the seven real data sets, the corrected estimates range from 91 to 104 percent of the true load. So, when the normality assumption is reasonable, the bias correction factor derived appears to be an adequate one. 2.4 CHURCH, KELLERHALS AND WARD'S (1985) CONSULTING REPORT 12 As in Walling and Webb (1981), Church, Kellerhals and Ward divided long-term sediment load estimation into two groups: one involving the use of rating curves and the other involving more direct approaches. In the latter group, the authors divided the methods into three main approaches 1. Total load = K £ " = 1 C.-Qj/n 2. Total load = # ( £ " = i C,-/n)(E£=i Qj/m) 3. Total load = K £ ^ 2 = 1 CmQm They dismissed the second and third approaches for reasons similar to those men-tioned in Section 2.2 and recommended the first approach, it being the only "unbi-ased" one. It should be noted that the first approach is not necessarily unbiased if sampling is not random. The second approach is valid only if Q is of no use in predict-ing C. The third approach is useful where C and Q are approximately "independent" within the time periods over which they were averaged. Turning to the methods involving rating curves, Church, Kellerhals and Ward used data from six Western Canadian stations to run the following regressions: 1. Daily concentration vs. daily flow for all sampled days in each year, 2. As in 1 but including only every second, every fifth, or every tenth sample, 3. Daily concentration vs. daily flow for all samples in one, two, three and five years combined, 4. Annual sediment load vs. annual water volume for complete years of record. 13 In the analysis 1, the data of a year were used to derive a rating equation and this equation was in turn used to estimate the year's total sediment load. The results were very close to those figures provided by the Water Survey of Canada for the stations wi th big drainage areas and were much less accurate for those wi th small drainage areas. This is in agreement wi th the findings of other workers (Kellerhals et al. (1974)). The estimates, however, were generally smaller than the ones published by the Water Survey of Canada. This general underestimation was corrected when the bias correction factor suggested by Smillie and K o c h (1984) was used. In the analysis 2, the results were comparable to those obtained in the analysis 1. The authors concluded that if long-term sediment yield was the main object of observation, twenty to thir ty samples a year were easily sufficient. There was no systematic change in the quality of the annual load estimates obtained using the rating curves derived from two, three or five years in the analysis 3 in comparison wi th those obtained in the analysis 1, where annual rating curves were used. This suggested that the annual relationship between water discharge and sediment concentration does not vary drastically from year to year so that using the rating curve developed from data of past years to estimate future annual loads from flow data is not out of line. This would, however, be subject to vagaries of the long-term behaviour of the contributing drainage basin. The analysis 4 showed good results for stations wi th large drainage areas. The procedure, however, was not very useful for stations with small drainage areas. Church, Kellerhals and Ward went further by examining flow during restricted por-tions of the year for one of the stations wi th larger drainage areas. The combinations were: M a y - J u n e , M a y - J u l y , A p r i l - J u l y , Apr i l -Augus t , May-September, and A p r i l -14 October. The authors discovered that all combinations yielded better R2 than the an-nual flow. This suggested that the linear relationship assumed between the logarithm of the sediment load and that of the flow is more of a high-flow period relationship than an annual one. These R2 values should not be used as a means of comparisons between the portions of the year as time periods over which the modelling should be done since the dependent variable in each case is different. In the end, Church, Kellerhals and Ward suggested a new method, to be used in case of sparse data, called the shifting rating curve. This method suggested that all available data be used to compute the best fitting curve of the form C = aQb. Then assuming that 6 is fixed, a is recalculated for each sampled concentration of sediment as follows a. = CiQi~b Linear interpolation was used to estimate the coefficient between samples. This scheme was tested on for two stations, one using every tenth sample and the other us-ing every fifth sample. These same samples were employed to derive the yearly rating curves, and comparisons amongst the annual loads computed by this new method, by the fixed yearly rating curve and by the Water Survey of Canada were carried out. The results showed substantial improvement in the annual loads estimated by the new method over those derived from the single year, fixed rating curves method. However, this method does not have a solid foundation. Furthermore, an implicit assumption is that the data be of the form shown in Figure 2.1. 2.5 DISCUSSION With the exception of Smillie and Koch (1984), all investigators had problems with underestimating the load, even with the bias correction, when assuming that 15 the daily relationship between l n C and InQ over the year or over the high-flow period can be described by the model l n C = /? 0 + /? i lnQ + e (2.1) Kellerhals et al. (1974) tried including several other independent variables but the results showed that only for rivers with small drainage areas did the addition of mostly one variable increase the percent of explained variation in l n C . Even then, there was no obvious second best factor. Assumptions required for the use of the adopted statistical model were mentioned by Kellerhals et al, but none of the papers reviewed in this chapter reported on whether the data satisfied the assumptions. Also, checks for the inadequacies of the models used were not mentioned in any of the papers. Some authors noted that the residuals were highly serially correlated but the problem was not pursued. With a record of continuous data, Walling and Webb (1981) were able to create replicate sets of data and obtain the sample standard deviation of the load. With the correct procedures and a sampling interval of one day, they obtained at best a standard deviation of about ten percent of the load. None of the investigators attempted to calculate the standard error of the load estimate based on a model. In this study (described in Chapter 3), the relationship between the mean daily sediment concentration and the mean daily discharge is also found to be well repre-sented by the model (2.1). However, as Walling and Webb (1981) have discovered, the fit is much improved when the seasonal effect is taken into account. Here, the data are split into two parts: before and after the day of the peak flow, and a different relationship is fitted to each set. 16 Following the work of Kellerhals et al. (1974), several discharge-related variables are included in the the relationship mentioned in the preceding paragraph. Of these, only the inclusion of a certain lag of the mean daily discharge is found to increase the portion of explained variance in In C by a significant amount, and even then, only for one of the relationships. As noted by Kellerhals et al. (1974), Smillie and Koch (1984), and Church et al. (1985), the residuals are positively serially correlated. This is not unexpected as the observations are ordered in time, hence adjacent cases influence each other. Analysis of the residuals suggests that the errors be modelled using a first-order autoregressive model et = pet-i + Vt, where are independent and identically normally (i.i.d.) distributed random vari-ables with mean zero and constant variance. The standard assumption in regression analysis that the errors, e, are uncorrelated no longer applies. This violation of the uncorrelated errors assumption does not affect the bias cor-rection factor for 6, as suggested by Smillie and Koch (1984). The bias correction factor remains exp(o-2/2) in the case of normally distributed error. However, the calcu-lation of the load variance is very sensitive to the assumption of uncorrelated errors, as shown in Appendix A.4. Up to now, only the bias correction factor for the case of normally distributed errors has been mentioned in the rating curves literature. In practice, the normality assumption may not always hold. In Section 3.3 and Appendix A.2, the sensitivity of the bias correction factor to the normality assumption is studied. Also a nonpara-metric estimate of the bias correction factor is introduced. 17 CHAPTER 3 STATISTICAL ANALYSIS AND MODELLING 3.0 I N T R O D U C T I O N The models used in this chapter, which are based on the exploratory data analysis described in Section 3.1, have the form lnC* = F(lnQu •••) + £«, £* = pe«_i + i/ t > (3.1) where {r)t} are independent and identically distributed error terms. That is, l n C ( is some function of In Qt and some other variables, plus an error term, which is serially correlated in time. The model is detailed in Section 3.2. One important inference from the model is the estimation of the sediment load for the high-flow period (Section 3.3). This involves a bias correction factor from the inverse transformation of the first term in (3.1). Some theory behind the bias correction factor is presented in Appendices A.I, A.2 and A.3. Sections 3.1 and 3.2 consist of a detailed analysis and modelling of the Mission data. The Mission data are the most complete and "densely" collected set of data among the four stations under study, and perhaps in Canada. For this reason, a model developed for Mission can be used as a guide for modelling other stations, in particular Hansard, Brocket and Walsingham. Section 3.4 summarizes the analysis of these other three data sets. Most statistical analyses were carried out using S and SAS. S is especially useful 18 for exploratory data analysis; the main reference for it is Becker and Chambers (1984). 3.1 EXPLORATORY DATA ANALYSIS FOR MISSION DATA To study the relationship between C and Q in relation to time, the yearly plots of C and Q versus time are examined. As shown in Figure 3.1, the variation of C and Q by several orders of magnitude makes it very difficult to identify the relationship between the two. To facilitate this task, the plots of the natural logarithm of C , I n C , vs. the natural logarithm of Q , I n Q , are inspected. These plots, as presented in Figure 3.2, show that there exists a relationship between I n C and I n Q during the high-flow period, usually between March and October; but no clear pattern can be detected during the low-flow period between November and Febuary (due to few observations). In addition, these plots indicate that the relationship between I n C and I n Q varies from the rising stage of the high-flow period to the falling stage of the same period. This observation is made obvious by the yearly plots of I n C vs. I n Q , with each point identified by the month in which it was sampled. As shown in Figure 3.3, these plots reveal a systematic annual hysteresis. That is, the trend of I n C vs. I n Q is reasonably consistent between June and October and in March, but between April and May, departs toward substantially higher I n C for a given I n Q . For the remaining months, there appears to be no trend between I n C and I n Q . Due to having few observations and a lack of a clear pattern between I n C and I n Q in the low-flow months, only data from the high-flow period, between March and October, are included in further analyses. This omission should not affect the estimation of the annual sediment load as the amount of sediment transport during these low-flow months is observed in plots such as Figure 3.1 to contribute very little 19 to the year's total load. A sample average of the mean of daily sediment concentration multiplied by the total flow will suffice as an estimate of the total sediment transport for these low-flow months. Also omitted from the data are all samples from 1980. As a result of the very unusual weather in that year. The data do not follow the general pattern exhibited by other years. For example, the plot of l n C vs. I n Q for this year is very different pattern. a. A basic model There is no theory which determines the form of the sediment rating curve. For suspended sediment, nearly all investigators (those whose papers are reviewed in Chapter 2 and references therein) have used a power law Ct = aQth exp(cf) (3.2) where {et} are random errors. This relationship is no doubt inspired by the linear trend discovered when plotting l n C against I n Q . That is, by applying the natural logarithm to both sides of equation (3.2), a linear model of the form I n Ct = I n a + b I n Qt + et is obtained. The method of linear regression can then be applied. b. The seasonal effect Many investigators (e.g., Miller (1951), Sharma and Dickinson (1980), Walling and Webb (1981)) believe that time is an important factor in studying the relationship between l n C and I n Q . As observed earlier, the trend of l n C vs. I n Q appeared to vary 20 from the rising limb of the hydrograph to the falling limb of the hydrograph. To substantiate this conjecture the following two models are fitted to the data and the results compared. The gain in the portion of explained variance, the adjusted R2, of the two-part model is approximately 16 percent. This suggests that two separate sub-models are required: one for the pre-peak flow days, the rising stage; and the other for the post-peak flow days, the falling stage. c. Other factors The lags of the daily discharge are considered to be influential factors. It is gen-erally observed that the sediment transport is dependent on not only the momentary discharge but also its recent history (Bogen (1980)). To investigate the above observation, the natural logarithm of the lag 1 to lag 10 of the daily flow were examined. The scatter plots of the lags vs. I n C all exhibit a linear trend, although increasing variability is observed with the larger lags. An algorithm for selecting the "best" subsets of predictor variables in regression (LEAPS in S), is applied to the lags along with I n Q . The results show that I n Q is the most dominant factor, explaining 59 percent of the variability in I n C . The addition of the lag 10 discharge, l n Q _ i 0 , is accompanied by an increase of about 21 percent of the portion of explained variance in I n C . (It should be noted that the addition of lag 7, lag 8 or lag 9 yielded comparable increase in the portion of explained variance as that of lag 10.) After these two variables have been introduced, further gain in the adjusted R? is minor. Thus, I n Q and l n Q _ i 0 are used as predictors for I n C . A)r + P\r InQt + ct, day t is before the peak flow A)/ + Pis^nQt + €t, day t is after the peak flow and lnC« = A> + /?ihiQi + C l 2 1 According to Abraham (1969), Walling and Teed (1971) and Church (1972), other factors believed to influence the sediment transport include the daily rate of change of discharge (rate), the number of days since the last peak flow of the year (dslp) and the previous maximum discharge of the year (prMadis). These variables are thought to be related to the amounts of sediments available for transport. These variables along with l n Q _ i 0 and InQ will now be examined for the rising stage and the falling stage separately. For the rising stage, all scatter plots (not given in the thesis) of the predictor variables vs. l n C , except the dslp vs. l n C and rate vs. l n C plots, exhibit some linear trend. The plot of rate vs. l n C shows that the magnitude of rate, abs(rate), is a better variable to use. Next, the best subsets algorithm is used to pick out the important factors. The results are summarized in Table 3.1 with p as the number of parameters (the number of predictors plus the intercept). With the adjusted R2 serving as the criterion, the model with InQ and lnQ_ io is the best model. The model with InQ alone explains only 66 percent of the variability in l n C . By including l n Q _ i 0 , the portion of explained variance in l n C is increased to 84 percent, which is the highest for any model with just two predictors. With InQ and lnQ_ io already in the model, further gain in the adjusted R2 is insignificant. The Mallow's Cp criterion (Draper and Smith (1981)) is also considered. The Mallow's Cp statistic for a p-parameter model is defined by Cp = RSSP + 2p-n, where RSSP = E ( l n C , t - l n C t ) 2 and n is total number of observations. Mallows (1973) suggests that good models will have negative or small Cp - p. Although the results do not appear to favour the above model, the ranking is the same as that provided by 22 the adjusted R2 criterion. For the falling stage, the plots of the predictor variables vs. InC are examined. Except for the plot of prMadis (which is now a constant, ie., the maximum discharge for the year) vs. InC, all scatter plots follow a definite linear trend. Again, the magnitude of rate is used instead of rate itself. The results obtained from running the best subsets algorithm are presented in the Table 3.2. According to the adjusted R2 criterion, InQ is the only variable needed. The model with just InQ yields an adjusted R2 of 81 percent. Only a maximum of about 1 percent can be gained by adding more variables in the model. Once more, the ranking based on the Mallow's Cp criterion is very close to the ranking provided by the adjusted R2 criterion. In choosing the "best" subset model, it should be kept in mind that although criteria are provided to aid the investigator in making the selection, a model with a simpler form that does not give up much of the criteria-based performance is often preferred over more complex models. In keeping with this rule, the models mentioned above are preferred to their competitors. d. The "dog leg" effect Next, a "dog leg" effect is considered; that is, the upper segment of the plot of InC vs. InQ has a tendency to flatten off or even approach horizontal. This can be explained by the fact that above a certain limit, sediment concentrations do not continue to increase with discharge at a uniform rate but may even remain constant 23 when the supply potential of the catchment has been fully achieved (Gregory and Walling (1973)). This effect can be seen in Figures 3.4a and b, where the plots of I n Q vs. I n C and I n C vs. l n Q _ i 0 for the rising stage both follow a curved line. The curved lines running through the data are the nonparametric smooths proposed by Cleveland (1975). In S, these can be obtained through the function LOWESS; for the figures, the smoothing factor used is 0.2. When the model lnCt = Po + Pi I n Q t + l n Q _ i o t + et is fitted to the rising stage data, the plots of the Studentized residuals vs. the pre-dicted values and the Studentized residuals vs. I n Q both betray signs of nonlinearity as shown in Figures 3.5a and b. The plot of the Studentized residuals vs. l n Q _ i 0 , on the other hand shows no indication of nonlinearity. This suggests that the model failure is more a function of I n Q than it is of l n Q _ 1 0 . Thus some transformation of I n Q is required. Careful examination of the I n C vs. I n Q plot reveals that only a slight transfor-mation of I n Q is necessary. Figure 3.4a shows that the relationship between I n C and I n Q is approximately piecewise linear with a corner at I n Q = 8 . Thus the data for the rising stage can be fitted using two different lines joined at I n Q = 8 , namely i n r _ / A) + / ? i l n Q t + /?2lnQ - i o < + e*, for I n Q * > 8 l/% + /?ilnQt + &ln<3-io« + et for l n Q ( < 8 subject to continuity at l n Q t = 8 . This is equivalent to fitting the following linear model ]nCt = Po + PiXlt + p2X2t + p\ l n Q.10t + et 2 4 where XI = (InQ - 8) x (l-d) and X2 = (InQ x d) + ((l - d) x 8), with d = 1 if InQ > 8 and d = 0 otherwise. e. Outliers The Studentized residuals (Weisberg (1980)) vs. the predicted values plots for the rising stage and the falling stage both display a few candidates for outliers. Because the least squares estimators can be very sensitive to outliers, alternative estimation methods, generally called robust regression methods, are used for comparison. These methods basically give lower weights to observations which seem inconsistent with the rest of the data, assuming that the fitted model is true. So, instead of minimizing £ ( l n C , - - lnC,) 2 , the function: £p ( [ lnC, - - lnC,]/s), is minimized, p is a function that penalizes outliers (less severely than the least squares procedure) and s is a fixed scale factor. Two such well-known methods are by Huber (1972) and Andrews (1974). The loss functions for these two are as follow: Displayed in Tables 3.6 and 3.7 are the results obtained using the Huber and Andrews criteria (with the constants A and K being 1.339 and 1.345 respectively) and the sum of absolute deviations criterion. These were obtained using the functions RREG and L 1 F I T in S. The estimates are found to be very close to those obtained using least squares. Thus, it is concluded that the possible outliers observed are not influential in the estimation of the parameters. f. Model for the errors 1*1 < otherwise. 1*1 < K; otherwise. 25 So far, all the regression methods have been used in an exploratory analysis sense, because the assumption of independent and identically normally distributed errors is not necessarily valid. This assumption is now checked by studying the residuals plots. The normality assumption is checked by plotting the normal probability plot of the Studentized residuals. The plot for the rising stage is presented in Figure 3.6a and the plot for the falling stage is presented in Figure 3.6b. Both these plots indicate slightly longer tails than those of the normal distribution. Otherwise, neither suggests any sign of serious violation of the normality assumption. Because the data in this study form a time series, adjacent errors would be expected to be serially correlated. Residuals of both the rising stage and the falling stage were joined together and daily runs of residuals were plotted against time. The plots show that the successive residuals tend to be more alike than otherwise, indicating the presence of positive serial correlation. On a set of residuals for which lag 1 up to lag 6 exist, the plots of the residuals vs. their lags were examined. The plots exhibit a "lower left to upper right" tendency that is representative of positive serial correlation. The amount of scatter in the plots increases with the lag. The plots of the residuals vs. lag 1 and lag 2 are provided in Figures 3.7a and 3.7b. An estimate of the lag i correlation is P " ) = - - E . J where n,- is the number et for which et_,- also exists (ie., a measurement was taken) and n0 is the length of the time series. The estimates of the lag 1 to lag 10 correlations are summarized in Table 3.3. These estimates are then plotted against their respective lags. The plot, as shown in Figure 3.8, indicates a decay of the correlation as the lag increases. 26 The partial autocorrelations (Box and Jenkins (1976)), pac(i), are also estimated and the estimates are provided in Table 3.3. The partial autocorrelations form a set of statistical measurements, similar to autocorrelations, that reveal how time series values are related to each other at specified lags. The plot of the partial autocorrelations vs. the lags is shown in Figure 3.9. This figure indicates that a first or second order autoregressive model might be appropriate because the pac(i) for i > 3 are approximately zero, whereas pac ( l ) and pac(2) are considerably larger. The Akaike's Information Criterion (Priestley (1981)) , AIC, is a method that can be used to select an appropriate order of an autoregressive model. This criterion picks the order k for which * 2 AIC(k) = n ln a2 + ln(l - pac(i) ) + 2k, «'=i where u2 is the residual mean square from the regression model for InC, is minimized. However, it is also desirable to have as simple a model as possible. With this additional objective, the selected k should have a small AIC(k), and none of the AIC(i), for i > k, should be much smaller than AIC(k) (that is, AIC(i) stabilizes for i > k). The AIC(i) — nlnd-2, for i from one to ten, are listed in Table 3.4. Following the above guidelines, a first order autoregressive model is preferred over the second order autoregressive model. This choice is also supported by the adjusted B? criterion. The adjusted R2,s obtained by fitting the models and et = piet-i + P2tt-2 + m to the set of residuals for which lag 1 and lag 2 exist, are almost equal. 27 3.2 M O D E L FOR MISSION DATA a. The proposed model Summarizing the analysis in Section 3.1, the models proposed are i_ c _ J /?o + /?ilnQt + /?2lnQ- 1 0t + e4, for InQ, > 8 * l/3o + #lnQ t + /?2lnQ_iOJ + e<, for InQ, < 8 subject to continuity at InQ, = 8, for the rising stage and lnC, = /?0 + /?ilnQ( + e, for the falling stage, where 1. InQ, is considered fixed and lnC, is random. 2. {e,} is a sequence of normal random variables wi th mean zero and variance a2 such that e, = pe,_! +77 ! , where {77,} are independent and identically normally distributed wi th mean zero and constant variance. The parameters fio, fid, fii, fi[, f h , cr and p can be estimated using the maximum likelihood estimators derived in Appendix A.6. However, this method is not available i n statistical packages and it may not be robust to the normality assumption. A n alternative method that involves two stages is to estimate fi0, fid, fix, fi[, and a using least squares regression and to estimate p using least squares regression on the residuals of the first regression. The estimates obtained by this latter method are presented in Table 3.5. b. Model adequacy and assumptions Some model inadequacies are more serious than others. A small deviation from 28 normality for the errors is among the less serious ones. As checked earlier, the normal probability plots of the Studentized residuals for the rising stage and the falling stage both look satisfactory. A potentially more serious failure is the heterogeneity of error variance. Hetero-geneity of variance can be detected by plotting the Studentized residuals against the predicted values and the independent variables. For the rising stage, the plots of the Studentized residuals vs. the predicted values and the independent variables, pre-sented in Figures 3.10a - c, all display a slight degree of heterogeneity. However, since a main goal is to estimate the annual load of sediment transport and not the daily concentration of sediment, the slight nonconstancy of variance, which may or may not be a result of outliers, should hopefully not be a factor. The plots of the Studen-tized residuals vs. the predicted values and the independent variable for the falling stage are presented in Figures 3.11a - b. Neither of the plots shows any indication of heterogeneity of variance. A much more serious failure is the nonlinearity of the model. To check for non-linearity, the plots of the Studentized residuals vs. the predicted values are studied. As shown in Figures 3.10a and 3.11a, such plots for the rising stage and the falling stage are well behaved. The specified models are considered are adequate. 3.3 ESTIMATION OF THE ANNUAL SEDIMENT LOAD In Section 3.2, a model for the natural logarithm of the mean daily sediment concentration was constructed. To obtain an estimate of the mean daily sediment concentration itself, the transformed scale prediction InC is retransformed by exp(.). 29 However, because the transformation is nonlinear and because E{C) = exp(p0)QPlE(exp(e)) jt exp(/?o)Q"1 exp(lnC) is a biased estimator (see Duan (1983) and Miller (1984) for example ). Still, if the errors are normally distributed, a major portion of the bias can be eliminated by applying the adjustment factor exp(o-2/2) of Smillie and Koch (1984) (see Appendix A.I). In practice, departures from the normality assumption are not uncommon. Some frequent departures from normality is that the underlying distribution of errors is approximately be symmetric but nonnormal, possibly "more peaked" than the normal with "lighter tails", or "less peaked" than the normal with "heavier tails". A family of distributions that includes some of the distributions described above is the generalized Gaussian distribution f(x) = const exp(—\x/b\p). A study of the sensitivity of the bias correction factor to the change in p, for 1 < p < 2 is presented in Appendix A.3. The biasing factor i?(exp(e)) can also be estimated nonparametrically using n ^exp(e,)/n !=1 . This estimator is approximately unbiased and does not assume any distribution for the errors. A comparison between the estimates of this bias correction factor and the "normal" bias correction factor is presented in Table 4.5. The two are very close, suggesting that any deviation from normality assumption is slight. A theoreti-cal relative efficiency comparison between the two bias correction factors, under the normality assumption, is provided in Appendix A.2. The efficiency exp( + A InQ. Inversely transforming the true and the fitted model yields C = exp(/?0)Q /Jl exp(e) and C = exp(/30)Q^. C is a biased estimator of E{C) = exp(/?0)Q/3l-£'(exp(e)). The bias of exp(/?0) for exp(/?0) and Q&1 for w i l l be small since both p\ and fix are based on a large number of observations. Hence, an approximately unbiased estimator of E(C) can be obtained by mul t ip lying C by an estimate of £(exp(e)) (Duan (1983) and Mi l l e r (1984)). If e ~ N(0, cr2), then E(exp(e)) = exp(cr2/2), by evaluating the moment generating function of e at 1. So, assuming a normal error, an estimated correction term is 39 exp(<72/2), where a2 is the residual mean square obtained from the regression of lnC on InQ. Similarly, this extends to a more general regression model. In this study, the correction is not done until after the load for the high-flow period has been estimated. Hence, the estimated correction term used is exp(Jp2/2), where x P ( e , - ) ) - exp(2/2) with respect to n~l YA=\ exP(e«) is greater than one. This is expected, as a parametric estimate is generally more efficient than a nonparametric estimate when the parametric model is true. A.3 B ias correct ion factor for C w i th generalized Gauss ian errors. In Appendix A.I, a bias correction factor was derived for C under the assumption that the errors are normally distributed. In practice, the underlying distribution of the errors will not be exactly normal; it may be "more peaked" with "lighter tails", or "less peaked" with "heavier tails". For the data at hand, the distribution of the residuals appears to be approximately symmetric but with "heavier tails" than the normal. In this Appendix, the biasing factor £(exp(e)) will be derived for the case of the errors being distributed according to the generalized Gaussian distribution f(x) = P exp(-\x/b\p), b>0, l 1, £'(exp(e)) is finite for all values of 6 (this follows from the fact that if p > 1, E(exp(tX)) exists for all t). A study of the sensitivity of the bias correction term to the change in p can be carried out as follow. Fix a2 — a2, where -- A -inQiT) /£ 0 •• 0 \ 0 E ... o \ 0 0 ..' E J with E being the covariance matrix in (A6.2). It should be noted that there are two cases to consider. In one case, only the data during the high-flow period are included in the analysis. In the other case, data over the whole year are analyzed. In the former case, n and T are defined as above. For the latter case, all the years on record can be joined into a sequence of observations, T is the total number of days on record and n is taken to be 1. The maximum likelihood estimates of /?0, p and a2, can be obtained by solving the derivative equations of the logarithm of the likelihood function with respect to these parameters. These equations are simplified to 0 = X'A~lX - l X'A~lY, {A&.Z) a2 = [(1 - fjt)nT\-1e'A-1e, (A6A) and where « = 1 j=2 X = \ 1 l n Q n T / InCn Y = J n C , V and A — — a1 nT ' (A6.5) 46 Combining ( A 6 . 4 ) and ( A 6 . 5 ) and simplifying yield a2 m (nT)~lY^H€Xr ^ o r tnese calculations, the identities det = (1 - p2)n and - i / i -P o - P (1 + P2) -P 0 -p (1 + p2) 0 0 V o o 0 0 \ 0 0 0 0 (1+P 2) -p, -p 1 / are used, where ft = E/cr2. No explicit forms exist for the m.l.e.'s of these parameters, but they can be obtained using the following iterative algorithm. 1. Estimate /?0 and /3X using ordinary least squares. 2. Estimate p using the residuals obtained in the previous step with n T E E » = 1 j=2 t = l j=2 n T P=^2^2eiJeiJ-i/^2Y^eh-3. Estimate /?0 and fa in ( A 6 . 3 ) by substituting p with p in A — A(p). 4. Estimate p as in step 2 using the residuals obtained in step 3 . 5. Repeat steps 3 and 4 until estimates of /30, /?j and p converge. 6. Estimate a2 with a2 = (nT)"1 Er=i Ej= 2 5mirn4 \ 1 , « 7 lnQ>8 Table 3.5 Least squares fitted models. station variable least squares min. abs. res. Huber Andrews Mission intercept -0.40 -0.68 -0.51 -0.52 XI 3.40 3.42 3.41 3.43 X2 2.05 2.09 2.05 2.05 InQ-io -1.40 -1.41 -1.38 -1.38 Hansard intercept -1.29 -1.00 -1.30 -1.31 InQ 1.59 1.55 1.58 1.5S lnQ_io -0.66 -0.66 -0.64 -0.64 Brocket intercept 0.33 0.51 0.34 0.36 InQ 1.82 1.97 1.89 1.91 lnQ_ 8 -1.00 -1.23 -1.08 -1.11 Walsingham intercept 1.90 1.91 1.91 1.92 InQ 1.11 1.09 1.10 1.09 Table 3.6 Rising stage models' coefficients estimated using least squares, minimum absolute residual, Huber and Andrews regressions. station variable least squares min. abs. res. Huber Andrews Mission intercept -9.09 -8.97 -8.93 -8.93 InQ 1.59 1.58 1.57 1.57 Hansard intercept -1.63 -1.07 -1.42 -1.35 InQ 0.98 0.89 0.95 0.94 Brocket intercept -6.52 -6.73 -6.62 -6.65 YI 0.32 0.26 0.28 0.29 Y2 2.09 2.13 2.11 2.12 Walsingham intercept 3.45 3.46 3.45 3.45 InQ 2.03 1.83 1.92 1.91 lnQ_i -1.51 -1.32 -1.41 -1.40 Table 3.7 Falling stage models' coefficients estimated using least squares, minimum absolute residual, Huber and Andrews regressions. 53 station rising stage model falling stage model error model Mission , c _ { Po + Pi In Qi + In Q-xoi + for In Q,- > 8 ' \ p'0 + p[ hi Qi + p2 In Q_10l- + a, for In Qi < 8 subject to continuity at In Qi — 8 \nCi-po + Pi InQ, + fi et = pei_i + m Hansard In Ci = Po + Pi In Q,- + fa In Q_ ioi + fi In Ci = /?0-r Pi lnQi + e,- €i = /?£,•_ i + rn Brocket In d = ft + A In Q,- + Pi In Q_8l- + c,-f /?0 + AlnQ.+fi, forlnQ,->4 ' ~ \P'o + fi[^Qi-r€i, forlnQi<4 subject to continuity atlnQi = 4 ti = pe,_i + rji Walsingham lnC,- = ft-r/MnQi + fi In d = Po + Px In Qi + & In Q_ l f + f ,• ei = pci_i + m Table 3.8 The proposed models for Mission, Hansard, Brocket and Walsingham. station ( £ e x p ( e t-))/n exp(cx2/2) Mission 1.0661 1.0626 Hansard 1.1123 1.1067 Brocket 1.4618 1.3893 Walsingham 1.2712 1.2458 Table 3.9 Numerical comparison between two estimates of exp(