A STRUCTURED COVARIANCE MODEL FOR DOUGLAS-FIR PSP DATA AND SOME IMPLICATIONS FOR SAMPLING By S. North way B.S.F., University of British Columbia, 1976 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF FORESTRY in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF FORESTRY We accept this thesis as conforming to /he required standard THE UNIVERSITY OF BRITISH COLUMBIA DECEMBER 1995 Â© Steven Northway, 1995 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. The University of British Columbia Vancouver, Canada Department Date DE-6 (2/88) Abstract The error structure resulting from the repeated measurements of permanent sample plots (PSPs) can be addressed using a general linear model. Maximum likelihood methods provide a unified approach for estimating the parameters and subsequent inferences. Utilizing a predicted error structure, the efficiency of proposed PSP sampling schemes can be compared, prior to sampling. This can be done because the sampling errors of the model parameters are independent of the dependent variable. A sampling scheme is defined by the independent variables' intended frequencies and a sample size. Statistical and cost efficiency can be calculated by combining the sampling scheme with the sampling costs and the predicted error structure. In the data set examined, the correlation in the error structure was constant among the measured increments within a plot. The third increment on a plot was no more independent of the first increment than the second. This symmetric error structure was a better fit than the more usually assumed autoregressive structure. The data set, with an assumed symmetric error structure, was used to address two questions as examples of how this type of analysis might be used. In the first example, two sampling schemes were compared. Sampling Scheme 1 involved three increments on 150 plots, resulting in a total of 450 increments at a cost of $540,000. Sampling Scheme 2 involved two increments on 150 plots and one increment on 111 plots, resulting in 411 increments at a cost of $642,000. The second scheme resulted in fewer increments, but more measurements and more plots. The two schemes had nearly identical statistical efficiencies. Scheme 2 was more efficient per increment, but less cost efficient. The cost efficiency of remeasuring the same PSP overwhelms the increased statistical efficiency of independent increments. However, for modelling purposes, repeated measurements from the same PSP does not represent an irreplaceable asset. If a PSP is lost to development, a reasonable cost ii recovery would be the cost of its last remeasurement and the first measurement of a replacement. As a second example, sampling Scheme 3 composed of 150 plots with a 5-year and a subsequent 10-year increment was compared to Scheme 4 which combined 141 plots with three 5-year increments and nine plots with one 5-year increment. Scheme 3 utilizes 300 increments at a cost of $420,000, while Scheme 4 utilizes 432 increments at a cost of $525,600. The parameter covariance matrices of Scheme 3 and 4 are nearly identical, resulting in the same precision in their respective estimates. Scheme 3 is therefore statistically more efficient per increment and is more cost efficient. Comparing these two schemes tests the trade off of skipping an intermediate measurement on all the plots against making all the measurements on a subset of the plots. Skipping an intermediate measurement is far more efficient, both statistically and financially. iii Table of Contents Abstract ii List of Tables vi Acknowledgement vii 1 Introduction . . 1 2 Linear Models from Repeated Measurements 5 2.1 Structured covariance matrices . 5 2.2 Parameter estimation 10 2.2.1 Selection of joint maximum likelihood 12 2.3 Model selection 13 2.4 Inference 14 2.4.1 Implication for sampling 15 3 Related Forestry Literature 1 8 3.1 Structured covariance modelling in forestry 19 3.2 Implications to sampling for linear models in forestry literature 21 4 Application to PSP Sampling Schemes 23 4.1 Results of the error structure analysis 24 iv 4.2 Simplified calculations for predicting efficiencies 27 4.3 Example sample scheme representing plot replacement 29 4.4 Example sample scheme representing skipped measurements 33 5 General Discussion 35 6 Summary 37 Literature Cited 40 v List of Tables 2.1: Covariance error structures 8 2.2: Multiple subjects with identical error structures 9 2.3: Multiple subjects with different error structures 9 4.1: Estimated correlated error structures 25 4.2: AIC and LRT significance tests for nested error structures 26 4.3: Covariance structure (r and r'1) for 3, 2 and 1 increments 30 4.4: Independent variables (x) for 3, 2 and 1 remeasurements. 31 4.5: Predicted covariance matrix ( C ) of the parameters for Schemes 1 and 2 32 4.6: Predicted covariance matrix (C) of the parameters for Schemes 3 and 4 34 vi Acknowledgement Thanks to all who have contributed to the joy I have found in intellectual pursuits. With special thanks to Dr. P. Marshall and the other members of my thesis committee, Drs. A. Kozak, V. LeMay and D. Tait. vii Chapter 1 Introduction Permanent Sample Plots (PSPs) are designed to allow, repeated measures of the growth of a portion of the forest. This information is used in calibrating growth models. The growth models are then used in projecting growth of parts of the forest for use. in making management decisions. , Efficiency in sampling design is always a concern, but perhaps more so than usual in the case of PSP sampling. The sampling costs of PSPs are high at $800 to $1200 per measurement, and they represent a commitment for continued measurement over a period of 20 or more years. While it is not necessary to estimate growth by measuring the same plot,on successive occasions, it is more efficient than using independent plots on successive occasions (Cochran, 1977). The precision of growth models is largely defined by the PSP sampling scheme. More precise growth models result from more individual PSPs, more measurements on each PSP, and PSPs sampling a wider range of conditions. Ideally, the efficiency of a PSP sampling scheme should be based on its effectiveness in leading to "good" decisions in its ultimate use. As the ultimate use is in forest management decisions, this is the true basis for a measure of efficiency. A calculation of imparted effectiveness in forest management requires relating the PSP sampling scheme to the "goodness' in the outcomes of the decisions. This can be accomplished only through propagating the effects of imprecision in the growth model through the whole of the decision making process. A simpler approach is to define the efficiency of a PSP sampling scheme through the precision imparted to the growth model. The rationale for this follows the assumption that more Chapter 1. Introduction 2 precise growth models lead to more effective forest management decisions. While basing efficiency on ultimate decisions is desirable, in this case it would divert a good deal of effort into modelling the decision making process and finding a suitable measurement of "good" decisions. The precision of estimates resulting from fitting a linear model to a set of data is well understood. In fitting a linear model, the dependent variable is estimated from a linear combination of the independent variables. In the simplest case, each data point is independent of all other data points and the distribution of the dependent variable, conditional on any set of independent variables, is normal and identical. The precision of this simple linear model can be estimated prior to sampling, because the precision of estimates from a linear model are not affected by the particular values of the dependent variables found in the sample (Judge et at, 1985). The precision is simply a function of the values of the independent variables and the residual variance. Unfortunately, in an over zealous quest for randomness, many neophyte samplers think the subjects, and so the resulting values for the independent variables, should be randomly selected. While it is necessary for the conditional value of the dependent variable to be independent of the probability of a subject being sampled, it is desirable for the sampling to be purposeful over the independent variables (Demaerschalk and Kozak, 1974). It is the possibility of purposefully sampling across the independent variables that gives rise to the opportunity to affect the efficiency of the sampling scheme. One complicating aspect of PSP data is possible inter-correlations among measurements. The series of growth measurements from one plot may be more closely related to one another than to measurements from any other plot. The presence of inter-correlations suggest the use of something other than simple linear modelling, where each measurement is assumed to be independent of every other measurement. This type of inter-correlation is found in other fields of study (Gumpertz and Pantula, Chapter 1. Introduction 3 1989). This type of structure is called panel data in econometrics, longitudinal data in medicine, repeated-measurement or growth curve data in agronomics, and serially correlated or auto-correlated data in statistics. As in the simple case, the statistical theory for modelling with repeated measurement data is well developed. The statistical theory can be found under techniques known as error component modelling (e.g., Wallace and Hussian, 1969), random coefficient modelling (e.g., Rosenburg, 1973), and the general linear model (e.g., Wolfinger, 1993). The precision of estimates from this kind of model are a function of the values of the independent variables, the correlation in the repeated measures, and the residual variance. When I say that the theory is well developed, I am oversimplifying. The theory is well defined as long as the correlation in the repeated measures and the residual variance are known. Even when this is true, estimation may be difficult because of the need to invert large matrices. In practice, the technique of estimating the correlations, variances and model parameters from a sample is not simple, and no one technique has been universally adopted. In forestry, several analytical techniques have been used. Some practitioners ignore the inter-correlations, accepting inefficient estimates of the model parameters (though unbiased) and biased variance estimates (e.g., Buckman, 1962). Increasingly over the last 20 years, the error structure of the repeated measurements has been considered in model development. A variety of error structures and estimation techniques have been applied (e.g., Arner and Seegrist, 1979; Gregoire et ai, 1995; Leak, 1966). The aim of these efforts has been efficient estimates of the parameters and sometimes unbiased estimates of the variance. There is a group of researchers in forestry assessing sampling schemes prior to sampling, ensuring subsequent model estimates achieve a defined level of precision (e.g., Demaerschalk and Kozak, 1974; Gertner,1984; Marshall and Demaerschalk, 1986; Penner, 1989). While not extensive, this literature includes a look at the problem of simple linear models, and extensions Chapter 1. Introduction 4 into non-linear models from correlated measurements. An examination of the forestry literature yielded no reference to assessing the relative efficiencies of alternative PSP sampling schemes. The objective of this thesis was to select a method and then present examples of just such an assessment procedure. This thesis is divided into 5 subsequent chapters. Chapter 2 presents an overview of the theory involved in the analysis of linear models with repeated measures, and the separate problem of optimal sample design. Chapter 3 reviews the analysis of repeated measurement data in the forestry literature and reviews the forestry literature related to assessing sampling schemes prior to sampling. Chapter 4 presents two examples of the application of linear models to repeated measurements for determining the relative efficiencies of different PSP sampling schemes. Chapter 5 contains the general discussion. Chapter 6 is a summary. Chapter 2 Linear Models from Repeated Measurements Data containing repeated measurements can be analyzed using the general linear model. Following the notation of Wolfinger,(1993) the model is: y = X0 + e, where y is the column vector of the observed dependent variables of length n, Xis the matrix of k observed fixed-effects independent variables of size n x k, /3 is the column vector of unknown fixed-effect parameters of length k, and e is a column vector of unobserved random errors of length n. The elements of e follow distributions with 0 expectations. /? can be estimated using: 0 = (X'R'X)'1 (X'R'y) , where R is an n x n matrix containing the covariance structure of e. In the simple linear model, the elements of e have identical-independent normal distributions (iid) and R is a diagonal matrix of equal entries along the diagonal: The theory of ordinary least squares (OLS) is appropriate in this case. 2.1 Structured covariance matrices The general linear model allows for R with different parameters along the diagonal and non-diagonal structures. The entries along the diagonal are the error variances of the individual observations, and the entries off the diagonal are the covariances between the errors of two observations. Correlations can be calculated as a function of the relevant variances and 5 Chapter 2.. Linear Models from Repeated Measurements 6 covariances in the matrix. This allows for modelling correlated error structures. The errors of two observations might be correlated because they come from the same individual, are geographically close, or perhaps are measured in the same time period. Error component models, also known as crossed-error structures, have been proposed for dealing with repeated measures data (Fuller and Battese, 1974; Wallace and Hussian, 1969). This is a specific error structure with three independent error components. The components cover variability associated with 1) sampling unit or subject, 2) time period, and 3) residual effects. This model assumes that each sampling unit is measured in each time period. Error component models are not generally applicable to PSP data. In the context of PSPs, the sampling unit is the plot and the time period is the calender years covered by the measured increment. The time periods are not consistent from one plot to another because subsets of PSPs are usually measured in staggered years. Certainly, time period effects can have a role in the covariance structure of PSP modelling, but only in a more complicated fashion than that reflected in error component models. Because of their limitations, error component models were not considered further. Some useful within-subject error structures for R are presented in Table 2.1 (after Wolfinger,1993). All are symmetric around the main diagonal. The unstructured error structure (UN) allows for different parameters throughout the matrix, presuming no relationship among the variances or covariances of the errors. It is the only structure in Table 2.1 that allows heterogeneity of variances to be represented by different parameters along the main diagonal, and is the assumption implicit in the standard multivariate repeated measure A N O V A (Jennrich and Schluchter, 1986). The Toeplitz structure (TOEP) is a restricted version of the UN. It has a unique parameter for the main diagonal and a unique parameter for each of the bands of covariances parallel to the main diagonal, presuming equal variances for remeasurements and equal Chapter 2. Linear Models from Repeated Measurements 7 covariances among remeasurements with the same number of intervening measurements. This can be viewed as a moving average structure (Wolfinger, 1993). A first-order autoregressive structure (AR(1)) is a restricted version of the TOEP. Each parallel band of covariances is related to the preceding by a constant multiplier, presuming equal variances of remeasurements and a geometrically smaller covariance among remeasurements as the number of intervening remeasurements increase. It seems reasonable in many situations that remeasurements close in time will have higher covariance then those further apart (Finney, 1990). The compound symmetry structure (CS) is also a restricted version of the TOEP. It has one parameter along the main diagonal and a different parameter throughout the matrix off the main diagonal, presuming equal variances among remeasurements, and equal covariance between any pair of remeasurements. This is the assumption implicit in the usual univariate repeated measure analysis of variance (ANOVA), where repeated measurements are treated as measurements within split-plots. This is also equivalent to an error component model with no time period effect. While Meredith and Stehman (1991) suggested that there is no biologically plausible mechanism to generate this error structure, it is a common assumption. The independent-identical structure (ITD) is the simplest structure, with a single parameter along the central diagonal. It is a restricted version of UN, TOEP, AR(1) and CS. IID is the usual assumption in regression, presuming equal variances for each measurement and absence of covariance among remeasurements. TOEP, AR(1), CS and IID error structures assume equal variances for each repeated measurement within a subject. If this assumption of homogeneity is not valid, then it is also likely that the equalities assumed in the covariances structures are not valid. Often these assumptions are not valid for the original measurements, but can be made so through a transformation (Diggle, 1988). Chapter 2. Linear Models from Repeated Measurements 8 Table 2.1: Covariance error structures (after Wolfinger, 1993) Structure Abbrev. Example unstructured U N Â°"l2 <% Â°"l2 a22 C?13 Â°"23 Toeplitz TOEP a 2 0"3 o 2 a 2 o 2 o"3 o 2 a2 autoregressive AR(!) 1 Q 0 2 first order a 2 * e 1 e Q 2 Q l compound CS o 2 o 2 symmetry a 2 o2 a 2 o 2 â€¢<y2 a 2 independent IU) a 2 0 0 identical 0 o 2 0 0 0 a 2 Assuming independent errors among subjects leads to a block diagonal structure for R. In a block diagonal structure, the parameters within the blocks represent the relationships among remeasurements on a subject. The blocks are centred along the main diagonal of the matrix, with Os elsewhere. The parameters can be identical among blocks or can differ, depending on the assumed Chapter 2. Linear Models from Repeated Measurements 9 similarity among subjects. Table 2.2 illustrates a compound symmetry structure for three subjects with two measurements each and identical parameters among subjects. Table 2.3 illustrates a compound symmetry structure for three subjects with two measurements each and different parameters among subjects. Table 2.2: Multiple subjects with identical error structures a 2 Â° 2 0 0 0 0 a 2 a2 0 0 0 0 0 0 o 2 a 2 0 0 0 0 a 2 0 0 0 0 0 0 a 2 c 2 0 0 0 0 Â° 2 a 2 Table 2.3: Multiple subjects with different error structures O i 2 0 0 0 0 a 1 2 0 0 0 0 0 0 o 2 2 <*22 0 0 0 0 Â° 2 2 a 2 2 0 0 0 0 0 0 a 2 3 ^ 3 2 0 0 0 0 a 3 2 0 2 3 In the simplest application, homogeneous error structures for all subjects and independence among subjects are assumed. Transformations may achieve the homogeneity in some situations; Chapter 2. Linear Models from Repeated Measurements 10 in others it cannot be achieved. If plot size differs or the remeasurement period differs among plots, variance and covariance parameters will differ among those sets of plots, each requiring unique parameter estimation. Furthermore, there may be reason to expect correlations among subjects. This might arise through temporal or geographic proximity. Inter-subject correlations and non-homogeneous error structures can be dealt with in repeated measurement models through more complicated structures of R (i.e., non-block diagonal and non-equal blocks). 2.2 Parameter estimation When the error structure matrix (R) is known, 8 can be estimated with: (8 = (X'R-'X)-1 (X'R'y) . This is commonly known as the Aitken or generalized least square (GLS) estimator; it is also the maximum likelihood estimator under normally distributed errors (Judge et al., 1985; Magnus, 1978; Parks, 1967). This is the best linear unbiased estimator (BLUE). When R is not known, it has to be estimated from the sample data. The most general form of R cannot be estimated from the sample data alone: A sample size of n leads to a n x n matrix for R, with n variance parameters and many more covariance parameters to be estimated. It is not even possible to estimate n variance parameters from n observations, so some assumptions about the structure of R must be made. The assumed structure must reduce the parameters to be estimated to a quantity less than n. Special cases involving an unknown R, but assumed structure, are tractable through traditional least squares techniques. If the data are balanced1 and the treatment intervals are equal, the traditional univariate A N O V A for repeated measures assumes a CS error structure for 1 All treatment differences can be estimated with the same precision. This can be accomplished by having equal number of observations in each cell, or through more specialized designs as in balanced incomplete block designs, fractional factorials and Youden squares (Kendall et al., 1983). Chapter 2. Linear Models from Repeated Measurements 11 each subject, and similarly the traditional multivariate A N O V A assumes an UN error structure for each subject (Jennrich and Schluchter, 1986; Meridith and Stehman, 19.9.1; Wolfinger, 1993). Typically, analysis involving an unknown R and unbalanced data sets are not amenable to analysis with traditional techniques (Jennrich and Schluchter, 1986; Laird and Ware, 1982; Oberhofer and Kmenta, 1974). Under very general conditions2, estimates of R result in an unbiased estimate of 0 (Henderson, 1971; Magnus, 1978; Judge et al., 1985), so the effort has gone into searching for efficiency or ease of estimation. Three techniques have been used to deal with the problem of repeated measures: 1) random coefficient or random function modelling, 2) some version of multi-stage modelling, and 3) joint maximum likelihood estimation. Random coefficient models and random function models avoid the problem of estimating R. As a first step, separate parameter estimates are made for each subject. In a random coefficient model, these estimates are then averaged to create the final parameter estimate (Gumpertz and Pantula, 1989). In a random function model, the estimates from the first step are used as dependent variables to be estimated from other independent variables (Gumpertz and Pantula, 1989; Rosenburg, 1973). These approaches require sufficient data to estimate the parameters independently for each subject. In the multi-stage approach, the error structure is ignored in the first estimate of j8. R is then estimated from the residuals of the first fit of 0. The estimated R is then used to obtain a better estimate of #. This process of alternately refining the estimates of/3 and R can stop at this point, or iteration can continue until some tolerance level is met. A variety of estimation techniques can be used within the multi-stage model. Parks. (1967) used a combination of OLS, transformations, and a type of Aitken estimator in a single - The conditions require the residuals to be symmetric around zero and the estimate of R to be invariant to a sign change in the residuals (i.e., R is an "even" function of the residuals). Chapter 2. Linear ModeIs fromRepeated'Measurements 12 pass. Oberhofer and Kmenta (1974) iterated between maximum likelihood estimates of 6 and R. Magnus (1978) iterated between an Aitken-type estimator for 8 and a maximum likelihood estimate of R. A variety of names are associated with these techniques: iterative Aitken, zig-zag iterative procedure, Parks estimation, Cochran-Orcutt iterative method, estimated generalised least squares, feasible generalized least squares, and others. An alternative to the multi-stage approach is to simultaneously maximize the joint likelihood of 8 and R. Algorithms for maximising the joint likelihood of & and R are presented in Henderson (1953), Laird and Ware (1982), Jennrich and Schluchter (1986) and Wolfinger (1993). They used combinations of several algorithms to maximise the likelihood function: E M (a generalization of Henderson's algorithm), Newton-Raphson and Fisher scoring algorithms. The approaches involving iterative-multistage algorithms and the joint maximum likelihood methods result in equivalent results under these circumstances (Oberhofer and Kmenta, 1974): 1) the errors are normally distributed; , 2) the errors have an expectation of 0; 3) the independent variables are fixed; and 4) the parameters in 8 are independent of those in R. These assumptions are assumed in the remainder of the discussion. 2.2.1 Selection of joint maximum likelihood Joint maximum likelihood methods are used for the examples in Chapter 4, in preference to the random coefficient or multi-stage approaches Random coefficient models require that a regression equation be fit for each PSP individually; this was not possible with the available data set. This technique also deals with the covariance structure of the individual plot parameter estimates, not with the remeasurement error structure. As my interest is in the effect: of the Chapter 2. Linear Models from Repeated Measurements 13 remeasurement error structure on efficient sampling schemes, this technique was of no direct use. The multi-stage approaches do deal directly with the remeasurement error structures, but they do not provide a direct method of performing significance tests on alternative error structures and model forms. The joint maximum likelihood methods have these desirable characteristics. While joint maximum likelihood methods are computationally intensive, they offer the , advantage of a general and systematic approach to specifying a model and estimating its parameters. The technique has been made generally available through release 6.07 of the SAS/STAT software, which includes a maximum likelihood implementation of the general linear model in PROC MIXED (SAS Institute Inc., 1992). The remainder of this discussion is based on the use of joint maximum likelihood estimation. 2.3 Model selection Diggle (1988) suggested using three steps to select a model: 1) develop a potential fixed effect model; 2) develop potential error structure models; and 3) test the potential models for adequacy. At the first stage, the provisional fixed effect model should be over parameterized rather than under parameterized; As an example of the dangers of underfitting, fitting a linear trend to a quadratic relationship will result in finding an autocorrelation structure in the errors where none exists. This stage also includes searching for transformations that linearize the parameters and deal with heterogeneous error variance. Stage two consists of selecting potential error structure models. This is accomplished by selecting likely error structures from theoretical considerations. Residuals from an ordinary least squares fit of the potential fixed effect model can then be used to judge their reasonableness: Chapter 2. Linear Models from Repeated Measurements 14 In the third stage, the proposed models are compared using information-based criteria like Akaike's Information Criterion (AIC) (Bozdogan, 1987), or on statistically-based tests such as likelihood ratio tests (LRT) (Jennrich and Schluchter, 1986). AIC is a measure of the quality of the fit with a penalty for overparameterization (Bozdogan, 1987). AIC can be calculated as -2 times the log likelihood plus 2 times the number of parameters. The model with the smallest AIC is chosen to be the best model. If two models have identical log likelihoods, the one with the fewest parameters will be chosen. The L R T can be used to test for a significant difference between nested models (Jennrich and Schluchter, 1986). The test statistic, the difference in -2 times the log likelihoods, is distributed as a chi square with degrees of freedom equal to the difference in the number of parameters. U N , TOEP, AR(1), CS and IID structures are sequentially nested in that order. 2.4 Inference Confidence intervals can be placed around the conditional means of the dependent variable using the normal distribution (Wolfinger, 1993). If R is known, the covariance matrix of 0 is given by (Wolfinger, 1993) as: C = {XR'Xy1 . A confidence interval can be placed on L, any linear combination of 0, by using: U0Â±z1_a/2VUCL, where z is from the normal distribution at a chosen percentile (1-a/2). Substituting an estimate Chapter 2. Linear Models from Repeated Measurements 15 for R into the above calculations leads to confidence intervals that are only approximate. 2.4.1 Implication for sampling If R is known, the width of confidence intervals, on the estimated parameters or predictions, are independent of the values obtained for the observed dependent variable, y. They depend only on X and R, the values of the independent variable and the error covariance matrix. The width of the confidence intervals can be predicted prior to sampling. The values for X are defined by the distribution of the sampling and the sampling intensity. The matrix R can be estimated prior to sampling, through a pilot sample or previous experience. Predicting confidence intervals prior to sampling affords the opportunity to compare alternative sampling schemes. Several texts comprehensively treat the development of optimal sampling strategies for regression models (e.g., Pazman, 1986; Silvey, 1980). The treatment deals with defining different criteria of optimality and how to implement them. The following paragraphs briefly summarize the subject. A sampling scheme is defined by the design (i.e., the proportion of the sample coming from specified combinations of the independent variables), and overall sample size. The design itself is independent of the overall sample size. A design is called universally optimal if it results in a minimum variance for every possible prediction. A design is said to dominate another if it results in a smaller variance for every possible prediction. Designs are called equivalent if they result in the same variance for every possible prediction. In practice, it is not possible to find the universally optimum design, so other criteria are considered. Two common definitions of optimality are D-optimality and G-optimality. D-optimality involves minimizing the confidence ellipsoid for the estimated parameters. This is a global Chapter 2. Linear Models from Repeated Measurements 16 optimal criterion because it includes all the parameters. G^optimaHty involves minimizing the least precise confidence interval for a set of predictions. This is a partial optimal criterion, as it only includes some of the parameters or some function of the parameters. D-optimality may result in large confidence intervals for some predictions. This can happen when the parameters' confidence ellipsoid is small because it is long and narrow. G-optimality is more intuitive and likely closer to an experimenter's objective. Optimal designs can be calculated directly only for very simple regression models and some criteria of optimality (D-optimality is especially amenable). Generally, iterative methods are necessary to find optimal or near optimal designs. This has led to considerable work in matching optimality criteria with search methods that ensure convergence to an optimum. Convergence can be particularly difficult with partial optimal criteria, because the search space can include discontinuities. These statistically optimal designs concentrate the sampling on the boundary of the possible sampling region. Although there are few outstanding problems in optimal design for linear models, the theory has not proven to be very useful in practice (Silvey, 1980). In part, this can be attributed to the omission of sampling costs (Marshall and Demaerschalk, 1986; Penner, 1989), and, in part, because sampling at the boundaries precludes testing the adequacy of a proposed model form (Penner, 1989). However, more importantly, the optimality criteria do not reflect utilitarian concerns that relate to the application of the model. Some utilitarian concerns in the application of optimal design for linear models are addressed through the use of a Bayesian decision theoretic approach. Brooks (1972) presented the theory for optimal sampling design in linear models. The result of the process is ah optimal design, considering: 1) prior knowledge about the regression coefficients; 2) sampling costs; Chapter 2. Linear Models from Repeated Measurements 17 3) the cost of collecting the variables needed to use the regression; and 4) the cost associated with errors in the predictions. In Brooks' (1972) simple linear regression example, he used: 1) a vague prior; 2) a sample cost function quadratic in the independent variables; and 3) a cost related to errors in prediction proportional to the residual sum of squares. For this example, he came up with a closed form solution to the optimization problem. The decision theoretic approach overcomes some of the inadequacies in the classical approach to optimal design for linear models, but is a difficult computational problem in itself (Silvey, 1980). Chapter 3 Related Forestry Literature Forestry data sets often have a repeated measures component to them. Permanent sample plots are designed to capture repeated measurement data, through measuring the same set of trees at intervals through time. Site index data sets are traditionally comprised of stem analysis data, where the ring pattern is used to measure many past heights from the same tree. Taper equations are developed from repeated measurements up the bole of the same tree In addition to chronological sequences, repeated measurement techniques can be used to analyze data sets where geographical proximity may lead to correlations in the data. Temporal correlations may also be exhibited in the data, through similarities in measurements from different plots in the same year. Foresters have used a wide range of parameter estimate techniques in analyzing repeated measurement data. The random coefficient model, multi-stage model and the joint maximum likelihood methods are all represented in the forestry literature. Random coefficient models have found limited use because of the necessity to have sufficient data to fit the model to each subject. Analytical ease has favoured the use of multi-stage models for dealing directly with combinations of serial, geographical, and temporal correlations. The relative ease comes from avoiding inversion of large matrices, a necessary part of the joint maximum likelihood methods. Joint maximum likelihood methods have seen only limited use in the past, because of the computation burden. Improvements in hardware and software are overcoming this impediment. There has been less interest in efficient sampling for regression data within forestry. The effort has been limited to selecting efficient sampling schemes for simple and non-linear regression under iid error structures, and in finding optimal measurement intervals for increment 1 8 Chapter 3. Forestry Related Literature cores. 19 3.1 Structured covariance modelling in forestry The first located reference to the problem of repeated measurements from PSP data was Buckman (1962). He recognised the correlation between errors in successive measurements from the same plot, and contemporaneous measurements on proximate plots. His analysis ignored the problem, but warned of biases in the variance estimates of the parameters and the possibility of liberal errors in the associated significance tests. The first suggested solution to the repeated measurements problem in PSPs was found in Leak (1966). He suggested an analysis known as a random coefficient model; it involved two steps. The first step required estimating regression coefficients independently for each plot. In the second step, coefficient means and variance-covariances were calculated from the coefficients resulting from the first step. The second step coefficient means were used for prediction. The variance of the predictions were estimated as a linear combination of the calculated variance-covariances from the second step. Ferguson and Leech (1978) used an extension of the random coefficient model, known as the random function model. They estimated a yield model for stand volume using PSP remeasurement data. In the first step, regression coefficients were calculated independently for each plot. In the second step, another regression was developed to estimate the first stage coefficients using other variables. This second step made use of an Aitken-type estimator (with UN) to deal with the correlations among the coefficients from the first step. Newberry and Burkhart (1986) used a random function model to estimate a stem profile equation from multiple measurements on several trees. They used a non-linear equation at the first step, estimating the resulting coefficients with an Aitken-type estimator (with UN) at the second step. Chapter 3. Forestry Related Literature 20 Random coefficient and random function models do not require an estimate of the error correlations among the repeated measurements. However, the approach does require sufficient remeasurements to estimate the model for each sample unit. This limitation has made the approach not generally applicable. Multi-stage analysis for repeated measurements in forestry has been a continual area of interest and work. It requires more specialized analytical techniques than random function models, but its more limited data requirements make it more generally useful. Gertner (1985) used an iterative Aitken-type estimator for predicting cumulative diameter yield using multiple measurements within several trees. An AR(1) error structure was selected as the result of analysis and estimated using least squares techniques. He worked with a non-linear equation, but used a linear approximation for inferences about the variance-covariances of the resulting estimates. Gregoire (1987) used several modelling techniques to estimate a stand basal area equation. The techniques were used to estimate successive degrees of complication in the error components. The most complicated error structure included autocorrelations due to repeated measurements, heteroscedasticity and contemporaneous correlations among plots measured in the same time period. He used a combination of transformations, simplifying the data, two-stage estimation, and maximum likelihood. His results indicated that no one error specification and estimation technique dominated the others. The OLS coefficient estimates performed the best under many measures. He concluded that the maximum likelihood estimate of a CS error structure was most acceptable, if inferences about the coefficients are of interest. Magnussen and Park (1991) developed a height-age equation from repeated measurements on several trees, using transformations and a two-stage method. The transformation was used to deal with heterogeneity, then the coefficients were estimated using an Aitken-type estimator, where the error structure (AR) was estimated using maximum likelihood. Chapter 3. Forestry Related Literature 21 A joint maximum likelihood method was developed by Arner and Seegrist (1979). They wrote an implementation for a very specific model and data set, but suggested generalizations. They utilized a combination of Newton-Raphson, Fisher scoring and iterative Aitken for solving the maximum likelihood equations with a U N error structure to predict subsequent stand basal area from initial basal area. The possibility of using LRT to test the significance of variables was also discussed. Gregoire and others (1995), used a joint maximum likelihood method for developing a mixed effect model for stand basal area. The "mixed" indicates using "random" covariates to model the error structure. (This is perhaps more familiar as "random" effects in contrast to "fixed" effects in ANOVA.) They suggested SAS, BMDP, and GAUSS as commercial products that are capable of estimating linear models with complicated error structures. LRT was used to test the incremental improvement of nested models, and AIC to otherwise rank models. Two examples were included, one for white pine (Pinus strobus L.) and another for Douglas-fir (Pseudotsuga menziesii (Mirb.) Franco). The final white pine model included an error structure with AR(1) and a random effects component for the inverse of the stand age; the inclusion of the later component resulted in only a marginal improvement. The final Douglas-fir model included an A R component related to square-root of age rather than age, and a random effects component for the inverse of the stand age. The inclusion of inverse-age as a random effect, as well as a fixed effect, allowed it to function as a random coefficient. Surprisingly, the appropriateness of an A R error structure was not generally tested against alternatives. This may be because most of the studies have used yield rather than growth as their dependent variable. 3.2 Implications to sampling for linear models in forestry literature Little work has been published in forestry, related to sampling designs for estimation of Chapter 3. Forestry Related Literature 22 models. No reference was located that discussed the implications of a complex error structure on sampling for PSPs. Demaerschalk and Kozak (1974 and 1975) argued for efficiency in sampling for simple linear regression. They illustrated the possibility of predicting confidence intervals prior to sampling, and developed estimates of the relative efficiencies of several different sampling schemes. Gertner (1984) used a multiple linear regression approximation to predict the standard deviations of the parameters of a non-linear model prior to sampling. He assumed iid errors. Marshall and Demaerschalk (1986) extended the search for efficient sampling schemes to include the costs of gathering the samples. They also presented an iterative procedure for finding an efficient sampling scheme to meet predefined confidence intervals for a series of predictions in simple linear regressions. Contrary to statistically optimal designs, the cost efficient design may include sampling inside the boundary of the sampling region. Penner (1989) extended the discussion into multiple linear regression with variable cost per sample and heteroscedasticity. This problem also required an iterative search procedure to find an efficient sampling scheme. It was not possible to generalize the increase in efficiency over a random or uniform sampling scheme. The relative efficiencies will be unique depending on the design requirements and the relative costs. As in simple linear regression, sampling costs may have the effect of moving the optimal design from the boundary of the possible sampling region. Gertner (1985) used a linear approximation of a non-linear model to calculate the relative efficiencies resulting from different measurement intervals in developing a yield model from increment cores. He included an A R component in the error structure. Increments of five years were the most efficient. Measurements were so highly correlated at short intervals that 30 measurements of two years was just as efficient as 60 at one year. Chapter 4 Application to PSP Sampling Schemes Included in this chapter is a description and illustration of a method for testing PSP sampling schemes prior to sampling. The first step was to estimate the appropriate error structure for a simple growth model; this was done using a large data set and was identical for the two examples. The next step was to propose alternative sampling schemes. The final step was to predict the precision of the model parameters resulting from the sample scheme and the estimated error structure. Two examples are used in illustrating the method, utilizing the estimated error structure. In the absence of a utilitarian definition of optimality, it is impossible to meaningfully order alternative sampling schemes. The problem was side-stepped in the examples by comparing alternatives that resulted in equal precision. This was accomplished by sampling a similar distribution of the independent variables, and finding a sample size that equalized the precision. The first example explores the desirability of measuring a replacement set of independent PSPs against another remeasurement on the existing plots. The distribution of the independent variables were similar in the alternative schemes; only the dependence among the remeasurements and the number of plots were altered. The second example explored the desirability of lengthening the remeasurement period from five to 10 years. Again, the distribution of the independent variables were similar in the alternative schemes; only the remeasurement interval and number of plots were altered. The examples both used Douglas-fir (greater than 80% Douglas-fir by basal area) PSP data from the drier areas of Eastern Vancouver Island. The data came from the MacMillan Bloedel Ltd. growth and yield program, representing approximately 200 plots with between four and seven measurements at five-year intervals. A total of 888 increments were represented in 23 Chapter 4. Application to PSP Sampling Schemes 24 the data set. A version of Schumacher's (1939) growth curve was selected for fitting and increments were used as the basis for modelling: ln(vw) - ln(Vi) = ln(a i+1 / ad 0, + (l/a j + 1 - Had & + e , where ln() indicates a natural logarithm, V; represents the stand volume at measurement i, a{ represents the stand age at measurement i, B{ represent the coefficients to be estimated, and e represents the error term. This model was selected for simplicity and familiarity. A graphical examination of the residuals confirmed their homogeneity, symmetry around zero, and lack of trend over the dependent variable. 4.1 Results of the error structure analysis The first step in comparing the alternative sampling schemes was to estimate the error structure. The 888 increments were used to estimate a fixed effect model with a structured covariance matrix. The error structure was assumed to be block diagonal, with each block representing one plot. Error structures from separate plots were assumed to be independent and equal. This reduced the number of parameters to be estimated to a manageable number. The estimation was done through the joint maximum likelihood implementation in PROC MIXED (SAS Institute Inc., 1992). This procedure simultaneously estimated B and R, for the specified model and error structure. Table 4.1 presents the resulting estimate for a correlation matrix of four remeasurements, under several alternative error structures. The correlation between error terms of adjacent increments were similar for the three non-iid error structures. In part, this was because of the weighting caused by the relative abundance of adjacent increments compared to non-adjacent increments. The AR(1) estimates of correlations Chapter 4, Application to PSP Sampling Schemes 25 for non-adjacent increments were very different than the TOEP and CS, due to the forced geometric reduction in the correlations. Table 4.1: Estimated correlated error structures Structure Estimated correlations TOEP â€¢1 .33 .30 .24 .33 1 .33 .30 .30 .33 1 .33 .24 .30 .33 1 AR(1) 1 .31 .10 .03 .31 1 .31 .10 ;10 .31 1 .31 .03 .1.0 .31 1 CS 1 .30 .30 .30 .30 1 .30 .30 .30 .30 1 .30 .30 .30 .30 1 IID 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 A function of the log likelihood (-2 11) and Akaike's information criterion (AIC) were calculated for each of the error structures. Probabilities for likelihood ratio tests (LRT) were calculated for nested error structures. The results are shown in Table 4.2. Chapter 4. Application to PSP Sampling Schemes 26 The AIC favoured the CS structure The TOEP structure had a higher likelihood, but a worse AIC because of the penalty associated with additional parameters. The LRT showed no significant improvement in going from CS to TOEP. The AR(1) and IID were significantly worse than either the TOEP or CS structures. Table 4.2: AIC and LRT significance tests for nested error structures Structure Parameters -2 11 AIC P(LRT) (Base) TOEP 8 -2306 -2290 N S @ . l (CS) CS 2 -2300 -2296 <005 (IID) AR(1) 2 -2270 -2269 <005 (IID) IID 1 -2201 -2199 The CS structure was chosen as the best fit for the data. The error covariance structure implies a correlation of .30 between the errors of any pair of increments on a plot regardless of the number of intervening measurements. Thus the 7th increment is just as correlated with the 1st increment as the 2nd. Meredith and Stehman (1991) suggested there is no biologically plausible mechanism to generate a CS error structure, but relied on appeals to common sense to support an AR structure. The CS structure has found implicit support in its continued use in univariate repeated measure A N O V A and as part of the error component model. While the CS fit is better than the AR(1), the TOEP had a higher likelihood (Table 4.2). TOEP shows a decreasing correlation between subsequent measurements, but not to the extent implied by AR(1) (Table 4.1). This suggests that an intermediate structure between the constant correlation of the CS and the geometric: reductions, of the AR(1) may be appropriate. Gregoire Chapter 4. Application to PSP Sampling Schemes 27 and others (1995) explored the possibility of using some transformation of time in an A R model. The confidence in these estimates has to be high as they have been estimated from over 200 PSPs. In conjunction with this form of growth model, the GS structure is a better fit than the AR(1). 4.2 Simplified calculations for predicting efficiencies Using the estimated covariance matrix from the repeated measurements, the efficiency of different sampling schemes can be estimated: The approximate variance for any arbitrary linear combination of 0 (L) is given by: VAR( L'0) = UCL , where C is the estimated covariance matrix for 0: C={X'R'XyI . P is a column vector made up of an estimate of the p parameters. L is a column vector of length p, defining the linear combination of 0 of interest. X is an n by p matrix of independent variables from the proposed sampling scheme, where a total of n measurements are proposed to be taken from m plots. R is a square matrix of dimension n by n, containing estimates of the covariance structure; it has to be inverted to obtain C. C is a square matrix of dimension p. R can be a very large matrix. Inverting a large matrix can be difficult and expensive. However, if errors among plots are independent, inverting R is made simpler by its block diagonal structure. Let rx be a typical block from R, containing the error structure for plot i. If Chapter 4. Application to PSP Sampling Schemes there are four plots with three remeasurements each, then R has the following structure: 28 0 0 0 0 0 0 0 0 r3 0 0 0 0 u where each ry is a 3 by 3 matrix representing the error structure among the remeasurements on plot i. (The r{ need not all be the same size.) The inverse of R can be easily obtained by inverting each block as follows: 1 0 0 0 1 0 r1 0 0 0 G r1 0 | 0 0 0 r -' The calculation of C can be further simplified. Let x{ be a typical set of measurements from plot i, containing the values of the independent variables for each remeasurement. Xhas the following structure: I * ! I | x2 I I *4 \ > Chapter 4. Application to PSP Sampling Schemes 29 where each x{ has size 2 by 3, representing two independent variables in three remeasurements. C can be easily calculated as: â‚¬ = ( I Xj ' X; . Thus, taking advantage of the block diagonal structure of the matrices, the inversion of large matrices can be avoided. The calculations necessary for the examples found in this chapter could be performed within most spread sheet packages. 4.3 Example sample scheme representing plot replacement Once the error covariance matrix has been estimated, it can, be used to predict the confidence intervals resulting from alternative sampling schemes. As an example, the following calculations compare sampling Scheme 1, with 150 plots each with three increments to Scheme 2 which combines 150 plots with two increments and 111 plots with one increment. The remeasurements are five years apart. One third of the three and two increment plots begin at each of 20, 40, 60 years of age. One third of the one increment plots begin at each of 30, 50 and 70 years of age. Sampling costs are based on $1200 for establishment and $800 for a remeasurement. Comparing these two schemes will test the desirability of supplementing the two increments on the original 150 plots with one increment measured on 111 new plots against the desirability of measuring the third increment on the original plots. These particular sample designs have very similar distributions of independent variables. This made it possible to find sample sizes that resulted in essentially equivalent designs. The blocks of R and their inverses for the assumed CS structure are presented in Table 4.3. They are expressed as correlations in Table 4.1 and covariance structures in Table 4.3. Chapter 4. Application to PSP Sampling Schemes 30 These blocks are 3 by 3 for three increments, 2 by 2 for two increments, and 1 by 1 for one increment. The R for Scheme 1 is made of 150 blocks of the 3 by 3 r. The R for Scheme 2 is made of 150 blocks of the 2 by 2 r and 111 blocks of the 1 by 1 r. Table 4.3: Covariance structure (r and r^.for 3, 2 and 1 increments INCREMENTS r r' 3 .0050 .0015 .0015 233 -53.7 -53.7 .0015 .0050 .0015 -53.7 233 -53.7 .0015 .0015 .0050 -53.7 -53.7 233 2 .0050 .0015 221 -66.0 .0015 .0050 -66.0 221 1 .0050 201 The values of the two independent variables that make up the blocks of X for the two schemes are shown in Table 4.4. The X for Scheme 1 is made up of 150 blocks of three remeasurements, 50 from each of the age groups. The Xfov Scheme 2 is made up of 150 blocks of two remeasurements and 111 blocks of one remeasurement, divided equally between each of the age groups. Chapter 4. Application to PSP Sampling Schemes 31 Table 4.4: Independent variables (x) for 3, 2 and 1 remeasurements. INCREMENTS LOWER AGES MIDDLE AGES OLDER AGES (l/a i + l - ln(ai+1 / tfj) ln(Â«H-i 3 -.0100 .223 -.0028 .118 -.00.13 .080 -.0068 .182 -.0022 .105 -.0011 .074 -.0048 .154 -.0018 .095 -.0010 .069 2 -.0100 .223 -.0028 .118 -.0013 .080 -.0068 .182 -.0022 .105 -.0011 .074 1 -.0048 .154 -.0018 .095 -.0010 .069 The full sized X and R can be used to calculate X'R'X or the simplified calculations can be used. In the simplified calculations X'R'X'm the first scheme is equal to the sum of 50 times the x'r'x of the three increments for each age class: | 1.29 -34.1 | 50 * | .0232 -.555 | + 50 * | .00208 -.0941 | | -34.1 1.29 | = | -.555 .0232 | | -.0941 .00208 | + 50 * | .000481 -.0317 | | -.0317 .000481 | . Chapter 4. Application to PSP Sampling Schemes 32 The XR*X for the second scheme is equal to the sum of 50 times the x 'r'x of the two increments for each age class and 37 times the x'r'x for the one increment for each age class: | 1.48 -40.1 | 50 * | .0231 -.542 | + 50 * | .00198 -.0873 | | -40.1 1.48 | = | -.542 .0231 | | -.0873 .00198 | + 50 * | .000443 -.0286 | | -.0286 .000443 | + 37 * | .00456 -.148 | + 37 * | .000664 -.0348 | | -.148 .00456 | | -.0348 .000664 | + 37 * | .000182 -.0132 | | -.0132 .000182 | . The resulting predictions of C, the covariance matrix of the estimated parameters, are calculated as (XR'X)'1. Table 4.5 shows the results. Table 4,5: Predicted covariance matrix (C) of the parameters for Schemes 1 and 2. SCHEME 1 (150X3 incr) SCHEME 2 (150X2 + 111X1 incr) ! 6.84 .230 | | 6.80 .225 ( | .230 .00873 | j .225 .00827 j The parameter covariance matrices of Scheme 1 and 2 are nearly identical, resulting in the same precision in their respective estimates. Scheme 1 utilizes 450 increments, while Scheme Chapter 4. Application to PSP Sampling Schemes 33 2 utilizes only 411. Scheme 2 is therefore statistically more efficient per increment. However, Scheme 1 costs $540,000 to get the 450 increments, while Scheme 2 costs $642,000 to get its 411 increments. Scheme 1 is therefore more cost efficient. The cost efficiency of remeasuring the same PSP overwhelmed the increased statistical efficiency of independent increments. However, for modelling purposes, repeated measurements from the same PSP do not represent an irreplaceable asset. If a PSP is lost for some reason, a reasonable cost recovery would be the cost of its last remeasurement and the first measurement of a replacement. 4.4 Example sample scheme representing skipped measurements As a further example, the following calculations compare sampling Scheme 3, 150 plots with a 5-year and a subsequent 10-year increment to Scheme 4 which combines 141 plots with three 5-year increments and nine plots with one 5-year increment. One third of the initial measurements begin at each of 20, 40, 60 years of age The costs are based on an establishment cost of $1200 and a remeasurement cost of $800. Comparing these two schemes tests the trade off of skipping an intermediate measurement on all the plots against making all the measurements on a subset of the plots. Scheme 3 begins with 150 plots with a 5-year increment, it continues by skipping a measurement and ending up with a subsequent 10-year increment on all 150 plots. Scheme 4 also begins with 150 (141+9) plots, but nine are dropped while the other 141 are continued to be measured for two additional 5-year increments. These particular sample designs have very similar distributions of independent variables. This made it possible to find sample sizes that resulted in essentially equivalent designs. The resulting estimates of C, the covariance matrix of the estimated parameters, are calculated as in the first example as QCR'X)'1. Table 4.6 shows the results. Chapter 4. Application to PSP Sampling Schemes 34 Table 4.6: Predicted covariance matrix (C) of the parameters for Schemes 3 and 4. SCHEME 3 (150 5+10yr incr) SCHEME 2 (141X3 + 9X1 incr) } 7.16 .240 |, | 7.15 .239 | j .240 .00901 | | .239 .00903 | The parameter covariance matrices of Scheme 3 and 4 are nearly identical, resulting in the same precision in their respective estimates. Scheme 3 utilizes 300 increments, while Scheme 4 utilizes 432. Scheme 3 is therefore statistically more efficient per increment. Also, Scheme 3 only costs $420,000, while Scheme 4 costs $525,600. Scheme 3 is therefore also more cost efficient. In a choice between skipping a measurement or remeasuring a subset of the PSPs, skipping a measurement provides both increased statistical and cost efficiency. However, there are no new data for a 10-year period; this will result in prolonged use of the model based on the first set of increments Chapter 5 General Discussion This chapter presents a generalization of the conclusions from the examples of the last chapter, and suggests other questions about PSP sampling schemes that could be addressed. The techniques in this thesis were based on model building being the ultimate use of PSP data. PSPs may also be used as a basis for an inventory or for monitoring long term changes in growth patterns. Even though these objectives might alter the relative efficiency of different PSP sampling schemes, they are not dealt with in this chapter. The efficiency of alternative PSP sampling schemes were illustrated in two examples. In the first, the cost efficiency of remeasuring the same PSP overwhelmed the increased statistical efficiency of independent increments. However, the statistical contribution of a PSP could be replaced with a last remeasurement and the first measurement of a replacement. For some reason in growth and yield, PSPs representing many years of growth are valued disproportionately to their contributions to model development. The only statistical advantage to a long history of measurements is the assurance of covering the whole of a weather or growth cycle. However, this long history of measurement need not come from one plot. It can come from a series of shorter-lived, independent plots. In the second example, a longer remeasurement interval was shown to be preferable to dropping plots. This conclusion has to be qualified with a cautionary comment about the timeliness of the application of the resulting model. The statistical efficiency defined here ignores the benefit of having improved decisions from the intermediate data. A 5-year remeasurement schedule provides the opportunity to improve decision making every five years, rather than the longer period associated with a 10-year cycle: The extra costs incurred with shorter remeasurement schedules may be balanced by more timely improvements in decision 3 5 Chapter 5. General Discussion 36 making. Several questions would be interesting to address with these techniques, in addition to those dealt with in the examples. However, some questions do not seem suited to this treatment. Sample size should not be decided without the application of a decision theoretic approach. Sample size defines the precision of the model, and an appropriate level of precision cannot be defined without relating the precision in the model to the costs of imperfect decisions. As well, the distribution of the independent variables should only be cautiously explored with this technique, because of its tendency to suggest sampling at boundaries. Sampling at the boundaries precludes testing the form of the model. This leaves questions about the PSPs themselves. PSP plot size is a question that can be addressed with this technique. Different plot sizes would introduce different error variances and correlations. Smaller plots would be cheaper to install, but would likely have higher error variances and covariances (though likely lower error correlations). The related question of plot clusters could also be explored. Presumably, as plots get further apart they become less correlated. The analysis would need estimates of variance and covariance for different plot sizes, and the relationship between inter-plot distance and covariance. Travel and measurement costs would then define the optimum combination of plot size, cluster size, and distance between plots. Optimal remeasurement intervals could be defined. A remeasurement interval could be based on the condition of a PSP. It may be efficient to measure slow growing PSPs less often than fast growing ones. It may also be more efficient to measure small PSPs more frequently than large ones. Guidelines could be developed defining efficient remeasurement intervals. The technique developed in this thesis has a general utility in answering some of the questions,about PSP sampling that have never been critically examined. Chapter 6 Summary The objective of this thesis was to find a method for comparing the statistical efficiency of alternative sampling schemes for PSPs. My interest in the subject followed from the pioneering forestry application of Demaerschalk and Kozak (1974) in sampling for simple linear regression. This introduced me to the idea of testing the relative efficiencies of sampling schemes prior to sampling. A utilitarian measure of PSP sampling efficiency would measure the resulting improvement in meeting forest management objectives. Estimating this impact requires understanding two relationships:.!) the relationship between a sampling scheme and the precision of the resulting model, and 2) the cost of that level of precision on the resulting management decisions. The later was elusive; the theory of optimal sampling design for linear models seemed unsuccessful in providing a meaningful definition of relative efficiency. The usual criteria are divorced from the use of the model. The only satisfactory ranking came from the trivial case of one scheme dominating the other in all predictions. PSP data have the added problem of coming in sequentially and over a prolonged time span. A Bayesian decision theory approach contains the comprehensive theory necessary to define an optimal sampling scheme, but requires a complete system dynamics model of the forest management process. This larger problem was outside the scope of this thesis. The theory of estimating general linear models was successfully employed to predict the confidence bounds for model parameters resulting from any PSP sampling scheme. The structured error components were necessary to model the error correlations between successive increments on the same plot. Error component models and random coefficient models were both considered and rejected. Error components represent only one error structure, where it seemed 37 Chapters. General Discussion and Conclusions 38 advisable to test the suitability of several alternatives. Random coefficient models circumvent the necessity of estimating the correlations among measurements on the same plot, making it impossible to explore the question of interest. Joint maximum likelihood estimation was successful in simultaneously estimating a specified growth equation and error structure. This technique allowed for the ranking of alternative models using Akaike's Information Criterion and the formal testing of nested models with Likelihood Ratio Tests. Multi-stage approaches to parameter estimation were considered and rejected. While efficient in estimating the parameters, they lacked the advantage of providing information for model selection. I feel that they have been popular only because of the computational burden of joint maximum likelihood methods. In the particular situation dealt with herein, an error structure of equal correlation among all increments on a PSP was found to be far superior to the more usual first order autoregressive structure. This surprising result may be due to the use of growth rather than the more usual yield, or possibly the autoregressive structure has been uncritically adopted. One example explored the cost of replacing existing PSPs with new ones. New ones contributed statistical efficiency with their independence, but required two new measurements to measure an increment. Existing plots required only one new measurement to measure an increment. The cost savings of measuring an increment on an existing PSP overwhelmed the statistical efficiency of the new plot. A reasonable cost recovery, for the destruction of an existing plot, is a last measurement on an existing PSP and an establishment measurement on a new PSP (in similar conditions). In a second example, skipping a measurement on a series of plots was shown to be preferable to maintaining a remeasurement schedule on a subset of plots. This conclusion assumes the delay in having new information is not crucial. The objective of this thesis was met in finding a method for comparing the statistical Chapter 5. General Discussion and Conclusions 39 efficiency of alternative sampling schemes for PSPs. While, the larger problem of defining an optimal sampling scheme for PSPs still has to be addressed, this thesis presents a new piece of the puzzle. Chapter 7 Literature Cited Arner, S.L. and D.W. Seegrist. 1979. A computer program for the maximum likelihood estimator of the general multivariate linear model with correlated errors. USDA For. Serv. Gen. Tech. Rep. NE-51. 10pp. Bozdogan, H. 1987. Model selection and Akaike's information criterion (AIC): the general theory and its analytical extensions. Psychometrika. 52:345-370. Brooks, R.J. 1972. A decision theory approach to optimal regression design. Biometrika 59:563-571. Buckman, R.E. 1962. Growth and yield of red pine in Minnesota. USDA For. Serv. Tech. Bull. 1272. 50pp. Cochran, W.G. 1977. Sampling Techniques, 3rd edition. John Wiley and Sons, New York. 428p. Diggle, P.J. 1988. An approach to the analysis of repeated measurements. Biometrics. 44:959-971. Demaerschalk, J:P. and A. Kozak, 1974. Suggestions and criteria for more effective regression sampling. Can. J. For. Res. 4:341-348. Demaerschalk, J.P. and A. Kozak, 1975. Suggestions and criteria for more effective regression sampling. 2. Can. J. For. Res. 5:496-497. Ferguson, I.S. and J.W. Leech. 1978. Generalized least squares estimation of yield functions. For. Sci. 24:27-42. Finney, D.J. 1990. Repeated measurements: what is measured and what repeats? Stat. Med. 9.639-644. Fuller, W.A. and G.E. Battese. 1974. Estimation of linear models with crossed-error structure. J. of Econometrics. 2:67-78. Gertner, G.Z. 1984. Approximating the precision of the parameter estimates of a nonlinear model prior to sampling. For. Sci. 30:836-841. Gertner, G.Z. 1985. Efficient nonlinear growth model estimation: its relationship to measurement interval. For Sci. 31:821-826. Gregoire, T.G. 1987. Generalized error structure for forest yield models. For. Sci. 33:423-444. 40 Chapter 6. Literature Cited 41 Gregoire, T.G., O. Schabenberger and J.P. Barrett. 1995. Linear modelling of irregularly spaced, unbalanced, longitudinal data from permanent-plot measurements. Can. J. For. Res. 25:137-156. Gumpertz, M . and S.G. Pantula. 1989. A simple approach to inference in random coefficient models. J. Amer. Stat. Assoc. 43:203-210. Henderson, C.R. 1953. Estimation of variance and covariance components. Biometrics. 9:226-252. Henderson, C.R. 1971. Comments on "The use of error components models in combining cross section with time series data". Econometrica. 39:397-401. Jennrich, R.I. and M.D. Schluchter. 1986. Unbalanced repeated measures models with structured covariance matrices. Biometrics. 42:805-820. Judge, G.G., W.E Griffiths, R C Hill, H. Lutkepohl and T.C. Lee. 1985. The theory and practice of econometrics. Second edition. John Wiley and Sons, New York. 1019p. Kendall, M. , Stuart, A. and J.K. Ord. 1983. The advanced theory of statistics. Volume 3. Fourth edition. Charles Griffin and Company Limited, London. 780p. Laird, N M . and J.H. Ware. 1982. Random-effects models for longitudinal data. Biometrics 38:963-974. Leak, W. 1966. Analysis of multiple systematic remeasurements. For. Sci. 12:69-73. Magnus, JR. 1978. Maximum likelihood estimation of the GLS model with unknown parameters in the disturbance covariance matrix. J. Econometrics 7:281-312. Magnussen, S. and Y.S. Park. 1991. Growth-curve differentiation among Japanese larch provenances. Can. J. For. Res. 21:504-513. Marshall, P L . and J.P. Demaerschalk. 1.986. A strategy for efficient sample, selection in simple, linear regression problems with unequal per unit sampling costs. For. Chron. 84:16-19. Meridith, M.P. and S.V. Stehman. 1991. Repeated measures experiments in forestry: focus on analysis of response curves. Can. J. For. Res. 21:957-965. Newberry, J.D. and,HE. Burkhart. 1986. Variable-form stem profile models for loblolly pine. Can. J. For. Res. 16:109-114. Oberhofer, W. and J. Kmenta. 1974. A general procedure for obtaining maximum likelihood estimates in generalized regression models. Econometrica. 42:579-590. Chapter 6. Literature Cited 42 Pazman, A. 1986. Foundations of optimal experimental design. D. Reidel Publishing Company, Dordrecht, Holland. 228pp. Parks, R.W. 1967. Efficient estimation of a system of regression equations when disturbances are both serially and contemporaneously correlated. J. Amer. Stat. Assn. 62:500-509. Penner, M . 1989. Optimal design with variable cost and precision requirements. Can. J. For. Res. 19:1591-1597. Rosenburg, B. 1973. Linear regression with randomly dispersed parameters. Biometrika 60:65-72. SAS Institute Inc. 1992. SAS technical; report P-229, SAS/STAT software: changes and enhancements, release 6.07, Cary, NC: SAS Institute Inc, 620pp. Schumacher, F X 1939. A new growth curve and its application to timber yield studies. J. For. 37:819-820. Silvey, S.D. 1980. Optimal designs. Chapman & Hall, London. 86pp. Wallace, T.D. and A. Hussian. 1969. The use of error components models in combining cross section with time series data. Econometrica. 37:55-77. Wolfinger, R. 1993. Covariance structure selection in general mixed models. Commun. Statist-Simula. 22:1079-1106.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A structured covariance model for Douglas-fir PSP data...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A structured covariance model for Douglas-fir PSP data and some implications for sampling Northway, S. 1995
pdf
Page Metadata
Item Metadata
Title | A structured covariance model for Douglas-fir PSP data and some implications for sampling |
Creator |
Northway, S. |
Date Issued | 1995 |
Description | The error structure resulting from the repeated measurements of permanent sample plots (PSPs) can be addressed using a general linear model. Maximum likelihood methods provide a unified approach for estimating the parameters and subsequent inferences. Utilizing a predicted error structure, the efficiency of proposed PSP sampling schemes can be compared, prior to sampling. This can be done because the sampling errors of the model parameters are independent of the dependent variable. A sampling scheme is defined by the independent variables' intended frequencies and a sample size. Statistical and cost efficiency can be calculated by combining the sampling scheme with the sampling costs and the predicted error structure. In the data set examined, the correlation in the error structure was constant among the measured increments within a plot. The third increment on a plot was no more independent of the first increment than the second. This symmetric error structure was a better fit than the more usually assumed autoregressive structure. The data set, with an assumed symmetric error structure, was used to address two questions as examples of how this type of analysis might be used. In the first example, two sampling schemes were compared. Sampling Scheme 1 involved three increments on 150 plots, resulting in a total of 450 increments at a cost of $540,000. Sampling Scheme 2 involved two increments on 150 plots and one increment on 111 plots, resulting in 411 increments at a cost of $642,000. The second scheme resulted in fewer increments, but more measurements and more plots. The two schemes had nearly identical statistical efficiencies. Scheme 2 was more efficient per increment, but less cost efficient. The cost efficiency of remeasuring the same PSP overwhelms the increased statistical efficiency of independent increments. However, for modelling purposes, repeated measurements from the same PSP does not represent an irreplaceable asset. If a PSP is lost to development, a reasonable cost recovery would be the cost of its last remeasurement and the first measurement of a replacement. As a second example, sampling Scheme 3 composed of 150 plots with a 5-year and a subsequent 10-year increment was compared to Scheme 4 which combined 141 plots with three 5-year increments and nine plots with one 5-year increment. Scheme 3 utilizes 300 increments at a cost of $420,000, while Scheme 4 utilizes 432 increments at a cost of $525,600. The parameter covariance matrices of Scheme 3 and 4 are nearly identical, resulting in the same precision in their respective estimates. Scheme 3 is therefore statistically more efficient per increment and is more cost efficient. Comparing these two schemes tests the trade off of skipping an intermediate measurement on all the plots against making all the measurements on a subset of the plots. Skipping an intermediate measurement is far more efficient, both statistically and financially. |
Extent | 2319872 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-02-06 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0075228 |
URI | http://hdl.handle.net/2429/4208 |
Degree |
Master of Forestry - MF |
Program |
Forestry |
Affiliation |
Forestry, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 1996-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1996-0075.pdf [ 2.21MB ]
- Metadata
- JSON: 831-1.0075228.json
- JSON-LD: 831-1.0075228-ld.json
- RDF/XML (Pretty): 831-1.0075228-rdf.xml
- RDF/JSON: 831-1.0075228-rdf.json
- Turtle: 831-1.0075228-turtle.txt
- N-Triples: 831-1.0075228-rdf-ntriples.txt
- Original Record: 831-1.0075228-source.json
- Full Text
- 831-1.0075228-fulltext.txt
- Citation
- 831-1.0075228.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0075228/manifest