"Science, Faculty of"@en . "Statistics, Department of"@en . "DSpace"@en . "UBCV"@en . "Chui, Grace Shung-Lai"@en . "2009-02-12T17:06:00Z"@en . "1996"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "One can imagine a possible loss of parameter estimation efficiency when response correlation\r\nis ignored or misspecified in modeling a response-covariate relationship. Under\r\nwhat conditions is efficiency lost? How much is lost?\r\nWhether the responses are correlated or independent, standard theory for the distribution\r\nof least squares parameter estimates in linear models (Gaussian responses) can\r\nbe readily determined. We find that the linear regression analysis assuming independent\r\nresponses is (theoretically) never more efficient than that incorporating response\r\ndependence. The \"difference\" in efficiencies between these two analyses \u00E2\u0080\u0094 measured by\r\nhow much more readily the latter detects a non-zero regression coefficient \u00E2\u0080\u0094 generally\r\nincreases as the coefficient-to-noise ratio increases.\r\nTo incorporate response correlation in G LM parameter estimation, Liang &; Zeger\r\n(1986) extended the quasi-likelihood theory and developed the generalized estimating\r\nequations (GEE) approach. Despite being a popular method, the effects of misspecifying\r\nresponse correlation (e.g. assuming independence when responses are correlated) on\r\nparameter estimation efficiency using GEE are not obvious. To investigate such effects,\r\nwe use simulation studies in which we generate count data and use the GEE approach to\r\nestimate the model parameters, using both the correct and misspecified correlation structures.\r\nThe generated counts, the number of correlated responses in each cluster/replicate,\r\nand the total number of replicates are all small to imitate health impact studies (in which\r\nhospital admission counts are often the responses). Despite possible loss of parameter\r\nestimation efficiency due to such \"obstacles\" intrinsic in the model, simulation results\r\nindicate that the GEE approach produces' \r\n1. regression parameter estimates with relatively small empirical biases using either a\r\ncorrect or misspecified response correlation;\r\n2. a good estimate of the response correlation matrix if its structure is correctly specified;\r\n3. naive and robust variance estimates both of which estimate the true variance well\r\nwhen the response correlation structure is correctly specified; and\r\n4. good robust variance estimates even when the response correlation is misspecified.\r\nFurthermore, in a G LM with exchangeably correlated Poisson data and no covariates,\r\nspecifying independence or exchangeable dependence yields the same intercept estimate\r\nand estimation efficiency, provided that inference is based on the robust variance estimate.\r\nThe naive variance estimate can significantly underestimate the true variance if\r\nthe responses are assumed independent when analyzing such a GLM."@en . "https://circle.library.ubc.ca/rest/handle/2429/4521?expand=metadata"@en . "6024464 bytes"@en . "application/pdf"@en . "Effect of Misspecif ied Response Cor re la t ion in Regression Analys i s by Grace Shung-Lai Chiu B.Sc. University of British Columbia, 1994 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF M A S T E R OF SCIENCE in T H E F A C U L T Y OF G R A D U A T E STUDIES Department of Statistics We accept this thesis as conforming to the required standard T H E U N I V E I ^ f T Y OF BRITISH C O L U M B I A June 1996 \u00C2\u00A9Grace S. Chiu, 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it i freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada DE-6 (2/88) Abst rac t One can imagine a possible loss of parameter estimation efficiency when response cor-relation is ignored or misspecified in modeling a response-covariate relationship. Under what conditions is efficiency lost? How much is lost? Whether the responses are correlated or independent, standard theory for the distri-bution of least squares parameter estimates in linear models (Gaussian responses) can be readily determined. We find that the linear regression analysis assuming indepen-dent responses is (theoretically) never more efficient than that incorporating response dependence. The \"difference\" in efficiencies between these two analyses \u00E2\u0080\u0094 measured by how much more readily the latter detects a non-zero regression coefficient \u00E2\u0080\u0094 generally increases as the coefficient-to-noise ratio increases. To incorporate response correlation in G L M parameter estimation, Liang &; Zeger (1986) extended the quasi-likelihood theory and developed the generalized estimating equations (GEE) approach. Despite being a popular method, the effects of misspecify-ing response correlation (e.g. assuming independence when responses are correlated) on parameter estimation efficiency using G E E are not obvious. To investigate such effects, we use simulation studies in which we generate count data and use the G E E approach to estimate the model parameters, using both the correct and misspecified correlation struc-tures. The generated counts, the number of correlated responses in each cluster/replicate, and the total number of replicates are all small to imitate health impact studies (in which hospital admission counts are often the responses). Despite possible loss of parameter estimation efficiency due to such \"obstacles\" intrinsic in the model, simulation results indicate that the G E E approach produces' ii 1. regression parameter estimates with relatively small empirical biases using either a correct or misspecified response correlation; 2. a good estimate of the response correlation matrix if its structure is correctly spec-ified; 3. naive and robust variance estimates both of which estimate the true variance well when the response correlation structure is correctly specified; and 4. good robust variance estimates even when the response correlation is misspecified. Furthermore, in a G L M with exchangeably correlated Poisson data and no covariates, specifying independence or exchangeable dependence yields the same intercept estimate and estimation efficiency, provided that inference is based on the robust variance esti-mate. The naive variance estimate can significantly underestimate the true variance if the responses are assumed independent when analyzing such a G L M . in Table of Contents Abs t rac t i i L is t of Tables v i i L i s t of Figures v i i i Acknowledgements ix 1 In t roduct ion 1 2 M u l t i p l e Linear Regression 6 2.1 Introduction 6 2.2 Literature Review 8 2.3 Notation 11 2.4 Which estimator is appropriate? 12 2.5 The Exchangeable Model 14 2.6 Common Spatial Correlation Structures 21 2.7 The Simulation Experiments 23 2.7.1 Generating Correlated Responses 25 2.7.2 Where are Naive and Enlightened? 28 2.7.3 Simulation Results 29 2.8 How bad is Naive's analysis? 43 2.8.1 Empirical Power Curves 44 2.8.2 Histograms of p-value Ratios and p-value Scatter Plots 51 iv 2.8.3 The case of a chosen St \u00C2\u00AB I 64 2.9 Discussion 66 2.10 C H A P T E R A P P E N D I X 69 3 In t roduct ion to M u l t i p l e Regression of Count D a t a 71 3.1 Literature Review 73 3.2 G E E in S-Plus 76 3.3 Some Notation and Theory in G E E 77 3.4 Terminology 83 4 Simulat ions of M u l t i p l e Regression of Count D a t a \u00E2\u0080\u0094 Par t One 85 4.1 Generating Count Data and their (Random) Covariates 85 4.2 Multiple Regression via Generalized Estimating Equations 89 4.2.1 G E E in the Conditional Poisson Model 89 4.3 Computer Simulations of Multiple Regression of Count Data 91 4.3.1 Simulations with Exchangeable and AR(1) Correlation Structures of the Covariates 93 4.3.2 Detailed Results for Simulations with Exchangeable and AR(1) Covariates 98 4.3.3 Additional observations 106 4.3.4 Other Simulations 107 4.3.5 Concluding Remarks 110 4.4 C H A P T E R A P P E N D I X I l l 4.4.1 Fitting an intercept f30 in the G L M I l l 4.4.2 Examples of calculating (marginal) Cor(Y) with a specified S . . I l l 5 Simulat ions of M u l t i p l e Regression of Count D a t a \u00E2\u0080\u0094 Par t Two 114 v 5.1 Generating Correlated Poisson Responses 114 5.1.1 The Link Function 116 5.2 Model Fitting using the G E E approach 116 5.3 The Simulations 119 5.4 The Naive and Robust Variances for J3q and 8^ 121 5.4.1 More on the simulation results 127 5.5 Concluding Remarks 129 6 Discussion 131 Bib l iography 133 vi Lis t of Tables 2.1 An example of a p-value contingency table for fli 25 2.2 Results of Simulation 1, with AR(1) response dependence 30 2.3 Results of Simulation 2, with AR(1) response dependence 31 2.4 Results of Simulation 3 35 2.5 Results of Simulation 4 36 2.6 Results of Simulation 5 39 2.7 Results of Simulation 6 40 4.8 Results of selected simulations 97 4.9 Sample S.D.'s and approximate Relative S.E.'s of coefficient estimates in the independence model in selected simulations 99 4.10 p-value comparisons of Simulation 3 in Table 4.8 101 4.11 p-value comparisons of Simulation 4 in Table 4.8 101 4.12 p-value comparisons of Simulation 7 in Table 4.8 103 4.13 p-value comparisons of Simulation 8 in Table 4.8 103 4.14 p-value comparisons of a re-run of Simulation 3 in Table 4.8, this time without fixing R 105 4.15 Final G E E R as K increases 108 5.16 S.E. estimate comparison based on the first 50 simulated datasets 128 vii Lis t of Figures 2.1 Empirical Power Curves for the GLS and naive OLS analyses 45 2.2 More Empirical Power Curves for the GLS and naive OLS analyses. . . . 49 2.3 Histogram Sketches of logged p-value Ratios and their Scatter Plots: B\ varies 55 2.4 Histogram Sketches of logged p-value Ratios and their Scatter Plots: a varies 58 2.5 Histogram Sketches of logged p-value Ratios and their Scatter Plots: /?i/tr = 1 60 2.6 Histogram Sketches of logged p-value Ratios and their Scatter Plots: fii/a = 1/2 62 2.7 Histogram Sketches of logged p-value Ratios and their Scatter Plots: Bi/a = 2 63 2.8 Histogram and Scatter Plots with Variogram Dt: a varies 65 4.9 Boxplots of estimates of intraclass correlation coefficient (in R) with var-ious sample sizes, K 109 vin Acknowledgements I would like to express my most sincere thanks to my two supervisors, Dr. Nhu Le and Dr. James Zidek, for their patience and guidance throughout the past twelve months. It has been an honor and a wonderful experience working for them. I would also like to thank Dr. John Petkau for being my mentor for the past four years. His guidance, support, and numerous advices will be eternally appreciated. Many thanks to the rest of the faculty members for being kind and helpful throughout my Master's program; to the Statistics Department office staff for always being helpful and patient; to Paige Eveson for her friendship, patience, help with my writing, M V N r.v. generator, and being the best officemate one can ever have; and to the rest of the Statistics graduate students for sharing both tough and wonderful times with me. ix Chapter 1 Introduction Researchers are often interested in the relationship between a response variable and explanatory variables (covariates). In many health care studies, for instance, the response may be some measure of health status of a population, and the covariates may be indices of different pollution types to which the population is exposed. The relationship between response and covariates can be determined through regres-sion analysis. Which regression method to use depends on the nature of the response variable. For example, linear regression via least squares estimation can be applied to response data which can be assumed to follow a normal distribution. When we can no longer assume Gaussian response data, we may sometimes use a generalized linear model (GLM) for the response-covariate relationship. Such a model may be appropriate, for instance, when the response data are discrete and consist of small counts, such as hospital admission counts as a measure of health status of a population. Regardless of the regression method, researchers should be very careful in fitting the model when any correlation possibly exists among the response data. For example, peo-ple living in neighboring geographical regions may tend to be at similar risk of developing respiratory ailments due to similar environmental factors shared by these regions. Effi-cient statistical inference is essential to, say, ensure appropriate allocation of government funding for controlling pollutant levels. Thus, introducing response correlation into the statistical model when studying the health impacts of pollution may be necessary. (This is an example of spatial correlation among the responses.) However, some researchers in 1 Chapter 1. Introduction 2 the past would not incorporate response correlation when modeling a response-covariate relationship. This was sometimes due to the belief that the only source of variability came from independent and identically distributed (i.i.d.) measurement errors.1 It might also be due to the complexity of incorporating response correlation into some non-linear statistical models such as a G L M . Using data which resemble those in health care studies, in this thesis we investigate the effect of misspecified response correlation on the efficiency of parameter estimation in both linear and generalized linear regression analyses. For example, ignoring response correlation is a form of correlation misspecification. The effect of such misspecification in linear regression is closely investigated in Chapter 2. We then turn our attention to GLM's in the following chapters. Our GLM's involve small Poisson count responses which resemble hospital admission counts observed from neighboring regions in studies of rare diseases. We use the generalized estimating equations (GEE) approach for fitting our GLM's , which allows the specification of a working response correlation structure.2 The effect of ignoring response correlation in a G L M is studied in Chapter 5, while Chapter 4 examines the effect of specifying some non-independence structure for independent responses. Simulation experiments are used in our investigation. Each experiment involves gener-ation of many datasets (responses and covariates), regression analyses using the true and incorrect response correlation structures respectively for each dataset, and the (empirical) efficiency comparison between the two analyses. We also include analytical arguments to support some of our simulation results. We measure (empirical) efficiency by how often a regression analysis rejects the null hypothesis of a zero regression coefficient in a simulation experiment.3 Rejecting such a 1See Cressie, 1991, page 25. 2 See page 80. 3That is, the sample probability of hypothesis rejection by the analysis in the simulation experiment. Chapter 1. Introduction 3 hypothesis is equivalent to detecting a non-zero coefficient in the regression model. Natu-rally, we say that the analysis that detects an existing non-zero coefficient more readily is more efficient. However, efficiency in this context is directly influenced by the variability of the coefficient estimate produced by the analysis. If the estimate has a very large stan-dard error, then the test statistic for the above hypothesis test will be very small, the null hypothesis will be retained and hence the analysis will be inefficient (not very powerful) for detecting this coefficient. For example, we find that in a linear regression, the gen-eralized least squares estimator (see Section 2.3) of a coefficient often has much smaller variability than the ordinary least squares estimator. Thus, the analysis based on the former estimator (which incorporates the response correlation) is more efficient/powerful than that based on the latter (which assumes response independence). We also find that an analysis generally increases in its power to detect a non-null regression coefficient as the coefficient-to-noise4 ratio increases. The G L M coefficient estimates produced by the G E E approach have a known asymp-totic Gaussian distribution. To assess efficiency of a G L M analysis, we can compute the naive and robust variance estimates5 for the G E E coefficient estimates based on this asymptotic distribution. However, each simulation experiment involves a finite number of data replicates.6 Thus, the asymptotic variance estimates may not estimate the true variances well. The assessment of efficiency is then affected by how good (close to the true variances) these variance estimates are. In particular, an analysis that produces a smaller but severely biased variance estimate may not be more efficient than another that produces a larger but unbiased variance estimate. Yet how do we know if an estimate is unbiased? Unlike the least squares estimates in a linear model, the exact distribution of a G E E 4Noise is the standard deviation of the random errors (with zero mean) in the model. 5See Section 3.3. 6See Section 3.3. Chapter 1. Introduction 4 coefficient estimate with finite number of replicates is typically unavailable. However, given enough datasets in a simulation experiment, we may use the sample variance of the G E E estimates7 as the \"gold standard\" for the true variance. In a simulation experiment, for each analysis (using correct/incorrect response correlation), we first assess how good the naive and robust variance estimates are by comparing them to this gold standard. Then we compare the efficiencies of the analyses based on their naive and robust esti-mates respectively. Finally, considering how well the true variance is estimated, we can conclude which analysis based on which variance estimate is more efficient. For example, we find that for the G L M in Chapter 4, the analyses using the correct and misspeci-fied response correlations respectively yield good coefficient estimates8 and good robust variance estimates. Thus, both analyses based on their robust variance estimates have similar efficiencies. (We do not examine analyses based on the naive variance estimates in this chapter.) For the G L M in Chapter 5, assuming9 either independent responses (incorrect) or exchangeably correlated responses (correct) yields the same G E E coefficient estimate and robust estimate of its variance. Both are good estimates of the true parameters. However, the naive estimate can considerably underestimate the true variance when assuming response independence. On the other hand, the naive estimate is good if the response correlation structure is correctly specified. In other words, the analysis assuming either independent or exchangeable responses based on the robust variance estimate yields the same efficiency. The latter analysis based on the naive variance estimate yields similar efficiency to that of these two analyses based on the robust estimate. However, assuming 7Each dataset produces one vector of coefficient estimates. 8That is, the estimates have small empirical biases. 9Technically, specifying a working correlation structure is not an assumption in model fitting. How-ever, we will use the word \"assume\" to reflect the belief that the response data have the correlation structure as specified by the working correlation. See Section 3.4 for more details. Chapter 1. Introduction 5 independent responses and using the naive variance estimate yields an incorrect 1 0 and thus inefficient analysis. ... because the variance of the coefficient estimate is incorrect/underestimated. Chapter 2 Multiple Linear Regression 2.1 Introduction Suppose we are interested in the relationship between some response variable and g ex-planatory variables, or covariates. The objective is to estimate the regression parameters associated with each covariate. Suppose we believe that the response-covariate relation-ship is linear. In other words, we believe that the response variable can be expressed as a linear combination of the covariates, with the coefficients in the linear combination being the regression parameters/coefficients. Then we may estimate these parameters via least squares linear regression. If we measure r sets of the response variable and the covariates, r > g + l , 1 least squares parameter estimates in such a regression can then be explicitly expressed in terms of the (r) responses, their covariances (if they are correlated), and the observed covariates. In many real life situations, the measurements of the response variable of interest may not be independent. For example, in a clinical trial to determine the effects of a drug, we measure some response on each patient several times over the period of the trial. Thus, repeated measurements on the same subject may be correlated over time. In this thesis, we are more interested in spatially correlated response variables. Sup-pose we take a measurement of a single response from each of several sites within the same geographical domain of interest. The measurements (of the same response) across the different sites may be correlated when they come from proximate sites. Correlation 1See Section 2.3. 6 Chapter 2. Multiple Linear Regression 7 among the measurements can be induced by similar environmental factors acting across the geographical domain. Such a factor would be air pollution level. Our scenario might well be realized in health care studies, where the impact of air pollution on human health is measured through some response variable on a human subject from each site, together with pollution and meteorological covariates. Researchers may overlook the correlation in the responses. In particular, they often believe that i.i.d. measurement errors constitute the only source of response variability.2 Furthermore, it may be tempting to assume that the geographical sites are so far apart that correlation among the responses may be ignored without affecting the inferences about regression parameters. However, the strength of the correlations may depend not only on the distance between the sites, but also on the similarity of their shared environmental factors. In this chapter, we concentrate on potential dangers which may arise in multiple linear regression if one overlooks the correlation among the response measurements. We will assume that the covariances of the responses are known. That assumption will not usually be realistic, in which case the covariance matrix must be estimated from the observed data. (See Johnson & Wichern, 1982 for details on how regression inference changes when the covariance matrix is estimated.) However, when accounting for correlated responses in regression, we expect that the conclusions based on estimated parameters will not change significantly if the correlation is well estimated. We only compare conclusions based on (1) assumed response independence to (2) admitted response dependence in parameter estimation. We are not interested in the possible efficiency loss in parameter estimation due to a poorly estimated response covariance matrix. Our investigation uses simulation studies. In each simulation, we generate realizations of model (2.1) (see Section 2.3). With each realization/dataset, the regression parameters 2See Cressie, 1991, page 25. Chapter 2. Multiple Linear Regression 8 are estimated with and without taking response dependence into account. Note that in all our simulation experiments, we fit the correct regression model (i.e. the model with the correct mean response structure) with each dataset. Thus, if the realizations of model (2.1) are generated, say, with an intercept and two slope coefficients, then the fitted model will contain an intercept and two slope coefficients. On the other hand, if the realizations are generated with no intercept and only one slope coefficient in model (2.1), then the fitted model will contain solely one slope regression parameter. We do not fit models that do not coincide with the true models to investigate the impact of factors such as incorrectly specified mean response structure. We only consider the impact on inference due to response dependence. 2.2 Literature Review Numerous publications address multiple linear regression. Many of them, such as Weis-berg, 1985, Johnson h Wichern, 1982, and Marascuilo h Levin, 1983, are undergraduate text books. Most discuss how to account for correlated responses in regression param-eter estimation. However, they usually do not investigate the impact of ignoring such correlation. Advanced books and articles have been written since the 1960's on the loss of efficiency in linear regression parameter estimation when response dependence is ignored. For example, Plackett, 1960, Rao, 1965, Anderson, 1971, and Martin, 1982 all mention that, when the responses are dependent, the matrix A \u00E2\u0080\u0094 B is always non-negative definite, A being the covariance matrix of the regression parameter estimate vector under the independent response model, B that under the model with a correctly specified response correlation matrix. (Also mentioned is the unbiasedness of the parameter estimates despite ignoring response dependence.) In other words, admitting the known covariance Chapter 2. Multiple Linear Regression 9 of the responses in regression analysis yields parameter estimates which are at least as efficient as those produced by ignoring response dependence. A natural question arises: \"Under what conditions can response dependence be ig-nored without loss of efficiency?\" Some of these conditions are stated in Plackett, 1960 (as an end-of-chapter exercise); Rao, 1965; Zyskind, 1967; McElroy, 1967; Anderson, 1971; Seber, 1977; Martin, 1982; and Cressie, 1991. Some of these articles supply neces-sary and sufficient conditions for the equivalence of A and B. One of these conditions, less technical than the others, is when the responses have an exchangeable correlation structure. This is explained in Section 2.5. However, none of these articles mentions the fact that, when an investigator overlooks the response dependence, s/he naturally does statistical inference based on parameter estimates whose standard errors are calculated as if the responses were truly independent. We then have three covariance matrices for two parameter estimate vectors: 1. the (true) covariance matrix of the correct parameter estimates (considering the known response dependence); 2. the true covariance matrix of the incorrect parameter estimates (obtained while ignoring response dependence); and 3. the incorrect covariance matrix of the incorrect parameter estimates (both obtained assuming the responses are uncorrelated). We can think of the first covariance matrix as that computed by the \"enlightened\" statistician, who knows the true covariance matrix of the responses and obtains his/her parameter estimates and their covariances taking account of this knowledge. The third covariance matrix is that computed by the \"naive\" statistician, who does not know the responses are correlated, and thus obtains his/her parameter estimates, as well as their Chapter 2. Multiple Linear Regression 10 covariances, without accounting for response dependence. Finally, the second covariance matrix is that computed by Enlightened for Naive: Enlightened knows Naive's covariance matrix is incorrect, and does Naive a favor by computing a correct one for him/her. For reasons given in Section 2.3, we refer to the least squares parameter vector esti-mate assuming response dependence as the GLS estimator. We call the one obtained ignoring response dependence the OLS estimator. The first covariance in the above list is thus the covariance matrix of the GLS estimator. Let us call it the GLS covariance. We call the true covariance matrix of the OLS estimator the true OLS covariance. This is the one given in the second item of the above list. Finally, we call the third covariance matrix in the list the naive OLS covariance for obvious reasons. In this chapter, we will investigate the loss of efficiency resulting from using the OLS instead of the GLS estimator, by essentially comparing the GLS covariance to the true OLS covariance, as well as the GLS covariance to the naive OLS covariance, with different covariate (design) matrices and regression parameter values.3 In this way, we hope to understand the loss of efficiency in parameter estimation when ignoring response dependence. At the same time, we also try to understand what the effects on parameter estimation are when Naive bases his/her inference on his/her naive parameter estimates and covariances. We will base our investigation on computer simulations implemented in the statistical software S-Plus. Later in the chapter, we address the question of how to model spatial correlation. A few such models are presented in Cressie, 1991, and Basu and Reinsel, 1994. 3For reasons given in Section 2.7, we calculate p-values (for hypothesis tests) based on these covari-ances. We then compare the p-values (hence, functions of the covariances) instead of directly comparing the covariance matrices. Chapter 2. Multiple Linear Regression 11 2.3 Notation Suppose the spatial domain of interest consists of r regions. In an investigation, we observe in region i the response (dependent) variable Y{, i = 1,2,..., r. We then decide to include in our regression model, say, g covariates (explanatory variables) xn,..., Xig for each region i, together with an intercept /?o and a slope f3j for the jth. covariate, j = 1,2,... ,g. In matrix notation, we may then write our model as Y = X(3 + e (2.1) where 1 xn x12 1 x21 x22 xlg x2g ( a \ 01 I , and e = (The intercept /?o and the column of l's in the design matrix, X , are of course omitted if we decide not to fit an intercept in the above model.) It is often convenient to assume that X is of full rank, and that rank(X) = g + 1 < r. We also assume that the expected value and the correlation matrix of the error vector e are respectively E(e) = 0 and Cor(e) = 2?\u00C2\u00A3, where St is symmetric and positive definite. Assuming that the variance of each e; is cr2, the covariance matrix of e is then Cov(e) = a2U\u00E2\u0082\u00AC. A simple case of (2.1) obtains when Dt = I, the identity matrix. In this case, the best linear unbiased estimator (BLUE) of the parameter /3 is simply the ordinary least squares (OLS) estimator pOLS = ( x T X ) \" 1 X T y (2.2) with covariance matrix Cov(3 O L S) = cr2 (XTX)~\ (2.3) Chapter 2. Multiple Linear Regression 12 With a known St ^ I, however, the B L U E for f3 is the generalized least squares (GLS) estimator (3GLS = (XTE-1X)-1XTE;1Y (2.4) with covariance matrix Cov(3 G L S) = a2 (XTS-1X)~1. (2.5) (See Seber, 1977, page 49 for derivation of the BLUE's . ) The covariance formulas for these B L U E ' s are derived in standard textbooks on regression theory. Recall the notion of the naive and enlightened statisticians. In the case Se ^ J , Enlightened uses the estimator in (2.4) to estimate the regression parameters. S/he uses the formula in (2.5) to calculate Cov( /3 G L 5 ) . On the other hand, since Naive assumes St = I in all situations, s/he uses the estimator in (2.2) and the covariance formula in (2.3) in his/her analyses. However, if the responses are truly not independent, i.e. St ^ J , then the true OLS covariance is actually Cov(3 O L S) = Cov[(XTX)-*XTY] = (XTX)-*XT Cov(Y) [(XTX)-'XT]T (2.6) = a2(XTX)-1(XTSeX){XTX)-\ Thus, we see that, in general, the covariance matrices in (2.3), (2.5), and (2.6) will differ. 2.4 Which estimator is appropriate? In principle, when the responses are known to be correlated, we should always account for their correlation in fitting the model. However, sometimes Uc may be difficult to determine (or estimate). Other times, it is believed that the correlation among the responses is too weak to make much difference in the inference. Also, we can easily see Chapter 2. Multiple Linear Regression 13 from (2.2) and (2.4) that both J3OLS and /3GLS are unbiased for (3 regardless of whether 27e = Cor(e) = COT(Y) is the identity matrix. In fact, the expected value of / 3 O L S does not depend on the value of ST. For reasons such as these, some investigators may ignore the presence of response correlation. However, the following theorem indicates a potential danger in using f3OLS to estimate p when 27\u00C2\u00A3 ^ I. Theorem 1 Cov(f3OLS) \u00E2\u0080\u0094 Cov((3GLS) is always non-negative definite. Rao, 1965 presents a proof of this theorem. Here, we are referring to the true OLS covariance as Cov((3OLS). In other words, although both f3OLS and (5GLS are unbiased for (3, Theorem 1 states that the former is never \"better\" than the latter in that there is at least as much variability in flOLS as in (3GLS. When the matrix (in Theorem 1) is positive definite, the parameter estimation based on / 3 O L S is lower in efficiency (with the true OLS covariance) than that based on pGLS_ yye W J V J s r i o w some simulation results later in the chapter to further illustrate and quantify this point. When is the matrix in Theorem 1 a zero matrix, then? That is, when are these two covariances4 equal? Two obvious conditions for the two covariances to be equal are: 1. 3 O L 5 = 3 G L 5 , and in particular, 2. E\u00E2\u0082\u00AC = I. The second condition is not of much interest to us. We are interested in situations where pOLS j g a g e m c i e n t a s pGLS w h j i e t n e responses are not independent. 4 G L S covariance and true OLS covariance. Chapter 2. Multiple Linear Regression 14 Necessary and sufficient conditions for f3OLS = (3GLS are presented in Rao, 1967; Zyskind, 1967; and McElroy, 1967. However, most of these conditions are very technical and difficult to verify. Thus, we believe investigators should always use flGLS to estimate /3 whenever the responses are known to be correlated. However, one realistic exception to the need to use f$GLSh stems from the exchangeable structure, also known as the intraclass correlation structure, often met in analyses of repeated measurements, and sometimes in spatially correlated data. With such structure, fitting an intercept in model (2.1) will entail the actual equality of flOLS and flGLS\ 2.5 The Exchangeable Model A n exchangeable model for correlation structure assumes that the correlation matrix of the errors is Cor(e) = E\u00E2\u0082\u00AC 1 P P P P 1 P P P P 1 P \u00E2\u0080\u00A2\u00E2\u0080\u00A2\u00E2\u0080\u00A2 P \ = (1 - p)I + PJ (2.7) P P 1 P \P P P 1 / where J is a matrix with all elements equal to 1, and 0 < p < 1. This last inequality guarantees that Se is positive definite for any r. For a given r, we can replace this inequality by [ - l / ( r - 1)] < p < l . 6 Theorem 2 Assume that we fit an intercept in the regression model (2.1). Then OLS $GLS for all Y St has the form 5However, this does not mean that we can use the naive OLS covariance in place of the GLS covariance. 6See Graybill, 1983, page 215. Chapter 2. Multiple Linear Regression Se = (l-p)I + pJ. To prove Theorem 2, we need the following lemma. Lemma 1 Let the matrix W have the form W = al + bP where P is an r x r matrix such that P2 = tP, with t > 0. Then W1 = - I - -i-^f-r P . a a(a+bt) Proof W-W = ' P ) ( f l J + bP) a(a+bt) b D 6 D b 2 t 1+ b-P- zhr.P~ a+bt a (a+6t ) J 6 = / + ^T6FjK\u00C2\u00AB + bt) - a - bt]P = I Similarly, WW'1 = I, and hence the result. Proof of Theorem 2 7 Note that J3OLS _ pGLS {XTX)-1XTY = (XTE^X)-*XTZ (XTE^X)(XTX)-1XTY = XTE^Y. 7This proof is an elaboration of the one by McElroy (1967). Chapter 2. Multiple Linear Regression 16 Therefore, proving Theorem 2 is equivalent to proving (*) holds for all Y if and only if Et = (1 - p)I + PJ. Now, (*) automatically holds for all X 1 \" 1 if Y is in the range space of X, i.e. if Y \u00E2\u0080\u0094 XZ for some Z. This is so because then, in (*), LHS = (XTZ-1X)(XTX)-1(XTX)Z = XTS~1(XZ) = XT2-*Y = RHS. Therefore, we now need only to consider those Y in the null space of X , i.e. those Y such that XTY = 0. (<=) Assume Ut = (1 \u00E2\u0080\u0094 p)I + pJ. Denote the column vector of l's as 1. Then J = 11 T ==> J2 = 1(1T1)1T = rllT = rJ. Since r > 0, by Lemma 1, we have 27- 1 = T^I -l - P ( l - p ) ( l - p + p r ) ' I-P* ( l -p) [ l + (7-l)p]' e. -l + (r-\)p-Now, let y be in the null space of X , i.e. l-p i l + f r - D p 1 1 J 0 = XTy = \ X9 J y Chapter 2. Multiple Linear Regression 17 where xj = (xu, x2i, ..., xri). Then, lTy = 0. With y in place of Y in (*), we now have LHS = (XTS-lX){XTX)-1XTy = 0 (since XTy = 0) and RHS = XTS-'y = 0 - 0 (since XTy = 0 and lTy = 0) = LHS. Thus, (*) holds for all y such that XTy = 0. Hence, P\u00C2\u00B0LS = pGLS. (=>) Suppose (3OLS = 3 G i S - Therefore, (*) holds for all Y. Let y be such that XTy = 0. Hence, by (*), XTE;1y = (XTE;1X){XTX)-1XTy = 0. (2.8) Now, fix i,j, and such that 1 < i,j, k < r, and i =^ j , A:. Let e,- denote the elementary vector all of whose co-ordinates are O's except the z'th being 1. In turn, let U be an r x (g + 1) matrix of the form U = (1, e,-, \u00C2\u00AB a , . . . , v s ) where the set of distinct elementary vectors r 3 , . . . ,?>s is chosen such that t73,.. .,t;a ^ e,-or ej or e^. (That is, the vectors v 3 , . . together with any of e,-, ej, or e^, form a basis.) Then, U has the specified form of the design matrix X in model (2.1), and we Chapter 2. Multiple Linear Regression 18 can substitute U for X in (2.8). Now, / XT \ UT{e3 - ek) e1 - e1 0. This implies that the vector (ej \u00E2\u0080\u0094 ek) is in the null space of U. Thus, by (2.8), we have UTS;1(ej-ek) = 0. (2.9) Write the ( i , j ) th element of U~x as (E~1)ij = cr*j. Now, we can write CT 1/ ' 2 / Then, the matrix ( n r \ X \u00C2\u00AB a* ... <) = V 1 2 7* / 1 T < . eT 0.05 OLS Ee = I p < 0.05 p > 0.05 Ue known p < 0.05 p > 0.05 Table 2.1: An example of a p-value contingency table for 8\. 2.7.1 Generating Correlated Responses The (model) parameters in model (2.1) include cr2, 27\u00C2\u00A3, and /3. In a simulation experi-ment, we need to specify all of these parameters, together with the covariate matrix X, in order to obtain the response vector Y. In one simulation study of n realizations of model (2.1), we would first specify X, the design/covariate matrix, and the regression coefficient vector, (3. Instead of arbitrarily \"inventing\" these quantities everytime we ran a simulation experiment, we decided to have a computer implemented procedure which could be used to generate both X and (3. In this way, we would only need to specify a limited number of program arguments instead of each of the r x g covariate values and g + 1 coefficient values. The procedure for obtaining the design matrix and the regression coefficients is pre-sented in Section 2.10, the C H A P T E R A P P E N D I X . This procedure requires the specifi-cation of fix and ZJx, the common expectation and covariance matrix of the g covariate vectors. These were truly arbitrarily chosen. For example, we sometimes took fix \u00E2\u0080\u0094 0, and other times fix = (1,2,3) T for a model with an intercept and two slope coeffi-cients. Ijx was taken to be the identity in some cases. We also tried JC^'s which have exchangeable structures. Once X and {3 were generated, they were used to generate all the n Y\"'s from model Chapter 2. Multiple Linear Regression 26 (2.1). In fact, we did not generate a new X and f3 in every simulation. Once generated, the same X and /3 were sometimes re-used in other simulations, varying only St and 3, for i,j = l , 2 , . . . , r ; 3. a correlation matrix, similar to the above, with its (i,j)th element equal to (a) 1 if i = j, 1 4Again, the choice of the moments was arbitrary. Chapter 2. Multiple Linear Regression 27 (b) 0.8 if \i-j \ = 1, (c) 0.45 if \i-j\ = 2, (d) 0.15 if \i- j\ = 3, and (e) Oif \i-j\ >3 , for i,j = 1,2,... ,r; 4. an exchangeable correlation matrix with an intraclass correlation coefficient of 0.6 or 0.9; and 5. the variogram computed according to equation (2.12) with / ZT \ whose columns were generated independently from a Uniform(25, 50) and a Uni-form(100, 110) distribution respectively.15 Define e's) to be the sth sample of errors, s = 1,2,..., n, generated from e 0.05 p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS \u00C2\u00A3t = I p < 0.05 11 0 12 0 11 0 p > 0.05 2 187 3 185 1 188 OLS Et known p < 0.05 13 1 15 0 12 0 p > 0.05 0 186 0 185 0 188 Table 2.2: Results of Simulation 1, with AR(1) response dependence Chapter 2. Multiple Linear Regression 31 Program input variables: r 9 n Vx Ex cr2 Ee fit intercept? 4 2 200 (1,2,3)'J' exchangeable, intraclass p = 0.6 1 AR(1), lag 1 correlation = 0.6 yes Program output: / 1 1.44 1 1.52 1 3.86 \ 1 2.14 2.80 \ 3.09 4.59 2.82 J (3 = (-0.875,0.375,0.375)2 OLS estimates GLS estimates A. & fa Pi sample mean of n estimates -0.544 0.563 0.137 -0.602 0.494 0.197 sample variance of n estimates 6.02 1.70 2.49 5.73 1.47 2.27 true variance of each estimate 6.36 1.57 2.33 6.24 1.40 2.20 GLS A) Pi Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS Et = I p < 0.05 12 0 14 0 12 0 p > 0.05 2 186 0 186 3 185 OLS Ee known p < 0.05 14 3 14 2 15 2 p > 0.05 0 183 0 184 0 183 Table 2.3: Results of Simulation 2, with AR(1) response dependence Chapter 2. Multiple Linear Regression 32 determine the approximate computational time needed for larger scale computer simu-lations. However, technicalities such as the computational efficiency of our simulation program and the capacity of the computer system on which the simulation experiments were run are beyond the scope of this thesis. Simulations 1 and 2 were run mostly with the same program input variables. One difference lay in Ex, the covariance matrix for generating the covariates and the regres-sion coefficients. However, only in Simulation 1 did we discard the regression coefficients generated by the corresponding module and replace it with the vector (1,0,0) T . We did this to examine how often the GLS or OLS methods falsely detected non-zero coefficients. Results of Simulation 1 are shown in Table 2.2. Those of Simulation 2 are shown in Table 2.3. First, let us examine the results of Simulation 1. We notice that the true variance of each OLS regression coefficient estimate is larger than that of the corresponding GLS estimate. This illustrates the result presented in Theorem 1. However, the covariate matrix generated from the program yielded very similar OLS and GLS estimates, and hence, similar (true and sample) variances. Notice that both the OLS and the GLS estimates for the intercept are substantially smaller than the true intercept (=1). The empirical biases may be due to chance, as sug-gested by the large variances of the intercept estimates. Nevertheless, in most regression analyses, the analyst's prime interest lies not in the intercept of the model, but the slope coefficients. S/he would be concerned about how much each unit change in a covariate would affect the response. This change is measured by the corresponding slope estimate and its variability. The estimates for the slope coefficients seem quite accurate. Now, to assess the consistency (see page 24) of the methods of Enlightened and Naive, we examine the off-diagonal values of each 2 x 2 p-value contingency table. First, we see Chapter 2. Multiple Linear Regression 33 that the consistency ratio wrt the naive OLS analysis for the intercept is 2:0. That is, Naive (in the naive OLS analysis) twice (out of 200) failed to detect the non-zero intercept at the 0.05 significance level when Enlightened succeeded (in the GLS analysis). On the other hand, this ratio wrt the true OLS analysis is 0:1. That is, Enlightened (in his/her GLS analysis) failed to detect it once when Naive succeeded (in his/her naive OLS analysis). However, the small number of inconsistent conclusions made by the two statisticians could well be due to chance. Similar reasoning obtains for the estimates of the two slopes. To compare the empirical power of the three analyses, we compare the sample proba-bilities of detecting a non-zero regression coefficient at significance level 0.05. For exam-ple, Enlightened's power for detecting a non-zero intercept was (11 + 2)/200 = 0.065 = 6.5%. This power was slightly lower at 11/200 = 5.5% for Naive. In the true OLS anal-ysis, Enlightened's power for the OLS estimator was (13 + l)/200 = 7.0%. The power is slightly lower (6.5% as above) for his/her GLS analysis. These powers are all very low, possibly due to the large variances in the intercept estimates. Also, the powers are similar because of the particular choice of covariate matrix which gave similar OLS and GLS estimators. We do not see striking power differences either between those associated with the OLS and GLS slope estimates. In fact, if we compare the GLS power and that of either OLS analysis, the Pearson's x 2 test of independence of analysis (GLS/OLS) and power (p < or > 0.05)1 6 of detecting each of do, and B2 retains the null hypothesis (of independence). For example, the p-value1 7 for the \ 2 test is 0.83 for the GLS and naive OLS analyses in detecting B0.18 Thus, the power differences in this simulation experiment are very likely 1 6 In such a x2 test, the two categories of analysis are \"GLS\" and \"OLS,\" each with two power levels, namely, \"p < 0.05\" and \"p > 0.05.\" We test the independence of analysis and power (category and level). 1 1 . . . obtained using the S-Plus function chisq.test . 1 8 In this x2 test, the data are (naive OLS p < 0.05, GLS p < 0.05) = (11, 13), and similarly (189, Chapter 2. Multiple Linear Regression 34 to be results of chance1 9 according to the x2 test. Finally, the fact that the two powers were low in detecting non-zero slopes in this simulation is desirable, because the true slopes were zero! In Simulation 2, much the same sort of results were obtained. A l l the (true) OLS estimate variances are noticeably larger than their GLS counter-parts. We again see large variability in the slope coefficient estimate for both the OLS and the GLS methods. In fact, this is true for every co-ordinate of the (3 vector. The sample mean values of the parameter estimates indicate that overall, the coefficients were not well estimated, possibly due to chance as a result of their large variabilities. The large variances were mainly due to the choice of X and St. However, the GLS estimates appear to be slightly more accurate with smaller empirical biases than the OLS estimates. Large variability in the coefficient estimates also entails low power in detecting non-zero true coefficients. This power for all three regression coefficients is under 10% based on any of the naive OLS, true OLS, and GLS covariances. Finally, minimal inconsistency is found between the methods in the detection of non-zero coefficients. However, notice that in a few cases (3 for j30 and 2 for both f3i and B2), the true OLS analysis actually out-performed the GLS analysis in detecting non-zero coefficients. Of course, the small number of such cases does not strongly suggest much about the overall power of this true OLS analysis. In fact, this out-performance of the true OLS analysis is likely due to chance: the \ 2 tests2 0 for this simulation experiment all yielded large p-values. Also, theoretically, the true OLS analysis is never more efficient than the GLS analysis, as indicated by Theorem 1. 187) for the \"p > 0.05\" level. v 1 9 . . . in the terminology of Freedman, Pisani, Purves & Adhikari, 1991, Chapter 26. 20See footnotes on page 33. Chapter 2. Multiple Linear Regression 35 Program input variables: r 9 n X P a2 Et fit intercept? 30 1 500 see below (3,1.5)'* 1 item 2 on page 26 yes Program output: sample mean of n estimates sample variance of n estimates true variance of each estimate OLS estimates 3.11 11.54 13.19 Pi 1.49 0.11 0.13 GLS estimates Po 3.05 8.60 9.93 Pi 1.49 0.08 0.10 GLS Po Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS St = I p < 0.05 35 8 482 0 p > 0.05 46 411 15 3 OLS Et known p < 0.05 46 . 24 491 1 p > 0.05 35 395 6 2 Table 2.4: Results of Simulation 3 (ii) Simulations 3 and 4 The next two simulations were of larger scale than the previous two. Instead of n \u00E2\u0080\u0094 200, we had 500 realizations of model (2.1) in Simulations 3 and 4. Such larger simulations were naturally more time consuming. To save computational time, we fixed the covariate matrix and the regression coefficient vector. X was set to be the 30 X 2 covariate matrix described on page 26, and /3 = (3,1.5) T. The results are presented in Tables 2.4 and 2.5. The two simulations were done with the same program input variables, except the response dependence correlation. The responses have a covariance matrix as described in item 2 on page 26 for Simulation 3, and item 3 on page 26 for Simulation 4. Chapter 2. Multiple Linear Regression 36 Program input variables: r 9 n X P CT2 \u00C2\u00A3 \u00C2\u00A3 fit intercept? 30 1 500 see page 35 (3,1.5)'' 1 item 3 on page 26 yes Program output: OLS estimates GLS estimates Po Pi Po Pi sample mean of n estimates 3.14 1.49 3.07 1.50 sample variance of n estimates 8.96 0.089 1.25 0.011 true variance of each estimate 9.41 0.093 1.24 0.011 GLS Po 0i p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS EC = I p < 0.05 43 1 487 0 p > 0.05 348 108 13 0 OLS Se known p < 0.05 91 2 498 0 p > 0.05 300 107 2 0 Table 2.5: Results of Simulation 4 Chapter 2. Multiple Linear Regression 37 In Simulation 3, because of the choice of X and 27\u00C2\u00A3, the variance is very large for the intercept estimate, but it is very small for the slope estimate. Using a slightly different St in Simulation 4, the variances for the GLS estimates decrease considerably (by about eight times). In both simulations, both regression coefficients in the model are well estimated by even the OLS estimators, although the GLS estimators seem to have slightly smaller empirical biases. However, due to the large difference in the variances of the GLS intercept estimate, the p-value tables of the two simulations for /?o are very different. In particular, the GLS analysis obviously out-performed the OLS analysis (naive or true) in Simulation 4 (in which the variance of the GLS intercept estimate is much smaller). In terms of consistency, the consistency ratio wrt the naive OLS analysis is 348:1. The true OLS analysis is slightly better than the naive in terms of consistency. Instead of a 348:1 ratio, it is a 300:2 ratio wrt the true OLS analysis. As for power, the empirical powers of the GLS, true OLS, and naive OLS analyses are respectively 78%, 19%, and 9%. The x2 test of independence (of analysis and power) yielded a virtually zero p-value for comparing the GLS analysis to either OLS analysis. (In other words, the power of detecting the intercept is significantly associated with the analysis (GLS/OLS).) Here, the difference in power is striking. Although the intercept is typically not the main interest of a regression analysis, we can nonetheless see \"the power of the enlightened\" in detecting f30 in this simulation experiment. This contrast in power, with respect to the intercept estimates, is not as striking in Simulation 3 as in 4. Nonetheless, the GLS analysis seems more efficient than the two OLS analyses. The consistency ratios wrt the naive and the true OLS analyses are respectively 46:8 and 35:24. Thus, the GLS analysis was the most efficient in this simulation, followed by the true OLS analysis, then the naive OLS analysis. This efficiency ranking is heralded by their empirical powers, which are 16%, 14%, and 9% respectively. In fact, the x 2 test Chapter 2. Multiple Linear Regression 38 of independence for the GLS and naive OLS analyses for B0 yielded a near-zero p-value. Thus, the difference in power between the the analyses was extremely unlikely to be a result of chance. We may say that Enlightened is significantly more powerful than Naive in detecting the intercept, according to the x2 test. However, the \ 2 test for the GLS and true OLS analyses indicated insignificant association between the analyses and their powers. Although the difference in efficiency is not as extreme as in the intercept analysis, the GLS analysis seems more efficient in both simulations than either of the OLS analyses in detecting the slope coefficient. The difference is slightly more noticeable in Simulation 3 than Simulation 4. We did not apply the x2 test of independence for the slope estimate of either simulation due to the presence of zero counts. When we compare the results of these two simulations, we notice that with stronger response dependence (in Simulation 4), more efficiency is lost in the OLS analyses (wrt the GLS analysis). (iii) Simulations 5 and 6 Simulations 5 and 6 were run with an exchangeable Se. Recall that with such a response dependence structure and a non-zero intercept in model (2.1), the OLS and the GLS estimators are identical. Thus, the true OLS covariance is no different from the GLS covariance, as stated in Theorem 2. However, the naive OLS covariance is not the same as the GLS covariance, unless St \u00E2\u0080\u0094 I. With St ^ I, we expect to see loss of efficiency in parameter estimation. In these two simulations, we used the same program input variables as in Simulations 3 and 4, except that St has the exchangeable structure. The intraclass correlation coefficient in St is 0.6 in Simulation 5 and 0.9 in Simulation 6. The results of these two simulations are shown in Tables 2.6 and 2.7. Their bottom panels tabulate the p-value Chapter 2. Multiple Linear Regression 39 Program input variables: exchangeable Se, intraclass p = 0.6 (see Table 2.4 for other program input variables) Program output: GLS = OLS estimates Po Pi sample mean of n estimates 3.14 1.49 sample variance of n estimates 9.76 0.077 true variance of each estimate 7.19 0.066 GLS Po Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS Et = I p < 0.05 109 0 500 0 p > 0.05 122 269 0 0 GLS power (= true OLS power) 46% 100% naive OLS power 22% 100% Table 2.6: Results of Simulation 5 Chapter 2. Multiple Linear Regression 40 Program input variables: exchangeable \u00C2\u00A3 t , intraclass p = 0.9 (see Table 2.4 for other program input variables) Program output: GLS = OLS estimates Po Pi sample mean of 2.96 1.50 n estimates sample variance of 2.53 0.018 n estimates true variance of 2.55 0.017 each estimate GLS Po Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS EC = I p < 0.05 297 0 500 0 p > 0.05 155 48 0 0 GLS power (= true OLS power) 90 % 10( ]% naive OLS power 59% 100% Table 2.7: Results of Simulation 6 Chapter 2. Multiple Linear Regression 41 of the GLS analysis versus the naive analysis. We will not show the actual contingency table corresponding to the true OLS analysis because this analysis is equivalent to the GLS analysis. Thus, the off-diagonal values of this contingency table will be the same. (Hence, the consistency ratio wrt the true OLS analysis must be k : k for some non-negative k.) Also for this reason, we will only quote the empirical power of the GLS analysis and the naive OLS analysis. Tables 2.6 and 2.7 show the extent of efficiency loss in estimating B0 in the naive OLS analysis. (The \ 2 test of independence of analysis and power of detecting B0 yielded a virtually zero p-value in either simulation experiment.) Comparing the two tables, we see that stronger response dependence heralds greater efficiency loss in the naive analysis. This relationship between the response dependence and efficiency loss was also seen in the previous two simulations. As we will see in some other simulations (see the next subsection), this relationship is generally true. (iv) Other Simulation Experiments We ran about fifteen more simulations in addition to the six described above. They were mainly intended to investigate under what conditions the GLS analysis would significantly out-perform either of the OLS analyses when the responses are positively correlated.2 1 Tables 2.2 to 2.7 were shown in full to illustrate how we interpret the program output. For brevity, we will merely highlight the more notable results, summarized as follows: 1. In general, the stronger the response correlation, the larger is the consistency ratio wrt either OLS analysis.2 2 2 1 Positively correlated responses are often encountered in studies of repeated measurements or in spatial statistics. 2 2 In the case of a k : 0 ratio for some non-negative k, a \"large\" ratio means a large k. Chapter 2. Multiple Linear Regression 42 2. While the last observation is generally true, in some specific cases, the two OLS analyses were empirically more powerful than the GLS analysis. This happened mainly in the analysis for the intercept, which usually had a very large variance. A large variance could have led to a slightly higher empirical power of an OLS analysis, by chance. (Note that the covariance of a /3 estimate depends on the choice of X, St, and cr2.) In the trials without an intercept, an empirically more powerful OLS analysis for a slope could also be due to a large variance of the slope estimate. 3. In some cases, with relatively weak response correlation, the GLS analysis proved to be more powerful than the OLS analyses. For example, a simulation experiment was done with the response dependence variogram described on page 27. Our choice of the region bearings (longitudes and latitudes) gave us a variogram which was extremely close to the identity matrix. However, for one of the two slope coefficients, the GLS analysis turned out to have 89% power, compared to 77% for the true OLS analysis, and 75% for the naive OLS analysis. The \ 2 test of independence of analysis and power of detecting this slope yielded a virtually zero p-value for each of the \"GLS/true OLS\" and \"GLS/naive OLS\" combinations. Such a small p-value indicates that the difference in power was extremely unlikely to be a result of chance. 4. In general, the loss in efficiency (wrt to the GLS analysis) of the true OLS analysis is less than that of the naive OLS analysis. The difference in efficiency lost between the two OLS analyses can sometimes be extreme. For example, in one simulation, the consistency ratio wrt the true OLS analysis was 4:2, but it was 147:0 wrt the naive OLS analysis. Chapter 2. Multiple Linear Regression 43 5. As expected, in cases for which the OLS (naive or true) variances are much larger than the GLS variances, the GLS analysis is much more powerful than the OLS analysis. 6. In most cases, both the OLS and the GLS estimators accurately estimate the true regression parameters, in that they both have small empirical biases. However, the empirical bias for both estimators can be quite large if their variances are large. 7. In general, the OLS and the GLS estimates are very close to each other. (They are the same if the response dependence is exchangeable, with an intercept in the regression model.) Sometimes, the GLS estimate has less empirical bias, as shown in Simulations 1 to 4. 2.8 H o w bad is Naive ' s analysis? In the previous section, we have seen that in general, ignoring response dependence in linear regression can be dangerous. A considerable loss in efficiency can result. There has been some established theory on the relative efficiency in parameter estimation of the true OLS analysis to the GLS analysis. As Theorem 1 indicates, the true OLS analysis can never be more efficient than the GLS analysis. 2 3 Our simulations also support this difference in efficiency, in terms of consistency and power, in detecting non-zero regression coefficients. However, we are not aware of theory which extensively compares the naive OLS analysis with the GLS analysis. From our simulation studies, we do know that in general, the naive OLS analysis is the most inefficient in parameter estimation, followed by the true 2 3 In practice, the empirical relative efficiency of the true OLS analysis to the GLS analysis may be greater than unity. This is the case in, for example, Simulation 2 in Section 2.7.3. This is mainly due to chance. If we had more data sets (i.e. more than 500) in this simulation, the GLS analysis could have been empirically more efficient than the true OLS analysis. Chapter 2. Multiple Linear Regression 44 OLS analysis, 2 4 then by the GLS analysis. Now, we direct our focus onto understanding the extent of efficiency loss using the naive OLS analysis relative to the GLS analysis. 2.8.1 Empirical Power Curves First, let us examine how the power in detecting a non-zero regression coefficient of the naive OLS analysis compares to that of the GLS analysis. We compare them under vary-ing conditions. For example, we may fix a regression coefficient and 27e, letting a2 vary. Similarly, we may fix the covariances of the responses, and let the regression coefficient vary. We will once again use simulation experiments as the basis of our investigation. We ran several such experiments for each set of varying conditions. For each of these simulations, we recorded the empirical powers of both the GLS and the naive OLS analyses. Each simulation consisted of 500 realizations of model (2.1), with no intercept and one slope coefficient. Having only one slope in the model provides us with enough insight into the question at hand while reducing the computational burden. We used a model with no intercept term because, as explained before, the slope is the primary determinant of the relationship between the response and the covariate. We chose a 30 x 1 covariate matrix X in this case, assuming there were 30 regions in the geographical domain of interest. The covariates in this X matrix (vector) were arbitrarily generated, independently, from a normal distribution with mean 0 and standard deviation 0.2. Once generated, the X matrix was fixed throughout the 500 simulation trials. Uc was taken to be the one described in item 3 on page 26 in all simulations. Such a S\u00C2\u00A3 was chosen because from the previous simulations with this St, the GLS analysis was often noticeably more powerful than the naive OLS analysis. We wish to exploit to a greater extent this 2 4The naive OLS analysis may appear to be more efficient than the true OLS analysis when the matrix (XTS-lX)-1 - (XTX)-1 is positive definite (i.e. when the true OLS covariance is \"larger than\" the naive). However, technically the naive analysis is not more efficient in this case because 1.5, then decreases very gradually and stabilizes as a grows beyond 8. Meanwhile, the naive OLS power starts decreasing as soon as a grows beyond 0. It decreases much faster than the GLS power, stabilizing when a \u00C2\u00AB 4. Finally, in the third panel of Figure 2.1, we let the ratio 8^ja = 1, i.e. B\ = cr, although letting both parameters vary together. The other program input arguments are fixed. The graph indicates that both the GLS and the naive OLS power functions remain roughly constant, but the latter is much lower. The naive OLS power seems to remain at around 0.2, whereas the GLS power remains close to 1 at all values of 8\ (= cr). Indeed, the fact that the GLS power remains relatively constant agrees with theoret-ical results. The following argument shows that the theoretical GLS power function is Chapter 2. Multiple Linear Regression 47 constant if the ratio fa/cr is constant. Let v = fa/cr. Let TT(V) be the power function of the GLS analysis evaluated at \u00C2\u00A3/, i.e. the probability of failing to detect a non-zero fa, given v. The t-test statistic is -~~GLS fa SE(faGLS) where estimates SE(faGLS) = a^{XTS^X)-\ The estimate a is the square-root of the M S E obtained from the regression analysis of variance. That is, \u00C2\u00AB72 = e1 e where r-g Y - X{3. Recall that the covariate matrix X has dimension r xg, where r = 30 and g = 1. Now, \u00E2\u0080\u0094GLS \u00E2\u0080\u0094 \u00E2\u0080\u0094GLS from classical theory, we know that fa /SE(/?i ) follows a non-central t distribution with r \u00E2\u0080\u0094 g degrees of freedom, and non-centrality parameter equal to 2 5 fa fa _ v S E ( A G L 5 ) y/(XTErX)-> which only varies in v, because the values in the denominator of the last expression are all fixed. Now, let tc be the critical value of this non-central t distribution at significance level 0.05. It is not difficult to see that fa SE(faGLS) > tc 25 See Encyclopedia of Statistical Sciences, Wiley & Sons, 1985, vol. 6, pp. 286 - 288. Chapter 2. Multiple Linear Regression 48 is constant if v is constant. From the same empirical power graph, we also see that the naive OLS analysis has constant power. In fact, we can deduce that theoretically, the naive OLS analysis should also have constant power. This is true despite the fact that the true standard error of ^Oj_j\u00C2\u00A7 * OLS OLS ^~~^OLS fi\ is not equal to the naive S.E. (= SE(/9i ) n a i ve ) so that 3\ /SE(/?i ) n a i v e does not have a non-central t distribution. -\u00E2\u0080\u0094OLS Recall that 3\ , in the case of 27e ^ / , has true S.E. and naive S.E. equal to ayJ(XTX)-i{XTZ;tX){XTX)-i and a^J{XTX)- respectively. That is, S E ( S \u00C2\u00B0 L 5 ) t r u e = y/{XTE.X)(XTX)-* S E ( ^ \u00C2\u00B0 L 5 ) n a i v e . , ^OLS OLS Now, notice that B\ /SE(/?i ) t r ue, the i-statistic for the true OLS analysis, has non-R ^~~OLS central t distribution with non-centrality parameter equal to /?i/SE(/?i ) t r a e . Hence, the power function of the true OLS analysis should be constant if 3\/a is constant (like that of the GLS analysis). Now, the naive t-statistic is -t~OLS <~-OLS Pi Pi SE(/?i )naive true J(xTEcx)(xTxy which is simply a multiple of the i-statistic of the true OLS analysis. Therefore, the power function of the naive OLS analysis should also be constant if P\/o~ is constant. Finally, the fact that the empirical power curve of the naive OLS analysis is below that of the GLS analysis in all three panels of Figure 2.1 illustrates that ignoring response dependence in regression analysis can lead to significant loss in parameter estimation efficiency. Next, let us examine Figure 2.2. The two panels of this figure show empirical power curves of the two statisticians' analyses with all program input arguments fixed as before, except that B\ and a both Chapter 2. Multiple Linear Regression CD Q. O CO O q CD CO E oo to d CM } (3.17) where 9^ = h(r]^), for some functions a,b, and h, then E ( y / W ) = a>(0W) W a x ( Y W \ x V ) = where a' and a\" are the first and second derivatives of a respectively. (f> is called the scale or the dispersion parameter. Note that the link function Q is related to h in the following manner: \u00C2\u00AB[E(yW)]=\u00C2\u00AB?fe' The r x r matrices A ^ , A ^ s \ and V(s) are defined as ( a\"(0[s)) 0 0 a\"(^ s )) 0 0 d^p o dew djT 0 a\"(0W) ... o N 0 0 0 ^ / Chapter 3. Introduction to Multiple Regression of Count Data \ 80 y(\u00C2\u00BB) = 1 J L j _ where R is any r xr matrix that satisfies the requirements of being a correlation matrix. This R is the working correlation matrix. Similarly, is the working covariance matrix for the sth replicate. Note that is the true covariance matrix of [F( S ) |XW] (i.e. = Cov ( y W | X ( s ) ) ) if R is the true correlation matrix of each [Y^\X^]. Also note that since all [y(*)|A\"(')]'s have the common correlation matrix Cor(Y\X), the working correlation matrix R (which may be used to estimate C o r ( Y | X ) ) does not depend on s. Next, define the r x g matrix \u00C2\u00A3>(*) = A^ A^X(SK Then the generalized estimating equations are defined as K J2 D W ^ V W j - ' i y W - a'(0<*))] = 0 (3.20) where a'{6^) = (a'(9[s)), \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 -,a'(9^))T. The estimate of p is then obtained by solving (3.20) for p. Since the quantities D^s\ V^s\ and a'(6^) in (3.20) are all functions of some un-known parameters, we need an iterative numerical method for updating (at each iteration) each unknown parameter estimate so that these quantities in (3.20) are consequently up-dated. Upon convergence, we obtain the final estimates of these unknown parameters such as P, and those in the working correlation matrix R. If the working correlation structure (i.e. the structure of R) is specified according to the true response correlation structure, then this final \"estimate\" of R is the estimate of the conditional correlation matrix of the response Y, i.e. C o r C K l X ) . 1 1In our investigation, we assume that the initial specification of R as the working correlation matrix is based on the believed structure of Cor(Y\"|X). The final estimate of R may not be a good estimate of Cox(Y\X) if its specified structure does not coincide with that of Cot(Y\X). Chapter 3. Introduction to Multiple Regression of Count Data 81 More advanced statistical packages such as S-Plus provide standard procedures for fitting GLM's using the G E E approach. Notice that if R = J , i.e. assuming2 the responses to be independent (given the covariates), the estimating equations (3.20) reduce to those derived from quasi-likelihood theory. In other words, the G E E parameter estimates reduce to those of the ordinary G L I M software approach when R = I. The following theorem is a re-statement of that presented in Liang & Zeger (1986): Theorem 3 Suppose regularity conditions hold, and that there exist K-consistent esti-mators for (j) and the parameters in R. Define (3G to be the solution of (3 in (3.20), with (j) and the parameters in R being replaced by their estimators, if they were not fully specified. Then \/~K ((3G \u00E2\u0080\u0094 (3) has an asymptotic MVN distribution with mean 0 and covariance l i m ^ o o COY[VK (3 g - (3)} = limK^K (J2f=1 D^lV^-'D^y1 {z*=1 D ^ T [ V ( s ) ] - 1 C o v ( y W { X ^ V ^ ] - 1 D^} ( Z ^ \u00C2\u00A3>0)TryO ) ] - iD(*)y\ In a G E E analysis, the above theorem is used for estimating the covariance of f3G', in the following manner: Cov(3G) = | c o v ( 3 G - / 3 ) = ^ C o v [ v ^ ( 3 G - /3) ] ~ (E^ ( S ) T[V ( S )]_ 1^ ( S )) J E r > ( s ) T [ v ( s ) ] - i C o v ( y W | x ( s ) ) [ y ( s ) ] - i i > ^ | K \ - 1 ^ / j W ^ y W j - D W (3.21) S=l / Recall that R and are respectively the working correlation matrix and the work-ing covariance matrix for the responses given the covariates X ^ . If we assume V( s ) 2See Section 3.4. Chapter 3. Introduction to Multiple Regression of Count Data 82 to be the true covariance matrix C o v ( Y ( S ) | X ^ ) , then equation (3.21) reduces to Cov(/3G) \u00C2\u00AB (j2 \u00C2\u00A3 ) ( s ) r [ V ( s ) ] _ 1 \u00C2\u00A3 ) ( s ) j . (3.22) The G E E estimate of the covariance of (3G using equation (3.22) is called the naive covariance estimate. The one using equation (3.21) is known as the robust covariance estimate of f3G. In practice, we usually do not know the true correlation structure of the responses. The working correlation, R, which we specify in running the G E E analysis, is only based on the belief of what the true structure is. The naive covariance estimate might then be an inappropriate estimate of the true covariance of (3G if R is specified as very different from the true C o r ( Y | X ) . 3 On the other hand, the robust covariance estimate accounts for the true covariance of the responses, C o v ( Y ( s ' | X ^ ) , and thus is a \"corrected version\" of the naive covariance estimate. Of course, Cov(Y^\X^) is usually unknown. In particular, misspecification of the response correlation structure is usually a result of not knowing the true Cov(Y\X). In fact, Cov(y( s) |X< s)) in the robust variance estimate formula is estimated from the data using [Y^-a'{d^)][Y^-a'(d^)}T. Here, a ' (0 w ) is computed using X W and J3G, for each s. Since the observed data should be representative of the population, the estimate of Cov(y( s ) |X( s ) ) based on the data should usually resemble the true covariance more than does (when Cor(Y( 5 ) |X( s ) ) is misspecified). Hence, the robust estimate is often a better estimate of the covariance of (3G. 3That is, the asymptotic Cov(/3G) does not equal the naive covariance estimate if the structure of R is different from that of Cor(T'\"|.X'). Chapter 3. Introduction to Multiple Regression of Count Data 83 3.4 Terminology In the next two chapters, we will assume the scenario of an investigator specifying the working correlation matrix/structure in G E E according to what s/he believes the re-sponse correlation matrix/structure to be. Misspecification of the response correlation refers to the case when the investigator's belief/assumption (and thus specification) of the response correlation is incorrect when using G E E for model fitting. We will refer to the G L M in which the responses are assumed to be independent as the independence model. Similarly, we will refer to the G L M in which a correlation structure of the response dependence is taken into account as the dependence model. In the simulations, we fit both models to a given dataset using the G E E approach. Notice that either model may be incorrect. For example, suppose the true response correlation is exchangeable. Then the independence model is incorrect, but the dependence model that assumes exchangeably correlated responses is correct. On the other hand, the dependence model is incorrect if it assumes the response correlation structure to be anything other than exchangeable. In other words, the model is only correct if the specified correlation structure is the true response correlation structure. We say that one model is more efficient than another if its parameter estimates produced by the G E E approach are closer to the true parameters, and if these estimates have smaller variability. One way to measure unbiasedness and variability of a parameter estimate is to observe many realizations of the estimate. For each simulation, we repeat the G E E model fitting process to n datasets. Summary statistics of these n parameter estimates are then obtained. Unbiasedness can be measured by how close the sample mean of these n estimates is to the true parameter. Similarly, variability can be measured by the sample variance of the n estimates. Chapter 3. Introduction to Multiple Regression of Count Data 84 However, small variability in a parameter estimate also means higher power in de-tecting a non-zero value of the parameter. This leads us to the notion of empirical power of the independence and dependence models. The power of detecting a particular non-zero regression parameter can be measured by the proportion of the n datasets in which the parameter is statistically significant at a given level. A model is then also more efficient than the other if it is empirically more powerful than the other in detecting a given parameter from the same dataset. As in the previous chapter, we may wish to determine if the difference in power be-tween the independence and dependence models is statistically significant. We will again use the Pearson's x 2 test \u00C2\u00B0f independence to assess the association between model (in-dependence/dependence) and power {8j significant or not) for each regression coefficient in the model. (See page 33 for examples of the x 2 test.) If the association is significant, we conclude that the difference in (empirical) power is significant. Otherwise, the difference in power is likely due to chance.4 4 . . . in the terminology of Freedman, Pisani, Purves k, Adhikari, 1991, Chapter 26. Chapter 4 Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One In this part of our simulations, we generate count data which are independently Pois-son distributed when given the covariates. (Marginally, the responses are correlated with a Poisson-lognormal distribution.) We are interested in the effect of misspecifying a response correlation structure on parameter estimation using G E E . Thus, any non-independence structure is a misspecification. 4.1 Generating Count Data and their (Random) Covariates Recall from Section 3.4 the notation for the response vector and the covariate matrix for the 5 t h replicate. Fixing s, the equations in this section hold for all s = 1,... ,K. In other words, each set of replicates {Y^s\ X^} are independent and identically distributed (i.i.d.). Thus, (4.1) to (4.7) hold for all s. We can generate F / ^ ' S through generating covariate vectors X-S^s. In a given re-gion i, the g-dimensional covariate vector is X;*\ and we assume that the j t h covari-ate is distributed as a normal random variable with mean fi^ and variance for i = 1,2, . . . , r . That is, conditional on the g covariates, the response in the ith region has a Poisson distribution with mean exp(X,- s ^ T ^). Note that we generate the Y^'s so that [Y^ \XJ']] is independent of [Yk(s)\X{ks)] for i / k. Also note that x \ s ) T p = Y?j=1 0jX\-] is a normal random variable with mean fijft and variance ]Cf=i 0j\u00C2\u00B0~n = \u00C2\u00B0~n PTPi a n d that Cov(xls)T(3,X{ks)Tf3) = cov(ExWpi,\u00C2\u00A3lxtfpl) 3 = 1 1=1 = tp]Cov(xi;\xif) j=i 3=1 = crik/3T/3 for alii ^ k. Thus, letting v = fx(3 and W = (3T f3E, we have X^(3 ~ MVN ( i / ,W) . (4.3) Now, let z | s ) = E ( y / s ) | x i s ) ) = exp(x[ s ) r/3). Therefore, Z^ = exp{X^/3) is an r-dimensional lognormal random vector with pa-rameters v and W. Hence, marginally (unconditional on X^), Y^ has a multivariate Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 88 Poisson-lognormal distribution (Aitchison & Ho, 1989) with E(Y;. ( s )) = expfc + \Wu) = exp(/xf/3 + lo-u(3T(3) var(y;.(s)) = E(y; ( s ) ) + [ E ( y ; ( s ) ) ] 2 [ e x p ( ^ ) - i ] = E ( F / S ) ) + [E(y;. ( s ))] 2[ex P(cT u/3 r/3) - 1 ] Cov(y,w,yW) = E(y 1 W)E(n w )[ex P (Wi f c ) - 1 ] = E(y/ s ) )E(n ( s ) )[exp( (T i f c /3^) - 1] Cor(y,w, y f c w ) = e x p ( l ^ f c ) - 1 exp(Wii) - 1 + \ eMWkk) - 1 + exp(aikpTf3)- 1 (4.4) (4.5) (4.6) e x p ( ^ / 3 ) - 1 + \u00E2\u0080\u0094 ^ E ( y / \" ) exp(crfcfc/3T/3) - 1 + E(yW)J (4.7) for all i and k where i ^ k. Finally, we can generate the covariate matrix and the response vector according to (4.1) and (4.2) by specifying the ii and \u00C2\u00A3 matrices, and the coefficient vector /3. We thus know the exact marginal distribution of the response variables according to equations (4.3) to (4.7). Notice that so far, we have assumed that we do not have an intercept in the G L M . Some of the equations presented above should be slightly modified if an intercept is to be fitted. Please refer to Section 4.4 in the C H A P T E R A P P E N D I X for details. Technically, in our simulations, applying a Poisson regression to these y( s ) 's to obtain an estimate of ft and the subsequent data analysis does not require the knowledge of the above marginal distribution. However, as mentioned in the previous chapter, some Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 89 response correlation structure needs to be specified in the dependence model when using G E E for parameter estimation. When deliberately misspecifying the response correlation, we may make use of the marginal response correlation structure, which is generally very different from the independence structure. In fact, the responses are often strongly correlated, marginally, due to the correlation among the covariates. 4.2 Multiple Regression via Generalized Estimating Equations In the dependence model, we assume that the responses are correlated, given the co-variates, with certain correlation structure. In this part of our simulations, since the responses are uncorrelated given the covariates, we may examine if specifying a non-independence structure for the responses in the dependence model would decrease the efficiency of the regression parameter estimation. We do so by comparing the efficiency of such a model to that of the independence model (both models are fitted using the G E E approach). 4.2.1 G E E in the Conditional Poisson Model where 77,- = X^' /3 for i = 1 ,2 , . . . , r and s = 1 ,2 , . . . , K. Comparing (3.17) and (4.8), we have 0\s) = j]\s\ a(6\s)) = e^'\ b{y\s)) = - logj / J s ) ! , The conditional density of [5^- | X\ ] in (4.2) i s e x p j - e ^ H e ^ ) ^ _ 001 = exp{\u00E2\u0080\u0094eVi + y\s^ log eVi \u00E2\u0080\u0094\ogy\s^\} = exp{?/Js)7/js) - - logy\s )\} (4.8) Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 90 and 4> \u00E2\u0080\u0094 1- Hence, (3.18) in the conditional Poisson model becomes (4.9) E ( ^ | X | S ) ) = a'(0\s)) = e* Var(lf>|x! s ) ) = = e*\ Note that the link function in a generalized linear model is defined to be the function that maps the mean (the conditional mean in our case) of the response to rjjs\ Thus, we have the log-link in the conditional Poisson model, i.e. log E ( F / S ^ | X ^ ) = n\sK Now, the matrices AW simply become \ e 2 0 0 0 = I = [AM]$R[AM]> By specifying an initial working correlation matrix R with a certain structure (e.g. exchangeable), we may now substitute the above matrices into equations (3.20) and solve for the coefficient vector {3. Hence, we obtain a G E E estimate f3G for /3. If the specified structure of R coincides with that of C o r ( y | X ) , then the final R 1 is the estimate of C o r ( Y | X ) , the (spatial) correlation matrix of the conditional Poisson responses. To estimate the covariance of /3 G , we need to know Cov(Y( S)|X\"( S)). Recall that in this part of our simulations, we generate count data which are independent given the covariates. Thus, we can replace Cov(Y( s ) |X( s ) ) in equation (3.21) by 1See page 80. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 91 1. I, or 2. [y(s) - a'(6^)][Y^ - a'(6^)]T (see page 82). In practice, the true correlation structure of the responses is not known, in which case item 1 above may not be used. Moreover, although we may sometimes believe that the data are so weakly correlated that they are close to independent, we should base our inference on the robust S.E. produced by gee. This is because the computation of the robust S.E. is a \"corrected version\" of the naive S.E., accounting for the true covariance structure of the responses which is estimated by item 2 above. 4.3 Computer Simulations of Multiple Regression of Count Data Our goal is to examine the effect of incorrectly specifying the response correlation on the efficiency of parameter estimation in a generalized linear model. (The response correlation is specified through the structure of the working correlation matrix, R, or fixing the matrix itself, when running gee.) In particular, count data are of special interest because many health impact studies use hospital admission counts to indicate health conditions of a population. Throughout our simulations presented in this chapter, our specification of the incor-rect response correlation was based on the marginal correlation of Y, not conditional on X. See Section 4.4.2 in the C H A P T E R A P P E N D I X for details of how we specified R. We used the statistical software S-Plus for all our simulations. Each simulation included three main parts, namely, 1. generation of the count responses and the covariates2 according to equations (4.1) and (4.2), with r = 5 and K = 20; 2We would need to specify some \u00C2\u00A3 in (4.1) to generate the covariates. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 92 2. regression ( G L M fitting) of the generated count responses on the covariates using gee (function in S-Plus) assuming (a) independent responses (R = J) , and (b) dependent responses (R ^ I), where i . only the structure of R was specified as either exchangeable or AR(1); or i i . R was fully specified as a fixed matrix based on the marginal distribu-tion (Poisson-lognormal) of Y (as a misspecification of the correlation structure); and 3. comparison of empirical relative efficiency of the independence and the dependence models, and other analyses. Note that estimating C o r ( Y | X ) (by the final R) is not the primary concern of a G E E analysis. However, a measure of the variability of j3G is crucial. In S-Plus, the function gee estimates the standard error (S.E.) of each regression coefficient estimate. It gives both the naive and the robust S.E. estimates. As mentioned before, in computing the robust S.E., the fact that the final estimate of R is only an estimate of C o r ( Y | X ) is taken into consideration. In this part of our simulations, since the responses are in fact conditionally independent, R = I (fixed) as specified in the independence model is actually the true C o r ( Y | X ) (so that the working covariance matrix V is the true C o v ( Y | X ) in the case of R = I). Hence, the naive S.E. in the fit for the independence model should be a better estimate of the true S.E. of a regression parameter estimate than the robust S.E. However, in our simulations, we only compared the robust S.E.'s of the parameter estimates produced by the independence and dependence models. This is because with Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 93 real data, one is never certain that they are truly independent or not. Analyses with real data should naturally be based on the robust S.E. estimates. In addition, we also computed the sample covariance of the /3G 's in each simulation study 3 for comparison to the robust S.E.'s produced by the two models. With enough observations of / 3 G , the sample covariance should resemble most the true covariance of flG, which can only be roughly estimated using equation (3.21) in a regression of real data. This brings us back to our objective of the simulations \u00E2\u0080\u0094 to determine how misspec-ification of the response correlation structure affects the performance of G E E parameter estimation in a G L M . A model performs well if the parameter estimates have small em-pirical biases and variability. Hence, a good model should be more effective in detecting departures from the null model of j3 = 0 . Intuitively, as supported by the previous chapter, the model that accounts for the true response correlation should perform no worse than that with a misspecified correlation structure. However, the effect of such misspecification on the efficiency of parameter estimation is not so clear in the current non-Gaussian setting. Our simulations are intended to help us understand at least em-pirically what this effect is on our non-Gaussian G L M . 4.3.1 Simulations with Exchangeable and AR(1) Correlation Structures of the Covariates The most convenient correlation structures to use in the simulations are the exchangeable and the AR(1) correlation structures. These structures only consist of one unknown pa-rameter. Hence, if the responses approximately have either of these correlation structures, then a small sample size, K, and a small number of regions, r, should provide sufficient information for G E E to estimate in a few iterations all the unknown parameters in the 3 Each sample of K response vectors produced one estimate f3G. In each simulation study, we gener-ated n such samples, and then obtained the sample covariance of the n /3G's. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 94 G L M . Indeed, conditioned on the covariates, the responses generated according to equa-tions (4.1) and (4.2) have the independence correlation structure, which is a special case of both the exchangeable and the AR(1) structures. At the same time, the exchangeable and AR(1) correlation structures are also con-venient to specify in generating the covariates. We ran numerous simulations in which the covariates had either of these two correlation structures. Let px denote the intraclass correlation coefficient in \u00C2\u00A3 for the exchangeable covariates, and the lag one correlation in E for the AR(1) covariates. We specified different values of px, together with different /x's and /3's to generate different sets of count data, each set being marginally correlated, but independent conditional on the covariates. With each generated dataset, we used the S-Plus function gee to separately fit the independence and dependence models. We will denote by /31 the / 3 G produced by the independence model, and by f3D that produced by the dependence model. We would like to compare the empirical relative efficiency of the two models as reflected by how frequently the models reject the null hypotheses (for each j) of 8j = 0, each at significance level a = 0.05. The z-test-statistics were computed based on normal approximations for the distributions of the estimates /31 and / 3 D . The S.E.'s of the d^s in these ^-statistics were taken to be the robust S.E.'s produced by the S-Plus function gee. We used two ways to specify a response correlation structure in the dependence model: (1) by specifying the structure for the working correlation, R (i.e. exchangeable or AR(1)) as a functional argument of gee; and (2) by specifying a fixed R. In our experiments, the fixed R matrices were often very different from the identity matrix. We determined if the dependence model \u00E2\u0080\u0094 either having an extra unknown parameter in R (i.e. case (1)), or a misspecified response correlation structure (i.e. case (2)) \u00E2\u0080\u0094 would be less efficient than the independence model. However, in all the simulations, both the independence and the dependence models Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 95 generally produced estimates of f3 with small empirical biases. This result may not be surprising at first glance, because the consistency part of Theorem 3 applies to a model with any working correlation H . However, it is merely an asymptotic result.4 In fact, our study indicates that the G E E parameter estimates for our G L M is reasonably robust to misspecification of response correlation for small sample sizes; afterall, K = 20 is a small sample size. Slight bias was apparent only in isolated cases, particularly in estimating the intercept. This bias could merely be due to chance. As revealed in Table 4.9, the intercept estimate was typically the most variable (had the largest relative S.E. 5) among all the regression coefficient estimates. As for the empirical relative efficiency of /31 to (3D, we examine the p-value compar-isons in the simulations. We tabulate the p-values as in Table 2.1 from Chapter 2. For each regression coefficient in one such comparison table, we are interested in comparing the following: 1. the margins of the 2x2 p-value table, and 2. the off-diagonals of the 2x2 p-value table. In item 1, we are essentially comparing the the empirical powers of the independence and the dependence GLM's . Hence, we can compute the empirical relative efficiency of the two models from the p-value tables as we have done in Chapter 2. We will not present the results of all of our simulations. However, the results we will present in Tables 4.8 to 4.14 are representative of all of the simulations. In these selected simulations, there were g = 2 covariates, and the \i matrix that we 4The misspecification of the correlation structure of the responses does not affect the (inconsistency of the GEE estimate for the regression coefficients. However, this is only true provided that, among other criteria, the link function is correctly specified. See Liang & Zeger, 1986 for other criteria for the consistency of /3 G . 5We define the relative S.E. of an estimator as the ratio of the S.E. of the estimator to the estimator itself. This is actually the inverse of the standardized estimator. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 96 used in the \"no-intercept\" GLM's was / 0 3 1 li 2 1 1 We used the same fj, for those GLM's with an intercept, where we simply added a column of l's to the covariate matrix generated with the above expectation. Before examining more closely the simulation results, let us first explain the meaning of the items in Table 4.8. In running the S-Plus function gee, we specify corstr=\"ex\" as an argument to spec-ify an exchangeable response correlation structure. Similarly, corstr=\"f ixed\" specifies a fixed response correlation matrix, in which case such a matrix needs to be completely supplied to gee. In simulations with corstr=\"f ixed\", we would supply that fixed matrix computed in the way described in Section 4.4.2 (i.e. the response correlation uncondi-tioned on the covariates), with (31 (obtained from the independence model) in place of Again, px represents the intraclass correlation coefficient or alternately the lag 1 correlation in \u00C2\u00A3 , the correlation matrix of the covariates. In our simulations, we no-ticed that the marginal correlation structure of the responses were often approximately exchangeable if the covariates were either exchangeable or AR(1). (See Example 1 in Section 4.4.2.) We use py to denote the approximate value of the marginal correlation of each pair (Y^s\ Y^), i ^ k, for each s. (That is, unconditioned on the covariates, the responses are approximately exchangeable, with an intraclass correlation coefficient approximately equal to py.) Since this marginal correlation matrix was what we would fix as R for the dependence model in simulations with corstr=\"f ixed\", py indicates how 0-Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 97 Sim cor s t r Px Py P Fit inter-cept? n sample mean of n 3 / 's sample mean of n 3 D 's 3 a \" f i x e d \" 0.9 0.8 ( \u00C2\u00B0 - 1 ^ 0.01 11 ) yes 500 ( 0,100 \ 0.010 ^ 1 . 0 0 1 ) ( 0.101 \ 0.010 1.000 / 4a \" f i x e d \" 0.9 0.8 r no 500 ( 0.011 \ \ 1 .000 ) / 0.010 \ I 1.000 J 7 \"ex\" 0.9 0.03 to 0.04 I 0.15 \ 0.1 0.12 ) yes 100 ( 0.138 \ 0.098 ^ 0.123 ) f 0.139 \ 0.097 ^ 0.123 j 8 \"ex\" 0.9 0.03 ( 0 .12 ) no 100 ( 0.100 \ \ 0.117 J ( 0 . 1 0 0 \ [ 0.116 ) a: In these simulations, when fitting the dependence G L M , we assumed px to be known, and then substituted it and the final fl1 into equations (4.4) and (4.7). We then specified in ge a correlation structure fixd as that in equation (4.7). Table 4.8: Results of selected simulations. badly the response correlation structure was misspecified in the model. In fact, most of these marginal response correlation matrices turned out to be quite different from the identity. However, this led to only slight loss of efficiency of the dependence model in some isolated cases. In the next section, we will see the simulation results which illustrate this point. As for the regression parameter vector /3, we wanted its components to be small enough such that badly misspecifying the response correlation might deter the depen-dence model from detecting them, 6 making any possible loss of estimation efficiency noticeable. However, the size of a coefficient of course depends on the variability of its (GEE) estimate. Without any a priori information about the standard error of the estimates before the actual simulation experiment was done, we specified \"near-zero\" regression coefficients in the model. We tried values in the order of between 1 0 - 2 and 101 in all the simulations. 6The larger the regression coefficients, the easier they are detected in any regression model. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 98 Next, denote in each simulation by n the number of datasets generated, i.e. the number of GLM's fitted. For each dataset, the sample size K (= the number of replicates) was fixed as 20 per region, and r (= the number of regions) as 5. The items \"sample mean of n /3 J's\" and \"sample mean of n /3D 's\" are self-explanatory. 4.3.2 Detailed Results for Simulations with Exchangeable and AR(1) Co-variates Among the simulations with exchangeable or AR(1) covariates, Simulations 3 and 4 are representative of those which were run with corstr=\"f ixed\", and are the more interesting to examine. Similarly, Simulations 7 and 8 are representative of those which were run with corstr=\"ex\". Notice that the two pairs of simulations are paired by their regression coefficient values. For example, the latter pair have Q\ = 0.1 and 82 = 0.12. The only difference between the two simulations is the intercept, B0 = 0.15, in the G L M of Simulation 7, whereas in Simulation 8 the G L M had no intercept. As mentioned before, both (31 and (3D seem to have small empirical biases in most of our simulations. Table 4.8 shows the typical amount of empirical biases of these two estimates. The Bj's in most cases are extremely well estimated by the dps and fif's. The intercept estimates are usually those Bfs that have slightly larger empirical biases. In fact, the intercepts usually have larger relative S.E.'s compared to the other B^s. Table 4.9 shows the sample S.D. of the n Bps produced by the G E E analysis for our selected simulations, together with the approximate relative S.E. of 8j, computed by sample S.D. of n Bps sample mean of n Bps for each j. For example, 8Q in Simulation 7 is not particulary well estimated. Not coincidentally, its relative S.E. is more than twice the relative S.E. of each of the two slope coefficient Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 99 Simu- sample S.D. for approximate relative S.E. for lation % Pi Pi PI Pi Pi 3 0.093 0,019 0.026 0.930 1.90 0.026 4 \u00E2\u0080\u0094 0.017 0.0096 \u00E2\u0080\u0094 1.55 0.0096 7 0.16 0.041 0.064 1.16 0.418 0.520 8 \u00E2\u0080\u0094 0.043 0.048 \u00E2\u0080\u0094 0.430 0.410 Table 4.9: Sample S.D.'s and approximate Relative S.E.'s of coefficient estimates in the independence model in selected simulations. estimates. We will not include a similar table for the /3D,s (by the dependence model), because their sample S.D.'s and sample means are very similar to those of the /37's (by the independence model). This brings us to the efficiency/power comparisons of the independence and the de-pendence models. The sample S.D.'s should be the closest (compared to the G E E naive and robust S.E. estimates) to the true S.E.'s of the coefficient estimates, because n was large. However, in a real life investigation, we only have one set of data, on which we fit a G L M using G E E . In that case, we can only estimate the true S.E.'s of the regression coefficient estimates, usually by the robust S.E. estimates from the G E E analysis. To compare the efficiency of the two models, we examine their p-value comparison tables. Tables 4.10 to 4.13 are the p-value comparison tables for Simulations 3, 4, 7, and 8 respectively. The following is an example of how one can examine a p-value comparison table. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 100 Example Table 4.12 shows the p-value comparisons for each of 30, B\, and 02 in Simultion 7. The 2x2 p-value comparison for B\ (middle row) shows that the empirical power7 of the independence model (R = I) is 58/100 = 0.58. The empirical power of the dependence model (R ^ I) is slightly higher at 60/100 = 0.60. We say that the dependence model is empirically 2% more powerful. This is obtained from (60 \u00E2\u0080\u0094 58)/100 = 0.02 = 2%. The off-diagonal values in this 2x2 table show the discordance of the two models in detecting a non-zero 3\. For example, among the 100 cases, there are 4 + 2 = 6 cases where the two models did not agree in the rejection of the null hypothesis. In considering the overall p-value comparison for the three regression coefficients in this simulation, the dependence model seems to be slightly more powerful than the independence model. However, the X2 test of independence of model and power yields p-values8 of 0.66, 0.89, and 0.67 for do, 3\, and 82 respectively. Thus, the difference in power of the two models (in detecting any of the regression coefficients) is likely due to chance. Both models are powerful in detecting 3\ and 32, but not powerful in detecting Bo- This is because ft\u00C2\u00AE has the largest relative S.E. (i.e. SE0G)/^), as is shown in Table 4.9. \u00E2\u0080\u00A2 From our simulation results, we see that generally, both the dependence and the in-dependence models would decrease in power of detecting a non-zero regression coefficient as the coefficient became smaller. Also, both models had very similar empirical power in a given simulation. For example, in Table 4.10, both models only had approximately 7 By power, we mean the probability of rejecting the null hypothesis in testing Ho :/?.\u00E2\u0080\u00A2= 0 vs. Hi : f3j ^ 0 for that j, at significance level a = 0.05, when H\ is true. We consider hypothesis tests of fa's individually for convenience. One can easily modify this into a test of Ho : /3 = 0, at a certain significance level. 8 . . . obtained by the S-Plus function chisq.test . Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 101 R ^ I p < 0.05 p > 0.05 total ft R = I p < 0.05 p > 0.05 58 47 76 319 134 366 total 105 395 500 fa R = I p < 0.05 p > 0.05 38 31 35 396 73 427 total 69 431 500 02 R \u00E2\u0080\u0094 I p < 0.05 p > 0.05 500 0 0 0 500 0 total 500 0 500 Table 4.10: p-value comparisons of Simulation 3 in Table 4.8. R ^ I p < 0.05 p > 0.05 total 0i R = I p < 0.05 p > 0.05 34 33 43 390 77 423 total 67 433 500 02 R = I p < 0.05 p > 0.05 500 0 0 0 500 0 total 500 0 500 Table 4.11: p-value comparisons of Simulation 4 in Table 4.8. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 102 70/500 = 14% empirical power in detecting Q\ = 0.01, but had 100% empirical power in detecting 32 = 1. We also noticed that both models tended to decrease in power if the number of parameters increased. That is, in those simulations where the G L M did not include an intercept, the two models were sometimes somewhat more powerful (empirically) than if the G L M included an intercept. This is illustrated by Simulations 7 and 8. We may compare Tables 4.12 and 4.13. In Simulation 7, the independence and dependence models both had only about 60% and 50% empirical power in detecting the two slope coefficients (3\ = 0.1, 32 = 0.12) respectively, but had about 70% and 75% empirical power for the respective slopes in Simulation 8. This trend was not always apparent, mainly because of other factors which would also affect power, such as the actual size of the regression coefficients. For example, in both Simulations 3 and 4, the two models had similar empirical power in detecting the two slopes. 3\ = 0.01 (whose estimate has an approximate S.E. of 0.02) is considered small, whether the G L M had many or few coefficients. Thus, the two models had low but similar empirical power (about 14%) in detecting it in both simulations. On the other hand, 32 = 1 (whose estimate has an approximate S.E. of 0.03) is considered very large, with or without many extra coefficients in the G L M . Therefore, the two models both had 100% empirical power in detecting it in both simulations. In some simulations, such as Simulations 3 and 4, we specified a fixed correlation matrix \u00E2\u0080\u0094 the marginal Cor(Y) \u00E2\u0080\u0094 which was often very different from the identity, as can be seen from the py column in Table 4.8. In these simulations, the dependence model often had slightly less empirical power than the independence model. For example, in Simulation 3, the independence model is 134 - 105 500 x 100% = 5.8% Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 103 R ^ I p < 0.05 p > 0.05 total Po R = I p < 0.05 10 0 10 p > 0.05 3 87 90 total 13 87 100 01 R = I p < 0.05 56 2 58 p > 0.05 4 38 42 total 60 40 100 R = I p < 0.05 46 1 47 p > 0.05 5 48 53 total 51 49 100 Table 4.12: p-value comparisons of Simulation 7 in Table 4.8. R ^ I p < 0.05 p > 0.05 total 01 R = I p < 0.05 66 2 68 p > 0.05 3 29 32 total 69 31 100 02 R = I p < 0.05 74 0 74 p > 0.05 1 25 26 total 75 25 100 Table 4.13: p-value comparisons of Simulation 8 in Table 4.8. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 104 more powerful than the dependence model in detecting B0. In fact, the x2 test \u00C2\u00B0f inde-pendence of model and power yields a p-value of 0.038, implying that this difference in power between the models is significant (at level 0.05, say). This difference in empirical power in detecting Q\ is 0.8% in Simulation 3, and is 2.0% in Simulation 4, although the differences in both cases seem to be results of chance according the x2 tests. The p-value discordance also seems to be more severe in those simulations where corstr=\"f ixed\". We re-ran Simulation 3 in which we saw a noticeable difference in empirical power of the two models. In the re-run, we let the G E E analysis update the values in the work-ing correlation matrix R. In running gee here, we specified the exchangeable response correlation structure. To save computational time, we set n \u00E2\u0080\u0094 100, but K remained 20. Table 4.14 shows the p-value comparison for this simulation. From Table 4.14, we can see that the dependence model indeed had a better empirical power in this re-run than the original simulation. In fact, in the re-run, the empirical powers of the independence and dependence models did not differ in testing B0 = 0, as opposed to the dependence model being 5.8% weaker in power in the original Simulation 3. In addition, the dependence model in the re-run was actually 3.0% more powerful than the independence model in the re-run in testing Bx = 0, as opposed to 0.8% less powerful in the original simulation. (However, x2 tests indicate that these differences were likely due to chance in both the original simulation and its re-run.) The p-value discordance was also much less severe in the re-run than in the original simulation. The smaller power of the dependence model in those cases such as Simulations 3 and 4 might be due to the inappropriately specified marginal response correlation matrix for the dependence model, when data were in fact independent. However, the dependence model did not have substantially lower power than the independence model in these cases. The power difference was the most \"extreme\" in Simulation 3, but the dependence model turned out to be only 5.8% weaker than the independence model in detecting B0 = 0.1. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 105 R ^ I p < 0.05 p > 0.05 total 00 R = I p < 0.05 22 1 23 p > 0.05 1 76 77 total 23 77 100 01 R = I p < 0.05 21 1 22 p > 0.05 4 74 78 total 25 75 100 02 R = I p < 0.05 100 0 100 P > 0.05 0 0 0 total 100 0 100 Table 4.14: p-value comparisons of a re-run of Simulation 3 in Table 4.8, this time without fixing R. The small difference in power of the two models is particularly surprising if we consider how different the marginal response correlation matrix was from the the identity in most of these cases. For example, in Simulations 3 and 4, we specified cors t r=\"f ixed\", and this fixed correlation matrix had most non-diagonal values of about 0.8. This in fact illustrates the robustness of the G E E approach to the misspecification of the response correlation structure. There is another worth-noting observation from our simulation results. Recall that we specified corstr=\"ex\" in the dependence model when we ran gee in some of the simula-tions. Since the responses were truly independent given the covariates, the independence model should be the correct model. However, since independence is a special case of ex-changeability, the final R matrix produced by gee should be close to the identity matrix, because the G E E approach is supposed to estimate the true response correlation reason-ably well when the structure is correctly specified. Therefore, one would expect both the dependence and independence models to have the same efficiencies here. In fact, if we consider the extra parameter (response intraclass correlation coefficient) that G E E needs Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 106 to estimate in the dependence model, one might even expect the independence model to be more efficient than the dependence model. Indeed, in some of our simulations with corstr=\"ex\" in the dependence model, this model was slightly less powerful than the independence model, possibly due to the extra unknown parameter in Cor(Y\X). However, some other simulations with corstr=\"ex\" in the dependence model indicate that this model could also be slightly more powerful than the independence model. For example, in Simulation 7, as presented in Table 4.12, the dependence model is empirically 3%, 2%, and 4% more powerful than the independence model in detecting the three regression coefficients respectively. However, such small differences in power seem to have resulted from chance according to x 2 tests. 4.3.3 Addit ional observations We have so far discussed how the power of independence and dependence models compare in terms of their p-values. We may also compare their power by comparing their actual robust S.E.'s for the regression coefficient estimates. Among the simulations for which we have a retained record of these S.E.'s, 9 the S.E.'s for both the independence and the dependence models were very similar. This explains why both models had virtually the same power in detecting a non-zero regression coefficient. The G E E robust S.E.'s seemed to be relatively close to the sample S.D.'s of the coefficient estimates. A few robust S.E.'s were smaller than the corresponding sample S.D. by a factor of about 2, making the regression analyses appear more powerful than they really were. Only a few other robust S.E.'s differed from the sample S.D.'s by a slightly larger factor. However, many of them were extremely close to the sample S.D.'s. Although the difference would well have been due to chance, the reason behind a smaller 9These include simulations yet to be discussed. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 107 robust S.E. than the sample S.D. in some cases is not obvious. Further investigation may be needed. 4.3.4 Other Simulations In addition to those simulations with exchangeable and AR(1) covariates, we also ran several other simulations with very different covariate correlation structures. Some were \"unstructured,\" while others had the variogram structure as in Chapter 2. Some of these extra simulations were also run with different ft = E ( X ) matrices. Their results were very similar to those of the previous simulations. That is, we found in these simulations that 1. both the independence and dependence models produced regression coefficient es-timates that have relatively small empirical biases; 2. the independence model tended to be slightly more powerful than the dependence model in those simulations with fixed response correlation matrix; 3. the dependence model was slightly more powerful than the independence model in some of the simulations where only the response correlation structure was specified; 4. the difference in power between the two models mentioned in the previous two items was not always significant according to the x 2 test; 5. both models had very similar overall power in detecting the non-zero regression coefficients in all simulations, suggesting the robustness of G E E to misspecification of response correlation structure; 6. this power tended to decrease as the coefficient size decreased, and/or as the number of parameters in the G L M increased; and Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 108 summary statistics of n K n intraclass correlation coefficient estimates mean median sample S.D. 10 30 -0.028 -0.029 0.080 20 20 -0.012 -0.020 0.056 50 15 0.0039 0.0042 0.048 100 15 -0.0011 -0.0029 0.024 200 15 -0.011 -0.013 0.022 500 10 -0.000084 -0.00059 0.012 800 10 -0.00093 -0.0023 0.010 Table 4.15: Final G E E R as K increases. 7. the robust S.E.'s from the G E E analysis were often very close to the sample S.D.'s of the coefficient estimates, even with such small r (= 5) and K (= 20). In addition, we also investigated how well the G E E approach estimates Cox{Y\X). We ran small scale simulations (n < 30) with increasing K but r was held at 5 (as in the previous simulations). Each was run with \x = (3,1,2,1,2) r , a common \u00C2\u00A3 matrix, corstr=\"ex\", and a G L M that consisted of only one slope coefficient, Q\ = \u00E2\u0080\u00941, but no intercept. Since the final R matrix estimates C o r ( y | X ) in our assumed scenario (see Section 3.4), its intraclass correlation coefficient should estimate that of Cor(Y \X) which is zero. Intuitively, the larger the sample size K (= number of replicates per dataset), the better the response correlation matrix is estimated. We found that with K ranging from 10 to 800, the estimated intraclass correlation coefficient (in the final R) produced by gee was always close to zero. The simulation results are shown in Table 4.15. Boxplots of the estimates are shown in Figure 4.9. The second column of Table 4.15 shows the number of regressions in the simula-tion experiment. As K increases, the computational time for the simulation experiment Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 109 Intraclass Correlation Coefficient Estimate 50 100 200 K = number of replicates 500 800 Figure 4.9: Boxplots of estimates of intraclass correlation coefficient (in R) with various sample sizes, K. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 110 increases.10 This explains our choice of small n's for the larger K's. The sample median of the estimates may be more appropriate than the sample mean for assessing how well the intraclass correlation coefficient is estimated in the cases with small n's. In fact, the boxplots in Figure 4.9 show asymmetric empirical distributions of the estimates with [ most K values, suggesting the comparisons of the sample medians instead of means. From Table 4.15, we see that the median of the estimates appears to approach closer to zero (the true intraclass correlation) as K increases, except for K = 200 and K = 800. The exceptions could be due to chance, but the real reason is not obvious since n was small for both cases. However, the variability (as reflected by the sample S.D.) of the estimate seems to decrease steadily with an increasing K. Figure 4.9 also shows that as K increases, the sample median of the estimates seems to stabilize around 0, and the estimates become less variable. Although the intraclass correlation coefficient estimate seems to become better (smaller bias and variability) as K increases, we are quite satisfied with the G E E approach in that with such a short correlated series (r = 5), even a small sample size of 10 seems to yield a reasonable estimate of C o r ( Y | X ) . 4.3.5 Concluding Remarks We have thus far examined the effect of misspecifying response correlation, when the data are independent, on parameter estimation using the G E E approach. Another type of misspecification of response correlation is using the independence model to estimate regression parameters, when the data are in fact correlated. In the next chapter, we will investigate the relative efficiency of the independence and dependence models in such a setting. This time, we will specify the correct correlation structure in the dependence model when running gee. 1 0 . . . with our implementation of the simulation experiments in S-Plus. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 111 4.4 C H A P T E R A P P E N D I X 4.4.1 Fitting an intercept (30 in the G L M As in Chapter 2, when fitting an intercept in the model, we have to add a column of l's to the covariate matrix X. Let us refer to this column as X_0. Then fi = E (X ) also has a column of l 's, denoted by /*.0. The distribution in (4.1) remains unchanged, except that j only runs from 1 to r (i.e. j ^ 0). The coefficient vector (3 now becomes f3 = {0o, 0\i \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, 0g)T\u00E2\u0080\u00A2 Now, the row vectors X{, and i = 1,2,..., r, all include an extra first element of 1. Nevertheless, the conditional distribution in (4.2) is unchanged with the new definitions of X, fi, and f3. However, now Xj(3 = 0O + J29j=i 0jX{j, and hence, Cov{Xff3, Xff3) =aik (/3_ 0) T/3- 0 for all i and k from 1 to r, where /3_ 0 = {0i, 02, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, 0g)T'\u00E2\u0080\u00A2 Note that Xiodo = 0o is non-random, and it does not have a variance. Thus, the distribution in (4.3) should now be written as X(3 ~ MVN(*v, W) where v = fx/3 (fi with column of l's and (3 with 0O) and W = {{3-0)T(3.0E. Equations (4.4) to (4.7) basically remain the same, except that is = (j,(3 is to be computed using the new fi (with the column of l 's), the new (3 (with 0O), and W = (J3-0)Tj3-0E. The rest of the equations in the chapter remain unchanged. 4.4.2 Examples of calculating (marginal) Cor(Y) with a specified \u00C2\u00A3 First, assume that we do not fit an intercept in the G L M . Let the r x g covariate matrix X have expected value fi, and let each of its column have covariance \u00C2\u00A3 . Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 112 Example 1 Suppose r = 5 and g = 1, and A* = M i 35 43 26 and 27 V 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 J That is, the covariate vector X = X L has an exchangeable correlation structure, with an intraclass correlation coefficient of 0.9. Also, let (3 = (0i) = (0.1). Substituting these values into equations (4.4) and (4.7), we have / - i n o r n AO C\ O A C\ OI \ E(Y) ( 149.16 ^ 33.28 74.07 13.53 24.66 V and Cor(Y) V 1 0.35 0.46 0.24 0.31 0.35 1 0.29 0.16 0.20 0.46 0.29 1 0.20 0.26 0.24 0.16 0.20 1 0.14 0.31 0.20 0.26 0.14 1 Here, Cor(Y) does not have a standard structure, but the structure is roughly exchange-able with an approximate intraclass correlation coefficient of 0.2 to 0.3. As an entry in Table 4.8, this approximate coefficient is denoted by py to indicate how badly the re-sponse correlation is misspecified through the dependence model with corstr=\"f ixed\" as this marginal Cor(Y) . In the next example, we will fit an intercept S0 in the G L M . With an extra column of l 's, the matrices X and fi now have dimension r x (g + 1). (3 is now a (g + l)-vector. Chapter 4. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part One 113 Example 2 Suppose r = 3 and g = 1, and A* V 1 10 1 8 1 25 and \u00C2\u00A3 1 0.9 0.81 0.9 1 0.9 0.81 0.9 1 That is, the covariate vector X.x has an AR(1) correlation structure, with a lag 1 corre-lation of 0.9. Now, let (3 = (0O, Pi)T = (5, -0 .15) T . Substituting these values into the modified 1 1 equations (4.4) and (4.7), we have / i n Am n i /i c \ ( 33.49 ^ E ( Y ) 45.21 3.53 and Cor(F) 1 0.421 0.145 0.421 1 0.175 \ 0.145 0.175 1 ) We see that the response vector Y has a correlation structure that is approximately AR(1) with an approximate lag 1 correlation of 0.4. As an entry in Table 4.8, this approximate lag 1 correlation is denoted by py to indicate how badly the response correlation is mis-specified through the dependence model with corstr=\"f ixed\" as this marginal Cor(Y). n See Section 4.4.1. Chapter 5 Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two In this part of our simulations, we generate Poisson responses that are correlated (con-ditional on the covariates). We apply the G E E approach to the same correlated Poisson data for estimating the regression parameters in each of the independence and dependence models. When the correct response correlation structure is specified for the dependence model in running G E E , we expect it to be more efficient in parameter estimation than the independence model (which assumes the data to be uncorrelated). However, generating correlated Poisson responses with covariates is a difficult task. Instead, we use a G L M which only has an intercept, with no covariates. In other words, the covariate matrix is simply a column of l 's, and the regression coefficient vector consists of only one element \u00E2\u0080\u0094 the intercept value (30. As explained in the following section, the resulting Poisson counts have an exchangeable correlation structure. 5.1 Generating Correlated Poisson Responses As before, we assume that there are r regions. From each region, we sample/generate K independent replicates. Each time we sample, i.e. for each fixed s for s = 1,2,..., K, we would like the r observations Y / s \ Y ^ , \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, Yr^ to be spatially correlated. Now, let us fix an s. We generate {Wia\ ..., W^} and Z^s\ independently, according to Wla)~ Poisson (Aj) 114 Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 115 and Z{s) ~ Poisson(A2) for a l H = 1,2,... ,r, for some positive Ai and A 2 . We then let the response in region i to be y / s ) = w}a) + Z(S\ and hence F / s ) - Poisson(A1 + A 2 ) (5.1) E ( F / S ) ) = Var(y/ S ) ) = Ax + A 2 (5.2) C O V ( Y / 4 ) , n ( s )) = C o v ( w i ( a ) + z < s \ iy f c ( s ) + z ^ ) = Var(Z ( s>) = A 2 for all i + k (5.3) C o r ( ^ , ^ ) . V /Var(y/ s ))Var(n ( s )) A 2 = T \u00E2\u0080\u0094 \u00E2\u0080\u0094 for all i / (5.4) Ai + A 2 ^ \u00E2\u0080\u00A2 v ' Let A = Ai + A 2 . Hence, these responses Y\,... ,Yr are exchangeably correlated, with intraclass correlation coefficient p = A 2 / A . Let us now put the above into the G L M format. The only observations here are the r-dimensional response vector, Y^ \u00E2\u0080\u0094 (Y}s\ . . . , Yr^)T. In other words, we do not include any covariates in the G L M (a regression model). How-ever, an intercept, BQ, does exist in the regression model. In other words, the G L M is G [E \X\S))\ = X\s)30 where X^ = 1 for all i = 1,..., r for some link function Q. In vector notation, the covariate (design) matrix X^ is thus simply a vector of l 's. The G L M in vector notation then becomes Q [E ( Y ( s ) | X ^ ) ] = X^30 where = 1. (5.5) Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 116 The goal is to estimate the unknown regression parameter, 0o, in the above G L M . The use of the G E E approach in estimating do allows the incorporation of the correlation among Y/ S^'s (for each s). 5.1.1 The Link Function To be consistent with the G E E notation as before, we define rj^ = X^0o- In other words, 77^' is simply the vector of 0o's. As explained before, the link function Q in a G L M maps E ( Y ^ | A , ^ ) to r^f1 for all i = 1,... , r. A n easy parametrization obtains by letting \ = \ 1 + \ 2 = 0O so that Q is the identity link, since Q [E ( Y / W ) ] = 0(A) = W ) = A ) = ^ for all i. In other words, this G L M is a true linear regression model because of the linear response-intercept relationship. However, there is a difference between this G L M and the linear model in Chapter 2. Here, the individual responses have a Poisson distribution, whereas before, they were normally distributed. After recognizing the link function, we may proceed to the actual model fitting for the simulated data. 5.2 Model Fitting using the G E E approach Once again, we fit two different GLM's to these simulated data, namely, the independence and the dependence models. Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 117 In fitting the independence model, we ignore the response correlation. We use the S-Plus function gee to fit this model, by specifying (5.5) as the linear model, Poisson as the distribution of [F/^IXJ^], 1 the identity link function, and the independence response correlation structure, gee then produces an estimate of Bo and the naive and robust estimates of its S.E. We may also use the same gee function for incorporating the response correlation structure in parameter estimation. Recall that the G E E approach allows the estima-tion of the true response correlation matrix, C o r ( F | X ) , 2 provided that the structure of C o r ( F | X ) is correctly specified. To use the S-Plus gee function in such a case, we specify the function arguments as in the independence model, except for some non-independence correlation structure for [F/^|Xf s ^] instead of the independence structure. Since the re-sponses we generate are exchangeably correlated, we may specify \"exchangeable\" for the c o r s t r function argument, gee would then use an initial working correlation matrix, R, which has the exchangeable correlation structure, in the estimation process. Thus, in addition to estimating the regression parameter B0 and the naive and robust S.E.'s of its estimate, gee also updates the intraclass correlation coefficient in R throughout the estimating process. The final R would then be the estimate of C o r ( F | X ) (= Cor(F) here). In other words, the non-diagonal value in the final R would be an estimate of p. Before fitting the independence and the dependence models, we notice that with an identity link in the G L M (5.5), and no true covariates (except the 1 vector as the design matrix), the G E E estimate of 0O is simply the sample mean of the responses. This is true for both the independence and dependence models, and will be proved in the proof of Theorem 4 in Section 5.4. In fact, we can see that the sample mean is the best linear unbiased estimator (BLUE) of the intercept for either model (as in ordinary and 1The function ge in S-Plus asks for the response distribution as a means of specifying the relationship between the mean and the variance of each response. 2See Section 3.3 page 78. Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 118 generalized least squares estimation in Chapter 2) due to the linearity of the response-intercept relationship and the exchangeability of the responses. Denoting BQ and BQ as the Bo estimates produced by the independence and dependence models respectively, we have, M = ffi = Y = \u00C2\u00B1 1 t \u00C2\u00A3 Y W . (5.6) n r s=l i = l However, the naive and robust S.E. estimates for BQ and BQ may be different, since their variance formulae differ. In particular, for the independence model, R = I (throughout the G E E iterations) in equation (3.19); hence, = from equation (3.19)3 and thus in (3.21) and (3.22). For the dependence model, R has an exchangeable correlation structure. Thus, the R matrix in the final iteration of G E E has some p as its non-diagonal value, possibly non-zero. This is because we expect p m p = \2/A ^ 0. In the case of p / 0, in equation (3.19) (computed using the final R matrix) is no longer equal to A^S\ This will possibly produce a different value in each of equations (3.21) and (3.22) from the one for the independence model (when = A^). In other words, the naive and robust Var(Bo) estimates are possibly different from the naive and robust V a r ( ^ ) estimates. Since the G E E approach allows the consideration of response correlation when fitting the model, our first thought was that the two variance estimates for 0Q (the dependence model (correct) parameter estimate) would be no bigger than their independence model (incorrect) counterparts. However, this is not exactly what we saw in our simulations, as explained in the next section. 3The scale parameter in our Poisson model is simply 1 (assumed known when using gee). Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 119 5.3 The Simulations In our first simulation, we tried to fit n = 100 G L M regressions. For each regression, we generated K = 20 independent samples from each of r = 5 regions according to Section 5.1. We let Ai = 0.05 and A 2 = 0.5, so that p = 0.5/0.55 \u00C2\u00AB 0.91 would give very highly correlated Poisson responses. The two A's were deliberately chosen to have such minute size because we would like the generated responses to resemble hospital admission counts (for rare diseases) \u00E2\u0080\u0094 in other words, very small. However, we overlooked the possibility of non-convergence of the G E E estimation process due to high number of zero counts. This was in fact what happened in fitting the dependence model in some of the regressions. The G E E analysis converged for the independence model despite the zero counts because BQ could be readily computed as the sample mean, and no other unknown parameter needed to be estimated. On the other hand, for the dependence model, the estimation of p might become difficult with too many zero observations. Despite the non-convergence of some of the regressions, we did observe some very interesting results. Firstly, as we expected, BQ and 8\u00C2\u00AE were the exact same value for all the regressions that converged. However, contrary to what we believed, the naive Var(/?o) estimate was always smaller than the naive V a r ( ^ ) estimate! Another curious observation is that the robust variance estimate of B0 was exactly the same for both the independence and the dependence regression models. We can summarize the observations of these simulations as follows: M = Po (5-7) V a ^ o 7 ) r o b u s t = V a ^ ) r o b u s t (5.8) Va^oOnaive < V a r ^ ) n a i v e . (5.9) Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 120 To double-check that the result in (5.8) and (5.9) were not by chance, we ran a second set of simulations. However, this time, we used a much larger X\ and A 2 to avoid the non-convergence of the G E E method. In this simulation, we used the same n, r, and K as before, but Ai = 5 and A 2 = 50. Thus, A = 5 + 50 = 55, and p = 50/55 \u00C2\u00AB 0.91 is the same large intraclass correlation coefficient as in the first simulation. In this simulation, the G E E analysis successfully converged for both the indepen-dence and dependence models for each of the 100 regressions. From the results here, we confirmed that (5.8) and (5.9) were indeed not due to chance. The result in equation (5.8) actually seems familiar to us. Recall that in the linear regression setting from Chapter 2, both the dependence and the independence models produce the exact same regression parameter estimate (and thus the same covariance matrix for the estimate) if the responses are exchangeably correlated, provided that an intercept is fitted in the model. However, the naive covariance for the independence model estimate is never smaller than that for the dependence model estimate in linear Gaussian regression. Why, then, is (5.9) true? That is, why is the naive variance estimate always smaller in the independence model than the one in the dependence model for a linear Poisson regression? Note that the word naive in both Chapter 2 and this chapter has the same meaning. In Chapter 2, the naive covariance is the one calculated believing that the responses were truly independent. Here, we are naive in calculating the variance estimate for 30 if we naively take the working covariance matrix V as the true covariance matrix of the responses. In other words, we believe that V is the true response covariance matrix, and thus the covariance estimate for 80 obtained according to such a belief is naive. In the next section, we present the theoretical reasoning for the results in (5.6) to (5.9). We also further compare the notion of naivete in the contexts of Gaussian and Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 121 Poisson regression. 5.4 The Naive and Robust Variances for /3Q and We now present (5.6) to (5.9), with a slight variation to (5.9), as the next theorem, with a proof following it. Theorem 4 Suppose we have responses Y}s\ ... ,Yj-s) sampled from a Poisson distribu-tion with positive parameter A = Ai + A 2 , such that they follow equations (5.1) to (5-4)-In particular, for each s, Y^ 's are exchangeably correlated with intraclass correlation coefficient p = A 2 / A . Note that A 2 > 0, and hence p > 0. Now, suppose we fit a Poisson regression model (a GLM) according to (5.5), re-spectively assuming the responses to be (1) uncorrelated (the independence model) and (2) exchangeably correlated (the dependence model). We use the identity link in such a GLM. For the dependence model, we specify in GEE an exchangeable response working correlation structure without specifying the value of the intraclass correlation coefficient. Then, equations (5.6) to (5.8) are true. Furthermore, V a r ( ^ ) n a i v e ^ V a r ( ^ 0 D ) n a i v e in general. To prove this theorem, we first need to \"translate\" our notations into the G E E nota-tions. In particular, we need to express equations (3.17) to (3.22) in terms of A i , A 2 , and P-Let us fix s. Since Y ^ ' s are K i.i.d. vectors, the following is true for all s = 1,..., K. Recall that the design matrix is simply the r-vector of l 's. Denote such a vector by 1. Also, we let r)^ = A = 0O for all i so that we have the identity link. Hence, the conditional density for each Y^ given X^ is Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 122 = exp{yj s ) log 7yJ s ) - rj^ - log = e x p ^ ' l o g A - A - l o g y ^ ! } . (5.10) Comparing (5.10) to (3.17), we have 6\s) = logr/J s ) = log A and a(0| s )) = exp{0J s )} - A. Also, E ( y . M | X j ' > ) = a'(6V) = exp{0\s)} = \ Var(y/S )|x| s )) = a\"(0Js)) = A. Thus, A \u00C2\u00AB = diag { \u00C2\u00AB ' ( \u00C2\u00BB \" ) } = dia,g(A,..., A) = ^ \u00C2\u00A3 > ( s ) = A( S )^( S )X< S ) - J l = 1 (5.11) y(*) = [AW]tjR[AW]i = A i l (5.12) = j - [ y ( \u00C2\u00AB ) ] - i = l i l \" 1 (5.13) where J i is the working correlation matrix that G E E uses in its estimation process. Since we specify an exchangeable response correlation structure for the dependence model, R has all non-diagonal values equal to some a. In each iteration of the estimation process, the values of both a and B0 are updated. Upon convergence, the final value of a is the estimate of p. That is, p = a. Also, upon convergence of the G E E estimation process, equations (3.21) and (3.22) are used to estimate Var(/?g) and Var( (50 D) (using the final values of B0 and a). Note that in the G E E estimation process, we do not expect the matrix to be computed based on the data. This is because when we specify the Poisson distribution of the responses, and the identity link function, with a design matrix which is simply Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 123 the 1 vector, the matrix product A^A^ = I is readily obtained, without the need to estimate any unknown parameter in this product. However, the computation of A^ and A^ (and hence, of V^) individually should be data based, because they both involve the unknown parameter A. By the identity link, we have A = 0O, and hence, A = 0O. Finally, as mentioned on page 82, the covariance matrix C o v ( Y ( s ) | X ^ ) in equation (3.21) is estimated by S^S^T in the G E E estimation process, where = - a'(6^). Now, we may substitute equations (5.11) to (5.13) into equations (3.20) to (3.22) to prove Theorem 4. P r o o f of Theorem 4 First, let us prove equation (5.6). For either the independence or dependence model, substituting equations (5.11) to (5.13) into estimating equation (3.20) gives 0 = ] \u00C2\u00A3 l T - J R - 1 [Y^ - Al] s=i A (since a'(0\ s^) = A for all i for all s) K <=> lTR-'^Y^ = 0QK1TR-'1 (5.14) S=l (since A = 0O by the identity link). For the independence model, R = H _ 1 = J , so that equation (5.14) becomes K J2lTY(s) = 0oKlTl (5.15) s=l s=l t = l ^ 0o = j-r\u00C2\u00A3\u00C2\u00B1Y}s)=Y. IX ' s=l i=l Hence, /?o = Y. Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 124 Now, recall from Chapter 2 that for the dependence model, we can write R(ct) = (1 \u00E2\u0080\u0094 o j ) / + Q ! l l T . (Again, a is the estimate of p, the intraclass correlation coefficient of the exchangeably correlated responses.) Also, from Lemma 1 and the proof of Theorem 2, we know that R 1 = T ^ ii ~ r ^ i i r ] - (5-16) Thus, we have lTR1 = 1J 1 - a I-1 + (r - l)a 11J 1 1 - a 1 a 1 + (r - l)a I-a. rl1 I-a [1 + (r - l)a 1 1 T. 1 + (r - l ) a Substituting equation (5.17) into estimating equation (5.14), we have 1 (5.17) 1 K TlTY( 1 + (r - \)ai \u00C2\u00A3 i r y w = B0K1T1 1J1 K s=l which is the same as estimating equation (5.15) whose solution is B0 = Y. In other words, BQ \u00E2\u0080\u0094 Y = B-. This proves equation (5.6) (and thus (5.7)). Next, we will prove equation (5.8). For either the independence or dependence model, substituting equations (5.11) to (5.13) into equation (3.21) gives Var(ft) robust S=l K - j r 1 A 1> A E i 5 i A 1 -R-1 A Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 125 ^ E f = 1 i T f l - 1 s W s w r f l - i i l T R i (Efl, S^SW) R - l K2(iTR-*iy ' For the independence model, R = I, and equation (5.18) becomes 1 T (EL S^SW) 1 i<2(iTiy (5.18) V a r ( ^ ) r o b u s t K2r2 For the dependence model, first notice that R'1 = (R~1)T because R (and thus R'1) is symmetric. Thus, by equation (5.17), we have R-l = ( 1 T J R - 1 ) T 1 -1. (5.19) 1 + (r - l)ct Now, we substitute equations (5.17) and (5.19) into equation (5.18) and have Vxv(RD\ _ Z \" 5 = 1 [l + ( r - l ) a ] a V a r l P o Jrobust -x2? = V a r ( ^ ) r o b u s t . This proves equation (5.8). Now, to prove that Var(/?o) n a i v e / ^ a r ( ^ 6 D ) n a i v e ' w e substitute equations (5.11) to (5.13) into equation (3.22). Then, for either the dependence or independence model, 1 Var( /? 0 ) n a h KlTRr*l (since A = (30 by the identity link). (5.20) Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 126 First, for the dependence model, the working correlation matrix R has inverse as in equation (5.16). Thus, we can substitute equation (5.17) into equation (5.20) and obtain By equation (5.7), we may denote both B\u00C2\u00AE and BQ by a common B0. The above equation then becomes V \u00C2\u00AB 7 j t \" ) ^ = ! ( i \u00C2\u00B1 ! ^ ) . (5.21) Again, for the independence model, R = R'1 = I. Thus, equation (5.20) simply becomes V a ^ 0 7 ) r K1T1 A Kr 1 + (r - l ) a ' Therefore, in general, unless a = 0, we will have Var ($J ) n a i v e ^ Var( / 5 0 9 ) n a i v e . Q.E.D. Notice that, in our simulations, we generated the Poisson responses with an extremely high intraclass correlation p (> 0.9). Since a estimates p, we naturally observed all positive a values in our simulations. In fact, in both sets of simulations, a turned out to be an extremely good estimate of p (a between 0.8 and 1) in all of the n observations which converged. This large positive a explains the result in (5.9). In principal, a can take on negative values. In particular, if we generated the responses with a near-zero (or zero) p, then it is possible to have negative a values. We can easily see that V a ^ ) n a i v e > V a ^ ) n a i v e if a < - l / ( r - 1). Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 127 Also, notice that 80 = Y is the M L E of 0Q when assuming response independence. Recall from Chapter 2 the notion of the naive statistician who blindly believes that the responses are truly independent. Thus, the theoretical variance of 0Q that he computes is Var(&) = Var(F) = A = A . The estimate of this variance is naturally computed by replacing 0Q with 0^. It, then, is indeed the naive variance estimate, Var(/?o) n a i v e . From this set up of a G L M , it is clear that the two notions of naivete \u00E2\u0080\u0094 that of the naive statistician and that of the naive variance estimate \u00E2\u0080\u0094 coincide. 5.4.1 More on the simulation results Let us now return to the question of why the naive variance estimate is always smaller than the robust variance estimate in our simulations. Such a phenomenon creates an illusion that Naive, using his/her naive variance estimate, has better efficiency in his/her inference than that of Enlightened who bases it on his/her naive variance estimate. However, a closer look reveals that the former estimate is simply the M L E variance estimate in the case of (assumed) independent responses coming from a single population, forming one big i.i.d. sample (of size Kr). Naturally, such a variance estimate must be smaller than the one computed assuming the r responses per replicate to be (positively) correlated, and hence more between-group response variability among the K replicates. As for the robust variance estimates, we have the results from Theorem 4 for both the independence and dependence models. Such variance estimates are derived from the asymptotic distribution of 0O presented in Theorem 3. For convenience, instead of determining the true variance of 0Q, we use the sample S.D. of the n / V s to estimate the true SE(/?o). With enough datasets, n, we may assume this sample S.D. to be reasonably Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 128 sample mean of 50 fa S E (A))sample S E ( A D r o b u s t ( = ftD's) = S E ( ^ ) g a m p l e S E ( A ? ) n a i v e = SE(^ 0 D) r obust 55 59.90 1.69 1.66 1.53 0.77 Table 5.16: S.E. estimate comparison based on the first 50 simulated datasets. close to the true S.E. On the other hand, our goal is only to obtain some idea of the relative variabilities of the intercept estimates. Thus, in this investigation, we decide not to use large scale simulation experiments although they would allow better understanding of the relative variabilities of the estimates. Let us revisit our latter simulation where the G E E method converged for all its n = 100 datasets. Our results, based on the first 50 datasets in this simulation,4 are summarized in Table 5.16. For the independence model, S E ( / 5 0 r ) s a m p l e denotes the square root of the sample variance of the 50 0QS. SE(/3o) n a i v e and SE(/?o) r o b u g t respectively denote the sample means of the 50 naive and the 50 robust estimates of SE ( / ?Q ) . We denote those S.E. estimates for 0Q (using the dependence model) in a similar way. Of course, S E ^ ^ i e = S E ^ L m p , e by equation (5.7), and S E l ^ L b u s t = S E ^ U u s t by equation (5.8). The results in Table 5.16 show that Y, the sample mean of the responses, is a reasonbly good estimate of the true 0O. Using SE(/%) g a m p l e (= S E ( ( L ? ^ , ) s a m p l e ) as the \"gold standard\" (g.d.) of the true SE(ft) for both models, S E ( ^ 0 D ) n a i v e is thus closest to the g.d. among the naive and robust estimates. It is followed by S E ( / ^ ) r o b u s t = S E ( ^ 0 D ) r o b u s t , and S E ( / ^ ) n a i v e is the farthest away from the g.d. In fact, SE(/3o) n a i v e is the only S.E. estimate which gives a 95% confidence interval that does not contain the true 0Q. This is an indication that, in 4The way we recorded the simulation output made it difficult to base our results here on all 100 datasets. Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 129 general,5 the naive variance estimate of the independence model is a poor estimate which substantially underestimates the true Var(/?0). The underestimation can be explained by the between-group variability among the K replicates being ignored in the computation of this variance estimate. In other words, if response correlation is ignored when fitting the G L M , then inference based on the naive variance estimate may be far off from the truth. Fortunately, inference done by ignoring response correlation but using the robust variance estimate does not affect estimation efficiency, in the sense that this variance estimate is the same as the robust variance estimate considering response correlation. Of course, the advantage of the dependence model over the independence model is the possibility of estimation of the parameters in the response correlation structure. 5.5 Concluding Remarks The above results of Theorem 4 and the size comparison of the different S.E. estimates of 0O are all based on the particular G L M we are dealing with here. Therefore, we may not generalize our observations to cover such conditions as GLM's with covariates and/or non-linear link functions. Simulations of non-linear Poisson regressions may be difficult, mainly due to the complexity of generating correlated (conditional on the covariates) Poisson count data with non-linear link to the covariates.6 However, based on our results, we recommend the use of the robust variance estimates, regardless of the assumption of correlated or independent responses, for such non-linear regressions of real data. This is because we have seen from simulation results here and in Chapter 4 that the data analyses based on the robust variance estimates for both the independence and dependence models are generally very similar. We have also seen that in our set up of a G L M in this 5 . . . for the conditions as in Theorem 4. 6 A method for this task is developed in the Ph.D. thesis by Jianmeng Xu (1996). Chapter 5. Simulations of Multiple Regression of Count Data \u00E2\u0080\u0094 Part Two 130 chapter, the naive variance estimate of the dependence model estimates the true variance overwhelmingly better than that of the independence model. However, assuming the dependence model allows the estimation of the true response correlation matrix if its structure is known or can be approximated. In this case, one should base inference on the dependence model instead of the independence model. An extra remark is that the G E E method seems to estimate reasonably well all the model parameters, i.e. the regression parameters, and those in the response correlation structure, even with relatively small r and K (5 and 20 in our simulations). Chapter 6 Discussion In this thesis, we have presented some results on the effect of misspecified response corre-lation in regression analysis. We focussed on regression analyses applied to health impact studies, in which the responses are often obtained from neighboring geographical sites and thus are spatially correlated. (See Chapter 1.) In Chapter 2, we made use of the sto-ryline of the naive and enlightened statisticians to illustrate the severe loss of parameter estimation efficiency that might result from ignoring response correlation when fitting a linear response-covariate relationship with Gaussian responses. In particular, both the-ory and our simulation results indicate that Naive's regression analysis is almost always1 less powerful/efficient than Enlightened's. Such and other forms of response correlation misspecification in GLM's were examined in Chapters 4 and 5. The GLM's were fitted via the G E E approach, which allows the incorporation of response correlation in the parameter estimation process. Our results suggest reasonable robustness2 of the G E E approach to misspecification of response correlation with a small number (about 20) of independent replicates/clusters in the sample if the analysis is based on the robust vari-ance estimates. In other words, the G L M analyses based on such variance estimates using the correct and incorrect response correlations respectively both have similar efficiencies. The analyses based on the naive variance estimates are similarly efficient if the response correlation structure is correctly specified. Otherwise, the true variance estimates may 1 Theoretically, Naive's analysis is never more powerful. However, it can be more powerful empirically due to chance. 2 . . . with those GLM's that we investigated. 131 Chapter 6. Discussion 132 be considerably underestimated, thus falsely revealing high efficiencies of the analyses. Moreover, when the response correlation structure is correctly specified, the parameters in the correlation matrix are relatively well estimated3 even with very few replicates (10 to 20) and correlated measurements per series4 (about 5). However, we have not examined the G L M setting in which the clusters are not inde-pendent. For example, if the replicates are repeated measurements obtained over a short period of time, they may be temporally correlated. Thus, both spatial and temporal correlations may exist in the responses: the measurements within a replicate may be spatially correlated, and the replicates may be temporally correlated. Studying the effect of ignoring correlation among replicates in a G L M analysis may be very complex although important to health impact investigations. The complexity lies in the incorporation of such correlation into the G L M and hence the parameter estimation process. In fact, the G E E approach only allows incorporation of correlation along a series, but strictly assumes the replicates to be independent. This assumption is often met in longitudinal data analyses in which the measurements are correlated temporally, but the measurement replicates are obtained from separate individuals and thus independent. On the other hand, the assumption of independent replicates may fail in some health impact studies for reasons mentioned earlier. Unfortunately, an alternative method to the G E E approach which accounts for response correlation along both spatial and temporal \"axes\" in parameter estimation is apparently not available. 3 . . . provided that the GEE estimation process converges. Non-convergence of the process may result due to insufficient information in the data (e.g. too many unknown parameters or too many zero counts in a Poisson regression). 4See page 110. Bibl iography [1] Aitchison, J . , Ho, C. H . (1989). The multivariate Poisson-log normal distribution. Biometrika 76, 643-653. [2] Anderson, Theodore W. (1971). The Statistical Analysis of Time Series. John Wiley & Sons. [3] Basu, Sabyasachi, Reinsel, Gregory C. (1994). Regression Models With Spatially Correlated Errors. American Statistical Association Journal 89, 88-99. [4] Cressie, Noel A . C. (1991). Statistics For Spatial Data. John Wiley & Sons. [5] Fitzmaurice, Garrett M . (1995). A Caveat Concerning Independence Estimating Equations With Multivariate Binary Data. Biometrics 51, 309-317. [6] Freedman, D., Pisani, R., Purves, R., Adhikari, A . (1991). Statistics, 2nd Ed. W. W. Norton & Co. [7] Graybill, Franklin A . (1983). Matrices with Applications in Statistics, 2nd Ed. Wadsworth. [8] Herzberg, Agnes M . (1988). Some further results for the equivalence of ordinary least squares and weighted least squares estimators: an advantage for rotable designs. SIAM Institute for Mathematics and Society Technical Report #116. [9] Johnson, Richard A . , Wichern, Dean W. (1982). Applied Multivariate Statistical Analysis. Prentice Hall. [10] Kotz, Samuel, Johnson, Norman L . , Read, Campbell B . (1985). Encyclopedia of Statistical Sciences. John Wiley & Sons. [11] Liang, Kung-Yee, Zeger, Scott L. (1986). Longitudinal data analysis using general-ized linear models. Biometrika 73, 13-22. [12] McElroy, F. W. (1967). A necessary and sufficient condition that ordinary least squares estimators be best linear unbiased. American Statistical Association Jour-nal 62, 1302-1304. [13] Marascuilo, Leonard A . , Levin, Joel R. (1983). Multivariate Statistics in the Social Sciences. Brooks/Cole. 133 Bibliography 134 [14] Martin, R. J . (1982). Some aspects of experimental design and analysis when errors are correlated. Biometrika 69, 597-612. [15] Mosteller, Frederick, Tukey, John W. (1977). Data Analysis and Regression. Addison-Wesley. [16] Plackett, Roger L. (1960). Principles of Regression Analysis. Oxford @ The Claren-don Press. [17] Rao, C. Radhakrishna (1965). Least squares theory using an estimated dispersion matrix and its application to measurement of signals. Fifth Berkeley Symposium 1, 355-372. [18] Rosychuk, Rhonda J . (1994). Cancer cluster detection in British Columbia school districts 1983-1989. Master of Science Thesis. University of British Columbia. [19] Seber, George A . F. (1977). Linear Regression Analysis. John Wiley & Sons. [20] Weisberg, Sanford (1985). Applied Linear Regression, 2nd Ed. John Wiley &; Sons. [21] Zeger, Scott L. (1988). A regression model for time series of counts. Biometrika 75, 621-629. [22] Zeger, Scott L . , Liang, Kung-Yee (1986). Longitudinal Data Analysis for Discrete and Continuous Outcomes. Biometrics 42, 121-130. [23] Zyskind, George (1967). On canonical forms, non-negative covariance matrices and best and simple least squares linear estimators in linear models. Annals of Mathe-matical Statistics 38, 1092-1109. "@en . "Thesis/Dissertation"@en . "1996-11"@en . "10.14288/1.0086986"@en . "eng"@en . "Statistics"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Effect of misspecified response correlation in regression analysis"@en . "Text"@en . "http://hdl.handle.net/2429/4521"@en .