Effect of Misspecif ied Response Cor re la t ion in Regression Analys i s by Grace Shung-Lai Chiu B.Sc. University of British Columbia, 1994 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF M A S T E R OF SCIENCE in T H E F A C U L T Y OF G R A D U A T E STUDIES Department of Statistics We accept this thesis as conforming to the required standard T H E U N I V E I ^ f T Y OF BRITISH C O L U M B I A June 1996 ©Grace S. Chiu, 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it i freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada DE-6 (2/88) Abst rac t One can imagine a possible loss of parameter estimation efficiency when response cor-relation is ignored or misspecified in modeling a response-covariate relationship. Under what conditions is efficiency lost? How much is lost? Whether the responses are correlated or independent, standard theory for the distri-bution of least squares parameter estimates in linear models (Gaussian responses) can be readily determined. We find that the linear regression analysis assuming indepen-dent responses is (theoretically) never more efficient than that incorporating response dependence. The "difference" in efficiencies between these two analyses — measured by how much more readily the latter detects a non-zero regression coefficient — generally increases as the coefficient-to-noise ratio increases. To incorporate response correlation in G L M parameter estimation, Liang &; Zeger (1986) extended the quasi-likelihood theory and developed the generalized estimating equations (GEE) approach. Despite being a popular method, the effects of misspecify-ing response correlation (e.g. assuming independence when responses are correlated) on parameter estimation efficiency using G E E are not obvious. To investigate such effects, we use simulation studies in which we generate count data and use the G E E approach to estimate the model parameters, using both the correct and misspecified correlation struc-tures. The generated counts, the number of correlated responses in each cluster/replicate, and the total number of replicates are all small to imitate health impact studies (in which hospital admission counts are often the responses). Despite possible loss of parameter estimation efficiency due to such "obstacles" intrinsic in the model, simulation results indicate that the G E E approach produces' ii 1. regression parameter estimates with relatively small empirical biases using either a correct or misspecified response correlation; 2. a good estimate of the response correlation matrix if its structure is correctly spec-ified; 3. naive and robust variance estimates both of which estimate the true variance well when the response correlation structure is correctly specified; and 4. good robust variance estimates even when the response correlation is misspecified. Furthermore, in a G L M with exchangeably correlated Poisson data and no covariates, specifying independence or exchangeable dependence yields the same intercept estimate and estimation efficiency, provided that inference is based on the robust variance esti-mate. The naive variance estimate can significantly underestimate the true variance if the responses are assumed independent when analyzing such a G L M . in Table of Contents Abs t rac t i i L is t of Tables v i i L i s t of Figures v i i i Acknowledgements ix 1 In t roduct ion 1 2 M u l t i p l e Linear Regression 6 2.1 Introduction 6 2.2 Literature Review 8 2.3 Notation 11 2.4 Which estimator is appropriate? 12 2.5 The Exchangeable Model 14 2.6 Common Spatial Correlation Structures 21 2.7 The Simulation Experiments 23 2.7.1 Generating Correlated Responses 25 2.7.2 Where are Naive and Enlightened? 28 2.7.3 Simulation Results 29 2.8 How bad is Naive's analysis? 43 2.8.1 Empirical Power Curves 44 2.8.2 Histograms of p-value Ratios and p-value Scatter Plots 51 iv 2.8.3 The case of a chosen St « I 64 2.9 Discussion 66 2.10 C H A P T E R A P P E N D I X 69 3 In t roduct ion to M u l t i p l e Regression of Count D a t a 71 3.1 Literature Review 73 3.2 G E E in S-Plus 76 3.3 Some Notation and Theory in G E E 77 3.4 Terminology 83 4 Simulat ions of M u l t i p l e Regression of Count D a t a — Par t One 85 4.1 Generating Count Data and their (Random) Covariates 85 4.2 Multiple Regression via Generalized Estimating Equations 89 4.2.1 G E E in the Conditional Poisson Model 89 4.3 Computer Simulations of Multiple Regression of Count Data 91 4.3.1 Simulations with Exchangeable and AR(1) Correlation Structures of the Covariates 93 4.3.2 Detailed Results for Simulations with Exchangeable and AR(1) Covariates 98 4.3.3 Additional observations 106 4.3.4 Other Simulations 107 4.3.5 Concluding Remarks 110 4.4 C H A P T E R A P P E N D I X I l l 4.4.1 Fitting an intercept f30 in the G L M I l l 4.4.2 Examples of calculating (marginal) Cor(Y) with a specified S . . I l l 5 Simulat ions of M u l t i p l e Regression of Count D a t a — Par t Two 114 v 5.1 Generating Correlated Poisson Responses 114 5.1.1 The Link Function 116 5.2 Model Fitting using the G E E approach 116 5.3 The Simulations 119 5.4 The Naive and Robust Variances for J3q and 8^ 121 5.4.1 More on the simulation results 127 5.5 Concluding Remarks 129 6 Discussion 131 Bib l iography 133 vi Lis t of Tables 2.1 An example of a p-value contingency table for fli 25 2.2 Results of Simulation 1, with AR(1) response dependence 30 2.3 Results of Simulation 2, with AR(1) response dependence 31 2.4 Results of Simulation 3 35 2.5 Results of Simulation 4 36 2.6 Results of Simulation 5 39 2.7 Results of Simulation 6 40 4.8 Results of selected simulations 97 4.9 Sample S.D.'s and approximate Relative S.E.'s of coefficient estimates in the independence model in selected simulations 99 4.10 p-value comparisons of Simulation 3 in Table 4.8 101 4.11 p-value comparisons of Simulation 4 in Table 4.8 101 4.12 p-value comparisons of Simulation 7 in Table 4.8 103 4.13 p-value comparisons of Simulation 8 in Table 4.8 103 4.14 p-value comparisons of a re-run of Simulation 3 in Table 4.8, this time without fixing R 105 4.15 Final G E E R as K increases 108 5.16 S.E. estimate comparison based on the first 50 simulated datasets 128 vii Lis t of Figures 2.1 Empirical Power Curves for the GLS and naive OLS analyses 45 2.2 More Empirical Power Curves for the GLS and naive OLS analyses. . . . 49 2.3 Histogram Sketches of logged p-value Ratios and their Scatter Plots: B\ varies 55 2.4 Histogram Sketches of logged p-value Ratios and their Scatter Plots: a varies 58 2.5 Histogram Sketches of logged p-value Ratios and their Scatter Plots: /?i/tr = 1 60 2.6 Histogram Sketches of logged p-value Ratios and their Scatter Plots: fii/a = 1/2 62 2.7 Histogram Sketches of logged p-value Ratios and their Scatter Plots: Bi/a = 2 63 2.8 Histogram and Scatter Plots with Variogram Dt: a varies 65 4.9 Boxplots of estimates of intraclass correlation coefficient (in R) with var-ious sample sizes, K 109 vin Acknowledgements I would like to express my most sincere thanks to my two supervisors, Dr. Nhu Le and Dr. James Zidek, for their patience and guidance throughout the past twelve months. It has been an honor and a wonderful experience working for them. I would also like to thank Dr. John Petkau for being my mentor for the past four years. His guidance, support, and numerous advices will be eternally appreciated. Many thanks to the rest of the faculty members for being kind and helpful throughout my Master's program; to the Statistics Department office staff for always being helpful and patient; to Paige Eveson for her friendship, patience, help with my writing, M V N r.v. generator, and being the best officemate one can ever have; and to the rest of the Statistics graduate students for sharing both tough and wonderful times with me. ix Chapter 1 Introduction Researchers are often interested in the relationship between a response variable and explanatory variables (covariates). In many health care studies, for instance, the response may be some measure of health status of a population, and the covariates may be indices of different pollution types to which the population is exposed. The relationship between response and covariates can be determined through regres-sion analysis. Which regression method to use depends on the nature of the response variable. For example, linear regression via least squares estimation can be applied to response data which can be assumed to follow a normal distribution. When we can no longer assume Gaussian response data, we may sometimes use a generalized linear model (GLM) for the response-covariate relationship. Such a model may be appropriate, for instance, when the response data are discrete and consist of small counts, such as hospital admission counts as a measure of health status of a population. Regardless of the regression method, researchers should be very careful in fitting the model when any correlation possibly exists among the response data. For example, peo-ple living in neighboring geographical regions may tend to be at similar risk of developing respiratory ailments due to similar environmental factors shared by these regions. Effi-cient statistical inference is essential to, say, ensure appropriate allocation of government funding for controlling pollutant levels. Thus, introducing response correlation into the statistical model when studying the health impacts of pollution may be necessary. (This is an example of spatial correlation among the responses.) However, some researchers in 1 Chapter 1. Introduction 2 the past would not incorporate response correlation when modeling a response-covariate relationship. This was sometimes due to the belief that the only source of variability came from independent and identically distributed (i.i.d.) measurement errors.1 It might also be due to the complexity of incorporating response correlation into some non-linear statistical models such as a G L M . Using data which resemble those in health care studies, in this thesis we investigate the effect of misspecified response correlation on the efficiency of parameter estimation in both linear and generalized linear regression analyses. For example, ignoring response correlation is a form of correlation misspecification. The effect of such misspecification in linear regression is closely investigated in Chapter 2. We then turn our attention to GLM's in the following chapters. Our GLM's involve small Poisson count responses which resemble hospital admission counts observed from neighboring regions in studies of rare diseases. We use the generalized estimating equations (GEE) approach for fitting our GLM's , which allows the specification of a working response correlation structure.2 The effect of ignoring response correlation in a G L M is studied in Chapter 5, while Chapter 4 examines the effect of specifying some non-independence structure for independent responses. Simulation experiments are used in our investigation. Each experiment involves gener-ation of many datasets (responses and covariates), regression analyses using the true and incorrect response correlation structures respectively for each dataset, and the (empirical) efficiency comparison between the two analyses. We also include analytical arguments to support some of our simulation results. We measure (empirical) efficiency by how often a regression analysis rejects the null hypothesis of a zero regression coefficient in a simulation experiment.3 Rejecting such a 1See Cressie, 1991, page 25. 2 See page 80. 3That is, the sample probability of hypothesis rejection by the analysis in the simulation experiment. Chapter 1. Introduction 3 hypothesis is equivalent to detecting a non-zero coefficient in the regression model. Natu-rally, we say that the analysis that detects an existing non-zero coefficient more readily is more efficient. However, efficiency in this context is directly influenced by the variability of the coefficient estimate produced by the analysis. If the estimate has a very large stan-dard error, then the test statistic for the above hypothesis test will be very small, the null hypothesis will be retained and hence the analysis will be inefficient (not very powerful) for detecting this coefficient. For example, we find that in a linear regression, the gen-eralized least squares estimator (see Section 2.3) of a coefficient often has much smaller variability than the ordinary least squares estimator. Thus, the analysis based on the former estimator (which incorporates the response correlation) is more efficient/powerful than that based on the latter (which assumes response independence). We also find that an analysis generally increases in its power to detect a non-null regression coefficient as the coefficient-to-noise4 ratio increases. The G L M coefficient estimates produced by the G E E approach have a known asymp-totic Gaussian distribution. To assess efficiency of a G L M analysis, we can compute the naive and robust variance estimates5 for the G E E coefficient estimates based on this asymptotic distribution. However, each simulation experiment involves a finite number of data replicates.6 Thus, the asymptotic variance estimates may not estimate the true variances well. The assessment of efficiency is then affected by how good (close to the true variances) these variance estimates are. In particular, an analysis that produces a smaller but severely biased variance estimate may not be more efficient than another that produces a larger but unbiased variance estimate. Yet how do we know if an estimate is unbiased? Unlike the least squares estimates in a linear model, the exact distribution of a G E E 4Noise is the standard deviation of the random errors (with zero mean) in the model. 5See Section 3.3. 6See Section 3.3. Chapter 1. Introduction 4 coefficient estimate with finite number of replicates is typically unavailable. However, given enough datasets in a simulation experiment, we may use the sample variance of the G E E estimates7 as the "gold standard" for the true variance. In a simulation experiment, for each analysis (using correct/incorrect response correlation), we first assess how good the naive and robust variance estimates are by comparing them to this gold standard. Then we compare the efficiencies of the analyses based on their naive and robust esti-mates respectively. Finally, considering how well the true variance is estimated, we can conclude which analysis based on which variance estimate is more efficient. For example, we find that for the G L M in Chapter 4, the analyses using the correct and misspeci-fied response correlations respectively yield good coefficient estimates8 and good robust variance estimates. Thus, both analyses based on their robust variance estimates have similar efficiencies. (We do not examine analyses based on the naive variance estimates in this chapter.) For the G L M in Chapter 5, assuming9 either independent responses (incorrect) or exchangeably correlated responses (correct) yields the same G E E coefficient estimate and robust estimate of its variance. Both are good estimates of the true parameters. However, the naive estimate can considerably underestimate the true variance when assuming response independence. On the other hand, the naive estimate is good if the response correlation structure is correctly specified. In other words, the analysis assuming either independent or exchangeable responses based on the robust variance estimate yields the same efficiency. The latter analysis based on the naive variance estimate yields similar efficiency to that of these two analyses based on the robust estimate. However, assuming 7Each dataset produces one vector of coefficient estimates. 8That is, the estimates have small empirical biases. 9Technically, specifying a working correlation structure is not an assumption in model fitting. How-ever, we will use the word "assume" to reflect the belief that the response data have the correlation structure as specified by the working correlation. See Section 3.4 for more details. Chapter 1. Introduction 5 independent responses and using the naive variance estimate yields an incorrect 1 0 and thus inefficient analysis. ... because the variance of the coefficient estimate is incorrect/underestimated. Chapter 2 Multiple Linear Regression 2.1 Introduction Suppose we are interested in the relationship between some response variable and g ex-planatory variables, or covariates. The objective is to estimate the regression parameters associated with each covariate. Suppose we believe that the response-covariate relation-ship is linear. In other words, we believe that the response variable can be expressed as a linear combination of the covariates, with the coefficients in the linear combination being the regression parameters/coefficients. Then we may estimate these parameters via least squares linear regression. If we measure r sets of the response variable and the covariates, r > g + l , 1 least squares parameter estimates in such a regression can then be explicitly expressed in terms of the (r) responses, their covariances (if they are correlated), and the observed covariates. In many real life situations, the measurements of the response variable of interest may not be independent. For example, in a clinical trial to determine the effects of a drug, we measure some response on each patient several times over the period of the trial. Thus, repeated measurements on the same subject may be correlated over time. In this thesis, we are more interested in spatially correlated response variables. Sup-pose we take a measurement of a single response from each of several sites within the same geographical domain of interest. The measurements (of the same response) across the different sites may be correlated when they come from proximate sites. Correlation 1See Section 2.3. 6 Chapter 2. Multiple Linear Regression 7 among the measurements can be induced by similar environmental factors acting across the geographical domain. Such a factor would be air pollution level. Our scenario might well be realized in health care studies, where the impact of air pollution on human health is measured through some response variable on a human subject from each site, together with pollution and meteorological covariates. Researchers may overlook the correlation in the responses. In particular, they often believe that i.i.d. measurement errors constitute the only source of response variability.2 Furthermore, it may be tempting to assume that the geographical sites are so far apart that correlation among the responses may be ignored without affecting the inferences about regression parameters. However, the strength of the correlations may depend not only on the distance between the sites, but also on the similarity of their shared environmental factors. In this chapter, we concentrate on potential dangers which may arise in multiple linear regression if one overlooks the correlation among the response measurements. We will assume that the covariances of the responses are known. That assumption will not usually be realistic, in which case the covariance matrix must be estimated from the observed data. (See Johnson & Wichern, 1982 for details on how regression inference changes when the covariance matrix is estimated.) However, when accounting for correlated responses in regression, we expect that the conclusions based on estimated parameters will not change significantly if the correlation is well estimated. We only compare conclusions based on (1) assumed response independence to (2) admitted response dependence in parameter estimation. We are not interested in the possible efficiency loss in parameter estimation due to a poorly estimated response covariance matrix. Our investigation uses simulation studies. In each simulation, we generate realizations of model (2.1) (see Section 2.3). With each realization/dataset, the regression parameters 2See Cressie, 1991, page 25. Chapter 2. Multiple Linear Regression 8 are estimated with and without taking response dependence into account. Note that in all our simulation experiments, we fit the correct regression model (i.e. the model with the correct mean response structure) with each dataset. Thus, if the realizations of model (2.1) are generated, say, with an intercept and two slope coefficients, then the fitted model will contain an intercept and two slope coefficients. On the other hand, if the realizations are generated with no intercept and only one slope coefficient in model (2.1), then the fitted model will contain solely one slope regression parameter. We do not fit models that do not coincide with the true models to investigate the impact of factors such as incorrectly specified mean response structure. We only consider the impact on inference due to response dependence. 2.2 Literature Review Numerous publications address multiple linear regression. Many of them, such as Weis-berg, 1985, Johnson h Wichern, 1982, and Marascuilo h Levin, 1983, are undergraduate text books. Most discuss how to account for correlated responses in regression param-eter estimation. However, they usually do not investigate the impact of ignoring such correlation. Advanced books and articles have been written since the 1960's on the loss of efficiency in linear regression parameter estimation when response dependence is ignored. For example, Plackett, 1960, Rao, 1965, Anderson, 1971, and Martin, 1982 all mention that, when the responses are dependent, the matrix A — B is always non-negative definite, A being the covariance matrix of the regression parameter estimate vector under the independent response model, B that under the model with a correctly specified response correlation matrix. (Also mentioned is the unbiasedness of the parameter estimates despite ignoring response dependence.) In other words, admitting the known covariance Chapter 2. Multiple Linear Regression 9 of the responses in regression analysis yields parameter estimates which are at least as efficient as those produced by ignoring response dependence. A natural question arises: "Under what conditions can response dependence be ig-nored without loss of efficiency?" Some of these conditions are stated in Plackett, 1960 (as an end-of-chapter exercise); Rao, 1965; Zyskind, 1967; McElroy, 1967; Anderson, 1971; Seber, 1977; Martin, 1982; and Cressie, 1991. Some of these articles supply neces-sary and sufficient conditions for the equivalence of A and B. One of these conditions, less technical than the others, is when the responses have an exchangeable correlation structure. This is explained in Section 2.5. However, none of these articles mentions the fact that, when an investigator overlooks the response dependence, s/he naturally does statistical inference based on parameter estimates whose standard errors are calculated as if the responses were truly independent. We then have three covariance matrices for two parameter estimate vectors: 1. the (true) covariance matrix of the correct parameter estimates (considering the known response dependence); 2. the true covariance matrix of the incorrect parameter estimates (obtained while ignoring response dependence); and 3. the incorrect covariance matrix of the incorrect parameter estimates (both obtained assuming the responses are uncorrelated). We can think of the first covariance matrix as that computed by the "enlightened" statistician, who knows the true covariance matrix of the responses and obtains his/her parameter estimates and their covariances taking account of this knowledge. The third covariance matrix is that computed by the "naive" statistician, who does not know the responses are correlated, and thus obtains his/her parameter estimates, as well as their Chapter 2. Multiple Linear Regression 10 covariances, without accounting for response dependence. Finally, the second covariance matrix is that computed by Enlightened for Naive: Enlightened knows Naive's covariance matrix is incorrect, and does Naive a favor by computing a correct one for him/her. For reasons given in Section 2.3, we refer to the least squares parameter vector esti-mate assuming response dependence as the GLS estimator. We call the one obtained ignoring response dependence the OLS estimator. The first covariance in the above list is thus the covariance matrix of the GLS estimator. Let us call it the GLS covariance. We call the true covariance matrix of the OLS estimator the true OLS covariance. This is the one given in the second item of the above list. Finally, we call the third covariance matrix in the list the naive OLS covariance for obvious reasons. In this chapter, we will investigate the loss of efficiency resulting from using the OLS instead of the GLS estimator, by essentially comparing the GLS covariance to the true OLS covariance, as well as the GLS covariance to the naive OLS covariance, with different covariate (design) matrices and regression parameter values.3 In this way, we hope to understand the loss of efficiency in parameter estimation when ignoring response dependence. At the same time, we also try to understand what the effects on parameter estimation are when Naive bases his/her inference on his/her naive parameter estimates and covariances. We will base our investigation on computer simulations implemented in the statistical software S-Plus. Later in the chapter, we address the question of how to model spatial correlation. A few such models are presented in Cressie, 1991, and Basu and Reinsel, 1994. 3For reasons given in Section 2.7, we calculate p-values (for hypothesis tests) based on these covari-ances. We then compare the p-values (hence, functions of the covariances) instead of directly comparing the covariance matrices. Chapter 2. Multiple Linear Regression 11 2.3 Notation Suppose the spatial domain of interest consists of r regions. In an investigation, we observe in region i the response (dependent) variable Y{, i = 1,2,..., r. We then decide to include in our regression model, say, g covariates (explanatory variables) xn,..., Xig for each region i, together with an intercept /?o and a slope f3j for the jth. covariate, j = 1,2,... ,g. In matrix notation, we may then write our model as Y = X(3 + e (2.1) where 1 xn x12 1 x21 x22 xlg x2g ( a \ 01 I , and e = (The intercept /?o and the column of l's in the design matrix, X , are of course omitted if we decide not to fit an intercept in the above model.) It is often convenient to assume that X is of full rank, and that rank(X) = g + 1 < r. We also assume that the expected value and the correlation matrix of the error vector e are respectively E(e) = 0 and Cor(e) = 2?£, where St is symmetric and positive definite. Assuming that the variance of each e; is cr2, the covariance matrix of e is then Cov(e) = a2U€. A simple case of (2.1) obtains when Dt = I, the identity matrix. In this case, the best linear unbiased estimator (BLUE) of the parameter /3 is simply the ordinary least squares (OLS) estimator pOLS = ( x T X ) " 1 X T y (2.2) with covariance matrix Cov(3 O L S) = cr2 (XTX)~\ (2.3) Chapter 2. Multiple Linear Regression 12 With a known St ^ I, however, the B L U E for f3 is the generalized least squares (GLS) estimator (3GLS = (XTE-1X)-1XTE;1Y (2.4) with covariance matrix Cov(3 G L S) = a2 (XTS-1X)~1. (2.5) (See Seber, 1977, page 49 for derivation of the BLUE's . ) The covariance formulas for these B L U E ' s are derived in standard textbooks on regression theory. Recall the notion of the naive and enlightened statisticians. In the case Se ^ J , Enlightened uses the estimator in (2.4) to estimate the regression parameters. S/he uses the formula in (2.5) to calculate Cov( /3 G L 5 ) . On the other hand, since Naive assumes St = I in all situations, s/he uses the estimator in (2.2) and the covariance formula in (2.3) in his/her analyses. However, if the responses are truly not independent, i.e. St ^ J , then the true OLS covariance is actually Cov(3 O L S) = Cov[(XTX)-*XTY] = (XTX)-*XT Cov(Y) [(XTX)-'XT]T (2.6) = a2(XTX)-1(XTSeX){XTX)-\ Thus, we see that, in general, the covariance matrices in (2.3), (2.5), and (2.6) will differ. 2.4 Which estimator is appropriate? In principle, when the responses are known to be correlated, we should always account for their correlation in fitting the model. However, sometimes Uc may be difficult to determine (or estimate). Other times, it is believed that the correlation among the responses is too weak to make much difference in the inference. Also, we can easily see Chapter 2. Multiple Linear Regression 13 from (2.2) and (2.4) that both J3OLS and /3GLS are unbiased for (3 regardless of whether 27e = Cor(e) = COT(Y) is the identity matrix. In fact, the expected value of / 3 O L S does not depend on the value of ST. For reasons such as these, some investigators may ignore the presence of response correlation. However, the following theorem indicates a potential danger in using f3OLS to estimate p when 27£ ^ I. Theorem 1 Cov(f3OLS) — Cov((3GLS) is always non-negative definite. Rao, 1965 presents a proof of this theorem. Here, we are referring to the true OLS covariance as Cov((3OLS). In other words, although both f3OLS and (5GLS are unbiased for (3, Theorem 1 states that the former is never "better" than the latter in that there is at least as much variability in flOLS as in (3GLS. When the matrix (in Theorem 1) is positive definite, the parameter estimation based on / 3 O L S is lower in efficiency (with the true OLS covariance) than that based on pGLS_ yye W J V J s r i o w some simulation results later in the chapter to further illustrate and quantify this point. When is the matrix in Theorem 1 a zero matrix, then? That is, when are these two covariances4 equal? Two obvious conditions for the two covariances to be equal are: 1. 3 O L 5 = 3 G L 5 , and in particular, 2. E€ = I. The second condition is not of much interest to us. We are interested in situations where pOLS j g a g e m c i e n t a s pGLS w h j i e t n e responses are not independent. 4 G L S covariance and true OLS covariance. Chapter 2. Multiple Linear Regression 14 Necessary and sufficient conditions for f3OLS = (3GLS are presented in Rao, 1967; Zyskind, 1967; and McElroy, 1967. However, most of these conditions are very technical and difficult to verify. Thus, we believe investigators should always use flGLS to estimate /3 whenever the responses are known to be correlated. However, one realistic exception to the need to use f$GLSh stems from the exchangeable structure, also known as the intraclass correlation structure, often met in analyses of repeated measurements, and sometimes in spatially correlated data. With such structure, fitting an intercept in model (2.1) will entail the actual equality of flOLS and flGLS\ 2.5 The Exchangeable Model A n exchangeable model for correlation structure assumes that the correlation matrix of the errors is Cor(e) = E€ 1 P P P P 1 P P P P 1 P ••• P \ = (1 - p)I + PJ (2.7) P P 1 P \P P P 1 / where J is a matrix with all elements equal to 1, and 0 < p < 1. This last inequality guarantees that Se is positive definite for any r. For a given r, we can replace this inequality by [ - l / ( r - 1)] < p < l . 6 Theorem 2 Assume that we fit an intercept in the regression model (2.1). Then OLS $GLS for all Y St has the form 5However, this does not mean that we can use the naive OLS covariance in place of the GLS covariance. 6See Graybill, 1983, page 215. Chapter 2. Multiple Linear Regression Se = (l-p)I + pJ. To prove Theorem 2, we need the following lemma. Lemma 1 Let the matrix W have the form W = al + bP where P is an r x r matrix such that P2 = tP, with t > 0. Then W1 = - I - -i-^f-r P . a a(a+bt) Proof W-W = ' P ) ( f l J + bP) a(a+bt) b D 6 D b 2 t 1+ b-P- zhr.P~ a+bt a (a+6t ) J 6 = / + ^T6FjK« + bt) - a - bt]P = I Similarly, WW'1 = I, and hence the result. Proof of Theorem 2 7 Note that J3OLS _ pGLS {XTX)-1XTY = (XTE^X)-*XTZ (XTE^X)(XTX)-1XTY = XTE^Y. 7This proof is an elaboration of the one by McElroy (1967). Chapter 2. Multiple Linear Regression 16 Therefore, proving Theorem 2 is equivalent to proving (*) holds for all Y if and only if Et = (1 - p)I + PJ. Now, (*) automatically holds for all X 1 " 1 if Y is in the range space of X, i.e. if Y — XZ for some Z. This is so because then, in (*), LHS = (XTZ-1X)(XTX)-1(XTX)Z = XTS~1(XZ) = XT2-*Y = RHS. Therefore, we now need only to consider those Y in the null space of X , i.e. those Y such that XTY = 0. (<=) Assume Ut = (1 — p)I + pJ. Denote the column vector of l's as 1. Then J = 11 T ==> J2 = 1(1T1)1T = rllT = rJ. Since r > 0, by Lemma 1, we have 27- 1 = T^I -l - P ( l - p ) ( l - p + p r ) ' I-P* ( l -p) [ l + (7-l)p]' e. -l + (r-\)p-Now, let y be in the null space of X , i.e. l-p i l + f r - D p 1 1 J 0 = XTy = \ X9 J y Chapter 2. Multiple Linear Regression 17 where xj = (xu, x2i, ..., xri). Then, lTy = 0. With y in place of Y in (*), we now have LHS = (XTS-lX){XTX)-1XTy = 0 (since XTy = 0) and RHS = XTS-'y = 0 - 0 (since XTy = 0 and lTy = 0) = LHS. Thus, (*) holds for all y such that XTy = 0. Hence, P°LS = pGLS. (=>) Suppose (3OLS = 3 G i S - Therefore, (*) holds for all Y. Let y be such that XTy = 0. Hence, by (*), XTE;1y = (XTE;1X){XTX)-1XTy = 0. (2.8) Now, fix i,j, and such that 1 < i,j, k < r, and i =^ j , A:. Let e,- denote the elementary vector all of whose co-ordinates are O's except the z'th being 1. In turn, let U be an r x (g + 1) matrix of the form U = (1, e,-, « a , . . . , v s ) where the set of distinct elementary vectors r 3 , . . . ,?>s is chosen such that t73,.. .,t;a ^ e,-or ej or e^. (That is, the vectors v 3 , . . together with any of e,-, ej, or e^, form a basis.) Then, U has the specified form of the design matrix X in model (2.1), and we Chapter 2. Multiple Linear Regression 18 can substitute U for X in (2.8). Now, / XT \ UT{e3 - ek) e1 - e1 0. This implies that the vector (ej — ek) is in the null space of U. Thus, by (2.8), we have UTS;1(ej-ek) = 0. (2.9) Write the ( i , j ) th element of U~x as (E~1)ij = cr*j. Now, we can write CT 1/ ' 2 / Then, the matrix ( n r \ X « a* ... <) = V 1 2 7* / 1 T < . eT<r* . . e^V t i t r VT(T* . . VT(T* 2 1 2 r VTCT* . . vTtr* 9 i 9 T Chapter 2. Multiple Linear Regression 19 has its second row equal to . . . , e f < ) = (a*,, C T * 2 , o*r) ith row of X 1 1 . Hence, equation (2.9) implies <3~<k = o. (2.10) This is true for all 1 < i,j, k < r such that i ^ j or fc. Thus , X 1 " 1 has al l off-diagonal elements equal. Equation (2.9) also implies 0 = (1 T <, . . . , l V ) ( e i - e t ) = ( E L l ^ l , E / = l ^ r ) ( C i - C f c ) Er * v^r * Now, (2.10) and (2.11) together imply = K - ^ ) + K - ^ ) (since 27£, and hence 2? e - 1 , is symmetric). Finally, we have a*j = a*kk for all 1 < j , k < r. Chapter 2. Multiple Linear Regression 20 Hence, St 1 has all diagonal elements equal. Note that Lemma 1 implies that St must also have its off-diagonal elements equal, and its diagonal elements equal. But Ht is positive definite, and thus it has positive diagonal elements. In addition, a correlation matrix has the property trace = r. Therefore, the diagonal elements of St must all equal 1. Hence, St is of the form 27e = (1 — p)I + pj for some p £ [0,1), as was to be proved. Q.E.D. Although (3OLS = (3GLS when the responses have an exchangeable correlation struc-ture and when an intercept exists in model (2.1), we need to be very cautious in computing its covariance in such situations. Recall that we have the GLS covariance, the true OLS covariance, and the naive OLS covariance of the corresponding (3 estimates. Theorem 2 says that, since /3OLS = / 3 G L 5 , their covariance matrices are the same. This is to say that the true OLS covariance equals the GLS covariance. This is not to say that the naive OLS covariance and the GLS covariance are the same, unless St = I. When St ^ I, the naive OLS covariance can be considerably different from the true OLS covariance (which is the same as the GLS covariance). The loss of efficiency due to the use of the incorrect covariance formula will be illustrated in Section 2.7.3. Thus, although the (3 estimates obtained by both Naive and Enlightened may be the same, the covariances that they compute for the estimates may be very different, regardless of what the response correlation structure may be. In other words, Theorem 2 only allows Enlightened to use f3OLS to estimate /3 for simplicity, but does not allow either to ignore Se in the covariance formula for their estimates. Chapter 2. Multiple Linear Regression 21 2.6 Common Spatial Correlation Structures We have now seen that with an exchangeable response correlation structure, we may use either the OLS or the GLS estimator to estimate the regression coefficients in model (2.1) without affecting estimation efficiency, provided that (1) the model includes an intercept and (2) Cov ( /3 O L S ) is taken to be the true OLS covariance in (2.6). However, when we have a different response correlation structure, using the OLS estimator may result in significant loss of efficiency in parameter estimation, as will be shown later in the chapter. So what response correlation structures are seen in spatial statistics? Let us first assume that the regions of interest have a "linear spatial arrangement," as do the major cities in the Pacific Northwest. Then the spatial correlation in the responses may resemble the common autocorrelation structures in repeated measurement or time series analyses.8 Some such structures are the exchangeable and the AR(1) (autoregressive process with lag 1) structures. The former is useful when we believe that observations from regions either nearby or far apart are correlated to the same extent. The latter is used in cases where the strength of correlation declines with increasing geographical separation. However, regions of interest are usually spread out in a two-dimensional geographical domain. The natural measure of how far apart two regions are is then the physical distance between them. This distance is often measured from the centroid of one region to that of the other.9 For each centroid, we record its bearing, i.e. its latitude and longitude. The distance between two centroids can then be calculated from their bearings (see equation (2.12)). It is often believed that correlation between the observations from two distinct regions 8See Cressie, 1991, page 340. 9See Rosychuk, 1994. Chapter 2. Multiple Linear Regression 22 declines with increasing separation.1 0 This belief suggests that the variogram,11 Uc, be modeled to have a form like: 1 2 (Se)ij = the (i, j ) th element of Ee = e x p { - | | z ; - Z j | | a } (2.12) where ' latitude of region i ^ Zi = ^ longitude of region i j Note that — Zj\\2 is a measure of the distance between regions i and j. In other words, the correlation between two regions decreases exponentially as their separation increases. This structure is widely used in spatial statistical models. Another way to model spatial correlation in a two-dimensional layout is presented in Basu & Reinsel, 1994. This type of model requires that the regions be arranged in a two-dimensional grid. The observation from each region is thus assigned a co-ordinate, (i,j). The correlation is introduced by a unilateral first order ARM A error process that depends on the co-ordinates. The drawback of using such a model is the difficulty in estimating the extra parameters (in the A R M A process) intrinsic in the regression model. In our computer simulations of linear regressions, we used both standard and non-standard correlation structures. The standard ones include the exchangeable, AR(1), and common variogram structures. We did not use the model presented by Basu & Reinsel (1994) for the reasons mentioned above. The non-standard correlation structures include some which resemble the AR(1) structure. A l l of these will be introduced in the next section. 10See Cressie, 1991, page 24. 1 1That is, the covariance function of spatially correlated observations. See Cressie, 1991, page 40 for a detailed definition. 12See Cressie, 1991. Chapter 2. Multiple Linear Regression 23 2.7 The Simulation Experiments Our objective is mainly to investigate how Naive's inferences compare with those of Enlightened. First, we specified the model parameters: /3 (the vector of regression parame-ters), Ue (the correlation matrix of the errors, and hence, of the response data), and cr2 (the noise level, or the "scale" of the covariances of the errors/responses). Next, we fixed an arbitrary covariate (design) matrix, X. We then generate error data, e;'s, from a multivariate normal (MVN) distribution with zero mean vector and covariance matrix equal to a2Se (a2 and St specified separately). Finally, X, (3, and e were "put together" to give correlated response data, Y, according to model (2.1). With this dataset Y, Naive and Enlightened fitted the linear regression model. Know-ing the true regression parameters /3, we can judge how well these two statisticians have estimated (3. We are concerned in particular with how much efficiency in parameter estimation is lost in Naive's analysis. Recall that Naive believes the response data to be independent, and hence computes a naive covariance matrix of according to formula (2.3). Let us denote it by Cov ( / 3 O L S ) n a i v e . Enlightened knows the true St, and hence computes the correct co-variance matrix of (3GLS, according to (2.5). We denote this by Cov(/3G L S). Finally, Enlightened realizes Naive's mistake and computes the true covariance matrix of (3OLS for Naive, according to (2.6). We denote it by Cov(/3OLS)tlue. In each simulation experiment, we fixed one set of model parameters with a covariate matrix, and carried out n repetitions of the process of: 1. generating e and calculating Y; 2. fitting Naive's regression model to obtain /3 O L S , then estimating a2 from this Chapter 2. Multiple Linear Regression 24 naively fitted model, obtaining Cov ( /3 0 i 5 )naive and Cov(/3 o z , 5 )true using the esti-mated (OLS) cr2, but using the true £ e ; 1 3 3. fitting Enlightened's regression model to obtain /3GLS, then estimating <72 from this fitted model, obtaining Cov(/3GLS) using the estimated (GLS) a 2 , but using the true ST; 4. obtaining t-test statistics based on each of these three covariance matrices for testing the hypotheses H0 : fa = 0 vs. Hi : fa ^ 0 at significance level 0.05, for all j = 1,2,..., g; and 5. calculating the corresponding p-values of the above t-test statistics. The number of covariates g was usually set to anywhere from 1 to 30. After n such sets of p-values were obtained, we compared two pairs of analyses. We first compared Enlightened's analysis to that of Naive. We then compared Enlightened's analysis to that s/he did for Naive. The comparisons were done via some p-value comparisons. For each regression coefficient, we put the corresponding p-values of the two pairs of analyses into a contingency table such as Table 2.1, to compare the consistency of hypothesis rejection (number of times both methods rejected H0) and the empirical power (number of times H0 was rejected) of the two methods. 11 Ue = J" in the second column of Table 2.1 means that the p-values were computed based on Cov(/3OL5)naive5 "-^ e known" that the p-values were computed based on Cov(/3O L S)true-1 3That is, when fitting a regression model, we assume the response correlation matrix, £ £ , is given, but the "scale" a is not. We adopted our assumption from Weisberg, 1985, page 40. Chapter 2. Multiple Linear Regression 25 GLS fix p < 0.05 p > 0.05 OLS Ee = I p < 0.05 p > 0.05 Ue known p < 0.05 p > 0.05 Table 2.1: An example of a p-value contingency table for 8\. 2.7.1 Generating Correlated Responses The (model) parameters in model (2.1) include cr2, 27£, and /3. In a simulation experi-ment, we need to specify all of these parameters, together with the covariate matrix X, in order to obtain the response vector Y. In one simulation study of n realizations of model (2.1), we would first specify X, the design/covariate matrix, and the regression coefficient vector, (3. Instead of arbitrarily "inventing" these quantities everytime we ran a simulation experiment, we decided to have a computer implemented procedure which could be used to generate both X and (3. In this way, we would only need to specify a limited number of program arguments instead of each of the r x g covariate values and g + 1 coefficient values. The procedure for obtaining the design matrix and the regression coefficients is pre-sented in Section 2.10, the C H A P T E R A P P E N D I X . This procedure requires the specifi-cation of fix and ZJx, the common expectation and covariance matrix of the g covariate vectors. These were truly arbitrarily chosen. For example, we sometimes took fix — 0, and other times fix = (1,2,3) T for a model with an intercept and two slope coeffi-cients. Ijx was taken to be the identity in some cases. We also tried JC^'s which have exchangeable structures. Once X and {3 were generated, they were used to generate all the n Y"'s from model Chapter 2. Multiple Linear Regression 26 (2.1). In fact, we did not generate a new X and f3 in every simulation. Once generated, the same X and /3 were sometimes re-used in other simulations, varying only St and <r2 of the distribution of the random errors in the model. In other simulations, we fixed the same 30 x 2 covariate matrix for those cases in which an intercept along with one slope coefficient was present in model (2.1). The first column of this X is simply the 1 vector. The second column was generated from a normal distribution with mean 10 and standard deviation 1/2.1 4 The first five elements of this second column are 9.73, 9.61, 9.33, 10.62, and 10.57. Next, we needed to generate the random errors in the model by specifying cr2 and St. The matrix ZJe was chosen from the following: 1. an AR(1) correlation matrix with a lag one correlation of 0.6 or 0.9; 2. a correlation matrix (which resembles the AR(1) structure) that has its ( i , j ) th element equal to (a) 1 if i = (b) 0.5 ii\i-j\ = 1, (c) 0.2 if \i-j\=2, (d) 0.1 if \i- j\ =3 , and (e) 0 if \i-j\ > 3, for i,j = l , 2 , . . . , r ; 3. a correlation matrix, similar to the above, with its (i,j)th element equal to (a) 1 if i = j, 1 4Again, the choice of the moments was arbitrary. Chapter 2. Multiple Linear Regression 27 (b) 0.8 if \i-j \ = 1, (c) 0.45 if \i-j\ = 2, (d) 0.15 if \i- j\ = 3, and (e) Oif \i-j\ >3 , for i,j = 1,2,... ,r; 4. an exchangeable correlation matrix with an intraclass correlation coefficient of 0.6 or 0.9; and 5. the variogram computed according to equation (2.12) with / ZT \ whose columns were generated independently from a Uniform(25, 50) and a Uni-form(100, 110) distribution respectively.15 Define e's) to be the sth sample of errors, s = 1,2,..., n, generated from e<s) ~ MVN (0 ,<r 2 i : e ) . (2.13) After generating these n (independent) samples of e^'s, we would use the X matrix and the (3 vector (same for all s) obtained using our algorithm mentioned earlier, and compute = X/3 + e<s'. (2.14) 1 5Once generated, the Z matrix (and hence the variogram) was fixed throughout those simulations in which the variogram was used. Chapter 2. Multiple Linear Regression 28 Thus, these Y( s)'s are i.i.d. M V N , with mean X/3 and covariance Et. Finally, we provided Naive and Enlightened with the n observed response vectors and the set of covariates. (Again, Naive believes that the responses are independent, whereas Enlightened knows the value of £ t . ) They returned their n sets of j3 estimates with their corresponding p-values. We then obtained the sample mean and variance of each regression coefficient estimate as summaries of the simulation experiment. Their p-values were then tabulated into a contingency table such as Table 2.1 for comparison of their estimation efficiencies. 2.7.2 Where are Naive and Enlightened? We brought these two statisticians to life (figuratively speaking) by a module in a com-puter program which we implemented in the statistical software S-Plus. Other modules in the program include one for generating the n e^'s, and another for tabulating the p-values that the statisticians have computed. A n independent module was implemented to generate covariates and regression coefficients. The program user has to specify \ix and Ex in this module if s/he wishes to generate the covariates and regression coefficients. These values can be stored and re-used in other simulations to save computational time. The user then has to specify a covariate matrix and a vector of regression coefficients, one or both of which may be generated by the above module, or fixed by the user. In addition, s/he has to specify a and Se in the main program. This program will first compute the responses, Y, and then "call on" the two statisticians to obtain p-values from them. Next, the main program will hand these p-values over to the "output department" to produce the p-value contingency tables. Chapter 2. Multiple Linear Regression 29 Terminology In the rest of this chapter, we will refer to Enlightened's analysis as the GLS analysis, and to Naive's as the naive OLS analysis. Finally, the analysis done by Enlightened by using the OLS estimator and its true OLS covariance will be referred to as the true OLS analysis. When examining the p-value contingency table for a particular regression coefficient, we will compare the off-diagonal values. For example, we will refer to the ratio of the (2,1) element to the (1,2) element in the contingency table for the GLS analysis to the naive OLS analysis as the consistency ratio with respect to (wrt) the naive OLS analysis. That is, a 10-to-l consistency ratio wrt the OLS analysis means that there are 10 datasets of a simulation in which the GLS analysis detected a non-zero regression coefficient (at significance level 0.05) when the naive OLS analysis did not; whereas there is only one dataset in which the naive OLS analysis detected the coefficient, but the GLS analysis did not. In other words, the larger this consistency ratio, the more efficient the GLS analysis relative to the naive OLS analysis (provided that the detected coefficient is truly non-zero). Similarly, the corresponding consistency ratio with respect to the true OLS analysis will be the consistency ratio wrt the true OLS analysis. By "detecting a non-zero regression coefficient," we mean that such a coefficient exists in the regression model and is detected (by the particular analysis). 2.7.3 Simulation Results (i) Simulations 1 and 2 Simulations 1 and 2 were somewhat small scale studies with few correlated responses (r = 4) and only n = 200 realizations of model (2.1). The small r and n allowed us to Chapter 2. Multiple Linear Regression 30 Program input variables: r 9 n Sx <72 2. fit intercept? 4 2 200 ( l , 2 , 3 f I 1 AR(1) lag 1 correlation =0.6 yes Program output: / 1 1.36 3.33 \ 1 1.66 4.82 1 1.67 4.86 V 1 0.81 4.94 / P = (1,0,0) T (fixed; see page 32) OLS estimates GLS estimates 00 0i 02 00 0i 02 sample mean of n estimates 0.67 0.05 0.06 0.69 0.06 0.05 sample variance of n estimates 8.54 1.10 0.41 7.85 1.06 0.35 true variance of each estimate 8.02 1.06 0.38 7.59 0.98 0.33 GLS 00 0i 02 p < 0.05 p > 0.05 p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS £t = I p < 0.05 11 0 12 0 11 0 p > 0.05 2 187 3 185 1 188 OLS Et known p < 0.05 13 1 15 0 12 0 p > 0.05 0 186 0 185 0 188 Table 2.2: Results of Simulation 1, with AR(1) response dependence Chapter 2. Multiple Linear Regression 31 Program input variables: r 9 n Vx Ex cr2 Ee fit intercept? 4 2 200 (1,2,3)'J' exchangeable, intraclass p = 0.6 1 AR(1), lag 1 correlation = 0.6 yes Program output: / 1 1.44 1 1.52 1 3.86 \ 1 2.14 2.80 \ 3.09 4.59 2.82 J (3 = (-0.875,0.375,0.375)2 OLS estimates GLS estimates A. & fa Pi sample mean of n estimates -0.544 0.563 0.137 -0.602 0.494 0.197 sample variance of n estimates 6.02 1.70 2.49 5.73 1.47 2.27 true variance of each estimate 6.36 1.57 2.33 6.24 1.40 2.20 GLS A) Pi Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS Et = I p < 0.05 12 0 14 0 12 0 p > 0.05 2 186 0 186 3 185 OLS Ee known p < 0.05 14 3 14 2 15 2 p > 0.05 0 183 0 184 0 183 Table 2.3: Results of Simulation 2, with AR(1) response dependence Chapter 2. Multiple Linear Regression 32 determine the approximate computational time needed for larger scale computer simu-lations. However, technicalities such as the computational efficiency of our simulation program and the capacity of the computer system on which the simulation experiments were run are beyond the scope of this thesis. Simulations 1 and 2 were run mostly with the same program input variables. One difference lay in Ex, the covariance matrix for generating the covariates and the regres-sion coefficients. However, only in Simulation 1 did we discard the regression coefficients generated by the corresponding module and replace it with the vector (1,0,0) T . We did this to examine how often the GLS or OLS methods falsely detected non-zero coefficients. Results of Simulation 1 are shown in Table 2.2. Those of Simulation 2 are shown in Table 2.3. First, let us examine the results of Simulation 1. We notice that the true variance of each OLS regression coefficient estimate is larger than that of the corresponding GLS estimate. This illustrates the result presented in Theorem 1. However, the covariate matrix generated from the program yielded very similar OLS and GLS estimates, and hence, similar (true and sample) variances. Notice that both the OLS and the GLS estimates for the intercept are substantially smaller than the true intercept (=1). The empirical biases may be due to chance, as sug-gested by the large variances of the intercept estimates. Nevertheless, in most regression analyses, the analyst's prime interest lies not in the intercept of the model, but the slope coefficients. S/he would be concerned about how much each unit change in a covariate would affect the response. This change is measured by the corresponding slope estimate and its variability. The estimates for the slope coefficients seem quite accurate. Now, to assess the consistency (see page 24) of the methods of Enlightened and Naive, we examine the off-diagonal values of each 2 x 2 p-value contingency table. First, we see Chapter 2. Multiple Linear Regression 33 that the consistency ratio wrt the naive OLS analysis for the intercept is 2:0. That is, Naive (in the naive OLS analysis) twice (out of 200) failed to detect the non-zero intercept at the 0.05 significance level when Enlightened succeeded (in the GLS analysis). On the other hand, this ratio wrt the true OLS analysis is 0:1. That is, Enlightened (in his/her GLS analysis) failed to detect it once when Naive succeeded (in his/her naive OLS analysis). However, the small number of inconsistent conclusions made by the two statisticians could well be due to chance. Similar reasoning obtains for the estimates of the two slopes. To compare the empirical power of the three analyses, we compare the sample proba-bilities of detecting a non-zero regression coefficient at significance level 0.05. For exam-ple, Enlightened's power for detecting a non-zero intercept was (11 + 2)/200 = 0.065 = 6.5%. This power was slightly lower at 11/200 = 5.5% for Naive. In the true OLS anal-ysis, Enlightened's power for the OLS estimator was (13 + l)/200 = 7.0%. The power is slightly lower (6.5% as above) for his/her GLS analysis. These powers are all very low, possibly due to the large variances in the intercept estimates. Also, the powers are similar because of the particular choice of covariate matrix which gave similar OLS and GLS estimators. We do not see striking power differences either between those associated with the OLS and GLS slope estimates. In fact, if we compare the GLS power and that of either OLS analysis, the Pearson's x 2 test of independence of analysis (GLS/OLS) and power (p < or > 0.05)1 6 of detecting each of do, and B2 retains the null hypothesis (of independence). For example, the p-value1 7 for the \ 2 test is 0.83 for the GLS and naive OLS analyses in detecting B0.18 Thus, the power differences in this simulation experiment are very likely 1 6 In such a x2 test, the two categories of analysis are "GLS" and "OLS," each with two power levels, namely, "p < 0.05" and "p > 0.05." We test the independence of analysis and power (category and level). 1 1 . . . obtained using the S-Plus function chisq.test . 1 8 In this x2 test, the data are (naive OLS p < 0.05, GLS p < 0.05) = (11, 13), and similarly (189, Chapter 2. Multiple Linear Regression 34 to be results of chance1 9 according to the x2 test. Finally, the fact that the two powers were low in detecting non-zero slopes in this simulation is desirable, because the true slopes were zero! In Simulation 2, much the same sort of results were obtained. A l l the (true) OLS estimate variances are noticeably larger than their GLS counter-parts. We again see large variability in the slope coefficient estimate for both the OLS and the GLS methods. In fact, this is true for every co-ordinate of the (3 vector. The sample mean values of the parameter estimates indicate that overall, the coefficients were not well estimated, possibly due to chance as a result of their large variabilities. The large variances were mainly due to the choice of X and St. However, the GLS estimates appear to be slightly more accurate with smaller empirical biases than the OLS estimates. Large variability in the coefficient estimates also entails low power in detecting non-zero true coefficients. This power for all three regression coefficients is under 10% based on any of the naive OLS, true OLS, and GLS covariances. Finally, minimal inconsistency is found between the methods in the detection of non-zero coefficients. However, notice that in a few cases (3 for j30 and 2 for both f3i and B2), the true OLS analysis actually out-performed the GLS analysis in detecting non-zero coefficients. Of course, the small number of such cases does not strongly suggest much about the overall power of this true OLS analysis. In fact, this out-performance of the true OLS analysis is likely due to chance: the \ 2 tests2 0 for this simulation experiment all yielded large p-values. Also, theoretically, the true OLS analysis is never more efficient than the GLS analysis, as indicated by Theorem 1. 187) for the "p > 0.05" level. v 1 9 . . . in the terminology of Freedman, Pisani, Purves & Adhikari, 1991, Chapter 26. 20See footnotes on page 33. Chapter 2. Multiple Linear Regression 35 Program input variables: r 9 n X P a2 Et fit intercept? 30 1 500 see below (3,1.5)'* 1 item 2 on page 26 yes Program output: sample mean of n estimates sample variance of n estimates true variance of each estimate OLS estimates 3.11 11.54 13.19 Pi 1.49 0.11 0.13 GLS estimates Po 3.05 8.60 9.93 Pi 1.49 0.08 0.10 GLS Po Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS St = I p < 0.05 35 8 482 0 p > 0.05 46 411 15 3 OLS Et known p < 0.05 46 . 24 491 1 p > 0.05 35 395 6 2 Table 2.4: Results of Simulation 3 (ii) Simulations 3 and 4 The next two simulations were of larger scale than the previous two. Instead of n — 200, we had 500 realizations of model (2.1) in Simulations 3 and 4. Such larger simulations were naturally more time consuming. To save computational time, we fixed the covariate matrix and the regression coefficient vector. X was set to be the 30 X 2 covariate matrix described on page 26, and /3 = (3,1.5) T. The results are presented in Tables 2.4 and 2.5. The two simulations were done with the same program input variables, except the response dependence correlation. The responses have a covariance matrix as described in item 2 on page 26 for Simulation 3, and item 3 on page 26 for Simulation 4. Chapter 2. Multiple Linear Regression 36 Program input variables: r 9 n X P CT2 £ £ fit intercept? 30 1 500 see page 35 (3,1.5)'' 1 item 3 on page 26 yes Program output: OLS estimates GLS estimates Po Pi Po Pi sample mean of n estimates 3.14 1.49 3.07 1.50 sample variance of n estimates 8.96 0.089 1.25 0.011 true variance of each estimate 9.41 0.093 1.24 0.011 GLS Po 0i p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS EC = I p < 0.05 43 1 487 0 p > 0.05 348 108 13 0 OLS Se known p < 0.05 91 2 498 0 p > 0.05 300 107 2 0 Table 2.5: Results of Simulation 4 Chapter 2. Multiple Linear Regression 37 In Simulation 3, because of the choice of X and 27£, the variance is very large for the intercept estimate, but it is very small for the slope estimate. Using a slightly different St in Simulation 4, the variances for the GLS estimates decrease considerably (by about eight times). In both simulations, both regression coefficients in the model are well estimated by even the OLS estimators, although the GLS estimators seem to have slightly smaller empirical biases. However, due to the large difference in the variances of the GLS intercept estimate, the p-value tables of the two simulations for /?o are very different. In particular, the GLS analysis obviously out-performed the OLS analysis (naive or true) in Simulation 4 (in which the variance of the GLS intercept estimate is much smaller). In terms of consistency, the consistency ratio wrt the naive OLS analysis is 348:1. The true OLS analysis is slightly better than the naive in terms of consistency. Instead of a 348:1 ratio, it is a 300:2 ratio wrt the true OLS analysis. As for power, the empirical powers of the GLS, true OLS, and naive OLS analyses are respectively 78%, 19%, and 9%. The x2 test of independence (of analysis and power) yielded a virtually zero p-value for comparing the GLS analysis to either OLS analysis. (In other words, the power of detecting the intercept is significantly associated with the analysis (GLS/OLS).) Here, the difference in power is striking. Although the intercept is typically not the main interest of a regression analysis, we can nonetheless see "the power of the enlightened" in detecting f30 in this simulation experiment. This contrast in power, with respect to the intercept estimates, is not as striking in Simulation 3 as in 4. Nonetheless, the GLS analysis seems more efficient than the two OLS analyses. The consistency ratios wrt the naive and the true OLS analyses are respectively 46:8 and 35:24. Thus, the GLS analysis was the most efficient in this simulation, followed by the true OLS analysis, then the naive OLS analysis. This efficiency ranking is heralded by their empirical powers, which are 16%, 14%, and 9% respectively. In fact, the x 2 test Chapter 2. Multiple Linear Regression 38 of independence for the GLS and naive OLS analyses for B0 yielded a near-zero p-value. Thus, the difference in power between the the analyses was extremely unlikely to be a result of chance. We may say that Enlightened is significantly more powerful than Naive in detecting the intercept, according to the x2 test. However, the \ 2 test for the GLS and true OLS analyses indicated insignificant association between the analyses and their powers. Although the difference in efficiency is not as extreme as in the intercept analysis, the GLS analysis seems more efficient in both simulations than either of the OLS analyses in detecting the slope coefficient. The difference is slightly more noticeable in Simulation 3 than Simulation 4. We did not apply the x2 test of independence for the slope estimate of either simulation due to the presence of zero counts. When we compare the results of these two simulations, we notice that with stronger response dependence (in Simulation 4), more efficiency is lost in the OLS analyses (wrt the GLS analysis). (iii) Simulations 5 and 6 Simulations 5 and 6 were run with an exchangeable Se. Recall that with such a response dependence structure and a non-zero intercept in model (2.1), the OLS and the GLS estimators are identical. Thus, the true OLS covariance is no different from the GLS covariance, as stated in Theorem 2. However, the naive OLS covariance is not the same as the GLS covariance, unless St — I. With St ^ I, we expect to see loss of efficiency in parameter estimation. In these two simulations, we used the same program input variables as in Simulations 3 and 4, except that St has the exchangeable structure. The intraclass correlation coefficient in St is 0.6 in Simulation 5 and 0.9 in Simulation 6. The results of these two simulations are shown in Tables 2.6 and 2.7. Their bottom panels tabulate the p-value Chapter 2. Multiple Linear Regression 39 Program input variables: exchangeable Se, intraclass p = 0.6 (see Table 2.4 for other program input variables) Program output: GLS = OLS estimates Po Pi sample mean of n estimates 3.14 1.49 sample variance of n estimates 9.76 0.077 true variance of each estimate 7.19 0.066 GLS Po Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS Et = I p < 0.05 109 0 500 0 p > 0.05 122 269 0 0 GLS power (= true OLS power) 46% 100% naive OLS power 22% 100% Table 2.6: Results of Simulation 5 Chapter 2. Multiple Linear Regression 40 Program input variables: exchangeable £ t , intraclass p = 0.9 (see Table 2.4 for other program input variables) Program output: GLS = OLS estimates Po Pi sample mean of 2.96 1.50 n estimates sample variance of 2.53 0.018 n estimates true variance of 2.55 0.017 each estimate GLS Po Pi p < 0.05 p > 0.05 p < 0.05 p > 0.05 OLS EC = I p < 0.05 297 0 500 0 p > 0.05 155 48 0 0 GLS power (= true OLS power) 90 % 10( ]% naive OLS power 59% 100% Table 2.7: Results of Simulation 6 Chapter 2. Multiple Linear Regression 41 of the GLS analysis versus the naive analysis. We will not show the actual contingency table corresponding to the true OLS analysis because this analysis is equivalent to the GLS analysis. Thus, the off-diagonal values of this contingency table will be the same. (Hence, the consistency ratio wrt the true OLS analysis must be k : k for some non-negative k.) Also for this reason, we will only quote the empirical power of the GLS analysis and the naive OLS analysis. Tables 2.6 and 2.7 show the extent of efficiency loss in estimating B0 in the naive OLS analysis. (The \ 2 test of independence of analysis and power of detecting B0 yielded a virtually zero p-value in either simulation experiment.) Comparing the two tables, we see that stronger response dependence heralds greater efficiency loss in the naive analysis. This relationship between the response dependence and efficiency loss was also seen in the previous two simulations. As we will see in some other simulations (see the next subsection), this relationship is generally true. (iv) Other Simulation Experiments We ran about fifteen more simulations in addition to the six described above. They were mainly intended to investigate under what conditions the GLS analysis would significantly out-perform either of the OLS analyses when the responses are positively correlated.2 1 Tables 2.2 to 2.7 were shown in full to illustrate how we interpret the program output. For brevity, we will merely highlight the more notable results, summarized as follows: 1. In general, the stronger the response correlation, the larger is the consistency ratio wrt either OLS analysis.2 2 2 1 Positively correlated responses are often encountered in studies of repeated measurements or in spatial statistics. 2 2 In the case of a k : 0 ratio for some non-negative k, a "large" ratio means a large k. Chapter 2. Multiple Linear Regression 42 2. While the last observation is generally true, in some specific cases, the two OLS analyses were empirically more powerful than the GLS analysis. This happened mainly in the analysis for the intercept, which usually had a very large variance. A large variance could have led to a slightly higher empirical power of an OLS analysis, by chance. (Note that the covariance of a /3 estimate depends on the choice of X, St, and cr2.) In the trials without an intercept, an empirically more powerful OLS analysis for a slope could also be due to a large variance of the slope estimate. 3. In some cases, with relatively weak response correlation, the GLS analysis proved to be more powerful than the OLS analyses. For example, a simulation experiment was done with the response dependence variogram described on page 27. Our choice of the region bearings (longitudes and latitudes) gave us a variogram which was extremely close to the identity matrix. However, for one of the two slope coefficients, the GLS analysis turned out to have 89% power, compared to 77% for the true OLS analysis, and 75% for the naive OLS analysis. The \ 2 test of independence of analysis and power of detecting this slope yielded a virtually zero p-value for each of the "GLS/true OLS" and "GLS/naive OLS" combinations. Such a small p-value indicates that the difference in power was extremely unlikely to be a result of chance. 4. In general, the loss in efficiency (wrt to the GLS analysis) of the true OLS analysis is less than that of the naive OLS analysis. The difference in efficiency lost between the two OLS analyses can sometimes be extreme. For example, in one simulation, the consistency ratio wrt the true OLS analysis was 4:2, but it was 147:0 wrt the naive OLS analysis. Chapter 2. Multiple Linear Regression 43 5. As expected, in cases for which the OLS (naive or true) variances are much larger than the GLS variances, the GLS analysis is much more powerful than the OLS analysis. 6. In most cases, both the OLS and the GLS estimators accurately estimate the true regression parameters, in that they both have small empirical biases. However, the empirical bias for both estimators can be quite large if their variances are large. 7. In general, the OLS and the GLS estimates are very close to each other. (They are the same if the response dependence is exchangeable, with an intercept in the regression model.) Sometimes, the GLS estimate has less empirical bias, as shown in Simulations 1 to 4. 2.8 H o w bad is Naive ' s analysis? In the previous section, we have seen that in general, ignoring response dependence in linear regression can be dangerous. A considerable loss in efficiency can result. There has been some established theory on the relative efficiency in parameter estimation of the true OLS analysis to the GLS analysis. As Theorem 1 indicates, the true OLS analysis can never be more efficient than the GLS analysis. 2 3 Our simulations also support this difference in efficiency, in terms of consistency and power, in detecting non-zero regression coefficients. However, we are not aware of theory which extensively compares the naive OLS analysis with the GLS analysis. From our simulation studies, we do know that in general, the naive OLS analysis is the most inefficient in parameter estimation, followed by the true 2 3 In practice, the empirical relative efficiency of the true OLS analysis to the GLS analysis may be greater than unity. This is the case in, for example, Simulation 2 in Section 2.7.3. This is mainly due to chance. If we had more data sets (i.e. more than 500) in this simulation, the GLS analysis could have been empirically more efficient than the true OLS analysis. Chapter 2. Multiple Linear Regression 44 OLS analysis, 2 4 then by the GLS analysis. Now, we direct our focus onto understanding the extent of efficiency loss using the naive OLS analysis relative to the GLS analysis. 2.8.1 Empirical Power Curves First, let us examine how the power in detecting a non-zero regression coefficient of the naive OLS analysis compares to that of the GLS analysis. We compare them under vary-ing conditions. For example, we may fix a regression coefficient and 27e, letting a2 vary. Similarly, we may fix the covariances of the responses, and let the regression coefficient vary. We will once again use simulation experiments as the basis of our investigation. We ran several such experiments for each set of varying conditions. For each of these simulations, we recorded the empirical powers of both the GLS and the naive OLS analyses. Each simulation consisted of 500 realizations of model (2.1), with no intercept and one slope coefficient. Having only one slope in the model provides us with enough insight into the question at hand while reducing the computational burden. We used a model with no intercept term because, as explained before, the slope is the primary determinant of the relationship between the response and the covariate. We chose a 30 x 1 covariate matrix X in this case, assuming there were 30 regions in the geographical domain of interest. The covariates in this X matrix (vector) were arbitrarily generated, independently, from a normal distribution with mean 0 and standard deviation 0.2. Once generated, the X matrix was fixed throughout the 500 simulation trials. Uc was taken to be the one described in item 3 on page 26 in all simulations. Such a S£ was chosen because from the previous simulations with this St, the GLS analysis was often noticeably more powerful than the naive OLS analysis. We wish to exploit to a greater extent this 2 4The naive OLS analysis may appear to be more efficient than the true OLS analysis when the matrix (XTS-lX)-1 - (XTX)-1 is positive definite (i.e. when the true OLS covariance is "larger than" the naive). However, technically the naive analysis is not more efficient in this case because <r2(XTX)-1 is not the true covariance of f3OLS. Chapter 2. Multiple Linear Regression 45 Empirical Power from 500 samples and only slope varies w/ SD(error)=1 • • + + 0 + + + + + + + ; Stl + + + * + * slope Empirical Power from 500 samples and only SD(error) varies w/ slope=1.5 1.5 slope - SD(error) Figure 2.1: Empirical Power Curves for the GLS and naive OLS analyses. difference between the two analyses to gain increased clarity in the result. A graph of all the recorded empirical powers from the several simulations were then produced. The above process was repeated under each set of varying conditions. We have pro-vided the graphs of the empirical powers in these sets of simulations in Figures 2.1 and 2.2. In the two figures, the graph titles designate the parameters chosen to vary. For example, the first panel of Figure 2.1 has X and 27e described as above, with ( 7 = 1 Chapter 2. Multiple Linear Regression 46 fixed, but a varying 8\. The third panel of the same figure has the ratio 81/o~ fixed, although both 8-y and a vary together. Let us examine Figure 2.1 closely. In the first panel, we let the slope &\ increase from 0 to 5 in the simulations. With a = 1 and every other parameter fixed, both the GLS and the naive OLS analyses increase in power as fl\ increases. Of course, this is not surprising, since the larger the true coefficient, the easier it is to detect, and hence, the higher the power of either test. However, the power of the GLS analysis increases at a much higher rate than the naive OLS analysis, as 8\ increases. The GLS power reaches almost 1 when 8\ « 1, whereas the naive OLS power increases very gradually, and becomes close to 1 only when 8\ is beyond 4. The loss of efficiency in using the OLS analysis is especially big in the interval (0.8, 1.2). In the second panel of Figure 2.1, we let a increase from near-zero to beyond 12. fli = 1.5 is fixed, as are other parameters in the model. As a increases, the power of either analysis should decrease, because detecting a non-zero coefficient becomes more difficult as the variation of the data increases. However, the GLS power remains very close to 1 until a > 1.5, then decreases very gradually and stabilizes as a grows beyond 8. Meanwhile, the naive OLS power starts decreasing as soon as a grows beyond 0. It decreases much faster than the GLS power, stabilizing when a « 4. Finally, in the third panel of Figure 2.1, we let the ratio 8^ja = 1, i.e. B\ = cr, although letting both parameters vary together. The other program input arguments are fixed. The graph indicates that both the GLS and the naive OLS power functions remain roughly constant, but the latter is much lower. The naive OLS power seems to remain at around 0.2, whereas the GLS power remains close to 1 at all values of 8\ (= cr). Indeed, the fact that the GLS power remains relatively constant agrees with theoret-ical results. The following argument shows that the theoretical GLS power function is Chapter 2. Multiple Linear Regression 47 constant if the ratio fa/cr is constant. Let v = fa/cr. Let TT(V) be the power function of the GLS analysis evaluated at £/, i.e. the probability of failing to detect a non-zero fa, given v. The t-test statistic is -~~GLS fa SE(faGLS) where estimates SE(faGLS) = a^{XTS^X)-\ The estimate a is the square-root of the M S E obtained from the regression analysis of variance. That is, «72 = e1 e where r-g Y - X{3. Recall that the covariate matrix X has dimension r xg, where r = 30 and g = 1. Now, —GLS — —GLS from classical theory, we know that fa /SE(/?i ) follows a non-central t distribution with r — g degrees of freedom, and non-centrality parameter equal to 2 5 fa fa _ v S E ( A G L 5 ) <ry/(XTZ:*X)-> y/(XTErX)-> which only varies in v, because the values in the denominator of the last expression are all fixed. Now, let tc be the critical value of this non-central t distribution at significance level 0.05. It is not difficult to see that fa SE(faGLS) > tc 25 See Encyclopedia of Statistical Sciences, Wiley & Sons, 1985, vol. 6, pp. 286 - 288. Chapter 2. Multiple Linear Regression 48 is constant if v is constant. From the same empirical power graph, we also see that the naive OLS analysis has constant power. In fact, we can deduce that theoretically, the naive OLS analysis should also have constant power. This is true despite the fact that the true standard error of ^Oj_j§ * OLS OLS ^~~^OLS fi\ is not equal to the naive S.E. (= SE(/9i ) n a i ve ) so that 3\ /SE(/?i ) n a i v e does not have a non-central t distribution. -—OLS Recall that 3\ , in the case of 27e ^ / , has true S.E. and naive S.E. equal to ayJ(XTX)-i{XTZ;tX){XTX)-i and a^J{XTX)- respectively. That is, S E ( S ° L 5 ) t r u e = y/{XTE.X)(XTX)-* S E ( ^ ° L 5 ) n a i v e . , ^OLS OLS Now, notice that B\ /SE(/?i ) t r ue, the i-statistic for the true OLS analysis, has non-R ^~~OLS central t distribution with non-centrality parameter equal to /?i/SE(/?i ) t r a e . Hence, the power function of the true OLS analysis should be constant if 3\/a is constant (like that of the GLS analysis). Now, the naive t-statistic is -t~OLS <~-OLS Pi Pi SE(/?i )naive true J(xTEcx)(xTxy which is simply a multiple of the i-statistic of the true OLS analysis. Therefore, the power function of the naive OLS analysis should also be constant if P\/o~ is constant. Finally, the fact that the empirical power curve of the naive OLS analysis is below that of the GLS analysis in all three panels of Figure 2.1 illustrates that ignoring response dependence in regression analysis can lead to significant loss in parameter estimation efficiency. Next, let us examine Figure 2.2. The two panels of this figure show empirical power curves of the two statisticians' analyses with all program input arguments fixed as before, except that B\ and a both Chapter 2. Multiple Linear Regression CD Q. O CO O q CD CO E oo to d <D O CD o CO & ci T> CM <D XJ o CO C\l o o ci CD 5 o Q. CD Q. O CO O p 0) n E CO u5 ci CD CD o O CO £ •<t ci T3 oo <D T3 O CO C\I O O ci 0) o Q. Empirical Power from 500 samples and ratio of slope/SD(error)=.5 o.o 0.5 1.0 slope = .5*SD(error) 1.5 2.0 Empirical Power from 500 samples and ratio of slope/SD(error)=2 o.o 0.5 1.0 slope = 2*SD(error) 1.5 2.0 Figure 2.2: More Empirical Power Curves for the GLS and naive OLS analyses. Chapter 2. Multiple Linear Regression 50 vary with the ratio v = f3\/o~ constant. This ratio was fixed as 1/2 and 2 respectively in the two graphs. Again, we see that the naive OLS analysis is clearly less powerful than the GLS analysis. Comparing the two panels in Figure 2.2 with the last one in Figure 2.1, the gap between the two power curves seems to decrease as the ratio of Q\ to a moves away from unity in either direction. Although the GLS power curve is everywhere above the naive OLS power curve, the gap between them is the largest when the ratio is 1, a little smaller when the ratio is 1/2, and even smaller when the ratio is 2. Therefore, the power difference between the two statisticians is most apparent when the slope and the noise level (i.e. a) are the same. Furthermore, from the relative positions of the power curves as the ratio of the slope to the noise level changes, we again see the following: the noisier the data with respect to the size of the slope coefficient, the less sensitive either analysis, in terms of the detection of the slope. The GLS analysis has almost 100% power when the slope is at least as large as the noise (cr). On the other hand, the OLS analysis has very low power when the noise is at least at large as the slope. These observations seem reasonable, by the following argument. When 3\ja = 1, the two empirical curves are almost as far apart as they can be. That is, the GLS power is almost 1, and the OLS power is close to 0.05, the size of the hypothesis test. As the ratio increases from 1, the size of the slope relative to the noise level increases. Then both analyses improve in their power in detecting the slope, but the OLS analysis has more room to improve in its power (from nearly 0.05 to anywhere below 1). Afterall, how much more can the GLS power improve when it has almost reached 1 at Q\ju — 1? Similarly, as the ratio decreases from 1, the size of the slope relative to the noise level decreases. Then the powers of both analyses decrease, but that of the GLS analysis has more room to decline. Chapter 2. Multiple Linear Regression 51 2.8.2 Histograms of p-value Ratios and p-value Scatter Plots We have compared the efficiency of parameter estimation using GLS analysis and that using naive OLS analysis by examining their empirical power curves. Another way of comparing their efficiencies is to compare their actual p-values in each simulation. In regression parameter estimation, a method is more efficient than another if its p-value is smaller. This is because a smaller p-value gives stronger evidence against a zero regression coefficient. In a simulation experiment with, say, 500 data sets, there are 500 pairs of (GLS p, naive OLS p) recorded for each coefficient estimate. Suppose we have only one slope coefficient and no intercept in the regression model. How then should we compare all the 500 pairs of p-values of this slope estimate? We decided to compute, for each pair of p-values, the ratio of naive OLS p to the GLS p. Since we expected the naive OLS analysis to be less efficient, the ratio should be larger than unity most of the time. 2 6 We then log-transformed the square of each of these 500 p-value ratios. Log transformations are often used to map ratios in the interval (0, oo) to ( — 0 0 , 0 0 ) . Moreover, our choice of squaring the p-value ratio before log-transformation was inspired by Wilks' Theorem: twice the log-likelihood ratio statistic has an asymptotic Chi-square distribution. We expected that such a transformation of the p-value ratios would allow nicer graphical displays of the simulation results. Histograms of these log-transformed p-value ratios were then plotted to examine their distribution. For example, a histogram with most of its mass at, say, 10, would indicate that in most of the 500 cases in the simulation, the naive OLS p-values were exp{10/2}(« 148) times larger than its GLS counterpart. This in turn would indicate that the GLS analysis was much more powerful than the naive OLS analysis. On the other hand, a 2 6 The ratios which are smaller than 1 are expected to be due to chance. Chapter 2. Multiple Linear Regression 52 histogram centered at 0 would indicate that the naive OLS p-value was just as likely to be larger than its corresponding GLS p-value as it was smaller. In other words, such a histogram would indicate that the two analyses were equally powerful, empirically. Before presenting some detailed observations, let us introduce an analogy of the power comparisons between Enlightened and Naive. A rabbit, under normal conditions, usually runs much faster than a tortoise. If both the rabbit and the tortoise wear good running shoes (large B\ relative to cr), it may be more apparent that the rabbit is a faster runner than the tortoise. Afterall, how much improvement can running shoes do to an animal which simply cannot run? On the other hand, if both animals have a broken leg (large cr relative to the hindrance to the rabbit's running may be a lot more severe than that to the tortoise's, because the rabbit's speed has much more room to decline. Hence, if we measure the difference between the two animals' running speeds, this difference should be more apparent when both wear good running shoes, but less so when they both have a broken leg. We will leave for the reader the task of linking this analogy between these two animals in a race and the two statisticians in an efficiency competition of regression parameter estimation. Now, using this analogy, we wish to understand how much faster the rabbit out-runs the tortoise under different conditions. Thus, we need the two animals to compete in several races (simulation experiments). In each race, they wear the same kind of running shoes assigned to that particular race. Running shoes of a different quality are assigned to a different race. Over several races, we hope to recognize which quality of running shoes makes the rabbit's superiority the most or the least apparent. To do this, we can display histograms of logged p-value ratios from several simulation experiments in the same graph. This way, the change in spread and location of the histogram due to the change in the "quality of the running shoes" can be spotted easily. Chapter 2. Multiple Linear Regression 53 Some technical difficulties might arise in plotting some of the histograms. For ex-ample, the log of zero is undefined. To get around the problem of taking the log of a virtually zero p-value, we used started logs27 and added 1 0 - 1 0 to each of the p-values, regardless of its size, 2 8 before taking logs. Another difficulty emerges in displaying the histogram itself. A regular histogram resembling a bar chart may not provide an effective display. Since our main interest in these histograms concerns location and spread, we will present their smoothed sketches instead of the actual histograms. Plotting the histogram of the logged p-value ratios is only one way to compare the p-values, and hence, the power of the two analyses. In addition, we also produced a scatter plot of all the logged OLS p-values versus their GLS counterparts for each simulation. For example, if the scatter cloud is concentrated around a point, this corresponds to a histogram which has a tall spike. In the previous section, empirical power curves were given for different sets of varying conditions. For example, in one graph, we fixed all program input arguments but let 8\ vary. A set of varying conditions is analogous to "different models/qualities of running shoes of the same brand." That is, the brand could be 81, and the different models could be By = 0.1, 8\ = 0.2, etc. The noise level a could be considered a different brand of running shoes, when it varies but all other parameters (and program input arguments) are fixed. For a certain brand, several races are run between the two animals, and each race is run with a different shoe model. Let us now relate this back to our p-value comparisons. For each brand for which, say, five different races (simulations) were run, five sets of race results in terms of the histogram sketches and the corresponding scatter plots were displayed. Hence, we could compare the five histograms and/or the five scatter plots to examine how the varying values (shoe models) of this parameter (brand) affected 2 1 . . . in the terminology of Mosteller k Tukey, 1977, page 91. 2 8 For the sake of consistency, we added the infinitesimal amount to all p-values. Chapter 2. Multiple Linear Regression 54 the power comparison of the naive OLS and the GLS analyses (speed comparison of the two animals). The Graphs Several sets of varying conditions were investigated. Here, we used the same X and ST matrices as those used in the previous section where simulations were run to produce empirical power curves. Thus, we have assumed 30 regions, and no intercept but only one slope coefficient, fa, in regression model (2.1). Again, each simulation had 500 realizations (datasets) of the model. The varying parameter is indicated in the title of each histogram graph. Notice also that the title of each scatter plot indicates its corresponding histogram sketch. The x-axis of a scatter plot is always "log (GLS p + I O - 1 0 ) " and the y-axis is always "log (naive OLS p + IO - 1 0 ) . " A 45-degree line is drawn on each scatter plot to indicate clearly the relative position of the scatter cloud. Figure 2.3 shows the histogram sketches and scatter plots for nine simulations under one set of varying conditions. X and Ut were fixed as described above, <r was fixed as 1, but fa varied from 0 to 10. Each histogram sketch is labeled with a letter from the alphabet for two reasons: (1) by comparing the labeling letter to those in the legend, we can match the histogram to the fa value with which the simulation was run; (2) as fa increases, the shape and the location of the histogram changes, and we can trace the changes of the histogram by following the alphabetically ordered sketches. That is, increasing fa to the next value corresponds to changing the labeling letter to the following letter in the alphabet. With this in mind, we may trace how the histogram "moves" as fa increases. From the histogram sketches in this figure, we notice a "tidal effect" in the location of the histograms as fa increases. That is, with an increasing slope coefficient, the center of the histogram of the logged p-value ratios "moves" from small values (< 0) up to Chapter 2. Multiple Linear Regression 55 Figure 2.3: Histogram Sketches of logged p-value Ratios and their Scatter Plots: /3i varies. Sketches of Histograms ~ slopes vary from 0 to 10, with fixed X matrix, fixed error correlation matrix, and fixed Var(error)=1 0 10 20 30 2"ln((naive ols p value + epsilon)/(gls p value + epsilon)) 40 Corresponding Scatter Plot for Histogram a I Corresponding Scatter Plot (or Histogram d •20 -16 -10 -6 ki(g)i p value * epslon) Corresponding Scatter Plot for Histogram g -20 -16 -10 -S ln(gli p value + epsilon) Corresponding Scatter Plot for Histogram c -20 -15 -tO -6 0 ln(gls p value + aptlkm) Corresponding Scatter Plot lor Histogram e Corresponding Scatter Plot for Histogram h -SO -1G -10 -6 ln{gli p valu* + epsilon) Corresponding Scatter Plot for Histogram t -20 -IS -10 -6 ln(gli p value * epsilon] Corresponding Scatter Plot for Histogram i -20 -IE -10 -E ln<Qj» p value • epsilon) Chapter 2. Multiple Linear Regression 56 larger values, then returns down to smaller values as the tide does. Note that a slope of moderate size, such as 0.8 (histogram "c"), gives a histogram mode of about 9. That is, for such a slope (and a = 1), the OLS p-value is typically exp{4.5} ~ 90 times as large as the GLS p-value! Another interesting observation is the tidal effect in the spread of the histogram which accompanies that of the location. Apparently, extreme (very small and very large) values of the slope are associated with smaller spread in the distribution of the log-transformed p-value ratios. Moderate values of the slope are associated with wider spread in this distribution. The tidal effect in the location and spread of the histogram can also be seen from the logged p-value scatter plots. First, we see the changing location of the scatter cloud. As 3\ increases, the scatter cloud moves from being around the top portion of the 45-degree line to being away from the line to finally being solely on the left of the line. Thus, when the slope is small (difficult to detect with either the OLS or the GLS method), both analyses have big p-values (so that the logged p-value is around 0). A small slope also means that empirically, the naive OLS analysis has more probability of out-performing the GLS analysis than with a larger slope. (Recall the rabbit and the tortoise racing against each other when both have a broken leg.) This explains the portion of the scatter cloud below the 45-degree line in the first few scatter plots. With a slope that is easier to detect using either method, the GLS analysis' superiority in power becomes more apparent. Having a large slope is analogous to having good quality running shoes in the rabbit-tortoise race. Recall how we explained that the rabbit's superiority in speed is most apparent when the two animals are wearing the best running shoes. Thus, as the slope increases, more and more GLS p-values are smaller than their naive OLS counterparts, and hence, the cloud moves further and further to the left of the 45-degree line. Finally, when Q\ = 10, an extremely Chapter 2. Multiple Linear Regression 57 large slope, the scatter cloud converges along a vertical line, almost solely above the 45-degree line, at a highly negative value on the rc-axis. Such a vertical line indicates that almost all GLS p-values were at least as small as their naive OLS counterparts in the simulation. In other words, the GLS analysis was at least as empirically powerful in almost all 500 cases in the simulation. Notice that this last scatter plot corresponds to histogram sketch " i , " which shows that there are almost no log-transformed p-value ratios that are negative, out of the 500 cases in the simulation. (A negative value corresponds to a naive OLS p-value smaller than the GLS p-value.) This is simply the exact translation of the observation of the last scatter plot. In addition to the moving location, the spread of the scatter cloud also changes as 3\ increases. From being very concentrated around one point, it becomes somewhat sparsely scattered, then becomes concentrated again. Although the cloud appears concentrated along a vertical line at large values of the slope coefficient, the last plot actually shows some tendency of the cloud converging to a point near (-25, -25). This change in the spread of the cloud simply translates back to the changing spread of the histograms. Now, putting back together the observations of both the histogram sketches and the scatter plots in Figure 2.3, it is not difficult to understand the tidal effect in both the location and the spread of either the histogram or the scatter cloud. Basically, it depends on the relationship between the varying quality of the animals' running shoes and the varying extent of superiority of the rabbit over the tortoise in a race. Here, the larger 0i is, the better the quality of the shoes, and hence the larger the rabbit's "margin of victory" over the tortoise. Figure 2.4 shows some similar tidal effects in the histogram and scatter cloud, but in the opposite direction of that in Figure 2.3. The simulation experiments were run with X and 2/e fixed as described above, Q\ fixed as 1.5, but a increased from exp{ —1.2} to Chapter 2. Multiple Linear Regression 58 Figure 2.4: Histogram Sketches of logged p-value Ratios and their Scatter Plots: a varies. Chapter 2. Multiple Linear Regression 59 exp{2.6}. In fact, the opposite direction of the tide is not surprising if we understand why the histogram and the scatter cloud "travel" the way they do in Figure 2.3. In Figure 2.3, a large fa corresponds to high quality running shoes in the rabbit-tortoise race. On the other hand, a large a (noise level in data) in Figure 2.4 corresponds to a broken leg, whereas having a small a is analogous to having good running shoes. This brings us back to the relative size of the slope coefficient with respect to the noise level in the data. In other words, if both parameters vary together, it is not so clear what shoe is considered to have good quality. In the simulation experiments for Figure 2.5, the ratio v = fa/cr was fixed as unity, and fa and a increased together from 0.2 to 3. Similarly, v was fixed as 1/2 and 2 respectively in the sets of simulation experiments for Figures 2.6 and 2.7. The slope fa ranged from 0.01 to 2 in both of these figures. From these three figures, we find that their histograms (and scatter clouds) remain at a relatively constant location and have almost the same shapes as fa and a vary together at a fixed rate. This observation agrees with the constant power (of detecting fa) of the GLS and OLS analyses seen in Figures 2.1 and 2.2. The logged p-value ratio is similar to the ratio of the powers of the two analyses. If the powers are constant for all values of fa and a (while fa/a is fixed), then their ratio must also be constant for all values of the two parameters. Hence, the locations and shapes of the histogram sketches almost coincide in each of Figures 2.5 to 2.7. However, if we compare these three figures, the tide moves from left to right as the ratio of fa/cr increases. When v = fa/cr = 1/2, the OLS p-value is typically larger than the GLS p-value by a factor of exp{1.5} ~ 4.5. When v — 1, this factor is exp{6} ~ 400. When v = 2, it is exp{16} ~ 8.9 million! The increasing factor indicates that, when the ratio of the slope to the noise level is fixed, the size of this ratio is essentially what determines the relative location of the tide. In other words, both fa and cr increasing together is like having two opposing forces Chapter 2. Multiple Linear Regression 60 Figure 2.5: Pi/v = 1. Histogram Sketches of logged p-value Ratios and their Scatter Plots Sketches of Histograms - slopes vary from .2 to 3, with fixed ratio of slope/SD(error)=1, fixed X matrix, & Cor(error) matrix a E E a slope=SD(error)=.2 b slope=SD(error)=2.2 c slope=SD(error)=2.4 d slope=SD(error)=3 -10 -J— 10 -J— 20 —I— 30 I 40 2*ln((naive ols p value + epsilon)/(gls p value + epsilon)) Corresponding Scatter Plot tor Histogram d -SO -IS -10 -6 0 ln(tfs p value + sptf on) Chapter 2. Multiple Linear Regression 61 "pulling the tide." A large Br makes more apparent the GLS analysis' victory over the naive OLS analysis. On the other hand, a large a makes this less noticeable. Again, with a fixed, a large Q\ corresponds to good running shoes. With Q\ fixed, a large a corresponds to a broken leg. When only the ratio of the two parameters is fixed, but the parameters themselves vary together, the efficiency competition between the two analyses is analogous to the two animals running a race in which they both have a broken leg but wear good running shoes. From the locations of the histogram sketches in Figures 2.5 to 2.7, we deduce that the rabbit's superiority in speed over the tortoise is constant as the parameters increase together with their ratio fixed, but increases if the ratio increases. (This can also be seen from the location of the scatter clouds — the clouds move farther towards the left of the 45-degree line as the ratio increases.) That is, the superiority of the GLS analysis over the naive OLS analysis is more apparent as the relative size of Q\ to the noise level increases. One may argue that this trend seems to disagree with what we observed from the empirical power curves. They indicate that the superiority of the GLS analysis was most apparent when Q\ = <7, but less so if Q\ ^ a. From the histogram sketches, we see that this superiority increases as the ratio Q\jo increases. However, the empirical power graphs were obtained based on a fixed significance level of the t-tests for the slope estimates. On the other hand, the histogram sketches were obtained using the actual p-values computed from these t-tests. Then, a larger OLS p-value would indicate a lower OLS power than the GLS power, regardless of how small both p-values are. Thus, if we measure the superiority of the GLS analysis based on how much smaller its p-values are, then we see the trend of increasing superiority as the ratio 0\/o- increases. If we measure superiority based on how often the null of B\ = 0 is rejected, then this superiority decreases as 3i/o~ moves farther away from unity. This is because both p-values might be smaller than the significance level so that both analyses Chapter 2. Multiple Linear Regression 62 Figure 2.6: Histogram Sketches of logged p-value Ratios and their Scatter Plots: Pi/v = 1/2. Sketches of Histograms - slopes vary from .01 to 2, with fixed ratio of slope/SD(error)=.5, fixed X matrix, & Cor(error) matrix slope=.5*SD(error)=.01 slope=.5*SD(error)=.5 slope=.5*SD(error)=1.5 slope=.5*SD(error)=2 E CO b b b b b b b b 30 -10 10 20 2*ln((naive ols p value + epsilon)/(gls p value + epsilon)) Corresponding Scatter Plot for Histogram d Chapter 2. Multiple Linear Regression 63 Figure 2.7: Histogram Sketches of logged p-value Ratios and their Scatter Plots: Pi/tr = 2. Sketches of Histograms - slopes vary from .01 to 2, with fixed ratio of slope/SD(error)=2, fixed X matrix, & Cor(error) matrix a slope=2*SD(error)=.01 b slope=2*SD(error)=.5 c slope=2*SD(error)=1.5 d slope=2*SD(error)=2 o o o o o o o o I 40 10 20 30 2*ln((naive ols p value + epsilon)/(gls p value + epsilon)) Corresponding Scatter Plot tor Histogram d Chapter 2. Multiple Linear Regression 64 appear powerful, despite the larger OLS p-value . 2.8.3 The case of a chosen Et « I So far in this section, the results are based on simulation experiments with Ht that might produce a potentially noticeable difference between the two statisticians' analyses. What happens if Uc is chosen to be very close to the identity? Naturally, we do not expect the GLS analysis to out-perform the naive OLS analysis. Afterall, the two analyses differ only in what the two statisticians believe Uc to be. Out of curiosity, we ran a few sets of simulations using S€ as the variogram structure described in item 5 of page 27. Recall that our choice of the region bearings produced a which was very close to J. X was again the 30 X 1 covariate vector that we used in the previous few sets of simulations. As expected, the histograms for these sets of simulations all locate at around 0, even under different sets of varying conditions. This indicates that in general, with such a 27£, the GLS analysis is not noticeably more powerful than the naive OLS analysis. Nonetheless, we have included the histograms and scatter plots for one of these sets of simulations in Figure 2.8. In these simulations, all program input arguments were fixed (3i = 1.5), except that cr varied from exp{ —1.5} to exp{2.5}. The other sets of simulations produced very similar graphs as Figure 2.8. Perhaps the only interesting observation of these simulations lies in the movement of the scatter cloud as a parameter varies. In Figure 2.8, the scatter cloud moves along the 45-degree line as a increases. It starts out at around (-20, -20), then spreads upwards along the line, until it concentrates at around (0, 0). The fact that the cloud is scattered around the 45-degree is because with St rH J , both the GLS and the naive OLS analyses are about as powerful as each other. At small values of cr, the running shoes are of very good quality, and hence, both Chapter 2. Multiple Linear Regression 65 Figure 2.8: Histogram and Scatter Plots with Variogram Ue: a varies. Sketches of Histograms - SD(error)'s vary from exp(-1.5) to exp(2.5), with fixed X matrix, Cor(error) matrix (variogram), and slope=1.5 a SD(error) = exp(-1.5) b SD(error) = exp(-0.5) c SD(error) = exp(O) = 1 d SD(error) = exp(1.5) e SD(error) = exp(2.5) -4 -2 0 2 2*ln((naive ols p value + epsilon)/(gls p value + epsilon)) Corresponding Scatter Plot tor Histogram a Corresponding Scatter Plot tor Histogram b Corresponding Scatter Plot for Histogram c Wo* p villi* . •pawn) p vtlu* . tptfo.) ln(Bkpvahi* *fnOon) I Chapter 2. Multiple Linear Regression 66 analyses are powerful. Thus, the cloud is concentrated at highly negative co-ordinates (extremely small p-values). As a increases, the quality of the running shoes worsens, and hence, the cloud moves up to near-zero co-ordinates (p-values close to 1). 2.9 Discussion Recall that our main goal in the chapter was to investigate the loss of parameter esti-mation efficiency in a linear regression model if one ignores the response dependence. We have used the "storyline" of the two analysts to clarify the distinctions among three different approaches to a regression problem: 1. the naive OLS analysis: Naive simply assumes that the data are uncorrelated, estimates the regression parameters with / 3 O L 5 (OLS estimator), and ignores the response dependence in computing the covariance of this (3OLS; 2. the GLS analysis: Enlightened knows the true response dependence structure, St, and hence obtains f3GLS (GLS estimator) as the regression parameter vector esti-mate and bases his/her regression analysis on the knowledge of U€; and 3. the true OLS analysis: Enlightened realizes Naive's mistake in computing the co-variance of f3OLS, and carries out Naive's regression analysis, using the true OLS covariance, a function of Se. Before our investigation, theory has been established on the loss of efficiency using the true OLS analysis compared to the GLS analysis. When St ^ J , that is, when the responses are correlated, the GLS estimator is the B L U E for /3 in model (2.1). When St = I, the GLS estimator reduces to the OLS estimator, and thus both are the B L U E ' s for j3. Theorem 1 indicates that for all U€, the GLS estimator is always at least as efficient as the OLS estimator, based on the true covariance of the OLS estimator. In Chapter 2. Multiple Linear Regression 67 other words, the true OLS analysis (regression analysis based on the OLS estimator and its true covariance) can never be more efficient than the GLS analysis. Of course, the two analyses are equivalent (as efficient as each other) when the two estimators are the same. However, we would like to know the conditions under which the two estimators are equivalent. First, Et = I is one such condition. It is not an interesting condition because we are trying to investigate cases where St ^ I. Some research has been done in this area since the 1960's. (See Section 2.2.) Among the complicated algebraic conditions which guarantee that j3OLS = f3GLS, one condition is readily verifiable. We have stated this condition in Theorem 2. It says that 3 ° L S = PGLS if and only if the model in (2.1) contains an intercept and that 27£ is exchangeable. However, Theorem 2 does not imply that under these conditions, the naive OLS analysis is equivalent to the GLS analysis. This brought us to our simulation studies. In each simulation experiment, we generated data according to model (2.1) with some St ^ J . The three analyses (naive OLS, true OLS, and GLS) were compared. The analyses consisted of independent hypothesis tests of zero regression coefficients, each at 5% significance level. We compared the analyses by comparing their p-values of these hypothesis tests. Empirical power (sample probability of rejection of the null hypothesis) was calculated for each regression coefficient estimate for each analysis. We found that in general, the stronger the dependence of the responses, the more apparent the GLS analysis' superiority in power. The naive OLS analysis is the least powerful, followed by the true OLS analysis, then by the GLS analysis. Even when the true OLS and the GLS analyses are equivalent, the naive OLS analysis can have a considerably lower power than the other two. We then decided to only compare the naive OLS analysis and the GLS analysis in the remaining part of our investigation. Chapter 2. Multiple Linear Regression 68 We ran more sets of simulations. Empirical power curves of the GLS and naive OLS analyses were produced to display their powers graphically. They show that the GLS empirical power curve is almost 2 9 everywhere above that of the naive OLS analysis. As the size of the slope relative to the noise level (cr) either increases or decreases, the gap between the power curves always increases and then decreases. In addition, instead of explicitly calculating the empirical power of each analysis, we graphically displayed their p-values using "logged p-value ratio histograms" and "logged p-value scatter plots." We could deduce which analysis was more powerful from these graphs. (An analysis is more powerful if its p-value is smaller.) We found that with all other program input arguments (X, St, and cr) fixed, the larger the regression coeffi-cient in the model, the larger the "margin of Enlightened's victory" over Naive in their efficiency competition. On the other hand, with the regression coefficient and the other program input arguments fixed, a large cr (high noise level in the data) corresponds to a smaller margin of victory. What if the ratio of the regression coefficient to the noise level in the data is fixed? We ran sets of simulation experiments with this ratio fixed, and let the two parameters vary together, keeping their ratio constant. In these simulations, we found that the powers of the two analysts remain roughly constant as both parameters increase together at a fixed ratio. (We have seen this from both the empirical power graphs and the histogram sketches.) However, Enlightened's margin of victory varies according to the size of the coefficient-to-noise ratio, as can be seen by comparing the locations of the histogram sketches in Figures 2.5 to 2.7. 3 0 In general, when the coefficient is smaller than the noise level (cr), Enlightened barely wins over Naive. When the coefficient and the noise level are equal, Enlightened wins by a larger margin. Finally, if the coefficient is bigger than 2 9 A higher naive OLS empirical power is due to chance. 3 0Figures 2.3 and 2.4 also show the effect of a varying slope-to-noise ratio. Chapter 2. Multiple Linear Regression 69 the noise level, s/he wins "by a landslide." 2.10 C H A P T E R A P P E N D I X The following is the algorithm for the computer implemented procedure for generating the design matrix and regression coefficients in a simulation experiment of linear regression. 1. Generate the (g + 1) x 1 vectors X 1 5 X 2 , .. ., Xr (Xj = ( X ; 0 , X t l , . . . ,X{g)T for all i) independently from an M V N distribution with mean fix and covariance Ex as specified by the program user, (r = number of regions.) Thus, all r of these {g x l)-vectors are i.i.d. 2. Write fj,x = (/*o,A*i, ••• ,Pg)T, and where o~oo = oo Var(Xio) (i.e. variance of the first element of any of the (g + l)-vectors) C o v ( X i 0 , X,-_ 0) [Cov(X,-o,X,- 1),Cov(X t-o,X« ) , . . . , C o v ( X , - 0 , X i s ) ] r where X{-0 = ' , • 2 , . . . , Xig)T for all i — 1,2,..., r, and < T _ 0 , - o = Cov(X,- )_0) = the covariance matrix of X 8 > _ 3. According to the above, we thus have [Xio|X t i_ 0 aj] ~ N(i/, ^ 2 ) (2.15) Chapter 2. Multiple Linear Regression 70 where X = (Xl,X2, • • • ,Xg)T v = p0 + ( * r 0 , _ 0 ) r ( o - - i _ o ) ( a j - fJL-o) (2.16) ^ 2 = POO - ( o - 0 _ 0 ) T ( < T ^ _ o ) ( « T 0 _ 0 ) . (See Johnson & Wichern, 1982, Result 4.6 for the derivation of v and ifi2.) 4. If we let 0_o = (#!, . . . ,0 5 ) r be (tr0 , - 0 ) T ( < T ^ _ 0 ) , then we see that (2.15) itself specifies a linear regression model with as the response, X j ' s as the covariates, and #j's as the slopes. From equation (2.16), we derive the intercept 0O in this regression model as 90 = Po - (0-o)TV-o-5. Now, we extract from the regression model specified by (2.15) its covariates and regression coefficients, and specify these as the corresponding program input ar-guments in (2.1). That is, in the case that model (2.1) has no intercepts, we let the p x g covariate matrix and the ^-vector of slope coefficients to be X — ( X l r 0 , - X " 2 ] _ 0 , . . . , Xg-0)T and j3 = (9_0. If we fit an intercept in model (2.1), we simply add a column of l's to this X. The (g + l)-vector of regression coefficients (including the intercept) in model (2.1) then becomes (3 = 9 = (90, 6-0)T. Chapter 3 Introduction to Multiple Regression of Count Data In the previous chapter, we have seen that ignoring response dependence may result in serious reduction of efficiency in parameter estimation in a linear regression model with normally distributed random errors. In real life, however, the data we need to analyze may not be normally distributed. For example, in health care studies, hospital admission counts are often used to indicate the health of a population. These counts are usually small (less than 20) for rare diseases, and thus should not be assumed to follow a normal distribution. In such cases, linear models may no longer be efficient in modeling the relationship between counts and possible covariates. In fact, small admission counts are usually assumed to follow a Poisson distribution in health care studies. As a result, these counts are often analyzed using Poisson regression. However, theoretical results are generally not widely available for models with non-Gaussian responses, especially when the responses are correlated. Perhaps this is why health care studies in the past generally ignored correlation among responses. In 1986, Liang & Zeger proposed the generalized estimating equations (GEE) approach to incorporate response correlation for estimating parameters in generalized linear models (GLM's) . A Poisson regression model, for example, is a G L M . The generalized estimating equations are an extension of the estimating equations in quasi-likelihood theory, imple-mented in the widely used G L I M software. Response independence is assumed in the original quasi-likelihood approach, whereas the extension (GEE) allows the incorpora-tion of any response correlation structure in the G L M . G E E has become a popular model 71 Chapter 3. Introduction to Multiple Regression of Count Data 72 fitting method in the past decade for dealing with correlated responses in GLM's , ever since its computer implementation became readily available. The approach is widely used in longitudinal data analyses in which the repeated responses over time are correlated within each independent subject (given the covariates). When the response correlation matrix or structure is known, it is natural to incorpo-rate this known matrix/structure into the regression model when using G E E to fit the model. However, misspecification of response correlation results when an investigator in-corporates an incorrect matrix/structure into the model while believing it to be true. For example, ignoring response correlation (i.e. believing that the responses are independent) is a misspecification of the true correlation structure. In the next two chapters, we examine the impact of misspecification of the response correlation structure on parameter estimation in multiple regression models for count data using G E E . Our investigation is based mainly on computer simulations. The first part of our simulations is presented in Chapter 4. In these simulations, we generate count data through Poisson-lognormal distributions. Conditioned on the covariates, the data are independent Poisson counts. Marginally, the counts have a Poisson-lognormal distribution, and are correlated through the correlation among the covariates. The G E E approach is used to fit the data using correct and incorrect corre-lation structures. We then examine the performance of the approach with the different structures, through its empirical power (of detecting non-zero regression coefficients). We can use this empirical power as a measure of parameter estimation efficiency. Recall from the previous chapter the notion of the model assuming independent re-sponses (while the responses are correlated) and that assuming correlated responses with a known (correlation) structure. In the next two chapters, the independence model refers to the G L M in which the responses are assumed independent. The dependence model, Chapter 3. Introduction to Multiple Regression of Count Data 73 on the other hand, is the G L M in which the responses are assumed to have certain cor-relation structure. (See Section 3.4 for detailed explanations of terminology.) These two models are then fitted to the data using G E E , and their efficiencies (empirical powers) are compared to determine their relative efficiency. We present the second part of our simulations in Chapter 5. In this chapter, we con-centrate on studying the impact of ignoring response correlation on parameter estimation in a G L M . We generate exchangeably correlated Poisson data. Generating correlated Poisson data with non-linear relationship to the covariates is generally a difficult task. Instead, the Poisson data that we generate are linearly related to the intercept — the only regression coefficient — without true covariates in the model. The following section is a summary of literature references for Chapters 4 and 5. In the remainder of this chapter, we describe the software we use for our simulations in Section 3.2, the theory of the G E E approach in Section 3.3, and the terminology we use throughout the next two chapters in Section 3.4. 3.1 Literature Review Since the development of the G E E theory by Liang & Zeger (1986), many longitudinal data studies involving non-Gaussian data have been carried out through the G E E ap-proach for model fitting. For example, Zeger h Liang (1986) applied their theory on a dichotomous dataset (response and covariates) intended to describe the association of a mother's stress and her child's morbidity. Apparently, the G E E approach for model fitting with such data types is simple and easy to use. However, the application of the G E E approach requires data replication. For example, the time series measurements on each individual subject in a longitudinal data analysis constitute one set of data replication, or a cluster. When there is only one correlated Chapter 3. Introduction to Multiple Regression of Count Data 74 series in the study, such as the case in the previous chapter, the G E E approach is not appropriate for model fitting. In the case of count data for a single time series, Zeger (1988) developed a "parameter-driven regression model for time series of counts." In his paper, he mentions both a parameter-driven model, in which autocorrelation of the counts (Poisson) is introduced through a latent process {e*}, and an observation-driven model, in which the distribu-tion of the current count observation is a function of all the past counts. He concentrates on the parameter-driven model in this paper, and develops an estimating equation ap-proach for parameter estimation. Simulations are done to compare the efficiency of the parameter-driven model with two common log-linear models. Results show that the parameter-driven model leads to "more valid inferences." The effect of ignoring correlation among response variables on the efficiency of regres-sion parameter estimation in non-linear models has been studied in different contexts, mostly involving longitudinal data analyses using the G E E approach. In some of these " G E E studies," the dependence model is shown to be more efficient than the indepen-dence model in parameter estimation. It is hence believed that the model that takes into account the correlation among the responses should be more efficient than the model which does not. However, the effect of ignoring response correlation on the efficiency of parameter estimation is still not clearly understood because such effect varies across dif-ferent contexts. Fitzmaurice (1995) mentioned that different research papers concluded different findings about the relative efficiency of the independence model to the depen-dence model. Some conclude that the latter is clearly more efficient than the former, while others conclude that the former is nearly as efficient as the latter. Fitzmaurice illustrates in his paper that, with binary data, this relative efficiency depends on how strongly the responses are correlated, and on the covariate design. In particular, with a fixed strength of response correlation, the G E E independence model (the model fitted Chapter 3. Introduction to Multiple Regression of Count Data 75 using G E E assuming independent responses) is noticeably less efficient than the G E E dependence model (the model fitted using G E E with a correctly specified response corre-lation structure) when a covariate varies within the cluster. On the other hand, ignoring response dependence does not lead to loss of efficiency in parameter estimation using G E E if "group is a cluster-level covariate." Inspired by health care studies using hospital admission counts as responses, here we study the impact of misspecifying the spatial correlation on parameter estimation. In this context, the correlated "series" is relatively short compared to those in a longitudinal study. This is because the spatially correlated responses are collected from the regions of interest, which are often of a limited number. Since the implementation of the G E E algorithm is readily available, we will not try to implement the algorithm for fitting a parameter-driven model as developed by Zeger (1988). Instead, we use the S-Plus function gee in our regression simulations with count data. Data replication is present in our simulation studies so that the G E E approach can be applied. Each spatially correlated series is analogous to the correlated repeated mea-surements (on each subject) over time in a longitudinal data analysis. The analog of the independent subjects in a longitudinal study is the independent replicates, i.e. the repeated observations (of the response and explanatory variables), within each region. Thus, instead of treating the subjects as the means of data replication in a longitudinal study, we can independently observe replicates of the response and covariates in each region in the context of spatially correlated data. For example, datasets collected by two independent investigators can be viewed as two sets of data replicates. Datasets observed at different time points far apart (such as a year) from each other may also be thought of as independent data replicates. Chapter 3. Introduction to Multiple Regression of Count Data 76 3.2 G E E in S-Plus In using the G E E method to fit a model to correlated responses with fixed covariates, we need to specify the response correlation structure. The S-Plus function gee asks for either a correlation structure (e.g. exchangeable, AR(1)) or a fixed correlation matrix for these responses as one of the input arguments. If we specify the correlation structure, then gee will estimate the parameters in the response correlation matrix with the specified structure. (For example, the intraclass correlation coefficient is the parameter for the exchangeable structure.) If we specify a fixed correlation matrix, then gee will treat it as known and fixed throughout the parameter estimation iterations. In addition to the response correlation structure, using gee also requires the specifi-cation of the link function and the response distribution family (given the covariates) as a means to specify the relationship between the mean and variance of each response, gee then outputs the parameter estimates and their standard error estimates, with the final estimate of the response correlation matrix if it was not fixed as an input argument. This S-Plus function gee version 3.5 that we use in our simulations was obtained from the share-ware organization, StatLib, and then installed into S-Plus version 3.1. This function allows the user to specify the following response correlation structures: 1. independence (in which case the gee estimates of the regression coefficients are equivalent to those produced by the G L I M software), 2. f i x e d (user-specified/known correlation matrix), 3. exchangeable, 4. AR-M (with a user-specified M ) , 5. stat_M_dep (stationary M-dependence, with a user-specified M ) , Chapter 3. Introduction to Multiple Regression of Count Data 77 6. non_stat_M_dep (non-stationary M-dependence), and 7. unstructured (gee estimates the structure and the values in the working correla-tion matrix). (For convenience, we do not apply items 5 and 6 in the above list to our simulation studies.) Notice that if we specify the "independence structure" in gee, the G E E co-efficient estimates are equivalent to those produced by the usual G L I M software. Also, the independence structure is a special case of both the exchangeable and the AR(1) structures. For the exchangeable structure, a 0 intraclass correlation coefficient gives the independence structure. Similarly, a 0 lag one correlation reduces the AR(1) structure to the independence structure. If the responses are independent (given the covariates), one would expect the independence model to provide very similar results as those of the dependence model with an exchangeable or AR(1) structure. On the other hand, fixing a correlation matrix that is very different from the identity matrix in the dependence model may yield very different results from those of the independence model. In particular, one would expect the estimation efficiency (in terms of bias and power) of the dependence model to be lower than that of the independence model in this case, simply because the structure is incorrectly specified in the former model. This will be closely investigated in the next chapter. Similarly, we will closely investigate in Chapter 5 the efficiencies of the two models when the responses are exchangeably correlated. 3.3 Some Notation and Theory in G E E In this section, we present the applicable results in Liang & Zeger, 1986 in the context of spatially correlated responses. First, assume that the geographical domain of interest consists of r regions. Denote the response vector by Y = (Yi,..., Yr)T. For each of the r regions, we observe the same Chapter 3. Introduction to Multiple Regression of Count Data 78 g covariates. In the ith region, denote the <JHdimensional covariate vector for Y{ by X ; . (Xn,Xi2,... ,Xig)T. Now, we can write the r x g covariate/design matrix X as / X X\\ X\2 X21 X22 X 13 X-2s 1. = XT 2. / 1 X r . j {X.!) X . 2 , . . . , X,g). y Xr\ Xr2 • • • Xr The relationship between the responses and the covariates is expressed in terms of the generalized linear model g [E(Yi\Xi.)] = XT/3 for a l i i = 1,... , r for some link function Q and fixed parameters /3. To apply the G E E approach in estimating (3, we need data replication. Let {Y^ : s = 1,2,..., K} be a random sample of r x 1 response vectors from the spatial domain of interest. Associated with this sample is the set of covariate matrices, { X W : s = 1,2,..., A"}. That is, we have K independent sets of replicates of the response vector and the covariate matrix. In this set up, it is natural to assume that the K replicates of the responses given the covariates, |X"(s)]'s, all have a common correlation matrix. Denote this matrix by Cov(Y\X). In other words, C o r ( Y ^ | X ( s ) ) is the same C o r ( Y | X ) matrix for all s. Now, define the r x 1 vector rj^ to be / X[*)T(3 ^ 1 v[s) ^ XW/3 Chapter 3. Introduction to Multiple Regression of Count Data 79 If, for all s, the conditional density of the ith response given the covariates, X^], can be written as / (yh^) = exp {yVoV - a ( ^ ) + b(y^)cf>} (3.17) where 9^ = h(r]^), for some functions a,b, and h, then E ( y / W ) = a>(0W) W a x ( Y W \ x V ) = where a' and a" are the first and second derivatives of a respectively. (f> is called the scale or the dispersion parameter. Note that the link function Q is related to h in the following manner: «[E(yW)]=«?fe' The r x r matrices A ^ , A ^ s \ and V(s) are defined as ( a"(0[s)) 0 0 a"(^ s )) 0 0 d^p o dew djT 0 a"(0W) ... o N 0 0 0 ^ / Chapter 3. Introduction to Multiple Regression of Count Data \ 80 y(») = 1 J L j _ where R is any r xr matrix that satisfies the requirements of being a correlation matrix. This R is the working correlation matrix. Similarly, is the working covariance matrix for the sth replicate. Note that is the true covariance matrix of [F( S ) |XW] (i.e. = Cov ( y W | X ( s ) ) ) if R is the true correlation matrix of each [Y^\X^]. Also note that since all [y(*)|A"(')]'s have the common correlation matrix Cor(Y\X), the working correlation matrix R (which may be used to estimate C o r ( Y | X ) ) does not depend on s. Next, define the r x g matrix £>(*) = A^ A^X(SK Then the generalized estimating equations are defined as K J2 D W ^ V W j - ' i y W - a'(0<*))] = 0 (3.20) where a'{6^) = (a'(9[s)), • • -,a'(9^))T. The estimate of p is then obtained by solving (3.20) for p. Since the quantities D^s\ V^s\ and a'(6^) in (3.20) are all functions of some un-known parameters, we need an iterative numerical method for updating (at each iteration) each unknown parameter estimate so that these quantities in (3.20) are consequently up-dated. Upon convergence, we obtain the final estimates of these unknown parameters such as P, and those in the working correlation matrix R. If the working correlation structure (i.e. the structure of R) is specified according to the true response correlation structure, then this final "estimate" of R is the estimate of the conditional correlation matrix of the response Y, i.e. C o r C K l X ) . 1 1In our investigation, we assume that the initial specification of R as the working correlation matrix is based on the believed structure of Cor(Y"|X). The final estimate of R may not be a good estimate of Cox(Y\X) if its specified structure does not coincide with that of Cot(Y\X). Chapter 3. Introduction to Multiple Regression of Count Data 81 More advanced statistical packages such as S-Plus provide standard procedures for fitting GLM's using the G E E approach. Notice that if R = J , i.e. assuming2 the responses to be independent (given the covariates), the estimating equations (3.20) reduce to those derived from quasi-likelihood theory. In other words, the G E E parameter estimates reduce to those of the ordinary G L I M software approach when R = I. The following theorem is a re-statement of that presented in Liang & Zeger (1986): Theorem 3 Suppose regularity conditions hold, and that there exist K-consistent esti-mators for (j) and the parameters in R. Define (3G to be the solution of (3 in (3.20), with (j) and the parameters in R being replaced by their estimators, if they were not fully specified. Then \/~K ((3G — (3) has an asymptotic MVN distribution with mean 0 and covariance l i m ^ o o COY[VK (3 g - (3)} = limK^K (J2f=1 D^lV^-'D^y1 {z*=1 D ^ T [ V ( s ) ] - 1 C o v ( y W { X ^ V ^ ] - 1 D^} ( Z ^ £>0)TryO ) ] - iD(*)y\ In a G E E analysis, the above theorem is used for estimating the covariance of f3G', in the following manner: Cov(3G) = | c o v ( 3 G - / 3 ) = ^ C o v [ v ^ ( 3 G - /3) ] ~ (E^ ( S ) T[V ( S )]_ 1^ ( S )) J E r > ( s ) T [ v ( s ) ] - i C o v ( y W | x ( s ) ) [ y ( s ) ] - i i > ^ | K \ - 1 ^ / j W ^ y W j - D W (3.21) S=l / Recall that R and are respectively the working correlation matrix and the work-ing covariance matrix for the responses given the covariates X ^ . If we assume V( s ) 2See Section 3.4. Chapter 3. Introduction to Multiple Regression of Count Data 82 to be the true covariance matrix C o v ( Y ( S ) | X ^ ) , then equation (3.21) reduces to Cov(/3G) « (j2 £ ) ( s ) r [ V ( s ) ] _ 1 £ ) ( s ) j . (3.22) The G E E estimate of the covariance of (3G using equation (3.22) is called the naive covariance estimate. The one using equation (3.21) is known as the robust covariance estimate of f3G. In practice, we usually do not know the true correlation structure of the responses. The working correlation, R, which we specify in running the G E E analysis, is only based on the belief of what the true structure is. The naive covariance estimate might then be an inappropriate estimate of the true covariance of (3G if R is specified as very different from the true C o r ( Y | X ) . 3 On the other hand, the robust covariance estimate accounts for the true covariance of the responses, C o v ( Y ( s ' | X ^ ) , and thus is a "corrected version" of the naive covariance estimate. Of course, Cov(Y^\X^) is usually unknown. In particular, misspecification of the response correlation structure is usually a result of not knowing the true Cov(Y\X). In fact, Cov(y( s) |X< s)) in the robust variance estimate formula is estimated from the data using [Y^-a'{d^)][Y^-a'(d^)}T. Here, a ' (0 w ) is computed using X W and J3G, for each s. Since the observed data should be representative of the population, the estimate of Cov(y( s ) |X( s ) ) based on the data should usually resemble the true covariance more than does (when Cor(Y( 5 ) |X( s ) ) is misspecified). Hence, the robust estimate is often a better estimate of the covariance of (3G. 3That is, the asymptotic Cov(/3G) does not equal the naive covariance estimate if the structure of R is different from that of Cor(T'"|.X'). Chapter 3. Introduction to Multiple Regression of Count Data 83 3.4 Terminology In the next two chapters, we will assume the scenario of an investigator specifying the working correlation matrix/structure in G E E according to what s/he believes the re-sponse correlation matrix/structure to be. Misspecification of the response correlation refers to the case when the investigator's belief/assumption (and thus specification) of the response correlation is incorrect when using G E E for model fitting. We will refer to the G L M in which the responses are assumed to be independent as the independence model. Similarly, we will refer to the G L M in which a correlation structure of the response dependence is taken into account as the dependence model. In the simulations, we fit both models to a given dataset using the G E E approach. Notice that either model may be incorrect. For example, suppose the true response correlation is exchangeable. Then the independence model is incorrect, but the dependence model that assumes exchangeably correlated responses is correct. On the other hand, the dependence model is incorrect if it assumes the response correlation structure to be anything other than exchangeable. In other words, the model is only correct if the specified correlation structure is the true response correlation structure. We say that one model is more efficient than another if its parameter estimates produced by the G E E approach are closer to the true parameters, and if these estimates have smaller variability. One way to measure unbiasedness and variability of a parameter estimate is to observe many realizations of the estimate. For each simulation, we repeat the G E E model fitting process to n datasets. Summary statistics of these n parameter estimates are then obtained. Unbiasedness can be measured by how close the sample mean of these n estimates is to the true parameter. Similarly, variability can be measured by the sample variance of the n estimates. Chapter 3. Introduction to Multiple Regression of Count Data 84 However, small variability in a parameter estimate also means higher power in de-tecting a non-zero value of the parameter. This leads us to the notion of empirical power of the independence and dependence models. The power of detecting a particular non-zero regression parameter can be measured by the proportion of the n datasets in which the parameter is statistically significant at a given level. A model is then also more efficient than the other if it is empirically more powerful than the other in detecting a given parameter from the same dataset. As in the previous chapter, we may wish to determine if the difference in power be-tween the independence and dependence models is statistically significant. We will again use the Pearson's x 2 test °f independence to assess the association between model (in-dependence/dependence) and power {8j significant or not) for each regression coefficient in the model. (See page 33 for examples of the x 2 test.) If the association is significant, we conclude that the difference in (empirical) power is significant. Otherwise, the difference in power is likely due to chance.4 4 . . . in the terminology of Freedman, Pisani, Purves k, Adhikari, 1991, Chapter 26. Chapter 4 Simulations of Multiple Regression of Count Data — Part One In this part of our simulations, we generate count data which are independently Pois-son distributed when given the covariates. (Marginally, the responses are correlated with a Poisson-lognormal distribution.) We are interested in the effect of misspecifying a response correlation structure on parameter estimation using G E E . Thus, any non-independence structure is a misspecification. 4.1 Generating Count Data and their (Random) Covariates Recall from Section 3.4 the notation for the response vector and the covariate matrix for the 5 t h replicate. Fixing s, the equations in this section hold for all s = 1,... ,K. In other words, each set of replicates {Y^s\ X^} are independent and identically distributed (i.i.d.). Thus, (4.1) to (4.7) hold for all s. We can generate F / ^ ' S through generating covariate vectors X-S^s. In a given re-gion i, the g-dimensional covariate vector is X;*\ and we assume that the j t h covari-ate is distributed as a normal random variable with mean fi^ and variance <T,-J. Also, within the iih region, the jth and the /th covariates are assumed to be uncorrelated, i.e. Cav(XJj\X^) = 0, for all j ^ /. Fixing j , we assume that the j t h covariate in the ith region and that in the kth region have covariance a^. In matrix notation, we have 85 Chapter 4. Simulations of Multiple Regression of Count Data — Part One 86 the r x 1 covariate vector / x(s) ^ l i A2j M V N ( / * . i , 17) for all j (4.1) where »". V2j and 27 C l l CT 1 2 (7i3 • • • <r l r <7i2 CT22 <T23 - - • °~2r \ prj J \ 0~lr 0 - r r J and that X^ and X^f* are independent vectors for all j ^ /. In health impact studies of air pollution, for instance, this distribution for the covari-ate vectors -X" j^'s (for a given s) is reasonable where the log-scale pollutant concentration often follows a normal distribution. Moreover, given an explanatory variable, observed across all regions, we expect the readings in the different regions to be spatially cor-related. To reduce the complexity of our simulation study to computationally realistic levels, we assume that the readings across all regions are spatially correlated in the same way for each explanatory variable. Hence, we have the same covariance matrix S for all j in (4.1). For the same reason, we assume the readings of the different explanatory variables within a region to be independent to avoid singularity problems in model fitting due to multi-collinearity in the explanatory variables. Now, we can define the expectation of X^ in the matrix form li = E (X^) I T \ T V ^ J — ( A * . I i A * . 2 5 • • • » Chapter 4. Simulations of Multiple Regression of Count Data — Part One 87 Again, let the r x 1 response vector be = (Y^ , . . . , Y^*)) T, where for the ith region, the response (e.g. hospital admission count) has the conditional Poisson distri-bution K ( , ) | X ( S ) ] ind p o . g s o n ^ x ^ T ^ { I 2 ) for some g x 1 coefficient vector [3 = (81: 32,.. ., 8g)T •> for i = 1,2, . . . , r . That is, conditional on the g covariates, the response in the ith region has a Poisson distribution with mean exp(X,- s ^ T ^). Note that we generate the Y^'s so that [Y^ \XJ']] is independent of [Yk(s)\X{ks)] for i / k. Also note that x \ s ) T p = Y?j=1 0jX\-] is a normal random variable with mean fijft and variance ]Cf=i 0j°~n = °~n PTPi a n d that Cov(xls)T(3,X{ks)Tf3) = cov(ExWpi,£lxtfpl) 3 = 1 1=1 = tp]Cov(xi;\xif) j=i 3=1 = crik/3T/3 for alii ^ k. Thus, letting v = fx(3 and W = (3T f3E, we have X^(3 ~ MVN ( i / ,W) . (4.3) Now, let z | s ) = E ( y / s ) | x i s ) ) = exp(x[ s ) r/3). Therefore, Z^ = exp{X^/3) is an r-dimensional lognormal random vector with pa-rameters v and W. Hence, marginally (unconditional on X^), Y^ has a multivariate Chapter 4. Simulations of Multiple Regression of Count Data — Part One 88 Poisson-lognormal distribution (Aitchison & Ho, 1989) with E(Y;. ( s )) = expfc + \Wu) = exp(/xf/3 + lo-u(3T(3) var(y;.(s)) = E(y; ( s ) ) + [ E ( y ; ( s ) ) ] 2 [ e x p ( ^ ) - i ] = E ( F / S ) ) + [E(y;. ( s ))] 2[ex P(cT u/3 r/3) - 1 ] Cov(y,w,yW) = E(y 1 W)E(n w )[ex P (Wi f c ) - 1 ] = E(y/ s ) )E(n ( s ) )[exp( (T i f c /3^) - 1] Cor(y,w, y f c w ) = e x p ( l ^ f c ) - 1 exp(Wii) - 1 + \ eMWkk) - 1 + exp(aikpTf3)- 1 (4.4) (4.5) (4.6) e x p ( ^ / 3 ) - 1 + — ^ E ( y / " ) exp(crfcfc/3T/3) - 1 + E(yW)J (4.7) for all i and k where i ^ k. Finally, we can generate the covariate matrix and the response vector according to (4.1) and (4.2) by specifying the ii and £ matrices, and the coefficient vector /3. We thus know the exact marginal distribution of the response variables according to equations (4.3) to (4.7). Notice that so far, we have assumed that we do not have an intercept in the G L M . Some of the equations presented above should be slightly modified if an intercept is to be fitted. Please refer to Section 4.4 in the C H A P T E R A P P E N D I X for details. Technically, in our simulations, applying a Poisson regression to these y( s ) 's to obtain an estimate of ft and the subsequent data analysis does not require the knowledge of the above marginal distribution. However, as mentioned in the previous chapter, some Chapter 4. Simulations of Multiple Regression of Count Data — Part One 89 response correlation structure needs to be specified in the dependence model when using G E E for parameter estimation. When deliberately misspecifying the response correlation, we may make use of the marginal response correlation structure, which is generally very different from the independence structure. In fact, the responses are often strongly correlated, marginally, due to the correlation among the covariates. 4.2 Multiple Regression via Generalized Estimating Equations In the dependence model, we assume that the responses are correlated, given the co-variates, with certain correlation structure. In this part of our simulations, since the responses are uncorrelated given the covariates, we may examine if specifying a non-independence structure for the responses in the dependence model would decrease the efficiency of the regression parameter estimation. We do so by comparing the efficiency of such a model to that of the independence model (both models are fitted using the G E E approach). 4.2.1 G E E in the Conditional Poisson Model where 77,- = X^' /3 for i = 1 ,2 , . . . , r and s = 1 ,2 , . . . , K. Comparing (3.17) and (4.8), we have 0\s) = j]\s\ a(6\s)) = e^'\ b{y\s)) = - logj / J s ) ! , The conditional density of [5^- | X\ ] in (4.2) i s e x p j - e ^ H e ^ ) ^ _ 001 = exp{—eVi + y\s^ log eVi —\ogy\s^\} = exp{?/Js)7/js) - - logy\s )\} (4.8) Chapter 4. Simulations of Multiple Regression of Count Data — Part One 90 and 4> — 1- Hence, (3.18) in the conditional Poisson model becomes (4.9) E ( ^ | X | S ) ) = a'(0\s)) = e* Var(lf>|x! s ) ) = = e*\ Note that the link function in a generalized linear model is defined to be the function that maps the mean (the conditional mean in our case) of the response to rjjs\ Thus, we have the log-link in the conditional Poisson model, i.e. log E ( F / S ^ | X ^ ) = n\sK Now, the matrices A<s), A^ S\ V^s\ and £>W simply become \ e 2 0 0 0 = I = [AM]$R[AM]> By specifying an initial working correlation matrix R with a certain structure (e.g. exchangeable), we may now substitute the above matrices into equations (3.20) and solve for the coefficient vector {3. Hence, we obtain a G E E estimate f3G for /3. If the specified structure of R coincides with that of C o r ( y | X ) , then the final R 1 is the estimate of C o r ( Y | X ) , the (spatial) correlation matrix of the conditional Poisson responses. To estimate the covariance of /3 G , we need to know Cov(Y( S)|X"( S)). Recall that in this part of our simulations, we generate count data which are independent given the covariates. Thus, we can replace Cov(Y( s ) |X( s ) ) in equation (3.21) by 1See page 80. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 91 1. I, or 2. [y(s) - a'(6^)][Y^ - a'(6^)]T (see page 82). In practice, the true correlation structure of the responses is not known, in which case item 1 above may not be used. Moreover, although we may sometimes believe that the data are so weakly correlated that they are close to independent, we should base our inference on the robust S.E. produced by gee. This is because the computation of the robust S.E. is a "corrected version" of the naive S.E., accounting for the true covariance structure of the responses which is estimated by item 2 above. 4.3 Computer Simulations of Multiple Regression of Count Data Our goal is to examine the effect of incorrectly specifying the response correlation on the efficiency of parameter estimation in a generalized linear model. (The response correlation is specified through the structure of the working correlation matrix, R, or fixing the matrix itself, when running gee.) In particular, count data are of special interest because many health impact studies use hospital admission counts to indicate health conditions of a population. Throughout our simulations presented in this chapter, our specification of the incor-rect response correlation was based on the marginal correlation of Y, not conditional on X. See Section 4.4.2 in the C H A P T E R A P P E N D I X for details of how we specified R. We used the statistical software S-Plus for all our simulations. Each simulation included three main parts, namely, 1. generation of the count responses and the covariates2 according to equations (4.1) and (4.2), with r = 5 and K = 20; 2We would need to specify some £ in (4.1) to generate the covariates. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 92 2. regression ( G L M fitting) of the generated count responses on the covariates using gee (function in S-Plus) assuming (a) independent responses (R = J) , and (b) dependent responses (R ^ I), where i . only the structure of R was specified as either exchangeable or AR(1); or i i . R was fully specified as a fixed matrix based on the marginal distribu-tion (Poisson-lognormal) of Y (as a misspecification of the correlation structure); and 3. comparison of empirical relative efficiency of the independence and the dependence models, and other analyses. Note that estimating C o r ( Y | X ) (by the final R) is not the primary concern of a G E E analysis. However, a measure of the variability of j3G is crucial. In S-Plus, the function gee estimates the standard error (S.E.) of each regression coefficient estimate. It gives both the naive and the robust S.E. estimates. As mentioned before, in computing the robust S.E., the fact that the final estimate of R is only an estimate of C o r ( Y | X ) is taken into consideration. In this part of our simulations, since the responses are in fact conditionally independent, R = I (fixed) as specified in the independence model is actually the true C o r ( Y | X ) (so that the working covariance matrix V is the true C o v ( Y | X ) in the case of R = I). Hence, the naive S.E. in the fit for the independence model should be a better estimate of the true S.E. of a regression parameter estimate than the robust S.E. However, in our simulations, we only compared the robust S.E.'s of the parameter estimates produced by the independence and dependence models. This is because with Chapter 4. Simulations of Multiple Regression of Count Data — Part One 93 real data, one is never certain that they are truly independent or not. Analyses with real data should naturally be based on the robust S.E. estimates. In addition, we also computed the sample covariance of the /3G 's in each simulation study 3 for comparison to the robust S.E.'s produced by the two models. With enough observations of / 3 G , the sample covariance should resemble most the true covariance of flG, which can only be roughly estimated using equation (3.21) in a regression of real data. This brings us back to our objective of the simulations — to determine how misspec-ification of the response correlation structure affects the performance of G E E parameter estimation in a G L M . A model performs well if the parameter estimates have small em-pirical biases and variability. Hence, a good model should be more effective in detecting departures from the null model of j3 = 0 . Intuitively, as supported by the previous chapter, the model that accounts for the true response correlation should perform no worse than that with a misspecified correlation structure. However, the effect of such misspecification on the efficiency of parameter estimation is not so clear in the current non-Gaussian setting. Our simulations are intended to help us understand at least em-pirically what this effect is on our non-Gaussian G L M . 4.3.1 Simulations with Exchangeable and AR(1) Correlation Structures of the Covariates The most convenient correlation structures to use in the simulations are the exchangeable and the AR(1) correlation structures. These structures only consist of one unknown pa-rameter. Hence, if the responses approximately have either of these correlation structures, then a small sample size, K, and a small number of regions, r, should provide sufficient information for G E E to estimate in a few iterations all the unknown parameters in the 3 Each sample of K response vectors produced one estimate f3G. In each simulation study, we gener-ated n such samples, and then obtained the sample covariance of the n /3G's. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 94 G L M . Indeed, conditioned on the covariates, the responses generated according to equa-tions (4.1) and (4.2) have the independence correlation structure, which is a special case of both the exchangeable and the AR(1) structures. At the same time, the exchangeable and AR(1) correlation structures are also con-venient to specify in generating the covariates. We ran numerous simulations in which the covariates had either of these two correlation structures. Let px denote the intraclass correlation coefficient in £ for the exchangeable covariates, and the lag one correlation in E for the AR(1) covariates. We specified different values of px, together with different /x's and /3's to generate different sets of count data, each set being marginally correlated, but independent conditional on the covariates. With each generated dataset, we used the S-Plus function gee to separately fit the independence and dependence models. We will denote by /31 the / 3 G produced by the independence model, and by f3D that produced by the dependence model. We would like to compare the empirical relative efficiency of the two models as reflected by how frequently the models reject the null hypotheses (for each j) of 8j = 0, each at significance level a = 0.05. The z-test-statistics were computed based on normal approximations for the distributions of the estimates /31 and / 3 D . The S.E.'s of the d^s in these ^-statistics were taken to be the robust S.E.'s produced by the S-Plus function gee. We used two ways to specify a response correlation structure in the dependence model: (1) by specifying the structure for the working correlation, R (i.e. exchangeable or AR(1)) as a functional argument of gee; and (2) by specifying a fixed R. In our experiments, the fixed R matrices were often very different from the identity matrix. We determined if the dependence model — either having an extra unknown parameter in R (i.e. case (1)), or a misspecified response correlation structure (i.e. case (2)) — would be less efficient than the independence model. However, in all the simulations, both the independence and the dependence models Chapter 4. Simulations of Multiple Regression of Count Data — Part One 95 generally produced estimates of f3 with small empirical biases. This result may not be surprising at first glance, because the consistency part of Theorem 3 applies to a model with any working correlation H . However, it is merely an asymptotic result.4 In fact, our study indicates that the G E E parameter estimates for our G L M is reasonably robust to misspecification of response correlation for small sample sizes; afterall, K = 20 is a small sample size. Slight bias was apparent only in isolated cases, particularly in estimating the intercept. This bias could merely be due to chance. As revealed in Table 4.9, the intercept estimate was typically the most variable (had the largest relative S.E. 5) among all the regression coefficient estimates. As for the empirical relative efficiency of /31 to (3D, we examine the p-value compar-isons in the simulations. We tabulate the p-values as in Table 2.1 from Chapter 2. For each regression coefficient in one such comparison table, we are interested in comparing the following: 1. the margins of the 2x2 p-value table, and 2. the off-diagonals of the 2x2 p-value table. In item 1, we are essentially comparing the the empirical powers of the independence and the dependence GLM's . Hence, we can compute the empirical relative efficiency of the two models from the p-value tables as we have done in Chapter 2. We will not present the results of all of our simulations. However, the results we will present in Tables 4.8 to 4.14 are representative of all of the simulations. In these selected simulations, there were g = 2 covariates, and the \i matrix that we 4The misspecification of the correlation structure of the responses does not affect the (inconsistency of the GEE estimate for the regression coefficients. However, this is only true provided that, among other criteria, the link function is correctly specified. See Liang & Zeger, 1986 for other criteria for the consistency of /3 G . 5We define the relative S.E. of an estimator as the ratio of the S.E. of the estimator to the estimator itself. This is actually the inverse of the standardized estimator. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 96 used in the "no-intercept" GLM's was / 0 3 1 li 2 1 1 We used the same fj, for those GLM's with an intercept, where we simply added a column of l's to the covariate matrix generated with the above expectation. Before examining more closely the simulation results, let us first explain the meaning of the items in Table 4.8. In running the S-Plus function gee, we specify corstr="ex" as an argument to spec-ify an exchangeable response correlation structure. Similarly, corstr="f ixed" specifies a fixed response correlation matrix, in which case such a matrix needs to be completely supplied to gee. In simulations with corstr="f ixed", we would supply that fixed matrix computed in the way described in Section 4.4.2 (i.e. the response correlation uncondi-tioned on the covariates), with (31 (obtained from the independence model) in place of Again, px represents the intraclass correlation coefficient or alternately the lag 1 correlation in £ , the correlation matrix of the covariates. In our simulations, we no-ticed that the marginal correlation structure of the responses were often approximately exchangeable if the covariates were either exchangeable or AR(1). (See Example 1 in Section 4.4.2.) We use py to denote the approximate value of the marginal correlation of each pair (Y^s\ Y^), i ^ k, for each s. (That is, unconditioned on the covariates, the responses are approximately exchangeable, with an intraclass correlation coefficient approximately equal to py.) Since this marginal correlation matrix was what we would fix as R for the dependence model in simulations with corstr="f ixed", py indicates how 0-Chapter 4. Simulations of Multiple Regression of Count Data — Part One 97 Sim cor s t r Px Py P Fit inter-cept? n sample mean of n 3 / 's sample mean of n 3 D 's 3 a " f i x e d " 0.9 0.8 ( ° - 1 ^ 0.01 11 ) yes 500 ( 0,100 \ 0.010 ^ 1 . 0 0 1 ) ( 0.101 \ 0.010 1.000 / 4a " f i x e d " 0.9 0.8 r no 500 ( 0.011 \ \ 1 .000 ) / 0.010 \ I 1.000 J 7 "ex" 0.9 0.03 to 0.04 I 0.15 \ 0.1 0.12 ) yes 100 ( 0.138 \ 0.098 ^ 0.123 ) f 0.139 \ 0.097 ^ 0.123 j 8 "ex" 0.9 0.03 ( 0 .12 ) no 100 ( 0.100 \ \ 0.117 J ( 0 . 1 0 0 \ [ 0.116 ) a: In these simulations, when fitting the dependence G L M , we assumed px to be known, and then substituted it and the final fl1 into equations (4.4) and (4.7). We then specified in ge a correlation structure fixd as that in equation (4.7). Table 4.8: Results of selected simulations. badly the response correlation structure was misspecified in the model. In fact, most of these marginal response correlation matrices turned out to be quite different from the identity. However, this led to only slight loss of efficiency of the dependence model in some isolated cases. In the next section, we will see the simulation results which illustrate this point. As for the regression parameter vector /3, we wanted its components to be small enough such that badly misspecifying the response correlation might deter the depen-dence model from detecting them, 6 making any possible loss of estimation efficiency noticeable. However, the size of a coefficient of course depends on the variability of its (GEE) estimate. Without any a priori information about the standard error of the estimates before the actual simulation experiment was done, we specified "near-zero" regression coefficients in the model. We tried values in the order of between 1 0 - 2 and 101 in all the simulations. 6The larger the regression coefficients, the easier they are detected in any regression model. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 98 Next, denote in each simulation by n the number of datasets generated, i.e. the number of GLM's fitted. For each dataset, the sample size K (= the number of replicates) was fixed as 20 per region, and r (= the number of regions) as 5. The items "sample mean of n /3 J's" and "sample mean of n /3D 's" are self-explanatory. 4.3.2 Detailed Results for Simulations with Exchangeable and AR(1) Co-variates Among the simulations with exchangeable or AR(1) covariates, Simulations 3 and 4 are representative of those which were run with corstr="f ixed", and are the more interesting to examine. Similarly, Simulations 7 and 8 are representative of those which were run with corstr="ex". Notice that the two pairs of simulations are paired by their regression coefficient values. For example, the latter pair have Q\ = 0.1 and 82 = 0.12. The only difference between the two simulations is the intercept, B0 = 0.15, in the G L M of Simulation 7, whereas in Simulation 8 the G L M had no intercept. As mentioned before, both (31 and (3D seem to have small empirical biases in most of our simulations. Table 4.8 shows the typical amount of empirical biases of these two estimates. The Bj's in most cases are extremely well estimated by the dps and fif's. The intercept estimates are usually those Bfs that have slightly larger empirical biases. In fact, the intercepts usually have larger relative S.E.'s compared to the other B^s. Table 4.9 shows the sample S.D. of the n Bps produced by the G E E analysis for our selected simulations, together with the approximate relative S.E. of 8j, computed by sample S.D. of n Bps sample mean of n Bps for each j. For example, 8Q in Simulation 7 is not particulary well estimated. Not coincidentally, its relative S.E. is more than twice the relative S.E. of each of the two slope coefficient Chapter 4. Simulations of Multiple Regression of Count Data — Part One 99 Simu- sample S.D. for approximate relative S.E. for lation % Pi Pi PI Pi Pi 3 0.093 0,019 0.026 0.930 1.90 0.026 4 — 0.017 0.0096 — 1.55 0.0096 7 0.16 0.041 0.064 1.16 0.418 0.520 8 — 0.043 0.048 — 0.430 0.410 Table 4.9: Sample S.D.'s and approximate Relative S.E.'s of coefficient estimates in the independence model in selected simulations. estimates. We will not include a similar table for the /3D,s (by the dependence model), because their sample S.D.'s and sample means are very similar to those of the /37's (by the independence model). This brings us to the efficiency/power comparisons of the independence and the de-pendence models. The sample S.D.'s should be the closest (compared to the G E E naive and robust S.E. estimates) to the true S.E.'s of the coefficient estimates, because n was large. However, in a real life investigation, we only have one set of data, on which we fit a G L M using G E E . In that case, we can only estimate the true S.E.'s of the regression coefficient estimates, usually by the robust S.E. estimates from the G E E analysis. To compare the efficiency of the two models, we examine their p-value comparison tables. Tables 4.10 to 4.13 are the p-value comparison tables for Simulations 3, 4, 7, and 8 respectively. The following is an example of how one can examine a p-value comparison table. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 100 Example Table 4.12 shows the p-value comparisons for each of 30, B\, and 02 in Simultion 7. The 2x2 p-value comparison for B\ (middle row) shows that the empirical power7 of the independence model (R = I) is 58/100 = 0.58. The empirical power of the dependence model (R ^ I) is slightly higher at 60/100 = 0.60. We say that the dependence model is empirically 2% more powerful. This is obtained from (60 — 58)/100 = 0.02 = 2%. The off-diagonal values in this 2x2 table show the discordance of the two models in detecting a non-zero 3\. For example, among the 100 cases, there are 4 + 2 = 6 cases where the two models did not agree in the rejection of the null hypothesis. In considering the overall p-value comparison for the three regression coefficients in this simulation, the dependence model seems to be slightly more powerful than the independence model. However, the X2 test of independence of model and power yields p-values8 of 0.66, 0.89, and 0.67 for do, 3\, and 82 respectively. Thus, the difference in power of the two models (in detecting any of the regression coefficients) is likely due to chance. Both models are powerful in detecting 3\ and 32, but not powerful in detecting Bo- This is because ft® has the largest relative S.E. (i.e. SE0G)/^), as is shown in Table 4.9. • From our simulation results, we see that generally, both the dependence and the in-dependence models would decrease in power of detecting a non-zero regression coefficient as the coefficient became smaller. Also, both models had very similar empirical power in a given simulation. For example, in Table 4.10, both models only had approximately 7 By power, we mean the probability of rejecting the null hypothesis in testing Ho :/?.•= 0 vs. Hi : f3j ^ 0 for that j, at significance level a = 0.05, when H\ is true. We consider hypothesis tests of fa's individually for convenience. One can easily modify this into a test of Ho : /3 = 0, at a certain significance level. 8 . . . obtained by the S-Plus function chisq.test . Chapter 4. Simulations of Multiple Regression of Count Data — Part One 101 R ^ I p < 0.05 p > 0.05 total ft R = I p < 0.05 p > 0.05 58 47 76 319 134 366 total 105 395 500 fa R = I p < 0.05 p > 0.05 38 31 35 396 73 427 total 69 431 500 02 R — I p < 0.05 p > 0.05 500 0 0 0 500 0 total 500 0 500 Table 4.10: p-value comparisons of Simulation 3 in Table 4.8. R ^ I p < 0.05 p > 0.05 total 0i R = I p < 0.05 p > 0.05 34 33 43 390 77 423 total 67 433 500 02 R = I p < 0.05 p > 0.05 500 0 0 0 500 0 total 500 0 500 Table 4.11: p-value comparisons of Simulation 4 in Table 4.8. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 102 70/500 = 14% empirical power in detecting Q\ = 0.01, but had 100% empirical power in detecting 32 = 1. We also noticed that both models tended to decrease in power if the number of parameters increased. That is, in those simulations where the G L M did not include an intercept, the two models were sometimes somewhat more powerful (empirically) than if the G L M included an intercept. This is illustrated by Simulations 7 and 8. We may compare Tables 4.12 and 4.13. In Simulation 7, the independence and dependence models both had only about 60% and 50% empirical power in detecting the two slope coefficients (3\ = 0.1, 32 = 0.12) respectively, but had about 70% and 75% empirical power for the respective slopes in Simulation 8. This trend was not always apparent, mainly because of other factors which would also affect power, such as the actual size of the regression coefficients. For example, in both Simulations 3 and 4, the two models had similar empirical power in detecting the two slopes. 3\ = 0.01 (whose estimate has an approximate S.E. of 0.02) is considered small, whether the G L M had many or few coefficients. Thus, the two models had low but similar empirical power (about 14%) in detecting it in both simulations. On the other hand, 32 = 1 (whose estimate has an approximate S.E. of 0.03) is considered very large, with or without many extra coefficients in the G L M . Therefore, the two models both had 100% empirical power in detecting it in both simulations. In some simulations, such as Simulations 3 and 4, we specified a fixed correlation matrix — the marginal Cor(Y) — which was often very different from the identity, as can be seen from the py column in Table 4.8. In these simulations, the dependence model often had slightly less empirical power than the independence model. For example, in Simulation 3, the independence model is 134 - 105 500 x 100% = 5.8% Chapter 4. Simulations of Multiple Regression of Count Data — Part One 103 R ^ I p < 0.05 p > 0.05 total Po R = I p < 0.05 10 0 10 p > 0.05 3 87 90 total 13 87 100 01 R = I p < 0.05 56 2 58 p > 0.05 4 38 42 total 60 40 100 R = I p < 0.05 46 1 47 p > 0.05 5 48 53 total 51 49 100 Table 4.12: p-value comparisons of Simulation 7 in Table 4.8. R ^ I p < 0.05 p > 0.05 total 01 R = I p < 0.05 66 2 68 p > 0.05 3 29 32 total 69 31 100 02 R = I p < 0.05 74 0 74 p > 0.05 1 25 26 total 75 25 100 Table 4.13: p-value comparisons of Simulation 8 in Table 4.8. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 104 more powerful than the dependence model in detecting B0. In fact, the x2 test °f inde-pendence of model and power yields a p-value of 0.038, implying that this difference in power between the models is significant (at level 0.05, say). This difference in empirical power in detecting Q\ is 0.8% in Simulation 3, and is 2.0% in Simulation 4, although the differences in both cases seem to be results of chance according the x2 tests. The p-value discordance also seems to be more severe in those simulations where corstr="f ixed". We re-ran Simulation 3 in which we saw a noticeable difference in empirical power of the two models. In the re-run, we let the G E E analysis update the values in the work-ing correlation matrix R. In running gee here, we specified the exchangeable response correlation structure. To save computational time, we set n — 100, but K remained 20. Table 4.14 shows the p-value comparison for this simulation. From Table 4.14, we can see that the dependence model indeed had a better empirical power in this re-run than the original simulation. In fact, in the re-run, the empirical powers of the independence and dependence models did not differ in testing B0 = 0, as opposed to the dependence model being 5.8% weaker in power in the original Simulation 3. In addition, the dependence model in the re-run was actually 3.0% more powerful than the independence model in the re-run in testing Bx = 0, as opposed to 0.8% less powerful in the original simulation. (However, x2 tests indicate that these differences were likely due to chance in both the original simulation and its re-run.) The p-value discordance was also much less severe in the re-run than in the original simulation. The smaller power of the dependence model in those cases such as Simulations 3 and 4 might be due to the inappropriately specified marginal response correlation matrix for the dependence model, when data were in fact independent. However, the dependence model did not have substantially lower power than the independence model in these cases. The power difference was the most "extreme" in Simulation 3, but the dependence model turned out to be only 5.8% weaker than the independence model in detecting B0 = 0.1. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 105 R ^ I p < 0.05 p > 0.05 total 00 R = I p < 0.05 22 1 23 p > 0.05 1 76 77 total 23 77 100 01 R = I p < 0.05 21 1 22 p > 0.05 4 74 78 total 25 75 100 02 R = I p < 0.05 100 0 100 P > 0.05 0 0 0 total 100 0 100 Table 4.14: p-value comparisons of a re-run of Simulation 3 in Table 4.8, this time without fixing R. The small difference in power of the two models is particularly surprising if we consider how different the marginal response correlation matrix was from the the identity in most of these cases. For example, in Simulations 3 and 4, we specified cors t r="f ixed", and this fixed correlation matrix had most non-diagonal values of about 0.8. This in fact illustrates the robustness of the G E E approach to the misspecification of the response correlation structure. There is another worth-noting observation from our simulation results. Recall that we specified corstr="ex" in the dependence model when we ran gee in some of the simula-tions. Since the responses were truly independent given the covariates, the independence model should be the correct model. However, since independence is a special case of ex-changeability, the final R matrix produced by gee should be close to the identity matrix, because the G E E approach is supposed to estimate the true response correlation reason-ably well when the structure is correctly specified. Therefore, one would expect both the dependence and independence models to have the same efficiencies here. In fact, if we consider the extra parameter (response intraclass correlation coefficient) that G E E needs Chapter 4. Simulations of Multiple Regression of Count Data — Part One 106 to estimate in the dependence model, one might even expect the independence model to be more efficient than the dependence model. Indeed, in some of our simulations with corstr="ex" in the dependence model, this model was slightly less powerful than the independence model, possibly due to the extra unknown parameter in Cor(Y\X). However, some other simulations with corstr="ex" in the dependence model indicate that this model could also be slightly more powerful than the independence model. For example, in Simulation 7, as presented in Table 4.12, the dependence model is empirically 3%, 2%, and 4% more powerful than the independence model in detecting the three regression coefficients respectively. However, such small differences in power seem to have resulted from chance according to x 2 tests. 4.3.3 Addit ional observations We have so far discussed how the power of independence and dependence models compare in terms of their p-values. We may also compare their power by comparing their actual robust S.E.'s for the regression coefficient estimates. Among the simulations for which we have a retained record of these S.E.'s, 9 the S.E.'s for both the independence and the dependence models were very similar. This explains why both models had virtually the same power in detecting a non-zero regression coefficient. The G E E robust S.E.'s seemed to be relatively close to the sample S.D.'s of the coefficient estimates. A few robust S.E.'s were smaller than the corresponding sample S.D. by a factor of about 2, making the regression analyses appear more powerful than they really were. Only a few other robust S.E.'s differed from the sample S.D.'s by a slightly larger factor. However, many of them were extremely close to the sample S.D.'s. Although the difference would well have been due to chance, the reason behind a smaller 9These include simulations yet to be discussed. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 107 robust S.E. than the sample S.D. in some cases is not obvious. Further investigation may be needed. 4.3.4 Other Simulations In addition to those simulations with exchangeable and AR(1) covariates, we also ran several other simulations with very different covariate correlation structures. Some were "unstructured," while others had the variogram structure as in Chapter 2. Some of these extra simulations were also run with different ft = E ( X ) matrices. Their results were very similar to those of the previous simulations. That is, we found in these simulations that 1. both the independence and dependence models produced regression coefficient es-timates that have relatively small empirical biases; 2. the independence model tended to be slightly more powerful than the dependence model in those simulations with fixed response correlation matrix; 3. the dependence model was slightly more powerful than the independence model in some of the simulations where only the response correlation structure was specified; 4. the difference in power between the two models mentioned in the previous two items was not always significant according to the x 2 test; 5. both models had very similar overall power in detecting the non-zero regression coefficients in all simulations, suggesting the robustness of G E E to misspecification of response correlation structure; 6. this power tended to decrease as the coefficient size decreased, and/or as the number of parameters in the G L M increased; and Chapter 4. Simulations of Multiple Regression of Count Data — Part One 108 summary statistics of n K n intraclass correlation coefficient estimates mean median sample S.D. 10 30 -0.028 -0.029 0.080 20 20 -0.012 -0.020 0.056 50 15 0.0039 0.0042 0.048 100 15 -0.0011 -0.0029 0.024 200 15 -0.011 -0.013 0.022 500 10 -0.000084 -0.00059 0.012 800 10 -0.00093 -0.0023 0.010 Table 4.15: Final G E E R as K increases. 7. the robust S.E.'s from the G E E analysis were often very close to the sample S.D.'s of the coefficient estimates, even with such small r (= 5) and K (= 20). In addition, we also investigated how well the G E E approach estimates Cox{Y\X). We ran small scale simulations (n < 30) with increasing K but r was held at 5 (as in the previous simulations). Each was run with \x = (3,1,2,1,2) r , a common £ matrix, corstr="ex", and a G L M that consisted of only one slope coefficient, Q\ = —1, but no intercept. Since the final R matrix estimates C o r ( y | X ) in our assumed scenario (see Section 3.4), its intraclass correlation coefficient should estimate that of Cor(Y \X) which is zero. Intuitively, the larger the sample size K (= number of replicates per dataset), the better the response correlation matrix is estimated. We found that with K ranging from 10 to 800, the estimated intraclass correlation coefficient (in the final R) produced by gee was always close to zero. The simulation results are shown in Table 4.15. Boxplots of the estimates are shown in Figure 4.9. The second column of Table 4.15 shows the number of regressions in the simula-tion experiment. As K increases, the computational time for the simulation experiment Chapter 4. Simulations of Multiple Regression of Count Data — Part One 109 Intraclass Correlation Coefficient Estimate 50 100 200 K = number of replicates 500 800 Figure 4.9: Boxplots of estimates of intraclass correlation coefficient (in R) with various sample sizes, K. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 110 increases.10 This explains our choice of small n's for the larger K's. The sample median of the estimates may be more appropriate than the sample mean for assessing how well the intraclass correlation coefficient is estimated in the cases with small n's. In fact, the boxplots in Figure 4.9 show asymmetric empirical distributions of the estimates with [ most K values, suggesting the comparisons of the sample medians instead of means. From Table 4.15, we see that the median of the estimates appears to approach closer to zero (the true intraclass correlation) as K increases, except for K = 200 and K = 800. The exceptions could be due to chance, but the real reason is not obvious since n was small for both cases. However, the variability (as reflected by the sample S.D.) of the estimate seems to decrease steadily with an increasing K. Figure 4.9 also shows that as K increases, the sample median of the estimates seems to stabilize around 0, and the estimates become less variable. Although the intraclass correlation coefficient estimate seems to become better (smaller bias and variability) as K increases, we are quite satisfied with the G E E approach in that with such a short correlated series (r = 5), even a small sample size of 10 seems to yield a reasonable estimate of C o r ( Y | X ) . 4.3.5 Concluding Remarks We have thus far examined the effect of misspecifying response correlation, when the data are independent, on parameter estimation using the G E E approach. Another type of misspecification of response correlation is using the independence model to estimate regression parameters, when the data are in fact correlated. In the next chapter, we will investigate the relative efficiency of the independence and dependence models in such a setting. This time, we will specify the correct correlation structure in the dependence model when running gee. 1 0 . . . with our implementation of the simulation experiments in S-Plus. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 111 4.4 C H A P T E R A P P E N D I X 4.4.1 Fitting an intercept (30 in the G L M As in Chapter 2, when fitting an intercept in the model, we have to add a column of l's to the covariate matrix X. Let us refer to this column as X_0. Then fi = E (X ) also has a column of l 's, denoted by /*.0. The distribution in (4.1) remains unchanged, except that j only runs from 1 to r (i.e. j ^ 0). The coefficient vector (3 now becomes f3 = {0o, 0\i • • •, 0g)T• Now, the row vectors X{, and i = 1,2,..., r, all include an extra first element of 1. Nevertheless, the conditional distribution in (4.2) is unchanged with the new definitions of X, fi, and f3. However, now Xj(3 = 0O + J29j=i 0jX{j, and hence, Cov{Xff3, Xff3) =aik (/3_ 0) T/3- 0 for all i and k from 1 to r, where /3_ 0 = {0i, 02, • • •, 0g)T'• Note that Xiodo = 0o is non-random, and it does not have a variance. Thus, the distribution in (4.3) should now be written as X(3 ~ MVN(*v, W) where v = fx/3 (fi with column of l's and (3 with 0O) and W = {{3-0)T(3.0E. Equations (4.4) to (4.7) basically remain the same, except that is = (j,(3 is to be computed using the new fi (with the column of l 's), the new (3 (with 0O), and W = (J3-0)Tj3-0E. The rest of the equations in the chapter remain unchanged. 4.4.2 Examples of calculating (marginal) Cor(Y) with a specified £ First, assume that we do not fit an intercept in the G L M . Let the r x g covariate matrix X have expected value fi, and let each of its column have covariance £ . Chapter 4. Simulations of Multiple Regression of Count Data — Part One 112 Example 1 Suppose r = 5 and g = 1, and A* = M i 35 43 26 and 27 V 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 J That is, the covariate vector X = X L has an exchangeable correlation structure, with an intraclass correlation coefficient of 0.9. Also, let (3 = (0i) = (0.1). Substituting these values into equations (4.4) and (4.7), we have / - i n o r n AO C\ O A C\ OI \ E(Y) ( 149.16 ^ 33.28 74.07 13.53 24.66 V and Cor(Y) V 1 0.35 0.46 0.24 0.31 0.35 1 0.29 0.16 0.20 0.46 0.29 1 0.20 0.26 0.24 0.16 0.20 1 0.14 0.31 0.20 0.26 0.14 1 Here, Cor(Y) does not have a standard structure, but the structure is roughly exchange-able with an approximate intraclass correlation coefficient of 0.2 to 0.3. As an entry in Table 4.8, this approximate coefficient is denoted by py to indicate how badly the re-sponse correlation is misspecified through the dependence model with corstr="f ixed" as this marginal Cor(Y) . In the next example, we will fit an intercept S0 in the G L M . With an extra column of l 's, the matrices X and fi now have dimension r x (g + 1). (3 is now a (g + l)-vector. Chapter 4. Simulations of Multiple Regression of Count Data — Part One 113 Example 2 Suppose r = 3 and g = 1, and A* V 1 10 1 8 1 25 and £ 1 0.9 0.81 0.9 1 0.9 0.81 0.9 1 That is, the covariate vector X.x has an AR(1) correlation structure, with a lag 1 corre-lation of 0.9. Now, let (3 = (0O, Pi)T = (5, -0 .15) T . Substituting these values into the modified 1 1 equations (4.4) and (4.7), we have / i n Am n i /i c \ ( 33.49 ^ E ( Y ) 45.21 3.53 and Cor(F) 1 0.421 0.145 0.421 1 0.175 \ 0.145 0.175 1 ) We see that the response vector Y has a correlation structure that is approximately AR(1) with an approximate lag 1 correlation of 0.4. As an entry in Table 4.8, this approximate lag 1 correlation is denoted by py to indicate how badly the response correlation is mis-specified through the dependence model with corstr="f ixed" as this marginal Cor(Y). n See Section 4.4.1. Chapter 5 Simulations of Multiple Regression of Count Data — Part Two In this part of our simulations, we generate Poisson responses that are correlated (con-ditional on the covariates). We apply the G E E approach to the same correlated Poisson data for estimating the regression parameters in each of the independence and dependence models. When the correct response correlation structure is specified for the dependence model in running G E E , we expect it to be more efficient in parameter estimation than the independence model (which assumes the data to be uncorrelated). However, generating correlated Poisson responses with covariates is a difficult task. Instead, we use a G L M which only has an intercept, with no covariates. In other words, the covariate matrix is simply a column of l 's, and the regression coefficient vector consists of only one element — the intercept value (30. As explained in the following section, the resulting Poisson counts have an exchangeable correlation structure. 5.1 Generating Correlated Poisson Responses As before, we assume that there are r regions. From each region, we sample/generate K independent replicates. Each time we sample, i.e. for each fixed s for s = 1,2,..., K, we would like the r observations Y / s \ Y ^ , • • •, Yr^ to be spatially correlated. Now, let us fix an s. We generate {Wia\ ..., W^} and Z^s\ independently, according to Wla)~ Poisson (Aj) 114 Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 115 and Z{s) ~ Poisson(A2) for a l H = 1,2,... ,r, for some positive Ai and A 2 . We then let the response in region i to be y / s ) = w}a) + Z(S\ and hence F / s ) - Poisson(A1 + A 2 ) (5.1) E ( F / S ) ) = Var(y/ S ) ) = Ax + A 2 (5.2) C O V ( Y / 4 ) , n ( s )) = C o v ( w i ( a ) + z < s \ iy f c ( s ) + z ^ ) = Var(Z ( s>) = A 2 for all i + k (5.3) C o r ( ^ , ^ ) . V /Var(y/ s ))Var(n ( s )) A 2 = T — — for all i / (5.4) Ai + A 2 ^ • v ' Let A = Ai + A 2 . Hence, these responses Y\,... ,Yr are exchangeably correlated, with intraclass correlation coefficient p = A 2 / A . Let us now put the above into the G L M format. The only observations here are the r-dimensional response vector, Y^ — (Y}s\ . . . , Yr^)T. In other words, we do not include any covariates in the G L M (a regression model). How-ever, an intercept, BQ, does exist in the regression model. In other words, the G L M is G [E \X\S))\ = X\s)30 where X^ = 1 for all i = 1,..., r for some link function Q. In vector notation, the covariate (design) matrix X^ is thus simply a vector of l 's. The G L M in vector notation then becomes Q [E ( Y ( s ) | X ^ ) ] = X^30 where = 1. (5.5) Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 116 The goal is to estimate the unknown regression parameter, 0o, in the above G L M . The use of the G E E approach in estimating do allows the incorporation of the correlation among Y/ S^'s (for each s). 5.1.1 The Link Function To be consistent with the G E E notation as before, we define rj^ = X^0o- In other words, 77^' is simply the vector of 0o's. As explained before, the link function Q in a G L M maps E ( Y ^ | A , ^ ) to r^f1 for all i = 1,... , r. A n easy parametrization obtains by letting \ = \ 1 + \ 2 = 0O so that Q is the identity link, since Q [E ( Y / W ) ] = 0(A) = W ) = A ) = ^ for all i. In other words, this G L M is a true linear regression model because of the linear response-intercept relationship. However, there is a difference between this G L M and the linear model in Chapter 2. Here, the individual responses have a Poisson distribution, whereas before, they were normally distributed. After recognizing the link function, we may proceed to the actual model fitting for the simulated data. 5.2 Model Fitting using the G E E approach Once again, we fit two different GLM's to these simulated data, namely, the independence and the dependence models. Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 117 In fitting the independence model, we ignore the response correlation. We use the S-Plus function gee to fit this model, by specifying (5.5) as the linear model, Poisson as the distribution of [F/^IXJ^], 1 the identity link function, and the independence response correlation structure, gee then produces an estimate of Bo and the naive and robust estimates of its S.E. We may also use the same gee function for incorporating the response correlation structure in parameter estimation. Recall that the G E E approach allows the estima-tion of the true response correlation matrix, C o r ( F | X ) , 2 provided that the structure of C o r ( F | X ) is correctly specified. To use the S-Plus gee function in such a case, we specify the function arguments as in the independence model, except for some non-independence correlation structure for [F/^|Xf s ^] instead of the independence structure. Since the re-sponses we generate are exchangeably correlated, we may specify "exchangeable" for the c o r s t r function argument, gee would then use an initial working correlation matrix, R, which has the exchangeable correlation structure, in the estimation process. Thus, in addition to estimating the regression parameter B0 and the naive and robust S.E.'s of its estimate, gee also updates the intraclass correlation coefficient in R throughout the estimating process. The final R would then be the estimate of C o r ( F | X ) (= Cor(F) here). In other words, the non-diagonal value in the final R would be an estimate of p. Before fitting the independence and the dependence models, we notice that with an identity link in the G L M (5.5), and no true covariates (except the 1 vector as the design matrix), the G E E estimate of 0O is simply the sample mean of the responses. This is true for both the independence and dependence models, and will be proved in the proof of Theorem 4 in Section 5.4. In fact, we can see that the sample mean is the best linear unbiased estimator (BLUE) of the intercept for either model (as in ordinary and 1The function ge in S-Plus asks for the response distribution as a means of specifying the relationship between the mean and the variance of each response. 2See Section 3.3 page 78. Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 118 generalized least squares estimation in Chapter 2) due to the linearity of the response-intercept relationship and the exchangeability of the responses. Denoting BQ and BQ as the Bo estimates produced by the independence and dependence models respectively, we have, M = ffi = Y = ± 1 t £ Y W . (5.6) n r s=l i = l However, the naive and robust S.E. estimates for BQ and BQ may be different, since their variance formulae differ. In particular, for the independence model, R = I (throughout the G E E iterations) in equation (3.19); hence, = from equation (3.19)3 and thus in (3.21) and (3.22). For the dependence model, R has an exchangeable correlation structure. Thus, the R matrix in the final iteration of G E E has some p as its non-diagonal value, possibly non-zero. This is because we expect p m p = \2/A ^ 0. In the case of p / 0, in equation (3.19) (computed using the final R matrix) is no longer equal to A^S\ This will possibly produce a different value in each of equations (3.21) and (3.22) from the one for the independence model (when = A^). In other words, the naive and robust Var(Bo) estimates are possibly different from the naive and robust V a r ( ^ ) estimates. Since the G E E approach allows the consideration of response correlation when fitting the model, our first thought was that the two variance estimates for 0Q (the dependence model (correct) parameter estimate) would be no bigger than their independence model (incorrect) counterparts. However, this is not exactly what we saw in our simulations, as explained in the next section. 3The scale parameter <f> in our Poisson model is simply 1 (assumed known when using gee). Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 119 5.3 The Simulations In our first simulation, we tried to fit n = 100 G L M regressions. For each regression, we generated K = 20 independent samples from each of r = 5 regions according to Section 5.1. We let Ai = 0.05 and A 2 = 0.5, so that p = 0.5/0.55 « 0.91 would give very highly correlated Poisson responses. The two A's were deliberately chosen to have such minute size because we would like the generated responses to resemble hospital admission counts (for rare diseases) — in other words, very small. However, we overlooked the possibility of non-convergence of the G E E estimation process due to high number of zero counts. This was in fact what happened in fitting the dependence model in some of the regressions. The G E E analysis converged for the independence model despite the zero counts because BQ could be readily computed as the sample mean, and no other unknown parameter needed to be estimated. On the other hand, for the dependence model, the estimation of p might become difficult with too many zero observations. Despite the non-convergence of some of the regressions, we did observe some very interesting results. Firstly, as we expected, BQ and 8® were the exact same value for all the regressions that converged. However, contrary to what we believed, the naive Var(/?o) estimate was always smaller than the naive V a r ( ^ ) estimate! Another curious observation is that the robust variance estimate of B0 was exactly the same for both the independence and the dependence regression models. We can summarize the observations of these simulations as follows: M = Po (5-7) V a ^ o 7 ) r o b u s t = V a ^ ) r o b u s t (5.8) Va^oOnaive < V a r ^ ) n a i v e . (5.9) Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 120 To double-check that the result in (5.8) and (5.9) were not by chance, we ran a second set of simulations. However, this time, we used a much larger X\ and A 2 to avoid the non-convergence of the G E E method. In this simulation, we used the same n, r, and K as before, but Ai = 5 and A 2 = 50. Thus, A = 5 + 50 = 55, and p = 50/55 « 0.91 is the same large intraclass correlation coefficient as in the first simulation. In this simulation, the G E E analysis successfully converged for both the indepen-dence and dependence models for each of the 100 regressions. From the results here, we confirmed that (5.8) and (5.9) were indeed not due to chance. The result in equation (5.8) actually seems familiar to us. Recall that in the linear regression setting from Chapter 2, both the dependence and the independence models produce the exact same regression parameter estimate (and thus the same covariance matrix for the estimate) if the responses are exchangeably correlated, provided that an intercept is fitted in the model. However, the naive covariance for the independence model estimate is never smaller than that for the dependence model estimate in linear Gaussian regression. Why, then, is (5.9) true? That is, why is the naive variance estimate always smaller in the independence model than the one in the dependence model for a linear Poisson regression? Note that the word naive in both Chapter 2 and this chapter has the same meaning. In Chapter 2, the naive covariance is the one calculated believing that the responses were truly independent. Here, we are naive in calculating the variance estimate for 30 if we naively take the working covariance matrix V as the true covariance matrix of the responses. In other words, we believe that V is the true response covariance matrix, and thus the covariance estimate for 80 obtained according to such a belief is naive. In the next section, we present the theoretical reasoning for the results in (5.6) to (5.9). We also further compare the notion of naivete in the contexts of Gaussian and Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 121 Poisson regression. 5.4 The Naive and Robust Variances for /3Q and We now present (5.6) to (5.9), with a slight variation to (5.9), as the next theorem, with a proof following it. Theorem 4 Suppose we have responses Y}s\ ... ,Yj-s) sampled from a Poisson distribu-tion with positive parameter A = Ai + A 2 , such that they follow equations (5.1) to (5-4)-In particular, for each s, Y^ 's are exchangeably correlated with intraclass correlation coefficient p = A 2 / A . Note that A 2 > 0, and hence p > 0. Now, suppose we fit a Poisson regression model (a GLM) according to (5.5), re-spectively assuming the responses to be (1) uncorrelated (the independence model) and (2) exchangeably correlated (the dependence model). We use the identity link in such a GLM. For the dependence model, we specify in GEE an exchangeable response working correlation structure without specifying the value of the intraclass correlation coefficient. Then, equations (5.6) to (5.8) are true. Furthermore, V a r ( ^ ) n a i v e ^ V a r ( ^ 0 D ) n a i v e in general. To prove this theorem, we first need to "translate" our notations into the G E E nota-tions. In particular, we need to express equations (3.17) to (3.22) in terms of A i , A 2 , and P-Let us fix s. Since Y ^ ' s are K i.i.d. vectors, the following is true for all s = 1,..., K. Recall that the design matrix is simply the r-vector of l 's. Denote such a vector by 1. Also, we let r)^ = A = 0O for all i so that we have the identity link. Hence, the conditional density for each Y^ given X^ is Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 122 = exp{yj s ) log 7yJ s ) - rj^ - log = e x p ^ ' l o g A - A - l o g y ^ ! } . (5.10) Comparing (5.10) to (3.17), we have 6\s) = logr/J s ) = log A and a(0| s )) = exp{0J s )} - A. Also, E ( y . M | X j ' > ) = a'(6V) = exp{0\s)} = \ Var(y/S )|x| s )) = a"(0Js)) = A. Thus, A « = diag { « ' ( » " ) } = dia,g(A,..., A) = ^ £ > ( s ) = A( S )^( S )X< S ) - J l = 1 (5.11) y(*) = [AW]tjR[AW]i = A i l (5.12) = j - [ y ( « ) ] - i = l i l " 1 (5.13) where J i is the working correlation matrix that G E E uses in its estimation process. Since we specify an exchangeable response correlation structure for the dependence model, R has all non-diagonal values equal to some a. In each iteration of the estimation process, the values of both a and B0 are updated. Upon convergence, the final value of a is the estimate of p. That is, p = a. Also, upon convergence of the G E E estimation process, equations (3.21) and (3.22) are used to estimate Var(/?g) and Var( (50 D) (using the final values of B0 and a). Note that in the G E E estimation process, we do not expect the matrix to be computed based on the data. This is because when we specify the Poisson distribution of the responses, and the identity link function, with a design matrix which is simply Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 123 the 1 vector, the matrix product A^A^ = I is readily obtained, without the need to estimate any unknown parameter in this product. However, the computation of A^ and A^ (and hence, of V^) individually should be data based, because they both involve the unknown parameter A. By the identity link, we have A = 0O, and hence, A = 0O. Finally, as mentioned on page 82, the covariance matrix C o v ( Y ( s ) | X ^ ) in equation (3.21) is estimated by S^S^T in the G E E estimation process, where = - a'(6^). Now, we may substitute equations (5.11) to (5.13) into equations (3.20) to (3.22) to prove Theorem 4. P r o o f of Theorem 4 First, let us prove equation (5.6). For either the independence or dependence model, substituting equations (5.11) to (5.13) into estimating equation (3.20) gives 0 = ] £ l T - J R - 1 [Y^ - Al] s=i A (since a'(0\ s^) = A for all i for all s) K <=> lTR-'^Y^ = 0QK1TR-'1 (5.14) S=l (since A = 0O by the identity link). For the independence model, R = H _ 1 = J , so that equation (5.14) becomes K J2lTY(s) = 0oKlTl (5.15) s=l s=l t = l ^ 0o = j-r£±Y}s)=Y. IX ' s=l i=l Hence, /?o = Y. Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 124 Now, recall from Chapter 2 that for the dependence model, we can write R(ct) = (1 — o j ) / + Q ! l l T . (Again, a is the estimate of p, the intraclass correlation coefficient of the exchangeably correlated responses.) Also, from Lemma 1 and the proof of Theorem 2, we know that R 1 = T ^ ii ~ r ^ i i r ] - (5-16) Thus, we have lTR1 = 1J 1 - a I-1 + (r - l)a 11J 1 1 - a 1 a 1 + (r - l)a I-a. rl1 I-a [1 + (r - l)a 1 1 T. 1 + (r - l ) a Substituting equation (5.17) into estimating equation (5.14), we have 1 (5.17) 1 K TlTY( 1 + (r - \)ai £ i r y w = B0K1T1 1J1 K s=l which is the same as estimating equation (5.15) whose solution is B0 = Y. In other words, BQ — Y = B-. This proves equation (5.6) (and thus (5.7)). Next, we will prove equation (5.8). For either the independence or dependence model, substituting equations (5.11) to (5.13) into equation (3.21) gives Var(ft) robust S=l K - j r 1 A 1> A E i 5 i A 1 -R-1 A Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 125 ^ E f = 1 i T f l - 1 s W s w r f l - i i l T R i (Efl, S^SW) R - l K2(iTR-*iy ' For the independence model, R = I, and equation (5.18) becomes 1 T (EL S^SW) 1 i<2(iTiy (5.18) V a r ( ^ ) r o b u s t K2r2 For the dependence model, first notice that R'1 = (R~1)T because R (and thus R'1) is symmetric. Thus, by equation (5.17), we have R-l = ( 1 T J R - 1 ) T 1 -1. (5.19) 1 + (r - l)ct Now, we substitute equations (5.17) and (5.19) into equation (5.18) and have Vxv(RD\ _ Z " 5 = 1 [l + ( r - l ) a ] a V a r l P o Jrobust -x2? = V a r ( ^ ) r o b u s t . This proves equation (5.8). Now, to prove that Var(/?o) n a i v e / ^ a r ( ^ 6 D ) n a i v e ' w e substitute equations (5.11) to (5.13) into equation (3.22). Then, for either the dependence or independence model, 1 Var( /? 0 ) n a h KlTRr*l (since A = (30 by the identity link). (5.20) Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 126 First, for the dependence model, the working correlation matrix R has inverse as in equation (5.16). Thus, we can substitute equation (5.17) into equation (5.20) and obtain By equation (5.7), we may denote both B® and BQ by a common B0. The above equation then becomes V « 7 j t " ) ^ = ! ( i ± ! ^ ) . (5.21) Again, for the independence model, R = R'1 = I. Thus, equation (5.20) simply becomes V a ^ 0 7 ) r K1T1 A Kr 1 + (r - l ) a ' Therefore, in general, unless a = 0, we will have Var ($J ) n a i v e ^ Var( / 5 0 9 ) n a i v e . Q.E.D. Notice that, in our simulations, we generated the Poisson responses with an extremely high intraclass correlation p (> 0.9). Since a estimates p, we naturally observed all positive a values in our simulations. In fact, in both sets of simulations, a turned out to be an extremely good estimate of p (a between 0.8 and 1) in all of the n observations which converged. This large positive a explains the result in (5.9). In principal, a can take on negative values. In particular, if we generated the responses with a near-zero (or zero) p, then it is possible to have negative a values. We can easily see that V a ^ ) n a i v e > V a ^ ) n a i v e if a < - l / ( r - 1). Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 127 Also, notice that 80 = Y is the M L E of 0Q when assuming response independence. Recall from Chapter 2 the notion of the naive statistician who blindly believes that the responses are truly independent. Thus, the theoretical variance of 0Q that he computes is Var(&) = Var(F) = A = A . The estimate of this variance is naturally computed by replacing 0Q with 0^. It, then, is indeed the naive variance estimate, Var(/?o) n a i v e . From this set up of a G L M , it is clear that the two notions of naivete — that of the naive statistician and that of the naive variance estimate — coincide. 5.4.1 More on the simulation results Let us now return to the question of why the naive variance estimate is always smaller than the robust variance estimate in our simulations. Such a phenomenon creates an illusion that Naive, using his/her naive variance estimate, has better efficiency in his/her inference than that of Enlightened who bases it on his/her naive variance estimate. However, a closer look reveals that the former estimate is simply the M L E variance estimate in the case of (assumed) independent responses coming from a single population, forming one big i.i.d. sample (of size Kr). Naturally, such a variance estimate must be smaller than the one computed assuming the r responses per replicate to be (positively) correlated, and hence more between-group response variability among the K replicates. As for the robust variance estimates, we have the results from Theorem 4 for both the independence and dependence models. Such variance estimates are derived from the asymptotic distribution of 0O presented in Theorem 3. For convenience, instead of determining the true variance of 0Q, we use the sample S.D. of the n / V s to estimate the true SE(/?o). With enough datasets, n, we may assume this sample S.D. to be reasonably Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 128 sample mean of 50 fa S E (A))sample S E ( A D r o b u s t ( = ftD's) = S E ( ^ ) g a m p l e S E ( A ? ) n a i v e = SE(^ 0 D) r obust 55 59.90 1.69 1.66 1.53 0.77 Table 5.16: S.E. estimate comparison based on the first 50 simulated datasets. close to the true S.E. On the other hand, our goal is only to obtain some idea of the relative variabilities of the intercept estimates. Thus, in this investigation, we decide not to use large scale simulation experiments although they would allow better understanding of the relative variabilities of the estimates. Let us revisit our latter simulation where the G E E method converged for all its n = 100 datasets. Our results, based on the first 50 datasets in this simulation,4 are summarized in Table 5.16. For the independence model, S E ( / 5 0 r ) s a m p l e denotes the square root of the sample variance of the 50 0QS. SE(/3o) n a i v e and SE(/?o) r o b u g t respectively denote the sample means of the 50 naive and the 50 robust estimates of SE ( / ?Q ) . We denote those S.E. estimates for 0Q (using the dependence model) in a similar way. Of course, S E ^ ^ i e = S E ^ L m p , e by equation (5.7), and S E l ^ L b u s t = S E ^ U u s t by equation (5.8). The results in Table 5.16 show that Y, the sample mean of the responses, is a reasonbly good estimate of the true 0O. Using SE(/%) g a m p l e (= S E ( ( L ? ^ , ) s a m p l e ) as the "gold standard" (g.d.) of the true SE(ft) for both models, S E ( ^ 0 D ) n a i v e is thus closest to the g.d. among the naive and robust estimates. It is followed by S E ( / ^ ) r o b u s t = S E ( ^ 0 D ) r o b u s t , and S E ( / ^ ) n a i v e is the farthest away from the g.d. In fact, SE(/3o) n a i v e is the only S.E. estimate which gives a 95% confidence interval that does not contain the true 0Q. This is an indication that, in 4The way we recorded the simulation output made it difficult to base our results here on all 100 datasets. Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 129 general,5 the naive variance estimate of the independence model is a poor estimate which substantially underestimates the true Var(/?0). The underestimation can be explained by the between-group variability among the K replicates being ignored in the computation of this variance estimate. In other words, if response correlation is ignored when fitting the G L M , then inference based on the naive variance estimate may be far off from the truth. Fortunately, inference done by ignoring response correlation but using the robust variance estimate does not affect estimation efficiency, in the sense that this variance estimate is the same as the robust variance estimate considering response correlation. Of course, the advantage of the dependence model over the independence model is the possibility of estimation of the parameters in the response correlation structure. 5.5 Concluding Remarks The above results of Theorem 4 and the size comparison of the different S.E. estimates of 0O are all based on the particular G L M we are dealing with here. Therefore, we may not generalize our observations to cover such conditions as GLM's with covariates and/or non-linear link functions. Simulations of non-linear Poisson regressions may be difficult, mainly due to the complexity of generating correlated (conditional on the covariates) Poisson count data with non-linear link to the covariates.6 However, based on our results, we recommend the use of the robust variance estimates, regardless of the assumption of correlated or independent responses, for such non-linear regressions of real data. This is because we have seen from simulation results here and in Chapter 4 that the data analyses based on the robust variance estimates for both the independence and dependence models are generally very similar. We have also seen that in our set up of a G L M in this 5 . . . for the conditions as in Theorem 4. 6 A method for this task is developed in the Ph.D. thesis by Jianmeng Xu (1996). Chapter 5. Simulations of Multiple Regression of Count Data — Part Two 130 chapter, the naive variance estimate of the dependence model estimates the true variance overwhelmingly better than that of the independence model. However, assuming the dependence model allows the estimation of the true response correlation matrix if its structure is known or can be approximated. In this case, one should base inference on the dependence model instead of the independence model. An extra remark is that the G E E method seems to estimate reasonably well all the model parameters, i.e. the regression parameters, and those in the response correlation structure, even with relatively small r and K (5 and 20 in our simulations). Chapter 6 Discussion In this thesis, we have presented some results on the effect of misspecified response corre-lation in regression analysis. We focussed on regression analyses applied to health impact studies, in which the responses are often obtained from neighboring geographical sites and thus are spatially correlated. (See Chapter 1.) In Chapter 2, we made use of the sto-ryline of the naive and enlightened statisticians to illustrate the severe loss of parameter estimation efficiency that might result from ignoring response correlation when fitting a linear response-covariate relationship with Gaussian responses. In particular, both the-ory and our simulation results indicate that Naive's regression analysis is almost always1 less powerful/efficient than Enlightened's. Such and other forms of response correlation misspecification in GLM's were examined in Chapters 4 and 5. The GLM's were fitted via the G E E approach, which allows the incorporation of response correlation in the parameter estimation process. Our results suggest reasonable robustness2 of the G E E approach to misspecification of response correlation with a small number (about 20) of independent replicates/clusters in the sample if the analysis is based on the robust vari-ance estimates. In other words, the G L M analyses based on such variance estimates using the correct and incorrect response correlations respectively both have similar efficiencies. The analyses based on the naive variance estimates are similarly efficient if the response correlation structure is correctly specified. Otherwise, the true variance estimates may 1 Theoretically, Naive's analysis is never more powerful. However, it can be more powerful empirically due to chance. 2 . . . with those GLM's that we investigated. 131 Chapter 6. Discussion 132 be considerably underestimated, thus falsely revealing high efficiencies of the analyses. Moreover, when the response correlation structure is correctly specified, the parameters in the correlation matrix are relatively well estimated3 even with very few replicates (10 to 20) and correlated measurements per series4 (about 5). However, we have not examined the G L M setting in which the clusters are not inde-pendent. For example, if the replicates are repeated measurements obtained over a short period of time, they may be temporally correlated. Thus, both spatial and temporal correlations may exist in the responses: the measurements within a replicate may be spatially correlated, and the replicates may be temporally correlated. Studying the effect of ignoring correlation among replicates in a G L M analysis may be very complex although important to health impact investigations. The complexity lies in the incorporation of such correlation into the G L M and hence the parameter estimation process. In fact, the G E E approach only allows incorporation of correlation along a series, but strictly assumes the replicates to be independent. This assumption is often met in longitudinal data analyses in which the measurements are correlated temporally, but the measurement replicates are obtained from separate individuals and thus independent. On the other hand, the assumption of independent replicates may fail in some health impact studies for reasons mentioned earlier. Unfortunately, an alternative method to the G E E approach which accounts for response correlation along both spatial and temporal "axes" in parameter estimation is apparently not available. 3 . . . provided that the GEE estimation process converges. Non-convergence of the process may result due to insufficient information in the data (e.g. too many unknown parameters or too many zero counts in a Poisson regression). 4See page 110. Bibl iography [1] Aitchison, J . , Ho, C. H . (1989). The multivariate Poisson-log normal distribution. Biometrika 76, 643-653. [2] Anderson, Theodore W. (1971). The Statistical Analysis of Time Series. John Wiley & Sons. [3] Basu, Sabyasachi, Reinsel, Gregory C. (1994). Regression Models With Spatially Correlated Errors. American Statistical Association Journal 89, 88-99. [4] Cressie, Noel A . C. (1991). Statistics For Spatial Data. John Wiley & Sons. [5] Fitzmaurice, Garrett M . (1995). A Caveat Concerning Independence Estimating Equations With Multivariate Binary Data. Biometrics 51, 309-317. [6] Freedman, D., Pisani, R., Purves, R., Adhikari, A . (1991). Statistics, 2nd Ed. W. W. Norton & Co. [7] Graybill, Franklin A . (1983). Matrices with Applications in Statistics, 2nd Ed. Wadsworth. [8] Herzberg, Agnes M . (1988). Some further results for the equivalence of ordinary least squares and weighted least squares estimators: an advantage for rotable designs. SIAM Institute for Mathematics and Society Technical Report #116. [9] Johnson, Richard A . , Wichern, Dean W. (1982). Applied Multivariate Statistical Analysis. Prentice Hall. [10] Kotz, Samuel, Johnson, Norman L . , Read, Campbell B . (1985). Encyclopedia of Statistical Sciences. John Wiley & Sons. [11] Liang, Kung-Yee, Zeger, Scott L. (1986). Longitudinal data analysis using general-ized linear models. Biometrika 73, 13-22. [12] McElroy, F. W. (1967). A necessary and sufficient condition that ordinary least squares estimators be best linear unbiased. American Statistical Association Jour-nal 62, 1302-1304. [13] Marascuilo, Leonard A . , Levin, Joel R. (1983). Multivariate Statistics in the Social Sciences. Brooks/Cole. 133 Bibliography 134 [14] Martin, R. J . (1982). Some aspects of experimental design and analysis when errors are correlated. Biometrika 69, 597-612. [15] Mosteller, Frederick, Tukey, John W. (1977). Data Analysis and Regression. Addison-Wesley. [16] Plackett, Roger L. (1960). Principles of Regression Analysis. Oxford @ The Claren-don Press. [17] Rao, C. Radhakrishna (1965). Least squares theory using an estimated dispersion matrix and its application to measurement of signals. Fifth Berkeley Symposium 1, 355-372. [18] Rosychuk, Rhonda J . (1994). Cancer cluster detection in British Columbia school districts 1983-1989. Master of Science Thesis. University of British Columbia. [19] Seber, George A . F. (1977). Linear Regression Analysis. John Wiley & Sons. [20] Weisberg, Sanford (1985). Applied Linear Regression, 2nd Ed. John Wiley &; Sons. [21] Zeger, Scott L. (1988). A regression model for time series of counts. Biometrika 75, 621-629. [22] Zeger, Scott L . , Liang, Kung-Yee (1986). Longitudinal Data Analysis for Discrete and Continuous Outcomes. Biometrics 42, 121-130. [23] Zyskind, George (1967). On canonical forms, non-negative covariance matrices and best and simple least squares linear estimators in linear models. Annals of Mathe-matical Statistics 38, 1092-1109.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Effect of misspecified response correlation in regression...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Effect of misspecified response correlation in regression analysis Chui, Grace Shung-Lai 1996
pdf
Page Metadata
Item Metadata
Title | Effect of misspecified response correlation in regression analysis |
Creator |
Chui, Grace Shung-Lai |
Date Issued | 1996 |
Description | One can imagine a possible loss of parameter estimation efficiency when response correlation is ignored or misspecified in modeling a response-covariate relationship. Under what conditions is efficiency lost? How much is lost? Whether the responses are correlated or independent, standard theory for the distribution of least squares parameter estimates in linear models (Gaussian responses) can be readily determined. We find that the linear regression analysis assuming independent responses is (theoretically) never more efficient than that incorporating response dependence. The "difference" in efficiencies between these two analyses — measured by how much more readily the latter detects a non-zero regression coefficient — generally increases as the coefficient-to-noise ratio increases. To incorporate response correlation in G LM parameter estimation, Liang &; Zeger (1986) extended the quasi-likelihood theory and developed the generalized estimating equations (GEE) approach. Despite being a popular method, the effects of misspecifying response correlation (e.g. assuming independence when responses are correlated) on parameter estimation efficiency using GEE are not obvious. To investigate such effects, we use simulation studies in which we generate count data and use the GEE approach to estimate the model parameters, using both the correct and misspecified correlation structures. The generated counts, the number of correlated responses in each cluster/replicate, and the total number of replicates are all small to imitate health impact studies (in which hospital admission counts are often the responses). Despite possible loss of parameter estimation efficiency due to such "obstacles" intrinsic in the model, simulation results indicate that the GEE approach produces' 1. regression parameter estimates with relatively small empirical biases using either a correct or misspecified response correlation; 2. a good estimate of the response correlation matrix if its structure is correctly specified; 3. naive and robust variance estimates both of which estimate the true variance well when the response correlation structure is correctly specified; and 4. good robust variance estimates even when the response correlation is misspecified. Furthermore, in a G LM with exchangeably correlated Poisson data and no covariates, specifying independence or exchangeable dependence yields the same intercept estimate and estimation efficiency, provided that inference is based on the robust variance estimate. The naive variance estimate can significantly underestimate the true variance if the responses are assumed independent when analyzing such a GLM. |
Extent | 6024464 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-02-12 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0086986 |
URI | http://hdl.handle.net/2429/4521 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1996-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1996-0352.pdf [ 5.75MB ]
- Metadata
- JSON: 831-1.0086986.json
- JSON-LD: 831-1.0086986-ld.json
- RDF/XML (Pretty): 831-1.0086986-rdf.xml
- RDF/JSON: 831-1.0086986-rdf.json
- Turtle: 831-1.0086986-turtle.txt
- N-Triples: 831-1.0086986-rdf-ntriples.txt
- Original Record: 831-1.0086986-source.json
- Full Text
- 831-1.0086986-fulltext.txt
- Citation
- 831-1.0086986.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0086986/manifest