MODELING ZERO INFLATED COUNT DATA by CHERYL ELLEN GARDEN B.Comm., University of British Columbia, 1992 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Department of Statistics) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA April 1996 © Cheryl Ellen Garden, 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted department or by his or her representatives. It is by the head of understood that copying my or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date DE-6 (2/88) April 25\ m£ Abstract A natural approach to analyzing the effect of covariates on a count response variable is to use a Poisson regression model. A complication is that the counts are often more variable than can be explained by a Poisson model. This problem, referred to as overdispersion, has received a great deal of attention in recent literature and a number of variations on the Poisson regression model have been developed. As such, statistical consultants are faced with the difficult task of identifying which of these alternative models is best suited to their particular application. In this thesis, two applications where the data exhibit overdispersion are investigated. In the first application, two treatments for chronic urinary tract infections are compared. The response variable represents the number of resistant strains of bacteria cultured from rectal swabs. In the second application, the number of units sold of a product are modeled as depending on two factors representing the day of the week and the store. Two alternative models that allow for overdispersion are used in both applications. The negative binomial regression model and the zero inflated Poisson regression model so named by Lambert (Lambert, 1992) provide improved fits. Further, the zero inflated Poisson regression model performs particularly well in the situation when the overdispersion is suspected to be due to a large number of zeroes occurring in the data. The zero inflated Poisson regression model allows one to both fit the data well and make some inference regarding the nature of the overdispersion present. This little known model may prove to be valuable as there exist a ' number of applications where observed overdispersion in a count response variable is clearly due to an inflated number of zeroes. Table of Contents Abstract ii Table of Contents iii List of Tables v List of Figures vii Acknowledgement viii 1 Introduction 1 2 Motivating Application and Initial Data Analyses 6 2.1 General Approach 6 2.1.1 Testing for Overdispersion 6 2.1.2 Testing for Zero Inflation 7 2.2 Application 1: A Clinical Trial Comparing Two Treatments for Chronic Urinary Tract Infections 9 2.2.1 Description of Data Set 9 2.2.2 Exploratory Analysis 11 2.2.3 Preliminary Models and Fits 21 2.3 Application 2: Sales in a Retail Environment 27 2.3.1 Description of Data Set 27 2.3.2 Exploratory Analysis 28 2.3.3 Preliminary Models and Fits 40 in 3 Random Effect Model - Negative Binomial 45 3.1 Methodology 45 3.2 Application 1: Results and Discussion 47 3.3 Application 2: Results and Discussion 54 4 Zero Inflated Poisson Regression Model 58 4.1 Methodology 58 4.2 Application 1: Results and Discussion 60 4.3 Application 2: Results and Discussion 67 5 Discussion and Suggestions for Further Study 72 Bibliography 76 Appendix 78 iv List of Tables 1. Frequencies of N S B resistant, emergent resistant, and combined at time 1 14 2. Frequencies of N S B resistant, emergent resistant, and combined at time 2 14 3. Summary of results of Poisson regression models for N S B at time 1 21 4. Summary of results of Poisson regression models for N S B at time 2 24 5. Comparison of observed and expected number of patients with N S B equal zero 26 6. Estimated model parameters for Poisson regression models of N S B at time 2 27 7. Average units sold by day of the week and store 37 8. Comparison of the actual number of days with zero units sold with the 9. number expected under the Poisson model 41 Summary of estimated model parameter for Poisson model of units sold 43 10. Annual sales by store 44 11. Summary of fits for N B regression models for N S B resistant at time 2 48 12. Observed and expected frequencies for NSB resistant, N B regression model 13. Summary of fits, N B regression models for NSB emergent resistant at time 2 50 50 14. Observed and expected frequencies for NSB emergent resistant, N B regression model 52 15. A summary of fits of N B regression models for N S B combined at time 2 52 16. Observed and expected frequencies for NSB combined, N B regression model 54 17. Estimated model parameters and standard errors for the N B regression model for the units sold 56 v 18. Observed and expected number of days with zero units sold, N B regression model 57 19. Log-likelihood based on ZIP regression models for N S B resistant 20. Observed and expected frequencies for NSB resistant, ZIP regression model 21. Log-likelihood based on ZIP regression models for N S B emergent resistant 61 63 64 22. Observed and expected frequencies for N S B emergent resistant, ZIP regression model 65 23. Log-likelihood based on ZIP regression models for N S B combined 66 24. Observed and expected frequencies for NSB combined, ZIP regression model 67 25. Log-likelihood for ZIP regression models for the daily units sold 68 26. Estimated model parameters, ZIP regression model for units sold 70 27. Observed and expected number of days with zero units sold, ZIP regression model 28. Model performance comparison 71 73 vi List of Figures 1. Comparison of N S B resistant, emergent resistant, and combined in group A and group B 12 2. Comparison of histogram for N S B resistant with the Poisson distribution 17 3. Comparison of histogram for NSB emergent resistant with the Poisson distribution 18 4. Comparison of histogram for N S B combined with the Poisson distribution 19 5. Time 2 versus time 1 plots for N S B resistant, emergent resistant and combined 20 6. Time series plots of units sold by store 30 7. Auto correlation functions for the units sold in each of the 21 stores 33 8. Units sold by day of the week 38 9. Units sold by store 39 vn Acknowledgement The advice and encouragement of a number of people helped me in the preparation of this thesis. I would like to thank my thesis supervisor, Professor Martin Puterman, for his many years of support and encouragement and for opening my eyes to the study of statistics. His advice and encouragement were much appreciated. I would also like to thank Professor Harry Joe for his helpful comments on improving the manuscript and for his advice and assistance in the later stages of model fitting. In addition, I would like to thank my friend and fellow graduate student, Sonia Mazzi for her words of advice that helped me to focus on completing the work. Also to my close friend and partner, Todd Shea, for listening to my thinking out loud and tolerating my domination of the computer. viii 1 1 Introduction When analyzing the effects of covariates on a non-normal response variable, generalized linear models (GLM's) provide an invaluable tool. More specifically, Poisson regression allows one to analyze the effect of covariates on a count response variable. Frequently, however, observed count data are more dispersed than what is expected under a Poisson distribution. That is, conditioned on the covariates, the variance exceeds the mean. This problem, referred to as overdispersion, arises in many applications and has received a great deal of attention in recent literature (Breslow, 1990, Breslow and Clayton, 1993, Dean and Lawless, 1989). Several adaptations to the basic methodology of G L M ' s have been developed so that there are now several modifications to Poisson regression which allow for overdispersion (Lambert, 1992, Lawless, 1987, Wang, Puterman, Cockburn, Le, 1996). Consequently, statistical consultants are faced with the difficult task of identifying which approach is most appropriate for their particular situation. It is the purpose of this thesis to provide some insights into this issue. The extra variation that frequently arises in count data can often have a physical explanation. When one wishes to model such data it is important to take into consideration these possible physical explanations and incorporate them into the model used. Thus in comparing competing models one should compare them with regard to the inferences made possible by the model as well as with regard to how sensible they are in light of this information. Many features of data may give rise to the overdispersion. When the response is a count for different subjects, overdispersion is particularly common, and is usually considered to be caused by individual random effects. 2 Wedderburn first proposed a method to deal with this problem referred to as quasilikelihood (Wedderburn, 1974). With the quasi-likelihood method one need only specify the form of the first and second moments rather than the exact distribution of the data. In this way, one is not restricted, as in Poisson regression, to the case of mean, variance equality. Instead, the method assumes that the mean, variance relationship is given by, Var(Y) = a E(Y) 2 where a 2 is referred to as the dispersion parameter. Applying this methodology to investigating the effects of covariates, one specifies that \og(E(Y)) = x p where x represents a vector of covariates r and P represents a vector of corresponding coefficients or the model parameters. Estimating or quasi-likelihood equations are obtained by minimizing a weighted sum of squared deviations of the data from the values specified by the model (McCullagh and Nelder, 1983), whereas in Poisson regression, the estimating equations are obtained by maximizing the log-likelihood. When making inferences regarding the effects of the covariates, the estimated standard errors for the parameters are multiplied by the square root of the dispersion parameter or its estimate. When the dispersion parameter is quite large, the effects of the covariates may be not be detectable. The quasi-likelihood method treats the extra variation as a nuisance parameter interfering with inference and does not allow any inference regarding the nature of the overdispersion to be made. Random effect, or mixture models have more recently been used to account for this extra variation (Jansen, 1993, Stein, 1988). The overdispersion is again considered a nuisance parameter, however in this case the nuisance parameter represents the subject to subject variation. These models are essentially mixture models where the response conditioned on the mean is assumed Poisson but the mean is random and follows some distribution. A gamma 3 distribution is often the choice for the distribution of the Poisson mean, resulting in the negative binomial distribution for the unconditional response. Other choices for the distribution of the Poisson mean appearing in recent literature are the log-normal (Aitchison and Ho, 1989), and the inverse Gaussian (Dean and Lawless, 1989). Overdispersion may also be the result of the data coming from two or more distinct but unobserved populations each with the observed counts from Poisson distributions with distinct means dependent upon a vector of covariates (Wang, Puterman, Cockburn, Le, 1996). A specific case occurs when the overdispersion may be due to the existence of an unobserved population with strictly zero counts and another with counts following a Poisson distribution with a positive mean. This type of data has been examined by Lambert (Lambert, 1992) as well as by Wang, Puterman, Cockburn, and Le (Wang, Puterman, Cockburn, Le, 1996). In Lambert's paper the counts represent the number of defects in a printed circuit board. A large number of these boards have zero defects, more than can be explained by a Poisson model, and other boards have a positive number of defects. Lambert describes the boards as either being in a perfect state, in which there will be zero defects, or an imperfect state, in which there may be some defects. She refers to this type of Poisson mixture model as a zero inflated Poisson model or a ZIP model (Lambert, 1992). They have the advantage of being simple and reasonable in many situations that share this same structure. Wang, Puterman, Cockburn, and Le use the same general approach to analyze the effect of a number of covariates on the daily seizure frequency in epileptic children. ZIP models are really a specific case of the finite mixture model described by Wang, Puterman, Cockburn, and Le whose model formulation allows for the possibility of more than two unobserved populations with various distinct means. 4 In a situation where there appears to be more zero counts than can be explained by a Poisson distribution, a ZIP model should be considered. The effect of covariates on both the rate of the unobserved non-zero population as well as the probability of an observation arising from the zero population or that the observation is in the zero state can be analyzed. In addition, a score test has been developed by Van Den Broek to detect the presence of zero inflation in Poisson regression (Van Den Broek, 1995). In this thesis I consider two different data sets where the effects of covariates on a count response variable are of interest. In both situations, the counts appear to be overdispersed and there are physical explanations for why this might be so. In chapter two, the data are examined and the inadequacy of an ordinary Poisson regression is demonstrated in the context of initial analyses. In addition, score tests for detecting overdispersion due to individual subject or random effects and due to zero inflation are applied to determine whether or not more complicated models that allow for overdispersion are necessary. In Chapter 3,1 describe the negative binomial regression model and use it to analyze both data sets. The fits are evaluated and all appropriate inferences are made. In Chapter 4,1 describe the ZIP model and use it the analyze both data sets. Again the fits are evaluated and inferences are made where required. Finally in Chapter 5,1 discuss the results of both approaches and make some conclusions and suggestions for further study. This work was motivated by my involvement in the analysis of data from a clinical study. In the study, two different treatments for chronic urinary tract infections (UTI's) in acutely spinal cord injured patients were compared. Chronic UTI is a major cause of morbidity in this unique population (Anderson, 1985) and treatment is controversial. Many infections are asymptomatic and repeated treatment often leads to re-infection with antimicrobial resistant bacteria (Gribble, McCallum and Schechter, 1988). 6 2 Motivating Application and Initial Data Analyses 2.1 General Approach A thorough initial examination of a data set is necessary before a reasonable model can be identified. In this section I investigate both data sets in an effort to understand their basic structure and develop appropriate models that allow for the required inferences. Graphical and numerical summaries facilitate the identification of preliminary models which are fit to the data. These fits are investigated and are shown to be inadequate. The preliminary models are Poisson regression models and with both data sets overdispersion is present. Two score tests based on these models are performed to detect overdispersion. The first test is designed to detect overdispersion that may be modeled by a random effects model and the second test aims to identify if the overdispersion is due to zero inflation. 2.1.1 Testing for Overdispersion Several tests for overdispersion in Poisson regression have been developed in response to the complication of overdispersion. (Cameron and Trivedi, 1990; Dean, 1992; Dean and Lawless, 1989; Ganio and Schafer, 1992) Dean discusses three score tests for testing the null hypothesis that an ordinary Poisson regression model holds against three different alternative models that allow for overdispersion (Dean, 1992). Since, with both data sets, the overdispersion may be explained by random effects, an appropriate adjustment to the Poisson regression model would be the inclusion of a random effect term. The model, 7 (Yj \oX ) ~ Poisson(uA,,), u ~ Gamma(a ,a) j £ ( o ) = 1, Var(v) = / = a < oo, so that £ ( ] ^ , . £ ( u ) ) = uA, = X.,. = e ' , and l x rp a VariY;) = EiVariY^ + VariEiY^v)) = E(u)\ +Var(u)X = \(\ + a\) l has an additional variable, u , representing a multiplicative random effect. Dean and Lawless derived a score test for H : a = 0 (Dean and Lawless, 1989). Under H , the test statistic, 0 0 is asymptotically standard normal (Dean and Lawless, 1989). Large values of the test statistic indicate the presence of overdispersion. 2.1.2 Testing for Zero Inflation Zero inflated regression models may provide a valuable analysis tool for a number of applications. These models are members of the more general class of models, Poisson mixture models, in which an extra mass at zero is allowed. The proportion of zeroes is allowed to be larger than that expected under a Poisson regression model by adding a positive probability of observation at zero that may or may not depend on a vector of covariates. If one ignores the effect of covariates on the probability that the observation is from the zero group, the extra proportion of zeroes are added as follows (Johnson, Kotz, and Kemp, 1992). v(Y =o)= +(\- y> l Pfr = v , > P V P ^ , , forv,>0 8 Recently, van den Broek developed a score test to test the hypothesis that there are too many zeroes for a Poisson regression to provide an adequate fit for the data (van den Broek, 1995). In order to be perform the test, one must fit a Poisson regression model. The test statistic is then based on comparing the number of zeroes occurring in the data with the number expected under a Poisson regression model taking into account the overall mean of the response variable. The test statistic, denoted 5(P ) is based on the vector of parameter estimates, P from a Poisson regression model. p - e ', where X = e Xt 0j Xi p i This statistic is asymptotically distributed as a % \ random variable, so values of S(P ) can be 2 compared with quantiles of the % \ distribution to test the null hypothesis, H : 6 = 0 , where 2 0 9 = P which is equivalent to the null hypothesis that there are no extra zeroes, ie: p = 0. I-P. Large values of this test statistic indicate the presence of zero inflation. 9 2.2 Application 1: A Clinical Trial Comparing Two Treatments for Chronic Urinary Tract Infections 2.2.1 Description of data set The motivating study was a clinical trial which aimed to compare two different protocols for antimicrobial therapy in the treatment of chronic urinary tract infections (UTI's) . The 1 participants in the study were all spinal cord injury patients admitted to Vancouver General Hospital during the period of July 1990 to July 1993. Urinary tract infections are considered by experts to be responsible for a large number of deaths among this unique group of patients as the majority are unable to void spontaneously and must be on intermittent catheterization. (Anderson, 1985; Merrit, 1976; Pearman, 1984). Physicians caring for such patients are faced with a dilemma of when to treat an infection, as symptoms are rarely reported by patients even when an infection is present. There is disagreement as to whether treatment of these asymptomatic infections is beneficial as this often leads to re-infection with antimicrobial resistant bacteria (Gribble, McCallum and Schechter, 1988). A total of 81 patients were recruited to the study; 41 were randomly assigned to treatment group A and the other 40 were assigned to treatment group B. Treatment A might be considered a standard treatment for chronic UTI's in spinal cord injury patients. This treatment consists of administering anti-microbial therapy (oral antibiotics) on every occasion where there is a positive test result for the presence of bacteria in the urine (a positive urine culture). Treatment B is a less standard treatment where patients are only prescribed antibiotics if there is both a positive urine culture and the presence of at least two specific symptoms of bacteriuria. That is, in group B, the This research was conducted at Vancouver General Hospital by Dr. Marie Gribble of the University of British Columbia. 1 10 patients are only given the antimicrobial therapy if they are "sick". The hypothesis in this study is that the patients in group B will somehow fare better than those in group A . The idea is that having an asymptomatic UTI makes it difficult for other bacteria to take over in the bladder, thus providing some protection against a symptomatic infection. The 81 patients were kept in the hospital and monitored for a period of up to sixteen weeks, however a large number of patients dropped out before the sixteen week period was up. Each week a sample of urine was taken and tested for the presence of bacteria. The presence or absence of bacteria along with variables representing which strain of bacteria and whether or not the particular strain was resistant to certain anti-microbials was recorded. In addition a record of the presence or absence of specific and non-specific symptoms of UTI was maintained as well as a number of variables measured only at the beginning and end of the study. One such variable was a count of the number of different strains of resistant bacteria (NSB) cultured from a rectal swab, and the drugs to which they were resistant. As the frequency of infection with resistant bacteria is of major interest, this work focuses on this last variable as the response and thus the indicator of patient wellness. There are obviously many ways in which one can describe the patients well-being, however, for the purpose of this thesis we will concentrate on only this one measure. The rectal swab data consists of pairs of counts for each of the 81 patients. The counts are further sub-classified into one of two categories; strains resistant to the drugs norfloxacin and cotrimoxazole, and strains with emergent resistance to norfloxacin or cotrimoxazole. A strain is classified as emergent resistant if the infected patient has already been treated with the drug to which the strain has resistance either during or just preceding the study. If we ignore the distinction between resistance and emergent resistance we can consider a combined count of the 11 number of strains with some resistance. We naturally assume that each of the six counts (before and after for each of the three classifications) follow a Poisson distribution. Consequently, a natural first approach to assessing the effect of the treatment group on the number of resistant strains of bacteria (NSB) would be to use Poisson regression with the treatment group as a binary covariate. These data are complicated in two ways however. First, the data appear to show more dispersion than is allowed if it followed a Poisson distribution. Second, the before and after counts may be correlated due to the fact that both counts are from the same patient. 2.2.2 Exploratory Analysis The main interest of this study is to compare the two protocols with respect to N S B at the beginning and end of the sixteen week study period. NSB at the commencement of the study should serve as a baseline measurement. Thus due to the randomization, we expect no difference between the two groups at this time. NSB at the end of the study should somehow be dependent upon the patients' treatment during the study. If the two protocols have different effects we should see a difference between the two study groups. Essentially, how the patients changed from beginning to the end of the study is of interest. The data are summarized in Tables 1 and 2 and Figure 1. 12 Figure 1 Comparison of N S B resistant, emergent resistant and combined in group A and group B . N S B resistant, time 1 N S B resistant, time 2 • Group A a Group B r l . r l , 11 1 n i P. 2 NSB N S B emergent resistant, time 1 40 1 N S B emergent resistant, time 2 w 30 . . .1 20.. • Group A • Group A § 10.. B Group B a Group B 0 2 I —M i r-i_ 3 4 NSB i r~i p 13 Figure 1 continued NSB combined, time 1 10 £ Q_ 6 1 • Group A • Group B 4 .. 2 ..I 0 |l • ,11 ,1 W , L W , L W , L 0 1 2 3 4 5 W,L 6 7 NSB U ,11 8 ,11 9 , 10 11 NSB combined, time 2 10 = .22 to 6 1 T 4 ii ri °- 2l\ 0 | 0 " 1 | n. n • Group A n 111 XL W | i M | i W | i W | i M | i M | i W | 2 3 4 5 6 7 NSB 8 a Group B W| 9 •m i| InI H_| 10 11 12 13 | «• 14 14 Table 1 Frequencies of NSB resistant, emergent resistant and combined at time 1 NSB 0 1 2 3 4 5 6 7 8 9 10 11 Mean Variance Resistant Group A 4 1 7 8 10 6 4 1 0 0 0 0 3.415 3.149 Emergent Resistant Group B 6 1 7 5 7 8 5 1 0 0 0 0 3.375 4.087 Combined Group A 33 0 0 1 2 4 1 0 0 0 0 0 0.902 3.590 Group B 34 0 0 4 1 0 1 0 0 0 0 0 0.550 1.946 Group A 3 1 5 6 9 7 4 1 2 1 1 1 4.317 6.322 Group B 5 0 4 6 8 9 6 1 0 0 0 1 3.925 4.943 Table 2 Frequencies of NSB resistant, emergent resistant, combined, at time 2 NSB 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mean Variance Resistant Group A Group B 12 29 2 1 2 2 8 '3 4 7 0 6 0 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.073 2.675 4.584 3.770 Emergent Resistant Group A Group B 12 15 0 0 1 0 1 4 10 3 6 7 6 5 4 4 1 2 0 0 0 0 0 0 0 0 0 0 0 0 3.585 3.325 6.799 8.225 Combined Group A Group B 10 6 1 0 1 0 2 2 4 5 3 5 7 5 7 3 3 5 2 0 0 3 0 3 1 0 1 0 2 0 4.659 6.000 15.846 10.030 15 N S B resistant and NSB emergent resistant do not appear to differ greatly between the two groups at the beginning of the study. NSB resistant appears to differ between the two groups at time 2 with patients in group B having, on average, more than twice as many resistant strains. As a strain is classified as emergent resistant if the patient has been previously treated with the antimicrobial drug to which the strain has resistance, and since patients in group B were treated less often, it is reasonable that patients in group B would have more resistant strains and fewer emergent resistant strains by the end of the study. NSB emergent resistant at time 2, however, does not appear to differ greatly between the two groups. In summary, it appears that patients in group B tend to get more strains with some resistance to norfloxacin and cotrimoxazole, but fewer of these strains are classified as emergent resistant. We can come to the same conclusions upon inspection of the histograms in Figure 1. Another feature of these data, is that the variance exceeds the mean for almost all of the counts suggesting that a Poisson distribution may not provide a good fit for the data. However, a Poisson regression analysis will be appropriate if conditioned on the covariates, the mean and variance are similar. For the counts at time one the only covariate identified is the group membership in which case, the overdispersion will still be a concern, however, additional covariates that are only relevant for the counts at time 2 are the number of weeks the patient remained in the study and the NSB count at timel. It appears that overdispersion will be a problem for modeling at least the counts at time one and possibly at time two as well. From Tables 1 and 2 one can also see that the number of patients with N S B equal to zero is particularly high for all three classifications. Perhaps the counts can be modeled as coming from two different Poisson distributions, one with a positive mean and the other with a mean of 16 zero. This would then account for the inflated number of zeroes that are seen. The zero inflation of the counts is better illustrated in Figures 2, 3 and 4 where the histogram for the actual data are compared with simulated data from Poisson distributions with the same means. 17 Figure 2: Comparison of histogram for N S B resistant with the Poisson distribution NSB resistant, timel <M;*J simulated data NSB resistant, time2 simulated data 18 Figure 3: Comparison of histogram for NSB emergent resistant with the Poisson distribution -.0f>- NSB emera. resis., timel simulated data ^ • l i l l 1 NSB emerg.resis., time2 •>4 simulated data •HHH| I 1, 1 19 Figure 4: Comparison of histogram for NSB combined with the Poisson distribution m NSB c o m b i n e d , timel sim u lated data NSB combined, time2 m U• • BSmsM* simulated data 20 With the possible exception of NSB combined at time 1, the counts do tend to show an inflation in the number of zeroes relative to the number expected under a Poisson distribution. Finally, scatter plots of the counts at time 2 versus the counts at time 1 provided in Figure 5, show that the dependence between counts at time 1 and at time 2 appears to be negligible. As a result a model that ignores this dependence may be reasonable and vastly simplifies the analysis. Figure 5 Time 2 versus time 1 plots for NSB resistant, emergent resistant and combined. N S B resistant, time2 vs timel 0 5 NSB time 1 N S B combined, time2 v s timel NSB time 1 N S B emergent resistant, time2 v s t i m e l 21 2.2.3 Preliminary Models and Fits Although overdispersion is anticipated, reasonable preliminary models for these data are ordinary Poisson regression models. We will model N S B resistant, N S B emergent resistant, and N S B combined at the beginning and the end separately thus concentrating on models for six different univariate responses. The Poisson regression model is specified as follows. (^|x,)~Poissson(A,) so that, £(l^|x,.)= A, and VariY^) ln(A. )= [ij = x / p / = X, (1) i = \,2,...,n , where x represents a vector ; of covariates for observation i, and p represents a vector of model parameters. The only covariate available at time 1 is group membership. The analysis essentially simplifies to one of comparing two Poisson means, however, this is easily performed in the context of generalized linear models. In this case, Y) = N S B at time 1 (resistant, emergent resistant or combined) [0 if patient i is in group A ' [1 if patient i is in group B Table 3 Summary of results of Poisson regression models for N S B at time 1. NSB Resistant Emergent Resistant Combined Residual Deviance (G ) d.f Null Deviance 114.0744 207.5104 79 79 114.0838 211.0048 0.24(0.41) 16.83(0.00) (p-value) 22.65(0.00) 115.58(0.00) 130.1418 79 130.8972 2.13(0.02) 33.68(0.00) PB (p-value) S(P~) 22 Upon inspection of the residual deviance in Table 3, the number of resistant strains at time 1 appears to be fit moderately well by the Poisson model however the other two counts are not fit adequately by these models. The effect of group membership appears negligible as the residual deviance is only slightly reduced compared to the null deviance. This is not surprising as we expected to see no significant difference between the two groups at the beginning of the study. A goodness of fit test may be performed by comparing the observed residual deviance with quantiles of the chi-square distribution with the corresponding degrees of freedom (McCullagh and Nelder, 1983). The deviance for Poisson regression has become commonly 2 2 known as G . A poor fit is indicated by a large value for G relative to the degrees of freedom. As can be seen in Table 3 above, the values for G of 114, 207 and 130 for N S B resistant, emergent resistant and combined respectively, on 79 degrees of freedom indicate a poor fit for all three responses. In addition i f we compare the actual number of zero counts observed with that expected under the Poisson model we have further evidence of a lack of fit. For N S B resistant, the model expects about 1 or 2 zero counts among the patients however there are actually 10 patients with zero counts. For NSB emergent resistant the model expects about 16 and 23 zero counts among the patients in group A and B, respectively, but there are 33 patients in group A and 34 in group B with N S B of zero. Finally, for NSB combined, the model expects about 1 zero count among all patients in either group A or B, but there are 3 such patients in group A and 5 in group B . Since the Poisson model implies that the variance equals the mean, larger variances indicate the presence of overdispersion. Under the quasi-likelihood model one can allow for the variance to exceed the mean by assuming Var(Y \x ) = a \ . A n estimate of a 2 l by i 2 can be obtained 23 z X 2 a 2 , where (n-p) n-p n = number of observations, (2) and p = number of covariates in the model (McCullagh and Nelder, 1983). This is the Pearson X 2 statistic divided by the residual degrees of freedom. For N S B resistant cF = 1.06, for NSB emergent resistant cT = 3.53 and for N S B combined 2 o 2 2 = 1.36. This indicates the presence of overdispersion for N S B emergent resistant and possibly also for N S B combined. Upon inspection of the columns P B and S(P), the test statistics described in sections 2.1.1 and 2.1.2 respectively, our two tests for overdispersion also indicate its presence for N S B emergent resistant and combined, they do not however make clear how the overdispersion may have arisen. The overdispersion may be well explained by either a ZIP regression model or a negative binomial model. Our main interest is in determining the effect of the different treatment protocols on N S B however, at the end of the study there are two additional covariates that need to be taken into account, the number of weeks in the study and NSB at time 1 (NSB1). Since not all of the patients remained in the ward for the full sixteen week duration of the study, we feel the effect of the length of time must be considered. In addition, since the longer the patients' are in the ward, the longer they are treated according to regime A or B, the interaction between the group and the number of weeks should also be considered. NSB1 serves as a baseline measure for the patients. The patients do vary considerably with respect to this measure and ignoring this variability would be a mistake. In addition, since we are mainly interested in how the patients have changed, we must take into account this baseline before attributing any effect on N S B at time 2 24 to the treatment. Using NSB1 as a covariate allows one to consider the dependence of the second count on the first as well as allow for differences in baseline measure to be considered. The model as specified in (1) applies with, Yj - N S B at time 2 (resistant, emergent resistant or combined) We will consider the covariates; group membership (grp), number of weeks in the study (weeks), and N S B at the beginning of the study, NSB1 as well as the interaction between weeks and grp. Table 4 Summary of results of Poisson regression models for N S B at time 2 NSB Model Residual Deviance (G ) grp* weeks+NSB 1 207.31 grp* weeks 208.05 grp+weeks+NSBl 209.34 grp+weeks 209.82 grp * weeks+NSB 1 211.78 grp* weeks grp+weeks+NSB 1 grp+weeks grp* weeks+NSB 1 grp* weeks grp+weeks+NSB 1 grp+weeks df PB (p-value) 76 77 77 78 76 6.68(0.00) 6.73(0.00) 6.81(0.00) 6.86(0.00) 4.48(0.00) 65.865 (0.00) 65.297 (0.00) 67.104 (0.00) 66.105 (0.00) 143.173 (0.00) 77 77 78 76 77 77 78 4.52(0.00) 4.49(0.00) 4.52(0.00) 6.76(0.00) 6.78(0.00) 6.77(0.00) 6.79(0.00) 123.490 (0.00) 155.321 (0.00) 130.412(0.00) 130.394 (0.00) 145.106 (0.00) 131.596 (0.00) 146.674 (0.00) 2 Resistant Emergent Resistant Combined 213.50 211.96 213.57 231.03 231.12 231.07 231.17 5(P) (p-value) In Table 4, the fits of four models for each of the three classifications of N S B are summarized. For each of the three classifications, four models are fit. The full model considers the covariates, treatment group, weeks in the study, NSB at time 1 and the interaction between the treatment group and the weeks in the study. This model is abbreviated as grp*weeks+NSB 1 in the table. 25 Examining the values of the goodness of fit statistic, G reveals that not even the full 2 model provides an adequate fit to these data. Based on the full models, for NSB resistant our estimated dispersion parameter is cT = 2.74, for NSB emergent resistant a 2 NSB combined a 2 2 =2.10 and for =2.50. It appears that overdispersion is present in these data. This is further evidenced by the large values for the two test statistics, P B and S(P) for all models. In addition, there appear to be far too many counts of zero to be adequately explained by a Poisson regression model. Of the patients who remained in the study the entire 16 weeks, the model that includes both covariates, group and weeks as well as the interaction between the two, expects 4 of the 11 patients in group A and 1 of the 18 in group B to have NSB resistant equal to zero when the actual numbers are 8 and 5. For NSB emergent resistant the model expects no zero counts in either of the groups A or B for the 29 patients who remained the entire 16 weeks in the study. The actual number of zero counts were 0 for group A , but 6 for group B. For N S B combined, we again see the evidence of a lack of fit. However, for those patients staying the entire 16 weeks, this time the number of zeroes expected under the model does not differ greatly from that observed indicating that perhaps zero inflation is not a problem and perhaps the overdispersion is better explained by individual random effects, however, this ignores the possibility of zero inflated counts for patients who did not remain in the study for the full sixteen weeks. In Table 5 below, the observed number of patients with N S B equal to zero are compared with the number of patients expected under the model. 26 Table 5 Comparison of observed and expected number of patients with N S B equal zero. (Expected numbers are given in parentheses.) NSB resistant 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1(0) 2(1) 0(0) 1(0) 1(1) 1(1) 1(0) 1(0) 1(1) 4(1) 0(0) 2(1) 1(1) 2(1) 3(1) 8 (4) Group A emergent resistant combined resistant 1(0) 3(1) 0(0) 1(0) 1(0) 1(0) 0(0) 0(0) 1(0) 1(0) 0(0) 1(0) 2(0) 0(0) 1(0) 1(0) 1(0) 0(0) 1(0) 0(0) 1(0) 0(0) 0(0) 1(0) 0(0) 2(0) 0(0) 0(0) 1(0) 0(0) 1(0) 0(0) 0(0) 0(0) 1(0) 0(0) 0(0) 0(0) 0(0) 2(0) 5(1) 1(0) 1(0) 0(0) 0(0) 1(0) 1(0) 0(0) Group B emergent resistant 2(1) 1(0) 0(0) 1(0) 0(0) 2(0) 0(0) 0(0) 0(0) 3 (0) 0(0) 0(0) 0(0) 0(0) 0(0) 6(0) combined 2(0) 0(0) 0(0) 1(0) 0(0) 1(0) 0(0) 0(0) 0(0) 1(0) 0(0) 0(0) 0(0) 0(0) 0(0) 1(0) The model appears to consistently underestimate the number of patients that have N S B equal to zero. It is not clear, however, from inspection of the two test statistics, 5(P ) and P, B whether a random effect model or a ZIP model would better explain the observed overdispersion as both statistics have small p-values. The model parameters estimated under the full model along with their standard errors corrected for overdispersion and a t-value are summarized below in Table 6. 27 Table 6 Estimated model parameters for Poisson regression models of N S B at time 2. NSB Parameter Resistant Emergent Resistant Combined interaction(group:week) group week NSB1 interaction(group: week) Parameter Estimate 0.0530635 0.32692938 -0.03415314 0.03941149 -0.012603994 Standard Error 0.0614822 0.7267554 0.0533424 0.0759832 0.0422624 tvalue 0.863 0.450 -0.640 0.519 -0.298 group week NSB1 interaction(group:week) group week NSB1 -0.001734265 0.08687913 0.046401763 -0.004443031 0.240052530 0.055320357 -0.006739101 0.5749374 0.0289098 0.0500387 0.0342365 0.4497721 0.0259616 0.0352619 -0.003 3.005 0.927 -0.130 0.534 2.131 -0.191 Few of the covariates appear to have a significant effect on N S B upon inspection of the tvalues. It does appear that the covariate representing the number of weeks the patients remained in the study may have a significant effect on NSB emergent resistant and perhaps N S B combined. Any further inference is left until a model that accounts for the apparent overdispersion is fit as the overdispersion may be disguising other effects. 2.3 Application 2: Sales in a Retail Environment 2.3.1 Description of Data Set The second data set considered was obtained from a study undertaken at the initiative of a large retail hardware chain, Canadian Tire Pacific Associates. The data consist of daily sales of one of many products sold by the chain, at the 21 stores owned and operated in the lower mainland. Sales were monitored from October 9, 1992 until August 18, 1993 for a period of 314 28 consecutive days. Sales are recorded as the number of units sold and as such are discrete count data. The initial goal of the study was to use the data in an effort to develop an optimal inventory control policy. This was the topic of a MSc. thesis written by Kapalka while a student at the University of British Columbia in the faculty of commerce (Kapalka, Katircioglu and Puterman, 1995). Customer demand for the product varies from day to day and must be accounted for in the model. It was noted in the initial study that no theoretical distribution was sufficient for describing the demand of products having medium demand. A model for the daily demand is required and is the focus of this analysis. The model will investigate the effect of two factors, the day of the week and the store location. A natural first approach to modeling the daily units sold is to use the Poisson regression model. Again, however, the data are somewhat complicated. The possibility of serial correlation between the daily units sold must be examined. In addition, on many days zero units of the product are sold whereas on other days some positive number of units are sold. The units sold do exhibit some overdispersion relative to the Poisson model which needs to be taken into account. This overdispersion may be explained by either individual random effects associated with the particular day or by a model that allows for zero inflated counts. 2.3.2 Exploratory Analysis The units sold in each of the 21 stores was collected on a series of 314 days, from October 9, 1992 until August 18, 1993. Time series plot for each of the 21 stores are provided in Figure 6. No obvious trend can be identified from inspection of these plots for any of the 21 stores. Demand for this particular product appears not to be affected by seasonality. The presence of serial correlation between the daily observations would necessitate a considerably more complex model. Serial correlation is, however, not a concern here as an initial examination of the autocorrelation functions rules out this complication. The autocorrelation functions for the each of the twenty-one series of observations are shown below in Figure 7. None of the lags show significant dependence between the observations. This vastly simplifies the subsequent analysis. Figure 6 Time series plots of units sold by store. Store 1 10/09/92 10/09/92 10/09/92 12/10/92 12/10/92 12/10/92 02/10/93 04/13/93 Store 2 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days Time in days Store 3 Store 4 02/10/93 04/13/93 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days Time in days Store 5 Store 6 02/10/93 04/13/93 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days Time in days Store 7 Store 8 06/14/93 08/15/93 06/14/93 08/15/93 06/14/93 08/15/93 JL.mUJm LilJ II Jut JUL 10/09/92 12/10/92 02/10/93 04/13/93 Time in days 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days 06/14/93 08/15/93 Figure 6 continued Store 9 10/09/92 12/10/92 02/10/93 Store 10 04/13/93 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days Time in days Store 11 Store 12 06/14/93 08/15/93 06/14/93 08/15/93 06/14/93 08/15/93 06/14/93 08/15/93 iiMjriLJ ILll 10/09/92 10/09/92 10/09/92 12/10/92 12/10/92 12/10/92 02/10/93 04/13/93 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days Time in days Store 13 Store 14 02/10/93 04/13/93 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days Time in days Store 15 Store 16 02/10/93 04/13/93 Time in days 06/14/93 08/15/93 10/09/92 12/10/92 02/10/93 04/13/93 Time in days Figure 6 continued Store 17 10/09/92 12/10/92 02/10/93 04/13/93 Store 18 06/14/93 08/15/93 10/09/92 12/10/92 12/10/92 Time in days Store 19 Store 20 02/10/93 04/13/93 06/14/93 08/15/93 Time in days 12/10/92 02/10/93 04/13/93 Time in days 10/09/92 12/10/92 06/14/93 08/15/93 06/14/93 08/15/93 06/14/93 08/15/93 AL 02/10/93 04/13/93 Time in days Store 21 10/09/92 04/13/93 Time in days AJUIJ 10/09/92 02/10/93 Figure 7 Auto correlation functions for the units sold in each of the 21 stores. Series : Storel Series : Store2 < < 10 15 Lag Lag Series : Store3 Series : Store4 < 1- -i - , - i - r I- r ~ -i - < - 20 Lag Lag Series : Store5 Series : Store6 < < 10 15 20 10 15 Lag Lag Series : Store7 Series : Store8 < < 10 15 Lag 10 15 Lag 20 Figure 7 continued Series : Store9 Series : Store 10 CD CD < o 0 II , , 1 5 1 1 1 1 1 < . o . '1 1 10 . 20 15 0 1 . 10 5 , 1 J 15 . I . 1 20 Series : Store11 Series : Store12 CD CD fe d <o [ ' ' 1 ' ' ' < 1 1 - r - 1 1 1 U ' 1 O ' ' ' ' 0 10 5 15 20 . . ' 1 ' 10 5 1 • i j 1 — 15 Series: Store 13 Series: Store 14 & 0 I Lag CO I . I 1 Lag d - 1 0 fc i i . , r 1 1 10 5 1 I 15 1 c r — 20 CO d < 20 1 0 10 5 15 Lag Lag Series : Store 15 Series : Store16 CD o 1 Lag d < - Lag fc <o , . 1 1 J 1 1 20 CO fc d ••t • I I i . • 1 0 5 10 I Lag 15 ... 20 <o 'T _ 0 1 1 5 i L . ...... ' J 10 15 Lag .... . . 20 Figure 7 continued Series : Store17 Series : Store18 < i. -i_ i 10 15 10 15 Lag Lag Series : Store 19 Series : Store20 < < 10 15 Lag Series : Store21 < Lag 20 10 15 Lag 20 Our primary interest in analyzing these data is to determine i f we can better model the units sold knowing at which store and on which day the observation was made. In Table 7 the mean units sold for each day of the week and each of the 21 stores are summarized. The variances are given in parentheses. From inspection of Table 7, one notes that, on average, more units are sold on Saturdays than any other day of the week and more in general on weekends than on weekdays. This trend is consistent across most of the 21 stores. In addition, one can also see that the stores tend to vary considerably with respect to the number of units sold on any particular day. Together both of these factors appear to be important in determining the number of units sold. One can also note that the majority of the variances, given in parentheses, of the units sold by day of the week and store exceed the mean. This indicates that overdispersion may be a problem even after the two factors have been taken into consideration. The number of units sold for each day of the week over all 21 stores are summarized in histograms in Figure 8 below. Also, for each of the 21 stores histograms of the number of units sold over all days are provided in Figure 9. In all of the histograms a large number of observations of zero units sold are evident. The overdispersion present in this data may be due to the presence of zero inflation. More simply put, the overdispersion could be due to the fact that on many days zero units are sold whereas on the days when purchases are made, a number of units are sold, which suggests a compound Poisson model for these data. 37 Table 7 Average units sold by day of the week and store 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 All Stores Monday 0.55 (0.53) 0.40 (0.56) 0.84 (0.82) 1.13 (1.30) 0.62 (0.60) 0.78 (2.68) 0.98 (1.84) 0.38 (0.60) 1.00 (1.13) 0.91 (0.99) 0.29 (0.39) 1.00 (1.41) 0.87 (1.21) 1.20 (1.98) 0.51 (0.53) 0.47 (0.53) 0.76 (0.87) 0.78 (1.09) 0.73 (1.29) 0.73 (2.61) 0.89 (1.01) 0.75 (1.18) Tuesday 0.53 (1.03) 0.40 (0.52) 1.00 (1.41) 1.02 (1.29) 0.60 (1.29) 0.53 (0.75) 1.11 (1.70) 0.62 (0.97) 0.96 (1.18) 0.84 (1.04) 0.44 (0.66) 0.76 (i.oi) 0.51 (0.66) 0.89 (0.79) 0.69 (0.86) 0.38 (0.74) 0.58 (0.75) 0.82 (1.01) 0.31 (0.36) 0.58 (0.98) 0.71 (0.76) 0.68 (0.97) Wednesday 0.31 (0.99) 0.55 (0.80) 1.22 (1.81) 1.04 (1.91) 0.62 (1.10) 0.58 (0.57) 1.02 (2.11) 0.47 (0.80) 0.58 (0.98) 0.84 (0.95) 0.44 (0.53) 1.18 (1.56) 0.69 (0.95) 0.73 (1.06) 1.02 (2.07) 0.42 (0.48) 0.53 (0.71) 0.71 (1.21) 0.56 (0.66) 0.47 (0.44) 0.91 (1.58) 0.71 Thursday 0.66 (0.83) 0.41 (0.53) 1.34 (2.37) 1.02 (1.46) 0.48 (0.39) 0.86 (1.56) 1.16 (2.09) 0.18 (0.43) 1.05 (2.14) 0.68 (0.83) 0.39 (0.34) 0.70 (0.59) 0.36 (0.47) 0.89 (0.94) 0.84 (1.40) 0.45 (0.49) 0.75 (2.66) 0.70 (0.82) 0.68 (1.34) 0.66 (0.83) 0.68 (0.83) 0.71 (1.15) (1.16) Friday 0.69 (1.04) 0.31 (0.27) 1.20 (1.62) 0.82 (0.97) 0.49 (0.57) 0.60 (2.47) 1.33 (2.36) 0.69 (1.40) 0.84 (1.95) 0.67 (1.09) 0.60 (1.61) 0.96 (1.63) 0.53 (0.62) 0.78 (1.18) 0.82 (1.56) 0.62 (0.79) 0.73 (1.02) 1.02 (1.61) 0.58 (0.70) 0.47 (0.48) 0.53 (0.62) 0.73 (1.25) Saturday 0.77 (0.95) 0.75 (0.78) 1.18 (1.87) 2.22 (3.27) 0.69 (1.08) 0.55 (0.71) 1.24 (2.14) 0.53 (0.75) 0.95 (0.95) 1.00 (1.09) 0.62 (0.56) 1.27 (2.84) 0.58 (0.57) 1.13 (1.71) 1.07 (1.88) 0.78 (0.90) 0.95 (1.23) 1.44 (2.93) 0.71 (0.85) 0.62 (0.97) 0.69 (0.99) 0.94 (1.50) Sunday 0.51 (0.53) 0.55 (0.66) 1.38 (2.06) 1.58 (3.52) 0.60 (0.65) 0.71 (0.98) 1.38 (2.10) 0.24 (0.28) 0.80 (0.57) 0.78 (0.81) 0.40 (0.61) 1.04 (1.23) 0.58 (0.98) 1.16 (1.95) 0.89 (0.83) 0.69 (0.86) 0.98 (1.84) 1.76 (2.42) 0.38 (0.38) 0.58 (0.98) 1.07 (1.65) 0.86 (1.37) All Days 0.58 (0.85) 0.48 (0.60) 1.17 (1.70) 1.26 (2.12) 0.59 (0.81) 0.66 (1.38) 1.18 (2.03) 0.45 (0.77) 0.88 (1.27) 0.82 (0.97) 0.46 (0.67) 0.99 (1.48) 0.59 (0.79) 0.97 (1.38) 0.83 (1.31) 0.55 (0.69) 0.75 (1.29) 1.04 (1.70) 0.56 (0.80) 0.59 (1.03) 0.78 (1.07) Figure 8 Units sold by day of the week Mrdsys 0 2 4 6 TiesdayS 8 10 irits 2 4 2 4 6 8 irits Trirsc^s 0 0 6 irits Fhcbys 8 10 V\fecrE9d^s 10 0 2 4 6 8 Lrits Sauries 10 Figure 9 Units sold by store 0 2 4 units 6 40 2.3.3 Preliminary Models and Fits A natural first approach to modeling these data is to use Poisson regression. We will model the number of units sold on any particular day as depending on the two factors we have at our disposal, store and day of the week.. The model is specified as follows. (^O-Poissonf^,) ln(\.) =ct + p, +yj , so that E(Y [k^ = \. , and Var(Y \\ ) {j l} t = \. z = l,...,21 and j = 1,...,7 represents the effect of store i on the units sold and y . represents the effect of the day of the week j. Monday is considered to be day one and store 1 is naturally considered to be store 1. The constraints put on these parameters are such that P, and y, are set to zero. This implies that the other parameters represent the effect of the particular store compared to store 1, or the effect of the particular day of the week compared to Monday. Estimation of the parameters using these constraints are easily obtained with the Splus function glm provided that the constraints are set to treatment constraints prior to fitting the model (Chambers and Hastie, 1992, p. 32). After fitting the model to the data, we find that the residual deviance is relatively large compared to the degrees of freedom which suggests a lack of fit and provides an indication that overdispersion may be present. The residual deviance is 9032.086 on 6567 degrees of freedom. The estimate of the dispersion parameter given by (2) is 1.51 and the two test statistics P and S{\} ) take on values of 27.68 and 346.54 respectively. It is fairly clear that the Poisson B regression model inadequately describes this data. A comparison of the actual number of days with zero units sold with the number expected under the model is provided in Table 8. Upon inspection of Table 8 one can note that for the majority of stores for each day of the week, the model underestimates the number of days when zero units are sold. This suggests that a ZIP model may provide an improved fit by allowing more days with zero sales. Table 8 Comparison of the actual number of days with zero units sold with the number expected under the Poisson regression model. (Expected numbers are given in parentheses) 0 Store 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Monday 26 (26) 33 (28) 20 (14) 14(13) 24 (26) 27 (24) 23 (14) 35 (29) 18 (19) 20 (20) 35 (29) 22(17) 21 (25) 20(17) 28 (20) 29 (26) 23 (21) 24 (16) 25 (27) 29 (25) 21 (21) Tuesday 28 (27) 31 (29) 20 (16) 17(15) 31 (27) 30 (25) 20 (16) 28 (30) 20 (21) 23 (22) 31 (30) 24(19) 28 (27) 17(19) 26 (22) 33 (18) 28 (23) 21 (18) 33 (27) 28 (27) 23 (23) Wednesday 38 (26) 29 (29) 18 (15) 20(14) 29 (26) 26 (24) 24(15) 30 (30) 28 (20) 19(21) 20 (30) 16(18) 26 (26) 24(18) 19(21) 31 (27) 20 (22) 27(17) 29 (27) 27 (26) 22 (22) Thursday 25 (26) 32 (28) 16(15) 19(14) 26 (26) 24 (24) 20(15) 39 (29) 19(19) 23 (22) 29 (29) 20(18) 32 (26) 19(18) 22 (20) 28 (27) 27 (23) 23(17) 26 (26) 25 (26) 24 (21) Friday 27 (26) 32 (28) 18 (15) 23 (14) 29 (26) 30 (24) 18 (15) 28 (30) 26 (20) 26 (22) 31 (29) 24(18) 28 (26) 25 (18) 25 (20) 27 (27) 24 (22) 20(17) 27 (26) 28 (26) 26 (21) Saturday 22 (22) 23 (25) 17(11) 7(10) 25 (22) 28 (20) 17(11) 29 (26) 18(15) 18(17) 23 (26) 24(13) 26(22) 20 (14) 22(16) 22 (23) 21 (18) 17(13) 23 (23) 29 (22) 25 (17) Sunday 27 (24) 28 (26) 13 (12) 13(11) 26 (23) 26 (22) 15(12) 36 (27) 17(17) 21 (18) 32 (27) 19(15) 30 (23) 17(15) 17(18) 27 (24) 25(19) 11(14) 30 (24) 29 (23) 21 (19) The model we use to fit these data uses a fixed effect for both the store and the day of the week factors. Although this makes sense for the day of the week, it implies that these twenty one stores are the only stores of interest. This may not coincide with management's perspective particularly when new stores are built or market conditions change. As an alternative to the 42 factor representing the store, it would be more useful for management i f covariates could be identified that acted as a surrogate for this factor. Our aim is to find a form of model that fits the data well and the inclusion of an effect particular to the store in essence captures many attributes of the store without the required knowledge of what attributes are important in determining the units sold of a particular product. However, new locations will not have sufficient data to be helpful in this respect and additional covariates may be necessary. It is interesting to note that the parameters representing the effect of stores 3, 4 and 7 are similar, so the model predicts that these stores will sell a similar number of units on any particular day. A l l of these stores have sales in excess of $7 million annually, considerably higher annual sales than most other stores. The estimated model parameters together with their standard errors and the annual sales of the 21 locations are summarized in Table 9 and 10. This suggests that covariates such as annual sales may be able to capture the effect of any particular store on the units sold. This is promising for management in that it would enable forecasts of the units sold particular to new or growing stores. If management could identify the attributes of a store that are important in determining how many units of a certain product are sold, the model may be more useful for estimating demand. Unfortunately such additional data were not available at the time of the preparation of this thesis. 43 Table 9 Summary of estimated model parameters for Poisson model of units sold Parameter intercept store 2 store 3 store 4 store 5 store 6 store 7 store 8 store 9 store 10 store 11 store 12 store 13 store 14 store 15 store 16 store 17 store 18 store 19 store 20 store 21 Tuesday Wednesday Thursday Friday Saturday Sunday Parameter Estimate -0.5719 -0.1746 0.7041 0.7854 0.0164 0.1343 0.7123 -0.2567 0.4255 0.3505 -0.2356 0.5380 0.0218 0.5185 0.3698 -0.0569 0.2696 0.5853 -0.0224 0.0165 0.3068 -0.1019 -0.0593 -0.0564 -0.0343 0.2231 0.1314 Standard Error 0.1005 0.1348 0.1114 0.1099 0.1282 0.1244 0.111 0.1375 0.1171 0.1189 0.1370 0.1146 0.1281 0.1151 0.1184 0.1307 0.1209 0.1136 0.1295 0.1281 0.1200 0.0667 0.0660 0.0663 0.0655 0.0616 0.0630 Table 10 Annual sales by store Store 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Annual Sales ($) 2,901,969 4,015,303 8,138,673 7,307,820 4,102,110 4,153,736 8,175,105 3,856,571 3,714,372 3,532,752 2,417,992 7,899,398 4,647,794 3,825,372 3,960,262 2,705,602 4,161,227 4,606,817 5,855,022 3,114,180 6,553,310 45 3 Random Effect Model - Negative Binomial 3.1 Methodology When the deviance associated with a Poisson regression model is excessively large one often considers adding a random effect to the model so that the observed overdispersion may be included in the model. A common approach to incorporating random effects into a Poisson regression model is to use a negative binomial regression model (NB regression model) (Jansen, 1993, Stein, 1988). Lawless (Lawless, 1987) describes the negative-binomial regression model and examines its performance. The N B regression model is a member of a more general class known as mixture models. In a Poisson mixture model, one assumes that the count response variable conditioned on the mean follows a Poisson distribution however, one further assumes that the Poisson mean is random and follows some known distribution. When the distribution for the Poisson mean is Gamma, then the marginal distribution for the count response is the negative binomial. The N B regression model is specified below. Let, Yj be the i observation of the count response and x, be a vector t h of covariates. (^|uAj) ~ Poisson(uA,,.), o~Gamma(a,oc) £ ( o ) = 1, Var(y>) = Xi > =a<o0 s o t h a t (3) E (. I\ (»)^I) Y E = £ ( U ) ^ = \ = E ^ > A N D Var(Y ) = X {l + aX ) i i i The marginal distribution for the count response is determined by integrating over the support of the Gamma distribution with respect to u . That is, x ^' y y Xy,|x,.)= j" y,!r(a) This is the probability function of the negative binomial distribution. So (^|x,) « , J/). The parameter, j / = a, is referred to as the index or dispersion parameter (Lawless, 1987). Estimates for the parameters in P can be obtained by maximum likelihood and for the parameter, a, by either the method of moments or maximum likelihood. We will focus on maximum likelihood estimation since the maximum likelihood estimator (MLE) is noted to be more efficient than the moment estimator (Lawless, 1987) and the estimation is performed with relative ease. The likelihood function for a vector of n independent observations of the count response is given by, »T(y,+a-')( £(p, )=n fl /=f aX, r(a-)y,.! V1 + aX j i 1 + akj j v Lawless points out that since for any c > 0,T(y + c)/T(c) = c(c +1) • • • (c + y -1), ify is an integer > 1, then the log likelihood, log{Z(P,a)}, can be written as, /(P,a) = X Xlog(l + aj) + y log(aX ) ; i - (y. +a~ )log(l l + aX ) t log(y,.!) (4) 7=0 (Lawless, 1987). Score or estimating equations are given by setting the first derivatives of (4) to zero and M L E ' s are obtained by solving these equations with respect to the parameter sought. Standard errors for the parameter estimates are obtained from the inverse of the Fisher information matrix, 47 since for n large, 4n($ -$,a-a) follows a normal distribution with expectation (0,0) and covariance matrix given by the inverse of the Fisher information matrix. The maximization can be performed by first assuming a fixed value for a and then maximizing with respect to P to obtain the M L E ' s , P , and then use the profile likelihood /(P ,a) to determine the M L E , a . Alternately one may use the statistical software Splus. Ripley and Venables discuss fitting this model using the software S-plus (Ripley and Venables, 1994, p. 200). They also provide the function glm.nb which fits a N B regression model. I used this function to fit N B regression models to both data sets. 3.2 Application 1: Results and Discussion The observations of the number of resistant strains of bacteria were obtained from 81 different patients. It was noted in Section 2.2.2 that there was considerable variability between the counts from different patients at baseline. In addition, even after covariates had been considered a great deal of extra variation remained in the data. The Negative Binomial regression model described above would allow for counts from each patient to follow a Poisson distribution with a random mean. This allows us to consider a random subject effect and allows for overdispersion in the model. The Negative binomial model specified in (3) is applied to this data with Y = N S B resistant, for patient i at time 2, and x, = a vector of covariates for patient t i. The fits for models of N S B resistant are summarized in Table 11. Since each time a new model was fit to these data the dispersion parameter was re-estimated, the residual deviance did not monotonically increase as covariates were dropped from the model. As such we use the log- 48 likelihood provided by the function glm.nb to choose between models. The function glm.nb n ignores the term, - ^log(_y,.!), in the log-likelihood for the N B regression model and so the /=i values for the log-likelihood may take on positive values. Table 11 Summary of fits for N B regression models for NSB resistant at time 2 Model 2 x log likelihood grp* weeks+NSB 1 grp+weeks+NSB 1 grp* weeks grp+weeks -37.69 -38.46 -38.02 -38.65 -38.65 -45.64 grp weeks Estimated Index of Dispersion 1.41 1.45 1.43 1.46 1.46 1.82 df 76 77 77 78 79 79 Likelihood ratio tests can be performed by comparing twice the difference in loglikelihoods associated with two competing models, with quantiles of the Chi-square distribution having degrees of freedom equal to the difference in the number of parameters of the competing models. For example, comparing the model that includes both covariates, group and weeks with the model that only includes the covariate weeks, twice the difference in the log-likelihoods is 13.98 which is large for a Chi-square random variable having 1 degree of freedom. This indicates that the effect of group is significant. Similarly performed likelihood ratio tests indicate that weeks and NSB1 have an insignificant effect on N S B resistant. The model which considers the single covariate representing group membership provides as adequate a fit as any including additional covariates. 49 The residual deviance associated with this model is 83.77 which is not large enough relative to the degrees of freedom to suggest a lack of fit. The index of dispersion for the N B regression model, a , is estimated at about 1.4 and compared with its standard error is significantly different from zero. It appears that the N B regression model in this case adequately explains this data as it allows for the overdispersion to be present. The likelihood ratio test comparing the model with both covariates, group and weeks, and the model with the covariate, weeks, only indicates that the effect of the group is significant. A t-test based on the parameter estimate and its standard error agrees. The estimated model parameter for the effect of group membership is 0.913. This indicates that patients in group B tend to be infected with more resistant strains by the end of the study than patients in group A . Patients in group A are expected to be infected with e° 0706 expected to have e° 9836 = 1.07 and patients in group B are = 2.67 resistant strains by the end of the study. This was an effect that was not detected in the original Poisson regression analysis. The fit appears to have been greatly improved by the inclusion of a random effect term. However, we can examine the fit in more detail by comparing the observed frequencies of zero counts with the expected frequencies under the model. The observed and expected frequencies for N S B resistant in groups A and B are provided in Table 12. The expected frequencies are determined by the negative binomial probability mass function with parameters given by a =1.4, and A, = e° 0706 for group A and X = e, 09 8 3 9 for group B. Based on this comparison, the fit appears to be fairly adequate however, we still underestimate the number of zero counts in group A and overestimate the number of low counts. This suggests that separating the zero counts from the 50 positive counts as will be done in the ZIP regression model may provide a further improvement in the fit and give us an analysis in which we can have greater confidence. Table 12 Observed and expected frequencies for NSB resistant, N B regression model. (Expected frequencies based on the model are given in parentheses) N S B resistant Group A Group B Group A (cumulative) 0 1 2 3 4 5 6 7 29 (21) 1(9) 2(5) 3(3) 4(1) 0(1) 0(0) 2(0) 12(13) 2(7) 2(5) 8(4) 7(3) 6(2) 2(1) 29 (21) 30 (30) 32 (35) 35 (38) 39 (39) 39 (40) 39 (40) 41 (40) 1(1) Group B (cumulative) 12(13) 14 (20) 16 (25) 24 (29) 31 (32) 37 (34) 39 (35) 40 (36) The N B model specified in (3) is applied with Y = N S B emergent resistant, for patient t i at time 2, and x, = a vector of covariates for patient / . A summary of the model fits is provided in Table 13. Table 13 Summary of fits, N B regression models for N S B emergent resistant at time 2. Model grp* weeks+NSB 1 grp+weeks+NSB 1 grp* weeks grp+weeks grp weeks 2 x log likelihood 200.07 199.95 199.05 199.02 187.27 198.36 Estimated Index of Dispersion 0.57 0.57 0.59 0.59 0.84 0.60 df 76 77 77 78 79 79 Comparing the values of the log-likelihood for the competing models indicates that the model considering the single covariate representing the number of weeks provides as adequate a 51 fit as models including additional terms. Adding the covariate representing group membership does not significantly improve the fit. Using a likelihood ratio test comparing the model with both covariates, group and weeks with the model that considers only group, we find that twice the difference in log-likelihoods of 23.5. This is large relative to a Chi-square distribution withl degree of freedom, indicating that group membership does not have a significant effect upon N S B emergent resistant. The residual deviance associated with this model is 105.71 on 79 degrees of freedom. Although the residual deviance has decreased substantially from those obtained using the ordinary Poisson regression model, it is still large relative to the degrees of freedom. There is still some evidence of a lack of fit. The index of dispersion for the N B regression model is estimated at about 0.59 and was also found to be significantly different from zero. The parameter representing the effect of the number of weeks is estimated at 0.088. This suggests that the longer a patient remained in the study the more likely they were to be infected with an emergent resistant strain of bacteria. Using the model with the single covariate representing the number of weeks, we examine the fit of this model in greater detail. Again we compare the observed frequencies of N S B emergent resistant with the expected frequencies under the model in Table 14. Patients did not all stay for the same number of weeks however twenty-nine patients stayed the entire sixteen weeks and so we compare the observed and expected frequencies for these twenty-nine patients. Again we underestimate the number of patients with zero N S B and we overestimate those with small positive counts. Our model does not separate the zero from the positive counts which would intuitively provide an improved fit. 52 Table 14 Observed and expected frequencies for N S B emergent resistant, N B regression model. (Expected frequencies under the model are given in parentheses) N S B emergent resistant 0 1 2 3 4 5 6 7 8 Frequency 6(3) 0(4) 0(4) 0(3) 6(3) 8(2) 2(2) 5(2) 2(1) Cumulative frequency 6(3) 6(7) 6(11) 6(14) 12(17) 20(19) 22 (21) 27 (23) 29 (24) Finally, the N B regression model specified in (3) is applied with Y = N S B combined, for t patient i at time 2, and x = a vector of covariates for patient i . The fits for these models are / summarized in Table 15. Table 15 A summary of fits of N B regression models for N S B combined at time 2. Model grp* weeks+NSB 1 grp+weeks+NSB 1 grp* weeks grp+weeks weeks 2 x log likelihood 650.95 650.95 650.95 650.95 642.23 649.78 Estimated Index of Dispersion 0.38 0.38 0.38 0.38 0.48 0.40 df 76 77 77 78 79 79 Comparing the values of the log likelihood for the above models indicates that the model containing the single covariate representing the number of weeks provides as adequate a fit as models including additional terms. The residual deviance is vastly improved over ordinary 53 Poisson regression however the residual deviance associated with this model is 108.32 on 79 degrees of freedom which still provides some evidence of a lack of fit. The effect of group membership does not appear to be significant, however the effect of the number of weeks a patient remains in the study does appear to have a significant effect on N S B combined at time 2. The estimate of the parameter for the number of weeks is 0.0597 indicating that the longer a patient remains in the study the more likely they are to be infected with some form of resistant strains of bacteria. Using the model with the single covariate, number of weeks, we can examine the fit of this model in more detail. In Table 16, the observed and expected frequencies for N S B combined are compared. Although, the model does not underestimate the number of zero counts this time, it does still overestimate the number of low positive counts and underestimates the number of somewhat higher positive counts. Clearly, the zero counts are well separated from the positive counts and this feature is not captured by the N B regression model. Table 16 Observed and expected frequencies for NSB combined, N B regression model. (Expected frequencies based on the model are given in parentheses) NSB combined 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Frequency Cumulative frequency 1(1) 1(2) 1(1) 2(3) 3(6) 3(9) 5(12) 12(15) 13 (17) 18 (19) 24 (21) 25 (23) 26 (24) 28 (25) 29 (26) 29 (27) 29 (28) 1(3) 0(3) 2(3) 7(3) 1(2) 5(2) 6(2) 1(2) 10) 2(1) 1(1) 0(1) 0(1) The N B regression model did offer some improvement over the ordinary Poisson regression model in terms of its ability to account for the apparent overdispersion in these data. However, the model was still unable to explain the large number of zero counts that were observed. This is further evidence that a ZIP model, where we can separate the zero data may provide improved fits for all three counts, NSB resistant, emergent resistant and combined. 3.3 Application 2: Results and Discussion The daily units sold appear to vary more than can be accounted for by a Poisson distribution. The ordinary Poisson regression was consequently unable to adequately model these data. If one allows for a random effect associated with the individual observations (or days) to be incorporated into the model, the N B regression model can be applied to these data. 55 Instead of covariates however, our model will use the two factors, store location and day of the week to predict the Poisson mean. The model is specified below. (YyluXy) ~ Poisson(u^) u ~ Gamma(a,a), E(y>) = 1, Var(v) = a < °o so that E(Y ) = \ , and Var(Y ) = \ . (1 + a\) U H l n ( \ ) = a + p. + j., y i=1 21 and j = l,...,7 P, represents the effect of store i on the units sold and y . represents the effect of the day of the week j. The parameters for store 1 and Monday are set to zero, so the rate for store 1 on Monday is simply determined by the parameter a . Fitting this model to the data resulted in a significant improvement. The deviance dropped to a value of 6434.70 on 6567 degrees of freedom and the index of dispersion, a is estimated at .60. A goodness of fit test finds that this deviance is not large enough relative to the degrees of freedom to provide evidence of a lack of fit. At first glance it appears that incorporating random effects into the Poisson regression model has allowed us to adequately explain the overdispersion observed in this data. Comparing the value of the log likelihood of the model with both factors to that of more parsimonious models where one of the factors is dropped indicates that both store and day of the week are significant in determining the number of units sold on a particular day at a particular location and so both factors are retained. A summary of the estimated model parameters along with their standard errors in provided in Table 17. 56 Table 17 Estimated model parameters and standard errors for the N B regression model for the units sold Parameter intercept store 2 store 3 store 4 store 5 store 6 store 7 store 8 store 9 store 10 store 11 store 12 store 13 store 14 store 15 store 16 store 17 store 18 store 19 store 20 store 21 Tuesday Wednesday Thursday Friday Saturday Sunday Parameter estimate -0.566 -0.177 0.708 0.775 0.018 0.138 0.715 -0.253 0.429 0.352 -0.253 0.537 0.024 0.518 0.370 -0.060 0.267 0.577 -0.021 0.019 0.309 -0.106 -0.063 -0.063 -0.035 0.215 0.114 Standard error 0.096 0.126 0.110 0.109 0.122 0.119 0.110 0.128 0.114 0.115 0.128 0.112 0.121 0.112 0.115 0.123 0.117 0.112 0.122 0.122 0.116 0.066 0.065 0.065 0.065 0.062 0.063 In order to evaluate the performance of this model we compare the observed number of days when zero units were sold with the number of days expected under the N B regression model. Table 18 provides the observed and expected number of days broken down by store and by day of the week. We do not underestimate the number of zeroes as often as with the Poisson 57 regression model. The N B regression model appears to have improved the fit with respect to the number of days when zero units are sold. Table 18 Observed and expected number of days with zero units sold, N B regression model. (Expected number of zeroes are given in parentheses) Store 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Monday 26 (28) 33 (30) 20(19) 14(18) 24 (27) 27 (26) 23 (19) 35 (30) 18(22) 20 (23) 35 (30) 22 (21) 21 (27) 20 (21) 28 (23) 29 (28) 23 (24) 24 (20) 25 (28) 29 (27) 21 (24) Tuesday 28 (29) 31 (31) 20 (20) 17(19) 31 (28) 30 (27) 20 (20) 28 (32) 20 (24) 23 (25) 31 (31) 24 (22) 28 (29) 17(23) 26 (24) 33 (29) 28 (26) 21 (22) 33 (29) 28 (29) 23 (25) Wednesday 38 (28) 29 (30) 18(20) 20(19) 29 (28) 26 (27) 24 (19) 30(31) 28 (23) 19 (24) 20 (31) 16(22) 26 (28) 24(22) 19 (24) 31 (29) 20 (25) 27 (21) 29 (29) 27 (28) 22 (25) Thursday 25 (28) 32 (30) 16(19) 19(18) 26 (27) 24 (26) 20(19) 39 (30) 19 (23) 23 (23) 29 (30) 20 (21) 32 (27) 19(21) 22 (23) 28 (28) 27 (25) 23 (21) 26 (28) 25 (27) 24 (24) Friday 27 (28) 32 (30) 18(19) 23 (18) 29 (28) 30 (26) 18 (19) 28 (31) 26 (23) 26 (24) 31 (31) 24 (21) 28 (28) 25 (22) 25 (23) 27 (29) 24 (25) 20 (21) 27 (28) 28 (28) 26 (24) Saturday 22 (25) 23 (27) 17(16) 7(15) 25 (25) 28 (23) 17(16) 29 (28) 18(20) 18(21) 23 (28) 24(18) 26 (25) 20(18) 22 (20) 22 (26) 21 (22) 17(17) 23 (25) 29 (25) 25 (21) Sunday 27 (26) 28 (28) 13 (17) 13 (16) 26 (26) 26 (25) 15(17) 36 (29) 17(21) 21(22) 32 (29) 19(19) 30 (26) 17(20) 17 (22) 27 (27) 25 (23) 11(19) 30 (26) 29 (26) 21(22) 58 4 Zero Inflated Poisson Regression Model %.l Methodology When the variance conditioned on the covariates in a Poisson regression model exceeds the mean, overdispersion is indicated. If there are a large number of observations when the count response variable is equal to zero and other observations when the count is relatively large, the variance will also exceed the mean. In this situation, however, the overdispersion is more accurately explained by the fact that more zeroes than can be accounted for by a Poisson distribution are observed. In addition, there are often physical explanations for why this is so. For example, Wang, Puterman, Cockburn and Le (Wang, Puterman, Cockburn, Le, 1996) model daily seizure frequency in epileptic children. The children have bad days when the experience a number of seizures and good days when they children are seizure free. A second example is when Lambert models the number of defects in printed circuits boards which can be produced in a perfect state, in which no defects occur, or an imperfect state where a number of defects may occur (Lambert, 1992). This kind of information can be incorporated into a Poisson regression model by allowing an extra mass at zero. Using ZIP regression, one can model this extra probability of zeroes in addition to the rate as depending on covariates. In this way, the overdispersion can be modeled and inference can be made regarding this inherent feature of the data. 59 The ZIP regression model is specified as follows. Y = 0, with probability p . t t Yj ~ Poisson(A, ) with probability 1 - p.,. I V ' {(l-p^X*/k\, for£>0 w Further X and p satisfy, i i log(X. ) = xP , where p represents a vector of parameters and x, a vector of ( ( covariates. log ( \ '•— = z,y , where y represents a vector of parameters and z, a vector of covariates. This model is flexible in that one can use different covariates to model the extra probability of zeroes and the Poisson rate. In an alternative model, one can allow p and X to be functionally related (see Lambert, 1992). The likelihood for the ZIP regression model specified in (5) is [ e*« iij(l +^ ) n exp(- ^)| - exp(- ' - ) ^ t | g (l + Ignoring the term e") T r g Jii (l + p g e^) \ yi log(v,!) , the log-likelihood is y,>0 1 = X - log(l + e" ) + X l o g ( e x p ( - e ^ ) +e*» ) + £(x,.p 1 1=1 ,y,=0 ' -e*< ). p V / (6) y,>0 M L E ' s for the model parameters can be obtained by maximizing (6) with respect to the parameters in P and y . 60 The maximization can be performed using a number of alternative algorithms. Lambert discusses the use of the E M algorithm to maximize (6) and Chambers and Hastie discuss using the Newton-Rapheson algorithm used by the Splus function ms which requires that at least the partial first derivatives with respect to the parameters of (6) be specified. I use the Splus function, nlmin, instead to find the minimum of the negative log-likelihood. This function uses an algorithm based on a quasi-Newton method developed by Dennis, Gay, and Welsch (Dennis, Gay, and Welsch, 1981). The estimated parameters from a Poisson regression model for only the positive counts provides initial estimates for the parameters in p and the estimated fl, parameters from a logistic regression on >>' = <! if^ = 0 ^ ^ provide initial estimates for the parameters in y . The procedure and code used to fit the ZIP regression models is provided in Appendix 1 together with first partial derivatives of (6) with respect to the model parameters. 4.2 Application 1: Results and Discussion The exploratory analysis in section 2.2.2 and the results of the Poisson regressions in section 2.2.3 both suggest that the observed overdispersion in these data may be modeled well by allowing an extra mass at zero. The ZIP regression model specified in (5) was applied to these data and here we summarize the results. We allowed for a different set of covariates to be used in modeling the extra probability of zeroes and the Poisson mean. For each of NSB resistant, emergent resistant and combined we fit a total of sixteen models. This was because we considered four models for each the extra probability of zeroes and the rate. The four models were based on including all or a subset of the 61 covariates representing group membership, the number of weeks in the study, and the interaction between the two. Specifically the covariates for each of the four models are; - group*weeks, group and weeks (group*weeks represents the interaction) - group and weeks - group only - weeks only In order to evaluate the performance of the ZIP regression model one of the sixteen models for each of N S B resistant, emergent resistant and combined is selected. The fit of the model selected is then examined. The model selection procedure is based on comparing the values for the log-likelihood associated with each of the fitted models. The selected model will exclude the covariates that are found to have an insignificant effect on N S B , and as such the model selection procedure also allows us to make inference regarding the effects of the covariates on N S B . Table 19 provides the values of the log-likelihood for each of the sixteen models for N S B resistant. The values for the log-likelihood in Table 19 were calculated ignoring the negative constant, ^ - log(_jA ! ) , and so may take on positive values. y,>0 Table 19 Log-likelihood based on ZIP regression models for N S B resistant. (Degrees of freedom are given in parentheses) 1 Covariates for A, Covariates for p group*weeks group+weeks group weeks group*weeks group+weeks group weeks 2.4022 (74) 1.7700 (75) 1.7331 (76) -5.0978 (76) 2.3388 (75) 1.7462 (76) 1.7067 (77) -5.1009 (77) 2.0329 (76) 1.3839(77) 1.3708 (78) -5.1323 (78) 2.3383 (76) 1.7461 (77) 1.7066 (78) -5.3714(78) no covariates 1.3411(79) 62 One can determine the significance of a covariate in both of the submodels (one for p and one for X) by comparing log-likelihoods in the diagonal cells in Table 19. The log-likelihood changes little when the interaction is dropped demonstrating the insignificant of this term on N S B . The log-likelihood changes little when the covariate, weeks, is further dropped from the model. The model that considers the single covariate representing group membership for the extra probability of zeroes provides as adequate a fit to these data as any models containing additional terms. This model is selected as being the most parsimonious to provide an adequate fit. When the covariate for p representing group membership is dropped from the model, the loglikelihood decreases dramatically. The likelihood ratio test based on comparing the model for p with the single covariate, weeks with the model including the covariates, group and weeks, indicates that group membership has a significant effect upon the probability of extra zeroes. The model states that, log(X,) = 1.303, and log f p. \ \ 0.846, if patient i is in group A [0.846 -1.781, if patient i is in group B ' This indicates that patients in group B are more likely to be infected with at least one resistant strain of bacteria by the end of the study. It also indicates that although patients in group B are more likely to have at least on resistant strain of bacteria, that the rate of infection among those infected does not differ significantly between the two groups. Of those patients suffering from an infection with a resistant strain of bacteria at the end of the study, the expected number of strains is e 1303 = 3.68. This inference is dramatically different from that taken from the N B regression analysis. Instead of concluding that the two groups experience different rates of infection we find that the two groups experience a different probability of being infected at the end of the study. Patients in group B are found to be more likely to be infected with resistant 63 strains of bacteria at the end of the study but that the rate of infection does not differ between the two groups. In order to evaluate how well this model fits the data, we compare the observed frequencies with those we expect under the model. Table 20 provides the observed and expected frequencies for N S B resistant. Table 20 Observed and expected frequencies for N S B resistant, ZIP regression model. (Expected frequencies are given in parentheses) N S B resistant 0 1 2 3 4 5 6 7 Group A 29 (29) 1(1) 2(2) 3(3) 4(2) 0(2) 0(1) 0(1) Group B 12(12) 2(3) 2(5) 8(6) 7(6) 6(4) 2(2) 1(1) The ZIP regression model provides a remarkably improved fit for these data over the alternative models. The model no longer underestimates the number of zeroes nor overestimates the number of low counts. The Poisson regression model seriously underestimated the number of patients with zero counts. Of those patients who remained in the study the entire sixteen weeks, the Poisson regression model expected 4 of the patients in group A and 1 of the patients in group B to have zero counts when there were 11 and 18 respectively. Among all patients in the study, under the N B regression model, we expect to see 21 patients in group A and 13 in group B with zero counts when in actuality there are 29 in group A and 12 in group B . The ZIP regression model appears to have and improved fit over the N B regression model in this case. 64 Table 21 provides the values for the log-likelihood associated with the sixteen ZIP regression models for N S B emergent resistant. Table 21 Log-likelihood based on ZIP regression models for NSB emergent resistant. (Degrees of freedom are given in parentheses) Covariates for X Covariates for p group*weeks group+weeks group weeks no covariates group*weeks group+weeks group weeks 137.806 (74) 137.239 (75) 130.809(76) 136.412 (76) 137.791 (75) 137.193 (76) 130.768 (77) 136.368 (77) 137.167 (76) 136.566 (77) 139.969 (78) 135.746 (78) 137.754 (76) 137.151 (77) 130.733 (78) 136.340 (78) 135.679 (79) The model that considers the single covariate, weeks, for p provides as adequate a fit for the data as any models the include additional terms. Group membership has no significant effect upon either the extra probability of zeroes or the Poisson mean. That is, the two treatments compared do not differ significantly in their effect on NSB emergent resistant. A likelihood ratio test comparing the model with the two covariates, group and weeks, with the model using only group for p, indicates that the number of weeks the patients remain in the study does have a significant effect on the extra probability of zeroes. The model states that, log(A, ) = 1.64 and ( 1.14 - 0.17(x,.), where x, represents the number of weeks that patient / remained log in the study. This implies that the longer a patient remains in the study, the more likely they are to suffer from at least one infection with an emergent resistant strain of bacteria. Of those patients suffering from such an infection, the expected number of different emergent resistant strains is e 164 = 5.15 . In order to evaluate the how well the model fits the data we compare, the observed frequencies of N S B emergent resistant and the expected frequencies under the model. Table 22 65 provides the observed and expected frequencies for those patients who remained in the study the full sixteen weeks. Table 22 Observed and expected frequencies for NSB emergent resistant, ZIP regression model. (Expected frequencies are given in parentheses) NSB resistant 0 1 2 3 4 5 6 7 8 Frequency 6(5) 0(1) 0(2) 0(3) 6(4) 8(4) 2(4) 5(3) 2(2) Under the Poisson regression model we expected to see no zero counts of N S B emergent resistant and under the N B regression model we expected to see 3 such patients when in actuality there are 6. The ZIP regression model appears to fit these data fairly well. We no longer seriously underestimate the number of patients with zero or high counts for N S B emergent resistant, however we still overestimate the number of patients with low counts. Table 23 provides the values for the log-likelihood associated with the sixteen ZIP regression models for NSB combined. 66 Table 23 Log-likelihood based on ZIP regression models for N S B combined. (Degrees of freedom are given in parentheses) Covariates for X Covariates for p group*weeks group+weeks group weeks group*weeks 354.723 (74) 354.454 (75) 354.440 (76) 353.491 (76) group+weeks 354.707 (75) 354.440 (76) 354.426 (77) group 346.015 (76) 345.750 (77) 345.100 (78) 353.479 (77) 344.793 (78) weeks 354.438 (76) 354.119(77) 354.098 (78) 353.140 (78) no covariates 353.094 (79) The model that includes the single covariate representing the number of weeks for p and no covariates for the Poisson mean provides as adequate a fit as any models containing additional terms. The effect of weeks on N S B combined is significant as the log-likelihood drops dramatically when this covariate is removed from the model. The model states that, Pi = 0.98 - 0.25(x,.), where x represents the number of weeks that log(^) = 1.89 and log U-JV t patient i remained in the study. This implies again that the longer a patient remains in the study the more likely they are to suffer from at least one infection with a resistant or an emergent resistant bacteria. Among those patients suffering from such an infection the expected number of different strains is e 189 = 6.62 . We can compare the observed frequencies of N S B combined with those expected under the above model in order to evaluate the model's performance. Table 24 provides a comparison between the observed frequencies and those that are expected under the above model for those patients who remained in the study the entire sixteen weeks. 67 Table 24 Observed and expected frequencies for NSB combined, ZIP regression model. (Expected frequencies are given in parentheses) NSB resistant 0 1 2 3 4 5 6 7 8 9 10 11 12 Frequency 1(2) 1(1) 1(2) 0(4) 2(5) 7(5) 1(4) 5(3) 6(2) 1(1) 1(1) 2(0) 1(0) It was not so apparent that at least for those patients remaining in the study the full sixteen weeks there was an excessive number of zero counts for N S B combined. The ZIP model does not however overestimate the number of zero counts and provides a somewhat improved fit over the Poisson regression model. 4.3 Application 2: Results and Discussion Both the exploratory data analysis in section 2.3.2 and the investigation into the fits of the preliminary models in section 2.3.3 suggest that these data may also be well modeled by a ZIP regression model. In this section we fit a ZIP regression model to the daily units sold. The model is specified as follows. Y) . = 0, with probability /?, . Y,. ~ Poisson( A. y) with probability 1 - p i t . 68 P(Y =k) = p,y{\- .)e- \ Pi (1 - Further Xy and P j )e~ >X ,,*/*!, h P l j for/< = 0 for£>0 satisfy l o g ( \ ) = a + p. + y , and log = a + P , + y j, where P, , P , represent the effect of store i and y ., y . represent the effect of the j th day of the week. The parameters, P,, P 1 and y , , y, are all taken to be zero, so P, and P , represent the effect of store i relative to store 1 and y . and y . represent the effect of the j day of the week relative to Monday. We consider three models for X and three models for p for a total of nine models. {j The three models for X and /; P j j tj consider either both factors, store and day of the week, store only, or day of the week only. In order to evaluate the performance of the ZIP regression model we need to select one of the nine models. The selection will be again based on comparing the values of the log-likelihood. Table 25 provides the values of the log-likelihood for the nine models. Table 25 Log-likelihood for ZIP regression models for the daily units sold (Degrees of freedom are given in parentheses) Factors for the extra prob. of zeroes store + day of the week store day of the week Factors for the Poisson mean store + day of the week store day of the week -5916.183 (6540) -5919.010 (6546) -5937.067 (6560) -5927.524 (6546) -5942.470 (6552) -5948.819 (6566) -5989.490 (6560) -5991.091 (6566) -6117.093 (6580) 69 The model that considers both factors, store and day of the week for the mean and the single factor representing the store for the extra probability of zeroes provides as adequate a fit as models containing the additional factor. Likelihood ratio tests indicate that both the store and the day of the week have a significant effect on the mean number of units sold. The effect of the store on the probability of extra zeroes is also significant however the effect of the day of the week is not significant after the store has been taken into account. Table 26 provides the estimated model parameters for the ZIP regression model which considers both factors for the mean and the single factor, store, for the extra probability of zeroes. 70 Table 26 Estimated model parameters, ZIP regression model for units sold a , Pz, 1j a , intercept (Store 1, Mon.) store 2 store 3 store 4 store 5 store 6 store 7 store 8 store 9 store 10 store 11 store 12 store 13 store 14 store 15 store 16 store 17 store 18 store 19 store 20 store 21 Tuesday Wednesday Thursday Friday Saturday Sunday For X y -0.059 -0.217 0.410 0.470 -0.106 0.162 0.556 -0.012 0.149 0.065 -0.280 0.390 -0.074 0.276 0.182 -0.160 0.222 0.381 -0.141 -0.012 0.112 -0.109 -0.053 -0.058 -0.030 0.221 0.095 For p u -0.484 0.025 -0.966 -1.106 -0.234 0.137 -0.385 0.628 -0.833 -0.879 0.019 -0.370 -0.164 -0.700 -0.494 -0.155 -0.057 -0.531 -0.215 0.008 -0.528 Table 27 provides the observed number of days when zero units were sold and the expected number under the above ZIP regression model. Again the fit appears to have improved over the Poisson regression. Table 27 Observed and expected number of days with zero units sold, ZIP regression model. (Expected frequencies are given in parentheses) Store 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Monday 26 (28) 33 (30) 20(17) 14(16) 24 (28) 27 (27) 23 (19) 35 (32) 18(21) 20 (22) 35 (31) 22 (21) 21 (28) 20 (21) 28 (23) 29 (29) 23 (25) 24 (20) 25 (28) 29 (28) 21 (23) Tuesday 28 (29) 31 (31) 20(19) 17(17) 31 (29) 30 (28) 20 (20) 28 (33) 20 (23) 23(24) 31 (32) 24 (23) 28 (29) 17(22) 26 (24) 33 (30) 28 (26) 21 (22) 33 (29) 28 (29) 23 (25) Wednesday 38 (28) 29 (31) 18(18) 20(17) 29 (28) 26 (28) 24 (20) 30 (33) 28 (22) 19(23) 20 (31) 16(22) 26 (28) 24(21) 19 (23) 31 (29) 20 (26) 27 (21) 29 (29) 27 (29) 22 (24) Thursday 25 (28) 32 (30) 16(18) 19(16) 26 (28) 24 (27) 20 (20) 39 (32) 19(22) 23 (23) 29 (31) 20 (21) 32 (28) 19(21) 22 (23) 28 (29) 27 (25) 23 (20) 26 (28) 25 (28) 24 (24) Friday 27 (28) 32(31) 18(18) 23 (16) 29 (28) 30 (28) 18 (20) 28 (33) 26 (22) 26 (23) 31 (31) 24 (22) 28 (28) 25 (21) 25 (23) 27 (29) 24 (26) 20 (21) 27 (28) 28 (28) 26(24) Saturday 22 (26) 23 (28) 17(15) 7(13) 25 (25) 28 (25) 17(17) 29 (31) 18 (19) 18(19) 23 (29) 24(19) 26 (25) 20(18) 22 (20) 22 (26) 21 (23) 17(18) 23 (26) 29 (26) 25 (21) Sunday 27 (27) 28 (29) 13 (16) 13(15) 26 (27) 26 (26) 15 (18) 36 (32) 17 (20) 21 (21) 32 (30) 19(20) 30 (27) 17(19) 17(22) 27 (28) 25 (24) 11(19) 30 (27) 29 (27) 21(22) 72 5 Discussion and Suggestions for Further Study The ZIP regression model clearly provided a superior fit to that of Poisson regression. It also performed at least as well as the N B regression, and in the case of modeling the number of resistant strains in the first application, was clearly superior to both alternative models. In the first application, both the Poisson and the N B regression models underestimated the number of patients having zero counts for N S B resistant. Under the ZIP regression model, the expected number of patients with zero counts agreed perfectly with the actual data. The fit was also improved for N S B emergent resistant but the improvement was not as dramatic. In addition the inferences with respect to N S B resistant made using the three analyses differed dramatically. With Poisson regression, the effect of the treatment group on NSB resistant was found to be insignificant after we adjusted the standard errors using the quasi-likelihood method. Both the N B regression and ZIP regression analyses found that the treatment group had a significant effect upon N S B resistant. However, the ZIP regression analysis found this effect only to be significant in determining the probability that a patient have at least one resistant strain and not in the rate. In this way, the ZIP regression model allowed us to gain insight as to the nature of the cause of the overdispersion itself rather than treating the overdispersion as a nuisance parameter. A summary of the comparison of the three models used in the first application is provided in Table 28. 73 Table 28: Model performance comparison Model Poisson regression NB regression NSB resistant Fit Inference Poor. Seriously underestimates the number of zeroes in the data. Effect of treatment group not significant. Poor. Underestimates the number of zeroes and overestimates the number of low counts. Effect of treatment group found significant. Vastly improved. Effect of treatment group significant only for the prob. of at least one strain. NSB emergent resistant Fit Inference Poor. Underestimates the number of zeroes. Improved but still underestimates the number of zeroes and overestimates the number of Effect of treatment group not significant. Effect of weeks found to be significant. Difficult to evaluate. Does not appear to seriously under-estimate the number of zeroes. Effect of treatment group not significant. Effect of weeks marginally significant. Effect of treatment group not significant. Effect of weeks found to be significant. Does not appear to seriously under-estimate the number of Effect of treatment group not significant. Effect of weeks significant. Effect of treatment group not significant. Effect of weeks found to be significant but only for the prob. of at least one strain. Appears to be marginally improved. Still overestimates the number of low counts. low counts. ZIP regression Vastly improved. Still overestimates the number of low counts. NSB combined Fit Inference zeroes but does overestimate the number of low counts Effect of treatment group not significant. Effect of weeks significant only for the prob. of at least one strain. In the second application, both the N B and the ZIP regression models provided an improved fit over Poisson regression. The Poisson regression seriously underestimated the number of days when zero units of the product were sold. With both the N B and ZIP regression models this problem was alleviated. The inferences drawn from the N B regression model did not differ from those drawn from the Poisson regression model. On the other hand, the inferences drawn from the ZIP regression model differed from those of both the Poisson and the N B 74 regression models. The ZIP regression model suggested that although both factors, store and day of the week, had a significant effect on the number of units sold, only the store had a significant effect on the probability that at least one units was sold. This again allows us some insight as to the nature of the overdispersion present in this data. Both tests for overdispersion described in Sections 2.1.1 and 2.1.2 clearly indicated its presence. However, these tests could not be used as the basis for selecting one of the two alternative models to provide and improved fit. A thorough exploratory analysis, instead, provided a relatively clear indication that a ZIP model would provide an improved fit. This stresses the importance of such an exploratory analysis as an aid to the statistician attempting to identify appropriate models. In addition, whenever possible, physical models should also be considered. To my knowledge, there is no standard software available currently that can be used to obtain the M L E ' s in the ZIP regression model. Since working with zero inflated data, a number of colleagues have inquired as to how such a model may be implemented. The procedure is not straightforward and must be adapted to the specific form of model used. The development of software that is flexible enough to accommodate a number of variations on the ZIP regression model used here would likely be valuable for a large number of applied statisticians and subject area researchers. Although the Splus function, nlmin, was used here to fit the ZIP regression models, other methods exist which use different algorithms to obtain the M L E ' s . A serious investigation that compares the performance of the E M algorithm suggested by Lambert, the Newton-Raphson algorithm and the quasi-Newton algorithm used to obtain M L E ' s here may provide some 75 assistance to those wishing to fit ZIP regression models in future applications and in the development of software. In summary, as overdispersion is a relatively common feature in Poisson regression models, models which can accommodate the overdispersion are greatly needed. Identification of which alternative model may be best suited to a particular application may be aided by a number of tests developed to detect specific forms of overdispersion together with a thorough examination of the data using exploratory tools. The ZIP regression model provides yet another alternative model that is appropriate in a number of situations and may provide both an improved fit for the data and allow a greater depth of inference. Bibliography Aitchison, J., and Ho, C.H., (1989), "The Multivariate Poisson log-normal Distribution", Biometrika, 76, 643-653. Anderson, R.U., (1985), Editorial, Journal of the American Paraplegia Association, 8, 18. Breslow, N.E., (1990), "Tests of Hypotheses in Overdispersed Poisson Regression and Other Quasi-likelihood Models", Journal of the American Statistical Association, 85, 565-571. Breslow, N.E., and Clayton, D.G., (1993), "Approximate Inference in Generalized Linear Mixed Models", Journal of the American Statistical Association, 88, 9-25. Cameron, A.C., and Trivedi, P.K., (1990), "Regression-Based Tests for Overdispersion in the Poisson Model", Journal of Econometrics, 46, 347-364. Chambers, J.M., and Hastie, T.J., (1992) Statistical Models in Splus, California: Wadsworth and Brooks Cole. Dean, C.B., (1992), "Testing for Overdispersion in Poisson and Binomial Regression Models", Journal of the American Statistical Association, 87, 451-457. Dean, C.B., Lawless, J.F., and Willmot, G.E., (1989), " A Mixed Poisson-Inverse Gaussian Regression Model", The Canadian Journal of Statistics, 17, 171-181. Dean, C , and Lawless, J.F., (1989), "Test for Detecting Overdispersion in Poisson Regression Models", Journal of the American Statistical Association, 84, 467-472. Ganio, L . M . , and Schafer, P.W., (1992), "Diagnostics for Overdispersion", Journal of the American Statistical Association, 87, 795-804. Gribble, M.J., McCallum, N . M . , and Schechter, M.T., (1988), "Evaluation of Diagnostic Criteria for Bacteriuria in Acutely Spinal Cord Injured Patients Undergoing Intermittent Catheterization", Diagnostic Microbiology and Infectious Diseases, 9, 197-206. Jansen, J., (1993), "Analysis of Counts Involving Random Effects with Applications in Experimental Biology", Biometrical Journal, 6, 745-757. Johnson, N . L . , Kotz, S., and Kemp, A.W., (1992), Univariate Discrete Distributions, second edition, New York: John Wiley and Sons, Inc. 77 Kapalka, B.A., Katircioglu, K., Puterman, M . L . , (1995), "Inventory Control in a Retail Environment with Lost Sales and Service Constraints", Bramss working paper, 3, Faculty of Commerce and Business Administration, Vancouver, B.C.: The University of British Columbia. Lambert, D., (1992), "Zero Inflated Poisson Regression, With an Application to Defects in Manufacturing", Technometrics, 34, 1-13. Lawless, J.F., (1987), "Negative Binomial and Mixed Poisson Regression", The Canadian Journal of Statistics, 15, 209-225. McCullagh, P., and Nelder, J.A., (1983), Generalized Linear Models, London:Chapman and Hall. Merrit, J.L., (1976), "Urinary Tract Infections, Causes and Management, with Particular Reference to the Patient with Spinal Cord Injury: A Review", Archives of Physical Medicine and Rehabilitation, 57, 365-378. Pearman, J.W., (1984), "Infection Hazards in Patients with Neuropathic Bladder Dysfunction", Journal of Hospital Infection, 5, 355-372. Ripley, B.D., and Venables, W.N. (1994), Modern Applied Statistics with Splus, New York: Springer-Verlag. Stein, G.Z., (1988), "Modelling Counts in Biological Populations", Mathematical Scientist, 13, 56-65. Van den Broek, J., (1995), " A Score Test for Zero Inflation in a Poisson Distribution", Biometrics, 51, 738-743. Wang, P.M., Puterman, M . L . , Cockburn, I., Le, N . , (1996) "Mixed Poisson Regression Models with Covariate Dependent Rates", Biometrics, in Press. Wedderburn, R.W.M., (1974), "Quasi-likelihood Functions, Generalized Linear Models and the Gauss-Newton Method", Biometrika, 61, 439-477. 78 Appendix Procedure for Fitting a ZIP Regression Model Using the software package, Splus, a ZIP regression model can be used to model a count response variable with the following procedure. y<-response variable pos<-y>0 yO<-ifelse(y<-0,l,0) posy<-y[pos] m 1 <-glm(y[pos]~covariates[pos], family=poisson) m2<-glm(y0~covariates, family^binomial) th<-c(coefficients(m 1), coefficients(m2) p<-no. of parameters in m l zipmodel<-nlmin(zip, th, ...) zip<-function(theta) { k<-length(theta) beta<-theta[l:p] gam<-theta[p+l:k] S<-Xdesl %*% beta G<-Xdes2 %*% gam eS<-exp(S) zero<-y<0.5; pos<-!zero lkh<-logexp(G) lkh[pos]<-lkh[pos] - y[pos] * S[pos] + eS[pos] lkh[zero]<-lkh[zero] + eS[zero] - logexp(eS[zero] + G[zero]) sum(lkh) } where, Xdes 1 <-model .matrix(m 1) Xdes2<-model. matrix(m2) 79 logexp<-function(x) { y<-x pos<-x>0 y[pos]<-x[pos] + logexp.neg(- x[pos]) y[!pos]<- logexp.neg(x[!pos]) y } logexp.neg<-function(x) { y<-exp(x) smalK- y<4*.Machine$double.eps y[small]<-y[small] * (l-0.5*y[small]) y[!small]<-log(l +y [Ismail]) } Note: The design matrix can be made to give the constraints used in sections 2.3.3, 3.3 and 4.3 by issuing the command, options(constraints= "contr.treatment") prior to fitting the models in m l and m2. This procedure was adapted from the procedure provided by Chambers and Hastie (Chambers, and Hastie, 1992) who use instead the function, ms, to perform the minimization of the negative log-likelihood. Using this procedure requires that at least the first partial derviatives of the negative log-likelihood be supplied. The first partial derivatives of /,.(P,y ) are as follows. for v, > 0 * Cv,-e* ), p d( ,(P,Y) ? (P,) 9 d(/,(P,Y) ff exp(e*' p + —x e u +z y)x e j jj (l + exp(e*'' +z,y) p (l + e ) ' - x,,e x exp(e^' + z j ) ZiT r,T <yj) 8 Xil z,y for v, > 0 p v (1 + e for v, = 0 ) (1 + exp(e JtiP + z,y ))' for V/ =0 80 where x,.. is the if entry of the design matrix, Xdesl, and z is the ij tj entry of the design matrix, Xdes2. Note that no information toward the covariates in the extra probability of zeroes is provided from positive values of the response variable.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Modeling zero inflated count data
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Modeling zero inflated count data Garden, Cheryl Ellen 1996
pdf
Page Metadata
Item Metadata
Title | Modeling zero inflated count data |
Creator |
Garden, Cheryl Ellen |
Date Issued | 1996 |
Description | A natural approach to analyzing the effect of covariates on a count response variable is to use a Poisson regression model. A complication is that the counts are often more variable than can be explained by a Poisson model. This problem, referred to as overdispersion, has received a great deal of attention in recent literature and a number of variations on the Poisson regression model have been developed. As such, statistical consultants are faced with the difficult task of identifying which of these alternative models is best suited to their particular application. In this thesis, two applications where the data exhibit overdispersion are investigated. In the first application, two treatments for chronic urinary tract infections are compared. The response variable represents the number of resistant strains of bacteria cultured from rectal swabs. In the second application, the number of units sold of a product are modeled as depending on two factors representing the day of the week and the store. Two alternative models that allow for overdispersion are used in both applications. The negative binomial regression model and the zero inflated Poisson regression model so named by Lambert (Lambert, 1992) provide improved fits. Further, the zero inflated Poisson regression model performs particularly well in the situation when the overdispersion is suspected to be due to a large number of zeroes occurring in the data. The zero inflated Poisson regression model allows one to both fit the data well and make some inference regarding the nature of the overdispersion present. This little known model may prove to be valuable as there exist a number of applications where observed overdispersion in a count response variable is clearly due to an inflated number of zeroes. |
Extent | 3575052 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-02-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0099036 |
URI | http://hdl.handle.net/2429/4495 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1996-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1996-0219.pdf [ 3.41MB ]
- Metadata
- JSON: 831-1.0099036.json
- JSON-LD: 831-1.0099036-ld.json
- RDF/XML (Pretty): 831-1.0099036-rdf.xml
- RDF/JSON: 831-1.0099036-rdf.json
- Turtle: 831-1.0099036-turtle.txt
- N-Triples: 831-1.0099036-rdf-ntriples.txt
- Original Record: 831-1.0099036-source.json
- Full Text
- 831-1.0099036-fulltext.txt
- Citation
- 831-1.0099036.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0099036/manifest