0 M L E ' s for the model parameters can be obtained by maximizing (6) with respect to the parameters in P and y . 60 The maximization can be performed using a number of alternative algorithms. Lambert discusses the use of the E M algorithm to maximize (6) and Chambers and Hastie discuss using the Newton-Rapheson algorithm used by the Splus function ms which requires that at least the partial first derivatives with respect to the parameters of (6) be specified. I use the Splus function, nlmin, instead to find the minimum of the negative log-likelihood. This function uses an algorithm based on a quasi-Newton method developed by Dennis, Gay, and Welsch (Dennis, Gay, and Welsch, 1981). The estimated parameters from a Poisson regression model for only the positive counts provides initial estimates for the parameters in p and the estimated f l , i f^ = 0 parameters from a logistic regression on >>' = 0 Table 19 Log-likelihood based on ZIP regression models for NSB resistant. (Degrees of freedom are given in parentheses) 1 Covariates for A, Covariates for p group*weeks group+weeks group weeks no covariates group*weeks 2.4022 (74) 2.3388 (75) 2.0329 (76) 2.3383 (76) group+weeks 1.7700 (75) 1.7462 (76) 1.3839(77) 1.7461 (77) group 1.7331 (76) 1.7067 (77) 1.3708 (78) 1.7066 (78) 1.3411(79) weeks -5.0978 (76) -5.1009 (77) -5.1323 (78) -5.3714(78) 62 One can determine the significance of a covariate in both of the submodels (one for p and one for X) by comparing log-likelihoods in the diagonal cells in Table 19. The log-likelihood changes little when the interaction is dropped demonstrating the insignificant of this term on NSB. The log-likelihood changes little when the covariate, weeks, is further dropped from the model. The model that considers the single covariate representing group membership for the extra probability of zeroes provides as adequate a fit to these data as any models containing additional terms. This model is selected as being the most parsimonious to provide an adequate fit. When the covariate for p representing group membership is dropped from the model, the log-likelihood decreases dramatically. The likelihood ratio test based on comparing the model for p with the single covariate, weeks with the model including the covariates, group and weeks, indicates that group membership has a significant effect upon the probability of extra zeroes. The model states that, log(X,) = 1.303, and log f p. \ \ 0.846, if patient i is in group A [0.846 -1.781, if patient i is in group B ' This indicates that patients in group B are more likely to be infected with at least one resistant strain of bacteria by the end of the study. It also indicates that although patients in group B are more likely to have at least on resistant strain of bacteria, that the rate of infection among those infected does not differ significantly between the two groups. Of those patients suffering from an infection with a resistant strain of bacteria at the end of the study, the expected number of strains is e 1 3 0 3 = 3.68. This inference is dramatically different from that taken from the N B regression analysis. Instead of concluding that the two groups experience different rates of infection we find that the two groups experience a different probability of being infected at the end of the study. Patients in group B are found to be more likely to be infected with resistant 63 strains of bacteria at the end of the study but that the rate of infection does not differ between the two groups. In order to evaluate how well this model fits the data, we compare the observed frequencies with those we expect under the model. Table 20 provides the observed and expected frequencies for NSB resistant. Table 20 Observed and expected frequencies for NSB resistant, ZIP regression model. (Expected frequencies are given in parentheses) NSB resistant Group A Group B 0 29 (29) 12(12) 1 1(1) 2(3) 2 2(2) 2(5) 3 3(3) 8(6) 4 4(2) 7(6) 5 0(2) 6(4) 6 0(1) 2(2) 7 0(1) 1(1) The ZIP regression model provides a remarkably improved fit for these data over the alternative models. The model no longer underestimates the number of zeroes nor overestimates the number of low counts. The Poisson regression model seriously underestimated the number of patients with zero counts. Of those patients who remained in the study the entire sixteen weeks, the Poisson regression model expected 4 of the patients in group A and 1 of the patients in group B to have zero counts when there were 11 and 18 respectively. Among all patients in the study, under the N B regression model, we expect to see 21 patients in group A and 13 in group B with zero counts when in actuality there are 29 in group A and 12 in group B. The ZIP regression model appears to have and improved fit over the N B regression model in this case. 64 Table 21 provides the values for the log-likelihood associated with the sixteen ZIP regression models for NSB emergent resistant. Table 21 Log-likelihood based on ZIP regression models for NSB emergent resistant. (Degrees of freedom are given in parentheses) Covariates for X Covariates for p group*weeks group+weeks group weeks no covariates group*weeks 137.806 (74) 137.791 (75) 137.167 (76) 137.754 (76) group+weeks 137.239 (75) 137.193 (76) 136.566 (77) 137.151 (77) group 130.809(76) 130.768 (77) 139.969 (78) 130.733 (78) weeks 136.412 (76) 136.368 (77) 135.746 (78) 136.340 (78) 135.679 (79) The model that considers the single covariate, weeks, for p provides as adequate a fit for the data as any models the include additional terms. Group membership has no significant effect upon either the extra probability of zeroes or the Poisson mean. That is, the two treatments compared do not differ significantly in their effect on NSB emergent resistant. A likelihood ratio test comparing the model with the two covariates, group and weeks, with the model using only group for p, indicates that the number of weeks the patients remain in the study does have a significant effect on the extra probability of zeroes. The model states that, log(A,() = 1.64 and log 1.14 - 0.17(x,.), where x, represents the number of weeks that patient / remained in the study. This implies that the longer a patient remains in the study, the more likely they are to suffer from at least one infection with an emergent resistant strain of bacteria. Of those patients suffering from such an infection, the expected number of different emergent resistant strains is e 1 6 4 = 5.15 . In order to evaluate the how well the model fits the data we compare, the observed frequencies of NSB emergent resistant and the expected frequencies under the model. Table 22 65 provides the observed and expected frequencies for those patients who remained in the study the full sixteen weeks. Table 22 Observed and expected frequencies for NSB emergent resistant, ZIP regression model. (Expected frequencies are given in parentheses) NSB resistant Frequency 0 6(5) 1 0(1) 2 0(2) 3 0(3) 4 6(4) 5 8(4) 6 2(4) 7 5(3) 8 2(2) Under the Poisson regression model we expected to see no zero counts of NSB emergent resistant and under the NB regression model we expected to see 3 such patients when in actuality there are 6. The ZIP regression model appears to fit these data fairly well. We no longer seriously underestimate the number of patients with zero or high counts for NSB emergent resistant, however we still overestimate the number of patients with low counts. Table 23 provides the values for the log-likelihood associated with the sixteen ZIP regression models for NSB combined. 66 Table 23 Log-likelihood based on ZIP regression models for NSB combined. (Degrees of freedom are given in parentheses) Covariates for X Covariates for p group*weeks group+weeks group weeks no covariates group*weeks 354.723 (74) 354.454 (75) 354.440 (76) 353.491 (76) group+weeks 354.707 (75) 354.440 (76) 354.426 (77) 353.479 (77) group 346.015 (76) 345.750 (77) 345.100 (78) 344.793 (78) weeks 354.438 (76) 354.119(77) 354.098 (78) 353.140 (78) 353.094 (79) The model that includes the single covariate representing the number of weeks for p and no covariates for the Poisson mean provides as adequate a fit as any models containing additional terms. The effect of weeks on NSB combined is significant as the log-likelihood drops dramatically when this covariate is removed from the model. The model states that, log(^) = 1.89 and log Pi U - J V = 0.98 - 0.25(x,.), where xt represents the number of weeks that patient i remained in the study. This implies again that the longer a patient remains in the study the more likely they are to suffer from at least one infection with a resistant or an emergent resistant bacteria. Among those patients suffering from such an infection the expected number of different strains is e 1 8 9 = 6.62 . We can compare the observed frequencies of NSB combined with those expected under the above model in order to evaluate the model's performance. Table 24 provides a comparison between the observed frequencies and those that are expected under the above model for those patients who remained in the study the entire sixteen weeks. 67 Table 24 Observed and expected frequencies for NSB combined, ZIP regression model. (Expected frequencies are given in parentheses) NSB resistant Frequency 0 1(2) 1 1(1) 2 1(2) 3 0(4) 4 2(5) 5 7(5) 6 1(4) 7 5(3) 8 6(2) 9 1(1) 10 1(1) 11 2(0) 12 1(0) It was not so apparent that at least for those patients remaining in the study the full sixteen weeks there was an excessive number of zero counts for NSB combined. The ZIP model does not however overestimate the number of zero counts and provides a somewhat improved fit over the Poisson regression model. 4.3 Application 2: Results and Discussion Both the exploratory data analysis in section 2.3.2 and the investigation into the fits of the preliminary models in section 2.3.3 suggest that these data may also be well modeled by a ZIP regression model. In this section we fit a ZIP regression model to the daily units sold. The model is specified as follows. Y) i. = 0, with probability /?, . Y,. ~ Poisson( A. y) with probability 1 - pt . 68 P(Y =k) = Further Xy and P j satisfy p,y{\-Pi.)e- \ for/< = 0 (1 - P l j )e~h>X , ,*/*! , f o r £ > 0 l o g ( \ ) = a + p. + y , and log = a + P , + y j, where P, , P , represent th the effect of store i and y ., y . represent the effect of the j day of the week. The parameters, P,, P 1 and y , , y, are all taken to be zero, so P, and P , represent the effect of store i relative to store 1 and y . and y . represent the effect of the j day of the week relative to Monday. We consider three models for X {j and three models for ptj for a total of nine models. The three models for X / ; and P j j consider either both factors, store and day of the week, store only, or day of the week only. In order to evaluate the performance of the ZIP regression model we need to select one of the nine models. The selection will be again based on comparing the values of the log-likelihood. Table 25 provides the values of the log-likelihood for the nine models. Table 25 Log-likelihood for ZIP regression models for the daily units sold (Degrees of freedom are given in parentheses) Factors for the Poisson mean Factors for the extra prob. of zeroes store + day of the week store day of the week store + day of the week -5916.183 (6540) -5927.524 (6546) -5989.490 (6560) store -5919.010 (6546) -5942.470 (6552) -5991.091 (6566) day of the week -5937.067 (6560) -5948.819 (6566) -6117.093 (6580) 69 The model that considers both factors, store and day of the week for the mean and the single factor representing the store for the extra probability of zeroes provides as adequate a fit as models containing the additional factor. Likelihood ratio tests indicate that both the store and the day of the week have a significant effect on the mean number of units sold. The effect of the store on the probability of extra zeroes is also significant however the effect of the day of the week is not significant after the store has been taken into account. Table 26 provides the estimated model parameters for the ZIP regression model which considers both factors for the mean and the single factor, store, for the extra probability of zeroes. 70 Table 26 Estimated model parameters, ZIP regression model for units sold a , Pz , 1j For X y For pu a , intercept (Store 1, Mon.) -0.059 -0.484 store 2 -0.217 0.025 store 3 0.410 -0.966 store 4 0.470 -1.106 store 5 -0.106 -0.234 store 6 0.162 0.137 store 7 0.556 -0.385 store 8 -0.012 0.628 store 9 0.149 -0.833 store 10 0.065 -0.879 store 11 -0.280 0.019 store 12 0.390 -0.370 store 13 -0.074 -0.164 store 14 0.276 -0.700 store 15 0.182 -0.494 store 16 -0.160 -0.155 store 17 0.222 -0.057 store 18 0.381 -0.531 store 19 -0.141 -0.215 store 20 -0.012 0.008 store 21 0.112 -0.528 Tuesday -0.109 Wednesday -0.053 Thursday -0.058 Friday -0.030 Saturday 0.221 Sunday 0.095 Table 27 provides the observed number of days when zero units were sold and the expected number under the above ZIP regression model. Again the fit appears to have improved over the Poisson regression. Table 27 Observed and expected number of days with zero units sold, ZIP regression model. (Expected frequencies are given in parentheses) Store Monday Tuesday Wednesday Thursday Friday Saturday Sunday 1 26 (28) 28 (29) 38 (28) 25 (28) 27 (28) 22 (26) 27 (27) 2 33 (30) 31 (31) 29 (31) 32 (30) 32(31) 23 (28) 28 (29) 3 20(17) 20(19) 18(18) 16(18) 18(18) 17(15) 13 (16) 4 14(16) 17(17) 20(17) 19(16) 23 (16) 7(13) 13(15) 5 24 (28) 31 (29) 29 (28) 26 (28) 29 (28) 25 (25) 26 (27) 6 27 (27) 30 (28) 26 (28) 24 (27) 30 (28) 28 (25) 26 (26) 7 23 (19) 20 (20) 24 (20) 20 (20) 18 (20) 17(17) 15 (18) 8 35 (32) 28 (33) 30 (33) 39 (32) 28 (33) 29 (31) 36 (32) 9 18(21) 20 (23) 28 (22) 19(22) 26 (22) 18 (19) 17 (20) 10 20 (22) 23(24) 19(23) 23 (23) 26 (23) 18(19) 21 (21) 11 35 (31) 31 (32) 20 (31) 29 (31) 31 (31) 23 (29) 32 (30) 12 22 (21) 24 (23) 16(22) 20 (21) 24 (22) 24(19) 19(20) 13 21 (28) 28 (29) 26 (28) 32 (28) 28 (28) 26 (25) 30 (27) 14 20 (21) 17(22) 24(21) 19(21) 25 (21) 20(18) 17(19) 15 28 (23) 26 (24) 19 (23) 22 (23) 25 (23) 22 (20) 17(22) 16 29 (29) 33 (30) 31 (29) 28 (29) 27 (29) 22 (26) 27 (28) 17 23 (25) 28 (26) 20 (26) 27 (25) 24 (26) 21 (23) 25 (24) 18 24 (20) 21 (22) 27 (21) 23 (20) 20 (21) 17(18) 11(19) 19 25 (28) 33 (29) 29 (29) 26 (28) 27 (28) 23 (26) 30 (27) 20 29 (28) 28 (29) 27 (29) 25 (28) 28 (28) 29 (26) 29 (27) 21 21 (23) 23 (25) 22 (24) 24 (24) 26(24) 25 (21) 21(22) 72 5 Discussion and Suggestions for Further Study The ZIP regression model clearly provided a superior fit to that of Poisson regression. It also performed at least as well as the NB regression, and in the case of modeling the number of resistant strains in the first application, was clearly superior to both alternative models. In the first application, both the Poisson and the N B regression models underestimated the number of patients having zero counts for NSB resistant. Under the ZIP regression model, the expected number of patients with zero counts agreed perfectly with the actual data. The fit was also improved for NSB emergent resistant but the improvement was not as dramatic. In addition the inferences with respect to NSB resistant made using the three analyses differed dramatically. With Poisson regression, the effect of the treatment group on NSB resistant was found to be insignificant after we adjusted the standard errors using the quasi-likelihood method. Both the N B regression and ZIP regression analyses found that the treatment group had a significant effect upon NSB resistant. However, the ZIP regression analysis found this effect only to be significant in determining the probability that a patient have at least one resistant strain and not in the rate. In this way, the ZIP regression model allowed us to gain insight as to the nature of the cause of the overdispersion itself rather than treating the overdispersion as a nuisance parameter. A summary of the comparison of the three models used in the first application is provided in Table 28. 73 Table 28: Model performance comparison NSB resistant NSB emergent resistant NSB combined Model Fit Inference Fit Inference Fit Inference Poisson Poor. Effect of Poor. Effect of Difficult to Effect of regression Seriously treatment Under- treatment evaluate. treatment under- group not estimates the group not Does not group not estimates the significant. number of significant. appear to significant. number of zeroes. Effect of seriously Effect of zeroes in the weeks found under-estimate weeks data. to be significant. the number of zeroes. marginally significant. NB Poor. Effect of Improved Effect of Does not Effect of regression Under- treatment but still treatment appear to treatment estimates the group under- group not seriously group not number of found estimates the significant. under-estimate significant. zeroes and significant. number of Effect of the number of Effect of over- zeroes and weeks found zeroes but weeks estimates the over- to be does over- significant. number of estimates the significant. estimate the low counts. number of low counts. number of low counts ZIP Vastly Effect of Vastly Effect of Appears to be Effect of regression improved. treatment improved. treatment marginally treatment group Still over- group not improved. group not significant estimates the significant. Still over- significant. only for number of Effect of estimates the Effect of the prob. low counts. weeks found number of low weeks of at least to be counts. significant one strain. significant but only for the prob. of at least one strain. only for the prob. of at least one strain. In the second application, both the NB and the ZIP regression models provided an improved fit over Poisson regression. The Poisson regression seriously underestimated the number of days when zero units of the product were sold. With both the N B and ZIP regression models this problem was alleviated. The inferences drawn from the N B regression model did not differ from those drawn from the Poisson regression model. On the other hand, the inferences drawn from the ZIP regression model differed from those of both the Poisson and the N B 74 regression models. The ZIP regression model suggested that although both factors, store and day of the week, had a significant effect on the number of units sold, only the store had a significant effect on the probability that at least one units was sold. This again allows us some insight as to the nature of the overdispersion present in this data. Both tests for overdispersion described in Sections 2.1.1 and 2.1.2 clearly indicated its presence. However, these tests could not be used as the basis for selecting one of the two alternative models to provide and improved fit. A thorough exploratory analysis, instead, provided a relatively clear indication that a ZIP model would provide an improved fit. This stresses the importance of such an exploratory analysis as an aid to the statistician attempting to identify appropriate models. In addition, whenever possible, physical models should also be considered. To my knowledge, there is no standard software available currently that can be used to obtain the M L E ' s in the ZIP regression model. Since working with zero inflated data, a number of colleagues have inquired as to how such a model may be implemented. The procedure is not straightforward and must be adapted to the specific form of model used. The development of software that is flexible enough to accommodate a number of variations on the ZIP regression model used here would likely be valuable for a large number of applied statisticians and subject area researchers. Although the Splus function, nlmin, was used here to fit the ZIP regression models, other methods exist which use different algorithms to obtain the M L E ' s . A serious investigation that compares the performance of the E M algorithm suggested by Lambert, the Newton-Raphson algorithm and the quasi-Newton algorithm used to obtain M L E ' s here may provide some 75 assistance to those wishing to fit ZIP regression models in future applications and in the development of software. In summary, as overdispersion is a relatively common feature in Poisson regression models, models which can accommodate the overdispersion are greatly needed. Identification of which alternative model may be best suited to a particular application may be aided by a number of tests developed to detect specific forms of overdispersion together with a thorough examination of the data using exploratory tools. The ZIP regression model provides yet another alternative model that is appropriate in a number of situations and may provide both an improved fit for the data and allow a greater depth of inference. Bibliography Aitchison, J., and Ho, C.H., (1989), "The Multivariate Poisson log-normal Distribution", Biometrika, 76, 643-653. Anderson, R.U., (1985), Editorial, Journal of the American Paraplegia Association, 8, 18. Breslow, N.E. , (1990), "Tests of Hypotheses in Overdispersed Poisson Regression and Other Quasi-likelihood Models", Journal of the American Statistical Association, 85, 565-571. Breslow, N.E. , and Clayton, D.G., (1993), "Approximate Inference in Generalized Linear Mixed Models", Journal of the American Statistical Association, 88, 9-25. Cameron, A.C. , and Trivedi, P.K., (1990), "Regression-Based Tests for Overdispersion in the Poisson Model", Journal of Econometrics, 46, 347-364. Chambers, J .M., and Hastie, T.J., (1992) Statistical Models in Splus, California: Wadsworth and Brooks Cole. Dean, C.B., (1992), "Testing for Overdispersion in Poisson and Binomial Regression Models", Journal of the American Statistical Association, 87, 451-457. Dean, C.B., Lawless, J.F., and Willmot, G.E., (1989), " A Mixed Poisson-Inverse Gaussian Regression Model", The Canadian Journal of Statistics, 17, 171-181. Dean, C , and Lawless, J.F., (1989), "Test for Detecting Overdispersion in Poisson Regression Models", Journal of the American Statistical Association, 84, 467-472. Ganio, L . M . , and Schafer, P.W., (1992), "Diagnostics for Overdispersion", Journal of the American Statistical Association, 87, 795-804. Gribble, M.J. , McCallum, N . M . , and Schechter, M.T., (1988), "Evaluation of Diagnostic Criteria for Bacteriuria in Acutely Spinal Cord Injured Patients Undergoing Intermittent Catheterization", Diagnostic Microbiology and Infectious Diseases, 9, 197-206. Jansen, J., (1993), "Analysis of Counts Involving Random Effects with Applications in Experimental Biology", Biometrical Journal, 6, 745-757. Johnson, N .L . , Kotz, S., and Kemp, A.W., (1992), Univariate Discrete Distributions, second edition, New York: John Wiley and Sons, Inc. 77 Kapalka, B.A. , Katircioglu, K. , Puterman, M.L . , (1995), "Inventory Control in a Retail Environment with Lost Sales and Service Constraints", Bramss working paper, 3, Faculty of Commerce and Business Administration, Vancouver, B.C.: The University of British Columbia. Lambert, D., (1992), "Zero Inflated Poisson Regression, With an Application to Defects in Manufacturing", Technometrics, 34, 1-13. Lawless, J.F., (1987), "Negative Binomial and Mixed Poisson Regression", The Canadian Journal of Statistics, 15, 209-225. McCullagh, P., and Nelder, J.A., (1983), Generalized Linear Models, London:Chapman and Hall. Merrit, J.L., (1976), "Urinary Tract Infections, Causes and Management, with Particular Reference to the Patient with Spinal Cord Injury: A Review", Archives of Physical Medicine and Rehabilitation, 57, 365-378. Pearman, J.W., (1984), "Infection Hazards in Patients with Neuropathic Bladder Dysfunction", Journal of Hospital Infection, 5, 355-372. Ripley, B.D., and Venables, W.N. (1994), Modern Applied Statistics with Splus, New York: Springer-Verlag. Stein, G.Z., (1988), "Modelling Counts in Biological Populations", Mathematical Scientist, 13, 56-65. Van den Broek, J., (1995), " A Score Test for Zero Inflation in a Poisson Distribution", Biometrics, 51, 738-743. Wang, P .M. , Puterman, M.L . , Cockburn, I., Le, N . , (1996) "Mixed Poisson Regression Models with Covariate Dependent Rates", Biometrics, in Press. Wedderburn, R.W.M. , (1974), "Quasi-likelihood Functions, Generalized Linear Models and the Gauss-Newton Method", Biometrika, 61, 439-477. 78 Appendix Procedure for Fitting a ZIP Regression Model Using the software package, Splus, a ZIP regression model can be used to model a count response variable with the following procedure. y<-response variable pos<-y>0 yO<-ifelse(y<-0,l,0) posy<-y[pos] m 1 <-glm(y[pos]~covariates[pos], family=poisson) m2<-glm(y0~covariates, family^binomial) th<-c(coefficients(m 1), coefficients(m2) p<-no. of parameters in m l zipmodel<-nlmin(zip, th, ...) zip<-function(theta) { k<-length(theta) beta<-theta[l:p] gam<-theta[p+l:k] S<-Xdesl %*% beta G<-Xdes2 %*% gam eS<-exp(S) zero<-y<0.5; pos<-!zero lkh<-logexp(G) lkh[pos]<-lkh[pos] - y[pos] * S[pos] + eS[pos] lkh[zero]<-lkh[zero] + eS[zero] - logexp(eS[zero] + G[zero]) sum(lkh) } where, Xdes 1 <-model .matrix(m 1) Xdes2<-model. matrix(m2) 79 logexp<-function(x) { y<-x pos<-x>0 y[pos]<-x[pos] + logexp.neg(- x[pos]) y[!pos]<- logexp.neg(x[!pos]) y } logexp.neg<-function(x) { y<-exp(x) smalK- y<4*.Machine$double.eps y[small]<-y[small] * (l-0.5*y[small]) y[!small]<-log(l +y [Ismail]) } Note: The design matrix can be made to give the constraints used in sections 2.3.3, 3.3 and 4.3 by issuing the command, options(constraints= "contr.treatment") prior to fitting the models in m l and m2. This procedure was adapted from the procedure provided by Chambers and Hastie (Chambers, and Hastie, 1992) who use instead the function, ms, to perform the minimization of the negative log-likelihood. Using this procedure requires that at least the first partial derviatives of the negative log-likelihood be supplied. The first partial derivatives of /,.(P,y ) are as follows. d(?,(P,Y) 9(P,) * f fCv,-e*p), for v, > 0 exp(e*'p +zjy)xjjeXil for v, = 0 — x e u + (l + exp(e*''p +z,y) d(/,(P,Y) 8